Wuensch Stats

ANOVA1.
docx
One-Way Independent Samples Analysis of Variance

If we are interested in the relationship between a categorical IV and a continuous DV, the
two categories analysis of variance (ANOVA) may be a suitable inferential technique. If the IV had
only two levels (groups), we could just as well do a t-test, but the ANOVA allows us to have 2 or
more categories. The null hypothesis tested is that
1
=
2
= ... =
k
, that is, all k treatment groups
have identical population means on the DV. The alternative hypothesis is that at least two of the
population means differ from one another.
We start out by making two assumptions:
- Each of the k populations is normally distributed and
- Homogeneity of variance - each of the populations has the same variance, the IV does not
affect the variance in the DV. Thus, if the populations differ from one another they differ in
location (central tendency, mean).
The model we employ here states that each score on the DV has two components:
- the effect of the treatment (the IV, Groups) and
- error, which is anything else that affects the DV scores, such as individual differences among
subjects, errors in measurement, and other extraneous variables. That is, Y
ij
= + t
j
+ e
ij
, or,
Y
ij
- = t
j
+ e
ij

The difference between the grand mean ( ) and the DV score of subject number i in group number j,
Y
ij
, is equal to the effect of being in treatment group number j, t
j
, plus error, e
ij

[Note that I am using i as the subscript for subject # and j for group #]

Computing ANOVA Statistics From Group Means and Variances, Equal n.
Let us work with the following contrived data set. We have randomly assigned five students to
each of four treatment groups, A, B, C, and D. Each group receives a different type of instruction in
the logic of ANOVA. After instruction, each student is given a 10 item multiple-choice test. Test
scores (# items correct) follow:
Group Scores Mean
A 1 2 2 2 3 2
B 2 3 3 3 4 3
C 6 7 7 7 8 7
D 7 8 8 8 9 8
Now, do these four samples differ enough from each other to reject the null hypothesis that
type of instruction has no effect on mean test performance? First, we use the sample data to
estimate the amount of error variance in the scores in the population from which the samples were
randomly drawn. That is variance (differences among scores) that is due to anything other than the
IV. One simple way to do this, assuming that you have an equal number of scores in each sample, is
to compute the average within group variance,
k
s s s
MSE
k
2 2
2
2
1
.... + + +
=

Copyright 2012, Karl L. Wuensch - All rights reserved.

A-PDF Merger DEMO : Purchase from www.A-PDF.com to remove the watermark A-PDF Merger DEMO : Purchase from www.A-PDF.com to remove the watermark
2
s
j
2
is the sample variance in Group number j.
Thought exercise: Randomly chose any two scores that are in the same group. If they differ
from each other, why? Is the difference because they got different treatments? Of course not, all
subjects in the same group got the same treatment. It must be other things that caused them to have
different scores. Those other things, collectively, are referred to as error.
MSE is the mean square error (aka mean square within groups): Mean because we divided by k,
the number of groups, square because we are working with variances, and error because we are
estimating variance due to things other than the IV. For our sample variances the MSE = (.5 + .5 + .5
+ .5) / 4 = 0.5
MSE is not the only way to estimate the population error variance. If we assume that the null
hypothesis is true, we can get a second estimate of population error variance that is independent of
the first estimate. We do this by finding the sample variance of the k sample means and multiplying
by n, where n = number of scores in each group (assuming equal sample sizes). That is,
2
means A
s n MS - =
I am using MS
A
to stand for the estimated among groups or treatment variance for
Independent Variable A. Although you only have one IV now, you should later learn how to do
ANOVA with more than one IV. For our sample data we compute the variance of the four sample
means, VAR(2,3,7,8) = 26 / 3 and multiply by n, so MS
A
= 5 - 26 / 3 = 43.33.
Now, our second estimate of error variance, the variance of the means, MS
A
, assumes that
the null hypothesis is true. Our first estimate, MSE, the mean of the variances, made no such
assumption. If the null hypothesis is true, these two estimates should be approximately equal to one
another. If not, then the MS
A
will estimate not only error variance but also variance due to the IV,
and MS
A
> MSE. We shall determine whether the difference between MS
A
and MSE is large
enough to reject the null hypothesis by using the F-statistic. F is the ratio of two independent
variance estimates. We shall compute F = MS
A
/ MSE which, in terms of estimated variances, is
the effect of error and treatment divided by the effect of error alone. If the null hypothesis is
true, the treatment has no effect, and F = [error / error] = approximately one. If the null
hypothesis is false, then F = [(error + treatment) / error] > 1. Large values of F cast doubt on
the null hypothesis, small values of F do not.
For our data, F = 43.33 / .5 = 86.66. Is this F large enough to reject the null hypothesis or
might it have happened to be this large due to chance? To find the probability of getting an F this
large or larger, our exact significance level, p, we must work with the sampling distribution of F.
This is the distribution that would be obtained if you repeatedly drew sets of k samples of n scores
each all from identical populations and computed MS
A
/ MSE for each set. It is a positively skewed
sampling distribution with a mean of about one. Using the F-table, we can approximate p. Like t-
distributions, F-distributions have degrees of freedom, but unlike t, F has df for numerator (MS
A
) and
df for denominator (MSE). The total df in the k samples is N - 1 (where N = total # scores) because
the total variance is computed using sums of squares for N scores about one point, the grand mean.
The treatment A df is k - 1 because it is computed using sums of squares for k scores (group
means) about one point, the grand mean. The error df is k(n - 1) because MSE is computed using k
within groups sums of squares each computed on n scores about one point, the group mean.
For our data, total df = N - 1 = 20 - 1 = 19. Treatment A df = k - 1 = 4 - 1 = 3. Error df = k(n -
1) = N - k = 20 - 4 = 16. Note that total df = treatment df + error df.
So, what is p? Using the F-table in our text book, we see that there is a 5% probability of
getting an F(3, 16) > = 3.24. Our F > 3.24, so our p < .05. The table also shows us that the upper
1% of an F-distribution on 3, 16 df is at and beyond F = 5.29, so our p < .01. We can reject the null
hypothesis even with an a priori alpha criterion of .01. Note that we are using a one-tailed test with
nondirectional hypotheses, because regardless of the actual ordering of the population means, for
3
example,
1
>
2
>
3
>
4
or
1
>
4
>
3
>
2
, etc., etc., any deviation in any direction from the null
hypothesis that
1
=
2
=
3
=
4
will cause the value of F to increase. Thus we are only interested in
the upper tail of the F-distribution.

Derivation of Deviation Formulae for Computing ANOVA Statistics
Lets do the ANOVA again using different formulae. Lets start by computing the total sum-of-
squares (SS
TOT
) and then partition it into treatment (SS
A
) and error (SSE) components.
We shall derive formulae for the ANOVA from its model. If we assume that the error
component is normally distributed and independent of the IV, we can derive formulas for the ANOVA
from this model. First we substitute sample statistics for the parameters in the model:
Y
ij
= GM + (M
j
- GM) + (Y
ij
- M
j
)
Y
ij
is the score of subject number i in group number j, GM is the grand mean, the mean of all
scores in all groups, M
j
is the mean of the scores in the group ( j ) in which Y
ij
is.
Now we subtract GM from each side, obtaining:
(Y
ij
- GM) = (M
j
- GM) + (Y
ij
- M
j
)
Next, we square both sides of the expression, obtaining:
(Y
ij
- GM)
2
= (M
j
- GM)
2
+ (Y
ij
- M
j
)
2
+ 2(M
j
- GM)(Y
ij
- M
j
)
Now, summing across subjects ( i ) and groups ( j ),
E
ij
(Y
ij
- GM)
2
= E
ij
(M
j
- GM)
2
+ E
ij
(Y
ij
- M
j
)
2
+ 2 - E
ij
(M
j
- GM)(Y
ij
- M
j
)
Now, since the sum of the deviations of scores about their mean is always zero,
2 - E
ij
(M
j
- GM)(Y
ij
- M
j
) equals zero, and thus drops out, leaving us with:
E
ij
(Y
ij
- GM)
2
= E
ij
(M
j
- GM)
2
+ E
ij
(Y
ij
- M
j
)
2

Within each group (M
j
- GM) is the same for every Y
ij
, so
E
ij
(M
j
- GM)
2
equals E
j
[n
j
- (M
j
- GM)
2
], leaving us with
E
ij
(Y
ij
- GM)
2
= E
j
[n
j
- (M
j
- GM)
2
] + E
ij
(Y
ij
- M
j
)
2

Thus, we have partitioned the leftmost term (SS
TOT
) into SS
A
(the middle term) and SSE (the
rightmost term).
SS
TOT
= E (Y
ij
- GM)
2
.
For our data, SS
TOT
= (1 - 5)
2
+ (2 - 5)
2
+...+ (9 - 5)
2
= 138.
To get the SS
A
, the among groups or treatment sum of squares, for each score subtract the
grand mean from the mean of the group in which the score is. Then square each of these deviations
and sum them. Since the squared deviation for group mean minus grand mean is the same for every
score within any one group, we can save time by computing SS
A
as:
SS
A
= E [n
j
- (M
j
- GM)
2
]
[Note that each groups contribution to SSA is weighted by its n, so groups with larger ns have more
influence. This is a weighted means ANOVA. If we wanted an unweighted means (equally
weighted means) ANOVA we could use a harmonic mean n
h
= k E
j
(1/n
j
) in place of n (or just be
sure we have equal ns, in which case the weighted means analysis is an equally weighted means
analysis). With unequal ns and an equally weighted analysis,
SS
TOT
= SS
A
+ SSE.
Given equal sample sizes (or use of harmonic mean n
h
), the formula for SS
A
simplifies to:
SS
A
= n - E (M
j
- GM)
2
.
For our data, SS
A
= 5[(2 - 5)
2
+ (3 - 5)
2
+ (7 - 5)
2
+ (8 - 5)
2
] = 130.
The error sum of squares, SSE = E (Y
ij
- M
j
)
2
.
4
These error deviations are all computed within treatment groups, so they reflect variance not
due to the IV, that is, error. Since every subject within any one treatment group received the same
treatment, variance within groups must be due to things other than the IV. For our data,
SSE = (1 - 2)
2
+ (2 - 2)
2
+ .... + (9 - 8)
2
= 8. Note that SS
A
+ SSE = SS
TOT
. Also note that for each SS
we summed across all N scores the squared deviations between either Y
ij
or M
j
and either M
j
or GM.
If we now divide SS
A
by its df and SSE by its df we get the same mean squares we earlier obtained.

Computational Formulae for ANOVA
Unless group and grand means are nice small integers, as was the case with our contrived
data, the above method (deviation formulae) is unwieldy. It is, however, easier to see what is going
on in ANOVA with that method than with the computational method I am about to show you. Use the
following computational formulae to do ANOVA on a more typical data set. In these formulae G
stands for the total sum of scores for all N subjects and T
j
stands for the sum of scores for treatment
group number j.
N
G
Y SS
TOT
2
2
=
=
N
G
n
T
SS
j
j
A
2
2
, which simplifies to:
N
G
n
T
SS
j
A
2
2
= when sample size is constant across

groups.
SSE = SS
TOT
- SS
A
.
For our sample data,
SS
TOT
= (1 + 4 + 4 +.....+ 81) - [(1 + 2 + 2 +.....+ 9)
2
] N = 638 - (100)
2
20 = 138
SS
A
= [(1+2+2+2+3)
2
+ (2+3+3+3+4)
2
+ (6+7+7+7+8)
2
+ (7+8+8+8+9)
2
] 5 - (100)
2
20 = 130
SSE = 138 - 130 = 8.

ANOVA Source Table and APA-Style Summary Statement
Summarizing the ANOVA in a source table:

Source SS df MS F
Teaching Method 130 3 43.33 86.66
Error 8 16 0.50
Total 138 19

In an APA journal the results of this analysis would be summarized this way: Teaching
method significantly affected test scores, F(3, 16) = 86.66, MSE = 0.50, p < .001, e
2
= .93. If the
researcher had a means of computing the exact significance level, that would be reported. For
example, one might report p = .036 rather than p < .05 or .01 < p < .05. One would also typically
refer to a table or figure with basic descriptive statistics (group means, sample sizes, and standard
deviations) and would conduct some additional analyses (like the pairwise comparisons we stall study
in our next lesson). If you are confident that the population variances are homogeneous, and have
reported the MSE (which is an estimate of the population variances), then reporting group standard
deviations is optional.

5
Violations of Assumptions
You should use boxplots, histograms, comparisons of mean to median, and/or measures of
skewness and kurtosis (available in SAS, the Statistical Analysis System, a delightful computer
package) on the scores within each group to evaluate the normality assumption and to identify
outliers that should be investigated (and maybe deleted, if you are willing to revise the population to
which you will generalize your results, or if they represent errors in data entry, measurement, etc.). If
the normality assumption is not tenable, you may want to transform scores or use a nonparametric
analysis. If the sample data indicate that the populations are symmetric, or, slightly skewed but all in
the same direction, the ANOVA should be sufficiently robust to handle the departure from normality.
You should also compute F
max
, the ratio of the largest within-group variance to the smallest
within-group variance. If F
max
is less than 4 or 5 (especially with equal or nearly equal sample sizes
and normal or nearly normal within-group distributions), then the ANOVA should be sufficiently robust
to handle the departure from homogeneity of variance. If not, you may wish to try data
transformations or a nonparametric test, keeping in mind that if the populations cannot be assumed to
have identical shapes and dispersions, rejection of the nonparametric null hypothesis cannot be
interpreted as meaning the populations differ in location.
It is possible to adjust the df to correct for heterogeneity of variance, as we did with the
separate variances t-test. Box has shown that the true critical F under heterogeneity of variance is
somewhere between the critical F on 1,(n - 1) df and the unadjusted critical F on (k - 1), k(n - 1) df,
where n = the number of scores in each group (equal sample sizes). It might be appropriate to use a
harmonic mean n
h
= k E
j
(1/n
j
), with unequal sample sizes (consult Box - the reference is in Howell).
If your F is significant on 1, (n - 1) df, it is significant at whatever the actual adjusted df are. If it is not
significant on (k - 1), k(n - 1) df, it is not significant at the actual adjusted df. If it is significant on (k -
1), k(n - 1) df but not on 1, (k - 1) df, you dont know whether or not it is significant with the true
adjusted critical F.
If you cannot reach an unambiguous conclusion using Boxs range for adjusted critical F, you
may need to resort to Welchs test, explained in our textbook (and I have an example below). You
must compute for each group W, the ratio of sample size to sample variance. Then you compute an
adjusted grand mean, an adjusted F, and adjusted denominator df.
You may prefer to try to meet the assumptions by employing nonlinear transformations of the
data prior to analysis. Here are some suggestions:
When the group standard deviations appear to be a linear function of the group means (try
correlating the means with the standard deviations or plotting one against the other), a logarithmic
transformation should reduce the resulting heterogeneity of variance. Such a transformation will
also reduce positive skewness, since the log transformation reduces large scores more than small
scores. If you have negative scores or scores near zero, you will need to add a constant (so that all
scores are 1 or more) before taking the log, since logs of numbers of zero or less are undefined.
If group means are a linear function of group variances (plot one against the other or correlate
them), a square root transformation might do the trick. This will also reduce positive skewness,
since large scores are reduced more than small scores. Again, you may need to first add a constant,
c, or use c X X + + to avoid imaginary numbers like 1.
A reciprocal transformation, T = 1/Y or T = -1/Y, will very greatly reduce large positive
outliers, a common problem with some types of data, such as running times in mazes or reaction
times.
If you have negative skewness in your data, you may first reflect the variable to convert
negative skewness to positive skewness and then apply one of the transformations that reduce
positive skewness. For example, suppose you have a variable on a scale of 1 to 9 which is
negatively skewed. Reflect the variable by subtracting each score from 10 (so that 9s become 1s,
8s become 2s, 7s become 3s, 6s become 4s, 4s become 6s, 3s become 7s, 2s become 8s,
6
and 1s become 9s. Then see which of the above transformations does the best job of normalizing
the data. Do be careful when it comes time to interpret your resultsif the original scale was 1 =
complete agreement with a statement and 9 = complete disagreement with the statement, after
reflection high scores indicate agreement and low scores indicate disagreement. This is also true of
the reciprocal transformation 1/Y (but not -1/Y). For more information on the use of data
transformation to reduce skewness, see my documents Using SAS to Screen Data and Using SPSS
to Screen Data.
Where Y is a proportion, p, for example, proportion of items correct on a test, variances (npq,
binomial) will be smaller in groups where mean p is low or high than in groups where mean p is close
to .5. An arcsine transformation, T = 2 - ARCSINE (Y), may help. It may also normalize by
stretching out both tails relative to the middle of the distribution.
Another option is to trim the samples. That is, throw out the extreme X% of the scores in
each tail of each group. This may stabilize variances and reduce kurtosis in heavy-tailed
distributions. A related approach is to use Winsorized samples, where all of the scores in the
extreme X% of each tail of each sample are replaced with the value of the most extreme score
remaining after trimming. The modified scores are used in computing means, variances, and test
statistics (such as F or t), but should not be counted in n when finding error df for F, t, s
2
, etc. Howell
suggests using the scores from trimmed samples for calculating sample means and MS
A
, but
Winsorized samples for calculating sample variances and MSE.
If you have used a nonlinear transformation such as log or square-root, it is usually best to
report sample means and standard deviations this way: find the sample means and standard
deviations on the transformed data and then reverse the transformation to obtain the statistics you
report. For example, if you used a log transformation, find the mean and sd of log-transformed data
and then the antilog (INV LOG on most calculators) of those statistics. For square-root-transformed
data, find the square of the mean and sd, etc. These will generally not be the same as the mean and
sd of the untransformed data.
How do you choose a transformation? I usually try several transformations and then evaluate
the resulting distributions to determine which best normalizes the data and stabilizes the variances. It
is not, however, proper to try many transformations and choose the one that gives you the lowest
significance level - to do so inflates alpha. Choose your transformation prior to computing F or t. Do
check for adverse effects of transformation. For example, a transformation that normalizes the data
may produce heterogeneity of variance, in which case you might need to conduct a Welch test on
transformed data. If the sample distributions have different shapes a transformation that normalizes
the data in one group may change those in another group from nearly normal to negatively skewed or
otherwise nonnormal.
Some people get very upset about using nonlinear transformations. If they think that their
untransformed measurements are interval scale data, linear transformations of the true scores, they
delight in knowing that their computed ts or Fs are exactly the same that would be obtained if they
had computed them on God-given true scores. But if a nonlinear transformation is applied, the
transformed data are only ordinal scale. Well, keep in mind that Fechner and Stevens
(psychophysical laws) have shown us that our senses also provide only ordinal data, positive
monotonic (but usually not linear) transformation of the physical magnitudes that constitute one
reality. Can we expect more of our statistics than of our senses? I prefer to simply generalize my
findings to that abstract reality which is a linear transformation of my (sensory or statistical) data, and
I shall continue to do so until I get a hot-line to God from whom I can obtain the truth with no
distortion.
Do keep in mind that one additional nonlinear transformation available is to rank the data and
then conduct the analysis on the ranks. This is what is done in most nonparametric procedures, and
7
they typically have simplified formulas (using the fact that the sum of the integers from 1 to n equals
n(n + 1) 2) with which one can calculate the test statistic.

Computing ANOVA Statistics From Group Means and Variances with Unequal Sample Sizes
and Heterogeneity of Variance
Wilbur Castellow (while he was chairman of our department) wanted to evaluate the effect of a
series of changes he made in his introductory psychology class upon student ratings of instructional
excellence. Institutional Research would not provide the raw data, so all we had were the following
statistics:

Semester Mean SD N p
j

Spring 89 4.85 .360 34 34/133 = .2556
Fall 88 4.61 .715 31 31/133 = .2331
Fall 87 4.61 .688 36 36/133 = .2707
Spring 87 4.38 .793 32 32/133 = .2406
1. Compute a weighted mean of the K sample variances. For each sample the weight is
N
n
p
j
j
= .
. 4317 . ) 793 (. 2406 . ) 688 (. 2707 . ) 715 (. 2331 . ) 360 (. 2556 .
2 2 2 2 2
= + + + = =
j j
s p MSE
2. Obtain the Among Groups SS, E n
j
(M
j
- GM)
2
.
The GM = E p
j
M
j
=.2556(4.85) + .2331(4.61) + .2707(4.61) + .2406(4.38) = 4.616.
Among Groups SS =
34(4.85 - 4.616)
2
+ 31(4.61 - 4.616)
2
+ 36(4.61 - 4.616)
2
+ 32(4.38 - 4.616)
2
= 3.646.
With 3 df, MSA = 1.215, and F(3, 129) = 2.814, p = .042.
3. Before you get excited about this significant result, notice that the sample variances are not
homogeneous. There is a negative correlation between sample mean and sample variance, due to a
ceiling effect as the mean approaches its upper limit, 5. The ratio of the largest to the smallest
variance is .793
2
/.360
2
= 4.852, which is significant beyond the .01 level with Hartleys maximum
F-ratio statistic (a method for testing the null hypothesis that the variances are homogeneous).
Although the sample sizes are close enough to equal that we might not worry about violating the
homogeneity of variance assumption, for instructional purposes let us make some corrections for the
heterogeneity of variance.
4. Box (1954, see our textbook) tells us the critical (.05) value for our F on this problem is
somewhere between F(1, 30) = 4.17 and F(3, 129) = 2.675. Unfortunately our F falls in that range, so
we dont know whether or not it is significant.
5. The Welch procedure (see the formulae in our textbook) is now our last resort, since we
cannot transform the raw data (which we do not have).
W
1
= 34 / .360
2
= 262.35,
W
2
= 31 / .715
2
= 60.638, W
3
= 36 / .688
2
= 76.055, and W
4
= 32 / .793
2
= 50.887.
8
. 724 . 4
93 . 449
44 . 2125
887 . 50 055 . 76 638 . 60 35 . 262
) 38 . 4 ( 887 . 50 ) 61 . 4 ( 055 . 76 ) 61 . 4 ( 638 . 60 ) 85 . 4 ( 35 . 262
! = =
+ + +
+ + +
= X
The numerator of F'' =
3
) 724 . 4 - 38 . 4 ( 887 . 50 + ) 724 . 4 - 61 . 4 ( 055 . 76 + ) 724 . 4 - 61 . 4 ( 638 . 60 + ) 724 . 4 - 85 . 4 ( 35 . 262

2 2 2 2
=
3.988. The denominator of F'' equals

93 . 449
887 . 50 1
31
1
93 . 449
055 . 76 1
35
1
93 . 449
638 . 60 1
30
1
93 . 449
35 . 262 1
33
1
15
4
1
2 2 2 2
(
(
|
.
|
\
|
|
.
|
\
|
+
|
.
|
\
|
|
.
|
\
|
+
|
.
|
\
|
|
.
|
\
|
+
|
.
|
\
|
|
.
|
\
|
+ =
1 + 4 / 15(.07532) = 1.020. Thus, F'' = 3.988 / 1.020 = 3.910. Note that this F'' is greater than our
standard F. Why? Well, notice that each groups contribution to the numerator is inversely related to
its variance, thus increasing the contribution of Group 1, which had a mean far from the Grand Mean
and a small variance.
We are not done yet, we still need to compute adjusted error degrees of freedom; df' = (15) /
[3(.07532)] = 66.38. Thus, F(3, 66) = 3.910, p = .012.

Directional Hypotheses
I have never seen published research where the authors used ANOVA and employed a
directional test, but it is possible. Suppose you were testing the following directional hypotheses:
H
0
: The classification variable is not related to the outcome variable in the way specified in the
alternative hypothesis
H
1
:
1
>
2
>
3

The one-tailed p value that you obtain with the traditional F test tells you the probability of
getting sample means as (or more) different from one another, in any order, as were those you
obtained, were the truth that the population means are identical. Were the null true, the probability of
your correctly predicting the order of the differences in the sample means is k!, where k is the number
of groups. By application of the multiplication rule of probability, the probability of your getting sample
means as different from one another as they were, and in the order you predicted, is the one-tailed p
times k!. If k is three, you take the one-tailed p and divide by 3 x 2 = 6 a one-sixth tailed test. I
know, that sounds strange. Lots of luck convincing the reviewers of your manuscript that you actually
PREdicted the order of the means. They will think that you POSTdicted them.

Fixed vs. Random vs. Mixed Effects ANOVA
As in correlation/regression analysis, the IV in ANOVA may be fixed or random. If it is fixed,
the researcher has arbitrarily (based on es opinon, judgement, or prejudice) chosen k values of the
IV. E will restrict es generalization of the results to those k values of the IV. E has defined the
population of IV values in which e is interested as consisting of only those values e actually used,
thus, e has used the entire population of IV values. For example, I give subjects 0, 1, or 3 beers and
measure reaction time. I can draw conclusions about the effects of 0, 1, or 3 beers, but not about 2
beers, 4 beers, 10 beers, etc.
With a random effects IV, one randomly obtains levels of the IV, so the actual levels used
would not be the same if you repeated the experiment. For example, I decide to study the effect of
dose of phenylpropanolamine upon reaction time. I have my computer randomly select ten dosages
from a uniform distribution of dosages from zero to 100 units of the drug. I then administer those 10
dosages to my subjects, collect the data, and do the analyses. I may generalize across the entire
range of values (doses) from which I randomly selected my 10 values, even (by interpolation or
extrapolation) to values other than the 10 I actually employed.
9
Group
5 4 3 2 1 0
S
c
o
r
e
10
8
6
4
2
0
Group
5 4 3 2 1 0
S
c
o
r
e
10
8
6
4
2
0
In a factorial ANOVA, one with more than one IV, you may have a mixed effects ANOVA -
one where one or more IVs is fixed and one or more is random.
Statistically, our one-way ANOVA does actually have two IVs, but one is sort of hidden. The
hidden IV is SUBJECTS. Does who the subject is affect the score on the DV? Of course it does, but
we count such effects as error variance in the one-way independent samples ANOVA. Subjects is a
random effects variable, or at least we pretend it is, since we randomly selected subjects from the
population of persons (or other things) to which we wish to generalize our results. In fact, if there is
not at least one random effects IV in your research, you dont need ANOVA or any other inferential
statistic. If all of your IVs are fixed, your data represent the entire population, not a random sample
therefrom, so your descriptive statistics are parameters and you need not infer what you already
know for sure.

ANOVA as a Regression Analysis: Eta-Squared and Omega-Squared
The ANOVA is really just a special case of a regression analysis. It can be represented as a
multiple regression analysis, with one dichotomous "dummy variable" for each treatment degree of
freedom (more on this in another lesson). It can also be represented as a bivariate, curvilinear
regression.
Here is a scatter plot for our ANOVA
data. Since the numbers used to code our
groups are arbitrary (the independent
variable being qualitative), I elected to use
the number 1 for Group A, 2 for Group D, 3
for Group C and 4 for Group B. Note that I
have used blue squares to plot the points
with a frequency of three and red triangles to
plot those with a frequency of one. The blue
squares are also the group means. I have
placed on the plot the linear regression line
predicting score from group. The regression
falls far short of significance, with the
SS
Regression
being only 1, for an r
2
of 1/138 =
.007.
We could improve the fit of our
regression line to the data by removing the
restriction that it be a straight line, that is, by doing
a curvilinear regression. A quadratic regression
line is based on a polynomial where the
independent variables are Group and Group-
squared that is,
2
2 1
X b X b a Y + + = more
on this when we cover trend analysis. A quadratic
function allows us one bend in the curve. Here is
a plot of our data with a quadratic regression line.
Eta-squared ( q
2
) is a curvilinear correlation
coefficient. To compute it, one first uses a
curvilinear equation to predict values of Y|X. You
then compute the SS
Error
as the sum of squared
residuals between actual Y and predicted Y, that
is, ( )
=
2
Y Y SSE . As usual,
10
Group
5 4 3 2 1 0
S
c
o
r
e
10
8
6
4
2
0
( )
=
2
GM Y SS
Total
, where GM is the grand mean, the mean of all scores in all groups. The
SS
Regression
is then SS
Total
- SS
Error
. Eta squared is then SS
Regression
/ SS
Total
, the proportion of the
SS
Total
that is due to the curvilinear association with X. For our quadratic regression (which is highly
significant), SS
Regression
= 126, q
2
= .913.
We could improve the fit a bit more by
going to a cubic polynomial model (which
adds Group-cubed to the quadratic model,
allowing a second bending of the curve).
Here is our scatter plot with the cubic
regression line. Note that the regression line
runs through all of the group means. This will
always be the case when we have used a
polynomial model of order = K 1, where K =
the number of levels of our independent
variable. A cubic model has order = 3, since it
includes three powers of the independent
variable (Group, Group-squared, and
Group-cubed). The SS
Regression
for the cubic
model is 130, q
2
= .942. Please note that this
SS
Regression
is exactly the same as that we
computed earlier as the ANOVA SS
Among Groups
. We have demonstrated that a poynomial regression
with order = K 1 is identical to the traditional one-way ANOVA.
Take a look at my document T = ANOVA = Regression.

Strength of Effect Estimates Proportions of Variance Explained
We can employ q
2
as a measure of the magnitude of the effect of our ANOVA independent
variable without doing the polynomial regression. We simply find
Total
s AmongGroup
SS
SS
from our ANOVA
source table. This provides a fine measure of the strength of effect of our independent variable in our
sample data, but it generally overestimates the population q
2
. My programs Conf-Interval-R2-
Regr.sas and CI-R2-SPSS.zip will compute an exact confidence interval about eta-squared. For our
data q
2
= 130/138 = .94. A 95% confidence interval for the population parameter extends from .84 to
.96. It might be better to report a 90% confidence interval here, more on that soon.
One well-known alternative is omega-squared, e
2
, which estimates the proportion of the
variance in Y in the population which is due to variance in X.
Error Total
Error Among
MS SS
MS K SS
+

=
) 1 (
2
e . For our
data, 93 .
5 . 138
5 ). 3 ( 130
2
=
+
= e .
Benchmarks for q
2
.
- .01 (1%) is small but not trivial
- .06 is medium
- .14 is large
A Word of Caution. Rosenthal has found that most psychologists misinterpret strength of
effect estimates such as r
2
and e
2
. Rosenthal (1990, American Psychologist, 45, 775-777.) used an
example where a treatment (a small daily dose of aspirin) lowered patients death rate so much that
the researchers conducting this research the research prematurely and told the participants who were
11
in the control condition to start taking a baby aspirin every day. So, how large was the effect of the
baby aspirin? As an odds ratio it was 1.83 that is, the odds of a heart attack were 1.83 times higher
in the placebo group than in the aspirin group. As a proportion of variance explained the effect size
was .0011 (about one tenth of one percent).
One solution that has been proposed for dealing with r
2
-like statistics is to report their square
root instead. For the aspirin study, we would report r = .033 (but that still sounds small to me).
Also, keep in mind that anything that artificially lowers error variance, such as using
homogeneous subjects and highly controlled laboratory conditions, artificially inflates r
2
, e
2
, etc.
Thus, under highly controlled conditions, one can obtain a very high e
2
even if outside the laboratory
the IV accounts for almost none of the variance in the DV. In the field those variables held constant
in the lab may account for almost all of the variance in the DV.

What Confidence Coefficient Should I Employ for q
2
and RMSSE?
If you want the confidence interval to be equivalent to the ANOVA F test of the effect (which
employs a one-tailed, upper tailed, probability) you should employ a confidence coefficient of (1 - 2).
For example, for the usual .05 criterion of statistical significance, use a 90% confidence interval, not
95%. Please see my document Confidence Intervals for Squared Effect Size Estimates in ANOVA:
What Confidence Coefficient Should be Employed? .

Strength of Effect Estimates Standardized Differences Among Means
When dealing with differences between or among group means, I generally prefer strength of
effect estimators that rely on the standardized difference between means (rather than proportions of
variance explained). We have already seen such estimators when we studied two group designs
(Hedges g) but how can we apply this approach when we have more than two groups?
My favorite answer to this question is that you should just report estimates of Cohens d for
those contrasts (differences between means or sets of means) that are of most interest that is,
which are most relevant to the research questions you wish to address. Of course, I am also of the
opinion that we would often be better served by dispending with the ANOVA in the first place and
proceeding directly to making those contrasts of interest without doing the ANOVA.
There is, however, another interesting suggestion. We could estimate the average value of
Cohens d for the groups in our research. There are several ways we could do this. We could, for
example, estimate d for every pair of means, take the absolute values of those estimates, and then
average them.
James H. Steiger (2004: Psychological Methods, 9, 164-182) has proposed the use of RMSSE
(root mean square standardized effect) in situations like this. Here is how the RMSSE is calculated:
|
|
.
|
\
|
|
.
|
\
|
=
k
j
MSE
GM M
k
RMSSE
1
2
1
1
, where k is the number of groups, M
j
is a group mean, GM is the
overall (grand) mean, and the standardizer is the pooled standard devation, the square root of the
within groups mean square, MSE (note that we are assuming homogeneity of variances). Basically
what we are doing here is averaging the values of (M
j
GM)/SD, having squared them first (to avoid
them summing to zero), dividing by among groups degrees of freedom (k -1) rather than k, and then
taking the square root to get back to un-squared (standard deviation) units.
Since the standardizer (sqrt of MSE) is constant across groups, we can simplify the expression
above to
2
) (
1
1
MSE
GM M
k
RMSSE
j

|
.
|
\
|
= .
12
For our original set of data, the sum of the squared deviations between group means and
grand mean is (2-5)
2
+ (3-5)
2
+ (7-5)
2
+ (8-5)
2
= 26. Notice that this is simply the among groups sum
of squares (130) divided by n (5). Accordingly, 16 . 4
5 .
26
1 4
1
= |
.
|
\
|
= RMSSE , a Godzilla-sized
average standardized difference between group means.
We can place a confidence interval about our estimate of the average standardized difference
between group means. To do so we shall need the NDC program from Steigers page at
http://www.statpower.net/Content/NDC/NDC.exe . Download and run that exe. Ask for a 90% CI and
give the values of F and df:

Click COMPUTE.

You are given the CI for lambda, the noncentrality parameter:

13
Now we transform this confidence interval to a confidence interval for RMSSE by with the
following transformation (applied to each end of the CI):
n k
RMSSE
) 1 (
=

. For the lower
boundary, this yields 837 . 2
5 ) 3 (
6998 . 120
= , and for the upper boundary 393 . 5
5 ) 3 (
3431 . 436
= . That is,
our estimate of the effect size is between King Kong-sized and beyond Godzilla-sized.
Steiger noted that a test of the null hypothesis that + (the parameter estimated by RMSSE) = 0
is equivalent to the standard ANOVA F test if the confidence interval is constructed with 100(1-2)%
confidence. For example, if the ANOVA were conducted with .05 as the criterion of statistical
significance, then an equivalent confidence interval for + should be at 90% confidence -- + cannot be
negative, after all. If the 90% confidence interval for + includes 0, then the ANOVA F falls short of
significance, if it excludes 0, then the ANOVA F is significant.

Power Analysis
One-way ANOVA power analysis is detailed in out text book. The effect size may be specified
in terms of Et
2
:
2
2
1
) (
error
j
k
j
ko

|

= '

=
. Cohen used the symbol f for this same statistic, and considered
an f of .10 to represent a small effect, .25 a medium effect, and .40 a large effect. In terms of
percentage of variance explained
2
, small is 1%, medium is 6%, and large is 14%.
For example, suppose that I wish to test the null hypothesis that for GRE-Q, the population
means for undergraduates intending to major in social psychology, clinical psychology, and
experimental psychology are all equal. I decide that the minimum nontrivial effect size is if each
mean differs from the next by 20 points (about 1/5 o ). For example, means of 480, 500, and 520.
The Et
2
is then 20
2
+ 0
2
+ 20
2
= 800. Next we compute |'. Assuming that the o is about 100, |'.
163 . 0 10000 / 3 / 800 = = .
Suppose we have 11 subjects in each group. | = |' - 54 . 11 163 . = - = n .
Treatment df = 2, error df = 3(11 - 1) = 30. From the noncentral F table in our text book, for | = .50,
df
t
= 2, df
e
= 30, o =.05, | = 90%, thus power = 10%.
How many subjects would be needed to raise power to 70%? | = .30. Go to the table,
assuming that you will need enough subjects so that df
e
= infinity. For | = .30,
| = 1.6. Now, n = (|
2
)(k)(o
e
2
) / Et
2
= (1.6)
2
(3)(100)
2
/ 800 = 96. Now, 96 subjects per group would
give you, practically speaking, infinite df. If N came out so low that df
e
< 30, you would re-do the
analysis with a downwards-adjusted df
e
.
One can define an effect size in terms of q
2
. For example, if q
2
= 10%, then
|' 33 .
10 . 1
10 .
1
2
2
=
=
q
q
.
Suppose I had 6 subjects in each of four groups. If I employed an alpha-criterion of .05, how
large [in terms of % variance in the DV accounted for by variance in the IV] would the effect need be
for me to have a 90% chance of rejecting the null hypothesis? From the table, for df
t
= 3, df
e
= 20, | =
2.0 for | = .13, and | = 2.2 for | = .07. By linear interpolation, for | = .10, | = 2.0 + (3/6)(.2) = 2.1. |'
857 . 0
6
1 . 2
= = =
n
|
.
14
q
2
= |'
2
/ (1 + |'
2
) = .857
2
/ (1 + .857
2
) = 0.42, a very large effect!
Do note that this method of power analysis does not ignore the effect of error df, as did the
methods employed in Chapter 8. If you were doing small sample power analyses for independent t-
tests, you should use the methods shown here (with k = 2), which will give the correct power figures
(since F t = , ts power must be the same as Fs).
Make it easy on yourself. Use G*Power to do the power analysis.

APA-Style Summary Statement
Teaching method significantly affected the students test scores, F(3, 16) = 86.66, MSE = 0.50,
p < .001, q
2
= .942, 95% CI [.858, .956]. As shown in Table 1, .

CI-Eta2-Alpha
Confidence Intervals for Squared Effect Size Estimates in ANOVA: What
Confidence Coefficient Should be Employed?

If you want the confidence interval to be equivalent to the ANOVA F test of the
effect (which employs a one-tailed, upper tailed, probability) you should employ a
confidence coefficient of (1 - 2). For example, for the usual .05 criterion of statistical
significance, use a 90% confidence interval, not 95%. This is illustrated below.
A two-way independent samples ANOVA was conducted and produced this
output:

Dependent Variable: PulseIncrease

Sum of
Source DF Squares Mean Square F Value Pr > F

Model 3 355.95683 118.65228 3.15 0.0249

Error 380 14295.21251 37.61898

Corrected Total 383 14651.16933

R-Square Coeff Var Root MSE pulse Mean

0.024295 190.8744 6.133431 3.213333

Source DF Anova SS Mean Square F Value Pr > F

Gender 1 186.0937042 186.0937042 4.95 0.0267
Image 1 63.6027042 63.6027042 1.69 0.1943
Gender*Image 1 106.2604167 106.2604167 2.82 0.0936

Eta-square and a corresponding 95% Confidence Interval will be computed for
each effect. To put a confidence interval on the
2
we need to compute an adjusted
F. To adjust the F we first compute an adjusted error term. For the main effect of
gender, 867 . 37
1 383
09 . 186 14651
=
=
Effect Total
Effect Total
df df
SS SS
MSE . In effect we are putting
back into the error term all of the variance accounted for by other effects in our model.
Now the adjusted F(1, 382) = 914 . 4
867 . 37
09 . 186
= =
Gender
Gender
MSE
MS
.
For main effects, one can also get the adjusted F by simply doing a one way
ANOVA with only the main effect of interest in the model:
2

proc ANOVA data=Katie; class Gender;
model PulseIncrease = Gender;

Dependent Variable: PulseIncrease

Sum of

Model 1 186.09370 186.09370 4.91 0.0272

Error 382 14465.07563 37.86669


R-Square Coeff Var Root MSE PulseIncrease Mean

0.012702 191.5018 6.153592 3.213333


Gender 1 186.0937042 186.0937042 4.91 0.0272

Now use this adjusted F with the SAS or SPSS program for putting a confidence
interval on R
2
.

DATA ETA;
*****************************************************************************
*********************************
Construct Confidence Interval for Eta-Squared
*****************************************************************************
*********************************;
F= 4.914 ;
df_num = 1 ;
df_den = 382;
ncp_lower = MAX(0,fnonct (F,df_num,df_den,.975));
ncp_upper = MAX(0,fnonct (F,df_num,df_den,.025));
eta_squared = df_num*F/(df_den + df_num*F);
eta2_lower = ncp_lower / (ncp_lower + df_num + df_den + 1);
eta2_upper = ncp_upper / (ncp_upper + df_num + df_den + 1);
output; run; proc print; var eta_squared eta2_lower eta2_upper;
title 'Confidence Interval on Eta-Squared'; run;
-------------------------------------------------------------------------------------------------

Confidence Interval on Eta-Squared

eta_ eta2_ eta2_
Obs squared lower upper

1 0.012700 0 0.043552
SASLOG

NOTE: Invalid argument to function FNONCT at line 57 column 19.
F=4.914 df_num=1 df_den=382 ncp_lower=0 ncp_upper=17.485492855 eta_squared=0.0127004968
eta2_lower=0 eta2_upper=0.0435519917 _ERROR_=1 _N_=1
NOTE: Mathematical operations could not be performed at the following places. The results of the
3
operations have been set to missing values.
Each place is given by: (Number of times) at (Line):(Column).
Do not be concerned about this note. You will get every time your CI includes zero -- the iterative
procedure bumps up against the wall at value = 0.

Notice that the confidence interval includes the value 0 even though the effect of
gender is significant at the .027 level. What is going on here? I think the answer can be
found in Steiger (2004).
Example 10: Consider a test of the hypothesis that = 0, that is, that the RMSSE (as defined in Equation
12) in an ANOVA is zero. This hypothesis test is one-sided because the RMSSE cannot be negative. To
use a two-sided confidence interval to test this hypothesis at the = .05 significance level, one should
examine the 100(1 - 2)% = 90% confidence interval for . If the confidence interval excludes zero, the
null hypothesis will be rejected. This hypothesis test is equivalent to the standard ANOVA F test.

Well, R
2
(and
2
) cannot be less than zero either. Accordingly, one can argue
that when putting a CI on an ANOVA effect that has been tested with the traditional .05
criterion of significance, that CI should be a 90% CI, not a 95% CI.

------------------------------------------------------------------------------------------------


eta_ eta2_
Obs squared eta2_lower upper

1 0.012700 .000743843 0.037453

The 90% CI does not include zero. Let us try another case. Suppose you
obtained F(2, 97) = 3.09019. The obtained value of F here is exactly equal to the critical
value of F for alpha = .05.

F= 3.09019 ;
df_num = 2 ;
df_den = 97;
4
.
------------------------------------------------------------------------------------------------


eta_ eta2_ eta2_
Obs squared lower upper

1 0.059899 2.1519E-8 0.13743

Notice that the 90% CI does exclude zero, but barely. A 95% CI would include
zero.

Reference

Steiger, J. H. (2004). Beyond the F test: Effect size confidence intervals and tests of
close fit in the analysis of variance and contrast analysis. Psychological Methods, 9,
164-182,

Karl L. Wuensch, Dept. of Psychology, East Carolina Univ., Greenville, NC USA

September, 2009
Homogeneity of Variance Tests For Two or More Groups

We covered this topic for two-group designs earlier. Basically, one transforms
the scores so that between groups variance in the scores reflects differences in
variance rather than differences in means. Then one does a t test on the transformed
scores. If there are three or more groups, simply replace the t test with an ANOVA.

See the discussion in the Engineering Statistics Handbook. Levene suggested
transforming the scores by subtracting the within-group mean from each score and then
either taking the absolute value of each deviation or squaring each deviation. Both
versions are available in SAS. Brown and Forsythe recommended using absolute
deviations from the median or from a trimmed mean. Their Monte Carlo research
indicated that the trimmed mean was the best choice when the populations were heavy
in their tails and the median was the best choice when the populations were skewed.
The Brown and Forsythe method using the median is available in SAS. It would not be
very difficult to program SAS to use the trimmed means. Obriens test is also available
in SAS.

I provide here SAS code to illustrate homogeneity of variance tests. The data
are the gear data from the Engineering Statistics Handbook.

options pageno=min nodate formdlim='-';
title 'Homogeneity of Variance Tests';
title2 'See http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm';
run;
data Levene;
input Batch N; Do I=1 to N; Input GearDiameter @@; output; end;
cards;
1 10
1.006 0.996 0.998 1.000 0.992 0.993 1.002 0.999 0.994 1.000
2 10
0.998 1.006 1.000 1.002 0.997 0.998 0.996 1.000 1.006 0.988
3 10
0.991 0.987 0.997 0.999 0.995 0.994 1.000 0.999 0.996 0.996
4 10
1.005 1.002 0.994 1.000 0.995 0.994 0.998 0.996 1.002 0.996
5 10
0.998 0.998 0.982 0.990 1.002 0.984 0.996 0.993 0.980 0.996
6 10
1.009 1.013 1.009 0.997 0.988 1.002 0.995 0.998 0.981 0.996
7 10
0.990 1.004 0.996 1.001 0.998 1.000 1.018 1.010 0.996 1.002
8 10
0.998 1.000 1.006 1.000 1.002 0.996 0.998 0.996 1.002 1.006
9 10
1.002 0.998 0.996 0.995 0.996 1.004 1.004 0.998 0.999 0.991
10 10
0.991 0.995 0.984 0.994 0.997 0.997 0.991 0.998 1.004 0.997
*****************************************************************************
;
proc GLM data=Levene; class Batch;
model GearDiameter = Batch / ss1;
means Batch / hovtest=levene hovtest=BF hovtest=obrien;
title; run;
*****************************************************************************
;
proc GLM data=Levene; class Batch;
model GearDiameter = Batch / ss1;
means Batch / hovtest=levene(type=ABS); run;

Here are parts of the statistical output, with annotations:

Levene's Test for Homogeneity of GearDiameter Variance
ANOVA of Squared Deviations from Group Means

Sum of Mean
Source DF Squares Square F Value Pr > F

Batch 9 5.755E-8 6.394E-9 2.50 0.0133
Error 90 2.3E-7 2.556E-9

With the default Levenes test (using squared deviations), the groups differ significantly
in variances.

O'Brien's Test for Homogeneity of GearDiameter Variance
ANOVA of O'Brien's Spread Variable, W = 0.5

Sum of Mean

Batch 9 7.105E-8 7.894E-9 2.22 0.0279
Error 90 3.205E-7 3.562E-9

Also significant with Obriens Test.

Brown and Forsythe's Test for Homogeneity of GearDiameter Variance
ANOVA of Absolute Deviations from Group Medians

But not significant with the Brown & Forsythe test using absolute deviations from within-
group medians.
Sum of Mean

Batch 9 0.000227 0.000025 1.71 0.0991
Error 90 0.00133 0.000015

-------------------------------------------------------------------------------------------------
SAS will only let you do one Levene test per invocation of PROC GLM, so I ran
GLM a second time to get the Levene test with absolute deviations. As you can see
below, the difference in variances is significant with this test.

Levene's Test for Homogeneity of GearDiameter Variance
ANOVA of Absolute Deviations from Group Means

Sum of Mean

Batch 9 0.000241 0.000027 2.16 0.0322
Error 90 0.00112 0.000012

The One-Way ANOVA procedure in PASW also provides a test of homogeneity
of variance, as shown below.

Test of Homogeneity of Variances
GearDiameter
Levene Statistic df1 df2 Sig.
2.159 9 90 .032

Notice that the Levene test provided by PASW is
that using absolute deviations from within-group means.
The Brown-Forsythe test offered as an option is not
their test of equality of variances, it is a robust test of
differences among means, like the Welch test.

Return to Wuenschs Statistics Lessons Page

Karl L. Wuensch
May, 2010.
Omega-Squared.doc
Dear 6430 students,

We have discussed omega-squared as a less biased (than is eta-squared)
estimate of the proportion of variance explained by the treatment variable in
the population from which our sample data could be considered to be random.
Earlier this semester we discussed a very similar statistic, r-squared, and I
warned you about how this statistic can be inflated by high levels of extraneous
variable control. The same caution applies to eta-squared and omega-squared.
Here is a comment I posted to EDSTAT-L on this topic a few years back:
------------------------------------------------------------------------------
Date: Mon, 11 Oct 93 11:27:23 EDT
From: "Karl L. Wuensch" <PSWUENSC@ecuvm1>
To: Multiple recipients of list <edstat-l@jse.stat.ncsu.edu>
Subject: Omega-squared (was P Value)

Josh, backon@vms.huji.ac.il, noted:

>We routinely run omega squared on our data. Omega squared is one of the most
>frequently applied methods in estimating the proportion of the dependent
>variable accounted for by an independent variable, and is used to confirm the
>strength of association between variables in a population. ............

Omega-squared can also be misinterpreted. If the treatment is evaluated in
circumstances (the laboratory) where the influence of extraneous variables
(other variables that influence the dependent variable) is eliminated, then the
omega-squared will be inflated relative to the proportion of the variance in the
dependent variable due to the treatment in a (real) population where those
extraneous variables are not eliminated. Thus, a treatment that really accounts
for a trivial amount of the variance in the dependent variable out there in the
real world can produce a large omega-squared when computed from data collected
in the laboratory. To a great extent both P and omega-squared measure the
extent to which the researcher has been able to eliminate "error variance" when
collecting the data (but P is also greatly influenced by sample size).

Imagine that all your subjects were clones of one another with identical
past histories. All are treated in exactly the same way, except that for half
of them you clapped your hands in their presence ten minutes before measuring
whatever the dependent variable is. Because the subjects differ only on whether
or not you clapped your hands in their presence, if such clapping has any effect
at all, no matter how small, it accounts for 100% of the variance in your
sample. If the population to which you wish to generalize your results is not
one where most extraneous variance has been eliminated, your omega-squared may
be a gross overestimate of the magnitude of the effect. Do note that this
problem is not unique to omega-squared. Were you to measure the magnitude of
the effect as being the between groups difference in means divided by the within
groups standard deviation the same potential for inflation of effect would
exist.

Karl L. Wuensch, Dept. of Psychology, East Carolina Univ.
Greenville, NC 27858-4353, phone 919-757-6800, fax 919-757-6283
Bitnet Address: PSWUENSC@ECUVM1
Internet Address: PSWUENSC@ECUVM.CIS.ECU.EDU
========================================================================
Sender: edstat-l@jse.stat.ncsu.edu
From: Joe H Ward <joeward@tenet.edu>

Karl --- good comment!! My early research days were spent in an R-squared,
Omega-squared, Factor Analysis environment. My own observations say: "BEWARE
of those correlation-type indicators!!!" --- Joe
Joe Ward 167 East Arrowhead Dr.
San Antonio, TX 78228-2402 Phone: 210-433-6575 joeward@tenet.edu

MultComp.doc
One-Way Multiple Comparisons Tests

Error Rates
The error rate per comparison,
pc
, is the probability of making a Type I error on a
single comparison, assuming the null hypothesis is true.
The error rate per experiment,
PE
, is the expected number of Type I errors made
when making c comparisons, assuming that each of the null hypotheses is true. It is
equal to the sum of the per comparison alphas. If the per comparison alphas are
constant, then
PE
= c
pc
,
The familywise error rate,
fw
, is the probability of making one or more Type I
errors in a family of c comparisons, assuming that each of the c null hypotheses is true.
If the comparisons are independent of one another (orthogonal), then
( )
c
pc fw
= 1 1 . For our example problem, evaluating four different teaching
methods, if we were to compare each treatment mean with each other treatment mean,
c would equal 6. If we were to assume those 6 comparisons to be independent of each
other (they are not), then 26 . 95 . 1
6
= =
fw
.
Multiple t tests
One could just use multiple t-tests to make each comparison desired, but one runs
the risk of greatly inflating the familywise error rate (the probability of making one or
more Type I errors in a family of c comparisons) when doing so. One may use a series
of protected t-tests in this situation. This procedure requires that one first do an
omnibus ANOVA involving all k groups. If the omnibus ANOVA is not significant, one
stops and no additional comparisons are done. If that ANOVA is significant, one makes
all the comparisons e wishes using t-tests. If you have equal sample sizes and
homogeneity of variance, you can use
n
MSE
X X
t
j i
=
2
, which pools the error variance
across all k groups, giving you N - k degrees of freedom. If you have homogeneity of
variance but unequal ns use:
=
j i
j i
n n
MSE
X X
t
1 1
. MSE is the error mean square from
the omnibus ANOVA. If you had heterogeneous variances, you would need to compute
separate variances t-tests, with adjusted df.
The procedure just discussed (protected t-tests) is commonly referred to as Fishers
LSD test. LSD stands for Least Significant Difference. If you were making
comparisons for several pairs of means, and n was the same in each sample, and you


2
were doing all your work by hand (as opposed to using a computer), you could save
yourself by solving substituting the critical value of t in the formula above, entering the n
and the MSE, and then solving for the (smallest) value of the difference between means
which would be significant (the least significant difference). Then you would not have to
compute a t for each comparison, you would just find the difference between the means
and compare that to the value of the least significant difference.
While this procedure is not recommended in general (it does not adequately control
familywise alpha), there is one special case when it is the best available procedure, and
that case is when k = 3. In that case Fishers procedure does hold
fw
at or below the
stated rate and has more power than other commonly employed procedures. The
interested student is referred to the article A Controlled, Powerful Multiple-Comparison
Strategy for Several Situations, by Levin, Serlin, and Seaman (Psychological Bulletin,
1994, 115: 153-159) for details and discussion of how Fishers procedure can be
generalized to other 2 df situations.
Linear Contrasts
One may make simple or complex comparisons involving some or all of a set of k
treatment means by using linear contrasts. Suppose I have five means, which I shall
label A, B, C, D, and E. I can choose contrast coefficients to compare any one mean or
subset of these means with any other mean or subset of these means. The sum of the
contrast coefficients must be zero. All of the coefficients applied to the one set of
means must be positive, all those applied to the other set must be negative. Means left
out of the contrast get zero coefficients. There are some advantages of using a
standard set of weights, so I shall do so here: The coefficients for the one set must
equal +1 divided by the number of conditions in that set while those for the other set
must equal -1 divided by the number of conditions in that other set. The sum of the
absolute values of the coefficients must be 2.
Suppose I want to contrast combined groups A and B with combined groups C, D,
and E. A nonstandard set of coefficients is 3, 3, 2, 2, 2, and a standard set of
coefficients would be .5, .5, 1/3, 1/3, 1/3. If I wanted to contrast C with combined D
and E , a nonstandard set of coefficients would be 0, 0, -2, 1, 1 and a standard set of
coefficients would be 0, 0, 1, .5, .5.
A standard contrast is computed as
i i
M c = .
To test the significance of a contrast, compute a contrast sum of squares this way:
=
j
j
n
c
SS
2
2
. When the sample sizes are equal, this simplifies to
=
2
2
j
c
n
SS

. Each
contrast will have only one treatment df, so the contrast MS is the same as the contrast
SS. To get an F for the contrast just divide it by an appropriate MSE (usually that which
would be obtained were one to do an omnibus ANOVA on all k treatment groups).
For our example problem, suppose we want to compare combined groups C and D
with combined groups A and B. The A, B, C, D means are 2, 3, 7, 8, and the
3
coefficients are .5, ,5, +.5, +.5. 5 ) 8 ( 5 . ) 7 ( 5 . ) 3 ( 5 . ) 2 ( 5 . = + + = . Note that the
value of the contrast is quite simply the difference between the mean of combined
groups C and D (7.5) and the mean of combined groups A and B (2.5).
125
1
) 25 ( 5
25 . 25 . 25 . 25 .
) 5 ( 5
2
= =
+ + +
=
MS , and F(1, 16) = 125/.5 = 250, p << .01.

To construct a confidence interval about , simply go out in each direction
s t
crit
.,
where
=
j
j
n
c
MSE s
2

. With equal sample sizes, this simplifies to
n
MSE
s =
.
When one is constructing multiple confidence intervals, one can use Bonferroni to
adjust the per contrast alpha. Such intervals have been called simultaneous or joint
confidence intervals. For the contrast above, 3162 .
5
5 .
= =
s . With no adjustment of
the per-comparison alpha, and df = 16, a 95% confidence interval is 5 2.12(.3162),
which extends from 4.33 to 5.67.
A population standardized contrast,
= , can be estimated by s ,
where s is the standard deviation of just one of the groups being compared (Glass ),
the pooled standard deviation of the two groups being compared (Hedges g), or the
pooled standard deviation of all of the groups (the square root of the MSE). For the
contrast above, 07 . 7 5 . 5
= =
g , a whopper effect.
SAS and other statistical software can be used to obtain the F for a specified
contrast. Having obtained a contrast F from your computer program, you can compute
=
j
j
n
c
F g
2

. For our contrast, 07 . 7
5
25 . 25 . 25 . 25 .
250
+ + +
=
g .
An approximate confidence interval for a standardized contrast d can be
computed simply by taking the confidence interval for the contrast and dividing its
endpoints by the pooled standard deviation (square root of MSE). In this case the
confidence interval amounts to
g crit
s t g , where
=
j
j
g
n
c
s
2

. For our contrast,
2 .
5
25 . 25 . 25 . 25 .
+ + +
=
g
s and a 95% confidence interval is 7.07 2.12(.447),
running from 6.12 to 8.02. More simply, we take the unstandardized confidence
interval, which runs from 4.33 to 5.67, and divide each end by the standard deviation
(.707) and obtain 6.12 to 8.02.
At http://www.psy.unsw.edu.au/research/PSY.htm one can obtain PSY: A
Program for Contrast Analysis, by Kevin Bird, Dusan Hadzi-Pavlovic, and Andrew
Isaac. This program computes unstandardized and approximate standardized
confidence intervals for contrasts with between-subjects and/or within/subjects factors.
It will also compute simultaneous confidence intervals. Contrast coefficients are
4
provided as integers, and the program converts them to standard weights. For an
example of the use of the PSY program, see my document PSY: A Program for
Contrast Analysis.
An exact confidence interval for a standardized contrast involving
independent samples can be computed with my SAS program Conf_Interval-
Contrast.sas. Enter the contrast t (the square root of the contrast F, 15.81 for our
contrast), the df (16), the sample sizes (5, 5, 5, 5), and the standard contrast
coefficients (.5, .5, .5, .5) and run the program. You obtain a confidence interval that
extends from 4.48 to 9.64. Notice that this confidence interval is considerably wider
than that obtained by the earlier approximation.
One can also use
2
or partial
2
as a measure of the strength of a contrast,
and use my program Conf-Interval-R2-Regr.sas to construct a CI. For
2
simply take
the SS
contrast
and divide by the SS
Total
. For our contrast, that yields
2
= 125/138 =
.9058. To get the confidence interval for
2
we need to compute a modified contrast F,
adding to the error term all variance not included in the contrast and all degrees of
freedom not included in the contrast.
Source SS df MS F
AB vs. CD 125 1 125 250
Error 8+5=13 16+2=18 13/18=.722 173.077
Total 138 19
077 . 173
18 / 13
125
) 1 19 ( ) 125 138 (
125
) ( ) (
) 18 , 1 ( = =

=

=
contrast Total contrast Total
contrast
df df SS SS
SS
F .
Feed that F and df to my SAS program and you obtain an
2
of .9058 with a
confidence interval that extends from .78 to .94.
Alternatively, one can compute a partial
2
as
93985 .
8 125
125
=
+
=
+
Error Contrast
Contrast
SS SS
SS
. Notice that this excludes from the denominator
all variance that is explained by differences among the groups that are not captured by
the tested contrast.
Source SS df MS F
AB vs. CD 125 1 125 250
Error 8 16 0.50
Total 138 19
For partial
2
enter the contrast F(1, 16) = 250 into my program and you obtain
2
= .93985 with a confidence interval extending from .85 to .96.
5
Orthogonal Contrasts
One can construct k - 1 orthogonal (independent) contrasts involving k means. If I
consider a
i
to represent the contrast coefficients applied for one contrast and b
j
those
for another, for the contrasts to be orthogonal it must be true that 0 =
j
j j
n
b a
. If you
have equal sample sizes, this simplifies to 0 =
j i
b a . Consider the following set of
contrast coefficients involving groups A, B, C, D, and E and equal sample sizes.

A B C D E
+.5 +.5 1/3 1/3 1/3
+1 1 0 0 0
0 0 1 .5 .5
0 0 0 +1 1
If we computed a SS for each of these contrasts and summed those SS, the sum
would equal the treatment SS which would be obtained in an omnibus ANOVA on the k
groups. This is beautiful, but not necessarily practical. The comparisons you make
should be meaningful, whether or not they form an orthogonal set.
Studentized Range Procedures
There is a number of procedures available to make a posteriori, posthoc,
unplanned multiple comparisons. When one will compare each group mean with each
other group mean, k(k - 1)/2 comparisons, one widely used procedure is the Student-
Newman-Keuls procedure. As is generally the case, this procedure adjusts downwards
the per comparison alpha to keep the alpha familywise at a specified value. It is a
layer technique, adjusting alpha downwards more when comparing extremely different
means than when comparing closer means, thus correcting for the tendency to
capitalize on chance by comparing extreme means, yet making it somewhat easier
(compared to non-layer techniques) to get significance when comparing less extreme
means.
To conduct a Student-Newman-Keuls (SNK) analysis:
a. Put the means in ascending order of magnitude.
b. r is the number of means spanned by a given comparison.
c. Start with the most extreme means (the lowest vs. the highest), where r = k.
d. Compute q with this formula:
n
MSE
X X
q
j i

= , assuming equal sample sizes and
homogeneity of variance. MSE is the error mean square from an overall ANOVA on the
k groups. Do note that the SNK, and multiple comparison tests in general, were
6
developed as an alternative to the omnibus ANOVA. One is not required to do the
ANOVA first, and if one does do the ANOVA first it does not need to be significant for
one to do the SNK or most other multiple comparison procedures.
e. If the computed q equals or exceeds the tabled critical value for the studentized
range statistic, q
r,df
, the two means compared are significantly different, you move to the
step g. The df is the df for MSE.
f. If q was not significant, stop and, if you have done an omnibus ANOVA and it
was significant, conclude that only the extreme means differ from one another.
g. If the q on the outermost layer is significant, next test the two pairs of means
spanning (k - 1) means. Note that r, and, thus the critical value of q, has decreased.
From this point on, underline all pairs not significantly different from one another, and
do not test any other pairs whose means are both underlined by the same line.
h. If there remain any pairs to test, move down to the next layer, etc. etc.
i. Any means not underlined by the same line are significantly different from one
another.
For our sample problem. which was presented in the previous handout, One-Way
Independent Samples Analysis of Variance, with alpha at .01:
1.
Group A B C D
Mean 2 3 7 8
2. n = 5
316 . 0
5
.5
= r Denominato =
3. A vs D: r = 4, df = 16, 99 . 18
316 .
2 8
=
= q , critical value .01 = 5.19, p < .01

4. r = 3, q
.01
= 4.79
a. A vs C: 82 . 15
316 .
2 7
=
= q , p < .01
b. B vs D: 82 . 15
316 .
3 8
=
= q , p < .01
5. r = 2, q
.01
= 4.13
a. A vs B: 16 . 3
316 .
2 3
=
= q , p > .01
b. B vs C: 66 . 12
316 .
3 7
=
= q , p < .01
7
c. C vs D: 16 . 3
316 .
7 8
=
= q , p > .01
6.
Group A B C D
Mean 2 3 7 8

What if there are unequal sample sizes? One solution is to use the harmonic
mean sample size computed across all k groups. That is,
=
j
n
k
n
1
~
. Another solution
is to compute for each comparison made the harmonic mean sample size of the two
groups involved in that comparison, that is,
j i
n n
n
1 1
2
~
+
= . With the first solution the
effect of n (bigger n, more power) is spread out across groups. With the latter solution
comparisons involving groups with larger sample sizes will have more power than those
with smaller sample sizes.
If you have disparate variances, you should compute a q that is very similar to the
separate variances t-test earlier studied. The formula is: 2 2
2
2
t
n
S
n
S
X X
q
j
j
i
i
j i
=
+
= ,
where t is the familiar separate variances t-test. This procedure is known as the
Games and Howell procedure.
When using this unpooled variances q one should also adjust the degrees of
freedom downwards exactly as done with Satterthwaites solution previously discussed.
Consult our text book for details on how the SNK can have a familywise alpha that is
greatly inflated if the omnibus null hypothesis is only party true.
Relationship Between q And Other Test Statistics
The studentized range statistic is closely related to t and to F. If one computes the
pooled-across-k -groups t, as done with Fishers LSD, then 2 t q = . If one computes
an F from a planned comparison, then F q = 2 . For example, for the A vs C
comparison with our sample problem: . 18 . 11
447 .
5
5
) 5 (. 2
2 7
= =
= t
2 18 . 11 82 . 15 = = q .
8
The contrast coefficients to compare A with C would be .5, 0, .5, 0.
The contrast
( )
5 . 62
5 .
) 25 . 6 ( 5
) 25 . 0 0 25 (.
) 8 0 7 5 . 3 0 2 5 . ( 5
2
2
2
= =
+ + +
+ + +
= =
j
j j
a
M a n
SS .
5
5 . 62
= F , 82 . 15 125 2 = = q .
Tukeys (a) Honestly Significant Difference Test
This test is applied in exactly the same way that the Student-Newman-Keuls is, with
the exception that r is set at k for all comparisons. This test is more conservative (less
powerful) than the Student-Newman-Keuls.
Tukeys (b) Wholly Significant Difference Test
This test is a compromise between the Tukey (a) and the Newman-Keuls. For each
comparison, the critical q is set at the mean of the critical q were a Tukey (a) being
done and the critical q were a Newman-Keuls being done.
Ryans Procedure (REGWQ)
This procedure, the Ryan / Einot and Gabriel / Welsch procedure, is based on the q
statistic, but adjusts the per comparison alpha in such a way (Howell provides details in
our text book) that the familywise error rate is maintained at the specified value (unlike
with the SNK) but power will be greater than with the Tukey(a). I recommend its use
with four or more groups. With three groups the REGWQ is identical to the SNK, and,
as you know, I recommend Fishers procedure when you have three groups. With four
or more groups I recommend the REGWQ, but you cant do it by hand, you need a
computer (SAS and SPSS will do it).
Other Procedures
Dunns Test (The Bonferroni t )
Since the familywise error rate is always less than or equal to the error rate per
experiment,
pc fw
c , an inequality known as the Bonferroni inequality, one can be
sure that alpha familywise does not exceed some desired maximum value by using an
adjusted alpha per comparison that equals the desired maximum alpha familywise
divided by c, that is,
c
fw
pc
= . In other words, you compute a t or an F for each

desired comparison, usually using an error term pooled across all k groups, obtain an
exact p from the test statistic, and then compare that p to the adjusted alpha per
comparison. Alternatively, you could multiply the p by c and compare the resulting
ensemble-adjusted p to the maximum acceptable familywise error rate. Dunn has
provided special tables which give critical values of t for adjusted per comparison
alphas, in case you cannot obtain the exact p for a comparison. For example, t
120
=
2.60, alpha familywise = .05, c = 6. Is this t significant? You might have trouble finding
a critical value for t at per comparison alpha = .05/6 in a standard t table.
9
Dunn-Sidak Test
Sidak has demonstrated that the familywise error rate for c nonorthogonal
(nonindependent) comparisons is less than or equal to the familywise error rate for c
orthogonal comparisons, that is,
( )
c
pc fw
1 1 . Thus, one can adjust the per comparison alpha by using an
adjusted criterion for rejecting the null: Reject the null only if ( ) [ ]
c
fw
p
/ 1
1 1 . This
procedure, the Dunn-Sidak test, is more powerful than the Dunn test, especially when c
is large.
Scheff Test
To conduct this test one first obtains F-ratios testing the desired comparisons. The
critical value of the F is then adjusted (upwards). The adjusted critical F equals (the
critical value for the treatment effect from the omnibus ANOVA) times (the treatment
degrees of freedom from the omnibus ANOVA). This test is extremely conservative
(low power) and is not generally recommended for making pairwise comparisons. It
assumes that you will make all possible linear contrasts, not just simple pairwise
comparisons. It is considered appropriate for making complex comparisons (such as
groups A and B versus C, D, & E; A, B, & C vs D & E; etc., etc.).
Dunnetts Test
The Dunnett t is computed exactly as is the Dunn t, but a different table of critical
values is employed. It is employed when the only comparisons being made are each
treatment group with one control group. It is somewhat more powerful than the Dunn
test for such an application.
Presenting the Results of Pairwise Contrasts
I recommend using a table like that below. You should also give a brief description
of the pattern of significant differences between means, but do not mention each
contrast that was made in a large set of pairwise contrasts.
Teaching method significantly affected test scores, F(3, 16) = 86.66, MSE = 0.50, p
< .001,
2
= .94, 95% CI [.82, .94]. Pairwise comparisons were made with Tukeys
HSD procedure, holding familywise error at a maximum of .01. As shown in Table 1,
the computer intensive and discussion centered methods were associated with
significantly better student performance than that shown by students taught with the
actuarial and book only methods. All other comparisons fell short of statistical
significance.
Table 1
Mean Quiz Performance By Students Taught With Different Methods

Method of Instruction Mean
Actuarial 2.00
A
Book Only 3.00
A
Computer Intensive 7.00
B
Discussion Centered 8.00
B
10
Note. Means sharing a letter in their superscript are not
significantly different at the .01 level according to a Tukey
HSD test.

Familywise Error, Alpha-Adjustment, and the Boogey Men
One should keep in mind that the procedures discussed above can very greatly
lower power. Unless one can justify taking a much greater risk of a Type II error as the
price to pay for keeping the conditional probability of a Type I error unreasonably low, I
think it not good practice to employ these techniques. So why do I teach them? Well,
because others will expect you (and me) to use these techniques even if we think them
unwise. Please read my rant about this: Familywise Alpha and the Boogey Men.

Can I Make These Comparisons Even If The ANOVA Is Not Significant?
Yes, with the exception of Fishers procedure. The other procedures were
developed to be used instead of ANOVA, not following a significant ANOVA. You dont
even need to do the ANOVA, and if you do, it does not need to be significant to be
permitted to make multiple comparisons. There is much misunderstanding about this.
Please read Pairwise Comparisons.

fMRI Gets Slap in the Face with a Dead Fish an example of research where so many
comparisons are made that one spurious effects are very likely to be found.

Return to Karls Stats Lessons Page

G*Power: One-Way Independent Samples ANOVA

See the power analysis done by hand in my document One-Way Independent
Samples Analysis of Variance. Here I shall do it with G*Power.
We want to know how much power we would have for a three-group ANOVA
where we have 11 cases in each group and the effect size in the population is
163 . = = f . When we did by hand, using the table in our text book, we found power =
10%. Boot up G*Power:

Click OK. Click OK again on the next window.

Click Tests, F-Test (Anova).

Under Analysis, select Post Hoc. Enter .163 as the Effect size f, .05 as the
Alpha, 33 as the Total sample size, and 3 as the number of Groups. Click Calculate.

G*Power tell you that power = .1146.

OK, how many subjects would you need to raise power to 70%? Under Analysis,
select A Priori, under Power enter .70, and click Calculate.

G*Power advises that you need 294 cases, evenly split into three groups, that is,
98 cases per group.
Alt-X, Discard to exit G*Power.
That was easy, wasnt it?

Links
Karl Wuenschs Statistics Lessons
Internet Resources for Power Analysis

Karl L. Wuensch
Dept. of Psychology
East Carolina University
Greenville, NC USA
PSY-ANOVA1.doc
PSY: A Program for Contrast Analysis

At http://www.psy.unsw.edu.au/research/PSY.htm one can obtain PSY: A
program for contrast analysis, by Kevin Bird, Dusan Hadzi-Pavlovic, and Andrew Isaac.
This program computes unstandardized and approximate standardized confidence
intervals for contrasts with between-subjects and/or within/subjects factors. It will also
compute simultaneous confidence intervals. Contrast coefficients are provided as
integers, and the program converts them to standard weights.
Here is a properly formatted input data set for the data in my handout One-Way
Independent Samples Analysis of Variance. Contrast coefficients are provided to
compare combined groups A and B with combined groups C and D, A with B, and C
with D, which happens to comprise a complete set of orthogonal contrasts.

1 1
1 2
1 2
1 2
1 3
2 2
2 3
2 3
2 3
2 4
3 6
3 7
3 7
3 7
3 8
4 7
4 8
4 8
4 8
4 9
[BetweenContrasts]
-1 -1 1 1
-1 1 0 0
0 0 -1 1

Here is the output from PSY
============================================================================
PSY
============================================================================
Date: 12/23/2005 Time: 3:46:53 PM
File: C:\Documents and Settings\Karl Wuensch\My
Documents\Temp\psy\ANOVA1.in
-----------------------------------------------------------------------------
--

Number of Groups: 4
Number of Measurements: 1

Number of subjects in...
2
Group 1: 5
Group 2: 5
Group 3: 5
Group 4: 5

Between contrast coefficients
Contrast Group...
1 2 3 4
B1 -1 -1 1 1
B2 -1 1 0 0
B3 0 0 -1 1

Means and Standard Deviations
Group 1 Overall Mean: 2.000
Measurement 1
Mean 2.000
SD 0.707

Measurement 1
Mean 3.000
SD 0.707

Measurement 1
Mean 7.000
SD 0.707

Measurement 1
Mean 8.000
SD 0.707

Means and SDs averaged across groups
Measurement 1
Mean 5.000
SD 0.707
--------------------

Analysis of Variance Summary Table

Source SS df MS F
------------------------------------------------
Between
------------------------------------------------
B1 125.000 1 125.000 250.000
B2 2.500 1 2.500 5.000
B3 2.500 1 2.500 5.000
Error 8.000 16 0.500
------------------------------------------------

Individual 95% Confidence Intervals
-----------------------------------
The CIs refer to mean difference contrasts,
with coefficients rescaled if necessary.
The rescaled contrast coefficients are:

3
Rescaled Between contrast coefficients
Contrast Group...
1 2 3 4
B1 -0.500 -0.500 0.500 0.500
B2 -1.000 1.000 0.000 0.000
B3 0.000 0.000 -1.000 1.000

Raw CIs (scaled in Dependent Variable units)
-------------------------------------------------------
Contrast Value SE ..CI limits..
Lower Upper
-------------------------------------------------------
B1 5.000 0.316 4.330 5.670
B2 1.000 0.447 0.052 1.948
B3 1.000 0.447 0.052 1.948
-------------------------------------------------------

Approximate Standardized CIs (scaled in Sample SD units)
-------------------------------------------------------
Contrast Value SE ..CI limits..
Lower Upper
-------------------------------------------------------
B1 7.071 0.447 6.123 8.019
B2 1.414 0.632 0.073 2.755
B3 1.414 0.632 0.073 2.755
-------------------------------------------------------

Return to Karls Stats Lessons Page

Karl L. Wuensch
Dept. of Psychology
Greenville, NC USA

December, 2005

Strength_of_Effect.doc
Reporting the Strength of Effect Estimates for Simple Statistical Analyses

This document was prepared as a guide for my students in Experimental
Psychology. It shows how to present the results of a few simple but common statistical
analyses. It also shows how to compute commonly employed strength of effect
estimates.
Independent Samples T
When we learned how to do t tests (see T Tests and Related Statistics: SPSS),
you compared the mean amount of weight lost by participants who completed two
different weight loss programs. Here is SPSS output from that analysis:
Group Statistics
6 22.67 4.274 1.745
12 13.25 4.093 1.181
GROUP
1
2
LOSS
N Mean Std. Deviation
Std. Error
Mean

The difference in the two means is statistically significant, but how large is it?
We can express the difference in terms of within-group standard deviations, that is, we
can compute the statistic commonly referred to as Cohens d, but more appropriately
referred to as Hedges g. Cohens d is a parameter. Hedges g is the statistic we use to
estimate d.
First we need to compute the pooled standard deviation. Convert the standard
deviations to sums of squares by squaring each and then multiplying by (n-1). For
Group 1, (5)4.274
2
= 91.34. For Group 2, (11)4.093
2
= 184.28. Now compute the
pooled standard deviation this way: 15 . 4
16
28 . 184 34 . 91
2
2 1
2 1
=
+
=
+
+
=
n n
SS SS
s
pooled
.
Finally, simply standardize the difference in means:
27 . 2
15 . 4
25 . 13 67 . 22
2 1
=
=
pooled
s
M M
g , a very large effect.
An easier way to get the pooled standard deviation is to conduct an ANOVA
relating the test variable to the grouping variable. Here is SPSS output from such an
analysis:


2
ANOVA
LOSS
354.694 1 354.694 20.593 .000
275.583 16 17.224
630.278 17
Between Groups
Within Groups
Total
Sum of
Squares df Mean Square F Sig.

Now you simply take the square root of the within groups mean square. That is,
SQRT(17.224) = 4.15 = the pooled standard deviation.
An easier way to get the value of g is to use one of my programs for placing a
confidence interval around our estimate of d. See my document Confidence Intervals,
Pooled and Separate Variances T.
Here is an APA-style summary of the results:
Persons who completed weight loss program 1 lost significantly more
weight (M = 22.67, SD = 4.27, n = 6) than did those who completed weight loss
program 2 (M = 13.25, SD = 4.09, n = 12), t(9.71) = 4.47, p = .001, g = 2.27.
Do note that I used the separate variances t here I had both unequal sample
sizes and disparate sample variances. Also note that I reported the sample sizes,
which are not obvious from the df when reporting a separate variances test. You should
also recall that the difference in sample sizes here was cause for concern (indicating a
problem with selective attrition).
One alternative strength of effect estimate that can be used here is the squared
point-biserial correlation coefficient, which will tell you what proportion of the variance in
the test variable is explained by the grouping variable. One way to get that statistic is to
take the pooled t and substitute in this formula:
. 56 .
2 12 6 538 . 4
538 . 4
2
2
2
2 1
2
2
2
=
+ +
=
+ +
=
n n t
t
r
pb
An easier way to get that statistic to
compute the r between the test scores and the numbers you used to code group
membership. SPSS gave me this:
Correlations
-.750
18
Pearson Correlation
N
GROUP
LOSS

When I square -.75, I get .56. Another way to get this statistic is to do a one-way
ANOVA relating groups to the test variable. See the output from the ANOVA above.
The eta-squared statistic is . 56 .
278 . 630
694 . 354
= =
total
between
SS
SS
Please note that
2
is the same
as the squared point-biserial correlation coefficient (when you have only two groups).
When you use SAS to do ANOVA, you are given the
2
statistic with the standard output
(SAS calls it R
2
). Here is an APA-style summary of the results with eta-squared.
3
program 2 (M = 13.25, SD = 4.09, n = 12), t(9.71) = 4.47, p = .001,
2
= .56.
One-Way Independent Samples ANOVA
The most commonly employed strength of effect estimates here are
2
and
2

(consult your statistics text or my online notes on ANOVA to see how to compute
2
). I
have shown above how to compute
2
as a ratio of the treatment SS to the total SS. If
you have done a trend analysis (polynomial contrasts), you should report not only the
overall treatment
2
but also
2
for each trend (linear, quadratic, etc.) Consult the
document One-Way Independent Samples ANOVA with SPSS for an example summary
statement. Dont forget to provide a table with group means and standard deviations.
If you have made comparisons between pairs of means, it is a good idea to
present d or
2
for each such comparison, although that is not commonly done. Look
back at the document One-Way Independent Samples ANOVA with SPSS and see how
I used a table to summarize the results of pairwise comparisons among means. One
should also try to explain the pattern of pairwise results in text, like this (for a different
experiment): The REGWQ procedure was used to conduct pairwise comparisons
holding familywise error at a maximum of .05 (see Table 2). The elevation in pulse rate
when imagining infidelity was significantly greater for men than for women. Among
men, the elevation in pulse rate when imagining sexual infidelity was significantly
greater than when imagining emotional infidelity. All other pairwise comparisons fell
short of statistical significance.
Correlation/Regression Analysis
You will certainly have reported r or r
2
, and that is sufficient as regards strength
of effect. Here is an example of how to report the results of a regression analysis, using
the animal rights and misanthropy analysis from the document Correlation and
Regression Analysis: SPSS:
Support for animal rights (M = 2.38, SD = 0.54) was significantly correlated
with misanthropy (M = 2.32, SD = 0.67), r = .22, animal rights = 1.97 +
.175 Misanthropy, n = 154, p =.006.
Contingency Table Analysis
Phi, Cramers phi (also known as Cramers V) and odds ratios are appropriate for
estimating the strength of effect between categorical variables. Please consult the
document Two-Dimensional Contingency Table Analysis with SPSS. For the analysis
done there relating physical attractiveness of the plaintiff with verdict recommended by
the juror, we could report:
Guilty verdicts were significantly more likely when the plaintiff was
physically attractive (76.7%) than when she was physically unattractive (54.2%),

2
(1, N = 145) = 6.23, p = .013,
C
= .21, odds ratio = 2.8.
Usually I would not report both
C
and an odds ratio.
4
Cramers phi is especially useful when the effect has more than one df. For
example, for the Weight x Device crosstabulation discussed in the document Two-
Dimensional Contingency Table Analysis with SPSS, we cannot give a single odds ratio
that captures the strength of the association between a persons weight and the device
(stairs or escalator) that person chooses to use, but we can use Cramers phi. If we
make pairwise comparisons (a good idea), we can employ odds ratios with them. Here
is an example of how to write up the results of the Weight x Device analysis:
As shown in Table 1, shoppers choice of device was significantly affected
by their weight,
2
(2, N = 3,217) = 11.75, p = .003,
C
= .06. Pairwise comparisons
between groups showed that persons of normal weight were significantly more
likely to use the stairs than were obese persons,
2
(1, N = 2,142) = 9.06, p = .003,
odds ratio = 1.94, as were overweight persons,
2
(1, N = 1,385) = 11.82, p = .001,
odds ratio = 2.16, but the difference between overweight persons and normal
weight persons fell short of statistical significance,
2
(1, N = 2,907) = 1.03, p = .31,
odds ratio = 1.12.

Table 1
Percentage of Persons Using the Stairs by Weight Category
Category Percentage
Obese 7.7
Overweight 15.3
Normal 14.0

Of course, a Figure may look better than a table, for example,

0
5
10
15
20
P
e
r
c
e
n
t
a
g
e
Obese Overweight Normal
Weight Category
Figure 1. Percentage Use of Stairs

For an example of a plot used to illustrate a three-dimensional contingency table,
see the document Two-Dimensional Contingency Table Analysis with SPSS.
5


T Tests, ANOVA, and Regression Analysis

Here is a one-sample t test of the null hypothesis that mu = 0:
DATA ONESAMPLE; INPUT Y @@;
CARDS;
1 2 3 4 5 6 7 8 9 10
PROC MEANS T PRT; RUN;
------------------------------------------------------------------------------------------------

The SAS System

The MEANS Procedure

Analysis Variable : Y

t Value Pr > |t|

5.74 0.0003

------------------------------------------------------------------------------------------------

Now an ANOVA on the same data but with no grouping variable:
PROC ANOVA; MODEL Y = ; run;
------------------------------------------------------------------------------------------------

The SAS System

The ANOVA Procedure

Dependent Variable: Y

Sum of

Model 1 302.5000000 302.5000000 33.00 0.0003

Error 9 82.5000000 9.1666667

Uncorrected Total 10 385.0000000

R-Square Coeff Var Root MSE Y Mean

0.000000 55.04819 3.027650 5.500000


Intercept 1 302.5000000 302.5000000 33.00 0.0003

------------------------------------------------------------------------------------------------
Notice that the ANOVA F is simply the square of the one-sample t, and the one-
tailed p from the ANOVA is identical to the two-tailed p from the t.

Now an Regression analysis with Model Y = intercept + error.
PROC REG; MODEL Y = ; run;
------------------------------------------------------------------------------------------------

The REG Procedure
Model: MODEL1

Sum of Mean

Model 0 0 . . .
Error 9 82.50000 9.16667

Root MSE 3.02765 R-Square 0.0000
Dependent Mean 5.50000 Adj R-Sq 0.0000
Coeff Var 55.04819

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 5.50000 0.95743 5.74 0.0003

------------------------------------------------------------------------------------------------
Notice that the ANOVA is replicated.

Now consider a two independent groups t test with pooled variances, null is
mu1-mu2 = 0:
DATA TWOSAMPLE; INPUT X Y @@;
CARDS;
1 1 1 2 1 3 1 4 1 5
2 6 2 7 2 8 2 9 2 10
PROC TTEST; CLASS X; VAR Y; RUN;
------------------------------------------------------------------------------------------------
The SAS System

T-Tests

Variable Method Variances DF t Value Pr > |t|

Y Pooled Equal 8 -5.00 0.0011
------------------------------------------------------------------------------------------------
Now an ANOVA on the same data:
PROC ANOVA; CLASS X; MODEL Y = X; RUN;
------------------------------------------------------------------------------------------------

The ANOVA Procedure


Sum of

Model 1 62.50000000 62.50000000 25.00 0.0011

Error 8 20.00000000 2.50000000


R-Square Coeff Var Root MSE Y Mean

0.757576 28.74798 1.581139 5.500000


X 1 62.50000000 62.50000000 25.00 0.0011

------------------------------------------------------------------------------------------------
Notice that the ANOVA F is simply the square of the independent samples t and
the one-tailed ANOVA p identical to the two-tailed p from t.

And finally replication of the ANOVA with a regression analysis:
PROC REG; MODEL Y = X; run;
------------------------------------------------------------------------------------------------

The SAS System

The REG Procedure
Model: MODEL1

Number of Observations Read 10
Number of Observations Used 10

Analysis of Variance

Sum of Mean

Model 1 62.50000 62.50000 25.00 0.0011
Error 8 20.00000 2.50000

Root MSE 1.58114 R-Square 0.7576
Dependent Mean 5.50000 Adj R-Sq 0.7273
Coeff Var 28.74798

Parameter Estimates

Parameter Standard
Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -2.00000 1.58114 -1.26 0.2415
X 1 5.00000 1.00000 5.00 0.0011

OK, but what if we have more than two groups? Show me that the ANOVA is a
regression analysis in that case.
Here is the SAS program, with data:
data Lotus;
input Dose N; Do I=1 to N; Input Illness @@; output; end;
cards;
0 20
101 101 101 104 104 105 110 111 111 113 114 79 89 91 94 95 96 99 99 99
10 20
100 65 65 67 68 80 81 82 85 87 87 88 88 91 92 94 95 94 96 96
20 20
64 75 75 76 77 79 79 80 80 81 81 81 82 83 83 85 87 88 90 96
30 20
100 105 108 80 82 85 87 87 87 89 90 90 92 92 92 95 95 97 98 99
40 20
101 102 102 105 108 109 112 119 119 123 82 89 92 94 94 95 95 97 98 99
*****************************************************************************
;
proc GLM data=Lotus; class Dose;
model Illness = Dose / ss1;
title 'Here we have a traditional one-way independent samples ANOVA'; run;
*****************************************************************************
;

data Polynomial; set Lotus; Quadratic=Dose*Dose; Cubic=Dose**3;
Quartic=Dose**4;
proc GLM data=Polynomial; model Illness = Dose Quadratic Cubic Quartic / ss1;
title 'Here we have a polynomial regression analysis.'; run;
*****************************************************************************
Here is the output:

Here we have a traditional one-way independent samples ANOVA 2

The GLM Procedure

Dependent Variable: Illness

Sum of

Model 4 6791.54000 1697.88500 20.78 <.0001

Error 95 7762.70000 81.71263


R-Square Coeff Var Root MSE Illness Mean

0.466637 9.799983 9.039504 92.24000

Source DF Type I SS Mean Square F Value Pr > F

Dose 4 6791.540000 1697.885000 20.78 <.0001

------------------------------------------------------------------------------------------------

Here we have a polynomial regression analysis. 3

The GLM Procedure

Number of observations 100

------------------------------------------------------------------------------------------------

Here we have a polynomial regression analysis. 4

The GLM Procedure

Dependent Variable: Illness

Sum of

Model 4 6791.54000 1697.88500 20.78 <.0001

Error 95 7762.70000 81.71263


Note that the polynomial regression produced exactly the same F, p, SS, MS, as the traditional
ANOVA.

R-Square Coeff Var Root MSE Illness Mean

0.466637 9.799983 9.039504 92.24000

Source DF Type I SS Mean Square F Value Pr > F

Dose 1 174.845000 174.845000 2.14 0.1468
Quadratic 1 6100.889286 6100.889286 74.66 <.0001
Cubic 1 389.205000 389.205000 4.76 0.0315
Quartic 1 126.600714 126.600714 1.55 0.2163
------------------------------------------------------------------------------------------------
Return to Wuenschs Stats Lessons Page

November, 2006
Factorial-Basics.doc
0
10
20
30
no alch. one alch
no barb.
one barb.
An Introduction to Factorial Analysis of Variance

Basic Concepts

We have already studied the one-way independent-samples ANOVA, which is used
when we have one categorical independent variable and one continuous dependent
variable. Research designs with more than one independent variable are much more
interesting than those with only one independent variable. When we have two categorical
independent variables (with nonexperimental research, these are better referred to as
factors, predictors, grouping variables, or classification variables), and one continuous
dependent variable (with nonexperimental research these are better referred to as criterion
variables, outcome variables, or response variables), with all combinations of levels of the first
independent variable with levels of the second independent variable (or factor) consider the
following design: we have measures of drunkenness for each of four groupsparticipants
given neither alcohol nor a barbiturate, participants given a vodka screwdriver but no
barbiturate, participants given a barbiturate tablet but not alcohol, and participants given both
alcohol and the barbiturate. We have a 2 x 2 factorial design, factor A being dose of alcohol
and factor B being dose of barbiturate. Suppose that our participants were some green alien
creatures that showed up at our party last week, and that we obtained the following means:

Alcohol
Barbiturate none one marginal
none 00 10 05
one 20 30 25
marginal 10 20 15

The 2 x 2 = 4 group means (0, 10, 20, 30) are called cell means. I can average cell
means to obtain marginal means, which reflect the effect of one factor ignoring the other
factor. For example, for factor A, ignoring B, participants who drank no alcohol averaged a
(0 + 20) / 2 = 10 on our drunkenness scale, those who did drink averaged (10 + 30) / 2 = 20.
From such marginal means one can compute the main effect of a factor, its effect ignoring the
other factor. For factor A, that main effect is (20 - 10) = 10 participants who drank alcohol
averaged 10 units more drunk than those who didnt. For factor B, the main effect is
(25 - 5) = 20, the barbiturate tablet produced 20 units of drunkenness, on the average.

A simple main effect is the effect of one factor at a specified level of the other factor.
For example, the simple main effect of the vodka screwdriver for participants who took no
barbiturate is (10 - 0) = 10. For participants who did take a barbiturate, taking alcohol also
made them (30 - 20) = 10 units more drunk. In this case, the simple main effect of A at level 1


Page 2
0
10
20
30
40
no alch. one alch.
no barb.
one barb.
of B is the same as it is at level 2 of B. The same is true of the simple main effects of Bthey
do not change across levels of A. When this is the case, the simple main effects of one factor
do not change across levels of the other factor, the two factors are said not to interact. In such
cases the combined effects of A and B will equal the simple sum of the separate effects of A
and Bfor our example, taking only the drink makes one 10 units drunk, taking only the
barbiturate makes one 20 units drunk, and taking both makes one (10 + 20) = 30 units drunk.
The combination of A and B is, in this case, additive.

Look at the interaction plot next to the table above. One line is drawn for each simple
main effect. The lower line represents the simple main effect of alcohol for subjects who had
no barbiturate, the upper line represents the simple main effect of alcohol for subjects who had
one barbiturate. The main effect of alcohol is evident by the fact that both lines have a slope.
The main effect of barbiturate is evident by the separation of the lines in the vertical dimension.

Suppose that we also conduct our experiment on human participants and obtain the
following means:

Alcohol
none 00 10 05
one 20 40 30
marginal 10 25 17.5

Note that I have changed only one cell mean, that of the participants who washed down
their sleeping pill with a vodka screwdriver (a very foolish thing to do!). Sober folks taking only
the drink still get (10 - 0) = 10 units drunk and sober folks taking the pill still get (20 - 0) = 20
units drunk, but now when you combine A and B you do not get (10 + 20) = 30, instead you get
(10 + 20 + 10) = 40. From where did that extra 10 units of drunkenness come? It came from
the interaction of A and B in this nonadditive combination. An interaction exists when the
effect of one factor changes across levels of another factor, that is, when the simple main
effects of one factor vary across levels of another. For our example, the simple main effect of
drinking a screwdriver is (10 - 0) = 10 units drunker if you have not already taken a pill, but it is
twice that, (40 - 20) = 20, if you have. The screwdriver has a greater effect if you already took
a pill than if you havent. Likewise, the simple main effect of taking a pill if you already drank,
(40 - 10) = 30 units drunker, is more than the simple main effect of taking a pill if you did not
already drink, (20 - 0) = 20.

The interaction between alcohol and barbiturates that we are discussing is a
monotonic interaction, one in which the simple main effects vary in magnitude but not in
direction across levels of the other variable. Alcohol increases your drunkenness whether or
not you have taken a pill, it just does it more so if you have taken a pill. Likewise, the pill
increases your drunkenness whether or not you have taken a drink, it just does so more if you
have been drinking. With such a monotonic interaction, one can still interpret main effectsin
Page 3
0
10
20
30
no alch. one alch
no barb.
one barb.
our example, drinking alcohol or taking a pill increases drunkenness. In the plot, the fact that
the two lines are not parallel indicates that there is an interaction. The fact that the direction of
the slope is the same (positive) for both lines indicates that the interaction is monotonic.

Sometimes the direction of the simple main effects changes across levels of the other
variable. In such a case the interaction may be described as a nonmonotonic interaction.
For example, consider the following means from a group of purple aliens who crashed our
party:

Alcohol
none 00 20 10
one 30 10 20
marginal 15 15 15

Alcohol has absolutely no main effect here, (15 - 15) = 0, and the barbiturate does, but
the interesting effect is the strange interaction. For purple aliens who took no barbiturate the
simple main effect of a drink was to make them (20 - 0) = 20 units more drunk, but for those
who had taken a pill the simple main effect of the drink was to make them (10 - 30) = 20 units
less drunk. Likewise, the effect of a pill for those who had not been drinking was (30 - 0) =
+30, but for those who had been drinking it was (10 - 20) = -10.

The presence of such a nonmonotonic interaction may make it unreasonable to interpret
the main effects. For example, asked what the effect of alcohol is on purple aliens, you cannot
honestly answer, It makes them more drunk. You must qualify your response, If they
havent been abusing barbiturates, it makes them more drunk, but if they have been abusing
barbiturates it makes them less drunk.

The plot for our purple aliens makes it clear that the interaction is nonmonotonic -- the
direction of the slope for the one line is positive, for the other it is negative.

In a three-way factorial design (three categorical independent variables) one can
evaluate three main effects (A, B, and C), three two-way interactions (A x B, A x C, and B x C)
and one three-way interaction (A x B x C). A three-way interaction exists when the simple
two-way interactions (the interaction between two factors at each level of a third factor) differ
from one another. For the contrived data with which we have been playing, consider the third
factor, C, to be species of participant, with level one being Alienus greenus, level two being
Homo sapiens, and level three being Alienus purpurs. We have already seen that the A x B
interactions differ across levels of C, so there is indication of a triple interaction in our 2 x 2 x 3
design.

You should be able to generalize the concepts of main effects, simple effects, and
interactions beyond the three-factor example we used here, but do not be surprised if you have
Page 4
trouble understanding higher-order interactions, such as four-way interactionsmost people
dothree dimensions is generally as many as a Homo sapiens can simultaneously handle!

Two-Way ANOVA: The Hypotheses

Now that we have the basic ideas of factorial designs, let us discuss the inferential
procedure, the ANOVA. We generally have sample data, not entire populations, so we wish to
determine whether the effects in our sample data are large enough for us to be very sure that
such effects also exist in the populations from which our data were randomly sampled. For our
two-way design we have three null hypotheses:

1.
1
=
2
= . . . =
a

That is, the mean of the dependent variable is constant across the a levels of
factor A.

2.
1
=
2
= . . . =
b

That is, Factor B does not affect the mean of the dependent variable.

3. Factors A and B do not interact with one another, A and B combine additively to
influence the dependent variable.


Be sure to complete the exercise on Interaction Plots.
Cross-Cultural.doc
An Example of Factorial Design in Cross Cultural Research

The dependent variable is what I have called the Executive Male Attitude. High
scores on this variable indicate that the respondent endorses statements such as
husbands should make all of the important decisions; a wife should do whatever her
husband wants; and only the husband should decide about major purchases.
------------------------------------------------------------------------------------------------------------
Additive Effect of Gender and Culture on Executive Male Attitude
The first graph illustrates hypothetical results in which men endorse this attitude more
than do women and in which the attitude is stronger in one culture than in the other.
Notice that there are main effects of both gender and culture, but no interaction.
------------------------------------------------------------------------------------------------------------

Interactive Effect of Gender and Culture on Executive Male Attitude
Faculty in ECUs Department of Psychology (Rosina Chia, John Childers, and
myself) have actually conducted research on this topic. Here is a graph illustrating the
actual effects found when comparing students here at ECU with university students in
Taiwan. Note the striking interaction -- while our male students are more likely to
endorse this attitude than are our female students, in Taiwan it is the female students
who strongly endorse the traditional Executive Male Attitude!


0
20
40
60
80
100
120
Women Men
Culture 1
Culture 2
0
20
40
60
80
100
120
140
Women Men
USA
Taiwan
Factorial-Computations.docx
Two-Way Orthogonal Independent Samples ANOVA: Computations

We shall test the hypotheses in factorial ANOVA in essentially the same way we
tested the one hypothesis in a one-way ANOVA. I shall assume that our samples are
strictly independent, not correlated with each other. The total sum of squares in the
dependent/outcome variable will be partitioned into two sources, a Cells or Model SS
[our model is Y = effect of level of A + effect of level of B + effect of interaction + error +
grand mean] and an Error SS.
The SS
Cells
reflects the effect of the combination of the two grouping variables.
The SS
error
reflects the effects of all else. If the cell sample sizes are all equal, then we
can simply partition the SS
Cells
into three orthogonal (independent, nonoverlapping)
parts: SS
A
, the effect of grouping variable A ignoring grouping variable B; SS
B
, the
effect of B ignoring A; and SS
AxB
, the interaction.
If the cell sample sizes are not equal, the design is nonorthogonal, that is, the
factors are correlated with one another, and the three effects (A, B, and A x B) overlap
with one another. Such overlap is a thorny problem which we shall avoid for now by
insisting on having equal cell ns. If you have unequal cell ns you may consider
randomly discarding a few scores to obtain equal ns or you may need to learn about
the statistical procedures available for dealing with such nonorthogonal designs.
Suppose that we are investigating the effects of gender and smoking history
upon the ability to smell an odourant thought to be involved in sexual responsiveness.
One grouping variable is the participants gender, male or female. The other grouping
variable is the participants smoking history: never smoked, smoked 2 packs a day for
10 years but have now stopped, stopped less than 1 month ago, between 1 month and
2 years ago, between 2 and 7 years ago, or between 7 and 12 years ago. Suppose we
have 10 participants in each cell, and we obtain the following cell and marginal totals
(with some means in parentheses):

SMOKING HISTORY
GENDER never < 1m 1 m - 2 y 2 y - 7 y 7 y - 12 y marginal
Male 300 200 220 250 280 1,250 (25)
Female 600 300 350 450 500 2,200 (44)
marginal 900 (45) 500 (25) 570 (28.5) 700 (35) 780 (39) 3,450


Page 2
We shall also need the uncorrected sums of squares, so here is 6Y
2
for each
cell:
SMOKING HISTORY
GENDER never < 1m 1 m - 2 y 2 y - 7 y 7 y - 12 y marginal
Male 10,071 5,071 5,911 7,321 8,911 37,285
Female 37,071 10,071 13,321 21,321 26,071 107,865
Marginal 47,142 15,142 19,232 28,642 34,982 145,140

The total sum of squares, 115 , 26
100
450 , 3
140 , 145
2

Total
SS . The squared
sum of the scores divided by total N, (3,450)
2
/ 100 = 119,025, is called the correction
for the mean, CM, a term that appears often in calculations of sums of squares.
The SS
Cells
is computed as if you were computing the among-groups SS in an a
x b = 2 x 5 = 10 group one-way design. Our general rule is to square and then sum
group totals, divide by the number of scores that went into each total, then
subtract the CM.
For our data, SS
Cells
=
405 , 15 025 , 119
10
500 450 350 300 600 280 250 220 200 300
2 2 2 2 2 2 2 2 2 2

The SS
error
is then SS
total
minus SS
Cells
= 26,115 - 15,405 = 10,710.
To compute the SS for the main effects of Gender and Smoking History, apply
the same general rule to their marginal totals:
025 , 9 025 , 119
50
200 , 2 250 , 1
2 2

Gender
SS
140 , 5 025 , 119
20
780 700 570 500 900
2 2 2 2 2

Smoke
SS
Since the SS
Cells
reflects the combined effect of Gender and Smoking History,
both their main effects and their interaction, we can compute the SS
interaction
as a
residual,
SS
Cells
SS
Gender
SS
Smoke
= 15,405 - 9,025 - 5,140 = 1,240.
As in the one-way design, total df = N - 1, and main effects df = number of levels
minus one, that is, (a - 1) and (b - 1). Interaction df is the product of main effect df,
(a - 1)(b - 1). Error df = (a)(b)(n - 1).
Mean squares are SS / df, and Fs are obtained by dividing effect mean squares
by error MS. Results are summarized in this source table:

Page 3

Source SS df MS F p
A-gender 9025 1 9025 75.84 <.001
B-smoking history 5140 4 1285 10.80 <.001
AxB interaction 1240 4 310 2.61 .041
Error 10,710 90 119
Total 26,115 99
Analysis of Simple Main Effects
The finding of a significant interaction is often followed by testing of the simple
main effects of one factor at each level of the other. Let us first compare the two
genders at each level of smoking history. Following our general rule for the
computation of effect sums of squares (note that each simple effect has its own CM).
Simple Main Effect of Gender at Each Level of Smoking History
SS
Gender for never smokers
= 500 , 4
20
900
10
600 300
2 2 2

SS
Gender, stopped < 1m
= 500
20
500
10
300 200
2 2 2

SS
Gender stopped 1 m - 2 y
= 845
20
570
10
350 220
2 2 2

SS
Gender stopped 2 y - 7 y
= 000 , 2
20
700
10
450 250
2 2 2

SS
Gender stopped 7 y - 12 y
= 420 , 2
20
780
10
500 280
2 2 2

Please note that the sum of the simple main effects SS for A (across levels of B)
will always equal the sum of SS
A
and the SS
interaction
: 4,500 + 500 + 845 + 2,000 + 2,420
= 10,265 = 9,025 + 1,240. Since gender is a two-level factor, each of these simple main
effects has 1 df, so MS = SS. For each we compute an F on 1, 90 df by dividing by the
MSE from the overall ANOVA:
Smoking History
SS
Gender at
never < 1m 1 m - 2 y 2 y - 7 y 7 y - 12 y
F(1, 90) 37.82 4.20 7.10 16.81 20.34
p <.001 .043 .009 <.001 <.001

Some recommend using a MSE computed using only the scores involved in the
simple main effect being tested, that is, using individual error terms for each simple
main effect. This is especially recommended when the assumption of homogeneity of
Page 4
variance is suspect. When I contrived these data I did so in such a way that there is
absolute homogeneity of variance each cell has a variance of 119, so I used the MSE
from the overall ANOVA, the pooled error term.
The results indicate that the gender difference is significant at the .05 level at
every level of smoking history, but the difference is clearly greater at levels 1, 4, and 5
(those who have never smoked or quit over 2 years ago) than at levels 2 and 3 (recent
smokers).
Simple Main Effect of Smoking History at Each Level of Gender
Is smoking history significantly related to olfactory ability within each gender? Let
us test the simple main effects of smoking history at each level of gender:
SS
Smoking history for men
= 680
50
250 , 1
10
280 250 220 200 300
2 2 2 2 2 2

SS
Smoking history for women
= 700 , 5
50
200 , 2
10
500 450 350 300 600
2 2 2 2 2 2

Note that SS
B at A1
+ SS
B at A2
= 680 + 5,700 = 6,380 = SS
B
+ SS
AxB
= 5,140 +
1,240. Since B had 5 levels, each of these simple main effects has 4 df, so the mean
squares are 680 / 4 = 170 for men, 5,700 / 4 = 1,425 for women. Smoking history has a
significant simple main effect for women, F(4, 90) = 11.97, p < .001, but not for men,
F(4, 90) = 1.43, p = .23.
Multiple comparisons
Since smoking history had a significant simple main effect for the women, we
might want to make some comparisons involving the five means in that simple main
effect. Rather than make all possible (10) pairwise comparisons , I elect to make only 4
comparisons: the never smoked group versus each other group. Although there is a
special procedure for the case where one (control) group is compared to each other
group, the Dunnett test, I shall use the Bonferroni t test instead. Holding familywise
alpha at .05 or less, my criterion to reject each null hypothesis becomes
. 0125 .
4
05 .
c
pc
D I will need the help of SAS to get the exact p values. Here is a little
SAS program that will obtain the p values for the t scores below:
options formdlim='-' pageno=min nodate;
data p;
T1 = 2*PROBT(-6.149, 90); T2 = 2*PROBT(-5.125, 90); T3 = 2*PROBT(-3.075,
90);
T4 = 2*PROBT(-2.050, 90); proc print; run;
Notice that I entered each t score as a negative value, and then gave the df. Since
PROBT returns a one-tailed p, I multiplied by 2. The output from SAS is:
Obs T1 T2 T3 T4
1 2.1036E-8 .000001688 .002787218 0.043274
The denominator for each t will be 119 1 10 1 10 48785 ( / / ) . . The computed t
scores and p values are then:
Page 5
Never Smoked vs Quit
8785 . 4
) 90 (
j i
M M
t

p Significant?
< 1 m (60-30) / 4.8785=6.149 < .001 yes
1 m - 2 y (60-35) / 4.8785=5.125 < .001 yes
2 y - 7 y (60-45) / 4.8785=3.075 .0028 yes
7 y - 12 y (60-50) / 4.8785=2.050 .0433 no
As you can see, female ex-smokers olfactory ability was significantly less than
that of women who never smoked for every group except the group that had stopped
smoking 7 to 12 years ago.
If the interaction were not significant (and sometimes, even if it were) we would
likely want to make multiple comparisons involving the marginal means of significant
factors with more than two levels. Let us do so for smoking history, again using the
Bonferroni t-test. I should note that, in actual practice, I would probably use the
REGWQ test. Since I shall be making ten comparisons, my adjusted per comparison
alpha will be, for a maximum familywise error rate of .05, .05/10 = .005. Again, I rely on
SAS to obtain the exact p values.
data p;
90);
90);
90);
T45 = 2*PROBT(-1.16, 90); proc print; run;
Obs T12 T13 T14 T15 T23
1 9.7316E-8 .000006793 .004688757 0.085277 0.31520

Obs T24 T25 T34 T35 T45
1 .004688757 .000104499 0.063343 .003097811 0.24912

Level i vs j t = Significant?
1 vs 2
80 . 5
4496 . 3
20
) 20 / 1 20 / 1 ( 119 / ) 25 45 (
yes
1 vs 3 (45-28.5) / 3.4496=4.78 yes
1 vs 4 (45-35) / 3.4496=2.90 yes
1 vs 5 (45-39) / 3.4496=1.74 no
2 vs 3 (28.5-25) / 3.4496=1.01 no
2 vs 4 (35-25) / 3.4496=2.90 yes
Page 6
2 vs 5 (39-25) / 3.4496=4.06 yes
3 vs 4 (35-28.5) / 3.4496=1.88 no
3 vs 5 (39-28.5) / 3.4496=3.04 yes
4 vs 5 (39-35) / 3.4496=1.16 no
Note that the ns are 20 because 20 scores went into each mean.
Smoking History < 1 m 1 m - 2 y 2 y - 7 y 7 y - 12 y never
Mean 25
A
28.5
AB
35
BC
39
CD
45
D
Note. Means sharing a superscript are not significantly different from one another.
Contrasts in Factorial ANOVA
One can create contrasts in factorial ANOVA just as in one-way ANOVA. For
example, in a 2 x 2 ANOVA one contrast is that known as the main effect of the one
factor, another contrast is that known as the main effect of the other factor, and a third
contrast is that known as the interaction between the two factors. For effects (main or
interaction) with more than one df, the effect can be broken down into a set of
orthogonal one df contrasts.
The coefficients for an interaction contrast must be doubly centered in the sense
that the coefficients must sum to zero in every row and every column of the a x b matrix.
For example, consider a 2 x 2 ANOVA. The interaction has only one df, so there is only
one contrast available.

Coefficients Means
B
1
B
2
B
1
B
2

A
1
1 -1 M
11
M
12

A
2
-1 1 M
21
M
22

This contrast is M
11
M
12
M
21
+ M
22
. From one perspective, this contrast is the
combined cells on one diagonal (M
11
+ M
22
) versus the combined cells on the other
diagonal (M
21
+ M
12
). From another perspective, it is (M
11
- M
12
) (M
21
M
22
), that is,
the simple main effect of B at A
1
versus the simple main effect of B at A
2
. From another
perspective it is (M
11
M
21
) (M
12
M
22
), that is, the simple main effect of A at B
1

versus the simple main effect of A at B
2
. All of this is illustrated in my program ANOVA-
Interact2x2.sas.
Now consider a 2 x 3 design. The interaction has two df and can be broken
down into two orthogonal interaction contrasts. For example, consider the contrast
coefficients in the table below:

Page 7

A x B
12 vs 3
A x B
1 vs 2

B
1
B
2
B
3
B
1
B
2
B
3

A
1
1 1 -2 1 -1 0
A
2
-1 -1 2 -1 1 0
The contrast on the left side of the table compares the simple main effect of A at
combined levels 1 and 2 of B with the simple main effect of A at level 3 of B. From
another perspective, it compares the simple main effect of (combined B
1
and B
2
) versus
B
3
at A
1
with that same effect at A
2
. Put another way, it is the A x B interaction with
levels 1 and 2 of B combined.
The contrast on the right side of the table compares the simple main effect of A
at level 1 of B with the simple main effect of A at level 2 of B. From another
perspective, it compares the simple main effect of B12 (excluding level 3 of B) at A1
with that same effect at A2. Put another way, it is the A x B interaction with level 3 of B
excluded.
If we had reason to want the coefficients on the left side of the table above to be
a standard set of weights, we would divide each by 2.

A x B
12 vs 3

B
1
B
2
B
3

A
1
.5 .5 -1
A
2
-.5 -.5 1
My program ANOVA-Interact2x3.sas illustrates the computation of these
interaction contrasts and more.
Standardized Contrasts
As in one-way designs, one can compute standardized contrasts. Rex B. Kline
(Chapter 7 of Beyond Significance Testing, 2004, American Psychological Association)
notes that there is much disagreement regarding how to compute standardized
contrasts with data from a multifactorial design, and opines that
1. such estimates should be comparable to those that would be obtained from a one-
way design, and
2. changing the number of factors in the design should not necessarily change the
effect size estimates.
Adding factors to a design is, IMHO, not different from adding covariates. Should
the additional variance explained by added factors be excluded from the denominator of
the standardized contrast g ? Imagine a 2 x 2 design, where A is type of therapy, B is
sex of patient, and Y is post-treatment wellness. You want to compute g for the effect of
type of therapy. The MSE excludes variance due to sex, but in the population of
interest sex may naturally account for some of the variance in wellness, so using the
root mean square error as the standardizer will underestimate the population standard
Page 8
deviation. It may be desirable to pool the SS
within-cells
, SS
B
, and SS
AxB
to form an
appropriate standardizer in a case like this. Id just drop B and AxB from the model, run
a one-way ANOVA, and use the root mean square error from that as the standardizer.
Kline argues that when a factor like sex is naturally variable in both the
population of interest and the sample then variance associated with it should be
included in the denominator of g. While I agree with this basic idea, I am not entirely
satisfied with it. Such a factor may be associated with more or less of the variance in
the sample than it is in the population of interest. In experimental research the
distribution of such a factor can be quite different in the experiment than it is in the
population of interest. For example, in the experiment there may be approximately
equal numbers of clients assigned to each of three therapies, but in the natural world
patients may be given the one therapy much more often than the others.
Now suppose that you are looking at the simple main effects of A (therapy) at
levels of B (sex). Should the standardizer be computed within-sex, in which case the
standardizer for men would differ from that for women, or should the standardizer be
pooled across sexes? Do you want each g to estimate d in a single-sex population, or
do you want a g for men that can be compared with the g for women without having to
consider the effect of the two estimators having different denominators?
Magnitude of Effect
Eta-squared and omega-squared can be computed for each effect in the
model. With omega-squared, substitute the effects df for the term (k-1) in the formula
we used for the one-way design.
For the interaction, 047 .
115 , 26
240 , 1
2
K , 029 .
234 , 26
764
119 115 , 26
) 119 ( 4 240 , 1
2

Z .
For Gender, 346 .
115 , 26
025 , 9
2
K , 339 .
234 , 26
) 119 ( 1 025 , 9
2
Z .
For Smoking History, 197 .
115 , 26
140 , 5
2
K , 178 .
234 , 26
) 119 ( 4 140 , 5
2
Z .
Gender clearly accounts for the greatest portion of the variance in ability to detect
the scent, but smoking history also accounts for a great deal. Of course, were we to
analyze the data from only female participants, excluding the male participants (for
whom the effect of smoking history was smaller and nonsignificant), the Z
2
for smoking
history would be much larger.
Partial Eta-Squared. The value of K
2
for any one effect can be influenced by the
number of and magnitude of other effects in the model. For example, if we conducted
our research on only women, the total variance in the criterion variable would be
reduced by the elimination of the effects of gender and the Gender x Smoking
interaction. If the effect of smoking remained unchanged, then the
Total
Smoking
SS
SS
ratio
would be increased. One attempt to correct for this is to compute a partial eta-squared,
Error Effect
Effect
p
SS SS
SS
2
K . In other words, the question answered by partial eta-squared is
Page 9
this: Of the variance that is not explained by other effects in the model, what proportion
is explained by this effect.
For the interaction, 104 .
710 , 10 240 , 1
240 , 1
2
p
K .
For gender, 457 .
710 , 10 025 , 9
025 , 9
2
p
K .
For smoking history, 324 .
710 , 10 140 , 5
140 , 5
2
p
K .
Notice that the partial eta-square values are considerably larger than the eta-
square or omega-square values. Clearly this statistic can be used to make a small
effect look moderate in size or a moderate-sized effect look big. It is even possible to
get partial eta-square values that sum to greater than 1. That makes me a little
uncomfortable. Even more discomforting, many researchers have incorrectly reported
partial eta-squared as being regular eta-squared. Pierce, Block, and Aguinis (2004)
found articles in prestigious psychological journals in which this error was made.
Apparently the authors of these articles (which appeared in Psychological Science and
other premier journals) were not disturbed by the fact that the values they reported
indicated that they had accounted for more than 100% of the variance in the outcome
variable in one case, the authors claimed to have explained 204%. Oh my.
Confidence Intervals for K
2
and Partial K
2

As was the case with one-way ANOVA, one can use my program Conf-Interval-
R2-Regr.sas to put a confidence interval about eta-square or partial eta-squared. You
will, however, need to compute an adjusted F when putting the confidence interval on
eta-square, the F that would be obtained were all other effects excluded from the model.
Note that I have computed 90% confidence intervals, not 95%. See this document.
Confidence Intervals for K
2
. For the effect of gender,
39 . 174
1 99
025 , 9 115 , 26
Gender Total
Gender Total
df df
SS SS
MSE , and
df F 98 1, on 752 . 51
39 . 174
025 . 9
. 90% CI [.22, .45].
For the effect of smoking, 79 . 220
4 99
140 , 5 115 , 26
Smoking Total
Smoking Total
df df
SS SS
MSE
and df F 95 1, on 82 . 5
79 . 220
285 , 1
. 90% CI [.005, .15].
For the effect of the interaction,
84 . 261
4 99
240 , 1 115 , 26
n Interactio Total
n Interactio Total
df df
SS SS
MSE and
df F 95 1, on 184 . 1
84 . 2261
310
. 90% CI [.000, .17].
Notice that the CI for the interaction includes zero, even though the interaction
was statistically significant. The reason for this is that the F testing the significance of
Page 10
the interaction used a MSE that excluded variance due to the two main effects, but that
variance was included in the standardizer for our confidence interval.
Confidence Intervals for Partial K
2
. If you give the program the unadjusted
values for F and df, you get confidence intervals for partial eta-squared. Here they are
for our data:
Gender: .33, .55
Smoking: .17, .41
Interaction: .002, .18 (notice that this CI does not include zero)
Eta-Squared or Partial Eta-Squared? Which one should you use? I am more
comfortable with eta-squared, but can imagine some situations where the use of partial
eta-squared might be justified. Kline (2004) has argued that when a factor like sex is
naturally variable in both the population of interest and the sample, then variance
associated with it should be included in the denominator of the strength of effect
estimator, but when a factor is experimental and not present in the population of
interest, then the variance associated with it may reasonably be excluded from the
denominator.
For example, suppose you are investigating the effect of experimentally
manipulated A (you create a lesion in the nucleus spurious of some subjects but not of
others) and subject characteristic B (sex of the subject). Experimental variable A does
not exist in the real-world population, subject characteristic B does. When estimating
the strength of effect of Experimental variable A, the effect of B should remain in the
denominator, but when estimating the strength of effect of subject characteristic B it
may be reasonable to exclude A (and the interaction) from the denominator.
To learn more about this controversial topic, read Chapter 7 in Kline (2004). You
can find my notes taken from this book on my Stat Help page, Beyond Significance
Testing.
Perhaps another example will help illustrate
the difference between eta-squared and partial eta-
squared. Here we have sums of squares for a two-
way orthogonal ANOVA. Eta-squared is
25 .
100
25
*
2

Error B A B A
Effect
SS SS SS SS
SS
K for
every effect. Eta-squared answers the question of
all the variance in Y, what proportion is (uniquely)
associated with this effect.

Partial eta squared is 50 .
50
25
2

Error Effect
Effect
p
SS SS
SS
K for every effect. Partial
eta-squared answers the question of the variance in Y that is not associated with any of
the other effects in the model, what proportion is associated with this effect. Put
another way, if all of the other effects were nil, what proportion of the variance in Y
would be associated with this effect.
Page 11
Notice that the values of eta-squared sum to .75, the full model eta-squared. The
values of partial eta-squared sum to 1.5. Hot damn, we explained 150% of the
variance. -
Once you have covered multiple regression, you should compare the difference
between eta-squared and partial eta-squared with the difference between squared
semipartial correlation coefficients and squared partial correlation coefficients.
K
2
for Simple Main Effects. We have seen that the effect of smoking history is
significant for the women, but how large is the effect among the women? From the cell
sums and uncorrected sums of squares given on pages 1 and 2, one can compute the
total sum of squares for the women it is 11,055. We already computed the sum of
squares for smoking history for the women 5,700. Accordingly, the K
2
= 5,700/11,055
= .52. Recall that this K
2
was only .20 for men and women together.
When using my Conf-Interval-R2-Regr.sas with simple effects one should use an
F that was computed with an individual error term rather than with a pooled error term.
If you use the data on pages 1 and 2 to conduct a one-way ANOVA for the effect of
smoking history using only the data from the women, you obtain an F of 11.97 on (4, 45)
degrees of freedom. Because there was absolute heterogeneity of variance in my
contrived data, this is the same value of F obtained with the pooled error term, but
notice that the df are less than they were with the pooled error term. With this F and df
my program gives you a 90% CI of [.29, .60]. For the men, K
2
= .11, 90% CI [0, .20].
Assumptions
The assumptions of the factorial ANOVA are essentially the same as those made
for the one-way design. We assume that in each of the a x b populations the dependent
variable is normally distributed and that variance in the dependent variable is constant
across populations.
Advantages of Factorial ANOVA
The advantages of the factorial design include:
1. Economy - if you wish to study the effects of 2 (or more) factors on the same
dependent variable, you can more efficiently use your participants by running them
in 1 factorial design than in 2 or more 1-way designs.
2. Power - if both factors A and B are going to be determinants of variance in your
participants dependent variable scores, then the factorial design should have a
smaller error term (denominator of the F-ratio) than would a one-way ANOVA on just
one factor. The variance due to B and due to AxB is removed from the MSE in the
factorial design, which should increase the F for factor A (and thus increase power)
relative to one-way analysis where that B and AxB variance would be included in the
error variance.

Consider the partitioning of the sums of squares illustrated to
the right. SS
B
= 15 and SSE = 85. Suppose there are two
levels of B (an experimental manipulation) and a total of 20
cases. MS
B
= 15, MSE = 85/18 = 4.722. The F(1, 18) =
15/4.72 = 3.176, p = .092. Woe to us, the effect of our
Page 12
experimental treatment has fallen short of statistical significance.

Now suppose that the subjects here consist of both men and women and that the
sexes differ on the dependent variable. Since sex is not included in the model,
variance due to sex is error variance, as is variance due to any interaction between
sex and the experimental treatment.

Let us see what happens if we include sex and the
interaction in the model. SS
Sex
= 25, SS
B
= 15, SS
Sex*B
=
10, and SSE = 50. Notice that the SSE has been reduced
by removing from it the effects of sex and the interaction.
The MS
B
is still 15, but the MSE is now 50/16 = 3.125 and
the F(1, 16) = 15/3.125 = 4.80, p = .044. Notice that
excluding the variance due to sex and the interaction has
reduced the error variance enough that now the main effect
of the experimental treatment is significant.

Of course, you could achieve the same reduction in error by simply holding the one
factor constant in your experimentfor example, using only participants of one
genderbut that would reduce your experiments external validity (generalizability
of effects across various levels of other variables). For example, if you used only
male participants you would not know whether or not your effects generalize to
female participants.
3. Interactions - if the effect of A does not generalize across levels of B, then including
B in a factorial design allows you to study how As effect varies across levels of B,
that is, how A and B interact in jointly affecting the dependent variable.
Example of How to Write-Up the Results of a Factorial ANOVA

Results
Participants were given a test of their ability to detect the scent of a chemical
thought to have pheromonal properties in humans. Each participant had been classified
into one of five groups based on his or her smoking history. A 2 x 5, Gender x Smoking
History, ANOVA was employed, using a .05 criterion of statistical significance and a
MSE of 119 for all effects tested. There were significant main effects of gender,
F(1, 90) = 75.84, p < .001, K
2
= .346, 90% CI [.22, .45], and smoking history, F(4, 90) =
10.80, p < .001, K
2
= .197, 90% CI [.005, .15], as well as a significant interaction
between gender and smoking history, F(4, 90) = 2.61, p = .041, K
2
= .047, 90% CI [.00,
.17],. As shown in Table 1, women were better able to detect this scent than were men,
and smoking reduced ability to detect the scent, with recovery of function being greater
the longer the period since the participant had last smoked.

Page 13

Table 1. Mean ability to detect the scent.
Smoking History
Gender < 1 m 1 m -2 y 2 y - 7 y 7 y - 12 y never marginal
Male 20 22 25 28 30 25
Female 30 35 45 50 60 44
Marginal 25 28 35 39 45

The significant interaction was further investigated with tests of the simple main
effect of smoking history. For the men, the effect of smoking history fell short of
statistical significance, F(4, 90) = 1.43, p = .23, K
2
= .113, 90% CI [.00, .20]. For the
women, smoking history had a significant effect on ability to detect the scent, F(4, 90) =
11.97, p < .001, K
2
= .516, 90% CI [.29, .60]. This significant simple main effect was
followed by a set of four contrasts. Each group of female ex-smokers was compared
with the group of women who had never smoked. The Bonferroni inequality was
employed to cap the familywise error rate at .05 for this family of four comparisons. It
was found that the women who had never smoked had a significantly better ability to
detect the scent than did women who had quit smoking one month to seven years
earlier, but the difference between those who never smoked and those who had
stopped smoking more than seven years ago was too small to be statistically significant.
Please note that you could include your ANOVA statistics in a source table
(referring to it in the text of your results section) rather than presenting them as I have
done above. Also, you might find it useful to present the cell means in an interaction
plot rather than in a table of means. I have presented such an interaction plot below.

10
20
30
40
50
60
70
< 1 m 1m - 2y 2y - 7y 7y - 12 y never
Female
Male
10
20
30
40
50
60
M
e
a
n

A
b
i
l
i
t
y

t
o

D
e
t
e
c
t

t
h
e

S
c
e
n
t

Smoking History
Page 14

Reference
Pierce, C. A., Block, R. A., & Aguinis, H. (2004). Cautionary note on reporting eta-
squared values from multifactor designs. Educational and Psychological
Measurement, 64, 916-924.

Triv-Int.doc
Main Effects That Participate in Significant but Trivial Interactions

Some persons opine that one should never interpret a main effect when it
participates in a significant interaction. I disagree. One may have good reasons to
ignore an interaction. For example, the interaction may be statistically significant but of
trivial magnitude. One of my graduate students investigated how the degree of altruism
shown in social interactions is a function of the degree of kinship between the parties
involved in the interaction. He had reason to believe that there are cultural differences
in the relationship between degree of kinship and amount of altruism shown. His
primary analysis was a Culture x Kinship ANOVA. The kinship variable was a within-
subjects variable, and the sample sizes were large, so he had so much power that even
trivial effects could be detected and labeled statistically significant. The student was
overjoyed when both the main effect of degree of kinship and the interaction between
kinship and culture were significant beyond the .001 level. But look at the results that I
have plotted below (after reflecting the cell means -- originally, high scores indicated low
altruism, but that makes things confusing). From the plot, it is pretty clear to me that the
level of altruism increases with degree of kinship in both cultures, and that the shape of
this function in the Americans is not much different than it is in the Chinese. I asked the
student to report a magnitude of effect estimate for each effect. He chose to employ
2
.
For the main effect of degree of kinship, the
2
was an enormous .75. For the
interaction it was a trivial .004. Significant or not, I argue that the interaction is so small
in magnitude that it can, even should be, ignored.

10
60
110
160
210
260
e
n
e
m
y
s
t
r
a
n
g
e
r
t
o
w
n
s
m
a
n
2
n
d

c
o
u
s
i
n
1
s
t

k
i
n
American
Chi nese

Copyright 2000, 2008, Karl L. Wuensch - All rights reserved.

A
l
t
r
u
i
s
m

o
f

R
e
s
p
o
n
s
e

Degree of Kinship
Example Presentation of Results From a Factorial ANOVA

Exercise 13.5 in David Howells Statistical Methods for Psychology, 7
th
edition,
provided the data for this analysis. It was in earlier editions of his Fundamental
Statistics for the Behavioral Sciences, but was dropped from the 4
th
edition of that
text. Here I present an example of how to write up the results from such an analysis.
Please that I used the pooled error term when testing the simple effects but
individual error terms when computing eta-squared for the simple effects.

Subjects were rats in a one-trial passive avoidance learning task
Put in the experimental apparatus, they were shocked when they crossed a line.
Later they were put back in. The dependent variable was their latency to cross
that line again.
They were given brain stimulation at 50, 100, 150 msec after the training shock.
This stimulation was delivered to Area 0, 1, or 2.
Such stimulation would be expected to disrupt the consolidation of the memory
when delivered to an area that is involved in such consolidation.
It was hypothesized that the stimulation would have no effect in Area 0 but would
in the other two areas.

My program, Area_x_Delay.sas, available on my SAS Programs Page, contains
the data and will do the analysis. Here are the results, presented in APA style.

The latency data were analyzed with a 3 x 3, Area x Delay, factorial ANOVA.
Each effect was tested with a MSE of 29.31. Significant (p < .05) effects were found for
the main effect of area, F(2, 36) = 6.07, p = .005,
2
= .18, 90% CI [.04, .29], and the
Area x Delay interaction, F(4, 36) = 3.17, p = .025,
2
= .19, 90% CI [.01, .26], but the
main effect of delay fell short of statistical significance, F(2, 36) = 3.22, p = .052,
2
=
.10, 90% CI [.00, .19]. As shown in Table 1, the latencies were, as expected, higher in
area 0 than in the other two areas.
More interesting than the main effect of area, however, is how the effect of delay
of stimulation changed when we changed the area of the brain stimulated. The
significant interaction was further investigated by testing the simple main effects of
delay for each level of the brain area factor. When the area stimulated was Area 0, the
area thought not to be involved in learning and memory, delay of stimulation had no
significant effect on mean latency, F(2, 36) = 0.02, p = .98,
2
= .00, 90% CI [.00, .00].
Delay of stimulation did, however, have a significant effect on mean latency when Area
1 was stimulated, F(2, 36) = 4.35, p = .020,
2
= .43, 90% CI [.08, .55], and when Area 2
was stimulated, F(2, 36) = 5.19, p = .010,
2
= .52, 90% CI [.16, .62].
As shown in Table 1, the disruption of consolidation of the memory produced by
the brain stimulation in Area 1 was greater with short delays than with long delays.
Pairwise comparisons using Fishers procedure indicated that the mean latency was
significantly less with 50 msec delay than with the 150 msec delay, but with the smaller
differences between adjacent means falling short of statistical significance.
Table 1. Mean latency (sec) to cross the line.
Delay of Stimulation
Area 50 100 150 marginal
0 28.6
A
28.0
A
28.0
A
28.2
1 16.8
A
23.0
AB
26.8
B
22.2
2 24.4
B
16.0
A
26.4
B
22.3
marginal 23.3 22.3 27.1
Note: Within each row, means with the same
letter in their superscript are not significantly
different from one another.
When the stimulation was delivered to Area 2, a different relationship between
latency and delay was obtained: Stimulation most disrupted consolidation at 100 msec.
Fishers procedure indicated that mean latency with 100 msec delay was significantly
less than with 50 or 150 msec delay, with mean latency not differing significantly
between the 50 and 150 msec conditions.

------------------------------------------------------------------------------
Please note that you could include your ANOVA statistics in a source table
(referring to it in the text of your results section) rather than presenting them as I have
done above. Also, you might find it useful to present the cell means in an interaction
plot rather than in a table of means. I have presented such an interaction plot below.
10
12
14
16
18
20
22
24
26
28
30
50 msec 100 msec 150 msec
Delay of Stimulation
M
e
a
n

L
a
t
e
n
c
y

t
o

C
r
o
s
s
Area 0
Area 1
Area 2

Karl L. Wuensch, June, 2010.

Copyright 2010, Karl L. Wuensch - All rights reserved. ANOVA-Wtd-UnWtd.doc

Weighted and Unweighted Means ANOVA

Weighted Means ANOVA with Unequal, Proportional Cell ns
Data Set Int (from Howell, 3
rd
ed., page 412)
1

Male Female Marginal Means
X M n X M n Weighted Unweighted n
School 1 1550 155 10 2200 110 20 125 132.5 30
School 2 2700 135 20 4800 120 40 125 127.5 60
Marginal
Weighted 141.6 30 116.6 60 90
Unweighted 145 115
Note that there is an interaction here. The simple main effect of gender at
School 1 = (155 - 110) = 45 does not equal that at School 2 = (135 - 120) = 15.
Note that the cell ns are proportional. For each cell
2
= 0, O = E.
10 = 30(30) / 90, 20 = 60(30) / 90, 20 = 30(60) / 90, and 40 = 60(60) / 90.
Look at the main effect of school. Using weighted (by sample size) means,
M
1
= [10(155) + 20(110)] / 30 = 125 = M
2
= [2700 + 4800] / 60. Since the two marginal
means are exactly equal, there is absolutely no main effect of school. For gender, there
is a main effect of (141.6 - 116.6) = 25.
What if we decide to weight all cell means equally? For example, we decide that
we wish to weight the male means the same as the female means and School 1 means
the same as School 2s. This would be quite reasonable if our obtaining more female
data than male and more School 2 data than School 1 was due to chance and we
wished to generalize our findings to a population with 50% male students, 50% female
students and 50% enrollment in School 1, 50% in 2. We compute unweighted
(equally weighted) marginal means as means of means. For the main effect of
school (155 + 110) / 2 = 132.5, (135 + 120) / 2 = 127.5, and the main effect is
(132.5 - 127.5) = 5. This is not what we found with a weighted means approach, which
indicated absolutely no effect of school. Note that the size of the main effect of gender
also varies with method of weighting the means.
What if there were no interaction? For example,

Data Set
X M n X M n Weighted Unweighted n

1
These data were not included in the most recent edition of Howell. The dependent variable is
body weight of the students.
2
School 1 1550 155 10 2800 140 20 145 147.5 3
0
School 2 2700 135 20 4800 120 40 125 127.5 6
0
Marginal
Weighted 141.6 30 126.6 60 9
0
Unweighted 145 130
(155 - 140) = (135 - 120) no interaction. The main effect for school is (145 -
125) = 20 with weighted means, = (147.5 - 127.5) = 20 for unweighted means. Choices
of weighting method also has no effect on the main effect of gender.
We have seen that even with proportional cell ns the row and column effects are
not independent of any interaction effects present. If an interaction is present with such
data, choice of weighting techniques affects the results.
Computation of Weighted Means ANOVA Using Data Set Int
SS
TOT
= 81000 (given)
( ) ( )
( )
1406250
40 20 20 10
4800 2700 2200 1550
2 2
=
+ + +
+ + +
=
=
N
Y
CM
16500 1406250 1422750
40
4800
20
2700
20
2200
10
1550
2 2 2 2
2
=
= + + + = = CM CM
n
T
SS
ij
ij
cells

( )
( )
( )
( )
0
40 20
4800 2700
20 10
2200 1550
2 2 2
=
+
+
+
+
+
= = CM CM
n
T
SS
i
i
School

( )
( )
( )
( )
12500
40 20
4800 2200
20 10
2700 1550
2 2 2
=
+
+
+
+
+
= = CM CM
n
T
SS
j
j
Gender

4000 12500 0 16500
64500 16500 81000
_ _
= = =
= = =
Gender School cells Gender x School
cells TOT error
SS SS SS SS
SS SS SS

3
Source SS df MS F p
School 0 1 0 0.0 1.000
Gender 12500 1 12500 16.6 < .001
Interaction 4000 1 4000 5.3 .024
Error 64500 86 750
Total 81000 89

Interaction Analysis:
( )
13500
20 10
2200 1550
20
2200
10
1550
2
2 2
1 _ _ _
=
+
+
+ =
School at Gender
SS
F(1, 86) = 13500 / 750 = 18, p < .001.
( )
3000
40 20
4800 2700
40
4800
20
2700
2
2 2
2 _ _ _
=
+
+
+ =
School at Gender
SS
F(1, 86) = 3000 / 750 = 4, p = .049.
Significant gender effects at both schools, but a greater difference between male
students and female students at School 1 than at School 2.
------------------------------------ OR -------------------------------------
( )
6 . 2666
20 10
2700 1550
20
2700
10
1550
2
2 2
_ _
=
+
+
+ =
students Male School
SS
F(1, 86) = 2666.6 / 750 = 3.5, p = .06.
( )
1333.3
40 20
4800 2200
40
4800
20
2200
2 2 2
_ _
=
+
+
+ =
students Female School
SS
F(1, 86) = 1333.3 / 750 = 1.7, p = .19.
Nonsignificant school differences for each gender, but trends in opposite
directions [Sch 1 > Sch 2 for male students, Sch 1 < Sch 2 for female students]

Traditional Unweighted Means ANOVA
One simple way to weight the cell means equally involves using the harmonic
mean. In this case we compute:
=
=
k
i i
n
k
N
1
1
~

For the data set Int (School x Gender), retain the previous sums and ns.
7 . 17
40
1
20
1
20
1
10
1
4 ~
=
+ + +
= N
4
We now adjust cell totals by multiplying cell means ( M ) by harmonic sample
size, M N
~
= total cell Adjusted .
Male X Female X Marginal
Total
School 1 2755.5 1955.5 4711.1
School 2 2400 2133.3 4533.3
Marginal Total 5155.5 4088.8 9244.4
( )
( )
7 . 1201777
) 7 . 17 ( 4
4 . 9244
cells #
~
2 2
= =
=
N
X
CM
( )
4 . 444
) 7 . 17 ( 2
3 . 4533 1 . 4711
cols #
~
2 2
2
=
+
= =

CM CM
N
T
SS
i
School

( )
16000
) 7 . 17 ( 2
8 . 4088 5 . 5155
rows #
~
2 2
2
=
+
= =

CM CM
N
T
SS
j
Gender

4 . 20444
) 7 . 17 (
3 . 2133 2400 5 . 1955 5 . 2755
~
2 2 2 2
2
=
+ + +
= =
CM CM
N
T
SS
ij
Cells

4000 16000 4 . 444 4 . 20444
_ _
= = =
Gender School Cells Gender x School
SS SS SS SS
To find the SSE, find for each cell
( )

=
n
X
X SS
ij
2
2
and then sum these
across cells.
Assume the below cell sums and ns.
School 1 School 2
Male Female Male Female
X 1,550 2,200 2,700 4,800
X
2
248,000 256,000 379,000 604,250
n 10 20 20 40
7750
10
1550
000 , 248
2
11
= = SS . 000 , 14
20
2200
000 , 256
2
12
= = SS .
500 , 14
20
2700
000 , 379
2
21
= = SS . 250 , 28
40
4800
250 , 604
2
22
= = SS .
The sum = SSE = 64500. The MSE = the weighted average of the cell variances.
Source SS df MS F p
School 444.4 1 444.4 0.59 .44
Gender 16,000 1 16,000 21.30 < .001
5
Interaction 4,000 1 4,000 5.30 .024
Error 64,500 86 750

Gender Interaction Analysis
( )
)
~
(
A at
~
A at
2
i i
2
_ _
N b
X
N
T
SS
ij
A at B
i

=
000 , 18
) 7 . 17 ( 2
1 . 4711
7 . 17
5 . 1955 5 . 2755
2 2 2
1 _ _ _
=
+
=
School at Gender
SS
000 , 2
) 7 . 17 ( 2
3 . 4533
7 . 17
3 . 2133 2400
2 2 2
2 _ _ _
=
+
=
School at Gender
SS

Gender x School Gender School at Gender School at Gender
SS SS SS SS
_ _ 2 _ _ _ 1 _ _ _
+ = +
18,000 + 2,000 = 20,000 = 16,000 + 4,000
F
1
= 18000 / 750 = 24, p < .001. F
2
= 2000 / 750 = 2.6, p = .11.
There is a significant gender difference at School 1, but not at School 2.
----------------- Or, School Interaction Analysis ----------------------
3,555.5
) 7 . 17 ( 2
5 . 5155
7 . 17
2400 5 . 2755
2 2 2
_
=
+
=
male School
SS
8 . 888
) 7 . 17 ( 2
8 . 4088
7 . 17
3 . 2133 5 . 1955
2 2 2
_
=
+
=
female School
SS

Gender x School School female School male School
SS SS SS SS
_ _ _ _
+ = +
3,555.5 + 888.8 = 4444.4 = 444.4 + 4,000
F
men
= 3555.5 / 750 = 4.74, p =.032. F
women
= 888.8 / 750 = 1.185, p =.28.
There is a significant school difference for men but not for women.

Reversal Paradox

We have seen that the School x Gender interaction present in the body weight
data (from page 412 of the 3
rd
edition of Howell) results in there being no main effect of
school if we use unweighted means, but a (small) main effect being indicated if we use
weighted means. When we modified one cell mean to remove the interaction, choice of
weighting method no longer affected the magnitude of the main effects. The cell
frequencies in Howells data were proportional, making school and gender orthogonal
(independent).
Let me show you a strange thing that can happen when the cell frequencies are
not proportional.
Gender
6
School M n M n weighted unweighted
1 150 60 110 40 134 130
2 160 10 120 90 124 140
Note that there is no interaction, but that the cell frequencies indicate that gender
is correlated with school (School 1 has a higher proportion of male students than does
School 2). Weighted means indicate that body weight at School 1 exceeds that at
School 2, but unweighted means indicate that body weight at School 2 exceeds that at
School 1. Both make sense. School 1 has a higher mean body weight than School 2
because School 1 has a higher proportion of male students than does School 2, and
men weigh more than women. But the men at School 2 weigh more than do the men at
School 1 and the women at School 2 weigh more than do the women at School 1.
A reversal paradox is when 2 variables are positively related in aggregated
data, but, within each level of a third variable, they are negatively related (or negatively
in the aggregate and positively within each level of the third variable). Please read
Messick and van de Geers article on the reversal paradox (Psychol. Bull. 90: 582-593).
We have a reversal paradox here - in the aggregated data (weighted marginal means),
students at School 1 weigh more than do those at School 2, but within each Gender,
students at School 2 weigh more than those at School 1.

Trend2.doc
Two-Way Independent Samples Trend Analysis

Imagine that we are continuing our earlier work (from the handout "One-
Way Independent Samples ANOVA with SAS") evaluating the effectiveness of
the drug Athenopram HBr. This time we have data from three different groups.
The groups differ with respect to the psychiatric condition for which the drug is
being employed. We wish to determine whether the dose-response curve is the
same across all three groups.
Download and run the file Trend2.sas from my SAS programs page. The
contrived data (created with SAS' normal random number generator) are within
the program. We have 100 scores (20 at each of the five doses) in each
diagnostic group. Our design is Diagnosis x Dose, 3 x 5. Diagnosis is a
qualitative variable, Dose is quantitative. Our dependent variable, as before, is a
measure of the patients' psychological illness after two months of
pharmacotherapy.
In the data step I compute the powers of the Dose variable necessary to
conduct the analysis as a polynomial regression. If I had used a input statement
of "INPUT DOSE DIAGNOS $ ILLNESS," I would have required 300 data lines,
one for each participant, from "0 A 83" to "40 C 120. I only needed 30 data lines
(two per cell) with the do loop I employed.
PROC MEANS and PROC PLOT are used to create a plot of the dose-
response curve for each diagnostic group, with the plotting symbols being the
letter representing diagnostic group. You should edit your output file in Word to
connect the plotted means with line segments. Look at that plot. The plots for
group A is clearly quadratic, while that for group B and C are largely linear, with
some quadratic thrown in.
The first invocation of PROC GLM is used to conduct a standard 3 x 5
factorial ANOVA. Note that all three effects are significant. The interaction here
is not only significant, but also large in magnitude, with an
2
of .14. Clearly we
need to investigate this interaction.
The second invocation of PROC GLM obtains trend components for the
Dose variable and for the interaction between Dose and Diagnosis. Look at the
output. Sum the SS for the four trends for the main effect of Dose. You should
get the SS
Dose
from the previous analysis. The trends for the main effect of Dose
are an orthogonal partitioning of the SS for the main effect of Dose. Sum the SS
for the trend components of the interaction between Dose and Diagnosis. You
should get the SS
Dose x Diagnosis
from the previous analysis. The trends for the
interaction are an orthogonal partitioning of the SS for the interaction. We see
that the effect of Dose differs significantly across the diagnostic groups with
respect to its linear, quadratic, and cubic trends. Given these results, we should
look at the linear, quadratic, and cubic trends in the simple effects (the effect of
dose in each of the diagnostic groups).


The third invocation of PROC GLM is used to conduct a trend analysis for
each diagnostic group. Do notice that I excluded the quartic trend from the
models, given that our earlier analysis indicated that it was both trivial and not
significant. We see that in Group A, there is a strong (
2
= .15) and significant
quadratic trend, and a weak (
2
= .03) but significant cubic trend. The cubic
component is so small that we should not be surprised if we can see no sign of it
when we look at a plot of the means or a plot of the predicted values from the
cubic regression model. PROC GLM also gives us the intercept and slopes for
the cubic regression equation. You should ignore the p values that accompany
these -- they are for Type III sums of squares. With Type III sums of squares,
any overlap between predictors is not counted any predictor's sum of squares.
Our predictors are powers of the dose variable, which, of course, are highly
correlated with one another, making Type III sums of squares not appropriate.
In Group B, the linear trend is large (
2
= .18) and significant, and the
quadratic trend is small (
2
= .03) yet significant. In Group C, the linear effect is
very strong (
2
= .31) and significant, and the quadratic effect small (
2
= ..03)
yet significant.
I used PROC REG to obtain the regression coefficients and a plot of
regression line for a quadratic model for each diagnostic group. Inspecting these
plots should give you a better understanding of the quadratic functions,
especially if you 'connect the dots.
Finally, I used Gplot to make an interaction plot with smoothed lines. Proc
Plot makes pretty clunky plots. Use Gplot or Excel to produce better looking
plots.
Results
<Table of Means and Standard Deviations>
Cell means and standard deviations are shown in Table 1. A factorial
trend analysis (Dose x Diagnosis) was conducted to determine the effects of
dose of Athenopram in each diagnostic group of patients at the Psychiatric
Institute for Abused Cuddly Toys (Psychiatrie fr misshandelte Kuscheltiere). As
shown in Table 2, the interaction between dose of drug and diagnostic group was
both statistically significant and large in magnitude. The diagnostic groups
differed significantly with respect to the linear, quadratic, and cubic components
of dose of Athenopram. The simple effects of dose of Athenopram were tested
within each diagnostic group. Among patients diagnosed with illness A, there
was a significant quadratic trend, F(1, 96) = 17.97, p < .001,
2
= .15, and a
significant cubic trend, F(1, 96) = 4.10, p = .046,
2
= .03. As shown in Figure 1,
remission of symptoms of patients with this diagnosis was greatly reduced by the
10 mg dose, but the effectiveness of the Athenopram decreased with increases
in dose beyond 10 mg. Among patients with condition B, the linear effect of dose
was large and significant, F(1, 96) = 23.09, p < .001,
2
= .18, and the quadratic
effect was small but significant, F(1, 96) = 4.08, p = .046,
2
= .03. In this group
the 10 mg treatment aggravated the patients' illness, but larger doses were
effective in reducing symptoms, with effectiveness a linear function of dosage.
Among patients with diagnosis C, there was a very strong and significant linear
effect of dosage F(1, 96) = 44.59, p < .001,
2
= .31, and a small but significant
quadratic trend, F(1, 96) = 4.29, p =.041,
2
= .03. For these patients, every
increase in dosage was accompanied by a decrease in symptoms.
Table 2: Trend Analysis
Effect df F p
2
Diagnosis 2 3.50 .031 .02
Dose 4 11.60 < .001 .12
Linear 1 40.59 < .001 .10
Quadratic 1 5.53 .019 .02
Cubic 1 0.26 .612 .00
Quartic 1 0.00 .958 .00
Interaction 8 6.67 < .001 .14
Linear 2 13.27 < .001 .07
Quadratic 2 10.05 < .001 .05
Cubic 2 3.12 .046 .02
Quartic 2 0.23 .79 .00
Error 285

Illness
70
80
90
100
110
Dose
0 5 10 15 20 25 30 35 40
Dose-Response for Athenopram in Cuddly Toys
Diagnosis
A B C
ANOVA-MixedEffects.doc
Random and Mixed Effects ANOVA

A classification variable in ANOVA may be either fixed or random. The
meaning of fixed and random are the same as they were when we discussed the
distinction between regression and correlation analysis. With a fixed variable we treat
the observed values of the variable as the entire population of interest. Another way to
state this is to note that the sampling fraction is one. The sampling fraction is the
number of values in the sample divided by the number of values in the population.
Suppose that one of the classification variables in which I am interested is the
diagnosis given to a patient. There are three levels of this variable, 1 (melancholic
depression), 2 (postpartum depression), and 3 (seasonal affective disorder). Since I
consider these values (1, 2, and 3) the entire population of interest, the variable is fixed.
Suppose that a second classification variable is dose of experimental therapeutic
drug. The population of values of interest ranges from 0 units to 100 units. I randomly
chose five levels from a uniform population that ranges from 0 to 100, using this SAS
code:
Data Sample; Do Value=1 To 5;Dose=round(100*Uniform(0)); Output; End; run;
Proc Print; run;
quit;
Obs Value Dose

1 1 12
2 2 23
3 3 54
4 4 64
5 5 98

In my research, I shall use the values 12, 23, 54, 64, and 98 units of the drug.
There is a uncountably large number of possible values between 0 and 100, so my
sampling fraction is 5/ = 0. Put another way, dose of drug is a random effects
variable.

The Group x Dose ANOVA here will be mixed effects, because there is a
mixture of fixed and random effects. When calculating the F ratios, we need to consider
the expected values for the mean squares in both numerator and denominator. We
want the denominator (error term) to have an expected mean square that contains
everything in the numerator except the effect being tested.
Howell (Statistical Methods for Psychology, 7
th
edition, page 433) shows the
expected values for the mean squares. They are:
Main effect of Group (fixed): Group, Interaction, Error
Main effect of Dose (random): Dose, Error
Group x Dose Interaction: Interaction, Error
Within Cells Error (MSE): Error

2
The F for the main effect of group will be
Error n Interactio
Error n Interactio Group
+
+ +
= =
Dose Group
group
MS
MS
F . Under the null, group has zero effect,
and the expected value of F is (0 + interaction +error)/(interaction + error) = 1. If group
has an effect, the expected value of F > 1.
The F for the main effect of dose will be
Error
Error Dose +
= =
error
dose
MS
MS
F . Under the
null, dose has no effect, and the expected value of F is (0 + error/error) = 1. If dose has
an effect, the expected value of F > 1.
The F for the Group x Dose interaction will be
Error
Error n Interactio +
= =

error
Dose Group
MS
MS
F . Under the null, the interaction has no effect, and
the expected value of F is (0 + error/error) = 1. If dose has an effect, the expected
value of F > 1.
You can use the TEST statement in PROC GLM to construct the appropriate F
tests.

An Example
Download the Excel file ANOVA-MixedEffects.xls, available at
http://core.ecu.edu/psyc/wuenschk/StatData/StatData.htm .
Bring it into SAS. If you do not know how to do this, read my document Excel to
SAS .
Run this code:
proc glm; class group dose; model score = group|dose / ss3;
Test H = group E = group*dose;
title 'Mixed Effects ANOVA: Group is fixed, dose is random'; run;

-------------------------------------------------------------------------------------------------

Mixed Effects ANOVA: Group is fixed, dose is random

The GLM Procedure

Dependent Variable: Score

Sum of

Model 14 978.986667 69.927619 15.20 <.0001

Error 60 276.000000 4.600000


R-Square Coeff Var Root MSE Score Mean

0.780077 28.02388 2.144761 7.653333

3
Source DF Type III SS Mean Square F Value Pr > F

Group 2 246.7466667 123.3733333 26.82 <.0001
Dose 4 612.5866667 153.1466667 33.29 <.0001
Group*Dose 8 119.6533333 14.9566667 3.25 0.0039

Tests of Hypotheses Using the Type III MS for Group*Dose as an Error Term


Group 2 246.7466667 123.3733333 8.25 0.0114

The appropriate F statistics are:
Group: F(2, 8) = 8.25
Dose: F(4, 60) = 33.29
Group x Dose: F(8, 60) = 3.25.

Random Command
GLM has a random command that can be used to identify random effects.
Unfortunately it does not result in the properly selection of error terms for a mixed
model.

proc glm; class group dose; model score = group|dose / ss3; Random dose
group*dose / Test;
title 'Mixed Effects ANOVA: Group is fixed, dose is random'; run;

Tests of Hypotheses for Mixed Model Analysis of Variance

Dependent Variable: Score Score


Group 2 246.746667 123.373333 8.25 0.0114
Dose 4 612.586667 153.146667 10.24 0.0031

Error 8 119.653333 14.956667
Error: MS(Group*Dose)


Group*Dose 8 119.653333 14.956667 3.25 0.0039

Error: MS(Error) 60 276.000000 4.600000

Notice that SAS has used the interaction MS as the error term for both main
effects. It should have used it only for the main effect of group. SPSS UNIANOVA has
the same problem Howell (7
th
ed.) has noted this in footnote 2 on page 434.

Power Considerations
Interaction mean squares typically have a lot fewer degrees of freedom than do
error mean squares. This can cost one considerable power when an interaction mean
4
square is used as the denominator of an F ratio. Howell (5
th
edition, page 445)
suggested one possible way around this problem. If the interaction effect has a large p
value (.25 or more), dump it from the model. This will result in the interaction SS and
the interaction df being pooled together with the error SS and the error df. The resulting
pooled error term,
error Dose Group
error Dose Group
df df
SS SS
+
+
, is then used as the denominator for testing both

main effects. For the data used above, the interaction was significant, so this would not
be appropriate. If the interaction had not been even close to significant, this code would
produce the appropriate analysis:
proc glm; class group dose; model score = group dose / ss3;
title 'Main Effects Only, Interaction Pooled With Within-Cells Error'; run;

Subjects the Hidden Random Effect
We pretend that we have randomly sampled subjects from the population to
which we wish to generalize our results or we restrict our generalizations to that
abstract population for which our sample could be considered random. Accordingly,
subjects is a random variable. In the typical ANOVA, independent samples and fixed
effects, subjects is lurking there as a random effect. What we call error is simply the
effect of subjects, which is nested within the cells. ANOVA is only necessary when we
have at least one random effect (typically subjects) and we wish to generalize our
results to the entire population of subjects from which we randomly sampled. If we were
to consider subjects to be a fixed variable, then we would have the entire population
and would not need ANOVA the means, standard deviations, etc. computed with our
data would be parameters, not statistics, and there would be no need for inferential
statistics.

Nested and Crossed Factors
Suppose one factor was Households and another was Neighborhoods.
Households would be nested within Neighborhoods each household is in only one
neighborhood. If you know the identity of the household, you also know the identity of
the neighborhood.
Neighborhood 1 Neighborhood 2 Neighborhood 3
H
1
H
4
H
7

H
2
H
5
H
8

H
3
H
6
H
9

Now suppose that one factor is Teachers, the other is Schools, and each teacher
taught at each of the three schools. Teachers and Schools are crossed.
School 1 School 2 School 3
T
1
T
1
T
1

T
2
T
2
T
2

5
T
3
T
3
T
3

Between Subjects (Independent Samples) Designs
The subjects factor is nested within the grouping factor(s).
Group 1 Group 2 Group 3
S
1
S
4
S
7

S
2
S
5
S
8

S
3
S
6
S
9

Within Subjects (Repeated Measures, Related Samples, Randomized Blocks,
Split-Plot) Designs
With this design, the subjects (or blocks or plots) factor is crossed with the other
factor.
Condition 1 Condition 2 Condition 3
S
1
S
1
S
1

S
2
S
2
S
2

S
3
S
3
S
3

Omega Squared
If you want to use
2
with data from a mixed-effects or random effects ANOVA,
you will need to 438-440 in Howell (7
th
ed.). After reading that, you just might decide
that
2
is adequate.

Karl L. Wuensch, Dept. of Psychology, East Carolina Univ., Greenville, NC USA
19. December 2010
ICC-OmegaSq.doc

2
for Random Effects, aka the Intraclass Correlation Coefficient

Lori Foster Thompson provided me with data from 40 teams, each consisting of 4
persons. Twenty of the teams were in an experimental group referred to as face to
face (FTF) and the other twenty were in the computer-mediated (CM) group. The
dependent variable was the score on a five item scale designed to measure satisfaction
with the team process. Lori compared the two groups using individual persons as the
experimental unit -- that is, she had 4 x 40 = 160 cases, 80 in each of two groups. I
replicated this analysis using a pooled variances t (equivalent to a one-way ANOVA).
This analysis showed that satisfaction in the FTF group (M = 4.25, s = 0.70) was
significantly greater than in the CM group (M = 3.33, SD = 0.83), t(158) = 7.54, p < .001.
If this comparison was made with ANOVA rather than t, the F would be the square of
the t -- F(1, 158) = 58.86, p < .001.
The editor of the journal pointed out to Lori that participants were nested within
teams and asked her to report an intraclass correlation to indicate the portion of the
variance due to teams and to report both individual-level and group-level comparisons
between FTF and CM teams. As noted by Howell (2010, page 439), some authors use
2
with both fixed and random effects, but others use the symbol
2
with fixed effects
and the symbol
2
(the intraclass correlation coefficient) with random effects.
Statistically, there are three classification variables (factors in the language of
ANOVA) in Loris design: Using the variable names in her SPSS data file, they are
CONDTN (FTF or CM), TnNo2 (team number), and Subjects. The Subjects variable is
nested within the Team variable and the Team variable is nested within the Condtn
variable -- each subject served on only one team and each team served in only one of
the experimental conditions.
Lori concluded, correctly I believe, that the editor wanted her to conduct a
one-way random effects ANOVA using process satisfaction (PrSat) as the dependent
variable and teams (TnNo2) as the classification variable, and then report
2
as an
estimate of the proportion of the variance in process satisfaction that is explained by
differences among the teams. Do note that in treating Teams as a random rather
than a fixed factor we are asserting that our teams represent a random sample from a
population of teams to which we wish to generalize our results. Put another way, we
are not just interested in the 40 teams on which we have data but rather on the entire
population of teams from which our sample of teams could be considered to be a
random sample. Of course, we are also treating Subjects as a random factor,
pretending that they really represent a random sample from the population of subjects
that is of interest to us. The experimental factor (Condtn) is a fixed factor -- we did
not randomly choose two experimental conditions from a population of conditions, we
deliberately chose these two conditions, the only two conditions in which we are
interested -- that is, on this factor we have the entire population of interest.
One-Way Random Effects ANOVA,
2
= .425,
2
= .562


Page 2
David Howell (2007, page 328) points out that one version of the intraclass
correlation coefficient is nothing but
2
for the random model. On pages 414 through
415, he shows how to compute
2
for one-way through three-way designs, including the
one-way random design, which I describe here:
n MS MS MS
n MS MS
E A E
E A
/ ) (
/ ) (
2
+
= , where MS
A
is the mean square among teams, MS
E
is the
error (within teams) mean square, and n is the number of scores in each team. In the
seventh edition of Howells Methods text only two-way ANOVA is included, but he
provides a link to a table of expected mean squares for three-way designs.
Now I obtain the random effects ANOVA, using my preferred statistical program,
SAS. Here is the procedural code I used, followed by the output:
proc glm; class tm_no2; model prsat = tm_no2; random tm_no2 / test; run;
------------------------------------------------------------------------------------------------

The SAS System 2

The GLM Procedure

Dependent Variable: PRSAT

Sum of

Model 39 71.1840000 1.8252308 3.96 <.0001

Error 120 55.3600000 0.4613333


R-Square Coeff Var Root MSE PRSAT Mean

0.562524 17.92125 0.679215 3.790000


TM_NO2 39 71.18400000 1.82523077 3.96 <.0001

------------------------------------------------------------------------------------------------
The SAS System 4

The GLM Procedure
Tests of Hypotheses for Random Model Analysis of Variance



TM_NO2 39 71.184000 1.825231 3.96 <.0001

Error: MS(Error) 120 55.360000 0.461333
------------------------------------------------------------------------------------------------
For a one-way ANOVA, the basic analysis for the random effect model is
identical to that for the fixed effect model. The computation of the
2
does differ,
Page 3
however, between the fixed effect model and the random effect model. Computation of
the
2
for this analysis is:
425 . 0
802 . 0
341 . 0
4 / ) 0.461 1.825 ( 0.461
4 / ) 0.461 1.825 (
/ ) (
/ ) (
2
= =
+
=
+
=
n MS MS MS
n MS MS
E A E
E A
.
Omega-squared is considered to be superior to eta-squared for estimating the
proportion of variance accounted for by an ANOVA factor. Eta-squared tends to
overestimate that proportion -- but eta-squared is certainly easier to compute:
563 . 0
544 . 126
184 . 71
2
= = =
Total
Effect
SS
SS
. Notice that this is the R
2
reported by SAS.
One-Way Fixed Effects ANOVA,
2
= .419,
2
= .563
For pedagogical purposes, Ill show the computation of
2
treating teams as a
fixed variable. The estimated treatment variance is
. 332 . 0
) 40 )( 4 (
) 461 . 0 825 . 1 )( 1 40 (
) )( (
) )( 1 (
=

=

a n
MS MS a
E A
The term a is the number of
levels of the team variable. The estimated total variance is equal to the estimated
treatment variance plus the estimated error variance, . 793 . 461 . 332 . 332 . = + = +
E
MS
Accordingly, 419 .
793 .
332 .
2
= = .
Two-Way Mixed Effects ANOVA
One should keep in mind that the
2
for teams, as computed above, includes the
effect of the experimental treatment, Condtn -- that is, we have estimated the variance
among the teams, but part of that variance is due to the fact that the two experimental
groups of teams were treated differently, and the other part of it is due to other
differences among teams (error, effects of extraneous variables, reflected in differences
among subjects scores within teams).
If one wanted to determine the variance of teams after excluding variance due to
the experimental condition, a mixed factorial ANOVA, Condtn x Teams (nested within
Condtn), could be conducted. Here is the SAS code and output for such an analysis,
treating Condtn as fixed and Teams as random:
proc glm; class condtn tm_no2; model prsat = condtn tm_no2(condtn);
random tm_no2(condtn) / test; run;
------------------------------------------------------------------------------------------------

The SAS System 6

The GLM Procedure


Sum of

Model 39 71.1840000 1.8252308 3.96 <.0001

Error 120 55.3600000 0.4613333

Page 4

R-Square Coeff Var Root MSE PRSAT Mean

0.562524 17.92125 0.679215 3.790000


CONDTN 1 33.48900000 33.48900000 72.59 <.0001
TM_NO2(CONDTN) 38 37.69500000 0.99197368 2.15 0.0009

------------------------------------------------------------------------------------------------

The SAS System 8

The GLM Procedure
Tests of Hypotheses for Mixed Model Analysis of Variance

Dependent Variable: PRSAT PRSAT


CONDTN 1 33.489000 33.489000 33.76 <.0001

Error 38 37.695000 0.991974
Error: MS(TM_NO2(CONDTN))


TM_NO2(CONDTN) 38 37.695000 0.991974 2.15 0.0009

Error: MS(Error) 120 55.360000 0.461333
Note that the SS for Teams from the previous analysis, 71.184, has been
partitioned into a SS for Condtn, 33.489, and a SS for Teams within conditions, 37.695.
The Total SS (126.544) is comprised of the SS for Condtn, 33.489, plus the SS for
Teams within conditions, 37.695, plus the SS for Subjects within teams within
conditions, 55.36. Do note that the analysis above uses the variance for teams within
conditions as the error variance for testing the effect of Condtn.
The
2
for the entire effect of teams is 71.184/126.544 = .563. That part due to
the experimental manipulation is 33.489/126.544 = .265.

References
Howell, D. C. (2007). Statistical methods for psychology (6
th
ed.). Belmont, CA:
Thompson Wadsworth.
th
ed.). Belmont, CA:
Cengage Wadsworth.

ANOVA_Flow.doc
The Factorial ANOVA Is Done, Now What Do I Do?

After conducting a factorial ANOVA, one typically inspects the results of that ANOVA and then decides what additional analyses
are needed. It is often recommended that this take place in a top-down fashion, inspecting the highest-order interaction term first and
then moving down to interactions of the next lower order, an so on until reaching the main effects.
Two basic principles are:
If an interaction is significant, conduct tests of simple (conditional) effects to help explain the interaction, and
Effects which do not participate in higher-order interactions are easier to interpret than are those that do.
Consider a three-way analysis. If the triple interaction, AxBxC is significant, one might decide to test the simple (conditional)
interaction of AxB at each level of C. If the AxB interaction at C=1 is significant, one might then decide to test the simple, simple
(doubly conditional) main effects of A at each level of B for those cells where C=1. If the AxB interaction at C=2 is not significant, then
one is likely to want to look the (simple main) effects of A and of B for those cells where C=1.
If the triple interaction is not significant, one next looks at the two-way interactions. Suppose that AxB is significant but the other two
interactions are not. The significant AxB interaction might then be followed by tests of the simple main effects of A at each level of B.
For each significant simple main effect of A, when there are more than two levels of A, one might want to conduct pairwise comparisons
or more complex contrasts among As marginal means for the data at the specified level of B. Since the main effect of C does not
participate in any significant interactions, it can now be more simply interpreted -- if there are more than two levels of C, one might want
to conduct pairwise comparisons or more complex contrasts involving the marginal means of C.
In some situations one might be justified in interpreting main effects even when they do participate in significant interactions,
especially when those interactions are monotonic. For example, even though AxB is significant, if the direction of the effect of A is the
same at all levels of B, there may be some merit in talking about the main effect of A, ignoring B.
The most important thing to keep in mind is that the contrasts that are made (interactions, simple interactions, main effects,
simple main effects, contrasts involving marginal means, and so on) should be contrasts help you answer questions of interest about
the data. My presentation here has been rather abstract, treating A, B, and C as generic factors. When A, B, and C are particular
variables, the recommendations given here may or may not make good sense. When they do not make good sense, do not follow them
-- make the comparisons that do make good sense!
B. J. White, graduate student in PSYC 6431 in the Spring of 2002, prepared the following ANOVA Flow Chart based on the
generic recommendations made above. Thanks, B.J.

LeastSq.doc
Least Squares Analyses of Variance and Covariance

One-Way ANOVA
Read Sections 1 and 2 in Chapter 16 of Howell. Run the program ANOVA1-
LS.sas, which can be found on my SAS programs page. The data here are from Table
16.1 of Howell.
Dummy Variable Coding. Look at the values of X1-X3 in the data in the Data
Dummy section of the program file. X1 codes whether or not an observation is from
Group 1 (0 = no, 1 = yes), X2 whether or not it is from Group 2, and X3 whether or not it
is from Group 3. Only k-1 (4-1) dummy variables are needed, since an observation that
is not in any of the first k-1 groups must be in the k
th
group. The dummy variable coding
matrix is thus:
Group X1 X2 X3
1 1 0 0
2 0 1 0
3 0 0 1
4 0 0 0
For each dummy variable the partial coefficients represent a contrast between
its group and the reference group (the one coded with all 0s), that is, X1s partials
code Group 1 vs. Group 4, X2 codes Group 2 vs. Group 4, and X3 codes Group 3 vs.
Group 4.
Look at the correlations among the Xs and note that with equal ns the
off-diagonal correlations are constant. Now look at the output from the regression
analysis. Note that the omnibus F of 4.455 is the same that would be obtained from a
traditional ANOVA. Also note the following about the partial statistics:
The intercept is the mean of the reference group.
For each X the b is the difference between its groups mean and the mean of the
reference group. For example, the b for X1 is the mean for Group 1 minus the
mean for Group 4, (8 - 6.33) = 1.67.
Do note that only Group 3 differs significantly from the reference group.

Copyright 2010 Karl L. Wuensch - All rights reserved.

Page 2
Effects Coding. Look at the Data Effects section of the program and its output.
This is the type of coding that Howell uses. The design matrix is exactly like that in
dummy variable coding except that the reference group is coded with -1 on each X.
The design matrix is:
Group X1 X2 X3
1 1 0 0
2 0 1 0
3 0 0 1
4 -1 -1 -1
The result of this coding scheme is that each Xs partial coefficients now
represents one group versus the grand mean, that is, X1 represents Group 1 versus
the grand mean, X2 represents Group 2 versus the grand mean, etc. As before, equal
ns cause constant off-diagonal rs among the Xs. Note that the omnibus F from the
regression analysis is still 4.455.
The intercept is now equal to the grand mean.
Each Xs b now equals the difference between its groups mean and the grand
mean, for example, for X1 b = (8 - 5.5) = 2.5.
Contrast Coding. Look at Data Contrast section of the program and its output.
The design matrix here codes a complete orthogonal set of comparisons:
Group X1 X2 X3
1 1 1 0
2 1 -1 0
3 -1 0 1
4 -1 0 -1
X1 contrasts Groups 1 & 2 with Groups 3 & 4, X2 contrasts Group 1 with Group 2,
and X3 contrasts Group 3 with Group 4. For each contrast varaiable (X1, X2, and X3),
the sum of the coefficients must be 0. Furthermore, the products of the coefficients for
and one contrast variable and any other contrast variable must sum to 0 for example,
for X1 and X2, (1)(1)+1(-1)+(-1)(0) +(-1)(0) = 1-1+0+0=0.
Again, the off-diagonal rs among the Xs are constant, but this time, since the
design matrix codes orthogonal contrasts, the rs all are 0s. The omnibus F is still 4.455.
Page 3
The intercept equals the grand mean.
Each of the bs is of the difference between the means of the groups
contrasted. For example, the b for X1 is one half of the difference between the
mean of Groups 1 and 2 (6.5) and the mean of Groups 3 and 4 (4.5),
(6.5-4.5)/2=1.
Standard Contrast Coding. There are some advantages of using a standard
set of weights. The coefficients for the one set of means must equal +1 divided by the
number of conditions in that set while those for the other set must equal -1 divided by the
number of conditions in that other set. The sum of the absolute values of the coefficients
must be 2. For our design, here are standard weights:
Group X1 X2 X3
1 0
2 - 0
3 - 0
4 - 0 -
Now the slopes equal the differences between the means of the groups
contrasted.
Let GLM Do It. The code in the Data GLM section of the program does the
ANOVA with Proc GLM. The contrast statements reproduce the contrasts earlier
produced with contrast coding. Each F from PROC GLM CONTRAST statements is the
square of t from PROC REG.
Two-Way ANOVA
Read Sections 16.3 and 16.4 in Howell. I also recommend that you reread my
handout Four Types of Sums of Squares for ANOVA Effects. Run the programs
ANOVA2-LS-Eq.sas and ANOV2-LS-UnEq.sas.
Orthogonal Analysis. ANOVA2-LS-Eq.sas uses the data from Table 16.2 of
Howell and the effects coding matrix presented in Section 16.3. Notice that it took only 1
X to code the 2-level factor A, but it took 3 Xs to code the 4-level factor B. Each of the
1x3=3 interaction Xs has a value equal to the code on A
j
times the code on B
k
-- for
example, A1B1=A1B1. Do note that we have as many Xs as we have degrees of
freedom -- 1 for factor A, 3 for factor B, and 3 for the interaction between A and B. If we
had three levels of A and four of B we would have 11 Xs: A1, A2, B1, B2, B3, A1B1,
A1B2, A1B3, A2B1, A2B2, A2B3.
Look at the output from the full model. The error SS there is the SSE for the
factorial ANOVA, 139.25 on 24 df. The model SS there is the sum of the A, B, and AxB
Page 4
sums-of-squares from a factorial ANOVA. The bs are as indicated in part (c) of Table
16.2 in Howell, and the intercept equals the grand mean.
Now look at the output from Model: A_X_B. We have deleted from this model the
three terms that code the interaction. The interaction sum-of-squares is simply the
difference between the full model and this reduced model, 231.96875 - 204.625 =
27.34375. The
2
for the interaction is simply the difference between the full model R
2

and the reduced model R
2
, (.6249 - .5512) = .0737, which also equals the interaction SS
divided by the total SS. We can test the significance of the interaction term by testing
the significance of the reduction in the regression SS that accompanied the deletion of
the dummy variables that coded the interaction. Using a partial F, we obtain the same
value of F we would get using the traditional means (interaction mean square divided by
error mean square):
F
SS SS
f r MSE
full reduced
full
=

=

=
( )( )
.
( )( . )
.
23196875 20462500
3 5 80208
1571
The Model: B output is for a reduced model with the three terms coding the main
effect of B deleted. You find the SS and
2
for B by subtracting the appropriate reduced
model statistics from the full model statistics. The Model: A output is for a reduced
model with the one term coding the main effect of A deleted. Use this output to obtain
the SS and
2
for the main effect of A. Construct a source table and then compare the
output of PROC ANOVA with the source table you obtained by comparing reduced
effects coded models with the full effects coded model. The CLASS statement in PROC
ANOVA and PROC GLM simply tells SAS which independent variables need to be
dummy coded.
Nonorthogonal Analysis. ANOV2-LS-UnEq.sas uses the unequal ns data from
Table 16.5 of Howell. The coding scheme is the same as in the previous analysis.
Obtain sums-of-squares for A, B, and AxB in the same way as you did in the previous
analysis and you will have done an Overall and Spiegel Method I analysis. Do note that
the sums-of-squares do not sum to the total SS, since we have excluded variance that is
ambiguous. Each effect is partialled for every other effect. If you will compare your
results from such an analysis with those provided by the TYPE III SS computed by
PROC GLM you will see that they are identical.
Analysis of Covariance
Read Sections 16.5 through 16.11 in Howell and Chapter 6 in Tabachnick and
Fidell. As explained there, the ANCOV is simply a least-squares ANOVA where the
covariate or covariates are entered into the model prior to the categorical independent
variables so that the effect of each categorical independent variable is adjusted for the
covariate(s). Do note the additional assumptions involved in ANCOV (that each
covariate has a linear relationship with the outcome variable and that the slope for that
relationship does not change across levels of the categorical predictor variable(s).
Carefully read Howells cautions about interpreting analyses of covariance when subjects
Page 5
have not been randomly assigned to treatment groups. Run the programs ANCOV1.sas
and ANCOV2.sas.
One-Way ANCOV. I am not going to burden you with doing ANCOV with PROC REGI
think you already have the basic idea of least-squares analyses mastered. Look at
ANCOV1.sas and its output. These data were obtained from Figure 2 in the article,
"Relationships among models of salary bias," by M. H. Birnbaum (1985, American
Psychologist, pp. 862-866) and are said to be representative of data obtained in various
studies of sex bias in faculty salaries. I did double the sample size from that displayed in
the plot from which I harvested the data. We can imagine that we have data from three
different departments faculty members: The professors Gender (1 = male, 2 = female),
an objective measure of the professors QUALIFICations (a composite of things like
number of publications, ratings of instruction, etc.), and SALARY (in thousands of 1985
dollars).
The data are plotted, using the symbol for gender as the plotting symbol. The plot
suggests three lines, one for each department (salaries being highest in the business
department and lowest in the sociology department), but that is not our primary interest.
Do note that salaries go up as qualifications go up. Also note that the Ms tend to be
plotted higher and more to the right than the Fs.
PROC ANOVA does two simple ANOVAs, one on the qualifications data (later to
be used as a covariate) and one on the salary data. Both are significant. This is going to
make the interpretation of the ANCOV difficult, since we will be adjusting group means
on the salary variable to remove the effect of the qualifications variable (the covariate),
but the groups differ on both. The interpretation would be more straightforward if the
groups did not differ on the covariate, in which case adjusting for the covariate would
simply reduce the error term, providing for a more powerful analysis. The error SS
(1789.3) from the analysis on the covariate is that which Howell calls SS
e(c)
when
discussing making comparisons between pairs of adjusted means.
The first invocation of PROC GLM is used to test the homogeneity of regression
assumption. PROC ANOVA does not allow any continuous effects (such as a
continuous covariate). The model statement includes (when the bar notation is
expanded) the interaction term, Qualific*Gender. Some computing time is saved by
asking for only sequential (SS1) sums of squares. Were Qualific*Gender significant, we
would have a significant violation of the homogeneity of regression assumption (the
slopes of the lines for predicting salary from qualifications would differ significantly
between genders), which would, I opine, be a very interesting finding in its own right.
The second invocation of PROC GLM is used to obtain the slopes for predicting
salary from qualifications within each level of GenderQUALIFIC(GENDER). We
already know that these two slopes do not differ significantly, but I do find it interesting
that the slope for the male faculty is higher than that for the female faculty.
The third invocation of PROC GLM is used to do the Analysis of Covariance. The
correlation between salary and qualifications is significant (Type I p < .0001 -- evaluating
Page 6
qualifications unadjusted for gender) and the genders do differ significantly after
adjusting for qualifications. The Estimate given for qualifications is the common (across
genders) slope used to adjust salary scores in both groups. The LSMEANS are
estimates of what the group means would be if the groups did not differ on qualifications.
If you have more than two groups, you will probably want to use the PDIFF option, for
example, LSMEANS GROUP / PDIFF. The matrix of p-values produced with the PDIFF
option are for pairwise comparisons between adjusted means (with no adjustment of
per-comparison alpha). You can adjust the alpha-criterion downwards (Bonferroni,
Sidak) if you are worried about familywise error rates.
We can estimate the magnitude of effect of gender with an eta-squared statistic,
the ratio of the gender sum of squares to the total sum of squares, 268.364 / 3537 =
.076. This is equivalent to the increase in R
2
when we add gender to a model for
predicting salary from the covariate(s). The Proc Corr shows that r for predicting salary
from qualifications is .60193. Proc GLM shows that the R
2
for predicting salary from
qualifications and gender is .438189. Accordingly, eta-squared = .438189 - .60193
2
=
.076. If men and women were equally qualified, 7.6% of the differences in salaries would
be explained by gender. Look back at the ANOVA comparing the genders on salary.
The eta-squared there was .306. If we ignore qualifications, 30.6% of the differences in
salaries is explained by gender (which is confounded with qualifications and other
unknown variables).
We could also use d
to estimate the magnitude of the difference between groups.

The raw difference between adjusted means is 37.54 31.96 = 5.58 thousand dollars.
The standardizer will be the square root of the MSE from the ANOVA or from the
ANCOV. Howell (following the advice of Cortina & Nouri) recommends the former.
Accordingly, 76 .
35 . 53
58 . 5
= = d . If we were to ignore qualifications (by using the

unadjusted means), 30 . 1
35 . 53
30 5 . 39
= d .
Our results indicate that even when we statistically adjust for differences in
qualifications, men receive a salary significantly higher than that of women. This would
seem to be pretty good evidence of bias against women, but will the results look the
same if we view them from a different perspective? Look at the last invocation of PROC
GLM. Here we compared the genders on qualifications after removing the effect of
salary. The results indicate that when we equate the groups on salary the mean
qualifications of the men is significantly greater than that of the women. That looks like
bias too, but in the opposite direction. ANCOV is a slippery thing, especially when
dealing with data from a confounded design where the covariate is correlated not only
with the dependent variable but with the independent variable as well.
Two-Way ANCOV. Look at ANCOV2.sas and its output. The data are from Table
16.11 in Howell. The program is a straightforward extension of ANCOV1.sas to a
two-way design. First PROC ANOVA is used to evaluate treatment effects on the
covariate (YEARS) and on the dependent variable (EXAM). Were the design
Page 7
unbalanced (unequal ns) you would need to use PROC GLM with Type III
sums-of-squares here. The model SS, 6051.8, from the ANOVA on the covariate is the
SS
cells(c)
from Howells discussion of comparing adjusted means. The error SS from the
same analysis, 54,285,8, is Howells SS
e(c)
and the SS
Smoke
, 730, is Howells SS
g(c)
.
PROC GLM is first used to test homogeneity of regression within cells and
treatments. The DISTRACTTASK F tests the null hypothesis that the slope for
predicting ERRORS from DISTRACT is the same for all three types of task. The
DISTRACTSMOKE tests slopes across smoking groups. The
DISTRACTTASKSMOKE tests the null that the slope is the same in every cell of the
two-way design. Howell did not extract the DISTRACTTASK and DISTRACTSMOKE
terms from the error term and he did not test them, although in the third edition of his text
(p. 562) he admitted that a good case could be made for testing those effects (he wanted
their 3 df in the error term). Our analysis indicates that we have no problem with
heterogeneity of regression across cells, but notice that there is heterogeneity of
regression across tasks and across smoking groups. PROC GLM is next used to obtain
the slopes for each cell. Ignore the Biased estimates for within treatment slopes.
Although these slopes do not differ enough across cells to produce significant
heterogeneity of regression, inspection of the slopes shows why the DISTRACT*TASK
effect was significant. Look at how high the slopes are for the cognitive task as
compared to the other two tasks. Clearly the number of error increased more rapidly with
participants level of distractibility with the cognitive task than with the other tasks,
especially for those nicotine junkies who had been deprived of their drug. You can also
see the (smaller) DISTRACT*SMOKE effect, with the slopes for the delayed smokers
(smokers who had not had a smoke in three hours) being larger than for the other
participants.
The next GLM does the ANCOV. Note that DISTRACT is significantly correlated with
ERRORS (p < .001, Type I SS). Remember that the Type I SS reported here does not
adjust the first term in the model (the covariate) for the later terms in the model. Howell
prefers to adjust the covariate for the other effects in the model, so he uses SPSS
unique (same as SAS Type III) SS to test the covariate. The common slope used to
adjust scores is 0.2925. TASK, SMOKE, and TASKSMOKE all have significant effects
after we adjust for the covariate (Type III SS). Since the interaction is significant, we
need to do some simple main effects analyses. Look first at the adjusted cell means. If
you look at Figure 16.5 in Howell, you will see the interaction quite clearly. The effect of
smoking group is clearly greater with the cognitive task than with the other two tasks (for
which the lines are nearly flat). The very large main effect of type of task is obvious in
that plot too, with errors being much more likely with the cognitive task than with the
other two tasks.
If we ignore the interaction and look at the comparisons between marginal means
(using the PDIFF output, and not worrying about familywise error), we see that, for the
type of task variable, there were significantly more errors with the cognitive task than with
the other two types of tasks. On the smoking variable, we see that the nonsmokers
made significantly fewer errors than did those in the two groups of smokers.
Page 8
The simple main effects analysis done with the data from the pattern recognition
task shows that the smoking groups did not differ significantly. The Type I SS
smoke
gives
us a test of the effect of smoking history ignoring the covariate, while the Type III SS
gives us the test after adjusting for the covariate (an ANCOVA). The slope used to
adjust the scores on the pattern recognition test is 0.085, notably less than the 0.293
used in the factorial ANCOV.
When we look at the analysis of the data from the cognitive task, we see that the
smoking groups differ significantly whether we adjust for the covariate or not. The
nonsmokers made significantly fewer errors than did the participants in both the smoking
groups. Notice that the small difference between the two smoking groups using the
means as adjusted in the factorial analysis virtually disappears when using the
adjustment from this analysis, where the slope used for adjusting scores (0.537) is
notably more than it was with the factorial ANCOV or with the other two tasks. This is
due to the DISTRACT*TASK interaction which Howell choose to ignore, but we detected.
Finally, with the driving task, we see that the smoking groups differ significantly,
with the active smokers making significantly fewer errors than did the delayed smokers
and the nonsmokers. I guess the stimulant properties of nicotine are of some value
when driving.
Controlling Familywise Error When Using PDIFF
If the comparisons being made involve only three means, I recommend Fishers
procedure that is, do not adjust the p values, but require that the main effect be
statistically significant if it is not, none of the pairwise differences are significant. If the
comparisons involve more than three means, you can tell SAS to adjust the p values to
control familywise error. For example, LSMEANS smoke / PDIFF ADJUST=TUKEY; would
apply a Tukey adjustment. Other adjustments available include BONferroni, SIDAK,
DUNNETT, and SCHEFFE.
References and Recommended Readings
Birnbaum, M. H. (1985). Relationships among models of salary bias. American Psychologist, 40,
862866.
h
ed.). Belmont, CA: Cengage
Wadsworth. ISBN-10: 0-495-59784-8. ISBN-13: 978-0-495-59784-1.
Huck, S. W., & McLean, R. A. (1975). Using a repeated measures ANOVA to analyze the data
from a pretest-posttest design: A potentially confusing task. Psychological Bulletin, 82. 511-518.
Maxwell, S. E., Delaney, H. D., & Dill. (1984). Another look at ANCOVA versus blocking.
Psychological Bulletin, 95, 136-147.
Maxwell, S. E., Delaney, H. D., & Manheimer, J. M. (1985). ANOVA of residuals and ANCOVA:
Correcting an illusion by using model comparisons and graphs. Journal of Educational and
Behavioral Statistics, 10, 197-209. doi: 10.3102/10769986010003197
Rausch, J. R., Maxwell, S. E., & and Kelley, K. (2003). Analytic methods for questions pertaining
to a randomized pretest, posttest, follow-up design. Journal of Clinical Child and Adolescent
Psychology, 32, 467-486.
Page 9
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5
th
ed.). Boston: Allyn &
Bacon. ISBN-10: 0205459382. ISBN-13: 9780205459384.
Example of Presentation of Results from One-Way ANCOV
The Pretest-Posttest x Groups Design: How to Analyze the Data
Matching and ANCOV with Confounded Variables
Pretest-Posttest-ANCOV.doc
The Pretest-Posttest x Groups Design: How to Analyze the Data

You could ignore the pretest scores and simply compare the groups on the
posttest scores, but there is probably a good reason you collected the pretest scores in
the first place (such as a desire to enhance power), so Ill dismiss that option.
To illustrate the analyses I shall use the AirportSearch data, available at
http://core.ecu.edu/psyc/wuenschk/SPSS/SPSS-Data.htm . Do see the Description of
the data.

Mixed Factorial ANOVA
Treat the Pretest-Postest contrast as a within-subjects factor and the groups as a
between-subjects factor. Since the within-subjects factor has only one degree of
freedom, the multivariate-approach results will be identical to the univariate-approach
results and sphericity will not be an issue.
Here is SPSS syntax and output.
GLM post pre BY race
/WSFACTOR=PostPre 2 Polynomial
/METHOD=SSTYPE(3)
/CRITERIA=ALPHA(.05)
/WSDESIGN=PostPre
/DESIGN=race.

Tests of Within-Subjects Contrasts
Measure:MEASURE_1

Source PostPre
Type III Sum of
PostPre Linear 288.364 1 288.364 84.676 .000
PostPre * race Linear 76.364 1 76.364 22.424 .000
Error(PostPre) Linear 180.491 53 3.405

Tests of Between-Subjects Effects
Measure:MEASURE_1
Transformed Variable:Average

Source
Type III Sum of
Intercept 1330.135 1 1330.135 307.006 .000
race 254.063 1 254.063 58.640 .000
Error 229.628 53 4.333

2
It is the interaction term that is of most interest in this analysis, and it is
significant. It indicates that the pre-post difference is not the same for Arab travelers as
it is for Caucasian travelers. To further investigate the interaction, one can compare the
groups on pretest only and posttest only and/or compare posttest with pretest
separately for the two groups. Ill do the latter here.

SORT CASES BY race.
SPLIT FILE SEPARATE BY race.
T-TEST PAIRS=post WITH pre (PAIRED)
/CRITERIA=CI(.9500)
/MISSING=ANALYSIS.

Arab Travelers
Paired Samples Statistics
a

Mean N Std. Deviation Std. Error Mean
Pair 1 Post-9-11 7.67 21 3.512 .766
Pre-9-11 2.62 21 1.161 .253
a. race = Arab

Paired Samples Correlations
a

N Correlation Sig.
Pair 1 Post-9-11 & Pre-9-11 21 .065 .778
a. race = Arab

Paired Samples Test
a

Paired Differences

95% Confidence Interval of the
Difference
t df Sig. (2-tailed)

Lower Upper
Pair 1 Post-9-11 - Pre-9-11 3.397 6.698 6.379 20 .000
a. race = Arab

As you can see, the pre-post difference was significant for the Arab travelers.
3
Caucasian Travelers
Paired Samples Statistics
a

Mean N Std. Deviation Std. Error Mean
Pair 1 Post-9-11 2.82 34 1.290 .221
Pre-9-11 1.21 34 1.572 .270
a. race = Caucasian

a

N Correlation Sig.
Pair 1 Post-9-11 & Pre-9-11 34 .287 .099
a. race = Caucasian

Paired Samples Test
a

Paired Differences

Difference

Lower Upper
Pair 1 Post-9-11 - Pre-9-11 1.016 2.219 5.473 33 .000
a. race = Caucasian

As you can see, the (smaller) pre-post difference is significant for Caucasian
travelers too. We conclude that both groups were searched more often after 9/11, but
that the increase in searches was greater for Arab travelers than for Caucasian
travelers. I should also note that the pre-post correlations are not very impressive here.

Simple Comparison of the Difference Scores
We could simply compute Post minus Pre difference scores and then compare
the two groups on those difference scores. Here is the output from exactly such a
comparison.

COMPUTE Diff=post-pre.
VARIABLE LABELS Diff 'Post Minus Pre'.
EXECUTE.
4

Group Statistics

race N Mean Std. Deviation Std. Error Mean
Post Minus Pre Arab 21 5.0476 3.62596 .79125
Caucasian 34 1.6176 1.72354 .29558

Independent Samples Test

t-test for Equality of Means


Post Minus Pre Equal variances assumed 4.735 53 .000
Equal variances not assumed 4.061 25.669 .000

The increase in number of searches is significantly greater for Arab travelers
than for Caucasian travelers. Notice that the value of t is 4.735 on 53 df. An ANOVA
on the same contrast would yield an F with one df in the numerator and the same error
df. The value of F would be the square of the value of t. When you square our t here
you get F(1, 53) = 22.42. The one-tailed p for this F is identical to the two-tailed p for
the t. Yes, with ANOVA one properly employs a one-tailed p to evaluate a
nondirectional hypothesis. Look back at the source table for the mixed factorial
ANOVA. Notice that the F(1, 53) for the interaction term is 22.42. Is this mere
coincidence? No, it is a demonstration that a t test on difference scores is
absolutely equivalent to the test of the interaction term in a 2 x 2 mixed factorial
ANOVA. Many folks find this hard to believe, but it is easy to demonstrate, as I have
done above. Try it with any other Pre-Post x Two Groups design if you are not yet
convinced.
The logical next step here would be to test, for each group, whether or not the
mean difference score differs significantly from zero. Here are such tests:

SORT CASES BY race.
SPLIT FILE SEPARATE BY race.
T-TEST
/TESTVAL=0
/MISSING=ANALYSIS
/VARIABLES=Diff
/CRITERIA=CI(.95).
5

One-Sample Test
a

Test Value = 0

t df Sig. (2-tailed) Mean Difference
Difference

Lower Upper
Post Minus Pre 6.379 20 .000 5.04762 3.3971 6.6981
a. race = Arab

One-Sample Test
a

Test Value = 0

t df Sig. (2-tailed) Mean Difference
Difference

Lower Upper
Post Minus Pre 5.473 33 .000 1.61765 1.0163 2.2190
a. race = Caucasian

Please do notice that the values of t, df, and p here are identical to those
obtained earlier with pre-post correlated samples t tests.

Here we treat the pretest scores as a covariate and the posttest scores as the
outcome variables. Please note that this involves the assumption that the relationship
between pretest and posttest is linear and that the slope is identical in both groups.
These assumptions are easily evaluated. For example, the slope for predicting posttest
scores from pretest scores in the Arab group is PostTest = 7.148 + .198PreTest. For
the Caucasian group it is PostTest = 2.539 + .236PreTest.

UNIANOVA post BY race WITH pre
/METHOD=SSTYPE(3)
/INTERCEPT=INCLUDE
/EMMEANS=TABLES(race) WITH(pre=MEAN)
/DESIGN=pre race.

6

Dependent Variable:Post-9-11

Source
Type III Sum of
Corrected Model 310.064
a
2 155.032 27.231 .000
Intercept 437.205 1 437.205 76.795 .000
pre 5.563 1 5.563 .977 .327
race 214.378 1 214.378 37.655 .000
Error 296.045 52 5.693

Total 1807.000 55

Corrected Total 606.109 54

a. R Squared = .512 (Adjusted R Squared = .493)

Estimated Marginal Means

race
Dependent Variable:Post-9-11

race Mean Std. Error
95% Confidence Interval
Lower Bound Upper Bound
Arab 7.469
a
.558 6.350 8.588
Caucasian 2.946
a
.427 2.088 3.803
a. Covariates appearing in the model are evaluated at the following
values: Pre-9-11 = 1.75.

We conclude that the two groups differ significantly on posttest scores after
adjusting for the pretest scores. Notice that for these data the F for the effect of interest
is larger with the ANCOV in the mixed factorial ANOVA. In other words, ANCOV had
more power.

Which Analysis Should I Use?
If the difference scores are intrinsically meaningful (generally this will involve
prettest and posttest having both been measured on the same metric), the simple
comparison of the groups on mean differences scores is appealing and, as I have
shown earlier, is mathematically identical to the mixed factorial. The ANCOV, however,
generally has more power.
7
Huck and McLean (1975) addressed the issue of which type of analysis to use
for the pretest-postest control group design. They did assume that assignment to
groups was random. They explained that it is the interaction term that is of interest if
the mixed factorial ANOVA is employed and that a simple t test comparing the groups
on pre-post difference scores is absolutely equivalent to such an ANOVA. They pointed
out that the t test and the mixed factorial ANOVA are equivalent to an ANCOV (with
pretest as covariate) if the linear correlation between pretest and posttest is positive and
perfect. They argued that the ANCOV is preferred over the others due to fact that it
generally will have more power and can easily be adopted to resolve problems such as
heterogeneity of regression (the groups differ with respect to slope for predicting the
posttest from the pretest) and nonlinearity of the relationship between pretest and
posttest.
Maxwell, Delaney, and Dill (1984) noted that under some conditions the ANCOV
is more powerful, under other conditions it is less powerful. Other points they made
include:
The mixed factorial ANOVA may employ data from a randomized blocks design
(where subjects have been matched/blocked up on one or more variables
thought to be well associated with the outcome variable) and then, within blocks,
randomly assigned to treatment groups, or it may employ data where no random
assignment was employed (as in my example, where subjects were not randomly
assigned a race/ethnicity). This matters. The randomized blocks design equates
the groups on the blocking variables.
If you can obtain scores on the concomitant variable (here the pretest) prior to
assigning subjects to groups, matching/blocking the subjects on that concomitant
variable and then randomly assigning subjects to treatment groups will enhance
power relative to ignoring the concomitant variable when assigning subjects to
treatment groups. Even with the randomized blocks design, one can use
ANCOV rather than mixed ANOVA for the analysis.
If the relationship between the concomitant variable and the outcome variable is
not linear, the ANCOV is problematic. You may want to consider transformations
to straighten up the (monotonic) nonlinear relationship or polynomial regression
analysis.

References and Recommended Readings
Huck, S. W., & McLean, R. A. (1975). Using a repeated measures ANOVA to
analyze the data from a pretest-posttest design: A potentially confusing task.
Psychological Bulletin, 82, 511-518.
Maxwell, S. E., Delaney, H. D., & Dill, C. A. (1984). Another look at ANCOVA
versus Blocking. Psychological Bulletin, 95, 136-147.
Rausch, J. R., Maxwell, S. E., & and Kelley, K. (2003). Analytic methods for
questions pertaining to a randomized pretest, posttest, follow-up design. Journal
of Clinical Child and Adolescent Psychology, 32, 467-486.

Links
8
AERA-D Discussion Don Burrill commenting on a study with pretest and
posttest but not random assignment.
Wuenschs Stats Lessons
o Least Squares Analyses of Variance and Covariance

Karl L. Wuensch
August, 2009
ws-anova.doc
An Introduction to Within-Subjects Analysis of Variance

In ANOVA a factor is either a between-subjects factor or a within-subjects factor. When the
factor is between-subjects the data are from independent samples, one sample of
outcome/dependent variable scores for each level of the factor. With such independent samples we
expect no correlation between the scores at any one level of the factor and those at any other level of
the factor. A within-subjects or repeated measures factor is one where we expect to have
correlated samples, because each subject is measured (on the dependent variable) at each level of
the factor.
The Data for this Lesson. An example of a within-subjects design is the migraine-headache
study described by Howell (Statistical Methods for Psychology, 6
th
ed., 2007, Table 14.3). The
dependent variable is duration of headaches (hours per week), measured five times. The
within-subjects factor is Weeks, when the measurement was taken, during the third or fourth week of
baseline recording (levels 1 and 2 of Week) or during the fourth, fifth, or sixth week of relaxation
training (levels 3, 4, and 5 of Week). The resulting five samples of scores are clearly not independent
samples, each is based on the same nine subjects. Since we expect the effect of individual differences
among subjects to exist across levels of the Week factor, we expect that the scores at each level of
Week will be positively correlated with those at each other level of Weekfor example, we expect
those who reported the greatest durations during the level 1 week will also tend to report the greatest
durations during the level 3 week.
Crossed and Nested Factors. When each subject is measured at each level of a factor we
say that Subjects is crossed with that factor. For our headache example, Subjects is crossed with
Week. Mathematically we treat Subjects as a factor, so we have a Week x Subjects factorial design
with only one score in each cell (each subject is measured once and only once at each level of Week).
In ANOVA each factor is either crossed with or nested within each other factor. When one
factor is nested within another then knowing the level of the nested factor tells you the level of the other
factor. The Subjects factor is said to be nested within between-subjects factors. For example, if I
randomly assigned ten subjects to each of five experimental groups I know that subjects 1-10 are at
level one of the between-subjects factor, 11-20 at level two, etc. If you ask me at what level of the
between-subjects factor is the score that is at level 35 of the Subjects factor, I can answer three. If
the experimental factor were within-subjects (each subject tested in each of the five experimental
conditions) and you asked me, This score is from subject number 5, at what level of the experimental
factor was it obtained, I could not tell you.
Order Effects and Counterbalancing. Suppose that our within-subjects factor is not Week,
but rather some experimental manipulation, for example, the color of the computer screen (gray,
green, white, blue, or black) upon which I present the material the subject is to learn. Each subject is
tested with each color. A big problem with such a design is that the order of presentation of the
experimental conditions may confound the results. For example, were I to test each subject first with
the gray screen, then green, then white, then blue, and lastly black, the results (how well the subject
learned the material that was presented, such as a list of paired associates) might be contaminated by
practice effects (subjects get better at the task as time passes), fatigue effects (subjects get tired of
it all as time passes), and other such order effects. While one may ameliorate such problems by being


Page 2
sure that subjects are well practiced before starting the experimental manipulations and by inserting
effective rest periods between experimental conditions, counterbalancing may be required.
In complete counterbalancing each of the possible orderings of the experimental conditions
is equally represented. If k is the number of conditions, k! is the number of orderings. With only two
conditions (A and B) there are only 2! = 2 orderings (A then B versus B then A), so half the subjects
are run with A first, half with B first. For k = 3, there are 3(2) = 6 orderings (ABC, ACB, BAC, BCA,
CAB, CBA), so you run 1/6 of the subjects with each. For k = 5 there are 5(4)(3)(2) = 120 orderings!
The basic idea of counterbalancing is to spread any order effects evenly across experimental
conditions so that order effects will not be confounded with experimental treatments. Of course, if
there is asymmetrical transfer (the effect upon B of being preceded by A is different from the effect
upon A of being preceded by Bfor example, the positive effect of having practiced sober greatly
helps subsequent drunken performance, but the effect of prior drunken practice upon later sober
performance is slight) then such counterbalancing does not solve the problem. When the number of
possible orderings is too large to for complete counterbalancing to be practical, one may employ some
form of incomplete counterbalancing where each condition appears equally often in each position,
but not all possible orderings are used. For example (a Latin Square, for a k = 5 level factor, run 1/5 of
the subjects in each of the following orderings:
A B C D E
E A B C D
D E A B C
C D E A B
B C D E A
Randomized Blocks Designs. The correlation between samples may be produced by
matching (also called blocking) the subjects on one or more extraneous variables thought to be
positively correlated with the dependent variable and then within each block randomly assigning one
subject to each condition. Statistically the scores within each block are treated as if they came from
the same subject, that is, the experimental factor is treated as a within-subjects factor. Such a design
is called a randomized blocks design or a split-plot design, the latter term coming from agricultural
research.
For example, suppose that I wish to evaluate the effectiveness of five different methods of
teaching the alphabet. I cannot very well use a truly within-subjects design, unless I use
electroconvulsive brain shock to clear subjects memories after learning the alphabet with one method
and before going on to the next method. I administer a readiness to learn test to all potential
subjects, confident that performance on this test is well correlated with how much the subjects will learn
during instruction. I match subjects into blocks of five, each subject within a block having a readiness
score identical to or very close to the others in that block. Within each block one subject is assigned to
Method 1, one to Method 2, etc. After I gather the post-instructional knowledge of the alphabet test
scores, the Blocks factor is treated just like a Subjects factor in a Method x Blocks ANOVA.
If the variable(s) used to match subjects is(are) well correlated with the dependent variable, the
matching will increase power, since the analysis we shall use allows us to remove from what would
otherwise be error variance (in the denominator of the F-ratio for the treatment effect) the effect of the
matching variable(s). Were we foolish enough to match on something not well correlated with the
dependent variable, we could actually loose power, because matching reduces the error degrees of
freedom, raising the critical value for F.
Page 3
One can view the within-subjects or repeated measures design as a special case of the
randomized blocks design, one where we have subjects matched up on themselves! The matched
pairs design, covered when we learned correlated t-tests, is just a special case of the randomized
blocks design, where k = 2.
Doing the Analysis by Hand. The data for Howells example can be succinctly summarized by
presenting marginal totals for the 5 x 9, Week x Subject design.
The totals for subjects 1 through 9 (summed across weeks) are: 63, 57, 46, 97, 84, 65, 54, 49, and 81.
The totals for weeks 1 through 5 (summed across subjects) are: 201, 198, 84, 52, and 61.
The sum of all 5 x 9 = 45 squared scores is 11,060. The correction for the mean, CM, is (201 +
198 + 84 + 52 + 61)
2
/ 45 = 596
2
/ 45 = 7893.69. The total SS is then 11,060 - 7893.69 = 3166.31.
From the marginal totals for week we compute the SS for the main effect of Week as: (201
2
+
198
2
+ 84
2
+ 52
2
+ 61
2
) / 9 - 7893.69 = 2449.20.
From the subject totals, the SS for subjects is: (63
2
+ 57
2
+ ...... + 81
2
) / 5 - 7893.69 = 486.71.
Since there is only one score per cell, we have no within-cells variance to use as an error term.
It is not generally reasonable to construct an F-ratio from the Subjects SS (we only compute it to
subtract it from what otherwise would be error SS), but we shall compute an F for the within-subjects
factor, using its interaction with the Subjects factor as the error term. The interaction SS can be simply
computed by subtracting the Subjects and the Week sums-of-squares from the total, 3166.31 - 486.71
- 2449.2 = 230.4.
The df are computed as usual in a factorial ANOVA -- (s-1) = (9-1) = 8 for Subjects, (w-1) =
(5-1) = 4 for Week, and 8 x 4 = 32 for the interaction. The F(4, 32) for the effect of Week is then
(2449.2/4) / (230.4/32) = 612.3/7.2 = 85.04, p < .01.
Assumptions. Some of the assumptions of the within-subjects ANOVA are already familiar to
younormality of the distribution of the dependent variable at each level of the factor and
homogeneity of variance. One assumption is newthe sphericity assumption. Suppose that we
computed difference scores (like those we used in the correlated t-test) for level 1 vs level 2, level 1 vs
level 3, 1 vs 4, and every other possible pair of levels of the repeated factor. The sphericity
assumption is that the standard deviation of each of these sets of difference scores (1 vs 2, 1 vs 3,
etc.) is a constant. One way to meet this assumption is to have a compound symmetric
variance-covariance matrix, which essentially boils down to having homogeneity of variance and
homogeneity of covariance, the latter meaning that the covariance (or correlation) between the scores
at one level of the repeated factor and those at another level of the repeated factor is constant across
pairs of levels (1 correlated with 2, 1 with 3, etc.). Advanced statistical programs like SAS and SPSS
have ways to test the sphericity assumption (Mauchleys test), ways to adjust downwards the degrees
of freedom to correct for violations of the sphericity assumption (the Greenhouse-Geisser and
Hunyh-Feldt corrections), and even an alternative analysis (a multivariate-approach analysis) which
does not require the sphericity assumption.
Mixed Designs. A design may have one or more between-subjects and/or one or more
within-subjects factors. For example, we could introduce Gender (of subject) as a second factor in our
Page 4
headache study. Week would still be a within-subjects factor, but Gender would be a
between-subjects factor (unless we changed persons genders during the study!). Although
higher-order factorial designs containing within-subjects factors can be very complex statistically, they
have been quite popular in the behavioural sciences.
Multiple Comparisons. Any of the multiple comparison techniques studied earlier can be
applied to the within-subjects design, using the interaction mean square as the MSE. For the
headache data one interesting a posteriori comparison would be to compare the mean during the
baseline period, (201 + 198)/18 = 22.17, with the mean during training (84 + 52 + 61)/27 = 7.30. Using
t, 21 . 18
27
1
18
1
20 . 7
30 . 7 17 . 22
1 1
=
+ +
=
j i
error
j i
n n
MS
M M
t on 32 degrees of freedom, p < .01.
This is the same formula used for multiple comparisons involving a between-subjects factor,
except that the error MS is the interaction between Subjects and the Within-subjects factor. If you
wanted qs instead of ts (for example, doing a Student-Newman-Keuls analysis), you would just
multiply the obtained t by SQRT(2). For example, for Week 2 versus Week 3, t =
(22-9.33)/SQRT(7.2(1/9 + 1/9)) = 10.02, q = 10.02 * SQRT(2) = 14.16.
Keppel (Design and Analysis, 1973, pages 408-421) recommends using individual rather than
pooled error terms and computes an F rather than a t. An individual error term estimates error
variance using only the scores for the two conditions being compated rather than all of the scores in all
conditions. Using Keppels method on the Week 2 versus Week 3 data I obtained a contrast SS of 722
and error SS of 50, for an F(1, 8) = 722/6.25 = 115.52.
Within-Subjects Analysis with SAS
On Karls SAS Programs page is the file WS-ANOVA.SASrun it and save the program and
output. The data are within the program.
Univariate Data Format. The first data step has the data in a univariate setup. Notice that
there are 5 lines of data for each subject, one line for each week. The format is Subject #, Week #,
score on outcome variable, new line.
Here are the data as they appear on Dave Howells web site:
Subject Wk1 Wk2 Wk3 Wk4 Wk5
1 21 22 8 6 6
2 20 19 10 4 4
3 17 15 5 4 5
4 25 30 13 12 17
5 30 27 13 8 6
6 19 27 8 7 4
7 26 16 5 2 5
8 17 18 8 1 5
9 26 24 14 8 9
The first invocation of PROC ANOVA does the analysis on the data in univariate setup.
proc anova; class subject week; model duration = subject week;
Page 5
Since the model statement does not include the Subject x Week interaction, that interaction is
used as the error term, which is appropriate. We conclude that mean duration of headaches changed
significantly across the five weeks, F(4, 32) = 85.04. MSE = 244.7, p < .001.
Multivariate Data Format. The second data step has the data in multivariate format. There is
only one line of data for each subject: Subject number followed by outcome variable scores for each of
the five weeks.
Compare Week 2 with Week 3. The treatment started on the third week, so this would seem
to be an important contrast. The second ANOVA is a one-way within-subjects ANOVA using only the
Week 2 and Week 3 data.
proc anova; model week2 week3 = / nouni; repeated week 2 / nom;
The basic syntax for the model statement is this: On the left side list the variables and on the
right side list the groups (we have no groups). The nouni stops SAS from reporting univariate
ANOVAS testing the null that the the population means for Week 2 and Week 3 are zero. The
repeated week 2 tells SAS that week is a repeated measures dimenion with 2 levels. The nom
stops SAS from reporting multivariate output.
Note that the F(1, 8) obtained is the 115.52 obtained earlier, by hand, using Keppels method
(individual error terms).
proc means mean t prt; var d23 week1-week5;
In the data step I created a difference score, d23, coding the difference between Week 2 and
Week 3. The Means procedure provides a correlated t-test comparing Week 2 with Week 3 by testing
the null hypothesis that the appropriate difference-score has a mu of zero. Note that the square root of
the F just obtained equals this correlated t, 10.75. When doing pairwise comparisons Keppels method
simplifies to a correlated t-test. I also obtained mean duration of headaches by week.
The easiest way to do pairwise comparisons for a within-subjects factor is to compute
difference-scores for each comparison and therefrom a correlated t for each comparison. If you
want to control familywise error rate (alpha), use the Bonferroni or the Sidak inequality to adjust
downwards your per-comparison alpha, or convert your ts into qs for procedures using the
Studentized range statistic, or square the ts to obtain Fs and use the Scheffe procedure to adjust the
critical F. The adjusted Scheffe critical F is simply (w-1) times the unadjusted critical F for the
within-subjects effect, where w is the number of levels of the within-subjects factor. If you want to do
Dunnetts test, just take the obtained correlated ts to Dunnetts table. Of course, all these methods
could also be applied to the ts computed with Howells (pooled error) formula.
proc anova;model week1-week5= / nouni;repeated week 5 profile / summary printe;
The final ANOVA in the SAS program does the overall within-subjects ANOVA. It also does a
profile analysis, comparing each mean with the next mean, with individual error terms. Notice that
data from all five weeks are used in this analysis. The profile and summary cause SAS to contrast
each weeks mean with the mean of the following week and report the results in ANOVA tables. The
printe option provides a test of sphericity (and a bunch of other stuff to ignore).
Page 6
Under Sphericity Tests, Orthogonal Components you find Mauchleys test of sphericity.
Significance of this test would indicate that the sphericity assumption has been violated. We have no
such problem with these data.
Under MANOVA test criteria no week effect are the results of the multivariate analysis.
Under Univariate Tests of Hypotheses is the univariate-approach analysis.Notice that we get the
same F etc. that we got with the earlier analysis with the data in univariate format.
SAS also gives us values of epsilon for both the Greenhouse-Geisser correction and the
Huynh-Feldt correction. These are corrections for violation of the assumption of sphericity. When
one of these has a value of 1 or more and Mauchleys test of sphericity is not significant we clearly do
not need to make any correction. The G-G correction is more conservative (less power) than the H-F
correction. If both the G-G and the H-F near or above .75, it is probably best to use the H-F.
If we were going to apply the G-G or H-F correction, we would multiply both numerator and
denominator degrees of freedom by epsilon. SAS provides three p values, one with no adjustment,
one with the G-G adjustment, and one with the H-F adjustment. If we had applied the G-G adjustment
here, we could report the results like this: A one-way, repeated measures ANOVA was employed to
evaluate the change in duration of headaches across the five weeks. Degrees of freedom were
adjusted according to Greenhouse and Geisser to correct for any violation of the assumption of
sphericity. Duration of headches changed significantly across the weeks, F(2.7, 21.9) = 85.04, MSE =
7.2, p < .001.
Under Analysis of Variance of Contrast Variables are the results of the profile analysis. Look
at CONTRAST VARIABLE: WEEK.2 this is the contrast between Week 2 and Week 3. For some
reason that escapes me, SAS reports contrast and error SS and MS that are each twice that obtained
when I do the contrasts by hand or with separate ANOVAs in SAS, but the F, df, and p are identical to
those produced by other means, so that is not a big deal.
For Week 2 vs Week 3 the F(1, 8) reported in the final analysis is 1444/12.5 = 115.52. When
we made this same contrast with a separate ANOVA the F was computed as 722/6.25 = 115.52.
Same F, same outcome, but doubled MS treatment and error.
If we were going to modify the contrast results to use a pooled error term, we would need be
careful computing the contrast F. For Week 2 versus Week 3 the correct numerator is 722, not 1444,
to obtain a pooled F(1, 32) = 722/7.2 = 100.28. Do note that taking the square root of this F gives
10.01, within rounding error of the pooled-error t computed with Howells method.
Multivariate versus Univariate Approach
Notice that when the data are in the multivariate layout, SAS gives us both a multivariate
approach analysis (Manova Test Criteria) and a univariate approach analysis (Univariate Tests). The
multivariate approach has the distinct advantage of not requiring a sphericity assumption. With the
univariate approach one can adjust the degrees of freedome, by multiplying by them by epsilon, to
correct for violation of the sphericity assumption. We shall cover the multivariate approach analysis in
much greater detail later.
Page 7
Omnibus Effect Size Estimates
We have partitioned the total sum of squares into three components: Weeks, subjects, and the
Weeks x Subjects interaction (error). We could compute eta-squared by dividing the sum of squares
for weeks by the total sum of squares. That would yield 2449.2 3166.3 = .774. An alternative is
partial eta-squared, in which the sum of squares for subjects is removed from the denominator. That
is,
( ) ( )
. 914 .
4 . 230 2 . 2449
2 . 2449
2
=
+
=
+
=
Error Conditions
Conditions
partial
SS SS
SS

Factorial ANOVA With One or More Within-Subjects Factors:
The Univariate Approach
AxBxS Two-Way Repeated Measures
CLASS A B S; MODEL Y=A|B|S;
TEST H=A E=AS;
TEST H=B E=BS;
TEST H=AB E=ABS;
MEANS A|B;
Ax(BxS) Mixed (B Repeated)
CLASS A B S; MODEL Y=A|B|S(A);
TEST H=A E=S(A);
TEST H=B AB E=BS(A);
MEANS A|B;
AxBx(CxS) Three-Way Mixed (C Repeated)
CLASS A B C S; MODEL Y=A|B|C|S(A B);
TEST H=A B AB E=S(A B);
TEST H=C AC BC ABC E=CS(A B);
MEANS A|B|C;
Ax(BxCxS) Mixed (B and C Repeated)
CLASS A B C S; MODEL Y=A|B|C|S(A);
TEST H=A E=S(A);
TEST H=B AB E=BS(A);
TEST H=C AC E=CS(A);
TEST H=BC ABC E=BCS(A);
MEANS A|B|C;
Page 8
AxBxCxS All Within
CLASS A B C S; MODEL Y=A|B|C|S;
TEST H=A E=AS;
TEST H=B E=BS;
TEST H=C E=CS;
TEST H=AB E=ABS;
TEST H=AC E=ACS;
TEST H=BC E=BCS;
TEST H=ABC E=ABCS;
MEANS A|B|C;
Higher-Order Mixed or Repeated Model
Expand as needed, extrapolating from the above. Here is a general rule for finding the error term
for an effect: If the effect contains only between-subjects factors, the error term is
Subjects(nested within one or more factors). For any effect that includes one or more within-
subjects factors the error term is the interaction between Subjects and those one or more
within-subjects factors.
MAN_RM1.doc
The Multivariate Approach to the One-Way Repeated Measures ANOVA

Analyses of variance which have one or more repeated measures/within
subjects factors have a SPHERICITY ASSUMPTION (the standard error of the
difference between pairs of means is constant across all pairs of means at one level of
the repeated factor versus another level of the repeated factor. Howell discusses
compound symmetry, a somewhat more restrictive assumption. There are
adjustments (of degrees of freedom) to correct for violation of the sphericity
assumption, but at a cost of lower power. A better solution might be a multivariate
approach to repeated measures designs, which does not have such a sphericity
assumption.
Consider the first experiment in Karl Wuenschs doctoral dissertation (see the
article, Fostering house mice onto rats and deer mice: Effects on response to species odors, Animal
Learning and Behavior, 20, 253-258. Wild-strain house mice were at birth
cross-fostered onto house-mouse (Mus), deer mouse (Peromyscus) or rat (Rattus)
nursing mothers. Ten days after weaning each subject was tested in an apparatus
that allowed it to enter tunnels scented with clean pine shavings or with shavings
bearing the scent of Mus, Peromyscus, or Rattus. One of the variables measured was
how long each subject spent in each of the four tunnels during a twenty minute test.
The data are in the file TUNNEL4b.DAT and a program to do the analysis in
MAN_RM1.SAS, both available on my web pages. Run the program. Time spent in
each tunnel is coded in variables T_clean, T_Mus, T_Pero, and T_Rat. TT_clean,
TT_Mus, TT_Pero, and TT_Rat are these same variables after a square root
transformation to normalize the within-cell distributions and to reduce heterogeneity of
variance.
proc anova; model TT_clean TT_mus TT_pero TT_rat = / nouni;
repeated scent 4 contrast(1) / summary printe;
proc means; var T_clean -- T_Rat;
Note that PROC ANOVA includes no CLASS statement and the MODEL
statement includes no grouping variable (since we have no between subjects factor).
The model statement does identify the multiple dependent variables, TT_clean,
TT_Mus, TT_Pero, and TT_Rat, and includes the NOUNI option to suppress irrelevant
output. The REPEATED statement indicates that we want a repeated measures
analysis, with SCENT being the name we give to the 4-level repeated factor
represented by the four transformed time variates. CONTRAST(1) indicates that
these four variates are to be transformed into three sets of difference scores, each
representing the difference between the subjects score on the 1
st
variate (tt_clean)
and one of the other variatesthat is, clean versus Mus, clean versus Peromyscus,
and clean versus Rattus. I chose clean as the comparison variable for all others


Page 2
because I considered it to represent a sort of control or placebo condition. The
SUMMARY option produces an ANOVA table for each of these contrasts and the
PRINTE option gives me a test of the sphericity assumption.
There are other CONTRASTS I could have chosen, and with respect to the
omnibus univariate and multivariate tests performed by PROC ANOVA, choice of
CONTRAST has no effect -- the multivariate test statistics are based on an orthogonal
set of contrasts. Had I specified PROFILE instead of CONTRAST(1), the contrasts
reported in the summary table would be clean versus Mus, Mus versus Peromyscus,
and Peromyscus versus Rattus (each level of the repeated factor contrasted with the
next level of that factor). POLYNOMIAL could be used to do a trend analysis (to
determine whether the effect of the repeated factor is linear, quadratic, cubic, etc.) if
the repeated factor had a quantitative metric (such as 1 month after treatment, 2
months, 3 months, etc. or 1 mg dose of drug, 2 mg, 3 mg, etc.). HELMERT would
contrast each level of the repeated factor with the mean of all subsequent levels.
MEAN(n) would contrast each level (except the n
th
) with the mean of all other levels.
Look first at the Sphericity Tests, Orthogonal Components output from
PRINTE. Mauchlys criterion yields a large Chi-square with a low p valuethat is,
we must reject the assumption of sphericity. If we were to use the univariate analysis
we would need to adjust the degrees of freedom for effects involving the repeated
factor, scent.
The multivariate approach, MANOVA Test Criteria for the Hypothesis of no
scent Effect, indicates a significant effect of Scent, F(3, 33) = 7.85, p = .0004. The
advantage of the multivariate approach is that it does not require sphericity, so no
adjustment for lack of sphericity is necessary.
Look at the Univariate Tests of Hypotheses for Within Subjects Effects.
Scent has a significant effect, F(3, 105) = 7.01, p = .0002, when we do not adjust for
violation of the sphericity assumption. To adjust, simply multiply both numerator and
denominator degrees of freedom by epsilon. Using the very conservative
Greenhouse-Geisser epsilon, F(2.35, 82.15) = 7.01, p = .0009 (SAS gives the
adjusted ps).
Howell has suggested using the Huynh-Feldt epsilon rather than the more
conservative Greenhouse-Geisser when there is reason to believe that the true value
of epsilon is near or above 0.75. For these data, the df using the Huynh-Feldt would
be 2.53, 88.43. As you can imagine, some so-called expert reviewers of
manuscripts still think that df can only come in integer units, so you might want to
round to integers to avoid distressing such experts.
The Analysis of Variance of Contrast Variables gives the results of the
planned comparisons between the clean tunnel and each of the scented tunnels. See
the untransformed means from PROC MEANS. The mice spent significantly more
time in the Mus-scented tunnel than in the clean tunnel, F(1, 35) = 7.89, p = .0008, but
Page 3
the time in the clean tunnel did not differ significantly from that in either of the other
two scented tunnels. If desired, one could apply a posteriori tests, such as the
Tukey test, to the four means. These could be simply computed by hand, using the
methods explained in Howell and in my handout on multiple comparisons. The
appropriate pooled error term would be the MSE from the omnibus univariate ANOVA,
69.78 on 105 df. If you decided that separate error terms would be better (a good
idea when the sphericity assumption has been violated), you could just compute
correlated t-tests and use the Bonferroni or Sidak inequality to adjust downwards the
p-criterion for declaring the difference significant.
Example of Pairwise Contrasts
data multi; input block1-block3; subj = _N_;
B1vsB3 = block1-block3; B1vsB2 = block1-block2; B2vsB3=block2-block3; cards;
.......scores.......
proc means t prt; var B1vsB3 B1vsB2 B2vsB3; run;
The second part of the program includes a hypothetical set of data, the number
of errors made by each of six subjects on each of three blocks of learning trials. In
addition to an omnibus analysis, you want to make pairwise comparisons. One
method is to construct a difference score for each contrast and then use PROC
MEANS to test the null that the mean difference score is zero in the population (that
is, conduct correlated t tests). Since there are only three conditions, and the omnibus
ANOVA is significant, we need not adjust the per comparison alpha.
Another method is to convert the data from a multivariate setup to a univariate
setup (the commands necessary to convert multivariate-setup data to univariate-setup
data are detailed in Chapter 16 of Cody and Smiths Applied Statistics and the SAS

Programming Language, 4
th
edition) and then use one of the pairwise options on the
MEANS statement of PROC ANOVA. This will allow you to use a pooled error term
rather than individual error terms, which, as you will see, will give you a little more
power. Since we had only three conditions, I chose the LSD (Fisher) option.
Here is the code to convert to a univariate setup:
data univ; set multi;
array b[3] block1-block3; do block = 1 to 3;
errors = b[block]; output; end; drop block1-block3;
proc print; var block errors; id subj;
Look at the output from Proc Print to see how the data look after being
converted to univariate format.
SPSS: Point and Click
Obtain from my SPSS Data Page the file TUNNEL4b.sav. Bring it into SPSS.
Click Analyze, General Linear Model, Repeated Measures. In the Within-Subject
Factor Name box, enter scent. For Number of Levels enter 4.
Page 4

Click Add and then Define. Select t_clean, t_mus, t_pero, and t_rat (these are
the transformed variables) and scoot them into the Within-Subjects Variables box.

Click Contrasts. Under Change Contrast select Simple and then select first
for the Reference Category. Click Change.
Page 5

Click Continue.
Other available contrasts are Helmert (each level versus mean of all
subsequent levels),Difference (reverse Helmert, that is each level versus mean of all
previous levels), Polynomial (trends), Repeated (each level versus the next level),
and Deviation (excepting a reference level, each level versus the grand mean of all
levels).
Click Plots. Scoot scent into the Horizontal Axis box.

Click Add, Continue.
Click Options. Scoot scent into the Display means for box. Check
Compare main effects. If you are afraid that the Familywise Error Boogie Man is
going to get you, then change Confidence interval adjustment from LSD to
Page 6
Bonferroni or Sidak. Ill just take the LSD here. Under Display check Estimates of
effect size.

Click Continue, OK.
The omnibus statistical output is essentially the same as that we got with SAS.
Look at the Tests of Within-Subjects Effects. The Partial Eta-Squared here is the
scent sum of squares divided by the (scent + error) sum of squares = 1467.267 /
(1467.267 + 7326.952) = .167. Look back at the Multivariate Tests. The Partial Eta
Squared here is 1 minus Wilks lambda, 1 - .583 = .417. While this statistic is used as
a magnitude of effect estimate in MANOVA, it is clearly not what most people think of
when they think of eta-squared.
SPSS: Syntax
If you are willing to deal with the syntax of SPSS MANOVA utility, you can do
more with your repeated measures ANOVA than you can using the point and click
Page 7
interface. Here is the code to do a one-way analysis with some special contrasts on
our tunnel4b data. Assuming you already have the data file open, all you have to do is
copy and paste this code into the syntax window. If SPSS is not already set up to
open a syntax window at startup, click Edit, Options on the main SPSS window and
check Open syntax window at start-up. When you reboot SPSS you will get a syntax
window into which to paste commands.
manova t_clean to t_rat / wsfactors = scent(4) /
contrast(scent)=special(1,1,1,1, -3,1,1,1, 0,-2,1,1, 0,0,-1,1) /
rename=overall c_vs_mpr m_vs_pr p_vs_r / wsdesign = scent /
print=transform signif(univ) error(cor) / design /
The wsfactors indicates that I call the repeated factor scent and it has 4
levels. contrast(scent) is used to specify which sort of contrasts I wish to make, if
any. You can choose from the same contrasts available with the point and click GLM
analysis, or you can provide your own special contrast coefficients, but they must be
orthogonal. The first set of coefficients should be K 1s ( where K is the number of
levels of the repeated factor), then the next K-1 sets each have K coefficients
specifying the contrast you want. The set of 1s specifies the contrast for the overall
mean. I then contrasted the clean tunnel with the three scented tunnels, the
conspecific (Mus) scented tunnel with the two contraspecific tunnels, and the
Peromyscus tunnel with the Rattus (a predator upon Mus) tunnel. These orthogonal
contrasts actually make some sense. The rename command was used to assign
labels to the contrasts.
The wsdesign statement is optionalif omitted MANOVA assumes a full
factorial model for the within-subjects factorsif you want other than that you must
specify the model you want on the WSDESIGN statement. The design statement
with no arguments produces a full factorial model with respect to the between-subjects
factorsif you want other than that you must specify the between-subjects effects you
want on the design statement.
Print = specifies optional output, including transform, the contrast
transformation matrix (inspect it to verify that the orthogonal contrasts I wanted were
those computed), signif(univ) to print univariate ANOVAs on the contrasts I
specified, and error(cor) to obtain the sphericity statistics.
With respect to omnibus univariate and multivariate tests the output is about
the same we got from SAS, only formatted differently. The univariate ANOVA on the
repeated factor is called an Averaged F because it is a pooled F computed with the
univariate contrast ANOVA sum of squares and degrees of freedom. Look at
univariate statistics and verify that the AVERAGED SCENT SS = the sum of the
Hypoth. SS for C_VS_MPR, M_VS_PR, and P_VS_R. Sum the corresponding Error
SS and you obtain the AVERAGED WITHIN CELLS SS. Sum the contrast degrees of
freedom (1,35 for each of 3) and you get the AVERAGED F DF (3,105). Note that the
clean versus scented contrast is not significant, the Mus versus other-rodent is (the
Mus tunnel being most attractive), and the Peromyscus versus Rattus is also
significant (the Rattus tunnel not being very attractive).
Page 8

Date: Sun, 6 Feb 94 19:45:50 EST
Sender: edstat-l@jse.stat.ncsu.edu
From: eklunn@aix02.ecs.rpi.edu (Neil Holger Eklund)
Subject: Re: Huynh-Feldt epsilonpronounce?

dfitts@u.washington.edu (Douglas Fitts) writes: How *do* you pronounce it? Hoon?
Hi-un? Anyone have a definitive answer?

Ive allways heard it Winn-Felt
===============================================================
From: Elaine Rhodes <erhodes@camelot.bradley.edu>: I believe that it is pronounced
something close to winn or perhaps when.
===============================================================
From: maxwell gwynn f <mgwynn@mach1.wlu.ca>
My understanding is that its pronounced as Hine-Felt. I may be wrong, I may
be right, I may be crazy.

Power-RM-ANOVA.doc
Power Analysis for One-Way Repeated Measures ANOVA

Univariate Approach
Colleague Caren Jordan was working on a proposal and wanted to know how
much power she would have if she were able to obtain 64 subjects. The proposed
design was a three group repeated measures ANOVA. I used G*Power to obtain the
answer for her. Refer to the online instructions, Other F-Tests, Repeated Measures,
Univariate approach. We shall use n = 64, m = 3 (number of levels of repeated factor),
numerator df = 2 (m-1), and denominator df = 128 (n times m-1) f
2
= .01 (small effect,
within-group ratio of effect variance to error variance), and = .79 (the correlation
between scores at any one level of the repeated factor and scores and any other level
of the repeated factor). Her estimate of was based on the test-retest reliability of the
instrument employed.
I have used Cohens (1992, A power primer, Psychological Bulletin, 112,
155-159) guidelines, which are .01 = small, .0625 = medium, and .16 = large.
The noncentrality parameter is

=
1
f m n
, but G*Power is set up for us to
enter as Effect size f
2
the quantity 143 .
79 . 1
) 01 (. 3
1
2
=
f m
.
Boot up G*Power. Click Tests, Other F-Tests. Enter Effect size f
2
= 0.143,
Alpha = 0.05, N = 64, Numerator df = 2, and Denominator df = 128. Click calculate.
G*Power shows that power = .7677.


2

Please note that this power analysis is for the univariate approach repeated
measures ANOVA, which assumes sphericity. If that assumption is incorrect, then the
degrees of freedom will need to be adjusted, according to Greenhouse-Geisser or
Huynh-Feldt, which will reduce the power.

Problems with the Sphericity Assumption
Correspondent Sheri Bauman, at the University of Arizona, wants to know how
much power she would have for her three condition design with 36 cases. As the data
are already on hand, she can use them to estimate the value of . Her data shows that
the between-conditions correlations are r = .58, .49, and .27. Oh my, those correlations
differ from one another by quite a bit and are not very large. The fact that they differ
greatly indicates that the sphericity assumption has been violated. The fact that they
are not very large tells us that the repeated measures design will not have a power
advantage over the independent samples design (it may even have a power
disadvantage).
The G*Power online documentation shows how to correct for violation of the
sphericity assumption. One must first obtain an estimate of the Greenhouse-Geisser or
Huynh-Feldt epsilon. This is most conveniently obtained by looking at the statistical
output from a repeated-measures ANOVA run on previously collected data that are
thought to be similar to those which will be collected for the project at hand. I am going
to guess that epsilon for Sherris data is low, .5 (it would be 1 if the sphericity
assumption were met). I must multiply her Effect size f
2
, numerator df, and
denominator df by epsilon before entering them into G*Power. Uncorrected, her Effect
size f
2
for a small effect = 055 .
45 . 1
) 01 (. 3
1
2
=
f m
. Note that I have used, as my
estimate of , the mean of the three values observed by Sheri. This may, or may not,
be reasonable. Uncorrected her numerator df = 2 and her denominator df = 72.
Corrected with epsilon, her Effect size f
2
= .0275, numerator df = 1, and denominator
df = 36. I enter these values into G*Power and obtain power = .1625. Sheri needs
more data, or needs to hope for a larger effect size. If she assumes a medium sized
effect, then epsilon Effect size f
2
= 17 .
45 . 1
) 0625 (. 3
5 .
1
2
=
f m
and power jumps to .67.
The big problem here is the small value of in Sheris data she is going to
need more data to get good power. With typical repeated measures data, is larger,
and we can do well with relatively small sample sizes.

Multivariate Approach
The multivariate approach analysis does not require sphericity, and, when
sphericity is lacking, is usually more powerful than is the univariate analysis with
Greenhouse-Geisser or Huynh-Feldt correction. Refer to the G*Power online
instructions, Other F-Tests, Repeated Measures, Multivariate approach.
3
Since the are three groups, the numerator df = 2. The denominator df = n-p+1,
where n is the number of cases and p is the number of dependent variables in the
MANOVA (one less than the number of levels of the repeated factors). For Sherri,
denominator df = 36-2+1 = 35.
For a small effect size, we need Effect size f
2
= 055 .
45 . 1
) 01 (. 3
1
2
=
f m
. As you
can see, G*Power tells me power is .2083, a little better than it was with the univariate
test corrected for lack of sphericity.

So, how many cases would Sherri need to raise her power to .80? This G*Power
routine will not solve for N directly, so you need to guess until you get it right. On each
guess you need to change the input values of N and denominator df. After a few
quesses, I found that Sheri needs 178 cases to get 80% power to detect a small effect.

A Simpler Approach
Ultimately, in most cases, ones primary interest is going to be focused on
comparisons between pairs of means. Why not just find the number of cases necessary
to have the desired power for those comparisons? With repeated measures designs I
generally avoid using a pooled error term for such comparisons. In other words, I use
simple correlated t tests for each such comparison. How many cases would Sherri
need to have an 80% chance of detecting a small effect, d = .2?
First we adjust the value of d to take into account the value of . I shall use her
weakest link, the correlation of .27. 166 .
) 27 . 1 ( 2
2 .
) 1 ( 2
12
=
d
d
Diff
. Notice that
the value of d went down after adjustment. Usually will exceed .5 and the adjusted d
will be greater than the unadjusted d.
4
The approximate sample size needed is 285
8 . 2
2
=
=
Diff
d
n . I checked this with
G*Power. Click Tests, Other T tests. For Effect size f enter .166. N = 285 and df =
n-1 = 284. Select two-tailed. Click Calculate. G*Power confirms that power = 80%.

When N is small, G*Power will show that you need a larger N than indicated by
the approximation. Just feed values of N and df to G*power until you find the N that
gives you the desired power.


Two x Two Within-Subject ANOVA Interaction = Correlated t on Difference Scores

Petra Schweinhardt at McGill University is planning research involving a 2 x 2
within-subjects ANOVA. Each case has four measurements, PostX1, PreX1, PostX2
and PreX2. X1 and X2 are different experimental treatments, Pre is the dependent
variable measured prior to administration of X, and Post is the dependent variable
following administration of X. Order effects are controlled by counterbalancing.
Petra wants to determine how many cases she needs to have adequate power to
detect the Time (Post versus Pre) by X (1 versus 2) interaction. She is using G*Power
3.1, and it is not obvious to her (or to me) how to do this. She suggested ANOVA,
Repeated Measures, within factors, but I think some tweaking would be necessary.
My first thought is that the interaction term in such a 2 x 2 ANOVA might be
equivalent to a t test between difference scores (I know for sure this is the case with
independent samples). To test this hunch, I contrived this data set:

Diff1 and Diff2 are Post-Pre difference scores for X1 and X2. Next I conducted
the usual 2 x 2 within-subjects ANOVA with these data:

COMPUTE Diff1=PostX1-PreX1.
EXECUTE.
COMPUTE Diff2=PostX2-PreX2.
EXECUTE.
GLM PostX1 PostX2 PreX1 PreX2
/WSFACTOR=Time 2 Polynomial X 2 Polynomial
/METHOD=SSTYPE(3)
/WSDESIGN=Time X Time*X.
Source Type III Sum of Squares df Mean Square F Sig.
Time 45.375 1 45.375 24.200 .004
Error(Time) 9.375 5 1.875

X 5.042 1 5.042 14.756 .012
Error(X) 1.708 5 .342

Time * X 1.042 1 1.042 1.404 .289
Error(Time*X) 3.708 5 .742

Next I conducted a correlated t test comparing the difference scores.
T-TEST PAIRS=Diff1 WITH Diff2 (PAIRED)
/CRITERIA=CI(.9500)
/MISSING=ANALYSIS.

N Correlation Sig.
Pair 1 Diff1 & Diff2 6 .650 .163

Paired Samples Test

Paired Differences
Mean Std. Deviation Std. Error Mean
Difference
Lower Upper
Pair 1 Diff1 - Diff2 -.83333 1.72240 .70317 -2.64088 .97422

Paired Samples Test

Pair 1 Diff1 - Diff2 -1.185 5 .289

As you can see, the correlated t test on the difference scores is absolutely
equivalent to the interaction test in the ANOVA. The square of the t (-1.185
2
= 1.404) is
equal to the interaction F and the p values are identical.
Having established this equivalence, my suggestion is that the required sample
size be determined as if one were simply doing a correlated t test. There are all sorts of
issues involving how to define effect sizes for within-subjects effects, but I shall not
address those here.
G*Power shows me that Petra would need 54 cases to have a 95% chance of
detecting a medium-sized effect using the usual 5% criterion of statistical significance.

t tests - Means: Difference between two dependent means (matched pairs)
Analysis: A priori: Compute required sample size
Input: Tail(s) = Two
Effect size dz = 0.5
err prob = 0.05
Power (1- err prob) = 0.95
Output: Noncentrality parameter = 3.6742346
Critical t = 2.0057460
Df = 53
Total sample size = 54
Actual power = 0.9502120

We should be able to get this same result using the ANOVA, Repeated
Measures, within factors analysis in G*Power, as Petra suggested, and, in fact, we do:
F tests - ANOVA: Repeated measures, within factors
Input: Effect size f = 0.25
err prob = 0.05
Number of groups = 1
Repetitions = 2
Corr among rep measures = 0.5
Nonsphericity correction = 1
Critical F = 4.0230170
Numerator df = 1.0000000
Denominator df = 53.0000000

Karl L. Wuensch, Dept. of Psychology, East Carolina University, Greenville, NC.
August, 2009

MAN_1W1B.doc
The A X (B X S) ANOVA: A Multivariate Approach

The A X (B X S) mixed ANOVA, where factor A is between/among subjects and factor
B is a repeated measures/within subjects factor, has a sphericity assumption (the same
assumption we discussed earlier when studying one-way repeated measures ANOVA). Our
example of such a design will be the first experiment in my dissertation, the same example we
used for the one-way analysis, but this time we shall not ignore the between subjects Nursing
groups variable. Run the program MAN_1W1B.sas from my SAS programs page. Variable
NURS is the Nursing Group variable, identifying the species of the subjects foster mother,
Mus, Peromyscus, or Rattus.
Note that nurs is identified in PROC ANOVAs CLASS statement (nurs is a
classification, categorical variable) and in the MODEL statement (nurs is a between subjects
independent variable). The MEANS statement is used to obtain means on each of the four
variates and to do LSD comparisons between nursing groups on each variate. Please note
that we had equal cell sizes in our data set. If we had unequal cell sizes, we would employ
PROC GLM, Type III sums of squares, instead of PROC ANOVA.
Simple Effects of the Between-Subjects Factor at Each Level of the Within-
Subjects Factor. Since I did not employ the nouni keyword, the first output given is the
simple effects of nurs at each level of Scent. These analyses indicate that the nursing groups
do not differ significantly from one another on time spent in the clean tunnel, F(2, 33) = 0.13, p
= .88, the Mus-scented tunnel, F(2, 33) = 1.39, p = .26, or the Peromyscus-scented tunnel F(2,
33) = 1.2, p = .31, but do on the Rattus-scented tunnel, F(2, 33) = 12.86, p < .0001. The LSD
comparisons, later in the output, show that the rat-reared mice spent significantly more time in
the rat-scented tunnel than did the other two groups of mice, which did not differ significantly
from each other. Do note that SAS has used individual error terms, one for each level of
Scent. In his chapter on repeated measures ANOVA, Howell explains how you could use a
possibly more powerful pooled error term instead.
Mauchlys criterion (W = .4297) indicates we have a serious lack of sphericity, so if we
were to take the univariate approach analysis, we would need to adjust the degrees of
freedom for both effects that involve the within-subjects factor, scent. If you compare this
analysis with the one-way analysis we previously did you will see that the univariate SS for
scent remains unchanged, but the error SS is reduced, due to the Scent Nurs effect being
removed from the error term.
The multivariate approach, which does not require sphericity, gives us significant
effects for both repeated effects. See Manova Test Criteria and Exact F Statistics for the
Hypothesis of no scent Effect. This tests the null hypothesis that the profile is flat when
collapsed across the groupsthat is, a plot with mean time on the ordinate and scent of tunnel


Page 2

on the abscissa would produce a flat line, all means being identical. Using PILLAIS TRACE,
which is considered more robust to violations of the MANOVAs assumption of homogeneity
of variance-covariance matrices than is Wilks lambda, Scent has a significant main effect,
F(3, 31) = 13.75, p < .0001. Boxs M (available with SPSS) may be used to test this
homogeneity assumption, but with equal sample sizes (as we have) the MANOVA is
considered to be so robust to violations of this assumption that one need not worry about it at
all. If sample sizes are quite unequal and Boxs M significant at .001, one may still trust Pillais
trace or may randomly discard cases to equalize sample sizes.
Manova Test Criteria and Exact F Statistics for the Hypothesis of no scent NURS
Effect shows us that with the multivariate approach the interaction is significant, F(6, 64) 4.37,
p = .0009. This test is often called the parallelism test. If we plot each groups profile (mean
time on ordinate, scent of tunnel on abscissa) on an interaction plot, are the profiles parallel to
one another or not? The output from PROC MEANS gives the means (in seconds) for the
untransformed data. The interaction plot below makes it pretty clear that the profile differs
among nursing groups.
Profile for Each Nursing Group
0
50
100
150
200
250
300
350
400
450
500
Clean Mus Pero Rat
Scent of Tunnel
M
e
a
n

T
i
m
e

I
n

T
u
n
n
e
l
Mus-nursed
Pero-nursed
Rat-nursed

Tests of Hypotheses for Between Subjects Effects, which is appropriate whether we
choose a univariate or a multivariate approach, indicate that (ignoring the scent variable) the
nursing groups did not significantly differ on total time spent in the tunnels, F(2, 33) = 1.02,
p = .37.
Look at the Univariate Tests of Hypotheses for Within Subject Effects. Note that the
unadjusted tests are both significant, Scent at .0001, ScentNurs at .0454, but we already
decided we need to adjust the degrees of freedom. The adjustment involves multiplying both
numerator and denominator df by (epsilon). Rather than using the overly conservative
Greenhouse-Geisser (.7087), I elected to use the more powerful Huynh-Feldt (.8047).
Page 3

One should use the Huynh-Feldt rather than the Greenhouse-Geisser when lies near or
above 0.75. SAS gives the adjusted ps. Scents main effect remains significant, F(2.41,
79.67) = 7.51, p = .0005), but the ScentNurs interaction now falls short of significance,
F(4.83, 79.67) = 2.24, p = .060).
Look at the Analysis of Variance of Contrast Variables. I contrasted time in the
Mus-scented tunnel with time in each of the other tunnels. The first constrast is with the clean
tunnel. The mice averaged significantly more time in the Mus-scented tunnel (346 sec) than in
the clean tunnel (148 sec), F(1, 33) = 7.67, p = .009, but the nursing groups did not
significantly differ from each other with respect to this contrast, F(2, 33) = 0.52, p = .60.
Neither of the contrasts involving the Peromyscus-scented tunnel was significant, but both
involving the Rattus-scented tunnel were. Ignoring nursing groups, the mice spent significantly
less time in the Rattus-scented tunnel than in the Mus-scented tunnel, F(1, 33) = 22.20, p ,
.001, and this contrast interacted significantly with the nursing groups variable, F(2, 33) = 4.72,
p = ..016. Note that this contrast interaction analysis is equivalent to a Scent X Nursing Group
interaction where Scent is restricted to Mus versus Rattus.
Simple Effects of the Within-Subjects Factor at Each Level of the Between-
Subjects Factor. The second invocation of PROC ANOVA was used to conduct simple
effects analysis of scent by level of nurs. One simply does a one-way repeated measures
analysis BY N;. Simple effects analysis conducted this way use individual error terms
that is, at each level of nurs the analysis uses an error term computed using only the data at
that level of nurs. Howell advocated use of the pooled error term from the overall analysis in
the 2
nd
edition of his text (pages 433-434), but switched to individual error terms for the simple
effects of the within-subjects factor at each level of the between-subjects factor in the 3
rd

edition (page 449). For the multivariate approach, SAS gives us individual error term
analyses. For the Mus-nursed animals scent had a significant effect, F(3, 9) = 16.62,
p = .0005, with the contrast between Mus-scent and Rattus-scent being significant. For the
Peromyscus-nursed animals scent had a significant effect, F(3, 9) = 15.42, p = .0007, with the
contrast between Mus-scent and Rattus-scent being significant. Among the Rattus-nursed
mice, however, the effect of scent of tunnel was not statistically significant, F(3, 9) = 2.02,
p = .18.
SAS also employs individual error terms for the univariate approach simple effects.
The basic results are the same as with the multivariate approach analysis. For the univariate
approach simple effects, if you wanted to use the pooled error from the overall analysis (MSE
= 65.162 on 99 df), you would compute the Fs yourselffor the Mus-nursed mice,
F = 232.235/65.162 = 3.564. Adjust the df using the Huynh-Feldt , from the omnibus
analysis, .8047, yielding F(2.41, 79.67) = 3.564. Use SAS PROBF to obtain p.
For our data, using individual error terms, both the univariate and multivariate
approaches indicated a significant simple main effect of Scent for Mus-nursed and
Peromyscus-nursed mice but not for Rattus-nursed mice. If you wish to take a multivariate
approach to the simple main effects of the repeated factor and you wish to use a pooled error
term, us the SPSS MANOVA routine explained below.
Page 4

Bring into SPSS the data file TUNNEL4b.sav. Click Analyze, General Linear Model,
Repeated Measures. In the Within-Subject Factor Name box, enter scent. For Number of
Levels enter 4. Click Add. Select scent(4) and click Define. Select t_clean, t_mus,
t_pero, and t_rat (these are the transformed variables) and scoot them into the Within-
Subjects Variables box. Select nurs and scoot it into the Between-Subjects Factor(s) box.
Click Post Hoc and scoot nurs into the Post-hoc tests for
box. Ask for LSD tests, Continue.
Click Contrasts. Select the contrast for scent. Under
Change Contrast select Simple and then select last for the
Reference Category, as shown to the right.
Click Change, Continue, OK. You will find that you get the
same basic statistics that we got with SAS. Since the omnibus
effect of the scent variable was not significant, our use of the post
hoc Fishers LSD was only for pedagogical purpose. Do note that
the output for that test includes both p values (Sig) confidence
intervals for differences between means.

SPSS: Syntax
While you still have the tunnel4b data in SPSS, paste this code into the syntax window
and run it:

manova t_clean to t_rat by nurs(1,3) / wsfactors = scent(4) /
contrast(scent)=special(1,1,1,1, -3,1,1,1, 0,-2,1,1, 0,0,-1,1) /
rename=overall c_vs_mpr m_vs_pr p_vs_r / wsdesign = scent /
print=signif(univ) error(cor) homogeneity(boxm) / design .
wsdesign = mwithin scent(1) mwithin scent(2)
mwithin scent(3) mwithin scent(4) /
design = nurs .
wsdesign = scent /
design = mwithin nurs(1) mwithin nurs(2) mwithin nurs(3) .
The Boxs M output shows that we have no problem with MANOVAs homogeneity
assumption. One of our orthogonal contrasts, P_VS_R, was significant for the interaction
between nursing group and scent of tunnel, that is, the nursing groups differed significantly on
the Peromyscus versus Rattus contrast The significant M_VS_PR and P_V_R contrasts
found in our earlier one-way analysis appear here also.
Page 5

The second invocation of MANOVA was used to evaluate the simple main effects of
Nursing groups at each level of Scent. WSDESIGN = MWITHIN SCENT(1) MWITHIN
SCENT(2) MWITHIN SCENT(3) MWITHIN SCENT(4) is the command specifying the simple
effects analysis. In the output, under Tests involving MWITHIN SCENT(1) Within-Subject
Effect the NURS BY MWITHIN SCEN F(2, 33) = 0.13, p = .88 is the test for the simple main
effect of Nursing groups upon response to the Scent(1) clean tunnel. Note that like SAS,
SPSS used individual rather than pooled error. The tests for the Mus-, Peromyscus-, and
Rattus-scented tunnels follow.
The third invocation of MANOVA was used to evaluate the simple main effects of
Scent at each level of Nursing groups. DESIGN = MWITHIN NURS(1) MWITHIN NURS(2)
MWITHIN NURS(3) specifies the simple effects. EFFECT...MWITHIN NURS(3) BY SCENT
Multivariate Tests of Significance gives the multivariate approach pooled error test for the
effect of Scent in Nursing Group 3, the Rattus-nursed mice, for which the effect was not
significant, F(3, 31) = 0.867, p = .47. Note that the df (3, 31) are the same as for Scent in the
omnibus multivariate analysis, since we are using pooled error. We do have significant simple
main effects of scent for the Peromyscus-nursed and Mus-nursed mice. Finally, the
pooled-error, univariate approach statistics under AVERAGED Tests of Significance for
MEAS.1 using UNIQUE sums of squares also indicate significant effects for Mus- and
Peromyscus-nursed but not Rattus-nursed mice, Fs (3, 99) = 3.56, 7.55, and 0.87. Note that
these are the same Fs we would need compute by hand were we using SAS and wanting
pooled error univariate tests. We would still need adjust the df and obtain the adjusted p, as
SPSS MANOVA does not do that for us.
MAN_RM3.doc
Three-Way Analyses of Variance Containing One or More Repeated Factors

We have already covered the one-way repeated measures design and the A x (B
x S) design. I shall not present computation of the (A x B x S) totally within-subjects
two-way design, since it is a simplification of the (A x B x C x S) design that I shall
address. If you need to do an (A x B x S) ANOVA, just drop the Factor C from the (A x
B x C x S) design.
A X B X (C X S) ANOVA
In this design only one factor, C, is crossed with subjects (is a within-subjects
factor), while the other two factors, A and B, are between-subjects factors.
Howell (page 495 of the 5
th
edition of Statistical Methods for Psychology)
presented a set of data with two between-subjects factors (Gender and Group) and one
within-subjects factor (Time). One group of adolescents attended a behavioral skills
training (BST)program designed to teach them how to avoid HIV infection. The other
group attended a traditional educational program (EC). The dependent variable which
we shall analyze is a measure of the frequency with which the participants used
condoms during intercourse. This variable is measured at four times: Prior to the
treatment, immediately after completion of the program, six months after completion of
the program, and 12 months after completion of the program.
SAS
Obtain the data file, MAN_1W2B.dat, from my StatData page and the program
file, MAN_1W2B.SAS, from my SAS programs page. Note that the first variable is
Gender, then Group number, then dependent variable scores at each level (pretest,
posttest, 6 month follow-up, and 12 month follow-up) of the within-subjects factor, Time.
Group|Gender factorially combines the two between-subjects factors, and Time 4
indicates that the variables Pretest, Posttest, FU6, and FU12 represent the 4-level
within-subjects factor, Time.
By not specifying NOUNI I had SAS compute Group|Gender univariate
ANOVAs on Pretest, Posttest, FU6, and FU12 . These provide the simple interaction
tests of GroupGender at each level of Time that we might use to follow-up a significant
triple interaction, but our triple interaction is not significant. However, our TimeGroup
interaction is significant, so we can use these univariate ANOVAs for simple main
effects tests of Group at each level of Time. Note that the groups differ significantly
only at the time of the 6 month follow-up, when the BST participants used condoms
more frequently (M = 18.8) than did the EC participants (M = 8.6).


Page 2
Mauchlys criterion shows that we have no problem with the sphericity
assumption. Both multivariate and univariate tests of within-subjects effects show that
the only such effect which is significant is the Time x Group interaction, for which we
have already inspected the simple main effects of group at each time.
Among the between-subjects effects, only the main effect of gender is
significant. Since this effect is across all levels of the time variable, we need to collapse
across levels of time to get the appropriate means on which female participants differed
from male participants. If you look in the data step, you will see that I used the
MEAN(OF....) function to compute, for each participant, mean condom use across
times. While we could use proc means by gender to compute the relevant means, I
used PROC ANOVA instead, doing a Gender x Group ANOVA and asking for means. I
did this to demonstrate to you that the test on gender in the earlier analysis was a test
of the difference between genders on mean condom use collapsed across times.
Notice that the F statistic for the effect of gender on mean condom use, 6.73, is
identical to that computed in the earlier analysis. The means show that male
participants reported using condoms during intercourse more than did female
participants.
Although we followed the significant Time x Group interaction with an analysis of
the simple main effects of group at each time, we could have chosen to test the simple
main effects of time in each group. The last invocation of PROC ANOVA does exactly
that, after sorting by group. Additionally, I requested contrasts between the pretest and
each post-treatment measure. Using individual error terms, the omnibus analysis
indicates that the use of condoms in the BST group did not change significantly across
time. One could elect to ignore that analysis and look instead at the specific contrasts -
- after all, if the treatment was effective and had a lasting effect, mean condom use at
all three times after treatment should be higher than prior to treatment, but could be
approximately equal to one another across those post-treatment times, diluting the
effect of the time variable and leading to an omnibus effect that falls short of
significance. Those contrasts show that condom use among the BST participants was
significantly greater at the six month follow-up (M = 18.8) than at the time of the pretest
(M = 13.45).
Among the members of the EC control group, mean condom use did change
significantly across time. The specific contrasts shows that mean condom use in this
group at the time of the six month follow-up was significantly less (M = 8.6) than at the
time of the pretest (M = 19.8).
Here is an interaction plot of the changes in condom use across time for the two
groups.
Page 3
Mean Condom Use Across Time
0
5
10
15
20
25
Pretest Posttest FU-6 FU-12
Time
M
e
a
n
BST
EC
Group

Obtain the SPSS data file MAN_1W2B.sav from my SPSS data page. Bring it
into SPSS. Analyze, General Linear Model, Repeated Measures. Define repeated
factor Time(4) with variables pretest, posttest, fu6, and fu12. Scoot gender
and group into the Between-Subjects Factor(s) box.
Click Plots. Scoot Time into the Horizontal Axis box and group into the
Separate Lines box. Click Add, Continue.
Click Options. Scoot gender and group*Time into the Display means for
box. Click Continue, OK. As before, you will find a significant main effect of gender
and a significant Time x Group interaction.
To test the simple main effect of the grouping variable at each level of time do
this: Click Analyze, General Linear Model, Univariate. Scoot pretest into the
Dependent Variable box and gender and group into the Fixed Factor(s) box.
Click OK. Look only at the effect of groups gender was included in the model only
to remove its effect from what otherwise would be error variance. You will see that the
groups did not differ significantly on the pretest. Now go back to the Univariate
window and replace pretest with posttest and click OK. Again, the difference is not
significant. Compare the groups on fu6 (significant) and fu12 (not significant) in the
same way.
To test the simple main effect of time within each group do this: Click Data,
Split File. Select Organize output by groups and Sort the file by grouping variables.
Scoot group into the Groups Base on box. Click OK. Click Analyze, General
Linear Model, Repeated Measures. Click Define. Remove group and gender
from the Between Subjects Factor(s) box. Click Contrasts. Change the contrast to
Simple, with First being the reference category (to compare pretest scores with each
Page 4
of the other times). Click Continue, OK. Sphericity is not a problem in either group,
so you are free to use the unadjusted univariate results. You will find that the changes
across time fell short of statistical significance in the BST group but were significant in
the ECU group (where frequency of use of condoms declined across time). To make
pairwise comparisons within the ECU you could create the necessary difference scores
and for each test the null that the mean difference score in the population is zero. If
you are afraid of the Type I boogie man under the bed, you can apply a Bonferroni
adjustment.
SPSS MANOVA Syntax
Obtain the SPSS data file MAN_1W2B.sav from my SPSS data page. Bring it
into SPSS and then paste the following code into the syntax window and run the
program:
manova pretest to fu12 by gender(0,1) group(1,2) / wsfactors = time(4) /
print=error(cor) homogeneity(boxm) / design .
wsdesign = mwithin time(1) mwithin time(2)
mwithin time(3) mwithin time(4) /
design .
wsdesign = time /
design = mwithin group(1) mwithin group(2) .
The first invocation of MANOVA reproduces the omnibus analysis we did earlier
with SAS. The second invocation reproduces the simple main effects of group at each
time. The third invocation tests the simple main effects of time within each group, using
pooled error, but with the same disappointing results obtained by SAS. Our analysis
does not, it seems, provide much support for the BST programs effectiveness.
A X (B X C X S) ANOVA
In this design, factors B and C are crossed with Subjects (B and C are
within-subjects factors) and Subjects is nested within factor A (A is a between-subjects
factor). Our data are for the conditioned suppression experiment presented on page
503 of Howell (5
th
edition). The dependent variable is a measure of the frequency of
bar pressing (for food) for an animal (rat?) in an experimental box. During the first
phase of the experiment, each animal was placed in box A and, while the animal was
bar pressing, a tone (or, in Group L-A-B a light) was presented and paired with shock
during each of four cycles. During the second phase of the experiment, animals in
Group L-A-B and Group A-B were tested in a different box, where the tone stimulus was
presented without the shock. It was expected that the tone would initially suppress bar
pressing in these animals (but only somewhat in the group for which shock had been
paired with a light rather than with a tone), but that they would learn, across cycles, that
box B was safe, and the rate of bar pressing would rise. Group A, however, was tested
in box A during the second phase. It was expected that they would show low rates of
bar pressing across the four cycles of the second phase of the experiment.
Page 5
SAS
Obtain the data file, MAN_2W1B.DAT from my StatData page and the program
file, MAN_2W1B.SAS, from my SAS programs page. Note that the first column of the
data file contains the level of the Group variable (between-subjects) followed by
subjects scores on Cycle-1/Phase-1, Cycle-1/Phase-2, Cycle-2/Phase 1, . .
.Cycle-4/Phase-2. We have 4 x 2 = 8 cells in the matrix of repeated factors,
represented by 8 dependent variables, C1P1 through C4P2, in the INPUT and MODEL
statements. The REPEATED statement indicates that we have two within-subjects
factors, CYCLE with 4 levels, PHASE with 2 levels. The comma separating one
within-subject factor from another must be there. The order of the dependent
variables is very important: The further to the right a repeated factor is in the
REPEATED statement, the more rapidly its index values must change. Phase, the
rightmore factor, changes more rapidly (1,2,1,2,1,2,1,2) than does Cycle, the leftmore
factor (1,1,2,2,3,3,4,4). Check the REPEATED MEASURES LEVEL INFORMATION
output page to assure that the within-subjects factors are properly defined.
The output shows that there is no problem with the sphericity assumption for the
Cycle effect, but there is for the Cycle x Phase interaction, so we need adjust the
degrees of freedom for any effect that includes CyclePhase if we use the univariate
approach. Since the Phase effect has only two levels, we have no sphericity
assumption with respect to Phase.
The main effect of the Group variable falls short of statistical significance, but all
other effects are statistically significant with both univariate and multivariate tests
(although for CyclePhase only if you accept Roys greatest root). Do note that the
multivariate and univariate analyses return identical results for the one df Phase effect,
F(1, 21) = 129.86, p < .0001. The multivariate approach analyzes a K-level
within-subjects factor by creating K-1 difference-scores and doing a MANOVA on that
set of difference-scores. With K = 2 there is only one difference-score variable, so the
MANOVA simplifies to an ANOVA on that difference-score, and both simplify to a
correlated t-test (t-squared equals F), a one-sample t-test of the null hypothesis that the
mean difference-score is zero. For the Phase*Group effect both multivariate and
univariate approaches simplify to a one-way ANOVA where IV = Group and DV = the
difference score.
Simple effects
The significant three-way interaction would likely be further investigated by
simple interaction analyses. With SAS you could sort by group and then do two-way
ANOVAs involving the other two factors by level of group, doing three (A x B x S)
ANOVAs. Below I show how to do simple effects analysis at levels of within-subjects
factors with individual error terms. If you wanted pooled rather than individual error
terms, you would need construct them yourself. With SPSS MANOVA, MWITHIN
could be used to construct simple effects, as explained when we covered the
A x (B x S) analysis.
Page 6
Howell chose to evaluate the simple Group x Cycle interaction at each level
of Phase. Look at the program to see how I did this by dropping variables from the left
side of the model statement. Notice that nothing is significant during the first phase.
The prediction was that all groups would show high suppression on all cycles during the
shock phase (1), but that during the non-shock phase (2) Group 2 (A-A) should show
more suppression (lower scores) than the other groups, with the difference between
Group 2 and the other groups increasing across cycles. Given this prediction, one
expects no effects at all for the data from the first phase and a Cycle x Group
interaction for the data from the second phase -- and that is exactly what we get.
Simple, simple main effects
The following plot illustrates the change in bar pressing across cycles in the
second phase of the experiment. As expected, the animals in Group A-A exhibited
conditioned suppression of bar pressing across all four cycles, but the animals in Group
A-B apparently learned that box B was safe, since their bar pressing ratio increased
across cycles. Group L-A-B showed less conditioned suppression than the other
groups, but also showed a recovery of bar pressing across cycles. This graph is
probably more useful than the tests of simple, simple main effects, but the Gods of
Hypothesis Testing demand that we kneel at the alter of significance and offer up some
p values, so we shall.
Knowing that the Group x Cycle simple interaction was significant during the
second phase, I dropped nouni from the model statement to get simple, simple main
effects of Group at each level of Cycle for the data from the second phase. I also
asked for LSD pairwise comparisons. This analysis shows that the groups did not differ
significantly on the first cycle, but they did on the remaining three cycles, with the mean
for Group A-A being significantly lower than those of Groups A-B and L-A-B.
Mean Bar-Press Ratio, Phase 2
15
20
25
30
35
40
45
50
55
Cycle 1 Cycle 2 Cycle 3 Cycle 4
Time
M
e
a
n
L-A-B
A-B
A-A
Group

Page 7
An alternative, perhaps better, way to dissect the significant Group x Cycle
interaction for the data from the second phase is to evaluate the simple, simple main
effects of Cycle at each level of Group (after sorting the data by group). I have included
such an analysis, and it shows exactly what was predicted, a significant increase in bar
pressing across cycles in Groups A-B and L-A-B (with the univariate test), but not in
Group A-A. I did no pairwise comparisons here, as I decided they would not add
anything of value.
Bring into SPSS the data file MAN_2W1B.sav. Click Analyze, General Linear
Model, Repeated Measures. In the Within-Subject Factor Name box, enter cycle.
For Number of Levels enter 4. Click Add. In the Within-Subject Factor Name box,
enter phase. For Number of Levels enter 2. Click Add and then Define. Select
c1p1 through c4p2 and scoot them into the Within-Subjects Variables box. Select
group and scoot it into the Between-Subjects Factor(s) box. Click OK. You will find
that you get the same basic statistics that we got with SAS, along with trend contrasts.
Return to the main Repeated Measures dialog box. Remove phase as a factor.
Define as the within-subjects variables c1p1, c2p1, c3p1, and c4p1. Leave group as a
between-subjects factor. Click OK. Then return to the main Repeated Measures dialog
box and Define as the within-subjects variables c1p2, c2p2, c3p2, and c4p2. Click OK.
You now have the simple effects at each level of the phase variable.
Click Analyze, Compare Means, One-Way ANOVA. Scoot into the Dependent
List c1p2, c2p2, c3p2, and c4p2. Select group as the factor. Click Post Hoc, LSD,
Continue, OK. You now have the simple, simple main effects of group at each level of
cycle for the second phase.
Go to the Data Editor and click Data, Split File, Organize output by groups.
Select group for Groups based on, and sort the file by the grouping variables. Click
OK. Click Analyze, General Linear Model, Repeated Measures. Remove group as a
Between-Subjects Factor, but leave c1p2, c2p2, c3p2, and c4p2 as the Within-Subjects
Variables. Click OK. You now have the simple main effects of cycle for each group
during phase 2.
SPSS: MANOVA Syntax
If you prefer to do the analysis with MANOVA, here is the syntax to get you
started:
manova c1p1 to c4p2 by group(1,3) / wsfactors = cycle(4) phase(2) /
print=error(cor) / design .
Page 8
(A X B X C X S) ANOVA
In this design Subjects is crossed with all three independent variables. Our data
are for the automobile driving experiment presented on page 511 of the 5
th
edition of
Howell. The dependent variable is number of steering errors. Factor T is the time of
testing, 1 for night and 2 for day. Factor C is type of course, 1 for a road racing track, 2
for city streets, and 3 for open highways. Factor S is size of car, 1 for small, 2 for
medium, and 3 for large.
SAS
Obtain the program file, MAN_3W.SAS, from my SAS programs page. The data
are within the program. I have labeled parts of the program with the letters A through L
for easy reference. As in the A x (B x C x S) ANOVA we did earlier, the order of the
repeated factors is important. Note that they are arranged with factors to the right
changing index values faster than factors to their left.
The omnibus analysis is done by part [ E ] of the program. The significant
effects are time, course, size, and Time x Size. I eliminated my usual request for a test
of the sphericity assumption, since there are too few subjects to conduct such tests.
Also note that there are too few subjects to do a complete factorial multivariate
analysis. SAS does do a complete univariate analysis.
Since there was a significant main effect of course, we want to obtain the
marginal means for course, and we may also want to make pairwise comparisons
among those marginal means. The first three lines in part [A] were used to compute,
for each subject, the mean number of steering errors on each type of course, collapsed
across time and size. As an example, look carefully at the computing of c1. I
computed the mean of all of the variables that included C1 in its name (that is, which
was a score from the road racing track). The last line in that section of the program
computes three difference scores: One for comparing errors on the road course with
those on the city streets, another for road course versus open highway, and the third for
city streets versus open highway. These will used to make pairwise comparisons
among the marginal means
In part [B] I computed, for each subject, mean number of errors for cars of each
size and then the difference scores needed to make pairwise comparisons.
In part [C] I computed, for each subject, the mean number of errors in each of
the four cells for Time x Size. Below I have plotted the means of these means, to
illustrate the significant Time x Size interaction.
Page 9
Mean Number of Steering Errors
0
1
2
3
4
5
6
7
8
Day Night
Time of Day
M
e
a
n

E
r
r
o
r
s
Small
Medium
Large
Size of Car

Parts [ F through H ] conduct the tests of the simple main effects of time at levels
of size of the car. Each of the simple effects is significant, with more steering errors
made at night than during the day. As shown in the plot, the effect of time of day is
greater in the small and medium sized cars than in the large cars.
Part [ I ] computes the marginal means for type of course and size of car, as well
as the means for the plot above. Steering errors more most likely on the road course
and least likely on the open highway. Errors were most likely in the small car and least
likely in the large car.
Parts [ J ] and [ K ] are included only for pedagogical purposes -- they show that
we can obtain the tests of the significance of the main effects from the means we
computed, for each subject, for type of course and size of car.
Part [ L ] uses PROC MEANS to conduct correlated t tests for our pairwise
comparisons between marginal means on the type of course and size of car variables.
Every one of the pairwise comparisons is significant.
Bring MAN_3W.sav (from my SPSS data page) into SPSS. Click Analyze,
General Linear Model, Repeated Measures. Create three within-subjects factors,
Time(2), Course(3), and Size(3). Highlight Time(2) and click Define. In the
Repeated Measures box, CTRL-A in the left pane to select all of the variables. Scoot
them into the Within-Subjects Variables box.
Click Plots. Scoot Time into the Horizontal Axis Box and Size into the
Separate Lines box. Click Add, Continue.
Page 10
Click Options. Ask for estimated marginal means for
Time, Course, Size, and Time*Size. Check Compare
main effects and select LSD(none). Click Continue, OK.
To test the simple effects of Time for each size of car, do
this: Click Analyze, General Linear Model, Repeated Measures.
Select Size(3) and click Remove. Select Time and click
Define. Click Reset. Enter into the Within-Subjects
Variables box all of those variables ending in s1, as shown to
the right. Click Continue, OK.
Look only at the effect of Time. The output shows that
this effect is significant for these small cars.
For effect of time on medium-sized cars use variables that end in s2, and for
the effect on large cars use variables that end in s3. You will find that the effect of
Time is significant for each size of car. As shown by the plot, the effect of time is larger
for small and medium-sized cars than for large cars.
SPSS MANOVA Syntax
Here is a short SPSS MANOVA program that will do the omnibus analysis and
obtain the simple effects of time at each level of size. The data are in the file
MAN_3W.sav on my SPSS data page.
manova T1C1S1 to T2C3S3 / wsfactors = Time(2) Course(3) Size(3) /
noprint=signif(multiv) / design .
manova T1C1S1 to T2C3S3 / wsfactors = Time(2) Course(3) Size(3) /
wsdesign = time within size(1) time within size(2) time within size(3)/
design .
Yet Higher Order Models
You should be able to extrapolate from the examples above to construct the
programs for higher order analyses -- but if you are intending to tackle a factorial design
with more than 3 factors, I recommend that you first be evaluated by one of our clinical
faculty -- you may have taken leave of your senses!

DMRM.doc
Doubly Multivariate Analysis of Repeated Measures Designs:
Multiple Dependent Variables

An analysis may be doubly multivariate in at least two different ways. First, a
set of noncommensurate (not measured on the same scale) dependent variables
may be administered at two or more different times. For example, I measure
subjects blood pressure, heart rate, cholesterol level, and percent body fat. Subjects
participate in a month-long cardiac fitness program or in some placebo activity. I
measure the four dependent variables just before the program starts, just after it ends,
a month after it ends, and a year after it ends. I have a Group x Time mixed design
with multiple dependent variables. I take the multivariate approach with respect to the
Time variable (to avoid the sphericity assumption), and I have multiple dependent
variables, so I have a doubly multivariate design. For each effect (Group x Time, Time,
Group) I obtain a test on an optimally weighted combination of the four dependent
variables (weighted to maximize that effect). If (and only if) the test from the doubly
multivariate analysis (which simultaneously analyzes all four dependent variables) is
significant, I then conduct tests of that effect on each dependent variable (one at a
time). This procedure may provide some protection against inflation of alpha with
multiple dependent variables. Suppose that only the time effect was significant. I
would then conduct singly multivariate analyses on each dependent variable, ignoring
the group and Group x Time tests in those analyses.
A second sort of doubly multivariate design exists when two or more
noncommensurate sets of commensurate dependent measures are obtained at
one time. Experiment 1 of Karls dissertation, which we used as an example for the
multivariate approach to the (A x S) and the A x (B x S) ANOVAs, will serve as an
example. In addition to measuring how much time each subject spent in each of four
differently scented tunnels (one set of commensurate variates), I measured how many
visits each subject made to each tunnel (a second set) and each subjects latency to
first entry of each tunnel (a third set).
Obtain from my SPSS Data Page TUNNEL4b.sav. Bring it into SPSS. Copy the
following syntax to the syntax editor and run it:
manova v_clean to v_rat t_clean to t_rat L_clean to L_rat
by nurs(1,3) / wsfactors = scent(4) /
contrast(scent)=helmert /
measure = visits time latency /
print=transform signif(univ hypoth) error(sscp) / design .
The order of the variates in the MANOVA statement must be: The variate
representing the first dependent variable at level 1 of the within-subjects factor; the
variate representing the first dependent variable at level 2 of the within-subjects factor; .
. . variate for first dv at the last level of within factor; variate for second dv at level 1 of


Page 2
within factor; variate for second dv at level 2 of within factor; . . . variate for last dv at
last level of the within-subjects factor.
Our dependent variables (named in the MEASURE command) are Visits,
Time, and Latency. Our within-subjects factor is scent, with four levels (clean, Mus,
Peromyscus, and Rattus, in that order). First listed in the MANOVA statement are the
four variates for the visits dependent variable, then the four for time, then latency. I
asked that the transformation matrix, univariate tests, and error and hypothesis SSCP
be printed.
Look at the orthonormalized coefficients (coefficients are orthogonal and for
each contrast the sum of the squared coefficients is one) SPSS used to compute the
difference-scores used in the analysis. T1, T5, and T9 compare the Visits, Time, and
Latency variates to zero. T2, T6, and T10 compare Clean vs Scented (Mus, Pero, Rat)
scores on Visits, Time, and Latency. T3, T7, and T11 compare Mus vs Other Rodent
(Pero, Rat). T4, T8, and T12 compare Pero (a prey species) with Rat (a predator
species). These particular contrasts result from my having chosen the Helmert
contrast. They actually make some sense. The default contrast is polynomial, which
does not make any sense, but which would lead to the same omnibus statistics.
The Multivariate Tests of Significance for the NURS (nursing groups) factor
indicate an effect significant beyond the .01 level. The Univariate F-tests output there
shows that the transformed variates T1, T5, and T9 were used to conduct the test of
the null hypothesis that the NURS variable had no effect on the Visits, Time, and
Latency variables. If you will look back at the A x (B x S) singly multivariate analysis we
did earlier you will find that the F conducted on T5 is exactly the same as the
between-subjects test of NURS effect on Time. NURS significant multivariate effect is
accompanied by significant univariate effects on Visits (T1) and Latency (T9), but not
Time. Were there no significant higher-order effects we might now go on to pairwise
comparisons to explain these significant main effects of NURS.
Ignore the EFFECT..constant output (which tests the null that all variates have
means of zero) and move on to the Tests involving SCENT WIthin-Subject Effect.
The main diagonal of the WITHIN CELLS SSCP matrix here contains the SS
error
for the
NURS BY SCENT Univariate F tests that follow. The main diagonal of the next matrix
contains the SS
hypothesis
for the those F tests.
Multivariate Tests of Significance show a significant interaction. For an
optimally weighted combination of the difference scores T2, T3, T4, T6, T7, T8, T10,
T11, and T12, the NURS BY SCENT effect is significant.
The Univariate F-tests show that the Nursing groups differed significantly on the
contrast between Peromyscus and Rattus for visits (T4), Time (T8) and Latency (T12).
For Effect .. SCENT, we are given the hypotheses SSCP matrix, whose main
diagonal contains the SS
hypothesis
for the Univariate F tests involving the scent variable.
The SS
error
were given earlier (same ones used for the interaction). The multivariate
tests are significant, and the univariate contrasts show significant effects on T3 (Visits
to Mus versus other rodent), T7 (Time, Mus versus other rodent), and T8 (Time,
Peromyscus versus Rattus).
Page 3
All remaining output of interest is for averaged statistics.
The AVERAGED WITHIN CELLS Sum-of-Squares and Cross-Products are
constructed from the error SSCP matrix displayed earlier. Look at that earlier matrix.
Sum the sums of squares for T2:T2, T3:T3, and T4:T4. You get 7.599 + 17.964 +
9.612 = 35.174 = the averaged error SS for Visits. Now sum T6:T2, T7:T3, and T8:T4.
You get 97.943 + 126.81 + 52.511 = 277.264 = the cross-product for Time-Visits.
The entries in the Adjusted Hypothesis Sum-of-Squares and Cross-Products
matrix represent the same sort of sum of the entries in the hypothesis SSCP matrices
for the interaction presented earlier.
The AVERAGED Multivariate Tests of Significance really represent an
analysis where the multiple dependent variables (visits, time, and latency) are
evaluated simultaneously, but the within-subjects contrasts are based on averaged
statistics (that is, those one would get with univariate repeated measures analyses, one
on each of visits, time, and latency) rather than a multivariate treatment of the repeated
measures dimension(s). Although I have seen both the fully multivariate and the
averaged multivariate tests used, it seems to me that the fully multivariate tests are the
better choice, except, perhaps, when sample sizes are very small (the averaged tests
have more error df). In fact, our sample sizes for the current analysis are so small that
our variance-covariance matrices are singular (we have more dummy variates, T1-T12,
than we have subjects per group). This can be a problem because it lowers power and
prevents us from conducting Boxs M, but with equal sample sizes we should not be
concerned with Boxs M (the MANOVA being quite robust under those conditions) and
when we have rejected the null hypothesis power is not an issue.
The Univariate F-tests given after the Averaged Multivariate Tests give us the
same statistics that would obtain if we were to do univariate Nurs x Scent ANOVAs, one
each on visits, time, and latency. In such ANOVAs, the interaction is significant for
each dependent variables. A bit later we see that the main effect of scent is significant
for visits and time, but not for latency.
Recommended Reading
Noruis, M. J. (1990). Repeated measures analysis of variance: More on procedure
MANOVA. In SPSS advanced statistics student guide (Chapter 8). Chicago: SPSS
Inc.
Tabachnick, B. G., & Fidell, L. S. (1996). Profile analysis of repeated measures
(Chapter 10). In Using multivariate statistics (3
rd
ed.). New York: HarperCollins.
[especially pages 476-483]
Tabachnick, B. G., & Fidell, L. S. (2001). Profile analysis: The multivariate approach
to repeated measures (Chapter 10). In Using multivariate statistics (4
th
ed.).
Needhan Heights, MA: Allyn & Bacon. [especially pages 423-429 and 442-452]
Mixed (Between, Within) ANOVA
With a Continuous Predictor Included in the Model

I have helped my colleague, Erik Everhart, with ANOVA where there are
categorical predictors (including within-subjects predictors) and a continuous predictor
and where it is desired that the model include all interactions, including those between
the categorical predictors and the continuous predictor. Here is an example of the SAS
code used to conduct such an analysis:

proc glm; class valence;
model BLF4A1 BLF8A1 BLT4A1 BLT6A1 BLC4A1 BLP4A1 BL02A1
VLF4A1 VLF8A1 VLT4A1 VLT6A1 VLC4A1 VLP4A1 VL02A1
XLF4A1 XLF8A1 XLT4A1 XLT6A1 XLC4A1 XLP4A1 XL02A1
YLF4A1 YLF8A1 YLT4A1 YLT6A1 YLC4A1 YLP4A1 YL02A1
= CMHI|valence / nouni; repeated trial 4, site 7; title 'right hemisphere - CMHI'; run;

Trial and site are the repeated factors, valence the between-subjects factor, and CMHI
the continuous predictor. Fifteen F ratios are produced by this analysis. Here is a bit of
the output.
MANOVA Test Criteria and Exact F Statistics for the Hypothesis of no site*CMHI*VALENCE Effect
H = Type III SSCP Matrix for site*CMHI*VALENCE
E = Error SSCP Matrix

S=1 M=2 N=18

Statistic Value F Value Num DF Den DF Pr > F

Wilks' Lambda 0.71155193 2.57 6 38 0.0347
Pillai's Trace 0.28844807 2.57 6 38 0.0347
Hotelling-Lawley Trace 0.40537880 2.57 6 38 0.0347
Roy's Greatest Root 0.40537880 2.57 6 38 0.0347

Source DF Type III SS Mean Square F Value Pr > F G - G H - F

site 6 6.56978865 1.09496478 14.96 <.0001 <.0001 <.0001
site*CMHI 6 0.50149394 0.08358232 1.14 0.3385 0.3329 0.3357
site*VALENCE 6 1.76673999 0.29445666 4.02 0.0007 0.0113 0.0079
site*CMHI*VALENCE 6 1.59292272 0.26548712 3.63 0.0018 0.0180 0.0134
Error(site) 258 18.88707111 0.07320570

Erik figured out how to do this analysis in SPSS. Analyze, General Linear Model,
Repeated Measures. Define the within-subjects factors (a real pain), and identify the
continuous predictor as a covariate.

Click Model, select custom, and build the model as illustrated below.

Continue OK. The analysis is identical to that produced by SAS. Here is some of the
output.

Multivariate Tests(b)

site * valence * cmhi Pillai's Trace
.288 2.567(a) 6.000 38.000 .035
Wilks' Lambda
.712 2.567(a) 6.000 38.000 .035
Hotelling's Trace
.405 2.567(a) 6.000 38.000 .035
Roy's Largest Root
.405 2.567(a) 6.000 38.000 .035

Tests of Within-Subjects Effects

Measure: MEASURE_1
Source
Type III Sum
of Squares df Mean Square F Sig.
site * valence * cmhi Sphericity Assumed
1.593 6 .265 3.627 .002
Greenhouse-Geisser
1.593 2.735 .582 3.627 .018
Huynh-Feldt
1.593 3.143 .507 3.627 .013
Lower-bound
1.593 1.000 1.593 3.627 .064

Here is the syntax:
GLM
blf4a1 blf8a1 blt4a1 blt6a1 blc4a1 blp4a1 bl02a1 vlf4a1 vlf8a1 vlt4a1 vlt6a1 vlc4a1
vlp4a1 vl02a1 xlf4a1 xlf8a1 xlt4a1
xlt6a1 xlc4a1 xlp4a1 xl02a1 ylf4a1 ylf8a1 ylt4a1 ylt6a1 ylc4a1 ylp4a1 yl02a1 BY
valence WITH cmhi
/WSFACTOR = trial 4 Polynomial site 7 Polynomial
/METHOD = SSTYPE(3)
/CRITERIA = ALPHA(.05)
/WSDESIGN = trial site site*trial
/DESIGN = valence cmhi cmhi*valence .

Karl L. Wuensch, Dept. of Psychology, East Carolina University, Greenville, NC
July, 2006

MANOVA1.doc
One-Way Multiple Analysis of Variance

When one wishes to determine whether two or more groups differ significantly on
one or more optimally weighted linear combinations (canonical variates or discriminant
functions) of two or more normally distributed dependent variables, a one-way multiple
analysis of variance is appropriate. We have already studied Discriminant Function
Analysis, which is mathematically equivalent to a one-way MANOVA, so we are just
shifting perspective here.

A Manipulation Check for Michelle Plasters Thesis
Download Plaster.dat from my StatData page and MANOVA1.sas from my SAS
Programs page. Run the program. The data are from Michelle Plaster's thesis, which
you may wish to look over (in Joyner or in the Psychology Department). The analyses
she did are very similar to, but not identical to, those we shall do for instructional
purposes. Male participants were shown a picture of one of three young women. Pilot
work had indicated that the one woman was beautiful, another of average physical
attractiveness, and the third unattractive. Participants rated the woman they saw on
each of twelve attributes. Those ratings are our dependent variables. To simplify the
analysis, we shall deal with only the ratings of physical attractiveness (PHYATTR),
happiness (HAPPY), INDEPENdence, and SOPHISTication. The purpose of the
research was to investigate the effect of the defendants physical attractiveness (and
some other variables which we shall ignore here) upon the sentence recommended in a
simulated jury trial. The MANOVA on the ratings served as a check on the
effectiveness of our manipulation of the physical attractiveness (PA) independent
variable.

Screening the Data
The overall means and standard deviations on the dependent variables were
obtained so that I could standardize them and then compute scores on the canonical
variates. Although you can not see it in this output, there is heterogeneity of variance
on the physical attractiveness variable. The F
max
is 4.62. With n's approximately equal,
I'm not going to worry about it. If F
max
were larger and sample sizes less nearly equal, I
might try data transformations to stabilize the variance or I might randomly discard
scores to produce equal n's and thus greater robustness.
One should look at the distributions of the dependent variables within each cell.
I have done so, (using SAS), but have not provided you with the statistical output. Such
output tends to fill many pages, and I generally do not print it, I just take notes from the
screen. Tests of significance employed in MANOVA do assume that the dependent
variables and linear combinations of the dependent variables are normal within each
cell. With large sample sizes (or small samples with approximately equal ns), the tests


2
may be fairly robust to violations of this assumption. I must confess that there are
some problems with the normality assumption for these data, especially the phyattr
variable. It is negatively skewed in one cell, positively skewed in another, due to ceiling
and floor effects. I was unable to find anything in my bag of tricks (nonlinear
transformations and deletion of outliers) that ameliorated this problem to my
satisfaction, so I decided to stick with the data as they are, with the caution that they
are not exactly normal.

The MANOVA with Canonical Analysis
PROC ANOVA is employed to do the multiple analysis of variance. Note that
there are multiple outcome variables listed in the model statement. The NOUNI option
is used to suppress output of the univariate analyses of variance.
MANOVA H = PA / CANONICAL indicates that we want to do a MANOVA test of the
null hypothesis that there is no effect of our physical attractiveness manipulation and
that we want statistics about the canonical variates.
Canonical Analysis (page 3)
An effect will have as many roots (discriminant functions, canonical variates)
as it has degrees of freedom, or it will have as many roots as there are DVs, whichever
is fewer. Our manipulation has 3 levels, 2 df, so we have two roots. Each canonical
variate is a weighted linear combination of the variables. The first canonical variate is
that weighted linear combination that maximizes the ratio
groups within
groups among
SS
SS
_
_
. These sums
of squares are those that would be obtained were we to do a univariate ANOVA using
our manipulation as the grouping variable and the weighted linear combination of our
four ratings variables as the single outcome variable. This ratio is called the
eigenvalue. For our data, it is 1.7672. The proportion for the first root, .9133, is
simply the first root's eigenvalue expressed as a percentage of the simple sum of all the
roots' eigenvalues 1.7672 / (1.7672 + .1677).
The second root, orthogonal to the first, is that weighted linear combination of
the dependent variables that maximizes the eigenvalue using variance in the variables
not already captured by the earlier root(s).
Were we to do a univariate ANOVA on the first canonical variate,
total
groups among
SS
SS
_

would be the first squared canonical correlation. You should recognize this ratio as
being
2
. For our data,
2
, the squared canonical correlation, for the first root is .64,
that is, 64% of the variance in the first canonical variate is accounted for by our
manipulation. For the second canonical variate it is 14%.
SAS gives us, along with the eigenvalues, tests of the significance of the roots.
The first row gives us the test of the null hypothesis that all of the canonical variates are
unrelated to the grouping variable. The second row tests the null hypothesis that all of
the canonical variates from the second are unrelated to the grouping variable, etc. As
3
you can see, both of our canonical variates are significantly affected by our physical
attractiveness manipulation.
We found that our groups differ significantly on two dimensions. Let us now look
at the standardized discriminant function coefficients and the correlations between
observed variables and canonical variates to interpret those dimensions.
The standardized canonical coefficients may be treated as Beta weights in a
multiple regression predicting C from standardized X scores, where C is the canonical
variate, the weighted linear combination of the variables which maximizes the
eigenvalue.
C
i
= B
1
Z
1
+ B
2
Z
2
+ . . . + B
p
Z
p
.
Under the heading Canonical Coefficients are the total-sample standardized
canonical correlation coefficients -- these are the coefficients obtained if the Z
scores above are computed by using, for each variable, its mean and standard
deviation ignoring groups (rather than within groups). Of course, one must realize that
these coefficients reflect the contribution of one variable in the context of the other
variables in the model. A low standardized coefficient might mean that the groups do
not differ much on that variable, or it might just mean that that variable's correlation with
the grouping variable is redundant with that of another variable in the model.
Suppressor effects can also occur. Raw (unstandardized) coefficients are presented as
well, but are rarely of any use.
Loadings (correlations between the dependent variables and the canonical
variates) may be more helpful in terms of understanding the canonical variates we
have created. These are available under the heading Canonical Structure. I am
accustomed to using the Within loadings, which are computed by computing the
correlations within each group and then averaging them across groups. Generally, any
variate with a loading of .40 or more must be considered to be important. One might
apply a meaningfulness criterion with loadings between .30 and .40 -- that is, if they
make sense, interpret them, if not, declare them too small to be worthy of your
attention.
Our loadings and our standardized coefficients indicate that canonical variate 1
is largely a matter of physical attractiveness and canonical variate 2 is largely
happiness and independence. The role of the sophistication rating is less clear.
Neither of its loadings is high and it is involved in some sort of suppressor effect on the
second canonical variate. It might be interesting to drop this variable and redo the
analysis.
MANOVA Test Criteria and F Approximations for the Hypothesis of No Overall PA Effect (page 4)
Following the canonical statistics, SAS gives us four different tests of the null
hypothesis that none of the canonical variates is correlated with our treatment.
Hotelling's trace is the simple sum of all the roots' eigenvalues.
If we were ambitious (or foolish) enough to work out our MANOVA by hand
(using matrix algebra, which is very arithmetic intensive, yuk), we would calculate
treatment (among groups) and error (within groups) SSCP (sums of squares and cross
4
products) matrices (like variance-covariance matrices, but with sums of squares on the
main diagonal and cross products elsewhere -- dividing the elements by degrees of
freedom would produce variances and covariances). Since we have 4 outcome
variables, each matrix (treatment and error) would be a 4 x 4 matrix. In univariate
ANOVA we test an effect by computing F, the ratio of a treatment variance (mean
square) to error variance.
In MANOVA we may test our treatment by finding the ratio of the determinant
(generalized variance) of the error SSCP matrix to the determinant of the sum of the
treatment and error SSCP matrices. This ratio is called Wilks' Lambda ( ). Since
the ratio is
treatment error
error
+
, one may interpret the Wilks' Lambda as the proportion of
variance in the DVs that is not accounted for by the IV. Unlike F, where large values
cast doubt on the null hypothesis that the treatment groups do not differ, small values of
Wilks Lambda cast doubt on the null hypothesis.
An eta-squared statistic can be computed as
2
=

1 - , .69 for our data. We
may interpret this statistic as being the proportion of variance in the dependent
variables that is accounted for by the physical attractiveness manipulation. When there
are only two treatment groups, this
2
will equal the squared canonical correlation.
I shall use lower case lambda ( ) to stand for an eigenvalue. If we define theta
as
1 +
=
, then Wilks' can be computed from . For our data, for the first root =
1.767/2.767 = 0.6386, and, for the second root, .168/1.168 = 0.1438. To obtain Wilks'
, subtract each theta from one and then find the product of these differences:
(1 - .6386)(1 - .1438) = .309.
Pillai's Trace is the sum of all the thetas, .6386 + .1438 = .782.
Roy's greatest characteristic root is simply the eigenvalue for the first root.
Roy's gcr should be the most powerful test when the first root is much larger than the
other roots.
Each of these statistics' significance levels is approximated using F.

Univariate ANOVAs on the Canonical Variates and the Original Variables
Now, look back at the data step of our program. I used the total-sample means
and standard deviations to standardize the variables into Z_scores. I then computed,
for each subject, canonical variate scores, using the total-sample standardized
canonical correlation coefficients. CV1 is canonical variate 1, CV2 is canonical variate
2. I computed these canonical variate scores so I could use them as outcome variables
in univariate analyses of variance. I used PROC ANOVA to do univariate ANOVAs with
Fishers LSD contrasts on these two canonical variates and also, for the benefit of
those who just cannot understand canonical variates, on each of the observed
variables.
Please note that for each canonical variate:
5
If you take the SS
among groups
and divide by the SS
within groups
, you get the eigenvalue
reported earlier for that root.
If you take the ratio of SS
among groups
to SS
total
, you get the root's squared canonical
correlation.
The group means are the group centroids that would be reported with a
discriminant function analysis.
I have never seen anyone else compute canonical scores like this and then do
pairwise comparisons on them, but it seems sensible to me, and such analysis has
been accepted in manuscripts I have published. As an example of this procedure, see
my summary of an article on Mock Jurors Insanity Defense Verdict Selections.
Note that on the "beauty" canonical variate, the beautiful group's mean is
significantly higher than the other two means, which are not significantly different from
one another. On the happily independent variate, the average group is significantly
higher than the other two groups.
The univariate ANOVAs reveal that our manipulation significantly affected every
one of the ratings variables, with the effect on the physical attractiveness ratings being
very large (
2
= .63). Compared to the other groups, the beautiful woman was rated
significantly more attractive and sophisticated, and the unattractive woman was rated
significantly less independent and happy.

Unsophisticated Use of MANOVA
Unsophisticated users of MANOVA usually rely on the Univariate F-tests to try
to interpret a significant MANOVA. These are also the same users who believe that the
purpose of a MANOVA is to guard against inflation of alpha across several DVs.
They promise themselves that they will not even look at the univariate ANOVAs for any
effect which is not significant in the MANOVA. The logic is essentially the same as that
in the Fishers LSD protected test for making pairwise comparisons between means
following a significant effect in ANOVA (this procedure has a lousy reputation -- not
conservative enough -- these days, but is actually a good procedure when only three
means are being compared; the procedure can also be generalized to other sorts of
analyses, most appropriately those involving effects with only two degrees of freedom).
In fact, this "protection" is afforded only if the overall null hypothesis is true (none of the
outcome variables is associated with the grouping variable), not if some outcome
variables are associated with the grouping variable and others are not. If one outcome
variable is very strongly associated with the grouping variable and another very weakly
(for practical purposes, zero effect), the probability of finding the very weak effect to be
"significant" is unacceptably high.
Were we so unsophisticated as to take such a "MANOVA-protected univariate
tests" approach, we would note that the MANOVA was significant, that univariate tests
on physically attractive, happy, independent, and sophisticated were significant, and
then we might do some pairwise comparisons between groups on each of these
"significant" dependent variables. I have included such univariate tests in my second
6
invocation of PROC ANOVA. Persons who are thinking of using MANOVA because
they have an obsession about inflated familywise alpha and they think MANOVA
somehow protects against this should consider another approach -- application of a
Bonferroni adjustment to multiple univariate ANOVAs. That is, dispense with the
MANOVA, do the several univariate ANOVAs, and evaluate each ANOVA using an
adjusted criterion of significance equal to the familywise alpha you can live with divided
by the number of tests in the family. For example, if I am willing to accept only a .05
probably of rejecting one or more of four true null hypotheses (four dependent
variables, four univariate ANOVAs), I used an adjusted criterion of .05/4 = .0125. For
each of the ANOVAs I declare the effect to be significant only if its p .0125. This will,
of course, increase the probability of making a Type II error (which is already the most
likely sort of error), so I am not fond of making the Bonferroni adjustment of the per-
comparison criterion of statistical significance. Read more on this in my document
Familywise Alpha and the Boogey Men.
While I have somewhat disparaged the protected test test use of MANOVA, I
must confess that I sometimes employ it, especially in complex factorial designs where,
frankly, interpreting the canonical variates for higher order effects is a bit too
challenging for me. With a complex factorial analysis (usually from one of my
colleagues, I usually having too much sense to embark on such ambitious projects) I
may simply note which of the effects are significant in the MANOVA and then use
univariate analyses to investigate those (and only those) effects in each of the
dependent variables. Aside from any protection against inflating familywise alpha, the
main advantage of this approach is that it may impress some reviewers of your
manuscript, multivariate analyses being popular these days.
This unsophisticated approach just described ignores the correlations among the
outcome variables and ignores the dimensions (canonical variates, discriminant
functions) upon which MANOVA found the groups to differ. This unsophisticated user
may miss what is really going on -- it is quite possible for none of the univariate tests to
be significant, but the MANOVA to be significant.

Relationship between MANOVA and DFA
I also conducted a discriminant function analysis on these data just to show you
the equivalence of MANOVA and DFA. Please note that the eigenvalues, canonical
correlations, loadings, and canonical coefficients are identical to those obtained with the
MANOVA.

SPSS MANOVA
SPSS has two routines for doing multiple analysis of variance, the GLM routine
and the MANOVA routine. Let us first consider the GLM routine. Go to my SPSS Data
Page and download the SPSS data file, Plaster.sav. Bring it into SPSS and click
Analyze, General Linear Model, Multivariate. Move our outcome variables (phyattr,
happy, indepen, and sophist) into the dependent variables box and the grouping
variable (PA, manipulated physical attractiveness) into the fixed factor box.
7

Under Options, select descriptive statistics and homogeneity tests. Now, look at
the output. Note that you do get Boxs M, which is not available in SAS. You also get
the tests of the significance of all roots simultaneously tested and Roys test of only the
first root, as in SAS. Levenes test is used to test the significance of group differences
in variance on the original variables. Univariate ANOVAs are presented, as are basic
descriptive statistics. Missing, regretfully, are any canonical statistics, and these are not
available even if you select every optional statistic available with this routine. What a
bummer. Apparently the folks at SPSS have decided that people that can only point
and click will never do a truly multivariate analysis of variance, that they only use what I
have called the unsophisticated approach, so they have made the canonical statistics
unavailable.
Fortunately, the canonical statistics are available from the SPSS MANOVA
routine. While this routine was once available on a point-and-click basis, it is now
available only by syntax -- that is, you must enter the program statements in plain text,
just like you would in SAS, and then submit those statements via the SPSS Syntax
Editor. Point your browser to my SPSS programs page and download the file
MANOVA1.sps to your hard drive or diskette. From the Windows Explorer or My
Computer, double click on the file. The SPSS Syntax Editor opens with the program
displayed. Edit the syntax file so that it points to the correct location of the Paster.sav
file and then click on Run, All, The output will appear.
I did not ask SPSS to print the variance/covariance matrices for each cell, but I
did get the determinants of these matrices, which are used in Box's M. You may think
of the determinants as being indices of the generalized variance within a
8
variance/covariance matrix. For each cell, for our 4 dependent variable design, that
matrix is a 4 x 4 matrix with variances on the main diagonal and covariances (between
each pair of DVs) off the main diagonal. Box's M tests the null hypothesis that the
variance/covariance matrices in the population are identical across cells. If this null
hypothesis is false, the pooled variance/covariance matrix used by SPSS is
inappropriate. Box's M is notoriously powerful, and one generally doesn't worry unless
p <.001 and sample sizes unequal. Using Pillai's trace (rather than Wilks' ) may
improve the robustness of the test in these circumstances. One can always randomly
discard scores to produce equal n's. Since our n's are nearly equal, I'll just use Pillai's
trace and not discard any scores.
Look at the pooled within cells (thus eliminating any influence of the grouping
variable) correlation matrix it is now referred to as the WITHIN+RESIDUAL correlation
matrix. The dependent variables are generally correlated with one another. Bartlett's
test of sphericity tests the null hypothesis that in the population the correlation matrix
for the outcome variables is an identity matrix (each r
ij
= 0). That is clearly not the case
with our variables. Bartletts test is based on the determinant of the within-cells
correlation matrix. If the determinant is small, the null hypothesis is rejected. If the
determinant is very small (zero to several decimal places), then at least one of the
variables is nearly a perfect linear combination of the other variables. This creates a
problem known as multicollinearity. With multicollinearity your solution is suspect --
another random sample from the same population would be likely to produce quite
different results. When this problem arises, I recommend that you delete one or more
of the variables. If one of your variables is a perfect linear combination of the others
(for example, were you to include as variables SAT
Total
, SAT
Math
and SAT
Verbal
), the
analysis would crash, due to the singularity of a matrix which needs to be inverted as
part of the solution. If you have a multicollinearity problem but just must retain all of the
outcome variables, you can replace those variables with principal component scores.
See Principal Components Discriminant Function Analysis for an example.
For our data, SPSS tells us that the determinant of the within cells correlation
matrix has a log of -.37725. Using my calculator, natural log (.689) = -.37725 -- that is,
the determinant is .689, not near zero, apparently no problem with multicollinearity.
Another way to check on multicollinearity is to compute the R
2
between each variable
and all the others (or 1-R
2
, the tolerance). This will help you identify which variable you
might need to delete to avoid the multicollinearity problem. I used SAS PROC
FACTOR to get the R
2
s, which are:
Prior Communality Estimates: SMC

PHYATTR HAPPY INDEPEN SOPHIST
0.136803 0.270761 0.133796 0.289598

Note that the MANOVA routine has given us all of the canonical statistics that we
are likely to need to interpret our multivariate results. If you compare the canonical
9
coefficients and loadings to those obtained with SAS, you will find that each SPSS
coefficient equals minus one times the SAS coefficient. While SAS constructed
canonical variates measuring physical attractiveness (CV1) and
happiness/independence (CV2), SPSS constructed canonical variates measuring
physical unattractiveness and unhappiness/dependence. The standardized coefficients
presented by SPSS are computed using within group statistics.
Return to my Stats Lessons Page

MANOVA2.doc
Factorial MANOVA

A factorial MANOVA may be used to determine whether or not two or more categorical
grouping variables (and their interactions) significantly affect optimally weighted linear
combinations of two or more normally distributed outcome variables. We have already
studied one-way MANOVA, and we previously expanded one-way ANOVA to factorial
ANOVA, so we should be well prepared to expand one-way MANOVA to factorial MANOVA.
The normality and homogeneity of variance assumptions we made for the factorial ANOVA
apply for the factorial MANOVA also, as does the homogeneity of dispersion matrices
assumption (variance/covariance matrices do not differ across cells) we made in one-way
MANOVA.
Obtain MANOVA2.sas from my SAS Programs page and run it. The data are from the
same thesis that provided us the data for our one-way MANOVA, but this time there are two
dependent variables: YEARS, length of sentence given the defendant by the mock-juror
subject, and SEVERITY, a rating of how serious the subject thought the defendants crime
was. The PA independent variable was a physical attractiveness manipulation: the female
defendant presented to the mock jurors was beautiful, average, or ugly. The second
independent variable was CRIME, the type of crime the defendant committed, a burglary
(theft of items from victims room) or a swindle (conned the male victim).
Multivariate Interactions (pages 6 & 7 of the listing)
As in univariate factorial ANOVA, we shall generally inspect effects from higher order
down to main effects. For our 3 x 2 design, the PA X CRIME effect is the highest order
effect. We had some reason to expect this effect to be significantothers have found that
beautiful defendants get lighter sentences than do ugly defendants unless the crime was one
in which the defendants beauty played a role (such as enabling her to more easily con our
male victim). We have had little luck replicating this interaction, and it has no significant
multivariate effect here (but Roys greatest root, which tests only the first root, is nearly
significant, p = .07). We shall, however, for instructional purposes, assume that the
interaction is significant. What do we do following a significant multivariate interaction?
The unsophisticated way to investigate a multivariate interaction is first to determine
which of the dependent variables have significant univariate effects from that interaction.
Suppose that YEARS does and SERIOUS does not. We could then do univariate simple
main effects analysis (or, if we were dealing with a triple or even higher order interaction,
simple interaction analysis) on that dependent variable. This unsophisticated approach
totally ignores the correlations among the dependent variables and the optimally weighted
linear combinations of them that MANOVA worked so hard to obtain.
At first thought, it might seem reasonable to follow a significant multivariate PA X
CRIME interaction with multivariate simple main effects analysis, that is, do two one-way
MANOVAs: multivariate effect of PA upon YEARS and SERIOUS using data from the
burglary level of CRIME only, and another using data from the swindle level of CRIME only
(or alternatively, three one-way MANOVAs, one at each level of PA, with IV=CRIME). There
is, however a serious complication with this strategy.


2
Were we to do two one-way MANOVAs, one at each level of Crime, to find the
multivariate effect of PA upon our variables, those two MANOVAs would each construct two
roots (weights for the variables) that maximize the eigenvalues for their own data. Neither
set of weights would likely be the same as that that maximized the multivariate interaction,
and the weights that maximize separation of the PA groups for burglary defendants are likely
different from those that do the same for the swindle defendants. Now these differences in
weights might well be quite interesting, but things have certainly gotten confusingly complex.
We have three MANOVAs (one factorial with significant interaction, two one-way with
multivariate main effects of PA holding Crime constant), and each has roots (weighted
combinations of the DVs) different from the other two MANOVAs roots. What a mess!
While I might actually get myself into the mess just described (sometimes out of such
messes there emerges an interesting new way to view the data), here is an approach that I
find easier to understand (perhaps because I created itwhether others have also, I know
not). A significant multivariate interaction means that the effect of one grouping variable
depends upon level of another grouping variable, where the multivariate outcome variable is
one or more canonical variates. Compute for each subject canonical variate scores for
each root that is significant (from the Dimension Reduction Analysis) and then do simple
effects analysis on the corresponding canonical variate.
The MANOVA produced the test for a multivariate interaction by obtaining
eigenvalue(s) (ratio of SS
effect
/ SS
error
) for the weighted linear combination(s) of the
dependent variables that maximized that effect. That is, the eigenvalues for our multivariate
interaction represent the (SS
interaction
/ SS
error
) we would obtain were we to do factorial
ANOVAs on the optimally weighted linear combinations of the dependent variables produced
by MANOVA. The number of such combinations (roots) created will be equal to the lesser
of: a.) the univariate df for the effect and b.) the number of dependent variables. MANOVA
chose those weights to maximize the multivariate interaction effect.
In the data step I computed Z scores and then used them with the total sample
standardized canonical coefficients for the first root of the interaction effect to compute
scores on the first root of the interaction (CV-INT1). Look at the ANOVA on this canonical
variate (page 10). The SS for PA X CRIME is 5.411, the error (residual) SS is 108, and
5.411/108 = 0.0501 = the eigenvalue for root 1 for the multivariate interaction. Note that the
interaction term for the ANOVA on this canonical variate is almost significant, F(2, 108) =
2.71, .0714. Also note that this test is absolutely equivalent (F, df, & p) to that of Roys
Greatest Root.
Let us pretend that our first root is significant but the second is not. The loadings
indicate that the first root is largely a matter of YEARS (r = .877), and the second root a
matter of SERIOUS (r = .998), but not independent of YEARS (r = .480). We compute
interaction root 1 canonical variate scores and then we do two one-way ANOVAs on this
canonical variateone looking at the effect of PA when the crime is the burglary, another
when the crime is the swindle. If a simple main effect of PA on the canonical variate is
significant, we could then do pairwise comparisons.

3
Main Effects (pages 3 - 5)
The MANOVA output for the main effect of CRIME (pages 4 & 5) indicates no
significant effect. Do note that the one root has canonical coefficients different from those of
either of the roots for the PA X CRIME multivariate analysis. That is, the one multivariate
composite variable analyzed at this level differs from the two analyzed at the interaction level.
For this reason, it might well make sense to interpret a significant multivariate main effect
even if that IV did participate in a higher order interaction, because the canonical variates
used at the two different levels of analysis are not the same. For example, the effect of PA
might differ across levels of Crime for one or both of the canonical variates that maximize the
interaction, and Crime might have a significant main effect upon the different canonical
variate that maximizes separation of the two Crime groups, and PA might have a significant
main effect on one or both of the yet different canonical variates that separate the PA
groups. What you must remember is that for each effect MANOVA computes the one or
more sets of weights that maximize that effect, and different effects are maximized by
different sets of weights.
One way to avoid having the various effects in your factorial analysis done on different
sets of canonical variates is to adopt the Principal Components then ANOVA strategy.
First, use a Principal Components Analysis with Varimax rotation to isolate from your many
correlated ourcome variables a smaller number of orthogonal components. Compute and
output component scores and then do a univariate factorial ANOVA on each component
score. Since the components are orthogonal, there is no need to utilize MANOVA.
Now, look at the output for the effect of PA. We do have a significant multivariate
effect. Note, however, that neither of the univariate Fs is significant. That is, neither Years
nor Serious considered alone is significantly affected by PA, but an optimally weighted pair of
linear combinations of these to variables is significantly affected by PA. Clearly the
MANOVA here is more powerful than ANOVA, and clearly the unsophisticated look at
univariate tests strategy for interpreting significant multivariate effects would only confuse its
unsophisticated user.
Do note that dimension reduction analysis indicates that we need interpret only the
first root. The standardized canonical coefficients and the loadings indicate that this first
canonical variate is a matter of Years (length of sentence given the defendant), with a
negatively weighted contribution from Seriousness. That is, if the mock juror thought the
crime not very serious, but he gave a long sentence anyhow, he gets a high score on this
canonical variate. If he gives a light sentence despite thinking the crime serious, he gets a
low canonical score. Low scores seem to represent giving the defendant a break for some
reason.
I computed the canonical variate scores for the first root of the significant multivariate
effect of PA (CV_PA1) and then conducted an ANOVA on the canonical variate, with
pairwise comparisons. On page 9 we see that 8.859/108 = .082 = the eigenvalue from page
3. The test of the main effect of PA on this canonical variate, F(2, 108) = 4.43, p = .0142, is
identical to the test of Roys greatest root (page 4). On page 13 we see that the group
centroids on this canonical variate are -.21 for the beautiful defendants, -.16 for the average,
and +.38 for the unattractive. Fishers procedure indicates that the mean for the unattractive
defendant is significantly higher than the other two, which do not differ significantly from each

4
other. Keeping in mind the weighting coefficients used to compute CV_PA1, this means that
relative to the other two groups, unattractive defendants received sentences that were long
even if the mock juror thought the crime was not especially serious.
An ANCOV (pages 17-19)
The analysis so far suggests that unattractive people seem to get a extra penalty,
above and beyond that which is appropriate for the perceived seriousness of their crime.
Accordingly, I decided to use ANCOV to see if the groups differed on the sentence
recommended after the sentence was statistically adjusted for the perceived seriousness of
the crime. The ANCOV indicates that when adjusted for perceived seriousness of the crime,
the unattractive defendant received recommended sentences significantly longer than those
for defendants of average or great attractiveness.

SPSS
Obtain from my SPSS Programs page MANOVA2.sps. Run this program. The output
has essentially the same statistics as the SAS output, with the addition of Boxs M. I did
include one option not requested the last time we used SPSS to do MANOVA -- a Roy-
Bargman Stepdown test. This analysis is sequential, with each dependent variable adjusted
for the effects of other dependent variables that precede it in the MANOVA statement. I
specified SERIOUS before YEARS. Accordingly, the F for YEARS would be for an ANCOV
with SERIOUS the covariate. Note that that F is significant at the .015 level, and is
absolutely equivalent to the ANCOV we conducted earlier with SAS.
Effect Size Estimates
As in one-way MANOVA, eta-squared can be computed as
2
=

1 - . For our data,
eta-squared is .050 for the interaction, .005 for the main effect of type of crime, and .085 for
the main effect of physical attractiveness.
In factorial MANOVA the values of the effects eta-squared can sum to more than 1,
since each effect involves different linear combinations of the outcome variables.
Multivariate eta-squared also tends to run larger than univariate eta-squared. One attempt to
correct for this is the multivariate partial eta-squared,
s
p
/ 1 2
1 = , where s is the smaller of
number of outcome variables and degrees of freedom for the effect. For our data, partial
eta-squared is .024 for the interaction, .005 for the main effect of type of crime, and .043 for
the main effect of physical attractiveness.
Copyright 2010 Karl L. Wuensch All rights reserved.

SAS Output with some annotations.
AppStat.doc
Choosing an Appropriate Bivariate Inferential Statistic

For each variable, you must decide whether it is for practical purposes
categorical (only a few values are possible) or continuous (many values are
possible). K = the number of values of the variable.
If both variables are categorical, go to the section Both Variables
Categorical on this page.
If both of the variables are continuous, go to the section Two Continuous
Variables on this page.
If one variable (Ill call it X) is categorical and the other (Y) is continuous,
go to the section Categorical X, Continuous Y on page 2.

Both Variables Categorical
The Pearson chi-square is appropriately used to test the null hypothesis
that two categorical (K 2) variables are independent of one another.
If each variable is dichotomous (K = 2), the phi coefficient ( ) is also
appropriate. If you can assume that each of the dichotomous variables
measures a normally distributed underlying construct, the tetrachoric
correlation coefficient is appropriate.

Two Continuous Variables
The Pearson product moment correlation coefficient (r) is used to
measure the strength of the linear association between two continuous variables.
To do inferential statistics on r you need assume normality (and
homoscedasticity) in (across) X, Y, (X|Y), and (Y|X).
Linear regression analysis has less restrictive assumptions (no
assumptions on X, the fixed variable) for doing inferential statistics, such as
testing the hypothesis that the slope of the regression line for predicting Y from X
is zero in the population.
The Spearman rho is used to measure the strength of the monotonic
association between two continuous variables. It is no more than a Pearson r
computed on ranks and its significance can be tested just like r.
Kendalls tau coefficient (), which is based on the number of inversions
(across X) in the rankings of Y, can also be used with rank data, and its
significance can be tested.


A-PDF Merger DEMO : Purchase from www.A-PDF.com to remove the watermark
2
Categorical X, Continuous Y
You need to decide whether your design is independent samples (no
correlation expected between Y at any one level of X and Y at any other level of
X, also called between subjects or Completely Randomized Designsubjects
randomly sampled from the population and randomly assigned to treatments) or
correlated samples. Correlated samples designs include the following: Within-
subjects (also known as repeated measures) and randomized blocks, (also
know as matched pairs or split-plot). In within-subjects designs each subject is
tested (measured on Y) at each level of X. That is, the (third, hidden) variable
Subjects is crossed with rather than nested within X.
I am assuming that you are interested in determining the effect of X
upon the location (central tendency - mean, median) of Y rather than
dispersion (variability) in Y or shape of distribution of Y. If it is variance in Y
that interests you, use an F
MAX
Test (see Wuensch for special tables if K > 2 or
for more powerful procedures), Levenes test, or Obriens test, all for
independent samples. For correlated samples, see Howell's discussion of the
use of t derived by Pitman (Biometrika, 1939). If you wish to determine whether
X has any effect on Y (location, dispersion, or shape), use one of the
nonparametrics.
Independent Samples
For K 2, the independent samples parametric one-way analysis of
variance is the appropriate statistic if you can meet its assumptions, which are
normality in Y at each level of X and constant variance in Y across levels of X
(homogeneity of variance). You may need to transform or trim or Windsorize Y
to meet the assumptions. If you can meet the normality assumption but not the
homogeneity of variance assumption, you should adjust the degrees of freedom
according to Box or adjust df and F according to Welch.
For K 2, the Kruskal-Wallis nonparametric one-way analysis of
variance is appropriate, especially if you have not been able to meet the
normality assumption of the parametric ANOVA. To test the null hypothesis that
X is not associated with location of Y you must be able to assume that the
dispersion in Y is constant across levels of X and that the shape of the
distribution of Y is constant across levels of X.
For K = 2, the parametric ANOVA simplifies to the pooled variances
independent samples t test. The assumptions are the same as for the
parametric ANOVA. The computed t will be equal to the square root of the F that
would be obtained were you to do the ANOVA and the p will be the same as that
from the ANOVA. A point-biserial correlation coefficient is also appropriate
here. In fact, if you test the null hypothesis that the point-biserial = 0 in the
population, you obtain the exact same t and p you obtain by doing the pooled
variances independent samples t-test. If you can assume that dichotomous X
represents a normally distributed underlying construct, the biserial correlation
is appropriate. If you cannot assume homogeneity of variance, use a separate
3
variances independent samples t test, with the critical t from the Behrens-
Fisher distribution (use Cochran & Cox approximation) or with df adjusted (the
Welch-Satterthwaite solution).
For K = 2, with nonnormal data, the Kruskal-Wallis could be done, but
more often the rank nonparametric statistic employed will be the nonparametric
Wilcoxon rank sum test (which is essentially identical to, a linear
transformation of, the Mann-Whitney U statistic). Its assumptions are identical
to those of the Kruskal-Wallis.
Correlated Samples
For K 2, the correlated samples parametric one-way analysis of
variance is appropriate if you can meet its assumptions. In addition to the
assumptions of the independent samples ANOVA, you must assume sphericity
which is essentially homogeneity of covariancethat is, the correlation
between Y at X
i
and Y at X
j
must be the same for all combinations of i and j.
This analysis is really a Factorial ANOVA with subjects being a second X, an X
which is crossed with (rather than nested within) the other X, and which is
random-effects rather than fixed-effects. If subjects and all other Xs were fixed-
effects, you would have parameters instead of statistics, and no inferential
procedures would be necessary. There is a multivariate approach to the
analysis of data from correlated samples designs, and that approach makes no
sphericity assumption. There are also ways to correct (alter the df) the
univarariate analysis for violation of the sphericity assumption.
For K 2 with nonnormal data, the rank nonparametric statistic is the
Friedman ANOVA. Conducting this test is equivalent to testing the null
hypothesis that the value of Kendalls coefficient of concordance is zero in the
population. The assumptions are the same as for the Kruskal-Wallis.
For K = 2, the parametric ANOVA could be done with normal data but the
Correlated samples t-test is easier. We assume that the difference-scores are
normally distributed. Again, t F = .
For K = 2, a Friedman ANOVA could be done with nonnormal data, but
more often the nonparametric Wilcoxons signed-ranks test is employed. The
assumptions are the same as for the Kruskal-Wallis. Additionally, for the test to
make any sense, the difference-scores must be rankable (ordinal), a conditional
that is met if the data are interval. A binomial sign test could be applied, but it
lacks the power of the Wilcoxon.
4
ADDENDUM

Parametric Versus Nonparametric
The parametric procedures are usually a bit more powerful than the
nonparametrics if their assumptions are met. Thus, you should use a parametric
test if you can. The nonparametric test may well have more power than the
parametric test if the parametric tests assumptions are violated, so, if you
cannot meet the assumptions of the parametric test, using the nonparametric
should both keep alpha where you set it and lower beta.

Categorical X with K > 2, Continuous Y

If your omnibus (overall) analysis is significant (and maybe even if it is not)
you will want to make more specific comparisons between pairs of means (or
medians). For nonparametric analysis, use one of the Wilcoxon tests. You may
want to use the Bonferroni inequality (or Sidaks inequality) to adjust the per
comparison alpha downwards so that familywise alpha does not exceed some
reasonable (or, unreasonable, like .05) value. For parametric analysis there are
a variety of fairly well known procedures such as Tukeys tests, REGWQ,
Newman-Keuls, Dunn-Bonferroni, Dunn-Sidak, Dunnett, etc. Fishers LSD
protected test may be employed when K = 3.

More Than One X and/or More Than One Y

Now this is multivariate statistics. The only one covered in detail in
PSYC 6430 is factorial ANOVA, were the Xs are categorical and the one Y is
continuous. There are many other possibilities, however, and these are covered
in PSYC 6431, as are complex factorial ANOVAs, including those with subjects
being crossed with other Xs or crossed with some and nested within others. We
do cover polynomial regression/trend analysis in PSYC 6430, which can be
viewed as a multiple regression analysis.


Howell1-2.pdf
PSYC 2101: Howell Chapters 1 & 2

I. What is Measurement?
A. Strict definition: any method by which a unique and reciprocal
correspondence is established between all or some of the magnitudes of a
kind and all or some of the numbers...
1. Magnitude: a particular amount of an attribute
2. Attribute: a measurable property
B. Example:

length
|---Y---Y--Y----Y----------------->
numbers
<-----------------------|---+---+--+----+----------------->
0 2 4 5 8

C. Loose definition: the assignment of numerals to objects or events
according to rules (presumably any rules). A numeral is any symbol other
than a word.

II. Scales of Measurement
A. Nominal Scale of Measurement
1. The function of nominal scales is to classify - numerals (symbols, such as
0, 1, A, Q #, Z) are arbitrarily assigned to name objects/events or
qualitative classes of objects/events.
2. Does not meet the criteria of the strict definition of measurement.
3. Given two measurements, I can determine whether they are the same or
not, but I may not be able to tell whether the one object/event has more or
less of the measured attribute than does the other.
4. For example, I ask each member of the class to take all of es (his or her)
paper money, write es Banner ID number on each bill, and then put it all
in a paper bag I pass around class. I shake the bag well and pull out two
bills. From the Banner ID numbers on them I can tell whether they belong
to the same person or not.

B. Ordinal Scale of Measurement
1. The data here have the characteristics of a nominal scale and more.
When the objects/events are arranged in serial order with respect to the
property measured, that order is identical to the order of the
measurements.

Copyright 2011, Karl L. Wuensch - All rights reserved

2
2. Given ordinal data on two objects, we can state whether they have equal
amounts of the attribute measured or, if not, which has the greater
amount. Given two pairs of scores we may not, however, be able to make
statements about whether or not the difference between the one pair of
scores (a-b) differs from the difference between the other pair of scores (c-
d).
3. The most common type of ordinal scale is that of rank order.
4. Example: I throw those bills out the window. My associate, outside, my
associate, outside, ranks the students in order of how long it took them to
retrieve their money, where rank 1 belongs to the student who most
quickly retrieved her money.

Error-Free Stopwatch | 1 3 4 7 12 Minutes
TIME |-Y---Y-Y-----Y---------Y--------->
RANK | 1 2 3 4 5

For the ordinal data (the ranks), we can say that 3 represents more time than
1, but we cannot honestly say that the difference between rank 3 and rank 1
(3 minutes) is the same as the difference between rank 5 and rank 3 (8
minutes). Two rank points between ranks 5 and 3 represent more time than
two rank points between ranks 3 and 1.
C. Interval Scale of Measurement
1. This is an ordinal scale plus. It has equal units a change of 1 unit on the
measurement scale corresponds to the same change in the measured
attribute, no matter where on the measurement scale the change is.
2. We can make statements about (a-b) versus (c-d).
2. Example: Consider the measurements from the error-free stopwatch.
The difference (3 minutes) between 1 and 4 minutes is the same as the
difference between 4 and 7 minutes (interval data), but the difference
between rank 1 and rank 2 (2 minutes) is not the same as between rank 2
and rank 3 (1 minute).
D. Ratio Scale of Measurement
1. An interval scale with a true zero point a score of 0 means that the
object/event has absolutely none of measured attribute.
2. With nonratio data we cannot make meaningful statements about the ratio
of two measurements, but with ratio data we can.
3. Example: we cannot say that 20 degrees Celsius (interval scale) is twice
as hot as 10 degrees Celsius, but we can say that 300 degrees Kelvin
(ratio scale) is twice as hot as 150 degrees Kelvin. Please see my
graphical explanation of this at
http://core.ecu.edu/psyc/wuenschk/docs01/Ratio-Interval.pdf .

3
III. Variables
A. A variable is a quantity that can take on different values. A constant is a
quantity that is always of the same value.
B. Discrete variable - one for which there is a finite number of potential values
which the variable can assume between any two points on the scale.
Examples: Number of light bulbs in a warehouse; number of items correctly
answered on a true-false quiz.
C. Continuous variable - one which theoretically can assume an infinite
number of values between any two points on the scale. Example: Weight of
an object in pounds. Can you find any two weights between which there is no
intermediate value possible?
D. Categorical variable similar to discrete variable, but usually there are only
a relatively small number of possible values. Also called a grouping variable
or classification variable. Such a variable is created when we categorize
objects/events into groups on the basis of some measurement. The
categories may be nominal (female, male; red, green, blue, other) or they
may be ordinal (final grade of A, B, C, D, or F).

IV. Bivariate Experimental Research
Consider the following sketch of a basic bivariate (focus on two variables)
research paradigm.

IV stands for independent variable (also called the treatment), DV for
dependent variable, and EV for extraneous variable. In experimental research
we manipulate the IV and observe any resulting change in the DV. Because we are
manipulating it experimentally, the IV will probably assume only a very few values,
maybe as few as two. The DV may be categorical or may be continuous. The EVs are
variables other than the IV which may affect the DV. To be able to detect the effect of
the IV upon the DV, we must be able to control the EVs.
IV
DV
EV
EV
EV

4
Consider the following experiment. I go to each of 100 classrooms on campus.
At each, I flip a coin to determine whether I will assign the classroom to Group 1 (level 1
of the IV) or to Group 2. The classrooms are my experimental units or subjects. In
psychology, when our subjects are humans, we prefer to refer to them as participants,
or respondents, but in statistics, the use of the word subjects is quite common, and I
shall use it as a generic term for experimental units. For subjects assigned to Group
1, I turn the rooms light switch off. For Group 2 I turn it on. My DV is the brightness of
the room, as measured by a photographic light meter. EVs would include factors such
as time of day, season of the year, weather outside, condition of the light bulbs in the
room, etc.
Think of the effect of the IV on the DV as a signal you wish to detect. EVs can
make it difficult to detect the effect of the IV by contributing noise to the DV that is,
by producing variation in the DV that is not due to the IV. Consider the following
experiment. A junior high school science student is conducting research on the effect of
the size of a coin (dime versus silver dollar) on the height of the wave produced when
the coin is tossed into a pool of water. She goes to a public pool, installs a wave
measuring device, and starts tossing coins. In the pool at the time are a dozen rowdy
youngsters, jumping in and out and splashing, etc. These youngsters activities are
EVs, and the noise they produce would make it pretty hard to detect the effect of the
size of the coin.
Sometimes an EV is confounded with the IV. That is, it is entangled with the
IV in such a way that you cannot separate the effect of the IV from that of the DV.
Consider the pool example again. Suppose that the youngsters notice what the student
is doing and conspire to confound her research. Every time she throws the silver dollar
in, they stay still. But when she throws the dime in, they all cannonball in at the same
time. The student reports back remarkable results: Dimes produce waves much higher
than silver dollars.
Here is another example of a confound. When I was a graduate student at ECU,
one of my professors was conducting research on a new method of instruction. He
assigned one of his learning classes to be taught with method A. This class met at
0800. His other class was taught with method B. This class met at 1000. On
examinations, the class taught with method B was superior. Does that mean that
method B is better than method A? Perhaps not. Perhaps the difference between the
two classes was due to the time the class was taught rather than the method of
instruction. Maybe most students just learn better at 10 than at 8 they certainly attend
better at 8 than at 10. Maybe the two groups of students were not equivalent prior to
being taught differently. Most students tend to avoid classes at 8. Upperclassmen get
to register before underclassmen. Some people who hate classes at 8 are bright
enough to learn how to avoid them, others not. Campbell and Stanley (Experimental
and quasi-experimental designs for research, 1963, Chicago: Rand McNally) wrote
about the importance of achieving pre-experimental equation of groups through
randomization. Note that the students in the research described here were not
randomly assigned to the treatments, and thus any post-treatment differences might
have been contaminated by pre-treatment differences.

5
V. Nonexperimental Research
Much research in the behavioral sciences is not experimental (no variable is
manipulated), but rather observational. Some use the term correlational to describe
such a design, but that nomenclature leads to confusion, so I suggest you avoid it. It is
true that correlation/regression analysis (which we shall study later) is often employed
with nonexperimental research.
Consider the following research. I recruit participants in downtown Greenville
one evening. Each participant is asked whether or not e has been drinking alcohol that
evening. I test each participant on a reaction time task. I find that those who report that
they have been drinking have longer (slower) reaction times than those who were not
drinking. In observational research like this, a nonmanipulated categorical variable, is
often referred to as the independent variable, but this can lead to confusion. Better
practice is to reserve the word independent variable for manipulated variables in
experimental research, referring to the grouping variables as groups and the
continuous variable as the criterion variable or test variable.
It is important that you recognize that drinking and reaction time research
described above is observational, not experimental. With observational research like
this, the results may suggest a causal relationship, but there are always alternative
explanations. For example, there may be a third variable involved here. Maybe
some people are, for whatever reason, mentally dull, while other people are bright.
Maybe mental dullness tends to cause people to consume alcohol, and, independently
of such consumption, to have slow reaction times. If that were the case, the observed
relationship between drinking status and reaction time would be explained by the
relationship between the third variable and the other variables, without any direct casual
relationship between drinking alcohol and reaction time.
As another example of a third variable, consider the relationship between going
to the hospital and dying. They are positively related. Does this mean that going to the
hospital causes death? I hope not. Hopefully the relationship between going to the
hospital and dying is due to the third variable, being very sick. Very sick people are
more likely to go to the hospital, and very sick people are more likely to die (especially if
they do not go to the hospital).

The demonstration of a correlation between variables X and Y is necessary, but
not sufficient, to establish a causal relationship between X and Y. To establish the
causal relationship, you have to rule out alternative explanations for the observed
correlation. That is, to establish that X causes Y you must show the following:
Very Sick
Hospital
Death

6
x X precedes Y.
x X and Y are correlated.
x Noncausal explanations of the correlation are ruled out.

VI. Other Definitions
A. Data - numbers or measurements collected as a result of observations. This
word is plural. The single is datum.
B. Population in applied statistics, a population is a complete and well defined
collection of measurements, aka scores. A population is often considered to
be composed of an infinite number of cases. An example of a finite
population is the amounts spent on lunch today by all students at ECU.
C. Sample - a subset of population. For example, the amounts spent on lunch
today by members of my classes is a sample from that the population defined
above.
D. Parameter - a characteristic of a population. These are typically symbolized
with Greek symbols, such as
Y
P for the population mean of variable Y.
E. Statistic - A characteristic of a sample. For example, the sample mean of
variable Y can be symbolized as M
Y
or Y , Also called an estimator, as
these are typically used to estimate the value of population parameters.
F. Descriptive Statistics - procedures for organizing, summarizing, and
displaying data, such as finding the average weight of students in your
statistics class or preparing a graph showing the weights.
G. Inferential Statistics - methods by which inferences are made to a larger
group (population) on the basis of observations made on a smaller group
(sample). For example, estimating the average weight at ECU by finding the
average weight of a randomly selected sample of ECU students.
H. Real Limits - a range around any measure taken of a continuous variable
equal to that measure plus or minus one half the (smallest) unit of
measurement. For example, 5.5 seconds means between 5.45 and 5.55
seconds.
I. Upper case Greek sigma means sum. 6 Y = Y
1
+ Y
2
+ ..... + Y
n

VII. Rounding
A. If the first (leftmost) digit to be dropped is less than 5, simply drop all
unwanted digits.
B. If the first digit to be dropped is greater than 5, increase the preceding digit by
one.
C. If the first digit to be dropped is 5
1. and any non-zero digits appear anywhere to the right of the 5, increase
the preceding digit by one.
2. and no non-zero digit appears anywhere to the right of the 5
a. if the preceding digit is even, simply drop all unwanted digits.

7
b. if the preceding digit is odd, increase it by one.
c. The reason for always rounding to even in this case is so that, over
many instances, approximately half will be rounded up, half down, thus
not introducing any systematic bias.
D. Examples, rounding to three decimal points
1. 25.2764 rounds to 25.276
2. 25.2768 rounds to 25.277
3. 25.27650001 rounds to 25.277
4. 25.27650000 rounds to 25.276
5. 25.27750000 rounds to 25.278

Scales.doc
Scales of Measurement

Let: m(O
i
) = our measurement of the amount of some attribute that object i has
t(O
i
) = the true amount of that attribute that object i has
For a measuring scale to be ordinal, the following two criteria must be met:
1. m(O
1
) m(O
2
) only if t(O
1
) t(O
2
) If two measurements are unequal, the true
magnitudes (amounts of the measured attribute) are unequal.
2. m(O
1
) > m(O
2
) only if t(O
1
) > t(O
2
) If measurement 1 is larger than measurement
2, then object 1 has more of the measured attribute than object 2.
If the relationship between the Truth and our Measurements is positive monotonic
[whenever T increases, M increases], these criteria are met.

For a measuring scale to be interval, the above criteria must be met and also a third
criterion:
3. The measurement is some positive linear function of the Truth. That is, letting X
i

stand for t(O
i
)

to simplify the notation: m(O
i
)

= a + bX
i
, b > 0
Thus, we may say that t(O
1
) - t(O
2
) = t(O
3
) - t(O
4
) if m(O
1
) - m(O
2
) = m(O
3
) - m(O
4
),
since the latter is (a + bX
1
) - (a + bX
2
) = (a + bX
3
) - (a + bX
4
), so bX
1
- bX
2
= bX
3
- bX
4
. Thus,
X
1
- X
2
= X
3
- X
4
.
In other words, a difference of y units on the measuring scale represents the same true
amount of the attribute being measured regardless of where on the scale the measurements
are taken. Consider the following data on how long it takes a runner to finish a race:
Runner: Joe Lou Sam Bob Wes
Rank: 1 2 3 4 5
(True) Time (sec) 60 60.001 65 75 80


2
These ranks are ordinal data. In terms of ranks, the difference between Lou and Sam is
equal to the difference between Joe and Lou, even though the true difference between Lou
and Sam is 4999 times as large as the true difference between Joe and Lou. Assume we
used Gods stopwatch (or one that produced scores that are a linear function of her
stopwatch, which is ultimately a matter of faith) to obtain the TIME (sec) measurements
above. As Interval data, the amount of time represented by 65 minus 60, Sam vs Joe, is
exactly the same as that represented by 80 minus 75, Wes vs Bob. That is, an interval of 5
sec represents the same amount of time regardless of where on the scale it is located.
t
m
b

=
Note that a linear function is monotonic, but a monotonic function is not necessarily linear. A
linear function has a constant slope. The slope of a monotonic function has a constant sign
(always positive or always negative) but not necessarily a constant magnitude.
In addition to the three criteria already mentioned, a fourth criterion is necessary for a scale
to be ratio:
4. a = 0, that is m(O
i
) = bX
i
, that is, a true zero point.
If m(O) = 0, then 0 = bX, thus X must = 0 since b > 0. In order words, if an
object has a measurement of zero, it has absolutely none of the measured
attribute.
For ratio data, m(O
1
) m(O
2
) = bX
1
bX
2
= X
1
X
2

Thus, we can interpret the ratios of ratio measurements as the ratio of the true
magnitudes.
For nonratio, interval data, m(O
1
) m(O
2
) = (a + bX
1
) (a + bX
2
) X
1
X
2
since
a 0.
When you worked Gas Law problems in high school [for example, volume of gas held
constant, pressure of gas at 10 degrees Celsius given, what is pressure if temperature is
raised to 20 degrees Celsius] you had to convert from Celsius to Kelvin because you were
working with ratios, but Celsius is not ratio [20 degrees Celsius is not twice as hot as 10
degrees Celsius], Kelvin is.
3
Additional Readings
See the two articles in BlackBoard, Articles, Scales of Measurement
Ratio versus Interval Scales of Measurement -- a graphical explanation of the
difference
PSYC 2101: Howell Chapters 1 & 2 a document from my undergraduate class,
includes material on scales of measurement and other very basic concepts.

Ratio versus Interval Data

Imagine that you are vacationing in Canada, where, like in most of the world,
they use the Celsius scale of temperature. When you get up in the morning you tune in
the weather station and the forecaster says The low this morning was 10 degrees, but
the forecast high this afternoon is 20 degrees, twice as hot. Well, 20 to 10 surely does
sound like a 2 to 1 ratio, as illustrated in the plot below:
[Hold down the Ctrl key and hit the L key to view in full screen mode]

However, the 0 value here is not true, it does not represent the absence of
molecular motion. The true (absolute) zero point on the Celsius scale is -273, as shown
here:

The first plot is basically what I have elsewhere called a Gee Whiz plot a big
chunk of the ordinate (vertical axis) was left out, creating the misperception of a much
larger difference in the height of the two bars.
To find the ratio of the two temperatures, you need convert to a ratio scale, such
as the Kelvin scale. In degrees Kelvin the two temperatures are 283 and 293, and the
ratio is not 2 but rather 293/283 = 1.035. As a chart,

For those of you more familiar with the Fahrenheit scale, our two temperatures
are 50 and 68. The Fahrenheit scale, like the Celsius scale, is interval, not ratio, as the
zero is not true. The ratio of 68 to 50 is 1.36, also meaningless in this context

Karl L. Wuensch, East Carolina University, Greenville, NC. May, 2010

Descript.doc
Descriptive Statistics

I. Frequency Distribution: a tallying of the number of times (frequency) each
score value (or interval of score values) is represented in a group of scores.
A. Ungrouped: frequency of each score value is given
Cumulative Cumulative
Statophobia Frequency Percent Frequency Percent
-----------------------------------------------------------------
0 9 1.51 9 1.51
1 17 2.86 26 4.37
2 15 2.52 41 6.89
3 35 5.88 76 12.77
4 38 6.39 114 19.16
5 93 15.63 207 34.79
6 67 11.26 274 46.05
7 110 18.49 384 64.54
8 120 20.17 504 84.71
9 47 7.90 551 92.61
10 43 7.23 594 99.83
11 1 0.17 595 100.00
B. Grouped: total range of scores divided into several (usually equal in width)
intervals, with frequency given for each interval.
1. Summarizes data, but involves loss of info, possible distortion
2. Usually 5-20 intervals
Nucophobia

Frequency Percent
Cumulative
Percent
Valid 0-9 13 2.1 2.1
10-19 17 2.8 4.9
20-29 30 4.9 9.8
30-39 45 7.3 17.1
40-49 36 5.9 23.0
50-59 144 23.5 46.5
60-69 129 21.0 67.5
70-79 81 13.2 80.8
80-89 57 9.3 90.0
90-100 61 10.0 100.0
Total 613 100.0

C. Percent: the percentage of scores at a given value or interval of values.


2
D. Cumulative Frequency: the number of scores at or below a given value or
interval of values
E. Cumulative Percent: the percentage of scores at or below a given value or
interval of values. This is also known as the Percentile Rank.
II. Graphing
A. Bar Graph (below, left)
1. Bars should be separated, indicating the variable is discrete
2. Plot frequencies, percents, cumulative frequencies, or cumulative
percents on the ordinate, values of the variable on the abscissa

B. Histogram (above, right)
1. Continuously distributed variable; The bars are connected, indicating that
the variable is continuous.
2. Plot frequencies, percents, cumulative frequencies, or cumulative
percents on the ordinate, values of the variable on the abscissa

C. Frequency Polygon (left)
1. As if you took a histogram, placed
a dot at the middle of each bar,
erased the bars, and joined the
dots with straight line segments.
2. Connect ends to abscissa to form
a polygon.

See SPSS Output with Frequency Distributions

3
D. Three-Quarter Rule
1. Height of highest point on the ordinate should be equal to three quarters
of the length of the abscissa.
2. Violating this rule may distort the data.
3. Use your browser to view colorful examples at:
http://core.ecu.edu/psyc/wuenschk/docs30/GeeWhiz.doc
E. Gee-Whiz - a way to distort the data
1. Ordinate starts with a frequency greater than zero
2. Exaggerates differences
3. Use your browser to view colorful examples at:
http://core.ecu.edu/psyc/wuenschk/docs30/GeeWhiz.doc
F. Shapes of Frequency Distributions
1. Symmetrical the left side is the mirror image of the right side
a. rectangular or uniform - for example, ranked data within some range,
every score has the same frequency
b. U and inverted U
c. bell shaped
Uniform Distribution U Distribution Bell-Shaped Distribution

2. Skewed Distributions they are lopsided
a. Negative skewness: most of the scores are high, but there are a few
very low scores.
b. Positive skewness: most of the scores are low, but there are a few
very high scores.
c. In the plots below, I have superimposed a normal bell-shaped
distribution with mean and standard deviation the same as that of the
plotted distribution.

4

Negative Skewness Positive Skewness

III. Measures of Central Tendency (Location)
A. Mean - three definitions (M = sample mean, = population mean)
1. = Y N just add up the scores and divide by number of scores
2. (Y - ) = 0 the mean is the point that makes the sum of deviations
about it exactly zero that is, it is a balance point. The Data Balancer can
help you understand this, give it a try. You will find that it is the mean that
balances (makes equal) the sum of the negative deviations and the sum
of the positive deviations.
3. (Y - )
2
is minimal the mean is the point that makes the sum of
squared deviations about it as small as possible. This definition of the
mean will be very important later.
B. Median - the preferred measure with markedly skewed distributions
1. The middle point in the distribution, the score or point which has half of
the scores above it, half below it
2. Arrange the scores in order, from lowest to highest. The median location
(ml) is (n + 1)/2, where n is the number of scores. Count in ml scores
from the top score or the bottom score to find the median. If ml falls
between two scores, the median is the mean of those two scores.
10, 6, 4, 3, 1: ml = 6/2 = 3, median = 4.
10, 8, 6 ,4, 3, 1: ml = 7/2 = 3.5, median = 5
C. Mode - the score with the highest frequency. A bimodal distribution is one
which has two modes. A multimodal distribution has three or more modes.

5
D. Skewness
1. the mean is very sensitive to extreme scores and will be drawn in the
direction of the skew
2. the median is not sensitive to extreme scores
3. if the mean is greater than the median, positive skewness is likely
4. if the mean is less than the median, negative skewness is likely
5. one simple measure of skewness is (mean - median) / standard deviation
6. statistical packages typically compute g
1
, an estimate of Fishers
skewness, based up the sum of cubed deviations of scores from their
mean, ( )
3
Y . The value 0 represents the absence of skewness.

Values between -1 and +1 represent trivial to small skewness.
IV. Measures of Variability (Dispersion)
A. Each of the four distributions in the table below has a mean of 3, but these
distributions obviously differ from one another. Our description of them can
be sharpened by measuring their variability.
X Y Z V
3 1 0 -294
3 2 0 -24
3 3 15 3
3 4 0 30
3 5 0 300
B. Range = highest score minus lowest score
X: 3 - 3 = 0
Y: 5 - 1 = 4
Z: 15 - 0 = 15
V: 300 - (-294) = 594
C. Interquartile Range = Q
3
- Q
1
, where Q
3
is the third quartile (the value of Y
marking off the upper 25% of scores) and Q
1
is the first quartile (the value of
Y marking off the lower 25%). The interquartile range is the range of the
middle 50% of the scores.
D. Semi-Interquartile Range = (Q
3
Q
1
)/2. This is how far you have to go from
the middle in both directions to mark off the middle 50% of the scores. This
is also known as the probable error. Half of the scores in a distribution will be
within one probable error of the middle of the distribution and half will not.
Astronomers have used this statistic to estimate by how much one is likely to
be off when estimating the value of some astronomical parameter.

6
E. Mean Absolute Deviation = |Y - | N (this statistic is rarely used)
X: 0; Y: 6/5; Z: 24/5; V: 648/5
F. Population Variance:
N
SS
N
Y
y
=

=
2
2
) (
.

N
Y
Y Y SS
y
2
2 2
) (
) (

= =
SS stands for sum of squares, more explicitly, the sum of squared
deviations of scores from their mean.
G. Population Standard Deviation: =
2

X: 0; Y: 2 ; Z: 6; V: 35575
H. Estimating Population Variance from Sample Data
1. computed from a sample, SS / N tends to underestimate the population
variance
2. s
2
is an unbiased estimate of population variance

1 1
) (
2
2

=
N
SS
N
M Y
s
y

3. s is a relatively unbiased estimate of population standard deviation
s =
2
s
4. Since in a bell-shaped (normal) distribution nearly all of the scores fall
within plus or minus 3 standard deviations from the mean, when you have
a moderately sized sample of scores from such a distribution the standard
deviation should be approximately one-sixth of the range.
I. Example Calculations
Y (Y-M) (Y-M)
2
z
5 +2 4 1.265
4 +1 1 0.633
3 0 0 0.000
2 -1 1 -0.633
1 -2 4 -1.265
Sum 15 0 10 0
Mean 3 0 4 0

Notice that the sum of the deviations of scores from their mean is zero (as
always). If you find the mean of the squared deviations, 10/5 = 2, you have
the variance, assuming that these five scores represent the entire population.
The population standard deviation is 414 . 1 2 = . Usually we shall consider

7
the data we have to be a sample. In this case the sample variance is 10/4 =
2.5 and the sample standard deviation is 581 . 1 5 . 2 = .

Notice that for this distribution the mean is 3 and the median is also 3. The
distribution is perfectly symmetric. Watch what happens when I replace the
score of 5 with a score of 40.
Distribution of Y: 40, 4, 3, 2, 1
Median = 3, Mean = 10. The mean is drawn in the direction of the (positive)
skew. The mean is somewhat deceptive here notice that 80% of the scores
are below average (below the mean) that sounds fishy, eh?

V. Standard Scores
A. Take the scores from a given distribution and change them such that the new
distribution has a standard mean and a standard deviation
B. This transformation does not change the shape of the distribution
C. Z - Scores: a mean of 0, standard deviation of 1

=
Y
Z how many standard deviations the score is above or below the
mean In the table above, I have computed z for each score by subtracting
the sample mean and dividing by the sample standard deviation.
D. Standard Score = Standard Mean + (z score)(Standard Standard Deviation).
Examples
Suzie Cueless has a z score of -2 on a test of intelligence. We want to
change this score to a standard IQ score, where the mean is 100 and the
standard deviation is 15. Suzies IQ = 100 - (2)(15) = 70.
Gina Genius has a z score of +2.5 on a test of verbal aptitude. We want
to change this score to the type of standard score that is used on the SAT
tests, where the mean is 500 and the standard deviation is 100. Suzies
SAT-Verbal score is 500 + (2.5)100 = 750.

Return to Wuenschs Page of Statistics Lessons
Exercises Involving Descriptive Statistics

The Three Quarter Rule
GeeWhiz.doc
0
5
10
15
20
25
Sales in $K
Baylen
Jackson
Jones
Smith
Stern
It is recommended that one make the height of the highest point on the ordinate
about 3/4 of the length of the abscissa. Below is a simple sales plot for 5 salespersons
following that rule. Take a look at it and get a feel for how much these five differ in
sales.

0
5
10
15
20
25
Sales in $K
Baylen
Jackson
Jones
Smith
Stern

Now here is a plot of the same data, but with the width increased relative to the height.
This gives the perception that the salespersons differ less.

Below, on the left, is a plot of the same data, but now I have increased the
height relative to the width. Notice that this creates the perception that the
salespersons differ from one another more. On the right, I have applied a "Gee-Whiz,",
leaving out a big chunk of the lower portion of the ordinate, which makes the differences
among the salespersons appear yet even greater.

0
5
10
15
20
25
Sales in $K
Baylen
Jackson
Jones
Smith
Stern
17
18
19
20
21
22
23
Sales in $K
Baylen
Jackson
Jones
Smith
Stern
999
1000
1001
1002
1003
1004
1005
1006
1982 1986
Their Bill
Our Bill
1982 1986
Their Bill
Our Bill
Here is my rendition of a graph used by Ronald Reagan on July 27, 1981. It was
published in the New York Times, and elsewhere. His graph was better done than
mine, but mine captures many of little "tricks" he used. The graph was designed to
show that the Republican (true blue) tax plan would save you money compared to the
Democratic (in the red) plan, over time. "Your Taxes" makes it personal. Notice that
there are no numbers on the ordinate, but a big attention-catching dollar sign is there.
The Republican plan is "Our" plan (yours and mine), and the Democratic plan is "Their"
plan (the bad guys). It looks like the Democratic plan would cost us a little less for a
couple of years, but then a lot more thereafter. But without any numbers on the
ordinate, we can't really make a fair comparison between the two plans. Might this
graph be a "Gee-Whiz?"

YOUR TAXES
ANNUAL FAMILY INCOME = $20,000

$

Here I have added some numbers to the ordinate. If these were the correct numbers (I
do not know what the correct numbers are), then this graph is clearly a "Gee-Whiz", and
the difference between the two plans is trivial, only a few dollars.

$

Here is an ad placed by Quaker Oats. Gee Whiz! The graph makes it look like there is
a dramatic drop in cholesterol, but notice that the ordinate starts at 196. The drop
across four weeks is from 209 to 199. That is a drop of 10/209 = 4.8%.

Document revised October, 2006.

Return to Karl Wuenschs Stats Lessons Page
eda.doc
Exploratory Data Analysis (EDA)

John Tukey has developed a set of procedures collectively known as EDA. Two
of these procedures that are especially useful for producing initial displays of data are:
1. the Stem-and-Leaf Display, and 2. the Box-and-Whiskers Plot.

To illustrate EDA, consider the following set of pulse rates from 96 people:

66 60 64 64 64 76 82 70 60 78
92 82 90 70 62 60 68 68 70 70
68 76 68 72 98 60 80 80 104 70
92 80 90 64 78 60 60 70 66 76
70 52 74 78 70 68 66 80 62 56
58 68 60 48 78 86 68 90 76 70
94 90 64 68 68 80 70 72 80 60
68 99 60 74 56 86 64 86 64 68
76 74 70 77 80 72 88 94 78 70
78 78 55 62 74 58

Stem and Leaf Display

You first decide how wide each row (class interval) will be. I decided on an
interval width of 5, that is, I'll group together on one row all scores of 40-44; on another,
45-49; on another, 50-54, etc. I next wrote down the leading digits (most significant
digits) for each interval, starting with the lowest. These make up the stem of the display.
Next I tallied each score, placing its trailing digit (rightmost, least significant digit) in
the appropriate row to the right of the stem. These digits (each one representing one
score) make up the leaves of the display. Here is how the display looks now:

4 8
5 2
5 68658
6 0444020040020400442
6 68888686888888
7 0000200040002440204
7 6868688667888
8 220000000
8 6668
9 20200404
9 89
10 4

Notice that the leaves in each row are in the order that I encountered them when
reading the unordered raw data in rows from left to right. A more helpful display is one


2
where the trailing digits are ordered from low to high, left to right, within each row of the
display. Here is how the display looks with such an ordering:

1 4 8
2 5 2
7 5 56688
26 6 0000000002224444444
40 6 66688888888888
(19) 7 0000000000002224444
37 7 6666678888888
24 8 000000022
15 8 6668
11 9 00002244
3 9 89
1 10 4

Notice that I have entered an additional column to the left of the stem. It gives
the depth of each row, that is, how many scores there are in that row and beyond that
row to the closer tail. For example, the 7 in the third row from the top indicates that
counting from the lowest score up to and including the scores in the third row, there are
seven scores. The 24 in the eighth row indicates that there are 24 scores of 80 and
higher. The row that has a number in parentheses is the row that contains the median.
The number within the parentheses is the number of scores in that row.

Rotate the display 90 degrees counter-clockwise. Now you see a histogram.
But this histogram is made out of the scores themselves, so you can always find out
how many times any particular score occurs. For example, if you want to know how
many 76's there are, go to the row with the 70's and count the number of trailing 6's.
There are five 76's.

We could have grouped our data with interval-widths of 10. For these data such
a display follows:

1 4 8
7 5 256688
40 6 000000000222444444466688888888888
(32) 7 00000000000022244446666678888888
24 8 0000000226668
11 9 0000224489
1 10 4

Do you notice any of the trailing digits that seem odd when compared to the
others? What does this suggest about the way the pulse rates were probably
determined? Might these odd scores represent errors, or just deviations in the way the
pulse rate was determined for a couple of subjects?

3
Box and Whiskers Plot

Another handy EDA technique is the box-and-whiskers plot. One first completes
the median location, which is equal to (N + 1)/2, where N = the total number of scores.
For our data, the median location = (96 + 1)/2 = 48.5. That means that the median is
the 48.5th score from the top (or the bottom). The 48.5th score is the mean of the 48th
and the 49th scores. To find the 48th score look at the second stem-and-leaf display
on this handout [it is at the top of page 2 ]. Look at the row with (19) in the depth
column. The median is in there somewhere. There are 40 scores lower than 40 [ the
depth of the row just above is 40 ], so we count in 8 scores, making the 48th score a
70. The next score over, the 49th score, is also a 70, and the mean of 70 and 70 is 70,
so the median is 70.

Now we find the hinge location, which is equal to (Median Location + 1)/2.
Drop any decimal on the median location when using it in the formula for the hinge
location. For our data, hinge location = (48 + 1)/2 = 24.5. Now, the upper hinge is the
24.5th score from the upper end of the distribution. Scan the depth column from the
bottom up the stem-and-leaf display. Note the row with a depth of 24. There are 24
scores of 80 or more. The 25th score from the top is a 78, so the upper hinge is 79.
This is essentially the same thing as the third quartile, Q
3
.

Now, the lower hinge is the 24.5th score from the lower end of the distribution.
Our stem-and-leaf display shows a depth of 26 for the fourth row, so the 26th score
from the lowest is a 64. The 25th and 24th are also 64's, so the lower hinge is 64. This
is essentially the same as Q
1
, the first quartile.

The H-spread is the difference between the upper hinge and the lower hinge.
For our data, 79 - 64 = 15. This is a trimmed range, a range where we have eliminated
some percentage of extreme scores. In this case we eliminated all but the middle 50%.
The H-spread is essentially the same thing as the interquartile range.

Next we find the upper and lower inner fences. The upper inner fence is the
upper hinge plus 1.5 H-spreads. For our data, 79 + 1.5(15) = 101.5. The lower inner
fence is the lower hinge minus 1.5 H-spreads, 64 - 1.5(15) = 41.5.

The adjacent values are actual scores that are no more extreme than the inner
fences [not lower than the lower inner fence and not higher than the higher inner fence].
For our data the adjacent values run out to 48 on the low side and up to 99 on the high
side. All scores that are not adjacent values are outliers. If possible, one should
investigate outliers. Why are these scores so extreme? They might be errors.

A second pair of fences, the outer fences, is located 3 H-spreads below the
lower hinge and 3 H-spreads above the upper hinge. For our data the outer fences are
located at 79 + 3(15) = 124 and at 79 - 3(15) = 34.

4
Now we can draw the plot. First a box is drawn from the lower to the upper
hinge. A vertical line or a cross marks the median within the box. Sometimes the mean
is also marked with some symbol. Whiskers are drawn out from the box to the lowest
adjacent value and to the highest adjacent value (not necessarily all the way to the
inner fences). Finally, outliers are located with asterisks or another prominent symbol.

Outliers that are not only beyond the inner fences but also beyond the outer
fences are way-outliers. These are plotted with a symbol different from those outliers
that within the outer fences. For our data there are no way-outliers.

Stem and leaf and box and whiskers plots are most conveniently prepared using
statistical software. The data from this handout are available in the file EDA.SAV on my
SPSS data files page and in the file EDA.DAT on my StatData page. Here are the plots
as prepared by SPSS for Windows, 10.1 (note that SPSS includes a Frequency column
rather than a Depth column):

PULSE Stem-and-Leaf Plot

Frequency Stem & Leaf

1.00 4 . 8
6.00 5 . 256688
33.00 6 . 000000000222444444466688888888888
32.00 7 . 00000000000022244446666678888888
13.00 8 . 0000000226668
10.00 9 . 0000224489
1.00 Extremes (>=104)

Stem width: 10
Each leaf: 1 case(s)

5

Here are the plots, as produced by SAS 8.0

Stem Leaf # Boxplot
10 4 1 0
9 89 2 |
9 00002244 8 |
8 6668 4 |
8 000000022 9 |
7 6666678888888 13 +-----+
7 0000000000002224444 19 *--+--*
6 66688888888888 14 | |
6 0000000002224444444 19 +-----+
5 56688 5 |
5 2 1 |
4 8 1 |
----+----+----+----+
Multiply Stem.Leaf by 10**+1

The original boxplots, called skeletal box and whiskers plots, were more simple
than those I described above. They consisted of a box representing the middle 50% of
the scores, bisected by a line at the median, and with whiskers drawn all the way out to
the lowest data value and the highest data value. Tukey called the sort of boxplot I
have described above a schematic plot, but most people simply refer to them as
boxplots.

Links
ABCs of EDA classic by Velleman and Hoaglin (1981)
Ask Dr. Math an illustration of how to produce a box and whiskers plot.
Statophobia Box & Whiskers my own short tutorial on box and whiskers plots.
Wuenschs Statistics Lessons

Skew-Kurt.doc
Skewness, Kurtosis, and the Normal Curve

Skewness
In everyday language, the terms skewed and askew are used to refer to
something that is out of line or distorted on one side. When referring to the shape of
frequency or probability distributions, skewness refers to asymmetry of the distribution.
A distribution with an asymmetric tail extending out to the right is referred to as
positively skewed or skewed to the right, while a distribution with an asymmetric tail
extending out to the left is referred to as negatively skewed or skewed to the left.
Skewness can range from minus infinity to positive infinity.
Karl Pearson (1895) first suggested measuring skewness by standardizing the
difference between the mean and the mode, that is,
mode
= sk . Population modes
are not well estimated from sample modes, but one can estimate the difference
between the mean and the mode as being three times the difference between the mean
and the median (Stuart & Ord, 1994), leading to the following estimate of skewness:
s
M
sk
est
median) ( 3
= . Many statisticians use this measure but with the 3 eliminated,
that is,
s
M
sk
median) (
= . This statistic ranges from -1 to +1. Absolute values above
0.2 indicate great skewness (Hildebrand, 1986).
Skewness has also been defined with respect to the third moment about the
mean:
3
3
1
) (
n
X
= , which is simply the expected value of the distribution of cubed z
scores. Skewness measured in this way is sometimes referred to as Fishers
skewness. When the deviations from the mean are greater in one direction than in the
other direction, this statistic will deviate from zero in the direction of the larger
deviations. From sample data, Fishers skewness is most often estimated by:
) 2 )( 1 (
3
1

=
n n
z n
g . For large sample sizes (n > 150), g
1
may be distributed
approximately normally, with a standard error of approximately n / 6 . While one could
use this sampling distribution to construct confidence intervals for or tests of hypotheses
about
1
, there is rarely any value in doing so.
The most commonly used measures of skewness (those discussed here) may
produce surprising results, such as a negative value when the shape of the distribution
appears skewed to the right. There may be superior alternative measures not
commonly used (Groeneveld & Meeden, 1984).


2

It is important for behavioral researchers to notice skewness when it appears in
their data. Great skewness may motivate the researcher to investigate outliers. When
making decisions about which measure of location to report (means being drawn in the
direction of the skew) and which inferential statistic to employ (one which assumes
normality or one which does not), one should take into consideration the estimated
skewness of the population. Normal distributions have zero skewness. Of course, a
distribution can be perfectly symmetric but far from normal. Transformations commonly
employed to reduce (positive) skewness include square root, log, and reciprocal
transformations.
Also see Skewness and the Relative Positions of Mean, Median, and Mode
Kurtosis
Karl Pearson (1905) defined a distributions degree of kurtosis as 3
2
= ,
where
4
4
2
) (
n
X
= , the expected value of the distribution of Z scores which have
been raised to the 4
th
power.
2
is often referred to as Pearsons kurtosis, and
2
- 3
(often symbolized with
2
) as kurtosis excess or Fishers kurtosis, even though it was
Pearson who defined kurtosis as
2
- 3. An unbiased estimator for
2
is
) 3 )( 2 (
) 1 ( 3
) 3 )( 2 )( 1 (
) 1 (
2 4
2

+
=
n n
n
n n n
Z n n
g . For large sample sizes (n > 1000), g
2
may be
distributed approximately normally, with a standard error of approximately n / 24
(Snedecor, & Cochran, 1967). While one could use this sampling distribution to
construct confidence intervals for or tests of hypotheses about
2
, there is rarely any
value in doing so.
Pearson (1905) introduced kurtosis as a measure of how flat the top of a
symmetric distribution is when compared to a normal distribution of the same variance.
He referred to more flat-topped distributions (
2
< 0) as platykurtic, less flat-topped
distributions (
2
> 0) as leptokurtic, and equally flat-topped distributions as mesokurtic
(
2
0). Kurtosis is actually more influenced by scores in the tails of the distribution
than scores in the center of a distribution (DeCarlo, 1967). Accordingly, it is often
appropriate to describe a leptokurtic distribution as fat in the tails and a platykurtic
distribution as thin in the tails.
Student (1927, Biometrika, 19, 160) published a cute description of kurtosis,
which I quote here: Platykurtic curves have shorter tails than the normal curve of error
and leptokurtic longer tails. I myself bear in mind the meaning of the words by the
above memoria technica, where the first figure represents platypus and the second
kangaroos, noted for lepping. See Students drawings.
Moors (1986) demonstrated that 1 ) (
2
2
+ = Z Var . Accordingly, it may be best to
treat kurtosis as the extent to which scores are dispersed away from the shoulders of a

3

distribution, where the shoulders are the points where Z
2
= 1, that is, Z = 1. Balanda
and MacGillivray (1988) wrote it is best to define kurtosis vaguely as the location- and
scale-free movement of probability mass from the shoulders of a distribution into its
centre and tails. If one starts with a normal distribution and moves scores from the
shoulders into the center and the tails, keeping variance constant, kurtosis is increased.
The distribution will likely appear more peaked in the center and fatter in the tails, like a
Laplace distribution ( 3
2
= ) or Students t with few degrees of freedom (
4
6
2
=
df
).
Starting again with a normal distribution, moving scores from the tails and the
center to the shoulders will decrease kurtosis. A uniform distribution certainly has a flat
top, with 2 . 1
2
= , but
2
can reach a minimum value of 2 when two score values are
equally probably and all other score values have probability zero (a rectangular U
distribution, that is, a binomial distribution with n =1, p = .5). One might object that the
rectangular U distribution has all of its scores in the tails, but closer inspection will
reveal that it has no tails, and that all of its scores are in its shoulders, exactly one
standard deviation from its mean. Values of g
2
less than that expected for an uniform
distribution (1.2) may suggest that the distribution is bimodal (Darlington, 1970), but
bimodal distributions can have high kurtosis if the modes are distant from the shoulders.
One leptokurtic distribution we shall deal with is Students t distribution. The
kurtosis of t is infinite when df < 5, 6 when df = 5, 3 when df = 6. Kurtosis decreases
further (towards zero) as df increase and t approaches the normal distribution.
Kurtosis is usually of interest only when dealing with approximately symmetric
distributions. Skewed distributions are always leptokurtic (Hopkins & Weeks, 1990).
Among the several alternative measures of kurtosis that have been proposed (none of
which has often been employed), is one which adjusts the measurement of kurtosis to
remove the effect of skewness (Blest, 2003).
There is much confusion about how kurtosis is related to the shape of
distributions. Many authors of textbooks have asserted that kurtosis is a measure of the
peakedness of distributions, which is not strictly true.
It is easy to confuse low kurtosis with high variance, but distributions with
identical kurtosis can differ in variance, and distributions with identical variances can
differ in kurtosis. Here are some simple distributions that may help you appreciate that
kurtosis is, in part, a measure of tail heaviness relative to the total variance in the
distribution (remember the
4
in the denominator).

4

Table 1.
Kurtosis for 7 Simple Distributions Also Differing in Variance
X freq A freq B freq C freq D freq E freq F freq G
05 20 20 20 10 05 03 01
10 00 10 20 20 20 20 20
15 20 20 20 10 05 03 01
Kurtosis -2.0 -1.75 -1.5 -1.0 0.0 1.33 8.0
Variance 25 20 16.6 12.5 8.3 5.77 2.27
Platykurtic Leptokurtic
When I presented these distributions to my colleagues and graduate students
and asked them to identify which had the least kurtosis and which the most, all said A
has the most kurtosis, G the least (excepting those who refused to answer). But in fact
A has the least kurtosis (2 is the smallest possible value of kurtosis) and G the most.
The trick is to do a mental frequency plot where the abscissa is in standard deviation
units. In the maximally platykurtic distribution A, which initially appears to have all its
scores in its tails, no score is more than one away from the mean - that is, it has no
tails! In the leptokurtic distribution G, which seems only to have a few scores in its tails,
one must remember that those scores (5 & 15) are much farther away from the mean
(3.3 ) than are the 5s & 15s in distribution A. In fact, in G nine percent of the scores
are more than three from the mean, much more than you would expect in a
mesokurtic distribution (like a normal distribution), thus G does indeed have fat tails.
If you were you to ask SAS to compute kurtosis on the A scores in Table 1, you
would get a value less than 2.0, less than the lowest possible population kurtosis.
Why? SAS assumes your data are a sample and computes the g
2
estimate of
population kurtosis, which can fall below 2.0.
Sune Karlsson, of the Stockholm School of Economics, has provided me with the
following modified example which holds the variance approximately constant, making it
quite clear that a higher kurtosis implies that there are more extreme observations (or
that the extreme observations are more extreme). It is also evident that a higher
kurtosis also implies that the distribution is more single-peaked (this would be even
more evident if the sum of the frequencies was constant). I have highlighted the rows
representing the shoulders of the distribution so that you can see that the increase in
kurtosis is associated with a movement of scores away from the shoulders.

5

Table 2.
Kurtosis for Seven Simple Distributions Not Differing in Variance
X Freq. A Freq. B Freq. C Freq. D Freq. E Freq. F Freq. G
6.6 0 0 0 0 0 0 1
0.4 0 0 0 0 0 3 0
1.3 0 0 0 0 5 0 0
2.9 0 0 0 10 0 0 0
3.9 0 0 20 0 0 0 0
4.4 0 20 0 0 0 0 0
5 20 0 0 0 0 0 0
10 0 10 20 20 20 20 20
15 20 0 0 0 0 0 0
15.6 0 20 0 0 0 0 0
16.1 0 0 20 0 0 0 0
17.1 0 0 0 10 0 0 0
18.7 0 0 0 0 5 0 0
20.4 0 0 0 0 0 3 0
26.6 0 0 0 0 0 0 1
Kurtosis 2.0 1.75 1.5 1.0 0.0 1.33 8.0
Variance 25 25.1 24.8 25.2 25.2 25.0 25.1

While is unlikely that a behavioral researcher will be interested in questions that
focus on the kurtosis of a distribution, estimates of kurtosis, in combination with other
information about the shape of a distribution, can be useful. DeCarlo (1997) described
several uses for the g
2
statistic. When considering the shape of a distribution of scores,
it is useful to have at hand measures of skewness and kurtosis, as well as graphical
displays. These statistics can help one decide which estimators or tests should perform
best with data distributed like those on hand. High kurtosis should alert the researcher
to investigate outliers in one or both tails of the distribution.
Tests of Significance
Some statistical packages (including SPSS) provide both estimates of skewness
and kurtosis and standard errors for those estimates. One can divide the estimate by

6

its standard error to obtain a z test of the null hypothesis that the parameter is zero (as
would be expected in a normal population), but I generally find such tests of little use.
One may do an eyeball test on measures of skewness and kurtosis when deciding
whether or not a sample is normal enough to use an inferential procedure that
assumes normality of the population(s). If you wish to test the null hypothesis that the
sample came from a normal population, you can use a chi-square goodness of fit test,
comparing observed frequencies in ten or so intervals (from lowest to highest score)
with the frequencies that would be expected in those intervals were the population
normal. This test has very low power, especially with small sample sizes, where the
normality assumption may be most critical. Thus you may think your data close enough
to normal (not significantly different from normal) to use a test statistic which assumes
normality when in fact the data are too distinctly non-normal to employ such a test, the
nonsignificance of the deviation from normality resulting only from low power, small
sample sizes. SAS PROC UNIVARIATE will test such a null hypothesis for you using
the more powerful Kolmogorov-Smirnov statistic (for larger samples) or the Shapiro-
Wilks statistic (for smaller samples). These have very high power, especially with large
sample sizes, in which case the normality assumption may be less critical for the test
statistic whose normality assumption is being questioned. These tests may tell you that
your sample differs significantly from normal even when the deviation from normality is
not large enough to cause problems with the test statistic which assumes normality.
SAS Exercises
Go to my StatData page and download the file EDA.dat. Go to my SAS-
Programs page and download the program file g1g2.sas. Edit the program so that the
INFILE statement points correctly to the folder where you located EDA.dat and then run
the program, which illustrates the computation of g
1
and g
2
. Look at the program. The
raw data are read from EDA.dat and PROC MEANS is then used to compute g
1
and g
2
.
The next portion of the program uses PROC STANDARD to convert the data to z
scores. PROC MEANS is then used to compute g
1
and g
2
on the z scores. Note that
standardization of the scores has not changed the values of g
1
and g
2
. The next portion
of the program creates a data set with the z scores raised to the 3
rd
and the 4
th
powers.
The final step of the program uses these powers of z to compute g
1
and g
2
using the
formulas presented earlier in this handout. Note that the values of g
1
and g
2
are the
same as obtained earlier from PROC MEANS.
Go to my SAS-Programs page and download and run the file Kurtosis-
Uniform.sas. Look at the program. A DO loop and the UNIFORM function are used to
create a sample of 500,000 scores drawn from a uniform population which ranges from
0 to 1. PROC MEANS then computes mean, standard deviation, skewness, and
kurtosis. Look at the output. Compare the obtained statistics to the expected values for
the following parameters of a uniform distribution that ranges from a to b:

7

Parameter Expected Value Parameter Expected Value
Mean
2
b a +
Skewness 0
Standard Deviation
12
) (
2
a b
Kurtosis 1.2

Go to my SAS-Programs page and download and run the file Kurtosis-T.sas,
which demonstrates the effect of sample size (degrees of freedom) on the kurtosis of
the t distribution. Look at the program. Within each section of the program a DO loop is
used to create 500,000 samples of N scores (where N is 10, 11, 17, or 29), each drawn
from a normal population with mean 0 and standard deviation 1. PROC MEANS is then
used to compute Students t for each sample, outputting these t scores into a new data
set. We shall treat this new data set as the sampling distribution of t. PROC MEANS is
then used to compute the mean, standard deviation, and kurtosis of the sampling
distributions of t. For each value of degrees of freedom, compare the obtained statistics
with their expected values.
Mean Standard Deviation Kurtosis
0
2 df
df

4
6
df

Download and run my program Kurtosis_Beta2.sas. Look at the program.
Each section of the program creates one of the distributions from Table 1 above and
then converts the data to z scores, raises the z scores to the fourth power, and
computes
2
as the mean of z
4
. Subtract 3 from each value of
2
and then compare the
resulting
2
to the value given in Table 1.
Download and run my program Kurtosis-Normal.sas. Look at the program. DO
loops and the NORMAL function are used to create 100,000 samples, each with 1,000
scores drawn from a normal population with mean 0 and standard deviation 1. PROC
MEANS creates a new data set with the g
1
and the g
2
statistics for each sample. PROC
MEANS then computes the mean and standard deviation (standard error) for skewness
and kurtosis. Compare the values obtained with those expected, 0 for the means, and
n / 6 and n / 24 for the standard errors.
References
Balanda & MacGillivray. (1988). Kurtosis: A critical review. American Statistician, 42: 111-119.
Blest, D.C. (2003). A new measure of kurtosis adjusted for skewness. Australian &New Zealand
Journal of Statistics, 45, 175-179.
Darlington, R.B. (1970). Is kurtosis really peakedness? The American Statistician, 24(2), 19-
22.

8

DeCarlo, L. T. (1997). On the meaning and use of kurtosis. Psychological Methods, 2, 292-307.
Groeneveld, R.A. & Meeden, G. (1984). Measuring skewness and kurtosis. The Statistician, 33,
391-399.
Hildebrand, D. K. (1986). Statistical thinking for behavioral scientists. Boston: Duxbury.
Hopkins, K.D. & Weeks, D.L. (1990). Tests for normality and measures of skewness and
kurtosis: Their place in research reporting. Educational and Psychological Measurement, 50,
717-729.
Loether, H. L., & McTavish, D. G. (1988). Descriptive and inferential statistics: An introduction ,
3
rd
ed. Boston: Allyn & Bacon.
Moors, J.J.A. (1986). The meaning of kurtosis: Darlington reexamined. The American
Statistician, 40, 283-284.
Pearson, K. (1895) Contributions to the mathematical theory of evolution, II: Skew variation in
homogeneous material. Philosophical Transactions of the Royal Society of London, 186,
343-414.
Pearson, K. (1905). Das Fehlergesetz und seine Verallgemeinerungen durch Fechner und
Pearson. A Rejoinder. Biometrika, 4, 169-212.
Snedecor, G.W. & Cochran, W.G. (1967). Statistical methods (6
th
ed.), Iowa State University
Press, Ames, Iowa.
Stuart, A. & Ord, J.K. (1994). Kendalls advanced theory of statistics. Volume 1. Distribution
Theory. Sixth Edition. Edward Arnold, London.
Wuensch, K. L. (2005). Kurtosis. In B. S. Everitt & D. C. Howell (Eds.), Encyclopedia of
statistics in behavioral science (pp. 1028 - 1029). Chichester, UK: Wiley.
Wuensch, K. L. (2005). Skewness. In B. S. Everitt & D. C. Howell (Eds.), Encyclopedia of
statistics in behavioral science (pp. 1855 - 1856). Chichester, UK: Wiley.

Links
http://core.ecu.edu/psyc/wuenschk/StatHelp/KURTOSIS.txt -- a log of email discussions on
the topic of kurtosis, most of them from the EDSTAT list.
http://core.ecu.edu/psyc/WuenschK/docs30/Platykurtosis.jpg -- distribution of final grades in
PSYC 2101 (undergrad stats), Spring, 2007.
Kurtosis slide show with histograms of the distributions presented in Table 2 above
Table 2 data from Table 2, SPSS format
Return to My Statistics Lessons Page
6430
Normal-30.docx
The Normal Distribution

The normal, or Gaussian, distribution has played a prominent role in statistics. It
was originally investigated by persons interested in gambling or in the distribution of
errors made by persons observing astronomical events. It is still very important to
behavioral statisticians because:

1. Many variables are distributed approximately as the bell-shaped normal curve.
2. Many of the inferential procedures (the so-called parametric tests) we shall learn
assume that variables are normally distributed.
3. Even when a variable is not normally distributed, a distribution of sample sums or
means on that variable will be approximately normally distributed if sample size is
large.
4. Most of the special probability distributions we shall study approach a normal
distribution under some conditions.
5. The mathematics of the normal curve are well known and relatively simple. One can
find the probability that a score randomly sampled from a normal distribution will fall
within the interval a to b by integrating the normal probability density function (pdf)
from a to b. This is equivalent to finding the area under the curve between a and b,
assuming a total area of one.

Here is the probability density function known as the normal curve. F(Y) is the
probability density, aka the height of the curve at value Y.
2 2
2 / ) (
) (
2
1
) (

=
Y
e Y F

Notice that there are only two parameters in this pdf the mean and the standard
deviation. Everything else on the right-hand side is a constant. If you know that a
distribution if normal and you know its mean and standard deviation, you know
everything there is to know about it. Normal distributions differ only with respect to their
means and their standard deviations.

Those who have not mastered integral calculus need not worry about integrating
the normal curve. You can use the computer to do it for you or make use of the normal
curve table in our textbook. This table is based on the standard normal curve (z),
which has a mean of 0 and a variance of 1. To use this table, one need convert raw
scores to z-scores. A z-score is the number of standard deviations ( or s) a score is
above or below the mean of a reference distribution.


~ 2 ~

=
Y
Z
Y

For example, suppose we wish to know the percentile rank (PR, the percentage
of scores at or below a given score value) of a score of 85 on an IQ test with = 100,
= 15. Z = (85 - 100)/15 = -1.00. We then either integrate the normal curve from minus
infinity to minus one or go the table. On page 695 find the row with Z = 1.00 (ignore the
minus sign for now). Draw a curve, locate mean and -1.00 on the curve and shade the
area you want (the lower tail). The entry under Smaller Portion is the answer, .1587 or
15.87%.

Suppose IQ = 115, Z = +1.00. Now the answer is under Larger Portion,
84.13%.

What percentage of persons have IQs between 85 and 130? The Z-scores are
-1.00 and +2.00. Between the -1.00 and the mean are 34.13%, with another 47.72
between the mean and +2.00, for a total of 81.85%.

What percentage have IQs between 115 and 130 ? The Z-scores are +1.00 and
+2.00. 97.72% are below +2.00, 84.13% are below +1.00, so the answer is 97.72 -
84.13 = 13.59%.

What score marks off the lower 10% of IQ scores ? Now we look in the Smaller
Portion column to find .1000 . The closest we can get is .1003 , which has Z = 1.28 .
We could do a linear interpolation between 1.28 and 1.29 to be more precise. Since we
are below the mean, Z = -1.28 . What IQ has a Z of -1.28 ? X = + Z ,
IQ = 100 + (-1.28) (15) = 100 - 19.2 = 80.8 .

What scores mark off the middle 50% of IQ scores? There will be 25% between
the mean and each Z-score,so we look for .2500 . The closest Z is .67, so the middle
50% is between Z = -0.67 and Z = +0.67, which is IQ = 100 - (.67)(15) to 100 + (.67)(15)
= 90 through 110.

You should memorize the following important Z-scores:

The MIDDLE __ % FALL BETWEEN PLUS OR MINUS Z =

50 ------------------------- 0.67
68 ------------------------- 1.00
90 ------------------------- 1.645
95 ------------------------- 1.96
98 ------------------------- 2.33
99 ------------------------- 2.58

There are standard score systems (where raw scores are transformed to have a
preset mean and standard deviation) other than Z. For example, SAT scores and GRE
~ 3 ~

scores are generally reported with a system having = 500, = 100. A math SAT
score of 600 means that Z = (600 - 500)/100 = +1.00, just like an IQ of 115 means that
Z = (115 - 100)/15 = +1.00. Converting to Z allows one to compare the relative
standings of scores from distributions with different means and standard deviations.
Thus, you can compare apples with oranges, provided you have first converted to Z. Be
careful, however, when doing this, because the two references groups may differ. For
example, a math SAT of 600 is not really equivalent to an IQ of 115, because the
persons who take SAT tests come from a population different from (brighter than) the
group with which IQ statistics were normed. (Also, math SAT and IQ tests measure
somewhat different things.)


Designs.doc
An Introduction to Research Design

Bivariate Experimental Research
Let me start by sketching a simple picture of a basic bivariate (focus on two
variables) research paradigm.

IV stands for independent variable (also called the treatment), DV for
dependent variable, and EV for extraneous variable. In experimental research
we manipulate the IV and observe any resulting change in the DV. Because we are
manipulating it experimentally, the IV will probably assume only a very few values,
maybe as few as two. The DV may be categorical or may be continuous. The EVs are
variables other than the IV which may affect the DV. To be able to detect the effect of
the IV upon the DV, we must be able to control the EVs.
Consider the following experiment. I go to each of 100 classrooms on campus.
At each, I flip a coin to determine whether I will assign the classroom to Group 1 (level 1
of the IV) or to Group 2. The classrooms are my experimental units or subjects. In
psychology, when our subjects are humans, we prefer to refer to them as participants,
or respondents, but in statistics, the use of the word subjects is quite common, and I
shall use it as a generic term for experimental units. For subjects assigned to Group
1, I turn the rooms light switch off. For Group 2 I turn it on. My DV is the brightness of
the room, as measured by a photographic light meter. EVs would include factors such
as time of day, season of the year, weather outside, condition of the light bulbs in the
room, etc.


IV
DV
EV EV EV

2
Think of the effect of the IV on the DV as a signal you wish to detect. EVs can
make it difficult to detect the effect of the IV by contributing noise to the DV that is,
by producing variation in the DV that is not due to the IV. Consider the following
experiment. A junior high school science student is conducting research on the effect of
the size of a coin (dime versus silver dollar) on the height of the wave produced when
the coin is tossed into a pool of water. She goes to a public pool, installs a wave
measuring device, and starts tossing coins. In the pool at the time are a dozen rowdy
youngsters, jumping in and out and splashing, etc. These youngsters activities are
EVs, and the noise they produce would make it pretty hard to detect the effect of the
size of the coin.
Sometimes an EV is confounded with the IV. That is, it is entangled with the
IV in such a way that you cannot separate the effect of the IV from that of the DV.
Consider the pool example again. Suppose that the youngsters notice what the student
is doing and conspire to confound her research. Every time she throws the silver dollar
in, they stay still. But when she throws the dime in, they all cannonball in at the same
time. The student reports back remarkable results: Dimes produce waves much higher
than silver dollars.
Here is another example of a confound. When I was a graduate student at ECU,
one of my professors was conducting research on a new method of instruction. He
assigned one of his learning classes to be taught with method A. This class met at
0800. His other class was taught with method B. This class met at 1000. On
examinations, the class taught with method B was superior. Does that mean that
method B is better than method A? Perhaps not. Perhaps the difference between the
two classes was due to the time the class was taught rather than the method of
instruction. Maybe most students just learn better at 10 than at 8 they certainly attend
better at 8 than at 10. Maybe the two groups of students were not equivalent prior to
being taught differently. Most students tend to avoid classes at 8. Upperclassmen get
to register before underclassmen. Some people who hate classes at 8 are bright
enough to learn how to avoid them, others not. Campbell and Stanley (Experimental
and quasi-experimental designs for research, 1963, Chicago: Rand McNally) wrote
about the importance of achieving pre-experimental equation of groups through
randomization. Note that the students in the research described here were not
randomly assigned to the treatments, and thus any post-treatment differences might
have been contaminated by pre-treatment differences.

Nonexperimental Research
Much research in the behavioral sciences is not experimental (no variable is
manipulated), but rather observational. Some use the term correlational to describe
such a design, but that nomenclature leads to confusion, so I suggest you avoid it.
Consider the following research. I recruit participants in downtown Greenville one
evening. Each participant is asked whether or not e has been drinking alcohol that
evening. I test each participant on a reaction time task. I find that those who report that
they have been drinking have longer (slower) reaction times than those who were not
drinking. I may refer to the drinking status variable as my IV, but note that it was not

3
manipulated. In observational research like this, the variable that we think of as being a
cause rather than an effect, especially if it is a grouping variable (has few values, as is
generally case with the IV in experimental research), is often referred to as the IV. Also,
a variable that is measured earlier in time is more likely to be called an IV than one
measured later in time, since causes precede effects.
It is important, however, that you recognize that this design is observational, not
experimental. With observational research like this, the results may suggest a causal
relationship, but there are always alternative explanations. For example, there may be
a third variable involved here. Maybe some people are, for whatever reason,
mentally dull, while other people are bright. Maybe mental dullness tends to cause
people to consume alcohol, and, independently of such consumption, to have slow
reaction times. If that were the case, the observed relationship between drinking status
and reaction time would be explained by the relationship between the third variable and
the other variables, without any direct casual relationship between drinking alcohol and
reaction time.
For my drinking research, I could do the statistical analysis with a method often
thought of as being associated with experimental research, like a t test or an ANOVA, or
with a method thought of as being associated with observational research, a correlation
analysis. With the former analysis, I would compute t or F, test the null hypothesis that
the two populations (drinkers and nondrinkers) have identical mean reaction times, and
obtain a p, which, if low enough, would cause me to conclude that those two
populations have different reaction times. With the latter analysis I would compute
Pearson r (which is called a point biserial r when computed between a dichotomous
variable and a continuous variable). To test the null hypothesis that there is, in the
population, zero correlation between drinking status and reaction time, I would convert
that r to a t and then to a p. If the p were sufficiently low, I would conclude that there is
an association between drinking and reaction time. The value of t and of p would be
exactly the same for these two analyses, because t tests and ANOVA are, quite simply,
just special cases of correlation or multiple correlation analysis. Whether you can make
a causal attribution or not depends not on the type of analysis done, but on how the
data were collected (experimentally with adequate EV control or not). Some
psychologists mistakenly think that one can never make firm causal inferences on the
basis of a correlation analysis but that one always can on the basis of a t test or an
ANOVA. These researchers have confused the correlational (better called
observational) research design with the correlation analysis. This is why I discourage
the use of the term correlational when referring to a research design.
Do note that the drinking research could have been done experimentally. We
could randomly assign participants to drink or not, administer the treatments, and then
test their reaction time. Again, I could do the analysis via a t test or a Pearson r, and
again the resulting p value would be identical regardless of statistical method. In this
case, if I get a significant correlation between drinking and reaction time, I can conclude
that drinking causes altered reaction time. In a nutshell, the demonstration of a
correlation between variables X and Y is necessary, but not sufficient, to establish a
causal relationship between X and Y. To establish the causal relationship, you have to
rule out alternative explanations for the observed correlation.

4
Let me give you another example of a third variable problem. Observational
research has demonstrated an association between smoking tobacco and developing a
variety of health problems. One might argue that this association is due to a third
variable rather than any causal relationship between smoking and ill health. Suppose
that there is a constellation of third variables, think of them as genetic or personality
variables, that cause some people to smoke, and, whether or not they smoke, also
cause them to develop health problems. These two effects of the third variable could
cause the observed associated between smoking and ill health in the absence of any
direct causal relationship between smoking and ill health.

< ---------------------- >

How can one rule out such an explanation? It is not feasible to conduct the
required experimental research on humans (randomly assigning newborns to be raised
as smokers or nonsmokers), but such research has been done on rats. Rats exposed
to tobacco smoke develop the same sort of health problems that are associated with
smoking in humans. So the tobacco institute has promised not to market tobacco
products to rats. By the way, there has been reported an interesting problem with the
rats used in such research. When confined to a chamber into which tobacco smoke is
pumped, some of them take their fecal boluses and stuff them into the vent from which
the smoke is coming.
Another example of a third variable problem concerns the air traffic controllers
strike that took place when Reagan was president. The controllers contended that the
high stress of working in an air traffic control tower caused a variety of health problems
known to be associated with stress. It is true that those working in that profession had
higher incidences of such problems than did those in most other professions. The
strikers wanted improved health benefits and working conditions to help with these
stress related problems -- but the government alleged that it was not the job that caused
the health problems, it was a constellation of third variables (personality/genetic) that on
the one hand caused persons of a certain disposition (Type A folks, perfectionists) to be
attracted to the air traffic controllers profession, and that same constellation of third
variables caused persons with such a disposition to have these health problems,
whether or not they worked in an air traffic control tower. One FAA official went so far
as to say that working in an air traffic control tower is no more stressful than driving the
beltway around DC. Personally, I find such driving very stressful. The government
won, the union was busted. I suppose they could solve the problem by hiring as air
traffic controllers only folks with a different disposition (Type B, lay back, what me worry,
so what if those two little blips on the screen on headed towards one another).
genetic/personality
variables
smoking
health
problems

5

Internal Validity
Donald T. Campbell and Julian C. Stanley used the term internal validity to refer
to the degree to which the research design allows one to determine whether or not the
experimental treatments, as applied in a particular piece of research, with a particular
group of subjects, affected the dependent variable, as measured in that research. They
listed a dozen types of threats to internal validity. Here I give you a definition and an
example for each type of threat.
History. The problem presents itself when events other than the experimental
treatment occur between pretest and posttest. Without a control group, these other
events will be confounded with the experimental treatment. Suppose that you are using
a one-group pretest-posttest design: You make an observation at time 1, administer a
treatment, and then make an observation at time 2. Extraneous events between time 1
and time 2 may confound your comparison. Suppose that your treatment is an
educational campaign directed at the residents of some community. It is designed to
teach the residents the importance of conserving energy and how to do so. The
treatment period lasts three months. You measure your subjects energy consumption
for a one month period before the treatment and a one month period after the treatment.
Although their energy consumption goes way down after the treatment, you are
confounded, because international events that took place shortly after the pre-testing
caused the price of energy to go up 50%. Is the reduction in energy consumption due
to your treatment or to the increased price of energy?
Maturation. This threat involves processes that cause your subjects to change
across time, independent of the existence of any special events (including your
experimental treatment). In the one-group pretest-posttest design, these changes may
be mistaken for the effect of the treatment. For example, suppose that you wish to
evaluate the effect of a new program on employees morale. You measure the morale
of a group of newly hired employees, administer the treatment across a six month
period, and then measure their morale again. To your dismay, you find that their morale
has gone down. Was your treatment a failure, or did the drop in morale just reflect a
common change that takes place across the first several months in a new job you
know, at first you think this is going to be a great job, and then after a while you find that
it just as boring as all those other jobs you have had.
Testing. The problem here is that pretesting subjects can change them.
Suppose you are still trying to get people to conserve energy and other resources. You
give them a pretest which asks them whether or not they practice a number of
conservation behaviors (things like using low flow toilets, lowering the thermostat in the
water heater, recycling, etc.). The treatment is completion of a two week course module
in environmental biology. The module includes information on how our planet is being
adversely affected by our modern lifestyle. After the treatment, subjects are asked
again about their conservation behaviors. You find that the frequency of conservation
behaviors has increased. Did it increase because of your treatment, or just because of
the pretest? Perhaps the pretest functioned to inform the subjects of several things they

6
could do to conserve, and, so informed, they would have started doing those things
whether or not they were exposed to the treatment.
Instrumentation. During the course of an experiment, the instrument used to
measure the DV may change, and these changes may be mistaken for a treatment
effect. Suppose we are going fishing, and want to see if we get bigger fish in the
morning or the afternoon. On the way we stop to get bait, beer, and a scale to weigh
the fish. If we buy the expensive scale, we cant afford the beer, so we get the $1.99
cheapo scale it has a poorly made spring with a hook on it, and the heavier the fish,
the more the spring stretches, pointing to a higher measurement. Each time you stretch
this cheap spring, it stretches a bit further, and that makes the apparent weight of the
fish we catch in the afternoon larger than those we caught in the morning, due to
instrumentation error. Often the instrument is a human observer. For example, you
have trained computer lab assistants to find and count the number of unauthorized
installations of software on lab computers, and then remove them. You establish a
treatment that is intended to stop users of the lab from installing unauthorized software.
Your dependent variable is the number of unauthorized installations found and the
amount of time it takes to repair the damage done by such installations. Both the
number and the time go down, but is that due to the treatment, or are your assistants
just getting bored with the task and missing many unauthorized installations, or getting
better at repairing them and thus taking less time?
Statistical regression. If you have scores which contain a random error
component, and you retest subjects who had very high or very low scores, you expect
them to score closer to the mean upon retesting. Such regression towards the mean
might be mistaken for a treatment effect if only subjects with very high (or very low)
scores on a pretest were given the treatment. Consider this demonstration. You have a
class of 50 students. You tell them you are giving an ESP test. You have a ten item
True-False quiz in the right hand drawer of your desk, but you are not going to pass it
out. Instead, they must try to use special powers to read that quiz. You give them two
minutes to record their answers. Then you give them an answer key and they score
their quizzes. Clearly this measurement has a high (100%) random error component.
In a class of 50, a couple of students will, by chance, have pretty high scores. Identify
them and congratulate them on their fantastic ESP. Almost certainly a couple will have
very low scores too. Identify them and tell them that you can help them get some ESP
power, if only the two high scorers will cooperate. Say that you have the ability to
transfer ESP power from one person to another. Put your hands on the heads of the
high scorers, quiver a bit and mumble something mysterious, and then do the same on
the heads of the low scorers. Now you are ready to give the posttest, but only to those
given this special treatment. In all probability, those who had the very high scores will
score lower on the posttest (see, you did take some of their ESP ability) while those
who had very low scores will show some gain.
Years ago, while in the bookstore at Miami University, I overhead a professor of
education explaining to a student how intelligence is not a stable characteristic. He
explained how he had chosen a group of students who had tested low on IQ, given
them a special educational treatment, and then retested them. They got smarter, as
evidenced by increased posttest scores. I bit my tongue. Then he went on to explain

7
that such educational interventions must be tailored to the audience. He said that he
had tried the same educational intervention on a group of students who had scored very
high on the IQ pretest, and, surprise of surprises, they got less intelligent, as indicated
by their lower scores on the posttest. I could not help myself. The phrase regression
to the mean leaped out of my mouth, to the great displeasure of the professor.
Selection. The problem here is that comparison groups are selected, or
subjects are selected into comparison groups, in such a way that they might have been
different on the criterion variable prior to the one group having received some special
treatment that the other group did not receive. Campbell and Stanley discussed this
threat with respect to what they called the static-group comparison design, in which the
researcher finds two existing groups, one which has experienced some special
treatment and another which has not. The two existing groups are then measured on
some characteristic and if they are found to differ on that characteristic then it is inferred
that the special treatment in the one group caused the observed difference in the
measured characteristic. Since no pretest is given, we have no way of knowing whether
or not the two groups were equivalent prior to the one group having experienced the
special treatment.
This problem may exist with any design where subjects are selected into the
comparison groups in such a way that they are already different before the treatment is
applied to the experimental group -- in which case any difference between the groups
after the treatment may be due to that initial difference between the groups rather than
due to the treatment. For example, you wish to evaluate the effectiveness of a tutorial
program. You announce that it is available on the computers in the lab and that it
covers the material on which the students will be tested on the next exam. You note
who uses the tutorial and who does not. After the next exam, you compare these two
groups performance on that exam. If the tutorial group does not do as well as the
control group, does that mean the tutorial is just a failure, or might it be that students
who were having difficulty selected themselves into the tutorial program, and did do
better than they would have otherwise, but still not as well as those who had no need to
do the tutorial and skipped it. If the tutorial group does better than the controls, does
that mean the tutorial was effective, or might it be that only the highly motivated
students bothered with the tutorial, and they would have done better than the
unmotivated students tutorial or no tutorial.
Differential mortality. This involves the loss of subjects in a way that results in
the subjects remaining in the one group being different than the subjects remaining in
the other group Sometimes the apparent effect of an experimental treatment is simply
due to its effectiveness in causing attrition of some types of subjects and not others.
For example, suppose you are comparing two programs designed to produce weight
loss. Program A subjects come in to a group meeting once a week and stand on a
scale. If the subject has not lost at least two pounds since the last week, she is forced
to do 50 pushups, while the rest of the class shouts derogatory comments at her.
Program B also has a weekly weigh in, but in that program, subjects who are not
loosing weight are given more positive encouragement to keep at it. Both programs
start out with 50 participants. After two months, 10 participants remain in Program A,
and their mean weight loss is 21 pounds, while in Program B, 40 participants remain,

8
but mean weight loss is 5 pounds. Is A the more effective program, or was it just more
effective at chasing off those who were unable or unwilling to loose weight?
Selection x (Maturation, History, Instrumentation) interaction. Here the
effect of maturation, history, or instrumentation is not the same in the one comparison
group as in the other. Suppose that you are comparing the effectiveness of one
educational program with another. The experimental program is being used at
Suburban High, the traditional program at Central High. The DV is scores on an
achievement test.

0
10
20
30
40
50
60
Pretest Posttest
Suburban
Central

It is clear that the pre to post gain at Suburban is greater than at Central, but is that
because of the special program at Suburban, or might it be due to a Selection x
Maturation interaction. That is, might the students at Suburban be maturing
(intellectually) at a rate faster than those at Central, in which case they would have
made greater gains at Suburban than at Central regardless of any special treatment?
Alternatively, our results might be due to a Selection x History interaction, where the
effect of events occurring between pretest and posttest is different for the one group
than for the other group. For example, there might have been a teachers strike and a
student riot at Central, while Suburban had a quiet year.

Suppose the results came out differently, as plotted below:
0
10
20
30
40
50
60
Pretest Posttest
Suburban
Central

Here it appears that the students at Central made greater gains than those at Suburban
-- but this apparent result might be due to a Selection x Instrumentation interaction,
in which the characteristics of the instrument are different for the one group than for the
other group. In this case, it appears that the achievement test is not adequate for

9
testing the students at Suburban. These students are already making close to the
maximum score on the test at the beginning of the school year. On that test, there is no
room for improvement. They may well have learned a lot during the school year, but the
test did not detect it. This is called a ceiling effect.

External Validity
Campbell and Stanley used the term external validity when referring to the
extent to which the results generalize beyond the specifics of the experimental situation.
Would you get the same results if you used different subjects, if you manipulated the IV
in a different way, if you measured the DV in a different way, if the setting were different,
etc.? Campbell and Stanley discussed four threats to external validity.
Testing x Treatment interaction. This is a problem in the pretest-posttest
control group design. This design simply adds to the one group pretest-posttest
design a control group which does not receive the experimental treatment between
pretest and posttest. Ideally, subjects are randomly assigned to the two comparison
groups so that we do not have to worry about selection. Adding the control group
eliminates all threats to internal validity, but we are left wondering whether or not the
effect of the treatment would be the same in subjects who were not pretested. If this is
of concern to you, you could use the Solomon four group design (to be discussed later),
or you could just get rid of the pretest and do a posttest only control group design
(which might reduce your statistical power, that is, the probability that you will be able to
detect an effect of the experimental treatment).
Selection x Treatment interaction. Does the effect of your treatment interact
with characteristics of the experimental units? That is, do your results generalize from
the type of subjects you used in your research to other types of subjects? Very often
subjects are college students. For some treatments and some DVs, it is probably
reasonable to assume generalization to most humans, but for others it may well not be
safe to assume such generalization.
Reactive effects of experimental arrangements. It is difficult if not impossible
to observe with affecting that which is observed. Can you generalize the results found
with observed subjects to subjects who are not being observed?
Multiple treatment interference. In some types of research, subjects are used
in many different experiments. For example, monkeys used in medical research are not
just discarded after completing one experiment, they are used for additional
experiments later. Also, in the psychology lab, if the human subject has to undergo
extensive training before serving as a subject in the type of research done in that lab,
there are economic benefits to recruiting that subject to serve in additional studies for
which that same training would be needed. The question is, do the results found with
subjects who have been used in all these different experiments generalize to individuals
that do not have such multiple treatment experience.

10
X O
1

???????
O
1

X O
O
1
X O
2

Common Research Designs
I shall sketch out common research designs using notation similar, but not
identical, to that employed by Campbell and Stanley. O
i
will stand for the observation of
the criterion (dependent) variable at time i. X will stand for the presence of the
experimental treatment. If we have two groups, the symbols representing the chain of
events for the experimental group will be on a line above that for the control group (the
group that does not receive the experimental treatment -- of course, you should
recognize that our two groups may not include a control group but might rather be two
groups that have received different experimental treatments -- also, we could have more
than two comparison groups). If the lines are separated by + signs, then subjects were
assigned to the two comparisons groups randomly (or in another fashion that should
result in them being equivalent prior to any special treatment). If the lines are separated
by ? marks, then subjects were assigned to comparison groups in a way that could have
resulted in the groups being nonequivalent even in the absence of the X treatment.
One-Shot Case Study. Campbell and Stanley classified this
design as pre-experimental. No variable is manipulated. The
researchers simply find some group of subjects who have experienced event X and
then measure them on some criterion variable. The researcher then tries to related X to
O. My kids were in the public schools here when a terrible tornado ripped through the
county just south of our house. After the tornado left, psycho-researchers descended
on the schools, conducting research to determine the effects of the tornado on the
childrens mental health. Of course, they had no pretest data on these children.
Without a comparison group, observations like this are of little value. One might
suppose that there is an implicit comparison group, such as that provided by norms on
the measuring instrument, but how do we know whether or not our subjects already
differed from the norms prior to experiencing the X?
One Group Pretest-Posttest Design. Campbell and
Stanley called this a pre-experimental design, but I consider it to
be experimental (since the X is experimentally manipulated), but with potentially serious
problems which we have already discussed: History, maturation, testing,
instrumentation, and possibly regression. If we have contrived ways to control these
threats (which might be possible under if our subjects are inanimate objects whose
environments we control completely, as we might imagine things are in the physics or
chemistry laboratory), then this design could be OK. Statistically, the comparison
between means on O
1
and O
2
could be made with correlated samples t or a
nonparametric equivalent.
Static Group Comparison. We discussed this design
earlier. As noted by Campbell and Stanley, it is pre-experimental
in that the researcher does not manipulate the X, but rather simply
finds one group which has already experienced the X and compares
that group to another group that has not experienced the X.
Independent samples t or a nonparametric equivalent could be employed to compare
the two groups means.

11
O
1
X O
2
++++++++
O
1
O
2

Pretest-Posttest Control Group Design. Here we
have added a control group to the one-group pretest-posttest
design. If we can assume that both groups experienced the
same history between observations (that is, there is no selection
by history interaction), then history is controlled in the sense that it should affect the O
1

to O
2
difference identically in the two groups. Likewise, maturation, testing,
instrumentation, and regression are controlled in the sense of having the same effects
in both groups. Selection and selection by maturation interaction are controlled by
assigning subjects to the two groups in a way (such as random assignment) that makes
us confident that they were equivalent prior to experimental treatment (and will mature
at equivalent rates). Unless we are foolish enough to employ different measuring
instruments for the two groups, selection by instrumentation interaction should not be a
problem. Of course, testing by treatment interaction is a threat to the external validity of
this design.
Statistically, one can compare the two groups pretest means (independent t or
nonparametric equivalent) to reassure oneself (hopefully) that the assignment technique
did produce equivalent groups -- sometimes one gets an unpleasant surprise here. For
example, when I took experimental psychology at Elmira College, our professor divided
us (randomly, he thought) by the first letter of our last name, putting those with letters in
the first half of the alphabet into one group, the others in the other group. Each subject
was given a pretest of knowledge of ANOVA. Then all were given a lesson on ANOVA.
Those in the one group were taught with one method, those in the other group by a
different method. Then we were tested again on ANOVA. The professor was showing
us how to analyze these data with a factorial ANOVA when I, to his great dismay,
demonstrated to him that the two groups differed significantly on the pretest scores.
Why? We can only speculate, but during class discussion we discovered that most of
those in the one group had taken statistics more recently than those in the other group -
- apparently at Elmira course registration requests were processed in alphabetical
order, so those with names in the first half of the alphabet got to take the stats course
earlier, while those who have suffered alphabetical discrimination all of their lives were
closed out of it and had to wait until the next semester to take the stats course -- but
having just finished it prior to starting the experimental class (which was taught only
once a year), ANOVA was fresh in the minds of those of us at the end of the alphabet.
One can analyze data from this design with a factorial ANOVA (time being a
within-subjects factor, group being a between-subjects factor), like my experimental
professor did, in which case the primary interest is in the statistical interaction -- did the
difference in groups change across time (after the treatment), or, from another
perspective, was the change across time different in the two groups. The interaction
analysis is absolutely equivalent to the analysis that would be obtained were one simply
to compute a difference score for each subject (posttest score minus pretest score) and
then use an independent samples t to compare the two groups means on those
difference scores. An alternative analysis is a one-way Analysis of Covariance,
employing the pretest scores as a covariate and the posttest scores as the criterion
variable -- that is, do the groups differ on the posttest scores after we have removed

12
X O
++++
O

O
1
X O
2

+++++++++++
O
1
O
2

+++++++++++

X O
2

+++++++++++

O
2

from them any effect of the pretest scores. All three of these analyses (factorial
ANOVA, t on difference scores, ANCOV) should be more powerful than simply
comparing the posttest means with t.
Posttest Only Control Group Design. Here we simply assign
subjects to groups in a way that should assure pretreatment
equivalence, dont bother with a pretest, administer the treatment to the
one group, and then measure the criterion variable. With respect to
controlling the previously discussed threats to internal and external
validity, this design is the strongest of all I have presented so far. However, this design
usually is less powerful than designs that include a pretest-posttest comparison. That
is, compared to designs that employ within-subjects comparisons, this design has a
higher probability of a Type II error, failing to detect the effect of the treatment variable
(failing to reject the null hypothesis of no effect) when that variable really does have an
effect. Accordingly, it is appropriate to refer to this threat to internal validity as
statistical conclusion validity. One can increase the statistical power of this design
by converting extraneous variables to covariates or additional factors in a factorial
ANOVA, as briefly discussed later in this document (and not-so-briefly discussed later in
this course). While it is theoretically possible to make another type of error that would
threaten statistical conclusion validity, the Type I error, in which one concludes that the
treatment has an effect when in fact it does not (a Type I error), it is my opinion that the
Type II error is the error about which we should be more concerned, since it is much
more likely to occur than a Type I error, given current conventions associated with
conducting statistical analysis.

Solomon Four Group Design. This design is a
combination of the pretest-posttest control group design and
the posttest only control group design. It is as good at
controlling threats to internal and external validity as the
posttest only control group design, but superior to that design
with respect to statistical conclusion validity. However, it gains
that advantage at the cost of considerably greater expense and
effort in the collection and analysis of the data.
To test the hypothesis of no testing by treatment interaction, one could arrange
all four groups posttest scores into a 2 x 2 factorial ANOVA design, (pretested or not) X
(given the experimental treatment or not). A significant interaction would indicate that
the simple main effect of the treatment among those that were pretested was not the
same as the simple main effect of the treatment among those that were not pretested.
The main effect of pretesting and the main effect of the treatment could also be
assessed with such an ANOVA. If there existed a significant testing by treatment
interaction, one could test the simple main effect of the treatment for the pretested
subjects and for the not pretested subjects. One might also want to analyze the data
from the pretest-posttest control group part of the design with those techniques which
are appropriate for such a design (such as ANCOV on the posttest scores using pretest
scores as a covariate).

13

Extraneous Variable Control
Controlling extraneous variables is important in terms of eliminating confounds
and reducing noise. Here I identify five methods of controlling extraneous variables.
Constancy. Here you hold the value of an extraneous variable constant across
all subjects. If the EV is not variable, it cannot contribute to the variance in the DV. For
example, you could choose to use only female subjects in your research, eliminating
any variance in the DV that could be attributable to gender. Do keep in mind that while
such noise reduction will increase the statistical power of your analysis (the ability to
detect an effect of the IV, even if that effect is not large), it comes at a potential cost of
external validity. If your subjects are all female, you remain uncertain whether or not
your results generalize to male individuals.
Balancing. Here you assign subjects to treatment groups in such a way that the
distribution of the EV is the same in each group. For example, if 60% of the subjects in
the experimental group are female, then you make sure that 60% of the subjects in the
control group are female. While this will not reduce noise and enhance power, it will
prevent the EV from being confounded with the IV.
Randomization. If you randomly assign subjects to treatment groups, they
should be balanced on subject characteristics (those EVs that subjects bring to the
experiment with themselves).
Matching. Here we are talking about the research design commonly know as
the randomized blocks or split plot design. On one or more EVs, thought to be well
correlated with the DV, we match subjects up in blocks of k, where k is the number of
treatment groups. Within each block, the subjects are identical or nearly identical on the
matching variable(s). Within each block, one subject is (randomly) assigned to each
treatment group. This will, of course, balance the distribution of the EV across groups,
but it will also allow a statistical analysis which removes from the DV the effect of the
matching variable, reducing noise and increasing power.
Statistical control. Suppose you were going to evaluate the effectiveness of
three different methods of teaching young children the alphabet. To enhance power,
you wish to use a randomized blocks design. You administer to every potential
subject a test of readiness to learn the alphabet, and then you match (block) subjects on
that variable. Next you randomly assign them (within each block) to groups. In your
statistical analysis, the effect of the matching/blocking variable is taken out of what
would otherwise be error variance in your statistical model. Such error variance is
generally the denominator of the ratio that you use as the test statistic for a test of
statistical significance, and the numerator of that ratio is generally a measure of the
apparent magnitude of the treatment effect. Lets look at that ratio.
noise
effect treatment
statistic Test = , for example,
difference of error standard
means between difference
= t , or
variance error
variance groups among
= F .

14
Look at this pie chart, in which I have partitioned the total variance in the DV into
variance due to the treatment, due to the blocking variable, and due to everything else
(error). If we had just ignored the blocking variable, rather than controlling it by using
the randomized blocks design, the variance identified as due to blocks would be
included in the error variance. Look back at the test statistic ratio. Since error variance
is in the denominator, removing some of it makes the absolute value of the test statistic
greater, giving you more power (a greater probability of obtaining a significant result).
Another statistical way to reduce noise and increase power is to have available
for every subject data on one or more covariate. Each covariate should be an
extraneous variable which is well correlated with the dependent variable. We can then
use an ANCOV (analysis of covariance) to remove from the error term that variance
due to the covariate (just like the randomized blocks analysis does), but we dont need
to do the blocking and random assignment within blocks. This analysis is most
straightforward when we are using it along with random assignment of subjects to
groups, rather than trying to use ANCOV to unconfound a static-group design (more
on this later in the semester).
If the EV you wish to control is a categorical variable, one method to remove its
effect from the error variance is just to designate the EV as being an IV in a factorial
ANOVA. More on this later in the semester.
I used the term split-plot design earlier as a synonym for a randomized blocks
design. The term split plot comes from agricultural research, in which a field is divided
into numerous plots and each plot is divided into k parts, where k is the number of
treatments. Within each plot, treatments are randomly assigned to each part -- for
example, seed type A to one part, seed type B to a second part, etc. Statistically, the
plots here are just like the blocks we use in a randomized blocks design.
Some of you have already studied repeated measures or within subjects
designs, where each subject is tested under each treatment condition. This is really just
a special case of the randomized blocks design, where subjects are blocked up on all
subject variables.
Mnemonics To Remember Threats to Internal and External Validity
Error
Treatment
Blocks
Error
Treatment
Blocks
Inferenc.doc
Making Inferences About Parameters

Parametric statistical inference may take the form of:
1. Estimation: on the basis of sample data we estimate the value of some
parameter of the population from which the sample was randomly drawn.
2. Hypothesis Testing: We test the null hypothesis that a specified parameter
(I shall use to stand for the parameter being estimated) of the population has a
specified value.
One must know the sampling distribution of the estimator (the statistic used to
estimate - I shall use
$
to stand for the statistic used to estimate ) to make full use
of the estimator. The sampling distribution of a statistic is the distribution that would be
obtained if you repeatedly drew samples of a specified size from a specified population
and computed
$
on each sample. In other words, it is the probability distribution of a
statistic.
Desirable Properties of Estimators Include:
1. Unbiasedness:
$
is an unbiased estimator of if its expected value equals
the value of the parameter being estimated, that is, if the mean of its sampling
distribution is . The sample mean and sample variance are unbiased estimators of the
population mean and population variance (but sample standard deviation is not an
unbiased estimator of population standard deviation).
For a discrete variable X, E(X), the expected value of X, is:
i i
X P X E

= ) ( . For
example, if 50% of the bills in a pot are one-dollar bills, 30% are two-dollar bills, 10%
are five-dollar bills, 5% are ten-dollar bills, 3% are twenty-dollar bills, and 2% are fifty-
dollar bills, the expected value for the value of what you get when you randomly select
one bill is .5(1) + .3(2) + .1(5) + .05(10) + .03(20) + .02(50) = $3.70. For a continuous
variable the basic idea of an expected value is the same as for a discrete variable, but a
little calculus is necessary to sum up the infinite number of products of P
i
(actually,
probability density) and X
i
.
Please note that the sample mean is an unbiased estimator of the population
mean, and the sample variance, s
2
, SS / (N - 1), is an unbiased estimator of the
population variance,
2
. If we computed the estimator, s
2
, with (N) rather than (N-1) in
the denominator then the estimator would be biased. SS is the sum of the squared
deviations of scores from their mean,
2
) (
Y
M Y .
The sample standard deviation is not, however, an unbiased estimator of the
population standard deviation (it is the least biased estimator available to us). Consider
a hypothetical sampling distribution for the sample variance where half of the samples
have s
2
= 2 and half have s
2
= 4. Since the sample variance is totally unbiased, the
population variance must be the expected value of the sample variances,


2
.5(2) + .5(4) = 3, and the population standard deviation must be 732 . 1 3 = . Now
consider the standard deviations. 707 . 1 4 5 . 2 5 . ) ( = + = s E . Since the expected
value of the sample standard deviations, 1.707, is not equal to the true value of the
estimated parameter, the sample standard deviation is not unbiased. It is, however,
less biased than it would be were we to use (N) rather than (N-1) in the denominator.
2. Relative Efficiency: an efficient estimator is one whose sampling distribution
has little variability. For an efficient estimator, the variance (and thus the standard
deviation too) of the sampling distribution is small. The standard deviation of a
sampling distribution is known as its standard error. With an efficient estimator the
statistics in the sampling distribution will not differ much from each other, and
accordingly when you randomly draw one sample your estimate is likely to have less
error than it would if you had employed a less efficient estimator.
Suppose that you and I are playing the following game. I know what the true
value of the population parameter is, you do not, but I allow you to obtain a random
sample from the population and from that random sample compute an estimate. You
will then submit your estimate to me. If your estimate is close to the true value of the
parameter, I award you a final grade of A in your statistics class. If it is not close, I
award you an F. How close is close? Close is within one point of the true value of the
parameter. Unbeknownst to you, the parameter has value 100. You can use one of
three estimators (A, B, and C) with your sample. Each sampling distribution has a
mean of 100, which is exactly equal to the true value of the parameter being estimated,
so all three of these estimators are unbiased. In the figure that follows this paragraph, I
have drawn a vertical line at the mean of each sampling distribution. If the value of
your sample statistic is within this much, , of the true value, then you get the A
grade. I have shaded in yellow the area under each sampling distribution which is
within that distance from the mean. For each distribution, if we consider the entire area
under the (rectangular) curve to be 100%, the shaded area is the probability that your
sample will produce an estimate close enough to the truth to earn the grade of A.
Because estimator A has a small standard error, that probably is 100% if you use A. B
has a larger standard error, and the probability drops to 50%. C has a yet larger
standard error, dropping the probability to 25%. So, which estimator would you chose
were we playing this game for real?
3

Below is a similar illustration, using a normal sampling distribution. I have scaled
the curves differently, such that the area under each curve is identical. The true value
of the estimated parameter is 100. Estimator A has a standard error of .5, B of 1, and
C of 5. Close enough to earn the grade of A is 1 point. I have shaded in purple the
probability get getting close enough, and in red the probability of flunking the course.
As you can see, the most efficient estimator (A) gives you a very good chance of getting
that A, the least efficient estimator (C) a slim chance.

Sampling Distribution A Sampling Distribution B

4
Sampling Distribution C

3. Consistency:
$
is a consistent estimator of if the larger the sample size,
the greater the probability that the estimate will be very close to the actual value of the
parameter (standard error decreases with increasing N). The standard error of the
sample mean is known to be
n
M
= . Since sample size, n, is in the denominator,

the greater the sample size, the smaller the standard error.
4. Sufficiency:
$
is a sufficient estimator if it makes use of all of the information
in the sample. The sample variance, for example, is more sufficient than the sample
range, since the former uses all the scores in the sample, the latter only two.
5. Resistance:
$
is a resistant estimator to the extent that it is not influenced by
the presence of outliers. For example, the median resists the influence of outliers more
than does the mean.
We will generally prefer sufficient, consistent, unbiased, resistant, efficient
estimators. In some cases the most efficient estimator may be more biased than
another. If we wanted to enhance the probability of being very close to the parameters
actual value on a single estimate we might choose the efficient, biased estimator over a
less efficient but unbiased estimator.
After a quiz on which she was asked to list the properties which are considered
desirable in estimators, graduate student Lisa Sadler created the following acronym:
CURSE (consistency, unbiasedness, resistance, sufficiency, efficiency).
Types of Estimation
1. Point Estimation: estimate a single value for which is, we hope, probably
close to the true value of .
5
2. Interval Estimation: find an interval, the confidence interval, which has a
given probability (the confidence coefficient) of including the true value of .
a. For estimators with normally distributed sampling distributions, a confidence
interval (CI) with a confidence coefficient of CC = (1 - ) is:

z z +
Alpha () is the probability that the CI will not include the true value of . Z is the
number of standard deviations one must go below and above the mean of a normal
distribution to mark off the middle CC proportion of the area under the curve, excluding
the most extreme proportion of the scores, / 2 in each tail.
$
is the standard
error, the standard deviation of the sampling distribution of
$
.
b. A 95% CI will extend from

96 . 1
96 . 1
+ if the sampling
distribution is normal. We would be 95% confident that our estimate-interval included
the true value of the estimated parameter (if we drew a very large number of samples
95% of them would have
$
intervals which would in fact include the true value of ). If
CC = .95, a fair bet would be placing 19:1 odds in favor of CI containing .
c. The value of Z will be 2.58 for a 99% CI, 2.33 for a 98% CI, 1.645 for a 90%
CI, 1.00 for a 68% CI, 2/3 for a 50% CI, other values obtainable from the normal curve
table.
d. Consider this very simple case. We know that a population of IQ scores is
normally distributed and has a of 15. We have randomly sampled one score and it is
110. When N = 1, the standard error of the mean,
M
= population . Thus, a 95%
CI would be 110 1.96(15). That is, we are 95% confident that the is between 80.6
and 139.4.
Hypothesis Testing
A second type of inferential statistics is hypothesis testing. For parametric
hypothesis testing one first states a null hypothesis (H

). The H

specifies that some
parameter has a particular value or has a value in a specified range of values. For
nondirectional hypotheses, a single value is stated. For example, = 100. For
directional hypotheses a value of less than or equal to (or greater than or equal to)
some specified value is hypothesized. For example, 100.
The alternative hypotheses (H
1
) is the antithetical complement of the H
. If the
H

is = 100, the H
1
is 100. If H
is 100, H
1
is > 100. H
:

100 implies
H
1
:

< 100. The H

and the H
1
are mutually exclusive and exhaustive: one, but not
both, must be true.
Very often the behavioral scientist wants to reject the H
and assert the H

1
. For
example, e may think es students are brighter than normal, so e sets up the H
that
100 for IQ, hoping to show that H
is not reasonable, thus asserting the H

1
that > 100.
Sometimes, however, one wishes not to reject a H
. For example, I may have a

mathematical model that predicts that the average amount of rainfall on an April day in
Soggy City is 9.5 mm, and if my data lead me to reject that H
, then I have shown my

model to be inadequate and in need of revision.
6
The H
is tested by gathering data that are relevant to the hypothesis and

determining how well the data fit the H
. If the fit is poor, we reject the H
and assert
the H
1
. We measure how well the data fit the H
with an exact significance level, p,

which is the probability of obtaining a sample as or more discrepant with the H

than is that which we did obtain, assuming that the H

is true. The higher this p,
the better the fit between the data and the H
.
If this p is low we have cast doubt upon
the H
. If p is very low, we reject the H
. How low is very low? Very low is usually .05

-- the criterion used to reject the H
is p .05 for behavioral scientists, by convention,

but I opine that an individual may set es own criterion by considering the implications
thereof, such as the likelihood of falsely rejecting a true H
(an error which will be more

likely if one uses a higher criterion, such as p .10).
Consider this very simple case. You think your relatives are so smart that they
have an average IQ of 145. You randomly select one and es IQ is 110. Now, how
unusual is a sample of N = 1, M = 110 if = 145, = 15, and the population is normally
distributed? We can find p by computing a Z-score. Z = (110 - 145) / 15 = -2.33. Now,
is that so unusual that we should reject the H
? How often would one randomly sample

a score as (or more) unusual than one with Z = -2.33? From the normal curve table,
the probability of obtaining a Z of -2.33 or less is .0099. But we must also consider the
probability of obtaining a Z of +2.33 or more, since that is also as unusual as a Z of -
2.33 or less. P(Z > +2.33) = .0099 also, so our exact significance level, p, is .0099 +
.0099 or 2(.0099) = .0198.

What does p = .0198 mean? It means that were we repeatedly to sample scores
from a normal population with = 145, =15, a little fewer than 2% of them would be
as unusual (as far away from the of 145) or more unusual than the sample score we
7
obtained, assuming the H
is true. In other words, either we just happened to select an

unusual score, or the H
is not really true.

So what do we conclude? Is the H
false or was our sample just unusual? Here

we need a decision rule: If p is less than or equal to an a priori criterion, we reject
the H

; otherwise we retain it. The most often used a priori criterion is .05. Since p =
.0198 < .05, we reject the H
and conclude that is not equal to 145. What are the

chances that we made a mistake, that the H
was true and we just happened to get an

unusual sample? Alpha is the probability of making a Type I Error, the conditional
probability of rejecting the H
given that the H
is true. By setting the a priori criterion

to .05 we assured that would be .05 or less. Our exact is p, .0198 for this sample.
This is also a less exact, more traditional way of doing all this. We consider first
the sampling distribution of the test statistic (the test statistic is the statistic, such as
Z, which is used to test the H
). We used Z = (X - ) / , which is appropriate if N = 1

and the population is normal. Z is the test statistic and its sampling distribution is
normal. We now map out on the sampling distribution (a normal curve) two regions.
The rejection region includes all values of the test statistic that would cause us to
reject the H
, given our a priori criterion for . For an of .05 or less this will be the
most extreme 5% of the normal curve, split into the two tails, 2.5% in each tail. The
rejection region would then include all values of Z less than or equal to -1.96 or greater
than or equal to +1.96. The nonrejection region would include all values of Z greater
than -1.96 but less than +1.96. The value of the test statistic at the boundary between
the nonrejection and the rejection regions is the critical value. Now we compute the
test statistic and locate it on the sampling distribution. If it falls in the rejection region
we conclude that p is less than or equal to our a priori criterion for and we reject the
H
. If it falls in the nonrejection region we conclude that p is greater than our a priori
criterion for and we do not reject the H
. For our example, Z = -2.33 -1.96, so we

reject the H
and report p .05.

In the figure below, for a testing using the .05 criterion of statistical significance, I
have mapped out the nonrejection region in green and the rejection region in red. The
boundaries between the rejection and nonrejection regions are the critical values,
1.96.

8

I prefer that you report an exact p rather than just saying that p .05 or p > .05.
Suppose that a reader of your research report thinks the a priori criterion should have
been .01, not .05. If you simply say p .05, e doesnt know whether to reject or retain
the H
with set at .01 or less. If you report p = .0198, e can make such decisions.
Imagine that our p came out to be .057. Although we would not reject the H
, it might
be misleading to simply report p > .05, H
not rejected. Many readers might

misinterpret this to mean that your data showed the H
to be true. In fact, p = .057 is

pretty strong evidence against the H
, just not strong enough to warrant a rejection with

an a priori criterion of .05 or less. Were the H
the defendant in your statistical court of

law, you would find it not guilty, which is not the same as innocent. The data cast
considerable doubt upon the veracity of the H
, but not beyond a reasonable doubt,

where beyond a reasonable doubt means p is less than or equal to the a priori
criterion for .
Now, were p = .95, I might encourage you not only to fail to reject the H
, but to
assert its truth or near truth. But using the traditional method, one would simply report p
> .05 and readers could not simply discriminate between the case when p = .057 and
that when p = .95.
Please notice that we could have decided whether or not to reject the H
on the
basis of the 95% CI we constructed earlier. Since our CC was 95% we were using an
of (1 - CC) = .05. Our CI for extended from 80.6 to 139.4, which does not include the
hypothesized value of 145. Since we are 95% confident that the true value of is
between 80.6 and 139.4, we can also be at least 95% confident (5% ) that is not 145
(or any other value less than 80.6 or more than 139.4) and reject the H
. If our CI
9
included the value of hypothesized in the H
, for example, if our H
were = 100,
then we could not reject the H
.
The CI approach does not give you a p value with which quickly to assess the
likelihood that a type I error was made. It does, however, give you a CI, which
hypothesis testing does not. I suggest that you give your readers both p and a (1-) CI
as well as your decision regarding rejection or nonrejection of the H
.
You now know that is the probability of rejecting a H
given that it is really true.

Although traditionally set at .05, I think one should set es criterion for at whatever
level e thinks reasonable, considering the danger involved in making a type two error.
A Type II error is not rejecting a false H
, and Beta ( ) is the conditional probability of

not rejecting the H
given that it is really false. The lower one sets the criterion for ,
the larger will be, ceteris paribus, so one should not just set very low and think e
has no chance of making any errors.

Possible Outcomes of Hypothesis Testing (and Their Conditional Probabilities)
IMHO, the null hypothesis is almost always wrong. Think of the alternative
hypothesis as being the signal that one is trying to detect. That signal typically is the
existence of a relationship between two things (events, variables, or linear combinations
of variables). Typically that thing really is there, but there may be too much noise
(variance from extraneous variables and other error) to detect it with confidence, or the
signal may be too weak to detect (like listening for the sound of a pin dropping) unless
almost all noise is eliminated.

The True Hypothesis Is
Decision The H
1
The H

Reject H

Assert H
1

correct decision
(power)
Type I error
( )
Retain H

Do not assert H
1

Type II error
( )
correct decision
(1- )

Think of the truth state as being two non-overlapping universes. You can be in
only one universe at a time, but may be confused about which one you are in
now.
You might be in the universe where the null hypothesis is true (very unlikely, but
you can imagine being there). In that universe there are only two possible
outcomes: you make a correct decision (do not detect the signal) or a Type I
error (detect a signal that does not exist). You cannot make a Type II error in
this universe.
10
You might be in the universe where the alternative hypothesis is correct, the
signal you seek to detect really is there. In that universe there are only two
possible outcomes: you make a correct decision or you make a Type II error.
You cannot make a Type I error in that universe.
Beta is the conditional probability of making a Type II error, failing to reject a
false null hypothesis. That is, if the null hypothesis is false (the signal you seek
to find is really there), is the probability that you will fail to reject the null (you
will not detect the signal).
Power is the conditional probability of correctly rejecting a false null hypothesis.
That is, if the signal you seek to detect is really there, power is the probability
that you will detect it.
Power is greater with
o larger a priori alpha (increasing P(Type I error) also) that is, if you
change how low p must get before you reject the null, you also change
beta and power.
o smaller sampling distribution variance (produced by larger sample size (n)
or smaller population variance) less noise
o greater difference between the actual value of the tested parameter and
the value specified by the null hypothesis stronger signal
o one-tailed tests (if the predicted direction (specified in the alternative
hypothesis) is correct) paying attention to the likely source of the signal
o some types of tests (t test) than others (sign test) like a better sensory
system
o some research designs (matched subjects) under some conditions
(matching variable correlated with DV)
Suppose you are setting out to test a hypothesis. You want to know the
unconditional probability of making an error (Type I or Type II). That probability
depends, in large part, on the probability of being in the one or the other
universe, that is, on the probability of the null hypothesis being true. This
unconditional error probability is equal to P(H
true) + P(H
1
true).
The 2 x 2 matrix above is a special case of what is sometimes called a confusion
matrix. The reference to confusion has nothing to do with the fact that this matrix
confuses some students. Rather, it refers to the confusion inherent in predicting into
which of two (or more) categories an event falls or will fall. Substituting the language of
signal detection theory for that of hypothesis testing, our confusion matrix becomes:
11

Is the Signal Really There ?
Prediction Signal is there

Signal is not there

Signal is there
True Positive (Hit)
(power)
False Positive
( )
Signal is not there
False Negative (Miss)
( )
True Negative
(1- )

Relative Seriousness of Type I and Type II Errors
Imagine that you are testing an experimental drug that is supposed to reduce
blood pressure, but is suspected of inducing cancer. You administer the drug to 10,000
rodents. Since you know that the tumor rate in these rodents is normally 10%, your H

is that the tumor rate in drug-treated rodents is 10% or less. That is, the H
is that the
drug is safe, it does not increase cancer rate. The H
1
is that the drug does induce
cancer, that the tumor rate in treated rodents is greater than 10%. [Note that the H

always includes an =, but the H
1
never does.] A Type II error (failing to reject the H

of safety when the drug really does cause cancer) seems more serious than a Type I
error (rejecting the H
of safety when the drug is actually safe) here (assuming that

there are other safe treatments for hypertensive folks so we dont need to weigh risk of
cancer versus risk of hypertension), so we would not want to place so low that was
unacceptably large. If that H
(drug is safe) is false, we want to be sure we reject it.

That is, we want to have a powerful test, one with a high probability of detecting false
H
hypotheses. Power is the conditional probability of rejecting a H
given that it is
false, and Power = 1 - .
Now suppose we are testing the drugs effect on blood pressure. The H
is that
the mean decrease in blood pressure after giving the drug (pre-treatment BP minus
post-treatment BP) is less than or equal to zero (the drug does not reduce BP). The H
1

is that the mean decrease is greater than zero (the drug does reduce BP). Now a Type
I error (claiming the drug reduces BP when it actually does not) is clearly more
dangerous than a type II error (not finding the drug effective when indeed it is), again
assuming that there are other effective treatments and ignoring things like your boss
threat to fire you if you dont produce results that support es desire to market the drug.
You would want to set the criterion for relatively low here.

Directional and Nondirectional Hypotheses
Notice that in these last two examples the H
was stated as Parameter some

value. When that is the case or when the H
states Parameter some value,

directional hypotheses are being tested. When the H
is Parameter = some value

12
nondirectional hypotheses are being used. With nondirectional hypotheses and a
normal sampling distribution there are rejection regions in both tails of the sampling
distribution, so this sort of test is often called a two-tailed test. When computing the p
for such a two-tailed test one must double the one-tailed probability obtained from the
normal curve table prior to comparing it to the criterion for . We doubled .0099 to
.0198 before comparing to .05 with our IQ example.
With directional hypotheses all of the rejection region is put into one tail, the tail
where the test statistic is expected to fall if the H
is false. Such a test may be called a

one-tailed test. For our IQ example, suppose the H
were that 145. This would be

a reasonable thing to do if prior to seeing the data you had strong reason to predict a
directionyou are almost positive that your relatives mean IQ is even greater than 145
(since IQ is heritable and you are so bright). With a .05 criterion, the rejection region is
now the upper 5% of the sampling distribution, Z +1.645. Our Z, -2.33, is not in the
rejection region, so we must retain the H
that mean IQ 145.

When the sample results come out in a direction opposite that predicted in the
H
1
, the H
can never be rejected. For our current example the H

1
is that > 145, but X
was only 110, which is less than 145. In such a case p will be the area under the
sampling distribution from the value of the test statistic to the more distant tail. Using
the larger portion column of the normal curve table, P(Z < -2.33) = .9901 > .05, we
retain the H
. In fact, with p close to 1, the H
looks very good.

Suppose the H
were 145, the H

1
: < 145. Now the rejection region is
Z 1.645 and we can reject the H
and assert the H

1
that < 145. In the case where
the sample results came out in the direction predicted by the H
1
, p is the area under the
sampling distribution from the value of the test statistic to the closer tail. Using the
13
smaller portion column of the normal curve table, P(Z < -2.33) = .0099 = p .05,
reject H
. When we did a nondirectional test, this was the probability which we doubled
prior to comparing to the criterion for . Since we are now doing a one-tailed test, we
do not double the probability. Not doubling the probability gives us more power, since p
is more likely to be less than or equal to our -criterion if we dont need to double p
before comparing it to . In fact, we could reject the H
here even with the -criterion

set at .01 for a one-tailed test, but with a two-tailed test, where p = 2(.0099) = .0198 >
.01, we could not. Note that to gain this power our H
1
directional hypothesis must have
correctly predicted the direction of the results. Since humans memory (and honesty)
are often suspect, some persons never accept such directional tests. They assume
that persons reporting directional tests did not decide which prediction to make until
they actually saw the results. When the direction is postdicted rather than predicted,
is double the directional test p.

At the moment, one-tailed test goes along with directional hypotheses and
two-tailed test goes along with nondirectional hypotheses, but this will change. Later
we shall study procedures that properly use one-tailed tests with nondirectional
hypotheses.
Frequency of Type I Errors in the Published Literature
Also, note that there are several factors that may operate to increase the
frequency of type I errors in published research. It is often in the self-interest of a
researcher to reject the H
, since failures to reject the H
are not often accepted for

publication. Editors are quick to think that a failure to reject a H
might be a Type II
error. Given the Publish or Perish atmosphere at many institutions, researchers may
bias (consciously or not) data collection and analysis. There is also a file drawer
14
problem. Imagine that each of 20 researchers is independently testing the same true
H
. Each uses an -criterion of .05. By chance, we would expect one of the 20 falsely
to reject the H
. That one would joyfully mail es results off to be published. The other
19 would likely stick their nonsignificant results in a file drawer rather than an
envelope, or, if they did mail them off, they would likely be dismissed as being type II
errors and would not be published, especially if the current Zeitgeist favored rejection of
that H
. Science is a human endeavor and as such is subject to a variety of socio-

political influences.
It could also, however, be argued that psychologists almost never test an
absolutely true H
, so the frequency of type I errors in the published literature is

probably exceptionally small. Psychologists may, however, often test H
hypotheses
that are almost true, and rejecting them may as serious as rejecting absolutely true H

hypotheses.

Recommended Reading
Read More About Exact p Values recommended reading.
Much Confusion About p recommended reading.
The History of the .05 Criterion of Statistical Significance recommended
reading.
The Most Dangerous Equation --
n
M
=

power1.doc
An Introduction to Power Analysis, N = 1

1.

H

: 100 H
1
: > 100 N = 1, = 15, Normal Distribution

For = .10, Z
critical
= 1.28, X
critical
= 100 + 1.28(15) = 119.2, therefore: we reject H
if X 119.2.
Our chances of rejecting the H
are 10% or less if H
is true: 10% if = 100, less than 10% if < 100.

That is, with our -criterion set at .10, we have no more than a 10% chance of making a Type I error when
H
is true.

What if H
is false? What are our chances of making a Type II error?

Assume the Truth is that = 110. We shall fail to reject the false H
when X < 119.2.

P(X < 119.2 | = 110, = 15) = P(Z < (119.2 - 110) / 15) = P(Z < .61) = approx .73 or 73% = . Our
chances of correctly rejecting this false H
, POWER, = P(Z .61) = 27%. That is, were the H
wrong by
10 points, 2/3 , we would have only a 27% chance of rejecting H
.

----------------------------------------------------------------------------------------------------------------------------------------

2.

Raise to .20: Z
critical
= .84 X
critical
= 100 + .84(15) = 112.6

Power = P(Z (112.6 - 110) / 15) = P(Z .17) = 43%

Increasing Raises Power, but at the expense of making Type I errors more
likely.

----------------------------------------------------------------------------------------------------------------------------------------

3.

Increase The Difference Between H
And The Truth: = 120

P(Z (112.6 - 120) / 15) = P(Z -.49) ==> approximately 69% Power

Big Differences (Between H

& Truth) Are Easier To Find Than Small
Differences


2
4

Lower by using more homogeneous subjects and by holding constant various
extraneous variables: = 5

X
critical
= 100 + .84(5) = 104.2

P(Z (104.2 - 120) / 5) = P(Z -3.16) = 99.9% = Power

Lowering Increases Power

----------------------------------------------------------------------------------------------------------------------------------------

5. Directional vs. Nondirectional Hypotheses

A. Nondirectional H

: = 100 H
1
: 100 N = 1 = .10 = 15

Z
critical
= 1.645, that is, reject H
if |Z| 1.645

X
critical
100 + 1.645(15) 124.68

100 - 1.645(15) 75. 32

Assume that the true = 110

P(X 124.68 OR X 75.32) = POWER

P(X 124.68) = P(Z (124.68 - 110) / 15) = P(Z .98) = .1635

P(X 75.32) = P(Z (75.32 -110) / 15) = P(Z -2.31) = .0104
------
Power = 17%

----------------------------------------------------------------------------------------------------------------------------------------

B. Directional, correct prediction by H
1

See Example 1 on the first page, Power = 27%

3

C. Directional, incorrect prediction by H
1

H

: 100 H
1
: < 100

Z
crit
= -1.28 X
crit
= 100 - 1.28(15) = 80.8

Power* = P(X 80.8 | =110) = P(Z (80.8 - 110) / 15) = P(Z -1.95) = 3%

Thus, if you can correctly predict the direction of the difference between the truth and the H
,
directional hypotheses have a higher probability of rejecting the H

than do nondirectional hypotheses.
If you can't, nondirectional hypotheses are more likely to result in rejection of the H
.

*One could argue that the probability here represented as Power is not Power at all, since the H

( 100) is in fact true ( = 110). Power is the probability of rejecting H
given that H
is false, not true.


Prob6430.docx
6430
Basic Probability, Including Contingency Tables

Random Variable
A random variable is real valued function defined on a sample space.
The sample space is the set of all distinct outcomes possible for an experiment.
Function: two sets (well defined collections of objects) members are paired so that each
member of the one set (domain) is paired with one and only one member of the other set
(range), although elements of the range may be paired with more than one element of the
domain.
The domain is the sample space, the range is a set of real numbers. A random variable is the
set of pairs created by pairing each possible experimental outcome with one and only one real
number.
Examples: a.) the outcome of rolling a die: = 1, = 2, = 3, etc. (Each outcome has
only one number, and, vice versa); b.) = 1, = 2, = 1, etc. (each outcome has (odd-
even) only one number, but not vica versa); c.) The weight of each student in my statistics
class.

Probability and Probability Experiments
A probability experiment is a well-defined act or process that leads to a single well defined
outcome. Example: toss a coin (H or T), roll a die
The probability of an event, P(A) is the fraction of times that event will occur in an indefinitely
long series of trials of the experiment. This may be estimated:
Empirically: conduct the experiment many times and compute
total
A
N
N
A P = ) ( , the sample
relative frequency of A. Roll die 1000 times, even numbers appear 510 times, P(even) =
510/1000 = .51 or 51%.
Rationally or Analytically: make certain assumptions about the probabilities of the
elementary events included in outcome A and compute probability by rules of probability.
Assume each event 1,2,3,4,5,6 on die is equally likely. The sum of the probabilities of all
possible events must equal one. Then P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6. P (even)
= P(2) + P(4) + P(6) = 1/6 + 1/6 + 1/6 = 1/2 (addition rule) or 50%.
Subjectively: a measure of an individuals degree of belief assigned to a given event in
whatever manner. I think that the probability that ECU will win its opening game of the season
is 1/3 or 33%. This means I would accept 2:1 odds against ECU as a fair bet (if I bet $1 on
ECU and they win, I get $2 in winnings).

Independence, Mutual Exclusion, and Mutual Exhaustion
Two events are independent iff (if and only if) the occurrence or non-occurrence of the one
has no effect on the occurrence or non-occurrence of the other.
Two events are mutually exclusive iff the occurrence of the one precludes occurrence of the
other (both cannot occur simultaneously on any one trial).


2
Two (or more) events are mutually exhaustive iff they include all possible outcomes.

Marginal, Conditional, and Joint Probabilities
The marginal probability of event A, P(A), is the probability of A ignoring whether or not any
other event has also occurred.
The conditional probability of A given B, P(A|B), is the probability that A will occur given
that B has occurred. If A and B are independent, then P(A) = P(A|B).
A joint probability is the probability that both of two events will occur simultaneously.

The Multiplication Rule
If two events are independent, their joint probability is the product of their marginal
probabilities, P(A B) = P(A) P(B).
Regardless of the independence or nonindependence of A and B, the joint probability of A and
B is: P(A B) = P(A) P(B|A) = P(B) P(A|B)

The Addition Rule
If A and B are mutually exclusive, the probability that either A or B will occur is the sum of
their marginal probabilities, P(A B) = P(A) + P(B).
Regardless of their exclusivity, P(A B) = P(A) + P(B) - P(A B)

Working with Contingency Tables
See Contingency Tables -- slide show
To determine whether two categorical variables are correlated with one another, we can use a
two-dimensional table of frequencies, often called a contingency table. For example, suppose that
we have asked each of 150 female college students two questions: 1. Do you smoke (yes/no), and,
2. Do you have sleep disturbances (yes/no). Suppose that we obtain the following data (these are
totally contrived, not real):
Data From Independent Variables

Sleep?
Smoke? No Yes
No 20 30 50
Yes 40 60 100
60 90 150

Marginal Probabilities
60 .
5
3
15
9
150
90
P(Sleep) 6 6 .
3
2
15
10
150
100
P(Smoke) = = = = = = = =

3
Conditional Probabilities
60 .
5
3
50
30
Nosmoke) | P(Sleep 60 .
5
3
10
6
100
60
Smoke) | P(Sleep = = = = = = =
Notice that the conditional probability that the student has sleeping disturbances is the same if
she is a smoker as it is if she is not a smoker. Knowing the students smoking status does not alter
our estimate of the probability that she has sleeping disturbances. That is, for these contrived data,
smoking and sleeping are independent, not correlated.

Multiplication Rule
The probability of the joint occurrence of two independent events is equal to the product of the
events marginal probabilities.
40 .
15
6
3
2
5
3
40 .
15
6
150
60
P(Smoke) x P(Sleep) Smoke) P(Sleep
= = = =
=

Addition Rule
Suppose that the probability distribution for final grades in PSYC 2101 were as follows:
Grade A B C D F
Probability .2 .3 .3 .15 .05
The probability that a randomly selected student would get an A or a B,
.5 .3 .2 P(B) P(A) B) P(A = + = + = , since A and B are mutually exclusive events.
Now, consider our contingency table, and suppose that the sleep question was not about
having sleep disturbances, but rather about sleeping with men (that is, being sexually active).
Suppose that a fundamentalist preacher has told you that women who smoke go to Hades, and
women who sleep go there too. What is the probability that a randomly selected woman from our
sample is headed to Hades? If we were to apply the addition rule as we did earlier,
27 . 1
15
19
15
10
15
9
P(Smoke) P(Sleep) = = + = + , but a probability cannot exceed 1, something is wrong
here.
The problem is that the events (sleeping and smoking) are not mutually exclusive, so we have
counted the overlap between sleeping and smoking (the 60 women who do both) twice. We need to
subtract out that double counting. If we look back at the cell counts, we see that 30 + 40 + 60 = 130
of the women sleep and/or smoke, so the probability we seek must be 130/150 = 13/15 = .87. Using
the more general form of the addition rule,
.87.
15
13
15
6
-
15
10
15
9
Smoke) P(Sleep - P(Smoke) P(Sleep) Smoke) P(Sleep = = + = + =

4

Data From Correlated Variables
Now, suppose that the smoke question concerned marijuana use, and the sleep question
concerned sexual activity, variables known to be related.

Sleep?
Smoke? No Yes
No 30 20 50
Yes 40 60 100
70 80 150

Marginal Probabilities
3 5 .
15
8
150
80
P(Sleep) 6 6 .
3
2
150
100
P(Smoke) = = = = = =

Conditional Probabilities
40 .
5
2
50
20
Nosmoke) | P(Sleep 60 .
5
3
10
6
100
60
Smoke) | P(Sleep = = = = = = =
Now our estimate of the probability that a randomly selected student sleeps depends on what
we know about her smoking behavior. If we know nothing about her smoking behavior, our estimate
is about 53%. If we know she smokes, our estimate is 60%. If we know she does not smoke, our
estimate is 40%. We conclude that the two variables are correlated, that female students who smoke
marijuana are more likely to be sexually active than are those who do not smoke.

Multiplication Rule
If we attempt to apply the multiplication rule to obtain the probability that a randomly selected
student both sleeps and smokes, using the same method we employed with independent variables,
we obtain: 5 3 .
45
16
3
2
15
8
P(Smoke) x P(Sleep) Smoke) P(Sleep = = = = . This answer is, however,
incorrect. Sixty of 150 students are smoking sleepers, so we should have obtained a probability of
6/15 = .40. The fact that the simple form of the multiplication rule (the one which assumes
independence) did not produce the correct solution shows us that the two variables are not
independent.
If we apply the more general form of the multiplication rule, the one which does not assume
independence, we get the correct solution:
. 40 .
15
6
5
3
3
2
Smoke) | P(Sleep P(Smoke) Sleep) P(Smoke = = = =

Real Data
Finally, here is an example using data obtained by Castellow, Wuensch, and Moore (1990,
Journal of Social Behavior and Personality, 5, 547-562). We manipulated the physical attractiveness
of the plaintiff and the defendant in a mock trial. The plaintiff was a young women suing her male
5
boss for sexual harassment. Our earlier research had indicated that physical attractiveness is an
asset for defendants in criminal trials (juries treat physically attractive defendants better than
physically unattractive defendants), and we expected physical attractiveness to be an asset in civil
cases as well. Here are the data relevant to the effect of the attractiveness of the plaintiff.

Guilty?
Attractive? No Yes
No 33 39 72
Yes 17 56 73
50 95 145

Guilty verdicts (finding in favor of the plaintiff) were more likely when the plaintiff was physically
attractive (56/73 = 77%) than when she was not physically attractive (39/72 = 54%). The magnitude
of the effect of physical attractiveness can be obtained by computing an odds ratio. When the
plaintiff was physically attractive, the odds of a guilty verdict were 56 to 17, that is, 56/17 = 3.29. That
is, a guilty verdict was more than three times more likely than a not guilty verdict. When the plaintiff
was not physically attractive the odds of a guilty verdict were much less, 39 to 33, that is, 1.18. The
ratio of these two odds is . 79 . 2
33 / 39
17 / 56
= That is, the odds of a guilty verdict when the plaintiff was
attractive were almost three times higher than when the plaintiff was not attractive. That is a big
effect!
We also found that physical attractiveness was an asset to the defendant. Here are the data:

Guilty?
Attractive? No Yes
No 17 53 70
Yes 33 42 75
50 95 145

Guilty verdicts (finding in favor of the plaintiff) were less likely when the defendant was
physically attractive (42/75 = 56%) than when he was not physically attractive (53/70 = 76%). The
odds ratio here is . 50 . 2
33 / 42
17 / 53
= We could form a ratio of probabilities rather than odds. The ratio of
[the probability of a guilty verdict given that the defendant was not physically attractive] to [the
probability of a guilty verdict given that the defendant was physically attractive] is . 35 . 1
75 / 42
70 / 53
=
We could look at these data from the perspective of the odds of a not guilty verdict. Not guilty
verdicts were more likely when the defendant was physically attractive (33/75 = 44%) than when he
was not physically attractive (17/70 = 24%). The odds ratio here is . 50 . 2
53 / 17
42 / 33
= Notice that the
odds ratio is the same regardless of which perspective we take. This is not true of probability ratios
(and is why I much prefer odds ratios over probability ratios).
6
If we consider the combined effects of physical attractiveness of both litigants, guilty verdicts
were most likely when the plaintiff was attractive and the defendant was not, in which case 83% of the
jurors recommended a guilty verdict. When the defendant was attractive but the plaintiff was not, only
41% of the jurors recommended a guilty verdict (and only 26% of the male jurors did). That
produces an odds ratio of . 03 . 7
59 / 41
17 / 83
= Clearly, it is to ones advantage to appear physically
attractive when in court.

Why Odds Ratios Rather Than Probability Ratios?
Odds ratios are my favorite way to describe the strength of the relationship between two
dichotomous variables. One could use probability ratios, but there are problems with probability
ratios, which I illustrate here.
Suppose that we randomly assign 200 ill patients to two groups. One group is treated with a
modern antibiotic. The other group gets a homeopathic preparation. At the end of two weeks we
blindly determine whether or not there has been a substantial remission of symptoms of the illness.
The data are presented in the table below.

Remission of
Symptoms

Treatment No Yes
Antibiotic 10 90 100
Homeopathic 60 40 100
70 130 200
Odds of Success
For the group receiving the antibiotic, the odds of success (remission of symptoms) are 90/10
= 9. For the group receiving the homeopathic preparation the odds of success are 40/60 = 2/3 or
about .67. The odds ratio reflecting the strength of the relationship is 9 divided by 2/3 = 13.5. That is,
the odds of success for those treated with the antibiotic are 13.5 times higher than for those treated
with the homeopathic preparation.
Odds of Failure
For the group receiving the homeopathic preparation, the odds of failure (no remission of
symptoms) are 60/40 = 1.5. For those receiving the antibiotic the odds of failure are 10/90 = 1/9. The
odds ratio is 1.5 divided by 1/9 = 13.5. Notice that this ratio is the same as that obtained when we
used odds of success, as, IMHO, it should be.
Now let us see what happens if we use probability ratios.
Probability of Success
For the group receiving the antibiotic, the probability of success is 90/100 = .9. For the
homeopathic group the probability of success is 40/100 = .4. The ratio of these probabilities is .9/.4 =
2.25. The probability of success for the antibiotic group is 2.25 times that of the homeopathic group.
7
Probability of Failure
For the group receiving the homeopathic preparation the probability of failure is 60/100 = .6.
For the antibiotic group it is 10/100 = .10. The probability ratio is .6/.1 = 6. The probability of failure
for the homeopathic group is six times that for the antibiotic group.
The Problem
With probability ratios the value you get to describe the strength of the relationship when you
compare (A given B) to (A given not B) is not the same as what you get when you compare (not A
given B) to (not A given not B). This is, IMHO, a big problem. There is no such problem with odds
ratios.

Another Example
According to a report provided by Medscape, among the general population in the US, the
incidence of narcissistic personality disorder is 0.5%. Among members of the US Military it is 20%.
Probability of NPD.
The probability that a randomly selected member of the military will have NPD is 20%, the
probability that a randomly selected member of the general population will have NPD is 0.5%. This
yields a probability ratio of .20/.005 = 40.
Probability of NOT NPD.
The probability that a randomly selected member of the general population will not have NPD
is .995. The probability that a randomly selected member of the military will not have NPD is .80.
This yields a probability ratio of .995/.8 = 1.24.
Odds of NPD
For members of the military, .2/.8 = .25. For members of the general population, .005/.995 =
.005. The odds ratio is (.2/.8) / (.005/.995) = 49.75.
Odds of NOT NPD.
For members of the military, .8/.2 = 4. For members of the general population, .995/.005 =
199. The odds ratio is (.995/.005) / (.8/.2) = 49.75.

Probability Distributions
The probability distribution of a discrete variable Y is the pairing of each value of Y with one
and only one probability. The pairing may be by a listing, a graph, or some other specification of a
functional rule, such as a formula. Every P(Y) must be between 0 and 1 inclusive, and the sum of all
the P(Y)s must equal 1.
8
For a continuous variable we work with a probability density function, defined by a formula
or a figure (such as the normal curve).
Imagine a relative frequency histogram with data grouped into 5 class intervals so there are 5
bars. Now increase the number of intervals to 10, then to 100, then 100,000, then an infinitely
large number of intervalsnow you have a smooth curve, the probability function. See my
slide show at http://core.ecu.edu/psyc/wuenschk/PP/Discrete2Continuous.ppt .
There is an infinitely large number of points on the curve, so the probability of any one point is
infinitely small.
We can obtain the probability of a score falling in any particular interval a to b by setting the
total area under the curve equal to 1 and measuring the area under the curve between a and
b. This involves finding the definite integral of the probability density function from a to b.
To avoid doing the calculus, we can, for some probability density functions (such as the
normal), rely on tables or computer programs to compute the area.
Although technically a continuous variable is one where there is an infinite number of possible
intermediate values between any pair of values, if a discrete variable can take on many
possible values we shall treat it as a continuous variable.

Random Sampling
Sampling N data points from a population is random if every possible different sample of size
N was equally likely to be selected.
We want our samples to be representative of the population to which we are making
inferences. With moderately large random samples, we can be moderately confident that our
sample is representative.
The inferential statistics we shall use assume that random sampling is employed.
In fact, we rarely if ever achieve truly random sampling, but we try to get as close to it as is
reasonable.
This definition differs from that given in Howell ("each and every element of the population has
an equal chance of being selected"). A sampling procedure may meet Howells definition but
not mine. For example, sampling from a population of 4 objects (A,B,C,& D) without
replacement, N = 2, contrast sampling procedure X with Y:
Probability
Sample X Y
AB 1/2 1/6
AC 0 1/6
AD 0 1/6
BC 0 1/6
BD 0 1/6
CD 1/2 1/6

If each time a single score is sampled, all scores in the population are equally likely to be
selected, then a random sample will be obtained.

9
Counting Rules
Suppose I have four flavors of ice cream and I wish to know how many different types of ice
cream cones I can make using all four flavours. I shall consider order important, so chocolate atop
vanilla atop strawberry atop pistachio is a different cone than vanilla atop strawberry atop pistachio
atop chocolate. Each cone must have all four flavours. The simple rule to solve this problem is:
There are X! ways to arrange X different things. The ! means factorial and directs you to find the
product of X times X-1 times X-2 times ... times 1. By definition, 0! is 1. For our problem, there are
4x3x2x1 = 24 different orderings of four flavours.
Now, suppose I have 10 different flavours, but still only want to make 4-scoop cones. From 10
different flavours how many different 4-scoop cones can I make? This is a permutations problem.
Here is the rule to solve it: the number of ways of selecting and arranging X objects from a set of N
objects equals
)! (
!
X N
N
. For our problem, the solution is 5040

! 6
! 6 7 8 9 10
)! 4 10 (
! 10
=

=
.
Now suppose I decide that order is not important, that is, I dont care whether the chocolate is
atop the vanilla or vice versa, etc. This is a combinations problem. Since the number of ways of
arranging X objects is X!, I simply divide the number of permutations by X! That is, I find
210
2 3 4 ! 6
! 6 7 8 9 10
! 4 ! 6
! 10
! )! (
!
=

= =
X X N
N
.
Suppose I am assigning ID numbers to the employees at my ice cream factory. Each employee
will have a two digit number. How many different two digit numbers can I generate? A rule I can
apply is C
L
where C is the number of different characters available (10) and L is the length of the ID
number. There are 10
2
= 100 different ID numbers, but you already knew that. Suppose I decide to
use letters instead. There are 26 different characters in the alphabet you use, so there are 26
2
= 676
different two digit ID strings. Now suppose I decide to use both letters (A through Z) and numerals (0
through 9). Now there are 36
2
= 1,296 different two-character ID strings. Now suppose I decide to
use one and two character strings. Since there are 36 different one character strings, I have 1,296 +
36 = 1,332 different ID strings of not more than two characters.
If I up the maximum number of characters to three, that gives me an additional 36
3
= 46,656
strings, for a total of 46,656 + 1,332 = 47,988 different strings of one to three characters. Up it to one
to four characters and I have 1,679,616 + 47,988 = 1,727,604. Up it to one to five characters and you
get 60,466,176 + 1,727,604 = 62,193,780. Up it to one to six characters and we have 2,176,782,336
+ 62,193,780 = 2,238,976,116 different strings of from one to six characters. I guess it will be a while
until tinyurl needs to go to seven character strings for their shortened urls.

Probability FAQ Answers to frequently asked questions.
Return to Wuenschs Stats Lessons

Copyright 2012, Karl L . Wuensch - All rights reserved.
Binomial.docx
Testing Hypotheses with the Binomial Probability Distribution

A Binomial Experiment:
consists of n identical trials.
each trial results in one of two outcomes, a success or a failure.
the probabilities of success ( p ) and of failure ( q = 1 - p ) are constant across
trials.
trials are independent, not affected by the outcome of other trials.
Y is the number of successes in n trials.
( )
( ) y n y
q p
y
n
y Y P

= =
! y) - (n !
!

P(Y = y) may also be determined by reference to a binomial table.
The binomial distribution has:
o np =
o npq =
2

Testing Hypotheses
State null and alternative hypotheses
o the null hypothesis specifies the value of some population parameter. For
example, p = .5 (two-tailed, nondirectional; this coin is fair) or p < .25
(one-tailed; directional; student is merely guessing on 4-choice multiple
choice test).
o the alternative hypothesis, which the researcher often wants to support, is
the antithetical complement of the null. For example, p .5 (two-tailed,
the coin is biased) or p > .25 (one-tailed, the student is not merely
guessing, e knows tested material).
Specify the sampling (probability) distribution and the test statistic (Y). Example:
the binomial distribution describes the probability that a single sample of n trials
would result in (Y = y) successes (if assumptions of binomial are true).
Set alpha at a level determined by how great a risk of a Type I error (falsely
rejecting a true null) you are willing to take. Traditional values of alpha are .05
and .01.


Page 2
Specify a decision rule:
o From the sampling distribution associated with a true null hypothesis, map
out the rejection region, the values of the sample test statistic that would
be obtained with a probability of or less.
o With directional hypotheses, the rejection region will be in one tail; with
nondirectional hypotheses, in two tails ( / 2 in each tail)
o if the test statistic falls in the rejection area, reject the null and assert the
alternative hypothesis. If not, fail to reject the null.
o The boundary between rejection and nonrejection areas is sometimes
called the critical value of the test statistic.
Collect sample data, compute test statistic, and make decision.
An alternative and preferable decision rule is:
o From the sample data, compute P, the significance level, the probability
that one would obtain a sample result as or more contradictory to the null
hypothesis than that actually observed.
o Examples for a one-tailed test:
H
: p < .5; H
1
: p > .5; n = 25, Y=18;
P(Y > 18 | n = 25, p = .5) = .022
APA-style summary: Mothers were allowed to smell two
articles of infants clothing and asked to pick the one which
was their infants. They were successful in doing so 72% of
the time, significantly more often than would be expected by
chance, exact binomial p (one-tailed) = .022.
H
: p > .5; H
1
: p < .5; n = 25, Y = 18;
P(Y 18 | n = 25, p = .5) = .993 (note that the direction of
the P(Y ) matches that of H
1
)
o For a two-tailed test
Compute a one-tailed P and double it.
H
: p = .5; H
1
: p .5; n = 25, Y = 18;
2P(Y 18) = 2(.022) = .044
H
: p = .5; H
1
: p .5; n = 25, Y = 7;
2P(Y 7) = 2(.022) = .044 (the direction of the P(Y ) is that which
gives the smaller p value; P(Y 7) = .993 and 2(.993) = 1.986,
obviously not a possible p.
If p alpha, reject H
, otherwise fail to reject the null.

Page 3
Reporting an exact p value gives the reader more information than
does merely reporting rejection or no rejection of the null: p = .06
and p = .95 might both lead to no rejection, but the former
nonetheless casts doubt on the null, the latter certainly does not.
Likewise, p = .04 and p = .001 may lead to the same conclusion,
but the latter would indicate greater confidence.
Normal Approximation of a Binomial Probability
If you do not have access to a computer or table to obtain a binomial probability
and it just would be too hard to do it by hand, you may be able to approximate it
using the normal curve.
If 2 falls within 0 to n, then the binomial approximation should be good. For
example, we want to find P(Y 18 | n = 25, p = .5).
15 10 ) 5 . 2 ( 2 5 . 12 ) 5 )(. 5 (. 25 2 ) 5 (. 25 = = , which is contained within 0 25,
so the normal approximation should be good.
We compute 2
5 . 2
5 . 12 5 . 17
=
= z . Note that I reduced the value for the number of

successes from 18 to 17.5. This is called the "correction for continuity." The
reasoning behind it goes something like this: We are using a continuous
distribution to approximate a discrete probability. Accordingly, we should think of
'18' as being '17.5 to 18.5.' Since we want the probability of getting 18 or more
successes, we find the z-score for getting 17.5 or more -- that way we include all
of '18' in the obtained probability. If we wanted to estimate the probability of
getting 18 or fewer successes, we would use '18.5,' again, to inlcude all of '18' in
the obtained probability. On a practical note, the approximated probability will be
closer to the exact binomial probability if the we use the correction for continuity
than it would be if we did not use the correction.
We then use the normal curve table to get the probability from the z-score. For
our z-score of 2.00, the probability is .0228.
The Binomial Sign Test
With matched pairs data we may simply compute, for each pair, whether the
difference score is positive or negative and then test the null hypothesis that in
the population 50% of the difference scores are positive.
For example, suppose we have pre and post blood pressure scores for each of
ten subjects given an experimental drug. Our null hypothesis is that the drug has
no effect on blood pressure.
We test 11 subjects. For 9 subjects pressure is lower after taking the drug, for 1
it is higher and for 1 it is unchanged.
Page 4
Our test statistic is the larger of n+ (the number of positive difference scores) and
n (the number of negative difference scores). Unless there are many difference
scores of zero, we usually discard data from subjects with difference scores of
zero (an alternative procedure would be to count zero difference scores as being
included with the smaller of n+ and n).
Following the usual procedure, our test statistic would be n+ = 9, with n = 10.
The exact two-tailed p value is 2P(Y 9 n = 10, p = .5) = .022, sufficiently low
to reject the null hypothesis at the customary .05 level of significance.
Our APA-style summary statement might read like this: An exact binomial sign
test indicated that the drug significantly lowered blood pressure, 9 of 10 patients
having post-treatment pressure lower than their pre-treatment pressure, p =
.022. Were we to use a normal approximation rather than an exact test, we
would include the value of z in our summary statement.

See this published example of a binomial sign test.


Proport.doc
Confidence Intervals for Proportions

If one has a relatively large sample (large enough to use a normal approximation
of the binomial parameter p), then one can construct a confidence interval about ones
estimate of the population proportion by using the following formula:
n
pq
Z p
2
. For
example, suppose we wish to estimate the proportion of persons who would vote for a
guilty verdict in a particular sexual harassment case. We shall use the data from a
study by Egbert, Moore, Wuensch, and Castellow (1992, Journal of Social Behavior
and Personality, 7: 569-579). Of 160 mock jurors of both sexes, 105 voted guilty and
55 voted not guilty. Our point estimate of the population proportion is simply our
sample proportion, 105 / 160 = .656. Is n large enough (given p and q) to use our
normal approximation, that is, is npq np 2 (which is essentially a 95% confidence
interval for the number of successes) within 0 n ? If we construct a 95% confidence
interval for p and the interval is within 01, then the normal approximation is OK. For a
95% confidence interval we compute:
730 . 582 . 074 . 656 .
160
) 344 (. 656 .
96 . 1 656 . = = .

Suppose we look at the proportions separately for female and male jurors.
Among the 80 female jurors 58 (72.5%) voted guilty. For a 95% confidence interval we
compute: 823 . 627 . 098 . 725 .
80
) 275 (. 725 .
96 . 1 725 . = = .

Among the 80 male jurors 47 (58.8%) voted guilty. For a 95% confidence
interval we compute: 696 . 480 . 108 . 588 .
80
) 412 (. 588 .
96 . 1 588 . = = . Do notice
that the confidence interval for the male jurors overlaps the confidence interval for the
female jurors.

There are several online calculators that will construct a confidence interval
around a proportion or percentage. Try the one at
http://www.dimensionresearch.com/resources/calculators/conf_prop.html . In Step 1
select the desired degree of confidence (95%). In step two enter the total sample size.
In Step 3 enter the number of successes or the percentage of successes. Click
Calculate and you get the confidence interval for the percentage. If you prefer a
Bayesian approach, try the calculator at
http://www.causascientia.org/math_stat/ProportionCI.html .


Difference Between Two Proportions From Independent Samples

We might want to construct a confidence interval for the difference between the
two proportions. The appropriate formula is ( )
2
2 2
1
1 1
2 2 1

n
q p
n
q p
Z p p +

. For our data,
a 95% confidence interval is
( ) 283 . 009 . 146 . 137 .
80
) 412 (. 588 .
80
) 275 (. 725 .
96 . 1 588 . 725 . = = + . Notice that
this confidence interval includes the value of zero. I have written an PASW macro that
will compute such confidence intervals. If you wish to try it, download CI_p1-p2.zip
from my PASW/SPSS Programs Page. It is most useful when computing several such
confidence intervals.

Frequently Asked Questions

Comparing Correlated Proportions With McNemars Test

Suppose you are evaluating the effectiveness of an intervention which is
designed to make patients more likely to comply with their physicians prescriptions.
Prior to introducing the intervention, each of 200 patients is classified as compliant or
not. Half (100) are compliant, half (100) are not. After the intervention you reclassify
each patient as compliant (120) or not (80). It appears that the intervention has raised
compliance from 50% to 65%, but is that increase statistically significant?
McNemars test is the analysis traditionally used to answer a question like this.
To conduct this test you first create a 2 x 2 table where each of the subjects is classified
in one cell. In the table below, cell A includes the 45 patients who were compliant at
both times, cell B the 55 who were compliant before the intervention but not after the
intervention, cell C the 85 who were not compliant prior to the intervention but were after
the intervention, and cell D the 15 who were noncompliant at both times.

After the Intervention
Compliant (1) Not (0) Marginals
Prior to the
Intervention
Compliant (1) 45 A 55 B 100
Not (0) 85 C 15 D 100
Marginals 130 70 200

McNemars Chi-square, with a correction for continuity, is computed this way:
c b
c b
+

=
2
2
) 1 | (|
. For the data above, 007 . 6
85 55
) 1 | 85 55 (|
2
2
=
+

= . The chi-square is
evaluated on one degree of freedom, yielding, for these data, a p value of .01425.
If you wish not to make the correction for continuity, omit the 1. For these data
that would yield a chi-square of 6.429 and a p value of .01123. I have not investigated
whether or not the correction for continuity provides a better approximation of the exact
(binomial) probability or not, but I suspect it does.

McNemar Done as an Exact Binomial Test

Simply use the binomial distribution to test the null hypothesis that p = q = .5
where the number of successes is either count B or count C from the table and N = B +
C. For our data, that is, obtain the probability of getting 55 or fewer failures in 85 + 55 =
140 trials of binomial experiment when p = q = .5. The syntax for doing this with SPSS
is
COMPUTE p=2*CDF.BINOM(55,140,.5).
EXECUTE.
and the value computed is .01396.
McNemar Done With SAS

Data compliance;
Input Prior After Count; Cards;
1 1 45
1 0 55
0 1 85
0 0 15
Proc Freq; Tables Prior*After / Agree; Weight Count; run;

Statistics for Table of Prior by After

McNemar's Test

Statistic (S) 6.4286
DF 1
Asymptotic Pr > S 0.0112
Exact Pr >= S 0.0140

The Statistic reported by SAS is the chi-square without the correction for
continuity. The Asymptotic Pr is the p value for that uncorrected chi-square. The
Exact Pr is an exact p value based on the binomial distribution.

McNemar Done With SPSS

Analyze, Descriptive Statistics, Crosstabs. Prior into Rows and After into
Columns. Click Statistics and select McNemar. Continue, OK

Prior * After Crosstabulation
Count

After

0 1 Total
Prior 0 15 85 100
1 55 45 100
Total 70 130 200

Chi-Square Tests

Value
Exact Sig. (2-
sided)
McNemar Test

.014
a

N of Valid Cases 200

a. Binomial distribution used.

Notice that SPSS does not give you a chi-square approximate p value, but rather
an exact binomial p value.

McNemar Done With Vasser Stats

Go to http://faculty.vassar.edu/lowry/propcorr.html

Enter the counts into the table and click Calculate.

Very nice, even an odds ratio with a confidence interval. I am impressed.

More Than Two Blocks
The design above could be described as one-way, randomized blocks, two levels
of the categorical variable (prior to intervention, after intervention). What if there were
more than two levels of the treatment for example, prior to intervention, immediately
after intervention, six months after intervention. An appropriate analysis here might be
the Cochran test. See http://en.wikipedia.org/wiki/Cochran_test.

See also McNemar Tests of Marginal Homogeneity at
http://ourworld.compuserve.com/homepages/jsuebersax/mcnemar.htm .

Power Analysis
I found, at http://www.stattools.net/SSizMcNemar_Pgm.php , an online calculator
for obtaining the required sample size given effect size and desired power, etc. For the
first sample size calculator, shown below, the effect size is specified in terms of what
proportion of the cases are expected to switch from State 1 to State 2 and what
proportion are expected to switch from State 2 to State 3. I have seen no guidelines
regarding what would constitute a small, medium, or large effect.
Suppose that we are giving each patient two diagnostic tests. We believe that
60% will test positive with Procedure A and 90% with Procedure B. We want to use our
data to test the null that the percentage positive is the same for Procedure A and for
Procedure B. Please note that the calculations for Probability 1 do not take into account
any correlation between the results of Procedure A and those of Procedure B, and that
correlation is almost certainly not zero. The second set of probabilities does assume a
positive correlation between the two procedures.

Procedure A Procedure B Probability 1 Probability 2
+ + .6(.9) = .54 .59
+ _ .6(.1) = .06 .01
- + .4(.9) = .36 .30
- - .4(.1) = .04 .10
In the output, on the next pages, you see that 35 cases are needed under the
independence assumption, fewer than 35 if there is a positive correlation. The
conservative action to take is to get enough data for the calculation that assumes
independence.

Assuming Independence

Positive Correlation

So, how large is the correlation resulting from the second set of probabilities? I
created the set of data shown below, from the probabilities above, and weighed the
outcomes (A and B) by the frequencies. I then correlated A with B and obtained 0.365,
a medium-sized correlation. Since both variables are binary, this correlation is a phi
coefficient.

Return to Wuenschs Stats Lessons

Karl L. Wuensch, October, 2011
Chi-square.docx
Common Univariate and Bivariate Applications of the Chi-square Distribution

The probability density function defining the chi-square distribution is given in the
chapter on Chi-square in Howell's text. Do not fear, we shall not have to deal directly
with that formula. You should know, however, that given that function, the mean of the
chi-square distribution is equal to its degrees of freedom and the variance is twice the
mean.
The chi-square distribution is closely related to the normal distribution. Image
that you have a normal population. Sample one score from the normal population and
compute
2
2
2
) (

=
Y
Z . Record that Z
2
and then sample another score, compute and
record another Z
2
, repeating this process an uncountably large number of times. The
resulting distribution is a chi-square distribution on one degree of freedom.
Now, sample two scores from that normal distribution. Convert each into Z
2
and
then sum the two scores. Record the resulting sum. Repeat this process an
uncountably large number of times and you have constructed the chi-square distribution
on two degrees of freedom. If you used three scores in each sample, you would have
chi-square on three degrees of freedom. In other words,

= =
=
2
2
1
2
2
) (
Y
Z
n
i
n
. Now, from the definition of variance, you know that the
numerator of this last expression,

2
) ( Y , is the sum of squares, the numerator of
the ratio we call a variance, sum of squares divided by n. From sample data we
estimate the population variance with sample sum of squares divided by degrees of
freedom, (n - 1). That is,
1
) (
2
2

=
n
Y Y
s . Multiplying both sides of this expression by (n
- 1), we see that
2 2
) 1 ( ) ( s n Y Y = . Taking our chi-square formula and substituting
2
) 1 ( S n for
2
) ( Y , we obtain
2
2
2
) 1 (
s n
= , which can be useful for testing null
hypotheses about variances. You could create a chi-square distribution using this
modified formula -- for chi-square on (n - 1) degrees of freedom, sample n scores from a
normal distribution, compute the sum of squares of that sample, divide by the known
population variance, and record the result. Repeat this process an uncountably large
number of times.
Given that the chi-square distribution is a sum of squared z-scores, and knowing
what you know about the standard normal distribution (mean and median are zero), for
chi-square on one df, what is the most probable value of chisquare (0)? What is the
smallest possible value (0)? Is the distribution skewed? In what direction (positive)?
Now, consider chi-square on 10 degrees of freedom. The only way you could get
a chi-square of zero is if each of the 10 squared z-scores were exactly zero. While zero
is still the most likely value for z from the standard normal distribution, it is not likely that


2
you would get all 10 scores exactly equal to zero, so zero is no longer the most likely
value of chi-square. Also, given that 10 squared z-scores go into that chi-square its
mean increases (from 1 to 10). Note also that the positive skewness of the chi-square
distribution decreases as the df increase.
It is strongly recommended that you complete the SAS lesson, "Simulating the
Chi-Square Distribution," at http://core.ecu.edu/psyc/wuenschk/SAS/ProbChisq.doc,
before you continue with this handout. That SAS lesson should help you understand
the basics of the Chi-Square distribution.

Inferences about Variances and Standard Deviations
Suppose that we have a sample of 31 male high school varsity basketball
players. We wish to test the null hypothesis that their heights are a random sample of
the general population of men, with respect to variance. We know from selective
service records that the standard deviation in the population is 2.5.
Let us use a pair of directional hypotheses, which will call for a one-tailed test.
H
:
2
6.25 H
1
:
2
< 6.25
We compute the sample variance and find it to be 4.55. We next compute the
value of the test statistic, chi-square. [If we were repeatedly to sample 31 scores from
a normal population with a variance of 6.25 and on each compute (N1) S
2
/ 6.25, we
would obtain the chi-square distribution on 30 df.]
2
2
2
) (
S df
= where df = N1
2
= 30(4.55) / 6.25 = 21.84
The expected value of the chi-square (the mean of the sampling distribution)
were the null hypothesis true is its degrees of freedom, 30. Our computed chi-square
is less than that, but is it enough less than that for us to be confident in rejecting the null
hypothesis? We now need to obtain the p-value. Since our alternative hypothesis
specified a < sign, we need find P(
2
< 21.84 | df = 30). We go to the chi-square table,
which is a one-tailed, upper-tailed, table (in Howell). For 30 df, 21.84 falls between
20.60, which marks off the upper 90%, and 24.48, which marks off the upper 75%.
Thus, the upper-tailed p is .75 < p < .90. But we need a lower-tailed p, given our
alternative hypothesis. To obtain the desired lower-tailed p, we simply subtract the
upper-tailed p from unity, obtaining .10 < p < .25. [If you integrate the chi-square
distribution you obtain the exact p = .14.] Using the traditional .05 criterion, we are
unable to reject the null hypothesis.
Our APA-style summary reads: A one-tailed chi-square test indicated that the
heights of male high school varsity basketball players (s
2
= 4.55) were not significantly
less variable than those of the general population of adult men (
2
= 6.25),
2
(30, N =
31) = 21.84, p = .14. I obtained the exact p from SAS. Note that I have specified the
variable (height), the subjects (basketball players), the status of the null hypothesis (not
rejected), the nature of the test (directional), the parameter of interest (variance),the
value of the relevant sample statistic (s
2
) the test statistic (
2
), the degrees of freedom
and N, the computed value of the test statistic, and an exact p. The phrase not
3
significantly less implies that I tested directional hypotheses, but I chose to be explicit
about having conducted a one-tailed test.
For a two-tailed test of nondirectional hypotheses, one simply doubles the
one-tailed p. If the resulting two-tailed p comes out above 1.0, as it would if you
doubled the upper-tailed p from the above problem, then you need to work with the
(doubled) p from the other tail. For the above problem the two-tailed p is .20 < p < .50.
An APA summary statement would read: A two-tailed chi-square test indicated that the
variance of male high school varsity basketball players heights (s
2
= 4.55) was not
significantly different from that of the general population of adult men (
2
= 6.25),
2
(30,
N = 31) = 21.84, p = .28. Note that with a nonsignificant result my use of the phrase
not significantly different implies nondirectional hypotheses.
Suppose we were testing the alternative hypothesis that the population variance
is greater than 6.25. Assume we have a sample of 101 heights of men who have been
diagnosed as having one or more of several types of pituitary dysfunction. The
obtained sample variance is 7.95, which differs from 6.25 by the same amount, 1.7, that
our previous sample variance, 4.55, did, but in the opposite direction. Given our larger
sample size this time, we should expect to have a better chance of rejecting the null
hypothesis. Our computed chi-square is 127.2, yielding an (upper-tail) p of .025 < p <
.05, enabling us to reject the null hypothesis at the .05 level. Our APA-style summary
statement reads: A one-tailed chi-square test indicated that the heights of men with
pituitary dysfunction (s
2
= 7.95) were significantly more variable than those of the
general population of men (
2
= 6.25),
2
(100, N = 101) = 127.2, p = .034. Since I
rejected the null hypothesis (a significant result), I indicated the direction of the
obtained effect (significantly more variable than ...). Note that if we had used
nondirectional hypotheses our two-tailed p would be .05 < p < .10 and we could not
reject the null hypothesis with the usual amount of confidence (.05 criterion for ). In
that case my APA-style summary statement would read: A two-tailed chi-square test
indicated that the variance in the heights of men with pituitary dysfunction (s
2
= 7.95)
was not significantly different from that of the general population of men (
2
= 6.25),
2
(100, N = 101) = 127.2, p = .069.
We can also place confidence limits on our estimation of a population variance.
For a 100(1 ) % confidence interval for the population variance, compute:

a
s N
b
s N
2 2
) 1 (
,
) 1 (

where a and b are the / 2 and 1 ( / 2) fractiles of the chi-square distribution on
(n 1) df. For example, for our sample of 101 pituitary patients, for a 90% confidence
interval, the .05 fractile (the value of chi-square marking off the lower 5%) is 77.93, and
the .95 fractile is 124.34. The confidence interval is 100(7.95)/124.34, 100(7.95)/77.93
or 6.39 to 10.20. In other words, we are 90% confident that the population variance is
between 6.39 and 10.20. Technically, the interpretation of the confidence coefficient
(90%) is this: were we to repeatedly draw random samples and for each construct a
90% confidence interval, 90% of those intervals would indeed include the true value of
the estimated parameter (in this case, the population variance).
4
Please note that the application of chi-square for tests about variances is not
robust to the normality assumption made when using such applications. When a
statistic is robust to violation of one of its assumptions then one can violate that
assumption considerably and still have a valid test.

Chi-Square Approximation of the Binomial Distribution
2
2
2
1
) (

=
Y
where Y is from a normal population.
Consider Y = # of successes in a binomial experiment. With N 0 within 2 npq np ,
the binomial distribution should be approximately normal. Thus,
npq
np Y
2
2
1
) (
= , which can be shown to equal
nq
nq Y n
np
np Y
2 2
) ( ) (
+
. Here is a proof
(the not-so-obvious algebra referred to by Howell):
[ ]
2 2
2
2
) ( ) ( ) 1 ( ) ( Y np np n Y n p n Y n nq Y n = + = = . Now, since
2 2 2 2
) ( ) ( ) ( ) ( np X X np a b b a = = , .
Thus,
npq
np Y p np Y q
nq
np Y
np
np Y
nq
nq Y n
np
np Y
2 2 2 2 2 2
) ( ) ( ) ( ) ( ) ( ) ( +
=
=

+

. 1 since ,
) (
which ,
) )( (
2 2
= +
=
+
= p q
npq
np Y
npq
np X p q

Substituting O
1
for number of successes, O
2
for number of failures, and E for np,
=
E
E O
E
E O
E
E O
2
2
2
2 2
1
2
1 1 2
) ( ) ( ) (

The Correction for Continuity (Yates Correction) When Using Chi-square to
Approximate a Binomial Probability
Suppose that we wish to test the null hypothesis that 50% of ECU students favor
tuition increases to fund the acquisition of additional computers for student use at ECU.
The data are: in a random sample of three, not a single person favors the increase.
The null hypothesis is that binomial p = .50. The two-tailed exact significance level
(using the multiplication rule of probability) is 2 x .5
3
= .25.
Using the chi-square distribution to approximate this binomial probability,
( )
00 . 3
5 . 1
) 5 . 1 3 (
5 . 1
) 5 . 1 0 (
2 2
2
2
=
E
E O
, p = .0833, not a very good
approximation. Remember that a one-tailed p is appropriate for nondirectional
hypotheses with this test, since the computed chi-square increases with increasing
(O - E) regardless of whether O > E or O < E.
Using the chi-square distribution with Yates correction for continuity.:
5
( )
33 . 1
5 . 1
) 5 . 5 . 1 (
2
5 .
2
2
2
=

=

=
E
E O
, p = .25, a much better approximation.
Half-Tailed Tests
Suppose that you wanted to test directional hypotheses, with the alternative
hypothesis being that fewer than 50% of ECU students favor the increased tuition? For
the binomial p you would simply not double the one tailed P(Y 0). For a directional
chi-square, with the direction correctly predicted in the alternative hypothesis, you take
the one-tailed p that is appropriate for a nondirectional test and divide it by the number
of possible orderings of the categorical frequencies. For this problem, we could have
had more favor than disfavor or more disfavor than favor, two possible orderings. This
is really just an application of the multiplication rule of probability. One one-tailed p
1

gives you the conditional probability of obtaining results as or more discrepant with the
null than are those you obtained. The probability of correctly guessing the direction of
the outcome, p
2
, is . The joint probability of getting results as unusual as those you
obtained AND in the predicted direction is p
1
p
2
.
One-Sixth Tailed Tests
What if there were three categories, favor, disfavor, and dont care, and you
correctly predicted that the greatest number of students would disfavor, the next
greatest number would not care, and the smallest number would favor? [The null
hypothesis from which you would compute expected frequencies would be that 1/3
favor, 1/3 disfavor, and 1/3 dont care.] In that case you would divide your one-tailed p
by 3! = 6, since there are 6 possible orderings of three things.
The basic logic of the half-tailed and sixth-tailed tests presented here was
outlined by David Howell in the fourth edition of his Statistical Methods for Psychology
text, page 155. It can be generalized to other situations, for example, a one-way
ANOVA where one predicts a particular ordering of the group means.

Multicategory One-Way Chi Square
Suppose we wish to test the null hypothesis that Karl Wuensch gives twice as
many Cs as Bs, twice as many Bs as As, just as many Ds as Bs, and just as many
Fs as As in his undergraduate statistics classes. We decide on a nondirectional test
using a .05 criterion of significance. The observed frequencies are: A: 6, B: 24, C:
50, D: 10, F: 10. Under this null hypothesis, given a total N of 100, the expected
frequencies are: 10, 20, 40, 20, 10, and
2
= 1.6 + 0.8 + 2.5 + 5 + 0 = 9.9; df = K - 1 =
4. p = .042. We reject the null hypothesis.
There are additional analyses you could do to determine which parts of the null
hypothesis are (significantly) wrong. For example, under the null hypotheses one
expects that 10% of the grades will be As. Six As were observed. You could do a
binomial test of the null hypothesis that the proportion of As is .10. Your two-tailed p
would be two times the probability of obtaining 6 or fewer As if n = 100 and p = 0.10.
As an example of another approach, you could test the hypothesis that there are twice
as many Cs as Bs. Restricting your attention to the 50 + 24 = 74 Cs and Bs, you
6
would expect 2/3(74) = 49.33 Cs and 1/3(74) = 24.67 Bs. A one df Chi-square (or an
exact binomial test) could be used to test this part of the omnibus null hypothesis.

Pearson Chi-Square Test for Contingency Tables.
For the dichotomous variables A and B, consider the below joint frequency
distribution [joint frequencies in the cells, marginal frequencies in the margins]. Imagine
that your experimental units are shoes belonging to members of a commune, that
variable A is whether the shoe belongs to a woman or a man, and that variable B is
whether the shoe has or has not been chewed by the dog that lives with the commune.
One of my graduate students actually had data like these for her 6430 personal data set
years ago. The observed cell counts are in bold font.

A = Gender of Shoe Owner
B = Chewed? Female Male
Yes 10 (15) 20 (15) 30
No 40 (35) 30 (35) 70
50 50 100

We wish to test the null hypothesis that A is independent of (not correlated with) B.
The marginal probabilities of being chewed are .3 chewed, .7 not. The marginal
probabilities for gender of the owner are .5, .5.
Using the multiplication rule to find the joint probability of (A = a) (B = b),
assuming independence of A and B (the null hypothesis), we obtain .5(.3) = .15 and
.5(.7) = .35.
Multiplying each of these joint probabilities by the total N, we obtain the expected
frequencies, which I have entered in the table in parentheses. A short cut method to get
these expected frequencies is: For each cell, multiply the row marginal frequency by
the column marginal frequency and then divide by the total table N. For example, for
the upper left cell, E = 30(50)/100 = 15.

( )
762 . 4
35
) 35 30 (
35
) 35 40 (
15
) 15 20 (
15
) 15 10 (
2 2 2 2 2
2
=
E
E O
.
Shoes owned by male members of the commune were significantly more likely to
be chewed by the dog (40%) than were shoes owned by female members of the
commune (20%),
2
(1, N = 100) = 4.762, p = .029, odds ratio = 2.67, 95% CI [1.09,
6.02].

Yates Correction in 2 x 2 Contingency Tables
Dont make this correction unless you find yourself in the situation of having both
sets of marginals fixed rather than random. By fixed marginals, I mean that if you were
to repeat the data collection the marginal probabilities would be exactly the same. This
7
is almost never the case. There is one circumstance when it would be the case
suppose that you dichotomized two continuous variables using a median split and then
ran a 2 x 2 chi-square. On each of the dichotomous variables each marginal probability
would be .5, and that would remain unchanged if you gathered the data a second time.

Misuses of the Pearson Chi-square
Independence of Observations. The observations in a contingency table
analyzed with the chi-square statistic are assumed to be independent of one another. If
they are not, the chi-square test is not valid. A common way in which this assumption is
violated is to count subjects in more than one cell. When I was studying ethology at
Miami University I attended a paper session where a graduate student was looking at
how lizards move in response to lighting conditions. He had a big terrarium with three
environmentally different chambers. Each day he counted how many lizards were in
each chamber and he repeated this observation each night. He conducted a Time of
Day x Chamber chi-square. Since each lizard was counted more than once, this
analysis was invalid.
Inclusion of Nonoccurrences. Every subject must be counted once and only
once in your contingency table. When dealing with a dichotomous variable, an ignorant
researcher might do a one-way analysis, excluding observations at one of the levels of
the dichotomous variable. Here is the Permanent Daylight Savings Time Attitude x
Rural/Urban example in Howell.
Twenty urban residents and twenty rural residents are asked whether or not they
favor making DST permanent, rather than changing to and from it annually: 17 rural
residents favoring making DST permanent, 11 urban residents do. An inappropriate
analysis is a one-way
2
with expected probability of favoring DST the same for rural as
for urban residents.
O E |O-E-.5|
2
/E
Rural 17 14 .4464
Urban 11 14 .4464
2
(1, N = 28) = 0.893, p = .35

The appropriate analysis would include those who disfavor permanent DST.
Favor Permanent DST
Residence No Yes
Rural 3 17
Urban 9 11
2
(1, N = 40) = 4.29, p = .038

Normality. For the binomial or multinomial distribution to be approximately
normal, the sample size must be fairly large. Accordingly, there may be a problem with
8
chi-square tests done with small cell sizes. Your computer program may warn you if
many of the expected frequencies are small. You may be able to eliminate small
expected frequencies by getting more data, collapsing across (combining) categories, or
eliminating a category. Please do note that the primary effect of having small expected
frequencies is a reduction in power. If your results are significant in spite of having
small expected frequencies, there really is no problem, other than your being less
precise when specifying the magnitude of the effect than you would be if you had more
data.
Likelihood Ratio Tests
In traditional tests of significance, one obtains a significance level by computing
the probability of obtaining results as or more discrepant with the null hypothesis than
are those which were obtained. In a likelihood ratio test the approach is a bit different.
We obtain two likelihoods: The likelihood of getting the data that we did obtain were the
null hypothesis true, and the likelihood of getting the data we got under the exact
alternative hypothesis that would make our sample data as likely as possible. For
example, if we were testing the null hypothesis that half of the students at ECU are
female, p = .5, and our sample of 100 students included 65 women, then the alternative
hypothesis would be p = .65. When the alternative likelihood is much greater than the
null likelihood, we reject the null. We shall encounter such tests when we study log
linear models next semester, which we shall employ to conduct muldimensional
contingency table analysis (where we have more than two categorical variables in our
contingency table).

Strength of Effect Estimates
I find phi an appealing estimate of the magnitude of effect of the relationship
between two dichotomous variables and Cramrs phi appealing for use with tables
where at least one of the variables has more than two levels.
Odds ratios can also be very useful. Consider the results of some of my
research on attitudes about animals (Wuensch, K. L., & Poteat, G. M. Evaluating the
morality of animal research: Effects of ethical ideology, gender, and purpose. Journal of
Social Behavior and Personality, 1998, 13, 139-150. Participants were pretending to be
members of a university research ethics committee charged with deciding whether or
not to stop a particular piece of animal research which was alleged, by an animal rights
group, to be evil. After hearing the evidence and arguments of both sides, 140 female
participants decided to stop the research and 60 decided to let it continue. That is, the
odds that a female participant would stop the research were 140/60 = 2.33. Among
male participants, 47 decided to stop the research and 68 decided to let it continue, for
odds of 47/68 = 0.69. The ratio of these two odds is 2.33 / .69 = 3.38. In other words,
the women were more than 3 times as likely as the men to decide to stop the research.
Why form ratios of odds rather than ratios of probabilities? See my document
Odds Ratios and Probability Ratios.

The Cochran-Mantel-Haenzel Statistic

9
New to the 7
th
edition of David Howells Statistical Methods for Psychology
(2010) is coverage of the CMH statistic (pages 157-159). Howell provides data from a
1973 study of sexual discrimination in graduate admissions at UC Berkeley. In Table
6.8 are data for each of six academic departments. For each department we have the
frequencies for a 2 x 2 contingency table, sex/gender of applicant by admissions
decision. At the bottom of this table are the data for a contingency table collapsing
across departments B through F and excluding data from department A. The data from
A were excluded because the relationship between sex and decision differed notably in
this department from what it was in the other departments.

The contingency tables for departments B through F are shown below. To the
right of each I have typed in the odds ratio showing how much more likely women were
to be admitted (compared to men). Notice that none of these differs much from 1 (men
and women admitted at the same rates).

The FREQ Procedure

Table 1 of Sex by Decision
Controlling for Dept=B

Sex Decision

Frequency
Row Pct Accept Reject Total

F 17 8 25 OR = (17/8)/(353/207)
68.00 32.00 = 1.25

M 353 207 560
63.04 36.96

Total 370 215 585

Controlling for Dept=C

Sex Decision

Frequency

F 202 391 593 OR = 0.88
34.06 65.94

M 120 205 325
36.92 63.08

Total 322 596 918

Controlling for Dept=D

Sex Decision

Frequency

F 131 244 375 OR = 1.09
10
34.93 65.07

M 138 279 417
33.09 66.91

Total 269 523 792

-------------------------------------------------------------------------------------------------

Controlling for Dept=E

Sex Decision

Frequency

F 94 299 393 OR = 0.82
23.92 76.08

M 53 138 191
27.75 72.25

Total 147 437 584

Controlling for Dept=F

Sex Decision

Frequency

F 24 317 341 OR = 1.21
7.04 92.96

M 22 351 373
5.90 94.10

Total 46 668 714

-------------------------------------------------------------------------------------------------
The CMH statistic is designed to test the hypothesis that there is no relationship
between rows and columns when you average across two or more levels of a third
variable (departments in this case). As you can see below, the data fit well with that
null.

Summary Statistics for Sex by Decision
Controlling for Dept

Cochran-Mantel-Haenszel Statistics (Based on Table Scores)

Statistic Alternative Hypothesis DF Value Prob

1 Nonzero Correlation 1 0.1250 0.7237
2 Row Mean Scores Differ 1 0.1250 0.7237
3 General Association 1 0.1250 0.7237

11
Estimates of the Common Relative Risk (Row1/Row2)

Type of Study Method Value 95% Confidence Limits

Case-Control Mantel-Haenszel 0.9699 0.8185 1.1493
(Odds Ratio) Logit 0.9689 0.8178 1.1481

The Breslow-Day test is for the null hypothesis that the odds ratios do not differ
across levels of the third variable (department). As you can see below, that null is
retained here.

Breslow-Day Test for
Homogeneity of the Odds Ratios

Chi-Square 2.5582
DF 4
Pr > ChiSq 0.6342

Total Sample Size = 3593

-------------------------------------------------------------------------------------------------

Below is a contingency table analysis on the aggregated data (collapsed across
departments B through F). As you can see, these data indicate that there is significant
sex bias again women the odds of a woman being admitted are significantly less than
the odds of a man being admitted.

Sex Decision

Frequency

F 508 1259 1767 OR = 0.69
28.75 71.25

M 686 1180 1866
36.76 63.24

Total 1194 2439 3633

Statistics for Table of Sex by Decision

Statistic DF Value Prob

Chi-Square 1 26.4167 <.0001
Likelihood Ratio Chi-Square 1 26.4964 <.0001
Phi Coefficient -0.0853

Sample Size = 3633

-------------------------------------------------------------------------------------------------
If we include department A in the analysis, we see that in that department there
was considerable sex bias in favor of women.

Controlling for Dept=A

Sex Decision

Frequency
12

F 89 19 108 OR = 2.86
82.41 17.59

M 512 313 825
62.06 37.94

Total 601 332 933

The CMH still falls short of significance, but

The FREQ Procedure

Summary Statistics for Sex by Decision
Controlling for Dept (A through F)

Cochran-Mantel-Haenszel Statistics (Based on Table Scores)

Statistic Alternative Hypothesis DF Value Prob

1 Nonzero Correlation 1 1.5246 0.2169
2 Row Mean Scores Differ 1 1.5246 0.2169
3 General Association 1 1.5246 0.2169

Estimates of the Common Relative Risk (Row1/Row2)

Type of Study Method Value 95% Confidence Limits

Case-Control Mantel-Haenszel 1.1053 0.9431 1.2955
(Odds Ratio) Logit 1.0774 0.9171 1.2658

notice that the Breslow-Day test now tells us that the odds ratios differ significantly
across departments.

Breslow-Day Test for
Homogeneity of the Odds Ratios

Chi-Square 18.8255
DF 5
Pr > ChiSq 0.0021

Total Sample Size = 4526

Here are the data aggregated across departments A through F. Note that these
aggregated data indicate significant sex bias against women.

Table of Sex by Decision

Sex Decision

Frequency

F 597 1278 1875 OR = 0.58
31.84 68.16

M 1198 1493 2691
44.52 55.48

Total 1795 2771 4566

Statistics for Table of Sex by Decision
13


Chi-Square 1 74.4567 <.0001
Likelihood Ratio Chi-Square 1 75.2483 <.0001
Cramer's V -0.1277

Howell mentions Simpons paradox in connection with these data. Simpsons
paradox is said to have taken place when the direction of the association between two
variables (in this case, sex and admission) is in one direction at each level of a third
variable, but when you aggregate the data (collapse across levels of the third variable)
the direction of the association changes.

We shall see Simpsons paradox (also known as a reversal paradox) in other
contexts later, including ANOVA and multiple regression. See The Reversal Paradox
(Simpson's Paradox) and the SAS code used to produce the output above.

Kappa
If you wish to compute a measure of the extent two which two judges agree when
making categorical decisions, kappa can be a useful statistic, since it corrects for the
spuriously high apparent agreement that otherwise results when marginal probabilities
differ from one another considerably.
For example, suppose that each of two persons were observing children at play
and at a designated time or interval of time determining whether or not the target child
was involved in a fight. Furthermore, if the rater decided a fight was in progress, the
target child was classified as being the aggressor or the victim. Consider the following
hypothetical data:
Rater 2

Rater 1 No Fight Aggressor Victim
marginal
No Fight 70 (54.75) 3 2
75
Aggressor 2 6 (2.08) 5
13
Victim 1 7 4 (1.32)
12
marginal 73 16 11 100

The percentage of agreement here is pretty good, (70 + 6 + 4) 100 = 80%, but
not all is rosy here. The raters have done a pretty good job of agreeing regarding
whether the child is fighting or not, but there is considerable disagreement between the
raters with respect to whether the child is the aggressor or the victim.
Jacob Cohen developed a coefficient of agreement, kappa, that corrects the
percentage of agreement statistic for the tendency to get high values by chance alone
when one of the categories is very frequently chosen by both raters. On the main
diagonal of the table above I have entered in parentheses the number of agreements
that would be expected by chance alone given the marginal totals. Each of these
14
expected frequencies is computed by taking the marginal total for the column the cell is
in, multiplying it by the marginal total for the row the cell is in, and then dividing by the
total count. For example, for the No Fight-No Fight cell, (73)(75) 100 = 54.75. Kappa
is then computed as:
E N
E O

= , where the Os are observed frequencies on the
main diagonal, the Es are expected frequencies on the main diagonal, and N is the total
count. For our data, 52 . 0
85 . 41
85 . 21
32 . 1 08 . 2 75 . 54 100
32 . 1 08 . 2 75 . 54 4 6 70
= =

+ +
= , which is not so
impressive.
More impressive would be these data, for which kappa is 0.82:
Rater 2

Rater 1 No Fight Aggressor Victim
marginal
No Fight 70 (52.56) 0 2
72
Aggressor 2 13 (2.40) 1
16
Victim 1 2 9 (1.44)
12
marginal 73 15 12 100

15
Power Analysis

G*Power uses Cohens w as the effect size statistic for contingency table
analysis. Here are conventional benchmarks for that statistic.
Size of effect w = odds ratio
small .1 1.49
medium .3 3.45
large .5 9

Please read the following documents:
Constructing a Confidence Interval for the Standard Deviation
Chi-Square, One- and Two-Way -- more detail on the w statistic and power analysis,
in the document linked here.
Power Analysis for a 2 x 2 Contingency Table
Power Analysis for One-Sample Test of Variance (Chi-Square)


Three Flavors of Chi-Square: Pearson, Likelihood Ratio, and Wald

Here is a short SAS program and annotated output.

proc format; value yn 1='Yes' 2='No'; value ww 1='Alone' 2='Partner';
data duh; input Interest WithWhom count;
weight count; cards;
1 1 51
1 2 16
2 1 21
2 2 1
proc freq; format Interest yn. WithWhom ww.;
table Interest*WithWhom / chisq nopercent nocol relrisk; run;
proc logistic; model WithWhom = Interest; run;
--------------------------------------------------------------------------------------------------

The SAS System 1

The FREQ Procedure

Table of Interest by WithWhom

Interest WithWhom

Frequency
Row Pct Alone Partner Total

Yes 51 16 67
76.12 23.88

No 21 1 22
95.45 4.55

Total 72 17 89

Statistics for Table of Interest by WithWhom


Chi-Square (Pearson) 1 4.0068 0.0453
Likelihood Ratio Chi-Square 1 5.0124 0.0252

Notice that the relationship is significant with both the Pearson and LR Chi-Square.

WARNING: 25% of the cells have expected counts less
than 5. Chi-Square may not be a valid test.
--------------------------------------------------------------------------------------------------

The SAS System 2

The FREQ Procedure

Statistics for Table of Interest by WithWhom

Estimates of the Relative Risk (Row1/Row2)

Type of Study Value 95% Confidence Limits

Case-Control (Odds Ratio) 0.1518 0.0189 1.2189
Cohort (Col1 Risk) 0.7974 0.6781 0.9378
Cohort (Col2 Risk) 5.2537 0.7385 37.3741

Sample Size = 89
Notice that although the Pearson and LR Chi-Square statistics were significant
beyond .05, the 95% confidence interval for the odds ratio includes the value one. As you
will soon see, this is because a more conservative Chi-Square, the Wald Chi-Square, is
used in constructing that confidence interval.

Since most people are uncomfortable with odds ratios between 0 and 1, I shall
invert the odds ratio, to 6.588, with a confidence interval extending from 0.820 to 52.910.
--------------------------------------------------------------------------------------------------

The SAS System 3

The LOGISTIC Procedure

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 5.0124 1 0.0252
Score (Pearson) 4.0068 1 0.0453
Wald 3.1461 1 0.0761

The values of the Pearson and the LR Chi-Square statistics are the same as
reported with Proc Freq. Notice that here we also get the conservative Wald Chi-Square,
and it falls short of significance. The Wald Chi-square is essentially a squared t, where t =
the value of the slope in the logistic regression divided by its standard error.

--------------------------------------------------------------------------------------------------

Odds Ratio Estimates

Point 95% Wald
Effect Estimate Confidence Limits

Interest 6.588 0.820 52.900

So we should not be surprised that the confidence interval, based up the Wald Chi-Square
statistic, does include one.
Strength_of_Effect.doc
Reporting the Strength of Effect Estimates for Simple Statistical Analyses

This document was prepared as a guide for my students in Experimental
Psychology. It shows how to present the results of a few simple but common statistical
analyses. It also shows how to compute commonly employed strength of effect
estimates.
Independent Samples T
When we learned how to do t tests (see T Tests and Related Statistics: SPSS),
you compared the mean amount of weight lost by participants who completed two
different weight loss programs. Here is SPSS output from that analysis:
Group Statistics
6 22.67 4.274 1.745
12 13.25 4.093 1.181
GROUP
1
2
LOSS
Std. Error
Mean

The difference in the two means is statistically significant, but how large is it?
We can express the difference in terms of within-group standard deviations, that is, we
can compute the statistic commonly referred to as Cohens d, but more appropriately
referred to as Hedges g. Cohens d is a parameter. Hedges g is the statistic we use to
estimate d.
First we need to compute the pooled standard deviation. Convert the standard
deviations to sums of squares by squaring each and then multiplying by (n-1). For
Group 1, (5)4.274
2
= 91.34. For Group 2, (11)4.093
2
= 184.28. Now compute the
pooled standard deviation this way: 15 . 4
16
28 . 184 34 . 91
2
2 1
2 1
=
+
=
+
+
=
n n
SS SS
s
pooled
.
Finally, simply standardize the difference in means:
27 . 2
15 . 4
25 . 13 67 . 22
2 1
=
=
pooled
s
M M
g , a very large effect.
An easier way to get the pooled standard deviation is to conduct an ANOVA
relating the test variable to the grouping variable. Here is SPSS output from such an
analysis:


2
ANOVA
LOSS
354.694 1 354.694 20.593 .000
275.583 16 17.224
630.278 17
Between Groups
Within Groups
Total
Sum of

Now you simply take the square root of the within groups mean square. That is,
SQRT(17.224) = 4.15 = the pooled standard deviation.
An easier way to get the value of g is to use one of my programs for placing a
confidence interval around our estimate of d. See my document Confidence Intervals,
Pooled and Separate Variances T.
Here is an APA-style summary of the results:
program 2 (M = 13.25, SD = 4.09, n = 12), t(9.71) = 4.47, p = .001, g = 2.27.
Do note that I used the separate variances t here I had both unequal sample
sizes and disparate sample variances. Also note that I reported the sample sizes,
which are not obvious from the df when reporting a separate variances test. You should
also recall that the difference in sample sizes here was cause for concern (indicating a
problem with selective attrition).
One alternative strength of effect estimate that can be used here is the squared
point-biserial correlation coefficient, which will tell you what proportion of the variance in
the test variable is explained by the grouping variable. One way to get that statistic is to
take the pooled t and substitute in this formula:
. 56 .
2 12 6 538 . 4
538 . 4
2
2
2
2 1
2
2
2
=
+ +
=
+ +
=
n n t
t
r
pb
An easier way to get that statistic to
compute the r between the test scores and the numbers you used to code group
membership. SPSS gave me this:
Correlations
-.750
18
Pearson Correlation
N
GROUP
LOSS

When I square -.75, I get .56. Another way to get this statistic is to do a one-way
ANOVA relating groups to the test variable. See the output from the ANOVA above.
The eta-squared statistic is . 56 .
278 . 630
694 . 354
= =
total
between
SS
SS
Please note that
2
is the same
as the squared point-biserial correlation coefficient (when you have only two groups).
When you use SAS to do ANOVA, you are given the
2
statistic with the standard output
(SAS calls it R
2
). Here is an APA-style summary of the results with eta-squared.
3
program 2 (M = 13.25, SD = 4.09, n = 12), t(9.71) = 4.47, p = .001,
2
= .56.
The most commonly employed strength of effect estimates here are
2
and
2

(consult your statistics text or my online notes on ANOVA to see how to compute
2
). I
have shown above how to compute
2
as a ratio of the treatment SS to the total SS. If
you have done a trend analysis (polynomial contrasts), you should report not only the
overall treatment
2
but also
2
for each trend (linear, quadratic, etc.) Consult the
document One-Way Independent Samples ANOVA with SPSS for an example summary
statement. Dont forget to provide a table with group means and standard deviations.
If you have made comparisons between pairs of means, it is a good idea to
present d or
2
for each such comparison, although that is not commonly done. Look
back at the document One-Way Independent Samples ANOVA with SPSS and see how
I used a table to summarize the results of pairwise comparisons among means. One
should also try to explain the pattern of pairwise results in text, like this (for a different
experiment): The REGWQ procedure was used to conduct pairwise comparisons
holding familywise error at a maximum of .05 (see Table 2). The elevation in pulse rate
when imagining infidelity was significantly greater for men than for women. Among
men, the elevation in pulse rate when imagining sexual infidelity was significantly
greater than when imagining emotional infidelity. All other pairwise comparisons fell
short of statistical significance.
Correlation/Regression Analysis
You will certainly have reported r or r
2
, and that is sufficient as regards strength
of effect. Here is an example of how to report the results of a regression analysis, using
the animal rights and misanthropy analysis from the document Correlation and
Regression Analysis: SPSS:
Support for animal rights (M = 2.38, SD = 0.54) was significantly correlated
with misanthropy (M = 2.32, SD = 0.67), r = .22, animal rights = 1.97 +
.175 Misanthropy, n = 154, p =.006.
Phi, Cramers phi (also known as Cramers V) and odds ratios are appropriate for
estimating the strength of effect between categorical variables. Please consult the
document Two-Dimensional Contingency Table Analysis with SPSS. For the analysis
done there relating physical attractiveness of the plaintiff with verdict recommended by
the juror, we could report:
Guilty verdicts were significantly more likely when the plaintiff was
physically attractive (76.7%) than when she was physically unattractive (54.2%),

2
(1, N = 145) = 6.23, p = .013,
C
= .21, odds ratio = 2.8.
Usually I would not report both
C
and an odds ratio.
4
Cramers phi is especially useful when the effect has more than one df. For
example, for the Weight x Device crosstabulation discussed in the document Two-
Dimensional Contingency Table Analysis with SPSS, we cannot give a single odds ratio
that captures the strength of the association between a persons weight and the device
(stairs or escalator) that person chooses to use, but we can use Cramers phi. If we
make pairwise comparisons (a good idea), we can employ odds ratios with them. Here
is an example of how to write up the results of the Weight x Device analysis:
As shown in Table 1, shoppers choice of device was significantly affected
by their weight,
2
(2, N = 3,217) = 11.75, p = .003,
C
= .06. Pairwise comparisons
between groups showed that persons of normal weight were significantly more
likely to use the stairs than were obese persons,
2
(1, N = 2,142) = 9.06, p = .003,
odds ratio = 1.94, as were overweight persons,
2
(1, N = 1,385) = 11.82, p = .001,
odds ratio = 2.16, but the difference between overweight persons and normal
weight persons fell short of statistical significance,
2
(1, N = 2,907) = 1.03, p = .31,
odds ratio = 1.12.

Table 1
Percentage of Persons Using the Stairs by Weight Category
Category Percentage
Obese 7.7
Overweight 15.3
Normal 14.0

Of course, a Figure may look better than a table, for example,

0
5
10
15
20
P
e
r
c
e
n
t
a
g
e
Weight Category
Figure 1. Percentage Use of Stairs

For an example of a plot used to illustrate a three-dimensional contingency table,
see the document Two-Dimensional Contingency Table Analysis with SPSS.
5


Power Analysis for a 2 x 2 Contingency Table

I received this email in September of 2009:
I'm an orthopaedic surgeon and I'm looking for someone that can help me with a
power analysis I'm doing for a study.
We want to know if there is any increased risk of infection in giving steroid injections
before a total hip arthroplasty.
It is generally accepted that the risk of infection after a hip arthroplasty is 1%. A
recent paper has demonstrated an infection rate of 10% (4 of 40) in patients that had a hip
replacement after they had a steroid injection. Another paper cites an infection rate of
1.34% in the injected group (3 of 224) versus 0.45% in the non-injected group (1 of 224)
We want to match two groups of patients (one with and one without steroid
injection) and look at the difference in infection rate. Who can I calculate how many
patients we have to include to get 80% power (p < 0.05)?

Here is my response:
As always, the power is critically dependent on how large the effect is. I shall
first assume that the papers highlighted in yellow, above, provides a good estimate of
the actual effect of the steroid injection. I shall also assume that the overall risk of
infection is 1%. I shall also assume that half of patients are injected and half are not.
Under the null hypothesis, the cell proportions are:

Injected?
Infected? No Yes
No .495 .495
Yes .005 .005
Marginal .500 .500

Under the alternative hypothesis, the cell proportions are:
Injected?
Infected? No Yes
No .4978 .493
Yes .0022 .007

These were entered into G*Power, as shown below.

Notice that the effect size statistic has a value of about 0.049. In the behavioral
sciences the conventional definition of a small but not trivial value of this statistic is 0.1,
but this may not apply for the proposed research.
G*Power next computes the required number of cases to have an 80% chance of
detecting an effect of this size, employing the traditional .05 criterion of statistical
significance.

As you can see, 3,282 cases would be needed. I suspect the surgeon to be
disappointed with this news.
What if the actual effect is larger than assumed above? Suppose the rate of
infection is 10% in injected patients and 1% in non-injected patients. The contingency
table under the alternative hypothesis is now

Injected?
Infected? No Yes
No .495 .45
Yes .005 .05

Now the effect size statistic is about .64. By convention, in the behavioral
sciences, .5 is the benchmark for a large effect.

tests - Goodness-of-fit tests: Contingency tables
Input: Effect size w = 0.6396021
err prob = 0.05
Df = 1
Critical = 3.8414588
As you can see, now only 20 cases are necessary to have an 80% chance of
detecting this large effect.
Return to Wuenschs Power Analysis Lessons
Reversal.docx
The Reversal Paradox in Contingency Table Analysis

A reversal paradox is when 2 variables are positively related in aggregated
data, but, within each level of a third variable, they are negatively related (or negatively
in the aggregate and positively within each level of the third variable). See Messick and
van de Geers article on the reversal paradox (Psychol. Bull. 90: 582-593).
Later I shall discuss the reversal paradox in the context of ANOVA and multiple
regression. Here I have an example in the context of contingency table analysis.
At Zoo Univ. 15 of 100 women (15%) applying for admission to the graduate
program in Clinical Psychology are offered admission. One of 10 men (10%) applying
to the same program are offered admission. For the Experimental Psychology program,
6 of 10 women (60%) are offered admission, 50 of 100 men (50%) are offered
admission. For the department as a whole, (15 + 6)/(100 + 10) = 19% of the female
applicants are offered admission and (1 + 50)/(10 + 100) = 46% of the male applicants
are offered admission. Assuming that male and female applicants are equally qualified,
is there evidence of gender discrimination in admissions, and, if so, against which
gender?
Program Female Applicants Male Applicants
Experimental
Psychology
6 of 10 offered
admission
60%
50 of 100 offered
admission
50%
Clinical
Psychology
15 of 100 offered
admission
15%
1 of 10 offered
admission
10%
Department
as a whole
21 of 110 offered
admission
19%
51 of 110 offered
admission
46%

See also: The Reversal Paradox (Simpson's Paradox)




Copyright 2012, Karl L. Wuensch - All rights reserved. OneMean.docx

One Mean Inference

Z. Population Standard Deviation Known
To use Z as the test statistic when testing the null hypothesis that = some value,
one must know the and be able to assume that the distribution of sample means is
normal. The distribution of Z will be normal if:
the population distribution of the variable being tested (Y) is normally distributed
or
the central limit theorem [CLT] applies.
The CLT states that the distribution of sample means will become approximately
normal with large sample sizes, even if Y is not normally distributed. If Y is nearly
normal, the distribution of sample means will approach normal very quickly with
increasing N, but if Y is very nonnormal, N may need be relatively large (30 or more) for
the sampling distribution to be close to normal.
How large N must be to produce approximately normal sampling distributions can be
investigated with computer Monte Carlo simulations. One instructs the computer to
sample randomly a large number of samples of given N from a population of a specified
shape, mean, and variance. The shape of the resulting sampling distribution is noted
and then the whole process is repeated with a larger N, etc. etc. until the sampling
distribution looks approximately normal.
The standard error of the mean, the standard deviation of the distribution of
sample means, is:
N
X
X
= .
For example, suppose we wish to test the H
that for IQ, = 100 for ECU students.

Assuming a population of 15 (that found in the general population) and a normal
sampling distribution, we compute:
X
X
Z
= Suppose that our sample of 25

students has 107 = X . 33 . 2
3
7
25 15
100 107
= =
= Z .
Now, P(Z +2.33) = .0099; doubling for a two-tailed test, p = .0198. Thus, we
could reject the H
with an -criterion of .05, but not at .01. Were we to have done a

one-tailed test with H
being 100, H
1
being > 100, p = .0099 and we could reject
the H
even at .01.
With a one-tailed test, if the direction of the effect, (sample mean is > or <
) is
as specified in the alternative hypothesis, one always uses the smaller portion
column of the normal curve table to obtain the p . If the direction is opposite that
specified in the alternative hypothesis, one uses the larger portion column. For a
two-tailed test, always use the smaller portion column and double the value that
appears there.
2

Confidence intervals may be constructed by taking the point estimate of and
going out the appropriate number of standard errors. The general formula is:
X X
CV X CV X CI + =
where CV = the critical value for the appropriate sampling distribution. For our sample
problem 88 . 112 12 . 101 ) 5 15 ( 96 . 1 107 ) 5 15 ( 96 . 1 107
95 .
+ = CI . Once we
have this confidence interval we can decide whether or not to reject a hypothesis about
simply by determining whether the hypothesized value of falls within the confidence
interval or not. The hypothesized value of 100 does not fall within 101.12 to 112.88, so
we could reject the hypothesis that = 100 with at least 95% confidence that is, with
alpha not greater than 1 - .95 = .05.
Students t. Population Standard Deviation Not Known
One big problem with what we have done so far is knowledge of the population . If
we really knew the , we would likely also know , and thus not need to make
inferences about . The assumption we made above, that
IQ
at ECU = 15, is probably
not reasonable. Assuming that ECU tends to admit brighter persons and not persons
with low IQ, the
IQ
at ECU should be lower than that in the general population. We
shall usually need to estimate the population from the same sample data we use to
test the mean. Unfortunately, sample variance, SS / (N - 1), has a positively skewed
sampling distribution. Although unbiased [the mean of the distribution of sample
variances equals the population variance], more often than not sample s
2
will be smaller
than population
2
and sample s smaller than population .
Thus, the quantity
N s
X
t

=

will tend to be larger than
N
X
Z
= . The result of
all this is that the sampling distribution of the test statistic will not be normally
distributed, but will rather be distributed as Students t, a distribution developed by
Gosset (his employer, Guiness Brewers, did not allow him to publish under his real
name). For more information on Gosset, point your browser to:
http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Gosset.html.
The Student t-distribution is plumper in its tails (representing a greater number of
extreme scores) than is the normal curve. Because the distribution of sample variances
is more skewed with small sample sizes than when N is large, the t distribution
becomes very nearly normal when N is large.
One of the parameters going into the probability density function of t is df, degrees
of freedom. We start out with df = N and then we lose one df for each parameter we
estimate when computing the standard error. We compute the sample standard error
as
N
s
s
X
X
= . When computing the sample s we estimate the population mean when
using (Y minus sample mean) rather than (Y minus ) to compute the sum-of-squares.
That one estimation cost us one df, so df = N - 1. The fewer the df, the plumper the t is
in its tails, and accordingly the greater the absolute critical value of t. With infinite df, t
has the same critical value as Z.
3

Here is an abbreviated table of critical values of t marking off the upper 2.5% of the
area under the curve. Notice how the critical value is very large when df are small, but
approaches 1.96 (the critical value for z) as df increase.
df 1 2 3 10 30 100
Critical
Value
12.706 4.303 3.182 2.228 2.042 1.984 1.960

When df are small, a larger absolute value of computed t is required to reject the null
hypothesis. Accordingly, low df translates into low power. When df are low, sample
size will be low too, and that also reduces power.
I shall illustrate the use of Students t for testing a hypothesis about the mean score
that my students get on the math section of the Scholastic Aptitude Test. I shall use
self-report data provided by students who took my undergraduate statistics class
between 2000 and 2004. During that five year period the national mean score on the
math SAT was 516. For North Carolina students it was 503. For the 114 students on
whom I have data, the mean is 534.78 and the standard deviation is 93.385. I shall test
the null hypothesis that the mean of the population from which my students scores
were randomly drawn is 516. I shall employ the usual .05 criterion of statistical
significance.
746 . 8
114
385 . 93
= =
X
s , and 147 . 2
746 . 8
516 78 . 534
=
= t . To get P(t > 2.147) go to the t

table. Df is 113, but there is not a row for df = 113 in the table, so use the row for df =
100. Then go across looking for t = 2.147. T = 2.147 is not there, but 1.984 and 2.364,
between which 2.147 falls, are. Since these two ts have two-tailed ps of .05 and .02,
our p is between .02 and .05. Were we doing a one-tailed test with correct prediction of
the direction of effect, we would now say .01 < p < .025. In both cases, p .05, and we
reject the H
. If we were doing a one-tailed test but the predicted direction were wrong,
p would be 1 minus the value for the one-tailed p with direction correct, that is, .975 < p
< .99. We can use PASW or SAS to get the exact p, which is, for these data, .034.
A confidence interval should also be constructed.
X X
s CV X s CV X CI + = .
For CC = 95%, = 1 - .95 = .05, = .025 in upper tail. From the t table for df = 100, CV =
1.984. 13 . 552 43 . 517 ) 746 . 8 ( 984 . 1 78 . 534 ) 746 . 8 ( 984 . 1 78 . 534
95 .
+ = CI .
Effect Size
When you test a hypothesis about a population mean, you should report an estimate
of ( -
), where is the true population mean and
is the hypothesized population

mean, and you should put a confidence interval about that estimate. For our SAT data
the estimated ( -
) = 534.78 516 = 18.78. Note that this is the numerator our the t
ratio. To get a confidence interval for this difference, just take the confidence interval
for the mean and subtract the hypothesized mean from both the lower and the upper
limits. For our SAT data, the 95% confidence interval for ( -
) is 1.43 to 36.13.
4

When you are dealing with data where the unit of measurement is easily understood
by most persons (such as inches, pounds, dollars, etc.), reporting an effect size in that
unit of measurement is fine. Psychologists, however, typically deal with data where the
unit of measurement is not so easily understood (such as score on a personality test).
Accordingly, it useful to measure effect size in standard deviation units. The
standardized effect size parameter for the one-sample design is
= . I often
refer to this parameter simply as Cohens d, to avoid confusion with the noncentrality
parameter, also symbolized with lower case delta. We can estimate d with the statistic:
20 .
385 . 93
78 . 18
= =
=
s
X
d

. This statistic is also known as Hedges g (see McGrath &

Meyer, 2006), which is the symbol I prefer, but it is not in common use. In research
articles in the behavioral sciences the carat is usually left off , and d is used as the
symbol for the statistic. You can also compute d simply by dividing the obtained t by the
square root of the sample size. Our best point estimate of the amount by which my
students mean math SAT exceeds that of the national norm is 1/5 of a standard
deviation. Jacob Cohen suggested that a d of 0.2 is small but not trivial, a d of 0.5 is
medium sized, and a d of 0.8 is large. These are very general guidelines, and are not
likely appropriate in all contexts.
Constructing a confidence interval for d is unreasonably difficult to do by hand, but
easy to do with SPSS or SAS. It involves an iterative procedure and use of the
noncentral t distributions. Central t distributions have only one parameter, df.
Noncentral t distributions have a second parameter, the noncentrality parameter. The
noncentrality parameter is a function of the amount by which the null hypothesis is
incorrect.
Non-Normal Data. Using the t table involves the assumption that Y is normally
distributed. If that is not a reasonable assumption, and if the data cannot be
transformed (via SQRT, LOG, etc.) to make them normal, one can use different
statistical procedures called nonparametric or distribution-free tests that do not
assume that Y is normally distributed (they do, however, have other assumptions which
must be met). For more information on the use of data transformation to reduce
skewness, see my documents Using SAS to Screen Data and Using SPSS to Screen
Data.
Summary Statements. When you test hypotheses for this class, I want you to state
your conclusion as it would be stated in an APA (American Psychological Association)
style journal. Your summary statement should include each of the following:
who or what the research units were (sometimes called subjects or participants)
what the null hypothesis was
descriptive statistics such as means and standard deviations
whether or not you rejected the null hypothesis
if you did reject the null hypothesis, what was the observed direction of the
difference between the obtained results and those expected under the null
hypothesis
5

what test statistic (such as t) was employed
the degrees of freedom
if not obtainable from the degrees of freedom, the sample size
the computed value of the test statistic
the p value (use SPSS or SAS to get an exact p value)
an effect size estimate, such as estimated Cohens d
and a confidence interval for the effect size parameter, such as d.
For the t-test that we did earlier, here is a APA-style summary statement: The mean
math SAT of my undergraduate statistics students (M = 535, SD = 93.4) was
significantly greater than the national norm (516), t(113) = 2.147, p = .034, d = .20.
A 95% confidence interval for the mean runs from 517 to 552. A 95% confidence
interval for d runs from .015 to .386.
Suppose that the sample mean we obtained was not 534.78 but 532. Our summary
statement would read: The mean math SAT of my undergraduate statistics
students (M = 532, SD = 93.4) was not significantly different from the national
norm (516), t(113) = 1.83, p = .07, d = .17. A 95% confidence interval for the mean
runs from 515 to 549. A 95% confidence interval for d runs from -.014 to .356.
Note that I did not indicate a direction of difference with this nonsignificant result -- to
do so would imply that I was testing directional rather than nondirectional hypotheses.
Suppose that I was testing directional hypotheses, with the alternative hypothesis
being that the mean is greater than 516, and the obtained sample mean being 532.
Now my summary statement would read: Employing a one-tailed test, the mean
math SAT of my undergraduate statistics students (M = 532, SD = 93.4) was
significantly greater than the national norm (516), t(113) = 1.83, p = .035, d = .17. A
90% confidence interval for the mean runs from 517 to 547. A 90% confidence
interval for d runs from .016 to .326. Notice that I shifted to a 90% confidence
interval, because with the one-tailed test I put all of alpha in one tail rather than splitting
it into two tails but confidence intervals are, IMHO, naturally bidirectional, so I put 5%
in both tails for the confidence interval. If I did not make this change in the confidence
coefficient, the confidence interval would include the null value, which would be in
disagreement with the prior decision to reject the null hypothesis.
Suppose that the mean was only 530, still testing the directional hypotheses. Now
my summary statement would read: Employing a one-tailed test, the mean math
SAT of my undergraduate statistics students (M = 530, SD = 93.4) was not
significantly greater than the national norm (516), t(113) = 1.60, p = .057, g = .15. A
90% confidence interval for the mean runs from 515 to 545. A 90% confidence
interval for d runs from -.005 to .305. Even though the result is not significant, I use
the phrase not significantly greater than rather than not significantly different from
because the test was directional.

Reference
McGrath, R. E. & Meyer, G. J. (2006). When effect sizes disagree: The case of r and
d. Psychological Methods, 11, 386-401. doi: 10.1037/1082-989X.11.4.386

Errors Often Made When Testing Hypotheses

Copyright 20
Table from McGrath & Meyer
Errors Often Made When Testing Hypotheses
6

All rights reserved.
TwoGr30.docx
6430
Two-Mean Inference

Two-Group Research
1. We wish to know whether two groups (samples) of scores (on some continuous OV,
outcome variable) are different enough from one another to indicate that the two
populations from which they were randomly drawn are also different from one another.
2. The two groups of scores are from research units (subjects) that differ with respect to
some dichotomous GV, grouping variable (treatment).
3. We shall compute an exact significance level, p, which represents the likelihood that
our two samples would differ from each other on the OV as much (or more) as they do, if in
fact the two populations from which they were randomly sampled are identical, that is, if
the dichotomous GV has no effect on the OV mean.
Research Designs
1. In the Independent Sampling Design (also know as the between-subjects design) we
have no good reason to believe there should be correlation between scores in the one
sample and those in the other. With experimental research this is also known as the
completely randomized design -- we not only randomly select our subjects but we also
randomly assign them to groups - the assignment of any one subject to group A is
independent of the assignment of any other subject to group A or B. Of course, if our
dichotomous OV is something not experimentally manipulated, such as subjects
sex/gender, we do not randomly assign subjects to groups, but subjects may still be in
groups in such as way that we expect no correlation between the two groups scores.
2. In the Matched Pairs Design (also called a split-plot design, a randomized blocks
design, or a correlated samples design) we randomly select pairs of subjects, with the
subjects matched on some extraneous variable (the matching variable) thought to be well
correlated with the dependent variable. Within each pair, one subject is randomly
assigned to group A, the other to group B. Again, our dichotomous GV may not be
experimentally manipulated, but our subjects may be matched up neverthelessfor
example, GV = sex, subjects = married couples.
a. If the matching variable is in fact well correlated with the dependent variable, the
matched pairs design should provide a more powerful test (greater probability of rejecting
the null hypothesis) than will the completely randomized design. If not, it may yield a less
powerful test.
b. One special case of the matched pairs design is the Within Subjects Design
(also known as the repeated measures design). Here each subject generates two scores:
one after treatment A, one after treatment B. Treatments are counterbalanced so that
half the subjects get treatment A first, the other half receiving treatment B first, hopefully
removing order effects.


2

The matched pairs design has the less complex analysis. We have two scores
from each pair, one under condition 1, another under condition 2. For example, we
investigate reaction time scores as affected by alcohol. The null hypothesis is that
sober
=
drunk
, that alcohol doesnt affect reaction time. We create a new variable, D. For each
pair we compute D = X
1
- X
2
. The null hypothesis becomes
D
= 0 and we test it exactly as
we previously tested one mean hypotheses, including using one-tailed tests if appropriate [
if the alternative hypothesis is
D
> 0, that is
1
>
2
, or if it is
D
< 0,
1
<
2
].
Calculation of the Related Samples t
The independent sampling design is more complex. The sampling distribution is
the distribution of differences between means, which has a mean equal to
1
-
2
. By the
variance sum law, the standard deviation of the sampling distribution, the Standard Error
Of Differences Between Means, is the square root of the variance of the sampling
distribution:
2 1 2 1 2 1
2
2 2
M M M M M M
+ =

This formula for the standard error actually applies to both matched pairs and
independent sampling designs. The (rho) is the correlation between scores in population
1 and scores in population 2. In matched pairs designs this should be positive and fairly
large, assuming that the variable used to match scores is itself positively correlated with
the dependent variable. That is, pairs whose Group 1 score is high should also have their
Group 2 score high, while pairs whose Group 1 score is low should have their Group 2
score low, relative to other within-group scores. The larger the , the smaller the standard
error, and thus the more powerful the analysis (the more likely we are to reject a false null
hypothesis). Fortunately there is an easier way to compute the standard error with
matched pairs, the difference score approach we used earlier.
In the independent sampling design we assume that = 0, so the standard error
becomes :
2
2
2
1
2
1
2 1

N N
M M

+ =
where the variances are population variances. We could

then test the null hypothesis that
1
=
2
with:
2 1
2 1
M M
M M
Z
assuming that we know the

population variances. Since we are not likely to know the population variances (or, with
matched samples, the population rho) when making inferences about population means,
we must estimate them with sample variances (and estimate with r). Assuming that n
1
=
n
2
, we use the same formulas shown above, except that we substitute s for and s
2
for
2

and r for and the test statistic is not the normally distributed Z but is rather Students t,
assuming that the dependent variable is normally distributed in both populations. The t
is evaluated on n
1
+ n
2
- 2 degrees of freedom if you can assume that the two populations
have identical variances, the homogeneity of variance assumption. Such a test is called
the pooled variances t-test. If n
1
= n
2
= n, then:
n
s s
s
M M
2
2
2
1
2 1
+
=
and
2 1
2 1
M M
s
M M
t
=
If n
1
n
2
, the pooled variances standard error requires a more elaborate formula.
Given the homogeneity of variance assumption, we can better estimate the variance of the

3

two populations by using (n
1
+ n
2
) scores than by using the n
1
and the n
2
scores
separately. This involves pooling the sums of squares when computing the standard error:
+
+
=
2 1 2 1
2 1
1 1

2
2 1
n n n n
SS SS
s
M M

Remember, SS = s
2
(n - 1).
The t that you obtain is then evaluated using df = n
1
+ n
2
- 2.
If you cannot assume homogeneity of variance, then using the pooled variances
estimate is not reasonable. Instead, compute t using the separate variances error term,
2
2
2
1
2
1
2 1

n
s
n
s
s
M M
+ =
. T, the separate variances t, is not, however, distributed the same as

the pooled variances t. Behrens and Fisher have tabled critical values of t and Cochran
and Cox invented a formula with which one can approximate the critical value of t (see the
formula on p. 186 of the 3
rd
edition of Howell if you are interested in this approach). So,
when should one use the separate variances t ? One "rule of thumb" which I have
employed is: "If the ratio of the larger variance to the smaller variance exceeds 4 or 5, one
should not pool variances, especially if sample sizes are also greatly unequal." However,
Monte Carlo work by Donald W. Zimmerman (1996) has indicated that two stage testing
(comparing the variances to determine whether to conduct a pooled test or a separate
variances test) is not a good procedure, especially when the sample sizes differ greatly (3
or 4 times more subjects in one group than in the other, in which case the pooled test
performs poorly even when the ratio of variances is as small as 1.5). Zimmerman was, by
the way, a faculty member in our department here at ECU many years ago. Zimmerman's
advice is that the separate variances t should be applied unconditionally whenever
sample sizes are unequal. Given the results of his Monte Carlo study, I think this is good
advice, and I suggest that you adopt the practice of using the separate variances test
whenever you have unequal sample sizes. I still believe that the pooled test may be
appropriate (and more powerful) when your sample sizes are nearly equal and the
variances not greatly heterogeneous, but carefully defining "nearly equal sample sizes"
and "not greatly heterogeneous variances" is not something I care to tackle.
I prefer the Welch-Satterthwaite solution over the Behrens and Fisher or Cochran
and Cox procedures. With this solution one adjusts the df downwards to correct for the
amount of heterogeneity of variance indicated by the samples (the greater the
heterogeneity, the more the df lost) and then uses a standard t-table or t probability density
function. The formula for df appears in your textbook (as does a slight modification
suggested by Welch in 1947). Note that df is never smaller than the smaller of (n
1
- 1) and
(n
2
- 1), so if t is significant with the smaller of those, it is also significant with the
completely adjusted df. The df is never larger than (n
1
+ n
2
- 2), so if t is not significant at
(n
1
+ n
2
- 2) df, it is not significant at the completely adjusted df either.
Heterogeneity of variance may also be removed by transforming the data prior to
analysis. For example, square root and logarithmic transformations may reduce the
heterogeneity of variance. These transformations may also reduce positive skewness in
the data, helping to meet the normality assumption.

4

The t-test is often considered robust, that is, little affected by violations of its
assumptions, within limits. With equal sample sizes, violation of the homogeneity of
variance assumption has little effect on the validity of the p obtained from a pooled t. If
sample sizes are unequal and there is considerable heterogeneity of variance, the pooled t
should not be trusted, but the separate variances t with adjusted df should still be OK. If
both populations have similar shapes, or if both are symmetric even if their shapes differ,
violating the normality assumption should not be too serious -- but see Bradley (1982).
Earlier I noted that use of a matched-pairs design may provide a more powerful
analysis than use of the independent sampling design, since the standard error in the
matched pairs t test is reduced by the term
2 1
2
M M
. Please note that the
matched pairs test also involves a reduction in df from (n
1
+ n
2
- 2) to (n - 1), that is, for the
same total number of scores, the matched t has half the df of the independent t. If total N
< 30, loosing half the df produces a considerable increase in the critical value of t, which
reduces power. One generally hopes that the reduction in power caused by the loss of df
is more than offset by the increase in power resulting from having a smaller standard error
(which increases the computed value of t).
One may compute a confidence interval for
1
-
2
by:
( )
2 1
2 1 M M critical
s t M M CI

=
It is good practice always to report a confidence interval for the difference between
means.
Computation of t by Hand.
See my power point show at http://core.ecu.edu/psyc/wuenschk/PP/T-Tests.pptx .

5

Effect Size Estimates
You should always provide an estimate of the size of the effect that you are
reporting. There are several effect size estimates available for two group designs. I
recommend that you use the standardized difference between group means. You should
also present a confidence interval for the effect size.
Cohens d.

2 1

= d . Notice that we are dealing with population parameters,
not sample statistics, when computing d. In other words, d is not an effect size estimate.
Nevertheless, most psychologists use the letter d to report what is really d
.
You should memorize the following benchmarks for d
, but keep in mind that they

are not appropriate in all contexts:
Size of effect d % variance
small .2 1
medium .5 6
large .8 16
Please see my document Cohen's Conventions for Small, Medium, and Large
Effects .
Estimated Cohens d, d
. The parameter being estimated here is

2 1
= d .
Our estimator is
pooled
s
M M
d
2 1

= , where the pooled standard deviation is the square root of
the within groups mean square (from a one-way ANOVA comparing the two groups). If
you have equal sample sizes, the pooled standard deviation is ) ( 5 .
2
2
2
1
s s s
pooled
+ = . If
you have unequal sample sizes, ) (
2
j j pooled
s p s = , where for each group s
2
j
is the within-
group variance and
N
n
p
j
j
= , the proportion of the total number of scores (in both groups,
N) which are in that group (n
j
). You can also compute d
as
2 1
2 1
n n
n n t
d
+
= , where t is the
pooled variances independent samples t comparing the two group means.
You can use the program Conf_Interval-d2.sas to obtain the confidence interval for
the standardized difference between means. It will require that you give the sample sizes
and the values of t and df. Use the pooled variances values of t and df. Why the pooled
variances t and df? See Confidence Intervals, Pooled and Separate Variances T. Also
see Standardized Difference Between Means, Independent Samples .

6

I shall illustrate using the Howell data (participants were students in Vermont),
comparing boys GPA with girls GPA. Please look at the computer output. For the girls,
M = 2.82, SD = .83, n = 33, and for the boys, M = 2.24, SD = .81, n = 55.
818 . ) 81 (.
88
55
) 83 (.
88
33
2 2
= + =
pooled
s .
71 .
818 .
24 . 2 82 . 2
= d . Also, =
+
=
) 55 ( 33
55 33 267 . 3
d .72 (there is a little rounding error in the

earlier computations).
Glass Delta.
control
s
M M
2 1

= . That is, in computing the standardized difference
between group means we use the control group standard deviation in the denominator.
Point-Biserial r. This is simply the Pearson r between the grouping variable
(coded numerically) and the criterion variable. It can be computed from the pooled
variances independent t :
df t
t
r
pb
+
=
2
2
. For the comparison between girls and boys
GPS, . 332 .
86 267 . 3
267 . 3
2
2
=
+
=
pb
r This is the standardized slope for the regression line for
predicting the criterion variable from the grouping variable. The unstandardized slope is
the difference between the group means. We standardize this difference by multiplying it
by the standard deviation of the grouping variable and dividing by the standard deviation of
the criterion variable. For our comparison, . 33 .
861 .
) 487 (. 588 .
= =
pb
r
Eta-squared. For a two-group design, this is simply the squared point-biserial
correlation coefficient. It can be interpreted as the proportion of the variance in the
criterion variable which is explained by the grouping variable. For our data,
2
= .11. For a
confidence interval, use my program Conf-Interval-R2-Regr.sas. It will ask you for F (enter
the square of the pooled t), df_num (enter 1), and df_den (enter the df for the pooled t).
For our data, a 95% confidence interval runs from .017 to .240.
Omega-squared. Eta-squared is a biased estimator, tending to overestimate the
population parameter. Less biased is the omega-squared statistic, which we shall study
when we cover one-way independent samples ANOVA.
Common Language Effect Size Statistic. See
http://core.ecu.edu/psyc/wuenschk/docs30/CL.doc . CL is the probability that a randomly
selected score from the one population will be greater than a randomly sampled score
from the other distribution. Compute
2
2
2
1
2 1
S S
X X
Z
+
= and then find the probability of obtaining

a Z less than the computed value. For the data here, 50 . 0
81 . 83 .
24 . 2 82 . 2
2 2
=
+
= Z , which yields

7

a lower-tailed p of .69. That is, if one boy and one girl were randomly selected, the
probability that the girl would have the higher GPA is .69. If you prefer odds, the odds of
the girl having the higher GPA = .69/(1-.69) = 2.23 to 1.
Point Biserial r versus Estimated d. Each of these has its advocates.
Regardless of which you employ, you should be aware that the ratio of the two sample
sizes can have a drastic effect on the value of the point-biserial r (and the square of that
statistic, which is
2
), but does not affect the value of estimated d. See Effect of n
1
/n
2
on
Estimated d and r
pb
.
Correlated Samples Designs. You could compute
Diff
Diff
s
M M
d
2 1

= , where s
Diff
is
the standard deviation of the difference scores, but this would artificially inflate the size of
the effect, because the correlation between conditions will probably make s
Diff
smaller than
the within-conditions standard deviation. You should instead treat the data as if they were
from independent samples. If you base your effect size estimate on the correlated
samples analysis, you will overestimate the size of the effect. You cannot use my
Conf_Interval-d2.sas program to construct a confidence interval for d when the data are
from correlated samples. See my document Confidence Interval for Standardized
Difference Between Means, Related Samples for details on how to construct an
approximate confidence interval for the standardized difference between related means.
Tests of Equivalence. Sometimes we want to test the hypothesis that the size of
an effect is not different from zero by more than a trivial amount. For example, we might
wish to test the hypothesis that the effect of a generic drug is equivalent to the effect of a
brand name drug. Please read my document Tests of Equivalence and Confidence
Intervals for Effect Sizes.

Testing Variances
One may be interested in determining the effect of some treatment upon variances
instead of or in addition to its effect on means. Suppose we have two different drugs, each
thought to be a good treatment for lowering blood cholesterol. Suppose that the mean
amount of cholesterol lowering for drug A was 40 with a variance of 100 and for drug B the
mean was 42 with a variance of 400. The difference in means is trivial compared to the
difference in variances. It appears that the effect of drug A does not vary much from
subject to subject, but drug B appears to produce very great lowering of blood cholesterol
for some subjects, but none (or even elevated cholesterol) for others. At this point the
researcher should start to look for the mystery variable which interacts with drug B to
determine whether Bs effect is positive or negative.
To test the null hypothesis that the treatments do not differ in effect upon variance,
that is,
A B
2 2
= , one may use an F-test. Simply divide the larger variance by the smaller,
obtaining an F of 400/100 = 4.0. Suppose we had 11 subjects in Group A and 9 in Group
B. The numerator (variance for B) degrees of freedom is n
b
- 1 = 8, the denominator
(variance for A) df is n
a
- 1 = 10. From your statistical program [ in SAS, p = 2*(1-
PROBF(4, 8, 10)); ] you obtain the two-tailed probability for F(8, 10) = 4, which is p =
.044.

8

We can do one-tailed tests of directional hypotheses about the relationship
between two variances. With directional hypotheses we must put in the numerator the
variance which we predicted (in the alternative hypothesis) would be larger, even if it isnt
larger. Suppose we did predict that >
A B
2 2
. F(8, 10) = 4.00, we dont double p, p =
.022. What if we had predicted that >
B A
2 2
? F(10, 8) = 100 / 400 = 0.25. Since the p
for F(x, y) equals 1 minus the p for 1 / F(y, x), our p equals 1 - .022 = .98, and the null
hypothesis looks very good. If you wish, you can use SAS to verify that p = 1-
PROBF(.25, 10, 8); returns a value of .98.
Although F is often used as I have shown you here, it has a robustness problem in
this application. It is not robust to violations of its normality assumption. There are,
however, procedures that are appropriate even if the populations are not normal. Levene
suggested that for each score you find either the square or the absolute value of its
deviation from the mean of the group in which it is and then run a standard t-test
comparing the transformed deviations in the one group with those in the other group.
Brown and Forsythe recommended using absolute deviation from the median or a trimmed
mean. Their Monte Carlo research indicated that the trimmed mean was the best choice
when the populations were heavy in their tails and the median was the best choice when
the populations were skewed. Levenes tests can be generalized to situations involving
more than two populations just apply an ANOVA to the transformed data. Please consult
the document Levene Test for Equality of Variances. Another alternative, Obriens test, is
illustrated in the 4
th
edition of Howell's Statistical Methods for Psychology. As he notes in
the 5
th
edition, it has not been included in mainstream statistical computing packages.
To test the null hypothesis of homogeneity of variance in two related (not
independent) samples, use E. J. G. Pitman's (A note on normal correlation, Biometrika,
1939, 31, 9-12) t:
) 1 ( 2
2 ) 1 (
2
r F
n F
t

= , where F is the ratio of the larger to the smaller sample
variance, n is the number of pairs of scores, r is the correlation between the scores in the
one sample and the scores in the other sample, and n - 2 is the df.
Testing Variances Prior to Testing Means. Some researchers have adopted the
bad habit of using a test of variances to help decide whether to use a pooled t test or a
separate variances t test. This is poor practice for several reasons.
The test of variances will have very little power when sample size is small, and
thus will not detect even rather large deviations from homogeneity of variance. It
is with small sample sizes that pooled t is likely least robust to the homogeneity of
variance assumption.
The test of variances will have very much power when sample size is large, and
thus will detect as significant even very small differences in variance, differences
that are of no concern given the pooled t tests great robustness when sample
sizes are large.
Heterogeneity of variance is often accompanied by non-normal distributions, and
some tests of variances are often not robust to their normality assumption.

9

Box (1953) was an early critic of testing variances prior to conducting a test of
means. He wrote to make the preliminary test on variances is rather like putting
to sea in a rowing boat to find out whether conditions are sufficiently calm for an
ocean liner to leave port.

Writing an APA Style Summary for Two-Group Research Results
Using our example data, a succinct summary statement should read something like
this: Among Vermont school-children, girls GPA (M = 2.82, SD = .83, N = 33) was
significantly higher than boys GPA (M = 2.24, SD = .81, N = 55), t(65.9) = 3.24, p = .002, d
= .72. A 95% confidence interval for the difference between girls and boys mean GPA
runs from .23 to .95 in raw score units and from .27 to 1.16 in standardized units.
Please note the following important components of the summary statement:
The subjects are identified.
The variables are identified: Method of motivation and time to cross the finish line.
The group means, standard deviations, and sample sizes are given.
Rejection of the null hypothesis is indicated (the difference is significant).
The direction of the significant effect is indicated.
The test statistic ( t ) is identified, and its degrees of freedom, computed value, and p-
value are reported.
An effect size estimate is reported.
Confidence intervals are reported for the difference between means and for d.
The style for reporting the results of correlated t test would be the same.
If the result were not significant, we would not emphasize the direction of the
difference between the group means, unless we were testing a directional hypothesis. For
example, among school-children in Vermont, the IQ of girls (M = 101.8, SD = 12.7, N = 33)
did not differ significantly from that of boys (M = 99.3, SD = 13.2, N = 55), t(69.7) = 0.879,
p = .38, d = .19. A 95% confidence interval for the difference between girls and boys
mean IQ runs from -3.16 to 8.14 in raw score units and from -.24 to .62 in standardized
units.
As an example of a nonsignificant test of a directional hypothesis: As predicted, the
GPA of students who had no social problems (M = 2.47, SD = 0.89, N = 78) was greater
than that of students who did have social problems (M = 2.39, SD = .61, N = 10), but this
difference fell short of statistical significance, one-tailed t(14.6) = 0.377, p = .36, d = .09. A
95% confidence interval for the difference between mean GPA of students with no social
problems and that of students with social problems runs from -.38 to .54 in raw score units
and from -.56 to .75 in standardized units.

10

References
Box, G. E. P. (1953). Non-normality and tests on variance. Biometrika, 40, 318-335.
Bradley, J. V. (1982). The insidious L-shaped distribution. Bulletin of the Psychonomic
Society, 20, 85-88.
Wuensch, K. L. (2009). The standardized difference between means: Much variance in
notation. Also, differences between g and r
pb
as effect size estimates. Available here.
Zimmerman, D. W. (1996). Some properties of preliminary tests of equality of variances in
the two-sample location problem, Journal of General Psychology, 1996, 123, 217-231.
Zimmerman, D. W., & Zumbo, B. D. (2009). Hazards in choosing between pooled and
separate-variances t tests. Psicolgica, 30, 371-390.
Summary of Effect Size Estimates -- Lee Becker at the Univ. of Colorado, Colorado
Springs
Two Groups and One Continuous Variable
The Moments of Students t Distribution

Confidence Interval for Standardized Difference Between Means, Independent Samples

Here are results from a independent samples t test. One group consists of mice (Mus
musculus) who were reared by mice, the other group consists of mice who were reared by rats
(Rattus norvegicus). The dependent variable is a the difference between the number of visits
the mouse made to a tunnel that smelled like another mouse and the number of visits to a
tunnel that smelled like rat.
--------------------------------------------------------------------------------------------------

The SAS System 1
Independent Samples T-Tests on Mouse-Rat Tunnel Difference Scores
Foster Mom is a Mouse or is a Rat

The TTEST Procedure

Variable: v_diff

Mom N Mean Std Dev Std Err Minimum Maximum

Mouse 32 14.8125 9.0320 1.5966 0 31.0000
Rat 16 -1.3125 8.4041 2.1010 -17.0000 17.0000
Diff (1-2) 16.1250 8.8321 2.7043

Mom Method Mean 95% CL Mean Std Dev 95% CL Std Dev

Mouse 14.8125 11.5561 18.0689 9.0320 7.2410 12.0078
Rat -1.3125 -5.7907 3.1657 8.4041 6.2082 13.0070
Diff (1-2) Pooled 16.1250 10.6816 21.5684 8.8321 7.3393 11.0930
Diff (1-2) Satterthwaite 16.1250 10.7507 21.4993

Method Variances DF t Value Pr > |t|

Pooled Equal 46 5.96 <.0001
Satterthwaite Unequal 32.141 6.11 <.0001

Equality of Variances

Method Num DF Den DF F Value Pr > F

Folded F 31 15 1.15 0.7906
Notice that you are given a pooled variances confidence interval and a separate
variances confidence interval. These are in raw units, not standardized units.

We may get a better feel for the size of the effect if we standardize it. I have two
programs available to do this.

Program One
title 'Compute 95% Confidence Interval for d, Standardardized Difference Between Two
Independent Population Means';
Data CI;
**********************************************************************************;
Replace tttt with the computed value of the independent samples t test.
Replace dd with the degrees of freedom for the independent samples t test.
Replace n1n with the sample size for the first group.
Replace n2n with the sample size for the second group.
***********************************************************************************;
t= 5.96 ;
df = 46 ;
n1 = 32 ;
n2 = 16 ;
***********************************************************************************;
g = t/sqrt(n1*n2/(n1+n2));
ncp_lower = TNONCT(t,df,.975);
ncp_upper = TNONCT(t,df,.025);
d_lower = ncp_lower*sqrt((n1+n2)/(n1*n2));
d_upper = ncp_upper*sqrt((n1+n2)/(n1*n2));
output; run; proc print; var g d_lower d_upper; run;

The Output

Obs g d_lower d_upper

1 1.82487 1.11164 2.52360

Notice that both sides of the confidence interval indicate that the effect is quite large.

Program 2

*This program computes a CI for the effect size in
a between-subject design with two groups.
m1 and m2 are the means for the two groups
s1 and s2 are the standard deviations for the two groups
n1 and n2 are the sample sizes for the two groups
prob is the confidence level;
*Downloaded from James Alginas webpage at http://plaza.ufl.edu/algina/ ;
data;
m1=14.8125 ;
m2= -1.3125 ;
s1=9.032 ;
s2=8.4041 ;
n1=32 ;
n2=16 ;
prob=.95;
v1=s1**2;
v2=s2**2;
pvar=((n1-1)*v1+(n2-1)*v2)/(n1+n2-2);
se=sqrt(pvar*(1/n1+1/n2));
nchat=(m1-m2)/se;
es=(m1-m2)/(sqrt(pvar));
df=n1+n2-2;
ncu=TNONCT(nchat,df,(1-prob)/2);
ncl=TNONCT(nchat,df,1-(1-prob)/2);
ll=(sqrt(1/n1+1/n2))*ncl;
ul=(sqrt(1/n1+1/n2))*ncu;
output;
proc print;
title1 'll is the lower limit and ul is the upper limit';
title2 'of a confidence interval for the effect size';
var es ll ul;
run;

The Output

ll is the lower limit and ul is the upper limit 2
of a confidence interval for the effect size

Obs es ll ul

1 1.82572 1.11239 2.52453

The minor differences between these results and those shown earlier are due to rounding error
from the value of t.

Do it with SPSS

Karl L. Wuensch, East Carolina University, Dept. of Psychology, 3. September, 2011.
Tests of Equivalence and Confidence Intervals for
Effect Sizes

Point or sharp null hypotheses specify that a parameter has a particular value -- for
example, (
1

2
) = 0, or = 0. Such null hypotheses are highly unlikely ever to be
true. They may, however, be close to true, and it may be more useful to test range or
loose null hypotheses that state that the value of the parameter of interest is close to a
hypothetical value. For example, one might test the null hypothesis that the difference
between the effect of drug G and that of drug A is so small that the drugs are essentially
equivalent. Biostatisticians do exactly this, and they call it bioequivalence testing.
Steiger (2004) presents a simple example of bioequivalence testing. Suppose that we
wish to determine whether or not generic drug G is bioequivalent to brand name drug
B. Suppose that the FDA defines bioequivalence as bioavailability within 20% of that of
the brand name drug. Let
1
represent the lower limit (bioavailability 20% less than that
of the brand name drug),
2
the upper limit (bioavailability 20% greater than that of the
brand name drug), and
G
the bioavailability of the generic drug. A test of
bioequivalence amounts to pitting the following two hypotheses against one another:
H
NE
:
G
<
1
or
G
>
2
-- the drugs are not equivalent
H
E
:
1

G

2
-- the drugs are equivalent -- note that this a range hypothesis
In practice, this amounts to testing two pairs of directional hypotheses:
H
0:

G

1
versus H
1
:
G
>
1
and H
0
:
G

2
versus H
1
:
G
<
2
.
If both of these null hypotheses rejected, then we conclude that the drugs are
equivalent. Alternatively, we can simply construct a confidence interval for
G
-- if the
confidence interval falls entirely within
1
to
2
, then bioequivalence is established.
Steiger (2004) opines that tests of equivalence (also described as tests of close fit)
have a place in psychology too, especially when we are interested in demonstrating that
an effect is trivial in magnitude. Steiger recommends the use of confidence intervals,
dispensing with the traditional NHST procedures (computation of test statistic, p value,
decision).
Suppose, for example, that we are interested in determining whether or not two
different therapies for anorexia are equivalent. Our criterion variable will be the average
amount of weight gained during a two month period of therapy. By how much would the
groups need differ before we would say they differ by a nontrivial amount? Suppose we
decide that a difference of less than three pounds is trivial. The hypothesis that the
difference (D) is trivial in magnitude can be evaluated with two simultaneous one-sided
tests:
H
0:
D -3 versus H
1
: D > 3, and H
0
: D 3

versus H
1
: D < 3
After obtaining our data, we simply construct a confidence interval for the difference
in the two means. If that confidence interval is entirely enclosed within the "range of
triviality," -3 to +3, then we retain the loose null hypothesis that the two therapies are
equivalent. What if the entire confidence interval is outside the range of triviality? I
assume we would then conclude that there is a nontrivial difference between the
therapies. If part of the confidence interval is within the range of triviality and part
outside the range, then we suspend judgment and wish we had obtained more data
and/or less error variance. Of course, if the confidence interval extended into the range
of triviality but not all the way to the point of no difference then we would probably want
to conclude that there is a difference but confess that it might be trivial.
Psychologists often use instruments which produce measurements in units that are
not as meaningful as pounds and inches. For example, suppose that we are interested
in studying the relationship between political affiliation and misanthropy. We treat
political affiliation as dichotomous (Democrat or Republican) and obtain a measure of
misanthropy on a 100 point scale. The point null is that mean misanthropy in
Democrats is exactly the same as that in Republicans. While this hypothesis is highly
unlikely to be true, it could be very close to true. Can we construct a loose null
hypothesis, like we did for the anorexia therapies? What is the smallest difference
between means on the misanthropy scale that we would consider to be nontrivial? Is a
5 point difference small, medium, or large? Faced with questions like this, we often
resort to using standardized measures of effect sizes. In this case, we could use
Cohen's d, the standardized difference between means. Suppose that we decide that
the smallest difference that would be nontrivial is d = .1. All we need to do is get our
data and then construct a confidence interval for d. If that interval is totally enclosed
within the range -.1 to .1, then we conclude that affiliates of the two parties are
equivalent in misanthropy, and if the entire confidence interval is outside the range, then
we conclude that there is a nontrivial difference between the parties.
So, how do we get a confidence interval for d? Regretfully, it is not as simple as
finding the confidence interval in the raw unit of measure and then dividing the upper
and lower limits by the pooled standard deviation. Because we are estimating both
means and standard deviations, we will be dealing with noncentral distributions (see
Cumming & Finch, 2001; Fidler & Thompson, 2001; Smithson, 2001). Iterative
computations that cannot reasonably be done by hand will be required. There are, out
there on the Internet, statistical programs designed to construct confidence intervals for
standardized effect size estimates, but I think it unlikely that such confidence intervals
will be commonly used unless and until they are incorporated in major statistical
packages such as SAS, SPSS, BMDP, Minitab, and so on. I have, on my SAS Program
Page and my SPSS Program Page, programs for constructing confidence intervals for
Cohen's d.
Steiger (2004) argues that when testing for close fit, the appropriate confidence
interval for testing range hypotheses is a 100(1 - 2) confidence interval. For example,
with the traditional .05 criterion, use a 90% confidence interval, not a 95% confidence
interval. His argument is that the estimated effect cannot be small in both directions, so
the confidence coefficient is relaxed to provide the same amount of power that would be
obtained with a one-sided test. I am not entirely comfortable with this argument,
especially after reading the Monte Carlo work by Serlin & Zumbo (2001).
References
Cumming, G., & Finch, S. (2001). A primer on the understanding, use, and
calculation of confidence intervals that are based on central and noncentral
distributions. Educational and Psychological Measurement, 61, 532-574.
Fidler, F., & Thompson, B. (2001). Computing correct confidence intervals for
ANOVA fixed- and random-effects effect sizes. Educational and Psychological
Measurement, 61, 575-604.
Smithson, M. (2001). Correct confidence intervals for various regression effect
sizes and parameters: The importance of noncentral distributions in computing
intervals. Educational and Psychological Measurement, 61, 605-532.
Serlin, R. C., & Zumbo, B. D. (2001). Confidence intervals for directional
decisions. Retrieved from
http://edtech.connect.msu.edu/searchaera2002/viewproposaltext.asp?propID=26
78 on 20. February 2005.
Steiger, J. H. (2004). Beyond the F test: Effect size confidence intervals and
tests of close fit in the analysis of variance and contrast analysis. Psychological
Methods, 9, 164-182. Retrieved from
http://www.statpower.net/Steiger%20Biblio/Steiger04.pdf on 20. February, 2005.


This document most recently revised on the 5
th
of April, 2012.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
Confidence Interval for Standardized Difference Between Means, Related Samples

You cannot use my Conf_Interval-d2.sas program to construct a confidence interval for d
when the data are from correlated samples. With correlated samples the distributions here are
very complex, not following the noncentral t. You can construct an approximate confidence
interval, g Z
cc
SE where SE is
n
r
n
g ) 1 ( 2
) 1 ( 2
12
2
, but such a confidence interval is not very

accurate unless you have rather large sample sizes. I cannot recall where I got this formula, but
suspect it was from Klines Beyond Significance Testing. I probably have the source at my
office, but am stuck at home with a broken leg at the moment. I do have an SAS program that
uses this algorithm
PS: I use g to symbolize the sample estimate of Cohens d.

Algina & Kesselman (2003) provided a new method for computing confidence intervals
for the standardized difference between means. I shall illustrate that method here.

Run this SAS code:
******************************************************************************;
title 'Experiment 2 of Karl''s Dissertation';
title2 'Correlated t-tests, Visits to Mus Tunnel vs Rat Tunnel, Three Nursing Groups
'; run;
data Mus; infile 'C:\D\StatData\tunnel2.dat';
input nurs $ 1-2 L1 3-5 L2 6-8 t1 9-11 t2 12-14 v_mus 15-16 v_rat 17-18;
v_diff=v_mus - v_rat;
proc means mean stddev n skewness kurtosis t prt;
var v_mus V_rat v_diff;
run;
*****************************************************************************;
Proc Corr Cov; var V_Mus V_Rat; run;
Get this output

--------------------------------------------------------------------------------------------------

Experiment 2 of Karl's Dissertation 1
Correlated t-tests, Visits to Mus Tunnel vs Rat Tunnel, Three Nursing Groups

The MEANS Procedure

Variable Mean Std Dev N Skewness Kurtosis t Value Pr > |t|

v_mus 22.7291667 10.4428324 48 -0.2582662 -0.3846768 15.08 <.0001
v_rat 13.2916667 10.4656649 48 0.8021143 0.2278957 8.80 <.0001
v_diff 9.4375000 11.6343334 48 -0.0905790 -0.6141338 5.62 <.0001

2

The CORR Procedure

Covariance Matrix, DF = 47

v_mus v_rat

v_mus 109.0527482 41.6125887
v_rat 41.6125887 109.5301418

Pearson Correlation Coefficients, N = 48
Prob > |r| under H0: Rho=0

v_mus v_rat

v_rat 0.38075 1.00000
0.0076

To obtain our estimate of the standardized difference between mean, run this SAS code:
Data G;
M1 = 22.72917 ; SD1 = 10.44283;
M2 = 13.29167 ; SD2 = 10.46566;
g = (m1-m2) / SQRT(.5*(sd1*sd1+sd2*sd2)); run;
Proc print; var g; run;

Obtain this result:
Obs g

1 0.90274

Using the approximation method first presented,
18567 . 0
48
) 38075 . 1 ( 2
) 1 48 ( 2
90274 .
2
=
= SE . A 95% CI is .90274(1.96)(.18567) = [.539, 1.267].

From James Alginas webpage, I obtained a simple SAS program for the confidence
interval. Here is it, with values from above.

*This program computes an approximate CI for the effect
size in a within-subjects design with two groups.
m2 and m1 are the means for the two groups
s1 and s2 are the standard deviations for the two groups
n1 and n2 are the sample sizes for the two groups
r is the correlation
prob is the confidence level;
data;
m1=22.7291667 ;
m2=13.2916667 ;
s1=10.4428324 ;
s2=10.4656649 ;
3
r= 0.38075 ;
n=48 ;
prob=.95 ;
v1=s1**2;
v2=s2**2;
s12=s1*s2*r;
se=sqrt((v1+v2-2*s12)/n);
pvar=(v1+v2)/2;
nchat=(m1-m2)/se;
es=(m1-m2)/(sqrt(pvar));
df=n-1;
ncu=TNONCT(nchat,df,(1-prob)/2);
ncl=TNONCT(nchat,df,1-(1-prob)/2);
ul=se*ncu/(sqrt(pvar));
ll=se*ncl/(sqrt(pvar));
output;
proc print;
title1 'll is the lower limit and ul is the upper limit';
title2 'of a confidence interval for the effect size';
var es ll ul ;
run;
quit;

Here is the result:

ll is the lower limit and ul is the upper limit 4

Obs es ll ul

1 0.90274 0.53546 1.26275

The program presented by Algina & Keselman (2003) is available at another of Alginas
web pages . This program will compute confidence intervals for one or more standardized
contrasts between related means, with or without a Bonferroni correction, and with or without
pooling the variances across all groups. Here is code, modified to compare the two related
means from above.

* This program is used with within-subjects designs. It computes
confidence intervals for effect size estimates. To use the program one
inputs at the top of the program:
m--a vector of means
v--a covariance matrix in lower diagonal form, with periods for
the upper elements
n--the sample size
prob--the confidence level prior to the Bonferroni adjustment
adjust--the number of contrats is a Bonferroni adjustment to the
confidence level is requested. Otherwise adjust is set
equal to 1
Bird--a switch that uses the variances of all variables to calculate
the denominator of the effect size as suggested by K. Bird
(Bird=1). Our suggestion is to use the variance of those
variables involved in the contrast to calculate the denominator
of the effect size (Bird=0)
4
In addition one inputs at the bottom of the program:
c--a vector of contrast weights
multiple contrasts can be entered. After each, type the code
run ci;
proc iml;
m={ 22.72917 13.29167};
v={109.0527482 . ,
41.6125887 109.5301418};
do ii = 1 to nrow(v)-1;
do jj = ii+1 to nrow(v);
v[ii,jj]=v[jj,ii];
end;
end;
n=48 ;
Bird=1;
df=n-1;
cl=.95;
adjust=1;
prob=cl;
print 'Vector of means:';
print m;
print 'Covariance matrix:';
print v;
print 'Sample size:';
print n;
print 'Confidence level before Bonferroni adjustment:';
print cl;
cl=1-(1-prob)/adjust;
print 'Confidence level with Bonferroni adjustment:';
print cl;
start CI;
pvar=0;
count=0;
if bird=1 then do;
do mm=1 to nrow(v);
if c[1,mm]^=0 then do;
pvar=pvar+v[mm,mm];
count=count+1;
end;
end;
end;
if bird=0 then do;
do mm=1 to nrow(v);
pvar=pvar+v[mm,mm];
count=count+1;
end;
end;
pvar=pvar/count;
es=m*c`/(sqrt(pvar));
se=sqrt(c*v*c`/n);
nchat=m*c`/se;
ncu=TNONCT(nchat,df,(1-prob)/(2*adjust));
ncl=TNONCT(nchat,df,1-(1-prob)/(2*adjust));
ll=se*ncl/(sqrt(pvar));
ul=se*ncu/(sqrt(pvar));
print 'Contrast vector';
print c;
5
print 'Effect size:';
print es;
Print 'Estimated noncentrality parameter';
print nchat;
Print 'll is the lower limit of the CI and ul is the upper limit';
print ll ul;
finish;
c={1 -1};
run ci;
quit;

Here is the output:

ll is the lower limit and ul is the upper limit

Vector of means:

m

22.72917 13.29167

Covariance matrix:

v

109.05275 41.612589
41.612589 109.53014

Sample size:

n

48

Confidence level before Bonferroni adjustment:

cl

0.95

Confidence level with Bonferroni adjustment:

cl

0.95

Contrast vector

c

1 -1
6

Effect size:

--------------------------------------------------------------------------------------------------

ll is the lower limit and ul is the upper limit

es

0.9027425

Estimated noncentrality parameter

nchat

5.619997

ll is the lower limit of the CI and ul is the upper limit

ll ul

0.5354583 1.2627535

Notice that this produces the same CI produced by the shorter program.

Recommended additional reading:
Algina, J., & Keselman, H. J. (2003). Approximate confidence intervals for effect sizes.
Educational and Psychological Measurement, 63, 537-553. DOI:
10.1177/0013164403256358
Grissom, R. J., & Kim, J. J. (2005). Effect sizes for research: A broad practical
approach. Mahwah, NJ: Erlbaum. especially pages 67 and 68
Glass, G. V., & Hopkins, K. D. Statistical Methods in Education and Psychology (2nd
ed.), Prentice-Hall 1984. Section 12.12: Testing the hypothesis of equal means with
paired observations, pages 240-243. Construction of the CI is shown on page 241, with a
numerical example on pages 242-243.

Back to Wuenschs Stats Lessons Page

Confidence Intervals, Pooled and Separate Variances T

The programs provided elsewhere (SPSS and SAS) for computing Hedges g
(estimate of Cohens d) and putting a confidence interval about that estimate assume that
one is willing to pool the error variances both for computing g and for obtaining the
confidence interval. IMHO it makes sense, when computing g, to pool the within-group
variances even when they greatly differ from one another (although in some cases it would
be better to employ Glass in the first place).
If you have unequal sample sizes and you supply (to the confidence interval program)
a separate variances t value rather than a pooled variances t value, the value of g will be
incorrectly computed (except in the special case where the sample sizes are equal, in which
case g will be correctly computed regardless of which t you use, since in this special case the
value of t (but not the value of df) for separate variances is identical to that for pooled
variances). Obtaining the correct g does not, however, mean that the obtained confidence
interval is also correct.
My Current Advice
Give the confidence interval program the pooled t and df, even when you will be
reporting the separate variances t test. Then check to see if there is any inconsistency
between the conclusion from the separate variances t test and the pooled variances
confidence interval for example, if the separate variances t test is not significant but the
confidence interval excludes zero then there is an inconsistency. You are most likely to find
such an inconsistency when the sample variances and sizes differ greatly between groups. If
there is no inconsistency, report the pooled variances confidence interval. If there is an
inconsistency, report an approximate confidence interval which is obtained by dividing each
end of the separate variances confidence interval by the pooled standard deviation. Inform
your audience that the confidence interval is approximate.
There are robust methods for estimating the standardized difference between two
means, with homogeneous or heterogeneous variances. If you wish to learn about them, go
to http://dx.doi.org/10.1037/1082-989X.13.2.110.supp .

An SAS Example
--------------------------------------------------------------------------------------------------

The TTEST Procedure

Variable: Pups

Group N Mean Std Dev Std Err Minimum Maximum

2F 8 40.8750 2.8504 1.0078 37.0000 44.0000
2M 15 56.8667 12.7552 3.2934 40.0000 84.0000
Diff (1-2) -15.9917 10.5438 4.6161

Group Method Mean 95% CL Mean Std Dev 95% CL Std Dev

2F 40.8750 38.4920 43.2580 2.8504 1.8846 5.8014
2M 56.8667 49.8031 63.9303 12.7552 9.3384 20.1162
Diff (1-2) Pooled -15.9917 -25.5913 -6.3921 10.5438 8.1119 15.0678
Diff (1-2) Satterthwaite -15.9917 -23.2765 -8.7069

Method Variances DF t Value Pr > |t|

Pooled Equal 21 -3.46 0.0023
Satterthwaite Unequal 16.456 -4.64 0.0003


Method Num DF Den DF F Value Pr > F

Folded F 14 7 20.02 0.0005

--------------------------------------------------------------------------------------------------
Notice that the groups differ greatly in variance (that in the 2m group being twenty
times that in the 2F group) and sample size. The pooled standard deviation is 10.5438. The
difference in means is 15.9917. Our point estimate of d is 15.9917/10.5438 = 1.52. If we use
one of my programs for putting a confidence interval on estimate d and give it the value of
pooled t and df, the output indicates that the point estimate is 1.52 and the confidence
interval runs from .53 to 2.47. What happens if we provide the confidence interval program
with the separate variances t and df ? We would get a point estimate of 2.013 and a
confidence interval running from .913 to 3.113. This is more than a trivial error. Clearly we
cannot use the separate variances t and df with the available programs for obtaining a
standardized confidence interval.
How to Compute an Approximate Standardized Confidence Interval, Pooled Variances
Simply divide each end of the unstandardized confidence interval by the pooled
standard deviation. The unstandardized confidence interval runs from 6.39 to 25.59. The
pooled standard deviation is 10.54. Dividing each end of the unstandardized confidence
interval by 10.54 yields an approximate standardized confidence interval that runs from .61 to
2.43. Notice that the exact confidence interval is a little bit wider than the approximate
confidence interval, because the exact confidence interval takes into account error in the
estimation of
pooled
, but the approximate confidence interval does not.

How to Compute an Approximate Standardized Confidence Interval, Separate
Variances
First, obtain the unstandardized separate variances confidence interval. Second,
obtain the pooled standard deviation. SAS gives this with the TTEST output. With SPSS you
get it by conducting an ANOVA comparing the two groups and then taking the square root of
the Mean Square Error. Third, divide each end of the separate variances confidence interval
by the pooled standard deviation.
From the output above, the unstandardized interval runs from 8.71 to 23.28. Divide
each end of this confidence interval by the pooled standard deviation, obtaining an
approximate standardized confidence interval running from .83 to 2.21. The midpoint of this
confidence interval is still equal to the point estimate of d. As with the approximate pooled
variances confidence interval, the width of the approximate confidence interval is less than
that of the exact confidence interval.
An SPSS Example
Consider the data W_LOSS2 at my SPSS data page. We are comparing the amount
of weight loss by participants in two different weight loss programs.
Notice that both the mean and the standard deviation are higher in the first group than
in the second group. The ratio of the larger variance to the smaller variance is great, 6.8.
Group Statistics
6 22.67 10.690 4.364
12 13.25 4.093 1.181
GROUP
1
2
LOSS
Std. Error
Mean

2.741 16 .015 9.42 3.436 2.133 16.701
2.083 5.746 .084 9.42 4.521 -1.766 20.599
Equal variances
assumed
Equal variances
not assumed
LOSS
Mean
Difference
Std. Error
Difference Lower Upper
95% Confidence
Interval of the
Difference

If we inappropriately applied the pooled test, we would report a significant difference.
The unstandardized confidence interval for the difference between means would run from
2.133 to 16.701. When we use the pooled t and df to construct the standardized confidence
interval, we get g = 1.37 and a standardized confidence interval that runs from .266 to 2.44.
The confidence interval probably should be wider, given the heterogeneity of variance.
If we appropriately applied the separate variances test, we would report that the
difference fell short of statistical significance (by the usual .05 criterion) and the confidence
interval runs from 1.766 pounds in one direction to 20.599 pounds in the other direction. If
we use the pooled t and df to construct the exact standardized confidence interval, we get g
= 1.37 and a standardized confidence interval that runs from .266 to 2.44. The g is correct,
but the confidence interval is not (it should include zero). If we give the program the pooled
value of t and the separate variances value for df, the confidence interval does get wider
(.088, 2.584), but it still excludes zero.
If we were to use the separate variances t and df, the program would give us a value
of 1.04 for g and -.133 to 2.152 for the standardized confidence interval. The value of g is
incorrect. The point estimate of d should be the same regardless of whether the variances
are homogeneous or not, but it may well be appropriate for the width of the confidence
interval to be affected by heterogeneity of variance.
The approximate separate variances confidence interval is easily calculated. A quick
ANOVA on these data shows that the MSE is 47.224, so the pooled standard deviation is
6.872.
ANOVA
loss
354.694 1 354.694 7.511 .015
755.583 16 47.224
1110.278 17
Between Groups
Within Groups
Total
Sum of

Dividing the confidence limits by 6.872 yields the approximate separate variance
confidence interval: 0.257, 2.998. The midpoint of this confidence interval does equal the
point estimate of d. The interval should be a bit wider than it is, because we have not taken
into account error in our estimation of the pooled standard deviation, but I can live with that.
Unequal Sample Sizes
If we follow Donald Zimmermans advice (A note on preliminary tests of equality of
variances, British Journal of Mathematical and Statistical Psychology, 2004, 57, 173-181), we
shall always use the separate variances t test when our sample sizes are not equal. This
leads to the awkward possibility of obtaining a unstandardized confidence interval (using
separate variances error and df) that includes the value 0 but a standardized confidence
interval (using pooled error and df) that does not.
I am not uncomfortable with employing the pooled variances test when the sample
sizes differ a little and the sample variances differ a little, but where should one draw the line
between a little and not a little. When the sample sizes do not differ much and the sample
data do not deviate from normal by much, many are comfortable with pooled t even when the
one sample variance is up to four times the size of the other.
It would be nice to have a rule of thumb such as If the ratio of the larger sample size
to the smaller sample size does not exceed 2, then you may use the pooled variances test
unless the ratio of the larger variance to the smaller variance exceeds 4, but, to the best of
my knowledge, nobody has done the Monte Carlo work that would be required to justify such
a rule of thumb.
What Does Smithson Say About This?
I think we shall probably have to wait for future work to have a more comfortable
solution here. I have asked Michael Smithson about this (personal communication, June 8,
2005) -- see his excellent page on confidence intervals. Here is his response:
That's a good question and I'm not entirely confident about the answer I'll give, but
here 'tis for what it's worth... My understanding is that the separate variances t would be the
appropriate estimate of the noncentrality parameter simply because the separate variances
formula is the appropriate estimate for the variance of the difference between the means. As
for the df, intuitively I'd go for the downward adjustment if it's being used for hypothesis-
testing because any confidence interval (CI) is an inverted significance-test. If you want a CI
that agrees with your t-test then you'd need to use the same df as that test does. That said, I
don't know how good the resulting approximation to a noncentral t would be (and therefore
what the coverage-rate of the CI would be). In the worst case, I believe you'd get a
conservatively wide CI.

Karl L. Wuensch, Dept. of Psychology, East Carolina University, Greenville, NC. USA
26. September 2010.

Dichot-Contin.doc
Two Groups and One Continuous Variable

Psychologists and others are frequently interested in the relationship between a
dichotomous variable and a continuous variable that is they have two groups of
scores. There are many ways such a relationship can be investigated. I shall discuss
several of them here, using the data on sex, height, and weight of graduate students, in
the file SexHeightWeight.sav. For each of 49 graduate students, we have sex (female,
male), height (in inches), and weight (in pounds).
After screening the data to be sure there are no errors, I recommend preparing a
schematic plot side by side box-and-whiskers plots. In SPSS, Analyze, Descriptive
Satistics, Explore. Scoot height into the Dependent List and sex into the Factor List.
In addition to numerous descriptive statistics, you get this schematic plot:
21 28 N =
SEX
Male Female
H
E
I
G
H
T
76
74
72
70
68
66
64
62
60
58

The height scores for male graduate students are clearly higher than for female
graduate students, with relatively little overlap between the two distributions. The
descriptive statistics show that the two groups have similar variances and that the
within-group distributions are not badly skewed, but somewhat playtkurtic. I would not
be uncomfortable using techniques that assume normality.
Students T Test. This is probably the most often used procedure for testing the
null hypothesis that two population means are equal. In SPSS, Analyze, Compare
Means, Independent Samples T Test, height as test variable, sex as grouping variable,
define groups with values 1 and 2.
The output shows that the mean height for the sample of men was 5.7 inches
greater than for the women and that this difference is significant by a separate
variances t test, t(46.0) = 8.18, p < .001. A 95% confidence interval for the difference
between means runs from 4.28 inches to 7.08 inches.
When dealing with a variable for which the unit of measure is not intrinsically
meaningful, it is a good idea to present the difference in means and the confidence
interval for the difference in means in standardized units. While I dont think that is
necessary here (you probably have a pretty good idea regarding how large a difference
of 5.7 inches is), I shall for pedagogical purposes compute Hedges g and a confidence
interval for Cohens d. In doing so, I shall use a special SPSS script and the separate
variances values for t and df. See the document Confidence Intervals, Pooled and
Separate Variances T. For these data, g = 2.36 (quite a large difference) and the 95%
confidence interval for d runs from 1.61 (large) to 3.09 (even larger). We can be quite
confident that the difference in height between men and women is large.
Group Statistics
28 64.893 2.6011 .4916
21 70.571 2.2488 .4907
sex
Female
Male
height
Std. Error
Mean

-8.005 47 .000 -5.6786 .7094 -7.1057 -4.2515
-8.175 45.981 .000 -5.6786 .6946 -7.0767 -4.2804
Equal variances
assumed
Equal variances
not assumed
height
Mean
Difference
Std. Error
Difference Lower Upper
95% Confidence
Interval of the
Difference

Wilcoxon Rank Sums Test. If we had reason not to trust the assumption that
the population data are normally distributed, we could use a procedure which makes no
such assumption, such as the Wilcoxon Rank Sums Test (which is equivalent to a
Mann-Whitney U Test). In SPSS, Analyze, Nonparametric Tests, Two Independent
Samples, height as test variable, sex as grouping variable with values 1 and 2, Exact
Test selected.
The output shows that the difference between men and women is statistically
significant. SPSS gives mean ranks. Most psychologists would prefer to report
medians. From Explore, used earlier, the medians are 64 (women) and 71 (men).
Ranks
28 15.73 440.50
21 37.36 784.50
49
sex
Female
Male
Total
height
N Mean Rank Sum of Ranks

Test Statistics
a
34.500
440.500
-5.278
.000
.000
.000
.000
Mann-Whitney U
Wilcoxon W
Z
Asymp. Sig. (2-tailed)
Exact Sig. (2-tailed)
Exact Sig. (1-tailed)
Point Probability
height
Grouping Variable: sex
a.

Point Biserial Correlation. Here we simple compute the Pearson r between
sex and height. The value of that r is .76, and it is statistically significant, t(47) = 8.005,
p < .001. This analysis is absolutely identical to an independent samples t test with a
pooled error term (see the t test output above) or an Analysis of Variance with pooled
error. The value of r here can be used as a strength of effect estimate. Square it and
you have an estimate of the percentage of variance in height that is explained by sex.

Resampling Statistics. Using the bootstrapping procedure in David Howells
resampling program and the SexHeight.dat file, a 95% confidence interval for the
difference in medians runs from 4 to 8. Since the null value (0) is not included in this
confidence interval, the difference between group medians is statistically significant.
Using the randomization test for differences between two medians the nonrejection
region runs from -2.5 to 3. Since our observed difference in medians is -7, we reject the
null hypothesis of no difference.

Logistic Regression. This technique is most often used to predict group
membership from two or more predictor variables. The predictors may be dichotomous
or continuous variables. Sex is a dichotomous variable in the data here, so let us test a
model predicting height from sex. In SPSS, Analyze, Regression, Binary Logistic.
Identify sex as the dependent variable and height as a covariate.
The
2
statistic shows us that we can predict sex significantly (p < .001) better if
we know the persons height than if all we know is the marginal distribution of the sexes
(28 women and 21 men). The odds ratio for height is 2.536. That is, the odds that a
person is male are multiplied by 2.536 for each one inch increase in height. That is
certainly a large effect. The classification table show us that using the logistic model we
could, use heights, correctly predict the sex of a person 83.7% of the time. If we did not
know the persons height, our best strategy would be to predict woman every time, and
we would be correct 28/49 = 57% of the time.

Omnibus Tests of Model Coefficients

Chi-square df Sig.
Step 1 Step
38.467 1 .000
Block
38.467 1 .000
Model
38.467 1 .000

Variables in the Equation
.931 .287 10.534 1 .001 2.536
-63.488 19.548 10.548 1 .001 .000
height
Constant
Step
1
a
B S.E. Wald df Sig. Exp(B)
Variable(s) entered on step 1: height.
a.

Classification Table
a
26 2 92.9
6 15 71.4
83.7
Observed
Female
Male
sex
Overall Percentage
Step 1
Female Male
sex
Percentage
Correct
Predicted
The cut value is .500
a.

Discriminant Function Analysis. This is really equivalent to the independent
samples t test, but looks different. In SPSS, Analyze, Classify, Discriminant, sex as
grouping variable, height as independent variable. Under Statistics, ask for
unstandarized function coefficients. Under Classify ask that prior probabilities be
computed from group sizes and that a summary table be displayed. In discriminant
function analysis a weighted combination of the predictor variables is used to predict
group membership. For our data, that function is DF = -27.398 + .407*height. The
correlation between this discriminant function and sex is .76 (notice that this identical to
the point-biserial r computed earlier) and is statistically significant,
2
(1, N = 49) =
39.994, p < .001. Using this model we are able correctly to predict a persons sex
83.7% of the time.

Canonical Discriminant Function Coefficients
.407
-27.398
height
(Constant)
1
Function
Unstandardized coefficients

Eigenvalues
1.363
a
100.0 100.0 .760
Function
1
Eigenvalue % of Variance Cumulative %
Canonical
Correlation
First 1 canonical discriminant functions were used in the
analysis.
a.

Wilks' Lambda
.423 39.994 1 .000
Test of Function(s)
1
Wilks'
Lambda Chi-square df Sig.

Classification Results
a
26 2 28
6 15 21
92.9 7.1 100.0
28.6 71.4 100.0
sex
Female
Male
Female
Male
Count
%
Original
Female Male
Predicted Group
Membership
Total
83.7% of original grouped cases correctly classified.
a.

Karl L. Wuensch
August, 2005
Copyright 2000, Karl L. Wuensch - All rights reserved. scales2.doc

Measurement Scales and Psychological Statistics: Empirical Science or Metaphysics?

Townsend & Ashby (Psychol. Bull. 96: 394-401) note that measurement involves assigning
numbers to objects in a way that interesting qualitative empirical relations among the objects are
reflected in the numbers themselves. For example (mine) I may put each object on the left side of
a scale and count the number (X) of one grain weights I must put on the right to balance the scale.
When each object has been so measured I can predict things such as which way the scale will
tip (or if it will balance) when I put one or more objects on the left and another on the right.

T & A go on to note that two scientists may by chance devise two different assignment-
systems (scales), each correctly reflecting the empirical relational structure of interest. For
example (mine), you may use a balance with 1.5 grain weights. Both of our scales are interval
(and ratio), that is, a linear function of the true magnitudes, and thus we may relate them to one
another with a linear function, Your weight = 0 + (2/3)my weight.

We might also appropriately use parametric statistics (given normal data, etc.) with either
scale. If we both wanted to compare one group of heavy objects with another group of light
objects (to make inferences about the populations from which they were drawn) we might both
transform (linear) our data into standard error units by doing a t-test. My interval data are: 175,
260, 340, 425 475, 560, 640, 725. I find M
1
= 300, M
2
= 600, t =

+
=
600 300
11483 11483
4
396 . .

Your interval data are 116 2/3, 173 1/3, 226 2/3, 283 1/3 316 2/3, 373 1/3, 426 2/3, 483 1/3.
You find M
1
= 200, M
2
= 400 (my means 2/3) and within groups variances of (2/3)
2
times mine,
so t =

+
=
400 200
51035 51035
4
396
. .
. . We get exactly the same result, and would with any other interval
measure of mass/weight (assume gravity constant).

God has told me that the true scores are 17.5, 26, 34, 42.5 47.5, 56, 64, 72.5. Thus, my
scale was a linear transformation of the Truth, X = 0 + 10T, as was yours, Y = 0 + 6.667T. To
see this more clearly, you might try plotting either of the above sets of interval data against the
true scores. That is, we both had interval data, and we get the same t that we would get using the
God-given true scores (check this out if you doubt it).

Now suppose we used only ordinal scales, scales where the relationship between the true
magnitudes and our measurements is monotonic, but not necessarily linear. Lets say my scale
can be specified by the function X = 100(log T). My ordinal data then are: 124, 141, 153, 163
168, 175, 181, 186. My t =

+
=
177 5 14525
2816 603
4
349
. .
. .
. , not the value computed with the true scores or
with linear transformations of the true scores. Note that the transformation has also introduced
considerable heterogeneity of variance.

Suppose your ordinal data were 1, 2, 3, 4; 10, 11, 12, 13. Note that this is a monotonic
transformation of the true scores, the true order is preserved. Your t =

+
=
115 25
1667 1667
4
9 86
. .
. .
. .
Clearly, the t obtained, and the conclusion reached, depend on the nature of the transformation. If
the transformation is linear, the results will be the same as those that would be obtained if we had
the true scores. Thus, if we have interval data, we can claim the results of our t-tests faithfully
reflect differences in true scores.

There is one big hitch in all this. How do we know whether our scale is indeed a linear
transformation of the truth? If we ask God to tell us, why not also ask her what the whole truth is
and dispense with all this silly measurement and inferential statistics? Lord (Amer. Psychol. 8:
750-751) told us that The numbers do not know where they came from.

I tell you that we dont know that either, and without such knowledge we never know whether
the data are interval or not. We may feel very confident that our numbers represent a linear
transformation of some true magnitudes when working with something as seemingly concrete as
mass, but psychophysical scientists tell us the relationship between the physical and the
psychological is not linear - recall Fechners logarithmic law and Stevens power law. What is the
reality we wish to measure- physical or psychological? Are our measurements a linear
transformation of true scores in one of these realities? How could we ever know? And what
about more abstract constructs such as intelligence? If IQ test A scores are not a linear
transformation of IQ test B scores (their Pearson r is not 1.0), is that because of random error or
because one is a linear transformation of the truth (interval scale) and the other (or both) a
nonlinear, monotonic transformation (ordinal)? If the latter, which scale is interval, which ordinal?

Why not just say that the two IQ tests measure somewhat different things? The things they
measure are mere abstractions anyway, so just say that the true scores they reflect are those that
are a linear transformation of our measurements. That is, not only is intelligence what an IQ test
measures, but it is a linear function of what an IQ test measures, and different IQ tests likely
measure different intelligences. Let us be logical, and let us be positivist, and do our statistics
without worrying about scale of measurement. All this unresolvable dispute about the nature of
these not empirically knowable realities belongs not in statistics, but is best left to ontology,
transcendental psychometrics, and other metaphysical domains.

While I do not consider the distinction between ordinal and interval data often to be important
when choosing inferential statistics, I do consider the distinction between categorical (discrete,
nominal) and measurement or quantitative (continuous, ordinal and above) to be important. For
example, I think it would be foolish to use r to correlate political conservatism (measured on a
continuous scale) with party affiliation (Democrat, Independent, Republican, coded 1, 2, 3),
although we shall later learn how to use dummy variable coding to incorporate categorical
variables into our general linear model.


CL.doc
CL: The Common Language Effect Size Statistic

McGraw and Wong (1992, Psychological Bulletin, 111: 361-365) proposed an effect size statistic
which for a two group design with a continuous dependent variable is the probability that a randomly
selected score from the one population will be greater than a randomly sampled score from the
other distribution. As an example they use sexual dimorphism in height among young adult humans.
National statistics are mean = 69.7 inches (SD= 2.8 ) for men, mean = 64.3 inches (SD = 2.6) for women.
If we assume that the distributions are normal, then the probability that a randomly selected man will be
taller than a randomly selected woman is 92%, thus the CL is 92%. They argue that the CL is a statistic
more likely to be understood by statistically naive individuals than are the other available effect size
statistics. Ill reserve judgment on that (naive persons have some funny ideas about probabilities), but it
may help you get a better feel for effect sizes. I assume they use sex differences because most of us
already have a pretty good feeling for how much the sexes differ on things like height and weight.
To calculate the CL with independent samples McGraw and Wong instruct us to compute
Z
X X
S S
=

+
1 2
1
2
2
2
and then find the probability of obtaining a Z less than the computed value. For the
height example, Z =

+
=
69 7 64 3
2 8 2 6
141
2 2
. .
. .
. and P(Z < 1.41) = 92%. Alternatively, one can compute the
CL from the more common effect size statistic d. If we have weighted the two samples variances equally
when computing d, that is,
$
d
X X
S S
=

+
1 2
1
2
2
2
2
, then we can compute the Z for the CL as d divided by 2.
For the height data,
$
. .
. .
. . d Z =

+
= = =
69 7 64 3
2 8 2 6
2
2 00
2
2
141
2 2
and . In their article McGraw and Wong
computed d with a weighted (by sample sizes) mean variance, that is,
$
d
X X
p S p S
p
n
N
i
i
=

+
=
1 2
1 1
2
2 2
2
, where . They used this weighted mean variance even though they were
comparing men with women, where in the population there are about equal numbers in both groups (for
some of their variables they had much more data from men than from women). I would have weighted
the two variances equally.
McGraw and Wong gave hypothetical examples of group differences in IQ (SD = 15) and
computed d, CL, and four other effect size statistics (including the binomial effect size display). I
reproduce the table without the other four effect size statistics but with my addition of examples
corresponding to Cohens small (d = .2), medium (d = .5) and large (d = .8) effect sizes.
Mean 1 Mean 2 d CL Odds Mean 1 Mean 2 d CL Odds
100 100 0.00 50% 1 90 110 1.33 83% 4.88
98.5 101.5 0.20 56% 1.27 85 115 2.00 92% 11.5
96.25 103.75 0.50 64% 1.78 80 120 2.67 97% 32.3
95 105 0.67 68% 2.12 75 125 3.33 99% 99
94 106 0.80 72% 2.57



McGraw and Wong also provide d and CL (and the other four effect size statistics) from some
actual large sample studies of sex differences. I list some of those statistics here just to give you a feel
for d and CL.
Variable Male Mean Female Mean d CL Odds
Verbal ACT 17.9 18.9 0.19 55% 1.22
Math ACT 18.6 16.1 0.48 63% 1.70
Aggressiveness 9.3 6.9 0.62 67% 2.03
Mental Rotation 23 15 0.91 74% 2.85
Weight 163 134 1.07 78% 3.55
Leg Strength 212 94 1.66 91% 10.11

McGraw and Wong opined that with correlated samples one should use the variance sum law to
get the denominator of the Z, that is,
2 1
2
2
2
1
2 1
2 S rS S S
X X
Z
+
= . I think this inappropriate, as it will lead

to overestimation of the effect size. IMHO, one should compute the CL in the same way, regardless of
whether the design was independent samples or correlated samples. McGraw and Wong also show how
to extend the CL statistic to designs with more than two groups and designs with a categorical dependent
variable, but I dont think the CL statistic very useful there.
Dunlap (1994, Psychological Bulletin, 116: 509-511) has extended the CL statistic to bivariate
normal correlations. Assume that we have randomly sampled two individuals scores on X and Y. If
individual 1 is defined as the individual with the larger score on X, than the CL statistic is the probability
that individual 1 also has the larger score on Y. Alternatively the CL statistic here can be interpreted as
the probability that an individual will be above the mean on Y given that we know e is above the mean on
X. Given r, CL
r
= +
sin ( )
.
1
5
. Dunlap uses Karl Pearsons (1896) data on the correlation between

fathers and sons heights (r = .40). CL = 63%. That is, if Joe is taller than Sam, then there is a 63%
probability that Joes son is taller than Sams son. Put another way, if Joe is taller than average, then
there is a 63% probability that Joes son is taller than average too. Here is a little table of CL statistics
for selected values of r, just to give you a feel for it.

r .00 .10 .20 .30 .40 .50 .60 .70 .80 .90 .95 .99
CL 50% 53% 56% 60% 63% 67% 70% 75% 80% 86% 90% 96%

EffectSizeConventions.doc
Cohens Conventions for Small, Medium, and Large Effects

These conventions should be used with caution. What is a small or even trivial
effect in one context may be a large effect in another context. For example, Rosnow
and Rosenthal (1989) discussed a 1988 biomedical research study on the effects of
taking a small, daily dose of aspirin. Each participant was instructed to take one pill a
day. For about half of the participants the pill was aspirin, for the others it was a
placebo. The dependent variable was whether or not the participant had a heart attack
during the study. In terms of a correlation coefficient, the size of the observed effect
was r = .034. In terms of percentage of variance explained, that is 0.12%. In other
contexts this might be considered a trivial effect, but it this context it was so large an
effect that the researchers decided it was unethical to continue the study and the
contacted all of the participants who were taking the placebo and told them to start
taking aspirin every day.

Difference Between Two Means*
Size of effect d % variance
small .2 1
medium .5 6
large .8 16
Cohens d is not influenced by the ratio of n
1
to n
2
, but r
pb
and eta-squared are.

Pearson Correlation Coefficient

Size of effect % variance
small .1 1
medium .3 9
large .5 25

Size of effect w = odds ratio*
small .1 1.49
medium .3 3.45
large .5 9
*For a 2 x 2 table with both marginals distributed uniformly.

ANOVA Effect
Size of effect f % of variance
small .1 1
medium .25 6
large .4 14

Multiple R
2
Size of effect f
2
% of variance
small .02 2
medium .25 13
large .4 26

Karl Wuensch, East Carolina University. Revised November, 2009.
More detail on these conventions and power
Wuenschs Statistics Lessons

Copyright 2011, Karl L. Wuensch - All rights reserved. Power30.docx

Power Analysis

Power is the conditional probability that one will reject the null hypothesis given
that the null hypothesis is really false.
Imagine that we are evaluating the effect of a putative memory enhancing drug.
We have randomly sampled 25 people from a population known to be normally
distributed with a of 100 and a of 15. We administer the drug, wait a reasonable
time for it to take effect, and then test our subjects IQ. Assume that we were so
confident in our belief that the drug would either increase IQ or have no effect that we
entertained directional hypotheses. Our null hypothesis is that after administering the
drug 100; our alternative hypothesis is > 100.
These hypotheses must first be converted to exact hypotheses. Converting the
null is easy: it becomes = 100. The alternative is more troublesome. If we knew that
the effect of the drug was to increase IQ by 15 points, our exact alternative hypothesis
would be = 115, and we could compute power, the probability of correctly rejecting the
false null hypothesis given that is really equal to 115 after drug treatment, not 100
(normal IQ). But if we already knew how large the effect of the drug was, we would not
need to do inferential statistics.
One solution is to decide on a minimum nontrivial effect size. What is the
smallest effect that you would consider to be nontrivial? Suppose that you decide that if
the drug increases
iq
by 2 or more points, then that is a nontrivial effect, but if the
mean increase is less than 2 then the effect is trivial.
Now we can test the null of = 100 versus the alternative of = 102. Look at the
figure on the following page (if you are reading this on paper, in black and white, I
recommend that you obtain an electronic copy of this document from our BlackBoard
site and open it in Word so you can see the colors). Let the left curve represent the
distribution of sample means if the null hypothesis were true, = 100. This sampling
distribution has a = 100 and a 3
25
15
= =
x
. Let the right curve represent the
sampling distribution if the exact alternative hypothesis is true, = 102. Its is 102
and, assuming the drug has no effect on the variance in IQ scores, 3
25
15
= =
x
.
The red area in the upper tail of the null distribution is . Assume we are using a
one-tailed of .05. How large would a sample mean need be for us to reject the null?
Since the upper 5% of a normal distribution extends from 1.645 above the up to
positive infinity, the sample mean IQ would need be 100 + 1.645(3) = 104.935 or more
to reject the null. What are the chances of getting a sample mean of 104.935 or more if
the alternative hypothesis is correct, if the drug increases IQ by 2 points? The area
under the alternative curve from 104.935 up to positive infinity represents that
probability, which is power. Assuming the alternative hypothesis is true, that = 102,
the probability of rejecting the null hypothesis is the probability of getting a sample mean
of 104.935 or more in a normal distribution with = 102, = 3. Z = (104.935 102)/3 =
0.98, and P(Z > 0.98) = .1635. That is, power is about 16%. If the drug really does
2
increase IQ by an average of 2 points, we have a 16% chance of rejecting the null. If its
effect is even larger, we have a greater than 16% chance.

Suppose we consider 5 the minimum nontrivial effect size. This will separate the
null and alternative distributions more, decreasing their overlap and increasing power.
Now, Z = (104.935 105)/3 = 0.02, P(Z > 0.02) = .5080 or about 51%. It is easier to
detect large effects than small effects.
Suppose we conduct a 2-tailed test, since the drug could actually decrease IQ;
is now split into both tails of the null distribution, .025 in each tail. We shall reject the
null if the sample mean is 1.96 or more standard errors away from the of the null
distribution. That is, if the mean is 100 + 1.96(3) = 105.88 or more (or if it is 100
1.96(3) = 94.12 or less) we reject the null. The probability of that happening if the
alternative is correct ( = 105) is: Z = (105.88 105)/3 = .29, P(Z > .29) = .3859, power
= about 39%. We can ignore P(Z < (94.12 105)/3) = P(Z < 3.63) = very, very small.
Note that our power is less than it was with a one-tailed test. If you can correctly
predict the direction of effect, a one-tailed test is more powerful than a two-tailed
test.
Consider what would happen if you increased sample size to 100. Now the
5 . 1
100
15
= =
x
. With the null and alternative distributions less plump, they should
overlap less, increasing power. With 1.5 =
x
, the sample mean will need be 100 +
(1.96)(1.5) = 102.94 or more to reject the null. If the drug increases IQ by 5 points,
3
power is : Z = (102.94 105)/1.5 = 1.37, P(Z > 1.37) = .9147, or between 91 and
92%. Anything that decreases the standard error will increase power. This may
be achieved by increasing the sample size or by reducing the of the dependent
variable. The of the criterion variable may be reduced by reducing the influence of
extraneous variables upon the criterion variable (eliminating noise in the criterion
variable makes it easier to detect the signal, the grouping variables effect on the
criterion variable).
Now consider what happens if you change . Let us reduce to .01. Now the
sample mean must be 2.58 or more standard errors from the null before we reject the
null. That is, 100 + 2.58(1.5) = 103.87. Under the alternative, Z = (103.87 105)/1.5 =
0.75, P(Z > 0.75) = 0.7734 or about 77%, less than it was with at .05, ceteris
paribus. Reducing reduces power.
Please note that all of the above analyses have assumed that we have used a
normally distributed test statistic, as
x
X
Z
= will be if the criterion variable is

normally distributed in the population or if sample size is large enough to invoke the
CLT. Remember that using Z also requires that you know the population rather than
estimating it from the sample data. We more often estimate the population , using
Students t as the test statistic. If N is fairly large, Students t is nearly normal, so this is
no problem. For example, with a two-tailed of .05 and N = 25, we went out 1.96
standard errors to mark off the rejection region. With Students t on N 1 = 24 df we
should have gone out 2.064 standard errors. But 1.96 versus 2.06 is a relatively trivial
difference, so we should feel comfortable with the normal approximation. If, however,
we had N = 5, df = 4, critical t = 2.776, and the normal approximation would not do. A
more complex analysis would be needed.

One Sample Power the Easy Way
Howell presents a simple method of doing power analyses for various designs.
His method also assumes a normal sampling distribution, so it is most valid with large
sample sizes. The first step in using Howells method is to compute d, the effect size
in units. For the one sample test
=
1
d . For our IQ problem with minimum
nontrivial effect size at 5 IQ points, d = (105 100)/15 = 1/3. We combine d with N to
get , the noncentrality parameter. For the one sample test, N d = . For our IQ
problem with N = 25, = 1/3 5 = 1.67. Once is obtained, power is obtained using
the power table in our textbook. For a .05 two-tailed test, power = 36% for a of 1.60
and 40% for a of 1.70. By linear interpolation, power for = 1.67 is 36% + .7(40%
36%) = 38.8%, within rounding error of the result we obtained using the normal curve.
For a one-tailed test, use the column in the table with twice its one-tailed value. For
of .05 one-tailed, use the .10 two-tailed column. For = 1.67 power then = 48% +
7(52% 48%) = 50.8%, the same answer we got with the normal curve.
4
If the sample size is large enough that there will be little difference between the t
distribution and the standard normal curve, then the solution we obtain using Howells
table is good. You can use the GPower program to fine tune the solution you get using
Howells table.
If we were not able to reject the null hypothesis in our research on the putative IQ
drug, and our power analysis indicated about 39% power, we would be in an awkward
position. Although we could not reject the null, we also could not accept it, given that
we only had a relatively small (39%) chance of rejecting it even if it were false. We
might decide to repeat the experiment using an n large enough to allow us to accept
the null if we cannot reject it. In my opinion, if 5% is a reasonable risk for a Type I
error (), then 5% is also a reasonable risk for a Type II error (), so let us use power =
1 = 95%. From the power table, to have power = .95 with = .05, is 3.60.
2
=
d
n

. For a five-point minimum IQ effect, 64 . 116
3 / 1
6 . 3
2
=
= n . Thus, if we repeat
the experiment with 117 subjects and still cannot reject the null, we can accept the null
and conclude that the drug has no nontrivial ( 5 IQ points) effect upon IQ. The null
hypothesis we are accepting here is a loose null hypothesis [95 < < 105] rather
than a sharp null hypothesis [ = exactly 100]. Sharp null hypotheses are probably
very rarely ever true.
Others could argue with your choice of the minimum nontrivial effect size. Cohen
has defined a small effect as d = .20, a medium effect as d = .50, and a large effect as d
= .80. If you defined minimum d at .20, you would need even more subjects for 95%
power.
A third approach one can take is to find the smallest effect that one could have
detected with high probability given n. If that d is small, and the null hypothesis is not
rejected, then it is accepted. For example, I used 225 subjects in the IQ enhancer
study. For power = 95%, = 3.60 with .05 two-tailed, and 24 . 0
15
60 . 3
= = =
N
d

. If I
cant reject the null, I accept it, concluding that if the drug has any effect, it is a small
effect, since I had a 95% chance of detecting an effect as small as .24 . The loose null
hypothesis accepted here would be that the population differs from 100 by less than
.24 .

Two Independent Samples
For the Two Group Independent Sampling Design,
2
n
d = , where n = the
number of scores in one group, and both groups have the same n;

2 1

= d , where
is the standard deviation in either population, assuming is identical in both
populations.
5
If n
1
n
2
, use the harmonic mean sample size,
2 1
1 1
2
~
n n
n
+
= .
For a fixed total N, the harmonic mean (and thus power) is higher the more
nearly equal n
1
and n
2
are. This is one good reason to use equal n designs. Other
good reasons are computational simplicity with equal ns and greater robustness to
violation of assumptions. Try computing the effective (harmonic) sample size for 100
subjects evenly split into two groups of 50 each and compare that with the effective
sample size obtained if you split them into 10 in one group and 90 in the other.
You should be able to rearrange the above formula for to solve for d or for n as
required.
Consider the following a priori power analysis. We wish to compare the
Advanced Psychology GRE scores of students in general psychology masters programs
with that of those in clinical psychology masters programs. We decide that we will be
satisfied if we have enough data to have an 80% chance of detecting an effect of 1/3 of
a standard deviation, employing a .05 criterion of significance. How many scores do we
need in each group, if we have the same number of scores in each group? From the
power table, we obtain the value of 2.8 for . 141
3 / 1
8 . 2
2 2
2 2
=
=
d
n

scores in
each group, a total of 282 scores.
Consider the following a posteriori power analysis. We have available only 36
scores from students in clinical programs and 48 scores from students in general
programs. What are our chances of detecting a difference of 40 points (which is that
actually observed at ECU in 1981) if we use a .05 criterion of significance and the
standard deviation is 98? The standardized effect size, d, is 40/98 = .408. The
harmonic mean sample size is 14 . 41
48
1
36
1
2
~
2 1
=
+
= n
85 . 1
2
14 . 41
408 . = = . From our power table, power is 46% (halfway between .44 and
.48).

Correlated Samples
The correlated samples t test is mathematically equivalent to a one-sample t test
conducted on the difference scores (for each subject, score under one condition less
score under the other condition). The greater
12
, the correlation between the scores in
the one condition and those in the second condition, the smaller the standard deviation
of the difference scores and the greater the power, ceteris paribus. By the variance
sum law, the standard deviation of the difference scores is
2 1 12
2
2
2
1
2 + =
Diff
.
If we assume equal variances, this simplifies to ) 1 ( 2 =
Diff
.
6
The reduction in the standard error should increase power relative to an
independent samples test with the same number of scores, but the correlated t has only
half the degrees of freedom as the independent t, which causes some loss of power.
The gain in power from reducing the standard error will generally greatly exceed the
loss of power due to loss of half of the degrees of freedom, but one could actually have
less power with the correlated t than with the independent t if sample size were low and
the
12
low. Please see my document Correlated t versus Independent t .
When conducting a power analysis for the correlated samples design, we can
take into account the effect of
12
by computing d
Diff
, an adjusted value of d:
) 1 ( 2
12
2 1
=
d
d
Diff
Diff
, where d is the effect size as computed above, with
independent samples. We can then compute power via n d
Diff
= or the required
sample size via
2
=
Diff
d
n

.
Please note that using the standard error of the difference scores, rather than the
standard deviation of the criterion variable, as the denominator of d
Diff
, is simply
Howells method of incorporating into the analysis the effect of the correlation produced
by matching. If we were computing estimated d (Hedges g) as an estimate of the
standardized effect size given the obtained results, we would use the standard deviation
of the criterion variable in the denominator, not the standard deviation of the difference
scores. I should admit that on rare occasions I have argued that, in a particular
research context, it made more sense to use the standard deviation of the difference
scores in the denominator of g.
Consider the following a priori power analysis. I am testing the effect of a new
drug on performance on a task that involves solving anagrams. I want to have enough
power to be able to detect an effect as small as 1/5 of a standard deviation (d = .2) with
95% power I consider Type I and Type II errors equally serious and am employing a
.05 criterion of statistical significance, so I want beta to be not more than .05. I shall use
a correlated samples design (within subjects) and two conditions (tested under the
influence of the drug and not under the influence of the drug). In previous research I
have found the correlation between conditions to be approximately .8.
3162 .
) 8 . 1 ( 2
2 .
) 1 ( 2
12
=
d
d
Diff
. . 130
3162 .
6 . 3
2
2
=
=
Diff
d
n

Consider the following a posteriori power analysis. We assume that GRE Verbal
and GRE Quantitative scores are measured on the same metric, and we wish to
determine whether persons intending to major in experimental or developmental
psychology are equally skilled in things verbal and things quantitative. If we employ a
.05 criterion of significance, and if the true size of the effect is 20 GRE points (that was
the actual population difference the last time I checked it, with quantitative > verbal),
what are our chances of obtaining significant results if we have data on 400 persons?
We shall assume that the correlation between verbal and quantitative GRE is .60 (that is
what it was for social science majors the last time I checked). We need to know what
7
the standard deviation is for the dependent variable, GRE score. The last time I
checked, it was 108 for verbal, 114 for quantitative. Let us just average those and use
111. ( ) 28 . 99 6 . 1 2 111 = =
Diff
. 20145 .
28 . 99
20
= =
Diff
d . . 03 . 4 400 20145 . = =
From the power table, power = 98%.

Pearson r
For a Pearson Correlation Coefficient, d = the size of the correlation coefficient in
the population, and 1 = n d where n = the number of pairs of scores in the sample.
Consider the following a priori power analysis. We wish to determine whether or
not there is a correlation between misanthropy and support for animal rights. We shall
measure these attributes with instruments that produce scores for which it is reasonable
to treat the variables as continuous. How many respondents would we need to have a
95% probability of obtaining significant results if we employed a .05 criterion of
significance and if the true value of the correlation (in the population) was 0.2?
325 1
2 .
6 . 3
1
2 2
= +
= +
=
d
n

.
Type III Errors and Three-Choice Tests
Leventhal and Huynh (Psychological Methods, 1996, 1, 278-292) note that it is
common practice, following rejection of a nondirectional null, to conclude that the
direction of difference in the population is the same as what it is in the sample. This
procedure is what they call a "directional two-tailed test." They also refer to it as a
"three-choice test" (I prefer that language), in that the three hypotheses entertained are:
parameter = null value, parameter < null value, and parameter > null value. This makes
possible a Type III error: correctly rejecting the null hypothesis, but incorrectly inferring
the direction of the effect - for example, when the population value of the tested
parameter is actually more than the null value, getting a sample value that is so much
below the null value that you reject the null and conclude that the population value is
also below the null value. The authors show how to conduct a power analysis that
corrects for the possibility of making a Type III error. See my summary at:
http://core.ecu.edu/psyc/wuenschk/StatHelp/Type_III.htm

Power-Example.doc
Examples of the Use of Power Analysis in Actual Research Projects

Two Conditions, Within-Subjects Design
Here is a real-life example of an a priori power analysis done by a graduate
student in our health psychology program. I think it serves as a good example of how
power analysis is an essential part of planning research.
I have a within subjects design. I am trying to predict the necessary sample size
(smallest) necessary to achieve adequate power (0.80 is fine). My problem now is I
don't know the expected value for the correlation between baseline scores and post-test
scores. Could you give me an idea about this --- or where I would get it from?
I am using prior research to estimate how large the effect will be in my research.
The stats from that prior research are:
Baseline: Mean = 41.7; SD = 9.93; Post: Mean = 32.8; SD = 4.94
The expected value for the correlation between baseline scores and post-test
scores must be estimated. Look at my document at
http://core.ecu.edu/psyc/wuenschk/docs30/Power-N.doc, the section Correlated
Samples T the table on page 5 shows how the required number of cases to achieve
80% power differs with both size of the effect and the correlation between conditions.
If the scores will be obtained from an instrument that has been used before, you
should be able to find, in the literature or from the researchers who have used that
instrument, an estimate of its reliability (Cronbachs alpha, for example). You could then
use that as an estimate of the baseline-posttest correlation. If others have used the
same dependent variable in pre-post designs, you could estimate the pre-post
correlation for your study as being about what it was in those others studies. If you are
still striking out, you can simply estimate the correlation as having a modest value, say
.7, and then after you have started collecting data check to what the correlation is. If it
is much less than .7, then your study will be underpowered and you know you need to
increase the sample size beyond what you expected to need unless, of course, the
data show that the effect is also large enough to be able to detect with the sample size
you will obtain.
I suggest you obtain enough data to be able to detect an effect that is only
medium in size (one-half standard deviation) or, if you expect the effect to be small but
not trivial, small in size (one-fifth standard deviation). For the stats you provided, g is
about 1.1:
2

If we were computing g to report as an effect-size estimate, I would probably use
the baseline SD as the standardizer, that is, report Glass delta rather than Hedges g.
For a medium effect and rho = .7, d
diff
= .5/sqrt(2(1-.7)) = .645. Taking that to
G*Power, we find that you need 22 cases (each measured pre and post) to get 80%
power to detect a medium-sized effect.

Two Groups, Independent Samples
Sylwia Mlynarska was working on her proposal (for her Science Fair project) and
the Institutional Review Board (which has to approve the proposal before she is allowed
to start collecting data) requested that she conduct a power analysis to determine how
many respondents she need recruit to answer her questionnaire. As I was serving as
her research mentor in this matter, I assisted her with this analysis. Below is a copy of
my correspondence with her.
Wuensch, Karl L.
3
From: Wuensch, Karl L.
Sent: Friday, June 02, 2000 1:04 PM
To: 'Sylwia Mlynarska'
Subject: A Priori Power Analysis
Sylwia, it so happens that the question you ask concerns exactly the topic that
we are covering in my undergraduate statistics class right now, so I am going to share
my response with the class.
PSYC 2101 students, Sylwia is a sophomore at the Manhattan Center for
Science and Mathematics High School in New York City. She is researching whether or
not ethnic groups differ on attitude towards animals (animal rights), using an instrument
of my construction. She is now in the process of obtaining approval (from the high
school's Institutional Review Board) of her proposal to conduct this research. I am
assisting her long-distance. Here is my response to her question about sample sizes:
Sylwia, the more subjects you have, the greater your power. Power is the
probability that you will find a difference, assuming that one really exists. Power is also
a function of the magnitude of the difference you seek to detect. If the difference is, in
fact, large, then you don't need many subjects to have a good chance of detecting it. If
it is small, then you do. Of course, you don't really know how large the difference
between ethnic groups is, so that makes it hard to plan. We assume that you will be
satisfied if your power is 80%. That is, if there really is a difference, you have an 80%
chance of detecting it statistically. Put another way, the odds of your finding the
difference are 4 to 1 in your favor. We also assume that you will be using the traditional
5% criterion (alpha) of statistical significance.
If the difference between two ethnic groups is small (defined by Cohen as
differing by 1/5 of a standard deviation), then to have 80% power you would need to
have 393 subjects in each ethnic group. If the difference is of medium size (1/2 of a
standard deviation), then you need only 64 subjects in each ethnic group. If the
difference is large (4/5 of a standard deviation), then you only need 26 subjects per
ethnic group.
It is typically difficult to get enough subjects to have a good chance of detecting a
small effect, so we generally settle for getting enough to have a good chance of
detecting a medium effect -- but if you can get high power even for small effects, there
is this advantage: If your results fall short of statistical significance (you did not detect
the difference), you can still make a strong statement, you can say that the difference, if
it exists, must be quite small. Without great power, when your result is statistically
"nonsignificant," it is quite possible that a difference is present, but your research did
just not have sufficient power to detect it (this circumstance is referred to as a Type II
error).

Post Hoc Power Analysis As Part of a Critical Evaluation of Published Research
Michelle Marvier wrote a delightful article for the American Scientist (2001,
Ecology of Transgenic Crops, 89: 160-167). She noted that before a transgenic crop
(genetically modified) receives government approval, it must be shown to be relatively
4
safe. Then she went on to discuss an actual petition to the government. Calgene Inc.
submitted a petition for approval of a variety of Bt cotton (cotton which contains genes
from a bacterium that result in it producing a toxin that kills insects which prey upon
cotton). To test this transgenic crops on friendly invertebrates found in the soil around
the plants (such as earthworms), they conducted research with a sample of four
subjects. The test period only lasted 14 days, and in that period the earthworms
exposed to the transgenic cotton plants gained 29.5% less weight than did control
earthworms. The difference between the groups was not statistically significant, which
the naive consumer might interpret to mean that the transgenic cotton had no influence
on the growth of the earthworms. Of course, with a sample size of only 4, the chances
of finding an undesirable effect of the transgenic cotton, assuming that such an effect
does exist, are very small. Dr. Marvier calculated that with the small sample sizes
employed in this research, the effect of the transgenic cotton would need to be quite
large (the exposed earthworms gaining less than half as much weight as the control
earthworms) to have a good (90%) chance (power) of detecting the effect (using the
conventional .05 level of significance, which is, in this case, IMHO, too small given the
relative risks of Type I vs Type II errors).

Karl L. Wuensch, Psychology, East Carolina University, December, 2007.
Power-N.doc
Estimating the Sample Size Necessary to Have Enough Power

How much data do you need -- that is, how many subjects should you include in
your research. If you do not consider the expenses of gathering and analyzing the data
(including any expenses incurred by the subjects), the answer to this question is very
simple -- the more data the better. The more data you have, the more likely you are to
reach a correct decision and the less error there will be in your estimates of parameters
of interest. The ideal would be to have data on the entire population of interest. In that
case you would be able to make your conclusions with absolute confidence (barring any
errors in the computation of the descriptive statistics) and you would not need any
inferential statistics.
Although you may sometimes have data on the entire population of interest,
more commonly you will consider the data on hand as a random sample of the
population of interest. In this case, you will need to employ inferential statistics, and
accordingly power becomes an issue. As you already know, the more data you have,
the more power you have, ceteris paribus. So, how many subjects do you need?
Before you can answer the question how many subjects do I need, you will
have to answer several other questions, such as:
How much power do I want?
What is the likely size (in the population) of the effect I am trying to detect, or,
what is smallest effect size that I would consider of importance?
What criterion of statistical significance will I employ?
What test statistic will I employ?
What is the standard deviation (in the population) of the criterion variable?
For correlated samples designs, what is the correlation (in the population)
between groups?
In my opinion, if one considers Type I and Type II errors equally serious, then
one should have enough power to make = . If employing the traditional .05 criterion
of statistical significance, that would mean you should have 95% power. However,
getting 95% power usually involves expenses too great for behavioral researchers --
that is, it requires getting data on many subjects.
A common convention is to try to get at least enough data to have 80% power.
So, how do you figure out how many subjects you need to have the desired amount of
power. There are several methods, including:
You could buy an expensive, professional-quality software package to do the
power analysis.
You could buy an expensive, professional-quality book on power analysis and
learn to do the calculations yourself and/or to use power tables and figures to
estimate power.
You could try to find an interactive web page on the Internet that will do the
power analysis for you. I do not have a great deal of trust in this method.
2
You could download and use the GPower program, which is free, not too
difficult to use, and generally reliable (this is not to say that it is error free). For
an undetermined reason, this program will not run on my laptop, but it runs fine
on all my other computers.
You could use the simple guidelines provided in Jacob Cohens A Power Primer
(Psychological Bulletin, 1992, 112, 155-159).
Here are minimum sample sizes for detecting small (but not trivial), medium, and
large sized effects for a few simple designs. I have assumed that you will employ the
traditional .05 criterion of statistical significance, and I have used Cohens guidelines for
what constitutes a small, medium, or large effect.
Chi-Square, One- and Two-Way
Effect size is computed as
=
k
i i
i i
P
P P
w
1 0
2
0 1
) (
. k is the number of cells, P
0i
is the
population proportion in cell i under the null hypothesis, and P
1i
is the population
proportion in cell i under the alternative hypothesis. For example, suppose that you
plan to analyze a 2 x 2 contingency table. You decide that the smallest effect that you
would consider to be nontrivial is one that would be expected to produce a contingency
table like this, where the experimental variable is whether the subject received a
particular type of psychotherapy or just a placebo treatment and the outcome is whether
the subject reported having benefited from the treatment or not:
Experimental Group
Outcome Treatment Control
= =
= 10 .
25 .
) 25 . 275 (. 4
2
w . Positive 55 45
Negative 45 55
For each cell in the table you compute the expected frequency under the null
hypothesis (P
0
)by multiplying the number of scores in the row in which that cell falls by
the number of scores in the column in which that cell falls and then dividing by the total
number of scores in the table. Then you divide by total N again to convert the expected
frequency to an expected proportion. For the table above the expected frequency will
be the same for every cell, 25 .
) 200 ( 200
) 100 ( 100
= . For each cell you also compute the
expected proportion under the alternative hypothesis (P
1
) by dividing the expected
number of scores in that cell by total N. For the table above that will give you the same
proportion for every cell, 55 200 = .275 or 45 200 = .225. The squared difference
between P
1
and P
0
, divided by P
0
, is the same in each cell, .0025. Sum that across four
cells and you get .01. The square root of .01 is .10. Please note that this is also the
value of phi.
In the treatment group, 55% of the patients reported a positive outcome. In the
control group only 45% reported a positive outcome. In the treatment group the odds of
reporting a positive outcome are 55 to 45, that is, 1.2222. In the control group the odds
3
are 45 to 55, that is, .8181. That yields an odds ratio of 1.2222 .8181 = 1.49. That is,
the odds of reporting a positive outcome are, for the treatment group, about one and a
half times higher than they are for the control group.
What if the effect is larger, like this:
Experimental Group
= =
= 30 .
25 .
) 25 . 325 (. 4
2
w . Positive 65 35
Negative 35 65
Now the odds ratio is 3.45 and the phi is .3.
Or even larger, like this:
Experimental Group
= =
= 50 .
25 .
) 25 . 375 (. 4
2
w . Positive 75 25
Negative 25 75
Now the odds ratio is 9 and the phi is .5.
Cohen considered a w of .10 to constitute a small effect, .3 a medium effect, and
.5 a large effect. Note that these are the same values indicated below for a Pearson r.
The required total sample size depends on the degrees of freedom, as shown in the
table below:
Effect Size
df Small Medium Large
1 785 87 26
2 964 107 39
3 1,090 121 44
4 1,194 133 48
5 1,293 143 51
6 1,362 151 54

The Correspondence between Phi and Odds Ratios it depends the distribution
of the marginals.
More on w = .
4
Pearson r
Cohen considered a of .1 to be small, .3 medium, and .5 large. You need 783
pairs of scores for a small effect, 85 for a medium effect, and 28 for a large effect. In
terms of percentage of variance explained, small is 1%, medium is 9%, and large is
25%.
One-Sample T Test
=
1
d . A d of .2 is considered small, .5
medium, and .8 large. For 80% power you need 196 scores for small effect, 33 for
medium, and 14 for large.
Cohens d is not affected by the ratio of n
1
to n
2
, but some alternative measures
of magnitude of effect (r
pb
and
2
) are. See this document.

Independent Samples T, Pooled Variances.

2 1

= d . A d of .2 is considered small, .5
medium, and .8 large. Suppose that you have population with means of 10 and 12 and
a within group standard deviation of 10. 2 .
10
10 12
=
= d , a small effect. The population

variance of the means here is 1, so the percentage of variance explained is 1%. Now
suppose the means are 10 and 15, so d = .5, a medium effect. The population variance
of the means is now 6.25, so the percentage of variance explained is 6%. If the means
were 10 and 18, d would be .8, a large effect. The population variance of the means
would be 16 and the percentage of variance explained 16%.
For 80% power you need, in each of the two groups, 393 scores for small effect,
64 for medium, and 26 for large.

Correlated Samples T
score under the other condition). One could, then, define effect size and required
sample size as shown above for the one sample t. This would, however, usually not be
a good idea.
The greater
12
, the correlation between the scores in the one condition and
those in the second condition, the smaller the standard deviation of the difference
scores and the greater the power, ceteris paribus. By the variance sum law, the
5
standard deviation of the difference scores is
2 1 12
2
2
2
1
2 + . If we assume equal
variances, this simplifies to ) 1 ( 2 .
12
by computing d
Diff
) 1 ( 2
12

=
d
d
Diff
. The denominator of this ratio is the standard deviation of the
difference scores rather than the standard deviation of the original scores. We can then
compute the required sample size as
2
=
Diff
d
n

. If the sample size is large enough
that there will be little difference between the t distribution and the standard normal
curve, then we can obtain the value of (the noncentrality parameter) from a table
found in David Howells statistics texts. With the usual nondirectional hypotheses and a
.05 criterion of significance, is 2.8 for power of 80%. You can use the GPower
program to fine tune the solution you get using Howells table.
I constructed the table below using Howells table and GPower, assuming
nondirectional hypotheses and a .05 criterion of significance.
Small effect Medium effect Large effect
d d
Diff
n d d
Diff
n d d
Diff
n
.2 .00 .141 393 .5 .00 0.354 65 .8 .00 0.566 26
.2 .50 .200 196 .5 .50 0.500 33 .8 .50 0.800 14
.2 .75 .283 100 .5 .75 0.707 18 .8 .75 1.131 08
.2 .90 .447 041 .5 .90 1.118 08 .8 .90 1.789 04

IMHO, one should not include the effect of the correlation in ones calculation of d
with correlated samples. Consider a hypothetical case. We have a physiological
measure of arousal for which the mean and standard deviation, in our population of
interest, are 50 (M) and 10 (SD). We wish to evaluate the effect of an experimental
treatment on arousal. We decide that the smallest nontrivial effect would be one of 2
points, which corresponds to a standardized effect size of d = .20.
Now suppose that the correlation is .75. The SD of the difference scores would
be 7.071, and the d
Diff
would be .28. If our sample means differed by exactly 2 points,
what would be our effect size estimate? Despite d
Diff
being .28, the difference is still
just 2 points, which corresponds to a d of .20 using the original group standard
deviations, so, IMHO, we should estimate d as being .20.
Now suppose that the correlation is .9. The SD of the difference scores would be
4.472, and the D
Diff
would be .45 but the difference is still just two points, so we
should not claim a larger effect just because the high correlation reduced the standard
deviation of the difference scores. We should still estimate d as being .20.
Note that the correlated samples t will generally have more power than an
independent samples t , holding the number of scores constant, as long as the
12
is not
very small or negative. With a small
12
it is possible to get less power with the
6
correlated t than with the independent samples t see this illustration. The correlated
samples t has only half the df of the independent samples t, making the critical value of t
larger. In most cases the reduction in the standard error will more than offset this loss
of df. Do keep in mind that if you want to have as many scores in a between-subjects
design as you have in a within-subjects design you will need twice as many cases.
Cohens f (effect size) is computed as
2
2
1
) (
error
j
k
j
k

=
, where
j
is the population
mean for a single group, is the grand mean, k is the number of groups, and error
variance is the mean within group variance. This can also be computed as
error
means
,
where the numerator is the standard deviation of the population means and the
denominator is the within-group standard deviation.
We assume equal sample sizes and homogeneity of variance.
Suppose that the effect size we wish to use is one where the three populations
means are 480, 500, and 520, with the within-group standard deviation being 100.
Using the first formula above, 163 .
) 100 ( 3
400 0 400
2
=
+ +
= f . Using the second formula,
the population standard deviation of the means (with k, not k-1, in the denominator) is
16.33, so f = 16.33 100 = .163. By the way, David Howell uses the symbol ' instead
of f.
You should be familiar with
2
as the treatment variance expressed as a
proportion of the total variance. If
2
is the treatment variance, then 1-
2
is the error
variance. With this in mind, we can define
2
2
1
= f . Accordingly, if you wish to

define your effect size in terms of proportion of variance explained, you can use this
formula to convert
2
into f.
Cohen considered an f of .10 to be a small effect, .25 a medium effect, and .40 a
large effect. Rearranging terms in the previous formula,
2
2
2
1 f
f
+
= . Using this to
translate Cohens guidelines into proportions of variance, a small effect is one which
accounts for about 1% of the variance, a medium effect 6%, and a large effect 14%.
The required sample size per group varies with treatment degrees of freedom, as
show below:
7

Effect Size
df Small Medium Large
2 393 64 26
3 322 52 21
4 274 45 18
5 240 39 16
6 215 35 14
7 195 32 13

Correlated Samples ANOVA
See Power Analysis for One-Way Repeated Measures ANOVA

See the document Power Analysis for an ANCOV.

Multiple Correlation
For testing the squared multiple correlation coefficient, Cohen computed effect
size as
2
2
2
1 R
R
f
= . For a squared partial correlation, the same definition is employed,

but the squared partial correlation coefficient is substituted for R
2
. For a squared
semipartial (part) correlation coefficient,
2
2
2
1
full
i
R
sr
f
= , where the numerator is the

squared semipartial correlation coefficient for the predictor of interest and the
denominator is 1 less the squared multiple correlation coefficient for the full model.
Cohen considered an f
2
of .02 to be a small effect, .15 a medium effect, and .35 a
large effect. We can translate these values of f
2
into proportions of variance by dividing
f
2
by (1 + f
2
): A small effect accounts for 2% of the variance in the criterion variable, a
medium effect accounts for 13%, and a large effect 26%.
The number of subjects required varies with the number of predictor variables, as
shown below:
8

Effect Size
# predictors Small Medium Large
2 481 67 30
3 547 76 34
4 599 84 38
5 645 91 42
6 686 97 45
7 726 102 48
8 757 107 50

Where Can I Find More on Power Analysis?
The classic source is Cohen, J. (1988). Statistical power analysis for the
behavior sciences. (2
nd
ed.). Hillsdale, NJ: Erlbaum Call number JZ1313 .D36 2002
in Joyner Library. I have parts of an earlier (1977) edition.
Karl Wuensch, East Carolina University. Revised November, 2009.
Return to Karls Statistics Lessons Page
GPower3-T.docx
G*Power: t Tests

One Sample t
time for it to take effect, and then test our subjects IQ. We have decided that the
minimum nontrivial effect size is 5 points. We shall use a .05 criterion of statistical
significance and plan to have 25 subjects. What is power if the effect of the drug is to
increase mean IQ by five points? d = 5/15 = 1/3.
Boot up G*Power (in the stats folder on the desktop of our lab computers or
download from here). Select the following options:
Test family: ttests
Statistical test: Means: Difference from constant (one sample case)
Type of power analysis: Post hoc: Compute achieved power given , sample
size, and effect size
Tails: Two
Effect size d: 0.333333 (you could click Determine and have G*Power compute d
for you)
error prob: 0.05
Total sample size: 25
Click Calculate and you find that power = 0.360.
At the top of the window you get a graphic showing the distribution of t under the
null and under the alternative, with critical t, , and power indicated.
If you click the Protocol tab you get a terse summary of the analyses you have
done, which can be printed, saved, or cleared out.
See the screen shot on the next page.
2

If you look back at the power analysis we did using Howells method, you will see
that we estimated power to be .388. Why does G-Power give us a different value?
Well, Howells method assumes that our sample size is large enough that the t
distribution is well approximated by the normal curve. With N = 25 the t distribution has
fatter tails than the normal curve, and that lowers our power a bit.
3
At the bottom of the window you can click X-Y plot for a range of values. Select
what you want plotted on Y and X and set constants and then click Draw plot. Here is
the plot showing the relation ship between sample size and power. Clicking the Table
tab gives you same information in a table.

We are unhappy with 36% power. How many subjects would we need to have
95% power? Under Type of power analysis, select A priori: Compute required sample
size given , power, and effect size. Enter .95 for Power (1- err prob). Click
Calculate. G*Power tells you that you need 119 subjects to get the desired power.
4
Independent Samples t
We wish to compare the Advanced Psychology GRE scores of students in
general psychology masters programs with that of those in clinical psychology masters
programs. We decide that we will be satisfied if we have enough data to have an 80%
chance of detecting an effect of 1/3 of a standard deviation, employing a .05 criterion of
significance. How many scores do we need in each group, if we have the same number
of scores in each group?
Select the following options:
Test family: ttests
Statistical test: Means: Difference between two independent means (two groups)
Type of power analysis: A priori: Compute required sample size given , power, and
effect size
Tails: Two
for you)
error prob: 0.05
Power (1- err prob): .8
Allocation ratio N2/N1: 1
Click Calculate and you see that you need 143 cases in each group, that is, a
total sample size of 286.
Change the allocation ratio to 9 (nine times as many cases in the one group than
in the other) and click Calculate again. You will see that you would need 788 subjects
to get the desired power with such a lopsided allocation ratio.
standard deviation is 98?
Change the type of power analysis to Post hoc. Enter d = 40/98 = .408, n
1
= 36,
n
2
= 48. Click Calculate. You will see that you have 45% power.
Df = 82

5
Correlated Samples t
I am testing the effect of a new drug on performance on a task that involves
solving anagrams. I want to have enough power to be able to detect an effect as small
as 1/5 of a standard deviation (d = .2) with 95% power I consider Type I and Type II
errors equally serious and am employing a .05 criterion of statistical significance, so I
want beta to be not more than .05. I shall use a correlated samples design (within
subjects) and two conditions (tested under the influence of the drug and not under the
influence of the drug). In previous research I have found the correlation between
conditions to be approximately .8. 3162 .
) 8 . 1 ( 2
2 .
) 1 ( 2
12
=
d
d
Diff
.

Use the following settings:
Statistical test: Means: Difference between two dependent means (matched pairs)
effect size
Tail(s): Two
Effect size dz: .3162
error prob: 0.05
Click Calculate. You will find that you need 132 pairs of scores.
Df = 131

checked, it was 108 for verbal, 114 for quantitative.
Change type of power analysis to Post hoc. Set the total sample size to 400.
Click on Determine. Select from group parameters. Set the group means to 0 and
20 (or any other two means that differ by 20), the standard deviations to 108 and 114,
6
and the correlation between groups to .6. Click Calculate in this window to obtain the
effect size dz, .2100539.

Click Calculate and transfer to main window to move the effect size dz to the
main window. Click Calculate in the main window to compute the power. You will see
that you have 98% power.

7
Pearson r
Test family: ttests
Statistical test: Correlation: Point biserial model (that is, a regression analysis)
effect size
Tails: Two
Effect size |r|: .2
error prob: 0.05
Click Calculate and you will see that you need 314 cases.
t tests - Correlation: Point biserial model
Effect size |r| = .2
err prob = 0.05
Df = 312

Check out Steiger and Fouladis R2 program, which will do power analysis (and
more) for correlation models, including multiple correlation.

Install G Power on Your Personal Computer
If you would like to install GPower on your Windows computer, you can
download it from Universitt Duesseldorf.

February, 2012.

10%. Boot up G*Power:

Click OK. Click OK again on the next window.

Click Tests, F-Test (Anova).

Under Analysis, select Post Hoc. Enter .163 as the Effect size f, .05 as the
Alpha, 33 as the Total sample size, and 3 as the number of Groups. Click Calculate.

G*Power tell you that power = .1146.


98 cases per group.
Alt-X, Discard to exit G*Power.

Links

Karl L. Wuensch
Dept. of Psychology
Greenville, NC USA
GPower3-ANOVA1.docx


10%. Here is the analysis:


98 cases per group.

Links

Karl L. Wuensch
Dept. of Psychology
Greenville, NC USA
GPower3-ANOVA-Factorial.doc
G*Power: Factorial Independent Samples ANOVA

The analysis is done pretty much the same as it is with a one-way ANOVA.
Suppose we are planning research for which an A x B, 3 x 4 ANOVA would be
appropriate. We want to have enough data to have 80% power for a medium sized
effect. The omnibus analysis will include three F tests one with 2 df in the numerator,
one with 3, and one with 6 (the interaction). We plan on having sample size constant
across cells.
Boot up G*Power and enter the options shown below:

Remember that Cohen suggested .25 as the value of f for a medium-sized effect.
The numerator df for the main effect of Factor A is (3-1)=2. The number of groups here
is the number of cells in the factorial design, 3 x 4 = 12. When you click Calculate you
see that you need a total N of 158. That works out to 13.2 cases per cell, so bump the
N up to 14(12) = 168.
What about Factor B and the interaction? There are (4-1)=3 df for the main
effect of Factor A, and when you change the numerator df to 3 and click Calculate
again you see that you need an N of 179 to get 80% power for that effect. The
interaction has 2(3)=6 df, and when you change the numerator df to 6 and click
Calculate you see that you need an N of 225 to have 80% power to detect a medium-
sized interaction. With equal sample sizes, that means you need 19 cases per cell, 228
total N.
Clearly you are not going to have the same amount of power for each of the
three effects. If your primary interest was in the main effects, you might go with a total
N that would give you the desired power for main effects but somewhat less than that
for the interaction. If, however, you have reason to expect an interaction, you will go for
the total N of 228. How much power would that give you for the main effects?

As you can see, you would have almost 93% power for A. If you change the
numerator df to 3 you will see that you would have 89.6% power for B.
If you click the Determine button you get a second window which allows you
select the value of f by specifying a value of
2
or partial
2
. Suppose you want to know
what f is for an effect that explains only 1% of the total variance. You tell G*Power to
that the Variance explained by special effect is .01 and Error variance is .99. Click
Calculate and you get an f of .10. Recall that earlier I told you that an f of .10 is
equivalent to an
2
of .01.

If you wanted to find f for an effect that accounted for 6% of the variance, you
would enter .06 (effect) and .94 (error) and get an f of .25 (a medium-sized effect).
Wait a minute. I have ignored the fact that the error variance in the factorial
ANOVA will be reduced by an amount equal to variance explained by the other factors
in the model, and that will increase power. Suppose that I have entered Factor B into
the model primarily as a categorical covariate. From past research, I have reason to
believe that Factor B will account for about 14% of the total variance (a large effect,
equivalent to an f of .40). I have no idea whether or not the interaction will explain much
variance, so I play it safe and assume it will explain no variance. When I calculate f I
should enter .06 (effect) and .80 (error 1 less .06 for A and another .14 for B).
G*Power gives an f of .27, which I would then use in the power analysis for Factor A.

The 2 x 2 ANOVA: A Query From Down Under
I work at the Department of Psychology, Macquarie University in Sydney
Australia. We're currently writing up a grant proposal and have to include power
calculations. I have a very simple question about G*Power analysis for a simple
experimental 2x2 ANOVA study. For such a design, I assume that the Numerator df
would be 1 for main effects and interactions. Does this mean that unlike Cohen's power
analysis, G*Power3 it would give the same power estimate for main effects and the
interaction in a 2x2 ANOVA? I had always assumed that interactions would be more
difficult to detect than main effect, and that seems true for all multi-factorials except a
2x2?
The key is the numerator df, and, as you note, they are all the same (1) in the 2 x
2 design, so your power will be constant across effects. You should, however, consider
what will follow if you have a significant interaction. Likely you will want to test simple
main effects. When planning it is probably best to assume that you might have enough
heterogeneity of variance to warrant using individual error terms rather than a pooled
error term. In that case, the tests of simple main effects are absolutely equivalent to
independent samples t tests, on half (assuming equal sample sizes) of the total data.
For example, if you decide to settle for 80% power for detecting a medium-sized
effect, you will need 128 cases (32 per cell).

F tests - ANOVA: Fixed effects, special, main effects and interactions
err prob = 0.05
Numerator df = 1
Denominator df = 124

If you wish to test the effect of one factor at each level of the other factor, with
individual error terms, still settling for 80% power for a medium-sized effect, then you
will need 128 cases for each simple effect, that is a total of 256 cases (64 per cell).

err prob = 0.05
Numerator df = 1

t tests - Means: Difference between two independent means (two groups)
Effect size d = 0.5
err prob = 0.05
Allocation ratio N2/N1 = 1
Df = 126
Sample size group 1 = 64
Sample size group 2 = 64

Links

Karl L. Wuensch
Dept. of Psychology
Greenville, NC USA
GPower3-ANOVA1.doc
G*Power: 3-Way Factorial Independent Samples ANOVA

The analysis is done pretty much the same as it is with a two-way ANOVA.
Suppose we are planning research for which an A x B, 2 x 2 x 3 ANOVA would be
effect. The omnibus analysis will include seven F tests three with one df each (A, B,
and A x B) and four with two df each (C, A x C, B x C, and A x B x C). We plan on
having sample size constant across cells.
For the tests of A, B, and A x B:
err prob = 0.05
Power (1- err prob) = .80
Numerator df = 1

1. The number of groups here is the number of cells in the factorial design, 2 x 2 x 3 =
12. When you click Calculate you see that you need a total N of 128. That works out
to 10.67 cases per cell, so bump the N up to 11(12) = 132.

For the effects with 2 df:
err prob = 0.05
Numerator df = 2
That works out to 13.2 cases per cell, so bump the N up to 14(12) = 168.

GPower3-ANOVA1.doc
Suppose that you anticipate obtaining a significant triple interaction and following
that with analysis of the A x B simple interactions at each level of C. Playing it
conservative by using individual error terms, you will then need at each level of C
err prob = 0.05
Numerator df = 1

That is 128/4 = 32 cases for each A x B cell. Since there are three levels of C,
the total sample size needed is now 3 x 128 = 384.
Suppose the A x B interaction were to be significant one or more of the levels of
C. You likely would then test the simple, simple, main effects of A at each level of B (or
vice versa). For each such comparison (which would involved only two cells):
err prob = 0.05
Numerator df = 1
You need 128 scores, 64 per cell. Since we have a total of 12 cells, that works
out to 768 cases. You might end up deciding that you can get by with having less
power for detecting simple effects than for detecting effects in the omnibus analysis.
Suppose you ended up with 20 scores per cell, total N = 20(12) = 240. How
much power would you have for detecting medium-sized effects in the omnibus
analysis?
For the one df effects:
Analysis: Post hoc: Compute achieved power
err prob = 0.05
Numerator df = 1
GPower3-ANOVA1.doc
Power (1- err prob) = 0.9710633

For the two df effects:
err prob = 0.05
Numerator df = 2
Power (1- err prob) = 0.9411531
How much power would you have if you got down to the level of comparing one
cell with one other cell:
err prob = 0.05
Numerator df = 1
Denominator df = 38
Power (1- err prob) = 0.3379390

Links

Karl L. Wuensch
Dept. of Psychology
Greenville, NC USA
Power-RM-ANOVA.doc
Power Analysis for One-Way Repeated Measures ANOVA

Univariate Approach
Colleague Caren Jordan was working on a proposal and wanted to know how
much power she would have if she were able to obtain 64 subjects. The proposed
design was a three group repeated measures ANOVA. I used G*Power to obtain the
answer for her. Refer to the online instructions, Other F-Tests, Repeated Measures,
Univariate approach. We shall use n = 64, m = 3 (number of levels of repeated factor),
numerator df = 2 (m-1), and denominator df = 128 (n times m-1) f
2
= .01 (small effect,
within-group ratio of effect variance to error variance), and = .79 (the correlation
between scores at any one level of the repeated factor and scores and any other level
of the repeated factor). Her estimate of was based on the test-retest reliability of the
instrument employed.
I have used Cohens (1992, A power primer, Psychological Bulletin, 112,
155-159) guidelines, which are .01 = small, .0625 = medium, and .16 = large.
The noncentrality parameter is

=
1
f m n
, but G*Power is set up for us to
enter as Effect size f
2
the quantity 143 .
79 . 1
) 01 (. 3
1
2
=
f m
.
Boot up G*Power. Click Tests, Other F-Tests. Enter Effect size f
2
= 0.143,
Alpha = 0.05, N = 64, Numerator df = 2, and Denominator df = 128. Click calculate.
G*Power shows that power = .7677.


2

Please note that this power analysis is for the univariate approach repeated
measures ANOVA, which assumes sphericity. If that assumption is incorrect, then the
degrees of freedom will need to be adjusted, according to Greenhouse-Geisser or
Huynh-Feldt, which will reduce the power.

Problems with the Sphericity Assumption
Correspondent Sheri Bauman, at the University of Arizona, wants to know how
much power she would have for her three condition design with 36 cases. As the data
are already on hand, she can use them to estimate the value of . Her data shows that
the between-conditions correlations are r = .58, .49, and .27. Oh my, those correlations
differ from one another by quite a bit and are not very large. The fact that they differ
greatly indicates that the sphericity assumption has been violated. The fact that they
are not very large tells us that the repeated measures design will not have a power
advantage over the independent samples design (it may even have a power
disadvantage).
The G*Power online documentation shows how to correct for violation of the
sphericity assumption. One must first obtain an estimate of the Greenhouse-Geisser or
Huynh-Feldt epsilon. This is most conveniently obtained by looking at the statistical
output from a repeated-measures ANOVA run on previously collected data that are
thought to be similar to those which will be collected for the project at hand. I am going
to guess that epsilon for Sherris data is low, .5 (it would be 1 if the sphericity
assumption were met). I must multiply her Effect size f
2
, numerator df, and
denominator df by epsilon before entering them into G*Power. Uncorrected, her Effect
size f
2
for a small effect = 055 .
45 . 1
) 01 (. 3
1
2
=
f m
. Note that I have used, as my
estimate of , the mean of the three values observed by Sheri. This may, or may not,
be reasonable. Uncorrected her numerator df = 2 and her denominator df = 72.
Corrected with epsilon, her Effect size f
2
= .0275, numerator df = 1, and denominator
df = 36. I enter these values into G*Power and obtain power = .1625. Sheri needs
more data, or needs to hope for a larger effect size. If she assumes a medium sized
effect, then epsilon Effect size f
2
= 17 .
45 . 1
) 0625 (. 3
5 .
1
2
=
f m
and power jumps to .67.
The big problem here is the small value of in Sheris data she is going to
need more data to get good power. With typical repeated measures data, is larger,
and we can do well with relatively small sample sizes.

Multivariate Approach
The multivariate approach analysis does not require sphericity, and, when
sphericity is lacking, is usually more powerful than is the univariate analysis with
Greenhouse-Geisser or Huynh-Feldt correction. Refer to the G*Power online
instructions, Other F-Tests, Repeated Measures, Multivariate approach.
3
Since the are three groups, the numerator df = 2. The denominator df = n-p+1,
where n is the number of cases and p is the number of dependent variables in the
MANOVA (one less than the number of levels of the repeated factors). For Sherri,
denominator df = 36-2+1 = 35.
For a small effect size, we need Effect size f
2
= 055 .
45 . 1
) 01 (. 3
1
2
=
f m
. As you
can see, G*Power tells me power is .2083, a little better than it was with the univariate
test corrected for lack of sphericity.

So, how many cases would Sherri need to raise her power to .80? This G*Power
routine will not solve for N directly, so you need to guess until you get it right. On each
guess you need to change the input values of N and denominator df. After a few
quesses, I found that Sheri needs 178 cases to get 80% power to detect a small effect.

A Simpler Approach
Ultimately, in most cases, ones primary interest is going to be focused on
comparisons between pairs of means. Why not just find the number of cases necessary
to have the desired power for those comparisons? With repeated measures designs I
generally avoid using a pooled error term for such comparisons. In other words, I use
simple correlated t tests for each such comparison. How many cases would Sherri
need to have an 80% chance of detecting a small effect, d = .2?
First we adjust the value of d to take into account the value of . I shall use her
weakest link, the correlation of .27. 166 .
) 27 . 1 ( 2
2 .
) 1 ( 2
12
=
d
d
Diff
. Notice that
the value of d went down after adjustment. Usually will exceed .5 and the adjusted d
will be greater than the unadjusted d.
4
The approximate sample size needed is 285
8 . 2
2
=
=
Diff
d
n . I checked this with
G*Power. Click Tests, Other T tests. For Effect size f enter .166. N = 285 and df =
n-1 = 284. Select two-tailed. Click Calculate. G*Power confirms that power = 80%.

When N is small, G*Power will show that you need a larger N than indicated by
the approximation. Just feed values of N and df to G*power until you find the N that
gives you the desired power.


Power Analysis for an ANCOV
If you add one or more covariates to your ANOVA model, and they are well
correlated with the outcome variable, then the error term will be reduced and power will
be increased. The effect of the addition of covariates can be incorporated into the
power analysis in this way:
Adjusting the effect size statistic, f, such that the adjusted f,
2
1 r
f
f
= , where r
is the correlation between the covariate (or set of covariates) and the outcome
variable.
Reducing the error df by one for each covariate added to the model.

Consider this example. I am using an ANOVA design to compare three
experimental groups. I want to know how many cases I need to detect a small effect
(f = .1). GPower tells me I need 1,548 cases. Ouch, that is a lot of data.

Suppose I find a covariate that I can measure prior to manipulating the
experimental variable and which is known to be correlated .7 with the dependent
variable. The adjusted f for a small effect increases to 14 .
49 . 1
1 .
=
= f .

Now I only need 792 cases. Do note that the error df here should be 788, not
789, but that one df is not going to make much difference, as shown below.

I used the Generic F Test routine with the noncentrality parameter from the
earlier run, and I dropped the denominator df to 788. The value of the critical F
increased ever so slightly, but the power did not change at all to six decimal places.

My Earlier Discussion of this Topic
Clinical student Natalie Cross wants to conduct a 2 x 2 x 2 ANCOV with a single
covariate. How many subjects does she need to have 80% power with alpha set at .05
if the effect is medium in size?
I shall estimate N for tests of main effects only, not interactions or simple effects.
First, I assume that no covariate is employed. Cohens f (effect size) is
computed as
2
2
/ ) (
e
j
k

. Cohens guidelines suppose that a medium-sized
difference between two groups is one that equals one half the size of the within-group
(error) variance, such as when
1
= 10,
2
= 12, = 11, and = 4. This corresponds to
a value of f equal to 25 . 0
4
2 / ) 1 1 (
2
2 2
=
+
, which is exactly the value of f which Cohen
has suggested corresponding to a medium sized effect in ANOVA.
Now, how many subjects would we need? n f = , where n is the number of
scores in each group. From Appendix ncF in Howell (Statistical methods for
psychology (5
th
ed.), 2002, Belmont, CA: Wadsworth), we need of about 2, so
64
25 .
2
2 2
=
=
f
n

. This matches exactly the amount indicated in Table 2 of
Cohens Power Primer (Psychological Bulletin, 1992, 112, 155-159).
Of course, when we factor in the effect of the covariate, we will have more power
for a fixed sample size, because the error variance (the variance of the dependent
variable scores after being adjusted for the covariate) will be reduced. The larger the
correlation between the covariate and the dependent variable (or, with multiple
covariates, the multiple R between covariates and the dependent variable), the greater
the reduction of error variance. We can estimate the error variance of the adjusted
scores this way:
2
1 r
Y Yadj
= (see http://www.psycho.uni-
duesseldorf.de/aap/projects/gpower/reference/reference_manual_07.html#t4). If we
assume that the correlation between the covariate and the dependent variable is .5,
then 464 . 3 25 . 1 4 = =
Yadj
.
Next we adjust the value of f to take into account the reduction in error variance
due to employing the covariate. Our adjusted f is computed as 29 . 0
464 . 3
2 / ) 1 1 (
2
2 2
=
+
.
Our required sample size is now computed as 48
29 .
2
2
=
= n subjects in each group

that is, 48 subjects at each level of each independent variable.
An alternative approach, given a one df comparison, is to use Howells procedure
for computing power for independent samples t, but adjusting the error variance. For
our example, adjusted d is 577 .
464 . 3
2
2 1
= =

, and 47
577 .
80 . 2
2 2
2 2
=
=
d
n

.
Natalie checked the database and found that the correlation between pre and
post test data was about .7. Using this value of r, our adjusted error variance is
computed as 857 . 2 49 . 1 4 = =
Yadj
, our adjusted f as 35 . 0
857 . 2
2 / ) 1 1 (
2
2 2
=
+
, and our
required sample size as 33
35 .
2
2
=
= n .

How Did Karl Get That Formula for the Adjusted f ?
I assume that the covariate is not related to the ANOVA factor(s), but is related to
the part of Y that is not related to the factors (that is, to the error variance).
2
2
2
/ ) (
e
j
k
f

= and the adjusted error variance is ) 1 (
2 2 2
r
e adj e
=
. Substituting
the adjusted error variance in the denominator,
2
2
2 2
2
2
1 ) 1 (
/ ) (
r
f
r
k
f
e
j

.
Accordingly,
2
1 r
f
f
= .
When asked to provide a reference for this adjusted f, I was at a loss, since I had
never seen it before I derived it myself. Thanks to Google, however, I have now found
the same derivation in Rogers, W. T., & Hopkins, K. D. (1988). Power estimates in the
presence of a covariate and measurement error. Educational and Psychological
Measurement, 48, 647-656. doi: 10.1177/0013164488483008


Karl L. Wuensch
Department of Psychology
Greenville, NC 27858 USA
24. October 2009
Power Analysis For Correlation and Regression Models

R2.exe Correlation Model
The free R2 program, from James H. Steiger and Rachel T. Fouladi, can be used
to do power analysis for testing the null hypothesis that R
2
(bivariate or multiple) is zero
in the population of interest. You can download the program and the manual here.
Unzip the files and put them in the directory/folder R2. Navigate to the R2 directory and
run (double click) the file R2.EXE. A window will appear with R2 in white on a black
background. Hit any key to continue.
Bad News: R2 will not run on Windows 7 Home Premium, which does not
support DOS. It ran on XP just fine. It might run on Windows 7 Pro.
Good News: You can get a free DOS emulator, and R2 works just fine within
the virtual DOS machine it creates. See my document DOSBox.
Consider the research published in the article: Patel, S., Long, T. E.,
McCammon, S. L., & Wuensch, K. L. (1995). Personality and emotional correlates of
self-reported antigay behaviors. Journal of Interpersonal Violence, 10, 354-366. We
had data from 80 respondents. We wished to predict self-reported antigay behaviors
from five predictor variables. Suppose we wanted to know how much power we would
have if the population
2
was .13 (a medium sized effect according to Cohen).
Enter the letter O to get the Options drop down menu. Enter the letter P for
Power Analysis. Enter the letter N to bring up the sample size data entry window.
Enter 80 and hit the enter key. Enter the letter K to bring up the number of variables
data entry window. Enter 6 and hit the enter key. Enter the letter R to enter the
2
data
entry window. Enter .13 and hit the enter key. Enter the letter A to bring up the alpha
entry window. Enter .05 and hit the enter key. The window should now look like this:

Enter G to begin computing. Hit any key to display the results.
2

Try substituting .02 (a small effect) for
2
and you will see power shrink to 13%.
So, how many subjects would we need to have an 80% chance of rejecting the
null hypothesis if the effect were small and we use the usual .05 criterion of statistical
significance. Hit the O key to get the options and then the S key to initiate sample size
calculation. K = 6, A = .05, R = .02, P = .8.


3

G*Power Regression Model
The R2 program is designed for correlation analysis (all variables are random),
not regression analysis (Y is random but the predictors are fixed). Under most
circumstances you will get the similar results from R2 and G*Power. For example,
suppose I ask how much power I would have for a large effect, alpha = .05, n = 5, one
predictor.
Using G*Power, correlation, point biserial
Effect size |r| = 0.5
err prob = 0.05
Df = 3

Equivalently, using G*Power, Multiple regression, omnibus R
2

F tests - Multiple Regression: Omnibus (R deviation from zero)
err prob = 0.05
Number of predictors = 1
Numerator df = 1
Denominator df = 3

Here G*Power uses Cohens f
2
effect size statistic, which is R
2
/ (1-R
2
). For a rho
of .5, that is .25/.75 = .333333333.
For a correlation model, the R2 program produces the following result

4

Return to Wuenschs Statistics Lesson Page
Karl L. Wuensch, Dept. of Psychology, East Carolina University, October, 2011.
G*Power for Change In R
2
in Multiple Linear Regression

Graduate student Ruchi Patel asked me how to determine how many cases
would be needed to achieve 80% power for detecting the interaction between two
predictors in a multiple linear regression. The interaction term is simply treated as
another predictor. I assumed that she wanted enough data to have 80% power and that
there were only three predictors, X1, X2, and their interaction. Here is the analysis:

Equivalently,

The method immediately above could also be used to determine the number of
cases needed to have the desired probability of detecting the increase in R
2
that
accompanies adding to the model a block of two or more predictors.

PowerAnalysis_Overview.doc
An Overview of Power Analysis

Power is the conditional probability that one will reject the null hypothesis given
that the null hypothesis is really false by a specified amount and given certain other
specifications, such as sample size and criterion of statistical significance (alpha). I
shall introduce power analysis in the context of a one sample test of the mean. After
that I shall move on to statistics more commonly employed.
There are several different sorts of power analyses see Faul, Erdfelder, Lang,
& Buchner (Behavior Research Methods, 2007, 39, 175-191) for descriptions of five
types that can be computed using G*Power 3. I shall focus on a priori and a
posteriori power analysis.
A Priori Power Analysis. This is an important part of planning research. You
determine how many cases you will need to have a good chance of detecting an effect
of a specified size with the desired amount of power. See my document Estimating the
Sample Size Necessary to Have Enough Power for required number of cases to have
80% for common designs.
A Posteriori Power Analysis. Also know as post hoc power analysis. Here
you find how much power you would have if you had a specified number of cases. Is it
a posteriori only in the sense that you provide the number of number of cases, as if
you had already conducted the research. Like a priori power analysis, it is best used in
the planning of research for example, I am planning on obtaining data on 100 cases,
and I want to know whether or not would give me adequate power.
Retrospective Power Analysis. Also known as observed power. There are
several types, but basically this involves answering the following question: If I were to
repeat this research, using the same methods and the same number of cases, and if the
size of the effect in the population was exactly the same as it was in the present
sample, what would be the probability that I would obtain significant results? Many
have demonstrated that this question is foolish, that the answer tells us nothing of value,
and that it has led to much mischief. See this discussion from Edstat-L. I also
recommend that you read Hoenig and Heisey (The American Statistician, 2001, 55, 19-
24). A few key points:
Some stat packs (SPSS) give you observed power even though it is useless.
Observed power is perfectly correlated with the value of p that is, it provides
absolutely no new information that you did not already have.
It is useless to conduct a power analysis AFTER the research has been
completed. What you should be doing is calculating confidence intervals for
effect sizes.
One Sample Test of Mean
2
time for it to take effect, and then test our subjects IQ. Assume that we were so
confident in our belief that the drug would either increase IQ or have no effect that we
entertained directional hypotheses. Our null hypothesis is that after administering the
drug 100; our alternative hypothesis is > 100.
These hypotheses must first be converted to exact hypotheses. Converting the
null is easy: it becomes = 100. The alternative is more troublesome. If we knew that
the effect of the drug was to increase IQ by 15 points, our exact alternative hypothesis
would be = 115, and we could compute power, the probability of correctly rejecting the
false null hypothesis given that is really equal to 115 after drug treatment, not 100
(normal IQ). But if we already knew how large the effect of the drug was, we would not
need to do inferential statistics.
One solution is to decide on a minimum nontrivial effect size. What is the
smallest effect that you would consider to be nontrivial? Suppose that you decide that if
the drug increases
iq
by 2 or more points, then that is a nontrivial effect, but if the
mean increase is less than 2 then the effect is trivial.
Now we can test the null of = 100 versus the alternative of = 102. Let the left
curve represent the distribution of sample means if the null hypothesis were true, =
100. This sampling distribution has a = 100 and a 3
25
15
= =
x
. Let the right curve
represent the sampling distribution if the exact alternative hypothesis is true, = 102.
Its is 102 and, assuming the drug has no effect on the variance in IQ scores,
3
25
15
= =
x
.
The red area in the upper tail of the null distribution is . Assume we are using a
one-tailed of .05. How large would a sample mean need be for us to reject the null?
Since the upper 5% of a normal distribution extends from 1.645 above the up to
positive infinity, the sample mean IQ would need be 100 + 1.645(3) = 104.935 or more
to reject the null. What are the chances of getting a sample mean of 104.935 or more if
the alternative hypothesis is correct, if the drug increases IQ by 2 points? The area
under the alternative curve from 104.935 up to positive infinity represents that
probability, which is power. Assuming the alternative hypothesis is true, that = 102,
the probability of rejecting the null hypothesis is the probability of getting a sample mean
of 104.935 or more in a normal distribution with = 102, = 3. Z = (104.935 102)/3 =
0.98, and P(Z > 0.98) = .1635. That is, power is about 16%. If the drug really does
increase IQ by an average of 2 points, we have a 16% chance of rejecting the null. If its
effect is even larger, we have a greater than 16% chance.

3

Suppose we consider 5 the minimum nontrivial effect size. This will separate the
null and alternative distributions more, decreasing their overlap and increasing power.
Now, Z = (104.935 105)/3 = 0.02, P(Z > 0.02) = .5080 or about 51%. It is easier to
detect large effects than small effects.
Suppose we conduct a 2-tailed test, since the drug could actually decrease IQ;
is now split into both tails of the null distribution, .025 in each tail. We shall reject the
null if the sample mean is 1.96 or more standard errors away from the of the null
distribution. That is, if the mean is 100 + 1.96(3) = 105.88 or more (or if it is 100
1.96(3) = 94.12 or less) we reject the null. The probability of that happening if the
alternative is correct ( = 105) is: Z = (105.88 105)/3 = .29, P(Z > .29) = .3859, power
= about 39%. We can ignore P(Z < (94.12 105)/3) = P(Z < 3.63) = very, very small.
Note that our power is less than it was with a one-tailed test. If you can correctly
predict the direction of effect, a one-tailed test is more powerful than a two-tailed
test.
Consider what would happen if you increased sample size to 100. Now the
5 . 1
100
15
= =
x
. With the null and alternative distributions less plump, they should
overlap less, increasing power. With 1.5 =
x
, the sample mean will need be 100 +
(1.96)(1.5) = 102.94 or more to reject the null. If the drug increases IQ by 5 points,
power is : Z = (102.94 105)/1.5 = 1.37, P(Z > 1.37) = .9147, or between 91 and
92%. Anything that decreases the standard error will increase power. This may
be achieved by increasing the sample size or by reducing the of the dependent
4
variable. The of the criterion variable may be reduced by reducing the influence of
extraneous variables upon the criterion variable (eliminating noise in the criterion
variable makes it easier to detect the signal, the grouping variables effect on the
criterion variable).
Now consider what happens if you change . Let us reduce to .01. Now the
sample mean must be 2.58 or more standard errors from the null before we reject the
null. That is, 100 + 2.58(1.5) = 103.87. Under the alternative, Z = (103.87 105)/1.5 =
0.75, P(Z > 0.75) = 0.7734 or about 77%, less than it was with at .05, ceteris
paribus. Reducing reduces power.
Please note that all of the above analyses have assumed that we have used a
normally distributed test statistic, as
x
X
Z
= will be if the criterion variable is

normally distributed in the population or if sample size is large enough to invoke the
CLT. Remember that using Z also requires that you know the population rather than
estimating it from the sample data. We more often estimate the population , using
Students t as the test statistic. If N is fairly large, Students t is nearly normal, so this is
no problem. For example, with a two-tailed of .05 and N = 25, we went out 1.96
standard errors to mark off the rejection region. With Students t on N 1 = 24 df we
should have gone out 2.064 standard errors. But 1.96 versus 2.06 is a relatively trivial
difference, so we should feel comfortable with the normal approximation. If, however,
we had N = 5, df = 4, critical t = 2.776, and the normal approximation would not do. A
more complex analysis would be needed.

One Sample Power the Easy Way
Hopefully the analysis presented above will help you understand what power
analysis is all about, but who wants to have to do so much thinking when doing a power
analysis? Yes, there are easier ways. These days the easiest way it to use computer
software that can do power analysis, and there is some pretty good software out there
that is free. I like free!
I shall illustrate power analysis using the GPower program. I am planning on
conducting the memory drug study described above with 25 participants. I have
decided that the minimum nontrivial effect size is 5 IQ points, and I shall employ
nondirectional hypothesis with a .05 criterion of statistical significance.
I boot up G*Power and select the following options:
Test family: t tests
Statistical test: Means: Difference from constant (one sample case)
Type of power analysis: Post hoc: Compute achieved power given , sample
size, and effect size
Tails: Two
5
for you)
error prob: 0.05
Total sample size: 25
Click Calculate and you find that power = 0.360.
At the top of the window you get a graphic showing the distribution of t under the
null and under the alternative, with critical t, , and power indicated.
If you click the Protocol tab you get a terse summary of the analyses you have
done, which can be printed, saved, or cleared out.
6

At the bottom of the window you can click X-Y plot for a range of values. Select
what you want plotted on Y and X and set constants and then click Draw plot. Here is
the plot showing the relationship between sample size and power. Clicking the Table
tab gives you same information in a table.

7

Having 36% power is not very encouraging if the drug really does have a five
point effect, there is a 64% chance that you will not detect the effect and will make a
Type II error. If you cannot afford to get data from more than 25 participants, you may
go ahead with your research plans and hope that the real effect of the drug is more than
five IQ points.
If you were to find a significant effect of the drug with only 25 participants, that
would speak to the large effect of the drug. In this case you should not be hesitant to
seek publication of your research, but you should be somewhat worried about having it
reviewed by ignorant experts. Such bozos (and they are to be found everywhere) will
argue that your significant results cannot be trusted because your analysis had little
power. It is useless to argue with them, as they are totally lacking in understanding of
the logic of hypothesis testing. If the editor cannot be convinced that the reviewer is a
moron, just resubmit to a different journal and hope to avoid ignorant expert reviewers
there. I should add that it really would have better if you had more data, as your
8
estimation of the size of the effect would be more precise, but these ignorant expert
reviewers would not understand that either.
If you were not able to reject the null hypothesis in your research on the putative
IQ drug, and your power analysis indicated about 36% power, you would be in an
awkward position. Although you could not reject the null, you also could not accept it,
given that you only had a relatively small (36%) chance of rejecting it even if it were
false. You might decide to repeat the experiment using an n large enough to allow you
to accept the null if you cannot reject it. In my opinion, if 5% is a reasonable risk for
a Type I error (), then 5% is also a reasonable risk for a Type II error (), [unless the
serious of one of these types of errors exceeds that of the other], so let us use power =
1 = 95%.
How many subjects would you need to have 95% power? In G*Power, under
Type of power analysis, select A priori: Compute required sample size given , power,
and effect size. Enter .95 for Power (1- err prob). Click Calculate. G*Power
tells you that you need 119 subjects to get the desired power. Now write that grant
proposal that will convince the grant reviewers that your research deserves funding that
will allow you get enough data to be able to make a strong statement about whether or
not the putative memory enhancing drug is effective. If it is effective, be sure that
ignorant reviewers are the first to receive government subsidized supplies of the drug
for personal use.
If we were to repeat the experiment with 119 subjects and still could not reject
the null, we can accept the null and conclude that the drug has no nontrivial ( 5 IQ
points) effect upon IQ. The null hypothesis we are accepting here is a loose null
hypothesis [95 < < 105] rather than a sharp null hypothesis [ = exactly 100].
Sharp null hypotheses are probably very rarely ever true.
Others could argue with your choice of the minimum nontrivial effect size. Cohen
has defined a small effect as d = .20, a medium effect as d = .50, and a large effect as d
= .80. If you defined minimum d at .20, you would need even more subjects for 95%
power.
A third approach, called a sensitivity analysis in GPower, is to find the
smallest effect that one could have detected with high probability given n. If that d is
small, and the null hypothesis is not rejected, then it is accepted. For example, I used
1500 subjects in the IQ enhancer study. Consider the null hypothesis to be -0.1 d
+0.1. That is, if d does not differ from zero by at least .1, then I consider it to be 0.
9

For power = 95%, d = .093. If I cant reject the null, I accept it, concluding that if
the drug has any effect, it is a trivial effect, since I had a 95% chance of detecting an
effect as small as . 093 . I would prefer simply to report a confidence interval here,
showing that d is very close to zero.
Install G Power on Your Personal Computer
If you would like to install GPower on your Windows computer, you can
download it from Universitt Duesseldorf.

Two Independent Samples Test of Means
If n
1
n
2
, the effective sample size is the harmonic mean sample size,
2 1
1 1
2
~
n n
n
+
= .
For a fixed total N, the harmonic mean (and thus power) is higher the more
nearly equal n
1
and n
2
are. This is one good reason to use equal n designs. Other
good reasons are computational simplicity with equal ns and greater robustness to
violation of assumptions. The effective (harmonic) sample size for 100 subjects evenly
split into two groups of 50 each is 50; for a 60:40 split it is 48; for a 90:10 split it is 18.
Consider the following a priori power analysis. We wish to compare the
Advanced Psychology GRE scores of students in general psychology masters programs
with that of those in clinical psychology masters programs. We decide that we will be
satisfied if we have enough data to have an 80% chance of detecting an effect of 1/3 of
a standard deviation, employing a .05 criterion of significance. How many scores do we
need in each group, if we have the same number of scores in each group?
10
Test family: ttests
Statistical test: Means: Difference between two independent means (two groups)
effect size
Tails: Two
for you)
error prob: 0.05
Allocation ratio N2/N1: 1
Click Calculate and you see that you need 143 cases in each group, that is, a
total sample size of 286.
Change the allocation ratio to 9 (nine times as many cases in the one group than
in the other) and click Calculate again. You will see that you would need 788 subjects
to get the desired power with such a lopsided allocation ratio.
standard deviation is 98?
Change the type of power analysis to Post hoc. Enter d = 40/98 = .408, n
1
= 36,
n
2
= 48. Click Calculate. You will see that you have 45% power.
Df = 82

Two Related Samples, Test of Means
score under the other condition). The greater
12
, the correlation between the scores in
the one condition and those in the second condition, the smaller the standard deviation
of the difference scores and the greater the power, ceteris paribus. By the variance
sum law, the standard deviation of the difference scores is
2 1 12
2
2
2
1
2 + =
Diff
.
If we assume equal variances, this simplifies to ) 1 ( 2 =
Diff
.
12
by computing d
Diff
11
) 1 ( 2
12
2 1
=
d
d
Diff
Diff
, where d is the effect size as computed above, with
independent samples.
Please note that using the standard error of the difference scores, rather than the
standard deviation of the criterion variable, as the denominator of d
Diff
, is simply a
means of incorporating into the analysis the effect of the correlation produced by
matching. If we were computing estimated d (Hedges g) as an estimate of the
standardized effect size given the obtained results, we would use the standard deviation
of the criterion variable in the denominator, not the standard deviation of the difference
scores. I should admit that on rare occasions I have argued that, in a particular
research context, it made more sense to use the standard deviation of the difference
scores in the denominator of g.
Consider the following a priori power analysis. I am testing the effect of a new
drug on performance on a task that involves solving anagrams. I want to have enough
power to be able to detect an effect as small as 1/5 of a standard deviation (d = .2) with
95% power I consider Type I and Type II errors equally serious and am employing a
.05 criterion of statistical significance, so I want beta to be not more than .05. I shall use
a correlated samples design (within subjects) and two conditions (tested under the
influence of the drug and not under the influence of the drug). In previous research I
have found the correlation between conditions to be approximately .8.
3162 .
) 8 . 1 ( 2
2 .
) 1 ( 2
12
=
d
d
Diff
.

Use the following settings:
Statistical test: Means: Difference between two dependent means (matched pairs)
effect size
Tail(s): Two
Effect size dz: .3162
error prob: 0.05
Click Calculate. You will find that you need 132 pairs of scores.
Df = 131

12
checked, it was 108 for verbal, 114 for quantitative.
Change type of power analysis to Post hoc. Set the total sample size to 400.
Click on Determine. Select from group parameters. Set the group means to 0 and
20 (or any other two means that differ by 20), the standard deviations to 108 and 114,
and the correlation between groups to .6. Click Calculate in this window to obtain the
effect size dz, .2100539.

Click Calculate and transfer to main window to move the effect size dz to the
main window. Click Calculate in the main window to compute the power. You will see
that you have 98% power.
13

Type III Errors and Three-Choice Tests
Leventhal and Huynh (Psychological Methods, 1996, 1, 278-292) note that it is
common practice, following rejection of a nondirectional null, to conclude that the
direction of difference in the population is the same as what it is in the sample. This
procedure is what they call a "directional two-tailed test." They also refer to it as a
"three-choice test" (I prefer that language), in that the three hypotheses entertained are:
parameter = null value, parameter < null value, and parameter > null value. This makes
possible a Type III error: correctly rejecting the null hypothesis, but incorrectly inferring
the direction of the effect - for example, when the population value of the tested
parameter is actually more than the null value, getting a sample value that is so much
below the null value that you reject the null and conclude that the population value is
also below the null value. The authors show how to conduct a power analysis that
corrects for the possibility of making a Type III error. See my summary at:
http://core.ecu.edu/psyc/wuenschk/StatHelp/Type_III.htm

Bivariate Correlation/Regression Analysis
Test family: ttests
Statistical test: Correlation: Point biserial model (that is, a regression analysis)
14
effect size
Tails: Two
Effect size |r|: .2
error prob: 0.05
Click Calculate and you will see that you need 314 cases.
Effect size |r| = .2
err prob = 0.05
Df = 312

Check out Steiger and Fouladis R2 program, which will do power analysis (and
more) for correlation models, including multiple correlation.

One-Way ANOVA, Independent Samples
The effect size may be specified in terms of f:
2
2
1
) (
error
j
k
j
k
f
=
. Cohen
considered an f of .10 to represent a small effect, .25 a medium effect, and .40 a large
effect. In terms of percentage of variance explained
2
, small is 1%, medium is 6%, and
large is 14%.
Suppose that I wish to test the null hypothesis that for GRE-Q, the population
means for undergraduates intending to major in social psychology, clinical psychology,
and experimental psychology are all equal. I decide that the minimum nontrivial effect
size is if each mean differs from the next by 20 points (about 1/5 ). For example,
means of 480, 500, and 520. The sum of the squared deviations between group means
and grand mean is then 20
2
+ 0
2
+ 20
2
= 800. Next we compute f. Assuming that the
is about 100, 163 . 0 10000 / 3 / 800 f = = . Suppose we have 11 cases in each group.
15

16

98 cases per group.

If you add one or more covariates to your ANOVA model, and they are well
correlated with the outcome variable, then the error term will be reduced and power will
be increased. The effect of the addition of covariates can be incorporated into the
power analysis in this way:
Adjusting the effect size statistic, f, such that the adjusted f,
2
1 r
f
f
= , where r
is the correlation between the covariate (or set of covariates) and the outcome
variable.
Reducing the error df by one for each covariate added to the model.

Consider this example. I am using an ANOVA design to compare three
experimental groups. I want to know how many cases I need to detect a small effect
(f = .1). GPower tells me I need 1,548 cases. Ouch, that is a lot of data.
17

Suppose I find a covariate that I can measure prior to manipulating the
experimental variable and which is known to be correlated .7 with the dependent
variable. The adjusted f for a small effect increases to 14 .
49 . 1
1 .
=
= f .

Now I only need 792 cases. Do note that the error df here should be 788, not
789, but that one df is not going to make much difference, as shown below.
18

I used the Generic F Test routine with the noncentrality parameter from the
earlier run, and I dropped the denominator df to 788. The value of the critical F
increased ever so slightly, but the power did not change at all to six decimal places.

Factorial ANOVA, Independent Samples
The analysis is done pretty much the same as it is with a one-way ANOVA.
Suppose we are planning research for which an A x B, 3 x 4 ANOVA would be
effect. The omnibus analysis will include three F tests one with 2 df in the numerator,
one with 3, and one with 6 (the interaction). We plan on having sample size constant
across cells.
Boot up G*Power and enter the options shown below:

19
The numerator df for the main effect of Factor A is (3-1)=2. The number of groups here
is the number of cells in the factorial design, 3 x 4 = 12. When you click Calculate you
see that you need a total N of 158. That works out to 13.2 cases per cell, so bump the
N up to 14(12) = 168.
What about Factor B and the interaction? There are (4-1)=3 df for the main
effect of Factor A, and when you change the numerator df to 3 and click Calculate
again you see that you need an N of 179 to get 80% power for that effect. The
interaction has 2(3)=6 df, and when you change the numerator df to 6 and click
Calculate you see that you need an N of 225 to have 80% power to detect a medium-
sized interaction. With equal sample sizes, that means you need 19 cases per cell, 228
total N.
Clearly you are not going to have the same amount of power for each of the
three effects. If your primary interest was in the main effects, you might go with a total
N that would give you the desired power for main effect but somewhat less than that for
the interaction. If, however, you have reason to expect an interaction, you will go for the
total N of 228. How much power would that give you for the main effects?

As you can see, you would have almost 93% power for A. If you change the
numerator df to 3 you will see that you would have 89.6% power for B.
If you click the Determine button you get a second window which allows you
select the value of f by specifying a value of
2
or partial
2
. Suppose you want to know
what f is for an effect that explains only 1% of the total variance. You tell G*Power to
that the Variance explained by special effect is .01 and Error variance is .99. Click
Calculate and you get an f of .10. Recall that earlier I told you that an f of .10 is
equivalent to an
2
of .01.
20

If you wanted to find f for an effect that accounted for 6% of the variance, you
would enter .06 (effect) and .94 (error) and get an f of .25 (a medium-sized effect).
Wait a minute. I have ignored the fact that the error variance in the factorial
ANOVA will be reduced by an amount equal to variance explained by the other factors
in the model, and that will increase power. Suppose that I have entered Factor B into
the model primarily as a categorical covariate. From past research, I have reason to
believe that Factor B will account for about 14% of the total variance (a large effect,
equivalent to an f of .40). I have no idea whether or not the interaction will explain much
variance, so I play it safe and assume it will explain no variance. When I calculate f I
should enter .06 (effect) and .80 (error 1 less .06 for A and another .14 for B).
G*Power gives an f of .27, which I would then use in the power analysis for Factor A.

ANOVA With Related Factors
The analysis here can be done with G*Power in pretty much the same way
described earlier for independent samples. There are two new parameters that you will
need to provide:
the value of the correlation between scores at any one level of the related factor
and any other level of the repeated factor. Assuming that this correlation is
constant across pairs of levels is the sphericity assumption.
-- this is a correction (applied to the degrees of freedom) to adjust for violation
of the sphericity assumption. The df are literally multiplied by , which has a
upper boundary of 1. There are two common ways to estimate , one developed
by Greenhouse and Geisser, the other by Huynh and Feldt.

Here is the setup for a one-way repeated measures or randomized blocks ANOVA
with four levels of the factor:
21

We need 36 cases to have 95% power to detect a medium sized effect assuming
no problem with sphericity and a .5 correlation between repeated measures. Let us see
what happens if we have a stronger correlation between repeated measures:

Very nice. I guess your stats prof wasnt kidding we she pointed out the power
benefit of having strong correlations across conditions but what if you have a problem
with the sphericity assumption. Let us assume that you suspect (from previous
research) that epsilon might be as low as .6.
22

Notice the reduction and the degrees of freedom and the associated increase in
number of cases needed.
Instead of the traditional univariate approach ANOVA, one can analyze data from
designs with related factors with the newer multivariate approach, which does not
have a sphericity assumption. G*Power will do power analysis for this approach too.
Let us see how many cases we would need with that approach using the same input
parameters as the previous example.

Contingency Table Analysis (Two-Way)
=
k
i i
i i
P
P P
w
1 0
2
0 1
) (
. k is the number of cells, P
0i
is
the population proportion in cell i under the null hypothesis, and P
1i
is the population
proportion in cell i under the alternative hypothesis. Cohens benchmarks are
23
.1 is small but not trivial
.3 is medium
.5 is large

When the table is 2 x 2, w is identical to .
Suppose we are going to employ a 2 x 4 analysis. We shall use the traditional
5% criterion of statistical significance, and we think Type I and Type II errors equally
serious, and, accordingly, we seek to have 95% power for finding an effect that is small
but not trivial. As you see below, you need a lot of data to have a lot of power when
doing contingency table analysis.

MANOVA and DFA
Each effect will have as many roots (discriminant functions, canonical variates,
weighted linear combinations of the outcome variables) as it has treatment degrees of
freedom, or it will have as many roots as there are outcome variables, whichever is
fewer. The weights maximize the ratio
groups within
groups among
SS
SS
_
_
. If you were to compute, for each
case, a canonical variate score and then conduct an ANOVA comparing the groups on
that canonical variate, you would get the sums of squares in the ratio above. This ratio
is called the eigenvalue ( ).
Theta is defined as
1 +
=
. Commonly employed test statistics in MANOVA

are the Hotellings trace, Wilks lambda, Pillais trace, and Roys greatest characteristic
root. Hotellings trace is simply the sum of the eigenvalues. To get Wilks lambda you
subtract each theta from 1 and then calculate the product of the differences. To get
Pillais trace you simply sum the thetas. Roys gcr is simply the eigenvalue for the first
root.
G*Power uses f as the effect size parameter (.1 is small, .25 is medium, and .40
is large), and allows you convert a value of Pillais trace (or other trace) to f if you wish.
24
Suppose you are going to conduct a one-way MANOVA comparing four groups
on two outcome variables. You want to have 95% power for detecting a medium-sized
effect.

As you can see, you only need 48 cases (12 per cell). You might actually need
more than 12, depending on what you intend to do after the MANOVA. Many
researchers want to follow a significant MANOVA with univariate ANOVAs, one on each
outcome variable. How much power would you have for a such an ANOVA if you had
48 cases?

25
Oh my, only 25% power. When planning research you really should consider
how much power you will have for the follow-up analyses you will employ after the initial
(usually more powerful) analysis.
So, how many cases would we need to have 95% power for those ANOVAs?
Here is the G*Power solution. Note that I have dropped the to .025, assuming that the
researcher has applied a Bonferroni correction to cap the familywise error at .05 across
these two ANOVAs (I do not think doing so is a very good idea, but that is another
story).

Now, what are you going to do after a significant univariate ANOVA? Likely you
are going make some pairwise comparisons among the group means. Suppose you
are going to compare groups 1 with 2, 1 with 3, 1 with 4, 2 with 3, 2 with 4, and 3 with 4.
That is six contrasts for each Y variable. Since you have two Y variables, you could
make as many as 12 contrasts. To cap familywise error at .05 via Bonferroni, you must
now use a per-comparison alpha of .05/12 = .00416. Furthermore, suppose you will not
pool variances and degrees of freedom across groups. How many cases will you need?
26

You need 330/2 = 165 cases per group. Since you have four groups, you need a
total of 660 cases.
Since DFA is equivalent to MANOVA, you can use G*Power for power analysis
for DFA as well.

Links
Assorted Stats Links
G*Power 3 download site
o User Guide sorted by type of analysis
o User Guide sort by test distribution
List of the analyses available in G*Power

Karl L. Wuensch
Dept. of Psychology
Greenville, NC 27858-4353 USA

May, 2009.
G*Power 3: Tests Available

Test Family: Exact

Test Family: F tests

Test Family: t tests

Test Family:
2
tests

Test Family: z tests

Return to Wuenschs Stat Help Page
Corr6430.docx
Bivariate Linear Correlation

One way to describe the association between two variables is to assume that the
value of the one variable is a linear function of the value of the other variable. If this
relationship is perfect, then it can be described by the slope-intercept equation for a
straight line, Y = a + bX. Even if the relationship is not perfect, one may be able to
describe it as nonperfect linear.

Distinction Between Correlation and Regression
Correlation and regression are very closely related topics. Technically, if the X
variable (often called the independent variable, even in nonexperimental research) is
fixed, that is, if it includes all of the values of X to which the researcher wants to
generalize the results, and the probability distribution of the values of X matches that in
the population of interest, then the analysis is a regression analysis. If both the X and
the Y variable (often called the dependent variable, even in nonexperimental research)
are random, free to vary (were the research repeated, different values and sample
probability distributions of X and Y would be obtained), then the analysis is a
correlation analysis. For example, suppose I decide to study the correlation between
dose of alcohol (X) and reaction time. If I arbitrarily decide to use as values of X doses
of 0, 1, 2, and 3 ounces of 190 proof grain alcohol and restrict X to those values, and
have the equal numbers of subjects at each level of X, then Ive fixed X and do a
regression analysis. If I allow X to vary randomly, for example, I recruit subjects from
a local bar, measure their blood alcohol (X), and then test their reaction time, then a
correlation analysis is appropriate.
In actual practice, when one is using linear models to develop a way to predict Y
given X, the typical behavioral researcher is likely to say she is doing regression
analysis. If she is using linear models to measure the degree of association between X
and Y, she says she is doing correlation analysis.

Scatter Plots
One way to describe a bivariate association is to prepare a scatter plot, a plot of
all the known paired X,Y values (dots) in Cartesian space. X is traditionally plotted on
the horizontal dimension (the abscissa) and Y on the vertical (the ordinate).
If all the dots fall on a straight line with a positive slope, the relationship is
perfect positive linear. Every time X goes up one unit, Y goes up b units. If all dots
fall on a negatively sloped line, the relationship is perfect negative linear.


2
0
2
4
6
8
10
12
0 1 2 3 4 5 6
Y
X
Perfect Positive Linear
0
2
4
6
8
10
12
0 1 2 3 4 5 6
Y
X
Perfect Negative Linear
0
2
4
6
8
10
12
0 1 2 3 4 5 6
Y
X
Perfect Positive Monotonic
0
2
4
6
8
10
12
0 1 2 3 4 5 6
Y
X
Perfect Negative Monotonic
0
1
2
3
4
5
6
0 1 2 3 4 5 6
P
e
r
f
o
r
m
a
n
c
e
Test Anxiety
Nonmonotonic Relationship

A linear relationship is monotonic (of one direction) that is, the slope of the line
relating Y to X is either always positive or always negative. A monotonic relationship
can, however, be nonlinear, if the slope of the line changes magnitude but not direction,
as in the plots below:

A nonlinear relationship may, however, not be
monotonic, as shown to the right, where we have a
quadratic relationship between level of test anxiety and
performance on a complex cognitive task.

3
0
2
4
6
8
10
12
0 1 2 3 4 5 6
B
e
e
r
s
Burgers
A Scatter Plot of Our Data
Of course, with real data, the dots are not likely all to fall on any one simple line,
but may be approximately described by a simple line. We shall learn how to compute
correlation coefficients that describe how well a straight line fits the data. If your plot
shows that the line that relates X and Y is linear, you should use the Pearson
correlation coefficient discussed below. If the plot shows that the relationship is
monotonic (not a straight line, but a line whose slope is always positive or always
negative), you can use the Spearman correlation coefficient discussed below. If your
plot shows that the relationship is curvilinear but not monotonic, you need advanced
techniques (such as polynomial regression) not covered in this class.
Let us imagine that variable X is the number of hamburgers consumed at a cook-
out, and variable Y is the number of beers consumed. We wish to measure the
relationship between these two variables and develop a regression equation that will
enable us to predict how many beers a person will consume given that we know how
many burgers that person will consume.

Subject Burgers (X) Beers (Y) XY
1 5 8 40
2 4 10 40
3 3 4 12
4 2 6 12
5 1 2 2
Sum 15 30 106
Mean 3 6
St. Dev. 1.581 3.162

40 ) 162 . 3 ( 4
2
= =
Y
SS

Covariance
One way to measure the linear association between two variables is covariance,
an extension of the unidimensional concept of variance into two dimensions.
The Sum of Squares Cross Products,
16
5
) 30 ( 15
106
) )( (
) )( ( = = = =

N
Y X
XY Y Y X X SSCP .
If most of the dots in the scatter plot are in the lower left and upper right
quadrants, most of the cross-products will be positive, so SSCP will be positive; as X
goes up, so does Y. If most are in the upper left and lower right, SSCP will be negative;
as X goes up, Y goes down.

4
Just as variance is an average sum of squares, SS N, or, to estimate
population variance from sample data, SS (N-1), covariance is an average SSCP,
SSCP N. We shall compute covariance as an estimate of that in the population from
which our data were randomly sampled. That is, . 4
4
16
1
= =
=
N
SSCP
COV
A major problem with COV is that it is affected not only by degree of linear
relationship between X and Y but also by the standard deviations in X and in Y. In fact,
the maximum absolute value of COV(X,Y) is the product
x
y
. Imagine that you and I
each measured the height and weight of individuals in our class and then computed the
covariance between height and weight. You use inches and pounds, but I use miles
and tons. Your numbers would be much larger than mine, so your covariance would be
larger than mine, but the strength of the relationship between height and weight should
be the same for both of our data sets. We need to standardize the unit of measure of
our variables.

Pearson r
We can get a standardized index of the degree of linear association by dividing
COV by the two standard deviations, removing the effect of the two univariate standard
deviations. This index is called the Pearson product moment correlation coefficient,
r for short, and is defined as 80 .
) 162 . 3 ( 581 . 1
4 ) , (
= = =
y x
s s
Y X COV
r . Pearson r may also
be defined as a mean, r
Z Z
N
x y
= , where the Z-scores are computed using population
standard deviations,
n
SS
.
Pearson r may also be computed as
8 .
) 40 ( 10
16
) 162 . 3 )( 4 ( ) 581 . 1 ( 4
16
2 2
= = = =
y x
SS SS
SSCP
r .
Pearson r will vary from 1 to 0 to +1. If r = +1 the relationship is perfect positive,
and every pair of X,Y scores has Z
x
= Z
y
. If r = 0, there is no linear relationship. If r =
1, the relationship is perfect negative and every pair of X,Y scores has Z
x
= Z
y
.

Sample r is a Biased Estimator It Underestimates
) 1 ( 2
) 1 (
) (
2

n
r E

. If the were .5 and the sample size 10, the expected value of r
would be about 479 .
) 9 ( 2
) 5 . 1 ( 5 .
5 .
2
=
. If the sample size were increased to 100, the

expected value of r would be about 499 .
) 99 ( 2
) 5 . 1 ( 5 .
5 .
2
=
. As you can see, the bias is

not large and decreases as sample size increases.

5
An approximately unbiased estimator was provided by Fisher many years ago
(1915):

+ =
n
r
r
2
) 1 (
1
2
. Since then there have been several other approximately
unbiased estimators. In 1958, Olin and Pratt proposed an even less biased estimator,
+ =
) 4 ( 2
) 1 (
1
2
n
r
r . For our correlation, the Olin & Pratt estimator has a value of
. 944 .
) 4 5 ( 2
) 8 . 1 (
1 8 .
2
=
+ =
There are estimators even less biased than the Olin & Pratt estimator, but I do
not recommend them because of the complexity of calculating them and because the
bias in the Olin & Pratt estimator is already so small. For more details, see Shieh
(2010) and Zimmerman, Zumbo, & Williams (2003).

Sample r
2
is a Biased Estimator It Overestimates
2

) 1 (
) 1 )( 2 (
1 ) (
2
2

=
n
n
r E

. If the
2
were .25 (or any other value) and the sample
size 2, the expected value of r
2
would be 1
) 1 2 (
) 25 . 1 )( 2 2 (
1 =

. See my document
What is R
2
When N = p + 1 (and df = 0)?

If the
2
were .25 and the sample size 10, the expected value of r
2
would be
333 .
) 1 10 (
) 25 . 1 )( 2 10 (
1 ) (
2
=

= r E . If the
2
were .25 and the sample size 100, the
expected value of r
2
would be 258 .
) 1 100 (
) 25 . 1 )( 2 100 (
1 ) (
2
=

= r E . As you can see, the
bias decreases with increasing sample size.

For a relatively unbiased estimate of population r
2
, the shrunken r
2
,
52 .
3
) 4 )( 64 . 1 (
1
) 2 (
) 1 )( 1 (
1
2
=

=
n
n r
for our data.

Factors Which Can Affect the Size of r
Range restrictions. If the range of X is restricted, r will usually fall (it can rise if
X and Y are related in a curvilinear fashion and a linear correlation coefficient has
inappropriately been used). This is very important when interpreting criterion-related
validity studies, such as one correlating entrance exam scores with grades after
entrance.

6
Extraneous variance. Anything causing variance in Y but not in X will tend to
reduce the correlation between X and Y. For example, with a homogeneous set of
subjects all run under highly controlled conditions, the r between alcohol intake and
reaction time might be +0.95, but if subjects were very heterogeneous and testing
conditions variable, r might be only +0.50. Alcohol might still have just as strong an
effect on reaction time, but the effects of many other extraneous variables (such as
sex, age, health, time of day, day of week, etc.) upon reaction time would dilute the
apparent effect of alcohol as measured by r.
Interactions. It is also possible that the extraneous variables might interact
with X in determining Y. That is, X might have one effect on Y if Z = 1 and a different
effect if Z = 2. For example, among experienced drinkers (Z = 1), alcohol might affect
reaction time less than among novice drinkers (Z = 2). If such an interaction is not
taken into account by the statistical analysis (a topic beyond the scope of this course),
the r will likely be smaller than it otherwise would be.

Assumptions of Correlation Analysis
There are no assumptions if you are simply using the correlation coefficient to
describe the strength of linear association between X and Y in your sample. If,
however, you wish to use t or F to test hypothesis about or place a confidence interval
about your estimate of , there are assumptions.
Bivariate Normality
It is assumed that the joint distribution of X,Y is bivariate normal. To see what
such a distribution look like, try the Java Applet at
http://ucs.kuleuven.be/java/version2.0/Applet030.html . Use the controls to change
various parameters and rotate the plot in three-dimensional space.
In a bivariate normal distribution the following will be true:
1. The marginal distribution of Y ignoring X will be normal.
2. The marginal distribution of X ignoring Y will be normal.
3. Every conditional distribution of Y|X will be normal.
4. Every conditional distribution of X|Y will be normal.
Homoscedasticity
1. The variance in the conditional distributions of Y|X is constant across values of X.
2. The variance in the conditional distributions of X|Y is constant across values of Y.

Testing H

: = 0
If we have X,Y data sampled randomly from some bivariate population of
interest, we may wish to test H
: = 0, the null hypothesis that the population

correlation coefficient (rho) is zero, X and Y are independent of one another, there is no
linear association between X and Y. This is quite simply done with Students t:
309 . 2
64 . 1
3 8 .
1
2
2
=
=
r
n r
t , with df = N - 2.

7
You should remember that we used this formula earlier to demonstrate that the
independent samples t test is just a special case of a correlation analysis if one of the
variables is dichotomous and the other continuous, computing the (point biserial) r and
testing its significance is absolutely equivalent to conducting an independent samples t
test. Keep this in mind when someone tells you that you can make causal inferences
from the results of a t test but not from the results of a correlation analysis the two are
mathematically identical, so it does not matter which analysis you did. What does
matter is how the data were collected. If they were collected in an experimental manner
(manipulating the independent variable) with adequate control of extraneous variables,
you can make a causal inference. If they were gathered in a nonexperimental manner,
you cannot.

Putting a Confidence Interval on R or R
2
It is a good idea to place a confidence interval around the sample value of r or r
2
,
but it is tedious to compute by hand. Fortunately, there is now available a free program
for constructing such confidence intervals. Please read my document Putting
Confidence Intervals on R
2
or R.
For our beer and burger data, a 95% confidence interval for r extends from -.28
to .99.

APA-Style Summary Statement
For our beer and burger data, our APA summary statement could read like this:
The correlation between my friends burger consumption and their beer
consumption fell short of statistical significance, r(n = 5) = .8, p = .10,
95% CI [-.28, .99]. For some strange reason, the value of the computed t is not
generally given when reporting a test of the significance of a correlation coefficient. You
might want to warn your readers that a Type II error is quite likely here, given the small
sample size. Were the result significant, your summary statement might read
something like this: Among my friends, burger consumption was significantly
related to beer consumption, ..........
Power Analysis
Power analysis for r is exceptionally simple: 1 = n , assuming that df are
large enough for t to be approximately normal. Cohens benchmarks for effect sizes for
r are: .10 is small but not necessarily trivial, .30 is medium, and .50 is large (Cohen, J.
A Power Primer, Psychological Bulletin, 1992, 112, 155-159).
For our burger-beer data, how much power would we have if the effect size was
large in the population, that is, = .50? 00 . 1 4 5 . = = . From our power table, using
the traditional .05 criterion of significance, we then see that power is 17%. As stated
earlier, a Type II error is quite likely here. How many subjects would we need to have
95% power to detect even a small effect? Lots: 1297 1
2
= +
n . That is a lot of
burgers and beer! See the document R2 Power Anaysis .

8

Correcting for Measurement Error in Bivariate Linear Correlations
The following draws upon the material presented in the article :
Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research:
Lessons from 26 research scenarios. Psychological Methods, 1, 199-223.
When one is using observed variables to estimate the correlation between the
underlying constructs which these observed variables measure, one should correct the
correlation between the observed variables for attenuation due to measurement error.
Such a correction will give you an estimate of what the correlation is between the two
constructs (underlying variables), that is, what the correlation would be if we able to
measure the two constructs without measurement error.
Measurement error results in less than perfect values for the reliability of an
instrument. To correct for the attenuation resulting from such lack of perfect reliability,
one can apply the following correction:

yy XX
XY
Y X
r r
r
r
t t
= ,where
t t
Y X
r is our estimate for the correlation between the constructs, corrected for
attenuation,
r
XY
is the observed correlation between X and Y in our sample,
r
XX
is the reliability of variable X, and
r
YY
is the reliability of variable Y.
Here is an example from my own research:
I obtained the correlation between misanthropy and attitude towards animals for
two groups, idealists (for whom I predicted there would be only a weak correlation) and
nonidealists (for whom I predicted a stronger correlation). The observed correlation was
.02 for the idealists, .36 for the idealists. The reliability (Cronbach alpha) was .91 for the
attitude towards animals instrument (which had 28 items) but only .66 for the
misanthropy instrument (not surprising, given that it had only 5 items). When we correct
the observed correlation for the nonidealists, we obtain 46 .
) 91 (. 66 .
36 .
= =
t t
Y X
r , a much
more impressive correlation. When we correct the correlation for the idealists, the
corrected r is only .03.
I should add that Cronbach's alpha underestimates a test's reliability, so this
correction is an over-correction. It is preferable to use maximized lamba4 as the
estimate of reliability. Using labmda4 estimates of reliability, the corrected r is
. 42 .
) 93 (. 78 .
36 .
= =
t t
Y X
r

9
Testing Other Hypotheses
H

:
1
=
2

One may also test the null hypothesis that the correlation between X and Y in
one population is the same as the correlation between X and Y in another population.
See our textbook for the statistical procedures. One interesting and controversial
application of this test is testing the null hypothesis that the correlation between IQ and
Grades in school is the same for Blacks as it is for Whites. Poteat, Wuensch, and
Gregg (1988, Journal of School Psychology: 26, 59-68) were not able to reject that null
hypothesis.

H

:
WX
=
WY

If you wish to compare the correlation between one pair of variables with that
between a second, overlapping pair of variables (for example, when comparing the
correlation between one IQ test and grades with the correlation between a second IQ
test and grades), use Williams procedure explained in our textbook or use Hotellings
more traditional solution, available from Wuensch and elsewhere. It is assumed that the
correlations for both pairs of variables have been computed on the same set of
subjects. Should you get seriously interested in this sort of analysis, consult this
reference: Meng, Rosenthal, & Rubin (1992) Comparing correlated correlation
coefficients. Psychological Bulletin, 111: 172-175.
H

:
WX
=
YZ

between a second (nonoverlapping) pair of variables, read the article by T. E.
Raghunathan , R. Rosenthal, and D. B. Rubin (Comparing correlated but
nonoverlapping correlations, Psychological Methods, 1996, 1, 178-183).

H

: = nonzero value
Our textbook also shows how to test the null hypothesis that a correlation has a
particular value (not necessarily zero) and how to place confidence limits on our
estimation of a correlation coefficient. For example, we might wish to test the null
hypothesis that in grad. school the r between IQ and Grades is +0.5 (the value most
often reported for this correlation in primary and secondary schools) and then put 95%
confidence limits on our estimation of the population .

Please note that these procedures require the same assumptions made for
testing the null hypothesis that the is zero. There are, however, no assumptions
necessary to use r as a descriptive statistic, to describe the strength of linear
association between X and Y in the data you have.

Spearman rho
When ones data are ranks, one may compute the Spearman correlation for
ranked data, also called the Spearman , which is computed and significance-tested

10
exactly as is Pearson r (if n < 10, find a special table for testing the significance of the
Spearman ). The Spearman measures the linear association between pairs of
ranks. If ones data are not ranks, but e converts the raw data into ranks prior to
computing the correlation coefficient, the Spearman measures the degree of
monotonicity between the original variables. If every time X goes up, Y goes up (the
slope of the line relating X to Y is always positive) there is a perfect positive monotonic
relationship, but not necessarily a perfect linear relationship (for which the slope would
have to be constant). Consider the following data:

X 1.0 1.9 2.0 2.9 3.0 3.1 4.0 4.1 5
Y 10 99 100 999 1,000 1,001 10,000 10,001 100,000

You should run the program Spearman.sas on my SAS Programs web page. It
takes these data and transforms them into ranks and then prints out the new data. The
first page of output shows the original data, the ranked data, and also the Y variable
after a base 10 log transformation. A plot of the raw data shows a monotonic but
distinctly nonlinear relationship. A plot of X by the log of Y shows a nearly perfect linear
relationship. A plot of the ranks show a perfect relationship. PROC CORR is then used
to compute Pearson, Spearman, and Kendall tau correlation coefficients.

How Do Behavioral Scientists Use Correlation Analyses?
1. to measure the linear association between two variables without establishing
any cause-effect relationship.
2. as a necessary (and suggestive) but not sufficient condition to establish
causality. If changing X causes Y to change, then X and Y must be correlated (but the
correlation is not necessarily linear). X and Y may, however, be correlated without X
causing Y. It may be that Y causes X. Maybe increasing Z causes increases in both X
and Y, producing a correlation between X and Y with no cause-effect relationship
between X and Y. For example, smoking cigarettes is well known to be correlated with
health problems in humans, but we cannot do experimental research on the effect of
smoking upon humans health. Experimental research with rats has shown a causal
relationship, but we are not rats. One alternative explanation of the correlation between
smoking and health problems in humans is that there is a third variable, or constellation
of variables (genetic disposition or personality), that is causally related to both smoking
and development of health problems. That is, if you have this disposition, it causes you
to smoke and it causes you to have health problems, creating a spurious correlation
between smoking and health problems but the disposition that caused the smoking
would have caused the health problems whether or not you smoked. No, I do not
believe this model, but the data on humans cannot rule it out.
As another example of a third variable problem, consider the strike by PATCO,
the union of air traffic controllers back during the Reagan years. The union cited
statistics that air traffic controllers had much higher than normal incidence of stress-
related illnesses (hypertension, heart attacks, drug abuse, suicide, divorce, etc.). They

11
said that this was caused by the stress of the job, and demanded better benefits to deal
with the stress, no mandatory overtime, rotation between high stress and low stress job
positions, etc. The government crushed the strike (fired all controllers), invoking a third
variable explanation of the observed correlation between working in air traffic control
and these illnesses. They said that the air traffic controller profession attracted persons
of a certain disposition (Type A individuals, who are perfectionists who seem always to
be under time pressure), and these individuals would get those illnesses whether they
worked in air traffic or not. Accordingly, the government said, the problem was the fault
of the individuals, not the job. Maybe the government would prefer that we hire only
Type B controllers (folks who take it easy and dont get so upset when they see two
blips converging on the radar screen)!
3. to establish an instruments reliability a reliable instrument is one which will
produce about the same measurements when the same objects are measured
repeatedly, in which case the scores at one time should be well correlated with the
scores at another time (and have equivalent means and variances as well).
4. to establish an instruments (criterion-related) validity a valid instrument is
one which measures what it says it measures. One way to establish such validity is to
show that there is a strong positive correlation between scores on the instrument and an
independent measure of the attribute being measured. For example, the Scholastic
Aptitude Test was designed to measure individuals ability to do well in college.
Showing that scores on this test are well correlated with grades in college establishes
the tests validity.
5. to do independent groups t-tests: if the X variable, groups, is coded 0,1 (or
any other two numbers) and we obtain the r between X and Y, a significance-test of the
hypothesis that = 0 will yield exactly the same t and p as the traditional pooled-
variances independent groups t-test. In other words, the independent groups t-test is
just a special case of correlation analysis, where the X variable is dichotomous and the
Y variable is normally distributed. The r is called a point-biserial r. It can also be
shown that the 2 x 2 Pearson Chi-square test is a special case of r. When both X and Y
are dichotomous, the r is called phi ( ).
6. One can measure the correlation between Y and an optimally weighted set of
two or more Xs. Such a correlation is called a multiple correlation. A model with
multiple predictors might well predict a criterion variable better than would a model with
just a single predictor variable. Consider the research reported by McCammon, Golden,
and Wuensch in the Journal of Research in Science Education, 1988, 25, 501-510.
Subjects were students in freshman and sophomore level Physics courses (only those
courses that were designed for science majors, no general education <football physics>
courses). The mission was to develop a model to predict performance in the course.
The predictor variables were CT (the Watson-Glaser Critical Thinking Appraisal), PMA
(Thurstones Primary Mental Abilities Test), ARI (the College Entrance Exam Boards
Arithmetic Skills Test), ALG (the College Entrance Exam Boards Elementary Algebra
Skills Test), and ANX (the Mathematics Anxiety Rating Scale). The criterion variable
was subjects scores on course examinations. Our results indicated that we could
predict performance in the physics classes much better with a combination of these
predictors than with just any one of them. At Susan McCammons insistence, I also

12
separately analyzed the data from female and male students. Much to my surprise I
found a remarkable sex difference. Among female students every one of the predictors
was significantly related to the criterion, among male students none of the predictors
was. A posteriori searching of the literature revealed that Anastasi (Psychological
Testing, 1982) had noted a relatively consistent finding of sex differences in the
predictability of academic grades, possibly due to women being more conforming and
more accepting of academic standards (better students), so that women put maximal
effort into their studies, whether or not they like the course, and according they work up
to their potential. Men, on the other hand, may be more fickle, putting forth maximum
effort only if they like the course, thus making it difficult to predict their performance
solely from measures of ability.
ANOVA, which we shall cover later, can be shown to be a special case of
multiple correlation/regression analysis.
7. One can measure the correlation between an optimally weighted set of Ys
and an optimally weighted set of Xs. Such an analysis is called canonical correlation,
and almost all inferential statistics in common use can be shown to be special cases of
canonical correlation analysis. As an example of a canonical correlation, consider the
research reported by Patel, Long, McCammon, & Wuensch (Journal of Interpersonal
Violence, 1995, 10: 354-366, 1994). We had two sets of data on a group of male
college students. The one set was personality variables from the MMPI. One of these
was the PD (psychopathically deviant) scale, Scale 4, on which high scores are
associated with general social maladjustment and hostility. The second was the MF
(masculinity/femininity) scale, Scale 5, on which low scores are associated with
stereotypical masculinity. The third was the MA (hypomania) scale, Scale 9, on which
high scores are associated with overactivity, flight of ideas, low frustration tolerance,
narcissism, irritability, restlessness, hostility, and difficulty with controlling impulses.
The fourth MMPI variable was Scale K, which is a validity scale on which high scores
indicate that the subject is clinically defensive, attempting to present himself in a
favorable light, and low scores indicate that the subject is unusually frank. The second
set of variables was a pair of homonegativity variables. One was the IAH (Index of
Attitudes Towards Homosexuals), designed to measure affective components of
homophobia. The second was the SBS, (Self-Report of Behavior Scale), designed to
measure past aggressive behavior towards homosexuals, an instrument specifically
developed for this study.
Our results indicated that high scores on the SBS and the IAH were associated
with stereotypical masculinity (low Scale 5), frankness (low Scale K), impulsivity (high
Scale 9), and general social maladjustment and hostility (high Scale 4). A second
relationship found showed that having a low IAH but high SBS (not being homophobic
but nevertheless aggressing against gays) was associated with being high on Scales 5
(not being stereotypically masculine) and 9 (impulsivity). This relationship seems to
reflect a general (not directed towards homosexuals) aggressiveness in the words of
one of my graduate students, being an equal opportunity bully.

Links all recommended reading (in other words, know it for the test)
Biserial and Polychoric Correlation Coefficients

13
Comparing Correlation Coefficients, Slopes, and Intercepts
Confidence Intervals on R
2
or R
Contingency Tables with Ordinal Variables
Correlation and Causation
Cronbachs Alpha and Maximized Lambda4
Inter-Rater Agreement
Phi
Residuals Plots -- how to make them and interpret them
Tetrachoric Correlation -- what it is and how to compute it.
Shieh, G. Estimation of the simple correlation coefficient. Behavior Research
Methods, 42, 906-917.
Zimmerman, D. W., Zumbo, B. D., & Williams, R. H. (2003). Bias in estimation
and hypothesis testing of correlation. Psicolgica, 24, 133-158.

Regr6430.docx
Bivariate Linear Regression

Bivariate Linear Regression analysis involves finding the best fitting straight line to describe a
set of bivariate data. It is based upon the linear model, e bX a Y + + = . That is, every Y score is
made up of two components: a + bX, which is the linear effect of X upon Y, the value of Y given X if
X and Y were perfectly correlated in a linear fashion, and e, which stands for error. Error is simply
the difference between the actual value of Y and that value we would predict from the best fitting
straight line. That is, Y Y e
=
Sources of Error
There are three basic sources that contribute to the error term, e.
Error in the measurement of X and or Y or in the manipulation of X (experimental research).
The influence upon Y of variables other than X (extraneous variables), including variables
that interact with X.
Any nonlinear influence of X upon Y.
The Regression Line
The best fitting straight line, or regression line, is bX a Y + =
, or, in z-score units,

x y
rZ Z =

If r
2
is less than one (a nonperfect correlation), then predicted Z
y
regresses towards the
mean relative to Z
x
. Regression towards the mean refers to the fact that predicted Y will be closer
to the mean on Y than is known X to the mean on X, unless the relationship between X and Y is
perfect linear.
r Z
x
Predicted Z
y

1.00 2 2: Just as far (2 sd) from mean Y as X is from mean X No
regression towards the mean.
0.50 2 1: Closer to mean Y (1 sd) than X is to mean X
0.00 2 0: Regression all the way to the mean
-0.50 2 -1: Closer to mean Y (1 sd) than X is to mean X
-1.00 2 -2: Just as far (2 sd) from mean Y as X is from mean X No
regression towards the mean.

The criterion used to find the best fitting straight line is the least squares criterion. That is,
we find a and b such that ( )
2
Y Y is as small as possible. This is the same criterion used to

define the mean of a univariate distribution. The regression line is an extension of the concept of a
mean into two-dimensional space.
Consider our beer/burger data again:


2
The b is the slope, the amount that predicted Y changes for each one unit increase in X. We
shall compute the slope by:
x
y
x
xy
x
s
s
r
s
COV
SS
SSCP
b = = =
2
. For our beer-burger data,
6 . 1
581 . 1
162 . 3
8 . =
= b
Subject Burgers,
X
Beers,
Y
Y
( Y Y
2
)
( Y Y
2
)
( Y Y
1 5 8 9.2 -1.2 1.44 10.24
2 4 10 7.6 2.4 5.76 2.56
3 3 4 6.0 -2.0 4.00 0.00
4 2 6 4.4 1.6 2.56 2.56
5 1 2 2.8 -0.8 0.64 10.24
Sum 15 30 30 0.0 14.40 25.60
Mean 3 6
St. Dev. 1.581 3.162

40 ) 162 . 3 ( 4
2
= =
Y
SS
Please note that if s
y
= s
x
, as would be the case if we changed all scores on Y and X to z-
scores, then r = b. This provides a very useful interpretation of r. Pearson r is the number of
standard deviations that predicted Y changes for each one standard deviation change in X. That is,
on the average, and in sd units, how much does Y change per sd change in X.
We compute the intercept, a, the predicted value of Y when X = 0, with: X b Y a = . For our
data, that is 6 1.6(3) = 1.2
There are two different linear regression lines that we could compute with a set of bivariate
data. The one I have already presented is for predicting Y from X. We could also find the least
squares regression line for predicting X from Y. This will usually not be the same line used for
predicting Y from X, since the line that minimizes ( )
2
Y Y does not also minimize ( )

2
X X ,
unless r
2
equals one. We can find the regression line for predicting X from Y using the same
formulas we used for finding the line to predict Y from X, but we must substitute X for Y and Y for X in
the formulas. The a and b for intercept and slope may be subscripted with y.x to indicate Y predicted
from X or with x.y to indicate X predicted from Y.
The two regression lines are coincident (are the same line) only when the correlation is perfect
(r is +1.00 or -1.00). They always intersect at the point M
x
, M
y
. When r = 0.00, the two regression
lines, X X Y Y = =
and
, are perpendicular to one another. As r increases from zero to +1.00, the

lines for Y b a X b a Y
y x y x x y x y . . . .
and X
+ = + = , rotate towards one another, lowering the angle

between them in the upper right quadrant, until, when r = +1.00, they are coincident with a positive
slope. As r decreases from 0 to -1.00, the lines rotate towards one another, decreasing the angle
between them in the upper left quadrant, until, when r = -1.00, they are coincident with a negative
slope. If you like trig, you can measure the correlation by measuring the angle between the
standardized regression lines.

3
It is possible to obtain one regression line that is a compromise between the best line for
predicting Y from X and the best line for predicting X from Y, but we shall not employ such techniques
in this class.
Variance Due to Error or Due to Regression
One may wish to estimate the error term in the linear model. The error sum of squares is:
( ) ( )
2
2
1
r SS Y Y SSE
y
= =
. SSE is the quantity we minimized when solving for a and b. To

convert SSE into an estimate of the population error variance or residual variance (the average
squared deviation between Y and predicted Y in the population) one computes:
2
=
n
SSE
MSE . MSE
stands for mean square error. This mean is also known as the estimated residual variance. It
represents the variance in the Y scores that is not due to Ys linear association with X. That is, it is
the variance in Y due to extraneous variables, nonlinear effects, and slop in the measurement and/or
manipulation of X and Y. If we did not know X or if r = 0, our best (least squares) prediction for Y
would be the mean of Y, regardless of the value of X, and ( )
y
SS Y Y SSE = =
2
. Since SSE
would then = SS
y
, all of the variance in Y would be error variance. If r is nonzero we can do better
than just always predicting Y Y =
, and SSE will be less than SS

y
. If we wish to get our estimate of
average error of prediction back to our original units rather than squared units, we compute the
standard error of estimate, ( )
2
1
1
2
= =
n
n
r s MSE s
y y y
. For our beer-burger data, where the
total SS for Y was 40, 4 . 14 ) 40 )( 64 . 1 ( = = SSE , 8 . 4
3
4 . 14
= = MSE , and the standard error of
estimate is 191 . 2 8 . 4 = .
SSE represents the SS in Y not due to the linear association between X and Y. SS
regr
, the
regression sum of squares, represents the SS in Y that is due to that linear association.
( )
y y regr
SS r SSE SS Y Y SS
2
2
= = =
. For our beer-burger data, that is 40 - 14.4 = 25.6.

The mean square regression,
p
SS
MS
regr
regr
= , where p is the number of predictor variables (Xs).
Since in bivariate regression we have only one X variable, p = 1 and MS
regr
= SS
regr
.
The coefficient of determination,
y
regr
SS
SS
r =
2
. That is, r
2
represents the proportion of the
variance in Y that is accounted for by the linear association between X and Y. The coefficient of
alienation,
y
error
SS
SS
r =
2
1 , is the proportion of the variance in Y that is not accounted for by that
linear association.
Testing Hypotheses
The most common hypothesis tested in simple regression analysis is the hypothesis that in the
population the slope for predicting Y from X is zero. That is, using a linear regression to predict Y

4
from X is no better than always predicting Y = the mean on Y. Students t may be employed,
2 , = = n df
MSE
MS
t
regr
. This will yield exactly the same t and p obtained were you to test the
hypothesis that population r = 0, but the assumptions may be less restrictive (as explained below).
Consult our textbook for the methods used to place a confidence interval about the point estimate of
b.
One may also test the null hypothesis of a zero slope using an F-statistic (Analysis of
Variance).
MSE
MS
F
regr
= . It is evaluated on p degrees of freedom in the numerator and (n - p - 1) df in
the denominator, where p = the number of predictor (independent) variables (1 for bivariate
regression). Note that this is another case where a one-tailed test is appropriately used to test
nondirectional hypotheses. Also note that F = t
2
. The one-tailed p obtained for our ANOVA (see
table below) is exactly the same as the two-tailed p we earlier obtained employing t to test the null
hypothesis that is zero in the population. The square root of the F below is equal to the value of
that t computed earlier.
The results of the test of the hypothesis of a zero slope may be summarized in an ANOVA
(analysis of variance) source table. For our beer-burger data, here is such a table:
Source SS df MS F p
Regression 25.6 1 25.6 5.33 .104
Error 14.4 3 4.8
Total 40.0 4 10.0

The MS
total
is nothing more than the sample variance of the dependent variable Y. It is usually
omitted from the table.
Of course, we should present a confidence interval for r
2
. Treating the predictor variable as
fixed rather than random, we can use the program Conf-Interval-R2-Regr to obtain such a confidence
interval. For our example, the predictor is more reasonably considered as random, so we should use
Steiger and Fouladis R2 program to construct the confidence interval. The R2 program will also do
the power analysis for us. Simply select Power Analysis from the Option drop-down menu.
Our summary statement might read something like this: A .05 criterion of statistical
significance was employed for all tests. The linear regression between my friends burger
consumption and their beer consumption fell short of statistical significance, r = .8, beers =
1.2 + 1.6 burgers, F(1, 3) = 5.33, p = .10. A 95% confidence interval for
2
runs from .000 to
.945. The most likely explanation of the nonsignificant result is that it represents a Type II
error. Given our small sample size (N = 5), power was only 13% even for a large effect ( = .5).
Suppose that we wisely decided to repeat the experiment, but with a larger sample size, 10
friends. Suppose the following source table was obtained:
Source SS df MS F p
Regression 51.2 1 51.2 14.22 .005
Error 28.8 8 3.6
Total 80.0 9 8.9

5
Note that with the same r
2
(but more error df), our effect is now significant.
Our summary statement might read something like this: A .05 criterion of statistical
significance was employed for all tests. An a priori power analysis indicated that our sample
size (N = 10) would yield power of only 33% even for a large effect ( = .5). Despite this low
power, our analysis yielded a statistically significant result. Among my friends, beer
consumption increased significantly with burger consumption, r
2
= .64, beers = 1.2 + 1.6
burgers, F(1, 8) = 5.33, p = .005. A 95% confidence interval for r
2
extends from .104 to .887.
If we used t to test the null hypothesis, 771 . 3
64 . 1
8 8 .
=
= t , and the exact two-tailed p for that t

on 8 df is .005.
Suppose that we were going to test directional hypotheses, testing the alternative hypothesis
that slope in the population is greater than 0. For t, our one-tailed p would be half of the two-tailed p,
that is, p = .0025. With F, our (half-tailed) p would be half of the one-tailed p, that is, p = .0025. One
way to understand this half-tailed p is this: If the null hypothesis were true (the slope is zero in the
population), what would be your chances of getting a sample slope whose absolute value was as
large (or larger) than that you obtained? The answer is our one-tailed p, .005. Now, what would be
your chances of guessing the direction (positive or negative) of the obtained slope (before the data
are collected) if you really were just guessing? The answer there is .5. Now, were the null
hypothesis true, what would be the probability of your correctly guessing the direction of the slope
and getting a slope whose absolute value is that far from zero? Using the multiplication rule of
probability under the assumption of independence, your chances would be (.5)(.005) = .0025, our
half-tailed probability.
Assumptions
No assumptions are necessary to use r, a, b,
y y
s

, etc. as descriptive statistics or to use MSE
as an unbiased estimate of population error variance. If, however, you use t or F to test hypotheses
or construct confidence intervals, then there are assumptions.
In a regression analysis we consider the X variable to be fixed rather than random.
Accordingly, the assumptions are different than those used in correlation analysis.
We have already discussed the assumptions associated with hypothesis testing and
confidence interval estimation when considering both X and Y as random variables (a correlation
analysis) homoscedasticity and bivariate normality.
When we consider the predictor variable(s) to be fixed (a regression analysis), there are no
assumptions about the predictor (X) variable(s). We do assume homogeneity of variance
(homoscedasticity) in the distributions of Y given X, that is, variance in Y is not a function of X, is
constant across X. We also assume that the conditional distribution of Y is normal at every value of
X. These assumptions can be restated in terms of the error term: The distribution of the error term
(the residuals) is normal at every value of X and constant in variance across values of X.
In the words of E. J. Pedhazur (page 40 of Multiple Regression in Behavioral Research, 2
nd

edition, 1982, New York, CBS College Publishing): "Unlike the regression model with which we have
been concerned up to this point, the correlation model does not distinguish between an independent
and a dependent variable. Instead, the model seeks to study the degree of relation, or association,
between two variables. Recall that in the regression model Y is a random variable assumed to be

6
normally distributed, whereas X is fixed, its values being determined by the researcher. In the
correlation model both X and Y are random variables and are assumed to follow a bivariate normal
distribution."
The independent samples t test is mathematically identical to a regression analysis where the
predictor variable is dichotomous (not normally distributed) and the criterion variable is normally
distributed (within each group). Likewise the ANOVA is simply a multiple regression analysis where
the predictor variables are dummy coded (dichotomous variables representing the categorical
variables, one dummy variable for each degree of freedom) -- again, the predictors are fixed and
there is no assumption of normality for them. We do, however, assume normality for the
criterion/dependent variable when doing those t tests and ANOVAs.
For our beers and burgers example, it is probably most reasonable to consider our predictor
(number of burgers consumed) to be a random variable. I can, however, imagine a case where it
would reasonably be considered as fixed. Suppose I had decided, a priori, to test the effects upon
beer consumption of having been given 1, 2, 3, 4, or 5 burgers to eat. I randomly assign each subject
to one of these five experimental conditions. Now number of burgers is a fixed variable, and the
analysis is a regression analysis, and is equivalent to a trend analysis (ANOVA) where only the linear
component of the relationship between beers and burgers is modeled.
Placing Confidence Limits on Predicted Values of Y
1. If we are using the regression line to predict the mean value of Y for all subjects who have
some particular score on X, X Y
, it might be advisable to report an interval within which we are

relatively confident that the true (population) value of mean Y given X falls. How confident is
relatively confident? Very often the criterion is 95%, (1 - ). To compute such a confidence interval
we use:
( )
x
y y
SS
X X
n
s t Y CI
2

+ =

.
2. If we wish to estimate an individual value of Y given X we need to widen the confidence
interval to include the variance of individual values of Y given X about the mean value of Y given X ,
using this formula:
( )
x
y y
SS
X X
n
s t Y CI
2
1
1

+ + =

.
3. In both cases, the value of t is obtained from a table where df = n - 2 and (1 - cc) = the
level of significance for two-tailed test in the t table in our textbook. Recall that cc is the confidence
coefficient, the degree of confidence desired.
4. Constructing confidence intervals requires the same assumptions as testing hypotheses
about the slope.
5. The confidence intervals about the regression line will be bowed, with that for predicting
individual values wider than that for predicting average values.
Testing Other Hypotheses
One may test the null hypothesis that the slope for predicting Y from X is the same in one
population as it is in another. For example, is the slope of the line relating blood cholesterol level to
blood pressure the same in women as it is in men? Howell explains how to test such a hypothesis in

7
our textbook. Please note that this is not equivalent to a test of the null hypothesis that two
correlation coefficients are equal.
The Y.X relationships in different populations may differ from one another in terms of slope,
intercept, and/or scatter about the regression line (error, 1 - r
2
). [See relevant plots] There are
methods (see Kleinbaum & Kupper, Applied Regression Analysis and Other Multivariable Methods,
Duxbury, 1978, Chapter 8) to test all three of these across two or more groups. Howell restricts his
attention to tests of slope and scatter with only two groups.
Suppose that for predicting blood pressure from blood cholesterol the slope were exactly the
same for men and for women. The intercepts need not be the same (men might have higher average
blood pressure than do women at all levels of cholesterol) and the rs may differ, for example, the
effect of extraneous variables might be greater among men than among women, producing more
scatter about the regression line for men, lowering r
2
for men. Alternatively, the rs may be identical
but the slopes different. For example, suppose the scatter about the regression line was identical for
men and women but that blood pressure increased more per unit increase in cholesterol for men than
for women.
Consider this case, where the slopes differ but the correlation coefficients do not:
Group r s
x
s
y
b = r s
y
s
x

A .5 10 20 .5(20/10) = 1
B .5 10 100 .5(100/10) = 5

Now, consider this case, where the correlation coefficients differ but the slopes do not:
Group b s
x
s
y
r = b s
x
s
y

A 1 10 20 1(10/20) = .5
B 1 10 100 1(10/100) = .1

See these data sets which have identical slopes for predicting Y from X but different correlation
coefficients.
Return to my Statistics Lessons page.
More on Assumptions in Correlation and Regression

What is R
2
When N = p + 1 (and df = 0)?

N = 2 = p + 1: Two variables, two cases.
N is the number of cases and p is the number of predictor variables. I shall
represent the outcome variable with Y and the predictor variables with X
i
. The data for
all variables here were randomly sampled from a population where the correlation
between each pair of variables was exactly zero.

Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 1.000
a
1.000 . .
a. Predictors: (Constant), X1

With only two data points, you can fit them perfectly with a straight line no matter
where the two points are located in Cartesian space accordingly r
2
= 1.
2

3
N = 3 = p + 1: Three cases, three variables.

Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 1.000
a
1.000 . .
a. Predictors: (Constant), X2, X1

Coefficients
a

Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig. B Std. Error Beta
1 (Constant) 8.079E-16 .000

. .
X1 .318 .000 .842 . .
X2 -.136 .000 -.343 . .
a. Dependent Variable: Y

We now have three data points in three dimensional space. We can fit them
perfectly with a plane. R
2
= 1.

I asked SPSS to save predicted scores. As you can see, prediction is perfect:
4

Here I plot Y versus predicted Y.

5
N = 4 = p + 1: Four cases, four variables. We are in hyperspace now.

Model Summary
b

Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 1.000
a
1.000 . .
a. Predictors: (Constant), X3, X1, X2

b. Dependent Variable: Y

Again, prediction is perfect.

Coefficients
a

Model
Standardized
Coefficients
1 (Constant) 4.886 .000

. .
X1 -.008 .000 -.010 . .
X2 -.788 .000 -1.553 . .
X3 .326 .000 .844 . .
a. Dependent Variable: Y

6

And so on.

Bottom line, when there only as many cases as there are variables, you can
perfectly predict any one of the variables from an optimally weighted linear combination
of the others. I should note that each of the variables must be a variable (have variance
> 0), not a constant (variance = 0).

Clearly, sample R
2
is an overestimate of population R
2
when the number of
cases is the same as the number of variables. When the number of cases is not much
more than the number of variables the overestimation will be less than in the extreme
cases above, but still enough that you might want to compute adjusted R
2
(also known
as shrunken R
2
). When the number of cases is very much greater than the number of
variables, overestimation of the value of R
2
will be trivial.


Karl L. Wuensch
February, 2009.
CI-R2.docx
Putting Confidence Intervals on R
2
or R

Giving a confidence interval for an R or R
2
is a lot more informative than just
giving the sample value and a significance level. So, how does one compute a
confidence interval for R or R
2
?

Bivariate Correlation
Benchmarks for . Again, context can be very important.
.1 is small but not trivial
.3 is medium
.5 is large
Confidence Interval for , Correlation Analysis. My colleagues and I
(Wuensch, K. L., Castellow, W. A., & Moore, C. H. Effects of defendant attractiveness
and type of crime on juridic judgment. Journal of Social Behavior and Personality, 1991,
6, 713-724) asked mock jurors to rate the seriousness of a crime and also to
recommend a sentence for a defendant who was convicted of that crime. The observed
correlation between seriousness and sentence was .555, n = 318, p < .001. We treat
both variables as random. Now we roll up our sleeves and prepare to do a bunch of
tedious arithmetic.
First we apply Fishers transformation to the observed value of r. I shall use
Greek zeta to symbolize the transformed r.
626 . 0 ) 251 . 1 )( 5 . 0 ( ) 494 . 3 ln( ) 5 . 0 (
445 .
555 . 1
ln ) 5 . 0 (
1
1
ln ) 5 . 0 ( = = = =
+
=
r
r
. We compute
the standard error as 05634 .
315
1
3
1
= =
=
n
SE
r
. We compute a 95% confidence
interval for zeta: ). 05634 (. 96 . 1 626 . =
r cc
SE z This give us a confidence interval
extending from .51557 to .73643 but it is in transformed units, so we need to
untransform it.
1
1
2
2
+
e
e
r . At the lower boundary, that gives us
474 .
8039 . 3
8039 . 1
1
1
031 . 1
031 . 1
= =
+
=
e
e
r , and at the upper boundary . 627 .
3617 . 5
3617 . 3
1
1
473 . 1
473 . 1
= =
+
=
e
e
r
What a bunch of tedious arithmetic that involved. We need a computer program
to do it for us. My Conf-Interval-r.sas program will do it all for you.


Suppose that we obtain
confidence interval runs from
Dont have SAS? Try
Confidence Interval for
for free, from James H. Steiger and Rachel T. Fouladi. You can download the program
and the manual here. Unzip the files and put them in the directory/folder R2. Navigate
to the R2 directory and run (double click) the file R2.EXE. A window will appear with R2
in white on a black background. Hit any key to continue. Enter the letter O to get the
Options drop down menu. Enter the letter C to enter the confidence interval routine.
Enter the letter N to bring up the sample size data entry window. Enter 318 and hit the
enter key. Enter the letter K to bring up the number of variables data entry window.
Enter 2 and hit the enter key. Enter the letter R to enter the
Enter .308 (that is .555 squared) and hit the enter key. Enter the letter C to bring up the
confidence level data entry window. Enter .95 and hit the enter key. The w
should now look like this:
we obtain r = .8, n = 5. Using my SAS program, the 95%
confidence interval runs from -0.28 to +0.99.
Dont have SAS? Try the calculator at Vassar.
2
, Correlation Analysis. Use R2, which is available
. Unzip the files and put them in the directory/folder R2. Navigate
drop down menu. Enter the letter C to enter the confidence interval routine.
Enter 2 and hit the enter key. Enter the letter R to enter the R
2
data entry window.
confidence level data entry window. Enter .95 and hit the enter key. The w
2
= 5. Using my SAS program, the 95%

Use R2, which is available
. Unzip the files and put them in the directory/folder R2. Navigate
drop down menu. Enter the letter C to enter the confidence interval routine.
data entry window.
confidence level data entry window. Enter .95 and hit the enter key. The window
3


As you can see, we get a confidence interval for r
2
that extends from .224 to
.393.
Hit any key to continue, F to display the File drop-down menu, and X to exit the
program.
Bad News: R2 will not run on Windows 7 Home Premium, which does not
support DOS. It ran on XP just fine. It might run on Windows 7 Pro.
Good News: You can get a free DOS emulator, and R2 works just fine within
the virtual DOS machine it creates. See my document DOSBox.
The R2 program will not handle sample sizes greater than 5,000. In that case
you can use the approximation procedure which is programmed into my SAS program
Conf-Interval-R2-Regr-LargeN.sas. This program assumes a regression model (fixed
predictors) rather than a correlation model (random predictors), but in my experience
the confidence intervals computed by R2 differ very little from those computed with my
large N SAS program when sample size is large (in the thousands).
What Confidence Coefficient Should I Employ? When dealing with R
2
, if you
want your confidence interval to correspond to the traditional test of significance, you
should employ a confidence coefficient of (1 - 2). For example, for the usual .05
criterion of statistical significance, use a 90% confidence interval, not 95%. This is
illustrated below.
Suppose you obtain r = .26 from n = 62 pairs of scores. If you compute t to test
the null that rho is zero in the population, you obtain t(60) = 2.089. If you compute F,
you obtain F(1, 60) = 4.35. The p value is .041. At the .04 level, the correlation is
4
significant. When you put a 95% confidence interval about r you obtain .01, .48. Zero is
not included in the confidence interval. Now let us put a confidence interval about the r
2

(.0676) using Steiger & Fouladis R2.

Oh my, the confidence interval includes zero, even though the p level is .04.
Now lets try a 90% interval.

That is more like it. Note that the lower limit from the 90% interval is the same as
the lower bound from the 90% interval.
2
, Regression Analysis. If you consider your
predictor variable to be fixed rather than random (that is, you arbitrarily chose the values
of that variable, or used the entire population of possible values, rather than randomly
sampling values from a population of values), then the confidence interval for
2
is
computed somewhat differently. The SAS program Conf-Interval-R2-Regr can be
employed to construct such a confidence interval. If you are using SPSS, see CI-R2-
SPSS at my SPSS Programs Page.
Unstandardized Slopes
Standard computer programs will give you a confidence interval for the
unstandardized slope for predicting the criterion variable from the predictor variable.
For example, in SAS: proc reg; a: model ar = misanth / CLB; In PASW/SPSS, in the
Linear Regression Statistics window, just remember to ask for confidence intervals.
5

Multiple Correlation and Regression.
The programs mentioned above can be used to put confidence intervals on
multiple R
2
too. They can also be employed to put a confidence intervals on squared
partial correlation coefficients for single variables or blocks of variables. I would prefer
to use the squared semipartial correlation coefficient (the amount by which R
2
increases
when a variable or block of variables is added to an existing model). To convert pr
2
to
sr
2
do this: ) 1 (
2
.
2 2
A Y B B
R pr sr = where A is the previous set of variables, B is the new set
of variables, and Y is the outcome variable. For an example, see my document Multiple
Regression with SPSS .

Return to Wuenschs Statistics Lesson Page
Contingency-Ordinal.doc
Contingency Tables With Ordinal Variables

David Howell presents a nice example of how to modify the usual Pearson
2

analysis if you wish to take into account the fact that one (or both) of your classification
variables can reasonably be considered to be ordinal (Statistical Methods for
Psychology, 7
th
ed., 2010, pages 307-309). Here I present another example.
The data are from the article "Stairs, Escalators, and Obesity," by Meyers et al.
(Behavior Modification 4: 355-359). The researchers observed people using stairs and
escalators. For each person observed, the following data were recorded: Whether the
person was obese, overweight, or neither; whether the person was going up or going
down; and whether the person used the stairs or the escalator. The weight
classification can reasonably be considered ordinal.
Before testing any hypotheses, let me present the results graphically:
Percentage Use of Staircase Rather than Escalator Among Three Weight Groups
0
5
10
15
20
25
30
Ascending
Descending

Initially I am going to ignore whether the shoppers were going up or going down
and test to see if there is a relationship between weight and choice of device. Here is
the SPSS output:
device * weight Crosstabulation
24 165 256 445
7.7% 15.3% 14.0% 13.8%
286 910 1576 2772
92.3% 84.7% 86.0% 86.2%
Count
% within weight
Count
% within weight
1 Stairs
2 Escalator
device
1 Obese 2 Overweight 3 Normal
weight
Total

2
Chi-Square Tests
11.752
a
2 .003
13.252 2 .001
2.718 1 .099
3217
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases
Value df
Asymp. Sig.
(2-sided)
0 cells (.0%) have expected count less than 5. The
minimum expected count is 42.88.
a.

Notice that the Person Chi-Square is significant. Now we ask is there a linear
relationship between our weight categories and choice of device? The easy way to do
this is just to use a linear regression to predict device from weight category.

ANOVA
b
.324 1 .324 2.719 .099
a
383.120 3215 .119
383.444 3216
Regression
Residual
Total
Model
1
Sum of
Predictors: (Constant), weight
a.
Dependent Variable: device
b.

As you can see, the linear relationship is not significant. If you look back at the
contingency table you will see that the relationship is not even monotonic. As you move
from obese to overweight the percentage use of the stairs rises dramatically but then as
you move from overweight to normal weight it drops a bit.
A chi-square for the linear effect can be computed as
2
= (N 1)r
2
= 3215(.324) / 383.444 = 2.717, within rounding error of the Linear by
Linear Association reported by SPSS.
We could also test the deviation from linearity by subtracting from the overall
2

the linear
2
: 11.752 2.717 = 9.035. The df are also obtained by subtraction, overall
less linear = 2 1 = 1. P(
2
> 9.035 | df = 1) = .0026. There is a significant deviation
from linearity.
Now let us split the file by the direction of travel. If we consider only those going
down, there is a significant overall effect of weight category but not a significant linear
effect:
3
Chi-Square Tests
b
8.639
a
2 .013
9.091 2 .011
.001 1 .973
1362
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases
Value df
Asymp. Sig.
(2-sided)
a.
direct = 2 Descending
b.

If we consider only those going up, there is significant linear effect and the
deviation from linearity is not significant
2
(1, N = 1362) = 2.626, p = .105
Chi-Square Tests
b
9.525
a
2 .009
10.001 2 .007
6.899 1 .009
1855
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases
Value df
Asymp. Sig.
(2-sided)
a.
direct = 1 Ascending
b.


Karl L. Wuensch, East Carolina University, October, 2010.
CompareCorrCoeff.docx
Comparing Correlation Coefficients, Slopes, and Intercepts

Two Independent Samples
H

:
1
=
2

If you want to test the null hypothesis that the correlation between X and Y in one
population is the same as the correlation between X and Y in another population, you can
use the procedure developed by R. A. Fisher in 1921 (On the probable error of a
coefficient of correlation deduced from a small sample, Metron, 1, 3-32).
First, transform each of the two correlation coefficients in this fashion:
r
r
r
e
+
=
1
1
log ) 5 . 0 (
Second, compute the test statistic this way:
3
1
3
1
2 1
2 1

=
n n
r r
z
Third, obtain p for the computed z.
Consider the research reported in by Wuensch, K. L., Jenkins, K. W., & Poteat, G.
M. (2002). Misanthropy, idealism, and attitudes towards animals. Anthrozos, 15, 139-
149.
The relationship between misanthropy and support for animal rights was compared
between two different groups of persons persons who scored high on Forsyths measure
of ethical idealism, and persons who did not score high on that instrument. For 91
nonidealists, the correlation between misanthropy and support for animal rights was .3639.
For 63 idealists the correlation was .0205. The test statistic, 16 . 2
60
1
88
1
0205 . 3814 .
=
+
= z , p =
.031, leading to the conclusion that the correlation in nonidealists is significantly higher
than it is in idealists.
Calvin P. Garbin of the Department of Psychology at the University of Nebraska has
authored a dandy document Bivariate Correlation Analyses and Comparisons which is
recommended reading. His web server has been a bit schizo lately, so you might find the
link invalid, sorry. Dr. Garbin has also made available a program (FZT.exe) for conducting
this Fishers z test. Files with the .exe extension encounter a lot of prejudice on the
Internet these days, so you might have problems with that link too. If so, try to find it on his
web sites at http://psych.unl.edu/psycrs/statpage/ and
http://psych.unl.edu/psycrs/statpage/comp.html .


2
If you are going to compare correlation coefficients, you should also compare
slopes. It is quite possible for the slope for predicting Y from X to be different in one
population than in another while the correlation between X and Y is identical in the two
populations, and it is also quite possible for the correlation between X and Y to be different
in one population than in the other while the slopes are identical, as illustrated below:

On the left, we can see that the slope is the same for the relationship plotted with
blue os and that plotted with red xs, but there is more error in prediction (a smaller
Pearson r ) with the blue os. For the blue data, the effect of extraneous variables on the
predicted variable is greater than it is with the red data.
On the right, we can see that the slope is clearly higher with the red xs than with
the blue os, but the Pearson r is about the same for both sets of data. We can predict
equally well in both groups, but the Y variable increases much more rapidly with the X
variable in the red group than in the blue.
H

: b
1
= b
2

Let us test the null hypothesis that the slope for predicting support for animal rights
from misanthropy is the same in nonidealists as it is in idealists.
First we conduct the two regression analyses, one using the data from nonidealists,
the other using the data from the idealists. Here are the basic statistics:
Group Intercept Slope SE
slope
SSE SD
X
n
Nonidealists 1.626 .3001 .08140 24.0554 .6732 91
Idealists 2.404 .0153 .09594 15.6841 .6712 63

The test statistic is Students t, computed as the difference between the two slopes
divided by the standard error of the difference between the slopes, that is,
2 1
2 1
b b
s
b b
t
= on
(N 4) degrees of freedom.
3
If your regression program gives you the standard error of the slope (both SAS and
SPSS do), the standard error of the difference between the two slopes is most easily
computed as 1258 . 09594 . 08140 .
2 2 2 2
2 1 2 1
= + = + =
b b b b
s s s
Or, if you just love doing arithmetic, you can first find the pooled residual variance,
2649 .
150
6841 . 15 0554 . 24
4
2 1
2 1 2
.
=
+
=
+
+
=
n n
SSE SSE
s
x y
,
and then compute the standard error of the difference between slopes as
1264 .
6712 ). 62 (
2649 .
6732 ). 90 (
2649 .
2 2
2
.
2
.
2 1
= + = +
X
x y
X
x y
SS
s
SS
s
, within rounding error of what we got
above.
Now we can compute the test statistic: 26 . 2
1258 .
0153 . 3001 .
=
= t .
This is significant on 150 df (p = .025), so we conclude that the slope in nonidealists
is significantly higher than that in idealists.
Please note that the test on slopes uses a pooled error term. If the variance in the
dependent variable is much greater in one group than in the other, there are alternative
methods. See Kleinbaum and Kupper (1978, Applied Regression Analysis and Other
Multivariable Methods, Boston: Duxbury, pages 101 & 102) for a large (each sample
n > 25) samples test, and K & K page 192 for a reference on other alternatives to pooling.
H

: a
1
= a
2

The regression lines (one for nonidealists, the other for idealists) for predicting
support for animal rights from misanthropy could also differ in intercepts. Here is how to
test the null hypothesis that the intercepts are identical:
+ + + =
2
2
2
1
2
1
2 1
2
.
2
1 1
pooled
2 1
X X
x y a a
SS
M
SS
M
n n
s s
2 1
2 1
a a
s
a a
t
= 4
2 1
+ = n n df
M
1
and M
2
are the means on the predictor variable for the two groups (nonidealists and
idealists). If you ever need a large sample test that does not require homogeneity of
variance, see K & K pages 104 and 105.
For our data,
3024 .
6712 ). 62 (
2413 . 2
6732 ). 90 (
3758 . 2
63
1
91
1
2649 .
2
2
2
2
2 1
=
+ + + =
a a
s , and
4
57 . 2
3024 .
404 . 2 626 . 1
=
= t . On 150 df, p = .011. The intercept in idealists is

significantly higher than in nonidealists.

Potthoff Analysis
A more sophisticated way to test these hypotheses would be to create K-1
dichotomous dummy variables to code the K samples and use those dummy variables
and the interactions between those dummy variables and our predictor(s) as independent
variables in a multiple regression. For our problem, we would need only one dummy
variable (since K = 2), the dichotomous variable coding level of idealism. Our predictors
would then be idealism, misanthropy, and idealism x misanthropy (an interaction term).
Complete details on this method (also known as the Potthoff method) are in Chapter 13 of
K & K. We shall do this type of analysis when we cover multiple regression in detail in a
subsequent course. If you just cannot wait until then, see my document Comparing
Regression Lines From Independent Samples .

Correlated Samples
H

:
WX
=
WY

between a second, overlapping pair of variables (for example, when comparing the
correlation between one IQ test and grades with the correlation between a second IQ test
and grades), you can use Williams procedure explained in our textbook -- pages 261-262
of Howell, D. C. (2007), Statistical Methods for Psychology, 6
th
edition, Thomson
Wadsworth. It is assumed that the correlations for both pairs of variables have been
computed on the same set of subjects. The traditional test done to compare overlapping
correlated correlation coefficients was proposed by Hotelling in 1940, and is still often
used, but it has problems. Please see the following article for more details, including an
alternate procedure that is said to be superior to Williams procedure: Meng, Rosenthal, &
Rubin (1992) Comparing correlated correlation coefficients. Psychological Bulletin, 111:
172-175. Also see the pdf document Bivariate Correlation Analyses and Comparisons
authored by Calvin P. Garbin of the Department of Psychology at the University of
Nebraska. Dr. Garbin has also made available a program for conducting (FZT.exe) for
conducting such tests. It will compute both Hotellings t and the test recommended by
Meng et al., Steigers Z. See my example, Steigers Z: Testing H
:
ay
=
by

H

:
WX
=
YZ

between a second (nonoverlapping) pair of variables, read the article by T. E.
Raghunathan , R. Rosenthal, and D. B. Rubin (Comparing correlated but nonoverlapping
correlations, Psychological Methods, 1996, 1, 178-183). Also, see my example,
Comparing Correlated but Nonoverlapping Correlation Coefficients .
Return to my Statistics Lessons page.
5
InterRater.doc

Inter-Rater Agreement

Psychologists commonly measure various characteristics by having a rater
assign scores to observed people, other animals, other objects, or events. When using
such a measurement technique, it is desirable to measure the extent to which two or
more raters agree when rating the same set of things. This can be treated as a sort of
reliability statistic for the measurement procedure.
Continuous Ratings, Two Judges
Let us first consider a circumstance where we are comfortable with treating the
ratings as a continuous variable. For example, suppose that we have two judges rating
the aggressiveness of each of a group of children on a playground. If the judges agree
with one another, then there should be a high correlation between the ratings given by
the one judge and those given by the other. Accordingly, one thing we can do to
assess inter-rater agreement is to correlate the two judges' ratings. Consider the
following ratings (they also happen to be ranks) of ten subjects:

Subject 1 2 3 4 5 6 7 8 9 10
Judge 1 10 9 8 7 6 5 4 3 2 1
Judge 2 9 10 8 7 5 6 4 3 1 2

These data are available in the SPSS data set IRA-1.sav at my SPSS Data
Page. I used SPSS to compute the correlation coefficients, but SAS can do the same
analyses. Here is the dialog window from Analyze, Correlate, Bivariate:


2

The Pearson correlation is impressive, r = .964. If our scores are ranks or we
can justify converting them to ranks, we can compute the Spearman correlation
coefficient or Kendall's tau. For these data Spearman rho is .964 and Kendall's tau is
.867.
We must, however, consider the fact that two judges scores could be highly
correlated with one another but show little agreement. Consider the following data:

Subject 1 2 3 4 5 6 7 8 9 10
Judge 4 10 9 8 7 6 5 4 3 2 1
Judge 5 90 100 80 70 50 60 40 30 10 20

The correlations between judges 4 and 5 are identical to those between 1 and 2,
but judges 4 and 5 obviously do not agree with one another well. Judges 4 and 5 agree
on the ordering of the children with respect to their aggressiveness, but not on the
overall amount of aggressiveness shown by the children.
One solution to this problem is to compute the intraclass correlation coefficient.
Please read my handout, The Intraclass Correlation Coefficient. For the data above,
the intraclass correlation coefficient between Judges 1 and 2 is .9672 while that
between Judges 4 and 5 is .0535.
What if we have more than two judges, as below? We could compute Pearson
r, Spearman rho, or Kendall tau for each pair of judges and then average those
coefficients, but we still would have the problem of high coefficients when the judges
agree on ordering but not on magnitude. We can, however, compute the intraclass
correlation coefficient when there are more than two judges. For the data from three
judges below, the intraclass correlation coefficient is .8821.

Subject 1 2 3 4 5 6 7 8 9 10
Judge 1 10 9 8 7 6 5 4 3 2 1
Judge 2 9 10 8 7 5 6 4 3 1 2
Judge 3 8 7 10 9 6 3 4 5 2 1

The intraclass correlation coefficient is an index of the reliability of the ratings
for a typical, single judge. We employ it when we are going to collect most of our
data using only one judge at a time, but we have used two or (preferably) more judges
on a subset of the data for purposes of estimating inter-rater reliability. SPSS calls this
statistic the single measure intraclass correlation.
If what we want is the reliability for all the judges averaged together, we need
to apply the Spearman-Brown correction. The resulting statistic is called the average
measure intraclass correlation in SPSS and the inter-rater reliability coefficient by
3
some others (see MacLennon, R. N., Interrater reliability with SPSS for Windows 5.0,
The American Statistician, 1993, 47, 292-296). For our data,
9573 .
) 8821 (. 2 1
) 8821 (. 3
) 1 ( 1
=
+
=
+
icc j
icc j
, where j is the number of judges and icc is the
intraclass correlation coefficient. I would think this statistic appropriate when the data
for our main study involves having j judges rate each subject.
Rank Data, More Than Two Judges
When our data are rankings, we don't have to worry about differences in
magnitude. In that case, we can simply employ Spearman rho or Kendall tau if there
are only two judges or Kendall's coefficient of concordance if there are three or more
judges. Consult pages 309 - 311 in David Howell's Statistics for Psychology, 7
th
edition,
for an explanation of Kendall's coefficient of concordance. Run the program
Kendall-Patches.sas, from my SAS programs page, as an example of using SAS to
compute Kendall's coefficient of concordance. The data are those from Howell, page
310. Statistic 2 is the Friedman chi-square testing the null hypothesis that the patches
differ significantly from one another with respect to how well they are liked. This null
hypothesis is equivalent to the hypothesis that there is no agreement among the judges
with respect to how pleasant the patches are. To convert the Friedman chi-square to
Kendall's coefficient of concordance, we simply substitute into this equation:
807 .
) 7 ( 6
889 . 33
) 1 (
2
= =
=
n J
W

, where J is the number of judges and n is the number of
things being ranked.
If the judges gave ratings rather than ranks, you must first convert the ratings
into ranks in order to compute the Kendall coefficient of concordance. An explanation
of how to this with SAS is presented in my document "Nonparametric Statistics." You
would, of course, need to remember that ratings could be concordant in order but not in
magnitude.
Categorical Judgments
Please re-read pages 165 and 166 in David Howells Statistical Methods for
Psychology, 7
th
edition. Run the program Kappa.sas, from my SAS programs page,
as an example of using SAS to compute kappa. It includes the data from page 166 of
Howell. Note that Cohens kappa is appropriate only when you have two judges. If you
have more than two judges you may use Fleiss kappa.


IntraClassCorrelation.doc

The Intraclass Correlation Coefficient

Read pages 495 through 498 in David Howells Statistical Methods for
Psychology, 7
th
edition.
Here I shall compute the same intraclass correlation coefficient that Howell did,
treating judges as a random (rather than fixed) variable. The basic analysis is a Judges
x Subjects repeated measures ANOVA. Here are the data, within SPSS:

Click Analyze, General Linear Model, Repeated Measures.
Name the Factor judge, indicate 3 levels, click Add.


2

Click Define and scoot all three judges into the Within-Subjects variables box.

Click OK.
Here are the parts of the output that we need.
3
Tests of Within-Subjects Effects
Measure: MEASURE_1
20.133 2 10.067 151.000 .000
20.133 1.000 20.133 151.000 .000
20.133 1.000 20.133 151.000 .000
20.133 1.000 20.133 151.000 .000
.533 8 6.667E-02
.533 4.000 .133
.533 4.000 .133
.533 4.000 .133
Sphericity Assumed
Greenhouse-Geisser
Huynh-Feldt
Lower-bound
Sphericity Assumed
Greenhouse-Geisser
Huynh-Feldt
Lower-bound
Source
JUDGE
Error(JUDGE)
Type III Sum

Measure: MEASURE_1
Transformed Variable: Average
299.267 1 299.267 20.977 .010
57.067 4 14.267
Source
Intercept
Error
Type III Sum

The intraclass correlation coefficient is an omega-squared like statistic that
estimates the proportion of variance in the data that is due to differences in the subjects
rather than differences in the judges, Judge x Subject interaction, or error. We shall
compute it this way:
Subjects
JxS Judges Judges
JxS Judges Subjects
JxS Subjects
n
MS MS n
MS df MS
MS MS
) (
) (

+ +
. Substituting the appropriate

numbers, we get

6961 .
4 . 20
2 . 14
5
) 6 0 . 0 6 0 . 10 ( 3
6 0 . 0 ) 2 ( 6 2 . 14
6 0 . 0 6 2 . 14
= =
+ +
.
Howell has more rounding error in his calculations than do I.
Doing the analysis as described above has pedagogical value, but if you just want to get
the intraclass correlation coefficient with little fuss, do it this way:

Click Analyze, Scale, Reliability Analysis. Scoot all three judges into the Items box.
4

Click Statistics. Ask for an Intraclass correlation coefficient, Two-Way Random model,
Type = Absolute Agreement.

Continue, OK.
Here is the output. I have set the intraclass correlation coefficient in bold font and
highlighted it.
5

****** Method 1 (space saver) will be used for this analysis ******

Intraclass Correlation Coefficient

Two-way Random Effect Model (Absolute Agreement Definition):
People and Measure Effect Random
Single Measure Intraclass Correlation = .6961*
95.00% C.I.: Lower = .0558 Upper = .9604
F = 214.0000 DF = ( 4, 8.0) Sig. = .0000 (Test Value = .0000 )
Average Measure Intraclass Correlation = .8730
95.00% C.I.: Lower = .1480 Upper = .9864
F = 214.0000 DF = ( 4, 8.0) Sig. = .0000 (Test Value = .0000 )
*: Notice that the same estimator is used whether the interaction effect
is present or not.

Reliability Coefficients

N of Cases = 5.0 N of Items = 3

Alpha = .9953

Addendum
A correspondent at SUNY, Albany, provided me with data on the grades given by
faculty grading comprehensive examinations for doctoral students in his unit and asked
if I could provide assistance in estimating the inter-rater reliability. I computed the ICC
as well as simple Pearson I between each rater and each other rater. The single
measure ICC was exactly equal to the mean of the Pearson r coefficients. Interesting,
and probably not mere coincidence.

See Also:

Enhancement of Reliability Analysis -- Robert A. Yaffee
Choosing an Intraclass Correlation Coefficient -- David P. Nichols


Revised March, 2010.
Pearson R and Phi

Pearson r computed on two dichotomous variables is a phi coefficient. To test
the significance of such a phi coefficient one generally uses a chi-square statistic, which
can be computed as
2 2
n = . For the contingency table presented below,
22 . 1
5 . 7
305 .
30
2
= = (r
2
is the ratio of the regression SS to the total SS). This chi-
square is evaluated on one degree of freedom. Do notice that the p value provided with
the usual test of significance for a Pearson correlation coefficient is off a bit (.285 as
compared to the .269 obtained from the chi-square).
Regression

Model Summary
.202
a
.041 .006 .50690
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), B
a.

ANOVA
b
.305 1 .305 1.189 .285
a
7.195 28 .257
7.500 29
Regression
Residual
Total
Model
1
Sum of
Predictors: (Constant), B
a.
Dependent Variable: A
b.

Correlations

Correlations
.202
.285
30
Pearson Correlation
Sig. (2-tailed)
N
A
B

= .202

Crosstabs

A * B Crosstabulation
Count
10 5 15
7 8 15
17 13 30
1.00
2.00
A
Total
1.00 2.00
B
Total

Chi-Square Tests
1.222 1 .269 Pearson Chi-Square
Value df
Asymp. Sig.
(2-sided)

Return to Wuenschs Introductory Lesson on Bivariate Linear Correlation
Residual-Plots-SPSS.doc
Producing and Interpreting Residuals Plots in SPSS

In a linear regression analysis it is assumed that the distribution of residuals,
)
( Y Y , is, in the population, normal at every level of predicted Y and constant in

variance across levels of predicted Y. I shall illustrate how to check that assumption.
Although I shall use a bivariate regression, the same technique would work for a
multiple regression.
Start by downloading Residual-Skew.dat and Residual-Hetero.dat from my
StatData page and ANOVA1.sav from my SPSS data page. Each line of data has four
scores: X, Y, X2, and Y2. The delimiter is a blank space.
Create new variable SQRT_Y2 this way: Transform, Compute,
OK.

First some descriptive statistics on the variables:
200 16 74 48.59 9.985 -.053 .172 .161 .342
200 22 77 49.66 9.847 -.069 .172 -.244 .342
200 1 144 45.68 27.123 1.046 .172 1.038 .342
200 3 163 48.43 27.484 .948 .172 1.250 .342
200 1.73 12.77 6.6751 1.97299 .171 .172 -.196 .342
200
X
Y
X2
Y2
Y2_SQRT
Valid N (listwise)
Statistic Statistic Statistic Statistic Statistic Statistic Std. Error Statistic Std. Error
N Minimum Maximum Mean
Std.
Deviation
Skewness Kurtosis

Notice that variables X and Y are not skewed I generated them with a normal
random number generator. Notice that X2 and Y2 are skewed and that taking the
square root of Y2 reduces its skewness greatly.

Here we predict Y from X, produce a residuals plot, and save the residuals.


2

3
Model Summary
b
.450
a
.203 .199 8.815
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), X2
a.
b.

Here is a histogram of the residuals with a normal curve superimposed. The
residuals look close to normal.

Here is a plot of the residuals versus predicted Y. The pattern show here
indicates no problems with the assumption that the residuals are normally distributed at
each level of Y and constant in variance across levels of Y. SPSS does not
automatically draw in the regression line (the horizontal line at residual = 0). I double
clicked the chart and then selected Elements, Fit Line at Total to get that line.
4

SPSS has saved the residuals, unstandardized (RES_1) and standardized
(ZRE_1) to the data file:

Analyze, Explore ZRE_1 to get a better picture of the standardized residuals.
The plots look fine. As you can see, the skewness and kurtosis of the residuals is about
what you would expect if they came from a normal distribution:
5
Descriptives
.0000000
-2.55481
2.65518
5.20999
-.074
-.264
Mean
Minimum
Maximum
Range
Skewness
Kurtosis
ZRE_1
Statistic

Now predict Y from the skewed X2.

You conduct this analysis with the same plots and saved residuals as above.

You will notice that the residuals plots and exploration of the saved residuals
indicate no problems for the regression model. The skewness of X2 may be
troublesome for the correlation model, but not for the regression model.

Next, predict skewed Y2 from X.
Model Summary
b
.452
a
.204 .200 24.581
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), X
a.
Dependent Variable: Y2
b.

6

Notice that the residuals plots shows the residuals not to be normally distributed they
are pulled out (skewed) towards the top of the plot. Explore also shows trouble:
7
Descriptives
.0000000
-1.87474
3.61399
5.48873
1.34039
.803
.965
Mean
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
ZRE_1
Statistic

Notice the outliers in the boxplot.

Maybe we can solve this problem by taking the square root of Y2. Predict the
square root of Y from X.
8
Model Summary
b
.459
a
.211 .207 1.75738
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), X
a.
Dependent Variable: Y2_SQRT
b.

Descriptives
.0000000 .07053279
-2.21496
2.82660
5.04156
1.42707
.133 .172
-.240 .342
Mean
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
ZRE_1
Statistic Std. Error

Notice that the transformation did wonders, reducing the skewness of the
residuals to a comfortable level.

9
We are done with the Residual-Skew data set now. Read into SPSS the
ANOVA1.sav data file. Conduct a linear regression analysis to predict illness from dose
of drug. Save the standardized residuals and obtain the same plots that we produced
above.
Model Summary
b
.110
a
.012 .002 12.113
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), dose
a.
Dependent Variable: illness
b.

Look at the residuals plot. Oh my. Notice that the residuals are not
symmetrically distributed about zero. They are mostly positive with low and high values
of predicted Y and mostly negative with medium values of predicted Y. If you were to
find the means of the residuals at each level of Y and connect those means with the line
you would get a curve with one bend. This strongly suggests that the relationship
between X and Y is not linear and you should try a nonlinear model. Notice that the
problem is not apparent when we look at the marginal distribution of the residuals.
Produce the new variable Dose_SQ by squaring Dose, OK.

10

Now predict Illness from a combination of Dose and Dose_SQ. Ask for the usual
plots and save residuals and predicted scores.

Model Summary(b)

Model R R Square
Adjusted R
Square
Std. Error of
the Estimate
1
.657(a) .431 .419 9.238
a Predictors: (Constant), Dose_SQ, dose
b Dependent Variable: illness

Notice that the R has gone up a lot and is now significant, and the residuals plot
looks fine.
11
Let us have a look at the regression line. We saved the predicted scores
(PRE_1), so we can plot their means against dose of the drug:
Click Graphs, Line, Simple, Define.

Select Line Represents Other statistic and scoot PRE_1 into the variable box.
Scoot Dose into the Category Axis box. OK.

12

Wow, that is certainly no straight line. What we have done here is a polynomial
regression, fitting the data with a quadratic line. A quadratic line can have one bend in
it.
Let us get a scatter plot with the data and the quadratic regression line. Click
Graph, Scatter, Simple Scatter, Define. Scoot Illness into the Y-axis box and Dose into
the X-axis box. OK. Double-click the graph to open the graph editor and select
Elements, Fit line at total. SPSS will draw a nearly flat, straight line. In the Properties
box change Fit Method from Linear to Quadratic.

Click Apply and then close the chart editor.

13

We are done with the ANOVA.sav data for now. Bring into SPSS the Residual-
HETERO.dat data. Each case has two scores, X and Y. The delimiter is a blank space.
Conduct a regression analysis predicting Y from X. Create residuals plots and save the
standardized residuals as we have been doing with each analysis.

14

As you can see, the residuals plot shows clear evidence of heteroscedasticity. In
this case, the error in predicted Y increases as the value of predicted Y increases. I
have been told that transforming one the variables sometimes reduces
heteroscedasticity, but in my experience it often does not help.

Return to Wuensch's SPSS Lessons Page

nonparm.docx
Nonparametric Statistics

I shall compare the Wilcoxon rank-sum statistic with the independent samples t-
test to illustrate the differences between typical nonparametric tests and their parametric
equivalents.
Independent Samples t test Wilcoxon Rank-Sum Test
H
:
1
=
2
H
: Population 1 = Population 2
Assumptions: None for general test, but often assume:
Normal populations Equal shapes
Homogeneity of variance Equal dispersions
(but not for separate variances test)
Both tests are appropriate for determining whether or not there is a significant
association between a dichotomous variable and a continuous variable with
independent samples data. Note that with the independent samples t test the null
hypothesis focuses on the population means. If you have used the general form of the
nonparametric hypothesis (without assuming that the populations have equal shapes
and equal dispersions), rejection of that null hypothesis simply means that you are
confident that the two populations differ on one or more of location, shape, or
dispersion. If, however, we are willing to assume that the two populations have identical
shapes and dispersions, then we can interpret rejection of the nonparametric null
hypothesis as indicating that the populations differ in location. With these equal shapes
and dispersions assumptions the nonparametric test is quite similar to the parametric
test. In many ways the nonparametric tests we shall study are little more than
parametric tests on rank-transformed data. The nonparametric tests we shall study are
especially sensitive to differences in medians.
If your data indicate that the populations are not normally distributed, then a
nonparametric test may be a good alternative, especially if the populations do appear to
be of the same non-normal shape. If, however, the populations are approximately
normal but heterogeneous in variance, I would recommend a separate variances t-test
over a nonparametric test. If you cannot assume equal dispersions with the
nonparametric test, then you cannot interpret rejection of the nonparametric null
hypothesis as due solely to differences in location.

Conducting the Wilcoxon Rank-Sum Test
Rank the data from lowest to highest. If you have tied scores, assign all of them
the mean of the ranks for which they are tied. Find the sum of the ranks for each group.
If n
1
= n
2
, then the test statistic, W
S
, is the smaller of the two sums of ranks. Go to the
table (starts on page 715 of Howell) and obtain the one-tailed (lower tailed) p. For a


2

two-tailed test (nondirectional hypotheses), double the p. If n
1
n
2
, obtain both W
S
and
W
S
: W
S
is the sum of the ranks for the group with the smaller n,
S S
W W W = 2 (see
the rightmost column in the table), the sum of the ranks that would have been obtained
for the smaller group if we had ranked from high to low rather than low to high. The test
statistic is the smaller of W
S
and W
S
. If you have directional hypothesis, to reject the
null hypothesis not only must the one-tailed p be less than or equal to the criterion, but
also the mean rank for the sample predicted (in H
1
) to come from the population with
the smaller median must be less than the mean rank in the other sample (otherwise the
exact p = one minus the p that would have been obtained were the direction correctly
predicted).
If you have large sample sizes, you can use the normal approximation
procedures explained on pages 675-677 of Howell. Computer programs generally do
use such an approximation, but they may also make a correction for continuity (reducing
the absolute value of the numerator by .5) and they may obtain the probability from a t-
distribution rather than from a z-distribution. Please note that the rank-sum statistic is
essentially identical to the (better know to psychologists) Mann-Whitney U statistic. but
the Wilcoxon is easier to compute. If someone insists on having U, you can always
transform your W to U (see page 678 in Howell).
Here is a summary statement for the problem on page 676 of Howell (I obtained
an exact p from SAS rather than using a normal approximation): A Wilcoxon rank-sum
test indicated that babies whose mothers started prenatal care in the first trimester
weighed significantly more (N = 8, M = 3259 g, Mdn = 3015 g, s = 692 g) than did those
whose mothers started prenatal care in the third trimester (N = 10, M = 2576 g, Mdn =
2769 g, s = 757 g), W = 52, p = .034.

Power of the Wilcoxon Rank Sums Test
You already know that the majority of statisticians reject the notion that
parametric tests require interval data and thus ordinal data need be analyzed with
nonparametric methods (Gaito, 1980). There are more recent simulation studies that
also lead one to the conclusion that scale of measurement (interval versus ordinal)
should not be considered when choosing between parametric and nonparametric
procedures (see the references on page 57 of Nanna & Sawilowsky, 1998). There are,
however, other factors that could lead one to prefer nonparametric analysis with certain
types of ordinal data. Nanna and Sawilowsky (1998) addressed the issue of Likert
scale data. Such data typically violate the normality assumption and often the
homogeneity of variance assumption made when conducting traditional parametric
analysis. Although many have demonstrated that the parametric methods are so robust
to these violations that this is not usually a serious problem with respect to holding
alpha at its stated level (but can be, as you know from reading Bradley's articles in the
Bulletin of the Psychonomic Society), one should also consider the power
characteristics of parametric versus nonparametric procedures.
While it is generally agreed that parametric procedures are a little more powerful
than nonparametric procedures when the assumptions of the parametric procedures are
3

met, what about the case of data for which those assumptions are not met, for example,
the typical Likert scale data? Nanna and Sawilowsky demonstrated that with typical
Likert scale data, the Wilcoxon rank sum test has a considerable power advantage over
the parametric t test. The Wilcoxon procedure had a power advantage with both small
and large samples, with the advantage actually increasing with sample size.
Wilcoxons Signed-Ranks Test
This test is appropriate for matched pairs data, that is, for testing the significance
of the relationship between a dichotomous variable and a continuous variable with
related samples. It does assume that the difference scores are rankable, which is
certain if the original data are interval scale. The parametric equivalent is the correlated
t-test, and another nonparametric is the binomial sign test. To conduct this test you
compute a difference score for each pair, rank the absolute values of the difference
scores, and then obtain two sums of ranks: The sum of the ranks of the difference
scores which were positive and the sum of the ranks of the difference scores which
were negative. The test statistic, T, is the smaller of these two sums for a
nondirectional test (for a directional test it is the sum which you predicted would be
smaller). Difference scores of zero are usually discarded from the analysis (prior to
ranking), but it should be recognized that this biases the test against the null hypothesis.
A more conservative procedure would be to rank the zero difference scores and count
them as being included in the sum which would otherwise be the smaller sum of ranks.
Refer to the table that starts on page 709 of Howell to get the exact one-tailed (lower-
tailed) p, doubling it for a nondirectional test. Normal approximation procedures are
illustrated on page 681 of Howell. Again, computer software may use a correction for
continuity and may use t rather than z.
Here is an example summary statement using the data on page 682 of Howell: A
Wilcoxon signed-ranks test indicated that participants who were injected with glucose
had significantly better recall (M = 7.62, Mdn = 8.5, s = 3.69) than did subjects who were
injected with saccharine (M = 5.81, Mdn = 6, s = 2.86), T(N = 16) = 14.5, p = .004.

Kruskal-Wallis ANOVA
This test is appropriate to test the significance of the association between a
categorical variable (k 2 groups) and a continuous variable when the data are from
independent samples. Although it could be used with 2 groups, the Wilcoxon rank-sum
test would usually be used with two groups. To conduct this test you rank the data from
low to high and for each group obtain the sum of ranks. These sums of ranks are
substituted into the formula on page 683 of Howell. The test statistic is H, and the p is
obtained as an upper-tailed area under a chi-square distribution on k-1 degrees of
freedom. Do note that this one-tailed p is appropriately used for a nondirectional test. If
you had a directional test (for example, predicting that Population 1 < Population 2 <
Population 3), and the medians were ordered as predicted, you would divide that one-
tailed p by k ! before comparing it to the criterion.
The null hypothesis here is: Population 1 = Population 2 = ......... = Population k.
If you reject that null hypothesis you probably will still want to make pairwise
4

comparisons, such as group 1 versus group 2, group 1 versus group 3, group 2 versus
group 3, etc. This topic is addressed in detail in Chapter 12 of Howell. One may need
to be concerned about inflating the familywise alpha, the probability of making one or
more Type I errors in a family of c comparisons. If k = 3, one can control this familywise
error rate by using Fishers procedure (also known as a protected test): Conduct the
omnibus test (the Kruskal-Wallis) with the promise not to make any pairwise
comparisons unless that omnibus test is significant. If the omnibus test is not
significant, you stop. If the omnibus test is significant, then you are free to make the
three pairwise comparisons with Wilcoxons rank-sum test. If k > 3 Fishers procedure
does not adequately control the familywise alpha. One fairly conservative procedure is
the Bonferroni procedure. With this procedure one uses an adjusted criterion of
significance,
c
fw
pc
= . This procedure does not require that you first conduct the
omnibus test, and should you first conduct the omnibus test, you may make the
Bonferroni comparisons whether or not that omnibus test is significant. Suppose that k
= 4 and you wish to make all 6 pairwise comparisons (1-2, 1-3, 1-4, 2-3, 2-4, 3-4) with a
maximum familywise alpha of .05. Your adjusted criterion is .05 divided by 6, .0083.
For each pairwise comparison you obtain an exact p, and if that exact p is less than or
equal to the adjusted criterion, you declare that difference to be significant. Do note that
the cost of such a procedure is a great reduction in power (you are trading an increased
risk of Type II error for a reduced risk of Type I error).
Here is a summary statement for the problem on page 684 of Howell: Kruskal-
Wallis ANOVA indicated that type of drug significantly affected the number of problems
solved, H(2, N = 19) = 10.36, p = .006. Pairwise comparisons made with Wilcoxons
rank-sum test revealed that ......... Basic descriptive statistics (means, medians,
standard deviations, sample sizes) would be presented in a table.

Friedmans ANOVA
This test is appropriate to test the significance of the association between a
categorical variable (k 2) and a continuous variable with randomized blocks data
(related samples). While Friedmans test could be employed with k = 2, usually
Wilcoxons signed-ranks test would be employed if there were only two groups.
Subjects have been matched (blocked) on some variable or variables thought to be
correlated with the continuous variable of primary interest. Within each block the
continuous variable scores are ranked. Within each condition (level of the categorical
variable) you sum the ranks and substitute in the formula on page 685 of Howell. As
with the Kruskal-Wallis, obtain p from chi-square on k1 degrees of freedom, using an
upper-tailed p for nondirectional hypotheses, adjusting it with k! for directional
hypotheses. Pairwise comparisons could be accomplished employing Wilcoxon signed-
ranks tests, with Fishers or Bonferronis procedure to guard against inflated familywise
alpha.
Friedmans ANOVA is closely related to Kendalls coefficient of concordance.
For the example on page 685 of Howell, the Friedman tests asks whether the rankings
5

are the same for the three levels of visual aids. Kendalls coefficient of concordance, W,
would measure the extent to which the blocks agree in their rankings.
) 1 (
2
=
k N
W
F
.
Here is a sample summary statement for the problem on page 685 of Howell:
Friedmans ANOVA indicated that judgments of the quality of the lectures were
significantly affected by the number of visual aids employed,
2
F
(2, n = 17) = 10.94, p =
.004. Pairwise comparisons with Wilcoxon signed-ranks tests indicated that
....................... Basic descriptive statistics would be presented in a table.
Power
It is commonly opined that the primary disadvantage of the nonparametric
procedures is that they have less power than does the corresponding parametric test.
The reduction in power is not, however, great, and if the assumptions of the parametric
test are violated, then the nonparametric test may be more powerful.

Everything You Ever Wanted to Know About Six But Were Afraid to Ask
You may have noticed that the numbers 2, 3, 4, 6, 12, and 24 commonly appear
as constants in the formulas for nonparametric test statistics. This results from the fact
that the sum of the integers from 1 to n is equal to n(n + 1) / 2.

Effect Size Estimation
Please read my document Nonparametric Effect Size Estimators .
Using SAS to Compute Nonparametric Statistics
Run the program Nonpar.sas from my SAS programs page. Print the output and
the program file.

The first analysis is a Wilcoxon Rank Sum Test, using the birthweight data also
used by Howell (page 676) to illustrate this procedure. SAS gives us the sum of scores
for each group. That sum for the smaller group is the statistic which Howell calls W
S

(100). Note that SAS does not report the W
S
statistic (52), but it is easily computed by
hand -- 52 100 152 = =
S
W . Please remember that the test statistic which
psychologists report is the smaller of W and W SAS does report both a normal
approximation (z = 2.088, p = .037) and an exact (not approximated) p = .034. The z
differs slightly from that reported by Howell because SAS employs a correction for
continuity (reducing by .5 the absolute value of the denominator of the z ratio).

The next analysis is a Wilcoxon Matched Pairs Signed-Ranks Test using the
data from page 682 of Howell. Glucose-Saccharine difference scores are computed
and then fed to Proc Univariate. Among the many other statistics reported with Proc
Univariate, there is the Wilcoxon Signed-Ranks Test. For the data employed here, you
6

will see that SAS reports S = 53.5, p = .004. S, the signed-rank statistic, is the
absolute value of
4
) 1 ( +
n n
T , where T is the sum of the positive ranks or the negative
ranks.
S is the difference between the expected and the obtained sums of ranks. You
know that the sum of the ranks from 1 to n is
2
) 1 ( + n n
. Under the null hypothesis, you
expect the sum of the positive ranks to equal the sum of the negative ranks, so you
expect each of those sums of ranks to be half of
2
) 1 ( + n n
. For the data we analyzed
here, the sum of the ranks 1 through 16 = 136, and half of that is 68. The observed sum
of positive ranks is 121.5, and the observed sum of negative ranks is 14.5 The
difference between 68 and 14.5 (or between 121.5 and 68) is 53.5, the value of S
reported by SAS.
To get T from S, just subtract the absolute value of S from the expected value for
the sum of ranks, that is, | |
4
) 1 (
S
n n
T
+
= . Alternatively, just report S instead of T and
be prepared to explain what S is to the ignorant psychologists who review your
manuscript.
If you needed to conduct several signed-ranks tests, you might not want to
produce all of the output that you get by default with Proc Univariate. See my program
WilcoxonSignedRanks.sas on my SAS programs page to see how to get just the
statistics you want and nothing else.
Note that a Binomial Sign Test is also included in the output of Proc Univariate.
SAS reports M = 5, p = .0213. M is the difference between the expected number of
negative signs and the obtained number of negative signs. Since we have 16 pairs of
scores, we expect, under the null, to get 8 negative signs. We got 3 negative signs, so
M - 8 - 3 = 5. The p here is the probability of getting an event as or more unusual than 3
successes on 16 binomial trials when the probability of a success on each trial is .5.
Another way to get this probability with SAS is: Data p; p = 2*PROBBNML(.5, 16, 3);
proc print; run;

Next is a Kruskal-Wallis ANOVA, using Howells data on effect of stimulants
and depressants on problem solving (page 684). Do note that the sums and means
reported by SAS are for the ranked data. Following the overall test, I conducted
pairwise comparisons with Wilcoxon Rank Sum tests. Note how I used the subsetting
IF statement to create the three subsets necessary to do the pairwise comparisons.

The last analysis is Friedmans Rank Test for Correlated Samples, using
Howells data on the effect of visual aids on rated quality of lectures (page 685). Note
that I first had to use Proc Rank to create a data set with ranked data. Proc Freq then
7

provides the Friedman statistic as a Cochran-Mantel-Haenszel Statistic. One might
want to follow the overall analysis with pairwise comparisons, but I have not done so
here.
I have also provided an alternative rank analysis for the data just analyzed with
the Friedman procedure. Note that I simply conducted a factorial ANOVA on the rank
data, treating the blocking variable as a second independent variable. One advantage
of this approach is that it makes it easy to get the pairwise comparisons -- just include
the LSMEANS command with the PDIFF option. The output from LSMEANS includes
the mean ranks and a matrix of p values for tests comparing each groups mean rank
with each other groups mean rank.

References
Gaito, J. (1980). Measurement scales and statistics: Resurgence of an old
misconception. Psychological Bulletin, 87, 564-567. doi:10.1037/0033-
2909.87.3.564
th
ed.). Belmont, CA:
Cengage Wadsworth.
Nanna, M. J., & Sawilowsky, S. S. (1998). Analysis of Likert scale data in disability and
medical rehabilitation research. Psychological Methods, 3, 5567.
doi:10.1037/1082-989X.3.1.55



Nonparametric Effect Size Estimators

As you know, the American Psychological Association now emphasizes the reporting of effect
size estimates. Since the unit of measure for most criterion variables used in psychological research
is arbitrary, standardized effect size estimates, such as Hedges g,
2
, and
2
are popular. What is
one to use when the analysis has been done with nonparametric methods? This query is addressed
in the document A Call for Greater Use of Nonparametric Statistics, pages 13-15. The authors
(Leech & Onwuegbuzie) note that researchers who employ nonparametric analysis generally either
do not report effect size estimates or report parametric effect size estimates such as estimated
Cohens d. It is, however, known that these effect size estimates are adversely affected by
departures from normality and heterogeneity of variances, so they may not be well advised for use
with the sort of data which generally motivates a researcher to employ nonparametric analysis.
There are a few nonparametric effect size estimates (see Leech & Onwuegbuzie), but they are
not well-known and they are not available in the typical statistical software package.
Remember that nonparametric procedures do not test the same null hypothesis that a
parametric t test or ANOVA tests. The nonparametric null hypothesis is that the populations be
compared are identical in all aspects -- not just in location. If you are willing to assume that the
populations do not differ in dispersion or shape, then you can interpret a significant difference as a
difference in locations. I shall assume that you are making such assumptions.
With respect to the two independent samples design (comparing means), the following might
make sense, but I never seen them done:
A d like estimator calculated by taking the difference in group mean ranks and dividing by the
standard deviation of the ranks.
Another d like estimator calculated by taking the difference between the group median scores
and dividing by the standard deviation of the scores.
An eta-squared like estimator calculated as the squared point-biserial correlation between
groups and the ranks.
Grissom and Kim (2012) have suggested some effect size estimators for use in association
with nonparametric statistics. For the two-group independent samples design, they suggest that
one obtain the Mann-Whitney U statistic and then divide it by the product of the two sample sizes.
That is,
b a
b a
n n
U
p =
.
)
. This statistic estimates the probability that a score randomly drawn from
population a will be greater than a score randomly drawn from population b. If you stats package
does not compute U, but rather computes the Wilcoxon Rank Sum Statistic, you can get
2
) 1 ( +
=
s s
n n
W U , where n
s
is the smaller of n
a
and n
b
. If there are tied ranks, you may add to U
one half the number of ties.
For the two related samples design, associated with the Binomial Sign Test and the Wilcoxon
Signed Ranks Test, Grissom and Kim (2012) recommend PS
dep
, the probability that in a randomly
sampled pair of scores (one matched pair scores) the score from Condition B (the condition which
most frequently has the higher score) will be greater than the score from Condition A (the
condition which most frequently has the lower score). When computing either the sign test or the
signed ranks test, one first finds the B-A difference scores. To obtain PS
dep
, one simply divides
the number of positive difference scores by the total number of matched pairs. That is,
N
n
PS
dep
+
= , where n
+
is the number of positive difference scores. If there are ties, one can simply
discard the ties (reducing n) or add to the numerator one half the number of ties.
You can find SAS code for computing two nonparametric effect size estimates in the document
Robust Effect Size Estimates and Meta-Analytic Tests of Homogeneity (Hogarty & Kromrey,
SAS Users Group International Conference, Indianapolis, April, 2000).

I posted a query about nonparametric effect size estimators on EDSTAT-L and got a few
responses, which I provide here.
Leech (2002) suggested to report nonparametric effect size indices, such as Vargha &
Delaney's A or Cliff's d. (Leech (2002). A Call for Greater Use of Nonparametric Statistics. Paper
presented at the Annual Meeting of the Mid-South Educational Research Association, Chattanooga,
TN, November 6-8.)

John Mark, Regions University.
----------------------------------------------------
See the chapter titled "Effect sizes for ordinal categorical variables" in Grissom and Kim
(2005). Effect sizes for research. Lawrence Erlbaum.

dale

If you find any good Internet resources on this topic, please do pass them on to me so I can
include them here. Thanks.
Reference
Grissom, R. J., & Kim, J. J. (2012). Effect sizes for research: Univariate and multivariate
applications. (2
nd
ed.). New York, NY: Taylor & Francis

Return to Dr. Wuensch's Statistics Lessons Page.

Contact Information for the Webmaster,
Dr. Karl L. Wuensch

This document most recently revised on the 6
th
of April, 2012.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to
open the image, or the image may have been corrupted. Restart your computer,
and then open the file again. If the red x still appears, you may have to delete the
image and then insert it again.
Screen.docx
Screening Data

Many of my students have gotten spoiled because I nearly always provide them
with clean data. By clean data I mean data which has already been screened to
remove out-of range values, transformed to meet the assumptions of the analysis to be
done, and otherwise made ready to use. Sometimes these spoiled students have quite
a shock when they start working with dirty data sets, like those they are likely to
encounter with their thesis, dissertation, or other research projects. Cleaning up a data
file is like household cleaning jobsit can be tedious, and few people really enjoy doing
it, but it is vitally important to do. If you dont clean your dishes, scrub the toilet, and
wash your clothes, you get sick. If you dont clean up your research data file, your data
analysis will produce sick (schizophrenic) results. You may be able to afford to pay
someone to clean your household. Paying someone to clean up your data file can be
more expensive (once you know how to do it you may be willing to do it for others, if
compensated wellI charge $200 an hour, the same price I charge for programming,
interpreting, and writing).
Missing Data
With some sorts of research it is not unusual to have cases for which there are
missing data for some but not all variables. There may or there may not be a pattern to
the missing data. The missing data may be classified as MCAR, MAR, or MNAR.
Missing Not at Random (MNAR)
Some cases are missing scores on our variable of interest, Y.
Suppose that Y is the salary of faculty members.
Missingness on Y is related to the actual value of Y.
Of course, we do not know that, since we do not know the values of Y for cases
with missing data.
For example, faculty with higher salaries may be more reluctant to provide their
income.
If we estimate mean faculty salary with the data we do have on hand it will be a
biased estimate.
There is some mechanism which is causing missingness, but we do not know
what it is.

Missing At Random (MAR)
Missingness on Y is not related to the true value of Y itself or is related to Y only
through its relationship with another variable or set of variables, and we have
scores on that other variable or variables for all cases.


2
For example, suppose that the higher a professors academic rank the less likely
e is to provide es salary. Faculty with higher ranks have higher salaries. We
know the academic rank of each respondent.
We shall assume that within each rank whether Y is missing or not is random of
course, this may not be true, that is, within each rank the missingness of Y may
be related to the true value of Y.
Again, if we use these data to estimate mean faculty salary, the estimate will be
biased.
However, within conditional distributions the estimates will be unbiased that is,
we can estimate without bias the mean salary of lecturers, assistant professors,
associate professors, and full professors.
We might get an unbiased estimate of the overall mean by calculating a weighted
mean of the conditional means --
i i
M p GM = , where GM is the estimated grand
mean, p
i
is, for each rank, the proportion of faculty at that rank, and M
i
is the
estimated mean for each rank.

Missing Completely at Random (MCAR)
There is no variable, observed or not, that is related to the missingness of Y.
This is probably never absolutely true, but we can pretend that it is.

Finding Patterns of Missing Data
The MVA module mentioned in Tabachnik and Fidell is an add-on that has not
been distributed with SPSS here at ECU. You can, however, obtain the same t tests
that are produced by MVA. Suppose that the variable of interest is income and you are
concerned only with how missingness on income is related with score on the other
variables. Create a missingness dummy variable (0 = not missing the income score, 1
= missing the income score). Now use t tests (or Pearson r ) to see if missingness on
income is related to the other variables (MAR).

Dealing with the Problem of Missing Data
Deletion of Cases. Delete from the data analyzed any case that is missing data
on any of the variables used in the analysis. If there are not a lot of cases with missing
data and you are convinced that the missing data are MCAR, then this is an easy and
adequate solution. Estimates will not be biased. Of course, the reduction of sample
size results in a loss of power and increased error in estimation (wider confidence
intervals).
Deletion of Variables. Delete any variable that is missing data on many cases.
This is most helpful when you have another variable that is well related to the
troublesome variable and on which there are not many missing values.
Mean Substitution. For each missing data point, impute the mean on that
variable. Imputation is the substitution of an estimated value in this case a mean) for
the missing value. If the cases are divided into groups, substitute the mean of the group
3
in which the case belongs. While this does not alter the overall or group means, it does
reduce the overall or group standard deviations, which is undesirable.
Missingness Dummy Variable. Maybe the respondents are telling you
something important by not answering one of the questions. Set up a dummy variable
with value 0 for those who answered the question and value 1 for those who did not.
Use this dummy variable as one of the predictors of the outcome variable. You may
also try using both a missingness dummy variable and the original variable with means
imputed for missing values.
Regression. Develop a multiple regression to predict values on the variable
which has missing cases from the other variables. Use the resulting regression line to
predict the missing values. Regression towards the mean will reduce the variability of
the data, especially if the predicted variable is not well predicted from the other
variables.
Multiple Imputation. A random sample (with replacement) from the data set is
used to develop the model for predicting missing values. Those predictions are
imputed. A second random sample is used to develop a second model and its
predictions are imputed. This may be done three or more times. Now you have three
or more data sets which differ with respect to the imputed values. You conduct your
analysis on each of these data sets and then average the results across data sets.
Proc MI is the SAS procedure used for multiple imputation.
Pairwise Correlation Matrix. For the variables of interest, compute a
correlation matrix that uses all available cases for each correlation. With missing data it
will not be true that all of the correlations will be computed on the same set of cases,
and some of the correlations may be based on many more cases than are others. After
obtaining this correlation matrix, use it as input to the procedure that conducts your
analysis of choice. This procedure can produce very strange results.
Can We Get Unbiased Estimates? Maybe.
If the data are MCAR, even simple deletion of cases with missing data will allow
unbiased estimates.
If the data are MAR the more sophisticated methods can provide unbiased
estimates.
If the data are MNAR then unbiased estimates are not possible, but modern
methods may allow one to reduce the bias in estimation.

Missing Item Data Within a Unidimensional Scale
Suppose that you have a twenty item scale that you trust is unidimensional. You
have done item analysis and factor analysis that supports your contention that each
item (some of them first having been reflected) measures the same construct (latent
variable). Some of your subjects have failed to answer one or more of the items on
your scale. What to do?
A relatively simple solution is to compute, for each subject, the mean of that
subjects scores on the items he or she did answer on that scale. If all of the items are
measuring the same construct (with the same metric), that should give you a decent
estimate of the subjects standing on that construct. Please note that this is not what is
4
commonly referred to as mean substitution. With mean substitution, one substitutes
for the missing item score the mean score from other subjects on that item.
Essentially what you are doing here is estimating the subjects missing score
from his or her scores on the remaining items on the scale. It might be better to use a
multiple regression technique, but that is a lot of trouble [you would develop the
regression to predict the missing score from the scores on those items which were not
missing and then use that regression to predict the missing score].
Suppose you have a twenty item, unidimensional scale that measures
misanthropy. Some respondents have failed to answer some of the questions. What
should you do? The first thing you should do is to see how many missing scores there
were on each item. If an item has a large proportion of missing scores, you need to try
to figure out why. You may need to remove that item from the scale, at least for the
current sample.
Next you need to determine, for each subject, how many of the items were
missing. If you are using SAS or SPSS, see my lesson on the use of the SUM, MEAN,
and NMISS functions. Look at the distribution of number of missing items. You need to
decide how many items a subject must have answered before you are comfortable
keeping that subjects data. Frequently I will decide that if a subject has answered at
least 80 or 90 percent of the items, then I shall keep that subjects data if not, then I
shall set that subjects score on that scale to missing.
Suppose that I have data, on that misanthropy scale, from 200 subjects. Eighty
percent (160) answered all of the items. Ten percent (20 subjects) answered only 19 of
the items (and which item went unanswered appears to be random). Five percent (10
subjects) answered only 18 of the items. The remaining five percent answered not
more than 10 of the items. In this case, I would probably decide to compute scale
scores for all those who had answered at least 18 (90%) of the items.
Now I compute, for each subject who has answered at least 18 of the items, that
subjects mean score on the items that were answered. This mean score becomes the
subjects scale score.
So, how do you exclude those subjects who have not answered at least 18
items? With SAS, you could identify the subset of subjects who did answer at least 18
items and then continue with that subset. Letting miss be the variable that is number
of items with missing data on the scale:
data cull; set alicia; if miss < 3;
If you are working with multiple scales, it might be easier to use an If, Then,
statement to set to missing the scale score of any subject with too many missing item
scores:
if miss > 2, then misanth = . ;
else misanth = MEAN(of q1-q20);
If you are using SPSS, use the MEAN.N function. For example,
compute misanth = mean.18(Q1 to Q20)
The number after mean. specifies how many item scores must be nonmissing
for the scale score to be computed.
5
I prefer that scale scores be computed as means, not sums. When computed as
means, the scale score can be interpreted with respect to the response scale used for
each item (such as 1 = strongly disagree, 7 = strongly agree). With sum scores, you
need to factor in the number of items, which can make it difficult to compare scores
across scales that differ with respect to number of items. That said, if some powerful
person just insists that you report sums instead of means, simply take the scale means
and multiply them by the number of items. For example,
[SAS] tot_misanth = 20*misanth;
[SPSS]compute tot_misanth = 20*misanth
If you decide to use the SUM function in SAS or SPSS, be very careful, it will
treat missing values as zeros. See SUM, MEAN, and NMISS .
Outliers
Univariate outliers are scores on a single variable that are very unusual that
is, scores which are far away from the mean on that variable. My preferred method of
detecting univariate outliers is the box and whiskers plot.
Multivariate outliers are cases where there is an unusual combination of values
on two or more of the variables. Suppose one of our variables is number of years since
a faculty member received her Ph.D. and another is number of publications. Three
years is not a very unusual value on the one variable, and thirty publications may not be
a very unusual value on the other variable, but a case with values of three years and
thirty publications is probably a rather unusual combination of values.
The Mahalanobis distance or the closely related leverage can be employed to
detect multivariate outliers. Suppose we have three variables (X, Y, and Z). Imagine a
three dimensional plot where X is plotted on the horizontal dimension, Y on the vertical,
and Z on the depth. The plotted scores are represented as points in three dimensional
space. Now add a central reference point which has as its values the mean on X, the
mean on Y and the mean on Z. This is the centroid. The Mahalanobis distance (MD) is
a measure of the geometric distance between the point representing any one of the
cases and this centroid. Leverage is
N N
MD 1
1
+
. You should investigate any case

where the leverage exceeds 2p/n, where p is the number of variables and n is the
number of cases.
When there are only a few multivariate outliers, examine each individually to
determine how they differ from the centroid. When there are many outliers you may
want to describe how the outliers as a group differ from the centroid just compare the
means of the outliers with the means of the total data set. If there are many variables
you may create an outlier dummy variable (0 = not an outlier, 1 = is an outlier) and
predict that dummy variable from the other variables. The other variables which are
well related to that dummy variable are those variables on which the outliers as a group
differ greatly from the centroid.
Regression diagnostics should be employed to detect outliers, poorly fitting
cases, and cases which have unusually large influence on the analysis. Leverage is
used to identify cases which have unusual values on the predictor variables.
Standardized residuals are used to identify cases which do not fit well with the
6
regression solution. Cooks D and several other statistics can be used to determine
which cases have unusually large influence on the regression solution. See my
introduction to the topic of regression diagnostics here.
Dealing with outliers. There are several ways to deal with outliers, including:
Investigation. Outlying observations may simply be bad data. There are many
ways bad data can get into a data file. They may result from a typo. They may
result from the use of a numeric missing value code that was not identified as
such to the statistical software. The subject may have been distracted when
being measured on a reaction time variable. While answering survey items the
respondent may have unknowingly skipped question 13 and then put the
response for question 14 in the space on the answer sheet for question 13 (and
likewise misplaced the answers for all subsequent questions). When the survey
questions shifted from five response options to four response options the
respondent may have continued to code the last response option as E rather
than D. You should investigate outliers (and out-of-range values, which may or
may not be outliers) and try to determine whether the data are good or not and if
not what the correct score is. At the very least you should go back to the raw
data and verify that the scores were correctly entered into the data file. Do be
sure that both your raw data and your data file have for each case a unique ID
number so that you can locate in the raw data the case that is problematic.
Set the Outlier to Missing. If you decide that the score is bad and you decide
that you cannot with confidence estimate what it should have been, change the
bad score to a missing value code.
Delete the Case. If a case has so many outliers that you are convinced that the
respondent was not even reading the questions before answering them (this
happens more often that most researchers realize), you may decide to delete all
data from the troublesome case. Sometimes I add to my survey items like I
frequently visit planets outside of this solar system and I make all my own
clothes just to trap those who are not reading the items. You may also delete a
case when your investigation reveals that the subject is not part of the population
to which you wish to generalize your results.
Delete the Variable. If one variable has very many outliers that cannot be
resolved, you may elect to discard that variable from the analysis.
Transform the Variable. If you are convinced that the outlier is a valid score,
but concerned that it is making the distribution badly skewed, you may chose to
apply a skewness-reducing transformation.
Change the Score. You may decide to retain the score but change it so it is
closer to the mean for example, if the score is in the upper tail, changing it to
one point more than the next highest score.

7
Assumptions of the Analysis
You should check your data to see if it about what you would expect if the
assumptions in your analysis were correct. With many analyses these assumptions are
normality and homoscedasticity.
Normality. Look at plots and values of skewness and kurtosis. Do not pay
much attention to tests of the hypothesis that the population is normally distributed or
that skewness or kurtosis is zero in the population if you have a large sample these
tests may be significant when the deviation from normality is trivial and of no concern,
and if you have a small sample these tests may not be significant when you have a big
problem. If you have a problem you may need to transform the variable or change the
sort of analysis you are going to conduct. See Checking Data For Compliance with A
Normality Assumption.
Homogeneity of Variance. Could these sample data have come from
populations where the group variances were all identical? If the ratio of the largest
group variance to the smallest group variance is large, be concerned. There is much
difference of opinion regarding how high that ratio must be before you have a problem.
If you have a problem you may need to transform the variable or change the sort of
analysis you are going to conduct.
Homoscedasticity. This the homogeneity of variance assumption in
correlation/regression analysis. Careful inspection of the residuals should reveal any
problem with heteroscedasticity. Inspection of the residuals can also reveal problems
with the normality assumption and with curvilinear effects that you had not anticipated.
See Bivariate Linear Regression and Residual Plots. If you have a problem you may
need to transform the variable or change the sort of analysis you are going to conduct.
Homogeneity of Variance/Covariance Matrices. Boxs M is used to test this
assumption made when conducting a MANOVA and some other analyses. M is very
powerful, so we generally do not worry unless it is significant beyond .001.
Sphericity. This assumption is made with univariate-approach correlated
samples ANOVA. Suppose that we computed difference scores (like those we used in
the correlated t-test) for Level 1 vs. Level 2, Level 1 vs. Level 3, Level 1 vs. Level 4, and
every other possible pair of levels of the repeated factor. The sphericity assumption is
that the standard deviation of each of these sets of difference scores (1 vs 2, 1 vs 3,
etc.) is a constant in the population. Mauchleys test is used to evaluate this
assumption. If it is violated there you can stick with the univariate approach but reduce
the df, or you can switch to the multivariate-approach which does not require this
assumption.
Homogeneity of Regression. This assumption is made when conducting an
ANCOV. It can be tested by testing the interaction between groups and the covariate.
Linear Relationships. The analyses most commonly employed by
psychologists are intended to detect linear relationships. If the relationships are
nonlinear then you need to take that into account. It is always a good idea to look at
scatter plots showing the nature of the relationships between your variables. See
8
Importance of Looking at a Scatterplot Before You Analyze Your Data and Curvilinear
Bivariate Regression.

Multicollinearity
Multicollinearity exits when one of your predictor variables can be nearly perfectly
predicted by one of the other predictor variables or a linear combination of the other
predictor variables. This is a problem because it makes the regression coefficients
unstable in the sense that if you were to draw several random samples from the same
population and conduct your analysis on each you would find that the regression
coefficients vary greatly from sample to sample. If there is perfect relationship between
one of your predictors and a linear combination of the others then the intercorrelation
matrix is singular. Many analyses require that this matrix be inverted, but a singular
matrix cannot be inverted, so the program crashes. Here is a simple example of how
you could get a singular intercorrelation matrix: You want to predict college GPA from
high school grades, verbal SAT, math SAT, and total SAT. Oops, each of the SAT
variables can be perfectly predicted from the other two.
To see if multicollinearity exists in your data, compute for each predictor variable
the R
2
for predicting it from the others. If that R
2
is very high (.9 or more), you have a
problem. Statistical programs make it easy to get the tolerance (1 that R
2
). Low
values of tolerance signal trouble.
Another problem with predictors that are well correlated with one another is that
the analysis will show that they have little unique relationship with the outcome variable
(since they are redundant with each other). This is a problem because many
psychologists do not understand redundancy and then conclude that none of those
redundant variables are related to the outcome variable.
There are several ways to deal with multicollinearity:
Drop a Predictor. The usual way of dealing with the multicollinearity problem is
to drop one or more of the predictors from the analysis including:
Combine Predictors. Combine into a composite variable the redundant
predictor variables.
Principal Components Analysis. Conduct a principal components analysis on
the data and output the orthogonal component scores, for which there will be no
multicollinearity. You can then conduct your primary analysis on the component
scores and transform the results back into the original variables.

SAS
Download the Screen.dat data file from my StatData page and the program
Screen.sas from my SAS-Programs page. Edit the infile statement so it correctly points
to the place where you deposited Screen.dat and then run the program. This program
9
illustrates basic data screening techniques. The data file is from a thesis long ago. Do
look at the questionnaire while completing this lesson. I have put numbered comment
lines in the SAS program (comment lines start with an asterisk) corresponding to the
numbered paragraphs below.
1. One really should give each subject (or participant, event, case, observation)
an identification number and enter that number in the data file. That way, when you
find that some subjects have bad data on some variable, you can obtain their id
numbers and go back to the raw data to start your investigationthe correct data may
be there, the bad data being just a data entry error.
2. If you will look at questions 15, 16, 62, 65, & 67, you will see that answer
option e (which the OPSCAN reader codes as the number 5) is Dont know. I used IF
THEN statements to recode 5s to missing values (SAS uses the dot to code missing
values). If we had many variables to recode we would use an ARRAY statement and a
DO loop to do the recoding (see page 330 in Cody & Smith, 4
th
ed.).
3. In an earlier screening run I observed that the age variable (Q3) was very
positively skewed, so I wanted to evaluate square root, log, and negative inverse
transformations, all known to reduce positive skewness. I created the transformed
variables, age_sr, age_log, and age_inv. Be sure not to have any numbers less than 0
when using square root or =1 when using log transformations. You can add a constant
to every score first to be avoid that problem.
Of course, I probably would not be much interested in reducing the skewness if
skewness was not a problem for the analysis I intended to do.
4. Sometimes you just cant get rid of that skewness. One might resort to
nonparametric procedures (use a rank transformation) or resampling statistics, or one
might just dichotomize the variable, as I have done here. Even after dichotomizing a
predictor you may have trouble if almost all of the scores fall into one group and very
few into the other then you will have very low power for detecting the effect of that
predictor.
5. I summed the brothers and sisters variables to create the siblings variable.
6. SIBS is positively skewed, so I tried transformations on it too.
7. I created the MENTAL variable, on which low scores mean the participant
thinks that es close relatives are mentally unstable. I also created a dummy variable to
indicate whether or not a case was missing data on the mental variable.
8. The mental variable is negatively skewed. My program evaluates the effect
of several different transformations that should reduce negative skewness. One can
reduce negative skewness by squaring the scores or raising them to an even higher
power. Be sure not to have any negative scores before such a transformation, adding a
constant first, if necessary. Another transformation which might be helpful is
exponentiation. Exponentiated Y is
Y
e Y =
exp
, where e is the base of the natural
logarithm, 2.718...... You may also need to reduce the values of the scores by dividing
by a constant, to avoid producing transformed scores so large that they exceed the
ability of your computer to record them. If your highest raw score were a 10, your
10
highest transformed score would be a 22,026. A 20 would transform to 485,165,195. If
you have ever wondered what is "natural" about the natural log, you can find an answer
of sorts at http://www.math.toronto.edu/mathnet/answers/answers_13.html.
Another approach is first to reflect the variable (subtract it from a number one
greater than its maximum value, making it positively skewed) and then apply
transformations which reduce positive skewness. When interpreting statistics done on
the reflected variable, we need to remember that after reflection it is high (rather than
low) scores which indicate that the participant thinks es close relatives are mentally ill.
A simple way to avoid such confusion is to rereflect the variable after transformation.
For example, graduate student Megan Jones is using job satisfaction (JIG) as the
criterion variable in a sequential multiple regression. It is badly negatively skewed, g
2
=
-1.309. Several transformations are tried. The transformation that works best is a log
transformation of the reflected scores. For that log transformation g
2
= -.191 and the
maximum score is 1.43. The transformed scores are now subtracted from 1.5 to obtain
a doubly reflected log transformation of the original scores with g
2
= +.191. If we were
to stick with the singly reflected transformation scores, we would need to remember that
high scores now mean low job satisfaction. With the doubly reflected transformed
scores high scores represent high job satisfaction, just as with the untransformed JIG
scores.
9. I tried dichotomizing the mental variable too. Note that when doing this
recoding I was careful not to tell SAS to recode all negative values. The internal
code which SAS uses to stand for a missing value is a large negative number, so if
you just tell it to recode all values below some stated value it will also recode all missing
values on that variable. That could be a very serious error! We are supposed to be
cleaning up these data, not putting in more garbage! Run the program If_Then.sas to
illustrate the problem created if you forget that SAS represents missing values with
extreme negative numbers.
10. The first PROC statement simply checks for out-of-range values. Look at
the output for Q9 and then look at question 9. There are only three response options for
question nine, but we have a maximum score of 5 that is, we have out-of-range
data. Looks like at least one of our student participants wasnt even able to maintain
attention for nine items of the questionnaire! You find other variables on which there
are out-of-range values.
11. Here I check the skewness of variables in which I am interested, including
transformations intended to normalize the data. Note the use of the to define a
variable list: A B defines a list that includes all of the variables input or created
from A through B. It works just like TO works in SPSS. Look at the output. Age (Q3)
is very badly skewed. Although the transformations greatly reduce the skewness, even
the strongest transformation didnt reduce the skewness below 1, which is why I
dichotomized the variable. The SIBS variable is not all that badly skewed, but the
square root and log transformations improve it. Note that the inverse transformation is
too strong, actually converting it to a negatively skewed variable. Notice that the
absolute value of the skewness of the reflected MENTAL variable is the same as that of
the original variable, but of opposite sign. The square root transformation normalizes it
well.
11
12. Tables may help give you a picture of a variable too. Of course, if your
variable had many values you would probably want to get stem and leaf plots (use
PROC UNIVARIATE).
13. Here we find the identification number of participants with bad data on Q9.
We might want to go to the raw data sheet and check on participant number 145.
14. Here I imagine that I have only mental, Q1, Q3, Q5, Q6, and sibs and I want
to see if there is a pattern to the missing data on mental. I correlate the MentalMiss
dummy variable with the predictors. I discover that it is disturbingly negatively
correlated with sibs. Those with low scores on sibs tend to missing data on mental.
Looking back at the items reveals that one of the items going into the mental composite
variable was an opinion regarding the mental health of ones siblings. Duh if one has
no siblings then of course there should be missing data on that item. If I want to include
in my data respondents who have no siblings then I could change the computation of
the mental variable to be the mean of the relevant items that were answered.
15. Here I illustrate how to identify and describe multivariate outliers. Ignore the
regression analysis predicting ID number I really just wanted to get the outlier
information from PROC REG and needed a dummy outcome variable that I knew would
be nonmissing for each case. I outputted data on those cases for which leverage was
greater than 2p/n. Notice that as a group the outliers have unusually high (compared
to the total sample) scores on Age (over 2.5 standard deviations above the mean for the
total sample) and CitySize (0.8 SD above the total sample mean) they are older
students from the big city.

If you are using SPSS and do not want to export the data to SAS, you can screen
the data in SPSS in much the same way you would in SAS. See my introductory lesson
on data screening in SPSS.


Intro.MV.doc
An Introduction to Multivariate Statistics

The term multivariate statistics is appropriately used to include all statistics where
there are more than two variables simultaneously analyzed. You are already familiar with
bivariate statistics such as the Pearson product moment correlation coefficient and the
independent groups t-test. A one-way ANOVA with 3 or more treatment groups might also be
considered a bivariate design, since there are two variables: one independent variable and one
dependent variable. Statistically, one could consider the one-way ANOVA as either a bivariate
curvilinear regression or as a multiple regression with the K level categorical independent
variable dummy coded into K-1 dichotomous variables.

Independent vs. Dependent Variables

We shall generally continue to make use of the terms independent variable and
dependent variable, but shall find the distinction between the two somewhat blurred in
multivariate designs, especially those observational rather than experimental in nature.
Classically, the independent variable is that which is manipulated by the researcher. With such
control, accompanied by control of extraneous variables through means such as random
assignment of subjects to the conditions, one may interpret the correlation between the
dependent variable and the independent variable as resulting from a cause-effect relationship
from independent (cause) to dependent (effect) variable. Whether the data were collected by
experimental or observational means is NOT a consideration in the choice of an analytic tool.
Data from an experimental design can be analyzed with either an ANOVA or a regression
analysis (the former being a special case of the latter) and the results interpreted as
representing a cause-effect relationship regardless of which statistic was employed. Likewise,
observational data may be analyzed with either an ANOVA or a regression analysis, and the
results cannot be unambiguously interpreted with respect to causal relationship in either case.

We may sometimes find it more reasonable to refer to independent variables as
predictors, and dependent variables as response-, outcome-, or criterion-variables.
For example, we may use SAT scores and high school GPA as predictor variables when
predicting college GPA, even though we wouldnt want to say that SAT causes college GPA. In
general, the independent variable is that which one considers the causal variable, the prior
variable (temporally prior or just theoretically prior), or the variable on which one has data from
which to make predictions.

Descriptive vs. Inferential Statistics

While psychologists generally think of multivariate statistics in terms of making
inferences from a sample to the population from which that sample was randomly or
representatively drawn, sometimes it may be more reasonable to consider the data that one
has as the entire population of interest. In this case, one may employ multivariate descriptive
statistics (for example, a multiple regression to see how well a linear model fits the data) without
worrying about any of the assumptions (such as homoscedasticity and normality of conditionals
or residuals) associated with inferential statistics. That is, multivariate statistics, such as R
2
,
can be used as descriptive statistics. In any case, psychologists rarely ever randomly sample


2
from some population specified a priori, but often take a sample of convenience and then
generalize the results to some abstract population from which the sample could have been
randomly drawn.

Rank-Data

I have mentioned the assumption of normality common to parametric inferential
statistics. Please note that ordinal data may be normally distributed and interval data may not,
so scale of measurement is irrelevant. Rank-ordinal data will, however, be non-normally
distributed (rectangular), so one might be concerned about the robustness of a statistics
normality assumption with rectangular data. Although this is a controversial issue, I am
moderately comfortable with rank data when there are twenty to thirty or more ranks in the
sample (or in each group within the total sample).

Why (and Why Not) Should One Use Multivariate Statistics?

One might object that psychologists got along OK for years without multivariate
statistics. Why the sudden surge of interest in multivariate stats? Is it just another fad? Maybe
it is. There certainly do remain questions that can be well answered with simpler statistics,
especially if the data were experimentally generated under controlled conditions. But many
interesting research questions are so complex that they demand multivariate models and
multivariate statistics. And with the greatly increased availability of high speed computers and
multivariate software, these questions can now be approached by many users via multivariate
techniques formerly available only to very few. There is also an increased interest recently with
observational and quasi-experimental research methods. Some argue that multivariate
analyses, such as ANCOV and multiple regression, can be used to provide statistical control of
extraneous variables. While I opine that statistical control is a poor substitute for a good
experimental design, in some situations it may be the only reasonable solution. Sometimes
data arrive before the research is designed, sometimes experimental or laboratory control is
unethical or prohibitively expensive, and sometimes somebody else was just plain sloppy in
collecting data from which you still hope to distill some extract of truth.

But there is danger in all this. It often seems much too easy to find whatever you wish
to find in any data using various multivariate fishing trips. Even within one general type of
multivariate analysis, such as multiple regression or factor analysis, there may be such a variety
of ways to go that two analyzers may easily reach quite different conclusions when
independently analyzing the same data. And one analyzer may select the means that maximize
es chances of finding what e wants to find or e may analyze the data many different ways and
choose to report only that analysis that seems to support es a priori expectations (which may
be no more specific than a desire to find something significant, that is, publishable). Bias
against the null hypothesis is very great.

It is relatively easy to learn how to get a computer to do multivariate analysis. It is not
so easy correctly to interpret the output of multivariate software packages. Many users
doubtlessly misinterpret such output, and many consumers (readers of research reports) are
being fed misinformation. I hope to make each of you a more critical consumer of multivariate
research and a novice producer of such. I fully recognize that our computer can produce
multivariate analyses that cannot be interpreted even by very sophisticated persons. Our
perceptual world is three dimensional, and many of us are more comfortable in two dimensional

3
space. Multivariate statistics may take us into hyperspace, a space quite different from that in
which our brains (and thus our cognitive faculties) evolved.

Categorical Variables and LOG LINEAR ANALYSIS

We shall consider multivariate extensions of statistics for designs where we treat all of
the variables as categorical. You are already familiar with the bivariate (two-way) Pearson Chi-
square analysis of contingency tables. One can expand this analysis into 3 dimensional space
and beyond, but the log-linear model covered in Chapter 17 of Howell is usually used for such
multivariate analysis of categorical data. As a example of such an analysis consider the
analysis reported by Moore, Wuensch, Hedges, & Castellow in the Journal of Social Behavior
and Personality, 1994, 9: 715-730. In the first experiment reported in this study mock jurors
were presented with a civil case in which the female plaintiff alleged that the male defendant
had sexually harassed her. The manipulated independent variables were the physical
attractiveness of the defendant (attractive or not), and the social desirability of the defendant
(he was described in the one condition as being socially desirable, that is, professional, fair,
diligent, motivated, personable, etc., and in the other condition as being socially undesirable,
that is, unfriendly, uncaring, lazy, dishonest, etc.) A third categorical independent variable was
the gender of the mock juror. One of the dependent variables was also categorical, the verdict
rendered (guilty or not guilty). When all of the variables are categorical, log-linear analysis is
appropriate. When it is reasonable to consider one of the variables as dependent and the
others as independent, as in this study, a special type of log-linear analysis called a LOGIT
ANALYSIS is employed. In the second experiment in this study the physical attractiveness and
social desirability of the plaintiff were manipulated.

Earlier research in these authors laboratory had shown that both the physical
attractiveness and the social desirability of litigants in such cases affect the outcome (the
physically attractive and the socially desirable being more favorably treated by the jurors).
When only physical attractiveness was manipulated (Castellow, Wuensch, & Moore, Journal of
Social Behavior and Personality, 1990, 5: 547-562) jurors favored the attractive litigant, but
when asked about personal characteristics they described the physically attractive litigant as
being more socially desirable (kind, warm, intelligent, etc.), despite having no direct evidence
about social desirability. It seems that we just assume that the beautiful are good. Was the
effect on judicial outcome due directly to physical attractiveness or due to the effect of inferred
social desirability? When only social desirability was manipulated (Egbert, Moore, Wuensch, &
Castellow, Journal of Social Behavior and Personality, 1992, 7: 569-579) the socially desirable
litigants were favored, but jurors rated them as being more physically attractive than the socially
undesirable litigants, despite having never seen them! It seems that we also infer that the bad
are ugly. Was the effect of social desirability on judicial outcome direct or due to the effect on
inferred physical attractiveness? The 1994 study attempted to address these questions by
simultaneously manipulating both social desirability and physical attractiveness.

In the first experiment of the 1994 study it was found that the verdict rendered was
significantly affected by the gender of the juror (female jurors more likely to render a guilty
verdict), the social desirability of the defendant (guilty verdicts more likely with socially
undesirable defendants), and a strange Gender x Physical Attractiveness interaction: Female
jurors were more likely to find physically attractive defendants guilty, but male jurors verdicts
were not significantly affected by the defendants physical attractiveness (but there was a
nonsignificant trend for them to be more likely to find the unattractive defendant guilty).

4
Perhaps female jurors deal more harshly with attractive offenders because they feel that they
are using their attractiveness to take advantage of a woman.

The second experiment in the 1994 study, in which the plaintiffs physical attractiveness
and social desirability were manipulated, found that only social desirability had a significant
effect (guilty verdicts were more likely when the plaintiff was socially desirable). Measures of
the strength of effect (
2
) of the independent variables in both experiments indicated that the
effect of social desirability was much greater than any effect of physical attractiveness, leading
to the conclusion that social desirability is the more important factorif jurors have no
information on social desirability, they infer social desirability from physical attractiveness and
such inferred social desirability affects their verdicts, but when jurors do have relevant
information about social desirability, litigants physical attractiveness is of relatively little
importance.

Continuous Variables

We shall usually deal with multivariate designs in which one or more of the variables is
considered to be continuously distributed. We shall not nit-pick on the distinction between
continuous and discrete variables, as I am prone to do when lecturing on more basic topics in
statistics. If a discrete variable has a large number of values and if changes in these values
can be reasonably supposed to be associated with changes in the magnitudes of some
underlying construct of interest, then we shall treat that discrete variable as if it were
continuous. IQ scores provide one good example of such a variable.

CANONICAL CORRELATION/REGRESSION:

Also known as multiple multiple regression or multivariate multiple regression. All other
multivariate techniques may be viewed as simplifications or special cases of this fully
multivariate general linear model. We have two sets of variables (set X and set Y). We wish to
create a linear combination of the X variables (b
1
X
1
+ b
2
X
2
+ .... + b
p
X
p
), called a canonical
variate, that is maximally correlated with a linear combination of the Y variables (a
1
Y
1
+ a
2
Y
2
+
.... + a
q
Y
q
). The coefficients used to weight the Xs and the Ys are chosen with one criterion,
maximize the correlation between the two linear combinations.

As an example, consider the research reported by Patel, Long, McCammon, & Wuensch
(Journal of Interpersonal Violence, 1995, 10: 354-366). We had two sets of data on a group of
male college students. The one set was personality variables from the MMPI. One of these
was the PD (psychopathically deviant) scale, Scale 4, on which high scores are associated with
general social maladjustment and hostility. The second was the MF (masculinity/femininity)
scale, Scale 5, on which low scores are associated with stereotypical masculinity
. The third
was the MA (hypomania) scale, Scale 9, on which high scores are associated with overactivity,
flight of ideas, low frustration tolerance, narcissism, irritability, restlessness, hostility, and
difficulty with controlling impulses. The fourth MMPI variable was Scale K, which is a validity
scale on which high scores indicate that the subject is clinically defensive, attempting to
present himself in a favorable light, and low scores indicate that the subject is unusually frank.
The second set of variables was a pair of homonegativity variables. One was the IAH (Index of
Attitudes Towards Homosexuals), designed to measure affective components of homophobia.

5
The second was the SBS, (Self-Report of Behavior Scale), designed to measure past
aggressive behavior towards homosexuals, an instrument specifically developed for this study.

With luck, we can interpret the weights (or, even better, the loadings, the correlations
between each canonical variable and the variables in its set) so that each of our canonical
variates represents some underlying dimension (that is causing the variance in the observed
variables of its set). We may also think of a canonical variate as a superordinate variable,
made up of the more molecular variables in its set. After constructing the first pair of canonical
variates we attempt to construct a second pair that will explain as much as possible of the
(residual) variance in the observed variables, variance not explained by the first pair of
canonical variates. Thus, each canonical variate of the Xs is orthogonal to (independent of)
each of the other canonical variates of the Xs and each canonical variate of the Ys is
orthogonal to each of the other canonical variates of the Ys. Construction of canonical
variates continues until you can no longer extract a pair of canonical variates that accounts for a
significant proportion of the variance. The maximum number of pairs possible is the smaller of
the number of X variables or number of Y variables.

In the Patel et al. study both of the canonical correlations were significant. The first
canonical correlation indicated that high scores on the SBS and the IAH were associated with
stereotypical masculinity (low Scale 5), frankness (low Scale K), impulsivity (high Scale 9), and
general social maladjustment and hostility (high Scale 4). The second canonical correlation
indicated that having a low IAH but high SBS (not being homophobic but nevertheless
aggressing against gays) was associated with being high on Scales 5 (not being stereotypically
masculine) and 9 (impulsivity). The second canonical variate of the homonegativity variables
seems to reflect a general (not directed towards homosexuals) aggressiveness.

PRINCIPAL COMPONENTS AND FACTOR ANALYSIS

Here we start out with one set of variables. The variables are generally correlated with
one another. We wish to reduce the (large) number of variables to a smaller number of
components or factors (Ill explain the difference between components and factors when we
study this in detail) that capture most of the variance in the observed variables. Each factor (or
component) is estimated as being a linear (weighted) combination of the observed variables.
We could extract as many factors as there are variables, but generally most of them would
contribute little, so we try to get a few factors that capture most of the variance. Our initial
extraction generally includes the restriction that the factors be orthogonal, independent of one
another.

Consider the analysis reported by Chia, Wuensch, Childers, Chuang, Cheng, Cesar-
Romero, & Nava in the Journal of Social Behavior and Personality, 1994, 9, 249-258. College
students in Mexico, Taiwan, and the US completed a 45 item Cultural Values Survey. A
principal components analysis produced seven components (each a linear combination of the
45 items) which explained in the aggregate 51% of the variance in the 45 items. We could
have explained 100% of the variance with 45 components, but the purpose of the PCA is to
explain much of the variance with relatively few components. Imagine a plot in seven
dimensional space with seven perpendicular (orthogonal) axes. Each axis represents one
component. For each variable I plot a point that represents its loading (correlation) with each
component. With luck Ill have seven clusters of dots in this hyperspace (one for each
component). I may be able to improve my solution by rotating the axes so that each one more

6
nearly passes through one of the clusters. I may do this by an orthogonal rotation (keeping
the axes perpendicular to one another) or by an oblique rotation. In the latter case I allow the
axes to vary from perpendicular, and as a result, the components obtained are no longer
independent of one another. This may be quite reasonable if I believe the underlying
dimensions (that correspond to the extracted components) are correlated with one another.

With luck (or after having tried many different extractions/rotations), Ill come up with a
set of loadings that can be interpreted sensibly (that may mean finding what I expected to find).
From consideration of which items loaded well on which components, I named the components
Family Solidarity (respect for the family), Executive Male (men make decisions, women are
homemakers), Conscience (important for family to conform to social and moral standards),
Equality of the Sexes (minimizing sexual stereotyping), Temporal Farsightedness (interest in
the future and the past), Independence (desire for material possessions and freedom), and
Spousal Employment (each spouse should make decisions about his/her own job). Now, using
weighting coefficients obtained with the analysis, I computed for each subject a score that
estimated how much of each of the seven dimensions e had. These component scores were
then used as dependent variables in 3 x 2 x 2, Culture x Sex x Age (under 20 vs. over 20)
ANOVAs. US students (especially the women) stood out as being sexually egalitarian, wanting
independence, and, among the younger students, placing little importance on family solidarity.
The Taiwanese students were distinguished by scoring very high on the temporal
farsightedness component but low on the conscience component. Among Taiwanese students
the men were more sexually egalitarian than the women and the women more concerned with
independence than were the men. The Mexican students were like the Taiwanese in being
concerned with family solidarity but not with sexual egalitarianism and independence, but like
the US students in attaching more importance to conscience and less to temporal
farsightedness. Among the Mexican students the men attached more importance to
independence than did the women.

Factor analysis also plays a prominent role in test construction. For example, I factor
analyzed subjects scores on the 21 items in Patels SBS discussed earlier. Although the
instrument was designed to measure a single dimension, my analysis indicated that three
dimensions were being measured. The first factor, on which 13 of the items loaded well,
seemed to reflect avoidance behaviors (such as moving away from a gay, staring to
communicate disapproval of proximity, and warning gays to keep away). The second factor (six
items) reflected aggression from a distance (writing anti-gay graffiti, damaging a gays property,
making harassing phone calls). The third factor (two items) reflected up-close aggression
(physical fighting). Despite this evidence of three factors, item analysis indicated that the
instrument performed well as a measure of a single dimension. Item-total correlations were
good for all but two items. Cronbachs alpha was .91, a value which could not be increased by
deleting from the scale any of the items. Cronbachs alpha is considered a measure of the
reliability or internal consistency of an instrument. It can be thought of as the mean of all
possible split-half reliability coefficients (correlations between scores on half of the items vs.
the other half of the items, with the items randomly split into halves) with the Spearman-Brown
correction (a correction for the reduction in the correlation due to having only half as many
items contributing to each score used in the split-half reliability correlation coefficientreliability
tends to be higher with more items, ceteris paribus). Please read the document Cronbach's
Alpha and Maximized Lambda4. Follow the instructions there to conduct an item analysis with
SAS and with SPSS. Bring your output to class for discussion.

7
MULTIPLE REGRESSION

In a standard multiple regression we have one continuous Y variable and two or more
continuous X variables. Actually, the X variables may include dichotomous variables and/or
categorical variables that have been dummy coded into dichotomous variables. The goal is to
construct a linear model that minimizes error in predicting Y. That is, we wish to create a linear
combination of the X variables that is maximally correlated with the Y variable. We obtain
standardized regression coefficients ( weights
$
Z Z Z Z
Y p p
= + + +
1 1 2 2
L ) that
represent how large an effect each X has on Y above and beyond the effect of the other Xs in
the model. We may use some a priori hierarchical structure to build the model (enter first X
1
,
then X
2
, then X
3
, etc., each time seeing how much adding the new X improves the model, or,
start with all Xs, then first delete X
1
, then delete X
2
, etc., each time seeing how much deletion
of an X affects the model). We may just use a statistical algorithm (one of several sorts of
stepwise selection) to build what we hope is the best model using some subset of the total
number of X variables available.

For example, I may wish to predict college GPA from high school grades, SATV, SATQ,
score on a why I want to go to college essay, and quantified results of an interview with an
admissions officer. Since some of these measures are less expensive than others, I may wish
to give them priority for entry into the model. I might also give more theoretically important
variables priority. I might also include sex and race as predictors. I can also enter interactions
between variables as predictors, for example, SATM x SEX, which would be literally
represented by an X that equals the subjects SATM score times es sex code (typically 0 vs. 1
or 1 vs. 2). I may fit nonlinear models by entering transformed variables such as LOG(SATM)
or SAT
2
. We shall explore lots of such fun stuff later.

As an example of a multiple regression analysis, consider the research reported by
McCammon, Golden, and Wuensch in the Journal of Research in Science Teaching, 1988, 25,
501-510. Subjects were students in freshman and sophomore level Physics courses (only
those courses that were designed for science majors, no general education <football physics>
courses). The mission was to develop a model to predict performance in the course. The
predictor variables were CT (the Watson-Glaser Critical Thinking Appraisal), PMA (Thurstones
Primary Mental Abilities Test), ARI (the College Entrance Exam Boards Arithmetic Skills Test),
ALG (the College Entrance Exam Boards Elementary Algebra Skills Test), and ANX (the
Mathematics Anxiety Rating Scale). The criterion variable was subjects scores on course
examinations. All of the predictor variables were significantly correlated with one another and
with the criterion variable. A simultaneous multiple regression yielded a multiple R of .40
(which is more impressive if you consider that the data were collected across several sections
of different courses with different instructors). Only ALG and CT had significant semipartial
correlations (indicating that they explained variance in the criterion that was not explained by
any of the other predictors). Both forward and backwards selection analyses produced a
model containing only ALG and CT as predictors. At Susan McCammons insistence, I also
separately analyzed the data from female and male students. Much to my surprise I found a
remarkable sex difference. Among female students every one of the predictors was
significantly related to the criterion, among male students none of the predictors was. There
were only small differences between the sexes on variance in the predictors or the criterion, so
it was not a case of there not being sufficient variability among the men to support covariance
between their grades and their scores on the predictor variables. A posteriori searching of the
literature revealed that Anastasi (Psychological Testing, 1982) had noted a relatively consistent
finding of sex differences in the predictability of academic grades, possibly due to women being

8
more conforming and more accepting of academic standards (better students), so that women
put maximal effort into their studies, whether or not they like the course, and according they
work up to their potential. Men, on the other hand, may be more fickle, putting forth maximum
effort only if they like the course, thus making it difficult to predict their performance solely from
measures of ability.

STRUCTURAL EQUATION MODELING (SEM)

This is a special form of hierarchical multiple regression analysis in which the researcher
specifies a particular causal model in which each variable affects one or more of the other
variables both directly and through its effects upon intervening variables. The less complex
models use only unidirectional paths (if X
1
has an effect on X
2
, then X
2
cannot have an effect
on X
1
) and include only measured variables. Such an analysis is referred to as a path
analysis. Patels data, discussed earlier, were originally analyzed (in her thesis) with a path
analysis. The model was that the MMPI scales were noncausally correlated with one another
but had direct causal effects on both IAH and SBS, with IAH having a direct causal effect on
SBS. The path analysis was not well received by
reviewers the first journal to which we sent the
manuscript, so we reanalyzed the data with the
atheoretical canonical correlation/regression analysis
presented earlier and submitted it elsewhere.
Reviewers of that revised manuscript asked that we
supplement the canonical correlation/regression
analysis with a hierarchical multiple regression
analysis (essentially a path analysis).

In a path analysis one obtains path
coefficients, measuring the strength of each path
(each causal or noncausal link between one variable and another) and one assesses how well
the model fits the data. The arrows from e represent error variance (the effect of variables not
included in the model). One can compare two different models and determine which one better
fits the data. Our analysis indicated that the only significant paths were from MF to IAH (.40)
and from MA (.25) and IAH (.4) to SBS.

SEM can include latent variables (factors), constructs that are not directly measured
but rather are inferred from measured variables (indicators). Confirmatory factor analysis
can be considered a special case of SEM. In confirmatory factor analysis the focus is on
testing an a priori model of the factor structure of a group of measured variables. Tabachnick
and Fidell (5
th
edition) present an example (pages 732 - 749) in which the tested model
hypothesizes that intelligence in learning disabled children, as estimated by the WISC, can be
represented by two factors (possibly correlated with one another) with a particular simple
structure (relationship between the indicator variables and the factors).

The relationships between latent variables are referred to as the structural part of a
model (as opposed to the measurement part, which is the relationship between latent variables
and measured variables). As an example of SEM including latent variables, consider the
research by Greenwald and Gillmore (Journal of Educational Psychology, 1997, 89, 743-751)
on the validity of student ratings of instruction (check out my review of this research). Their
analysis indicated that when students expect to get better grades in a class they work less on

9
that class and evaluate the course and the instructor more favorably. The indicators (measured
variables) for the Workload latent variable were questions about how much time the students
spent on the course and how challenging it was. Relative expected grade (comparing the
grade expected in the rated course with that the student usually got in other courses) was a
more important indicator of the Expected Grade latent variable than was absolute expected
grade. The Evaluation latent variable was indicated by questions about challenge, whether or
not the student would take this course with the same instructor if e had it to do all over again,
and assorted items about desirable characteristics of the instructor and course.

.57

.49 .75

.98

.70

.44 .53

.93

.90

Greenwalds research suggests that instructors who have lenient grading policies will
get good evaluations but will not motivate their students to work hard enough to learn as much
as they do with instructors whose less lenient grading policies lead to more work but less
favorable evaluations.

I have avoided becoming very involved with SEM. Only twice have I decided that a path
analysis was an appropriate way to analyze the data from research in which I was involved, and
only once was the path analysis accepted as being appropriate for publication. Part of my
reluctance to embrace SEM stems from my not being comfortable with the notion that showing
good fit between an observed correlation matrix and ones theoretical model really confirms
that model. It is always quite possible that an untested model would fit the data as well or
better than the tested model.

Workload
Expected
Grade
Evaluation
Relative
Expected
Grade
Absolute
Expected
Grade
Hours Worked per
Credit Hour
Intellectual Challenge
Take Same Instructor Again?
Characteristics of Instructor & Course

10
I use PROC REG and PROC IML to do path analysis, which requires that I understand
fairly well the math underlying the relatively simple models I have tested with path analysis.
Were I to do more sophisticated analyses (those including latent variables and/or bidirectional
paths), I would need to employ software specially designed to do complex SEM. Information
about such software is available at: http://core.ecu.edu/psyc/wuenschk/StructuralSoftware.htm.

DISCRIMINANT FUNCTION ANALYSIS

This is essentially a multiple regression where the Y variable is a categorical variable.
You wish to develop a set of discriminant functions (weighted combinations of the predictors)
that will enable you to predict into which group (level of the categorical variable) a subject falls,
based on es scores on the X variables (several continuous variables, maybe with some
dichotomous and/or dummy coded variables). The total possible number of discriminant
functions is one less than the number of groups, or the number of predictor variables,
whichever is less. Generally only a few of the functions will do a good job of discriminating
group membership. The second function, orthogonal to the first, uses variance not already
used by the first, the third uses the residuals from the first and second, etc. One may think of
the resulting functions as dimensions on which the groups differ, but one must remember that
the weights are chosen to maximize the discrimination among groups, not to make sense to
you. Standardized discriminant function coefficients (weights) and loadings (correlations
between discriminant functions and predictor variables) may be used to label the functions.
One might also determine how well a function separates each group from all the rest to help
label the function. It is possible to do hierarchical/stepwise analysis and factorial (more than
one grouping variable) analysis.

As a rather nasty example, consider what the IRS does with the data they collect from
random audits of taxpayers. From each taxpayer they collect data on a number of predictor
variables (gross income, number of exemptions, amount of deductions, age, occupation, etc.)
and one classification variable, is the taxpayer a cheater (underpaid es taxes) or honest. From
these data they develop a discriminant function model to predict whether or not a return is likely
fraudulent. Next year their computers automatically test every return, and if yours fits the profile
of a cheaters you are called up for a discriminant analysis audit. Of course, the details of the
model are a closely guarded secret, since if a cheater knew the discriminant function e could
prepare his return with the maximum amount of cheating that would result in es (barely) not
being classified as a cheater.

As another example, consider the research done by Poulson, Braithwaite, Brondino, and
Wuensch (1997, Journal of Social Behavior and Personality, 12, 743-758). Subjects watched a
simulated trial in which the defendant was accused of murder and was pleading insanity. There
was so little doubt about his having killed the victim that none of the jurors voted for a verdict of
not guilty. Aside from not guilty, their verdict options were Guilty, NGRI (not guilty by reason of
insanity), and GBMI (guilty but mentally ill). Each mock juror filled out a questionnaire,
answering 21 questions (from which 8 predictor variables were constructed) about es attitudes
about crime control, the insanity defense, the death penalty, the attorneys, and es assessment
of the expert testimony, the defendants mental status, and the possibility that the defendant
could be rehabilitated. To avoid problems associated with multicollinearity among the 8
predictor variables (they were very highly correlated with one another, and such multicollinearity
can cause problems in a multivariate analysis), the scores on the 8 predictor variables were
subjected to a principal components analysis, with the resulting orthogonal components used

11
as predictors in a discriminant analysis. The verdict choice (Guilty, NGRI, or GBMI) was the
criterion variable.

Both of the discriminant functions were significant. The first function discriminated
between jurors choosing a guilty verdict and subjects choosing a NGRI verdict. Believing that
the defendant was mentally ill, believing the defenses expert testimony more than the
prosecutions, being receptive to the insanity defense, opposing the death penalty, believing
that the defendant could be rehabilitated, and favoring lenient treatment were associated with
rendering a NGRI verdict. Conversely, the opposite orientation on these factors was associated
with rendering a guilty verdict. The second function separated those who rendered a GBMI
verdict from those choosing Guilty or NGRI. Distrusting the attorneys (especially the
prosecution attorney), thinking rehabilitation likely, opposing lenient treatment, not being
receptive to the insanity defense, and favoring the death penalty were associated with
rendering a GBMI verdict rather than a guilty or NGRI verdict.

MULTIPLE ANALYSIS OF VARIANCE, MANOVA

This is essentially a DFA turned around. You have two or more continuous Ys and one
or more categorical Xs. You may also throw in some continuous Xs (covariates, giving you a
MANCOVA, multiple analysis of covariance). The most common application of MANOVA in
psychology is as a device to guard against inflation of familywise alpha when there are
multiple dependent variables. The logic is the same as that of the protected t-test, where an
omnibus ANOVA on your K-level categorical X must be significant before you make pairwise
comparisons among your K groups means on Y. You do a MANOVA on your multiple Ys. If it
is significant, you may go on and do univariate ANOVAs (one on each Y), if not, you stop. In a
factorial analysis, you may follow-up any effect which is significant in the MANOVA by doing
univariate analyses for each such effect.

As an example, consider the MANOVA I did with data from a simulated jury trial with
Taiwanese subjects (see Wuensch, Chia, Castellow, Chuang, & Cheng, Journal of Cross-
Cultural Psychology, 1993, 24, 414-427). The same experiment had earlier been done with
American subjects. Xs consisted of whether or not the defendant was physically attractive, sex
of the defendant, type of alleged crime (swindle or burglary), culture of the defendant (American
or Chinese), and sex of subject (juror). Ys consisted of length of sentence given the
defendant, rated seriousness of the crime, and ratings on 12 attributes of the defendant. I did
two MANOVAs, one with length of sentence and rated seriousness of the crime as Ys, one with
ratings on the 12 attributes as Ys. On each I first inspected the MANOVA. For each effect
(main effect or interaction) that was significant on the MANOVA, I inspected the univariate
analyses to determine which Ys were significantly associated with that effect. For those that
were significant, I conducted follow-up analyses such as simple interaction analyses and simple
main effects analyses. A brief summary of the results follows: Female subjects gave longer
sentences for the crime of burglary, but only when the defendant was American; attractiveness
was associated with lenient sentencing for American burglars but with stringent sentencing for
American swindlers (perhaps subjects thought that physically attractive swindlers had used their
attractiveness in the commission of the crime and thus were especially deserving of
punishment); female jurors gave more lenient sentences to female defendants than to male
defendants; American defendants were rated more favorably (exciting, happy, intelligent,
sociable, strong) than were Chinese defendants; physically attractive defendants were rated
more favorably (attractive, calm, exciting, happy, intelligent, warm) than were physically

12
unattractive defendants; and the swindler was rated more favorably (attractive, calm, exciting,
independent, intelligent, sociable, warm) than the burglar.

In MANOVA the Ys are weighted to maximize the correlation between their linear
combination and the Xs. A different linear combination (canonical variate) is formed for each
effect (main effect or interactionin fact, a different linear combination is formed for each
treatment dfthus, if an independent variable consists of four groups, three df, there are three
different linear combinations constructed to represent that effect, each orthogonal to the
others). Standardized discriminant function coefficients (weights for predicting X from the
Ys) and loadings (for each linear combination of Ys, the correlations between the linear
combination and the Ys themselves) may be used better to define the effects of the factors and
their interactions. One may also do a stepdown analysis where one enters the Ys in an a
priori order of importance (or based solely on statistical criteria, as in stepwise multiple
regression). At each step one evaluates the contribution of the newly added Y, above and
beyond that of the Ys already entered.

As an example of an analysis which uses more of the multivariate output than was used
with the example two paragraphs above, consider again the research done by Moore,
Wuensch, Hedges, and Castellow (1994, discussed earlier under the topic of log-linear
analysis). Recall that we manipulated the physical attractiveness and social desirability of the
litigants in a civil case involving sexual harassment. In each of the experiments in that study we
had subjects fill out a rating scale, describing the litigant (defendant or plaintiff) whose attributes
we had manipulated. This analysis was essentially a manipulation check, to verify that our
manipulations were effective. The rating scales were nine-point scales, for example,

Awkward Poised
1 2 3 4 5 6 7 8 9

There were 19 attributes measured for each litigant. The data from the 19 variables
were used as dependent variables in a three-way MANOVA (social desirability manipulation,
physical attractiveness manipulation, gender of subject). In the first experiment, in which the
physical attractiveness and social desirability of the defendant were manipulated, the MANOVA
produced significant effects for the social desirability manipulation and the physical
attractiveness manipulation, but no other significant effects. The canonical variate maximizing
the effect of the social desirability manipulation loaded most heavily (r > .45) on the ratings of
sociability (r = .68), intelligence (r = .66), warmth (r = .61), sensitivity (r = .50), and kindness (r =
.49). Univariate analyses indicated that compared to the socially undesirable defendant, the
socially desirable defendant was rated significantly more poised, modest, strong, interesting,
sociable, independent, warm, genuine, kind, exciting, sexually warm, secure, sensitive, calm,
intelligent, sophisticated, and happy. Clearly the social desirability manipulation was effective.

The canonical variate that maximized the effect of the physical attractiveness
manipulation loaded heavily only on the physical attractiveness ratings (r = .95), all the other
loadings being less than .35. The mean physical attractiveness ratings were 7.12 for the
physically attractive defendant and 2.25 for the physically unattractive defendant. Clearly the
physical attractiveness manipulation was effective. Univariate analyses indicated that this
manipulation had significant effects on several of the ratings variables. Compared to the
physically unattractive defendant, the physically attractive defendant was rated significantly
more poised, strong, interesting, sociable, physically attractive, warm, exciting, sexually warm,
secure, sophisticated, and happy.

13

In the second experiment, in which the physical attractiveness and social desirability of
the plaintiff were manipulated, similar results were obtained. The canonical variate maximizing
the effect of the social desirability manipulation loaded most heavily (r > .45) on the ratings of
intelligence (r = .73), poise (r = .68), sensitivity (r = .63), kindness (r = .62), genuineness (r =
.56), warmth (r = .54), and sociability (r = .53). Univariate analyses indicated that compared to
the socially undesirable plaintiff the socially desirable plaintiff was rated significantly more
favorably on all nineteen of the adjective scale ratings.

The canonical variate that maximized the effect of the physical attractiveness
manipulation loaded heavily only on the physical attractiveness ratings (r = .84), all the other
loadings being less than .40. The mean physical attractiveness ratings were 7.52 for the
physically attractive plaintiff and 3.16 for the physically unattractive plaintiff. Univariate
analyses indicated that this manipulation had significant effects on several of the ratings
variables. Compared to the physically unattractive plaintiff the physically attractive plaintiff was
rated significantly more poised, interesting, sociable, physically attractive, warm, exciting,
sexually warm, secure, sophisticated, and happy.

LOGISTIC REGRESSION

Logistic regression is used to predict a categorical (usually dichotomous) variable from a
set of predictor variables. With a categorical dependent variable, discriminant function analysis
is usually employed if all of the predictors are continuous and nicely distributed; logit analysis is
usually employed if all of the predictors are categorical; and logistic regression is often chosen
if the predictor variables are a mix of continuous and categorical variables and/or if they are not
nicely distributed (logistic regression makes no assumptions about the distributions of the
predictor variables). Logistic regression has been especially popular with medical research in
which the dependent variable is whether or not a patient has a disease.

For a logistic regression, the predicted dependent variable is the estimated probability
that a particular subject will be in one of the categories (for example, the probability that Suzie
Cue has the disease, given her set of scores on the predictor variables).

As an example of the use of logistic regression in psychological research, consider the
research done by Wuensch and Poteat and published in the Journal of Social Behavior and
Personality, 1998, 13, 139-150. College students (N = 315) were asked to pretend that they
were serving on a university research committee hearing a complaint against animal research
being conducted by a member of the university faculty. Five different research scenarios were
used: Testing cosmetics, basic psychological theory testing, agricultural (meat production)
research, veterinary research, and medical research. Participants were asked to decide
whether or not the research should be halted. An ethical inventory was used to measure
participants idealism (persons who score high on idealism believe that ethical behavior will
always lead only to good consequences, never to bad consequences, and never to a mixture of
good and bad consequences) and relativism (persons who score high on relativism reject the
notion of universal moral principles, preferring personal and situational analysis of behavior).

Since the dependent variable was dichotomous (whether or not the respondent decided
to halt the research) and the predictors were a mixture of continuous and categorical variables
(idealism score, relativism score, participants gender, and the scenario given), logistic

14
regression was employed. The scenario variable was represented by k1 dichotomous dummy
variables, each representing the contrast between the medical scenario and one of the other
scenarios. Idealism was negatively associated and relativism positively associated with support
for animal research. Women were much less accepting of animal research than were men.
Support for the theoretical and agricultural research projects was significantly less than that for
the medical research.

In a logistic regression, odds ratios are commonly employed to measure the strength of
the partial relationship between one predictor and the dependent variable (in the context of the
other predictor variables). It may be helpful to consider a simple univariate odds ratio first.
Among the male respondents, 68 approved continuing the research, 47 voted to stop it, yielding
odds of 68 / 47. That is, approval was 1.45 times more likely than nonapproval. Among female
respondents, the odds were 60 / 140. That is, approval was only .43 times as likely as was
nonapproval. Inverting these odds (odds less than one are difficult for some people to
comprehend), among female respondents nonapproval was 2.33 times as likely as approval.
The ratio of these odds, 38 . 3
140 60
47 68
=
, indicates that a man was 3.38 times more likely to

approve the research than was a woman.

The odds ratios provided with the output from a logistic regression are for partial effects,
that is, the effect of one predictor holding constant the other predictors. For our example
research, the odds ratio for gender was 3.51. That is, holding constant the effects of all other
predictors, men were 3.51 times more likely to approve the research than were women.

The odds ratio for idealism was 0.50. Inverting this odds ratio for easier interpretation,
for each one point increase on the idealism scale there was a doubling of the odds that the
respondent would not approve the research. The effect of relativism was much smaller than
that of idealism, with a one point increase on the nine-point relativism scale being associated
with the odds of approving the research increasing by a multiplicative factor of 1.39. Inverted
odds ratios for the dummy variables coding the effect of the scenario variable indicated that the
odds of approval for the medical scenario were 2.38 times higher than for the meat scenario
and 3.22 times higher than for the theory scenario.

Classification: The results of a logistic regression can be used to predict into which
group a subject will fall, given the subjects scores on the predictor variables. For a set of
scores on the predictor variables, the model gives you the estimated probability that a subject
will be in group 1 rather than in group 2. You need a decision rule to determine into which
group to classify a subject given that estimated probability. While the most obvious decision
rule would be to classify the subject into group 1 if p > .5 and into group 2 if p < .5, you may well
want to choose a different decision rule given the relative seriousness of making one sort of
error (for example, declaring a patient to have the disease when she does not) or the other sort
of error (declaring the patient not to have the disease when she does). For a given decision
rule (for example, classify into group 1 if p > .7) you can compute several measures of how
effective the classification procedure is. The Percent Correct is based on the number of
correct classifications divided by the total number of classifications. The Sensitivity is the
percentage of occurrences correctly predicted (for example, of all who actually have the
disease, what percentage were correctly predicted to have the disease). The Specificity is the
percentage of nonoccurrences correctly predicted (of all who actually are free of the disease,
what percentage were correctly predicted not to have the disease). Focusing on error rates, the
False Positive rate is the percentage of predicted occurrences which are incorrect (of all who

15
were predicted to have the disease, what percentage actually did not have the disease), and
the False Negative rate is the percentage of predicted nonoccurrences which are incorrect (of
all who were predicted not to have the disease, what percentage actually did have the disease).
For a screening test to detect a potentially deadly disease (such as breast cancer), you might
be quite willing to use a decision rule that makes false positives fairly likely, but false negatives
very unlikely. I understand that the false positive rate with mammograms is rather high. That is
to be expected in an initial screening test, where the more serious error is the false negative.
Although a false positive on a mammogram can certainly cause a woman some harm (anxiety,
cost and suffering associated with additional testing), it may be justified by making it less likely
that tumors will go undetected. Of course, a positive on a screening test is followed by
additional testing, usually more expensive and more invasive, such as collecting tissue for
biopsy.

For our example research, the overall percentage correctly classified is 69% with a
decision rule being if p > .5, predict the respondent will support the research. A slightly higher
overall percentage correct (71%) would be obtained with the rule if p > .4, predict support
(73% sensitivity, 70% specificity) or with the rule if p > .54, predict support (52% sensitivity,
84% specificity).

LEAST SQUARES ANOVA

An ANOVA may be done as a multiple regression, with the categorical Xs coded as
dummy variables. A K-level X is represented by K-1 dichotomous dummy variables. An
interaction between two Xs is represented by products of the main effects Xs. For example,
were factors A and B both dichotomous, we could code A with X
1
(0 or 1), B with X
2
(0 or 1),
and A x B with X
3
, where X
3
equals X
1
times X
2
. Were A dichotomous and B had three levels,
the main effect of B would require two dummy variables, X
2
and X
3
, and the A x B interaction
would require two more dummy variables, X
4
(the product of X
1
and X
2
) and X
5
(the product of
X
1
and X
3
). [Each effect will require as many dummy variables as the df it has.] In the multiple
regression the SS due to X
1
would be the SS
A
, the SS
B
would be the combined SS for X
2
and
X
3
, and the interaction SS would be the combined SS for X
4
and X
5
. There are various ways we
can partition the SS, but we shall generally want to use Overall and Spiegels Method I, where
each effect is partialled for all other effects. That is, for example, SS
A
is the SS that is due
solely to A (the increase in the SS
reg
when we added As dummy variable(s) to a model that
already includes all other effects). Any variance in Y that is ambiguous (could be assigned to
more than one effect) is disregarded. There will, of course, be such ambiguous variance only
when the independent variables are nonorthogonal (correlated, as indicated by the unequal cell
sizes). Overall and Spiegels Method I least-squares ANOVA is the method that is approximated
by the by hand unweighted means ANOVA that you learned earlier.

ANCOV

In the analysis of covariance you enter one or more covariates (usually continuous, but
may be dummy coded categorical variables) into the multiple correlation before you enter the
primary independent variable(s). This essentially adjusts the Y-scores to what they would be
(based on the correlation between Y and the covariates) if all subjects had the same score on
the covariate(s). This statistically removes (some of) the variance in Y due to the covariate(s).
Since that variance would otherwise go into the error term, power should be increased. Again,

16
the effect of each factor or interaction is the increase in the SS
reg
when that factor is added to a
model that already contains all of the other factors and interactions and all of the covariates.
Let me remind you again that if your factors are correlated with the covariate(s), which is
quite likely if the factors are nonexperimental (not manipulated) or if the covariates are
measured after the manipulations of the factors are accomplished (and those manipulations
change subjects scores on the covariates), then removing the effects of the covariates may
also remove some of the effects of the factors, which may not be what you wanted to do.

Typically the psychologist considers the continuous covariates to be nuisance variables,
whose effects are to be removed prior to considering the effects of categorical independent
variables. But the same model can be used to predict scores on a continuous dependent
variable from a mixture of continuous and categorical predictor variables, even when the
researcher does not consider the continuous covariates to be nuisance variables. For
example, consider the study by Wuensch and Poteat discussed earlier as an example of logistic
regression. A second dependent variable was respondents scores on a justification variable
(after reading the case materials, each participant was asked to rate on a 9-point scale how
justified the research was, from not at all to completely). We used an ANCOV model to
predict justification scores from idealism, relativism, gender, and scenario. Although the first
two predictors were continuous (covariates), we did not consider them to be nuisance
variables, we had a genuine interest in their relationship with the dependent variable. A brief
description of the results of the ANCOV follows:

There were no significant interactions between predictors, but each predictor had a
significant main effect. Idealism was negatively associated with justification, = 0.32, r =
0.36, F(1, 303) = 40.93, p < .001, relativism was positively associated with justification, =
.20, r = .22, F(1, 303) = 15.39, p < .001, mean justification was higher for men (M = 5.30, SD =
2.25) than for women (M = 4.28, SD = 2.21), F(1, 303) = 13.24, p < .001, and scenario had a
significant omnibus effect, F(4, 303) = 3.61, p = .007. Using the medical scenario as the
reference group, the cosmetic and the theory scenarios were found to be significantly less
justified.

MULTIVARIATE APPROACH TO REPEATED MEASURES ANOVA

Analyses of variance which have one or more repeated measures (within-subjects,
correlated samples) factors have a sphericity assumption: the standard error of the
difference between pairs of means is constant across all pairs of means. That is, for comparing
the mean at any one level of the repeated factor versus any other level of the repeated factor,
the
diff
is the same as it would be for any other pair of levels of the repeated factor. Howell
(page 443 of the 6
th
edition of Statistical Methods for Psychology) discusses compound
symmetry, a somewhat more restrictive assumption. There are adjustments (of degrees of
freedom) to correct for violation of the sphericity assumption, but at a cost of lower power. A
better solution might be a multivariate approach to repeated measures designs, which does not
have such a sphericity assumption. In the multivariate approach the effect of a repeated
measures dimension (for example, whether this score represents Suzie Cues headache
duration during the first, second, or third week of treatment) is coded by computing k1
difference scores (one for each degree of freedom for the repeated factor) and then treating
those difference scores as dependent variables in a MANOVA.

17
You are already familiar with the basic concepts of main effects, interactions, and simple
effects from our study of independent samples ANOVA. We remain interested in these same
sorts of effects in ANOVA with repeated measures, but we must do the analysis differently.
While it might be reasonable to conduct such an analysis by hand when the design is quite
simple, typically computer analysis will be employed.

If your ANOVA design has one or more repeated factors and multiple dependent
variables, then you can do a doubly multivariate analysis, with the effect of the repeated
factor being represented by a set of k1 difference scores for each of the two or more
dependent variables. For example, consider my study on the effects of cross-species rearing of
house mice (Animal Learning & Behavior, 1992, 20, 253-258). Subjects were house mice that
had been reared by house mice, deer mice, or Norway rats. The species of the foster mother
was a between-subjects (independent samples) factor. I tested them in an apparatus where
they could visit four tunnels: One scented with clean pine shavings, one scented with the smell
of house mice, one scented with the smell of deer mice, and one scented with the smell of rats.
The scent of the tunnel was a within-subjects factor, so I had a mixed factorial design (one or
more between-subjects factor, one or more within-subjects factor). I had three dependent
variables: The latency until the subject first entered each tunnel, how many visits the subject
made to each tunnel, and how much time each subject spent in each tunnel. Since the doubly
multivariate analysis indicated significant effects (interaction between species of the foster
mother and scent of the tunnel, as well as significant main effects of each factor), singly
multivariate ANOVA (that is, on one dependent variable at a time, but using the multivariate
approach to code the repeated factor) was conducted on each dependent variable (latency,
visits, and time). The interaction was significant for each dependent variable, so simple main
effects analyses were conducted. The basic finding (somewhat simplified here) was that with
respect to the rat-scented tunnel, those subjects who had been reared by a rat had shorter
latencies to visit the tunnel, visited that tunnel more often, and spent more time in that tunnel. If
you consider that rats will eat house mice, it makes good sense for a house mouse to be
disposed not to enter tunnels that smell like rats. Of course, my rat-reared mice may have
learned to associate the smell of rat with obtaining food (nursing from their rat foster mother)
rather than being food!

CLUSTER ANALYSIS

In a cluster analysis the goal is to cluster cases (research units) into groups that share
similar characteristics. Contrast this goal with the goal of principal components and factor
analysis, where one groups variables into components or factors based on their having similar
relationships with with latent variables. While cluster analysis can also be used to group
variables rather than cases, I have no familiarity with that application.

I have never had a set of research data for which I though cluster analysis appropriate,
but I wanted to play around with it, so I obtained, from online sources, data on faculty in my
department: Salaries, academic rank, course load, experience, and number of published
articles. I instructed SPSS to group the cases (faculty members) based on those variables. I
asked SPSS to standardize all of the variables to z scores. This results in each variable being
measured on the same scale and the variables being equally weighted. I had SPSS use
agglomerative hierarchical clustering. With this procedure each case initially is a cluster of
its own. SPSS compares the distance between each case and the next and then clusters
together the two cases which are closest. I had SPSS use the squared Euclidian distance

18
between cases as the measure of distance. This is quite simply ( )
2
1
v
i
i i
Y X , the sum across
variables (from i = 1 to v) of the squared difference between the score on variable i for the one
case (X
i
) and the score on variable i for the other case (Y
i
). At the next step SPSS recomputes
all the distances between entities (cases and clusters) and then groups together the two with
the smallest distance. When one or both of the entities is a cluster, SPSS computes the
averaged squared Euclidian distance between members of the one entity and members of the
other entity. This continues until all cases have been grouped into one giant cluster. It is up to
the researcher to decide when to stop this procedure and accept a solution with k clusters. K
can be any number from 1 to the number of cases.

SPSS produces both tables and graphics that help the analyst follow the process and
decide which solution to accept I obtained 2, 3, and 4 cluster solutions. In the k = 2 solution
the one cluster consisted of all the adjunct faculty (excepting one) and the second cluster
consisted of everybody else. I compared the two clusters (using t tests) and found compared to
the regular faculty the adjuncts had significantly lower salary, experience, course load, rank,
and number of publications.

In the k = 3 solution the group of regular faculty was split into two groups, with one
group consisting of senior faculty (those who have been in the profession long enough to get a
decent salary and lots of publications) and the other group consisting of junior faculty (and a
few older faculty who just never did the things that gets one merit pay increases). I used plots
of means to show that the senior faculty had greater salary, experience, rank, and number of
publications than did the junior faculty.

In the k = 4 solution the group of senior faculty was split into two clusters. One cluster
consisted of the acting chair of the department (who had a salary and a number of publications
considerably higher than the others) and the other cluster consisting of the remaining senior
faculty (excepting those few who had been clustered with the junior faculty).

There are other ways of measuring the distance between clusters and other methods of
doing the clustering. For example, one can do divisive hierarchical clustering, in which one
starts out with all cases in one big cluster and then splits off cases into new clusters until every
case is a cluster all by itself.
Aziz and Zickar (2006: A cluster analysis investigation of workaholism as a syndrome,
Journal of Occupational Health Psychology, 11, 52-62) is a good example of the use of cluster
analysis with psychological data. Some have defined workaholism as being high in work
involvement, high in drive to work, and low in work enjoyment. Aziz and Zickar obtained
measures of work involvement, drive to work, and work enjoyment and conducted a cluster
analysis. One of the clusters in the three-cluster solution did look like workaholics high in
work involvement and drive to work but low in work enjoyment. A second cluster consisted of
positively engaged workers (high on work involvement and work enjoyment) and a third
consisted of unengaged workers (low in involvement, drive, and enjoyment).

There are numerous other multivariate techniques and various modifications of those I
have briefly described here. I have, however, covered those you are most likely to encounter in
psychology. We are now ready to go into each of these in greater detail. The general Gestalt
you obtain from studying these techniques should enable you to learn other multivariate
techniques that you may encounter as you zoom through the hyperspace of multivariate
research.

19

Hyperlinks
Multivariate Effect Size Estimation supplemental chapter from Kline, Rex. B.
(2004). Beyond significance testing: Reforming data analysis methods in
behavioral research. Washington, DC: American Psychological Association.
Statistics Lessons
MANOVA, Familywise Error, and the Boogey Man
SAS Lessons
SPSS Lessons

Endnote

A high Scale 5 score indicates that the individual is more like members of the other gender
than are most people. A man with a high Scale 5 score lacks stereotypical masculine interests,
and a woman with a high Scale 5 score has interests that are stereotypically masculine. Low
Scale 5 scores indicate stereotypical masculinity in men and stereotypical femininity in women.
MMPI Scale scores are T-scores that is, they have been standardized to mean 50, standard
deviation 10. The normative group was residents of Minnesota in the 1930s. The MMPI-2 was
normed on what should be a group more representative of US residents.


Canonical Correlation

In a canonical correlation (multiple multiple correlation) one has two or more X
variables and two or more Y variables. The goal is to describe the relationships
between the two sets of variables. You find the canonical weights (coefficients) a
1
, a
2
,
a
3
, ... a
p
to be applied to the p X variables and b
1
, b
2
, b
3
, ... b
m
to be applied to the m Y
variables in such a way that the correlation between CV
X1
and CV
Y1
is maximized.
CV
X1
= a
1
X
1
+ a
2
X
2
+...+ a
p
X
p
. CV
Y1
= b
1
Y
1
+ b
2
Y
2
+ ... + b
m
Y
m
. CV
X1
and
CV
Y1
are the first canonical variates, and their correlation is the sample canonical
correlation coefficient for the first pair of canonical variates. The residuals are then
analyzed in the same fashion to find a second pair of canonical variates, CV
X2
and
CV
Y2
, whose weights are chosen to maximize the correlation between CV
X2
and CV
Y2
,
using only the variance remaining after the variance due to the first pair of canonical
variates has been removed from the original variables. This continues until a
"significance" cutoff is reached or the maximum number of pairs (which equals the
smaller of m and p) has been found.
You may think of the pairs of canonical variates as representing superordinate
constructs. For each pair this construct is estimated as a linear combination of the
variables, where the sole criterion for choosing one linear combination over another is
maximizing the correlation between the two canonical variates. The resulting constructs
may not be easily interpretable as representing an underlying dimension of interest.
Since each pair of canonical variates is calculated from the residuals of the pair(s)
extracted earlier, the resulting canonical variates are orthogonal. The underlying
dimensions in which you are interested may, however, be related to one another.
To learn about canonical correlation, we shall reproduce the analysis reported by
Patel, Long, McCammon, & Wuensch (Journal of Interpersonal Violence, 1995, 10: 354-
366, 1994). We had two sets of data on a group of male college students. The one set
was personality variables from the MMPI. One of these was the PD (psychopathically
deviant) scale, Scale 4, on which high scores are associated with general social
maladjustment, rebelliousness, antisocial behavior, criminal behavior, impulsive acting
out, insensitivity, hostility, and difficulties with interpersonal relationships (family, school,
and authority figures). The second was the MF (masculinity/femininity) scale, Scale 5,
on which low scores are associated with traditional masculinity - being easy-going,
cheerful, practical, coarse, adventurous, lacking insight into own motives, preferring
action to thought, overemphasizing strength and physical prowess, having a narrow
range of interests, and harboring doubts about one's own masculinity and identity
. The
third was the MA (hypomania) scale, Scale 9, on which high scores are associated with
overactivity, emotional lability, flight of ideas, being easily bored, having low frustration
tolerance, narcissism, difficulty inhibiting impulses, thrill-seeking, irritability,
restlessness, and aggressiveness. The fourth MMPI variable was Scale K, which is a
validity scale on which high scores indicate that the subject is clinically defensive,
attempting to present himself in a favorable light, and low scores indicate that the


2
subject is unusually frank. The second set of variables was a pair of homonegativity
variables. One was the IAH (Index of Attitudes Towards Homosexuals), designed to
measure affective components of homophobia. The second was the SBS, (Self-Report
of Behavior Scale), designed to measure past aggressive behavior towards
homosexuals, an instrument specifically developed for this study.
Download the following files:
Sunita.dat from my StatData Page.
Canonical.sas from my SAS Programs Page.
Edit the program file so that it properly points to the location of Sunita.dat and
then run the program. The IF statement is used to restrict the analysis to data from
men.
The zero-order correlations show that aggression against gays is significantly
correlated with hypomania and homophobia. Homophobia is significantly correlated
with masculinity. Although not significantly correlated with either of the homonegativity
variables, psychopathic deviance is significantly associated with hypomania and with
the K scale (clinical defensiveness).
In PROC CANCORR, ALL requests all optional statistics, VN gives a name to the
variables specified in the VAR statement, WN gives a name to the variables specified in
the WITH statement, VP gives a prefix to be used with names for the canonical variates
constructed from the variables in the VAR statement, and WP gives a prefix to be used
with names for the canonical variates constructed from the variables in the WITH
statement.
Two pairs (the maximum) of canonical variates were constructed. The first has a
canonical correlation of .38, the second .32. The first canonical correlation will always
have a value at least as large as the largest R between one variable and the opposite
set of variables, but that canonical r can be much larger than that largest R. The test of
the significance reported in the row with the first eigenvalue is a test that as a set (both
pairs of canonical variates tested simultaneously) CV
X
is independent of CV
Y
. This null
was rejected, p = .0099. One next tests all remaining pairs (as a set) with the first pair
removed, then all remaining pairs with the first and second pairs removed, etc. For the
p values from such a test to be valid, at least one of the two sets of variables should
have an approximately normal multivariate distribution. For these data the second
canonical correlation was significant (p = .0392) with the first removed. I generally do
not interpret a canonical correlation that is less than .30, even if it is significant, since it
is trivially small (the overlap, percentage of variance shared by the two canonical
variates, is 9% or less), but if I found a small correlation to be meaningful, I just might
share my interpretation of it.
For each root the eigenvalue is equal to the ratio of the squared canonical
correlation (explained variance in the canonical variate) to one minus the squared
canonical correlation (unexplained variance in the canonical variate). An eigenvalue of
1 would be obtained if the squared canonical correlation was .5 the proportion of
variance explained would be equal to the proportion of variance not explained. An
eigenvalue of 1/3 would be obtained if the squared canonical correlation was .25 the
proportion of unexplained variance would be three times the proportion of explained
variance. An eigenvalue of 3 would be obtained if the squared canonical correlation
3
was .75 the proportion of explained variance would be three times the proportion of
unexplained variance.
CANCORR gave us the raw and the standardized coefficients (a
1
, a
2
, ... b
1
, b
2
,
...) for each pair of canonical variates. One generally interprets the canonical variates
from their loadings rather than from their canonical coefficients, and SAS gives us those
loadings under the descriptive title "Correlations Between The (set of variables) And
Their Canonical Variables." For the Homonegativity variables, CV
1
loads heavily on
both IAH and SBS -- high scores on this CV indicate that the individual is homophobic
and aggresses against gays. For the MMPI variables, CV
1
loads well on all MMPI
scales (negatively on MF and K) -- high scores on this CV indicate that the individual is
hypomanic, masculine, psychopathically deviant, and unusually frank. The canonical
correlation for the first pair of canonical variates indicates that stereotypically masculine,
hypomanic, psychopathically deviant, frank men are homophobic and report aggressing
against homosexuals.
The second pair of canonical variates show suppression. Look at the
correlations and the standardized coefficients (beta weights) for the homonegativity CV
2

and its variables. For each of the variables, the beta weights are higher than the
correlations, indicating cooperative suppression (each variable suppresses irrelevant
variance in the other). Individuals scoring high on this CV are not homophobic, but do
aggress against gays. Perhaps these individuals are, in the words of one of my
graduate students (Cliff Wall), equal opportunity bullies -- they aggress against
everybody, not just against gays. Such nondiscriminatory aggression is associated with
(look at the loadings for the second CV of the MMPI) hypomania and femininity (dare I
call this CV bitchiness?).
One generally wants to know how much variance a canonical variate extracts
from its set of variables. Look at the loadings for the Homonegativity variables.
Homoneg_1 captures .5219
2
of the variance in IAH and .9907
2
of the variance in SBS,
so the proportion of the total variance in the Homonegativity variables captured by the
Homoneg_1 canonical variate is the mean of .5219
2
and .9907
2
= .6269. We can find
the proportion of variance each canonical variate extracts from its own set of variables
by simply finding the mean of the squared loadings between the canonical variate and
the variables of its own set. SAS gives us these proportions as "Standardized Variance
of the (set of variables) Explained by Their Own Canonical Variables." Homoneg_2
captures the remaining 37.31% of the variance in the Homonegativity variables, for a
total of 100% captured by the two Homonegativity canonical variates.
MMPI_1 captures 22.70% of the variance in the MMPI variables and MMPI_2
another 28.92%, for a total of 51.62% captured by the two MMPI canonical variates.
Note that with respect to capturing variance in MMPI, the second MMPI canonical
variate captured more than did the first - recall that canonical variates are ordered by
the size of their canonical correlations, not by how much variance they capture in their
variables. MMPI_2 captures a lot of variance from the MMPI variables, but it is not
variance which is as highly associated with the Homonegativity variables as is that
captured by MMPI_1. It is always true that 100% of the variance will be captured from
the smaller set of variables and less from the larger set. If both sets have the same
number of variables, all the variance in each set will be captured by its own p = m
canonical variates.
4
SAS also gives us the correlations between each variable and each opposite
canonical variate, under the title "Correlations Between the (set of variables) and the
Canonical Variables of the (opposite set of variables). Look at these cross-set loadings
for the Homonegativity variables. The MMPI_1 canonical variate "explains" .1982
2
of
the variance in IAH and .3762
2
of the variance in SBS, for a total of
% 0904 .
2
3762 . 1982 .
2 2
=
+
of the total variance in the Homonegativity variables. Note
that this percentage was computed as a mean of the squared (cross-set) loadings, just
as we previously did to find the proportions of variance a canonical variate captured
from its own set of variables. This percentage is called a redundancy. The
redundancy X
j
is the proportion of the total variance in the set of X variables that is
redundant with (predicted from, "explained" by) the j
th
canonical variate of the (opposite
set) Y variables.
Such a redundancy can also be computed from own-set proportions of variance
captured and the canonical correlations. Homoneg_1 captures .6269 of the variance in
its own set of variables. The canonical r
2
(the square of the r between MMPI_1 and
Homoneg_1) is .1442, = indicating that 14.42% of the variance in the Homoneg_1
canonical variate is explained by its correlation with the MMPI_1 canonical variate.
Well, if .6269 of the variance in the Homonegativity variables is captured by the
Homoneg_1 canonical variate and .1442 of the variance in Homoneg_1 is explained by
MMPI_1, then (.6269)(.1442) = .0904 of the variance in the Homonegativity variables is
explained by the MMPI_1 canonical variate, which is exactly what SAS tells us under
the title "Standardized Variance of the Delinquency Explained by the Opposite
Canonical Variables."
In general, you multiply (the proportion of variance that one X canonical variate
captures from its own set of X variables) times (the squared canonical correlation
between that X canonical variate and the corresponding Y canonical variate) to obtain
the amount of variance in the X variables explained by the canonical variate from the
(opposite) Y set of variables.
For our data, the MMPI canonical variates explain a total of 12.95% of the
variance in the Homonegativity variables (.0904 explained by MMPI_1, .0391 explained
by MMPI_2). The Homonegativity canonical variates explain a total of 6.30% of the
variance in the MMPI variables (.0327 explained by Homoneg_1, .0303 explained by
Homoneg_2).
SAS completes the canonical redundancy analysis by giving for each variable the
R
2
for predicting that variable from the first canonical variate from the opposite set of
variables, the first and second canonical variates from the opposite set, the first three
canonical variates from the opposite set, etc. For our data, SBS is predicted moderately
well by MMPI_1 (R
2
= .1415) and only very slightly better by the combination of MMPI_1
and MMPI_2 (R
2
= .1434). IAH is not as well predicted by MMPI_1 (R
2
= .0393), but
adding MMPI_2 helps a lot (R
2
= .1155). Note that averaging the IAH-MMPI_1 and
SBS-MMPI_1 R
2
values gives the redundancy for Homonegativity from MMPI_1
(proportion of variance in the Homonegativity variables explained by the MMPI_1
canonical variate). Likewise, averaging the R
2
values for predicting IAH and SBS from
MMPI_1 and MMPI_2, (.1155 + .1434) / 2 = .1295, gives the total redundancy for the
5
Homonegativity variables predicted from the MMPI canonical variates. The
redundancies of the MMPI variables predicted from the Homonegativity canonical
variates can likewise be obtained by averaging the R
2
values between the MMPI
variables and the "First M Canonical Variables of the Homonegativity."
You should note that it is quite possible for a pair of canonical variates that have
a large squared correlation not to explain much of the variance in the variables, that is,
the canonical analysis may produce a pair of highly correlated weighted combinations of
the variables that extract only a very small amount of the variance in the original
variables. There are ways to produce weighted combinations of variables that
maximize redundancies rather than canonical correlations, but they will not be
presented here.
Do note that redundancies are not symmetrical - an X canonical variate may
explain much more (or less) of the variance in the Y variables than does the Y canonical
variate in the X variables. For our example, the MMPI canonical variates explain 13%
of the variance in the Homonegativity variables, but the Homonegativity canonical
variates explain only 6% of the variance in the MMPI variables. Much of the
unexplained variance in the MMPI variables is in the PD variable - the R
2
for predicting
PD from both Homonegativity canonical variates was only .02.
Note that I created an output data set (Sol) with the four canonical variates.
PROC CORR on those data shows that the for each set of variables the correlation
between the first canonical variate and the second is absolutely zero.

SPSS
Download the following files
Sunita.sav from my SPSS Data Page.
Canonical.sps from my SPSS Programs Page.
Bring Canonical.sps into SPSS. Edit Canonical.sps so that it properly points at
Sunita.sav. From the SPSS Syntax Editor, click Run, All.
In SPSS one does canonical correlation by using the MANOVA routine, calling
one set of variables DEPENDENT VARIABLES and the other set COVARIATES. Under
"EFFECT..WITHIN+RESIDUAL Regression" SPSS gives you essentially the same
output that SAS does, formatted differently. Ignore the "EFFECT..CONSTANT" output.

6
MANOVA IAH SBS with MA MF PD K
/discrim stan corr alpha(1)
/print signif(mult univ eigen dimenr)
/noprint param(estim)
/method=unique
/error within+residual
/design .

IAH and SBS are the Dependents and MA, MF, PD, and K are the Covariates.
The "ALPHA(1)" statement in the DISCRIM subcommand is needed to force
MANOVA to calculate all possible canonical functions, regardless of whether of not they
are significant.
The "Univariate F-tests" give us the squared multiple correlation coefficients
(SMC's) for predicting IAH and SBS from the MMPI variables. Note that these match
the SMC's that SAS gave us for predicting IAH and SBS from MMPI canonical variates
1 and 2. We would get the same SMCs if we just did two multiple regressions, one
predicting IAH from the MMPI variables and the other predicting SBS from the MMPI
variables.
Under the title Variance in dependent variables explained by canonical
variables is the redundancy analysis for the homonegativity variables. In the Pct Var
Dep column are the percentages of standardized variance in the homonegativity
variables explained by their own canonical variates, and in the Pct Var CO column are
the percentages of standardized variance in the homonegativity variables explained by
the opposite (MMPI) canonical variates.
Under the title Variance in covariates explained by canonical variables is the
redundancy analysis for the MMPI variables. In the Pct Var DE column are the
percentages of standardized variance in the MMPI variables explained by the opposite
(homonegativity) canonical variates, and in the Pct Var CO column are the
percentages of standardized variance in the MMPI variables explained by their own
canonical variates.
I find the SAS output to be easier to read than the SPSS output, especially for
the redundancy analysis. The SPSS output used to be even more confusing. In some
earlier versions of SPSS the proportions of variance in the X variables explained by the
Y canonical variates was mistakenly described as being the proportions of variance in
the Y variables explained by the X canonical variates. I explained this problem to the
folks at SPSS, and, eventually, they fixed it. I also explained it to the authors of a
multivariate statistics text (Tabachnik and Fidell) who made exactly the same error.
They never responded to my letter.
7
I should caution you that sometimes SPSS/PASW will construct a canonical
variate that is exactly the opposite of that created by SAS. For example, look at these
loadings from SAS, paying special attention to the second canonical variate

Correlations Between the Homonegativity and Their Canonical Variables

Homoneg_1 Homoneg_2

IAH 0.5219 -0.8530
SBS 0.9907 0.1361
High Scores on Homoneg_1 reflect high scores on the SBS and, to a lesser degree, on the IAH.
High Scores on Homoneg_2 reflect high scores on the SBS and low scores on the IAH (notice the negative sign on the
loading for IAH)

MMPI_1 MMPI_2

MA 0.5326 0.7224
MF -0.4937 0.7638
PD 0.3241 0.2129
K -0.5251 -0.0788
High scores on MMPI_1 reflect high scores on MA, high scores on PD, low scores on MF, and low scores on K.
High scores on MMPI_2 reflect from high scores on MA and MF.

Now look at the loadings produced by SPSS/PASW

Correlations between DEPENDENT and canonical variables
Function No.

Variable 1 2

iah .52190 .85301
sbs .99069 -.13613

Correlations between COVARIATES and canonical variables
CAN. VAR.

Covariate 1 2

ma .53259 -.72236
mf -.49368 -.76384
pd .32410 -.21288
k -.52513 .07876

Note that the second canonical variate of the homonegativity variables is just the
opposite of that created by SAS and the same is true of the second canonical variate of
the MMPI variables. The interpretation of the correlation is still the same, it is just that
instead of saying A is positively related to B we are now saying that NOT A is positively
related to NOT B.
8
SAS Listing
Return to my statistics lessons page.

Endnote

A high Scale 5 score indicates that the individual is more like members of the other gender
than are most people. A man with a high Scale 5 score lacks stereotypical masculine interests,
and a woman with a high Scale 5 score has interests that are stereotypically masculine. Low
Scale 5 scores indicate stereotypical masculinity in men and stereotypical femininity in women.
MMPI Scale scores are T-scores that is, they have been standardized to mean 50, standard
deviation 10. The normative group was residents of Minnesota in the 1930s. The MMPI-2 was
normed on what should be a group more representative of US residents.
ClusterAnalysis-SPSS.docx
Cluster Analysis With SPSS

I have never had research data for which cluster analysis was a technique I
thought appropriate for analyzing the data, but just for fun I have played around with
cluster analysis. I created a data file where the cases were faculty in the Department of
Psychology at East Carolina University in the month of November, 2005. The variables
are:
Name -- Although faculty salaries are public information under North Carolina
state law, I though it best to assign each case a fictitious name.
Salary annual salary in dollars, from the university report available in OneStop.
FTE Full time equivalent work load for the faculty member.
Rank where 1 = adjunct, 2 = visiting, 3 = assistant, 4 = associate, 5 = professor
Articles number of published scholarly articles, excluding things like comments
in newsletters, abstracts in proceedings, and the like. The primary source for
these data was the faculty members online vita. When that was not available,
the data in the Universitys Academic Publications Database was used, after
eliminating duplicate entries.
Experience Number of years working as a full time faculty member in a
Department of Psychology. If the faculty member did not have employment
information on his or her web page, then other online sources were used for
example, from the publications database I could estimate the year of first
employment as being the year of first publication.

In the data file but not used in the cluster analysis are also
ArticlesAPD number of published articles as listed in the universitys Academic
Publications Database. There were a lot of errors in this database, but I tried to
correct them (for example, by adjusting for duplicate entries).
Sex I inferred biological sex from physical appearance.

I have saved, annotated, and placed online the statistical output from the
analysis. You may wish to look at it while reading through this document.

Conducting the Analysis
Start by bringing ClusterAnonFaculty.sav into SPSS. Now click Analyze,
Classify, Hierarchical Cluster. Identify Name as the variable by which to label cases
and Salary, FTE, Rank, Articles, and Experience as the variables. Indicate that you
want to cluster cases rather than variables and want to display both statistics and plots.
2

Click Statistics and indicate that you want to see an Agglomeration schedule with
2, 3, 4, and 5 cluster solutions. Click Continue.

Click plots and indicate that you want a Dendogram and a verticle Icicle plot with
2, 3, and 4 cluster solutions. Click Continue.
3

Click Method and indicate that you want to use the Between-groups linkage
method of clustering, squared Euclidian distances, and variables standardized to z
scores (so each variable contributes equally). Click Continue.

4

Click Save and indicate that you want to save, for each case, the cluster to which
the case is assigned for 2, 3, and 4 cluster solutions. Click Continue, OK.

SPSS starts by standardizing all of the variables to mean 0, variance 1. This
results in all the variables being on the same scale and being equally weighted.
In the first step SPSS computes for each pair of cases the squared Euclidian
distance between the cases. This is quite simply ( )
2
1
v
i
i i
Y X , the sum across
variables (from i = 1 to v) of the squared difference between the score on variable i
for the one case (X
i
) and the score on variable i for the other case (Y
i
). The two
cases which are separated by the smallest Euclidian distance are identified and then
classified together into the first cluster. At this point there is one cluster with two
cases in it.
Next SPSS re-computes the squared Euclidian distances between each entity
(case or cluster) and each other entity. When one or both of the compared entities
is a cluster, SPSS computes the averaged squared Euclidian distance between
members of the one entity and members of the other entity. The two entities with
the smallest squared Euclidian distance are classified together. SPSS then re-
computes the squared Euclidian distances between each entity and each other
entity and the two with the smallest squared Euclidian distance are classified
together. This continues until all of the cases have been clustered into one big
cluster.
Look at the Agglomeration Schedule. On the first step SPSS clustered case 32
with 33. The squared Euclidian distance between these two cases is 0.000. At
stages 2-4 SPSS creates three more clusters, each containing two cases. At stage
5 SPSS adds case 39 to the cluster that already contains cases 37 and 38. By the
43
rd
stage all cases have been clustered into one entity.
5
Look at the Vertical Icicle. For the two cluster solution you can see that one
cluster consists of ten cases(Boris through Willy, followed by a column with no Xs).
These were our adjunct (part-time) faculty (excepting one) and the second cluster
consists of everybody else.
For the three cluster solution you can see the cluster of adjunct faculty and the
others split into two. Deanna through Mickey were our junior faculty and Lawrence
through Rosalyn our senior faculty
For the four cluster solution you can see that one case (Lawrence) forms a
cluster of his own.
Look at the dendogram. It displays essentially the same information that is found
in the agglomeration schedule but in graphic form.
Look back at the data sheet. You will find three new variables. CLU2_1 is
cluster membership for the two cluster solution, CLU3_1 for the three cluster
solution, and CLU4_1 for the four cluster solution. Remove the variable labels and
then label the values for CLU2_1

and CLU3_1.

Let us see how the two clusters in the two cluster solution differ from one another
on the variables that were used to cluster them.
6

The output shows that the cluster adjuncts has lower mean salary, FTE, ranks,
published articles, and years experience.
Now compare the three clusters from the three cluster solution. Use One-Way
ANOVA and ask for plots of group means.

The plots of means show nicely the differences between the clusters.

Predicting Salary from FTE, Rank, Publications, and Experience
Now, just for fun, let us try a little multiple regression. We want to see how
faculty salaries are related to FTEs, rank, number of published articles, and years of
experience.
7

Ask for part and partial correlations and for Casewise diagnostics for All cases.

The output is shows that each of our predictors is has a medium to large positive
zero-order correlation with salary, but only FTE and rank have significant partial
effects. In the Casewise Diagnostic table you are given for each case the
standardized residual (I think that any whose absolute value exceeds 1 is worthy of
8
inspection by the persons who set faculty salaries), the actual salary, the salary
predicted by the model, and the difference, in $, between actual salary and predicted
salary.

If you split the file by sex and repeat the regression analysis you will see some
interesting differences between the model for women and the model for men. The
partial effect of rank is much greater for women than for men. For men the partial
effect of articles is positive and significant, but for women it is negative. That is,
among our female faculty, the partial effect of publication is to lower ones salary.

Clustering Variables
Cluster analysis can be used to cluster variables instead of cases. In this case
the goal is similar to that in factor analysis to get groups of variables that are
similar to one another. Again, I have yet to use this technique in my research, but it
does seem interesting.
We shall use the same data earlier used for principal components and factor
analysis, FactBeer.sav. Start out by clicking Analyze, Classify, Hierarchical Cluster.
Scoot into the variables box the same seven variables we used in the components
and factors analysis. Under Cluster select Variables.
9

Click Statistics and

Continue

Click Plots and
10
Continue

Click Method and
Continue, OK.

11
I have saved, annotated, and placed online the statistical output from the
analysis. You may wish to look at it while reading through the remainder of this
document.
Look at the proximity matrix. It is simply the intercorrelation matrix. We start out
with each variable being an element of its own. Our first step is to combine the two
elements that are closest that is, the two variables that are most well correlated.
As you can from the proximity matrix, that is color and aroma (r = .909). Now we
have six elements one cluster and five variables not yet clustered.
In Stage 2, we cluster the two closest of the six remaining elements. That is size
and alcohol (r = .904). Look at the agglomeration schedule. As you can see, the
first stage involved clustering variables 5 and 6 (color and aroma), and the second
stage involved clustering variables 2 and 3 (size and alcohol).
In Stage 3, variable 7 (taste) is added to the cluster that already contains
variables 5 (color) and 6 (aroma).
In Stage 4, variable 1 (cost) is added to the cluster that already contains
variables 2 (size) and 3 (alcohol). We now have three elements two clusters, each
with three variables, and one variable not yet clustered.
In Stage 5, the two clusters are combined, but note that they are not very similar,
the similarity coefficient being only .038. At this point we have two elements, the
reputation variable all alone and the six remaining variables clumped into one
cluster.
The remaining plots show pretty much the same as what I have illustrated with
the proximity matrix and agglomeration schedule, but in what might be more easily
digested format.
I prefer the three cluster solution here. Do notice that reputation is not clustered
until the very last step, as it was negatively correlated with the remaining variables.
Recall that in the components and factor analyses it did load (negatively) on the two
factors (quality and cheap drunk).

Karl L. Wuensch
Greenville, NC 27858-4353
United Snakes of America
June, 2011

More SPSS Lessons
More Lessons on Statistics
Curvi.docx
Curvilinear Bivariate Regression

You are now familiar with linear bivariate regression analysis. What do you do if
the relationship between X and Y is curvilinear? It may be possible to get a good analysis
with our usual techniques if we first straighten-up the relationship with data
transformations.
You may have a theory or model that indicates the nature of the nonlinear effect.
For example, if you had data relating the physical intensity of some stimulus to the
psychologically perceived intensity of the stimulus, Fechners law would suggest a
logarithmic function (Stevens would suggest a power function). To straighten out this log
function all you would need to do is take the log of the physical intensity scores and then
complete the regression analysis using transformed physical intensity scores to predict
psychological intensity scores. For another example, suppose you have monthly sales
data for each of 25 consecutive months of a new business. You remember having been
taught about exponential growth curves in a business or a biology class, so you do the
regression analysis for predicting the log of monthly sales from the number of months the
firm has been in business.
In other cases you will have no such model, you simply discover (from the scatter
plot) that the relationship is curvilinear. Here are some suggestions for straightening up
the line, assuming that the relationship is monotonic.

A. If the curve for predicting Y from X is a negatively accelerated curve, a curve of
decreasing returns, one where the positive slope decreases as X increases, try
transforming X with the following: X , LOG(X),
X
1
. Prepare a scatter plot for each of
these and choose the one that best straightens the line (and best assists in meeting the
assumptions of any inferential statistics you are doing).
B. If the curve for predicting Y from X is a positively accelerated curve, one where
the positive slope increases as X increases, try: Y , LOG(Y),
Y
1
.

Copyright 2011, Karl L. Wuensch, All Rights Reserved

2
C. If the curve has a negative slope that approaches zero as X increases, try
transforming X, Y, or both X and Y with some nonlinear transformation(s) such as LOG or
SQRT.
You can always just try a variety of nonlinear transformations and see what works
best. One handy transformation is to RANK the data. When done on both variables, the
resulting r is a Spearman correlation coefficient.
To do a square root transformation of variable Y in SAS, use a statement like this in
the data step: X_Sqrt = SQRT(X); for a base ten log transformation, X_Log = LOG10(X);
for an inverse transformation, X_Inv = -1/X; . If you have scores of 0 or less, you will need
to add an appropriate constant (to avoid scores of 0 or less) to X before applying a square
root or log transformation, for example, X_Log = LOG10(X + 19); .
Polynomial Regression
Monotonic nonlinear transformations (such as SQRT, LOG, and -1/X) of
independent and/or dependent variables may allow you to obtain a predictive model that
has less error than does a linear model, but if the relationship between X and Y is not
monotonic, a polynomial regression may do a better job. A polynomial has two or more
terms. The polynomials we most often use in simple polynomial regression are the
quadratic,
2
2 1
X b X b a Y + + = , and the cubic,

3
3
2
2 1
X b X b X b a Y + + + = . With a
quadratic, the slope for predicting Y from X changes direction once, with a cubic it
changes direction twice.
Please run the program Curvi.sas from my SAS Programs page. This provides an
example of how to do a polynomial regression with SAS. The data were obtained from
scatterplots in an article by N. H. Copp (Animal Behavior, 31,: 424-430). Ladybugs tend
to form large winter aggregations, clinging to one another in large clumps, perhaps to
stay warm. In the laboratory, Copp observed, at various temperatures, how many beetles
(in groups of 100) were free (not aggregated). For each group tested, we have the
temperature at which they were tested and the number of ladybugs that were free. Note
that in the data step I create the powers of the temperature variable (temp2, temp3, and
temp4) as well as the log of the temperature variable (LogTemp) and the log of the free
variable (LogFree).
Please note that a polynomial regression analysis is a sequential analysis. One
first evaluates a linear model. Then one adds a quadratic term and decides whether or
not addition of such a term is justified. Then one adds a cubic term and decides whether
or not such an addition is justified, etc.
Proc Corr is used to evaluate the effects of ranking both variables (Spearman rho)
and the effect of a log transformation on temperature and/or on number of free ladybugs.
The output show that none of these transformations helps.
Proc Reg is used to test five different models and to prepare scatterplots with the
regression line drawn on the plot. The VAR statement is used to list all of the variables
that will be used in the models that are specified.
The LINEAR model replicates the analysis which Copp reported. Note that there is
a strong (r
2
= .615) and significant (t = 7.79, p < .001) linear relationship between
3
temperature and number of free ladybugs. Note my use of the plot subcommand to
produce a plot with number of free ladybugs on the ordinate and temperature (Celsius) on
the abscissa. I asked that the data points be plotted with the asterisk symbol. I also
asked that a second plot, predicted number of free ladybugs (p.) versus temperature,
plotted with the symbol x, be overlaid. This results in a plot where the regression line is
indicated by the xs and the data points by the asterisks. If you look at that plot, you see
that the ladybugs did aggregate more at low temperatures than at high temperatures.
That plot also suggests that the data would be better fit with a quadratic function, one
whose slope increases as temperature either increases or decreases from the point where
the ladybugs are most aggregated.
The QUADRATIC model adds temperature squared to the model. SS1 is used to
obtain Type I (sequential) sums of squares and SCORR1(SEQTESTS) is used to obtain
squared sequential semipartial correlation coefficients and sequential tests of the predictor
variables.
The output (page 4) shows that addition of the quadratic component has increased
the model SS from 853.27 (that for the linear model) to 1163.36, an increase of 310.09.
Dividing this increase in the model SS by the error MS gives 19 . 51
058 . 6
09 . 310
= , the F ratio
testing the effect of adding temperature squared, which is shown to be a significant effect
( t = 7.15, p < .001).
Adding temperature squared increased the proportion of explained variance
from.6150 (r
2
for the linear model) to .8385 (R
2
for the quadratic model), an increase of
.2235, the squared semipartial correlation coefficient for the quadratic term.
The plot shows that aggregation of the ladybugs is greatest at about 5 to 10
degrees Celsius (the mid to upper 40s Fahrenheit). When it gets warmer than that, the
ladybugs start dispersing, but they also start dispersing when it gets cooler than that.
Perhaps ladybugs are threatened by temperatures below freezing, so the dispersal at the
coldest temperatures represents their attempt to find a warmer place to aggregate.
The CUBIC model adds temperature cubed. The output (page 6) shows that this
component is significant (t = 2.43, p = .02), but that it has not explained much more
variance in aggregation (the R
2
has increased by only .02281). At this point I might decide
that adding the cubic component is not justified (because it adds so little to the model),
even though it is statistically significant. The second bend in the curve provided by a
cubic model is not apparent in the plot, but there is an apparent flattening of the line at low
temperatures. It would be really interesting to see what would happen if the ladybugs
were tested at temperatures even lower than those employed by Copp.
The QUARTIC model adds temperature raised to the 4
th
power. The output (page
8) shows that the quartic component is not significant (t = 0.26, p = .80).
The LOG_X model shows that a log function describes the data less well than does
a simple linear function. As shown in the plot, the bend in the curve does not match that
in the data.
Below is an example of how to present results of a polynomial regression. I used
SPSS/PASW to produce the figure.
4
Forty groups of ladybugs (100 ladybugs per group) were tested at temperatures
ranging from -2 to 34 degrees Celsius. In each group we counted the number of ladybugs
which were free (not aggregated). A polynomial regression analysis was employed to fit
the data with an appropriate model. To be retained in the final model, a component had to
be statistically significant at the .05 level and account for at least 2% of the variance in the
number of free ladybugs. The model adopted was a cubic model, Free Ladybugs =
13.607 + .085 Temperature - .022 Temperature
2
+ .001 Temperature
3
, F(3, 36) = 74.50, p
< .001,
2
= .86, 90% CI [.77, .89]. Table 1 shows the contribution of each component at
the point where it entered the model. It should be noted that a quadratic model fit the data
nearly as well as did the cubic model.

Table 1.
Number of Free Ladybugs Related to Temperature
Component SS df t p sr
2

Linear 853 1 7.79 < .001 .61
Quadratic 310 1 7.15 < .001 .22
Cubic 32 1 2.43 .020 .02

As shown in Figure 1, the ladybugs were most aggregated at temperatures of 18
degrees or less. As temperatures increased beyond 18 degrees, there was a rapid rise in
the number of free ladybugs.
5

Current research in our laboratory is directed towards evaluating the response of
ladybugs tested at temperatures lower than those employed in the currently reported
research. It is anticipated that the ladybugs will break free of aggregations as
temperatures fall below freezing, since remaining in such a cold location could kill a
ladybug.

The data for this exercise in an Excel spreadsheet


MV\MultReg\IntroMR.doc
A Brief Introduction to Multiple Correlation/Regression
as a Simplification of the Multivariate General Linear Model

In its most general form, the GLM (General Linear Model) relates a set of p
predictor variables (X
1
through X
p
) to a set of q criterion variables (Y
1
through Y
q
). We
shall now briefly survey two special cases of the GLM, bivariate correlation/regression
and multiple correlation/regression.
The Univariate Mean: A One Parameter (a) Model
If there is only one Y and no X, then the GLM simplifies to the computation of a
mean. We apply the least squares criterion to reduce the squared deviations
between Y and predicted Y to the smallest value possible for a linear model. The
prediction equation is Y Y =
. Error in prediction is estimated by

1
) (
2

=
n
Y Y
s .
Bivariate Regression: A Two Parameter (a and b) Model
If there is only one X and only one Y, then the GLM simplifies to the simple
bivariate linear correlation/regression with which you are familiar. We apply the least
squares criterion to reduce the squared deviations between Y and predicted Y to the
smallest value possible for a linear model. That is, we find a and b such that for
bX a Y + =
, the ( )
Y Y is minimal. The GLM is reduced to e Y e bX a Y + = + + =
,
where e is the "error" term, the deviation of Y from predicted Y. The coefficient "a" is
the Y-intercept, the value of Y when X = 0 (the intercept was the mean of Y in the one
parameter model above), and "b" is the slope, the average amount of change in Y per
unit change in X. Error in prediction is estimated by
1
)
(
2
_

=
n
Y Y
s
Y est
.
Although the model is linear, that is, specifies a straight line relationship
between X and Y, it may be modified to test nonlinear models. For example, if you
think that the function relating Y to X is quadratic, you employ the model
e X b X b a Y + + + =
2
2 1
.
It is often more convenient to work with variables that have all been standardized
to some common mean and some common SD (standard deviation) such as 0, 1 (Z-
scores). If scores are so standardized, the intercept, "a," drops out (becomes zero) and
the standardized slope, the number of standard deviations that predicted Y changes for
each change of one SD in X, is commonly referred to as . In a bivariate regression,
is the Pearson r. If r = 1, then each change in X of one SD is associated with a one SD
change in predicted Y.


2
The variables X and Y may be both continuous (Pearson r), one continuous and
one dichotomous (point biserial r), or both dichotomous ( ).
Multiple Correlation/Regression
In multiple correlation/regression, one has two or more predictor variables but
only one criterion variable. The basic model is
p p
X b X b X b a Y + + + + = K
2 2 1 1
or,
employing standardized scores,
p p Y
Z Z Z Z + + + = K
2 2 1 1
. Again, we wish to find

regression coefficients that produce a predicted Y that is minimally deviant from
observed Y, by the least squares criterion. We are creating a linear combination of the
X variables,
p p
X b X b X b a + + + + K
2 2 1 1
, that is maximally correlated with Y. That is,
we are creating a superordinate predictor variable that is a linear combination of the
individual predictor variables, with the weighting coefficients (b
1
b
p
) chosen such that
the Pearson r between the criterion variable and the linear combination is maximal.
The value of this r between Y and the best linear combination of Xs is called R, the
multiple correlation coefficient. Note that the GLM is not only linear, but additive.
That is, we assume that the weighted effect of X
1
combines additively with the weighted
effect of X
2
to determine their joint effect,
2 2 1 1
X b X b a + + , on predicted Y.
As a simple example of multiple regression, consider using high school GPA and
SAT scores to predict college GPA. R would give us an indication of the strength of the
association between college GPA and the best linear combination of high school GPA
and SAT scores. We could additionally look at the weights (also called
standardized partial regression coefficients) to determine the relative contribution of
each predictor variable towards predicting Y. These coefficients are called partial
coefficients to emphasize that they reflect the contribution of a single X in predicting Y
in the context of the other predictor variables in the model. That is, how much does
predicted Y change per unit change in X
i
when we partial out (remove, hold constant)
the effects of all the other predictor variables. The weight applied to X
i
can change
dramatically if we change the context (add one or more additional X or delete one or
more of the X variables currently in the model). An X which is highly correlated with Y
could have a low weight simply because it is redundant with another X in the model.
Rather than throwing in all of the independent variables at once (a
simultaneous multiple regression) we may enter them sequentially . With an a priori
sequential analysis (also called a hierarchical analysis), we would enter the predictors
variables in some a priori order. For example, for predicting college GPA, we might first
enter high school GPA, a predictor we consider high-priority because it is cheap (all
applicants can provide it at low cost). We would compute r
2
and interpret it as the
proportion of variance in Y that is explained by high school GPA. Our next step might
be to add SAT-V and SAT-Q to the model and compute the multiple regression for
3 3 2 2 1 1
X b X b X b a Y + + + = . We entered SAT scores with a lower priority because they

are more expensive to obtain - not all high school students have them and they cost
money to obtain. We enter them together because you get both for one price. This is
called setwise entry. We now compare the R
2
(squared multiple correlation
coefficient) with the r
2
previously obtained to see how much additional variance in Y is

3
explained by adding X
2
and X
3
to the X
1
already in the model. If the increase in R
2

seems large enough to justify the additional expense involved in obtaining the X
2
and X
3

information, we retain X
2
and X
3
in the model. We might then add a yet lower priority
predictor, such as X
4
, the result of an on-campus interview (costly), and see how much
further the R
2
is increased, etc.
In other cases we might first enter nuisance variables (covariates) for which we
wish to achieve statistical control and then enter our predictor variable(s) of primary
interest later. For example, we might be interested in the association between the
amount of paternal care a youngster has received (X
p
) and how healthy e is (Y). Some
of the correlation between X
p
and Y might be due to the fact that youngsters from
good families get lots of maternal (X
m
) care and lots of paternal care, but it is the
maternal care that causes the youngsters' good health. That is, X
p
is correlated with Y
mostly because it is correlated with X
m
which is in turn causing Y. If we want to find the
effect of X
p
on Y we could first enter X
m
and compute r
2
and then enter X
p
and see how
much R
2
increases. By first entering the covariate, we have statistically removed (part
of) its effect on Y and obtained a clearer picture of the effect of X
p
on Y (after removing
the confounded nuisance variables effect). This is, however, very risky business,
because this adjustment may actually remove part of (or all) of the actual causal effect
of X
p
on Y. For example, it may be that good fathers give their youngsters lots of care,
causing them to be healthy, and that mothers simply passively respond, spending more
time with (paternally caused) healthy youngsters than with unhealthy youngsters. By
first removing the noncausal effect of X
m
on Y we, with our maternal bias, would have
eliminated part of the truly causal effect of X
p
on Y. Clearly our a priori biases can
affect the results of such a squential analyses.
Stepwise multiple regression analysis employs one of several available
statistical algorithms to order the entry (and/or deletion) of predictors from the model
being constructed. I opine that stepwise analysis is one of the most misunderstood and
abused statistical procedures employed by psychologists. Many psychologists
mistakenly believe that such an analysis will tell you which predictors are importantly
related to Y and which are not. That is a very dangerous delusion. Imagine that among
your predictors are two, let us just call them A and B, each of which is well correlated
with the criterion variable, Y. If A and B are redundant (explain essentially the same
portion of the variance in Y), then one, but not both, of A and B will be retained in the
final model constructed by the stepwise technique. Whether it is A or B that is retained
will be due to sampling error. In some samples A will, by chance, be just a little better
correlated with Y than is B, while in other samples B will be, by chance, just a little
better correlated with Y than is A. With your sample, whether it is A or B that is retained
in the model does not tell you which of A and B is more importantly related to Y. I
strongly recommend against persons using stepwise techniques until they have
received advanced instruction in their use and interpretation. See this warning.

Assumptions
There are no assumptions involved in computing point estimates of the value of
R, a, b
i
, or s
est_Y
, but as soon as you use t or F to put a confidence interval on your

4
estimate of one of these or test a hypothesis about one of these there are assumptions.
Exactly what the assumptions are depends on whether you have adopted a correlation
model or a regression model, which depends on whether you treat the X variable(s) as
fixed (regression) or random (correlation). Review this distinction between regression
and correlation in the document Bivariate Linear Correlation and then work through my
lesson on Producing and Interpreting Residuals Plots in SAS.



MultReg-WriteUp.docx
Presenting the Results of a Multiple Regression Analysis

Suppose that we have developed a model for predicting graduate students
Grade Point Average. We had data from 30 graduate students on the following
variables: GPA (graduate grade point average), GREQ (score on the quantitative
section of the Graduate Record Exam, a commonly used entrance exam for graduate
programs), GREV (score on the verbal section of the GRE), MAT (score on the Miller
Analogies Test, another graduate entrance exam), and AR, the Average Rating that the
student received from 3 professors who interviewed the student prior to making
admission decisions. GPA can exceed 4.0, since this university attaches pluses and
minuses to letter grades.
Later I shall show you how to use SAS to conduct a multiple regression analysis
like this. Right now I simply want to give you an example of how to present the results
of such an analysis. You can expect to receive from me a few assignments in which I
ask you to conduct a multiple regression analysis and then present the results. I
suggest that you use the examples below as your models when preparing such
assignments.

Table 1.
Graduate Grade Point Averages Related to Criteria Used When Making
Admission Decisions (N = 30).
Zero-Order r
sr
2
b
Variable AR MAT GREV GREQ GPA
GREQ .611* .32* .07 .0040
GREV .468* .581* .21 .03 .0015
MAT .426* .267 .604* .32* .07 .0209
AR .525* .405* .508* .621* .20 .02 .1442
Intercept = -1.738
Mean 3.57 67.00 575.3 565.3 3.31
SD 0.84 9.25 83.0 48.6 0.60 R
2
= .64*
*p < .05
Multiple linear regression analysis was used to develop a model for predicting
graduate students grade point average from their GRE scores (both verbal and
quantitative), MAT scores, and the average rating the student received from a panel of
professors following that students pre-admission interview with those professors. Basic
descriptive statistics and regression coefficients are shown in Table 1. Each of the


predictor variables had a significant (p < .01) zero-order correlation with graduate GPA,
but only the quantitative GRE and the MAT predictors had significant (p < .05) partial
effects in the full model. The four predictor model was able to account for 64% of the
variance in graduate GPA, F(4, 25) = 11.13, p < .001,
2
= .64, 90% CI [.35, .72].
Based on this analysis, we have recommended that the department reconsider
requiring the interview as part of the application procedure. Although the interview
ratings were the single best predictor, those ratings had little to offer in the context of
the GRE and MAT scores, and obtaining those ratings is much more expensive than
obtaining the standardized test scores. We recognize, however, that the interview may
provide the department with valuable information which is not considered in the analysis
reported here, such as information about the potential students research interests. One
must also consider that the students may gain valuable information about us during the
interview, information which may help the students better evaluate whether our program
is really the right one for them.
------------------------------------------------------------------------------------------------------------
In the table above, I have used asterisks to indicate which zero-order correlations
and beta weights are significant and to indicate that the multiple R is significant. I
assume that the informed reader will know that if a beta is significant then the
semipartial r and the unstandardardized slope are also significant. Providing the
semipartials, unstandardized slopes, and intercept is optional, but recommended in
some cases for example, when the predictors include dummy variables or variables
for which the unit of measure is intrinsically meaningful (such as pounds or inches), then
unstandardized slopes should be reported. I would have used Steiger and Fouladis R2
program to construct the confidence interval about R
2
, but it would not run in Windows 7
Home Premium and I had not yet found a work-around. I used my SAS program, Conf-
Interval-R2-Regr , instead. R2 runs just fine in XP, and it might run in Windows 7 Pro.
If there were more than four predictors, a table of this format would get too
crowded. I would probably first drop the column of semipartials, then either the column
of standardized or unstandardized regression coefficients. If necessary I would drop the
zero-order correlation coefficients between predictors, but not the zero-order correlation
between each predictor and the criterion variable.

Here is another example, this time with a sequential multiple regression analysis.
Additional analyses would follow those I presented here, but this should be enough to
give you the basic idea. Notice that I made clear which associations were positive and
which were negative. This is not necessary when all of the associations are positive
(when someone tells us that X and Y are correlated with Z we assume that the
correlations are positive unless we are told otherwise).
Results
Complete data
1
were available for 389 participants. Basic descriptive statistics
and values of Cronbach alpha are shown in Table 1
Table 1
Basic Descriptive Statistics and Cronbach Alpha
Variable M SD
Subjective Well Being 24.06 5.65 .84
Positive Affect 36.41 5.67 .84
Negative Affect 20.72 5.57 .82
SJAS-Hard Driving/Competitive 3.31 2.36 .66
Rosenberg Self Esteem 40.62 6.14 .86
Contingent Self Esteem 48.99 8.52 .84
Perceived Social Support 84.52 8.39 .91
Social Network Diversity 5.87 1.45
Number of Persons in Social Network 19.39 7.45

Three variables were transformed prior to analysis to reduce skewness. These
included Rosenberg self esteem (squared), perceived social support (exponentiated),
and number of persons in social network (log). Each outcome variable was significantly
correlated with each other outcome variable. Subjective well being was positively
correlated with PANAS positive (r = .433) and negatively correlated with PANAS
negative (r = -.348). PANAS positive was negatively correlated with PANAS negative (r
= -.158). Correlations between the predictor variables are presented in Table 2.
Table 2
Correlations Between Predictor Variables

SJAS-HC RSE CSE PSS ND
RSE .231*
CSE .025 -.446*
PSS .195* .465* -.088
ND .110* .211* -.057 .250*
NP .100* .215* .076 .283 .660*

1
Data available in Hoops.sav file on my SPSS Data Page. Intellectual property rights belong to Anne S.
Hoops.
*p .05

A sequential multiple regression analysis was employed to predict subjective well
being. On the first step SJAS-HC was entered into the model. It was significantly
correlated with subjective well being, as shown in Table 3. On the second step all of the
remaining predictors were entered simultaneously, resulting in a significant increase in
R
2
, F(5, 382) = 48.79, p < .001. The full model R
2
was significantly greater than zero,
F(6, 382) = 42.49, p < .001, R
2
= .40, 90% CI [.33, .45]. As shown in Table 3, every
predictor had a significant zero-order correlation with subjective self esteem. SJAS-HC
did not have a significant partial effect in the full model, but Rosenberg self esteem,
contingent self esteem, perceived social support, and number of persons in social
network did have significant partial effects. Contingent self esteem functioned as a
suppressor variable. When the other predictors were ignored, contingent self esteem
was negatively correlated with subjective well being, but when the effects of the other
predictors were controlled it was positively correlated with subjective well being.

Table 3
Predicting Subjective Well Being
Predictor r

CI
.95
for r
SJAS-Hard Driving Competitive -.035 .131* .03, .23
Rosenberg Self Esteem .561* .596* .53, .66
Contingent Self Esteem .092* -.161* -.26, -.06
Perceived Social Support .172* .426* .34, .50
Network Diversity -.089 .134* .04, .23
Number of Persons in Network .107* .221* .12, .31
*p .05


MultReg-ByHand.doc
Trivariate Regression By Hand

It is not totally unreasonable to conduct a multiple regression analysis by hand,
as long as you have only two predictor variables. Consider the analysis we did in PSYC
6430 (with the program CorrRegr.sas) predicting attitudes towards animals (AR) from
idealism and misanthropy. SAS gave us the following univariate and bivariate statistics:

Variable N Mean Std Dev Sum Minimum Maximum

ar 154 2.37969 0.53501 366.47169 1.21429 4.21429
ideal 154 3.65024 0.53278 562.13651 2.30000 5.00000
misanth 154 2.32078 0.67346 357.40000 1.00000 4.00000

Pearson Correlation Coefficients, N = 154
Prob > |r| under H0: Rho=0

ar ideal misanth

ar 1.00000 0.05312 0.22064
0.5129 0.0060

ideal 0.05312 1.00000 -0.13975
0.5129 0.0839

Let us first obtain the beta weights for a standardized regression equation,
2 2 1 1
Z Z Z
y
+ =

0856 .
13975 . 1
) 13975 . 0 ( 22064 . 05312 .
1
2 2
12
12 2 1
1
=
=
r
r r r
y y

2326 .
13975 . 1
) 13975 . 0 ( 05312 . 22064 .
1
2 2
12
12 1 2
2
=
=
r
r r r
y y

Now for the unstandardized equation,
2 2 1 1
X b X b a Y + + =

i
y i
i
s
s
b

=
=
i
X b Y a

086 .
53278 .
) 535 (. 0856 .
1
= = b 185 .
67346 .
) 535 (. 2326 .
2
= = b

637 . 1 ) 32 . 2 ( 185 . ) 65 . 3 ( 086 . 38 . 2 = = a

The Multiple R
2


2
0559 .
13975 . 1
) 13975 . 0 )( 22064 )(. 05312 (. 2 22064 . 05312 .
1
2
2
2 2
2
12
12 2 1
2
2
2
1 2
12
=
+
=
+
=
r
r r r r r
R
y y y y
y

2
2
... ... 12 y y yi i p i y
r r R = =

for p 2 (p is number of predictors)

0559 . ) 22064 (. 2326 . ) 05312 (. 0856 .
2
12
= + =
y
R

Semipartials

2
12
12 2 1
1
1 r
r r r
sr
y y
=
2
12
12 1 2
2
1 r
r r r
sr
y y
= (but easier to do by method below)

2
)... ...( 12
2
... ... 12
2
p i y p i y i
R R sr

= for p 2

0072 . 22064 . 0559 .
2 2
1
= = sr , so sr
1
= .085

0531 . 05312 . 0559 .
2 2
2
= = sr , so sr
2
= .231

Partials

pr
sr
r
y
1
1
2
2
1
=
pr
sr
r
y
2
2
1
2
1
=

pr
sr
R
i
i
y i p
2
2
12
2
1
=
...( )...
for p 2

Tests of significance of partials

2
12
2
12
1
1
1
1
1
r p N
R
SE
y

H
0
: Pop. = 0
SE
t = df = N - p - 1 (but easier to get with next formula)

H
0
: Pop. sr = 0
2
1
1
R
p N
sr t

= df = N - p - 1

3
07 . 1
0559 . 1
1 2 154
085 .
1
=

= t 91 . 2
0559 . 1
1 2 154
231 .
2
=

= t df = 151

A test of H
0
: Pop. sr = 0 is identical to a test of H
0
: Pop. = 0 or a test of H
0
: Pop. b = 0

Shrunked (Adjusted) R
2

043 .
151
) 153 )( 0559 . 1 (
1
1
) 1 )( 1 (
1 shrunken
2
2
=
=

=
p N
N R
R
Please re-read this document about shrunken R
2
.

ANOVA summary table:

Source SS df MS F p
Regression R
2
SS
Y
p
Error (1-R
2
)SS
Y
n-p-1
Total SS
Y
n-1

F tests the null that population R
2
= 0

For our example, SS
y
= (N-1)s
2
= 153(.53501)
2
= 43.794
SS
regr
= .0559(43.794) = 2.447
SS
error
= 43.794 - 2.447 = 41.347
F(2, 151) = 4.47, p = .013

52 . 0 estimate of error standard = = MSE

For a hierarchical analysis,

2
) 1 ...( 12
2
12 3
2
1 2
2
1
2
... 12
+ + + + =
p yp y y y p y
sr sr sr r R K



Suppress.docx
Redundancy and Suppression in Trivariate Regression Analysis

Redundancy

In the behavioral sciences, ones nonexperimental data often result in the two
predictor variables being redundant with one another with respect to their covariance
with the dependent variable. Look at the ballantine above, where we are predicting
verbal SAT from family income as parental education. Area b represents the
redundancy. For each X, sr
i
and
i
will be smaller than r
yi
, and the sum of the squared
semipartial rs will be less than the multiple R
2
. Because of the redundancy between
the two predictors, the sum of the squared semipartial correlation coefficients (areas a
and c, the unique contributions of each predictor), is less than the squared multiple
correlation coefficient, R
2
y.12
(area a + b + c).

Classical Y X
2

Suppression

X
1

Look at the ballantine above. Suppose that Y is score on a statistics
achievement test, X
1
is score on a speeded quiz in statistics (slow readers dont have
enough time), and X
2
is a test of reading speed. The r
y1
= .38, r
y2
= 0, r
12
= .45.
, 4255 . 181 .
45 . 1
38 .
2
2
12 .
= == = = == =

= == =
y
R greater than r
y1
. Adding X
2
to X
1
increased R
2
by
.181 .38
2
= .0366. That is, 19 . . 0366 .
2
= = sr . The sum of the two squared
semipartial rs, .181 + .0366 = .218, greater than the R
2
y.12
.

, 476 .
45 . 1
) 45 (. 0 38 .
2 1
= == =

= == = greater than r
y1
. . 214 .
45 . 1
) 45 (. 38 . 0
2 2
= == =

= == =


2

Notice that the sign of the and sr for the classical suppressor variable will be
opposite that of its zero-order r
12
. Notice also that for both predictor variables the
absolute value of exceeds that of the predictors r with Y.

How can we understand the fact that adding a predictor that is uncorrelated with
Y (for practical purposes, one whose r with Y is close to zero) can increase our ability to
predict Y? Look at the ballantine. X
2
suppresses the variance in X
1
that is irrelevant to
Y (area d). Mathematically,

2
) 2 . 1 (
2
) 2 . 1 (
2
2
2
0
y y y
r r r R + ++ + = == = + ++ + = == =
r
2
y(1.2)
, the squared semipartial for predicting Y from X
2
(
2
2
sr ), is the r
2
between Y and
the residual ( )
2 . 1 1
X X . It is increased (relative to r
2
y1
) by removing from X
1
the
irrelevant variance due to X
2
what variance is left in ( )
2 . 1 1
X X is more correlated
with Y than is unpartialled X
1
.

144 . 38 .
2 2
1
= == = = == = = == =
+ ++ + + ++ +
= == = b
d c b
b
r
y
which is <
2
12 . 2 2
12
2
1 2
) 2 . 1 (
181 .
45 . 1
144 .
1 1
y
y
y
R
r
r
d
b
c b
b
r = == = = == =

= == =

= == =

= == =
+ ++ +
= == =

Velicer (see Smith et al, 1992) wrote that suppression exists when a predictors
usefulness is greater than its squared zero-order correlation with the criterion variable.
Usefulness is the squared semipartial correlation for the predictor.

Our X
1
has 144 .
2
1
=
Y
r -- all by itself it explains 14.4% of the variance in Y. When
added to a model that already contains X
2
, X
1
increases the R
2
by .181 that is,
. 181 .
2
1
= sr X
1
is more useful in the multiple regression than all by itself. Likewise,
. 0 0366 .
2
2
2
2
= > = r sr That is, X
2
is more useful in the multiple regression than all by
itself.

Net Suppression

X
1

X
2

3
Look at the ballantine above. Suppose Y is the amount of damage done to a
building by a fire. X
1
is the severity of the fire. X
2
is the number of fire fighters sent to
extinguish the fire. The r
y1
= .65, r
y2
= .25, and r
12
= .70.

. > 93 .
70 . 1
) 70 (. 25 . 65 .
1 2 1 y
r = == =

= == = . 40 .
70 . 1
) 70 (. 65 . 25 .
2 2
= == =

= == =

Note that
2
has a sign opposite that of r
y2
. It is always the X which has the
smaller r
yi
which ends up with a of opposite sign. Each falls outside of the range 0
r
yi
, which is always true with any sort of suppression.

Again, the sum of the two squared semipartials is greater than is the squared
multiple correlation coefficient:
.505. > 525 . 0825 . 4425 . , 0825 . 505 . , 4425 . 505 . , 505 .
2
1
2
2
2
2
2
1
2
12 .
= == = + ++ + = == = = == = = == = = == = = == =
y y y
r sr r sr R

Again, each predictor is more useful in the context of the other predictor than all
by itself: 4225 . 4424 .
2
1
2
1
= == = > >> > = == =
Y
r sr and 0625 . 0825 .
2
2
2
2
= == = > >> > = == =
Y
r sr .

For our example, number of fire fighters, although slightly positively correlated
with amount of damage, functions in the multiple regression primarily as a suppressor of
variance in X
1
that is irrelevant to Y (the shaded area in the Venn diagram). Removing
that irrelevant variance increases the for X
1
.

Looking at it another way, treating severity of fire as the covariate, when we
control for severity of fire, the more fire fighters we send, the less the amount of damage
suffered in the fire. That is, for the conditional distributions where severity of fire is held
constant at some set value, sending more fire fighters reduces the amount of damage.
Please note that this is an example of a reversal paradox, where the sign of the
correlation between two variables in aggregated data (ignoring a third variable) is
opposite the sign of that correlation in segregated data (within each level of the third
variable). I suggest that you (re)read the article on this phenomenon by Messick and
van de Geer [Psychological Bulletin, 90, 582-593].

Cooperative Suppression

R
2
will be maximally enhanced when the two Xs correlate negatively with one
another but positively with Y (or positively with one another and negatively with Y) so
that when each X is partialled from the other its remaining variance is more correlated
with Y: both predictors , pr, and sr increase in absolute magnitude (and retain the
same sign as r
yi
). Each predictor suppresses variance in the other that is irrelevant to Y.

Consider this contrived example: We seek variables that predict how much the
students in an introductory psychology class will learn (Y). Our teachers are all
graduate students. X
1
is a measure of the graduate students level of mastery of
4
general psychology. X
2
is a measure of how strongly the students agree with
statements such as This instructor presents simple, easy to understand explanations of
the material, This instructor uses language that I comprehend with little difficulty, etc.
Suppose that r
y1
= .30, r
y2
= .25, and r
12
= 0.35.

. 442 .
35 . 1
) 35 . ( 25 . 30 .
2 1
= == =

= == = . 405 .
35 . 1
) 35 . ( 30 . 25 .
2 2
= == =

= == =

. 414 .
35 . 1
) 35 . ( 25 . 30 .
2
1
= == =

= == = sr . 379 .
35 . 1
) 35 . ( 30 . 25 .
2
2
= == =

= == = sr

. 234 . ) 405 (. 25 . ) 442 (. 3 .
2
12 .
= == = + ++ + = == = = == =
i yi y
r R Note that the sum of the squared
semipartials, .414
2
+ .379
2
= .171 + .144 = .315, exceeds the squared multiple
correlation coefficient, .234.

Again, each predictor is more useful in the context of the other predictor than all
by itself: 09 . 171 .
2
1
2
1
= > =
Y
r sr and 062 . 144 .
2
2
2
2
= > =
Y
r sr .

Summary

When
i
falls outside the range of 0 r
yi
, suppression is taking place. This is
Cohen & Cohens definition of suppression. As noted above, Velicer defined
suppression in terms of a predictor having a squared semipartial correlation coefficient
that is greater than its squared zero-order correlation with the criterion variable.

If one r
yi
is zero or close to zero , it is classic suppression, and the sign of the
for the X with a nearly zero r
yi
will be opposite the sign of r
12
.

When neither X has r
yi
close to zero but one has a opposite in sign from its r
yi

and the other a greater in absolute magnitude but of the same sign as its r
yi
, net
suppression is taking place.

If both Xs have absolute
i
> r
yi
, but of the same sign as r
yi
, then cooperative
suppression is taking place.

Although unusual, beta weights can even exceed one when cooperative
suppression is present.
References

Cohen, J., & Cohen, P. (1975). Applied multiple regression/correlation for the
behavioral sciences. New York, NY: Wiley. [This handout has drawn heavily
from Cohen & Cohen.]
5
Smith, R. L., Ager, J. W., Jr., & Williams, D. L. (1992). Suppressor variables in
multiple regression/correlation. Educational and Psychological Measurement,
52, 17-29. doi:10.1177/001316449205200102
Wuensch, K. L. (2008). Beta Weights Greater Than One !
http://core.ecu.edu/psyc/wuenschk/MV/multReg/Suppress-BetaGT1.doc .


Example of Three Predictor Multiple Regression/Correlation Analysis: Checking
Assumptions, Transforming Variables, and Detecting Suppression

The data are from Guber, D.L. (1999). Getting what you pay for: The debate over
equity in public school expenditures. Journal of Statistics Education, 7, 1-8. The
research units are the fifty states in the USA. We shall pretend they represent a
random sample from a population of interest. The criterion variable is mean SAT in the
state. The predictors are Expenditure ($ spent per student), Salary (mean salary of
teachers), and Teacher/Pupil Ratio. If we consider the predictor variables to be fixed
(the regression model), then we do not worry about the shape of the distributions of the
predictor variables. If we consider the predictor variables to be random (the correlation
model) we do. It turns out that each of the predictors has a distinct positive skewness
which can be greatly reduced by a negative reciprocal transformation.

Here are the zero-order correlations for the untransformed variables:

50 3.66 9.77 1.107 .337 1.279 .662
50 -.27 -.10 -.109 .337 .009 .662
50 25.99 50.05 .757 .337 .028 .662
50 -.04 -.02 .090 .337 -.620 .662
50 844.00 1107.00 .236 .337 -1.309 .662
50 13.80 24.30 1.334 .337 2.583 .662
50 -.07 -.04 .490 .337 .220 .662
50
Expenditure
Expend_nr
salary
Salary_nr
SAT
TeachPerPup
TeachPerPup_nr
Valid N (listwise)
Statistic Statistic Statistic Statistic Std. Error Statistic Std. Error
N Minimum Maximum Skewness Kurtosis
Correlations
1 .870** -.371** -.381**
.000 .008 .006
50 50 50 50
.870** 1 -.001 -.440**
.000 .994 .001
50 50 50 50
-.371** -.001 1 .081
.008 .994 .575
50 50 50 50
-.381** -.440** .081 1
.006 .001 .575
50 50 50 50
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Expenditure
salary
TeachPerPup
SAT
Expenditure salary TeachPerPup SAT
Correlation is significant at the 0.01 level (2-tailed).
**.
Here is a regression analysis with the untransformed variables. I asked SPSS
for a plot of the standardized residuals versus the standardized predicted scores. I also
asked for a histogram of the residuals.

Model Summary
b
.458
a
.210 .158 68.65350
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), TeachPerPup, salary,
Expenditure
a.
Dependent Variable: SAT
b.
ANOVA
b
57495.745 3 19165.248 4.066 .012
a
216811.9 46 4713.303
274307.7 49
Regression
Residual
Total
Model
1
Sum of
Predictors: (Constant), TeachPerPup, salary, Expenditure
a.
b.

If you compare the beta weights with the zero-order correlations, it is obvious that
we have some suppression taking place. The beta for expenditure is positive but the
zero-order correlation between SAT and expenditure was negative. For the other two
predictors the value of beta exceeds the value of their zero-order correlation with SAT.
Here is a histogram of the residuals with a normal curve superimposed:

The residuals appear to be approximately normally distributed. The plot of
standardized residuals versus standardized predicted scores will allow us visually to
check for heterogeneity of variance, nonlinear trends, and normality of the residuals
across values of the predicted variable. I have drawn in the regression line (error = 0). I
see no obvious problems here.

Coefficients
a
1069.234 110.925 9.639 .000
16.469 22.050 .300 .747 .459
-8.823 4.697 -.701 -1.878 .067
6.330 6.542 .192 .968 .338
(Constant)
Expenditure
salary
TeachPerPup
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
a.

Under the homoscedasticity assumption there should be no correlation between
the predicted scores and error variance. The vertical spread of the dots in the plot
above should not vary as we move left to right. I squared the residuals and correlated
them with the predicted values. If the residuals were increasing in variance as the
predicted values increase this correlation would be positive. It is close to zero,
confirming my eyeball conclusion that there is no problem with that fairly common sort
of heteroscedasticity.

Now let us look at the results using the transformed data.

Correlations
.093 Pearson Correlation Predicted_SAT
Residual2
Correlations
1 .816** -.425** -.398**
.000 .002 .004
50 50 50 50
.816** 1 .015 -.467**
.000 .920 .001
50 50 50 50
-.425** .015 1 .089
.002 .920 .537
50 50 50 50
-.398** -.467** .089 1
.004 .001 .537
50 50 50 50
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Expend_nr
Salary_nr
TeachPerPup_nr
SAT
Expend_nr Salary_nr
TeachPer
Pup_nr SAT
Correlation is significant at the 0.01 level (2-tailed).
**.
The correlation matrix looks much like it did with the untransformed data.

The R
2
has increased a bit.

No major changes caused by the transformation, which is comforting. Trust me
that the residuals plots still look OK too.

I wonder what high school teachers would think about the negative relationship
between average state salary for teachers and average state SAT score? If we want
better education should we lower teacher salaries? There is an important state
characteristic that we should have but have not included in our model. Check out the
JSE article to learn what that characteristic is.

Now, can we figure out what sort of suppression is going on here?

Model Summary
b
.482
a
.232 .182 67.65259
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), TeachPerPup_nr, Salary_nr,
Expend_nr
a.
b.
ANOVA
b
63771.502 3 21257.167 4.644 .006
a
210536.2 46 4576.873
274307.7 49
Regression
Residual
Total
Model
1
Sum of
Predictors: (Constant), TeachPerPup_nr, Salary_nr, Expend_nr
a.
b.
Coefficients
a
850.240 130.425 6.519 .000
367.276 692.442 .181 .530 .598
-9823.521 4920.257 -.618 -1.997 .052
1805.969 2031.564 .176 .889 .379
(Constant)
Expend_nr
Salary_nr
TeachPerPup_nr
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
a.

It looks like the expenditures variable is suppressing irrelevant variance in one or
both or a linear combination of the other two predictors. Put another way, if we hold
constant the effects of teacher salary and number of teachers per pupil, then the
relationship between expenditures and SAT goes from negative to positive. Maybe the
money is best spent on things other than hiring more teachers or better paid teachers?

Let us look at two-predictor models.

No suppression between expenditures and teacher salary.

A little bit of classical suppression here, but not dramatic.

Coefficients
a
.181 -.398
-.618 -.467
.176 .089
Expend_nr
Salary_nr
TeachPerPup_nr
Model
1
Beta
Standardized
Coefficients
r
a.
Coefficients
a
-.049 -.398
-.428 -.467
Expend_nr
Salary_nr
Model
1
Beta
Standardized
Coefficients
r
a.
Coefficients
a
-.439 -.398
-.097 .089
Expend_nr
TeachPerPup_nr
Model
1
Beta
Standardized
Coefficients
r
a.

A little bit of cooperative suppression here, but not dramatic.

Maybe the expenditures variable is suppressing irrelevant variance in a linear
combination of teacher salary and teacher/pupil ratio. I predicted SAT from salary and
teacher/pupil ratio and saved the predicted scores as predicted23. Those predicted
scores are a linear combination of teacher salary and teacher/pupil ratio, with lower
salaries and higher teacher/pupil ratios being associated with higher SAT scores. When
I correlate predicted23 with SAT I get .477, the R for SAT predicted from salary and
teacher/pupil ratio. Watch what happens when I add the expenditures variable to the
predicted23 combination.

As you can see, the expenditures variable suppresses irrelevant variance in the
predicted23 combination of the other two predictor variables. When you hold total
amount of expenditures constant, there is an increase in the predictive value of a linear
combination of teacher salary and teacher/pupil ratio.

Karl L. Wuensch
East Carolina University, Dept. of Psychology
March, 2011

Coefficients
a
-.469 -.467
.096 .089
Salary_nr
TeachPerPup_nr
Model
1
Beta
Standardized
Coefficients
r
a.
Coefficients
a
.586 .477
.122 -.398
predicted23
Expend_nr
Model
1
Beta
Standardized
Coefficients
r
a.
Invert.docx
Inverting Matrices: Determinants and Matrix Multiplication

Determinants

Square matrices have determinants, which are useful in other matrix operations,
especially inversion.

For a second-order square matrix, A,
22 21
12 11

a a
a a
, the determinant of A,

21 12 22 11
a a a a = A

Consider the following bivariate raw data matrix:

Subject # 1 2 3 4 5
X 12 18 32 44 49
Y 1 3 2 4 5

from which the following XY variance-covariance matrix is obtained:

X Y
X 256 21.5
Y 21.5 2.5

9 . 0
5 . 2 256
5 . 21
= = =
Y X
XY
S S
COV
r 75 . 177 ) 5 . 21 ( 5 . 21 ) 5 . 2 ( 256 = = A

Think of the variance-covariance matrix as containing information about the two
variables the more variable X and Y are, the more information you have. The total
amount of information you have is reduced, however, by any redundancy between X
and Y that is, to the extent that you have covariance between X and Y you have less
total information. The determinant of a matrix is sometimes called its generalized
variance, the total amount of information you have about variance in the scores, after
removing the redundancy between the variables look at how we just computed the
determinant the product of the variances (information) less the product of the
covariances (redundancy).

Now think of the information in the X scores as being represented by the width of
a rectangle, and the information in the Y scores represented by the height of the
rectangle. The area of this rectangle is the total amount of information you have. Since


2
I specified that the shape was rectangular (X and Y are perpendicular to one another),
the covariance is zero and the generalized variance is simply the product of the two
variances.

Now allow X and Y to be correlated with one another. Geometrically this is
reducing the angle between X and Y from 90 degrees to a lesser value. As you reduce
the angle the area of the parallelogram is reduced the total information you have is
less than the product of the two variances. When X and Y become perfectly correlated
(the angle is reduced to zero) the determinant had been reduced to value zero.

Consider the following bivariate raw data matrix:

Subject # 1 2 3 4 5
X 10 20 30 40 50
Y 1 2 3 4 5

from which the following XY variance-covariance matrix is obtained:

X Y
X 250 25
Y 25 2.5

1
5 . 2 250
25
= = =
Y X
XY
S S
COV
r 0 ) 25 ( 25 ) 5 . 2 ( 250 = = A

Inverting a Matrix

Determinants are useful in finding the inverse of a matrix, that is, the matrix that
when multiplied by A yields the identity matrix. That is, AA
1
= A
1
A = I

An identity matrix has 1s on its main diagonal, 0s elsewhere. For example,
1 0 0
0 1 0
0 0 1

With scalars, multiplication by the inverse yields the scalar identity, 1: . 1
1
=
a
a
Multiplying by an inverse is equivalent to division: .
1
b
a
b
a =

The inverse of a 2 * 2 matrix,
256 21.5 -
21.5 - 2.5
75 . 177
1
-
-
*
1
11 21
12 22
a a
a a
2
1
2
A
A for
our example.

3
Multiplying a scalar by a matrix is easy - simply multiply each matrix element by the
scalar, thus,

5 1.44022503 .120956399 -
.120956399 - .014064698
1
A

Now to demonstrate that AA
1
= A
1
A = I, but multiplying matrices is not so easy.
For a 2 2,

+ +
+ +
=
dz cx dy cw
bz ax by aw
z y
x w
d c
b a

col row col row
col row col row

2 2 1 2
2 1 1 1

1 0
0 1
5 1.44022503 .120956399 -
.120956399 - .014064698
2.5 21.5
21.5 256

Third-Order Determinant and Matrix Multiplication

The determinant of a third-order square matrix,

3
A =
33 32 31
23 22 21
13 12 11

a a a
a a a
a a a

33 21 12 23 32 11 13 22 31 21 32 13 31 23 12 33 22 11
a a a a a a a a a a a a a a a a a a + + =
Matrix multiplication for a 3 x 3

+ + + + + +
+ + + + + +
+ + + + + +
=
iz hw gt iy hv gs ix hu gr
fz ew dt fy ev ds fx eu dr
cz bw at cy bv as cx bu ar
z y x
w v u
t s r
i h g
f e d
c b a

That is,

3 3 2 3 1 3
3 2 2 2 1 2
3 1 2 1 1 1
col row col row col row

Isnt this fun? Arent you glad that SAS will do matrix algebra for you? Copy the
little program below into the SAS editor and submit it.

4
SAS Program

Proc IML;
reset print;
XY ={
256 21.5,
21.5 2.5};
determinant = det(XY);
inverse = inv(XY);
identity = XY*inverse;
quit;

Look at the program statements. The reset print statement makes SAS display
each matrix as it is created. When defining a matrix, one puts brackets about the data
points and commas at the end of each row of the matrix.

Look at the output. The first matrix is the variance-covariance matrix from this
handout. Next is the determinant of that matrix, followed by the inverted variance-
covariance matrix. The last matrix is, within rounding error, an identity matrix, obtained
by multiplying the variance-covariance matrix by its inverse.

SAS Output

XY 2 rows 2 cols (numeric)
256 21.5
21.5 2.5

DETERMINANT 1 row 1 col (numeric)
177.75

INVERSE 2 rows 2 cols (numeric)
0.0140647 -0.120956
-0.120956 1.440225

IDENTITY 2 rows 2 cols (numeric)
1 -2.22E-16
-2.08E-17 1

MultReg-Matrix.docx
Using Matrix Algebra to do Multiple Regression

Before we had computers to assist us, we relied on matrix algebra to solve
multiple regressions. You have some appreciation of how much arithmetic is involved in
matrix algebra, so you can imagine how tedious the solution is. We shall use SAS to do
that arithmetic for us. Consider the research I have done involving the relationship
between a persons attitudes about animals, idealism, and misanthropy. I also have, for
the same respondents, relativism scores and gender. Below is a correlation matrix and,
in the last two rows, a table of simple descriptive statistics for these variables.

Persons who score high on the idealism dimension believe that ethical behavior
will always lead only to good consequences, never to bad consequences, and never to
a mixture of good and bad consequences. Persons who score high on the relativism
dimension reject the notion of universal moral principles, preferring personal and
situational analysis of behavior. Persons who score high on the misanthropy dimension
dislike humans. Gender was coded 1 for female, 2 for male. High scores on the
attitude variable indicate that the respondent supports animals rights and does not
support research on animals. There were 153 respondents.

idealism relativism misanthropy gender attitude
idealism 1.0000 -0.0870 -0.1395 -0.1011 0.0501
relativism -0.0870 1.0000 0.0525 0.0731 0.1581
misanthropy -0.1395 0.0525 1.0000 0.1504 0.2259
gender -0.1011 0.0731 0.1504 1.0000 -0.1158
attitude 0.0501 0.1581 0.2259 -0.1158 1.0000

mean 3.64926 3.35810 2.32157 1.18954 2.37276
standard dev. 0.53439 0.57596 0.67560 0.39323 0.52979

1. The first step is to obtain R
iy
, the column vector of correlations between X
i
and
Y.
ar
ideal 0.0501
relat 0.1581
misanth 0.2259
gender -0.1158

2. Next we obtain R
ii
, the matrix of correlations among Xs.

ideal relat misanth gender
ideal 1.0000 -0.0870 -0.1395 -0.1011
relat -0.0870 1.0000 0.0525 0.0731
misanth -0.1395 0.0525 1.0000 0.1504
gender -0.1011 0.0731 0.1504 1.0000

3. Now we invert R
ii
. You dont really want to do this by hand, do you?


4. Now
i iy
1
ii
B R R =
-- that is, we post multiply the inverted X correlation matrix by

the XY correlation vector to get the partial regression coefficients in standardized form
[]. Since we dont want to do that by hand either, we employ IML in SAS. Copy this
little program into the SAS editor, submit it, and see the resulting matrices:

proc iml;
reset print;
ii ={
1.0000 -0.0870 -0.1395 -0.1011,
-0.0870 1.0000 0.0525 0.0731,
-0.1395 0.0525 1.0000 0.1504,
-0.1011 0.0731 0.1504 1.0000};
iy = {0.0501, 0.1581, 0.2259, -0.1158};
betas = inv(ii)*iy;
quit;

In the output window you will see first the R
iy
, then the R
ii
, and finally the column
vector of Beta weights. The first beta is for idealism, the second relativism, the third
misanthropy, and the last gender.

5. If you want unstandardized coefficients, you need to use the following
formulae:

i
y
i i
s
s
b =
=
i i
X b Y a
6. To obtain the squared multiple correlation coefficient,
=
1
2
r R . For our
data, that is 0.0501(.0837) + 0.1581(.1636) + 0.2259(.2526) + (-0.1158)(-.1573) =
.1053.

7. Test the significance of the R
2

For our data, s
y
= 0.52979, n = 153, so SS
Y
= 152(0.52979)
2
= 42.663.
The regression sum of squares, 492 . 4 ) 663 . 42 ( 1053 .
2
= = =
Y regr
SS R SS
The error sum of squares, 171 . 38 492 . 4 663 . 42 = = =
regr Y error
SS SS SS
Source SS df MS F
Regression 4.492 4 1.123 4.353
Residual 38.171 148 0.258
Total 42.663 152
This is significant at about .002. We could go on to obtain test the significance of
the partials and obtain partial or semipartial correlation coefficients, but frankly, that is
just more arithmetic than I can stand. Let us stop at this point. The main objective of
this handout is to help you appreciate how matrices and matrix algebra are essential
when computing multiple regressions, and I hope that I have already made that point
adequately.

reg-diag.doc
Regression Diagnostics

Run the program RegDiag.sas, available at my SAS Programs Page. The data
are within the program. We have data on the following variables:
SpermCount sperm count for one man, gathered during copulation
Together percentage of time the man has spent with his mate recently
LastEjac time since the mans last ejaculation
We are interested in predicting sperm counts from the other two variables.
Proc Univariate is used to screen the three variables. We find an outlier on
LastEjac a man who apparently went 168 hours without an ejaculation. We
investigate and conclude that this data point is valid, but it does cause the LastEjac
variable to be distinctly skewed. We apply a square root transformation which works
marvelously.
The multiple regression is disappointingly nonsignificant. Inspection of the
residuals, as explained below, does reveal a troublesome case that demands
investigation.
The diagnostic statistics appear on page 10 of the output. For each observation
we are given the actual score on SpermCount, the predicted SpermCount, and the
standard error of prediction. The standard error of prediction could be used to put
confidence intervals about predicted scores.

Detection of Outliers among the Independent Variables
LEVERAGE, h
i
or Hat Diagonal, is used to detect outliers among the predictor
variables. It varies from 1/n to 1.0 with a mean of (p + 1)/n. Kleinbaum et al. describe
leverage as a measure of the geometric distance of the i
th
predictor point (X
i1
, X
i2
, ...,
X
ik
) from the center point ) X ...., , X , X (
k 2 1
of the predictor space. The SAS manual
cites Belsley, Kuh, and Welschs (1980) Regression Diagnostics text, suggesting that
one investigate observations with Hat greater than 2p/n, where n is the number of
observations used to fit the model, and p is the number of parameters in the model.
They present an example with 10 observations, two predictors, and the intercept, noting
that a HAT cutoff of 0.60 should be used. Our model has three parameters, and we
have 11 observations, so our cutoff would be 2(3)/11 = .55. Observations # 5 and 7
seem worthy of investigation. Case 5 had a very high time since last ejaculation, and
case 7 had a very low time together. Investigation reveals the data to be valid.


Page 2
Measures of DISTANCE from the regression surface
SAS gives us, for each observation, the raw residual, the standard error of the
residual, and the Studentized Residual. The standard error of the residual is
computed as ) 1 (
i
h MSE . The Studentized Residual, also known at the
standardized residual, is simply the raw residual divided by this standard error.
Kleinbaum, Kupper, and Muller (1988, Applied Regression Analysis and Other
Multivariable Methods) note that this statistic approximately follows Students t
distribution with n - k - 1 degrees of freedom (k standing for the number of predictor
variables). Kleinbaum et al. define a standardized residual as raw residual divided by
root mean square.
The values in the column labeled RStudent are the Studentized deleted
residuals -- these Studentized deleted residuals are computed in the same way that
standardized residuals are computed, except that: the i
th
observation is removed
before computing its error tandard and MSE, , e , Y
i i
s . This prevents the i
th
observation
from influencing these statistics, resulting in unusual observations being more likely to
stick out like a sore thumb. Kleinbaum et al. refer to this statistic as the jackknife
residual and note that it is distributed exactly as a t on n - k - 2 degrees of freedom, as
opposed to n - k -1 degrees of freedom for the Studentized (nondeleted) residuals. The
SAS manual (SAS/STAT Users Guide, Version 6, fourth edition, chapter on the REG
procedure) suggests that one attend to observations which have absolute values of
RSTUDENT greater than 2 (observations whose score on the dependent variable is a
large distance from the regression surface). Using that criterion, observation # 11
demands investigation the predicted sperm count is much lower than the actual
sperm count.

Measuring the Extent to Which an Observation Influences the Location of the
Regression Surface
COOKS D is used to measure INFLUENCE, the extent to which an observation
is affecting the location of the regression surface, a function of both its distance and its
leverage. Cook suggested that one check observations whose D is greater than the
median value of F on p and n-p degrees of freedom . David Howell ( Statistical
Methods for Psychology, sixth edition, 2007, page 518) suggests investigating any D >
1.00. By Howells criterion, observation # 11 has an influence worth of our attention.
The Cov Ratio measures how much change there is in the determinant of the
covariance matrix of the estimates when one deletes a particular observation. The SAS
manual says Belsley et al. suggest investigating observations with ABS(Cov Ratio - 1) >
3*p/n. The Dffits statistic is very similar to Cooks D. The SAS manual says Belsley et
al. suggest investigating observations with Dffits > 2SQRT(p/n). The SAS manual
suggests a simple cutoff of 2. Dfbetas measure the influence of an observation on a
single parameter (intercept or slope). The SAS manual says Belsley et al. recommend
a general cutoff of > 2 or a size-adjusted cutoff of > 2/SQRT(n). Case number 11 is
suspect here too, with great influence on the slope for time since last ejaculation.
Page 3
We investigate case number 11 and discover that the participant had not
followed the instructions for gathering the data. We decide to discard case number 11
and reanalyze the data. Case number 11 was, by the way, contrived by me for this
lesson, but the data for cases 1 through ten are the actual data used in the research
presented in this article:
Baker, R. R., & Bellis, M. A. (1989). Number of sperm in human ejaculates
varies in accordance with sperm competition theory. Animal Behaviour, 37, 867
869.

Back to Wuenschs Stats Lesson Page

Stepwise.doc
Stepwise Multiple Regression

Your introductory lesson for multiple regression with SAS involved developing a
model for predicting graduate students Grade Point Average. We had data from 30
graduate students on the following variables: GPA (graduate grade point average),
GREQ (score on the quantitative section of the Graduate Record Exam, a commonly
used entrance exam for graduate programs), GREV (score on the verbal section of the
GRE), MAT (score on the Miller Analogies Test, another graduate entrance exam), and
AR, the Average Rating that the student received from 3 professors who interviewed
em prior to making admission decisions. GPA can exceed 4.0, since this university
attaches pluses and minuses to letter grades. We used a simultaneous multiple
regression, entering all of the predictors at once. Now we shall learn how to conduct
stepwise regressions, where variables are entered and/or deleted according to
statistical criteria. Please run the program STEPWISE.SAS from my SAS Programs
page.

Forward Selection
In a forward selection analysis we start out with no predictors in the model. Each
of the available predictors is evaluated with respect to how much R
2
would be increased
by adding it to the model. The one which will most increase R
2
will be added if it meets
the statistical criterion for entry. With SAS the statistical criterion is the significance
level for the increase in the R
2
produced by addition of the predictor. If no predictor
meets that criterion, the analysis stops. If a predictor is added, then the second step
involves re-evaluating all of the available predictors which have not yet been entered
into the model. If any satisfy the criterion for entry, the one which most increases R
2
is
added. This procedure is repeated until there remain no more predictors that are
eligible for entry.
Look at the program. The first model (A:) asks for a forward selection analysis.
The SLENTRY= value specifies the significance level for entry into the model. The
defaults are 0.50 for forward selection and 0.15 for fully stepwise selection. I set the
entry level at .05 -- I think that is unreasonably low for a forward selection analysis, but I
wanted to show you a possible consequence of sticking with the .05 criterion.
Look at the output. The Statistics for Entry on page 1 show that all four
predictors meet the criterion for entry. The one which most increases R
2
is the Average
Rating, so that variable is entered. Now look at the Step 2 Statistics for Entry. The F
values there test the null hypotheses that entering a particular predictor will not change
the R
2
at all. Notice that all of these F values are less than they were at Step 1,
because each of the predictors is somewhat redundant with the AR variable which is
now in the model. Now look at the Step 3 Statistics for Entry. The F values there are
down again, reflecting additional redundancy with the now entered GRE_Verbal
predictor. Neither predictor available for entry meets the criterion for entry, so the


2
procedure stops. We are left with a two predictor model, AR and GRE_V, which
accounts for 54% of the variance in grades.

Backwards Elimination
In a backwards elimination analysis we start out with all of the predictors in the
model. At each step we evaluate the predictors which are in the model and eliminate
any that meet the criterion for removal.
Look at the program. Model B asks for a backwards elimination model. The
SLSTAY= value specifies the significance level for staying in the model. The defaults
are 0.10 for BACKWARD and 0.15 for STEPWISE. I set it at .05.
Look at the output for Step 1. Of the variables eligible for removal (those with p
> .05), removing AR would least reduce the R
2
, so AR is removed. Recall that AR was
the first variable to be entered with our forwards selection analysis. AR is the best
single predictor of grades, but in the context of the other three predictors it has the
smallest unique contribution towards predicting grades. The Step 2 statistics show that
only GRE_V is eligible for removal, so it is removed. We are left with a two predictor
model containing GRE_Q and MAT and accounting for 58% of the variance in grades.
Does it make you a little distrustful of stepwise procedures to see that the one
such procedure produces a two variable model that contains only predictors A and B,
while another such procedure produces a two variable model containing only predictors
C and D? It should make you distrustful!

Fully Stepwise Selection
With fully stepwise selection we start out just as in forwards selection, but at
each step variables that are already in the model are first evaluated for removal, and if
any are eligible for removal, the one whose removal would least lower R
2
is removed.
You might wonder why a variable would enter at one point and leave later -- well, a
variable might enter early, being well correlated with the criterion variable, but later
become redundant with predictors that follow it into the model.
Look at the program. For Model C I asked for fully stepwise analysis and set
both SLSTAY and SLENTRY at .08 (just because I wanted to show you both entry and
deletion).
Look at the output. AR entered first, and GRE_V second, just as in the forward
selection analysis. At this point, Step 3, both GRE_Q and MAT are eligible for entry,
given my .08 criterion for entry. MAT has a p a tiny bit smaller than GRE_Q, so it is
selected for entry. Look at the F for entry of GRE_Q on Step 4 -- it is larger than it was
on Step 3, reflecting a suppressor relationship between GRE_Q and MAT. GRE_Q
enters. We now have all four predictors in the model, but notice that GRE_V and AR
no longer have significant partial effects, and thus become eligible for removal. AR is
removed first, then GRE_V is removed.
3
It appears that the combination of GRE_Q and MAT is better than the
combination of GRE_V and AR, due to GRE_Q and MAT having a suppressor
relationship. That suppressor relationship is not accounted for, however, until one or
the other of GRE_Q and MAT are entered into the model, and with the forward
selection analysis neither get the chance to enter.

R
2
Selection

SELECTION = RSQUARE finds the best n (BEST = n) combinations of
predictors among all possible 1 predictor models, then among 2 predictor models, then
3, etc., etc., where best means highest R
2
. You may force it to INCLUDE=i the first i
predictors, START=n with n-predictor models, and STOP=n with n-predictor models. I
specified none of these options, so I got every possible 1 predictor model, every
possible 2 predictor model, etc.

I did request Mallows C
p
statistic and MSE. One may define the best model
as that which has a small value of C
p
which is also close to p (the number of
parameters in the model, including the intercept). The small C
p
indicates precision,
small variance in estimating the population regression coefficients. With C
p
small and
approximately equal to p, the model should fit the data well, and adding additional
predictors should not improve precision much. Models with C
p
>> p do not fit the data
well.
The output shows that the best one-predictor model is AR, as we already know.
The best two-predictor model is GRE_Q and MAT, which should not surprise us, given
our evidence of a suppressor relationship between those two predictors. Adding
GRE_V to that model gives us the best three predictor model -- and even if the R
2

doesnt go up much, we might as well add GRE_V, because the GRE_V scores come
along with the GRE_Q scores at no additional cost. For economic reasons, we might
well decide to drop the AR predictor, since it does not contribute much beyond what the
other three predictors provide, and it is likely much more expensive to gather scores for
that predictor than for the other three.

My Opinion of Stepwise Multiple Regression
I think it is fun, but dangerous. For the person who understands multiple
regression well, a stepwise analysis can help reveal interesting relationships such as
the suppressor effects we noted here. My experience has been that the typical user of
a stepwise multiple regression has little understanding of multiple regression, and
absolutely no appreciation of how a predictors unique contribution is affected by the
context within which it is evaluated (the other predictors in the model). Too many
psychologists think that stepwise regression somehow selects out the predictors that
are really associated with the criterion and leaves out those which have only spurious or
4
unimportant relationships with the criterion. Stepwise analysis does no such thing.
Furthermore, statistical programs such as SPSS for Windows make it all too easy for
such psychologists to conduct analyses, such as stepwise multiple regression analysis,
which they cannot understand and whose results they are almost certain to
misinterpret.

Voodoo Regression

Potthoff.docx

Comparing Regression Lines From Independent Samples

The analysis discussed in this document is appropriate when one wishes to
determine whether the linear relationship between one continuously distributed criterion
variable and one or more continuously distributed predictor variables differs across
levels of a categorical variable. For example, school psychologists often are interested
in whether the predictive validity of a test varies across different groups of children.
Poteat, Wuensch, and Gregg (1988) investigated the relationship between IQ scores
(WISC-R full scale, the predictor variable) and grades in school (the criterion variable) in
independent samples of black and white students who had been referred for special
education evaluation. Within each group (black students and white students) a linear
model for predicting grades from IQ was developed. These two models were then
compared with respect to slopes, intercepts, and scatter about the regression line.
Such an analysis, when done by a school psychologist, is commonly referred to as a
Potthoff (1966) analysis. Poteat et al. found no significant differences between the two
groups they compared, and argued that the predictive validity of the WISC-R does not
differ much between white and black students in the referred population from which the
samples were drawn.
In the simplest case, a Potthoff analysis is essentially a multiple regression
analysis of the following form: Y = a + b
1
C + b
2
G + b
3
CG, where Y is the criterion
variable, C is the continuously distributed predictor variable, G is the dichotomous
grouping variable, and CG is the interaction between C and G. Grouping variables are
commonly dummy-coded with K-1 dichotomous variables (see Chapter 16 of Howell,
2010) for a good introduction to ANOVA and ANCOV as multiple regressions). In the
case where there are only two groups, only one such dummy variable is necessary.
I shall illustrate a Potthoff analysis using data from some of my previous research
on ethical ideology, misanthropy, and attitudes about animals. Clearly this has nothing
to do differential predictive validity of tests used by school psychologists, but otherwise
the analysis is the same as that which school psychologists call a Potthoff analysis.
First I shall describe the source of the data.
One day as I sat in the living room watching the news on TV there was a story
about some demonstration by animal rights activists. I found myself agreeing with them
to a greater extent than I normally do. While pondering why I found their position more
appealing than usual that evening, I noted that I was also in a rather misanthropic mood
that day. Watching the evening news tends to do that to me, it reminds me of how
selfish, myopic, and ignorant humans are. It occurred to me that there might be an
association between misanthropy and support for animal rights. When evaluating the
ethical status of an action that does some harm to a nonhuman animal, I generally do a
cost/benefit analysis, weighing the benefit to humankind against the cost of harm done
to the nonhuman. When doing such an analysis, if one does not think much of
humankind (is misanthropic), e is unlikely to be able to justify harming nonhumans. To
the extent that one does not like humans, one will not be likely to think that benefits to
humans can justify doing harm to nonhumans.


Page 2
Later I learned that not all people engage in the sort of cost/benefit analysis I just
described. D. R. Forsyth (1990) has developed an instrument that is designed to
measure a persons idealism and relativism. It is the idealism scale that is of interest to
me. The idealist is one who believes that morally correct behavior always leads only to
desirable consequences. The nonidealist believes that good behaviors may lead to a
mix of desirable and undesirable consequences. One would expect the idealist not to
engage in cost/benefit analysis of the morality of an action, weighing the good
consequences against the bad, since the idealist believes that any action that leads to
bad consequences is a morally wrong action.
If idealists do not engage in such cost/benefit analysis, then there is less reason
to believe that there will be any association between misanthropy and support for
animal rights in idealists. That is, I hypothesized that the relationship between
misanthropy and supporting animal rights would be greater in nonidealists than in
idealists.
This basic research design was offered to Kevin Jenkins for his masters thesis.
Mike Poteat and I constructed a questionnaire with animal rights questions, Forsyths
idealism questions, and a few questions designed to measure misanthropy. Kevin
collected the data from students at ECU, and I did the statistical analysis. I used
reliability and factor analysis to evaluate the scales (I threw a few items out). All of the
items were Likert-type items, on a 5-point scale. For each scale we computed each
respondents mean on the items included in that scale (after reflecting the scores, where
appropriate). The scale ran from 1 (strongly disagree) to 5 (strongly agree). On the
animal rights scale (AR), high scores represent support of animal rights positions (such
as not eating meat, not wearing leather, not doing research on animals, etc.). On the
misanthropy scale (MISANTH) high scores represent high misanthropy (such as
agreeing with the statement that humans are basically wicked). I dichotomized the
idealism scale by a median split (at or below the median = 0, above the median = 1).
Dichotomizing a perfectly good continuous variable is not good practice. Ill show you
how to analyze these data in a superior fashion in the near future.
In the file Potthoff.dat (available on my Stat Data Page) are the data for the AR,
MISANTH, and IDEALISM variables. These data are used by the program in the file
Potthoff.sas on my SAS Programs Page. Download the data and program and run the
program. Do note how I, in the data step, created the interaction term (MxI) as the
product of the misanthropy score and the code for the dichotomous variable.
The zero-order correlation coefficients (page 1 out of the output) show a
significant but small correlation between misanthropy and support for animal rights.
I named the first regression analysis CGI to indicate that it included the
continuous predictor (misanthropy), the grouping variable (idealism), and the interaction.
All of the basic tests of significance in a Potthoff analysis involve comparing this full
model to a reduced model.
Test of Coincidence
A Potthoff analysis starts with a test of the null hypothesis that the regression line
for predicting Y from C is the same at all levels of some grouping variable. For our
example, that means that the regression line for predicting attitudes about animals is the
same for idealists as it is for nonidealists. To test this null hypothesis of coincident
Page 3
regression lines, we compare the full model (Model CGI -- that containing the
continuous predictor, the grouping variable, and the interaction) with a model that
contains only the continuous predictor. Model C in our output is the model with only the
continuous predictor. It shows that support of animal rights increases significantly as
misanthropy increases, but that misanthropy accounts for only 5% of the variance in
attitude towards animals.
Model CGI does explain more of the variance in attitude about animals (R
2
=
.0925) than does Model C (R
2
= .0487), but is this difference large enough to be
statistically significant? We calculate the partial F:
full
reduced reg full reg
MSE r f
SS SS
F
) (

f is the number of predictors in the full model, r is the number of predictors in the
reduced model. The numerator degrees of freedom is (f - r), the denominator df is
(n - f - 1). The full model MSE is identical to the pooled error variance one would use
for comparing slopes with Howells t-test ( s
2
y.x
on page 258).
For our data, 623 . 3
) 26493 )(. 1 3 (
13252 . 2 05237 . 4
) 150 , 2 (
F , p = .029. The regression

line for predicting attitude towards animals from misanthropy is not the same in idealists
as it is in nonidealists. Noncoincident lines may differ in slope and/or in intercept. Let
us now test the significance of the differences in slopes and the differences in
intercepts.
Test of Parallelism
To test the null hypothesis that the slope for predicting attitude towards animals
from misanthropy is the same in idealists as it is in nonidealists, we need to determine
whether or not removing the interaction term, MxI, from the model significantly reduces
the R
2
. That is, we need to compare Model CGI with Model CG. The partial F statistic
is 073 . 5
) 26493 )(. 2 3 (
70839 . 2 05237 . 4
) 150 , 1 (
F . Since this F has only one df in its numerator,

we could express it as a t (by taking its square root). That t has a value of 2.252 and a
p of .026. We conclude that the slope for predicting attitude from misanthropy differs
between idealists and nonidealists. This partial F test is equivalent to that called the
Test for Parallelism by Kleinbaum and Kupper (1978, page 192) and called the
Common B-Coefficient test by Potthoff (1966).
Look back at the tests of partial coefficients for Model CGI. The t given there for
the interaction term is identical to the t we just obtained from the partial F comparing
Model CGI with Model CG. Accordingly, we did not really need to run the Model CG to
test the slopes, we could have just used the test of the interaction term from Model CGI.
If, however, we had more than two levels of our grouping variable, then we really would
need to obtain the Model CG statistics.
Suppose that I had defined three levels of idealism (low, medium, and high). In
our multiple regressions the grouping variable would be represented by two dummy
variables, G1 and G2. Each respondents score on G1 would tell us whether or not e
was low in idealism (those low get a score of 1, those not low get a score of 0). G2
Page 4
would tell us whether or not e was medium in idealism (1 for medium, 0 for not
medium). We only need one dummy variable for each df. A third dummy variable
would be redundant: If we know a respondent is not low and not medium, then we also
know that respondent is high in idealism. The main effect of the grouping variable
would now be measured by the sum of the G1 and G2 sums of squares. The
interaction term would also have 2 df, and would be represented by two terms in the
model: G1xM and G2xM, each the product of the dummy variable code and the
misanthropy score. The significance of the interaction (testing the hypothesis that the
slopes are the same for all three groups) would be tested by comparing the full model
(M, G1, G2, G1xM, G2xM) with a model from which we have removed the interaction
terms (M, G1, G2). The partial F for this comparison would have 5 - 3 = 2 df.
Imagine that the test of slopes was not significant. In that case, we would drop
the interaction term from our model and concentrate on Model CG. You should
recognize Model CG as an Analysis of Covariance: We have one grouping variable
(often inappropriately called the independent variable) and one continuous predictor
(often called a covariate). The traditional analysis of covariance has an assumption of
homogeneity of regression: We assume that the slope for predicting the dependent
variable is same at all levels of the grouping variable. If this assumption is violated, the
traditional analysis of covariance is not appropriate. We have already tested this
assumption for our data and found it violated. If our interest were in the partial effects of
the misanthropy and idealism predictors, we would be annoyed and would have to seek
an alternative analysis. In this case, however, we are delighted, because our
experimental hypothesis was that the slopes would differ across groups.
Test of Intercepts
Our noncoincident regression lines have already been shown to differ in slopes,
but we have not yet tested for a difference in intercepts. To test the null hypothesis that
the intercepts are identical across groups, we need to compare Model CGI with Model
CI. The partial F statistic is 632 . 6
) 26493 )(. 2 3 (
29525 . 2 05237 . 4
) 150 , 1 (
F . Since this F has

only one df in its numerator, we could express it as a t (by taking its square root). That t
has a value of 2.575 and a p of .011. We conclude that the intercept for predicting
attitude from misanthropy is not the same for idealists as it is for nonidealists.
Look back at the tests of partial coefficients for Model CGI. The t given there for
the idealism grouping variable is identical to the t we just obtain from the partial F
comparing Model CGI with Model CI. Accordingly, we did not really need to run the
Model CI to test the intercepts, we could have just used the test of the interaction term
from Model CGI. If, however, we had more than two levels of our grouping variable,
then we really would need to obtain the Model CI statistics. For the three group design
discussed earlier, the significance of the differences in intercepts would be tested by
comparing the full model (M, G1, G2, G1xM, G2xM) with a model from which we have
removed the dummy variables for groups (M, G1xM, G2xM). The partial F for this
comparison would have 5 - 3 = 2 df.
Why would one care whether the intercepts differed or not? In much research
one would not care, but if one were conducting the research to obtain prediction
Page 5
equations that would later be put to practical use, then knowing whether or not the
intercepts differ might well be important. Suppose, for example, you were working with
a test that is used to predict severity of some illness. Your grouping variable is
sex/gender. Even if the slope relating test score to illness is the same for men as it is
for women, you would want to know if the intercepts differed. If they did, you would
want to develop separate regression lines, one for use with female patients and one for
use with male patients.
Obtaining the Separate Regression Lines
Having concluded that the regression lines differ significantly across groups, we
should obtain those within-group regression lines. On pages 6 through 9 of our output
we see the analysis by group. The regression line is Misanth AR 30 . 63 . 1 for the
nonidealists and Misanth AR 02 . 40 . 2 for the idealists. The slope is significantly
higher and the intercept significantly lower for the nonidealists than for the idealists.
This is exactly what was predicted: Nonidealists do cost/benefit analysis, and to the
extent that they are misanthropic, they discount any benefit to humankind and thus
cannnot justify using animals to benefit humans. Idealists, however, do not conduct
cost/benefit analysis, so their attitude about animals is not related to their misanthropy.
Test of Correlation Coefficients
We should also compare the groups standardized slopes (correlation
coefficients) for predicting attitude from misanthropy. Among nonidealists there is a
significant correlation between misanthropy and support of animal rights, r = .36,
p < .001. Among idealists there is no significant correlation between misanthropy and
support of animal rights, r = .02, p = .87.
See the document Comparing Correlation Coefficients, Slopes, and Intercepts.
On the first page is the test that rho for idealists is identical to rho for nonidealists. We
conclude that the correlation between misanthropy and support for animal rights is
significantly higher in nonidealists than in idealists.
If you have more than two groups and want to test the null hypothesis that the
populations have identical correlation coefficients between X and Y, there is a Chi-
square statistic that is appropriate. See http://home.ubalt.edu/ntsbarsh/Business-
stat/opre504.htm#rmulticorr for details on this test. The Java Script at
http://home.ubalt.edu/ntsbarsh/Business-stat/otherapplets/MultiCorr.htm will do this
analysis for you. For our data it returns a Chi-square of 4.64597 on 1 degree of
freedom, p = .031. It is not a coincidence that the p here is identical to that obtained
when using z to test the difference in correlation coefficients.
It is possible for the slopes to differ and the correlation coefficients not or vice
versa. If the ratio of the variance in Y to variance in X is constant across groups, then
the test of slopes is equivalent to a test of correlation coefficients. For our data, the
ratio of X to Y is .45319/.30808 = 1.47 for the non-idealists and .45053/.25308 = 1.78 for
the idealists. Since these ratios differ little from each other, we should not expect the
test of correlation coefficients to differ much from the test of slopes.
A Couple of Independent Samples t Tests to Complete the Analysis
Finally, I used PROC TTEST to see if the nonidealists differ from the idealists on
AR or MISANTH. This is, of course, absolutely redundant with the information obtained
Page 6
from the initial PROC CORR, except that I get the groups means, confidence intervals,
and t values here. Idealists did not differ significantly from nonidealists on either
misanthropy or attitude towards animals.
Presentation of the Results
The results of the analyses conducted here are presented on my web page at:
http://core.ecu.edu/psyc/wuenschk/Animals/ABS99-ppr.htm .
Assumptions of the Potthoff technique
As with other multiple regressions we have studied, to use t or F we must
assume that the error component is normally distributed and that error variance is
constant across groups. If you have heterogeneity of variance, you should consult
Kleinbaum and Kupper (1978) for large sample z-tests that dont pool error and for a
reference to a discussion of other alternatives. Kleinbaum and Kupper also show how
to expand the analysis when you have more than two groups, and/or more than two
grouping variables, and/or more than one continuous predictor. For a more recent
discussion of the problem of heterogeneous error variances and potential solutions, see
DeShon and Alexander (1996).
Deshon and Alexander (1996) suggest that you conduct an alternative analysis if
the error variance for one group is more than 1.5 times the error variance of another
group. For our data, the error variance of the non-idealistic group is .27028 and that of
the idealistic group is .25712, yielding a ratio of 1.05, no problem.
Doing the Analysis with SPSS
Bring the data into SPSS and create the interaction term. Then run this
regression:
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA CHANGE
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT AR
/METHOD=ENTER Misanth /METHOD=ENTER Idealism Interact .
Notice that we have conducted a sequential analysis and asked for change (in
R
2
) statistics.

In the output table above, the F Change statistic tests the hypothesis of
coincidence, which we reject, F(2, 150) = 3.623, p = .029.
Model Summary
.221
a
.049 .042 .52352 .049 7.781 1 152 .006
.304
b
.093 .074 .51471 .044 3.623 2 150 .029
Model
1
2
R R Square
Adjusted
R Square
Std. Error of
the Est imat e
R Square
Change F Change df 1 df 2 Sig. F Change
Change Statistics
Predictors: (Constant ), Misant h
a.
Predictors: (Constant ), Misant h, Idealism, Interact
b.
Page 7

In the output table above, the test of the Interaction term is the test of parallelism.
We conclude that the slope for predicting AR from Misanth differs significantly between
idealists and nonidealists, t(150) = 2.252, p = .026. The test of the Idealism term is the
test of intercepts. We conclude that the intercept for predicting AR from Misanth differs
significantly between idealists and nonidealists, t(150) = 2.575, p = .011.

More Than One Predictor Variable
Here is advice from an SPSS discussion group:
Maybe someone can help me with this problem. I calculated two linear
regressions over the same variables but for two groups (boys and girls). Now, i would
like to compare the two R
2
values to see which model explains more variance.
Descriptivley the R
2
value of the one group (boys) is higher than the R
2
value of the
other group. Is it sound to compare these two values like this? If so, how could I show
that the difference between the two R
2
values is significant?

First, I would use the regression weights from the boys to predict scores in the
girls. Your question is not simply whether the same predictors work as well for boys
and girls, but whether the same model (including the weights) works as well for boys
and girls. Running two separate regression analyses allows the weights to be optimized
for girls. The correlation between the predicted and the observed scores is the cross-
validity correlation. The difference between the r
2
for boys and the square of the cross-
validity coefficient for girls is the degree of shrinkage in r
2
between girls and boys. You
can also perform this operation in reverse, using the girls weights to generate
predictions for boys.
I think that your question is nested within the larger question of whether your
model predicts differentially for boys and girls. I would suggest that you combine the
data from the boys and girls, add a predictor variable representing sex to your model,
and then add the cross-products between your predictor variables and sex. Use a
hierarchical order of entry. After you have entered all of the variables in your model,
and sex, enter the interaction terms as a block - in one step. If the R
2
increment for this
step is significant, then your model makes differential predictions for boys and girls.
Coeffici ents
a
1.973 .152 12.994 .000
.175 .063 .221 2.789 .006
1.626 .199 8.173 .000
.300 .081 .378 3.723 .000
.779 .302 .718 2.575 .011
-.285 .126 -.631 -2.252 .026
(Constant)
Misanth
(Constant)
Misanth
Idealism
Interact
Model
1
2
B Std. Error
Unstandardized
Coef f icients
Beta
Standardized
Coef f icients
t Sig.
Dependent Variable: AR
a.
Page 8
The finding of a sex by predictor interaction is consistent with two general
possibilities: a.) that the direction of prediction is similar for both groups but weaker for
one group compared with another; or b.) that the direction of the effects is different for
the two groups. To look into these interpretations, examine the beta weights for the
regression equation with the interactions, and plot predicted scores for boys and girls at
varying levels of the predictors.

An Example of a Three Group Analysis
We have data on the lengths and weights of flounder caught in three different
locations (Pamlico River, Pamlico Sound, and Tar River). We want to see if the
regression line for predicting length from weight differs across locations. After
eliminating cases missing data on one or more of these variables, we check the within-
group distributions. Only the weight variable is distinctly not normal, but rather
positively skewed. Both square root and log transformations bring the distributions into
line, but the square root transformation does a little better job, so we use that
transformation.
The data, Potthoff3.dat, can be found on my Stat Data Page. Download this file
and, from my SAS Programs Page, Potthoff3.sas. Edit the SAS file so that it points
correctly to the data file. Run the program file.
Notice that I used Proc GLM instead of Proc Reg. With Proc Reg I would have to
create the group dummy variables myself, but with Proc GLM I can have SAS do that
for me by simply declaring the grouping (classification) variable in the CLASS
statement. GLM also creates the interaction dummy variables for me when I use the
bar notation to specify the model Location|WeightSR expands to Location,
WeightSR, Location*WeightSR.
Testing the null hypothesis of coincidence,
927 . 11
1)124.817 - (5
4 1921272.70 - 0 1927227.45
) (

full
MSE r f
SS SS
F on 4, 745 df, p <
.001.
The full model output already shows us that the slopes do not differ significantly,
since p = .1961 for the interaction term.
Testing the null hypothesis of equal intercepts,
900 . 4
3)124.817 - (5
2 1926004.15 - 0 1927227.45
) (

full
MSE r f
SS SS
F on 2, 745 df, p = .008.

Since the slopes do not differ significantly, but the intercepts do, the group
means must differ. When comparing the groups we can either ignore the covariate or
control for it. Look at the ANCOV output. The weights are significantly correlated with
the lengths (p < .001) and the locations differ significantly in lengths, after controlling for
weights (p < .001). The flounder in the sound are significantly longer than those in the
rivers.

Page 9
Mean Length, Controlling for Weight
Location Mean Length
Pamlico Sound 347.16
A

Pamlico River 341.98
B
Tar River 338.92
B
Note. Groups with the same letter in
their subscripts do not differ
significantly at the .05 level.

Lastly, the ANOVA compares the groups on lengths ignoring weights. The
pattern of results differs when weight is ignored.

Mean Length, Ignoring Weight
Location Mean Length
Pamlico River 347.29
A

Pamlico Sound 344.73
A
Tar River 296.60
B
Note. Groups with the same letter in
their subscripts do not differ
significantly at the .05 level.

A Better Approach When Both Predictors are Continuous
It is usually a bad idea to categorize a continuous variable prior to analysis. For
an introduction to testing interactions between continuous predictor variables, see my
document Continuous Moderator Variables in Multiple Regression Analysis.

Page 10
References
DeShon, R. P. & Alexander, R. A. (1996). Alternative procedures for testing regression
slope homogeneity when group error variances are unequal. Psychological
Methods, 1, 261-277.
Forsyth, D. R. (1990). A taxonomy of ethical ideologies. Journal of Personality and
Social Psychology, 1990, 39, 175-184.
th
ed.). Belmont, CA:
Thomson Wadsworth.
Kleinbaum, D. G., & Kupper, L. L. (1978). Applied regression analysis and other
multivariable methods. Boston: Duxbury.
Poteat, G. M., Wuensch, K. L., & Gregg, N. B. (1988). An investigation of differential
prediction with the WISC-R. Journal of School Psychology, 26, 59-68.
Potthoff, R. F. (1966). Statistical aspects of the problem of biases in psychological
tests. (Institute of Statistics Mimeo Series No. 479.) Chapel Hill: University of North
Carolina, Department of Statistics.

x Learn How to Use SPSS to Make a Dandy Scatterplot Displaying These Results
x Return to Wuenschs Statistics Lessons Page

The url for this document is
http://core.ecu.edu/psyc/wuenschk/MV/MultReg/Potthoff.pdf.

Logistic-SPSS.docx
Binary Logistic Regression with PASW/SPSS

Logistic regression is used to predict a categorical (usually dichotomous) variable
from a set of predictor variables. With a categorical dependent variable, discriminant
function analysis is usually employed if all of the predictors are continuous and nicely
distributed; logit analysis is usually employed if all of the predictors are categorical; and
logistic regression is often chosen if the predictor variables are a mix of continuous and
categorical variables and/or if they are not nicely distributed (logistic regression makes
no assumptions about the distributions of the predictor variables). Logistic regression
has been especially popular with medical research in which the dependent variable is
whether or not a patient has a disease.
For a logistic regression, the predicted dependent variable is a function of the
probability that a particular subject will be in one of the categories (for example, the
probability that Suzie Cue has the disease, given her set of scores on the predictor
variables).
Description of the Research Used to Generate Our Data
As an example of the use of logistic regression in psychological research,
consider the research done by Wuensch and Poteat and published in the Journal of
Social Behavior and Personality, 1998, 13, 139-150. College students (N = 315) were
asked to pretend that they were serving on a university research committee hearing a
complaint against animal research being conducted by a member of the university
faculty. The complaint included a description of the research in simple but emotional
language. Cats were being subjected to stereotaxic surgery in which a cannula was
implanted into their brains. Chemicals were then introduced into the cats brains via the
cannula and the cats given various psychological tests. Following completion of testing,
the cats brains were subjected to histological analysis. The complaint asked that the
researcher's authorization to conduct this research be withdrawn and the cats turned
over to the animal rights group that was filing the complaint. It was suggested that the
research could just as well be done with computer simulations.
In defense of his research, the researcher provided an explanation of how steps
had been taken to assure that no animal felt much pain at any time, an explanation that
computer simulation was not an adequate substitute for animal research, and an
explanation of what the benefits of the research were. Each participant read one of five
different scenarios which described the goals and benefits of the research. They were:
- COSMETIC -- testing the toxicity of chemicals to be used in new lines of hair
care products.
- THEORY -- evaluating two competing theories about the function of a particular
nucleus in the brain.
- MEAT -- testing a synthetic growth hormone said to have the potential of
increasing meat production.


2
- VETERINARY -- attempting to find a cure for a brain disease that is killing both
domestic cats and endangered species of wild cats.
- MEDICAL -- evaluating a potential cure for a debilitating disease that afflicts
many young adult humans.
After reading the case materials, each participant was asked to decide whether
or not to withdraw Dr. Wissens authorization to conduct the research and, among other
things, to fill out D. R. Forysths Ethics Position Questionnaire (Journal of Personality
and Social Psychology, 1980, 39, 175-184), which consists of 20 Likert-type items, each
with a 9-point response scale from completely disagree to completely agree. Persons
who score high on the relativism dimension of this instrument reject the notion of
universal moral principles, preferring personal and situational analysis of behavior.
Persons who score high on the idealism dimension believe that ethical behavior will
always lead only to good consequences, never to bad consequences, and never to a
mixture of good and bad consequences.
Having committed the common error of projecting myself onto others, I once
assumed that all persons make ethical decisions by weighing good consequences
against bad consequences -- but for the idealist the presence of any bad consequences
may make a behavior unethical, regardless of good consequences. Research by Hal
Herzog and his students at Western Carolina has shown that animal rights activists tend
to be high in idealism and low in relativism (see me for references if interested). Are
idealism and relativism (and gender and purpose of the research) related to attitudes
towards animal research in college students? Lets run the logistic regression and see.
Using a Single Dichotomous Predictor, Gender of Subject
Let us first consider a simple (bivariate) logistic regression, using subjects'
decisions as the dichotomous criterion variable and their gender as a dichotomous
predictor variable. I have coded gender with 0 = Female, 1 = Male, and decision with 0
= "Stop the Research" and 1 = "Continue the Research".
Our regression model will be predicting the logit, that is, the natural log of the
odds of having made one or the other decision. That is,
( ) bX a
Y
Y
ODDS + =
|
|
\
|
ln ln , where Y
is the predicted probability of the event

which is coded with 1 (continue the research) rather than with 0 (stop the research),
Y
1 is the predicted probability of the other decision, and X is our predictor variable,
gender. Some statistical programs (such as SAS) predict the event which is coded with
the smaller of the two numeric codes. By the way, if you have ever wondered what is
"natural" about the natural log, you can find an answer of sorts at
http://www.math.toronto.edu/mathnet/answers/answers_13.html.
Our model will be constructed by an iterative maximum likelihood procedure.
The program will start with arbitrary values of the regression coefficients and will
construct an initial model for predicting the observed data. It will then evaluate errors in
such prediction and change the regression coefficients so as make the likelihood of the
observed data greater under the new model. This procedure is repeated until the model
3
converges -- that is, until the differences between the newest model and the previous
model are trivial.
Open the data file at http://core.ecu.edu/psyc/wuenschk/SPSS/Logistic.sav.
Click Analyze, Regression, Binary Logistic. Scoot the decision variable into the
Dependent box and the gender variable into the Covariates box. The dialog box should
now look like this:

Click OK.
Look at the statistical output. We see that there are 315 cases used in the
analysis.

The Block 0 output is for a model that includes only the intercept (which SPSS
calls the constant). Given the base rates of the two decision options (187/315 = 59%
decided to stop the research, 41% decided to allow it to continue), and no other
information, the best strategy is to predict, for every case, that the subject will decide to
stop the research. Using that strategy, you would be correct 59% of the time.
Case Processing Summary
315 100.0
0 .0
315 100.0
0 .0
315 100.0
Unweighted Cases
a
Included in Analysis
Missing Cases
Total
Selected Cases
Unselected Cases
Total
N Percent
If weight is in ef f ect, see classif ication table f or the total
number of cases.
a.
4

Under Variables in the Equation you see that the intercept-only model is
ln(odds) = -.379. If we exponentiate both sides of this expression we find that our
predicted odds [Exp(B)] = .684. That is, the predicted odds of deciding to continue the
research is .684. Since 128 of our subjects decided to continue the research and 187
decided to stop the research, our observed odds are 128/187 = .684.

Now look at the Block 1 output. Here SPSS has added the gender variable as a
predictor. Omnibus Tests of Model Coefficients gives us a Chi-Square of 25.653 on
1 df, significant beyond .001. This is a test of the null hypothesis that adding the gender
variable to the model has not significantly increased our ability to predict the decisions
made by our subjects.

Under Model Summary we see that the -2 Log Likelihood statistic is 399.913.
This statistic measures how poorly the model predicts the decisions -- the smaller
the statistic the better the model. Although SPSS does not give us this statistic for the
model that had only the intercept, I know it to be 425.666. Adding the gender variable
reduced the -2 Log Likelihood statistic by 425.666 - 399.913 = 25.653, the _
2
statistic
we just discussed in the previous paragraph. The Cox & Snell R
2
can be interpreted
like R
2
in a multiple regression, but cannot reach a maximum value of 1. The
Nagelkerke R
2
can reach a maximum of 1.
Classification Tabl e
a,b
187 0 100.0
128 0 .0
59.4
Observed
stop
continue
decision
Overall Percentage
Step 0
stop continue
decision
Percentage
Correct
Predicted
Constant is included in the model.
a.
b.
-.379 .115 10.919 1 .001 .684 Constant St ep 0
B S. E. Wald df Sig. Exp(B)
25.653 1 .000
25.653 1 .000
25.653 1 .000
St ep
Block
Model
St ep 1
Chi-square df Sig.
5

The Variables in the Equation output shows us that the regression equation is
( ) Gender ODDS 217 . 1 847 . ln + = .

We can now use this model to predict the odds that a subject of a given gender
will decide to continue the research. The odds prediction equation is
bX a
e ODDS
+
= .
If our subject is a woman (gender = 0), then the 429 . 0
847 . ) 0 ( 217 . 1 847 .
= = =
+
e e ODDS .
That is, a woman is only .429 as likely to decide to continue the research as she is to
decide to stop the research. If our subject is a man (gender = 1), then the
448 . 1
37 . ) 1 ( 217 . 1 847 .
= = =
+
e e ODDS . That is, a man is 1.448 times more likely to decide
to continue the research than to decide to stop the research.
We can easily convert odds to probabilities. For our women,
30 . 0
429 . 1
429 . 0
1
= =
+
=
ODDS
ODDS
Y . That is, our model predicts that 30% of women will
decide to continue the research. For our men, 59 . 0
448 . 2
448 . 1
1
= =
+
=
ODDS
ODDS
Y . That is,
our model predicts that 59% of men will decide to continue the research
The Variables in the Equation output also gives us the Exp(B). This is better
known as the odds ratio predicted by the model. This odds ratio can be computed by
raising the base of the natural log to the b
th
power, where b is the slope from our
logistic regression equation. For our model, 376 . 3
217 . 1
= e . That tells us that the
model predicts that the odds of deciding to continue the research are 3.376 times higher
for men than they are for women. For the men, the odds are 1.448, and for the women
they are 0.429. The odds ratio is
1.448 / 0.429 = 3.376 .
The results of our logistic regression can be used to classify subjects with
respect to what decision we think they will make. As noted earlier, our model leads to
the prediction that the probability of deciding to continue the research is 30% for women
and 59% for men. Before we can use this information to classify subjects, we need to
Model Summary
399.913
a
.078 .106
St ep
1
-2 Log
likelihood
Cox & Snell
R Square
Nagelkerke
R Square
Estimation terminat ed at iteration number 3 because
parameter est imat es changed by less than .001.
a.
Variables i n the Equation
1.217 .245 24.757 1 .000 3.376
-.847 .154 30.152 1 .000 .429
gender
Constant
Step
1
a
Variable(s) entered on st ep 1: gender.
a.
6
have a decision rule. Our decision rule will take the following form: If the probability of
the event is greater than or equal to some threshold, we shall predict that the event will
take place. By default, SPSS sets this threshold to .5. While that seems reasonable, in
many cases we may want to set it higher or lower than .5. More on this later. Using the
default threshold, SPSS will classify a subject into the Continue the Research category
if the estimated probability is .5 or more, which it is for every male subject. SPSS will
classify a subject into the Stop the Research category if the estimated probability is
less than .5, which it is for every female subject.
The Classification Table shows us that this rule allows us to correctly classify
68 / 128 = 53% of the subjects where the predicted event (deciding to continue the
research) was observed. This is known as the sensitivity of prediction, the P(correct |
event did occur), that is, the percentage of occurrences correctly predicted. We also
see that this rule allows us to correctly classify 140 / 187 = 75% of the subjects where
the predicted event was not observed. This is known as the specificity of prediction,
the P(correct | event did not occur), that is, the percentage of nonoccurrences correctly
predicted. Overall our predictions were correct 208 out of 315 times, for an overall
success rate of 66%. Recall that it was only 59% for the model with intercept only.

We could focus on error rates in classification. A false positive would be
predicting that the event would occur when, in fact, it did not. Our decision rule
predicted a decision to continue the research 115 times. That prediction was wrong 47
times, for a false positive rate of 47 / 115 = 41%. A false negative would be predicting
that the event would not occur when, in fact, it did occur. Our decision rule predicted a
decision not to continue the research 200 times. That prediction was wrong 60 times,
for a false negative rate of 60 / 200 = 30%.
It has probably occurred to you that you could have used a simple Pearson Chi-
Square Contingency Table Analysis to answer the question of whether or not there is
a significant relationship between gender and decision about the animal research. Let
us take a quick look at such an analysis. In SPSS click Analyze, Descriptive
Statistics, Crosstabs. Scoot gender into the rows box and decision into the columns
box. The dialog box should look like this:
a
140 47 74.9
60 68 53.1
66.0
Observed
stop
continue
decision
Overall Percentage
Step 1
stop continue
decision
Percentage
Correct
Predicted
a.
7

Now click the Statistics box. Check Chi-Square and then click Continue.

Now click the Cells box. Check Observed Counts and Row Percentages and
then click Continue.

8
Back on the initial page, click OK.
In the Crosstabulation output you will see that 59% of the men and 30% of the
women decided to continue the research, just as predicted by our logistic regression.

You will also notice that the Likelihood Ratio Chi-Square is 25.653 on 1 df, the
same test of significance we got from our logistic regression, and the Pearson Chi-
Square is almost the same (25.685). If you are thinking, Hey, this logistic regression is
nearly equivalent to a simple Pearson Chi-Square, you are correct, in this simple case.
Remember, however, that we can add additional predictor variables, and those
additional predictors can be either categorical or continuous -- you cant do that with a
simple Pearson Chi-Square.

Multiple Predictors, Both Categorical and Continuous
Now let us conduct an analysis that will better tap the strengths of logistic
regression. Click Analyze, Regression, Binary Logistic. Scoot the decision variable
in the Dependent box and gender, idealism, and relatvsm into the Covariates box.
gender * decision Crosstabulation
140 60 200
70.0% 30.0% 100.0%
47 68 115
40.9% 59.1% 100.0%
187 128 315
59.4% 40.6% 100.0%
Count
% wit hin gender
Count
% wit hin gender
Count
% wit hin gender
Female
Male
gender
Total
stop continue
decision
Total
Chi-Square Tests
25.685
b
1 .000
25.653 1 .000
315
Pearson Chi-Square
Likelihood Ratio
N of Valid Cases
Value df
Asy mp. Sig.
(2-sided)
Computed only f or a 2x2 table
a.
0 cells (.0%) hav e expect ed count less than 5. The
b.
9

Click Options and check Hosmer-Lemeshow goodness of fit and CI for exp(B)
95%.

Continue, OK. Look at the output.
In the Block 1 output, notice that the -2 Log Likelihood statistic has dropped to
346.503, indicating that our expanded model is doing a better job at predicting decisions
than was our one-predictor model. The R
2
statistics have also increased.

We can test the significance of the difference between any two models, as long
as one model is nested within the other. Our one-predictor model had a
Model Summary
346.503
a
.222 .300
St ep
1
-2 Log
likelihood
Cox & Snell
R Square
Nagelkerke
R Square
a.
10
2 Log Likelihood statistic of 399.913. Adding the ethical ideology variables (idealism
and relatvsm) produced a decrease of 53.41. This difference is a _
2
on 2 df (one df for
each predictor variable).
To determine the p value associated with this _
2
, just click Transform,
Compute. Enter the letter p in the Target Variable box. In the Numeric Expression
box, type 1-CDF.CHISQ(53.41,2). The dialog box should look like this:

Click OK and then go to the SPSS Data Editor, Data View. You will find a new
column, p, with the value of .00 in every cell. If you go to the Variable View and set the
number of decimal points to 5 for the p variable you will see that the value of p
is.00000. We conclude that adding the ethical ideology variables significantly
improved the model, _
2
(2, N = 315) = 53.41, p < .001.
Note that our overall success rate in classification has improved from 66% to
71%.

The Hosmer-Lemeshow tests the null hypothesis that there is a linear
relationship between the predictor variables and the log odds of the criterion variable.
Cases are arranged in order by their predicted probability on the criterion variable.
These ordered cases are then divided into ten groups (lowest decile [prob < .1] to
highest decile [prob > .9]). Each of these groups is then divided into two groups on the
basis of actual score on the criterion variable. This results in a 2 x 10 contingency table.
Expected frequencies are computed based on the assumption that there is a linear
relationship between the weighted combination of the predictor variables and the log
odds of the criterion variable. For the outcome = no (decision = stop for our data)
column, the expected frequencies will run from high (for the lowest decile) to low (for the
highest decile). For the outcome = yes column the frequencies will run from low to high.
a
151 36 80.7
55 73 57.0
71.1
Observed
stop
continue
decision
Overall Percentage
Step 1
stop continue
decision
Percentage
Correct
Predicted
a.
11
A chi-square statistic is computed comparing the observed frequencies with those
expected under the linear model. A nonsignificant chi-square indicates that the data fit
the model well.

Using a K > 2 Categorical Predictor
We can use a categorical predictor that has more than two levels. For our data,
the stated purpose of the research is such a predictor. While SPSS can dummy code
such a predictor for you, I prefer to set up my own dummy variables. You will need K-1
dummy variables to represent K groups. Since we have five levels of purpose of the
research, we shall need 4 dummy variables. Each of the subjects will have a score of
either 0 or 1 on each of the dummy variables. For each dummy variable a score of 0
will indicate that the subject does not belong to the group represented by that dummy
variable and a score of 1 will indicate that the subject does belong to the group
represented by that dummy variable. One of the groups will not be represented by a
dummy variable. If it is reasonable to consider one of your groups as a reference
group to which each other group should be compared, make that group the one
which is not represented by a dummy variable.
I decided that I wanted to compare each of the cosmetic, theory, meat, and
veterinary groups with the medical group, so I set up a dummy variable for each of the
groups except the medical group. Take a look at our data in the data editor. Notice
that the first subject has a score of 1 for the cosmetic dummy variable and 0 for the
other three dummy variables. That subject was told that the purpose of the research
was to test the safety of a new ingredient in hair care products. Now scoot to the
bottom of the data file. The last subject has a score of 0 for each of the four dummy
Hosmer and Lemeshow Test
8.810 8 .359
Step
1
Chi-square df Sig.
Contingency Table for Hosmer and Lemeshow Test
29 29.331 3 2.669 32
30 27.673 2 4.327 32
28 25.669 4 6.331 32
20 23.265 12 8.735 32
22 20.693 10 11.307 32
15 18.058 17 13.942 32
15 15.830 17 16.170 32
10 12.920 22 19.080 32
12 9.319 20 22.681 32
6 4.241 21 22.759 27
1
2
3
4
5
6
7
8
9
10
Step
1
Observed Expected
decision = stop
Observed Expected
decision = continue
Total
12
variables. That subject was told that the purpose of the research was to evaluate a
treatment for a debilitating disease that afflicts humans of college age.
Click Analyze, Regression, Binary Logistic and add to the list of covariates the
four dummy variables. You should now have the decision variable in the Dependent
box and all of the other variables (but not the p value column) in the Covariates box.
Click OK.
The Block 0 Variables not in the Equation show how much the -2LL would drop
if a single predictor were added to the model (which already has the intercept)

Look at the output, Block 1. Under Omnibus Tests of Model Coefficients we
see that our latest model is significantly better than a model with only the intercept.

Under Model Summary we see that our R
2
statistics have increased again and
the -2 Log Likelihood statistic has dropped from 346.503 to 338.060. Is this drop
statistically significant? The _
2
, is the difference between the two -2 log likelihood
values, 8.443, on 4 df (one df for each dummy variable). Using 1-CDF.CHISQ(8.443,4),
we obtain an upper-tailed p of .0766, short of the usual standard of statistical
significance. I shall, however, retain these dummy variables, since I have an a priori
interest in the comparison made by each dummy variable.

Variables not i n the Equation
25.685 1 .000
47.679 1 .000
7.239 1 .007
.003 1 .955
2.933 1 .087
.556 1 .456
.013 1 .909
77.665 7 .000
gender
idealism
relatvsm
cosmetic
theory
meat
veterin
Variables
Overall Statistics
Step
0
Score df Sig.
87.506 7 .000
87.506 7 .000
87.506 7 .000
St ep
Block
Model
St ep 1
Chi-square df Sig.
Model Summary
338.060
a
.243 .327
St ep
1
-2 Log
likelihood
Cox & Snell
R Square
Nagelkerke
R Square
a.
13
In the Classification Table, we see a small increase in our overall success rate,
from 71% to 72%.

I would like you to compute the values for Sensitivity, Specificity, False Positive
Rate, and False Negative Rate for this model, using the default .5 cutoff.
Sensitivity percentage of occurrences correctly predicted
Specificity percentage of nonoccurrences correctly predicted
False Positive Rate percentage of predicted occurrences which are incorrect
False Negative Rate percentage of predicted nonoccurrences which are incorrect
Remember that the predicted event was a decision to continue the research.
Under Variables in the Equation we are given regression coefficients and odds
ratios.

We are also given a statistic I have ignored so far, the Wald Chi-Square statistic,
which tests the unique contribution of each predictor, in the context of the other
predictors -- that is, holding constant the other predictors -- that is, eliminating any
overlap between predictors. Notice that each predictor meets the conventional .05
standard for statistical significance, except for the dummy variable for cosmetic
research and for veterinary research. I should note that the Wald _
2
has been criticized
a
152 35 81.3
54 74 57.8
71.7
Observed
stop
continue
decision
Overall Percentage
Step 1
stop continue
decision
Percentage
Correct
Predicted
a.
1.255 20.586 1 .000 3.508 2.040 6.033
-.701 37.891 1 .000 .496 .397 .620
.326 6.634 1 .010 1.386 1.081 1.777
-.709 2.850 1 .091 .492 .216 1.121
-1.160 7.346 1 .007 .314 .136 .725
-.866 4.164 1 .041 .421 .183 .966
-.542 1.751 1 .186 .581 .260 1.298
2.279 4.867 1 .027 9.766
gender
idealism
relatvsm
cosmetic
theory
meat
veterin
Constant
Step
1
a
B Wald df Sig. Exp(B) Lower Upper
95.0% C.I.f or EXP(B)
Variable(s) entered on step 1: gender, idealism, relatv sm, cosmetic, theory , meat, v eterin.
a.
14
for being too conservative, that is, lacking adequate power. An alternative would be to
test the significance of each predictor by eliminating it from the full model and testing
the significance of the increase in the -2 log likelihood statistic for the reduced model.
That would, of course, require that you construct p+1 models, where p is the number of
predictor variables.
Let us now interpret the odds ratios.
- The .496 odds ratio for idealism indicates that the odds of approval are more than
cut in half for each one point increase in respondents idealism score. Inverting this
odds ratio for easier interpretation, for each one point increase on the idealism scale
there was a doubling of the odds that the respondent would not approve the
research.
- Relativisms effect is smaller, and in the opposite direction, with a one point increase
on the nine-point relativism scale being associated with the odds of approving the
research increasing by a multiplicative factor of 1.39.
- The odds ratios of the scenario dummy variables compare each scenario except
medical to the medical scenario. For the theory dummy variable, the .314 odds ratio
means that the odds of approval of theory-testing research are only .314 times those
of medical research.
- Inverted odds ratios for the dummy variables coding the effect of the scenario
variable indicated that the odds of approval for the medical scenario were 2.38 times
higher than for the meat scenario and 3.22 times higher than for the theory scenario.
Let us now revisit the issue of the decision rule used to determine into which
group to classify a subject given that subject's estimated probability of group
membership. While the most obvious decision rule would be to classify the subject into
the target group if p > .5 and into the other group if p < .5, you may well want to choose
a different decision rule given the relative seriousness of making one sort of error (for
example, declaring a patient to have breast cancer when she does not) or the other sort
of error (declaring the patient not to have breast cancer when she does).
Repeat our analysis with classification done with a different decision rule. Click
Analyze, Regression, Binary Logistic, Options. In the resulting dialog window,
change the Classification Cutoff from .5 to .4. The window should look like this:
15

Click Continue, OK.
Now SPSS will classify a subject into the "Continue the Research" group if the
estimated probability of membership in that group is .4 or higher, and into the "Stop the
Research" group otherwise. Take a look at the classification output and see how the
change in cutoff has changed the classification results. Fill in the table below to
compare the two models with respect to classification statistics.
Value When Cutoff = .5 .4
Sensitivity
Specificity
False Positive Rate
False Negative Rate
Overall % Correct
SAS makes it much easier to see the effects of the decision rule on sensitivity
etc. Using the ctable option, one gets output like this:

16
------------------------------------------------------------------------------------
The LOGISTIC Procedure

Classification Table

Correct Incorrect Percentages
Prob Non- Non- Sensi- Speci- False False
Level Event Event Event Event Correct tivity ficity POS NEG

0.160 123 56 131 5 56.8 96.1 29.9 51.6 8.2
0.180 122 65 122 6 59.4 95.3 34.8 50.0 8.5
0.200 120 72 115 8 61.0 93.8 38.5 48.9 10.0
0.220 116 84 103 12 63.5 90.6 44.9 47.0 12.5
0.240 113 93 94 15 65.4 88.3 49.7 45.4 13.9
0.260 110 100 87 18 66.7 85.9 53.5 44.2 15.3
0.280 108 106 81 20 67.9 84.4 56.7 42.9 15.9
0.300 105 108 79 23 67.6 82.0 57.8 42.9 17.6
0.320 103 115 72 25 69.2 80.5 61.5 41.1 17.9
0.340 100 118 69 28 69.2 78.1 63.1 40.8 19.2
0.360 97 120 67 31 68.9 75.8 64.2 40.9 20.5
0.380 96 124 63 32 69.8 75.0 66.3 39.6 20.5
0.400 94 130 57 34 71.1 73.4 69.5 37.7 20.7
0.420 88 134 53 40 70.5 68.8 71.7 37.6 23.0
0.440 86 140 47 42 71.7 67.2 74.9 35.3 23.1
0.460 79 141 46 49 69.8 61.7 75.4 36.8 25.8
0.480 75 144 43 53 69.5 58.6 77.0 36.4 26.9
0.500 71 147 40 57 69.2 55.5 78.6 36.0 27.9
0.520 69 152 35 59 70.2 53.9 81.3 33.7 28.0
0.540 67 157 30 61 71.1 52.3 84.0 30.9 28.0
0.560 65 159 28 63 71.1 50.8 85.0 30.1 28.4
0.580 61 159 28 67 69.8 47.7 85.0 31.5 29.6
0.600 56 162 25 72 69.2 43.8 86.6 30.9 30.8
0.620 50 165 22 78 68.3 39.1 88.2 30.6 32.1
0.640 48 166 21 80 67.9 37.5 88.8 30.4 32.5
0.660 43 170 17 85 67.6 33.6 90.9 28.3 33.3
0.680 40 170 17 88 66.7 31.3 90.9 29.8 34.1
0.700 36 173 14 92 66.3 28.1 92.5 28.0 34.7
0.720 30 177 10 98 65.7 23.4 94.7 25.0 35.6
0.740 28 178 9 100 65.4 21.9 95.2 24.3 36.0
0.760 23 180 7 105 64.4 18.0 96.3 23.3 36.8
0.780 22 180 7 106 64.1 17.2 96.3 24.1 37.1
0.800 18 181 6 110 63.2 14.1 96.8 25.0 37.8
0.820 17 182 5 111 63.2 13.3 97.3 22.7 37.9
0.840 13 184 3 115 62.5 10.2 98.4 18.8 38.5
0.860 12 185 2 116 62.5 9.4 98.9 14.3 38.5
0.880 8 185 2 120 61.3 6.3 98.9 20.0 39.3
0.900 7 185 2 121 61.0 5.5 98.9 22.2 39.5
0.920 5 187 0 123 61.0 3.9 100.0 0.0 39.7
0.940 1 187 0 127 59.7 0.8 100.0 0.0 40.4
0.960 1 187 0 127 59.7 0.8 100.0 0.0 40.4
0.980 0 187 0 128 59.4 0.0 100.0 . 40.6

------------------------------------------------------------------------------------
The classification results given by SAS are a little less impressive because SAS
uses a jackknifed classification procedure. Classification results are biased when the
coefficients used to classify a subject were developed, in part, with data provided by
that same subject. SPSS' classification results do not remove such bias. With
jackknifed classification, SAS eliminates the subject currently being classified when
computing the coefficients used to classify that subject. Of course, this procedure is
17
computationally more intense than that used by SPSS. If you would like to learn more
about conducting logistic regression with SAS, see my document at
http://core.ecu.edu/psyc/wuenschk/MV/MultReg/Logistic-SAS.doc.
Beyond An Introduction to Logistic Regression
I have left out of this handout much about logistic regression. We could consider
logistic regression with a criterion variable with more than two levels, with that variable
being either qualitative or ordinal. We could consider testing of interaction terms. We
could consider sequential and stepwise construction of logistic models. We could talk
about detecting outliers among the cases, dealing with multicollinearity and nonlinear
relationships between predictors and the logit, and so on. If you wish to learn more
about logistic regression, I recommend, as a starting point, Chapter 10 in Using
Multivariate Statistics, 5
th
edition, by Tabachnick and Fidell (Pearson, 2007).
Presenting the Results
Let me close with an example of how to present the results of a logistic
regression. In the example below you will see that I have included both the multivariate
analysis (logistic regression) and univariate analysis. I assume that you all already
know how to conduct the univariate analyses I present below.
Table 1
Effect of Scenario on Percentage of Participants Voting to Allow the Research to
Continue and Participants Mean Justification Score
Scenario Percentage
Support
Theory 31
Meat 37
Cosmetic 40
Veterinary 41
Medical 54
As shown in Table 1, only the medical research received support from a majority
of the respondents. Overall a majority of respondents (59%) voted to stop the research.
Logistic regression analysis was employed to predict the probability that a participant
would approve continuation of the research. The predictor variables were participants
gender, idealism, relativism, and four dummy variables coding the scenario. A test of
the full model versus a model with intercept only was statistically significant, _
2
(7, N =
315) = 87.51, p < .001. The model was able correctly to classify 73% of those who
approved the research and 70% of those who did not, for an overall success rate of
71%.
Table 2 shows the logistic regression coefficient, Wald test, and odds ratio for
each of the predictors. Employing a .05 criterion of statistical significance, gender,
idealism, relativism, and two of the scenario dummy variables had significant partial
effects. The odds ratio for gender indicates that when holding all other variables
constant, a man is 3.5 times more likely to approve the research than is a woman.
Inverting the odds ratio for idealism reveals that for each one point increase on the nine-
18
point idealism scale there is a doubling of the odds that the participant will not approve
the research. Although significant, the effect of relativism was much smaller than that of
idealism, with a one point increase on the nine-point idealism scale being associated
with the odds of approving the research increasing by a multiplicative factor of 1.39.
The scenario variable was dummy coded using the medical scenario as the reference
group. Only the theory and the meat scenarios were approved significantly less than
the medical scenario. Inverted odds ratios for these dummy variables indicate that the
odds of approval for the medical scenario were 2.38 times higher than for the meat
scenario and 3.22 times higher than for the theory scenario.
Table 2
Logistic Regression Predicting Decision From Gender, Ideology, and Scenario
Predictor B Wald _
2
p Odds Ratio
Gender 1.25 20.59 < .001 3.51
Idealism -0.70 37.89 < .001 0.50
Relativism 0.33 6.63 .01 1.39
Scenario
Cosmetic -0.71 2.85 .091 0.49
Theory -1.16 7.35 .007 0.31
Meat -0.87 4.16 .041 0.42
Veterinary -0.54 1.75 .186 0.58
Univariate analysis indicated that men were significantly more likely to approve
the research (59%) than were women (30%), _
2
(1, N = 315) = 25.68, p < .001, that
those who approved the research were significantly less idealistic (M = 5.87, SD = 1.23)
than those who didnt (M = 6.92, SD = 1.22), t(313) = 7.47, p < .001, that those who
approved the research were significantly more relativistic (M = 6.26, SD = 0.99) than
those who didnt (M = 5.91, SD = 1.19), t(313) = 2.71, p = .007, and that the omnibus
effect of scenario fell short of significance, _
2
(4, N = 315) = 7.44, p = .11.
Interaction Terms
Interaction terms can be included in a logistic model. When the variables in an
interaction are continuous they probably should be centered. Consider the following
research: Mock jurors are presented with a criminal case in which there is some doubt
about the guilt of the defendant. For half of the jurors the defendant is physically
attractive, for the other half she is plain. Half of the jurors are asked to recommend a
verdict without having deliberated, the other half are asked about their recommendation
only after a short deliberation with others. The deliberating mock jurors were primed
with instructions predisposing them to change their opinion if convinced by the
arguments of others. We could use a logit analysis here, but elect to use a logistic
regression instead. The article in which these results were published is: Patry, M. W.
19
(2008). Attractive but guilty: Deliberation and the physical attractiveness bias.
Psychological Reports, 102, 727-733.
The data are in Logistic2x2x2.sav at my SPSS Data Page. Download the data
and bring them into SPSS. Each row in the data file represents one cell in the three-
way contingency table. Freq is the number of scores in the cell.

Tell SPSS to weight cases by Freq. Data, Weight Cases:

Analyze, Regression, Binary Logistic. Slide Guilty into the Dependent box and
Delib and Plain into the Covariates box. Highlight both Delib and Plain in the pane on
the left and then click the >a*b> box.
20

This creates the interaction term. It could also be created by simply creating a
new variable, Interaction = Delib-Plain.

Under Options, ask for the Hosmer-Lemeshow test and confidence intervals on
the odds ratios.

You will find that the odds ratios are .338 for Delib, 3.134 for Plain, and 0.030 for
the interaction.
21

Those who deliberated were less likely to suggest a guilty verdict (15%) than
those who did not deliberate (66%), but this (partial) effect fell just short of statistical
significance in the logistic regression (but a 2 x 2 chi-square would show it to be
significant).
Plain defendants were significantly more likely (43%) than physically attractive
defendants (39%) to be found guilty. This effect would fall well short of statistical
significance with a 2 x 2 chi-square.
We should not pay much attention to the main effects, given that the interaction
is powerful.
The interaction odds ratio can be simply computed, by hand, from the cell
frequencies.
- For those who did deliberate, the odds of a guilty verdict are 1/29 when the
defendant was plain and 8/22 when she was attractive, yielding a conditional
odds ratio of 0.09483.

- For those who did not deliberate, the odds of a guilty verdict are 27/8 when the
defendant was plain and 14/13 when she was attractive, yielding a conditional
odds ratio of 3.1339.
3.697 1 .054 .338 .112 1.021
4.204 1 .040 3.134 1.052 9.339
8.075 1 .004 .030 .003 .338
.037 1 .847 1.077
Delib
Plain
Delib by Plain
Constant
Step
1
a
Wald df Sig. Exp(B) Lower Upper
95.0% C.I.f or EXP(B)
Variable(s) entered on st ep 1: Delib, Plain, Delib * Plain .
a.
Pl ai n * Guilty Crosstabulation
a
22 8 30
73.3% 26.7% 100.0%
29 1 30
96.7% 3.3% 100.0%
51 9 60
85.0% 15.0% 100.0%
Count
% wit hin Plain
Count
% wit hin Plain
Count
% wit hin Plain
At trractive
Plain
Plain
Total
No Yes
Guilty
Total
Delib = Yes
a.
22

- The interaction odds ratio is simply the ratio of these conditional odds ratios
that is, .09483/3.1339 = 0.030.
Follow-up analysis shows that among those who did not deliberate the plain
defendant was found guilty significantly more often than the attractive defendant, _
2
(1, N
= 62) = 4.353, p = .037, but among those who did deliberate the attractive defendant
was found guilty significantly more often than the plain defendant, _
2
(1, N = 60) = 6.405,
p = .011.
Interaction Between a Dichotomous Predictor and a Continuous Predictor
Suppose that I had some reason to suspect that the effect of idealism differed
between men and women. I can create the interaction term just as shown above.

Pl ai n * Guilty Crosstabulation
a
13 14 27
48.1% 51.9% 100.0%
8 27 35
22.9% 77.1% 100.0%
21 41 62
33.9% 66.1% 100.0%
Count
% wit hin Plain
Count
% wit hin Plain
Count
% wit hin Plain
At trractive
Plain
Plain
Total
No Yes
Guilty
Total
Delib = No
a.
23


Step 1
a
idealism -.773 .145 28.572 1 .000 .461
gender -.530 1.441 .135 1 .713 .589
gender by idealism .268 .223 1.439 1 .230 1.307
Constant 4.107 .921 19.903 1 .000 60.747
a. Variable(s) entered on step 1: idealism, gender, gender * idealism .
As you can see, the interaction falls short of significance.

Partially Standardized B Weights and Odds Ratios
The value of a predictors B (and the associated odds ratio) is highly dependent
on the unit of measure. Suppose I am predicting whether or not an archer hits the
target. One predictor is distance to the target. Another is how much training the archer
has had. Suppose I measure distance in inches and training in years. I would not
expect much of an increase in the logit when decreasing distance by an inch, but I
would expect a considerable increase when increasing training by a year. Suppose I
measured distance in miles and training in seconds. Now I would expect a large B for
distance and a small B for training. For purposes of making comparisons between the
predictors, it may be helpful to standardize the B weights.
Suppose that a third predictor is the archers score on a survey of political
conservatism and that a photo of Karl Marx appears on the target. The unit of measure
here is not intrinsically meaningful how much is a one point change in score on this
survey. Here too it may be helpful to standardize the predictors. Menard (The
American Statistician, 2004, 58, 218-223) discussed several ways to standardize B
weights. I favor simply standardizing the predictor, which can be simply accomplished
by converting the predictor scores to z scores or by multiplying the unstandardized B
weight by the predictors standard deviation. While one could also standardize the
dichotomous outcome variable (group membership), I prefer to leave that
unstandardized.
In research here at ECU, Cathy Hall gathered data that is useful in predicting
who will be retained in our engineering program. Among the predictor variables are
high school GPA, score on the quantitative section of the SAT, and one of the Big Five
personality measures, openness to experience. Here are the results of a binary logistic
regression predicting retention from high school GPA, quantitative SAT, and openness
(you can find more detail here).

24

Unstandardized Standardized
Predictor B Odds Ratio B Odds Ratio
HS-GPA 1.296 3.656 0.510 1.665
SAT-Q 0.006 1.006 0.440 1.553
Openness 0.100 1.105 0.435 1.545
The novice might look at the unstandardized statistics and conclude that SAT-Q
and openness to experience are of little utility, but the standardized coefficients show
that not to be true. The three predictors differ little in their unique contributions to
predicting retention in the engineering program.

Practice Your Newly Learned Skills
Now that you know how to do a logistic regression, you should practice those
skills. I have presented below three exercises designed to give you a little practice.
Exercise 1: What is Beautiful is Good, and Vice Versa
Castellow, Wuensch, and Moore (1990, Journal of Social Behavior and
Personality, 5, 547-562) found that physically attractive litigants are favored by jurors
hearing a civil case involving alleged sexual harassment (we manipulated physical
attractiveness by controlling the photos of the litigants seen by the mock jurors). Guilty
verdicts were more likely when the male defendant was physically unattractive and
when the female plaintiff was physically attractive. We also found that jurors rated the
physically attractive litigants as more socially desirable than the physically unattractive
litigants -- that is, more warm, sincere, intelligent, kind, and so on. Perhaps the jurors
treated the physically attractive litigants better because they assumed that physically
attractive people are more socially desirable (kinder, more sincere, etc.).
Our next research project (Egbert, Moore, Wuensch, & Castellow, 1992, Journal
of Social Behavior and Personality, 7, 569-579) involved our manipulating (via character
witness testimony) the litigants' social desirability but providing mock jurors with no
information on physical attractiveness. The jurors treated litigants described as socially
desirable more favorably than they treated those described as socially undesirable.
However, these jurors also rated the socially desirable litigants as more physically
attractive than the socially undesirable litigants, despite having never seen them! Might
our jurors be treating the socially desirable litigants more favorably because they
assume that socially desirable people are more physically attractive than are socially
undesirable people?
We next conducted research in which we manipulated both the physical
attractiveness and the social desirability of the litigants (Moore, Wuensch, Hedges, &
25
Castellow, 1994, Journal of Social Behavior and Personality, 9, 715-730). Data from
selected variables from this research project are in the SPSS data file found at
http://core.ecu.edu/psyc/wuenschk/SPSS/Jury94.sav. Please download that file now.
You should use SPSS to predict verdict from all of the other variables. The
variables in the file are as follows:
- VERDICT -- whether the mock juror recommended a not guilty (0) or a guilty (1)
verdict -- that is, finding in favor of the defendant (0) or the plaintiff (1)
- ATTRACT -- whether the photos of the defendant were physically unattractive (0)
or physically attractive (1)
- GENDER -- whether the mock juror was female (0) or male (1)
- SOCIABLE -- the mock juror's rating of the sociability of the defendant, on a 9-
point scale, with higher representing greater sociability
- WARMTH -- ratings of the defendant's warmth, 9-point scale
- KIND -- ratings of defendant's kindness
- SENSITIV -- ratings of defendant's sensitivity
- INTELLIG -- ratings of defendant's intelligence
You should also conduct bivariate analysis (Pearson Chi-Square and
independent samples t-tests) to test the significance of the association between each
predictor and the criterion variable (verdict). You will find that some of the predictors
have significant zero-order associations with the criterion but are not significant in the
full model logistic regression. Why is that?
You should find that the sociability predictor has an odds ratio that indicates that
the odds of a guilty verdict increase as the rated sociability of the defendant increases --
but one would expect that greater sociability would be associated with a reduced
probability of being found guilty, and the univariate analysis indicates exactly that (mean
sociability was significantly higher with those who were found not guilty). How is it
possible for our multivariate (partial) effect to be opposite in direction to that indicated by
our univariate analysis? You may wish to consult the following documents to help
understand this:
Redundancy and Suppression
Simpson's Paradox

Exercise 2: Predicting Whether or Not Sexual Harassment Will Be Reported
Download the SPSS data file found at
http://core.ecu.edu/psyc/wuenschk/SPSS/Harass-Howell.sav. This file was obtained
from David Howell's site,
http://www.uvm.edu/~dhowell/StatPages/Methods/DataMethods5/Harass.dat. I have
added value labels to a couple of the variables. You should use SPSS to conduct a
logistic regression predicting the variable "reported" from all of the other variables. Here
is a brief description for each variable:
- REPORTED -- whether (1) or not (0) an incident of sexual harassment was
reported
26
- AGE -- age of the victim
- MARSTAT -- marital status of the victim -- 1 = married, 2 = single
- FEMinist ideology -- the higher the score, the more feminist the victim
- OFFENSUV -- offensiveness of the harassment -- higher = more offensive
I suggest that you obtain, in addition to the multivariate analysis, some bivariate
statistics, including independent samples t-tests, a Pearson chi-square contingency
table analysis, and simple Pearson correlation coefficients for all pairs of variables.
Exercise 3: Predicting Who Will Drop-Out of School
Download the SPSS data file found at
http://core.ecu.edu/psyc/wuenschk/SPSS/Dropout.sav. I simulated these data based on
the results of research by David Howell and H. R. Huessy (Pediatrics, 76, 185-190).
You should use SPSS to predict the variable "dropout" from all of the other variables.
Here is a brief description for each variable:
- DROPOUT -- whether the student dropped out of high school before graduating -
- 0 = No, 1 = Yes.
- ADDSC -- a measure of the extent to which each child had exhibited behaviors
associated with attention deficit disorder. These data were collected while the
children were in the 2
nd
, 4
th
, and 5
th
grades combined into one variable in the
present data set.
- REPEAT -- did the child ever repeat a grade -- 0 = No, 1 = Yes.
- SOCPROB -- was the child considered to have had more than the usual number
of social problems in the 9
th
grade -- 0 = No, 1 = Yes.
I suggest that you obtain, in addition to the multivariate analysis, some bivariate
statistics, including independent samples t-tests, a Pearson chi-square contingency
table analysis, and simple Pearson correlation coefficients for all pairs of variables.
Imagine that you were actually going to use the results of your analysis to decide which
children to select as "at risk" for dropping out before graduation. Your intention is, after
identifying those children, to intervene in a way designed to make it less likely that they
will drop out. What cutoff would you employ in your decision rule?


- Logistic Regression With SAS
- Why Not Let SPSS Do The Dummy Coding of Categorical Predictors?
- Statistics Lessons
- SPSS Lessons
- More Links
27
- Letters From Former Students -- some continue to use my online lessons when
they go on to doctoral programs
MediationModels.docx
Statistical Tests of Models That Include Mediating Variables

Consider a model that proposes that some independent variable (X) is correlated
with some dependent variable (Y) not because it exerts some direct effect upon the
dependent variable, but because it causes changes in an intervening or mediating
variable (M), and then the mediating variable causes changes in the dependent
variable. Psychologists tend to refer to the X M Y relationship as mediation.
Sociologists tend to speak of the indirect effect of X on Y through M.

XM MY

XY

MacKinnon, Lockwood, Hoffman, West, and Sheets (A comparison of methods
to test mediation and other intervening variable effects, Psychological Methods, 2002, 7,
83-104) reviewed 14 different methods that have been proposed for testing models that
include intervening variables. They grouped these methods into three general
approaches.
Causal Steps. This is the approach that has most directly descended from the
work of Judd, Baron, and Kenny and which has most often been employed by
psychologists. Using this approach, the criteria for establishing mediation, which are
nicely summarized by David Howell (Statistical Methods for Psychology, 6
th
ed., page
528) are:
X must be correlated with Y.
X must be correlated with M.
M must be correlated with Y, holding constant any direct effect of X on Y.
When the effect of M on Y is removed, X is no longer correlated with Y (complete
mediation) or the correlation between X and Y is reduced (partial mediation).
Each of these four criteria are tested separately in the causal steps method:
First you demonstrate that the zero-order correlation between X and Y (ignoring
M) is significant.
Next you demonstrate that the zero-order correlation between X and M (ignoring
Y) is significant.


X
M
Y
2
Now you conduct a multiple regression analysis, predicting Y from X and M. The
partial effect of M (controlling for X) must be significant.
Finally, you look at the direct effect of X on Y. This is the Beta weight for X in the
multiple regression just mentioned. For complete mediation, this Beta must be
(not significantly different from) 0. For partial mediation, this Beta must be less
than the zero-order correlation of X and Y.
MacKinnon et al. are rather critical of this approach. They note that it has low
power. They also opine that one should not require that X be correlated with Y -- it
could be that X has both a direct effect on Y and an indirect effect on Y (through M),
with these two effects being equal in magnitude but opposite in sign -- in this case,
mediation would exist even though X would not be correlated with Y (X would be a
classical suppressor variable, in the language of multiple regression).
Difference in Coefficients. These methods involve comparing two regression
or correlation coefficients -- that for the relationship between X and Y ignoring M and
that for the relationship between X and Y after removing the effect of M on Y.
MacKinnon et al. describe a variety of problems with these methods, including
unreasonable assumptions and null hypotheses that can lead one to conclude that
mediation is taking place even when there is absolutely no correlation between M and
Y.
Product of Coefficients. One can compute a coefficient for the indirect effect
of X on Y through M by multiplying the coefficient for path XM by the coefficient for path
MY. The coefficient for path XM is the zero-order r between X and M. The coefficient
for path MY is the Beta weight for M from the multiple regression predicting Y from X
and M (alternatively one can use unstandardized coefficients).
One can test the null hypothesis that the indirect effect coefficient is zero in the
population from which the sample data were randomly drawn. The test statistic (TS) is
computed by dividing the indirect effect coefficient by its standard error, that is,
= TS . This test statistic is usually evaluated by comparing it to the standard normal

distribution. The most commonly employed standard error is Sobels (1982) first-order
approximation, which is computed as
2 2 2 2

+ , where is the zero-order
correlation or unstandardized regression coefficient for predicting M from X,
2
is the
standard error for that coefficient, is the standardized or unstandardized partial
regression coefficient for predicting Y from M controlling for X, and
2
is the standard
error for that coefficient. Since most computer programs give the standard errors for the
unstandardized but not the standardized coefficients, I shall employ the unstandardized
coefficients in my computations (using an interactive tool found on the Internet) below.
An alternative standard error is Aroians (1944) second-order exact solution,
2 2 2 2 2 2

+ + . Another alternative is Goodmans (1960) unbiased solution,
in which the rightmost addition sign becomes a subtraction sign:
2 2 2 2 2 2

+ .
In his text, Dave Howell employed Goodmans solution, but he made a potentially
3
serious error -- for the MY path he employed a zero-order coefficient and standard error
when he should have employed the partial coefficient and standard error.
MacKinnon et al. gave some examples of hypotheses and models that include
intervening variables. One was that of Ajzen & Fishbein (1980), in which intentions are
hypothesized to intervene between attitudes and behavior. I shall use here an example
involving data relevant to that hypothesis. Ingram, Cope, Harju, and Wuensch
(Applying to graduate school: A test of the theory of planned behavior. Journal of Social
Behavior and Personality, 2000, 15, 215-226) tested a model which included three
independent variables (attitude, subjective norms, and perceived behavior control),
one mediator (intention), and one dependent variable (behavior). I shall simplify that
model here, dropping subjective norms and perceived behavioral control as
independent variables. Accordingly, the mediation model (with standardized path
coefficients) is:

= .767 = .245

direct effect = .337

Let us first consider the causal steps approach:
Attitude is significantly correlated with behavior, r = .525.
Attitude is significantly correlated with intention, r = .767.
The partial effect of intention on behavior, holding attitude constant, falls short
of statistical significance, = .245, p = .16.
The direct effect of attitude on behavior (removing the effect of intention) also
falls short of statistical significance, = .337, p = .056.
The causal steps approach does not, here, provide strong evidence of mediation,
given lack of significance of the partial effect of intention on behavior. If sample size
were greater, however, that critical effect would, of course, be statistically significant.
Now I calculate the Sobel/Aroian/Goodman tests. The statistics which I need are
the following:
The zero-order unstandardized regression coefficient for predicting the
mediator (intention) from the independent variable (attitude). That coefficient
= .423.
Attitude
Intention
Behavior
4
The standard error for that coefficient = .046.

The partial, unstandardized regression coefficient for predicting the
dependent variable (behavior) from the mediator (intention) holding constant
the independent variable (attitude). That regression coefficient = 1.065.
The standard error for that coefficient = .751.

For Aroians second-order exact solution,
3935 . 1
) 751 (. 046 . ) 046 (. 065 . 1 ) 751 (. 423 .
) 065 . 1 ( 423 .
2 2 2 2 2 2 2 2 2 2 2 2
=
+ +
=
+ +
=

TS

What a tedious calculation that was. I just lost interest in showing you
how to calculate the Sobel and the Goodman statistics by hand. Let us use Kris
Preachers dandy tool at http://quantpsy.org/sobel/sobel.htm . Just enter alpha
(a), beta (b), and their standard errors and click Calculate:

Coefficients
a
3.390 1.519 2.231 .030
.423 .046 .767 9.108 .000
(Constant)
attitude
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: intent
a.
Coefficients
a
.075 9.056 .008 .993
.807 .414 .337 1.950 .056
1.065 .751 .245 1.418 .162
(Constant)
attitude
intent
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: behav
a.
5
Even easier (with a little bit of rounding error), just provide the t statistics for
alpha and beta and click Calculate:

The results indicate (for each of the error terms) a z of about 1.40 with a p of
about .16. Again, our results do not provide strong support for the mediation
hypothesis.
Mackinnon et al. (1998) Distribution of
. MacKinnon et al. note one

serious problem with the Sobel/Aroian/Goodman approach -- power is low due to the
test statistic not really being normally distributed. MacKinnon et al. provide an
alternative approach. They used Monte Carlo simulations to obtain critical values for
the test statistic. A table of these critical values for the test statistic which uses the
Aroian error term (second order exact formula) is available at
http://www.public.asu.edu/~davidpm/ripl/mediate.htm (the direct link is
http://www.public.asu.edu/~davidpm/ripl/freqdist.pdf ). This table is also available from
Dr. Karl Wuensch. Please note that the table includes sampling distributions both for
populations where the null hypothesis is true (there is no mediating effect) and where
there is a mediating effect. Be sure you use the appropriate (no mediating effect)
portion of the table to get a p value from your computed value of the test statistic. The
first four pages of the table give the percentiles from 1 to 100 under the null hypothesis
when all variables are continuous. Later in the table (pages 17-20) is the same
information for the simulations where the independent variable was dichotomous and
there was no mediation.
When using the Aroian error term, the .05 critical value is 0.9 -- that is, if the
absolute value of the test statistic is .9 or more, then the mediation effect is significant.
Using the Sobel error term, the .05 critical value is 0.97. MacKinnon et al. refer to the
test statistic here as z, to distinguish it from that for which one (inappropriately) uses
the standard normal PDF to get a p value. With the revised test, we do have evidence
of a significant mediation effect.
The table can be confusing. Suppose you are evaluating the product of
coefficients test statistic computed with Aroians second-order exact solution and that
your sample size is approximately 200. For the traditional two-tailed .05 test, the critical
value of the test statistic is that value which marks off the lower 2.5% of the sampling
6
distribution and the upper 2.5% of the sampling distribution. As noted at the top of the
table, the absolute critical value is approximately .90. The table shows that the 2
nd

percentile has a value of -.969 and the 3
rd
a value of -.871. The 2.5
th
percentile is not
tabled, but as noted earlier, it is approximately -.90. The table shows that the 97
th

percentile has a value of .868 and the 98
th
a value of .958. The 97.5
th
percentile is not
tabled, but, again, it is approximately .90. So, you can just use .90 as the absolute
critical value for a two-tailed .05 test. If you want to report an exact p (which I
recommend), use the table to find the proportion of the area under the curve beyond the
your obtained value of the test statistic and then, for a two-tailed test, double that
proportion. For example, suppose that you obtained a value of 0.78. From the table
you can see that this falls near the 96
th
percentile -- that is, the upper-tailed p is about
.04. For a two-tailed p, you double .04 and then cry because your p of .08 is on the
wrong side of that silly criterion of .05.
Mackinnon et al. (1998) Distribution of Products. With this approach, one
starts by converting both critical paths ( and in the figure above) into z scores by
dividing their unstandardized regression coefficients by the standard errors (these are,
in fact, the t scores reported in typical computer output for testing those paths). For our
data, that yields . 915 . 12 418 . 1 108 . 9 = =

Z Z For a .05 nondirectional test, the critical
value for this test statistic is 2.18. Again, our evidence of mediation is significant.
MacKinnon et al. used Monte Carlo techniques to compare the 14 different
methods statistical performance. A good method is one with high power and which
keeps the probability of a Type I error near its nominal value. They concluded that the
best method was the Mackinnon et al. (1998) distribution of
method, and the next

best method was the Mackinnon et al. (1998) distribution of products method.
Bootstrap Analysis. Patrick Shrout and Niall Bolger published an article,
Mediation in Experimental and Nonexperimental Studies: New Procedures and
Recommendations, in the Psychological Bulletin (2002, 7, 422-445), in which they
recommend that one use bootstrap methods to obtain better power, especially when
sample sizes are not large. They included, in an appendix (B), instructions on how to
use EQS or AMOS to implement the bootstrap analysis. Please note that Appendix A
was corrupted when printed in the journal. A corrected appendix can be found at
http://www.psych.nyu.edu/couples/PM2002.
Kris Preacher has provided SAS and SPSS macros for bootstrapped mediation
analysis, and recommend their use when you have the raw data, especially when
sample size is not large. Here I shall illustrate the use of their SPSS macro.
I bring the raw data from Ingrams research into SPSS.
I bring the sobel_SPSS.sps syntax file into SPSS.
I click Run, All.
I enter into another syntax window the command
SOBEL y=behav / x=attitude / m=intent /boot=10000.
7
I run that command. Now SPSS is using about 50% of the CPU on my computer
and I hear the fan accelerate to cool down its guts. Four minutes later the output
appears:
Run MATRIX procedure:

DIRECT AND TOTAL EFFECTS
Coeff s.e. t Sig(two)
b(YX) 1.2566 .2677 4.6948 .0000
b(MX) .4225 .0464 9.1078 .0000
b(YM.X) 1.0650 .7511 1.4179 .1617
b(YX.M) .8066 .4137 1.9500 .0561

INDIRECT EFFECT AND SIGNIFICANCE USING NORMAL DISTRIBUTION
Value s.e. LL 95 CI UL 95 CI Z Sig(two)
Sobel .4500 .3231 -.1832 1.0831 1.3929 .1637

BOOTSTRAP RESULTS FOR INDIRECT EFFECT
Mean s.e. LL 95 CI UL 95 CI LL 99 CI UL 99 CI
Effect .4532 .2911 -.1042 1.0525 -.2952 1.2963

SAMPLE SIZE
60

NUMBER OF BOOTSTRAP RESAMPLES
10000

------ END MATRIX -----
If you look at the bootstrapped confidence (95%) interval for the indirect effect (in
unstandardized units), -0.1042 to 1.0525, you see that bootstrap tells us that the indirect
effect is not significantly different from zero.
X and Y Temporally Distant. Shrout and Bolger also discussed the question of
whether or not one should first verify that there is a zero-order correlation between X
and Y (Step 1 of the causal steps method). They argued that it is a good idea to test
the zero-order correlation between X and Y when they are proximal (temporally close to
one another), but not when they are distal (widely separated in time). When X and Y
are distal, it becomes more likely that the effect of X on Y is transmitted through
additional mediating links in a causal chain and that Y is more influenced by extraneous
variables. In this case, a test of the mediated effect of X on Y through M can be more
powerful than a test of the zero-order effect of X on Y.
Shrout and Bolger also noted that the direct effect of X on Y could be opposite in
direction of the indirect effect of X on Y, leading to diminution of the zero-order
correlation between X and Y. Requiring as a first step a significant zero-order
correlation between X and Y is not recommended when such suppressor effects are
considered possible. Shrout and Bolger gave the following example:
X is the occurrence of an environmental stressor, such as a major flood, and which
has a direct effect of increasing Y
Y is the stress experienced by victims of the flood.
8
M is coping behavior on part of the victim, which is initiated by X and which reduces
Y.
Partial Mediation. Shrout and Bolger also discussed three ways in which one
may obtain data that suggest partial rather than complete mediation:
X may really have a direct effect upon Y in addition to its indirect effect on Y through
M.
X may have no direct effect on Y, but may have indirect effects on Y through M
1
and
M
2
. If, however, M
2
is not included in the model, then the indirect effect of X on Y
through M
2
will be mistaken as being a direct effect of X on Y.
There may be two subsets of subjects. In the one subset there may be only a direct
effect of X on Y, and in the second subset there may be only an indirect effect of X
on Y through M.
Causal Inferences from Nonexperimental Data? Please note that no
variables were manipulated in the research by Ingram et al. One might reasonably
object that one can never establish with confidence the veracity of a causal model by
employing nonexperimental methods. With nonexperimental data, the best we can do
is to see if our data fit well with a particular causal model. We must keep in mind that
other causal models might fit the data equally well or even better. For example, we
could consider the following alternative model, which fits the data equally well:

= .767 = .337

direct effect = .245

Models containing intervening variables, with both direct and indirect effects
possible, can be much more complex than the simple trivariate models discussed here.
Path analysis is one technique for studying such complex causal models. For
information on this topic, see my document An Introduction to Path Analysis.

Direct, Indirect, and Total Effects
It is always a good idea to report the (standardized) direct, indirect, and total
effects, all of which can be obtained from the path coefficients. Using our original
model, the direct effect of attitude is .337. The indirect effect is (.767)(.245) = .188. The
total effect is the simply the sum of the direct and indirect effect, .337 + .188 = .525.
Intention
Attitude
Behavior
9
The zero-order correlation between attitude and behavior was .525, so what we have
done here is to partition that correlation into two distinct parts, the direct effect and the
indirect effect.
Example of How to Present the Results
Have a look at the articles which I have made available in BlackBoard. While
they are more complex than the analysis presented above, they should help you learn
how mediation analyses are presented in the literature. Below is an example of how a
simple mediation analysis can be presented.
Example Presentation of Results of a Simple Mediation Analysis
As expected, attitude was significantly correlated with behavior, r = .525, p <
.001., CI
.95
= .31, .69. Regression analysis was employed to investigate the
involvement of intention as a possible mediator of the relationship between attitude and
behavior.
Attitude was found to be significantly related to the intention, r = .767, p < .001,
CI
.95
= .64, .86. Behavior was significantly related to a linear combination of attitude and
intention, F(2, 57) = 12.22, p < .001, R = .548, CI
.95
= .32, .69. Neither attitude ( =
.337, p = .056) nor intention ( = .245, p = .162) had a significant partial effect on
behavior.
Aroians test of mediation indicated that intention significantly mediated the
relationship between the attitude and behavior, TS = 1.394, p < .02. It should be noted
that the p value was not obtained from the standard normal distribution but rather from
the table provided by MacKinnon, Lockwood, Hoffman, West, & Sheets (2002),
available at http://www.public.asu.edu/~davidpm/ripl/freqdist.pdf.
The mediation model is illustrated below. The indirect effect of attitude on
behavior, (.767)(.245) = .188, and it direct effect is .337, yielding a total effect coefficient
of .525 (not coincidentally equal to the zero-order correlation between attitude and
behavior). Accordingly, .188/.525, 35.8% of the effect of attitude on behavior is
mediated through intention, and .337/.525 = 62.4% is direct. Of course, this direct effect
may include the effects of mediators not included in the model.

.767 .245

.337

Attitude
Intention
Behavior
10
I am not terribly fond of the practice of computing percentages for the direct and
indirect effects, but many researchers do report such percentages. I dont know how
they would handle a case where the indirect and direct effects are in different directions.
For example, if the IE were +0.20 and the DE were -0.15, the total effect would be
+0.05. Would the IE as a percentage be 400% ? Would the DE as a percentage be -
300% ?
Links
Internet Resources

\MV\DFA\BAYES.doc
A Brief Introduction to Bayesian Statistics

Suppose that I am at a party. You are on your way to the party, late. Your friend
asks you, Do you suppose that Karl has already had too many beers? Based on past
experience with me at such parties, your prior probability of my having had too many
beers, 2 . 0 ) ( = == = B p . The probability that I have not had too many beers, 8 . 0 ) ( = == = B p ,
giving prior odds, of: 25 . 0
8 . 0
2 . 0
) (
) (
= == = = == = = == =
B p
B p
(inverting the odds, the probability
that I have not had too many beers is 4 times the probability that I have). is Greek
omega.
Now, what data could we use to revise your prior probability of my having had
too many beers? How about some behavioral data. Suppose that your friend tells you
that, based on her past experience, the likelihood that I behave awfully at a party if I
have had too many beers is 30%, that is, the conditional probability 3 . 0 ) | ( = == = B A p .
According to her, if I have not been drinking too many beers, there is only a 3% chance
of my behaving awfully, that is, the likelihood 03 . 0 ) | ( = == = B A p . Drinking too many beers
raises the probability of my behaving awfully ten-fold, that is, the
likelihood ratio, L is: 10
03 . 0
30 . 0
) | (
) | (
= == = = == = = == =
B A p
B A p
L .
From the multiplication rule of probability, you know that
) | ( ) ( ) | ( ) ( ) ( B A p B p A B p A p A B p = = , so it follows that
) (
) (
) | (
A p
A B p
A B p

= .
From the addition rule, you know that ) ( ) ( ) ( A B p A B p A p + = , since B and
not B are mutually exclusive. Thus,
) ( ) (
) (
) | (
A B p A B p
A B p
A B P
+
= .
From the multiplication rule you know that
) | ( ) ( ) ( and ) | ( ) ( ) ( B A p B p A B p B A p B p A B p = == = = == = , so
) | ( ) ( ) | ( ) (
) | ( ) (
) | (
B A p B p B A p B p
B A p B p
A B p
+ ++ +

= == = . This is Bayes theorem, as applied to the
probability of my having had too many beers given that I have been observed behaving
awfully. Stated in words rather than symbols:
posterior probability =
prior probability likelihood
prior probability likelihood + prior probability likelihood
i i
i i j j

2
Suppose that you arrive at the party and find me behaving awfully. You revise
your prior probability that I have had too many beers. Substituting in the equation
above, you compute your posterior probability of my having had too many beers (your
revised opinion, after considering the new data, my having been found to be behaving
awfully):
714 . 0
) 03 (. 8 . ) 3 (. 2 .
) 3 (. 2 .
) | ( = == =
+ ++ +
= == = A B p . Of course, that means that your posterior probability
that I have not had too many beers is 1 - 0.714 = 0.286. The posterior odds
5 . 2
286 . 0
714 . 0
= == = = == = . You now think the probability that I have had too many beers is 2.5
times the probability that I have not had too many beers.
Note that Bayes theorem can be stated in terms of the odds and likelihood ratios:
the posterior odds equals the product of the prior odds and the likelihood ratio.
L =
: 2.5 = 0.25 x 10.

Bayesian Hypothesis Testing
You have two complimentary hypotheses, H
0
, the hypothesis that I have not had
too many beers, and H
1
, the hypothesis that I have had too many beers. Letting D
stand for the observed data, Bayes theorem then becomes:
) | ( ) ( ) | ( ) (
) | ( ) (
) (
) (
) | (
1 1 0 0
0 0 0
0
H D P H P H D P H P
H D P H P
D P
D H P
D H P
+
= , and

) | ( ) ( ) | ( ) (
) | ( ) (
) (
) (
) | (
1 1 0 0
1 1 1
1
H D P H P H D P H P
H D P H P
D P
D H P
D H P
+
= .
The P(H
0
|D) and P(H
1
|D) are posterior probabilities, the probability that the null is
true given tP p(H
1
) are prior probabilities, the probability that the null or the alternative is
true prior to considering the new data. The P(D|H
0
) and P(D|H
1
) are the likelihoods, the
probabilities of the data given one or the other hypothesis.
As before, L = == = , that is,
) | (
) | (
) (
) (
) | (
) | (
0
1
0
1
0
1
H D P
H D P
H P
H P
D H P
D H P
=
In classical hypothesis testing, one considers only P(D|H
0
), or more precisely, the
probability of obtaining sample data as or more discrepant with null than are those on
hand, that is, the obtained significance level, p, and if that p is small enough, one
rejects the null hypothesis and asserts the alternative hypothesis. One may mistakenly
believe e has estimated the probability that the null hypothesis is true, given the
obtained data, but e has not done so. The Bayesian does estimate the probability that
the null hypothesis is true given the obtained data, P(H
0
|D), and if that probability is
sufficiently small, e rejects the null hypothesis in favor of the alternative hypothesis. Of
course, how small is sufficiently small depends on an informed consideration of the
3
relative seriousness of making one sort of error (rejecting H
0
in favor of H
1
) versus
another sort of error (retaining H
0
in favor of H
1
).
Suppose that we are interested in testing the following two hypotheses about the
IQ of a particular population H
: = 100 versus H
1
: = 110. I consider the two
hypotheses equally likely, and dismiss all other possible values of , so the prior
probability of the null is .5 and the prior probability of the alternative is also .5.
I obtain a sample of 25 scores from the population of interest. I assume it is
normally distributed with a standard deviation of 15, so the standard error of the mean
is 15/5 = 3. The obtained sample mean is 107. I compute for each hypothesis
M
i
i
M
z

= . For H
the z is 2.33. For H

1
it is -1.00. The likelihood p(D|H
0
) is obtained
by finding the height of the standard normal curve at z = 2.33 and dividing by 2 (since
there are two hypotheses). The height of the normal curve can be found by
2
2 /
2
z
e
,
where pi is approx. 3.1416, or you can use a normal curve table, SAS, SPSS, or other
statistical program. Using PDF.NORMAL(2.333333333333,0,1) in SPSS, I obtain
.0262/2 = .0131. In the same way I obtain the likelihood p(D|H
1
) = .1210. The
06705 . ) 121 (. 5 . ) 0131 (. 5 . ) | ( ) ( ) | ( ) ( ) (
1 1 0 0
= + = + = H D P H P H D P H P D P . Our revised,
posterior probabilities are:
0977 .
06705 .
) 0131 (. 5 .
) (
) | ( ) (
) | (
0 0
0
= =
=
D P
H D P H P
D H P , and 9023 .
06705 .
) 1210 (. 5 .
) | (
1
= = D H P .
Before we gathered our data we thought the two hypotheses equally likely, that
is, the odds were 1/1. Our posterior odds are .9023/.0977 = 9.24. That is, after
gathering our data we now think that H
1
is more than 9 times more likely than is H
.
The likelihood ratio 24 . 9
0131 .
1210 .
) | (
) | (
0
1
= =
H D P
H D P
. Multiplying the prior odds ratio (1)
by the likelihood ratio gives us the posterior odds. When the prior odds = 1 (the null
and the alternative hypotheses are equally likely), then the posterior odds is equal to
the likelihood ratio. When intuitively revising opinion, humans often make the mistake
of assuming that the prior probabilities are equal.
If we are really paranoid about rejecting null hypotheses, we may still retain the
null here, even though we now think the alternative about nine times more likely.
Suppose we gather another sample of 25 scores, and this time the mean is 106. We
can use these new data to revise the posterior probabilities from the previous analysis.
For these new data, H
the z is 2.00 for H
, and -1.33 for H

1
. The likelihood P(D|H
0
) is
.0540/2 = .0270 and the likelihood P(D|H
1
) is .1640/2 = .0820. The
07663 . ) 0820 (. 9023 . ) 0270 (. 0977 . ) | ( ) ( ) | ( ) ( ) (
1 1 0 0
= + = + = H D P H P H D P H P D P . The
newly revised posterior probabilities are
4
0344 .
07663 .
) 0270 (. 0977 .
) (
) | ( ) (
) | (
0 0
0
= =
=
D P
H D P H P
D H P , and
9655 .
07663 .
) 0820 (. 9023 .
) | (
1
= = D H P .
The likelihood ratio is .082/.027 = 3.037. The newly revised posterior odds is
.9655/.0344= 28.1. The prior odds 9.24, times the likelihood ratio, 3.037, also gives the
posterior odds, 9.24(3.037) = 28.1. With the posterior probability of the null at .0344,
we should now be confident in rejecting it.
The Bayesian approach seems to give us just want we want, the probability of
the null hypothesis given our data. So whats the rub? The rub is, to get that posterior
probability we have to come up with the prior probability of the null being true. If you
and I disagree on that prior probability, given the same data, we arrive at different
posterior probabilities. Bayesians are less worried about this than are traditionalists,
since the Bayesian thinks of probability as being subjective, ones degree of belief
about some hypothesis, event, or uncertain quantity. The traditionalist thinks of a
probability as being an objective quantity, the limit of the relative frequency of the event
across an uncountably large number of trials (which, of course, we can never know, but
we can estimate by rational or empirical means). Advocates of Bayesian statistics are
often quick to point out that as evidence accumulates there is a convergence of the
posterior probabilities of those started with quite different prior probabilities.

Bayesian Confidence Intervals
In Bayesian statistics a parameter, such as , is thought of as a random variable
with its own distribution rather than as a constant. That distribution is thought of as
representing our knowledge about what the true value of the parameter may be, and
the mean of that distribution is our best guess for the true value of the parameter. The
wider the distribution, the greater our ignorance about the parameter. The precision
(prc) of the distribution is the inverse of its variance, so the greater the precision, the
greater our knowledge about the parameter.
Our prior distribution of the parameter may be noninformative or informative. A
noninformative prior will specify that all possible values of the parameter are equally
likely. The range of possible values may be fixed (for example, from 0 to 1 for a
proportion) or may be infinite. Such a prior distribution will be rectangular, and if the
range is not fixed, of infinite variance. An informative prior distribution specifies a
particular nonuniform shape for the distribution of the parameter, for example, a
binomial, normal, or t distribution centered at some value. When new data are
gathered, they are used to revise the prior distribution. The mean of the revised
(posterior) distribution becomes our new best guess of the exact value of the
parameter. We can construct a Bayesian confidence interval and opine that the
probability that the true value of the parameter falls within that interval is cc, where cc is
the confidence coefficient (typically 95%).
Suppose that we are interested in the mean of a population for which we confess
absolute ignorance about the value of prior to gathering the data, but we are willing to
5
assume a normally distribution. We obtain 100 scores and compute the sample mean
to be 107 and the sample variance 200. The precision of this sample result is the
inverse of its squared standard error of the mean. That is, 5 .
200
100
2 1
= = =
s
n
prc . The
95% Bayesian confidence interval is identical to the traditional confidence interval, that
is, 77 . 109 23 . 104 77 . 2 107 100 / 200 96 . 1 107 96 . 1 = = =
M
s M .
Now suppose that additional data become available. We have 81 scores with a
mean of 106, a variance of 243, and a precision of 81/243 = 1/3. Our prior distribution
has (from the first sample) a mean of 107 and a precision of .5. Our posterior
distribution will have a mean that is a weighted combination of the mean of the prior
distribution and that of the new sample. The weights will be based on the precisions:
6 . 106 106
3 . 5 .
3 .
107
3 . 5 .
5 .
1
1
1
1
2
=
+
+
+
=
+
+
+
=
S
S
S
S
M
prc prc
prc
prc prc
prc
.
The precision of the revised (posterior) distribution for is simply the sum of the
prior and sample precisions: 3 8 . 3 . 5 .
1 2
= + = + =
S
prc prc prc . The variance of the
revised distribution is just the inverse of its precision, 1.2. Our new Bayesian
confidence interval is 75 . 108 46 . 104 15 . 2 6 . 106 2 . 1 96 . 1 6 . 106 96 . 1
2 2
= = = .
Of course, if more data come in, we revise our distribution for again. Each time
the width of the confidence interval will decline, reflecting greater precision, more
knowledge about .
dfa2.doc
Two Group Discriminant Function Analysis

In DFA one wishes to predict group membership from a set of (usually continuous)
predictor variables. In the most simple case one has two groups and p predictor variables. A
linear discriminant equation,
p p i
X b X b X b a D + ++ + + ++ + + ++ + + ++ + = == = K
2 2 1 1
, is constructed such that the
two groups differ as much as possible on D. That is, the weights are chosen so that were you
to compute a discriminant score ( D
i
) for each subject and then do an ANOVA on D, the ratio
of the between groups sum of squares to the within groups sum of squares is as large as
possible. The value of this ratio is the eigenvalue. Eigen can be translated from the
German as own, peculiar, original, singular, etc. Check out the page at
http://core.ecu.edu/psyc/wuenschk/StatHelp/eigenvalue.txt for a discussion of the origins of
the term eigenvalue.

Read the following article, which has been placed on reserve in Joyner:

Castellow, W. A., Wuensch, K. L., & Moore, C. H. (1990). Effects of physical attractiveness of
the plaintiff and defendant in sexual harassment judgments. Journal of Social Behavior
and Personality, 5, 547-562.

The data for this analysis are those used for the research presented in that article.
They are in the SPSS data file Harass90.sav. Download it from my SPSS-Data page and
bring it into SPSS. To do the discriminant analysis, click Analyze, Classify, Discriminant.
Place the Verdict variable into the Grouping Variable box and define the range from 1 to 2.
Place the 22 rating scale variables (D_excit through P_happy) in the Independents box. We
are using the ratings the jurors gave defendant and plaintiff to predict the verdict. Under
Statistics, ask for Means, Univariate ANOVAs, Boxs M, Fishers Coefficients, and
Unstandardized Coefficients. Under Classify, ask for Priors Computed From Group Sizes and
for a Summary Table. Under Save ask that the discriminant scores be saved.

Now look at the output. The means show that when the defendant was judged not
guilty he was rated more favorably on all 11 scales than when he was judged guilty. When the
defendant was judged not guilty the plaintiff was rated less favorably on all 11 scales than
when a guilty verdict was returned. The Tests of Equality of Group Means show that the
groups differ significantly on every variable except plaintiff excitingness, calmness,
independence, and happiness.

The discriminant function, in unstandardized units (Canonical Discriminant Function
Coefficients), is D = -0.064 + .083 D_excit + ...... + .029 P_happy. The group centroids
(mean discriminant scores) are -0.785 for the Guilty group and 1.491 for those jurors who
decided the defendant was not guilty. High scores on the discriminant function are associated
with the juror deciding to vote not guilty.


Page 2
The eigenvalue
groups within
groups between
SS
SS
_
_
= on D (the quantity maximized by the discriminant
function coefficients obtained), is 1.187. The canonical correlation
total
groups between
SS
SS
_
= on D
(equivalent to eta in an ANOVA and equal to the point biserial r between Group and D), is
0.737.

Wilks lambda is used to test the null hypothesis that the populations have identical
means on D. Wilks lambda is
total
groups within
SS
SS
_
= , so the smaller the the more doubt cast
upon that null hypothesis. SPSS uses a
2
approximation to obtain a significance level. For
our data, p < .0001. We can determine how much of the variance in the grouping variable is
explained by our predictor variables by subtracting the from one. For our data, that is 54%
(also the value of the squared canonical correlation).

DFA is mathematically equivalent to a MANOVA. Looking at our from the perspective
of a MANOVA, when we combine the rating scales with weights that maximize group
differences on the resulting linear combination, the groups do differ significantly from one
another. Such a MANOVA is sometimes done prior to doing univariate analyses to provide a
bit of protection against inflation of alpha. Recall that the grouping variable is predictor variable
in MANOVA (is it what is being predicted in DFA) and the rating scales are the MANOVA
outcome variables (and our DFA predictor variables). If the MANOVA is not significant, we
stop. If it is significant, we may go on to do an ANOVA on each dependent variable. SPSS
gave us those ANOVAs.

We have created (or discovered) a dimension (like a component in PCA) on which the
two groups differ. The univariate ANOVAs may help us explain the nature of the relationship
between this discriminant dimension and the grouping variable. For example, some of the
variates may have a significant relationship with the grouping variable and others might not,
but the univariate ANOVAs totally ignore the correlations among the variates. It is possible for
the groups to differ significantly on D but not on any one predictor by itself.

The standardized discriminant function coefficients may help. These may be
treated as Beta weights in a multiple regression predicting D from z-scores on the Xs,
p p i
Z Z Z D + ++ + + ++ + + ++ + = == = K
2 2 1 1
. Of course, one must realize that these coefficients reflect the
contribution of one variate in the context of the other variates in the model. A low
standardized coefficient might mean that the groups do not differ much on that variate or it
might just mean that that variates correlation with the grouping variable is redundant with that
of another variate in the model. Suppressor effects can also occur.

Correlations between variates and D may also be helpful. These are available in the
loading or structure matrix. Generally, any variate with a loading of .30 or more is
considered to be important in defining the discriminant dimension. These correlations may
help us understand the discriminant function we have created. Note that high scores on our D
Page 3
are associated with the defendant being rated as sincere, kind, happy, warm, and calm and
with the plaintiff being rated as cold, insincere, and cruel. D scores were higher (mean = 1.49)
for jurors who voted not guilty than for those who voted guilty (mean = -0.78).

If your primary purpose is to predict group membership from the variates (rather than to
examine group differences on the variates), you need to do classification. SPSS classifies
subjects into predicted groups using Bayes rule:
= == =

= == =
g
i
i i
i i
i
G D p G p
G D p G p
D G p
1
) | ( ) (
) | ( ) (
) | ( .

Each subjects discriminant score is used to determine the posterior probabilities of
being in each of the two groups. The subject is then classified (predicted) to be in the group
with the higher posterior probability.

By default, SPSS assumes that all groups have equal prior probabilities. For two
groups, each prior = , for three, 1/3, etc. I asked SPSS to use the group relative frequencies
as priors, which should result in better classification.

Another way to classify subjects is to use Fishers classification function
coefficients. For each subject a D is computed for each group and the subject classified into
the group for which es D is highest. To compute a subjects D
1
you would multiply es scores
on the 22 rating scales by the indicated coefficients and sum them and the constant. For es
D
2
you would do the same with the coefficients for Group 2. If D
1
> D
2
then you classify the
subject into Group 1, if D
2
> D
1
, the you classify em into Group 2.

The classification results table shows that we correctly classified 88% of the subjects.
To evaluate how good this is we should compare 88% with what would be expected by
chance. By just randomly classifying half into group 1 and half into group 2 you would
expect to get .5(.655) + .5(.345) = 50% correct. Given that the marginal distribution of Verdict
is not uniform, you would do better by randomly putting 65.5% into group 1 and 34.4% into
group 2 (probability matching), in which case you would expect to be correct .655(.655) +
.345(.345) = 54.8% of the time. Even better would be to probability maximize by just
placing every subject into the most likely group, in which case you would be correct 65.5% of
the time. We can do significantly better than any of these by using our discriminant function.

Assumptions: Multivariate normality of the predictors is assumed. One may hope
that large sample sizes make the DFA sufficiently robust that one does not worry about
moderate departures from normality. One also assumes that the variance-covariance
matrix of the predictor variables is the same in all groups (so we can obtain a pooled
matrix to estimate error variance). Boxs M tests this assumption and indicates a problem
with our example data. For validity of significance tests, one generally does not worry about
this if sample sizes are equal, and with unequal sample sizes one need not worry unless the p
< .001. The DFA is thought to be very robust and Boxs M is very sensitive. Non-normality
also tends to lower the p for Boxs M. The classification procedures are not, however, so
robust as the significance tests are. One may need to transform variables or do a quadratic
Page 4
DFA (SPSS wont do this) or ask that separate rather than pooled variance-covariance
matrices be used. Pillais criterion (rather than Wilks ) may provide additional robustness
for significance testing -- although not available with SPSS discriminant, this criterion is
available with SPSS MANOVA.

ANOVA on D. Conduct an ANOVA comparing the verdict groups on the discriminant
function. Then you can demonstrate that the DFA eigenvalue is equal to the ratio of the
SS
between
to SS
within
from that ANOVA and that the ratio of SS
between
to SS
total
is the squared
canonical correlation coefficient from the DFA.

Correlation Between Groups and D. Correlate the discriminant scores with the
verdict variable. You will discover that the resulting point biserial correlation coefficient is the
canonical correlation from the DFA.

SAS: Obtain the data file Harass90.dat from my StatData page and the program
DFA2.sas from my SAS Programs Page. Run the program. This program uses SAS to do
essentially the same analysis we just did with SPSS. Look at the output from PROC REG. It
did a multiple regression to predict group membership (1, 2) from the rating scales. Notice
that the SS
model
/ SS
error
= the eigenvalue from the DFA, and that the SS
error
/ SS
total
= the Wilks
from the DFA. The square root of the R
2
equals the canonical correlation from the DFA.
The unstandardized discriminant function coefficients (raw canonical coefficients) are equal to
the standardized discriminant function coefficients (pooled within-class standardized canonical
coefficients) divided by the pooled (within-group) standard deviations.

Note also that the DFAs discriminant function coefficients are a linear transformation of
the multiple regression bs (multiply each by 4.19395 and you get the unstandardized
discriminant funtion coefficients). I do not know what determines the value of this constant, I
determined it empirically for this set of data.


DFA3.doc
Discriminant Function Analysis with Three or More Groups

With more than two groups one can obtain more than one discriminant function. The
first DF is that which maximally separates the groups (produces the largest ratio of
among-groups to within groups SS on the resulting D scores). The second DF, orthogonal
to the first, maximally separates the groups on variance not yet explained by the first DF.
One can find a total of K-1 (number of groups minus 1) or p (number of predictor variables)
orthogonal discriminant functions, whichever is smaller.

We shall use the data from Experiment 1 of my dissertation to illustrate a
discriminant function analysis with three groups. The analysis I reported when I published
this research was a doubly multivariate repeated measures ANOVA (see Wuensch, K. L.,
Fostering house mice onto rats and deer mice: Effects on response to species odors.
Animal Learning and Behavior, 1992, 20, 253-258). Wild-strain house mice were, at birth,
cross-fostered onto house-mouse (Mus), deer mouse (Peromyscus) or rat (Rattus) nursing
mothers. Ten days after weaning, each subject was tested in an apparatus that allowed it
to enter tunnels scented with clean pine shavings or with shavings bearing the scent of
Mus, Peromyscus, or Rattus. One of the variables measured was the number of visits to
each tunnel during a twenty minute test. Also measured were how long each subject spent
in each of the four tunnels and the latency to first visit of each tunnel. We shall use the
visits data for our discriminant function analysis.

The data are in the SPSS data file, TUNNEL4b.sav. Download it from my SPSS-
Data page. The variables in this data file are:
NURS (nursing group, 1 for Mus reared, 2 for Peromyscus reared, and 3 for Rattus
reared)
V1, V2, V3, and V4 (labeled Clean-V, Mus-V, Pero-V, and Rat-V, these are the raw
data for number of visits to the clean, Mus-scented, Peromyscus-scented, and
Rattus-scented tunnels)
V_Clean, V_Mus, V_Pero, and V_Rat (the visits data after a square root
transformation to reduce positive skewness and stabilize the variances)
T1, T2, T3, and T4 (time in seconds spent in each tunnel)
T_Clean, T_Mus, T_Pero, and T_Rat (the time data after a square root
transformation to reduce positive skewness)
L1, L2, L3, and L4 (the latency data in seconds) and
L_Clean, L_Mus, L_Pero, and L_Rat (the latency data after a log transformation to
reduce positive skewness).

For this lesson we shall use only the NURS variable and the visits variables.

Obtaining Means and Standard Deviations for the Untransformed Data

Open the TUNNEL4b.sav file in SPSS. Click Analyze, Compare Means, Means.


2

Scoot V1, V2, V3, and V4 into the Dependent List and Nurs into the Independent
List. Click OK

The output produced here is a table of means and standard deviations for
untransformed number of visits to each tunnel for each nursing group. Look at the means
for the Mus group and the Peromyscus group. These two groups were very similar to one
another. Both visited the tunnels with moderate frequency, except for the rat-scented
tunnel, which they avoided. Now look at the means for the Rattus-reared group. These
animals appear to have been much more active, visiting the tunnels more frequently than
did animals in the other groups, and they did not avoid the rat-scented tunnel.

Look at the standard deviations for V4. There is troublesome heterogeneity of
variance here.

Conducting the Discriminant Function Analysis

Now let us do the discriminant function analysis on the transformed data. Click
Analyze, Classify, Discriminant. Put V_Clean, V_Mus, V_Pero, and V_Rat into the
Independents box. Put Nurs into the Grouping Variable box.

3

Click Nurs and then Define Range and define the range from 1 to 3.

Continue. Click Statistics and , select Means, ANOVAs, and Boxs M.

4

Continue. Click Classify and select Casewise Results, Summary Table, Combined
Groups Plot, and Territorial Map.

Continue. Click Save and select Discriminant scores.

Continue, OK.

Interpreting the output

Now look at the output. The means show the same pattern observed with the
untransformed data and the standard deviations show that the heterogeneity of variance
has been greatly reduced by the square root transformation.

5

The univariate ANOVAs show that the groups differ significantly on number of visits
to the rat-scented tunnel and the clean tunnel, with the differences in number of visits to the
other two tunnels falling short of statistical significance. Boxs M shows no problem with the
assumption of equal variance/covariance matrices.

Look under the heading Eigenvalues Two discriminant functions are obtained.
The first accounts for 1.641/(1.641 + .111) = 94% of the total among-groups variability. The
second accounts for the remaining 6%.

SPSS uses a Stepwise Backwards Deletion to assess the significance of the
discriminant functions. The first Wilks Lambda testing the null hypothesis that in the
population the groups do not differ from one another on mean D for any of the discriminant
functions. This Wilks Lambda is evaluated with a chi-square approximation, and for our
data it is significant. In the second row are the same statistics for evaluating all discriminant
functions except the first. We have only 2 functions, so this evaluates DF
2
by itself. If we
had 3 functions, functions 2 and 3 would be simultaneously evaluated at this point and we
would have a third row evaluating function 3 alone. Our second DF falls short of statistical
significance.

To interpret the first discriminant function, let us first look at the standardized
discriminant function coefficients. DF
1
is most heavily weighted on V_Rat. Subjects
who visited the rat tunnel often should get a high score on DF
1
. The loadings (in the
structure matrix) show us that subjects who scored high on DF
1
tended to visit all of the
tunnels (but especially the rat-scented tunnel) frequently.

Although it fell short of statistical significance, I shall, for pedagogical purposes,
attempt to interpret the second discriminant function. Both the standardized discriminant
function coefficients and the loadings indicate that scoring high on DF
2
results from tending
to visit the Peromyscus-scented tunnel frequently and the clean tunnel infrequently.

Under Functions at Group Centroids we are given the group means on each of
the discriminant functions. DF
1
separates the rat-reared animals (who score high on this
function) from the animals in the other two groups. DF
2
separates the Mus-reared animals
(who score high on this function) from the Peromyscus-reared animals. If you look back at
the transformed group means you can see this separation: Compared to the Peromyscus-
reared animals, the Mus-reared animals visited the Peromyscus-scented tunnel more
frequently and the clean tunnel less frequently.

Territorial maps provide a nice picture of the relationship between predicted group
and two discriminant functions. Look at the map on our example data. Subjects with D
1

and D
2
scores that place them in the area marked off by 3s are classified into Group 3 (rat-
reared). The marks the group centroid. Group 3 is on the right side of the map, having
high scores on DF
1
, (high activity and no avoidance of the rat-scented tunnel). Subjects
with low D
1
and high D
2
scores fall in the upper left side of the map, and are classified into
Group 1 (Mus-reared), while those with low scores on both discriminant functions are
classified into Group 2 (Peromyscus-reared).

6

When the primary goal is classification, all discriminant functions (including any that
are not significant) are generally used. Look at the Casewise Statistics from our example
analysis. The classifications are based on probabilities using both discriminant functions.
For example, for subject 1, .881 = P(Group = 2 | D
1
= -1.953 and D
2
= -1.684), while the
posterior probability of membership in Group 1 is .118. Accordingly, this subject is
classified as being in Group 2 (Peromyscus-reared), when, in fact, it was in Group 1 (Mus-
reared).

The combined groups plot, Canonical Discriminant Functions, is best viewed in
color, since group membership is coded by color. In this plot you can see where each
subject falls in the space defined by the two discriminant functions

5 4 3 2 1 0 -1 -2
Function 1
3
2
1
0
-1
-2
F
u
n
c
t
i
o
n

2
Rat
Pero
Mus
Group Centroid
Rat
Pero
Mus
nurs
Canonical Discriminant Functions

The Classification Results show that knowledge of the animals behavior in the
testing apparatus greatly increased our ability to predict what species of animal reared it. If
we were just guessing, we would expect to have a 33% success rate. Using the
discriminant function, we correctly classify 83% of the rat-reared animals and 62% of the
other animals.

7

Follow-Up Analysis

Look back at the data set. At the very end you will find two new variables, Dis1_1
and Dis2_1. These are the rats scores on the two discriminant functions. I find it useful to
make pairwise comparisons on the means of the discriminant functions and on the means
of the predictor variables which had significant univariate effects.

Click Analyze, Compare Means, One-Way ANOVA. Scoot NURS into the Factor box
and scoot into the Dependent List V_Clean, V_rat, Dis1_1, and Dis2_1.

Click Post Hoc and select LSD.

8

Continue. OK.

Look at the output from the ANOVA. For either discriminant function take the Among
Groups sums of squares and divide by the Within Groups sum of squares. You get the
eigenvalue for that discriminant function. Now take the Among Group sums of squares and
divide by the total sum of squares and then take the square root of the resulting R
2
. You
get the canonical correlation for that discriminant function. Finally, for the last (second)
discriminant function, take the Error sum of squares and divide by the total sum of squares.
You will obtain the Wilks Lambda for that discriminant function.

The multiple comparisons show that on each the rat-reared group differs significantly
from the other two groups on number of visits to the clean tunnel, on number of visits to the
rat-scented tunnel and on the first discriminant function.

Presenting the Results of a Discriminant Function Analysis

The manner in which the results are presented depends in part on what the goals of
the analysis were -- was the focus of the research developing a model with which to classify
subjects into groups, or was the focus on determining how the groups differ on a set of
continuous variables. In the behavioral sciences the focus is more often the latter.

You should pay attention to the example presentations in Tabachnick and Fidell.
Here I present the results of the analysis done during this lesson.

Results

In order to determine how the nursing groups differed with respect to their response
to the four scented tunnels, we conducted a discriminant function analysis. The data were
subjected to a square root transformation prior to analysis, to reduce positive skewness and
stabilize the variances.
The first discriminant function was statistically significant, = .341,
2
(8, N = 36) =
33.92, p < .001, but the second was not, = .900,
2
(3, N = 36) = 3.32, p = .34. As shown
in Table 1, high scores on the discriminant function were associated with having made
many visits to the tunnels, especially to the Rattus-scented tunnels.

Table 1

Structure of the Discriminant Function

Variable Loading
Visits to Rattus-scented tunnel .92
Visits to clean tunnel .40
Visits to Peromyscus-scented tunnel .30
Visits to Mus-scented tunnel .30

9

Univariate analysis showed that the nursing groups differed significantly on visits to
the Rattus-scented tunnel, F(2, 33) = 22.98, MSE = 0.72, p < .001, and the clean tunnel,
F(2, 33) = 4.54, MSE = 1.10, p = .018, but not on visits to the other two tunnels, .08 < p <
.10.
Table 2 contains the classification means for the groups on the discriminant function
as well as the group means on each of the four original variables. Fishers procedure was
employed to make pairwise comparisons. It should be noted that when employed to make
pairwise comparisons among three and only three groups, Fishers procedure has been
found to hold familywise error at or below the nominal rate and to have more power than
commonly employed alternative procedures (Levin, Serlin, & Seaman, 1994). The Rattus-
nursed mice scored significantly higher on the discriminant function than did mice in the
other two groups and made significantly more visits to the Rattus-scented and the clean
tunnels than did mice in the other two groups. All other pairwise comparisons fell short of
statistical significance.

Table 2

Group Means on the Discriminant Functions and the Original Four Variables

Nursing Group
Variable Rattus Peromyscus Mus
Discriminant Function 1 1.70
A
1.16
B
0.54
B
Visits to Rattus-scented tunnel 12.75
A
1.50
B
3.33
B
Visits to clean tunnel 10.25
A
4.58
B
4.67
Visits to Peromyscus-scented tunnel 7.92
A
4.33
A
5.58
A
Visits to Mus-scented tunnel 10.50
A
6.00
A
6.25
A

Note. Within each row, means having the same letter in their superscripts are not
significantly different from each other at the .05 level.


PCA-DFA.doc
Principal Components Discriminant Function Analysis

The data for this example come from the research reported by
Poulson, R. L., Braithwaite, R. L., Brondino, M. J., & Wuensch, K. L. (1997).
Mock jurors' insanity defense verdict selections: The role of evidence, attitudes,
and verdict options. Journal of Social Behavior and Personality, 12, 743-758.
Participants watched a simulated trial in which the defendant was accused
of murder and was pleading insanity. There was so little doubt about his having
killed the victim that none of the jurors voted for a verdict of not guilty. Aside
from not guilty, their verdict options were Guilty, NGRI (not guilty by reason of
insanity), and GBMI (guilty but mentally ill). Each mock juror filled out a
questionnaire, answering 21 questions (from which 8 predictor variables were
constructed) about es attitudes about crime control, the insanity defense, the
death penalty, the attorneys, and es assessment of the expert testimony, the
defendants mental status, and the possibility that the defendant could be
rehabilitated. To avoid problems associated with multicollinearity among the 8
predictor variables (they were very highly correlated with one another, and such
multicollinearity can cause problems in a multivariate analysis), the scores on the
8 predictor variables were subjected to a principal components analysis, with the
resulting orthogonal components used as predictors in a discriminant analysis.
The verdict choice (Guilty, NGRI, or GBMI) was the criterion variable.
The data are in the file PCA-DFA.dat, available at my StatData page. A SAS
program to conduct the analysis, PCA-DFA.sas, is available at my SAS Programs Page.
Download both and run the program.
There is not really anything of exceptional importance in the statistical output
from Proc Factor, which was employed to produce the scores on eight principal
components, repackaging all of the variance in the original variables. The principal
components are not correlated with one another, solving the problem with
multicollinearlity.
The output of the DFA shows that there are two significant discriminant functions.
Take a look at the total sample canonical coefficients and then look back at the
program. In the DATA CMPSCORE set I used these coefficients to compute, for each
participant, scores on the two discriminant functions. I then correlated scores on those
two discriminant functions with scores on the original variables. This gave me the
structure of the discriminant functions (the loadings) with respect to the original
variables.
Next I obtained means (centroids), by verdict, for the two discriminant functions.
Please note that these match those reported in the output from Proc Discrim.
Finally, I used ANOVA to compare the three verdict groups on both discriminant
functions and on the original variables. Note that I used Fishers procedure (LSD) to


make pairwise comparisons among means, including the means on the two discriminant
functions.
Here is a brief summary of the results: The first function discriminated between
jurors choosing a guilty verdict and subjects choosing a NGRI verdict. Believing that the
defendant was mentally ill, believing the defenses expert testimony more than the
prosecutions, being receptive to the insanity defense, opposing the death penalty,
believing that the defendant could be rehabilitated, and favoring lenient treatment were
associated with rendering a NGRI verdict. Conversely, the opposite orientation on
these factors was associated with rendering a guilty verdict. The second function
separated those who rendered a GBMI verdict from those choosing Guilty or NGRI.
Distrusting the attorneys (especially the prosecution attorney), thinking rehabilitation
likely, opposing lenient treatment, not being receptive to the insanity defense, and
favoring the death penalty were associated with rendering a GBMI verdict rather than a
guilty or NGRI verdict.

Presenting the Results of a Discriminant Function Analysis

The manner in which the results are presented depends in part on what the goals
of the analysis were -- was the focus of the research developing a model with which to
classify subjects into groups, or was the focus on determining how the groups differ on
a set of continuous variables. In the behavioral sciences the focus is more often the
latter.

You should pay attention to the example presentations in Tabachnick and Fidell.
Here I supplement that material with an example from some research I did with Ron
Poulson while he was here (see the Journal of Social Behavior and Personality, 12:
743-758).

There was a problem with multicollinearity among the continuous variables in this
study. We handled that by first conducting a principal components analysis and then
doing the discriminant function analysis on the component scores. Also note that I
presented group means on each of the two discriminant functions, and made pairwise
comparisons on these means (as well as on the original means).

Results
In order to determine which of the evaluative and attitudinal factors were
important in producing the differences in verdict choice, we conducted a principal
components discriminant function analysis. A discriminant function is a weighted linear
combination of the predictor variables, with the weights chosen such that the criterion
groups differ as much as possible on the resulting discriminant function. In our analysis,
verdict choice served as the criterion variable. The predictor variables were the five
attitudinal and three evaluative clusters of variables described in the Methods section
and in Appendix 1. To avoid problems associated with multicollinearity among the
original variables, these variables were subjected to a principal components analysis,
with the resulting orthogonal components used as the predictors in the discriminant
analysis. The results of this analysis were then transformed back into a form
interpretable in terms of the original variables by correlating the participants' raw scores
on the original eight variables with the participants' scores on the two significant
discriminant functions (DF). These correlations are given in the structure matrices
displayed in Table 1.

Table 1

Structure of the Discriminant Functions

Structure Matrix
Variable DF1 DF2
Mental status of defendant .87 -.19
Evaluation of expert testimony .75 .12
Receptivity to insanity defense .65 .28
Opposition to death penalty .54 .25
Favoring lenient treatment .34 .35
Believing rehabilitation unlikely -.41 .43
Trusting the prosecuting attorneys -.10 .50
Trusting the defense attorneys -.13 .29

Table 2 contains the classification means for the groups on each discriminant
function as well as the group means on each of the eight original variables. The
classification means indicate that the first function distinguishes between participants
choosing a guilty verdict and participants returning an insanity verdict, F(16, 252) =
10.71, p < .001. Believing that the defendant was mentally ill, believing the defenses
expert testimony more than the prosecutions, being receptive to the insanity defense,
opposing the death penalty, believing that the defendant could be rehabilitated, and
favoring lenient treatment were associated with rendering a insanity verdict.
Conversely, the opposite orientation on these factors was associated with rendering a
guilty verdict. The second function separated those who rendered a guilty-ill verdict from
those choosing guilty or insanity, F(7, 127) = 3.40, p < .003. Distrusting the attorneys
(especially the prosecution attorney), thinking rehabilitation likely, opposing lenient
treatment, not being receptive to the insanity defense, and favoring the death penalty
were associated with rendering a guilty-ill verdict rather than a guilty or insanity verdict.

Those who prefer univariate presentation of results should focus on the last eight
rows of Table 2. Do note that on every variable, excepting the trust of the attorneys
variables, the mean for the guilty-ill group is between that for the guilty group and that
for the insanity group. Fishers procedure was used to make pairwise comparisons
among the groups on each of the variables, including the discriminant functions. It
should be noted that when employed to make pairwise comparisons among three and
only three groups, Fishers procedure has been found to hold familywise error at or
below the nominal rate and to have more power than commonly employed alternative
procedures (Levin, Serlin, & Seaman, 1994).

Table 2

Group Means on the Discriminant Functions and the Original Eight Variables

Verdict
Variable Guilty Guilty-Ill Insanity
Discriminant Function 1 -1.29
A
-0.01
B
2.32
C

Discriminant Function 2 0.48
A
-0.39
B
0.46
A

Mental status of defendant 2.86
A
4.61
B
6.98
C

Evaluation of expert testimony -2.47
A
-0.71
B
3.43
C

Receptivity to insanity defense 2.01
A
2.22
A
3.02
B

Opposition to death penalty 1.56
A
1.81
A
2.82
B

Favoring lenient treatment 1.88
A
1.89
A
2.50
B

Believing rehabilitation unlikely 6.82
A
5.47
B
4.91
B

Trusting the prosecuting attorneys 2.21
A
1.81
B
2.00
AB

Trusting the defense attorneys 2.03
A
1.75
A
1.77
A

Note. Within each row, means having the same letter in their superscripts are not
significantly different from each other at the .05 level.

DFA-Step.doc
Stepwise Discriminant Function Analysis

SPSS will do stepwise DFA. You simply specify which method you wish to employ
for selecting predictors. The most economical method is the Wilks lambda method, which
selects predictors that minimize Wilks lambda. As with stepwise multiple regression, you
may set the criteria for entry and removal (F criteria or p criteria), or you may take the
defaults.

Imagine that you are working as a statistician for the Internal Revenue Service. You
are told that another IRS employee has developed four composite scores (X
1
- X
4
), easily
computable from the information that taxpayers provide on their income tax returns and from
other databases to which the IRS has access. These composite scores were developed in
the hope that they would be useful for discriminating tax cheaters from other persons. To
see if these composite scores actually have any predictive validity, the IRS selects a random
sample of taxpayers and audits their returns. Based on this audit, each taxpayer is placed
into one of three groups: Group 1 is persons who overpaid their taxes by a considerable
amount, Group 2 is persons who paid the correct amount, and Group 3 is persons who
underpaid their taxes by a considerable amount. X
1
through X
4
are then computed for each
of these taxpayers. You are given a data file with group membership, X
1
, X
2
, X
3
, and X
4
for
each taxpayer, with an equal number of subjects in each group. Your job is to use
discriminant function analysis to develop a pair of discriminant functions (weighted sums of
X
1
through X
4
) to predict group membership. You use a fully stepwise selection procedure to
develop a (maybe) reduced (less than four predictors) model. You employ the WILKS
method of selecting variables to be entered or deleted, using the default p criterion for
entering and removing variables.

Your data
file is DFA-
STEP.sav,
which is
available on
Karls SPSS-
Data page --
download it and
then bring it into
SPSS. To do
the DFA, click
Analyze,
Classify, and
then put Group
into the
Grouping
Variable box,
defining its
range from 1 to 3. Put X1 through X4 in the Independents box, and select the stepwise
method.


Page 2

Click Method and select Wilks lambda and Use probability of F. Click Continue.

Under Statistics, ask for the group means. Under Classify, ask for a territorial map.
Continue, OK.

Look at the output, Variables Not in the Analysis. At Step 0 the tax groups (overpaid,
paid correct, underpaid) differ most on X
3
( drops to .636 if X
3
is entered) and Sig. of F to
enter is less than .05, so that predictor is entered first. After entering X
3
, all remaining
predictors are eligible for entry, but X
1
most reduces lambda, so it enters. The Wilks lambda
is reduced from .635 to .171. On the next step, only X
2
is eligible to enter, and it does,
lowering Wilks lambda to .058. At this point no variable already in meets the criterion for
removal and no variable out meets the criterion for entry, so the analysis stops.

Look back at the Step 0 statistics. Only X
2
and X
3
were eligible for entry. Note,
however, that after X
3
was entered, the p to enter dropped for all remaining predictors. Why?
X
3
must suppress irrelevant variance in the other predictors (and vice versa). After X
1
is
added to X
3
, p to enter for X
4
rises, indicating redundancy of X
4
with X
1
.

Interpretation of the Output from the Example Program

If you look at the standardized coefficients and loadings you will see that high
scores on DF
1
result from high X
3
and low X
1
. If you look back at the group means you will
see that those who underpaid are characterized by having low X
3
and high X
1
, and thus low
DF
1
. This suggests that DF
1
is good for discriminating the cheaters (those who underpaid)
from the others. The centroids confirm this.

If you look at the standardized coefficients and loadings for DF
2
you will see that high
DF
2
scores come from having high X
2
and low X
1
. From the group means you see that those
Page 3
who overpaid will have low DF
2
(since they have a low X
2
and a high X
1
). DF
2
seems to be
good for separating those who overpaid from the others, as confirmed by the centroids for
DF
2
.

In the territorial map the underpayers are on the left, having a low DF
1
(high X
1
and
low X
3
). The overpayers are on the lower right, having a high DF
1
and a low DF
2
(low X
2
,
high X
3
, high X
1
). Those who paid the correct amount are in the upper right, having a high
DF
1
and a high DF
2
(low X
1
, high X
2
, high X
3
).

LogLin2.doc
Log-Linear Contingency Table Analysis, Two-Way

Please read the section on Likelihood Ratio Tests in Howell's Statistical Methods
for Psychology (p. 156-157 in the 7
th
edition) for a brief introduction to the topic of
Likelihood Ratio Tests and their use in log-linear analysis of data from contingency
tables. You should also read Chapter 17 in Howell and Chapter 7 in Tabachnick and
Fidell. Also recommended is David Garsons document at
http://faculty.chass.ncsu.edu/garson/PA765/logit.htm .

The Data

The data for this assignment were presented in Chapter 9 of the SPSS
Advanced Statistics Student Guide, 1990. We wish to determine whether or not there
is an association between a person's marital status and happiness. The data are in the
file LogLin2.sav on my SPSS Data Page. Download the data file and bring it into
SPSS. In the data editor you will see that there are three variables, Happy, Marital and
Freq. Happy has two values, 1 (Yes) and 2 (No). Marital has three values, 1 (Married),
2 (Single), and 3 (Split). Each row in the data file represents one cell in the 2 x 3
contingency table. The Freq variable is, for each cell, the number of observations in
that cell. I used the WEIGHT CASES command to make SPSS treat the values of
FREQ as cell weights. From the data page, I selected Data, Weight Cases, Weight
Cases By, Freq. When I saved the data file, the weighting information was saved with
it, so you do not have to tell SPSS to use the Freq variable as cell counts, it already
knows that when it opens the data file.

The PASW/SPSS Actions to Do the Analysis

1. Analyze, Descriptive Statistics, Crosstabs. Move Happy into the Rows box
and Marital into the Columns box. Click Statistics and select
2
. Continue. Click Cells
and select Observed and Column Percentages. Continue. OK. This produces the
contingency table with
2
analysis.
Page 2

2. Analyze, Loglinear, Model Selection. Select "Enter in Single Step." Move
Happy and Marital into the Factors box, defining the range for Happy as 1,2 and for
Marital as 1,3. Click Model and select Saturated. Click Options and select Display
Frequencies, Parameter Estimates, and Association Table. You do not need to display
the residuals, since they will all be zero for a saturated model. Change Delta from .5
to 0. By default, SPSS will add .5 to every cell in the table to avoid the possibility of
having cells with a frequency of 0. We have no such cells, so we need not do this.
Doing so would have a very small effect on the results.

Click OK to conduct the analysis.

3. Same as 2, with these changes: For Model, select Custom and then move
Happy and Marital into the Generating Class box, without any interaction term.
Continue
Page 3
For Options, ask to display frequencies and residuals.

Continue, OK.

4. Go to the data page, variable view, and declare 3 to be a missing value
click in the right-hand side of the cell Marital, Missing.

OK. Now run crosstabs again, just as in step 1 -- but this time it will be one of the three
2 x 2 tables that can be constructed from our original 2 x 3 table.

5. Go back to the data page and declare 2 (but not 3) to be a missing value.
Run crosstabs again.

6. Go back to the data page and declare 1 (but not 2) to be a missing value.
Run crosstabs again.

Page 4
Crosstabs: Happiness is Related to Marital Status

Now look at the output from the first invocation of crosstabs. The table suggests
that reported happiness declines as you move from married to single to split. Both the
Pearson and the Likelihood Ratio
2
are significant.

Hierarchical Log Linear Analysis, Saturated Model

The Loglinear, Model Selection path called up an SPSS routine known as
Hiloglinear. This procedure does a hierarchical log linear analysis. The generating
class happy*marital means that the model will include the two factors as well as their
interaction. When a model contains all of the possible effects (a so-called "saturated
model," it will predict all of the cell counts perfectly. Our model is such a saturated
model.

Goodness of Fit Tests

These null hypothesis for these tests is that the model fits the data perfectly. Of
course, with a saturated model that is true, so the p value will be 1.0.

SPSS gives us additional goodness of fit tests, where the value of the LR
2

equals how much the goodness of fit
2
would increase (indicating a reduction in the
goodness of fit of the model with the data) were we to delete certain effects from the
model. For example, with these data, were we to drop the two-way effects (the Happy x
Marital interaction), the LR
2
would increase by 48.012, a significant (p < .0001)
decrease in how well the model fits the data.

Parameter Estimates

A saturated log-linear model for our 2 variable design is of the form:

LN(cell freq)
ij
= +
i
+
j
+
ij

where the term on the left is the natural log of the frequency in the cell at level i of the
one variable and level j of the other. The is a constant, the natural log of the
geometric mean of the expected cell frequencies.
i
is the parameter lambda
associated with being at level i of the one variable,
j
is the same for the other variable,
and
ij
is the interaction parameter.

Look at the "Parameter Estimates" portion of the output. You get one parameter
for each degree of freedom, and you can compute the redundant parameter simply,
since the coefficients must sum to zero across categories of a variable.
Page 5
For the Main Effect of Marital Status
For Marital = 1 (married), = +.397
for Marital = 2 (single), = -.415
Accordingly, for Marital = 3 (split), = 0 - (.397 - .415) = .018.

For the Main Effect of Happiness
For Happy = 1 (yes), = +.885
Accordingly, for Happy =2 (no), is -.885.

For the HappyMarital Interaction
1/1 = Happy/Married, = +.346
Accordingly, No/Married is -.346.
Happy/Single is -.111
Accordingly, Unhappy/Single is +.111.
Happy/Split = 0 - (.346 - .111) = -.235
Unhappy/Split = 0 - (-.235) = .235.

The coefficients can be used to estimate the cell frequencies. When the model
is saturated (includes all possible parameters), the expected frequencies will be equal
to the observed frequencies.

The geometric mean of the cell frequencies, , is found by taking the k
th
root of
the product of the cell frequencies, where k is the number of cells. For our data, that is
3429 . 154 ) 82 )( 47 )( 67 )( 301 )( 221 ( 787
6
= = ; u is the natural log of , which equals
5.0392.

For the married & happy cell, the model predicts that the natural log of the cell
frequency = 5.039 + .397 +.885 +.346 = 6.667. The natural log of the observed cell
frequency (787) is 6.668. Rounding error accounts for the discrepancy of .001.

These coefficients may be standardized by dividing by their standard errors.
Such standardized parameters are identified as "Z-value" by SPSS, because their
significance can be evaluated via the standard normal curve. When one is attempting
to reduce the complexity of a model by deleting some effects, e may decide to delete
any effect whose coefficients are small. Our largest coefficients are those for Happy.
The positive value of lambda for Happy Yes simply reflects the fact that a lot more
people said Yes than No. The large for Married reflects the fact that most of the
participants were married. These main effects (marginal frequencies) coefficients may
be quite important for predicting the frequency of a given cell, when the marginal
frequencies do differ from one another, but they are generally not of great interest
otherwise.

The interaction coefficients are interesting, since they reflect associations
between variables. For example, the +.346 lambda for Happy=Yes/Marital=Married
indicates that the actual frequency of persons in the Happy/Married cell is higher than
Page 6
would be expected based only upon the marginal frequencies of marital status and
happiness. Likewise, the -.235 for Happy Yes/Marital Split indicates that there are
fewer participants in that cell than would be expected if being split were independent of
happiness.

A Reduced Model

From the output we have already inspected, we know that it would not be a good
idea to delete the interaction term -- but this is an instructional analysis, so we shall
delete the interaction and evaluate a main effects only model. For our main effects only
model, the expected (predicted) cell frequencies are exactly those that would be used
for a traditional Pearson
2
contingency table analysis -- the frequencies expected if the
row variable is independent of the column variable. The residuals here (differences in
observed and expected frequencies) are rather large, indicating that the model no
longer does a very good job of predicting the cell counts. Note that our goodness of fit
chi-square jumped from 0 to 48.012, p < .001 -- a value that we have seen twice before:
We saw that
2
from the crosstabs we did first and then again with the test that the 2-
way effects are zero.

Pairwise Comparisons

It is possible to break down our 3 x 2 Marital Status x Happiness table into three
2 x 2 tables, each of which represents a comparison of two marital status groups with
respect to their reported happiness. Our last three invocations of crosstabs accomplish
this analysis.

The column percentages for the first table show us that 92.2% of the married
people are happy, compared to 82.5% of the single people, a statistically significant
difference by the LR
2
, p < .001. The second crosstabs shows that happiness is
significantly more likely in married people than in divorced people. The last crosstabs
shows that the difference between single people and divorced people falls short of
statistical significance.

SPSS Loglinear

This program can only be run by syntax. Here is the syntax:

LOGLINEAR Happy(1,2) Marital(1,3) /
CRITERIA=Delta(0) /
PRINT=DEFAULT ESTIM /
DESIGN=Happy Marital Happy by Marital.

Paste it into a syntax window and run it. You will see that you get the same
parameter estimates that you got with Hiloglinear. If you want to print this very wide
output, you will need to export it to a rtf document first and adjust the margins to .5 inch,
Page 7
change the layout to landscape, and reduce the font size so that the lines do not wrap.
Do not use a proportional font. To export the output, go to the output window and click
File, Export. Set Type to Word/RTF, point to the location where you wish to save the
document, and click OK.

SAS Catmod

The program below will conduct the analysis.

data happy;
input Happy Marital count;
cards;
1 1 787
1 2 221
1 3 301
2 1 67
2 2 47
2 3 82
proc catmod;
weight count;
model Happy*Marital = _response_;
Loglin Happy|Marital;
run;

You will find that the parameter estimates produced by SAS Catmod are identical
to those produced by SPSS Hiloglinear. Now compare the Partial Associations table
from SPSS Catmod with the Maximum Likelihood Analysis of Variance table from SAS.
You will find that they differ greatly. I find this disturbing. Tabachnick and Fidel wrote
Page 8
due to differences in the algorithms used <by SAS Catmod and SPSS Hiloglinear),
these estimates differ a bit <sic> from <each other>.

SPSS Genlog

This procedure codes the variables differently then does those discussed earlier,
so the parameter estimates will differ from those produced by the other procedures.
The other procedures use what we called effects coding when we discussed least
squares ANOVA. Each parameter estimate contrasts one level with the grand mean.
Genlog uses what we called dummy coding when we discussed least squares
ANOVA. For a k level variable, the last level is the reference level, so that each of the
k-1 parameter estimates represents a contrast between one level and the reference
level.
Here is how to point and click your way to a Genlog Analysis of our data: Weight
cases by freq and then click Analyze, Loglinear, General.

Page 9

Karl L. Wuensch
Greenville, NC 27858

October, 2009

Return to my Statistics Lessons Page
PowerPoint to Accompany This Lesson
SPSS Output
o Crosstabs & Hiloglinear
o Loglinear
o Genlog
Log3h.docx
Three-Way Hierarchical Log-Linear Analysis: Positive Assortative Mating

We shall learn how to do the three-way analysis using data collected at East Carolina
University by Jay Gammon. He was testing the prediction that persons should desire mates
that are similar to themselves (should desire "positive assortative mating"). Three of the
categorical variables were Religion, Hair Color, and Eye Color. It was also predicted that
women would be stronger in their preference for positive assortative mating, so we have a
three-way analysis, Self x Mate x Gender.
The Data
The data are in the file Loglin3h.sav, which you can download from my SPSS Data
Page. Each row in the data file represents one cell in the 3 x 3 x 2 contingency table, with the
freq variable already set as the weighting variable. Variable Relig_S is the participants
religion, Relig_M is the religion the participant wanted es mate to have, and Gender is the
participants gender. There are three values of Religion: Catholic, Protestant, and None. To
avoid very small expected frequencies I excluded data from participants who indicated that
they were, or desired their mate to be, Jewish or of Eastern religion.
The SPSS Actions to Do the Analysis
1. Analyze, Loglinear, Model Selection. Select "Use Backward Elimination. Move
Relig_S, Relig_M, and Gender into the Factors box, defining the range for the Religion
variables as 1,3 and for Gender as 1,2. Click Model and select Saturated. Click Options and
select Display Frequencies, Residuals, Parameter Estimates, and Association Table. Let
Delta be .5 (there are cells with zero frequencies). OK to run the analysis.

2

2. Analyze, Descriptive Statistics, Crosstabs. Move Relig_S into the Rows box and
Relig_M into the Columns box. Click Statistics and select
2
. Click Cells and select Observed,
Expected, and Row Percentages.
3. Analyze, Descriptive Statistics, Crosstabs. Keep Relig_S into the Rows box, but
replace Relig_M with Gender in the Columns box. In Statistics keep
2
. In Cells keep
Observed, but replace Row Percentages with Column Percentages.
The Saturated Model
The first model evaluated is the saturated model, which includes all effects and thus
perfectly fits the data. The Tests that K-Way show us that we could delete the three-
way interaction with little effect, but that dropping all of the two-way interactions would
significantly reduce the goodness of fit between model and data. The Tests of PARTIAL
associations (which are adjusted for all other effects in the model) indicate a significant
association between respondents' own religion and that desired in their mates, as well as main
effects of all variables. Estimates for Parameters shows high values for the four
parameters (one per df) in the Relig_S x Relig_M effect. The parameter estimates for the main
effects reflect the fact we did not have equal numbers of respondents in the Catholic,
Protestant, and None groups and we had more female respondents than male respondents.
Backward Elimination
After evaluating the full model, HILOGLINEAR attempts to remove effects, starting with
the highest-order effect. It removes it and tests whether the removal significantly (.05 default
alpha) increased the
2
(made the fit worse). If removing that effect has no significant effect,
then that effect remains out and HILOGLINEAR moves down to the next level (in our case, the
2-way effects). The effect at that level whose removal would least increase the
2
is removed,
unless its removal would cause a significant increase of the
2
. Then the next least important
effect at that level is evaluated for removal, etc., etc., until all effects at a level are retained or
all effects have been tested and removed. When an interaction effect is retained, lower-order
effects for all combinations of the variables in the higher-order effect must also be retained.
With luck (not always) this method will lead you to the same final model that the tests of partial
associations would suggest.
With our data, deleting the triple interaction does not significantly increase the
goodness-of-fit
2
, so the triple interaction is removed. The two-way effects are then
evaluated. Deletion of Relig_S x Gender or Relig_M x Gender would not have a significant
effect, so Relig_M x Gender (which produces the smaller change) is deleted. Note that with
Relig_M x Gender out, Relig_S x Gender is now significant, so the backwards elimination
stops. We are left with a model which contains Relig_S x Relig_M and Relig_S x Gender.
Since a hierarchical analysis always retains lower-order effects contained within retained
higher-order effects, our model also includes the main effects of Relig_S, Relig_M, and
Gender. The model fits the data well -- the goodness-of-fit chi-square has a nice, high p of
.954, and all of the residuals are small.

Crosstabs was used to obtain unpartialled tests of the two-way associations that our
hierarchical analysis indicated were significant. Crosstabs' tests totally ignore variables not
included in the effect being tested. For the Relig_S x Relig_M analysis we obtain an enormous
2
.

3

Positive Assortative Mating on the Main Diagonal
Our research hypothesis was that individuals would prefer to mate with others similar to
themselves (in this case, of the same religion). Look at the main diagonal (upper left cell,
middle cell, lower right cell) of the Relig_S x Relig_M table. Most of the counts are in that
diagonal, which represents individuals who want mates of the same religion as their own. If we
sum the counts on the main diagonal, we see that 185 (or 92.5%) of our respondents said they
want their mates to be of the same religion as their religion. How many respondents would we
expect to be on this main diagonal if there was no correlation between Relig_S and Relig_M?
The answer to that question is simple: Just sum the expected frequencies in that main
diagonal -- in the absence of any correlation, we expect 108 (54%) of our respondents to be on
that main diagonal. Now we can employ an exact binomial test of the null hypothesis that the
proportion of respondents desiring mates with the same religion as their own is what would be
expected given independence of self religion and ideal mate religion (binomial p = .54). The
one-tailed p is the P(Y 185 | n = 200, p = .54). Back in PSYC 6430 you learned how to use
SAS to get binomial probabilities. In the little program below, I obtained the P(Y 184 | n =
200, p = .54), subtracted that from 1 to get the P(Y 185 | n = 200, p = .54), and then doubled
that to get the two-tailed significance level. The SAS output shows that p < .001.

data p;
LE184 = PROBBNML(.54, 200, 184);
GE185 = 1 - LE184;
p = 2*GE185;
proc print; run;

Obs LE184 GE185 p

1 1 0 0
You can also use SPSS to get an exact binomial probability. See my document
Obtaining Significance Levels with SPSS.

4

The unpartialled
2
on Relig_S x Gender is also significant. The column percentages in
the table make it fairly clear that this effect is due to men being much more likely than women
to have no religion.

SAS Catmod

data Religion;
input Relig_Self Relig_Mate Sex count;
cards;
1 1 1 20.5
1 1 2 7.5
1 2 1 0.5
1 2 2 0.5
1 3 1 1.5
1 3 2 1.5
2 1 1 1.5
2 1 2 1.5
2 2 1 86.5
2 2 2 49.5
2 3 1 3.5
2 3 2 2.5
3 1 1 0.5
3 1 2 1.5
3 2 1 2.5
3 2 2 3.5
3 3 1 8.5
3 3 2 15.5
proc catmod;
weight count;
model Relig_Self*Relig_Mate*Sex = _response_;
Loglin Relig_Self|Relig_Mate|Sex;
run;
Submit this code to obtain the analysis of the saturated model in SAS.
Karl L. Wuensch, Dept. of Psychology, East Carolina University, Greenville, NC 27858 USA
March, 2012

5

Links
Download the SPSS Output
Log-Linear Contingency Table Analysis, Two-Way
Three-Way Nonhierarchical Log-Linear Analysis: Escalators and Obesity
Four Variable LOGIT Analysis: The 1989 Sexual Harassment Study

Log3N.doc
Three-Way Nonhierarchical Log-Linear Analysis: Escalators and Obesity

Hierarchical analyses are the norm when one is doing multidimensional log-linear
analyses. The backwards elimination tests of significance are available because each
reduced model is nested within the next more complex model. With nonhierarchical
analysis one can exclude lower-order effects that are contained within retained
higher-order effects. One might wish to evaluate two nonhierarchical models when one
is not nested within the other. One cannot test the significance of the difference
between two such nonhierarchical models, but one can assess the adequacy of fit of
each such model.
The Data and the Program
We shall use data which I captured from the article "Stairs, Escalators, and
Obesity," by Meyers et al. (Behavior Modification 4: 355-359). A copy of the article is
available within BlackBoard. The (nonhierarchical) LOGLINEAR procedure is not
available by point and click in SPSS, you must use syntax. Since I needed to issue the
loglinear command by syntax, I also issued the rest of the commands by syntax. The
syntax file is ESCALATE.SPS on my SPSS Programs page, and the data file is
ESCALATE.SAV on my SPSS Data page. Download both. After downloading, double-
click on the syntax file and PASW will boot and the syntax file will be displayed in the
syntax editor.

You will need to edit the File statement so that it points to the correct location of
the data file on your computer. Run the syntax file to produce the output. I exported
the output to a rtf document, edited it, and then converted it to a pdf document. You
can obtain the pdf document at http://core.ecu.edu/psyc/wuenschk/MV/LogLin/Log3N-
SPSS-Output.pdf.
An Initial Run with the Hiloglinear Procedure
Look at the program. With Hiloglinear I asked that the tests of partial
associations and the parameter estimates be listed. I did not ask for the frequencies or
residuals. Look at the output. The three-way interaction is significant. When the
highest-order effect is significant, one may attempt to eliminate one or more of the
2
lower-order effects while retaining the higher-order effect. The partial chi-squares may
suggest which effects to try deleting, and one can try deleting any effect which does not
have at least one highly significant .
For our data, every partial chi-square is significant, but the Weight x Device
effect has a relatively small
2
, so I'll try removing it. Looking at the parameter
estimates, Weight x Device (neither parameter is significant) and Direct (not significant
at .01) appear to be candidates for removal.
Building a Reduced Model with the Loglinear Procedure
I used Loglinear to test two models, one with Weight x Device removed and one
with Direct removed. In both cases the goodness-of-fit chi-square was significant,
meaning that the reduced models do not fit the data well. This is in part due to the
great power we have with large sample sizes. We can look at the residuals to see
where the fit is poor. For the model with Weight x Device removed, none of the
standardized residuals is very large (> 2), but three are large enough to warrant
inspection (> 1). The model predicted that:
15.45 Obese folks would be observed Ascending Stairs, but only 10 were;
19.45 Obese folks would be observed Descending Stairs, but only 14 were; and
72.04 folks of normal weight would be observed Ascending Stairs, but 82 were.
For the model with Direct removed, the residuals are generally small, but two
cells have residuals worthy of some attention. For the Obese, the model predicted that:
14.3 would be observed Ascending Stairs, only 10 were, and
9.7 would be observed Descending Stairs, but 14 were.
Comparing Nested Models
When we have two models that are nested, with Model A being a subset of
Model B, with all of the effects in Model A also in Model B, then we can test the
significance of the difference between those two models. The difference
2
will equal
the Model A goodness-of-fit
2
minus the Model B goodness-of-fit
2
, with df equal to
the difference between the two models' df. We do have two such pairs, the full model
versus that with Weight x Device removed and the full model versus that with Direct
removed. Since the full model always has
2
= 0 and df = 0, the difference chi-squares
are the reduced model chi-squares, and they are significant.
The Triple Interaction
Now we shall try to explain the triple interaction by looking at "simple two-way"
interactions at each level of the third variable. I decided to look at the Weight x Device
interaction at each level of Direction, but could have just as well looked at Weight x
Direction at each level of Device or Device x Direction at each level of Weight.
Look at the tables produced by the first Crosstabs command. I reproduce here
the row percentages for the Stairs column.
Percentage Using Stairs Within Each Weight x Direction Combination
3

Direction
Weight Ascending Descending
Obese 4.7 14.7
Overweight 3.9 27.8
Normal 7.6 23.1

The Direction x Device interaction is obvious here, with many more people using
the stairs going down than going up. Were we inspecting Direction x Device at each
level of Weight, we would do three 2 x 2 Direction x Device analyses, each determining
whether the rate of stairway use was significantly higher when descending than when
ascending for a given weight category. For example, among the obese, is 14.7%
significantly different from 4.7%? I decided to look at Weight x Device interaction at
each level of Direction. Crosstabs gave us the LR
2
for Weight x Device at each
direction, and they are both significant.
Breaking Up Each 3 x 2 Interaction Into Three 2 x 2 Interactions
To understand each 3 x 2 (Weight x Device) interaction better, I broke each into
three 2 x 2 interactions. If you will look at the program, you will see that I changed
WEIGHT(1,3) to WEIGHT(1,2) to get the comparison between the Obese (level 1) and
the Overweight (level 2). When ascending, they do not differ significantly on
percentage using the stairs, but when descending they do, with the overweight using
the stairs more often than do the obese.
The Obese versus Normal comparisons both fall short of significance, but just
barely. Note, in the program, how I used the TEMPORARY and the MISSING VALUES
commands to construct these comparisons. I declared the value 2 to be a missing
value for the weight variable, so when I indicated WEIGHT(1,3), the comparison was
only between weight group 1 and weight group 2. The TEMPORARY command made
this declaration of missing value status valid for only one procedure.
For Overweight versus Normal, the normal weight folks are significantly more
likely to use the stairs than are the overweight when ascending, but when descending
the overweight persons use the stairs more than do the normal weight persons, with the
difference not quite reaching statistical significance.
4
Percentage Use of Staircase Rather than Escalator Among Three Weight Groups
0
5
10
15
20
25
30
Ascending
Descending

In summary, people use the stairs much less than the escalator, especially when
going up. The overweight are the least likely to use the stairs when going up, but the
most likely to use the stairs when going down. Perhaps these people know they have a
weight problem, know they need exercise, so they resolve to use the stairs more often,
but using them going up is just asking too much.

Karl L. Wuensch
Dept. of Psychology
Greenville, NC 27858

November, 2009

SAS code to do the Model 1 analysis
How to get people to use the stairs -- http://www.youtube.com/watch?v=2lXh2n0aPyw
Logit.doc
Four Variable LOGIT Analysis: The 1989 Sexual Harassment Study

The Data and the Program
In the file HARASS89.sav on Karl's SPSS Data Page are cell data from a mock
jury study done by C. Moore et al. early in 1989. Download the data file. Every variable
is categorical: Verdict (1 = guilty, 2 = not guilty), Gender (1 = male, 2 = female) Plattr (1
= the plaintiff is low in physical attractiveness, 2 = high in physical attractiveness), and
Deattr(1 = defendant is low in physical attractiveness, 2 = high). The cell frequencies
are provided by the Freq variable. The female plaintiff in this civil case has accused the
male defendant of sexually harassing her. We wish to determine whether our
outcome/dependent variable, Verdict, is affected by (associated with) Plattr, Deattr,
Gender, and/or any combination of these three categorical predictor/independent
variables. Download the SPSS program file, LOGIT.sps, from Karls SPSS Programs
Page. Edit the syntax so the Get command points correctly to the location of the data
file on your computer and then run the program.
A Screening Run with Hiloglinear
Let us first ignore the fact that we consider one of the variables a dependent
variable and do a hierarchical backwards elimination analysis. We shall pay special
attention to the effects which include our dependent variable, Verdict, in this analysis.
Note that while the one-way effects are as a group significant (due solely to the fact that
guilty verdicts were more common than not guilty), the two-way and higher-order effects
are not. This is exactly what we should expect, since most of these effects were
experimentally made zero or near zero by our random assignment of participants to
groups. We randomly assigned female and male participants to have an attractive or
an unattractive plaintiff and, independent of that assignment, to have an attractive or an
unattractive defendant, so, we made the effects involving only Gender, Plattr, and
Deattr zero or near zero.
Hiloglinear's "Tests of PARTIAL associations" indicated significant effects of
Verdict, Gender x Verdict, and Plattr x Deattr x Verdict. The estimated parameters for
Verdict and Gender x Verdict are significant, and that for Plattr x Deattr x Verdict is very
close to significance. The backwards elimination procedure led to a model that
includes the Plattr x Deattr x Verdict effect and the Gender x Verdict effect. Since this
is a hierarchical model, all lower-order effects included in the retained higher-order
effects are also retained, that is, the model also includes the effects of Plattr x Deattr,
Plattr x Verdict, Deattr x Verdict, Plattr, Deattr, Verdict, and Gender. Note that many of
these effects are effects that we experimentally made zero or near zero by our random
assignment to treatment groups. The model fits the data well, as indicated by the high
p for the goodness-of-fit
2
.
A Saturated Logit Analysis
The Hiloglinear analysis was employed simply to give us some suggestions
regarding which of the effects we want to include in our Logit Analysis. In a logit
analysis we retain only effects that include the dependent variable, Verdict. The partial
association tests suggest a model with Verdict, Gender x Verdict, and
2
Plattr x Deattr x Verdict. We could just start out with every effect that includes Verdict
and then evaluate various reduced models, using standardized parameter estimates (Z)
to guide our selection of effects to be deleted (if an effect has no parameter with a large
absolute Z, delete it, then evaluate the reduced model). When deciding between any
two particular models, we may test the significance of their differences if and only if one
is nested within the other. Of course, each model is automatically compared to the fully
saturated model with the goodness-of-fit
2
SPSS gives us, and we don't want to accept
a model that is significantly bad in fit.
Instead of using just the three effects suggested by Hiloglinear's partial
association tests, I entered every effect containing Verdict to do a saturated logit
analysis. Note the syntax: LOGLINEAR dependent variable BY independent variables.
As always, the saturated model has perfect fit.
A Backwards Elimination Nonhierarchical Logit Analysis
I inspected the Z scores for tests of parameters in the saturated model, looking
for an effect to delete. Verdict x Plattr with its Z of .024 was chosen. I employed
Loglinear again, leaving Verdict x Plattr out of the DESIGN statement, to evaluate the
reduced model this analysis is not included in the program you ran. The
goodness-of-fit
2
was incremented from 0 to .00059, a trivial, nonsignificant increase,
df = 1, p = .981.
The smallest |Z| in the reduced model was -.306 for Verdict x Gender x Plattr, so
I deleted that effect increasing the
2
to .094, which was still nonsignificantly different
from the saturated model, df = 2, p = .954. Again, this analysis is not included in the
program you ran.
I next deleted Verdict x Gender x Deattr, Z = -.431, increasing
2
to .280, still not
significantly ill fitting, df = 3, p = .964. Next I removed Verdict x Deattr, Z = -1.13,
increasing
2
to 1.567, df = 4 p = .815. Next out was Verdict x Gender x Plattr x Deattr,
Z = -1.31,
2
= 3.283, p = .656. I have omitted from the program the four models
between the saturated model and the Verdict, Verdict x Gender, Verdict x Plattr x Deattr
model, just to save paper. I made my decisions (and took notes) looking at these
models on my computer screen, not printing them.
If you look at the standardized residuals for the Verdict, Verdict x Gender, Verdict
x Plattr x Deattr model, you will see that there are no problems, not even a hint of any
problem (no standardized residual of 1 or more).
Going Too Far
The way I have been deleting effects, each model is nested within all previous
models, so I can test the significance of the difference between one model and any that
preceded it with a
2
that equals the difference between the two models' goodness-of-fit
chi-squares. The df = the number of effects deleted from the one model to obtain the
other model. The null hypothesis is that the two models fit the data equally well. Since
the .05 critical value for
2
on 1 df is 3.84, I was on the watch for an increase of this
magnitude in the goodness-of-fit
2
produced by removing one effect.
3
Verdict x Plattr x Deattr had a significant Z-value of -2.17 in the current model,
but since that was the smallest Z-value, I removed it. The goodness-of-fit
2
jumped a
significant 4.84 from 3.28 to 8.12. This was enough to convince me to leave
Verdict x Plattr x Deattr in the model, even though the Verdict, Verdict x Gender model
was not significantly different from the saturated model (due to large df), df = 6, p = .23.
Removal of the Verdict x Plattr x Deattr effect also resulted in increased residuals, four
of the cells having standardized residuals greater than 1.
Just to complete my compulsion, I tried one last step (not included in the
program you ran), removing Verdict x Gender, producing an ill fitting one parameter
(Verdict) model,
2
= 15.09, df = 7, p = .035.
Our Final Model
So, we are left with a model containing Verdict, Verdict x Gender, and
Verdict x Plattr x Deattr. Do note that this is exactly the model suggested by the partial
association tests in our screening run with Hiloglinear. Now we need to interpret the
model we have selected.
Our structural model is LN(cell freq)
vgpd
= +
v
+
vg
+
vpd
. Consider the
parameter for the effect of Verdict. A value of 0 would indicate that there were equal
numbers of guilty and not guilty votes -- the odds would be 1/1 = 1, and the natural log
of 1 is 0. Our models estimate of the parameter for Verdict = Guilty is .363, which is
significantly greater than zero.
Odds
In our sample there were 110 guilty votes and 56 not guilty, for odds = 110/56 =
1.96. Our model predicts the odds to be 07 . 2
) 3626 (. 2 2
= = e e

, a pretty good estimate.
Of course, we could also predict the odds of a not guilty vote using the parameter for
Verdict = Not_Guilty: 484 .
) 3626 . 0 ( 2
=
e , the inverse of 2.07.

Four Conditional Odds
Now consider the Verdict x Gender interaction. The observed conditional odds
of a guilty verdict if the juror was male is 47/36 = 1.31. Our model yielded the
parameter .363 for Verdict = Guilty, and the parameter -0.231 for
Verdict x Gender = Guilty, Male. To predict the conditional odds of voting guilty given
the juror is male, we add the parameter for Verdict = Guilty to the parameter for Verdict
x Gender = Guilty, Male, and then convert to odds: .363 - .231 = .132, 30 . 1
) 132 (. 2
= e ,
very nearly the observed odds of 1.31.
The conditional odds of a guilty verdict if the juror was female was 63/20 = 3.15.
The parameter for Verdict x Gender = Guilty, Female is +.231, so to predict the
conditional odds of voting guilty given the juror is female we add .363 (parameter for
Guilty) and .231 (Guilty, Female) and convert to odds: 28 . 3
) 594 (. 2
= e , not a bad
estimate. There are two more conditional odds we could estimate, the odds of a not
guilty vote given the juror is male and the odds of a not guilty vote given the juror is
female, but these are simply inverses of the conditional odds we have just calculated.
Odds Ratios
4
The observed odds ratio for the effect of gender on verdict is:
414 . 0
20 / 63
36 / 47
= = . That is, the odds of a guilty verdict from a male juror were only
.414 those from a female juror (or, inverting the odds ratio, the odds of a guilty verdict
when the juror was female were 2.41 times the odds when the juror was male). This
odds ratio can be estimated from the parameter for the Verdict x Gender effect (-0.231):
397 .
) 231 . 0 ( 4
=
e not a bad estimate. (The constant 4 follows from the four conditional
odds just discussed). The log of an odds ratio is called a logit, thus, "logit analysis."
Using Crosstabs to Help Interpret the Significant Effects
For the Verdict x Plattr x Deattr triple interaction I decided to inspect the Verdict x
Plattr effect at each level of Deattr. Look at the Crosstabs output. When the defendant
was unattractive, guilty verdicts were rendered more often when the plaintiff was
attractive (70%) than when she was unattractive (55%), but the difference between
these percentages fell short of significance (p = .154 by the likelihood ratio test). When
the defendant was handsome ,the results were just the opposite, guilty verdicts being
more likely when the plaintiff was unattractive (77%) than when she was beautiful
(62%), but the simple effect again fell short of significance.
Some people just cannot understand how a higher-order effect can be significant
when none of its simple effects is. Although we understand this stems from the simple
effects having opposite signs, perhaps we should try looking at the interaction from
another perspective, the Verdict x Deattr effect at each level of Plattr. Look at the
Crosstabs output. When the plaintiff was unattractive, handsome defendants were
found guilty significantly more often (77%) than unattractive defendants (55%),
p = .026. When the plaintiff was beautiful, attractiveness of the defendant had no
significant effect upon the verdict, 70% versus 62%, p = .48.
50
55
60
65
70
75
80
Unattractive Attractive
Defendant
P
e
r
c
e
n
t
a
g
e

G
u
i
l
t
y
Unattractive Plaintiff
Attractive Plaintiff

Next, look at the Verdict x Gender Crosstabs output. Significantly more of the
female jurors (76%) found the defendant guilty than did the male jurors (57%), p = .008.
5
The likelihood ratio test reported here is one that ignores all the other effects in the full
model.

The 1990 Sexual Harassment Study
The results of the 1989 sexual harassment study were never published. There
was a serious problem with the stimulus materials that made the physical attractiveness
manipulation not adequate. We never even bothered submitting that study to any
journal. We considered it a pilot study and did additional pilot work that led to better
stimulus materials. The research conducted with these better stimulus materials has
been published -- "Effects of physical attractiveness of the plaintiff and defendant in
sexual harassment judgments" by Wilbur A. Castellow, Karl L. Wuensch, & Charles H.
Moore (1990), Journal of Social Behavior and Personality, 5, 547-562.
The program file, LOGIT2.sps, along with the data file, HARASS90.sav, will
produce the logit analysis that is reported in the article. Download the files, run the
program, and look over the output until you understand the statistics reported in the
article. The results are not as complex as they were in the pilot study. The 1-way and
2-way effects are significant, but the higher order effects are not. Verdict, Defendant
Attractiveness x Verdict, and Plaintiff Attractiveness x Verdict have significant partial
chi-squares and significant parameters. The backwards elimination procedure led to a
model with Defendant Attractiveness x Verdict and Plaintiff Attractiveness x Verdict,
and, because it is a hierarchical procedure, those effects included therein, namely
Verdict, Defendant Attractiveness, and Plaintiff Attractiveness.
As explained in the article, nonhierarchical logit analysis was then used to test a
model including only effects that involved the verdict (dependent) variable. The
saturated model (all effects) produced significant parameters only for Verdict,
Defendant Attractiveness x Verdict, and Plaintiff Attractiveness x Verdict. A reduced
model containing only these three effects fit the data well, as indicated by the
nonsignificant goodness-of-fit test. All three retained parameters remained significant
in the reduced model.
The output from Crosstabs helps explain the significant effects. The effect of verdict is
due to guilty verdicts being significantly more frequent (66%) than not guilty (34%)
verdicts. The two interactions each show that physically attractive persons are favored
over physically unattractive persons.
Karl L. Wuensch Dept. of Psychology East Carolina University Greenville, NC 27858

September, 2009

http://core.ecu.edu/psyc/wuenschk/MV/LogLin/Logit-SPSS-Output.pdf --
annotated PASW output
SAS Catmod code
Reliability-Validity-Scaling.docx
A Brief Introduction to Reliability, Validity, and Scaling

Reliability
Simply put, a reliable measuring instrument is one which gives you the same
measurements when you repeatedly measure the same unchanged objects or events.
We shall briefly discuss here methods of estimating an instruments reliability. The
theory underlying this discussion is that which is sometimes called classical
measurement theory. The foundations for this theory were developed by Charles
Spearman (1904, General Intelligence, objectively determined and measures.
American Journal of Psychology, 15, 201-293).
If a measuring instrument were perfectly reliable, then it would have a perfect
positive (r = +1) correlation with the true scores. If you measured an object or event
twice, and the true scores did not change, then you would get the same measurement
both times.
We theorize that our measurements contain random error, but that the mean
error is zero. That is, some of our measurements have error that make them lower than
the true scores, but others have errors that make them higher than the true scores, with
the sum of the score-decreasing errors being equal to the sum of the score increasing
errors. Accordingly, random error will not affect the mean of the measurements, but it
will increase the variance of the measurements.
Our definition of reliability is
2
2 2
2
2
2
TM
E T
T
M
T
XX
r r =
+
= =

. That is, reliability is the

proportion of the variance in the measurement scores that is due to differences in the
true scores rather than due to random error.
Please note that I have ignored systematic (nonrandom) error, optimistically
assuming that it is zero or at least small. Systematic error arises when our instrument
consistently measures something other than what it was designed to measure. For
example, a test of political conservatism might mistakenly also measure personal
stinginess.
Also note that I can never know what the reliability of an instrument (a test) is,
because I cannot know what the true scores are. I can, however, estimate reliability.
Test-Retest Reliability. The most straightforward method of estimating
reliability is to administer the test twice to the same set of subjects and then correlate
the two measurements (that at Time 1 and that at Time 2). Pearson r is the index of
correlation most often used in this context. If the test is reliable, and the subjects have
not changed from Time 1 to Time 2, then we should get a high value of r. We would
likely be satisfied if our value of r were at least .70 for instruments used in research, at
least .80 (preferably .90 or higher) for instruments used in practical applications such as
making psychiatric diagnoses (see my document Nunnally on Reliability). We would
also want the mean and standard deviation not to change appreciably from Time 1 to


2
Time 2. On some tests, however, we would expect some increase in the mean due to
practice effects.
Alternate/Parallel Forms Reliability. If there two or more forms of a test, we
want to know that the two forms are equivalent (on means, standard deviations, and
correlations with other measures) and highly correlated. The r between alternate forms
can be used as an estimate of the tests reliability.
Split-Half Reliability. It may be prohibitively expensive or inconvenient to
administer a test twice to estimate its reliability. Also, practice effects or other changes
between Time 1 and Time 2 might invalidate test-retest estimates of reliability. An
alternative approach is to correlate scores on one random half of the items on the test
with the scores on the other random half. That is, just divide the items up into two
groups, compute each subjects score on the each half, and correlate the two sets of
scores. This is like computing an alternate forms estimate of reliability after producing
two alternate forms (split-halves) from a single test. We shall call this coefficient the
half-test reliability coefficient, r
hh
.
Spearman-Brown. One problem with the split-half reliability coefficient is that it
is based on alternate forms that have only one-half the number of items that the full test
has. Reducing the number of items on a test generally reduces it reliability coefficient.
To get a better estimate of the reliability of the full test, we apply the Spearman-Brown
correction,
hh
hh
sb
r
r
r
+
=
1
2
.
Cronbachs Coefficient Alpha. Another problem with the split-half method is
that the reliability estimate obtained using one pair of random halves of the items is
likely to differ from that obtained using another pair of random halves of the items.
Which random half is the one we should use? One solution to this problem is to
compute the Spearman-Brown corrected split-half reliability coefficient for every one of
the possible split-halves and then find the mean of those coefficients. This mean is
known as Cronbachs coefficient alpha. Instructions for computing it can be found in my
document Cronbachs Alpha and Maximized Lambda4.
Maximized Lambda4. H. G. Osburn (Coefficient alpha and related internal
consistency reliability coefficients, Psychological Methods, 2000, 5, 343-355) noted that
coefficient alpha is a lower bound to the true reliability of a measuring instrument, and
that it may seriously underestimate the true reliability. They used Monte Carlo
techniques to study a variety of alternative methods of estimating reliability from internal
consistency. Their conclusion was that maximized lambda4 was the most consistently
accurate of the techniques.

4
is the r
sb
for one pair of split-halves of the instrument. To obtain maximized
4
,
one simply computes
4
for all possible split-halves and then selects the largest
obtained value of
4
. The problem is that the number of possible split halves is
2
) ! (
)! 2 ( 5 .
n
n

for a test with 2n items. If there are only four or five items, this is tedious but not
3
unreasonably difficult. If there are more than four or five items, computing maximized
4

is unreasonably difficulty, but it can be estimated -- see my document Estimating
Maximized Lambda4.
Construct Validity
Simply put, the construct validity of an operationalization (a measurement or a
manipulation) is the extent to which it really measures (or manipulates) what it claims to
measure (or manipulate). When the dimension being measured is an abstract construct
that is inferred from directly observable events, then we may speak of
construct validity.
Face Validity. An operationalization has face validity when others agree that it
looks like it does measure or manipulate the construct of interest. For example, if I tell
you that I am manipulating my subjects sexual arousal by having them drink a pint of
isotonic saline solution, you would probably be skeptical. On the other hand, if I told
you I was measuring my male subjects sexual arousal by measuring erection of their
penises, you would probably think that measurement to have face validity.
Content Validity. Assume that we can detail the entire population of behavior
(or other things) that an operationalization is supposed to capture. Now consider our
operationalization to be a sample taken from that population. Our operationalization will
have content validity to the extent that the sample is representative of the population.
To measure content validity we can do our best to describe the population of interest
and then ask experts (people who should know about the construct of interest) to judge
how well representative our sample is of that population.
Criterion-Related Validity. Here we test the validity of our operationalization by
seeing how it is related to other variables. Suppose that we have developed a test of
statistics ability. We might employ the following types of criterion-related validity:
Concurrent Validity. Are scores on our instrument strongly correlated with
scores on other concurrent variables (variables that are measured at the same
time). For our example, we should be able to show that students who just
finished a stats course score higher than those who have never taken a stats
course. Also, we should be able to show a strong correlation between score on
our test and students current level of performance in a stats class.
Predictive Validity. Can our instrument predict future performance on an
activity that is related to the construct we are measuring? For our example, is
there a strong correlation between scores on our test and subsequent
performance of employees in an occupation that requires the use of statistics.
Convergent Validity. Is our instrument well correlated with measures of other
constructs to which it should, theoretically, be related? For our example, we
might expect scores on our test to be well correlated with tests of logical thinking,
abstract reasoning, verbal ability, and, to a lesser extent, mathematical ability.
Discriminant Validity. Is our instrument not well correlated with measures of
other constructs to which it should not be related? For example, we might expect
4
scores on our test not to be well correlated with tests of political conservatism,
ethical ideology, love of Italian food, and so on.

Scaling
Scaling involves the construction of instruments for the purpose of measuring
abstract concepts such as intelligence, hypomania, ethical ideology, misanthropy,
political conservatism, and so on. I shall restrict my discussion to Likert scales, my
favorite type of response scale for survey items.
The items on a Likert scale consist of statements with which the respondents are
expected to differ with respect to the extent to which they agree with them. For each
statement the response scale may have from 4 to 9 response options. Because I have
used 5-point optical scanning response forms in my research, I have most often used
this response scale:
A B C D E
strongly disagree disagree no opinion agree strongly agree

Generating Potential Items. You should start by defining the concept you wish
to measure and then generate a large number of potential items. It is a good idea to
recruit colleagues to help you generating the items. Some of the items should be
worded such that agreement with them represents being high in the measured attribute
and others should be worded such that agreement with them represents being low in
the measured attribute.
Evaluating the Potential Items.
It is a good idea to get judges to evaluate your pool of potential items. Ask each
judge to evaluate each item using the following scale:
1 = agreeing indicates the respondent is very low in the measured attribute
2 = agreeing indicates the respondent is below average in the measured attribute
3 = agreeing does not tell anything about the respondents level of the attribute
4 = agreeing indicates the respondent is above average in the measured attribute
5 = agreeing indicates the respondent is very high in the measured attribute
Analyze the data from the judges and select items with very low or very high averages
(to get items with good discriminating ability) and little variability (indicating agreement
among the judges).
Alternatively, you could ask half of the judges to answer the items as they think a
person low in the attribute to be measured would, and the other half to answer the items
as would a person high in the attribute to be measured. You would then prefer items
which best discriminated between these two groups of judges -- items for which the
standardized difference between the group means is greatest.
5
Judges can also be asked whether any of the items were unclear or confusing or
had other problems.
Pilot Testing the Items. After you have selected what the judges thought were
the best items, you can administer the scale to respondents who are asked to answer
the questions in a way that reflects their own attitudes. It is a good idea to do this first
as a pilot study, but if you are impatient like me you can just go ahead and use the
instrument in the research for which you developed it (and hope that no really serious
flaws in the instrument appear). Even at this point you can continue your evaluation of
the instrument -- at the very least, you should conduct an item analysis (discussed
below), which might lead you to drop some of the items on the scale.
Scoring the Items. The most common method of creating a total score from a
set of Likert items is simply to sum each persons responses to each item, where the
responses are numerically coded with 1 representing the response associated with the
lowest amount of the measured attribute and N (where N = the number of response
options) representing the response associated with the highest amount of the measured
attribute. For example, for the response scale I showed above, A = 1, B = 2, C = 3,
D = 4, and E = 5, assuming that the item is one for which agreement indicates having a
high amount of the measured attribute.
You need to be very careful when using a computer to compute total scores.
With some software, when you command the program to compute the sum of a certain
set of variables (responses to individual items), it will treat missing data (items on which
the respondent indicated no answer) as zeros, which can greatly corrupt your data. If
you have any missing data, you should check to see if this is a problem with the
computer software you are using. If so, you need to find a way to deal with that problem
(there are several ways, consult a statistical programmer if necessary).
I generally use means rather than sums when scoring Likert scales. This allows
me a simple way to handle missing data. I use the SAS (a very powerful statistical
analysis program) function NMISS to determine, for each respondent, how many of the
items are unanswered. Then I have the computer drop the data from any subject who
has missing data on more than some specified number of items (for example, more than
1 out of 10 items). Then I define the total score as being the mean of the items which
were answered. This is equivalent to replacing a missing data point with the mean of
the subjects responses on the other items in that scale -- if all of the items on the scale
are measuring the same attribute, then this is a reasonable procedure. This can also be
easily done with PASW.
If you have some items for which agreement indicates a low amount of the
measured attribute and disagreement indicates a high amount of the measured attribute
(and you should have some such items), you must remember to reflect (reverse score)
the item prior to including it in a total score sum or mean or an item analysis. For
example, consider the following two items from a scale that I constructed to measure
attitudes about animal rights:
Animals should be granted the same rights as humans.
Hunters play an important role in regulating the size of deer populations.
6
Agreement with the first statement indicates support for animal rights, but agreement
with the second statement indicates nonsupport for animal rights. Using the 5-point
response scale shown above, I would reflect scores on the second item by subtracting
each respondents score from 6.
Item Analysis. If you believe your scale is unidimensional, you will want to
conduct an item analysis. Such an analysis will estimate the reliability of your
instrument by measuring the internal consistency of the items, the extent to which the
items correlate well with one another. It will also help you identify troublesome items.
To illustrate item analysis with PASW, we shall conduct an item analysis on data
from one of my past research projects. For each of 154 respondents we have scores
on each of ten Likert items. The scale is intended to measure ethical idealism. People
high on idealism believe that an action is unethical if it produces any bad
consequences, regardless of how many good consequences it might also produce.
People low on idealism believe that an action may be ethical if its good consequences
outweigh its bad consequences.
Bring the data (KJ-Idealism.sav) into PASW.

Click Analyze, Scale, Reliability Analysis.

7
Select all ten items and scoot them to the Items box on the right.

Click the Statistics box.

Check Scale if item deleted and then click Continue.

Back on the initial window, click OK.
Look at the output. The Cronbach alpha is .744, which is acceptable.
8

Look at the Item-Total Statistics.

There are two items, numbers 7 and 10, which have rather low item-total
correlations, and the alpha would go up if they were deleted, but not much, so I retained
them. It is disturbing that item 7 did not perform better, since failure to do ethical
cost/benefit analysis is an important part of the concept of ethical idealism. Perhaps the
problem is that this item does not make it clear that we are talking about ethical
cost/benefit analysis rather than other cost/benefit analysis. For example, a person
might think it just fine to do a personal, financial cost/benefit analysis to decide whether
to lease a car or buy a car, but immoral to weigh morally good consequences against
morally bad consequences when deciding whether it is proper to keep horses for
entertainment purposes (riding them). Somehow I need to find the time to do some
more work on improving measurement of the ethical cost/benefit component of ethical
idealism.

1. People should make certain that their actions never intentionally harm others
even to a small degree.
2. Risks to another should never be tolerated, irrespective of how small the risks
might be.
3. The existence of potential harm to others is always wrong, irrespective of the
benefits to be gained.
4. One should never psychologically or physically harm another person.
5. One should not perform an action which might in any way threaten the dignity
and welfare of another individual.
Reliability Statistics
.744 10
Cronbach's
Alpha N of Items
Item-Total Statistics
32.42 23.453 .444 .718
32.79 22.702 .441 .717
32.79 21.122 .604 .690
32.33 22.436 .532 .705
32.33 22.277 .623 .695
32.07 24.807 .337 .733
34.29 24.152 .247 .749
32.49 24.332 .308 .736
33.38 22.063 .406 .725
33.43 24.650 .201 .755
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Scale Mean if
Item Deleted
Scale
Variance if
Item Deleted
Corrected
Item-Total
Correlation
Cronbach's
Alpha if Item
Deleted
9
6. If an action could harm an innocent other, then it should not be done.
7. Deciding whether or not to perform an act by balancing the positive
consequences of the act against the negative consequences of the act is
immoral.
8. The dignity and welfare of people should be the most important concern in any
society.
9. It is never necessary to sacrifice the welfare of others.
10. Moral actions are those which closely match ideals of the most "perfect" action.

Factor Analysis. It may also be useful to conduct a factor analysis on the scale
data to see if the scale really is unidimensional. Responses to the individual scale items
are the variables in such a factor analysis. These variables are generally well
correlated with one another. We wish to reduce the (large) number of variables to a
smaller number of factors that capture most of the variance in the observed variables.
Each factor is estimated as being a linear (weighted) combination of the observed
variables. We could extract as many factors as there are variables, but generally most
of those factors would contribute little, so we try to get just a few factors that capture
most of the covariance. Our initial extraction generally includes the restriction that the
factors be orthogonal, independent of one another.
Return to Wuenschs Statistics Lessons
Alpha.docx
Cronbachs Alpha and Maximized Lambda4

The Data
Download the KJ.dat data file from my StatData page. These are the data from
the research that was reported in:
Wuensch, K. L., Jenkins, K. W., & Poteat, G. M. (2002). Misanthropy, idealism,
and attitudes towards animals. Anthrozos, 15, 139-149.
A summary of the research can be found at Misanthropy, Idealism, and Attitudes
About Animals.

SAS
To illustrate the computation of Cronbachs alpha with SAS, I shall use the data
set we used back in PSYC 6430 when learning to do correlation/regression analysis on
SAS, KJ.dat. Columns 1 through 10 contain the participants responses to the first ten
items, which constitute the idealism scale.
Obtain the program file, Alpha.sas, from my SAS Programs page. The simple
way to get Cronbachs alpha is to use the NOMISS and ALPHA options with PROC
CORR. I have also included an illustration of how Cronbachs alpha can be computed
from the item variances and an illustration of how to compute maximized
4
(lambda4)
and estimated maximum
4
.

SPSS
Boot up SPSS and click File, Read Text Data. Point the wizard to the KJ.dat file.
On step one, indicate that there is no predefined format. On step two, indicate that the
data file is of FIXED WIDTH, and that there are no variable names at the top. On step
three indicate that the data start on line one, there is one line per case, and all cases
should be read. On step four you will see a screen like that atop the next page. To
indicate that the scores for item one are in column one, item two in column two, etc.,
just click between columns one and two, two and three ten and eleven.


2

On step five, we dont need to bother naming the variables, the default names,
V1 through V10 will do just fine, but we do need to tell SPSS not to read the data that
start in column 11. Highlight column 11 and then for Data Format, select Do Not
Import.

3
Go on to step six and then finish. Now click Analyze, Scale, Reliability. Select
V1 through V10 into the items box. Indicate model alpha.

Click on Statistics and select Scale if item deleted.

Click Continue, OK.

The Questionnaire and Results of the Analysis
Each item had a 5-point Likert-type response scale (strongly disagree to strongly
agree). Here are the items:
1. People should make certain that their actions never intentionally harm others
even to a small degree.
2. Risks to another should never be tolerated, irrespective of how small the risks
might be.
3. The existence of potential harm to others is always wrong, irrespective of the
4
4. One should never psychologically or physically harm another person.
5. One should not perform an action which might in any way threaten the dignity
and welfare of another individual.
6. If an action could harm an innocent other, then it should not be done.
7. Deciding whether or not to perform an act by balancing the positive
consequences of the act against the negative consequences of the act is
immoral.
8. The dignity and welfare of people should be the most important concern in any
society.
9. It is never necessary to sacrifice the welfare of others.
10. Moral actions are those which closely match ideals of the most "perfect" action.

Look at the output. With SAS, use the statistics for the raw (rather than
standardized) variables. The value of alpha, .744, is acceptable. There are two items,
numbers 7 and 10, which have rather low item-total correlations, and the alpha would
go up if they were deleted, but not much, so I retained them. It is disturbing that item 7
did not perform better, since failure to do ethical cost/benefit analysis is an important
part of the concept of ethical idealism. Perhaps the problem is that this item does not
make it clear that we are talking about ethical cost/benefit analysis rather than other
cost/benefit analysis. For example, a person might think it just fine to do a personal,
financial cost/benefit analysis to decide whether to lease a car or buy a car, but immoral
to weigh morally good consequences against morally bad consequences when deciding
whether it is proper to keep horses for entertainment purposes (riding them). Somehow
I need to find the time to do some more work on improving measurement of the ethical
cost/benefit component of ethical idealism.
Computing Alpha from Item Variances
Coefficient alpha can be computed from the item variances with this formula:
1
) / 1 (
2 2

=
n
n
T I

, where the ratio of variances is the sum of the n item variances
divided by the total test variance. The second part of the SAS program illustrates the
application of this method for computing coefficient alpha.
Maximized
4
(Lambda4)
H. G. Osburn (Coefficient alpha and related internal consistency reliability
coefficients, Psychological Methods, 2000, 5, 343-355) noted that coefficient alpha is a
lower bound to the true reliability of a measuring instrument, and that it may seriously
underestimate the true reliability. Osburn used Monte Carlo techniques to study a
variety of alternative methods of estimating reliability from internal consistency. Their
conclusion was that maximized lambda4 was the most consistently accurate of the
techniques.

4
is computed as coefficient alpha, but on only one pair of split-halves of the
instrument. To obtain maximized
4
, one simply computes
4
for all possible split-
halves and then selects the largest obtained value of
4
. The problem is that the
5
number of possible split halves is
2
) ! (
)! 2 ( 5 .
n
n
for a test with 2n items. If there are only four
or five items, this is not so bad. For example, suppose we decide to use only the first
four items in the idealism instrument. Look at the third part of the SAS program. In the
data step I create the three possible split-halves and the total test score for a scale
comprised of only the first four items of the idealism measure. Proc Corr is used to
obtain an alpha coefficient of .717. Then I use Proc Corr three more times, obtaining
the correlations for each pair of split halves. Those correlations are .49581, .66198, and
.53911. When I apply the Spearman-Brown correction (taking into account that each
split half has only half as many items as does the total instrument),
r
r
+
=
1
2
4
, the
values of
4
are .6629, .7966, and .7005. The largest of these, .7966, is maximized
4
.
The average of these, .72, is Cronbachs alpha.
If you have an even number of items,
+
=
2
2 2
4
) (
1 2
T
B A

, where the three
variances are for half A, half B, and the total test variance. My program computes
variances for each half and the total variance, and then
4
is computed for each split-
half using these variances. The maximized
4
is .796. I also computed coefficient alpha
as the mean of the three possible pairs of split-halves. Note that value obtained is the
same as reported by Proc Corr.
If you have an odd number of items, use the method which actually computes r
for each split-half. For example, I had data from a 5-item misanthropy scale. I created
10 split-halves (in each, one set had only 2 items, the other had 3 items). The
correlations, with the Spearman-Brown correction, were .586, .718., .623, .765, .684,
.686, .776, .687, .453, and .625. The highest corrected correlation. .776, is the
maximized
4
. The mean of the 10 corrected correlations is Cronbachs alpha, .66.
Estimating Maximized
4

When there are more than just a few items, there are just too many possible split-
halves to be computing
4
on each one. For our 10 item idealism scale, there are 126
possible split-halves, so dont even think about computing
4
for each. There are
methods for estimating maximized
4
. If you are interested in such methods, read my
document Estimating Maximized Lambda4.

I have archived some EDSTAT-L discussions of Cronbachs alpha. They are
available at Discussion of Cronbachs Alpha On the EDSTAT-L List.

Cronbachs Alpha When There Are Only Two Items on the Scale
A colleague, while reviewing a manuscript, asked about Cronbachs alpha for a
scale with only two items. The authors had reported the simple r between the two
items. My response was that the alpha would be higher than that because of the
6
Spearman-Brown correction. To verify that I used SPSS to compute r and alpha for two
items selected from a scale that measures forgiveness. The alpha for the full scale is
well over .9.

Correlations

RAof3 Aof7
RAof3 Pearson Correlation 1 .679
**

Sig. (2-tailed)

.000
N 468 468

There is only one possible split half when there are only two items, and the r for
that split half is, here, .679. Applying the Spearman-Brown correction,

81 .
679 . 1
) 2 ( 679 .
1
2
= =
+
=
r
r
alpha
which is what SPSS reports:


Cronbach's
Alpha N of Items
.807 2
PCA-SPSS.docx
Principal Components Analysis - PASW

In principal components analysis (PCA) and factor analysis (FA) one wishes to
extract from a set of p variables a reduced set of m components or factors that accounts
for most of the variance in the p variables. In other words, we wish to reduce a set of p
variables to a set of m underlying superordinate dimensions.
These underlying factors are inferred from the correlations among the p
variables. Each factor is estimated as a weighted sum of the p variables. The i
th
factor
is thus

p ip i i i
X W X W X W F + + + = K
2 2 1 1

One may also express each of the p variables as a linear combination of the m
factors,

j m mj j j j
U F A F A F A X + + + + = K
2 2 1 1

where U
j
is the variance that is unique to variable j, variance that cannot be explained
by any of the common factors.

Goals of PCA and FA
One may do a PCA or FA simply to reduce a set of p variables to m
components or factors prior to further analyses on those m factors. For example,
Ossenkopp and Mazmanian (Physiology and Behavior, 34: 935-941) had 19 behavioral
and physiological variables from which they wished to predict a single criterion variable,
physiological response to four hours of cold-restraint. They first subjected the 19
predictor variables to a FA. They extracted five factors, which were labeled Exploration,
General Activity, Metabolic Rate, Behavioral Reactivity, and Autonomic Reactivity.
They then computed for each subject scores on each of the five factors. That is, each
subjects set of scores on 19 variables was reduced to a set of scores on 5 factors.
These five factors were then used as predictors (of the single criterion) in a stepwise
multiple regression.
One may use FA to discover and summarize the pattern of intercorrelations
among variables. This is often called Exploratory FA. One simply wishes to group
together (into factors) variables that are highly correlated with one another, presumably
because they all are influenced by the same underlying dimension (factor). One may
also then operationalize (invent a way to measure) the underlying dimension by a linear
combination of the variables that contributed most heavily to the factor.
If one has a theory regarding what basic dimensions underlie an observed event,
e may engage in Confirmatory Factor Analysis. For example, if I believe that
performance on standardized tests of academic aptitude represents the joint operation
of several basically independent faculties, such as Thurstones Verbal Comprehension,
Word Fluency, Simple Arithmetic, Spatial Ability, Associative Memory, Perceptual
Speed, and General Reasoning, rather than one global intelligence factor, then I may


2

use FA as a tool to analyze test results to see whether or not the various items on the
test do fall into distinct factors that seem to represent those specific faculties.
Psychometricians often employ FA in test construction. If you wish to develop
a test that measures several different dimensions, each important for some reason, you
first devise questions (variables) which you think will measure these dimensions. For
example, you may wish to develop a test to predict how well an individual will do as a
school teacher. You decide that the important dimensions are Love of Children, Love of
Knowledge, Tolerance to Fiscal Poverty, Acting Ability, and Cognitive Flexibility. For
each of these dimensions you write several items intended to measure the dimension.
You administer the test to many people and FA the results. Hopefully many items
cluster into factors representing the dimensions you intended to measure. Those items
that do not so cluster are rewritten or discarded and new items are written. The new
test is administered and the results factor analyzed, etc. etc. until you are pleased with
the instrument. Then you go out and collect data testing which (if any) of the factors is
indeed related to actual teaching performance (if you can find a valid measure thereof)
or some other criterion (such as teachers morale).
There are numerous other uses of FA that you may run across in the literature.
For example, some researchers may investigate the differences in factor structure
between groups. For example, is the factor structure of an instrument that measures
socio-politico-economic dimensions the same for citizens of the U.S.A. as it is for
citizens of Mainland China? Note such various applications of FA when you encounter
them.

A Simple, Contrived Example
Suppose I am interested in what influences a consumers choice behavior when
e is shopping for beer. I ask each of 200 consumers to rate on a scale of 0-100 how
important e considers each of seven qualities when deciding whether or not to buy the
six pack: low COST of the six pack, high SIZE of the bottle (volume), high percentage
of ALCOHOL in the beer, the REPUTATion of the brand, the COLOR of the beer, nice
AROMA of the beer, and good TASTE of the beer. The data are in the file
FACTBEER.SAV. Bring that file into PASW. On the command bar, click Analyze, Data
Reduction, Factor. Scoot the seven variables of interest into the Variables Box:

3

Click Descriptives and then check Initial Solution, Coefficients, KMO and
Bartletts Test of Sphericity, and Anti-image. Click Continue.

Click Extraction and then select Correlation Matrix, Unrotated Factor Solution,
Scree Plot, and Eigenvalues Over 1. Click Continue.

4

Click Rotation. Select Varimax and Rotated Solution. Click Continue.

Click Options. Select Exclude Cases Listwise and Sorted By Size. Click Continue.

Click OK, and PASW completes the Principal Components Analysis.

5

Checking For Unique Variables
Aside from the raw data matrix, the first matrix you are likely to encounter in a
PCA or FA is the correlation matrix. Here is the correlation matrix for our data:

COST SIZE ALCOHOL REPUTAT COLOR AROMA TASTE
COST 1.00 .83 .77 -.41 .02 -.05 -.06
SIZE .83 1.00 .90 -.39 .18 .10 .03
ALCOHOL .77 .90 1.00 -.46 .07 .04 .01
REPUTAT -.41 -.39 -.46 1.00 -.37 -.44 -.44
COLOR .02 .18 .07 -.37 1.00 .91 .90
AROMA -.05 .10 .04 -.44 .91 1.00 .87
TASTE -.06 .03 .01 -.44 .90 .87 1.00
Unless it is just too large to grasp, you should give the correlation matrix a good
look. You are planning to use PCA to capture the essence of the correlations in this
matrix. Notice that there are many medium to large correlations in this matrix, and that
every variable, except reputation, has some large correlations, and reputation is
moderately correlated with everything else (negatively). There is a statistic, Bartletts
test of sphericity, that can be used to test the null hypothesis that our sample was
randomly drawn from a population in which the correlation matrix was an identity matrix,
a matrix full of zeros, except, of course, for ones on the main diagonal. I think a good
ole Eyeball Test is generally more advisable, unless you just dont want to do the PCA,
someone else is trying to get you to, and you need some official sounding
justification not to do it.
If there are any variables that are not correlated with the other variables, you
might as well delete them prior to the PCA. If you are using PCA to reduce the set of
variables to a smaller set of components to be used in additional analyses, you can
always reintroduce the unique (not correlated with other variables) variables at that
time. Alternatively, you may wish to collect more data, adding variables that you think
will indeed correlate with the now unique variable, and then run the PCA on the new
data set.
One may also wish to inspect the Squared Multiple
Correlation coefficient (SMC or R
2
) of each variable with all other
variables. Variables with small R
2
s are unique variables, not well
correlated with a linear combination of the other variables. If you
conduct a principal axis factor analysis, these are found in the
Communalities (Initial) table.

Communalities

Initial
cost .738
size .912
alcohol .866
reputat .499
color .922
aroma .857
taste .881

6

Kaisers Measure of Sampling Adequacy
It is undesirable to have two variables which share variance with each other but
not with other variables. Recall that the partial correlation coefficient between variables
X
i
and X
j
is the correlation between two residuals,
( )
p j i i i
X X
).. )..( ..( 12 .
and ( )
p j i j j
X X
).. )..( ..( 12 .

A large partial correlation indicates that the variables involved share variance that
is not shared by the other variables in the data set. Kaisers Measure of Sampling
Adequacy (MSA) for a variable X
i
is the ratio of the sum of the squared simple rs
between X
i
and each other X to (that same sum plus the sum of the squared partial rs
between X
i
and each other X). Recall that squared rs can be thought of as variances.

+
=
2 2
2
ij ij
ij
pr r
r
MSA
Small values of MSA indicate that the correlations between X
i
and the other
variables are unique, that is, not related to the remaining variables outside each simple
correlation. Kaiser has described MSAs above .9 as marvelous, above .8 as
meritorious, above .7 as middling, above .6 as mediocre, above .5 as miserable, and
below .5 as unacceptable.
The MSA option in SAS PROC FACTOR [Enter PROC FACTOR MSA;] gives
you a matrix of the partial correlations, the MSA for each variable, and an overall MSA
computed across all variables. Variables with small MSAs should be deleted prior to FA
or the data set supplemented with additional relevant variables which one hopes will be
correlated with the offending variables.
For our sample data the partial correlation matrix looks like this:
COST 1.00 .54 -.11 -.26 -.10 -.14 .11
SIZE .54 1.00 .81 .11 .50 .06 -.44
ALCOHOL -.11 .81 1.00 -.23 -.38 .06 .31
REPUTAT -.26 .11 -.23 1.00 .23 -.29 -.26
COLOR -.10 .50 -.38 .23 1.00 .57 .69
AROMA -.14 .06 .06 -.29 .57 1.00 .09
TASTE .11 -.44 .31 -.26 .69 .09 1.00
___________________________________________________________
MSA .78 .55 .63 .76 .59 .80 .68
OVERALL MSA = .67
These MSAs may not be marvelous, but they arent low enough to make me
drop any variables (especially since I have only seven variables, already an
unrealistically low number).
The PASW output includes the overall MSA in the same table as the (useless)
Bartletts test of sphericity.

7

The partial correlations (each multiplied by minus 1) are found in the Anti-Image
Correlation Matrix. On the main diagonal of this matrix are the MSAs for the individual
variables.

Anti-image Matrices

cost size alcohol reputat color aroma taste
Anti-image
Correlation
cost .779
a
-.543 .105 .256 .100 .135 -.105
size -.543 .550
a
-.806 -.109 -.495 .061 .435
alcohol .105 -.806 .630
a
.226 .381 -.060 -.310
reputat .256 -.109 .226 .763
a
-.231 .287 .257
color .100 -.495 .381 -.231 .590
a
-.574 -.693
aroma .135 .061 -.060 .287 -.574 .801
a
-.087
taste -.105 .435 -.310 .257 -.693 -.087 .676
a

a. Measures of Sampling Adequacy(MSA)

Extracting Principal Components
We are now ready to extract principal components. We shall let the computer do
most of the work, which is considerable. From p variables we can extract p
components. This will involve solving p equations with p unknowns. The variance in
the correlation matrix is repackaged into p eigenvalues. Each eigenvalue represents
the amount of variance that has been captured by one component.
Each component is a linear combination of the p variables. The first component
accounts for the largest possible amount of variance. The second component, formed
from the variance remaining after that associated with the first component has been
extracted, accounts for the second largest amount of variance, etc. The principal
components are extracted with the restriction that they are orthogonal. Geometrically
they may be viewed as dimensions in p-dimensional space where each dimension is
perpendicular to each other dimension.
Each of the p variables variance is standardized to one. Each factors eigenvalue
may be compared to 1 to see how much more (or less) variance it represents than does
KMO and Bartlett's Test
.665
1637.9
21
.000
Kaiser-Meyer-Olkin Measure of Sampling
Adequacy.
Approx. Chi-Square
df
Sig.
Bartlett's Test of
Sphericity

8

a single variable. With p variables there is p x 1 = p variance to distribute. The principal
components extraction will produce p components which in the aggregate account for
all of the variance in the p variables. That is, the sum of the p eigenvalues will be equal
to p, the number of variables. The proportion of variance accounted for by one
component equals its eigenvalue divided by p.
For our beer data, here are the eigenvalues and proportions of variance for the
seven components:

Deciding How Many Components to Retain
So far, all we have done is to repackage the variance from p correlated variables
into p uncorrelated components. We probably want to have fewer than p components.
If our p variables do share considerable variance, several of the p components should
have large eigenvalues and many should have small eigenvalues. One needs to decide
how many components to retain. One handy rule of thumb is to retain only components
with eigenvalues of one or more. That is, drop any component that accounts for less
variance than does a single variable. Another device for deciding on the number of
components to retain is the scree test. This is a plot with eigenvalues on the ordinate
and component number on the abscissa. Scree is the rubble at the base of a sloping
cliff. In a scree plot, scree is those components that are at the bottom of the sloping plot
of eigenvalues versus component number. The plot provides a visual aid for deciding at
what point including additional components no longer increases the amount of variance
accounted for by a nontrivial amount. Here is the scree plot produced by PASW:
3.313 47.327 47.327
2.616 37.369 84.696
.575 8.209 92.905
.240 3.427 96.332
.134 1.921 98.252
9.E-02 1.221 99.473
4.E-02 .527 100.000
Component
1
2
3
4
5
6
7
Total
% of
Variance
Cumulative
%
Initial Eigenvalues
Extraction Method: Principal Component Analysis.

9

For our beer data, only the first two components have eigenvalues greater than
1. There is a big drop in eigenvalue between component 2 and component 3. On a
scree plot, components 3 through 7 would appear as scree at the base of the cliff
composed of components 1 and 2. Together components 1 and 2 account for 85% of
the total variance. We shall retain only the first two components.
I often find it useful to try at least three different solutions, and then decide
among them which packages the variance in a way most pleasing to me. Here I could
try a one component, a two component, and a three component solution.

Loadings, Unrotated and Rotated
Another matrix of interest is the loading matrix, also known as the factor
pattern matrix or the component matrix. The entries in this matrix, loadings, are
correlations between the components and the variables. Since the two components are
orthogonal, these correlation coefficients are also beta weights, that is,
j j j j
U F A F A X + + =
2 2 1 1
, thus A
1
equals the number of standard deviations that X
j

changes for each one standard deviation change in Factor 1. Here is the loading matrix
for our beer data:
Scree Plot
Component Number
7 6 5 4 3 2 1
E
i
g
e
n
v
a
l
u
e
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0

10

As you can see, almost all of the variables load well on the first component, all
positively except reputation. The second component is more interesting, with 3 large
positive loadings and three large negative loadings. Component 1 seems to reflect
concern for economy and quality versus reputation. Component 2 seems to reflect
economy versus quality.
Remember that each component represents an orthogonal (perpendicular)
dimension. Fortunately, we retained only two dimensions, so I can plot them on paper.
If we had retained more than two components, we could look at several pairwise plots
(two components at a time).
For each variable I have plotted in the vertical dimension its loading on
component 1, and in the horizontal dimension its loading on component 2. Wouldnt it
be nice if I could rotate these axes so that the two dimensions passed more nearly
through the two major clusters (COST, SIZE, ALCH and COLOR, AROMA, TASTE)?
Imagine that the two axes are perpendicular wires joined at the origin (0,0) with a pin. I
rotate them, preserving their perpendicularity, so that the one axis passes through or
near the one cluster, the other through or near the other cluster. The number of
degrees by which I rotate the axes is the angle PSI. For these data, rotating the axes -
40.63 degrees has the desired effect.
Here is the loading matrix after rotation:
Component Matrix
a
.760 -.576
.736 -.614
-.735 -.071
.710 -.646
.550 .734
.632 .699
.667 .675
COLOR
AROMA
REPUTAT
TASTE
COST
ALCOHOL
SIZE
1 2
Component
2 components extracted. a.

11

Rotated Component Matrix
a
.960 -.028
.958 1.E-02
.952 6.E-02
7.E-02 .947
2.E-02 .942
-.061 .916
-.512 -.533
TASTE
AROMA
COLOR
SIZE
ALCOHOL
COST
REPUTAT
1 2
Component
Rotation Method: Varimax with Kaiser Normalization.
Rotation converged in 3 iterations. a.

Number of Components in the Rotated Solution
I generally will look at the initial, unrotated, extraction and make an initial
judgment regarding how many components to retain. Then I will obtain and inspect
rotated solutions with that many, one less than that many, and one more than that many
components. I may use a "meaningfulness" criterion to help me decide which solution
to retain if a solution leads to a component which is not well defined (has none or very
few variables loading on it) or which just does not make sense, I may decide not to
accept that solution.
One can err in the direction of extracting too many components (overextraction)
or too few components (underextraction). Wood, Tataryn, and Gorsuch (1996,
Psychological Methods, 1, 354-365) have studied the effects of under- and over-
extraction in principal factor analysis with varimax rotation. They used simulation
methods, sampling from populations where the true factor structure was known. They
found that overextraction generally led to less error (differences between the structure
of the obtained factors and that of the true factors) than did underextraction. Of course,
extracting the correct number of factors is the best solution, but it might be a good
strategy to lean towards overextraction to avoid the greater error found with
underextraction.
Wood et al. did find one case in which overextraction was especially problematic
the case where the true factor structure is that there is only a single factor, there are
no unique variables (variables which do not share variance with others in the data set),
and where the statistician extracts two factors and employs a varimax rotation (the type
I used with our example data). In this case, they found that the first unrotated factor had
loadings close to those of the true factor, with only low loadings on the second factor.
However, after rotation, factor splitting took place for some of the variables the
obtained solution grossly underestimated their loadings on the first factor and
overestimated them on the second factor. That is, the second factor was imaginary and

12

the first factor was corrupted. Interestingly, if there were unique variables in the data
set, such factor splitting was not a problem. The authors suggested that one include
unique variables in the data set to avoid this potential problem. I suppose one could do
this by including "filler" items on a questionnaire. The authors recommend using a
random number generator to create the unique variables or manually inserting into the
correlation matrix variables that have a zero correlation with all others. These unique
variables can be removed for the final analysis, after determining how many factors to
retain.

Explained Variance
The PASW output also gives the variance explained by each component after
the rotation. The variance explained is equal to the sum of squared loadings (SSL)
across variables. For component 1 that is (.76
2
+ .74
2
+...+ .67
2
) = 3.31 = its
eigenvalue before rotation and (.96
2
+ .96
2
+...+ -.51
2
) = 3.02 after rotation. For
component 2 the SSLs are 2.62 and 2.91. After rotation component 1 accounted for
3.02/7 = 43% of the total variance and 3.02 / (3.02 + 2.91) = 51% of the variance
distributed between the two components. After rotation the two components together
account for (3.02 + 2.91)/7 = 85% of the total variance.

The SSLs for components can be used to help decide how many components to
retain. An after rotation SSL is much like an eigenvalue. A rotated component with an
SSL of 1 accounts for as much of the total variance as does a single variable. One may
want to retain and rotate a few more components than indicated by eigenvalue 1 or
more criterion. Inspection of the retained components SSLs after rotation should tell
you whether or not they should be retained. Sometimes a component with an
eigenvalue > 1 will have a postrotation SSL < 1, in which case you may wish to drop it
and ask for a smaller number of retained components.
You also should look at the postrotation loadings to decide how well each
retained component is defined. If only one variable loads heavily on a component, that
component is not well defined. If only two variables load heavily on a component, the
component may be reliable if those two variables are highly correlated with one another
but not with the other variables.

Naming Components
Total Variance Explained
3.017 43.101 43.101
2.912 41.595 84.696
Component
1
2
Total
% of
Variance
Cumulative
%
Rotation Sums of Squared
Loadings

13

Now let us look at the rotated loadings again and try to name the two
components. Component 1 has heavy loadings (>.4) on TASTE, AROMA, and COLOR
and a moderate negative loading on REPUTATION. Id call this component
AESTHETIC QUALITY. Component 2 has heavy loadings on large SIZE, high
ALCOHOL content, and low COST and a moderate negative loading on REPUTATION.
Id call this component CHEAP DRUNK.

Communalities
Let us also look at the SSL for each variable across factors. Such a SSL is
called a communality. This is the amount of the variables variance that is accounted
for by the components (since the loadings are correlations between variables and
components and the components are orthogonal, a variables communality represents
the R
2
of the variable predicted from the components). For our beer data the
communalities are COST, .84; SIZE, .90; ALCOHOL, .89; REPUTAT, .55; COLOR, .91;
AROMA, .92; and TASTE, .92.

Orthogonal Versus Oblique Rotations
The rotation I used on these data is the VARIMAX rotation. It is the most
commonly used rotation. Its goal is to minimize the complexity of the components by
making the large loadings larger and the small loadings smaller within each component.
There are other rotational methods. QUARTIMAX rotation makes large loadings larger
and small loadings smaller within each variable. EQUAMAX rotation is a compromise
that attempts to simplify both components and variables. These are all orthogonal
rotations, that is, the axes remain perpendicular, so the components are not correlated
with one another.

It is also possible to employ oblique rotational methods. These methods do not
produce orthogonal components. Suppose you have done an orthogonal rotation and
you obtain a rotated loadings plot that looks like this:

Communalities
1.000 .842
1.000 .901
1.000 .889
1.000 .546
1.000 .910
1.000 .918
1.000 .922
COST
SIZE
ALCOHOL
REPUTAT
COLOR
AROMA
TASTE
Initial Extraction

14

The cluster of points midway between
axes in the upper left quadrant indicates that
a third component is present. The two
clusters in the upper right quadrant indicate
that the data would be better fit with axes
that are not orthogonal. Axes drawn
through those two clusters would not be
perpendicular to one another. We shall
return to the topic of oblique rotation later.

Return to Multivariate Analysis with PASW
Continue on to Factor Analysis with PASW

FA-SPSS.docx
Factor Analysis - PASW

First Read Principal Components Analysis.
The methods we have employed so far attempt to repackage all of the variance
in the p variables into principal components. We may wish to restrict our analysis to
variance that is common among variables. That is, when repackaging the variables
variance we may wish not to redistribute variance that is unique to any one variable.
This is Common Factor Analysis. A common factor is an abstraction, a hypothetical
dimension that affects at least two of the variables. We assume that there is also one
unique factor for each variable, a factor that affects that variable but does not affect any
other variables. We assume that the p unique factors are uncorrelated with one another
and with the common factors. It is the variance due to these unique factors that we
shall exclude from our FA.

Iterated Principal Factors Analysis
The most common sort of FA is principal axis FA, also known as principal
factor analysis. This analysis proceeds very much like that for a PCA. We eliminate
the variance due to unique factors by replacing the 1s on the main diagonal of the
correlation matrix with estimates of the variables communalities. Recall that a variables
communality, its SSL across components or factors, is the amount of the variables
variance that is accounted for by the components or factors. Since our factors will be
common factors, a variables communality will be the amount of its variance that is
common rather than unique. The R
2
between a variable and all other variables is most
often used initially to estimate a variables communality.
Using the beer data, change the extraction method to principal axis:


2
Take a look at the initial communalities (for each variable, this is the R
2
for
predicting that variable from an optimally weighted linear combination of the remaining
variables). Recall that they were all 1s for the principal components analysis we did
earlier, but now each is less than 1. If we sum these communalities we get 5.675. We
started with 7 units of standardized variance and we have now reduced that to 5.675
units of standardized variance (by eliminating unique variance).

For an iterated principal axis solution PASW first estimates communalities, with
R
2
s, and then conducts the analysis. It then takes the communalities from that first
analysis and inserts them into the main diagonal of the correlation matrix in place of the
R
2
s, and does the analysis again. The variables SSLs from this second solution are
then inserted into the main diagonal replacing the communalities from the previous
iteration, etc. etc., until the change from one iteration to the next iteration is trivial.
Look at the communalities after this iterative process and for a two-factor
solution. They now sum to 5.60. That is, 5.6/7 = 80% of the variance is common
variance and 20% is unique. Here you can see how we have packaged that common
variance into two factors, both before and after a varimax rotation:

Communalities
.738 .745
.912 .914
.866 .866
.499 .385
.922 .892
.857 .896
.881 .902
COST
SIZE
ALCOHOL
REPUTAT
COLOR
AROMA
TASTE
Initial Extraction
Extraction Method: Principal Axis Factoring.
Total Variance Explained
3.123 44.620 44.620 2.879 41.131 41.131
2.478 35.396 80.016 2.722 38.885 80.016
Factor
1
2
Total % of Variance Cumulative % Total % of Variance Cumulative %
Extraction Sums of Squared Loadings Rotation Sums of Squared Loadings
3
The final rotated loadings are:

These loadings are very similar to those we obtained previously with a principal
components analysis.

Reproduced and Residual Correlation Matrices
Having extracted common factors, one can turn right around and try to reproduce
the correlation matrix from the factor loading matrix. We assume that the correlations
between variables result from their sharing common underlying factors. Thus, it makes
sense to try to estimate the correlations between variables from the correlations
between factors and variables. The reproduced correlation matrix is obtained by
multiplying the loading matrix by the transposed loading matrix. This results in
calculating each reproduced correlation as the sum across factors (from 1 to m) of the
products (r between factor and the one variable)(r between factor and the other
variable). For example, for our 2 factor iterative solution the reproduced correlation
between COLOR and TASTE = (r for color - Factor1)(r for taste - Factor 1) + (r for color
- Factor 2)(r for taste - Factor2) = (.95)(.94) + (.06)(-.02) = .89. The original r between
color and taste was .90, so our two factors did indeed capture the relationship between
Color and Taste.
The residual correlation matrix equals the original correlation matrix
minus the reproduced correlation matrix. We want these residuals to be
small. If you check Reproduced under Descriptive in the Factor Analysis
dialogue box, you will get both of these matrices:
Rotated Factor Matrix
a
.950 -2.17E-02
.946 2.106E-02
.942 6.771E-02
7.337E-02 .953
2.974E-02 .930
-4.64E-02 .862
-.431 -.447
TASTE
AROMA
COLOR
SIZE
ALCOHOL
COST
REPUTAT
1 2
Factor
Rotation converged in 3 iterations.
a.
4

Nonorthogonal (Oblique) Rotation
The data may be better fit with axes that are not perpendicular. This can be
done by means of an oblique rotation, but the factors will now be correlated with one
another. Also, the factor loadings (in the pattern matrix) will no longer be equal to the
correlation between each factor and each variable. They will still be standardized
regression coefficients (Beta weights), the As in the
j m mj j j j
U F A F A F A X + + + + = K
2 2 1 1

formula presented at the beginning of the handout on principal components analysis.
The correlations between factors and variables are presented in a factor structure
matrix.
I am not generally comfortable with oblique rotations, but for this lesson I tried a
Promax rotation (a varimax rotation is first applied and then the resulting axes rotated to
oblique positions):
Reproduced Correlations
.745
b
.818 .800 -.365 1.467E-02 -2.57E-02 -6.28E-02
.818 .914
b
.889 -.458 .134 8.950E-02 4.899E-02
.800 .889 .866
b
-.428 9.100E-02 4.773E-02 8.064E-03
-.365 -.458 -.428 .385
b
-.436 -.417 -.399
1.467E-02 .134 9.100E-02 -.436 .892
b
.893 .893
-2.57E-02 8.950E-02 4.773E-02 -.417 .893 .896
b
.898
-6.28E-02 4.899E-02 8.064E-03 -.399 .893 .898 .902
b
1.350E-02 -3.295E-02 -4.02E-02 3.328E-03 -2.05E-02 -1.16E-03
1.350E-02 1.495E-02 6.527E-02 4.528E-02 8.097E-03 -2.32E-02
-3.29E-02 1.495E-02 -3.47E-02 -1.88E-02 -3.54E-03 3.726E-03
-4.02E-02 6.527E-02 -3.471E-02 6.415E-02 -2.59E-02 -4.38E-02
3.328E-03 4.528E-02 -1.884E-02 6.415E-02 1.557E-02 1.003E-02
-2.05E-02 8.097E-03 -3.545E-03 -2.59E-02 1.557E-02 -2.81E-02
-1.16E-03 -2.32E-02 3.726E-03 -4.38E-02 1.003E-02 -2.81E-02
COST
SIZE
ALCOHOL
REPUTAT
COLOR
AROMA
TASTE
COST
SIZE
ALCOHOL
REPUTAT
COLOR
AROMA
TASTE
Reproduced Correlation
Residual
a
Residuals are computed between observed and reproduced correlations. There are 2 (9.0%) nonredundant residuals with
absolute values greater than 0.05.
a.
Reproduced communalities
b.
5
Structure Matrix
.947 .030
.946 .072
.945 .118
.123 .956
.078 .930
-.002 .858
-.453 -.469
TASTE
AROMA
COLOR
SIZE
ALCOHOL
COST
REPUTAT
1 2
Factor
Rotation Method: Promax with Kaiser Normalization.
Beta Weights Correlations

Notice that this solution is not much different from the previously obtained
varimax solution, so little was gained by allowing the factors to be correlated.

Exact Factor Scores
One may wish to define subscales on a test, with each subscale representing
one factor. Using an "exact" weighting scheme, each subject's estimated factor score
on each factor is a weighted sum of the products of scoring coefficients and the
subject's standardized scores on the original variables.
The regression coefficients (standardized scoring coefficients) for converting
scores on variables to factor scores are obtained by multiplying the inverse of the
original simple correlation matrix by the factor loading matrix. To obtain a
subjects factor scores you multiply es standardized scores (Zs) on the variables by
these standardized scoring coefficients. For example, subject # 1s Factor scores are:
Factor 1: (-.294)(.41) + (.955)(.40) + (-.036)(.22) + (1.057)(-.07) + (.712)(.04) +
(1.219)(.03) + (-1.14)(.01) = 0.23.
Factor 2: (-.294)(.11) + (.955)(.03) + (-.036)(-.20) + (1.057)(.61) + (.712)(.25) +
(.16)(1.219) + (-1.14)(-.04) = 1.06
PASW will not only compute the scoring coefficients for you, it will also output the
factor scores of your subjects into your PASW data set so that you can input them into
Pattern Matrix
a
.955 -7.14E-02
.949 -2.83E-02
.943 1.877E-02
2.200E-02 .953
-2.05E-02 .932
-9.33E-02 .868
-.408 -.426
TASTE
AROMA
COLOR
SIZE
ALCOHOL
COST
REPUTAT
1 2
Factor
Rotation converged in 3 iterations.
a.
Factor Correlation Matrix
1.000 .106
.106 1.000
Factor
1
2
1 2
6
other procedures. In the Factor Analysis window, click Scores and select Save As
Variables, Regression, Display Factor Score Coefficient Matrix.

Here are the scoring coefficients:

Look back at your data sheet. You will find that two columns have been added to
the right, one for scores on Factor 1 and another for scores on Factor 2.
PASW also gives you a Factor Score Covariance Matrix. On the main diagonal
of this matrix are, for each factor, the R
2
between the factor and the observed variables.
This is treated as an indictor of the internal consistency of the solution. Values below
.70 are considered undesirable.

Factor Score Covariance Matrix
Factor 1 2
1 .966 .003
2 .003 .953
Factor Scores Method: Regression.

Factor Score Coefficient Matrix
.026 .157
-.066 .610
.036 .251
.011 -.042
.225 -.201
.398 .026
.409 .110
COST
SIZE
ALCOHOL
REPUTAT
COLOR
AROMA
TASTE
1 2
Factor
Factor Scores Method: Regression.
7
These squared multiple correlation coefficients are equal to the variance of the
factor scores:


N Mean Variance
FAC1_1 220 .0000000 .966
FAC2_1 220 .0000000 .953

The input data included two variables (SES and Group) not included in the factor
analysis. Just for fun, try conducting a multiple regression predicting subjects SES
from their factor scores and also try using Students t to compare the two groups means
on the factor scores. Do note that the scores for factor 1 are not correlated with those
for factor 2. Accordingly, in the multiple regression the squared semipartial correlation
coefficients are identical to squared zero-order correlation coefficients and the
2
2
2
1
2
Y Y
r r R + = .

Model Summary
.988
a
.976 .976 .385
Model
1
R R Square
Adjusted
R Square
Std. Error of
the Estimate
Predictors: (Constant), FAC2_1, FAC1_1
a.
ANOVA
b
1320.821 2 660.410 4453.479 .000
a
32.179 217 .148
1353.000 219
Regression
Residual
Total
Model
1
Sum of
Predictors: (Constant), FAC2_1, FAC1_1
a.
Dependent Variable: SES
b.
Coefficients
a
134.810 .000
.681 65.027 .000 .679 .681
-.718 -68.581 .000 -.716 -.718
(Constant)
FAC1_1
FAC2_1
Model
1
Beta
Standardized
Coefficients
t Sig. Zero-order Part
Correlations
Dependent Variable: SES
a.
8

Unit-Weighted Factor Scores
If one were using FA to define subscales on a test, e might decide to define
subscale 1 as being an unweighted sum of all of a subjects scores (for example, 1....7
on a Likert item) on any item loading heavily (>.4) on Factor 1, etc. etc. Such a method
is sometimes referred to as a "unit-weighting scheme." For our data, if I answered the
Color, Taste, Aroma, Size, Alcohol, Cost, Reputation questions with the responses 80,
100, 40, 30, 75, 60, 10, my Subscale 1 (Aesthetic Quality) subscale score would be 80
+ 100 + 40 - 10 = 210 [note that I subtracted the Reputation score since its loading was
negative] and my Subscale 2 (Cheap Drunk) score would be 30 + 75 + 60 - 10 = 155.
Recent work has indicated that certain types of unit-weighting schemes may
actually be preferable to exact factor scores in certain circumstances. I recommend
Grice, J. W. (2001). A comparison of factor scores under conditions of factor obliquity,
Psychological Bulletin, 6, 67-83. Grice reviewed the literature on this topic and
presented results of his own Monte Carlo study (with oblique factor analysis). The early
studies which suggested that unit-weighted factor scores are preferable to exact factor
scores used data for which the structure was simple (each variable loading well on only
one factor) and the loadings not highly variable (most loadings either very high or very
low). More recent research (Grice & Harris, 1998, orthogonal factor analysis, cited in
Grice, 2001) has shown that unit-weighted factor scores based on the loadings perform
Group Statistics
121 -.4198775 .97383364 .08853033
99 .5131836 .71714232 .07207552
121 .5620465 .88340921 .08030993
99 -.6869457 .55529938 .05580969
GROUP
1
2
1
2
FAC1_1
FAC2_1
Std. Error
Mean
19.264 .000 -7.933 218 .000 -1.16487 -.701253
-8.173 215.738 .000 -1.15807 -.708049
25.883 .000 12.227 218 .000 1.047657 1.450327
12.771 205.269 .000 1.056175 1.441809
Equal variances
assumed
Equal variances
not assumed
Equal variances
assumed
Equal variances
not assumed
FAC1_1
FAC2_1
F Sig.
Levene's Test for
t df Sig. (2-tailed) Lower Upper
95% Confidence
Interval of the
Difference
9
poorly under conditions of non-simple structure and variable loadings, which are typical
of the conditions most often found in actual practice. Grice and Harris developed an
alternative unit-weighting scheme which produced factor scores that compared
favorably with exact factor scores -- they based the weightings on the factor score
coefficients rather than on the loadings.
Grice's article extended the discussion to the case of oblique factor analysis,
where one could entertain several different sorts of unit-weighting schemes -- for
example, basing them on the pattern matrix (loadings, standardized regression
coefficients for predicting), the structure matrix (correlations of variables with factors), or
the factor score coefficients. Grice defined a variable as salient on a factor if it had a
weighting coefficient whose absolute value was at least 1/3 as large as that of the
variable with the largest absolute weighting coefficient on that factor. Salient items'
weights were replaced with 1 or 1, and nonsalient variables' weights with 0. The
results of his Monte Carlo study indicated that factor scores using this unit-weighting
scheme based on scoring coefficients performed better than those using various other
unit-weighting schemes and at least as good as exact factor scores (by most criteria
and under most conditions). He did note, however, that exact factor scores may be
preferred under certain circumstances -- for example, when using factor scores on the
same sample as that from which they were derived, especially when sample size is
relatively small. If we followed Grices advice we would drop Reputation from both
subscales and Cost from the second subscale.

Cronbachs Alpha
If you have developed subscales such as the Aesthetic Quality and Cheap Drunk
subscales above, you should report an estimate of the reliability of each subscale. Test-retest
reliability can be employed if you administer the scale to the same persons twice, but usually
you will only want to administer it once to each person. Cronbachs alpha is an easy and
generally acceptable estimate of reliability.
Suppose that we are going to compute AQ (Aesthetic Quality) as color + taste + aroma
reputat and CD as cost + size + alcohol reputat. How reliable would such subscales be?
We conduct an item analysis to evaluate the reliability (and internal consistency) of the each
subscale.
Before conducting the item analysis, we shall need to multiply the Reputation variable
by minus 1, since it is negatively weighted in the AQ and CD subscale scores. Transform
Compute NegRep = 1 reputat.
10

Analyze, Scale, Reliability Analysis. Scoot color, aroma, taste, and NegRep into the
items box.

Click Statistics. Check Scale if item deleted. Continue, OK.


Cronbach's
Alpha N of Items
.886 4

Alpha is .886. Shoot for an alpha of at least .7 for research instruments.
11

Notice that NegRep is not as well correlated with the corrected total scores as are the
other items and that dropping it from the scale would increase the value of alpha considerably.
That might be enough to justify dropping the reputation variable from this subscale.
If you conduct an item analysis on the CD items you will find alpha = .878 and increases
to .941 if Reputation is dropped from the scale.

Comparing Two Groups Factor Structure
Suppose I wish to compare the factor structure in one population with that in
another. I first collect data from randomly selected samples from each population. I
then extract factors from each group using the same criteria for number of factors etc. in
both groups. An eyeball test should tell whether or not the two structures are
dissimilar. Do I get the same number of well defined factors in the two groups? Is the
set of variables loading well on a given factor the same in Group 1 as in Group 2 (minor
differences would be expected due to sampling error, of course - one can randomly split
one sample in half and do separate analyses on each half to get a picture of such error).
Catells salient similarity index (s) may be used to compare two solutions
patterns of loadings. Consider the following hypothetical loading matrices:
Group 1 Group 2
Variable Factor1 Factor2 Factor3 Factor1 Factor2 Factor3
X1 .90 .12 .21 .45 .49 -.70
X2 .63 .75 .34 .65 -.15 .22
X3 .15 .67 .24 .27 .80 .04
X4 -.09 -.53 -.16 -.15 .09 .67
X5 -.74 -.14 -.19 .95 -.79 .12
SSL 1.78 1.32 0.28 1.62 1.53 1.00

Item-Total Statistics
63.7500 3987.814 .859 .810
70.0000 4060.959 .881 .802
47.5000 4273.174 .879 .807
163.0000 5485.936 .433 .961
color
aroma
taste
NegRep
Scale Mean if
Item Deleted
Scale
Variance if
Item Deleted
Corrected
Item-Total
Correlation
Cronbach's
Alpha if Item
Deleted
12
Each loading is classified as Positively Salient (Catell used a criterion of > .10,
Ill use a higher cut, > .30), Negatively Salient (< -.30) or neither (HyperPlane). One
then constructs a third order square [PS, HP, NS] matrix comparing Group 1 with Group
2. Ill abbreviate the contents of this table using these cell indices:
Group 1
PS HP NS
PS 11 12 13
Group2 HP 21 22 23
NS 31 32 33
The loadings for X
1
- F
1
are PS for both groups, so it is counted in cell 11. Ditto
for X
2
- F
1
. X
3
- F
1
is HP in both groups, so it is counted in cell 22. Ditto for X
4
- F
1
.
X
5 -
F
1
is NS in Group 1 but PS in Group 2, so it is counted in cell 13.
Thus, the table for comparing Factor 1 in Group 1 with Factor 1 in Group 2 with
frequency counts inserted in the cells looks like this:
Group 1
PS HP NS
PS 2 0 1
Group2 HP 0 2 0
NS 0 0 0
The 1 in the upper right corner reflects the difference in the two patterns with
respect to X5. Counts in the main diagonal, especially in the upper left and the lower
right, indicate similarity of structure; counts off the main diagonal, especially in the upper
right or lower left, indicate dissimilarity.
Catells s is computed from these counts this way:
) 32 23 21 12 ( 5 . 31 13 33 11
31 13 33 11
+ + + + + + +
+
= s (The numbers here are cell indices.)
For our data,
33 .
) 0 0 0 0 ( 5 . 0 1 0 2
0 1 0 2
=
+ + + + + + +
+
= s
Catell et al. (Educ. & Psych. Measurement, 1969, 29, 781-792) provide tables to
convert s to an approximate significance level, P, for testing the null hypothesis that the
two factors being compared (one from population 1, one from population 2) are not
related to one another. [I have these tables, in Tabachnick & Fidell, 1989, pages 717 &
718, if you need them.] These tables require you to compute the percentage of
hyperplane counts (60, 70, 80, or 90) and to have at least 10 variables (the table has
rows for 10, 20, 30, 40, 50, 60, 80, & 100 variables). We have only 5 variables, and a
hyperplane percentage of only 40%, so we cant use the table. If we had 10 variables
and a hyperplane percentage of 60%, P = .138 for s = .26 and P = .02 for s = .51.
Under those conditions our s of .33 would have a P of about .10, not low enough to
reject the null hypothesis (if alpha = .05) and conclude that the two factors are related
13
(similar). In other words, we would be left with the null hypothesis that Factor 1 is not
the same in population 1 as population 2.
It is not always easy to decide which pairs of factors to compare. One does not
always compare Factor 1 in Group 1 with Factor 1 in Group 2, and 2 in 1 with 2 in 2, etc.
Factor 1 in Group 1 may look more like Factor 2 in Group 2 than it does like Factor 1 in
Group 2, so one would compare 1 in 1 with 2 in 2. Remember that factors are ordered
from highest to lowest SSL, and sampling error alone may cause inversions in the
orders of factors with similar SSLs. For our hypothetical data, comparing 1 in 1 with 1
in 2 makes sense, since F
1
has high loadings on X
1
and X
2
in both groups. But what
factor in Group 2 would we choose to compare with F
2
in Group 1 - the structures are so
different that a simple eyeball test tells us that there is no factor in Group 2 similar to F
2

in Group 1.
One may also use a simple Pearson r to compare two factors. Just correlate
the loadings on the factor in Group 1 with the loadings on the maybe similar factor in
Group 2. For 1 in 1 compared with 1 in 2, the 1(2) data are .90(.45), .63(.65), .15(.27),
-.09(-.15), & -.74(.95). The r is -0.19, indicating little similarity between the two factors.
The Pearson r can detect not only differences in two factors patterns of loadings,
but also differences in the relative magnitudes of those loadings. One should beware
that with factors having a large number of small loadings, those small loadings could
cause the r to be large (if they are similar between factors) even if the factors had
dissimilar loadings on the more important variables.
Cross-Correlated Factor Scores. Compute factor scoring coefficients for Group
1 and, separately, for Group 2. Then for each case compute the factor score using the
scoring coefficients from the group in which is it located and also compute it using the
scoring coefficients from the other group. Correlate these two sets of factor scores
Same Group and Other Group. A high correlation between these two sets of factor
scores should indicate similarity of the two factors between groups. Of course, this
method and the other two could be used with random halves of one sample to assess
the stability of the solution or with different random samples from the same population at
different times to get something like a test-retest measure of stability across samples
and times.
RMS, root mean square. For each variable square the difference between the
loading in the one group and that in the other group. Find the mean of these differences
and then the square root of that mean. If there is a perfect match between the two
groups loadings, RMS = 0. The maximum value of RMS (2) would result when all of
the loadings are one or minus one, with those in the one group opposite in sign of those
in the other group.
CC, coefficient of congruence. Multiply each loading in the one group by the
corresponding loading in the other group. Sum these products and then divide by the
square root of (the sum of squared loadings for the one group times the sum of squared
loading for the other group).
See Factorial Invariance of the Occupational Work Ethic Inventory -- An example
of the used of multiple techniques to compare factor structures.
14

Required Number of Subjects and Variables
"How many subjects do I need for my factor analysis?" If you ask this question of
several persons who occasionally employ factor analysis, you are likely to get several
different answers. Some may answer "100 or more" or give you some other fixed
minimum N. Others may say "That depends on how many variables you have, you
should have at least 10 times as many subjects as you have variables" (or some other
fixed minimum ratio of subjects to variables). Others may say "get as many subjects as
you can, the more the better, but you must have at least as many subjects as you have
variables." I would say "that depends on several things." So what are those several
things? Here I shall summarize the key points of two recent articles that address this
issue. The first also addresses the question of how many variables should be included
in the analysis. The articles are:
Velicer, W. F., & Fava, J. L. (1998). Effects of variable and subject sampling on factor
pattern recovery. Psychological Methods, 3, 231-251.
MacCallum, R. C., Widaman, K. F., Zhang, S., & Hong, S. (1999). Sample size in
factor analysis. Psychological Methods, 4, 84-99.
Velicer & Fava briefly reviewed the literature and then reported the results of their
empirical study. The main points were as follows:
Factors that do not have at least three variables with high loadings should not be
interpreted accordingly, you need to have at least three variables per factor,
probably more. Since not all of your variables can be expected to perform as
anticipated, you should start out with 6 to 10 variables per factor.
If the loadings are low, you will need more variables, 10 or 20 per factor may be
required. Assuming that some variables will have to be discarded, you should
start with 20 to 30 variables per factor.
The larger the number of subjects, the larger the number of variables per factor,
and the larger the loadings, the better are your chances of doing a good job at
reproducing the population factor pattern. Strength in one of these areas can
compensate for weakness in another for example, you might compensate for
mediocre loadings and variable sample size by having a very large number of
subjects. If you can't get a lot of subjects, you can compensate by having many
good variables per factor (start with 30 60 per factor, but not more variables
than subjects).
MacCullum et al. demonstrated, by both theoretical and empirical means that the
required sample size or ratio of subjects to variables is not constant across studies, but
rather varies as a function of several other aspects of the research. They concluded
that rules of thumb about sample size are not valid and not useful. Here are the key
points:
Larger sample size, higher communalties (low communalities are associated
with sampling error due to the presence of unique factors that have nonzero
correlations with one another and with the common factors), and high
15
overdetermination [each factor having at least three or four high loadings and
simple structure (few, nonoverlapping factors)] each increase your chances of
faithfully reproducing the population factor pattern.
Strengths in one area can compensate for weaknesses in another area.
When communalities are high (> .6), you should be in good shape even with N
well below 100.
With communalities moderate (about .5) and the factors well-determined, you
should have 100 to 200 subjects.
With communalities low (< .5) but high overdetermination of factors (not many
factors, each with 6 or 7 high loadings), you probably need well over 100
subjects.
With low communalities and only 3 or 4 high loadings on each, you probably
need over 300 subjects.
With low communalities and poorly determined factors, you will need well over
500 subjects.
Of course, when planning your research you do not know for sure how good the
communalities will be nor how well determined your factors will be, so maybe the best
simple advise, for an a priori rule of thumb, is "the more subjects, the better."
MacCallum's advise to researchers is to try to keep the number of variables and factors
small and select variables (write items) to assure moderate to high communalities.

Closing Comments
Please note that this has been an introductory lesson that has not addressed
many of the less common techniques available. For example, I have not discussed
Alpha Extraction, which extracts factors with the goal of maximizing alpha (reliability)
coefficients of the Extracted Factors, or Maximum-Likelihood Extraction, or several
other extraction methods.
I should remind you of the necessity of investigating (maybe even deleting)
outlying observations. Subjects factor scores may be inspected to find observations
that are outliers with respect to the solution [very large absolute value of a factor score].

Factor Analysis Exercise: Animal Rights, Ethical Ideology, and Misanthropy
Factor Analysis Exercise: Rating Characteristics of Criminal Defendants
Polychoric and Tetrachoric Factor Analysis with data from Likert-type or
dichotomous items.
SAS Version of this Workshop
Return to Multivariate Analysis with PASW
Factor-Discrim.doc
SPSS/PASW Discriminant Analysis on Factor Scores Produced By SAS

When you ran the FactOut.sas program, there was produced an output file,
FactBeer.dat, which contained factor scores, SES, and GROUP data for each subject.
Suppose that we wished to compare the two groups on the remaining three variables
using a discriminant function analysis. Here is how to do that analysis:

First, import the data file. Click File, Read Text Data, and point to the
FactBeer.dat file. Tell the wizard that there is no predefined format, the file is delimited,
no variable names at top, data starts on line 1, each line is a case, read all cases,and
the delimiter is a space. Name the first variable AesthetQ, the second CheapDr, the
third SES, and the fourth Group. Click Analyze, Classify, Discriminant. Put Group in
the Grouping Variable box, and AesthetQ, CheapDr, and SES in the Independents box.
Enter the independents together. For Statistics, ask for Means, Univariate ANOVAs,
and Boxs M. Under Classify, ask for Compution of Priors From Group Sizes, a
Summary Table, and Separate-Groups Plots.

Look at the output. The Group Statistics and Tests of Equality of Group Means
show that, compared to Group 1, Group 2 is significantly more interested in the
aesthetic quality of their beer, significantly less interested in getting a cheap drunk, and
significantly higher in SES.

The discriminant analysis produces a weighted linear combination of the three
independent variables on which the two groups differ as much as possible. This
weighted linear combination is called the discriminant function, D. If we were to
conduct an ANOVA comparing the two groups on this weighted linear combination, the
ratio of the SS
between groups
to SS
within groups
would be the eigenvalue (the value which is
maximized by the weights obtained for the linear combination). The canonical
correlation is the square root on the ratio of the SS
between groups
to the SS
total
. The
square of this quantity is the same as
2
in ANOVA. Accordingly, for our data, group
membership accounts for 64% of the variance in AesthetQ, CheapDr, and SES.

Wilks lambda is used to test the significance of this canonical correlation. The
smaller the Wilks lambda, the smaller the p value. SPSS uses a chi-square statistic to
approximate the p value. It is significant for our data -- that is, our groups differ
significantly on an optimally weighted combination of AesthetQ, CheapDr, and SES --
or, from another perspective, using the discriminant function to predict group
membership from scores on AesthetQ, CheapDr, and SES is significantly better than
just guessing group membership. Boxs M tests one of the assumptions of this test of
significance. The nonsignificant result for our data tell us that we have no problem with
the assumption that our two groups have equal variance/covariance matrices.

The standardized discriminant function coefficients (the standardized
weighting coefficients for computing D) and the loadings (in the structure matrix, the
correlations between the predictor variables and D) indicate that SES is most important


for separating the two groups and AesthetQ least important. The centroids are simply
the mean values on D for each group.

The classification statistics tell us that the prior probabilities are 55% for Group
1 and 45% for Group 2. The classification process takes these priors into account when
predicting which group a subject is in. Ceteris paribus, one should be a little more likely
to predict membership in Group 1 than in Group 2. The graphs and the summary table
show us that the discriminant function was successful in predicting group membership
for 10 out of 11 subjects who were actually members of Group 1 and 7 out of 9 subjects
who were actually members of Group 2, for an overall success rate of 85%. If you did
not have the discriminant function, but you knew that there were members of Group 1
than of Group 2, you best guessing strategy would be to guess membership in Group
1 for every subject. In that case your overall success rate would be only 55%.

The url for this document is http://core.ecu.edu/psyc/wuenschk/MV/FA/Factor-
Discrim.doc.



Path.doc
An Introduction to Path Analysis

Developed by Sewall Wright, path analysis is a method employed to determine
whether or not a multivariate set of nonexperimental data fits well with a particular (a
priori) causal model. Elazar J. Pedhazur (Multiple Regression in Behavioral Research,
2
nd
edition, Holt, Rinehard and Winston, 1982) has a nice introductory chapter on path
analysis which is recommended reading for anyone who intends to use path analysis.
This lecture draws heavily upon the material in Pedhazur's book.
Consider the path diagram presented in Figure 1.
Figure 1

p
41
= .009

p
31
= .398

p
43
= .416
r
12
= .3

p
32
= .041

E
3

p
42
= .501 E
4

Each circle represents a variable. We have data on each variable for each subject. In
this diagram SES and IQ are considered to be exogenous variables -- their variance is
assumed to be caused entirely by variables not in the causal model. The connecting
line with arrows at both ends indicates that the correlation between these two variables
will remain unanalyzed because we choose not to identify one variable as a cause of
the other variable. Any correlation between these variables may actually be casual (1
causing 2 and/or 2 causing 1) and/or may be due to 1 and 2 sharing common causes.


SES
1
IQ
2
nACH
3
GPA
4
Page 2
For example, having a certain set of genes may cause one to have the physical
appearance that is necessary to obtain high SES in a particular culture and may
independently also cause one to have a high IQ, creating a spurious correlation
between 1 and 2 that is totally due to their sharing a common cause, with no causal
relationship between 1 and 2. Alternatively, some genes may cause only the physical
appearance necessary to obtain high SES and high SES may cause high IQ (more
money allows you to eat well, be healthy, afford good schools, etc., which raises your
IQ). Alternatively, the genes may cause only elevated IQ, and high IQ causes one to
socially advance to high SES. In this model we have chosen not to decide among these
alternatives.
GPA and nAch are endogenous variables in this model -- their variance is
considered to be explained in part by other variables in the model. Paths drawn to
endogenous variables are directional (arrowhead on one end only). Variance in GPA is
theorized to result from variance in SES, IQ, nAch, and extraneous (not in the model)
sources. The influence of these extraneous variables is indicated by the arrow from E
Y
.
Variance in nAch is theorized to be caused by variance in SES, IQ, and extraneous
sources.
Please note that the path to an endogenous variable must be unidirectional in
path analysis. Were we to decide that not only does high SES cause high nAch but
that also high nAch causes high SES, we could not use path analysis.
For each path to an endogenous variable we shall compute a path coefficient,
p
ij
, where "i" indicates the effect and "j" the cause. If we square a path coefficient we
get the proportion of the affected variable's variance that is caused by the causal
variable. The coefficient may be positive (increasing the causal variable causes
increases in the dependent variable if all other causal variables are held constant) or
negative (increasing causal variable decreases dependent variable).
A path analysis can be conducted as a hierarchical (sequential) multiple
regression analysis. For each endogenous variable we shall conduct a multiple
regression analysis predicting that variable (Y) from all other variables which are
hypothesized to have direct effects on Y. We do not include in this multiple regression
any variables which are hypothesized to affect Y only indirectly (through one or more
intervening variables). The beta weights from these multiple regressions are the path
coefficients shown in the typical figures that are used to display the results of a path
analysis.
Consider these data from Pedhazur:

IQ nAch GPA
SES .300 .410 .330
IQ .160 .570
nAch .500

Page 3
For our analysis, let us make one change in Figure 1: Make IQ an endogenous
variable, with SES a cause of variance in IQ (make unidirectional arrow from SES to
IQ). Our revised model is illustrated in Figure 1A, to which I have added the path
coefficients computed below.
Figure 1A

.009
.398

.416
.3

.911
.041

E
3
.501

.710

E
4

.954
E
2

Obtain and run Path-1.sas from my SAS Programs page. Here is the code that
produced the coefficients for the model in the figure above:
PROC REG;
Figure_1_GPA: MODEL GPA = SES IQ NACH;
Figure_1_nACH: MODEL NACH = SES IQ;
Our diagram indicates that GPA is directly affected by SES, IQ, and nAch. We
regress GPA on these three causal variables and obtain R
2
4.123
= .49647,
41.23
= p
41
=
.009,
42.13
= p
42
= .501, and
43.12
= p
43
= .416. The path coefficient from extraneous
variables is 710 . 49647 . 1 1
2
123 . 4
= = R . We see that GPA is directly affected by IQ,
nAch, and extraneous variables much more than by SES, but we must not forget that
SES also has indirect effects (through IQ & nAch) upon GPA. We shall separate direct
from indirect effects later.
Achievement motivation is affected by both SES and IQ in our model, and these
causes are correlated with one another. We regress nAch on these two causal
SES
1
IQ
2
nACH
3
GPA
4
Page 4
variables and obtain R
2
3.12
= .1696,
31.2
= p
31
= .398, and
32.1
= p
32
= .041. The path
coefficient from E
3
to nAch is 911 . 1696 . 1 1
2
12 . 3
= = R . We see that nAch is more
strongly caused by SES than by IQ, and that extraneous variables exert great influence.
Now consider the path to IQ from SES. Since there is only one predictor variable
in this model, the path coefficient is the simple (zero-order) r between IQ and SES,
which is .300. This would also be the case if the Y variable were theorized to be
affected by two independent causes (see Figure 2, in which our model theorizes that
the correlation between 1 and 2 equals 0). The path coefficient from extraneous
variables to IQ as a residual of the SES-IQ correlation, 954 . 09 . 1 1
2
= = r .
Figure 2

r
12
= 0

p
31

p
31
= r
31

p
32
= r
32

p
32

Note that the program contains the correlation matrix from Pedhazur. I decided
to use an N of 50, but did not enter means and standard deviations for the variables, so
the parameter estimates that SAS produces are standardized (the slope is a beta).

Decomposing Correlations
The correlation between two variables may be decomposed into four
components:
1. the direct effect of X on Y,
1
2
3
Page 5
2. the indirect effect of X (through an intervening variable) on Y,
3. an unanalyzed component due to our not knowing the direction of causation for
a path, and
4. a spurious component due to X and Y each being caused by some third
variable or set of variables in the model.
Consider first the correlations among the variables in Figure 1.
The correlation between SES and IQ, r
12
, will be unanalyzed because of the
bi-directional path between the two variables.
The correlation between SES and nAch, r
13
= .410 is decomposed into:
p
31
, a direct effect, SES to nAch, which we already computed to be .398, and
p
32
r
12
, an unanalyzed component, SES to IQ to nAch, whose size = .041(.3) =
.012. -- SES could indirectly affect nAch if SES causes changes in IQ which in
turn cause changes in nAch, but we do not know the nature of the causal
relationship between SES and IQ, so this component must remain unanalyzed.
When we sum these two components, .398 + .012, we get the value of the
original correlation, .410.
The correlation between IQ and nAch, r
23
= .16, is decomposed into:
p
32
, the direct effect, = .041 and
p
31
r
12
, an unanalyzed component, IQ to SES to nAch, = .398(.3) = .119.
Summing .041 and .119 gives the original correlation, .16.
The SES - GPA correlation, r
14
=.33 is decomposed into:
p
41
, the direct effect, = .009.
p
43
p
31
, the indirect effect of SES through nAch to GPA, = .416(.398) = .166.
p
42
r
12
, SES to IQ to GPA, is unanalyzed, = .501(.3) = .150.
p
43
p
32
r
12
, SES to IQ to nAch to GPA, is unanalyzed, = .416(.041)(.3) = .005.
When we sum .009, .166, ,150, and .155, we get the original correlation, .33.
The total effect (or effect coefficient) of X on Y equals the sum of X's direct
and indirect effects on Y -- that is, .009 + .166 = .175.
The IQ - GPA correlation, r
24
, =.57 is decomposed into:
p
42
, a direct effect, = .501.
p
43
p
32
, an indirect effect through nAch to GPA, = .416(.041) = .017.
p
41
r
12
, unanalyzed, IQ to SES to GPA, .009(.3) = .003
p
43
p
31
r
12
, unanalyzed, IQ to SES to nAch to GPA, = .416(.398)(.3) = .050.
Page 6
The original correlation = .501 + .017 + .003 .050 = .57.
The nAch - GPA correlation, r
34
p
43
, the direct effect, = .416
and a spurious component due to nAch and GPA sharing common causes
SES and IQ
o p
41
p
31
, nAch to SES to GPA, = (.009)(.398).
o p
41
r
12
p
32
, nAch to IQ to SES to GPA, = (.009)(.3)(.041).
o p
42
p
32
, nAch to IQ to GPA, = (.501)(.041).
o p
42
r
12
p
31
, nAch to SES to IQ to GPA, = (.501)(.3)(.398).
o These spurious components sum to .084. Note that in this decomposition
elements involving r
12
were classified spurious rather than unanalyzed
because variables 1 and 2 are common (even though correlated) causes
of variables 3 and 4.
Here is a summary of the decomposition of correlations from Figure 1:
r
12
= unanalyzed r
13
= p
31
+ p
32
r
12

r
23
= p
32
+ p
31
r
12
DE U DE U

r
14
= p
41
+ p
43
p
31
+ (p
43
p
32
r
12
+ p
42
r
12
) r
24
= p
42
+ p
43
p
32
+ (p
41
r
12
+ p
43
p
31
r
12
)
DE IE U DE IE U

r
34
= p
43
+ (p
41
p
31
+ p
41
r
12
p
32
+ p
42
p
32
+ p
42
r
12
p
31
)
DE S

Are you sufficiently confused yet? I get confused doing these decompositions
too. Here is a relatively simple set of instructions to help you decide whether a path is
direct, indirect, spurious, or unanalyzed: Put your finger on the affected variable and
trace back to the causal variable. Now,
If you cross only one arrowhead, head first, you have a direct effect.
If you cross two or more arrowheads, each head first, you have an indirect effect.
If you cross a path that has arrowheads on both ends, the effect is unanalyzed
(or possibly spurious)
Only cross a path not head first when you are evaluating a spurious effect -- that
is, where a pair of variables is affected by a common third variable or set of
variables. For example, some of the correlation between X and Y below is due
to the common cause Z.
Page 7
Z

X Y

An effect that includes a bidirectional path can be considered spurious rather
than unanalyzed if both of the variables in the bidirectional path are causes of
both of the variables in the correlation being decomposed, as illustrated below:
Z
1
Z
2

X Y

Next, consider Figure 1A, with SES a cause of variance in IQ.
The r
12
is now p
21
, the direct effect of SES on IQ.
The correlation between SES and nAch, r
13
= .410 is decomposed into:
p
31
, a direct effect, SES to nAch, .398, and
p
32
p
21
, an indirect effect, SES to IQ to nAch, whose size = .041(.3) = .012. --
Note that this indirect effect was an unanalyzed component of r
13
in the previous
model.
The total effect (or effect coefficient) of X on Y equals the sum of X's direct
and indirect effects on Y. For SES to nAch, the effect coefficient = .398 + .012 =
.410 = r
13
. Note that making SES a cause of IQ in our model only slightly
increased the effect coefficient for SES on IQ (by .012).
The correlation between IQ and nAch, r
23
p
32
, the direct effect, = .041 and
p
31
p
21
, a spurious component, IQ to SES to nAch, = .398(.3) = .119. Both nAch
and IQ are caused by SES, so part of the r
23
must be spurious, due to that
shared common cause rather than to any effect of IQ upon nAch. This
component was unanalyzed in the previous model.
The SES - GPA correlation, r
14
=.33 is decomposed into:
p
41
, the direct effect, = .009.
p
43
p
31
, the indirect effect of SES through nAch to GPA, = .416(.398) = .166.
p
42
p
21
, the indirect effect of SES to IQ to GPA, is unanalyzed, .501(.3) = .150.
p
43
p
32
p
21
, the indirect effect of SES to IQ to nAch to GPA, = .416(.041)(.3) =
.005.
Page 8
The indirect effects of SES on GPA total to .321. The total effect of SES on
GPA = .009 + .321 = .330 = r
14
. Note that the indirect and total effects of SES
upon GPA are greater in this model than in the previous model. Considering
SES a cause of variance in IQ moved what otherwise would be SES' unanalyzed
effects into its indirect effects.
The IQ - GPA correlation, r
24
, =.57 is decomposed into:
p
42
, a direct effect, = .501.
p
43
p
32
, an indirect effect through nAch to GPA, = .416(.041) = .017.
p
41
p
21
, spurious, IQ to SES to GPA, .009(.3) = .003 (IQ and GPA share the
common cause SES).
p
43
p
31
p
12
, spurious, IQ to SES to nAch to GPA, .416(.398)(.3) = .050 (the
common cause also affects GPA through nAch).
The total effect of IQ on GPA = DE + IE = .501 + .017 = .518 = r
24
less the
spurious component.
The nAch - GPA correlation, r
34
= .50, is decomposed in exactly the same way
it was in the earlier model.
Here is a summary of the decompositions for the correlations in Figure 1A:
r
12
= p
21
r
13
= p
31
+ p
32
p
21
r
23
= p
32
+ p
31
p
21
DE IE DE S

r
14
= p
41
+ (p
43
p
31
+ p
43
p
32
p
21
+ p
42
p
21
) r
24
= p
42
+ p
43
p
32
+ (p
41
p
21
+ p
43
p
31
p
21
)
DE IE DE IE S

r
34
= p
43
+ (p
41
p
31
+ p
41
p
21
p
32
+ p
42
p
32
+ p
42
p
21
p
31
)
DE S
Overidentified Models
Consider a simple three variable model where r
12
= r
23
= .50 and r
13
= .25. In
Figure 3A and Figure 3B are two different "just-identified" models of the causal
relationships among these three variables. In a just-identified model there is a direct
path (not through an intervening variable) from each variable to each other variable.
Note that the decomposed correlations for both models can be used to "reproduce" the
original correlations perfectly, even though the two models present quite different
pictures of the casual relationships among the variables.
Page 9
Figure 3A

Model A

p
31

.00

p
21
.50

.866
p
32
.50 E
3

.866
E
2

Figure 3B
Model B

.968 E
1

p
13

.25

p
21
.40

p
23
.40

.775
1
2
3
1
2
3
Page 10

Model A Model B
r
12
= .50 p
21
= .50 p
21
+ p
23
p
13

.50 .40 + .10
DE DE + spurious component
r
13
= .25 p
31
+ p
32
p
31
p
31

0.00 + .25 .25
DE + IE DE
r
23
= .50 p
32
+ p
31
p
21
p
23
+ p
21
p
13

.50 + 0.00 40 + .10
DE + spurious DE + IE

In Figure 4A and Figure 4B are two "overidentified" models, models in which at
least one pair of variables are not connected to each other by direct paths. In Model A
it is hypothesized that 1 causes 2, 2 causes 3, and there is no direct effect of 1 on 3. In
B it is hypothesized that 2 causes both 1 and 3 with no nonspurious effect between 1
and 3. Both models attempt to explain the r
13
in the absence of any direct effect of 1 on
3.
Figure 4A

Model A

p
21
.50

.866
E
2
p
32
E
3
.866
.50

1
2 3
Page 11
Figure 4B
Model B

.866
E
1

p
12
.50

.866
p
32
E
3

.50

In Model A, r
13
= p
32
p
21
= .5(.5) = .25 = IE of 1 through 2 on 3.
In Model B, r
13
= p
32
p
21
= .5(.5) = .25 = spurious correlation between 1 and 3 due
to their sharing 2 as a common cause.
One may attempt to determine how well an overidentified path model fits the
data by checking how well the original correlation matrix can be reproduced using the
path coefficients. Any just-identified model will reproduce the correlation matrix
perfectly. However, two radically different overidentified models may reproduce the
correlation matrix equally well. For our overidentified models A and B, r
13
= p
32
p
21
in A
= .25 = p
32
p
21
in B. That is, A and B fit the data equally well (perfectly). Being able to
reproduce the correlation matrix should lend support to your model in the sense that
you have attempted to refute it but could not, but it does not in any way "prove" your
model to be correct -- other models may pass the test just as well or even better!
Consider another overidentified model, based on the same data used for all
models in Figures 3 and 4. As shown in Figure 5, this model supposes that 3 affects 1
directly and 2 only indirectly. That is, r
23
is due only to an indirect effect of 3 on 2. The
r
23
here decomposes to p
21
p
13
= (.50)(.25) = .125, the IE of 3 through 1 on 2 -- but the
original r
23
= .50. This model does such a lousy job of reproducing r
23
that we conclude
that it is not supported by the data.
1
2 3
Page 12
Figure 5

E
1
p
13

.968 .25

p
21
.50

E
2
.866

Let us now consider an overidentified model based on Figure 1. As shown in
Figure 6, we hypothesize no direct effects of SES on GPA or of IQ on nAch, and we
choose not to directionalize the SES - IQ relationship. Since nAch is now caused only
by SES and E
3
, p
31
= r
13
. Since GPA is now directly affected only by nAch and IQ (and
E
4
), we regress GPA on nAch and IQ and obtain
42.3
= p
42
= .503 and
43.2
= p
43
= .420
Path-1.sas includes computation of the path coefficients for the model in Figure 6:
Figure_6: MODEL GPA = IQ NACH;
Figure 6

E
3
p
31

.410 .912

p
43

r
12
.3 .420

p
42
.710

.503 E
4

1
2
3
SES
1
IQ
2
nACH
3
GPA
4
Page 13
Now, to see how well this overidentified model fits the data, we reproduce the
original correlations (rr = reproduced correlation coefficient, r = original coefficient)
rr
12
= r
12
= .3 rr
13
= p
31
= .41 = r
13

U DE

rr
14
= p
43
p
31
+ p
42
r
12
= .172 + .151 = .323 r
14
= .330
IE U

rr
23
= p
31
r
12
= .123 r
23
= .160
U

rr
24
= p
42
+ p
43
p
31
r
12
= .503 + .052 = .555 r
24
= .570
DE U
rr
34
= p
43
+ p
42
r
12
p
31
= .420 + .062 = .482 r
34
= .500
DE S
Note that there are relatively small discrepancies between the reproduced
correlations and the original correlations, indicating that the model fits the data well.

Matrix Calculation of Effect Coefficients
While effect coefficients can be computed by decomposing correlations into
direct, indirect, spurious, and unanalyzed components, in complex models the
arithmetic becomes overly tedious and prone to error. I shall now show you how to
obtain effect coefficients via matrix algebra.
Consider the model from Figure 1A (with SES a cause of IQ).
1. We first create an identity matrix with as many rows (and columns) as
there are endogenous variables. For this example, that will be a 3 x 3 matrix.
2. We next obtain D
YY
, the matrix of direct effects of endogenous variables
(Y's) on endogenous variables. These are simply the path coefficients (obtained as
Beta weights in multiple regressions). For this example, the D
YY
is:
Page 14

Direct Effect of Upon
IQ nAch GPA
.000 .000 .000 IQ
.041 .000 .000 nAch
.501 .416 .000 GPA

Read this matrix this way: "Direct effect of <column header> upon <row label>.
For example, the first column indicates that IQ has no effect on IQ, .041 DE on nAch,
.501 DE on GPA. Column 2 reads nAch has no DE on IQ or nAch and a .416 DE on
GPA. Column 3 indicates GPA has no DE on anything.
3. Calculate matrix B by subtracting D
YY
from I.
4. Invert matrix B to obtain B
-1
(B2 in the SAS program).
5. Multiply B
-1
by -1 (producing matrix B3 in the SAS program).
6. Obtain matrix D
YX
, the matrix of direct effects (path coefficients) of
exogenous variables (X's) on endogenous variables. For our example this is a 3 x 1
matrix:
SES
.300 IQ
.398 nAch
.009 GPA

7. Multiply D
YX
by -1 to obtain matrix C.
8. Multiply -B
-1
by C to obtain the matrix of (total) effect coefficients for
exogenous to endogenous variables, E
YX
:
SES
.300 IQ
.410 nAch
.330 GPA

Page 15
9. Subtract D
YX
(add C) from E
YX
to obtain I
YX
, the matrix of indirect effects of
exogenous on endogenous variables:
SES
.000 IQ
.012 nAch
.321 GPA

10. Subtract I from B
-1
to obtain E
YY
, the matrix of total effects of endogenous
variables on endogenous variables:
IQ nAch GPA
.000 .000 .000 IQ
.041 .000 .000 nAch
.518 .416 .000 GPA

11. Finally, subtract D
YY
from E
YY
to obtain I
YY
, the matrix of indirect effects
among endogenous variables:
IQ nAch GPA
.000 .000 .000 IQ
.000 .000 .000 nAch
.017 .000 .000 GPA

Path-1.sas includes the code necessary to do the matrix algebra for computing
these effect coefficients as well as those for the model in Figure 1 and the model in
Figure 6.
This matrix method is most useful with complex models. Such a complex model
is presented in Figure 7. The Xs are Fathers education, Fathers occupation, and
number of siblings. The Ys are subjects education, occupation, and income. Subjects
were "non-Negro men, 35-44 age group." Path coefficients to Y
1
were obtained by
regressing Y
1
on X
1
, X
2
, and X
3
; to Y
2
by regressing Y
2
on Y
1
, X
1
, X
2
, and X
3
; and to Y
3

by regressing Y
3
on Y
1
, Y
2
, X
1
, X
2
, and X
3
. Run Path-2.sas to see the computation of
the path coefficients and the effect coefficients.
Page 16

Evaluating Models
Starting with a just-identified model, one may want to do some "theory
trimming" -- that is, evaluate whether or not dropping one or more of the paths would
substantially reduce the fit between the model and the data. For example, consider the
model in Figure 7. We regressed Y
1
on X
1
, X
2
, and X
3
to obtain the path coefficients
Page 17
(Beta weights) to Y
1
. If we wished to use statistical significance as our criterion for
retaining a coefficient, we could delete any path for which our multiple regression
analysis indicated the Beta was not significantly different from zero. One problem with
this approach is that with large sample sizes even trivially small coefficients will be
"statistically significant" and thus retained. One may want to include a
"meaningfulness" criterion and/or a minimum absolute value of Beta for retention. For
example, if || < .05, delete the path; if .05 < || < .10 and the path "doesn't make
(theoretical) sense," delete it. In Figure 7 all paths to Education are large enough to
pass such a test.
Consider Occupation regressed on Education, Dads Education, Dads
Occupation, and Sibs. The Beta for Dads Education is clearly small enough to delete.
We might also decide to delete the path from Sibs if it does not seem sensible to us
that one's occupation be directly affected by the number of sibs one has. Of course,
deleting one predictor will change the Beta weights of remaining predictors, so one may
want to delete one variable at a time, evaluating all remaining predictors at each step.
If you are eliminating two or more predictors at one stage (for example, Dads
Education and Sibs to Occupation) and wish to test whether simultaneously eliminating
them significantly affects the model at that stage, use a partial F test.
For Income regressed on all other variables, it appears that the direct paths from
Dads Education, Dads Occupation, and Sibs could all be eliminated. Notice, however,
that we have evaluated this model in three separate stages. It is possible that
eliminating one path might not have a significant effect on the model at that stage, but
nonetheless might have a significant effect on the entire model. Later I shall show you
how to test the entire model.
Perhaps the most serious criticism of the sort of theory trimming described above
is that it is applied a posteriori. Some argue that the data should not be allowed to tell
one which hypotheses to test and which not to test. One should always feel more
comfortable with a priori trimming, that is, trimming that is supported on theoretical
grounds. To some extent, that is the function of the "meaningfulness" criterion I
mentioned earlier. We might have a priori suspicions that a particular path is not
important, but include it in our just identified model anyhow, later trimming it away if the
data results in it receiving the low Beta we expected.
Once one has trimmed away some paths (a priori or a posteriori) from a just
identified model, e is left with an overidentified model. Specht (On the evaluation of
causal models, Social Science Research, 1975, 4, 113-133) developed a goodness of
fit statistic (Q) to measure how well a trimmed model fits the data and a test statistic
(W) to test the null hypothesis that the trimmed model fits the data as well as does the
just-identified model. If our trimmed model is not significantly different from the just-
identified model, then we feel more comfortable with it. The flaw in this is that even a
poorly fitting model will not differ significantly from the just-identified model if sample
size (and thus power) is low, and a good fitting model will differ significantly from the
just-identified model if sample size (and power) is large. Spechts Q seems to have
been replaced by more modern goodness of fit statistics. I have not seen it in modern
statistical packages.
Page 18
The statistical test (W) of the null hypothesis that the model fits the data is based
upon how well the model reproduces the observed correlation matrix. I shall use the
overidentified model of Figure 5 as a simple example. First, one creates a
just-identified model with which to compute R
2
F
, an R
2
for an entire just-identified (full)
model. Variables that are exogenous in the model to be tested must also be
exogenous in the just identified model being used for comparison purposes, and
variables that are endogenous in the overidentified model must be endogenous in the
just identified model too, so I'll use Figure 3B. To obtain R
2
F
one simply obtains the
product of the squares of all the error paths and subtracts from one. For Figure 3B,
R
2
F
= 1 - (.968)
2
(.775)
2
= .437.
Next, one computes R
2
R
, an R
2
for the overidentified (reduced) model being
tested. Again, one simply obtains the product of the squared error path coefficients and
subtracts from one. For Figure 5, R
2
R
= 1 - (.968)
2
(.866)
2
= .297. The smaller R
2
R
is
relative to R
2
F
, the poorer the fit, and the fit here looks poor indeed. Were the
overidentified model to perfectly reproduce the original correlation matrix, R
2
R
would
equal R
2
F
, and the fit would be perfect.
We next compute Q, a measure of the goodness of fit of our overidentified
model, relative to a perfectly fitting (just identified) model.
2
2
1
1
R
F
R
R
Q
= . For our
example, Q = (1 -.437)/(1 -.297) = .801. A perfect fit would give a Q of one; less than
perfect fits yield Q's less than one. Q can also be computed simply by computing the
product of the squares of the error path coefficients for the full model and dividing by
the square of the products of the error path coefficients for the reduced model. For our
model, 801 .
) 866 (. ) 968 (.
) 775 (. ) 968 (.
2 2
2 2
= = Q .
Finally, we compute the test statistic, W = -(N - d) ln(Q) where N = sample
size (let us assume N = 100), d = number of overidentifying restrictions (number of
paths eliminated from the full model to yield the reduced model), and ln = natural
logarithm. For our data, W = -(100 - 1) * ln(.801) = 21.97. W is evaluated with the
chi-square distribution on d degrees of freedom. For our W, p <.0001 and we conclude
that the model does not fit the data well. If you compute W for Figure 4A or 4B you will
obtain W = 0, p = 1.000, indicating perfect fit for those models.
The overidentified model (Figure 5) we tested had only one restriction (no p
23
),
so we could have more easily tested it by testing the significance of p
23
in Figure 3B.
Our overidentified models may, however, have several such restrictions, and the
method I just presented allows us simultaneously to test all those restrictions. Consider
the overidentified model in Figure 6. The just identified model in Figure 1 is an
appropriate model for computing R
2
F
. Please note that the model in Figure 1A (with IQ
endogenous rather than exogenous) would not be appropriate, since that model would
include an E
2
path, but our overidentified model has no E
2
path (paths are drawn from
extraneous variables to endogenous variables but not to exogenous variables). In the
just-identified model 911 . 1
2
12 . 3 3
= = R E and 710 . 1
2
123 . 4 4
= = R E . For our
Page 19
overidentified model 912 . 1
2
1 . 3 3
= = R E and 710 . 1
2
23 . 4 4
= = R E .
R
2
F
= 1 - (.911)
2
(.710)
2
= .582. R
2
R
= 1 - (.912)
2
(.710)
2
= .581. Q = (1 - .582)/(1 - .581)
= .998. If N = 100, W = -(100 - 2) * ln(.998) = 0.196 on 2 df (we eliminated two paths), p
= .91, our overidentified model fits the data well.
Let us now evaluate three different overidentified models, all based on the same
correlation matrix. Variable V is a measure of the civil rights attitudes of the Voters in
116 congressional districts (N = 116). Variable C is the civil rights attitude of the
Congressman in each district. Variable P is a measure of the congressmen's
Perceptions of their constituents' attitudes towards civil rights. Variable R is a measure
of the congressmen's Roll Call behavior on civil rights. Suppose that three different a
priori models have been identified (each based on a different theory), as shown in
Figure 8.
Figure 8

Just-Identified Model

.867

E
C

.498

.324

.075

.560

.366 .507
.556
E
R

.595
E
P

For the just-identified model, R
2
F
= 1 - (.867)
2
(.595)
2
(.507)
2
= .9316.
C

V R

P

Page 20
Model X

.766

E
C

.643

.327

.613

.510
.738
E
R

.675
E
P

For Model X, R
2
R
= 1 - (.766)
2
(.675)
2
(.510)
2
= .9305.

C

V R

P

Page 21
Model Y

.867

E
C

.498

.327

.613

.643 .510

E
R

.766
E
P

For Model Y, R
2
R
= 1 - (.867)
2
(.766)
2
(.510)
2
= .8853.
C

V R

P

Page 22

Model Z

.867

E
C

.498

.327

.613

.366 .510
.556
E
R

.595
E
P

For Model Z, R
2
R
= 1 -(.867)
2
(.595)
2
(.510)
2
= .9308.

Goodness of fit Q
x
= (1 - .9316)/(1 - .9305) = .0684/.0695 = .9842.
Q
y
= .0684/(1 - .8853) = .5963.
Q
z
= .0684/(1 - .9308) = .9884.
It appears that Models X and Z fit the data better than does Model Y.
C

V R

P

Page 23
W
x
= -(116 - 2) ln(.9842) = 1.816. On 2 df (two restrictions, no paths from V to
C or to R), p = .41. We do not reject the null hypothesis that Model X fits the data.
W
Y
= -(116 - 2) ln(.5963) = 58.939, on 2 df, p <.0001. We do reject the null
and conclude that Model Y does not fit the data.
W
Z
= -(116 - 1) ln(.9884) = 1.342, on 1 df (only one restriction, no V to R direct
path) p = .25. We do not reject the null hypothesis that Model Z fits the data. It seems
that Model Y needs some revision, but Models X and Z have passed this test.
Sometimes one can test the null hypothesis that one reduced model fits the data
as well as does another reduced model. One of the models must be nested within the
other -- that is, it can be obtained by eliminating one or more of the paths in the other.
For example, Model Y is exactly like Model Z except that the path from V to P has been
eliminated in Model Y. Thus, Y is nested within Z. To test the null, we compute
W = -(N -d) ln[(1 - M
1
)/(1 - M
2
)] . M
2
is for the model that is nested within the other,
and d is the number of paths eliminated from the other model to obtain the nested
model. For Z versus Y, on d = 1 df, W = -(116 - 1) * ln[(1 - .9308)/(1 - .8853) = 58.112,
p < .0001. We reject the null hypothesis. Removing the path from V to P significantly
reduced the fit between the model and the data.


Path-SPSS-AMOS.doc
Conducting a Path Analysis With SPSS/AMOS

Download the PATH-INGRAM.sps data file from my SPSS data page and then
bring it into SPSS. The data are those from the research that led to this publication:
Ingram, K. L., Cope, J. G., Harju, B. L., & Wuensch, K. L. (2000). Applying to graduate
school: A test of the theory of planned behavior. Journal of Social Behavior and
Personality, 15, 215-226.
Obtain the simple correlations among the variables:
Correlations

Attitude SubNorm PBC Intent Behavior
Attitude Pearson Correlation 1.000 .472 .665 .767 .525
SubNorm Pearson Correlation .472 1.000 .505 .411 .379
PBC Pearson Correlation .665 .505 1.000 .458 .496
Intent Pearson Correlation .767 .411 .458 1.000 .503
Behavior Pearson Correlation .525 .379 .496 .503 1.000

One can conduct a path analysis with a series of multiple regression analyses.
We shall test a model corresponding to Ajzens Theory of Planned Behavior look at
the model presented in the article cited above, which is available online. Notice that the
final variable, Behavior, has paths to it only from Intention and PBC. To find the
coefficients for those paths we simply conduct a multiple regression to predict Behavior
from Intention and PBC. Here is the output.
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 .585
a
.343 .319 13.74634
a. Predictors: (Constant), PBC, Intent

ANOVA
b

Model Sum of Squares df Mean Square F Sig.
1 Regression 5611.752 2 2805.876 14.849 .000
a

Residual 10770.831 57 188.962

Total 16382.583 59

a. Predictors: (Constant), PBC, Intent

b. Dependent Variable: Behavior

2

Coefficients
a

Model
Standardized
Coefficients
1 (Constant) -11.346 10.420

-1.089 .281
Intent 1.520 .525 .350 2.894 .005
PBC .734 .264 .336 2.781 .007
a. Dependent Variable: Behavior

The Beta weights are the path coefficients leading to Behavior: .336 from PBC
and .350 from Intention.

In the model Intention has paths to it from Attitude, Subjective Norm, and
Perceived Behavioral Control, so we predict Intention from Attitude, Subjective Norm,
and Perceived Behavioral Control. Here is the output:

Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 .774
a
.600 .578 2.48849
a. Predictors: (Constant), PBC, SubNorm, Attitude

ANOVA
b

Model Sum of Squares df Mean Square F Sig.
1 Regression 519.799 3 173.266 27.980 .000
a

Residual 346.784 56 6.193

Total 866.583 59

a. Predictors: (Constant), PBC, SubNorm, Attitude

b. Dependent Variable: Intent

3

Coefficients
a

Model
Standardized
Coefficients
1 (Constant) 3.906 1.828

2.137 .037
Attitude .444 .064 .807 6.966 .000
SubNorm .029 .031 .095 .946 .348
PBC -.064 .059 -.126 -1.069 .290
a. Dependent Variable: Intent

The path coefficients leading to Intention are: .807 from Attitude, .095 from
Subjective Norms, and .126 from Perceived Behavioral Control.

AMOS

Now let us use AMOS. The data file is already open in SPSS. Click Analyze,
AMOS 16. In the AMOS window which will open click File, New:

Click on the Draw observed variables icon which I have circled on the image
above. Move the cursor over into the drawing space on the right. Hold down the left
mouse button while you move the cursor to draw a rectangle. Release the mouse
button and move the cursor to another location and draw another rectangle. Annoyed
that you cant draw five rectangles of the same dimensions. Do it this way instead:
Draw one rectangle. Now click the Duplicate Objects icon, boxed in black in the
image below, point at that rectangle, hold down the left mouse button while you move to
the desired location for the second rectangle, and release the mouse button.

4
Draw five rectangles arranged something like this:

You can change the shape of the rectangles later, using the Change the shape
of objects tool (boxed in green in the image above), and you can move the rectangles
later using the Move objects tool (boxed in blue in the image above).
Click on the List variables in data set icon (boxed in orange in the image
above). From the window that results, drag and drop variable names to the boxes.
A more cumbersome way to do this is: Right-click the rectangle, select Object
Properties, then enter in the Object Properties window the name of the observed
variable. Close the widow and enter variable names in the remaining rectangles in the
same way.

Click on the Draw paths icon (the single-headed arrow boxed in purple in the
image below) and then draw a path from Attitude to Intent (hold down the left mouse
button at the point you wish to start the path and then drag it to the ending point and
5
release the mouse button). Also draw paths from SubNorm to Intent, PBC to Intent,
PBC to Behavior, and Intent to Behavior.

Click on the Draw Covariances icon (the double-headed arrow boxed in purple
in the image above) and draw a path from SubNorm to Attitude. Draw another from
PBC to SubNorm and one from PBC to Attitude. You can use the Change the shape of
objects tool (boxed in green in the image above) to increase or decrease the arc of
these paths just select that tool, put the cursor on the path to be changed, hold down
the left mouse button, and move the mouse.
Click on the Add a unique variable to an existing variable icon (boxed in red in
the image above) and then move the cursor over the Intent variable and click the left
mouse button to add the error variable. Do the same to add an error variable to the
Behavior variable. Right-click the error circle leading to Intent, Select Object Properties,
and name the variable e1. Name the other error circle e2.

Click the Analysis properties icon -- to display the Analysis Properties
window. Select the Output tab and ask for the output shown below.
6

7

Click on the Calculate estimates icon . In the Save As window browse
to the desired folder and give the file a name. Click Save.

Change the Parameter Formats setting (boxed in red in the image below) to
Standardized estimates if it is not already set that way. Click the View the output path
diagram icon (boxed in red in the image below) and zap, you get the path analysis
diagram.

8

Click the View text icon to see extensive text output from the analysis.

9
The Copy to Clipboard icon (in green, above) can be used to copy the output to
another document via the clipboard. Click the Options icon (in red, above) to select
whether you want to view/copy just part of the output or all of the output.

Here are some parts of the output with my comments in green:
Variable Summary (Group number 1)
Your model contains the following variables (Group number 1)
Observed, endogenous variables
Intent
Behavior
Observed, exogenous variables
Attitude
PBC
SubNorm
Unobserved, exogenous variables
e1
e2
Variable counts (Group number 1)
Number of variables in your model: 7
Number of observed variables: 5
Number of unobserved variables: 2
Number of exogenous variables: 5
Number of endogenous variables: 2

10
Parameter summary (Group number 1)
Weights Covariances Variances Means Intercepts Total
Fixed 2 0 0 0 0 2
Labeled 0 0 0 0 0 0
Unlabeled 5 3 5 0 0 13
Total 7 3 5 0 0 15
Models
Default model (Default model)
Notes for Model (Default model)
Computation of degrees of freedom (Default model)
Number of distinct sample moments: 15
Number of distinct parameters to be estimated: 13
Degrees of freedom (15 - 13): 2
Result (Default model)
Minimum was achieved
Chi-square = .847
Degrees of freedom = 2
Probability level = .655
This Chi-square tests the null hypothesis that the overidentified (reduced) model
fits the data as well as does a just-identified (full, saturated) model. In a just-identified
model there is a direct path (not through an intervening variable) from each variable to
each other variable. When you delete one or more of the paths you obtain an
overidentified model. The nonsignificant Chi-square here indicated that the fit between
our overidentified model and the data is not significantly worse than the fit between the
just-identified model and the data. You can see the just-identified model here. While
one might argue that nonsignificance of this Chi-square indicates that the reduced
model fits the data well, even a well-fitting reduced model will be significantly different
from the full model if sample size is sufficiently large. A good fitting model is one that
can reproduce the original variance-covariance matrix (or correlation matrix) from the
path coefficients, in much the same way that a good factor analytic solution can
reproduce the original correlation matrix with little error.
Maximum Likelihood Estimates
Do note that the parameters are estimated by maximum likelihood (ML) methods rather
than by ordinary least squares (OLS) methods. OLS methods minimize the squared
11
deviations between values of the criterion variable and those predicted by the model.
ML (an iterative procedure) attempts to maximize the likelihood that obtained values of
the criterion variable will be correctly predicted.
Standardized Regression Weights: (Group number 1 - Default model)
Estimate
Intent SubNorm .095
Intent PBC -.126
Intent Attitude .807
Behavior Intent .350
Behavior PBC .336
The path coefficients above match those we obtained earlier by multiple regression.
Correlations: (Group number 1 - Default model)
Estimate
Attitude <--> PBC .665
Attitude <--> SubNorm .472
PBC <--> SubNorm .505
Above are the simple correlations between exogenous variables.
Squared Multiple Correlations: (Group number 1 - Default model)
Estimate
Intent .600
Behavior .343
Above are the squared multiple correlation coefficients we saw in the two multiple
regressions.
The total effect of one variable on another can be divided into direct effects (no
intervening variables involved) and indirect effects (through one or more intervening
variables). Consider the effect of PBC on Behavior. The direct effect is .336 (the path
coefficient from PBC to Behavior). The indirect effect, through Intention is computed as
the product of the path coefficient from PBC to Intention and the path coefficient from
Intention to Behavior, (.126)(.350) = .044. The total effect is the sum of direct and
indirect effects, .336 + (.126) = .292.
12
Standardized Total Effects (Group number 1 - Default model)
SubNorm PBC Attitude Intent
Intent .095 -.126 .807 .000
Behavior .033 .292 .282 .350
Standardized Direct Effects (Group number 1 - Default model)
Intent .095 -.126 .807 .000
Behavior .000 .336 .000 .350
Standardized Indirect Effects (Group number 1 - Default model)
Intent .000 .000 .000 .000
Behavior .033 -.044 .282 .000

Model Fit Summary
CMIN
Model NPAR CMIN DF P CMIN/DF
Default model 13 .847 2 .655 .424
Saturated model 15 .000 0
Independence model 5 134.142 10 .000 13.414
NPAR is the number of parameters in the model. In the saturated (just-identified) model
there are 15 parameters 5 variances (one for each variable) and 10 path coefficients.
For our tested (default) model there are 13 parameters we dropped two paths. For
the independence model (one where all of the paths have been deleted) there are five
parameters (the variances of the five variables).
CMIN is a Chi-square statistic comparing the tested model and the independence model
with the saturated model. We saw the former a bit earlier. CMIN/DF, the relative chi-
square, is an index of how much the fit of data to model has been reduced by dropping
one or more paths. One rule of thumb is to decide you have dropped too many paths if
this index exceeds 2 or 3.
13
RMR, GFI
Model RMR GFI AGFI PGFI
Default model 3.564 .994 .957 .133
Saturated model .000 1.000
Independence model 36.681 .471 .207 .314
RMR, the root mean square residual, is an index of the amount by which the estimated
(by your model) variances and covariances differ from the observed variances and
covariances. Smaller is better, of course.
GFI, the goodness of fit index, tells you what proportion of the variance in the sample
variance-covariance matrix is accounted for by the model. This should exceed .9 for a
good model. For the full model it will be a perfect 1. AGFI (adjusted GFI) is an
alternate GFI index in which the value of the index is adjusted for the number of
parameters in the model. The fewer the number of parameters in the model relative to
the number of data points (variances and covariances in the sample variance-
covariance matrix), the closer the AGFI will be to the GFI. The PGFI (P is for
parsimony), the index is adjusted to reward simple models and penalize models in
which few paths have been deleted. Note that for our data the PGFI is larger for the
independence model than for our tested model.

Baseline Comparisons
Model
NFI
Delta1
RFI
rho1
IFI
Delta2
TLI
rho2
CFI
Default model .994 .968 1.009 1.046 1.000
Saturated model 1.000 1.000 1.000
Independence model .000 .000 .000 .000 .000
These goodness of fit indices compare your model to the independence model rather
than to the saturated model. The Normed Fit Index (NFI) is simply the difference
between the two models chi-squares divided by the chi-square for the independence
model. For our data, that is (134.142)-.847)/134.142 = .994. Values of .9 or higher
(some say .95 or higher) indicate good fit. The Comparative Fit Index (CFI) uses a
similar approach (with a noncentral chi-square) and is said to be a good index for use
even with small samples. It ranges from 0 to 1, like the NFI, and .95 (or .9 or higher)
indicates good fit.
14
Parsimony-Adjusted Measures
Model PRATIO PNFI PCFI
Default model .200 .199 .200
Saturated model .000 .000 .000
Independence model 1.000 .000 .000
PRATIO is the ratio of how many paths you dropped to how many you could have
dropped (all of them). The Parsimony Normed Fit Index (PNFI), is the product of NFI
and PRATIO, and PCFI is the product of the CFI and PRATIO. The PNFI and PCFI are
intended to reward those whose models are parsimonious (contain few paths).
RMSEA
Model RMSEA LO 90 HI 90 PCLOSE
Default model .000 .000 .200 .693
Independence model .459 .391 .529 .000
The Root Mean Square Error of Approximation (RMSEA) estimates lack of fit compared
to the saturated model. RMSEA of .05 or less indicates good fit, and .08 or less
adequate fit. LO 90 and HI 90 are the lower and upper ends of a 90% confidence
interval on this estimate. PCLOSE is the p value testing the null that RMSEA is no
greater than .05.
HOELTER
Model
HOELTER
.05
HOELTER
.01
Default model 418 642
Independence model 9 11
If your sample were larger than this you would reject the null hypothesis that your model
fit the data just as well as does the saturated model.

15
The Just-Identified Model

Our Reduced Model

16
Matrix Input
AMOS will accept as input a correlation matrix (accompanied by standard
deviations and sample sizes) or a variance/covariance matrix. The SPSS syntax below
would input such a matrix:
MATRIX DATA VARIABLES=ROWTYPE_ Attitude SubNorm PBC Intent Behavior.
BEGIN DATA
N 60 60 60 60 60
SD 6.96 12.32 7.62 3.83 16.66
CORR 1
CORR .472 1
CORR .665 .505 1
CORR .767 .411 .458 1
CORR .525 .379 .496 .503 1
END DATA.
After running the syntax you would just click Analyze, AMOS, and proceed as
before. If you had the correlations but not the standard deviations, you could just
specify a value of 1 for each standard deviation. You would not be able to get the
unstandardized coefficients, but they are generally not of interest anyhow.

AMOS Files
Amos creates several files during the course of conducting a path analysis. Here
is what I have learned about them, mostly by trial and error.
.amw = a path diagram, with coefficients etc.
.amp = table output all the statistical output details. Open it with the AMOS
file manager.
.AmosOutput looks the same as .amp, but takes up more space on drive.
.AmosTN = thumbnail image of path diagram
*.bk# -- probably a backup file

Notes
To bring a path diagram into Word, just Edit, Copy to Clipboard, and then paste it
into Word.
If you pull up an .amw path diagram but have not specified an input data file,
you cannot alter the diagram and re-analyze the data. The .amw file includes the
coefficients etc., but not the input data.
If you input an altered data file and then call up the original .amw, you can
Calculate Estimates again and get a new set of coefficients etc. WARNING when you
exit you will find that the old .amp and .AmosOutput have been updated with the
results of the analysis on the modified data. The original .amw file remains unaltered.

17
Links
Lesson by Garson at NCSU
Introduction to Path Analysis maybe more than you want to know.
Wuenschs Stats Lessons Page

Karl L. Wuensch
Dept. of Psychology
Greenville, NC 27858-4353

October, 2008
SEM-Intro.doc
An Introduction to Structural Equation Modeling (SEM)

SEM is a combination of factor analysis and multiple regression. It also goes by
the aliases causal modeling and analysis of covariance structure. Special cases of
SEM include confirmatory factor analysis and path analysis. You are already familiar
with path analysis, which is SEM with no latent variables.
The variables in SEM are measured (observed, manifest) variables (indicators)
and factors (latent variables). I think of factors as weighted linear combinations that we
have created/invented. Those who are fond of SEM tend to think of them as underlying
constructs that we have discovered.
Even though no variables may have been manipulated, variables and factors in
SEM may be classified as independent variables or dependent variables. Such
classification is made on the basis of a theoretical causal model, formal or informal.
The causal model is presented in a diagram where the names of measured variables
are within rectangles and the names of factors in ellipses. Rectangles and ellipses are
connected with lines having an arrowhead on one (unidirectional causation) or two (no
specification of direction of causality) ends.
Dependent variables are those which have one-way arrows pointing to them and
independent variables are those which do not. Dependent variables have residuals (are
not perfectly related to the other variables in the model) indicated by es (errors) pointing
to measured variables and ds pointing to latent variables.
The SEM can be divided into two parts. The measurement model is the part
which relates measured variables to latent variables. The structural model is the part
that relates latent variables to one another.
Statistically, the model is evaluated by comparing two variance/covariance
matrices. From the data a sample variance/covariance matrix is calculated. From this
matrix and the model an estimated population variance/covariance matrix is computed.
If the estimated population variance/covariance matrix is very similar to the known
sample variance/covariance matrix, then the model is said to fit the data well. A Chi-
square statistic is computed to test the null hypothesis that the model does fit the data
well. There are also numerous goodness of fit estimators designed to estimate how
well the model fits the data.
Sample Size. As with factor analysis, you should have lots of data when
evaluating a SEM. As usual, there are several rules of thumb. For a simple model, 200
cases might be adequate. When relationships among components of the model are
strong, 10 cases per estimated parameter may be adequate.
Assumptions. Multivariate normality is generally assumed. It is also assumed
that relationships between variables are linear, but powers of variables may be included
in the model to test polynomial relationships.
Problems. If one of the variables is a perfect linear combination of the other
variables, a singular matrix (which cannot be inverted) will cause the analysis to crash.
Multicollinearity can also be a problem.
An Example. Consider the model presented in Figure 14.4 of Tabachnick and
Fidell [Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5
th
ed.).
2
Boston: Allyn & Bacon.]. There are five measurement variables (in rectangles) and two
latent variables (in ellipses). Two of the variables are considered independent (and
shaded), the others are considered dependent. From each latent variable there is a
path pointing to two indicators. From one measured variable (SenSeek) there is a path
pointing to a latent variable (SkiSat). Each measured variable has an error path leading
to it. Each latent variable has a disturbance path leading to it.
Parameters. The parameters of the model are regression coefficients for paths
between variables and variances/covariances of independent variables. Parameters
may be fixed to a certain value (usually 0 or 1) or may be estimated. In the
diagram, an represents a parameter to be estimated. A 1 indicates that the
parameter has been fixed to value 1. When two variables are not connected by a
path the coefficient for that path is fixed at 0.
Tabachnick and Fidell used EQS to arrive at the final model displayed in their
Figure 14.5.
Model Identification. An identified model is one for which each of the
estimated parameters has a unique solution. To determine whether the model is
identified or not, compare the number of data points to the number of parameters to be
estimated. Since the input data set is the sample variance/covariance matrix, the
number of data points is the number of variances and covariances in that matrix, which
can be calculated as
2
) 1 ( + m m
, where m is the number of measured variables. For
T&Fs example the number of data points is 5(6)/2 = 15.
If the number of data points equals the number of parameters to be estimated,
then the model is just identified or saturated. Such a model will fit the data
perfectly, and thus is of little use, although it can be used to estimate the values of the
coefficients for the paths.
If there are fewer data points than parameters to be estimated then the model is
under identified. In this case the parameters cannot be estimated, and the
researcher needs to reduced the number of parameters to be estimated by deleting or
fixing some of them.
When the number of data points is greater than the number of parameters to be
estimated then the model is over identified, and the analysis can proceed.
Identification of the Measurement Model. The scale of each independent
variable must be fixed to a constant (typically to 1, as in z scores) or to that of one of the
measured variables (a marker variable, one that is thought to be exceptionally well
related to the this latent variable and not to other latent variables in the model). To fix
the scale to that of a measured variable one simply fixes to 1 the regression coefficient
for the path from the latent variable to the measured variable. Most often the scale of
dependent latent variables is set to that of a measured variable. The scale of
independent latent variables may be set to 1 or to the variance of a measured variable.
The measurement portion of the model will probably be identified if:
There is only one latent variable, it has at least three indicators that load on it,
and the errors of these indicators are not correlated with one another.
3
There are two or more latent variables, each has at least three indicators that
load on it, and the errors of these indicators are not correlated, each indicator
loads on only one factor, and the factors are allowed to covary.
There are two or more latent variables, but there is a latent variable on which
only two indicators load, the errors of the indicators are not correlated, each
indicator loads on only one factor, and none of variances or covariances between
factors is zero.
Identification of the Structural Model. This portion of the model may be identified
if:
None of the latent dependent variables predicts another latent dependent
variable.
When a latent dependent variable does predict another latent dependent
variable, the relationship is recursive, and the disturbances are not correlated.
A relationship is recursive if the causal relationship is unidirectional (one line
pointing from the one latent variable to the other). In a nonrecursive
relationship there are two lines between a pair of variables, one pointing from
A to B and the other from B to A. Correlated disturbances are indicated by
being connected with a single line with arrowhead on each end.
When there is a nonrecursive relationship between latent dependent variables
or disturbances, spend some time with: Bollen, K.A. (1989). Structural
equations with latent variables. New York: John Wiley & Sons -- or hire an
expert in SEM.
If your model is not identified, the SEM program will throw an error and then you
must tinker with the model until it is identified.
Estimation. The analysis uses an iterative procedure to minimize the
differences between the sample variance/covariance matrix and the estimated
population variance matrix. Maximum Likelihood (ML) estimation is that most
frequently employed. Among the techniques available in the software used in this
course (SAS and AMOS), the ML and Generalized Least Squares (GLS) techniques
have fared well in Monte Carlo comparisons of techniques.
Fit. With large sample sizes, the Chi-square testing the null that the model fits
the data well may be significant even when the fit is good. Accordingly there has
been great interest in developing estimates of fit that do not rely on tests of
significance. In fact, there has been so much interest that there are dozens of such
indices of fit. Tabacknick and Fidell discuss many of these fit indices. You can also
find some discussion of them in my document Conducting a Path Analysis With
SPSS/AMOS.
Model Modification and Comparison. You may wish to evaluate two nested
models. Model R is nested within Model F if Model R can be created by deleting
one or more of the parameters from Model F. The significance of the difference in fit
can be tested with a simple Chi-square statistic. The value of this Chi-square equals
the Chi-square fit statistic for Model F minus the Chi-square statistic for Model R.
The degrees of freedom equal degrees of freedom for Model F minus degrees of
freedom for Model R. A nonsignificant Chi-square indicates that removal of the
4
parameters that are estimated in Model F but not in Model R did not significantly
reduce the fit of the model to the data.
The Lagrange Multiplier Test (LM) can be used to determine whether or not the
model fit would be significantly improved by estimating (rather than fixing) an
additional parameter. The Wald Test can be used to determine whether or not
deleting a parameter would significantly reduce the fit of the model. The Wald test is
available in SAS Calis, but not in AMOS. One should keep in mind that adding or
deleting a parameter will likely change the effect of adding or deleting other
parameters, so parameters should be added or deleted one at a time. It is
recommended that one add parameters before deleting parameters.
Reliability of Measured Variables. The variance in each measured variable is
assumed to stem from variance in the underlying latent variable. Classically, the
variance of a measured variable can be partitioned into true variance (that related to
the true variable) and (random) error variance. The reliability of a measured variable
is the ratio of true variance to total (true + error) variance. In SEM the reliability of a
measured variable is estimated by a squared correlation coefficient, which is the
proportion of variance in the measured variable that is explained by variance in the
latent variable(s).

SEM with AMOS
SEM with SAS Proc Calis
Intro to SEM Garson at NC State

Karl L. Wuensch
Dept. of Psychology, East Carolina University, Greenville, NC USA
November, 2009

SEM-Ski-Amos.doc
SEM With Amos: Ski Satisfaction

The data for this example comes from Tabachnick and Fidel (4
th
ed.). The
variance covariance matrix is in the file SkiSat-VarCov.txt, which you should download
from my StatData page. Note the data are different in the 5
th
edition of T&F.
Start by booting Amos Graphics. File, New to start a new diagram. Click File,
Data Files, File Name. Select SkiSat-VarCov.txt.

Click Open. Click View Data if you wish to peek at the data you have selected.

Click OK.

Click the Draw a Latent Variable icon once. Put the cursor
where you wish to draw the ellipse for the first latent variable and click
again. Click again once for each variable you wish to related to the
first latent variable.

Click the Draw a Latent Variable icon again and place the
ellipse for the second latent variable. Add two observed variables
associated with this latent variable. Use Move Object to relocate the
objects as desired. If an arrow will not locate as you wish it, delete it
(X icon) and then redraw it (arrow icon).

2

Click File, Save As and save your model before too
much time passes that way, if AMOS decides to nuke your
model then you can get it back from the saved amw file. I try to
remember to save my model frequently while I am working on
it.
Click the Draw Observed Variables icon and locate an
observed variable near the second (right) latent variable.
Place an arrow going from it to the second latent variable.
Click the List Variables in Data Set icon and then drag
each variable name to the appropriate rectangle. You will find
that the rectangles are not large enough to hold the variable
names. Use the Change the Shape of Objects tool to enlarge
the rectangles.

Right click the error circle that goes to numyrs. Select Object Properties. Enter
e1 as the variable name. In the same way name the other three error circles.
Use Object Properties to name the first latent variable (that on the left) LoveSki.
Click the Parameters tab and set the variance to 1. Name the second latent variable
SkiSat, but do not fix its variance. Draw an arrow from LoveSki to SkiSat.

3

Click the Add a Unique Variable to an Existing Variable icon and then click the
SkiSat ellipse. Move and resize the error circle and name it d2.

Compare your diagram with that in Tabachnick and Fidell. Notice that AMOS
has fixed the coefficient from LoveSki to numyears at 1. That is not necessary, as we
fixed the variance of LoveSki to 1. Right click that arrow and select Object Properties.
Delete the 1 under Regression Weight.

4
Click the Analysis Properties icon. Under the Output tab select the stats you
want, as indicated below.

To start the analysis, just click the Calculate Estimates icon.

Click Proceed with the analysis.

5
Click the View the output path diagram to see the path diagram with values of
the estimated parameters placed on the arrows. Notice that you can select
unstandardized or standardized estimates.
The standardized regression coefficients are printed beside each path. Beside
each dependent variable is printed the r
2
relating that variable to a latent variable(s).

Click the View Text icon to see much more extensive output from the analysis.
Under Notes for model: Result you see that the null that the model does fit the data
well is not rejected,
2
(4) = 8.814, p = .066.
Under Estimates you see both unstandardized and standardized regression
weights. Many of the elements of the output are hyperlinks. For example, if you click
on the .399 estimate for the standardized regression weight for SkiSat <--- senseek, you
get the message When senseek goes up by one standard deviation SkiSat goes up by
.399 standard deviation.
The p values in the regression weights table are for tests of the null that the
regression coefficient is zero. Those in the variances tables are for tests of the null that
the variance is zero.

6
In the squared multiple correlations table the .328 for SkiSat indicates that 32.8%
of the variance in that latent variable is explained by its predictors (LoveSki and
SenSeek).
Look at the standardized residual covariances table. The elements in this table
represent differences between the sample variance/covariance table and the estimated
population variance/covariance table. The residuals for two covariances are
distressingly large SenSeek-numyears and SenSeek-DaySki. We might want to
modify the model to reduce these residuals.
Total, direct, and indirect effects have the same meaning they had in path
analysis. For example, consider the standardized direct effect of LoveSki on FoodSat
it is zero, as there is no path connecting those two variables. The indirect standardized
effect of LoveSki on FoodSat is the product of the standardized path coefficients leading
from LoveSki to FoodSat that is, .411(.601) = .247. Of course, the total effects equal
the sum of the direct and indirect effects.
Under Modification Indices we see that the LM test indicates that allowing
LoveSki and SenSeek to covary would reduce the goodness of fit Chi-square by about
5.57. Since this involves only one parameter, this Chi-square could be evaluated on
one degree of freedom. It is significant. That is, adding this one parameter to the
model should significantly increase the fit of the model to the data.
Under Model Fit you see values of the many fit statistics. The Comparative Fit
Index (CFI) is supposed to be good with small samples, and we certainly have a small
sample here. Its value is .919. Values greater than .95 indicate good fit. The Root
Mean Square Error of Approximation (RMSEA) is .110. Values less than .06 indicate
good fit, and values greater than .10 indicate poor fit.

Modification of the Model
Our model does not fit the data very well.
Let us try adding the parameter recommended by the LM,
the path from SenSeek to LoveSki. Edit the diagram to look like
that below. Notice that LoveSki is now a latent dependent
variable. Also notice the following changes:
LoveSki no longer has its variance fixed to 1 AMOS
warned me not to constrain its variance to 1 if I wanted to
draw a path to it from SenSeek. Accordingly, I fixed the
regression coefficient from LoveSki to NumYrs at 1, giving
LoveSki the same variance as NumYrs. I had noticed earlier that
LoveSki and NumYrs were very well correlated.
I added a disturbance for LoveSki, as it is now a latent dependent variable.

After making the indicated changes in the model, click the Calculate Estimates
icon and then view the output path diagram with standardized estimates.

7
LoveSki
numyrs
e1
1
dayski
e2
SkiSat
snowsat
e3
1
foodsat
e4
1
senseek
d2
1
1
d1
1
1
1

8

Click the View Text icon and look at the results. The goodness of fit Chi-square
is now only 2.053 on 3 degrees of freedom. Previously it was 8.814 on 4 degrees of
freedom. The change of fit Chi-square is 8.814 2.053 = 6.761 on (4 3) = 1 degrees
of freedom. Adding the path from SenSeek to LoveSki significantly increased the fit of
the model with the data.
Notice that the standardized residual matrix no longer has any very large
elements. Among the fit indices, the CFI has increased from .919 to 1.000 and the
RMSEA has decreased from .110 to 0.000, both indicating better fit.

An Introduction to Structural Equation Modeling (SEM)
SEM with SAS Proc Calis

Karl L. Wuensch
Dept. of Psychology, East Carolina University, Greenville, NC USA
October, 2008

Tender-Heartedness and Attitude About Animal Research

First Things First
Go here and download to your computer (not open in your browser) the Power
Point slide for creating the diagram.
Download the data using the SEM Tenderheartedness link on my SPSS Data
Page. The data are in the form of a variance/covariance matrix. If you are going
to use SAS Calis instead of Amos, download ARC-VarCov.dat from my StatData
Page.

Description of the Data
Four of the items on Forsyths Ethics Position Questionnaire seem to measure
tender-heartedness. The items are:
Risks to another should never be tolerated, irrespective of how small the risks
might be.
It is never necessary to sacrifice the welfare of others.
If an action could harm an innocent other, then it should not be done.
The existence of potential harm to others is always wrong, irrespective of the
These items were included in the idealism subscale employed in the research
reported in:
Wuensch, K. L., & Poteat, G. M. (1998). Evaluating the morality of animal
research: Effects of ethical ideology, gender, and purpose. Journal of Social
Behavior and Personality, 13, 139-150.
The subjects were also asked to answer a question about how justified a
particular case of animal research was (higher scores = more justified) and asked
whether or not that research should be allowed to continue (0 = no, 1 = yes).

Your Assignment
Your assignment is to use the data gathered in this research to test the SEM
model diagrammed below. It includes two latent variables: Tender Heartedness (high
scores = high tender heartedness) and Animal Research (high scores = favor animal
research). Obtain standardized estimates, squared multiple correlations, and estimates
of variance.
In a Word document, enter your answers to each of the following questions.
Immediately following each answer, paste in the relevant part of the text output from
Amos or SAS.
1. Is the Chi-square test (of the null that the fit is as good as that of the saturated
model -- perfect) significant?
2. Do any of the regression weights fall short of statistical significance?
3. Which three paths have the highest absolute standardized regression weights?
4. Interpret the association between Tender Heartedness and Animal Research and
specify the proportion of the variance in Animal Research that is explained by
Tender Heartedness?
5. Which of the indicators for Tender Heartedness has the highest reliability?
6. What is the relationship between the estimated reliabilities and the standardized
weights of the paths leading to the indicators from the latent variables?
7. Does the GFI indicate that the model fits the data well?
8. Does the RMSEA indicate that the model fits the data well?
9. You know that the Chi-square test will be significant even when the fit is good if
you have a sufficiently large sample. How large would the sample need be here
for the Chi-square to be significant at the .05 level?
Rather than using the diagram created by Amos, I would like you to use a
diagram I drew in PowerPoint. Open it in PowerPoint Click on each vari (in a text
box) and replace vari with the estimated variance of the error or disturbance. Click on
each .rr and enter the estimated reliability of the indicator variable. Click on each sw
and enter the estimated standardized weight. After you have finished entering the
parameter estimates and reliabilities, File, Save as, and select a graphics format (png,
jpg, or gif). Then open your Word doc and Insert, Picture the diagram into the Word
document below your answers to the eight questions posed above.
Print the Word document and bring it to me in class on Wednesday the 18
th
of
November, 2009.

Karl L. Wuensch, November, 2009
CFA-AMOS.doc
Confirmatory Factor Analysis With AMOS

Please read pages 732 through 749 in
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5
th
ed.).
Boston: Allyn & Bacon. ISBN-10: 0205459382. ISBN-13: 9780205459384
(Students should already have this text from the prerequisite PSYC 7431
course).

The data for this lesson are available at T&Fs data site and also from my SPSS
data page, file CFA-Wisc.sav. Download the file and bring it into SPSS and pass
it to AMOS. Alternatively, you can just boot AMOS Graphic, click Select data
files, and then select CFA-Wisc.sav. Minor culling has already taken place, as
described in the textbook.

2
Draw two latent variables, one with six indicators and one with five
indicators, like this:

3
Click List variables in data set and then drag the names of the measured
variables into the rectangles, as shown below.

4
Using Object Properties, name the errors and factors, fix both factors to variance
1, remove the fixed path coefficients, and draw a two-headed arrow like this:

5
Click Analysis properties and select the desired output:

Click Calculate estimates. Click View the output path diagram and View
Text.
Look at the diagram with standardized estimates. Note that the solution is the
same as that shown in T & F Figure 14.10. The error variances for the measured
variables, shown on the left in Figure 14.10, are simply 1 minus the value of R
2
shown
in the Amos diagram. For example, for Info, 1 - .58 = .42.
Look at the text output. The null hypothesis of good fit is rejected, but this may
be simply from having too much power. The fit indices are OK. GFI (.931) exceeds .9,
CFI (.941) does not quite reach the .95 standard, and RMSEA (.06) is between good
(.05) and adequate (.08).
The Standardized Residual Covariances are large for Comp-Pictcomp and Digit-
Coding. The Modification Indices for Covariances suggests linkage between e2 (error
in comp) and Performance IQ perhaps we need a path from Performance IQ to Comp.
The Modification Indices for Regression Weights suggests linkage between Comp and
(Object and Pictcomp), both of which are connected with Performance IQ. Again, this
6
suggests a path from Performance IQ to Comp. Let us add that path and see what
happens.
Diagram for Model 1.

7
Diagram for Model 2.

8
Look at the text output for the second model. The model fit Chi-square has
dropped from 70.236 to 60.296, a drop of 9.94, which, on one df, is significant. Adding
that path from Performance IQ to Comp has significantly improved the fit of the model.
GFI has increased from .931 to .94, CFI from .941 to .960, and RMSEA has dropped
from .06 to .05.
Notice that the path from Performance IQ to Coding is not statistically significant.
Perhaps we should just drop that variable. Drop it and see what happens.
With Coding out of the model, the goodness of fit Chi-square is no longer
significant,
2
(33) = 45.018, p = .079. GFI has increased from .94 to .952, CFI from
.960 to .974, and RMSEA has dropped from .05 to .046.

Links
CFA Using AMOS Indiana Univ.
CFA Using SAS Calis

Karl L. Wuensch
Dept. of Psychology, East Carolina University, Greenville, NC 27858 USA
November, 2008
Drawing an SEM Diagram in Power Point

Open Power Point.
Select blank content layout.

Display the Drawing toolbar

Click on the rectangle icon for a measured variable or the ellipse icon for a
latent variable. Then move the cursor to where you want to draw the shape. Hold down
the left mouse button and move the mouse the modify the shape of the object. Release
the mouse button when you are satisfied.

Dont like the color inside the object? Right click on the shape, select Format
Auto Shape, and select a different color or no fill. In the Format Auto Shape window
you can check Default for new objects.
You can resize the object by putting the cursor on one of the little circles on its
border the cursor will change to an double-headed arrow. Hold down the left mouse
button and drag the border in the desired direction.
You can move the object by putting the cursor on a border of the shape not on
one of those little circles the cursor will change to a four-headed arrow. Hold down
the left mouse button and drag the object to the desired location.
You can rotate the object by putting the cursor on the little green circle the
cursor will change to a circular arrow. Hold down the left mouse button and move the
mouse left or right.

You want to add some text inside the object. Click on the Text Box icon, move
the cursor to the interior of the shape and draw the text box. You can now type in the
text box and format the text as you wish.

To draw an arrow from one object to another, click on the arrow icon,
place the cursor on the border of the one object, hold down the left mouse
button, and pull the arrowhead to the border of the other object.
You can draw curved lines as well, but I find it challenging. Click
AutoShapes, Lines, Curved. Move the cursor to one desired location, hold
down the left mouse button, and drag the line to the other desired location.
Double click to stop drawing. Now click Draw and select Edit Points. Grab
the straight line in its middle and pull it to curve it. Want to put an arrow head
or heads on it. Select the line and then click the Arrow Style icon and select
the type of arrow you want.
To enter the values of estimated parameters, just put a text box
where you want it and enter the value inside the box. Text boxes can be
rotated too, so when entering a coefficient on a diagonal path you can rotate
the text box to match the slope of the path.

If you want to duplicate an object, put the cursor on its border (see four-headed
error), right click, and select copy. Then right click and select paste. Then move the
copy to the new location. This is very handy if, for example, you want several objects all
of the same size and shape.
You may also group objects together so that you can manipulate them
collectively as if they were a single object. First, hold down the left mouse button at one
edge of the area to be grouped and move the mouse to select all of the objects to be
grouped. When you are satisfied that you have correctly selected the objects, release
the mouse button and click on Draw, Group. Suppose you want to make a copy of the
grouped objects. Put the cursor on the border of one of the grouped objects, right click,
select copy. Right click, select paste. Grab the copied group of objects and move it to
the desired location.
When you are satisfied that your diagram is as good as it is going to get, select
File, Save As. Save it in a graphic format (png, jpg, gif). You can insert the saved
picture into a Word document.

Here is an example of a diagram created by my colleague John Finch. I need to
ask him how he drew those curved arrows. He may know a better way to do that than I
have described here, or he may just have much better drawing skills than do I.

Back to SEM Lessons
HLM-Intro.doc
An Introduction to Hierarchical Linear Modeling

The data are those described in the following article:
Singer, J. D. (1998). Using SAS proc mixed to fit multilevel models, hierarchical
models, and individual growth models. Journal of Educational and Behavioral
Statistics, 24, 323-355.
There are data for 7,185 students (Level 1) in 160 schools (Level 2). I shall use
MathAch as the Level 1 outcome variable.

Download the data using a link at my page at
http://core.ecu.edu/psyc/wuenschk/MV/HLM/HLM.htm . I shall use csv file myself. I did
convert it to .xls before bringing it into SAS.

Model 1: Unconditional Means, Intercepts Only
After SAS has imported the data, submit this program which will estimate
parameters for a model that includes only the outcome variable and intercepts.
title 'Model 1: Unconditional Means Model, Intercepts Only';
proc mixed data = hsb12 covtest noclprint;
class School;
model MathAch = / solution;
random intercept / subject = School;
run;

Level 1 Equation.
ij j ij
e Y + =
0
. That is, the score of the i
th
case in the j
th

school is due to the intercept for the j
th
school and error for the the i
th
case in the j
th

school. Although I usually use a for the intercept, here I use
0
for the intercept.
00

will be an estimate of the variance in the school intercepts the more the schools differ
in mean MathAch, the greater this variance should be.
Level 2 Equation.
j j 0 00 0
+ = . That is, the school intercepts are due to the
average intercept across schools plus the effect (on the intercept) of being in school j.
Combined Equation. Substitute
j 0 00
+ (from the Level 2 equation) for
j 0
in
the Level 1 equation and you get
ij j ij
e Y + + =
0 00
.
Fixed Effects. These are effects that are constant across schools. They are
specified in the model statement (see the SAS code above). Since no variable follows
the = sign in model MathAch =/, the only fixed parameter is the average intercept
across schools, which SAS automatically includes in the model. This effect is
symbolized with
00
in the boxed equations. Remember that the outcome variable is
MathAch.
Random Effects. These are effects that vary across schools,
0j
and e
ij
. I shall
estimate their variance.
2
Look at the Output. Under Solution for Fixed Effects we find an estimate of
the average intercept across schools, 12.637. That it differs significantly from zero is of
no interest (unless zero is a meaningful point on the scale of the outcome variable).
Under Covariance Parameter Estimates, the Random Effects, you see that
the variance in intercepts across schools is estimated to be 8.6097 and it is significantly
greater than zero (this is a one-tailed test, since a variance cannot be less than 0 unless
you have been drinking too much). This tells us that the schools differ significantly in
intercepts (means). The error variance (differences among students within schools) is
estimated to be 39.1487, also significantly greater than zero.
Intraclass Correlation. We can use this coefficient to estimate the proportion of
the variance in MathAch that is due to differences among schools. To get this
coefficient we simply take the estimated variance due to schools and divide by the sum
of that same variance plus the error variance, that is, 8.6097 / (8.6097 + 39.1487) =
18%.

Model 2: Including a Level 2 Predictor in the Model
I shall use MeanSES (the mean SES by school) as a predictor of MathAch.
MeanSES has been centered/transformed to have a mean of zero (by subtracting the
grand mean from each score).
Level 1 Equation. Same as before.
Level 2 Equation.
j j j 0 01 00 0
MeanSES + + = . That is, the school
intercepts/means are due to the average intercept across schools, the effect of being in
a school with the MeanSES of school j, and the effect of everything else (error,
extraneous variables) on which school j differs from the other schools.
Combined Equation. Substituting the right hand side of the Level 2 equation
into the Level 1 equation, we get ] [ ] MeanSES [
0 01 00 ij j j ij
e Y + + + = . The
parameters within the brackets on the left are fixed, those on the right are random.
SAS Code. Add this code to your program and submit it (highlight just this code
before you click the running person).
title 'Model 2: Including Effects of School (Level 2) Predictors';
title2 '-- predicting MathAch from MeanSES'; run;
proc mixed covtest noclprint;
class school;
model MathAch = MeanSES / solution ddfm = bw;
random intercept / subject = school;
run;

Notice the addition of ddfm = bw;. This results in SAS using the
between/within method of computing denominator df for tests of fixed effects. Why do
this because Singer says so.
Look at the Output, Fixed Effects. Under Solution for Fixed Effects, we see
that the equation for predicting MathAch is 12.6495 + 5.8635*(School MeanSES
Grand MeanSES) remember that MeanSES is centered about zero. That is, for each
3
one point increase in a schools MeanSES, MathAch rises by 5.9 points. For a school
with average MeanSES, the predicted MathACH would be the intercept, 12.6.
Grab your calculator and divide the estimated slope for MeanSES by its standard
error, retaining all decimal points. Square the resulting value of t. You will get the value
of F reported under Type 3 Tests of Fixed Effects.
Notice that the df for the fixed effect of MeanSES = 158 (number of schools
minus 2). Without the ddfm = bw the df would have been 7025. The t distribution is
not much different with 7025 df than with 158 df, so this really would not have much
mattered.
Look at the Output, Random Effects. The value of the covariance parameter
estimate for the (error) variance within schools has changed little, but that for the
difference in intercepts/means across schools has decreased dramatically, from 8.6097
to 2.6357, a reduction of 5.974. That is, MeanSES explains 5.974/8.6097 = 69% of the
differences among schools in MathAch.
Even after accounting for variance explained by MeanSES, the MathAch scores
differ significantly across schools (z = 6.53). Our estimate of this residual variance is
2.6357. Add to that our estimate of (error) variance among students within schools
(39.1578) and we have 41.7935 units of variance not yet explained. Of that not yet
explained variance, 2.6357/41.7935 = 6.3% remains available to be explained by some
other (not yet introduced into the model) Level 2 predictor. Clearly most of the variance
not yet explained is within schools, that is, at Level 1 so lets introduce a Level 1
predictor in our next model.

Model 3: Including a Level 1 Predictor in the Model
Suppose that instead of entering MeanSES into the model I entered SES, the
socio-economic-status of individual students.
Level 1 Equation.
ij ij j j ij
e Y + + = SES
1 0
. That is, a students score is due to
the intercept/mean for his/her school, the within-school effect of SES (these
slopes may differ across schools), and error. To facilitate interpretation, I shall
subtract from each students SES score the mean SES score for the school in
which that student is enrolled. Thus,
ij ij j j ij
e M Y
j
+ + = ) SES (
SES 1 0
. In the
SAS code this centered SES is represented by the variable cSES.
Level 2 Equations. Each random effect (excepting error within schools) will
require a separate Level 2 equation. Here I need one for the random intercept and one
for the random slope.
For the random intercept,
j j 0 00 0
+ = . That is, the intercept for school j is the
sum of the grand intercept across schools and the effect (on intercept) of being in
school j.
For the random slope,
j j 1 10 1
+ = . That is, the slope for predicting MathAch
from SES is, in school j, the grand slope (across all groups) and the effect (on slope) of
being in school j.
4
Combined Equation. Substituting the right hand expressions in the Level 2
equations for the corresponding elements in the Level 1 equation yields
] ) SES ( [ ] ) SES ( [
SES 1 0 SES 10 00 ij ij j j ij c ij ij
e M e M Y
j j
+ + + + + = . The fixed effects are
within the brackets on the left, the random effects to the right.
SAS Code. Add this code to your program and submit it.
title 'Model 3: Including Effects of Student-Level Predictors';
title2 '--predicting MathAch from cSES';

data hsbc; set hsb12; cSES = SES - MeanSES;
run;

proc mixed data = hsbc noclprint covtest noitprint;
class School;
model MathAch = cSES / solution ddfm = bw notest;
random intercept cSES / subject = School type = un;
run;

Note the computation of cSES, student SES centered about the mean SES for
the students school. Just as noclprint suppresses the printing of class information,
noitprint suppresses printing of information about the iterations. Type = un indicates
you are imposing no structure, allowing all parameters to be determined by the data.
Look at the Output, Fixed Effects. Under Solution for Fixed Effects, see that
the estimated MathAch for a student whose SES is average for his or her school is
12.6493. The average slope, across schools, for predicting MathAch from SES is
2.1932, which is significantly different from zero.
Look at the Output, Random Effects. Under Covariance Parameter
Estimates we see that the UN(1,1) estimate is 8.6769 and is significantly greater than
zero. This is an estimate of the variance (across schools) for the first parameter, the
intercept. That it is significantly greater than zero tells us that there remains variance,
across schools, in MathAch, even after controlling for cSES.
The UN(2,1) estimates the covariance between the first parameter and the
second, that is, between the school intercepts and school slopes. This (with a two-
tailed test) falls well short of significance. There is no evidence that the slope for
predicting MathAch from cSES depends on a schools average value of MathAch.
The UN(2,2) estimates the variance in the second parameter, cSES. The
estimated variance, .694, is significantly greater than zero. In other words, the slope for
predicting MathAch from cSES differs across schools.
The unconditional means model (the first model) estimated the within-schools
variance in MathAch to be 39.1487. Our most recent model shows that within-schools
variance is 36.7006 after taking out the effect of cSES. Accordingly, cSES accounted
for 39.1487-36.7006 2.4481 units of variance, or 2.4881/39.1487 = 6.25% of the
within-school variance.
5
Model 4: A Model with Predictors at Both Levels and All Interactions
Here I add to the model the variable sector, where 0 = public school and 1 =
Catholic school. Notice in the SAS code that the model also includes interactions
among predictors. More on this later.
SAS Code.
title 'Model 4: Model with Predictors From Both Levels and Interactions';
class School;
model mathach = MeanSES sector cSES MeanSES*Sector
MeanSES*cSES Sector*cSES MeanSES*Sector*cSES
/ solution ddfm = bw notest;

Look at the Output, Fixed Effects. MeanSES x Sector and MeanSES x Sector
x cSES are not significant. Without further comment I shall drop these from the model.

Model 5: A Model with Predictors at Both Levels and Selected Interactions
I provide more comment on this model.
Level 1 Equation.
ij j j ij
e cSES Y + + =
1 0
.
Level 2 Equations.
For the random intercept,
j j j j
Sector MeanSES
0 02 01 00 0
+ + + = . That is, the
intercept/mean for a schools MathAch is due to the grand mean, the linear effect of
MeanSES, the effect of Sector, and the effect of being in school j.
For the random slope,
j j j j
Sector MeanSES
1 12 11 10 1
+ + + = . That is, the
slope for predicting MathAch from SES, in school j, is affected by the grand slope
(across all schools), the effect of being in a school with the MeanSES that school j has,
the effect of being in Catholic school, and the effect of everything else on which the
schools differ.
Combined Equation.
+ + + + + + = ] [
12 11 10 02 01 00 j j j j j j ij
cSES Sector cSES MeanSES cSES Sector MeanSES Y
] [
1 0 ij j j j
e cSES + + . Arent you glad you remember that algebra you learned in
ninth grade?
SAS Code. Add this code to your program and submit it
title 'Model 5: Model with Two Interactions Deleted';
title2 '--predicting mathach from meanses, sector, cses and ';
title3 'cross level interaction of meanses and sector with cses'; run;

class School;
model MathAch = MeanSES Sector cSES MeanSES*cSES Sector*cSES
/ solution ddfm = bw notest;
proc means mean q1 q3 min max skewness kurtosis; var MeanSES Sector cSES;
run;
6

Look at the Output, Fixed Effects. All of the fixed effects are significant. Sector
is new to this model. The main effect of sector tells us that a one point increase in
sector is associated with a 1.2 point increase in MathAch. Since public schools were
coded with 0 and Catholic schools with 1, this means that higher MatchAch is
associated with the schools being Catholic. Keep in mind that this is above and
beyond other effects in the model. Also new to this model are the interactions with
cSES. The MeanSES x cSES interaction indicates that the slopes for predicting
MathAch from cSES differ across levels of MeanSES. The Sector x cSES interaction
indicates that the slopes for predicting MathAch from cSES differ between public and
Catholic schools. Note that Singer reported that she tested for a MeanSES x Sector
interaction and a MeanSES x cSES x Sector interaction but found them not to be
significant.
I created separate regressions equations for the public and the Catholic schools
by substituting 0 and 1 for the values of sector. For the public schools, that yields
MathAch = 12.11 + 5.34(MeanSES) + 1.22(0) + 2.94(cSES) + 1.04(MeanSES)(cSES) -
1.64(cSES)(0). For the Catholic schools that yields MathAch = 12.11 + 5.34(MeanSES)
+ 1.22(1) + 2.94(cSES) + 1.04(MeanSES*cSES) - 1.64(cSES)(0). These simplify to:
Public: 12.11 + 5.34(MeanSES) + 2.94(cSES) + 1.04(MeanSES)(cSES)
Catholic: 13.33 + 5.34(MeanSES) + 1.30(cSES) + 1.04(MeanSES)(cSES)
As you can see, MathAch is significantly higher in the Catholic schools and the effect
of cSES on MathAch is significantly greater in the public schools.
The MeanSES x cSES interaction can be illustrated by preparing a plot of the
relationship between MathAch and cSES at each of two or three levels of MeanSES
for example, when MeanSES is its first quartile, its second quartile, and its third quartile.
Italassi could be used to illustrate this interaction interactively, but it is hard to move that
slider in the published article.
At the mean for sector (.493), MathAch = 12.11 + 5.34(MeanSES) + 1.22(.493) +
2.94(cSES) + 1.04(MeanSES)(cSES) - 1.64(cSES)(.493) =
12.71 + 5.34(MeanSES) + 2.13(cSES) + 1.04(MeanSES)(cSES).
At Q
1
for MeanSES (-.32), MathAch = 12.71 -1.71 + 2.13(cSES) - 0.33(cSES) =
11.00 + 1.80(cSES).
At Q
2
for MeanSES (.006), MathAch = 12.71 + .03 + 2.13(cSES) + .006(cSES) =
12.74 + 2.14(cSES).
At Q
3
for MeanSES (.33), MathAch = 12.71 +1.76 + 2.13(cSES) + 0.34(cSES) =
14.47 + 2.47(cSES).
For each of these conditional regressions I shall predict MatchAch at two values
of cSES (-3 and +3) and then produce an overlay plot with the three lines.
7
Here is the table of predicted values:
MeanSES
cSES
Difference
-3 +3
Q
1
5.60 16.4 10.80
Q
2
6.32 19.16 12.84
Q
3
7.06 21.88 14.82

Here is a plot of the relationship between cSES and MathAch at each of three
levels of MeanSES. Notice that the slope increases as MeanSES increases.
0
5
10
15
20
25
-3 3
cSES
M
a
t
h
A
c
h
MeanSES=Q1
MeanSES=Q2
MeanSES=Q3

Look at the Output, Random Effects. The estimate for the difference in
intercepts/slopes across schools, UN (1,1) remains significant, but now the estimate for
the differences across schools in slope (for predicting MathAch from cSES), UN (2,2), is
small and not significant, as is the estimate for the covariance between intercepts and
slopes, UN (2,1). Perhaps I should trim the model, removing those components.

Model 6: Trimmed
In this model I remove the random effect of cSES slopes (and thus also the
interaction between those slopes and the intercepts). Because there is only one
random effect, I no longer need to use type = un.
SAS Code.
title 'Model 6: Simpler Model Without cSES Slopes';
class School;
model MathAch = MeanSES Sector cSES MeanSES*cSES Sector*cSES / solution
ddfm = bw notest;
random intercept / subject = School;
run;
data p;
df = 2; p = 1-probchi(1.1,2);
proc print data = pvalue noobs; run;
8

Look at the Output. The model for the fixed effects is the same in this model as it was
in the previous model. Accordingly, we can compare the two models fit with the data by
comparing their fit statistics.

Fit Statistics

-2 Res Log Likelihood 46503.7 -2 Res Log Likelihood 46504.8
AIC (smaller is better) 46511.7 AIC (smaller is better) 46508.8
AICC (smaller is better) 46511.7 AICC (smaller is better) 46508.8
BIC (smaller is better) 46524.0 BIC (smaller is better) 46514.9

The fit is (slightly) better in the trimmed model. Trimming the model has reduced
the log likelihood statistic by 1.1. We can evaluate the significance of this change with a
Chi-square on 2 df (one df for each parameter trimmed, the slopes and the Slope x
Intercept interaction). As you can see, deleting those two parameters has not
significantly affected the fit of the model to the data.


Karl L. Wuensch, East Carolina Univ., Dept. of Psychology, Greenville, NC 27858, USA

December, 2008

Many thanks to Dr. Cecelia Valrie, who introduced me to this topic. That said,
any mistakes in this document are mine, not hers.

Wuensch Stats

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wuensch Stats

Uploaded by

Copyright:

Available Formats

ANOVA1.

Copyright 2012, Karl L. Wuensch - All rights reserved.

= when sample size is constant across

Copyright 2010, Karl L. Wuensch - All rights reserved.

. When the sample sizes are equal, this simplifies to

MS , and F(1, 16) = 125/.5 = 250, p << .01.

= q , critical value .01 = 5.19, p < .01

= . In other words, you compute a t or an F for each

Copyright 2006, Karl L. Wuensch - All rights reserved.

Copyright 2005, Karl L. Wuensch - All rights reserved.

Copyright 2012, Karl L. Wuensch - All rights reserved.

Copyright 2001, Karl L. Wuensch - All rights reserved.

Copyright 2011, Karl L. Wuensch - All rights reserved.

Copyright 2000, 2008, Karl L. Wuensch - All rights reserved.

Copyright 2010, Karl L. Wuensch - All rights reserved. ANOVA-Wtd-UnWtd.doc

Copyright 2011, Karl L. Wuensch - All rights reserved.

, is then used as the denominator for testing both

Copyright 2010, Karl L. Wuensch - All rights reserved.

Copyright 2010 Karl L. Wuensch - All rights reserved.

to estimate the magnitude of the difference between groups.

= = d . If we were to ignore qualifications (by using the

Copyright 2008, Karl L. Wuensch - All rights reserved.

Copyright 2008, Karl L. Wuensch - All rights reserved.

Copyright 2005, Karl L. Wuensch - All rights reserved.

Copyright 2008, Karl L. Wuensch - All rights reserved.

Copyright 2008, Karl L. Wuensch - All rights reserved.

Copyright 2003, Karl L. Wuensch - All rights reserved.

Copyright 2008 Karl L. Wuensch - All rights reserved.

Copyright 2009, Karl L. Wuensch - All rights reserved.

Copyright 2001, Karl L. Wuensch - All rights reserved.

Copyright 2011, Karl L. Wuensch - All rights reserved

Copyright 2010, Karl L. Wuensch - All rights reserved.

Copyright 2010, Karl L. Wuensch - All rights reserved

Y . The value 0 represents the absence of skewness.

Copyright 2007, Karl L. Wuensch - All rights reserved.

Copyright 2011, Karl L. Wuensch - All rights reserved.

Copyright 2011, Karl L. Wuensch - All rights reserved.

Copyright 2005, Karl L. Wuensch - All rights reserved.

Copyright 2010, Karl L. Wuensch - All rights reserved.

= . Since sample size, n, is in the denominator,

and assert the H

is not reasonable, thus asserting the H

. For example, I may have a

, then I have shown my

is tested by gathering data that are relevant to the hypothesis and

. If the fit is poor, we reject the H

with an exact significance level, p,

. If p is very low, we reject the H

. How low is very low? Very low is usually .05

is p .05 for behavioral scientists, by convention,

(an error which will be more

? How often would one randomly sample

is true. In other words, either we just happened to select an

is not really true.

false or was our sample just unusual? Here

and conclude that is not equal to 145. What are the

was true and we just happened to get an

given that the H

is true. By setting the a priori criterion

). We used Z = (X - ) / , which is appropriate if N = 1

. For our example, Z = -2.33 -1.96, so we

and report p .05.

not rejected. Many readers might