Professional Documents
Culture Documents
The analysis of variance technique is probably the most widely used of all statistical
techniques, and one which varies in complexity from a simple one-way classification of data to
compare the means in two groups (equivalent to the t-test on unpaired observations) to
complicated multi-way classifications involving sub-groups of unequal sizes. It has been defined
by R.A.Fisher as the separation of the variance ascribable to one group of causes from the
variance ascribable to the other groups. More simply, it can be described as the partitioning of
the total variance between observations to several components, each of which can be ascribed to
a cause. A few examples will clarify the concepts.
Suppose we have measurements of weight (X) in three groups of subjects numbering 10,
15 and 20, respectively. The data may be set out in the following form:
In the above equation, the first expression in flower bracket denotes the variability of the
observations in each group from the corresponding group mean (within groups) and the
second expression in flower bracket denotes the variation between the three group means
(Between groups:). From first principles (see notes on standard deviation), the former has
(10-1) + (15-1) + (20-1) = 42 degrees of freedom, and the latter has (3-1) = 2 degrees of freedom.
The ratio of the latter variance (between groups) to the former (within groups) follows a
F-distribution (under the Null hypothesis that the three groups are samples from the same
population), and its significance can therefore be assessed by reference to the variance ratio
tables (see appendices to notes on t-test).
Numerical example
A B C
4 0 1
5 2 0
3 1 0
3 2 1
4 2 2
2 2 2
------ ------ -------
Mean 3.5 1.5 1.0
------ ------ -------
Grand Mean 2.0
Total variation amongst 18 observations = (X-2)2 = 34 .....................(1)
Within 13 15 0.867
treatment
(Residual)
-----------------------------------------------------------------------------------------------------------------
Since P < 0.05 for Between Treatments, we conclude that the three treatments A, B and C
yield significantly different values, by way of response. Suppose we had specific interest in
comparing A with B, A with C and B with C, we take the square root of the Residual Mean
Square of 0.867, which is 0.93, as the estimate of standard deviation and employ a t-test.
(X - X ) = ( X P - X ) + ( X M - X ) + (X - X P - X M + X
) ........................(1)
From this it follows that the total variation can be split up into three components:
i) Variation due to consistent differences between patients in their overall SGOT
activity (first term on right hand side of equation above; this has 49 degrees of
freedom)
ii) Variation due to duration of treatment with pyrazinamide (second term above; this
has 12 degrees of freedom)
iii) The influence of duration of treatment on SGOT activity may not be the same on
all the patients; it could vary from patient to patient (Interaction), and this,
together with errors of the test, is another component of the total variation (third
term above; this has 49 x 12 = 588 degrees of freedom).
A comparison of (i) with (iii) provides a test of the hypothesis that patients are homogenous in
their SGOT activity, and a comparison of (ii) with (iii) provides a test of the hypothesis that the
time points with pyrazinamide has no effect on SGOT activity.
Numerical example
Clotting time (minutes) of plasma from eight subjects, treated by four methods
--------------------------------------------------------------------------------------------------------------------
Treatment
Subject ----------------------------------------------------------------------------------------------------
1 2 3 4 Total Mean ( X s)
-------------------------------------------------------------------------------------------------------------------
1 8.4 9.4 9.8 12.2 39.8 9.95
2 12.8 15.2 12.9 14.4 55.3 13.82
3 9.6 9.1 11.2 9.8 39.7 9.92
4 9.8 8.8 9.9 12.0 40.5 10.12
5 8.4 8.2 8.5 8.5 33.6 8.40
6 8.6 9.9 9.8 10.9 39.2 9.80
7 8.9 9.0 9.2 10.4 37.5 9.38
8 7.9 8.1 8.2 10.0 34.2 8.55
----------------------------------------------------------------------------------------------------------
Total 74.4 77.7 79.5 88.2 319.8
-------------------------------------------------------------------------------------------------------------
Mean ( X T) 9.30 9.71 9.94 11.02
--------------------------------------------------------------------------------------------------------------
Total sum of squares = (X - 9.99)2 = (8.4 - 9.99)2 + (12.8 - 9.99)2 + .........+ (10.0 - 9.99)2 = 106
Residual = 106 79 13 = 14
Residual 14 21 0.67
--------------------------------------------------------------------------------------------------------
Since P < 0.01 for both Subjects and Treatments, we conclude that there are consistent
differences between subjects in their average clotting time, and that the four treatments studied
have significantly different effects. If we now wish to compare any two treatments (e.g. 1 Vs 4),
we take the square root of the residual mean square of 0.67, which is 0.82 as the estimate of
standard deviation and undertake a t-test.
The splitting up of the total variation as due to various components is purely a numerical
exercise and can be done under all circumstances. However, for the significance tests to be valid
and permit of proper interpretation, certain conditions must be satisfied. To appreciate what
these are, it is necessary to understand the underlying mathematical model. For the completely
randomized design (CRD), it is
Yij is the outcome variable in the j-th patient on the i-th treatment.
eij is a random element of error in the j-th patient on the i-th treatment.
Yij is the outcome variable in the j-th patient on the i-th treatment
eij is a random element of error in the j-th patient on the i-th treatment.
1) the variance of eij must be homogenous, i.e. it must be random variation only,
unaffected by Treatment (or patient),
2) the additive model above must be valid for instance, in the case of the RBD, the
addition of the appropriate treatment effect and the appropriate patient effect to
the general effect and a random component must yield the outcome variable for
that patient,
3) the random component eij must follow a Normal distribution with an expectation
of zero (and a constant variance see (1) above).
Minor departures from these assumption do not have serious consequences on the
interpretation of the Analysis of Variance results. If there are gross violations,
a transformation of the data (e.g. taking logarithms of drug concentrations or
square roots of counts) can sometimes overcome the difficulty. A professional
statistician or any standard statistics text book may be consulted for further
details.
*************