You are on page 1of 11

Analysis of variance (ANOVA)

• The difference between two sample means can be studied through the
standard error of the difference of the means of the two samples or through
student’s t-test but the difficulty arises when we happen to examine the
significance of the difference between more than two sample means at
once. Analysis of variance help us to test whether more than two
population means can be considered to be equal.
• Analysis of variance will enable us to test for the significance of the
differences between more than two sample means.
• Using analysis of variance, we will able to make inference about
whether our samples are drawn from population having the same mean.
• Sir R. A. Fisher originated the technique of analysis of variance.
• The analysis of variance is essentially a technique for testing the
difference between groups of data for homogeneity. It is a method of
analyzing the variance to which a response is subject into its various
components corresponding to the various sources of variation. There may
be variation between the samples or there may be variation within the
sample items. Thus, the technique of analysis of variance consists in
splitting the variance for analytical purposes into its various components.
Normally the variance (or what can be called as the Total variance) is
divided into two parts:
1. Variance between samples,
2. Variance within a samples; such that
Variance = Variance between samples + Variance within samples
Three steps in analysis of variance
Analysis of variance consists of three different steps.
1. Determine first estimate of the population variance from the variance
among (between) the sample means.
2. Determine second estimate of the population variance from the
variance within the sample.
3. Compare these two estimates. If they approximately equal in value,
accept null hypothesis
Assumption

1
In order to use analysis of variance, we must assume that the each of the
samples is drawn from a normal population and that each of these
populations has the same variance
Suppose that the experimenter has available the result of k independent
random sample, each of size n, from k different population and wish to test the
hypothesis that the means of these k population are equal.

If we denote the jth observation in the ith sample by y ij , the general scheme
for a one-way classification is as follows:

Sample
Mean
Sample 1 y11 y12 … y1 j … y1n y1
Sample 2 y 21 y 22 … y2 j … y 2n y2
:
Sample i y i1 y i2 … y ij … y in yi
:
Sample k y k1 yk 2 … y kj … y kn yk
Grand
Mean y.
Where y.is the overall mean (grand mean) of all observations.
To test the hypothesis that the samples were obtained from k populations with
equal means, we make an assumption that we are dealing with normal
population having equal variances.

If µi denotes the mean of the ith population and σ2 denotes the common
variance of the k populations, we can express each observation Y ij as µi plus
the value of a random component. Thus, we could say that the model for the
observation is given by:

2
Y ij = µ i + ε ij fo ri = 1,2,..,k ; j = 1,2,..,n
Where ε ij are independent, normally distributed random variables with zero
means and the common variance σ2 .

To permit the generalization of this model to more complicated kind of


situations, it is customary to replace µi by µ + α i where µ is the mean of
µi ; that is grand mean and the α i is the effect of ith treatment such that
k
∑ αi =0 . Note that we have written the mean of the ith population as
i =1
k
µi = µ +αi and imposed the condition ∑ αi =0 , so that the mean of the µi
i =1
equals the grand mean µ .

The null hypothesis we shall want to test that the population means are all
equal; that is µ 1 = µ 2 = .....= µ k or equivalently

H 0: α 1 = α 2 = .....= α k = 0
Correspondingly, the alternative hypothesis that
the population means are not all equal to zero, that is
H 1:α i ≠ 0 f o ra tle a sot n ev a luoef i
To test the null hypothesis that the k population means are all equal, we shall
compare two estimates of σ2 - one based on the variation among (between) the
sample means, and one based on the variation within the samples.

The Variation Within The Samples

Since by the assumption each sample comes from a population having the
variance σ2 , this variance can be estimated by any one of the sample
variances

3
=
n (y −y )
ij i
2

s i2 ∑ and, hence, also


j =1 n −1
k s i2 k n (y −y )
ij i
2

σˆ w = ∑
2
= ∑ ∑
i =1 k i =1 j =1 k ( n −1)
Note that each of the sample variance s i2 is based on (n-1) degrees of freedom.
Hence, σ̂ 2w is based on k(n-1) degrees of freedom.
The Variance Among (between) The Sample Means

The variance of the k sample mean is given by


2 k ( yi − y. ) 2
sy = ∑
i=1 k −1
σ
As we know that standard error of the sample meanσ y =
n
Then population variance σ = σ y × n , where σ y is the variance among the
2 2 2

sample mean. But we do not know σ y , but we could however calculate the
2

2 2
variance among the three sample means s y . So, substitute σ y by s y .
2

Then we have the estimated population variance; that is variance between the

columnσˆ 2 n *σ 2 y = n * s 2 y = n * ∑
k ( y i − y.) 2 , and it is based on (k-1) degrees
B=
i=1 k−1
of freedom.

If the null hypothesis is true, it can be shown that σ̂ 2w and σ̂ 2B are


independent estimates of σ2 , and it follows that

4
2
F = σ B2
σW
is a value of a random variable having F- distribution with (k-1)and k(n-1)
degrees of freedom.

The null hypothesis will be rejected if F-exceeds Fα with (k-1) and k(n-1)
degrees of freedom.

Furthermore, The sample variance of all nk observations if given by


1 k n
∑ ∑ y ij − y
nk−1 i =1 j =1
( .) 2

Identity for one-way analysis of variance


Theorem
k n
(
∑ ∑ y ij − y
i =1 j =1
2
.) =
k n
(
∑ ∑ y ij − y i
i =1 j =1
2
) + n ∑ ( yi − y ) 2
k

i =1
.
it is customary to refer to the expression on the left-hand side of the identity of
theorem as the Total Sum of Squares(SST) to the first term of the
expression as the Error Sum of Squares (SSE) and to the second term of the
expression of the right-hand side as the Treatment Sum of Square SS(Tr)

SS(Tr)/ (k- 1 )
F=
SSE / k ( n −1)
To simplify the calculation of the various sums of squares, we usually use the
following computing formulas:

5
k n
2
SST = ∑ ∑ y ij − C
i =1 j =1
k
∑ T i2
SS(Tr) = i =1 .
− C where,C,calledthe correctionterm ,is givenby C = T
2

n kn

.
In these formula Ti is the total of the n observations in the ith sample, where

as
T is the grand total of all kn observations. The Error Sum of Square,

SSE, is then obtained by:


SSE = SST- SS(Tr)
The result obtained in analyzing the total sum of squares into its components
are summarized by means of the following kind of analysis of variance
table:

Source of Degrees of Sum of Mean Square F


Variation Freedom Square
Treatments (k-1) SS(Tr) MS(Tr) =
Error k(n-1) SSE MSE =

Total (nk-1) SST

6
Example: A company wants to compare the cleansing action of three
detergents on the basis of the following whiteness readings made on 15
swatches of white cloth, which were first soiled with ink and then washed in
an agitator-type machine with the respective detergents:

Detergent 1 77 81 71 76 80
Detergent 2 72 58 74 66 70
Detergent 3 76 85 82 80 77
Test at the 0.01 level of significance whether the differences among the means
of whiteness reading are significant.

Detergent Detergent Detergent


1 2 3
77 5929 72 5184 76 5776
81 6561 58 3364 85 7225
71 5041 74 5476 82 6724
76 5776 66 4356 80 6400
80 6400 70 4900 77 5929
385 29707 340 23280 400 32054 85041 1125

Source of Degrees of Sum of Mean Square F


Variation Freedom Square
Treatments (3-1) = 2 390 MS(Tr) =
= 390/2 = 195/23
= 195 = 8.48
Error k(n-1) = 276 MSE =
3(5-1)= 12 = 276/12
= 23
Total (nk-1) 666
= (5*3-1)
= 15-1=14

Given F value at 0.01 for 2, 12 df is 6.93. As Calculated F value is greater


than table value; i.e 8.48 > 6.93. Reject null hypothesis at 0.01 level of
significance.
7
Hence, the three detergents are not equally effective.

The F-Distribution

The F is skewed distribution. Generally it is skewed to the right and tends to


become more symmetrical as the number of degrees of freedom in the
numerator and denominator increase. The F- distribution has single mode. The
shape of the distribution depends on the number of degrees of freedom in both
numerator and denominator of the F-ratio. The first number is the number of
degrees of freedom in the numerator of the F-ratio; the second is the degrees
of freedoms in the denominator.
Fig 11.8 Pg 597 Rubin

Degrees of Freedom
• As we have mentioned each F-distribution has a pair of degrees of
freedom, one for the numerator of the F-ratio and the other for the
denominator.
• While calculating variance between the sample mean we have used
different values of - , one for each sample to calculate . In above
example once we knew two of these - values, the third was
automatically determined and could not be freely specified. Thus, one df
is lost when we calculate the variance between samples. Hence, the
number of degrees of freedom for the numerator of the F-ratio is always
one fewer than the number of samples.

8
Number of degrees of freedom in the numerator of the F-ratio = (n-1)

• For the denominator we have calculated the variance within the


samples and we used all three samples. For the jth sample, we used
values of to calculate for that sample. Once we knew all but one of
these values, the last was automatically determined and could not be
freely specified. Thus, we lost 1 df in the calculations for each sample.
In above example we lost 1 df in the calculations for each sample,
leaving us with 4, 4 and 5 df in the sample. Because we had three
samples, we were left with 4+4+5 = 13 df. Which could also be
calculated as 5+5+6 –3 = 13. Thus,

Number of degrees of freedom in the denominator of the F-ratio = ( -k)

The F-Table
For analysis of variance, we shall use an F-table in which the columns
represent the number of degrees of freedom for the numerator and the rows
represents the degrees of freedom for the denominator. Suppose we are testing
a hypothesis at the level of significance 0.05, using F-distribution and our
degrees of freedom for numerator is 2 and 13 for the denominator. The value
we find in the F-Table is 3.81 (First look in column and then in row)

Critical Value of F- distribution


Usually F-tables give the critical value of F for the right tailed test, the right-
tail area determines i.e. the critical region. Thus, the significant value at the
level of significance “ ” and (n1, n2) where n1 is the number of degrees of
freedom in the numerator and n2 the number of degrees of freedom in the
denominator. . As shown in figure
Pg. 877 Gupta and Kapoor

9
If calculated F-ratio value is greater than table we reject null hypothesis,
otherwise accept it.

Statement of Hypotheses
Null Hypothesis: There is no significance difference between population
means
In our above example suppose the director of training wants to test at the 0.05
levels the hypothesis that there are no differences among the three training
methods.
We set the null hypothesis as

Analysis of variance Table


Source of Sum of square df Mean Test
variation square Statistics
Between SSB = (k-1) MSB= F=
j= 1,2,…k SSB/(k-1) MSB/MSW
Within SSW = (N-k) MSW=
SSW/(N-k)
Total SST= (N-1) - -

10
11

You might also like