You are on page 1of 11

One way Analysis of Variance Example A manager wishes to determine whether the mean times required to complete a certain

task differ for the three levels of employee training. He randomly selected 10 employees with each of the three levels of training (Beginner, Intermediate and Advanced). Do the data provide sufficient evidence to indicate that the mean times required to complete a certain task differ for at least two of the three levels of training? The data is summarized in the table.

Level of Training Advanced Intermediate Beginner

n 10 10 10 24.2 27.1 30.2

s2 21.54 18.64 17.76

Ha: The mean times required to complete a certain task differ for at least two of the three levels of training. Ho: The mean times required to complete a certain task do not differ the three levels of training. ( B = I = A) Assumptions: The samples were drawn independently and randomly from the three populations. The time required to complete the task is normally distributed for each of the three levels of training. The populations have equal variances.

Test Statistic:

RR:

or

Calculations: = 10(24.2 - 27.16...)2 + 10(27.1 27.16...)2 + 10(30.2 - 27.16...)2 = 180.066....

= 9(21.54) + 9(18.64) + 9(17.76) = 521.46

Source Treatments Error Total

df 2 27 29

SS 180.067 521.46 702.527

MS 90.033 19.313

F 4.662

Decision: Reject Ho. Conclusion: There is sufficient evidence to indicate that the mean times required to complete a certain task differ for at least two of the three levels of training.

Which pairs of means differ? The Bonferroni Test is done for all possible pairs of means.

Decision rule: Reject Ho, if the interval does not contain 0. c = # of pairs c = p(p-1)/2 = 3(2)/2 = 3

t.0083 = 2.554 (This value is not in the t table; it was obtained from program.)

a computer

Since t.010 < t.0083 < t.0050 (2.473 < t.0083 < 2.771), use t.005 when using a table. If you reject the null hypothesis when t = 2.771; you will also reject it for t.0083.

There is sufficient evidence to indicate that the mean response time for the advanced level of training is less than the mean response time for the beginning level. There is not sufficient evidence to indicate that the mean response time for the intermediate level differs from the mean response time of either of the other two levels.

Simple Linear Regression Example In this lesson, we apply regression analysis to some fictitious data, and we show how to interpret the results of our analysis. Note: Regression computations are usually handled by a software package or a graphing calculator. For this example, however, we will do the computations "manually", since the gory details have educational value. Problem Statement Last year, five randomly selected students took a math aptitude test before they began their statistics course. The Statistics Department has three questions. What linear regression equation best predicts statistics performance, based on math aptitude scores? If a student made an 80 on the aptitude test, what grade would we expect her to make in statistics? How well does the regression equation fit the data?

How to Find the Regression Equation In the table below, the xi column shows scores on the aptitude test. Similarly, the yi column shows statistics grades. The last two rows show sums and mean scores that we will use to conduct the regression analysis. Student 1 2 3 4 5 Sum Mean xi yi (xi - x) 95 85 17 85 95 7 80 70 2 70 65 -8 60 70 -18 390 385 78 77 (yi - y) 8 18 -7 -12 -7 (xi - x)2 289 49 4 64 324 730 (yi - y)2 64 324 49 144 49 630 (xi - x)(yi - y) 136 126 -14 96 126 470

The regression equation is a linear equation of the form: = b0 + b1x . To conduct a regression analysis, we need to solve for b0 and b1. Computations are shown below.

b1 = [ (xi - x)(yi - y) ] / [ (xi - x)2] b1 = 470/730 = 0.644

b0 = y - b1 * x b0 = 77 - (0.644)(78) = 26.768

Therefore, the regression equation is: = 26.768 + 0.644x .

How to Use the Regression Equation Once you have the regression equation, using it is a snap. Choose a value for the independent variable (x), perform the computation, and you have an estimated value () for the dependent variable. In our example, the independent variable is the student's score on the aptitude test. The dependent variable is the student's statistics grade. If a student made an 80 on the aptitude test, the estimated statistics grade would be: = 26.768 + 0.644x = 26.768 + 0.644 * 80 = 26.768 + 51.52 = 78.288 Warning: When you use a regression equation, do not use values for the independent variable that are outside the range of values used to create the equation. That is called extrapolation, and it can produce unreasonable estimates. In this example, the aptitude test scores used to create the regression equation ranged from 60 to 95. Therefore, only use values inside that range to estimate statistics grades. Using values outside that range (less than 60 or greater than 95) is problematic. How to Find the Coefficient of Determination Whenever you use a regression equation, you should ask how well the equation fits the data. One way to assess fit is to check the coefficient of determination, which can be computed from the following formula. R2 = { ( 1 / N ) * [ (xi - x) * (yi - y) ] / (x * y ) }2 where N is the number of observations used to fit the model, is the summation symbol, xi is the x value for observation i, x is the mean x value, yi is the y value for observation i, y is the mean y value, x is the standard deviation of x, and y is the standard deviation of y. Computations for the sample problem of this lesson are shown below.

x = sqrt [ ( xi - x )2 / N ] x = sqrt( 730/5 ) = sqrt(146) = 12.083

y = sqrt [ ( yi - y )2 / N ] y = sqrt( 630/5 ) = sqrt(126) = 11.225

R2 = { ( 1 / N ) * [ (xi - x) * (yi - y) ] / (x * y ) }2 R = [ ( 1/5 ) * 470 / ( 12.083 * 11.225 ) ] 2 = ( 94 / 135.632 )2 = ( 0.693 )2 = 0.48


2

A coefficient of determination equal to 0.48 indicates that about 48% of the variation in statistics grades (the dependent variable) can be explained by the relationship to math aptitude scores (the independent variable). This would be considered a good fit to the data, in the sense that it would substantially improve an educator's ability to predict student performance in statistics class.

Two-Way ANOVA Two-Way Analysis of Variance We have examined the one way ANOVA but we have only considered one factor at a time. Remember, a factor is an Independent Variable (IV), so we have only been considering experiments in which one Independent Variable was being manipulated. We are now going to move up a level in complexity and consider two factors (i.e., two Independent Variables) simultaneously. Now these two IVs can either be both between groups designs, both repeated measures designs, or a mixed design. The mixed design obviously has one between groups IV and one repeated measures IV. Each IV can also be a true experimental manipulation or a quasi experimental grouping (i.e., one in which there was no random assignment and only pre-existing groups are compared). If a significant F-value is found for one IV, then this is referred to as a significant main effect. However, when two or more IVs are considered simultaneously, there is also always an interaction between the IVs - which may or may not be significant. An interaction may be defined as: There is an interaction between two factors if the effect of one factor depends on the levels of the second factor. When the two factors are identified as A and B, the interaction is identified as the A X B interaction. Often the best way of interpreting and understanding an interaction is by a graph. A two factor ANOVA with a nonsignificant interaction can be represented by two approximately parallel lines, whereas a significant interaction results in a graph with non parallel lines. Because two lines will rarely be exactly parallel, the significance test on the interaction is also a test of whether the two lines diverge significantly from being parallel. If only two IVs (A and B, say) are being tested in a Factorial ANOVA, then there is only one interaction (A X B). If there are three IVs being tested (A, B and C, say), then this would be a three-way ANOVA, and there would be three two-way interactions (A X B, A X C, and B X C), and one three-way interaction (A X B X C). The complexity of the analysis increases markedly as the number of IVs increases beyond three. Only rarely will you come across Factorial ANOVAs with more than 4 IVs. A word on interpreting interactions and main effects in ANOVA. Many texts including Ray (p. 198) stipulate that you should interpret the interaction first. If the interaction is not significant, you can then examine the main effects without needing to qualify the main effects because of the interaction. If the interaction is significant, you cannot examine the main effects because the main effects do not tell the complete story. Most statistics texts follow this line. But I will explain my pet grievance against this! It seems to me that it makes more sense to tell the simple story first and then the more complex story. The explanation of the results ends at the level of complexity which you wish to convey to the reader. In the two-way case, I prefer to examine each of the main effects first and then the interaction. If the interaction is not significant, the most complete story is told by the main effects. If the interaction is significant, then the most complete story is told by the interaction. In a two-way ANOVA this is the story you would most use to describe the results (because a two-way interaction is not too difficult to understand). One consequence of the difference in the two approaches is if, for example, you did run a four-way ANOVA and the four-way interaction (i.e., A X B X C D) was significant, you would not be able to examine any of the lower order interactions even if you wanted to! The most complex significant interaction would tell the most complete story and so this is the one you have to describe. Describing a four-way interaction is exceedingly difficult and would most likely not represent the relationships you were intending to examine and would not hold the reader's attention for very long. With the other approach, you would describe the main effects first, then the first order interactions (i.e., A X B, A X C, A X D, B X C, B X D, C X D) and then the higher order interactions only if you were interested in them! You can stop at the level of complexity you wish to convey to the reader. Another exception to the rule of always describing the most complex relationship first is if you have a specific research question about the main effects. In your analysis and discussion you need to address the particular

hypotheses you made about the research scenario, and if these are main effects, then so be it! However, not all texts appear to agree with this approach either! For the sake of your peace of mind, and assessment, in this unit, there will be no examination questions or assignment marks riding on whether you interpret the interaction first or the main effects first. You do need to realise that in a two-way ANOVA, if there is a significant interaction, then this is the story most representative of the research results (i.e., tells the most complete story and is not too complex to understand).

One-way analysis of variance From Wikipedia, the free encyclopedia Jump to: navigation, search In statistics, one-way analysis of variance (abbreviated one-way ANOVA) is a technique used to compare means of two or more samples (using the F distribution). This technique can be used only for numerical data[1]. The ANOVA tests the null hypothesis that samples in two or more groups are drawn from the same population. To do this, two estimates are made of the population variance. These estimates rely on various assumptions (see below). The ANOVA produces an F-statistic, the ratio of the variance calculated among the means to the variance within the samples. If the group means are drawn from the same population, the variance between the group means should be lower than the variance of the samples, following the central limit theorem. A higher ratio therefore implies that the samples were drawn from different populations.[2] The degrees of freedom for the numerator is I-1, where I is the number of groups (means),e.g. I levels of urea fertiliser application in a crop. The degrees of freedom for the denominator is N - I, where N is the total of all the sample sizes. Typically, however, the one-way ANOVA is used to test for differences among at least three groups, since the twogroup case can be covered by a t-test (Gosset, 1908). When there are only two means to compare, the t-test and the F-test are equivalent; the relation between ANOVA and t is given by F = t2. Assumptions The results of a one-way ANOVA can be considered reliable as long as the following assumptions are met: Response variable are normally distributed (or approximately normally distributed). Samples are independent. Variances of populations are equal. Responses for a given group are independent and identically distributed normal random variables (not a simple random sample (SRS)).

ANOVA is a relatively robust procedure with respect to violations of the normality assumption [3]. If data are ordinal, a non-parametric alternative to this test should be used such as KruskalWallis one-way analysis of variance.

Analysis of variance (ANOVA) for factorial combinations of treatments Elsewhere on this site we have dealt with ANOVA for simple comparisons of treatments. We can also use ANOVA for combinations of treatments, where two factors (e.g. pH and temperature) are applied in every possible combination. These are called factorial designs, and we can analyse them even if we do not have replicates. This type of analysis is called TWO-WAY ANOVA. Suppose that we have grown one bacterium in broth culture at 3 different pH levels at 4 different temperatures. We have 12 flasks in all, but no replicates. Growth was measured by optical density (O.D.). Construct a table as follows (O.D. is given in fictitious whole numbers here for convenience). Temp oC 25 30 35 40 pH 5.5 10 15 20 15 pH 6.5 19 25 30 22 pH 7.5 40 45 55 40

Then calculate the following (see the worked example and the output from Microsoft "Excel"). (a) x, x2, ( x)2 / n, and (b) x, x2, ( x)2 / n, and for each column in the table. for each row.

(c) Find the grand total by adding all x for columns (it should be the same for rows). Square this grand total and then divide by uv, where u is the number of data entries in each row, and v is number of data entries in each column. Call this value D; in our example it is (336)2 12 = 9408. (d) Find the sum of x2 values for columns; call this A. It will be the same for x2 of rows. In our example it is 11570. (e) Find the sum of x2/n values for columns; call this B. In our example it is 11304. (f) Find the sum of x2/n values for rows; call this C. In our example it is 9646. (g) Set out a table of analysis of variance as follows: Mean square (= S of S df) 948 79.3 4.67

Source of variance Between columns Between rows Residual

Sum of squares B - D (1896) C - D (238) *** (28)

Degrees of freedom* u - 1 (=2) v - 1 (= 3) (u-1)(v-1) (=6)

Total

A - D (2162)

(uv)-1 (=11)

196.5

[* Where u is the number of data entries in each row, and v is the number of data entries in each column); note that the total df is always one fewer than the total number of entries in the table of data. *** Obtained by subtracting the between-columns and between-rows sums of squares from total sum of squares. Now do a variance ratio test to obtain F values: (1) For between columns (pH): F = Between columns mean square / Residual mean square = 948 / 4.67 = 203 (2) For between rows (temperature) F = Between rows mean square / Residual mean square = 79.3 / 4.67 = 17.0 In each case, consult a table of F (p = 0.05 or p = 0.01 or p = 0.001) where u is the between-treatments df (columns or rows, as appropriate) and v is residual df. If the calculated F value exceeds the tabulated value then the treatment effect (temperature or pH) is significant. In our example, for the effect of pH (u is 2 degrees of freedom, v is 6 df) the critical F value at p = 0.05 is 5.14. In fact, we have a significant effect of pH at p = 0.001. For the effect of temperature (u is 3 degrees of freedom, v is 6 df) the critical F value at p = 0.05 is 4.76. We find that the effect of temperature is significant at p = 0.01. Worked example: pH 5.5 25oC 30oC 35oC 40 C x, Columns n (= v)
o

pH 6.5 19 25 30 22 96 4 24.67 2370 2304

pH 7.5 40 45 55 40 180 4 46.67 8250 8100

x, Rows 69 85 105 77 336 grand total

n (=u) 3 3 3 23 28.33 35

x2 2061 2875 4325 2309

( x)2 / n 1587 2408 3675 1976 Total C 9646

10 15 20 15 60 4 15

x2 ( x)2 / n

950 900

Total A 11570 Total B 11304

Below, we see a print-out of this analysis from "Excel". We select Anova: Two-Factor Without Replication from the analysis tools package. Note that the Anova table gives Source of Variation separately for Rows, Columns and Error (= Residual).

pH 5.5 25oC 30oC 35oC 40oC 10 15 20 15

pH 6.5 19 25 30 22

pH 7.5 40 45 55 40

Anova: Two-Factor Without Replication SUMMARY Row 1 Row 2 Row 3 Row 4 Count 3 3 3 3 Sum 69 85 105 77 Average 23 28.33333 35 25.66667 Variance 237 233.3333 325 166.3333

Column 1 Column 2 Column 3 ANOVA Source of Variation Rows Columns Error Total SS

4 4 4

60 96 180

15 24 45

16.66667 22 50

df 3 2 6 11

MS 79.55556 948 4.555556

F 17.46341 208.0976

P-value 0.00228 2.87E-06

F crit 4.757055 5.143249

238.6667 1896 27.33333 2162

Of interest, another piece of information is revealed by this analysis - the effects of temperature do not interact with effects of pH. In other words, a change of temperature does not change the response to pH, and vice-versa.We can deduce this because the residual (error) mean square (MS) is small compared with the mean squares for temperature (columns) or pH (rows). [A low residual mean square tells us that most variation in the data is accounted for by the separate effects of temperature and pH].

But suppose that our data were as follows: Temp oC 25 30 35 40 pH 5.5 10 15 20 25 pH 6.5 19 25 30 22 pH 7.5 40 30 25 10

Here an increase of temperature increases growth at low pH but decreases growth at high pH. If we analysed these data we would probably find no significant effect of temperature or pH, because these factors interact to influence growth. The residual mean square would be very large. This type of result is not uncommon - for example, patients' age might affect their susceptibility to levels of stress. Inspection of our data strongly suggests that there is interaction. To analyse it, we would need to repeat the experiment with two replicates, then use a slightly more complex analysis of variance to test for (1) separate temperature effects, (2) separate pH effects, and (3) significant effects of interaction. As an example, below is shown a print-out from "Excel" of the following table, where I have assumed that we did the experiment above with replication. Temp oC 25 rep 1 rep 2 30 rep 1 rep 2 35 rep 1 rep 2 40 rep 1 rep 2 pH 5.5 9 11 13 17 18 22 22 28 pH 6.5 18 20 23 27 27 33 20 24 pH 7.5 36 44 27 33 23 27 7 13

You might also like