Professional Documents
Culture Documents
Statistics Guide
This Statistics Guide is a companion to GraphPad Prism 5. Available for both Mac and Windows, Prism makes it very easy to graph and analyze scientific data. Download a free demo from www.graphpad.com The focus of this Guide is on helping you understand the big ideas behind statistical tests, so you can choose a test, and interpret the results. Only about 10% of this Guide is specific to Prism, so you may find it very useful even if you use another statistics program. The companion Regression Guide explains how to fit curves with Prism, Both of these Guides contain exactly the same information as the Help system that comes with Prism 5, including the free demo versrion. You may also view the Prism Help on the web at: http://graphpad.com/help/prism5/prism5help.html
Copyright 2007, GraphPad Software, Inc. All rights reserved. GraphPad Prism and Prism are registered trademarks of GraphPad Software, Inc. GraphPad is a trademark of GraphPad Software, Inc. Citation: H.J. Motulsky, Prism 5 Statistics Guide, 2007, GraphPad Software Inc., San Diego CA, www.graphpad.com. To contact GraphPad Software, email support@graphpad.com or sales@graphpad.com.
Contents
I. Statistical principles
The big picture ............................................................................................................................................9
When do you need statistical calculations? Extrapolating from 'sample' to 'population' Why statistics can be hard to learn The need for independent samples Ordinal, interval and ratio variables
P Values
............................................................................................................................................33
What is a P value? Common misinterpretation of a P value One-tail vs. two-tail P values Advice: Use two-tailed P values Advice: How to interpret a small P value Advice: How to interpret a large P value
Outliers
............................................................................................................................................59
What is an outlier? Advice: Beware of identifying outliers manually Detecting outliers with Grubbs test Statistical tests that are robust to the presence of outliers Dealing with outliers in Prism
How to: Repeated measures one-way ANOVA Interpreting results: Repeated measures one-way ANOVA Analysis checklist: Repeated-measures one way ANOVA
V. Two-way ANOVA
............................................................................................................................................169 Key concepts: Two-way ANOVA
Entering two-way ANOVA data Choosing two-way ANOVA Point of confusion: ANOVA with a quantitative factor Q&A: Two-way ANOVA
www.graphpad.com
I. Statistical principles
Before you can choose statistical tests or make sense of the results, you need to know some of the basic principles of statistics. We try to make this as painless as possible, by avoiding equations and focussing on the practical issues that often impede effective data analysis.
www.graphpad.com
www.graphpad.com
10
www.graphpad.com
Statistics is on the interface of math and science Statistics is a branch of math, so to truly understand the basis of statistics you need to delve into the mathematical details. However, you don't need to know much math to use statistics effectively and to correctly interpret the results. Many statistics books tell you more about the mathematical basis of statistics than you need to know to use statistical methods effectively. The focus here is on selecting statistical methods and making sense of the results, so this presentation uses very little math. If you are a math whiz who thinks in terms of equations, you'll want to learn statistics from a mathematical book.
11
www.graphpad.com
Definitions
A categorical variable, also called a nominal variable, is for mutually exclusive, but not ordered, categories. For example, your study might compare five different genotypes. You can code the five genotypes with numbers if you want, but the order is arbitrary and any calculations (for example, computing an average) would be meaningless. An ordinal variable, is one where the order matters but not the difference between values. For example, you might ask patients to express the amount of pain they are feeling on a scale of 1 to 10. A score of 7 means more pain than a score of 5, and that is more than a score of 3. But the difference between the 7 and the 5 may not be the same as that between 5 and 3. The values simply express an order. Another example would be movie ratings, from * to *****. An interval variable is a one where the difference between two values is meaningful. The difference between a temperature of 100 degrees and 90 degrees is the same difference as between 90 degrees and 80 degrees. A ratio variable, has all the properties of an interval variable, but also has a clear definition of 0.0. When the variable equals 0.0, there is none of that variable. Variables like height, weight, enzyme activity are ratio variables. Temperature, expressed in F or C, is not a ratio variable. A temperature of 0.0 on either of those scales does not mean 'no temperature'. However, temperature in degrees Kelvin in a ratio variable, as 0.0 degrees Kelvin really does mean 'no temperature'. Another counter example is pH. It is not a ratio variable, as pH=0 just means 1 molar of H+. and the definition of
12
www.graphpad.com
molar is fairly arbitrary. A pH of 0.0 does not mean 'no acidity' (quite the opposite!). When working with ratio variables, but not interval variables, you can look at the ratio of two measurements. A weight of 4 grams is twice a weight of 2 grams, because weight is a ratio variable. A temperature of 100 degrees C is not twice as hot as 50 degrees C, because temperature C is not a ratio variable. A pH of 3 is not twice as acidic as a pH of 6, because pH is not a ratio variable. The categories are not as clear cut as they sound. What kind of variable is color? In some experiments, different colors would be regarded as nominal. But if color is quantified by wavelength, then color would be considered a ratio variable. The classification scheme really is somewhat fuzzy.
What is OK to compute
OK to compute.... frequency distribution median and percentiles sum or difference mean, standard deviation, standard error of the mean ratio, or coefficient of variation Nominal Yes No No No Ordinal Yes Yes No No Interval Yes Yes Yes Yes Ratio Yes Yes Yes Yes
No
No
No
Yes
Does it matter?
It matters if you are taking an exam in statistics, because this is the kind of concept that is easy to test for. Does it matter for data analysis? The concepts are mostly pretty obvious, but putting names on different kinds of variables can help prevent mistakes like taking the average of a group of postal (zip) codes, or taking the ratio of two pH values. Beyond that, putting labels on the different kinds of variables really doesn't really help you plan your analyses or interpret the results.
13
www.graphpad.com
14
www.graphpad.com
The average weight is 10 milligrams, the weight of 10 mL of water (at least on earth). The distribution is flat, with no hint of a Gaussian distribution. Now let's make the experiment more complicated. We pipette twice and weigh the result. On average, the weight will now be 20 milligrams. But you expect the errors to cancel out some of the time. The figure below is what you get.
Each pipetting step has a flat random error. Add them up, and the distribution is not flat. For example, you'll get weights near 21 mg only if both pipetting steps err substantially in the same direction, and that is rare. Now let's extend this to ten pipetting steps, and look at the distribution of the sums.
15
www.graphpad.com
The distribution looks a lot like an ideal Gaussian distribution. Repeat the experiment 15,000 times rather than 1,000 and you get even closer to a Gaussian distribution.
This simulation demonstrates a principle that can also be mathematically proven. Scatter will approximate a Gaussian distribution if your experimental scatter has numerous sources that are additive and of nearly equal weight, and the sample size is large. The Gaussian distribution is a mathematical ideal. Few biological distributions, if any, really follow the Gaussian distribution. The Gaussian distribution extends from negative infinity to positive infinity. If the weights in the example above really were to follow a Gaussian distribution, there would be some chance (albeit very small) that the weight is negative. Since weights can't be negative, the distribution cannot be exactly Gaussian. But it is close enough to Gaussian to make it OK to use statistical methods (like t tests and regression) that assume a Gaussian distribution.
16
www.graphpad.com
17
www.graphpad.com
The graph that follows shows the relationship between the standard deviation and a Gaussian distribution. The area under a probability distribution represents the entire population, so the area under a portion of a probability distribution represents a fraction of the population. In the graph on the left, the green (shaded) portion extends from one SD below the mean to one SD above the mean. The green area is about 68% of the total area, so a bit more than two thirds of the values are in the interval mean pus or minus one SD. The graph on the right shows that about 95% of values lie within two standard deviations of the mean.
18
www.graphpad.com
This graph points out that interpreting the mean and SD can be misleading if you assume the data are Gaussian, but that assumption isn't true.
19
www.graphpad.com
Computing the SD
How is the SD calculated?
1. Compute the square of the difference between each value and the sample mean. 2. Add those values up. 3. Divide the sum by N-1. This is called the variance. 4. Take the square root to obtain the Standard Deviation.
Why N-1?
Why divide by N-1 rather than N in the third step above? In step 1, you compute the difference between each value and the mean of those values. You don't know the true mean of the population; all you know is the mean of your sample. Except for the rare cases where the sample mean happens to equal the population mean, the data will be closer to the sample mean than it will be to the true population mean. So the value you compute in step 2 will probably be a bit smaller (and can't be larger) than what it would be if you used the true population mean in step 1. To make up for this, we divide by N-1 rather than N. But why N-1? If you knew the sample mean, and all but one of the values, you could calculate what that last value must be. Statisticians say there are N-1 degrees of freedom.
20
www.graphpad.com
want to know the standard deviation of the values in cells B1 through B10, use this formula in Excel: =STDEV(B1:B10) That function computes the SD using N-1 in the denominator. If you want to compute the SD using N in the denominator (see above) use Excel's STDEVP() function.
21
www.graphpad.com
The standard deviation computed from the five values shown in the graph above is 18.0. But the true standard deviation of the population from which the values were sampled might be quite different. Since N=5, the 95% confidence interval extends from 10.8 (0.60*18.0) to 51.7 (2.87*18.0). When you compute a SD from only five values, the upper 95% confidence limit for the SD is almost five times the lower limit. Most people are surprised that small samples define the SD so poorly. Random sampling can have a huge impact with small data sets, resulting in a calculated standard deviation quite far from the true population standard deviation. Note that the confidence intervals are not symmetrical. Why? Since the SD is always a positive number, the lower confidence limit can't be less than zero. This means that the upper confidence interval usually extends further above the sample SD than the lower limit extends below the sample SD. With small samples, this asymmetry is quite noticeable. If you want to compute these confidence intervals yourself, use these Excel equations (N is sample size; alpha is 0.05 for 95% confidence, 0.01 for 99% confidence, etc.): Lower limit: =SD*SQRT((N-1)/CHIINV((alpha/2), N-1)) Upper limit: =SD*SQRT((N-1)/CHIINV(1-(alpha/2), N-1))
22
www.graphpad.com
The 67% confidence interval extends approximately one SEM in each direction from the mean. The 95% confidence interval extends approximately two SEMs from the mean in each direction. The multipliers are not actually 1.0 and 2.0, but rather are values that come from the t distribution and depend on sample size. With small samples, and certainly when N is less than ten, those rules of thumb are not very accurate.
23
www.graphpad.com
24
www.graphpad.com
When you are trying to emphasize small and unimportant differences in your data, show your error bars as standard errors and hope that your readers think they are standard deviations. When you are trying to cover-up large differences, show the error bars as standard deviations and hope that your readers think they are standard errors. Steve Simon (in jest) http://www.childrens-mercy.org/stats/weblog2005/standar derror.asp
25
www.graphpad.com
Confidence intervals
How is it possible that the CI of a mean does not include the true mean
The upper panel below shows ten sets of data (N=5), randomly drawn from a Gaussian distribution with a mean of 100 and a standard deviation of 35. The lower panel shows the 95% CI of the mean for each sample.
Confidence intervals
26
www.graphpad.com
Because these are simulated data, we know the exact value of the true population mean (100), so can ask whether or not each confidence interval includes that true population mean. In the data set second from the right in the graphs above, the 95% confidence interval does not include the true mean of 100 (dotted line). When analyzing data, you don't know the population mean, so can't know whether a particular confidence interval contains the true population mean or not. All you know is that there is a 95% chance that the confidence interval includes the population mean, and a 5% chance that it does not.
Confidence intervals
27
www.graphpad.com
N Multiplier 2 12.706 3 5 10 25 50 100 500 4.303 2.776 2.262 2.064 2.010 1.984 1.965
N =TINV(0.05,N-1) The samples shown in the graph above had five values. So the lower confidence limit from one of those samples is computed as the mean minus 2.776 times the SEM, and the upper confidence limit is computed as the mean plus 2.776 times the SEM. The last line in the table above shows you the equation to use to compute the multiplier in Excel. A common rule-of-thumb is that the 95% confidence interval is computed from the mean plus or minus two SEMs. With large samples, that rule is very accurate. With small samples, the CI of a mean is much wider than suggested by that rule-of-thumb.
Confidence intervals
28
www.graphpad.com
The graph shows three samples (of different size) all sampled from the same population. With the small sample on the left, the 95% confidence interval is similar to the range of the data. But only a tiny fraction of the values in the large sample on the right lie within the confidence interval. This makes sense. The 95% confidence interval defines a range of values that you can be 95% certain contains the population mean. With large samples, you know that mean with much more precision than you do with a small sample, so the confidence interval is quite narrow when computed from a large sample. Don't view a confidence interval and misinterpret it as the range that contains 95% of the values.
Confidence intervals
29
www.graphpad.com
you compute depends on the data you happened to collect. If you repeated the experiment, your confidence interval would almost certainly be different. So it is OK to ask about the probability that the interval contains the population mean. Does it matter? It seems to me that it makes little difference, but some statisticians think this is a critical distinction.
Confidence intervals
30
www.graphpad.com
Confidence intervals
31
www.graphpad.com
confidence interval instead of 95%. The 90% CI for difference in eradication rate extends from -5.9% to 8.4%. Since we are less confident that it includes the true value, it doesn't extend as far as 95% interval. We can restate this to say that the 95% confidence interval is greater than -5.9%. Thus, we are 95% sure that the new drug has an eradication rate not more than 5.9% worse than that of the standard drug. In this example of testing noninferiority, it makes sense to express a one-sided confidence interval as the lower limit only. In other situations, it can make sense to express a one-sided confidence limit as an upper limit only. For example, in toxicology you may care only about the upper confidence limit. GraphPad Prism does not compute one-sided confidence intervals directly. But, as the example shows, it is easy to create the one-sided intervals yourself. Simply ask Prism to create a 90% confidence interval for the value you care about. If you only care about the lower limit, say that you are 95% sure the true value is higher than that (90%) lower limit. If you only care about the upper limit, say that you are 95% sure the true value is lower than the (90%) upper limit.
Reference
1. S. J. Pocock, The pros and cons of noninferiority trials, Fundamental & Clinical Pharmacology, 17: 483-490 (2003).
Confidence intervals
32
www.graphpad.com
P Values
What is a P value?
Suppose that you've collected data from two samples of animals treated with different drugs. You've measured an enzyme in each animal's plasma, and the means are different. You want to know whether that difference is due to an effect of the drug whether the two populations have different means. Observing different sample means is not enough to persuade you to conclude that the populations have different means. It is possible that the populations have the same mean (i.e., that the drugs have no effect on the enzyme you are measuring) and that the difference you observed between sample means occurred only by chance. There is no way you can ever be sure if the difference you observed reflects a true difference or if it simply occurred in the course of random sampling. All you can do is calculate probabilities. The first step is to state the null hypothesis, that really the disease does not affect the outcome you are measuring (so all differences are due to random sampling). The P value is a probability, with a value ranging from zero to one, that answers this question (which you probably never thought to ask): In an experiment of this size, if the populations really have the same mean, what is the probability of observing at least as large a difference between sample means as was, in fact, observed?
There is a 3% chance of observing a difference as large as you observed even if the two population means are identical (the null hypothesis is true). or Random sampling from identical populations would lead to a difference smaller than you observed in 97% of experiments, and larger than you observed in 3% of experiments.
Wrong:
P Values
33
www.graphpad.com
There is a 97% chance that the difference you observed reflects a real difference between populations, and a 3% chance that the difference is due to chance.
This latter statement is a common mistake. If you have a hard time understanding the difference between the correct and incorrect definitions, read this Bayesian perspective 40 .
Two-tail P value
The two-tail P value answers this question: Assuming the null hypothesis is true, what is the chance that randomly selected samples would have means as far apart as (or further than) you observed in this experiment with either group having the larger mean?
One-tail P value
To interpret a one-tail P value, you must predict which group will have the larger mean before collecting any data. The one-tail P value answers this question: Assuming the null hypothesis is true, what is the chance that randomly selected samples would have means as far apart as (or further than) observed in this experiment with the specified group having the larger mean? A one-tail P value is appropriate only when previous data, physical limitations or common sense tell you that a difference, if any, can only go in one direction. The issue is not whether you expect a difference to exist that is what you are trying to find out with the experiment. The issue is whether you should interpret increases and decreases in the same manner. You should only choose a one-tail P value when both of the following are true. You predicted which group will have the larger mean (or proportion) before you collected any data. If the other group had ended up with the larger mean even if it is quite a bit larger you would have attributed that difference to chance and called the difference 'not statistically significant'.
P Values
34
www.graphpad.com
P Values
35
www.graphpad.com
Deciding between the last two possibilities is a matter of scientific judgment, and no statistical calculations will help you decide.
P Values
36
www.graphpad.com
P Values
37
www.graphpad.com
38
www.graphpad.com
Extremely significant?
Once you have set a threshold significance level (usually 0.05), every result leads to a conclusion of either "statistically significant" or not "statistically significant". Some statisticians feel very strongly that the only acceptable conclusion is significant or 'not significant', and oppose use of adjectives or asterisks to describe values levels of statistical significance. Many scientists are not so rigid, and so prefer to use adjectives such as very significant or extremely significant. Prism uses this approach as shown in the table. These definitions are not entirely standard. If you report the results in this way, you should define the symbols in your figure legend.
Summary *** ** * ns
39
www.graphpad.com
A. Prior probability=10% Drug really works P<0.05, significant P>0.05, not significant Total 80 20 100 Drug really doesn't work 45 855 900 Total 125 875 1000
B. Prior probability=80% Drug really works P<0.05, significant P>0.05, not significant Total 640 160 800 Drug really doesn't work 10 190 200 Total 650 350 1000
40
www.graphpad.com
C. Prior probability=1% Drug really works P<0.05, significant P>0.05, not significant Total 8 2 10 Drug really doesn't work 50 940 990 Total 58 942 1000
The totals at the bottom of each column are determined by the prior probability the context of your experiment. The prior probability equals the fraction of the experiments that are in the leftmost column. To compute the number of experiments in each row, use the definition of power and alpha. Of the drugs that really work, you won't obtain a P value less than 0.05 in every case. You chose a sample size to obtain a power of 80%, so 80% of the truly effective drugs yield significant P values and 20% yield not significant P values. Of the drugs that really don't work (middle column), you won't get not significant results in every case. Since you defined statistical significance to be P<0.05 (alpha=0.05), you will see a "statistically significant" result in 5% of experiments performed with drugs that are really inactive and a not significant result in the other 95%. If the P value is less than 0.05, so the results are statistically significant, what is the chance that the drug is, in fact, active? The answer is different for each experiment. Experiments with P<0.05 and... Prior probability Drug really works Fraction of experiments with P<0.05 where drug Drug really doesn't work really works 45 10 50 80/125 = 64% 640/650 = 98% 8/58 =The analysis checklists are part of Prism's help system, and have proven to be quite useful. We reprint them here, without the surrounding discussion of the tests. But the checklists alone might prove useful, even if only provoking you to read more about these tests. 14%
For experiment A, the chance that the drug is really active is 80/125 or 64%. If you observe a statistically significant result, there is a 64% chance that the difference is real and a 36% chance that the difference simply arose in the course of random sampling. For experiment B, there is a 98.5% chance that the difference is real. In contrast, if you observe a significant result in experiment C, there is only a 14% chance that the result is real and an 86% chance that it is due to random sampling. For experiment C, the vast majority of significant results are due to chance.
41
www.graphpad.com
You can't interpret a P value in a vacuum. Your interpretation depends on the context of the experiment. Interpreting results requires common sense, intuition, and judgment.
42
www.graphpad.com
the entire simulation three times. These simulations were done comparing two groups with identical population means. So any "statistically significant" result we obtain must be a coincidence -- a Type I error. The graph plots P value on the Y axis vs. sample size (per group) on the X axis. The green shaded area at the bottom of the graph shows P values less than 0.05, so deemed "statistically significant".
Experiment 1 (green) reached a P value less than 0.05 when N=7, but the P value is higher than 0.05 for all other sample sizes. Experiment 2 (red) reached a P value less than 0.05 when N=61 and also when N=88 or 89. Experiment 3 (blue) curve hit a P value less than 0.05 when N=92 to N=100. If we followed the sequential approach, we would have declared the results in all three experiments to be "statistically significant". We would have stopped when N=7 in the first (green) experiment, so would never have seen the dotted parts of its curve. We would have stopped the second (red) experiment when N=6, and the third (blue) experiment when N=92. In all three cases, we would have declared the results to be "statistically significant". Since these simulations were created for values where the true mean in both groups was identical, any declaration of "statistical significance" is a Type I error. If the null hypothesis is true (the two population means are identical) we expect to see this kind of Type I error in 5% of experiments (if we use the traditional definition of alpha=0.05 so P values less than 0.05 are declared to be
43
www.graphpad.com
significant). But with this sequential approach, all three of our experiments resulted in a Type I error. 47 If you extended the experiment long enough (infinite N) all experiments would eventually reach statistical significance. Of course, in some cases you would eventually give up even without "statistical significance". But this sequential approach will produce "significant" results in far more than 5% of experiments, even if the null hypothesis were true, and so this approach is invalid. Note: There are some special statistical techniques for analyzing data sequentially, adding more subjects if the results are ambiguous and stopping if the results are clear. Look up 'sequential' or 'adaptive' methods in advanced statistics books to learn more.
44
www.graphpad.com
Statistical power
GraphPad StatMate
GraphPad Prism does not compute statistical power or sample size, but the companion program GraphPad StatMate does.
Statistical power
45
www.graphpad.com
Statistical power
46
www.graphpad.com
Statistical power
47
www.graphpad.com
The two means were almost identical, and a t test gave a very high P value. We concluded that the platelets of people with hypertension do not have an altered number of alpha2 receptors. These negative data can be interpreted in terms of confidence intervals or using power analyses. The two are equivalent and are just alternative ways of thinking about the data.
Statistical power
48
www.graphpad.com
All studies have a high power to detect "big" differences and a low power to detect "small" differences. So power graph all have the same shape. Interpreting the graph depends on putting the results into a scientific context. Here are two alternative interpretations of the results: We really care about receptors in the heart, kidney, brain and blood vessels, not the ones in the platelets (which are much more accessible). So we will only pursue these results (do more studies) if the difference was 50%. The mean number of receptors per platelet is about 260, so we would only be seriously interested in these results if the difference exceeded half of that, or 130. From the graph above, you can see that this study had extremely high power to detect a difference of 130 receptors/platelet. In other words, if the difference really was that big, this study (given its sample size and variability) would almost certainly have found a statistically significant difference. Therefore, this study gives convincing negative results. Hey, this is hypertension. Nothing is simple. No effects are large. We've got to follow every lead we can. It would be nice to find differences of 50% (see above) but realistically, given the heterogeneity of hypertension, we can't expect to find such a large difference. Even if the difference was only 20%, we'd still want to do follow up experiments. Since the mean number of receptors per platelet is 260, this means we would want to find a difference of about 50 receptors per platelet. Reading off the graph (or the table), you can see that the power of this experiment to find a difference of 50 receptors per cell was only about 50%. This means that even if there really were a difference this large, this particular experiment (given its sample size and scatter) had only a 50% chance of finding a statistically significant result. With such low power, we really can't conclude very much from this experiment. A reviewer or editor making such an argument could convincingly argue that there is no point publishing negative data with such low power to detect a biologically interesting result. As you can see, the interpretation of power depends on how large a difference you think would be scientifically or practically important to detect. Different people may reasonably reach different conclusions. Note that it doesn't help at all to look up the power of a study to detect the difference
Statistical power
49
www.graphpad.com
50
Advice: Don't compute the power to detect the difference actually observed
It is never possible to just ask "what is the power of this experiment?". Rather, you must ask "what is the power of this experiment to detect an effect of some specified size?". Which effect size should you use? How large a difference should you be looking for? It only makes sense to do a power analysis when you think about the data scientifically. It isn't purely a statistical question, but rather a scientific one. Some programs try to take the thinking out of the process by computing only a single value for power. These programs compute the power to detect the effect size (or difference, relative risk, etc.) actually observed in that experiment. The result is sometimes called observed power, and the procedure is sometimes called a post-hoc power analysis or retrospective power analysis. Prism does not do this. Here is the reason: If your study reached a conclusion that the difference is not statistically significant, then by definition its power to detect the effect actually observed is very low. You learn nothing new by such a calculation. You already know that the difference was not statistically significant, and now you know that the power of the study to detect that particular difference is low. Not helpful. What would be helpful is to know the power of the study to detect some hypothetical difference that you think would have been scientifically or clinically worth detecting. GraphPad StatMate can help you make these calculations (for some kinds of experimental designs) but you must decide how large a difference you would have cared about. That requires a scientific judgment, that cannot be circumvented by statistical computations. These articles listed below discuss the futility of post-hoc power analyses.
References
M Levine and MHH Ensom, Post Hoc Power Analysis: An Idea Whose Time Has Passed, Pharmacotherapy 21:405-409, 2001. SN Goodman and JA Berlin, The Use of Predicted Confidence Intervals When Planning Experiments and the Misuse of Power When Interpreting the Results, Annals Internal Medicine 121: 200-206, 1994.
Statistical power
50
www.graphpad.com
Lenth, R. V. (2001), Some Practical Guidelines for Effective Sample Size Determination, The American Statistician, 55, 187-193
Statistical power
51
www.graphpad.com
Multiple comparisons
Multiple comparisons
Many scientific studies generate more than one P value. Some studies in fact generate hundreds of P values. Interpreting multiple P values is difficult. If you test several independent null hypotheses and leave the threshold at 0.05 for each comparison, the chance of obtaining at least one statistically significant result is greater than 5% (even if all null hypotheses are true). This graph shows the problem. The probability on the Y axis is computed from N on the X axis using this equation: 100(1.00 - 0.95N).
Multiple comparisons
52
www.graphpad.com
Remember the unlucky number 13. If you perform 13 independent experiments, your chances are about 50:50 of obtaining at least one 'significant' P value (<0.05) just by chance.
Example
Let's consider an example. You compare control and treated animals, and you measure the level of three different enzymes in the blood plasma. You perform three separate t tests, one for each enzyme, and use the traditional cutoff of alpha=0.05 for declaring each P value to be significant. Even if the treatment doesn't actually do anything, there is a 14% chance that one or more of your t tests will be statistically significant. To keep the overall chance of a false significant conclusion at 5%, you need to lower the threshold for each t test to 0.0170. If you compare 10 different enzyme levels with 10 t tests, the chance of obtaining at least one significant P value by chance alone, even if the treatment really does nothing, is 40%. Unless you correct for the multiple comparisons, it is easy to be fooled by the results. Now lets say you test ten different enzymes, at three time points, in two species, with four pre treatments. You can make lots of comparisons, and you are almost certain to find that some of them are 'significant', even if really all null hypotheses are true. You can only account for multiple comparisons when you know about all the comparisons made by the investigators. If you report only significant differences, without reporting the total number of comparisons, others will not be able to properly evaluate your results. Ideally, you should plan all your analyses before collecting data, and then report all the results. Distinguish between studies that test a hypothesis and studies that generate a hypothesis. Exploratory analyses of large databases can generate hundreds or thousands of P values, and scanning these can generate intriguing research hypotheses. But you can't test hypotheses using the same data that prompted you to consider them. You need to test your new hypotheses with fresh data.
Multiple comparisons
53
www.graphpad.com
"Significant" "Not significant" Total No difference. Null hypothesis true A difference truly exists Total C A+C D B+D C+D A+B+C+D A B A+B
Multiple comparisons
54
www.graphpad.com
thus a 95% chance that none of the comparisons will lead to a 'significant' conclusion. The 5% applies to the entire experiment, so is sometimes called an experimentwise error rate or familywise error rate. The advantage of this approach is that you are far less likely to be mislead by false conclusions of 'statistical significance'. The disadvantage is that you need to use a stricter threshold of significance, so will have less power to detect true differences.
Multiple comparisons
55
www.graphpad.com
You must decide how large a difference has to be to in order to be considered scientifically or clinically relevant.
In any experiment, you expect to almost always see some difference in outcome when you apply two treatments. So the question is not whether the two treatments lead to exactly the same outcome. Rather, the question is whether the outcomes are close enough to be clinically or scientifically indistinguishable. How close is that? There is no way to answer that question generally. The answer depends on the scientific or clinical context of your experiment. To ask questions about equivalence, you first have to define a range of treatment effects that you consider to be scientifically or clinically trivial. This is an important decision that must be made totally on scientific or clinical grounds.
You can test for equivalence using either a confidence interval or P value approach
Statistical methods have been developed for testing for equivalence. You can use either a confidence interval or a P value approach 57 .
56
www.graphpad.com
In the experiment shown on top, even the limit of the confidence interval lies within the zone of indifference. You can conclude (with 95% confidence) that the two treatments are equivalent. In the experiment shown on the bottom, the confidence interval extends beyond the zone of indifference. Therefore, you cannot conclude that the treatments are equivalent. You also cannot conclude that the treatments are not equivalent, as the observed treatment is inside the zone of indifference. With data like these, you simply cannot make any conclusion about equivalence.
57
www.graphpad.com
equivalent, but rather that the difference is just barely large enough to be outside the zone of scientific or clinical indifference. In the figure above, define the null hypothesis to be that the true effect equals the effect denoted by the dotted line. Then ask: If that null hypothesis were true, what is the chance (given sample size and variability) of observing an effect as small or smaller than observed. If the P value is small, you reject the null hypothesis of nonequivalence, so conclude that the treatments are equivalent. If the P value is large, then the data are consistent with the null hypothesis of nonequivalent effects. Since you only care about the chance of obtaining an effect so much lower than the null hypothesis (and wouldn't do the test if the difference were higher), you use a one-tail P value. The graph above is plotted with the absolute value of the effect on the horizontal axis. If you plotted the treatment effect itself, you would have two dotted lines, symmetric around the 0 point, one showing a positive treatment effect and the other showing a negative treatment effect. You would then have two different null hypotheses, each tested with a one-tail test. You'll see this referred to as Two One-Sided Tests Procedure.
Confused about the switch from 90% confidence intervals to conclusions with 95% certainty? Good. That means you are paying attention. It is confusing!
58
www.graphpad.com
Outliers
What is an outlier?
What is an outlier?
When analyzing data, you'll sometimes find that one value is far from the others. Such a value is called an outlier, a term that is usually not defined rigorously.
Outliers
59
www.graphpad.com
The graph above was created via simulation. The values in all ten data sets are randomly sampled from a Gaussian distribution with a mean of 50 and a SD of 15. But most people would conclude that the lowest value in data set A is an outlier. Maybe also the high value in data set J. Most people are unable to appreciate random variation, and tend to find 'outliers' too often.
Outliers
60
www.graphpad.com
other values? If the P value is small, you conclude that the deviation of the outlier from the other values is statistically significant, and most likely from a different population.
B Iglewicz and DC Hoaglin. How to Detect and Handle Outliers (Asqc Basic References in Quality Control, Vol 16) Amer Society for Quality Control, 1993. V Barnett, T Lewis, V Rothamsted. Outliers in Statistical Data (Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics) John Wiley & Sons, 1994.
Outliers
61
www.graphpad.com
Prism will display excluded values in blue italics and append an asterisk. Such values will be ignored by all analyses and graphs. We suggest that you document your reasons for excluding values in a floating note. Note that sometimes outliers provide very useful information. Think about the source of variation before excluding an outlier.
Outliers
62
www.graphpad.com
fitting method to be "automatic outlier elimination" 5. Go to the Constrain tab, and constrain slope to equal 0. 6. Go to the Weights tab and check the value of the ROUT coefficient. We recommend setting it to 1%. Increasing the value tells Prism to be more aggressive when defining a point to be an outlier, and decreasing the value makes Prism be more conservative about defining points to be outliers. 7. Click OK. Go to the results sheet. 8. At the bottom of the tabular results, Prism will report how many outliers it excluded. You can view a table of these values as a separate results page. 9. If you want to exclude these values from graphs or t tests or anovas, go to the data table, select the questionable values, right click and choose Exclude. The values will appear in blue italics, and will be ignored by all graphs and analyses.
Reference
Motulsky HM and Brown RE, Detecting outliers when fitting data with nonlinear regression a new method based on robust nonlinear regression and the false discovery rate, BMC Bioinformatics 2006, 7:123. Download from http://www.biomedcentral.com/1471-2105/7/123.
Outliers
63
www.graphpad.com
Nonparametric tests
Nonparametric tests
64
www.graphpad.com
When you see those values, it seems obvious that the treatment drastically increases the value being measured. But let's analyze these data with the Mann-Whitney test 118 (nonparametric test to compare two unmatched groups). This test only sees ranks. So you enter the data above into Prism, but the Mann Whitney calculations only see the ranks: Control 1 3 2 Treated 4 6 5
The Mann-Whitney test then asks if the ranks were randomly shuffled between control and treated, what is the chance of obtaining the three lowest ranks in one group and the three highest ranks in the other group. The nonparametric test only looks at rank, ignoring the fact that the treated values aren't just higher, but are a whole lot higher. The answer, the two-tail P value, is 0.10. Using the traditional significance level of 5%, these results are not significantly different. This example shows that with N=3 in each group, the Mann-Whitney test can never obtain a P value less than 0.05. In other words, with three subjects in each group and the conventional definition of 'significance', the Mann-Whitney test has zero power. With large samples in contrast, the Mann-Whitney test has almost as much power as the t test. To learn more about the relative power of nonparametric and conventional tests with large sample size, look up the term "Asymptotic Relative Efficiency" in an advanced statistics book.
Nonparametric tests
65
www.graphpad.com
Large samples
The decision to choose a parametric or nonparametric test matters less with huge samples (say greater than 100 or so). If you choose a parametric test and your data are not really Gaussian, you haven't lost much as the parametric tests are robust to violation of the Gaussian assumption, especially if the sample sizes are equal (or nearly so). If you choose a nonparametric test, but actually do have Gaussian data, you haven't lost much as nonparametric tests have nearly as much power as parametric tests when the sample size is large. Normality tests work well with large samples, which contain enough data to let you make reliable inferences about the shape of the distribution of the population from which the data were drawn.
Summary
Large samples (> 100 or so) Parametric tests on nongaussian data Nonparametric test on Gaussian data. Normality test OK. Tests are robust. OK. Almost as powerful as parametric tests. Useful.
Small samples (<12 or so) Results are misleading. Tests not robust. Results are misleading. Tests are not powerful. Not useful. Little power to discriminate between Gaussian and non-Gaussian populations.
Nonparametric tests
66
www.graphpad.com
Nonparametric tests
67
www.graphpad.com
68
www.graphpad.com
Column statistics
.
Column statistics
69
www.graphpad.com
Prism's column statistics analysis computes descriptive statistics of each data set, tests for normality, and tests whether the mean of a column is different than a hypothetical value.
, coefficient of
Column statistics
70
www.graphpad.com
Inferences If you have a theoretical reason for expecting the data to be sampled from a population with a specified mean, choose the one-sample t test 77 to test that assumption. Or choose the nonparametric Wilcoxon signed-rank test 78 .
Meaning The smallest value. 25% of values are lower than this. Half the values are lower; half are higher. 75% of values are higher than this. The largest value. The average.
18 22 26
Standard Deviation
Quantifies variability or scatter. Quantifies how precisely the mean is known. Given some assumptions, there is a 95% chance that this range includes the true overall mean. The standard deviation divided by the mean. Compute the logarithm of all values, compute the mean of the logarithms, and then take the antilog. It is a better measure of central tendency when data follow a lognormal distribution (long tail). Quantifies how symmetrical the distribution is. A distribution that is symmetrical has a skewness of 0. Quantifies whether the shape of the data distribution matches the Gaussian distribution. A Gaussian distribution has a kurtosis of 0.
Standard Error of Mean 95% confidence interval Coefficient of variation Geometric mean
72
75
Skewness Kurtosis
76
76
Normality tests
Normality tests 76 are performed for each column of data. Each normality test reports a P value that answers this question: If you randomly sample from a Gaussian population, what is the probability of obtaining a sample that deviates from a Gaussian distribution as much (or more so) as this sample does? A small P value is evidence that your data was sampled from a nongaussian distribution. A large P value means that your data are consistent with a Gaussian distribution (but certainly does not prove that the distribution is Gaussian).
Column statistics
71
www.graphpad.com
Normality tests are less useful than some people guess. With small samples, the normality tests don't have much power to detect nongaussian distributions. Prism won't even try to compute a normality test with fewer than seven values. With large samples, it doesn't matter so much if data are nongaussian, since the t tests and ANOVA are fairly robust to violations of this standard. Normality tests can help you decide when to use nonparametric tests, but the decision should not be an automatic one 64 .
Inferences
A one-sample t test 77 compares the mean of a each column of numbers against a hypothetical mean that you provide. The P value answers this question: If the data were sampled from a Gaussian population with a mean equal to the hypothetical value you entered, what is the chance of randomly selecting N data points and finding a mean as far (or further) from the hypothetical value as observed here? If the P value is small 35 (usually defined to mean less than 0.05), then it is unlikely that the discrepancy you observed between sample mean and hypothetical mean is due to a coincidence arising from random sampling. The nonparametric Wilcoxon signed-rank test 78 is similar, but does not assume a Gaussian distribution. It asks whether the median of each column differs from a hypothetical median you entered.
Column statistics
72
www.graphpad.com
The graph shows one hundred values sampled from a population that follows a lognormal distribution. The left panel plots the data on a linear (ordinary) axis. Most of the data points are piled up at the bottom of the graph, where you can't really see them. The right panel plots the data with a logarithmic scale on the Y axis. On a log axis, the distribution appears symmetrical. The median and geometric mean are near the center of the data cluster (on a log scale) but the mean is much higher, being pulled up by some very large values. Why is there no 'geometric median'? you would compute such a value by converting all the data to logarithms, find their median, and then take the antilog of that median. The result would be identical to the median of the actual data, since the median works by finding percentiles (ranks) and not by manipulating the raw data.
Column statistics
73
www.graphpad.com
often used in biology, and is not computed by Prism. Mode The mode is the value that occurs most commonly. It is not useful with measured values assessed with at least several digits of accuracy, as most values will be unique. It can be useful with variables that can only have integer values. While the mode is often included in lists like this, the mode doesn't always assess the center of a distribution. Imagine a medical survey where one of the questions is "How many times have you had surgery?" In many populations, the most common answer will be zero, so that is the mode. In this case, some values will be higher than the mode, but none lower, so the mode is not a way to quantify the center of the distribution.
Column statistics
74
www.graphpad.com
that shows every value. Note that there is no ambiguity about how to compute the median. All programs do it the same way.
Column statistics
75
www.graphpad.com
14
Column statistics
76
www.graphpad.com
expected with a Gaussian distribution, and computes a single P value from the sum of these discrepancies. It is a versatile and powerful normality test, and is recommended. Note that D'Agostino developed several normality tests. The one used by Prism is the "omnibus K2" test. An alternative is the Shapiro-Wilk normality test. We prefer the D'Agostino-Pearson test for two reasons. One reason is that, while the Shapiro-Wilk test works very well if every value is unique, it does not work well when several values are identical. The other reason is that the basis of the test is hard to understand. Earlier versions of Prism offered only the Kolmogorov-Smirnov test. We still offer this test (for consistency) but no longer recommend it. It computes a P value from a single value: the largest discrepancy between the cumulative distribution of the data and a cumulative Gaussian distribution. This is not a very sensitive way to assess normality, and we now agree with this statement1:"The Kolmogorov-Smirnov test is only a historical curiosity. It should never be used."1 The Kolmogorov-Smirnov method as originally published assumes that you know the mean and SD of the overall population (perhaps from prior work). When analyzing data, you rarely know the overall population mean and SD. You only know the mean and SD of your sample. To compute the P value, therefore, Prism uses the Dallal and Wilkinson approximation to Lilliefors' method (Am. Statistician, 40:294-296, 1986). Since that method is only accurate with small P values, Prism simply reports P>0.10 for large P values.
Reference
RB D'Agostino, "Tests for Normal Distribution" in Goodness-Of-Fit Techniques edited by RB D'Agostino and MA Stepenes, Macel Decker, 1986.
1
Column statistics
77
www.graphpad.com
Prism also reports the 95% confidence interval for the difference between the actual and hypothetical mean. You can be 95% sure that this range includes the true difference.
Assumptions
The one sample t test assumes that you have sampled your data from a population that follows a Gaussian distribution. While this assumption is not too important with large samples, it is important with small sample sizes, especially when N is less than 10. If your data do not come from a Gaussian distribution, you have three options. Your best option is to transform the values to make the distribution more Gaussian, perhaps by transforming all values to their reciprocals or logarithms. Another choice is to use the Wilcoxon signed rank nonparametric test instead of the t test. A final option is to use the t test anyway, knowing that the t test is fairly robust to departures from a Gaussian distribution with large samples. The one sample t test also assumes that the errors are independent 12 . The term error refers to the difference between each value and the group mean. The results of a t test only make sense when the scatter is random that whatever factor caused a value to be too high or too low affects only that one value. Prism cannot test this assumption.
Column statistics
78
www.graphpad.com
the same. You just have no compelling evidence that they differ. If you have small samples, the Wilcoxon test has little power. In fact, if you have five or fewer values, the Wilcoxon test will always give a P value greater than 0.05, no matter how far the sample median is from the hypothetical median.
Assumptions
The Wilcoxon signed rank test does not assume that the data are sampled from a Gaussian distribution. However it does assume that the data are distributed symmetrically around the median. If the distribution is asymmetrical, the P value will not tell you much about whether the median is different than the hypothetical value. Like all statistical tests, the Wilcoxon signed rank test assumes that the errors are independent 12 . The term error refers to the difference between each value and the group median. The results of a Wilcoxon test only make sense when the scatter is random that any factor that causes a value to be too high or too low affects only that one value.
Column statistics
79
www.graphpad.com
Frequency Distributions
Frequency Distributions
80
www.graphpad.com
1. Enter data
Choose a Column table, and a column scatter graph. If you are not ready to enter your own data, choose the sample data set for frequency distributions.
Frequency Distributions
81
www.graphpad.com
graph below shows a frequency distribution on the left, and a cumulative distribution of the same data on the right, both plotting the number of values in each bin.
The main advantage of cumulative distributions is that you don't need to decide on a bin width. Instead, you can tabulate the exact cumulative distribution as shown below. The data set had 250 values, so this exact cumulative distribution has 250 points, making it a bit ragged.
Relative or absolute frequencies? Select Relative frequencies to determine the fraction (or percent) of values in each bin, rather than the actual number of values in each bin. For example, if 15 of 45 values fall into a bin, the relative frequency is 0.33 or 33%. If you choose both cumulative and relative frequencies, you can plot the distribution using a probabilities axis. When graphed this way, a Gaussian distribution is linear. Bin width If you chose a cumulative frequency distributions, we suggest that you choose to create an exact distribution. In this case, you don't choose a bin width as each value is plotted individually. To create an ordinary frequency distribution, you must decide on a bin width. If the bin width is too large, there will only be a few bins, so you will not get a good sense of how the values distribute. If the bin width is too low, many bins might have only a few values (or none) and so the number of values in adjacent bins can randomly fluctuate so much that you will not get a sense of how the data are distributed. How many bins do you need? Partly it depends on your goals. And partly it depends on sample size.
Frequency Distributions
82
www.graphpad.com
If you have a large sample, you can have more bins and still have a smooth frequency distribution. One rule of thumb is aim for a number of bins equal to the log base 2 of sample size. Prism uses this as one of its two goals when it generates an automatic bin width (the other goal is to make the bin width be a round number). The figures below show the same data with three different bin widths. The graph in the middle displays the distribution of the data. The one on the left has too little detail, while the one on the right has too much detail.
Replicates If you entered replicate values, Prism can either place each replicate into its appropriate bin, or average the replicates and only place the mean into a bin. All values too small to fit in the first bin are omitted from the analysis. You can also enter an upper limit to omit larger values from the analysis. How to graph See these examples
84
Prism can only make frequency distributions from numerical data. It can handle categorical data, but only if the categories are entered as values.
Frequency Distributions
83
www.graphpad.com
The last two graphs look very similar, but the graph on the right is a bar graph, while the one in the middle is an XY graph plotting bars or spikes instead of symbols. The graph in the middle has X values so you can fit a Gaussian distribution 86 to it. The graph on the right has no X values (just category names, which happen to be numbers), so it is not possible to fit a curve.
If you choose to tabulate the results as fractions or percentages, then Prism also offers you (from the bottom part of the Parameters dialog for frequency distributions) the choice of plotting on a probability axis. If your data were drawn from a Gaussian distribution, they will appear linear when the cumulative distribution is plotted on a probability axis. Prism uses standard values to label the Y axis, and you cannot adjust these. This graph is very similar to a Q-Q plot.
Frequency Distributions
84
www.graphpad.com
The term histogram is used inconsistently. We use the term to mean a graph of a frequency distribution which is usually a bar graph. Some people use the term histogram to refer to any bar graph, even those that don't plot frequency distributions.
Frequency Distributions
85
www.graphpad.com
The results depend to some degree on which value you picked for bin width, so we recommend fitting the cumulative distribution as explained below.
Frequency Distributions
86
www.graphpad.com
Frequency Distributions
87
www.graphpad.com
Describing curves
Describing curves
88
www.graphpad.com
The second derivative is the derivative of the derivative curve. The integral is the cumulative area under the curve. The integral at any value X equals the area of the curve for all values less than X. This analysis integrates a curve, resulting in another curve showing cumulative area. Another Prism analysis computes a single value for the area under the curve 89 . Prism uses the trapezoid rule 89 to integrate curves. The X values of the results are the same as the X values of the data you are analyzing. The first Y value of the results equals a value you specify (usually 0.0). For other rows, the resulting Y value equals the previous result plus the area added to the curve by adding this point. This area equals the difference between X values times the average of the previous and this Y value.
Smoothing a curve
If you import a curve from an instrument, you may wish to smooth the data to improve the appearance of a graph. Since you lose data when you smooth a curve, you should not smooth a curve prior to nonlinear regression or other analyses. Smoothing is not a method of data analysis, but is purely a way to create a more attractive graph. Prism gives you two ways to adjust the smoothness of the curve. You choose the number of neighboring points to average and the 'order' of the smoothing polynomial. Since the only goal of smoothing is to make the curve look better, you can simply try a few settings until you like the appearance of the results. If the settings are too high, you lose some peaks which get smoothed away. If the settings are too low, the curve is not smooth enough. The right balance is subjective -use trial and error. The results table has fewer rows than the original data.
References
Describing curves
89
www.graphpad.com
Describing curves
90
www.graphpad.com
In Prism, a curve is simply a series of connected XY points, with equally spaced X values. The left part of the figure above shows two of these points and the baseline as a dotted line. The area under that portion of the curve, a trapezoid, is shaded. The middle portion of the figure shows how Prism computes the area. The two triangles in the middle panel have the same area, so the area of the trapezoid on the left is the same as the area of the rectangle on the right (whose area is easier to calculate). The area, therefore, is DX*(Y1+Y2)/2. Prism uses this formula repeatedly for each adjacent pair of points defining the curve.
Describing curves
91
www.graphpad.com
Row statistics
You can choose to compute the SD 18 , SEM 22 , or %CV (what you'll usually want) or for the entire table.
75
Row statistics
92
www.graphpad.com
93
www.graphpad.com
94
www.graphpad.com
Choosing a t test
The t test analysis compares the means (or medians) of two groups. If your goal is to compare the mean (or median) of one group with a hypothetical value, use the column statistics analysis 69 instead. Prism offers five related tests that compare two groups. To choose among these tests, answer three questions:
Nonparametric test?
Nonparametric tests 64 , unlike t tests, are not based on the assumption that the data are sampled from a Gaussian distribution 14 . But nonparametric tests have less power 65 , and report only P values but not confidence intervals. Deciding when to use a nonparametric test is not straightforward 67 .
Equal variances?
If your data are not paired and you are not choosing a nonparametric test, you must decide whether to accept the assumption that the two samples come from populations with the same standard deviations (same variances). This is a standard assumption, and you should accept it unless you have good reason not to. If you check the option for Welch's correction, the analysis will not assume equal variances (but will have less power).
95
www.graphpad.com
Summary of tests
Test Unpaired t Welch's t
98 98
Paired?
Nonparametric?
Welch correction?
No No Yes No Yes
No No No Yes Yes
96
www.graphpad.com
test. I want to compare the mean survival time in two groups. But some subjects are still alive so I don't know how long they will live. How can I do a t test on survival times? You should use special methods designed to compare survival curves 217 . Don't run a t test on survival times.
97
www.graphpad.com
Unpaired t test
Unpaired t test
98
www.graphpad.com
Enter the data for each group into a separate column. The two groups do not have to have the same number of values, and it's OK to leave some cells empty.
Stop. If you want to compare three or more groups, don't use t tests repeatedly. Instead, use one-way ANOVA 133 followed by multiple comparison 158 post tests.
2. Choose t tests from the list of column analyses. 3. On the t test dialog, choose the unpaired t test. Choose the Welch's correction if you don't want to assume the two sets of data are sampled from populations with equal variances, and you are willing to accept the loss of power that comes with that choice. That choice is used rarely, so don't check it unless you are quite sure.
34
Unpaired t test
99
www.graphpad.com
1. Enter data
From the Welcome (or New Table and graph) dialog, choose the Grouped tab. (The t test is usually done from data tables formatted for column data, but Prism doesn't let you create column tables with subcolumns. Instead, create a Grouped table and enter data on one row). Choose an interleaved bar graph, and choose to enter the data as Mean, SD and N (or as Mean, SEM and N).
Unpaired t test
100
www.graphpad.com
Enter the data all on one row. Because there is only one row, the data really only has one grouping variable even though entered on a table formatted for grouped data.
Stop. If you want to compare three or more groups, don't use t tests repeatedly. Instead, use one-way ANOVA 133 followed by multiple comparison 158 post tests.
2. Choose t tests from the list of Column analyses. 3. On the t test dialog, choose the unpaired t test. Choose the Welch's correction if you don't want to assume the two sets of data are sampled from populations with equal variances, and you are willing to accept the loss of power that comes with that choice. That choice is used rarely, so don't check it unless you are quite sure.
Unpaired t test
101
www.graphpad.com
34
Be sure to mention on the figure, figure legend or methods section whether the error bars represent SD or SEM (what's the difference? 22 ). To add the asterisks representing significance level
39
Unpaired t test
102
www.graphpad.com
onto the graph. This creates a live link, so if you edit or replace the data, the number of asterisks may change (or change to 'ns'). Use the drawing tool to add the line below the asterisks, then right-click and set the arrow heads to "half tick down'. To make your graph simple to understand, we strongly recommend avoiding log axes, starting the Y axis at any value other than zero, or having a discontinuous Y axis.
P value
The P value is used to ask whether the difference between the mean of two groups is likely to be due to chance. It answers this question: If the two populations really had the same mean, what is the chance that random sampling would result in means as far apart (or more so) than observed in this experiment? It is traditional, but not necessary and often not useful, to use the P value to make a simple statement about whether or not the difference is statistically significant 38 . You will interpret the results differently depending on whether the P value is small
35
or large
36
t ratio
To calculate a P value for an unpaired t test, Prism first computes a t ratio. The t ratio is the difference between sample means divided by the standard error of the difference, calculated by combining the SEMs of the two groups. If the difference is large compared to the SE of the difference, then the t ratio will be large (or a large negative number), and the P value is small. The sign of the t ratio indicates only which group had the larger mean. The P value is derived from the absolute value of t. Prism reports the t ratio so you can compare with other programs, or examples in text books. In most cases, you'll want to focus on the confidence interval and P value, and can safely ignore the value of the t ratio. For the unpaired t test, the number of degrees of freedom (df) equals the total sample size minus 2. Welch's t test (a modification of the t test which doesn't assume equal variances) calculates df from a complicated equation.
Unpaired t test
103
www.graphpad.com
(or possibly equal to) 1.0. The P value then asks: If the two populations really had identical variances, what is the chance of obtaining an F ratio this big or bigger? Note: Don't mix up the P value testing for equality of the variances (standard deviations) of the groups with the P value testing for equality of the means. That latter P value is the one that answers the question you most likely were thinking about when you chose the t test If the P value is large (>0.05) you conclude that there is no evidence that the variances differ. If the P value is small, you conclude that the variances differ significantly. Then what? There are four answers. Ignore the result. With equal, or nearly equal, sample size (and moderately large samples), the assumption of equal variances is not a crucial assumption and the t test works pretty well even with unequal standard deviations. In other words, the t test is remarkably robust to violations of that assumption so long as the sample size isn't tiny and the sample sizes aren't far apart. Go back and rerun the t test, checking the option to do the modified Welch t test that allows for unequal variance. While this sounds sensible, Moser and Stevens (Amer. Statist. 46:19-21, 1992) have shown that it is not a good idea to first look at the F test to compare variances, and then switch to the modified (Welch modification to allow for different variances) t test when the P value is less than 0.05. If you think your groups might have different variances, you should always use this modified test (and thus lose some power). Transform your data (often to logs or reciprocals) in an attempt to equalize the variances, and then run the t test on the transformed results. Logs are especially useful, and are worth thinking about. Conclude that the two populations are different; that the treatment had an effect. In many experimental contexts, the finding of different variances is as important as the finding of different means. If the variances are truly different, then the populations are different regardless of what the t test concludes about differences between the means. This may be the most important conclusion from the experiment, so think about what it might mean before using one of the other approaches listed above.
Unpaired t test
104
www.graphpad.com
The graphs above plot the sample data for an unpaired t test. We prefer the graph on the left which shows each individual data point. This shows more detail, and is easier to interpret, than the bar graph on the right.
Graphing tips
The scatter plot shows a horizontal line at the mean. If you choose the nonparametric MannWhitney test, you'll probably want to plot the median instead (a choice in the Format Graph dialog). Prism lets you turn off the horizontal line altogether. The horizontal line with caps is easy to draw. Draw a line using the tool in the Draw section of the toolbar. Then double click that line to bring up the Format Object dialog, where you can add the caps. The text objects "P=" and "95% CI of Difference" were created separately than the values pasted from the results. Click the text "T" button, then click on the graph and type the text. Don't forget to state somewhere how the error bars are calculated. We recommend plotting the mean and SD if you analyze with an unpaired t test, and the median and Interquartile range if you use the nonparametric Mann-Whitney test. If you choose a bar graph, don't use a log scale on the Y axis. The whole point of a bar graph is that viewers can compare the height of the bars. If the scale is linear (ordinary), the relative height of the bars is the same as the ratio of values measured in the two groups. If one bar is
Unpaired t test
105
www.graphpad.com
twice the height of the other, its value is twice as high. If the axis is logarithmic, this relationship does not hold. If your data doesn't show well on a linear axis, either show a table with the values, or plot a graph with individual symbols for each data point (which work fine with a log axis). For the same reason, make sure the axis starts at Y=0 and has no discontinuities. The whole idea of a bar graph is to compare height of bars, so don't do anything that destroys the relationship between bar height and value.
Unpaired t test
106
www.graphpad.com
Unpaired t test
107
www.graphpad.com
avoid correcting for multiple comparisons, you should still do it as part of one-way ANOVA to take advantage of the extra degrees of freedom that brings you. Do both columns contain data? If you want to compare a single set of experimental data with a theoretical value (perhaps 100%) don't fill a column with that theoretical value and perform an unpaired t test. Instead, use a onesample t test 77 . Do you really want to compare means? The unpaired t test compares the means of two groups. It is possible to have a tiny P value clear evidence that the population means are different even if the two distributions overlap considerably. In some situations for example, assessing the usefulness of a diagnostic test you may be more interested in the overlap of the distributions than in differences between means. If you chose a one-tail P value, did you predict correctly? If you chose a one-tail P value 34 , you should have predicted which group would have the larger mean before collecting any data. Prism does not ask you to record this prediction, but assumes that it is correct. If your prediction was wrong, then ignore the P value reported by Prism and state that P>0.50.
Unpaired t test
108
www.graphpad.com
Paired t test
1. Enter data
From the Welcome (or New Table and graph) dialog, choose the Column tab, and then a before-after graph. If you are not ready to enter your own data, choose sample data and choose: t test - Paired.
Enter the data for each group into a separate column, with matched values on the same row. If you
Paired t test
109
www.graphpad.com
leave any missing values, that row will simply be ignored. Optionally, enter row labels to identify the source of the data for each row (i.e. subject's initials).
Stop. If you want to compare three or more groups, use repeated-measures one-way ANOVA 144 (not multiple t tests).
2. Choose t tests from the list of column analyses. 3. On the t test dialog, choose the paired t test.
34
Paired t test
110
www.graphpad.com
A before-after graph shows all the data. This example plots each subject as an arrow to clearly show the direction from 'before' to 'after', but you may prefer to plot just lines, or lines with symbols. Avoid using a bar graph, since it can only show the mean and SD of each group, and not the individual changes. To add the asterisks representing significance level 39 copy from the results table and paste onto the graph. This creates a live link, so if you edit or replace the data, the number of asterisks may change (or change to 'ns'). Use the drawing tool to add the line below the asterisks, then right-click and set the arrow heads to "half tick down'. Read more about graphing a paired t test 113 .
Paired t test
111
www.graphpad.com
5. If the P value for the normality test is low, you have evidence that your pairs were not sampled from a population where the differences follow a Gaussian distribution. Read more about interpreting normality tests 76 . If your data fail the normality test, you have two options. One option is to transform the values (perhaps to logs or reciprocals) to make the distributions of differences follow a Gaussian distribution. Another choice is to use the Wilcoxon matched pairs nonparametric test instead of the t test.
P value
The P value is used to ask whether the difference between the mean of two groups is likely to be due to chance. It answers this question: If the two populations really had the same mean, what is the chance that random sampling would result in means as far apart (or more so) than observed in this experiment? It is traditional, but not necessary and often not useful, to use the P value to make a simple statement about whether or not the difference is statistically significant 38 . You will interpret the results differently depending on whether the P value is small
35
or large
36
t ratio
The paired t test compares two paired groups. It calculates the difference between each set of pairs and analyzes that list of differences based on the assumption that the differences in the entire population follow a Gaussian distribution. First, Prism calculates the difference between each set of pairs, keeping track of sign. If the value in column B is larger, then the difference is positive. If the value in column A is larger, then the difference is negative. The t ratio for a paired t test is the mean of these differences divided by the standard error of the differences. If the t ratio is large (or is a large negative number) the P value will be small. The number of degrees of freedom equals the number of pairs minus 1. Prism calculates the P value from the t ratio and the number of degrees of freedom.
Paired t test
112
www.graphpad.com
quantifies this by calculating the Pearson correlation coefficient, r. From r, Prism calculates a P value that answers this question: If the two groups really are not correlated at all, what is the chance that randomly selected subjects would have a correlation coefficient as large (or larger) as observed in your experiment? The P value has one-tail, as you are not interested in the possibility of observing a strong negative correlation. If the pairing was effective, r will be positive and the P value will be small. This means that the two groups are significantly correlated, so it made sense to choose a paired test. If the P value is large (say larger than 0.05), you should question whether it made sense to use a paired test. Your choice of whether to use a paired test or not should not be based on this one P value, but also on the experimental design and the results you have seen in other similar experiments. If r is negative, it means that the pairing was counterproductive! You expect the values of the pairs to move together if one is higher, so is the other. Here, the opposite is true if one has a higher value, the other has a lower value. Most likely this is just a matter of chance. If r is close to -1, you should review your experimental design, as this is a very unusual result.
The graph above shows the sample data for a paired t test. Note the following: Since the data are paired, the best way to show the data is via a before after graph, as shown on the left. A bar graph showing the average value before and the average value after really doesn't properly display the results from a paired experiment.
Paired t test
113
www.graphpad.com
The graph uses arrows to show the sequence from Before to After. You may prefer to just show the lines with no arrowheads. Choose in the Format Graph dialog. The P value is copy and pasted from the paired t test analysis. The paired t test first computes the difference between pairs. The graph on the right shows these differences. These values were computed using the Remove Baseline analysis. The confidence interval for the difference between means shown on the right graph was copy and pasted from the paired t test results.
To perform a ratio t test with Prism, follow these steps 1. Starting from your data table, click Analyze and choose Transform. On the Transform dialog, choose the transform Y=log(Y). 2. From the results of this transform, click Analyze and choose to do a t test. On the t test dialog, choose a paired t test. Notice that you are chaining the transform analysis with the t test analysis. 3. Interpret the P value: If there really were no differences between control and treated values, what is the chance of obtaining a ratio as far from 1.0 as was observed? If the P value is small, you have evidence that the ratio between the paired values is not 1.0. 4. Manually compute the antilog of each end of the confidence interval of the difference between the means of the logarithms. The result is the 95% confidence interval of the ratio of the two means, which is much easier to interpret. (This will make more sense after you read the example below.)
Paired t test
114
www.graphpad.com
Example
You measure the Km of a kidney enzyme (in nM) before and after a treatment. Each experiment was done with renal tissue from a different animal.
Control 4.2 2.5 6.5 Treated 8.7 4.9 13.1 Difference 4.3 2.4 6.6 Ratio 0.483 0.510 0.496
The P value from a conventional paired t test is 0.07. The difference between control and treated is not consistent enough to be statistically significant. This makes sense because the paired t test looks at differences, and the differences are not very consistent. The 95% confidence interval for the difference between control and treated Km value is -0.72 to 9.72, which includes zero. The ratios are much more consistent. It is not appropriate to analyze the ratios directly. Because ratios are inherently asymmetrical, you'll get a different answer depending on whether you analyze the ratio of treated/control or control/treated. You'll get different P values testing the null hypothesis that the ratio really equals 1.0. Instead, we analyze the log of the ratio, which is the same as the difference between the log(treated) and log(control). Using Prism, click Analyze, pick Transform, and choose Y=log(Y). Then from the results table, click Analyze again and choose t test, then paired t test. The P value is 0.0005. Looked at this way, the treatment has an effect that is highly statistically significant. The t test analysis reports the difference between the means, which is really the difference between means of log(control) and log(treated), or -0.3043. Take the antilog of this value (10 to that power) to obtain the ratio of the two means, which is 0.496. In other words, the control values are about half the treated values. It is always important to look at confidence intervals as well as P values. Here the 95% confidence interval extends from -0.3341 to -0.2745. Take the antilog (10 to the power) of both numbers to get the confidence interval of the ratio, which extends from 0.463 to 0.531. We are 95% sure the control values are between 46% and 53% of the treated values. You might want to look at the ratios the other way around. If you entered the treated values into column A and the control in column B, the results would have been the reciprocal of what we just computed. It is easy enough to compute the reciprocal of the ratio at both ends of its confidence interval. On average, the treated values are 2.02 times larger than the control values, and the 95% confidence interval extends from 1.88 to 2.16. Analyzed with a paired t test, the results were very ambiguous. But when the data are analyzed with a ratio t test, the results are very persuasive the treatment doubled the Km of the enzyme.
Paired t test
115
www.graphpad.com
Paired t test
116
www.graphpad.com
control value is larger. With these data, you'll get more consistent results if you perform a ratio t test 114 .
Paired t test
117
www.graphpad.com
Mann-Whitney test
1. Enter data
From the Welcome (or New Table and graph) dialog, choose the Column tab, and then choose a scatter plot with a line at the median. If you are not ready to enter your own data, choose sample data and choose: t test - unpaired.
Mann-Whitney test
118
www.graphpad.com
Since the Mann-Whitney tests compares the sum-of-ranks in the two groups, the test needs to have your raw data. It is not possible to perform a Mann-Whitney test if you entered your data as mean and SD (or SEM). Enter the data for each group into a separate column. The two groups do not have to have the same number of values, and it's OK to leave some cells empty. Since the data are unmatched, it makes no sense to enter any row titles.
Stop. If you want to compare three or more groups, use the Kruskal-Wallis test 149 .
Mann-Whitney test
119
www.graphpad.com
2. Choose t tests from the list of column analyses. 3. On the t test dialog, choose the Mann-Whitney test.
34
Mann-Whitney test
120
www.graphpad.com
Graphing notes: A scatter plot shows every point. If you have more than several hundred points, a scatter plot can become messy, so it makes sense to plot a box-and-whiskers graph instead. We suggest avoiding bar graphs, as they show less information than a scatter plot, yet are no easier to comprehend. The horizontal lines mark the medians. Set this choice (medians rather than means) on the Welcome dialog, or change on the Format Graph dialog. To add the asterisks representing significance level 39 copy from the results table and paste onto the graph. This creates a live link, so if you edit or replace the data, the number of asterisks may change (or change to 'ns'). Use the drawing tool to add the line below the asterisks, then right-click and set the arrow heads to "half tick down'.
Mann-Whitney test
121
www.graphpad.com
Mann-Whitney test
122
www.graphpad.com
The graph shows each value obtained from control and treated subjects. The two-tail P value from the Mann-Whitney test is 0.0288, so you conclude that there is a statistically significant difference between the groups. But the two medians, shown by the horizontal lines, are identical. The MannWhitney test compared the distributions of ranks, which is quite different in the two groups. It is not correct, however, to say that the Mann-Whitney test asks whether the two groups come from populations with different distributions. The two groups in the graph below clearly come from different distributions, but the P value from the Mann-Whitney test is high (0.46).
Mann-Whitney test
123
www.graphpad.com
The Mann-Whitney test compares sums of ranks -- it does not compare medians and does not compare distributions. To interpret the test as being a comparison of medians, you have to make an additional assumption -- that the distributions of the two populations have the same shape, even if they are shifted (have different medians). With this assumption, if you reject the Mann-Whitney test reports a small P value, you can conclude that the medians are different.
Mann-Whitney test
124
www.graphpad.com
34
Are the data sampled from non-Gaussian populations? By selecting a nonparametric test, you have avoided assuming that the data were sampled from Gaussian distributions, but there are drawbacks to using a nonparametric test. If the populations really are Gaussian, the nonparametric tests have less power (are less likely to give you a small P value), especially with small sample sizes. Furthermore, Prism (along with most other programs) does not calculate confidence intervals when calculating nonparametric tests. If the distribution is clearly not bell-shaped, consider transforming the values to create a Gaussian distribution and then using a t test.
Mann-Whitney test
125
www.graphpad.com
1. Enter data
From the Welcome (or New Table and graph) dialog, choose the one-way tab, and then a before-after graph. If you are not ready to enter your own data, choose sample data and choose: t test - Paired.
126
www.graphpad.com
Enter the data for each group into a separate column, with matched values on the same row. If you leave any missing values, that row will simply be ignored. Optionally, enter row labels to identify the source of the data for each row (i.e. subject's initials).
Stop. If you want to compare three or more groups, use Friedman's test 154 .
2. Choose t tests from the list of column analyses. 3. On the t test dialog, choose the Wilcoxon matched-pairs test.
127
www.graphpad.com
34
A before-after graph shows all the data. This example plots each subject as an arrow to clearly show the direction from 'before' to 'after', but you may prefer to plot just lines, or lines with symbols. Avoid using a bar graph, since it can only show the mean and SD of each group, and not the individual changes.
128
www.graphpad.com
To add the asterisks representing significance level 39 copy from the results table and paste onto the graph. This creates a live link, so if you edit or replace the data, the number of asterisks may change (or change to 'ns'). Use the drawing tool to add the line below the asterisks, then right-click and set the arrow heads to "half tick down'.
129
www.graphpad.com
130
www.graphpad.com
Are you comparing exactly two groups? Use the Wilcoxon test only to compare two groups. To compare three or more matched groups, use the Friedman test followed by post tests. It is not appropriate 52 to perform several Wilcoxon tests, comparing two groups at a time. If you chose a one-tail P value, did you predict correctly? If you chose a one-tail P value 34 , you should have predicted which group would have the larger median before collecting any data. Prism does not ask you to record this prediction, but assumes that it is correct. If your prediction was wrong, then ignore the P value reported by Prism and state that P>0.50. Are the data clearly sampled from non-Gaussian populations? By selecting a nonparametric test, you have avoided assuming that the data were sampled from Gaussian distributions. But there are drawbacks to using a nonparametric test. If the populations really are Gaussian, the nonparametric tests have less power (are less likely to give you a small P value), especially with small sample sizes. Furthermore, Prism (along with most other programs) does not calculate confidence intervals when calculating nonparametric tests. If the distribution is clearly not bell-shaped, consider transforming the values (perhaps to logs or reciprocals) to create a Gaussian distribution and then using a t test. Are the differences distributed symmetrically? The Wilcoxon test first computes the difference between the two values in each row, and analyzes only the list of differences. The Wilcoxon test does not assume that those differences are sampled from a Gaussian distribution. However it does assume that the differences are distributed symmetrically around their median.
131
www.graphpad.com
132
www.graphpad.com
Repeated measures?
Choose a repeated measures test when the columns of data are matched. Here are some examples: You measure a variable in each subject several times, perhaps before, during and after an intervention. You recruit subjects as matched groups, matched for variables such as age, ethnic group, and disease severity. You run a laboratory experiment several times, each time with several treatments handled in parallel. Since you anticipate experiment-to-experiment variability, you want to analyze the data in such a way that each experiment is treated as a matched set. Matching should not be based on the variable you are comparing. If you are comparing blood pressures in three groups, it is OK to match based on age or zip code, but it is not OK to match based on blood pressure. The term repeated measures applies strictly when you give treatments repeatedly to one subject (the first example above). The other two examples are called randomized block experiments (each set of subjects is called a block, and you randomly assign treatments within each block). The analyses are identical for repeated measures and randomized block experiments, and Prism always uses the term repeated measures.
Nonparametric test?
Nonparametric tests 64 , unlike ANOVA are not based on the assumption that the data are sampled from a Gaussian distribution 14 . But nonparametric tests have less power 65 , and report only P values but not confidence intervals. Deciding when to use a nonparametric test is not straightforward 67 .
133
www.graphpad.com
Test summary
Test Ordinary one-way ANOVA 136 Repeated measures one-way ANOVA 144 Kruskal-Wallis test 149 Friedman test 154 Matched No Yes Nonparametric No No
No Yes
Yes Yes
134
www.graphpad.com
SEM. Can I run ANOVA? No. ANOVA compares the difference among group means with the scatter within the groups, taking into account sample size. If you only know the means, there is no possible way to do any statistical comparison. Can I use a normality test to make the choice of when to use a nonparametric test? This is not a good idea 64 . Choosing when to use a nonparametric test is not straightforward, and you can't really automate the process. I want to compare three groups. The outcome has two possibilities, and I know the fraction of each possible outcome in each group. How can I compare the groups? Not with ANOVA. Enter your data into a contingency table 206 and analyze with a chi-square test 208 . What does 'one-way' mean? One-way ANOVA, also called one-factor ANOVA, determines how a response is affected by one factor. For example, you might measure a response to three different drugs. In this example, drug treatment is the factor. Since there are three drugs, the factor is said to have three levels. If you measure response to three different drugs, and two time points, then you have two factors: drug and time. One-way ANOVA would not be helpful. Use two-way ANOVA 169 instead. If you measure response to three different drugs at two time points with subjects from two age ranges, then you have three factors: drug, time and age. Prism does not perform three-way ANOVA, but other programs do. If there are only two levels of one factor --say male vs. female, or control vs. treated --, then you should use a t test. One-way ANOVA is used when there are three or more groups (although the underlying math is the same for a t test and one-way ANOVA with two groups). What does 'repeated measures' mean? How is it different than 'randomized block'? The term repeated-measures strictly applies only when you give treatments repeatedly to each subject, and the term randomized block is used when you randomly assign treatments within each group (block) of matched subjects. The analyses are identical for repeated-measures and randomized block experiments, and Prism always uses the term repeated-measures.
135
www.graphpad.com
One-way ANOVA
1. Enter data
You can enter the actual values you want to analyze (raw data) or the mean and SD (or SEM) of each group. Enter raw data From the Welcome (or New Table and graph) dialog, choose a vertical scatter plot from the Column tab. If you aren't ready to enter your own data, choose to use sample data, and choose: One-way ANOVA - Ordinary.
One-way ANOVA
136
www.graphpad.com
Enter the data for each group into a separate column. The two groups do not have be the same size (it's OK to leave some cells empty). Since the data are unmatched, it makes no sense to enter any row titles.
Enter averaged data Prism also lets you perform one-way ANOVA with data entered as mean, SD (or SEM), and N. This can be useful if you are entering data from another program or publication. From the Welcome (or New Table and graph) dialog, choose the Grouped tab. That's right. Even though you want to do one-way ANOVA, you should pick the two-way tab (since the one-way table doesn't allow for entry of mean and SD or SEM). Then choose any graph from the top row, and choose to enter the data as Mean, SD and N or as Mean, SEM and N. Entering N is essential.
One-way ANOVA
137
www.graphpad.com
Enter the data all on one row. Because there is only one row, the data really only has one grouping variable even though entered on a grouped table.
2. Choose one-way ANOVA from the list of column analyses. 3. On the ANOVA dialog, choose ordinary one-way ANOVA.
One-way ANOVA
138
www.graphpad.com
P value
The P value answers this question: If all the populations really have the same mean (the treatments are ineffective), what is the chance that random sampling would result in means as far apart (or more so) as observed in this experiment? If the overall P value is large, the data do not give you any reason to conclude that the means differ. Even if the population means were equal, you would not be surprised to find sample means this far apart just by chance. This is not the same as saying that the true means are the same. You just don't have compelling evidence that they differ. If the overall P value is small, then it is unlikely that the differences you observed are due to random sampling. You can reject the idea that all the populations have identical means. This doesn't mean that every mean differs from every other mean, only that at least one differs from the rest. Look at the results of post tests to identify where the differences are.
One-way ANOVA
139
www.graphpad.com
ANOVA partitions the variability among all the values into one component that is due to variability among group means (due to the treatment) and another component that is due to variability within the groups (also called residual variation). Variability within groups (within the columns) is quantified as the sum of squares of the differences between each value and its group mean. This is the residual sum-of-squares. Variation among groups (due to treatment) is quantified as the sum of the squares of the differences between the group means and the grand mean (the mean of all values in all groups). Adjusted for the size of each group, this becomes the treatment sum-of-squares. Each sum-of-squares is associated with a certain number of degrees of freedom (df, computed from number of subjects and number of groups), and the mean square (MS) is computed by dividing the sum-of-squares by the appropriate number of degrees of freedom. The F ratio is the ratio of two mean square values. If the null hypothesis is true, you expect F to have a value close to 1.0 most of the time. A large F ratio means that the variation among group means is more than you'd expect to see by chance. You'll see a large F ratio both when the null hypothesis is wrong (the data are not sampled from populations with the same mean) and when random sampling happened to end up with large values in some groups and small values in others. The P value is determined from the F ratio and the two values for degrees of freedom shown in the ANOVA table.
One-way ANOVA
140
www.graphpad.com
different) as they are in your experiment? If the P value is small, you must decide whether you will conclude that the standard deviations of the two populations are different. Obviously Bartlett's test is based only on the values in this one experiment. Think about data from other similar experiments before making a conclusion. If you conclude that the populations have different variances, you have three choices: Conclude that the populations are different. In many experimental contexts, the finding of different standard deviations is as important as the finding of different means. If the standard deviations are truly different, then the populations are different regardless of what ANOVA concludes about differences among the means. This may be the most important conclusion from the experiment. Transform the data to equalize the standard deviations, and then rerun the ANOVA. Often you'll find that converting values to their reciprocals or logarithms will equalize the standard deviations and also make the distributions more Gaussian. Use a modified ANOVA that does not assume that all standard deviations are equal. Prism does not provide such a test. Ignore Bartlett's test and interpret the ANOVA results as usual. Bartlett's test is very sensitive to deviations from a Gaussian distribution more sensitive than the ANOVA calculations. A low P value from Bartlett's test may be due to data that are not Gaussian, rather than due to unequal variances. Since ANOVA is fairly robust to non-Gaussian data (at least when sample sizes are equal), some statisticians suggest ignoring the Bartlett's test, especially when the sample sizes are equal (or nearly so). Why not switch to the nonparametric Kruskal-Wallis test?. While nonparametric tests do not assume Gaussian distributions, the Kruskal-Wallis test (and other nonparametric tests) does assume that the shape of the data distribution is the same in each group. So if your groups have very different standard deviations and so are not appropriate for one-way ANOVA, they should not be analyzed by the Kruskal-Wallis test either. Some suggest using Levene's median test instead of Bartlett's test. Prism doesn't do this test (yet), but it isn't hard to do by Excel (combined with Prism). To do Levene's test, first create a new table where each value is defined as the absolute value of the difference between the actual value and median of its group. Then run a one-way ANOVA on this new table. The idea is that by subtracting each value from its group median, you've gotten rid of difference between group averages. (Why not subtract means rather than medians? In fact, that was Levene's idea, but others have shown the median works better.) So if this ANOVA comes up with a small P value, then it must be confused by different scatter (SD) in different groups. If the Levene P value is small then don't believe the results of the overall one-way ANOVA. See an example on pages 325-327 of Glantz. Read more about the general topic of assumption checking after ANOVA in this article by Andy Karp.
R squared
R2 is the fraction of the overall variance (of all the data, pooling all the groups) attributable to differences among the group means. It compares the variability among group means with the variability within the groups. A large value means that a large fraction of the variation is due to the treatment that defines the groups. The R2 value is calculated from the ANOVA table and equals the between group sum-of-squares divided by the total sum-of-squares. Some programs (and books)
One-way ANOVA
141
www.graphpad.com
don't bother reporting this value. Others refer to it as h2 (eta squared) rather than R2. It is a descriptive statistic that quantifies the strength of the relationship between group membership and the variable you measured.
One-way ANOVA
142
www.graphpad.com
value to be too high or too low affects only that one value. Prism cannot test this assumption. You must think about the experimental design. For example, the errors are not independent if you have six values in each group, but these were obtained from two animals in each group (in triplicate). In this case, some factor may cause all triplicates from one animal to be high or low. Do you really want to compare means? One-way ANOVA compares the means of three or more groups. It is possible to have a tiny P value clear evidence that the population means are different even if the distributions overlap considerably. In some situations for example, assessing the usefulness of a diagnostic test you may be more interested in the overlap of the distributions than in differences between means. Is there only one factor? One-way ANOVA compares three or more groups defined by one factor. For example, you might compare a control group, with a drug treatment group and a group treated with drug plus antagonist. Or you might compare a control group with five different drug treatments. Some experiments involve more than one factor. For example, you might compare three different drugs in men and women. There are two factors in that experiment: drug treatment and gender. These data need to be analyzed by two-way ANOVA 168 , also called two factor ANOVA. Is the factor fixed rather than random? Prism performs Type I ANOVA, also known as fixed-effect ANOVA. This tests for differences among the means of the particular groups you have collected data from. Type II ANOVA, also known as random-effect ANOVA, assumes that you have randomly selected groups from an infinite (or at least large) number of possible groups, and that you want to reach conclusions about differences among ALL the groups, even the ones you didn't include in this experiment. Type II random-effects ANOVA is rarely used, and Prism does not perform it. Do the different columns represent different levels of a grouping variable? One-way ANOVA asks whether the value of a single variable differs significantly among three or more groups. In Prism, you enter each group in its own column. If the different columns represent different variables, rather than different groups, then one-way ANOVA is not an appropriate analysis. For example, one-way ANOVA would not be helpful if column A was glucose concentration, column B was insulin concentration, and column C was the concentration of glycosylated hemoglobin.
One-way ANOVA
143
www.graphpad.com
1. Enter data
From the Welcome (or New Table and graph) dialog, choose the Column tab and a before-after graph. If you aren't ready to enter your own data, choose to use sample data, and choose: One-way ANOVA - Repeated-measures.
Enter the data for each group into a separate column. Enter data for each subject (or each matched set of results) on a separate row. Optionally, identify the subjects with row titles. If you leave any values blank, Prism will not be able to perform repeated measures ANOVA.
144
www.graphpad.com
2. Choose one-way ANOVA from the list of column analyses. 3. On the ANOVA dialog, choose repeated measures one-way ANOVA.
145
www.graphpad.com
P value
The P value answers this question: If all the populations really have the same mean (the treatments are ineffective), what is the chance that random sampling would result in means as far apart (or more so) as observed in this experiment? If the overall P value is large, the data do not give you any reason to conclude that the means differ. Even if the true means were equal, you would not be surprised to find means this far apart just by chance. This is not the same as saying that the true means are the same. You just don't have compelling evidence that they differ. If the overall P value is small, then it is unlikely that the differences you observed are due to random sampling. You can reject the idea that all the populations have identical means. This doesn't mean that every mean differs from every other mean, only that at least one differs from the rest. Look at the results of post tests to identify where the differences are.
146
www.graphpad.com
147
www.graphpad.com
compare a control group, with a drug treatment group and a group treated with drug plus antagonist. Or you might compare a control group with five different drug treatments. Some experiments involve more than one factor. For example, you might compare three different drugs in men and women. There are two factors in that experiment: drug treatment and gender. Similarly, there are two factors if you wish to compare the effect of drug treatment at several time points. These data need to be analyzed by two-way ANOVA, also called two-factor ANOVA. Is the factor fixed rather than random? Prism performs Type I ANOVA, also known as fixed-effect ANOVA. This tests for differences among the means of the particular groups you have collected data from. Type II ANOVA, also known as random-effect ANOVA, assumes that you have randomly selected groups from an infinite (or at least large) number of possible groups, and that you want to reach conclusions about differences among ALL the groups, even the ones you didn't include in this experiment. Type II random-effects ANOVA is rarely used, and Prism does not perform it. Can you accept the assumption of circularity or sphericity? Repeated-measures ANOVA assumes that the random error truly is random. A random factor that causes a measurement in one subject to be a bit high (or low) should have no affect on the next measurement in the same subject. This assumption is called circularity or sphericity. It is closely related to another term you may encounter, compound symmetry. Repeated-measures ANOVA is quite sensitive to violations of the assumption of circularity. If the assumption is violated, the P value will be too low. One way to violate this assumption is to make the repeated measurements in too short a time interval, so that random factors that cause a particular value to be high (or low) don't wash away or dissipate before the next measurement. To avoid violating the assumption, wait long enough between treatments so the subject is essentially the same as before the treatment. When possible, also randomize the order of treatments. You only have to worry about the assumption of circularity when you perform a repeated-measures experiment, where each row of data represents repeated measurements from a single subject. It is impossible to violate the assumption with randomized block experiments, where each row of data represents data from a matched set of subjects. Prism does not attempt to detect whether you violated this assumption, but some other programs do.
148
www.graphpad.com
Kruskal-Wallis test
1. Enter data
From the Welcome (or New Table and graph) dialog, choose any graph from the One-way tab. We suggest plotting a scatter graph showing every point, with a line at the median. If you aren't ready to enter your own data, choose to use sample data, and choose: One-way ANOVA - Ordinary.
Enter the data for each group into a separate column. The two groups do not have be the same size (it's OK to leave some cells empty). Since the data are unmatched, it makes no sense to enter any row titles.
Kruskal-Wallis test
149
www.graphpad.com
2. Choose one-way ANOVA from the list of column analyses. 3. On the ANOVA dialog, choose the Kruskal-Wallis test.
Kruskal-Wallis test
150
www.graphpad.com
Tied values
The Kruskal-Wallis test was developed for data that are measured on a continuous scale. Thus you expect every value you measure to be unique. But occasionally two or more values are the same. When the Kruskal-Wallis calculations convert the values to ranks, these values tie for the same rank, so they both are assigned the average of the two (or more) ranks for which they tie. Prism uses a standard method to correct for ties when it computes the Kruskal-Wallis statistic. Unfortunately, there isn't a standard method to get a P value from these statistics when there are ties. Prism always uses the approximate method, which converts U or sum-of-ranks to a Z value. It then looks up that value on a Gaussian distribution to get a P value. The exact test is only exact when there are no ties. If your samples are small and no two values are identical (no ties), Prism calculates an exact P value. If your samples are large or if there are ties, it approximates the P value from the chi-square distribution. The approximation is quite accurate with large samples. With medium size samples, Prism can take a long time to calculate the exact P value. While it does the calculations, Prism displays a progress dialog and you can press Cancel to interrupt the calculations if an approximate P value is good enough for your purposes. If you have large sample sizes and a few ties, no problem. But with small data sets or lots of ties, we're not sure how meaningful the P values are. One alternative: Divide your response into a few
Kruskal-Wallis test
151
www.graphpad.com
categories, such as low, medium and high. Then use a chi-square test to compare the groups.
Analysis checklist
Before interpreting the results, review the analysis checklist 152 .
Kruskal-Wallis test
152
www.graphpad.com
populations really are Gaussian, the nonparametric tests have less power (are less likely to detect a true difference), especially with small sample sizes. Furthermore, Prism (along with most other programs) does not calculate confidence intervals when calculating nonparametric tests. If the distribution is clearly not bell-shaped, consider transforming the values (perhaps to logs or reciprocals) to create a Gaussian distribution and then using ANOVA. Do you really want to compare medians? The Kruskal-Wallis test compares the medians of three or more groups. It is possible to have a tiny P value clear evidence that the population medians are different even if the distributions overlap considerably. Are the shapes of the distributions identical? The Kruskal-Wallis test does not assume that the populations follow Gaussian distributions. But it does assume that the shapes of the distributions are identical. The medians may differ that is what you are testing for but the test assumes that the shapes of the distributions are identical. If two groups have very different distributions, consider transforming the data to make the distributions more similar.
Kruskal-Wallis test
153
www.graphpad.com
Friedman's test
1. Enter data
From the Welcome (or New Table and graph) dialog, choose the Column tab and a before-after graph. If you aren't ready to enter your own data, choose to use sample data, and choose: One-way ANOVA - Repeated-measures.
Enter the data for each group into a separate column. Enter data for each subject (or each matched set of results) on a separate row. Optionally, identify the subjects with row titles. If you leave any values blank, Prism will not be able to perform Friedman's test.
Friedman's test
154
www.graphpad.com
2. Choose one-way ANOVA from the list of column analyses. 3. On the ANOVA dialog, choose Friedman's test.
Friedman's test
155
www.graphpad.com
Tied values
If two or more values (in the same row) have the same value, it is impossible to calculate the exact P value, so Prism computes the approximate P value. If your samples are small and there are no ties, Prism calculates an exact P value. If your samples are large, it calculates the P value from a Gaussian approximation. The term Gaussian has to do with the distribution of sum of ranks, and does not imply that your data need to follow a Gaussian distribution. With medium size samples, Prism can take a long time to calculate the exact P value. You can interrupt the calculations if an approximate P value meets your needs.
Friedman's test
156
www.graphpad.com
have P<0.05. The 5% chance does not apply to each comparison but rather to the entire family of comparisons.
Friedman's test
157
www.graphpad.com
158
www.graphpad.com
so you may receive different recommendations depending on who you ask. Here are our recommendations:
Select Dunnett's test if one column represents control data and you wish to compare all other columns to that control column but not to each other. Select the test for linear trend if the columns are arranged in a natural order (e.g. dose or time) and you want to test whether there is a trend such that values increase (or decrease) as you move from left to right across columns. Select the Bonferroni test for selected pairs of columns when you only wish to compare certain column pairs. You must select those pairs based on experimental design and ideally should specify the pairs of interest before collecting any data. If you base your decision on the results (e.g., compare the smallest with the largest mean), then you have effectively compared all columns, and it is not appropriate to use the test for selected pairs. If you want to compare all pairs of columns, choose the Tukey test. This is actually the TukeyKramer test, which includes the extension by Kramer to allow for unequal sample sizes. We recommend you don't use 163 the Newman-Keuls test or the Bonferroni test used to compare every pair of groups.
159
www.graphpad.com
test as the post test following a Kruskal-Wallis test, and don't give it an exact name.
Planned comparisons
What are planned comparisons?
The term planned comparison is used when you focus in on a few scientifically sensible comparisons. You don't do every possible comparison. And you don't decide which comparisons to do after looking at the data. Instead, you decide -- as part of the experimental design -- to only make a few comparisons. Some statisticians recommend not correcting for multiple comparisons when you make only a few planned comparisons. The idea is that you get some bonus power as a reward for having planned a focussed study. Prism always corrects for multiple comparisons, without regard for whether the comparisons were planned or post hoc. But you can get Prism to do the planned comparisons for you once you realize that a planned comparison is identical to a Bonferroni corrected comparison for selected pairs of means, when there is only one pair to compare.
160
www.graphpad.com
Here are the results of one-way ANOVA and Tukey multiple comparison tests comparing every group with every other group.
One-way analysis of variance P value P<0.0001 P value summary *** Are means signif. different? (P < 0.05)Yes Number of groups 4 F 62.69 R squared 0.9592 ANOVA Table Treatment (between columns) Residual (within columns) Total Tukey's Multiple Comparison Test Control vs Treated Control vs Con. wo vehicle Control vs Blank Treated vs Con. wo vehicle Treated vs Blank Con. wo vehicle vs Blank SS 15050 640 15690 Mean Diff. 22.67 -0.3333 86.33 -23 63.67 86.67 df 3 8 11 q 4.389 0.06455 16.72 4.454 12.33 16.78 MS 5015 80
P value P > 0.05 P > 0.05 P < 0.001 P > 0.05 P < 0.001 P < 0.001
95% CI of diff -0.7210 to 46.05 -23.72 to 23.05 62.95 to 109.7 -46.39 to 0.3877 40.28 to 87.05 63.28 to 110.1
The overall ANOVA has a very low P value, so you can reject the null hypothesis that all data were sampled from groups with the same mean. But that really isn't very helpful. The fourth column is a negative control, so of course has much lower values than the others. The ANOVA P value answers a question that doesn't really need to be asked. Tukey's multiple comparison tests were used to compare all pairs of means (table above). You only
161
www.graphpad.com
care about the first comparison -- control vs. treated -- which is not statistically significant (P>0.05). These results don't really answer the question your experiment set out to ask. The Tukey multiple comparison tests set the 5% level of significance to the entire family of six comparisons. But five of those six comparisons don't address scientifically valid questions. You expect the blank values to be much lower than the others. If that wasn't the case, you wouldn't have bothered with the analysis since the experiment hadn't worked. Similarly, if the control with vehicle (first column) was much different than the control without vehicle (column 3), you wouldn't have bothered with the analysis of the rest of the data. These are control measurements, designed to make sure the experimental system is working. Including these in the ANOVA and post tests just reduces your power to detect the difference you care about.
The difference is statistically significant with P<0.05, and the 95% confidence interval for the difference between the means extends from 5.826 to 39.51. When you report the results, be sure to mention that your P values and confidence intervals are not corrected for multiple comparisons, so the P values and confidence intervals apply individually to each value you report and not to the entire family of comparisons. In this example, we planned to make only one comparison. If you planned to make more than one comparison, you would need to do one at a time with Prism (so it doesn't do the Bonferroni correction for multiple comparisons). First do one comparison and note the results. Then change to compare a different pair of groups. When you report the results, be sure to explain that you are doing planned comparisons so have not corrected the P values or confidence intervals for multiple comparisons.
162
www.graphpad.com
equals 4.301. There are six data points in the two groups being compared, so four degrees of freedom. The P value is 0.0126, and the 95% confidence interval for the difference between the two means ranges from 8.04 to 37.3.
For this example, the standard error of the difference between the means of column 1 and column 2 is 7.303. Now compute the t ratio as the difference between means (22.67) divided by the standard error of that difference (7.303). So t=3.104. Since the MSerror is computed from all the data, the number of degrees of freedom is the same as the number of residual degrees of freedom in the ANOVA table, 8 in this example (total number of values minus number of groups). The corresponding P value is 0.0146. The 95% confidence interval extends from the observed mean by a distance equal to SE of the difference (7.303) times the critical value from the t distribution for 95% confidence and 8 degrees of freedom (2.306). So the 95% confidence interval for the difference extends from 5.826 to 39.51.
163
www.graphpad.com
difficult to articulate exactly what null hypotheses the Newman-Keuls test actually tests. Confidence intervals are more informative than significance levels, but the Newman-Keuls test cannot generate confidence intervals. Although Prism still offers the Newman-Keuls test (for compatibility with prior versions), we recommend that you use the Tukey test instead. Unfortunately, the Tukey test has less power. This means that the Tukey test concludes that the difference between two groups is 'not statistically significant' in some cases where the Newman-Keuls test concludes that the difference is 'statistically significant'.
Tests Prism does not offer because many consider them obsolete
Fisher's LSD While the Fisher's Least Significant Difference (LSD) test is of historical interest as the first post test ever developed, it is no longer recommended. The other tests are better. Prism does not offer the Fisher LSD test. Fisher's LSD test does not correct for multiple comparisons as the other post tests do. The other tests can be used even if the overall ANOVA yields a "not significant" conclusion. They set the 5% significance level for the entire family of comparisons -- so there is only a 5% chance than any one or more comparisons will be declared "significant" if the null hypothesis is true. The Fishers LSD post test can only be used if the overall ANOVA has a P value less than 0.05. This first step sort of controls the false positive rate for the entire family of comparisons. But when doing each individual comparison, it sets the 5% significance level to apply to each individual comparison, rather than to the family of comparisons. This means it is easier to find statistical significance with the Fisher LSD test than with other post tests (it has more power), but that also means it is too easy to be mislead by false positives (you'll get bogus 'significant' results in more than 5% of experiments). Duncan's test This test is adapted from the Newman-Keuls method. Like the Newman-Keuls method, Duncan's test does not control family wise error rate at the specified alpha level. It has more power than the other post tests, but only because it doesn't control the error rate properly. Few statisticians, if any, recommend this test.
164
www.graphpad.com
confidence intervals for the differences between group means. (Let us know if you would like to see this in a future version of Prism.) False Discovery Rate The concept of the False Discovery Rate 54 is a major advance in statistics. But it is really only useful when you have calculated a large number of P values from independent comparisons, and now have to decide which P values are small enough to followup further. It is not used as a post test following one-way ANOVA.
References
If the overall ANOVA finds no significant difference among groups, are the post test results valid?
You may find it surprising, but all the post tests offered by Prism are valid even if the overall ANOVA did not find a significant difference among means. It is certainly possible that the post tests of Bonferroni, Tukey, Dunnett, or Newman-Keuls can find significant differences even when the overall ANOVA showed no significant differences among groups. These post tests are more focussed, so have power to find differences between groups even when the overall ANOVA is not significant. "An unfortunate common practice is to pursue multiple comparisons only when the null hypothesis of homogeneity is rejected." (Hsu, page 177) There are two exceptions, but these are for tests that Prism does not offer. Scheffe's test is intertwined with the overall F test. If the overall ANOVA has a P value greater than 0.05, then no post test using Scheffe's method will find a significant difference. Another exception is Fisher's Least Significant Difference (LSD) test (which Prism does not offer). In its original form (called the restricted Fisher's LSD test) the post tests are performed only if the overall ANOVA finds a
165
www.graphpad.com
statistically significant difference among groups. But this LSD test is outmoded, and no longer recommended.
Are the results of the overall ANOVA useful at all? Or should I only look at post tests?
ANOVA tests the overall null hypothesis that all the data come from groups that have identical means. If that is your experimental question -- does the data provide convincing evidence that the means are not all identical -- then ANOVA is exactly what you want. More often, your experimental questions are more focussed and answered by multiple comparison tests (post tests). In these cases, you can safely ignore the overall ANOVA results and jump right to the post test results. Note that the multiple comparison calculations all use the mean-square result from the ANOVA table. So even if you don't care about the value of F or the P value, the post tests still require that the ANOVA table be computed.
The q or t ratio
Each Bonferroni comparison is reported with its t ratio. Each comparison with the Tukey, Dunnett, or Newman-Keuls post test is reported with a q ratio. We include it so people can check our results against text books or other programs. The value of q won't help you interpret the results. For a historical reason (but no logical reason), the q ratio reported by the Tukey (and NewmanKeuls) test and the one reported by Dunnett's test differ by a factor of the square root of 2, so cannot be directly compared.
Significance
Statistically significant is not the same as scientifically important. Before interpreting the P value or confidence interval, you should think about the size of the difference you are looking for. How large a difference would you consider to be scientifically important? How small a difference would you consider to be scientifically trivial? Use scientific judgment and common sense to answer these questions. Statistical calculations cannot help, as the answers depend on the context of the experiment.
Compared to comparing two groups with a t test, is it always harder to find a 'significant' difference when I use a post test following ANOVA?
Post tests control for multiple comparisons. The significance level doesn't apply to each comparison, but rather to the entire family of comparisons. In general, this makes it harder to reach significance. This is really the main point of multiple comparisons, as it reduces the chance of being fooled by differences that are due entirely to random sampling. But post tests do more than set a stricter threshold of significance. They also use the information from all of the groups, even when comparing just two. It uses the information in the other groups to get a better measure of variation. Since the scatter is determined from more data, there are more degrees of freedom in the calculations, and this usually offsets some of the increased strictness mentioned above. In some cases, the effect of increasing the df overcomes the effect of controlling for multiple comparisons. In these cases, you may find a 'significant' difference in a post test where you wouldn't find it doing a simple t test. In the example below, comparing groups 1 and 2 by unpaired t test yields a two-tail P value equals 0.0122. If we set our threshold of 'significance' for this example to 0.01, the results are not 'statistically significant'. But if you compare all three groups with one-way
166
www.graphpad.com
ANOVA, and follow with a Tukey post test, the difference between groups 1 and 2 is statistically significant at the 0.01 significance level. Group 1 Group 2 Group 3 34 38 29 43 45 56 48 49 47
167
www.graphpad.com
V. Two-way ANOVA
You measured a response to three different drugs in both men and women. Is the response affected by drug? By gender? Are the two intertwined? These are the kinds of questions that two-way ANOVA answers.
168
www.graphpad.com
Of course, the ANOVA results will be identical no matter which way you enter the data. But the choice does matter, as it influences how the graph will appear and how Prism can do multiple comparison post tests.
169
www.graphpad.com
appearance and color. If you enter data as shown in the first approach above, men and women will appear in bars of different color, with three bars of each color representing the three time points. This is illustrated in the left panel below. The right panel below shows a graph of the second data table shown above. There is one bar color and fill for Before, another for During, and another for After. Men and Women appear as two bars of identical appearance.
170
www.graphpad.com
171
www.graphpad.com
Repeated measures Choose a repeated-measures analysis when the experiment used paired or matched subjects. Prism can calculate repeated-measures two-way ANOVA with matching by either row or column, but not both. Details. 184 This is sometimes called a mixed model. You can only choose repeated measures when you have entered two or more side-by-side values into subcolumns for each row and dataset. Post tests Following two-way ANOVA, there are many possible multiple comparison tests that can help you focus in on what is really going on. However, Prism performs only the post tests 197 biologists use most frequently. At each row, Prism will compare each column with every column, or compare every column with a control column. Variable names If you enter names for the factors that define the rows and columns, the results will be easier to follow.
172
www.graphpad.com
points, you don't expect to find any difference then. Your experiment really is designed to ask about later time points. In this situation, you expect an interaction, so finding a small P value for interaction does not help you understand your data. ANOVA pays no attention to the order of your time points (or doses). If you randomly scramble the time points or doses, two-way ANOVA would report identical results. In other words, ANOVA ignores the entire point of the experiment, when one of the factors is quantitative.
173
www.graphpad.com
174
www.graphpad.com
2. Enter data
Here are the sample data:
175
www.graphpad.com
Note that one value is blank. It is fine to have some missing values, but you must have at least one value in each row for each data set. The following table cannot be analyzed by two-way ANOVA because there are no data for treated women. But it doesn't matter much that there are only two (not three) replicates for control men and treated men.
If you are entering mean, SD (or SEM) and N, You must enter N (number of replicates) but it is ok if N is not always the same.
2. Choose Two-way ANOVA from the list of grouped analyses. 3. Select 'no matching' because your data doesn't involve repeated measure 184 . Choose post tests 197 if they will help you interpret your results, and enter the name of the factors that define columns and rows.
176
www.graphpad.com
ANOVA table
The ANOVA table breaks down the overall variability between measurements (expressed as the sum of squares) into four components: Interactions between row and column. These are differences between rows that are not the same at each column, equivalent to variation between columns that is not the same at each row. Variability among columns. Variability among rows.
177
www.graphpad.com
Residual or error. Variation among replicates not related to systematic differences between rows and columns. The ANOVA table shows how the sum of squares is partitioned into the four components. Most scientists will skip these results, which are not especially informative unless you have studied statistics in depth. For each component, the table shows sum-of-squares, degrees of freedom, mean square, and the F ratio. Each F ratio is the ratio of the mean-square value for that source of variation to the residual mean square (with repeated-measures ANOVA, the denominator of one F ratio is the mean square for matching rather than residual mean square). If the null hypothesis is true, the F ratio is likely to be close to 1.0. If the null hypothesis is not true, the F ratio is likely to be greater than 1.0. The F ratios are not very informative by themselves, but are used to determine P values.
P values
Two-way ANOVA partitions the overall variance of the outcome variable into three components, plus a residual (or error) term. Therefore it computes P values that test three null hypotheses (repeated measures two-way ANOVA adds yet another P value). Interaction P value The null hypothesis is that there is no interaction between columns (data sets) and rows. More precisely, the null hypothesis states that any systematic differences between columns are the same for each row and that any systematic differences between rows are the same for each column. Often the test of interaction is the most important of the three tests. If columns represent drugs and rows represent gender, then the null hypothesis is that the differences between the drugs are consistent for men and women. The P value answers this question: If the null hypothesis is true, what is the chance of randomly sampling subjects and ending up with as much (or more) interaction than you have observed? The graph on the left below shows no interaction. The treatment has about the same effect in males and females. The graph on the right, in contrast, shows a huge interaction. the effect of the treatment is completely different in males (treatment increases the concentration) and females (where the treatment decreases the concentration). In this example, the treatment effect goes in the opposite direction for males and females. But the test for interaction does not test whether the effect goes in different directions. It tests whether the average treatment effect is the same for each row (each gender, for this example).
178
www.graphpad.com
Testing for interaction requires that you enter replicate values or mean and SD (or SEM) and N. If you entered only a single value for each row/column pair, Prism assumes that there is no interaction, and continues with the other calculations. Depending on your experimental design, this assumption may or may not make sense. Tip: If the interaction is statistically significant, you won't learn much from the other two P values. Instead focus on the multiple comparison post tests 197 .
Column factor P value The null hypothesis is that the mean of each column (totally ignoring the rows) is the same in the overall population, and that all differences we see between column means are due to chance. In the example graphed above, results for control and treated were entered in different columns (with males and females being entered in different rows). The null hypothesis is that the treatment was ineffective so control and treated values differ only due to chance. The P value answers this question: If the null hypothesis is true, what is the chance of randomly obtaining column means as different (or more so) than you have observed? In the example shown in the left graph above, the P value for the column factor (treatment) is 0.0002. The treatment has an effect that is statistically significant. In the example shown in the right graph above, the P value for the column factor (treatment) is very high (0.54). On average, the treatment effect is indistinguishable from random variation. But this P value is not meaningful in this example. Since the interaction P value is low, you know that the effect of the treatment is not the same at each row (each gender, for this example). In fact, for this example, the treatment has opposite effects in males and females. Accordingly, asking about the overall, average, treatment effect doesn't make any sense. Row factor P value The null hypothesis is that the mean of each row (totally ignoring the columns) is the same in the overall population, and that all differences we see between row means are due to chance. In the example above, the rows represent gender, so the null hypothesis is that the mean response is the same for men and women. The P value answers this question: If the null hypothesis is true, what is the chance of randomly obtaining row means as different (or more so) than you have observed?
179
www.graphpad.com
In both examples above, the P value for the row factor (gender) is very low.
Post tests
Prism can compare columns at each row using multiple comparison post tests 197 .
The graph above shows three ways to plot the sample data for two-way ANOVA. The graphs on the left and middle interleave the data sets. This is set on the second tab of the Format Graphs dialog. In this case, the data sets are defined by the figure legend, and the groups (rows) are defined by the labels on the X axis. The graph on the right has the data sets grouped. In this graph, the labels on the X axis show the row title -- one per bar. You can use the "number format" choice in the Format Axes dialog to change this to Column titles -- one per set of bars. With this choice, there wouldn't be much point in also having the legend shown in the box, and you would need to define the side by side bars ("serum starved" vs "normal culture" for this example) in the figure legend. The graph on the left has the appearance set as a column dot plot. The other two graphs have the appearance set as bars with error bars plotted from the mean and SD. I prefer the column dot plot as it shows all the data, without taking up more space and without being harder to interpret. Don't forget to include in the figure legend whether the error bars are SD or SEM or something different.
180
www.graphpad.com
Missing values
If some values are missing, two-way ANOVA calculations are challenging. Prism uses the method detailed in SA Glantz and BK Slinker, Primer of Applied Regression and Analysis of Variance, McGraw-Hill, 1990. This method converts the ANOVA problem to a multiple regression problem and then displays the results as ANOVA. Prism performs multiple regression three times each time presenting columns, rows, and interaction to the multiple regression procedure in a different order. Although it calculates each sum-of-squares three times, Prism only displays the sum-of-squares for the factor entered last into the multiple regression equation. These are called Type III sum-of-squares. Prism cannot perform repeated-measures two-way ANOVA if any values are missing. It is OK to have different numbers of numbers of subjects in each group, so long as you have complete data (at each time point or dose) for each subject.
181
www.graphpad.com
(random variability and interaction can't be distinguished unless you measure replicates). Instead, Prism assumes that there is no interaction and only tests for row and column effects. If this assumption is not valid, then the P values for row and column effects won't be meaningful.
182
www.graphpad.com
Some experiments involve more than two factors. For example, you might compare three different drugs in men and women at four time points. There are three factors in that experiment: drug treatment, gender and time. These data need to be analyzed by three-way ANOVA, also called three-factor ANOVA. Prism does not perform three-way ANOVA. Are both factors fixed rather than random? Prism performs Type I ANOVA, also known as fixed-effect ANOVA. This tests for differences among the means of the particular groups you have collected data from. Different calculations are needed if you randomly selected groups from an infinite (or at least large) number of possible groups, and want to reach conclusions about differences among ALL the groups, even the ones you didn't include in this experiment.
183
www.graphpad.com
The table above shows example data testing the effects of three doses of drugs in control and treated animals. The decision to use repeated measures depends on the experimental design.
184
www.graphpad.com
dose). Then you injected dose 1 and made the next measurement, then dose 2 and measured again. Then you gave the animal the experimental treatment, waited an appropriate period of time, and made the three measurements again. Finally, you repeated the experiment with another animal (Y2). So a single animal provided data from both Y1 subcolumns (23, 34, 43 and 28, 41, 56). Prism cannot perform two-way ANOVA with repeated measures in both directions, and so cannot analyze the data from this experiment.
185
www.graphpad.com
186
www.graphpad.com
Enter the data putting matched values in the same row, in corresponding columns. In the sample data shown below, the two circled values represent repeated measurements in one animal.
2. Choose Two-way ANOVA from the list of grouped analyses. 3. Choose the option that specifies that matched values are spread across a row.
187
www.graphpad.com
188
www.graphpad.com
Arrange your data so the data sets (columns) represent different levels of one factor, and different rows represent different levels of the other factor. The sample data set compares five time points in two subjects under control conditions and two after treatment.
Each subcolumn (one is marked with a red arrow above) represents repeated measurements on a single subject.
Missing values
It is OK if some treatments have more subjects than others. In this case, some subcolumns will be entirely blank. But you must enter a measurement for each row for each subject. Prism cannot handle missing measurements with repeated measures ANOVA.
189
www.graphpad.com
2. Choose Two-way ANOVA from the list of grouped analyses. 3. Choose the option that specifies that matched values are stacked in a subcolumn.
Also, choose post tests 197 if they will help you interpret your results, and enter the name of the factors that define columns and rows.
190
www.graphpad.com
191
www.graphpad.com
192
www.graphpad.com
The appearance (for all data sets) should be "Each replicate". if you plot the replicates as 'Staggered", Prism will move them right or left to prevent overlap. In this example, none of the points overlap so 'Staggered' and 'Aligned' look the same. Check the option to plot 'one line for each subcolumn'.
193
www.graphpad.com
194
www.graphpad.com
pooled value that assess variability in all the groups. If you assume that variability really is the same in all groups (with any differences due to chance) this gives you more power. This makes sense, as you get to use data from all time points to assess variability, even when comparing only two times. Prism does this automatically. But with repeated measures, it is common that the scatter increases with time, so later treatments give a more variable response than earlier treatments. In this case, some argue that the post tests are misleading.
195
www.graphpad.com
subjects (or a matched set of experiments) Repeated-measures ANOVA is quite sensitive to violations of the assumption of circularity. If the assumption is violated, the P value will be too low. You'll violate this assumption when the repeated measurements are made too close together so that random factors that cause a particular value to be high (or low) don't wash away or dissipate before the next measurement. To avoid violating the assumption, wait long enough between treatments so the subject is essentially the same as before the treatment. Also randomize the order of treatments, when possible. Some advanced programs do special calculations to detect, and correct for, violations of the circularity or sphericity assumption. Prism does not. Consider alternatives to repeated measures two-way ANOVA. Two-way ANOVA may not answer the questions your experiment was designed to address. Consider alternatives. 172
196
www.graphpad.com
197
www.graphpad.com
Prism will compare the two columns at each row. For this example, Prism's built-in post tests compare the two columns at each row, thus asking: Do the control responses differ between men and women? Do the agonist-stimulated responses differ between men and women? Do the response in the presence of both agonist and antagonist differ between men and
198
www.graphpad.com
women? If these questions match your experimental aims, Prism's built-in post tests will suffice. Many biological experiments compare two responses at several time points or doses, and Prism built-in post tests are just what you need for these experiments. But you may wish to perform different post tests. In this example, based on the experimental design above, you might want to ask these questions: For men, is the agonist-stimulated response different than control? (Did the agonist work?) For women, is the agonist-stimulated response different than control? For men, is the agonist response different than the response in the presence of agonist plus antagonist? (Did the antagonist work?) For women, is the agonist response different than the response in the presence of agonist plus antagonist? For men, is the response in the presence of agonist plus antagonist different than control? (Does the antagonist completely block agonist response?) For women, is the response in the presence of agonist plus antagonist different than control? One could imagine making many more comparisons, but we'll make just these six. The fewer comparisons you make, the more power you'll have to find differences, so it is important to focus on the comparisons that make the most sense. But you must choose the comparisons based on experimental design and the questions you care about. Ideally you should pick the comparisons before you see the data. It is not appropriate to choose the comparisons you are interested in after seeing the data.
In this equation, mean1 and mean 1 are the means of the two groups you are comparing, and N1 and N2 are their sample size. MSresidual is reported in the ANOVA results, and is the same for all post tests.
199
www.graphpad.com
The variable t* is the critical value from the student t distribution, using the Bonferroni correction for multiple corrections. When making a single confidence interval, t* is the value of the t ratio that corresponds to a two-tail P value of 0.05 (or whatever significance level you chose). If you are making six comparisons, t* is the t ratio that corresponds to a P value of 0.05/6, or 0.00833. Find the value using this Excel formula =TINV(0.00833,6), which equals 3.863. The first parameter is the significance level corrected for multiple comparisons; the second is the number of degrees of freedom for the ANOVA (residuals for regular two-way ANOVA, subject' for repeated measures). The value of t* will be the same for each comparison. Its value depends on the degree of confidence you desire, the number of degrees of freedom in the ANOVA, and the number of comparisons you made. To determine significance levels, calculate for each comparison:
The variables are the same as those used in the confidence interval calculations, but notice the key difference. Here, you calculate a t ratio for each comparison, and then use it to determine the significance level (as explained in the next paragraph). When computing a confidence interval, you choose a confidence level (95% is standard) and use that to determine a fixed value from the t distribution, which we call t*. Note that the numerator is the absolute value of the difference between means, so the t ratio will always be positive. To determine the significance level, compare the values of the t ratio computed for each comparison against the standard values, which we abbreviate t*. For example, to determine whether the comparison is significant at the 5% level (P<0.05), compare the t ratios computed for each comparison to the t* value calculated for a confidence interval of 95% (equivalent to a significance level of 5%, or a P value of 0.05) corrected for the number of comparisons and taking into account the number of degrees of freedom. As shown above, this value is 3.863. If a t ratio is greater than t*, then that comparison is significant at the 5% significance level. To determine whether a comparison is significant at the stricter 1% level, calculate the t ratio corresponding to a confidence interval of 99% (P value of 0.01) with six comparisons and six degrees of freedom. First divide 0.01 by 6 (number of comparisons), which is 0.001667. Then use the Excel formula =TINV(0.001667,6) to find the critical t ratio of 5.398. Each comparison that has a t ratio greater than 5.398 is significant at the 1% level. Tip: All these calculations can be performed using a free QuickCalcs web calculator.
Example
For this example, here are the values you need to do the calculations (or enter into the web calculator).
Comparison 1: Men. Agonist vs. control 2: Women. Agonist vs. control 3: Men. Agonist vs. Ag+Ant Mean1 176.0 206.5 176.0 Mean2 98.5 100.0 116.0 N1 2 2 2 N2 2 2 2
200
www.graphpad.com
4: Women. Agonist vs. Ag+Ant 5: Men Control vs. Ag+Ant 6: Women. Control vs. Ag+Ant
2 2 2
2 2 2
Comparison 1: Men. Agonist vs. control 2: Women. Agonist vs. control 3: Men. Agonist vs. Ag+Ant 4: Women. Agonist vs. Ag+Ant 5: Men Control vs. Ag+Ant 6: Women Control vs. Ag+Ant
95% CI of difference + 43.3 to + 111.7 + 72.3 to + 140.7 + 25.8 to + 94.2 + 51.3 to + 119.7 -51.7 to + 16.7 -55.2 to + 13.2
The calculations account for multiple comparisons. This means that the 95% confidence level applies to all the confidence intervals. You can be 95% sure that all the intervals include the true value. The 95% probability applies to the entire family of confidence intervals, not to each individual interval. Similarly, if the null hypothesis were true (that all groups really have the same mean, and all observed differences are due to chance) there will be a 95% chance that all comparisons will be not significant, and a 5% chance that any one or more of the comparisons will be deemed statistically significant with P< 0.05. For the sample data, we conclude that the agonist increases the response in both men and women. The combination of antagonist plus agonist decreases the response down to a level that is indistinguishable from the control response.
201
www.graphpad.com
202
www.graphpad.com
203
www.graphpad.com
In some cases, the lower limit calculated using that equation is less than zero. If so, set the lower limit to 0.0. Similarly, if the calculated upper limit is greater than 1.0, set the upper limit to 1.0. This method works very well. For any values of S and N, there is close to a 95% chance that it contains the true proportion. With some values of S and N, the degree of confidence can be a bit less than 95%, but it is never less than 92%. Where did the numbers 2 and 4 in the equation come from? Those values are actually z2/2 and z2, where z is a critical value from the Gaussian distribution. Since 95% of all values of a normal distribution lie within 1.96 standard deviations of the mean, z = 1.96 (which we round to 2.0) for 95% confidence intervals. Note that the confidence interval is centered on p', which is not the same as p, the proportion of experiments that were successful. If p is less than 0.5, p' is higher than p. If p is greater than 0.5, p' is less than p. This makes sense, as the confidence interval can never extend below zero or above one. So the center of the interval is between p and 0.5. One of the GraphPad QuickCalcs free web calculators computes the confidence interval of a proportion using both methods.
References
1. C. J. Clopper and E. S. Pearson, The use of confidence or fiducial limits illustrated in the case of the binomial, Biometrika 1934 26: 404-413. 2. RG Newcombe, Two-sided confidence intervals for the single proportion: Comparison of seven methods. Statistics in Medicine 17: 857-872, 1998. 3. Agresti, A., and Coull, B. A. (1998), Approximate is Better than "exact" for interval estimation of binomial proportions, The American Statistician, 52: 119-126.
204
www.graphpad.com
A shortcut equation for a confidence interval when the numerator equals zero
JA Hanley and A Lippman-Hand (1) devised a simple shortcut equation for estimating the 95% confidence interval of a proportion when the numerator is zero. If you observe zero events in N trials, you can be 95% sure that the true rate is less than 3/N. To compute the usual 95% confidence interval (which really gives you 97.5% confidence), estimate the upper limit as 3.5/N. This equation is so simple, you can do it by hand in a few seconds. Here is an example. You observe 0 dead cells in 10 cells you examined. What is the 95% confidence interval for the true proportion of dead cells? The exact 95% CI is 0.00% to 30.83. The adjusted Wald equation gives a 95% confidence interval of 0.0 to 32.61%. The shortcut equation computes upper confidence limits of 3.5/10, or 35%. With such small N, the shortcut equation overestimates the confidence limit, but it is useful as an estimate you can calculate instantly. Another example: You have observed no adverse drug reactions in the first 250 patients treated with a new antibiotic. What is the confidence interval for the true rate of drug reactions? The exact confidence interval extends from 0% to 1.46% (95% CI). The shortcut equation computes the upper limits as 3.5/250, or 1.40%. With large N, the shortcut equation is quite accurate.
Reference
1. Hanley JA, Lippman-Hand A: "If nothing goes wrong is everything all right? Interpreting zero numerators". Journal of the American Medical Association 249(13): 1743-1745, 1983.
205
www.graphpad.com
Contingency tables
Contingency tables
206
www.graphpad.com
test results in the other. For data from prospective and experimental studies, the top row usually represents exposure to a risk factor or treatment, and the bottom row is for controls. The left column usually tabulates the number of individuals with disease; the right column is for those without the disease. In case-control retrospective studies, the left column is for cases; the right column is for controls. The top row tabulates the number of individuals exposed to the risk factor; the bottom row is for those not exposed.
Logistic regression
Contingency tables analyze data where the outcome is categorical, and where there is one independent (grouping) variable that is also categorical. If your experimental design is more complicated, you need to use logistic regression which Prism does not offer. Logistic regression is used when the outcome is categorical, but can be used when there are multiple independent variables, which can be categorical or numerical. To continue the example above, imagine you want to compare the incidence of leukemia in people who were, or were not, exposed to EMF, but want to account for gender, age, and family history of leukemia. You can't use a contingency table for this kind of analysis, but would use logistic regression.
Contingency tables
207
www.graphpad.com
2. Enter data
Most contingency tables have two rows (two groups) and two columns (two possible outcomes), but Prism lets you enter tables with any number of rows and columns. You must enter data in the form of a contingency table. Prism cannot cross-tabulate raw data to create a contingency table. For calculation of P values, the order of rows and columns does not matter. But it does matter for calculations of relative risk, odds ratio, etc. Use the sample data to see how the data should be organized. Be sure to enter data as a contingency table. The categories defining the rows and columns must be mutually exclusive, with each subject (or experimental unit) contributing to one cell only. In each cell, enter the number of subjects actually observed. Don't enter averages, percentages or rates.
Contingency tables
208
www.graphpad.com
If your experimental design matched patients and controls, you should not analyze your data with contingency tables. Instead you should use McNemar's test. This test is not offered by Prism, but it is calculated by the free QuickCalcs web calculators available on www.graphpad.com.
3. Analyze
From the data table, click exact) test. on the toolbar, and choose Chi-square (and Fisher's
If your table has exactly two rows and two columns: Prism will offer you several choices:
We suggest you always choose Fisher's exact test 210 with a two-tail P value
34
Your choice of additional calculations will depend on experimental design. Calculate an Odds ratio from retrospective case-control data, sensitivity (etc.) from a study of a diagnostic test, and relative risk and difference between proportions from prospective and experimental studies. If your table has more than two rows or two columns If your table has two columns and three or more rows, choose the chi-square test or the chi-square test for trend. This calculation tests whether there is a linear trend between row number and the fraction of subjects in the left column. It only makes sense when the rows are arranged in a natural order (such as by age, dose, or time), and are equally spaced. With contingency tables with more than two rows or columns, Prism always calculates the
Contingency tables
209
www.graphpad.com
chi-square test. You have no choice. Extensions to Fisher's exact test have been developed for larger tables, but Prism doesn't offer them.
Conventional advice In the days before computers were readily available, people analyzed contingency tables by hand, or using a calculator, using chi-square tests. But the chi-square test is only an approximation. The Yates' continuity correction is designed to make the chi-square approximation better, but it overcorrects so gives a P value that is too large (too 'conservative'). With large sample sizes, Yates' correction makes little difference, and the chi-square test works very well. With small sample sizes, chi-square is not accurate, with or without Yates' correction. Fisher's exact test, as its name implies, always gives an exact P value and works fine with small sample sizes. Fisher's test (unlike chi-square) is very hard to calculate by hand, but is easy to compute with a computer. Most statistical books advise using it instead of chi-square test. Fisher's test. Exactly correct answer to wrong question? As its name implies, Fisher's exact test, gives an exactly correct answer no matter what sample size you use. But some statisticians conclude that Fisher's test gives the exact answer to the wrong question, so its result is also an approximation to the answer you really want. The problem is that the Fisher's test is based on assuming that the row and column totals are fixed by the experiment. In fact, the row totals (but not the column totals) are fixed by the design of a prospective study or an experiment, the column totals (but not the row totals) are fixed by the design of a retrospective case-control study, and only the overall N (but neither row or column totals) is fixed in a cross-sectional experiment. Since the constraints of your study design don't match the constraints of Fisher's test, you could question whether the exact P value produced by Fisher's test actually answers the question you had in mind. An alternative to Fisher's test is the Barnard test. Fisher's test is said to be 'conditional' on the row and column totals, while Barnard's test is not. Mehta and Senchaudhuri explain the difference and why Barnard's test has more power. Berger modified this test to one that is easier to calculate
Contingency tables
210
www.graphpad.com
yet more powerful. He also provides an online web calculator so you can try it yourself. It is worth noting that Fisher convinced Barnard to repudiate his test! At this time, we do not plan to implement Bernard's or Berger's test in Prism. There certainly does not seem to be any consensus among statisticians that these tests are preferred. But let us know if you disagree.
In this example, disease progressed in 28% of the placebo-treated patients and in 16% of the AZT-treated subjects. The difference between proportions (P1-P2) is 28% - 16% = 12%. The relative risk is 16%/28% = 0.57. A subject treated with AZT has 57% the chance of disease progression as a subject treated with placebo. The word risk is not always appropriate. Think of the relative risk as being simply the ratio of proportions.
Odds ratio
If your data represent results of a case-control retrospective study, choose to report the results as an odds ratio. If the disease or condition you are studying is rare, you can interpret the Odds ratio as an approximation of the relative risk. With case-control data, direct calculations of the relative risk or the difference between proportions should not be performed, as the results are not meaningful.
If any cell has a zero, Prism adds 0.5 to all cells before calculating the relative risk, odds ratio, or P1-P2 (to prevent division by zero).
Contingency tables
211
www.graphpad.com
Positive predictive The fraction of people with positive tests who actually have the condition. value Negative predictive The fraction of people with negative tests who actually don't have the value condition. Likelihood ratio If you have a positive test, how many times more likely are you to have the disease? If the likelihood ratio equals 6.0, then someone with a positive test is six times more likely to have the disease than someone with a negative test. The likelihood ratio equals sensitivity/(1.0-specificity).
The sensitivity, specificity and likelihood ratios are properties of the test. The positive and negative predictive values are properties of both the test and the population you test. If you use a test in two populations with different disease prevalence, the predictive values will be different. A test that is very useful in a clinical setting (high predictive values) may be almost worthless as a screening test. In a screening test, the prevalence of the disease is much lower so the predictive value of a positive test will also be lower.
Contingency tables
212
www.graphpad.com
.
36
or large
Contingency tables
213
www.graphpad.com
If you want to learn more, SISA provides a detail discussion with references. Also see the section on Fisher's test in Categorical Data Analysis by Alan Agresti. It is a very confusing topic, which explains why different statisticians (and so different software companies) use different methods.
Contingency tables
214
www.graphpad.com
a natural order, such as age, duration, or time. Otherwise, ignore the results of the chi-square test for trend and only consider the results of the regular chi-square test.
Contingency tables
215
www.graphpad.com
216
www.graphpad.com
Censored data
Creating a survival curve is not quite as easy as it sounds. The difficulty is that you rarely know the survival time for each subject. Some subjects may still be alive at the end of the study. You know how long they have survived so far, but don't know how long they will survive in the future. Others drop out of the study -- perhaps they moved to a different city or wanted to take a medication disallowed on the protocol. You know they survived a certain length of time on the protocol, but don't know how long they survived after that (or do know, but can't use the information because they weren't following the experimental protocol). In both cases, information about these patients is said to be censored. You definitely don't want to eliminate these censored observations from your analyses -- you just need to account for them properly.The term censored seems to imply that the subject did something inappropriate. But that isn't the case. The term censored simply means that you don't know, or can't use, survival beyond a certain point. Prism automatically accounts for censored data when it creates and compares survival curves.
217
www.graphpad.com
218
www.graphpad.com
Enter 1 into the Y column for rows where the subject died (or the event occurred) at the time shown in the X column. Enter 0 into the rows where the subject was censored 217 at that time. Every subject in a survival study either dies or is censored. Enter subjects for each treatment group into a different Y column. Place the X values for the subjects for the first group at the top of the table with the Y codes in the first Y column. Place the X values for the second group of subjects beneath those for the first group (X values do not have to be sorted, and the X column may well contain the same value more than once). Place the corresponding Y codes in the second Y column, leaving the first column blank. In the example below, data for group A were entered in the first 14 rows, and data for group B started in row 15.
If the treatment groups are intrinsically ordered (perhaps increasing dose) maintain that order when entering data. Make sure that the progression from column A to column B to column C follows the natural order of the treatment groups. If the treatment groups don't have a natural order, it doesn't matter how you arrange them. Entering data for survival studies can be tricky. See answers to common questions 221 , an example of a clinical study 223 , and an example of an animal study 225 .
219
www.graphpad.com
Survival analysis works differently than other analyses in Prism. When you choose a survival table, Prism automatically analyzes your data. You don't need to click the Analyze button.
220
www.graphpad.com
How do I enter data for subjects still alive at the end of the study? Those subjects are said to be censored. You know how long they survived so far, but don't know what will happen later. X is the # of days (or months) they were followed. Y is the code for censored 217 observations, usually zero. What if two or more subjects died at the same time? Each subject must be entered on a separate row. Enter the same X value on two (or more) rows. How do I enter data for a subject who died of an unrelated cause? Different investigators handle this differently. Some treat a death as a death, no matter what the cause. Others treat death of an unrelated cause to be a censored observation. Ideally, this decision should be made in the study design. If the study design is ambiguous, you should decide how to handle these data before unblinding the study. Do the X values have to be entered in order? No. You can enter the data in any order you want. How does Prism distinguish between subjects who are alive at the end of the study and those who dropped out of the study? It doesn't. In either case, the observation is censored. You know the patient was alive and on the protocol for a certain period of time. After that you can't know (patient still alive), or can't use (patient stopped following the protocol) the information. Survival analysis calculations treat all censored subjects in the same way. Until the time of censoring, censored subjects contribute towards calculation of percent survival. After the time of censoring, they are essentially missing data. I already have a life-table showing percent survival at various times. Can I enter this table into Prism? No. Prism only can analyze survival data if you enter survival time for each subject. Prism cannot analyze data entered as a life table. Can I enter a starting and ending date, rather than duration? No. You must enter the number of days (or months, or some other unit of time). Use a spreadsheet to subtract dates to calculate duration. How do I handle data for subjects that were enrolled but never treated? Most clinical studies follow the intention to treat rule. You analyze the data assuming the subject got the treatment they were assigned to receive.
221
www.graphpad.com
If the subject died right after enrollment, should I enter the patient with X=0? No. The time must exceed zero for all subjects.
222
www.graphpad.com
Prism does not allow you to enter beginning and ending dates. You must enter elapsed time. You
223
www.graphpad.com
can calculate the elapsed time in Excel (by simply subtracting one date from the other; Excel automatically presents the results as number of days). Unlike many programs, you don't enter a code for the treatment (control vs. treated, in this example) into a column in Prism. Instead you use separate columns for each treatment, and enter codes for survival or censored into that column. There are three different reasons for the censored observations in this study. Three of the censored observations are subjects still alive at the end of the study. We don't know how long they will live. Subject 7 moved away from the area and thus left the study protocol. Even if we knew how much longer that subject lived, we couldn't use the information since he was no longer following the study protocol. We know that subject 7 lived 733 days on the protocol and either don't know, or know but can't use the information, after that. Subject 10 died in a car crash. Different investigators handle this differently. Some define a death to be a death, no matter what the cause. Others would define a death from a clearly unrelated cause (such as a car crash) to be a censored observation. We know the subject lived 703 days on the treatment. We don't know how much longer he would have lived on the treatment, since his life was cut short by a car accident. Note that the order of the rows is entirely irrelevant to survival analysis. These data are entered in order of enrollment date, but you can enter in any order you want.
224
www.graphpad.com
Note that the five control animals are each entered on a separate row, with the time entered as 28 (the number of days you observed the animals) and with Y entered as 0 to denote a censored observation. The observations on these animals is said to be censored because we only know that they lived for at least 28 days. We don't know how much longer they will live because the study ended. The five treated animals also are entered one per row, with Y=1 when they died and Y=0 for the two animals still alive at the end of the study.
225
www.graphpad.com
Input
The default choices are to use the code '1' for deaths and '0' for censored subjects, and these are almost universal. But you can choose different codes by entering values here. The codes must be integers.
Output
The choices on how to tabulate the results (percents or fractions, death or survival), can also be made on the Format Graph dialog. If you choose to plot 95% confidence intervals, Prism 5 gives you two choices. The default is a transformation method, which plots asymmetrical confidence intervals. The alternative is to choose symmetrical Greenwood intervals. The asymmetrical intervals are more valid, and we recommend choosing them. The only reason to choose symmetrical intervals is to be consistent with results computed by prior versions of Prism. Note that the 'symmetrical' intervals won't always plot symmetrically. The intervals are computed by adding and subtracting a calculated value from the percent survival. At this point the intervals are always symmetrical, but may go below 0 or above 100. In these cases,
226
www.graphpad.com
Prism trims the intervals so the interval cannot go below 0 or above 100, resulting in an interval that appears asymmetrical. Prism always compares survival curves by performing both the log-rank (Mantel-Cox) test and the Gehan-BreslowWilcoxon test. P values from both tests are reported. You can't choose which test is computed (Prism always does both), but you should choose which test you want to report.
227
www.graphpad.com
228
www.graphpad.com
229
www.graphpad.com
If your prediction was wrong, the one-tail P value equals 1.0 minus half the two-tail P value. This value will be greater than 0.50, and you must conclude that the survival difference is not statistically significant.
Hazard ratio
If you compare two survival curves, Prism reports the hazard ratio and its 95% confidence interval. Hazard is defined as the slope of the survival curve a measure of how rapidly subjects are dying. The hazard ratio compares two treatments. If the hazard ratio is 2.0, then the rate of deaths in one treatment group is twice the rate in the other group. The computation of the hazard ratio assumes that the ratio is consistent over time, and that any differences are due to random sampling. So Prism reports a single hazard ratio, not a different hazard ratio for each time interval. Prism 5 computes the hazard ratio and its confidence interval using the Mantel-Haenszel method. Prism 4 computed the hazard ratio itself (but not the confidence interval) by the log-rank method. The two are usually quite similar. If the hazard ratio is not consistent over time, the value that Prism reports for the hazard ratio will not be useful. If two survival curves cross, the hazard ratios are certainly not consistent.
230
www.graphpad.com
231
www.graphpad.com
232
www.graphpad.com
1. Define the significance level that you want to apply to the entire family of comparisons. This is conventionally set to 0.05. 2. Count the number of comparisons you are making, and call this value K. See the next section which discusses some ambiguities. 3. Compute the Bonferroni corrected threshold that you will use for each individual comparison. This equals the family-wise significance level (defined in step 1 above, usually .05) divided by K. 4. If a P value is less than this Bonferroni-corrected threshold, then the comparison can be said to be 'statistically significant'.
233
www.graphpad.com
234
www.graphpad.com
would be misleading, for example, if many patients quit the study because they were too sick to come to clinic, or because they stopped taking medication because they felt well. Does average survival stay constant during the course of the study? Many survival studies enroll subjects over a period of several years. The analysis is only meaningful if you can assume that the average survival of the first few patients is not different than the average survival of the last few subjects. If the nature of the disease or the treatment changes during the study, the results will be difficult to interpret. Is the assumption of proportional hazards reasonable? The logrank test is only strictly valid when the survival curves have proportional hazards. This means that the rate of dying in one group is a constant fraction of the rate of dying in the other group. This assumption has proven to be reasonable for many situations. It would not be reasonable, for example, if you are comparing a medical therapy with a risky surgical therapy. At early times, the death rate might be much higher in the surgical group. At later times, the death rate might be greater in the medical group. Since the hazard ratio is not consistent over time (the assumption of proportional hazards is not reasonable), these data should not be analyzed with a logrank test. Were the treatment groups defined before data collection began? It is not valid to divide a single group of patients (all treated the same) into two groups based on whether or not they responded to treatment (tumor got smaller, lab tests got better). By definition, the responders must have lived long enough to see the response. And they may have lived longer anyway, regardless of treatment. When you compare groups, the groups must be defined before data collection begins.
235
www.graphpad.com
As shown above, survival curves are usually plotted as staircases. Each death is shown as a drop in survival. In the left panel, the data are plotted as a tick symbol. These symbols at the time of death are lost within the vertical part of the staircase. You see the ticks clearly at the times when a subject's data was censored. The example has two censored subjects in the treated group between 100 and 150 days. The graph on the right plots the data as circles, so you see each subject plotted.
236
www.graphpad.com
Showing error bars or error envelopes make survival graphs more informative, but also more cluttered. The graph on the left above shows staircase error envelopes that enclose the 95% confidence interval for survival. This shows the actual survival data very well, as a staircase, but it is cluttered. The graph on the left shows error bars that show the standard error of the percent survival. To prevent the error bars from being superimposed on the staircase curve, the points are connected by regular lines rather than by staircases.
237
www.graphpad.com
238
www.graphpad.com
ROC Curves
ROC Curves
239
www.graphpad.com
In the ROC dialog, designate which columns have the control and patient results, and choose to see the results (sensitivity and 1-specificity) expressed as fractions or percentages. Don't forget to check the option to create a new graph. Note that Prism doesn't ask whether an increased or decrease test value is abnormal. Instead, you tell Prism which column of data is for controls and which is for patients, and it figures out automatically whether the patients tend to have higher or lower test results.
ROC Curves
240
www.graphpad.com
Area
The area under a ROC curve quantifies the overall ability of the test to discriminate between those individuals with the disease and those without the disease. A truly useless test (one no better at identifying true positives than flipping a coin) has an area of 0.5. A perfect test (one that has zero
ROC Curves
241
www.graphpad.com
false positives and zero false negatives) has an area of 1.00. Your test will have an area between those two values. Even if you choose to plot the results as percentages, Prism reports the area as a fraction. While it is clear that the area under the curve is related to the overall ability of a test to correctly identify normal versus abnormal, it is not so obvious how one interprets the area itself. There is, however, a very intuitive interpretation. If patients have higher test values than controls, then: The area represents the probability that a randomly selected patient will have a higher test result than a randomly selected control. If patients tend to have lower test results than controls: The area represents the probability that a randomly selected patient will have a lower test result than a randomly selected control. For example: If the area equals 0.80, on average, a patient will have a more abnormal test result than 80% of the controls. If the test were perfect, every patient would have a more abnormal test result than every control and the area would equal 1.00. If the test were worthless, no better at identifying normal versus abnormal than chance, then one would expect that half of the controls would have a higher test value than a patient known to have the disease and half would have a lower test value. Therefore, the area under the curve would be 0.5. The area under a ROC curve can never be less than 0.50. If the area is first calculated as less than 0.50, Prism will reverse the definition of abnormal from a higher test value to a lower test value. This adjustment will result in an area under the curve that is greater than 0.50.
P Value
Prism completes your ROC curve evaluation by reporting a P value that tests the null hypothesis that the area under the curve really equals 0.50. In other words, the P value answers this question: If the test diagnosed disease no better flipping a coin, what is the chance that the area under the ROC curve would be as high (or higher) than what you observed? If your P value is small, as it usually will be, you may conclude that your test actually does discriminate between abnormal patients and normal controls. If the P value is large, it means your diagnostic test is no better than flipping a coin to diagnose patients. Presumably, you wouldn't collect enough data to create an ROC curve until you are sure your test actually can diagnose the disease, so high P values should occur very rarely.
ROC Curves
242
www.graphpad.com
Prism calculates z= (AUC - 0.5)/SEarea and then determines P from the z ratio (normal distribution). In the numerator, we subtract 0.5, because that is the area predicted by the null hypothesis. The denominator is the SE of the area, which Prism reports.
4. If you investigated many pairs of methods with indistinguishable ROC curves, you would expect the distribution of z to be centered at zero with a standard deviation of 1.0. To calculate a two-tail P value, therefore, use the following Microsoft Excel function: =2*(1-NORMSDIST(z)) The method described above is appropriate when you compare two ROC curves with data collected from different subjects. A different method is needed to compare ROC curves when both laboratory tests were evaluated in the same group of patients and controls. Prism does not compare paired-ROC curves. To account for the correlation between areas under your two curves, use the method described by Hanley, J.A., and McNeil, B. J. (1983). Radiology 148:839-843. Accounting for the correlation leads to a larger z value and, thus, a smaller P value.
ROC Curves
243
www.graphpad.com
Designate the columns with the data (usually A and B), and choose how to plot the data. You can plot the difference, the ratio, or the percent difference. If the difference between methods is consistent, regardless of the average value, you'll probably want to plot the difference. If the difference gets larger as the average gets larger, it can make more sense to plot the ratio or the
244
www.graphpad.com
percent difference.
The 95% confidence limits of the bias are shown as two dotted lines. They are created using Additional ticks and grid lines on the Format axes dialog. The position of each line was 'hooked' to an analysis constant created by the Bland-Altman analysis.
245
www.graphpad.com
The origin of the graph was moved to the lower left (and offset) on the first tab of the Format Axes dialog.
246
www.graphpad.com
247
www.graphpad.com
Index
-*** Asterisks to denote significance 39
-AAdvice: Choosing a multiple comparisons test 159 Alpha, defined 38 Altman (Bland-Altman plot) 244 Analysis checklist: Bland-Altman results 247 Analysis checklist: Column statistics 71 Analysis checklist: Contingency tables 214 Analysis checklist: Friedman's test 157 Analysis checklist: Kruskal-Wallis test 152 Analysis checklist: Mann-Whitney test 124 Analysis checklist: One-way ANOVA 142 Analysis checklist: Paired t test 116 Analysis checklist: Repeated measures two-way ANOVA Analysis checklist: Repeated-measures one way ANOVA Analysis checklist: ROC curves 243 Analysis checklist: Survival analysis 234 Analysis checklist: Two-way ANOVA 182 Analysis checklist: Unpaired t test 107 Analysis checklist: Wilcoxon matched pairs test 130 Analysis of variance, one-way 136 Analysis of variance, two-way 175 ANOVA table, two-way 177 ANOVA, one-way 136 ANOVA, one-way, instructions 136 ANOVA, one-way, interpreting results 139 ANOVA, repeated measures one-way 144 ANOVA, two-way 175 ANOVA, two-way repeated measures 184 ANOVA, two-way, instructions 175 Area under the curve 89 Asterisks, to denote significance 39
195 147
-BBarnard test 210 Basement, analogy about power 46 Bayesian perspective on statistical significance Berger's test 210
40
Index
248
www.graphpad.com
Beta, defined 45 Bias, in Bland-Altman results 246 Bin width in frequency distributions 81 Bland-Altman plot 244 Bonferroni post tests, after two-way ANOVA 198 Bonferroni tests after one-way ANOVA 159 Breslow test 229
-CCase-control study 206 Censored data, defined 217 Central Limit Theorem 17 Chi-square test or Fisher's test? 210 Choosing a t test 95 CI of mean 26 Circularity 195 Coefficient of variation 75 Column statistics, how-to 69 Compound symmetry 195 Confidence interval of a mean 26 Confidence interval of a proportion 203 Confidence interval of proportion 203 Confidence intervals, advice 31 Confidence intervals, one sided 31 Confidence intervals, principles 29 Contingency tables 206 Cross-sectional study 206 Cumulative Gaussian distribution, fitting 86 CV, coefficient of variation 75
-DD'Agostino normality test 69 D'Agostino-Pearson normality test Derivative of curve 88 Diagnostic test, accuracy of 212 Duncan's test 163 Dunnett's test 159 Dunn's post test 151 76
-EEquivalence 56 Error bars, on survival curves Error bars, overlapping 106 Error, defined 14 236
Index
249
www.graphpad.com
39
-FF test for unequal variance 103 False Discovery Rate 54 Family-wise significance level 54 FDR 54 Fisher's LSD 163 Fisher's test or chi-square test? 210 Fixed vs. random factors 195 Frequency distribution, how to 81 Friedman's test 154
-GGaussian distribution, fitting to frequency distribution 86 Gaussian distribution, origin of 15 Gehan-Breslow-Wilcoxon test to compare survival curves 229 Geometric mean 69, 72 Graphing hints: Contingency tables 215 Graphing hints: Survival curves 236 Graphing tips: Frequency distributions 84 Graphing tips: Paired t 113 Graphing tips: Repeated measures two-way ANOVA 191 Graphing tips: Two-way ANOVA 180 Graphing tips: Unpaired t 105 Grubbs' outlier test 60 Guilty or not guilty? Analogy to understand hypothesis testing. 42
-HHarmonic mean 72 Hazard ratio 229 Holm's test 163 How to: Bland-Altman plot 244 How to: Column statistics 69 How to: Contingency table analysis 208 How to: Frequency distribution 81 How to: Friedman test 154 How to: Kruskal-Wallis test 149 How to: Mann-Whitney test 118 How to: One-way ANOVA 136 How to: Paired t test 109 How to: Repeated measures one-way ANOVA How to: ROC curve 240 How to: Survival analysis 218
144
Index
250
www.graphpad.com
How to: Two-way ANOVA 175 How to: Two-way ANOVA. Repeated measures by column 188 How to: Two-way ANOVA. Repeated measures by row 186 How to: Unpaired t test from averaged data 100 How to: Unpaired t test from raw data 98 How to: Wilcoxon matched pairs test 126 Hypothesis testing, defined 38
-IIndependent samples, need for 12 Integral of a curve 88 Interaction, in two way ANOVA 177 Interpreting results: Bland-Altman 246 Interpreting results: Chi-square 213 Interpreting results: Comparing >2 survival curves 231 Interpreting results: Comparing two survival curves 229 Interpreting results: Fisher's exact test 213 Interpreting results: Friedman test 156 Interpreting results: Kaplan-Meier curves 228 Interpreting results: Kruskal-Wallis test 151 Interpreting results: Mann-Whitney test 122 Interpreting results: Normality test 76 Interpreting results: One-sample t test 77 Interpreting results: One-way ANOVA 139 Interpreting results: Paired t 112 Interpreting results: Relative risk and odds ratio 211 Interpreting results: Repeated measures one-way ANOVA Interpreting results: Repeated measures two-way ANOVA Interpreting results: ROC curves 241 Interpreting results: Sensitivity and specificity 212 Interpreting results: Two-way ANOVA 177 Interpreting results: Unpaired t 103 Interpreting results: Wilcoxon signed rank test 78 Interquartile range 74 Interval variables 12
146 194
-KKaplan-Meier curves, interpreting 228 Key concepts. Survival curves 217 Key concepts: Contingency tables 206 Key concepts: Multiple comparison tests 158 Key concepts: Receiver-operator characteristic (ROC) curves Key concepts: t tests 95 Kolmogorov-Smirnov normality test 69, 76 Kruskal-Wallis test 149
239
Index
251
www.graphpad.com
151
-LLikelihood ratio 212 Log-rank (Mantel-Cox) test 229 Logrank test 231 Logrank test for trend 231
-MMann-Whitney test 118 Mantel-Cox comparison of survival curves 229 Mean, geometric 72 Mean, harmonic 72 Mean, trimmed 72 Mean, Winsorized 72 Median 72 Median survival ratio 229 Medians, does the Mann-Whitney test compare? 123 Method comparison plot 244 Mode 72 Multiple comparison test, choosing 159 Multiple comparison tests after two-way ANOVA 198 Multiple comparison tests, after two-way ANOVA 197 Multiple comparisons 52 Multiple comparisons of survival curves 232 Multiple comparisons, after one-way ANOVA 158
-NNewman-Keuls 163 Nominal variables 12 Nonparametric tests, choosing 67 Nonparametric tests, power of 65 Nonparametric tests, why Prism doesn't choose automatically 64 nonparametric, defined 64 Normal distribution, defined 14 Normality tests, choosing 69 Normality tests, interpreting 76 Not guilty or guilty? Analogy to understand hypothesis testing. 42 Null hypothesis, defined 33
-OOdds ratio
Index
211 252
2007 GraphPad Software, inc.
www.graphpad.com
One sample t test, choosing 69 One-sample t test, interpreting 77 One-sided confidence intervals 31 One-tail vs. two-tail P value 34 One-way ANOVA 136 One-way ANOVA, instructions 136 One-way ANOVA, interpreting results 139 Ordinal variables 12 Outlier, defined 59 Outliers, danger of identifying manually 60 Outliers, testing with Grubbs' test 60
-PP value, common misinterpretation 33 P value, defined 33 P value, interpreting when large 36 P value, interpreting when small 35 P values, one- vs. two-tail 34 Paired data, defined 95 Paired t test 109 Parameters: Area under the curve 89 Parameters: Bland-Altman plot 244 Parameters: Column statistics 69 Parameters: Contingency tables 208 Parameters: Frequency distribution 81 Parameters: One-way ANOVA 133 Parameters: ROC curve 240 Parameters: Smooth, differentiate or integrate a curve Parameters: Survival analysis 218, 226 Parameters: t tests (and nonparametric tests) 95 Parameters: Two-way ANOVA 171 Percentiles 74 Planned comparisons 160 Population, defined 10 Post test after one-way ANOVA, choosing 159 Post test for trend 160 Post tests, after comparing survival curves 232 Post tests, after one-way ANOVA 158 Post tests, after two-way ANOVA 197, 198 Power of nonparametric tests 65 Power, defined 45 Predictive value 212 Probability vs. statistics 11 Probability Y axis 86 Proportion, confidence interval of 203 Proportional hazards 234 Proportional hazards regression 217
Index
88
253
www.graphpad.com
Prospective
206
-RRandom vs. fixed factors 195 Randomized block, defined 184 Ratio t test 114 Ratio variables 12 Receiver-operator characteristic (ROC) curves 239 Relative risk 211 Repeated measures two-way ANOVA 184 Repeated measures vs. "randomized block" 184 Repeated measures, defined 184 Repeated measures, one-way ANOVA 133 Repeated-measures one-way ANOVA 144 Results: Wilcoxon matched pairs test 129 Retrospective case-control study 206 ROC curves 239 ROUT method of outlier detection 62
-SSample, defined 10 Savitsky, method to smooth curves 88 Scheffe's test 163 SD vs. SEM 24 SD, interpreting 18 SEM vs. SD 24 SEM, computing 22, 23 Sensitivity 212 Sequential approach to P values 42 Shapiro-Wilk normality test 69, 76 Significant, defined 38 Simulation, to demonstrate Gaussian distribution Skewness 76 Smoothing a curve 88 Specificity 212 Sphericity 195 Standard deviation, computing 20 Standard deviation, confidence interval of 21 Standard deviation, interpreting 18 Statistics vs. probability 11
15
Index
254
www.graphpad.com
STDEV function of Excel 20 Stevens' categories of variables Student-Newman-Keuls 163 Survival analysis 216
12
-Tt test, one sample 69 t test, paired 109 t test, unpaired 98 t test, use logs to compare ratios 114 Trend, post test for 160 Trimmed mean 72 Tukey test 159 Two-tail vs. one-tail P value 34 Two-way ANOVA 168, 175 Two-way ANOVA, instructions 175 Two-way ANOVA, interaction 177 Two-way ANOVA, repeated measures 184 Type I, II (and III) errors 47
-UUnpaired t test 98
-WWilcoxon (used to compare survival curves) 229 Wilcoxon matched pairs test 126 Wilcoxon signed rank test, choosing 69 Wilcoxon signed rank test, Interpreting results 78 Winsorized mean 72
Index
255