You are on page 1of 11

Pearson's Test

The Nonmathematician Series

Thomas Gamsjger

Meitner Monographs

Pearson's Test
The Nonmathematician Series

Thomas Gamsjger

Abstract
Whereas inferential testing of interval data is widely employed, data on other scales of measurement
occasionally appear to fall by the wayside. Even though classical significance testing is hampered by its
own limitations, p-value generating methods are nevertheless also available to be used on
categorical/nominal or ordinal data. Especially the question whether empirically found univariate
categorical frequency observations conform to or deviate, with statistical significance, from the
expected can be suitably solved with Pearson's chi-squared goodness-of-fit test. Due to the inherent
simplicity of its algorithm, neither outstanding statistical knowledge nor dedicated statistical software is
needed to subject data to this kind of statistical inference.

Introduction
The hypothesis that many researchers are not overly confident when it comes to dragging their hard
arrived by data through the forbidding machinery of statistical software might not appear extremely far
fetched. The evidence has been critically appraised already 20 years ago.1 Still, most branches of
science are just unthinkable without the proper use of statistical tests to decide whether the results
confirm the initially stated hypothesis or not. Especially the widely applied testing for significance by
comparing the mean values of different groups along with their corresponding distributions requires
interval data.6, 17 But this type of data, which renders itself suitable for common algebraic operations like
the calculation of the mean itself (e.g. blood pressure), is not always available. The other types include,
in particular, nominal and ordinal scales for which the appropriate statistical tests appear considerably
less well known. Even though the qualitative character of nominal data can by definition carry only a
limited amount of inherent information, certain parameters are most properly expressed in just such
terms. In medicine, a common example is the distinction 'dead' vs. 'alive', or the condition of a patient
upon discharge may be described as 'improved/unchanged/worse'. After the categorisation of individual
cases into groups using a nominal parameter, the number of cases in each group can be counted and
displayed in a histogram. This kind of incidence data can then be subjected to rigorous analytical
procedures just as its brethren from the interval camp.
The hallmark of statistical analysis appears to be still the testing for significance using a cut-off value of
typically p < 0.05, the core of which has been repeatedly criticised, even more than 60 years ago.8 But
until the final verdict is available, is it possible to test nominal incidence data for classical statistical
significance? The answer is unequivocally yes: Enter the chi-squared goodness-of-fit test, or for short,
Pearson's test. Though the concept could appear genuinely complicated, it might be surprising that no
fancy statistical software package is need to calculate it.

Pearson, who?
Karl Pearson (1857-1936) was an eminent English mathematician. After studying mathematics in
Cambridge, he pursued his wide ranging interests including, among others, Darwinism, German
literature and even Roman law, which induced him to travel widely and stay extendedly especially in
Berlin and Heidelberg.9, 11 He has been aptly described as a 'thoroughly restless intellectual'.15

Figure 1. Karl Pearson in 1910

From 1881 onwards, Pearson renewed his focus again on his primary subject with a very strong
biometric inclination and took succeeding positions at King's College London and University College
London. Together with Walter Frank Raphael Weldon, he founded the still appearing journal
Biometrika. During his scientific career, Pearson made major foundational contributions to the field of
mathematical statistics and is even regarded as the founder of the discipline in its modern form.11, 14
Pearson's paper on the chi-squared test appeared in 1900.13

Chi-squared, what?
'Chi' (pronounced /'kai/) stands for the Greek letter , which was chosen to lend its name to a
distinctive kind of distribution. It was first described in 1875-76 by Friedrich Robert Helmert as the
distribution of the sum of squares of k independent standard normal random variables representing a
special case of the gamma distribution. Its single parameter k specifies the degrees of freedom and
determines the shape of the distribution. If k is increased to infinity, the chi-squared distribution very
much approaches a normal distribution.3, 14 A single such sum is calculated using the following formula:

= !! + !! + + !!

Or in a more compact format:


!

!!

=
!!!

On repeatedly and numerously calculating this sum Q, the tabulated frequencies of them give the chisquared distribution. But the good thing is that these mildly bewildering mathematical aspects serve as a
background only and are fortunately not of immediate relevance in using this approach for inferential
testing.

Goodness-of-fit-test
How, then, is the test applied in practice? First, let us state the aim again: We want to determine
whether the values of nominal incidence data differ significantly from what is expected. This is best
illustrated with an example.

In the course of one month a general ward in a hospital admitted 300 patients, of which 162 were
female and 138 male. As the expected values would be 150 patients in each group, do these two
empirically determined values differ significantly from the expected ones using a p-value of 0.05? Or in
more scientific parlance: The null hypothesis H0 assumes no significant difference whereas the
alternative hypothesis H1 does just the opposite.

Before solving this example, it is time to define the conditions of the test:16, 19
- A single categorical or nominal variable. (The case of a single variable is also characterised by the
term 'univariate'.) Within this single variable two or more groups can be tested, which are
represented by the number of cases in each group. In the example above the categorical/nominal
variable is 'gender' with the two groups 'female' and 'male'.
- Mutual exclusion of the observations. One observation can only be found in one group and not in
another.
- Independence of observations.
- Use of actual numbers, not of percentages.
- Total probability = 1.
- No group has a number of expected cases less than 1 and no more than 20 % of the groups have
expected numbers of cases less than 5. (The use of Yates's correction for continuity in such
circumstances remains controversial.5)

How the redoubtable Karl Pearson came about his solution can be happily relegated to the true
statisticians. What we need now is much more a convenient way to calculate the results. A statistical
software package certainly does the trick in ideal manner. Amazingly, such a device is not of ultimate
necessity. A very conventional spreadsheet or even a pocket calculator (or nowadays a cell phone, for
that matter) is all that is needed. We just have to calculate the chi-squared test statistic 2:2, 4, 18

! =

( )!

O Number of observed cases


E Number of expected cases

In (other) words: We have to take the squared difference between the number of observed and expected
cases in each group and divide it by the number of expected cases. And this we repeat for each group
and take the sum of the respective results. To do just that a table containing our data comes in handy:


Table 1. Tabulation of observed and expected cases

Gender

Number of
observed
cases O

Expected
cases (%)

Number of
expected
cases E

Female

162

50

150

Male

138

50

150

Sum

300

100

300

The actual calculation of the test statistic 2 is as follows:

! =

138 150
150
=

162 150
150

144 144
+
= 1.92
150 150

Now we have to determine the degrees of freedom k, which is even easier as this is the number of
groups minus 1.

= 1

In our case there are two groups (female and male):

=21=1

All that is left to do is looking up these two results 2 = 1.92 and k = 1 in a table of corresponding 2 and
p-values (readily available on the internet20). Here is a small section of such a table:

Table 2. Chi-squared look-up table

Degrees of
freedom k

2 value

1.07

1.64

2.71

3.84

6.64

2.41

3.22

4.60

5.99

9.21

3.66

4.64

6.25

7.82

11.64

p-value

0.30

0.20

0.10

0.05

0.01

As our result for k is 1, only the first row is relevant. And in this row our value of 2 of 1.92 lies between
1.64 and 2.71 with a corresponding p-value range between 0.20 and 0.10. Therefore, the level of
significance at p = 0.05 is not reached.
The final verdict in our example: Even though the numbers of observed cases in both gender groups
deviate considerable from the expectation of a 50:50 split, these empirical data do not reach statistical
significance (on the p = 0.05 level).

Dependence on the number of observations


In statistics, more data points usually lead to more robust results and the level of statistical significance
is more easily reached. The chi-squared goodness-of-fit-test conforms to this rule just as well. To
illustrate this we expand our imaginary hospital ward example by increasing the number of
observations by a factor of 10.


Table 3. Tabulation of observed and expected cases

Gender

Number of
observed
cases O

Expected
cases (%)

Number of
expected
cases E

Female

1,620

50

1,500

Male

1,380

50

1,500

Sum

3,000

100

3,000

The new result for our test statistic: 2 =19.2


Degrees of freedom: k = 1
In the first row of the look-up table the value of 19.2 lies to the right of 6.64. Therefore, the statistical
significance level is even 'better' than p = 0.001.
Even though the two examples share identical ratios, statistical testing yields markedly different results.
The actual numbers are of paramount importance, which definitely precludes the use of percentages as
input in the calculation.

More than two groups


A univariate variable is not limited to having only two groups. Accordingly, the chi-squared goodnessof-fit-test can handle any number of them. A worked example highlights the case:
The following table contains (hypothetical) long-term data for the conditions of patients discharged
from a hospital ward:


Table 4. Example data
(expected)
Status
Improved

Percentage
91.5 %

Unchanged

5%

Deteriorated

2%

Dead

1.5 %

The observed data of the ward under investigation are the following:

Table 5. Example data


(observed)

Status

Number of
cases

Improved

308

Unchanged

16

Deteriorated

10

Dead

11

Do these observed numbers conform to the long-term average or is there any significant deviation
(p=0.05)?

Again, we tabulate what we have got:


Table 6. Tabulation of observed and expected cases
Number of
observed
cases O

Expected
cases (%)

Number of
expected
cases E

Improved

308

91.5

315.675

Unchanged

16

17.25

Deteriorated

10

6.9

Dead

11

1.5

5.175

Sum

345

100

345

Status

Calculation of the test statistic 2:

! =

308 315.675
315.675
=

16 17.25
17.25

10 6.9
6.9

11 5.175
5.175

58.906 1.563 9.61 33.931


+
+
+
= 8.227
315.675 17.25 6.9
5.175

Now there are four groups to heed in the determination of the degrees of freedom:

=41=3

In the look-up table the test statistic of 8.227 can be found in row 3 between the p-value levels of 0.05
and 0.01. Therefore, the intended level of significance is reached, which leads to the conclusion that
the observed numbers of cases show indeed significant deviation from the long-term averages.

Conclusion
Scientific investigation is a demanding occupation. Toil (and, occasionally, tears) are necessary
prerequisites. Against this background, it is all the more an intriguing finding that all the laboriously
gathered data get boiled down to typically only one parameter, the p-value. The most narrow line,
usually, albeit arbitrarily, drawn at 0.05, decides whether the whole undertaking was worth the effort or
not, an ongoing and, unfortunately, undecided discussion.10, 12 The contentious issue of publication bias
only compounds this sometimes dire situation.7
These academic debates notwithstanding, the p-value can still defend its beleaguered position. Whereas
interval data optimally comply with the demands of classical p-value producing inferential testing, their
categorical kin appear to fall through the occasional crack in the common scientist's statistical toolbox.
But that need not be the case. In fact, Pearson's chi-squared goodness-of-fit test is a very good choice to
subject univariate categorical frequency data to statistical scrutiny. It compares the number of observed
cases with the number of expected cases, quantitatively weighing how 'good' the empirically found
data 'fit' a given reference. As an additional bonus, all this can be accomplished without the heavy
lifting usually associated with statistical software packages. And, most importantly, Pearson's test
provides us with the familiar and trusted (as we have seen, not always rightly so) p-value signifying the
cut-off between statistical relevance and scientific oblivion.

10

References
1) Altman DG. The scandal of poor medical research. BMJ 1994; 308: 283-284
2) Bithell JF. Statistical inference. In: Ahrens W, Pigeot I (eds.). Handbook of epidemiology. Springer
2014, p. 953
3) Boslaugh S. Statistics in a nutshell. O'Reilly Media 2012, p. 125
4) Ibid., p. 129
5) Ibid., p. 131
6) Carlberg C. Statistical analysis. Pearson Education 2011, p. 12
7) Dwan K, Gamble C, Williamson PR et al. Systematic review of the empirical evidence of study
publication bias and outcome reporting bias An updated review. PLoS ONE 2013; 8(7): e66844
8) Greenwood M. The statistician and medical research. BMJ 1948; 2: 467-468
9) Hardy A, Magnello ME. Statistical methods in epidemiology: Karl Pearson, Ronald Ross, Major
Greenwood and Austin Bradford Hill, 1900-1945. Soz Prventivmed 2002; 47: 80-89
10) Lew MJ. To P or not to P: on the evidential nature of P-values and their place in scientific inference.
arXiv:1311.0081v1
11) Norton BJ. Karl Pearson and statistics: The social origins of scientific innovation. Social Studies of
Science 1978; 8: 3-34
12) Nuzzo R. Statistical errors. Nature 2014; 506: 150-152
13) Pearson K. On the criterion that a given system of deviations from the probable in the case of a
correlated system of variables is such that it can be reasonably supposed to have arisen from
random sampling. Philosophical Magazine 1900; 50: 157-175
14) Plackett RL. Karl Pearson and the chi-squared test. Int Stat Rev 1983; 51: 59-72
15) Porter TM. Karl Pearson: The scientific life in a statistical age. Princeton University Press 2004, p. 1
16) Sheskin DJ. Handbook of parametric and nonparametric statistical procedures. CRC Press 2003, p.
219
17) Stevens SS. On the theory of scales of measurement. Science 1946; 103: 677-680
18) Van den Broeck J, Brestoff JR. Epidemiology: Principals and practical guidelines. Springer 2013, p.
449
19) Verma JP. Data analysis in management with SPSS software. Springer 2013, p. 73
20) A good table can be found at: http://www.medcalc.org/manual/chi-square-table.php

Figures
Figure 1. Public domain image.
Author
Dr. Thomas Gamsjger, University Hospital St. Plten-Lilienfeld, Propst-Fhrer-Strae 4, 3100 St. Plten, Austria
Date of publication
1 January 2015
Citation
Gamsjger T. Pearson's test. Meitner Monographs 2015

11