You are on page 1of 13

CHI-SQUARE DISTRIBUTION AND ITS APPLICATIONS

Introduction
The science of statistics comprises of two parts: Descriptive statistics and Inferential Statisco.
The descriptive statistics deals with the data and information base of analysis. It basically deal
with the condensation and preparation of data for presentation through classification, tabulation
and condensation of data through summary statistics such as the measure of central tendency,
measures of variation etc.. It also deals with the graphical presentation of the data. Graphs are the
portrayal of the basis features of the frequency distribution. But the descriptive statistics can also
deal with the information base pertaining to the qualitative attributes that fall within the purview
of non parametric statistics.
As against the Descriptive Statistics, Inferential Statistics goes deeper to unfathom what
underlies the data. For this, inferential statistics focuses on the analysis of data in order to arrive
at the conclusions or inferences. It basically deals with the i) problems of estimation of the
values of the population parameters; and ii) hypothesis testing. The hypotheses may relate to a)
population parametric values; or b) Representative character of one or more samples which may
involve the evaluation of the differences between the sample estimates and true values; c) the
relationships between two or more variables. Inferential statistics has the theory of probability as
its pivotal base. The probability theory may, however, be applied either to parametric or nonparametric statistico. Hence, both the descriptive and inferential statistics are further classifiable
into i) Parametric statistics; and ii) non-parametric statistics. Parametric statistics revolves
around the variables which are cardinally measurable, whereas non-parametric statistics deals
with nominally or ordinally measurable qualitative attributes of the phenomena under study.
Similarly, inferential statistics also have both parametric non-parametric components.
The following graph may portray this classification:
Parametric and Non-parametric Statistics
Statistics

Descriptive
Statistics

Parametric
Descriptive
Statistics

Non-Parametric
Descriptive
Statistics

Inferential
Statistics

Parametric
Inferential Statistics

Non-Parametric
Inferential Statistics

Sampling theory constitutes a large part of parametric statistics which use numerous test
statistics. The test statistics that have been discussed earlier by us enables us to test the
differences between the sample and population means, differences between the two sample
means, equality of more than two sample means, equality of sample and/or population variances
etc. All these tests are based either on the assumption of the variate values having normal
distribution or a good approximation to the normal distribution under certain conditions. z and t
tests are the important tests in parametric statistics. F test complements these tests. These tests
are frequently used to evaluate one, two or multiple sample estimates. The causal relations
between two or more cardinally measured variables may also be evaluated by these tests. But
numerous research investigations focus either exclusively on ordinally or even nominally
measured qualitative attributes and their classification into distinct categories. Since the
population in such cases has no parameters as such, there is no need for the parametric sample
estimates. The research investigations in such cases do, however, involve the testing of the
hypotheses. These hypotheses may pertain to one or more samples. Besides, the study may relate
to one, two or even more attributes. In cases involving two or more attributes, the focus of
analysis may be inter-relations among the attributes under investigation. Rigorous statistical
analysis of such data obviously needs non-parametric methods of statistical analysis. Nonparametric statistics is now well developed and several non-parametric tests and methods of
statistical analysis have been developed to fulfill the research requirements. Obviously, the
choice of any of these tests for application depends on the nature of the measurement, conditions
of the problem setting and the nature of the sample(s).
2 Distribution
2 distribution was discovered by Helmert in 1875. Subsequently, Karl Pearson developed it in
1900 independently of Helmert. 2 distribution constitutes a part of Non-Parametric Statistics of
major importance. 2 is probably the most widely used test statistics in social science research.
The large scale use of 2 in research is accounted by the versatility of this test statistics on the
one hand, and the newer uses that the researchers have successfully evolved continuously on the
other. 2 test mostly suits the problems where i) the nominal scale of measurement in one sample
is used; ii) higher scale of ordinal measurement is involved; iii) the testing of the goodness of the
fit of some curve/mathematical/statistical function or model is required. Karl Pearson developed
2 as a test of the goodness of fit of the mathematical/statistical distribution. For example, one
may have got data relating to the monthly earnings of 500 employees of different managerial
cadres in a company. The company may be interested in knowing whether the distribution of
earnings follows normal, or binomial, multinomial or hyper geometric distribution. For getting
an answer to such questions, 2 may come as quite handy, while the answer may also help the
company in sorting out some knotty issues of HRM that revolve around the earning profiles. A
related problem is to evaluate the significance of the differences between the observed and
expected/theoretical values; iv) it is useful not only in one sample study but also for studying two
or more samples; v) the test may also be used to evaluate the statistical significance of some
value or characteristic; vi) the test may be used for evaluating multi-collinearity in econometric
models (Koutsoyiannis, 198 ); vii) the values of the test statistics, applied for the testing of the
homoskedasticity, tends to be distributed like 2, if these are normally and independently
distributed with k-1 degrees of freedom (Kane, p. 373); and viii) evaluating the association

between the attributes; and ix) the value of 2 is also used for the calculation of the coefficient of
contingency.
2 may also be defined as follows for the sum of squares of n independently and normally
distributed variables xi, i=1,2,..n when these have been normalized as follows:

xi i
~ N (0,1)
i
( xi i ) 2
x 2 zi 2
i

zi

With n degrees of freedom. The degrees of freedom equal the Independent number of variables.
The distribution is skewed towards the right; it starts from the origin but extends to + as towards
the right tail. As n increases, 2 tends more and more towards symmetry.
The mean and variance of the 2 distribution are given by
E(x2)=n
V(x2)=2n
Thus, the mean equals the number of variables and the variance is twice the number of variables.
x 2 (n a)1 / 6(2k 5)

2 distribution is of great importance in the statistical testing of the hypotheses. It is used to test
the correspondence between the sample data and expected values or frequencies, especially those
which relate to the variance, independence of the classification and the rank order.
It is interesting to note that the expected/theoretical values may be generated from the alternative
assumptions or theoretical framework. Each option may furnish an opportunity to formulate and
evaluate the empirical validity of a set of hypotheses rather than focusing on one single
hypothesis. This advantage may not be available from most of the other tests. The advantage is
in-built in the formulation of the definition given above. This test enables us to compare not only
one sample value such as mean or variance but the whole set of sample values with the
corresponding theoretical or expected values. It is also considered to be one of the several nonparametric test statistics. Thus, this test statistic is unique in so far as it is capable of being used
to test the hypotheses relating to both the qualitative and quantitative data. In some cases, 2
offers an alternative to the Binomial distribution. Binomial distribution suits the problems most,
if the small size of the sample precludes the use of 2 test. But in large samples, it is an option to
the Binomial distribution specially for the problems having binary values or such events as
success or failure.
Definition:

where f0 refer to the actually observed values/frequencies, fe are the expected values/frequencies
and n is the sample size. If the distribution of the sum of the independent normal variates with
zero mean is itself normally distributed, the sum of the squares of these variates has the Chi
Square distribution with n-1 degrees of freedom. This property furnishes the general definition,
and hence, the following equation of this distribution:
v2

(1)
Y y0 e 2 . 2 2
where v=n-1 shows the degrees of freedom, Y is the function of these degrees of freedom.
1
2

Since the degrees of freedom are generally exogenously given, there is no need to estimate this
as the distributional parameter. This makes the 2 distribution parametric free. The shape of the
distribution obviously depends upon the degrees of freedom, which is the only parameter of the
distribution.
For n=2, the distribution will be inverse exponential:
Y y 0 e1 / 2 x 2 .( x 2) 1 / 2
If n is large, it can be shown that the quantity

( f0 fe )2
f is distributed like Chi-square with ne

1 degrees of freedom.
The following diagram shows the Chi square curves for three different values of n: n=2, n=4, and
n=8

8
7
n=8
6
5
4

n=4

3
Chi Square distributions for 2,
n=2
4, 8

2
1
0
1

10

11

12

These curves show that i) for each value of n, there exists one 2 distribution. Thus, like t
distribution, Chi-square is a family of the distributions. One distribution corresponds to each
given value of the degrees of freedom, ranging from 1 to infinity; ii) for one degree of freedom,
2 is approximated by hyperbola; iii) as the degrees of freedom change, the shape of the curve
changes; iv) as the degrees of freedom increase, there occurs a peak; v) larger the n, more
towards the left the peak shifts.
2 distribution, for purposes of calculation, may be defined differently for different cases. If one is
interested in evaluating the goodness of the fit of a function or differences between actual and
expected frequencies, 2 distribution can be defined by the following relation:

( f0 fe )2
fe

(2)

Properties
2 distribution has the following characteristic features:
i)
ii)

iii)
iv)

v)

vi)

Chi-square is the sum of squares, and therefore, its value can never be negative. The
distribution, in fact, ranges from zero to infinity (positive);
The mode of Chi-square distribution is located at the point where Chi-square equals
n-1. The distribution is uni-model and it is positively skewed towards the right. In
fact, the curve falls gradually and if ultimately approaches zero as Chi-square tends
towards infinity;
2 curve is a Pearsonian type III curve;
As the size of the sample increases, the distribution approaches normality. Fisher
showed that, if n>30, the quantity [(2 2)1/2 (2y-1)1/2] is distributed normally with
zero mean and unit standard deviation. Therefore, it is a normal deviate, having
values of 1.96 and 2.58 at 5 and 1 per cent probability level. This quantity can be both
positive and negative; it is distributed like a usual normal variate. Thus, 5 and 1 per
cent significance points of 2, like other test statistics, refer to both the tails of the
distribution;
A sum of several Chi-squares is also distributed as Chi-square and the degrees of
freedom are equal to the sum of the degrees of freedom of the component Chi-squares
(For proof, See, Kenny and Keeping);
Generally, Chi-square could be defined as
( xi x ) 2
2

,
(3)
2
that is, it is the ratio of the sum of the deviations of the sample values from the
sample mean to the population variance. Greater the degree of departure of the data
from the stipulated hypothesis, greater will be the value of Chi-square and greater

vii)

viii)

will be the probability of the rejection of the hypothesis that the data are in
consonance with a priori theory;
Chi-square distribution is a continuous distribution even though the actual frequencies
of the occurrence may be discontinuous. This introduces an error, especially if the
frequency is small. A frequency of less than 5 is considered to be small. The
alternative solutions are available to over-come this limitation:
a) Fisher suggested that the expected frequency should not be less than five in any
cell or class. This limitation may be overcome since several classes, having low
frequencies, could be clubbed together if their individual expected frequencies are
less than five;
b) Yates correction may also be used. According to this procedure, is added to the
small frequencies, while is deducted from the largely frequencies so as to keep the
marginal totals unchanged. This is, however, applicable only to 2x2 tables (See Yule
and Kendall);
c) In some cases, it may be possible to arrive at an exact solution, which renders this
correction un-necessary; and
As pointed out earlier, the equation of the Chi-square does not involve any parameter
of the population. Hence, it does not depend upon the form of the population
distribution. Since it does not contain any population parameter, it is known as a nonparametric statistics.

Conditions For The Application of Chi-Square Test:


i)

ii)
iii)
iv)

v)

The sample must be large; the sample should preferably contain 50 or more items
even though the number of cells or class intervals may be small. Aggregation and
classification generally reduces the number of cells;
N individual items in the sample must have been drawn independently;
The number of cells must be neither too small nor too large. It is preferable to have
the class intervals or cells in the range of 5 to 20;
The constraints to which the cell frequencies are subjected must be linear. The
researcher can exercise his choice in favour of formulating the constraints to satisfy
the condition of linearity;
The cell frequencies must not be small. In any case, no cell frequency should be less
than five. It is preferable to have 10 or more than 10 as the smallest value of the cell
frequency. This condition can easily be satisfied by clubbing several classes and
aggregating the corresponding frequencies together in case their frequencies are less
than five.

Numerical Examples:
If the population variance is known or its value is postulated, Chi-square statistic can be used to
determine whether the sample estimate of the population variance differs from the actual
population variance by more than what is warranted by the sampling fluctuations. For such
problems, Chi-square is defined as follows:

nS 2

( x x) 2

2
2
where n is the sample size, S2 is the sample estimate of the variance, x is the sample mean, and
2 is population variance.
Test of Variances
Example 1: A random sample of the heights of 20 males is collected from the records of an army
recruitment board, which gives a variance of 3.21. Is it consistent with the population variance of
2.79?

nS 2

20 x3.21 / 2.79 64 .2 / 2.79 23 .01 . For 19 d.f., 5% value of Chi-square is 30.14,

which is greater than the calculated value. Hence, the difference between the true population
variance and its sample estimate could have arisen from the sampling fluctuations.
Example 2: Ten random samples of 50 persons each were selected for an opinion poll with the
following number of votes, favouring a particular candidate: 25,25,27,27,28,29,30,31,33,33. Is it
an unusual variation to expect? In the opinion poll, the respondents are given two choices
(binary-success or failure): either they favour the particular candidate or they do not. If they
favour the candidate, we may take it as being denoted by S and if they do not, then it may be

designated by S . Then, the probabilities of the occurrence of S and S may be shows by p and q.

Occurrences of S and S may be calculated from the binominal distribution: (q+p)n. The overall
sample size, if all the ten sub-samples are pooled together Nxn=10x50=500=n.
The variance of the binomial distribution is given by npq, where n is the sample size, p is the
probability of success and q is the probability of failure in any trial. The mean number of
successes, that is, expected value will be np. For testing the acceptability of sample variance,
equation 3 has to be used:

nS 2

xi x

npq

Total number of votes in all the samples, which favour the particular candidate =
25+25+27+27+28+29+30+31+33+33) = 288. But the total number of voters, who have been
interviewed = 50X10 = 500. Hence, the actual probability of any voter favouring in all the
samples taken together is given by the particular candidate = 288/500 = .576. Hence, q = 1 p =
.424.
The mean number of successes = np =500 X .576 = 288.

x x x x
2

/ n and

= (625+625+729+729+784+841+900+961+1089+1089) = 8372

8372 288 X 288 / 10


77 .6 / 12 .2112 6.35 . For 9 d.f., 5% table value of Chi(50 ) X (.576 ) X (. 424 )
square is 16.919. The table value is much greater than the calculated value. Hence, the evidence
is in support of the hypothesis that the variation could have arisen from the sampling
fluctuations.

Chi-Square =

The Test of Goodness of Fit:


The most common use of the Chi-square test is to assess the goodness of the fit of a theoretical
distribution/curve or a mathematical function to a given set of data. Then, the hypothesis to be
tested is whether the observed sample frequencies are in consonance with expected frequencies
or the values predicted by the function are acceptable.

2 ( f 0 f e ) 2 / f e , where f0 denotes the observed frequency and fe is the


expected/theoretical frequency/value. The expected frequencies may be obtained by any one of
the following methods: these may be derived from some theoretical distribution like the binomial
distribution, they may be obtained by applying the principle of independence in case of a 2X2
frequency table, or these may be obtained from the expected ratios such as sex ratio, or these
may be based on empirical data. It may also assumed that the frequencies occur randomly. If a
mathematical/econometric function has been used (See Prakash and Subramanian, 2006), the
value predicted by this function may be used.
Example 3: In 120 throws of a single die, the following results are obtained:
Number: X: 1
Frequency: f: 30

2
25

3
18

4
10

5
22

6
15

Total
120

Do these frequencies discredit the hypothesis of equal probability of each number?


If the die is unbiased, the probability of each of the six numbers is 1/6 and the expected number
of each number/face is np=120/6=20.
2 = (30-20)2/20 + (25-20)2/20 + (18-20)2/20 + (10-20)2/20 + (22-20)2/20 + (15-20)2/20 =
[100+25+4+100+4+25]1/20 = 258/20 = 12.9, v = 6 1 = 5. 5% value of Chi-square is 1.7, which
is much less than the calculated value. Hence, the hypothesis of unbiased die of equal probability
of each number is rejected.
There is no evidence to support the thesis that the die is unbiased.
2 For Contingency Table
Example 4: The following table shows the classification of the employed male graduates by
occupation related to the occupation of their fathers

Youths
Occupation
Unskilled
8
34
5
47

White Collar
Skilled
Unskilled
Total

Number of Youth Reporting


Fathers Occupation
Skilled
White Collar
39
56
118
38
16
10
173
104

Total

103
190
31
324

Test the association between the occupation of the youth and that of their fathers.
The probability of the occurrences of two independent events is the product of the
probabilities of their individual occurrences. If n is the sample size, n(io) is the total frequency of
i-th row, n(oj) is the total frequency of the j-th column and n(ij) is the frequency of the i-j-th cell,
n(io) n(oj )
expected frequency of this cell is then given by p(ij )
n
If we assume the independence between the various cell frequencies, that is, the occupation
chosen by the sons is independent of the occupation of their fathers, the probability of i-j-th cell
frequency is p(ij)=p(i)Xp(j). But p(i) = n(io)/n and p(j) = n(oj)/n.

n(io) n(oj)
. This probability multiplied by the sample size gives the expected
nn
n(io) n(oj)
frequency: np(ij)
n
Hence, p( j )

Hence, 2 n(ij) ne (ij)}2 / ne (ij). Thus, the expected frequencies are calculated with the
help of the grand total and the marginal frequencies. If we have p rows and q columns, the
number of degrees of freedom is given by (p-1))(q-1). If the null hypothesis of the independence
of the cell frequencies is rejected, the relationship between the given attributes will be estimated
by the coefficient of contingency:

2
C
n 2
In the given example, we will have the expected frequencies as given below: n(11) =
47X103/324 = 14.9, n(12) = 103X173/324 = 55, n(13) = 103X104/324 = 33.1, n(21) =
190X47/324 = 27.6, n(22) = 190X173/324 = 101.4, n(23) = 190X104/324 = 61, n(31) =
31X47/324 = 4.5, n(32) = 31X173/324 = 16.6, n(33) = 31X104 = 9.9. The value of 2 will,
therefore, be given by
x2

(8 14.9) 2 (39 55) 2 (56 33.1) 2 (34 27.6) 2 (118 101.4) 2 (38 61) 2 (5 4.5) 2 (16 16.6) 2 (10 9.9) 2

36.645.
14.9
55
33.1
27.6
101.4
61
4.5
16.6
9.9

For d.f.=2X2=4, 5% value of Chi square is 8.488, which is less than the calculated value of 2.
Hence, the evidence is against the hypothesis of independence. Now the question is what is the
degree of relation ship between the occupation of fathers and sons?
36 .645
0.32 while the maximum value of the
324 36 .645
coefficient is 0.84. Hence the calculated value of the contingency coefficient is moderate. It is
neither too low, nor too high. The empirical evidence suggests that there is a weak relation
between the occupation of fathers and sons.

The Coefficient of Contingency, C

Test of the Coefficient of Association


2X2 Frequency Table. A 2X2 frequency table is a special case of pxq contingency table. Such a
table is used to illustrate the following points: i) Formula for calculating Chi-square without
correction; ii) Calculation of 2 with Yates correction for small frequency; and iii) An exact
solution for Chi-square. Let the 2X2 table be as follows:

A
C
a
D
c
Total a+c

B
b
d
b+d

Total
a+b
c+d
n

Chi-square for such a table is given by:

(a b c d )( ad bc) 2
(a b)( c d )( a c)( b d )

where n = a+b+c+d
Example 5: Find the value of Chi-square for the following table without Yates correction for the
small frequencies:
A B
C
10 4
D
2 8
Total 12 12

Total
14
10
24

Calculated value of Chi-Square:


24 (8 10 2 4) 2
(24 72 72 ) /(12 12 14 10 ) 216 / 35 6.17
12 12 14 10

Value of Chi-square at 5% level for (r-1)(c-1)=(2-1)(2-1)=1x1=1 d.f. is 3.841. This is much less
than the observed or calculated value. Hence, the attributes are not independent. The results
suggests that the attributes are associated.
Alternatively, the expected frequencies may be calculated as follows:

n(11)

12 14
12 14
12 10
12 10
7, n(12)
7, n(21)
5, n(22)
5
24
24
24
24

Then, the value of Chi-square


90 126
2
2
2
2
(10 7) / 7 (4 7) / 7 (2 5) / 5 (8 5) / 5 18 / 7 18 / 5
216 / 35 6.17
35

Results remain the same irrespective of the procedure followed. Since the two cell frequencies
are small, we may apply the Yates correction by raising the small frequencies by and by
decreasing the large frequencies by . The table thus corrected will be as given below:

C
D

9.5
2.5
12

4.5
7.5
12

14
10
24

24 (9.5 7.5 4.5 2.5) 2


Then, Chi Square
24 60 60 / 12 12 14 10 4.29
12 12 14 10

Without correction, the term (ad-bc)2 =72X72=5184 which is now reduced to 60X60=3600 by
the correction. This reduction in the value of the numerator accounts for the reduction in the
value of Chi Square from 6.17 to 4.29. But the calculated value is still greater than the table
value. Hence, it is again significant at 5% probability level and the inference drawn earlier is not
altered.
Derivation of Expected Values
Expected values may be derived in alternative ways. Some scientific principle may be used to
estimate the expected values. Alternatively, the empirical evidence may be used to evolve the
criteria for generating expected values. The evidence may be used to determine the relative
frequencies or probabilities ad we have done for solving example 4. Some statistical distribution
or mathematical/ econometric model may be use to determine the theoretical values.
Scientific Principle as the Base
Example 6: In an experiment of pea breeding, Mendel obtained the following frequencies of
seeds:

Round and
Yellow
315

Wrinkled
Yellow
101

Round Green

Wrinkled Green

Total

108

32

556

The theory predicts that the frequencies should be in the following proportions 9:3:3:1. Examine
consistency of the data with the above expectation.
According to theoretical prediction of relative shares of 4 types of seeds in a total of 16, the
9 556 3 556 3 556 1 556
,
,
,
expected frequencies will be
=313, 104,104 and 35 respectively.
16
16
16
16
The value of

( 315 313)

313

(101 104)
104

(108 104)

(32 35)

104

= 4/313 + 9/104 + 16/104 + 9/35 =

35

0.51. But the 5% value of Chi-square for 3 d.f. is 7.815, which is much greater than the observed
value. Hence, the difference between the observed and the theoretical frequencies is not
significant. The data are thus consistent with the theoretical expectation.
Example 7: Ten random samples of 100 items each have been collected. Are the following
frequencies of the males and females consistent with an expectation of equal division of sexes?
Male
40
Female 60

52
48

49
51

50
50

43
47

48
52

42
58

45
55

41
59

51
49

Expected number of males or females=np= x100=50 in each sample. The difference between
the observed and expected frequencies of both males and females are the same. Hence, Chisquare = 2(100+4+1+0+49+4+64+25+81+1) (1/50)=2x329=13.16.
But the 5% value of Chi-square is 18.037, which is greater than the observed value. Therefore,
the data do not discredit the hypothesis of equal proportions of the males and females in the
population.
Books Recommended:
1. Croxton and Cowden: Applied General Statistics.
2. Kenney, J.F. and Keeping, E. S.: Mathematics of Statistics, part I&II, Affiliated EastWest Press Pvt. Ltd., New Delhi.
3. Rosander, A.C.: Elementary Principles of Statistics, Affiliated East-West Press Pvt. Ltd.,
New Delhi.
4. Yamane, Taro: Statistics.
5. Yule and Kendell: Introducation to Theory of Statistics.
6. Weatherburn, C.E.: A First Course in Mathematical Statistics, Cambridge University
Press, London.
7. Kane, Edward J.: Economic Statistics & Econometrics, Harper & Row, New York,
Evenston & Condon and John Weatherhill, Inc., Tokyo.
8. Kotsoyiannis

9. Cooper Donald R. and Schindler, Pamela S.: Business Research Methods, Tata McGraw
Hill, New Delhi.
10. Levine, Devid M., Krenbiel, Timothy L. and Berenson Mark L.: Business Statistics,
Pearson education Asia, 2001.
11. Prakash and Subramanian (2006) Determination of Share Prices: Analysis of A Select
Group of Indian Companies Forthcoming in Finance India.

You might also like