Associations Between Categorical Variables

Associations Between Categorical
Variables
Case where both explanatory (independent)
variable and response (dependent) variable
are qualitative (Chapter 7 includes case
where both are binary (2 levels)

Association: The distributions of responses
differ among the levels of the explanatory
variable (e.g. Party affiliation by gender)
Contingency Tables
Cross-tabulations of frequency counts where the
rows (typically) represent the levels of the
explanatory variable and the columns represent
the levels of the response variable.
Numbers within the table represent the numbers
of individuals falling in the corresponding
combination of levels of the two variables
Row and column totals are called the marginal
distributions for the two variables
Example - Cyclones Near Antarctica
Period of Study: September,1973-May,1975
Explanatory Variable: Region (40-49,50-59,60-79)
(Degrees South Latitude)
Response: Season (Aut(4),Wtr(5),Spr(4),Sum(8))
(Number of months in parentheses)
Units: Cyclones in the study area
Treating the observed cyclones as a random
sample of all cyclones that could have occurred
Source: Howarth(1983), An Analysis of the Variability of Cyclones around Antarctica and Their
Relation to Sea-Ice Extent, Annals of the Association of American Geographers, Vol.73,pp519-537
Region\Season Autumn Winter Spring Summer Total
40-49S
370 452 273 422 1517
50-59S
526 624 513 1059 2722
60-79S
980 1200 995 1751 4926
Total 1876 2276 1781 3232 9165
For each region (row) we can compute the percentage of storms
occuring during each season, the conditional distribution. Of the
1517 cyclones in the 40-49 band, 370 occurred in Autumn, a
proportion of 370/1517=.244, or 24.4% as a percentage.
Region\Season Autumn Winter Spring Summer Total% (n)
40-49S
24.4 29.8 18.0 27.8 100.0 (1517)
50-59S
19.3 22.9 18.9 38.9 100.0 (2722)
60-79S
19.9 24.4 20.2 35.5 100.0 (4926)
40-49S
50-59S
60-79S
region
Bars show Means
Autumn Wi nter Spri ng Summer
season
10.00
20.00
30.00
40.00
r
e
g
p
c
t
Graphical Conditional Distributions for Regions
Guidelines for Contingency Tables
Compute percentages for the response (column)
variable within the categories of the explanatory
(row) variable. Note that in journal articles, rows
and columns may be interchanged.
Divide the cell totals by the row (explanatory
category) total and multiply by 100 to obtain a
percent, the row percents will add to 100
Give title and clearly define variables and
categories.
Include row (explanatory) total sample sizes
Independence & Dependence
Statistically Independent: Population conditional
distributions of one variable are the same across
all levels of the other variable
Statistically Dependent: Conditional Distributions
are not all equal
When testing, researchers typically wish to
demonstrate dependence (alternative hypothesis),
and wish to refute independence (null hypothesis)
Pearsons Chi-Square Test
Can be used for nominal or ordinal explanatory
and response variables
Variables can have any number of distinct levels
Tests whether the distribution of the response
variable is the same for each level of the
explanatory variable (H
0
: No association between
the variables
r = # of levels of explanatory variable
c = # of levels of response variable
Intuition behind test statistic
Obtain marginal distribution of outcomes for
the response variable
Apply this common distribution to all levels of
the explanatory variable, by multiplying each
proportion by the corresponding sample size
Measure the difference between actual cell
counts and the expected cell counts in the
previous step
Notation to obtain test statistic
Rows represent explanatory variable (r levels)
Cols represent response variable (c levels)
n
..
n
.c
n
.2
n
.1
Total
n
r.
n
rc
n
r2
n
r1
r

n
2.
n
2c
n
22
n
21
2
n
1.
n
1c
n
12
n
11
1
Total c 2 1
Observed frequency (f
o
): The number of
individuals falling in a particular cell
Expected frequency (f
e
): The number we would
expect in that cell, given the sample sizes
observed in study and the assumtpion of
independence.
Computed by multiplying the row total and the
column total, and dividing by the overall sample
size.
Applies the overall marginal probability of the
response category to the sample size of explanatory
category
Large-sample test (all f
e
> 5)
H
0
: Variables are statistically independent
(No association between variables)
H
a
: Variables are statistically dependent
(Association exists between variables)
Test Statistic:

P-value: Area above in the chi-squared
distribution with (r-1)(c-1) degrees of
freedom. (Critical values in Table 8.5)

=
e
e o
obs
f
f f
2
2
) (
_
2
obs
_
40-49S
370 452 273 422 1517
50-59S
526 624 513 1059 2722
60-79S
980 1200 995 1751 4926
Total 1876 2276 1781 3232 9165
Note that overall: (1876/9165)100%=20.5% of all cyclones
occurred in Autumn. If we apply that percentage to the 1517 that
occurred in the 40-49S band, we would expect (0.205)(1517)=310.5
to have occurred in the first cell of the table. The full table of f
e
:
40-49S
310.5 376.7 294.8 535.0 1517
50-59S
557.2 676.0 529.0 959.9 2722
60-79S
1008.3 1223.3 957.3 1737.1 4926
Total 1876 2276 1781 3232 9165
Observed Cell Counts (f
o
):
Region Season fo fe (fo-fe)^2 ((fo-fe)^2)/fe
40-49S Autumn 370 310.5 3540.25 11.4017713
40-49S Winter 452 376.7 5670.09 15.0520042
40-49S Spring 273 294.8 475.24 1.61207598
40-49S Summer 422 535.0 12769 23.8672897
50-59S Autumn 526 557.2 973.44 1.74702082
50-59S Winter 624 676.0 2704 4
50-59S Spring 513 529.0 256 0.48393195
50-59S Summer 1059 959.9 9820.81 10.2310762
60-79S Autumn 980 1008.3 800.89 0.79429733
60-79S Winter 1200 1223.3 542.89 0.44379138
60-79S Spring 995 957.3 1421.29 1.4846861
60-79S Summer 1751 1737.1 193.21 0.11122561
71.2291706
Computation of
2
obs
_
H
0
: Seasonal distribution of cyclone occurences
is independent of latitude band
H
a
: Seasonal occurences of cyclone occurences
differ among latitude bands
Test Statistic:

P-value: Area in chi-squared distribution with (3-
1)(4-1)=6 degrees of freedom above 71.2
Frrom Table 8.5, P(_
2
>22.46)=.001 P< .001
2 . 71
2
=
obs
_
SPSS Output - Cyclone Example
REGION * SEASON Crosstabulation
370 452 273 422 1517
310.5 376.7 294.8 535.0 1517.0
24.4% 29.8% 18.0% 27.8% 100.0%
526 624 513 1059 2722
557.2 676.0 529.0 959.9 2722.0
19.3% 22.9% 18.8% 38.9% 100.0%
980 1200 995 1751 4926
1008.3 1223.3 957.3 1737.1 4926.0
19.9% 24.4% 20.2% 35.5% 100.0%
1876 2276 1781 3232 9165
1876.0 2276.0 1781.0 3232.0 9165.0
20.5% 24.8% 19.4% 35.3% 100.0%
Count
Expected Count
% wi thi n REGION
Count
Expected Count
% wi thi n REGION
Count
Expected Count
% wi thi n REGION
Count
Expected Count
% wi thi n REGION
40-49S
50-59S
60-79S
REGION
Total
Autumn Wi nter Spri ng Summer
SEASON
Total
C hi-Squar e T e sts
71.189
a
6 .000
71.3376 .000
23.4181 .000
9165
Pears on C hi-Square
Likelihood R atio
Linear-by-Linear
As s oc iation
N of Valid C as es
Value df
As ym p. Sig.
(2-s ided)
0 c ells (.0% ) have expec ted c ount les s than 5. The
m inim um expec ted c ount is 294.79.
a.
P-value
Misuses of chi-squared Test
Expected frequencies too small (all
expected counts should be above 5, not
necessary for the observed counts)
Dependent samples (the same individuals
are in each row, see McNemars test)
Can be used for nominal or ordinal
variables, but more powerful methods exist
for when both variables are ordinal and a
directional association is hypothesized
Residual Analysis
Once dependence has been determined from a chi-
squared test, often interested in determining which
cells contributed
Residual: f
o
-f
e
measures the difference between
the observed and expected counts
Positive implies observed more than expected
Residuals practical importance depends on level of f
e
Adjusted Residual (computed for each cell):
) proportion column 1 ( ) proportion row 1 (

e
e o
f
f f
Adjusted residuals above 3 in absolute value give strong evidence against independence in
that cell
Region Season fo fe row prop col prop adj res
40-49S Autumn 370 310.5 0.1655 0.2047 4.144837
40-49S Winter 452 376.7 0.1655 0.2483 4.898484
40-49S Spring 273 294.8 0.1655 0.1943 -1.54843
40-49S Summer 422 535 0.1655 0.3526 -6.64664
50-59S Autumn 526 557.2 0.297 0.2047 -1.76769
50-59S Winter 624 676 0.297 0.2483 -2.75125
50-59S Spring 513 529 0.297 0.1943 -0.92433
50-59S Summer 1059 959.9 0.297 0.3526 4.741291
60-79S Autumn 980 1008.3 0.5375 0.2047 -1.4695
60-79S Winter 1200 1223.3 0.5375 0.2483 -1.12983
60-79S Spring 995 957.3 0.5375 0.1943 1.996065
60-79S Summer 1751 1737.1 0.5375 0.3526 0.609481
Adjusted residuals are computed in the following table.
Row proportion for Region 40-49S: 1517/9165=0.1655
Column Proportion for Season Autumn is: 1876/9165=0.2047
2x2 Tables
Each variable has 2 levels
Explanatory Variable Groups (Typically
based on demographics, exposure, or Trt)
Response Variable Outcome (Typically
presence or absence of a characteristic)
Measures of association
Relative Risk (Prospective Studies)
Odds Ratio (Prospective or Retrospective)
Absolute Risk (Prospective Studies)
2x2 Tables - Notation
n
..
n
.2
n
.1
Outcome
Total
n
2.
n
22
n
21
Group 2
n
1.
n
12
n
11
Group 1
Group
Total
Outcome
Absent
Outcome
Present
Relative Risk
Ratio of the probability that the outcome
characteristic is present for one group, relative
to the other
Sample proportions with characteristic from
groups 1 and 2:

. 2
21
2
^
. 1
11
1
^
n
n
n
n
= = t t
Relative Risk
Estimated Relative Risk:
2
^
1
^
t
t
= RR
95% Confidence Interval for Population
Relative Risk:
21
2
^
11
1
^
96 . 1 96 . 1
) 1 ( ) 1 (
71828 . 2
) ) ( , ) ( (
n n
v e
e RR e RR
v v
t t
+
= =
Relative Risk
Interpretation
Conclude that the probability that the outcome
is present is higher (in the population) for group
1 if the entire interval is above 1
is present is lower (in the population) for group
1 if the entire interval is below 1
Do not conclude that the probability of the
outcome differs for the two groups if the
interval contains 1
Example - Coccidioidomycosis and
TNFo-antagonists
Research Question: Risk of developing
Coccidioidmycosis associated with arthritis
therapy?
Groups: Patients receiving tumor necrosis
factor o (TNFo) versus Patients not receiving
TNFo (all patients arthritic)
COC No COC Total
TNF
o
7 240 247
Other 4 734 738
Total 11 974 985
Source: Bergstrom, et al
(2004)
TNFo-antagonists
Group 1: Patients on TNFo
Group 2: Patients not on TNFo
) 76 . 17 , 55 . 1 ( ) 24 . 5 , 24 . 5 ( : % 95
3874 .
4
0054 . 1
7
0283 . 1
24 . 5
0054 .
0283 .
0054 .
738
4
0283 .
247
7
3874 . 96 . 1 3874 . 96 . 1
2
^
1
^
2
^
1
^
= = = =
= = = =
e e CI
v RR
t
t
t t
Entire CI above 1 Conclude higher risk if on TNFo
Odds Ratio
Odds of an event is the probability it occurs
divided by the probability it does not occur
Odds ratio is the odds of the event for group 1
divided by the odds of the event for group 2
Sample odds of the outcome for each group:
22
21
2
12
11
. 1 12
. 1 11
1
/
/
n
n
odds
n
n
n n
n n
odds
=
= =
Odds Ratio
Estimated Odds Ratio:
21 12
22 11
22 21
12 11
2
1
/
/
n n
n n
n n
n n
odds
odds
OR = = =
95% Confidence Interval for
Population Odds Ratio
22 21 12 11
96 . 1 96 . 1
1 1 1 1
71828 . 2
) ) ( , ) ( (
n n n n
v e
e OR e OR
v v
+ + + = =
Odds Ratio
Interpretation
1 if the entire interval is above 1
1 if the entire interval is below 1
interval contains 1
Example - NSAIDs and GBM
Case-Control Study (Retrospective)
Cases: 137 Self-Reporting Patients with Glioblastoma
Multiforme (GBM)
Controls: 401 Population-Based Individuals matched to
cases wrt demographic factors
GBM Present GBM Absent Total
NSAID User 32 138 170
NSAID Non-User 105 263 368
Total 137 401 538
Source: Sivak-Sears, et al
(2004)
Example - NSAIDs and GBM
) 91 . 0 , 37 . 0 ( ) 58 . 0 , 58 . 0 ( : % 95
0518 . 0
263
1
105
1
138
1
32
1
58 . 0
14490
8416
) 105 ( 138
) 263 ( 32
0518 . 0 96 . 1 0518 . 0 96 . 1
= + + + =
= = =
e e CI
v
OR
Interval is entirely below 1, NSAID
use appears to be lower among
cases than controls
Absolute Risk
Difference Between Proportions of outcomes with
an outcome characteristic for 2 groups
Sample proportions with characteristic from
groups 1 and 2:

. 2
21
2
^
. 1
11
1
^
n
n
n
n
= = t t
Absolute Risk
2
^
1
^
t t = AR
Estimated Absolute Risk:
95% Confidence Interval for Population
Absolute Risk

. 2
2
^
2
^
. 1
1
^
1
^
1 1
96 . 1
n n
AR
|
.
|
\
|

+
|
.
|
\
|

t t t t
Absolute Risk
Interpretation
1 if the entire interval is positive
1 if the entire interval is negative
interval contains 0

TNFo-antagonists
Group 1: Patients on TNFo
Group 2: Patients not on TNFo
) 0242 . 0 , 0016 . 0 ( 0213 . 0229 .
738
) 9946 (. 0054 .
247
) 9717 (. 0283 .
96 . 1 0229 . : % 95
0229 . 0054 . 0283 .
0054 .
738
4
0283 .
247
7
2
^
1
^
2
^
1
^

+
= = =
= = = =
CI
AR t t
t t
Interval is entirely positive, TNFo is associated
with higher risk
Ordinal Explanatory and Response
Variables
Pearsons Chi-square test can be used to test
associations among ordinal variables, but more
powerful methods exist
When theories exist that the association is
directional (positive or negative), measures exist
to describe and test for these specific alternatives
from independence:
Gamma
Kendalls t
b

Concordant and Discordant Pairs
Concordant Pairs - Pairs of individuals where one
individual scores higher on both ordered variables
than the other individual
Discordant Pairs - Pairs of individuals where one
individual scores higher on one ordered variable
and the other individual scores higher on the other
C = # Concordant Pairs D = # Discordant Pairs
Under Positive association, expect C > D
Under Negative association, expect C < D
Under No association, expect C ~ D
Example - Alcohol Use and Sick Days
Alcohol Risk (Without Risk, Hardly any Risk,
Some to Considerable Risk)
Sick Days (0, 1-6, >7)
Concordant Pairs - Pairs of respondents where one
scores higher on both alcohol risk and sick days
than the other
Discordant Pairs - Pairs of respondents where one
scores higher on alcohol risk and the other scores
higher on sick days
Source: Hermansson, et al
(2003)

ALCOHOL * SICKDAYS Crosstabulation
Count
347 113 145 605
154 63 56 273
52 25 34 111
553 201 235 989
Without Risk
Hardly any Risk
Some-Considerable Risk
ALCOHOL
Total
0 days 1-6 days 7+ days
SICKDAYS
Total
Concordant Pairs: Each individual in a
given cell is concordant with each individual
in cells Southeast of theirs
Discordant Pairs: Each individual in a given
cell is discordant with each individual in
cells Southwest of theirs
ALCOHOL * SICKDAYS Crosstabulation
Count
347 113 145 605
154 63 56 273
52 25 34 111
553 201 235 989
Without Risk
Hardly any Risk
Some-Considerable Risk
ALCOHOL
Total
0 days 1-6 days 7+ days
SICKDAYS
Total
73496 ) 52 ( 63 ) 25 52 ( 56 ) 52 154 ( 113 ) 25 52 63 154 ( 145
83164 ) 34 ( 63 ) 34 25 ( 154 ) 34 56 ( 113 ) 34 25 56 63 ( 347
= + + + + + + + + =
= + + + + + + + + =
D
C
Measures of Association
Goodman and Kruskals Gamma:
1 1
^ ^
+ s s
+
=
D C
D C
Kendalls t
b
:
) )( ( 5 . 0
2
.
2
2
.
2
^

=
j i
b
n n n n
D C
t
When theres no association between the ordinal variables,
the population based values of these measures are 0.
Statistical software packages provide these tests.
0617 . 0
73496 83164
73496 83164
^
=
+
=
+
=
D C
D C
Symmetric Measures
.035 .030 1.187 .235
.062 .052 1.187 .235
989
Kendall's tau-b
Gamma
Ordinal by
Ordinal
N of Valid Cases
Value
Asymp.
Std. Error
a
Approx. T
b
Approx. Sig.
Not assuming the null hypothesis.
a.
Using the asymptotic standard error assuming the null hypothesis.
b.

Associations Between Categorical Variables

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Associations Between Categorical Variables

Uploaded by

Copyright:

Available Formats

Associations Between Categorical

You might also like