Professional Documents
Culture Documents
Measures of Association:
Cross-tabulation and Correlation
Section two
Dr. Paul Bottomley (Room F03)
bottomleypa@cf.ac.uk
Readings: Silver, pp. 89-103,106-110.
Measures of Association
Analysis of relationships between two variables forms the
next logical step beyond descriptive statistics.
What is the (i) nature and (ii) strength of the relationship?
(iii) will it hold for the wider population, not just our sample?
Metric: (Int./Ratio) Ordinal: Spearmans Nominal: Phi
Pearsons Correlation Correlation (rho = ) For 0 / 1 data
4 302 2 7 1 1
6 145 6 8 0 0
7 56 7 4 1 0
An Introduction to Hypothesis Testing
A hypothesis is a fancy word for a prediction about a population
parameter that is tested by a sample statistic.
We first set up the null hypothesis that no relationship exists in the
population, then use a statistic to measure how close our sample
results are to what would be expected if the null is true.
Typically, we want to reject the null hypothesis but must have
strong empirical evidence to do so.
Only if an empirical result is unlikely to be due to sampling error /
chance do we regard it as being strong or statistically significant.
In social sciences, a significant result is considered to be one when
the null hypothesis has less than a 5% chance of being true.
In other words, the significance level is a risk factor - it tells us the
chance of being wrong if we reject the null hypothesis.
Contingency Tables & Nominal Data
A contingency table lists all possible outcomes of two
variables and the number of times (frequency) they occur.
Suitable for nominal data but metric data also can be reduced
to a few categories: gender vs. money on snacks at cinema.
Recent Corporate Identity Changes
As the sample is random, the row and column totals are our best
guide to the chance of events occurring in the population.
Calculating Expected Frequencies
What is the chance of finding a customer who is male?
(a) 157 out of 320 customers are men.
What is the chance of a customer liking the blue card?
(b) 60 out of 320 customers prefer the blue credit card.
(Oi Ei ) 2
.
2
Ei
Because differences can be positive or negative, we square
them so that they dont cancel each other out.
Repeat for each cell in the table, then sum together.
Some More Arithmetic
2
Cell Oi Ei Oi - Ei (Oi - Ei) /Ei
a 18 (157* 60) / 320 = 29.44 -11.44 4.44
b 53 (157*117) / 320 = 57.40 -4.40 0.34
c 86 (157*143) / 320 = 70.16 15.84 3.58
d 42 (163* 60) / 320 = 30.56 11.44 4.28
e 64 (163*117) / 320 = 59.60 4.40 0.33
f 57 (163*143) / 320 = 72.84 -15.84 3.44
320 Chi-sq 16.41
How do we interpret this statistic? We compare it against a
critical value find out how unusual it is assuming null true.
Chi-Squared: Interpreting the Result
The value of our chi-squared statistic = 16.41.
Does the null hypothesis (H0) have less than a 5% chance of being
true? Is the chance of being wrong when rejecting H0, namely
there is no relationship between gender and color preference < 5%?
Critical Value
Cramers
2
n 16.41 / 320
V V 0.226
min .( r 1), (c 1) 2 1
Phi coefficient
n
Whats Making Our Kids Materialistic?
35000
30000
25000 Linearity and non-linearity
20000 Price drop slows down
15000
10000 with age (we cant have
5000 negatively priced cars).
0
0 5 10 15 20 25 Outlying data
Age of Car/Years There are a few atypical
data points.
Negative Correlations of Varying Strength
7 8
6 r = -0.9 7
r = -0.7
6
5
5
4
4
3
3
2
2
1
1
0
0
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
9
8
7
r = -0.5 12
10 r r==-0.3
-0.3
6 8
5 6
4
4
3
2
2
1 0
0 -2 0 1 2 3 4 5 6 7
-1 0 1 2 3 4 5 6 7 -4
-2
Sales (Y)
248 891 200
79 1295 150
74 1451 100
62 580 50
0
56 1192
500 700 900 1100 1300 1500
48 1285
Price (X)
34 757
n XY X Y
r
n X 2 2
X ' nY Y
2 2
n refers to the number of pairs of data points.
As usual, the best way to calculate this is by using a table.
The magic numbers that we substitute into the formula are
the respective column totals.
B&0: Correlation of Sales with Price
Sales (Y) Price (X) XY Y2 X2
284 800 227200 80656 640000
248 891 220968 61504 793881
79 1295 102305 6241 1677025
74 1451 107374 5476 2105401
62 580 35960 3844 336400
56 1192 66752 3136 1420864
48 1285 61680 2304 1651225
34 757 25738 1156 573049
Sum
885 8251 847977 164317 9197845
B = n X 2
X 2
= 8x9197845 (8251)2 = 5503759
4
2
0
0 2 4 6 8 Strong non-linear association of X with
X
Y ---> weak or zero value of r
15
Strong linear association between X
10 and Y weakened to zero by an outlier
Y
0
0 2 4 6 8
X ALWAYS DRAW A PICTURE!
Cautionary Tale #2
Limited range decreases the apparent correlation.
Missing middle increases the apparent correlation.
Unwanted Pregancy
100 25
performance
80 * * 20
60 * * 15 ???
(%)
40 * ** * ** * 10
* **
20 5
0 0
0 20 40 60 80 100 0 2 4 6 8
Aptitude Test Score Socio-Economic Status
r=+ r=+
Z variable:
Missing ?