You are on page 1of 40

Full-time MBA: Business Statistics (BST510)

Measures of Association:
Cross-tabulation and Correlation

Section two
Dr. Paul Bottomley (Room F03)
bottomleypa@cf.ac.uk
Readings: Silver, pp. 89-103,106-110.
Measures of Association
Analysis of relationships between two variables forms the
next logical step beyond descriptive statistics.
What is the (i) nature and (ii) strength of the relationship?
(iii) will it hold for the wider population, not just our sample?
Metric: (Int./Ratio) Ordinal: Spearmans Nominal: Phi
Pearsons Correlation Correlation (rho = ) For 0 / 1 data

Sales/1000s Satisfaction Income band Gender Level (Clerk=0,


Price/
per week (1 7) (1 10) (m=0/f=1) Manager =1)
3 250 3 2 0 1

4 302 2 7 1 1

6 145 6 8 0 0

7 56 7 4 1 0
An Introduction to Hypothesis Testing
A hypothesis is a fancy word for a prediction about a population
parameter that is tested by a sample statistic.
We first set up the null hypothesis that no relationship exists in the
population, then use a statistic to measure how close our sample
results are to what would be expected if the null is true.
Typically, we want to reject the null hypothesis but must have
strong empirical evidence to do so.
Only if an empirical result is unlikely to be due to sampling error /
chance do we regard it as being strong or statistically significant.
In social sciences, a significant result is considered to be one when
the null hypothesis has less than a 5% chance of being true.
In other words, the significance level is a risk factor - it tells us the
chance of being wrong if we reject the null hypothesis.
Contingency Tables & Nominal Data
A contingency table lists all possible outcomes of two
variables and the number of times (frequency) they occur.

Widely used by market Guests complaints across a


researchers. chain of hotels.
Simple to conduct. Adoption of technology
Easy to interpret. (internet) and social class.
Strong link to managerial Businessmens type of airline
action. ticket and nature of flight
(domestic / international)

Suitable for nominal data but metric data also can be reduced
to a few categories: gender vs. money on snacks at cinema.
Recent Corporate Identity Changes

Henderson & Cote (1998),


Journal of Marketing
Ralf Van Der Lans et al.
(2009), Marketing Science.
A Contingency Table
A bank designing a new credit card asks a random sample
of 320 customers about their color preferences.
The contingency table shows the observed frequencies of
the two nominal variables.
18 men prefer the blue credit card.
Blue Red Green
Male 18 53 86
Female 42 64 57

Q: Null hypothesis: there is no relationship between gender


and color preference. (But wed like to reject this pop. claim!)
A Contingency Table
A bank designing a new credit card asks a random sample
of 320 customers about their color preferences.
The contingency table shows the observed frequencies of
the two nominal variables.
18 men prefer the blue credit card.

Blue Red Green Row Total


Male 18 53 86 157
Female 42 64 57 163
Col. Total 60 117 143 320
Q: Null hypothesis: there is no relationship between gender
and color preference. (But wed like to reject this pop. claim!)
Chi-Squared Contingency Test
Statistical issue: are the relative frequencies (proportions) in
each group so different that it is unlikely that they came from
the same population? (Lets look along each row).
Test logic: Is there is a meaningful difference between the
observed frequencies (in Table) and expected frequencies
assuming that the null hypothesis is true (no relationship)?

Q: How many customers would we expect to find who are (i)


male, and (ii) prefer the blue card, if there is no relationship
(independent) and (iii) given a sample of 320 participants?

As the sample is random, the row and column totals are our best
guide to the chance of events occurring in the population.
Calculating Expected Frequencies
What is the chance of finding a customer who is male?
(a) 157 out of 320 customers are men.
What is the chance of a customer liking the blue card?
(b) 60 out of 320 customers prefer the blue credit card.

If we multiply (a) by (b), we find the probability a customer who


is both male AND prefers blue (range 0 to 1).
But we want the expected number of customers, so multiply
probability by the sample size.
E = 157 x 60 x 320
Row Total x Column Total
320 320 Expected =
Frequency Sample Size
= 29.44 customers
Some Arithmetic
Cell Oi Ei Oi - Ei (Oi - Ei)2/Ei
a 18 (157* 60) / 320 = 29.44
b 53 (157*117) / 320 = 57.40
c 86 (157*143) / 320 = 70.16
d 42 (163* 60) / 320 = 30.56
e 64 (163*117) / 320 = 59.60
f 57 (163*143) / 320 = 72.84
Sum 320

Oi = Observed cell frequency, Ei = Expected cell frequency.

Expected frequencies show how the sample would be allocated


between cells if there was no relationship (null hypothesis is true)
Chi-Squared Independence Test
Chi-squared statistic (2) tests whether two nominal variables
are independent (no relationship) or are related.
It depends on the extent that the observed frequencies (Oi)
differ from their expected frequencies (Ei). (i = 1 to 6 cells).

(Oi Ei ) 2

.
2

Ei
Because differences can be positive or negative, we square
them so that they dont cancel each other out.
Repeat for each cell in the table, then sum together.
Some More Arithmetic
2
Cell Oi Ei Oi - Ei (Oi - Ei) /Ei
a 18 (157* 60) / 320 = 29.44 -11.44 4.44
b 53 (157*117) / 320 = 57.40 -4.40 0.34
c 86 (157*143) / 320 = 70.16 15.84 3.58
d 42 (163* 60) / 320 = 30.56 11.44 4.28
e 64 (163*117) / 320 = 59.60 4.40 0.33
f 57 (163*143) / 320 = 72.84 -15.84 3.44
320 Chi-sq 16.41
How do we interpret this statistic? We compare it against a
critical value find out how unusual it is assuming null true.
Chi-Squared: Interpreting the Result
The value of our chi-squared statistic = 16.41.
Does the null hypothesis (H0) have less than a 5% chance of being
true? Is the chance of being wrong when rejecting H0, namely
there is no relationship between gender and color preference < 5%?

Do not reject H0 Reject H0


Chi-Squared
Distribution

Critical Value

If our statistic > critical value, reject null hypothesis.


If our statistic < critical value, retain null hypothesis.
Chi-Squared: Critical Value & Result
To test whether our statistic is significant, we compare it against
the critical value which we find in tables.
Step 1: determine the level of significance ( = 0.05).
Step 2: determine the degrees of freedom (v).
v = (r 1)(c 1) = (2 rows 1)x(3 columns 1) = 2

Do not reject H0 Reject H0


Chi-
Squ. Test Statistic
= 16.41

Critical Value = 5.991

As our statistic (16.41) is > critical value (5.991), we reject the


null hypothesis, so gender and color preference are related.
Strength of Association: Cramers V
Chi-sq test shows that the variables are related, but not how.
But we can determine the strength of the association.

Cramers
2
n 16.41 / 320
V V 0.226
min .( r 1), (c 1) 2 1

Smaller dimension of the contingency table. Our table has 2


rows and 3 columns, so select (r 1).
Cramers V (0 to 1). 0.1 = weak, 0.3 = mod, 0.5 = strong
Weak / moderate association - gender and color preference.
Follow-up comparisons using 2 x 2 tables can identify the
source of the difference (e.g. red vs. blue, green vs. red?).
Assumptions and Limitations
All observations must be drawn from a random sample.
With stratified and quota samples, survey design issues influence
expected frequencies (% male / female recruited).

Observations must be unique to one cell.


Social class and people playing football or rugby

Blue Red Green


Male 11.5% 33.8% 54.8%
Female 25.8% 39.3% 35.0%
Test based on frequencies. Not valid if cells are percentages
but useful for interpretation purposes.
Need to know the sample size (n) to convert this information.
Assumptions and Limitations (2)
Avoid small expected frequencies (Ei).
Malthotra and Birks: No expected frequencies < 5.
Cochran: Expected frequencies < 5 in fewer than 20% of cells.
Silver: No expected frequencies < 3.

Solution #1: increase sample size (n).


If n = 320 (640), expect 29.44 (58.88) men prefer blue credit card.

Solution #2: combine adjacent cells together if meaningful.


Money on snacks, ice cream: 2.50 up to 5.00, 5.00+ into 1 cell.

Solution #3: use Fishers Exact test! (beyond this course).


Cinema Snacks: Anyone for
Popcorn, Ice-cream or Hotdogs?
Contingency tables are useful for describing the sample.
Q: How many first year female students were surveyed? A: 43.
Q: What percentage of male students are in their 3rd year? A: 23.2%

Do male and female students spend equal amounts on


snacks when they go to the cinema?
Q: Is this chi-squared test valid?
A: Two cells expected frequencies (count) < 5.
Solution: combine cells 2.50 - 5.00, and 5.00+ into 1 category.

Chi-squared test = 7.005 (df = 2).

Conc: Only 3% chance of getting this result if null hypothesis


is true. So reject null: male and female students dont spend
the same on popcorn / ice-cream women spend more.
The 2 x 2 Contingency Table
Many statisticians argue that an adjustment should be
made with 2 x 2 tables. (Silver outlines 18 of these!)
Stage 1: Apply adjustment factor.

2
[|(Oi Ei)| 0.5]

2
Yatess correction
Ei
Vertical bars tell us to take the absolute difference, then
subtract 0.5. This is a conservative test.
Stage 2: Assess strength of the relationship. Cramers V
simplifies to phi.

2

Phi coefficient
n
Whats Making Our Kids Materialistic?

TV Affulent Deprived Row_Tot


Yes 146 189 335
No 156 68 224
Col_Tot 302 257 559

A study examining the impact of media on materialism and


well-being found that nearly 60% of children had a TV in their
bedrooms.
Q: Determine whether there is a relationship between TV
in bedrooms and social class, or are they unrelated?
Media and Childhood Materialism
Cell Oi Ei |(Oi - Ei)|-0.5 [|(Oi - Ei)|-0.5] 2/Ei
a 146 180.98 34.48 6.57
b 189 154.02 34.48 7.72
c 156 121.02 34.48 9.83
d 68 102.98 34.48 11.55
sum 559 chi-sq 35.66

Test statistic: Chi-squared = 35.66.


Critical value from Tables (5%, d.f.=1) = 3.841
Test statistic > critical value, so reject null hypothesis.
TVs in childrens bedrooms is related to social-class.
Aside: continuous variables can be split into two-groups,
but test lacks power & difficult to know where to divide.
Pearsons Correlation Coefficient
Pearsons correlation (r) summarizes the association between two
metric variables, in terms of:
Its nature (positive or negative)
Its strength (magnitude)
The correlation ranges from -1 to +1 (0 = unrelated).

Do you think the following variables are correlated; if so how?


Advertising and brand sales
Weekend box-office revenue and cinema screenings
Peoples height and success in chosen career

But, are these relationships linear (big assumption)? A scatter plot


provides a useful first step to investigate this.
A Typical Scatter Plot
This plot tells us about
Price of Mazda Cars in Australia in 1991
(from http://www.statsci.org/data/index.html )
Groups and patterns in data
45000
Older cars are cheaper, so
40000
no classic cars here!
Price of Car/ $Au

35000
30000
25000 Linearity and non-linearity
20000 Price drop slows down
15000
10000 with age (we cant have
5000 negatively priced cars).
0
0 5 10 15 20 25 Outlying data
Age of Car/Years There are a few atypical
data points.
Negative Correlations of Varying Strength
7 8

6 r = -0.9 7
r = -0.7
6
5
5
4
4
3
3
2
2
1
1
0
0
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7

9
8
7
r = -0.5 12
10 r r==-0.3
-0.3
6 8
5 6
4
4
3
2
2
1 0

0 -2 0 1 2 3 4 5 6 7

-1 0 1 2 3 4 5 6 7 -4
-2

Q: What do correlations of (i) 1, and (ii) zero look like?


What About the Relationship Between
Sales and Prices of B&O Televisions?
Dont just simply undertake the calculation. Ask:

What sort of relationship do you expect?


Why do you expect a relationship (theory)?
All hypotheses are made before any data is collected -
theory driven not having checked the scatterplot!

What do you know about the quality of the data?


Is the relationship linear or non-linear?
Are they any outliers? Lets check the scatter plot.
Silver (1996)
Scatter Plot: Demand For Selected
Models of B&O Televisions
Sales (Y) Price (X) 300
284 800 250

Sales (Y)
248 891 200
79 1295 150
74 1451 100

62 580 50
0
56 1192
500 700 900 1100 1300 1500
48 1285
Price (X)
34 757

Is there evidence of a linear relationship?


Is there evidence of any obvious outliers?
But beware reading too much into small samples
Calculating Pearsons Correlation
While scatter plots are useful, we often want to precisely assess
the extent of the linear relationship between two metric variables.

Pearsons Correlation (r) = Covariance of X and Y

Std.Dev. X times Std.Dev. Y

The covariance shows how the two variables (X and Y) change


with each other. Specifically
How two variables move together around their respective means.
Covariance is not very useful in itself. Needs to be adjusted to
make comparisons easier (units of measurement, sample size).
Covariance of Sales and Price
Covariance: multiply the deviations of each pair of data
points from their respective means, sum and then average.
_ _
1
Cov( XY ) ( X i X )(Yi Y )
n
_
(Xi X ) Positive covariance: positive (neg.)
A B
300
250 deviations of X are associated
Sales (Y)

200 _ with positive (neg.) deviations of Y.


(Yi Y )
150 _
Negative covariance: negative
100
Y deviations of X and associated
50
0
C D with pos. dev. of Y (vice versa).
_
500 700 900
X1100 1300 1500
Quadrants B & C = Cov(XY) > 0
Price (X)
Quadrants A & D = Cov(XY) < 0
Pearsons Correlation Coefficient (r)
To calculate Pearsons r, we divide the covariance by the
standard deviation of X times the standard deviation of Y

n XY X Y
r
n X 2 2

X ' nY Y
2 2

n refers to the number of pairs of data points.
As usual, the best way to calculate this is by using a table.
The magic numbers that we substitute into the formula are
the respective column totals.
B&0: Correlation of Sales with Price
Sales (Y) Price (X) XY Y2 X2
284 800 227200 80656 640000
248 891 220968 61504 793881
79 1295 102305 6241 1677025
74 1451 107374 5476 2105401
62 580 35960 3844 336400
56 1192 66752 3136 1420864
48 1285 61680 2304 1651225
34 757 25738 1156 573049
Sum
885 8251 847977 164317 9197845

Y = Sales, because we assume Prices, X cause Y


Pearsons Correlation Coefficient (r)
n XY X Y A
r
n X 2 2

X ' n Y Y
2 2
B.C

A = n XY X Y = 8x847977 (885x8251) = -518319


B = n X 2
X 2
= 8x9197845 (8251)2 = 5503759

C = n Y Y = 8x164317 (885)2 = 531311


2 2

As price goes up, sales go


518319
r 0.303 down. But buyers of B&O
5503759 x531311 are not very price sensitive
(small absolute value).
Interpreting Pearsons Correlation
From a textbook / small sample perspective
r = 0.7 or 0.7 is a strong correlation
r = 0.5 or 0.5 is a moderate correlation
r = 0.3 or 0.3 is a weak correlation

From a practical / researchers perspective


Cohens effect sizes: 0.1 = small, 0.3 = medium and > 0.5 = large.
Cohens guidelines based on realistic / reasonable sized samples.
More faith in Cohens meta-analysis of 1000s of studies.

Pearsons r is a standardized measure of association:


Useful for comparative purposes.
Unit-free: changing prices from s to pence, same r.
Cautionary Tale #1

8 Pearsons correlation only measures


6 the extent of a linear relationship.
Y

4
2
0
0 2 4 6 8 Strong non-linear association of X with
X
Y ---> weak or zero value of r

15
Strong linear association between X
10 and Y weakened to zero by an outlier
Y

0
0 2 4 6 8
X ALWAYS DRAW A PICTURE!
Cautionary Tale #2
Limited range decreases the apparent correlation.
Missing middle increases the apparent correlation.

Unwanted Pregancy
100 25
performance

80 * * 20
60 * * 15 ???

(%)
40 * ** * ** * 10

* **
20 5
0 0
0 20 40 60 80 100 0 2 4 6 8
Aptitude Test Score Socio-Economic Status

Limited range problem Missing middle problem


Cancer Survival Rates: Postcode Lottery

10% point difference in


survival rates -> Daily
Mail Postcode Lottery
Based on the number of
patients diagnosed in
England 1991-2006, and
followed up in 2007.
What about relative age-
adjusted survival rates?
Source: ONS (07.09.2010)
Cautionary Tale #3
Correlation is not proof of causation (omitted variables). If cross-
sectional data, both X and Y measured at the same point in time.

Y variable: r=+ X variable:


Personal Red wine
Health consumption

r=+ r=+
Z variable:
Missing ?

PIMS (Harvard): strong positive correlation between market


share (X) and firm performance (Y). But was it an illusion?
Partial correlations control for 3rd factors (beyond this course).
Another Hypothesis Test:
Is Pearsons Correlation Significant?
A little recap
We may wish to test the significance level of the correlation
when working with a sample rather than a whole population.

We may wish to ensure that it is unlikely that the effect we


are seeing occurs because of sampling error / chance (when
there is no relationship between X and Y in the population).

We want the chance of being wrong when saying there is a


relationship between the two variables to be less than 5% -
the social science standard of proof or risk factor.
Another Hypothesis Test:
Is Pearsons Correlation Significant (2)?
Null hypothesis H0: There is no linear association between
X and Y (in the population: r = 0)

Alternative hypothesis H1: (choose 1 of 3). There is:


A) a linear association between X and Y (r 0)
B) a positive linear association between X and Y (r > 0).
C) a negative linear association between X and Y (r < 0).
But you decide before collecting the data (theory-driven).

A) 2-tailed test, B) and C) 1-tailed tests [denote tail(s) of the


distribution where the rejection region(s) is found, see later].
Hypothesis Testing: Significance of the
Correlation Coefficient (3)?
H0: There is no linear association between X and Y (r = 0).
H1: There is a negative association between X and Y (r < 0).
1. Determine significance level (chance of being wrong if
reject H0). [alpha = 0.05].
2. Determine the degrees of freedom (v = n - 2) = 6.
3. Determine if it is a one-tailed or two tailed test. (1t)
4. Determine critical value from table (c.v. = |0.6215|)
5. If r > critical value, reject the null hypothesis, but it is not
in this case (|0.303 < 0.6215|), so retain null (r = 0).
For B&O customers, sales and prices are not related.
Association could be due to sampling error / chance.
Thought For the Day
This area requires skilled practitioners. Not people
simply wanting to apply formulae to get answers

Answers depend on formulating questions well, applying


theory, using reliable data, being aware of assumptions
implicit in the methods and knowing what the measure is
doing and whether it is appropriate for your purposes.
Silver (p.166)

In summary: You should have two metric variables.


It works best for straight line relationships.
It works best if there are no serious outliers.
But it does not truly consider causation.

You might also like