FT Mba Section 2 Associations 2016-17 Sva

Full-time MBA: Business Statistics (BST510)
Measures of Association:
Cross-tabulation and Correlation
Section two
Dr. Paul Bottomley (Room F03)
bottomleypa@cf.ac.uk
Readings: Silver, pp. 89-103,106-110.
Measures of Association
Analysis of relationships between two variables forms the
next logical step beyond descriptive statistics.
What is the (i) nature and (ii) strength of the relationship?
(iii) will it hold for the wider population, not just our sample?
Metric: (Int./Ratio) Ordinal: Spearmans Nominal: Phi
Pearsons Correlation Correlation (rho = ) For 0 / 1 data
Sales/1000s Satisfaction Income band Gender Level (Clerk=0,

Price/
per week (1 7) (1 10) (m=0/f=1) Manager =1)
3 250 3 2 0 1
4 302 2 7 1 1
6 145 6 8 0 0
7 56 7 4 1 0
An Introduction to Hypothesis Testing
A hypothesis is a fancy word for a prediction about a population
parameter that is tested by a sample statistic.
We first set up the null hypothesis that no relationship exists in the
population, then use a statistic to measure how close our sample
results are to what would be expected if the null is true.
Typically, we want to reject the null hypothesis but must have
strong empirical evidence to do so.
Only if an empirical result is unlikely to be due to sampling error /
chance do we regard it as being strong or statistically significant.
In social sciences, a significant result is considered to be one when
the null hypothesis has less than a 5% chance of being true.
In other words, the significance level is a risk factor - it tells us the
chance of being wrong if we reject the null hypothesis.
Contingency Tables & Nominal Data
A contingency table lists all possible outcomes of two
variables and the number of times (frequency) they occur.
Widely used by market Guests complaints across a

researchers. chain of hotels.
Simple to conduct. Adoption of technology
Easy to interpret. (internet) and social class.
Strong link to managerial Businessmens type of airline
action. ticket and nature of flight
(domestic / international)
Suitable for nominal data but metric data also can be reduced
to a few categories: gender vs. money on snacks at cinema.
Recent Corporate Identity Changes
Henderson & Cote (1998),

Journal of Marketing
Ralf Van Der Lans et al.
(2009), Marketing Science.
A Contingency Table
A bank designing a new credit card asks a random sample
of 320 customers about their color preferences.
The contingency table shows the observed frequencies of
the two nominal variables.
18 men prefer the blue credit card.
Blue Red Green
Male 18 53 86
Female 42 64 57
Q: Null hypothesis: there is no relationship between gender

and color preference. (But wed like to reject this pop. claim!)
A Contingency Table
A bank designing a new credit card asks a random sample
of 320 customers about their color preferences.
The contingency table shows the observed frequencies of
the two nominal variables.
18 men prefer the blue credit card.
Blue Red Green Row Total

Male 18 53 86 157
Female 42 64 57 163
Col. Total 60 117 143 320
Q: Null hypothesis: there is no relationship between gender
and color preference. (But wed like to reject this pop. claim!)
Chi-Squared Contingency Test
Statistical issue: are the relative frequencies (proportions) in
each group so different that it is unlikely that they came from
the same population? (Lets look along each row).
Test logic: Is there is a meaningful difference between the
observed frequencies (in Table) and expected frequencies
assuming that the null hypothesis is true (no relationship)?
Q: How many customers would we expect to find who are (i)

male, and (ii) prefer the blue card, if there is no relationship
(independent) and (iii) given a sample of 320 participants?
As the sample is random, the row and column totals are our best
guide to the chance of events occurring in the population.
Calculating Expected Frequencies
What is the chance of finding a customer who is male?
(a) 157 out of 320 customers are men.
What is the chance of a customer liking the blue card?
(b) 60 out of 320 customers prefer the blue credit card.
If we multiply (a) by (b), we find the probability a customer who

is both male AND prefers blue (range 0 to 1).
But we want the expected number of customers, so multiply
probability by the sample size.
E = 157 x 60 x 320
Row Total x Column Total
320 320 Expected =
Frequency Sample Size
= 29.44 customers
Some Arithmetic
Cell Oi Ei Oi - Ei (Oi - Ei)2/Ei
a 18 (157* 60) / 320 = 29.44
b 53 (157*117) / 320 = 57.40
c 86 (157*143) / 320 = 70.16
d 42 (163* 60) / 320 = 30.56
e 64 (163*117) / 320 = 59.60
f 57 (163*143) / 320 = 72.84
Sum 320
Oi = Observed cell frequency, Ei = Expected cell frequency.
Expected frequencies show how the sample would be allocated

between cells if there was no relationship (null hypothesis is true)
Chi-Squared Independence Test
Chi-squared statistic (2) tests whether two nominal variables
are independent (no relationship) or are related.
It depends on the extent that the observed frequencies (Oi)
differ from their expected frequencies (Ei). (i = 1 to 6 cells).
(Oi Ei ) 2
.
2

Ei
Because differences can be positive or negative, we square
them so that they dont cancel each other out.
Repeat for each cell in the table, then sum together.
Some More Arithmetic
2
Cell Oi Ei Oi - Ei (Oi - Ei) /Ei
a 18 (157* 60) / 320 = 29.44 -11.44 4.44
b 53 (157*117) / 320 = 57.40 -4.40 0.34
c 86 (157*143) / 320 = 70.16 15.84 3.58
d 42 (163* 60) / 320 = 30.56 11.44 4.28
e 64 (163*117) / 320 = 59.60 4.40 0.33
f 57 (163*143) / 320 = 72.84 -15.84 3.44
320 Chi-sq 16.41
How do we interpret this statistic? We compare it against a
critical value find out how unusual it is assuming null true.
Chi-Squared: Interpreting the Result
The value of our chi-squared statistic = 16.41.
Does the null hypothesis (H0) have less than a 5% chance of being
true? Is the chance of being wrong when rejecting H0, namely
there is no relationship between gender and color preference < 5%?
Do not reject H0 Reject H0

Chi-Squared
Distribution
Critical Value
If our statistic > critical value, reject null hypothesis.

If our statistic < critical value, retain null hypothesis.
Chi-Squared: Critical Value & Result
To test whether our statistic is significant, we compare it against
the critical value which we find in tables.
Step 1: determine the level of significance ( = 0.05).
Step 2: determine the degrees of freedom (v).
v = (r 1)(c 1) = (2 rows 1)x(3 columns 1) = 2
Do not reject H0 Reject H0

Chi-
Squ. Test Statistic
= 16.41
Critical Value = 5.991
As our statistic (16.41) is > critical value (5.991), we reject the

null hypothesis, so gender and color preference are related.
Strength of Association: Cramers V
Chi-sq test shows that the variables are related, but not how.
But we can determine the strength of the association.
Cramers
2
n 16.41 / 320
V V 0.226
min .( r 1), (c 1) 2 1
Smaller dimension of the contingency table. Our table has 2

rows and 3 columns, so select (r 1).
Cramers V (0 to 1). 0.1 = weak, 0.3 = mod, 0.5 = strong
Weak / moderate association - gender and color preference.
Follow-up comparisons using 2 x 2 tables can identify the
source of the difference (e.g. red vs. blue, green vs. red?).
Assumptions and Limitations
All observations must be drawn from a random sample.
With stratified and quota samples, survey design issues influence
expected frequencies (% male / female recruited).
Observations must be unique to one cell.

Social class and people playing football or rugby
Blue Red Green

Male 11.5% 33.8% 54.8%
Female 25.8% 39.3% 35.0%
Test based on frequencies. Not valid if cells are percentages
but useful for interpretation purposes.
Need to know the sample size (n) to convert this information.
Assumptions and Limitations (2)
Avoid small expected frequencies (Ei).
Malthotra and Birks: No expected frequencies < 5.
Cochran: Expected frequencies < 5 in fewer than 20% of cells.
Silver: No expected frequencies < 3.
Solution #1: increase sample size (n).

If n = 320 (640), expect 29.44 (58.88) men prefer blue credit card.
Solution #2: combine adjacent cells together if meaningful.

Money on snacks, ice cream: 2.50 up to 5.00, 5.00+ into 1 cell.
Solution #3: use Fishers Exact test! (beyond this course).

Cinema Snacks: Anyone for
Popcorn, Ice-cream or Hotdogs?
Contingency tables are useful for describing the sample.
Q: How many first year female students were surveyed? A: 43.
Q: What percentage of male students are in their 3rd year? A: 23.2%
Do male and female students spend equal amounts on

snacks when they go to the cinema?
Q: Is this chi-squared test valid?
A: Two cells expected frequencies (count) < 5.
Solution: combine cells 2.50 - 5.00, and 5.00+ into 1 category.
Chi-squared test = 7.005 (df = 2).
Conc: Only 3% chance of getting this result if null hypothesis

is true. So reject null: male and female students dont spend
the same on popcorn / ice-cream women spend more.
The 2 x 2 Contingency Table
Many statisticians argue that an adjustment should be
made with 2 x 2 tables. (Silver outlines 18 of these!)
Stage 1: Apply adjustment factor.

2
[|(Oi Ei)| 0.5]

2
Yatess correction
Ei
Vertical bars tell us to take the absolute difference, then
subtract 0.5. This is a conservative test.
Stage 2: Assess strength of the relationship. Cramers V
simplifies to phi.

2
Phi coefficient
n
Whats Making Our Kids Materialistic?
TV Affulent Deprived Row_Tot

Yes 146 189 335
No 156 68 224
Col_Tot 302 257 559
A study examining the impact of media on materialism and

well-being found that nearly 60% of children had a TV in their
bedrooms.
Q: Determine whether there is a relationship between TV
in bedrooms and social class, or are they unrelated?
Media and Childhood Materialism
Cell Oi Ei |(Oi - Ei)|-0.5 [|(Oi - Ei)|-0.5] 2/Ei
a 146 180.98 34.48 6.57
b 189 154.02 34.48 7.72
c 156 121.02 34.48 9.83
d 68 102.98 34.48 11.55
sum 559 chi-sq 35.66
Test statistic: Chi-squared = 35.66.

Critical value from Tables (5%, d.f.=1) = 3.841
Test statistic > critical value, so reject null hypothesis.
TVs in childrens bedrooms is related to social-class.
Aside: continuous variables can be split into two-groups,
but test lacks power & difficult to know where to divide.
Pearsons Correlation Coefficient
Pearsons correlation (r) summarizes the association between two
metric variables, in terms of:
Its nature (positive or negative)
Its strength (magnitude)
The correlation ranges from -1 to +1 (0 = unrelated).
Do you think the following variables are correlated; if so how?

Advertising and brand sales
Weekend box-office revenue and cinema screenings
Peoples height and success in chosen career
But, are these relationships linear (big assumption)? A scatter plot

provides a useful first step to investigate this.
A Typical Scatter Plot
This plot tells us about
Price of Mazda Cars in Australia in 1991
(from http://www.statsci.org/data/index.html )
Groups and patterns in data
45000
Older cars are cheaper, so
40000
no classic cars here!
Price of Car/ $Au
35000
30000
25000 Linearity and non-linearity
20000 Price drop slows down
15000
10000 with age (we cant have
5000 negatively priced cars).
0
0 5 10 15 20 25 Outlying data
Age of Car/Years There are a few atypical
data points.
Negative Correlations of Varying Strength
7 8
6 r = -0.9 7
r = -0.7
6
5
5
4
4
3
3
2
2
1
1
0
0
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
9
8
7
r = -0.5 12
10 r r==-0.3
-0.3
6 8
5 6
4
4
3
2
2
1 0
0 -2 0 1 2 3 4 5 6 7
-1 0 1 2 3 4 5 6 7 -4
-2
Q: What do correlations of (i) 1, and (ii) zero look like?

What About the Relationship Between
Sales and Prices of B&O Televisions?
Dont just simply undertake the calculation. Ask:
What sort of relationship do you expect?

Why do you expect a relationship (theory)?
All hypotheses are made before any data is collected -
theory driven not having checked the scatterplot!
What do you know about the quality of the data?

Is the relationship linear or non-linear?
Are they any outliers? Lets check the scatter plot.
Silver (1996)
Scatter Plot: Demand For Selected
Models of B&O Televisions
Sales (Y) Price (X) 300
284 800 250
Sales (Y)
248 891 200
79 1295 150
74 1451 100
62 580 50
0
56 1192
500 700 900 1100 1300 1500
48 1285
Price (X)
34 757
Is there evidence of a linear relationship?

Is there evidence of any obvious outliers?
But beware reading too much into small samples
Calculating Pearsons Correlation
While scatter plots are useful, we often want to precisely assess
the extent of the linear relationship between two metric variables.
Pearsons Correlation (r) = Covariance of X and Y
Std.Dev. X times Std.Dev. Y
The covariance shows how the two variables (X and Y) change

with each other. Specifically
How two variables move together around their respective means.
Covariance is not very useful in itself. Needs to be adjusted to
make comparisons easier (units of measurement, sample size).
Covariance of Sales and Price
Covariance: multiply the deviations of each pair of data
points from their respective means, sum and then average.
_ _
1
Cov( XY ) ( X i X )(Yi Y )
n
_
(Xi X ) Positive covariance: positive (neg.)
A B
300
250 deviations of X are associated
Sales (Y)
200 _ with positive (neg.) deviations of Y.

(Yi Y )
150 _
Negative covariance: negative
100
Y deviations of X and associated
50
0
C D with pos. dev. of Y (vice versa).
_
500 700 900
X1100 1300 1500
Quadrants B & C = Cov(XY) > 0
Price (X)
Quadrants A & D = Cov(XY) < 0
Pearsons Correlation Coefficient (r)
To calculate Pearsons r, we divide the covariance by the
standard deviation of X times the standard deviation of Y
n XY X Y
r
n X 2 2

X ' nY Y
2 2

n refers to the number of pairs of data points.
As usual, the best way to calculate this is by using a table.
The magic numbers that we substitute into the formula are
the respective column totals.
B&0: Correlation of Sales with Price
Sales (Y) Price (X) XY Y2 X2
284 800 227200 80656 640000
248 891 220968 61504 793881
79 1295 102305 6241 1677025
74 1451 107374 5476 2105401
62 580 35960 3844 336400
56 1192 66752 3136 1420864
48 1285 61680 2304 1651225
34 757 25738 1156 573049
Sum
885 8251 847977 164317 9197845
Y = Sales, because we assume Prices, X cause Y

Pearsons Correlation Coefficient (r)
n XY X Y A
r
n X 2 2

X ' n Y Y
2 2
B.C
A = n XY X Y = 8x847977 (885x8251) = -518319

B = n X 2
X 2
= 8x9197845 (8251)2 = 5503759
C = n Y Y = 8x164317 (885)2 = 531311

2 2
As price goes up, sales go

518319
r 0.303 down. But buyers of B&O
5503759 x531311 are not very price sensitive
(small absolute value).
Interpreting Pearsons Correlation
From a textbook / small sample perspective
r = 0.7 or 0.7 is a strong correlation
r = 0.5 or 0.5 is a moderate correlation
r = 0.3 or 0.3 is a weak correlation
From a practical / researchers perspective

Cohens effect sizes: 0.1 = small, 0.3 = medium and > 0.5 = large.
Cohens guidelines based on realistic / reasonable sized samples.
More faith in Cohens meta-analysis of 1000s of studies.
Pearsons r is a standardized measure of association:

Useful for comparative purposes.
Unit-free: changing prices from s to pence, same r.
Cautionary Tale #1
8 Pearsons correlation only measures

6 the extent of a linear relationship.
Y
4
2
0
0 2 4 6 8 Strong non-linear association of X with
X
Y ---> weak or zero value of r
15
Strong linear association between X
10 and Y weakened to zero by an outlier
Y
0
0 2 4 6 8
X ALWAYS DRAW A PICTURE!
Cautionary Tale #2
Limited range decreases the apparent correlation.
Missing middle increases the apparent correlation.
Unwanted Pregancy
100 25
performance
80 * * 20
60 * * 15 ???
(%)
40 * ** * ** * 10
* **
20 5
0 0
0 20 40 60 80 100 0 2 4 6 8
Aptitude Test Score Socio-Economic Status
Limited range problem Missing middle problem

Cancer Survival Rates: Postcode Lottery
10% point difference in

survival rates -> Daily
Mail Postcode Lottery
Based on the number of
patients diagnosed in
England 1991-2006, and
followed up in 2007.
What about relative age-
adjusted survival rates?
Source: ONS (07.09.2010)
Cautionary Tale #3
Correlation is not proof of causation (omitted variables). If cross-
sectional data, both X and Y measured at the same point in time.
Y variable: r=+ X variable:

Personal Red wine
Health consumption
r=+ r=+
Z variable:
Missing ?
PIMS (Harvard): strong positive correlation between market

share (X) and firm performance (Y). But was it an illusion?
Partial correlations control for 3rd factors (beyond this course).
Another Hypothesis Test:
Is Pearsons Correlation Significant?
A little recap
We may wish to test the significance level of the correlation
when working with a sample rather than a whole population.
We may wish to ensure that it is unlikely that the effect we

are seeing occurs because of sampling error / chance (when
there is no relationship between X and Y in the population).
We want the chance of being wrong when saying there is a

relationship between the two variables to be less than 5% -
the social science standard of proof or risk factor.
Another Hypothesis Test:
Is Pearsons Correlation Significant (2)?
Null hypothesis H0: There is no linear association between
X and Y (in the population: r = 0)
Alternative hypothesis H1: (choose 1 of 3). There is:

A) a linear association between X and Y (r 0)
B) a positive linear association between X and Y (r > 0).
C) a negative linear association between X and Y (r < 0).
But you decide before collecting the data (theory-driven).
A) 2-tailed test, B) and C) 1-tailed tests [denote tail(s) of the

distribution where the rejection region(s) is found, see later].
Hypothesis Testing: Significance of the
Correlation Coefficient (3)?
H0: There is no linear association between X and Y (r = 0).
H1: There is a negative association between X and Y (r < 0).
1. Determine significance level (chance of being wrong if
reject H0). [alpha = 0.05].
2. Determine the degrees of freedom (v = n - 2) = 6.
3. Determine if it is a one-tailed or two tailed test. (1t)
4. Determine critical value from table (c.v. = |0.6215|)
5. If r > critical value, reject the null hypothesis, but it is not
in this case (|0.303 < 0.6215|), so retain null (r = 0).
For B&O customers, sales and prices are not related.
Association could be due to sampling error / chance.
Thought For the Day
This area requires skilled practitioners. Not people
simply wanting to apply formulae to get answers
Answers depend on formulating questions well, applying

theory, using reliable data, being aware of assumptions
implicit in the methods and knowing what the measure is
doing and whether it is appropriate for your purposes.
Silver (p.166)
In summary: You should have two metric variables.

It works best for straight line relationships.
It works best if there are no serious outliers.
But it does not truly consider causation.

FT Mba Section 2 Associations 2016-17 Sva

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FT Mba Section 2 Associations 2016-17 Sva

Uploaded by

Copyright:

Available Formats

Full-time MBA: Business Statistics (BST510)

Sales/1000s Satisfaction Income band Gender Level (Clerk=0,

Widely used by market Guests complaints across a

Henderson & Cote (1998),

Q: Null hypothesis: there is no relationship between gender

Blue Red Green Row Total

Q: How many customers would we expect to find who are (i)

If we multiply (a) by (b), we find the probability a customer who

Oi = Observed cell frequency, Ei = Expected cell frequency.

Expected frequencies show how the sample would be allocated

Do not reject H0 Reject H0

If our statistic > critical value, reject null hypothesis.

Do not reject H0 Reject H0

Critical Value = 5.991

As our statistic (16.41) is > critical value (5.991), we reject the

Smaller dimension of the contingency table. Our table has 2

Observations must be unique to one cell.

Blue Red Green

Solution #1: increase sample size (n).

Solution #2: combine adjacent cells together if meaningful.

Solution #3: use Fishers Exact test! (beyond this course).

Do male and female students spend equal amounts on

Chi-squared test = 7.005 (df = 2).

Conc: Only 3% chance of getting this result if null hypothesis

TV Affulent Deprived Row_Tot

A study examining the impact of media on materialism and

Test statistic: Chi-squared = 35.66.

Do you think the following variables are correlated; if so how?

But, are these relationships linear (big assumption)? A scatter plot

Q: What do correlations of (i) 1, and (ii) zero look like?

What sort of relationship do you expect?

What do you know about the quality of the data?

Is there evidence of a linear relationship?

Pearsons Correlation (r) = Covariance of X and Y

Std.Dev. X times Std.Dev. Y

The covariance shows how the two variables (X and Y) change

200 _ with positive (neg.) deviations of Y.

Y = Sales, because we assume Prices, X cause Y

A = n XY X Y = 8x847977 (885x8251) = -518319

C = n Y Y = 8x164317 (885)2 = 531311

As price goes up, sales go

From a practical / researchers perspective

Pearsons r is a standardized measure of association:

8 Pearsons correlation only measures

Limited range problem Missing middle problem

10% point difference in

Y variable: r=+ X variable:

PIMS (Harvard): strong positive correlation between market

We may wish to ensure that it is unlikely that the effect we

We want the chance of being wrong when saying there is a

Alternative hypothesis H1: (choose 1 of 3). There is:

A) 2-tailed test, B) and C) 1-tailed tests [denote tail(s) of the

Answers depend on formulating questions well, applying

In summary: You should have two metric variables.

You might also like