You are on page 1of 28

Do heavier people burn more energy? Does wine consumption affect cause a decrease in heart disease?

These questions reflect a desire to understand the relationship between two variables. What we need: 1. A plot/graph to view the relationship 2. Characteristics to describe 3. Measures of the characteristics 4. Method to make inferences about the relationship

Correlation & Regression

The grapha Scatter Plot


Y

Response variable (dependent variable)

Explanatory variable (independent variable)


Correlation & Regression

Do heavier people burn more energy? Response: Explanatory: metabolic rate weight or mass

Does wine consumption cause a decrease in heart disease? Response: death rate from heart disease Explanatory: wine consumption
Correlation & Regression

Do heavier people burn more energy? Lean body mass vs. metabolic rate
2000

Rate(cal)

1500

1000

30

40

50

60

Mass(kg)

Correlation & Regression

Is wine good for your heart? wine consumption vs. heart disease rate (per 100,000)
300

hrt_death rate

200

100

Alcohol
wine consumption

Correlation & Regression

Interpretingcharacteristics to look for: Patterns: Form (clusters, scatter, linear..) Direction (positive, negative) Strength ( how closely points follow form) Deviations: Outliers
Interpret the last two scatter plots.
Correlation & Regression

Options to consider: Adding a categorical variable

Correlation & Regression

Scatter plot: relationship between quantitative variables Form: Linear is probably the most common form Strength: We can measure the strength of a linear relationship because our eyes can deceive us!!!

Strength?

Strength?

Correlation
measure the direction and strength of a linear relationship

Standardised value of each x Standardised value of each y Correlation is an average product of standardised values

Correlation & Regression

Correlation = r
Quantitative variables Linear relationships r has no units r can be between 1 and 1 Positive r = positive association Negative r = negative association 0 = no association r is influenced by outliers

Do heavier people burn more energy? Lean body mass vs. metabolic rate
2000

Rate(cal)

1500

1000

30

40

50

60

Mass(kg)

Correlations: Mass (kg), Rate (cal) Pearson correlation of Mass(kg) and Rate(cal) = 0.865 P-Value = 0.000
Correlation & Regression

Weight (mass) vs. metabolic rate


2000 Males +

Rate(cal)

1500 Females o

1000

30

40

Mass(kg)

50

60

Correlations: Mass (kg)_F, Rate (cal)_F Pearson correlation of Mass(kg)_F and Rate(cal)_F = 0.876 Correlations: Mass (kg)_M, Rate (cal)_M Pearson correlation of Mass (kg)_M and Rate (cal)_M = 0.592
Correlation & Regression

Is wine good for your heart? wine consumption vs. heart disease rate (per 100,000)
300

hrt_death rate

200

100

Alcohol
wine cons umption

Correlations: Alcohol, heart_death rate Pearson correlation of Alcohol and hrt_death rate = -0.843

Correlation & Regression

heart disease death rate vs. wine consumption (outliers removed)


300

hrt death rate

250

200

150

2 Alc wine consumption

Correlations: Alcohol Wine consumption, heart death rate Pearson correlation of Alc Wine consumption and hrt death rate = -0.648
Correlation & Regression

Linear relationshipsusing a LINE


Is wine good for your heart? wine consumption vs. heart disease rate (per 100,000)
300

hrt_death rate

200

100

Alcohol
wine cons umption

We can summarise an overall linear form with a linethe best line is called the Regression Line
Correlation & Regression

A regression line describes how a response variable changes as an explanatory variable changes. We can now predict a value of y when given an x.
Fitted regression line death rate vs.wine consumption
death rate = 260.563 - 22.9688 wine consumt S = 37.8786 300 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

death rate

200

What would be the death rate due to heart disease if the average daily consumption of wine was 3 glasses? 191.66 deaths per 100,000
0 1 2 3 4 5 6 7 8 9

100

wine consumption

Correlation & Regression

How do we determine the regression line?


We want the vertical distances from the points (observed) to the line (predicted) to be as small as possiblethis means our error in predicting y is small.

Correlation & Regression

Calculating the line


We will use the method of least squares to calculate the line. Least squares regression is the line that makes the sum of the squares of the vertical distances as small as possible.

y ! a  bx

Equation of the line (read y hat) b is the slope (rate of change in y when x increases) a is the y intercept (value of y when x is 0)

sy ! r sx

a ! y  bx

Correlation & Regression

Fitted regression line death rate vs.wine consumption


death rate = 260.563 - 22.9688 wine consumt S = 37.8786 300 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

death rate

200

100

wine consumption

The regression equation is death rate = 260.563 - 22.9688 wine consumption S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

Analysis of Variance Source Regression Error Total DF 1 17 18 SS 59813.6 24391.4 84204.9 MS 59813.6 1434.8 F 41.6881 P 0.000

Correlation & Regression

Facts about regression.


1. Clear distinction between the response variable and the explanatory variable. 2. Correlation and slopea change in one Wof x corresponds to a change of r W in y. 3. Least-squares regression line passes through (x, y ) 4. Some variation (spread) in y can be accounted for by changes in x when there is a linear relationship. The square of the correlation coefficient is the the fraction of the variation in y values that is explained by changes in x.

r !

variation in y due to x total variation in observed y


Correlation & Regression

= coefficient of determination

Fitted regression line death rate vs.wine consumption


death rate = 260.563 - 22.9688 wine consumt S = 37.8786 300 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

death rate

200

100

wine consumption

The regression equation is death rate = 260.563 - 22.9688 wine consumption S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

R-sq can have a value between 0 and 1.


Correlation & Regression

VARIATION OF DEPENDENT Y

Correlation & Regression

Residuals
the left overs from least-squares regression Deviations from the overall pattern are important. The deviations In regression are the scatter of points about the line. The vertical distances from the line to the points are called residuals and they are the left-over variation after a regression line is fit. Residual = observed y predicted y

residuals ! y  y

Correlation & Regression

The regression equation is death rate = 260.563 - 22.9688 wine consumption s = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

The residuals are.


Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Alcohol 2.50 3.90 2.90 2.40 2.90 0.80 9.10 0.80 0.70 7.90 1.80 1.90 0.80 6.50 1.60 5.80 1.30 1.20 2.70 hrt_deat 211.00 167.00 131.00 191.00 220.00 297.00 71.00 211.00 300.00 107.00 167.00 266.00 227.00 86.00 207.00 115.00 285.00 199.00 172.00 Fit 203.14 170.99 193.95 205.44 193.95 242.19 51.55 242.19 244.49 79.11 219.22 216.92 242.19 111.27 223.81 127.34 230.70 233.00 198.55 SE Fit 8.89 9.23 8.70 8.97 8.70 11.76 23.29 11.76 12.00 19.39 9.72 9.57 11.76 15.11 10.06 13.15 10.64 10.85 8.77 Residual 7.86 -3.99 -62.95 -14.44 26.05 54.81 19.45 -31.19 55.51 27.89 -52.22 49.08 -15.19 -25.27 -16.81 -12.34 54.30 -34.00 -26.55 St Resid 0.21 -0.11 -1.71 -0.39 0.71 1.52 0.65 X -0.87 1.55 0.86 -1.43 1.34 -0.42 -0.73 -0.46 -0.35 1.49 -0.94 -0.72

The mean of residuals is always equal to 0


Correlation & Regression

Residual Plots
Things to look for:
Residuals Versus Alcohol
(response is hrt_deat)

50

-50

Alcohol

Do we have any influential points here?

1. A curved pattern means the relationship is not linear. 2. Increasing/decreasing spread about the line 3. Individual points with large residuals 4. Individual points that are extreme in the x direction

Residual

Correlation & Regression

Ideal residual pattern

Curvaturea linear fit is not appropriate

Increasing variation

Correlation & Regression

Fitted regression line death rate vs.wine consumption


death rate = 260.563 - 22.9688 wine consumt S = 37.8786 300 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

Residuals Versus Alcohol


(response is hrt_deat)

50

death rate

Residual

200

100

-50

wine consumption

Alcohol

Regression Plot
C6 = 280.215 - 33.7666 C5 S = 40.0879 300 R-Sq = 42.0 % R-Sq(adj) = 37.5 %

Residuals Versus C5
(response is C6)
50

250

Residual

C6

200

150

-50 1 2 3 4

C5

C5

Correlation & Regression

Attention!! Caution!!
1. Correlation and regression describe only linear relationships 2. R and r-sq are not resistant 3. Do not extrapolate!!! What is extrapolate? 4. Correlations based on averages are too high when applied to individualsif the data has been averaged, the values of correlation and regression cannot be used with un-averaged values. (i.e., average alcohol consumption per countrynot individuals). 5. Lurking variableslike the male/female variable in the weight vs. energy and the possible Mediterranean variable in the wine data. 6. Correlation/association is not causation.
Correlation & Regression

You might also like