Professional Documents
Culture Documents
These questions reflect a desire to understand the relationship between two variables. What we need: 1. A plot/graph to view the relationship 2. Characteristics to describe 3. Measures of the characteristics 4. Method to make inferences about the relationship
Do heavier people burn more energy? Response: Explanatory: metabolic rate weight or mass
Does wine consumption cause a decrease in heart disease? Response: death rate from heart disease Explanatory: wine consumption
Correlation & Regression
Do heavier people burn more energy? Lean body mass vs. metabolic rate
2000
Rate(cal)
1500
1000
30
40
50
60
Mass(kg)
Is wine good for your heart? wine consumption vs. heart disease rate (per 100,000)
300
hrt_death rate
200
100
Alcohol
wine consumption
Interpretingcharacteristics to look for: Patterns: Form (clusters, scatter, linear..) Direction (positive, negative) Strength ( how closely points follow form) Deviations: Outliers
Interpret the last two scatter plots.
Correlation & Regression
Scatter plot: relationship between quantitative variables Form: Linear is probably the most common form Strength: We can measure the strength of a linear relationship because our eyes can deceive us!!!
Strength?
Strength?
Correlation
measure the direction and strength of a linear relationship
Standardised value of each x Standardised value of each y Correlation is an average product of standardised values
Correlation = r
Quantitative variables Linear relationships r has no units r can be between 1 and 1 Positive r = positive association Negative r = negative association 0 = no association r is influenced by outliers
Do heavier people burn more energy? Lean body mass vs. metabolic rate
2000
Rate(cal)
1500
1000
30
40
50
60
Mass(kg)
Correlations: Mass (kg), Rate (cal) Pearson correlation of Mass(kg) and Rate(cal) = 0.865 P-Value = 0.000
Correlation & Regression
Rate(cal)
1500 Females o
1000
30
40
Mass(kg)
50
60
Correlations: Mass (kg)_F, Rate (cal)_F Pearson correlation of Mass(kg)_F and Rate(cal)_F = 0.876 Correlations: Mass (kg)_M, Rate (cal)_M Pearson correlation of Mass (kg)_M and Rate (cal)_M = 0.592
Correlation & Regression
Is wine good for your heart? wine consumption vs. heart disease rate (per 100,000)
300
hrt_death rate
200
100
Alcohol
wine cons umption
Correlations: Alcohol, heart_death rate Pearson correlation of Alcohol and hrt_death rate = -0.843
250
200
150
Correlations: Alcohol Wine consumption, heart death rate Pearson correlation of Alc Wine consumption and hrt death rate = -0.648
Correlation & Regression
hrt_death rate
200
100
Alcohol
wine cons umption
We can summarise an overall linear form with a linethe best line is called the Regression Line
Correlation & Regression
A regression line describes how a response variable changes as an explanatory variable changes. We can now predict a value of y when given an x.
Fitted regression line death rate vs.wine consumption
death rate = 260.563 - 22.9688 wine consumt S = 37.8786 300 R-Sq = 71.0 % R-Sq(adj) = 69.3 %
death rate
200
What would be the death rate due to heart disease if the average daily consumption of wine was 3 glasses? 191.66 deaths per 100,000
0 1 2 3 4 5 6 7 8 9
100
wine consumption
y ! a bx
Equation of the line (read y hat) b is the slope (rate of change in y when x increases) a is the y intercept (value of y when x is 0)
sy ! r sx
a ! y bx
death rate
200
100
wine consumption
The regression equation is death rate = 260.563 - 22.9688 wine consumption S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %
Analysis of Variance Source Regression Error Total DF 1 17 18 SS 59813.6 24391.4 84204.9 MS 59813.6 1434.8 F 41.6881 P 0.000
r !
= coefficient of determination
death rate
200
100
wine consumption
The regression equation is death rate = 260.563 - 22.9688 wine consumption S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %
VARIATION OF DEPENDENT Y
Residuals
the left overs from least-squares regression Deviations from the overall pattern are important. The deviations In regression are the scatter of points about the line. The vertical distances from the line to the points are called residuals and they are the left-over variation after a regression line is fit. Residual = observed y predicted y
residuals ! y y
The regression equation is death rate = 260.563 - 22.9688 wine consumption s = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %
Residual Plots
Things to look for:
Residuals Versus Alcohol
(response is hrt_deat)
50
-50
Alcohol
1. A curved pattern means the relationship is not linear. 2. Increasing/decreasing spread about the line 3. Individual points with large residuals 4. Individual points that are extreme in the x direction
Residual
Increasing variation
50
death rate
Residual
200
100
-50
wine consumption
Alcohol
Regression Plot
C6 = 280.215 - 33.7666 C5 S = 40.0879 300 R-Sq = 42.0 % R-Sq(adj) = 37.5 %
Residuals Versus C5
(response is C6)
50
250
Residual
C6
200
150
-50 1 2 3 4
C5
C5
Attention!! Caution!!
1. Correlation and regression describe only linear relationships 2. R and r-sq are not resistant 3. Do not extrapolate!!! What is extrapolate? 4. Correlations based on averages are too high when applied to individualsif the data has been averaged, the values of correlation and regression cannot be used with un-averaged values. (i.e., average alcohol consumption per countrynot individuals). 5. Lurking variableslike the male/female variable in the weight vs. energy and the possible Mediterranean variable in the wine data. 6. Correlation/association is not causation.
Correlation & Regression