Professional Documents
Culture Documents
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview
We consider regression analysis which is used for explaining variation in market share, sales, brand preference etc. This may use explanatory variables such as advertising, price, distribution and product quality Starting with correlation, we proceed to the simple linear model followed by multiple linear regression
Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Number of offences
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction
Scatter plot
x x 8 x x 8
Scatter plot
x x x
x x 6 x x x x x y x 4
x y 4
x 2 2
x x 2 4 x 6 8 0
x x 2 4 x 6 8 x
Correlation
Correlation measures the strength of the linear relationship between two variables, each measured on an interval scale Positive correlation the two variables tend to vary in the same direction Negative correlation the two variables tend to vary in the opposite direction Perfect correlation the two variables have points which all lie exactly on a straight line
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Correlation
If there exists a perfect linear relationship between X and Y , we can represent them using an equation of the form Y = + X represents the intercept of the line represents the slope or gradient of the line Examples of anticipated correlation: Variables Height & weight Rainfall & sunshine hours Ice cream sales & sun cream sales Hours of study & exam mark Cars petrol consumption & goals scored Correlation Positive Negative Positive Positive Zero
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Correlation
Positive correlation: large X with large Y ; small X with small Y Negative correlation: large X with small Y ; small X with large Y However, since the X and Y may have widely different numerical values we need to take this into account We do this by considering how far away from the means the two scores are
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Correlation
So, we are interested in the degree to which variations in variable values are related to each other Our basis for the measurement of correlation is
n n
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient
)(yi y ) = (xi x
i =1 i =1
y xi yi nx
Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Unfortunately, this measure is extremely sensitive to the units in which the variables are measured We would prefer a measure of correlation to remain the same regardless of the units of measurement (e.g. days, hours, minutes or seconds)
Correlation
So, we use the following to measure the correlation for (sample) data r = ( = y xi yi nx 2) ( xi2 nx 2) yi2 ny
Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Correlation
Returning to the unemployment/crime dataset: xi = 19979, yi2 xi2 = 36695129, yi = 66803,
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation
= 374471231,
xi yi = 113784494
Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
= 19979/12 = 1664.92 and Since n = 12, we have x = 66803/12 = 5566.92 y Hence the (sample) correlation coefcient, r , is r = 0.861 Of course, in practise we can software like SPSS to calculate r for us!
Correlation
The (sample) correlation coefcient, r , takes values between 1 and 1, i.e. 1 r 1
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
r > 0 indicates positive correlation, with r = 1 indicating perfect positive correlation r < 0 indicates negative correlation, with r = 1 indicating perfect negative correlation The closer |r | is to 1, the stronger the linear relationship is
Regression
Overview
Here we introduce simple linear regression Only part of a very large topic in statistical analysis In the simple model, we have two variables Y and X :
Y is the dependent (or response) variable that which we are trying to explain using: X , the independent (or explanatory) variable the factor we think inuences Y
Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Assume a true (population) linear relationship between a response variable y and an explanatory variable x of the approximate form: y = + x and are xed, but unknown, population parameters is the y -intercept is the slope of the line We seek to estimate and using (paired) sample data (xi , yi ), i = 1, . . . , n
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Particularly in business, we would not expect a perfect linear relationship between the two variables Hence we modify this basic model to y = + x + is some random perturbation from the initial approximate line In other words, each y observation almost lies on the postulated line, but jumps off the line according to the random variable Often referred to as the error term
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R
The least squares estimators for is x =y Hence the line of best t has equation: x = y + Again, this is routinely calculated in SPSS
Example
Returning to the unemployment/crime dataset xi = 19979, xi2 = 36695129, yi = 66803,
Overview Relationship between two variables Correlation Regression The simple linear regression model
yi2 = 374471231,
Parameter estimation
xi yi = 113784494
Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
= 19979/12 = 1664.92 and Since n = 12, we have x = 66803/12 = 5566.92, hence y = y xi yi n x 2 2 xi nx 113784494 (12 1664.92 5566.92) = 36695129 (12 1664.922 ) = 0.7468
Example
Overview
We estimate the intercept to be x = y = 5566.92 0.7468 1664.92 = 4323.6 Hence the least squares regression line is = 4323.6 + 0.7468x y notation, where the hat denotes an Note the y estimated value
Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
In the case of perfect correlation between X and Y , we can predict Y directly and exactly from X In the case of zero correlation between X and Y , knowledge of X tells us nothing about Y Here we consider measuring the extent to which the values of one variable can be used to predict the values of another where the correlation is neither 1, nor 0, nor 1
Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
)2 (yi y
i =1
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Coefcient of determination, R 2
We can assess the overall t of a model using R 2 This measures the proportion of the total variability in the response variable explained by the model This statistic is known as the coefcient of determination and is denoted R 2 and dened as R2 = 0 R2 1 The closer R 2 is to 1, the better the explanatory power of the model Note that R 2 = r 2 for a simple linear model, so we can also compute it from r (correlation coefcient) ESS TSS
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Coefcient of determination, R 2
Returning to the crime/unemployment dataset, lets assign Y and X as follows
Y = number of offences X = unemployment
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient
The least squares regression line was = 4323.6 + 0.7468x y The correlation coefcient was 0.861, therefore R 2 = 0.8612 = 0.7413 This means we can explain 74.13% of the variation in number of offences using unemployment
Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Prediction
Overview
One of the purposes in calculating the line of best t is prediction Specically, for some value of x , we can provide a prediction for y So, returning to the example, how many offences would you predict if there were 2000 unemployed people in a city area? Answer: just substitute the desired value of x into the least squares regression line: = 4323.6 + 0.7468 2000 = 5817 y
Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Prediction
Provided we are predicting y for an x value that is within the available x data, then we can be fairly condent in the prediction This is what we call interpolation However, if we base our prediction on an x value outside the available x data, then we should view the prediction with caution This would be an example of extrapolation which is risky since the relationship between x and y may change for such values of x
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Regression diagnostics
The usefulness of a tted regression model rests on a basic assumption: E(y ) = + x Furthermore inference such as the hypothesis tests, condence intervals and predictive intervals only make sense if the error terms are (approximately) independent and normal with constant variance 2 Therefore it is important to check these conditions are met in practice this task is called regression diagnostics Basic idea: Looking into the residuals i or the normalised residuals i /
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Regression diagnostics
What to look for?
Do the residuals manifest IID normal behaviour? Is the scatter plot of i versus xi patternless? Is the scatter plot of i versus yi patternless? Is the scatter plot of i versus i patternless?
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
If you see trends, periodic patterns, increasing variation in any one of the above scatter plots, it is very likely that at least one assumption is violated!
Regression diagnostics
Two other issues in regression diagnostics: outliers and inuential observations Outlier: An unusually small or unusually large yi which lies outside of the majority of observations An outlier is often caused by an error in either sampling or recording data. If so, we should correct it before proceeding with the regression analysis If an observation which looks like an outlier indeed belongs to the sample and no errors in sampling or recording were discovered, we may use a more complex model or distribution to accommodate this outlier. For example, stock returns often exhibit extreme values and they often cannot be modelled satisfactorily by a normal regression model
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Regression diagnostics
Inuential observation: An xi which is far away from other xi s Such an observation may have a large inuence on the tted regression line
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
We t a regression model: Cisco = + S&P500 + Rationale: Part of the uctuation in Cisco returns was driven by the uctuation of the S&P500 returns
The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview
Coefficients Model Unstandardized Coefficients Standardized Coefficients B (Constant) 1 Cisco a. Dependent Variable: SP500 .227 .015 -.012 Std. Error .064 Beta
Sig.
Correlation Regression
The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics
Worked example
Adjusted R Square
.687
.472
.470
1.01964
Overview
When testing the statistical signicance of regression coefcients, we just need to look at the p-value The smaller the p-value, the more signicant the result, i.e. that the true parameter value is different from zero In practice, we treat p-values smaller than 0.05 as being statistically signicant (at the 5% signicance level)
Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview
= 47.2% of the variation of Cisco stock may be explained by the variation of the S&P500 index, or in other words 47.2% of the risk in Cisco stock is the market-related risk see CAPM below CAPM: A simple asset pricing model in nance: y i = + x i + i where yi is a stock return and xi is a market return at time i
R2
Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview
1 )2 = (yi y n
n i =1
1 )2 + (yi y n
(yi yi )2
i =1
Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics
1 )2 = 2 (yi y n
)2 (xi x
i =1
Firm-specic risk: 1 n
n
(yi yi )2
i =1
measures the market-related (or systematic) risk of the stock Market-related risk is unavoidable, while rm-specic risk may be diversied away through hedging Variance is a simple measure (and one of the most frequently used) for risk in nance
Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Previously we saw simple linear regression That had one explanatory variable Often one explanatory variable is not enough to explain variation in the response variable So we add more linear explanatory variables
Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview
Remember the aim of statistics is prediction and decision making In order to make the best predictions and decisions we need to use the best models This often means making more complex models adding more explanation But not too complex (Occams razor)
Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Overview
Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
We can visualise up to n = 3
Overview Relationship between two variables Correlation Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression
Multiple linear regression uses least squares estimation like simple linear regression That is, we minimise the sum of the squared residuals in all dimensions Sounds tricky, but fortunately software (SPSS etc.) takes care of that for us
Regression The simple linear regression model Parameter estimation Interpretation of correlation coefcient Coefcient of determination, 2 R Prediction Regression diagnostics Worked example Multiple linear regression