You are on page 1of 9

Correlation and Regression

Dr. McGahagan - Stat 1040


Assume that we have data on scores on the verbal SAT of entering freshmen and we want to predict their GPA for the first year. We know that the correlation between SAT scores and freshman GPA is 0.6. We also have data on the means and standard deviations of both SAT and GPA: SAT MEAN SD 500 100 GPA 2.5 0.5 r = 0.6

a) Forecast the expected GPA of a student who has scored 450 on the verbal SAT. Notation: E [GPA | SAT = 450] means the expected value of GPA given a SAT score of 450. Solution: The student is Z(SAT) = (450 500) / 100 = - 0.5 S.D.s away from the mean (half a S.D. below the mean) He will be expected to have a GPA below the mean as well, but because of the regression effect, he is expected to be closer to the average GPA than half a S.D. on the GPA. The exact prediction will be: Z(GPA) = 0.6 *Z(SAT) or the expected value of his GPA, given a SAT score half a SD below the mean, will be 0.3 S.D.s below the mean of the GPAs, since:

Z(GPA) = 0.6 (- 0.5) = - 0.3 To translate this into an expected GPA, note that we want to find the value k so that: Z(GPA) = k - Mean(GPA) --------------------------- = - 0.3 SD (GPA)

k - Mean (GPA) = - 0.3 * SD(GPA) k - 2.5 = -0.3(.5) k = 2.5 - .15 = 2.35 To make sure that you haven't neglected the negative sign on the standardized scores, remember that the expected GPA of the student should be BELOW the average of 2.5 -- if your answer was 2.65, something obviously went wrong.

b) Calculate the student's chance of getting above an average of 3.0 Solution: we will have to calculate the root mean square error (also known as the standard error) of our forecast in order to work this one through. The standard error of the forecast will be about: SD(GPA) * sqrt ( 1 - r2) = 0.5 sqrt( 1 - .62) = 0.5 * sqrt(1 - .36) = 0.5 * sqrt(0.64) = 0.4 (Note: about because the precise calculation of forecast errors is a little more complicated, but a lot more confusing, and we won't worry about the precision in an introductory course). 3.0 - E[GPA] 3.0 - 2.35 0.65 Z(3.5 | E[GPA] = 2.485) = ---------------------- = ------------- = --------- = 1.625 RMSE 0.4 0.4 Looking up 1.60 in the tables, we find that the central area under a normal curve from -1.60 to +1.60 is 89.04 percent, so the two tail area is 100 89.04 = 10.96 percent, and the chance of being above a GPA of 3.5 is half of that or 5.48 percent. c) Calculate the student's chance of getting below a GPA average of 1.8 Do it before reading further ! You should get Z = (1.80 - 2.35) / 0.4 = - .55 / 0.4 = -1.375 , and a central area (using 1.35 for the Z-score) of 82.30 The chance of a GPA below 1.8 is (100 82.30) / 2 = 17.70 / 2 = 8.85 percent. d) Calculate a 90 percent confidence interval for your forecast: First, find the appropriate multiplier for an interval that contains 90 percent of the data in the middle. Use the normal table to find: z Height Area 1.65 10.23 90.11 The confidence interval will be 2.35 +/- 1.65 * RMSE = 2.35 +/- 1.65 (.4) = 2.35 +/- 0.66 = 1.69 to 3.01 REGRESSION METHOD Note that these steps amount to finding the appropriate z-scores for what we want to predict by using the formula: Y will be the dependent variable or response variable that we try to predict. X will be the independent variable or explanatory variable or treatment variable that we know about. In our case, Y is Let's spell that out a bit more: X - Mean(X) Y - Mean(Y) r * ---------------- = ---------------SDx SDy or: S.D.y Y - Mean(Y) r -------- = -------------------S.D.x X - Mean(X) Is the formula familiar? Graphing a point and drawing a line from the point of means (Mean(X), Mean(Y)) may help. r * Zx = Zy

The formula will give us the slope of a line from the point of means to any point on the regression line; that is, it will give us the slope coefficient in the formula: E[Y | X] = a + b X (read the first part as the expected value of Y GIVEN X)

If we apply this to our data, we will have 0.5 0.3 b = 0.6 * --------- = ------- = 0.003 100 100 What about the intercept coefficient a? To find that, remember that the regression line passes through the point of means (X = 500, Y = 2.5), so we can write: Mean(GPA) = a + .003 Mean(SAT) 2.5 = a + .003 (500) 2.5 = a + 1.5 or a = 1.0

Hence our regression equation is: E[Y| X] = 1.0 + .003 X Which works for the prediction that we just made: E[Y|X = 450] = 1.0 + .003 (450) = 1.0 + 1.35 = 2.35 and works for any other value of X. What is the predicted GPA of a student who scored 600 on the SAT: Answer: E[Y | X=600] = 1.0 + .003 (600) = 1.0 + 1.8 = 2.8 90 percent confidence interval = 2.8 +/- 1.65 (0.4) = 2.8 +/- 0.66 = 2.14 to 3.46 You could also work this out by noting that he scored 1 SD above the average SAT, and would be expected to score 0.6 SD s above the average of 2.5, and since the SD of GPA's is .5, this translates into 0.3 points above the average of 2.5. But having a regression equation makes it easy to make a lot of forecasts with less work.

SIMULATION and Computer Output The actual scores for the above problem could be simulated in R with the command: scatter(.6, 5000, 500, 2.5, 100, 0.5) (where we create 5000 observations with a correlation of 0.6, with the means being 500 for the X variable, and 2.5 for the Y variable, with the S.D. of X being 100, and the S.D. of Y being 0.5) Since the numbers are being randomly generated, the actual means, SDs and correlation may differ slightly -- for the data set I generated, I had means of 500.98 and 2.499, SD s of 98.69 and 4.991, and a correlation coefficient of 0.6034. The plot, with added regression line looks like this:

The plot was generated with the command: > plot(SAT, GPA, col=blue, pch=.) The regression line is the one developed earlier: E[GPA | SAT] = 1.0 + .003 SAT but the fact that the randomly generated data have slightly different means, SD s and correlation will result in a computer estimate that is also slightly different. To run a regression in R, the command we create a linear model, and we assign the output to a variable: >>> model.GPA <- lm(GPA ~ SAT) Typing the name of the variable to get the coefficients is simple in R: >>> model.GPA results in the computer returning: Coefficients: (Intercept) SAT 0.970531 0.003052 with the user being expected to fill in the equation: E[GPA | SAT] = 0.970531 + .003052 SAT So our forecast if the SAT score is 600 is for a GPA of 0.970531 + .003052 * 600 = 2.801731 To add the regression line to the plot, we use the command: > abline(model.GPA, col=red, lwd=3) Most programs provide a lot more information immediately; for R you have to ask: > summary(model.GPA) The full output of the command is given on the next page: The most important other number to read is the Residual standard error, the phrase in the R language which means root mean square error (in the textbook language) or the standard error of the regression, in the language of many other texts and programs. This is of course what enables us to place a confidence interval around any point forecast. For example, in the case of a SAT score of 600, the exact calculation for this data set would have a 75 percent confidence of interval of: 2.801731 +/- 1.15 * 0.3981 = 2.3439 to 3.2595 (Do you see where the 1.15 is coming from? Look up a central area of 75 percent or so)

Call: lm(formula = GPA ~ SAT) Residuals: Min 1Q Median 3Q -1.416397 -0.275427 -0.009415 0.265172 Coefficients: Estimate (Intercept) 9.705e-01 SAT 3.052e-03 Std. Error t value Pr(>|t|) 2.913e-02 33.32 <2e-16 5.704e-05 53.50 <2e-16 Max 1.498115

Residual standard error: 0.3981 on 4998 degrees of freedom Multiple R-squared: 0.3641, Adjusted R-squared: 0.364 F-statistic: 2862 on 1 and 4998 DF, p-value: < 2.2e-16 Note that the coefficients are in scientific notation: the estimate of the intercept is 9.705 * 10 -1 or 0.9705 as in the previous output the estimate of the slope is 3.052 * 10 -3 or 0.003052 as in the previous output The information on the residuals is helpful in getting rough ideas of the magnitude of the mistakes. It is probably more helpful to draw a histogram or density plot of the residuals. The next most important number on the output is the residual standard error, which other regression packages call the root mean square error or RMSE (as our text does) or the standard error of the regression. In any case, it is the one to look at in constructing a confidence interval for the forecast.
[Note: in more advanced courses, you will learn that the standard error of the forecast will be a bit larger than the SE of the regression, even if you are trying to predict within the range of the data which went into the regression, and possibly a lot larger if you are trying to predict outside that range. You should treat the RMSE as a minimum margin for error]

The R-squared is the square of the correlation coefficient. > cor(SAT, GDP) = .6034208 > .6034208 * .6034208 = 0.3641167 The t-statistics are test statistics for the hypothesis that the true values of the coefficients are equal to zero -the important one is the test of the slope coefficient, since if the regression line has a slope of zero, the regression would explain nothing. The t-statistic is a slight modification of the Z statistic or standardized normal score. The modification is important when there are only a very few observations, but the difference is minimal when we have more than 20 or 30 observations. What the t-statistic on the slope coefficient is saying is that the standardized distance of the coefficient from zero is 0.00352 / .00005704 = 53.50 You will have a hard time finding this number in any normal table; the chance of something being 53.5 standard deviations away from its mean is for all practical purposes zero. Note that the text table goes up only to 4.45 standard deviations; the probability of an observation outside the range of Mean +/- 4.45 SD s is 100 - 99.9991 = 0.0009 or less than one-thousandth of one percent. A standard rule of thumb in the interpretation of regressions is that t-statistics larger than 2 indicate a significant coefficient. If the t-statistic is less than 1, the coefficient may not be significantly different than zero.

A visual guide to the standard error of the regression


For this purpose, I'm using monthly data on the BAA corporate bond rate and the Federal Funds (overnight bank lending) rate from 1960 through 2010. The relation is important because the Federal Reserve targets the Federal Funds rate when it takes an interest rate decision, but really wants to influence rates at which companies borrow, such as the BAA rate. Basic statistics can be found by the commands: mean(BAA) sd(BAA) cor(FEDFUNDS, BAA) BAA MEAN SD 8.6230 2.7434 FEDFUNDS 5.7585 3.4329 Correlation = 0.790756

A regression can be run and assigned to the variable rm.BAA (this is necessary if you want to ask for a full summary and to keep the regression results around without running it again): model.BAA <- lm(BAA ~ FEDFUNDS) summary(model.BAA) gives the table of coefficients (abbreviated display below) as: Estimate (Intercept) FEDFUNDS 4.97740 0.63309 Std.error 1.3303 0.1984 t value 37.42 31.91

Residual standard error = 1.685 R-squared = .6253 You should of course be able to: a. extract the equation b. explain whether the coefficient on FEDFUNDS is significantly different than zero, and what the importance of this is. c. check the relation of the R-squared of the regression to the correlation coefficient d. make a point prediction for the BAA rate if the Fed Funds rate is set at 8 percent. e. place a 90 percent confidence interval around that point prediction. Try all this before turning the page.

Answers: a. The equation is E[BAA|FEDFUNDS] = 4.97740 + .63309 * FEDFUNDS b. The coefficient on Fed Funds is significantly different than zero -- the standard error is only 0.01984, so the t value is .63309 / .01984 = 31.91. If zero were the true value of the coefficient, we would be 31.91 standard error units away from zero; the probability of being 31.91 standard error units away from zero is so low that there is no table value for it. You could tease the probability from R by asking for pnorm(-31.91), which will give you the tail area to the left of -31.91. It is 9.7e-224,which is scientific notation for 9.7 / 10 224 -- as close to zero as you can get. c. R-squared = .6253, cor(FEDFUNDS, BAA) = .7907564 and .7907564 * .7907564 = .6252957 d. Use the equation in (a) to make a point prediction: E[BAA|FEDFUNDS] = 4.97740 + .63309 * 8 = 4.97740 + 5.06472 = 10.04212 e. Place a 90 percent confidence interval around the prediction: 90 percent confidence interval is given by Point prediction +/- 1.65 * Residual standard error with the 1.65 coming from the tables -- look for a central area of 90 The computer could be used to find the exact multiplier: the qnorm(.95) command (normal quantile) gives the z-score of the point which cuts off an area of 95 percent to the left of that point, which is what you want to find an area of 90 percent in the center. You would find qnorm(.95) = 1.644854. 10.04212 +/- 1.65 * 1.685 = 10.0412 +/- 2.78025 = from 7.26187 to 12.82237 One way to get a sense of where the residual standard error (RMSE) is coming from is to take a slice through the plotted data, say from 7.25 to 8.75 percent, and look at the variation of BAA within that narrow slice. The regression told us it should be about 1.685. We can do this by: a. finding which values of FEDFUNDS are between 7.25 and 8.75: idx <- which((FEDFUNDS > 7.25) & (FEDFUNDS < 8.75)) b. defining new variables by selecting subsets of FEDFUNDS and BAA by that index: fedfunds <- FEDFUNDS[idx] baa <- BAA[idx] (note that R is case sensitive, so baa is not the same variable as BAA You will find that sd(baa) = 1.678656. This isn't quite the 1.685 for the residual standard error, but the residual standard error takes all the residual points into account, not just the slice we concentrated on. The plots on the next page may help you see this: The first graph shows the regression line E[BAA | FEDFUNDS] = 4.9774 + .6331 * FEDFUNDS and adds two vertical lines at 7.25 and 8.75. The second graph focuses in on the slice of data contained inside those two green lines.

Note that the regression line has a value of about 10 at fedfunds = 8.0 but it is possible to find data points in the slice about 3 points above or almost 3 points below the curve, as we might expect if the standard deviation were 1.5 or so. If we calculate the SD of the baa variable (the values of BAA within the slice) we find it to be 1.679, very close to the whole regression's residual standard error of 1.685.

You might also like