You are on page 1of 5

Abstract

This is relevant to the third chapter of Introduction to Statistical Learning and concerns the use of
linear regression. It also includes notes, applications, and proofs which are not included in the book.

Notes 2: Linear Regression


Chapters one and two discuss our most generalized problem as being determining a function for predictor
variables which determines what the outcome of the response variables will be:
Y = f (X) + 
In this chapter we discuss one of the most commonly used, and most simple strategies which is to use a
line of best fit through a group of data in order to determine a linear model:
Y = 0 + 1 X

(1)

This is referred to as regressing Y onto X. n represent the parameters of the linear model. Further, n is
used to denote the coefficients derived from a particular set of data.
The most simple case involves a set of paired values, however we can use several variables if we choose
to.
The quality of the fit is measured using the residual sum squared:
n
X
(yi 0 1 )2

RSS =

(2)

i=1

Theorem 0.1. The RSS functional above is minimized when the coefficient terms are taken as:
Pn
(xi x
)(yi y)
i=1
Pn
1 =
(x
x
)2
i
i=1
0 = y 1 x

(3)
(4)

Where:
x
=

n
X

xi ,

y =

i=1

n
X

yi

i=1

Proof 0.1. This theorem is validated by taking the partial derivative of RSS for both 1 and 0
n

(yi 0 1 xi )
RSS = 2
0

Set the derivative equal to zero and determine 0 :

i=1

n
1X
0 =
(yi 1 xi ) = y 1 x

n i=1

Finding 1 involves proceeding in a similar fashion:


n

RSS = 2
xi (yi 0 1 xi )

1
i=1

n
X
2
(yi xi 0 xi 1 x2i ) = 0
i=1

n
X

x2i =

i=1

1 =

n
X
(yi xi 0 xi )
i=1

Pn

(yi xi 0 xi )
i=1P
n
2
i=1 xi

Pn
(yi y)(xi x
)
i=1
Pn
2
(x

)
i
i=1

An important thing to keep in mind when discussing linear regression is the significance of the data set
we use. While we might suppose a platonic linear model which explains the data, the sample of data we are
using will change the model which we recover such that it does not match what this platonic relationship
actually is.
Population regression line - the best approximation to the true linear relationship between the response
and predictor variables.

Consider the case where we generated 100 different sets of twenty data points we would get 100 slightly
different regression lines. However, as we use more and more data, these estimates of what the relationship
actually is will proceed to get closer and closer.
This general concept holds also for different descriptions of the data which we might use. For instance,
we dont have access of the population mean of Y , but we do have the sample mean:

= y =

n
1X
yi
n i=1

As we increase the number of samples which the sample mean is extracted from, the value of the sample
mean will approach . This is a quality known as unbiasedness. An important question though is how well
this sample mean estimates what the actual mean is.
An important question to answer: Given that we have an estimate of
, how likely is it to be a
substantial over/under-estimation of ? To answer this, we define the standard error of
:
V ar(
) = SE(
) 2 =

2
n

(5)

In this case, is defined as the standard deviation of each yi . V ar(


) gives us the average amount that

differs from the actual value of . It decreases as n increases. This stands to reason if we assume that the
data is unbiased.
We can also estimate how close the coefficients 0 and 1 using the same principle. In this case, the
errors are calculated as:
h1
i
2
x
2
0 )2 = Pn
Pn
SE(0 )2 = 2
,
SE(

(6)
n i=1 (xi x
)2
)2
i=1 (xi x
In this case, 2 = V ar(). In general, it is not known from the data, but we can estimate it by finding the
residual standard error:
r
RSS
(7)
RSE =
n2
The main significance of the standard errors is that they can be used to compute the likelihood that the
range of the data will contain the true unknown value of the parameter. This is referred to as a confidence
interval. For instance, a 95% confidence interval is defined as a range of value such that we know with this
level of probability the range will include the probability.
For instance, the 95% confidence interval for 1 is:
1 2SE(1 )

(8)

Similarly, the 95% confidence interval for 1 is:


0 2SE(0 )

0.1

(9)

Hypothesis Testing

The confidence interval and stard errors are used to perform hypothesis tests on the coefficients for a model.
These hypothesis test takes the form:
The null hypothesis H0 indicates that there is no relationship between X and Y .
The null hypothesis H1 indicates that there is no relationship between X and Y .
Hypothesis testing under these assumptions presumes for the null hypothesis that 1 = 0, 1 6= 0.
Whether the coefficients meet the requirements of the hypothesis test depends on the level of accuracy we
desire (90% confidence level? 99%?), and the size of SE(0 or SE(1 0. We determine if it meets a
particular confidence interval using the t-statistic which measures the number of standard deviations 1 is
away from 0.
1
t=
(10)
SE(1 )
Furthermore, the probability that we would get a value the size of |t| assuming 1 = 0 is called the p-value.
A small p-value indicates that it is unlikely that sampling error (dumb luck) would produce the association
between the predictor and response variables. If our p-value is small enough, we can reject the null hypothesis.
The alternative is not rejected the null hypothesis.

0.2

Evaluating Model Fit

Two main statistics are used to find the fit of a model when we are dealing with one variable. These are the
RSE and the R2 statistic.

RSE is the estimate of the standard deviation of . It is the average amount that the response variable
deviates from the true regression line. It is calculated as:
v
r
u
n
u 1 X
RSS
RSE =
=t
(yi yi )2
n2
n 2 i=1

(11)

It is considered a measure of the lack of fit for the model of the data. If the predictions are very close then
(11) will be vary small. Also note that if one observation is very large, then the RSE could be very large. It
doesnt have to be all of the observations that are off.

R2 is an alternative to RSE because it allows us to get around the fact that RSE is measured in the units
of Y. As a result, we dont know what a very large estimate for the RSE should be. R2 fixes this by being
calculated as a proportion. To calculate it :
R2 =

T SS RSS
RSS
=1
where:
T SS
T SS
n
X
T SS =
(yi y)2

(12)
(13)

i=1

R2 has the advantage of appearing between 0 and 1. RSS measures the amount of variation which the
regression model fails to explain. TSS measures the amount of variance in the response Y .
While R2 has the scaling advantage, it can be difficult to determine whether it is a good fit. It tends to
depend on the subject. A good R2 for a social science will be too small for something like physics.

Correlation is another method of measuring the fit. It is the measure of the linear relationship between
X and Y. The correlation is defined as:
Pn

x
)(yi y)
pPn
2
)
)2
i=1 (xi x
i=1 (yi y
i=1 (xi

Cor(X, Y ) = r = pPn

0.3

(14)

Pitfalls of Regression Testing:

There is a standard set of problems which may arise when doing regression testing which must be kept in
mind. These are less related to the model fit discussed above. Rather, the presence of any of the following
may indicate problems with the assumptions on which the model is based. They include:

Non-linearity of the Response Predictor Relationship indicates that our assumption of the
way that the model is formed is faulty. For instance, the model may not be linear, but may instead be of the
form:
Y = 0 + 1 log(X)
How do we test for non-linearity?
To identify non-linearity, the text recommends plotting the residuals and then plotting a fitted line to the
trend shown by them. If the trend is linear, then the error terms should be more or less uniformly distributed
and there should be no pattern to the errors indicating a better fit in some areas than others.
To demonstrate, I loaded the Boston dataset discussed at the end of the chapter and then used linear
regression in order to predict medv given the predictor variable lstat:
a t t a c h ( Boston )
lm . f i t = lm ( medv l s t a t )
p l o t ( lm . f i t )

This can be improved by simply switching over the linear factor to a logarithmic one:

lm . f i t = lm ( medv l o g ( l s t a t ) )
p l o t ( lm . f i t )

This
mostly compensates for the nonlinearity of the data.

Correlation of Error Terms often comes up when working with time-series data sets. Regression
assumes that the error terms are uncorrelated with one another. Often the error at one data point will
influence the error in the next observation for the series. Correlation in the error terms can cause us to be
too confident in the model because:
1. Confidence intervals will be narrower than they should be.
2. p-values may be lower than they should be.

Non-constant variance in the error term occurs when the the error term increases with the size
of the response variable. One way to identify non-constant variance in the error (heteroscedasticity) is to
plot the residuals for the regression model. If the residuals have a funnel shape then the variance has a
non-constant variance.
l o a d ( zoo )
p l o t ( lm . f i t $ r e s i d u a l s )
s e r i e s a v e r a g e s = r o l l a p p l y ( lm . f i t $ r e s i d u a l s , w i d t h = 5 0 ,FUN = mean
s e r i e s s d = r o l l a p p l y ( lm . f i t $ r e s i d u a l s , w i d t h = 5 0 ,FUN = s d
)

l i n e s ( s e r i e s a v e r a g e s , c o l = red , l w d = 2 )
l i n e s ( s e r i e s s d +s e r i e s a v e r a g e s , c o l = b l u e , l w d = 1 )
l i n e s ( s e r i e s a v e r a g e s s e r i e s s d , c o l = b l u e , l w d = 1 )
In this case we see this example produces residuals which have roughly uniform variations:

Outliers occur when the value of an observation deviates greatly from what the model produces. Unlike
the issues of fit mentioned earlier, this does not necessarily condemn the model. Instead, it may mean that
the data includes mistakes, or that there may be some underlying cause in the phenomena which lies behind
the deviation. However, they do have to be dealt with as a small number of large outliers can undermine an
otherwise accurate model. This can include lowering the RSE, or the R2 .
The identification of outliers can generally be performed by plotting the residuals and noting especially
large elements. If we want a more systematic way of determining the outliers for a data set, we can instead
calculate the studentized residuals by taking a residual ei and dividing it by its standard error:
ei
ei =
(15)
SE(
)2

The text states that an absolute value greater than 3 indicates that the point is an outlier.

High leverage is similar to the aforementioned outlier issue except that it occurs with the predictor
variable being deviant instead of the response variable. A point with high leverage is a point which demonstrates outstanding values which can cause a deviation in the regression line. In fact, a point with high
leverage can have as much of an impact on the quality of the regression line as an outlier.
To make matters worse, a high leverage point can be almost impossible to spot when multiple linear
regression is used. In this case the quality of the leverage in the point may be spread over several different
variables, or it may not be possible to identify it by eyeballing the graph at all.
In this case, we can calculate the leverage statistic for the variable. This statistic measures the proportion
of the total error in the model that a particular point introduces:
Lev(xj ) =

(xj x
)2
1
+ Pn
n
)2
i=1 (xi x

(16)

Colinearity occurs when two predictor variables are related to one another. One example of this comes
from Frank Torchios paper on the impact of liquidity on the expected return of a business in which he
investigates the colinearity between the size of the business and its liquidity.
The effect of colinearity is that it causes the standard error for the coefficient terms to grow. As a result,
when we calculate the t-statistic by dividing the coefficient term by its standard error, we may fail to reject
the null hypothesis.
One simple way to identify the colinearity between variables is to create a correlation matrix of the
predictors. A high absolute value for correlation can indicate colinearity. However, we can have the same
problem here that we had with leverage in that several variables may be colinear (multicolinear) in a subtle
way which isnt detectable.
A numeric way to identify multicollinearity is to calculate the variance inflation factor by dividing the
variance of the full multiple regression by the variance resulting from each particular variable:
V IF (j ) =

1
2
1 RX
j |Xj

(17)

2
represents the R2 resulting from a regression of Xj onto all other predictor variables. If
Where RX
j |Xj
the value is close to on then collinearity is present and the VIF will be large.

Multiple Regression

If we are concerned with more than one predictor variables we encounter the issue of how to construct a
model. A simple way to go about it would be to regress the response variable onto each predictor variable.
The problem with this is that if the predictor variables are correlated, each regression model will capture
the same aspects of other variables. For instance, the example in the textbook uses predictor variables for
television, radio, and newspaper as predictor variables. These variables all have the similar aspect of bein
part of media in general.
Instead, we use multiple linear regression which uses n predictor variables simultaneously:
Y = 0 +

n
X

i Xi + 

i=1

1.1

General Measures for Model Effectiveness

(18)

You might also like