Professional Documents
Culture Documents
April 9, 2013
Regression analysis is probably the most used tool in statistics. Regression deals with
modeling how one variable (called a response) is related to one or more other variables
(called predictors or regressors). Before introducing regression models involving two
or more variables, we rst return to the very simple model introduced in Chapter 1
to set up the basic ideas and notation.
1 A Simple Model
Consider once again the ll-weights in the cup-a-soup example. For sake of illustra-
tion, consider the rst 10 observations from the data set:
236.93, 237.39, 239.67, 236.39, 239.67, 237.26, 239.27, 237.46, 239.08, 237.83.
Note that although the lling machine is set to ll each cup to a specied weight, the
actual weights vary from cup to cup. Let y1 , y2 , . . . , yn denote the ll-weights for our
sample (i.e. y1 = 236.93, y2 = 237.39 etc. and n = 10). The model we introduced in
Chapter 1 that incorporates the variability is
yi = + i (1)
where i is a random error representing the deviation of the ith diameter from the
average ll-weight of all cups (). Equation (1) is a very simple example of a statistical
model. It involves a random component (i ) and a deterministic component (). The
population mean is a parameter of the model and the other parameter in (1) is the
variance of the random error which we shall denote by 2 (sigma-squared).
Let us now consider the problem of estimating the population mean in (1). The
technique we will use for (1) is called least-squares and it is easy to generalize to more
complicated regression models. A natural and intuitive way of estimating the true
value of the population mean is to simply take the average of the measurements:
1 n
y = yi .
n i=1
Why should we use y to estimate ? There are many reasons why y is a good estimator
of , but the reason we shall focus on is that y is the best estimator of in terms of
having the smallest mean squared error. That is, given the 10 measurements above,
we can ask: which value of makes sum of squared deviations
n
(yi )2 (2)
i=1
Chapter 5: Regression Models 119
the smallest? That is, what is the least-squares estimator of ? The answer to this
question can be found by doing some simple calculus. Consider the following function
of :
n
f () = (yi )2 .
i=1
From calculus, we know that to nd the extrema of a function, we can take the
derivative of the function, set it equal to zero, and solve for the argument of the
function. Thus,
d n
f () = 2 (yi ) = 0.
d i=1
= y.
(One can check that the 2nd derivative of this function is positive so that setting the
rst derivative to zero determines a value of that minimizes the sum of squares.)
The hat notation (i.e. ) is used to denote an estimator of a parameter. This is
a standard notational practice in statistics. Thus, we use = y to estimate the
unknown population mean . Note that is not the true value of but simply an
estimator based on 10 data points.
Now we shall re-do the computation using matrix notation. This will seem unnec-
essarily complicated, but once we have a solution worked out, we can re-apply it to
many other much more complicated models very easily. Data usually comes to us in
the form of arrays of numbers, typically in computer les. Therefore, a natural and
easy way to handle data (particularly large sets of data) is to use the power of matrix
computations. Take the ll-weight measurements y1 , y2 , . . . , yn and stack them into
a vector and denote this vector by a boldfaced y:
y1 236.93
y 237.39
2
y3 239.67
y4 236.39
y 239.67
5
y= = .
y6 237.26
y 239.27
7
y8 237.46
y9 239.08
y10 237.83
Now let X denote a column vector of ones and denote the error terms i s stacked
into a vector:
1 1
1 2
X=
.. and = .. .
. .
1 n
Chapter 5: Regression Models 120
Then we can re-write our very simply model (1) in matrix/vector form as:
y1 1 1
y2 1 2
. = . + . .
. .. ..
.
yn 1 n
(y X) (y X)
y y 2X y + 2 X X.
Taking the derivative of this with respect to and setting the derivative equal to zero
gives
2X y + 2X X = 0.
Solving for gives
= (X X)1 X y. (4)
The solution given by equation (4) is the least squares solution and this formula
holds for a wide variety of models as we shall see.
It seems reasonable that the miles per gallon of a car is related to the weight of the
car. Our goal is to model the relationship between these two variables. A scatterplot
of the data is shown in Figure 1.
As can be seen from the gure, there appears to be a linear relationship between the
MPG (y) and the weight of the car (x). Heavier cars tend to have lower gas mileage.
A deterministic model for this data is given by
yi = 0 + 1 xi
where yi is MPG for the ith car and xi is the corresponding weight of the car. The
two parameters are 0 which is the y-intercept and 1 which is the slope of the line.
However, this model is inadequate because it forces all the points to lie exactly on
a line. From Figure 1, we clearly see that the points do follow a linear pattern,
but the points do not all fall exactly on a line. Thus, a better model will include
a random component for the error which allows for points to scatter about the line.
The following model is called a simple linear regression model:
yi = 0 + 1 xi + i (5)
is found by determining the values of 0 and 1 that minimize the sum of squared
errors: n
(yi 0 1 xi )2 .
i=1
Graphically, this corresponds to nding the line minimizing the sum of squared verti-
cal dierences between the observed MPGs and the corresponding values on the line
as shown in Figure 2.
Returning to our matrix and vector notation, we can write
27.5 1 2.560
27.2 1 2.300
34.1 1 1.975
y=
and X = 1 1.915 .
35.1
31.8 1 2.020
22.0 1 2.815
y = X + . (6)
X r = X (y y)
= X (y X )
= 0.
Therefore, ( )
1 7.9651 3.4443
(X X) = .
3.4443 1.5212
Chapter 5: Regression Models 124
Figure 3: The geometry of least-squares. The vector y is projected onto the space
spanned by the columns of the design matrix X denoted by X1 and X2 in the gure.
The projected value is the vector of tted values y (denoted by yhat in the gure).
The dierence between y and y is the vector of residuals r.
Also,
27.5
27.2
( ) ( )
1 1 1 1 1 1 34.1 177.70
Xy=
2.560 2.300 1.975 1.915 2.020 2.815 = 393.69 .
35.1
31.8
22.0
So, the least squares estimators of the intercept and slope are:
( )( ) ( )
1 7.9651 3.4443 177.70 59.4180
= (X X) X y = = .
3.4443 1.5212 393.69 13.1622
From this computation, we nd that the least squares estimator of the y-intercept is
0 = 59.4180 and the estimated slope is 1 = 13.1622 and the prediction equation
is given by
y = 59.418 13.1622x.
Note that the estimated y-intercept of 0 = 59.418 does not have any meaningful
interpretation in this example. The y-intercept corresponds to the average y value
when x = 0, i.e. the MPG for a car that weighs zero pounds. It makes no sense to
estimate the mileage of a car with a weight of zero. Typically in regression examples
the intercept will not be meaningful unless data is collected for values of x near
zero. Since there is no such thing as cars weighing zero pounds, the intercept has no
meaningful interpretation in this example.
Chapter 5: Regression Models 125
The slope 1 is generally the parameter of primary interest in a simple linear regres-
sion. The slope represents the average change in the response for a unit change in
the regressor. In the car example, the estimated slope of 1 = 13.1622 indicates
that for each additional thousand pounds of weight of a car we would expect to see
a reduction of about 13 miles per gallon on average.
Multiplying out the matrices in (8) we get the following formulas for the least square
estimates in simple linear regression:
0 = y 1 x
SSxy
1 =
SSxx
where
n
SSxy = (xi x)(yi y)
i=1
and
n
SSxx = (xi x)2 .
i=1
That is, the estimator of the slope is the covariance between the xs and ys divided by
the variance of the xs. In multiple regression when there is more than one regressor
variable, the formulas for the least square estimators become extremely complicated
unless you stick with the matrix notation.
The matrix notation also allows us to compute quite easily the standard errors of the
least squares estimators as well as the covariance between the estimators. First, let
us show that the least square estimators are unbiased for the corresponding model
parameters. Before doing so, note that in a designed experiment, the values of the
regressor are typically xed by the experimenter and therefore are not considered
random. On the other hand, because yi = 0 + 1 xi + i and i is a random variable,
then yi is also a random variable. Computing, we get
since E[] = 0. Therefore, the least square estimators are unbiased for the popula-
tion parameters .
Many statistical software packages have built-in functions that will perform regression
analysis. We can also use software to do the matrix calculations directly. Below is
Matlab code that produces some of the output generated above for the car mileage
example:
Chapter 5: Regression Models 126
Cov() = E[( )( ) ]
= E[((X X)1 X y (X X)1 X E[y])((X X)1 X y (X X)1 X E[y]) ]
= E[((X X)1 X )((X X)1 X ) ]
= (X X)1 X E[ ]X(X X)1
= (X X)1 X 2 IX(X X)1 (where I is the identity matrix)
= 2 (X X)1 .
Chapter 5: Regression Models 127
The main point of this derivation is that the covariance matrix of the least-square
estimators is
2 (X X)1 (9)
where 2 is the variance of the error term in the simple linear regression model.
Formula (9) holds a wide variety of regression models that include polynomial regres-
sion, analysis of variance, and analysis of covariance. The only assumption needed
for (9) to hold is that the errors are uncorrelated and all have the same variance.
Formula (9) indicates that we need an estimate for the last remaining parameter
of the simple linear regression model (5), and that is the error variance 2 . Since
i = yi 0 1 xi and the ith residual is ri = yi 0 1 xi , a natural estimate of
the error variance is
SSres
2 = M Sres =
n2
where n
SSres = ri2
i=1
is the Sum of Squares for the Residuals and M Sres stands for the Mean Squared
Residual (or mean squared error (MSE)). We divide by n 2 in the mean squared
residual so to make it an unbiased estimator of 2 : E[M Sres ] = 2 . We lose two
degrees of freedom for estimating the slope 1 and the intercept 0 . Therefore, the
degrees of freedom associated with the mean squared residual is n 2.
Returning to the car example, we have
27.5 25.7229 1.7771
27.2 29.1450 -1.9450
34.1
y= , y = 33.4227 , r = 0.6773
35.1 34.2125 0.8875
31.8 32.8304 -1.0304
22.0 22.3665 -0.3665
where r is the vector of residuals (note that the residual sum to zero, analogously
with E[i ] = 0). Computing, we get M Sres = 2.346 and the estimated covariance
matrix for is
( ) ( )
2 1 7.9651 3.4443 18.6859 8.0802
(X X) = 2.3460 = .
3.4443 1.5212 8.0802 3.5687
The numbers in the diagonal of the covariance matrix give the estimated variances of
0 and 1 . Therefore, the slope of the regression line is estimated to be 1 = 13.1622
with estimated variance 2 = 3.5687. Taking the square-root of this variance gives
1
the estimated standard error of the slope
1 = 3.5687 = 1.8891
which will be used for making inferential statement about the slope.
Note that the estimated covariance between the estimated intercept and the estimated
slope is 8.0902. Does it seem intuitive that the estimated slope and intercept will
be negatively correlated when the regressor values (xi s) are all positive?
Chapter 5: Regression Models 128
1 10
t=
1
and reject H0 when this standardized dierence is large (away from the null hypoth-
esis). Assuming the error terms i s are independent with a normal distribution, this
test statistic has a t-distribution on n 2 degrees of freedom when the null hypothesis
is true. If we are performing a test using a signicance level , then we would reject
H0 at signicance level if
t > t when Ha : 1 > 10
t < t when Ha : 1 < 10 .
t > t
/2 or t < t/2 when Ha : 1 = 10
A common hypothesis of interest is if the slope diers signicantly from zero. If the
slope 1 is zero, then the response does not depend on the regressor. The test statistic
in this case reduces to t = 1 /1 .
Car Example continued ... We can test if the mileage of a car is related (linearly)
to the weight of the car. In other words, we want to test if H0 : 1 = 0 versus
Ha : 1 = 0. Let us test this hypothesis using signicance level = 0.05. Since there
are n = 6 observations, we will reject H0 if the test statistic is larger in absolute
value t/2 = t.05/2 = t.025 = 2.7764 which can be found in the t-table under n 2 =
6 2 = 4 degrees
of freedom. Recall that 1 = 13.1622 with estimated standard
error 1 = 3.5687 = 1.8891. Computing, we nd that
1 13.1622
t= = = 6.9674.
1 1.8891
since |t| = | 6.9674| = 6.9674 > t/2 = 2.7764, we reject H0 and conclude that the
slope diers from zero using a signicance level = 0.05. In other words, the MPG
of a car depends on the weight of the car.
We can also compute a p-value for this test as
p-value = 2P (T > |t|) (2-tailed p-value)
where T represents a t random variable on n2 degrees of freedom and t represents the
observed value of the test statistic. The factor 2 is needed because this is a two-sided
Chapter 5: Regression Models 129
test we reject H0 for large values of 1 in the positive or negative directions. The
computed p-value in this example (using degrees of freedom equal to 4) is 2P (|T | >
6.9674) = 2(0.0001) = 0.0022. Thus, we have very strong evidence that the slope
diers from zero.
Hypothesis tests can be performed for the intercept 0 as well, but then is not as
common. The test statistic for testing H0 : 0 = 00 is
0 00
t= ,
0
Extension Force
3.072 181.063
3.154 185.313
3.238 191.375
3.322 196.609
3.403 201.406
3.487 207.594
3.569 212.984
3.652 217.641
3.737 223.609
3.816 228.203
3.902 234.422
Figure 4 shows a scatterplot of the raw data. The relation appears to be linear.
Figure 5 shows the raw data again in the left panel along with the tted regression
Chapter 5: Regression Models 130
Figure 4: Scatterplot of Force (in Newtons) versus extension (in mm.) for an external
xator used to hold a broken bone in place.
line y = 17.703 + 64.531x. The points in the plot are tightly clustered about the
regression line indicating that almost all the variability in y is accounted for by the
regression relation (see the discussion of R2 below).
A residual plot is shown in the right panel of Figure 5. The residuals should not
exhibit any structure and a plot of residuals is useful for accessing if the specied
model is adequate for the data. The slope is estimated to be 1 = 64.531 and the
estimated standard error of the slope is found to be 1 = 0.465. A (1 )100%
condence interval for the slope is given by
Condence Interval for the Slope: 1 t/2 1 ,
where the degrees of freedom for the t-critical value is given by n 2. The estimated
standard error of the slope can be found as before by taking the square root of second
diagonal element of the covariance matrix 2 (X X)1 . For the xator experiment,
let us compute a 95% condence interval for the stiness (1 ). The sample size is
n = 11 and the critical value is t/2 = t.05/2 = t0.025 = 2.26216 for n 2 = 11 2 = 9
degrees of freedom. The 95% condence interval for the stiness is
which gives an interval of [63.479, 65.583]. With 95% condence we estimate that the
stiness of the external xator lies between 63.479 to 65.583 Newtons/mm.
Problems
1. Box, Hunter, & Hunter (1978) report on an experiment looking at how y, the
dispersion of an aerosol (measured as the reciprocal of the number of particles
Chapter 5: Regression Models 131
Figure 5: Left Panel shows the scatterplot of the xator data along with the least-
squares regression line. The right panel shows a plot of the residuals versus the
tted values yi s to evaluate the t of the model.
Chapter 5: Regression Models 132
per unit volume) depends on x, the age of the aerosol (in minutes). The data
are given in the following table:
y x
6.16 8
9.88 22
14.35 35
24.06 40
30.34 57
32.17 73
42.18 78
43.23 87
48.76 98
Fit a simple linear regression model to this data by performing the following
steps:
a) Write out the design matrix X for this data and the vector y of responses.
b) Compute X X.
c) Compute (X X)1 .
d) Compute the least squares estimates of the y-intercept and slope =
(X X)1 X y.
e) Plot the data along with the tted regression line.
f) Compute the mean squared error from the least-squares regression line:
2 = M SE = (y y) (y y)/(n 2).
g) Compute the estimated covariance matrix for the estimated regression co-
ecients: 2 (X X)1 .
h) Does the age of the aerosol eect the dispersion of the aerosol? Perform
a hypothesis test using signicance level = 0.05 to answer this question.
Set up the null and alternative hypotheses in terms of the parameter of
interest, determine the critical region, compute the test statistic, and state
your decision. In plain English, write out the conclusion of the test.
i) Find a 95% condence interval for the slope of the regression line.
2. Consider the crystal growth data in the notes. In this example, x = time the
crystal grew and y = weight of the crystal (in grams). It seems reasonable that
at time zero, the crystal would weigh zero grams since it has not started growing
yet. In fact, the estimated regression line has a y-intercept near zero. Find the
least squares estimator of 1 in the no-intercept model: yi = 1 xi + i in two
dierent ways:
a) Find the value of 1 that minimizes
n
(yi 1 xi )2 .
i=1
Note: Solve this algebraically without using the data from the actual ex-
periment.
Chapter 5: Regression Models 133
b) Write out the design matrix for the no-intercept model and compute b1 =
(X X)1 X Y . Does this give the same solution as part (a)?
Example. An experiment was conducted to study how the weight (in grams) of a
crystal varies according to how long (in hours) the crystal grows (Graybill and Iyer,
1994). The data are given in the following table:
Weight Hours
0.08 2
1.12 4
4.43 6
4.98 8
4.92 10
7.18 12
5.57 14
8.40 16
8.81 18
10.81 20
11.16 22
10.12 24
13.12 26
15.04 28
Clearly as the crystal grows, the weight increases. We can use the slope of the
estimated least squares regression line as an estimate of the linear growth rate. A
direct computation shows
( )
14 210
(X X) =
210 4060
and ( )
0.0014
= .
0.5034
The raw data along with the tted regression line are shown in Figure 6. From the
estimated slope, we can state that the crystals grow at a rate of 0.5034 grams per hour.
Chapter 5: Regression Models 134
We now turn to the question of using the estimated regression model to estimate the
mean response at a given value of x or predict a new value of y for a given value of
x. Note that estimating a mean response and predicting a new response are dierent
goals.
Suppose we want to estimate the mean weight of a crystal that has grown for x = 15
hours. The question is: what is the average weight of all crystals that have grown for
x = 15 hours. Note this is a hypothetical population. If we were to set a production
process where we grow crystals for 15 hours, what would be the average weight of the
resulting crystals? In order to estimate the mean response at x = 15 hours, we use
y = 0 + 1 x plugging in x = 15.
On the other hand, if we want to predict the weight of a single crystal that has grown
for x = 15 hours, we would also use y = 0 + 1 x using x = 15 just as we did
for estimating a mean response. Note that although estimating a mean response and
predicting a new response are two dierent goals, we use y in each case. The dierence
statistically between estimating a mean response and predicting a new response lies
in the uncertainty associated with each. A condence interval for a mean response
will be narrower than a prediction interval for a new response. The reason why is that
a mean response for a given x value is a xed quantity it is an expected value of the
response for a given x value, known as a conditional mean. A 95% prediction interval
for a new response must be wide enough to contain 95% of the future responses at a
given x value. The condence interval for a mean response only needs to contain the
mean of all responses for a given x with 95% condence. The following two formulas
give the condence interval for a mean response and a prediction interval for a new
response at a given value x0 for the predictor:
( )
1
yt/2 M Sres (1, x0 )(X X)1 Condence Interval for Mean Response (10)
x0
Chapter 5: Regression Models 135
and
( )
1
y t/2 M Sres (1 + (1, x0 )(X X)1 ) Prediction Interval for New Response
x0
(11)
where the t-critical value t/2 is based on n 2 degrees of freedom. Note that in both
formulas, ( )
1
(1, x0 )(X X)1
x0
corresponds to a 12 vector (1, x0 ) times a 22 matrix (X X)1 times a 21 vector
transpose of (1, x0 ). The prediction interval is wider than the condence interval due
to the added 1 underneath the radical in the prediction interval. Formulas (10)
and (11) generalize easily to the multiple regression setting when there is more than
one predictor variable.
The condence interval for the mean response can be rewritten after multiplying out
the terms to get
1 (x0 x)2
y t/2 M Sres ( + ).
n SSxx
From this formula, one can see that the condence interval for a mean response (and
also the prediction interval) is narrowest when x0 = x. Figure 7 shows both the
condence intervals for mean responses and prediction intervals for new responses at
each x value. The lower and upper ends of these intervals plotted for all x values
form an upper and lower band shown in Figure 7. The solid curve corresponds to
a condence band and is narrower than the prediction band which is plotted by
the dashed curve. Both bands are narrowest at the point (x, y) (the least squares
regression line always passes through the point (x, y)). Note that in this example, all
of the actual weight measurements (the yi s) lie inside the 95% prediction bands as
seen in Figure 7.
A note of caution is in order when using regression models for prediction. Using
an estimated model to extrapolate outside the range where data was collected to t
the model is very dangerous. Often a straight line is a reasonable model relating a
response y to a predictor (or regressor) x over a short interval of x values. However,
over a broader range of x values, the response may be markedly nonlinear and the
straight line t over the small interval when extrapolated over a larger interval can
give very poor or even down right nonsense predictions. It is not unusual for instance
that as the regressor variable gets larger (or smaller), the response may level o
and approach an asymptote. One such example is illustrated in Figure 8 showing a
scatterplot of the winning times in the Boston marathon for men (open circles) and
women (solid circles) each year. Also plotted is the least squares regression lines tted
to the data for men and women. If we were to extrapolate into the future using the
straight line t, then we would eventually predict that the fastest female runner would
beat the fastest male runner. Not only that, the predicted times in the future for both
men and women would eventually become negative which is clearly impossible. It may
be that the female champion will eventually beat the male champion at some point in
the future, but we cannot use these models to predict this because these models were
t using data from the past. We do not know for sure what sort of model is applicable
Chapter 5: Regression Models 136
Figure 7: Crystal growth data with the estimated regression line along with a the
95% condence band for estimated mean responses (solid curves) and 95% prediction
band for predicted responses (dashed curve).
for future winning times. In fact, the straight line models plotted in Figure 8 are not
even valid for the data shown. For instance, the data for the women shows a rapid
improvement in winning times over the rst several years women were allowed to run
the race but then the winning times atten out, indicating that a threshold is being
reached for the fastest possible time the race can be run. This horizontal asymptote
eect is evident for both males and females.
Problems
y x
215 19
629 38
1034 57
1475 76
1925 95
12000
11000
10000
Time (in seconds)
9000
8000
7000
Year
Figure 8: Winning times in the Boston Marathon versus year for men (open circles)
and women (solid circles). Also plotted are the least-squares regression lines for the
men and women champions.
7 Coecient of Determination R2
The quantity SSres is a measure of the variability in the response y after factoring
out the dependence on the regressor x. A measure of total variability in the response
measured without regard to x is
n
SSyy = (yi y)2 .
i=1
A useful statistic for measuring the proportion of variability in the ys accounted for
by the regressor x is the coecient of determination R2 , or sometimes known simply
as the R-squared:
SSres
R2 = 1 . (12)
SSyy
In the car mileage example, SSyy = 123.27 and SSres = 9.38:
9.38
R2 = 1 = 0.924.
123.27
Chapter 5: Regression Models 138
In the xator example, the points are more tightly clustered about the regression line
and the corresponding coecient of determination is R2 = 0.9995 which is higher
than for the car mileage example (compare the plots in Figure 1 with Figure 4)
By denition, R2 is always between zero and one:
0 R2 1.
If R2 is close to one, then most of the variability in y is explained by the regression
model.
R2 is often reported when summarizing a regression model. R2 can also be computed
in multiple regression (when there is more than one regressor variable) using the same
formula above. Many times a high R2 is considered as an indication that one has a
good model since most of the variability in the response is explained by the regressor
variables. In fact, some experimenters use R2 to compare various models. However,
this can be problematic. R2 always increases (or at least does not decrease) when
you add regressors to a model. Thus, choosing a model based on the largest R2 can
lead to models with too many regressors.
Another note of caution regarding R2 is that a large value of R2 does not necessarily
mean that the tted model is correct. It is not unusual to obtain a large R2 when
there is a fairly strong non-linear trend in the data.
In simple linear regression, the coecient of determination R2 turns out to be the
square of the sample correlation r.
8 Residual Analysis
The regression models considered so far are simple linear regression models where it
is assumed that the mean response y is a linear function of the regressor x. This is
a very simple model and appears to work quite well in many examples. Even if the
actual relation of y on x is non-linear, tting a straight line model may provide a good
approximation if we restrict the range of x to a small interval. In practice, one should
not assume that a simple linear model will be sucient for tting data (except in
special cases where there is a theoretical justication for a straight line model). Part
of the problem in regression analysis is to determine an appropriate model relating
the response y to the predictor x.
Recall that the simple linear regression model is yi = 0 + 1 xi + i where i is a
mean zero random error. After tting the model, the residuals ri = yi yi mimic
the random error. A useful diagnostic to access how well a model ts the data is to
plot the residuals versus the tted values (yi ). Such plots should show no structure.
If there is evidence of structure in the residual plot, then it is likely that the tted
regression model is not the correct model. In such cases, a more complicated model
may need to be tted to the data such as a polynomial model (see below) or a
nonlinear regression model (not covered here).
It is customary to plot the residuals versus the tted values instead of residuals versus
the actual yi values. The reason is that the residuals are uncorrelated with the tted
Chapter 5: Regression Models 139
Figure 9: Left Panel: Scatterplot of the full xator data set and tted regression line.
Right Panel: The corresponding residual plot.
values. Recall from the geometric derivation of the least squares estimators that the
vector of residuals is orthogonal to the vector of tted values (see Figure 3).
A word of caution is needed here. Humans are very adept at picking out patterns.
Sometimes, a scatterplot of randomly generated variates (i.e. noise) will show what
appears to be pattern. However, if the plot was generated by just random noise, then
the patterns are supercial. The same problem can occur when examining a residual
plot. One must be careful of nding structure in a residual plot when there really is
no structure. Analyzing residual plots is an art that improves with lots of practice.
Example (Fixator example continued). When the external xator example was
introduced earlier, only a subset of the full data set was used to estimate the stiness
of the xator. Figure 9 shows (in the left panel) a scatterplot of the full data set for
values of force (x) near zero when the machine was rst turned on. Also plotted is the
least squares regression line. From this picture, it appears as if a straight line model
would t the data well. However, the right panel shows the corresponding residual
plot which reveals a fairly strong structure indicating that a straight line does not t
the full data set well.
Example. Fuel eciency data was obtained on 32 automobiles from the 1973-74
models by Motor Trends US Magazine. The response of interest is miles per gallon
(mpg) of the automobiles. Figure 10 shows a scatterplot of mpg versus horsepower.
Figure 10 shows that increasing horsepower corresponds to lower fuel eciency. A
simple linear regression model was t to the data and the tted line is shown in the
left panel of Figure 11. The coecient of determination for this t is R2 = 0.6024. A
closer look at the data indicates a slight non-linear trend. The right panel of Figure 11
Chapter 5: Regression Models 140
Figure 10: Scatterplot of Motor Trends car data: Miles per gallon (mpg) versus Gross
Horsepower for 32 dierent brands of cars.
shows a residual plot versus tted values. The residual plot indicates that there may
be problem with the straight line t: the residuals to the left and right are positive
and the residuals in the middle are mostly negative. This U-shaped pattern is
indicative of a poor t. To solve the problem, a dierent type of model needs to be
considered or perhaps a transformation of one or both variables may work.
Figure 11: Left Panel Scatterplot of MPG versus horsepower along with the tted
regression line. Right panel Residual plot versus the tted values yi s.
x1 = Cobalt content
x2 = Temperature.
The data from this experiment are given in the following table:
Chapter 5: Regression Models 142
12
12
10
10
y1
y2
8
8
6
6
4
4
5 10 15 5 10 15
x1 x2
12
12
10
10
y3
y4
8
8
6
6
4
5 10 15 5 10 15
x3 x4
Figure 12: Anscombe simple linear regression data. Four very dierent data sets
yielding exactly the same least squares regression line.
Cobalt Surface
Contents Temp. Area
0.6 200 90.6
0.6 250 82.7
0.6 400 58.7
0.6 500 43.2
0.6 600 25
1.0 200 127.1
1.0 250 112.3
1.0 400 19.6
1.0 500 17.8
1.0 600 9.1
2.6 200 53.1
2.6 250 52
2.6 400 43.4
2.6 500 42.4
2.6 600 31.6
2.8 200 40.9
2.8 250 37.9
2.8 400 27.5
2.8 500 27.3
2.8 600 19
A general model relating the surface area y to cobalt contents x1 and temperature x2
Chapter 5: Regression Models 143
is
yi = f (xi1 , xi2 ) + i
where i is a random error and f is some unknown function. xi1 is the cobalt content
on the ith unit in the data and xi2 is the corresponding ith temperature measure-
ment for i = 1, . . . , n. We can try approximating f by a rst order Taylor series
approximation to get the following multiple regression model:
yi = 0 + 1 xi1 + 2 xi2 + i . (13)
If (13) is not adequate to model the response y, then we could try a higher order
Taylor series approximation such as:
yi = 0 + 1 xi1 + 2 xi2 + 11 x2i1 + 22 x2i2 + 12 xi1 xi2 + i . (14)
The work required for nding the least squares estimators of the coecients and the
variance and covariances of these estimated parameters has already been done. The
form of the least-squares solution from the simple linear regression model holds for
the multiple regression model. This is where the matrix approach to the problem
really pays o because working out the details without using matrix algebra is very
tedious. Consider a multiple linear regression model with k regressors, x1 , . . . , xk :
yi = 0 + 1 xi1 + 2 xi2 + + k xik + i . (15)
The least squares estimators are given by
0
1 1
=
.. = (X X) X y (16)
.
k
just as in equation (8) where
1 x11 x12 x1k
1 x21 x22 x2k
X = .. .. .. .. . (17)
. . . .
1 xn1 xn2 xnk
Note that the design matrix X has a column of ones in its rst column for the
intercept term 0 just like in simple linear regression. The covariance matrix of the
estimated coecients is given by
2 (X X)1
just as in equation (9) where 2 is the error variance which is again estimated by the
mean squared residual
n
M Sres = 2 = (yi yi )2 /(n k 1).
i=1
Note that the degrees of freedom associated with the mean squared residual is nk1
since we lose k + 1 degrees of freedom estimating the intercept and the k coecients
for the k regressors.
Chapter 5: Regression Models 144
When the data is t to a multiple regression model using height and weight as regres-
sors, how do we interpret the resulting coecients? The coecient for height tells
us how much longer the catheter needs to be for each additional inch of height of
the child provided the weight of the child stays constant. But the taller the child, the
heavier the child tends to be. Figure 13 shows a scatterplot of weight versus height
for the n = 12 children from this experiment. The plot shows a very strong linear
relationship between height and weight. The correlation between height and weight
is r = 0.9611. This large correlation complicates the interpretation of the regression
coecients. The problem of correlated regressor variables is known as collinearity (or
multicollinearity) and it is an important problem that one needs to be aware of when
tting multiple regression models. We return to this example when in the collinearity
section below.
Chapter 5: Regression Models 145
Fortunately, in designed experiments where the engineer has complete control over the
regressor variables, data may be able to be collected in a fashion so that the estimated
regression coecients are uncorrelated. In such situations, the estimated coecients
can then be easily interpreted. To make this happen, one needs to choose the values
of the regressors so that the o-diagonal terms in the (X X)1 matrix (from (9))
that correspond to covariances between estimated coecients of the regressors are all
zero.
Cobalt Example Continued. Let us return to model (13) and estimate the pa-
rameters of the model. Using the data in the cobalt example, we can construct the
Chapter 5: Regression Models 146
= (X X)1 X y
1
20 35 7800 961.2
= 35 79.8 13650 1471.8
7800 13650 3490000 309415
124.8788
=
11.3369 .
0.1461
The matrix computations are tedious, but software packages like Matlab can perform
these for us. Nonetheless, it is a good idea to understand exactly what it is the
computer software packages are computing for us when we feed data into them. The
mean squared residual for this data is 2 = M Sres = 1754.69768 and the estimated
covariance matrix of the estimated coecients is
246.8702 41.9933 0.3875
2 (X X)1 = 41.9933 23.9962 0.0000 .
0.3875 0.0000 0.00099
Note that this was a well-designed experiment because the estimated coecients for
Cobalt contents (1 ) and Temperature (2 ) are uncorrelated the covariance between
them is zero as can be seen in the covariance matrix.
Chapter 5: Regression Models 147
H0 : 1 = 2 = k = 0
versus
The basic idea behind the testing procedure is to partition all the variability in the
response into two pieces: variability due to the regression relation and variability due
to the random error term. This is why the procedure is called Analysis of Variance
(ANOVA). The total variability in the response is represented by the total sums of
squares (SSyy ):
n
SSyy = (yi y)2 .
i=1
which represents the variability in the yi s explained by the multiple regression model.
Note that for an individual measurement
If we square both sides of this equation and sum over all n observations, we will get
SSyy on the left-hand side. On the right-hand side (after doing some algebra) we will
get SSres + SSreg only because the cross product terms sum to zero. This gives us
the well-known variance decomposition formula:
If all the j s are zero, then the regression model will explain very little of the variabil-
ity in the response in which case SSreg will be small and SSres will be large, relatively
speaking. The ANOVA test then simply compares these components of variance with
each other. However, in order to make the sums of squares comparable, we need to
rst divide each sum of squares by its respective degrees of freedom. The degrees
Chapter 5: Regression Models 148
of freedom for the regression sum of squares is simply the number of s (for the
intercept plus k regressors) in the model minus one:
dfreg = k.
dfres = n k 1.
The ANOVA test statistic is the ratio of the mean squares, which is denoted by F :
M Sreg
F = . (19)
M Sres
Assuming that the errors in the model (15) are independent and follow a normal
distribution, then the test statistic F follows an F-distribution on k numerator degrees
of freedom and n k 1 denominator degrees of freedom. If the null hypothesis is
true (i.e. all j s equal zero), then F takes the value one on average. However, if the
null hypothesis is false, then the regression mean square tends to be larger than the
error mean square and F will be larger than one on average. Therefore, the F -test
rejects the null hypothesis for large values of F . How large does F have to be? If the
null hypothesis is true, then the F test statistic follows an F distribution. Therefore,
we use critical values from the F -distribution to determine the critical region for the
F -test which can be looked up in an F -table (see pages 202204 in the Appendix).
For a simple linear regression model yi = 0 + 1 xi + i , we used a t-test to test
H0 : 1 = 0 versus Ha : 1 = 0. One can also perform an ANOVA F -test of this
hypothesis and in fact, the two tests are equivalent. One can show with a little
algebra that the square of the t-test statistic is equal to the ANOVA F -test statistic
in a simple linear regression, i.e. t2 = (1 /1 )2 = F .
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
In this ANOVA table, the Corrected Total sum of squares is SSyy . Note that adding
the Model and Error sum of squares gives the corrected total sum of squares: 11947 +
7567.19968 = 19514 (rounded to the nearest integer) showing the variance decom-
position given in (18). Likewise, the model and error degrees of freedom add up to
equal the corrected total degrees of freedom: 2 + 17 = 19 = n 1. Finally, the last
column of the ANOVA table gives the p-value of the test of p = 0.0003. Because the
p-value is less than = 0.05 we reject the null hypothesis. In fact, the p-value is very
small indicating very strong evidence that at least one of the regression coecients
diers from zero.
Once we have determined that at least one of the coecients is non-zero, typically
t-tests are conducted for the individual coecients in the model:
H0 : j = 0 versus Ha : j = 0
for j = 1, . . . , k. The test statistics for each of these hypotheses is given by a t-test
statistic of the form
j
t=
j
where the j are found by taking the square roots of the diagonal elements of the
2 (X X)1 covariance matrix. Recall that the square root of the diagonal terms in
2 (X X)1 are the estimated standard errors j of the estimated regression coe-
cients. The t-statistics follow a t-distribution on n k 1 degrees of freedom (i.e.
the residual degrees of freedom) when the null hypothesis is true. The following table
generated by SAS summarizes the tests for individual coecients:
Parameter Estimates
Chapter 5: Regression Models 150
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
This table gives the results of two-tailed test of whether or not the regression coe-
cients are zero or not. The last column gives the p-values, all of which are less than
= 0.05 indicating that both the cobalt content and the temperature are signi-
cant in modeling the surface area of the hydroxide catalyst. Because the experiment
was designed so that the estimated slope parameters for cobalt and temperature are
uncorrelated, the interpretation of the coecients is made easier. Note that both
coecients are negative indicating that increasing cobalt content and temperature
results in lower surface area on average. The intercept of the model does not have a
clear interpretation in this example because data was not collected at values of cobalt
and temperature near zero.
It is possible to test more general hypotheses for individual regression coecients
such as H0 : j = j0 for some hypothesized value j0 of the jth regression coecient.
The t-test statistic then becomes
j j0
t=
j
j t/2 j .
However, individual condence regions can be misleading when the estimated coef-
cients are correlated. A better method is to determine a joint condence region.
11 Condence Ellipsoids
Figure 14: A 95% condence ellipse for the regression coecients 1 and 2 for cobalt
content and temperature.
The following formula (e.g. see Johnson and Wichern, 1998, page 285) denes a
condence ellipsoid in the regression setting: A 100(1 )% condence region for
is given by the set of k+1 that satisfy
The condence region described by this inequality is an elliptical shaped region (see
Chapter 4). In the previous example, if we want a condence region for only (1 , 2 ) ,
the coecients of cobalt content and temperature, the condence region is given by
the set of (1 , 2 ) 2 satisfying the following inequality:
( ) 0 0 ( )
0 1 0 1 1 1 1
(1 1 , 2 2 )[ (X X) 1 0 ] 2M Sres F2,nk1, .
0 0 1 2 2
0 1
The major and minor axes of the condence ellipse in Figure 14 are parallel to the
1 and 2 axes because the estimates of these parameters are uncorrelated due to the
design of the experiment. The estimators of the regression coecients will not always
be uncorrelated, particularly with observational data. The next example illustrates
an experimental data set where the regression coecients are correlated.
Hald Cement Data Example. An experiment was conducted to measure the heat
from cement based on diering levels of the ingredients that make up the cement. This
is the well-known Hald cement data (Hald 1932) which has been used to illustrate
concepts from multiple regression. The data are in the following table:
Chapter 5: Regression Models 152
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
The F -test statistic equals 85.07 and the associated p-value is less than 0.0001 indi-
cating that there is very strong statistical evidence that at least one of the coecients
for x1 and x2 are non-zero. The R2 from this tted model is R2 = 0.9445 indicating
that most of the variability in the response is explained by the regression relationship
with x1 and x2 . The estimated parameters, standard errors and t-test for equality to
zero are tabulated below:
Chapter 5: Regression Models 153
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Both parameter estimates dier signicantly from zero as indicated by the p-values
< 0.0001 for x1 and x2 . The estimated coecient for x1 (amount of tricalcium alu-
minate) is 1 = 1.53258 and the estimated coecient for x2 (Amount of tricalcium
silicate) is 2 = 2.16559 which seems to indicate that the amount of heat produced
decreases with increasing levels of tricalcium aluminate and the amount of heat in-
creases with increasing levels of tricalcium silicate. However, x1 and x2 are correlated
with one another (sample correlation r = 0.7307) and thus, we cannot easily interpret
the coecients due to this correlation. The estimated covariance matrix of (1 , 2 )
is
101.8276 -1.2014 1.9098
(X X)1 =
-1.2014 0.0148 -0.0276
1.9098 -0.0276 0.0964
indicating that 1 and 2 are negatively correlated with one another as well (since
the covariance between them is 0.0276). Figure 15 shows a 95% condence ellipse
for the coecients 1 and 2 . Also shown is a rectangular condence region obtained
by taking the Cartesian product of individual condence intervals for 1 and 2 sep-
arately. The fact that the ellipse tilts downward from left to right indicates that the
estimated coecients are negatively correlated. In addition, points that lie in the
rectangular condence region but not inside the ellipse are not really plausible values
for the coecients because they do not lie inside the elliptical condence region. To
illustrate this point, the symbol in the plot lies inside the condence rectangle and
would seem to be a plausible value for (1 , 2 ) . However, since this point lies outside
the condence ellipse, it is not a plausible value with 95% condence.
12 Polynomial Regression.
A special case of multiple regression is polynomial regression. If a scatterplot of y
versus x indicates a nonlinear relationship, then tting a straight line is not appropri-
ate. For example, the response y may be a quadratic function of x. If the functional
relationship between x and y is unknown, then a Taylor series approximation may
provide a reasonable way of modeling the data which would entail tting a polynomial
to the data. The polynomial regression model can be expressed as
To see that (20) is a special case of the multiple regression model, simply dene the
multiple regressor variables to be x1 = x, x2 = x2 , . . . , xk = xk . Although (20) is
Chapter 5: Regression Models 154
Figure 15: A 95% condence ellipse for the regression coecients 1 and 2 for hald
cement data.
Temp Acid
100 1.93
125 2.22
150 2.85
175 2.69
200 3.01
225 3.82
250 3.91
275 3.65
300 3.71
325 3.40
350 3.71
375 2.57
400 2.71
A scatterplot sulfuric acid amount versus temperature is shown in Figure 16. Clearly
Chapter 5: Regression Models 155
the relationship between acid and temperature is nonlinear. One of the goals of the
study is to determine the optimal temperature that will result in the highest yield
of sulfuric acid. Figure 16 suggests that a quadratic model may t the data well:
yi = 0 + 1 xi + 2 x2i + i . The design matrix for the quadratic model is
1. 100 10000
1 125 15625
1 150 22500
1 175 30625
1 200 40000
1 x1 x21
1 50625
1 x2 x22
225
X=
.. .. ..
= 1 250 62500
. . .
1 275
75625
1 xn x2n 1
300 90000
1 325 105625
1 350 122500
1 375 140625
1 400 160000
Figure 17 shows the scatterplot of the data along with the tted quadratic curve
y = 1.141878 + 0.035519x 0.000065x2 . In order to estimate the temperature that
produces the highest yield of sulfuric acid, simply take the derivative of this quadratic
function, set it equal to zero and solve for x.
One of the shortcomings of polynomial regression is that the tted models can become
very unstable when tting higher order polynomials. The estimated covariance matrix
Chapter 5: Regression Models 156
Figure 17: Sulfuric Acid y versus Temperature x along with a quadratic polynomial
t y = 1.141878 + 0.035519x 0.000065x2
In order to determine the appropriate degree to use when tting a polynomial, one
can choose the model that leads to the smallest residual mean square M Sres . The
residual sum of squares SSres gets smaller as the degree of the polynomial gets larger.
However, each time we add another term to the model, we lose a degree of freedom
for estimating residual mean square. Recall that the residual mean square is dened
by dividing the residual sum of squares by n k 1 where k equals the degree of the
polynomial (e.g. k = 2 for a quadratic model). The following table gives the R2 and
M Sres for dierent polynomial ts:
Degree R2 M Sres
1 0.1877 0.3785
2 0.8368 0.0837
3 0.8511 0.0848
4 0.8632 0.0876
5 0.8683 0.0964
6 0.8687 0.1122
7 0.8884 0.1144
As seen in the table, the R2 increases with the degree of the polynomial. However,
for polynomials of degree 3 and higher, the increase in R2 is very slight. The residual
mean square is smallest for the quadratic t indicating that the quadratic t is best
according to this criteria.
Chapter 5: Regression Models 158
13 Collinearity
One of the most serious problems in a multiple regression setting is (multi)collinearity
when the regressor variables are correlated with one another. Collinearity causes
many problems in a multiple regression model. One of the main problems is that
if collinearity is severe, the estimated regression coecients are very unstable and
cannot be interpreted. In the heart catheter data set described above, the height
and weight of the children were highly correlated (r = 0.9611). Figure 19 shows
a 3-dimensional plot of the height and weight as well as the response variable y =
length. The goal of the least-squares tting procedure is to determine the best-tting
plane through these points. However, the points lie roughly along a straight line in
the height-weight plane. Consequently, tting the regression plane is analogous to
trying to build a table when all the legs of the table lie roughly in a straight line. The
result is a very wobbly table. Ordinarily, tables are designed so that the legs are far
apart and spread out over the surface of the table. When tting a regression surface
to highly correlated regressors, the resulting t is very unstable. Slight changes in
the values of the regressor variables can lead to dramatic dierences in the estimated
parameters. Consequently, the standard errors of the estimated regression coecients
tend to be inated. In fact, it is quite common for none of the regression coecients
to dier signicantly from zero when individual t-tests are computed for regression
coecients. Additionally, the regression coecients can have the wrong sign one
may obtain a negative slope coecient when instead a positive coecient is expected.
Chapter 5: Regression Models 159
Heart Catheter Example. A multiple regression was used to model the length of
the catheter (y) with regressors height (x1 ) and weight (x2 ):
yi = 0 + 1 xi1 + 2 xi2 + i .
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
The overall F -test indicates that at least one of the slope parameters 1 and/or
2 diers signicantly from zero (p-value = 0.0004). Furthermore, the coecient
of determination is R2 = 0.8254 indicating that height and weight explain most of
the variability in the required catheter length. However, the parameter estimates,
standard errors, t-test statistics and p-values shown below indicate that neither 1
nor 2 dier signicantly from zero.
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
The apparent paradox is that the overall F -test says at least one of the coecients
diers from zero whereas the individual t-tests says neither dier from zero is due
to the collinearity problem. This phenomenon is quite common when collinearity is
present. A 95% condence ellipse for the estimated regression coecients of height
and weight shown in Figure 20 shows that the estimated regression coecients are
highly correlated because the condence ellipse is quite eccentric.
Another major problem when collinearity is present is that the tted model will not
be able to produce reliable predictions for values of the regressors away from the range
of the regressor values. If the data changes just slightly, then the predicted values
outside the range of the data can change dramatically when collinearity is a present
just think of the wobbly table analogy.
In a well designed experiment, the values of the regressor variables will be orthog-
onal which means that the covariances between estimated coecients will be zero.
Chapter 5: Regression Models 160
Figure 20: A 95% condence ellipse for the coecients of height and weight in the
heart catheter example.
The easiest solution is to simply drop regressors from the model. In the heart
catheter example, height and weight are highly correlated. Therefore, if we t
a model using only height as a regressor, then we do not gain much additional
information by adding weight to the model. In fact, if we t a simple linear
regression using only height as a regressor, the coecient of height is highly
signicant (p-value < 0.0001) and the R2 = 0.7971 is just slightly less than if
we had used both height and weight as regressors. (Fitting a regression using
only weight as a regressor yields an R2 = 0.8181.) When there are several
regressors, the problem becomes one of trying to decide upon which subset of
regressors works best and which regressors to throw out. Often times, there
may be several dierent subsets of regressors that work reasonably well.
Collect more data with the goal of spreading around the values of the regres-
sors so they do not form the picket fence type pattern as seen in Figure 19.
Collecting more data may not solve the problem in situations where the exper-
imenter has no control over the relationships between the regressors, as in the
heart catheter example.
in the residual plot, then the model needs to be reconsidered. Perhaps quadratic
terms and interaction terms such as x1 x2 need to be incorporated into the model. A
useful diagnostic tool is to also plot residuals versus each of the regression variables
to see if any structure shows up in these plots as well.
As indicated by the Anscombe data set (Figure 12), the tted regression surface can
be highly inuenced by just a few observations. In the Anscombe data set, it was
easy to see the oending point, but in a multiple regression setting, it can be dicult
to determine which, if any, points may be strongly eecting the t of the regression.
There are several diagnostic tools to help with this problem and advanced textbooks
in regression analysis discuss these tools.
First, some terminology: it is important to note that up to this point in this chapter,
the regression models such as (13) and (14) are both examples of what are called
linear models because the response y is a linear function of the parameters (i.e., the
s) even though (14) has quadratic terms (such as x2i1 etc.). This terminology may
be a bit confusing since a model with quadratic terms is still called a linear model.
Other examples of linear models are
yi = 0 + 1 log(xi1 ) + i
and
yi = 0 + 1 (xi1 /x2i2 ) + i
because once again, these equations are linear in terms of the j s. In each of these
cases, we can nd the least squares estimators of the j s by simply applying for-
mula (16). That is, there exists a closed-form solution for the least-squares estima-
tors.
A nonlinear regression model is one where the response is related to the model param-
eters in a nonlinear fashion. These types of models are very common in engineering
applications. A couple of nonlinear regression model examples are
yi = 1 e2 xi1 + i
and
0 + 1 xi1
yi = + i .
2 xi2 + 3 xi3
Neither of these two models are linear in the parameters and thus they are not linear
models. To nd the least squares estimators in such cases one can either try to nd a
Chapter 5: Regression Models 163
All of the regression models introduced up to this point impose a specic functional
relationship between the response and the predictor(s) (e.g. straight line, quadratic,
exponential, etc.). In many applications, the investigator may want to let the data
determine the shape of the functional response. In this case, one can use a nonpara-
metric regression approach. The idea behind some of approaches is to predict the
response at a point x by taking a local average of the responses in a neighborhood
of values around x.
Another popular approach is to use what are called spline models where pieces of
polynomials (usually cubic polynomials) are pieced together to form a smooth curve.
e0 +1 x
p(x) = , (21)
1 + e0 +1 x
which produces an S shaped regression curve usually. As with nonlinear regression,
iterative algorithms are needed to estimate the logistic regression parameters.
Problems
Analysis of Variance
Source DF SS MS F P
Regression 3 1890.41 630.14 59.90 0.000
Error 17 178.83 10.52
Total 20 2069.24
Compute the coecient of determination R2 .
e) Compute 3 t-test statistics for testing if the regression coecients 1 , 2
and 3 dier signicantly from zero. Which coecients dier signicantly
from zero (use = 0.05 to make your decision for each test).
f) Another regression model was t to the data by dropping out the acid
concentration regressor (x3 ). The ANOVA table from tting a regression
using only x1 and x2 is given here:
Source DF SS MS F P
Regression 1880.44 940.22 89.64 0.000
Error
Total 2069.24
Fill in the missing values for the degrees of freedom, the sum of squares for
the residuals and the M SE. What is the coecient of determination R2
for this reduced model? How does it compare to the R2 from the full
model using all three regressor variables?
a) Write down the design matrix X for tting the model yi = 0 + 1 xi1 +
2 xi2 + i .
b) Can you express the third column of X as a linear combination of the rst
two columns?
c)) What goes wrong when we t the above model? Running this model using
the statistical software SAS gives the following message: Model is not full
rank. Least-squares solutions for the parameters are not unique. What
does this mean?
d) Find the least-squares line for regressing y on x1 alone and regression y on
x2 alone. How are the slope estimates related in these two models?
Chapter 5: Regression Models 166
References
Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician
27:1721.
Said, A., Hassan, E., El-Awad, A., El-Salaam, K., and El-Wahab, M. (1994). Struc-
tural changes and surface properties of cox f e3x o4 spinels. Journal of Chemical
Technology and Biotechnology 60:161170.