Chapter 5. Regression Models: 1 A Simple Model

Chapter 5: Regression Models 118
April 9, 2013
Chapter 5. Regression Models
Regression analysis is probably the most used tool in statistics. Regression deals with
modeling how one variable (called a response) is related to one or more other variables
(called predictors or regressors). Before introducing regression models involving two
or more variables, we rst return to the very simple model introduced in Chapter 1
to set up the basic ideas and notation.
1 A Simple Model
Consider once again the ll-weights in the cup-a-soup example. For sake of illustra-
tion, consider the rst 10 observations from the data set:
236.93, 237.39, 239.67, 236.39, 239.67, 237.26, 239.27, 237.46, 239.08, 237.83.
Note that although the lling machine is set to ll each cup to a specied weight, the
actual weights vary from cup to cup. Let y1 , y2 , . . . , yn denote the ll-weights for our
sample (i.e. y1 = 236.93, y2 = 237.39 etc. and n = 10). The model we introduced in
Chapter 1 that incorporates the variability is
yi = + i (1)
where i is a random error representing the deviation of the ith diameter from the
average ll-weight of all cups (). Equation (1) is a very simple example of a statistical
model. It involves a random component (i ) and a deterministic component (). The
population mean is a parameter of the model and the other parameter in (1) is the
variance of the random error which we shall denote by 2 (sigma-squared).
Let us now consider the problem of estimating the population mean in (1). The
technique we will use for (1) is called least-squares and it is easy to generalize to more
complicated regression models. A natural and intuitive way of estimating the true
value of the population mean is to simply take the average of the measurements:
1 n
y = yi .
n i=1
Why should we use y to estimate ? There are many reasons why y is a good estimator
of , but the reason we shall focus on is that y is the best estimator of in terms of
having the smallest mean squared error. That is, given the 10 measurements above,
we can ask: which value of makes sum of squared deviations

n
(yi )2 (2)
i=1
the smallest? That is, what is the least-squares estimator of ? The answer to this
question can be found by doing some simple calculus. Consider the following function
of :

n
f () = (yi )2 .
i=1
From calculus, we know that to nd the extrema of a function, we can take the
derivative of the function, set it equal to zero, and solve for the argument of the
function. Thus,
d n
f () = 2 (yi ) = 0.
d i=1
Using a little algebra, we can solve this equation for to get
= y.
(One can check that the 2nd derivative of this function is positive so that setting the
rst derivative to zero determines a value of that minimizes the sum of squares.)
The hat notation (i.e. ) is used to denote an estimator of a parameter. This is
a standard notational practice in statistics. Thus, we use = y to estimate the
unknown population mean . Note that is not the true value of but simply an
estimator based on 10 data points.
Now we shall re-do the computation using matrix notation. This will seem unnec-
essarily complicated, but once we have a solution worked out, we can re-apply it to
many other much more complicated models very easily. Data usually comes to us in
the form of arrays of numbers, typically in computer les. Therefore, a natural and
easy way to handle data (particularly large sets of data) is to use the power of matrix
computations. Take the ll-weight measurements y1 , y2 , . . . , yn and stack them into
a vector and denote this vector by a boldfaced y:

y1 236.93
y 237.39
2

y3 239.67

y4 236.39

y 239.67
5
y= = .
y6 237.26

y 239.27
7

y8 237.46

y9 239.08
y10 237.83
Now let X denote a column vector of ones and denote the error terms i s stacked
into a vector:
1 1
1 2
X=
.. and = .. .
. .
1 n
Then we can re-write our very simply model (1) in matrix/vector form as:

y1 1 1

y2 1 2
. = . + . .
. .. ..
.
yn 1 n
More compactly, we can write:

y = X + . (3)
The sum of squares in equation (2) can be written
(y X) (y X)
Multiplying this out, we nd the sum of squares to be
y y 2X y + 2 X X.
Taking the derivative of this with respect to and setting the derivative equal to zero
gives
2X y + 2X X = 0.
Solving for gives
= (X X)1 X y. (4)
The solution given by equation (4) is the least squares solution and this formula
holds for a wide variety of models as we shall see.
2 The Simple Linear Regression Model

Now we will dene a slightly more complicated statistical model that turns out to
be extremely useful in practice. The model is a simple extension of our rst model
yi = + i and, using the matrix notation, all we have to do is add another column
to the vector X and change it into a matrix with two columns.
To illustrate ideas, consider the data in the following table that was collected in
Consumer Reports and reported on in Henderson and Velleman (1981). The table
gives the make (column 1), the miles per gallon (MPG) (Column 2) and the weight
(column 3) in thousands of pounds of n = 6 Japanese cars from 1978-79 model
automobiles.
Make MPG Weight

Toyota Corona 27.5 2.560
Datsun 510 27.2 2.300
Mazda GLC 34.1 1.975
Dodge Colt 35.1 1.915
Datsun 210 31.8 2.020
Datsun 810 22.0 2.815
Figure 1: Scatterplot of miles per gallon versus weight of n = 6 Japanese cars.
It seems reasonable that the miles per gallon of a car is related to the weight of the
car. Our goal is to model the relationship between these two variables. A scatterplot
of the data is shown in Figure 1.
As can be seen from the gure, there appears to be a linear relationship between the
MPG (y) and the weight of the car (x). Heavier cars tend to have lower gas mileage.
A deterministic model for this data is given by
yi = 0 + 1 xi
where yi is MPG for the ith car and xi is the corresponding weight of the car. The
two parameters are 0 which is the y-intercept and 1 which is the slope of the line.
However, this model is inadequate because it forces all the points to lie exactly on
a line. From Figure 1, we clearly see that the points do follow a linear pattern,
but the points do not all fall exactly on a line. Thus, a better model will include
a random component for the error which allows for points to scatter about the line.
The following model is called a simple linear regression model:
yi = 0 + 1 xi + i (5)
for i = 1, 2, . . . , n. The random variable yi is called the response (it is sometimes

called the dependent variable also). The xi is called the ith value of the regressor
variable (sometimes known as the independent or predictor variable). The random
error i is assumed to have a mean of 0 and variance 2 . We typically assume the i s
are independent of each other.
The slope 1 and intercept 0 are the two parameters of primary importance and
the question arises as to how they should be estimated. The least squares solution
Figure 2: The least-squares regression line is determined by minimizing the sum of

squared vertical dierences between the observed MPGs and the corresponding point
on the line.
is found by determining the values of 0 and 1 that minimize the sum of squared
errors: n
(yi 0 1 xi )2 .
i=1
Graphically, this corresponds to nding the line minimizing the sum of squared verti-
cal dierences between the observed MPGs and the corresponding values on the line
as shown in Figure 2.
Returning to our matrix and vector notation, we can write

27.5 1 2.560

27.2 1 2.300

34.1 1 1.975

y=
and X = 1 1.915 .
35.1

31.8 1 2.020
22.0 1 2.815
Let = (0 , 1 ) and = (1 , 2 , . . . , n ) . Then we can rewrite (5) in matrix form as
y = X + . (6)
In order to nd the least-squares estimators of , we need to nd the values of 0

and 1 that minimize

n
(y X) (y X) = (yi 0 1 xi )2
i=1
or, since i = yi 0 1 xi , we need to nd 0 and 1 that minimize . Matrix

dierentiation can be used to solve this problem, but instead we will use a geometric
argument.
First, some additional notation. Let 0 and 1 denote the least squares estimators of
0 and 1 . Then given a value of the predictor xi , we can compute the predicted
value of y given xi as
yi = 0 + 1 xi .
The residual ri is dened to be the dierence between the response yi and the
predicted value yi :
ri = yi yi .
Let r = (r1 , r2 , . . . , rn ) and y = (y1 , y2 , . . . , yn ) . Note that y = X where =
(0 , 1 ) . The least square estimators 0 and 1 are chosen to make r r small as pos-
sible. Geometrically, y is the projection of y onto the plane spanned by the columns
of the matrix X. This is illustrated in Figure 3. To make r r small as possible, r
should be orthogonal to the plane spanned by the columns of X. Algebraically, this
means that
X r = 0.
Writing this out, we get
X r = X (y y)
= X (y X )
= 0.
Thus, should satisfy

X y = X X . (7)
This equation is know as the normal equation. Assuming X X is an invertible
matrix, we can simply multiply on the left on both sides of (7) to get the least-squares
solution:
= (X X)1 X y. (8)
This is one of the most important equations of this course. This formula provides the
least-squares solution for a wide variety of models. Note that we have already seen
this solution in (4).
Example. Returning to the MPG example for Japanese cars, we now illustrate the
computation of the least-squares estimators of the slope 1 and y-intercept 0 . From
the data, we can compute
( )
6 13.585
X X = .
13.585 31.4161
Therefore, ( )
1 7.9651 3.4443
(X X) = .
3.4443 1.5212
Figure 3: The geometry of least-squares. The vector y is projected onto the space
spanned by the columns of the design matrix X denoted by X1 and X2 in the gure.
The projected value is the vector of tted values y (denoted by yhat in the gure).
The dierence between y and y is the vector of residuals r.
Also,

27.5

27.2
( ) ( )
1 1 1 1 1 1 34.1 177.70
Xy=
2.560 2.300 1.975 1.915 2.020 2.815 = 393.69 .
35.1

31.8
22.0
So, the least squares estimators of the intercept and slope are:
( )( ) ( )
1 7.9651 3.4443 177.70 59.4180
= (X X) X y = = .
3.4443 1.5212 393.69 13.1622
From this computation, we nd that the least squares estimator of the y-intercept is
0 = 59.4180 and the estimated slope is 1 = 13.1622 and the prediction equation
is given by
y = 59.418 13.1622x.
Note that the estimated y-intercept of 0 = 59.418 does not have any meaningful
interpretation in this example. The y-intercept corresponds to the average y value
when x = 0, i.e. the MPG for a car that weighs zero pounds. It makes no sense to
estimate the mileage of a car with a weight of zero. Typically in regression examples
the intercept will not be meaningful unless data is collected for values of x near
zero. Since there is no such thing as cars weighing zero pounds, the intercept has no
meaningful interpretation in this example.
The slope 1 is generally the parameter of primary interest in a simple linear regres-
sion. The slope represents the average change in the response for a unit change in
the regressor. In the car example, the estimated slope of 1 = 13.1622 indicates
that for each additional thousand pounds of weight of a car we would expect to see
a reduction of about 13 miles per gallon on average.
Multiplying out the matrices in (8) we get the following formulas for the least square
estimates in simple linear regression:
0 = y 1 x
SSxy
1 =
SSxx
where

n
SSxy = (xi x)(yi y)
i=1
and

n
SSxx = (xi x)2 .
i=1
That is, the estimator of the slope is the covariance between the xs and ys divided by
the variance of the xs. In multiple regression when there is more than one regressor
variable, the formulas for the least square estimators become extremely complicated
unless you stick with the matrix notation.
The matrix notation also allows us to compute quite easily the standard errors of the
least squares estimators as well as the covariance between the estimators. First, let
us show that the least square estimators are unbiased for the corresponding model
parameters. Before doing so, note that in a designed experiment, the values of the
regressor are typically xed by the experimenter and therefore are not considered
random. On the other hand, because yi = 0 + 1 xi + i and i is a random variable,
then yi is also a random variable. Computing, we get
E[] = E[(X X)1 X y]

= (X X)1 X E[y]
= (X X)1 X E[X + ]
= (X X)1 X (X + E[])
= (X X)1 X X + 0
=
since E[] = 0. Therefore, the least square estimators are unbiased for the popula-
tion parameters .
Many statistical software packages have built-in functions that will perform regression
analysis. We can also use software to do the matrix calculations directly. Below is
Matlab code that produces some of the output generated above for the car mileage
example:
% Motor Trend car data

% Illustration of simple linear regression.
mpg = [27.5; 27.2; 34.1; 35.1; 31.8; 22.0]; % Cars weight
wt = [2.560; 2.300; 1.975; 1.915; 2.020; 2.815];
% Compute means and standard deviations:
mean(wt)
std(wt)
mean(mpg)
std(mpg)
n = length(mpg); % n = sample size
X = [ones(6,1) wt]; % Compute the design matrix X
bhat = inv(X*X)*X*mpg; % bhat = estimated regression coefficients
yhat = X*bhat; % yhat = fitted values
r = mpg - yhat; % r = residuals
plot(wt, mpg, o, wt, yhat)
title(Motor Trend Car Data)
axis([1.5, 3, 20, 40])
ylabel(Miles per Gallon (mpg))
xlabel(Weight of the Car)
% Make a plot of residuals versus fitted values:

plot(yhat, r, o, linspace(20,40,n), zeros(n,1))
xlabel(Fitted Values)
ylabel(Residuals)
title(Residual Plot)
% Heres a built-in matlab function that will fit a

% polynomial to the data -- the last number indicates the degree of the polynomial.
polyfit(wt, mpg, 1)
3 Covariance Matrices for Least-Squares Estima-

tors
Now is a random vector (since it is a function of the random yi s). We have shown
that it is unbiased for 0 and 1 . In order to determine how stable the parameter
estimates are, we need an estimate of the variability of the estimators. This can be
obtained by determining the covariance matrix of as follows:
Cov() = E[( )( ) ]
= E[((X X)1 X y (X X)1 X E[y])((X X)1 X y (X X)1 X E[y]) ]
= E[((X X)1 X )((X X)1 X ) ]
= (X X)1 X E[ ]X(X X)1
= (X X)1 X 2 IX(X X)1 (where I is the identity matrix)
= 2 (X X)1 .
The main point of this derivation is that the covariance matrix of the least-square
estimators is
2 (X X)1 (9)
where 2 is the variance of the error term in the simple linear regression model.
Formula (9) holds a wide variety of regression models that include polynomial regres-
sion, analysis of variance, and analysis of covariance. The only assumption needed
for (9) to hold is that the errors are uncorrelated and all have the same variance.
Formula (9) indicates that we need an estimate for the last remaining parameter
of the simple linear regression model (5), and that is the error variance 2 . Since
i = yi 0 1 xi and the ith residual is ri = yi 0 1 xi , a natural estimate of
the error variance is
SSres
2 = M Sres =
n2
where n
SSres = ri2
i=1
is the Sum of Squares for the Residuals and M Sres stands for the Mean Squared
Residual (or mean squared error (MSE)). We divide by n 2 in the mean squared
residual so to make it an unbiased estimator of 2 : E[M Sres ] = 2 . We lose two
degrees of freedom for estimating the slope 1 and the intercept 0 . Therefore, the
degrees of freedom associated with the mean squared residual is n 2.
Returning to the car example, we have

27.5 25.7229 1.7771

27.2 29.1450 -1.9450

34.1
y= , y = 33.4227 , r = 0.6773
35.1 34.2125 0.8875

31.8 32.8304 -1.0304
22.0 22.3665 -0.3665
where r is the vector of residuals (note that the residual sum to zero, analogously
with E[i ] = 0). Computing, we get M Sres = 2.346 and the estimated covariance
matrix for is
( ) ( )
2 1 7.9651 3.4443 18.6859 8.0802
(X X) = 2.3460 = .
3.4443 1.5212 8.0802 3.5687
The numbers in the diagonal of the covariance matrix give the estimated variances of
0 and 1 . Therefore, the slope of the regression line is estimated to be 1 = 13.1622
with estimated variance 2 = 3.5687. Taking the square-root of this variance gives
1
the estimated standard error of the slope

1 = 3.5687 = 1.8891
which will be used for making inferential statement about the slope.
Note that the estimated covariance between the estimated intercept and the estimated
slope is 8.0902. Does it seem intuitive that the estimated slope and intercept will
be negatively correlated when the regressor values (xi s) are all positive?
4 Hypothesis Tests for Regression Coecients

Regression models are used in a wide variety of applications. Interest often lies in
testing if the slope parameter 1 takes a particular value, 1 = 10 say. We can test
hypotheses of the form:
H0 : 1 = 10
versus
Ha : 1 > 10 or Ha : 1 < 10 or Ha : 1 = 10 .
A suitable test statistic for these tests is to compute the standardized dierence
between the estimated slope and the hypothesized slope:
1 10
t=
1
and reject H0 when this standardized dierence is large (away from the null hypoth-
esis). Assuming the error terms i s are independent with a normal distribution, this
test statistic has a t-distribution on n 2 degrees of freedom when the null hypothesis
is true. If we are performing a test using a signicance level , then we would reject
H0 at signicance level if

t > t when Ha : 1 > 10
t < t when Ha : 1 < 10 .

t > t
/2 or t < t/2 when Ha : 1 = 10
A common hypothesis of interest is if the slope diers signicantly from zero. If the
slope 1 is zero, then the response does not depend on the regressor. The test statistic
in this case reduces to t = 1 /1 .
Car Example continued ... We can test if the mileage of a car is related (linearly)
to the weight of the car. In other words, we want to test if H0 : 1 = 0 versus
Ha : 1 = 0. Let us test this hypothesis using signicance level = 0.05. Since there
are n = 6 observations, we will reject H0 if the test statistic is larger in absolute
value t/2 = t.05/2 = t.025 = 2.7764 which can be found in the t-table under n 2 =
6 2 = 4 degrees
of freedom. Recall that 1 = 13.1622 with estimated standard
error 1 = 3.5687 = 1.8891. Computing, we nd that
1 13.1622
t= = = 6.9674.
1 1.8891
since |t| = | 6.9674| = 6.9674 > t/2 = 2.7764, we reject H0 and conclude that the
slope diers from zero using a signicance level = 0.05. In other words, the MPG
of a car depends on the weight of the car.
We can also compute a p-value for this test as
p-value = 2P (T > |t|) (2-tailed p-value)
where T represents a t random variable on n2 degrees of freedom and t represents the
observed value of the test statistic. The factor 2 is needed because this is a two-sided
test we reject H0 for large values of 1 in the positive or negative directions. The
computed p-value in this example (using degrees of freedom equal to 4) is 2P (|T | >
6.9674) = 2(0.0001) = 0.0022. Thus, we have very strong evidence that the slope
diers from zero.
Hypothesis tests can be performed for the intercept 0 as well, but then is not as
common. The test statistic for testing H0 : 0 = 00 is
0 00
t= ,
0
which follows a t-distribution on n 2 degrees of freedom when the null hypothesis

is true.
5 Condence Intervals for Regression Coecients

We can also form condence intervals for regression coecients. The next example
illustrates such an application.
Example (data compliments of Brian Jones). Experiments were conducted at Wright

State University to measure the stiness of external xators. An external xator is
designed to hold a broken bone in place so it can heal. The stiness is an important
characteristic of the xator since it indicates how well the xator protects the broken
bone. In the experiment, the vertical force (in Newtons) on the xator is measured
along with the amount the xator extends (in millimeters). The stiness is dened
to be the force per millimeters of extension. A natural way to estimate the stiness
of the xator is to use the slope from an estimated simple linear regression model.
The data from the experiment is given in the following table:
Extension Force
3.072 181.063
3.154 185.313
3.238 191.375
3.322 196.609
3.403 201.406
3.487 207.594
3.569 212.984
3.652 217.641
3.737 223.609
3.816 228.203
3.902 234.422
Figure 4 shows a scatterplot of the raw data. The relation appears to be linear.
Figure 5 shows the raw data again in the left panel along with the tted regression
Figure 4: Scatterplot of Force (in Newtons) versus extension (in mm.) for an external
xator used to hold a broken bone in place.
line y = 17.703 + 64.531x. The points in the plot are tightly clustered about the
regression line indicating that almost all the variability in y is accounted for by the
regression relation (see the discussion of R2 below).
A residual plot is shown in the right panel of Figure 5. The residuals should not
exhibit any structure and a plot of residuals is useful for accessing if the specied
model is adequate for the data. The slope is estimated to be 1 = 64.531 and the
estimated standard error of the slope is found to be 1 = 0.465. A (1 )100%
condence interval for the slope is given by
Condence Interval for the Slope: 1 t/2 1 ,
where the degrees of freedom for the t-critical value is given by n 2. The estimated
standard error of the slope can be found as before by taking the square root of second
diagonal element of the covariance matrix 2 (X X)1 . For the xator experiment,
let us compute a 95% condence interval for the stiness (1 ). The sample size is
n = 11 and the critical value is t/2 = t.05/2 = t0.025 = 2.26216 for n 2 = 11 2 = 9
degrees of freedom. The 95% condence interval for the stiness is
1 t/2 1 = 64.531 2.26216(0.465) = 64.531 1.052,
which gives an interval of [63.479, 65.583]. With 95% condence we estimate that the
stiness of the external xator lies between 63.479 to 65.583 Newtons/mm.
Problems
1. Box, Hunter, & Hunter (1978) report on an experiment looking at how y, the
dispersion of an aerosol (measured as the reciprocal of the number of particles
Figure 5: Left Panel shows the scatterplot of the xator data along with the least-
squares regression line. The right panel shows a plot of the residuals versus the
tted values yi s to evaluate the t of the model.
per unit volume) depends on x, the age of the aerosol (in minutes). The data
are given in the following table:
y x
6.16 8
9.88 22
14.35 35
24.06 40
30.34 57
32.17 73
42.18 78
43.23 87
48.76 98
Fit a simple linear regression model to this data by performing the following
steps:
a) Write out the design matrix X for this data and the vector y of responses.
b) Compute X X.
c) Compute (X X)1 .
d) Compute the least squares estimates of the y-intercept and slope =
(X X)1 X y.
e) Plot the data along with the tted regression line.
f) Compute the mean squared error from the least-squares regression line:
2 = M SE = (y y) (y y)/(n 2).
g) Compute the estimated covariance matrix for the estimated regression co-
ecients: 2 (X X)1 .
h) Does the age of the aerosol eect the dispersion of the aerosol? Perform
a hypothesis test using signicance level = 0.05 to answer this question.
Set up the null and alternative hypotheses in terms of the parameter of
interest, determine the critical region, compute the test statistic, and state
your decision. In plain English, write out the conclusion of the test.
i) Find a 95% condence interval for the slope of the regression line.
2. Consider the crystal growth data in the notes. In this example, x = time the
crystal grew and y = weight of the crystal (in grams). It seems reasonable that
at time zero, the crystal would weigh zero grams since it has not started growing
yet. In fact, the estimated regression line has a y-intercept near zero. Find the
least squares estimator of 1 in the no-intercept model: yi = 1 xi + i in two
dierent ways:
a) Find the value of 1 that minimizes

n
(yi 1 xi )2 .
i=1
Note: Solve this algebraically without using the data from the actual ex-
periment.
b) Write out the design matrix for the no-intercept model and compute b1 =
(X X)1 X Y . Does this give the same solution as part (a)?
6 Estimating a Mean Response and Predicting a

New Response
Regression models are often used to predict a new response or estimate a mean
response for a given value of the predictor x. We have seen how to compute a
predicted value y as
y = 0 + 1 x.
However, as with parameter estimates, we need a measure of reliability associated
with y. In order to illustrate the ideas, we consider a new example.
Example. An experiment was conducted to study how the weight (in grams) of a
crystal varies according to how long (in hours) the crystal grows (Graybill and Iyer,
1994). The data are given in the following table:
Weight Hours
0.08 2
1.12 4
4.43 6
4.98 8
4.92 10
7.18 12
5.57 14
8.40 16
8.81 18
10.81 20
11.16 22
10.12 24
13.12 26
15.04 28
Clearly as the crystal grows, the weight increases. We can use the slope of the
estimated least squares regression line as an estimate of the linear growth rate. A
direct computation shows
( )
14 210
(X X) =
210 4060
and ( )
0.0014
= .
0.5034
The raw data along with the tted regression line are shown in Figure 6. From the
estimated slope, we can state that the crystals grow at a rate of 0.5034 grams per hour.
Figure 6: Crystal growth data with the estimated regression line.
We now turn to the question of using the estimated regression model to estimate the
mean response at a given value of x or predict a new value of y for a given value of
x. Note that estimating a mean response and predicting a new response are dierent
goals.
Suppose we want to estimate the mean weight of a crystal that has grown for x = 15
hours. The question is: what is the average weight of all crystals that have grown for
x = 15 hours. Note this is a hypothetical population. If we were to set a production
process where we grow crystals for 15 hours, what would be the average weight of the
resulting crystals? In order to estimate the mean response at x = 15 hours, we use
y = 0 + 1 x plugging in x = 15.
On the other hand, if we want to predict the weight of a single crystal that has grown
for x = 15 hours, we would also use y = 0 + 1 x using x = 15 just as we did
for estimating a mean response. Note that although estimating a mean response and
predicting a new response are two dierent goals, we use y in each case. The dierence
statistically between estimating a mean response and predicting a new response lies
in the uncertainty associated with each. A condence interval for a mean response
will be narrower than a prediction interval for a new response. The reason why is that
a mean response for a given x value is a xed quantity it is an expected value of the
response for a given x value, known as a conditional mean. A 95% prediction interval
for a new response must be wide enough to contain 95% of the future responses at a
given x value. The condence interval for a mean response only needs to contain the
mean of all responses for a given x with 95% condence. The following two formulas
give the condence interval for a mean response and a prediction interval for a new
response at a given value x0 for the predictor:
( )
1
yt/2 M Sres (1, x0 )(X X)1 Condence Interval for Mean Response (10)
x0
and
( )
1
y t/2 M Sres (1 + (1, x0 )(X X)1 ) Prediction Interval for New Response
x0
(11)
where the t-critical value t/2 is based on n 2 degrees of freedom. Note that in both
formulas, ( )
1
(1, x0 )(X X)1
x0
corresponds to a 12 vector (1, x0 ) times a 22 matrix (X X)1 times a 21 vector
transpose of (1, x0 ). The prediction interval is wider than the condence interval due
to the added 1 underneath the radical in the prediction interval. Formulas (10)
and (11) generalize easily to the multiple regression setting when there is more than
one predictor variable.
The condence interval for the mean response can be rewritten after multiplying out
the terms to get
1 (x0 x)2
y t/2 M Sres ( + ).
n SSxx
From this formula, one can see that the condence interval for a mean response (and
also the prediction interval) is narrowest when x0 = x. Figure 7 shows both the
condence intervals for mean responses and prediction intervals for new responses at
each x value. The lower and upper ends of these intervals plotted for all x values
form an upper and lower band shown in Figure 7. The solid curve corresponds to
a condence band and is narrower than the prediction band which is plotted by
the dashed curve. Both bands are narrowest at the point (x, y) (the least squares
regression line always passes through the point (x, y)). Note that in this example, all
of the actual weight measurements (the yi s) lie inside the 95% prediction bands as
seen in Figure 7.
A note of caution is in order when using regression models for prediction. Using
an estimated model to extrapolate outside the range where data was collected to t
the model is very dangerous. Often a straight line is a reasonable model relating a
response y to a predictor (or regressor) x over a short interval of x values. However,
over a broader range of x values, the response may be markedly nonlinear and the
straight line t over the small interval when extrapolated over a larger interval can
give very poor or even down right nonsense predictions. It is not unusual for instance
that as the regressor variable gets larger (or smaller), the response may level o
and approach an asymptote. One such example is illustrated in Figure 8 showing a
scatterplot of the winning times in the Boston marathon for men (open circles) and
women (solid circles) each year. Also plotted is the least squares regression lines tted
to the data for men and women. If we were to extrapolate into the future using the
straight line t, then we would eventually predict that the fastest female runner would
beat the fastest male runner. Not only that, the predicted times in the future for both
men and women would eventually become negative which is clearly impossible. It may
be that the female champion will eventually beat the male champion at some point in
the future, but we cannot use these models to predict this because these models were
t using data from the past. We do not know for sure what sort of model is applicable
Figure 7: Crystal growth data with the estimated regression line along with a the
95% condence band for estimated mean responses (solid curves) and 95% prediction
band for predicted responses (dashed curve).
for future winning times. In fact, the straight line models plotted in Figure 8 are not
even valid for the data shown. For instance, the data for the women shows a rapid
improvement in winning times over the rst several years women were allowed to run
the race but then the winning times atten out, indicating that a threshold is being
reached for the fastest possible time the race can be run. This horizontal asymptote
eect is evident for both males and females.
Problems
3. A calibration experiment with nuclear tanks was performed in an attempt to

determine the volume of uid in the tank based on the reading from a pressure
gauge. The following data was derived from such an experiment where y is the
volume and x is the pressure:
y x
215 19
629 38
1034 57
1475 76
1925 95
a) Write out a simple linear regression model for this experiment.

b) Write down the design matrix X and the vector of responses y.
c) Find the least-squares estimates of the y-intercept and slope of the re-
gression line. Plot the data and draw the estimated regression line in the
plot.
Winning Times in Boston Marathon
12000
11000
10000
Time (in seconds)
9000
8000
7000
1900 1920 1940 1960 1980 2000
Year
Figure 8: Winning times in the Boston Marathon versus year for men (open circles)
and women (solid circles). Also plotted are the least-squares regression lines for the
men and women champions.
d) Find the estimated covariance matrix of the least-squares estimates.

e) Test if the slope of the regression line diers from zero using = 0.05.
f) Find a 95% condence interval for the slope of the regression line.
g) Estimate the mean volume for a pressure reading of x = 50 using a 95%
condence interval.
h) Predict the volume in the tank from a pressure reading of x = 50 using a
95% prediction interval.
7 Coecient of Determination R2
The quantity SSres is a measure of the variability in the response y after factoring
out the dependence on the regressor x. A measure of total variability in the response
measured without regard to x is

n
SSyy = (yi y)2 .
i=1
A useful statistic for measuring the proportion of variability in the ys accounted for
by the regressor x is the coecient of determination R2 , or sometimes known simply
as the R-squared:
SSres
R2 = 1 . (12)
SSyy
In the car mileage example, SSyy = 123.27 and SSres = 9.38:
9.38
R2 = 1 = 0.924.
123.27
In the xator example, the points are more tightly clustered about the regression line
and the corresponding coecient of determination is R2 = 0.9995 which is higher
than for the car mileage example (compare the plots in Figure 1 with Figure 4)
By denition, R2 is always between zero and one:
0 R2 1.
If R2 is close to one, then most of the variability in y is explained by the regression
model.
R2 is often reported when summarizing a regression model. R2 can also be computed
in multiple regression (when there is more than one regressor variable) using the same
formula above. Many times a high R2 is considered as an indication that one has a
good model since most of the variability in the response is explained by the regressor
variables. In fact, some experimenters use R2 to compare various models. However,
this can be problematic. R2 always increases (or at least does not decrease) when
you add regressors to a model. Thus, choosing a model based on the largest R2 can
lead to models with too many regressors.
Another note of caution regarding R2 is that a large value of R2 does not necessarily
mean that the tted model is correct. It is not unusual to obtain a large R2 when
there is a fairly strong non-linear trend in the data.
In simple linear regression, the coecient of determination R2 turns out to be the
square of the sample correlation r.
8 Residual Analysis
The regression models considered so far are simple linear regression models where it
is assumed that the mean response y is a linear function of the regressor x. This is
a very simple model and appears to work quite well in many examples. Even if the
actual relation of y on x is non-linear, tting a straight line model may provide a good
approximation if we restrict the range of x to a small interval. In practice, one should
not assume that a simple linear model will be sucient for tting data (except in
special cases where there is a theoretical justication for a straight line model). Part
of the problem in regression analysis is to determine an appropriate model relating
the response y to the predictor x.
Recall that the simple linear regression model is yi = 0 + 1 xi + i where i is a
mean zero random error. After tting the model, the residuals ri = yi yi mimic
the random error. A useful diagnostic to access how well a model ts the data is to
plot the residuals versus the tted values (yi ). Such plots should show no structure.
If there is evidence of structure in the residual plot, then it is likely that the tted
regression model is not the correct model. In such cases, a more complicated model
may need to be tted to the data such as a polynomial model (see below) or a
nonlinear regression model (not covered here).
It is customary to plot the residuals versus the tted values instead of residuals versus
the actual yi values. The reason is that the residuals are uncorrelated with the tted
Figure 9: Left Panel: Scatterplot of the full xator data set and tted regression line.
Right Panel: The corresponding residual plot.
values. Recall from the geometric derivation of the least squares estimators that the
vector of residuals is orthogonal to the vector of tted values (see Figure 3).
A word of caution is needed here. Humans are very adept at picking out patterns.
Sometimes, a scatterplot of randomly generated variates (i.e. noise) will show what
appears to be pattern. However, if the plot was generated by just random noise, then
the patterns are supercial. The same problem can occur when examining a residual
plot. One must be careful of nding structure in a residual plot when there really is
no structure. Analyzing residual plots is an art that improves with lots of practice.
Example (Fixator example continued). When the external xator example was
introduced earlier, only a subset of the full data set was used to estimate the stiness
of the xator. Figure 9 shows (in the left panel) a scatterplot of the full data set for
values of force (x) near zero when the machine was rst turned on. Also plotted is the
least squares regression line. From this picture, it appears as if a straight line model
would t the data well. However, the right panel shows the corresponding residual
plot which reveals a fairly strong structure indicating that a straight line does not t
the full data set well.
Example. Fuel eciency data was obtained on 32 automobiles from the 1973-74
models by Motor Trends US Magazine. The response of interest is miles per gallon
(mpg) of the automobiles. Figure 10 shows a scatterplot of mpg versus horsepower.
Figure 10 shows that increasing horsepower corresponds to lower fuel eciency. A
simple linear regression model was t to the data and the tted line is shown in the
left panel of Figure 11. The coecient of determination for this t is R2 = 0.6024. A
closer look at the data indicates a slight non-linear trend. The right panel of Figure 11
Figure 10: Scatterplot of Motor Trends car data: Miles per gallon (mpg) versus Gross
Horsepower for 32 dierent brands of cars.
shows a residual plot versus tted values. The residual plot indicates that there may
be problem with the straight line t: the residuals to the left and right are positive
and the residuals in the middle are mostly negative. This U-shaped pattern is
indicative of a poor t. To solve the problem, a dierent type of model needs to be
considered or perhaps a transformation of one or both variables may work.
Example: Anscombes Regression Data. Anscombe (1973) simulated 4 very

dierent data sets that produce identical least-square regression lines. One of the
benets of this example is to illustrate the importance of plotting your data. Figure 12
shows scatterplots of the 4 data sets along with the tted regression line. The top-
left panel shows a nice scatter of points with a linear trend and the regression line
provides a nice t to the data. The data in the top-right panel shows a very distinct
non-linear pattern. Although one can t a straight line to such data, the straight line
model is clearly wrong. Instead one could try to t a quadratic curve (see polynomial
regression). The bottom-left panel demonstrates how a single point can be very
inuential when a least-squares line is t to the data. The points in this plot all lie
in a line except a single point. The least squares regression line is pulled towards
this single inuential point. In a simple linear regression it is fairly easy to detect
a highly inuential point as in this plot. However, in multiple regression (see next
section) with several regressor variables, it can be dicult to detect inuential points
graphically. There exist many diagnostic tools for accessing how inuential individual
points are when tting a model. There also exist robust regression techniques that
prevent the t of the line to be unduly inuenced by a small number of observations.
The bottom-right panel shows data from a very poorly designed experiment where
all but one observation was obtained at one level of the x variable. The single point
Figure 11: Left Panel Scatterplot of MPG versus horsepower along with the tted
regression line. Right panel Residual plot versus the tted values yi s.
on the right determines the slope of the tted regression line.
9 Multiple Linear Regression

Often a response of interest may depend on several dierent factors and consequently,
many regression applications involve more than a single regressor variable. A regres-
sion model with more than one regressor variable is known as a multiple regression
model. The simple linear regression model can be generalized in a straightforward
manner to incorporate other regressor variables and the previous equations for the
simple linear regression continue to hold for multiple regression models.
For illustration consider the results of an experiment to study the eect of the mole
contents of cobalt and calcination temperature on the surface area of an iron-cobalt
hydroxide catalyst (Said et al., 1994). The response variable in this experiment is
y = surface-area and there are two regressor variables:
x1 = Cobalt content
x2 = Temperature.
The data from this experiment are given in the following table:
Anscombes 4 Regression data sets
12
12
10
10
y1
y2
8
8
6
6
4
4
5 10 15 5 10 15
x1 x2
12
12
10
10
y3
y4
8
8
6
6
4
5 10 15 5 10 15
x3 x4
Figure 12: Anscombe simple linear regression data. Four very dierent data sets
yielding exactly the same least squares regression line.
Cobalt Surface
Contents Temp. Area
0.6 200 90.6
0.6 250 82.7
0.6 400 58.7
0.6 500 43.2
0.6 600 25
1.0 200 127.1
1.0 250 112.3
1.0 400 19.6
1.0 500 17.8
1.0 600 9.1
2.6 200 53.1
2.6 250 52
2.6 400 43.4
2.6 500 42.4
2.6 600 31.6
2.8 200 40.9
2.8 250 37.9
2.8 400 27.5
2.8 500 27.3
2.8 600 19
A general model relating the surface area y to cobalt contents x1 and temperature x2
is
yi = f (xi1 , xi2 ) + i
where i is a random error and f is some unknown function. xi1 is the cobalt content
on the ith unit in the data and xi2 is the corresponding ith temperature measure-
ment for i = 1, . . . , n. We can try approximating f by a rst order Taylor series
approximation to get the following multiple regression model:
yi = 0 + 1 xi1 + 2 xi2 + i . (13)
If (13) is not adequate to model the response y, then we could try a higher order
Taylor series approximation such as:
yi = 0 + 1 xi1 + 2 xi2 + 11 x2i1 + 22 x2i2 + 12 xi1 xi2 + i . (14)
The work required for nding the least squares estimators of the coecients and the
variance and covariances of these estimated parameters has already been done. The
form of the least-squares solution from the simple linear regression model holds for
the multiple regression model. This is where the matrix approach to the problem
really pays o because working out the details without using matrix algebra is very
tedious. Consider a multiple linear regression model with k regressors, x1 , . . . , xk :
yi = 0 + 1 xi1 + 2 xi2 + + k xik + i . (15)
The least squares estimators are given by

0

1 1
=
.. = (X X) X y (16)
.
k
just as in equation (8) where

1 x11 x12 x1k
1 x21 x22 x2k

X = .. .. .. .. . (17)
. . . .
1 xn1 xn2 xnk
Note that the design matrix X has a column of ones in its rst column for the
intercept term 0 just like in simple linear regression. The covariance matrix of the
estimated coecients is given by
2 (X X)1
just as in equation (9) where 2 is the error variance which is again estimated by the
mean squared residual

n
M Sres = 2 = (yi yi )2 /(n k 1).
i=1
Note that the degrees of freedom associated with the mean squared residual is nk1
since we lose k + 1 degrees of freedom estimating the intercept and the k coecients
for the k regressors.
9.1 Coecient Interpretation
In the simple linear regression model yi = 0 + 1 xi + i , the slope represents the

expected change in y given a unit change in x. In multiple regression the interpretation
of the regression coecients have a similar interpretation: j represents the expected
change in the response for a unit change in xj given that all the other regressors are
held constant. The problem in multiple regression is that regressor variables often
are correlated with one another and if you change one regressor, then the others may
tend to change as well and this makes the interpretation of regression coecients
very dicult in multiple regression. This is particularly true of observational studies
where there is no control over the conditions that generate the data. The next example
illustrates the problem.
Heart Catheter Example. A study was conducted and data collected to t a

regression model to predict the length of a catheter needed to pass from a major artery
at the femoral region and moved into the heart for children (Weisberg, 1980). For 12
children, the proper catheter length was determined by checking with a uoroscope
that the catheter tip had reached the right position. It was hoped that using the
childs height and weight could be used to predict the proper catheter length. The
data are given in the following table:
Height Weight Length

42.8 40.0 37
63.5 93.5 50
37.5 35.5 34
39.5 30.0 36
45.5 52.0 43
38.5 17.0 28
43.0 38.5 37
22.5 8.5 20
37.0 33.0 34
23.5 9.5 30
33.0 21.0 38
58.0 79.0 47
When the data is t to a multiple regression model using height and weight as regres-
sors, how do we interpret the resulting coecients? The coecient for height tells
us how much longer the catheter needs to be for each additional inch of height of
the child provided the weight of the child stays constant. But the taller the child, the
heavier the child tends to be. Figure 13 shows a scatterplot of weight versus height
for the n = 12 children from this experiment. The plot shows a very strong linear
relationship between height and weight. The correlation between height and weight
is r = 0.9611. This large correlation complicates the interpretation of the regression
coecients. The problem of correlated regressor variables is known as collinearity (or
multicollinearity) and it is an important problem that one needs to be aware of when
tting multiple regression models. We return to this example when in the collinearity
section below.
Figure 13: A scatterplot of weight versus height for n = 12 children in an experiment

used to predict the required length of a catheter to the heart based on the childs
height and weight.
Fortunately, in designed experiments where the engineer has complete control over the
regressor variables, data may be able to be collected in a fashion so that the estimated
regression coecients are uncorrelated. In such situations, the estimated coecients
can then be easily interpreted. To make this happen, one needs to choose the values
of the regressors so that the o-diagonal terms in the (X X)1 matrix (from (9))
that correspond to covariances between estimated coecients of the regressors are all
zero.
Cobalt Example Continued. Let us return to model (13) and estimate the pa-
rameters of the model. Using the data in the cobalt example, we can construct the
design matrix X in (17) and the response vector y as:

1 0.6 200 90.6

1 0.6 250 82.7

1 0.6 400 58.7

1 0.6 500 43.2

1 0.6
600
25.0
1 1.0
200
127.1

1 1.0 250 112.3

1 1.0 400 19.6

1 1.0 500

17.8

1 1.0 600
X= y = 9.1 .
1 2.6
200
53.1

1 2.6 250 52.0

1 2.6 400 43.4

1 2.6 500

42.4

1 2.6 600 31.6

1 2.8 200 40.9

1 2.8 250 37.9

1 2.8 400 27.5

1 2.8 500 27.3
1 2.8 600 19.0
The least squares estimates of the parameters are given by = (0 , 1 , 2 ) where
= (X X)1 X y
1
20 35 7800 961.2

= 35 79.8 13650 1471.8
7800 13650 3490000 309415

124.8788
=
11.3369 .
0.1461
The matrix computations are tedious, but software packages like Matlab can perform
these for us. Nonetheless, it is a good idea to understand exactly what it is the
computer software packages are computing for us when we feed data into them. The
mean squared residual for this data is 2 = M Sres = 1754.69768 and the estimated
covariance matrix of the estimated coecients is

246.8702 41.9933 0.3875

2 (X X)1 = 41.9933 23.9962 0.0000 .
0.3875 0.0000 0.00099
Note that this was a well-designed experiment because the estimated coecients for
Cobalt contents (1 ) and Temperature (2 ) are uncorrelated the covariance between
them is zero as can be seen in the covariance matrix.
10 Analysis of Variance (ANOVA) for Multiple

Regression
Because there are several regression parameters in the multiple regression model (15),
a formal test can be conducted to see if the response depends on any of the regressor
variables. That is, we conduct a single test of the hypothesis:
H0 : 1 = 2 = k = 0
versus
Ha : not all j s are zero.
The basic idea behind the testing procedure is to partition all the variability in the
response into two pieces: variability due to the regression relation and variability due
to the random error term. This is why the procedure is called Analysis of Variance
(ANOVA). The total variability in the response is represented by the total sums of
squares (SSyy ):

n
SSyy = (yi y)2 .
i=1
We have already dened the residual sum of squares:

n
SSres = (yi yi )2 .
i=1
We can also dene the Regression Sum of Squares (SSreg ):

n
SSreg = (yi y)2
i=1
which represents the variability in the yi s explained by the multiple regression model.
Note that for an individual measurement
(yi y) = (yi yi ) + (yi y).
If we square both sides of this equation and sum over all n observations, we will get
SSyy on the left-hand side. On the right-hand side (after doing some algebra) we will
get SSres + SSreg only because the cross product terms sum to zero. This gives us
the well-known variance decomposition formula:
SSyy = SSreg + SSres . (18)
If all the j s are zero, then the regression model will explain very little of the variabil-
ity in the response in which case SSreg will be small and SSres will be large, relatively
speaking. The ANOVA test then simply compares these components of variance with
each other. However, in order to make the sums of squares comparable, we need to
rst divide each sum of squares by its respective degrees of freedom. The degrees
of freedom for the regression sum of squares is simply the number of s (for the
intercept plus k regressors) in the model minus one:
dfreg = k.
The Mean Square for Regression is

SSreg
M Sreg = .
k
We have already dened the mean square for residuals:
SSres
M Sres =
nk1
where the degrees of freedom for the residual sum of squares is the total sample size
minus the number of s (k + 1):
dfres = n k 1.
The ANOVA test statistic is the ratio of the mean squares, which is denoted by F :
M Sreg
F = . (19)
M Sres
Assuming that the errors in the model (15) are independent and follow a normal
distribution, then the test statistic F follows an F-distribution on k numerator degrees
of freedom and n k 1 denominator degrees of freedom. If the null hypothesis is
true (i.e. all j s equal zero), then F takes the value one on average. However, if the
null hypothesis is false, then the regression mean square tends to be larger than the
error mean square and F will be larger than one on average. Therefore, the F -test
rejects the null hypothesis for large values of F . How large does F have to be? If the
null hypothesis is true, then the F test statistic follows an F distribution. Therefore,
we use critical values from the F -distribution to determine the critical region for the
F -test which can be looked up in an F -table (see pages 202204 in the Appendix).
For a simple linear regression model yi = 0 + 1 xi + i , we used a t-test to test
H0 : 1 = 0 versus Ha : 1 = 0. One can also perform an ANOVA F -test of this
hypothesis and in fact, the two tests are equivalent. One can show with a little
algebra that the square of the t-test statistic is equal to the ANOVA F -test statistic
in a simple linear regression, i.e. t2 = (1 /1 )2 = F .
Example continued... Returning to the multiple regression example on the eect of

cobalt contents and temperature on surface area, there were k = 2 regressor variables
used to predict surface area. If we test H0 : 1 = 2 = 0, our F -test statistic will
be compared with an F -critical value on k = 2 numerator degrees of freedom and
n k 1 = 20 2 1 = 17 denominator degrees of freedom. This critical value is
denoted Fk,nk1, where is the signicance level. Using = 0.05, we can look up
the critical value in the F table (page 203) and nd F2,17,.05 = 3.59. From the data,
we compute that SSreg = 11947 and therefore M Sreg = 12310/2 = 5973.43416. Also,
SSres = 7567.19968 and M Sres = 7567.19968/3 = 445.12939. The F -test statistic
comes out to be F = M Sreg /M Sres = 5973.43416/445.12939 = 13.42 which exceeds

the = 0.05 critical value indicating that we reject H0 and conclude that 1 and 2
are not both zero. The results of the ANOVA F test are usually summarized in an
ANOVA table like the one below which was generated using the popular statistical
SAS software package:
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 11947 5973.43416 13.42 0.0003

Error 17 7567.19968 445.12939
Corrected Total 19 19514
In this ANOVA table, the Corrected Total sum of squares is SSyy . Note that adding
the Model and Error sum of squares gives the corrected total sum of squares: 11947 +
7567.19968 = 19514 (rounded to the nearest integer) showing the variance decom-
position given in (18). Likewise, the model and error degrees of freedom add up to
equal the corrected total degrees of freedom: 2 + 17 = 19 = n 1. Finally, the last
column of the ANOVA table gives the p-value of the test of p = 0.0003. Because the
p-value is less than = 0.05 we reject the null hypothesis. In fact, the p-value is very
small indicating very strong evidence that at least one of the regression coecients
diers from zero.
Once we have determined that at least one of the coecients is non-zero, typically
t-tests are conducted for the individual coecients in the model:
H0 : j = 0 versus Ha : j = 0
for j = 1, . . . , k. The test statistics for each of these hypotheses is given by a t-test
statistic of the form
j
t=
j
where the j are found by taking the square roots of the diagonal elements of the
2 (X X)1 covariance matrix. Recall that the square root of the diagonal terms in
2 (X X)1 are the estimated standard errors j of the estimated regression coe-
cients. The t-statistics follow a t-distribution on n k 1 degrees of freedom (i.e.
the residual degrees of freedom) when the null hypothesis is true. The following table
generated by SAS summarizes the tests for individual coecients:
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 124.87880 15.71210 7.95 <.0001

cobalt 1 -11.33693 4.89859 -2.31 0.0334
temp 1 -0.14610 0.03152 -4.63 0.0002
This table gives the results of two-tailed test of whether or not the regression coe-
cients are zero or not. The last column gives the p-values, all of which are less than
= 0.05 indicating that both the cobalt content and the temperature are signi-
cant in modeling the surface area of the hydroxide catalyst. Because the experiment
was designed so that the estimated slope parameters for cobalt and temperature are
uncorrelated, the interpretation of the coecients is made easier. Note that both
coecients are negative indicating that increasing cobalt content and temperature
results in lower surface area on average. The intercept of the model does not have a
clear interpretation in this example because data was not collected at values of cobalt
and temperature near zero.
It is possible to test more general hypotheses for individual regression coecients
such as H0 : j = j0 for some hypothesized value j0 of the jth regression coecient.
The t-test statistic then becomes
j j0
t=
j
and the alternative hypothesis can be one-sided or two-sided.

Individual condence intervals can be obtained for the regression coecients using
the formula as in the case of a simple linear regression model:
j t/2 j .
However, individual condence regions can be misleading when the estimated coef-
cients are correlated. A better method is to determine a joint condence region.
11 Condence Ellipsoids
In Chapter 4, the concept of a condence region was introduced when estimators of

interest are correlated with one another. In multiple regression the estimated coef-
cients are often correlated. Forming condence intervals for individual regression
coecients can be misleading when considering more than one coecient as in mul-
tiple regression. The reason for this is that the region formed by two condence
intervals is the Cartesian product forming a rectangle. If the estimated coecients
are correlated, then some points in the condence rectangle may not be plausible
values for the regression coecients. We illustrate this idea below.
Figure 14: A 95% condence ellipse for the regression coecients 1 and 2 for cobalt
content and temperature.
The following formula (e.g. see Johnson and Wichern, 1998, page 285) denes a
condence ellipsoid in the regression setting: A 100(1 )% condence region for
is given by the set of k+1 that satisfy
( ) (X X)( ) (k + 1)M Sres Fk+1,nk1, .
The condence region described by this inequality is an elliptical shaped region (see
Chapter 4). In the previous example, if we want a condence region for only (1 , 2 ) ,
the coecients of cobalt content and temperature, the condence region is given by
the set of (1 , 2 ) 2 satisfying the following inequality:

( ) 0 0 ( )
0 1 0 1 1 1 1
(1 1 , 2 2 )[ (X X) 1 0 ] 2M Sres F2,nk1, .
0 0 1 2 2
0 1
Figure 14 shows a 95% condence ellipse for 1 and 2 in this example.
The major and minor axes of the condence ellipse in Figure 14 are parallel to the
1 and 2 axes because the estimates of these parameters are uncorrelated due to the
design of the experiment. The estimators of the regression coecients will not always
be uncorrelated, particularly with observational data. The next example illustrates
an experimental data set where the regression coecients are correlated.
Hald Cement Data Example. An experiment was conducted to measure the heat
from cement based on diering levels of the ingredients that make up the cement. This
is the well-known Hald cement data (Hald 1932) which has been used to illustrate
concepts from multiple regression. The data are in the following table:
Amount of Amount of Amount of Amount of Heat

tricalcium tricalcium tetracalcium dicalcium evolved
aluminate silicate alumino ferrite silicate calories per gram
x1 x2 x3 x4 y
78.5 7.0 26.0 6.0 60.0
74.3 1.0 29.0 15.0 52.0
104.3 11.0 56.0 8.0 20.0
87.6 11.0 31.0 8.0 47.0
95.9 7.0 52.0 6.0 33.0
109.2 11.0 55.0 9.0 22.0
102.7 3.0 71.0 17.0 6.0
72.5 1.0 31.0 22.0 44.0
93.1 2.0 54.0 18.0 22.0
115.9 21.0 47.0 4.0 26.0
83.8 1.0 40. 23.0 34.0
113.3 11.0 66.0 9.0 12.0
109.4 10.0 68.0 8.0 12.0
A full model would incorporate all the regressor variables:

yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 + 4 xi4 + i .
One of the goals of this study was to determine the best reduced model that used
only a subset of the regressors. In this example, the regressor variables are correlated
with one another. If we consider a model with regressors x1 and x2 say, then adding x3
to the model may not provide much additional information. In statistical modeling,
an important principal is the principal of parsimony which means to use the simplest
model possible that still adequately explains the data. More complicated models (e.g.
more regressor variables) can have undesirable properties such as unstable parameter
estimates.
In order to illustrate further the concept of condence ellipses, we shall t a multiple
regression of y on x1 and x2 only. The ANOVA table from running SAS is
Sum of Mean
Model 2 3175.37558 1587.68779 85.07 <.0001

Error 10 186.62442 18.66244
Corrected Total 12 3362.00000
The F -test statistic equals 85.07 and the associated p-value is less than 0.0001 indi-
cating that there is very strong statistical evidence that at least one of the coecients
for x1 and x2 are non-zero. The R2 from this tted model is R2 = 0.9445 indicating
that most of the variability in the response is explained by the regression relationship
with x1 and x2 . The estimated parameters, standard errors and t-test for equality to
zero are tabulated below:
Parameter Estimates
Parameter Standard
Intercept 1 160.08461 10.09097 15.86 <.0001

x1 1 -1.53258 0.12143 -12.62 <.0001
x2 1 2.16559 0.31054 6.97 <.0001
Both parameter estimates dier signicantly from zero as indicated by the p-values
< 0.0001 for x1 and x2 . The estimated coecient for x1 (amount of tricalcium alu-
minate) is 1 = 1.53258 and the estimated coecient for x2 (Amount of tricalcium
silicate) is 2 = 2.16559 which seems to indicate that the amount of heat produced
decreases with increasing levels of tricalcium aluminate and the amount of heat in-
creases with increasing levels of tricalcium silicate. However, x1 and x2 are correlated
with one another (sample correlation r = 0.7307) and thus, we cannot easily interpret
the coecients due to this correlation. The estimated covariance matrix of (1 , 2 )
is
101.8276 -1.2014 1.9098
(X X)1 =
-1.2014 0.0148 -0.0276
1.9098 -0.0276 0.0964
indicating that 1 and 2 are negatively correlated with one another as well (since
the covariance between them is 0.0276). Figure 15 shows a 95% condence ellipse
for the coecients 1 and 2 . Also shown is a rectangular condence region obtained
by taking the Cartesian product of individual condence intervals for 1 and 2 sep-
arately. The fact that the ellipse tilts downward from left to right indicates that the
estimated coecients are negatively correlated. In addition, points that lie in the
rectangular condence region but not inside the ellipse are not really plausible values
for the coecients because they do not lie inside the elliptical condence region. To
illustrate this point, the symbol in the plot lies inside the condence rectangle and
would seem to be a plausible value for (1 , 2 ) . However, since this point lies outside
the condence ellipse, it is not a plausible value with 95% condence.
12 Polynomial Regression.
A special case of multiple regression is polynomial regression. If a scatterplot of y
versus x indicates a nonlinear relationship, then tting a straight line is not appropri-
ate. For example, the response y may be a quadratic function of x. If the functional
relationship between x and y is unknown, then a Taylor series approximation may
provide a reasonable way of modeling the data which would entail tting a polynomial
to the data. The polynomial regression model can be expressed as
yi = 0 + 1 xi + 2 x2i + + k xki + i . (20)
To see that (20) is a special case of the multiple regression model, simply dene the
multiple regressor variables to be x1 = x, x2 = x2 , . . . , xk = xk . Although (20) is
Figure 15: A 95% condence ellipse for the regression coecients 1 and 2 for hald
cement data.
a nonlinear function of x, it is nonetheless considered a linear model because it is

expressed as a linear function of the parameters 0 , 1 , . . . , k .
Least squares estimators of the parameters in (20) can be found exactly as for other
multiple regression models by computing = (X X)1 X y. The next example
illustrations the computations involved.
Example. An experiment was conducted to study the daily production of sulfu-

ric acid (in tons) and its dependence on the temperature (in Celcius) at which the
production process was run (Graybill and Iyer, page 461). The data is in the table
below:
Temp Acid
100 1.93
125 2.22
150 2.85
175 2.69
200 3.01
225 3.82
250 3.91
275 3.65
300 3.71
325 3.40
350 3.71
375 2.57
400 2.71
A scatterplot sulfuric acid amount versus temperature is shown in Figure 16. Clearly
Figure 16: Sulfuric Acid y versus Temperature x.
the relationship between acid and temperature is nonlinear. One of the goals of the
study is to determine the optimal temperature that will result in the highest yield
of sulfuric acid. Figure 16 suggests that a quadratic model may t the data well:
yi = 0 + 1 xi + 2 x2i + i . The design matrix for the quadratic model is

1. 100 10000

1 125 15625

1 150 22500

1 175 30625

1 200 40000
1 x1 x21
1 50625
1 x2 x22

225

X=
.. .. ..
= 1 250 62500
. . .
1 275

75625

1 xn x2n 1
300 90000

1 325 105625

1 350 122500

1 375 140625
1 400 160000
Figure 17 shows the scatterplot of the data along with the tted quadratic curve
y = 1.141878 + 0.035519x 0.000065x2 . In order to estimate the temperature that
produces the highest yield of sulfuric acid, simply take the derivative of this quadratic
function, set it equal to zero and solve for x.
One of the shortcomings of polynomial regression is that the tted models can become
very unstable when tting higher order polynomials. The estimated covariance matrix
Figure 17: Sulfuric Acid y versus Temperature x along with a quadratic polynomial
t y = 1.141878 + 0.035519x 0.000065x2
for the least-squares estimators is

0.3614696722 0.0030589110 0.0000057501

2 (X X)1 = 0.0030589110 0.0000274801 0.0000000535 .
0.0000057501 0.0000000535 0.0000000001
Converting this to a correlation matrix gives

1 0.9705594110 0.9246777286

R = 0.9705594110 1 0.9865272473
0.9246777286 0.9865272473 1
showing a high correlation between the linear coecient and the quadratic coecient
of r = 0.9865.
One of the dangers of polynomial regression is to overt the model by using a poly-
nomial of too high a degree. Adding higher and higher order terms to the polynomial
will increase the R2 but at some point, the t of the model becomes very unstable,
i.e. the standard errors of the parameter estimates become very large. Figure 18
highlights these dangers. Figure 18 shows the sulfuric acid data along with tted
polynomials of degree 1 through 7. Each of the higher order polynomials t the data
very well within the range where the data was collected. However, the predictions
outside of this range will be very poor using polynomials of too high a degree. When
the degree of the polynomial is too high, the tted curve attempts to t the noise in
the data instead of just the signal. That is, the curve will attempt to pass through
each of the data points when the degree of the polynomial is too high. In such cases,
the t is very unstable and thus undesirable.
Figure 18: Polynomial ts of degrees 1 to 7 to the sulfuric data.
In order to determine the appropriate degree to use when tting a polynomial, one
can choose the model that leads to the smallest residual mean square M Sres . The
residual sum of squares SSres gets smaller as the degree of the polynomial gets larger.
However, each time we add another term to the model, we lose a degree of freedom
for estimating residual mean square. Recall that the residual mean square is dened
by dividing the residual sum of squares by n k 1 where k equals the degree of the
polynomial (e.g. k = 2 for a quadratic model). The following table gives the R2 and
M Sres for dierent polynomial ts:
Degree R2 M Sres
1 0.1877 0.3785
2 0.8368 0.0837
3 0.8511 0.0848
4 0.8632 0.0876
5 0.8683 0.0964
6 0.8687 0.1122
7 0.8884 0.1144
As seen in the table, the R2 increases with the degree of the polynomial. However,
for polynomials of degree 3 and higher, the increase in R2 is very slight. The residual
mean square is smallest for the quadratic t indicating that the quadratic t is best
according to this criteria.
Figure 19: 3-D plot of the heart catheter data.
13 Collinearity
One of the most serious problems in a multiple regression setting is (multi)collinearity
when the regressor variables are correlated with one another. Collinearity causes
many problems in a multiple regression model. One of the main problems is that
if collinearity is severe, the estimated regression coecients are very unstable and
cannot be interpreted. In the heart catheter data set described above, the height
and weight of the children were highly correlated (r = 0.9611). Figure 19 shows
a 3-dimensional plot of the height and weight as well as the response variable y =
length. The goal of the least-squares tting procedure is to determine the best-tting
plane through these points. However, the points lie roughly along a straight line in
the height-weight plane. Consequently, tting the regression plane is analogous to
trying to build a table when all the legs of the table lie roughly in a straight line. The
result is a very wobbly table. Ordinarily, tables are designed so that the legs are far
apart and spread out over the surface of the table. When tting a regression surface
to highly correlated regressors, the resulting t is very unstable. Slight changes in
the values of the regressor variables can lead to dramatic dierences in the estimated
parameters. Consequently, the standard errors of the estimated regression coecients
tend to be inated. In fact, it is quite common for none of the regression coecients
to dier signicantly from zero when individual t-tests are computed for regression
coecients. Additionally, the regression coecients can have the wrong sign one
may obtain a negative slope coecient when instead a positive coecient is expected.
Heart Catheter Example. A multiple regression was used to model the length of
the catheter (y) with regressors height (x1 ) and weight (x2 ):
yi = 0 + 1 xi1 + 2 xi2 + i .
The ANOVA table (from SAS) is below:
Sum of Mean
Model 2 607.18780 303.59390 21.27 0.0004

Error 9 128.47887 14.27543
Corrected Total 11 735.66667
The overall F -test indicates that at least one of the slope parameters 1 and/or
2 diers signicantly from zero (p-value = 0.0004). Furthermore, the coecient
of determination is R2 = 0.8254 indicating that height and weight explain most of
the variability in the required catheter length. However, the parameter estimates,
standard errors, t-test statistics and p-values shown below indicate that neither 1
nor 2 dier signicantly from zero.
Parameter Estimates
Parameter Standard
Intercept 1 20.37576 8.38595 2.43 0.0380

height 1 0.21075 0.34554 0.61 0.5570
weight 1 0.19109 0.15827 1.21 0.2581
The apparent paradox is that the overall F -test says at least one of the coecients
diers from zero whereas the individual t-tests says neither dier from zero is due
to the collinearity problem. This phenomenon is quite common when collinearity is
present. A 95% condence ellipse for the estimated regression coecients of height
and weight shown in Figure 20 shows that the estimated regression coecients are
highly correlated because the condence ellipse is quite eccentric.
Another major problem when collinearity is present is that the tted model will not
be able to produce reliable predictions for values of the regressors away from the range
of the regressor values. If the data changes just slightly, then the predicted values
outside the range of the data can change dramatically when collinearity is a present
just think of the wobbly table analogy.
In a well designed experiment, the values of the regressor variables will be orthog-
onal which means that the covariances between estimated coecients will be zero.
Figure 20: A 95% condence ellipse for the coecients of height and weight in the
heart catheter example.
Figure 21: 3-D plot of orthogonal regressor variables.

Figure 21 illustrates an orthogonal design akin to the usual placement of legs on a

table.
There are many solutions to the collinearity problem:
The easiest solution is to simply drop regressors from the model. In the heart
catheter example, height and weight are highly correlated. Therefore, if we t
a model using only height as a regressor, then we do not gain much additional
information by adding weight to the model. In fact, if we t a simple linear
regression using only height as a regressor, the coecient of height is highly
signicant (p-value < 0.0001) and the R2 = 0.7971 is just slightly less than if
we had used both height and weight as regressors. (Fitting a regression using
only weight as a regressor yields an R2 = 0.8181.) When there are several
regressors, the problem becomes one of trying to decide upon which subset of
regressors works best and which regressors to throw out. Often times, there
may be several dierent subsets of regressors that work reasonably well.
Collect more data with the goal of spreading around the values of the regres-
sors so they do not form the picket fence type pattern as seen in Figure 19.
Collecting more data may not solve the problem in situations where the exper-
imenter has no control over the relationships between the regressors, as in the
heart catheter example.
Biased regression techniques such as ridge regression and principal component

regression. Details can be found in advanced textbooks on regression. These
methods are called biased because the resulting slope estimators are biased.
The tradeo is that the resulting estimators will be more stable.
In polynomial regression, collinearity is a very common problem when tting higher

order polynomials and the resulting least-squares estimators tend to be very unsta-
ble. Centering the regressor variable x at its mean (i.e. using xi x instead of xi )
helps alleviate this problem to some extent. Another solution is to t a polynomial
regression using orthogonal polynomials (details of this method can also be found in
many advanced texts on regression analysis).
14 Additional Notes on Multiple Regression

One can compute condence intervals for a mean response or prediction intervals for
a predicted response as was done in the simple linear regression model. The exact
same formulas (10) and (11) apply as in the simple linear regression model except
the degrees of freedom needs to correspond to the residual mean square degrees of
freedom. Most statistical software packages can compute these intervals.
The problem of checking the adequacy of the multiple regression model is considerably
tougher than in simple linear regression because we cannot plot the full data set due to
the high dimension of the data. Nonetheless, a plot of residuals versus tted values is
still useful in order to determine the appropriateness of the model. If structure exists
in the residual plot, then the model needs to be reconsidered. Perhaps quadratic
terms and interaction terms such as x1 x2 need to be incorporated into the model. A
useful diagnostic tool is to also plot residuals versus each of the regression variables
to see if any structure shows up in these plots as well.
As indicated by the Anscombe data set (Figure 12), the tted regression surface can
be highly inuenced by just a few observations. In the Anscombe data set, it was
easy to see the oending point, but in a multiple regression setting, it can be dicult
to determine which, if any, points may be strongly eecting the t of the regression.
There are several diagnostic tools to help with this problem and advanced textbooks
in regression analysis discuss these tools.
15 Advanced Topics in Regression

Regression analysis is a very wide topic and we have introduced the basic ideas so
far in this chapter. There are several additional regression topics that are used by
engineers that will be briey mentioned here.
15.1 Nonlinear Models
First, some terminology: it is important to note that up to this point in this chapter,
the regression models such as (13) and (14) are both examples of what are called
linear models because the response y is a linear function of the parameters (i.e., the
s) even though (14) has quadratic terms (such as x2i1 etc.). This terminology may
be a bit confusing since a model with quadratic terms is still called a linear model.
Other examples of linear models are
yi = 0 + 1 log(xi1 ) + i
and
yi = 0 + 1 (xi1 /x2i2 ) + i
because once again, these equations are linear in terms of the j s. In each of these
cases, we can nd the least squares estimators of the j s by simply applying for-
mula (16). That is, there exists a closed-form solution for the least-squares estima-
tors.
A nonlinear regression model is one where the response is related to the model param-
eters in a nonlinear fashion. These types of models are very common in engineering
applications. A couple of nonlinear regression model examples are
yi = 1 e2 xi1 + i
and
0 + 1 xi1
yi = + i .
2 xi2 + 3 xi3
Neither of these two models are linear in the parameters and thus they are not linear
models. To nd the least squares estimators in such cases one can either try to nd a
transformation that will linearize the relationship (such as a log-transformation in the

rst example) or use an iterative algorithm (such as the Gauss-Newton algorithm) to
nd the estimators because there does not exist a closed form solution to the least
squares estimators as in the case of linear models. Most statistical software packages
have routines that will t nonlinear regression models. The algorithms for nonlinear
regression usually require specifying some initial guesses for the parameter values and
one needs to be careful in choosing these initial values or else the iterative algorithm
may crash or go o in the wrong direction.
15.2 Nonparametric Regression
All of the regression models introduced up to this point impose a specic functional
relationship between the response and the predictor(s) (e.g. straight line, quadratic,
exponential, etc.). In many applications, the investigator may want to let the data
determine the shape of the functional response. In this case, one can use a nonpara-
metric regression approach. The idea behind some of approaches is to predict the
response at a point x by taking a local average of the responses in a neighborhood
of values around x.
Another popular approach is to use what are called spline models where pieces of
polynomials (usually cubic polynomials) are pieced together to form a smooth curve.
15.3 Logistic Regression
In Chapter 2, the binomial distribution was introduced. As we saw in that chapter,

in many applications, the response of interest is binary (0 or 1) indicating a success
or failure. For example, one may run an experiment at dierent temperatures and
note which ones result in a success or failure. A tragic example of this was the Space
Shuttle Challenger disaster that occurred on January 28, 1986 due to o-ring failures
on that cold morning. In logistic regression, the idea is to model a binary response
as a function of one or more covariates (or predictors), like temperature.
If x denotes a predictor and y is a binary response, then in logistic regression we
model the expected value of y given x. However, since y is binary, this expectation
becomes a conditional probability, which we can write p(x) = probability of success
given x. Typically, the logistic regression model that is used is
e0 +1 x
p(x) = , (21)
1 + e0 +1 x
which produces an S shaped regression curve usually. As with nonlinear regression,
iterative algorithms are needed to estimate the logistic regression parameters.
Problems
1. Brownlee (1965) discusses an experiment looking at the oxidation of ammonia

to nitric acid at a chemical plant. The response variable is the percentage of
unprocessed ammonia. The regressors are x1 = operation time, x2 = temper-

ature of the coolant, and x3 = acid concentration. n = 21 observations were
obtained.
Operation Time Coolant Temperature Acid Concentration Unprocessed %

x1 x2 x3 y
80 27 89 42
80 27 88 37
75 25 90 37
62 24 87 28
62 22 87 18
62 23 87 18
62 24 93 19
62 24 93 20
58 23 87 15
58 18 80 14
58 18 89 14
58 17 88 13
58 18 82 11
58 19 93 12
50 18 89 8
50 18 86 7
50 19 72 8
50 19 79 8
50 20 80 9
56 20 82 15
70 20 91 15
Consider the following multiple regression model
yi = 0 + 1 xi1 + 2 xi2 + 3 xi3 + i .
a) The following computations were performed:

13.452727 0.027339 0.061961 0.159355 368
0.027339 0.001729 0.003471 0.000679 23953

(X X)1 =
, and X y = .
0.061961 0.003471 0.012875 0.000001 8326
0.159355 0.000679 0.000001 0.002322 32189
Use these computations to compute , the estimated regression coe-

cients.
b) Write down the equation for the tted regression equation.
c) Compute the tted value y1 for the rst observation as well as the rst
residual r1 = y1 y1 .
d) Here is an ANOVA table from the multiple regression model:
Source DF SS MS F P
Regression 3 1890.41 630.14 59.90 0.000
Error 17 178.83 10.52
Total 20 2069.24
Compute the coecient of determination R2 .
e) Compute 3 t-test statistics for testing if the regression coecients 1 , 2
and 3 dier signicantly from zero. Which coecients dier signicantly
from zero (use = 0.05 to make your decision for each test).
f) Another regression model was t to the data by dropping out the acid
concentration regressor (x3 ). The ANOVA table from tting a regression
using only x1 and x2 is given here:
Source DF SS MS F P
Regression 1880.44 940.22 89.64 0.000
Error
Total 2069.24
Fill in the missing values for the degrees of freedom, the sum of squares for
the residuals and the M SE. What is the coecient of determination R2
for this reduced model? How does it compare to the R2 from the full
model using all three regressor variables?
2. An air mixture consists of oxygen and nitrogen only. Suppose an experiment is

run by varying the percentages of the air mixture and how it eects the strength
of a weld made using the air mixture. The following table gives the hypothetical
data from this experiment:
Oxygen % (x1 ) Nitrogen % (x2 ) Strength (y)

50% 50% 93
40% 60% 87
30% 70% 80
20% 80% 72
a) Write down the design matrix X for tting the model yi = 0 + 1 xi1 +
2 xi2 + i .
b) Can you express the third column of X as a linear combination of the rst
two columns?
c)) What goes wrong when we t the above model? Running this model using
the statistical software SAS gives the following message: Model is not full
rank. Least-squares solutions for the parameters are not unique. What
does this mean?
d) Find the least-squares line for regressing y on x1 alone and regression y on
x2 alone. How are the slope estimates related in these two models?
e) If you had to oer advice on how to t a regression model to this data,

what would you suggest be done?
References
Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician
27:1721.
Brownlee, K. A. (1965). Statistical Theory and Methodology.
Graybill, F. A. and Iyer, H. K. (1994). Regression analysis: concepts and applications.

Duxbury, Belmont, California.
Henderson, H. V. and Velleman, P. F. (1981). Building regression models interactively.

Biometrics 37:391411.
Johnson, R. A. and Wichern, D. W. (1998). Applied Multivariate Statistical Analysis.

Prentice Hall, New Jersey.
Said, A., Hassan, E., El-Awad, A., El-Salaam, K., and El-Wahab, M. (1994). Struc-
tural changes and surface properties of cox f e3x o4 spinels. Journal of Chemical
Technology and Biotechnology 60:161170.
Weisberg, S. (1980). Applied Linear Regression. Wiley, New York.

Chapter 5. Regression Models: 1 A Simple Model

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5. Regression Models: 1 A Simple Model

Uploaded by

Copyright:

Available Formats

Chapter 5: Regression Models 118

Chapter 5. Regression Models

Using a little algebra, we can solve this equation for to get

More compactly, we can write:

Multiplying this out, we nd the sum of squares to be

2 The Simple Linear Regression Model

Make MPG Weight

Figure 1: Scatterplot of miles per gallon versus weight of n = 6 Japanese cars.

for i = 1, 2, . . . , n. The random variable yi is called the response (it is sometimes

Figure 2: The least-squares regression line is determined by minimizing the sum of

Let = (0 , 1 ) and = (1 , 2 , . . . , n ) . Then we can rewrite (5) in matrix form as

In order to nd the least-squares estimators of , we need to nd the values of 0

or, since i = yi 0 1 xi , we need to nd 0 and 1 that minimize . Matrix

Thus, should satisfy

E[] = E[(X X)1 X y]

% Motor Trend car data

% Make a plot of residuals versus fitted values:

% Heres a built-in matlab function that will fit a

3 Covariance Matrices for Least-Squares Estima-

4 Hypothesis Tests for Regression Coecients

which follows a t-distribution on n 2 degrees of freedom when the null hypothesis

5 Condence Intervals for Regression Coecients

Example (data compliments of Brian Jones). Experiments were conducted at Wright

1 t/2 1 = 64.531 2.26216(0.465) = 64.531 1.052,

6 Estimating a Mean Response and Predicting a

Figure 6: Crystal growth data with the estimated regression line.

3. A calibration experiment with nuclear tanks was performed in an attempt to

a) Write out a simple linear regression model for this experiment.

Winning Times in Boston Marathon

1900 1920 1940 1960 1980 2000

d) Find the estimated covariance matrix of the least-squares estimates.

Example: Anscombes Regression Data. Anscombe (1973) simulated 4 very

on the right determines the slope of the tted regression line.

9 Multiple Linear Regression

Anscombes 4 Regression data sets

9.1 Coecient Interpretation

In the simple linear regression model yi = 0 + 1 xi + i , the slope represents the

Heart Catheter Example. A study was conducted and data collected to t a

Height Weight Length

Figure 13: A scatterplot of weight versus height for n = 12 children in an experiment

design matrix X in (17) and the response vector y as:

The least squares estimates of the parameters are given by = (0 , 1 , 2 ) where

10 Analysis of Variance (ANOVA) for Multiple

Ha : not all j s are zero.

We have already dened the residual sum of squares:

We can also dene the Regression Sum of Squares (SSreg ):

(yi y) = (yi yi ) + (yi y).

SSyy = SSreg + SSres . (18)

The Mean Square for Regression is

Example continued... Returning to the multiple regression example on the eect of

comes out to be F = M Sreg /M Sres = 5973.43416/445.12939 = 13.42 which exceeds

Model 2 11947 5973.43416 13.42 0.0003

Intercept 1 124.87880 15.71210 7.95 <.0001

and the alternative hypothesis can be one-sided or two-sided.

In Chapter 4, the concept of a condence region was introduced when estimators of

( ) (X X)( ) (k + 1)M Sres Fk+1,nk1, .

Figure 14 shows a 95% condence ellipse for 1 and 2 in this example.

Amount of Amount of Amount of Amount of Heat

A full model would incorporate all the regressor variables:

Model 2 3175.37558 1587.68779 85.07 <.0001

Intercept 1 160.08461 10.09097 15.86 <.0001

yi = 0 + 1 xi + 2 x2i + + k xki + i . (20)

a nonlinear function of x, it is nonetheless considered a linear model because it is