You are on page 1of 54

Least Squares Regression

Fitting a Line to Bivariate Data


Linear Relationships
Avg. occupants per Food for Thought
car Kind of
1980: 6/car mathematical
1990: 3/car relationship between
2000: 1.5/car
year and avg. no. of
occupants per car?
By the year 2010
every fourth car will Why might relation-
have nobody in it! ship break down by
2010?
Basic Terminology
Scatterplots, correlation: interested in
association between 2 variables (assign
x and y arbitrarily)
Least squares regression: does one
quantitative variable explain or cause
changes in another variable?
Basic Terminology (cont.)
Explanatory variable: explains or
causes changes in the other variable;
the x variable. (independent variable)
Response variable: the y -variable; it
responds to changes in the x - variable.
(dependent variable)
Examples
Fertilizer (x ) corn yield (y )
Advertising $ (x ) store income (y )
Drug dose (x ) blood pressure (y )
Daily temperature (x )
natural gas demand (y )
change in min wage(x)
unemployment rate (y)
Simplest Relationship
Simplest equation that describes the
dependence of variable y on variable x

y = b0 + b1x
linear equation
graph is line with slope b1 and y-
intercept b0
Graph

y=b0 +b1x
y rise
Slope b=rise/run
b0
run
0 x
Notation
(x1, y1), (x2, y2), . . . , (xn, yn)
draw the line y= b0 + b1x through the
scatterplot , the point on the line
corresponding to xi is
yi b0 b1 xi ; yi is the value of y predicted by the line
y b0 b1 x when x xi ;
yi is the observed value of y when x xi .
Observed y, Predicted y
FUEL CONSUMPTION vs CAR WEIGHT

7
6.5
FUEL CONSUMPTION

predicted y when
6 x=2.7
5.5 yhat = a + bx
= a + b*2.7
5
4.5
4 (2.7, 3.6)
3.5 3.6 = observed y
3
2.5
2
1.5 2 2.5 2.7 3 3.5 4 4.5
CAR WEIGHT
Scatterplot: Fuel Consumption
vs Car Weight
Fuel Consumption vs Car Weight
Fuel consumption (gal/100

6
Best line?
miles)

5
Fuel consumption
4

2
1 2 3 4 5
Car Weight (1000 lbs)
Scatterplot with least squares
prediction line

FUEL CONSUMPTION vs CAR WEIGHT


FUEL CONSUMP.

7
(gal/100 miles)

y = 1.639x - 0.3631
6
r 2 = 0.9538
5
4
3
2
1.5 2.5 3.5 4.5
WEIGHT (1000 lbs)
How do we draw the line?
Residuals

the i th residual is the vertical deviation of the


i th data point from the line :
i th residual = observed y predicted y
yi yi
yi (b0 b1 xi )
Residuals: graphically
Graphical Display of Residuals

positive residual

Yi negative residual
Yi ei=Yi - Yi

Xi X
Criterion for choosing what
line to draw: method of least
squares
The method of least squares chooses
the line that makes the sum of squares
of the residuals as small as possible
This line has slope b1 and intercept b0
that minimizes
n

i 0 1i
[ y
i 1
(b b x )]2

for the given observations ( xi , yi )


Least Squares Line y = b0 + b1x:
Slope b1 and Intercept b0
(x1, y1 ),(x 2 , y2 ), ,(x n , yn )
sy
slope b1 r
sx
y intercept b0 y bx
where
n

(x x ) i
2

sx i 1
is the standard deviation of x1 , x2 ,..., xn
n 1
n

( y y) i
2

sy i 1
is the standard deviation of y1 , y2 ,..., yn
n 1
n

( x x )( y y )
i i
r i 1
is the correlation between x and y
(n 1) sx s y
n n n
SSE y b0 yi b1 xi yi
2
i
i 1 i 1 i 1
Example: Income vs
Consumption Expenditure
Consumption
Income (x)
Expenditure (y)
1 7
5 6
9 9
13 8
17 10
Questions
Construct scatterplot; determine if linear
model is appropriate. If so
find the least squares prediction line
Estimate consumption expenditure in a
household with an income of (i) $6,000
(ii) $25,000. Comfortable with
estimates?
Compute the residuals
Scatterplot
Consumption Expenditure

11
Expenditure ($1,000's)

10
9
8
7
6
5
0 5 10 15 20
Household Income ($1,000's)
Solution
Inc. x Exp. y xi-xbar (xi-xbar)2 yi-ybar (yi-ybar)2 (xi-xbar)
(yi-ybar)
1 7 -8 64 -1 1 8

5 6 -4 16 -2 4 8

9 9 0 0 1 1 0

13 8 4 16 0 0 0

17 10 8 64 2 4 16

x=45 y=40 (xi-xbar) (xi-xbar) (yi-ybar) (yi-ybar)


2 2
32
=0 =160 =0 =10

45 40
x 9; y 8; sx 160
4 40 6.325
5 5
32
sy 10
4 2.5 1.581; r .8
4(6.325)(1.581)
Calculations

sy 1.581
b1 r .8 .2;
sx 6.325
b0 y b1 x 8 .2(9) 8 1.8 6.2
least squares prediction line:
y 6.2 .2 x
least squares prediction line

y b0 b1 x 6.2 .2 x
income $6, 000, x 6
y 6.2 .2(6) 7.4 ($7, 400)
income $25, 000, x 25
y 6.2 .2(25) 11.2 ($11, 200)
Least Squares Prediction Line
Consumption Expenditure

11
Expenditure ($1,000's)

10 y = 6.2 + 0.2x
9
8
7
6
5
0 5 10 15 20
Household Income ($1,000's)
Consumption Expenditure
Prediction When x=$6,000
Consumption Expenditure

11
Expenditure ($1,000's)

10 y = 6.2 + 0.2x
9
8
7.4 7
6
5
0 5 6 10 15 20
Household Income ($1,000's)
Consumption Expenditure
Prediction When x=$25,000
Consumption Expenditure

11.2 12
Expenditure ($1,000's)

11
10 y = 6.2 + 0.2x
9
8
7
6
5
0 5 10 15 20 25
25
Household Income ($1,000's)
The least squares line always
goes through the point with
coordinates (x, y)
Least Squares Line Goes Through ( x , y )
11
Consumption Expenditure

10
( x, y ) = ( 9, 8 )
9
8 y = 0.2x + 6.2

7
6
5
0 5 10 15 20

Income
C. Compute the Residuals
Inc. x ConE y y=6.2+.2x y - y (y-y)^2
1 7 6.4 .6 .36
5 6 7.2 -1.2 1.44
9 9 8 1 1
13 8 8.8 -.8 .64
17 10 9.6 .4 .16
residuals=0 (residuals)2
=3.6
Residuals
Consumption Expenditure

11
Expenditure ($1,000's)

10 y = 6.2 + 0.2x
9
8
7
6
5
0 5 10 15 20
Household Income ($1,000's)
Income Residual Plot

Income Residual Plot

2
Residuals

1
0
-1 0 5 10 15 20
-2
Income
residuals, (residuals)2
Note that
* residuals = 0
(residuals)2 = 3.6
* From formula in box on p. 7:
SSE=yi2 b0*yi b1*xiyi
330 6.2*40 - .2*392
= 330 248 78.4 = 3.6
Any other line drawn through the
scatterplot will have
(residuals)2 > 3.6
Car Weight, Fuel
Consumption Example, cont.
(xi, yi): (3.4, 5.5) (3.8, 5.9) (4.1, 6.5) (2.2, 3.3)
(2.6, 3.6) (2.9, 4.6) (2, 2.9) (2.7, 3.6) (1.9, 3.1) (3.4, 4.9)

FUEL CONSUMPTION vs CAR WEIGHT


FUEL CONSUMP.

7
(gal/100 miles)

6
5
4
3
2
1.5 2.5 3.5 4.5
WEIGHT (1000 lbs)
Wt Fuel
2 2
(x) (y) x i - x (x i - x) y i - y (y i - y) (xi - x)(y i - y)

3.4 5.5 .5 .25 1.11 1.231 .555


3.8 5.9 .9 .81 1.51 2.2801 1.359

4.1 6.5 1.2 1.44 2.11 4.4521 2.532

2.2 3.3 -.7 .49 -1.09 1.1881 .763


2.6 3.6 -.3 .09 -.79 .6241 .237
2.9 4.6 0 0 .21 .0441 0

2.0 2.9 -.9 .81 -1.49 2.2201 1.341


2.7 3.6 -.2 .04 -.79 .6241 .158
1.9 3.1 -1.0 1 -1.29 1.6641 1.29
3.4 4.9 .5 .25 .51 .2601 .255

col. sum 29 43.9 0 5.18 0 14.589 8.49


Calculations

x 2.9; y 4.39; sx 5.18


9 .7587;
8.49
sy 14.589
9 1.2732; r .9766
9(.77587)(1.2732)
sy 1.2732
slope b1 r .9766 1.639
sx .7587
intercept b0 y b1 x 4.39 1.639(2.9) .3631
least squares prediction line y b0 b1 x .3631 1.639x
Scatterplot with least squares
prediction line

FUEL CONSUMPTION vs CAR WEIGHT


FUEL CONSUMP.

7
(gal/100 miles)

y = 1.639x - 0.3631
6
r 2 = 0.9538
5
4
3
2
1.5 2.5 3.5 4.5
WEIGHT (1000 lbs)
The Least Squares Line Always goes
Through ( x, y )

FUEL CONSUMP. (gal/100 miles) FUEL CONSUMPTION vs CAR WEIGHT

7
6.5
6 (x, y ) = (2.9, 4.39)
5.5
5
4.5 y = 1.639x - 0.3631
4
3.5
3
2.5
2
1.5 2.5 3.5 4.5
WEIGHT (1000 lbs)
Using the least squares line for prediction.
Fuel consumption of 3,000 lb car? (x=3)
y .3631 1.639(3) 4.5539
Fuel Consumption vs Car Weight: Scatterplot and Least Squares Line
7
FUEL CONSUMPTION

6 y = - 0.3631 + 1.639x

4
(3.0, 4.5539)
3

2
1.5 2 2.5 3 3.5 4 4.5
CAR WEIGHT
Be Careful!
Fuel consumption of 500 lb car? (x = .5)

y .3631 1.639(.5) .4564


(219 mpg)
FUEL CONSUMPTION vs CAR WEIGHT
FUEL CONSUMP.

7
(gal/100 miles)

y = 1.639x - 0.3631
6
r 2 = 0.9538
5
4
3
2
1.5 2.5 3.5 4.5
WEIGHT (1000 lbs)

x = .5 is outside the range of the x-data that we


used to determine the least squares line
Avoid GIGO! Evaluating the least
squares line
1. Create scatterplot. Approximately
linear?
2. Calculate r2, the square of the
correlation coefficient
3. Examine residual plot
r2 : The Variation Accounted
For
The square of the correlation coefficient
r gives important information about the
usefulness of the least squares line
r2: important information for evaluating the
usefulness of the least squares line

-1 r 1 implies 0 r2 1

The square of the correlation coefficient, r2, is the


fraction of the variation in y that is explained by the
least squares regression of y on x.
The square of the correlation coefficient, r2, is the
fraction of the variation in y that is explained by the
variation in x.
Example: car weight, fuel
consumption
x=car weight, y=fuel consumption
r2 = (.9766)2 .95
About 95% of the variation in fuel
consumption (y) is explained by the
linear relationship between car weight
(x) and fuel consumption (y).
What else affects fuel consumption?
Driver, size of engine, tires, road, etc.
Example: SAT scores
SAT Mean per State vs % Seniors Taking Test

1120
1070
Mean SAT Score

1020 y = -2.2375x + 1023.4


R2 = 0.7542
970
920
870
820
0 10 20 30 40 50 60 70 80
% of Seniors Taking Test
SAT scores: calculations
x 33.882 sx 24.103 y 947.549 s y 62.1 r .868
sy
b1 r , b0 y b1 x
sx
62.1
slope b1 .868 2.23635
24.103
intercept b0 947.549 (2.236)33.882 1023.309
least squares prediction line y 1023.309 2.236 x
SAT scores: result
SAT Mean per State vs % Seniors Taking Test r2 = (-.868)2
= .7534
1120
1070
Mean SAT Score

1020 y = -2.2375x + 1023.4


R2 = 0.7542
970
920
870
820
0 10 20 30 40 50 60 70 80
% of Seniors Taking Test

If 57% of NC seniors take the SAT, the predicted mean


score is
y 1023.309 2.23635(57) 895.84
Avoid GIGO! Evaluating the least
squares line
1. Create scatterplot. Approximately
linear?
2. Calculate r2, the square of the
correlation coefficient
3. Examine residual plot
Residuals
residual =observed y - predicted y
= y - y
Properties of residuals
1. The residuals always sum to 0 (therefore
the mean of the residuals is 0)
2. The least squares line always goes
through the point (x, y)
Graphically
residual = y - y
y

yi
yi ei=yi - yi
X
xi
Residual Plot
Residuals help us determine if fitting a least
squares line to the data makes sense
When a least squares line is appropriate, it
should model the underlying relationship;
nothing interesting should be left behind
We make a scatterplot of the residuals in the
hope of finding
NOTHING!
Car Wt/ Fuel Consump:
Residuals
CAR WT. FUEL CONSUMP. Pred FUEL CONSUMP. Residuals
3.4 5.5 5.2094980690 .290501931
3.8 5.9 5.865096525 0.034903475
4.1 6.5 6.356795367 0.143204633
2.2 3.3 3.242702703 0.057297297
2.6 3.6 3.898301158 -0.29830115
2.9 4.6 4.39 0.21
2 2.9 2.914903475 -0.01490347
2.7 3.6 4.062200772 -0.46220077
1.9 3.1 2.751003861 0.348996139
3.4 4.9 5.209498069 -0.309498069
Example: Car wt/fuel consump.
residual plot page 13

RESIDUALS vs WT(X)

0.4
0.2
RESIDUALS

0
RESIDUAL
-0.2
-0.4
-0.6
1.5 2 2.5 3 3.5 4 4.5
WT(X)
SAT Residuals

%TAKE Residual Plot

100
Residuals

50
0
-50 0 20 40 60 80
-100

%TAKE
Linear Relationship?
Linear(?)
60
50
40
Y

30
20
10
0
-4 -2 0 2 4 6 8
X
Garbage In Garbage Out
GIGO
60
50 y = 4x + 11
40
Y

30
20
10
0
-4 -2 0 2 4 6 8
X
Residual Plot Clue to GIGO

Residual Plot
20
Residuals

10

0
-4 -2 0 2 4 6 8
-10

-20
X Variable
GIGO
60
50 y = 4x + 11
40

Y 30
20
10
0
-4 -2 0 2 4 6 8
X

Residual Plot
20
Residuals

10

0
-4 -2 0 2 4 6 8
-10

-20
X Variable

You might also like