You are on page 1of 9

Example R Session 8

CSSS 508 Spring 2005

Example R Session 8
CSSS 508 Spring 2005
Multiple Linear Regression in R using lm()
First take a look at the documentation on the lm() function:
> help(lm)

Create some data and take a look at it:


> x <- sample(1:10,20,replace=T)
> x
[1] 3 7 10 10 7 5 9 7 7 7 7 2 1
> hist(x)
> y <- 5 + 2 * x + rnorm(20,mean=0,sd=2)
> y
[1] 13.150517 17.799619 25.175995 25.270884
[8] 18.610446 20.566607 20.508134 16.320147
[15] 16.515930 11.589159 14.301640 19.966737
> plot(x,y)

8 10

20.816128 13.530085 24.203779


12.186057 6.623761 15.219714
21.825570 17.718439

Fit a linear regression model to the data. Note the syntax for the model: to the left of the ~ is y, the
dependent variable, to the right are the independent variables: in this example there is only one
dependent variable, x. The intercept is always included in the model by default. Or it can be explicitly
specified with y ~ 1 + x. To exclude the intercept from the model (usually a bad idea), the syntax is
y ~ -1 + x

> result.lm <- lm(y~x)

The following demonstrates various useful functions that can be performed on the lm object
(object of class lm) that his returned by the lm() function.
First is summary() which returns the basic results of a linear regression fit.
> summary(result.lm)
Call:
lm(formula = y ~ x)
Residuals:
Min
1Q
-2.8265 -1.5524

Median
0.8373

3Q
1.4672

Max
2.0514

Coefficients:
Estimate Std. Error t value
(Intercept)
7.0780
1.0094
7.012
x
1.7241
0.1523 11.317
--Signif. codes: 0 `***' 0.001 `**' 0.01

Pr(>|t|)
1.52e-06 ***
1.29e-09 ***
`*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 1.762 on 18 degrees of freedom


Multiple R-Squared: 0.8768,
Adjusted R-squared: 0.8699
F-statistic: 128.1 on 1 and 18 DF, p-value: 1.289e-09

http://csde.washington.edu/training/courses/csss508/

Example R Session 8
CSSS 508 Spring 2005

To get an analysis of variance table use anova()


> anova(result.lm)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value
Pr(>F)
x
1 397.72 397.72 128.07 1.289e-09 ***
Residuals 18 55.90
3.11
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

To get the estimates of the coefficients of the model use coef()


> coef(result.lm)
(Intercept)
x
7.077995
1.724094

To get the values of the dependent variable predicted by the models use fitted()
> fitted(result.lm)
1
2
3
4
5
6
12.250276 19.146652 24.318934 24.318934 19.146652 15.698464
7
8
9
10
11
12
22.594840 19.146652 19.146652 19.146652 19.146652 10.526183
13
14
15
16
17
18
8.802089 15.698464 15.698464 13.974370 12.250276 20.870746
19
20
24.318934 15.698464

To get thevariance/covariance matrix of the parameters (the expected variability in the coefficients over
repeated sampling) use vcov()
> vcov(result.lm)
(Intercept)
x
(Intercept)
1.0189274 -0.14158217
x
-0.1415822 0.02321019

Residuals are the difference between the observed and predicted values of the outcome variable.
Standardized residuals (stdres() in library(MASS)) have been scaled by the square root of their
variance and have a standard deviation equal to one. It is suggested that a case with a standardized
residual larger than about +/- 2.5 should be investigated as a potential outlier, since that would be
expected to occur by chance about 1% of the time. Cooks distance measures the change in the fitted
regression coefficients if an observation were dropped from the regression. An observation with Cooks
D over 1 is worth investigating. (Simonoff, J.S. (2003) Analyzing Cateogrical Data, Springer-Verlag
New York, Inc.,p. 36-39)
> residuals(result.lm)
1
2
3
4
5
6
0.9002409 -1.3470334 0.8570613 0.9519508 1.6694764 -2.1683791
7
8
9
10
11
12
1.6089396 -0.5362055 1.4199549 1.3614821 -2.8265051 1.6598746
13
14
15
16
17
18
-2.1783278 -0.4787503 0.8174658 -2.3852110 2.0513639 -0.9040092
19
20
-2.4933639 2.0199750

http://csde.washington.edu/training/courses/csss508/

Example R Session 8
CSSS 508 Spring 2005
> library(MASS)
> stdres(result.lm)
1
2
0.5451305 -0.7867510
8
9
-0.3131773 0.8293417
15
16
0.4782086 -1.4134003

3
4
5
6
7
0.5318112 0.5906907 0.9750778 -1.2684782 0.9693375
10
11
12
13
14
0.7951900 -1.6508543 1.0374046 -1.4220285 -0.2800637
17
18
19
20
1.2421797 -0.5339493 -1.5471459 1.1816634

> cooks.distance(result.lm)
1
2
3
4
5
6
0.020612025 0.018378182 0.027675723 0.034143202 0.028229711 0.050482094
7
8
9
10
11
12
0.059764914 0.002912114 0.020421844 0.018774557 0.080918046 0.114645942
13
14
15
16
17
18
0.327026699 0.002460849 0.007174738 0.090360529 0.107025820 0.011888862
19
20
0.234232173 0.043808552

Anatomy of the lm object


The lm object contains some information that you can access directly once you know that it is there. To
get a list of the attributes in an lm object, use the names() function:
> names(result.lm)
[1] "coefficients" "residuals"
[5] "fitted.values" "assign"
[9] "xlevels"
"call"
>

"effects"
"qr"
"terms"

"rank"
"df.residual"
"model"

To access these attributes, use the $ syntax:


> result.lm$coefficients
(Intercept)
x
7.077995
1.724094
> result.lm$residuals
1
2
0.9002409 -1.3470334
8
9
-0.5362055 1.4199549
15
16
0.8174658 -2.3852110

3
4
5
6
7
0.8570613 0.9519508 1.6694764 -2.1683791 1.6089396
10
11
12
13
14
1.3614821 -2.8265051 1.6598746 -2.1783278 -0.4787503
17
18
19
20
2.0513639 -0.9040092 -2.4933639 2.0199750

> result.lm$fitted.values
1
2
3
4
5
6
7
8
12.250276 19.146652 24.318934 24.318934 19.146652 15.698464 22.594840 19.146652
9
10
11
12
13
14
15
16
19.146652 19.146652 19.146652 10.526183 8.802089 15.698464 15.698464 13.974370
17
18
19
20
12.250276 20.870746 24.318934 15.698464
> result.lm$df.residual
[1] 18

http://csde.washington.edu/training/courses/csss508/

Example R Session 8
CSSS 508 Spring 2005

Raw data:
> result.lm$model
y x
1 13.150517 3
2 17.799619 7
3 25.175995 10
4 25.270884 10
5 20.816128 7
6 13.530085 5
7 24.203779 9
8 18.610446 7
9 20.566607 7
10 20.508134 7
11 16.320147 7
12 12.186057 2
13 6.623761 1
14 15.219714 5
15 16.515930 5
16 11.589159 4
17 14.301640 3
18 19.966737 8
19 21.825570 10
20 17.718439 5

Plotting and the lm object


Try the plot function on the result of the linear fit:
> plot(result.lm)

To get the four plots on one page


> par(mfrow=c(2,2))
> plot(result.lm)

The plot function is actually a set of different functions. It identifies what kind of object is passed to the
function and then uses the plot function designed for that kind of object. See for example,
> help(plot)
> help(plot.default)
> help(plot.lm)

The plot function also knows how to handle the residuals:


> plot(result.lm$residuals)

The following is a demonstration of the predict() function.


First plot the data and add the fitted line in blue (note that the abline() function can be passed the lm
result object), and then add the predicted points for each x-value in red:
> plot(x,y)
> abline(result.lm,col="blue")
> points(x,result.lm$fitted.values,col="red",pch=20,cex=1.5)

To get values predicted by the model for values of the covariates (independent variables) that are not in
the original sample, use the predict() function. The predict function requires that you pass it the lm
object and a dataframe containing the values of the covariates for which you want a predicted value. So
if I want a predicted value for x=5.4, I pass data.frame(x=5.4) to put my x value into dataframe
format.
4

http://csde.washington.edu/training/courses/csss508/

Example R Session 8
CSSS 508 Spring 2005
> predict(result.lm,data.frame(x=5.4))
[1] 16.38810

Now I plot this predicted point on my graph in purple.

> points(5.4,16.38810,col="purple",pch=20,cex=2)

If I want predicted values for a series (set) of x values:

> new.x <- seq(1.5,10)


> new.x
[1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
> pred.new.x <- predict(result.lm,data.frame(x=new.x))
> points(new.x,pred.new.x,col="green",pch=20,cex=1.75)

The following is a demonstration of a linear model with one continuous and one categorical
variable and an interaction term.
First I write a function to create some data, read the function into my workspace, and run the function,
storing its result in a dataframe called test.
cat.interact <- function()
{
one <- runif(100,10,20)
two <- sample(1:3,100,replace=T)
outcome <- rep(0,100)
for (i in 1:100)
{
if (two[i] == 1)
outcome[i] <- 3
else if (two[i] ==
outcome[i] <- 2
else
outcome[i] <- 1
}
}

+ 2 * one[i] + rnorm(1,mean=0,sd=3)
2)
+ 5 * one[i] + rnorm(1,mean=0,3)
- one[i] + rnorm(1,mean=0,3)

return(data.frame(outcome=outcome,contin=one,categ=two))

> source("make.data.w8.r")
> test <- cat.interact()
> head(test)
outcome
contin categ
1 -16.64759 18.07304
3
2 -15.76270 16.38989
3
3 -12.20210 11.66220
3
4 34.98955 14.96454
1
5 70.65761 12.61652
2
6 -10.26925 12.41431
3

Next I plot all pairwise combinations of the variables using the function pairs()
> pairs(test)

http://csde.washington.edu/training/courses/csss508/

Example R Session 8
CSSS 508 Spring 2005

Fitting the a model with just the continuous variable. Note that the fit is very low (R-Squared=0.0448)
because the model is incorrect (leaves out an important variable). [As an exercise, add the fitted line
from this model to the scatterplot of the raw data].
> fit <- lm(outcome~contin,data=test)
> summary(fit)
Call:
lm(formula = outcome ~ contin, data = test)
Residuals:
Min
1Q
-74.085 -38.154

Median
-1.899

3Q
37.822

Max
53.961

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-9.028
21.586 -0.418
0.6767
contin
3.011
1.404
2.144
0.0345 *
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 38.46 on 98 degrees of freedom
Multiple R-Squared: 0.0448,
Adjusted R-squared: 0.03506
F-statistic: 4.597 on 1 and 98 DF, p-value: 0.03451

Before adding the categorical variable (categ), I need to make it a categorical variable. At this point,
it is a numeric variable. Categorical variables in R are called factors. The as.factor() function will
change a variable into a factor.
> is.numeric(test$categ)
[1] TRUE
> is.factor(test$categ)
[1] FALSE
> test$categ <- as.factor(test$categ)
> is.factor(test$categ)
[1] TRUE

This categorical variable has three levels:


> levels(test$categ)
[1] "1" "2" "3"

Now fit the model including both independent variables:


> fit <- lm(outcome~contin+categ,data=test)
> summary(fit)

Call:
lm(formula = outcome ~ contin + categ, data = test)
Residuals:
Min
1Q
-21.0664 -4.8608

Median
0.1922

3Q
4.7240

Max
19.0128

http://csde.washington.edu/training/courses/csss508/

Example R Session 8
CSSS 508 Spring 2005
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.2810
4.3134 -0.761
0.449
contin
2.4546
0.2745
8.942 2.78e-14 ***
categ2
44.7337
1.7854 25.055 < 2e-16 ***
categ3
-47.8674
1.8976 -25.226 < 2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 7.507 on 96 degrees of freedom
Multiple R-Squared: 0.9644,
Adjusted R-squared: 0.9632
F-statistic: 865.6 on 3 and 96 DF, p-value: < 2.2e-16

This fit is MUCH better (R-Squared=.9644)


For the categorical variable, categ, the reference category is 1. So the model for category 1 is:
-3.2810 + 2.4546 * contin

for category 2 is:


-3.2810 + (2.4546 * contin) + 44.7337 =

41.4527 + 2.4546 * contin

for category 3 is:

-3.2810 + (2.4546 * contin) 47.8674 = -51.1484 + 2.4546 * contin

To look at this fit on a plot:


> max(test$outcome)
[1] 104.7463
> min(test$outcome)
[1] -24.14960
> max(test$contin)
[1] 19.95094
> min(test$contin)
[1] 10.10331
> plot(test$contin[test$categ==1],test$outcome[test$categ==1],ylim=c(-30,110),
+ xlim=c(10,20),pch=20,col="red",xlab=contin,ylab=outcome)
> points(test$contin[test$categ==2],test$outcome[test$categ==2],pch=20,col="blue")
> points(test$contin[test$categ==3],test$outcome[test$categ==3],pch=20,col="green")
> abline(a=-3.2810,b=2.4546,col="red")
> abline(a=41.4527,b=2.4546,col="blue")
> abline(a=-51.1484,b=2.4546,col="green")

http://csde.washington.edu/training/courses/csss508/

60
40
20
-20

outcome

80

Example R Session 8
CSSS 508 Spring 2005

10

12

14

16

18

20

contin
Now fit the correct model with the interaction term.
> fit <- lm(outcome ~ contin + categ + contin*categ, data=test)
> summary(fit)
Call:
lm(formula = outcome ~ contin + categ + contin * categ, data = test)
Residuals:
Min
1Q
Median
-6.649368 -2.079444 -0.005302

3Q
1.954777

Max
8.645110

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-0.07365
2.98313 -0.025
0.980
contin
2.24075
0.19588 11.439
<2e-16 ***
categ2
2.54845
4.02113
0.634
0.528
categ3
2.10395
4.39219
0.479
0.633
contin:categ2 2.76049
0.26144 10.559
<2e-16 ***
contin:categ3 -3.32198
0.28796 -11.536
<2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 3.023 on 94 degrees of freedom
Multiple R-Squared: 0.9943,
Adjusted R-squared: 0.994
F-statistic: 3303 on 5 and 94 DF, p-value: < 2.2e-16

http://csde.washington.edu/training/courses/csss508/

Example R Session 8
CSSS 508 Spring 2005

This fit is virtually perfect (R-Squared of .9943).


The model for category 1 (categ2=0, categ3=0) is:
-0.07365 + 2.24075 * contin

for category 2 (categ2=1, categ3=0) is:


for

-0.07365 + 2.54845 + (2.24075 + 2.76049)* contin = 2.4748 + 5.00124*contin


category 3 (categ2=0, categ3=1) is:
-0.07365 + 2.10395 + (2.24075 -3.32198)* contin = 2.0303 1.08123*contin

You can see by the plot that the fit is a bit better since the interaction term allows for the slopes to be
different between the groups defined by the categ variable:
plot(test$contin[test$categ==1],test$outcome[test$categ==1],ylim=c(-30,110),
xlim=c(10,20),pch=20,col="red",xlab="contin",ylab="outcome")
points(test$contin[test$categ==2],test$outcome[test$categ==2],pch=20,col="blue")
points(test$contin[test$categ==3],test$outcome[test$categ==3],pch=20,col="green")
abline(a=-0.07365,b=2.24075,col="red")
abline(a=2.4748,b=5.00124,col="blue")
abline(a=2.0303,b=-1.08123,col="green")

60
40
20
0
-20

outcome

80

>
+
>
>
>
>
>

10

12

14

16

18

20

contin

http://csde.washington.edu/training/courses/csss508/

You might also like