Professional Documents
Culture Documents
Example R Session 8
CSSS 508 Spring 2005
Multiple Linear Regression in R using lm()
First take a look at the documentation on the lm() function:
> help(lm)
8 10
Fit a linear regression model to the data. Note the syntax for the model: to the left of the ~ is y, the
dependent variable, to the right are the independent variables: in this example there is only one
dependent variable, x. The intercept is always included in the model by default. Or it can be explicitly
specified with y ~ 1 + x. To exclude the intercept from the model (usually a bad idea), the syntax is
y ~ -1 + x
The following demonstrates various useful functions that can be performed on the lm object
(object of class lm) that his returned by the lm() function.
First is summary() which returns the basic results of a linear regression fit.
> summary(result.lm)
Call:
lm(formula = y ~ x)
Residuals:
Min
1Q
-2.8265 -1.5524
Median
0.8373
3Q
1.4672
Max
2.0514
Coefficients:
Estimate Std. Error t value
(Intercept)
7.0780
1.0094
7.012
x
1.7241
0.1523 11.317
--Signif. codes: 0 `***' 0.001 `**' 0.01
Pr(>|t|)
1.52e-06 ***
1.29e-09 ***
`*' 0.05 `.' 0.1 ` ' 1
http://csde.washington.edu/training/courses/csss508/
Example R Session 8
CSSS 508 Spring 2005
To get the values of the dependent variable predicted by the models use fitted()
> fitted(result.lm)
1
2
3
4
5
6
12.250276 19.146652 24.318934 24.318934 19.146652 15.698464
7
8
9
10
11
12
22.594840 19.146652 19.146652 19.146652 19.146652 10.526183
13
14
15
16
17
18
8.802089 15.698464 15.698464 13.974370 12.250276 20.870746
19
20
24.318934 15.698464
To get thevariance/covariance matrix of the parameters (the expected variability in the coefficients over
repeated sampling) use vcov()
> vcov(result.lm)
(Intercept)
x
(Intercept)
1.0189274 -0.14158217
x
-0.1415822 0.02321019
Residuals are the difference between the observed and predicted values of the outcome variable.
Standardized residuals (stdres() in library(MASS)) have been scaled by the square root of their
variance and have a standard deviation equal to one. It is suggested that a case with a standardized
residual larger than about +/- 2.5 should be investigated as a potential outlier, since that would be
expected to occur by chance about 1% of the time. Cooks distance measures the change in the fitted
regression coefficients if an observation were dropped from the regression. An observation with Cooks
D over 1 is worth investigating. (Simonoff, J.S. (2003) Analyzing Cateogrical Data, Springer-Verlag
New York, Inc.,p. 36-39)
> residuals(result.lm)
1
2
3
4
5
6
0.9002409 -1.3470334 0.8570613 0.9519508 1.6694764 -2.1683791
7
8
9
10
11
12
1.6089396 -0.5362055 1.4199549 1.3614821 -2.8265051 1.6598746
13
14
15
16
17
18
-2.1783278 -0.4787503 0.8174658 -2.3852110 2.0513639 -0.9040092
19
20
-2.4933639 2.0199750
http://csde.washington.edu/training/courses/csss508/
Example R Session 8
CSSS 508 Spring 2005
> library(MASS)
> stdres(result.lm)
1
2
0.5451305 -0.7867510
8
9
-0.3131773 0.8293417
15
16
0.4782086 -1.4134003
3
4
5
6
7
0.5318112 0.5906907 0.9750778 -1.2684782 0.9693375
10
11
12
13
14
0.7951900 -1.6508543 1.0374046 -1.4220285 -0.2800637
17
18
19
20
1.2421797 -0.5339493 -1.5471459 1.1816634
> cooks.distance(result.lm)
1
2
3
4
5
6
0.020612025 0.018378182 0.027675723 0.034143202 0.028229711 0.050482094
7
8
9
10
11
12
0.059764914 0.002912114 0.020421844 0.018774557 0.080918046 0.114645942
13
14
15
16
17
18
0.327026699 0.002460849 0.007174738 0.090360529 0.107025820 0.011888862
19
20
0.234232173 0.043808552
"effects"
"qr"
"terms"
"rank"
"df.residual"
"model"
3
4
5
6
7
0.8570613 0.9519508 1.6694764 -2.1683791 1.6089396
10
11
12
13
14
1.3614821 -2.8265051 1.6598746 -2.1783278 -0.4787503
17
18
19
20
2.0513639 -0.9040092 -2.4933639 2.0199750
> result.lm$fitted.values
1
2
3
4
5
6
7
8
12.250276 19.146652 24.318934 24.318934 19.146652 15.698464 22.594840 19.146652
9
10
11
12
13
14
15
16
19.146652 19.146652 19.146652 10.526183 8.802089 15.698464 15.698464 13.974370
17
18
19
20
12.250276 20.870746 24.318934 15.698464
> result.lm$df.residual
[1] 18
http://csde.washington.edu/training/courses/csss508/
Example R Session 8
CSSS 508 Spring 2005
Raw data:
> result.lm$model
y x
1 13.150517 3
2 17.799619 7
3 25.175995 10
4 25.270884 10
5 20.816128 7
6 13.530085 5
7 24.203779 9
8 18.610446 7
9 20.566607 7
10 20.508134 7
11 16.320147 7
12 12.186057 2
13 6.623761 1
14 15.219714 5
15 16.515930 5
16 11.589159 4
17 14.301640 3
18 19.966737 8
19 21.825570 10
20 17.718439 5
The plot function is actually a set of different functions. It identifies what kind of object is passed to the
function and then uses the plot function designed for that kind of object. See for example,
> help(plot)
> help(plot.default)
> help(plot.lm)
To get values predicted by the model for values of the covariates (independent variables) that are not in
the original sample, use the predict() function. The predict function requires that you pass it the lm
object and a dataframe containing the values of the covariates for which you want a predicted value. So
if I want a predicted value for x=5.4, I pass data.frame(x=5.4) to put my x value into dataframe
format.
4
http://csde.washington.edu/training/courses/csss508/
Example R Session 8
CSSS 508 Spring 2005
> predict(result.lm,data.frame(x=5.4))
[1] 16.38810
> points(5.4,16.38810,col="purple",pch=20,cex=2)
The following is a demonstration of a linear model with one continuous and one categorical
variable and an interaction term.
First I write a function to create some data, read the function into my workspace, and run the function,
storing its result in a dataframe called test.
cat.interact <- function()
{
one <- runif(100,10,20)
two <- sample(1:3,100,replace=T)
outcome <- rep(0,100)
for (i in 1:100)
{
if (two[i] == 1)
outcome[i] <- 3
else if (two[i] ==
outcome[i] <- 2
else
outcome[i] <- 1
}
}
+ 2 * one[i] + rnorm(1,mean=0,sd=3)
2)
+ 5 * one[i] + rnorm(1,mean=0,3)
- one[i] + rnorm(1,mean=0,3)
return(data.frame(outcome=outcome,contin=one,categ=two))
> source("make.data.w8.r")
> test <- cat.interact()
> head(test)
outcome
contin categ
1 -16.64759 18.07304
3
2 -15.76270 16.38989
3
3 -12.20210 11.66220
3
4 34.98955 14.96454
1
5 70.65761 12.61652
2
6 -10.26925 12.41431
3
Next I plot all pairwise combinations of the variables using the function pairs()
> pairs(test)
http://csde.washington.edu/training/courses/csss508/
Example R Session 8
CSSS 508 Spring 2005
Fitting the a model with just the continuous variable. Note that the fit is very low (R-Squared=0.0448)
because the model is incorrect (leaves out an important variable). [As an exercise, add the fitted line
from this model to the scatterplot of the raw data].
> fit <- lm(outcome~contin,data=test)
> summary(fit)
Call:
lm(formula = outcome ~ contin, data = test)
Residuals:
Min
1Q
-74.085 -38.154
Median
-1.899
3Q
37.822
Max
53.961
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-9.028
21.586 -0.418
0.6767
contin
3.011
1.404
2.144
0.0345 *
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 38.46 on 98 degrees of freedom
Multiple R-Squared: 0.0448,
Adjusted R-squared: 0.03506
F-statistic: 4.597 on 1 and 98 DF, p-value: 0.03451
Before adding the categorical variable (categ), I need to make it a categorical variable. At this point,
it is a numeric variable. Categorical variables in R are called factors. The as.factor() function will
change a variable into a factor.
> is.numeric(test$categ)
[1] TRUE
> is.factor(test$categ)
[1] FALSE
> test$categ <- as.factor(test$categ)
> is.factor(test$categ)
[1] TRUE
Call:
lm(formula = outcome ~ contin + categ, data = test)
Residuals:
Min
1Q
-21.0664 -4.8608
Median
0.1922
3Q
4.7240
Max
19.0128
http://csde.washington.edu/training/courses/csss508/
Example R Session 8
CSSS 508 Spring 2005
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.2810
4.3134 -0.761
0.449
contin
2.4546
0.2745
8.942 2.78e-14 ***
categ2
44.7337
1.7854 25.055 < 2e-16 ***
categ3
-47.8674
1.8976 -25.226 < 2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 7.507 on 96 degrees of freedom
Multiple R-Squared: 0.9644,
Adjusted R-squared: 0.9632
F-statistic: 865.6 on 3 and 96 DF, p-value: < 2.2e-16
http://csde.washington.edu/training/courses/csss508/
60
40
20
-20
outcome
80
Example R Session 8
CSSS 508 Spring 2005
10
12
14
16
18
20
contin
Now fit the correct model with the interaction term.
> fit <- lm(outcome ~ contin + categ + contin*categ, data=test)
> summary(fit)
Call:
lm(formula = outcome ~ contin + categ + contin * categ, data = test)
Residuals:
Min
1Q
Median
-6.649368 -2.079444 -0.005302
3Q
1.954777
Max
8.645110
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-0.07365
2.98313 -0.025
0.980
contin
2.24075
0.19588 11.439
<2e-16 ***
categ2
2.54845
4.02113
0.634
0.528
categ3
2.10395
4.39219
0.479
0.633
contin:categ2 2.76049
0.26144 10.559
<2e-16 ***
contin:categ3 -3.32198
0.28796 -11.536
<2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 3.023 on 94 degrees of freedom
Multiple R-Squared: 0.9943,
Adjusted R-squared: 0.994
F-statistic: 3303 on 5 and 94 DF, p-value: < 2.2e-16
http://csde.washington.edu/training/courses/csss508/
Example R Session 8
CSSS 508 Spring 2005
You can see by the plot that the fit is a bit better since the interaction term allows for the slopes to be
different between the groups defined by the categ variable:
plot(test$contin[test$categ==1],test$outcome[test$categ==1],ylim=c(-30,110),
xlim=c(10,20),pch=20,col="red",xlab="contin",ylab="outcome")
points(test$contin[test$categ==2],test$outcome[test$categ==2],pch=20,col="blue")
points(test$contin[test$categ==3],test$outcome[test$categ==3],pch=20,col="green")
abline(a=-0.07365,b=2.24075,col="red")
abline(a=2.4748,b=5.00124,col="blue")
abline(a=2.0303,b=-1.08123,col="green")
60
40
20
0
-20
outcome
80
>
+
>
>
>
>
>
10
12
14
16
18
20
contin
http://csde.washington.edu/training/courses/csss508/