Linear Regression in R

Example R Session 8
CSSS 508 Spring 2005
Example R Session 8
Multiple Linear Regression in R using lm()
First take a look at the documentation on the lm() function:
> help(lm)
Create some data and take a look at it:

> x <- sample(1:10,20,replace=T)
> x
[1] 3 7 10 10 7 5 9 7 7 7 7 2 1
> hist(x)
> y <- 5 + 2 * x + rnorm(20,mean=0,sd=2)
> y
[1] 13.150517 17.799619 25.175995 25.270884
[8] 18.610446 20.566607 20.508134 16.320147
[15] 16.515930 11.589159 14.301640 19.966737
> plot(x,y)
8 10
20.816128 13.530085 24.203779

12.186057 6.623761 15.219714
21.825570 17.718439
Fit a linear regression model to the data. Note the syntax for the model: to the left of the ~ is y, the
dependent variable, to the right are the independent variables: in this example there is only one
dependent variable, x. The intercept is always included in the model by default. Or it can be explicitly
specified with y ~ 1 + x. To exclude the intercept from the model (usually a bad idea), the syntax is
y ~ -1 + x
> result.lm <- lm(y~x)
The following demonstrates various useful functions that can be performed on the lm object
(object of class lm) that his returned by the lm() function.
First is summary() which returns the basic results of a linear regression fit.
> summary(result.lm)
Call:
lm(formula = y ~ x)
Residuals:
Min
1Q
-2.8265 -1.5524
Median
0.8373
3Q
1.4672
Max
2.0514
Coefficients:
Estimate Std. Error t value
(Intercept)
7.0780
1.0094
7.012
x
1.7241
0.1523 11.317
--Signif. codes: 0 `***' 0.001 `**' 0.01
Pr(>|t|)
1.52e-06 ***
1.29e-09 ***
`*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 1.762 on 18 degrees of freedom

Multiple R-Squared: 0.8768,
Adjusted R-squared: 0.8699
F-statistic: 128.1 on 1 and 18 DF, p-value: 1.289e-09
http://csde.washington.edu/training/courses/csss508/
Example R Session 8
To get an analysis of variance table use anova()

> anova(result.lm)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value
Pr(>F)
x
1 397.72 397.72 128.07 1.289e-09 ***
Residuals 18 55.90
3.11
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
To get the estimates of the coefficients of the model use coef()

> coef(result.lm)
(Intercept)
x
7.077995
1.724094
To get the values of the dependent variable predicted by the models use fitted()
> fitted(result.lm)
1
2
3
4
5
6
12.250276 19.146652 24.318934 24.318934 19.146652 15.698464
7
8
9
10
11
12
22.594840 19.146652 19.146652 19.146652 19.146652 10.526183
13
14
15
16
17
18
8.802089 15.698464 15.698464 13.974370 12.250276 20.870746
19
20
24.318934 15.698464
To get thevariance/covariance matrix of the parameters (the expected variability in the coefficients over
repeated sampling) use vcov()
> vcov(result.lm)
(Intercept)
x
(Intercept)
1.0189274 -0.14158217
x
-0.1415822 0.02321019
Residuals are the difference between the observed and predicted values of the outcome variable.
Standardized residuals (stdres() in library(MASS)) have been scaled by the square root of their
variance and have a standard deviation equal to one. It is suggested that a case with a standardized
residual larger than about +/- 2.5 should be investigated as a potential outlier, since that would be
expected to occur by chance about 1% of the time. Cooks distance measures the change in the fitted
regression coefficients if an observation were dropped from the regression. An observation with Cooks
D over 1 is worth investigating. (Simonoff, J.S. (2003) Analyzing Cateogrical Data, Springer-Verlag
New York, Inc.,p. 36-39)
> residuals(result.lm)
1
2
3
4
5
6
0.9002409 -1.3470334 0.8570613 0.9519508 1.6694764 -2.1683791
7
8
9
10
11
12
1.6089396 -0.5362055 1.4199549 1.3614821 -2.8265051 1.6598746
13
14
15
16
17
18
-2.1783278 -0.4787503 0.8174658 -2.3852110 2.0513639 -0.9040092
19
20
-2.4933639 2.0199750
Example R Session 8
> library(MASS)
> stdres(result.lm)
1
2
0.5451305 -0.7867510
8
9
-0.3131773 0.8293417
15
16
0.4782086 -1.4134003
3
4
5
6
7
0.5318112 0.5906907 0.9750778 -1.2684782 0.9693375
10
11
12
13
14
0.7951900 -1.6508543 1.0374046 -1.4220285 -0.2800637
17
18
19
20
1.2421797 -0.5339493 -1.5471459 1.1816634
> cooks.distance(result.lm)
1
2
3
4
5
6
0.020612025 0.018378182 0.027675723 0.034143202 0.028229711 0.050482094
7
8
9
10
11
12
0.059764914 0.002912114 0.020421844 0.018774557 0.080918046 0.114645942
13
14
15
16
17
18
0.327026699 0.002460849 0.007174738 0.090360529 0.107025820 0.011888862
19
20
0.234232173 0.043808552
Anatomy of the lm object

The lm object contains some information that you can access directly once you know that it is there. To
get a list of the attributes in an lm object, use the names() function:
> names(result.lm)
[1] "coefficients" "residuals"
[5] "fitted.values" "assign"
[9] "xlevels"
"call"
>
"effects"
"qr"
"terms"
"rank"
"df.residual"
"model"
To access these attributes, use the $ syntax:

> result.lm$coefficients
(Intercept)
x
7.077995
1.724094
> result.lm$residuals
1
2
0.9002409 -1.3470334
8
9
-0.5362055 1.4199549
15
16
0.8174658 -2.3852110
3
4
5
6
7
0.8570613 0.9519508 1.6694764 -2.1683791 1.6089396
10
11
12
13
14
1.3614821 -2.8265051 1.6598746 -2.1783278 -0.4787503
17
18
19
20
2.0513639 -0.9040092 -2.4933639 2.0199750
> result.lm$fitted.values
1
2
3
4
5
6
7
8
12.250276 19.146652 24.318934 24.318934 19.146652 15.698464 22.594840 19.146652
9
10
11
12
13
14
15
16
19.146652 19.146652 19.146652 10.526183 8.802089 15.698464 15.698464 13.974370
17
18
19
20
12.250276 20.870746 24.318934 15.698464
> result.lm$df.residual
[1] 18
Example R Session 8
Raw data:
> result.lm$model
y x
1 13.150517 3
2 17.799619 7
3 25.175995 10
4 25.270884 10
5 20.816128 7
6 13.530085 5
7 24.203779 9
8 18.610446 7
9 20.566607 7
10 20.508134 7
11 16.320147 7
12 12.186057 2
13 6.623761 1
14 15.219714 5
15 16.515930 5
16 11.589159 4
17 14.301640 3
18 19.966737 8
19 21.825570 10
20 17.718439 5
Plotting and the lm object

Try the plot function on the result of the linear fit:
> plot(result.lm)
To get the four plots on one page

> par(mfrow=c(2,2))
> plot(result.lm)
The plot function is actually a set of different functions. It identifies what kind of object is passed to the
function and then uses the plot function designed for that kind of object. See for example,
> help(plot)
> help(plot.default)
> help(plot.lm)
The plot function also knows how to handle the residuals:

> plot(result.lm$residuals)
The following is a demonstration of the predict() function.

First plot the data and add the fitted line in blue (note that the abline() function can be passed the lm
result object), and then add the predicted points for each x-value in red:
> plot(x,y)
> abline(result.lm,col="blue")
> points(x,result.lm$fitted.values,col="red",pch=20,cex=1.5)
To get values predicted by the model for values of the covariates (independent variables) that are not in
the original sample, use the predict() function. The predict function requires that you pass it the lm
object and a dataframe containing the values of the covariates for which you want a predicted value. So
if I want a predicted value for x=5.4, I pass data.frame(x=5.4) to put my x value into dataframe
format.
4
Example R Session 8
> predict(result.lm,data.frame(x=5.4))
[1] 16.38810
Now I plot this predicted point on my graph in purple.
> points(5.4,16.38810,col="purple",pch=20,cex=2)
If I want predicted values for a series (set) of x values:
> new.x <- seq(1.5,10)

> new.x
[1] 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
> pred.new.x <- predict(result.lm,data.frame(x=new.x))
> points(new.x,pred.new.x,col="green",pch=20,cex=1.75)
The following is a demonstration of a linear model with one continuous and one categorical
variable and an interaction term.
First I write a function to create some data, read the function into my workspace, and run the function,
storing its result in a dataframe called test.
cat.interact <- function()
{
one <- runif(100,10,20)
two <- sample(1:3,100,replace=T)
outcome <- rep(0,100)
for (i in 1:100)
{
if (two[i] == 1)
outcome[i] <- 3
else if (two[i] ==
outcome[i] <- 2
else
outcome[i] <- 1
}
}
+ 2 * one[i] + rnorm(1,mean=0,sd=3)
2)
+ 5 * one[i] + rnorm(1,mean=0,3)
- one[i] + rnorm(1,mean=0,3)
return(data.frame(outcome=outcome,contin=one,categ=two))
> source("make.data.w8.r")
> test <- cat.interact()
> head(test)
outcome
contin categ
1 -16.64759 18.07304
3
2 -15.76270 16.38989
3
3 -12.20210 11.66220
3
4 34.98955 14.96454
1
5 70.65761 12.61652
2
6 -10.26925 12.41431
3
Next I plot all pairwise combinations of the variables using the function pairs()
> pairs(test)
Example R Session 8
Fitting the a model with just the continuous variable. Note that the fit is very low (R-Squared=0.0448)
because the model is incorrect (leaves out an important variable). [As an exercise, add the fitted line
from this model to the scatterplot of the raw data].
> fit <- lm(outcome~contin,data=test)
> summary(fit)
Call:
lm(formula = outcome ~ contin, data = test)
Residuals:
Min
1Q
-74.085 -38.154
Median
-1.899
3Q
37.822
Max
53.961
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-9.028
21.586 -0.418
0.6767
contin
3.011
1.404
2.144
0.0345 *
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
F-statistic: 4.597 on 1 and 98 DF, p-value: 0.03451
Before adding the categorical variable (categ), I need to make it a categorical variable. At this point,
it is a numeric variable. Categorical variables in R are called factors. The as.factor() function will
change a variable into a factor.
> is.numeric(test$categ)
[1] TRUE
> is.factor(test$categ)
[1] FALSE
> test$categ <- as.factor(test$categ)
> is.factor(test$categ)
[1] TRUE
This categorical variable has three levels:

> levels(test$categ)
[1] "1" "2" "3"
Now fit the model including both independent variables:

> fit <- lm(outcome~contin+categ,data=test)
> summary(fit)
Call:
lm(formula = outcome ~ contin + categ, data = test)
Residuals:
Min
1Q
-21.0664 -4.8608
Median
0.1922
3Q
4.7240
Max
19.0128
Example R Session 8
Coefficients:
(Intercept) -3.2810
4.3134 -0.761
0.449
contin
2.4546
0.2745
8.942 2.78e-14 ***
categ2
44.7337
1.7854 25.055 < 2e-16 ***
categ3
-47.8674
1.8976 -25.226 < 2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
F-statistic: 865.6 on 3 and 96 DF, p-value: < 2.2e-16
This fit is MUCH better (R-Squared=.9644)

For the categorical variable, categ, the reference category is 1. So the model for category 1 is:
-3.2810 + 2.4546 * contin
for category 2 is:

-3.2810 + (2.4546 * contin) + 44.7337 =
41.4527 + 2.4546 * contin
for category 3 is:
-3.2810 + (2.4546 * contin) 47.8674 = -51.1484 + 2.4546 * contin
To look at this fit on a plot:

> max(test$outcome)
[1] 104.7463
> min(test$outcome)
[1] -24.14960
> max(test$contin)
[1] 19.95094
> min(test$contin)
[1] 10.10331
> plot(test$contin[test$categ==1],test$outcome[test$categ==1],ylim=c(-30,110),
+ xlim=c(10,20),pch=20,col="red",xlab=contin,ylab=outcome)
> points(test$contin[test$categ==2],test$outcome[test$categ==2],pch=20,col="blue")
> points(test$contin[test$categ==3],test$outcome[test$categ==3],pch=20,col="green")
> abline(a=-3.2810,b=2.4546,col="red")
> abline(a=41.4527,b=2.4546,col="blue")
> abline(a=-51.1484,b=2.4546,col="green")
60
40
20
-20
outcome
80
Example R Session 8
10
12
14
16
18
20
contin
Now fit the correct model with the interaction term.
> fit <- lm(outcome ~ contin + categ + contin*categ, data=test)
> summary(fit)
Call:
lm(formula = outcome ~ contin + categ + contin * categ, data = test)
Residuals:
Min
1Q
Median
-6.649368 -2.079444 -0.005302
3Q
1.954777
Max
8.645110
Coefficients:
(Intercept)
-0.07365
2.98313 -0.025
0.980
contin
2.24075
0.19588 11.439
<2e-16 ***
categ2
2.54845
4.02113
0.634
0.528
categ3
2.10395
4.39219
0.479
0.633
contin:categ2 2.76049
0.26144 10.559
<2e-16 ***
contin:categ3 -3.32198
0.28796 -11.536
<2e-16 ***
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
F-statistic: 3303 on 5 and 94 DF, p-value: < 2.2e-16
Example R Session 8
This fit is virtually perfect (R-Squared of .9943).

The model for category 1 (categ2=0, categ3=0) is:
-0.07365 + 2.24075 * contin
for category 2 (categ2=1, categ3=0) is:

for
-0.07365 + 2.54845 + (2.24075 + 2.76049)* contin = 2.4748 + 5.00124*contin

category 3 (categ2=0, categ3=1) is:
-0.07365 + 2.10395 + (2.24075 -3.32198)* contin = 2.0303 1.08123*contin
You can see by the plot that the fit is a bit better since the interaction term allows for the slopes to be
different between the groups defined by the categ variable:
plot(test$contin[test$categ==1],test$outcome[test$categ==1],ylim=c(-30,110),
xlim=c(10,20),pch=20,col="red",xlab="contin",ylab="outcome")
points(test$contin[test$categ==2],test$outcome[test$categ==2],pch=20,col="blue")
points(test$contin[test$categ==3],test$outcome[test$categ==3],pch=20,col="green")
abline(a=-0.07365,b=2.24075,col="red")
abline(a=2.4748,b=5.00124,col="blue")
abline(a=2.0303,b=-1.08123,col="green")
60
40
20
0
-20
outcome
80
>
+
>
>
>
>
>
10
12
14
16
18
20
contin

Linear Regression in R

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression in R

Uploaded by

Copyright:

Available Formats

Example R Session 8

CSSS 508 Spring 2005

Create some data and take a look at it:

20.816128 13.530085 24.203779

> result.lm <- lm(y~x)

Residual standard error: 1.762 on 18 degrees of freedom

To get an analysis of variance table use anova()

To get the estimates of the coefficients of the model use coef()

Anatomy of the lm object

To access these attributes, use the $ syntax:

Plotting and the lm object

To get the four plots on one page

The plot function also knows how to handle the residuals:

The following is a demonstration of the predict() function.

Now I plot this predicted point on my graph in purple.

If I want predicted values for a series (set) of x values:

> new.x <- seq(1.5,10)

This categorical variable has three levels:

Now fit the model including both independent variables:

This fit is MUCH better (R-Squared=.9644)

for category 2 is:

41.4527 + 2.4546 * contin

for category 3 is:

-3.2810 + (2.4546 * contin) 47.8674 = -51.1484 + 2.4546 * contin

To look at this fit on a plot:

This fit is virtually perfect (R-Squared of .9943).

for category 2 (categ2=1, categ3=0) is:

-0.07365 + 2.54845 + (2.24075 + 2.76049)* contin = 2.4748 + 5.00124*contin

You might also like