You are on page 1of 51

Winter Term Lecture 9:

Standard multiple regression

Hannah Wong
School of Health Policy and Management
Faculty of Health
Correction to Lecture 8 Slide 6 of 34
The least squares estimates of the regression coefficients, a
and b, describing the relationship between BMI and total
cholesterol are computed as follows:

Keep to 5 decimal places since we


will use a and b to calculate .
(27.40)

The estimate of the slope (b = 6.49297) represents the change


in total cholesterol relative to a one-unit change in BMI. For
example, if we compare two participants whose BMIs differ by
one unit, we would expect their total cholesterol levels to
differ by approximately 6.49297 units (with the person with
the higher BMI having the higher total cholesterol). The
equation of the regression line is = 27.99262 + 6.49297x 2
Introduction
Multiple linear regression is an extension of simple bivariate
linear regression to allow for more than one independent
variable. That is, instead of using only a single independent
variable x to explain the variation in y, you can simultaneously
use several independent (or predictor) variables. By using
more than one independent variable, we should do a better
job of explaining the variation in y and hence be able to make
more accurate predictions.

In multiple regression, researchers use several predictor


variables (x1, x2, x3 ) in order to see how they relate to, or
predict, an outcome variable (y).

3
Number of predictors
The number of independent variables that should be used in
your model is limited by the number of participants or
observations.

A rule of thumb used by some researchers is to limit the


number of independent variables to one independent variable
for every 10 observations. Thus, if you have 50 observations,
this rough guide would suggest that you should limit your
regression model to a maximum of five independent variables.
Having too few participants for the number of variables leads
to overfitting. This means that the results fit the sample far
better than they do the population.

4
Standard Multiple Regression
The general formula for multiple regression is:

= a + b1x1 + b2x2 + b3x3 + + bkxk


= the predicted or expected value of the dependent variable

xi (x1 through xk) = the distinct independent or predictor


variables

bi (b1 through bk) = the distinct estimated regression


coefficients. Each regression coefficient represents the
change in y relative to a one-unit change in the respective
independent variable holding the remaining independent
variables constant.
a = the constant or intercept - the value of when all of the
5
independent variables are equal to 0.
Testing the usefulness of the linear regression model
Is the regression equation that uses information provided by
the predictor variables x1, x2, xk substantially better than the
simple predictor that does not rely on any of the x-values?

This question is answered using an overall F test with the


hypothesis:
H0: 1 = 2 = = k = 0
versus
H1: at least one of 1, 2 , ,k is not 0.

If the null hypothesis is true, none of the independent


variables x1, x2, xk is linearly related to y, and therefore the
model is invalid and not useful. If at least one i is not equal
to 0, the model does have some validity.
6
Categorical independent variables
Independent variables in regression models can be
continuous, dichotomous or categorical. For example, it might
be of interest to assess whether there is a difference in total
cholesterol by race/ethnicity.

We must create indicator variables to represent the different


comparison groups (eg different racial/ethnic groups). The set
of indicator variables (also called dummy variables) are
considered in the multiple regression model simultaneously as
a set of independent variables. For example, suppose that
participants indicate which of the following best represents
their race/ethnicity: White, Black, Hispanic, or Other Race.
This categorical variable has four response options.

7
Categorical independent variables
To consider race/ethnicity as a predictor in a regression
model, we create a set of dummy variables. To create the set
of dummy variables, we decide on a reference group or
category. In this example, the reference group is the racial
group that we will compare the other groups against.

We will create three dummy variables to compare with the


reference group. For each dummy variable, we code 1 for
participants who are in that group (eg are of the specific
race/ethnicity of interest), and all others are coded 0.

8
Categorical independent variables
In the multiple regression model, the regression coefficients
associated with each of the dummy variables (representing in
this example each race/ethnicity group) are interpreted as the
expected difference in the mean of the outcome variable for
that race/ethnicity as compared to the reference group,
holding all other predictors constant.

9
Example 1
An observational study is conducted to investigate risk factors
associated with infant birth weight. The study involves 832
pregnant women. Each woman provides demographic and
clinical data and is followed through the outcome of the
pregnancy. At the time of delivery, the infants birth weight is
measures, in grams, as is their gestational age, in weeks.
Birth weights vary widely and range from 404 to 5400 grams.
The mean birth weight is 3367.83 grams, with a standard
deviation of 537.21 grams. Investigators wish to determine
whether there are differences in birth weight by infant gender,
gestational age, mothers age, and mothers race. In the study
sample, 421/832 (50.6%) of the infants are male, and the
mean gestational age at birth is 39.49 weeks with a standard
deviation of 1.81 weeks (range: 22-43 weeks).
10
Example 1
The mean mothers age is 30.83 years with a standard
deviation of 5.76 years (range: 17-45 years). Approximately
49% of the mothers are white, 41% are Hispanic, 5% are black,
and 5% identify themselves as other race. A multiple
regression analysis is performed relating infant gender,
gestational age in weeks, mothers age in years, and three
dummy or indicator variables reflecting mothers race to birth
weight. The results are shown below.

11
Example 1
Multiple Regression Analysis
Reg. Coefficient t P-value
Intercept -3850.92 -11.56 .000
Male infant 174.79 6.06 .000
Gestational age (weeks) 179.89 22.35 .000
Mothers age (years) 1.38 0.47 .6361
Black race -138.46 -1.93 .0535
Hispanic race -13.07 -0.37 .7103
Other race -68.67 -1.05 .2918

Many of the independent variables are statistically


significantly associated with birth weight. Male infants are
approximately 175 grams heavier than female infants
(reference group), adjusting for gestational age, mothers age,
and mothers race.
12
Example 1
Gestational age is highly significant (p<.001), with each
additional gestational week associated with an increase of
179.89 grams in birth weight, holding infant gender, mothers
age, and mothers race constant.

Mothers age does not reach statistical significance (p =


.6361).

Mothers race is modeled as a set of three dummy or indicator


variables. In this analysis, white race is the reference group (it
is the race not listed in the output). There are no statistically
significant differences in birth weight in infants born to black
versus white mothers or to Hispanic versus white mothers or
to women who identify themselves as other race as
compared to white, holding infant gender, gestational age and
13
mothers age constant.
Example 2
Hanna et al have data on hand washing in nurses (dataset not
presented). They used four scalar variables as predictor
variables to see how well they predict how often the nurses
washed their hands (y = frequency of hand washing):
x1 = perceived importance of hand washing
x2 = perceived risk to self
x3 = perceived risk to others
x4 = the degree to which nurses thought their workplace
assisted hand washing

Here are the results of the multiple regression analysis.

14
SPSS Output: Descriptive statistics

Descriptive Statistics
Mean Std Deviation N
Frequency of hand washing 7.69 2.04 76
Perceived importance of hand washing 9.52 1.26 76

Perceived risk to self 8.74 1.63 76


Perceived risk to others 9.47 1.07 76
Workplace assisted hand washing 8.96 2.03 76

15
SPSS output: variables in the regression equation
The variables entered/removed table simply confirms the
variables which have been entered in the equation, ie the four
predictor variables.

In the SPSS output, Method = enter simply means the


multiple regression that is being performed is a standard
multiple regression. This is the simplest type of multiple
regression, and is the default in SPSS.
Variables Entered/Removedb
Model Variables entered Variables removed Method
1 Workplace assisted hand washing, Enter
Risk to self, Perceived importance
of hand washing, Risk to othersa
a. All requested variables entered.
b. Dependent Variable: frequency of hand washing
16
SPSS output: model summary
The model summary table shows us:
Model Summary
Model R R square Adjusted R square Std. error of the estimate
1 .638a .407 .373 1.61770
a. Predictors: (Constant), workplace assisted hand washing, risk to self, perceived importance of hand
washing, risk to others

R = .638, which is a strong correlation.


R2 = .407 which shows that 40.7% of the variance in scores on
frequency of hand washing can be explained by the variation in
scores on the four predictor variables.
Adjusted R2 is needed in order to generalize to the population.
The formula for adjusted R2 takes into account the number of
variables and sample size, and adjusts R2 downwards. So now we
can expect approximately 37.3% of the variance in the frequency
of hand washing to be explained by the four predictor variables.17
SPSS output: ANOVA
The ANOVA shows us that the four predictor variables
together significantly predict frequency of hand washing
behaviour in nurses; at least one of the predictor variables is
contributing significant information for the prediction of the
outcome variable y (F4,71= 12.17, p < .001).

ANOVAb
Model Sum of squares df Mean square F Sig
1 Regression 127.421 4 31.855 12.173 .000a
Residual 185.803 71 2.617
Total 313.224 75
a. Predictors: (Constant), workplace assisted hand washing, risk to self, perceived
importance of hand washing, risk to others
b. Dependent Variable: frequency of hand washing

18
SPSS output: individual coefficients
To see the separate effects of the predictor variables on the
outcome variable, we look at the coefficients table.
Coefficientsa
Model Unstandardised Standardised 95% confidence
coefficients coefficients interval for B
B Std Beta t Sig Lower Upper
error bound bound
1 (Constant) -5.397 2.009 -2.686 .009 -9.403 -1.391
Perceived .403 .160 .248 2.523 .014 .084 .721
importance
Risk to self .215 .134 .171 1.602 .114 -.052 .482
Risk to others .453 .211 .237 2.147 .035 .032 .874
Workplace assisted .345 .095 .343 3.630 .001 .156 .535
hand washing
a. Dependent Variable: frequency of hand washing
19
Individual Coefficients
The Constant, a
The value of the constant, a is -5.397. This is not of great
interest to us, unless we want to use the regression formula to
predict scores in a new sample of people.

Unstandardized b weights
Lets look at the perceived importance of hand-washing.
The unstandardized coefficient, b1, is .403. This means that
for every one unit increase in perceived importance of hand
washing (x1), the predicted frequency of hand washing ()
rises by .403. Thus, for every one point increase on
perceived importance of hand-washing the frequency of
hand washing rises almost a half a point.

20
Standardized beta weights
Standardized beta weights
Beta weights are the standardized b weights or standardized
regression coefficients. Recall, in standardization, we create z-
scores. For each variable, we calculate the mean and standard
deviation across all observations. Then, for each observation,
we subtract out the mean and divide by the standard
deviation for that variable.
So, if our regression equation is y = a + b1x1 + b2x2 + b3x3 , the
equation used to find standardized regression coefficients
(denoted beta) is:

A simpler formula for converting from an unstandardized


coefficient to a standardized one is:
21
Standardized beta weights
The standardized coefficients then have a slightly different
meaning. In our example beta1 = 0.248. Thus a 1 standard
deviation increase in x1 perceived importance of hand-
washing results in a 0.248 standard deviation (approximately
a quarter of a standard deviation) increase in the dependent
variable frequency of hand-washing.
The betas are comparable because they all refer to a 1
standard deviation change in their respective independent
variables rather than a one unit change.

The t-value = 2.52 and is statistically significant (p =.014).

22
The coefficients table (continued)
The confidence intervals relate to the unstandardized b
weights. So although b = .403, in the general population we
would expect, with a 95% likelihood, that the true value of b
would fall somewhere between .084 and .721.

Of all the predictors, the strongest predictor is found by


looking at the standardized beta weights. In this example, the
strongest predictor is the perception of the degree to which
the nurses thought their workplace assists hand-washing, as
this has a beta weight of .343.

The perceived risk to self, in the presence of the other


predictors, is not statistically significant, i.e. it does not predict
the frequency of hand washing.
23
The regression equation
= a + b1x1 + b2x2 + b3x3 + b4x4

Here :
= predicted frequency of hand washing
x1 = perceived importance of hand washing
x2 = risk to self
x3 = risk to others
x4 = workplace assists hand washing

So in the example it is:


= -5.397 + .403x1 + .215x2 + .453x3 + .345x4

24
The regression equation
= -5.397 + .403x1 + .215x2 + .453x3 + .345x4

= predicted frequency of hand washing


x1 = perceived importance of hand washing, x2 = risk to self
x3 = risk to others, x4 = workplace assists hand washing

Interpretation
If perceived importance of hand washing, risk to self, and
risk to others are all held constant, predicted frequency of
hand washing increases an estimated .345 units for every
one unit increase in workplace assists hand washing.
We estimate that two groups of subjects of the same
perceived importance of hand washing, risk to self, and
risk to others, who differ by one unit of workplace assists
hand washing will have predicted frequency of hand
25
washing levels that differ on average by .345 units.
Predicting an individuals score
Assume that a new participant, Kelly, has scores on the
predictor variables, but that we have no information on the
frequency of her hand washing. We can predict (estimate her
likely) score on the frequency of her hand washing, by using
her x scores:

perceived importance of hand washing = 9


risk to self = 8
risk to others = 8
workplace assists hand washing = 10

predicted frequency of hand washing


= -5.397 + .403(9) + .215(8) + .453(8) + .345(10)
= 7.024
= 7.02 (rounded to 2 decimal places) 26
Visualizing multiple regression
Recall, r2 tells us how much variation in the outcome variable
y is explained by the independent variables.

For example, if r = .50, then r2 = .25. We can then say that


25% of the variation in scores on one variable is explained, or
accounted for, by the variation in other scores.

27
Visualizing multiple regression
The figure shows the overlapping
areas of variance, and the areas a, b,
c and d are all areas of shared
variance. What it shows is that
predictor x1 shares unique variance
with the outcome variable y (area a)
and unique variance with predictor
x2 (area b). In addition, predictor x2 shares unique variance with
the outcome variable y (area c). The shared variance for x1 and
x2 and the outcome y is illustrated by area d in the middle.

In multiple regression, the R represents all the shared variance


between the predictor variables and the outcome. So the
variance represented in multiple R contains areas a, d and c.
28
Collinearity
You might feel it is tempting that the best way to proceed in
multiple regression is to enter all of the independent variables
of possible interest in the multiple regression equation.
However, this may not produce desirable results, often
because of collinearity among the independent variables.

Collinearity (also called


multicollinearity and intercorrelation)
is a condition that exists when the
independent variables are correlated
with one another, ie x1 and x2 have
large overlap.

29
Collinearity
The adverse effect of collinearity is that independent variables
in a regression are so highly correlated that it becomes
difficult or impossible to distinguish their individual or unique
effects on the dependent variable.

How can you tell whether a regression analysis exhibits


collinearity? Look for these clues:
If there are a lot of predictor
variables, all highly correlated, the
value of R2 (and F) might be large
and statistically significant,
indicating a good fit, but the
individual predictors are
nonsignificant (small t statistics).
30
Collinearity
The signs of the regression coefficients are contrary to
what you would intuitively expect the contributions of
those variables to be. For example, if an independent
variable is by itself positively related to the dependent
variable, then you would expect the coefficient associated
with this variable to be positive.
The correlation table shows you which predictor variables
are highly correlated with each other and with the
response y.
Minimizing the effect of collinearity is often easier than
correcting it. The statistics practitioner must try to include
independent variables that are independent of each other.

31
Example of collinearity
Open excel datafile: 2300 Winter Term Lec 9 (worksheet: ibs).
In the excel file, there are six symptoms of IBS and records for
21 patients. Note, weve already broken the rule of thumb for
the number of predictors (max should be 2). We will proceed
however for illustrative purposes of collinearity.

We want to see how much variance in depression (outcome


variable) is accounted for by the IBS symptoms. We therefore
carry out a standard multiple regression analysis.

32
Example of collinearity

In this sample, weve explained 78% of the variance, ie 78% of


the variation in depression scores is accounted for by the
variation in the six symptoms of IBS. We can see that this is
33
significant (F6,14 = 8.08, p = .001).
Example of collinearity
Lets look at the individual predictors.

Apart from abdominal pain, none of the explanatory variables


seem to explain anything! This is an important finding in itself,
because it means that each symptom, on its own (apart from
abdominal pain) isnt that important to depression its the
combined symptoms that matter.
34
Example of collinearity
The reason why we have such a high R2 is that R2 includes all
the variance, whereas the individual results include only the
unique variance. Because the symptoms are highly correlated
with each other, as you would expect, the unique variance is
low.

35
Confounding
Multiple regression analysis can be used to assess whether
confounding exists.
Confounding is present when the relationship between a risk
factor and an outcome is modified by a third variable, called
the confounding variable. A confounding variable or
confounder is one that is related to the main risk factor or
predictor of interest and also to the outcome.

Multiple regression analysis can be used to identify


confounding as follows: suppose we have a risk factor, which
we denote x1 (eg obesity is a risk factor for heart disease) and
an outcome or dependent variable (heart disease), which we
denote y.
36
Confounding
We estimate a simple linear regression equation relating the
risk factor (the independent variable) to the dependent
variable as = a + b1x1
Where b1 is the estimated regression coefficient that
quantifies the association between the main risk factor and
the outcome.

Suppose we now want to assess whether a third variable (eg


age) is a confounder. We denote the potential confounder x2
and then estimate a multiple linear regression equation as
= a + b1x1+ b2x2

In the multiple linear regression equation, b1is the estimated


regression coefficient that quantifies the association between
the risk factor x1and the outcome, adjusted for x2. 37
Confounding
We can assess confounding by assessing the extent to which
the regression coefficient associated with the main risk factor
changes after adjusting for the potential confounder.

In this case, we compare b1 from the simple linear regression


model to b1 from the multiple regression model.

As an informal rule, if the regression coefficient from the


simple linear regression model changes by more than 10%,
then x2 is said to be a confounder.

Once a variable is identified as a potential confounder, we can


then use multiple linear regression analysis to estimate the
association between the risk factor and the outcome adjusting
for that confounder.
38
Confounding
The test of significance of the regression coefficient associated
with the risk factor can be used to assess whether the
association between the risk factor is statistically significant
after accounting for one or more confounding variables.

Note that collinearity can be viewed as an extreme case of


confounding.

39
Example of confounding
Suppose we want to assess the association between BMI and
systolic blood pressure using data collected. A total of n =
3539 participants attended the exam and their mean systolic
blood pressure is 127.3 with a standard deviation of 19.0. The
mean BMI in the sample was 28.2 with a standard deviation of
5.3. A simple linear regression analysis reveals the results in
the below table. The simple linear regression model is =
108.28 + 0.67 (BMI). The association between BMI and
systolic blood pressure is also statistically significant (p <.001).

Multiple Regression Analysis


Reg. Coefficient t P-value
Intercept 108.28 62.61 .000
BMI 0.67 11.06 .000
40
Example of confounding
Suppose we now want to assess whether age (a continuous
variable, measured in years), gender, and treatment for
hypertension (yes/no) are potential confounders and, if so,
appropriately account for these using multiple linear
regression analysis. For analytic purposes, treatment for
hypertension is coded as 1 = yes and 0 = no. Gender is coded
as 1 = male and 0 = female. A multiple regression analysis
reveals the results in the table below.
Multiple Regression Analysis
Reg. Coefficient t P-value
Intercept 68.15 26.33 .000
BMI 0.58 10.30 .000
Age 0.65 20.22 .000
Male gender 0.94 1.58 .1133
Treatment for hypertension 6.44 9.74 .000 41
Example of confounding
The multiple regression model is
= 68.15 + 0.58(BMI) + 0.65(Age) + 0.94(Male gender)
+ 6.44(Treatment for hypertension)

Notice that the association between BMI and systolic blood


pressure is smaller (0.58 versus 0.67) after adjustment for age,
gender, and treatment for hypertension. BMI remains
statistically significantly associated with systolic blood
pressure (p<.001) but the magnitude of the association is
lower after adjustment. The regression coefficient decreases
by 13.43% ((0.58-0.67)/0.67)). Using the informal rule (ie a
change in the coefficient in either direction by 10% or more),
we meet the criteria for confounding. Thus, part of the
association between BMI and systolic blood pressure is
explained by age, gender, and treatment for hypertension. 42
Example of confounding
The t and p-values for each of the regression coefficients can
be used to assess the statistical significance of each
independent variable. Assessing only the p-values suggests
that BMI, age and treatment for hypertension are equally
statistically significant. The magnitude of the t statistics
provides another means to judge relative importance of the
independent variables. In this example, age is the most
significant independent variable, followed by BMI, treatment
for hypertension, and then male gender (in fact, male gender,
on its own, is not statistically significant, ie it does not predict
systolic blood pressure).

43
Example of confounding
The multiple regression model produces an estimate of the
association between BMI and systolic blood pressure that
accounts for differences in systolic blood pressure due to age,
gender, and treatment for hypertension.

A one-unit increase in BMI is associated with a 0.58-unit increase


in systolic blood pressure, holding age, gender, and treatment for
hypertension constant. Each additional year of age is associated
with a 0.65-unit increase in systolic blood pressure, holding BMI,
gender, and treatment for hypertension constant. Men have
higher systolic blood pressures, by approximately 0.94 units,
holding BMI, age, and treatment for hypertension constant; and
persons on treatment for hypertension have higher systolic
blood pressures, by approximately 6.44 units, holding BMI, age,
and gender constant. 44
Example of confounding
The multiple regression equation can be used to estimate the
systolic blood pressures as a function of a participants BMI,
age, gender, and treatment for hypertension status. For
example, we can estimate the blood pressure of a 50 year old
male, with a BMI of 25, who is not on treatment for
hypertension as follows:

= 68.15 + 0.58(25) + 0.65(50) + 0.94(1) + 6.44(0) = 116.09

We can estimate the blood pressure of a 50 year old female,


with a BMI of 25, who is on treatment for hypertension as
follows:

= 68.15 + 0.58(25) + 0.65(50) + 0.94(0) + 6.44(1) = 121.59


45
Using multiple regression in an exploratory-type way
Multiple regression is often used in an exploratory-type way.

The dean of the MBA program wants to raise admissions


standards by developing a method that more accurately
predicts how well an applicant will perform in the MBA
program. She believes that the primary determinants of the
success are the following: Undergraduate grade point average
(GPA), Graduate Management Admissions Test (GMAT) score,
and number of years of work experience. Develop a plan to
decide which applicants to admit.

Open excel datafile: 2300 Winter Term Lec 9 (worksheet:


mba).

46
Using multiple regression in an exploratory-type way
Select Analyze, Regression, Linear
Move the predictor variables from the left into the
Independent(s) box.
Move the variable to be predicted (mba_gpa) to the
Dependent box
Press Statistics, choose Estimates, Confidence intervals, Model
Fit and Descriptives
Click on Continue, then OK.

47
SPSS output
SPSS output

44.5% (R2 adj) of the variation in MBA program GPA is


explained by the variation in the three independent or
predictor variables.

49
SPSS output

The ANOVA shows us that the three predictor variables


together significantly predict MBA program GPA (F3,85= 24.48,
p < .001).

50
SPSS output

Standardized beta coefficients showed that GMAT score (beta


= .650, p<.001) and years of work experience (beta = .238, p
=.004) contributed most to the prediction of MBA program
GPA, but there is no evidence of a linear relationship between
undergraduate GPA and MBA program GPA (p =.602) . These
results suggest that admissions should be based on the GMAT
score as well as the number of years of work experience.
51

You might also like