Professional Documents
Culture Documents
Hannah Wong
School of Health Policy and Management
Faculty of Health
Correction to Lecture 8 Slide 6 of 34
The least squares estimates of the regression coefficients, a
and b, describing the relationship between BMI and total
cholesterol are computed as follows:
3
Number of predictors
The number of independent variables that should be used in
your model is limited by the number of participants or
observations.
4
Standard Multiple Regression
The general formula for multiple regression is:
7
Categorical independent variables
To consider race/ethnicity as a predictor in a regression
model, we create a set of dummy variables. To create the set
of dummy variables, we decide on a reference group or
category. In this example, the reference group is the racial
group that we will compare the other groups against.
8
Categorical independent variables
In the multiple regression model, the regression coefficients
associated with each of the dummy variables (representing in
this example each race/ethnicity group) are interpreted as the
expected difference in the mean of the outcome variable for
that race/ethnicity as compared to the reference group,
holding all other predictors constant.
9
Example 1
An observational study is conducted to investigate risk factors
associated with infant birth weight. The study involves 832
pregnant women. Each woman provides demographic and
clinical data and is followed through the outcome of the
pregnancy. At the time of delivery, the infants birth weight is
measures, in grams, as is their gestational age, in weeks.
Birth weights vary widely and range from 404 to 5400 grams.
The mean birth weight is 3367.83 grams, with a standard
deviation of 537.21 grams. Investigators wish to determine
whether there are differences in birth weight by infant gender,
gestational age, mothers age, and mothers race. In the study
sample, 421/832 (50.6%) of the infants are male, and the
mean gestational age at birth is 39.49 weeks with a standard
deviation of 1.81 weeks (range: 22-43 weeks).
10
Example 1
The mean mothers age is 30.83 years with a standard
deviation of 5.76 years (range: 17-45 years). Approximately
49% of the mothers are white, 41% are Hispanic, 5% are black,
and 5% identify themselves as other race. A multiple
regression analysis is performed relating infant gender,
gestational age in weeks, mothers age in years, and three
dummy or indicator variables reflecting mothers race to birth
weight. The results are shown below.
11
Example 1
Multiple Regression Analysis
Reg. Coefficient t P-value
Intercept -3850.92 -11.56 .000
Male infant 174.79 6.06 .000
Gestational age (weeks) 179.89 22.35 .000
Mothers age (years) 1.38 0.47 .6361
Black race -138.46 -1.93 .0535
Hispanic race -13.07 -0.37 .7103
Other race -68.67 -1.05 .2918
14
SPSS Output: Descriptive statistics
Descriptive Statistics
Mean Std Deviation N
Frequency of hand washing 7.69 2.04 76
Perceived importance of hand washing 9.52 1.26 76
15
SPSS output: variables in the regression equation
The variables entered/removed table simply confirms the
variables which have been entered in the equation, ie the four
predictor variables.
ANOVAb
Model Sum of squares df Mean square F Sig
1 Regression 127.421 4 31.855 12.173 .000a
Residual 185.803 71 2.617
Total 313.224 75
a. Predictors: (Constant), workplace assisted hand washing, risk to self, perceived
importance of hand washing, risk to others
b. Dependent Variable: frequency of hand washing
18
SPSS output: individual coefficients
To see the separate effects of the predictor variables on the
outcome variable, we look at the coefficients table.
Coefficientsa
Model Unstandardised Standardised 95% confidence
coefficients coefficients interval for B
B Std Beta t Sig Lower Upper
error bound bound
1 (Constant) -5.397 2.009 -2.686 .009 -9.403 -1.391
Perceived .403 .160 .248 2.523 .014 .084 .721
importance
Risk to self .215 .134 .171 1.602 .114 -.052 .482
Risk to others .453 .211 .237 2.147 .035 .032 .874
Workplace assisted .345 .095 .343 3.630 .001 .156 .535
hand washing
a. Dependent Variable: frequency of hand washing
19
Individual Coefficients
The Constant, a
The value of the constant, a is -5.397. This is not of great
interest to us, unless we want to use the regression formula to
predict scores in a new sample of people.
Unstandardized b weights
Lets look at the perceived importance of hand-washing.
The unstandardized coefficient, b1, is .403. This means that
for every one unit increase in perceived importance of hand
washing (x1), the predicted frequency of hand washing ()
rises by .403. Thus, for every one point increase on
perceived importance of hand-washing the frequency of
hand washing rises almost a half a point.
20
Standardized beta weights
Standardized beta weights
Beta weights are the standardized b weights or standardized
regression coefficients. Recall, in standardization, we create z-
scores. For each variable, we calculate the mean and standard
deviation across all observations. Then, for each observation,
we subtract out the mean and divide by the standard
deviation for that variable.
So, if our regression equation is y = a + b1x1 + b2x2 + b3x3 , the
equation used to find standardized regression coefficients
(denoted beta) is:
22
The coefficients table (continued)
The confidence intervals relate to the unstandardized b
weights. So although b = .403, in the general population we
would expect, with a 95% likelihood, that the true value of b
would fall somewhere between .084 and .721.
Here :
= predicted frequency of hand washing
x1 = perceived importance of hand washing
x2 = risk to self
x3 = risk to others
x4 = workplace assists hand washing
24
The regression equation
= -5.397 + .403x1 + .215x2 + .453x3 + .345x4
Interpretation
If perceived importance of hand washing, risk to self, and
risk to others are all held constant, predicted frequency of
hand washing increases an estimated .345 units for every
one unit increase in workplace assists hand washing.
We estimate that two groups of subjects of the same
perceived importance of hand washing, risk to self, and
risk to others, who differ by one unit of workplace assists
hand washing will have predicted frequency of hand
25
washing levels that differ on average by .345 units.
Predicting an individuals score
Assume that a new participant, Kelly, has scores on the
predictor variables, but that we have no information on the
frequency of her hand washing. We can predict (estimate her
likely) score on the frequency of her hand washing, by using
her x scores:
27
Visualizing multiple regression
The figure shows the overlapping
areas of variance, and the areas a, b,
c and d are all areas of shared
variance. What it shows is that
predictor x1 shares unique variance
with the outcome variable y (area a)
and unique variance with predictor
x2 (area b). In addition, predictor x2 shares unique variance with
the outcome variable y (area c). The shared variance for x1 and
x2 and the outcome y is illustrated by area d in the middle.
29
Collinearity
The adverse effect of collinearity is that independent variables
in a regression are so highly correlated that it becomes
difficult or impossible to distinguish their individual or unique
effects on the dependent variable.
31
Example of collinearity
Open excel datafile: 2300 Winter Term Lec 9 (worksheet: ibs).
In the excel file, there are six symptoms of IBS and records for
21 patients. Note, weve already broken the rule of thumb for
the number of predictors (max should be 2). We will proceed
however for illustrative purposes of collinearity.
32
Example of collinearity
35
Confounding
Multiple regression analysis can be used to assess whether
confounding exists.
Confounding is present when the relationship between a risk
factor and an outcome is modified by a third variable, called
the confounding variable. A confounding variable or
confounder is one that is related to the main risk factor or
predictor of interest and also to the outcome.
39
Example of confounding
Suppose we want to assess the association between BMI and
systolic blood pressure using data collected. A total of n =
3539 participants attended the exam and their mean systolic
blood pressure is 127.3 with a standard deviation of 19.0. The
mean BMI in the sample was 28.2 with a standard deviation of
5.3. A simple linear regression analysis reveals the results in
the below table. The simple linear regression model is =
108.28 + 0.67 (BMI). The association between BMI and
systolic blood pressure is also statistically significant (p <.001).
43
Example of confounding
The multiple regression model produces an estimate of the
association between BMI and systolic blood pressure that
accounts for differences in systolic blood pressure due to age,
gender, and treatment for hypertension.
46
Using multiple regression in an exploratory-type way
Select Analyze, Regression, Linear
Move the predictor variables from the left into the
Independent(s) box.
Move the variable to be predicted (mba_gpa) to the
Dependent box
Press Statistics, choose Estimates, Confidence intervals, Model
Fit and Descriptives
Click on Continue, then OK.
47
SPSS output
SPSS output
49
SPSS output
50
SPSS output