Logistička Regresija

Logistika regresija
General linear models

Family of regression models Outcome variable determines choice of model Outcome
Continuous Counts Survival Binomial
Model
Linear regression Poisson regression Cox model Logistic regression
Uses Control of confounding Model building, risk prediction
What is Logistic Regression?

Form of regression that allows the prediction of discrete variables by a mix of continuous and discrete predictors Logistic regression is often used because the relationship between the DV (a discrete variable) and a predictor is non-linear Example: the probability of heart disease changes very little with a ten-point difference among people with low-blood pressure, but a ten point change can mean a drastic change in the probability of heart disease in people with high blood-pressure
Logistic regression
Models relationship between set of variables xi dichotomous (yes/no) categorical (social class, ... ) continuous (age, ...) and dichotomous (binary) variable Y Dichotomous outcome most common situation in biology and epidemiology We code dichotomous varioable: 0 no disease, survive, non-smoker, female ... 1 disease, died , smoker, male ... We code with 1 what we want to predict
4
Linear regression
y = 0 + 1x +
error is normally distributed, with mean=0 and constant variance (i.e., homogeneity of variance) Binary dependent variable: y = 0 or y = 1
1 positive response 0 negative response P Q = (1-P)
= 1 0 1x = 0 0 1x
5
Applications
Compare two (success) probabilities with correction for prognostic factors (clinical trials) Determine which risk factors are important/not important (epidemiology) Determine the dose-response relation (toxicology)
Example 1 Is a smoking predictor for CHD?

Design: 60 Patients Two groups: Smokers and Non-smokers Smokers = 1 Non-smokers = 0 Outcome variable: Coronary heart disease: yes/no Research question: CHD different for smokers as for non-smokers?
7
Example 1
outcome CHD + CHD total smoking + 17 (a) 7 (c) 24 (m) smoking 9 (b) 27 (d) 36 (n) total 26 (r) 34 (s) 60 (N)
Analysis: Student t test for proportion: pCHD+ CHD+ smokers : p CHD+ nonnon-smokers t = 3,53 p < 0,01 or 2 - test
Dichotomous outcome variable Y (0/1):
CHD
0 0 smoking 1
Data transformation is required!

9
Example 1
Odds CHD (smo ker s) =
a / m a 17 = = = 2,429 c /m c 7
A smoker is 2.428 times more likely to have CHD than he is likely to have not CHD
Odds CHD (non smo ker s) =
b/n b 9 = = = 0,333 d / n d 27
A non-smoker is 0.333 times more likely to have CHD as he is likely to have not CHD.
10
Odds
Odds for an event
p odds = 1 p
p log (odds ) = log 1 p
p is probability that an event occurs What is greater odds of an event, the greater the probability that the event occurs
11
Logit transformacija
Logit transformacija daje linearnu relaciju izmeu verovatnoe posmatranog dogaaja i vrednosti nezavisne varijable x
p log (odds) = log 1 p = 0 + 1x
Model je slian prostom regresionom modelu, ali: raspodela je binomna, a ne normalna koeficijenti a i b se ne odreuju na isti nain kao u linearnom regresionom modelu
12
Logit transformacija
Logit prirodni logaritam (ln) odds (anse) da se posmatrani dogaaj desi (kodiranog sa 1) obeleava se kao log odds logit skala je kontinuirana i ponaa se na slian nain kao z-score skala p = 0.50, logit = 0 p = 0.70, logit = 0.84 p = 0.30, logit = -0.84
13
Logistic regression model

Equation for P (y=1) for one predictor is:
p log (odds) = log 1 p = 0 + 1x p = e0 +1x 1 p p= p= e0 +1x 1 e 1 e

0 +1x
for population for sample
e b0 + b1x
b 0 + b1x
e 2,718 p = P(y=1) x = predictor

14
Logistic regression model

Equation for P (y=1) for more predictors is :
p=
eb0 +b1x1 +b2 x 2 +...... 1 eb0 +b1x1 +b2 x 2 +......
e 2,718 p = P(y=1)
15
Interpretacija koeficijenata b0 i b1
b0 neophodan za jednainu, nema znaaja za interpretaciju predstavlja vrednost log odds kada je prediktor jednak 0
b1 mera za asocijaciju izmeu prediktora i log odds za pojavu dogaaja koji nas interesuje b1 > 0 pozitivna asocijacija b1 = 0 nema asocijacije b1 < 0 negativna asocijacija
16
Interpretacija koeficijenta b1
b1 je frakcija za koju se promeni rizik za pojavu dogaaja koji nas interesuje kada se prediktor x promeni za jednu jedinicu Primer
osoba 1, prediktor (x) = k osoba 2, prediktor (x) = k + 1

Jednaine za log odds glase
log (odds za dogaaj kod osobe 2) = b0 + b1 (k + 1) log (odds za dogaaj kod osobe 1) = b0 + b1 (k)
Dalje:
log (odds za dogaaj kod osobe 2) = b0 + b1 (k) + b1 log (odds za dogaaj kod osobe 1) = b0 + b1 (k)
17
Razlika izmeu log odds osobe 1 i osobe 2:
log (odds za dogaaj kod osobe 2) = b0 + b1 (k) + b1 log (odds za dogaaj kod osobe 1) = b0 + b1 (k)
log odds za pojavu dogaaja koji nas interesuje kod osobe 2 iji je prediktor x = k + 1, razlikuje se od log odds za pojavu dogaaja koji nas interesuje kod osobe 1 iji je prediktor x = k za vrednost koeficijenta b1 odnosno b1 je frakcija za koju se promeni rizik za pojavu dogaaja koji nas interesuje kada se prediktor x promeni za jednu jedinicu
18
b1 = log (odds za pojavu dogaaja kod osobe 2) - log (odds za pojavu dogaaja kod osobe 1)
odds za pojavu dogaaja kod osobe 2 b1 = log odds za pojavu dogaaja kod osobe 1 b1 = log (odds ratio ) odds ratio (OR ) = eb1
19
b1 = 0 odds i verovatnoa za pojavu eljenog dogaaja su jednaki za sve vrednosti x (eb1 = OR = 1) b1 > 0 odds i verovatnoa za pojavu eljenog dogaaja se poveavaju sa poveanjem vrednosti x (eb1 = OR > 1) b1 < 0 odds i verovatnoa za pojavu eljenog dogaaja se smanjuju sa smanjenjem vrednosti x (eb1 = OR < 1)
20
Example 1 Odds ratio

Odds CHD (smo ker s) =
a / m a 17 = = = 2,429 c/m c 7 b/n b 9 = = = 0,333 d / n d 27
Odds CHD (non smo ker s) =
Odds ratio (OR ) =

Interpretation:
2,429 = 7,286 0,333
Smokers are 7,29 times more likely to have CHD than non-smokers
21
Odds ratio (Relativni odds, Ukrteni odnos)

Odds Ratio (OR) je odnos ansi prethodne izloenosti u grupi u kojoj je prisutan dogaaj koji nas interesuje (kodiran sa 1) i u grupi u kojoj je odsutan dogaaj koji nas interesuje (kodiran sa 0):
dogaaj prisutan (+) odsutan (-) da (+) izloenost ne (-) ukupno a c m (a + c) b d n (b + d) ukupno r (a + b) s (c + d) N (a+b+c+d)
Odds za prisutan dogaaj koji nas interesuje: (a/m) / (c/m) = a/c Odds za odsutan dogaaj koji nas interesuje: (b/n) / (d/n) = b/d Odds ratio: (a/c) / (b/d) = ad/bc
22
Interpretation of coefficients
Odds (smokers) = 2.429 ln (odds) = 0.887 Odds (non-smokers) = 0.333 ln (odds) = -1.099 Model for this example is
p ln 1 p = b 0 + b1 x p ln 1 p = b 0 + b1 0 = b 0
For non-smokers (x = 0) we have
The estimate of the intercept is equal to 0 which is the log odds for non-smokers
p ln 1 p
= 0 = 1 . 099
23
Interpretation of coefficients
The estimate of the slope is the difference between the log odds for smokers and the log odds for non-smokers:
p0 p1 b1 = ln ln (1 p ) (1 p ) = 0.887 (1.099) = 1.986 1 0

The fitted model is: log(odds) = -1. 099 + 1.986x The odds ratio is:
Oddssmo ker s e (1.099+1.986 ) 1.986 = = e = 7.286 ( ) 1 . 099 Odds non smo ker s e
24
Logistic regression in SPSS

In the menu, click on Analyze Point to Regression Point to Binary Logistic ... and click Dependent : chd Covariates: smoking Method: Enter Then Continue and OK
25
Example 1 in SPSS
Point to the variable labeled chd Move variable chd, to the box labeled Dependent Variable by clicking the arrow Point to the variable labeled smoking Move variable smoking to the box labeled Covariates by clicking the arrow Method Enter
26
Example 1 in SPSS
In the menu, click on Options Check CI for exp(B) and Continue Then click OK
27
Example 1 in SPSS - Output

Case Processing Summary
We see that there are 60 cases used in the analysis.
Unweighted Cases Selected Cases
N Included in Analysis Missing Cases Total 60 0 60 0 60
Unselected Cases Total
Percent 100,0 ,0 100,0 ,0 100,0
a. If weight is in effect, see classification table for the total number of cases.
a,b Classification Table
Predicted CHD Step 0 Observed CHD Overall Percentage a. Constant is included in the model. b. The cut value is ,500 0 0 1 34 26 1 0 0 Percentage Correct 100,0 ,0 56,7
The Block 0 output is for a model that includes only the intercept (which SPSS calls the constant). Given the base rates of the two CHD options (34/60 = 56.7% no CHD, 43.3% with CHD), and no other information, the best strategy is to predict, for every case, that the subject has CHD. Using that strategy, you would be correct 56.7% of the time.
28

Under Variables in the Equation you see that the intercept-only model is ln(odds) = -.268 The predicted odds that nonsmokers have CHD is [Exp(B)] = 0.765
Variables in the Equation
Step 0
Constant
B -,268
S.E. ,261
Wald 1,060
df 1
Sig. ,303
Exp(B) ,765
Omnibus Tests of Model Coefficients gives us a Chi-Square of 12.645 on 1 df, significant beyond 0.001. This is a test of the null hypothesis that adding the smoking variable to the model has not significantly increased our ability to predict the CHD in our subjects.
Omnibus Tests of Model Coefficients Chi-square 12,645 12,645 12,645 df 1 1 1 Sig. ,000 ,000 ,000
Step 1
Step Block Model
29

Under Model Summary we see that the -2 Log Likelihood statistic is 69.463. This statistic measures how poorly the model predicts the decisions -the smaller the statistic the better the model. The Cox & Snell R2 can be interpreted like R2 in a multiple regression, but cannot reach a maximum value of 1. The Nagelkerke R2 can reach a maximum of 1.
Model Summary -2 Log likelihood 69,463 Cox & Snell R Square ,190 Nagelkerke R Square ,255
Step 1
30

The Variables in the Equation output shows us that the regression equation is
log (odds ) = 1,099 + 1,986 smoking

Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 2,286 23,223
Step a 1
PUSENJE Constant
B 1,986 -1,099
S.E. ,591 ,385
Wald 11,274 8,147
df 1 1
Sig. ,001 ,004
Exp(B) 7,286 ,333
a. Variable(s) entered on step 1: PUSENJE.
Wald 2 - significance of the coefficients in a model
coefficient Wald 2 = SE
df = 1, 20,05; 1 = 3,841
31

The Variables in the Equation output also gives us the Exp(B) or the odds ratio predicted by the model.
Step a 1
PUSENJE Constant
B 1,986 -1,099
S.E. ,591 ,385
Wald 11,274 8,147
df 1 1
Sig. ,001 ,004
Exp(B) 7,286 ,333
OR = e1,986 = 7,286
OR
32

We can now use this model to predict the odds that a subject has CHD. The odds prediction equation is odds = ea+bx If our subject is a non-smoker (smoking = 0), then odds = e-1.099+1.986(0) = e-1.099 = 0.333 A non-smoker is only 0.333 times more likely to have CHD as he is likely to have not CHD. If our subject is a smoker (smoking = 1), then odds = e-1.099+1.986(1) = e0.887 = 2.428 A smoker is 2.428 times more likely to have CHD than he is likely to have not CHD
33

Convert Odds to probability p = odds / (1+odds)
Non-smokers: p = 0.333 / (1+0.333) = 0.250 = 25% Probability is 25% that non-smoker will have CHD Smokers: p = 2.428 / (1+2.428) = 0.708 = 70.8% probability is 70.8% that smoker will have CHD
34
Primer 2 Faktori rizika za pojavu KSB

Pokazati da li su starost, puenje, gojaznost i holesterol faktori rizika za KSB Ako su faktori rizika kolika je jaina njihovog delovanja Varijable:
KSB: 0 KSB odsutna; 1 KSB prisutna zavisna varijabla, nominalna skala (binarna) Starost: 0 - < 50 g; 1 - > 50 g prediktor, kategorika varijabla, nominalna skala (binarna) Puenje: 0 nepua; 1 pua prediktor, kategorika varijabla, nominalna skala (binarna) Gojaznost: 0 negojazni; 1 gojazni prediktor, kategorika varijabla, nominalna skala (binarna) Holesterol: kontinuirane vrednosti prediktor, skala odnosa
35
Primer 2 - Logistika regresija

Omoguava da se izrauna jednaina koja izraava relaciju izmeu binarnog ishoda i jednog ili vie faktora uticaja (prediktora): verovatnoa za pojavu KSB i starost verovatnoa za pojavu KSB i puenje verovatnoa za pojavu KSB i gojaznost verovatnoa za pojavu KSB i holesterol verovatnoa za pojavu KSB i starost + puenje + gojaznost + holesterol i ako nas interesuje
verovatnoa za pojavu KSB i starost + puenje verovatnoa za pojavu KSB i starost + gojaznost verovatnoa za pojavu KSB i starost + holesterol verovatnoa za pojavu KSB i puenje + gojaznost verovatnoa za pojavu KSB i puenje + holesterol verovatnoa za pojavu KSB i gojaznost + holesterol
36
Primer 2 u SPSS-u
KSB : Faktor rizika Starost
Step a 1
AGE Constant
B 1,810 -1,299
S.E. ,588 ,461
Wald 9,485 7,958
df 1 1
Sig. ,002 ,005
Exp(B) 6,111 ,273
a. Variable(s) entered on step 1: AGE.
b0 OR = e-1,299 = 6,111
b1
OR
Osobe starije od 50 g imaju 6,11 puta veu verovatnou da obole od KSB nego osobe mlae od 50 g
Model Summary -2 Log likelihood 71,437 Cox & Snell R Square ,163 Nagelkerke R Square ,219 Step 1
37
Primer 2 u SPSS-u
KSB : Faktor rizika Puenje
Step a 1
PUSENJE Constant
B 1,986 -1,099
S.E. ,591 ,385
Wald 11,274 8,147
df 1 1
Sig. ,001 ,004
Exp(B) 7,286 ,333
p OR = e1,986 = 7,286
OR
Puai imaju 7,29 puta veu verovatnou da obole od KSB nego nepuai
Model Summary -2 Log likelihood 69,463 Cox & Snell R Square ,190 Nagelkerke R Square ,255 Step 1
38
Primer 2 u SPSS-u
KSB : Faktor rizika Gojaznost
Step a 1
OBESITY Constant
B 1,176 -,734
S.E. ,553 ,351
Wald 4,520 4,368
df 1 1
Sig. ,034 ,037
Exp(B) 3,241 ,480
a. Variable(s) entered on step 1: OBESITY.
p OR = e1,176 = 3,241
OR
Gojazne osobe imaju 3,24 puta veu verovatnou da obole od KSB nego negojazne osobe
Step 1
39
Primer 2 u SPSS-u
KSB : Faktor rizika Holesterol
p OR = e0,696 = 2,005
OR
Kada se holesterol povea za jednu jedinicu (1 mmol/L), verovatnoa da osoba oboli od KSB poveava se za 2,005 puta
Step 1
40
Example 2
In the menu, click on Options Check CI for exp(B) Hosmer-Lemeshow goodnessof-fit and Continue Then click OK
41
Example 2
Point to the variable labeled chd Move variable chd, to the box labeled Dependent Variable by clicking the arrow Point to the variable labeled smoking, then obesity, age and cholestero Move variables to the box labeled Covariates by clicking the arrow Method Enter
42

The -2 Log Likelihood statistic has dropped to 55.86, indicating that our expanded model is doing a better job at predicting CHD than was one-predictor model The R2 statistics have also increased
Step 1
The Hosmer-Lemeshow tests the null hypothesis that there is a linear relationship between the predictor variables and the log odds of the criterion variable.
Hosmer and Lemeshow Test Step 1 Chi-square 5,583 df 8 Sig. ,694
43

Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 2,023 30,840 ,822 11,575 1,089 23,216 ,712 2,633
Step a 1
SMOKING OBESITY AGE CHOLESTE Constant
B 2,067 1,126 1,615 ,314 -4,482
S.E. ,695 ,675 ,781 ,334 2,013
Wald 8,843 2,785 4,280 ,886 4,960
df 1 1 1 1 1
Sig. ,003 ,095 ,039 ,347 ,026
Exp(B) 7,899 3,084 5,027 1,369 ,011
a. Variable(s) entered on step 1: SMOKING, OBESITY, AGE, CHOLESTE.
one-predictor model OR smoking obesity age cholesterol 7.286 3.241 6.111 2.005 p < 0.05 < 0.05 < 0.05 < 0.05
four-predictors model OR 7.899 3.084 5.027 1.369 p < 0.05 > 0.05 <0.05 >0.05
44
Example 2 in SPSS Method Forward:Wald

Point to the variable labeled chd Move variable chd, to the box labeled Dependent Variable by clicking the arrow Point to the variable labeled smoking, then obesity, age and cholestero Move variables to the box labeled Covariates by clicking the arrow Method Forward: Wald
45
Example 2 in SPSS Method Forward:Wald - Output

Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 2,286 23,223 2,183 1,824 30,080 25,688 B 1,986 -1,099 2,092 1,924 -2,239 S.E. ,591 ,385 ,669 ,675 ,636 Wald 11,274 8,147 9,776 8,129 12,376 df 1 1 1 1 1 Sig. ,001 ,004 ,002 ,004 ,000 Exp(B) 7,286 ,333 8,104 6,846 ,107
Step a 1 Step b 2
SMOKING Constant SMOKING AGE Constant
a. Variable(s) entered on step 1: SMOKING. b. Variable(s) entered on step 2: AGE.
Model Summary
Variables not in the Equation Step 1 Variables OBESITY AGE CHOLESTE OBESITY CHOLESTE Score 3,769 9,234 6,060 12,654 3,247 1,262 4,106 df 1 1 1 3 1 1 2 Sig. ,052 ,002 ,014 ,005 ,072 ,261 ,128
Step 1 2
-2 Log likelihood 69,463 60,020
Cox & Snell R Square ,190 ,308
Nagelkerke R Square ,255 ,413
Step 2
Overall Statistics Variables Overall Statistics
Hosmer and Lemeshow Test Step 2 Chi-square ,053 df 2 Sig. ,974
46

Pokazati da li su starost, puenje, gojaznost i holesterol faktori rizika za KSB Ako su faktori rizika kolika je jaina njihovog delovanja Varijable:
KSB: 0 KSB odsutna; 1 KSB prisutna zavisna varijabla, nominalna skala (binarna) Starost: kontinuirane vrednosti prediktor, skala odnosa Puenje: 0 nepua; 1 pua prediktor, kategorika varijabla, nominalna skala (binarna) Gojaznost (BMI): kontinuirane vrednosti prediktor, skala odnosa Holesterol: kontinuirane vrednosti prediktor, skala odnosa
47

Hosmer and Lemeshow Test Step 1 Chi-square 6,370 df 8 Sig. ,606
Step 1
Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 2,388 71,358 1,054 1,719 1,031 1,198 ,446 2,107
Step a 1
SMOKING BMI YEARS CHOLESTE Constant
B 2,569 ,297 ,106 -,031 -14,624
S.E. ,867 ,125 ,038 ,396 4,594
Wald 8,788 5,680 7,603 ,006 10,134
df 1 1 1 1 1
Sig. ,003 ,017 ,006 ,938 ,001
Exp(B) 13,054 1,346 1,111 ,970 ,000
a. Variable(s) entered on step 1: SMOKING, BMI, YEARS, CHOLESTE.
48

Model Summary -2 Log likelihood 64,361 50,473 43,261 Cox & Snell R Square ,256 ,410 ,477 Nagelkerke R Square ,343 ,550 ,639
Hosmer and Lemeshow Test Step 1 2 3 Chi-square 2,687 4,078 6,346 df 8 8 8 Sig. ,952 ,850 ,609
Step 1 2 3
Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 1,038 1,142 2,802 1,046 60,461 1,171
Step a 1 Step b 2
YEARS Constant SMOKING YEARS Constant SMOKING BMI YEARS Constant
B ,085 -4,744 2,566 ,101 -6,703 2,558 ,298 ,104 -14,739
S.E. ,024 1,339 ,784 ,029 1,763 ,854 ,125 ,034 4,365
Wald 12,268 12,558 10,724 12,337 14,451 8,973 5,681 9,515 11,402
df 1 1 1 1 1 1 1 1 1
Sig. ,000 ,000 ,001 ,000 ,000 ,003 ,017 ,002 ,001
Exp(B) 1,089 ,009 13,016 1,106 ,001 12,910 1,347 1,110 ,000
Step c 3
2,421 1,054 1,039
68,831 1,720 1,186
a. Variable(s) entered on step 1: YEARS. b. Variable(s) entered on step 2: SMOKING. c. Variable(s) entered on step 3: BMI.
49

Logistička Regresija

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logistička Regresija

Uploaded by

Copyright:

Available Formats

Logistika regresija

General linear models

Uses Control of confounding Model building, risk prediction

What is Logistic Regression?

Example 1 Is a smoking predictor for CHD?

Dichotomous outcome variable Y (0/1):

Data transformation is required!

Odds CHD (smo ker s) =

Odds CHD (non smo ker s) =

Logistic regression model

p log (odds) = log 1 p = 0 + 1x p = e0 +1x 1 p p= p= e0 +1x 1 e 1 e

for population for sample

e 2,718 p = P(y=1) x = predictor

Logistic regression model

eb0 +b1x1 +b2 x 2 +...... 1 eb0 +b1x1 +b2 x 2 +......

osoba 1, prediktor (x) = k osoba 2, prediktor (x) = k + 1

Example 1 Odds ratio

Odds CHD (smo ker s) =

a / m a 17 = = = 2,429 c/m c 7 b/n b 9 = = = 0,333 d / n d 27

Odds CHD (non smo ker s) =

Odds ratio (OR ) =

2,429 = 7,286 0,333

Odds ratio (Relativni odds, Ukrteni odnos)

For non-smokers (x = 0) we have

p0 p1 b1 = ln ln (1 p ) (1 p ) = 0.887 (1.099) = 1.986 1 0

Logistic regression in SPSS

Example 1 in SPSS - Output

We see that there are 60 cases used in the analysis.

Unweighted Cases Selected Cases

N Included in Analysis Missing Cases Total 60 0 60 0 60

Unselected Cases Total

Percent 100,0 ,0 100,0 ,0 100,0

Example 1 in SPSS - Output

Step Block Model

Example 1 in SPSS - Output

Example 1 in SPSS - Output

log (odds ) = 1,099 + 1,986 smoking

S.E. ,591 ,385

Wald 11,274 8,147

Sig. ,001 ,004

Exp(B) 7,286 ,333

a. Variable(s) entered on step 1: PUSENJE.

Wald 2 - significance of the coefficients in a model

Example 1 in SPSS - Output

S.E. ,591 ,385

Wald 11,274 8,147

Sig. ,001 ,004

Exp(B) 7,286 ,333

a. Variable(s) entered on step 1: PUSENJE.

Example 1 in SPSS - Output

Example 1 in SPSS - Output

Primer 2 Faktori rizika za pojavu KSB

Primer 2 - Logistika regresija

S.E. ,588 ,461

Wald 9,485 7,958

Sig. ,002 ,005

Exp(B) 6,111 ,273

a. Variable(s) entered on step 1: AGE.

S.E. ,591 ,385

Wald 11,274 8,147

Sig. ,001 ,004

Exp(B) 7,286 ,333