You are on page 1of 11

BUILDING
PREDICTIVE

MODELS
OF
ELECTION

RESULTS
–
an
application
of

Logistic
regression



2009


Abstract

This
 document
 attempts
 to
 teach
 students
 logistic
 regression
 with
 the
 help
 of
 a
 simple
 real
 world

example.
 2009
 Lok
 Sabha
 election
 results
 for
 Karnataka
 were
 analyzed
 to
 assess
 if
 there
 was
 any

dependence
between
the
available
information
about
the
candidates
and
the
final
election
results.

The
 only
 significant
 variables
 were
 “Political
 Party”
 and
 “Moveable
 Assets”
 of
 the
 politician
 in

question.

Table
of
Contents

Introduction................................................................................................................................
Tips to getting started................................................................................................................
Objective...............................................................................................................................
Data Understanding…………………………………………………………………...
Data Preparation…………………………………………………………………………..
Modeling………………………………………………………………………………..
Evaluation………………………………………………………………………………….
Conclusion ..................................................................................................................................
Further Discussion……………………………………………………………………………
INTRODUCTION:

The
 compulsory
 disclosure
 of
 information
 with
 respect
 to
 the
 background
 of
 candidates
 in
 the

election
 make
 sure
 that
 the
 voters
 have
 sufficient
 information
 about
 the
 candidates
 in
 order
 to

enable
them
to
make
an
informed
choice
while
casting
their
votes.
This
information
includes
assets

and
 liabilities
 as
 well
 as
 criminal
 antecedents,
 if
 any.
 Thus,
 a
 fairly
 large
 amount
 of
 data
 on
 the

candidate
 background
 had
 become
 available.
 It
 is
 also
 interesting
 to
 see
 if
 the
 information
 thus

made
available
did
make
any
difference
in
the
outcome
of
the
elections.
Also
it
is
important
to
see
if

it
would
be
possible
to
use
the
information
to
predict
or
forecast
the
results
of
the
elections.




TIPS
TO
GETTING
STARTED:


Business
Understanding:

This
is
the
initial
phase
focusing
on
understanding
the
project
objectives
and
requirements
from
a

business
perspective.
The
available
information
of
the
candidates
can
be
use
as
data
to
fit
a
model

to
determine
which
candidate
will
win
the
election.
Alternatively,
the
voters
can
make
a
fair
choice

of
selecting
their
leader.
Thus,
the
objective
of
this
research
paper
is
to
develop
predictive
models

which
could
be
used
for
predicting
the
outcomes
of
election.


 


Data
Understanding:


The
data
with
respect
to
the
profile
of
the
candidates
of
Karnataka
Lok
Sabha
election
2009
were

taken
from
myneta.com
for
this
research
paper.
There
are
a
total
of
28
constituencies
for
which
the

elections
 were
 held
 in
 2009
 and
 there
 are
 all
 together
 421
 candidates. The
 election
 results
 of
 the

candidates
(win
or
loss)
are
used
as
the
dependent
variable
for
the
predictive
models.
This
is
treated

as
 a
 binary
 categorical
variable.
 In
 addition,
 a
 number
 of
 variables
 on
 which
 information
 was

available
are
used
as
independent
variables.
These
variables
included














• Age
of
the
candidate

• Gender

• Educational
qualification

• 
Number
of
candidates
in
the
constituency

• type
of
the
political
party

• win
or
loss
of
the
candidates

• movable
asset

• immovable
assets

• total
assets
of
the
candidates

• whether
the
candidate
has any
liabilities
to
the
government

• whether
the
candidate
has
any
liabilities
to
the
financial
institutions

• whether
the
candidate
had
committed
any
crimes
or
not






































































































































































Data
preparation:

Here
 the
 dependent
 and
 independent
 variables
 are
 categorical
 variable.
 So
 we
 need
 to
 transform

the
 categorical
 variables
 using
 dummy
 variables.
 The
 statistical
 software
 will
 automatically

transform
the
categorical
variables
using
dummy
variable
at
the
starting
of
the
analysis.
Given
below

is
the
description
of
the
variables
used
in
this
paper
to
build
the
model.












DEPENDENT
VARIABLE
 win
=
1,
loss
=
0


INDEPENDENT
VARIABLES
NAME
 
















DESCRIPTION


(
DEMOGRAPHIC
 

CHARACTERISTICS)

 

ID
 Candidate
ID

Age
 Age
of
the
candidate

Gen
 Male
=
1
,
Female
=
0

Edu
 Educational
level
of
the
candidate.
Divided
into
5

categories
:
primary
=
1,
High
school
=
2
,
pre
university

=
3,
graduate
=
4,
post
graduate
=
5









(dichotomous

variable)

(
POLITICAL
FACTORS
)
 

No.
of
candidate
 Number
of
the
candidates
in
the
constituency:
binned

into
4
categories‐
<=10,
11
to
15,
16
to
20,
above
20

PolParty
 Name
of
the
political
parties
:
BJP
=
1,congress
=
2,

independent
party
=
3,
JD
=
4,
other
national
party
=
4,

other
regional
party
=
5



























(dichotomous

variable)




(
OWNERSHIP
)
 

Movasset
 Whether
the
candidate

owns

any
movable
assets:
yes
=

1,
no
=
0

Immovasset
 Whether
the
candidate

owns
any
immovable
assets
:
yes

=
1,
no
=
0


Totlasset
 Total
assets
of
the
candidate
(
continuous
variable
)


(
LIABILITIES
)
 

GovtDues
 Whether
the
candidate
has
any
government
dues
or
not
:

yes
=
1
,
no
=
0

BankDues
 Whether
the
candidate
has
any
Banks
dues
or
not
:
yes
=

1
,
no
=
0

(
OTHER
FACTORS)
 

Crime
 Whether
the
candidate
had
committed
any
crime
or
not
:

yes
=
1
,
no
=
0





Modeling:

Since
the
dependent
variable
is
categorical
in
nature,
usual
predictive
models
that
revolve
around

regression
techniques
could
be
use
for
prediction
of
this
specific
case.
Logistic
regression
would
be

ideal
for
handling
when
we
have
a
mixture
of
numerical
and
categorical
regressors.


A
brief
description
of
the
techniques
is
given
below.


Logistic
regression
is
a
multiple
regression
with
an
outcome
variable
(or
dependent
variable)
that
is

a
 categorical
 dichotomy
 and
 explanatory
 variables
 that
 can
 be
 either
 continuous
 or
 categorical.
 In

other
 words,
 the
 interest
 is
 in
 predicting
 which
 of
 two
 possible
 events
 are
 going
 to
 happen
 given

certain
other
information.

The
dependent
variable
in
logistic
regression
is
usually
dichotomous,
that
is,
the
dependent
variable

can
take
the
value
1
with
a
probability
of
success
θ,
or
the
value
0
with
probability
of
failure
1‐θ.
This

type
of
variable
is
called
a
Bernoulli
(or
binary)
variable.
The
independent
or
predictor
variables
in

logistic
 regression
 can
 take
 any
 form.
 That
 is,
 logistic
 regression
 makes
 no
 assumption
 about
 the

distribution
 of
 the
 independent
 variables.
 They
 do
 not
 have
 to
 be
 normally
 distributed,
 linearly

related
 or
 of
 equal
 variance
 within
 each
 group.
 The
 relationship
 between
 the
 predictor
 and

response
 variables
 is
 not
 a
 linear
 function
 in
 logistic
 regression,
 instead,
 the
 logistic
 regression

function
is
used,
which
is
the
logit
transformation
of
θ:






Where
α
=
the
constant
of
the
equation
and,
β
=
the
coefficient
of
the
predictor
variables.



An
alternative
form
of
the
logistic
regression
equation
is:


The
estimation
of
the
variables
in
Logistic
Regression
Analysis
is
done
through
Maximum
Likelihood

Techniques.
The
idea
behind
the
method
is
to


find
the
parameters
that
make
the
observed
values
most
likely
to
have
occurred.
i.e.:
it
maximises

the
probability
of
obtaining
the
sample
we
got.


The
 process
 by
 which
 coefficients
 are
 tested
 for
 significance
 for
 inclusion
 or
 elimination
 from
 the

model
involves
several
different
techniques.
Some
of
them
are
Wald
test,
Likelihood‐Ratio
test
and

Hosmer‐Lemshow
Goodness
of
fit
test.



Hosmer
and
Lemeshow
chi‐square
test
of
goodness
of
fit
is
the
recommended
test
for
overall
fit
of
a

binary
 logistic
 regression
 model.
 If
 the
 H‐L
 goodness‐of‐fit
 test
 statistic
 is
 greater
 than
 .05,
 as
 we

want
for
well‐fitting
models,
we
fail
to
reject
the
null
hypothesis
that
there
is
no
difference
between

observed
 and
 model‐predicted
 values,
 implying
 that
 the
 model's
 estimates
 fit
 the
 data
 at
 an

acceptable
 level.
 That
 is,
 well‐fitting
 models
 show
 nonsignificance
 on
 the
 H‐L
 goodness‐of‐fit
 test,

indicating
model
prediction
is
not
significantly
different
from
observed
value.


Evaluation:

The
data
are
analyzed
by
using
the
Statistical
software.

Sample
of
data
of
the
candidates
are
shown
below.



ID
 polparty
 crime
 edu
 age
 Ttlasset
 liabilities
 gen
 winloss












1
 4
 0
 2
 2
 54406000
 0
 1
 0

2
 4
 0
 2
 2
 100000
 0
 1
 0

3
 4
 0
 4
 1
 1000000
 0
 1
 0

4
 5
 0
 5
 3
 406400
 0
 1
 0

5
 4
 0
 4
 1
 0
 1
 1
 0

6
 5
 0
 5
 2
 6371000
 1
 1
 0

7
 4
 1
 1
 3
 7578500
 1
 1
 0

8
 1
 0
 5
 4
 16763000
 1
 1
 1

9
 4
 0
 1
 1
 600000
 1
 1
 0

10
 2
 0
 1
 4
 30411328
 1
 1
 0

11
 4
 0
 5
 4
 3380000
 0
 1
 0

12
 4
 0
 1
 2
 0
 0
 1
 0

13
 6
 0
 4
 3
 11415000
 1
 1
 0

14
 4
 0
 4
 1
 630000
 1
 1
 0

15
 4
 0
 2
 1
 20000
 0
 1
 0

16
 4
 0
 5
 2
 8000000
 0
 1
 0

17
 4
 0
 3
 3
 195000
 0
 1
 0

18
 6
 0
 3
 4
 1173000
 1
 1
 0

19
 4
 0
 4
 3
 865000
 0
 1
 0


















This part of the output describes a "null model", which is model with

no predictors and just the intercept.





Classification Tablea,
b

Predicted

WINLOS Percentage
Observed 0 S 1 Correc
Step 0 WINLOS 0 390 0 t 100.0
S 1 2 0 .
Overall Percentage 8 0
93.3
a Constant is included in the model.
.
b The cut value is .500
. 



This
gives
the
percent
of
cases
for
which
the
dependent
variables
was
correctly
predicted
given
the

model
and
here
it
is
93.3%.





This
 is
 the
 Wald
 chi‐square
 test
 that
 tests
 the
 null
 hypothesis
 that
 the
 constant
 equals
 0.

 This

hypothesis
 is
 rejected
 because
 the
 p‐value
 (listed
 in
 the
 column
 called
 "Sig.")
 is
 smaller
 than
 the

critical
p‐value
of
.05
(or
.01).

Hence,
we
conclude
that
the
constant
is
not
0.



 






 



 This section contains the overall test of the model (in the “Hosmer-
lemeshow test” table) and the coefficients and odds ratio (in the “Variables

 in the Equation”)

Cox
&
Snell
R
Square
and
Nagelkerke
R
Square
are
pseudo
R‐squares.

Logistic
regression
does
not

have
an
equivalent
to
the
R‐squared
that
is
found
in
OLS
regression.
Here
Cox
&
Snell
R
Square
is

0.276<1
and
Nagelkerke
R
Square
is
0<
0.712<1.
It
indicates
improvement
from
null
model
to
fitted

model.


Here
 the
 null
 hypothesis
 is
 that
 there
 is
 no
 difference
 between
 observed
 and
 model‐predicted

values.
The
H‐L
goodness‐of‐fit
test
statistic
is
greater
than
.05,
we
fail
to
reject
the
hypothesis.
It

implies
that
the
model
fit
the
data
at
an
acceptable
level.


a
Classification Table

Predicted

WINLOS Percentage
Observe 0 S 1 Correct
Step d
WINLOS 0 383 7 98.
1 S 1 1 1 2
64.
Overall 0 8 3
95.
a The Percentage
cut value is 9
. .500 


This
table
shows
how
many
cases
are
correctly
predicted
(383
cases
are
observed
to
be
0
and
are

correctly
predicted
to
be
0;
18
cases
are
observed
to
be
1
and
are
correctly
predicted
to
be
1),
and

how
many
cases
are
not
correctly
predicted
(7
cases
are
observed
to
be
0
but
are
predicted
to
be
1;

10
 cases
 are
 observed
 to
 be
 1
 but
 are
 predicted
 to
 be
 0).
 The
 overall
 percent
 of
 cases
 that
 are

correctly
 predicted
 by
 the
 model
 (in
 this
 case,
 the
 full
 model
 that
 we
 specified)
 is
 95.9%.
 This

percentage
has
increased
from
93.3
for
the
null
model
to
95.9
for
the
full
model.



From
 the
 above
 table
 we
 see
 that
 the
 variables
 POLPARTY
 (political
 Party)
 and
 MOVASSET
 (1)

(Movable
assets)
are
statistically
significant
since
the
P‐values
are
less
than
the
critical
P‐value
0.05.

There
 is
 no
 coefficient
 listed
 for
 POLPARTY,
 because
 it
 is
 not
 a
 variable
 in
 the
 model.

 Rather,

dummy
 variables
 which
 code
 for
 POLPARTY
 have
 coefficients.
 However,
 the
 coefficients
 of
 the

dummies
are
not
statistically
significant.
The
statistic
given
on
this
row
tells
you
if
the
dummies
that

represent
POLPARTY,
taken
together,
are
statistically
significant.

Thus,
 type
 of
 Political
 party
 and
 movable
 assets
 are
 important
 in
 explaining
 the
 winning
 of
 a

candidate.
The
other
variables
do
not
seem
to
have
any
effects
at
all.






Both
the
effects
can
be
interpreted
as:


‐

Here
the
reference
group
of
the
variable
POLPARTY
is
level
6.
So,
changing
the
reference
group

from
level
6
to
levels
1,
2,3,4,5
increases
the
probability
of
winning
the
election.


‐



The
ownership
of
movable
assets
increases
the
winning
of
the
election





Now
the
predicted
model
is
given
by


Log 
=
‐23.477
+
23.583
POLPARTY
(1)
+
21.317
POLPARTY
(2)
+
19.
112






























POLPARTY
(3)
–
0.087
POLPARTY
(4)
+3.958
MOVASSET
(1)


Suppose
 we
 want
 to
 compare
 the
 probability
 of
 winning
 a
 candidate
 A
 with
 movable
 assets
 and

Political
Party
changing
from
level
6
to
level
1
(here
level
1
indicate
BJP
and
level
6
indicate
other

regional
parties)
with
the
probability
of
winning
a
candidate
B
without
movable
assets
and
Political

Party
changing
from
level
6
to
level
1.


Predicted
logit
of
candidate
A:


Log =
‐23.477
+
23.583
+
3.958(1)






















=
4.0640



Thus,
Prob
(win/loss)
= 
=
0.9831



Predicted
logit
of
candidate
B:


Log 
=
‐23.477
+
23.583
+
3.958
(0)























=
0.1060


Therefore,
Prob
(win/loss)
=
 
=
0.5265



From
this,
we
can
conclude
that
a
candidate
with
movable
assets
has
more
probability
to
win
the

election
than
a
candidate
without
movable
assets.










CONCLUSIONS:


The
 disclosures
 of
 the
 background
 of
 candidates
 for
 elections
 in
 India
 resulted
 in
 providing
 voters

with
 sufficient
 information.
 While
 this
 information
 was
 primarily
 meant
 to
 enable
 the
 voters
 to

make
a
well‐informed
choice,
the
availability
of
such
information
made
it
possible
to
build
effective

predictive
 models
 for
 forecasting
 the
 election
 results.
 The
 techniques
 namely
 logistic
 regression
 is

used
to
build
the
predictive
models
for
the
Karnataka
Lok
Sabha
elections.
The
important
variables

in
predicting
election
outcomes
are
type
of
the
Political
party
and
movable
assets.



Questions
for
Further
Discussion


1. What
will
happen
to
the
predicted
log
odds
if
the
coefficients
of
the
predictor
variables

are
negative?


2. Will
 there
 be
 any
 change
 in
 the
 model
 if
 we
 consider
 more
 independent
 variables
 like

number
of
crimes
committed
by
the
candidate,
whether
the
candidate
belongs
to
ruling

party
 or
 not,
 whether
 the
 constituency
 was
 reserved
 for
 the
 scheduled
 caste
 and

scheduled
tribes
candidates,
whether
the
candidate
belongs
to
the
incumbent
party
in

the
specific
constituency
etc?

3. Compare
 the
 model
 using
 other
 data
 mining
 techniques
 like
 Artificial
 neural
 networks

and
Classification
trees?


4. How
many
categorical
independent
variables
are
there
in
the
model?

5. Is
there
any
significance
test
to
fit
the
model
other
than
Hosmer‐Lemeshow
test?
If
so,

explain
briefly?

6. Can
we
use
simple
linear
regression
instead
of
Logistic
regression
and
why?

7. What
is
Wald
Chi‐square
test?

8. How
many
continuous
independent
variables
are
there
in
the
model?
If
so,
what
is
the

name
of
the
variable?

9. How
the
parameters
are
estimated
in
Logistic
regression
model?

10. What
is
the
dependent
variable
there
in
the
model?
Is
it
binary
or
continuous
variable?

11. How
many
regressors
are
significant
to
predict
the
model?

12. What
are
the
Coefficient
of
the
significant
variables?

13. Give
the
interpretation
of
the
significant
predictors?

14. What
are
Cox
&
Snell
R
Square
and
Nagelkerke
R
Square?


















You might also like