Naïve Bayes Classifier: Ke Chen

Nave Bayes Classifier
Ke Chen
COMP24111 Machine Learning
Outline
Background
Probability Basics
Probabilistic Classification
Nave Bayes
Principle and Algorithms
Example: Play Tennis
Relevant Issues
Summary
Background
There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data

Example: perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class

Examples: naive Bayes, model based classifiers
a) and b) are examples of discriminative classification

c) is an example of generative classification
b) and c) are both examples of probabilistic classification
Probability Basics
Prior, conditional and joint probability for random
variables
P(X )
Prior probability:
Conditional probability:
X ( X1 , X2 ), P( X ) P(X1 ,X2 )
P(X1 ,X2 ) P( X2 |X1 )P( X1 ) P( X1 |X2 )P( X2 )

Joint probability:
Relationship: P( X2 |X1 ) P( X2 ), P( X1 |X2 ) P( X1 ), P(X1 ,X2 ) P( X1 )P( X2 )
Independence:
P( X1 |X2 ), P(X2 |X1 )
Bayesian Rule
Likelihood Prior
P( X |C )P(C )
P(C |X )
Posterior
P( X )
Evidence
Probability Basics
Quiz:
We have two six-sided dice. When they are tolled, it could

end up with the following occurance: (A) dice 1 lands on side 3,
(B) dice 2 lands on side 1, and (C) Two dice sum to eight. Answer
the following questions:
1) P( A) ?
2) P(B) ?
3) P(C) ?
4) P( A |B) ?
5) P(C |A) ?
6) P( A , B) ?
7) P( A , C ) ?
8) Is P( A , C ) equal to P(A) P(C) ?
5
Establishing a probabilistic model for classification
Discriminative model
P(C |X ) C c1 , , c L , X (X1 , , Xn )
P(c1 |x) P(c 2 |x)
P(c L |x )
Discriminative
Probabilistic Classifier
x1
x2
xn
x ( x1 , x2 , , xn )
Establishing a probabilistic model for classification
(cont.)
Generative
P( Xmodel
|C ) C c1 , , c L , X (X1 , , Xn )
P( x |c1 )
P( x |c 2 )
P( x |c L )
Generative
Probabilistic Model
Generative
Probabilistic Model
Generative
Probabilistic Model
for Class 1
for Class 2
x1
x2
xn x1
x2
for Class L
xn
x1
x2
xn
x ( x1 , x2 , , xn )
MAP classification rule
MAP: Maximum A Posterior
Assign x to c* if
P(C c * |X x ) P(C c |X x ) c c * , c c1 , , c L
Generative classification with the MAP rule

Apply Bayesian rule to convert them into posterior
probabilities
P( X x |C ci )P(C ci )
P(C ci |X x)
P( X x)
P( X x |C ci )P(C ci )
for i 1,2 , , L
Then apply the MAP rule
Nave Bayes
Bayes classification
P(C |X ) P( X |C )P(C ) P( X1 , , Xn |C )P(C )
Difficulty: learning the joint probability
P( X1 , , Xn |C )
Nave Bayes classification
Assumption that all input features are conditionally

P(independent!
X1 , X2 , , Xn |C ) P( X1 |X2 , , Xn , C )P( X2 , , Xn |C )
P( X1 |C )P( X2 , , Xn |C )
P( X1 |C )P( X2 |C ) P( Xn |C )
x ( x1 , x2 , , xn )
*
[ PMAP
( x1 |c *classification
) P( xn |c * )]P( crule:
) [ P(for
x1 |c ) P( xn |c )]P(c ), c c * , c c1 , , c L
Nave Bayes
For each target value of ci (ci c1 , , cL )
P (C c ) estimate P (C c ) with examples in S;
i
For every feature value x jk of each feature X j ( j 1, , F ; k 1, , N j )

P ( X j x jk | C ci ) estimate P( X j x jk | C ci ) with examples in S;
X ( a1 , , an )
[ P ( a1 |c * ) P ( an |c * )]P ( c * ) [ P ( a1 |c) P ( an |c)]P (c ), c c * , c c1 , , c L
10
Example
Example: Play Tennis
11
Example
Learning Phase
Outlook
Play=Yes Play=No
Sunny
Overcast
Rain
Humidity
2/9
4/9
3/9
Temperature
Play=Yes
Play=No
Hot
2/9
4/9
3/9
2/5
2/5
1/5
3/5
0/5
2/5
Play=Yes Play=No
Mild
Cool
Wind
Play=Yes
Play=No
High
3/9
4/5
Strong
3/9
3/5
Normal
6/9
1/5
Weak
6/9
2/5
P(Play=Yes) = 9/14
P(Play=No) = 5/14
12
Example of the Nave Bayes

Classifier
The weather data, with counts and probabilities
outlook
temperature
yes
no
sunny
overcast
rainy
humidity
yes
no
hot
mild
cool
sunny
2/9
3/5
hot
2/9
2/5 high
3/9
overcast
4/9
0/5
mild
4/9
2/5 normal
6/9
rainy
3/9
2/5
cool
3/9
1/5
windy
yes
no
high
normal
play
yes
no
yes
no
false
true
4/5 false
6/9
2/5
9/14
5/14
1/5 true
3/9
3/5
A new day
outlook
temperature
humidity
windy
play
sunny
cool
high
true
Likelihood of yes
2 3 3 3 9
0.0053
9 9 9 9 14
Likelihood of no
3 1 4 3 5
0.0206
5 5 5 5 14
Therefore, the prediction is No
Example
Test Phase
Given a new instance, predict its label

x=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9
P(Outlook=Sunny|Play=No) = 3/5
P(Wind=Strong|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5

P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Play=Yes) = 9/14
P(Play=No) = 5/14
Decision making with the MAP rule
P(Yes|x) [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x) [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
Given the fact P(Yes|x) < P(No|x), we label x to be No.

15
16
Nave Bayes
Algorithm: Continuous-valued Features
Numberless values for a feature
Conditional probability often modeled with the normal

distribution
2
( X j ji )
1
exp
2
2 ji
2
ji
ji : mean (avearage) of feature values X j of examples for which C ci

P ( X j |C ci )
ji : standard deviation of feature values X j of examples for which C ci
Learning Phase: for X ( X1 , , Xn ), C c1 , , c L

Output: n Lnormal distributions and P(C c ) i 1, , L
i
Test Phase: Given an unknown instance X ( a , , a )
1
Instead of looking-up tables, calculate conditional probabilities with all the

normal distributions achieved in the learning phrase
Apply the MAP rule to make a decision
17
Nave Bayes
Example: Continuous-valued Features
Temperature is naturally of continuous value.

Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
Estimate mean and variance for each class
Yes 21.64 , Yes 2.35

No 23.88 , No 7.09
N
1 N
1
xn , 2 ( xn ) 2
N n1
N n1
Learning Phase: output two Gaussian models for

2
2
P(temp|C)1
(
x
21
.
64
)
1
(
x
21
.
64
)
P( x | Yes )
exp
exp
2

2.35 2
2 2.35
2.35 2
11.09
2
2
1
(
x
23
.
88
)
1
(
x
23
.
88
)
P( x | No)
exp
exp
2
2 7.09
50.25
7.09 2
7.09 2
18
Relevant Issues
Violation of Independence Assumption
For many real world tasks,

P( X1 , , Xn |C ) P( X1 |C ) P( Xn |C )
Nevertheless, nave Bayes works surprisingly well

anyway!
Zero conditional probability Problem

X j a jk , P ( X j a jk |C ci ) 0
If no example contains
feature
value
P ( x1 |cthe
)
P
(
a
|
c
)
P
( xn |ci ) 0
i
jk i
In this circumstance,
during test
nc mp
For a remedy, Pconditional
re-estimated with
( X a |C c probabilities
)
j
i
jk
nm
nc : number of training examples for which X j a jk and C ci
n : number of training examples for which C ci
p : prior estimate (usually, p 1 /t for t possible values of X j )
m : weight to prior (number of " virtual" examples, m 1)

19
Summary
Nave Bayes: the conditional independence assumption
Training is very easy and fast; just requiring considering each

attribute in each class separately
Test is straightforward; just looking up tables or calculating

conditional probabilities with estimated distributions
A popular generative model
Performance competitive to most of state-of-the-art classifiers

even in presence of violating independence assumption
Many successful applications, e.g., spam mail filtering
A good candidate of a base learner in ensemble learning
Apart from classification, nave Bayes can do more
20

Naïve Bayes Classifier: Ke Chen

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Naïve Bayes Classifier: Ke Chen

Uploaded by

Copyright:

Available Formats

Nave Bayes Classifier

COMP24111 Machine Learning

b) Model the probability of class memberships given input data

c) Make a probabilistic model of data within each class

a) and b) are examples of discriminative classification

COMP24111 Machine Learning

P(X1 ,X2 ) P( X2 |X1 )P( X1 ) P( X1 |X2 )P( X2 )

Relationship: P( X2 |X1 ) P( X2 ), P( X1 |X2 ) P( X1 ), P(X1 ,X2 ) P( X1 )P( X2 )

P( X1 |X2 ), P(X2 |X1 )

COMP24111 Machine Learning

We have two six-sided dice. When they are tolled, it could

Generative classification with the MAP rule

Nave Bayes classification

Assumption that all input features are conditionally

COMP24111 Machine Learning

For every feature value x jk of each feature X j ( j 1, , F ; k 1, , N j )

COMP24111 Machine Learning

Example of the Nave Bayes

Given a new instance, predict its label

P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5

Decision making with the MAP rule

Given the fact P(Yes|x) < P(No|x), we label x to be No.

COMP24111 Machine Learning

Numberless values for a feature

Conditional probability often modeled with the normal

ji : mean (avearage) of feature values X j of examples for which C ci

ji : standard deviation of feature values X j of examples for which C ci

Learning Phase: for X ( X1 , , Xn ), C c1 , , c L

Instead of looking-up tables, calculate conditional probabilities with all the

Temperature is naturally of continuous value.

Estimate mean and variance for each class

Yes 21.64 , Yes 2.35

Learning Phase: output two Gaussian models for

COMP24111 Machine Learning

For many real world tasks,

Nevertheless, nave Bayes works surprisingly well

Zero conditional probability Problem

p : prior estimate (usually, p 1 /t for t possible values of X j )

m : weight to prior (number of " virtual" examples, m 1)

Training is very easy and fast; just requiring considering each

Test is straightforward; just looking up tables or calculating

A popular generative model

Performance competitive to most of state-of-the-art classifiers

Many successful applications, e.g., spam mail filtering

A good candidate of a base learner in ensemble learning

Apart from classification, nave Bayes can do more

COMP24111 Machine Learning

You might also like