Naïve-Bayes Crashing Course

10/29/13
Probability theory
A crash course in probability and Naïve Random variable: a variable whose possible values are numerical
outcomes of a random phenomenon.
Bayes classification
Examples: A person’s height, the outcome of a coin toss
Chapter 9 Distinguish between discrete and continuous variables.
The distribution of a discrete random variable:

The probabilities of each value it can take.
Notation: P(X = xi).
These numbers satisfy:
X
P (X = xi ) = 1
i
1 2
Probability theory Probability theory

A joint probability distribution for two variables is a table.
Marginal Probability
If the two variables are binary, how many parameters does it

have?
What about joint probability of d variables P(X1,…,Xd)?

Joint Probability
Conditional Probability
How many parameters does it have if each variable is binary?
1
10/29/13
Probability theory The Rules of Probability
Marginalization:
Marginalization
Product Rule
Product Rule
Independence: X and Y are independent if P(Y|X) = P(Y)
Op)mal&Classiﬁca)on& This implies P(X,Y) = P(X) P(Y)
Optimal predictor:
(Bayes classifier)
Equivalently,%
Using probability in learning Maximum likelihood

Interested in: P (Y |X) Fit a probabilistic model P(x | θ) to data
1%
  Estimate θ
X – features Y – labels Given independent identically distributed (i.i.d.) data
0.5%
X = (x1, x2, …, xn)
0%
3%
  Likelihood
For example, when classifying spam, we could estimate P (X|✓) = P (x1 |✓)P (x2 |✓), . . . , P (xn |✓)
P(Y| Viagara, lottery)   Log likelihood
n
X
We would then classify an example if P(Y|X) > 0.5. ln P (X|✓) = ln P (xi |✓)
i=1
Maximum likelihood solution: parameters θ that
However, it’s usually easier to model P(X | Y)
maximize ln P(X | θ)
2
10/29/13
Example Bayes’ rule

Example: coin toss From the product rule:
Estimate the probability p that a coin lands “Heads” using the
result of n coin tosses, h of which resulted in heads. P(Y, X) = P(Y | X) P(X)
and:
The likelihood of the data: P (X|✓) = ph (1 p)n h
P(Y, X) = P(X | Y) P(Y)
Log likelihood: ln P (X|✓) = h ln p + (n h) ln(1 p) Therefore:
P (X|Y )P (Y )
Taking a derivative and setting to 0: P (Y |X) =
P (X)
@ ln P (X|✓) h (n h)
= =0 This is known as Bayes’ rule
@p p (1 p)
h
) p=
n
9 10
Bayes’ rule Maximum a-posteriori and maximum likelihood
likelihood prior The maximum a posteriori (MAP) rule:
P (X|Y )P (Y ) yM AP = arg max P (Y |X) = arg max

P (X|Y )P (Y )
= arg max P (X|Y )P (Y )
P (Y |X) = Y Y P (X) Y
P (X)
posterior If we ignore the prior distribution or assume it is uniform we
obtain the maximum likelihood rule:
posterior ∝ likelihood × prior yM L = arg max P (X|Y )
Y
P(X) can be computed as:
X A classifier that has access to P(Y|X) is a Bayes optimal
P (X) = P (X|Y )P (Y ) classifier.
Y
But is not important for inferring a label
12
3
10/29/13
Naïve Bayes classifier Naïve Bayes classifier
Learning&the&Op)mal&Classiﬁer&
We would like to model P(X | Y), where X is a feature vector,
and Y is its associated label.
We would like to model P(X | Y), where X is a feature vector,
and Y is its associated label.
Task:%Predict%whether%or%not%a%picnic%spot%is%enjoyable% Simplifying assumption: conditional independence: given the

% class label the features are independent, i.e.
Training&Data:&& X%=%(X1%%%%%%%X2%%%%%%%%X3%%%%%%%%…%%%%%%%%…%%%%%%%Xd)%%%%%%%%%%Y%
P (X|Y ) = P (x1 |Y )P (x2 |Y ), . . . , P (xd |Y )
n&rows& How many parameters now?
How many parameters?

Lets&learn&P(Y|X)&–&how&many&parameters?&
Prior: P(Y) k-1 if k classes
Prior:%P(Y%=%y)%for%all%y% %% KR1&if&K&labels&
Likelihood: P(X | Y) (2d – 1)k for binary features
Likelihood:%P(X=x|Y%=%y)%for%all%x,y %%(2d&–&1)K&if&d&binary&features%
13 14
9%
Naïve Bayes classifier Naïve Bayes classifier

We would like to model P(X | Y), where X is a feature vector, Naïve Bayes decision rule:
and Y is its associated label. d
Y
yNB = arg max P (X|Y )P (Y ) = arg max P (xi |Y )P (Y )
Simplifying assumption: conditional independence: given the Y Y i=1
class label the features are independent, i.e.
P (X|Y ) = P (x1 |Y )P (x2 |Y ), . . . , P (xd |Y ) If conditional independence holds, NB is an optimal classifier!
How many parameters now? dk + k - 1
15 16
4
10/29/13
9. Probabilistic models 9.2 Probabilistic models for categorical data
Training a Naïve Bayes classifier p.276

Example
Example 9.4: Prediction using a naive Bayes model I
Training data: Feature matrix X (n x d) and labels y1,…yn Email classification

Suppose our vocabulary contains three words a , b and c , and we use a
Maximum likelihood estimates: multivariate Bernoulli model for our e-mails, with parameters
|{i : yi = y}| ✓ © = (0.5, 0.67, 0.33) ✓ ™ = (0.67, 0.33, 0.33)

Class prior: P̂ (y) =
n
This means, for example, that the presence of b is twice as likely in spam (+),
compared with ham.
P̂ (xi , y) |{i : Xij = xi , yi = y}|/n
Likelihood: P̂ (xi |y) = = The e-mail to be classified contains words a and b but not c , and hence is
P̂ (y) |{i : yi = y}|/n described by the bit vector x = (1, 1, 0). We obtain likelihoods
P (x|©) = 0.5·0.67·(1°0.33) = 0.222 P (x|™) = 0.67·0.33·(1°0.33) = 0.148
The ML classification of x is thus spam.
Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 273 / 349
17 18
p.280 Table 9.1: Training data for naive Bayes

Example 9. Probabilistic models 9.2 Probabilistic models for categorical data
Example
Email classification: training data Email classification: training data

E-mail #a #b #c Class E-mail a? b? c? Class
E-mail #a #b #c Class E-mail a? b? c? Class e1 0 3 0 + e1 0 1 0 +

e2 0 3 3 + e2 0 1 1 +
e3 3 0 0 + e3 1 0 0 +
e1 0 3 0 + e1 0 1 0 + e4 2 3 0 + e4 1 1 0 +
e2 0 3 3 + e2 0 1 1 + e5 4 3 0 ° e5 1 1 0 °
e6 4 0 3 ° e6 1 0 1 °
e3 3 0 0 + e3 1 0 0 + e7 3 0 0 ° e7 1 0 0 °
e8 0 0 0 ° e8 0 0 0 °
e4 2 3 0 + e4 1 1 0 +
(left) A small e-mail data set described by count vectors. (right) The same data set
e5 4 3 0 ° e5 1 1 0 ° described by bit vectors. What are the parameters of the model?
e6 4 0 3 ° e6 1 0 1 °
e7 3 0 0 ° e7 1 0 0 ° Peter Flach (University of Bristol) Machine Learning: Making Sense of Data |{i : yi = y}|
August 25, 2012 277 / 349
P̂ (y) =
e8 0 0 0 ° e8 0 0 0 ° n
P̂ (xi , y) |{i : Xij = xi , yi = y}|/n
P̂ (xi |y) = =
P̂ (y) |{i : yi = y}|/n
described by bit vectors. What are the parameters of the model?
19 20
5
10/29/13

Example Comments on Naïve Bayes
Email classification: training data Usually features are not conditionally independent, i.e.
E-mail #a #b #c Class E-mail a? b? c? Class
P (X|Y ) 6= P (x1 |Y )P (x2 |Y ), . . . , P (xd |Y )
e1 0 3 0 + e1 0 1 0 +
e2 0 3 3 + e2 0 1 1 +
e3 3 0 0 + e3 1 0 0 +
e4 2 3 0 + e4 1 1 0 +
e5 4 3 0 ° e5 1 1 0 ° And yet, one of the most widely used classifiers. Easy to train!
e6 4 0 3 ° e6 1 0 1 °
e7 3 0 0 ° e7 1 0 0 °
It often performs well even when the assumption is violated.

e8 0 0 0 ° e8 0 0 0 °
described by bit vectors. What are the parameters of the model? Domingos, P., & Pazzani, M. (1997). Beyond Independence: Conditions
for the Optimality of the Simple Bayesian Classifier. Machine
|{i : yi = y}| Learning. 29, 103-130.
P(+) = 0.5, P(-) = 0.5 P̂ (y) =
n
P̂ (xi , y) |{i : Xij = xi , yi = y}|/n
P(a|+) = 0.5, P(a|-) = 0.75 P̂ (xi |y) = =
P̂ (y) |{i : yi = y}|/n
P(b|+) = 0.75, P(b|-)= 0.25
P(c|+) = 0.25, P(c|-)= 0.25
21 22
When there are few training examples When there are few training examples
What if you never see a training example where x1=a when What if you never see a training example where x1=a when
y=spam? y=spam?
P(x | spam) = P(a | spam) P(b | spam) P(c | spam) = 0 P(x | spam) = P(a | spam) P(b | spam) P(c | spam) = 0
What to do? What to do?
Add “virtual” examples for which x1=a when y=spam.
23 24
6
10/29/13
Naïve Bayes for continuous variables Continuous Probability Distributions

The probability of the random variable assuming a value within
Need to talk about continuous distributions! some given interval from x1 to x2 is defined to be the area under the
graph of the probability density function between x1 and x2.
Normal/Gaussian
Uniform f (x)
f (x)
x
x x1 x2
x1 x2
25
The Gaussian (normal) distribution

Expectations
Discrete variables Continuous variables
Conditional expectation
(discrete)
Approximate expectation
(discrete and continuous)
7
10/29/13
Properties of the Gaussian distribution Standard Normal Distribution
A random variable having a normal distribution

99.72% with a mean of 0 and a standard deviation of 1 is
95.44% said to have a standard normal probability
68.26% distribution.
µ
x
µ – 3σ
µ – 1σ
µ + 1σ
µ + 3σ

µ – 2σ
µ + 2σ

Standard Normal Probability Distribution Gaussian Parameter Estimation
  Converting to the Standard Normal Distribution
x−µ
z= Likelihood function
σ
We can think of z as a measure of the number of

standard deviations x is from µ.
8
10/29/13
Maximum (Log) Likelihood Example
34
Gaussian models Gaussian Naïve Bayes

Assume we have data that belongs to three classes, and assume Likelihood function:
a likelihood that follows a Gaussian distribution
✓ ◆
1 (x µik )2
P (Xi = x|Y = yk ) = p exp 2
2⇡ ik 2 ik
Need to estimate mean and variance for each feature in each

class.
35 36
9
10/29/13
Summary
Naïve Bayes classifier:
²  What’s the assumption
²  Why we make it
²  How we learn it
Naïve Bayes for discrete data
Gaussian naïve Bayes
37
10

Naïve-Bayes Crashing Course

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Naïve-Bayes Crashing Course

Uploaded by

Copyright:

Available Formats

10/29/13

Chapter 9 Distinguish between discrete and continuous variables.

The distribution of a discrete random variable:

Probability theory Probability theory

If the two variables are binary, how many parameters does it

What about joint probability of d variables P(X1,…,Xd)?

Probability theory The Rules of Probability

Op)mal&Classiﬁca)on& This implies P(X,Y) = P(X) P(Y)

Using probability in learning Maximum likelihood

Example Bayes’ rule

Log likelihood: ln P (X|✓) = h ln p + (n h) ln(1 p) Therefore:

Bayes’ rule Maximum a-posteriori and maximum likelihood

likelihood prior The maximum a posteriori (MAP) rule:

P (X|Y )P (Y ) yM AP = arg max P (Y |X) = arg max

But is not important for inferring a label

Naïve Bayes classifier Naïve Bayes classifier

Task:%Predict%whether%or%not%a%picnic%spot%is%enjoyable% Simplifying assumption: conditional independence: given the

n&rows& How many parameters now?

How many parameters?

Naïve Bayes classifier Naïve Bayes classifier

P (X|Y ) = P (x1 |Y )P (x2 |Y ), . . . , P (xd |Y ) If conditional independence holds, NB is an optimal classifier!

How many parameters now? dk + k - 1

9. Probabilistic models 9.2 Probabilistic models for categorical data

Training a Naïve Bayes classifier p.276

Training data: Feature matrix X (n x d) and labels y1,…yn Email classification

|{i : yi = y}| ✓ © = (0.5, 0.67, 0.33) ✓ ™ = (0.67, 0.33, 0.33)

P (x|©) = 0.5·0.67·(1°0.33) = 0.222 P (x|™) = 0.67·0.33·(1°0.33) = 0.148

The ML classification of x is thus spam.

9. Probabilistic models 9.2 Probabilistic models for categorical data

p.280 Table 9.1: Training data for naive Bayes

Email classification: training data Email classification: training data

E-mail #a #b #c Class E-mail a? b? c? Class e1 0 3 0 + e1 0 1 0 +

9. Probabilistic models 9.2 Probabilistic models for categorical data

It often performs well even when the assumption is violated.

What to do? What to do?

Add “virtual” examples for which x1=a when y=spam.

Naïve Bayes for continuous variables Continuous Probability Distributions

The Gaussian (normal) distribution

Discrete variables Continuous variables

Properties of the Gaussian distribution Standard Normal Distribution

A random variable having a normal distribution

Standard Normal Probability Distribution Gaussian Parameter Estimation

 Converting to the Standard Normal Distribution

We can think of z as a measure of the number of

Maximum (Log) Likelihood Example

Gaussian models Gaussian Naïve Bayes

Need to estimate mean and variance for each feature in each

You might also like

  Converting to the Standard Normal Distribution