You are on page 1of 10

10/29/13

Probability theory

A crash course in probability and Naïve Random variable: a variable whose possible values are numerical
outcomes of a random phenomenon.
Bayes classification
Examples: A person’s height, the outcome of a coin toss

Chapter 9 Distinguish between discrete and continuous variables.

The distribution of a discrete random variable:


The probabilities of each value it can take.
Notation: P(X = xi).
These numbers satisfy:
X
P (X = xi ) = 1
i

1 2

Probability theory Probability theory


A joint probability distribution for two variables is a table.

Marginal Probability

If the two variables are binary, how many parameters does it


have?

What about joint probability of d variables P(X1,…,Xd)?


Joint Probability
Conditional Probability
How many parameters does it have if each variable is binary?

1
10/29/13

Probability theory The Rules of Probability

Marginalization:
Marginalization

Product Rule

Product Rule
Independence: X and Y are independent if P(Y|X) = P(Y)

Op)mal&Classifica)on& This implies P(X,Y) = P(X) P(Y)

Optimal predictor:
(Bayes classifier)

Equivalently,%

Using probability in learning Maximum likelihood


Interested in: P (Y |X) Fit a probabilistic model P(x | θ) to data
1%
  Estimate θ
X – features Y – labels Given independent identically distributed (i.i.d.) data
0.5%
X = (x1, x2, …, xn)
0%
3%
  Likelihood
For example, when classifying spam, we could estimate P (X|✓) = P (x1 |✓)P (x2 |✓), . . . , P (xn |✓)
P(Y| Viagara, lottery)   Log likelihood
n
X
We would then classify an example if P(Y|X) > 0.5. ln P (X|✓) = ln P (xi |✓)
i=1
Maximum likelihood solution: parameters θ that
However, it’s usually easier to model P(X | Y)
maximize ln P(X | θ)

2
10/29/13

Example Bayes’ rule


Example: coin toss From the product rule:
Estimate the probability p that a coin lands “Heads” using the
result of n coin tosses, h of which resulted in heads. P(Y, X) = P(Y | X) P(X)
and:
The likelihood of the data: P (X|✓) = ph (1 p)n h
P(Y, X) = P(X | Y) P(Y)

Log likelihood: ln P (X|✓) = h ln p + (n h) ln(1 p) Therefore:

P (X|Y )P (Y )
Taking a derivative and setting to 0: P (Y |X) =
P (X)
@ ln P (X|✓) h (n h)
= =0 This is known as Bayes’ rule
@p p (1 p)
h
) p=
n
9 10

Bayes’ rule Maximum a-posteriori and maximum likelihood

likelihood prior The maximum a posteriori (MAP) rule:

P (X|Y )P (Y ) yM AP = arg max P (Y |X) = arg max


P (X|Y )P (Y )
= arg max P (X|Y )P (Y )
P (Y |X) = Y Y P (X) Y
P (X)
posterior If we ignore the prior distribution or assume it is uniform we
obtain the maximum likelihood rule:
posterior ∝ likelihood × prior yM L = arg max P (X|Y )
Y
P(X) can be computed as:
X A classifier that has access to P(Y|X) is a Bayes optimal
P (X) = P (X|Y )P (Y ) classifier.
Y

But is not important for inferring a label

12

3
10/29/13

Naïve Bayes classifier Naïve Bayes classifier

Learning&the&Op)mal&Classifier&
We would like to model P(X | Y), where X is a feature vector,
and Y is its associated label.
We would like to model P(X | Y), where X is a feature vector,
and Y is its associated label.

Task:%Predict%whether%or%not%a%picnic%spot%is%enjoyable% Simplifying assumption: conditional independence: given the


% class label the features are independent, i.e.
Training&Data:&& X%=%(X1%%%%%%%X2%%%%%%%%X3%%%%%%%%…%%%%%%%%…%%%%%%%Xd)%%%%%%%%%%Y%
P (X|Y ) = P (x1 |Y )P (x2 |Y ), . . . , P (xd |Y )

n&rows& How many parameters now?

How many parameters?


Lets&learn&P(Y|X)&–&how&many&parameters?&
Prior: P(Y) k-1 if k classes
Prior:%P(Y%=%y)%for%all%y% %% KR1&if&K&labels&
Likelihood: P(X | Y) (2d – 1)k for binary features
Likelihood:%P(X=x|Y%=%y)%for%all%x,y %%(2d&–&1)K&if&d&binary&features%
13 14
9%

Naïve Bayes classifier Naïve Bayes classifier


We would like to model P(X | Y), where X is a feature vector, Naïve Bayes decision rule:
and Y is its associated label. d
Y
yNB = arg max P (X|Y )P (Y ) = arg max P (xi |Y )P (Y )
Simplifying assumption: conditional independence: given the Y Y i=1
class label the features are independent, i.e.

P (X|Y ) = P (x1 |Y )P (x2 |Y ), . . . , P (xd |Y ) If conditional independence holds, NB is an optimal classifier!

How many parameters now? dk + k - 1

15 16

4
10/29/13

9. Probabilistic models 9.2 Probabilistic models for categorical data

Training a Naïve Bayes classifier p.276


Example
Example 9.4: Prediction using a naive Bayes model I

Training data: Feature matrix X (n x d) and labels y1,…yn Email classification


Suppose our vocabulary contains three words a , b and c , and we use a
Maximum likelihood estimates: multivariate Bernoulli model for our e-mails, with parameters

|{i : yi = y}| ✓ © = (0.5, 0.67, 0.33) ✓ ™ = (0.67, 0.33, 0.33)


Class prior: P̂ (y) =
n
This means, for example, that the presence of b is twice as likely in spam (+),
compared with ham.
P̂ (xi , y) |{i : Xij = xi , yi = y}|/n
Likelihood: P̂ (xi |y) = = The e-mail to be classified contains words a and b but not c , and hence is
P̂ (y) |{i : yi = y}|/n described by the bit vector x = (1, 1, 0). We obtain likelihoods

P (x|©) = 0.5·0.67·(1°0.33) = 0.222 P (x|™) = 0.67·0.33·(1°0.33) = 0.148

The ML classification of x is thus spam.

Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 273 / 349
17 18

9. Probabilistic models 9.2 Probabilistic models for categorical data

p.280 Table 9.1: Training data for naive Bayes


Example 9. Probabilistic models 9.2 Probabilistic models for categorical data
Example
p.280 Table 9.1: Training data for naive Bayes

Email classification: training data Email classification: training data


E-mail #a #b #c Class E-mail a? b? c? Class

E-mail #a #b #c Class E-mail a? b? c? Class e1 0 3 0 + e1 0 1 0 +


e2 0 3 3 + e2 0 1 1 +
e3 3 0 0 + e3 1 0 0 +
e1 0 3 0 + e1 0 1 0 + e4 2 3 0 + e4 1 1 0 +
e2 0 3 3 + e2 0 1 1 + e5 4 3 0 ° e5 1 1 0 °
e6 4 0 3 ° e6 1 0 1 °
e3 3 0 0 + e3 1 0 0 + e7 3 0 0 ° e7 1 0 0 °
e8 0 0 0 ° e8 0 0 0 °
e4 2 3 0 + e4 1 1 0 +
(left) A small e-mail data set described by count vectors. (right) The same data set
e5 4 3 0 ° e5 1 1 0 ° described by bit vectors. What are the parameters of the model?
e6 4 0 3 ° e6 1 0 1 °
e7 3 0 0 ° e7 1 0 0 ° Peter Flach (University of Bristol) Machine Learning: Making Sense of Data |{i : yi = y}|
August 25, 2012 277 / 349

P̂ (y) =
e8 0 0 0 ° e8 0 0 0 ° n

(left) A small e-mail data set described by count vectors. (right) The same data set
P̂ (xi , y) |{i : Xij = xi , yi = y}|/n
P̂ (xi |y) = =
P̂ (y) |{i : yi = y}|/n
described by bit vectors. What are the parameters of the model?

Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 277 / 349
19 20

5
10/29/13

9. Probabilistic models 9.2 Probabilistic models for categorical data


Example Comments on Naïve Bayes
p.280 Table 9.1: Training data for naive Bayes

Email classification: training data Usually features are not conditionally independent, i.e.
E-mail #a #b #c Class E-mail a? b? c? Class
P (X|Y ) 6= P (x1 |Y )P (x2 |Y ), . . . , P (xd |Y )
e1 0 3 0 + e1 0 1 0 +
e2 0 3 3 + e2 0 1 1 +
e3 3 0 0 + e3 1 0 0 +
e4 2 3 0 + e4 1 1 0 +
e5 4 3 0 ° e5 1 1 0 ° And yet, one of the most widely used classifiers. Easy to train!
e6 4 0 3 ° e6 1 0 1 °
e7 3 0 0 ° e7 1 0 0 °

It often performs well even when the assumption is violated.


e8 0 0 0 ° e8 0 0 0 °

(left) A small e-mail data set described by count vectors. (right) The same data set
described by bit vectors. What are the parameters of the model? Domingos, P., & Pazzani, M. (1997). Beyond Independence: Conditions
for the Optimality of the Simple Bayesian Classifier. Machine
Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 277 / 349
|{i : yi = y}| Learning. 29, 103-130.
P(+) = 0.5, P(-) = 0.5 P̂ (y) =
n
P̂ (xi , y) |{i : Xij = xi , yi = y}|/n
P(a|+) = 0.5, P(a|-) = 0.75 P̂ (xi |y) = =
P̂ (y) |{i : yi = y}|/n
P(b|+) = 0.75, P(b|-)= 0.25
P(c|+) = 0.25, P(c|-)= 0.25
21 22

When there are few training examples When there are few training examples
What if you never see a training example where x1=a when What if you never see a training example where x1=a when
y=spam? y=spam?

P(x | spam) = P(a | spam) P(b | spam) P(c | spam) = 0 P(x | spam) = P(a | spam) P(b | spam) P(c | spam) = 0

What to do? What to do?

Add “virtual” examples for which x1=a when y=spam.

23 24

6
10/29/13

Naïve Bayes for continuous variables Continuous Probability Distributions


The probability of the random variable assuming a value within
Need to talk about continuous distributions! some given interval from x1 to x2 is defined to be the area under the
graph of the probability density function between x1 and x2.

Normal/Gaussian
Uniform f (x)
f (x)

x
x x1 x2
x1 x2

25

The Gaussian (normal) distribution


Expectations

Discrete variables Continuous variables

Conditional expectation
(discrete)

Approximate expectation
(discrete and continuous)

7
10/29/13

Properties of the Gaussian distribution Standard Normal Distribution

A random variable having a normal distribution


99.72% with a mean of 0 and a standard deviation of 1 is
95.44% said to have a standard normal probability
68.26% distribution.

µ
x
µ – 3σ
µ – 1σ
µ + 1σ
µ + 3σ

µ – 2σ
µ + 2σ

Standard Normal Probability Distribution Gaussian Parameter Estimation

  Converting to the Standard Normal Distribution

x−µ
z= Likelihood function
σ

We can think of z as a measure of the number of


standard deviations x is from µ.

8
10/29/13

Maximum (Log) Likelihood Example

34

Gaussian models Gaussian Naïve Bayes


Assume we have data that belongs to three classes, and assume Likelihood function:
a likelihood that follows a Gaussian distribution
✓ ◆
1 (x µik )2
P (Xi = x|Y = yk ) = p exp 2
2⇡ ik 2 ik

Need to estimate mean and variance for each feature in each


class.

35 36

9
10/29/13

Summary
Naïve Bayes classifier:
²  What’s the assumption
²  Why we make it
²  How we learn it
Naïve Bayes for discrete data
Gaussian naïve Bayes

37

10

You might also like