Professional Documents
Culture Documents
Probability theory
A crash course in probability and Naïve Random variable: a variable whose possible values are numerical
outcomes of a random phenomenon.
Bayes classification
Examples: A person’s height, the outcome of a coin toss
1 2
Marginal Probability
1
10/29/13
Marginalization:
Marginalization
Product Rule
Product Rule
Independence: X and Y are independent if P(Y|X) = P(Y)
Optimal predictor:
(Bayes classifier)
Equivalently,%
2
10/29/13
P (X|Y )P (Y )
Taking a derivative and setting to 0: P (Y |X) =
P (X)
@ ln P (X|✓) h (n h)
= =0 This is known as Bayes’ rule
@p p (1 p)
h
) p=
n
9 10
12
3
10/29/13
Learning&the&Op)mal&Classifier&
We would like to model P(X | Y), where X is a feature vector,
and Y is its associated label.
We would like to model P(X | Y), where X is a feature vector,
and Y is its associated label.
15 16
4
10/29/13
Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 273 / 349
17 18
P̂ (y) =
e8 0 0 0 ° e8 0 0 0 ° n
(left) A small e-mail data set described by count vectors. (right) The same data set
P̂ (xi , y) |{i : Xij = xi , yi = y}|/n
P̂ (xi |y) = =
P̂ (y) |{i : yi = y}|/n
described by bit vectors. What are the parameters of the model?
Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 277 / 349
19 20
5
10/29/13
Email classification: training data Usually features are not conditionally independent, i.e.
E-mail #a #b #c Class E-mail a? b? c? Class
P (X|Y ) 6= P (x1 |Y )P (x2 |Y ), . . . , P (xd |Y )
e1 0 3 0 + e1 0 1 0 +
e2 0 3 3 + e2 0 1 1 +
e3 3 0 0 + e3 1 0 0 +
e4 2 3 0 + e4 1 1 0 +
e5 4 3 0 ° e5 1 1 0 ° And yet, one of the most widely used classifiers. Easy to train!
e6 4 0 3 ° e6 1 0 1 °
e7 3 0 0 ° e7 1 0 0 °
(left) A small e-mail data set described by count vectors. (right) The same data set
described by bit vectors. What are the parameters of the model? Domingos, P., & Pazzani, M. (1997). Beyond Independence: Conditions
for the Optimality of the Simple Bayesian Classifier. Machine
Peter Flach (University of Bristol) Machine Learning: Making Sense of Data August 25, 2012 277 / 349
|{i : yi = y}| Learning. 29, 103-130.
P(+) = 0.5, P(-) = 0.5 P̂ (y) =
n
P̂ (xi , y) |{i : Xij = xi , yi = y}|/n
P(a|+) = 0.5, P(a|-) = 0.75 P̂ (xi |y) = =
P̂ (y) |{i : yi = y}|/n
P(b|+) = 0.75, P(b|-)= 0.25
P(c|+) = 0.25, P(c|-)= 0.25
21 22
When there are few training examples When there are few training examples
What if you never see a training example where x1=a when What if you never see a training example where x1=a when
y=spam? y=spam?
P(x | spam) = P(a | spam) P(b | spam) P(c | spam) = 0 P(x | spam) = P(a | spam) P(b | spam) P(c | spam) = 0
23 24
6
10/29/13
Normal/Gaussian
Uniform f (x)
f (x)
x
x x1 x2
x1 x2
25
Conditional expectation
(discrete)
Approximate expectation
(discrete and continuous)
7
10/29/13
µ
x
µ – 3σ
µ – 1σ
µ + 1σ
µ + 3σ
µ – 2σ
µ + 2σ
x−µ
z= Likelihood function
σ
8
10/29/13
34
35 36
9
10/29/13
Summary
Naïve Bayes classifier:
² What’s the assumption
² Why we make it
² How we learn it
Naïve Bayes for discrete data
Gaussian naïve Bayes
37
10