Cse291d 2 PDF

CSE291D Lecture 2:
Bayesian Inference
• This topic is so important that we have a very
special guest lecturer today:
1
https://www.youtube.com/watch?v=Zr_xWfThjJ0
Suppose you're on a game show, and you're given the choice of three doors:
Behind one door is a car; behind the others, goats.
You pick a door, say No. 1, and the host, who knows what's behind the doors, opens
another door, say No. 3, which has a goat. He then says to you, "Do you want to pick
door No. 2?“
Is it to your advantage to switch your choice?
? ?
?
2
3
Bayesian Inference
• If the winning door was chosen arbitrarily, why would
the probability of winning be anything other than 1/3?
• The host will not open the door with the car, so their
choice gives us valuable information. We need to
update our beliefs about where the car is in light of our
observation of the host’s choice.
• Bayesian inference is a quantitative framework for the

process of updating beliefs based on observed data.
4
Bayesian inference and AI
• Bayesian inference predates AI by around 200 years
– Thomas Bayes’ paper published 1763
– John McCarthy coined the term AI in 1955
"the science and engineering of making intelligent machines”
• Yet, Bayesian inference fits AI like a glove

– We would like to build intelligent agents that learn from
the world and make intelligent decisions
– Bayesian inference tells us how an intelligent agent should
update its beliefs in order to act
5
Learning Outcomes
By the end of the lesson, you should be able to:
• Explain the differences between Bayesian and frequentist

interpretations of probability.
• Contrast the following estimation methods according to how they

deal with uncertainty: maximum likelihood estimation (MLE),
maximum a posteriori (MAP) estimation, and
fully Bayesian inference.
• Calculate Bayesian posterior distributions for simple models.
• Make predictions for unobserved data using the

posterior predictive distribution.
6
7
Answer
• All are true, but the answer is B).
The posterior is always available, but is not

always tractable to compute. Computational
and algorithmic advances were what made
Bayesian methods more feasible.
8
9
Answer
• D). Laplace was one of the first to promote
Bayesian ideas, after Bayes.
Lindley and de Finetti were proponents in the

20th century. Fisher is perhaps the most
important founder of frequentist statistics.
10
11
Answer
• D). The posterior predictive distribution is
used to predict new data.
The prior and posterior are used to predict

unobservable parameters.
While the likelihood is a model for observable

data, it isn’t used directly for prediction.
12
Bayes’ rule
• Our goal:
To infer the probability of a hypothesis H, given evidence E.

This probability distribution is called the posterior distribution,
• We have:
Prior beliefs about H, encoded as a probability distribution,
the prior distribution
A model for how the data were generated, given the hypothesis, the
likelihood
13
Bayes’ rule
• Bayes’ rule allows us to combine observed evidence and prior beliefs to

obtain the posterior,
A normalizing constant
14
Deconstructing Bayes’ rule
Definition of conditional probability
Product rule
Plug in equation for joint
Sum rule
15
The Monty Hall problem
16
door No. 2?“
? ?
?
17
door No. 2?“
18
door No. 2?“
19
door No. 2?“
20
door No. 2?“
21
door No. 2?“
22
? ?
23
? ?
24
? ?
25
? ?
26
? ?
27
? ?
28
The Monty Hall problem
The contestant should always

switch doors.
29
Probability and Inference
Probability
Data generating
Observed data
process
Inference
30
Figure based on one by Larry Wasserman, "All of Statistics"
Bayesian Statistics
• A statistical methodology in which:
– Probabilities are interpreted as “degrees of belief”
– Model parameters are random variables
– To infer model parameters, we update our prior

beliefs based on observed data via Bayes’ rule
31
Bayesian and Frequentist
Interpretations of Probability
• Bayesian interpretation:
– Probability refers to degrees of belief
– A probability distribution quantifies our uncertainty
– Typically subjective
• Frequentist interpretation:
– Probabilities are long-run frequencies
– I.e. if I repeat this experiment many times, what proportion of the time
will the event occur?
• “The probability that a coin will land heads is 0.5”

– Bayesian: we believe the coin is equally likely to land heads or tails on the
next toss.
– Frequentist: if we flip the coin many times, we expect it to land heads half
the time.
32
Limitations of Frequentist Probabilities
• We cannot assign frequentist probabilities to many “degree of belief”
statements that we’d like to use in our everyday lives, e.g.
– The probability that Shakespeare’s plays were

written by Francis Bacon
– The probability that a particular signature on a

particular check is genuine
– The probability that a particular email is spam
33
Model parameters
• In Bayesian statistics, model parameters are
random variables. It is meaningful to talk about their
probabilities,
(prior) (posterior)
• In frequentist statistics, model parameters are fixed,

unknown quantities. In this paradigm it is not
meaningful to assign them probabilities.
• Frequentists often write as

to emphasize this.
34
Inferring parameters
• As a Bayesian, all you have to do is write down
your prior beliefs, write down your likelihood,
and apply Bayes ‘ rule,
35
Elements of Bayesian Inference
Likelihood
Prior
Posterior Marginal likelihood

(a.k.a. model evidence)
is a normalization constant that does not depend on

the value of θ. It is the probability of the data under
the model, marginalizing over all possible θ’s.
36
Elements of Bayesian Inference
Likelihood
Prior
Posterior Marginal likelihood

(a.k.a. model evidence)
is a normalization constant that does not depend on

the value of θ. It is the probability of the data under
the model, marginalizing over all possible θ’s.
37
Subjectiveness of Bayesian Inference
• Bayesian inference is sometimes criticized for being
subjective. The choice of prior may affect the
conclusions, and so two Bayesians may get different
answers to the same query.
• But the choice of model (likelihood) and other

modeling decisions are also subjective decisions!
• Bayesian statistics forces you to state ALL of your

assumptions, and then tells you what your conclusions
should be based on data + assumptions.
38
“You cannot do inference
without making assumptions!”
Sir David J. C. MacKay 39

As a Bayesian, and as a frequentist, what would you
estimate is the probability, if it is defined, that:
1. Einstein had a cup of coffee in 1916?

2. Einstein had a cup of coffee in 2016?
3. You will win the car in the Monty Hall problem if
you switch doors?
4. Google Deepmind's AlphaGo program beats
professional go player Lee Sedol in their next go
game, supposing that there will be a rematch?
(AlphaGo won the recent match 4-1).
5. There is intelligent life on other worlds?
40
• A simple Bayesian analysis shows that the fact that life
occurred early in the Earth’s history is not strong evidence
that life is common in the universe
• Key to this conclusion is the uncertainty on the rate of the
emergence of life based on only one data point (Earth)
41
Fully Bayesian inference:
modeling uncertainty
Number of
observations
• The posterior tells us how uncertain we are in our estimate of θ. As we

observe more data, the posterior generally becomes more concentrated
around the true θ.
42
Point Estimates
• Maximum likelihood estimator (MLE):
• Maximum a posteriori estimator (MAP):
The prior acts as a regularizer!

43
The full posterior distribution
can be very useful
The mode (MAP estimate) is unrepresentative of the distribution
44
MAP estimate can result in overfitting
45
46
Answer
• B) and C) are both good answers.
Bayesian methods are most valuable when there is

uncertainty with respect to some parameters due to lack of
data. Although there is a lot of data overall, there is typically
little data per user (for recommender systems, especially in
the cold start setting), and per document (for topic models).
47
48
Making Predictions
• The posterior predictive distribution the distribution of new,
unseen data, given the observed data.
• E.g., given that we have seen 10 coin flips come up heads, and
one come up tails, the posterior predictive distribution gives
us the probability that the next coin flip is heads or tails.
• We obtain this by marginalizing over the parameters,
49
Simulating the posterior predictive
• We can rewrite as the expected value of the
new data point with respect to the posterior:
• To simulate new data, draw a θ from the

posterior, then draw a data point from the
likelihood model given θ
50
Posterior Predictive Example
• Suppose we know a coin is either:
– An unbiased coin A, with Pr(heads) = 0.5, or
– a biased coin B, with Pr(heads) = 0.8.
• We have a uniform prior, Pr(A) = Pr(B) = 0.5
• We observe X = (heads, heads, heads, tails).
• What is the probability of the next coin is a heads?

51
52
Posterior Predictive: Solution
• Compute the posterior over the choice of coin
using Bayes’ rule:
• Predict the next data point using the posterior

predictive distribution
53
Announcements
• Homework 1 is due 4/14/2016
• I’ll start recording participation using Poll

Everywhere starting next week, so please sign
up for an account (and participate in polls!).
• Piazza will also be factored into participation

grades.
54

Cse291d 2 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cse291d 2 PDF

Uploaded by

Copyright:

Available Formats

CSE291D Lecture 2:

Is it to your advantage to switch your choice?

• Bayesian inference is a quantitative framework for the

"the science and engineering of making intelligent machines”

• Yet, Bayesian inference fits AI like a glove

• Explain the differences between Bayesian and frequentist

• Contrast the following estimation methods according to how they

• Calculate Bayesian posterior distributions for simple models.

• Make predictions for unobserved data using the

The posterior is always available, but is not

Lindley and de Finetti were proponents in the

The prior and posterior are used to predict

While the likelihood is a model for observable

To infer the probability of a hypothesis H, given evidence E.

• Bayes’ rule allows us to combine observed evidence and prior beliefs to

Definition of conditional probability

Plug in equation for joint

Is it to your advantage to switch your choice?

Is it to your advantage to switch your choice?

Is it to your advantage to switch your choice?

Is it to your advantage to switch your choice?

Is it to your advantage to switch your choice?

Is it to your advantage to switch your choice?

The contestant should always

• A statistical methodology in which:

– Probabilities are interpreted as “degrees of belief”

– Model parameters are random variables

– To infer model parameters, we update our prior

• “The probability that a coin will land heads is 0.5”

– The probability that Shakespeare’s plays were

– The probability that a particular signature on a

– The probability that a particular email is spam

• In frequentist statistics, model parameters are fixed,

• Frequentists often write as

Posterior Marginal likelihood

is a normalization constant that does not depend on

Posterior Marginal likelihood

is a normalization constant that does not depend on

• But the choice of model (likelihood) and other

• Bayesian statistics forces you to state ALL of your

Sir David J. C. MacKay 39

1. Einstein had a cup of coffee in 1916?

• The posterior tells us how uncertain we are in our estimate of θ. As we

• Maximum a posteriori estimator (MAP):

The prior acts as a regularizer!

The mode (MAP estimate) is unrepresentative of the distribution

Bayesian methods are most valuable when there is

• We obtain this by marginalizing over the parameters,

• To simulate new data, draw a θ from the

• We have a uniform prior, Pr(A) = Pr(B) = 0.5

• We observe X = (heads, heads, heads, tails).

• What is the probability of the next coin is a heads?

• Predict the next data point using the posterior

• I’ll start recording participation using Poll

• Piazza will also be factored into participation

You might also like