You are on page 1of 54

CSE291D Lecture 2:

Bayesian Inference
• This topic is so important that we have a very
special guest lecturer today:

1
https://www.youtube.com/watch?v=Zr_xWfThjJ0
Suppose you're on a game show, and you're given the choice of three doors:
Behind one door is a car; behind the others, goats.

You pick a door, say No. 1, and the host, who knows what's behind the doors, opens
another door, say No. 3, which has a goat. He then says to you, "Do you want to pick
door No. 2?“

Is it to your advantage to switch your choice?

? ?
?

2
3
Bayesian Inference
• If the winning door was chosen arbitrarily, why would
the probability of winning be anything other than 1/3?

• The host will not open the door with the car, so their
choice gives us valuable information. We need to
update our beliefs about where the car is in light of our
observation of the host’s choice.

• Bayesian inference is a quantitative framework for the


process of updating beliefs based on observed data.

4
Bayesian inference and AI
• Bayesian inference predates AI by around 200 years
– Thomas Bayes’ paper published 1763
– John McCarthy coined the term AI in 1955

"the science and engineering of making intelligent machines”

• Yet, Bayesian inference fits AI like a glove


– We would like to build intelligent agents that learn from
the world and make intelligent decisions
– Bayesian inference tells us how an intelligent agent should
update its beliefs in order to act
5
Learning Outcomes
By the end of the lesson, you should be able to:

• Explain the differences between Bayesian and frequentist


interpretations of probability.

• Contrast the following estimation methods according to how they


deal with uncertainty: maximum likelihood estimation (MLE),
maximum a posteriori (MAP) estimation, and
fully Bayesian inference.

• Calculate Bayesian posterior distributions for simple models.

• Make predictions for unobserved data using the


posterior predictive distribution.

6
7
Answer
• All are true, but the answer is B).

The posterior is always available, but is not


always tractable to compute. Computational
and algorithmic advances were what made
Bayesian methods more feasible.

8
9
Answer
• D). Laplace was one of the first to promote
Bayesian ideas, after Bayes.

Lindley and de Finetti were proponents in the


20th century. Fisher is perhaps the most
important founder of frequentist statistics.

10
11
Answer
• D). The posterior predictive distribution is
used to predict new data.

The prior and posterior are used to predict


unobservable parameters.

While the likelihood is a model for observable


data, it isn’t used directly for prediction.

12
Bayes’ rule
• Our goal:

To infer the probability of a hypothesis H, given evidence E.


This probability distribution is called the posterior distribution,

• We have:
Prior beliefs about H, encoded as a probability distribution,
the prior distribution

A model for how the data were generated, given the hypothesis, the
likelihood

13
Bayes’ rule

• Bayes’ rule allows us to combine observed evidence and prior beliefs to


obtain the posterior,

A normalizing constant
14
Deconstructing Bayes’ rule

Definition of conditional probability

Product rule

Plug in equation for joint

Sum rule
15
The Monty Hall problem

16
Suppose you're on a game show, and you're given the choice of three doors:
Behind one door is a car; behind the others, goats.

You pick a door, say No. 1, and the host, who knows what's behind the doors, opens
another door, say No. 3, which has a goat. He then says to you, "Do you want to pick
door No. 2?“

Is it to your advantage to switch your choice?

? ?
?

17
Suppose you're on a game show, and you're given the choice of three doors:
Behind one door is a car; behind the others, goats.

You pick a door, say No. 1, and the host, who knows what's behind the doors, opens
another door, say No. 3, which has a goat. He then says to you, "Do you want to pick
door No. 2?“

Is it to your advantage to switch your choice?

18
Suppose you're on a game show, and you're given the choice of three doors:
Behind one door is a car; behind the others, goats.

You pick a door, say No. 1, and the host, who knows what's behind the doors, opens
another door, say No. 3, which has a goat. He then says to you, "Do you want to pick
door No. 2?“

Is it to your advantage to switch your choice?

19
Suppose you're on a game show, and you're given the choice of three doors:
Behind one door is a car; behind the others, goats.

You pick a door, say No. 1, and the host, who knows what's behind the doors, opens
another door, say No. 3, which has a goat. He then says to you, "Do you want to pick
door No. 2?“

Is it to your advantage to switch your choice?

20
Suppose you're on a game show, and you're given the choice of three doors:
Behind one door is a car; behind the others, goats.

You pick a door, say No. 1, and the host, who knows what's behind the doors, opens
another door, say No. 3, which has a goat. He then says to you, "Do you want to pick
door No. 2?“

Is it to your advantage to switch your choice?

21
Suppose you're on a game show, and you're given the choice of three doors:
Behind one door is a car; behind the others, goats.

You pick a door, say No. 1, and the host, who knows what's behind the doors, opens
another door, say No. 3, which has a goat. He then says to you, "Do you want to pick
door No. 2?“

Is it to your advantage to switch your choice?

22
? ?

23
? ?

24
? ?

25
? ?

26
? ?

27
? ?

28
The Monty Hall problem

The contestant should always


switch doors.

29
Probability and Inference
Probability

Data generating
Observed data
process

Inference

30
Figure based on one by Larry Wasserman, "All of Statistics"
Bayesian Statistics

• A statistical methodology in which:

– Probabilities are interpreted as “degrees of belief”

– Model parameters are random variables

– To infer model parameters, we update our prior


beliefs based on observed data via Bayes’ rule

31
Bayesian and Frequentist
Interpretations of Probability
• Bayesian interpretation:
– Probability refers to degrees of belief
– A probability distribution quantifies our uncertainty
– Typically subjective
• Frequentist interpretation:
– Probabilities are long-run frequencies
– I.e. if I repeat this experiment many times, what proportion of the time
will the event occur?

• “The probability that a coin will land heads is 0.5”


– Bayesian: we believe the coin is equally likely to land heads or tails on the
next toss.
– Frequentist: if we flip the coin many times, we expect it to land heads half
the time.

32
Limitations of Frequentist Probabilities
• We cannot assign frequentist probabilities to many “degree of belief”
statements that we’d like to use in our everyday lives, e.g.

– The probability that Shakespeare’s plays were


written by Francis Bacon

– The probability that a particular signature on a


particular check is genuine

– The probability that a particular email is spam

33
Model parameters
• In Bayesian statistics, model parameters are
random variables. It is meaningful to talk about their
probabilities,

(prior) (posterior)

• In frequentist statistics, model parameters are fixed,


unknown quantities. In this paradigm it is not
meaningful to assign them probabilities.

• Frequentists often write as


to emphasize this.
34
Inferring parameters
• As a Bayesian, all you have to do is write down
your prior beliefs, write down your likelihood,
and apply Bayes ‘ rule,

35
Elements of Bayesian Inference
Likelihood

Prior

Posterior Marginal likelihood


(a.k.a. model evidence)

is a normalization constant that does not depend on


the value of θ. It is the probability of the data under
the model, marginalizing over all possible θ’s.

36
Elements of Bayesian Inference
Likelihood

Prior

Posterior Marginal likelihood


(a.k.a. model evidence)

is a normalization constant that does not depend on


the value of θ. It is the probability of the data under
the model, marginalizing over all possible θ’s.

37
Subjectiveness of Bayesian Inference
• Bayesian inference is sometimes criticized for being
subjective. The choice of prior may affect the
conclusions, and so two Bayesians may get different
answers to the same query.

• But the choice of model (likelihood) and other


modeling decisions are also subjective decisions!

• Bayesian statistics forces you to state ALL of your


assumptions, and then tells you what your conclusions
should be based on data + assumptions.
38
“You cannot do inference
without making assumptions!”

Sir David J. C. MacKay 39


As a Bayesian, and as a frequentist, what would you
estimate is the probability, if it is defined, that:

1. Einstein had a cup of coffee in 1916?


2. Einstein had a cup of coffee in 2016?
3. You will win the car in the Monty Hall problem if
you switch doors?
4. Google Deepmind's AlphaGo program beats
professional go player Lee Sedol in their next go
game, supposing that there will be a rematch?
(AlphaGo won the recent match 4-1).
5. There is intelligent life on other worlds?
40
• A simple Bayesian analysis shows that the fact that life
occurred early in the Earth’s history is not strong evidence
that life is common in the universe
• Key to this conclusion is the uncertainty on the rate of the
emergence of life based on only one data point (Earth)

41
Fully Bayesian inference:
modeling uncertainty

Number of
observations

• The posterior tells us how uncertain we are in our estimate of θ. As we


observe more data, the posterior generally becomes more concentrated
around the true θ.

42
Point Estimates
• Maximum likelihood estimator (MLE):

• Maximum a posteriori estimator (MAP):

The prior acts as a regularizer!


43
The full posterior distribution
can be very useful

The mode (MAP estimate) is unrepresentative of the distribution

44
MAP estimate can result in overfitting

45
46
Answer
• B) and C) are both good answers.

Bayesian methods are most valuable when there is


uncertainty with respect to some parameters due to lack of
data. Although there is a lot of data overall, there is typically
little data per user (for recommender systems, especially in
the cold start setting), and per document (for topic models).

47
48
Making Predictions
• The posterior predictive distribution the distribution of new,
unseen data, given the observed data.
• E.g., given that we have seen 10 coin flips come up heads, and
one come up tails, the posterior predictive distribution gives
us the probability that the next coin flip is heads or tails.

• We obtain this by marginalizing over the parameters,

49
Simulating the posterior predictive
• We can rewrite as the expected value of the
new data point with respect to the posterior:

• To simulate new data, draw a θ from the


posterior, then draw a data point from the
likelihood model given θ
50
Posterior Predictive Example
• Suppose we know a coin is either:
– An unbiased coin A, with Pr(heads) = 0.5, or
– a biased coin B, with Pr(heads) = 0.8.

• We have a uniform prior, Pr(A) = Pr(B) = 0.5

• We observe X = (heads, heads, heads, tails).

• What is the probability of the next coin is a heads?


51
52
Posterior Predictive: Solution
• Compute the posterior over the choice of coin
using Bayes’ rule:

• Predict the next data point using the posterior


predictive distribution

53
Announcements
• Homework 1 is due 4/14/2016

• I’ll start recording participation using Poll


Everywhere starting next week, so please sign
up for an account (and participate in polls!).

• Piazza will also be factored into participation


grades.
54

You might also like