Baysian-Slides 16 Bayes Intro

Introduction to Bayesian Estimation
May 31, 2013
The Plan
1. What is Bayesian statistics? How is it different from

frequentist methods?
2. 4 Bayesian Principles
3. Overview of main concepts in Bayesian analysis
Main reading: Ch.1 in Gary Koops Bayesian Econometrics
Christian P. Roberts The Bayesian Choice: From decision
theoretic foundations to Computational Implementation
Probability and statistics; Whats the difference?
Probability is a branch of mathematics

I
There is little disagreement about whether the theorems

follow from the axioms
Statistics is an inversion problem: What is a good probabilistic

description of the world, given the observed outcomes?
I
There is some disagreement about how we interpret

data/observations
Why probabilistic models?
Is the world characterized by randomness?

I
Is the weather random?
Is a coin flip random?
ECB interest rates?
What is the meaning of probability, randomness and

uncertainty?
Two main schools of thought:

I
The classical (or frequentist) view is that probability

corresponds to the frequency of occurrence in repeated
experiments
The Bayesian view is that probabilities are statements about

our state of knowledge, i.e. a subjective view.
The difference has implications for how we interpret estimated

statistical models and there is no general agreement about which
approach is better.
Frequentist vs Bayesian statistics

Variance of estimator vs variance of parameter
I
Bayesians think of parameters as having distributions while

frequentist conduct inference by thinking about the variance
of an estimator
Frequentist confidence intervals:

I
If point estimates are the truth and with repeated draws from
the population of equal sample length, what is the interval
that b lies in 95 per cent of the time?
Bayesian probability intervals:

I
Conditional on the observed data, a prior distribution of and

a functional form (i.e. model), what is the shortest interval
that with 95% contains ?
The main conceptual difference is that Bayesians condition on the

data while frequentist design procedures that work well ex ante.
Frequentist vs Bayesian statistics

Coin flip example:
After flipping a coin 10 times and counting the number of heads,
what is the probability that the next flip will come up heads?
I
A Bayesian with a uniform prior would say that the probability

of the next flip coming up heads is x/10 and where x is the
number of times the flip came up heads in the observed
sample
A frequentist cannot answer this question: A frequentist

confidence interval can only be used to compute the
probability of having observed the sample conditional on a
given null hypothesis about the probability of heads vs tails.
Bayesian statistics
Bayesianism has obviously come a long way. It used to be that

could tell a Bayesian by his tendency to hold meetings in isolated
parts of Spain and his obsession with coherence, self-interrogations,
and other manifestations of paranoia. Things have changed...
Peter Clifford, 1993
Quote arrived here via Jesus Fernandez-Villaverde (U Penn)
Bayesian statistics
Bayesians used to be fringe types, but are now more or less the
mainstream in macro,
I
This is largely due to increased computing power
Bayesian methods have several advantages:

I
Facilitates incorporating information from outside of sample
Easy to compute confidence/probabilty intervals of functions

of parameters
Good small sample properties
The subjective view of probability
What does subjective mean?

I
Coin flip: What is prob(heads) after flip but before looking?
Important: Probabilities can be viewed as statements about our

knowledge of a parameter, even if the parameter is a constant
I
E.g. treating risk aversion as a random variable does not

imply that we think of it as varying across time.
Is a subjective view of probability useful?

I
Yes, it allow people to incorporate information from outside

the sample
Yes, because subjectivity is always there (it just more hidden

in frequentist methods)
Yes, because one can use objective, or non-informative

priors
And: Sensitivity to priors can be checked
4 Principle of Bayesian Statistics
1. The Likelihood Principle

2. The Sufficiency Principle
3. The Conditionality Principle
4. The Stopping Rule Principle
These are principles that appeal to common sense.
The Likelihood Principle
All the information about a parameter from an experiment is

contained in the likelihood function
The Likelihood Principle

The likelihood principle can be derived from two other principles:
I
The Sufficiency Principle

I
If two samples imply the same sufficient statistic (given a

model that admits such a thing), then they should lead to the
same inference about
Example: Two samples from a normal distribution with the
same mean and variance should lead to the same inference
about the parameters and 2
The Conditionality Principle

I
If there are two possible experiments on the parameter is

available, and one of the experiments is chosen with
probability p then the inference on should only depend on
the chosen experiment
The Stopping Rule Principle
The Stopping Rule Principle is an implication of the Likelihood

Principle
If a sequence of experiments, E 1, E 2, ..., is directed by a stopping
rule, , which indicates when the experiment should stop, inference
about should depend on only through the resulting sample.
The Stopping Rule Principle: Example

In an experiment to find out the fraction of the population that
watched a TV show, a researcher found 9 viewers and and 3
non-viewers.
Two possible models:
1. The researcher interviewed 12 people and thus observed
x B(12, )
2. The researcher questions N people until he obtained 3
non-viewers with N Neg (3, 1 )
The likelihood principle implies that the inference about should
be identical if either model was used. A frequentist would draw
different conclusions depending on wether he thought (1) or (2)
generated the experiment.
In fact, H 0 : > 0.5 would be rejected if experiment is (2) but not
if it is (1).
The Stopping Rule Principle
I learned the stopping rule principle from Professor Barnard, in

conversation in the summer of 1952. Frankly, I thought it a
scandal that anyone in the profession could advance an idea so
patently wrong, even as today I can scarcely believe that some
people resist an idea so patently right
Savage (1962)
Quote (again) arrived here via Jesus Fernandez-Villaverde (U Penn)
The main components in Bayesian inference
Bayes Rule
Deriving Bayes Rule
p (A, B) = p(A | B)p(B)
Symmetrically
p (A, B) = p(B | A)p(A)
Implying
p(A | B)p(B) = p(B | A)p(A)
or
p(A | B) =
which is known as Bayes Rule.
p(B | A)p(A)
p(B)
Data and parameters
The purpose of Bayesian analysis is to use the data y to learn

about the parameters
I
Parameters of a statistical model
Or anything not directly observed
Replace A and B in Bayes rule with and y to get

p( | y ) =
p(y | )p()
p(y )
The probability density p( | y ) then describes what do we know

about , given the data.
Parameters as random variables

Since p( | y ) is a density function, we thus treat as a random
variable with a distribution.
This is the subjective view of probability
I
Express subjective probabilities through the prior
More importantly, the randomness of a parameter is subjective

in the sense of depending on what information is available to
the person making the statement.
The density p( | y ) is proportional to the prior times the

likelihood function
p( | y )
| {z }
posterior
p(y | )
| {z }
p()
|{z}
likelihood function prior
But what is the prior and the likelihood function?
The prior density p()
Sometimes we know more about the parameters than what the

data tells us, i.e. we have some prior information.
The prior density p() summarizes what we know about a
parameters before observing the data y .
I
For a DSGE model, we may have information about deep

parameters
I
I
I
Range of some parameters may be restricted by theory, e.g.

risk aversion should be positive
Discount rate is inverse of average real interest rates
Price stickiness can be measured by surveys
We may know something about the mean of a process
Choosing priors
How influential the priors will be for the posterior is a choice:

I
It is always possible to choose priors such that a given result is

achieved no matter what the sample information is (i.e.
dogmatic priors)
It is also possible to choose priors such that they do not

influence the posterior (i.e. so-called non-informative priors)
Important: Priors are a choice and must be motivated.
Choosing prior distributions

The beta distribution is a good choice when parameter is in [0,1]
P(x) =
(1 x)b1 x a1
B(a, b)
where
B(a, b) =
(a 1)!(b 1)!
(a + b 1)!
Easier to parameterize using expression for mean, mode and

variance:
=
2 =
a
,
a+b
b
x=
a1
a+b2
ab
(a +
b)2 (a
+ b + 1)
Examples of beta distributions holding mean fixed
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Examples of beta distributions holding s.d. fixed
18
16
14
12
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Choosing prior distributions

The inverse gamma distribution is a good choice when parameter
is positive
ba
P(x) =
(1/x)a+1 exp(b/x)
(a)
where
(a) = (a 1)!
Again, easier to parameterize using expression for mean, mode and
variance:
=
2 =
b
b
; a > 1, b
x=
a1
a+1
2
b
;a > 2
(a 1)2 (a 2)
Examples of inverse gamma distributions
0.7
0.6
0.5
0.4
0.3
0.2
0.1
10
Conjugate Priors
Conjugate priors are a particularly convenient:

I
Combining distributions that are members of a conjugate

family result in a new distribution that is a member of the
same family
Useful, but only so far as that the priors are chosen to actually
reflect prior beliefs rather than just for analytical convenience
Improper Priors
Improper priors are priors that are not probability density functions
in the sense that they do not integrate to 1.
I
Can still be used as a form of uninformative priors for some

types of analysis
The uniform distribution U(, ) is popular
Mode of posterior then coincide with MLE.
The likelihood function
The likelihood function is the probability density of observing the

data y conditional on a statistical model and the parameters .
I
For instance if the data y conditional on is normally

distributed we have that

1
T /2 1 1/2
0 1
p(y | ) = (2)
exp (y ) (y )
2
where and are either elements or functions of .
The posterior density p( | y )
The posterior density is often the object of fundamental interest in

Bayesian estimation.
I
The posterior is the result.
We are interested in the entire conditional distribution of the

parameters
Model comparison
Bayesian statistics is not concerned with rejecting models, but

with assigning probabilities to different models
I
We can index different models by i = 1, 2, ...m

p( | y , Mi ) =
p(y | , Mi )p( | Mi )
p(y | Mi )
Model comparison
The posterior model probability is given by
p(Mi | y ) =
p(y | Mi )p(Mi )
p(y )
where p(y | Mi ) is called the marginal likelihood. It can be

computed from
Z
Z
p(y | , Mi )p(, Mi )
d
p( | y , Mi )d =
p(y | Mi )
R
by using that p( | y , Mi )d = 1 so that
Z
p(y | Mi ) = p(y | , Mi )p(, Mi )d
The posterior odds ratio
The posterior odds ratio is the relative probabilities of two models

conditional on the data
p(Mi | y )
p(y | Mi )p(Mi )
=
p(Mj | y )
p(y | Mj )p(Mj )
It is made up of the prior odds ratio
p(y |Mi )
p(y |Mj )
p(Mi )
p(Mj )
and the Bayes factor
Concepts of Bayesian analysis
Prior densities, likelihood functions, posterior densities, posterior

odds ratios, Bayes factors etc are part of almost all Bayesian
analysis.
I
Data, choice of prior densities and the likelihood function are

the inputs into the analysis
Posterior densities etc are the outputs
The rest of the course is about how to choose these inputs and
how to compute the outputs
Bayesian computation
More specifics about the outputs

R
I Posterior mean E ( | y ) =
p( | y )d
I
Posterior variance var ( | y ) = E ( | y ) [E ( | y )]2
Posterior prob(i > 0)
These objects can all be written in the form

Z
E (g () | y ) = g () p( | y )d
where g (y ) is the function of interest.
Posterior simulation
There are only a few cases when the expected value of functions of
interest can be derived analytically.
I
Instead, we rely on posterior simulation and Monte Carlo

integration.
Posterior simulation consists of constructing a sample from

the posterior distribution p ( | y )
Monte carlo integration then uses that

S
1 X (s)
gbS =
g
S
s=1
and that limS gbS = E (g () | y ) where (s) is a draw from the

posterior distribution.
The end product of Bayesian statistics

Most of Bayesian econometrics consists of simulating distributions
of parameters using numerical methods.
I
A simulated posterior is a numerical approximation to the

distribution p(y | )p()
This is useful since the the distribution p(y | )p() (by

Bayes rule) is proportional to p( | y )
We rely on ergodicity, i.e. that the moments of the

constructed sample correspond to the moments of the
distribution p( | y )
The most popular procedure to simulate the posterior is called the

Random Walk Metropolis Algorithm
Sampling from a distribution
Example:
I
Standard Normal distribution N(0, 1)
If distribution is ergodic, sample shares all the properties of the

true distribution asymptotically
I
Why dont we just compute the moments directly?

I
I
When we can, we should (as in the Normal distributions case)

Not always possible, either because tractability reasons or for
computational burden
Sample from Standard Normal
-1
-2
-3
10
20
30
40
50
60
70
80
90
100
0
-3
-2
-1
-1
-2
-3
-4
-5
10
4
x 10
8000
7000
6000
5000
4000
3000
2000
1000
0
-5
-4
-3
-2
-1
Sampling from a distribution
Example:
I
Standard Normal distribution N(0, 1)
If distribution is ergodic, sample shares all the properties of the

true distribution asymptotically
I
Why dont we just compute the moments directly?

I
I
When we can, we should (as in the Normal distributions case)

Not always possible, either because tractability reasons or for
computational burden
What is a Markov Chain?
A process for which the distribution of next period variables are

independent of the past, once we condition on the current state
I
A Markov chain is a stochastic process with the Markov

property
Markov chain is sometimes taken to mean only processes

with a countably finite number of states
I
Here, the term will be used in the broader sense
Markov Chain Monte Carlo methods provide a way of generating

samples that share the properties of the target density (i.e. the
object of interest)
What is a Markov Chain?

Markov property
p(j+1 | j ) = p(j+1 | j , j1 , j2 , ..., 1 )
Transition density function p(j+1 | j ) describes the distribution of
j+1 conditional on j .
I
In most applications, we know the conditional transition

density and
can figure out unconditional properties like E ()
and E 2
MCMC methods can be used to do the opposite: Determine a

particular conditional transition density such that the
unconditional distribution converges to that of the target
distribution.
Summing up
The Bayesian Principles
Bayesians condition on the data
Priors and likelihood functions are chosen by the researcher

and must be motivated
The posterior density is the main object of interest
We often need to use numerical methods to approximate the

posterior density
Next time: Regression model with natural conjugate priors
Thats it for today.

Baysian-Slides 16 Bayes Intro

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Baysian-Slides 16 Bayes Intro

Uploaded by

Copyright:

Available Formats

Introduction to Bayesian Estimation

May 31, 2013

1. What is Bayesian statistics? How is it different from

Probability and statistics; Whats the difference?

Probability is a branch of mathematics

There is little disagreement about whether the theorems

Statistics is an inversion problem: What is a good probabilistic

There is some disagreement about how we interpret

Why probabilistic models?

Is the world characterized by randomness?

Is the weather random?

Is a coin flip random?

ECB interest rates?

What is the meaning of probability, randomness and

Two main schools of thought:

The classical (or frequentist) view is that probability

The Bayesian view is that probabilities are statements about

The difference has implications for how we interpret estimated

Frequentist vs Bayesian statistics

Bayesians think of parameters as having distributions while

Frequentist confidence intervals:

Bayesian probability intervals:

Conditional on the observed data, a prior distribution of and

The main conceptual difference is that Bayesians condition on the

Frequentist vs Bayesian statistics

A Bayesian with a uniform prior would say that the probability

A frequentist cannot answer this question: A frequentist

Bayesianism has obviously come a long way. It used to be that

Quote arrived here via Jesus Fernandez-Villaverde (U Penn)

This is largely due to increased computing power

Bayesian methods have several advantages:

Facilitates incorporating information from outside of sample

Easy to compute confidence/probabilty intervals of functions

Good small sample properties

The subjective view of probability

What does subjective mean?

Coin flip: What is prob(heads) after flip but before looking?

Important: Probabilities can be viewed as statements about our

E.g. treating risk aversion as a random variable does not

The subjective view of probability

Is a subjective view of probability useful?

Yes, it allow people to incorporate information from outside

Yes, because subjectivity is always there (it just more hidden

Yes, because one can use objective, or non-informative

And: Sensitivity to priors can be checked

4 Principle of Bayesian Statistics

1. The Likelihood Principle

The Likelihood Principle

All the information about a parameter from an experiment is

The Likelihood Principle

The Sufficiency Principle

If two samples imply the same sufficient statistic (given a

The Conditionality Principle

If there are two possible experiments on the parameter is

The Stopping Rule Principle

The Stopping Rule Principle is an implication of the Likelihood

The Stopping Rule Principle: Example

The Stopping Rule Principle

I learned the stopping rule principle from Professor Barnard, in

The main components in Bayesian inference

Data and parameters

The purpose of Bayesian analysis is to use the data y to learn

Parameters of a statistical model

Or anything not directly observed