You are on page 1of 49

Introduction to Bayesian Estimation

May 31, 2013

The Plan

1. What is Bayesian statistics? How is it different from


frequentist methods?
2. 4 Bayesian Principles
3. Overview of main concepts in Bayesian analysis
Main reading: Ch.1 in Gary Koops Bayesian Econometrics
Christian P. Roberts The Bayesian Choice: From decision
theoretic foundations to Computational Implementation

Probability and statistics; Whats the difference?

Probability is a branch of mathematics


I

There is little disagreement about whether the theorems


follow from the axioms

Statistics is an inversion problem: What is a good probabilistic


description of the world, given the observed outcomes?
I

There is some disagreement about how we interpret


data/observations

Why probabilistic models?

Is the world characterized by randomness?


I

Is the weather random?

Is a coin flip random?

ECB interest rates?

What is the meaning of probability, randomness and


uncertainty?

Two main schools of thought:


I

The classical (or frequentist) view is that probability


corresponds to the frequency of occurrence in repeated
experiments

The Bayesian view is that probabilities are statements about


our state of knowledge, i.e. a subjective view.

The difference has implications for how we interpret estimated


statistical models and there is no general agreement about which
approach is better.

Frequentist vs Bayesian statistics


Variance of estimator vs variance of parameter
I

Bayesians think of parameters as having distributions while


frequentist conduct inference by thinking about the variance
of an estimator

Frequentist confidence intervals:


I

If point estimates are the truth and with repeated draws from
the population of equal sample length, what is the interval
that b lies in 95 per cent of the time?

Bayesian probability intervals:


I

Conditional on the observed data, a prior distribution of and


a functional form (i.e. model), what is the shortest interval
that with 95% contains ?

The main conceptual difference is that Bayesians condition on the


data while frequentist design procedures that work well ex ante.

Frequentist vs Bayesian statistics


Coin flip example:
After flipping a coin 10 times and counting the number of heads,
what is the probability that the next flip will come up heads?
I

A Bayesian with a uniform prior would say that the probability


of the next flip coming up heads is x/10 and where x is the
number of times the flip came up heads in the observed
sample

A frequentist cannot answer this question: A frequentist


confidence interval can only be used to compute the
probability of having observed the sample conditional on a
given null hypothesis about the probability of heads vs tails.

Bayesian statistics

Bayesianism has obviously come a long way. It used to be that


could tell a Bayesian by his tendency to hold meetings in isolated
parts of Spain and his obsession with coherence, self-interrogations,
and other manifestations of paranoia. Things have changed...
Peter Clifford, 1993

Quote arrived here via Jesus Fernandez-Villaverde (U Penn)

Bayesian statistics

Bayesians used to be fringe types, but are now more or less the
mainstream in macro,
I

This is largely due to increased computing power

Bayesian methods have several advantages:


I

Facilitates incorporating information from outside of sample

Easy to compute confidence/probabilty intervals of functions


of parameters

Good small sample properties

The subjective view of probability

What does subjective mean?


I

Coin flip: What is prob(heads) after flip but before looking?

Important: Probabilities can be viewed as statements about our


knowledge of a parameter, even if the parameter is a constant
I

E.g. treating risk aversion as a random variable does not


imply that we think of it as varying across time.

The subjective view of probability

Is a subjective view of probability useful?


I

Yes, it allow people to incorporate information from outside


the sample

Yes, because subjectivity is always there (it just more hidden


in frequentist methods)

Yes, because one can use objective, or non-informative


priors

And: Sensitivity to priors can be checked

4 Principle of Bayesian Statistics

1. The Likelihood Principle


2. The Sufficiency Principle
3. The Conditionality Principle
4. The Stopping Rule Principle
These are principles that appeal to common sense.

The Likelihood Principle

All the information about a parameter from an experiment is


contained in the likelihood function

The Likelihood Principle


The likelihood principle can be derived from two other principles:
I

The Sufficiency Principle


I

If two samples imply the same sufficient statistic (given a


model that admits such a thing), then they should lead to the
same inference about
Example: Two samples from a normal distribution with the
same mean and variance should lead to the same inference
about the parameters and 2

The Conditionality Principle


I

If there are two possible experiments on the parameter is


available, and one of the experiments is chosen with
probability p then the inference on should only depend on
the chosen experiment

The Stopping Rule Principle

The Stopping Rule Principle is an implication of the Likelihood


Principle
If a sequence of experiments, E 1, E 2, ..., is directed by a stopping
rule, , which indicates when the experiment should stop, inference
about should depend on only through the resulting sample.

The Stopping Rule Principle: Example


In an experiment to find out the fraction of the population that
watched a TV show, a researcher found 9 viewers and and 3
non-viewers.
Two possible models:
1. The researcher interviewed 12 people and thus observed
x B(12, )
2. The researcher questions N people until he obtained 3
non-viewers with N Neg (3, 1 )
The likelihood principle implies that the inference about should
be identical if either model was used. A frequentist would draw
different conclusions depending on wether he thought (1) or (2)
generated the experiment.
In fact, H 0 : > 0.5 would be rejected if experiment is (2) but not
if it is (1).

The Stopping Rule Principle

I learned the stopping rule principle from Professor Barnard, in


conversation in the summer of 1952. Frankly, I thought it a
scandal that anyone in the profession could advance an idea so
patently wrong, even as today I can scarcely believe that some
people resist an idea so patently right
Savage (1962)
Quote (again) arrived here via Jesus Fernandez-Villaverde (U Penn)

The main components in Bayesian inference

Bayes Rule
Deriving Bayes Rule
p (A, B) = p(A | B)p(B)
Symmetrically
p (A, B) = p(B | A)p(A)
Implying
p(A | B)p(B) = p(B | A)p(A)
or
p(A | B) =
which is known as Bayes Rule.

p(B | A)p(A)
p(B)

Data and parameters

The purpose of Bayesian analysis is to use the data y to learn


about the parameters
I

Parameters of a statistical model

Or anything not directly observed

Replace A and B in Bayes rule with and y to get


p( | y ) =

p(y | )p()
p(y )

The probability density p( | y ) then describes what do we know


about , given the data.

Parameters as random variables


Since p( | y ) is a density function, we thus treat as a random
variable with a distribution.
This is the subjective view of probability
I

Express subjective probabilities through the prior

More importantly, the randomness of a parameter is subjective


in the sense of depending on what information is available to
the person making the statement.

The density p( | y ) is proportional to the prior times the


likelihood function
p( | y )
| {z }
posterior

p(y | )
| {z }

p()
|{z}

likelihood function prior

But what is the prior and the likelihood function?

The prior density p()

Sometimes we know more about the parameters than what the


data tells us, i.e. we have some prior information.
The prior density p() summarizes what we know about a
parameters before observing the data y .
I

For a DSGE model, we may have information about deep


parameters
I

I
I

Range of some parameters may be restricted by theory, e.g.


risk aversion should be positive
Discount rate is inverse of average real interest rates
Price stickiness can be measured by surveys

We may know something about the mean of a process

Choosing priors

How influential the priors will be for the posterior is a choice:


I

It is always possible to choose priors such that a given result is


achieved no matter what the sample information is (i.e.
dogmatic priors)

It is also possible to choose priors such that they do not


influence the posterior (i.e. so-called non-informative priors)

Important: Priors are a choice and must be motivated.

Choosing prior distributions


The beta distribution is a good choice when parameter is in [0,1]
P(x) =

(1 x)b1 x a1
B(a, b)

where
B(a, b) =

(a 1)!(b 1)!
(a + b 1)!

Easier to parameterize using expression for mean, mode and


variance:
=
2 =

a
,
a+b

b
x=

a1
a+b2

ab
(a +

b)2 (a

+ b + 1)

Examples of beta distributions holding mean fixed

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Examples of beta distributions holding s.d. fixed

18

16

14

12

10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Choosing prior distributions


The inverse gamma distribution is a good choice when parameter
is positive
ba
P(x) =
(1/x)a+1 exp(b/x)
(a)
where
(a) = (a 1)!
Again, easier to parameterize using expression for mean, mode and
variance:
=
2 =

b
b
; a > 1, b
x=
a1
a+1
2
b
;a > 2
(a 1)2 (a 2)

Examples of inverse gamma distributions

0.7

0.6

0.5

0.4

0.3

0.2

0.1

10

Conjugate Priors

Conjugate priors are a particularly convenient:


I

Combining distributions that are members of a conjugate


family result in a new distribution that is a member of the
same family

Useful, but only so far as that the priors are chosen to actually
reflect prior beliefs rather than just for analytical convenience

Improper Priors

Improper priors are priors that are not probability density functions
in the sense that they do not integrate to 1.
I

Can still be used as a form of uninformative priors for some


types of analysis

The uniform distribution U(, ) is popular

Mode of posterior then coincide with MLE.

The likelihood function

The likelihood function is the probability density of observing the


data y conditional on a statistical model and the parameters .
I

For instance if the data y conditional on is normally


distributed we have that




1
T /2 1 1/2
0 1
p(y | ) = (2)

exp (y ) (y )
2
where and are either elements or functions of .

The posterior density p( | y )

The posterior density is often the object of fundamental interest in


Bayesian estimation.
I

The posterior is the result.

We are interested in the entire conditional distribution of the


parameters

Model comparison

Bayesian statistics is not concerned with rejecting models, but


with assigning probabilities to different models
I

We can index different models by i = 1, 2, ...m


p( | y , Mi ) =

p(y | , Mi )p( | Mi )
p(y | Mi )

Model comparison
The posterior model probability is given by
p(Mi | y ) =

p(y | Mi )p(Mi )
p(y )

where p(y | Mi ) is called the marginal likelihood. It can be


computed from
Z
Z
p(y | , Mi )p(, Mi )
d
p( | y , Mi )d =
p(y | Mi )
R
by using that p( | y , Mi )d = 1 so that
Z
p(y | Mi ) = p(y | , Mi )p(, Mi )d

The posterior odds ratio

The posterior odds ratio is the relative probabilities of two models


conditional on the data
p(Mi | y )
p(y | Mi )p(Mi )
=
p(Mj | y )
p(y | Mj )p(Mj )
It is made up of the prior odds ratio
p(y |Mi )
p(y |Mj )

p(Mi )
p(Mj )

and the Bayes factor

Concepts of Bayesian analysis

Prior densities, likelihood functions, posterior densities, posterior


odds ratios, Bayes factors etc are part of almost all Bayesian
analysis.
I

Data, choice of prior densities and the likelihood function are


the inputs into the analysis

Posterior densities etc are the outputs

The rest of the course is about how to choose these inputs and
how to compute the outputs

Bayesian computation

More specifics about the outputs


R
I Posterior mean E ( | y ) =
p( | y )d
I

Posterior variance var ( | y ) = E ( | y ) [E ( | y )]2

Posterior prob(i > 0)

These objects can all be written in the form


Z
E (g () | y ) = g () p( | y )d
where g (y ) is the function of interest.

Posterior simulation
There are only a few cases when the expected value of functions of
interest can be derived analytically.
I

Instead, we rely on posterior simulation and Monte Carlo


integration.

Posterior simulation consists of constructing a sample from


the posterior distribution p ( | y )

Monte carlo integration then uses that


S
1 X  (s) 
gbS =
g
S
s=1

and that limS gbS = E (g () | y ) where (s) is a draw from the


posterior distribution.

The end product of Bayesian statistics


Most of Bayesian econometrics consists of simulating distributions
of parameters using numerical methods.
I

A simulated posterior is a numerical approximation to the


distribution p(y | )p()

This is useful since the the distribution p(y | )p() (by


Bayes rule) is proportional to p( | y )

We rely on ergodicity, i.e. that the moments of the


constructed sample correspond to the moments of the
distribution p( | y )

The most popular procedure to simulate the posterior is called the


Random Walk Metropolis Algorithm

Sampling from a distribution

Example:
I

Standard Normal distribution N(0, 1)

If distribution is ergodic, sample shares all the properties of the


true distribution asymptotically
I

Why dont we just compute the moments directly?


I
I

When we can, we should (as in the Normal distributions case)


Not always possible, either because tractability reasons or for
computational burden

Sample from Standard Normal

-1

-2

-3

10

20

30

40

50

60

70

80

90

100

Sample from Standard Normal

0
-3

-2

-1

Sample from Standard Normal

-1

-2

-3

-4

-5

10
4

x 10

Sample from Standard Normal

8000

7000

6000

5000

4000

3000

2000

1000

0
-5

-4

-3

-2

-1

Sampling from a distribution

Example:
I

Standard Normal distribution N(0, 1)

If distribution is ergodic, sample shares all the properties of the


true distribution asymptotically
I

Why dont we just compute the moments directly?


I
I

When we can, we should (as in the Normal distributions case)


Not always possible, either because tractability reasons or for
computational burden

What is a Markov Chain?

A process for which the distribution of next period variables are


independent of the past, once we condition on the current state
I

A Markov chain is a stochastic process with the Markov


property

Markov chain is sometimes taken to mean only processes


with a countably finite number of states
I

Here, the term will be used in the broader sense

Markov Chain Monte Carlo methods provide a way of generating


samples that share the properties of the target density (i.e. the
object of interest)

What is a Markov Chain?


Markov property
p(j+1 | j ) = p(j+1 | j , j1 , j2 , ..., 1 )
Transition density function p(j+1 | j ) describes the distribution of
j+1 conditional on j .
I

In most applications, we know the conditional transition


density and
 can figure out unconditional properties like E ()
and E 2

MCMC methods can be used to do the opposite: Determine a


particular conditional transition density such that the
unconditional distribution converges to that of the target
distribution.

Summing up

The subjective view of probability

The Bayesian Principles

Bayesians condition on the data

Priors and likelihood functions are chosen by the researcher


and must be motivated

The posterior density is the main object of interest

We often need to use numerical methods to approximate the


posterior density

Next time: Regression model with natural conjugate priors

Thats it for today.

You might also like