Professional Documents
Culture Documents
Contents
6 Conjugate families
6.1 Binomial - beta prior . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Uninformative priors . . . . . . . . . . . . . . . . . .
6.2 Gaussian (unknown mean, known variance) . . . . . . . . .
6.2.1 Uninformative prior . . . . . . . . . . . . . . . . . .
6.3 Gaussian (known mean, unknown variance) . . . . . . . . .
6.3.1 Uninformative prior . . . . . . . . . . . . . . . . . .
6.4 Gaussian (unknown mean, unknown variance) . . . . . . . .
6.4.1 Completing the square . . . . . . . . . . . . . . . . .
6.4.2 Marginal posterior distributions . . . . . . . . . . . .
6.4.3 Uninformative priors . . . . . . . . . . . . . . . . . .
6.5 Multivariate Gaussian (unknown mean, known variance) . .
6.5.1 Completing the square . . . . . . . . . . . . . . . . .
6.5.2 Uninformative priors . . . . . . . . . . . . . . . . . .
6.6 Multivariate Gaussian (unknown mean, unknown variance)
6.6.1 Completing the square . . . . . . . . . . . . . . . . .
6.6.2 Inverted-Wishart kernel . . . . . . . . . . . . . . . .
6.6.3 Marginal posterior distributions . . . . . . . . . . . .
6.6.4 Uninformative priors . . . . . . . . . . . . . . . . . .
6.7 Bayesian linear regression . . . . . . . . . . . . . . . . . . .
6.7.1 Known variance . . . . . . . . . . . . . . . . . . . .
6.7.2 Unknown variance . . . . . . . . . . . . . . . . . . .
6.7.3 Uninformative priors . . . . . . . . . . . . . . . . . .
6.8 Bayesian linear regression with general error structure . . .
1
2
2
2
4
4
5
5
6
7
9
11
11
12
13
14
14
15
17
18
19
22
26
27
iv
Contents
6.9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
29
32
34
This is page 1
Printer: Opaque this
6
Conjugate families
Conjugate families arise when the likelihood times the prior produces a
recognizable posterior kernel
p ( | y) ( | y) p ()
where the kernel is the characteristic part of the distribution function that
depends on the random variable(s) (the part excluding any normalizing
constants). For example, the density function for a univariate Gaussian or
normal is
1
1
2
exp 2 (x )
2
2
and its kernel (for known) is
1
2
exp 2 (x )
2
1
as 2
is a normalizing constant. Now, we discuss a few common conjugate
family results1 and uninformative prior results to connect with classical
results.
1 A more complete set of conjugate families are summarized in chapter 7 of Accounting
and Causal Eects: Econometric Challenges as well as tabulated in an appendix at the
end of the chapter.
6. Conjugate families
n
ns
( | s; n) =
s (1 )
s
n
s = i=1 yi , yi = {0, 1}
(a + b) a1
b1
(1 )
(a) (b)
to yield
ns
p ( | y) s (1 )
b1
a1 (1 )
ns+b1
s+a1 (1 )
2
1 (y )
( | y, ) exp
2 2
combines with a Gaussian or normal prior for given 2 with prior mean
0 and prior variance 20
1 ( 0 )
2
2
p | ; 0 , 0 exp
2
20
would utilize Jereys prior, p () beta ; 12 , 12 , which is invariant to transformation, as the uninformative prior.
2 Some
or writing 20 2 /0 , we have
p | ; 0 , /0
1 0 ( 0 )
exp
2
2
to yield
p | y, , 0 , /0
1
exp
2
(y )
0 ( 0 )
+
2
2
1
p | y, , 0 , 2 /0 exp 2 y 2 + 0 20 2y + 2 + 0 2 20
2
Any terms not involving are constants and can be discarded as they are
absorbed on normalization of the posterior
1 2
2
p | y, , 0 , /0 exp 2 (0 + 1) 2 (0 0 + y)
2
2
+y)
Completing the square (add and subtract (000+1
), dropping the term
subtracted (as its all constants), and factoring out (0 + 1) gives
0 + 1
0 0 + y
2
p | y, , 0 , /0 exp
2 2
0 + 1
Finally, we have
p | y, , 0 , 2 /0
where 1 =
0 0 +y
0 +1
1
0
0 + 12
1
1
0 + 2
1 ( 1 )
exp
2
21
and 21 =
2
0 +1
1
0
1
+ 12
, or the posterior
distribution of the mean given the data and priors is Gaussian or normal.
Notice, the posterior mean, 1 , weights the data and prior beliefs by their
relative precisions.
For a sample of n exchangeable draws, the likelihood is
n
2
1 (yi )
( | y, )
exp
2
2
i=1
combined with the above prior yields
p | y, , 0 , 2 /0
1 ( n )
exp
2
2n
6. Conjugate families
where n =
1
0
1
+ n2
0 0 +ny
0 +n
1
0
0 + n2
1
n
0 + 2
2
0 +n
priors is again Gaussian or normal and the posterior mean, n , weights the
data and priors by their relative precisions.
n
2
1 (yi )
( | y, )
exp
2
2
i=1
1
2
2
yi 2ny + n
exp 2
2
i=1
1
2
exp 2
yi2 ny 2 + n ( y)
2
i=1
1
2
exp 2 n ( y)
2
determines the posterior
2
n
(
y)
p | 2 , y exp
2 2
2
which is the kernel for a Gaussian or N | 2 , y; y, n , the classical result.
n
2
1 (yi )
12
( | y, )
exp
2
i=1
combines with an inverted-gamma(a, b)
p (; a, b)
(a+1)
b
exp
1
to yield an inverted-gamma n+2a
2 , b + 2 t posterior distribution where
t=
i=1
(yi )
0 20
( 20 +1)
2
p ; 0 , 0 ()
exp
2
and combine with the above likelihood to yield
n+ 0
1
p ( | y) ( 2 +1) exp
0 20 + t
2
2
0 +t
an inverted chi-square 0 + n, 00 +n
.
t
p ( | y)
exp
2
n
2
1 (yi )
2
1
, | y
exp
2
2
i=1
Expanding and rewriting the likelihood gives
n
1 y 2 2yi + 2
i
2
n
, | y exp
2
2
i=1
Adding and subtracting
, 2 | y
n
2 2
i=1
1 2
exp 2
yi 2yi y + y 2 + y 2 2y + 2
2 i=1
2
3 0 0 is a scaled, inverted-chi
X
square( 0 ) random variable.
6. Conjugate families
or
, | y
n
2 2
1
2
2
exp 2
(yi y) + (y )
2 i=1
n
1
2
, 2 | y 2 2 exp 2 (n 1) s2 + n (y )
2
n
2
1
where s2 = n1
(y y) . The above likelihood combines with a
i=1 2 i
2
Gaussian or normal | ; 0 , /0 inverted-chi square 2 ; 0 , 20 prior4
1
0 ( 0 )
2
2
2
p | ; 0 , /0 p ; 0 , 0
exp
2 2
( 0 /2+1)
0 2
2
exp 20
2
2 ( 0+3
)
2
2
0 20 + 0 ( 0 )
exp
2 2
= 0 + n
= 0 + n
n 2n
= 0 20 + (n 1) s2 +
0 n
2
(0 y)
0 + n
n+20 +3
p , 2 | y; 0 , 2 /0 , 0 , 20
2
2
2
0 0 + (n 1) s
1
2
+0 ( 0 )
exp 2
2
2
+n ( y)
The expression for the joint posterior is written by completing the square.
Completing the weighted square for centered around
n =
4 The
5 The
1
(0 0 + ny)
0 + n
where y =
1
n
i=1
yi gives
2
(0 + n) ( n )
= (0 + n) 2 2 (0 + n) n + (0 + n) 2n
= (0 + n) 2 2 (0 0 + ny) + (0 + n) 2n
While expanding the exponent includes the square plus additional terms
as follows
2
2
0 ( 0 ) + n ( y) = 0 2 20 + 20 + n 2 2y + y 2
= (0 + n) 2 2 (0 0 + ny) + 0 20 + ny 2
0 ( 0 ) + n ( y)
= (0 + n) 2 2 (0 + n) n + (0 + n) 2n
(0 + n) 2n + 0 20 + ny 2
2
= (0 + n) ( n )
1
(0 + n) 0 20 + ny 2
2
(0 0 + ny)
(0 + n)
Expand and simplify the last term.
2
0 ( 0 ) + n ( y) = (0 + n) ( n ) +
0 n
2
(0 y)
0 + n
p , 2 | y; 0 , 2 /0 , 0 , 20
or
p , 2 | y; 0 , 2 /0 , 0 , 20
n+20 +3
1
exp 2
2
0 20 + (n 1) s2
2
n
+ 00+n
(0 y)
2
+ (0 + n) ( n )
1
exp 2 n 2n
2
1
2
1 exp 2 (0 + n) ( n )
2
0 1
n+
2
2
Gaussian or normal | 2 ; n , 0+n .
6. Conjugate families
2
marginal posterior
mean,
for the
, on integrating out is a noncentral,
2n
scaled-Student t ; n , n , n 6 for the mean
p ; n , 2n , n , n
or
n 2n
p ; n ,
, n
n
n +
n
n ( n )2
2n
n2+1
n ( n )
1+
n 2n
n2+1
2
2 ( n /2+1)
n 2n
2
p ; n , n
exp
2 2
Derivation of the marginal posterior for the mean, , is as follows. Let
A
z = 2
2 where
A = 0 20 + (n 1) s2 +
0 n
2
2
(0 y) + (0 + n) ( n )
0 + n
2
= n 2n + (0 + n) ( n )
The marginal posterior for the mean, , integrates out 2 from the joint
posterior
p ( | y) =
p , 2 | y d 2
0
2 n+20 +3
A
=
exp 2 d 2
2
0
Utilizing 2 =
A
2z
p ( | y)
A
6 The
A
2z
A
2z
n+20 +3
n+20 +1
n+ 0 +1
2
A
exp [z] dz
2z 2
z 1 exp [z] dz
n+ 0 +1
1
2
exp [z] dz
2 n +1
Student-t( n ) distribution p ( | y) 1 +
n /
n
n
n
n / n
has a standard
n+ 0 +1
The integral 0 z 2 1 exp [z] dz is a constant since it is the kernel of
a gamma density and therefore can be ignored when deriving the kernel of
the marginal posterior for the mean
n+ 0 +1
p ( | y) A 2
n+20 +1
2
n 2n + (0 + n) ( n )
(0 + n) ( n )
1+
n 2n
n+20 +1
2n
which is the kernel for a noncentral, scaled Student t ; n , 0 +n
, n + 0 .
p , 2 | y = p | 2 , y p 2 | y
Marginalization of 2 is achieved by integrating out .
2
p |y =
p 2 | y p | 2 , y d
Since only the conditional posterior involves the marginal posterior for
2 is immediate.
n+20 +3
A
p , 2 | y 2
exp 2
2
2
2 n+20 +2
n 2n 1
(0 + n) ( n )
exp
exp
2 2
2 2
Integrating out yields
n 2n
exp
2 2
2
(0 + n) ( n )
1
exp
d
2 2
2 ( 2n +1)
n 2n
exp
2 2
p 2 | y
n+20 +2
10
6. Conjugate families
p , 2 | y
where
1
p , 2 2
1
2
exp 2 (n 1) s2 + n ( y)
2
2 [(n1)/2+1]
2
exp n2
2
n
2
1 exp 2 ( y)
2
(n/2+1)
2n = (n 1) s2
2
The conditional posterior for given 2 is Gaussian y, n . And, the mar 2
2 (n/2+1)
1
2
2
2
p , | y
exp 2 (n 1) s + n ( y)
2
2
A
2
2
Let z = 2
out of the
2 where A = (n 1) s + n ( y) . Now integrate
joint posterior following the transformation of variables.
2 (n/2+1)
A
p ( | y)
exp 2 d 2
2
0
An/2
z n/21 ez dz
0
As before, the integral involves the kernel of a gamma density and therefore
is a constant which can be ignored. Hence,
p ( | y) An/2
n2
2
(n 1) s2 + n ( y)
n1+1
2 2
n ( y)
1+
(n 1) s2
2
which we recognize as the kernel of a noncentral, scaled Student t ; y, sn , n 1 .
11
1
T
( | y, )
exp (yi ) 1 (yi )
2
i=1
where superscript T refers to transpose, yi and are k length vectors and
is a k k variance-covariance matrix. A Gaussian prior for the mean
vector, , with prior mean, 0 , and prior variance, 0 ,is
1
T
1
p ( | ; 0 , 0 ) exp ( 0 ) 0 ( 0 )
2
The product of the likelihood and prior yields the kernel of a multivariate
posterior Gaussian distribution for the mean
1
T
1
p ( | , y; 0 , 0 ) exp ( 0 ) 0 ( 0 )
2
n
1
T 1
exp
(yi ) (yi )
2
i=1
( 0 )
=
1
0
1
0
( 0 ) +
1
i=1
T
2 1
0 0
+ n
+T0 1
yiT 1 yi
0 0 +
(yi ) 1 (yi )
+ n1 y
i=1
1 1
= 1
0 0 + n1 y
0 + n
12
6. Conjugate families
leads to
1
1
0 + n
1
= T 1
0 + n
1
T
1
2 0 + n
T 1
1
+ 0 + n
T
1
Thus, adding and subtracting 1
in the exponent com0 + n
pletes the square (with three extra terms).
T
( 0 )
( 0 ) +
1
0
+ n
i=1
(yi ) 1 (yi )
T
1
1
1
+ 1
0 + n
0 + n
n
T
1
1
+ T0 1
yiT 1 yi
0 + n
0 0 +
1
0
1
0
+ n
1
0 + n
+ T0 1
0 0 +
i=1
yiT 1 yi
i=1
T 1
1
1
p ( | , y; 0 , 0 ) exp
0 + n
2
Hence, the posterior for the mean has expected value and variance
1 1
V ar [ | y, , 0 , 0 ] = 1
0 + n
As in the univariate case, the data and prior beliefs are weighted by their
relative precisions.
n
1
T 1
( | , y) exp
(yi ) (yi )
2 i=1
13
Expanding the exponent and adding and subtracting ny T 1 y (to complete the square) gives
n
i=1
(yi ) 1 (yi ) =
i=1
yiT 1 yi 2nT 1 y + nT 1
+ny T 1 y ny T 1 y
T
= n (y ) 1 (y )
n
+
yiT 1 yi ny T 1 y
i=1
The latter two terms are constants, hence, the posterior kernel is
n
T
p ( | , y) exp (y ) 1 (y )
2
i=1
n
2
||
n
2
||
1
T
exp (yi ) 1 (yi )
2
n
T
1
(yi y) 1 (yi y)
i=1
exp
T
2
+n (y ) 1 (y )
1
T 1
2
exp
(n 1) s + n (y ) (y )
2
12
||
n
T 1
1
where s2 = n1
(yi y) combines with a Gaussiani=1 (yi y)
inverted Wishart prior
p | ; 0 ,
0
; ,
1
T
1
|| exp ( 0 ) 0 ( 0 )
2
1
+k+1
tr
2
|| 2 ||
exp
2
12
14
6. Conjugate families
tr 1
+n+k+1
2
2
p (, | y) || ||
exp
2
T
1
(n 1) s2 + n (y ) 1 (y )
12
|| exp
T
2
+0 ( 0 ) 1 ( 0 )
(n 1) s2 + n (y ) 1 (y ) + 0 ( 0 ) 1 ( 0 )
= (n 1) s2 + ny T 1 y 2nT 1 y + nT 1
+0 T 1 20 T 1 0 + 0 T0 1 0
= (n 1) s2 + (0 + n) T 1 2T 1 (0 0 + ny) + (0 + n) Tn 1 n
(0 + n) Tn 1 n + 0 T0 1 0 + ny T 1 y
T
= (n 1) s2 + (0 + n) ( n ) 1 ( n )
0 n
T
+
(0 y) 1 (0 y)
0 + n
Hence, the joint posterior can be rewritten as
tr 1
p (, | y) || ||
exp
2
T
(0 + n) ( n ) 1 ( n )
1
1
+ (n 1) s2
|| 2 exp
2
T
0 n
+ 0 +n (0 y) 1 (0 y)
tr 1 + (n 1) s2
1
+n+k+1
2
|| 2 ||
exp
T
n
+ 00+n
(0 y) 1 (0 y)
2
1
12
T 1
|| exp
(0 + n) ( n ) ( n )
2
+n+k+1
2
xT 1 x = tr xT 1 x
15
tr xT x = tr xxT
and
tr xT 1 x = tr 1 xxT = tr xxT 1
1
Therefore, the above joint posterior can be rewritten as a N ; n , (0 + n)
inverted-Wishart 1 ; + n, n
+n
1
+n+k+1
2
p (, | y) |n | 2 ||
exp tr n 1
2
+
n
0
T
|| 2 exp
( n ) 1 ( n )
2
where
n =
1
(0 0 + ny)
0 + n
and
n = +
i=1
(yi y) (yi y) +
0 n
T
(y 0 ) (y 0 )
0 + n
1
Now, its apparent the conditional posterior for given is N n , (0 + n)
0 + n
T
p ( | , y) exp
( n ) 1 ( n )
2
I-W 1 ; + n, n
where
= (0 + n)
( + n k + 1)
16
6. Conjugate families
Marginalization of the mean derives from the following identities (see Box
and Tiao [1973], p. 427, 441). Let Z be a m m positive definite symmetric
matrix consisting of 12 m (m + 1) distinct random variables zij (i, j = 1, . . . , m; i j).
And let q > 0 and B be an mm positive definite symmetric matrix. Then,
the distribution of zij ,
1
p (Z) |Z| 2
q1
Z>0
|Z|
1
2 q1
1
exp tr (ZB) dZ
2
12 (q+m1)
= |B|
2 2 (q+m1) m
(I.1)
q+m1
2
p (b) = 12 2
b+
=1
and
(z) =
p
2
, b>
p1
2
tz1 et dt
or for integer n,
(n) = (n 1)!
The second identity involves the relationship between determinants that
allows us to express the above as a quadratic form. The identity is
|Ik P Q| = |Il QP |
(I.2)
p , 1 | y = p (, | y) 1
where
1 is the (absolute value of the) determinant of the Jacobian or
( 11 , 12 , . . . , kk )
=
( 11 , 12 , . . . , kk )
k+1
= ||
17
1
+n+k+1
2
p (, | y) ||
exp tr n 1
2
0 + n
1
T
|| 2 exp
( n ) 1 ( n )
2
+n+k
1
2
||
exp tr S () 1
2
T
2k+2
1
+n+k+2
1
1
2
p , | y ||
exp tr S ()
|| 2
2
1 +nk
1
1
2
exp tr S ()
2
Now, applying the first identity yields
1 (+n+1)
p , 1 | y d1 |S ()| 2
1 >0
1 (+n+1)
T 2
n + (0 + n) ( n ) ( n )
1 (+n+1)
T 2
I + (0 + n) 1
n ( n ) ( n )
12 (+n+1)
T
p ( | y) 1 + (0 + n) ( n ) 1
n ( n )
p (, ) ||
and the joint posterior is
k+1
2
p (, | y) ||
n+k+1
2
||
n+k+1
2
||
1
T
exp
(n 1) s2 + n (y ) 1 (y )
2
1
T
exp
(n 1) s2 + n (y ) 1 (y )
2
1
exp tr S () 1
2
n
2
||
18
6. Conjugate families
n
T
T
where now S () = i=1 (y yi ) (y yi ) + n (y
) (y ) . Then, the
1
conditional posterior for given is N y, n
n
T
p ( | , y) exp ( y) 1 ( y)
2
The marginal posterior for is derived analogous to the above informed
conjugate prior case. Rewriting the posterior in terms of 1 yields
2k+2
1
n+k+1
2
p , 1 | y ||
exp tr S () 1 || 2
2
1 nk1
1
2 exp tr S () 1
2
p , 1 | y d1
1 >0
1 nk1
2 exp 1 tr S () 1 d1
2
1 >0
p ( | y)
p ( | y) |S ()| 2
n
n2
T
T
(y yi ) (y yi ) + n (y ) (y )
i=1
n2
n
1
T
T
I + n
(y yi ) (y yi )
(y ) (y )
i=1
The second identity (I.2) identifies the marginal posterior for as (multivariate) Student tk ; y, n1 s2 , n k
p ( | y) 1 +
n
T
T
(y ) (y )
(n k) s2
n2
n
T
where (n k) s2 = i=1 (y yi ) (y yi ). The marginal posterior for the
1
n
T
variance is I-W ; n, n where now n = i=1 (y yi ) (y yi ) .
19
where X is a n
rank matrix of (weakly exogenous) regres p full column
p | 2 N 0 , 2 V0
1
where we can think of V0 = X0T X0
as if we had a prior sample (y0 , X0 )
such that
1 T
0 = X0T X0
X 0 y0
then the conditional posterior for is
p | 2 , y, X; 0 , V0 N , V
where
1 T
= X0T X0 + X T X
X0 X0 0 + X T X
= X T X 1 X T y
and
1
V = 2 X0T X0 + X T X
1 T
= X0T X0 + X T X
X0 X0 0 + X T X
1 T
1 T
1 T
= X0T X0 + X T X
X0 X0 X0T X0
X 0 y0 + X T X X T X
X y
T
1
= X0 X0 + X T X
X0T y0 + X T y
Since the DGP is
then
y0 = X0 + 0 , 0 N 0, 2 In0
y = X + ,
N 0, 2 In
1 T
= X0T X0 + X T X
X0 X0 + X0T 0 + X T X + X T
1 T
E | X, X0 = X0T X0 + X T X
X0 X0 + X T X =
20
6. Conjugate families
Hence,
so that
V
E | X, X0 =
1 T
= X0T X0 + X T X
X 0 0 + X T
V ar | X, X0
T
= E | X, X0
1 T
T
X0T X0 + X T X
X0 0 + X T X0T 0 + X T
= E
1
X0T X0 + X T X
| X, X0
1
X0T 0 T0 X0 + X T T0 X0
T
T
X
X
+
X
X
0
0
+X0T 0 T X + X T T T X
= E
T
1
T
X0 X0 + X X
| X, X0
T
1
1
= X0 X0 + X T X
X0T 2 IX0 + X T T IX X0T X0 + X T X
1 T
1
= 2 X0T X0 + X T X
X0 X0 + X T X X0T X0 + X T X
1
= 2 X0T X0 + X T X
Now, lets backtrack and derive the conditional posterior as the product
of conditional priors and the likelihood function. The likelihood function
for known variance is
1
T
2
| , y, X exp 2 (y X) (y X)
2
1
T
p | 2 exp 2 ( 0 ) V01 ( 0 )
2
1
(y X) (y X)
2
p | , y, X
exp 2
T
2
+ ( 0 ) V01 ( 0 )
T
y y 2y T X + X T X
1
= exp 2
+ T X0T X0 2 T0 X0T X0
2
+ T0 X0T X0 0
The first and last terms in the exponent do not involve (are constants)
and can ignored as they are absorbed through normalization. This leaves
1
2y T X + T X T X + T X0T X0
p | 2 , y, X
exp 2
2 T0 X0T X0
2
T X0T X0 + X T X
1
= exp 2
2 y T X + T0 X0T X0
2
21
p | 2 , y, X
N , V
1
exp V1
2
T T
1
= exp 2
X0 X0 + X T X
2
T X0T X0 + X T X
T
1
= exp 2
2 X0T X0 + X T X
+ X0T X0 + X T X
T X0T X0 + X T X
T
1
= exp 2
2 X0T X0 0 + X T y
+ X0T X0 + X T X
The last term in the exponent is all constants (does not involve ) so its
absorbed through normalization and disregarded for comparison of kernels.
Hence,
1
p | 2 , y, X
exp V1
2
T X0T X0 + X T X
1
exp 2
2 y T X + T0 X0T X0
2
as claimed.
Uninformative priors
If theprior for
is uniformly distributed conditional on known variance,
2 , p | 2 1, then its as if X0T X0 0 (the information matrix for
the prior is null) and the posterior for is
2 X T X 1
p | 2 , y, X N ,
equivalent to the classical parameter estimators.
To see this intuition holds, recognize combining the likelihood with the
uninformative prior indicates the posterior is proportional to the likelihood.
1
T
2
p | , y, X exp 2 (y X) (y X)
2
Expanding this expression yields
1
p | 2 , y, X exp 2 y T y 2y T X + T X T X
2
22
6. Conjugate families
The first term in the exponent doesnt depend on and can be dropped
as its absorbed via normalization. This leaves
1
T
2
T
T
p | , y, X exp 2 2y X + X X
2
2 X T X 1
Now, write p | 2 , y, X N ,
1
T
p | , y, X exp 2 X X
2
and expand
1
+ X
T X
p | 2 , y, X exp 2 T X T X 2 T X T X
2
The last term in the exponent doesnt depend on and is absorbed via
normalization. This leaves
p | 2 , y, X
exp 2 T X T X 2 T X T X
2
1 T
1
exp 2 T X T X 2 T X T X X T X
X y
2
1
exp 2 T X T X 2 T X T y
2
As this latter expression matches the simplifiedlikelihood expression,
the
2 X T X 1 .
demonstration is complete, p | 2 , y, X N ,
In the usual case where the variance as well as the regression coecients,
, are unknown, the likelihood function can be expressed as
1
T
, 2 | y, X n exp 2 (y X) (y X)
2
Rewriting gives
, | y, X
1 T
exp 2
2
1 T
Xb + e where b = X T X
X y and e = y Xb are estimates of and
, respectively. This implies = e X ( b) and
T
T
T
1
e
e
2
(
b)
X
e
, 2 | y, X n exp 2
T
2
+ ( b) X T X ( b)
23
, | y, X
or
, | y, X
1 T
T
T
exp 2 e e + ( b) X X ( b)
2
1
T
2
T
exp 2 (n p) s + ( b) X X ( b)
2
1
where s2 = np
eT e.7
2
2 1
The conjugate prior
for
linear
regression
is
the
Gaussian
0
0
2
2
inverse chi square ; 0 , 0
p | ; 0,
1
0
; 0 , 20
( 0 ) 0 ( 0 )
exp
2 2
0 2
( 0 /2+1) exp 20
2
p
p , | y, X; 0 ,
2
1
0 , 0 , 0
(n p) s2
exp
2 2
T
( b) X T X ( b)
exp
2 2
T
( 0 ) 0 ( 0 )
p
exp
2 2
2 ( 0 /2+1)
0 20
exp 2
2
n
1
, 2 | y, X n exp 2 (n p) s2 + ( b)T X T X ( b)
2
becomes
1
= , 2 | y, X = n exp 2 (n 1) s2 + n ( y)2
2
1 T
where = , b = X T X
X y = y, p = 1, and X T X = n.
24
6. Conjugate families
p , | y, X; 0 ,
where
2
1
0 , 0 , 0
2n
exp 2
2
1
p
exp 2 n
2
2 [( 0 +n)/2+1]
= 0 + X T X
0 0 + X T Xb
n = 0 + X T X
and
XT X
n 2n = 0 20 +(n p) s2 + 0 0 0 +
The derivation of the above joint posterior follows from the matrix version of completing the square where 0 and X T X are square, symmetric,
full rank p p matrices. The exponents from the prior for the mean and
likelihood are
T
XT X
( 0 ) 0 ( 0 ) +
Expanding and rearranging gives
+ T 0 +
T X T X
(6.1)
T 0 + X t X 2 0 0 + X T X
0
0
The latter two terms are constants not involving (and can be ignored
when writing the kernel for the conditional posterior) which well add to
when we complete the square. Now, write out the square centered around
0 + X T X
= T 0 + X T X
T
T
2 0 + X T X + 0 + X T X
Substitute for in the second term on the right hand side and the first two
terms are identical to the two terms in equation (6.1). Hence, the exponents
from the prior for the mean and likelihood in (6.1) are equal to
0 + X T X
T
T X T X
0 + X T X + T0 0 0 +
25
0 + X T X
T
XT X
+ 0 0 0 +
or (in the form analogous to the univariate Gaussian case)
0 + X T X
1
1
1
+ 0
1 1
0
n 0 n 1 + 0 n 1 n 0
where 1 = X T X.
Stacked regression
1 T
0 = X0T X0
X 0 y0
Then, we combine this initial "evidence" with new evidence to update our
beliefs in the form of the posterior. Not surprisingly, the posterior mean is
a weighted average of the two "samples" where the weights are based on
the relative precision of the two "samples".
Marginal posterior distributions
2
The marginal posterior for on integrating
out is noncentral, scaled
2 1
multivariate Student tp , n n , 0 + n
p ( | y, X)
0 +n+p
2
n 2n + n
1+
1
n
2
n n
0 +n+p
2
A
2
2
2 2 where A = n +
2 2 n . The marginal posterior for is
inverted-chi square ; n, n .
Derivation of the marginal posterior for is as follows.
p ( | y) =
p , 2 | y d 2
0
2 n+ 02+p+2
A
=
exp 2 d 2
2
0
26
6. Conjugate families
2
A
Utilizing 2 = 2z
and dz = 2zA d 2 or d 2 = 2zA2 dz, (1 and 2 are
constants and can be ignored when deriving the kernel)
p ( | y)
A
2z
n+ 02+p+2
n+ 0 +p
2
A
exp [z] dz
2z 2
n+ 0 +p
1
2
exp [z] dz
n+ 0 +k
The integral 0 z 2 1 exp [z] dz is a constant since it is the kernel of
a gamma density and therefore can be ignored when deriving the kernel of
the marginal posterior for beta
n+ 0 +p
p ( | y) A 2
n+20 +p
n 2n + n
n+20 +p
n
1+
n 2n
1
freedom so that the joint prior is p , 2 2
.
The joint posterior is
2 [n/2+1]
1
T
2
p , | y
exp 2 (y X) (y X)
2
T 1 T
Since y = Xb + e where b = X X
X y, the joint posterior can be
written
2 [n/2+1]
1
T
2
2
T
p , | y
exp 2 (n p) s + ( b) X X ( b)
2
Or, factoring into the conditional posterior for and marginal for 2 , we
have
p , 2 | y p 2 | y p | 2 , y
[(np)/2+1]
2
2
exp n2
2
1
T
p exp 2 ( b) X T X ( b)
2
27
where
2n = (n p) s2
1
Hence, the conditional posterior for given 2 is Gaussian b, 2 X T X
.
1
The marginal posterior for is multivariate Student tp ; b, s2 X T X
,n p ,
the classical estimator. Derivation of the marginal posterior for is analoT
A
2
gous to that above. Let z = 2
X T X ( b).
2 where A = (n p) s +( b)
2
Integrating out of the joint posterior produces the marginal posterior
for .
p ( | y)
p , 2 | y d 2
2 n+2
A
2
exp 2 d 2
2
Substitution yields
p ( | y)
A 2
A
2z
n+2
2
A
exp [z] dz
2z 2
z 2 1 exp [z] dz
As before, the integral involves the kernel of a gamma distribution, a constant which can be ignored. Therefore, we have
n
p ( | y) A 2
n2
T
(n p) s2 + ( b) X T X ( b)
n2
T
( b) X T X ( b)
1+
(n p) s2
1
which is multivariate Student tp ; b, s2 X T X
,n p .
28
6. Conjugate families
1 1
1 = T
1 y = 1 X + 1
where
1 N (0, In )
p ( | , y, X; 0 , ) N , V
where
1 T 1
T 1
T 1
X0T 1
X
+
X
X
X
+
X
0
0
0
0 0
0
1
T 1
T 1
=
1
+
X
+
X
= X T 1 X 1 X T 1 y
and
V
T 1
X0T 1
X
0 X0 + X
1
T 1
=
1
X
+X
It is instructive to once again backtrack to develop the conditional posterior distribution. The likelihood function for known variance is
1
T 1
( | , y, X) exp (y X) (y X)
2
Conditional Gaussian priors are
1
T
p ( | ) exp 2 ( 0 ) V1 ( 0 )
2
The conditional posterior is the product of the prior and likelihood
1
(y X) 1 (y X)
2
p | , y, X
exp 2
T
2
+ ( 0 ) V1 ( 0 )
T 1
1
y y 2y T 1 X + T X T 1 X
= exp 2
+ T V1 2 T0 V1 + T0 V1 0
2
29
The first and last terms in the exponent do not involve (are constants)
and can ignored as they are absorbed through normalization. This leaves
p | 2 , y, X
1
2y T 1 X + T X T 1 X
exp 2
+ T V1 2 T0 V1
2
T V 1 + X T 1 X
1
= exp 2
2 2 y T 1 X + T0 V 1
p ( | , y, X) N , V
T 1
1
exp V
2
T 1
1
T 1
= exp
V + X X
2
T
1
T 1
V
+
X
1
= exp
2 V + X T 1 X
+ V1 + X T 1 X
T X0T X0 + X T X
T
1
T 1
T
1
= exp
2 y X + 0 V
T T
T
+ X0 X0 + X X
The last term in the exponent is all constants (does not involve ) so its
absorbed through normalization and disregarded for comparison of kernels.
Hence,
1
p | 2 , y, X
exp V1
2
T X0T X0 + X T X
1
T
exp 2
2
2 y T 1 X + T0 V1
as claimed.
30
6. Conjugate families
and variance unknown where each draw is an element of the y vector and
X is an n p matrix of regressors. A Gaussian likelihood is
1
n
T 1
2
(, | y, X) ||
exp (y ) (y )
2
T 1
1
(y
Xb)
(y
Xb)
n
|| 2 exp
T
2
+ (b ) X T 1 X (b )
1
T
n
2
T 1
2
exp
(n p) s + (b ) X X (b )
||
2
1 T 1
T
1
where b = X T 1 X
X y and s2 = np
(y Xb) 1 (y Xb).
Combine the likelihood with a Gaussian-inverted Wishart prior
1
T
p ( | ; 0 , ) p 1 ; , exp ( 0 ) 1
(
)
0
tr 1
+p+1
2
2
|| ||
exp
2
T 1 1
where tr () is the trace of the matrix, it is as if = X0 0 X0
, and
is degrees of freedom to produce the joint posterior
1
+n+p+1
tr
2
p (, | y, X) || 2 ||
exp
2
(n p) s2
1
T
+ (b ) X T 1 X (b )
exp
2
T
+ ( 0 ) 1
( 0 )
Completing the square
Completing the square involves the matrix analog to the univariate unknown mean and variance case. Consider the exponent (in braces)
T
(n p) s2 + (b ) X T 1 X (b ) + ( 0 ) 1
( 0 )
= (n p) s2 + bT X T 1 Xb 2 T X T 1 Xb + T X T 1 X
T 1
T 1
+ T 1
2 0 + 0 0
T 1
= (n p) s2 + T 1
+
X
2 T V1 + bT X T 1 Xb + T0 1
0
= (n p) s2 + T V1 2 T V1 + bT X T 1 Xb + T0 1
0
where
T 1
T 1
1
X
1
Xb
+X
0 + X
T 1
= V 1
+
X
Xb
0
31
1
T 1
and V = 1
+
X
X
.
Variation in around is
T
V1 = T V1 2 T V1 + V1
The first two terms are identical to two terms in the posterior involving
and there is apparently no recognizable kernel from these expressions. The
joint posterior is
p (, | y, X)
+n+p+1
2
|| ||
exp
tr 1
exp
2
T 1
V
T
+ (n p) s2 V1
+bT X T 1 Xb + T0 1
0
tr 1 + (n p) s2
+n+p+1
T
1
2
|| 2 ||
exp
V1
2
+bT X T 1 Xb + T 1
0
0
1
exp
V1
2
2
N , V or
T 1
1
p ( | , y, X) exp
V
2
Inverted-Wishart kernel
Now, we gather all terms involving and write the conditional posterior
for .
p ( | , y, X)
+n+p+1
2
+n+p+1
2
+n+p+1
2
|| ||
|| 2 ||
|| 2 ||
1
tr 1 + (n p) s2
exp
T
+ (b ) X T 1 X (b )
2
tr 1 +
1
T
(y Xb) 1 (y Xb)
exp
2
T
T 1
+ (b ) X X (b )
T
1
+ (y Xb) (y Xb)
exp
tr
1
T
2
+ (b ) X T X (b )
32
6. Conjugate families
+n
1
+n+p+1
2
p (, | y) |n | 2 ||
exp tr n 1
2
where
n = + (y Xb) (y Xb) + (b ) X T X (b )
With conditional posteriors in hand, we can employ McMC strategies
(namely, a Gibbs sampler) to draw inferences around the parameters of
interest, and . That is, we sequentially draw conditional on and
, in turn, conditional on . We discuss McMC strategies (both the Gibbs
sampler and its generalization, the Metropolis-Hastings algorithm) later.
p (, ) ||
and the joint posterior is
1
T
2
T 1
p (, | y, X) || ||
exp
(n p) s + (b ) X X (b )
2
1
n+1
T
2
T 1
2
||
exp
(n p) s + (b ) X X (b )
2
1
n+1
1
2
||
exp tr S ()
2
12
n
2
T
p ( | , y, X) exp ( b) X T 1 X ( b)
2
33
n
2
n+1
2
p (, | y) |n | ||
1
1
exp tr n
2
where
n = (y Xb) (y Xb) + (b ) X T X (b )
As with informed priors, a Gibbs sampler (sequential draws from the conditional posteriors) can be employed to draw inferences for the uninformative
prior case.
Next, we discuss posterior simulation, a convenient and flexible strategy
for drawing inference from the evidence and (conjugate) priors.
34
6. Conjugate families
prior
()
likelihood
( | y)
posterior
( | y)
discrete data:
beta-binomial
p
beta
b1
(1 p)
a1
binomial
ns
ps (1 p)
a+s1
beta
b+ns1
(1 p)
gamma-poisson
gamma
a1 eb
poisson
s en
gamma-exponential
gamma
a1 eb
exponential
n es
gamma
a+n1 e(b+s)
negative
binomial
s
pnr (1 p)
beta
b+s1
pa+nr1 (1 p)
beta-negative
binomial
p
beta-binomialhypergeometric8
k
k unknown
population success
N known
population size
n known
sample size
x known
sample success
beta-binomial
n
x
(a+x)(b+nx)(a+b)
,
(a)(b)(a+b+n)
x = 0, 1, 2, . . . , n
Dirichlet
K
ai i 1
multinomialDirichlet
(vector)
s=
i=1
(z) =
8 See
yi ,
beta
b1
(1 p)
a1
i=1
n
x
hypergeometric
k N k
x
nx
n
sampling
without
replacement
gamma
a+s1 (b+n)
(a+k)(b+N k)(a+b+n)
(a+x)(b+nx)(a+b+N ) ,
multinomial
s11 sKK
k = x, x + 1, . . . ,
x+N n
Dirichlet
K
ai i +si 1
i=1
n!
x!(nx)! ,
beta-binomial
N n
kx
(a)(b)
(a+b)
focal
parameter(s)
prior
()
35
likelihood
( | y)
posterior
( | y)
continuous data:
Pareto-uniform
w
w unknown
upper bound,
0 known
lower bound
Pareto
a
ab
wa+1
Pareto-Pareto
unknown
precision,
known
shape
gamma-Pareto
unknown
shape,
known
precision
gammaexponential
Pareto
aba
,
a+1
>b
gamma
a1 e/b
ba (a) ,
>0
uniform
1
wn ,
w > max (xi )
Pareto
n ,
0 < < min (xi )
Pareto
n n
m+1
,
n
m=
xi ,
i=1
gamma
a1 e/b
(a)ba
inverse gammagamma
unknown
rate,
known
shape
inverse gamma
1a 1/b
e
(a)b
a
conjugate priorgamma
unknown
shape,
known
rate
nonstandard
1 c
a ()b ,
a, b, c > 0
>0
exponential
n s
e ,
n
s = i=1 xi
gamma
s/
e n ,
n
s = i=1 xi
Pareto
(a+n) max[b,xi ]a+n
wa+n+1
Pareto
(an)b(an)
,
an+1
a > n, > b
gamma
a+n1 e/b
(b
)a+n (a+n) ,
1
b = 1 +log mn
>
log
b
gamma
a+n1 e/b
,
(a+n)(b )a+n
b
b = 1+bs
inverse gamma
1an e1/b
,
(a+n)(b )a+n
b
b = 1+bs
gamma
m1
,
n ()n
n
m=
xi ,
i=1
xi > 0
nonstandard
(am)1 (c+n)
()b+n
36
6. Conjugate families
focal
parameter(s)
prior
()
likelihood
( | y)
posterior
( | y)
normal
2
0)
exp (
,
2 2
normal
)2
exp (yi2
2
i=1
ss
= exp 2
2
normal
2
n)
exp (
,
2 2
inverse gamma
(a+1)
2
exp b2
normal
(21)n/2
ss
exp 2
2
continuous data:
normal-normal
20
inverse gamma
-normal
2
2
0
normal | 2
inverse gamma
(0 )2
1
0 exp 2 20
(a+1)
2
exp b2 ,
2
20 = 0
normal | 2
inverse gammanormal
, 2
normal
(21)n/2
ss
exp 2
2
+ny
n = 000+n
,
2
2n = 0+n
inverse gamma
( n+2a
+1)
2
2
1
b+ ss
exp 22
i=1
(yi )
0
2
exp
2
0 0
2 2
2 a +1
exp b 2 ;
Student t marginal
posterior for :
2a 2+1
2
1+
ss =
joint posterior:
normal | 2
inverse gamma
0 b
0
2
inverse gamma
marginal
posterior for 2 :
a +1
2
exp b 2 ,
a = a + n2 , 0 = 0 + n,
1
ss
b + 2
b =
0 n(y0 )2 ,
+ 2(0 +n)
+ny
0 = 000+n
focal
parameter(s)
prior
()
37
likelihood
( | y)
posterior
( | y)
continuous data:
bilateral
bivariate
Pareto
bilateral
bivariate
Paretouniform
l, u
a(a+1)(r2 r1 )
(ul)a+2
uniform
n
1
ul
l < r 1 , u > r2
normallognormal
normal
2
0)
,
exp (
2 2
log normal
yi )2
exp (log 2
2
i=1
lss
= exp 2
2
inverse gammalognormal
2
inverse gamma
(a+1)
2
exp b2
normal
(21)n/2
lss
exp 2
2
lss =
i=1
20 =
(log yi )
bilateral
bivariate
Pareto
(a +n) (a + n + 1)
r2 r1
a+n
(ul)a+n+2
r1 = min (r1 , xi ) ,
r2 = max (r2 , xi )
normal
2
n)
exp (
,
2 2
n
y
n = 0 00+nlog
,
+n
2
2
n = 0 +n
inverse gamma
( n+2a
+1)
2
2
b+ 1 lss
exp 22
38
6. Conjugate families
continuous data:
multivariate normal
inverted Wishartmultivariate normal
,
prior (, )
12
multivariate normal ( | )
||
|| ||
likelihood (, | y)
n
2
multivariate normal
||
joint posterior (, | y)
inverted Wishart ( | y)
T
exp 12 (n 1) s2 + (y ) 1 (y )
+n
2
||
marginal posterior
multivariate Student t ( | y)
tr (1 )
exp
2
T
exp 02+n ( n ) 1 ( n )
multivariate normal ( | , y)
+n+k+1
2
||
tr (n 1 )
exp
2
12 (+n+1)
T
I + (0 + n) ( n ) 1
n ( n )
inverted Wishart ( | y)
||
where
+n
2
+n+k+1
2
||
tr (n 1 )
exp
2
0 0 +ny
0 +n ,
n = +
+k+1
2
inverted Wishart ()
n =
T
exp 20 ( 0 ) 1 ( 0 )
s2 =
n
i=1
1
n1
(yi y) (yi y) +
i=1
0 n
0 +n
(yi y) 1 (yi y) ,
T
(0 y) (0 y)
continuous data:
linear regression
normal inverse chi square-normal
, 2
prior , 2
normal | 2
inverse chi square 2
normal likelihood , 2 | y, X
39
T
p exp 21 2 ( 0 ) 0 ( 0 )
0 20
( 0 /2+1) exp 2
2
T
n exp 21 2 eT e + ( b) X T X ( b)
normal
joint posterior p , 2 | y, X
normal p | 2 , y, X
p exp 21 2 n
[( 0 +n)/2+1]
2
2
exp 2n2
Student t | 2 , y, X
1+
1
n 2n
0 +n+p
2
n
[( 0 +n)/2+1]
2
2
exp 2n2
= 0 + X T X
0 0 + X T Xb ,
1 T
b = XT X
X y,
n = 0 + X T X,
XT X
,
n 2n = 0 20 + eT e + 0 0 0 +
and n = 0 + n
40
6. Conjugate families
continuous data:
linear regression
with general variance
normal inverted Wishart-normal
,
prior (, )
T
exp 12 ( 0 ) 1
( 0 )
normal ( | )
inverted Wishart ()
normal likelihood (, | y, X)
normal
conditional posterior
normal p ( | , y, X)
inverted Wishart ( | , y, X)
where
s2 =
1
np
+p+1
2
|| ||
n
2
||
tr (1 )
exp
2
T
exp 12 (n p) s2 + ( b) X T 1 X ( b)
exp 12 V1
||
+n
2
+n+p+1
2
||
tr (n 1 )
exp
2
(y Xb) 1 (y Xb) ,
1
T 1
V = 1
+
X
X
,
1 T 1
b = X T 1 X
X y,
1
T 1
T 1
= 1
+
X
+
X
Xb
,
0
T
n = + (y Xb) (y Xb) + b X T X b ,
1
and = X0T 1
0 X0