You are on page 1of 82

Solutions to Selected Problems In:

Detection, Estimation, and Modulation Theory: Part I


by Harry L. Van Trees
John L. Weatherwax

October 1, 2013
Introduction
Here youll nd some notes that I wrote up as I worked through this excellent book. Ive
worked hard to make these notes as good as I can, but I have no illusions that they are perfect.
If you feel that that there is a better way to accomplish or explain an exercise or derivation
presented in these notes; or that one or more of the explanations is unclear, incomplete,
or misleading, please tell me. If you nd an error of any kind technical, grammatical,
typographical, whatever please tell me that, too. Ill gladly add to the acknowledgments
in later printings the name of the rst person to bring each problem to my attention.
Special thanks (most recent comments are listed rst) to Jeong-Min Choi and Hemant Saggar
for their corrections involving chapter 2.
All comments (no matter how small) are much appreciated. In fact, if you nd these notes
useful I would appreciate a contribution in the form of a solution to a problem that is not yet
worked in these notes. Sort of a take a penny, leave a penny type of approach. Remember:
pay it forward.

wax@alum.mit.edu
1
Chapter 2 (Classical Detection and Estimation Theory)
Notes On The Text
Notes on the Bayes Criterion
Given the books Eq. 8 we have
R = P
0
C
00
_
Z
0
p(R|H
0
)dR+ P
0
C
10
_
ZZ
0
p(R|H
0
)dR
= P
1
C
01
_
Z
0
p(R|H
1
)dR+ P
1
C
11
_
ZZ
0
p(R|H
1
)dR. (1)
We can use _
Z
0
p(R|H
0
)dR +
_
ZZ
0
p(R|H
0
)dR = 1 ,
to replace all integrals over Z Z
0
with (one minus) integrals over Z
0
. We get
R = P
0
C
00
_
Z
0
p(R|H
0
)dR + P
0
C
10
_
1
_
Z
0
p(R|H
0
)dR
_
= P
1
C
01
_
Z
0
p(R|H
1
)dR + P
1
C
11
_
1
_
Z
0
p(R|H
1
)dR
_
= P
0
C
10
+ P
1
C
11
+
_
Z
0
{P
1
(C
01
C
11
)p(R|H
1
) P
0
(C
10
C
00
)p(R|H
0
)} dR (2)
If we introduce the probability of false alarm P
F
, the probability of detection P
D
, and the
probability of a miss P
M
, as dened in the book, we nd that Rgiven via Equation 1 becomes
when we use
_
Z
0
p(R|H
0
)dR +
_
Z
1
p(R|H
0
)dR =
_
Z
0
p(R|H
0
)dR + P
F
= 1
R = P
0
C
10
+ P
1
C
11
+ P
1
(C
01
C
11
)P
M
P
0
(C
10
C
00
)(1 P
F
) . (3)
Since P
0
= 1 P
1
we can consider R computed in Equation 3 as a function of the prior
probability P
1
with the following manipulations
R(P
1
) = (1 P
1
)C
10
+ P
1
C
11
+ P
1
(C
01
C
11
)P
M
(1 P
1
)(C
10
C
00
)(1 P
F
)
= C
10
(C
10
C
00
)(1 P
F
) + P
1
[C
10
+ C
11
+ (C
01
C
11
)P
M
+ (C
10
C
00
)(1 P
F
)]
= C
00
+ (C
10
C
00
)P
F
+ P
1
[C
11
C
00
+ (C
01
C
11
)P
M
(C
10
C
00
)P
F
]
= C
00
(1 P
F
) + C
10
P
F
+ P
1
[(C
11
C
10
) + (C
01
C
11
)P
M
(C
10
C
00
)P
F
] . (4)
Recall that for the Bayes decision test our decision regions Z
0
and Z
1
are determined via
(R) >
P
0
(C
10
C
00
)
P
1
(C
01
C
11
)
,
for H
1
. Thus if P
0
changes the decision regions Z
0
and Z
1
change (via the above expression)
and thus both P
F
and P
M
change since they depend on Z
0
and Z
1
. Lets assume that we
specify a decision boundary that then denes classication regions Z
0
and Z
1
. These
decision regions correspond to a specic value of P
1
denoted via P

1
. Note that P

1
is not the
true prior probability of the class H
1
but is simply an equivalent prior probability that one
could use in the likelihood ratio test to obtain the same decision regions Z
0
and Z
1
. The
book denotes R
B
(P
1
) to be the expression given via Equation 4 where P
F
and P
M
changes
in concert with P
1
. The book denotes R
F
(P
1
) to be the expression given by Equation 4 but
where P
F
and P
M
are xed and are held constant as we change the value of P
1
.
In the case where we do not x P
F
and P
M
we can evaluate R
B
(P
1
) at its two end points of
P
1
. If P
1
= 0 then from Equation 4 R
B
(0) = C
00
(1 P
F
) + C
10
P
F
. When we have P
1
= 0
then we see that

P
0
(C
10
C
00
)
P
1
(C
01
C
11
)
+,
for all R the function (R) is always less than , and all classications are H
0
. Thus Z
1
is
the empty set and P
F
= 0. Thus we get R
B
(0) = C
00
.
The other extreme is when P
1
= 1. In that case P
0
= 0 so = 0 and we would have that
(R) > 0 for all R implying that all points are classied as H
1
. This implies that P
M
= 0.
The expression for R
B
(P
1
) from Equation 4 is given by
R
B
(1) = C
00
(1 P
F
) + C
10
P
F
+ (C
11
C
00
) (C
10
C
00
)P
F
= C
11
,
when we simplify. These to values for R(0) and R(1) give the end point conditions on
R
B
(P
1
) seen in Figure 2.7. If we do not know the value of P
1
then one can still design a
hypothesis test by specifying values of P
M
and P
F
such that the coecient of P
1
in the
expression for R
F
(P
1
) in Equation 4 vanishes. The idea behind this procedure is that this
will make R
F
a horizontal line for all values of P
1
which is an upper bound on the Bayes
risk. To make the coecient of P
1
vanish requires that
(C
11
C
00
) + (C
01
C
11
)P
M
(C
10
C
00
)P
F
= 0 . (5)
This is also known as the minimax test. The decision threshold can be introduced into the
denitions of P
F
and P
M
and the above gives an equation that can be used to determine its
value. If we take C
00
= C
11
= 0 and introduce the shorthands C
01
= C
M
(the cost of a miss)
and C
01
= C
F
(the cost of a false alarm) so we get our constant minimax risk of
R
F
= C
F
P
F
+ P
1
[C
M
P
M
C
F
P
F
] = P
0
C
F
P
F
+ P
1
C
M
P
M
. (6)
Receiver operating characteristics: Example 1 (Gaussians with dierent means)
Under H
1
each sample R
i
can be written as R
i
= m+n
i
with n
i
a Gaussian random variable
(with mean 0 and variance
2
). Thus R
i
N(m,
2
). The statistic l which is the sum of
individual random variables is also normal. The mean of l is given by the sum of the N
means (multiplied by the scaling factor
1

N
) or
_
1

N
_
(Nm) =

m,
and a variance given by the sum of the N variances (multiplied by the square of the scaling
factor
1

N
) or
_
1
N
2
_
N
2
= 1 .
These two arguments have shown that l N
_

m, 1
_
.
We now derive Pr() for the case where we have measurements from Gaussians with dierent
means (Example 1). To do that we need to note the following symmetry identity about
erfc

(X) function. We have that


erfc

(X)
_

X
1

2
exp
_

x
2
2
_
dx
= 1
_
X

2
exp
_

x
2
2
_
dx = 1
_

X
1

2
exp
_

x
2
2
_
dx
= 1 erfc

(X) . (7)
Then using this result we can derive the given expression for Pr() under the case when
P
0
= P
1
=
1
2
and = 1. We have
Pr() =
1
2
(P
F
+ P
M
) since = 1 this becomes
=
1
2
_
erfc

_
d
2
_
+ 1 P
D
_
=
1
2
_
erfc

_
d
2
_
+ 1 erfc

d
2
__
=
1
2
_
erfc

_
d
2
_
+ 1
_
1 erfc

_
d
2
___
= erfc

_
d
2
_
, (8)
the expression given in the book.
Receiver operating characteristics: Example 2 (Gaussians with
2
0
=
2
1
)
Following the arguments in the book we end up wanting to evaluate the expression P
F
=
Pr(r
2
1
+ r
2
2
|H
0
). By denition this is just the integral over the region of r
1
r
2
space
where r
2
1
+ r
2
2
hold true. This is
P
F
=
_
2
=0
_

Z=

p(R
1
|H
0
)P(R
2
|H
0
)dR
1
dR
2
.
When we put in the expressions for p(R
1
|H
0
) and P(R
2
|H
0
) we see why converting to polar
coordinates is helpful. We have
P
F
=
_
2
=0
_

Z=

_
1
_
2
2
0
e

1
2
R
2
1

2
0
__
1
_
2
2
0
e

1
2
R
2
2

2
0
_
dR
1
dR
2
=
1
2
2
0
_
2
=0
_

Z=

exp
_

1
2
R
2
1
+ R
2
2

2
0
_
dR
1
dR
2
.
When we change to polar we have the dierential of area change via dR
1
dR
2
= ZddZ and
thus get for P
F
the following
P
F
=
1
2
2
0
_
2
=0
_

Z=

Z exp
_

1
2
Z
2

2
0
_
ddZ =
1

2
0
_

Z=

Z exp
_

1
2
Z
2

2
0
_
dZ .
If we let v =
Z
2
2
2
0
then dv =
Z

2
0
dZ so P
F
becomes
P
F
=
1

2
0
_

2
2
0

2
0
e
v
dv = e
v

2
2
0
= e


2
2
0
, (9)
the expression for P
F
given in the book. For P
D
the only thing that changes in the calculation
is that the normal has a variance of
2
1
rather than
2
0
. Making this change gives the
expression for P
D
in the book.
We can compute the ROC curve for this example by writing = 2
2
0
ln(P
F
) and putting
this into the expression for P
D
. Where we nd
P
D
= exp
_


2
2
1
_
= exp
_

2
0

2
1
ln(P
F
)
_
.
This gives
ln(P
D
) =

2
0

2
1
ln(P
F
) or P
D
= P

2
0

2
1
F
. (10)
From Equation 10 if

2
1

2
0
increases then

2
0

2
1
decreases, thus P

2
0
/
2
1
F
get larger (since P
F
is less
than 1). This in tern makes P
D
gets larger.
Notes on properties of ROC curves
Recall that a randomized rule applied between thresholds at two points on the ROC curve
(say A and B) allows a system designer to obtain (P
F
, P
D
) performance for all points on
the straight line between A and B. This comment allows one to argue that a ROC curve
must be concave down. For if the ROC curve were concave up, then by using a randomized
rule this linear approximation would have better performance than the ROC curve. Since
we know that the ROC curve expresses the optimal performance characteristics this is a
contradiction.
Notes on the M hypothesis decision problem
In this section we derive (with more detail) some of the results presented in the book in the
case where there are at total of M hypothesis to choose from. We start with the denition
of the Bayes risk R or
R =
M1

j=0
_
P
j
M1

i=0
C
ij
P(choose i|j is true)
_
=
M1

j=0
_
P
j
M1

i=0
C
ij
_
Z
i
p(R|H
j
)dR
_
=
M1

i=0
M1

j=0
P
j
C
ij
_
Z
i
p(R|H
j
)dR. (11)
Lets consider the case where there are three classes M = 3 and expand our R where we
replace the integration region over the correct regions with their complement in terms of
Z i.e.
R = P
0
C
00
_
Z
0
=ZZ
1
Z
2
p(R|H
0
)dR + P
0
C
10
_
Z
1
p(R|H
0
)dR + P
0
C
20
_
Z
2
p(R|H
0
)dR
= P
1
C
01
_
Z
0
p(R|H
1
)dR + P
1
C
11
_
Z
1
=ZZ
0
Z
2
p(R|H
1
)dR + P
1
C
21
_
Z
2
p(R|H
1
)dR
= P
2
C
02
_
Z
0
p(R|H
2
)dR + P
2
C
12
_
Z
1
p(R|H
2
)dR + P
2
C
22
_
Z
2
=ZZ
0
Z
1
p(R|H
2
)dR. (12)
If we then simplify by breaking the intervals over segments into their component pieces for
example we simplify the rst integral above as
_
ZZ
1
Z
2
p(R|H
0
)dR = 1
_
Z
1
p(R|H
0
)dR
_
Z
2
p(R|H
0
)dR.
Doing this in three places gives us
R = P
0
C
00
+ P
1
C
11
+ P
2
C
22
+
_
Z
0
{P
1
(C
01
C
11
)p(R|H
1
) + P
2
(C
02
C
22
)p(R|H
2
)}dR
+
_
Z
1
{P
0
(C
00
+ C
10
)p(R|H
0
) + P
2
(C
12
C
22
)p(R|H
2
)}dR
+
_
Z
2
{P
0
(C
00
+ C
20
)p(R|H
0
) + P
2
(C
11
+ C
21
)p(R|H
1
)}dR
= P
0
C
00
+ P
1
C
11
+ P
2
C
22
+
_
Z
0
{P
2
(C
02
C
22
)p(R|H
2
) + P
1
(C
01
C
11
)p(R|H
1
)}dR
+
_
Z
1
{P
0
(C
10
C
00
)p(R|H
0
) + P
2
(C
12
C
22
)p(R|H
2
)}dR
+
_
Z
2
{P
0
(C
20
C
00
)p(R|H
0
) + P
1
(C
21
C
11
)p(R|H
1
)}dR. (13)
If we dene the integrands of the above integrals at the point R as I
1
, I
2
, and I
3
such that
I
0
(R) = P
2
(C
02
C
22
)p(R|H
2
) + P
1
(C
01
C
11
)p(R|H
1
)
I
1
(R) = P
0
(C
10
C
00
)p(R|H
0
) + P
2
(C
12
C
22
)p(R|H
2
) (14)
I
2
(R) = P
0
(C
20
C
00
)p(R|H
0
) + P
1
(C
21
C
11
)p(R|H
1
) .
Then the optimal decision is made based on the relative magnitude of I
i
(R). For example,
our decision rule should be
if I
0
(R) min(I
1
(R), I
2
(R)) decide 0
if I
1
(R) min(I
0
(R), I
2
(R)) decide 1 (15)
if I
2
(R) min(I
0
(R), I
1
(R)) decide 2 .
Based on the results above it seems that for a general M decision hypothesis Bayes test we
can write the risk as
R =
M1

i=0
P
i
C
ii
+
M1

i=0
_
Z
i
_
M1

j=0;j=i
P
j
(C
ij
C
jj
)p(R|H
j
)
_
dR.
Note that in the above expression the rst term,

M1
i=0
P
i
C
ii
, is a xed cost and cannot be
changed regardless of the decision region selected. The second term in R or
M1

i=0
_
Z
i
_
M1

j=0;j=i
P
j
(C
ij
C
jj
)p(R|H
j
)
_
dR,
is the average cost accumulated when we incorrectly assign a sample to the regions i =
0, 1, 2, , M 1. Thus we should dene Z
i
to be the points R such that the integrand
evaluated at that R is smaller than all other possible integrand. Thus if we dene
I
i
(R)
M1

j=0;j=i
P
j
(C
ij
C
jj
)p(R|H
j
) ,
a point R will be classied as from Z
i
if it has I
i
(R) the smallest from all possible values of
I

(R). That is we classify R as from H


i
when
I
i
(R) min(I
j
(R)) for 0 j M 1 . (16)
The book presents this decision region as the equations for the three class case M = 3 as
P
1
(C
01
C
11
)
1
(R) > P
0
(C
10
C
00
) + P
2
(C
12
C
02
)
2
(R) then H
1
or H
2
(17)
P
2
(C
02
C
22
)
2
(R) > P
0
(C
20
C
00
) + P
1
(C
21
C
01
)
1
(R) then H
2
or H
1
(18)
P
2
(C
12
C
22
)
2
(R) > P
0
(C
20
C
10
) + P
1
(C
21
C
11
)
1
(R) then H
2
or H
0
. (19)
We can determine the decision regions in
1
and
2
space when we replace all inequalities
with equalities. In that case each of the above equalities would then be a line. We can then
solve for the values of
1
and
2
that determine the intersection point by solving these three
equations for
1
and
2
. In addition, the linear decision regions in
1
and
2
space can be
plotted by taking each inequality as an equality and plotting the given lines.
Notes on a degenerate test
For a three class classication problem with cost assignments given by
C
12
= C
21
= 0 (20)
C
01
= C
10
= C
20
= C
02
= C (21)
C
00
= C
11
= C
22
= 0 , (22)
when we use Equation 17 we get
if P
1
C
1
(R) > P
0
C + P
2
(C)
2
(R) then H
1
or H
2
else H
0
or H
2
,
While Equation 18 gives
if P
2
C
2
(R) > P
0
C + P
1
(C)
1
(R) then H
2
or H
1
else H
0
or H
1
,
If we divide both of these by C we see that they are equivalent to
if P
1

1
(R) + P
2

2
(R) > P
0
then H
1
or H
2
else H
0
or H
2
if P
1

1
(R) + P
2

2
(R) > P
0
then H
2
or H
1
else H
0
or H
1
.
These two expressions combine to give the single expression
if P
1

1
(R) + P
2

2
(R) > P
0
then H
1
or H
2
else H
0
.
Notes on a dummy hypothesis test
We take P
0
= 0 and then P
1
+ P
2
= 1 with C
12
= C
02
and C
21
= C
01
. Then when we put
these simplications into the M dimensional decision problem we get
P
1
(C
21
C
11
)
1
(R) > 0
P
2
(C
12
C
22
)
2
(R) > 0
P
2
(C
12
C
22
)
2
(R) > P
1
(C
21
C
11
) . (23)
The rst two equations state that we should pick H
1
or H
2
depending on the magnitudes of
the costs.
Notes on Random Parameters: Bayes Estimation with a Uniform Cost
We start with the denition risk given a cost function C() and a method at estimating A
(i.e. the function a(R)) given by
R
uniform
=
_

dRp(R)
_

dAC(A a(R))p(A|R) ,
where C() is the uniform cost which is 1 except in a window of size centered around
where C() is zero. That is
_

dAC(A a(R))p(A|R) = 1
_
|A a(R)|

2
dAp(A|R)
= 1
_
a(R)+

2
a(R)

2
p(A|R)dA, (24)
which is used to derive the equation for R
uniform
in the book.
Notes on Estimation Theory: Example 2
The expression for p(A|R) via the books equation 141 follow from the arguments given in
the book. Once that expression is accepted we can manipulate it by rst writing it as
p(A|R) = k

(R) exp
_

1
2
_
1

2
n
N

i=1
R
2
i

2A

2
n
N

i=1
R
i
+
NA
2

2
n
+
A
2

2
a
__
.
Note that the coecient of A
2
in the above is
N

2
n
+
1

2
a
. If we dene

2
p

_
1

2
a
+
N

2
n
_
1
=
_

2
n
+ N
2
a

2
a

2
n
_
1
=

2
a

2
n
N
2
a
+
2
n
,
then the expression for p(A|R) becomes
p(A|R) = k

(R) exp
_

1
2
2
n
N

i=1
R
2
i
_
exp
_

1
2
_

2A

2
n
N

i=1
R
i
+
A
2

2
p
__
= k

(R) exp
_

1
2
2
n
N

i=1
R
2
i
_
exp
_

1
2
2
p
_
A
2

2
2
p

2
n
_
N

i=1
R
i
_
A
__
= k

(R) exp
_

1
2
2
n
N

i=1
R
2
i
_
exp
_
_
_

1
2
2
p
_
_
A
2

2
2
p

2
n
_
N

i=1
R
i
_
A +
_

2
p

N
i=1
R
i

2
n
_
2

2
p

N
i=1
R
i

2
n
_
2
_
_
_
_
_
= k

(R) exp
_

1
2
2
n
N

i=1
R
2
i
_
exp
_
_
_

1
2
2
p
_
A

2
p

N
i=1
R
i

2
n
_
2
_
_
_
exp
_
_
_

1
2
2
p
_
_

2
p

N
i=1
R
i

2
n
_
2
_
_
_
_
_
= k

(R) exp
_
_
_

1
2
2
p
_
A

2
p

N
i=1
R
i

2
n
_
2
_
_
_
.
Note that the mean value of the above density can be observed by inspection where we have
a
ms
(R) =

2
p

2
n
N

i=1
R
i
=
_

2
a

2
a
+

2
n
N
__
1
N
N

i=1
R
i
_
. (25)
Notes on the optimality of the mean-square estimator
The next section of the text will answer the question, about what is the risk if we use a
dierent estimator, say a, rather than the one that we argue is optimal or a
ms
. We start
with the denition of the Bayes risk in using a or
R
B
( a|R) = E
a
[C( a a)|R] =
_

C( a a)p(a|R)da
If we write this in terms of a
ms
with z a a
ms
= a E[a|R] we have a = z + a
ms
and
since p(a|R) = p(z|R) i.e. that the densities of a and z are the same the above is given by
_

C( a a
ms
z)p(z|R)dz . (26)
If p(z|R) = p(z|R) the above is equal to
_

C( a a
ms
+ z)p(z|R)dz . (27)
If the cost function is symmetric C(a

) = C(a

) the above is equal to


_

C( a
ms
a z)p(z|R)dz . (28)
Again using symmetry of the a posteriori density p(z|R) we get
_

C( a
ms
a + z)p(z|R)dz . (29)
We now add 1/2 of Equation 27 and 1/2 of Equation 29 and the convexity of C to get
R
B
( a|R) =
1
2
E
z
[C(z + a a
ms
)|R] +
1
2
E
z
[C(z + a
ms
a)|R]
= E
z
_
1
2
C(z + ( a
ms
a)) +
1
2
C(z ( a
ms
a))

R
_
E
z
_
C
_
1
2
(z + ( a
ms
a)) +
1
2
(z ( a
ms
a))
_

R
_
= E
z
[C(z)|R] = R
B
( a
ms
|R) . (30)
While all of this manipulations may seem complicated my feeling that the take away from
this is that the risk of using any estimator a = a
ms
will be larger (or worse) than using a
ms
when the cost function is convex. This is a strong argument for using the mean-squared cost
function above others.
Notes on Estimation Theory: Example 3: a nonlinear dependence on a
Now the books equation 137 used to compute the MAP estimate is
l(A)
A

A= a(R)
=
ln(p
r|a
(R|A))
A

A= a(R)
+
ln(p
a
(A))
A

A= a(R)
= 0 . (31)
This is equivalent to nding a(R) such that
ln(p
a|r
(A|R))
A

A= a(R)
= 0 . (32)
For this example from the functional form for p
a|r
(A|R) (now containing a nonlinear function
in a) in we have our MAP estimate given by solving the following
ln(p
a|r
(A|R))
A
=

A
_
ln(k(R))
1
2
1

2
n
N

i=1
[R
i
s(A)]
2

1
2
A
2

2
a
_
=
1

2
n
N

i=1
[R
i
s(A)]
_

ds(A)
dA
_

2
a
= 0 ,
equation for A. When we do this (and calling the solution a
map
(R)) we have
a
map
(R) =

2
a

2
n
_
N

i=1
[R
i
s(A)]
_
s(A)
A

A= amap(R)
(33)
which is the books equation 161.
Notes on Estimation Theory: Example 4
When the parameter A has a exponential distribution
p
a
(A) =
_
e
A
A > 0
0 otherwise
,
and the likelihood is given by a Poisson distribution the posteriori distribution looks like
p
a|n
(A|N) =
Pr(n = N|a = A)p
a
(A)
Pr(n = N)
=
1
Pr(n = N)
_
A
N
N!
e
A
_
e
A
= k(N)A
N
exp((1 + )A) . (34)
To nd k(N) such that this density integrate to one we need to evaluate
k(N)
_

0
A
N
exp((1 + )A)dA.
To do so let v = (1 + )A so dv = (1 + )dA to get
k(N)
_

0
_
v
1 +
_
N
e
v
dv
1 +
=
k(N)
(1 + )
N+1
_

0
v
N
e
v
dv =
k(N)N!
(1 + )
N+1
.
To make this equal one we need that
k(N) =
(1 + )
N+1
N!
. (35)
Now that we have the expression for k(N) we can evaluate a
ms
(N). We nd
a
ms
(N)
_

0
Ap(A|N)dA
=
(1 + )
N+1
N!
_

0
A
N+1
e
(1+)A
dA
=
(1 + )
N+1
N!

1
(1 + )
N+2
_

0
v
N+1
e
v
dv
=
N + 1
+ 1
. (36)
To evaluate a
map
(N) we rst note from Equation 34 that
ln(p(A|N)) = N ln(A) A(1 + ) + ln(k(N)) ,
so that setting the rst derivative equal to zero we get
ln(p(A|N))
A
=
N
A
(1 + ) = 0 .
Solving for A we get a
map
(N)
a
map
(N) =
N
1 +
. (37)
Nonrandom Parameter Estimation: The expression Var[ a(R) A]
We can compute an equivalent representation of Var[ a(R) A] using its denition when A
is non-random as (but R is due to measurement noise) and E[ a(A)] = A+ B(A) as
Var[ a(R) A] = E[( a(R) A E[ a(R) A])
2
]
= E[( a(R) A E[ a(R)] + A)
2
]
= E[( a(R) A B(A))
2
]
= E[( a(R) A)
2
] 2E[ a(R) A]B(A) + B(A)
2
.
Now the expectation of the second term is given by
E[ a(R) A] = E[ a(R)] A
= A+ B(A) A = B(A) ,
so using this the above becomes
Var[ a(R) A] = E[( a(R) A)
2
] 2B(A)
2
+ B(A)
2
= E[( a(R) A)
2
] B(A)
2
. (38)
which is equation 173 in the book.
Notes on The Cramer-Rao Inequality Derivation
The Schwarz inequality is
_
f(x)g(x)dx

_
f(x)g(x)dx

__
f(x)
2
dx
_
1/2
__
g(x)
2
dx
_
1/2
. (39)
We will have equality if f(x) = kg(x). If we take for f and g the functions
f(R)
ln(p
r|a
(R|A))
A
_
p
r|a
(R|A)
g(R)
_
p
r|a
(R|A)( a(R) R) ,
as the component functions in the Schwarz inequality then we nd a right-hand-side (RHS)
of this inequality given by
RHS =
_
_ _
ln(p
r|a
(R|A))
A
_
2
p
r|a
(R|A)dR
_
1/2 __
p
r|a
(R|A)( a(R) A)
2
dR
_
1/2
,
and a left-hand-side (LHS) given by
LHS =
_
ln(p
r|a
(R|A))
A
p
r|a
(R|A)( a(R) A)dR.
From the derivation in the book this LHS expression is on the right-hand-side is equal to
the value of 1. Squaring both sides of the resulting inequality
1
_
_ _
ln(p
r|a
(R|A))
A
_
2
p
r|a
(R|A)dR
_
1/2 __
p
r|a
(R|A)( a(R) A)
2
dR
_
1/2
,
gives the books equation 186. Simply dividing by the integral with the derivative and
recognizing that these integrals are expectations gives
E[( a(R) A)
2
]
_
E
_
ln(p
r|a
(R|A))
A
_
2
_
1
, (40)
which is one formulation of the Crammer-Rao lower bound on the value of the expression
Var[ a(R) A] and is the books equation 188. From the above proof we will have an ecient
estimator (one that achieve the Cramer-Rao lower bound) and the Schwarz inequality is
tight when f = kg or in this case
ln(p
r|a
(R|A))
A
_
p
r|a
(R|A) = k(A)
_
p
r|a
(R|A) ( a(R) R) .
or
ln(p
r|a
(R|A))
A
= k(A)( a(R) A) . (41)
If we can write our estimator a(R) in this form then we can state that we have an ecient
estimator.
Notes on Example 2: Using The Cramer-Rao Inequality
The expression for p(R|A) for this example is given via the books equation 139 or
p
r|a
(R|A) =
N

i=1
1

2
n
exp
_

(R
i
A)
2
2
2
n
_
. (42)
The logarithm of this is then given by
ln(p
r|a
(R|A)) =
N
2
ln(2) N ln(
n
)
1
2
2
n
N

i=1
(R
i
A)
2
.
To nd the maximum likelihood solution we need to nd the maximum of the above expres-
sion with respect to the variable A. The A derivative of this expression is given by
ln(p
r|a
(R|A))
A
=
2
2
2
n
N

i=1
(R
i
A) =
1

2
n
_
N

i=1
R
i
AN
_
=
N

2
n
_
1
N
N

i=1
R
i
A
_
. (43)
Setting this equal to zero and solving for A we get
a
ml
(R) =
1
N
N

i=1
R
i
. (44)
An ecient estimator (equal to the Crammer-Rao lower bound) will have
ln(p
r|a
(R|A))
A
= k(A)( a(R) A) .
we see from Equation 43 that our estimator a
ml
(R) is of this form. As we have an ecient
estimator we can evaluate the variance of it by using the Crammer-Rao inequality as an
equality. The needed expression in the Crammer-Rao inequality is

2
ln(p
r|a
(R|A))
A
2
=
N

2
n
. (45)
Thus we nd
Var[ a
ml
(R) A] =
_
E
_

2
ln(p
r|a
(R|A))
A
2
__
1
=
_
N

2
n
_
1
=

2
n
N
(46)
which is the books equation 201.
Notes on Example 4: Using The Cramer-Rao Inequality
The likelihood of a for Example 4 is a Poisson random variable, given by the books equa-
tion 162 or
Pr(nevents|a = A) =
A
n
n!
exp(A) for n = 0, 1, 2, . . . . (47)
The maximum likelihood estimate of A, after we observe the number n events, is given by
nding the maximum of the density above Pr(nevents|a = A). We can do this by setting
ln(p(n=N|A))
A
equal to zero and solving for A. This derivative is

A
ln(Pr(n = N|A)) =

A
(N ln(A) Aln(N!))
=
N
A
1 =
1
A
(N A) . (48)
Setting this equal to zero and solving for A gives
a
ml
(N) = N . (49)
Note that in Equation 48 we have written
ln(p(n=N|A))
A
in the form k(A)( a A) and thus
a
ml
is an ecient estimator (one that achieves the Crammer-Rao bounds). Computing the
variance of this estimator using this method we then need to compute

2
ln(p(n = N|A))
A
2
=
N
A
2
.
Thus using this we have
Var[ a
ml
(N) A] =
1
E
_

2
ln(p(R|A))
A
2
_ =
1
E
_
N
A
2
_ =
A
2
E[N]
=
A
2
A
= A. (50)
A bit of explanation might be needed for these manipulations. In the above E[N] is the
expectation of the observation N with A a xed parameter. The distribution of N with A
a xed parameter is a Poisson distribution with mean A given by Equation 47. From facts
about the Poisson distribution this expectation is A.
Note that these results can be obtained from MAP estimates in the case where our prior
information is innitely weak. For example, in example 2 weak prior information means
that we should take
a
in the MAP estimate of a. Using Equation 25 since a
map
(R) =
a
ms
(R) for this example this limit gives
a
map
(R) =

2
a

2
a
+ (
2
n
/N)
_
1
N
N

i=1
R
i
_

1
N
N

i=1
R
i
.
which matches the maximum likelihood estimate of A as shown in Equation 44.
In example 4, since A is distributed as an exponential with parameter it has a variance
given by Var[A] =
1

2
see [1], so to remove any prior dependence in the MAP estimate we
take 0. In that case the MAP estimate of A given by Equation 37 limits to
a
map
=
N
1 +
N ,
which is the same as the maximum likelihood estimate Equation 49.
Notes on Example 3: Using The Cramer-Rao Inequality
Consider the expression for p(A|R) for this example given in the books equation 160
p(A|R) = k(R) exp
_

1
2
_
1

2
n
N

i=1
[R
i
s(A)]
2
+
1

2
a
A
2
__
. (51)
From this expression for p(A|R) to get p(R|A) we would need to drop p
a
(A) exp
_

A
2
2
2
a
_
.
When we do this and then take the logarithm of p(R|A) we get
ln(p(R|A)) = ln(k

(R))
1
2
2
n
N

i=1
[R
i
s(A)]
2
.
To compute a
ml
(R) we compute
ln(p(R|A))
A
, set this expression equal to zero and then solve
for a
ml
(R). We nd the needed equation to solve given by
1

2
n
_
s(A)
A
_
_
1
N
N

i=1
R
i
s(A)
_

A= a
ml
(R)
= 0 . (52)
To satisfy this equation either
ds
dA
= 0 is zero or s(A) =
1
N

N
i=1
R
i
. The second equation
has a solution for a
ml
(R) given by
a
ml
(R) = s
1
_
1
N
N

i=1
R
i
_
, (53)
which is the books equation 209. If this estimates is unbiased we can evaluate the Kramer-
Rao lower bound on Var[ a
ml
(R) A] by computing the second derivative

2
ln(p(R|A))
A
2
=
1

2
n

2
s
A
2
_
N

i=1
[R
i
s(A)]
_
+
1

2
n
s
A
_
N
s
A
_
=
1

2
n

2
s
A
2
_
N

i=1
[R
i
s(A)]
_

2
n
_
s
A
_
2
. (54)
Taking the expectation of the above expression and using the fact that E(R
i
s(A)) =
E(n
i
) = 0 the rst term in the above expression vanishes and we are left with the expectation
of the second derivative of the log-likelihood given by

2
n
_
s
A
_
2
.
Using this expectation the Cramer-Rao lower bound gives
Var[ a
ml
(R) A]
1
E
_

2
p(R|A)
A
2
_ =

2
n
N
_
ds(A)
dA
_
2
. (55)
We can see why we need to divide by the derivative squared when computing the variance
of a nonlinear transformation form the following simple example. If we take Y = s(A) and
Taylor expand Y about the point A = A
A
where Y
A
= s(A
A
) we nd
Y = Y
A
+
ds
dA

A=A
A
(AA
A
) + O((AA
A
)
2
) .
Computing Y Y
A
we then have
Y Y
A
= (AA
A
)
ds
dA

A=A
A
.
From this we can easily compute the variance of our nonlinear function Y in terms of the
variance of the input and nd
Var[Y Y
A
] =
_
ds(A)
dA

A=A
A
_
2
Var[AA
A
] .
Which shows that a nonlinear transformation expands the variance of the mapped variable
Y by a multiple of the derivative of the mapping.
Notes on The Cramer-Rao Bound in Estimating a Random Parameter
Starting with the conditional expectation of the error given A given by
B(A) =
_

[ a(R) A]p(R|A)dR, (56)


when we multiply by the a priori density of A or p(A) we get
p(A)B(A) =
_

[ a(R) A]p(R|A)p(A)dR.
Taking the A derivative of the above gives
d
dA
p(A)B(A) =
_

p(R, A)dR +
_

p(R, A)
A
[ a(R) A]dR. (57)
Next we integrate the above over all space to get
0 = 1 +
_

p(R, A)
A
[ a(R) A]dR. (58)
Then using the Schwarz inequality as was done on Page 13 we can get the stated lower bound
on the variance of our estimator a(R). The Schwarz inequality will hold with equality if and
only if

2
ln(p(R, A))
A
2
= k . (59)
Since p(R, A) = p(A|R)p(R) we have ln(p(R, A)) = ln(p(A|R)) + ln(p(R)) so Equation 59
becomes

2
ln(p(R, A))
A
2
=

2
ln(p(A|R))
A
2
= k .
Then integrating this expression twice gives that p(A|R) must satisfy
p(A|R) = exp(kA
2
+ c
1
(R)A+ c
2
(R)) .
Notes on the proof that

2
i
= Var[ a
i
(R) A
i
] J
ii
Lets verify some of the elements of E[xx
T
]. We nd
E[x
1
x
2
] =
_

( a
1
(R) A
1
)
ln(p(R|A))
A
1
p(R|A)dR
=
_

a
1
(R)
p(R|A)
A
1
dR A
1
_

p(R|A)
A
1
dR
= 1 A
1

A
1
_

p(R|A)dR using the books equation 264 for the rst term
= 1 A
1

A
1
1 = 1 0 = 1 . (60)
Notes on the general Gaussian problem: Case 3
The book has shown that l(R) m
T
QR and the transformation from primed to un-
primed variables looks like
m = W
1
m

and R = W
1
R

,
thus in the primed coordinate system we have
l(R

) = m

T
W
T
QW
1
R

.
Recall that W
T
is the matrix containing the eigenvectors of K as its column values, since Q
is dened as Q = K
1
we can conclude that
KW
T
= W
T
is the same as Q
1
W
T
= W
T
,
so inverting both sides gives W
T
Q =
1
W
T
. Multiply this last expression by W
1
on the
right gives W
T
QW
1
=
1
W
T
W
1
. Since the eigenvectors are orthogonal WW
T
= I
so W
T
W
1
= I and we obtain
W
T
QW
1
=
1
.
Using this expression we see that l(R

) becomes
l(R

) = m

1
R

=
N

i=1
m
i

i
. (61)
In the same way we nd that d
2
becomes
d
2
= m
T
Qm = m

T
W
T
QW
1
w

= m

1
m

=
N

i=1
(m

i
)
2

i
. (62)
If > 0 then m

11
= 0 and m

12
= 1 will maximize d
2
. In terms of m
11
and m
12
this means
that
m
11
+ m
12

2
= 0 and
m
11
m
12

2
= 1 .
Solving for m
11
and m
12
we get
_
m
11
m
12
_
=
1

2
_
1
1
_
=
2
Note that
2
is the eigenvector
corresponding to the smaller eigenvalue (when > 0).
If < 0 then m

11
= 1 and m

12
= 0 will maximize d
2
. In terms of m
11
and m
12
this means
that
m
11
+ m
12

2
= 1 and
m
11
m
12

2
= 0 .
Solving for m
11
and m
12
we get
_
m
11
m
12
_
=
1

2
_
1
1
_
=
1
Note that
1
is again the eigen-
vector corresponding to the smaller eigenvalue (when < 0).
When m
1
= m
2
= m we get for the H
1
decision boundary
1
2
(R m)
T
Q
0
(R m)
1
2
(R m)
T
Q
1
(R m) > ln() +
1
2
ln |K
1
|
1
2
ln |K
0
|

.
We can write the left-hand-side of the above as
1
2
_
(R m)
T
Q
0
(R m)
T
Q
1

(R m) =
1
2
(R m)
T
(Q
0
Q
1
)(R m) . (63)
Q
n
=
1

2
n
_
I +
1

2
n
K
s
_
1
=
1

2
n
[I H] ,
so
_
I +
1

2
n
K
s
_
1
= I H ,
solving for H we get
H = I
_
I +
1

2
n
K
s
_
1
(64)
=
_
I +
1

2
n
K
s
_
1
__
I +
1

2
n
K
s
_
I
_
=
_
I +
1

2
n
K
s
_
1
_
1

2
n
K
s
_
=
_

2
n
I + K
s
_
1
K
s
.
Also factor the inverse out on the right of Equation 64 to get
_
1

2
n
__
I +
1

2
n
K
s
_
1
= K
s
(
2
n
I + K
s
)
1
.
Notice that from the functional forms of Q
0
and Q
1
we can write this as
I
_
I +
1

2
n
K
s
_
1
=
2
n
Q
0

2
n
Q
1
=
2
n
Q, (65)
We have
P
F
=
_

(2
N/2

N
n
(N/2))
1
L
N/21
e
L/2
2
n
dL = 1
_

0
(2
N/2

N
n
(N/2))
1
L
N/21
e
L/2
2
n
dL,
Since the integrand is a density and must integrate to one. If we assume N is even and let
M =
N
2
1 (an integer) then

_
N
2
_
= (M + 1) = M! .
Lets change variables in the integrand by letting x =
L
2
2
n
(so that dx =
dL
2
2
n
) and then the
expression for P
F
becomes
P
F
= 1
_

2
2
n
0
(2
N/2

N
n
)
1
_
1
M!
_
(2
2
n
)
N
2
1
x
M
e
x
(2
2
n
)dx
= 1
_

2
2
n
0
x
M
M!
e
x
dx. (66)
Using the same transformation of the integrand used above (i.e. letting x =
L
2
2
n
) we can
write P
F
in terms of x as
P
F
=
_

x
M
M!
e
x
dx.
We next integrate this by parts M times as
P
F
=
1
M!
_
x
M
e
x

+ M
_

x
M1
e
x
dx
_
=
1
M!
_

M
e

+ M
_

x
M1
e
x
dx
_
=

M
M!
e

+
1
(M 1)!
_

x
M1
e
x
dx rst integration by parts
=

M
M!
e

+
1
(M 1)!
_
x
M1
e
x

+ (M 1)
_

x
M2
e
x
dx
_
=

M
M!
e

+

M1
(M 1)!
e

+
1
(M 2)!
_

x
M2
e
x
dx second integration by parts
= e

_
2

k=M,M1,M2,

k
k!
_
+
1
(M (M 1))
_

xe
x
dx
= e

_
M

k=2

k
k!
_
xe
x

+
_

e
x
dx = e

_
M

k=2

k
k!
_
+

+ e

.
Thus
P
F
= e

k=0

k
k!
. (67)
If

1 and M is not too large then the largest term is



M
M!
is the largest and we can
factor it out to get
P
F
=
(

)
M
e

M!
M

k=0
(

)
kM
_
M!
k!
_
=
(

)
M
e

M!
_
1 +
M

+
M(M 1)

2
+
M(M 1)(M 2)

3
+
_
. (68)
If we drop the terms after the second in the above expansion and recall that (1+x)
1
1x
when x 1 we can write
P
F

(

)
M
e

M!
_
1
M

_
1
. (69)
Problem Solutions
The conventions of this book dictate that lower case letters (like y) indicate a random variable
while capital case letters (like Y ) indicate a particular realization of the random variable y.
To maintain consistency with the book Ill try to stick to that notation. This is mentioned
because other books use the opposite convention like [1] which could introduce confusion.
Problem 2.2.1 (a Likelihood Ratio Test (LRT))
Part (1): We assume a hypothesis of
H
1
: r = s + n (70)
H
0
: r = n, (71)
where both s and n are exponentially distributed. For example for s we have
p
s
(S) =
_
ae
aS
S 0
0 S < 0
. (72)
A similar expression holds for p
n
(N). Now from properties of the exponential distribution
the mean of s is
1
a
and the mean of n is
1
b
. We hope in a physical application that the mean of
s is larger than the mean of n. Thus we should expect that b > a. Ill assume this condition
in what follows.
We now need to compute p(R|H
0
) and p(R|H
1
). The density is p(R|H
0
) is already given.
To compute p(R|H
1
) we can use the fact that the probability density function (PDF) of the
sum of two random variables is the convolution of the individual PDFs or
p(R|H
1
) =
_

p
s
(R n)p
n
(n)dn.
Since the domain of n is n 0 the function p
n
(N) vanishes for N < 0 and the lower limit on
the above integral becomes 0. The upper limit is restricted by recognizing that the argument
to p
s
(Rn) will be negative for n suciently larger. For example, when Rn < 0 or n > R
the density p
S
(R n) will cause the integrand to vanish. Thus we need to evaluate
p(R|H
1
) =
_
R
0
p
s
(R n)f
n
(n)dn =
_
R
0
ae
a(Rn)
be
bn
dn
=
ab
b a
(e
aR
e
bR
) ,
when we integrate and simplify some. This function has a domain given by 0 < R < and
is zero otherwise. As an aside, we can check that the above density integrates to one (as it
must). We have
_

0
ab
b a
(e
aR
e
bR
)dR =
ab
b a
_
e
aR
a
+
e
bR
b

0
= 0
ab
b a
_

1
a
+
1
b
_
= 1 ,
when we simplify. The likelihood ratio then decides H
1
when
(R) =
ab
ba
(e
aR
e
bR
)
be
bR
>
P
0
(C
10
C
00
)
P
1
(C
01
C
11
)
,
and H
0
otherwise. We can write the above LRT as
a
b a
(e
(ba)R
1) > .
If we solve the above for R we get
R >
1
b a
ln
__
b a
a
_
+ 1
_
.
If R is not larger than we declare H
0
.
Part (2): We would replace in the above expression with
P
0
(C
10
C
00
)
P
1
(C
01
C
11
)
.
Part (3): For Neyman-Pearson test (in the standard form) we x a value of
P
F
Pr(say H
1
|H
0
is true)
say and seek to maximize P
D
. Since we have shown that the LRT in this problem is
equivalent to R > we have
Pr(say H
1
|H
0
is true) = Pr(R > |H
0
is true) .
We can calculate the right-hand-side as
Pr(R > |H
0
is true) =
_

p
n
(N)dN
=
_

be
bN
dN = b
_

e
bN
b

= e
b
.
At this equals P
F
we can write as a function of P
F
as =
1
b
ln(P
F
).
Problem 2.2.2 (exponential and Gaussian hypothesis test)
Part (1): We nd
(R) =
p(R|H
1
)
p(R|H
0
)
=

2
2
exp
_
|R| +
1
2
R
2
_
=
_

2
exp
_
1
2
(R
2
2|R| + 1)
1
2
_
=
_

2
exp
_
1
2
(|R| 1)
2

1
2
_
=
_

2
e

1
2
exp
_
1
2
(|R| 1)
2
_
.
Part (2): The LRT says to decide H
1
when (R) > and decide H
0
otherwise. From the
above expression for (R) this can be written as
1
2
(|R| 1)
2
> ln
_
_
2

e
1
2

_
,
or simplifying some
|R| >

_
2 ln
_
_
2

e
1
2

_
+ 1 .
If we plot the two densities we get the result shown in Figure 1. See the caption on that plot
for a description.
Problem 2.2.3 (nonlinear hypothesis test)
Our two hypothesis are
H
1
: y = x
2
H
0
: y = x
3
,
where x N(0, ). The LRT requires calculating the ratio
p(Y |H
1
)
p(Y |H
0
)
which we will do by
calculating each of the conditional densities p(Y |H
0
) and p(Y |H
1
). For this problem, the
0 1 2 3 4 5
0
.0
0
.1
0
.2
0
.3
0
.4
x
p
H
0
Figure 1: Plots of the density
1

2
exp
_

1
2
R
2
_
(in black) and
1
2
exp(|R|) (in red). Notice
that the exponential density has fatter tails than the Gaussian density. A LRT where if |R|
is greater than a threshold we declare H
1
makes sense since the exponential density (from
H
1
) is much more likely to have large valued samples.
distribution function for y under the hypothesis H
0
is given by
P(Y |H
0
) = Pr{y Y |H
0
} = Pr{x
2
Y |H
0
}
= Pr{

Y x

Y |H
0
} = 2
_
+

Y
0
1

2
2
exp
_


2
2
2
_
d .
The density function for Y (under H
0
) is the derivative of this expression with respect to Y .
We nd
p(Y |H
0
) =
2

2
2
exp
_

Y
2
2
__
1
2

Y
_
=
1

2
2
Y
exp
_

Y
2
2
_
.
Next the distribution function for y under the hypothesis H
1
is given by
P(Y |H
1
) = Pr{y Y |H
1
} = Pr{x
3
Y |H
1
}
= Pr{x Y
1/3
|H
1
} =
_
Y
1/3

2
2
exp
_


2
2
2
_
d .
Again the density function for y (under H
1
) is the derivative of this expression with respect
to Y . We nd
p(Y |H
1
) =
1

2
2
exp
_

Y
2/3
2
2
__
1
3Y
2/3
_
=
1
3

2
2
Y
2/3
exp
_

Y
2/3
2
2
_
.
Using these densities the LRT then gives
(Y ) =
1
3Y
2/3
exp
_

Y
2/3
2
2
_
1
Y
1/2
exp
_

Y
2
2
_ = Y
1/6
exp
_

1
2
2
(Y
2/3
+ Y )
_
.
After receiving the measurement Y , the decision as to whether H
0
or H
1
occurred is based
on the value of (Y ) dened above. If (Y ) >
P
0
(C
10
C
00
)
P
1
(C
01
C
11
)
then we say H
1
occurred
otherwise we say H
0
occurred.
Problem 2.2.4 (another nonlinear hypothesis test)
The distribution function for y|H
0
can be computed as
P(Y |H
0
) = Pr{x
2
Y |H
0
}
= Pr{

Y x

Y |H
0
} =
_

Y
1

2
2
exp
_

( m)
2
2
2
_
d .
The density function for y|H
0
is the derivative of this expression. To evaluate that derivative
we will use the identity
d
dt
_
(t)
(t)
f(x, t)dx =
d
dt
f(, t)
d
dt
f(x, t) +
_
(t)
(t)
f
t
(x, t)dx. (73)
With this we nd
p(Y |H
0
) =
1

2
2
_
exp
_

Y m)
2
2
2
_
_
1
2

Y
_
exp
_

Y m)
2
2
2
_
_

1
2

Y
_
_
=
1

2
2
_
1
2

Y
_
_
exp
_

Y m)
2
2
2
_
+ exp
_

Y + m)
2
2
2
__
.
The distribution function for y|H
1
can be computed as
P(Y |H
1
) = Pr{e
x
Y |H
1
}
= Pr{x ln(Y )|H
1
} =
_
ln(Y )

2
2
exp
_

( m)
2
2
2
_
d .
Taking the derivative to get p(y|H
1
) we have
p(Y |H
1
) =
1

2
2
exp
_

(ln(Y ) m)
2
2
2
__
1
Y
_
.
Using these two expressions we nd the likelihood ratio test would be if
p(Y |H
1
)
p(Y |H
0
)
> then
decide H
1
otherwise decide H
0
.
Problem 2.2.5 (testing samples with dierent variances)
Part (1-2): Note that this problem is exactly the same as considered in Example 2 in the
book but where we have taken K independent observations rather than N. Thus all formulas
derived in the book are valid here after we replace N with K. The LRT expressions desired
for this problem are then given by the books Eq. 29 or Eq. 31.
Part (3): Given the decision region l(R) > for H
1
and the opposite inequality for deciding
H
0
, we can write P
F
and P
M
= 1 P
D
as
P
F
= Pr(choose H
1
|H
0
is true) = Pr
_
K

i=1
R
2
i
> |H
0
is true
_
P
D
= Pr(choose H
1
|H
1
is true) = Pr
_
K

i=1
R
2
i
> |H
1
is true
_
.
The book discusses how to evaluate these expressions in section 6.
Part (5): According to the book, when C
00
= C
11
= 0 the minimax criterion is C
M
P
M
=
C
F
P
F
. If C
M
= C
F
then the minimax criterion reduces to P
M
= P
F
or 1 P
D
= P
F
. Since
both P
D
and P
F
are functions of we would solve the above expression for to determine
the threshold to use in the minimax LRT.
Problem 2.2.6 (multiples of the mean)
Part (1): Given the two hypothesis H
0
and H
1
to compute the LRT we need to have the
conditional probabilities p(R|H
0
) and p(R|H
1
). The density for p(R|H
0
) is the same as that
of p
n
(N). To determine p(R|H
1
) we note that as m
1
is xed and b N(0,
b
) that the
product bm
1
N(0, m
1

b
). Adding an independent zero mean random variable n gives
another Gaussian random variable back with a larger variance. Thus the distribution of
R|H
1
given by
R|H
1
= bm
1
+ n N
_
0,
_

2
b
m
2
1
+
2
n
_
.
Using the above density we have that the LRT is given by
(R)
p(R|H
1
)
p(R|H
0
)
=
1

m
2
1

2
b
+
2
n
exp
_

R
2
2(m
2
1

2
b
+
2
n
)
_
1

2
n
exp
_

R
2
2
2
n
_
=


2
n
m
2
1

2
b
+
2
n
exp
_
1
2
m
2
1

2
b
(m
2
1

2
b
+
2
n
)
2
n
R
2
_
,
when we simplify. We pick H
1
when (R) > or
R
2
>
_
2(m
2
1

2
b
+
2
n
)
2
n
m
2
1

2
b
_
ln
_

m
2
1

2
b
+
2
n

2
n
_
.
Taking the square root we decide H
1
when
|R| >

_
_
2(m
2
1

2
b
+
2
n
)
2
n
m
2
1

2
b
_
ln
_

m
2
1

2
b
+
2
n

2
n
_
. (74)
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
P_F
P
_
D
Figure 2: The ROC curve for Problem 2.2.6. The minimum probability of error is denoted
with a red marker.
The optimal processor takes the absolute value of the number R and compares its value to
a threshold .
Part (2): To draw the ROC curve we need to compute P
F
and P
D
. We nd
P
F
=
_
|R|>
p(R|H
0
)dR = 2
_

R=
p(R|H
0
)dR = 2
_

R=
1

2
n
exp
_

1
2
R
2

2
n
_
dR.
Let v =
R
n
so that dR =
n
dv and we nd
P
F
= 2
_

v=

n
1

2
exp
_

1
2
v
2
_
dv = 2erfc

n
_
For P
D
we nd
P
D
= 2
_

R=
p(R|H
1
)dR = 2
_

R=
1
_
m
2
1

2
b
+
2
n
exp
_

1
2
R
2
m
2
1

2
b
+
2
n
_
dR
= 2
_

v=

m
2
1

2
b
+
2
n
1

2
exp
_

1
2
v
2
_
dv = 2erfc

_

_
m
2
1

2
b
+
2
n
_
.
To plot the ROC curve we plot the points (P
F
, P
D
) as a function of . To show an example
of this type of calculation we need to specify some parameters values. Let
n
= 1,
b
= 2,
and m
1
= 5. With these parameters in the R code chap 2 prob 2.2.6.R we obtain the plot
shown in Figure 2.
Part (3): When we have P
0
= P
1
=
1
2
we nd
Pr() = P
0
P
F
+ P
1
P
M
=
1
2
P
F
() +
1
2
(1 P
D
()) =
1
2
+
1
2
(P
F
() P
D
()) .
The minimum probability of error (MPE) rule has C
00
= C
11
= 0 and C
10
= C
01
= 1 and
thus = 1. With this value of we would evaluate Equation 74 to get the value of of the
MPE decision threshold. This threshold gives Pr() = 0.1786.
Problem 2.2.7 (a communication channel)
Part (1): Let S
1
and S
2
be the events that our data was generated via the source 1 or 2
respectively. Then we want to compute P
F
Pr(Say S
2
|S
1
) and P
D
Pr(Say S
1
|S
2
). Now
all binary decision problems are likelihood ratio problems where we must compute
p(R|S
2
)
p(R|S
1
)
.
We will assume that R in this case is a sequence of N outputs from the communication
system. That is an example R (for N = 9) might look like
R =
_
a a b a b b a a a

.
We assume that the source is held constant for the entire length of the sequence of R. Let
r
i
be one of the N samples of the vector R. Since each of the r
i
outputs are independent of
the others given the source we can evaluate p(R|S
1
) as
p(R|S
1
) =
N

i=1
p(r
i
|S
1
) = p(r = a|S
1
)
Na
p(r = b|S
1
)
NNa
.
Here N
a
is the number of a output and N N
a
= N
b
is the number of b output in our
sample of N total outputs. A similar type of an expression will hold for p(R|S
2
). Based on
these expressions we need to compute p(r|S
i
) which we can do from the numbers given in
the problem. Let I
0
and I
1
be the events that the given source emits a 0 or a 1. Then we
have
p(r = a|S
1
) = p(r = a|I
0
, S
1
)p(I
0
|S
1
) + p(r = a|I
1
, S
1
)p(I
1
|S
1
) = (0.4)(0.5) + (0.7)(0.5) = 0.55
p(r = b|S
1
) = p(r = b|I
0
, S
1
)p(I
0
|S
1
) + p(r = b|I
1
, S
1
)p(I
1
|S
1
) = (0.6)(0.5) + (0.3)(0.5) = 0.45
p(r = a|S
2
) = p(r = a|I
0
, S
2
)p(I
0
|S
2
) + p(r = a|I
1
, S
2
)p(I
1
|S
2
) = (0.4)(0.4) + (0.7)(0.6) = 0.58
p(r = b|S
2
) = p(r = b|I
0
, S
2
)p(I
0
|S
2
) + p(r = b|I
1
, S
2
)p(I
1
|S
2
) = (0.6)(0.4) + (0.3)(0.6) = 0.42 .
With what we have thus far the LRT (will say to decide S
2
) if
(R)
p(r = a|S
2
)
Na
p(r = b|S
2
)
NNa
p(r = a|S
1
)
Na
p(r = b|S
1
)
NNa
=
_
p(r = a|S
2
)
p(r = a|S
1
)
_
Na
_
p(r = b|S
1
)
p(r = b|S
2
)
_
Na
_
p(r = b|S
2
)
p(r = b|S
1
)
_
N
=
_
p(r = a|S
2
)
p(r = b|S
2
)
p(r = b|S
1
)
p(r = a|S
1
)
_
Na
_
p(r = b|S
2
)
p(r = b|S
1
)
_
N
> .
For the numbers in this problem we nd
p(r = a|S
2
)
p(r = b|S
2
)
p(r = b|S
1
)
p(r = a|S
1
)
=
(0.58)(0.45)
(0.42)(0.55)
= 1.129870 > 1 .
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
P_F
P
_
D
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
P_F
P
_
D
Figure 3: The ROC curve for Problem 2.2.7. Left: When N = 1. Right: When N = 10.
A vertical line is drawn at the desired value of P
F
= = 0.25.
We can solve for N
a
(and dont need to ip the inequality since the log is positive) to nd
N
a
>

_
p(r=b|S
1
)
p(r=b|S
2
)
_
N
ln
_
p(r=a|S
2
)
p(r=b|S
2
)
p(r=b|S
1
)
p(r=a|S
1
)
_ .
Since N
a
can only be integer valued for N
a
= 0, 1, 2, , N we only need to consider integer
values of say for
I
= 0, 1, 2, , N + 1. Note that at the limit of
I
= 0 the LRT N
a
0
is always true, so we always declare S
2
and obtain the point (P
F
, P
D
) = (1, 1). The limit of

I
= N +1 the LRT of N
a
N +1 always fails so we always declare S
1
and obtain the point
(P
F
, P
D
) = (0, 0). Since N
a
is the count of the number of as from N it is a binomial random
variable (under both S
1
and S
2
) and once
I
is specied, we have P
F
and P
D
given by
P
F
= Pr{N
a

I
|S
1
} =
N

k=
I
_
N
k
_
p(r = a|S
1
)
k
p(r = b|S
1
)
Nk
P
D
= Pr{N
a

I
|S
2
} =
N

k=
I
_
N
k
_
p(r = a|S
2
)
k
p(r = b|S
2
)
Nk
.
To plot the ROC curve we evaluate P
F
and P
D
for various values of
I
. This is done in
the R code chap 2 prob 2.2.7.R. Since the value of P
F
= = 0.25 does not exactly fall on
a integral value for
I
we must use a randomized rule to achieve the desired performance.
Since we are not told what N is (the number of samples observed we will consider two cases
N = 1 and N = 10.
In the case N = 1 the desired value P
F
= = 0.25 falls between the two points P
F
(
I
=
2) = 0 and P
F
(
I
= 1) = 0.55. To get the target value of 0.25 we need to introduce the
probability that we will use the threshold
I
= 2 as p
=2
. The complement of this probability
or 1 p
=2
is the probability that we use the threshold
I
= 1. Then to get the desired false
alarm rate we need to take p
=2
to satisfy
p
=2
P
F
(
I
= 2) + (1 p
=2
)P
F
(
I
= 1) = 0.25 .
Putting in what we know for P
F
(
I
= 2) = 0 and P
F
(
I
= 1) = 0.55 this gives p
=2
= 0.54.
The randomized procedure that gets P
F
= while maximizing P
D
is to observe N
a
and then
with a probability p
=2
= 0.54 return the result from the test N
a
2 (which will always be
false causing us to returning S
1
). With probability 1 p
=2
= 0.45 return the result from
the test N
a
1.
In the case where N = 10 we nd that = 0.25 falls between the two points P
F
(
I
= 8) =
0.09955965 and P
F
(
I
= 7) = 0.2660379. We again need a randomized rule where we have
to pick p
=8
such that
p
=8
P
F
(
I
= 8) + (1 p
=8
)P
F
(
I
= 7) = 0.25 .
Solving this gives p

I
=8
= 0.096336. The randomized decision to get P
F
= 0.25 while
maximizing P
D
is of the same form as in the N = 1 case. That is to observe N
a
and then
with a probability p
=8
= 0.096336 return the result from the test N
a
8. With probability
1 p
=8
= 0.9036635 return the result from the test N
a
7.
Problem 2.2.8 (a Cauchy hypothesis test)
Part (1): For the given densities the LRT would state that if
(R) =
p(X|H
1
)
p(X|H
0
)
=
1 + (X a
0
)
2
1 + (X a
1
)
2
=
1 + X
2
1 + (X 1)
2
> ,
we decide H
1
(and H
0
otherwise). We can write the above as a quadratic expression in X
on the left-hand-side as
(1 )X
2
+ 2X + 1 2 > 0 .
Using the quadratic formula we can nd the values of X where the left-hand-side of this
expression equals zero. We nd
X

=
2
_
4
2
4(1 )(1 2)
2(1 )
=

_
(1 3 +
2
)
1
. (75)
In order for a real value of X above to exist we must have that the expression 13+
2
< 0.
If that expression is not true then (1 )X
2
+ 2X + 1 2 is either always positive or
always negative. This expression 1 3 +
2
is zero at the values
3

9 4
2
=
3

5
2
= {0.381966, 2.618034} .
For various values of we have
If < 0.381966 (say = 0) then 1 3 +
2
> 0 and (1 )X
2
+ 2X + 1 2 is
always positive indicating we always choose H
1
. This gives the point (P
F
, P
D
) = (1, 1)
on the ROC curve.
If > 2.618034 (say = 3) then 1 3 +
2
> 0 and (1 )X
2
+ 2X + 1 2 is
always negative indicating we always choose H
0
. This gives the point (P
F
, P
D
) = (0, 0)
on the ROC curve.
If 0.381966 < < 2.618034 (say = 0.75) then 1 3 +
2
< 0 and there are two
points X, given by Equation 75, where (1 )X
2
+ 2X + 1 2 changes sign. Note
that from Equation 75 X

< X
+
if < 1. For example, if = 0.75 then the two points
X

and X
+
are
X

= 6.316625 and X
+
= 0.3166248 .
When X < X

one nds that the expression (1 )X


2
+ 2X + 1 2 is always
positive so we choose H
1
, when X

< X < X
+
the expression is negative so we choose
H
0
, and when X > X
+
the expression is positive again so we choose H
0
.
If > 1 say = 1.25 then the two points X

and X
+
are
X

= 9.358899 and X
+
= 0.641101 .
When X < X
+
one nds that the expression (1 )X
2
+ 2X + 1 2 is always
negative so we choose H
0
, when X
+
< X < X

the expression is positive so we choose


H
1
, and when X > X

the expression is positive again so we choose H


1
.
Part (2): Using the above information we can express P
F
and P
D
as a function of . We
will increase from 0 to + and plot the point (P
F
, P
D
) as a function of .
For all points 0 < < 0.381966 we get the ROC point (P
F
, P
D
) = (1, 1). After we increase
past > 2.618034 we get the ROC point (P
F
, P
D
) = (0, 0). For values of between 0.381966
and 1.0 these two values we have
P
F
= Pr(choose H
1
|H
0
is true) =
_
X

p(X|H
0
)dX +
_

X
+
p(X|H
0
)dX
= 1
_
X
+
X

1
(1 + X
2
)
dX = 1
1

tan
1
(X
+
) +
1

tan
1
(X

)
P
D
= Pr(choose H
1
|H
1
is true) = 1
_
X
+
X

1
(1 + (X 1)
2
)
dX
= 1
1

tan
1
(X
+
1) +
1

tan
1
(X

1) .
In the case where between 1.0 and 2.618034 we would have
P
F
=
_
X

X
+
p(X|H
0
)dX =
1

tan
1
(X

)
1

tan
1
(X
+
)
P
D
=
_
X

X
+
p(X|H
1
)dX =
1

tan
1
(X

1)
1

tan
1
(X
+
1) .
All of these calculations are done in the R script chap 2 prob 2.2.8.R. When that script is
run we get the result shown in Figure 4.
0.0 0.2 0.4 0.6 0.8 1.0
0
.0
0
.2
0
.4
0
.6
0
.8
1
.0
P_F
P
_
D
Figure 4: Plots of the ROC curve for Problem 2.2.8.
Problem 2.2.9 (coin ipping)
For this problem we assume that a coin is ipped N times and we are told the number of
heads N
H
. We wish to decide if the last ip of the coin landed heads (denoted as H
1
) or
tails (denoted H
0
). Since this is a binary decision problem we need to evaluate p(R|H
i
) for
some received measurement vector R. The most complete set of information one could have
for the N coin ips would be the sequence of complete outcomes. For example in the case
where N = 9 we might observe the sequence
R =
_
h h t h t t h h h

.
Lets derive the LRT under the case of complete observability and then show that all that is
needed to make a decision is N
H
. In the same way as Problem 2.2.7 on page 28 we have
p(R|H
1
) = P
N
H
1
(1 P
1
)
NN
H
p(R|H
0
) = P
N
H
0
(1 P
0
)
NN
H
.
Using these the LRT is given by
p(R|H
1
)
p(R|H
0
)
=
P
N
H
1
(1 P
1
)
NN
H
P
N
H
0
(1 P
0
)
NN
H
> ,
decide H
1
(otherwise decide H
0
).
Problem 2.2.10 (a Poisson counting process)
Part (1): For a Poisson counting process, under the hypothesis H
0
and H
1
, the probabilities
we have n events by the time T are given by
Pr{N(T) = n|H
0
} =
e
k
0
T
(k
0
T)
n
n!
Pr{N(T) = n|H
1
} =
e
k
1
T
(k
1
T)
n
n!
.
Part (2): In the case where P
0
= P
1
=
1
2
, C
00
= C
11
= 0, and equal error costs C
10
= C
01
we have =
P
0
(C
10
C
00
)
P
1
(C
01
C
11
)
= 1 and our LRT says to decide H
1
when
Pr{N(T) = n|H
1
}
Pr{N(T) = n|H
0
}
= e
k
1
T
e
k
0
T
_
k
1
k
0
_
n
> 1 .
If we assume that k
1
> k
0
indicating that the second source has events that happen more
frequently. Then the LRT can be written as
n >
T(k
1
k
0
)
ln
_
k
1
k
0
_ .
Since n can only take on integer values the only possible values for are 0, 1, 2, . Thus
our LRT reduces to n
I
for
I
= 0, 1, 2, . Note we consider the case where n =
I
to
cause us to state that the event H
1
occurred.
Part (3): To determine the probability of error we use
Pr() = P
0
P
F
+ P
1
P
M
= P
0
P
F
+ P
1
(1 P
D
) =
1
2
+
1
2
(P
F
P
D
) .
Here we would have
P
F
= Pr{Say H
1
|H
0
} =

i=
I
e
k
0
T
(k
0
T)
i
i!
P
D
= Pr{Say H
1
|H
1
} =

i=
I
e
k
1
T
(k
1
T)
i
i!
.
The above could be plotted as a function of
I
in the (P
F
, P
D
) plane to obtain the ROC
curve for this problem.
Problem 2.2.11 (adding up Gaussian random variables)
For a LRT we need to compute p(Y |H
0
) and p(Y |H
1
) since Y is our observed variable. For
the two hypothesis given we have
p(Y |H
1
) = p(Y |N 1)
= p(Y |N = 0)P(N = 0) + p(Y |N = 1)P(N = 1)
=
1

2
e

1
2
y
2

2
e

+
1

2
2
e

1
2
y
2
2
2
e

.
and for p(Y |H
0
) we have
p(Y |H
0
) = p(Y |N > 1) =

k=2
p(Y |N = k)P(N = k)
=

k=2
1

2
_
(k + 1)
2
e

1
2
y
2
(k+1)
2
_

k
e

k!
_
.
With these densities the LRT is simply to decide H
1
if
p(Y |H
1
)
p(Y |H
0
)
> and H
0
otherwise.
Problem 2.2.13 (the expectation of (R))
To begin, recall that (R)
p(R|H
1
)
p(R|H
0
)
.
Part (1): We have
E(
n+1
|H
0
) =
_

n+1
(R)p(R|H
0
)dR =
_ _
p(R|H
1
)
p(R|H
0
)
_
n+1
p(R|H
0
)dR
=
_ _
p(R|H
1
)
p(R|H
0
)
_
n
p(R|H
1
)dR = E(
n
|H
1
) .
Part (2): We have
E(|H
0
) =
_
p(R|H
1
)
p(R|H
0
)
p(R|H
0
)dR =
_
p(R|H
1
)dR = 1 .
Part (3): Recall that Var(
2
|H
0
) = E(
2
|H
0
) E(|H
0
)
2
from #1 in this problem we have
shown that E(
2
|H
0
) = E(|H
1
). From #2 of this problem E(|H
0
) = 1 so
E(|H
0
)
2
= 1
2
= 1 = E(|H
0
) ,
thus
Var(|H
0
) = E(|H
1
) E(|H
0
) .
We can work this problem in a dierent way. Consider the dierence
E(|H
1
) E(|H
0
) =
_
p(R|H
1
)
p(R|H
0
)
p(R|H
1
)dR
_
p(R|H
1
)
p(R|H
0
)
p(R|H
0
)dR
=
_
p(R|H
1
)
p(R|H
0
)
_
p(R|H
1
)
p(R|H
0
)
1
_
p(R|H
0
)dR
=
_
(R)((R) 1)p(R|H
0
)dR
= E(
2
|H
0
) E(|H
0
) using #1 from this problem
= E(|H
1
) E(|H
0
) .
Problem 2.2.14 (some mathematical results)
Part (2): From Part (3) of the previous problem we have that
Var(|H
0
) = E(|H
1
) E(|H
0
) = E(|H
1
) 1
=
_
p(R|H
1
)
p(R|H
0
)
p(R|H
1
)dR 1 .
Thus to evaluate Var(|H
0
) + 1 we need to evaluate the integral
I =
_
p(R|H
1
)
p(R|H
0
)
p(R|H
1
)dR.
Using the books Eq. 24 we nd
I =
_
exp
_
m

2
N

i=1
R
i

Nm
2
2
2
_
p(R|H
1
)dR
= e

Nm
2
2
2
_
R
1
,R
2
, ,R
N1
,R
N
e
m

N
i=1
R
i
_
N

i=1
1

2
e

(R
i
m)
2
2
2
_
dR
1
dR
2
dR
N1
dR
N
= e

Nm
2
2
2
N

i=1
_
R
i
1

2
exp
_
m

2
R
i

(R
i
m)
2
2
2
_
dR
i
.
The exponent can be simplied as
m

2
R
i

R
2
i
2
2
+
2R
i
m
2
2

m
2
2
2
=
R
2
i
2
2
+
2R
i
m

2

m
2
2
2
=
1
2
2
(R
i
2m)
2
+
3m
2
2
2
.
With this the integral above becomes
I = e

Nm
2
2
2
N

i=1
e
3m
2
2
2
_
R
i
1

2
exp
_

1
2
2
(R
i
2m)
2
_
dR
i
= e

Nm
2
2
2
e
3Nm
2
2
2
= e
Nm
2

2
.
Taking the logarithm of this last expression gives
Nm
2

2
which is the denition of d
2
showing
the requested result.
Problem 2.2.15 (bounds on erfc

(X))
Part (1): Recall that from the denition of the function erfc

(X) we have that


erfc

(X) =
_

X
1

2
e
x
2
/2
dx.
We will integrate this by parts by writing it as
1

2
_

X
1
x
(xe
x
2
/2
)dx.
Then using the integration by parts lemma
_
vdu = vu
_
udv with
v =
1
x
and du = xe
x
2
/2
dx,
where
dv =
1
x
2
dx and u = e
x
2
/2
,
we have erfc

(X) given by
erfc

(X) =
1

2
_

1
x
e

x
2
2

_

X
1
x
2
e

x
2
2
dx
_
=
1

2
_
1
X
e

X
2
2

_

X
1
x
2
e

x
2
2
dx
_
. (76)
Since the second integral term
1

2
_

X
1
x
2
e

x
2
2
dx is positive (and nonzero) if we drop this
term from the above sum we will have an expression that is larger in value that erfc

(X) or
an upper bound. This means that
erfc

(X) <
1

2X
e

X
2
2
. (77)
We can continue by integrating this second term by parts as
_

X
1
x
3
(xe

x
2
2
)dx =
1
x
3
e

x
2
2

_

X
_

3
x
4
_
_
e

x
2
2
_
dx
=
1
X
3
e

X
2
2

_

X
3
x
4
e

x
2
2
dx.
Remembering the factor of
1

2
and combining these results we have shown that
erfc

(X) =
1

2X
e

X
2
2

2X
3
e

X
2
2
+
1

2
_

X
3
x
4
e

x
2
2
dx. (78)
Since the last expression is positive (and nonzero) if we drop it the remaining terms will then
sum to something smaller than erfc

(X). Thus we have just shown


1

2X
e

X
2
2

2X
3
e

X
2
2
< erfc

(X) . (79)
Combining expressions 77 and 79 we get the desired result.
Part (2): We will sketch the solution to this part of the problem but not work it out in
full. Starting from Equation 76 we will derive a recursive expression for the second integral
which will expand to give the series presented in the book. To begin, in that integral we let
t =
x
2
2
, so x =

2t and dx =
1

2t
1/2
dt and the integral becomes
_

X
2
2
(

2t
1/2
)
2
e
t
1

2t
1/2
dt = 2
3/2
_

X
2
2
t
3/2
e
t
dt .
Thus with this we have written erfc

(X) as
erfc

(X) =
1

2
1
X
e

X
2
2

22
3/2
_

X
2
2
t
(
1
2
+1)
e
t
dt .
We will derive an asymptotic expansion to this second integral. Dene I
n
(x) as
I
n
(x)
_

x
t
(
1
2
+n)
e
t
dt .
The integral we initially have is I
1
(
X
2
2
). We can write I
n
(x) recursively using integration by
part as
I
n
(x) = t
(
1
2
+n)
e
t

x
+
_

x

_
1
2
+ n
_
t
(
1
2
+n+1)
e
t
dt
= x
(n+
1
2
)
e
x

_
1
2
+ n
__

x
t
(
1
2
+n+1)
e
t
dt
= x
(n+
1
2
)
e
x

_
n +
1
2
_
I
n+1
(x) . (80)
Let x
X
2
2
. Then using this we have shown that we can write erfc

(X) as
erfc(X) =
1

2
1
X
e

X
2
2
1

22
3/2
I
1
( x)
=
1

2
1
X
e

X
2
2
1

22
3/2
_
x

3
2 e
x

3
2
I
2
( x)
_
=
1

2
1
X
e

X
2
2
1

22
3/2
_
x

3
2 e
x

3
2
_
x

5
2 e
x

5
2
I
3
( x)
__
=
1

2
1
X
e

X
2
2
1

22
3/2
_
x

3
2 e
x

3
2
x

5
2 e
x
+
3 5
2
2
I
3
( x)
_
=
1

2
1
X
e

X
2
2
1

22
3/2
_
x

3
2 e
x

3
2
x

5
2 e
x
+
3 5
2
2
x

7
2 e
x

3 5 7
2
3
I
4
( x)
_
=
1

2
1
X
e

X
2
2
1

22
3/2
_
N

k=1
(1)
k1
3 5 7 (2k 1)
2
k1
x
(k+
1
2
)
e
x
_
+ (1)
N+1
3 5 7 (2N + 1)

22
3/2
2
N1
I
N+1
( x) .
When we put in what we know for x we get
erfc(X) =
1

2
1
X
e

X
2
2
e

X
2
2

2X
_
N

k=1
(1)
k1
3 5 7 (2k 1)
X
2k
_
+ (1)
N
3 5 7 (2N + 1)
2
N1
I
N+1
( x)
=
1

2
1
X
e

X
2
2
_
1 +
N

k=1
(1)
k
1 3 5 7 (2k 1)
X
2k
_
+ (1)
N
3 5 7 (2N + 1)
2
N1
I
N+1
_
X
2
2
_
. (81)
Problem 2.2.16 (an upper bound on the complementary error function)
Part (1): From the denition of the function erfc

(X) we have that


erfc

(X) =
_

X
1

2
e
x
2
/2
dx,
which we can simplify by the following change of variable. Let v = x X (then dv = dx)
and the above becomes
erfc

(X) =
_

0
1

2
e
(v+X)
2
/2
dv
=
_

0
1

2
e
(v
2
+2vX+X
2
)/2
dv
=
e

X
2
2

2
_

0
e

v
2
2
e
vX
dv

X
2
2

2
_

0
e

v
2
2
dv ,
since e
vX
1 for all v [0, ) and X > 0. Now because of the identity
_

0
e

v
2
2
dv =
_

2
,
we see that the above becomes
erfc

(X)
1
2
e

X
2
2
,
as we were to show.
Part (2): We want to compare the expression derived above with the bound
erfc

(X) <
1

2X
e

X
2
2
. (82)
Note that these two bounds only dier in the coecient of the exponential of
X
2
2
. These
two coecients will be equal at the point X

when
1

2X

=
1
2
X

=
_
2

= 0.7978 .
Once we know this value where the coecients are equal we know that when X > 0.7978 then
Equation 82 would give a tighter bound since in that case
1

2X
<
1
2
, while if X < 0.7978
the bound
erfc

(X) <
1
2
e

X
2
2
,
is better. One can observe the truth in these statements by looking at the books Figure 2.10
which plots three approximations to erfc

(X). In that gure one can visually observe that


1
2
e

X
2
2
is a better approximation to erfc

X than
1

2
e

X
2
2
is in the range 0 < X < 0.7978.
The opposite statement holds for X > 0.7978
X
Y
4
4 2 0 2 4

2
0
2
4
4
Figure 5: For = 4, the points that satisfy () outside of the black curve, points that
satisfy
subset
() outside of the green curve, and points that satisfy
super
() outside of the
red curve. When integrated over these regions we obtain exact, lower bound, and upper
bound values for P
F
and P
D
.
Problem 2.2.17 (multidimensional LRTs)
Part (1): For the given densities we have that when
(X
1
, X
2
) =
2
2
0
4
1

0
_
exp
_

1
2
X
2
1
_
1

2
1

2
0
__
+ exp
_

1
2
X
2
2
_
1

2
1

2
0
___
=

0
2
1
_
exp
_
1
2
X
2
1
_

2
1

2
0

2
0

2
1
__
+ exp
_
1
2
X
2
2
_

2
1

2
0

2
0

2
1
___
> ,
we decide H
1
. Solving for the function of X
1
and X
2
in terms of the parameter of this
problem we have
exp
_
1
2
X
2
1
_

2
1

2
0

2
0

2
1
__
+ exp
_
1
2
X
2
2
_

2
1

2
0

2
0

2
1
__
>
2
1

0
(83)
For what follows lets assume that
1
>
0
. This integration region is like a nonlinear ellipse
in that the points that satisfy this inequality are outside of a squashed ellipse. See Figure 5
where we draw the contour (in black) of Equation 83 when = 4. The points outside of this
squashed ellipse are the ones that satisfy ().
Part (2): The probabilities we are looking for will satisfy
P
F
= Pr{Say H
1
|H
0
} =
_
()
p(X
1
, X
2
|H
0
)dX
1
dX
2
P
D
= Pr{Say H
1
|H
1
} =
_
()
p(X
1
, X
2
|H
1
)dX
1
dX
2
.
Here () is the region of the (X
1
, X
2
) plane that satises Equation 83. Recalling the
identity that X < e
X
when X > 0 the region of the (X
1
, X
2
) plane where
1
2
X
2
1
_

2
1

2
0

2
0

2
1
_
+
1
2
X
2
2
_

2
1

2
0

2
0

2
1
_
> , (84)
will be a smaller set than (). The last statement is the fact that X
1
and X
2
need to be
larger to make their squared sum bigger than . Thus there are point (X
1
, X
2
) closer to
the origin where Equation 83 is true while Equation 84 is not. Thus if we integrate over this
new region rather than the original () region we will have an lower bound on P
F
and P
D
.
Thus
P
F

_

subset
()
p(X
1
, X
2
|H
0
)dX
1
dX
2
P
D

_

subset
()
p(X
1
, X
2
|H
1
)dX
1
dX
2
.
Here
subset
() is the region of the (X
1
, X
2
) plane that satises Equation 84. This region
can be written as the circle
X
2
1
+ X
2
2
>
2
0

2
1

2
0
.
See Figure 5 where we draw the contour (in green) of Equation 84 when = 4. The points
outside of this circle are the ones that belong to
subset
() and would be integrated over to
obtaining the approximations to P
F
and P
D
for the value of = 4.
One might think that one could then integrate in polar coordinates to evaluate these integrals.
This appears to be true for the lower bound approximation for P
F
(where we integrate against
p(X
1
, X
2
|H
0
) which has an analytic form with the polar expression X
2
1
+X
2
2
) but the integral
over p(X
1
, X
2
|H
1
) due to its functional form (even in polar) appears more dicult. If anyone
knows how to integrate this analytically please contact me.
To get an upper bound on P
F
and P
D
we want to construct a region of the (X
1
, X
2
) plane
that is a superset of the points in (). We can do this by considering the internal polytope
or the box one gets by taking X
1
= 0 and solving for the two points X
2
on Equation 83
(and the same thing for X
2
) and connecting these points by straight lines. For example,
when we take = 4 and solve for these four points we get
(1.711617, 0) , (0, 1.711617) , (1.711617, 0) , (0, 1.711617) .
One can see lines connecting these points drawn Figure 5. Let the points in the (X
1
, X
2
)
space outside of these lines be denoted
super
(). This then gives the bounds
P
F
<
_
super()
p(X
1
, X
2
|H
0
)dX
1
dX
2
P
D
<
_
super()
p(X
1
, X
2
|H
1
)dX
1
dX
2
.
The R script chap 2 prob 2.2.17.R performs plots needed to produce these gures.
Problem 2.2.18 (more multidimensional LRTs)
Part (1): From the given densities in this problem we see that H
1
represents the idea
that one of the coordinates of the vector X will have a non zero mean of m. The probability
that the coordinate that has a non zero mean is given by p
k
. The LRT for this problem is
p(X|H
1
)
p(X|H
0
)
=
M

k=1
p
k
exp
_

(X
k
m)
2
2
2
_
exp
_
X
2
k
2
2
_
=
M

i=1
p
k
exp
_
2X
k
mm
2
2
2
_
= e

m
2

2
M

i=1
p
k
e
m

2
X
k
.
If the above ratio is greater than some threshold we decide H
1
otherwise we decide H
0
.
Part (2): If M = 2 and p
1
= p
2
=
1
2
the above LRT becomes
1
2
e

m
2

2
(e
m

2
X
1
+ e
m

2
X
2
) > .
Thus we decide H
1
when
e
m

2
X
1
+ e
m

2
X
2
> 2e
m
2

2
. (85)
Note that > 0. For the rest of the problem we assume that m > 0. Based on this
expression we will decide H
1
when the magnitude of (X
1
, X
2
) is large and the inequality
in Equation 85 is satised. For example, when = 4 the region of the (X
1
, X
2
) plane
classied as H
1
are the points to the North and East of the boundary line (in black) in
Figure 6. The points classied as H
0
are the points to the South-West in Figure 6. Part
(3): The exact expressions for P
F
and P
D
involve integrating over the region of (X
1
, X
2
)
space dened by Equation 85 but with dierent integrands. For P
F
the integrand is p(X|H
0
)
and for P
D
the integrand is p(X|H
1
).
We can nd a lower bound for P
F
and P
D
by noting that for points on the decision boundary
threshold we have
e
m

2
X
1
= e
m

2
X
2
Thus
e
m

2
X
1
< or X
1
<

2
m
ln() .
X
Y
4
4 2 0 2 4

2
0
2
4
Figure 6: For = 4, the points that are classied as H
1
are to the North and East past
the black boundary curve. Points classied as H
0
are in the South-West direction and are
below the black boundary curve. A lower bound for P
F
and P
D
can be obtained by
integrating to the North and East of the green curve. A upper bound for P
F
and P
D
can be
obtained by integrating to the North and East of the red curve.
The same expression holds for X
2
. Thus if we integrate instead over the region of X
1
>

2
m
ln() and X
2
>

2
m
ln() we will get a lower bound for P
F
and P
D
. To get an upper bound
we nd the point where the line X
1
= X
2
intersects the decision boundary. This is given at
the location of
2e
m

2
X
1
= or X
1
=

2
m
ln
_

2
_
.
Using this we can draw an integration region to compute an upper bound for P
F
and P
D
.
We would integrate over the (X
1
, X
2
) points to the North-East of the red curve in Figure 6.
The R script chap 2 prob 2.2.18.R performs plots needed to produce these gures.
Problem 2.2.19 (dierent means and covariances)
Part (1): For this problem, we have N samples from two dierent hypothesis each of which
has a dierent mean m
k
and variance
2
k
. Given these densities the LRT test for this problem
is given by
(R) =
p(R|H
1
)
p(R|H
0
)
=
N

i=1

1
exp
_

(R
i
m
1
)
2
2
2
1
+
(R
i
m
0
)
2
2
2
0
_
=
_

1
_
N
exp
_

1
2
2
1
N

i=1
(R
i
m
1
)
2
+
1
2
2
0
N

i=1
(R
i
m
0
)
2
_
=
_

1
_
N
exp
_

1
2
2
1
N

i=1
(R
2
i
2m
1
R
i
+ m
2
1
) +
1
2
2
0
N

i=1
(R
2
i
2R
i
m
0
m
2
0
)
_
=
_

1
_
N
exp
_

1
2
_
1

2
1

2
0
_
N

i=1
R
2
i
+
_
m
1

2
1

m
0

2
0
_
N

i=1
R
i

Nm
2
1
2
2
1
+
Nm
2
0
2
2
0
_
=
_

1
_
N
e
_

N
2
_
m
2
1

2
1

m
2
0

2
0
__
exp
_

1
2
_
1

2
1

2
0
_
l

+
_
m
1

2
1

m
0

2
0
_
l

_
.
We decide H
1
if the above ratio is greater than our threshold . The above can be written

1
2
_
1

2
1

2
0
_
l

+
_
m
1

2
1

m
0

2
0
_
l

> ln
_

0
_
N
e
_
N
2
_
m
2
1

2
1

m
2
0

2
0
___
. (86)
The above is the expression for a line in the (l

, l

) space. We take the right-hand-side of


the above expression be equal to (a parameter we can change to study dierent possible
detection trade os).
Part (2): If m
0
=
1
2
m
1
> 0 and
0
= 2
1
the above LRT, when written in terms of m
1
and

1
, becomes if
7m
1
8
2
1
l

3
8
2
1
l

> ,
then we decide H
1
. Note that this decision region is a line in (l

, l

) space. An example of
this decision boundary is drawn in Figure 7 for m
1
= 1,
1
=
1
2
, and = 4.
L_ALPHA
L
_
B
E
T
A
4

3 2 1 0 1 2 3

1
0
1
2
3
Figure 7: For m
1
= 1,
1
=
1
2
, and = 4, the points that are classied as H
1
are the ones
in the South-East direction across the black decision boundary.
Problem 2.2.20 (specications of dierent means and variances)
Part (1): When m
0
= 0 and
0
=
1
Equation 86 becomes
m
1

2
1
l

> .
If we assume that m
1
> 0 this is equivalent to l

>

2
1
m
1
, which is a vertical line in the l

, l

plane. Points to the right of the constant



2
1
m
1
are classied as H
1
and points to the left of
that point are classied as H
0
. To compute the ROC we have
P
F
=
_

2
1
m
1
p(L|H
0
)dL
P
D
=
_

2
1
m
1
p(L|H
1
)dL.
Recall that L in this case is l

N
i=1
R
i
we can derive the densities p(L|H
0
) and p(L|H
1
)
from the densities for R
i
in each case. Under H
0
each R
i
is a Gaussian with mean 0 and
variance
2
1
. Thus

N
i=1
R
i
is a Gaussian with mean 0 and variance N
2
1
. Under H
1
each R
i
is a Gaussian with a mean of m
1
and a variance
2
1
. Thus

N
i=1
R
i
is a Gaussian with mean
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
P_F
P
_
D
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
P_F
P
_
D
Figure 8: Left: The ROC curve for Problem 2.2.20. Part 1. Right: The ROC curve for
Problem 2.2.20. Part 2.
Nm
1
and variance N
2
1
. Thus we have
P
F
=
_

2
1
m
1
1
_
2N
2
1
e

1
2
L
2
N
2
1
dL
=
_

Nm
1
1

2
e

1
2
V
2
dV = erfc

_

1

Nm
1
_
P
D
=
_

2
1
m
1
1
_
2N
2
1
e

1
2
(LNm
1
)
2
N
2
1
dL
=
_

Nm
1

Nm
1

1
1

2
e

1
2
V
2
dV = erfc

_

1

Nm
1

Nm
1

1
_
.
In the R script chap 2 prob 2.2.20.R we plot the given ROC curve for m
1
= 1,
1
=
1
2
, and
N = 3. When this script is run it produces the plot given in Figure 8 (left).
Part (2): When m
0
= m
1
= 0,
2
1
=
2
s
+
2
n
, and
2
n
=
2
0
Equation 86 becomes

1
2
_
1

2
s
+
2
n

2
n
_
l

> ,
or simplifying some we get
l

>
2
2
n
(
2
s
+
2
n
)

2
s
. (87)
This is a horizontal line in the l

, l

plane. Lets dene the constant on the right-hand-side


as

. Points in the l

, l

plane above this constant are classied as H


1
and points below
this constant are classied as H
0
. To compute the ROC we have
P
F
=
_

p(L|H
0
)dL
P
D
=
_

p(L|H
1
)dL.
Recall that L in this case is l

N
i=1
R
2
i
. We can derive the densities p(L|H
0
) and p(L|H
1
)
from the densities for R
i
under H
0
and H
1
.
Under H
0
each R
i
is a Gaussian random variable with a mean of 0 and a variance of
2
n
.
Thus
1

2
n

N
i=1
R
2
i
is a chi-squared random variable with N degrees of freedom and we should
write Equation 87 as
l

2
n
>
2(
2
s
+
2
n
)

2
s
,
so
P
F
=
_

2(
2
s
+
2
n
)

2
s

p(L|H
0
)dL.
Here p(L|H
0
) is the chi-squared probability density with N degrees of freedom.
Under H
1
each R
i
is a Gaussian with a mean of 0 and a variance
2
s
+
2
n
. Thus
1

2
s
+
2
n

N
i=1
R
2
i
is another chi-squared random variable with N degrees of freedom and we should write
Equation 87 as
l

2
s
+
2
n
>
2
2
n

2
s
,
so
P
D
=
_

2
2
n

2
s

p(L|H
1
)dL.
Here p(L|H
1
) is again the chi-squared probability density with N degrees of freedom.
In the R script chap 2 prob 2.2.20.R we plot the given ROC curve for
n
= 1,
s
= 2, and
N = 3. When this script is run it produces the plot given in Figure 8 (right).
Problem 2.2.21 (error between to points)
Part (1): Since the book does not state exactly how we should compare the true impact
point denoted via (x, y, z) and either of the two target located at the points (x
0
, y
0
, z
0
) and
(x
1
, y
1
, z
1
). If we consider as our measurement the normalized squared distance between the
impact point and a target point say (x
0
, y
0
, z
0
) under hypothesis H
0
then the distribution of
this sum is given by a chi-squared distribution with three degrees of freedom. Thus if our
impact point is at (x, y, z) and we compute
D
2
=
(x x
i
)
2

2
+
(y y
i
)
2

2
+
(z z
i
)
2

2
, (88)
for i = 0, 1 to perform our hypothesis test we can use a chi-squared distribution for the
distributions p(D
2
|H
0
) and p(D
2
|H
1
).
If we want to use the distance (rather than the square distance) as our measure of deviation
between the impact point and one of the hypothesis points, it turns out that the Euclidean
distance between two points is given by a chi distribution (not chi-squared). That is if X
i
are normal random variables with means
i
and variance
2
i
then
Y =

_
N

i=1
_
X
i

i
_
2
,
is given by a chi distribution. The probability density function for a chi distribution looks
like
f(x; N) =
2
1
N
2
x
N1
e

x
2
2

_
N
2
_ . (89)
If we remove the mean
i
from the expression for Y we get the noncentral chi distribution.
That is if Z looks like
Z =

_
N

i=1
_
X
i

i
_
2
,
and we dene =
_

N
i=1
_

i
_
2
then the probability density function for Z is given by
f(x; N, ) =
e

x
2
+
2
2
x
N

(x)
N/2
IN
2
1
(x) , (90)
where I

(z) is the modied Bessel function of the rst kind. Since the chi distribution is
a bit more complicated than the chi-squared we will consider the case where we use and
use expression Equation 88 to measure distances. For reference, the chi-squared probability
density looks like
f(x; N) =
x
N
2
1
e

x
2
2
N
2

_
N
2
_ . (91)
Using this the LRT for H
1
against H
0
look like
p(R|H
1
)
p(R|H
0
)
=
_
(xx
1
)
2

2
+
(yy
1
)
2

2
+
(zz
1
)
2

2
_N
2
1
_
(xx
0
)
2

2
+
(yy
0
)
2

2
+
(zz
0
)
2

2
_N
2
1
exp
_

1
2
_
(x x
1
)
2

2
+
(y y
1
)
2

2
+
(z z
1
)
2

2
_
+
1
2
_
(x x
0
)
2

2
+
(y y
0
)
2

2
+
(z z
0
)
2

2
__
=
_
(x x
1
)
2
+ (y y
1
)
2
+ (z z
1
)
2
(x x
0
)
2
+ (y y
0
)
2
+ (z z
0
)
2
_
N
2
1
exp
_

1
2
2
_
(x x
1
)
2
(x x
0
)
2
+ (y y
1
)
2
(y y
0
)
2
+ (z z
1
)
2
(z z
0
)
2
_
_
.
We would decide H
1
when this ratio is greater than a threshold .
Part (2): The time variable is another independent measurement and the density function
for the combined time and space measurement would simply be the product of the spatial
density functions for H
1
and H
0
discussed above and the Gaussian for time.
Problem 2.3.1 (the dimension of the M hypothesis Bayes test)
Part (1): The general M hypothesis test is solved by computing the minimum of M expres-
sions as demonstrated in Equation 16. As there are M expressions to nd the minimum of
these M expressions requires M1 comparisons i.e. we have a decision space that is M1.
Part (2): In the next problem we show that the decision can make based on
i
, which can
be computed in terms
k
(R) when we divide by p(R|H
0
).
Problem 2.3.2 (equivalent form for the Bayes test)
Part (1): When we use P
j
p(R|H
j
) = p(H
j
|R)p(R) we can write Equation 11 as
R =
M1

i=0
M1

j=0
C
ij
_
Z
i
p(H
j
|R)p(R)dR =
M1

i=0
_
Z
i
p(R)
M1

j=0
C
ij
p(H
j
|R)dR.
Given a sample R the risk is given by evaluating the above over the various Z
i
s. We can
make this risk as small as possible by picking Z
i
such that it only integrates over points R
where the integrand above is as small as possible. That means pick R to be from class H
i
if
p(R)
M1

j=0
C
ij
p(H
j
|R) < p(R)
M1

j=0
C
i

j
p(H
j
|R) for all i

= i .
We can drop p(R) from both sides to get the optimal decision to pick the smallest value of
(over i)
M1

j=0
C
ij
p(H
j
|R) .
This is the denition of
i
and is what we minimize to nd the optimal decision rule.
Part (2): When the costs are as given we see that

i
=
M1

j=0;j=i
Cp(H
j
|R) = C(1 p(H
i
|R)) .
Thus minimizing
i
is the same as maximizing p(H
i
|R) as a function of i.
Problem 2.3.3 (Gaussian decision regions)
Part (1): The minimum probability of error is the same as picking the class/hypothesis
corresponding to the largest a-posterior probability. Thus we need to compute
p(H
k
|R) for k = 1, 2, 3, 4, 5 .
Since each hypothesis is equally likely the above is equivalent to picking the class with the
maximum likelihood or the expression p(R|H
k
) which is the largest. The decision boundaries
are then the points where two likelihood functions p(R|H
k
) and p(R|H
k
) meet. For example,
the decision boundary between H
1
and H
2
is given when p(R|H
1
) = p(R|H
2
) or when we
simplify that we get
|R + 2m| = |R + m| .
By geometry we must have R+m < 0 and R+2m > 0 so we need to solve R+2m = (R+m)
or R =
3
2
m. In general, the decision boundary will be the midpoint between each Gaussians
mean and are located at
3m
2
,
m
2
,
m
2
, and
3m
2
.
Part (2): We can compute the probability of error in the following way
Pr() = 1 P(correct decision)
= 1
_

3m
2

p(R|H
1
)dR
_

m
2

3m
2
p(R|H
2
)dR

_
+
m
2

m
2
p(R|H
3
)dR
_
+
3m
2
m
2
p(R|H
4
)dR
_

3m
2
p(R|H
5
)dR.
These integrals could be evaluated by converting to expressions involving the error func-
tion.
Problem 2.3.4 (Gaussians with dierent variances)
Part (1): As in Problem 2.3.4 the requested criterion is equivalent to the maximum likeli-
hood classication. That is given R we pick the hypothesis H
i
such that p(R|H
i
) is largest.
Part (2): If
2

= 2
2

and

= m our conditional densities look like


p(R|H
1
) =
1

2m
exp
_

R
2
2m
2
_
p(R|H
2
) =
1

2m
exp
_

(R m)
2
2m
2
_
p(R|H
3
) =
1

2(

2m)
exp
_

R
2
2(2m
2
)
_
.
If we plot these three densities on the R-axis for m = 2.5 we get the plot in Figure 9. We
now need to calculate the decision boundaries or the points where we would change the
10 5 0 5 10 15
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
x
H
1
Figure 9: The three densities for Problem 2.3.4 for m = 2.5. Here p(R|H
1
) is plotted in red,
p(R|H
2
) is plotted in black, and p(R|H
3
) is plotted in green. The hypothesis, H
i
, that is
chosen for any value of R corresponds to the index of the largest likelihood p(R|H
i
) at the
given point R. The decision regions are drawn in the gure as light gray lines.
decision made. From the plot it looks like the classication is to pick green or H
3
then
red or H
1
, then black or H
2
, and nally green or H
3
again. Each of these points where
the decision changes is given by solving equations of the form p(R|H
i
) = p(R|H
j
). We will
compute these decision boundaries here. For the rst decision boundary we need to nd R
such that p(R|H
3
) = p(R|H
1
) or
1

2
exp
_

R
2
2(2m
2
)
_
= exp
_

R
2
2m
2
_
.
When we take the logarithm and solve for R we nd R = m
_
2 ln(2) = 2.943525 when
we consider with m = 2.5. Lets denote this numeric value as m
13
. For the second decision
boundary we need to nd R such that p(R|H
1
) = p(R|H
2
) or

R
2
2m
2
=
(R m)
2
2m
2
or R =
m
2
.
We will denote this boundary as m
12
. Finally, for the third decision boundary we need to
nd R such that p(R|H
2
) = p(R|H
3
) or
exp
_

(R m)
2
2m
2
_
=
1

2
exp
_

R
2
2(2m
2
)
_
.
This can be written as
R
2
4mR + 2m
2
(1 ln(2)) = 0 ,
or solving for R we get
R = 2m

2m
_
1 + 8 ln(2) = {4.045, 14.045} ,
when m = 2.5. We would take the larger value (and denote it m
23
) since we are looking for
a positive decision boundary.
Part (3): To evaluate this probability we will compute
Pr() = 1 Pr(correct) ,
with Pr(correct) given by
Pr(correct) =
_
m
13

p(R|H
3
)dR +
_
m
12
m
13
p(R|H
1
)dR+
_
m
23
m
12
p(R|H
2
)dR +
_

m
23
p(R|H
3
)dR.
Some of this algebra is done in the R code chap 2 prob 2.3.4.R.
Problem 2.3.5 (hypothesis testing with 2 dimensional densities)
Part (1): Since this is a three class problem we can directly use the results where the decision
region is written in terms of likelihood ratios
1
(R) and
2
(R) as found in Equations 17, 18,
and 19. To do that we need to compute these likelihood ratios. We nd

1
(R)
p(R|H
2
)
p(R|H
1
)
=

11

21

12

22
exp
_

1
2
_
R
2
1

2
12
+
R
2
2

2
22

R
2
1

2
11

R
2
2

2
21
__
=

2
n

n
_

2
s
+
2
n
exp
_

1
2
_
1

2
s
+
2
n

2
n
_
l
1

1
2
_
1

2
n

2
n
_
l
2
_
=

n
_

2
s
+
2
n
exp
_

2
s
2
2
n
(
2
s
+
2
n
)
l
1
_
.
For
2
(R) we have

2
(R)
p(R|H
3
)
p(R|H
1
)
=

11

21

13

23
exp
_

1
2
_
R
2
1

2
13
+
R
2
2

2
23

R
2
1

2
11

R
2
2

2
21
__
=

2
n

n
_

2
s
+
2
n
exp
_

1
2
_
1

2
n

2
n
_
l
1

1
2
_
1

2
s
+
2
n

2
n
_
l
2
_
=

n
_

2
s
+
2
n
exp
_

2
s
2
2
n
(
2
s
+
2
n
)
l
2
_
.
The class priors are specied as P
0
= 1 2p and P
1
= P
2
= p so using Equations 17, 18,
and 19 we have that the decision region is based on
if p(1 0)
1
(R) > (1 2p)(1 0) + p( 1)
2
(R) then H
1
or H
2
else H
0
or H
2
if p(1 0)
2
(R) > (1 2p)(1 0) + p( 1)
1
(R) then H
2
or H
1
else H
0
or H
1
if p( 0)
2
(R) > (1 2p)(1 1) + p( 1)
1
(R) then H
2
or H
0
else H
1
or H
0
.
Simplifying these some we get
if
1
(R) >
1
p
2 + ( 1)
2
(R) then H
1
or H
2
else H
0
or H
2
if
2
(R) >
1
p
2 + ( 1)
1
(R) then H
2
or H
1
else H
0
or H
1
if
2
(R) >
1
(R) then H
2
or H
0
else H
1
or H
0
.
To nd the decision regions in the l
1
l
2
plane we can rst make the inequalities above
equal to equalities to nd the decision boundaries. One decision boundary is easy, when we
put in what we know for for
1
(R) and
2
(R) the last equation becomes l
2
= l
1
, which is a
diagonal line in the l
1
l
2
space. To determine the full decision boundaries lets take values
for p and say p = 0.25 and = 0.75 and plot the rst two decision boundaries above. This
is done in Figure 10. Note that the rst and second equation are symmetric (give the other
equation) if we switch l
1
and l
2
. Thus the decision boundary in the l
1
l
2
space expressed
by these two expressions is the same. Based on the inequalities in the above expressions we
would get the classication regions given in Figure 10. Some of this algebra and our plots is
performed in the R code chap 2 prob 2.3.5.R.
l_1
l_
2
0
10 5 0 5 10

1
0

5
0
5
1
0
0
0

H_1 H_0
H_0
H_2
Figure 10: The decision boundaries for Problem 2.3.5 with the nal classication label
denoted in each region.
Problem 2.3.6 (a discrete M class decision problem)
Part (1): We are given
Pr(r = n|H
m
) =
k
n
m
n!
e
km
=
(km)
n
e
km
n!
,
for m = 1, 2, 3, . . . , M and k a xed positive constant. Thus each hypothesis involves the
expectation of progressively more samples being observed. For example, the mean number
of events for the hypothesis H
1
, H
2
, H
3
, H
4
, . . . are k, 2k, 3k, 4k, . . . . Since each hypothesis
is equally likely and the coecients of the errors are the same the optimal decision criterion
corresponds to picking the hypothesis with the maximum likelihood i.e. we pick H
m
if
Pr(r = n|H
m
) =
(km)
n
e
km
n!
Pr(r = n|H
m
) =
(km

)
n
e
km

n!
, (92)
for all m

= m. For the value of k = 3 we plot Pr(r = n|H


m
) as a function of n for
several values of m in Figure 11. In that plot you can see that for progressively larger values
of r we would select larger hypothesis values H
m
. In the plot presented, as r, increased
we would decide on the hypothesis H
1
, H
2
, H
3
, . Our decision as to when the selected
hypothesis changes is when two sequential likelihood functions have equal values. If we take
the inequality in Equation 92 as an equality, the boundaries of the decision region between
two hypothesis m and m

are where (canceling the factor n!)


(km)
n
e
km
= (km

)
n
e
km

.
0 5 10 15 20
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
r
H
1
H_1
H_2
H_3
H_4
H_5
Figure 11: Plots of the likelihoods Pr(r = n|H
m
) as a function of r in Problem 2.3.6 for
various hypothesis or values of m. The optimal decision (given a measurement r) is to take
the hypothesis that has the largest likelihood.
Solving for n (the decision boundary between H
m
and H
m
) in the above we get
n =
k(mm

)
ln
_
m
m

_ .
Since the change in hypothesis happens between sequential H
m
and H
m
we have that m =
m

+ 1 and the above simplies to


n
m

m
=
k
ln
_
1 +
1
m

_ .
We can evaluate this expression for m

= 1 to nd the locations where we switch between


H
1
and H
2
, for m

= 2 to nd the location where we switch between H


2
and H
3
, etc. For
m

= 1, 2, 3, 4, 5 (here M = 5) we nd
[1] 4.328085 7.398910 10.428178 13.444260 16.454445
These numbers match very well the cross over points in Figure 11. Since the number of counts
received must be a natural number we would round the numbers above to the following
H
1
: 1 r 4
H
2
: 5 r 7
H
3
: 8 r 10
.
.
.
H
m
:
_
k
ln
_
1 +
1
m1
_
_
+ 1 r
_
k
ln
_
1 +
1
m
_
_
(93)
.
.
.
H
M
:
_
k
ln
_
1 +
1
M1
_
_
r .
Part (2): We would calculate Pr() = 1 Pr(correct) where Pr(correct) is calculated by
summing over regions in r where we would make the correct classication. This is given by
Pr(correct) =
M

m=1

nZm
Pr(r = n|H
m
)
=
4

n=1
Pr(r = n|H
1
) +
7

n=5
Pr(r = n|H
2
) +
10

n=8
Pr(r = n|H
3
) + +
nu

n=n
l
Pr(r = n|H
m
) ,
where the lower and upper summation endpoints n
l
and n
u
are based on the decision bound-
ary computed as given in Equation 93.
Problem 2.3.7 (M-hypothesis classication dierent mean vectors)
Part (1): For a three class problem we need to consider Equations 17, 18, and 19 to decide
the decision boundaries. Thus we should compute

1
(R) =
p(R|H
1
)
p(R|H
0
)
=
exp
_

1
2
(rm
1
)
T
(rm
1
)

2
_
exp
_

1
2
(rm
0
)
T
(rm
0
)

2
_
= exp
_

1
2
2
_
(r m
1
)
T
(r m
1
) (r m
0
)
T
(r m
0
)

_
= exp
_

1
2
2
_
2r
T
(m
0
m
1
) + m
T
1
m
1
m
T
0
m
0

_
= exp
_

1
2
2
_
m
T
1
m
1
m
T
0
m
0
_
_
exp
_
1

2
r
T
(m
1
m
0
)
_
= exp
_

1
2
2
_
m
T
1
m
1
m
T
0
m
0
_
_
e
l
1
,
with l
1
dened as
l
1
=
3

i=1
c
i
r
i
=
1

2
3

i=1
(m
1i
m
0i
)r
i
so c
i
=
1

2
(m
1i
m
0i
) .
In the same way we have

2
(R) =
p(R|H
2
)
p(R|H
0
)
= exp
_

1
2
2
_
m
T
2
m
2
m
T
0
m
0
_
_
exp
_
1

2
r
T
(m
2
m
0
)
_
= exp
_

1
2
2
_
m
T
2
m
2
m
T
0
m
0
_
_
e
l
2
,
with l
2
dened as
l
2
=
3

i=1
d
i
r
i
=
1

2
3

i=1
(m
2i
m
0i
)r
i
so d
i
=
1

2
(m
2i
m
0i
) .
Part (2): For the given cost assignment given here when we use Equations 17, 18, and 19
(and cancel out the common cost) we have
P
1

1
(R) > P
0
P
2

2
(R) then H
1
or H
2
else H
0
or H
2
P
2

2
(R) > P
0
then H
2
or H
1
else H
0
or H
1
P
2

2
(R) > P
0
+ P
1

1
(R) then H
2
or H
0
else H
1
or H
0
.
To make the remaining problem simpler, we will specify values for P
0
= P
1
= P
2
=
1
3
and
values for m
0
, m
1
, m
2
, and so that every expression in the decision boundary can be
evaluated numerically. This is done in the R script chap 2 prob 2.3.6.R and the result
plotted in Figure 12.
Problem 2.4.1 (estimation in r = ab + n)
Part (1): The maximum a posterior estimate of a is derived from Bayes rule
p(A|R) =
p(R|A)p(A)
p(R)
,
from which we see that we need to compute p(R|A) and p(A). Now when the variable a is
given the expression we are observing or r = ab + n is the sum of two independent random
variables ab and n. These two terms are independent normal random variables with zero
means and variances a
2

2
b
and
2
n
respectively. The variance of r is then the sum of the
variance of ab and n. With these arguments we have that the densities we need in the above
given by
p(R|A) =
1

2(A
2

2
b
+
2
n
)
1/2
exp
_

R
2
2(A
2

2
b
+
2
n
)
_
p(A) =
1

2
a
exp
_

A
2
2
2
a
_
.
l_1
l_
2
0
10 5 0 5 10

1
0

5
0
5
1
0
0 0
H_0 H_1
H_1 H_2
Figure 12: Plots of decision regions for Problem 2.3.6. The regions into which we would
classify each set are labeled.
Note to calculate our estimate a
map
(R) we dont need to explicitly calculate p(R), since it is
not a function of A. From the above densities
ln(p(R|A)) =
1
2
ln(2)
1
2
ln(A
2

2
b
+
2
n
)
R
2
2(A
2

2
b
+
2
n
)
and
ln(p(A)) =
1
2
ln(2) ln(
a
)
A
2
2
2
a
,
We now need to nd the value A such that
ln(p(A|R))
A
= 0 .
This equation is

A
2
b
A
2

2
b
+
2
n
+
A
2
b
R
2
(A
2

2
b
+
2
n
)
2

A

2
a
= 0 .
One solution to the above is A = 0. If A = 0 and let v A
2

2
b
+
2
n
then the equation above
is equivalent to

2
b
v
+

2
b
R
2
v
2

1

2
a
= 0 ,
or
v
2
+
2
a

2
b
v
2
a

2
b
R
2
= 0 .
Using the quadratic equation the solution for v is given by
v =

2
a

2
b

_

4
a

4
b
+ 4
2
a

2
b
R
2
2
=

2
a

2
b

2
a

2
b
_
1 +
4R
2

2
a

2
b
2
.
From the denition of v we expect v > 0 thus we must take the expression with the positive
sign. Now that we know v we can solve for A
2

2
b
+
2
n
= v for A from which we would nd
two additional solutions for a
map
(R).
Part (2): If we desire to simultaneously nd a
map
and

b
map
then we must treat our two
unknowns a and b in a vector (a, b) and then nd them both at the same time. Thus our
unknown A is now the vector (a, b) and we form the log-likelihood l given by
l = ln(p(R|A, B)) + ln(p(A, B)) ln(p(R))
= ln(p(R|A, B)) + ln(p(A)) + ln(p(B)) ln(p(R))
= ln
_
1

2
n
e

1
2
(RAB)
2

2
n
_
+ ln
_
1

2
a
e

1
2
A
2

2
a
_
+ ln
_
1

2
b
e

1
2
B
2

2
b
_
ln(p(R)) .
We want to maximize l with respect to A and B so we take derivatives of l with respect to
A and B, set the result equal to zero, and then solve for A and B. The A and B derivative
of l are given by
(R AB)B

2
n

2
b
= 0
(R AB)A

2
n

2
a
= 0 .
One solution to these equations is A = B = 0. This is the only solution for if A and B are
both non-zero then the above two equations become
R AB =

2
n

2
b
and R AB =

2
n

2
a
,
thus R AB is equal to two dierent things implying contradicting equations. Even if

2
b
=
2
a
so that the two equations above are identical, we would then have only one equation
for the two unknowns A and B. Thus it looks like in this case that the only solutions are
a
map
=

b
map
= 0. This is dierent than the estimate of a
map
we obtained in Part (1) above.
Part (3-a): For the expression for r = a +

k
i=1
b
i
+ n we have that
p(R|A) =
1

2(k
2
b
+
2
n
)
1/2
exp
_

(R A)
2
2(k
2
b
+
2
n
)
_
.
Thus
ln(p(R|A)) = ln
_
1

2(k
2
b
+
2
n
)
1/2
_

(R A)
2
2(k
2
b
+
2
n
)
.
With l ln(p(R|A)) + ln(p(A)) ln(p(R)) taking the A derivative of l, and setting it equal
to zero, gives
R A
k
2
b
+
2
n

2
a
= 0 .
If we solve the above for A we nd
A =

2
a
R

2
a
+ k
2
b
+
2
n
, (94)
for the expression for a
map
(R).
Part (3-b): In this case we have A = [a, b
1
, b
2
, , b
k1
, b
k
]
T
and the densities we need are
given by
p(R|A) =
1

2
n
exp
_

(R A

l
i=1
B
i
)
2
2
2
n
_
p(A) =
_
1

2
a
e

A
2
2
2
a
_
_
l

i=1
1

2
b
e

B
2
i
2
2
b
_
.
Then computing the log-likelihood l ln(p(R|A)) + ln(p(A)) ln(p(R)) we nd
l =
(R A

l
i=1
B
i
)
2
2
2
n

A
2
2
2
a

1
2
2
b
l

i=1
B
2
i
+ constants .
If we take the derivative of l with respect to A and set it equal to zero we get
1

2
n
_
R A
l

i=1
B
i
_

2
a
= 0 .
If we take the derivative of l with respect to B
i
and set it equal to zero we get
1

2
n
_
R A
l

=1
B
i

B
i

2
b
= 0 ,
for i = 1, 2, , k 1, k. As a system of equations this is
_

_
_
1

2
n
+
1

2
a
_
1

2
n
1

2
n

1

2
n
1

2
n
1

2
n
_
1

2
n
+
1

2
b
_
1

2
n

1

2
n
1

2
n
1

2
n
1

2
n
_
1

2
n
+
1

2
b
_

1

2
n
1

2
n
.
.
.
.
.
.
.
.
.
1

2
n
1

2
n
1

2
n

_
1

2
n
+
1

2
b
_
1

2
n
1

2
n
1

2
n
1

2
n

1

2
n
_
1

2
n
+
1

2
b
_
_

_
_

_
A
B
1
B
2
.
.
.
B
k1
B
k
_

_
=
R

2
n
_

_
1
1
1
.
.
.
1
1
_

_
This system must have a solution that depends on k (the number of b
i
terms). While it
might be hard to solve the full system above analytically we can solve smaller systems, look
for a pattern and hope it generalizes. If anyone knows how to solve this system analytically
as is please contact me. When k = 1 we have the system
_
1

2
n
+
1

2
a
1

2
n
1

2
n
1

2
n
+
1

2
b
_
_
A
B
1
_
=
R

2
n
_
1
1
_
.
Solving for A and B we get
A =

n

b
R

b
+
b

n
+
a

n
and B
1
=

n

a
R

b
+
b

n
+
a

n
.
Here we have introduced the precision which is dened as one over the variance. For
instance we have
a

1

2
a
,
b

1

2
b
, and
n

1

2
n
. If we do the same thing when k = 2 we nd
solutions for A, B
1
, and B
2
given by
A =

n

b
R

b
+ 2
a

n
+
b

n
and B
1
= B
2
=

n

a
R

b
+ 2
a

n
+
b

n
.
If we do the same thing for k = 3 we get
A =

n

b
R

b
+ 3
a

n
+
b

n
and B
1
= B
2
= B
3
=

n

a
R

b
+ 3
a

n
+
b

n
.
For general k it looks like the solution is
A =

n

b
R

b
+ k
a

n
+
b

n
and B
1
= B
2
= = B
k1
= B
k
=

n

a
R

b
+ k
a

n
+
b

n
.
See the Mathematical le chap 2 prob 2 4 1.nb where we solve the above linear system
for several values of k. If we take the above expression for A (assuming k terms B
i
in the
summation) and multiply by
1
na
b
on top and bottom of the fraction we get
A =

2
a
R
1
n
+
k

b
+
1
a
=

2
a
R

2
n
+ k
2
b
+
2
a
,
which is the same as Equation 94.
Problem 2.4.2 (a rst reproducing density example)
Part (1): Before any measurements are made we are told that the density of is a gamma
distribution dened by
p(|n

, l

) =
l

n
(n

)
e
l

n1
for 0 , (95)
and is zero for < 0. This density has an expectation and variance [1] given by
E() =
n

(96)
Var() =
n

2
. (97)
Part (2): After the rst measurement is made we need to compute the a posterior distri-
bution of using Bayes rule
p(|X) =
p(X|)p()
p(X)
.
For the densities given we have
p(|X) =
1
p(X)
_
e
X
l

n
(n

)
e
l

n1
_
=
_
l

n
p(X)(n

)
_
e
(l+X)

n
.
In the above expression the factor p(X) just contributes to the normalization constant in
that p(|X) when integrated over valid values of should evaluate to one. Now notice that
the functional form for p(|X) derived above is of the same form as a gamma distribution
just like the priori distribution p() was. That is (with the proper normalization) we see
that p(|X) must be
p(|X) =
(l

+ X)
n+1
(n

+ 1)
e
(l+X)

n
.
Then

ms
is just the mean of with the above density. From Equations 96 and 97 we see
that for the a posteriori density that we have here

ms
=
n

+ 1
l

+ X
and E[(

ms
)
2
] =
n

+ 1
(l

+ X)
2
.
Part (3): When we now have several measurements X
i
the calculation of the a posteriori
density is as follows
p(|X) =
p(X|)p()
p(X)
=
1
p(X)
_
n

i=1
p(X
i
|)
_
p() =
1
p(X)
_
n

i=1
e
X
i
_
_
l

n
(n

)
e
l

n1
_
=
l

n
p(X)(n

)
exp
_

_
l

+
n

i=1
X
i
__

n+n1
.
Again we recognize this is a gamma distribution with dierent parameters so the normaliza-
tion constant can be determined and the full a posteriori density is
p(|X) =
(l

n
i=1
X
i
)
n+n
(n + n

)
exp
_

_
l

+
n

i=1
X
i
__

n+n1
.
Again using Equations 96 and 97 our estimates of and its variance are now given by

ms
=
n

+ n
l

n
i=1
X
i
and E[(

ms
)
2
] =
n

+ n
(l

n
i=1
X
i
)
2
.
Part (4): We now compute

map
for the density p(|X) computed in Part 3 above. Taking
the derivative of ln(p(|X)) and using the l

and n

notation in the book we have


d
d
ln(p(|X)) =
d
d
ln
_
l

(n

)
e
l

1
_
=
d
d
[l

+ (n

1) ln()] = l

+
(n

1)

.
Setting this expression equal to zero and solving for then gives

map
=
n

1
l

=
n

+ n 1
l

n
i=1
X
i
.
Note that this is dierent from

ms
computed in Part (3) by the term
1
l+

n
i=1
X
i
.
Problem 2.4.3 (normal conjugate densities)
Part (1): If we assume that the priori distribution of a is N
_
m
0
,
n
k
0
_
, then the a posteriori
density is given using Bayes rule and the density for p(R|A). We nd
p(A|R) =
p(R|A)p(A)
p(R)
=
1
p(R)
_
1

2n
exp
_

(R A)
2
2
2
n
__
_
_
_
_
k
0

2n
exp
_
_
_
_

(A m
0
)
2
2
_

2
n
k
2
0
_
_

_
_
_
_
_
=
k
0
2
2
n
p(R)
exp
_

1
2
2
n
_
(R A)
2
+k
2
0
(A m
0
)
2
_
_
= exp
_

1
2
2
n
_
A
2
2RA +R
2
+ k
2
0
A
2
2k
2
0
m
0
A+ k
2
0
m
2
0
_
_
=

exp
_

1
2
2
n
_
(1 + k
2
0
)A
2
2(R + k
2
0
m
0
)A
_
_
=

exp
_
_
_
_
_
_
_

1
2
_

2
n
1+k
2
0
_
_
A
2
2
_
R +k
2
0
m
0
1 + k
2
0
_
A
_
_
_
_
_
_
_
_
=

exp
_
_
_
_
_
_
_

1
2
_

2
n
1+k
2
0
_
_
A
2
2
_
R + k
2
0
m
0
1 +k
2
0
_
A+
_
R +k
2
0
m
0
1 + k
2
0
_
2

_
R + k
2
0
m
0
1 +k
2
0
_
2
_
_
_
_
_
_
_
_
=

exp
_
_
_
_
_
_
_

1
2
_

2
n
1+k
2
0
_
_
A
_
R +k
2
0
m
0
1 + k
2
0
__
2
_
_
_
_
_
_
_
.
Here ,

, and

are constants that dont depend on A. Notice that this expression is a


normal density with a mean and variance given by
m
1
=
R + k
2
0
m
0
1 + k
2
0
and
2
1
=

2
n
1 + k
2
0
.
Part (2): If we now have N independent observations of Rdenoted R
1
, R
2
, R
3
, , R
N1
, R
N
then the likelihood p(R|A) is now given by the product of the individual densities or
p(R|A) =
N

i=1
1

2
n
exp
_

(R
i
A)
2
2
2
n
_
.
Using this the a posteriori density is given by
p(A|R) = exp
_

1
2n
N

i=1
(R
i
A)
2
_
exp
_
_
_
_

(A m
0
)
2
2
_

2
n
k
2
0
_
_

_
= exp
_

1
2n
_
N

i=1
(R
2
i
2AR
i
+ A
2
)
__
exp
_
_
_
_

(A
2
2Am
0
+m
2
0
)
2
_

2
n
k
2
0
_
_

_
=

exp
_

1
2n
_
NA
2
2A
N

i=1
R
i
__
exp
_
_
_
_

(A
2
2Am
0
)
2
_

2
n
k
2
0
_
_

_
=

exp
_

1
2
_
_
N

2
n
+
k
2
0

2
n
_
A
2
2
_

N
i=1
R
i

2
n
+
m
0
k
2
0

2
n
_
A
__
=

exp
_
_
_
_

1
2
_

2
n
N+k
2
0
_
_
A
2

2
n
_

2
n
N + k
2
0
_
_
N

i=1
R
i
+ m
0
k
2
0
_
A
_
_

_
=

exp
_
_
_
_

1
2
_

2
n
N+k
2
0
_
_
A
_
1
N +k
2
0
_
_
N

i=1
R
i
+m
0
k
2
0
__
2
_

_
.
When we put the correct normalization we see that the above is a normal density with mean
and variance given by
m
N
=
m
0
k
2
0
+ N
_
1
N

N
i=1
R
i
_
k
2
0
+ N
and
2
N
=

2
n
N + k
2
0
.
Problem 2.4.4 (more conjugate priors)
Part (1): For the densities given we nd that
p(A|R) =
p(R|A)p(A)
p(R)
=
1
p(R)
_
N

i=1
A
1/2
(2)
1/2
exp
_

A
2
(R
i
m)
2
_
_
_
cA
k
1
2
1
exp
_

1
2
Ak
1
k
2
__
=
_
c
p(R)(2)
N/2
_
A
N
2
+
k
1
2
1
exp
_

1
2
Ak
1
k
2

A
2
N

i=1
(R
i
m)
2
_
=
_
c
p(R)(2)
N/2
_
A
N
2
+
k
1
2
1
exp
_

1
2
A
_
k
1
k
2
+
N

i=1
(R
i
m)
2
__
. (98)
This will have the same form as the a priori distribution for A with parameters k

1
and k

2
if
k

1
2
1 =
k
1
+ N
2
1 or k

1
= k
1
+ N ,
and
k

1
k

2
= k
1
k
2
+
N

i=1
(R
i
m)
2
.
Solving for k

2
(since we know the value of k

1
) we get
k

2
=
1
k

1
_
k
1
k
2
+ N
_
1
N
N

i=1
(R
i
m)
2
__
.
These are the same expressions in the book.
Part (2): To nd a
ms
we must nd the conditional mean of the density given by Equation 98
which is of the same form of the a priori distribution (but with dierent parameters). From
the above, the a posteriori distribution has the form
p(A|k

1
, k

2
) = cA
k

1
2
1
exp
_

1
2
Ak

1
k

2
_
.
As the gamma distribution looks like
p(A|, r) =
1
(r)
(A)
r1
e
A
, (99)
we see that the a posteriori distribution, computed above, is also a gamma distribution with
=
1
2
k

1
k

2
and r =
k

1
2
. Using the known expression for the mean of a gamma distribution in
the form of Equation 99 we have
E[A] =
r

=
k

1
2
1
2
k

1
k

2
=
1
k

2
.
Thus the conditional mean of our a posteriori distribution is given by
a
ms
=
k

1
k
1
k
2
+ Nw
=
k
1
+ N
k
1
k
2
+ Nw
.
Problem 2.4.5 (recursive estimation)
Part (1): Recall that from Problem 2.4.3 we have that with K observations of R
i
that
p(A|R) = p(A|{R
i
}
K
i=1
) is a normal density with a mean m
K
and a variance
2
K
given by
m
K
=
m
0
k
2
0
+ Kl
K + k
2
0
and
2
K
=

2
n
K + k
2
0
with l =
1
K
K

i=1
R
i
. (100)
To make the a priori model for this problem match the one from Problem 2.4.3 we need to
take m
0
= 0 and
n
k
0
=
a
or k
0
=
n
a
. Thus the mean m
K
of the a posteriori distribution for
a becomes (after K measurements)
m
K
= a
ms
=
K
_
1
K

K
i=1
R
i
_
K +

2
n

2
a
=
_
1
1 +

2
n
K
2
a
__
1
K
K

i=1
R
i
_
. (101)
Part (2): The MAP estimator takes the priori information on a to be innitely weak i.e.

2
a
+. If we take that limit in Equation 101 we get a
map
=
1
K

K
i=1
R
i
.
Part (3): The mean square error is the integral of the conditional variance over all possible
values for R i.e. the books equation 126 we get
R
ms
=
_

dRp(R)
_

[A a
ms
(R)]p(A|R)dA =
_

dRp(R)
_

2
n
K + k
2
0
_
=

2
n
K + k
2
0
,
or the conditional variance again (this true since the conditional variance is independent of
the measurement R).
Part (4-a): Arguments above show that each density over only the rst j measurements or
p(A|{R
i
}
j
i=1
) should be normal with a mean and a variance which we will denote by
a
j
(R
1
, R
2
, , R
j
) and
2
j
.
We can now start a recursive estimation procedure with our initial mean and variance esti-
mates given by a
0
= 0 and
2
0
=
2
a
and using Bayes rule to link mean and variance estimates
as new measurements arrive. For example, we will use
p(A|R
1
, R
2
, , R
j1
, R
j
) =
p(A|R
1
, R
2
, , R
j1
)p(R
j
|A, R
1
, R
2
, , R
j1
)
p(R
1
, R
2
, , R
j1
, R
j
)
=
p(A|R
1
, R
2
, , R
j1
)p(R
j
|A)
p(R
1
, R
2
, , R
j1
, R
j
)
.
This is (using the densities dened above)
p(A|R
1
, R
2
, , R
j1
, R
j
) =
1
p(R
1
, , R
j
)
_
1

2
j1
exp
_

1
2
2
j1
(A a
j1
)
2
__
_
1

2n
exp
_

1
2
2
n
(A R
j
)
2
__
=
1
2
j1
np(R
1
, , R
j
)
exp
_

1
2
2
j1
(A
2
2 a
j1
A + a
2
j1
)
1
2
2
n
(A
2
2R
j
A+ R
2
j
)
_
= exp
_

1
2
__
1

2
j1
+
1

2
n
_
A
2
2
_
a
j1

2
j1
+
R
j

2
n
_
A
__
= exp
_
_
_
_
_
_
_

1
2
_
1

2
j1
+
1

2
n
_
_
_
_
_
A
2
2
_
a
j1

2
j1
+
R
j

2
n
_
_
1

2
j1
+
1

2
j
_A
_

_
_
_
_
_
_
_
_
.
From the above expression we will dene the variance of p(A|{R
i
}
j
i=1
) or
2
j
as
1

2
j

2
j1
+
1

2
n
so
2
j
=

2
j1

2
n

2
n
+
2
j1
. (102)
Using this the above becomes
p(A|R
1
, R
2
, , R
j1
, R
j
) =

exp
_

1
2
2
j
_
A
_
a
j1

2
j1
+
R
j

2
n
_

2
j
_
2
_
.
With the correct normalization this is a normal density with a mean given by
a
j
(R
1
, R
2
, , R
j1
, R
j
) =
_
a
j1

2
j1
+
R
j

2
n
_

2
j
=

2
n

2
n
+
2
j1
a
j1
(R
1
, R
2
, , R
j1
) +

2
j1

2
n
+
2
j1
R
j
.
This expresses the new mean a
j
as a function of a
j1
,
2
j1
and the new measurement R
j
.
From Equation 102 with j = 1 and
2
0
=
2
a
when we iterate a few of these terms we nd
1

2
1
=
1

2
a
+
1

2
n
1

2
2
=
1

2
1
+
1

2
n
=
1

2
a
+
2

2
n
1

2
3
=
1

2
a
+
2

2
n
+
1

2
n
=
1

2
a
+
3

2
n
.
.
.
1

2
j
=
1

2
a
+
j

2
n
.
Problem 2.4.6
Part (1): Using Equation 29 we can write our risk as
R( a|R) =
_

C( a
ms
a + Z)p(Z|R)dZ .
Introduce the variable Y = a
ms
a + Z and the above becomes
_

C(Y )p(Y + a a
ms
|R)dY , (103)
which is the books P.4.
Part (2): We rst notice that due to symmetry the MS risk is
R( a
ms
|R) =
_

C(Z)p(Z|R)dZ = 2
_

0
C(Z)p(Z|R)dZ ,
and that using Equation 103 we can write this as
R( a|R) =
_
0

C(Z)p(Z + a a
ms
|R)dZ +
_

0
C(Z)p(Z + a a
ms
|R)dZ
=
_

0
C(Z)p(Z + a a
ms
|R)dZ +
_

0
C(Z)p(Z + a a
ms
|R)dZ
=
_

0
C(Z)p(Z a + a
ms
|R)dZ +
_

0
C(Z)p(Z + a a
ms
|R)dZ
Then R R( a|R)R( a
ms
|R) can be computed by subtracting the two expressions above.
Note that this would give the expression stated in the book except that the product of the
densities would need to be subtraction.
Problem 2.4.7 (bias in the variance calculation?)
We are told to assume that R
i
N(m,
2
) for all i and that R
i
and R
j
are uncorrelated.
We rst compute some expectation of powers of R
i
. Recalling the fact from probability of
E[X
2
] = Var(X) + E[X]
2
,
and the assumed distribution of the R
i
we have that
E[R
i
] = m
E[R
i
R
j
] =
_
E[R
i
]E[R
j
] = m
2
i = j
Var(R
i
) + E[R
i
]
2
=
2
+ m
2
i = j
,
note that we have used the fact that the R
i
are uncorrelated in the evaluation of E[R
i
R
j
]
when i = j. Next dene

R =
1
n
n

i=1
R
i
,
using this we can write our expression for V as
V =
1
n
n

i=1
(R
i


R)
2
=
1
n
n

i=1
(R
2
i
2

RR
i
+

R
2
) (104)
=
1
n
n

i=1
R
2
i
2

R
n
n

i=1
R
i
+
n
n

R
2
=
1
n
n

i=1
R
2
i


R
2
. (105)
We now ask if V is an unbiased estimator of
2
or whether E[V ] =
2
. Thus we take the
expectation of V given by Equation 105. We nd
E[V ] =
1
n
n

i=1
E[R
2
i
] E[

R
2
] =
1
n
n(
2
+ m
2
) E[

R
2
] =
2
+ m
2
E[

R
2
] .
To evaluate this we now need to compute the expectation of

R
2
. From the denition of

R
we have

R
2
=
_
1
n
n

i=1
R
i
_
2
=
1
n
2

i,j
R
i
R
j
.
Thus we have
E[

R
2
] =
1
n
2

i,j
E[R
i
R
j
] =
1
n
2
_
n(
2
+ m
2
) + n(n 1)m
2
_
=

2
n
+ m
2
. (106)
With this expression we can now compute E[V ] to nd
E[V ] =
2
+ m
2

2
n
+ m
2
_
=
2
(n 1)
n
=
2
. (107)
Thus the given expression for V is not unbiased.
Lets check that if we normalize correctly (divide the sum in Equation 104 by n 1 rather
than n) we get an unbiased estimate of
2
. Let this estimate of the variance be denoted by

V . Using similar steps as in the above we get

V
1
n 1
n

i=1
(R
i


R)
2
=
1
n 1
n

i=1
R
2
i
2

R
n 1
n

i=1
R
i
+
n
n 1

R
2
=
1
n 1
n

i=1
R
2
i

n
n 1

R
2
.
In this case then we nd
E[

V ] =
1
n 1
n

i=1
E[R
2
i
]
n
n 1
E[

R
2
]
=
1
n 1
n(
2
+ m
2
)
n
n 1
_

2
n
+ m
2
_
=
2
,
when we simplify. Thus
1
n1

n
i=1
(R
i


R)
2
is an unbiased estimator of
2
.
Problem 2.4.8 (ML estimation of a binomial distribution)
We are told that R is distributed as a binomial random variable with parameters (A, n).
This means that the probability we observe the value of R after n trials is given by
p(R|A) =
_
n
R
_
A
R
(1 A)
nR
for 0 R n.
We desire to estimate the probability of success, A, from the measurement R.
Part (1): To compute the maximum likelihood (ML) estimate of A we compute
a
ml
(R) = argmax
A
p(R|A) = argmax
A
_
n
R
_
A
R
(1 A)
nR
.
To compute this maximum we can take the derivative of p(R|A) with respect to A, set the
resulting expression equal to zero and solve for A. We nd the derivative equal to
_
n
R
_
_
RA
R1
(1 A)
nR
+ A
R
(n R)(1 A)
nR1
(1)
_
= 0 .
Dividing by A
R1
(1 A)
nR1
to get
R(1 A) + A(n R)(1) = 0 ,
and solving for A gives or ML estimate of
a
ml
(R) =
R
n
. (108)
Lets compute the bias and variance of this estimate of A. The bias, b(A), is dened as
b(A) = E[ a A|A] = E[ a|A] A
= E
_
R
n

A
_
A =
1
n
E[R|A] A.
Now since R is drawn from a binomial random variable with parameters (n, A), the expecta-
tion of R is An, from which we see that the above equals zero and our estimator is unbiased.
To study the conditional variance of our error (dened as e = a A) consider

2
e
(A) = E[(e E[e])
2
|A] = E[e
2
|A] = E[( a A)
2
|A]
= E[
_
1
n
R A
_
2
|A] =
1
n
2
E[(R nA)
2
|A]
=
1
n
2
(nA(1 A)) =
A(1 A)
n
. (109)
In the above we have used the result that the variance of a binomial random variable with
parameters (n, A) is nA(1 A). In developing a ML estimator A is not considered random
and as such the above expression is the desired variance of our estimator.
Part (2): Eciency means that our estimator is unbiased and must satisfy the Cramer-Rao
lower bound. One form of which is
Var[ a(R) A] =
1
E
_
_
ln(p(R|A))
A
_
2
_ . (110)
For the density considered here the derivative of the above is given by
ln(p(R|A))
A
=
R
A

n R
1 A
=
R nA
A(1 A)
. (111)
Squaring this expression gives
R
2
2nAR + n
2
A
2
A
2
(1 A)
2
.
Taking the expectation with respect to R and recalling that E[R] = nA and
E[R
2
] = Var[R] + E[R]
2
= nA(1 A) + n
2
A
2
we have
(nA(1 A) + n
2
A
2
) 2nA(nA) + n
2
A
2
A
2
(1 A)
2
=
n
A(1 A)
,
when we simplify. As this is equal to 1/
2
e
(A) computed in Equation 109 this estimator is
ecient. Another way to see this argument is to note that in addition Equation 111 can be
written in the eciency required form of
ln(p(R|A))
A
= [ a(R) A]k(A) .
by writing it as
ln(p(R|A))
A
=
_
R
n
A
_ _
n
A(1 A)
_
.
Problem 2.4.11 (ML estimation of the mean and variance of a Gaussian)
Part (1): For the maximum likelihood estimate we need to nd values of (A
1
, A
2
) that
maximizes
p(R|A
1
, A
2
) =
n

i=1
p(R
i
|A
1
, A
2
) =
n

i=1
1
(2A
2
)
1/2
exp
_

(R
i
A
1
)
2
2A
2
_
=
1
(2)
n/2
1
A
n/2
2
exp
_

1
2A
2
n

i=1
(R
i
A
1
)
2
_
.
Taking logarithm of the above expression we desire to maximize we get

n
2
ln(2)
n
2
ln(A
2
)
1
2A
2
n

i=1
(R
i
A
1
)
2
.
To maximize this we must evaluate

A
[ln(p(R|A))]|
A= a
ml
(R)
= 0 .
As a system of equations this is

A
[ln(p(R|A))] =
_
1
A
2

n
i=1
(R
i
A
1
)

n
2
1
A
2
+
1
2A
2
2

n
i=1
(R
i
A
1
)
2
_
= 0 .
The rst equation gives
a
1
=
1
n
n

i=1
R
i
.
When we put this expression into the second equation and solve for A
2
we get
a
2
=
1
n
n

i=1
(R
i
a
1
)
2
.
Part (2): The estimator for A
1
is not biased while the estimator for A
2
is biased as was
shown in Problem 2.4.7 above.
Part (3): The estimators are seen to be coupled since the estimate of A
2
depends on the
value of A
1
.
Problem 2.4.12 (sending a signal)
Part (1): The likelihood for A = (A
1
, A
2
) for this problem is given by
p(R|A) = p(R
1
|A)p(R
2
|A)
=
_
1

2
n
exp
_

1
2
2
n
(R
1
S
1
)
2
___
1

2
n
exp
_

1
2
2
n
(R
2
S
2
)
2
__
.
The logarithm of this expression is given by
ln(p(R|A)) =
1
2
2
n
(R
1
S
1
)
2

1
2
2
n
(R
2
S
2
)
2
+ constants .
To nd the maximum likelihood estimate of A we need to take the derivatives of ln(p(R|A))
with respect to A
1
and A
2
. We nd
ln(p(R|A))
A
1
=
1

2
n
(R
1
S
1
)
S
1
A
1
+
1

2
n
(R
2
S
2
)
S
2
A
1
=
1

2
n
(R
1
S
1
)x
11
+
1

2
n
(R
2
S
2
)x
21
and
ln(p(R|A))
A
2
=
1

2
n
(R
1
S
1
)
S
1
A
2
+
1

2
n
(R
2
S
2
)
S
2
A
2
=
1

2
n
(R
1
S
1
)x
12
+
1

2
n
(R
2
S
2
)x
22
.
Each derivative would need to be equated to zero and the resulting system solved for A
1
and
A
2
. To do this we rst solve for S and then with these known values we solve for A. As a
rst step write the above as
x
11
S
1
+ x
21
S
2
= x
11
R
1
+ x
21
R
2
x
12
S
1
+ x
22
S
2
= x
12
R
1
+ x
22
R
2
.
In matrix notation this means that S is given by
_
S
1
S
2
_
=
_
x
11
x
21
x
12
x
22
_
1
_
x
11
x
21
x
12
x
22
_ _
R
1
R
2
_
=
_
R
1
R
2
_
.
Here we have assumed the needed matrix inverses exits. Now that we know the estimate for
S the estimate for A is given by
_
A
1
A
2
_
=
_
x
11
x
12
x
21
x
22
_
1
_
S
1
S
2
_
=
_
x
11
x
12
x
21
x
22
_
1
_
R
1
R
2
_
.
We now seek to answer the question as to whether a
1
(R) and a
2
(R) are unbiased estimators
of A
1
and A
2
. Consider the expectation of the vector
_
a
1
(R)
a
2
(R)
_
. We nd
E
_
a
1
(R)
a
2
(R)
_
=
_
x
11
x
12
x
21
x
22
_
1
E
_
R
1
R
2
_
=
_
x
11
x
12
x
21
x
22
_
1
E
_
S
1
S
2
_
=
_
x
11
x
12
x
21
x
22
_
1
_
x
11
x
12
x
21
x
22
_ _
A
1
A
2
_
=
_
A
1
A
2
_
,
showing that our estimator is unbiased.
Part (2): Lets compute the variance of our estimator
_
a
1
(R)
a
2
(R)
_
. To do this we rst let
matrix
_
x
11
x
12
x
21
x
22
_
be denoted as X and note that
_
a
1
(R) A
1
a
2
(R) A
2
_
= X
1
_
R
1
R
2
_
X
1
X
_
A
1
A
2
_
= X
1
__
R
1
R
2
_
X
_
A
1
A
2
__
= X
1
_
X
_
A
1
A
2
_
X
_
A
1
A
2
_
+
_
n
1
n
2
__
= X
1
_
n
1
n
2
_
.
Using this we know that
Var
_
a
1
(R)
a
2
(R)
_
= X
1
_
Var
_
n
1
n
2
__
X
T
.
Since
Var
_
n
1
n
2
_
=
_

2
n
0
0
2
n
_
=
2
n
I ,
the above becomes
Var
_
a
1
(R)
a
2
(R)
_
=
2
n
_
x
11
x
12
x
21
x
22
_
1
_
x
11
x
12
x
21
x
22
_
1
=
2
n
_
1
x
11
x
22
x
12
x
21
_
x
22
x
12
x
21
x
11
___
1
x
11
x
22
x
12
x
21
_
x
22
x
12
x
21
x
11
__
=

2
n
(x
11
x
22
x
12
x
21
)
2
_
x
2
22
+ x
2
12
x
22
x
21
x
12
x
11
x
22
x
21
x
12
x
11
x
2
21
+ x
2
11
_
. (112)
Part (3): Lets compute the Cramer-Rao bound for any unbiased estimator. To do this we
need to evaluate the matrix J which has components given by
J
ij
= E
_
ln(p(R|A))
A
i
ln(p(R|A))
A
j
_
= E
_

2
ln(p(R|A))
A
i
A
j
_
.
Using the above expressions we compute

2
ln(p(R|A))
A
1
2
=
1

2
n
x
11
S
1
A
1

2
n
x
21
S
2
A
1
=
1

2
n
(x
2
11
+ x
2
21
)

2
ln(p(R|A))
A
1
A
2
=
1

2
n
x
11
S
1
A
2

2
n
x
21
S
2
A
2
=
1

2
n
(x
11
x
12
+ x
21
x
22
)

2
ln(p(R|A))
A
2
2
=
1

2
n
x
12
S
1
A
2

2
n
x
22
S
2
A
2
=
1

2
n
(x
2
12
+ x
2
22
) .
Thus as a matrix we nd that J is given by
J =
1

2
n
_
x
2
11
+ x
2
21
x
11
x
12
+ x
21
x
22
x
11
x
12
+ x
21
x
22
x
2
12
+ x
2
22
_
.
The Cramer-Rao bound involves J
1
which we compute to be
J
1
=

2
n
(x
2
11
+ x
2
21
)(x
2
12
+ x
2
22
) (x
11
x
12
+ x
21
x
22
)
2
_
x
2
11
+ x
2
22
(x
11
x
12
+ x
21
x
22
)
(x
11
x
12
+ x
21
x
22
) x
2
11
+ x
2
21
_
.
We can see how this expression compares with the one derived in Equation 112 if we expand
the denominator D of the above fraction to get
D = x
2
11
x
2
12
+ x
2
11
x
2
22
+ x
2
21
x
2
12
+ x
2
21
x
2
22
(x
2
11
x
2
12
+ 2x
11
x
12
x
21
x
22
+ x
2
21
x
2
22
)
= x
2
11
x
2
22
2x
11
x
12
x
21
x
22
+ x
2
21
x
2
12
= (x
11
x
22
x
12
x
21
)
2
.
Thus the above expression matches that from Equation 112 and we have that our estimate
is ecient.
Problem 2.4.14 (a Poisson random variable)
Part (1): Here we are told that P(R|A) =
A
R
e
A
R!
and A is nonrandom. The logarithm is
given by
ln(P(R|A)) = Rln(A) Aln(R!) .
The rst two derivatives of the log-likelihood above are given by
ln(P(R|A))
A
=
R
A
1

2
ln(P(R|A))
A
2
=
R
A
2
.
By the Cramer-Rao bound we have that
Var[ a(R)]
1
E
_

2
ln(P(R|A))
A
2
_ .
Thus we need to evaluate the expectation above. We nd
E
_

2
ln(P(R|A))
A
2
_
= E
_
R
A
2
_
=
1
A
2
E[R] =
A
A
2
=
1
A
,
using the fact that E[A] = A see [1]. Thus we see that Var[ a(R)] A.
Part (2): In this case we have
p(R|A) =
n

i=1
p(R
i
|A) =
n

i=1
_
A
R
i
e
A
R
i
!
_
=
1

n
i=1
R
i
!
e
nA
A

n
i=1
R
i
.
Now the logarithm of the above gives
ln(p(R|A)) = nA +
_
n

i=1
R
i
_
ln(A) ln
_
n

i=1
R
i
!
_
.
To compute the maximum likelihood estimate of A we take the rst derivative of the log-
likelihood, set the result equal to zero and solve for A. This means that we need to solve
ln(p(R|A))
A
= n +

n
i=1
R
i
A
= 0 .
This means A is given by
A =
1
n
n

i=1
R
i
.
To show that this is ecient we will write this as k(A)( a(R) A) as
n
A
_
1
n
n

i=1
R
i
A
_
.
Thus a(R) =
1
n

n
i=1
R
i
is an ecient estimator.
Problem 2.4.15 (Cauchy random variables)
Part (1): Note that we have
p(R|A) =
1

n
n

i=1
1
1 + (R
i
A)
2
,
so that
ln(p(R|A)) =
n

i=1
ln(1 + (R
i
A)
2
) nln() .
Taking the rst derivative of the above with respect to A we nd
ln(p(R|A))
A
=
n

i=1
2(R
i
A)
1 + (R
i
A)
2
.
The second derivative is then given by

2
ln(p(R|A))
A
2
= 2
n

i=1
_
1
1 + (R
i
A)
2
+
(R
i
A)
2
(1 + (R
i
A)
2
)
2
_
= 2
n

i=1
1 + (R
i
A)
2
(1 + (R
i
A)
2
)
2
.
By the Cramer-Rao inequality we have that
Var[ a(R)]
1
E
_

2
ln((R|A))
A
2
_ .
Thus we need to evaluate the expectation in the denominator of the above fraction. We nd
E
_

2
ln(p(R|A))
A
2
_
= 2
n

i=1
_

(1 + (R
i
A)
2
)
(1 + (R
i
A)
2
)
2
p(R
i
|A)dR
i
=
2

i=1
_

dR
i
(1 + (R
i
A)
2
)
3
+
_

(R
i
A)
2
dR
i
(1 + (R
i
A)
2
)
3
_
=
2

i=1
_

dv
(1 + v
2
)
3
+
_

v
2
dv
(1 + v
2
)
3
_
.
When we evaluate the above two integrals using Mathematica we nd
E
_

2
ln(p(R|A))
A
2
_
=
2

i=1
_

3
8
+
_

8
_
_
=
2n

2
8
_
=
n
2
.
Thus the Cramer-Rao inequality then gives Var[ a(R)]
2
n
as we were to show.
Problem 2.4.16 (correlated Gaussian random variables)
Part (1): We have
p(R|) =
n

i=1
p(R
i1
, R
i2
|)
=
n

i=1
_
1
2(1
2
)
1/2
exp
_

(R
2
i1
2R
i1
R
i2
+ R
2
i2
)
2(1
2
)
__
=
1
(2)
n
(1
2
)
n/2
exp
_

1
2(1
2
)
n

i=1
(R
2
i1
2R
i1
R
i2
+ R
2
i2
)
_
.
Taking the logarithm of the above we get
ln(p(R|)) = nln(2)
n
2
ln(1
2
)
1
2(1
2
)
n

i=1
(R
2
i1
2R
i1
R
i2
+ R
2
i2
) .
The ML estimate is given by solving
ln(p(R|))

= 0 for . The needed rst derivative is


ln(p(R|))

=
n
2
(2)
(1
2
)
+
(2)
2(1
2
)
2
n

i=1
(R
2
i1
2R
i1
R
i2
+ R
2
i2
)
1
2(1
2
)
n

i=1
(2R
i1
R
i2
)
=
n
1
2
+
1
1
2
n

i=1
R
i1
R
i2


(1
2
)
2
n

i=1
(R
2
i1
2R
i1
R
i2
+ R
2
i2
) .
To simplify expression this lets dene some sums. We introduce
S
11

1
n
n

i=1
R
2
i1
, S
12

1
n
n

i=1
R
i1
R
i2
, S
22

1
n
n

i=1
R
2
i2
.
With these the expression for
ln(p(R|))

now becomes
ln(p(R|))

=
n
1
2
+
n
1
2
S
12

n
(1
2
)
2
(S
11
2S
12
+ S
22
)
=
n
(1
2
)
2
_

3
+ S
11
+ (1 +
2
)S
12
S
22

. (113)
For later work we will need to compute the second derivative of ln(p(R|)) with respect to
. Using Mathematica we nd

2
ln(p(R|))

2
=
n
(1
2
)
3
_
1 +
4
+ (1 + 3
2
)S
11
+ (6 2
3
)S
12
+ (1 + 3
2
)S
22

. (114)
The equation for the ML estimate of is given by setting Equation 113 equal to zero and
solving for .
Part (2): One form of the Cramer-Rao inequality bounds the variance of a(R) by a function
of the expectation of the expression

2
ln(p(R|))

2
. In this problem, the samples R
i
are mean
zero, variance one, and correlated with correlation which means that
E[R
i1
] = E[R
i2
] = 0 , E[R
2
i1
] = E[R
2
i2
] = 1 , and E[R
i1
R
i2
] = .
Using these with the denitions of S
11
, S
22
, and S
12
we get that
E[S
11
] = E[S
22
] = 1 and E[S
12
] = .
Using these expressions, the expectation of Equation 114 becomes
E
_

2
ln(p(R|))

2
_
= n
1 +
2
(1
2
)
2
.
Thus using the second derivative form of the Cramer-Rao inequality or
E[( a(R) A)
2
]
_
E
_

2
ln(p
r|a
(R|A))
A
2
__
1
, (115)
we see that in this case we have
E[( a(R) A)
2
]
(1
2
)
2
n(1 +
2
)
.
Problem 2.4.17 (the Cramer-Rao inequality for biased estimates)
If A is nonrandom then
E[ a(R) A] =
_
p(R|A)[ a(R) A]dR =
_
p(R|A) a(R)dRA = (A+B(A)) A = B(A) .
Taking the A derivative of both sides gives
_

A
{p(R|A)[ a(R) A]}dR = B

(A) .
Following the steps in the books proof of the Cramer-Rao bound we get an equation similar
to the books 185 of
_

_
ln(p(R|A))
A
_
p(R|A)
_
_
_
p(R|A)( a(R) A)
_
dR = 1 + B

(A) .
Using the Schwarz inequality 39 on the integral on the left-hand-side we get that
1+B

(A)
_
_

_
ln(p(R|A))
A
_
p(R|A)
_
2
dR
_
1/2 __

_
_
p(R|A)( a(R) A)
_
2
dR
_
1/2
.
Squaring both sides we get
(1 + B

(A))
2

_
_

_
ln(p(R|A))
A
_
p(R|A)
_
2
dR
_
Var[ a(R)] .
If we solve for Var[ a(R)] we get
Var[ a(R)]
(1 + B

(A))
2
E
_
_
ln(p(R|A))
A
_
2
_ ,
which is the expression we desired to obtain.
Problem 2.4.21 (an alternative derivation of the Cramer-Rao inequality)
Part (1): The rst component of E[z] = 0 is true because a(R) is an unbiased estimator.
That the second component of E[z] is zero can be shown to be true by considering
_
ln(p(R|A))
A
p(R|A)dR =
_
1
ln(p(R|A))
A
dR =

A
_
p(R|A)dR =

A
(1) = 0 .
Part (2): The covariance matrix is
z
= E[zz
T
]. The needed outer product is given by
zz
T
=
_
a(R) A
ln(p(R|A))
A
_
_
a(R) A
ln(p(R|A))
A
_
=
_
_
( a(R) A)
2
( a(R) A)
ln(p(R|A))
A
( a(R) A)
ln(p(R|A))
A
_
ln(p(R|A))
A
_
2
_
_
.
Following the same steps leading to Equation 60 we can show that the expectation of the
(1, 2) and (2, 1) terms is one and that E[zz
T
] takes the form
E[zz
T
] =
_
_
E[( a(R) A)
2
] 1
1 E
_
_
ln(p(R|A))
A
_
2
_
_
_
.
As
z
is positive semidenite we must have |
z
| 0 which means that
E[( a(R) A)
2
]E
_
_
ln(p(R|A))
A
_
2
_
1 ,
or
E[( a(R) A)
2
]
1
E
_
_
ln(p(R|A))
A
_
2
_ ,
which is the Cramer-Rao inequality. If equality holds in the Cramer-Rao inequality then
this means that |
z
| = 0.
Problem 2.4.22 (a derivation of the random Cramer-Rao inequality)
Note that we can show that E[z] = 0 in the same was as was done in Problem 2.4.21. We
now compute zz
T
and nd
zz
T
=
_
a(R) A
ln(p(R,A))
A
_
_
a(R) A
ln(p(R,A))
A
_
=
_
_
( a(R) A)
2
( a(R) A)
ln(p(R,A))
A
( a(R) A)
ln(p(R,A))
A
_
ln(p(R,A))
A
_
2
_
_
.
Consider the expectation of the (1, 2) component of zz
T
where using integration by parts we
nd
_
( a(R) A)
ln(p(R, A))
A
p(R, A)dRdA =
_
( a(R) A)
p(R, A)
A
dRdA
= ( a(R) A)p(R, A)|

_
(1)p(R, A)dRdA
= 0 +
_
p(R, A)dRdA = 1 .
Using this we nd that E[zz
T
] is given by
E[zz
T
] =
_
_
E[( a(R) A)
2
] 1
1 E
_
_
ln(p(R,A))
A
_
2
_
_
_
.
As
z
is positive semidenite we must have |
z
| 0 which as in the previous problem means
that
E[( a(R) A)
2
]
1
E
_
_
ln(p(R,A))
A
_
2
_ ,
which is the Cramer-Rao inequality when a is a random variable.
Problem 2.4.23 (the Bhattacharyya bound)
Part (1): Note that E[z] = 0 for the rst and second component as in previous problems.
Note that for i 1 we have
E
_
1
p(R|A)

i
p(R|A)
A
i
_
=
_
1
p(R|A)

i
p(R|A)
A
i
p(R|A)dR
=

i
A
i
_
p(R|A)dR =

i
A
i
(1) = 0 .
Thus E[z] = 0 for all components and
z
= E[zz
T
].
Part (2): The rst row of the outer product zz
T
has components that look like
( a(R) A)
2
,
_
a(R) A
p(R|A)
_
p(R|A)
A
,
_
a(R) A
p(R|A)
_

2
p(R|A)
A
2
,
_
a(R) A
p(R|A)
_

i
p(R|A)
A
i
,
for i n. When we evaluate the expectation for terms in this row we nd
E
__
a(R) A
p(R|A)
_

i
p(R|A)
A
i
_
=
_
( a(R) A)

i
p(R|A)
A
i
dR
= ( a(R) A)

i1
p(R|A)
A
i1

_
(1)

i1
p(R|A)
A
i1
dR
=

i1
A
i1
_
p(R|A)dR =

i1
A
i1
(1) = 0 .
Thus
z
has its (1, 1) element given by E[ a(R) A)
2
], its (1, 2) element given by 1, and
all other elements in the rst row are zero. The rst column of
z
is similar. Denote the
lower-right corner of
z
as

J and note that

J has elements for i 1 and j 1 given by
E
_
1
p(R|A)

i
p(R|A)
A
i

1
p(R|A)

j
p(R|A)
A
j
_
=
_
1
p(R|A)

i
p(R|A)
A
i

j
p(R|A)
A
j
dR.
We know that
z
is nonnegative denite. Expressing the fact that |
z
| 0 by expanding
the determinant about the rst row gives

J| cofactor(J
11
) 0 .
Solving for
2

we get


cofactor(J
11
)
|

J|
= J
11
,
the same procedure as in the proof of the Cramer-Rao bound proof for non-random variables.
Part (3): When N = 1 then zz
T
looks like
zz
T
=
_

2

1
1
_
1
p(R|A)
p(R|A)
A
_
2
_
=
_

2

1
1
_
ln(p(R|A))
A
_
2
_
,
so

J = E
_
_
ln(p(R|A))
A
_
2
_
and thus

J
11
=
1
E
_
_
ln(p(R|A))
A
_
2
_ ,
which is the Cramer-Rao inequality.
Part (4): Informally, as N increases then

J increases (or its norm increases) and thus

J
11
decreases providing a tighter bound.
Problem 2.4.27 (some vector derivatives)
Part (1): We have dened

x
=
_

x
1

x
2
.
.
.

xn
_

_
, (116)
and we want to show

x
(A
T
B) = (
x
A
T
)B + (
x
B
T
)A.
Lets compute the left-hand-side of this expression. We nd that A
T
B is given by
A
T
B =
_
A
1
A
2
A
n

_
B
1
B
2
.
.
.
B
n
_

_
= A
1
B
1
+ A
2
B
2
+ + A
n
B
n
,
a scalar. With this we see that

x
(A
T
B) =
_

x
1
(A
1
B
1
+ A
2
B
2
+ + A
n
B
n
)

x
2
(A
1
B
1
+ A
2
B
2
+ + A
n
B
n
)
.
.
.

xn
(A
1
B
1
+ A
2
B
2
+ + A
n
B
n
)
_

_
=
_

_
A
1
x
1
B
1
+
A
2
x
1
B
2
+ +
An
x
1
B
n
A
1
x
2
B
1
+
A
2
x
2
B
2
+ +
An
x
2
B
n
.
.
.
A
1
xn
B
1
+
A
2
xn
B
2
+ +
An
xn
B
n
_

_
+
_

_
A
1
B
1
x
1
+ A
2
B
2
x
1
+ + A
n
Bn
x
1
A
1
B
1
x
2
+ A
2
B
2
x
2
+ + A
n
Bn
x
2
.
.
.
A
1
B
1
xn
+ A
2
B
2
xn
+ + A
n
Bn
xn
_

_
=
_

_
A
1
x
1
A
2
x
1

An
x
1
A
1
x
2
A
2
x
2

An
x
2
.
.
.
.
.
.
.
.
.
A
1
xn
A
2
xn

An
xn
_

_
_

_
B
1
B
2
.
.
.
B
n
_

_
+
_

_
B
1
x
1
B
2
x
1

Bn
x
1
B
1
x
2
B
2
x
2

Bn
x
2
.
.
.
.
.
.
.
.
.
B
1
xn
B
2
xn

Bn
xn
_

_
_

_
A
1
A
2
.
.
.
A
n
_

_
.
If we dene
x
A
T
as the matrix
_

x
1

x
2
.
.
.

xn
_

_
_
A
1
A
2
A
n

=
_

_
A
1
x
1
A
2
x
1

An
x
1
A
1
x
2
A
2
x
2

An
x
2
.
.
.
.
.
.
.
.
.
A
1
xn
A
2
xn

An
xn
_

_
,
the we can write the above as (
x
A
T
)B +(
x
B
T
)A thus our derivative relationship is then

x
(A
T
B) = (
x
A
T
)B + (
x
B
T
)A. (117)
Part (2): Use Part (1) from this problem with B = x where we would get

x
(A
T
x) = (
x
A
T
)x + (
x
x
T
)A.
If A does not depend on x then
x
A
T
= 0 and we have
x
(A
T
x) = (
x
x
T
)A. We now need
to evaluate
x
x
T
where we nd

x
x
T
=
_

_
x
1
x
1
x
2
x
1

xn
x
1
x
1
x
2
x
2
x
2

xn
x
2
.
.
.
.
.
.
.
.
.
x
1
xn
x
2
xn

xn
xn
_

_
= I ,
the identity matrix. Thus we have
x
(A
T
x) = A. Part (3): Note from the dimensions
given x
T
C is a (1 n) (n m) = 1 m sized matrix. We can write the product of x
T
C as
x
T
C =
_
x
1
x
2
x
n

_
c
11
c
12
c
1m
c
21
c
22
c
2m
.
.
.
.
.
.
.
.
.
c
n1
c
n2
c
nm
_

_
=
_
n
j=1
x
j
c
j1

n
j=1
x
j
c
j2

n
j=1
x
j
c
jm

We now compute
x
(x
T
C) to get

x
(x
T
C) =
_

x
1
_

n
j=1
x
j
c
j2
_

x
1
_

n
j=1
x
j
c
j2
_


x
1
_

n
j=1
x
j
c
jm
_

x
2
_

n
j=1
x
j
c
j2
_

x
2
_

n
j=1
x
j
c
j2
_


x
2
_

n
j=1
x
j
c
jm
_
.
.
.
.
.
.
.
.
.

xn
_

n
j=1
x
j
c
j2
_

xn
_

n
j=1
x
j
c
j2
_


xn
_

n
j=1
x
j
c
jm
_
_

_
=
_

_
c
11
c
12
c
1m
c
21
c
22
c
2m
.
.
.
.
.
.
.
.
.
c
n1
c
n2
c
nm
_

_
= C .
Part (4): This result is a specialization of Part (3) where C = I.
Problem 2.4.28 (some vector derivatives of quadratic forms)
Part (1): Write our derivative as

x
Q =
x
(A
T
(x)A(x)) =
x
(A
T
(x)
1/2

1/2
A(x)) =
x
((
1/2
A(x))
T
(
1/2
A(x))) .
Then using Problem 2.4.27 with the vectors A and B dened as A
1/2
A(x) and B

1/2
A(x) to get

x
Q = (
x
[
1/2
A(x)]
T
)
1/2
A(x) +
x
[
1/2
A(x)]
1/2
A(x)
= 2(
x
[
1/2
A(x)]
T
)
1/2
A(x) = 2(
x
A(x)
T
)
1/2

1/2
A(x)
= 2(
x
A(x)
T
)A(x) , (118)
as we were to show.
Part (2): If A(x) = Bx then using Problem 2.4.27 where we computed
x
x
T
we nd

x
A
T
=
x
(x
T
B
T
) =
x
(x
T
)B
T
= B
T
.
Using this and the result from Part (1) we get

x
Q = 2B
T
Bx.
Part (3): If Q = x
T
x then A(x) = x so from Part (4) of Problem 2.4.27 we have
x
x
T
= I
which when we use with Equation 118 derived above we get

x
Q = 2x,
as we were to show.
References
[1] S. Ross. A First Course in Probability. Macmillan, 3rd edition, 1988.

You might also like