You are on page 1of 14

c Pleiades Publishing, Inc., 2009.

ISSN 0032-9460, Problems of Information Transmission, 2009, Vol. 45, No. 4, pp. 295308. 
c V.V. Prelov, 2009, published in Problemy Peredachi Informatsii, 2009, Vol. 45, No. 4, pp. 317.
Original Russian Text 

INFORMATION THEORY

Mutual Information of Several Random Variables


and Its Estimation via Variation1
V. V. Prelov
Kharkevich Institute for Information Transmission Problems, RAS, Moscow
prelov@iitp.ru
Received May 12, 2009
AbstractWe obtain some upper and lower bounds for the maximum of mutual information of
several random variables via variational distance between the joint distribution of these random
variables and the product of its marginal distributions. In this connection, some properties of
variational distance between probability distributions of this type are derived. We show that in
some special cases estimates of the maximum of mutual information obtained here are optimal
or asymptotically optimal. Some results of this paper generalize the corresponding results
of [13] to the multivariate case.
DOI: 10.1134/S0032946009040012

1. INTRODUCTION
.
Let P = {pi } and Q = {qi }, i N = {1, 2, . . .}, be two discrete probability distributions. Recall
that the (information) divergence is dened as
D(P || Q) =


i

pi ln

pi
,
qi

and the variational distance V (P, Q) between P and Q is the L1 distance between them; i.e.,
V (P, Q) =

|pi qi |

(though sometimes the variational distance between P and Q is dened as half the L1 distance
between them).
There is extensive literature devoted to investigation of relationships between D(P || Q) and
V (P, Q) (see, e.g., [4] and references therein). Here we only mention the so-called Pinskers inequality
1 2
V (P, Q) D(P || Q),
(1)
2
though in his original paper [5] Pinsker proved a weaker inequality D(P || Q) c V 2 (P, Q) with a
constant c < 1/2. Note also that in general it is impossible to upper estimate D(P || Q) via V (P, Q)
without some additional conditions on the probability distributions P and Q, since D(P || Q) can
be arbitrarily large while V (P, Q) is arbitrarily small.
In this paper, we consider a special case where P is a joint distribution of several discrete random
variables and Q is the direct product of distributions of these random variables. Let X1 , . . . , Xn
be discrete random variables ranging in nite or countable sets Ii , i = 1, . . . , n, respectively.
1

Supported in part by the Russian Foundation for Basic Research, project no. 09-01-00536.
295

296

PRELOV

Since it what follows we consider only discrete random variables and operate only with probability
.
distributions of such random variables, we usually assume without loss of generality that Ii =
{1, 2, . . . , Ni }, i = 1, . . . , n, where Ni are given integers; moreover, some of them may be innite.
Denote by
.
(2)
I(X1 ; . . . ; Xn ) = D(PX1 ...Xn || PX1 . . . PXn )
the information divergence, and by
.
(X1 , . . . , Xn ) = V (PX1 ...Xn , PX1 . . . PXn ),

(3)

the variational distance between the joint distribution PX1 ...Xn of the random variables X1 , . . . , Xn
and the product PX1 . . . PXn of their marginal distributions. The quantity I(X1 ; . . . ; Xn ) is
usually called the mutual information of X1 , . . . , Xn (see, e.g., [6]), which in the special case n = 2
coincides with the standard mutual information I(X1 ; X2 ) of two random variables. The quantities
dened above satisfy the inequality
1 2
(X1 , . . . , Xn ) I(X1 ; . . . ; Xn ),
2
which is a special case of inequality (1).
Consider the quantities
.
I (X1 ; . . . ; Xn ) =

sup

(4)

I(X1 ; . . . ; Xn ; Y ),

(5)

Y : (X1 ,...,Xn ,Y )

where the supremum is over all discrete random variables Y dened by conditional distributions
PY | X1 ...Xn such that (X1 , . . . , Xn , Y ) . In addition, note that I (X1 ; . . . ; Xn ) is dened only
for (X1 , . . . , Xn ), since it is easily seen that (X1 , . . . , Xn , Y ) (X1 , . . . , Xn ) for any Y (see
also Section 3).
For given integers N1 , . . . , Nn , dene
.
sup I (X1 ; . . . ; Xn ),
(6)
I(N1 ,...,Nn ) =
Xi : |Ii |=Ni
i=1,...,n

where |I| denotes the cardinality of a set I.


The main goal of this paper is to upper and lower estimate the quantities I (X1 ; . . . ; Xn ) and
(N1 ,...,Nn )
introduced above and investigate their asymptotics for various behavior of . Some
I
results obtained here generalize the corresponding results of [13] for arbitrary n. When deriving
the estimates mentioned above, we use some properties of the variational distance between the joint
distribution of several random variables and the product of their marginal distributions obtained
in this paper.
The main results of the paper are stated in Section 2. Some results connected with properties
of the variational distance are described in Section 3. Section 4 contains proofs of main statements
of the paper. Proofs of Lemma 2 and Corollaries 1 and 2 are postponed to the Appendix.
2. MAIN RESULTS
According to denition (2), mutual information I(X1 ; . . . ; Xn ; Y ) of random variables X1 , . . . ,
Xn , Y can be represented as
I(X1 ; . . . ; Xn ; Y ) =
=

n

i=1
n


H(Xi ) + H(Y ) H(X1 , . . . , Xn , Y )


H(Xi ) H(X1 , . . . , Xn ) + I((X1 , . . . , Xn ); Y ),

(7)

i=1

PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4

2009

MUTUAL INFORMATION OF SEVERAL RANDOM VARIABLES

297

where, as usual, H() and H( | ) denote the entropy and conditional entropy of the corresponding
random variables, respectively. Therefore, considering (X1 , . . . , Xn ) as a single random variable,
one can use some results of [13] for estimating sup I(X1 ; . . . ; Xn ; Y ) via under the assumption
Y

that V (PX1 ...Xn Y || PX1 ...Xn PY ) . However, our aim is to estimate sup I(X1 ; . . . ; Xn ; Y ) via
Y

provided that (X1 , . . . , Xn , Y ) = V (PX1 ...Xn Y || PX1 . . . PXn PY ) , but there is no direct dependence between V (PX1 ...Xn Y || PX1 ...Xn PY ) and V (PX1 ...Xn Y || PX1 . . . PXn PY ).
We will see in Section 3 that V (PX1 ...Xn Y || PX1 ...Xn PY ) can be both larger or smaller than
V (PX1 ...Xn Y || PX1 . . . PXn PY ), though in the special case where the random variables
X1 , . . . , Xn are independent, V (PX1 ...Xn Y || PX1 ...Xn PY ) = V (PX1 ...Xn Y || PX1 . . . PXn PY ).
Thus, in general we cannot directly apply the results of [13]; however, as will be seen in Section 4,
some methods of those papers can be partially used in the case considered here.
Let us introduce some necessary notation to state our results. Joint and marginal probability
distributions of random variables X1 , . . . , Xn are denoted by
.
pi1 ...in = Pr{X1 = i1 , . . . , Xn = in },
(8)
(k) .
pik = Pr{Xk = ik }, ik Ik , k = 1, . . . , n,
respectively. Let

.
(X1 , . . . , Xn ) = max (X1 , . . . , Xn , Y ),

(9)

where the maximum is over all random variables Y . In Section 3 (see Lemma 1), it is shown that


(X1 , . . . , Xn ) = 2 1

(1)
(n)
pi1 ...in pi1 . . . pin

(10)

i1 ,...,in
(1)

(n)

Assume that vectors (pi1 , . . . , pin ) are ordered in such a way that
(1)

(n)

(1)

(n)

(pi1 , . . . , pin )  (pj1 , . . . , pjn ) if

n

(k)

n

(k)

k=1

k=1

p ik

pj k .

For a given vector s = (s1 , . . . , sn ), sk Ik , k = 1, . . . , n, introduce the set



. 
(1)
(n)
(n)
,
.
.
.
,
p
)
Ds = (i1 , . . . , in ) : (pi1 , . . . , pin )  (p(1)
s1
sn
and the quantities

 n

 (k)

.
Ks =

p ik

(i1 ,...,in )Ds


 n


.
Ls =

k=1

p(k)
sk

ln

k=1

Moreover, let
. 
M=
i1 ,...,in

 n

k=1

(k)
p ik

 n


 n

 (k)

p ik

(12)

k=1

p(k)
sk

(13)

k=1

pi1 ...in
1

ln

(11)

n

k=1

(k)
p ik

n

k=1

(k)
p ik

ln

 n

 (k)

p ik

(14)

k=1

In what follows, we always assume that H(X1 , . . . , Xn ) < .


Before formulating our rst statement on an upper bound for the function I (X1 ; . . . ; Xn ), note
that I (X1 ; . . . ; Xn ) =

i=1

H(Xi ) if (X1 , . . . , Xn ). Indeed, this equality immediately follows

from (7) if we put Y = (X1 , . . . , Xn ). Therefore, to study the behavior of I (X1 ; . . . ; Xn ), we may
restrict ourselves to the case of (X1 , . . . , Xn ) < (X1 , . . . , Xn ).
PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4 2009

298

PRELOV

Proposition 1. For any , (X1 , . . . , Xn ) < (X1 , . . . , Xn ), we have


I (X1 ; . . . ; Xn ) Ks + xLs + M,

(15)

where a real number x, 0 x < 1, and a vector s = (s1 , . . . , sn ), sk Ik , k = 1, . . . , n, are


dened by

(1)
(n)
(n)
(1 pi1 ...in )pi1 . . . pin + x(1 ps1 ...sn )p(1)
(16)
s1 . . . psn = /2,
(i1 ,...,in )Ds

and (X1 , . . . , Xn ), (X1 , . . . , Xn ), Ds , Ks , Ls , and M are dened in (3) and (10)(14).


(N1 ,...,Nn )

Proofs of this and subsequent propositions are given in Section 4. An upper bound for I
is given in the following proposition.
n
.
Nk , we have the inequality
Proposition 2. For any , 0 < 2(1 1/N ), N =
k=1

I(N1 ,...,Nn )

ln(N 1) + h
,
2
2

(17)

.
where h(x) = x ln x (1 x) ln(1 x) is the binary entropy function, and
I(N1 ,...,Nn ) = ln N

if

2(1 1/N ).

(18)

The lower bound given in the following proposition, though not optimal, is asymptotically
optimal in some cases.
Proposition 3. For any , (X1 , . . . , Xn ) < (X1 , . . . , Xn ), we have
I (X1 ; . . . ; Xn )

n

i=1

H(Xi )

(X1 , . . . , Xn )
H(X1 , . . . , Xn ).
(X1 , . . . , Xn ) (X1 , . . . , Xn )

(19)

Remark 1. In a special case where the random variables X1 , . . . , Xn are independent, one can
(N ,...,Nn )
given in Proposieasily verify that the upper and lower bounds for I (X1 ; . . . ; Xn ) and I 1
tions 13 coincide with the corresponding bounds for these quantities obtained in [13] if the vector
(X1 , . . . , Xn ) is considered as a single discrete random variable ranging in the set I = I1 . . . In .
In particular, this observation allows us to claim that for any , 0 2(1 1/N ), with
N=

Nk we have

k=1

I (X1 ; . . . ; Xn ) =

N
ln N
2(N 1)

(20)

if X1 , . . . , Xn are independent and each Xi takes Ni dierent values with equal probability.
Remark 2. Note that for = (X1 , . . . , Xn ) (i.e., for the minimum value of ), the lower estimate (19) reduces to the inequality I (X1 ; . . . ; Xn ) I(X1 ; . . . ; Xn ). At rst sight, it seems that
this estimate is tight, i.e., there should be equality instead of the inequality, since it is obvious
that (X1 , . . . , Xn , Y ) = (X1 , . . . , Xn ) if Y does not depend on the collection of random variables X1 , . . . , Xn , and therefore I (X1 ; . . . ; Xn ) = I(X1 ; . . . ; Xn ). However, actually we have the
strong inequality I (X1 , . . . , Xn ) > I(X1 ; . . . ; Xn ) if the random variables X1 , . . . , Xn are dependent, since there exists a random variable Y such that (X1 , . . . , Xn , Y ) = (X1 , . . . , Xn ) and at the
same time Y depends on the collection of random variables X1 , . . . , Xn (see Section 3, Lemma 1),
and therefore we obviously have
I (X1 ; . . . ; Xn ) I(X1 ; . . . ; Xn ; Y )
= I(X1 ; . . . ; Xn ) + I((X1 , . . . , Xn ); Y ) > I(X1 ; . . . ; Xn ).
PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4

2009

MUTUAL INFORMATION OF SEVERAL RANDOM VARIABLES

299

Note also that Propositions 13 imply two corollaries stated below, which are proved in the
Appendix.
Corollary 1. We have the asymptotic relations


I(N1 ,...,Nn ) =
and

n
. 
N=
Nk ,

ln N (1 + o(1)),
2

(21)

k=1

1
n
ln + O( ) I(N1 ,...,Nn ) ln + O( ),
2(n + 1)
2

0.

(22)

Before formulating the second corollary, recall that I (X1 ; . . . ; Xn ) was dened as
I(X1 ; . . . ; Xn ; Y ).

sup
Y : (X1 ,...,Xn ,Y )

(m)

It seems natural to consider more general quantities I


.
I(m) (X1 ; . . . ; Xn ) =

sup

(X1 ; . . . ; Xn ) dened by the equality

I(X1 ; . . . ; Xn ; Y1 ; . . . ; Ym ).

(23)

Y1 ,...,Ym :
(X1 ,...,Xn ,Y1 ,...,Ym )

However, as the following statement shows, denition (23) is in fact senseless.


Corollary 2. For any random variables X1 , . . . , Xn , any > (X1 , . . . , Xn ), and any integer
m 2, we have
(24)
I(m) (X1 ; . . . ; Xn ) = .
3. SOME PROPERTIES OF A SPECIAL TYPE OF VARIATIONAL DISTANCES
In this section, some properties of the variational distance between joint distribution of several
random variables and the product of their marginal distribution are considered. This type of the
variational distance is sometimes called the Kolmogorov distance (see, e.g., [7]). Some of these
properties are used in the proof of our main results, and others, though not directly used in the
proofs, are of independent interest.
Along with the function (X1 , . . . , Xn ) introduced above (see (3)), consider also the functions
.
((X1 , . . . , Xk ), Xk+1 , . . . , Xn ) = V (PX1 ...Xn || PX1 ...Xk PXk+1 . . . PXn ),

(25)

where k takes integer values from 1 to n. In particular, for k = 1 we obtain the former function,
i.e.,
((X1 ), X2 , . . . , Xn ) = (X1 , . . . , Xn ).
The quantities
(X1 , . . . , Xk , (Xk+1 , . . . , Xm ), Xm+1 , . . . , Xn )
are dened similarly. When dening such quantities, one should take into account that it is necessary to consider the random vector (Xk+1 , . . . , Xm ) as a single random variable whose probability
distribution is the joint distribution of the collection of random variables Xk+1 , . . . , Xm .
Let us list several simple properties of these quantities.

If the random variables X1 , . . . , Xk are independent, then


((X1 , . . . , Xk ), Xk+1 , . . . , Xn ) = (X1 , . . . , Xn );
PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4 2009

300

PRELOV

If the random variables Xk+1 , . . . , Xn and the vector (X1 , . . . , Xk ) are jointly independent,
then ((X1 , . . . , Xk ), Xk+1 , . . . , Xn ) = 0. In particular, if X1 , . . . , Xn are independent, then
(X1 , . . . , Xn ) = 0;
For any integers k, 1 k n 1, and m 0, we have
(X1 , . . . , Xk , (Xk+1 , . . . , Xn )) (X1 , . . . , Xk , (Xk+1 , . . . , Xn+m ))

(26)

(X1 , . . . , Xk ) (X1 , . . . , Xn );

(27)

and

moreover, (X1 , . . . , Xk ) = (X1 , . . . , Xn ) if the random variables Xk+1 , . . . , Xn and the vector
(X1 , . . . , Xk ) are jointly independent;
If Xi X with probability 1 for all i = 1, . . . , n, then for all integers k, 1 k n 1, we have


((X1 , . . . , Xk ), Xk+1 , . . . , Xn ) = 2 1

pnk+1
i

(28)

in particular,

(X1 , . . . , Xn ) = (X, . . . , X) = 2 1

pni

(29)

where pi = Pr{X = i}, i = 1, 2, . . . .


All these properties easily follow from denitions (3) and (25). Let us prove, for example,
inequality (26). We have


(1)
(k)

pi1 ...in pi1 . . . pik pik+1 ...in
i1 ,...,in







(1)
(k)
=
pi1 ...in+m pi1 . . . pik pik+1 ...in+m



i1 ,...,in in+1 ,...,in+m




(1)
(k)

pi1 ...in+m pi1 . . . pik pik+1 ...in+m = (X1 , . . . , Xk , (Xk+1 , . . . , Xn+m )).

(X1 , . . . , Xk , (Xk+1 , . . . , Xn )) =

i1 ,...,in+m

Some other (less obvious) properties of the variational distance for the considered class of probability distributions are given in the following lemma. Some of these properties are already mentioned
in Section 2.
Lemma 1. The following statesments are valid:
(1) The function (X1 , . . . , Xn ) dened in (9) satises equality (10); moreover,
(X1 , . . . , Xn ) = (X1 , . . . , Xn , (X1 , . . . , Xn ))


=2 1

(1)
(n)
pi1 ...in pi1 . . . pin

(30)

i1 ,...,in

(2) For any m 2, we have


sup (X1 , . . . , Xn , Y1 , . . . , Ym ) = 2;

(31)

Y1 ,...,Ym

(3) If the random variables X1 , . . . , Xn are dependent, then there exists a random variable Y
such that (X1 , . . . , Xn , Y ) = (X1 , . . . , Xn ) but Y depends on X1 , . . . , Xn .
Before proving this lemma, we make two remarks.
PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4

2009

MUTUAL INFORMATION OF SEVERAL RANDOM VARIABLES

301

Remark 3. The function (X1 , . . . , Xn ) can be both greater or smaller than ((X1 , . . . , Xk ),
Xk+1 , . . . , Xn ), depending on the probability distribution of the random variables X1 , . . . , Xn .
A similar statement is also valid for the functions (X1 , . . . , Xn ) and
.
((X1 , . . . , Xk ), Xk+1 , . . . , Xn ) = max ((X1 , . . . , Xk ), Xk+1 , . . . , Xn , Y ).
Y

Indeed, it easily follows from (28)(30) that in the case where Xi = X, i = 1, . . . , n, and X is a
nondegenerate random variable, we have
(X1 , . . . , Xn ) > ((X1 , . . . , Xk ), Xk+1 , . . . , Xn )
and
(X1 , . . . , Xn ) > ((X1 , . . . , Xk ), Xk+1 , . . . , Xn )
for any n 3 and k, 2 k n 1. These inequalities are valid for most joint probability
distributions. However, in some cases the opposite inequalities hold, as the following example
shows. Let X and Y be random variables, each of them taking two possible values 1 and 2, and let
their joint distribution be given by the formulas
.
.
p22 = Pr{X = 2, Y = 2} = q 2 ,
p11 = Pr{X = 1, Y = 1} = p2 + ,
.
.
p12 = Pr{X = 1, Y = 2} = Pr{X = 2, Y = 1} = p21 = pq,
where p > 0, q > 0, p + q = 1, and > 0 is suciently small. Let Z = (X, Y ). Then, using
equality (30), we get


(X, Y ) = (X, Y, Z) = 2 1

pij pi qj ,

i,j

where

.
p1 = p11 + p12 = p + ,
.
q1 = p11 + p21 = p + ,

and

.
p2 = 1 p1 = q ,
.
q2 = 1 q1 = q ,


((X, Y )) = ((X, Y ), Z) = 2 1

p2ij

i,j

It is easy to see that




pij pi qj = p4 + q 4 + 2p2 q 2 + (2p3 2p2 q + 2pq 2 2q 3 + p2 q 2 ) + O(2 ),

0,

i,j

and

p2ij = p4 + q 4 + 2p2 q 2 + (2p2 2q 2 ) + O(2 ),

0.

i,j

Therefore, if p is rather close to 1 and is suciently small, then (X, Y ) < ((X, Y )).
Remark 4. Let X1 , . . . , Xn be a collection of random variables with a given joint probability dis1, . . . , X
 n a system of independent random variables with the same marginal
tribution. Denote by X
distributions as the former one. It turns out that (X1 , . . . , Xn ) can be both larger or smaller
1 , . . . , X
 n ), depending on the joint distribution of the random variables X1 , . . . , Xn . Inthan (X
.
deed, consider two examples for the case n = 2, denoting as usual pij = Pr{X = i, Y = j},
.
.
pi = Pr{X = i}, and qj = Pr{Y = j}, i, j {1, 2}.
PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4 2009

302

PRELOV

Let

.
pij =

1/4 + if i = j,
1/4 otherwise.

Then, using (30), we obtain




(X, Y ) = 2 1

pij pi qj

= 3/2

i,j

and



 
2
2


pi
qj = 3/2
(X, Y ) = 2 1

 Y ).
for any , 1/4 1/4, so that in this case we have (X, Y ) = (X,
Let

if i = j,

1/4 +
.
pij = 1/4
if i = 1, j = 2,

1/4 2 if i = 2, j = 1.

Then we have

(X, Y ) = 2 3/4 22 + 43

and

 Y ) = 2 3/4 22 44 ,
(X,

 Y ) if 0 < < 1/8, and (X, Y ) < (X,


 Y ) if 1/4 < < 0.
so that (X, Y ) > (X,

Proof of Lemma 1. (1) Let us prove the rst claim of the lemma. To this end, we upper
estimate (X1 , . . . , Xn+1 ) for arbitrary random variables Xn+1 given the joint distribution of the
. 
(1)
(n+1) 
random variables X1 , . . . , Xn . Consider the set A = (i1 , . . . , in+1 ) : pi1 ...in+1 > pi1 . . . pin+1 .
Then, using denition (2), we get
(X1 , . . . , Xn+1 ) = 2

(1)

(n+1)

pi1 ...in+1 pi1 . . . pin+1

(1)

=2


A

(n)

pi1 ...in pin+1 | i1 ...in pi1 . . . pin pi1 ...in+1


(1)

(n)

pin+1 | i1 ...in pi1 ...in pi1 ...in pi1 . . . pin


(1)

=2 1

(n)

pi1 ...in+1 pin+1 | i1 ...in pi1 ...in pi1 . . . pin

i1 ,...,in+1


(1)
(n)
pi1 ...in pi1 . . . pin

(32)

i1 ,...,in

On the other hand, we have


(X1 , . . . , Xn ) (X1 , . . . , Xn , (X1 , . . . , Xn ))


=2 1

(1)
(n)
pi1 ...in pi1 . . . pin

(33)

i1 ,...,in

Inequalities (32) and (33) immediately imply (30).


PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4

2009

MUTUAL INFORMATION OF SEVERAL RANDOM VARIABLES

303

(2) To prove equality (31), note that


sup (X1 , . . . , Xn , Y1 , . . . , Ym ) sup (X1 , . . . , Xn+1 ),

Y1 ,...,Ym

Xn+1

where Xn+1 is independent of X1 , . . . , Xn and is uniformly distributed in a set of M elements.


Then, using (30), we obtain


(X1 , . . . , Xn+1 ) = 2 1



(1)
(n+1)
pi1 ...in+1 pi1 . . . pin+1

i1 ,...,in+1

1 
(1)
(n)
=2 1
pi ...i p . . . pin
M i ,...,i 1 n i1
1

as

M ,

from which (31) follows.


(3) Now we prove the third claim of the lemma. Since, by the condition, the random variables
X1 , . . . , Xn are dependent, in the set of values of these random variables there exist two vectors
(i1 , . . . , in ) and (i1 , . . . , in ) such that
Pr{X1 = i1 , . . . , Xn = in } >

n


Pr{Xk = ik } > 0

k=1

and
0 < Pr{X1 = i1 , . . . , Xn = in } <

n


Pr{Xk = ik }.

k=1

Now consider a random variable Y taking two equiprobable values, so that Pr{Y = 1} =
Pr{Y = 2} = 1/2, and dene the joint probability distribution of the random variables X1 , . . . , Xn
and Y as follows:
Pr{X1 = i1 , . . . , Xn = in , Y = 1}

Pr{X1 = i1 , . . . , Xn = in } +

. 1
=
Pr{X1 = i1 , . . . , Xn = in }

1 Pr{X = i , . . . , X = i }
1
1
n
n
2

if (i1 , . . . , in ) = (i1 , . . . , in ),


if (i1 , . . . , in ) = (i1 , . . . , in ),
otherwise

and
Pr{X1 = i1 , . . . , Xn = in , Y = 2}

Pr{X1 = i1 , . . . , Xn = in }

. 1
=
Pr{X1 = i1 , . . . , Xn = in } +
2

1 Pr{X = i , . . . , X = i }
1
1
n
n
2

if (i1 , . . . , in ) = (i1 , . . . , in ),


if (i1 , . . . , in ) = (i1 , . . . , in ),
otherwise.

It is easy to verify that for this denition of a joint distribution of the random variables X1 , . . . , Xn
and Y both the joint distribution of X1 , . . . , Xn and the distribution of Y remain the same; moreover, (X1 , . . . , Xn , Y ) = (X1 , . . . , Xn ) if > 0 is suciently small. Hence, the third statement of
the lemma follows, since the random variable Y obviously depends on X1 , . . . , Xn for any = 0.

PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4 2009

304

PRELOV

4. PROOF OF THE MAIN STATEMENTS


Proof of Proposition 1. First of all note that the proof of this proposition is quite similar to
the proof of the main theorem of [1] and generalizes it to the multivariate case considered here.
Our aim is upper estimating the quantity I (X1 ; . . . ; Xn ) dened in (5). Taking into account (7),
we observe that it is required to nd (more precisely, to upper estimate) the maximum of the
functional


. 
H(Xi ) +
qj
pi1 ...in |j ln pi1 ...in |j
(34)
F ({pi1 ...in |j }, {qj }) =
i

i1 ,...,in

over all probability distributions {qj } and all conditional distributions pi1 ...in |j satisfying the following conditions for all possible values of i1 , . . . , in and j:


0 qj 1,

qj = 1,

i1 ,...,in


j

qj

0 pi1 ...in |j 1,

pi1 ...in |j = 1,

(35)
qj pi1 ...in |j = pi1 ...in ,

(36)

|pi1 ...in |j

(1)
(n)
p i1 . . . p in |

(37)

i1 ,...,in

 (k)

where {pi1 ...in } and pik , k = 1, . . . , n are the joint and marginal distributions of the random
variables X1 , . . . , Xn , respectively.
Denote by {qj } and {pi1 ...in |j } the optimal distributions maximizing the functional (34) under
conditions (35)(37). The following lemma, generalizing Lemma 2 from [1], describes the class of
probability distributions which the optimal distribution {pi1 ...in |j } belongs to.
Lemma 2. For any j, the optimal distribution {pi1 ...in |j } has one of the following three forms:

There exists a vector (i1 , . . . , in ) such that pi ...i |j = 1 and pi1 ...in |j = 0 for all (i1 , . . . , in ) =
n
1
(i1 , . . . , in );

pi1 ...in |j =

There exists a unique vector (i1 , . . . , in ) such that




0,

k=1

k=1

(k) 

p ik

(k)

pik for all i1 , . . . , in ;

n

k=1

for all (i1 , . . . , in ) = (i1 , . . . , in ).

(k)

p i

< pi ...i |j < 1 and pi1 ...in |j


n

A proof of this lemma is given in the Appendix. Now we nd the maximum of the functional

pi1 ...in |j = 1 but assuming that the other conditions
F ({pi1 ...in |j }, {qj }), skipping the condition
i1 ,...,in

in (35)(37) are fullled. Then Lemma 2 implies that to nd the maximum, we may restrict


ourselves to collections of probabilities pi1 ...in |j 0,

(k)

k=1

of optimal conditional probabilities that are greater than

pik , 1 , since we replace possible values


n

k=1

(k)

pik by 0 (or 1), which only increases

the value of the functional F ({pi1 ...in |j }, {qj }). Therefore, denoting


.
= i1 ...in =
j:

and using (34), we get


I (X1 ; . . . ; Xn ) max


i1 ,...,in

qj

(38)

(1)
(n)
pi1 ...in |j =pi ...pin
1

(i1 ...in 1)

 n

 (k)

p ik

k=1

ln

 n

 (k)

p ik

(39)

k=1

where the maximum is over all collections {qj } and {pi1 ...in |j} under the above conditions.
PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4

2009

MUTUAL INFORMATION OF SEVERAL RANDOM VARIABLES

305

To nd this maximum, dene




. (1)
(n)
= i1 ...in = pi1 . . . pin

.
= i1 ...in =

qj ,

j: pi1 ...in |j =0

qj

(40)

j: pi1 ...in |j =1

and express through . To this end, note that we have the equalities

p(1) . . . p(n) + = pi ...i ,


n
1
i1
in
+ + /p(1) . . . p(n) = 1,
i1

in

which follow from conditions (35) and (36) and denitions (38) and (40). These equalities imply
=1

(1)

i1 ...in
(1)

(n)

(1)

(n)

pi1 . . . pin (1 pi1 . . . pin )

(n)

pi1 . . . pin pi1 ...in


(1)

(n)

1 p i1 . . . p in

(41)

Now note that the quantities = i1 ...in dened in (40) must satisfy the inequalities


i1 ...in /2 ,

(1)

(n)

i1 ...in pi1 . . . pin (1 pi1 ...in ).

(42)

i1 ,...,in

Indeed, the rst inequality in (42) follows from (37), and the second one follows from denition (40)
and the second equality in (36). Therefore, by optimizing the right-hand side of (39) over i1 ...in
and taking into account (41) and conditions (42), we derive the required estimate (15).

Proof of Proposition 2. For any j = 1, 2, . . . , consider the quantities


.
j = V (PX

n , PX1 ...Xn | Y =j )
1 ...X



(i
,
.
.
.
,
i
)

P
(i
,
.
.
.
,
i
)
=
PX
,
1
n
1
n
X1 ...Xn | Y =j
 ...X

1

i1 ,...,in

1 , . . . , X
 n are independent random variables such that for all i = 1, . . . , n the probability
where X
 i coincides with that of Xi . Then
distribution of each random variable X

I(X1 ; . . . ; Xn ; Y ) =

n


H(Xi ) H(X1 , . . . , Xn | Y )

i=1

1 , . . . , X
 n ) H(X1 , . . . , Xn | Y )
= H(X


j

1 , . . . , X
 n ) H(X1 , . . . , Xn | Y = j)
PY (j) H(X


j
j
ln(N 1) + h
PY (j)
2
2

.
where N = N1 Nn .

Here we have used the inequality (see [7])




|H(U ) H(V )|

ln(N 1) + h
,
2
2

.
where = V (PU , PV ), and N is the cardinality of the range of each random variable U and V .
Now note that

j

PY (j)j =

|PX1 (i1 ) PXn (in )PY (j) PX1 ...Xn Y (i1 . . . in , j)|

i1 ,...,in ,j

= V (PX1 ...Xn Y , PX1 PXn PY ) .


PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4 2009

306

PRELOV

Therefore, taking into account the concavity of the function h(x), we conclude that


I(X1 ; . . . ; Xn ; Y )


j

since the function

j
PY (j)
2

ln(N 1) + h


j

j
PY (j)
2

ln(N 1) + h
,
2
2

x
x
ln(N 1) + h
is monotone increasing in the segment [0, 2(1 1/N )].

2
2

Proof of Proposition 3. Let X1 , . . . , Xn be random variables ranging in nite or countable


sets I1 , . . . , In , respectively, and having the joint distribution {pi1 ...in }. Consider a random variable Y ranging in the set J = {0}

Ik and having the following joint probability distribution

k=1

with X1 , . . . , Xn :

Pr{X1 = i1 , . . . , Xn = in , Y = 0} = (1 )pi1 ...in ,

Pr{X1 = i1 , . . . , Xn = in , Y = (i1 , . . . , in )} = pi1 ...in ,

Pr{X = i , . . . , X = i , Y = (i , . . . , i )} = 0 if (i , . . . , i ) = (i , . . . , i ),


1
1
n
n
1
n
1
n
1
n
where we choose the parameter in such a way that (X1 , . . . , Xn , Y ) = . Using the denition of
(X1 , . . . , Xn , Y ) and equality (10), it is easy to verify that
(X1 , . . . , Xn , Y ) = (1 ) (X1 , . . . , Xn ) + (X1 , . . . , Xn ),
from which we conclude that
=

(X1 , . . . , Xn )
.

(X1 , . . . , Xn ) (X1 , . . . , Xn )

Now, using an obvious estimate


I (X1 ; . . . ; Xn ) I(X1 ; . . . ; Xn ; Y ) =

n


H(Xi ) (1 )H(X1 , . . . , Xn ),

i=1

we immediately obtain the desired inequality (19).

APPENDIX
Proof of Lemma 2. The proof of this lemma is quite similar to that of Lemmas 1 and 2
from [1], and therefore we only outline the main arguments for our case. Let a collection of
probability distributions {pi1 ...in |j , qj } be optimal, maximizing the functional F ({pi1 ...in |j }, {qj })
under conditions (40)(42). Assume the contrary: let the claim of Lemma 2 be not valid, i.e., let
there exist two dierent vectors (i1 , . . . , in ) and (i1 , . . . , in ) such that for some j0 we have
pi ...in |j0
1

/ 0,


n

(k)

p i

k=1

pi ...in |j0


1

/ 0,


n

(k)

pi

k=1

For a suciently small , consider two new conditional probability distributions, {pi1 ...in |j0 } and
{pi1 ...in |j0 }, dened by the equalities
pi ...in |j0 = pi ...in |j0 + ,
1

pi ...in |j0 = pi ...in |j0 ,


1

pi1 ...in |j0 = pi1 ...in |j0

in all other cases

PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4

2009

MUTUAL INFORMATION OF SEVERAL RANDOM VARIABLES

and

307

pi ...in |j0 = pi ...in |j0 ,


1

pi ...in |j0 = pi ...in |j0 + ,


1

pi1 ...in |j0

pi1 ...in |j0

in all other cases.

It is easy to verify that the conditional distribution {pi1 ...in |j0 } is the half-sum of the distributions
{pi1 ...in |j0 } and {pi1 ...in |j0 }; moreover, we have






n
n
n

1  

 
 
1




(k)
(k)
(k)
pik pi1 ...in |j0 =
pik pi1 ...in |j0 +
pik pi1 ...in |j0 .









2
2

i1 ,...,in k=1

i1 ,...,in k=1

i1 ,...,in k=1

Then it is easy to show that the conditional distribution {pi1 ...in |j0 } cannot be included in the
collection of optimal distributions {pi1 ...in |j , qj } maximizing the functional F ({pi1 ...in |j }, {qj }). This
claim generalizes Lemma 1 from [1] and is almost obvious: it suces to split the state j0 into
two states j0 and j0 (each with probability qj0 /2) and consider two distributions {pi1 ...in |j0 } and
{pi1 ...in |j0 } instead of {pi1 ...in |j0 }. This change increases the value of the functional F .

Proof of Corollary 1. Asymptotic equality (21) directly follows from the upper estimate (17)
and equality (20). The upper estimate in (22) also follows from (17). To prove the lower bound
in (22), we apply inequality (19), putting Xi X, i = 1, . . . , n, with probability 1, where X
is a random variable taking two values with probabilities and 1 , respectively. Choosing

, noting that (X, . . . , X) = n/(n + 1) + O( 2 ) and (X, . . . , X) = + O( 2 )


=
2(n + 1)

(see (29) and (30)), and using (19), we get




I(N1 ,...,Nn )

I (X; . . . ; X) nh
2(n + 1)

+ O( ) =

n
1
ln + O( ),
2(n + 1)

0,

as desired.

Proof of Corollary 2. First of all note that denitions (9) and (23) imply
I(m) (X1 ; . . . ; Xn ) sup I (X1 ; . . . ; Xn ; Y ) I (X1 ; . . . ; Xn ; Z)
Y

for any m 2, where the random variable Z is independent of X1 , . . . , Xn and takes N equiprobable
values. Now we lower estimate I (X1 ; . . . ; Xn ; Z) by inequality (19), so that
I (X1 ; . . . ; Xn ; Z)

 n




H(Xi ) + H(Z)

i=1

(X1 , . . . , Xn , Z)
,
(X1 , . . . , Xn , Z) (X1 , . . . , Xn , Z)

from which we easily conclude that I (X1 ; . . . ; Xn ; Z) as N . Indeed, the latter follows
from the above estimate if we take into account that (X1 , . . . , Xn , Z) = (X1 , . . . , Xn ) (since the
random variable Z is independent of X1 , . . . , Xn ), (X1 , . . . , Xn , Z) 2 as N (which easily
follows from (30)), and H(Z) = ln N .

REFERENCES
1. Pinsker, M.S., On Estimation of Information via Variation, Probl. Peredachi Inf., 2005, vol. 41, no. 2,
pp. 38 [Probl. Inf. Trans. (Engl. Transl.), 2005, vol. 41, no. 2, pp. 7175].
2. Prelov, V.V., On Inequalities between Mutual Information and Variation, Probl. Peredachi Inf., 2007,
vol. 43, no. 1, pp. 1527 [Probl. Inf. Trans. (Engl. Transl.), 2007, vol. 43, no. 1, pp. 1223].
PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4 2009

308

PRELOV

3. Prelov, V.V. and van der Meulen, E.C., Mutual Information, Variation, and Fanos Inequality, Probl.
Peredachi Inf., 2008, vol. 44, no. 3, pp. 1932 [Probl. Inf. Trans. (Engl. Transl.), 2008, vol. 44, no. 3,
pp. 185197].
4. Fedotov, A.A., Harremoes, P., and Topse, F., Renements of Pinskers Inequality, IEEE Trans. Inform.
Theory, 2003, vol. 49, no. 6, pp. 14911498.
5. Pinsker, M.S., Informatsiya i informatsionnaya ustoichivost sluchainykh velichin i protsessov, Probl.
Peredachi Inf., issue 7, Moscow: Akad. Nauk SSSR, 1960. Translated under the title Information and
Information Stability of Random Variables and Processes, San Francisco: Holden-Day, 1964.
6. Csisz
ar, I. and K
orner, J., Information Theory: Coding Theorems for Discrete Memoryless Systems, New
York: Academic; Budapest: Akad. Kiad
o, 1981. Translated under the title Teoriya informatsii: teoremy
kodirovaniya dlya diskretnykh sistem bez pamyati, Moscow: Mir, 1985.
7. Zhang, Z., Estimating Mutual Information via Kolmogorov Distance, IEEE Trans. Inform. Theory, 2007,
vol. 53, no. 9, pp. 32803282.

PROBLEMS OF INFORMATION TRANSMISSION

Vol. 45 No. 4

2009

You might also like