You are on page 1of 21

Entropy

Let X be a discrete random variable with alphabet X and probability mass function p(x) p(x) = P r {X = x}, x X The entropy of the variable X is dened by H (X ) =
xX

p(x) log p(x)

The logarithm can be in any base, but normally base 2 is used. The unit of the entropy is then called bits. If we use base b in the logarithm, we denote the entropy by Hb (X ). We can easily convert between entropies in different bases Hb (X ) = logb a Ha (X ) By convention 0 log 0 = 0, since y log y 0 as y 0. The entropy is a measure of the information content of a random variable.

Information theory lecture 1 p. 1/21

Entropy, cont.
The entropy is always non-negative H (X ) 0 Proof: Since 0 p(x) 1, log p(x) 0

The entropy is always less than the logarithm of the alphabet size H (X ) log |X | with equality given a uniform distribution. Proof: see later slide

Information theory lecture 1 p. 2/21

Joint and conditional entropy


The joint entropy of a pair of discrete random variables (X, Y ) with joint probability mass function p(x, y ) is dened by H (X, Y ) =
xX y Y

p(x, y ) log p(x, y )

The conditional entropy of Y given X is dened as H (Y | X ) =


xX

p(x)H (Y |X = x)
xX

= =

p(x)
y Y

p(y |x) log p(y |x)

xX y Y

p(x, y ) log p(y |x)

Information theory lecture 1 p. 3/21

Chain rule
The joint entropy can be written as the sum H (X, Y ) = H (X ) + H (Y |X ) Proof: H (X, Y ) =
xX y Y

p(x, y ) log p(x, y ) p(x, y ) log p(x)p(y |x)


xX y Y

= =
xX y Y

p(x, y ) log p(x)


xX y Y

p(x, y ) log p(y |x) p(x, y ) log p(y |x)

=
xX

p(x) log p(x)


xX y Y

= H (X ) + H (Y | X )

Information theory lecture 1 p. 4/21

Chain rule, cont.


We of course also have H (X, Y ) = H (Y ) + H (X |Y )

The chain rule can easily be generalized to larger collections of random variables. Let X1 , X2 , . . . , Xn be drawn according to p(x1 , x2 , . . . , xn ). Then we have that
n

H (X 1 , X 2 , . . . , X n ) =
i=1

H (X i | X i 1 , . . . , X 1 )

Information theory lecture 1 p. 5/21

Relative entropy
The relative entropy or Kullback-Liebler distance between two probability mass functions p(x) and q (x) is dened by D (p||q ) =
xX

p(x) p(x) log q (x)


p 0

0 = 0, 0 log 0 We use the conventions 0 log 0 q = 0 and 0 log

= .

The relative entropy is always non-negative and zero if and only if p = q . Note that the relative entropy is not a true distance.

Information theory lecture 1 p. 6/21

Mutual information
The mutual information between random variables X and Y with joint probability mass function p(x, y ) and marginal probability mass functions p(x) and p(y ) is dened as I (X ; Y ) =
xX y Y

p(x, y ) log

p(x, y ) p(x)p(y )

= D (p(x, y )||p(x)p(y )) The mutual information is a measure of the amount of information that one random variable contains about another random variable.

Information theory lecture 1 p. 7/21

Mutual information, cont.


The mutual information can be rewritten as I (X ; Y ) =
xX y Y

p(x, y ) p(x, y ) log p(x)p(y ) p(x, y ) log p(x|y ) p(x) p(x, y ) log p(x|y )
xX y Y

=
xX y Y

=
xX y Y

p(x, y ) log p(x) +

= H (X ) H (X | Y ) or symmetrically as I (X ; Y ) = H (Y ) H (Y | X ) Thus X says as much about Y as Y says about X .

Information theory lecture 1 p. 8/21

Mutual information, cont.


Note that I (X ; Y ) = I (Y ; X ) and I (X ; X ) = H (X ) H (X | X ) = H (X ) Using the chain rule, we also note that I (X ; Y ) = H (X ) + H (Y ) H (X, Y )

Information theory lecture 1 p. 9/21

Mutual information, cont.


The conditional mutual information of random variables X and Y given Z is dened by I (X ; Y |Z ) = H (X |Z ) H (X |Y, Z ) Chain rule for mutual information
n

I (X 1 , X 2 , . . . , X n ; Y ) =
i=1

I (X i ; Y | X i 1 , X i 2 , . . . , X 1 )

Information theory lecture 1 p. 10/21

Jensens inequality
A function f is said to be convex over an interval (a, b) if for every x1 , x2 (a, b) and 0 1 f (x1 + (1 )x2 ) f (x1 ) + (1 )x(x2 ) The function is said to be strictly convex if equality holds only if = 0 or = 1. A function f is (strictly) concave if f is (strictly) convex. Jensens inequality: If f is a convex function and X is a random variable then Ef (X ) f (EX ) If f is strictly convex, then equality implies that X = EX with probability 1, ie X is a constant.

Information theory lecture 1 p. 11/21

Information inequality
Let p(x), q (x), x X be two probability mass functions. Then D (p||q ) 0 with equality if and only if p(x) = q (x) for all x. Proof: Let A = {x : p(x) > 0}. Then D (p||q ) =
xA

p(x) log p(x)


xA

p(x) = q (x)

p(x) log
xA

q (x) p(x)

log log
xX

q (x) = log p(x)

q (x)
xA

q (x) = log 1 = 0

Since log t is a strictly concave function of t, we have equality if and only if q (x)/p(x) is constant everywhere, ie p(x) = q (x)
Information theory lecture 1 p. 12/21

Mutual information, entropy


For any two random variables X and Y we have I (X ; Y ) 0 with equality if and only if X and Y are independent. Proof: I (X ; Y ) = D (p(x, y )||p(x)p(y )) 0 with equality if and only if p(x, y ) = p(x)p(y ), ie X and Y are independent. The entropy is bounded by the logarithm of the alphabet size H (X ) log |X | Proof: Let u(x) =
1 |X |

be the uniform probability mass function over X and

let p(x) be the probability mass function for X . Then D (p||u) = p(x) = log |X | H (X ) p(x) log u(x)

Since D (p||u) 0 we get the inequality


Information theory lecture 1 p. 13/21

Conditional entropy
Conditioning reduces entropy H (X | Y ) H (X ) with equality of and only if X and Y are independent. Proof: 0 I (X ; Y ) = H (X ) H (X |Y ) Knowing another random variable Y reduces (on average) the uncertainty of variable X .

Information theory lecture 1 p. 14/21

Markov chains
Random variables X , Y , Z are said to form a Markov chain in that order (denoted X Y Z ) if the conditional distribution of Z depends only on Y and is conditionally independent of X , ie if the joint probability mass function can be written as p(x, y, z ) = p(x)p(y |x)p(z |y ) X Y Z if and only if X and Z are conditionally independent given Y . Markovity gives p(x, y, z ) p(x, y )p(z |y ) p(x, z |y ) = = = p(x|y )p(z |y ) p (y ) p (y ) X Y Z implies that Z Y X . If Z = f (Y ) then X Y Z .

Information theory lecture 1 p. 15/21

Data processing inequality


If X Y Z then I (X ; Y ) I (X ; Z ) Proof: By the chain rule, we can expand mutual information in two ways I (X ; Y, Z ) = I (X ; Z ) + I (X ; Y |Z ) = I (X ; Y ) + I (X ; Z | Y ) Since X and Z are conditionally independent given Y , we have I (X ; Z |Y ) = 0. Since I (X ; Y |Z ) 0 we get the inequality. We have equality if and only if I (X ; Y |Z ) = 0, ie X Z Y ) forms a Markov chain. This means that no processing of Y , deterministic or random, can increase the amount of information that Y contains about X .

Information theory lecture 1 p. 16/21

Law of large numbers


The law of large numbers states that for independent, identically n 1 distributed (i.i.d) random variables, n i=1 Xi is close to the expected value EX for large n. Denition: Given a sequence of random variables X1 , X2 , . . . we say that the sequence converges to a random variable X 1. In probability if for every > 0, P r {|Xn X | > } 0 2. In mean square if E (Xn X )2 0 3. With probability (or almost surely) if P r {limn Xn = X } = 1

Information theory lecture 1 p. 17/21

Asymptotic equipartition property


If the random variables X1 , X2 , . . . are i.i.d p(x) then 1 log p(X1 , X2 , . . . , Xn ) H (X ) in probability n

Proof: Functions of independent random variables are also independet random variables. Since Xi are i.i.d., so are log p(Xi ). Thus, by the law of large numbers 1 log p(X1 , X2 , . . . , Xn ) n = = 1 n log p(Xi )
i

E log p(X ) in probability H (X )

Information theory lecture 1 p. 18/21

Typical sets
The typical set A with respect to p(x) is the set of sequences (x1 , x2 , . . . , xn ) with the property 2n(H (X )+) p(x1 , x2 , . . . , xn ) 2n(H (X )) The set A
(n) (n)

has the following properties


(n)

1. If (x1 , x2 , . . . , xn ) A then 1 H (X ) n log p(x1 , x2 , . . . , xn ) H (X ) + 2. P r {A } > 1 for sufciently large n 3. |A | 2n(H (X )+) 4. |A | (1 )2n(H (X )) for sufciently large n The typical set has probability near 1, all elemets of the typical set are nearly equiprobable and the number of elements in the set is nearly 2nH .
Information theory lecture 1 p. 19/21

(n)

(n)

(n)

Typical sets, cont.


Property 1 follows directly from the denition of the typical set. Property 2 follows from the AEP proof. Property 3 follows from 1 =
xX n

p (x )
xA
( n)

p (x )

xA (n)
( n)

(n) 2n(H (X )+) = 2n(H (X )+) |A |

and thus |A | 2n(H (X )+) .

Information theory lecture 1 p. 20/21

Typical sets, cont.


Finally, for sufciently large n we have P r {A } > 1 . Thus property 4 follows from
(n) 1 < P r {A } xA (n)
( n)

(n)

(n) 2n(H (X )) = 2n(H (X )) |A |

and thus |A | (1 )2n(H (X )) .

Information theory lecture 1 p. 21/21

You might also like