Paul C. Shields The Ergodic Theory of Discrete Sample Paths Graduate Studies in Mathematics 13 1996

The Ergodic Theory of Discrete Sample Paths
Paul C. Shields
Graduate Studies in Mathematics

Volume 13
American Mathematical Society
Editorial Board James E. Humphreys David Sattinger Julius L. Shaneson Lance W. Small, chair
1991 Mathematics Subject Classification. Primary 28D20, 28D05, 94A17; Secondary 60F05, 60G17, 94A24.
ABSTRACT. This book is about finite-alphabet stationary processes, which are important in physics, engineering, and data compression. The book is designed for use in graduate courses, seminars or self study for students or faculty with some background in measure theory and probability theory.
Library of Congress Cataloging-in-Publication Data
Shields, Paul C. The ergodic theory of discrete sample paths / Paul C. Shields. p. cm. (Graduate studies in mathematics, ISSN 1065-7339; v. 13) Includes bibliographical references and index. ISBN 0-8218-0477-4 (alk. paper) 1. Ergodic theory. 2. Measure-preserving transformations. 3. Stochastic processes. I. Title. II. Series. QA313.555 1996 96-20186 519.2'32dc20 CIP
Copying and reprinting. Individual readers of this publication, and nonprofit libraries acting
for them, are permitted to make fair use of the material, such as to copy a chapter for use in teaching or research. Permission is granted to quote brief passages from this publication in reviews, provided the customary acknowledgment of the source is given. Republication, systematic copying, or multiple reproduction of any material in this publication (including abstracts) is permitted only under license from the American Mathematical Society. Requests for such permission should be addressed to the Assistant to the Publisher, American Mathematical Society, P.O. Box 6248, Providence, Rhode Island 02940-6248. Requests can also be made by e-mail to reprint-permission0ams.org .
C) Copyright 1996 by the American Mathematical Society. All rights reserved. The American Mathematical Society retains all rights except those granted to the United States Government. Printed in the United States of America. The paper used in this book is acid-free and falls within the guidelines established to ensure permanence and durability. .. Printed on recycled paper. t.,1 10 9 8 7 6 5 4 3 2 1 01 00 99 98 97 96
Contents
Preface I ix
Basic concepts. I.1 Stationary processes. 1.2 The ergodic theory model. 1.3 The ergodic theorem. 1.4 Frequencies of finite blocks 1.5 The entropy theorem. 1.6 Entropy as expected value 1.7 Interpretations of entropy 1.8 Stationary coding 1.9 Process topologies. 1.10 Cutting and stacking.
1 1 13 33 43 51 56 66 79 87 103 121 121 131 137 147 154 165 165 174 184 194 200 211 211 221 232 239 245
II Entropy-related properties. II.! Entropy and coding 11.2 The Lempel-Ziv algorithm 11.3 Empirical entropy 11.4 Partitions of sample paths. 11.5 Entropy and recurrence times III Entropy for restricted classes. Ill.! Rates of convergence 111.2 Entropy and joint distributions. 111.3 The d-admissibility problem. 111.4 Blowing-up properties. 111.5 The waiting-time problem IV B-processes. IV1 Almost block-independence. 1V.2 The finitely determined property. 1V.3 Other B-process characterizations.
Bibliography
Index
vii
Preface
This book is about finite-alphabet stationary processes, which are important in physics, engineering, and data compression. The book is designed for use in graduate courses, seminars or self study for students or faculty with some background in measure theory and probability theory. The focus is on the combinatorial properties of typical finite sample paths drawn from a stationary, ergodic process. A primary goal, only partially realized, is to develop a theory based directly on sample path arguments, with minimal appeals to the probability formalism. A secondary goal is to give a careful presentation of the many models for stationary finite-alphabet processes that have been developed in probability theory, ergodic theory, and information theory. The two basic tools for a sample path theory are a packing lemma, which shows how "almost" packings of integer intervals can be extracted from coverings by overlapping subintervals, and a counting lemma, which bounds the number of n-sequences that can be partitioned into long blocks subject to the condition that most of them are drawn from collections of known size. These two simple ideas, introduced by Ornstein and Weiss in 1980, immediately yield the two fundamental theorems of ergodic theory, namely, the ergodic theorem of Birkhoff and the entropy theorem of Shannon, McMillan, and Breiman. The packing and counting ideas yield more than these two classical results, however, for in combination with the ergodic and entropy theorems and further simple combinatorial ideas they provide powerful tools for the study of sample paths. Much of Chapter I and all of Chapter II are devoted to the development of these ideas. The classical process models are based on independence ideas and include the i.i.d. processes, Markov chains, instantaneous functions of Markov chains, and renewal and regenerative processes. An important and simple class of such models is the class of concatenated-block processes, that is, the processes obtained by independently concatenating fixed-length blocks according to some block distribution and randomizing the start. Related models are obtained by block coding and randomizing the start, or by stationary coding, an extension of the instantaneous function concept which allows the function to depend on both past and future. All these models and more are introduced in the first two sections of Chapter I. Further models, including the weak Bernoulli processes and the important class of stationary codings of i.i.d. processes, are discussed in Chapter III and Chapter IV. Of particular note in the discussion of process models is how ergodic theorists think of a stationary process, namely, as a measure-preserving transformation on a probability space, together with a partition of the space. This point of view, introduced in Section 1.2, leads directly to Kakutani's simple geometric representation of a process in terms of a recurrent event, a representation that not only simplifies the discussion of stationary renewal and regenerative processes but generalizes these concepts to the case where times between recurrences are not assumed to be independent, but only stationary. A ix
PREFACE
further generalization, given in Section 1.10, leads to a powerful method for constructing examples known as cutting and stacking. The book has four chapters. The first chapter, which is half the book, is devoted to the basic tools, including the Kolmogorov and ergodic theory models for a process, the ergodic theorem and its connection with empirical distributions, the entropy theorem and its interpretations, a method for converting block codes to stationary codes, the weak topology and the even more important d-metric topology, and the cutting and stacking method. Properties related to entropy which hold for every ergodic process are discussed in Chapter II. These include entropy as the almost-sure bound on per-symbol compression, Ziv's proof of asymptotic optimality of the Lempel-Ziv algorithm via his interesting concept of individual sequence entropy, the relation between entropy and partitions of sample paths into fixed-length blocks, or partitions into distinct blocks, or partitions into repeated blocks, and the connection between entropy and recurrence times and entropy and the growth of prefix trees. Properties related to entropy which hold only for restricted classes of processes are discussed in Chapter III, including rates of convergence for frequencies and entropy, the estimation of joint distributions in both the variational metric and the d-metric, a connection between entropy and a-neighborhoods, and a connection between entropy and waiting times. Several characterizations of the class of stationary codings of i.i.d. processes are given in Chapter IV, including the almost block-independence, finitely determined, very weak Bernoulli, and blowing-up characterizations. Some of these date back to the original work of Ornstein and others on the isomorphism problem for Bernoulli shifts, although the almost block-independence and blowing-up ideas are more recent. This book is an outgrowth of the lectures I gave each fall in Budapest from 1989 through 1994, both as special lectures and seminars at the Mathematics Institute of the Hungarian Academy of Sciences and as courses given in the Probability Department of E6tviis Lordnd University. In addition, lectures on parts of the penultimate draft of the book were presented at the Technical University of Delft in the fall of 1995. The audiences included ergodic theorists, information theorists, and probabilists, as well as combinatorialists and people from engineering and other mathematics disciplines, ranging from undergraduate and graduate students through post-docs and junior faculty to senior professors and researchers. Many standard topics from ergodic theory are omitted or given only cursory treatment, in part because the book is already too long and in part because they are not close to the central focus of this book. These topics include topological dynamics, smooth dynamics, random fields, K-processes, combinatorial number theory, general ergodic theorems, and continuous time and/or space theory. Likewise little or nothing is said about such standard information theory topics as rate distortion theory, divergence-rate theory, channel theory, redundancy, algebraic coding, and multi-user theory. Some specific stylistic guidelines were followed in writing this book. Proofs are sketched first, then given in complete detail. With a few exceptions, the sections in Chapters II, III, and IV are approximately independent of each other, conditioned on the material in Chapter I. Theorems and lemmas are given names that include some information about content, for example, the entropy theorem rather than the ShannonMcMillan-Breiman theorem. Likewise, suggestive names are used for concepts, such as
PREFACE
xi
building blocks (a name proposed by Zoli Gytirfi) and column structures (as opposed to gadgets.) Also numbered displays are often (informally) given names similar to those used for LATEX labels. Only those references that seem directly related to the topics discussed are included. Exercises that extend the ideas are given at the end of most sections; these range in difficulty from quite easy to quite hard. I am indebted to many people for assistance with this project. Imre Csiszdr and Katalin Marton not only attended most of my lectures but critically read parts of the manuscript at all stages of its development and discussed many aspects of the book with me. Much of the material in Chapter III as well as the blowing-up ideas in Chapter IV are the result of joint work with Kati. It was Imre's suggestion that led me to include the discussions of renewal and regenerative processes, and his criticisms led to many revisions of the cutting and stacking discussion. In addition I had numerous helpful conversations with Benjy Weiss, Don Ornstein, Aaron Wyner, and Jacob Ziv. Others who contributed ideas and/or read parts of the manuscript include Gdbor TusnAdy, Bob Burton, Jacek Serafin, Dave Neuhoff, Gyiirgy Michaletzky, Gusztdv Morvai, and Nancy Morrison. Last, but far from least, I am much indebted to my two Toledo graduate students, Shaogang Xu and Xuehong Li, who learned ergodic theory by carefully reading almost all of the manuscript at each stage of its development, in the process discovering numerous errors and poor ways of saying things. No project such as this can be free from errors and incompleteness. A list of errata as well as a forum for discussion will be available on the Internet at the following web address. http://www.math.utoledo.edurpshields/ergodic.html I am grateful to the Mathematics Institute of the Hungarian Academy for providing me with many years of space in a comfortable and stimulating environment as well as to the Institute and to the Probability Department of Etitvotis Lordnd University for the many lecture opportunities. My initial lectures in 1989 were supported by a Fulbright lectureship. Much of the work for this project was supported by NSF grants DMS8742630 and DMS-9024240 and by a joint NSF-Hungarian Academy grant MTA-NSF project 37. This book is dedicated to my son, Jeffrey. Paul C. Shields Toledo, Ohio March 22, 1996
Chapter I Basic concepts.

Section 1.1
Stationary processes.
...
of random A (discrete-time, stochastic) process is a sequence X1, X2, ... , X, variables defined on a probability space (X, E, p). The process has alphabet A if the range of each X i is contained in A. In this book the focus is on finite-alphabet processes, so, unless stated otherwise, "process" means a discrete-time finite-alphabet process. Also, unless it is clear from the context or explicitly stated stated otherwise, "measure" will mean "probability measure" and "function" will mean "measurable function" with respect to some appropriate a-algebra on a probability space. The cardinality of a finite set A is denoted by IA I. The sequence a m , am+i, , an, n . The set of all such am n is denoted by Al, except where each ai E A, is denoted by am for m = 1, when An is used. The k-th order joint distribution of the process {X k } is the measure p.k on A k defined by the formula p.,k (alic) = Prob(Xif = c4), alic E Ak . When no confusion will result the subscript k on p.k may be omitted. The set of joint distributions {Ak: k > 1} is called the distribution of the process. The distribution of a process can, of course, also be defined by specifying the start distribution, A i , and the successive conditional distributions
k -1 k- 1 k- 1\ (akla 1 ) = Prob(Xk = ak1X 1 = al ) =
til(alk)
ktk-1(a
1) .
The distribution of a process is thus a family of probability distributions, one for each k. The sequence cannot be completely arbitrary, however, for implicit in the definition of process is that the following consistency condition must hold for each k > 1, (1)
111(4) = E P4+1 (a jic+1 ),
ak-1-1
al E A k .
A process is considered to be defined by its joint distributions, that is, the particular space on which the functions Xn are defined is not important; all that really matters in probability theory is the distribution of the process. Thus one is free to choose the underlying space (X, E, p) on which the Xn are defined in any convenient manner, as 1
CHAPTER I. BASIC CONCEPTS.
long as the the joint distributions are left unchanged. An important instance of this idea is the Kolmogorov model for a process, which represents it as the sequence of coordinate functions on the space of infinite sequences drawn from the alphabet A, equipped with a Borel measure constructed from the consistency conditions (1). The rigorous construction of the Kolmogorov representation is carried out as follows. Let A denote the set of all infinite sequences
X = 1Xil, Xi
A, 1 < i < co.
n , denoted by [4], is the subset of A' defined by The cylinder set determined by am
[4] = {x: x, = ai , m < i < n} .

Let C, be the collection of cylinder sets defined by sequences that belong to An, let C = Un C, and let R. = 'R,,(C,) denote the ring generated by Cn . The sequence {R,} is increasing, and its union R = R(C) is the ring generated by all the cylinder sets. Two important properties are summarized in the following lemma.
Lemma 1.1.1
(a) Each set in R, is a finite disjoint union of cylinder sets from C,.
(b) If {B} c R is a decreasing sequence of sets with empty intersection then there is an N such that B = 0, n > N.
Proof The proof of (a) is left as an exercise. Part (b) is an application of the finite intersection property for sequences of compact sets, since the space A is compact in the product topology and each set in R. is closed, from (a). Thus the lemma is established.
Let E be the a-algebra generated by the cylinder sets C. The collection E can also be defined as the the a -algebra generated by R, or by the compact sets. The members of E are commonly called the Borel sets of A. A is defined by kn (x). xn , x E For each n > 1 the coordinate function in : A' representation theorem states that every process with alphabet A A. The Kolmogorov can be thought of as the coordinate function process {i n }, together with a Borel measure on A".
Theorem 1.1.2 (Kolmogorov representation theorem.) If {p k } is a sequence of measures for which the consistency conditions (1) hold, then there is a unique Borel probability measure p. on A" such that, p([4]) = pk (alic), for each k and each 4. In other words, if {X,} is a process with finite alphabet A, there is a unique Borel measure p, on A' for which the sequence of coordinate functions fin } has the same distribution as {X J.
Proof The process {X,} defines a set function p, on the collection C of cylinder sets by the formula (2) p({41) = Prob(X, = a i , 1 _< i < n).
The consistency conditions (1) together with Lemma I.1.1(a), imply that p. extends to a finitely additive set function on the ring R. = R(C) generated by the cylinder sets. The finite intersection property, Lemma I.1.1(b), implies that p can be extended to a
SECTION Ll. STATIONARY PROCESSES.
unique countably additive measure on the a-algebra E generated by R. Equation (2) translates into the statement that { 5C,} and {X,} have the saine joint distribution. This proves Theorem 1.1.2. The sequence of coordinate functions { k - n } on the probability space (A', E, it) will be called the Kolmogorov representation of the process {U. The measure will be called the Kolmogorov measure of the process, or the Kolmogorov measure of the sequence Lu l l. Process and measure language will often be used interchangeably; for example, "let be a process," means "let i be the Kolmogorov measure for some process {Xn } ." As noted earlier, a process is considered to be defined by its joint distributions, that is, the particular space on which the functions X are defined is not important. The Kolmogorov model is simply one particular way to define a space and a sequence of functions with the given distributions. Another useful model is the complete measure model, that is, one in which subsets of sets of measure 0 are measurable. The Kolmogorov measure on A extends to a complete measure on the completion t of the Borel sets E relative to tt. Furthermore, completion has no effect on joint distributions, for Aga in) = ictk (alic) for all k > 1 and all a, and uniqueness is preserved, that is, different processes have Kolmogorov measures with different completions. Many ideas are easier to express and many results are easier to establish in the framework of complete measures; thus, whenever it is convenient to do so the complete Kolmogorov model will be used in this book, though often without explicitly saying so. In particular, "Kolmogorov measure" is taken to refer to either the measure defined by Theorem 1.1.2, or to its completion, whichever is appropriate to the context. A process is stationary, if the joint distributions do not depend on the choice of time origin, that is,
(3)
Prob(Xi = ai , ni <j < n) = Prob(Xi+i = ai , ni <j < n),
for all m, n, and affi n . Stationary finite-alphabet processes serve as models in many settings of interest, including physics, data transmission and storage, and statistics. The phrase "stationary process" will mean a stationary, discrete-time, finite-alphabet process, unless stated otherwise. The statement that a process {X n } with finite alphabet A is stationary translates into the statement that the Kolmogorov measure tt is invariant under the shift transformation T. The (left) shift T: A" 1-*- A" is defined by
(Tx) n= xn+i, x
which is expressed in symbolic form by
A, n> 1,
x = (xi, x2, ...) == T x = (x2, x3, . .).

The shift transformation is continuous relative to the product topology on A", and defines a set transformation T -1 by the formula, T -1 B = fx: T x
E
BI,
4
Note that
T-1 rn 1 _ ri,n+1 i
L"m-1 Lum+1.1 ,
bi+1 =
ai, m <j < n,
so that the set transformation T -1 maps a cylinder set onto a cylinder set. It follows that T -1 B is a Borel set for each Borel set B, which means that T is a Borel measurable mapping. The condition that the process be stationary is just the condition that p,([4]) = p.([b 11 ]), that is, A(B) = p(T -1 B) for each cylinder set B; which is, in turn, equivalent to the condition that p(B) = p(T -1 B) for each Borel set B. The condition,
p(B) = p(T -1 B), B E E,

is usually summarized by saying that T preserves the measure iu or, alternatively, that ti. is T -invariant. In summary, the shift T is Borel measurable. It preserves a Borel probability measure A if and only if the sequence of coordinate functions {:kn } is a stationary process on the probability space (A", E, p.). The intuitive idea of stationarity is that the mechanism generating the random variables doesn't change in time. It is often convenient to think of a process has having run for a long time before the start of observation, so that only nontransient effects, that is, those that do not depend on the time origin, are seen. Such a model is well interpreted by another model, called the doubly-infinite sequence model or two-sided model. Let Z denote the set of integers and let A z denote the set of doubly-infinite A-valued sequences, that is, the set of all sequences x = {x}, where each xn E A and n E Z ranges from oo to oo. The shift T: A z 1-- A z is defined by (Tx) n= xn+i, n E Z. The concept of cylinder set extends immediately to this new setting. Furthermore, if {p4} is a set of measures for which the consistency conditions (1) and the stationarity conditions (3) hold then there is a unique T-invariant Borel measure ii, on A z such that Again) = pk (a1 ), for each k > 1 and each all'. The projection of p, onto il' is, of course, the Kolmogorov measure on A' for the process defined by {pa In summary, for stationary processes there are two standard representations, one as the Kolmogorov measure p, on A", the other as the shift-invariant extension to the space Az. The latter model will be called the two-sided Kolmogorov model of a stationary process. Note that the shift T for the two-sided model is invertible, that is, T is one-to-one and onto and T -1 is measure preserving. (The two-sided shift is, in fact, a homeomorphism on the compact space A z .) From a physical point of view it does not matter in the stationary case whether the one-sided or two-sided model is used. Proofs are sometimes simpler if the two-sided model is used, for invertibility of the shift on A z makes things easier; such results will then apply to the one-sided model in the stationary case. In some cases where such a simplification is possible, only the proof for the invertible case will be given. In the stationary case, "Kolmogorov measure" is taken to refer to either the one-sided measure or its completion, or to the two-sided measure or its completion, as appropriate to the context. Remark 1.1.3 Much of the success of modern probability theory is based on use of the Kolmogorov model, for asymptotic properties can often be easily formulated in terms of subsets of the infinite product space A". In addition, the model provides a concrete uniform setting for all processes; different processes with the same alphabet A are distinguished by having
SECTION 1.1. STATIONARY PROCESSES.
different Kolmogorov measures on A. When completeness is needed the complete model can be used, and, in the stationary case, when invertibility is useful the two-sided model on A z can be used. As will be shown in later sections, even more can be gained in the stationary case by retaining the abstract concept of a stationary process as a sequence of random functions with a specified joint distribution on an arbitrary probability space. Remark 1.1.4 While this book is primarily concerned with finite-alphabet processes, it is sometimes desirable to consider countable-alphabet processes, or even continuous-alphabet processes. For example, return-time processes associated with finite-alphabet processes are, except in trivial cases, countable-alphabet processes, and functions of uniformly distributed i.i.d. processes will be useful in Chapter 4. The Kolmogorov theorem extends easily to the countable case, as well as to the continuous-alphabet i.i.d. cases that will be considered.
I.1.a
Examples.
Some standard examples of finite-alphabet processes will now be presented. Example 1.1.5 (Independent processes.) The simplest example of a stationary finite-alphabet process is an independent, identically distributed (i.i.d.) process. A sequence of random variables {X} is independent if Prob(X, = an IX7 -1 = ar I ) = Prob(X = an ), holds for all n > 1 and all al' E An. It is identically distributed if Prob(X = a) = Prob(Xi = a) holds for all n > 1 and all a E A. An independent process is stationary if and only if it is identically distributed. The i.i.d. condition holds if and only if the product formula
iak (al
k
)=
n Ai (ai),
i= 1
holds. Such a measure is called a product measure defined by Ai. Thus a process is i.i.d. if and only if its Kolmogorov measure is the product measure defined by some distribution on A. Example 1.1.6 (Markov chains.) The simplest examples of finite-alphabet dependent processes are the Markov processes. A sequence of random variables, {X}, is a Markov chain if (4) Prob (X n = an I V I -1 = ar l ) = Prob(Xn = an IXn _i =a_1)
holds for all n > 2 and all a7 E An. A Markov chain is said to be homogeneous or to have stationary transitions if Prob(X n = blXn _i = a) does not depend on n, in which case the I AI x IAI matrix M defined by
Mab = Mbia = Prob(X
= bi X_i = a)
is called the transition matrix of the chain. The start distribution is the (row) vector lit = Lai (s)}, where /2i(a) = Prob(X i = a), a E A. The start distribution is called a stationary distribution of a homogeneous chain if the matrix product A I M is equal to the row vector pl. If this holds, then a direct calculation shows that, indeed, the process is stationary. The condition that a Markov chain be a stationary process is that it have stationary transitions and that its start distribution is a stationary distribution. Unless stated otherwise, "stationary Markov process" will mean a Markov chain with stationary transitions for which A I M = tti and, for which iu 1 (a) > 0 for all a E A. The positivity condition rules out transient states and is included so as to avoid trivialities. Example 1.1.7 (Multistep Markov chains.) A generalization of the Markov property allows dependence on k steps in the past. A process is called k-step Markov if (5) Prob(X, . anix ri .4-1 ) . Prob(X, = an i Xn n :ki = an niki )
holds for all n > k and for all dii. The two conditions for stationarity are, first, that transitions be stationary, that is, A (an I a n
v ni _ ) = Prob(X n = anl, il nk
n-1\
ank )
does not depend on n as long as n > k, and, second, that A k M (k) = ,uk , where ii.k (4`) = Prob(X i` = 4), 4 E A " , and M (k) is the I BI x IBI matrix defined by
Ai
(k) _
bk l' 1
1 r1 ak
Prob(Xk i = bk I X ki = a k\ il , 0
LkI fu i i = a2 k otherwise,
and B = fa: p,k (4c) > 01. Note that probabilities for a k-th order stationary chain can be calculated by thinking of it as a first-order stationary chain with alphabet B. Example 1.1.8 (Finite-state processes.) Let A and S be finite sets. An A-valued process {Y,} is said to be a finite-state process with state space S, if (6) Prob(sn = s, Yn = bisrl ,1T -1 ) = Prob(sn = s, Y,
= biSn-1),
for all n > 1. In other words, {Yn } is a finite-state process if there is a finite-alphabet process {s n } such that the pair process {X, = (sa , Yn )} is a Markov chain. Note that the finite-state process {Yn } is stationary if the chain {X} is stationary, but {Yn } is not, in general, a Markov chain, see Exercise 2. The Markov chain {X,} is often referred to as the underlying Markov process or hidden Markov process associated with {Yn }. In ergodic theory, any finite-alphabet process is called a finite-state process, and what is here called a finite-state process is called a function (or lumping) of a Markov chain. In information theory, what is here called a finite-state process is sometimes called a Markov source, following the terminology used in [15 ]. Here "Markov" will be reserved for processes that satisfy the Markov property (4) or its more general form (5), and "finite-state process" will be reserved for processes satisfying property (6).
Example 1.1.9 (Stationary coding.) The concept of finite-state process can be generalized to allow dependence on the past and future, an idea most easily expressed in terms of the two-sided representation of a stationary process. Suppose A and B are finite sets. A Borel measurable mapping F: Az 1-- Bz is a stationary coder if F(TA X) = TB F(x), Vx
E
Az ,
where TA and TB denotes the shifts on the respective two-sided sequence spaces Az and B z . The coder F transports a Borel measure IL on Az into the measure v = it o F -1 , defined for Borel subsets C of Bz by
v(C) =
The encoded process v is said to be a stationary coding of ti. with coder F. Note that a stationary coding of a stationary process is automatically a stationary process. Define the function f: Az i- B by the formula f (x) = F(x)o, that is, f (x) is just the 0-th coordinate of y = F (x). The function f is called the time-zero coder associated with F. The stationary coder F and its time-zero coder f are connected by the formula = f (rx), x
E
AZ, n
Z,
that is, the image y = F(x) is the sequence defined by y, = f (T:1 4 x). Thus the stationary coding idea can be expressed in terms of the sequence-to-sequence coder F, sometimes called the full coder, or in terms of the sequence-to-symbol coder f, otherwise known as the per-symbol or time-zero coder. The word "coder" or "encoder" may refer to either F or f. A case of particular interest occurs when the time-zero coder f depends on only a finite number of coordinates of the input sequence x, that is, when there is a nonnegative integer w such that = f (x) = f CO . In this case it is common practice to write f (x) = f (xw w ) and say that f (or its associated full coder F) is a finite coder with window half-width w. The encoded sequence y = F(x) is defined by the formula
- ), n Yn = f (x',71 Z
Z,
that is, the value y, depends only on xn and the values of its w past and future neighbors. In the special case when w = 0 the encoding is said to be instantaneous, and the encoded process {Yn } defined by Yn = f (X,) is called an instantaneous function or simply a function of the {X n } process. Finite coding is sometimes called sliding-block or sliding-window coding, for one can think of the coding as being done by sliding a window of width 2w 1 along x. At time n the window will contain
Xnw, Xnw+1,
9
Xn,
Xn-f-w
which determines the value y,, through the function f. To determine yn+1 , the window is shifted to the right (that is, the contents of the window are shifted to the left) eliminating xn _ u, and bringing in
The key property of finite codes is that the inverse image of a cylinder set of length n is a finite union of cylinder sets of length n 2w, where w is the window half-width ,n41, to of the code. Such a finite coder f defines a mapping, also denoted by f, from An B,7, by the formula
f (x,nn+ -w w ) = ym n , where yi = f m <j < n.
If F: Az
B z is the full coder defined by f, then F' ([y]) =

f (x;nitww)=Y s11
[x:t w w] ,
n ,v ) so that, in particular, F -1 ([), 1) is measurable with respect to the a-algebra E(X,n+w generated by the random variables X,,, Xn, + i, , Xn,. Furthermore, for any measure ,u, cylinder set probabilities for the encoded measure y = o F -1 are given by the formula y(4) = ii,(F-1 [4]) = (7)
A process {Yn } is a finite coding of a process {X n } if it is a stationary coding of {Xn } with finite width time-zero coder. Note that a finite-state process is merely a finite coding of a Markov chain in which the window half-width is 0, that is, a function of a Markov chain. A stationary coding of an i.i.d. process is called a B-process. More generally, a finite alphabet process is called a B-process if it is a stationary coding of some i.i.d. process with finite or infinite alphabet. Much more will be said about B-processes in later parts of this book, especially in Chapter 4, where several characterizations will be discussed. Example 1.1.10 (Block codings.) Another type of coding, called block coding, destroys stationarity. It is frequently used in practice, however, and is often easier to analyze than is stationary coding. An N -block code is a function CN: A N B N , where A and B are finite sets. Such a code can be used to map an A-valued process {X n } into a B-valued process {Yn } by applying CN to consecutive nonoverlapping blocks of length N, that is,
y(j+1)N jN+1 = CN ()((f+1)N)
+1 iN
j = 0 1, 2,, .
,
The process {Yn } is called the N-block coding of the process {X n ) defined by the N -block code C N. The Kolmogorov measures, pt and v, of the respective processes, {X,} and {K}, are connected by the formula y = o F -1 , where y = F(x) is the (measurable) mapping defined by
(j+1)N r jN+1 =
,
(j+1)N X j1s1+1 )
0, I, 2, ...
If the process {X,} is stationary, the N-block coding {Yn } is not, in general, stationary, although it is N-stationary, that is, it is invariant under the N-fold shift T N , where T = Tg, the shift on B. There is a simple way to convert an N-stationary 2rocess, such as the encoded process {Yn }, into a stationary process, {Yn }. The process {Yn } is defined bj selecting an integer u E [1, N] according to the uniform distribution and defining i = 1, 2, .... This method is called "randomizing the start." The relation =
SECTION I.1. STATIONARY PROCESSES.
between the Kolmogorov measures, y and , of the respective processes, {Y,,} and { Y, }, is expressed by the formula v ( c)
= E v (T C),
i=o
is the average of y, together with its first N 1 shifts. The method of that is, randomizing the start clearly generalizes to convert any N-stationary process into a stationary process. In summary, an N-block code CN induces a block coding of a stationary process {Xn } onto an N-stationary process MI, which can then be converted to a stationary process, {Y, } , by randomizing the start. In general, the final stationary process {Yn } is not a stationary encoding of the original process {X,} and many useful properties of {X n } may get destroyed. For example, block coding introduces a periodic structure in the encoded process, {Y,I, a structure which is inherited, with a random delay, by the process {41. In Section 1.8, a general procedure will be developed for converting a block code to a stationary code by inserting spacers between blocks so that the resulting process is a stationary coding of the original process, and as such inherits many of its properties. Example 1.1.11 (Concatenated-block processes.) Processes can also be constructed by concatenating N-blocks drawn independently at random according a given distribution, with stationarity produced by randomizing the start. There are several equivalent ways to describe such a process. In random variable terms, suppose X liv = (X1, , X N ) is a random vector with values in AN. Let {Y,} be the A-valued process defined by the requirement that the blocks . } obtained (j-11N+1 be independent and have the distribution of Xr. The process { 171 from the N-stationary process {Yi I by randomizing the start is called the concatenatedblock process defined by the random vector Xr . The concatenated-block process is characterized by the following two conditions.
(i) li;;} is stationary.

(ii) There is random variable U uniformly distributed on {1, 2, ... , NI such that, conditioned on U = u, the sequence {Y: (iiN oNi : j = 1, 2, ...} is i.i.d. with distribution t. An alternative formulation of the same idea in terms of measures is obtained by starting with a probability measure p, on A N , forming the product measure If on (A N )", transporting this to an N-stationary measure p,* o 0 -1 on A" via the mapping = x = w(l)w(2) where JN
X (j-1)N+1 = W(i),
= 1, 2, ... ,
then averaging to obtain the measure Z defined by the formula

N-1
N il(B) =
i=o
it*(0 -1 (7i B)), B
E E.
10
The measure ii. is called the concatenated-block process defined by the measure p, on AN. The process defined by ii has the properties (i) and (ii), hence this measure terminology is consistent with the random variable terminology. A third formulation, which is quite useful, represents a concatenated-block process as a finite-state process, that is, a function of a Markov chain. To describe this representation fix a measure p, on A N and let { Y,} be the (stationary) Markov chain with alphabet S = AN x {1, 2, ... , N}, defined by the following transition and starting rules. (a) If i < N , (ar , i) can only go to (ar , i 1).
(b) (ar , N) goes to (b ', 1) with probability p,(br), for each br

(C)
AN.
The process is started by selecting (ar ,i) with probability i,(4)/N.
The process {Y,n } defined by setting Y, n = ai, if Y,r, = (ar , j), is the same as the concatenated-block process based on the block distribution A. (See Exercise 7.) Concatenated-block processes will play an important role in many later discussions in this book. Example 1.1.12 (Blocked processes.) Associated with a stationary process {X n } and a positive integer N are two different stationary processes, {Z n } and IKI, defined, for n > 1, by
Zn = (X(n-1)N+19 X(n-1)N+29
1
XnN);
Wn = ant X+1,
Xn+N-1)
The Z process is called the nonoverlapping N-block process determined by {X n }, while the W process is called the overlapping N-block process determined by {X J. Each is stationary if {X,} is stationary. If the process {X,} is i.i.d. then its nonoverlapping blockings are i.i.d., but overlapping clearly destroys independence. The overlapping N-blocking of an i.i.d. or Markov process is Markov, however. In fact, if {X,} is Markov with transition matrix Mab, then the overlapping N-block process is Markov with alphabet B = far: War) > 0} and transition matrix defined by
Ma i, b'
1 1
{ MaNbN, 0
if bN i 1 = ci 2 iv
otherwise.
Note also that if {X n} is Markov with transition matrix Mab, then the nonoverlapping N-block process is Markov with alphabet B and transition matrix defined by
N-1
1-4a'iv ,tSiv
= MaNbi
ri
i= 1
Mb,b, +i
A related process of interest {Yn }, called the N-th term process, selects every N-th term from the {X,} process, that is, Yn = X (n-1)N+1, n E Z. If {Xn } is Markov with transition matrix M, then {Y,i } will be Markov with transition matrix MN, the N-th power of M.
11
I.l.b
Probability tools.
Two elementary results from probability theory will be frequently used, the Markov inequality and the Borel-Cantelli principle. Lemma 1.1.13 (The Markov inequality.) Let f be a nonnegative, integrable function on a probability space (X, E, p..). If f f dp.. < c 8 then f (x) < c, except for a set of measure at most b. Lemma 1.1.14 (The Borel-Cantelli principle.) If {C n } is a sequence of measurable sets in a probability space (X, E, p.) such that E ii(c,i ) < 00 then for almost every x there is an N = N(x) such that x g C, n > N. In general, a property P is said to be measurable if the set of all x for which P(x) is true is a measurable set. If {Pn } is a sequence of measurable properties then
(a) P(x) holds eventually almost surely, if for almost every x there is an N = N(x) such that Pn (x) is true for n > N. (b) P, (x) holds infinitely often, almost surely, if for almost every x there is an increasing sequence {n i } of integers, which may depend on x, such that Pn,(x) is true for i = 1, 2, ....
For example, the Borel-Cantelli principle is often expressed by saying that if E wc,) < oo then x g C, eventually almost surely. Almost-sure convergence is often established using the following generalization of the Borel-Cantelli principle. Lemma 1.1.15 (The iterated Borel-Cantelli principle.) Suppose {G} and {B} are two sequences of measurable sets such that x E G, eventually almost surely, and x g B, n Gn , eventually almost surely. Then x g Bn , eventually almost surely. The proof of this is left to the exercises. In many applications, the fact that x E B, n Gn, eventually almost surely, is established by showing that E ,u(Bn n Gn ) < oo, in which case the iterated Borel-Cantelli principle is, indeed, just a generalized Borel-Cantelli principle. Frequent use will be made of various equivalent forms of almost sure convergence, summarized as follows. Lemma 1.1.16 The following are equivalent for measurable functions on a probability space.
(a) f, > f, almost surely.
(b) I fn (x) f (x)I <E, eventually almost surely, for every c > O.
(c) Given c > 0, there is an N and a set G of measure at least 1 c, such that Ifn(x) f(x)1 < E , X E G, n ? N.
12
As several of the preceding examples suggest, a process is often specified as a function of some other process. Probabilities for the new process can then be calculated by using the inverse image to transfer back to the old process. Sometimes it is useful to go in the opposite direction, e. g., first see what is happening on a set of probability 1 in the old process then transfer to the new process. A complication arises, namely, that Borel images of Borel sets may not be Borel sets. For nice spaces, such as product spaces, this is not a serious problem, for such images are always measurable with respect to the completion of the image measure, [5]. This fact is summarized by the following lemma.
Lemma 1.1.17 (The Borel mapping lemma.) Let F be a Borel function from A" into B", let p. be a Borel measure on A", let v = p. o F-1 and let r) be the completion of v. If X is a Borel set such that p(X) = 1, then F(X) is measurable with respect to the completion i of v and ii(F (X)) = 1.
A similar result holds with either A" or B" replaced, respectively, by AZ or B z . Use will also be made of two almost trivial facts about the connection between cardinality and probability, namely, that lower bounds on probability give upper bounds on cardinality, and upper bounds on cardinality "almost" imply lower bounds on probability. For ease of later reference these are stated here as the following lemma.
Lemma 1.1.18 (Cardinality bounds.) Let p. be a probability measure on the finite set A, let B c A, and let a be a positive
number (a) If a E B (b) For b
E
p(a) a a, then I BI 5_ 11a. B, p(b) > a/I BI, except for a subset of B of measure at most a.
Deeper results from probability theory, such as the martingale theorem, the central limit theorem, the law of the iterated logarithm, and the renewal theorem, will not play a major role in this book, though they may be used in various examples, and sometimes the martingale theorem is used to simplify an argument.
I.l.c
Exercises.
1. Prove Lemma 1.1.18. 2. Give an example of a function of a finite-alphabet Markov chain that is not Markov of any order. (Include a proof that your example is not Markov of any order.) 3. If ,u([4] n [ci_t) . ,u([4]),u([4]), for all n and k, and all a;' and an+nik n+m-1-1 , then a stationary process is said to be m-dependent. Show that a finite coding of an i.i.d. process is m-dependent for some m. (How is m related to window half-width?) 4. Let {tin } be i.i.d. with each U, uniformly distributed on [0, 1]. Define X, = 1, if U > Un _i , otherwise X, = O. Show that {X,1 is 1-dependent, and is not a finite coding of a finite-alphabet i.i.d. process. (Hint: show that such a coding would have the property that there is a number c > 0 such that A(4) > c", if p.(x 1') 0 O. Then show that the probability of n consecutive O's is 11(n + 1)!.)
SECTION L2. THE ERGODIC THEORY MODEL.
13
5. Show that the process constructed in the preceding exercise is not Markov of any order. 6. Establish the Kolmogorov representation theorem for countable alphabet processes. 7. Show that the finite-state representation of a concatenated-block process in Example 1.1.11 satisfies the two conditions (i) and (ii) of that example. 8. A measure kt on An defines a measure ,u (N) on AN, where N = Kn+r, 0 < n < r, by the formula
(n
K-1
kn -En iU(Xkr1+1) ,Kn-Fr \ 1-4-4 Kn+1 , k=0
The measures {,a (N) , N = 1, 2, ...} satisfy the Kolmogorov consistency conditions, hence have a common extension ,tc* to A". Show that the concatenatedblock process Z defined by kt is an average of shifts of , that is, Z(B) = . (1/n) E7=1 kt* (1' B), for each Borel subset B C 9. Prove the iterated Borel-Cantelli principle, Lemma 1.1.15.
Section 1.2 The ergodic theory model.

The Kolmogorov measure for a stationary process is preserved by the shift T on the sequence space. This suggests the possibility of using ergodic theory ideas in the study of stationary processes. Ergodic theory is concerned with the orbits x,Tx,T 2x,... of a transformation T: X X on some given space X. In many cases of interest there is a natural probability measure preserved by T, relative to which information about orbit structure can be expressed in probability language. Finite measurements on the space X, which correspond to finite partitions of X, then give rise to stationary processes. This model, called the transformation/partition model for a stationary process, is the subject of this section and is the basis for much of the remainder of this book. The Kolmogorov model for a stationary process implicitly contains the transformation/partition model. The shift T on the sequence space, X = A", (or on the space A z ), is the transformation and the partition is P = {Pa : a E A}, where for each a E A, Pc, = {x: x 1 = a}. The partition P is called the Kolmogorov partition associated with the process. The sequence of random variables and the joint distributions are expressed in terms of the shift and Kolmogorov partition, as follows. First, associate with the partition P the random variable Xp defined by X2 (x) = a if a is the label of member of P to which x belongs, that is, x E Pa . The coordinate functions, {X,}, are given by the formula (1) X(x) = Xp(Tn -1 x), n > 1.
The process can therefore be described as follows: Pick a point x E X, that is, an infinite sequence, at random according to the Kolmogorov measure, and, for each n, let Xn = X n (x) be the label of the member of P to which 7' 1 x belongs. Since to say that
14
Tn-1 X E
Pa is the same as saying that x E T -n+ 1 Pa , cylinder sets and joint distributions may be expressed by the respective formulas,
[4] = n7..7i+1 Pai ,

and (2)
= u([4]) = i (nLI T i+1 Pa, ) In summary, the coordinate functions, the cylinder sets in the Kolmogorov representation, and the joint distributions can all be expressed directly in terms of the Kolmogorov partition and the shift transformation.
The concept of stationary process is formulated in terms of the abstract concepts of measure-preserving transformation and partition, as follows. Let (X, E, A) be a probability space. A mapping T: X X is said to be measurable if T -1 B E E, for all B E E, and measure preserving if it is measurable and if
(T' B) = ,u,(B), B
A partition
E,
P = {Pa : a
A}
of X is a finite, disjoint collection of measurable sets, indexed by a finite set A, whose union has measure 1, that is, X U n Pa is a null set. (In some situations, countable partitions, that is, partitions into countably many sets, are useful.) Associated with the partition P = {Pa : a E Al is the random variable Xp defined by Xp(x) = a if x E Pa . The random variable Xp and the measure-preserving transformation T together define a process by the formula
(3)
X(x) = Xp(T n-l x), n> 1.
The k-th order distribution Ak of the process {Xa } is given by the formula
(4)
(4) = u (nLiT -i+1 Pai )
the direct analogue of the sequence space formula, (2). The process {Xa : n > 1} defined by (3), or equivalently, by (4), is called the process defined by the transformation T and partition P, or, more simply, the (T, 2) process. The sequence { xa = X n (x): n > 1} defined for a point x E X by the formula Tn -l x E Pin , is called the (T, 2) name of x. The (T, 2)-process may also be described as follows. Pick x E X at random according to the A-distribution and let X 1 (x) be the label of the set in P to which x belongs. Then apply T to x to obtain Tx and let X2(x) be the label of the set in P to which Tx belongs. Continuing in this manner, the values,
-
(x), X 2 (x X3(x),
),
X a (x),
tell to which set of the partition the corresponding member of the random orbit
belongs. Of course, in the Kolmogorov representation a random point x is a sequence in 21' or A z , and the (T, 2)-name of x is the same as x, (or the forward part of x in the two-sided case.)
SECTION 1.2. THE ERGODIC THEORY MODEL.
15
The (T, 2)-process concept is, in essence, just an abstract form of the stationary coding concept, in that a partition P of (X, E, p) gives rise to a measurable function F: X 1- B" which carries p onto the Kolmogorov measure y of the (T, 2)-process, such that F(Tx) = TA F (x), where TB is the shift on 13'. The mapping F extends to a stationary coding to Bz in the case when X = A z and T is the shift. Conversely, a stationary coding F: A z B z carrying it onto y determines a partition P = {Pb: b E BI of Az such that y is the Kolmogorov measure of the (TA, 2)-process. The partition P is defined by Pb = {x: f (x) = 1) } . See Exercise 2 and Exercise 3. In summary, a stationary process can be thought of as a shift-invariant measure on a sequence space, or, equivalently, as a measure-preserving transformation T and partition P of an arbitrary probability space. The ergodic theory point of view starts with the transformation/partition concept, while modern probability theory starts with the sequence space concept.
I.2.a Ergodic processes.

It is natural to study the orbits of a transformation by looking at its action on invariant sets, since once an orbit enters an invariant set it never leaves it. In particular, the natural object of study becomes the restriction of the transformation to sets that cannot be split into nontrivial invariant sets. This leads to the concept of ergodic transformation. A measurable set B is said to be T-invariant if TB C B. The space X is Tdecomposable if it can be expressed as the disjoint union X = X1U X2 of two measurable invariant sets, each of positive measure. The condition that TX C Xi, i = 1, 2 translates into the statement that T -1 Xi = Xi, i = 1, 2, hence to say that the space is indecomposable is to say that if T -1 B = B then p(B) is 0 or 1. It is standard practice to use the word "ergodic" to mean that the space is indecomposable. A measure-preserving transformation T is said to be ergodic if
(5)
7-1 B = B = (B) = 0 or p(B) = 1.
The following lemma contains several equivalent formulations of the ergodicity condition. The notation, C AD = (C D)U(D C), denotes symmetric difference, and UT denotes the operator on functions defined by (UT f)(x) = f (T x), where the domain can be taken to be any one of the LP-spaces or the space of measurable functions. Lemma 1.2.1 (Ergodicity equivalents.) The following are equivalent for a measure-preserving transformation T on a probability space.
(a) T is ergodic. (b) T -1 B C B, = (c) T -1 B D B,

(d)
(e) UT
(B) = 0 or kt(B) = 1. ,i(B) = 0 or p(B) = 1.
B LB) = 0 = (B) = 0 or kt(B) = 1.
= f , a.e., implies that f is constant, a.e.
16
Proof The equivalence of the first two follows from the fact that if T -1 B c B and C = n>0T -n B, then T -1 C = C and ,u(C) = ,u(B). The proofs of the other equivalences are left to the reader.
Remark 1.2.2 In the particular case when T is invertible, that is, when T is one-to-one and for each measurable set C the set T C is measurable and has the same measure as C, the conditions for ergodicity can be expressed in terms of the action of T, rather than T -1 . In particular, an invertible T is ergodic if and only if any T-invariant set has measure 0 or 1. Also, note that if T is invertible then UT is a unitary operator on L2 . A stationary process is ergodic, if the shift in the Kolmogorov representation is ergodic relative to the Kolmogorov measure. As will be shown in Section 1.4, to say that a stationary process is ergodic is equivalent to saying that measures of cylinder sets can be determined by counting limiting relative frequencies along a sample path x, for almost every x. Furthermore, a shift-invariant measure shift-invariant measures. Thus the concept of ergodic process, which is natural from the transformation point of view, is equivalent to an important probability concept.
I.2.b
Examples of ergodic processes.
Examples of ergodic processes include i.i.d. processes, irreducible Markov chains and functions thereof, stationary codings of ergodic processes, concatenated-block processes, and some, but not all, processes obtained from a block coding of an ergodic process by randomizing the start. These and other examples will now be discussed.
Example 1.2.3 (The Baker's transformation.) A simple geometric example provides a transformation and partition for which the resulting process is the familiar coin-tossing process. Let X = [0, 1) x [0, 1) denote the unit square and define a transformation T by
T (s, t) =
{ (2s, t/2) (2s 1, (t 1)/2)
if S <1/2 it s> 1/2.
The transformation T is called the Baker's transformation since its action can be described as follows. (See Figure 1.2.4.)
1. Cut the unit square into two columns of equal width.
2. Squeeze each column down to height 1/2 and stretch it to width 1. 3. Place the right rectangle on top of the left to obtain a square.
Tw
Tz
Tw
Tz
Figure 1.2.4 The Baker's transformation.
SECTION I.2. THE ERGODIC THEORY MODEL.
17
The Baker's transformation T preserves Lebesgue measure in the square, for dyadic subsquares (which generate the Borel field) are mapped into rectangles of the same area. To obtain the coin-tossing process define the two-set partition P = {Po, PO by setting (6) Po = {(s, t): s < 1/2}, P1 = {(s, t): s _?: 1/2}.
To assist in showing that the (T ,P)-process is, indeed, the binary, symmetric, i.i.d. process, some useful partition notation and terminology will be developed. Let P = {Pa : a E A} be a partition of the probability space (X, E, ii). The distribution of P is the probability distribution ftt(Pa), a E Al. The join, P v Q, of two partitions P and Q is their common refinement, that is, the partition P v Q = {Pa n Qb: Pa
E P , Qb E Q}
The join AP(i) of a finite sequence P (1) , P (2) , ... , P(k) of partitions is their common refinement; defined inductively by APO ) = (vrP (i) ) y p(k). Two partitions P and Q are independent if the distribution of the restriction of P to each Qb E Q does not depend on b, that is, if
[t(Pa n Qb) = 12 (Pa)11 (Qb),Va, b.

The sequence of partitions {P (i) : i > 1} is said to be independent if P (k+ i) and v1P (i) independent for each k? 1. are For the Baker's transformation and partition (6), Figure 1.2.5 illustrates the partitions P, TP ,T 2P , 7-1 P , along with the join of these partitions. Note that T 2P partitions each set of T -1 7, v P v TT into exactly the same proportions as it partitions the entire space X. This is precisely the meaning of independence. In particular, it can be shown that the partition TnP is independent of vg -1 PP , for each n > 1, so that {T i P} is an independent sequence. This, together with the fact that each of the two sets P0 , P1 has measure 1/2, implies, of course, that the (T, 2)-process is just the coin-tossing process.
1
0 P
T -1 1)1 n Po n T Pi n T2 Po
Ti'
I I
1 0 1 0
T 2p
01101
I I
.
T -I P v P v T'P v T 2P
T -1 P
Figure 1.2.5 The join of P, TP, T2P, and 7 '1'.
In general, the i.i.d. process ii, defined by the first-order distribution j 1 (a), a E A, is given by a generalized Baker's transformation, described as follows. (See Figure 1.2.6.)
18
CHAPTER I. BASIC CONCEPTS. 1. Cut the unit square into I AI columns, labeled by the letters of A, such that the column labeled a has width A i (a). 2. For each a E A squeeze the column labeled a down and stretch it to obtain a rectangle of height (a) and width 1. 3. Stack the rectangles to obtain a square.
The corresponding Baker's transformation T preserves Lebesgue measure. The partition P into columns defined in part (a) of the definition of T then produces the desired i.i.d. process.
Figure 1.2.6 The generalized Baker's transformation.

Example 1.2.7 (Ergodicity for i.i.d. processes.) To show that a process is ergodic it is sometimes easier to verify a stronger property, called mixing. A transformation is mixing if (7)
n ,froo
lim ,u(T -nC n D) = ,u(C),u(D), C, D
E.
A stationary process is mixing if the shift is mixing for its Kolmogorov measure. Mixing clearly implies ergodicity, for if T -1 C = C then T -ncnD=CnD, for all sets D and positive integers n, so that the mixing property, (7), implies that p, (C nD) = Since this holds for all sets D, the choice D = C gives ,i(C) = ,u(C) 2 and hence ,u(C) is 0 or 1. Suppose it is a product measure on A'. To show that the mixing condition (7) holds for the shift first note that it is enough to establish the condition for any two sets in a generating algebra, hence it is enough to show that it holds for any two cylinder sets. But this is easy, for if C = [a7] and N > 0 then
T-N =
A +1
bNA -i = a1,
1 < < n.
Thus, if D = [br] and N > m then T -N C and D depend on values of x i for indices i in disjoint sets of integers. Since the measure is product measure this means that
p,(T -Nc n D) = ,a(T-N C),a(D) =

Thus i.i.d. measures satisfy the mixing condition. Example 1.2.8 (Ergodicity for Markov chains.) The location of the zero entries, if any, in the transition matrix M determine whether a Markov process is ergodic. The results are summarized here; the reader may refer to standard probability books or to the extensive discussion in [29] for details. The stochastic matrix M is said to be irreducible, if for any pair i, j there is a sequence io, i 1 , , in with io = i and in = j such that > 0, in = 0, 1, , n 1.
19
This is just the assertion that for any pair i, j of states, if the chain is at state i at some time then there is a positive probability that it will be in state j at some later time. If M is irreducible there is a unique probability vector it such that irM = 7r. Furthermore, each entry of it is positive and
N
(8) Noc)
liM
n=1
where each row of the k x k limit matrix P is equal to jr. In fact, the limit of the averages of powers can be shown to exist for an arbitrary finite stochastic matrix M. In the irreducible case, there is only one probability vector IT such that 7rM = it and the limit matrix P has it all its rows equal to 7r. The condition 7rM = it shows that the Markov chain with start distribution Ai = it and transition matrix M is stationary. An irreducible Markov chain is ergodic. To prove this it is enough to show that
N (9) LlM oo
C n D) =
for any cylinder sets C and D, hence for any measurable sets C and D. This condition, which is weaker than the mixing condition (7), implies ergodicity, for if T -1 C = C and D = C, then the limit formula (9) gives /..t(C) = 1u(C) 2 , so that C must have measure 0 or 1. To establish (9) for an irreducible chain, let D = [4] and C = [cri and write + Tkn = [bnn ++ kk-Fr ], where bn_f_k+i = ci, 1 < i < m. Thus when k > 0 the sets T -k-n C and D depend on different coordinates. Summing over all the ways to get from d, to bn+k+1 yields the product
p.,([4] n [btni 1 -_ - :Vinj) = uvw,

where
n-1 7r (di)11Md,d, +1 =
acin),
Mc1,1kbn+k+i
(n-Flc-1 i=n
JJ
Md,d, +1
= 11
n+k+m-1 i=n+k-F1
Note that V, which is the probability of transition from dn to bn+k+1 = Cl in equal to MI ci = [Mk ]d c the dn ci term in the k-th power of M, and hence
.-1
(10) itt([d]
steps, is
n ii n [bn
1) = ttacril i) A I
mc,c,+i. 1=,
The sequence ML.1 1 converges in the sense of Cesaro to (Ci), by (8), which establishes (9), since ,u(C) = This proves that irreducible Markov chains are ergodic.
20
The converse is also true, at least with the additional assumption being used in this book that every state has positive probability. Indeed, if every state has positive probability and if Pi = {x: xi = j}, then the set B = T -1 Pi U T -2 PJ U has positive probability and satisfies T -1 B D B. Thus if the chain is ergodic then ,u(B) = 1, so that, ,u(B n Pi ) > 0, for any state i. But if it(B n Pi ) > 0, then ,u(T -nP n Pi ) > 0, for some n, which means that transition from i to j in n steps occurs with positive probability, and hence the chain is irreducible. In summary, Proposition 1.2.9 A stationary Markov chain is ergodic if and only if its transition matrix is irreducible. If some power of M is positive (that is, has all positive entries) then M is certainly irreducible. In this case, the Cesaro limit theorem (8) can be strengthened to limN, M N = P. The argument used to prove (9) then shows that the shift T must be mixing. The converse is also true, again, assuming that all states have positive probability. Proposition 1.2.10 A stationary Markov chain is mixing if and only if some power of its transition matrix has all positive entries.
Remark 1.2.11
In some probability books, for example, Feller's book, [11], "ergodic" for Markov chains is equivalent to the condition that some power of the transition matrix be positive. To be consistent with the general concept of ergodic process as used in this book, "ergodic" for Markov chains will mean merely irreducible with "mixing Markov" reserved for the additional property that some power of the transition matrix has all positive entries.
Example 1.2.12 (Codings and ergodicity.)

Stationary coding preserves the ergodicity property. This follows easily from the definitions, for suppose F: A z B z is a stationary encoder, suppose it is the Kolmogorov measure of an ergodic A-valued process, and suppose v = p. o F -1 is the Kolmogorov measure of the encoded B-valued process. If C is a shift-invariant subset of B z then F-1 C is a shift-invariant subset of Az so that ,u(F-1 C) is 0 or 1. Since v(C) = p,(F-1 C) it follows that v(C) is 0 or 1, and hence that v is the Kolmogorov measure of an ergodic process. It is not important that the domain of F be the sequence space A z ; any probability space will do. Thus, if T is ergodic the (T, 2)-process is ergodic for any finite partition P. It should be noted, however, that the (T, 2)-process can be ergodic even though T is not. (See Exercise 5, below.) A simple extension of the above argument shows that stationary coding also preserves the mixing property, see Exercise 19. A finite-state process is a stationary coding with window width 0 of a Markov chain. Thus, if the chain is ergodic then the finite-state process will also be ergodic. Likewise, a finite-state process for which the underlying Markov chain is mixing must itself be mixing. A concatenated-block process is always ergodic, since it can be represented as a stationary coding of an irreducible Markov chain; see Example 1.1.11. The underlying
21
Markov chain is not generally mixing, however, for it has a periodic structure due to
the blocking. Since this periodic structure is inherited, up to a shift, concatenated-block processes are not mixing except in special cases. A transformation T on a probability space (X, E, p.) is said to be totally ergodic if every power TN is ergodic. If T is the shift on a sequence space then T is totally ergodic if and only if for each N, the nonoverlapping N-block process defined by
Zn = (X (n-1)N-1-11 X (n-1)N+2,
9
Xn
is ergodic. (See Example 1.1.12.) Since the condition that F(TA X) = TB F (x), for all x, implies that F (Ti x) = T: F(x), for all x and all N, it follows that a stationary coding of a totally ergodic process must be totally ergodic. As noted in Section 1.1, N-block codings destroy stationarity, but a stationary process can be constructed from the encoded process by randomizing the start, see Example 1.1.10. The final randomized-start process may not be ergodic, however, even if the original process was ergodic. For example, let p. give measure 1/2 to each of the two sequences 1010... and 0101 ..., that is, p is the stationary Markov measure with transition matrix M and start distribution 7r given, respectively, by
m
ro 1 L o j'
rl
1]
Let y be the encoding of p. defined by the 2-block code C(01) = 00, C(10) = 11, so that y is concentrated on the two sequences 000... and 111 .... The measure -17 obtained by randomizing the start is, in this case, the same as y, hence is not an ergodic process. A condition insuring ergodicity of the process obtained by N-block coding and randomizing the start is that the original process be ergodic relative to the N-shift T N . The proof of this is left to the reader. In particular, applying a block code to a totally ergodic process and randomizing the start produces an ergodic process. Example 1.2.13 (Rotation processes.) Let a be a fixed real number and let T be defined on X = [0, 1) by the formula
Tx = x ED a,
where ED indicates addition modulo 1. The mapping T is called translation by a. It is also called rotation, since X can be thought of as the unit circle by identifying x with the angle 27rx, so that translation by a becomes rotation by 27ra. A subinterval of the circle corresponds to a subinterval or the complement of a subinterval in X, hence the word "interval" can be used for subsets of X that are connected or whose complements are connected. The transformation T is one-to-one and maps intervals onto intervals of the same length, hence preserves Lebesgue measure p,. (The measure-preserving property is often established by proving, as in this case, that it holds on a family of sets that generates the a-algebra.) Any partition 7, of the interval gives rise to a process; such processes are called translation processes, (or rotation processes if the circle representation is used.) As an example, let P consist of the two intervals P0 = [0, 1/2), P 1 = [1/2, 1), which correspond to the upper and lower halves of the circle in the circle representation. The (T, P)-process {X n } is then described as follows. Pick a point x at random in the unit interval according to the uniform (Lebesgue) measure. The value X(x) = Xp(Tn -1 x) is then 0 or 1, depending on whether Tn - lx = x ED (n 1)a belongs to P0 or P1 . The following proposition is basic.
22 Proposition 1.2.14 T is ergodic if and only if a is irrational.
Proof It is left to the reader to show that T is not ergodic if a is rational. Assume that a is irrational. Two proofs of ergodicity will be given. The first proof is based on the
following classical result of Kronecker.
Proposition 1.2.15 (Kronecker's theorem.) If a is irrational the forward orbit {Tnx: n > 1} is dense in X, for each x.
Proof To establish Kronecker's theorem let F be the closure of the forward orbit, {Tx: n > 1 } , and suppose that F is not dense in X. Then there is an interval I
of positive length which is maximal with respect to the property of not meeting F. Furthermore, for each positive integer n, TI cannot be the same as I, since a is irrational, and hence the maximality of I implies that n I = 0. It follows that is a disjoint sequence and
00 1 = ,u ( [0, 1))
a.
E n=1
00,
which is a contradiction. This proves Kronecker's theorem.
Proof of Proposition 1.2.14 continued. To proceed with the proof that T is ergodic if a is irrational, suppose A is an invariant set of positive measure. Given c > 0 choose an interval I such that
/2(1) < E
and /..t(A
n I) > (1
Kronecker's theorem, applied to the end points of I, produces integers n1 <n2 < < n k such that {Tni / } is a disjoint sequence and Ek i=1 tt(Tni /) > 1 2E. The assumption that i,t(A n I) > (1 6) gives
it (A)
E
E)
n Trig)
(1 0(1 2E).
i= 1
Thus ,u(A) = 1, which completes the proof of Proposition 1.2.14. The preceding proof is typical of proofs of ergodicity for many transformations defined on the unit interval. One first shows that orbits are dense, then that small disjoint intervals can be placed around each point in the orbit. Such a technique does not work in all cases, but does establish ergodicity for many transformations of interest. A generalization of this idea will be used in Section 1.10 to establish ergodicity of a large class of transformations. An alternate proof of ergodicity can be obtained using Fourier series. Suppose f is square integrable and has Fourier series E an e brinx . The Fourier series of g(x) = f (T x)
is
E a, e2-n(x+a)
a, e2"ina e 27rinx
23
If g(x) = f(x), a. e., then an = an e 2lri" holds for each integer n. If a is irrational 0 unless n = 0, which means that a, = 0, unless and an = an e 27' then e2lrin" n=0, which, in turn, implies that the function f is constant, a. e. In other words, in the irrational case, the rotation T has no invariant functions and hence is ergodic. In general, translation of a compact group will be ergodic with respect to Haar measure if orbits are dense. For example, consider the torus T, that is, the product of two intervals, T = [0, 1) x [0, 1), with addition mod 1, (or, alternatively, the product of two circles.) The proof of the following is left to the reader.
Proposition 1.2.16 The following are equivalent for the mapping T: (x, y) 1- (x a, y El) p) (i) {Tn(x, y)} is dense for any pair (x, y). (ii) T is ergodic. (iii) a and 13 are rationally independent.
I.2.c The return-time picture.

A general return-time concept has been developed in ergodic theory, which, along with a suggestive picture, provides an alternative and often simpler view of two standard probability models, the renewal processes and the regenerative processes. Let T be a measure-preserving transformation on the probability space (X, E, IL) and let B be a measurable subset of X of positive measure. If x E B, let n(x) be the least positive integer n such that Tnx E B, that is, n(x) is the time of first return to B.
Theorem 1.2.17 (The Poincare recurrence theorem.)

Return to B is almost certain, that is, n(x) < oc, almost surely. Proof To prove this, define the sets Bn = E B: n(x) = n}, for n > 1, and define = {X E B: Tx B,Vn > O. These sets are clearly disjoint and have union B, and, furthermore, are measurable, for B 1 = B n 7-1 B and
Bn =
(X
B)
n B n T -n B, n > 2.
Furthermore, if x E 1100 then Tx g Bc.,, for all n > 1, from which it follows that the sequence of sets {T B} is disjoint. Since they all have the same measure as B and p(X) = 1, it follows that p(B,,o ) = 0. This proves Theorem 1.2.17. To simplify further discussion it will be assumed in the remainder of this subsection that T is invertible. (Some extensions to the noninvertible case are outlined in the exercises.) For each n, the sets
(11)
Bn , TB,
, T n-1 Bn
are disjoint and have the same measure. Furthermore, only the first set Bn in this sequence meets B, while applying T to the last set Tn -1 Bn returns it to B, that is, TBn C B. The return-time picture is obtained by representing the sets (11) as intervals, one above the other. Furthermore, by reassembling within each level it can be assumed that T just moves points directly upwards one level. (See Figure 1.2.18). Points in the top level of each column are moved by T to the base B in some manner unspecified by the picture.
24
Tx
B1
B2
B3
B4
Figure 1.2.18 Return-time picture.

Note that the set Util=i Bn is just the union of the column whose base is Bn , so that the picture is a representation of the set (12) U7 =0 TB
This set is T-invariant and has positive measure. If T is ergodic then it must have measure 1, in which case, the picture represents the whole space, modulo a null set, of course. The picture suggests the following terminology. The ordered set
Bn , TB,
, T 11-1 13,
is called the column C = Cn with base Bn, width w(C) = kt(B,z ), height h(C) = n, levels L i = Ti-1 Bn , 1 < i < n, and top T'' B. Note that the measure of column C is just its width times its height, that is, p,(C) = h(C)w(C). Various quantities of interest can be easily expressed in terms of the return-time picture. For example, the return-time distribution is given by Prob(n(x) =nIxEB) and the expected return time is given by w (Ca )
Em w(C, n
E(n(x)lx E B) =
(13)
h(C,i )Prob(n(x) =nix E B)
h(C)w(C)
m w(Cm )
In the ergodic case the latter takes the form
(14)
E(n(x)ix E B) =
1 Em w(Cm )
since En h(C)w(C) is the measure of the set (12), which equals 1, when T is ergodic. This formula is due to Kac, [23]. The transformation I' = TB defined on B by the formula fx = Tn(x)x is called the transformation induced by T on the subset B. The basic theorem about induced transformations is due to Kakutani.
SECTION I.2. THE ERGODIC THEORY MODEL.

-
25
Theorem 1.2.19 (The induced transformation theorem.) If p(B) > 0 the induced transformation preserves the conditional measure p(.IB) and is ergodic if T is ergodic. Proof If C c B, then x E f- 1C if and only if there is an n > 1 such that x Tnx E C. This translates into the equation
00
Bn and
F-1 C =U(B n n=1
n TAC),
which shows that T-1 C is measurable for any measurable C c B, and also that
/2(fi
n
= E ,u(B,, n T'C),
n=1
00
since Bn n An = 0, if m 0 n, with m, n > 1. More is true, namely, (15)
T n Ba nT ni Bm = 0, m,n > 1, m n,
by the definition of "first return" and the assumption that T is invertible. But this implies 00 E,(T.Bo
n=1
= E p(Bn ) = AO),
n=1
,
co
which, together with (15) yields E eu(B, n T'C) = E iu(Tnii n n c) = u(C). Thus the induced transformation f preserves the conditional measure p.(.1B). The induced transformation is ergodic if the original transformation is ergodic. The picture in Figure 1.2.18 shows why this is so. If f --1 C, then each C n Bn can be pushed upwards along its column to obtain the set
c.
oo n-1
Ti(c n Bn) D=UU n=1 i=0
But p,(T DAD) = 0 and T is invertible, so the measure of D must be 0 or 1, which, in turn, implies that kt(C1B) is 0 or 1, since p,((D n B),LC) = O. This completes the proof of the induced-transformation theorem. El
1.2.c.1 Processes associated with the return-time picture.

Several processes of interest are associated with the induced transformation and the return-time picture. It will be assumed throughout this discussion that T is an invertible, ergodic transformation on the probability space (X, E, p,), partitioned into a finite partition P = {Pa : a E A}; that B is a set of positive measure; that n(x) = minfn: Tnx E BI, x E B, is the return-time function; and that f is the transformation induced on B by T. Two simple processes are connected to returns to B. The first of these is {R,}, the (T, B)-process defined by the partition B = { B, X B}, with B labeled by 1 and X B labeled by 0, in other words, the binary process defined by
Rn(x) = xB(T n-l x), n?: 1,
26
where xB denotes the indicator function of B. The (T, 5)-process is called the generalized renewal process defined by T and B. This terminology comes from the classical definition of a (stationary) renewal process as a binary process in which the times between occurrences of l's are independent and identically distributed, with a finite expectation, the only difference being that now these times are not required to be i.i.d., but only stationary with finite expected return-time. The process {R,,} is a stationary coding of the (T, 2)-process with time-zero encoder XB, hence it is ergodic. Any ergodic, binary process which is not the all 0 process is, in fact, the generalized renewal process defined by some transformation T and set B; see Exercise 20. The second process connected to returns to B is {Ri } , the (7." , g)-process defined by the (countable) partition
{Bi, B2, },
of B, where B, is the set of points in B whose first return to B occurs at time n. The process {k } is called the return-time process defined by T and B. It takes its values in the positive integers, and has finite expectation given by (14). Also, it is ergodic, since I' is ergodic. Later, it will be shown that any ergodic positive-integer valued process with finite expectation is the return-time process for some transformation and partition; see Theorem 1.2.24. The generalized renewal process {R n } and the return-time process {kJ } are connected, for conditioned on starting in B, the times between successive occurrences of l's in {Rn} are distributed according to the return-time process. In other words, if R 1 = 1 and the sequence {S1 } of successive returns to 1 is defined inductively by setting So = 1 and
= min{n >
Rn = 1},
j > 1,
then the sequence of random variables defined by
(16)
= Si Sf _ 1 , j> 1
has the distribution of the return-time process. Thus, the return-time process is a function of the generalized renewal process. Except for a random shift, the generalized renewal process {R,,} is a function of the return-time process, for given a random sample path R = {k } for the return-time process, a random sample path R = {R n } for the generalized renewal process can be expressed as a concatenation (17) R = ii)(1)w(2)w(3)
of blocks, where, for m > 1, w(m) = 01 _ 1 1, that is, a block of O's of length kn -1, followed by a 1. The initial block tt,(1) is a tail of the block w(1) = 0 /1-1-1 1. The only real problem is how to choose the start position in w(1) in such a way that the process is stationary, in other words how to determine the waiting time r until the first 1 occurs. The start position problem is solved by the return-time picture and the definition of the (T, 2)-process. The (T, 2)-process is defined by selecting x E X at random according to the it-measure, then setting X(x) = a if and only if Tx E Pa . In the ergodic case, the return-time picture is a representation of X, hence selecting x at random is the same as selecting a random point in the return-time picture. But this, in turn, is the same as selecting a column C at random according to column measure
27
then selecting j at random according to the uniform distribution on 11, 2, ... , h (C)}, then selecting x at random in the j-th level of C. In summary,
Theorem 1.2.20 The generalized renewal process {R,} and the return-time process {km } are connected via the successive-returns formula, (16), and the concatenation representation, (17) with initial block th(1) = OT-1 1, where r = R j 1 and j is uniformly distributed on [1,
The distribution of r can be found by noting that
r (x) = jx E
from which it follows that
00
Prob(r = i) =E p.,(T 4-1 n=i
E w (C ) .
ni
The latter sum, however, is equal to Prob(x E B and n(x) > i), which gives the alternative form 1 Prob (r = i) = (18) Prob (n(x) ijx E B) , E(n(x)lx E B) since E(n(x)lx E B) = 1/ p.(B) holds for ergodic processes, by (13). The formula (18) was derived in [11, Section XIII.5] for renewal processes by using generating functions. As the preceding argument shows, it is a very general and quite simple result about ergodic binary processes with finite expected return time. The return-time process keeps track of returns to the base, but may lose information about what is happening outside B. Another process, called the induced process, carries along such information. Let A* be the set all finite sequences drawn from A and define the countable partition 13 = {P: w E A* } of B, where, for w =
E Bk: T i-1 X E
Pao 1 <j
kl.
The (?,)-process is called the process induced by the (T, P)-process on the set B. The relation between the (T, P)-process { X, } and its induced (T, f 5)-process {km } parallels the relationship (17) between the generalized renewal process and the returntime process. Thus, given a random sample path X = {W,} for the induced process, a random sample path x = {X,} for the (T, P)-process can be expressed as a concatenation
(19)
x = tv(1)w(2)w(3) ,
into blocks, where, for m > 1, w(m) = X m , and the initial block th(1) is a tail of the block w(1) = X i . Again, the only real problem is how to choose the start position in w(1) in such a way that the process is stationary, in other words how to determine the length of the tail fi)(1). A generalization of the return-time picture provides the answer. The generalized version of the return-time picture is obtained by further partitioning the columns of the return-time picture, Figure 1.2.18, into subcolumns according to the conditional distribution of (T, P)-names. Thus the column with base Bk is partitioned
28
into subcolumns CL, labeled by k-length names w, so that the subcolumn corresponding to w = a k has width (ct, T - i+ 1 Pa n BO,
,
the ith-level being labeled by a,. Furthermore, by reassembling within each level of a subcolumn, it can again be assumed that T moves points directly upwards one level. Points in the top level of each subcolumn are moved by T to the base B in some manner unspecified by the picture. For example, in Figure 1.2.21, the fact that the third subcolumn over B3, which has levels labeled 1, 1, 0, is twice the width of the second subcolumn, indicates that given that a return occurs in three steps, it is twice as likely to visit 1, 1, 0 as it is to visit 1, 0, 0 in its next three steps.
o o I
0 1 1
o
i
o
i
o .
I I
1
1o 1
I I
I
0
I I 1
I 1 I I
o
0
I I
1Ii
10
I I 1 I
o o
11
B1
B2
B3
B4
Figure 1.2.21 The generalized return-time picture. The start position problem is solved by the generalized return-time picture and the definition of the (T, 2)-process. The (T, 2)-process is defined by selecting x E X at random according to the ii-measure, then setting X, (x) = a if and only if 7' 1 x E Pa . In the ergodic case, the generalized return-time picture is a representation of X, hence selecting x at random is the same as selecting a random point in the return-time picture. But this, in turn, is the same as selecting a column Cu, at random according to column measure p.(Cu,), then selecting j at random according to the uniform distribution on {1, 2, ... , h(w), where h(w) = h(C), then selecting x'at random in the j-th level of C. This proves the following theorem.
Theorem 1.2.22 The process {X,..,} has the stationary concatenation representation (19) in terms of the induced process {54}, if and only if the initial block is th(1) = w(1)i h. (w(l)) , where j is uniformly distributed on [1, h(w(1))].
A special case of the induced process occurs when B is one of the sets of P, say B = Bb. In this case, the induced process outputs the blocks that occur between successive occurrences of the symbol b. In general case, however, knowledge of where the blocking occurs may require knowledge of the entire past and future, or may even be fully hidden.
I.2.c.2 The tower construction.

The return-time construction has an inverse construction, called the tower construction, described as follows. Let T be a measure-preserving mapping on the probability
29
space (X, E, p.), and let f be a measurable mapping from X into the positive integers - be the subset of the product space X x Ar Ar = { 1, 2, ...}, of finite expected value. Let k defined by = {(x , 1 < i f (x)}. A useful picture is obtained by defining Bn = {x: f (x) = n}, and for each n > 1, stacking the sets Bn x {1}, B x {2}, Bn x {n} (20) as a column in order of increasing i, as shown in Figure 1.2.23.
Tx
B3 X
{2}
B2 X
111
B3 X
Bi X {0}
B2 X {0}
B3 X {0}
Figure 1.2.23 The tower transformation.

The measure it is extended to a measure on X, by thinking of each set in (20) as a copy of Bn with measure A(B), then normalizing to get a probability measure, the normalizing factor being
n ( B n ) = E(f), n=1
the expected value of f. In other words, B C X is defined to be measurable if and only if (x, i) E T8} n Bn E E, n > 1, i < n, with its measure given by
Tt(k) =
2-, 0
,u({x:
, E 131 )
n 13,0 .
E (f ) n=1 i=1
A transformation T is obtained by mapping points upwards, with points in the top of each column sent back to the base according to T. The formal definition is
T (x, i) =
+ 1) i < f (x) i = f (x). (T x , 0)
The transformation T is easily shown to preserve the measure Ti and to be invertible (and ergodic) if T is invertible (and ergodic). The transformation T is called the tower extension of T by the (height) function f. The tower extension T induces the original transformation T on the base B = X x {0}. This and other ways in which inducing and tower extensions are inverse operations are explored in the exercises. As an application of the tower construction it will be shown that return-time processes are really just stationary positive-integer valued processes with finite expected value.
30
Theorem 1.2.24 An ergodic, positive-integer valued process {kJ} with E(R 1 ) < oc is the return-time process for some transformation and set. In particular, fkil defines a stationary, binary process via the concatenation-representation (17), with the distribution of the waiting time r until the first I occurs given by (18). Proof Let p, be the two-sided Kolmogorov measure on the space .Ar z of doubly-infinite positive-integer valued sequences, and note that the shift T on Az is invertible and ergodic and the function f (x) =X has finite expectation. Let T be the tower extension of T by the function f and let B = x x {1}. A point (x, 1) returns to B = X x L1} at the point (T x , 1) at time f (x), that is, the return-time process defined by T and B has the same distribution as {k1 }. This proves the theorem.
A stationary process {X,} is regenerative if there is a renewal process {R,} such that the future joint process {(Xn, Ra ):n > t} is independent of the past joint process {(X, n , R,n ): n < t}, given that Rt = 1. The induced transformation and the tower construction can be used to prove the following theorem, see Exercise 24. (Recall that if Y, = f (X n ), for all n, then {Yn } is an instantaneous function of {Xn}.)
Theorem 1.2.25 An ergodic process is regenerative if and only if it is an instantaneous function of an induced process whose return-time process is
I.2.d
Exercises.
1. Complete the proof of Lemma 1.2.1.

2. Let T be an invertible measure-preserving transformation and P = {Pa : a E Al a finite partition of the probability space (X, E, /2). Show that there is a measurable mapping F: X 1- A z which carries p. onto the (two-sided) Kolmogorov measure y of the (T, P)-process, such that F(Tx) = TAF(x), where TA is the shift on A z . (Thus, a partition gives rise to a stationary coding.) 3. Suppose F: A z F--> Bz is a stationary coding which carries p, onto v. Determine a partition P of A z such that y is the Kolmogorov measure of the (TA, P)-process. (Thus, a stationary coding gives rise to a partition.) 4. Let T be a measure-preserving transformation. A nonnegative measurable function f is said to be T-subinvariant, if f(Tx) < f (x). Show that a T-subinvariant function is almost-surely invariant. (Hint: consider g(x) = min{ f (x), N}.) 5. Show that the (T, P)-process can be ergodic even if T is not. (Hint: start with a suitable nonergodic Markov chain and lump states to produce an ergodic chain, then use Exercise 2.) 6. Show that the (T, P)-process can be mixing even if T is only ergodic. 7. Let Zn = ( X (n -1 )N+19 41-1)N+2, , XnN) be the nonoverlapping N-block process, let Wn = (Xn, Xn+19 Xn+N-1) be the overlapping N-block process, and let Yn = X(n-1)N+1 be the N-th term process defined by the (T, P)-process {Xn}. (a) Show that {Zn } is the (TN,
P-IP)-process.
SECTION 1.2. THE ERGODIC THEORY MODEL. (b) Show that {147, } is the (T, v7P -1 P)-process. (c) Show that {Y } is the (T N , 'P)-process.
31
8. Let T be the shift on X = {-1, 1} Z , and let ,u be the product measure on X defined by p,(-1) = p,(1) = 1/2. Let Y=Xx X, with the product measure y = p, x and define S(x, y) = (Tx, Txoy).
(a) Show that S preserves v. (b) Let P be the time-zero partition of X and let Q = P x P. Give a formula expressing the (S, Q)-name of (x, y) in terms of the coordinates of x and y. (The (S, Q)-process is called "random walk with random scenery.")
9. Generalize the preceding exercise by showing that if T is a measure-preserving transformation on a probability space (X, p,) and {Tx : x c X) is a family of measure-preserving transformations on a probability space (Y, y), such that the mapping (x, y) i- (Tx, T ry) is measurable, then S(x , y) = (Tx, Try) is a measure-preserving mapping on the product space (X x Y, x y). S is called a skew product. If T, = R, for every x, then S is called the direct product of T
and R.
10. Show that a concatenated-block process is ergodic. 11. Show that the overlapping n-blocking of an ergodic process is ergodic. 12. Show that the nonoverlapping n-blocking of an ergodic process may fail to be ergodic. 13. Insertion of random spacers between blocks can be used to change a concatenatedblock process into a mixing process. Fix a measure p, on AN, a symbol a E A N , and number p E (0, 1). Let {Z,n } be the stationary Markov chain with alphabet S = A N x {0, 1, 2, ... , N}, defined by the following transition rules.
(i) If i < N, (ar , i) can only go to (ar , i
1).
(ii) (ar,, N) goes to (br , 0) with probability p ite(br ), for each br E AN. (iii) (ar , N) goes to (br , 1) with probability (1 p),u(br), for each br E AN.
(a) Show that there is indeed a unique stationary Markov chain with these properties. (Hint: the transition matrix is irreducible.) (b) Show that the process {} defined by setting 4, = a if I'm = (ar , 0) and Ym = ai , if Y, = (ar , j), j > 0, is a mixing finite-state process.
14. Show that if T is mixing then it is totally ergodic. 15. Show that an ergodic rotation is not mixing, but is totally ergodic. 16. Show that if T is mixing then so is the direct product T x T: X x X where X x X is given the measure ,u x ,u.
X x X,
17. Show directly, that is, without using Proposition 6, that if a is irrational then the direct product (x, y) 1- (x a, y a) is not ergodic.
32
18. Prove that even if T is not invertible, the induced transformation is measurable, preserves the conditional measure, and is ergodic if T is ergodic. (Hint: obtain a picture like Figure 1.2.18 for T -1 , then use this to guide a proof.) 19. Prove that a stationary coding of a mixing process is mixing. (Hint: use the formula F-1 (C n T ' D) = F-1 C n TA- i F -1 D.) 20. Show that if {X,} is a binary, ergodic process which is not identically 0, then it is the generalized renewal process for some transformation T and set B. (Hint: let T be the shift in the two-sided Kolmogorov representation and take B = Ix: xo = 1 1.) 21. Prove: a tower i'' over T induces on the base a transformation isomorphic to T. 22. Show that the tower defined by the induced transformation function n(x) is isomorphic to T.
?
I' and return-time
23. Let T be an invertible measure-preserving transformation and P = {Pa : a E A} a finite partition of the probability space (X, E, pt). Let X be the tower over X defined by f, and let S = T be the tower transformation. Show how to extend P to a partition 2 of X, such that the (T, P)-process is an instantaneous coding of the induced (,?, d)-process. 24. Prove Theorem 1.2.25. 25. Define P = {Pa : a E AI and 2 = {Qb: b Ea,b lit(Pa n Qb) it(Pa)P,(Qb)1 ._ E.
E
BI to be 6-independent if
(a) Show that p and 2 are independent if and only if they are 6-independent for each 6 > 0. (b) Show that if P and 2 are 6-independent then Ea iii(Pa I Qb) 11 (Pa)i < Ag, except for a set of Qb's of total measure at most ,g. (c) Show that if Ea lia(PaiQb) ii(Pa)i < E, except for a set of Qb's of total measure at most 6, then P and 2 are 3E-independent.
SECTION 1.3. THE ERGODIC THEOREM.
33
Section 1.3 The ergodic theorem.

The ergodic theorem extends the strong law of large numbers from i.i.d. processes to the general class of stationary processes. Theorem 1.3.1 (The ergodic theorem.) If T is a measure-preserving transformation on a probability space (X, E, p,) and if f is integrable then the average (1/n) EL, f (T i-1 x) converges almost surely and in L 1 -norm to a T -invariant function f*(x). The ergodic theorem in the almost-sure form presented here is due to G. D. Birkhoff and is often called Birkhoff's ergodic theorem or the individual ergodic theorem. The L'-convergence implies that f f* d 1a = f f d ia, since
f1 n f(T i-l x) d,u = 1- E f f(T i-l x)d,u = f f -E

n
n i=1
by the measure-preserving property of T. Thus if T is ergodic then the limit function f*, since it is T-invariant, must be almost-surely equal to the constant value f f d,u,. In particular, if f is taken to be 0,1-valued and T to be ergodic the following form of the ergodic theorem is obtained. Theorem 1.3.2 (Ergodic theorem: binary ergodic form.) If {X n } is a binary ergodic process then the average (X 1 +X2+. ..+Xn )In converges almost surely to the constant value E(X 1 ).This binary version of the ergodic theorem is sufficient for many results in this book, but the more general version will also be needed and is quite useful in many situations not treated in this book. The proof of the ergodic theorem will be based on a rather simple combinatorial result discussed in the next subsection. The combinatorial idea is not a merely step on the way to the ergodic theorem, however, for it is an important tool in its own right and will be used frequently in later parts of this book.
I.3.a Packings from coverings.

A general technique for extracting "almost packings" of integer intervals from certain kinds of "coverings" of the natural numbers will be described in this subsection. In this discussion intervals are subsets of the natural numbers Ar = {1, 2, ...}, of the form [n, m] = (j E n < j < m). A strong cover C of Af is defined by an integer-valued function n m(n) for which m(n) > n, and consists of all intervals of the form [n, m(n)], n E H. (The word "strong" is used to indicate that every natural number is required to be the left endpoint of a member of the cover.) A strong cover C has a packing property, namely, there is a subcover C' whose members are disjoint. This is a trivial observation; just set C' = {[n,, m(n)]}, where n1 = 1 and n ii = 1 + m(n), i > 1. The finite problem has a different character, for, unless the function m(n) is severely restricted, it may not be possible, even asymptotically, to pack an initial segment [1, K] by disjoint subcollections of a given strong cover C of the natural numbers. If it is only required, however, there be a disjoint subcollection that fills most of [1, K], then a positive and useful result is possible.
34
To motivate the positive result, suppose all the intervals of C have the same length, say L, and apply a sequential greedy algorithm, that is, start from the left and select successive disjoint intervals, stopping when within L of the end. This produces the disjoint collection C' = {[iL +1,(i +1)L]: 0 < i < (K L)IL} which covers all but at most the final L 1 members of [1, K]. In particular, if 6 > 0 is given and K > LIS then all but at most a 6-fraction is covered. The desired positive result is just an extension of this idea to the case when most of the intervals have length bounded by some L <8 K. Let C be a strong cover of the natural numbers N. The interval [1, K] is said to be (L, 6)-strongly-covered by C if
1ln E [1, K]: m(n) n +1 > L}1 < 6.

A collection of subintervals C' of the interval [1, K] is called a (1 6)-packing of [1, K] if the intervals in C' are disjoint and their union has cardinality at least (1 . Lemma 1.3.3 (The packing lemma.) Let C be a strong cover of N and let S > 0 be given. If K > LIS and if [1, K] is (L, 8)-strongly-covered by C, then there is a subcollection C' c C which is a (1 26)packing of [1, K]. Proof By hypothesis K > LIS and (1) iin E K]: m(n) n +1 > L11 < SK.
The construction of the (1 26)-packing will proceed sequentially from left to right, selecting the first interval of length no more than L that is disjoint from the previous selections, stopping when within L of the end of [1, K]. To carry this out rigorously set m(0) = no = 0 and, by induction, define
n i = minfj E [1+ m(n i _i), K L]: m(j) j +1 < L}.
The construction stops after I = I (C, K) steps if m(n i ) > K L or there is no j E[1 m(ni), K L] for which m(j) j + 1 < L. The claim is that C' = {[n i ,m(n i )]: 1 < i < I} is a (1 26)-packing of [1, K]. The intervals are disjoint, by construction, and are contained in [1, K], since m(n i )n i +1 < L, and hence m(n i ) < K. Thus it is only necessary to show that the union 1,1 of the [ni ,m(n i )] has length at least (1 25)K. The interval (K L, K] has length at most L 1 < SK, so that I(K L,K]U]I< 6K. For the interval [1, K L], the definition of the n i implies the following fact. (2) If j E [I, K L] Li then m(j) j + 1 > L.
The (L, 6)-strong-cover assumption, (1), thus guarantees that 1[1, K L]-1,ll< 6K. This completes the proof of the packing lemma.
35
Remark 1.3.4
In most applications of the packing lemma all that matters is the result, but in the proof of the general ergodic theorem use will be made of the explicit construction given in the proof of Lemma 1.3.3, including the description (2) of those indices that are neither within L of the end of the interval nor are contained in sets of the packing. Some extensions of the packing lemma will be mentioned in the exercises.
I.3.b
The binary, ergodic process proof.
A proof of the ergodic theorem for the binary ergodic process case will be given first, as it illustrates the ideas in simplest form. In this case ,u, is invariant and ergodic with respect to the shift T on {0, 1}, and the goal is to show that (3) lirn x i = ,a(1), a.s, n*o n i=1
i n
where A(1) = A{x: x 1 = 1} = E(X1). Suppose (3) is false. Then either the limit superior of the averages is too large on a set of positive measure or the limit inferior is too small on a set of positive measure. Without loss of generality the first of these possibilities can be assumed, and hence there is an E > 0 such that the set
{n
B = x : lirn sup E xi > ,u, ( 1 ) n-+oo n i=i has positive measure. Since
lim sup
n>oo
+E
xi +
+ xn
= =sup
n>oo
x2 + .. + Xn+1
the set B is T-invariant and therefore p.(B) = 1. (Here is where the ergodicity assumption is used.) Now suppose x E B. Since B is T-invariant, for each integer n there will be a first integer m(n) > n such that
X n Xn+1 + + Xm(n)
m(n) n + 1
> p,(1) + E.
Thus the collection C(x) = lln, m(n)]: n E JVI is a (random) strong cover of the natural numbers .A'. Furthermore, each interval in C(x) has the property that the average of the xi over that interval is too big by a fixed amount. These intervals overlap, however, so it is not easy to see what happens to the average over a fixed, large interval. The packing lemma can be used to reduce the problem to a nonoverlapping interval problem, for, combined with a simple observation about almost-surely finite variables, it will imply that if K is large enough then with high probability, most of the terms xi , i < K are contained in disjoint intervals over which the average is too big. But, if the average over most such disjoint blocks is too big, then the average over the entire interval [1, K] must also be too big by a somewhat smaller fixed amount. Since this occurs with high probability it implies that the expected value of the average over the entire set A K must be too big, which is a contradiction.
36
A bit of preparation is needed before the packing lemma can be applied. First, since the random variable m(1) is almost-surely finite, it is bounded except for a set of small probability. Thus given 3 > 0 there is a number L such that if D = {x: m(1) > L}, then ,u(D) <62. Second, note that the function g K(x) =
i=1
IC
where x denotes the indicator function, has integral ,u(D) < 6 2 , since T is measure preserving. The Markov inequality implies that for each K, the set GK = {X: gic(x) < 3} has measure at least 1 - 8. The definitions of D and GK imply that if x E GK then C(x) is an (L, 3)-strongcovering of [1,K]. Thus the packing lemma implies that if K > LI 8 and x E GK then there is a subcollection Cf (x) = {[ni,m(ni)]: i < I (x)} C C(X) which is a (1 - 23)-packing of [1, K]. Since the xi. are nonnegative, by assumption, and since the intervals in C'(x) are disjoint, the sum over the intervals in C'(x) lower bounds the sum over the entire interval, that is,
1(x) m(n)
(4) j=1
x; >
i=1 j_n 1
(l
_ 28)K [p,(1) + c].
Note that while the collection C(x) and subcollection C'(x) both depend on x, the lower bound is independent of x, as long as x E GK. Thus, taking expected values yields
(K E iE =1 x
> (1 - 26)K ku(1) + El u(GK) > (1 - 26)(1 (5)K [p.(1)+ c] ,
so that
1 K p. (1) = E (Ex .) > (1 - 23)(1 6)[,a(1) + c], J=1 which cannot be true for all S. This completes the proof of (3), and thereby the proof of Theorem 1.3.2, the binary, ergodic process form of the ergodic theorem.
I.3.c
The proof in the general case.
The preceding argument will now be extended to obtain the general ergodic theorem. The following lemma generalizes the essence of what was actually proved; it will be used to obtain the general theorem. Lemma 1.3.5 Let T be a measure-preserving transformation on (X, E, ,a), let f let a be an arbitrary real number, and define the set 1 n B = {x: lim sup - E f (T i-1 x) > al. n i=i
Then fB f (x) d(x) > a g(B).
E
L i (X, E, ,u),
37
Proof Note that in the special case where the process is ergodic, where f is the indicator function of a set C and a = p(C) c, this lemma is essentially what was just proved. The same argument can be used to show that Lemma 1.3.5 is true in the ergodic case for bounded functions. Only a bit more is needed to handle the general case, where f is allowed to be unbounded, though integrable. The lemma is clearly true if p(B) = 0, so it can be assumed that ,i(B) > O. The set B is T-invariant and the restriction of T to it preserves the conditional measure ,u(. 1B). Thus, it is enough to prove
fB f (x) c 1 ii.(x1B) > a.

For x E B and n E A1 there is a least integer m(n) > n such that (5) Em i= (n n ) f (T i-1 m(n)n+1 > a.
Since B is T-invariant, the collection C(x) = fin, m(n)]: n E AO is a (random) strong cover of the natural numbers Js/. As before, given 3 > 0 there is an L such that if D=
E B: m(1) > L }
then ,u(D1B) <82 and for every K the set 1 G K = {X E B: 1
-7
i=1
E xD(Ti-ix)
6}
has conditional measure at least 1 B. Furthermore, if K > LIB then the packing lemma can be applied and the argument used to prove (4) yields a lower bound of the form
E f( Ti-lx )
j=1 (6) where R(K , x) = >
R(K , x) +
E E f (T i-1 x)
i=1 j=ni
1(x) m(ni)
R(K , + (1 28)K a,
E
jE[1,1qU[n,,m(n,)]
f (T x)
is the sum of f(Ti -l x) over the indices that do not belong to the [ne,m(n i )] intervals. (Keep in mind that the collection {[ni,m(ni)]} depends on x.) In the bounded case, say 1f(x)1 < M, the sum R(K , x) is bounded from below by 2M KS and the earlier argument produces the desired conclusion. In the unbounded case, a bit more care is needed to control the effect of R(K, x) as well as the effect of integrating over GK in place of X. An integration, together with the lower bound (6), does give
1 v-IC s
f (x) dia(x1B) = (7) a'
i=i
II -I- 12 + 13,
(1 8)a A(G KIB)
38 where
K fBGic j=1
12 =
f(Ti -l x) dp,(x1B),
1 ___
K
f (Ti -1 x) c 1 ii,(x1B), E GK jE[1,K L] U[ni,M(ri,)]

GK
jE(KL,K]U[ni,m(n,)]
13= (T i-1 x) d,u(xIB). f
The measure-preserving property of T gives
f3-Gic f (T ./ x) clitt(xIB) = fT_J(B-GK) f(x)

Since f E L I and since all the sets T -i(B GK) have the same measure, which is upper bounded by 8, all of the terms in 11, and hence 11 itself, will be small if 8 is small enough. The integral 13 is also easy to bound for it satisfies
1
dp,(xiB),
1131 15- k- B jE(KL,K]
and hence the measure-preserving property of T yields

1131
1 fl(x) dp,(xim,
which is small if K is large enough, for any fixed L. To bound 12, recall from the proof of the packing lemma, see (2), that if j E [1, K L] m(ni)] then m(j) j 1 > L. In the current setting this translates into the statement that if j E [1, K L] U[ni , m(n)] then Ti -l x E D. Thus
1121 5_ -17
1 f vN K
. .B
xD(T' - x)1 f (P -1 x)I dwxiB),
and the measure-preserving property of T yields the bound
1121 5_
which is small if L is large enough. In summary, in the inequality
f D1 f (x)1 clia(x1B).
f(x) d ii(x B) ?. (1
kt(G
+ 11 +
13,
see (7), the final three terms can all be made arbitrarily small, so that passage to the appropriate limit shows that indeed fB f (x) dp,(x1B) > a, which completes the proof of Lemma 1.3.5.
39
Proof of the Birkhoff ergodic theorem: general case.

To prove almost-sure convergence first note that Lemma 1.3.5 remains true for any T-invariant subset of B. Thus if 1 n 1 n f (T i x) C = x: lim inf f (T i x) <a < fi < lim sup n-400 n i=1 n,o0 n i=1
{
then
fi,u(C)<
Jc
f d
which can only be true if ti(C) = O. This proves almost sure convergence. To complete the discussion of the ergodic theorem L' -convergence must be established. To do this define the operator UT on L I by the formula,
UT f (x) = f (Tx), xE X, f EL I .
The operator UT is linear and is an isometry because T is measure preserving, that is, UT f = if Ii where 11f II = f Ifl diu is the L'-norm. Thus the averaging operator
An f
1 E n n i=i T
is linear and is a contraction, that is, IA f II If Il f E L I . Moreover, if f is a bounded function, say I f(x)I < M, almost surely, then A n f (x)I < M, almost surely. Since almost-sure convergence holds, the dominated convergence theorem implies L' convergence of A n f for bounded functions. But if f, g E L I then for all m, n,
A n f Am f 5-
g)11 + ii Ag Amg ii 2 11 f gil + 11Ang A m g iI

g) A ni (f
The L ,-convergence of A, f follows immediately from this inequality, for g can be taken to be a bounded approximation to f. Since, as was just shown, Ag converges in L'-norm, and since the average An f converges almost surely to a limit function f*, it follows that f* E L I and that A n f converges in L'-norm to f*. This completes the proof of the Birkhoff ergodic theorem. 1=1
Remark 1.3.6
Birkhoff's proof was based on a proof of Lemma 1.3.5. Most subsequent proofs of the ergodic theorem reduce it to a stronger result, called the maximal ergodic theorem, whose statement is obtained by replacing lirn sup by sup in Lemma 1.3.5. There are several different proofs of the maximal theorem, including the short, elegant proof due to Garsia that appears in many books on ergodic theory. The proof given here is less elegant, but it is much closer in spirit to the general partitioning, covering, and packing ideas which have been used recently in ergodic and information theory and which will be used frequently in this book. The packing lemma is a simpler version of a packing idea used to establish entropy theorems for amenable groups in [51]; its application to prove the ergodic theorem as given here was developed by this author in 1980 and published much later in [68]. Other proofs that were stimulated by [51] are given in [25, 27]. Proofs of the maximal theorem and Kingman's subadditive ergodic theorem which use ideas from the packing lemma will be sketched in the exercises, as well as von Neumann's elegant functional analytic proof of L 2-convergence. The reader is referred to Krengel's book for historical commentary and numerous other ergodic theorems, [32].
40
I.3.d
Extensions of the packing lemma.
The given proof of the ergodic theorem was based on the purely combinatorial packing lemma. Several later results will be proved using extensions of the packing lemma. Now that the ergodic theorem is available, however, the packing lemma and its relatives can be reformulated in a more convenient way as lemmas about stopping times. A (generalized) stopping time is a measurable function r defined on the probability space (X, E, ,u) with values in the extended natural numbers .Af U ool. Note that the concept of stopping time used here does not require that X be a sequence space, nor that the set {x: r(x) = nl be measurable with respect to the first n coordinates, which is a common requirement in probability theory. The stopping time idea was used implicitly in the proof of the ergodic theorem, for the first time m that rin f (Ti -1 x) > ma is a stopping time, see (5). Suppose it is a stationary measure on A'. A stopping time r is p,-almost-surely finite if ,u ({x: r(x) = o }) = O. An almost-surely finite stopping time r has the property that for almost every x, the stopping time for each shift 7' 1 x is finite. Thus almost-sure finiteness guarantees that, for almost every x, the function r(Tn -l x) is finite for every n. The interval [n,r(Tn- lx)-F n 1] is called the stopping interval for x, starting at n. In particular, for an almost-surely finite stopping time, the collection C = C(x, r) = {[n, t(Tn -1 x) n 1]: n
E
of stopping intervals is, almost surely, a (random) strong cover of the natural numbers
A.
Frequent use will be made of the following almost-sure stopping-time packing result for ergodic processes. Lemma 1.3.7 (The ergodic stopping-time packing lemma.) If r is an almost-surely finite stopping time for an ergodic process t then for any > 0 and almost every x there is an N = N (3, x) such that if n > N then [1,n] is (1 3)-packed by intervals from C(x, r).
Proof Since r (x) is almost-surely finite, it is bounded except for a set of small measure, hence, given 8 > 0, there is an integer L such that
p({x: r(x) > LI) < 6/4. Let D be the set of all x for which r(x) > L. By the ergodic theorem, lim almost surely, so that if
1 n
1=1
x D (T i-1 x) d p(x) = ,u(D) <6/4,
Gn =
then x
E
ix
1 x---11 s
:
2x D (Ti - ix)< 6/ 41
G , eventually almost surely.
41
Suppose x E G and n > 4L/3. The definition of G, implies that r(T -(' -1) x) < L for at least (1 314)n indices i in the interval [1, n] and hence for at least (1 812)n indices i in the interval [1, n L]. But this means that [1, n] is (L, 1 3/2)-stronglycovered by C(x, r). The packing lemma produces a (L, 1 3/2)-packing of [1, n] by a subset of C(x, r). This proves the lemma. Two variations on these general ideas will also be useful later, a partial-packing result that holds in the case when the stopping time is assumed to be finite only on a set of positive measure, and a two-packings variation. These are stated as the next two lemmas, with the proofs left as exercises. For a stopping time r which is not almost-surely infinite, put F(r) = {x: r(x) < ocl, and define C(x, r) to be the set of all intervals [n, r(r-1 n 1], for which Tn -l x E F(r).
Lemma 1.3.8 (The partial-packing lemma.)

Let p, be an ergodic measure on A, let r be a stopping time such that ,u(F(r)) > y > O. For almost every x, there is an N = N (x, r) such that if n > N, then [1, n] has a y -packing C' c C(x, r)
Lemma 1.3.9 (The two-packings lemma.)

Let {[ui, vd: i E I} be a disjoint collection of subintervals of [1, K] of total length aK, and let {[si , ti]: j E J} be another disjoint collection of subintervals of [1, K] of total length 13 K . Suppose each [ui , v,] has length m, and each [s1 , ti ] has length at least M, where M > m13. Let I* be the set of indices i E I such that [ui, yi] meets at least one of the [s i , ti ]. Then the disjoint collection {[s i , ti ]: j E J } U flub i E I*1 has total length at least (a + p - 23)K.
I.3.e
Exercises.
1. A (1 3)-packing C' of [1, n] is said to be separated if there is at least one integer between each interval in C'. Let p, be an ergodic measure on A', and let r be an almost-surely finite stopping time which is almost-surely bounded below by a positive integer M > 1/8. Show that for almost every x, there is an integer N= x) such that if n > N then [1, n] has a separated (1 28)-packing C' c C(x, r). (Hint: define F'(x) = r(x) + 1 and apply the ergodic stopping-time lemma to F.) 2. Show that a separated (1 S)-packing of [1, n] is determined by its complement. Show that this need not be true if some intervals are not separated by at least one integer. 3. Prove the partial-packing lemma. 4. Formulate the partial-packing lemma as a strictly combinatorial lemma about the natural numbers. 5. Prove the two-packings lemma. (Hint: use the fact that the cardinality of J is at most K I M to estimate the total length of those [u, , yi ] that meet the boundary of some [si, ti].)
42
6. Show that if r is an almost-surely finite stopping time for a stationary process p, and G n (r , 8) is the set of all x such that [1, n ] is (1 (5)-packed by intervals from oc. C(x, r), then ii,(Gn (r, 8)) 1, as n 7. Show that if ti is ergodic and if f is a nonnegative measurable, but not integrable f (T i-1 x) converge almost surely to oc. function, then the averages (1/N)
E7
f con8. The mean ergodic theorem of von Neumann asserts that (1/n) EL I verges in L2-norm, for any L 2 function f. Assume T is invertible so that UT is
unitary. (a) Show that the theorem is true for f E F = If E L 2 : UT f = f} and for f E .A4 = (I UT)L 2 . (b) Show that .F+M is dense in L 2 . (Hint: show that its orthogonal complement is 0.) (c) Deduce von Neumann's theorem from the two preceding results. (d) Deduce that L' -convergenceholds, if f
E
LI .
(e) Extend the preceding arguments to the case when it is only assumed that T is measure preserving.
9. Assume T is measure preserving and f is integrable, and for each n, let Un be the set of x's for whichEL -01 f (T i x) > 0. Fix N and let E = Un <N Un . One form of the maximal ergodic theorem asserts that JE f (x) di(x) > 0. The ideas of the following proof are due to R.Jones, [33].
(a) For bounded f, show that
r(x)-1
E i=o
f (T i x)xE(T i x) ?_ 0, for some r(x) <

E Un ,
N.
r(x) = 1, x 0 E.) (b) For bounded f and L > N, show that
(Hint: let r(x) = n, x
E f (T i x)xE(T i x) i=o
(N
fII.
(Hint: sum from 0 to r(x) 1, then from r(x) to r(T) 1, continuing until within N of L.) (c) Show that the theorem holds for bounded functions. (d) Show that the theorem holds for integrable functions. (e) For B = {x: sup E71-01 f (T i x) > a}, show that h f da a,u(B).
10. Assume T is measure preserving and { g,} is a sequence of integrable functions such that gn+,,(x) < g(x) g ni (Tn x). Kingman's subadditive ergodic theorem asserts that gn (x)/ n converges almost surely to an invariant function g(x) > oc. The ideas of the following proof are due to M.Steele, [80].
(a) Prove the theorem under the additional assumption that gn < 0. (Hint: first show that g(x) = liminf g n (x)In is an invariant function, then apply packing.)
SECTION 1.4. FREQUENCIES OF FINITE BLOCKS.
43
(Hint: let
(b) Prove the theorem by reducing it to the case when g < O. g 1 (x) = gni (x) ET -1 gi (T i x ).) (c) Show that if a = inf, f and a = fg d
d,u < oc, then convergence is also in L 1 -norm, +
(d) Show that the same conclusions hold if gn < 0 and gn+8+,n (x) < cc. g,(Tn+gx) + y8 , where y8 0 as g
Section 1.4
Frequencies of finite blocks.
An important consequence of the ergodic theorem for stationary finite alphabet processes is that relative frequencies of overlapping k-blocks in n-blocks converge almost surely as n > oc. In the ergodic case, limiting relative frequencies converge almost surely to the corresponding probabilities, a property characterizing ergodic processes. In the nonergodic case, the existence of limiting frequencies allows a stationary process to be represented as an average of ergodic processes. These and related ideas will be discussed in this section.
I.4.a
Frequencies for ergodic processes.

nk+1 Palic Wiz =
The frequency of the block ail' in the sequence Jet' is defined for n > k by the formula
E
i=1
x fa
(T i-1 x),
where x [4] denotes the indicator function of the cylinder set [an. The frequency can also be expressed in the alternate form
f(4x) = I{i
[1, n k + 1]:
where I I denotes cardinality, that is, f(a14) is obtained by sliding a window of length k along x'it and counting the number of times af is seen in the window. The relative frequency is defined by dividing the frequency by the maximum possible number of occurrences, n k 1, to obtain
Pk(aflx7) =
f (4x)
= I{i E [1, n k + 1]: xi +k-1 = 4}1

nk1
nk+1
If x and k < n are fixed the relative frequency defines a measure pk 14) on A " , called the empirical distribution of overlapping k-blocks, or the (overlapping) k-type of x. When k is understood, the subscript k on pk(ali( 14) may be omitted. The limiting (relative) frequency of di in the infinite sequence x is defined by
(1)
p(ainx) = lim p(ainx),

oo
provided, of course, that this limit exists. A sequence x is said to be (frequency) typical for the process p. if each block cif appears in x with limiting relative frequency equal to 11,(4). Thus, x is typical for tt if for each k the empirical distribution of each k-block converges to its theoretical probability p,(4).
44
Let T(A) denote the set of sequences that are frequency typical for A. A basic result for ergodic processes is that almost every sequence is typical. Thus the entire structure of an ergodic process can be almost surely recovered merely by observing limiting relative frequencies along a single sample path. Theorem 1.4.1 (The typical-sequence theorem.) If A is ergodic then A(T (A)) = 1, that is, for almost every x,
lim p(aIN) =
for all k and all a.
Proof In the ergodic case,

nk+1
p(afixiz) -
n - k +1 E
k (T i X _ ' ul'
' 1
x),
which, by the ergodic theorem, converges almost surely to f x [4] dp, = p(a), for fixed 4. In other words, there is a set B(a) of measure 0, such that p(alicixtiz) -> 1),(4), for x B(a lic), as n -> co. Since there are only a countable number of possible af, the set
00
B=U
k=1 ak i EAk
B(a 1`)
XE
has measure 0 and convergence holds for all k and all al', for all measure 1. This proves the theorem. The converse of the preceding theorem is also true.
- B, a set of [=:]
Theorem 1.4.2 (The typical-sequence converse.) Suppose A is a stationary measure such that for each k and for each block di', the limiting relative frequencies p(aliÌx) exist and are constant in x, p.-almost surely. Then A is an ergodic measure.
Proof If the limiting relative frequency p(a'nx) exists and is a constant c, with probability 1, then this constant c must equal p,(4), since the ergodic theorem includes L I -convergence of the averages to a function with the same mean. Thus the hypotheses imply that the formula
x--, n
liM n
i=1
x8
holds for almost all x for any cylinder set B. Multiplying each side of this equation by the indicator function xc (x), where C is an arbitrary measurable set, produces
liM
n)00
fl
E X B (T i-l x)X c (x)=

i=1
and an integration then yields
(2)
lim A(T -i B n i=1
E
n
n=
45
The latter holds for any cylinder set B and measurable set C. The usual approximation argument then shows that formula (2) holds for any measurable sets B and C. But if B = B then the formula gives it(B) = A(B) 2 , so that p(B) must be 0 or 1. This proves Theorem 1.4.2. Note that as part of the preceding proof the following characterizations of ergodicity were established. Theorem 1.4.3 The following are equivalent. (i) T is ergodic. (ii) lima pi,(iii) limn x ,(Ti-lx) = tt(E), a.e., for each E in a generating algebra. A(T -i A n B) = ,u(A)A(B), for each A and B ma generating algebra.
Some finite forms of the typical sequence theorems will be discussed next. Sequences of length n can be partitioned into two classes, G, and B, so that those in Gn will have "good" k-block frequencies and the "bad" set Bn has low probability when n is large. The precise definition of the "good" set, which depends on k and a measure of error, is
(3)
G n (k,
6) =
p(afix) ,L(a)1 < c, a
The "bad" set is the complement, B n (k, 6) = An G n (k, E). The members of the "good" set, G ,(k, c), are often called the frequency-(k, c)-typical sequences, or when k and c are understood, just the typical sequences. Theorem 1.4.4 (The "good set" form of ergodicity.) (a) If ji is ergodic then 4 (b) If lima p.n (G n (k, ergodic.
6))
Gn (k,
E),
eventually almost surely, for every k and c > O.
= 1, for every integer k > 0 and every c > 0, then p, is
Proof Part (a) is just a finite version of the typical-sequence theorem, Theorem 1.4.1. The condition that A n (G n (k, E)) converge to 1 for any E > 0 is just the condition that p(41x7) converges in probability to ii(4). Since p(an.q) is bounded and converges almost surely to some limit, the limit must be almost surely constant, and hence (b) follows from Theorem 1.4.2. It is also useful to have the following finite-sequence form of the typical-sequence characterization of ergodic processes. Theorem 1.4.5 (The finite form of ergodicity.) A measure IL is ergodic if and only if for each block ail' and E > 0 there is an N = N(a, c) such that if n > N then there is a collection C, C An such that tt(Cn) > 1
E.
(ii) If x,
C, then 1/3(41fi') p(aini)I < c.
46
Proof If it is ergodic, just set Cn = G n (k, E), the "good" set defined by (3), and use Theorem 1.4.4. To establish the converse fix a il' and E > 0, then define G n = ix: IP(a inxi) P(a inx)I < EI, and use convergence in probability to select n > N(a, c) such that ,u(an) > 1 Cn satisfies (i) and (ii) put En = {x: 4 E Cn} and note that ,a(a n n e . n ) > 1 2E. Thus, IP(ailx) P(ailZ)I 3E, X,
E _ _
n
E.
If
n El.n ,
and hence p(a Ix) is constant almost surely, proving the theorem.
I.4.b
The ergodic theorem and covering properties.
The ergodic theorem is often used to derive covering properties of finite sample paths for ergodic processes. In these applications, a set Bn c An with desired properties is determined in some way, then the ergodic theorem is applied to show that, eventually oc, there are approximately tt(Bn )M indices i E [1, M n] for almost surely as M which x: +n-1 E Bn . This simple consequence of the ergodic theorem can be stated as either a limit result or as an approximation result.
Theorem 1.4.6 (The covering theorem.) If 12, is an ergodic process with alphabet A, and if Bn is a subset of
measure, then liM Pn(BniXi ) M).co
In other words, for any 8 > 0,
An of positive
1-1 (Bn), a.s.
(4)
E [1, M n]:
E 13n}1 ML(13n)1 <8M,
eventually almost surely. Proof Apply the ergodic theorem to the indicator function of Bn . In many situations the set 13, is chosen to have large measure, in which case the conclusion (4) is expressed, informally, by saying that, eventually almost surely, xr is "mostly strongly covered" by blocks from Bn . A sequence xr is said to be (1 8)strongly-covered by B n if I {i E [1, M n that is, if pn (B n iXr) < b.
1]: xn-1 B n }i < 8M,
Theorem 1.4.7 (The almost-covering principle.) If kt is an ergodic process with alphabet A, and if 8, C An, satisfies A(13 n ) > 1 then xr is eventually almost surely (1 (5)-strongly-covered by r3n.
47
The almost strong-covering idea and a related almost-packing idea are discussed in more detail in Section L7. The following example illustrates one simple use of almost strong-covering. Note that the ergodic theorem is used twice in the example, first to select the set 13n , and then to obtain almost strong-covering by 13n . Such "doubling" is common in applications. Example 1.4.8 Let p. be an ergodic process and fix a symbol a E A of positive probability. Let F (xr) and L(xr) denote the first and last occurrences of a E Xr, that is,
F(xr) = min{i E [1, M]: x i = a} L(x) = max{i E [1, M]: x i = a},

with the convention F(41) = L(41 ) = 0, if no a appears in xr. The almost-covering idea will be used to show that
(5)
To prove this, let
lim kr,00
L(41 ) F(41 )
= 1, almost surely.
Bn = {a7: ai = a, for some i
[1, nil.
Given 8 > 0, choose n so large that A(B n ) > 1 3. Such an n exists by the ergodic theorem. The almost-covering principle, Theorem 1.4.7, implies that xr is eventually almost surely (1 3)-strongly-covered by Bn . Note, however, that if xr is (1 6)strongly-covered by 13n , then at least one member of 13n must start within 3M of the beginning of xr and at least one member of 1:3n must start within 8M of the end of xr. But if this is so and M > n/3, then F(41 ) < 23M and L(xr) > (1 3)M, so that L(xr) F(41) > (1 3 8 )M. Since 3 is arbitrary, the desired result, (5), follows. It follows easily from (5), together with the ergodic theorem, that the expected waiting time between occurrences of a along a sample path is, in the limit, almost surely equal to 1/p,(a), see Exercise 2. This result also follows easily from the return-time picture discussed in Section I.2.c.
I.4.c
The ergodic decomposition.
In the stationary nonergodic case, limiting relative frequencies of all orders exist almost surely, but the limit measure varies from sequence to sequence. The sequences that produce the same limit measure can be grouped into classes and the process measure can then be represented as an average of the limit measures, almost all of which will, in fact, be ergodic. These ideas will be made precise in this section. The set of all x for which the limit
p(ainx) = lim p(44)

exists for all k and all di will be denoted by E. Thus is the set of all sequences for which all limiting relative frequencies exist for all possible blocks of all possible sizes. The set is shift invariant and any shift-invariant measure has its support in This support result, which is a simple consequence of the ergodic theorem, is stated as the following theorem.
48
Theorem 1.4.9 If it is a shift-invariant measure then A ( C) = 1.
Proof Since
p (all' 14) = n ..... k
1
i=1 the ergodic theorem implies that p(aflx) = limn p(aiiìx) exists almost surely, for each k and 4. Since there are only a countable number of the al` , the complement of ,C must El have measure 0 with respect to 1.1,, which establishes the theorem.
Theorem 1.4.9 can be strengthened to assert that a shift-invariant measure kt must actually be supported by the sequences for which the limiting frequencies not only exist but define ergodic measures. To make this assertion precise first note that the ergodic theorem gives the formula (6) ,u(alic) = f p(af Ix) dp,(x).
i E x [ ak,,(Ti. ix),
n k+ 1
The frequencies p(41.)c) can be thought of as measures, for if x E
r then the formula
,u,x (a) = p(ainx), 4 E Ak

defines a measure itx on sequences of length k, for each k, which can be extended, by the Kolmogorov consistency theorem, to a Borel measure i.tx on A. The measure ,ux will be called the (limiting) empirical measure determined by x. Two sequences x and y are said to be frequency-equivalent if /ix = A y . Let n denote the projection onto frequency equivalence classes. The projection x is measurable and transforms the measure ,u onto a measure w = ,u o x -1 on equivalence classes. Next define /27(x) = ktx, which is well-defined on the frequency equivalence class of x. Formula (6) can then be expressed in the form (7) ,u(B) = f 1.1.,(x )(B) dco(7(x)), B E E.
This formula indeed holds for any cylinder set, by (6), and extends by countable additivity to any Borel set B. Let E denote the set of all sequences x E L for which the measure ti,x is ergodic. Of course, each frequency-equivalence class is either entirely contained in E or disjoint from E. Theorem 1.4.9 may now be expressed in the following much stronger form. Theorem 1.4.10 ( The ergodic decomposition theorem.) If p. is shift-invariant then p,(E) = 1 and hence the representation (7) takes the form, e In other words, a shift-invariant measure always has its support in the set of sequences whose empirical measures are ergodic. The ergodic measures, ,u, (x), x E E, are called the ergodic components of kt and the formula represents p, as an average of its ergodic components. An excellent discussion of ergodic components and their interpretation in communications theory can be found in [ 1 9 ]. The usual measure theory argument, together with the finite form of ergodicity, Theorem 1.4.5, shows that to prove the ergodic decomposition theorem it is enough to prove the following lemma.
(B)= f p.,( x )(B) da)(7r(x)), BE E.
49
Lemma 1.4.11 Given c > 0 and 4 there is a set X1 with ii,(Xi) > 1 E, such that for any x E X1 there is an N such that if n > N there is a collection C, c A n , with /L(C) > 1 c, such that the frequency of occurrence of 4 in any two members of C, differs by no more than E. Proof Fix E > 0, fix a block 4, and fix a sequence y E L, the set of sequences with limiting frequencies of all orders. Let L i = Li(y) denote the set of all x E L such that
(8)
ip(a'nx) p(41y)l < E/4.
Note, in particular, that the relative frequency of occurrence of al in any two members of Li differ by no more than E/2. oo, hence For each x E Li, the sequence {p(al 14)} converges to p(41x), as n there is an integer N and a set L2 C Li of measure at least (1 E2 )A(L1) such that
(9)
IP(414) p(alx)I
< E/4, x E L2, n > N.
Fix n > N and put Cn = {x: x E L2 }. The conditions (9) and (8) guarantee that the relative frequency of occurrence of 4 in any two members x7, itiz of Cn differ by no more E. The set L1 is invariant since p(ainx) = p(4`17' x), for any x E L. Thus the conditional measure it (. ILi) is shift-invariant and
1 ,u(xilLi) = A(CI)
In particular,
tt(CnICI) = ti
Li
P(xilz) dtt(z), x' iz E An.
1 (L1) L1 P(Crilz) d iu(z) > 1 6 2
so the Markov inequality yields a set L3 C L2 such that /4L3) > (1 E)/L(L 1 ) and for which
.q.c.
E p(xz) ?. 1 - E, Z E r3,
which translates into the statement that p(C) > 1 E, z E L3. The unit interval [0, 1] is bounded so L can be covered by a finite number of the sets Li(y) and hence the lemma is proved. This completes the proof of the ergodic decomposition theorem, Theorem 1.4.10. 0 Remark 1.4.12 The ergodic decomposition can also be viewed as an application of the ChoquetBishop-deLeeuw theorem of functional analysis, [57]. Indeed, the set Ps of shift-invariant probability measures is compact in the weak*-topology (obtained by thinking of continuous functions as linear functionals on the space of all measures via the mapping it i f f dit.) The extreme points of Ps correspond exactly to the ergodic measures, see Exercise 1 of Section 1.9.
50
I.4.d
Exercises
1. Let qk 14) be the empirical distribution of nonoverlapping k-blocks in 4, that is,
qk (alnx;1 ) = I{/ E [0, t):

where n = tk r, 0 < r < k.
(a) Show that if T is totally ergodic then qk (ali`14)
,u(x lic), almost surely.
(b) Show that the preceding result is not true if T is not totally ergodic.
2. Combine (5) with the ergodic theorem to establish that for an ergodic process the expected waiting time between occurrences of a symbol a E A is just 111,t(a). (Hint: show that the average time between occurrences of a in 4 is close to 1/pi (a14).) 3. Let {X: n > 1} be a stationary process. The reversed process {Yn : n > 1} is the , P)-process, where P is the Kolmogorov partition in the two-sided representation of {X,: n > 1}. Show that {Xn } is ergodic if and only its reversed process is ergodic. (Hint: combine Exercise 2 with Theorem 1.4.5.) 4. Use the central limit theorem to show that the "random walk with random scenery" process is ergodic. (Hint: with high probability (x k , z, m+k ) and (4, zt) are looking at different parts of y. See Exercise 8 in Section 1.2 for a definition of this process.) 5. Let P be the time-0 partition for an ergodic Markov chain with period d = 3. (a) Determine the ergodic components of the (T 3 , P)-process. (b) Determine the ergodic components of the (T 3 , P V TP y T 2P)-process. 6. Show that an ergodic finite-state process is a function of an ergodic Markov chain. (It is an open question whether a mixing finite-state process is a function of a mixing Markov chain.) 7. Let T be an ergodic rotation of the circle and P the two-set partition of the unit circle into upper and lower halves. Describe the ergodic components of the (T x T ,P x P)-process. 8. This exercise illustrates a direct method for transporting a sample-path theorem from an ergodic process to a sample-path theorem for a (nonstationary) function of the process. " be the set of all x E [a] such that Tnx E [a], for infinitely Fix a E A and let Aa R(x) = {R 1 (X), R2(x), .1 by many n. Define the mapping x
(x) = min{m > 1: x,n = a} 1

R1 (x) =
minfm >
(x): x m = a}
E Rico
i=1
(This is just the return-time process associated with the set B = [a].) A measure on Aa " transports to the measure y = ,u o R -1 on Aroo. Assume p. is the conditional measure on [a] defined by an ergodic process. Show that
SECTION 1.5. THE ENTROPY THEOREM. (a) The mapping x

R(X) is Borel.
51
(b) R(T R ' (x ) x) = SR(x), where S is the shift on
(e) If T is the set of the set of frequency-typical sequences for j, then v(R(T)) = 1. (Hint: use the Borel mapping lemma, Lemma 1.1.17.)
(d) (1/n) En i R i - 11 p,(a), v-almost surely. (Hint: use the preceding result.)
Section 1.5 The entropy theorem.

For stationary processes, the measure p.(4) is nonincreasing in n, and, except for some interesting cases, has limit 0; see Exercise 4c. The entropy theorem asserts that, for ergodic processes, the decrease is almost surely exponential in n, with a constant exponential rate called the entropy or entropy-rate, of the process, and denoted by h =
h(t).
Theorem 1.5.1 (The entropy theorem.) Let it be an ergodic measure for the shift T on the space A', where A is finite. There is a nonnegative number h = h(A) such that
n oo n
1 lim log
,u,(4)
= h, almost surely.
In the theorem and henceforth, log means the base 2 logarithm, and the natural logarithm will be denoted by ln. The proof of the entropy theorem will use the packing lemma, some counting arguments, and the concept of the entropy of a finite distribution. The entropy of a probability distribution 7(a) on a finite set A is defined by
H(z) =
E 71- (a) log 7(a).

aEA
Let
1 1 h n (x) = log n ti,(4)
1 log p,(4), n
so that if A n is the measure on An defined by tt then H(t) = nE(hn (x)). The next three lemmas contain the facts about the entropy function and its connection to counting that will be needed to prove the entropy theorem.
Lemma 1.5.2 The entropy function H(7) is concave in 7 and attains its maximum value log I AI only for the uniform distribution, z(a) 111AI.
Proof An elementary calculus exercise.
Lemma 1.5.3 E(17,(x)) _< log I AI.

Proof Since 1/(.4) = nE(h n (x)), the lemma follows from the preceding lemma.
CI
52
Lemma 1.5.4 (The combinations bound.)

(n ) denotes the number of combinations of n objects taken k at a time and k 3 < 1/2 then n 2nH(3) , k)
If
k<n6(
where H(3) = 3 log3 (1 3) log(1 3). Proof The function q log3 (1 q) log(1 8) is increasing in the interval 0 < q < 3
and hence
2-nH(S) < 5k(1 _ n ) and summing gives k
on-k , k
< ms.
Multiplying by (
2-nH(8) E k<nS
n k
) 'E
k<n3 (
n k
) (5k 0 _ on-k
n ) ic (1 _ 5 k
on-k = 19
which proves the lemma.
The function H(S) = 3 log 8(1S) log(1-3) is called the binary entropy function. It is just the entropy of the binary distribution with ti(1) = S. It can be shown, see [7], that the binary entropy function is the correct (asymptotic) exponent in the number of binary sequences of length n that have no more than Sn ones, that is I lim log n-,00 n k<rz b
E( n ) .= H (3) k
I.5.a
The proof of the entropy theorem.

h(x) = lim inf hn (x),
n- Co
Define where h(x) = (1/n) log it(x'lz). The first task is to show that h(x) is constant, almost surely. To establish this note that if y = Tx then
[Yi ] = [X3 +1 ] j
so that a(y) > It(x
1 ) and hence h(Tx) < h(x),
that is, h(x) is subinvariant. Thus h(Tx) = h(x), almost surely, since subinvariant functions are almost surely invariant, see Exercise 4, Section 1.2. Since T is assumed to be ergodic the invariant function h(x) must indeed be constant, almost surely. Define h to be the constant such that
h = lim inf hn (x),
n-).00
almost surely.
The goal is to show that h must be equal to lim sup, hn (x), almost surely. Towards this end, let c be a positive number. The definition of limit inferior implies that for almost
SECTION 1.5. THE ENTROPY THEOREM.
53
all x the inequality h(x) < h + E holds for infinitely many values of n, a fact expressed in exponential form as (1) A(4) >
infinitely often, almost surely. The goal is to show that "infinitely often, almost surely" can be replaced by "eventually, almost surely," for a suitable multiple of E.
Three ideas are used to complete the proof. The first idea is a packing idea. Eventually almost surely, most of a sample path is filled by disjoint blocks of varying lengths for each of which the inequality (1) holds. This is a simple application of the ergodic stopping-time packing lemma. The second idea is a counting idea. The set of sample paths of length K that can be mostly filled by disjoint, long subblocks for which (1) holds cannot have cardinality exponentially much larger than 2 /"+) , for large enough K. Indeed, if these subblocks are to be long, then there are not too many ways to specify their locations, and if they mostly fill, then once their locations are specified, there are not too many ways the parts outside the long blocks can be filled. Most important of all is that once locations for these subblocks are specified, then since a location of length n can be filled in at most 2n(h+E) ways if (1) is to hold, there will be a total of at most 2 K(h+E ) ways to fill all the locations. The third idea is a probability idea. If a set of K-length sample paths has cardinality only a bit more than 2K (h+E) , then it is very unlikely that a sample path in the set has probability exponentially much smaller than 2 K (h+E ) . This is just an application of the fact that upper bounds on cardinality "almost" imply lower bounds on probability, Lemma I.1.18(b).
To fill in the details of the first idea, let S be a positive number and M > 1/B an integer, both to be specified later. For each K > M, let GK(S, M) be the set of all that are (1 23)-packed by disjoint blocks of length at least M for which the inequality (1) holds. In other words, xr E G K (8 , M) if and only if there is a collection
xr
S = S(xr) = {[n i , m i [}
of disjoint subintervals of [1, K] with the following properties.
(a) m
n 1 +1 > M, [ni, m i l E S.
(b) p(x) >
ni+1)(h+c),
[n,, m,] E S.
(c) E [ ,..,,Es (mi n, + 1) > (1 28)K.
An application of the packing lemma produces the following result.
54
Lemma 1.5.5
GK
(3, M), eventually almost surely.
Proof Define T. (x) to be the first time n > M such that p,([xrii]) >n(h+E) Since r is measurable and (1) holds infinitely often, almost surely, r is an almost surely finite stopping time. An application of the ergodic stopping-time lemma, Lemma 1.3.7, then yields the lemma. LI
The second idea, the counting idea, is expressed as the following lemma.
Lemma 1.5.6 There is a 3 > 0 and an M> 116 such that IG large K.
, M)1 <2K (h+26) ,
for all sufficiently
Proof A collection S = {[ni , m i ll of disjoint subintervals of [1, K], will be called a skeleton if it satisfies the requirement that mi ni 1 > M, for each i, and if it covers all but a 26-fraction of [1, K], that is,
(2)
E (mi _ ni 1) >_ (1 23)K.

[n,,m,JES
A sequence xr is said to be compatible with such a skeleton S if 7i ) > 2-(m, --ni+i)(h+e) kt(x n for each i. The bound of the lemma will be obtained by first upper bounding the number of possible skeletons, then upper bounding the number of sequences xr that are compatible with a given skeleton. The product of these two numbers is an upper bound for the cardinality of G K (6, M) and a suitable choice of 6 will then establish the lemma. First note that the requirement that each member of a skeleton S have length at least M, means that Si < KIM, and hence there are at most KIM ways to choose the starting points of the intervals in S. Thus the number of possible skeletons is upper bounded by (3)
K k
H(1/111)
k<KIM\
where the upper bound is provided by Lemma 1.5.4, with H() denoting the binary entropy function. Fix a skeleton S = {[ni,mi]}. The condition that .)C(C be compatible with S means that the compatibility condition (4) it(x,7 1 ) >
must hold for each [ni , m i ] E S. For a given [ni , m i ] the number of ways xn m , can be chosen so that the compatibility condition (4) holds, is upper bounded by 2(m1n's +1)(h+E) , by the principle that lower bounds on probability imply upper bounds on cardinality, Lemma I.1.18(a). Thus, the number of ways x j can be chosen so that j E Ui [n and so that the compatibility conditions hold is upper bounded by
nK h-1-)
L
(fl
SECTION 1.5. THE ENTROPY THEOREM.
55
Outside the union of the [ni , mi] there are no conditions on xi . Since, however, there are fewer than 28K such j these positions can be filled in at most IA 1 26K ways. Thus, there are at most
IAl23K 2K (h+E)
sequences compatible with a given skeleton S = {[n i , m i ]}. Combining this with the bound, (3), on the number of possible skeletons yields
(5)
1G K(s, m)i
< 2. 1(- H(1lm) oi ncs 2 K(h+E)
Since the binary entropy function H (1 M) approaches 0 as M 00 and since IA I is finite, the numbers 8 > 0 and M > 1/8 can indeed be chosen so that IGK (8, M) < 0 2K(1 +2' ) , for all sufficiently large K. This completes the proof of Lemma 1.5.6. Fix 6 > 0 and M > 1/6 for which Lemma 1.5.6 holds, put GK = Gl{(8, M), and let BK be the set of all xr for which ,u(x(c ) < 2 -1C(h+36) . Then tt(Bic
n GK) < IGKI 2K(h+3E)
5 2ICE
holds for all sufficiently large K. Thus, xr g BK n GK, eventually almost surely, by the Borel-Cantelli principle. Since xt E GK, eventually almost surely, the iterated almost-sure principle, Lemma 1.1.15, implies that xr g BK, eventually almost surely, that is,
lim sup h K (x) < h + 3e, a.s.

K>oo
In summary, for each e > 0, h = lim inf h K (x) < lim sup h K (x) < h 3e, a.s.,
K*co
which completes the proof of the entropy theorem, since c is arbitrary.
Remark 1.5.7 The entropy theorem was first proved for Markov processes by Shannon, with convergence in probability established by McMillan and almost-sure convergence later obtained by Breiman, see [4] for references to these results. In information theory the entropy theorem is called the asymptotic equipartition property, or AEP. In ergodic theory it has been traditionally known as the Shannon-McMillan-Breiman theorem. The more descriptive name "entropy theorem" is used in this book. The proof given is due to Ornstein and Weiss, [51], and appeared as part of their extension of ergodic theory ideas to random fields and general amenable group actions. A slight variant of their proof, based on the separated packing idea discussed in Exercise 1, Section 1.3, appeared in [68].
I.5.b Exercises.
1. Prove the entropy theorem for the i.i.d. case by using the product formula on then taking the logarithm and using the strong law of large numbers. This yields the formula h = Ea ,u(a) log it(a). 2. Use the idea suggested by the preceding exercise to prove the entropy theorem for ergodic Markov chains. What does it give for the value of h?
56

3. Suppose for each k, Tk is a subset of Ac of cardinality at most 2k". A sequence 4 is said to be (K, 8, {T})-packed if it can be expressed as the concatenation 4 = w(1) . w(t), such that the sum of the lengths of the w(i) which belong to Ur_K 'Tk is at least (1 8)n. Let G be the set of all (K, 8, {T})-packed sequences 4 and let E be positive number. Show that if K is large enough, if S is small enough, and if n is large enough relative to K and S, then IG n I < 2n(a+E) . 4. Assume p is ergodic and define c(x) = lima
,44), x
A.
(a) Show that c(x) is almost surely a constant c. (b) Show that if c > 0 then p. is concentrated on a finite set. (c) Show that if p. is mixing then c(x) = 0 for every x.
Section 1.6
Entropy as expected value.
Entropy for ergodic processes, as defined by the entropy theorem, is given by the almost-sure limit 1 1 log h= n>oo n Entropy can also be thought of as the limit of the expected value of the random quantity (1/n) log A(4). The expected value formulation of entropy will be developed in this section.
I.6.a
The entropy of a random variable.
Let X be a finite-valued random variable with distribution defined by p(x) = X E A. The entropy of X is defined as the expected value of the random variable log p(X), that is,
Prob(X = x),
H(X) =
xEA
E p(x) log -1-= p(x)
p(x) log p(x). xEA
The logarithm base is 2 and the conventions Olog 0 = 0 and log0 = oo are used. If p is the distribution of X, then H(p) may be used in place of H(X). For a pair (X, Y) of random variables with a joint distribution p(x, y) = Prob(X = x, Y = y), the notation H(X,Y)-= p(x , y) log p(x, y) will be used, a notation which extends to random vectors. Most of the useful properties of entropy depend on the concavity of the logarithm function. One way to organize the concavity idea is expressed as follows.
Lemma 1.6.1 If p and q are probability k-vectors then
_E pi log p, <
with equality if and only if p = q.
pi log qi ,
SECTION 1.6. ENTROPY AS EXPECTED VALUE.
57
Proof The natural logarithm is strictly concave so that, ln x < x 1, with equality if and only if x = 0. Thus
qj
= 0,
with equality if and only if qi = pi , log x = (In x)/(1n 2).
1 < i 5_ k.
This proves the lemma, since
E,
The proof only requires that q be a sub-probability vector, that is nonnegative with qi 5_ 1. The sum
Dcplio =
pi in
PI
qi
is called the (informational) divergence, or cross-entropy, and the preceding lemma is expressed in the following form.
Lemma 1.6.2 (The divergence inequality.) If p is a probability k-vector and q is a sub-probability k-vector then D(pliq) > 0, with equality if and only if p = q.
A further generalization of the lemma, called the log-sum inequality, is included in the exercises. The basic inequalities for entropy are summarized in the following theorem.
Theorem 1.6.3 (Entropy inequalities.) (a) Positivity. H(X) > 0, with equality if and only if X is constant.
(b) Boundedness. If X has k values then H(X) < log k, with equality if and only if
each p(x) = 11k.
(c) Subadditivity. H(X, Y) < H(X) H(Y), with equality if and only if X and Y
are independent. Proof Positivity is easy to prove, while boundedness is obtained from Lemma 1.6.1 by setting Px = P(x), qx ----- 1/k. To establish subadditivity note that
H(X,Y).
Ep (x, y) log p(x, y),

X.
then replace p(x , y) in the logarithm factor by p(x)p(y), and use Lemma 1.6.1 to obtain the inequality H(X,Y) < H(X)-}- H(Y), with equality if and only if p(x, y) p(x)p(y), that is, if and only if X and Y are independent. This completes the proof of the theorem. El The concept of conditional entropy provides a convenient tool for organizing further results. If p(x , y) is a given joint distribution, with corresponding conditional distribution p(xly) = p(x, y)1 p(y), then H ain
=_
p(x, y) log p(xiy) =
p(x, y) log
p(x, y)
POO
58
is called the conditional entropy of X, given Y. (Note that this is a slight variation on standard probability language which would call Ex p(xIy) log p(xly) the conditional entropy. In information theory, however, the common practice is to take expected values with respect to the marginal, p(y), as is done here.) The key identity for conditional entropy is the following addition law.
(1)
H(X, Y) = H(Y)+ H(X1Y).
This is easily proved using the additive property of the logarithm, log ab = log a+log b. The previous unconditional inequalities extend to conditional entropy as follows. (The proofs are left to the reader.)
Theorem 1.6.4 (Conditional entropy inequalities.) (a) Positivity. H (X IY) > 0, with equality if and only if X is a function of Y. (b) Boundedness. H(XIY) < H (X) with equality if and only if X and Y are independent. (c) Subadditivity. H((X, Y)IZ) < H(XIZ)+ H(YIZ), and Y are conditionally independent given Z. with equality if and only if X
A useful fact is that conditional entropy H(X I Y) increases as more is known about the first variable and decreases as more is known about the second variable, that is, for any functions f and g,
Lemma 1.6.5 H(f(X)In < H(XIY) < 1-1 (X1g(n).

The proof follows from the concavity of the logarithm function. This can be done directly (left to the reader), or using the partition formulation of entropy which is developed in the following paragraphs. The entropy of a random variable X really depends only on the partition Px = {Pa: a E A} defined by Pa = fx: X(x) = a), which is called the partition defined by X. The entropy H(1)) is defined as H (X), where X is any random variable such that Px = P. Note that the join Px V Py is just the partition defined by the vector (X, Y) so that H(Px v Pr) = H(X, Y). The conditional entropy of P relative to Q is then defined by H(PIQ) = H(1) V Q) H (Q). The partition point of view provides a useful geometric framework for interpretation of the inequalities in Lemma 1.6.5, because the partition Px is a refinement of the partition Pf(x), since each atom of Pf(x) is a union of atoms of P. The inequalities in Lemma 1.6.5 are expressed in partition form as follows.
Lemma 1.6.6 (a) If P refines Q then H(QIR) < H(PIR). (b) If R. refines S then H(PIS) > 11(P 1R).
SECTION I.6. ENTROPY AS EXPECTED VALUE.
59
Proof The proof of (a) is accomplished by manipulating with entropy formulas, as follows.
H (P IR)
(i)
(P v 21 1Z) H(QITZ) + H(PIQ v 1Z) H (OR).
The equality (i) follows from the fact that P refines Q, so that P = Pv Q; the equality (ii) is just the general addition law; and the inequality (iii) uses the fact that H (P I VR,) > O. Let Pc denote the partition of the set C defined by restricting the sets in P to C, where the conditional measure, it (. IC) is used on C. To prove (h) it is enough to consider the case when R. is obtained from S by splitting one atom of S into two pieces, say Ra = Sa , a 0 b; Sb = Rb U Rb, The quantity H(P IR.) can be expressed as
-EEtt(Pt n R a ) log
t ab
it(Pt
n Ra) +
(4)11 (Psb liZsb )
Ewpt n Ra ) log It(Pt n Ra) + it(Sb)1-1(Psb ).

t a0b
II(Ra)
The latter is the same as H(PIS), which establishes inequality (b).
I.6.b
The entropy of a process.
The entropy of a process is defined by a suitable passage to the limit. The n-th order entropy of a sequence {X1, X2, ...} of A-valued random variables is defined by
H(r)
= E p(x)log 1
ATEAn
= p(x'iL)log p(4). 11) P(x1 xnEan
The process entropy is defined by passing to a limit, namely,
(2)
1 1 1 H ({X n }) = lim sup H((X7) = lim sup p(4) log n n p(4) . n n xn
In the stationary case the limit superior is a limit. This is a consequence of the following basic subadditivity property of nonnegative sequences. Lemma 1.6.7 (The subadditivity lemma.) If {an} is a sequence of nonnegative numbers which is subadditive, that is, an-Fm < an + am , then limn an ln exists and equals infn an /n. Proof Let a = infn an /n. Given c > 0 choose n so that an < n(a 6). If m > n write m = np +.r, 0 < r < n. Subadditivity gives anp < pan , so that if b = sup ,n a, then another use of subadditivity yields am < anp + a,. < pan + b.
60
Division by m, and the fact that np/m > 1, as in > pc, gives lim sup ani /m < a + E. CI This proves the lemma. The subadditivity property for entropy, Theorem I.6.3(c), gives
H(X7 +m) < H(X7)+ H(rnTin).

If the process is stationary then H(X n + -FT) = H(r) so the subadditivity lemma with an = H(X7) implies that the limit superior in (2) is a limit. An alternative formula for the process entropy in the stationary case is obtained by using the general addition law, H(X, Y) = H(Y) H(X1Y), to produce the formula
H(X n ) H(X1,1 )= H(X0IX:n1), n > 0.

The right-hand side is decreasing in n, from Lemma 1.6.5, and the simple fact from analysis that if an > 0 and an+ i an decreases to b then an /n > b, can then be applied to give H({X n })= lim H (X01X1 nI ) , (3)
n--4co
a formula often expressed in the suggestive form
(4)
H({X n }) = H (X01X_1 0 ) .
The next goal is to show that for ergodic processes, the process entropy H is the same as the entropy-rate h of the entropy theorem. Towards this end, the following lemma will be useful in controlling entropy on small sets.
Lemma 1.6.8 (The subset entropy-bound.)
_ E p(a) log p(a) _< p(B) log IBI p(B)log p(B),

aEB
for any B c A.
Proof Let p(. 1B) denote the conditional measure on B and use the trivial bound to obtain P(a1B)log p(aIB) ._ - _ log IBIaEB
E
1
The left-hand side is the same as
( p(B) from which the result follows.
aEB
E P(a)log p(a) + log p(B)) ,

0
Let p, be an ergodic process with alphabet A, and process entropy H, and let h be the entropy-rate of A as given by the entropy theorem, that is urnlog
n
,u (x '0
= h, a.e.
61
Theorem 1.6.9 Process entropy H and entropy-rate h are the same for ergodic processes.
Proof First assume that h > 0 and fix E such that 0 < E < h. Define
G, = {4: 2n(h+E) < 1,1 (4) < 2n(hE)}
and let B, = A n G. Then
H (X7) =
E ,(x ) log ,u(xtil) A" = E ,u (4 ) log ,u(xliz) E ,u(4) log p.(4).

B,, G,
The two sums will be estimated separately. The subset entropy-bound, Lemma 1.6.8, gives
(5)
E A (x) log ii(x) < np,(Bn )log IA 1 ,i(B)log(B),

8
since I B I < IAIn . After division by n, this part must go to 0 as n > oc, since the entropy theorem implies that ti(B) goes to O. On the set G, the following holds
n(h
e) _< log ,u(4) <_ n(h c),
so multiplication by ,u(4)/n,u(G n ) and summing produces
(6)
(h 6)
1
G,.
As n > co the measure of G, approaches 1, from the entropy theorem, so the sum converges to H({X}). Thus
which proves the theorem in the case when h > O. If h = 0 then define G, = {4: ,u(4) > 2 ne . The bound (5) still holds, while D only the upper bound in (6) matters and the theorem again follows.
}
Remark 1.6.10 Since for ergodic processes, the process entropy H({X,}) is the same as the entropyrate h of the entropy theorem, both are often simply called the entropy. A detailed mathematical exposition and philosophical interpretation of entropy as a measure of information can be found in Billingsley's book, [4]. The Csiszdr-Krner book, [7], contains an excellent discussion of combinatorial and communication theory aspects of entropy, in the i.i.d. case. The recent books by Gray, [18], and Cover and Thomas, [6], discuss many information-theoretic aspects of entropy for the general ergodic process.
62
I.6.c
The entropy of i.i.d. and Markov processes.
Entropy formulas for i.i.d. and Markov processes can be derived from the entropy theorem by using the ergodic theorem to directly estimate p.(4), see Exercises 1 and 2 in Section 1.5. Here they will be derived from the definition of process entropy. Let {X, } be an i.i.d. process and let p(x) = Prob(Xi = x). The additivity of entropy for independent random variables, Theorem 1.6.3(c), gives
H (XI') = H(Xi)
which yields the entropy formula
H (X2) +
H(X n ) = nH (X i ),
(7)
H({X,}) = H(X i ) =
p(x) log p(x).
An alternate proof can be given by using the conditional limit formula, (3), and the fact that H(Xl Y) = H(X) if X and Y are independent, to obtain
H({X,}) = lip H (X0IX1,!) = H(X0)

Now suppose {X n } is a Markov chain with stationary vector p and transition matrix M. The Markov property implies that
Prob (X0 = alX) = Prob (X0 = alX_i) , a

from which it follows that
A,
H (X 0 1
= H(X01 X -1)
Thus the conditional entropy formula for the entropy of a process, (3), along with a direct calculation of H(X0IX_ 1 ) yields the following formula for the entropy of a Markov chain.
(8)
H({X n }) = H(X0IX_ i ) =
E 04 log Mii .
,
Recall that a process {Xn } is Markov of order k if
Prob (X0 = aiX) = Prob (X0 = alX:k1 ) , a
A.
The argument used in the preceding paragraph shows that if {X,} is Markov of order k then
1/({X}) = H(X0IX:1) =
n > k.
This condition actually implies that the process is Markov of order k.
Theorem 1.6.11 (The Markov order theorem.) A stationary process {X n } is Markov of order k if and only if H(X0IX:) = H(X0IX: n1 ), n > k.
Proof The conditional addition law gives
H((X o , Xi n k+1 )1X1) = H(X: n k+1 1X11) H(X 0 IXI kl
Xink+1).
63
The second term on the right can be replaced by H(X01X: k1 ), provided H(X01X1 k1 ) = H(X0 1X1n1 ), for n > k. The equality condition of the subadditivity principle, Theok+1 are conditionally indepenrem I.6.4(c), can then be used to conclude that X0 and X:_ n dent given XI/. If this is true for every n > k then the process must be Markov of order k. This completes the proof of the theorem. In general the conditional entropy function Hk = H(X0IX: k1 ) is nonincreasing in k. To say that the process is Markov of some order is to say that Hk is eventually constant. If the true order is k*, then Hks_i > fik* = Hk , k > k*. This fact can be used to estimate the order of a Markov chain from observation of a sample path.
I.6.d
Entropy and types.
The entropy of an empirical distribution gives the exponent for a bound on the number of sequences that could have produced that empirical distribution. This fact is useful in large deviations theory, and will be useful in some of the deeper interpretations of entropy to be given in later chapters. The empirical distribution or type of a sequence x E A" is the probability distribution Pi = pl (. 14) on A defined by the relative frequency of occurrence of each symbol in the sequence 4, that is,
n
Ifi:
Pi(alxi) =
= an ,
a E A.
Two sequences xi' and y'11 are said to be type-equivalent if they have the same type, that is, if each symbol a appears in x'; the same number of times it appears in Type-equivalence classes are called type classes. The type class of xi' will be denoted by T WO or by 7, where the latter stresses the type p i = pi (. 14), rather than a particular sequence that defines the type. Thus, x'1' E T; if and only if each symbol a appears in xi' exactly n pi(a) times. The empirical (first-order) entropy of xi' is the entropy of p i (- 14), that is,
17(130= ii(pl(. 14)) = a
Pi(aixi)log Pi(aI4).
The following purely combinatorial result bounds the size of a type class in terms of the empirical entropy of the type.
Theorem 1.6.12 (The type class bound.) < 2r1(131), for any type p i .
-
Proof First note that if Qn is a product measure on A n then (9)

Q" (x)= ji Q(xi) i=1
= fl
aEA
War P1(a) , x
rn Tpn,. In
since np i (a) is the number of times the symbol a appears in a given 4 particular, a product measure is always constant on each type class.
64
CHAPTER 1. BASIC CONCEPTS.

A type pi defines a product measure Pn on An by the formula
Pn (z?) =
pi (zi), z E A'
Replacing Qn by Pi in the product formula, (9), produces
Pn(xiz) = aEA
pi (a)nPl(a) ,
E r pni ,
which, after taking the logarithm and rewriting, yields
Pn(x7) = 2n17(PI) ,
In other words, Pn (4) has the constant value 2' -17(P' ) on the type class of x7, and hence = ypn, 12 n 11( v 1 ) But Pi is a probability distribution so that Pi (Tpni ) < 1. This establishes Theorem 1.6.12. The bound 2' 17, while not tight, is the correct asymptotic exponential bound for the size of a type class, as shown in the Csiszdr-Krner book, [7]. The upper bound is all that will be used in this book. Later use will also be made of the fact that there are only polynomially many type classes, as stated in the following theorem. Theorem 1.6.13 (The number-of-types bound.) The number of possible types is at most (n + 1)1A I.
Proof This follows from the fact that for each a E A, the only possible values for , n. npi (alfii) are 0, 1,
The concept of type extends to the empirical distribution of overlapping k-blocks. The empirical overlapping k-block distribution or k-type of a sequence x7 E A' is the probability distribution Pk = pk(' 14) on A k defined by the relative frequency of occurrence of each k-block in thesen [i quenk ce +ii'i th_a i t ki_ S,1
=
E
I
-41
n k
A k.
Two sequences x;' and y' are said to be k-type-equivalent if they have the same k-type, and k-type-equivalence classes are called k-type classes. The k-type class of x i" will be denoted by Tpnk , where Pk = Pk ( Ix), that is, 4 E T, if and only if each block all' appears in x exactly (n k + 1) pk (alic) times. The bound on the number of types, Theorem 1.6.13, extends immediately to k-types. Theorem 1.6.14 (The number of k-types bound.) The number of possible k-types is at most (n k + Note, in particular, that if k is fixed the number of possible k-types grows polynomially in n, while if k is also growing, but satisfies k < a logiAi n, with a < 1, then the number of k-types is of lower order than IA In.
65
Estimating the size of a k-type class is a bit trickier, since the k-type measures the frequency of overlapping k-blocks. A suitable bound can be obtained, however, by considering the (k 1)-st order Markov measure defined by the k-type plc ( 14), that is, the stationary (k 1)-st order Markov chain ri 0c -1) with the empirical transition function
(ak 14 -1 ) =
Pk(ainxii ) f k-1 ir uk IA 1 ) E bk pk
and with the stationary distribution given by The Markov chain ri(k-1) has entropy
(k
-
"Aar') =
Ebk Pk(ar i
1)
-E 1.-j(a k Or 1 )Ti(a ki -1 ) log 75(akiar) E Pk(akil4) log Ti(ak lari ),

ak
ak
by formula (8). The entropy FP") is called the empirical (k 1)-st order Markov entropy of 4. Note that it is constant on the k-type class 'Tk(xiii) = Tp k of all sequences that have the same k-type Pk as xrii. Theorem 1.6.15 (The k-type-class bound.) I7k (fil)1 < (n
Proof First consider the case k = 2. If Q is Markov a direct calculation yields the formula
Q(4)
= Q(xi)n Q(bi a)(nl)p2(ab)

a,b
xr it c Tp2.
This formula, with Q replaced b y (2-1) = rf,(1) and after suitable rewrite using the definition of W I) becomes
1 TV I) (Xlii ) = Ti-(1) (X02(n-1)1 ,

and hence
P
2- (n- 1>ii(1 ) IT P2 I 1n
since ii:(1) (xi) > 1/(n 1). This yields the desired bound for the first-order case, since P1) (4) is a probability measure on A n . The proof for general k is obtained by an obvious extension of the argument.
I.6.e
Exercises.
, n, and a
1. Prove the log-sum inequality: If a, > 0, b, > 0, i = 1, 2, E ai , b = E bi , then E ai log(ai I bi ) ?_ a log(a/b).
2. Let P be the time-0 partition for an ergodic Markov chain with period d = 3 and entropy H > O. (a) Determine the entropies of the ergodic components of the (T3 , 2)-process.
66
(b) Determine the entropies of the ergodic components of the (T 3 , p V Tp T22)-process. 3. Let p be an ergodic process and let a be a positive number. Show there is an N = N(a) such that if n > N then there is a set Fn , measurable with respect to xn oc , such that ,u(Fn ) > 1 a and so that if xn E Fn then
27" 11, (4) < t(xIx)
2",i(4).
(Hint: apply the ergodic theorem to f (x) = log //,(xi Ix), and use the equality of process entropy and entropy-rate.)
4. Let T be the shift on X = {-1,1} z , and let ji be the product measure on X defined by ,u0(-1) = p,(1) = 1/2. Let Y = X x X, with the product measure y = j x it, and define S(x, y) = (T x , Toy). Let p be the time-zero partition of X and let Q = P x P. Find the entropy of the (S, Q)-process, which is the "random walk with random scenery" process discussed in Exercise 8 of Section 1.2. (Hint: recurrence of simple random walk implies that all the sites of y have been almost surely visited by the past of the walk.) 5. Prove that the entropy of an n-stationary process {Yn } is the same as the entropy the stationary process {X} obtained by randomizing the start. (Hint: let S be uniformly distributed on [1, n] and note that H (X I v , S) = H H (SIXr) = H(S)+ H(X l iv IS) for any N> n, so H(Xr)1N H(Xr IS)1N. Then show that I-1(r IS = s) is the same as the unshifted entropy of Xs/v+1 .) 6. The divergence for k-set partitions is D(P Q) =
Ei It(Pi)log(p,(Pi)1 A(Q i )).
(a) Show that D (P v 7Z I Q v > Q) for any partition R.. (b) Prove Pinsker's inequality, namely, that D (P Il Q) > (1 / 2 ln 2)IP Q1 2 , where jp - QI 1,u(Pi ) - bt 1 a2012 . (Hint: use part (a) to reduce to the two set case, then use calculus.)
7. Show that the process entropy of a stationary process and its reversed process are the same. (Hint: use stationarity.)
Section 1.7
Interpretations of entropy.
Two simple, useful interpretations of entropy will be discussed in this section. The first is an expression of the entropy theorem in exponential form, which leads to the concept of entropy-typical sequence, and to the related building-block concept. The second interpretation is the connection between entropy and expected code length for the special class of prefix codes.
I.7.a
Entropy-typical sequences.
Let p. be an ergodic process of entropy h. For E > 0 and n > 1 define En(E) 1 4: 2n(h+E) < tt (x in) < 2n(h-6)} The set T(E) is called the set of (n, )-entropy-typical sequences, or simply the set of entropy-typical sequences, if n and c are understood. The entropy theorem may be expressed by saying that xril is eventually almost surely entropy-typical, that is,
SECTION 1.7. INTERPRETATIONS OF ENTROPY.

Theorem 1.7.1 (The typical-sequence form of entropy.) For each E > 0, x E T(e), eventually almost surely.
67
The convergence in probability form, limn ,u(Tn (E)) = 1, is known as the asymptotic equipartition property or AEP, in information theory. The phrase "typical sequence," has different meaning in different contexts. Sometimes it means entropy-typical sequences, as defined here, sometimes it means frequencytypical sequences, as defined in Section 1.4, sometimes it means sequences that are both frequency-typical and entropy-typical, and sometimes it is just shorthand for those sequences that are likely to occur. The context usually makes clear the notion of typicality being used. Here the focus is on the entropy-typical idea. The members of the entropy-typical set T(e) all have the lower bound 2 n(h+f) on their probabilities. Since the total probability is at most 1, this fact yields an upper bound on the cardinality of Tn , namely,
Theorem 1.7.2 (The entropy-typical cardinality bound.) The set of entropy-typical sequences satisfies 17-n (01 < 2n (h+E)
Thus, even though there are I A In possible sequences of length n, the measure is eventually mostly concentrated on a set of sequences of the (generally) much smaller cardinality 2n(h+6) This fact is of key importance in information theory and plays a major role in many applications and interpretations of the entropy concept. The preceding theorem provides an upper bound on the cardinality of the set of typical sequences and depends on the fact that typical sequences have a lower bound on their probabilities. Typical sequences also have an upper bound on their probabilities, which leads to the fact that too-small sets cannot be visited too often.
Theorem 1.7.3 (The too-small set principle.) < 2n(h--), n > 1, then .7q If C, c A n and ICI _
Ci', eventually almost surely.
Proof Since .X1' E 'T,(E12), eventually almost surely, it is enough to show that x
g C, n'T,i (e/ 2), eventually almost surely. The cardinality bound on Cn and the probability upper bound 2 2) on members of Tn (E/2) combine to give the bound
(Cn
n Tn(E/2)) <
2n(12+02n(hE/2) < 2nc/2
Thus ,u(Cn
n 'T,(E12)) is summable in n, and the theorem is established.
fl
Another useful formulation of entropy, suggested by the upper bound, Theorem 1.7.2, on the number of entropy-typical sequences, expresses the connection between entropy and coverings. For a > 0 define the (n, a)-covering number by Arn (a) = minfICI: C c An, and p(C)> a}, that is, Ar(a) is the minimum number of sequences of length n needed to fill an afraction of the total probability. A good way to think of .I\rn (a) is given by the following algorithm for its calculation.
(Al) List the n-sequences in decreasing order of probability.
(A2) Count down the list until the first time a total probability of at least a is reached.
68
The covering number Arn (a) is the count obtained in (A2). The connection with entropy is given by the following theorem. Theorem 1.7.4 (The covering-exponent theorem.) For each a > 0, the covering exponent (1/n)logArn (a) converges to h, as n
oc.
Proof Fix a E (0, 1) and c > O. Since the measure of the set of typical sequences goes to I as n oc, the measure ,u(T,i (E)) eventually exceeds a. When this happens N,Jce) < iTn (01 < 2 n0-1-0 , by Theorem 1.7.2, and hence, lim supn (l/n) log Nn (a) .5_ h. On the other hand, suppose, for each n, ,i(C) > a and I C,,I = (a). If n is large enough then A(T,(6) n cn ) > a(1 c). The fact that p.(x) < 2 n(hE) , for E rn (E then implies that
ACTn(E)
p.(4) E x;Ern (onc,
< 2n(hE) 1 -6 - n GI,

)
and hence
Nn (a) = ICn > IT(e) n Cn > 2n (h-E'A(Tn(E) n cn) > 2n(h-E)a(i 6), which proves the theorem.
fl
The covering exponent idea is quite useful as a tool for entropy estimation, see, for example, the proof in Section I.7.e, below, that a rotation process has entropy O. The connection between coverings and entropy also has as a useful approximate form, which will be discussed in the following paragraphs. Let d(a, b) be the (discrete) metric on A, defined by 0 d (a , b) = J 1 1 and extend to the metric d on A', defined by
if a = b otherwise,
1 dn (4 , yri`) = n i=1
yi)
The metric cin is also known as the per-letter Hamming distance. The distance dn (x, S) from 4 to a set S C An is defined by
dn (x' , S) = min{dn (4,
y;' E S},
and the 6-neighborhood or 6-blowup of S is defined by
[S ], =
S) < 81.
It is important that a small blowup does not increase size by more than a small exponential factor. To state this precisely let H(8) = 6 log 6 (1 6) log(1 8) denote the binary entropy function.

Lemma 1.7.5 (The blowup bound.) The 6-blowup of S satisfies
69
I[S]81
ISI2 n11(6) (1A1
1) n3 . < n(h+2) 2
In particular, given c > 0 there is a S > 0 such that Ir1n(c)131
for all n.
Proof Given x, there are at most nS positions i in which xi can be changed to create a member of [ { x7 } 13. Each such position can be changed in IA 1 ways. Thus, the _ 2nH(3) (IAI 1) n8 , which combinations bound, Lemma 1.5.4, implies that I[{4}]51 < yields the stated bound. < 2n(h-1-) since (IAI 1)n8 < 20 log IAI and since Slog IA I and H(8) Since IT(e)1 _ < 2n (h+26) 0, it follows that if 8 is small enough, then Irrn(E)ls1 _ both go to 0 as El will hold, for all n. This proves the blowup-bound lemma.
The blowup form of the covering number is defined by
(1)
Arn (a, 8) = min{ICI: C c
An,
and //([C]3 )
a},
that is, it is the minimum size of an n-set for which the 8-blowup covers an a-fraction of the probability. It is left to the reader to prove that the limit of (1/n) log Nen (a, 8) oc. An application of Lemma 1.7.5, together with Theorem 1.7.4 then exists as n establishes that 1 lim lim log Afn (a, 8) = h.
6>0 n> oo
I.7.b
The building-block concept.
An application of the ergodic theorem shows that frequency-typical sequences must consist mostly of n-blocks which are entropy-typical, provided only that n is not too small. In particular, the entropy-typical n-sequences can be thought of as the "building blocks," from which longer sequences are made by concatenating typical sequences, with occasional spacers inserted between the blocks. This idea and several useful consequences of it will now be developed. Fix a collection 5, c An , thought of as the set of building blocks. Also fix an integer M > n and 8 > O. A sequence xr is said to be (1 0-built-up from the building blocks B, if there is an integer I = I(41 ) and a collection llni, mi]: i < 1), of disjoint n-length subintervals of [1, M], such that
(a) Eil.=1 (mi n i + 1) > (1

(b) E
I3,,
i c I.
In the special case when 8 = 0 and M is a multiple of n, to say that 41 is 1-built-up from Bn is the same as saying that xr is a concatenation of blocks from B. If 8 > 0 then the notion of (1-0-built-up requires that xr be a concatenation of blocks from 13,, with spacers allowed between the blocks, subject only to the requirement that the total length of the spacers be at most SM. Both the number I and the intervals [ni , m i ] for which x , is required to be a member of 13,, are allowed to depend on the sequence xr. The reader should also note the concepts of blowup and built-up are quite different. The
70
blowup concept focuses on creating sequences by making a small density of otherwise arbitrary changes, while the built-up concept only allows changes in the spaces between the building blocks, but allows arbitrary selection of the blocks from a fixed collection. An important fact about the building-block concept is that if S is small and n is large, then the set of M-sequences that can be (1 6)-built-up from a given set 8 C An of building blocks cannot be exponentially much larger in cardinality than the set of all sequences that can be formed by selecting M I n sequences from Bn and concatenating them without spacers. The proof of this fact, which is stated as the following lemma, is similar in spirit to the proof of the key bound used to establish the entropy theorem, Lemma 1.5.6, though simpler since now the blocks all have a fixed length. As usual, H(8) = 8 log6 (1 8) log(1 S) denotes the binary entropy function. Lemma 1.7.6 (The built-up set bound.) Let DM be the set of all sequences xr that can be (1 6)-built-up from a given collection B c An. Then
I Dm I
IB.1 2mH (0 1 A im s.
In particular, if B, = 7,(c), the set of entropy-typical n-sequences relative to c, and if
is small enough then 11)m 1 < 2m(h +26) . Proof The number of ways to select a family {[n i , m i ]} of disjoint n-length subintervals that cover all but a 6-fraction of [1, M] is upper bounded by the number of ways to select at most SM points from a set with M members, namely,
Y1 E
<
which is, in turn, upper bounded by 2 A411(3) , by the combinations bound, Lemma 1.5.4. For a fixed configuration of locations, say {[n i ,m i ],i E / } , for members of Bn , the number ways to fill these with members of B is upper bounded by IBnl i <
,3 Min
and the number of ways to fill the places that are not in Ui [ni,m i ] is upper bounded by IA 1 3m . Thus
I Dm I <
ni
Min
IAIM,
which is the desired bound. The bound for entropy-typical building blocks follows immediately from the fact that I T(e)l 2 n(h+f) . This establishes the lemma.
The building-block idea is closely related to the packing/covering ideas discussed in Section I.4.b. A sequence xr is said to be (1 6)-strongly-covered by B C An if
E [1, M n +1]:
Bn }l < 8(M n +1).
The argument used to prove the packing lemma, Lemma 1.3.3, can be used to show that almost strongly-covered sequences are also almost built-up, provided M is large enough. This is stated in precise form as the following lemma.
71
Lemma 1.7.7 (The building-block lemma.) If xr is (1-812)-strongly-covered by 13,, and if M > 2n/8, then xr is (1-0-built-up
from Bn Proof Put mo = 0 and for i > 0, define ni to be least integer k > m i _ i such that
ric+n--1 E 13, stopping when "" Ic (1 3/2)-strongly-covered by
within n of the end of xr. The assumption that xr is 13, implies there are at most 6M/2 indices j < M n 1 which are not contained in one of the [ni , m i ], while the condition that M > 2n18 implies there at most 3M/2 indices j E [M n 1, M] which are not contained in one of the [ni , mi]. The lemma is therefore established.
Remark 1.7.8 The preceding two lemmas are strictly combinatorial. In combination with the ergodic and entropy theorems they provide powerful tools for analyzing the combinatorial properties of partitions of sample paths, a subject to be discussed in the next chapter. In particular, if B, = Tn (e), and n is large enough, then ,u,(7,,()) 1. The almost-covering principle, Theorem 1.4.7, implies that eventually almost surely xr is almost stronglycovered by sequences from T(e), hence eventually almost surely mostly built-up from members of 7; (e). A suitable application of the built-up set lemma, Lemma 1.7.6, implies that the set Dm of sequences that are mostly built-up from the building blocks T(e) will eventually have cardinality not exponentially much larger than 2/4(h+6) The sequences in Dm are not necessarily entropy-typical, for no upper and lower bounds on their probabilities are given; all that is known is that the members of Dm are almost built-up from entropy-typical n-blocks, where n is a fixed large number, and that xr is eventually in DM. This result, which can be viewed as the entropy version of the finite sequence form of the frequency-typical sequence characterization of ergodic processes, Theorem 1.4.5, will be surprisingly useful later.
I.7.c
Entropy and prefix codes.
In a typical data compression problem a given finite sequence 4, drawn from some finite alphabet A, is to be mapped into a binary sequence bf = b1, b2, , b.c, whose length r may depend on 4 , in such a way that the source sequence x'1' is recoverable from knowledge of the encoded sequence bf.. The goal is to use as little storage space as possible, that is, to make the code length as short as possible, at least in some average sense. Of course, there are many possible source sequences, often of varying lengths, so that a typical code must be designed to encode a large number of different sequences and accurately decode them. A standard model is to think of a source sequence as a (finite) sample path drawn from some ergodic process. For this reason, in information theory a stationary, ergodic process with finite alphabet A is usually called a source. In this section the focus will be on a special class of codes, known as prefix codes, for which there is a close connection between code length and source entropy. In the following discussion B* denotes the set of all finite-length binary sequences and t(w) denotes the length of a member w E B*. B*. A code is said to be faithful or A (binary) code on A is a mapping C: A noiseless if it is one-to-one. An image C (a) is called a codeword and the range of C
72
is called the codebook. The function that assigns to each a E A the length of the code word C (a) is called the length function of the code and denoted by (IC) or by r if C is understood. Formally, it is the function defined by ,C(aIC) = (C (a)), a E C. The expected length of a code C relative to a probability distribution a on A is E(L) = Ea ,C(aIC)p,(a). The entropy of p, is defined by the formula
H (A) =
E p(a) log
aEA
that is, H(j) is just the expected value of the random variable log ,u(X), where X has the distribution A. (As usual, base 2 logarithms are used.) Without further assumptions about the code there is little connection between entropy and expected code length. For codes that satisfy a simple prefix condition, however, entropy is a lower bound to code length, a lower bound which is almost tight. To develop the prefix code idea some notation and terminology and terminology will be introduced. For u = u7 E B* and u = vr E B*, the concatenation uv is the sequence of length n m defined by
(uv) i =
ui
1<i<n n < i < n m.
A nonempty word u is called a prefix of a word w if w = u v. The prefix u is called a proper prefix of w if w = u v where y is nonempty. A nonempty set W c B* is prefix free if no member of W is a proper prefix of any member of W. A prefix-free set W has the property that it can be represented as the leaves of the (labeled, directed) binary tree T(W), defined by the following two conditions. (i) The vertex set V(W) of T (W) is the set of prefixes of members of W, together with the empty word A. (ii) If y E V(W) has the form v = ub, where b from u to y, labeled by b. (See Figure 1.7.9.)
E
B then there is a directed edge
Figure 1.7.9 The binary tree for W = {00, 0101, 101, 0100, 100 } .
Note that the root of the tree is the empty word and the depth d(v) of the vertex is just the length of the word v. In particular, since W is prefix free, the words w E W correspond to leaves L(T) of the tree T (W). An important, and easily proved, property of binary trees is that the sum of 2 ci(v) over all leaves y of the tree can never exceed 1. For prefix-free sets W of binary sequences this fact takes the form
(2)
..w
< 1,
73
where t(w) denotes the length of w E W, a result known as the Kraft inequality. A geometric proof of this inequality is sketched in Exercise 10. The Kraft inequality has the following converse. Lemma 1.7.10 If 1 < < f2 < < it is a nondecreasing sequence of positive integers such that E, < 1 then there is a prefix-free set W = {w(i): 1 < i < t} such that t(w(i)) = ti , i E [1, t]. Proof Define i and j to be equivalent if ti = Li and let {G1, , Gs } be the equivalence classes, written in order of increasing length L(r) = L, where t i E Gr . The Kraft inequality then takes the form
E I Gr I2-Lo-) < 1.
r=1
Assign the indices in G1 to binary words of length L(1) in some one-to-one way, which is possible since I Gil < 2" ) . Assign G2 in some one-to-one way to binary words of length L(2) that do not have already assigned words as prefixes. This is possible since 1 GI 12L(1) + I G2 1 2 L(2) < 1, so that IG21 2") IG11 2L(2)L(1),
where the second term on the right gives the number of words of length L(2) that have prefixes already assigned. An inductive continuation of this assignment argument clearly establishes the lemma. I=1 A code C is a prefix code if a = a whenever C(a) is a prefix of C(a). Thus C is a prefix code if and only if C is one-to-one and the range of C is prefix-free. The codewords of a prefix code are therefore just the leaves of a binary tree, and, since the code is one-to-one, the leaf corresponding to C(a) can be labeled with a or with C(a). For prefix codes, the Kraft inequality (3)
aEA
1,
holds. For prefix codes, entropy is always a lower bound on average code length, a lower bound that is "almost" tight. This fundamental connection between entropy and prefix codes was discovered by Shannon. Theorem 1.7.11 (Entropy and prefix codes.) Let j..t be a probability distribution on A.
(i) H(p) E4 (C(.1C)), for any prefix code C.

(ii) There is a prefix code C such that EIL (C(.1C)) < H(g) + 1. Proof Let C be a prefix code. A little algebra on the expected code length yields
E A (C(.1C)) =
Er(aiC)Wa) =
a a
log 2paic),,, (, )
E p,(a) log
a
2-C(a1C)
E,u (a ) log ,u(a).

a
74
The second term is just H(g), while the divergence inequality, Lemma 1.6.2, implies that the first term is nonnegative, since the measure v defined by v(a) = 2 C(a I C) satisfies v(A) < 1, by the Kraft inequality. This proves (i). To prove part (ii) define L(a) = 1- - log 1,t(a)1, a E A, where rx1 denotes the least integer > x, and note that
E 2 C (a) < E
a a
2logg(a)
E a E
1.
Lemma 1.7.10 produces a prefix code C(a) = w(a), a Since, however, ,C(a) < 1 log p(a), it follows that =
a
A, whose length function is L.
E(1- log tt(a))p,(a)

a
E p(a) log p(a) =1+ H(p).

a
Thus part (ii) is proved, which completes the proof of the theorem. Next consider n-codes, that is, mappings Cn : A n 1> B* from source sequences of length n to binary words. In this case, a faithful n-code Cn : An B* is a prefix n-code if and only if its range is prefix free. Of interest for n-codes is per-symbol code length ,C(a71Cn )/n and expected per-symbol code length E(C(.lCn ))1n. Let /in be a probability measure on A n with per-symbol entropy 1-1,(A n ) = H(i)/n. Theorem I.7.11(i) takes the form 1 (4) H(iin)
for any prefix n-code Cn , while Theorem I.7.11(ii) asserts the existence of a prefix n-code Cn such that 1 1 (5) E (reIC n)) 5- H n (P, n) . n " n A sequence {Cn } , such that, for each n, Cn is a prefix n-code with length function ,C(4) = ,C(4 IQ, will be called a prefix-code sequence. The (asymptotic) rate of such a code sequence, relative to a process {X n } with Kolmogorov measure IL is defined by
= liM sup
nco
E (C(IC n))
The two results, (4) and (5) then yield the following asymptotic results. Theorem 1.7.12 (Process entropy and prefix-codes.) If p, is a stationary process with process entropy H then (a) There is a prefix code sequence {C n } such that 1-4({G}) < H. (b) There is no prefix code sequence {C n } such that 1Z 4 ({C n }) < H.
Thus "good" codes exist, that is, it is possible to compress as well as process entropy in the limit, and "too-good" codes do not exist, that is, no sequence of codes can asymptotically compress more than process entropy. In the next chapter, almost-sure versions of these two results will be obtained for ergodic processes.
75
Remark 1.7.13 A prefix code with length function ,C(a) = F log ,u(a)1 will be called a Shannon code. Shannon's theorem implies that a Shannon code is within 1 of being the best possible prefix code in the sense of minimizing expected code length. A somewhat more complicated coding procedure, called Huffman coding, produces prefix codes that minimize expected code length. Huffman codes have considerable practical significance, but for n-codes the per-letter difference in expected code length between a Shannon code and a Huffman code is at most 1/n, which is asymptotically negligible. For this reason no use will be made of Huffman coding in this book, since, for Shannon codes, the code length function L(a) = 1 log (a)1 is closely related to the function log p,(a), whose expected value is entropy.
I.7.d
Converting faithful codes to prefix codes.
Next it will be shown that, as far as asymptotic results are concerned, there is no loss of generality in assuming that a faithful code is a prefix code. The key to this is the fact that a faithful n-code can always be converted to a prefix code by adding a header (i.e, a prefix) to each codeword to specify its length length. This can be done in such a way that the length of the header is (asymptotically) negligible compared to total codeword length, so that asymptotic results about prefix codes automatically apply to the weaker concept of faithful code. The following is due to Elias, [8].
Lemma 1.7.14 (The Elias-code lemma.) There is a prefix code :11,2, B*, such that i(e(n)) = log n + o(log n). Any prefix code with this property is called an Elias code.
Proof The code word assigned to n is a concatenation of three binary sequences,
C(n) = u(n)v(n)w(n).
The third part w(n) is the usual binary representation of n, so that, for example, w(12) = 1100. The second part v(n) is the binary representation of the length of w(n), so that, for example, v(12) = 100. The first part u(n) is just a sequence of O's of length equal to the length of v(n), so that, for example, u(12)=000. Thus E(12) = 0001001100. The code is a prefix code, for if
u(n)v(n)w(n) = u(m)v(in)w(m)w', then, u(n) = u(m), since both consist only of O's and the first bit of both v(n) and v(m) is a 1. But then v(n) = v(m), since both have length equal to the length of u(n). This means that w(n) = w(m), since both have length specified by v(n), so that w' is empty and n = m. The length of w(n) is flog(n +1)1, the length of the binary representation of n, while both u(n) and v(n) have length equal to
f log(1 + Flog(n
1 )1)1,
the length of the binary representation of Flog(n + 1)1. The desired bound f(E(n)) = log n + o(log n) follows easily, so the lemma is established. CI
76
Given a faithful n-code Cn with length function Le ICn ), a prefix n-code Cn * is obtained by the formula
(6)
C: (4 ) = ECC(.4 I Cn ))Cn (4), x lii E An ,
where e is an Elias prefix code on the integers. For example, if C,(4) = 001001100111, a word of length 12, then
C:(4) = 0001001100, 001001100111,

where, for ease of reading, a comma was inserted between the header information, e (C(4 ICn )), and the code word C, (4). The decoder reads through the header information and learns where C, (4) starts and that it is 12 bits long. This enables the decoder to determine the preimage .x11 , since Cn was assumed to be faithful. The code CnK, which is clearly a prefix code, will be called an Elias extension of C. In the example, the header information is almost as long as Cn (4), but header length becomes negligible relative to the length of Cn (4) as codeword length grows, by Lemma 1.7.14. Thus Theorem 1.7.12(b) extends to the following somewhat sharper form. (The definition of faithful-code sequence is obtained by replacing the word "prefix" by the word "faithful," in the definition of prefix-code sequence.)
Theorem 1.7.15 If ,u is a stationary process with entropy-rate H there is no faithful-code sequence such that R IL ({C n }) < H.
Remark 1.7.16
Another application of the Elias prefix idea converts a prefix-code sequence {C n } into a single prefix code C: A* i where C is defined by
C(4) = e(n)C(4),
X tiz E An,
n = 1, 2, ...
The header tells the decoder which codebook to apply to decode the received message.
I.7.e
Rotation processes have entropy O.
As an application of the covering-exponent interpretation of entropy it will be shown that ergodic rotation processes have entropy 0. Let T be the transformation defined by T:xi--> x ea, x E [0, 1), where a is irrational and ED is addition modulo 1, and let P be a finite partition of the unit interval [0, 1).
Proposition 1.7.17 The (T, P)-process has entropy O.

The proof for the special two set partition P = (Po, PO, where Po = [0, 0.5), and P1 = [0.5, 1), will be given in detail. The argument generalizes easily to arbitrary partitions into subintervals, and, with a bit more effort, to general partitions (by approximating by partitions into intervals). The proof will be based on direct counting arguments rather than the entropy formalism. Geometric arguments will be used to estimate the number of possible (T, P)-names of length n. These will produce an upper bound of the form 2nEn , where En > 0 as
77
n oo. The result then follows from the covering-exponent interpretation of entropy, Theorem 1.7.4. The key to the proof is the following uniform distribution property of irrational rotations.
Proposition 1.7.18 (The uniform distribution property.)

Inn
E [1, tz]: Tlx E
for any interval I c [0, 1) and any x E [0, 1), where A denotes Lebesgue measure. Ergodicity asserts the truth of the proposition for almost every x. The fact that rotations are rigid motions allows the almost-sure result to be extended to a result that holds for every x. The proof is left as an exercise.
The key to the proof to the entropy 0 result is that, given z and x, some power y = Tx of x must be close to z. But translation is a rigid motion so that Tmy and Trnz will be close for all m, and hence the (T-P)-names of y and z will agree most of the time. Thus every name is obtained by changing the name of some fixed z in a small fraction of places and shifting a bounded amount. In particular, there cannot be exponentially very many names. To make the preceding argument precise define the metric Ix - yli = minflx - y1,11 and the pseudometric
dn(x, y) = dn(x , x - y11, x, y E [0, 1),
=-
1 n d(xi, Yi), n
where d(a, b) = 0, if a = b, and d(a,b) = 1, if a 0 b, and {x n } and {yn} denote the (T, P)-names of x, y E [0, 1), respectively. The first lemma shows that cln (x, y) is continuous relative to Ix - Y 1 1.
Lemma 1.7.19
E, if Ix - Y11 < 6/4 and n > N. Proof Suppose lx - yli < 14. Consider the case when Ix Y Ii = lx - yi. Without loss of generality it can be assumed that x < y. Let I = [x, y]. The names of x and y disagree at time j if and only if Tx and T y belong to different atoms of the partition P, which occurs when and only when either T --1 0 E I or (0.5) E I. The uniform distribution property, Proposition 1.7.18, can be applied to T -1 , which is rotation by -a, to provide an N = N(x, y) such that if n > N then
Given
E
> 0 there is an N such that dn (x, y) <
1{j E
E I or T -1 (0.5) E
<E
This implies that dn (x, y) < e, n > N. The case when lx - yll = 11 + x - yl can be treated in a similar manner. Compactness then shows that N can be chosen to be independent of x and y. The next lemma is just the formal statement that the name of any point x can be shifted by a bounded amount to obtain a sequence close to the name of z = 0 in the sense of the pseudometric dn.
78 Lemma 1.7.20
Given E > 0, there is an integer M and an integer K > M such that if x E [0, 1) there is a j = j(x) E [1, M] such that if Z = 0 and y = Tx then
i, yk ) < 69 k > K. dk (zk
Given E > 0, choose N from the preceding lemma so that if n > N and lx yli < 6/4 then dn (x , y) < E. Given x, apply Kronecker's theorem, Proposition 1.2.15, to obtain a least positive integer j = j(x) such that IT ] x Ok < 6/4. Since small changes in x do not change j(x) there is a number M such that j(x) < M for all x. With K = M N the lemma follows.
Proof
Now for the final details of the proof that the entropy of the (T, P)-process is 0. Given E > 0, determine M and K > M from the preceding lemma, let n > K+M and let z = O. The unit interval X can be partitioned into measurable sets Xi = Ix : j(x) = j}, j < M, where j(x) is given by the preceding lemma. Note that
dnj(Z,T i X) < E, X E X.
The number of binary sequences of length n j that can be obtained by changing the (n j)-name of z in at most c(n j) places is upper bounded by 2 (n i )h(E) , where
h(6) = doge (1 6) log(1 6), is the binary entropy function, according to the combinations bound, Lemma 1.5.4. There are at most 2i possible values for the first j places of any name. Thus an upper bound on the number of possible names of length n for members of Xi is
(7)
2(n Dh(e) 114 : x E Xill < 2 i
IA lnh(6)
Since there only M possible values for j, the number of possible rotation names of 0 length n is upper bounded by M2m2nh(), for n > K + M. Since h() > 0 as E 1=1 this completes the proof that the (T, 2)-process has entropy 0.
I.7.f Exercises.
1. Give a direct proof of the covering-exponent theorem, Theorem 1.7.4, based on the algorithm (Al ,A2) given for the computation of the covering number. (Hint: what does the entropy theorem say about the first part of the list?)
2. Show that limn (1/n) log.Arn (a, 6) exists, where
Nn (a, 6)
is defined in (1).
3. Let g be the concatenated-block process defined by a measure it on A n . Let H(A) denote the entropy of p,. (a) Show that Ann ( '2) < (P,) log n. + (b) Show that H,n (g) > mH(A). (Hint: there is at least one way and at most n1441 2n ways to express a given 4 in the form uw(l)w(2)... w(k 1)v, where each w(i) E An and u and v are the tail and head, respectively, of words w(0), w(k) E An.)
SECTION 1.8. STATIONARY CODING.
79
4. Use the covering exponent concept to show that the entropy of an ergodic nstationary process is the same as the entropy of the stationary process obtained by randomizing its start. 5. Let p, a shift-invariant ergodic measure on A z and let B c A z be a measurable set of positive measure. Show, using the entropy theorem, that the entropy of the induced process is 1/(A)//t(B). 6. Suppose the waiting times between occurrences of "1" for a binary process it are independent and identically distributed. Find an expression for the entropy of it, in terms of the entropy of the waiting-time distribution. 7. Let {Ifn } be the process obtained from the stationary ergodic process {X} by applying the block code CN and randomizing the start. Suppose {1',} is ergodic. Show that H({Y}) < 8. Show that if C: A n H B* is a prefix code with length function L then there is a prefix code e: A" 1* B* with length function E such that
(a) (x) <L(x) + 1, for all x'i' E A n
.
(b) 44) 5_ 2 2n log IA I, for all xri' E An. 9. Prove Proposition 1.7.18. 10. Let C be a prefix code on A with length function L. Associate with each a E A the dyadic subinterval of the unit interval [0, 1] of length 2 L(a ) whose left endpoint has dyadic expansion C (a). Show that the resulting set of intervals is disjoint, and that this implies the Kraft inequality (2). Use this idea to prove that if Ea 2 C(a) < 1, then there is prefix code with length function L(a). 11. Show, using Theorem 1.7.4, that the entropy-rate of an ergodic process and its reversed process are the same.
Section 1.8
Stationary coding.
Stationary coding was introduced in Example 1.1.9. Two basic results will be established in this section. The first is that stationary codes can be approximated by finite codes. Often a property can be established by first showing that it holds for finite codes, then passing to a suitable limit. The second basic result is a technique for creating stationary codes from block codes. In some cases it is easy to establish a property for a block coding of a process, then extend using the block-to-stationary code construction to obtain a desired property for a stationary coding.
I.8.a Approximation by finite codes.

The definition and notation associated with stationary coding will be reviewed and the basic approximation by finite codes will be established in this subsection. Recall that a stationary coder is a measurable mapping F: A z 1--> B z such that
F(TA X ) = TB F (x), VX
E AZ,
80
where TA and TB denote the shifts on the respective two-sided sequence spaces A z and B z . Given a (two-sided) process with alphabet A and Kolmogorov measure ,U A , the encoded process has alphabet B and Kolmogorov measure A B = /LA o F. The associated time-zero coder is the function f: A z 1--* B defined by the formula f (x) = F (x )o. Associated with the time-zero coder f: Az 1---> B is the partition P = 19 (f) = {Pb: h E B} of A z , defined by Pb = {X: f(x) = 1, } , that is,
(1)
X E h
if and only if f (x) = b.
Note that if y = F(x) then yn = f (Tnx) if and only if Tx E Pyn . In other words, y is the (T, 2)-name of x and the measure A B is the Kolmogorov measure of the (T, P)process. The partition 2( f) is called the partition defined by the encoder f or, simply, the encoding partition. Note also that a measurable partition P = {Pb: h E BI defines an encoder f by the formula (1), such that P = 2(f ), that is, P is the encoding partition for f. Thus there is a one-to-one correspondence between time-zero coders and measurable finite partitions of Az . In summary, a stationary coder can be thought of as a measurable function F: A z iB z such that F oTA = TB o F , or as a measurable function f: A z 1--* B, or as a measurable partition P = {Pb: h E B } . The descriptions are connected by the relationships
f (x) = F (x)0 , F (x) n = f (T n x), Pb = lx: f (x) = b).
A time-zero coder f is said to be finite (with window half-width w) if f (x) = f (i), whenever .0'. = i" vu,. In this case, the notation f (xw u.,) is used instead of f (x). A key fact for finite coders is that
=
U {x, n 71 w wl ,
( x ,' u t u,' ) = .
, fl
and, as shown in Example 1.1.9, for any stationary measure ,u, cylinder set probabilities for the encoded measure y = ,u o F -1 are given by the formula
(2)
y()1) = it(F-1 [yi])
E it(xi +:) .
f (4 - :)=y' it
The principal result in this subsection is that time-zero coders are "almost finite", in the following sense.
Theorem 1.8.1 (Finite-coder approximation.) If f: Az 1--* B is a time-zero coder, if bt is a shift-invariant measure on A z , and if Az 1---> B such that E > 0, then there is a finite time-zero encoder
7:
1.i (ix: f (x) 0 7(x)}) _< c. Proof Let P = {Pb: b E B) be the encoding partition for f. Since p is a finite partition of A z into measurable sets, and since the measurable sets are generated by the finite cylinder sets, there is a positive integer w and a partition 1 5 = {Pb } with the following two properties
"../
(a) Each Pb is union of cylinder sets of the form
SECTION I8. STATIONARY CODING.
81
(h)
I../
Eb kt(PbA 7)1,) < E.

r',I 0.ss
Let f be the time-zero encoder defined by f(x) = b, if [xi'.] c Pb. Condition (a) assures that f is a finite encoder, and p,({x: f (x) 0 f (x))) < c follows from condition I=1 (b). This proves the theorem.
Remark 1.8.2
The preceding argument only requires that the coded process be a finite-alphabet process. In particular, any stationary coding onto a finite-alphabet process of any i.i.d. process of finite or infinite alphabet can be approximated arbitrarily well by finite codes. As noted earlier, the stationary coding of an ergodic process is ergodic. A consequence of the finite-coder approximation theorem is that entropy for ergodic processes is not increased by stationary coding. The proof is based on direct estimation of the number of sequences of length n needed to fill a set of fixed probability in the encoded process.
Theorem 1.8.3 (Stationary coding and entropy.)

If y = it o F-1 is a stationary encoding of an ergodic process ii then h(v)
First consider the case when the time-zero encoder f is finite with window half-width w. By the entropy theorem, there is an N such that if n > N, there is a set Cn c An i +w w of measure at least 1/2 and cardinality at most 2 (n2w+1)(h( A )+6) The image f (C n ) is a subset of 13 1 of measure at least 1/2, by formula (2). Furthermore, because mappings cannot increase cardinality, the set f (C n ) has cardinality at most 2(n+21D+1)(h(4)+E) . Thus 1 - log I f (Cn)i h(11) + 6 + 8n, n where Bn >. 0, as n -> oo, since w is fixed. It follows that h(v) < h(tt), since entropy equals the asymptotic covering rate, Theorem 1.7.4. In the general case, given c > 0 there is a finite time-zero coder f such that A (ix: f (x) 0 f (x)}) < E. Let F and F denote the sample path encoders and let y = it o F -1 and 7/ = ki o F -1 denote the Kolmogorov measures defined by f and f, respectively. It has already been established that finite codes do not increase entropy, so that h(71) < h(i). Thus, there is an n and a collection 'e" c Bn such that 71(E) > 1 - E and II Let C = [C], be the 6-blowup of C, that is,
Proof
C = { y: dn(Yi, 3711 ) 5_
E,
for some
3,T;
E '
el
The blowup bound, Lemma 1.7.5, implies that
ICI < where

3(E) = 0(E).
Let
15 = {x: T(x)7
El} and D = {x: F(x)7
CI
be the respective pull-backs to AZ. Since ki (ix: f (x) 0 7(x)}) < _ E2 , the Markov inequality implies that the set G = {x: dn (F(x)7,F(x)7) _< cl
82
has measure at least 1 c, so that p(G n 5 ) > 1 2E, since ,u(15) = 75(e) > 1 E. By definition of G and D, however, Gniic D, so that
e > 1 2c. v(C) = g(D) > p(G n ij)

The bound ICI < 2n (h(A )+E+3(E )) , and the fact that entropy equals the asymptotic covering rate, Theorem 1.7.4, then imply that h(v) < h(u). This completes the proof of Theorem 1.8.3.
Example 1.8.4 (Stationary coding preserves mixing.)

A simple argument, see Exercise 19, can be used to show that stationary coding preserves mixing. Here a proof based on approximation by finite coders will be given. While not as simple as the earlier proof, it gives more direct insight into why stationary coding preserves mixing and is a nice application of coder approximation ideas. The a-field generated by the cylinders [4], for m and n fixed, that is, the cr-field generated by the random variables Xm , Xm+ i, , X n , will be denoted by E(Xm n ). As noted earlier, the i-fold shift of the cylinder set [cini n ] is the cylinder set T[a] = [c.:Ln ], where ci +i = ai, m < j <n. Let v = o F-1 be a stationary encoding of a mixing process kt. First consider the case when the time-zero encoder f is finite, with window width 2w + 1, say. The coordinates yri' of y = F(x) depend only on the coordinates xn i +:. Thus for g > 2w, the intersection [Yi ] n [ym mLg: i n] is the image under F of C n Tm D where C and D are both measurable with respect to E(X7 +:). If p, is mixing then, given E > 0 and n > 1, there is an M such that
jp,(c n
D) ,u(C),u(D)I < c, C, D E
Earn,
El in
m > M,
which, in turn, implies that
n [y:Z +1]) v([yi])v(Ey:VDI
M, g > 2w.
Thus v is mixing. In the general case, suppose v = p, o F -1 is a stationary coding of the mixing process with stationary encoder F and time-zero encoder f, and fix n > 1. Given E > 0, and a7 and Pi', let S be a positive number to be specified later and choose a finite encoder with time-zero encoder f, such that A({x: f(x) f . Thus,
col) <S
pax: F(x)7
fr(x)71)
/tax:
f (T i x)
f (Ti x)i)
f(x)
f (x)}), < n3,
by stationarity. But n is fixed, so that if (5 is small enough then v(4) will be so close to (a) and v(b7) so close to i(b) that
(3)
Iv(a7)v(b7) j) (a7) 13 (bi)I
E/ 3 , P(x)21:71) < 2n8,
for all ar il and b7. Likewise, for any m > 1,
p,
F(x)7
P(x)7, or F(x)m m -Vi

so that (4)
83
Ivaan n T - m[b7]) j)([4] n T -m[b71)1
6/3,
provided only that 8 is small enough, uniformly in m for all ail and b7. Since 1'5 is mixing, being a finite coding of ,u, it is true that
n T -m[bn) i)([4])Dab7DI < E/3,

provided only that m is large enough, and hence, combining this with (3) and (4), and using the triangle inequality yields I vga7] n T -ni[b7]) vga7])v([14])1 < c, for all sufficiently large m, provided only that 8 is small enough. Thus, indeed, stationary coding preserves the mixing property.
I.8.b
From block to stationary codes.
As noted in Example 1.1.10, an N-block code CN: A N B N can be used to map an A-valued process {X} into a B-valued process {Y} by applying CN to consecutive nonoverlapping blocks of length N, i iN+1
v (f-1-1)N
-
ii(j+1)N)
iN+1
j = 0, 1, 2, ...
If {X,} is stationary, then a stationary process {4} is obtained by randomizing the start, i.e., by selecting an integer U E [1, N] according to the uniform distribution and defining = Y+-i, j = 1, 2, .... The final process {Yn } is stationary, but it is not, except in rare cases, a stationary coding of {X}, and nice properties of {X,}, such as mixing or even ergodicity, may get destroyed. A method for producing stationary codes from block codes will now be described. The basic idea is to use an event of small probability as a signal to start using the block code. The block code is then applied to successive N-blocks until within N of the next occurrence of the event. If the event has small enough probability, sample paths will be mostly covered by nonoverlapping blocks of length exactly N to which the block code is applied.
Lemma 1.8.5 (Block-to-stationary construction.)

Let 1,1, be an ergodic measure on A z and let C: A N B N be an N-block code. Given c > 0 there is a stationary code F = Fc: A z A z such that for almost every X E A Z there is an increasing sequence { ni :i E Z}, which depends on x, such that Z= ni+i) and (i) ni+1ni < N, i E Z. (ii) If .I is the set of indices i such that [ni , +1) c [n, n] and ni+i n i < N, then ihn supn (1/2n)E iEjn ni+ 1 ni < c, almost surely.
(iii) If n i+1 ni = N then y: n n +'- ' = C(xn:'), n -1 where y = F(x).
84
Proof Let D be a cylinder set such that 0 < ,a(D) < 6/N, and let G be the set of all x E A z for which Tmx E D for infinitely many positive and negative values of m. The set G is measurable and has measure 1, by the ergodic theorem. For x E G, define mo = mo(x) to be the least nonnegative integer m such that Tx E D, then extend to obtain an increasing sequence mi = m i (x), j E Z such that Tinx E D if and only if m = mi, for some j. The next step is to split each interval [mi, m 1+1) into nonoverlapping blocks of length N, starting with mi, plus a final remainder block of length shorter than N, in case mi+i mi is not exactly divisible by N. In other words, for each j let qi and ri be nonnegative integers such that mi +imi = qiN d-ri, 0 <ri <N, and form the disjoint collection 2-x (mi) of left-closed, right-open intervals [m i , mi N), [m 1 d-N,m i +2N), [miqiN, mi+i ).
All but the last of these have length exactly N, while the last one is either empty or has length ri <N. The definition of G guarantees that for x E G, the union Ui/x (mi) is a partition of Z. The random partition U i/x (m i ) can then be relabeled as {[n i , ni+1 ), i E Z}, where ni = ni(x), i E Z. If x G, define ni = i, i E Z. By construction, condition (i) certainly holds for every x E G. Furthermore, the ergodic theorem guarantees that the average distance between m i and m ii is at least N/6, so that (ii) also holds, almost surely. The encoder F = Fc is defined as follows. Let b be a fixed element of B, called the filler symbol, and let bj denote the sequence of length 1 < j < N, each of whose terms is b. If x is sequence for which Z = U[ni , ni+i), then y = F (x) is defined by the formula bni+i n, if n ii ni <N yn n :+i 1 = if n i+i n i = N. C (4 +1-1 )
This definition guarantees that property (iii) holds for x E G, a set of measure 1. For x G, define F(x) i = b, i E Z. The function F is certainly measurable and satisfies F(TAX) = F(x), for all x. This completes the proof of Lemma 1.8.5. The blocks of length N are called coding blocks and the blocks of length less than N are called filler or spacer blocks. Any stationary coding F of 1,1, for which (i), (ii), and (iii) hold is called a stationary coding with 6-fraction of spacers induced by the block code CN. There are, of course, many different processes satisfying the conditions of the lemma, since there are many ways to parse sequences so that properties (i), (ii), and (iii) hold, for example, any event of small enough probability can be used and how the spacer blocks are coded is left unspecified in the lemma statement. The terminology applies to any of these processes. Remark 1.8.6 Lemma 1.8.5 was first proved in [40], but it is really only a translation into process language of a theorem about ergodic transformations first proved by Rohlin, [60]. Rohlin's theorem played a central role in Ornstein's fundamental work on the isomorphism problem for Bernoulli shifts, [46, 63].
85
I.8.c
A string-matching example.
The block-to-stationary code construction of Lemma 1.8.5 provides a powerful tool for making counterexamples. As an illustration of the method a string matching example will be constructed. The string-matching problem is of interest in DNA modeling and is defined as follows. For x E An let L(4) be the length of the longest block appearing at least twice in x1% that is, L(x
) = max{k:
+k i xss+
xttk, for some O <s<t<nk} .
The problem is to determine the asymptotic behavior of L(4) along sample paths drawn from a stationary, finite-alphabet ergodic process. For i.i.d. and Markov processes it is known that L(x7) = 0(log n), see [26, 2]. A proof for the i.i.d. case using coding ideas is suggested in Exercise 2, Section 11.1, and extended to a larger class of processes in Theorem 11.5.5. A question arose as to whether such a log n type bound could hold for the general ergodic case, at least if the process is assumed to be mixing. A negative solution to the question was provided in [71], where it was shown that for any ergodic process {X,} and any positive function X(n) for which n -1 1,.(n) > 0, there is a nontrivial stationary coding, {Yn }, of {X}, and an increasing unbounded sequence {ni I, such that
L(Y;') > 1, almost surely. lim sup i-400 (ni)

In particular, if the start process {X n } is i.i.d, the {Y} process will be mixing. It can, furthermore, be forced to have entropy as close to the entropy of {X n } as desired. The proof for the binary case and growth rate (n) = n 112 will be given here, as a simple illustration of the utility of block-to-stationary constructions. The first observation is that if stationarity is not required, the problem is easy to solve merely by periodically inserting blocks of O's to get bad behavior for one value of n, then nesting to obtain bad behavior for infinitely many n. To make this precise, for each n > 64, define the function Cn : A n H {0, 1 " by changing the first 4n 1 i2 terms into all O's and the final 4n 1 12 terms into all O's, that is, Cn (4) = y;', where
}
(5)
0 Yi = 1 xi
i <4n 112 or i > n 4n 112 otherwise.
The n-block coding {I'm ) of the process {X,} defined by
yin
clearly has the property
on+1 = cn(X in o - -on+1) , j ?- 1,

-Fin) > n 1/2 , uyss+
for any starting place s, since two disjoint blocks of O's of length at least n 112 must appear in any set of n consecutive places. The construction is iterated to obtain an infinite sequence of such n's, as follows. Let
pc }
.
fy ,
.
indicate that {Y,} is the process obtained from {X,,,} by the encoding 17(jin 1)n+1 = cn(X; 1)/2+1), j > 1, where C is the zero-inserter defined by the formula (5). Let
86
{n i : i > 11 be an increasing sequence of integers, to be specified later, and define the sequence of processes by starting with any process {Y(0)} and inductively defining {Y(i)} for i > 1 by the formula {Y --4Ecn {Y(i)m}.
For any start process {Y(0),,}, the i-th iterate {Y(i) m } has the property
L(Y(i): +7') > n/2 , 1 < j <

for all s, since O's inserted at any stage are not changed in subsequent stages. In particular, each coordinate is changed at most once, which, in turn, guarantees that a limit process 1/2 {Y(oo),} exists for which L(Y(oo) :, + 71 ) > n. , j > 1. Furthermore, the final process has a positive frequency of l's, provided the n i increase rapidly enough. The preceding construction is conceptually quite simple but clearly destroys stationarity. Randomizing the start at each stage will restore stationarity, but mixing properties are lost, and it is not easy to see how to pass to a limit process. An appropriate use of block-to-stationary coding, Lemma 1.8.5, at each stage converts the block codes into stationary codings, which will, in turn, guarantee that the final limit process is a stationary coding of the starting process with the desired string-matching properties, hence, in particular, mixing is not destroyed. The rigorous construction of a stationary coding is carried out as follows. For n > 64, let Cn be the zero-inserter code defined by (5), let {n i } be an increasing sequence of natural numbers, and let {ci } be a decreasing sequence of positive numbers, both to be specified later. Given the ergodic process {X n } (now thought of as a two-sided process, since stationary coding will be used), the sequence of processes {Y(i)} is defined by setting {Y(0)} = {X,n } and inductively defining {Y(i) m } for i > 1 by the formula (cn ,E,) {YU l)m} -14 {Y(i)ml,
where the notation indicates that {Y(j) m } is constructed from {Y(i 1),n } by stationary coding using the codebook Cn, with a limiting Et-fraction of spacers. At each stage the same filler symbol, b = 0, will be used, so that the code at each stage never changes a 0 into a 1, that is, for all m E Z and i > 1,
(6) Y(i 1) m = 0 =z= Y (i) m = O.
Furthermore, the choice of filler means that L(Y(i)?`) > rt,1./2 , since this always occurs if Y(i)?' is the concatenation uv, where u is the end of a codeword and v the beginning of a codeword, and long matches are even more likely if Y(i)?' is the concatenation uf v, where u is the end of a codeword, y is the beginning of a codeword, and f is filler. Combining this observation with the fact that O's are not changed, (6), yields (7) L(Y(i)7 J ) > ni ll2 , < j.
Since stationary codings of stationary codings are themselves stationary codings, each process {Y(i) m } is a stationary coding of the original process {X,n }. For each i > 1, let Fi denote the stationary encoder Fi : {0, 11 z {0, 1} z that maps {X,n } to {Y(i) m }. Property (6) guarantees that no coordinate is changed more than once, and hence there is a limit code F defined by the formula [F(x)]m = lim[F i (x)] m , mEZ, xE 0, UZ
{
SECTION 1.9. PROCESS TOPOLOGIES.
87
The limit process {Y(oo)z } defined by Y(oo),n = lim Y(i), m E Z is the stationary coding of the initial process {X,n } defined by the stationary encoder F. Since (7) hold j > 1. Furthermore, for each i, the limit process has the property L(Y(oo)) > the limiting density of changes can be forced to be as small as desired, merely by making the ni increase rapidly enough and the Ei go to zero rapidly enough. In particular, the entropy of the final process will be as close to that of the original process as desired. The above is typical of stationary coding constructions. First a sequence of block codes is constructed to produce a (nonstationary) limit process with the properties desired, then Lemma 1.8.5 is applied to produce a stationary process with the same limiting properties.
I.8.d
Exercises.
1. Show that if t is mixing and 8 > 0, then the sequence {ni(x)} of Lemma 1.8.5 can be chosen so that Prob(x 1 ' = an < 8 , for each al iv E AN . (Hint: replace D by TD for n large.) 2. Suppose C: A N B N is an N-code such that Ei,(dN(x fv , C(4)) < E, where it is mixing. Show that there is a stationary code F such that limn Ett (dn (4, F(x)7) < < E 1-1(un o 2E and such that h(o., o (Hint: use the preceding exercise.)
Section 1.9 Process topologies.

Two useful ways to measure the closeness of stationary processes are described in this section. One concept, called the weak topology, declares two processes to be close if their joint distributions are close for a long enough time. The other concept, called the dtopology, declares two ergodic processes to be close if only a a small limiting density of changes are needed to convert a typical sequence for one process into a typical sequence for the other process. The weak topology is separable, compact, and easy to describe, but it has two important defects, namely, weak limits of ergodic processes need not be ergodic and entropy is not weakly continuous. On the other hand, entropy is d-continuous and the class of ergodic processes is d-closed, as are other classes of interest. The d-metric is not so easy to define, the d-distance between processes is often difficult to calculate, and the d-topology is nonseparable. Many of the deeper recent results in ergodic theory depend on the d-metric, however, and it plays a central role in the theory of stationary codings of i.i.d. processes, a theory to be presented in Chapter 4.
I.9.a
The weak topology.
p(x).
The collection of (Borel) probability measures on a compact space X is denoted by Of primary interest are the two cases, X = Anand X = A", where A is a finite set. The collection P(A") is just the collection of processes with alphabet A. The subcollection of stationary processes is denoted by P s (A") and the subcollection of stationary ergodic processes is denoted by 7'e (A). A sequence {,u (n) } of measures in P(A") converges weakly to a measure it if
lim p, (") (a lic) =

n-4.00
88
for all k and all'. The weak topology on P(A") is the topology defined by weak convergence. It is the weakest topology for which each of the mappings p ,u(C) is continuous for each cylinder set C. The weak topology is Hausdorff since a measure is determined by its values on cylinder sets. The weak topology is a metric topology, relative to the metric D(g, y) where ,) v(a 101, iii - vik = E liu(a/1
at
denotes the k-th order distributional (variational) distance. If k is understood then I I may be used in place of I lk. The weak topology is a compact topology. Indeed, for any sequence {,u.1) } of probability measures and any all', the sequence {,u,1) (a)} is bounded, hence has a convergent subsequence. Since there are only countably many 4, the usual diagonalization procedure produces a subsequence {n i } such that { 0 (a)} converges to, say, p,(4), as i > oc, for every k and every 4. It is easy to check that the limit function p, is a probability measure. Furthermore, the limit p, is stationary if each A (n ) is stationary, so the class of stationary measures is also compact in the weak topology.
The weak limit of ergodic measures need not be ergodic. One way to see this is to note that on the class of stationary Markov processes, weak convergence coincides with convergence of the entries in the transition matrix and stationary vector. In particular, if ,u (n) is the Markov process with transition matrix
M= n
[ 1 1/n 1/n
1/n 1 1 1/n j
then p weakly to the (nonergodic) process p. that puts weight 1/2 on each of the two sequences (0, 0, ...) and (1, 1, ...). A second defect in the weak topology is that entropy is not weakly continuous. For example, for each n, let x(n) be the concatenation of the members of {0, 1}n in some order and let p,2) be the concatenated-block process defined by the single n2'1 -block x(n). The process p, (n ) has entropy 0, yet lim ,u (n) (4) = 2-k , V k,
so that {,u (n) } converges weakly to the coin-tossing process, which has entropy log 2. Entropy is weakly upper semicontinuous, however. Theorem 1.9.1 If p (n ) converges weakly to p, and each ,u(n) is stationary, then lim sup, H(a) < H(p). Proof For any stationary measure p, let Hi,(X01X: ) denote the conditional entropy of X0 relative to X=1, where X k is distributed according to p, and recall that 111,(X 0 1X:1) decreases to H(,u,) as k > oc. Thus, given, E > 0 there is a k such that HA (X0I < H(p) . Now suppose ,u (n ) converges weakly to p.. Since Ho (Xo l XII) depends continuously on the probabilities ,u(a k i +1 ), it must be true that
1-1/,(0(X 0 IX11) < H(A)-1- 2e,
89
for all sufficiently large n. But if this is true for some n, then H(g, (n) ) < H(,u,) + 2E, LI since Ii(g (n ) ) < 1-11,(n)(X01X11) is always true. This proves Theorem 1.9.1.
I.9.b
The d-metric.
The d-distance between two ergodic processes is the minimal limiting (upper) density of changes needed to convert a typical sequence for one process into a typical sequence for the other. A precise formulation of this idea and some equivalent versions are given in this subsection. The per-letter Hamming distance is defined by
1 x---N n O d(4 , y ) = d(xi, yi), where d(a,b) = { 1 n i =1
a=b a 0 b. This extends to the
It measures the fraction of changes needed to convert 4 into pseudometric ci(x, y) on A, defined by
d(x, y) = lirn sup d(4 , y;i)
Thus ci(x, y) is the limiting (upper) density of changes needed to convert the (infinite) sequence x into the (infinite) sequence y, that is, a(x, y) is the limiting per-letter Hamming distance between x and y. Let T(g) denote the set of frequency-typical sequences for the stationary process it, that is, the set of all sequences for which the limiting empirical measure is equal to p,. By the typical-sequence theorem, Theorem 1.4.1, and its converse, Theorem 1.4.2, the measure p, is ergodic if and only if g(T(g)) = 1. The (minimal) limiting (upper) density of changes needed to convert y into a g-typical sequence is given by
d(y, , T (p,)) = inf{d(x, y): x
E T(11,)}.
Theorem 1.9.2 If p, and y are ergodic processes, then d( y,T (p)) is v-almost surely constant.
Proof The function f (y) = 41(y ,T (p)) is invariant since ii(y, , x) = d(T y, Tx), hence f (y) is almost surely constant. The constant of the theorem is denoted by d(g, y) and is called the d-distance between the ergodic processes it and v. Note, in particular, that for v-almost every v-typical sequence y, the d-distance between p, and y is the limiting density of changes needed to convert y into a g-typical sequence. (The fact that d(g, v), as defined, is actually a metric on the class of ergodic processes is a consequence of an alternative description to be given later.)
Example 1.9.3 (The d-distance for i.i.d. processes.) To illustrate the d-distance concept the distance between two binary i.i.d. processes will be calculated. For 0 < p < 1, let g (P) denote the binary i.i.d. process such that p is the probability that X 1 = 1. Suppose 0 < p <q < 1/2, let x be a ,u (P) -typical sequence, and let y be a g (q) -typical sequence. The sequence x contains a limiting p-fraction of l's while the sequence q contains a limiting q-fraction of l's, so that x and y must
90
disagree in at least a limiting (g p)-fraction of places, that is, a(x, y) > g p. Thus it is enough to find, for each p9' ) -typical x, one ,u (q ) -typical y such that ii(x, y) = g p. Fix a ,u (P ) -typical x, and let r be the solution to the equation g = p(1 r) (1 p)r . Let Z be an i.i.d. random sequence with the distribution of the Afr ) -process and let Y = x ED Z, where e is addition modulo 2. For any set S of natural numbers, the values {Zs : s E SI are independent of the values {Zs : s 0 S}, and hence, Y is ,u (q ) -typical, for almost any Z. Since Z is also almost surely p (r ) -typical, there is therefore at least one ,u (r) -typical z such that y = x Ef) z is ,)typical. Thus there is indeed at least one ,u, (q ) -typical y such that cl(x, y) = g p, and hence ii(A (P) , p (q ) ) = g p. d -distance This result is also a simple consequence of the more general definition of . to be given later in this section, see Exercise 5. Example 1.9.4 ( Weakly close, d-far-apart Markov chains.) As another example of the d-distance idea it will be shown that the d-distance between two binary Markov chains whose transition matrices are close can nevertheless be close to 1/2. Let p, and y be the binary Markov chains with the respective transition matrices
M.[
1p [0 1 1 N= p 1' 1
where p is small, say p = 10-50 . The y-measure is concentrated on the alternating sequence y = 101010... and its shift Ty. On the other hand, a typical /L-sequence x has blocks of alternating O's and l's, separated by an extra 0 or 1 about every 1050 places. Thus d- (x, y) 1/2, since about half the time the alternation in x is in phase with the alternation in y, which produces no errors, and about half the time the alternations are out of phase, which produces only errors. Likewise, a(x, Ty) 1/2, so that ci(x , 7" (v)) 1/2, for ,u-almost all x. (With a little more effort it can be shown that d(,u, y) is exactly 1/2; see Exercise 4.) These results show that the d-topology is not the same as the weak topology, for kt converges weakly to y as p > 0. Later it will be shown that the class of mixing processes is a-closed, hence the nonmixing process v cannot be the d-limit of any sequence of mixing processes.
1.9.b.1
The joining definition of d.
The definition of d-distance given by Theorem 1.9.2 serves as a useful guide to intuition, but it is not always easy to use in proofs and it doesn't apply to nonergodic istance as a processes. To remedy these defects an alternative formulation of the d-d limit of a distance dn un , vn ) between the corresponding n - th order distributions will be developed. This new process distance will also be denoted by d(A, v). In a later subsection it will be shown that this new process distance is the constant given by Theorem 1.9.2.
(
A key concept used in the definition of the 4-metric is the "join" of two measures. A joining of two probability measures A and y on a finite set A is a measure A. on A x A that has ,u and y as marginals, that is,
(1)
,a(a) =
E X(a, b); v(b) = E X(a, b).

b
A simple geometric model is useful in thinking about the joining concept. First each of the two measures, kt and y, is represented by a partition of the unit square into horizontal
91
strips, with the width of a strip equal to the measure of the symbol it represents. The joining condition ,u(a) = Eb X(a, 6) means that, for each a, the ,u-rectangle corresponding to a can be partitioned into horizontal rectangles {R(a,b):b E A} such that R(a,b) has area X(a, b). The second joining condition, v(b) = Ea X.(a, 6), means that the total mass of the rectangles {R(a, b): a E A} is exactly the y-measure of the rectangle corresponding to b. (See Figure 1.9.5.) In other words, a joining is just a rule for cutting each A-rectangle into subrectangles and reassembling them to obtain the y-rectangles.
R(0,a) R(0,c) R(1,a)
R(0,a)
a
R(1,a) R(2,a) R(1b) R(1,b) R(2a)
R(2,b) R(2c)
R(2,b) R(0,c) R(2,c)
Figure 1.9.5 Joining as reassembly of subrectangles. Of course, one can also think of cutting up the y-rectangles and reassembling to give the A-rectangles. Also, the use of the unit square and rectangles is for simplicity only, for a finite distribution can always be represented as a partition of any nonatomic probability space into sets whose masses are given by the distribution. (A space is nonatomic if any subset of positive measure contains subsets of any smaller measure.) A representation of p, on a nonatomic probability space (Z, a) is just a measure-preserving mapping ct. from (Z, a) to (A, A), that is, a measurable mapping such that a(0 -1 (a)) = p(a), a E A. A joining X can then be represented as a pair of measure-preserving mappings, (/) from (Z, a) to (A, A), and ik from (Z, a) to (A, y), such that X is given by the formula
(2)
X(a i , ai )
a (q5-1 (ai )
n ( - 1 (c)).
Turning now to the definition of y), where p, and v are probability measures on An , let J(11, y) denote the set of joinings of p, and v. The 4-metric is defined by cin(ti, V) = min
Ejdn(4 ,
where ex denotes expectation with respect to X. The minimum is attained, since expectation is continuous with respect to the distributional distance IX 5,1In, relative to which Jn(Itt, V) is a compact subset of the space 7'(A x A n ). A measure X on A n x A n is said to realize v) if it is a joining of A and y such that E Vd n (x7, yti')) = cin v). The function 4(A, y) satisfies the triangle inequality and is strictly positive if p, y, hence is a metric on the class P(Afl) of measures on A", see Exercise 2.
92
The d-distance between two processes is defined for the class P(A) of probability measures on A" by passing to the limit, that is,
d(it, Y) = lirn sup an(An, un).

n).co
Note that d(p,, y) is a pseudometric on the class P(A), since d, is a metric on and hence d defines a topology on P(A"). In the stationary case, the limit superior is, in fact, both the limit and the supremum.
Theorem 1.9.6 If i and v are stationary then

C1 (1-1 , v) = SUP dn (An, vn) = limiin(An.vn)
Proof The proof depends on a superadditivity inequality, which is, in turn, a consequence of the definition of dn as an minimum, namely,
(3)
vilz) m
(p,:tr,, v'nV) < ( n + m) di,n (1t7+m , q+m),
where pi denotes the measure induced by p, on the set AL. The proof of this is left to Exercise 3. In the stationary case, p: + 41 = pr, so the preceding inequality takes the form nd + mdm < n + m)dn+m,
(
which implies that the limsup is in fact both a limit and the supremum. This establishes Theorem 1.9.6. LI A consequence of the preceding result is that the d-pseudometric is actually a metric on the class P(A) of stationary measures. Indeed, if d(p,, v) = 0 and p and y are stationary, then, since in the stationary case the limit superior is a supremum, dn (p,n , un ) must be 0, for all n, which implies that A = vn for all n, since iin(A, y) is a metric, Exercise 2, which, in turn, implies that p = v. The d-distance between stationary processes is defined as a limit of n-th order dndistance, which is, in turn, defined as a minimization of expected per-letter Hamming distance over joinings. It is useful to know that the d-distance can be defined directly in terms of stationary joining measures on the product space A" x A". A joining A of p and v is just a measure on A x A" with p, and y as marginals, that is,
p(B) = A.(B x
v(B) = X(A" x B), B
E.
The set of all stationary joinings of p and y will be denoted by Js(p,, y).
Theorem 1.9.7 (Stationary joinings and d-distance.) If p, and v are stationary then
(4) cl(p, v) =
min ex(d(xi, yl)).
93
Proof The existence of the minimum follows from the fact that the set .1s(p., y) of stationary joinings is weakly closed and hence compact, and from the fact that e ,k (d(x i , Yi)) is weakly continuous, for it depends only on the first order distribution. It is also easy to see that d cannot exceed the minimum of Ex (d(x l , yi)), for if A E Js(, y) then stationarity implies that
ex(d(xi, yl)) = ex n (dn (4, y rit )), and the latter dominates an (IL, y), since A n belongs to -In(un, va). It takes a bit more effort to show that d cannot be less than the right-hand side in (4). Towards this end, for each n choose A n E Jn(tin, va), such that
an(An, vn) = ex(c1,(4 ,
The goal is to construct a stationary measure A E y) from the collection {A}, such that ex (d(Xi , yi ) = lima Cin (tt. v). This is done by forming, for each n, the concatenatedblock process defined by A n , then taking a weakly convergent subsequence to obtain the desired limit process . The details are given in the following paragraphs. For each n, let ),.01) denote the concatenated-block process defined by An. As noted in Exercise 8 of Section 1.1, the measure 7,(n) is given by the formula
(5)
i "(n) (B) =
n 1=1
where A. (n) is the measure on (A x A ) defined by A((4, yr)) = K-1

(
k=0
(t v knn -l"n kk"kn-I-1
.kn-Fn
Ykn+1 //
t( y Kn+r
"n kk""Kn-1-1
)1(n+1 , / ,
if N = Kn+r, 0<r <n.The definition of
A (n)
implies that if 0 < i < n j then
x A c )) = Xn(T -` [4] x A) = tingi [xn) = since An has A n as marginal and it is shift invariant. This, together with the averaging formula, (5), implies that lim,. (") (C x = A(C), for any cylinder set C, which proves the first of the following equalities. (a) (6) (b) (c)
n-+co
lim 5:(n)(B x lim 3.(n) (A
= bt(B), B
x B) = ,u(B), B
E E
E. E.
V).
lim e-)(d(xi, Yi)) = litridn (tt, V) = Ci(p.,
The proof of (b) is similar to the proof of (a). To prove (c), let A.7 be projection of A n onto its i-th coordinate, that is, the measure on A x A defined by A.7 (a, b) =
(uav, u'bvi),
94
and note that
CHAPTER L BASIC CONCEPTS.
ex n (dn(x tiz , y7)) =

since 3:(n) (a, b) = (1/n)
1
n
x,,y,
E d (xi , Yi)X7 (xi , yi) = e7.0)(d (xi , yi)),

This proves (c), since cin(A, u) = e), (dn(4 )11 ))
Ei x7 (a, b).
To complete the proof select a convergent subsequence of 1-(n) . The limit X must be stationary, and is a joining of h and y, by conditions (6a) and (6b). Thus X E Js(h, y). Condition (6c) guarantees that ii(tt, y) = EA,(d(xi, Yi)).
I.9.b.2
Empirical distributions and joinings.
The pair (x, yril) defines three empirical k-block distributions. Two of these, pk (af 14) and pk (biny), are obtained by thinking of 4 and Ai as separate sequences, and the third, pk ((alic , 14)1(4 , yril)), is obtained by thinking of the pair (4 , y) as a single sequence. A simple, but important fact, is that plc ( 1(4 , ) ) is a joining of pk (. and plc ( ly7), since, for example, each side of
E(n k
1)p((af, b 101(x7,
= (n k
1)p(aii` I x7)
counts the number of times af appears in x7. Note also, that (7) dn (x, yri') = E p , ( . 07 , y,p) (d(a, b)).
Limit versions of these finite ideas as n oc will be needed. One difficulty is that limits may not exist for a given pair (x, y). A diagonalization argument, however, yields an increasing sequence fn ./ J such that each of the limits
(8)
lim p(afix),
I ICX)
j>00
lim p((a , bk i )1(x n i1,
)),
Fix such a n1 and denote the Kolmogorov measures exists for all k and all al` and defined by the limits in (8) as TI,(. 1x), "i.j'e ly), and 3.- (. 1(x, y)), respectively. By dropping to a further subsequence, if necessary, it can be assumed that the limit i i )) , y) = lim dni ((x in i , yn also exists. The limit results that will be needed are summarized in the following lemma. Lemma 1.9.8 (The empirical-joinings lemma.) The measures 171,1), and): are all stationary, and X' is a joining of g, and more, ir(x , y) = E(d(ai , b1 )).
Proof Note that Further-
M.
{ }
0< (n k + 2) p(414) (n
E p(a'nfi!) <1,
al
since the only difference between the two counts is that a' 2` might be the first k 1 terms of 4, hence is not counted in the second term. Passing to the limit it follows that
95
is stationary. Likewise, "if and 1- are stationary. The joining result follows from the 0 corresponding finite result, while the d-equality is a consequence of (7). The empirical-joinings lemma can be used in conjunction with the ergodic decomposition to show that the d-distance between ergodic processes is actually realized by an ergodic joining, a strengthening of the stationary-joining theorem, Theorem 1.9.7.
Theorem 1.9.9 (Ergodic joinings and a-distance.) If ,u and y are ergodic then there is an ergodic joining A of ,u and y such that Au, y) = ex(d(xi , yi))
Proof Let A be a stationary joining of g and y for which c/(u, y) = e),(d(xi let
yi)),
and
(9)
A=
A(x ) da)(71- (x, y))
be the ergodic decomposition of A, as given by the ergodic decomposition theorem, Theorem 1.4.10. In this expression it is the projection onto frequency-equivalence classes of pairs of sequences, co is a probability measure on the set of such equivalence classes, is the empirical measure defined by (x, y) E A" x A', and A., (,, y) is ergodic for almost every (x, y). Furthermore, since is a stationary joining of p and y, it is concentrated on those pairs (x, y) for which x is g-typical and y is v-typical. Since, by definition of X,(,, )) , the pair (x, y) is 4( X ,)-typical, it follows that
X
),
T (v), (x, y)
(A,7(x,y))
for A-almost every pair (x, y). But if this holds for a given pair, then X 7r(x. ,) must be a stationary joining of g and y, by the empirical-joinings lemma, Lemma 1.9. 8 - , and hence d(u, y) < E),( , ) (d(xi, yi)). Thus,
(10)
d(u, y) < Ex(x ,y) (d(x i , y i )), A-almost surely.
The integral expression (9) implies, however, that
x(d(xi
yi)) =
yl))
da)(7(x, y)),
since d(xi, yl) depends only on the first order distribution and expectation is linear. This, combined with (10), means that
a(i.t, y) = EA..(x.y) (d(xi, yl )), -almost surely,

since it was assumed that dCu, = ex(d(xi, 371)). But, A(x) is ergodic and is a joining of ,u and v, for -almost every pair (x, y). This establishes Theorem 1.9.9. Li Next it will be shown that the d-distance defined as a minimization of expected perletter Hamming distance over joinings, Theorem 1.9.7, is, in the ergodic case, the same as the constant given by Theorem 1.9.2.
96
Theorem 1.9.10 (Typical sequences and d-distance.) If ,u and v are ergodic then d(,u, v) = d(y, T (A)), v-almost surely. Proof First it will be shown that
(11) (x, y) E T(A) x T (v) d(p,, v) d(x , y).
Let x be ti-typical and y be v-typical, and choose an increasing sequence {n 1 } such that )) converges to a limit A,(a, bt) for each k and each al' and bt, and h1 , p((at, blf)1(x n oc. The empirical-joinings so that dnj (x l i , y 1 ') converges to a limit d(x, y), as j lemma implies that is stationary and has p, and v as marginals, and hence
d(p, v) E-i- (d(Xi, Y1)).
Since the latter is equal to 2( x, y), also by the empirical-joinings lemma, and 2(x, y) < d(x , y), the proof of (11) is finished. Next it will be shown that there is a subset X c A of v-measure 1 such that
(12) d(y,T (p)) = d(p., v), V y E X.
Here is why this is so. - Theorem 1.9.9 implies that there is an ergodic joining for v). The ergodic theorem guarantees that for A-almost all which e(d(x i , y i )) = pairs (x, y), the following two properties hold.
(a) cln (4 , y';') =
n
E(d(xi,Y1))=a(u,v). i=1
(b) (x, y) E TOO T(i)
T (v).
In particular, there is a set X c A' such that v(X) = 1 and such that for y E X there is at least one x E T(A) for which both (a) and (b) hold. This establishes the desired result (12). The lower bound result, (11), together with the almost sure limit result, (12), yields LIJ the desired result, Theorem 1.9.10.
I.9.b.3 Properties and interpretations of the d-distance.

As earlier, the distributional (or variational) distance between n-th order distributions is given by Ii vin = Elp, (a ) v(a7)I and a measure is said to be TN -invariant or N-stationary if it is invariant under the N-th power of the shift T. The vector obtained from xri' be deleting all but the terms with indices j i < j2 < < j,n , will be denoted by x n (jr), or by x(jr) when n is understood, that is, For convenience, random variable notation will be used in stating d-distance results, for example, 17) stands for an (A, v), if p, is the distribution of X and v the to indicate the random vector distribution of Yf'. Another convenient notation is X' , conditioned on the value r of some auxiliary random variable R..
97
Lemma 1.9.11 (Properties of the d-distance.)
(a) dn
v) _< (1/2)Ip, - vi n , with equality when n = 1.
(b) d, is a complete metric on the space of measures on An.

(c) (X (ir), YU))
yr, < md,(X (j
r), Y (if n ))
n - m.
(d) If ,u and v are T N -invariant, then ci(u, v) = lim cin (,u , y).
(e) d(u, y) d(p. o T', y o 7-1 ).
(f) If the blocks {X i ort on+1 } and Olin 1) , 1 } are independent of each other and of past blocks, then mcl,(XF n Yr n ) = dn(X i ort1)n+1 ,17(iinon+i).
(g) ("in(X7, IT) < illr,YD, for any finite random variable R. r P(R = r)d(r
Proof A direct calculation shows that

IL - Yin = 2- 2 Emin(u(a7), v(a)),
and that Ex (dn (a7 , 14)) X({(a7 , 14): a7 b7 } ) = 1 - E ga7,a7 ) , for any joining X of bt and v. Moreover, there is always at least one joining X* of i and y for which X(4, = min(A(a7), v(a)), see Exercise 8, and hence,
dn(A, y)
(dn (di% b7)) < - 1A - v In .

2
In the case when n = 1, Ex(d(a, b)) > 1 - Ea ga, a), since d(a, b) = 1, unless a = b. This completes the proof of property (a). The fact that cln (A, y) satisfies the triangle inequality and is positive if p, y is left as an exercise. Completeness follows from (a), together with the fact that the space of measures is compact and complete relative to the metric ii - Via . The left-hand inequality in (c) follows from the fact that summing out in a joining of X, Yf' gives a joining of X(ffn),Y(j im), and mdm (x(r),y(jr)) 5_ ndn (x 11 , The right-hand inequality in (c) is a consequence of the fact that any joining X of X(r) and Y(ffn) can be extended to a joining of X and To establish this extension property, let {6, i2, , in _m } denote the natural ordering of the complement {1, 2, ... , n } , Li } and for each ( x (jr ) , y( j,"1 )), let
yr.
X(x(jr),y(j"))(X
Y(iinm ))
be a joining of the conditional random variables, X(i in-m )/x( jim) and Y(i n i -m )/y(jr). The function defined by
x*(q, yri') = x(x(r), yuln))x(x(q).,(ir(xcili -m), yoin -m,

is a joining of X and Yln which extends X. Furthermore, nE),.(d(xr il , )711')) < mEVd m (x(iln ), y(jr)) n - m,
98
i - `n )) < 1. Minimizing over A yields the right-hand inequality since dn ,n (x(irm ), y(i n in property (c). Property (d) for the case N = 1 was proved earlier; the extension to N > 1 is straightforward. Property (e) follows from (c). } and {Y(; 1)n-1-1 } , for each To prove property (f), choose a joining Ai of {Xiiin j, such that
Exj (dn (X17_ 1)nl' Yii;2-1)n+1 )) = i (;-1)n+1 , y(ijn-on+1)
The product A of the A is a joining of Xr and Yr, by the assumed independence of the blocks, and hence
nz
Manin (X7 m , Yintn )
Edna (./2-1)n+1' y(i.in-on+1) ,

j=1
since
mnEx (c1,,,(xrn , yrn)) =
E nE j (dn (x - 1)n-1-1 , Yilj.tt1)n+1))

i
j=1
The reverse inequality follows from the superadditivity property, (3), and hence property (f) is established. The proof of Property (g) is left to the reader. This completes the discussion of Lemma 1.9.11. Another metric, equivalent to the 4-metric, is obtained by replacing expected values by probabilities of large deviations. Let A n (c) = dn (xiz , y) < cl, and define (13) *(/./.., y) = min minfc > 0: A.(A n (c)) > 1 cl. dn
A.E.4, 01,0
The function dn * ( u, y) is a metric on the space P(An) of probability measures on A. It is one form of a metric commonly known as the Prohorov metric, and is uniformly equivalent to the 4-metric, with bounds given in the following lemma. Lemma 1.9.12 [d:Gt, v)] 2 5_ 4 (A, y) 5_ 24(u, v). Proof Let A be a joining of 1.t and y such that dn (A, y) = E),(d n (f; )1)) = a. The Markov inequality implies that X(A n ( N/Fe)) > 1 NATt, and hence du, y) Likewise, if A E y) is such that X(An(a)) > 1 a and d:(,u, y) = ct, then
E(dn (4, yiii)) <
E (x7, y7)EA,
dn(xii , yDA.(xi' y) ct
(a)
20/,
and hence cl,(,u, y) < E.(dn (4 , yriz)) < 2a. This completes the proof of the lemma. The preceding lemma is useful in both directions; versions of each are stated as the following two lemmas. Part (b) of each version is based on the mapping interpretation of joinings, (2).
SECTION 1.9. PROCESS TOPOLOGIES. Lemma 1.9.13
99
(a) If there is a joining of it and u such that d(4 , y) < c, except for a set of A.-measure at most c, then cin (it, y) < 2E.
(b) If 4 and are measure-preserving mappings from a nonatomic probability space (Z, a) into (An, ,u) and (An, v), respectively, such that dn ( ( Z) (z)) < E, except for a set of a-measure at most c, then iin (t.t, y) < 2e.
Lemma 1.9.14
(a) If anCu, y) < e 2 and ,un (B) > 1 c, then Y n ([B],) > 1 2e, where [ME = dn(Y z B) EL (b) If cin (i.t, y) < c2, and (Z, a) is a nonatomic probability space, there are measurepreserving mappings (/) and from (Z, a) into (A', p.) and (A", v), respectively, such that a(lz: dn((z),*(z)) > el) 5_ E.
I.9.b.4 Ergodicity, entropy, mixing, and d limits.

In this section it will be shown that the d-limit of ergodic (mixing) processes is ergodic (mixing) and that entropy is d-continuous. The key to these and many other limit results is that do -closeness implies that sets of large measure for one distribution must have large blowup with respect to the other distribution, which is just Lemma 1.9.14. As before, for 3 > 0, let [B]s = {y7: dn(y ri' , B) 3), denote the 3-neighborhood or 3-blow-up of B c A". Theorem 1.9.15 (Ergodicity and d-limits.) The a-limit of ergodic processes is ergodic.
Proof Suppose ,u is the d-limit of ergodic processes. By the ergodic theorem the relative
frequency p(eilìx) converges almost surely to a limit p(anx). By Theorem 1.4.2 it is enough to show that p(aii`lx) is constant in x, almost surely, for every a. This follows from the fact that if v is ergodic and close to i, then a small blowup of the set of vtypical sequences must have large ti-measure, and small blowups don't change empirical frequencies very much. To show that small blowups don't change frequencies very much, fix k and note that if )7 4-k---1 = all' and 4 +k-1 0 al', then yi 0 x i for at least one j E [i, i + k 1], and hence (14) (n k + 1)p(41),11 ) < (n k + 1)p(a in.q) + knd,(4 , )1), where the extra factor of k on the last term is to account for the fact that a j for which yi 0 x i can produce as many as k indices i for which y :+ k -1 _ ak 1 and X tj. 1c-1 0 ak 1' Let c be a positive number and use (14) to choose a positive number S < c, and N1 so that if n > N1 then
(15) d( x, 11
)
I p(44) p(any)1 < c.
100
Let y be an ergodic measure such that cl(A, y) < 3 2 and hence an (11,n , vn ) < 82, n > Because y is ergodic N1 , since for stationary measures d is the supremum of the there is an N2 > Ni such that if n > N2, the set
an .
Bn = fx ril : I P(414) v(4)1 <

has vn measure at least 1-6. Fix n > N2 and apply Lemma 1.9.14 to obtain A n ([13, ] 5 ) > 1 2 8 . Note, however, that if y E [A]s then there is a sequence x E Bn such that dn (fi! , <8 and hence I 19 (alic 1-741 ) IY7)1 < 6 by (15). Likewise if 5,1' E [B, 8 then there is a sequence i E Bn such that
]
14) P(a lic l51)1 < E.

The definition of Bn and the two preceding inequalities yield
p(451)1 < 4E,
Al, 57 11 E [B, 8.
]
Since [B] large An -measure, the finite sequence characterization of ergodicity, D Theorem 1.4.5, implies that A is ergodic, completing the proof of Theorem 1.9.15.
Theorem 1.9.16 (Entropy and d-limits.) Entropy is a-continuous on the class of ergodic processes. Proof Now the idea is that if p, and y are d-close then a small blowup does not increase the set of p.-entropy-typical sequences by very much, and has large v-measure. Let pc be an ergodic process with entropy h = h(i). Given E > 0, choose N, and, for each < 2i1(h+E) and p(T) n > N, choose Tn c An such that ITn I _ 1 as n co. Let 8 be a positive number such that un bi for all n > N. (Such a 3 exists by the blowup-bound lemma, Lemma 1.7.5.) Choose > N such that p(T) > 1 3, for n > N I . Let y be an ergodic process such that a(A, y) < 8 2 . Lemma 1.9.14 implies that y([7-,]6 ) > 1 23, for all n > N, so the covering-exponent interpretation of entropy, Theorem 1.7.4, yields h(v) < h(p) + 2E. Likewise, for sufficiently large n, v gives measure at least I 3 to a subset of An of cardinality at most 2n (n( v)+E ) and hence El h(,u) < h (v) 2E. This completes the proof of Theorem 1.9.16. Mixing is also preserved in the passage to d-limits. Theorem 1.9.17 (Mixing and a-limits.) The d-limit of mixing processes is mixing. Proof The proof is similar to the proof, via finite-coder approximation, that stationary coding preserves mixing, see Example 19. Suppose A is the a-limit of mixing processes, fix all and VII, each of positive p.-measure, and fix E > 0. Given 8 > 0, let y be a mixing process such that d(A , y) < 3 2 , and let A. be a stationary joining of p, and y such that
Evdi (xi , Yi)) = Ex(cin (4, y;')) < 82.
101
This implies that d(x': , y7) < 8 except for a set of A measure at most 8, so that if has A-measure at most 8. Thus by making xt; < 1/n then the set {(x'il , even smaller it can be supposed that A(a7) is so close to v(a) and p.(1,11') so close to v(b?) that liu(dil )[1(b7) v(a'12 )v(b)I Likewise, for any m > 1,
- iz X({(xr +n , y'1 ): 4 0 yrii or x: V X({(xi, yi): xi 0 4 } )
y: .- .;1 })
m+n
X({(x,m n F-7,
xm _11
,n
ym m:++:rill)
28,
so that if 8 <112n is small enough, then
1,u([4] n
T - m[bn) vaan
n T - m[b7D1
6/3,
for all m > 1. Since u is mixing, however,
Iv(a7)v(b7) vgan
n T - m[b7])1
6/3,
for all sufficiently large m, and hence, the triangle inequality yields
ii([4]
n T - m[brill)1
for all sufficiently large m, provided only that 6 is small enough. Thus p, must be mixing, which establishes the theorem. This section will be closed with an example, which shows, in particular, that the
ii-topology is nonseparable.
Example 1.9.18 (Rotation processes are generally d-far-apart.) Let S and T be the transformations of X = [0, 1) defined, respectively, by Sx = x
a, T x = x
fi,
where, as before,
e denotes
addition modulo 1.
Proposition 1.9.19
Let A be an (S x T)-invariant measure on X x X which has Lebesgue measure as marginal on each factor If a and fi are rationally independent then A must be the product measure. Proof "Rationally independent" means that ka mfi is irrational for any two rationals k and m with (k, m) (0, 0). Let C and D be measurable subsets of X. The goal is to show that A.(C x D) =
It is enough to prove this when C and D are intervals and p.(C) = 1/N, where N is an integer. Given E > 0, let C 1 be a subinterval of C of length (1 E)1N and let
E = C i x D, F = X x D.
102
Since a and fi are rationally independent, the two-dimensional version of Kronecker's theorem, Proposition 1.2.16, can be applied, yielding integers m 1 , m2, , rnN such that if V denotes the transformation S x T then Vms E and X(FA) < 2E, where F' = U 1 Vrn1E .
E -
n vrni
E = 0, if i j,
It follows that X(E) = X(P)/N is within 26/N of A.(F)/ N = 1,t(C)A(D). Let obtain X(C x D) = This completes the proof of Proposition 1.9.19.
0 to
El
Now let P be the partition of the unit interval that consists of the two intervals Pc, = [0, 1/2), Pi = [1/2, 1). It is easy to see that the mapping that carries x into its (T, 2)-name {x} is an invertible mapping of the unit interval onto the space A z the Kolmogorov measure. This fact, together withwhicaresLbgumont Proposition 1.9.19, implies that the only joining of the (T, 2)-process and the (S, 2)process is the product joining, and this, in turn, implies that the d-distance between these two processes is 1/2. This shows in particular that the class of ergodic processes is not separable, for, in fact, even the translation (rotation) subclass is not separable. It can be shown that the class of all processes that are stationary codings of i.i.d. processes is d-separable, see Exercise 3 in Section W.2.
I.9.c
Exercises.
1. A measure ,u E P(X) is extremal if it cannot be expressed in the form p = Ai 0 ia2. (a) Show that if p is ergodic then p, is extremal. (Hint: if a = ta1 -F(1 t)A2, apply the Radon-Nikodym theorem to obtain gi (B) = fB fi diu and show that each fi is T-invariant.) (b) Show that if p is extremal, then it must be ergodic. (Hint: if T -1 B = B then p, is a convex sum of the T-invariant conditional measures p(.iB) and peiX B).) 2. Show that l') is a complete metric by showing that
(a) The triangle inequality holds. (Hint: if X joins X and /7, and X* joins 17 and Zri', then E,T yi)A*(z7ly'll) joins X and 4.) (b) If 24(X?, ri) = 0 then Xi; and 1111 have the same distribution. (c) The metric d(X7, 17) is complete. 3. Prove the superadditivity inequality (3). 4. Let p, and y be the binary Markov chains with the respective transition matrices
m [
P p
1p p 1'
N [0 1 1 o]'
Let fi be the Markov process defined by M2.
SECTION 1.10. CUTTING AND STACKING.
103
v n = X2n, n = 1, 2, ..., then, almost (a) Show that if x is typical for it, and v surely, y is typical for rt.
(b) Use the result of part (a) to show d(,u, y) = 1/2, if 0 < p < 1. 5. Use Lemma 1.9.11(f) to show that if it. and y are i.i.d. then d(P., y) = (1/2)1A v I i. (This is a different method for obtaining the d-distance for i.i.d. processes than the one outlined in Example 1.9.3.) 6. Suppose y is ergodic and rt, is the concatenated-block process defined by A n on A n . (p ii ) (Hint: g is concentrated on shifts of sequences Show that d(v ,g) = --n.n, ,.--n, that are typical for the product measure on (An ) defined by lin .)
a.
7. Prove property (d) of Lemma 1.9.11.
* (a7 , a7) = min(p,(a7), v(a)). 8. Show there is a joining A.* of ,u and y such that Xn
9. Prove that ikt v in = 2 2 E a7 min(i.t(a7), v(a)). 10. Two sets C, D C A k are a-separated if C n [D], = 0. Show that if the supports of 121 and yk are a-separated then dk(Ilk.vk) > a. 11. Suppose ,u, and y are ergodic and d(i.t, for sufficiently large n there are subsets v(Dn ) > 1 E, and dn (x7, y7) > a 1`) ' Ak and Pk('IY1 c ) ' Vk, and k Pk('IX 1 much smaller than a, by (7)) y) > a. Show that if E > 0 is given, then Cn and Dn of A n such that /L(C) > 1 e, e, for x7 E Cn , yli' E D. (Hint: if is large enough, then d(4 , y ) cannot be
12. Let (Y, y) be a nonatomic probability space and suppose *: Y 1-4- A n is a measurable mapping such that dn (un , y o * -1 ) < S. Show that there is a measurable mapping q5: y H A n such that A n = V curl and such that Ev(dn(O(Y), fr(y))) <8.
Section 1.10 Cutting and stacking.

Concatenated-block processes and regenerative processes are examples of processes with block structure. Sample paths are concatenations of members of some fixed collection S of blocks, that is, finite-length sequences, with both the initial tail of a block and subsequent block probabilities governed by a product measure it* on S' . The assumption that te is a product measure is not really necessary, for any stationary measure A* on S" leads to a stationary measure p, on Aw whose sample paths are infinite concatenations of members of S, provided only that expected block length is finite. Indeed, the measure tt is just the measure given by the tower construction, with base S", measure IL*, and transformation given by the shift on S, see Section 1.2.c.2. It is often easier to construct counterexamples by thinking directly in terms of block structures, first constructing finite blocks that have some approximate form of the final desired property, then concatenating these blocks in some way to obtain longer blocks in which the property is approximated, continuing in this way to obtain the final process as a suitable limit of finite blocks. A powerful method for organizing such constructions will be presented in this section. The method is called "cutting and stacking," a name suggested by the geometric idea used to go from one stage to the next in the construction.
104
Before going into the details of cutting and stacking, it will be shown how a stationary measure ,u* on 8" gives rise to a sequence of pictures, called columnar representations, and how these, in turn, lead to a description of the measure on A'.
I.10.a
The columnar representations.
Fix a set 8 c A* of finite length sequences drawn from a finite alphabet A. The members of the initial set S c A* are called words or first-order blocks. The length of a word w E S is denoted by t(w). The members w7 of the product space 8" are called nth order blocks. The length gel') of an n-th order block wrii satisfies (w) = E, t vi ). The symbol w7 has two interpretations, one as the n-tuple w1, w2, , w,, in 8", the other as the concatenation w 1 w2 to, in AL, where L = E, t(wi ). The context usually makes clear which interpretation is in use. The space S" consists of the infinite sequences iv?"' = w1, w2, ..., where each Wi E S. A (Borel) probability measure 1.1* on S" which is invariant under the shift on S' will be called a block-structure measure if it satisfies the finite expected-length condition,
(
E(t(w)) =
Ef (w),uT(w ) <
WES
onto 8, while, in general, Here 14 denotes the projection of of ,u* onto 8'. Note, by the way, that stationary gives E v( w7 ))
denotes the projection
. E t (w ) ,u.: (w7 ) . nE(t(w)).

to7Esn
Blocks are to be concatenated to form A-valued sequences, hence it is important to have a distribution that takes into account the length of each block. This is the probability measure , on S defined by the formula
A( w)
w)te (w)
E(fi
,WE
8,
where E(ti) denotes the expected length of first-order blocks. The formula indeed defines a probability distribution, since summing t(w),u*(w) over w yields the expected block length E(t 1 ). The measure is called the linear mass distribution of words since, in the case when p,* is ergodic, X(w) is the limiting fraction of the length of a typical concatenation w1 w2 ... occupied by the word w. Indeed, using f (w1w7) to denote the number of times w appears in w7 E S", the fraction of the length of w7 occupied by w is given by
t(w)f (wIw7)
t(w7)
= t(w)
f(w1w7) n
n f(w7)
t(w)11,* (w)
E(t 1 )
= A.(w), a. s.,
since f (w1w7)1 n --- ,e(w) and t(will)In E(t1), almost surely, by the ergodic theorem applied to lu*. The ratio r(w) = kt * (w)/E(ti) is called the width or thickness of w. Note that A(w) = t(w)r(w), that is, linear mass = length x width.
105
The unit interval can be partitioned into subintervals indexed by 21) E S such that length of the subinterval assigned to w is X(w). Thus no harm comes from thinking of X as Lebesgue measure on the unit interval. A more useful representation is obtained by subdividing the interval that corresponds to w into t(w) subintervals of width r (w), labeling the i-th subinterval with the i-th term of w, then stacking these subintervals into a column, called the column associated with w. This is called the first - order columnar representation of (5 00 , ,u*). Figure 1.10.1 shows the first-order columnar representation of S = {v, w}, where y = 01, w = 011, AIM = 1/3, and /4(w) = 2/3. Ti x
X V w
Figure 1.10.1 The first-order columnar representation of (S, AT). In the columnar representation, shifting along a word corresponds to moving intervals one level upwards. This upward movement can be accomplished by a point mapping, namely, the mapping T1 that moves each point one level upwards. This mapping is not defined on the top level of each column, but it is a Lebesgue measure-preserving map from its domain to its range, since a level is mapped linearly onto the next level, an interval of the same width. (This is also shown in Figure I.10.1.) In summary, the columnar representation not only carries full information about the distribution A.(w), or alternatively, /4(w) = p,*(w), but shifting along a block can be represented as the transformation that moves points one level upwards, a transformation which is Lebesgue measure preserving on its domain. The first-order columnar representation is determined by S and the first-order distribution AI (modulo, of course, the fact that there are many ways to partition the unit interval into subintervals and stack them into columns of the correct sizes.) Conversely, the columnar representation determines S and Al% The information about the final process is only partial since the picture gives no information about how to get from the top of a column to the base of another column, in other words, it does not tell how the blocks are to be concatenated. The first-order columnar representation is, of course, closely related to the tower representation discussed in Section I.2.c.2, the difference being that now the emphasis is on the width distribution and the partially defined transformation T that moves points upwards. Information about how first-order blocks are concatenated to form second-order blocks is given by the columnar representation of the second-order distribution ,q(w?). Let r (w?) = p(w)/E(2) be the width of w?, where E(t2) is the expected length of w?, with respect to ,4, and let X(w?) = t(w?)r(w?) be the second-order linear mass. The second-order blocks w? are represented as columns of disjoint subintervals of the unit interval of width r (14) and height aw?), with the i-th subinterval labeled by the i-th term of the concatenation tv?. A key observation, which gives the name "cutting and stacking" to the whole procedure that this discussion is leading towards, is that
106
Second-order columns can be be obtained by cutting each first-order column into subcolumns, then stacking these in pairs.
Indeed, the total mass in the second-order representation contributed by the first f(w) levels of all the columns that start with w is
t(w) x--, E aw yr ( ww2 ) . .1,--(t2) 2_, . le un-v2)\

(
aw* * (w)
= 2E( ii)
tV2
_ 1 toor(w), 2
which is exactly half the total mass of w in the first-order columnar representation. Likewise, the total mass contributed by the top t(w) levels of all the columns that end with w is
Et(w)r(w1w)= Ea(lV t2))

WI
Pf(W1W)
2E(ti) =t(w* *(w)
Thus half the mass of a first-order column goes to the top parts and half to the bottom parts of second-order columns, so, as claimed, it is possible to cut the first-order columns into subcolumns and stack them in pairs so as to obtain the second-order columnar representation. Figure 1.10.2 shows how the second-order columnar representation of S = {v, w}, where ,4(vv) = 1/9u;(vw) = ,u,;(wv) = 2/9, 16(ww) = 4/9, can be built by cutting and stacking the first-order representation shown in Figure 1.10.1.
2 " (w)r(w).
..._ C
i
***********
E ****** B
VW
A B C
V
H E F G ***************4****+**********
w
A
VV
F G ****** ***********
WV
WW
Figure 1.10.2 The second-order representation via cutting and stacking. The significance of the fact that second-order columns can be built from first-order columns by appropriate cutting and stacking is that this guarantees that the transformation T2 defined for the second-order picture by mapping points directly upwards one level extends the mapping T1 that was defined for the first-order picture by mapping points directly upwards one level. Indeed, if y is directly above x in some column of the firstorder representation, then it will continue to be so in any second-order representation that is built from the first-order representation by cutting and stacking its columns. Note also the important property that the set of points where T2 is undefined has only half the measure of the set where T1 is defined, for the total width of the second-order columns is 1/E(t2), which is just half the total width 1/E(t 1 ) of the first-order columns. One can now proceed to define higher order columnar representations. In general, the 2m-th order representation can be produced by cutting the columns of the m-th order columnar representation into subcolumns and stacking them in pairs. If this cutting and stacking method is used then the following two properties hold.
SECTION 1.10. CUTTING AND STACKING. (a) The associated column transformation T2m extends the transformation Tm .
107
(b) The set where T2m is defined has one-half the measure of the set where Tft, is undefined. In particular, if cutting and stacking is used to go from stage to stage, then the set of transformations {TA will have a common extension T defined for almost all x in the unit interval, and the transformation T will preserve Lebesgue measure. A general form of this fact, Theorem 1.10.3, will be proved later. The labeling of levels by symbols from A provides a partition P = {Pa : a E AI of the unit interval, where Pa is the union of all the levels labeled a. The (T, P)-process is the same as the stationary process it defined by A*, see Corollary 1.10.5. The difference between the building of columnar representations and the general cutting and stacking method is really only a matter of approach and emphasis. The columnar representation idea starts with the final block-structure measure ,u* and uses it to construct a sequence of partial transformations of the unit interval each of which extends the previous ones. Their common extension is a Lebesgue measure-preserving transformation on the entire interval, which, in turn, gives the desired stationary A-valued process. Thus, in essence, it is a way of representing something already known. The cutting and stacking idea focuses directly on the geometric concept of cutting and stacking labeled columns, using the desired goal (typically to make an example of a process with some sample path property) to guide how to cut up the columns of one stage and reassemble to create the next stage. The goal is still to produce a Lebesgue measure-preserving transformation on the unit interval. Of course, column structures define distributions on blocks, and cutting and stacking extends these to joint distributions on higher level blocks, but this all happens in the background while the user focuses on the combinatorial properties needed at each stage to produce the desired final process. Thus, in essence, it is a way of building something new. The cutting and stacking language and theory will be rigorously developed in the following subsections. Some applications of its use to construct examples will be given in Chapter 3.
I.10.b
The basic cutting and stacking vocabulary.
A column C = (L1, L2, .. , Lac)) of length, or height, f(C) is a nonempty, ordered, disjoint, collection of subintervals of the unit interval of equal positive width (C). The interval L1 is called the base or bottom, the interval Lt(c) is called the top, and the interval L i is called the j-th level of the column. A labeling of C is a mapping Ar from {1, 2, ... , f(C)} into the finite alphabet A. The vector Ar(C) = (H(1), Ar(2), , Ar(E(C))) is called the name of C, and for each j, N(j) = Ar(L i ) is called the name or label of L. The support of a column C is the union of its levels L. Its measure X(C) is the Lebesgue measure X of the support of C, which is of course given by X(C) = i(C)r (C). Two columns are said to be disjoint if their supports are disjoint. A column structure S is a nonempty collection of mutually disjoint, labeled columns. The suggestive name, column structure, will be used in this book, rather than the noninformative name, gadget, which has been commonly used in ergodic theory. Note that the members of a column structure are labeled columns, hence, unless stated otherwise,
108
terminology should be interpreted as statements about sets of labeled columns. Thus the union S US' of two column structures S and S' consists of the labeled columns of each, and S c S' means that S is a substructure of S', that is, each labeled column in S is a labeled column in S' with the same labeling. An exception to this is the terminology "disjoint column structures," which is taken as shorthand for "column structures with disjoint support," where the support of a column structure is the union of the supports of its columns. The width r(S) of a column structure is the sum of the widths of its columns. The base or bottom of a column structure is the union of the bases of its columns and its top is the union of the tops of its columns. Note that the base and top have the same - is the normalized column Lebesgue measure, namely, r(S). The width distribution 7 width r (C) r (C) "f(C) = EVES r (D) (45) The measure A(S) of a column structure is the Lebesgue measure of its support, in other words, A(S) = Ex(c).Et(c),(c).
CES
CES
Note that X(S) < 1 since columns always consist of disjoint subintervals of the unit interval, and column structures have disjoint columns. In particular, this means that expected column height with respect to the width distribution is finite, for
CES
(c ) < Ei(c)-r(c) Ex . ) c cyf( a E r(S) r (S)

,
r (5)
< 00.
A column C = (L 1 L2, , Le (C )) defines a transformation T = T, called its upward map. If a column is pictured by drawing L i+ 1 directly above L1 then T maps points directly upwards one level, and is not defined on the top level. The precise definition is that T maps Li in a linear and order-preserving manner onto L i+ i, for 1 < j < f(C). Note that T is one-to-one and its inverse maps points downwards. The transformation T = Ts defined by a column structure S is the union of the upward maps defined by its columns. It is called the mapping or upward mapping defined by the column structure S. Each point below the base of S is mapped one level directly upwards by T. T is not defined on the top of S and is a Lebesgue measure-preserving mapping from all but the top to all but the bottom of S. Columns are cut into subcolumns by slicing them vertically. This idea can be made precise in terms of subcolumns and column partitions as follows. A subcolumn of L'2 , , L'e ), such that the following hold. C = (L1, L2, . , Li) is a column C' = (a) For each j, is a subinterval of L i with the same label as L.
(b) The distance from the left end point of L'i to the left end point of L 1 does not depend on j. Note that implicit in the definition is that a subcolumn always has the same height as the column. (An alternative definition of subcolumn in terms of the upward map is given in Exercise 1.) A (column) partition of a column C is a finite or countable collection
{C(i) = (L(i, 1), L(i, 2) ... , L(i, t(C)): i
I}
109
of disjoint subcolumns of C, the union of whose supports is the support of C. Partitioning corresponds to cutting, thus cutting C into subcolumns according to a distribution 7r is the same as finding a column partition {C(i)} of C such that 7r(i) = t(C(i))1 .1- (C), for each i. The cutting idea extends to column structures. A column partitioning of a column structure S is a column structure S' with the same support as S such that each column of S' is a subcolumn of a column of S. Thus, a column partitioning is formed by partitioning each column of S in some way, and taking the collection of the resulting subcolumns. The stacking idea is defined precisely as follows. Let C 1 = (L(1, j): 1 < j < t(C1)) and C2 = (L(2, j): 1 < j < t(C2)) be disjoint labeled columns of the same width. The stacking of C2 on top of C 1 is denoted by C1* C2 and is the labeled column with levels L i defined by = L(1, j) 1 < j < f(C1) L(2, j f(C1)) t(C1) < j f(Ci) -e(C2),
Li
with the label of C1* C2 defined to be the concatenation vw of the label v of C 1 and the label w of C2. Note that the width of C1 * C2 is the same as the width of C1, which is also the same as the width of C2. Longer stacks are defined inductively by
Ci *C2* Ck Ck+i = (C1 * C2 * Ck) * Ck+1.
A column structure S' is a stacking of a column structure S, if they have the same support and each column of S' is a stacking of columns of S. A column structure S' is built by cutting and stacking from a column structure S if it is a stacking of a column partitioning of S. Thus, for example, the second-order columnar representation of (S, ,a*), where S c A*, is built by cutting each of the first-order columns into subcolumns and stacking these in pairs. In general, the columns of S' are permitted to be stackings of variable numbers of subcolumns of S. The basic fact about stacking is that it extends upward maps. Indeed, the upward map Tco,c, defined by stacking C2 on top of C1 agrees with the upward map Tc wherever it is defined, and with the upward map TC 2 wherever it is defined, and extends their union, since Te,,,c, is now defined on the top of C1. This is also true of the upward mapping T = Ts defined by a column structure S, namely, it is extended by the upward map T = Ts , for any S' built by cutting and stacking from S. In summary, the key properties of the upward maps defined by column structures are summarized as follows.
,
(i) The domain of Ts is the union of all except the top of S. (ii) The range of Ts is the union of all except the bottom of S. (iii) The upward map Ts is a Lebesgue measure-preserving mapping from its domain to its range. (iv) If S' is built by cutting and stacking from S, then Ts , extends T8.
110
I.10.c The final transformation and process.

The goal of cutting and stacking operations is to construct a measure-preserving transformation on the unit interval. This is done by starting with a column structure 8(1) with support 1 and applying cutting and stacking operations to produce a new structure 8(2). Cutting and stacking operations are then applied to 8(2) to produce a third structure 8(3). Continuing in this manner a sequence {S(m)} of column structures is obtained for which each member is built by cutting and stacking from the preceding one. If the tops shrink to 0, then a common extension T of the successive upward maps is defined for almost all x in the unit interval and preserves Lebesgue measure. Together with the partition defined by the names of levels, the transformation produces a stationary process. A precise formulation of this result will be presented in this section, along with the basic formula for estimating the joint distributions of the final process from the sequence of column structures, and a condition for ergodicity. A sequence {S(m)} of column structures is said to be complete if the following hold. (a) For each m > 1, S(m + 1) is built by cutting and stacking from S(m).
(b) X(S(1)) = 1 and the (Lebesgue) measure of the top of S(m) goes to 0 as m > oc.
Theorem 1.10.3 (The complete-sequences theorem.) If {8(m)} is complete then the collection {Ts (n )} has a common extension T defined for almost every x E [0, 1]. Furthermore, T is invertible and preserves Lebesgue
measure. Proof For almost every x E [0, 1] there is an M = M (x) such that x lies in an interval below the top of S(M), since the top shrinks to 0 as m oc. Further cutting and stacking preserves this relationship of being below the top and produces extensions of Ts(m), so that the transformation T defined by Tx = Ts(m)x is defined for almost all x and extends every Ts (m) . Likewise, the common extension of the inverses of the Ts (m) is defined for almost all x; it is clearly the inverse of T. If B is a subinterval of a level in a column of some S(m) which is not the base of that column, then T -1 B = Ts -(m i ) B is an interval of the same length as B. Since such intervals generate the Borel sets on the unit interval, it follows that T is measurable and preserves Lebesgue measure. This completes the proof of Theorem 1.10.3. El The transformation T is called the transformation defined by the complete sequence {S(m)}. The label structure defines the partition P = {Pa : a E A), where Pa is the union of all the levels of all the columns of 8(1) that have the name a. The (T, P)-process {X,} and its Kolmogorov measure it are, respectively, called the process and Kolmogorov measure defined by the complete sequence {S(m)}. The process {X,} is described by selecting a point x at random according to the uniform distribution on the unit interval and defining X, (x) to be the index of the member of P to which 7 1 x belongs. This is equivalent to picking a random x in the support of 8(1), then choosing m so that x belongs to the j-th level, say L i , of a column C E S(m) of height t(C) > j n 1, and defining X, (x) to be the name of level Li+n _1 of C. (Such an m exists eventually almost surely, since the tops are shrinking to 0.) The k-th order joint distribution of the process it defined by a complete sequence {S(m)} can be directly estimated by the relative frequency of occurrence of 4 in a
SECTION L10. CUTTING AND STACKING.
111
column name, averaged over the Lebesgue measure of the column name. This estimate is exact for k = 1 for any m and for k > 1 is asymptotically correct as m oc. The relative frequency of occurrence of all' in a labeled column C is defined by
pk(a inC) =
Ili
E [1, f(C) k
1 : x 1' 1 =
]
where xf (c) is the name of C, that is, pk(lC) is the empirical overlapping k-block distribution pk ixf (c) ) defined by the name of C. Theorem 1.10.4 (Estimation of joint distributions.) If t is the measure defined by the complete sequence, {8(m)}, then (1)
= lim
E
CES
pk(ailC)X(C),
k E Ak 1
(m)
Proof In this proof C will denote either a column or its support; the context will make clear which is intended. Let Pa be the union of all the levels of all the columns in 8(1) that have the name a. Then, of course, ,u(a) = X(Pa), which is, in turn, given by
p,(apx(c),
CES (1)
since f(C)p i (aIC) just counts the number of times a appears in the name of C, and hence X(Pa fl C)
= [t(c)pi (aiC)j r(c) = pi(aIC)X(C).
The same argument applies to each S(m), establishing (1) for the case k = 1. For k > 1, the only error in using (1) to estimate Wa l') comes from the fact that pk (ainC) ignores the final k 1 levels of the column C, a negligible effect when m is large for then most of the mass must be in long columns, since the top has small measure. To make this precise, for x E [0, 1], let {..x,} E A Z be the sequence defined by the relation Tnx E Px n E Z, and let
Pa k
x: x =
The quantity (t(C) k 1)pk(a lic IC) counts the number of levels below the top k 1 levels of C that are contained in Pal k , and hence
IX(Paf n C) p k (c41C)A(C)1 < 2(k 1)r(C),
since the top k 1 levels of C have measure (k 1)r(C). The desired result (1) now follows since the sum of the widths t(C) over the columns of (m) was assumed to go oc. This completes the proof of Theorem 1.10.4. to 0, as n An application of the estimation formula (1) is connected to the earlier discussion on of the sequence of columnar representations defined by a block-structure measure where S c A*. For each m let 8(m) denote the 2''-order columnar representation. Without loss of generality it can be assumed that 8(m +1) is built by cutting and stacking from S(m). Furthermore, X(8(1)) = 1 and the measure of the top of S(m + 1) is half the measure of the top of S(m), so that {S(m)} is a complete sequence. Let itt be the
112
measure defined by this sequence. The sequence {S(m)} will be called the standard cutting and stacking representation of the measure it. Next let T be the tower transformation defined by the base measure it*, with the shift S on S" as base transformation and height function defined by f (wi , .) = t (w 1) Also let /72, be the Kolmogorov measure of the (T, 7 3)-process, where P = {Pa } is the partition defined by letting P a be the set of all pairs (wr) , j) for which ai = a, where tovi) w = ai .
Corollary 1.10.5 The tower construction and the standard representation produce the same measures, that is, p. = l.
Proof Let pk(afiw7) be the empirical overlapping k-block distribution defined by the sequence w E S n , thought of as the concatenation wi w2 wn E AL, where L = Ei aw 1 ). Since this is clearly the same as pk(a inC) where C is the column with name w, it is enough to show that
(2)
= lim
E
WES
pk (ak l it4)A.(C), a
E Ak .
For k = 1 the sum is constant in n, since the start distribution for is obtained by selecting w 1 at random according to the distribution AT, then selecting the start position according to the uniform distribution on [1, awl)]. The proof for k > 1 is left as an exercise. 0 Next the question of the ergodicity of the final process will be addressed. One way to make certain that the process defined by a complete sequence is ergodic is to make sure that the transformation defined by the sequence is ergodic relative to Lebesgue measure. A condition for this, which is sufficient for most purposes, is that later stage structures become almost independent of earlier structures. These ideas are developed in the following paragraphs. In the following discussion C denotes either a column or its support. For example, A(B I C) denotes the conditional measure X(B n c)/x(c) of the intersection of the set B with the support of the column C. The column structures S and S' are 6-independent if
(3)
E E Ige n D) X(C)(D)I
CES DES'
6.
In other words, two column structures S and S' are 6-independent if and only if the partition into columns defined by S and the partition into columns defined by S' are 6-independent. The sequence {S(m)} of column structures is asymptotically independent if for each m and each e > 0, there is a k > 1 such that S(m) and S(m k) are 6-independent. Note, by the way, that these independence concepts do not depend on how the columns are labeled, but only the column distributions. Related concepts are discussed in the exercises.
SECTION I10. CUTTING AND STACKING.
113
Theorem 1.10.6 (Complete sequences and ergodicity.)

A complete asymptotically independent sequence defines an ergodic process.
Proof Let T be the Lebesgue measure-preserving transformation of the unit interval defined by the complete, asymptotically independent sequence { 8(m)}. It will be shown that T is ergodic, which implies that the (T, P)-process is ergodic for any partition P. Let B be a measurable set of positive measure such that T -1 B = B. The goal is to show that A(B) = 1. Towards this end note that since the top of S(m) shrinks to 0, the widths of its intervals must also shrink to 0, and hence the collection of all the levels of all the columns in all the S(m) generates the a-algebra. Thus, given E > 0 there is an m and a level L, say level j, of some C E 8(M) such that
X(B n L) > 0 This implies that the entire column C is filled to within (1 E 2 ) by B, that is,
(4)
(B n c)
> 1 - E 2 )x(c),
(
since T k (B n L) = B n TkL and T k L sweeps out the entire column as k ranges from j + 1 to aW) j. Fix C and choose M so large that S(m) and S(M) are EX(C)-independent, so that, in particular, the following holds.
(5)
E
DES(M)
igvi c) x(D)1
E-
Let .T be the set of all D E S(M) for which X(B1 C n D) > 1 6. The set C n D is a union of levels of D, since S(M) is built by cutting and stacking from S(m), so if D E .F, then there must be at least one level of D which is at least (1 E) filled by B. The argument used to prove (4) then shows that the entire column D must be (1 c) filled by B. In summary,
(6)
X(B n D) > (1 c)X(D), D e .F.
Since X(B I C) = E-D A.(B1 C X(B I C) ?.. (1 6 2 ), imply that
n v)x(vi C), the Markov inequality and the fact that
E x (7, 1 C)
DO'
El
which together with the condition, (5), implies
E X(D) <2E.
DO*
Thus summing over D and using (6) yields X(B) > 1 3E. This shows that X(B) = 1, and hence T must be ergodic. This completes the proof of Theorem 1.10.6. El If is often easier to make successive stages approximately independent than it is to force asymptotic independence, hence the following stronger form of the complete sequences and ergodicity theorem is quite useful.
114
Theorem 1.10.7 (Complete sequences and ergodicity: strong form.)

If {S(m)} is a complete sequence such that for each m, S(m) and S(m 1) are Em -independent, where Em 0, as m OC, then {S(m)} defines an ergodic process.
Proof Assume T -1 B = B and X(B) > O. The only real modification that needs to be made in the preceding proof is to note that if {Li} is a disjoint collection of column levels such that X (BC n (UL)) = BciL i , x ( L i) < E 28, (7)
Eg
where BC is the complement of B, then by the Markov inequality there is a subcollection with total measure at least 3 for which
x(B n L i ) > (1
E 2 )(L i ).
Let S' be the set of all columns C for which some level has this property. The support of S' has measure at least 3, and the sweeping-out argument used to prove (4) shows that (8) X(B n > ( 1 E2 E SI
)x(c), c
Thus, taking 3 = ,u(B)/2, and m so large that Em is smaller than both E 2 and (ju(B)/2) 2 so that there is a set of levels of columns in S(m) such that (7) holds, it follows that and there must be at least one C E S(m) for which (8) holds and for which
DES (m+1)
Ix(Dic)
x(D)i <
E.
The argument used in the preceding theorem then gives (B) > 1 3E. This proves Theorem 1.10.7. The freedom in building ergodic processes via cutting and stacking lies in the arbitrary nature of the cutting and stacking rules. The user is free to vary which columns are to be cut and in what order they are to be stacked, as well as how substructures are to become well-distributed in later substructures. There are practical limitations, of course, in the complexity of the description needed to go from one stage to the next. A bewildering array of examples have been constructed, however, by using only a few simple techniques for going from one stage to the next. A few of these constructions will be described in later chapters of this book. The next subsection will focus on a simple form of cutting and stacking, known as independent cutting and stacking, which, in spite of its simplicity, can produce a variety of counterexamples when applied to substructures.
I.10.d
Independent cutting and stacking.
Independent cutting and stacking is the geometric version of the product measure idea. In this discussion, as earlier, the same notation will be used for a column or its support, with the context making clear which meaning is intended. A column structure S can be stacked independently on top of a labeled column C of the same width, provided that they have disjoint supports. This gives a new column structure denoted by C * S and defined as follows. (i) Partition C into subcolumns {Ci } so that r (C1) = (Di ), where Di is the ith column of S.
115
(ii) Stack Di on top of Ci to obtain the new column, Ci *Di .

The new column structure C * S consists of all the columns C, * V. It is called the independent stacking of S onto C. (See Figure 1.10.8.)
C* S
Figure 1.10.8 Stacking a column structure independently onto a column. To say precisely what it means to independently cut and stack one column structure on top of another column structure the concept of copy is useful. A column structure S is said to be a copy of size a of a column structure S' if there is a one-to-one correspondence between columns such that corresponding columns have the same height and the same labeling, and the ratio of the width of a column of S to the width of its corresponding column in S' is a. In other words, a scaling of one structure is isomorphic to the other, where two column structures S and S' are said to be isomorphic if there is a one-to-one correspondence between columns such that corresponding columns have the same height, width, and name. Note, by the way, that a copy has the same width distribution as the original. A column structure S can be cut into copies {S i : i E n according to a distribution on /, by partitioning each column of S according to the distribution 7r, and letting Si be the column structure that consists of the i-th subcolumn of each column of S. Let S' and S be disjoint column structures with the same width. The independent cutting and stacking of S' onto S is denoted by S * S' and is defined for S = {Ci } as follows. (See Figure 1.10.9.) (i) Cut S' into copies {S; so that r(S:)= .1- (Ci ). (ii) For each i, stack S; independently onto Ci , obtaining Ci * S;. The column structure S * S' is the union of the column structures Ci *
C2
S'
Figure 1.10.9 Stacking a structure independently onto a structure. An alternative description of the columns of S * S' may given as follows.
116 (i) Cut each C each Ci E S.

E
i ) I r (S') for S' into subcolumns {C } such that r(C) = (C)T- (Ci
E
(ii) Cut each column Ci
'). S into subcolumns {CO, such that r(Cii ). -r(Ci
The column structure S *S' consists of all the C,J*Cii . In particular, in the finite column case, the number of columns of S * S' is the product of the number of columns of S and the number of columns of S'. The key property of independent cutting and stacking, however, is that width distributions multiply, that is, r (Cii *C) r (S * S') r (Ci ) r (S) r(C)
since r (S*S') = r(S). This formula expresses the probabilistic meaning of independent cutting and stacking: cut and stack so that width distributions multiply. A column structure can be cut into copies of itself and these copies stacked to form a new column structure. The M-fold independent cutting and stacking of a column structure S is defined by cutting S into M copies {Sm : m = 1, 2, ... , M} of itself of equal width and successively independently cutting and stacking them to obtain
Si
*52 * ' * SM
where the latter is defined inductively by Si *. *Sm = (S1* *Sm_i)*Sm. Two-fold independent cutting and stacking is indicated in Figure 1.10.10.
Si
S2
SI
* S2
Figure 1.10.10 Two-fold independent cutting and stacking. Note that (Si * * S) = t (S)/M, and, in the case when S has only finitely many columns, the number of columns of Si * * Sm is the M-th power of the number of columns of S. The columns of Si * * Sm have names that are M-fold concatenations of column names of the initial column structure S. The independent cutting and stacking construction contains more than just this concatenation information, for it carries with it the information about the distribution of the concatenations, namely, that they are independently concatenated according to the width distribution of S. Successive applications of repeated independent cutting and stacking, starting with a column structure S produces the sequence {S(m)}, where 5 (1) = S, and
S(m + 1) = S(m) * S (m), m > 1.

The sequence {S(m)} is called the sequence built (or generated) from S by repeated independent cutting and stacking. Note that S(m) is isomorphic to the 2m-fold independent
SECTION I.10. CUTTING AND STACKING.
117
cutting and stacking of S, so that its columns are just concatenations of 2' subcolumns of the columns of S, selected independently according to the width distribution of S. In the case when X(S) = 1, the (T, P)-process defined by the sequence generated from S by repeated independent cutting and stacking is called the process built from S by repeated independent cutting and stacking. This is, of course, just another way to define the regenerative processes, for they are the precisely the processes built by repeated independent cutting and stacking from the columnar representation of their firstorder block-structure distributions, Exercise 8. Ergodicity is guaranteed by the following theorem. Theorem 1.10.11 (Repeated independent cutting and stacking.) If {S(m)} is built from S by repeated independent cutting and stacking, then given > 0, there is an m such that S and S(m) are E-independent. In particular, if X(S) = 1 then the process built from S by repeated independent cutting and stacking is ergodic. Proof The proof for the case when S has only a finite number of columns will be given here, the countable case is left to Exercise 2. The notation D = C = C1 C2 .. Ck, will mean the column D was formed by taking, for each i, a subcolumn of Ci E S, and stacking these in order of increasing i. For k = 2', D = C1C2. Ck E SOTO, and C E S, define E [1, k]:C = p(CICO I = so that, in particular, kp(CICf) is the total number of occurrences of C in the sequence C. By the law of large numbers, (9) and (10) 1 1 i(D) = k k p(CIC) >
E (0)
i=1
E(t(S)),
oc, where E(t(S)) is the expected height of the columns both with probability 1, as k of S, with respect to the width distribution. Division of X(C n D) = kp(C1Cbt(C)r(D) by the product of X(C) = i(C)r(C) and (D) = i(D)t(D) yields
x(c n D)
X(C)(D)
kp(CIC I ) i(D)r (C)
1,
with probability 1, as k > cc, by (9) and (10), since r(S)E(t(S)) = X(S) = 1, by assumption. This establishes that indeed S and S(m) are eventually c-independent. Note that for each mo, {S(m): m > mol is built from S(rno) by repeated independent cutting and stacking, and hence the proof shows that {S(m)} is asymptotically independent, so that if X(S) = 1 then the process defined by {S(m)} is ergodic. This completes the proof of Theorem 1.10.11. The following examples indicate how some standard processes can be built by cutting and stacking. More interesting examples of cutting and stacking constructions are given in Chapter 3. The simplest of these, an example of a process with an arbitrary rate of
118
convergence for frequencies, see III.! .c, is recommended as a starting point for the reader unfamiliar with cutting and stacking constructions. It shows how repeated independent cutting and stacking on separate substructures can be used to make a process "look like another process on part of the space," so as to approximate the desired property, and how to "mix one part slowly into another," so as to guarantee ergodicity for the final process.
Example 1.10.12 (I.i.d. processes.) Let S be the column structure that consists of two columns each of height 1 and width 1/2, labeled as '0' and '1'. The process built from S by repeated independent cutting and stacking is just the coin-tossing process, that is, the binary i.i.d. process in which O's and l's are equally likely. By starting with a partition of the unit interval into disjoint intervals labeled by the finite set A, repeated independent cutting and stacking produces the A-valued i.i.d. process with the probability of a equal to the length of the interval assigned to a. Example 1.10.13 (Markov processes.) The easiest way to construct an ergodic Markov chain {X} via cutting and stacking is to think of it as the regenerative process defined by a recurrent state, say a. Let S be i for which ai a,1 < i < k, and ak = a. Define the set of all ak Prob(X i` = alic1X0 = a), all'
E
S,
and let pt* be the product measure on S defined by p,T. The process built from the first-order columnar representation of S by repeated independent cutting and stacking is the same as the Markov process {X}. Example 1.10.14 (Hidden Markov chains.) A hidden Markov process {X,} is a function X, = f (Y,) of a Markov chain. (These also known as finite-state processes.) If {KJ is ergodic, then {X,} is regenerative, hence can be represented as the process built by repeated independent cutting and stacking. In fact, one can first represent the ergodic Markov chain {(Xn , Yn )} as a process built by repeated independent cutting and stacking, then drop the first coordinates of each label. Example 1.10.15 (Finite initial structures.) If the initial structure S has only a finite number of columns, the process {X,} built from it by repeated independent cutting and stacking is a hidden Markov chain. To see why this is so, assume that the labeling set A does not contain the symbols 0 and 1 and relabel the column levels by changing the label A (L) of level L to (A (L), 0) if L is not the top of its column, and to (.Ar(L), 1) if L is the top of its column. It is easy to see that the process {KJ built from this new column structure by repeated independent cutting and stacking is Markov of order no more than the maximum length of its columns. Since dropping the second coordinates of each new label produces the old labels, it follows that X, is a function of Y. Note, in particular, that if the original structure has two columns with heights differing by 1, then {Y, i } is mixing Markov, and hence {X,} is a function of a mixing Markov chain. Remark 1.10.16 The cutting and stacking ideas originated with von Neumann and Kakutani. For a discussion of this and other early work see [12]. They have since been used to construct
119
numerous counterexamples, some of which are mentioned [73]. The latter also includes several of the results presented here, most of which have long been part of the folklore of the subject. A sufficient condition for a cutting and stacking construction to produce a stationary coding of an i.i.d. process is given in [75].
I.10.e Exercises
1. Show that if T is the upward map defined by a column C, then a subcolumn has the form (D, T D,T 2 D, ... , re(c)-1 D), where D is a subset of the base B. 2. Prove Theorem 1.10.11 for the case when S has countably many columns. 3. Show that formula (2) holds for k > 1. (Hint: the tower with base transformation Sn defines the same process Ti. Then apply the k = 1 argument.) 4. A column C E S is (1 6)-well-distributed in S' if EDES' lx(vi C)X(7))1 < E. A sequence {S(m)} is asymptotically well-distributed if for each m, each C E S(m), and each E > 0, there is an M = M(m,C, E) such that C is (1 6)-well-distributed in S(n) for n > M. (a) Suppose S' is built by cutting and stacking from S. Show that for C E S and D E S', the conditional probability X(DI C) is the fraction of C that was cut into slices to put into D. (b) Show that the asymptotic independence property is equivalent to the asymptotic well-distribution property. (c) Show that if S and S' are 6-independent, then E CES IX(CI D) - - - (D)i except for a set of D E S I of total measure at most Nfj. (d) Show that if EcEs IX(C1 D) X(D)I < E, except for a set of D E SI of total measure at most E, then S' and S are 26-independent. 5. Let pt" be a block-structure measure on Soe, where S C A*, and let {S(m)} be the standard cutting and stacking representation of the stationary A-valued process defined by (S, p*). Show that {S(m)} is asymptotically independent if and only if it* is totally ergodic. (Hint: for S = S(m), Exercise 4c implies that A* is m-ergodic.) 6. Suppose 1(m) c S(m), for each m, and X(1 (m)) -->- 1 as m > oc. (a) Show that {S(m)} is asymptotically independent if and only if {r(m)} is asymptotically independent. (b) Suppose that for each m there exists R(m) C S(m), disjoint from 1(m), and an integer 111,n such that T(m + 1) is built by An -fold independent cutting and stacking of 1(m) U R.(m). Suppose also that {S(m} is complete. Show that if {Mm } increases fast enough then the process defined by {S(m)} is ergodic. 7. Suppose S has measure 1 and only finitely many columns. (a) Show that if the top of each column is labeled '1', with all other levels labeled '0', then the process built from S by repeated independent cutting and stacking is Markov of some order.
120
(b) Show that if two columns of S have heights differing by 1, then the process built from S by repeated independent cutting and stacking is a mixing finitestate process. (c) Verify that the process constructed in Example 1.10.13 is indeed the same as the Markov process {X}. 8. Show that process is regenerative if and only if it is the process built from a column structure S of measure 1 by repeated independent cutting and stacking.
Chapter II Entropy-related properties.

Section 11.1 Entropy and coding.
As noted in Section I.7.c, entropy provides an asymptotic lower bound on expected per-symbol code length for prefix-code sequences and faithful-code sequences, a lower bound which is "almost" tight, at least asymptotically in source word length n, see Theorem 1.7.12 and Theorem 1.7.15. Two issues left open by these results will be addressed in this section. The first issue is the universality problem. The sequence of Shannon codes compresses to entropy in the limit, but its construction depends on knowledge of the process, knowledge which may not be available in practice. It will be shown that there are universal codes, that is, code sequences which compress to the entropy in the limit, for almost every sample path from any ergodic process. The second issue is the almost-sure question. The entropy lower bound is an expected-value result, hence it does not preclude the possibility that there might be code sequences that beat entropy infinitely often on a set of positive measure. It will be shown that this cannot happen. An n-code is a mapping Cn : A n F-* 10, 1}*, where {0, 1}* is the set of finite-length binary sequences. The code length function Le Icn ) of the code is the function that assigns to each .7q , the length i(Cn (4)) of the code word Cn (4). When Cn is understood, r (4) will be used instead of r (x rii ICn ). A code sequence is a sequence {Cn : n = 1, 2, ...), where each Cn is an n-code. If each Cn is one-to-one, the code sequence is called a faithful-code sequence, while if each C, is a prefix code it is called a prefix-code sequence. The Shannon code construction provides a prefix-code sequence {C n } for which L(xrii) = rlog,u(4)1, so that, in particular, L(4)In h, almost surely, by the entropy theorem. In general, however, if {C n } is a Shannon-code sequence for A, and y is some other ergodic process, then L(4)1 n may fail to converge on a set of positive y-measure, see Exercise 1. A code sequence {C n } is said to be universally asymptotically optimal or, more simply, universal if
lim sup
L(4) < h(p), almost surely, n n>co
for every ergodic process ,u, where h(p) denotes the entropy of A.
121
122
CHAPTER II. ENTROPY-RELATED PROPERTIES.
Theorem 11.1.1 (Universal codes exist.) There is a prefix-code sequence {C} such that limsup n ,C(fii)1 n < h(a), almost surely, for any ergodic process ,u. Theorem 11.1.2 (Too-good codes do not exist.) If {C n } is a faithful-code sequence and it is an ergodic measure with entropy h, then lim inf, L(x 11 )/n > h, almost surely. In other words, it is possible to (universally) compress to entropy in the limit, almost surely for any ergodic process, but for no ergodic process is it possible to beat entropy infinitely often on a set of positive measure. A counting argument, together with some results about entropy for Markov chains, will be used to establish the universal code existence theorem, Theorem 11.1.1. The nonexistence theorem, Theorem 11.1.2, will follow from the entropy theorem, together with a surprisingly simple lower bound on prefix-code word length. A second proof of the existence theorem, based on entropy estimation ideas of Ornstein and Weiss, will be discussed in Section 11.3.d. A third proof of the existence theorem, based on the Lempel-Ziv coding algorithm, will be given in Section 11.2. Of interest, also, is the fact that both the existence and nonexistence theorems can be established using only the existence of process entropy, whose existence depends only on subadditivity of entropy as expected value, while the existence of decay-rate entropy is given by the much deeper entropy theorem. The proof of the existence theorem to be given actually shows the existence of codes that achieve at least the process entropy. A direct coding argument, which is based on a packing idea and is similar in spirit to some later proofs, will then be used to show that it is not possible to beat process entropy in the limit on a set of positive measure. These two results also provide an alternative proof of the entropy theorem. While no simpler than the earlier proof, they show clearly that the basic existence and nonexistence theorems for codes are, in essence, together equivalent to the entropy theorem, thus further sharpening the connection between entropy and coding. In addition, the direct coding construction extends to the case of semifaithful codes, for which a controlled amount of distortion is allowed, as shown in [49].
II.1.a
Universal codes exist.

lim sup ,C(4)/n < H, a.s.,
n
A prefix-code sequence {C} will be constructed such that for any ergodic it,
where H = 114 denotes the process entropy of pt. Since process entropy H is equal to the decay-rate h given by the entropy theorem, this will show that good codes exist. The code C, utilizes the empirical distribution of k-blocks, that is, the k-type, for suitable choice of k as a function of n, with code performance established by using the type counting results discussed in Section I.6.d. The code is a two-part code, that is, the codeword Cn (4) is a concatenation of two binary blocks. The first block gives the index of the k-type of the sequence 4, relative to some enumeration of possible k-types; it has fixed length which depends only on the number of type classes. The second block gives the index of the particular sequence x'iz in its k-type class, relative to some enumeration of the k-type class; its length depends on the size of the type class. Distinct words, 4 and yril, either have different k-types, in which case the first blocks of Cn (fi') and Cn (Yil)
SECTION 11.1. ENTROPY AND CODING.
123
will be different, or they have the same k-types but different indices in their common k-type class, hence the second blocks of Cn (4) and Cn (y;`) will differ. Since the first block has fixed length, the code Cn is a prefix code. If k does not grow too rapidly with n, say, k -- (1/2) log IAI n, then the log of the number of possible k-types is negligible relative to n, so asymptotic code performance depends only on the asymptotic behavior of the cardinality of the k-type class of x. An entropy argument shows that the log of this class size cannot be asymptotically larger than nH . A slight modification of the definition of k-type, called the circular k-type, will be used as it simplifies the final entropy argument, yet has negligible effect on code performance. Given a sequence 4 and an integer k < n, let ril+k-1 = xx k i -1 , the concatenation of xtiz with x k i -1 , that is, 4 is extended periodically for k 1 more terms. The circular k-type is the measure P k = ilk(.IX II ) on il k defined by the relative frequency of occurrence of each k-block in the sequence .4+k-1 , that is,
lic 14) Pk(a
1 ti E [1, rd: 5.,,:+ k -1 = ak 11 , aik E A k .

1f1
The circular k-type is just the usual k-type of the sequence .7irk-1 , so that the bounds, Theorem 1.6.14 and Theorem 1.6.15, on the number of k-types and on the size of a k-type class yield the following bounds (1) (2) (k, n) < (n olAl k , ITi(4)1 < (n 1)2 (11-1)11- "7
on the number /V (k, n) of circular k-types that can be produced by sequences of length n, and on the cardinalityj7k (x)i of the set of all n-sequences that have the same circular k-type as 4, where Hk_1, x 7 denotes the entropy of the (k 1)-order Markov chain defined by Pk (.14). The gain in using the circular k-type is compatibility in k, that is,
Pk-1(4 -1 14) =---"

which, in turn, implies the entropy inequality
EFkcal,c14),
ak
(3)
ilk-Le, < iii,f,' , i <k 1.
Now suppose k = k(n) is the greatest integer in (1/2) log lAi n and let Cn be the two-part code defined by first transmitting the index of the circular k-type of x, then transmitting the index of 4 in its circular k-type class, that is, Cn (4) = blink, ' +1 , where is a fixed length binary sequence specifying the index of the circular k-type of x, br relative to some fixed enumeration of the set of possible circular k-types, and bm t +1 is a variable-length binary sequence specifying., the index of the particular sequence 4 relative to some particular enumeration of T k (4). (Variable length is needed for the second part, because the size of the type class depends on the type.) Total code length satisfies log /S1 (k, n) + log1 7 4(4)1 +2, where the two extra bits are needed in case the logarithms are not integers.
124
The bound (1) on the number of circular k-types, together with the assumption that k < (1/2) log iAi n, implies that it takes at most
1 + N/Tz log(n + 1)
bits to specify a given circular k(n)-type, relative to some fixed enumeration of circular k(n)-types. This is, of course, negligible relative to n. The bound (2) on the size of a type class implies that at most
1 + (n
log(n 1)
4(4), relative to a given enumeration of bits are needed to specify a given member of 7 Thus, the type class. hm sup 'C(n xilz)
lim sup iik(n) - 1,x7,
so that to show the existence of good codes it is enough to show that for any ergodic measure lim sup fik(n)-1,x7 < H, almost surely, (4) where H is the process entropy of p. Fix an ergodic measure bc with process entropy H. Given E > 0, choose K such that HK_i _< H E, where, in this discussion, HK-1 = H (X IX i K -1 ) denotes the process entropy of the Markov chain of order K 1 defined by the conditional probabilities ,u(ar)/,u(ar-1 ). Since K is fixed, the ergodic theorem implies that PK (ar lx7) oo. In almost surely, as n ,u(4), almost surely, and hence HK _ Lx? integer N = N(x, E) such that almost every x there is an particular, for
es
<H
+ 2E, n > N.
For all sufficiently large n, k(n) will exceed K. Once this happens the entropy inequality (3) combines with the preceding inequality to yield
ilk(n)-1,x? <
H + 2E.
Since E is arbitrary, this shows that lim sun n ilk(n),x 1' < H, almost surely, establishing the desired bound (4). The proof that good codes exist is now finished, for as noted earlier, process entropy El H and decay-rate h are the same for ergodic processes.
II.l.b
Too-good
codes do not exist.
A faithful-code sequence can be converted to a prefix-code sequence with no change in asymptotic performance by using the Elias header technique, described in Section I.7.d of Chapter 1. Thus it is enough to prove Theorem 11.1.2 for the case when each G is a prefix code. Two quite different proofs of Theorem 11.1.2 will be given. The first proof uses a combination of the entropy theorem and a simple lower bound on the pointwise behavior of prefix codes, a bound due to Barron, [3]. The second proof, which was developed in [49], and does not make use of the entropy theorem, is based on an explicit code construction which is closely connected to the packing ideas used in the proof of the entropy theorem. The following lemma, due to Barron, is valid for any process, stationary or not.
SECTION II.1. ENTROPY AND CODING.
125
Lemma 11.1.3 (The almost-sure code-length bound.)

Let (CO be a prefix-code sequence and let bt be a Borel probability measure on A. 2-an < oo, then If {an } is a sequence of positive numbers such that
En
(5)
,C(x7) + log ,u(4) ?_ - an, eventually a. s.
Proof For each n define B, ={x: .C(4) +log ,u(fil ) - an).
Using the relation L(x7) = log 2', this can be rewritten as
Bn = {x7: iu,(4)
which yields
2-c(4 ) 2 -an I,
I,t(Bn) = E kt(x ,0 <2 -an E 2-4(4) < 2 -"n , x7En .ei2E B

since E 2c(x7) < 1, by the Kraft inequality for prefix codes. The lemma follows from the Borel-Cantelli lemma. Barron's inequality (5) with an = 2 loglAI n, and after division by n, yields
lim inf
n000
> lim inf
- log p.(4)
eventually almost surely, for any ergodic measure if/. But -(1/n) log z(x) converges El almost surely to h, the entropy of ,u. This completes the proof of Theorem 11.1.2.
II.l.c
Nonexistence: second proof.
A second proof of the nonexistence of too-good codes, Theorem 11.1.2, will now be given, a proof which uses the existence of good codes but makes no use of the entropy theorem. The basic idea is that if a faithful-code sequence indeed compresses more than entropy infinitely often on a set of positive measure, then, with high probability, long enough sample paths can be partitioned into disjoint words, such that those words that can be coded too well cover at least a fixed fraction of sample path length, while the complement can be compressed almost to entropy, thereby producing a code whose expected length per symbol is less than entropy by a fixed amount. Let i be a fixed ergodic process with process entropy H, and assume {C, } is a prefix-code sequence such that lim infi (41C,)/n < H, on a set of positive measure, with respect to A. (As noted earlier, a faithful-code sequence with this property can be replaced by a prefix-code sequence with the same property, since adding headers to specify length has no effect on the asymptotics.) Thus there are positive numbers, c and y, such that if Bi = (x: ICJ) < i(H - 6)} then i(nn Up. Bj ) > y.
126
Let 8 < 1/2 be a positive number to be specified later. The existence theorem, - , with length function E(x) = Theorem 11.1.1, provides a k > 1/3 and a prefix code ek (.x i` I e' -k), such that the set
Dk =
E(X lic ) < k(H 8)) ,
has measure at least 1 S. A version of the packing lemma will be used to show that long enough sample paths can, with high probability, be partitioned into disjoint words, such that the words that belong to B m U ..., for some large enough M, cover at least a y-fraction of total length, while the words that are neither in Dk nor in BM U ... cover at most a 33-fraction of total length. Sample paths with this property can be encoded by sequentially encoding the words, using the too-good code Ci on those words that belong to Bi for some i > M, using the good code Ck on those words that belong to Dk, and using a single-letter code on the remaining symbols, plus headers to tell which code is being used. If 8 is small enough and such sample paths are long enough, this produces a single code with expected length smaller than expected entropy, contradicting Shannon's lower bound. To make precise the desired sample path partition concept, let M > 2k IS be a positive integer. For n > M, a concatentation 4 = w(1)w(2) w(t) is said to be a (y, 0-good representation of xt; if the following three conditions hold. (a) Each w(i) belongs to A, or belongs to Dk, or belongs to B1 for some j > M. (b) The total length of the w(i) that belong to A is at most 33n. (c) The total length of the w(i) that belong to BmU Bm +1U ... is at least yn. Let G(n) be the set of all sequences 4 that have a (y, 3)-good representation. It will be shown later that (6) x E G(n), eventually almost surely. Here it will be shown how a prefix n-code can be constructed by taking advantage of the structure of sequences in G(n), a code that will eventually beat entropy for a suitable choice of 3 and M, given that (6) holds. The coding result is stated as the following lemma.
Lemma 11.1.4
For n > M, there is a prefix n-code C n * whose length function E*(4) satisfies
(7)
.C*(4) <n(H E'), xr;
G(n),
for all sufficiently large n, for a suitable choice of E. ' > 0, 8 > 0, and M > 2k/S. Furthermore, .C*(xÒln will be uniformly bounded above by a constant on the complement of G(n), independent of n.
The lemma is enough to contradict Shannon's lower bound on expected code length, for the uniform boundedness of E*(4)In on the complement of G(n) combines with the properties (6) and (7) to yield the expected value bound E(C*(X7)) < n(H 672),
127
for all sufficiently large n. But this, in turn, yields

E(.C*(X7)) < H(1,t,),
for all sufficiently large n, since (1/n)H(A n ) -+ H, thereby contradicting Shannon's D lower bound on expected code length, Theorem I.7.11(i).
Proof of Lemma 11.1.4. The code C(x) is defined for sequences x E G(n) by first determining a (y, 8)good representation 4 = w(1)w(2) w(t), then coding the w(i) in order of increasing i, using the too-good code on those words that have length more than M, using the good code Ck on those words that have length k, and using some fixed code, say, F : A 1--* Bd, where d > log IA I, on the words of length 1. Appropriate headers are also inserted so the decoder can tell which code is being used, and, for x g G(n), symbols are encoded separately using the fixed code F. A precise definition of Cn * is given as follows. For 4 g G(n), the per-symbol code
C:(4) = OF(xi)F(x2) F(xn), is used. Note that C * (4)/n is uniformly bounded on the complement of G(n). If E G(n), then a (y, 3)-good representation .)cr i' = w(l)w(2) w(t) is determined and the code is defined by C:(4) = lv(1)v(2) v(t), where
OF ((w(i))
(8)
v(i) = 10 . .; ' (w(i))
iie(j)Ci(W(i))
if w(i) has length 1 if w(i) E Dk if w(i) E Bi for some j > M.
Here e (. ) is an Elias code, that is, a prefix code on the natural numbers for which the length of e(j) is log j o(log j). The code Cn * is clearly a prefix code, since the initial header specifies whether E G(n) or not, while the other headers specify which of one of the prefix-code collection {F, Ck, C m , Cm+i , ...} is being applied to a given word. The goal is to show that on sequences in G(n) the code Cn * compresses more than entropy by a fixed amount, for suitable choice of 8 and M > 2kI8. This is so because, as shown in the following paragraphs, the principal contribution to code length comes from the encodings Cf (u, (0) (w(i)) of the w(i) that belong to Up.m B.; and the encodings k(W(i)) of the w(i) that belong to Dk. By the definition of the Bi , the code word Cf (w(i)) is no longer than j (H c), so that if L is total length of those w(i) that belong to BM U BA,1 4.1 ..., then it requires at most L(H E) bits to encode all such w(i). Likewise, by the definition of Dk, a code word Elk (w(i)) has length at most k(h + 6), so that, since the total length of such w(i) is at most N L, it takes at most
L(H E) + (N L)(H 8)
to encode all those w(i) that belong to either Dk or to Ui > m Bi . But L > yn, by property (c) in the definition of G(n), so that, ignoring the headers needed to specify which code is being used as well as the length of the words that belong to Uj >m BJ , the contribution
128
to total code length from encoding those w(i) that belong to either Dk or to Ui ,m B i is upper bounded by ye). n(H +3 y(e +8)) n(H (9) Each word w(i) of length 1 requires d bits to encode, and there are at most 38n such words, by property (b) in the definition of G(n), so that, still ignoring header lengths, the total length needed to encode all the w(i) is upper bounded by (10) n(H +8 ye +3d3).
If 8 is small enough, then 8 ye -1-3d8 will be negative, so essentially all that remains is to show that the headers require few bits, relative to n, provided 8 is small enough and M is large enough. The 1-bit headers used to indicate that w(i) has length 1 require a total of at most 33n bits, since the total number of such words is at most 38n, by property (b) of the definition of G(n). There are at most nlk words w(i) the belong to Dk or to Ui >m Bi , since k < M and hence such words must have length at least k. By assumption, k > 113, so these 2-bits headers contribute at most 28n to total code length. The words w(i) need an additional header to specify their length and hence which Ci is to be used. Since an Elias code is used to do this, there is a constant K such that at most log t(w(i)) K tv(i)Eui>m B1
bits are required to specify all these lengths. But (1/x) log x is a decreasing function for x > e, so that log log t(w(i)) < K M n, w(i)Eup. m Bi
since each such w(i) has length M > 2k1S, which exceeds e since it was assumed that < 1/2 and k > 118. Thus, if it assumed that log M < 8M, then at most K3n bits are needed to specify the lengths of the w(i) that belong to Ui >m B i . In summary, ignoring the one bit needed to tell that 4 E G(n), which is certainly negligible, the total contribution of all the headers as well as the length specifiers is upper bounded by 38n 26n + IC8n,
for any choice of 0 < 8 < 1/2, provided k > 118, M > 2k/3 and log M < 8M. Adding this bound to the bound (10) obtained by ignoring the headers gives the bound
L*(4) < n[H 8(6 3d K) yen

on total code length for members of G(n). Since d and K are constants, it follows that if 8 is chosen small enough, then
n(H E'), 4
G(n),
for suitable choice of 8 , e', and M > 2k13, which establishes the lemma. Of course, the lemma depends on the truth of the assertion that 4 E G(n), eventually almost surely, provided only that M > 2k18. To establish this, choose L > M so large that if B = UL BP then ,u(B) > y. The partial packing lemma, Lemma 1.3.8, 1=- M
129
provides, eventually almost surely, a y-packing of 4 by blocks drawn from B. On the other hand, by the ergodic theorem, eventually almost surely there are at least 8n indices i E [1, n k 1] that are starting places of blocks from Dk, that is, 4 is (1 0-strongly-covered by Dk. But the building-block lemma, Lemma 1.7.7, then implies that xri' is eventually almost surely (1 20-packed by k-blocks drawn from Dk. In particular, eventually almost surely, 4 has both a y-packing by blocks drawn from B and a (1 20-packing by blocks drawn from Dk. To say that x has both a y-packing by blocks drawn from B and a (1 20-packing by blocks drawn from Dk, is to say that there is a disjoint collection {[u yi ]: i E / } of k-length subintervals of [1, n] of total length at least (1 20n for which each 4, E Dk and a disjoint collection {[si, ti ]: j E J } of M-length or longer subintervals of [1, n] of total length at least yn, for which each xs% E B. Since each [s1 , t] has length at least M, the cardinality of J is at most n/M, so the total length of those [ui , vi that meet the boundary of at least one [s ti] is at most 2nk/M, which is upper bounded by 8n, since it was assumed that M > 2k/ 6 . Thus if I* is the set of indices i E I such that [ui , vi ] meets none of the [si , ti ], then the collection
{[sj,
ti]: j
U {[ui,
/*}
is disjoint and covers at least a (1 30-fraction of [1, n]. (This argument is essentially just the proof of the two-packings lemma, Lemma 1.3.9, suggested in Exercise 5, Section 1.3.) The words v:: j E /*I {x:j E J} U {xu together with the length 1 words {x s I defined by those
s U[s , ti]UU[ui, 1E1.

provide a representation x = w(1)w(2) w(t) for which the properties (a), (b), and (c) hold. The preceding paragraph can be summarized by saying that if x has both a y-packing by blocks drawn from B and a (1 20-packing by blocks drawn from Dk then 4 must belong to G(n). But this completes the proof that 4 E G(n), eventually almost surely, since, eventually almost surely, 4 has both a y-packing by blocks drawn from B and a (1 20-packing by blocks drawn from Dk. This completes the second proof that too-good codes do not exist.
II.l.d Another proof of the entropy theorem.

The entropy theorem can be deduced from the existence and nonexistence theorems connecting process entropy and faithful-code sequences, as follows. Fix an ergodic measure bt with process entropy H, and let {C} be a prefix-code sequence such that lim sup, ,C(x`11 )1 n < H, a.s. Fix c > 0 and define
D = 14: (4) < n(H

so that x Thus, if
E
Dn , eventually almost surely, and I Dn I < 2 n(H+E) , since Cn is invertible. B, = ,u(x'1') < 2
1'1+2}
,
130
then
/t (Bn
n D n) <
IDn 12n(H+2E) < 2n
Therefore 4 fl B n n Dn , eventually almost surely, and so xri` g Bn , eventually almost surely, establishing the upper bound
1 1 lim sup log ) _< H , a.s. n n 11,(4

For the lower bound define U = Viz: p,(4) > 2n(H- 1 , and note that I Uni < 2n(I I E) . Thus there is a one-to-one function 0 from Un into binary sequences of length no more than 2 -I- n(H c), such that the first symbol in 0(4) is always a 0. Let C,, be the code obtained by adding the prefix 1 to the code Cn . Define On (4) = 4)(4), xii E Un and en (4) = C(4), xi" g Un . The resulting code On is invertible so that Theorem 11.1.2 guarantees that Je ll g Un , eventually almost surely. This proves that
1 1 lim inf log > H, a.s., n n

and completes this proof of the entropy theorem.
C3
II. Le Exercises
1. This exercise explores what happens when the Shannon code for one process is used on the sample paths of another process.
(a) For each n let p,,, be the projection of the Kolmogorov measure of unbiased coin tossing onto A n and let Cn be a Shannon code for /in . Show that if y 0 p, is ergodic then C(4)In cannot converge in probability to the entropy of v. (b) Let D (A n il v pi ) = E7x /1,1(4) log(pn (4)/vn (4)) be the divergence of An from yn , and let (4) be the length function of a Shannon code with respect to vn . Show that the expected value of (4) with respect to p,,, is 1-1(pn ) D (An II vn). (C) Show that if p, is the all 0 process, then there is a renewal process v such that lim supn D (gn II vn )/n = oc and lim infn D(pn II vn )/n = 0.
2. Let L (xi') denote the length of the longest string that appears twice in 4. Show that if bt is i.i.d. then there is a constant C such that lim sup,, L(4)/ log n < C, almost surely. (Hint: code the second occurrence of the longest string by telling how long it is, where it starts, and where it occurred earlier. Code the remainder by using the Shannon code. Add suitable headers to each part to make the entire code into a prefix code and apply Barron's lemma, Lemma 5.) 3. Suppose Cn * is a one-to-one function defined on a subset S of An , whose range is prefix free. Show that there is prefix n-code Cn whose length function satisfies (4) < Kn on the complement of S, where K is a constant independent of n, and such that C,,(4) = 1C:(4), for X'il E S. Such a code Cn is called a bounded extension of Cn * to An.
SECTION 11.2. THE LEMPEL-ZIV ALGORITHM
131
Section 11.2 The Lempel-Ziv algorithm.

An important coding algorithm was invented by Lempel and Ziv in 1975, [91]. In finite versions it is the basis for many popular data compression packages, and it has been extensively analyzed in various finite and limiting forms. Ziv's proof, [90], that the Lempel-Ziv (LZ) algorithm compresses to entropy in the limit will be given in this section. A second proof, which is due to Ornstein and Weiss, [53], and uses several ideas of independent interest, is given in Section 11.4. The LZ algorithm is based on a parsing procedure, that is, a way to express an infinite sequence x as a concatenation x = w(1)w(2) . of variable-length blocks, called words. In its simplest form, called here (simple) LZ parsing, the word-formation rule can be summarized by saying: The next word is the shortest new word. To be precise, x is parsed inductively according to the following rules. (a) The first word w(1) consists of the single letter x1. (b) Suppose w(1) ... w(j) = (i) If xn 1 +1 SI {w(1), ... , w(j)} then w(j 1) consists of the single letter xn ro. 7 + 1 , where m is the least integer larger than n 1 (ii) Otherwise, w(j + 1) = xn such that xm n .1 +1 E {WM, ... , W(j)} and xm+ 1 g {w(1), ... , WW1. Thus, for example, 11001010001000100... parses into 1, 10, 0, 101, 00, 01, 000, 100, ... where, for ease of reading, the words are separated by commas. Note that the parsing is sequential, that is, later words have no effect on earlier words, and therefore an initial segment of length n can be expressed as
(1)
xi! = w(1)w(2)... w(C) y,
where the final block y is either empty or is equal to some w(j), for j < C. The parsing defines a prefix n-code Cn , called the (simple) Lempel-Ziv (LZ) code by noting that each new word is really only new because of its final symbol, hence it is specified by giving a pointer to where the part before its final symbol occurred earlier, together with a description of its final symbol. To describe the LZ code Cn precisely, let 1- .1 denote the least integer function and let f: {0, 1, 2, ... , n) i-- Brl0gn1 and g: A 1-- BflogiAll be fixed one-to-one functions. If fiz is parsed as in (1), the LZ code maps it into the concatenation Cn (eii) = b(1)b(2)... b(C)b(C + 1) of the binary words b(1), b(2), ... , b(C), b(C + 1), defined according to the following rules.
132
(a) If j < C and w(j) has length 1, then b(j) = 0 g(w(j)).

(b) If j < C and i < j is the least integer for which w(j) = w(i)a, a E A, then b(j) = 1 f (i)0 g(a).
1) is empty, otherwise b(C (c) If y is empty, then b(C the least integer such that y = w(i).
1) = 1 f (i), where i is
Part (a) requires at most IA l(Flog IA + 1) bits, part (b) requires at most Calog n1 Flog I A + 2) bits, and part (c) requires at most Flog ni + 1 bits, so total code length (..eiz) is upper bounded by
(C+1)logn-FaC-1- 16,
where a and /3 are constants. The dominant term is C log n, so to establish universality it is enough to prove the following theorem of Ziv, [90], in which C1(4) now denotes the number of new words in the simple parsing (1).
Theorem 11.2.1 (The LZ convergence theorem.)

If p is an ergodic process with entropy h, then (11n)C(4) log n > h, almost surely.
Of course, it is enough to prove that entropy is an upper bound, since no code can beat entropy in the limit, by Theorem 11.1.2. Ziv's proof that entropy is the correct almost-sure upper bound is based on an interesting extension of the entropy idea to individual sequences, together with a proof that this individual sequence entropy is an asymptotic upper bound on (1/n)C(4) log n, for every sequence, together with a proof that, for ergodic processes, individual sequence entropy almost surely upper bounds the entropy given by the entropy theorem.
Ziv's concept of entropy for an individual sequence begins with a simpler idea, called
topological entropy, which is the growth rate of the number of observed strings of length k, as k oc. The k-block universe of x is the set tik(x) of all a l' that appear as a block of consecutive symbols in x, that is, k = {a i x 1' 1 = all for some i _>
The topological entropy of x is defined by
1 h(x) = lim log 114 (x)I k

a limit which exists, by subadditivity, Lemma 1.6.7, since I/44k (x)i < Vim (x)I Pk (x)1. (The topological entropy of x, as defined here, is the same as the usual topological entropy of the orbit closure of x, see [84, 34] for discussions of this more general concept.) Topological entropy takes into account only the number of k-strings that occur in x, but gives no information about frequencies of occurrence. For example, if x is a typical sequence for the binary i.i.d. process with p equal to the probability of a 1 then every finite string occurs with positive probability, hence its topological entropy h(x) is log 2 = 1. The entropy given by the entropy theorem is h(p) = p log p (1 p) log(1 p), which depends strongly on the value of p, for it takes into account the frequency with which strings occur.
SECTION 11.2. THE LEMPEL-ZIV ALGORITHM.
133
The concept of Ziv-entropy is based on the observation that strings with too small frequency of occurrence can be eliminated by making a small limiting density of changes. The natural distance concept for this idea is cl(x, y), defined as before by
c1(x , y) = lim sup d,i (xii , )1),

where
d(4 , )7;1 ) =
1 ,t--ns
is per-letter Hamming distance. The Ziv-entropy of a sequence x is denoted by H(x) and is defined by
H(x) = lim inf h(y). Ziv established the LZ convergence theorem by proving the following two theorems,
which also have independent interest. Theorem 11.2.2 (The LZ upper bound.)
lim sup
C(xti') log n
H(x), x
E A c
Theorem 11.2.3 (The Ziv-entropy theorem.) If is an ergodic process with entropy h then H(x) < h, almost surely. These two theorems show that lim supn (l/n)C(x7) log n < h, almost surely, which, combined with the fact it is not possible to beat entropy in the limit, Theorem 11.1.2, leads immediately to the LZ convergence theorem, Theorem 11.2.1. The proof of the upper bound, Theorem 11.2.2, will be carried out via three lemmas. The first gives a simple bound (which is useful in its own right), the second gives topological entropy as an upper bound, and the third establishes a ii-perturbation bound. The first lemma obtains a crude upper bound by a simple worst-case analysis. Lemma 11.2.4 (The crude bound.) There exists Bn O. 0 such that
C(x) log n n <(1 + 3n)loglAl,
X E A.
Proof The most words occur in the case when the LZ parsing contains all the words of length 1, all the words of length 2,..., up to some length, say m, that is, when
n=
In this case, n (in + 1 )1Al m+1
jiAli and C =
;=.
1) and C
IA
1), and hence
C logn ' n log I AI.
134
To establish the general case in the form stated, first choose m such that n<
j=1
jlAli.
i=1
The desired upper bound then follows from the bound bounds . 1 t m it , tm 1 m(t _ i) _< E (2) m (t 1 i
Er ti
< t"/(t 1), and the
t < -
mtm,
all of which are valid for t > 1 and are simple consequences of the well-known formula 0 t)/(t 1). tj =
Er
(en-El
The second bounding result extends the crude bound by noting that short words don't matter in the limit, and hence the rate of growth of the number of words of length k, as k oc, that is, the topological entropy, must give an upper bound.
Lemma 11.2.5 (The topological entropy bound.)

For each x E A', there exists 8, > 0, such that
C(x) log n n
Proof Let hk(x) = (1/k) log lUk(x)1, for k > 1, and choose m such that
E j2
j=1
(x) < n
< E i2 p(x) .
j=1
m+1
The bounds (2), together with the fact that hi(x) Ei < h(x) h (x) where Ei > 0 as j
oc, yield the desired result. ei,
The third lemma asserts that the topological entropy of any sequence in a small dneighborhood of x is (almost) an upper bound on code performance. At first glance the result is quite unexpected, for the LZ parsing may be radically altered by even a few changes in x.
Lemma 11.2.6 (The perturbation bound.) x, y) <S, then Given x and E > 0, there is a S >0 such that if d(
li m sup
n>oo
C(.x'iz) log n
h(y) +
Proof Define the word-length function by t(w) = j, for w E Ai. Fix S > 0 and suppose d(x, y) < S. Let x = w(1)w(2) ... be the LZ parsing of x and parse y as y = v(1)v(2) ..., where t(v(i)) = t(w(i)), i = 1, 2, .... The word w(i) will be called well-matched if
dt(u, ()) (w(i), v(i)) <
SECTION 11.2. THE LEMPEL-ZIV ALGORITHM.
135
otherwise poorly-matched. Let C1(4) be the number of poorly-matched words and let C2(4) be the number of well-matched words in the LZ parsing of x. By the Markov inequality, the poorly-matched words cover less than a limiting ,r8fraction of x, and hence it can be supposed that for all sufficiently large n, the total length of the poorly-matched words in the LZ parsing of 4 is at most rt./TS. Thus, Lemma 11.2.4 gives the bound
(3)
(1
n/ +an)
log n'a
log I AI
where limn an = O. Now consider the well-matched words. For each k, let Gk(x) be the set of wellmatched w(i) of length k, and let Gk(y) be the set of all v(i) for which w(i) E Gk(X). The cardinality of Gk(Y) is at most 2khk (Y ) , since this is the total number of words of length k in y. Since dk(w(i), v(i)) < N/75, for w(i) E Gk(X), the blowup-bound lemma, Lemma 1.7.5, together with the fact that hk(Y) --+ h(y), yields
IGk(x)I
where f(S)
< 2k(h(y)+ f (6))
0 as 8
O. Lemma 11.2.5 then gives

C2(4) 5_ (1 + On)r (h(Y)
gn
f (8)),
where lim n fi, = O. This bound, combined with (3) and the fact that f(S) -- 0, implies LI the desired result, completing the proof of Lemma 11.2.6. The LZ upper bound, Theorem 11.2.2, follows immediately from Lemma 11.2.6. 0 The desired inequality, H(x) < h, almost surely, is an immediate consequence of the following lemma. Lemma 11.2.7 Let p. be an ergodic process with entropy h. For any frequency-typical x and c > O, there is a sequence y such that cl(x, y) < e and h(y) <h E. Proof Fix c > 0, and choose k so large that there is a set 'Tk c Ak of cardinality less than 2n (h+E) and measure more than 1 E. Fix a frequency-typical sequence x. The idea for the proof is that 4 +k-1 E Tk for all but a limiting (1 0-fraction of indices i, hence the same must be true for nonoverlapping k-blocks for at least one shift s E [0, k 1]. The resulting nonoverlapping blocks that are not in 'Tk can then be replaced by a single fixed block to obtain a sequence close to x with topological entropy close to h. To make the preceding argument rigorous, first note that
lim i+k Ili E [0, n 1]. x i+ 1 -T k11
> 1 E,
since x was assumed to be frequency-typical. Thus there must be an integer s E [0, k 1] such that i.n+s +k E Ed' I{i E [0, n 11: x in+s+1 lim inf > 1- E,
n>oo
136
which can be expressed by saying that x is the concatenation
x = uw(1)w(2)...,
where the initial block u has length s, each w(i) has length k, and
(4)
liminf
n,00
Ri < n: w(i) E TxII > 1 E.

n
Fix a E A and let ak denote the sequence of length k, all of whose members are a. The sequence y is defined as the concatenation
y = uv(1)v(2)...,
where
w(i) ak
if w(i) E 'Tk otherwise.
Condition (4) and the definition of y guarantee that ii(x, y) <E. If M is large relative to k, a member of the m-block universe Um (y) of y has the form qv(j)v(j + D.. v(j + in 1)r, where q and r have length at most k, and (M 2k)/ k < m < M/k. Since each v(i) E U {a l } , a set of cardinality at most 1 + 2k(h +E) , the built-up set bound, Lemma 1.7.6, implies that h m (y) < h + 6 3m,
Tic
where 3A1 ---* 0, and hence h(y) < h + E. This completes the proof of Lemma 11.2.7, 0 and, thereby, the proof of the Ziv-entropy theorem, Theorem 11.2.3. Remark 11.2.8 Variations in the parsing rules lead to different forms of the LZ algorithm. One version, which has also been extensively analyzed, defines a word to be new only if it has been seen nowhere in the past (recall that in simple parsing, "new" meant "not seen as a word in the past.") In this alternative version, new words tend to be longer, which improves code performance, but more bookkeeping is required, for now the location of both the start of the initial old word, as well as its length must be encoded. Nevertheless, the rate of growth in the number of new words still determines limiting code length and the upper bounding theorem, Theorem 11.2.2, remains true, see Exercise 1. Remark 11.2.9 Another variation, also discussed in the exercises, includes the final symbol of the new word in the next word. This modification, known as the LZW algorithm, produces slightly more rapid convergence and allows some simplification in the design of practical coding algorithms. It is used in several data compression packages. Remark 11.2.10 For computer implementation of LZ-type algorithms some bound on memory growth is needed. One method is to use some form of the LZ algorithm until a fixed number, say Co, of words have been seen, after which the next word is defined to be the longest block that appears in the list of previously seen words. Another version, called the slidingwindow version, starts in much the same way, until it reaches the no-th term, then defines
SECTION 11.3. EMPIRICAL ENTROPY.
137
the next word as the longest block that appears somewhere in the past no terms. Such algorithms cannot achieve entropy for every ergodic process, but do perform well for processes which have constraints on memory growth, and can be shown to approach optimality as window size or tree storage approach infinity, see [87, 88].
II.2.a
Exercises
1. Show that the version of LZ discussed in Remark 11.2.8 achieves entropy in the limit. (Hint: if the successive words lengths are , t2., . , ic, then E log Li < C log(n/C).) 2. Show that the LZW version discussed in Remark 11.2.9 achieves entropy in the limit. 3. Show that simple LZ parsing defines a binary tree in which the number of words corresponds to the number of nodes in the tree.
Section 11.3 Empirical entropy.

In many coding procedures a sample path 4 is partitioned into blocks, then each block is coded separately. The blocks may have fixed length, which can depend on n, or variable length. Furthermore, the coding of a block may depend only on the block, or it may depend on the past or even the entire sample path, e. g., through the empirical distribution of blocks in the sample path. In this section, a beautiful and deep result due to Ornstein and Weiss, [52], about the covering exponent of the empirical distribution of fixed-length blocks will be established. The theorem, here called the empirical-entropy theorem, suggests an entropy-estimation algorithm and a specific coding technique, both of which will also be discussed. Furthermore, the method of proof leads to useful techniques for analyzing variable-length codes, such as the Lempel-Ziv code, as well as other entropy-related ideas. These will be discussed in later sections. The empirical-entropy theorem has a nonoverlapping-block form which will be discussed here, and an overlapping-block form, see Exercise 1. The nonoverlapping k-block distribution, is defined by
qic(cli
Ri
[0,
rit
1]:
in
X: ;cc+ -E
=a
where n = km + r, 0 < r < k. The theorem is concerned with the number of k-blocks it takes to cover a large fraction of a sample path, asserting that eventually almost surely only a small exponential factor more than 2kh such k-blocks is enough, and that a small > 21th exponential factor less than 2kh is not enough, provided only that n _
.
138
Theorem 11.3.1 (The empirical-entropy theorem.) Let ii be an ergodic measure with entropy h > O. For each c > 0 and each k there is a set 'Tk (e) c A" for which l'Tk (e)I < 2k(h +E) , such that for almost every x there is a K = K(e, x) such that if k > K and n > 2kh , then (a) qk(Z(E)14)) > 1 E.
(b) qk(BI4)) <E, for any B c A" for which IBI < 2k( h e) .
The theorem is, in essence, an empirical distribution form of entropy as covering exponent, Theorem 1.7.4, for the latter says that for given c and all k large enough, there < 2k(h+f) and pl (Tk(E)) > 1 e, and there is no is a set T(E) c A k such that l'Tk(E)I _ set B C A k such that IB I _< 2k(h--E) and ,a(B) > E. The ergodic theorem implies that for fixed k and mixing kt, the empirical distribution qk(.14) is eventually almost surely close to the true distribution, ,uk, hence, in particular, for fixed k, the empirical and true distributions will eventually almost surely have approximately the same covering properties. The surprising aspect of the empirical-entropy theorem is its assertion that even if k is allowed to grow with n, subject only to the condition that k < (11 h) log n, the empirical measure qk(.I4) eventually almost surely has the same covering properties as ktk, in spite of the fact that it may not otherwise be close to iik in variational or even cik -distance. For example, for unbiased coin-tossing the measure qk (.14) is not close to kt k in variational distance for the case n = k2", see Exercise 3.
II.3.a
Strong-packing.
Part (a) of the empirical-entropy theorem is a consequence of a very general result about parsings of "typical" sample paths. Eventually almost surely, if the short words cover a small enough fraction, then most of the path is covered by words that come from fixed collections of exponential size almost determined by entropy. A fixed-length version of the result is all that is needed for part (a) of the entropy theorem, but the more general result about parsings into variable-length blocks will be developed here, as it will be useful in later sections. As before, a finite sequence w is called a word, word length is denoted by (w) and a parsing of 4 is an ordered collection P = {w(1), w(2), ... , w(t)} of words for which 4 = w(1)w(2) w(t). Fix a sequence of sets {E}, such that Tk c A k , for k > 1. A parsing X n i = w(l)w(2) w(t) is (1 e)-built-up from {'Tk}, or (1 e)-packed by {TO, if the words that belong to U'rk cover at least a (1 E)-fraction of n, that is,
E
w(i)Eurk
f(w(i)) a. (1 O n.
A sequence 4 is (K, c)-strongly-packed by MI if any parsing .elz = w(1)w(2) w(t) for which awci < --En ,
t(w(I))<K
is (1 E)-built-up from MI. Strong-packing captures the essence of the idea that if the long words of a parsing cover most of the sample path, those long words that come from the fixed collections
SECTION 11.3. EMPIRICAL ENTROPY
139
{'Tk } must cover most of the sample path. The key result is the existence of collections for which the 'Tk are of exponential size determined by entropy and such that eventually almost surely x is almost strongly-packed.
Lemma 11.3.2 (The strong-packing lemma.) Let p. be an ergodic process of entropy h and let E be a positive number There is an integer K = K(e) and, for each k > K, a set 'Tk = 'Tk(E) C A k , such that both of the following hold.
< 2k(h+E) for k > K. (a) ITkI_

(b) x is eventually almost surely (K, 0-strongly-packed by {7 } .
Proof The idea of the proof is quite simple. The entropy theorem provides an integer m
and a set Cm c Am of measure close to 1 and cardinality at most 2m (h +6) By the ergodic theorem, eventually almost surely most indices in x7 are starting places of m-blocks from Cm . But if such an x is partitioned into words then, by the Markov inequality, most of 4 must be covered by those words which themselves have the property that most of their indices are starting places of members of Cm . If a word is long enough, however, and most of its indices are starting places of members of Cm , then the word is mostly built-up from C m , by the packing lemma. The collection 'Tk of words of length k that are mostly built-up from Cm has cardinality only a small exponential factor more than 2"") , by the built-up set lemma. To make the outline into a precise proof, fix c > 0, and let 8 be a positive number to be specified later. The entropy theorem provides an m for which the set Cm = {c4n : p,(ar) > 2 -m(h+ 5)} has measure at least 1 3 2 /4. For k > m, let 'Tk be the set of sequences of length k that are (1 3)-built-up from Cm . By the built-up set lemma, Lemma 1.7.6, it can be supposed that 6 is small enough to guarantee that iTk i < 2k(n+E) , k > in. By making 8 smaller, if necessary, it can be supposed that 3 < 12. It remains to show that eventually almost surely x'IL is (K, 0-strongly-packed by {Tk for a suitable K. This is a consequence of the following three observations. (i) The ergodic theorem implies that for almost every x there is an N = N(x) such that for n > N, the sequence 4 has the property that X: +171-1 E Cm for at least (1 3 2 /2)n indices i E [1, n m + 1], that is, 4 is (1 3 2 /2)-strongly-covered by Cm . (ii) If x is (1 3 2 /2)-strongly-covered by C m and parsed as 4 = w(1)w(2) w(t), then, by the Markov inequality, the words w(i) that are not (1 3/2)-stronglycovered by Cm cannot have total length more than Sn < cn12. (iii) If w(i) is (1 3/2)-strongly-covered by C m , and if f(w(i)) ?_ 2/3, then, by the packing lemma, Lemma 1.3.3, w(i) is (1-0-packed by C m , that is, w(i) E Te(w(j)).
140
From the preceding it is enough to take K > 2/8, for if this is so, if n > N(x), and if the parsing x 11 = w(1)w(2) w(t) satisfies 2
t(w(i))<K
then the set of w(i) for which f(w(i)) > K and w(i) E U Tk must have total length at El least (1 On. This completes the proof of Lemma 11.3.2.
II.3.b
Proof of the empirical-entropy theorem.
Proof of part (a). Fix c > 0 and let S < 1 be a positive number to be specified later. The strongpacking lemma provides K and 'Tk c A " , such that Irk I .- 2k(h+') for k a K, and such that x;' is (K, 8)-strongly-packed by {Tk}, eventually almost surely. Fix x such that 4 is (K, 8)-strongly-packed by {T} for n a N(x), and choose K (x) > K such that if k> K(x) and n> 2kh , then n > N(x) and k < Sn. Fix k > K(x) and n a 2kh , and suppose xril = w(1)w(2) w(t)w(t + 1), where t(w(i)) = k, i < t, and t(w(t + 1)) < k. All the blocks, except possibly the last one, are longer than K, while the last one has length less than Sn, since k < Sn.. The definition of strong-packing implies that
(n k + 1)qk('Tk I4) a- (1 2 8 )n,

since the left side is just the fraction of 4 that is covered by members of Tk . Thus, dividing by n k + 1 produces
qk(Tkixriz ) a (1 28)(1 8) > (1 c),

for suitable choice of S. This completes the proof of part (a) of the empirical-entropy 0 theorem. Proof of part (b). An informal description of the proof will be given first. Suppose xtiz is too-well covered by a too-small collection B of k-blocks, for some k. A simple two-part code can be constructed that takes advantage of the existence of such sets B. The first part is an encoding of some listing of B; this contributes asymptotically negligible length if I BI is exponentially smaller than 2kh and n > 2kh . The second part encodes successive kblocks by giving their index in the listing of B, if the block belongs to B, or by applying a fixed good k-block code if the block does not belong to B. If B is exponentially smaller than 2" then fewer than kh bits are needed to code each block in B, so that if B covers too much the code will beat entropy. The good k-code used on the blocks that are not in B comes from part (a) of the empirical-entropy theorem, which supplies for each k a set 'Tk c A k of cardinality roughly 2, such that eventually almost surely most of the k-blocks in an n-length sample path belong to Tk , provided only that n > 2kh . If a block belongs to Tk B then its index in some ordering of Tk is transmitted; this requires roughly hk-bits. If the block does
141
not belong to Tk U B then it is transmitted term by term using some fixed 1-block code; since such blocks cover at most a small fraction of the sample path this contributes little to overall code length. To proceed with the rigorous proof, fix c > 0 and let 8 be a positive number to be specified later. Part (a) of the theorem provides a set Tk c ilk of cardinality at most 2k(h+3) , for each k, and, for almost every x, an integer K(x) such that if k > K(x) and n > 2" then qk(TkIx) > 1 - 8. Let K be a positive integer to be specified later and for n > 2/(4 let B(n) be the set of 4 for which there is some k in the interval [K, (log n)/h] for which there is a set B c A k such that the following two properties hold.
(i) qic(Tk14) ?_ 1 - 8. (ii) I BI <2k(h6) and qk (BI4) > E.
By using the suggested coding argument it will be shown that if 8 is small enough and K large enough then
xn i ,_, B(n), eventually almost surely.
This fact implies part (b) of the empirical-entropy theorem. Indeed, the definition of {rk } implies that for almost every x there is an integer K (x) > K such that
qk(Tk14) > 1 - 3, for K(x) < k <
loh gn,
so that if K(x) < k < (log n)/ h then either 4 E B(n), or the following holds.
(1)
If B C A k and I B I < 21* -0 , then qk(BI4) < c.
If 4 g B(n), eventually almost surely, then for almost every x there is an N(x) such that x'ii g B(n), for n > N(x), so that if k is enough larger than K(x) to guarantee that nkh Z > N(x), then property (1) must hold for all n > 2, which implies part (b) of the empirical-entropy theorem. D It will be shown that the suggested code Cn beats entropy on B(n) for all sufficiently large n, if 8 is small enough and K is large enough, which establishes that 4 g B(n), eventually almost surely, since no prefix-code sequence can beat entropy infinitely often on a set of positive measure. Several auxiliary codes are used in the formal definition of C. Let 1 1 denote the upper integer function. A faithful single letter code, say, F: A i-- {0, 1}1 1 05 All is needed along with its extension to length in > 1, defined by
Fm (X) = F(xi)F(x2)... F(x, n ).
Also needed for each k is a fixed-length faithful coding of Tk, say
G: Tk i- {0, 1} rk(4 3)1 ,

and a fixed-length faithful code for each B c A k of cardinality at most 2k(h- E ) , say,
Mk,B: B i - {0, 1}rk(h-01.
142
Finally, for each k and B C A k , let Ok(B) be the concatenation in some order of {Fk (4): 4 E B } , and let E be an Elias prefix code on the integers, that is, a prefix code such that the length of E(n) is log n + o(log n). The rigorous definition of Cn is as follows. If 4 V B(n) then Cn (4) = OFn (4). If 4 E B(n), then an integer k E [K, (log n)/11] and a set B C A k of cardinality at most 2k(h-E ) are determined such that qk(B14') > c, and 4 is parsed as
4 = w(l)w(2) w(t)w(t + 1),
where w(i) has length k for i < t, and w(t + 1) has length r = [0, k). The code is defined as the concatenation
C(4) = 10k 1E(1B1)0k (B)v( 1)v(2) v(t)v(t + 1),

where
(2)
v(i) =
The code is a prefix code, since reading from the left, the first bit tells whether xì' E B(n), and, given that x til E B(n), the block 0k1 determines k. Once k is known, ( (I BI) determines the size of B, and k(B) determines B. The two-bit header on each v(i) specifies whether w(i) belongs to B, to Tic B, or to neither of these two sets, from which w(i) can then be determined by using the appropriate inverse, since each of the codes Mk,B, Gk, Fk, and Fr are now known. For 4 E B(n), the principal contribution to total code length (4) = (41C) comes from the encoding of the w(i) that belong to rk U B, a fact stated as the following lemma. Lemma 11.3.3
Mk.B(w(i)) OlG k (w(i)) 11 Fk(w(i)) 11Fr (w(t +1)
w(i) E B
W(i) E Tk - B
w(i) g Tk UB,i<t i = t +1.
(3)
.C(x) 15_ tkqk(h 6) + tk(1 qk)(h + 6) + (3 + K)0(n), xri' E B(n),
where aK > 0 as K > oc, and qk = clk(Blx'1 2 ), for the block length k and set B c itic used to define C(.4). The lemma is sufficient to complete the proof of part (b) of the empirical-entropy theorem. This is because
tkq k (h c) + tk(1 q k )(h + 3) < n(h + 3 6(6 + S)),
since qk >
E,
for 4 E B(n), so that if K is large enough and 8 small enough then

n(h c 2 12), 4 E B(n),
for all sufficiently large n. But this implies that xriz V B(n), eventually almost surely, since no sequence of codes can beat entropy infinitely often on a set of positive measure, and as noted earlier, this, in turn, implies part (b) of the empirical-entropy theorem. 1=1
143
Proof of Lemma 11.3.3.

Ignoring for the moment the contribution of the headers used to describe k and B and to tell which code is being applied to each k-block, as well as the extra bits that might be needed to round k(h E) and k(h +3) up to integers, the number of bits used to encode the blocks in Tk U B is given by tqk (Bix)k(h E) + (t tq k (BI4)) k(h +3), since there are tqk(Bl.q) blocks that belong to B, each of which requires k(h E) bits, and (t tqk(B 14)) blocks in Tk B, each of which requires k(h +3) bits. This gives the dominant terms in (3). The header 10k le(I BD that describes k and the size of B has length
(4) 2 +k+logIBI o(logIBI)
which is certainly o(n), since log I BI < k(h E) and k < (log n)I n. It takes at most k(1 + log I Ai) bits to encode each w(i) Tk U B as well as to encode the final r-block if r > 0, and there are at most 1 +3t such blocks, hence at most 80(n) bits are needed to encode them. Adding this to the o(n) bound of (4) then yields the 30(n) term of the lemma. The encoding k(B) of B takes k Flog IAMBI < k(1+ log I A1)2 k(h- E ) bits, which is, in turn, upper bounded by K (1+ loglA1)2. -KEn, (5) provided K > (E ln 2) - ', since k > K and n > 2". There are at most t + 1 blocks, each requiring a two-bit header to tell which code is being used, as well as a possible extra bit to round up k(h E) and k(h +8) to integers; together these contribute at most 3(t + 1) bits to total length, a quantity which is at most 6n/K, since t < nlk, and n> k > K. Thus with 6 ax = K(1 + log IA1)2 -KE = o(K), K the encoding of 4k(B) and the headers contributes at most naK to total length. This completes completes the proof of Lemma 11.3.3, thereby establishing part (b) of the empirical-entropy theorem. El
Remark 11.3.4
The original proof, [ 52], of part (b) of the empirical-entropy theorem used a counting argument to show that the cardinality of the set B(n) must eventually be smaller than 2nh by a fixed exponential factor. The argument parallels the one used to prove the entropy theorem and depends on a count of the number of ways to select the bad set B. The coding proof given here is simpler and fits in nicely with the general idea that coding cannot beat entropy in the limit.
II.3.c Entropy estimation.

A problem of interest is the entropy-estimation problem. Given a sample path Xl, X2, , xn from an unknown ergodic process it, the goal is to estimate the entropy h of II. A simple procedure is to determine the empirical distribution of nonoverlapping k-blocks, and then take H (qk )/ k as an estimate of h. If k is fixed and n oc, then H(qk)1 k will converge almost surely to H(k)/k, which, in turn, converges to h as
144
k > oc, at least if it is assumed that it is totally ergodic. Thus, at least for a totally ergodic ,u, there is some choice of k = k(n) as a function of n for which the estimate will converge almost surely to h. At first glance, the choice of k(n) would appear to be very dependent on the measure bt, because, for example, there is no universal choice of k = k(n) for which k(n) > oc for which the empirical distribution qi,(.14) is close to the true distribution bak for every mixing process A. The empirical entropy theorem does imply a universal choice for the entropy-estimation problem, for example, the choice k(n) log n works for any binary
ergodic process. The general result may be stated as follows.

Theorem 11.3.5 (The entropy-estimation theorem.) If it is an ergodic measure of entropy h > 0, if k(n) > oc, as n -- oc, and if k(n) < (1/h)log n, then
(6)
n->C0 k(n)
1 lim H(q00(.14)) = h, a.s.
In particular, if k(n) -- log iAl n then (6) holds for any ergodic measure p, with alphabet A, while if k(n) log log n, then it holds for any finite-alphabet ergodic process. Proof Fix c > 0 and let {Tk(c) c A k : n > 1) satisfy part (a) of the empirical-entropy theorem. Let Uk be the set of all cif E Tk(E) for which qk (ali` 'xi' ) < 2 -k(h+2E) , so that for all large enough k and all n > 2kh ,
(7) qk(Uklxi) _ < Irk(E)12ch+2E) < 2k(h+E)2 -k(h+2E) _ "-Ek z.
Next let I Vkl <
Vk be the set of all alic for which qk (cif 14) > 2 -k(h-26) , and note that 2/(/1-26) . Part (b) of the empirical-entropy theorem implies that for almost every
x there is a K (x) such that

qk(VkIX) < E,
> for k > K (x), n _
This bound, combined with the bound (7) and part (a) of the empirical-entropy theorem, implies that for almost every x, there is a Ki (x) such that
qk(GkIx) > 1 36, for k > Ki(x), n > 2kh ,

where G k = Tk(E) Uk Vk. In summary, the set of al' for which
,(k1n iikkai 'x i
) < 2-0h-2E)
has qk (. lx) measure at least 1 3e. The same argument that was used to show that entropy-rate is the same as decay-rate entropy, Theorem 1.6.9, can now be applied to complete the proof of Theorem 11.3.5. CI Remark 11.3.6 The entropy-estimation theorem is also true with the overlapping block distribution in place of the nonoverlapping block distribution, see Exercise 2.
SECTION 11.3. EMPIRICAL ENTROPY
145
II.3.d
Universal coding.
A universal code for the class of ergodic processes is a faithful code sequence {C n } such that for any ergodic process p.
lirn
.42 )
71
= h, almost surely,
where h is the entropy of A. Such a code sequence was constructed in Section II. La, see the proof of Theorem 11.1.1, and it was also shown that the Lempel-Ziv algorithm gives a universal code. The empirical-entropy theorem provides another way to construct a universal code. The steps of the code are as follows.
Step 1. Partition xi: into blocks of length k = k(n) (1/2) log AI n. Step 2. Transmit a list rence in 4.
book E. of these k-blocks in order of decreasing frequency of occur-
Step 3. Encode successive k-blocks in 4 by giving the index of the block in the code Step 4. Encode the final block, if k does not divide n, with a per-symbol code.
The number of bits needed to transmit the list ,C is short, relative to n, since k -(1/2) log ik n. High frequency blocks appear near the front of the code book and therefore have small indices, so that the empirical-entropy theorem guarantees good performance, since the number of bits needed to transmit an index is of the order of magnitude of the logarithm of the index. To make this sketch into a rigorous construction, fix {k(n)} for which k(n) (1/2) log n. (For simplicity it is assumed that the alphabet is binary.) The code word Cn (4) is a concatenation of two binary sequences, a header of fixed length m = k2k , where k = k(n), followed by a sequence whose length depends on x. The header br is a concatenation of all the members of {0, 1} k , subject only to the rule that af precedes 5k whenever
k n -k n) qk(ailX1) > qk( all X 1
The sequence {v(j) = j = 0, 1, , 2k 1) is called the code book. Suppose n = tk r, r E [0, k) and 4 = w(1)w(2) w(t)v, where each w(i) has length k. For 1 < i < t and 0 < j < 2k, define the address function by the rule
A(w(i)) = j, if w(i) =
that is, w(i) is the j-th word in the code book. Let E be a prefix code on the natural numbers such that the length of the word E(j) is log j o(log j), that is, an Elias code. Define b2+ +L i to be the concatenation of the E(A(w(i))) in order of increasing i, where L is the sum of the lengths of the e(A(w(i))). Finally, define b:+ 1+ -1 to be x4+1 The code Cn (4) is defined as the concatenation
Cn (x7) = br b:V' b7,JF- iiri I .

The code length (4) = ni+ L +r, depends on 4, since L depends on the distribution qk (.lxiii), but the use of the Elias code insures that C, is a prefix code.
146
The empirical-entropy theorem will be used to show that for any ergodic tt,
(8)
n-*
lim
f(x n ) 1 = h, a. s., n
where h is the entropy of it. To establish this, let {7-k(E) c Ac} be the sequence given by the empirical-entropy theorem. For each n, let Gk denote the first 2k(h +E) members of the code book, where k = k(n). Note that IGk I > l'Tk(E)I and, furthermore, qk(GkIx t 0 qk(Tk(E)Ixi),
since Gk is a set of k-sequences of largest qke probability whose cardinality is at most 2k(h +E ) . The empirical-entropy theorem provides, for almost every x, an integer K(x) such that qk (GkI4) > qk (7k (c)14) > 1 E for k > K(x). Thus, for k > K(x), at least a (1 E)-fraction of the binary addresses e(A(w(i))) will refer to members of Gk and thus have lengths bounded above by k(h + c)+ o(log n). For those w(i) Gk, the crude bound k + o(log n) will do, and hence
< (1 On(h + c) + En + o(log n),

for k > K (x). This proves that
,C(4) lim sup < h, a.
S.
The reverse inequality follows immediately from the fact that too-good codes do not exist, Theorem 11.1.2, but it is instructive to note that it follows from the second part of the empirical-entropy theorem. In fact, let Bk be the first 2k(h-E) sequences in the code book for x. The empirical-entropy theorem guarantees that eventually almost surely, there are at most Et indices i for which w(i) E Bk. These are the only k-blocks that have addresses shorter than k(h 6) + o(log n), so that
lirninf' > (h c)(1 c), a. s.
Remark 11.3.7 The universal code construction is drawn from [49], which also includes universal coding results for coding in which some distortion is allowed. A second, and in some ways, even simpler algorithm which does not require that the code book be listed in any specific order was obtained in [43], and is described in Exercise 4. Another application of the empirical-entropy theorem will be given in the next chapter, in the context of the problem of estimating the measure II from observation of a finite sample path, see Section 111.3.
II.3.e Exercises
1. Show that the empirical-entropy theorem is true with the overlapping block distribution, nk+ 1 in place of the nonoverlapping block distribution qk(.lfiz). (Hint: reduce to the nonoverlapping case for some small shift of the sequence.) Pk(a in4)
E [1, n k + 1]: 444-1 = aDi
SECTION 11.4. PARTITIONS OF SAMPLE PATHS.
147
2. Show that the entropy-estimation theorem is true with the overlapping block distribution in place of the nonoverlapping block distribution. 3. Show that the variational distance between qk (.14) and ,u k is asymptotically almost surely lower bounded by (1 e -i ) for the case when n = k2k and ,u is unbiased coin-tossing. (Hint: if M balls are thrown at random into M boxes, then the expected fraction of empty boxes is asymptotic to (1 4. Another simple universal coding procedure is suggested by the empirical-entropy theorem. For each k < n construct a code Ck m as follows. Express x'12 as the concatenation w(l)w(2) w(q)v, of k-blocks {w(i)}, plus a possible final block y of length less than k. Make a list in some order of the k-blocks that occur. Transmit k and the list, then code successive w(i) by using a fixed-length code; such a code requires at most k(1 + log IA I) bits per word. Append some coding of the final block y. Call this C k m (Xil ) and let k(X) denote the length of Ck ,,,(x;'). The final code C(xil) transmits the shortest of the codes {Ck,,,(4)}, along with a header to specify the value of k. In other words, let km,,, be the first value of k E [ 1, n] at which rk(x'11 ) achieves its minimum. The code C(4) is the concatenation of e(kiii,n ) and Cknun ,n (x). Show that {C,,} is a universal prefix-code sequence.
Section 11.4 Partitions of sample paths.

An interesting connection between entropy and partitions of sample paths into variable-length blocks was established by Ornstein and Weiss, [53]. They show that eventually almost surely, for any partition into distinct words, most of the sample path is covered by words that are not much shorter than (log n)I h, and for any partition into words that have been seen in the past, most of the sample path is covered by words that are not much longer than (log n)I h. Their results were motivated by an attempt to better understand the Lempel-Ziv algorithm, which partitions into distinct words, except possibly for the final word, such that all but the last symbol of each word has been seen before. As in earlier discussions, a word w is a finite sequence of symbols drawn from the alphabet A and the length of w is denoted by i(w). A sequence Jell is said to be parsed (or partitioned) into the (ordered) set of words {w(1), w(2), , w(t)} if it is the concatenation (1) 4 = w(1)w(2)... w(t). If w(i) w(j), for i words. For example,
j, then (1) is called a parsing (or partition) into distinct
000110110100 = [000] [110] [1101] [00] is a partition into distinct words, while 000110110100 = [00] [0110] [1101] [00] is not. Partitions into distinct words have the asymptotic property that most of the sample path must be contained in those words that are not too short relative to entropy.
148
Theorem 11.4.1 (The distinct-words theorem.) Let be ergodic with entropy h > 0, and let E > 0 be given. For almost every w(t) is a x E A" there is an N = N(E, x) such that if n > N and xri' = w(1)w(2) partition into distinct words, then t ( o )) < En.
ow(0)<0+0-11ogn
The (simple) Lempel-Ziv algorithm discussed in Section 11.2 parses a finite sequence into distinct words, except, possibly for the final word, w(C + 1). If this final word is not empty, it is the same as a prior word, and hence t(w(C + 1)) < n/2. Thus the distinct-words theorem implies that, eventually almost surely, most of the sample path is covered by blocks that are at least as long as (h + E)' log(n/2). But the part covered by shorter words contains only a few words, by the crude bound, Lemma 11.2.4, so that
(2)
lim sup
n-
C(4) log n
fl
< h, a. s.
To discuss partitions into words that have been seen in the past, some way to get started is needed. For this purpose, let .7- be a fixed finite collection of words, called the start set. A partition x = w(1)w(2) w(k) is called a partition (parsing) into repeated words, if each w(i) is either in .T or has been seen in the past, meaning that there is an index j < t(w(1)) + + f(w(i 1)) such that w(i) = Partitions into repeated words have the asymptotic property that most of the sample path must be contained in words that are not too long relative to entropy. Theorem 11.4.2 (The repeated-words theorem.) Let p, be ergodic with entropy h > 0, and let 0 < E < h and a start set T be given. For almost every x E A there is an N = N(E, x) such that if n > N, and xi; = w(1)w(2)... w(t) is a partition into repeated words, then
t ( to)) <En.
twi>>>0-0-Ilogn
It is not necessary to require exact repetition in the repeated-words theorem. For example, it is sufficient if all but the last symbol appears earlier, see Exercise 1. With this modification, the repeated-words theorem applies to the versions of the Lempel-Ziv algorithm discussed in Section 11.2, so that eventually almost surely most of the sample path must be covered by words that are no longer than (h E) -1 log n, which, in turn, implies that C(4) log n lim inf > h, a. s. (3) Thus the distinct-words theorem and the modified repeated-words theorem together provide an alternate proof of the LZ convergence theorem, Theorem 11.2.1. Of course, as noted earlier, the lower bound (3) is a consequence of fact that entropy is an upper bound, (2), and the fact that too-good codes do not exist, Theorem 11.1.2.
The proofs of both theorems make use of the strong-packing lemma, Lemma 11.3.2, this time for parsings into words of variable length.
149
II.4.a Proof of the distinct-words theorem.

The distinct-words theorem is a consequence of the strong-packing lemma, together with some simple facts about partitions into distinct words. The first observation is that the words must grow in length, because there are at most IA i k words of length k. The proof of this simple fact is left to the reader.
Lemma 11.4.3 Given K and 8 > 0, there is an N such that if n > N and xrii = w(1)w(2)... w(t) is a partition into distinct words, then
E aw (i) ) Bn.
POD (j)) <K
The second observation is that if the distinct words mostly come from sets that grow in size at a fixed exponential rate a, then words that are too short relative to (log n)Ice cannot cover too much.
Lemma 11.4.4 For each k, suppose Gk C A k satisfies IGkI _< 2ak , where a is a fixed positive number, and let G = UkGk. Given E > 0 there is an N such that if n > N and x n = w(1)w(2)... w(t) is a partition into distinct words such that
t(w(i)) 5_ En, w(i)gc
then
t(w(i)) < 2En.
Proof First consider the case when all the words are required to come from G. The fact that the words are distinct then implies that
,(.(i)),(,_Exio g n)la
top(o)
E
Ic<(1-6)(log n)la
Since IGk I <2"k the sum on the right is upper bounded by

( 2' ((1 E)logn)
1)
exp2 ((1 E)logn)) =
logn) = o(n),
by an application of the standard bound,
t ) men, t > 1. Et _
The general case follows since it is enough to estimate the sum over the too-short words that belong to G. This proves the lemma.
To continue with the proof of the distinct-words theorem, let 8 be a positive number to be specified later. The strong-packing lemma, Lemma 11.3.2, yields an integer K and { rk C A": k> 1} such that I7j < 21c(h+S) , k > K,
150
and such that eventually almost surely xrii is (K, 3)-strongly-packed by MI, i. e.,
E
w(i)Euciz k rk
t (o ))
a ( l-3)n,
for any parsing P of xriz for which
(4)
Sn atv(i)) . 2
Suppose xr,' is (K, 3)-strongly-packed by {'Tk }. Suppose also that n is so large that, it into distinct words. Thus, by Lemma 11.4.3, property (4) holds for any parsing of .x. r w(t) into distinct words, strong-packing implies that given a parsing x = w(l)w(2)
E
w(i)EU
Klj
t (o)) a _3)n,
and therefore Lemma 11.4.4 can be applied with a = h + 3 to yield

aw(i)) < 25n,
f(w(i))(1-15)(logn)/(h+S)
provided only that n is large enough. The distinct-words theorem follows, since 8 could have been chosen in advance to be so small that log n (1 2S) log n h + c 5h+3
1:3
I1.4.b
Proof of the repeated-words theorem.
It will be shown that if the repeated-words theorem is false, then a prefix-code sequence can be constructed which beats entropy by a fixed amount infinitely often on a set of positive measure. The idea is to code each too-long word of a repeated-word parsing by telling where it occurs in the past and how long it is (this is, of course, the basic idea of the version of the Lempel-Ziv code suggested in Remark 11.2.8.) If a good code is used on the complement of the too-long words and if the too-long words cover too much, then overall code length will be shorter than entropy allows. As in the proof of the empirical-entropy theorem, the existence of good codes is guaranteed by the strong-packing lemma, this time applied to variable-length parsing. Throughout this discussion p, will be a fixed ergodic process with positive entropy h and E < h will be a given positive number. A block of consecutive symbols in xiit of length more (h E) -1 log n will be said to be too long. The first idea is to merge the words that are between the too-long words. Suppose = w(l)w(2) w(t) is some given parsing, for which s of the w(i) are too long. Label these too-long words in increasing order of appearance as
V(1), V(2),
V(s).
Let u1 be the concatenation of all the words that precede V(1), for 1 < i < s + 1, let ui be the concatenation of all the words that come between V(i 1) and V(i), and let
151
u s+ 1 be the concatenation of all the words that follow V(s). In this way, x'iz is expressed as the concatenation x z = u i V(1)u2V(2) us V(s)u si .
Such a representation is called a too-long representation of 4, with the too-long words {V(1), ..., V(s)} and fillers lui, Let B(n) be the set of all sequences x for which there is an s and a too-long representation x = uiV (1)u2 V(2) ... us V(s)u s+i with the following two properties.
(i) Ei f ( V (D)
>En.
(ii) Each V(j) has been seen in the past.

ni + imi has been seen in the past is to say that there is an index i E [0, ni ) To say V(j) = xn
Since the start set .T is a fixed finite set, it can be supposed such that V(j) that when n is large enough no too-long word belongs to F, and therefore to prove the repeated-words theorem, it is enough to prove that x g B(n), eventually almost surely.
The idea is to code sequences in B(n) by telling where the too-long words occurred earlier, but to make such a code compress too much a good way to compress the fillers is needed. The strong-packing lemma provides the good codes to be used on the fillers. Let 3 be a positive number to be specified later. An application of the strong-packing lemma, Lemma 11.3.2, provides an integer K and for a each k > K, a set 'Tk c A k of cardinality at most 2k(h+8) , such that eventually almost surely xril is (K, 3)-stronglypacked by {'Tk }. Let G(n) be the set of all xri' that are (K, 8)-strongly-packed by {Tk } . Since .q E G(n) eventually almost surely, to complete the proof of the repeated-words theorem it is enough to prove that if K is large enough, then B(n) n G(n), eventually almost surely. A code C, is constructed as follows. Sequences not in B(n) n G(n) are coded using some fixed single-letter code on each letter separately. If x E B(n) n G(n), a too-long representation = u1V(1)u2V(2)...u s V(s)u s+i, is determined for which each too-long V(j) is seen somewhere in its past and the total length of the {V(j)} is at least En. The words in the too-long representation are coded sequentially using the following rules. (a) Each filler u 1 E U K Tk is coded by specifying its length and giving its index in the set Tk to which it belongs. (b) Each filler u1 g U K r k is coded by specifying its length and applying a fixed single-letter code to each letter separately. (c) Each too-long V(j) is coded by specifying its length and the start position of its earlier occurrence.
152
An Elias code is used to encode the block lengths, that is, a prefix code S on the natural numbers such that the length of the code word assign to j is log j + o(log j). Two bit headers are appended to each block to specify which of the three types of code is being used. With a one bit header to tell whether or not xri` belongs to B(n) n G(n), the code becomes a prefix code. For 4 E B(n) n G(n), the principal contribution to total code length E(xriz) .C(fi` IC n ) comes from telling where each V(j) occurs in the past, which requires Flog ni bits for each of the s too-long words, and from specifying the index of each u i E UlciK rk, which requires (u) [h + 81 bits per word. This fact is stated in the form needed as the following lemma.
Lemma 11.4.5 If log K < 6 K , then

(5)
uniformly for 4
.C(x'11 ) = s log n + (
E
ti J ELJA7=Kii B(n) n G(n).
E au ., ) (h + 6) + 60(n),
The lemma is sufficient to show that xrii B(n) n G(n), eventually almost surely. The key to this, as well as to the lemma itself, is that the number s of too-long words, while it depends on x, must satisfy the bound,
(6)
av(i))
log n
(h e),
since, by definition, a word is too long if its length is at least (h E) -1 log n. By assumption, tw(i) > en, so that (5) and (6) yield the code-length bound
< n(h + e(e + 6)) + 0(n),

for
XF il E
B(n) n G(n), and hence if 6 is small enough then
(4) < n(h 2 12), x2 E B(n) n G(n),

for all sufficiently large n, which, since it is not possible to beat entropy by a fixed amount infinitely often on a set of positive measure, shows that, indeed, "ell B(n) n G(n), eventually almost surely, completing the proof of the repeated-words theorem.
Proof of Lemma 11.4.5.

It is enough to show that the encoding of the fillers that do not belong to Urk , as well as the encoding of the lengths of all the words, plus the two bit headers needed to tell which code is being used and the extra bits that might be needed to round up log n and h + 6 to integers, require a total of at most 60(n) bits. For the fillers that do not belong to U71, first note that there are at most s + 1 fillers so the bound Es. n s < i="' (h e) < (h 6), log n log n implies that
E au ., ) <K (1 + n(h log n

j).(x
))
153
which is o(n) for fixed K. In particular, once n is large enough to make this sum less than (Sn/2, then the assumption that 4 E G(n), that is, that 4 is (K, 6)-strongly-packed by {Tk }, implies that the words longer than K that do not belong to UTk cover at most a 6-fraction of x . Since it takes L(u)Flog IAI1 bits to encode a filler u g UTk, the total number of bits required to encode all such fillers is at most
(u,
E 7., a u ,))
[log
on
SnFlog IA11,
which is 50(n). Finally, the encoding of the lengths contributes

EV(t(u i))
Et(eV(V(i)
bits, the dominant terms of which can be expressed as
E
j )<K
log t(u i ) +
E
t(u i )K
log t(u i ) +
logt(V(i)).
The first sum is at most (log K)(s + 1), which is o(n) for K fixed. Since it can be assumed that the too-long words have length at least K, which can be assumed to be at least e, and since x -1 log x is decreasing for x > e, the second and third sums together contribute at most log K
K n
to total length, which is Sn, since it was assumed that log K < 5K. Finally, since there are at most 2s + 1 words, the total length contributed by the two bit headers required to tell which type of code is being used plus the possible extra bits needed to round up log n and h + 8 to integers is upper bounded by 3(2s + 1) which is o(n) since s < n(h E)/(logn). This completes the proof of Lemma 11.4.5.
Remark 11.4.6
The original proof, [53], of the repeated-words theorem parallels the proof of the entropy theorem, using a counting argument to show that the set B(n) n G(n) must eventually be smaller than 2nh by a fixed exponential factor. The key bound is that the number of ways to select s disjoint subintervals of [1, n] of length Li > (log n)/ (h 6), is upper bounded by exp((h c) Et; + o(n)). This is equivalent to showing that the coding of the too-long repeated blocks by telling where they occurred earlier and their lengths requires at most (h c) E ti + o(n) bits. The coding proof is given here as fits in nicely with the general idea that coding cannot beat entropy in the limit.
II.4.c
Exercises
1. Extend the repeated-words theorem to the case when it is only required that all but the last k symbols appeared earlier. 2. Extend the distinct-words and repeated-words theorems to the case where agreement in all but a fixed number of places is required.
154
3. Carry out the details of the proof that (2) follows from the distinct-words theorem. 4. Carry out the details of the proof that (3) follows from the repeated-words theorem. 5. What can be said about distinct and repeated words in the entropy 0 case? 6. Let ,u, be the Kolmogorov measure of the binary, equiprobable, i.i.d. process (i. e., unbiased coin-tossing.) (a) Show that eventually almost surely, no parsing of 4 into repeated words contains a word longer than 4 log n. (Hint: use Barron's lemma, Lemma 11.1.3.) (b) Show that eventually almost surely, there are fewer than n 1-612 words longer than (1 + c) log n in a parsing of 4 into repeated words. (c) Show that eventually almost surely, the words longer than (1 + c) log n in a parsing of 4 into repeated words have total length at most n 1- E12 log n.
Section 11.5
Entropy and recurrence times.
An interesting connection between entropy and recurrence times for ergodic processes was discovered by Wyner and Ziv, [86], see also the earlier work of Willems, [85]. Wyner and Ziv showed that the logarithm of the waiting time until the first n terms of a sequence x occurs again in x is asymptotic to nh, in probability. This is a sharpening of the fact that the average recurrence time is 1/,u(x'11 ), whose logarithm is asymptotic to nh, almost surely. An almost-sure form of the Wyner-Ziv result was established by Ornstein and Weiss, [53]. These results, along with an application to a prefix-tree problem will be discussed in this section. The definition of the recurrence-time function is
Rn (x) = min{rn > 1: x m+ m+n 1 Xn }
. i .
The theorem to be proved is Theorem 11.5.1 (The recurrence-time theorem.) For any ergodic process p, with entropy h, lim 1 log Rn (x) = h, almost surely. n
n--oo
Some preliminary results are easy to establish. Define the upper and lower limits, 1 = lim sup - log R(x), n 1 r(x) = lim inf - log Rn (x). nwoo n
T(x)
Since Rn _1(T x) < R n (x), both the upper and lower limits are subinvariant, that is,
F(Tx) <?(x), and r(Tx) < r(x),
SECTION 11.5. ENTROPY AND RECURRENCE TIMES.
155
hence, both are almost surely invariant. Since the measure is assumed to be ergodic, both upper and lower limits are almost surely constant, and hence, there are constants F and r such that F(x) = F and r(x) = r, almost surely. Furthermore, r < F, so the desired result can be established by showing that r > h and F < h. The bound F < h will be proved via a nice argument due to Wyner and Ziv. The bound r > h was obtained by Ornstein and Weiss by an explicit counting argument similar to the one used to prove the entropy theorem. In the proof given below, a more direct coding construction is used to show that if recurrence happens too soon infinitely often, then there is a prefix-code sequence which beats entropy in the limit, which is impossible, by Theorem 11.1.2. Another proof, based on a different coding idea communicated to the author by Wyner, is outlined in Exercise 1. To establish i: < h, fix e > 0 and define
D, = fx: R(x) > r(h+E ) }, n > 1. The goal is to show that x g D,, eventually almost surely. It is enough to prove this under the additional assumption that xril is "entropy typical." Indeed, if
Tri = {X: t(x) >
2-n(h-i-E/2)}
then x E T, eventually almost surely, so it is enough to prove
(1)
x g Dn n 7,, eventually almost surely.
To establish this, fix (4 and consider only those x E D for which xri' = ari', that is, the set D(a7) = D, n [4]. Note, however, that if x E D(a), then Tx g [a7], for j E [1, 2n(h+E) 1], by the definition of D(a), and hence the sequence
D(a), 7-1 D,(4), ... , 7' -2a(h++1 Dn(ai) must be disjoint. Since these sets all have the same measure, disjointness implies that (2) It(D,(4)) < 2-n(h+E).
It is now a simple matter to get from (2) to the desired result, (1). Indeed, the cardinality of the projection onto An of D, n 'T, is upper bounded by the cardinality of the projection of Tn , which is upper bounded by 2n(h+E / 2) , so that (2) yields
1-1 (Dn
< n Tn) _
2n(h-1-12)2n(h+E) . 2nE/2 .
The Borel-Cantelli principle implies that x g D, n T, eventually almost surely, so the iterated almost sure principle yields the desired result (1). The inequality r > h will be proved by showing that if it is not true, then eventually almost surely, sample paths are mostly covered by long disjoint blocks which are repeated too quickly. A code that compresses more than entropy can then be constructed by specifying where to look for the first later occurrence of such blocks.
156
To develop the covering idea suppose r < h E, where E > O. A block xs,' of s lik k for consecutive symbols in a sequence 4 is said to recur too soon in 4 if xst = x` some k in the interval
[ 1 , 2h(t -s + 1) ) ,
for which s +k < n. The least such k is called the distance from xs to its next occurrence in i Let m be a positive integer and 8 be a positive number, both to be specified later. For n > m, a concatenation
= u V (1)u 2 V (2)
u j V (J)u j+1 ,
is said to be a (8, m)-too-soon-recurrent representation of 4, if the following hold. (a) Each V(j) has length at least m and recurs too soon in 4. (b) The sum of the lengths of the filler words ui is at most 3 8 n. Let G(n) be the set of all xi' that have a (8, m)-too-soon-recurrent representation. It will be shown later that a consequence of the assumption that r < h c is that
(3)
x E G(n), eventually almost surely.
Here it will be shown how a prefix n-code Cn can be constructed which, for suitable choice of m and S, compresses the members of G(n) too well for all large enough n. The code Cn uses a faithful single-letter code F: A H+ {0, 1}d, where d < 2+ log l Ai, such that each code word starts with a O. The code also uses an Elias code E, that is, a prefix code on the natural numbers such that t(E(j)) = log j + o(log j). Sequences not in G(n) are coded by applying F to successive symbols. For x E G(n) a (8, m)-too-soon-recurrent representation 4 = u V(1)u2 V(2) u V( .1 )u.t+i is determined and successive words are coded using the following rules. (i) Each filler word u i is coded by applying F to its successive symbols. (ii) Each V(j) is coded using a prefix 1, followed by E(t(V (j)), followed by e(k i ), where k i is the distance from V(j) to its next occurrence in x. The one bit headers on F and in the encoding of each V (j) are used to specify which code is being used. A one bit header to tell whether or not 4 E G(n) is also used to guarantee that C, is a prefix n-code. The key facts here are that log ki < i(V(j))(h c), by the definition of "too-soon recurrent," and that the principal contribution to total code length is E i log ki < n(hc), a result stated as the following lemma.
Lemma 11.5.2
(4) f(xii) < n(h E) n(3dS + ot ni ), X E G (n ) ,
oc.
for n > m, where cm -- 0 as m
157
The lemma implies the desired result for by choosing 6 small enough and m large enough, the lemma yields ,C(x'ii) < n(h 6/2) on G(n), for all large enough n. But too-good codes do not exist, so that ,u(G(n)) must go to 0, contradicting the fact (3) that 4 E G(n), eventually almost surely, a fact that follows from the assumption that 0 r <h E. Thus r > h, establishing the recurrence-time theorem.
Proof of Lemma 11.5.2. The Elias encoding of the distance to the next occurrence of V(j) requires
(5)
f(V (j))(h E) o(f(V(j)))
bits, by the definition of "too-soon recurrent." Summing the first term over j yields the principal term n(h E) in (4). Since each V(j) is assumed to have length at least m, there are at most n/ m such V(j) and hence the sum over j of the second term in (5) is upper bounded by nI3, where /3,n --* 0 as m cc. The Elias encoding of the length t(V(j)) requires log t(V(D) o(log f(V(j))) bits. Summing on j < n/ m, the second terms contribute at most Om bits while the total contribution of the first terms is at most
Elogf(V(D)
logm
rn
since (1/x) log x is decreasing for x > e. The one bit headers on the encoding of each V(j) contribute at most n/ m bits to total code length. In summary, the complete encoding of the too-soon recurring words V(j) requires a total of at most n(h E) nun, bits, where an, = 2/3,, + (1 log m)/m
0, as m
Do.
Finally, the encoding of each filler u i requires dt(u i ) bits, and the total length of the fillers is at most 36n, by property (a) of the definition of G(n). Thus the complete encoding of all the fillers requires at most 3d6n bits. This completes the proof of Lemma 11.5.2. 0 It remains to be shown that 4 E G(n), eventually almost surely, for fixed m and 6. This is a covering/packing argument. To carry it out, assume r < h E and, for each n > 1, define
B n = (x: R(x) < 2 n(hE) ).

Since lim inf, R n (x) = r there is an M such that
n=M
,u(B) > 1 6, for B =

n=m
Bn'
The ergodic theorem implies that (1/N) Eni Bx (Ti-i X ) > 1 - 8, eventually almost surely, and hence the packing lemma implies that for almost every x, there is an N(x) > M/8 such that if n > (x), there is an integer I = I (n, x) and a collection {[n i , m]: i < of disjoint subintervals of [1, n], each of length at least m and at most M, for which the following hold.
158 (a) For each i < I el' n, =

-

Xi :Fniln `
for some j
[ni + 1, ni + 2(ni i n, 1)(hE) ].
(b)
(mi
ni + 1) > (1 25)n.
In addition, it can be supposed that n > 2m(h E ) IS, so that if J is the largest index i for which ni + 2A1(h E) < n, condition (b) can be replaced by
(c) > < (m n +1) > (1 33)n.

The sequence xìl can then be expressed as a concatenation (6) xn i = u i V(1)u2V(2) u./ V(J)u j+i,
where each V(j) = xn 'n, recurs too soon, the total length of the V(j) is at least (1 38)n, and where each u i is a (possibly empty) word, that is, x E G (n). This completes the proof that 4 E G(n), eventually almost surely, for fixed m and 3, if it is assumed that r < h E, and thereby finishes the proof of the recurrence-time theorem.
Remark 11.5.3
The recurrence-time theorem suggests a Monte Carlo method for estimating entropy. a log n, where a < 1, select indices fit, , itl, independently at random in Let k [1, n], and take (11 kt) Et. log Rk(T i l X) as an estimate of h. This works reasonably well for suitably nice processes, provided n is large, and can serve, for example, as quick way to test the performance of a coding algorithm.
II.5.a
Entropy and prefix trees.
Closely related to the recurrence concept is the prefix tree concept. A sequence X E A" together with its first n 1 shifts, Tx, T 2x, ,Tn l x, defines a tree as follows. For each 0 < i < n, let W(i) = W(i,n, x) be the shortest prefix of Tx i, 0 < j < n. Since by definition, the set which is not a prefix of Px, for any j W(i): i = 0, 1, , n 11 is prefix-free, it defines a tree, 'Tn (x), called the n-th order prefix tree determined by x. For example,
x 001010010010
, T x = 01010010010
, T 2x = 1010010010 . . . ,
T 3x = 010010010 ...,T4x = 10010010...

where underlining is used to indicate the respective prefixes,
W(0) = 00, W(1) = 0101, W(2) = 101, W(3) = 0100, W(4) = 100. The prefix tree 'T5(x) is shown in Figure 1.7.9, page 72. Prefix trees are used to model computer storage algorithms, in which context they are usually called suffix trees. In practice, algorithms for encoding are often formulated in terms of tree structures, as tree searches are easy to program. In particular, as indicated in Remark 11.5.7, below, the simplest form of Lempel-Ziv parsing grows a subtree of the prefix tree. This led Grassberger, [16], to propose that use of the full prefix tree
159
might lead to more rapid compression, as it reflects more of the structure of the data. He suggested that if the depth of node W(i) is the length L7(x) = f(W(i)), then average depth should be asymptotic to (1/h) log n, at least in probability. While this is true in the i.i.d. case, and some other cases, see Theorem 11.5.5, it is not true in the general ergodic case. What is true in the general case, however, is the following theorem.
Theorem 11.5.4 (The prefix-tree theorem.) If p, is an ergodic process of entropy h > 0, then for every c > 0 and almost every x, there is an integer N = N(x, c) such that for n > N, all but en of the numbers L7(x)/ log n are within c of (11h).
Note that the prefix-tree theorem implies a trimmed mean result, namely, for any E > 0, the (1 - E)-trimmed mean of the set {L(x)} is almost surely asymptotic to log n/h. The (1 - E)-trimmed mean of a set of cardinality M is obtained by deleting the largest ME/2 and smallest ME/2 numbers in the set and taking the mean of what is left. The {L } are bounded below so average depth (1/n) E (x) will be asymptotic to log n1 h in those cases where there is a constant C such that max L7 (x) < C log n, eventually almost surely. This is equivalent to the string matching bound L(4) = 0(log n), eventually almost surely, see Exercise 2, where, as earlier, L(x) is the length of the longest string that appears twice in x7, since if L7(x) = k then, by definition of L(x), there is a j E [1, n], j i, such that x: +k-2 = There is a class of processes, namely, the finite energy processes, for which a simple coding argument gives L(x) = 0(log n), eventually almost surely. An ergodic process p. has finite energy if there is are constants c < 1 and K such that kxt-F1 i) < Kc L ,
for all t > 1, for all L > 1, and for all xl +L of positive measure.
Theorem 11.5.5 (The finite-energy theorem.) If p, is an ergodic finite energy process with entropy h > 0 then there is a constant C such that L(4) < C log n, eventually almost surely, and hence
lim
n -+cx)
E n log n i=1
1
1 (x) = - , almost surely. h
The i.i.d. processes, mixing Markov processes and functions of mixing Markov processes all have finite energy. Another type of finite energy process is obtained by adding i.i.d. noise to a given ergodic process, such as, for example, the binary process defined by X n = Yn Z, (mod 2) where {Yn } is an arbitrary binary ergodic process and {Z n } is binary i.i.d. and independent of {Yn }. (Adding noise is often called "dithering," and is a frequently used technique in image compression.) The mean length is certainly asymptotically almost surely no smaller than h log n; in general, however, there is no way to keep the longest W(i) from being much larger than 0(log n), which kills the possibility of a general limit result for the mean. One type of counterexample is stated as follows.
160 Theorem 11.5.6
There is a stationary coding F of the unbiased coin-tossing process kt, and an increasing sequence {nk}, such that vnrfo l Ltilk (x ) = oo, almost surely, lim (7) nk log nk with respect to the measure y = o F-1 .
The prefix-tree and finite-energy theorems will be proved and the counterexample construction will be described in the following subsections.
Remark 11.5.7
The simplest version of Lempel-Ziv parsing x = w ( 1)w(2) , in which the next word is the shortest new word, grows a sequence of IA I-ary labeled trees, where the next word w(i), if it has the form w(i) = w(j)a, for some j < i and a E A, is assigned to the node corresponding to a, whose predecessor node was labeled w(j). (See Exercise 3, Section 11.2.) The limitations of computer memory mean that at some stage the tree can no longer be allowed to grow. From that point on, several choices are possible, each of which produces a different algorithm. In the simplest of these, mentioned in Remark 11.2.10, the tree grows to a certain size then remains fixed. The remainder of the sequence is parsed into words, each of which is one symbol shorter than the shortest new word. Subsequent compression is obtained by giving the successive indices of these words in the fixed tree. Such finite LZ-algorithms do not compress to entropy in the limit, but do achieve something reasonably close to optimal compression for most finite sequences for nice classes of sources; see [87, 88] for some finite version results. Performance can be improved in some cases by designing a finite tree that better reflects the actual structure of the data. Prefix trees may perform better than the LZW tree for they tend to have more long words, which means more compression.
II.5.a.1
Proof of the prefix-tree theorem.
For E > 0, define the sets
L(x, n) =
fi E [0, n I]: L7(x) < (1 E)(logn) I h}
U (x , n) = fi E [0, n I]:
(x) > (1 c)(log n)I hl.
It is enough to prove the following two results.
(8)
1L(x, n)I < En, eventually almost surely.
I U(x , n)I < En, eventually almost surely. (9) The fact that most of the prefixes cannot be too short, (8), can be viewed as an overlapping form of the distinct-words theorem, Theorem 11.4.1, and is a consequence of the fact that, for fixed n, the W (i, n) are distinct. The fact that most of the prefixes cannot be too long, (9), is a consequence of the recurrence-time theorem, Theorem 11.5.1, for the longest prefix is only one longer than a word that has appeared twice.
The details of the proof that most of the words cannot be too short will be given first. To control the location of the prefixes an "almost uniform eventual typicality" idea will be needed. For S > 0 and each k, define
Tk (8)
(x) >
161
_ 2k(h+5) and so that I2k(6)1 < E Tk (6 ), eventually almost surely, by the entropy theorem. Since E (3 ), eventually almost surely, there is an M such that the set
G(M) = ix:
4 E 7i,(8),
for all k > MI
has measure greater than 1 S. The ergodic theorem implies that, for almost every x, the shift Tx belongs to G(M) for all but at most a limiting 6-fraction of the indices i. This result is summarized in the following "almost uniform eventual typicality" form.
Lemma 11.5.8 For almost every x there is an integer N(x) such that if n > N(x) then for all but at most 2 6 n indices i E [0, n), the i-fold shift Tx belongs to G(M).
To proceed with the proof that words cannot be too short, fix c > O. For a given sequence x and integer n, an index i < n will be called good if L(x) > M and W(i) = W(i, n, x) E U m rk ( . Since the prefixes W(i) are all distinct there are at most lAl m indices i < n such that L7(x) < M. Thus by making the N(x) of Lemma 11.5.8 larger, if necessary, it can be supposed that if n > N(x) then the set of non-good indices will have cardinality at most 3 6 n, which can be assumed to be less than en/2. Suppose k > M and consider the set of good indices i for which L7(x) = k. There are at most 2k(h+3) such indices because the corresponding W (i, n, x) are distinct members of the collection 'Tk (6) which has cardinality at most 2k(h+8) . Hence, for any J > M, there are at most (constant) 2 J (h+) good indices i for which M < L < J. Thus if n > N(x) is sufficiently large and S is sufficiently small there will be at most cnI2 good indices i for which L'i'(x) < (1 )(log n)I h. This, combined with the bound on the number of non-good indices of the preceding paragraph, completes the proof that eventually almost surely most of the prefixes cannot be too short, (8). Next the upper bound result, (9), will be established by using the fact that, except for the final letter, each W(i,n,x) appears at least twice. If a too-long block appears twice, then it has too-short return-time. The recurrence-time theorem shows that this can only happen rarely. Both a forward and a backward recurrence-time theorem will be needed. The corresponding recurrence-time functions are respectively defined by
Fk(X) = Bk(X) =
min {in> 1 . X m+ m+k i min frn > 1: X1_ k+1 = X k+11
The recurrence-time theorem directly yields the forward result,
1 lim log Fk (x) = h, a.s., k> cx) k

and when applied to the reversed process it yields the backward result,
1 urn - log Bk (x) = h, a.s., k>cc k

since the reversed process is ergodic and has the same entropy as the original process. (See Exercise 7, Section 1.6 and Exercise 11, Section 1.7.) These limit results imply that
162
there is a K, and two sets F and B, each of measure at most en/4, such that if k > K, then log Fk (x) > kh(1 + /4) -1 , x F, and log Bk (x) > kh(1+ c /4) -1 , x V B. The ergodic theorem implies that for almost every x there is an integer N(x) such that if n> N(x) then Tx E F, for at most en/3 indices i E [0, n- 1], and Tx E B, for at most 6n/3 indices i E [0, n - 1]. Fix such an x and assume n > N(x). Let k be the least integer that exceeds (1 + e/2) log n/ h. By making n larger, if necessary, it can be supposed that both of the following hold.
(a) K < (1+ 6/2) log n/ h. (b) k +1 < (1+ c)log n/ h.

Assume i E U(x, n), that is, L7 > (1+ e) log n/ h. Condition (b) implies that L7 -1 > k, and hence, by the definition of W (i, n, x) as the shortest prefix of Tx that is not a prefix of any other Ti x, for j < n, there is an index j E [0, n), such that j i and j+k -1 =x me two cases i < j and i > j will be discussed separately. Case 1. i < j. In this case, the k-block starting at i recurs in the future within the next n steps, that is, Fk (T ix) < n, so that
1 logn - log Fk(T i x) < < h(1 + E/4) -1 , k

which means that Tx E F. Case 2. j < i. If i +k < n, this means that Bk(T i+k x) < n, which means that 7
k) x E B.
In summary, if i E Uk (x,n) then T i E F, 7x E B, or i + k < n. Thus, if n also satisfies k < en/3, then I U (x , n)I < en. This completes the proof of the prefix-tree theorem.
II.5.a.2
Proof of the finite-energy theorem.
The idea of the proof is suggested by earlier code constructions, namely, code the second occurrence of the longest repeated block by telling its length and where it was seen earlier, and use an optimal code on the rest of the sequence. An upper bound on the length of the longest repeated block is obtained by comparing this code with an optimal code. Here "optimal code" means Shannon code. The key to making this method yield the desired bound is Barron's almost-sure code-length bound, Lemma 5, which asserts that no prefix code can asymptotically beat the Shannon code by more than 210g n. A Shannon code for a measure p u is a prefix code such that E(w) = log (w)1, for each w, see Remark 1.7.13. A version of Barron's bound sufficient for the current result asserts that for any prefix-code sequence {C},
(10)
(C(4)) + log ,u(x'11 )
-21ogn,
eventually almost surely.
163
The details of the coding idea can be filled in as follows. Suppose a block of length L occurs at position s, and then again at position t 1 > s in the sequence x. The block xi is encoded by specifying its length t, then using a Shannon code with respect to the given measure A. The block xtr+t is encoded by specifying s and L. The final block xtn+LI is encoded using a Shannon code with respect to the conditional measure Ix ti +L ). The prefix property is guaranteed by encoding the lengths with the Elias code e on the natural numbers, that is, a prefix code for which t(e(n)) = log n + o(log n). Since t, s, and L must be encoded, total code length satisfies
3 log n + log
1 1 + log p(xl) kt(4A-L+11xi+L)'
where the extra o(logn) terms required by the Elias codes and the extra bits needed to roundup the logarithms to integers are ignored, as they are asymptotically negligible. To compare code length with the length of the Shannon code of x with respect to it, the probability it (4) is factored to produce log (4) = log ti )
log A(xt v. x ts 0+ log 2(4+L+1 1xi+L ).
This is now added to the code length (11) to yield

t-I-L , t 3 log n + log p(xt+1
s
Thus Barron's lemma gives
1 lxf)> 2 log n, 3 log n + log p(x t +

+ IL lxi) < Kc L , yields to which an application of the finite energy assumption p(x:+
L(4) 5 < , almost surely. li m _sxu,p log c n log n This completes the proof of the finite-energy theorem, since c <1. II.5.a.3 A counterexample to the mean conjecture.
Let it denote the measure on two-sided binary sequences {0, 1} z , given by unbiased coin-tossing, that is, the shift-invariant measure defined by requiring that p(aii) = 2, for each n and each a E An. It will be shown that given E > 0, there is a measurable shift invariant function F: {0, 1} z {0, 1} z and an increasing sequence {n k } of positive integers with the following properties. (a) If y = F(x) then, with probability greater than 1 2-k , there is an i in the range 0 < i < nk n k 314 , such that yi+i = 0, 0 < j < n k 3 I4 . (b) ,u(fx: xo (F(x)01) < E.
The process v = p, o F -1 will then satisfy the conditions of Theorem 11.5.6, for property (a) guarantees that eventually almost surely L7k (x) > n:I8 for at least 4/8 indices i < nk , 5/4 so that Ei<n, L7(x) > nk and hence, (7) holds.
164
Property (b) is not really needed, but it is added to emphasize that such a y can be produced by making an arbitrarily small limiting density of changes in the sample paths of the i.i.d. process a, hence, in particular, the entropy of y can be as close to log 2 as desired. The coding F can be constructed by only a minor modification of the coding used in the string-matching example, discussed in Section I.8.c, as an illustration of the blockto-stationary coding method. In that example, blocks of O's of length 2n k i l2 were created. Replacing 2n k 1/2 by nk 3/4 will produce the coding F needed here. Remark 11.5.9 The prefix tree theorem was proved in [72], which also contains the counterexample construction given here. The coding proof of the finite-energy theorem is new; after its discovery it was learned that the theorem was proved using a different method in [31]. The asymptotics of the length of the longest and shortest word in the Grassberger tree are discussed in [82] for processes satisfying memory decay conditions.
II.5.b
Exercises.
,
1. Another proof, due to Wyner, that r > h is based on the following coding idea. A function f: An B* will be said to be conditionally invertible if .xn = yn 00 whenever f (xn ,0 ) = f (y" co ) and x 0 y .
(a) Show that lim infn t(f(x n ))/ n > h, almost surely, for any ergodic measure ,u of entropy h. (Hint: let B, = fx" o: p,(4 Ix ) > 2-n ( HE) and Cn = {X n c) : f (xn 00 )) < 2n (H-2 ) 1, where H is the entropy-rate of it and show that E p,(B n cn ix) is finite.) (b) Deduce that the entropy of a process and its reversed process are the same. (c) Define f (x" cc ) to be the minimum m > n for which xim niV = .q and apply the preceding result. (d) The preceding argument establishes the recurrence result for the reversed process. Show that this implies the result for ft.
2. Show that if L(x) = 0(log n), eventually almost surely, then max i 14 (x) = 0 (log n), eventually almost surely. (Hint: L (4n ) < K log n implies max L7(x) < K log n.) 3. Show that dithering an ergodic process produces a finite-energy process.
Chapter III Entropy for restricted classes.

Section 111.1 Rates of convergence.
Many of the processes studied in classical probability theory, such as i.i.d. processes, Markov processes, and finite-state processes have exponential rates of convergence for frequencies of all orders and for entropy, but some standard processes, such as renewal processes, do not have exponential rates. In the general ergodic case, no uniform rate is possible, that is, given any convergence rate for frequencies or for entropy, there is an ergodic process that does not satisfy the rate. These results will be developed in this section. In this section bt denotes a stationary, ergodic process with alphabet A, with ilk denoting the projection onto A k defined by 14(4) = ,u({x: x = 4}). The k-th order empirical distribution, or k-type, is the measure pk = pk( 1x11 ) on Ak defined by
Pk(a1 14) =
sn
E [1, n k +1]: x:+k-1
k +1
, aik E A k .
The distance between 14 and Pk will be measured using the distributional (variational) distance, defined by
IILk
Pk1=
Pk( 14)1 =
E
E 1[4(4) at
pk (64'14)1.
The ergodic theorem implies that if k and
are fixed then
({4:
as n
Pk(' 14)1 ?_ ED -
oo. Likewise, the entropy theorem implies that

,an(tx rit : 2n(h+E) < itin (x n i ) < 2n(hE)}
1,
as n cc. The rate of convergence problem is to determine the rates at which these convergences takes place. A mapping rk : R+ x Z 1 R +, where R + denotes the positive reals and Z + the positive integers, is called a rate function for frequencies for a process 1.1, if for each fixed k and c > 0,
lin(fx riz :
Pk(' ix ril )i > ED < rk(6,n), 165
166
CHAPTER III. ENTROPY FOR RESTRICTED CLASSES.
oc. An ergodic process is said to have exponential rates for and rk(E,n) > 0 as n frequencies if there is a rate function such that for each E > 0 and k, (-1/n) log rk (E, n) is bounded away from O. (The determination of the best value of the exponent is known as "large deviations" theory; while ideas from that theory will be used, it will not be the focus of interest here.) A mapping r: R+ x Z+ R+ is called a rate function for entropy for a process it of entropy h, if for each E > 0,
2 n(h+E) < /41 (4) < 2 n(hE) 1) > 1 r (E, n),
0 as n oc. An ergodic process has exponential rates for entropy if it and r(E,n) has a rate function for entropy such that for each c > 0, (-1/n) log r (c, n) is bounded away from O. Exponential rates of convergence for i.i.d processes will be established in the next subsection, then extended to Markov and other processes that have appropriate asymptotic independence properties. Examples of processes without exponential rates and even more general rates will be discussed in the final subsection.
Remark 111.1.1
A unrelated concept with a similar name, called speed of convergence, is concerned with what happens in the ergodic theorem when (1/n) f (T i x) is replaced by (1/an ) f (Ti x), for some unbounded, nondecreasing sequence {an }. See [32] for a discussion of this topic.
III.1.a
Exponential rates for i.i.d. processes.
In this section the following theorem will be proved.
Theorem 111.1.2 (Exponential rates for i.i.d. processes) If p. is i.i.d., then it has exponential rates for frequencies and for entropy.
The theorem is proved by establishing the first-order case. The overlapping-block problem is then reduced to the nonoverlapping-block problem, which is treated by applying the first-order result to the larger alphabet of blocks. Exponential rates for entropy follow immediately from the first-order frequency theorem, for the fact that t is product measure implies that p.(x) is a continuous function of p1 (. 14). The first-order theorem is a consequence of the bound given in the following lemma, which is, in turn, a combination of a large deviations bound due to Hoeffding and Sanov and an inequality of Pinsker, [22, 62, 58].
Lemma 111.1.3 (First-order rate bound.) There is a positive constant c such that
(1)
< (n It (14: IPi( 14) ILiI > El) _
DIAl2ncE 2 ,
for any n, for any finite set A, and for any i.i.d. process p. with alphabet A. Proof A direct calculation, using the fact that p. is a product measure, produces
(2)
11 (4) = Dit(a))nPl aEA
(alx7) = 2n(H(pi)+D(p1 1141))
SECTION III. 1. RATES OF CONVERGENCE.
167
where
H(pi) = 11(131(. 1x7)) =
a
(a14) log pi (a lx7),
is the entropy of the empirical 1-block distribution, and D(Pi Lai) =
E a
pi (aIx)log
Pl(aix'11) iii(a)
is the divergence of pi( Ixii) relative to pi. The desired result will follow from an application of the theory of type classes discussed in I.6.d. The type class of _el is the set T(x) = {,)1 : pie lyD = pie jx7)}. The two key facts, Theorem 1.6.1 3 and Theorem 1.6.12, are (a) 17-(4)1 <
2nH(p 1
) .
(b) There are at most (n + 1) 1A I type classes. Fact (a) in conjunction with the product formula (2) produces the bound,
(3) p, (T(x)) <
The bad set B(n, E) = fx7: 1/1 1( . I > E } can be partitioned into disjoint hence the bound (3) on the measure of the type sets of the form B(n, E) n T(4), and class, together with fact (b), produces the bound
tt(B(n, E)) < ( n
01Al2nD. ,
where
= min ID(Pi Lai): IP1(.
1L
El
Since D(P II Q) I P Q1 2 1 (2 ln 2) is always true, see Pinsker's inequality, Exercise 6, Section 1.6, it follows that
1 12 1Ple lx7) D* r17-
which completes the proof of Lemma 111.1.3.
El
Lemma 111.1.3 gives the desired exponential-rate theorem for first-order frequencies, since (n + 1)I 41 2 - n'' = exp(n(cE 2 n)) where 8, 0 as n > oc. The extension to k-th order frequencies is carried out by reducing the overlapping-block problem to k separate nonoverlapping-block problems. To assist in this task some notation and terminology will be developed. For n > 2k define integers t = t (n, k) and r E [0, k) such that n = tk + k + r. For each s E [0, k 1], the sequence x can be expressed as the concatenation (4) x = w(0)w(1) w(t)w(t + 1),
where the first word w(0) has length s < k, the next t words, w(1), , w(t), all have length k, and the final word w(t 1) has length n tk s. This is called the s -shifted, k-block parsing of x. The sequences x7 and y r i` are said to be (s, k)-equivalent if their
168
, w(t)} of k-blocks, s-shifted, k-block parsings have the same (ordered) set {w(1), i yss+t +k . i The (s, k)-equivalence class of x is denoted by Sk (fi', s). that is, 4-k +Ft k = ps k( 1..el') is the The s-shifted (nonoverlapping) empirical k-block distribution ps A k defined by distribution on
l) = ps,(41x Pil ) ps k (4
E [1, t]: w(j)
= a}1
where x is given by (4). It is, of course, constant on the (s, k)-equivalence class k, Sk(x , s). The overlapping-block measure Pk is (almost) an average of the measures ps result summarized as the following where "almost" is needed to account for end effects, a lemma.
Lemma 111.1.4 (Overlapping to nonoverlapping lemma.) Given e > 0 there is a y > 0 such that if kln < y and Ipk( 14) I > E then there is an s E [0, k 1] such that Ips k ( 14) ,ukl>
Proof Left to the reader. If k is fixed and n is large enough the lemma yields the containment relation,
k-1
(5)
14: Ipk( 14) iLk1
el
g U{4: lp;c e
S=0
kl
Fix such a k and n, and fix s E [0, k 1]. The fact that p. is a product measure implies (6) (Sk(x riz , s))
1=1
where Sk(x , s) is the (s, k)-equivalence class of xri', so that if B denotes A" and y denotes the product measure on 13' defined by the formula
v(wD =
then
/1(w i ), WI E B.
(7)
p.(fx: 1Pic (' 14)
e/2})y
({U/1:
IwD
> e/21).
The latter, however, is upper bounded by
(8)
(t
1) 1A1 '2'" 2 /4 ,
by the first-order result, Lemma 111.1.3, applied to the measure y and super-alphabet B = lAlk . The containment relation, (5), then provides the desired k-block bound
(9) p. 04: IPke
itkI >
El)
<
k(t + 0 1
2-46'6214 .
If k and c > 0 are fixed, the logarithm of the right-hand side is asymptotic to tcE 2 14n c 2 /4k, hence the bound decays exponentially in n. This completes the proof of Theorem 111.1.2.
SECTION 111.1. RATES OF CONVERGENCE.
169
III.l.b
The Markov and related cases.
Exponential rates for Markov chains will be established in this section. Theorem 111.1.5 (Exponential rates for Markov sources.) If p. is an ergodic Markov source, then it has exponential rates for frequencies and for entropy. The entropy part follows immediately from the second-order frequency result, since p.(4) is a continuous function of pi e 14) and p2( 14). The theorem for frequencies is proved in stages. First it will be shown that aperiodic Markov chains satisfy a strong mixing condition called *-mixing. Then it will be shown that -*-mixing processes have exponential rates for both frequencies and entropy. Finally, it will be shown that periodic, irreducible Markov chains have exponential rates. A process is 11 f-mixing if there is a nonincreasing sequence {*(g)} such that *(g) 1 as g oc, and such that
,u(uvw) _<111(g)A(u),u(w), u, w E A*, u E Ag.
The *-mixing concept is stronger than ordinary mixing, which allows the length of the gap to depend on the length of the past and future. An i.i.d. process is *-mixing, with Ili(g) 1. It requires only a bit more effort to obtain the mixing Markov result. Lemma 111.1.6 An aperiodic Markov chain is ip-mixing. Proof Let M be the transition matrix and 7r(-) the stationary vector for iii. Fix a gap g, let a be the final symbol of u, and let b be the first symbol of w. The formula (10), Section 1.2, for computing Markov probabilities yields
E w uvw )
t(v)=g
mg 7r(b)
The right-hand side is upper bounded by *(g),u(u),u(w), where
i/ (g)= maxmax
a
Mg ab
(b)
The function *(g) > 1, as g > oc, since the aperiodicity assumption implies that iti`cfb 7r(b). This proves the lemma. LI The next task is to prove Theorem 111.1.7 If kt is ilf-mixing then it has exponential rates for frequencies of all orders. Proof A simple modification of the i.i.d. proof, so as to allow gaps between blocks, is all that is needed to establish this lemma. Fix positive integers k and g. For n > 2k + g, define t = t (n, k, g) such that (10)
n=t(k+g)(k+g)-Er,0_<r<k-i-g.
170
For each s E [0, k + g 1], the sequence 4 can be expressed as the concatenation (11) x = w(0)g(1)w(1)g(2)w(2)...g(t)w(t)w(t + 1),
where w(0) has length s < k + g, the g(j) have length g and alternate with the k-blocks w(j), for 1 < j < t, and the final block w(t + 1) has length n t (k + g) s < 2(k + g). This is called the the s-shifted, k-block parsing of x, with gap g. The sequences 4 and are said to be (s, k, g)-equivalent if their s-shifted, g-gapped, k-block parsings have the same (ordered) set {w(1), , w(t)} of k-blocks. The (s, k, g)-equivalence class of 4 is denoted by Sk(xPil , s, g) The s-shifted (nonoverlapping) empirical k-block distribution, with gap g, is the distribution on Ak defined by
P. ,g (c4) = pisc,g(alnxii).
E [1, t]: w(j) =
411
It is, of course, constant on the (s, k, g)-equivalence class Sk (4, s, g). The overlappings .g , where "almost" is now block measure Pk is (almost) an average of the measures p k gaps. If k is large relative to gap needed to account both for end effects and for the length and n is large relative to k, this is no problem, however, and Lemma 111.1.4 easily extends to the following.
Lemma 111.1.8 (Overlapping to nonoverlapping with gaps.)

Given E> 0 and g, there is a y > 0 and a K > 0, such that if k n < y, if k > K, and if pke > E, then there is an s E [0, k + g 1] such that
114,4 ( 14)
itkl
612.
Completion of proof of Theorem 111.1.7. The i.i.d. proof adapts to the Vf-mixing case, with the product measure bound (6)
replaced by
(12)
ian (Sk (4, s, g))
kii(g)i t
i=1
ktk(w(i)),
where the w(j) are given by (11), the s-shifted, k-block parsing of 4, with gap g. The upper bound (8) on Ips k ( 14) lkI121) is replaced by the upper bound
(13) [i (g)]t(t 1)1111 4 C E 2 I4
and the final upper bound (9) on the probability p, (14: Ipke placed by the bound (k + g)[*(g)]t (t + 010 2 -tc 2 14 (14)
/41
ED
is re-
assuming, of course, that k is enough larger than g and n enough larger than k to guarantee that Lemma 111.1.8 holds. The proof of Theorem 111.1.7 is completed by using the 1i-mixing property to choose g so large that (g) < 2 218 . The logarithm of the bound in (14) is then asymptotically at most tce 2 18n c 2 /8(k + g), hence the bound decays exponentially, provided only that k is large enough. Since it is enough to prove exponential rates for k-th order frequencies for large k, this completes the proof of Theorem 111.1.7, and establishes the exponential rates theorem for aperiodic Markov chains. The following lemma removes the aperiodicity requirement in the Markov case.
SECTION 111.1. RATES OF CONVERGENCE. Lemma 111.1.9 An ergodic Markov chain has exponential rates for frequencies of all orders.
171
Proof The only new case is the periodic case. Let A be an ergodic Markov chain with period d > 1, and partition A into the (periodic) classes, C1, C2, ... , Cd, such that
Prob(X n+ 1
E Cs91lXn E
C s = 1, 1 < s < d,
)
where @ denotes addition mod d. Define the function c: A i [1, d] by putting c(a) = s, if a E C. Also let pP) denote the measure A conditioned on Xi E C5 . Let g be a gap length, which can be assumed to be divisible by d and small relative to k. It can also be assumed that k is divisible by d, for k can always be increased or decreased by no more than d to achieve this, which has no effect on the asymptotics. k , g = ps k , g ( WO satisfies the For s E [0, k + g 1], the nonoverlapping block measure ps condition p g (4) = 0, unless c(ai) = c(xs ). (s) , it follows that if k I g is large enough then Since ,u k is an average of the bt k
k+g-1 fx rI : IPk( ifil ) Ad > El C U VI I:
s=0
IPsk.g (
! XIII )
14c(xs)) 1 > 6 / 2 1.
The measure ,u (s) is, however, an aperiodic Markov measure with state space
C(s) x C(s @ 1) x . x C(s @ d 1), so the previous theory applies to each set {xii': ' K g I (c(xs )) I as before, IL({xi: IPk iLk I E})
> 6/2 } separately. Thus,
?_
decays exponentially as n -- oc, establishing the lemma, and thereby completing the CI proof of the exponential-rates theorem for Markov chains, Theorem 111.1.5.
III.l.c
Counterexamples.
Examples of renewal processes without exponential rates for frequencies and entropy are not hard to construct; see Exercise 4. A simple cutting and stacking procedure which produces counterexamples for general rate functions will be presented, in part, because it is conceptually quite simple, and, in part, because it illustrates several useful cutting and stacking ideas. The following theorem summarizes the goal.
Theorem 111.1.10 Let n i r(n) be a positive decreasing function with limit O. There is a binary ergodic process it and an integer N such that
An
(1-112 :
'Pi ( Ix) Ail
1/2 ?. _ r(n), n > N.
172
Counterexamples for a given rate would be easy to construct if ergodicity were not required. For example, if ,u were the average of two processes, ,u, = (y +V)/2, where y is concentrated on the sequence of all l's and iY is concentrated on the sequence of all O's, then for any n, pi(11x) is either 1 or 0, according to whether xìz consists of l's or O's, while ,u, i (1) = 1/2, so that
A n (14: i pi( 14)
1/21) = 1, n > 1.
Ergodic counterexamples are constructed by making ,u look like one process on part of the space and something else on the other part of the space, mixing the first part into the second part so slowly that the rate of convergence is as large as desired. The cutting and stacking method was designed, in part, to implement both of these goals, that is, to make a process look like another process on part of the space, and to mix one part slowly into another. As an example of what it means to "look like one process on part of the space and something else on the other part of the space," suppose kt is defined by a (complete) sequence of column structures. Suppose some column C at some stage has all of its levels labeled '1'. If x is a point in some level of C below the top ni levels, then it is certain that xi = 1, for 1 < j < in, so that, /31(114) = 1. In particular, if C has height L and measure fi, then the chance that a randomly selected x lies in the first L ni levels of C is (1 m/L)p, and hence (15) ,u n ({xr: pi(114) = 11)
(1 -11 L i z ) /3,
no matter what the complement of C looks like. Furthermore, if exactly half the measure of 8(1) is concentrated on intervals labeled '1', then
(141 : 1/31(
Ixr)
I ?_ 112 ?_ (1 _
so that if 1/2> fi > r(m) and L is large enough then

(16) /2, ({x;n:
I > 1/2)) > r(m).
The mixing of "one part slowly into another" is accomplished by cutting C into two columns, say C1 and C2. Mixing is done by applying repeated independent cutting and stacking to the union of the second column C2 with the complement of C. The bound (16) can be achieved with ni replaced by ni + 1 by making sure that the first column CI has measure slightly more than r(m 1), for it can then be cut and stacked into a much longer column, long enough to guarantee that (16) will indeed hold with m replaced by ni + 1. As long as enough independent cutting and stacking is done at each stage and all the mass is moved in the limit to the second part, the final process will be ergodic and will satisfy (16) for all m for which r(m) < 1/2. The details of the above outline will now be given. Without loss of generality, it can be supposed that r(m) < 1/2, for all m. Let m fi(m) be a nonincreasing function with limit 0, such that /3(1) = 1/2 and r(m) < f3(m) < 1/2, for all m > 1. The desired (complete) sequence {S(m)} will now be defined. The construction will be described inductively in terms of two auxiliary unbounded sequences, {41 } and fr,n 1, of positive integers which will be specified later. Each S(m) will contain a column
SECTION III. 1. RATES OF CONVERGENCE.
173
C(m), of height 4, and measure 13(m), all of whose entries are labeled '1', together with a column structure R.(m), disjoint from C(m). To get started, let C(1) consist of one interval of length 1/2, labeled '1', and let R.(1) consist of one interval of length 1/2, labeled '0'. This guarantees that at all subsequent stages the total measure of the intervals of 8(m) that are labeled with a '1' is 1/2, and the total measure of the intervals of S(m) that are labeled with a '0' is 1/2, and hence the final process must satisfy t 1 (l) = A i (0) = 1/2. For ni > 1, 8(m) is constructed as follows. First C(m 1) is cut into two columns, C(m 1, 1) and C(m 1, 2), where C(m 1, 1) has measure /3(m) and C(m 1, 2) has measure fi(m 1) )6(m). The column C(m 1, 1) is cut into t ni g ni -1 columns of equal width, which are then stacked to obtain C(m). The new remainder R(m) is obtained by applying rni -fold independent cutting and stacking to C(m 1, 2) U 7Z(m 1). Since rm > oc, and since /3(m) - 0, the total width of 8(m) goes to 0, so the sequence {8(m)} is complete and hence defines a process A. All that remains to be shown is that for suitable choice of 4, and rm the final process /1 is ergodic and satisfies the desired condition (17) An (Ix'i'I : Ipi( 14) Ail a- 1/20 > r (m), ni > 1.
The condition (17) is guaranteed by choosing 4, so large that (1 m/t n1 )/3(m) > r(m), for this is all that is needed to make sure that
m ttn({x;: Ipi(. 14) Ail a 1/2 }) a (1 ) /3(m),

tm holds. Once this holds, then it can be combined with the fact that l's and O's are equally likely and the assumption that 1/2 > 8(m) > r (m), to guarantee that (17) holds. The sequence {r,n } is chosen, sequentially, so that C(m 1, 2) U TZ.(m 1) and 1?-(m) are (1 2')-independent. This is possible by Theorem 1.10.11. Since the measure of R,(m) goes to 1, this guarantees that the final process is ergodic, by Theorem 1.10.7. This completes the proof of Theorem 111.1.10. 0
111.1 .d Exercises.
1. Suppose A has exponential rates for frequencies. (a) Choose k such that Tk = {x: i(x) ) 2h j has measure close to 1. Show that the measure of the set of n-sequences that are not almost strongly-packed by 'Tk goes to 0 exponentially fast.
,
(b) Show that the measure of {xii: p,(4) < 2 -n (h+E) } goes to 0 exponentially fast. (Hint: as in the proof of the entropy theorem it is highly unlikely that a sequence can be mostly packed by 'Tk and have too-small measure.) (c) Show that a 0- -mixing process has exponential rates for entropy. 2. Let y be a finite stationary coding of A and assume /2, has exponential rates for frequencies and entropy. (a) Show that y has exponential rates for frequencies.
174
(b) Show that y has exponential rates for entropy. (Hint: bound the number of n-sequences that can have good k-block frequencies, conditional k-type too much larger than 2 k(h(4)h(v) and measure too much larger than 2-PP).) 3. Use cutting and stacking to construct a process with exponential rates for entropy but not for frequencies. 4. Show that there exists a renewal process that does not have exponential rates for frequencies of order 1. (Hint: make sure that the expected recurrence-time series converges slowly.) 5. Exercise 4 shows that renewal processes are not *-mixing. Establish this by directly constructing a renewal process that is not lit-mixing.
Section 111.2
Entropy and joint distributions.
The kth-order joint distribution for an ergodic finite-alphabet process can be estimated from a sample path of length n by sliding a window of length k along the sample path and counting frequencies of k-blocks. If k is fixed the procedure is almost surely consistent, that is, the resulting empirical k-block distribution almost surely converges to the true distribution of k-blocks as n oc, a fact guaranteed by the ergodic theorem. The such estimates is important when using training sequences, that is, finite consistency of sample paths, to design engineering systems. The empirical k-block distribution for a training sequence is used as the basis for design, after which the system is run on other, independently drawn sample paths. There are some situations, such as data compression, where it is good to make the block length as long as possible. Thus it would be desirable to have consistency results for the case when the block length function k = k(n) grows as rapidly as possible, as a function of sample path length n. This is the problem addressed in this section. As before, let plc ( 14) denote the empirical distribution of overlapping k-blocks in the sequence _el'. A nondecreasing sequence {k(n)} will be said to be admissible for the ergodic measure bt if
11111 1Pk(n)( 14) n-4co lik(n) = 0, a. s.,
where I I denotes variational, that is, distributional, distance. The definition of admissible in probability is obtained by replacing almost-sure convergence by convergence in probability. Every ergodic process has an admissible sequence such that limp k(n) = oc, by the ergodic theorem. It is also can be shown that for any sequence k(n) - oc there is an ergodic measure for which {k(n)} is not admissible, see Exercise 2. The problem addressed here is whether it is possible to make a universal choice of {k(n)) for "nice" classes of processes, such as the i.i.d. processes or Markov chains. Here is where entropy enters the picture, for if k is large then, with high probability, the probability of a k-block will be roughly 2 -kh . In particular, if k(n) > (1 + e)(logn)1 h, or, equivalently, k(n) > (logn)I(h 6), there is no hope that the empirical k-block distribution will be close to the true distribution, see Theorem 111.2.1, below. Consistent estimation also may not be possible for the choice k(n) (log n)I h. For example, in
SECTION 111.2. ENTROPY AND JOINT DISTRIBUTIONS.
175
the unbiased coin-tossing case when h = 1, the choice k(n) log n is not admissible, for it is easy to see that, with high probability, an approximate (1 e -1 ) fraction of the k-blocks will fail to appear in a given sample path of length n, see Exercise 3. Thus the (log n)/(h 6). case of most interest is when k(n) The principle results are the following.
Theorem 111.2.1 (The nonadmissibility theorem.) If p, is ergodic with positive entropy h, if 0 < E < h, and if k(n) > (log n)/(h 6) then {k(n)} is not admissible in probability for p. Theorem 111.2.2 (The positive admissibility theorem.) If p is i.i.d., Markov, or Ili-mixing, and k(n) < (log n)/(h 6) then {k(n)} is admissible for p. Theorem 111.2.3 (Weak Bernoulli admissibility.) If p, is weak Bernoulli and k(n) < (log n)/(h 6) then {k(n)} is admissible for ,u.
The Vi-mixing concept was introduced in the preceding section. The weak Bernoulli property, which will be defined carefully later, requires that past and future blocks be almost independent if separated by a long enough gap, independent of the length of past and future blocks. The nonadmissibility theorem is a simple consequence of the fact that no more than 2 k(n)(hE) sequences of length k can occur in a sequence of length n and hence there is no hope of seeing even a fraction of the full distribution. The positive admissibility result for the i.i.d. and ifr-mixing cases is a consequence of a slight strengthening of the exponential rate bounds of the preceding section. The weak Bernoulli result follows from an even more careful look at what is really used for the ifr-mixing proof.
Remark 111.2.4 A first motivation for the problem discussed in this section was the training sequence problem described in the opening paragraph. A second motivation was the desire to obtain a more classical version of the positive results of Ornstein and Weiss, who used the d-distance rather than the variational distance, [52]. They showed, in particular, that if the process is a stationary coding of an i.i.d. process then the d-distance between the empirical k(n)-block distribution and the true k(n)-block distribution goes to 0, almost surely, provided k(n) (log n)/ h. The d-distance is upper bounded by half the variational distance, and hence the results described here are a sharpening of the Ornstein-Weiss theorem for the case when k < (log n)/(h 6) and the process satisfies strong enough forms of asymptotic independence. The Ornstein-Weiss results will be discussed in more detail in Section 111.3. Remark 111.2.5 A third motivation for the problem discussed here was a waiting-time result obtained by Wyner and Ziv, [86]. They showed that if 147,(x, y) is the waiting time until the first n terms of x appear in the sequence y then, for ergodic Markov chains, (1/n) log 14/,(x, y) converges in probability to h, provided x and y are independently chosen sample paths. The admissibility results of obtained here can be used to prove stronger versions of their theorem. These applications, along with various related results and counterexamples, are presented in Section 111.5.
176
III.2.a
Proofs of admissibility and nonadmissibility.
First the nonadmissibility theorem will be proved. Let p, be an ergodic process with entropy h and suppose k = k(n) > (log n)/(h E), where 0 < E < h. Define the empirical universe of k-blocks to be the set
Lik (4) =
x: +k-1 = a l% for some i E [1, n k +1]} .

lik(x), it follows that
Since pk(a fic ix) = 0, whenever al'
(1)
Let
1 Pk( ' 14) 1 111
((lik(X)) c )
Tk (6/2) = 14: p.,(4) < 2-k(h-E/2)}
The assumption n 2k (h -E ) implies that ILik(4)1 < 2k (h -o, and hence

Pk
,k
(tk oc;') n Tk ( 6/2))
2 k(h-0 2-k(h-E/2)
2 -kE/2 ,
since each member of Tk (E/2) has measure at most 2- k(he12) The entropy theorem guarantees that ,u,k (Tk (e/2)) > 1, so that ,uk ((tik(4)) c) also goes to 1, and hence I pk (- 14) pk1 - 1 for every x E A', by the distance lower bound, (1). This implies the nonadmissibility theorem. Next the positive admissibility theorem will be established. First note that, without loss of generality, it can be supposed that {k(n)} is unbounded, since any bounded sequence is admissible for any ergodic process, by the ergodic theorem. The rate bound,
(2)
(14 : 1Pke 14) 141 ?. ED < k(t +1) 1A1k 2 -t cE214 ,
where t nlk, see the bound (9) of Section MA, is enough to prove the positive admissibility theorem for unbiased coin-tossing, for, as the reader can show, the error is summable in n, if n > IA I''. The key to such results for other processes is a similar bound which holds when the full alphabet is replaced by a subset of large measure.
Lemma 111.2.6 (Extended first-order bound.) There is a positive constant C such that for any E > 0, for any finite set A, for any n > 0, for any i.i.d. process p, with finite alphabet A, and for any B c A such that p(B) > 1 and I BI > 2, the following holds. 1/31(. Proof Define
yn = yn (x)
> 5E1 < 2(n 01/312-nCe2
=0 1
if xn E B otherwise,
E.
so that { yn } is a binary i.i.d. process with Prob (yn = 1) <
Let
Cn =
E yi > 2En}

and apply the bound (2) to obtain (3) p, (Ca ) < (n
1)22 - cn4E2
177
The idea now is to partition An according to the location of the indices i for which , i,n ) den, let B(ii, x i g B. For each m < n and 1 < j 1 < i2 < < j The sets , }. note the set of all 4 for which Xi E B if and only if i z' {i1, i 2 , are disjoint for different { -1, -2, im) and have union A. Furthermore, the sets {B(ii, , in1 ): m > 2en} have union Cn so that (3) yields the bound
(4)
(ii,
no)
im }:nz>2En
+ 1) 22-cn4E2 .
, in,) , in,} with m < 2cn, put s = n m, and for xri' E B(ii, Fix a set {i1, define = i(xil) to be the sequence of length s obtained by deleting xi xi2 , , x im from x. Also let ALB be the conditional distribution on B defined by ,u i and let AB be the corresponding product measure on BS defined by ALB. The assumption that m < 26n the implies that the probability ({Xj1 E B(ii, is upper bounded by tc(B(i i , in,)) /1 B E B s : i Pi e liD I > 3}), im ): 'pi (
14) - tti I >
56})
which is, in turn, upper bounded by (5) since E Bs: 11 1,BI > 1)
Ii ui
- AL B I
--=
2_,
tt (b)
/Ld o
B)
+ E A(b) = 2(1 ,u(B))

bvB
26.
bEB [ A\
The exponential bound (2) can now be applied with k = 1 to upper bound (5) by
tt (B(i i,
im ))(s
1)1B12
2<
ini mn
1)IBI2n(1 20c 2
The sum of the p,(B(i i , , im )) for which in < 26n cannot exceed 1, and hence, combined with (4) and the assumption that I BI > 2, the lemma follows. The positive admissibility result will be proved first for the i.i.d. case, then extended to the *-mixing and periodic Markov cases. Assume ,i is i.i.d. with entropy h, and assume k(n) < (log n)/ (h 6). Let 8 be a given positive number. It will be shown that
(6)
Ep, (14: ip(. 14)
?_ 31) < oc,
which immediately implies the desired result, by the Borel-Cantelli lemma. The proof of (6) starts with the overlapping to nonoverlapping relation derived in the preceding section, see (5), page 168,
k-1
(7)
14: 1Pk(
14) tuki
31 g s=o
IPsk( .
p.kI ?3/21,
178
CHAPTER III. ENTROPY FOR RESTRIC I ED CLASSES.
which is valid for k/n smaller than some fixed y > 0, where n = tk + k +r, 0 < r < k, and plc ( ',el') denotes the s-shifted nonoverlapping empirical k-block distribution, for s E [O, k 1]. As in the preceding section, see (7) on page 168,
(8)
ti({4: Ins*
Ak1
3/2})
v({ 01: IP1( . 101 ) vil
where each wi E A = A', and y denotes the product measure on induced by A. The idea now is to upper bound the right-hand side in (8) by applying the extended first-order bound, Lemma 111.2.6, to the super-alphabet A", with B replaced by a suitable set of entropy-typical sequences. Since the number of typical sequences can be controlled, as k grows, the polynomial factor in that lemma takes the form, (1 + nl k) 2"1(') , which, though no longer a polynomial in n is still dominated by the exponential factor. The rigorous proof is given in the following paragraphs. The set
rk -=
{X I( :
itt(x lic ) > 2 k(h+12)}
has cardinality at most 2k(h e/ 2) , and, furthermore, by the entropy theorem, there is a K such that yi (T) = kil (Tk ) > 1 8/10, for k > K. The extended first-order bound, Lemma 111.2.6, with A replaced by A = A k and B replaced by T, yields the bound
101) vil > 3/20 < 2(t + 1
)2/2)2_tc82000,
which, combined with (8) and (7), produces the bound (9) tx: IPke
1kI >
< 2k(t
1)21,(hE/2)2-tc82/100 ,
valid for all k > K and n > kly. Note that k can be replaced by k(n) in this bound, since it was assumed that k(n) oc, as n oc. The bound (9) is actually summable in n. To show this put
a =C32 1100, fi =
h + 612 h E
The right-hand side of (9) is then upper bounded by
(10)
log n
h +E
(t
since k(n) < (log n)/ (h E), by assumption. This is, indeed, summable in n, since nIk(n) and p < 1. This establishes that the sum in (6) is finite, thereby completing the proof of the admissibility theorem for i.i.d. processes.
The preceding argument applied to the Vi-mixing case yields the bound
(1 1)
kt({4: 1Pke
il ?_. 3 } ) < 2(k + g)[lk(g)]` (t
)22127 tceil00 ,
valid for all g, all k > K = K (g), and all n > kly, where y is a positive constant which depends only on g. To obtain the desired summability result it is only necessary to choose
179
g so large that (g) < 2c6212. With a = C8 2 /200 and /3 = (h + E /2)/ (h E), and oc, subject only to the requirement that k(n) < (log n)/(h E), assuming that k(n) the right-hand side in (11) is, in turn, upper bounded by log n 2(t h
1)n i3
which is again summable in n, since since t n/ k(n) and 8 < 1. This establishes the positive admissibility theorem for the Vi-mixing case, hence for the aperiodic Markov case. The extension to the periodic Markov case is obtained by a similar modification EJ of the exponential-rates result for periodic chains, see Lemma 111.1.9.
I11.2.b
The weak Bernoulli case.
A process has the weak Bernoulli property if past and future become almost independent when separated by enough, where the variational distance is used as the measure of approximation. To make this precise, let p.,(.1x:g g _n1 ) denote the conditional measure on n-steps into the future, conditioned on the past {xj, g m < j < g}, where g and m are nonnegative integers, that is, the measure defined for ar iz E An by
iun
(41x
g_m = Prob(r; = aril and Xi g

Prob ( X Tg g , = xi,)
Also let Ex -g (f) denote conditional expectation of a function f with respect to the -g-m random vector
X gm g
A stationary process {X i } is weak Bernoulli (WB) or absolutely regular, if given E > 0 there is a gap g > 0 such that
(12)
E X ig-m :
(.1X gm -g
i) <
for all n > 1 and all m > O. An equivalent form is obtained by letting m using the martingale theorem to obtain the condition
oc and
(13)
E sifc Gun( . i X -go)
n> 1.
The weak Bernoulli property is much stronger than mixing, which only requires that for each m > 0 and n > 1, there is a gap g for which (12) holds. It is much weaker than Ifr-mixing, however, which requires that jun (4 lxig g _m ) < (1 + c)/L n (ar iz), uniformly in m > 0, n > 1, ar iz, and xig g provided only that g be large enough. The key to the O. -mixing result was the fact that the measure on shifted, nonoverlapping blocks could be upper bounded by the product measure on the blocks, with only a small exponential error. The weak Bernoulli property leads to a similar bound, at least for a large fraction of shifts, provided a small fraction of blocks are omitted and conditioning on the past is allowed, and this is enough to obtain the admissibility theorem. As before, the s-shifted, k-block parsing of x, with gap g is the expression
(14)
4 = w(0)g(1)w(1)g(2)w(2)...g(t)w(t)w(t 1),
180
where w(0) has length s, the g-blocks g(j) alternate with the k-blocks w(j), for 1 < j < t, and the final block w(t 1) has length n t(k+g) s < 2(k g). Note that the k-block w(j) starts at index s (j 1)(k + g) + g 1 and ends at index s j (k + g). An index j E [1, t] is called a (y, s, k, g)-splitting index for x E A Z if
(w(j) ix _ s ti -1)(k+g) ) < (1 + y)p,(w(j)).

Here, and later, lie lx i 00 ) denotes the conditional measure on the infinite past, defined as the limit of A( Ix i rn ) as m oc, which exists almost surely by the martingale theorem. The set of all x E Az for which j is a (y, s, k, g)-splitting index will be denoted by Bi (y, s, k, g), or by B i if s, y, k, and g are understood. Note that Bi is measurable with respect to the coordinates i < s + j (k + g).
Lemma 111.2.7
Fix (y, s, k, g) and fix a finite set J of positive integers. Then for any assignment {w(j): E J } of k-blocks tt Proof obtain (15)
JEJ
(naw(i)] n
B;))
(1+ y)IJI
J EJ
Put jp, = max{ j : j E J } and condition on B* =
R E .,_{. } ( [ w (;) , fl Bi )
B*)
to
n Bi ) =
[w(j,n )] n Bi,
tc(B*).
each
The first factor is an average of the measures II (w(j,n )] of which satisfies
Bi
I xs_+,2- 1)(k +g) ),
([w(i,n )] n B .&
x_ s tim -1)(k+g) ) < (1 + y)tt(w(j ni )),
by the definition of B11. Thus (15) yields
(n([w(i)]
B;))
(1 + y) ,u(w(in,)) tt
([.( ; )]
Bi )
JEJ-th)
D
and Lemma 111.2.7 follows by induction.
The weak Bernoulli property guarantees the almost-sure existence of a large density of splitting indices for most shifts.
Lemma 111.2.8 (The weak Bernoulli splitting-set lemma.)
If l is weak Bernoulli and 0 < y < 1/2, then there is a gap g = g(y), there are integers k(y) and t(y), and there is a sequence of measurable sets {G n (y)}, such that the following hold.
(a)
X E Gn(y), eventually almost surely.
181
(b) If k > k(y), if t > t(y), and if (t +1)(k + g) _< n < (t + 2)(k +g), then for x E G(y) there are at least (1 y)(k + g) values of s E [0, k g 1] for each of which there are at least (1 y)t indices j in the interval [1, t] that are (y, s, k, g)-splitting indices for x. Proof By the weak Bernoulli property there is a gap g = g(y) so large that for any k
(xx
p(x i`) p(x i`lxioo g)
dp,(x) <
Y 4
Fix such a g and for each k define

fk(X)
,U(.3C ( IX:e g,c )
Let Ek be the a-algebra determined by the random variables, {Xi : i < g} U {X,: 1 < i < kl. Direct calculation shows that each fk has expected value 1 and that {fk} is a martingale with respect to the increasing sequence {Ed, see Exercise 7b. Thus fk converges almost surely to some f. Fatou's lemma implies that
I 1 f (x) d p(x) < )/4 ,

1
so there is an M such that if Cm = lx :I 1 fk (x) 15_ , Vk ?_ MI,
then /i(CM) > 1 (y 2 /2). The ergodic theorem implies that

Nul+ rri cx,
1 N-1
i=u
y2 > 1 a.s., 2
where kc,, denotes the indicator function of Cm, so that if

G(y) = EICc m (T z x) > 1
1 n-1
n.t=o
then x E Gn(Y), eventually almost surely. Put k(y)= M and let t(y) be any integer larger than 2/y 2 . Fix k > M, t > t(y), and (t + 1)(k g) < n < (t +2)(k g), and fix an x E Gn(Y). The definition of G(y) and the assumption t > 2/y 2 imply that
1 r(k+g)-1 i=0 KCht(T1 X) = g) E
k-i-g-1 1
t(k
k + g s=o t j =i
E_
K cm ( T s-hc i ( k + g) x ) > _ y 2 ,
so there is a subset S = S(x) C [0, k + g 1] of cardinality at least (1 y)(k + g) such that for x E G(y) and s E S(x)
k c, (Ts+0-1)(k+g)x ) > y.
182
In particular, if s E S(X), then Ts(.1-1)(k +g) x E Cm, for at least (1y)t indices j E [1, t]. But if Tr+(i -1)(k +g) x E Cm then g(w(i)Ixst i-1)(k+g ) g) < 1
(
+ y)tt(w(D),
which implies that j is a (y, s, k, g)-splitting index for x. In summary, for x E G(y) and s E S(X) there are at least (1 y)t indices j in the interval [1, t] which are (y, s, k, g)-splitting indices for x. Since I S(x)I > (1 3)(k+ g), this completes the proof of Lemma 111.2.8. For J C [1, t] define
s.J k n Pk,g(ai
)=
W(i)
an! ,
k A,
that is, the empirical distribution of k-blocks obtained by looking only at those k-blocks w(j) for which j E J. The overlapping-block measure Pk is (almost) an average of the measures pk s'g i, provided the sets J are large fractions of [1, t], where "almost" is now needed to account for end effects, for gaps, and for the j g J. If k is large relative to gap length and n is large relative to k, this is no problem, however, and Lemma 111.1.8 easily extends to the following sharper form in which the conclusion holds for a positive fraction of shifts.
Lemma 111.2.9 Given 3 > 0 there is a positive y < 1/2 such that for any g there is a K = K(g, y) such that if k > K, if kl n < y, and if Ipk(-14) ?.: 3 then 114' g j ( . 14) ILkI 3/4 for at least 2y (k + g) indices s E [0, k +g 1], for any subset J C [1, t] of cardinality at least (1 y)t. Proof of weak Bernoulli admissibility, Theorem 111.2.3. Assume A is weak Bernoulli of entropy h, and k(n) < (log n)/(h + E), n > 1. Fix > 0, choose a positive y < 1/2, then choose integers g = g(y), k(y) and t(y), and measurable sets G, = Gn(Y), n > 1, so that conditions (a) and (b) of Lemma 111.2.8 hold. Fix t > t(y) and (t+1)(k+g) < n < (t+2)(k+g), where k(y) k < (log n)/(h+E). For s E [0, k + g 1] and J C [1, t], let Dn (s, J) be the set of those sequences x for which every j E J is a (y, s, k, g)-splitting index, and note that
jEJ
n B = Dn(s, J),
,E,
so that Lemma 111.2.7 and the fact that I Ji < t yield (1 6) A (n[w(j)]
jEJ
n D,(s, J)) < (1+ y)' rip. ( . (;))
If x E G(y) then Lemma 111.2.8 implies that there are (1 y)(k + g) indices s E [0, k + g 1] for each of which there are at least (1 y)t indices j in the interval [1, t] that are (y, s, k, g)-splitting indices for x.
183
On the other hand, y can be assumed to be so small and t so large that Lemma 111.2.9 _> 3/4 for at least 2y (k -I- g) assures that if I pk (14) Ik1> S then I Ps kv g j ( . 14) Thus if y is sufficiently at least (1y)t. indices s, for any subset J C [1, t] of cardinality small, and k > k(y) and t > t(y) are sufficiently large, then for any x E G 0 (y) there exists at least one s E [0, k + g 1] and at least one J C [1, t] of cardinality at least > 3/4. In particular, the set k :g j (-14) (1 y)t, for which x E Dn (s, J) and ips
(17)
is contained in the set
k+ g 1 s=0
lpk itkl
a. 31 n G(y)
UU ig[1.]
1J12:(1y)t
(ix: Ips k :jg elxi)
> 3/41
Dn (s, J))
The proof of weak Bernoulli admissibility can now be completed very much as the proof for the 'tit-mixing case. Using the argument of that proof, the measure of the set (17) is upper bounded by
(18)
2 . 2-2t y log y
yy[k(o+
2k(n)(h+E/2)2_
t(1Y)C32/4
for t sufficiently large. This bound is the counterpart of (11), with an extra factor, 2-2ty log y to bound the number of subsets J C [1, t ] of cardinality at least (1 y)t. If y is small enough, then, as in the -tfr-mixing case, the bound (18) is summable in n. Since X E G n (y), eventually almost surely, this establishes Theorem 111.2.3. Remark 111.2.10 The admissibility results are based on joint work with Marton, [38]. The weak Bernoulli concept was introduced by Friedman and Ornstein, [13], as part of their proof that aperiodic Markov chains are isomorphic to i.i.d. processes in the sense of ergodic theory. Aperiodic Markov chains are weak Bernoulli because they are ifr-mixing. Aperiodic renewal and regenerative processes, which need not be if-mixing are, however, weak Bernoulli, see Exercises 2 and 3, in Section 1V.3. It is shown in Chapter 4 that weak Bernoulli processes are stationary codings of i.i.d. processes.
I11.2.c Exercises.
1. Show that there exists an ergodic measure a with positive entropy h such that k(n) (log n)/(h + E) is admissible for p, yet p, does not have exponential rates of convergence for frequencies. 2. Show that for any nondecreasing, unbounded sequence {k(n)} there is an ergodic process p such that {k(n)} is not admissible for A. 3. Show that k(n) = [log n] is not admissible for unbiased coin-tossing. 4. Show that if p is i.i.d., if k(n) < (logn)I(h + E), and if B, = {x: Ipk (n) (.1.ei') [4(n)1 > 31, then p(B 0 ) goes to 0 exponentially fast.
184
CHAPTER ER III. ENTROPY FOR RESTRICIED CLASSES.
5. Assume t = Ln/k] and xin = w(1)... w(t)r, where each w(i) has length k. Let R.(x7 , k, c) be the set of all kt-sequences that can be obtained by changing an e(log n)I(h+ fraction of the w(i)'s and permuting their order. Show that if k(n) oc, almost surely. e), and p, is i.i.d., then ,u(R.(x7, k(n), c)) > 1, as n 6. Show that the preceding result holds for weak Bernoulli processes. 7. Let {Yi : j > 1} be a sequence of finite-valued random variables, let E(m) be the a-algebra determined by , Ym, and let E be the smallest complete a-algebra containing Um E(m). For each x let Pm (X) be the atom of E(m) that contains x. (a) Show that
E(glE)(x) = lim 1
ni -+cx) 11 (Pm(x)) fp,(,)
g dp,, a. s.,
for any integrable function g. (Hint: use the martingale theorem.) (b) Show that the sequence fk defined in the proof of Lemma 111.2.8 is indeed x4) to evaluate a martingale. (Hint: use the previous result with Yi = E(fk+ilEk).)
Section 111.3 The d-admissibility problem.

The d-metric forms of the admissibility results of the preceding section are also of interest. A sequence {k(n)} is called d-admissible for the ergodic process if
(1) iiIndn(Pk(n)( Alc(n)) = 0, 151.S
The definition of d-admissible in probability is obtained by replacing almost-sure convergence by convergence in probability. The earlier concept of admissible will now be called variationally admissible. Since the bound
(2) dn (p, y)
E
a1
- v(4)1,
always holds a variationally-admissible sequence is also d-admissible. The principal positive result is Theorem 111.3.1 (The d-admissibility theorem.) If k(n) < (log n)I h then {k(n)} is d-admissible for any process of entropy h which is a stationary coding of an i.i.d. process. Stated in a somewhat different form, the theorem was first proved by Ornstein and Weiss, [52}. A similar result holds with the nonoverlapping-block empirical distribution in place of the overlapping-block empirical distribution, though only the latter will be discussed here. Note that in the d-m etric case, the admissibility theorem only requires that k(n) < (log n)I h, while in the variational case the condition k(n) < (log n)I(h e) was needed. The d-admissibility theorem has a fairly simple proof, based on the empirical entropy theorem, Theorem 11.3.1, and a deep property, called the finitely determined property,
SECTION IIL3. THE 15-ADMISSIBILITY PROBLEM.
185
which holds for stationary codings of i.i.d. processes, and which will be proved in Section IV.2. A proof of the d-admissibility theorem, assuming this finitely determined property, is given in Section III.3.a. The negative results are two-fold. One is the simple fact that if n is too short relative to k, then a small neighborhood of the empirical universe of k-blocks in an n-sequence is too small to allow admissibility. The other is a much deeper result asserting in a very strong way that no sequence can be admissible for every ergodic process. Theorem 111.3.2 (The d-nonadmissibility theorem.) If k(n) > (log n)/(h c), then {k(n)} is not admissible for any ergodic process of entropy h. Theorem 111.3.3 (The strong-nonadmissibility theorem.) For any nondecreasing unbounded sequence {k(n)} and 0 < a < 1/2, there is an ergodic process p, such that liminf dk( fl )(pk( fl )(.
n--*oo 111(n)) > a, almost surely
The d-nonadmissibility theorem is a consequence of the fact that no more than sequences of length k(n) can occur in a sequence of length n < so that at most 2k(n )(h-E/2) sequences can be within 3 of one of them, provided 3 is small enough. It will be proved in the next subsection. A weaker, subsequence version of strong nonadmissibility was first given in [52], then extended to the form given here in [50]. These results will be discussed in a later subsection.
I11.3.a Admissibility and nonadmissibility proofs.

To establish the d-nonadmissibility theorem, define the (empirical) universe of kblocks of x'11 ,
14(4) =
k i+k-1 = a, tai :x i
some i E [0, n k +1]} ,
and its 6-blowup dk(4,16(4)) < 81. If k > (log n)/(h c) then Itik (4)1 _< 2k(h-E), for any x'11 . If 3 is small enough then for all k, 1[14(4)]s I < 2 k(" 12) by the blowup-bound, Lemma 1.7.5, and hence, intersecting with the entropy-typical set, Tk = 11,(X1) < k(hE/4) 1 1 , produces
Ak [11k(Xn)t5
(
[14(4)16 =
n I, < 2k(h-02) 2-ch-E/4) <2 -k/4
The entropy theorem implies that it(Tk) > 1, so that ttaik(x)is) < 6, if k is large enough. But if this holds then
dk(/ik, pk( 14)) > 82,
186
for otherwise Lemma I.9.14(a) gives /Lk aik (x')13 ) > 1 28, since pk(tik(fit)14) = 1. This completes the proof of the d-nonadmissibility theorem, Theorem 111.3.2. The finitely determined property asserts that a process has the finitely determined property if any process close enough in joint distribution and entropy must also be dclose. An equivalent finite form, see Theorem IV.2.2, which is more suitable for use here and in Section 111.4, is expressed in terms of the averaging of finite distributions. If y is a measure on An, and in < n, the v-distribution of m-blocks in n-blocks is the measure 0, = 0(m, y) on Am defined by
b m (a) =
1
n m -I- l
n-m
EE i=0 uEiv veAnn- ,
v(uaT v),
that is, the average of the y-probability of din over all except the final m 1 starting positions in sequences of length n. An alternative expression is Om (ain )
E Pm (ain14)v(x) = Ev (pm (ar x7)), .x7
which makes it clear that 4)(m, v n ) = ym for stationary v. A stationary process t is finitely determined if for any c > 0 there is a 6 > 0 and positive integers m and K such that if k > K then any measure y on Ak which satisfies the two conditions
-(rn, v)1 <8,
11-1(14) H(v)I <k6,

must also satisfy cik(iik, vk) < E. The important fact needed for the current discussion is that a stationary coding of an i.i.d. process is finitely determined, see Section IV.2. The d-admissibility theorem is proved by taking y = plc ( I4), and noting that the sequence {k(n)} can be assumed to be unbounded, since the bounded result is true for any ergodic process, by the ergodic theorem. Thus, by the definition of finitely determined, it is sufficient to prove that if {k(n)} is unbounded and k(n) < (11h) log n, then, almost surely as n oc, the following hold for any ergodic process p. of entropy h.
(a) 10 (in, Pk(n)(. 14))

(b) (1/k(n))H(p k(n) (.
1 > 0, for each m.
--- h.
Convergence of entropy, fact (b), is really just the empirical-entropy theorem, Theorem 11.3.1, for it asserts that (1 /k)H(pke h = 11(A), almost surely, as k and n go to infinity, provided only that k < (11h) log n. To establish convergence in m-block distribution, fact (a), note that the averaged distribution 0(m, Pk( WO) is almost the same as pn,( Ixrn), the only difference being that the former gives less than full weight to those m-blocks that start in the initial k 1 places or end in the final k 1 places xn , , a negligible effect if n is large enough. The desired result then follows since pni (an i 'lxri') > A(an i z), almost surely, by the ergodic theorem. This completes the proof of the d-admissibility theorem, modulo, of course, the proof that stationary codings of processes are finitely determined. 111
SECTION 111.3. THE D-ADMISSIBILITY PROBLEM.
187
III.3.b Strong-nonadmissibility examples.

Throughout this section {k(n)} denotes a fixed, nondecreasing, unbounded sequence of positive integers. Associated with the window function k(.) is the path-length function, n(k) = max{n: k(n) < k}. Let
Bk(a) = ix
) : dic(Pk Ix), Ak) <
I.
It is enough to show that for any 0 <a < 1/2 there is an ergodic measure be, for which
(3) lim p, (U Bk(a)) = 0
k>K
The construction is easy if ergodicity is not an issue, and k(n) < log n, for one can take p, to be the average of a large number, {p (J ) : j < J } , of {k(n)}-d-admissible ergodic processes which are mutually a' apart in d, where a'(1 1/J) > a. A given x is typical x) n(k) i is mostly concentrated on for at most one component AO ) , in which case pk( /2,W-typical sequences, and hence
lim infdk (pk (. 14 (k) ), Ak) > a'(1 1/J),
k- *oc
almost surely, see Exercise 1. This simple nonergodic example suggests the basic idea: build an ergodic process whose n-length sample paths, when viewed through a window of length k(n), look as though they are drawn from a large number of mutually far-apart ergodic processes. This must happen for every value of n, yet at the same time, merging of these far-apart processes must be taking place in order to obtain a final ergodic process. The trick is to do the merging itself in different ways on different parts of the space, keeping the almost the same separation at the same time as the merging is happening. The tool for managing all this is cutting and stacking, which is well-suited to the tasks of both merging and separation. Before beginning the construction, some discussion of the relation between dk-far-far-apart measures is in order. The easiest way to make measures apart sequences and dk d-far-apart is to make supports disjoint, even when blown-up by a. Two sets C, D c A k are said to be a separated, if C n [D]ce = 0, where, as before,
-
[Dia =
i, y k < dk (x k )
for some yk i E D}
is the a-blowup of D. Thus, a-separated means that at least ak changes must be made in a member of C to produce a member of D. The support a (i) of a measure pc on Ak is the set of all x i` for which ,u(x) > 0. Two measures 1..t and y on A k are said to be a separated if their supports are a-separated. A simple, but important fact, is
-
(4)
If 1..c and y are a-separated, then cik(p, y) > a.
Indeed, if X is any joining of two a-separated measures, 1.1 and y, then E(dk (x li` , yf)) ce),.(a (p) x a (v)) = ce.
The result (4) extends to averages of separated families in the following form.
188
Lemma 111.3.4 If it = (1/J)Ei v i is the average of a family {v .,: j < J} of pairwise a-separated measures, then d(p., v) > a(1 1/J), for any measure v whose support is entirely contained in the support of one of the v i .
Proof The a-separation condition implies that the a-blowup of the support of yi does j, and hence not meet the support of yi , for i
it([a(vma)
= E vi ([a(v.)]a) = v jag (v i)la) = 7

j i=1
if
Thus, if the support of y is contained in the support of y, then ,u,([a(v)]a ) < 1/J, and hence EAdk(4, yl`)) v(a (0) [1 AGa(v)icen a(1 1/.1), for any joining . of p, and y, from which the lemma follows. With these simple preliminary ideas in mind, the basic constructions can begin. A weaker liminf result will be described in the next subsection, after which a sketch of the modifications needed to produce the stronger result (3) will be given.
III.3.b.1
A limit inferior result.
Here it will be shown, in detail, how to produce an ergodic p. and an increasing sequence {km } such that lim p(B km (a)) = O. (5) The construction, based on [52], illustrates most of the ideas involved and is a bit easier to understand than the one used to obtain the full limit result, (3). To achieve (5), a complete sequence {8(m)} of column structures and an increasing sequence {km } will be constructed such that a randomly chosen point x which does not lie in the top n(km )-levels of any column of S(m) must satisfy
(6) dk, (Pkm (- x ' "5, 11k,,) > am ,
where xi is the label of the interval containing T i-l x, for 1 < i < n(k,n ). The sequence am will decrease to 0 and the total measure of the top n(km )-levels of S(m) will be summable in m, guaranteeing that (5) holds. Furthermore, ergodicity will be guaranteed by making sure that S(m) and S(m 1) are sufficiently independent. Lemma 111.3.4 suggests a strategy for making (6) hold, namely, take S(m) to be the union of a disjoint collection of column structures {Si (m): j < Jm } such that any km blockwhich appears in the name of any column of Si (m) must be at least am apart from any km -block which appears in the name of any column of 5, (m) for i j. If all the columns in S(m) have the same width and height am), and if the cardinality of Si (m) is constant in j, then, conditioned on being n(km )-levels below the top, the measure is the average of the conditional measures on the separate Si (m). Thus if f(m)In(k m ) is sunu-nable in m, then Lemma 111.3.4 guarantees that the d-property (6) will indeed hold. Some terminology will be helpful. A column structure S is said to be uniform if its columns have the same height f(S) and width. The k-block universe Li(S) of a column
SECTION III.3. THE 13-ADMISSIBILITY PROBLEM.
189
structure S is the set of all a l' that appear as a block of consecutive symbols in the name of any column in S. Disjoint column structures S and S' are said to be (a, k)-separated if their k-block universes are a-separated. To summarize the discussion up to this point, the goal (6) can be achieved with an ergodic measure for a given a E (0, 1/2) by constructing a complete sequence {S(m)} of column structures and an increasing sequence {k m } with the following properties. (Al) S(m) is uniform with height am) > 2mn(k,n ).
(A2) S(m) and S(m
1) are 2m-independent.
(A3) S(m) is a union of a disjoint family {S) (m): j < J} of pairwise (am , km )separated column structures, for which the cardinality of Si (m) is constant in j, oc. and for which an2 (1 1/J,n ) decreases to a, as m
Since all the columns of 5(m) will have the same width and height, and there is no loss in assuming that distinct columns have the same name, the simpler concatenation language will be used in place of cutting and stacking language. For this reason S(m) will be taken to be a subset of A" ) . Conversion back to cutting and stacking language is achieved by replacing S(m) by its columnar representation with all columns equally likely, and concatenations of sets by independent cutting and stacking of appropriate copies of the corresponding column structures. There are many ways to force the initial stage 8 (1) to have property (A3). The real problem is how to go from stage m to stage ni+1 so that separation holds, yet asymptotic independence is guaranteed. This is, of course, the simultaneous merging and separation problem. The following four sequences suggest a way to do this.
(7)
= 01010...0101... 01 = (01) 64 b = 00001111000011110... 01111 = (0 414)16 c = 000000000000000001 11 = (016 116)4 d = 000... 11 = ( 064164) 1
a
These sequences are created by concatenating the two symbols, 0 and 1, using a rapidly increasing period from one sequence to the next. Each sequence has the same frequency of occurrence of 0 and 1, so first-order frequencies are good, yet if a block xr of 32 consecutive symbols is drawn from one of them and a block yr is drawn from another, then d32(4 2 , yr) > 1/2. The construction (7) suggests a way to merge while keeping separation, namely, concatenate blocks according to a rule that specifies to which set the m-th block in the concatenation belongs. If one starts with enough far-apart sets, then by using cyclical rules with rapidly growing periods many different sets can be produced that are almost as far apart. The idea of merging rule is formalized as follows. For M divisible by J, an (M, J)merging rule is a function
..., MI
whose level sets are constant, that is,
{1, 2, ..., J } ,
10 -1 (DI = M/J, j <J.
190
An (M, J)-merging rule 4), when applied to a collection {Si: j E J } of disjoint subsets of A produces the subset S(0) = S(0, {Sj }) of A m', formed by concatenating the sets in the order specified by q, that is, the direct product
S()
in =1
fl
5rn = 1. 0 *(2)
w (m ) :
W(M) E
So(,n ), m
E [1, /MI.
In other words, 8(0) is the set of all concatenations w(1) w(M) that can be formed by selecting w(m) from the 0(m)-th member of {S1 } , for each m. The set S(0) is called the 0-merging of {Si : j E f }. In general, a merging of the collection {Sj : j E J } is just an (M, J)-merging of the collection for some M. The two important properties of this merging idea are that each factor Si appears exactly M1.1 times, and, given that a block comes from Sj , it is equally likely to be any member of Si . In cutting and stacking language, each Si is cut into exactly MIJ copies and these are independently cut and stacked in the order specified by 0. Use of this type of merging at each stage will insure asymptotic independence. Cyclical merging rules are defined as follows. Assume p divides M1.1 and J divides p. The merging rule defined by the two conditions
(i) (ii) 0(m)= j, .P n
.0(rn np) = 0(m), m E [1, p1, 0
is called the cyclic rule with period p. The desired "cyclical rules with rapidly growing periods" are obtained as follows. Let expb 0 denote the exponential function with base b. Given J and J*, let
M = exp ., (exp2 (J* 1))
and for each t E [1, .11, let (/), be the cyclic (M, J)-merging rule with period pt = exp j (exp2 (t 1)). The family {Or : t < PI is called the canonical family of cyclical merging rules defined by J and J. When applied to a collection {Si : j E J } of disjoint subsets of A', the canonical family {0i } produces the collection {S(0)} of disjoint subsets of A mt . The key to the construction is that if J is large enough, then the new collection will be almost as well separated as the old collection. In proving this it is somewhat easier to use a stronger infinite concatenation form of the separation idea. The full k-block universe of S C A t is the set
tik (S oe ) = fak i : ak i = x:+k-1 , for some x E S" and some i > 11.
Two subsets S, S' c At are (a, K)-strongly-separated if their full k-block universes are a-separated for any k > K. The following lemma, stated in a form suitable for iteration, is the key to producing an ergodic measure with the desired property (5).
Lemma 111.3.5 (The merging/separation lemma.) Given 0 < at < a < 1/2, there is a Jo = Jo(a, at) such that if J > Jo and {S J : j < .1} is a collection of pairwise (a, K)-strongly-separated subsets of A t of the saine cardinalit-v, then for any J* there is a K* and an V, and a collection {S7: t < J* } of subsets of 2V * of equal cardinalitv, such that
SECTION IIL3. THE D-ADMISSIBILITY PROBLEM. (a) {St*: t < J*} is pairwise (a*, K*)-strongly-separated.
(13) Each S7 is a merging of {Si: j
E
191
J}.
Proof Without loss of generality it can be supposed that K < f, for otherwise {Si : j < J } can be replaced by {S7: j E J } for N > Klt, without losing (a, K)-strongseparation, since (87)' = ST, for any j, and, furthermore, any merging of {S7 } is also a merging of {S}. Let {Or : t < J* } be the canonical family of cyclical merging rules defined by J and J*, and for each t, let S7 = S(0,). Part (b) is certainly true, so it only remains to show that if J is large enough, then K* can be chosen so that property (a) holds. Suppose x E (S;`)c and y E (S)oc where s > t. The condition s > t implies that there are integers n and m such that m is divisible by nJ and such that x = b(1)b(2) where each b(i) is a concatenation of the form
(8) w(l)w(2) w(J), w(j)
E
Sr
1 < j < J,
while y = c(1)c(2) where each c(i) has the form

(9) v(1)v(2) v(J),
v(j)E ST,
1 _< j _< J.
Furthermore, each block v(j) is at least as long as each concatenation (8), so that if u+(.1-1)nt-1-1 is a subblock of such a v(j), and Yu
x v(J-Unf+1 = w(t
where w(k) E j, and hence
(10)
1 )w (t + 2) w(J)w(1) w(t)
S, for each k,
then there can be at most one index k which is equal to 1
d u_one (xvv-i-(J-1)ne+1 y u ( J 1 ) n f + 1 ) > a
by the definition of (a, K)-strongly-separated and the assumption that K < E. Since all but a limiting (1/J)-fraction of y is covered by such y u u+(j-1)nt+1 , and since J* is finite, the collection {St*: t < J* } is indeed pairwise (a*, K*)-strongly-separated, for all large enough K*, provided J is large enough. This completes the proof of Lemma 111.3.5. LI
The construction of the desired sequence {S(m)} is now carried out by induction. In concatenation language, this may be described as follows. Fix a decreasing sequence {a } such that a < ant < 1/2, and, for each ni, choose J, > JOY .--m,Ottn-1-1) from 111.3.5. Suppose m have been determined so that the followLemma S(m) C and k ing hold.
Aar
(B1) S(m) is a disjoint union of a collection {S I (m): j < strongly-separated sets of the same cardinality.
4 } of pairwise (an kni )-
(B2) Each Si (m) is a merging of {S (m 1): j < 4_ 1 }. (B3) 2n(k) <t(m).
192
Define k 1+ 1 to be the K* of Lemma 111.3.5 for the family {Si (m): j < J,n } with am = a and ce,n+ 1 = a*. Let {S'tk: t < Jrn+1 } be the collection given by Lemma 111.3.5 for J* = J,, 1 . By replacing S;' by (S7) N for some suitably large N, if necessary, it can be supposed that 2m+ l n(kn1+ i) < t(S,*). Put Si (m + 1) = S, j < J,n+ 1. The union S(m + 1) = Ui Si (m + 1) then has properties (Bi), (B2), and (B3), for m replaced by
m +1.
Since t(m) --> cc, by (B3), the sequence {S(m)} defines a stationary A-valued measure it by the formula
, lc \ _ lim [ga l
E IS(m)1 t(m),s(m)
1
ilxiam)),
the analogue of Theorem 1.10.4 in this case where all columns have the same height and width. Furthermore, the measure p, is ergodic since the definition of merging implies that if M is large relative to m, then each member of S(m) appears in most members of S(M) with frequency almost equal to 1/1S(m)1. Since (6) clearly holds, and En, n(lcm )I f(m) < cc, the measure of the bad set ,u(Bk m (a)) is summable in m, establishing the desired 1=1 goal (5).
III.3.b.2
The general case.
To obtain the stronger property (3) it is necessary to control what happens in the interval k m < k < km+1. This is accomplished by doing the merging in separate intermediate steps, on each of which only a small fraction of the space is merged. At each intermediate step, prior separation properties are retained on the unmerged part of the space until the somewhat smaller separation for longer blocks is obtained. Only a brief sketch of the ideas will be given here; for the complete proof see [50]. The new construction goes from S(m) to S(m + 1) through a sequence of Jm+i intermediate steps
S(m, 0)
S(m, 1)
S(m, J 1+1),
merging only a (1/4, 1 )-fraction of the space at each step. All columns at any substage have the same height, but two widths are possible, one on the part waiting to be merged at subsequent steps, and the other on the part already merged, and hence cutting and stacking language, rather than simple concatenation language, is needed. The only merging idea used is that of cyclical merging of { yi : j < J} with period J, which is just the independent cutting and stacking in the order given, that is, V1 * V2 * * V j As in the earlier construction, the structure S(m) = S(m, 0) is a union of pairwise km )-strongly separated substructures {Si (m, 0): j < 41 all of whose columns have the same width and the same height t(m) = (m, 0). In the first substage, each Si (nt, 0) is cut into two copies, S; and Sj ", where
}
(i)
)1/4.(St i) = (1
1/ Ji)x(S; (m, 0)).
(ii) x(S7) = ( i/J,n + i)x(S,(m, 0)).

Cyclical merging with period is applied to the collection {57 } to produce a new column structure R 1 (m, 1). Meanwhile, Jm+i -fold independent cutting and stacking is
SECTION III.3.
THE D-ADMISSIBILITY PROBLEM.
193
applied separately to each to obtain a new column structure L i (m, 1). Note, by the way, that the columns of R I (m, 1) have the same height as those of L i (m, 1), but are only 1/J1 ..E. 1 as wide. The collection {,C i (m, 1): j < J} is pairwise (a,,,, k m )-strongly-separated, since neither cutting into copies nor independent cutting and stacking applied to separate Ej(m, 1) affects this property. An argument similar to the one used to prove the merging/separation lemma, Lemma 111.3.5, can be used to show that if am) is long enough and J, large enough, then for a suitable choice of an, > 8(m, 1) > a, 1 , there is a k(m, 1) > k m such that each unmerged part j (m, 1) is (8(m, 1), k(m, 1))-strongly separated from the merged part R1 (m, 1). In particular, since strong-separation for any value of k implies strong-separation for all larger values, the following holds. (a) The collection S(m, 1) = {L i (m, t): j < J } U {R 1 (m, t))} is (p(m, 1), k (m , 1))strongly-separated. Each of the separate substructures can be extended by applying independent cutting and stacking to it, which makes each substructure longer without changing any of the separation properties. Thus if M(m, 1) is large enough and each L i (m, 1) is replaced by its M(m, 1)-fold independent cutting and stacking, and R(m, 1) is replaced by its M(m, 1)-fold independent cutting and stacking, then {L i (m, 1): j < J} remains pairwise (am , km )-strongly-separated, and (a) continues to hold. But, if M(m, 1) is chosen large enough, then it can be assumed that i(m, 1) > n(k(m, 1))4 + 1, which, in turn, implies the following. (b) If x and y are picked at random in S(m, 1), then for any k in the range km < k < k(m, 1), the probability that ak(pk(.14 (k) ), pk(.Iy in(k) )) > a is at least (1 2/4 + 0 2 (1 1/.0. This is because, for the stated range of k-values, the k-block universes of xrk) and y n(k) i are at least a-apart, if they lie at least n(k(m, 1))-levels below the top and in different L i (m, 1), an event of probability at least (1 2/J,n+ i) 2 (1-1/0, since R,(m, 1) has measure 1/4 +1 and the top n(k(m, 1))-levels have measure at most n(k(m, 1))/(m, 1) < 1/4 + 1, and since the probability that both lie in in the same L i (m, 1) is upper bounded by 1/./r , . The merging-of-a-fraction idea can now be applied to S(m, 1), merging a copy of {L i (m, 1): j < J} of measure 1/4 +1 , then applying enough independent cutting and stacking to it and to the separate pieces to achieve almost the same separation for the separate structures, while making each structure so long that (b) holds for k(m, 1) < k < k(m, 2). After 4 +1 iterations, the unmerged part has disappeared, and the whole construction can be applied anew to S(m 1) = S(m, J,..E.1), producing in the end an ergodic process p, for which lim ,u(Bk (a)) = O. This is, of course, weaker than the desired almost-sure result, but a more careful look shows that if k(m, t) <k < k(m, t + 1), then k-blocks drawn from the unmerged part or previously merged parts must be far-apart from a large fraction of k-blocks drawn from the part merged at stage t 1, and this is enough to obtain (12)
ni
k,<k<k,4_]
Bk ) < DC
'
194
for suitable choice of the sequence {40, which is, in turn, enough to guarantee the desired result (3).
Remark 111.3.6 It is important to note here that enough separate independent cutting and stacking
can always be done at each stage to make i(m) grow arbitrarily rapidly, relative to km . This fact will be applied in I11.5.e to construct waiting-time counterexamples.
Remark 111.3.7
By starting with a well-separated collection of n-sequences of cardinality more than one can obtain a final process of positive entropy h. In particular, this shows the existence of an ergodic process for which the sequence k(n) = [(log n)/ hi is not a-admissible.
I11.3.c Exercises.
1. Let p, = (1/J)Ei yi, where each yi is ergodic, cik(Pk(.1x n i (k) ), (Yi)k) surely, and d(yi , yi ) > a, for i j. Show that
0, almost
lim infak(Pk(. ixin(k) ), k-4, 00
>
1/f),
almost surely. (Hint: use Exercise 11, Section 1.9, and Lemma 111.3.4.) 2. Show that if "if is the concatenated-block process defined by a measure y on An, then 10(k, y) 1511 5 - . 2(k 1)/n.
Section 111.4
Blowing-up properties.
An interesting property closely connected to entropy ideas, called the blowing-up property, has recently been shown to hold for i.i.d. processes, for aperiodic Markov sources, and for other processes of interest, including a large family called the finitary processes. Informally, a stationary process has the blowing-up property if sets of n-sequences that are not exponentially too small in probability have a large blowup. Processes with the blowing-up property are characterized as those processes that have exponential rates of convergence for frequencies and entropy and are stationary codings of i.i.d. processes. A slightly weaker concept, called the almost blowing-up property is, in fact, equivalent to being a stationary coding of an i.i.d. process. The blowing-up property and related ideas will be introduced in this section. A full discussion of the connections between blowing-up properties and stationary codings of i.i.d. processes is delayed to Chapter 4. If C c An then [C], denotes the 6-neighborhood (or 6-blowup) of C, that is
[C], = {b7: dn (a7, b7) < E, for some ariL
E C} .
An ergodic process i has the blowing-up property (B UP) if given E > 0 there is a 8 > 0 and an N such that if n > N then p,([C],) > 1 E, for any subset C c An for which
11 (C)
The following theorem characterizes those processes with the blowing-up property.
2_
SECTION 111.4. BLOWING-UP PROPERTIES.
195
Theorem 111.4.1 (Blowing-up property characterization.) A stationary process has the blowing-up property if and only if it is a stationary coding of an i.i.d. process and has exponential rates of convergence for frequencies and entropy.
Note, in particular, that the theorem asserts that an i.i.d. process has the blowing-up property. The proof of this fact as well as most of the proof of the theorem will be delayed to Chapter 4. The fact that processes with the blowing-up property have exponential rates will be established later in the section. A particular kind of stationary coding, called finitary coding, does preserve the blowing-up property. A stationary coding F: A z i- B z is said to be finitary, relative to an ergodic process ii, if there is a nonnegative, almost surely finite integer-valued measurable function w(x), called the window function, such that, for almost every x E A z , the time-zero encoder f satisfies the following condition. If i
E
A Z and x -w(x) w(x) w(x) then f (x ) = f Gip . .i -w(x)
A finitary coding of an i.i.d. process is called a finitary process. It is known that aperiodic Markov chains and finite-state processes, in-dependent processes, and some renewal processes are finitary, [28, 79]. Later in this section the following will be proved.
Theorem 111.4.2 (The finitary-coding theorem.) Finitary coding preserves the blowing-up property.
In particular, since i.i.d. processes have the blowing-up property it follows that finitary processes have the blowing-up property. Not every stationary coding of an i.i.d. process has the blowing-up property. The difficulty is that only sets of sequences that are mostly both frequency and entropy typical can possibly have a large blowup, and, without exponential rates there can be sets that are not exponentially too small yet fail to contain any typical sequences. Once these are suitably removed, however, then a blowing-up property will hold for an arbitrary stationary coding of an i.i.d. process. A set B c A k has the (8, 0-blowing-up property if p,([C],) > 1 E, for any subset C C B for which ,a(C) > 2-18 . An ergodic process p. has the almost blowing-up property (ABUP) if for each k there is a set Bk C A k such that the following hold. (i) xlic
E Bk,
(ii) For any 6 > 0 there is a 8 > 0 and a K such that Bk has the (8, 0-blowing-up property for k > K.
Theorem 111.4.3 (Almost blowing-up characterization.) A stationary process has the almost blowing-up property if and only if it is a stationary coding of an i.i.d. process.
By borrowing one concept from Chapter 4, it will be shown later in this section that a stationary coding of an i.i.d. process has the almost blowing-up property, a result that will be used in the waiting-time discussion in the next section. A proof that a process with the almost blowing-up property is a stationary coding of an i.i.d. process will be given in Section IV.3.c.
196
III.4.a Blowing-up implies exponential rates.

Suppose ,u has the blowing-up property. First it will be shown that p, has exponential rates for frequencies. The idea of the proof is that if the set of sequences with bad frequencies does not have exponentially small measure then it can be blown up by a small amount to get a set of large measure. If the amount of blowup is small enough, however, then frequencies won't change much and hence the blowup would produce a set of large measure all of whose members have bad frequencies, contradicting the ergodic theorem. To fill in the details let pk ( WI') denote the empirical distribution of overlapping k blocks, and define B(n, k, E) = 1Pk( 14) IkI > El. Note that if y = E/(2k + 1) then d(4, so that, in particular, (1)
Ipk(
14)
pk(
1)1)1
6/2,
[B(n, k,
B(n, k, E/2).
The blowing-up property provides 8 > 0 and N so that if n>N,Cc An, and p,(C) > 2 -611 then ii,([C]y ) > 1 E. If, however, n > N and ,u n (B(n, k, E)) > 2 -6" then ([B(n, k, E)],) would have to be at least 1 E, and hence (1) would force the measure p n (B(n, k, E/2)) to be at least 1 E. But this cannot be true for all large n since the ergodic theorem guarantees that lim n p n (B(n, k, E/2)) = 0 for each fixed k and E. Thus p n (B(n, k, c)) < 2 -6n , for all sufficiently large n. Next it will be shown that blowing-up implies exponential rates for entropy. One part of this is easy, for there cannot be too many sequences whose measure is too large, hence a small blowup of such a set cannot possibly produce enough sequences to cover a large fraction of the measure. To fill in the details of this part of the argument, let
B . 0,
-
= trn i: iun (x n i)>
so that B* (n, 6)1 < 2n(h ), and hence there is an a > 0 such that 1[B* (n, o ] l < for all n. Intersecting with the set Tn =
11.naB * 2n(hE/2) ,
,u(4)
2 ' _E/4)} gives
(n,
< 2n(hE/2)2 n(hE14) < 2nEl 4 , n In) _
SO p([B* (n, > 0, since ,u(T,) > 1, by the entropy theorem. The blowing-up property provides 6 and N so that if n>N,Cc A n , and ti,(C) > 2 -6 " then ,u([C],) > 1 a. Since ,u n ([B*(n, E)1 0 ) 0 it follows that ,u(B*(n, E)) must be less than 2 -8n, for all n sufficiently large. An exponential bound for the measure of the set
B.(n, 6 . )
fx n i: tin(x n i) <
of sequences of too-small probability is a bit trickier to obtain, but depends only on the existence of exponential rates for frequencies, since this implies that if n is sufficiently
SECTION 111.4. BLOWING-UP PROPERTIES.
197
large then, except for a set of exponentially small probability, most of xPli will be covered by k-blocks whose measure is about 2 -kh. As in the proof of the entropy theorem, this gives an exponential bound on the number of such x, which in turn means that it is exponentially very unlikely that such an can have probability much smaller than 2 -nh. To fill in the details of the preceding argument apply the entropy theorem to choose k so large that Ak(B(k, 6/4)) < a, where a will be specified in a moment. For n > k let = pk(B.(k, E /4)
< 2al,
that is, the set of n-sequences that are (1 20-covered by the complement of B(k, 6/4). Since the complement of 13(k, 6/4) has cardinality at most 2 k(h+e/4) , the built-up set bound, Lemma 1.7.6, implies that there is an a > 0 and an N such that if n > N, then iGn < 2n(h+E/2), and hence
tin(Gn
n 13.(n, 6)) < IGn12n(h+E) < 2ne/2
> N
Since exponential rates have already been established for frequencies, there is a 6 > 0 and an N 1 > N such that p, n (G n ) > 1 2 -6n, if n > N I . Thus + 2ne/2, n > N1, ,un(B.(n, E)) < 2_8n which gives the desired exponential bound. I=1
III.4.b Finitary coding and blowing-up.

The finitary-coding theorem will now be established. Fix E > 0 and a process p with AZ, the blowing-up property. Let be a finitary coding of ,u, with full encoder F: B z time-zero encoder f: B z A, and window function w(x), x E B Z . Thus f (x) depends only on the values x ww (x()x) , where w(x) is almost surely finite, and hence there is a k such that if Gk is the set of all x k k such that w(x) < k, then p. (Gk) > 1 6 2 . In particular, the set T = {x n I k 1 : P2k+1(GkiX n jc- k i) > 1 61, satisfies ,u(T) > 1 c by the Markov inequality. Suppose C c An satisfies v(C) > 2 -h 6 , where 6 > O. Let D be the projection of F-1 C onto B" HI;_E k 1 , and note that ,u(D) > 2 -h 8 . Thus, since k is fixed and ,u has the blowing-up property, 6 can be assumed to be so small and n so large that if y = E / (2k + 1) then ,u([D]y ) > 1 E. Now consider a sequence xhti. k 1 E [D]y n T. For such a sequence (1 c)n of its (2k + 1)-blocks belong to Gk, and, moreover, there is a sequence i n tE k E D such that
dn+2k(X
,
1) < Y
2k -I- 1
Thus fewer than (2k + 1)yn < cn of the (2k + 1)-blocks in x h k k I can differ from the corresponding block in i1Hk4 k . In particular, there is at least a (1 20-fraction of (2k -I- 1)-blocks in xn -tf k that belong to Gk and, at the same time agree entirely with the corresponding block in imHk-k
198
c+ k i . int+ ki, and put z = . x ntf k 19 and r-iChoose y and 5, E D such that Yi F(y), Z = F(Y- ). The sequence Z7 belongs to C and cln (z7 , Z?) < 26, so that Z7 E [C]2 E . This proves that
vn ([ che
)
n T) > 1 2,
LII
which completes the proof of the finitary-coding theorem.
I11.4.c Almost blowing-up and stationary coding.

In this section it is shown that stationary codings of i.i.d. processes have the almost blowing-up property (ABUP). Towards this end a simple connection between the blowup of a set and the 4-distance concept will be needed. Let ,u,1 (' IC) denote the conditional measure defined by the set C c An. Also let c/,(4, C) denote the distance from x to C, that is, the minimum of dn (4 , ylz), y E C. Lemma 111.4.4 If d(1-t, Ane IC <2 then 1,1,([C]e ) > 1 E. Proof The relation
(2)
vn EC
holds for any joining X of A n and A n ( IC), and hence EA(Cin(X til C)) < (LW, An( IC)). Thus if (LW, ktn( . IC)) < 62 , then, by the Markov inequality, the set of 4 such that dn (xit , C) > E has p, measure less than E, that is, 4([C]) > 1 E. This proves the lemma.
Yil )dn(.4 )1) > P(4)dn(4 , C).
The proof that stationary codings of i.i.d. process have the almost blowing-up property makes use of the fact that such coded processes have the finitely determined property, a property introduced to prove the d-admissibility theorem in Section 111.3, and which will be discussed in detail in Section IV.2. An ergodic process p. is finitely determined (FD) if given E > 0, there is a 3 > 0 and positive integers k and N such that if n > N then any measure y on An which satisfies the two conditions (a) Luk
0*(k, v)I < 8,
(b) 11-1(u n ) H(v)I < u8,

<e, where 4)(k, v) is the measure on Ak obtained by averaging must also satisfy the y-probability of k-blocks over all starting positions in sequences of length n, that is,
, )
(3)
Ca li) =
E P (a14)v(x) = Ev(Pk(a l'IX7)).

k
This simply a translation of the definition of Section III.3.a into the notation used here.
SECTION 111.4. BLOWING-UP PROPERTIES. Theorem 111.4.5 (FD implies ABUP.) A finitely determined process has the almost blowing-up property
199
Proof Fix a finitely determined process A. The basic idea is that if C c A n consists of sequences that have good k-block frequencies and probabilities roughly equal to and i(C) is not exponentially too small, then lin e IC) will have (averaged) k-block probabilities close to ,uk and entropy close to H(i). The finitely determined property then implies that gn e IC) and ,un are d-close, which, by Lemma 111.4.4, implies that C has a large blowup. The first step in the rigorous proof is to remove the sequences that are not suitably frequency and entropy typical. A sequence x i" will be called (k, a)-frequency-typical if I Pk (. I xi ) ,uk I < a, where, as usual, Pk( Ifiz) is the empirical distribution of overlapping k-blocks in xri'. A sequence 4 will be called entropy-typical, relative to a, if
tt(x til ) < 2- 11 (4)+na .
Note that in this setting n-th order entropy, Kan), is used, rather than the usual nh in the definition of entropy-typical, which is all right since H(t)/n > h. Let Bn (k, a) be the set of n-sequences that are both entropy-typical, relative to a, and (j, a)-frequency-typical for all j < k. If k and a are fixed, then xi E Bn(k, a), eventually almost surely, so there is a nondecreasing, unbounded sequence {k(n)}, and a nonincreasing sequence {a(n)} with limit 0, both of which depend on ,u, such that 4 E Bn(k(n), a(n)), eventually almost surely. Put Bn = Bn(k(n), a (n)). It will be shown that for any c > 0, there is a 8 > 0 such that Bn eventually has the (8, e)blowing-up property. Given E > 0, the finitely determined property provides 8 > 0 and k such that if n is large enough then any measure y on A n which satisfies WI (k, ")I < 8 and IHCan ) HMI < n6, must also satisfy dn (ii, y) < e 2 . Suppose C c Bn and ,u(C) > 2 n8 . The definition of Bn and formula (3) together imply that if k(n) > k, then the conditional measure A n (. IC) satisfies the following.
(i) 10(k, ttn(. IC)) 141 < a(n).
(ii) H(14) na(n)
178
Milne IC)) MILO na(n).
Thus if n is large enough then an(it, time IC)) < e 2 , for any C c Bp, for which 2 n8 which, by Lemma 111.4.4, implies that ,u([C],) > 1 E. This completes the proof that finitely determined implies almost blowing-up. ED
Remark 111.4.6 The results in this section are mostly drawn from joint work with Marton, [37, 39]. For references to earlier papers on blowing-up ideas the reader is referred to [37] and to Section 1.5 in the Csiszdr-Krner book, [7], which also includes applications to some problems in multi-user information theory.
I11.4.d
Exercises.
1. Show that (2) is indeed true. 2. Show that a process with the almost blowing-up property must be ergodic.
200
3. Show that a process with the almost blowing-up property must be mixing. 4. Assume that i.i.d. processes have the blowing-up property.
(a) Show that an m-dependent process has the blowing-up property. (b) Show that a i/i -mixing process has the blowing-up property.
5. Assume that aperiodic renewal processes are stationary codings of i.i.d. processes.
Show that some of them do not have the blowing-up property.
6. Show that condition (i) in the definition of ABUP can be replaced by the condition
that ,u(B n ) 1.
7. Show that a coding is finitary relative to IL if and only if for each b the set f -1 (b)
is a countable union of cylinder sets together with a null set.
Section 111.5
The waiting-time problem.
A connection between entropy and recurrence times, first noted by Wyner and Ziv, [86], was shown to hold for any ergodic process in Section Section 11.5. Wyner and Ziv also established a positive connection between a waiting-time concept and entropy, at least for certain classes of ergodic processes. They showed that if Wk (x, y) is the waiting time until the first k terms of x appear in an independently chosen y, then (1/k) log Wk(x, y) converges in probability to h, for irreducible Markov chains. This result was extended to somewhat larger classes of processes, including the weak Bernoulli class in [44, 76]. The surprise here, of course, is the positive result, for the well-known waiting-time paradox, [11, pp. 1Off. 1, suggests that waiting times are generally longer than recurrence times. An almost sure version of the Wyner-Ziv result will be established here for the class of weak Bernoulli processes by using the joint-distribution estimation theory of III.2.b. In addition, an approximate-match version will be shown to hold for the class of stationary codings of i.i.d. processes, by using the d-admissibility theorem, Theorem 111.3.1, in conjunction with the almost blowing-up property discussed in the preceding section. Counterexamples to extensions of these results to the general ergodic case will also be discussed. The counterexamples show that waiting time ideas, unlike recurrence ideas, cannot be extended to the general ergodic case, thus further corroborating the general folklore that waiting times and recurrence times are quite different concepts. The waiting - time function Wk (x, y) is defined for x, y E Ac by
Wk (x, y)
= min{m > 1:
yn n: +k-1
i = xk
The approximate-match waiting-time function W k (x, y, 6) is defined by Wk (x, v, 6) = minfm > 1: dk (x Two positive theorems will be proved.
yn mi +k-1 <
61.
SECTION 111.5. THE WAITING-TIME PROBLEM.
201
Theorem 111.5.1 (The exact-match theorem.)

If p, is weak Bernoulli with entropy h, then
k s oo
1 lirn - log Wk(x, y) = h,

k
almost surely with respect to the product measure p, x A.
Theorem 111.5.2 (The approximate-match theorem.)

If p is a stationary coding of an i.i.d. process, then
lim sup - log

k,c,0
Wk(X,
y, 8) < h, V S > 0,
and
Wk (x, y, S) > h, k almost surely with respect to the product measure p, x p.

6).0 k>oo
1 lim lim inf - log
The easy part of both results is the lower bound, for there are exponentially too few k-blocks in the first 2k(h - f) terms of y, hence it is very unlikely that a typical 4 can be among them or even close to one of them. The proofs of the two upper bound results will have a parallel structure, though the parallelism is not complete. In each proof two sequences of measurable sets, {B k C Ak} and {G,,}, are constructed for which the following hold.
(a) x'ic
E Bk,
(b) y E G n , eventually almost surely.
For the weak Bernoulli case, it is necessary to allow the set G to depend on the infinite past, hence G is taken to be a subset of the space A z of doubly infinite sequences. For the approximate-match result, the set G, can be taken to depend only on the coordinates from 1 to n, hence is taken to be a subset of A. In the weak Bernoulli case, the sequences {B k } and {G,,} have the additional property that if k < (log n)/(h + 6) then for any fixed xiic E Bk, the probability that y E G does not contain x lic anywhere in its first n places is almost exponentially decaying in n. In the approximate-match proof, {B k } and {G n } have the property that if n > 2k(h +E) , then for any fixed y'iz E G, the probability that xf E Bk is not within 8 of some k-block in y'i' will be exponentially small in k. In both cases an application the Borel-Cantelli lemma establishes the desired result. In the weak Bernoulli case, Bk consists of the entropy-typical sequences, so that property (a) holds by the entropy theorem. The set G n consists of those y E A Z for which yll can be split into k-length blocks separated by a fixed length gap, such that a large fraction of these k-blocks are almost independent of each other. In other words, the G are the sets given by the weak Bernoulli splitting-set lemma, Lemma 111.2.8. In the approximate-match case, G consists of those yli whose set of k-blocks has a large 3-neighborhood, for k - (log n)/(h + 6). The d-admissibility theorem, Theorem 111.3.1, guarantees that property (b) holds. The set Bk C A k is chosen to have the property that any subset of it that is not exponentially too small in measure must have a large 3-blowup. In other words, the Bk are the sets given by the almost blowing-up property of stationary codings of i.i.d. processes, see Theorem 111.4.3.
202
The lower bound and upper bound results are established in the next three subsections. In the final two subsections a stationary coding of an i.i.d. process will be constructed for which the exact-match version of the approximate-match theorem is false, and an ergodic process will be constructed for which the approximate-match theorem is false.
I11.5.a
Lower bound proofs.
The following notation will be useful here and later subsections. The (empirical) universe of k-blocks of yri' is the set iik(Y') = Its S-blowup is dk(x 1`,14(y1)) 6). In the following discussion, the set of entropy-typical k-sequences (with respect to a > 0) is taken to be 7k(01) 2 -k(h+a) < 4 (4) < 2-k(h-a ) } Theorem 111.5.3 Let p, be an ergodic process with entropy h > 0 and let e > 0 be given. If k(n) > (logn)I(h - 6) then for every y E A" and almost every x, there is an N = N(x, y) such that x ik(n) lik(n)(Yi), n > N. 2k(h-E), yl E An , Proof If k > (log n)/(h - 6) then n <2k(h-e), and hence so that intersecting with the entropy-typical set 7k (e/2) gives < 2 k (6/2)) _ k(h-0 2 -ko-E/2) < (tik (yi ) n r Since .x /,` E 7 k(E/2), eventually almost surely, an application of the Borel-Cantelli lemma yields the theorem. El The theorem immediately implies that 1 lim inf - log Wk(X, y) h, k almost surely with respect to the product measure p, x /,/,, for any ergodic measure with entropy h. Theorem 111.5.4 Let p. be an ergodic process with entropy h > 0 and let e > 0 be given. There is a 3 > 0 such that if k(n) > (log n)/(h - 6) then for every y E A and almost every x, there is an N = N(x, y) such that x ik(n) (/[14k(n)(11`)]5, n > N. < 2k(h -E ) , and hence 1/1k(yril)1 < Proof If k > (log n)/(h - 6) then n _ _ 2 k(h - E) . I f 6 < 2k(h-3(/4), by the blowup-bound lemma, is small enough then for all k, j[lik (y)]51 _ Lemma 1.7.5. Intersecting with the entropy-typical set 'Tk (E/2) gives < 2 k(h-3E/4) 2 -kch-E/2) < p. (y)16 n Tk(e/2)) [lik(yli )]3 = for some i
E
[0, n - k
l]}.
_L
which establishes the theorem. The theorem immediately implies that 1 lim lim inf - log Wk(x, y, 3) > h, 8.0 kc k almost surely with respect to the product measure p. x A, for any ergodic measure with entropy h.
111
IL
203
III.5.b
Proof of the exact-match waiting-time theorem.

Let
Fix a weak Bernoulli measure a on A z with entropy h, and fix c > 0. n(k) = n(k, E) denote the least integer n such that k < (log n) I (h ). The sets {B k } are just the entropy-typical sets, that is,
Bk = Tk(a) = {xj: 2 -k(h+a) < wx lic) <

where a is a positive number to be specified later. The entropy theorem implies that 4 E B, eventually almost surely. The sets G n C A Z are defined in terms of the splitting-index concept introduced in III.2.b. The basic terminology and results needed here will be restated in the form they will be used. The s-shifted, k-block parsing of y, with gap g is yti` = w(0)g(1)w(1)g(2)w(2)...g(t)w(t)w(t + 1), where w(0) has length s < k + g, the g(j) have length g and alternate with the k-blocks w(j), for 1 < j < t, and the final block w(t +1) has length n - t(k g) - s < 2(k + g). An index j E [1, t] is a (y, s, k, g)-splitting index for y E A z if ,u(w(DlY s 40,( 0)-1)(k+g) ) < (1+ Y)Ww(i)).
If 8 1 is the set of all y then
A z that have j as a (y, s, k, g)-splitting index, and J C [1, t
],
(1)
sa Claw (Di n Bs.,)
(1 + y) Ifl libt(w(i)).
The sets {G,} are given by Lemma 111.2.8 and depend on a positive number y, to be specified later, and a gap g, which depends on y. Property (a) and a slightly weaker version of property (b) from that lemma are needed here, stated as the following.
(a) x E Gn , eventually almost surely.
(b) For all large enough k and t, for (t + 1)(k + g) < n < (t 2)(k g), and for y E G(y) there is at least one s E [0, k + g - 1] for which there are at least (1 - y)t indices j in the interval [1, t] that are (y, s, k, g)-splitting indices for y.
To complete the proof of the weak Bernoulli waiting-time theorem, it is enough to establish the upper bound,
(2)
1 lim sup - log Wk(x, y) < h, 1.1
x ,u - a.s.
E
For almost every x E A" there exists K(x) such that x

x. For each k put Ck(x) = Y
Bk, k > K(x). Fix such an
(0
(Yrk) )1
and note that

ye Ck (x),
1 - log Wk (x, y) > h -E
so that it is enough to show that y g' Ck (x), eventually almost surely.
204
ED CLASSES. CHAPTER III. ENTROPY FOR RESTRICTED
Choose k > K (x) and t so large that property (b) holds, and such that (t+ 1)(k+ g) < n(k) < (t 2)(k g). The key bound is (3)
ii(Ck(x)
n Gn.(0) < (k + g)2- 2ty log y 0
y y (1
2k(h+a))t( 1 Y)
Indeed, if this holds, then a and y can be chosen so small that the bound is summable in k, since t n/ k and n . Since y E Gn , eventually almost surely, it then follows that y 0 Ck(x), eventually almost surely, which proves the theorem. To establish the key bound (3), fix a set J c [1, t] of cardinality at least t(1 y) and an integer s E [0, k + g 1] and let Gn(k) (J, s) denote the set of all y E G n (k) for which every j E J is a (y, s, k, g)-splitting index for y. Since
Gnuo (J, s) c
the splitting-index product bound, (1), yields it(n[w(i)]
JE.1
n Bsd)
+ y)'
jJ
fl bi (w(i)),
for any y E G n (k)(J, s). But if the blocks w(j), j E J are treated as independent then, since ii(x) > k(h+a) and since the cardinality of J is at least t(1 y), the probability x lic, for all j E J, is upper bounded by that w(j) (1 2 -k(h+a ) ) t(1-Y) and hence
)t it(ck(x)n G n(k)(J s)) < (1 + y (1
This implies the bound (3), since there are k g ways to choose the shift s and at most 2-2ty lo gy ways to choose sets J c [1, t] of cardinality at least t(1 y). This completes Li the proof of the exact-match waiting-time theorem, Theorem 111.5.1.
I11.5.c
Proof of the approximate-match theorem.
Let be a stationary coding of an i.i.d. process such that h(A) = h, and fix E > 0. The least n such that k < (log n)/ (h c) is denoted by n(k), while k(n) denotes the largest value of k for which k < (log n)/(h c). Fix a positive number 3 and define
G, =
,ualik(n)(31)]812) > 1/2) .
By the d-admissibility theorem, Theorem 111.3.1, yri: E G n , eventually almost surely. For almost every y E Aoe there is an N (y) such that yti' E G n , n > N(,y). Fix such a y. For each k, let Ck (y) = fx 1`: x g [Uk(Yr k) )]81 so that, (1/k) log Wk(x, y, 3) > h c if and only if x i( E C k(y), and hence it is enough to show that xi,' Ck (y), eventually almost surely. Let {B k } be the sequence of sets given by the almost blowing-up characterization theorem, Theorem 111.4.3, and choose ,0 > 0 and K such that if k > K and C c Bk is a set of k-sequences of measure at least 2 -0 then 11([C]8/2) > 1/2.
205
Suppose k = k(n) > K and n(k) > N(y), so that yrk) E G(k). If it were true that [t(Ck(Y) n Bk) > 2-0 then both [Ck(Y) n Bk]512 and [16 (Yr k) )]6/2 each have measure larger than 1/2, hence intersect. But this implies that the sets Ck(y) and [ilk (Yin (k) )]8 intersect, which contradicts the definition of Ck(y). Thus, ,u(Ck(Y) n Bk) < 2-10 , and hence
E p(Ck(y)n Bk ) < oc.

k=1
Since 4 E Bk, eventually almost surely, it follows that 4 g Ck(y), eventually almost Li surely, completing the proof of the approximate-match theorem.
I11.5.d
An exact-match counterexample.
Let ,u, be the Kolmogorov measure of the i.i.d. process with binary alphabet B = {0, 1} such that g i (0) = ,u i (1) = 1/2.. A stationary coding y = ,u o F -1 will be constructed along with an increasing sequence {k(n)} such that
(4)
1 lim P, ( log Wk(n)(y, ) 7 ) N) = 0, n-400 k(n)
for every N, where P, denotes probability with respect to the product measure y x v. The classic example where waiting times are much longer than recurrence times is a "bursty" process, that is, one in which long blocks of l's and O's alternate. If x i = 1, say, then, 1471 (x, y) will be large for a set of y's of probability close to 1/2. If the process has a large enough alphabet, and sample paths cycle through long enough blocks of each symbol then W1 (x, y) will be large for a set of probability close to 1. The key to the construction is to force Wk(y, Si) to be large, with high probability, with a code that makes only a small density of changes in sample paths, for then the construction can be iterated to produce infinitely many bad k. The initial step at each stage is to create markers by changing only a few places in a k-block, where k is large enough to guarantee that many different markers are possible. The same marker is used for a long succession of nonoverlapping k-blocks, then switches to another marker for a long time, cycling in this way through the possible markers to make sure that the time between blocks with the same marker is very large. By this mimicking of the bursty property of the classical example, (1/k) log Wk(y, Y - ) is forced to be large with high probability. The marker construction idea is easier to carry out if the binary process p, is thought of as a measure on A Z whose support is contained in B z , where A = {0, 1, 2, 3 } , for then blocks of 2's and 3's can be used to form markers. This artificial expansion of the alphabet is not really essential, however, for with additional (but messy) effort one can create binary markers. The first problem is the creation of a large number of block codes which don't change sequences by much, yet are pairwise separated at the k-block level. To make this precise, fix GL c A L . A function C: GL A L is said to be 6-close to the identity if points are moved no more than 8, that is, aL (C(4), 4. ) < 6, x E GL. A family {Ci} of functions from GL to A L is 6-close to the identity if each member of the family is 6-close to the identity. A family {Ci } of functions from G L to A L is said to be pairwise k-separated, if the k-block universes of the ranges of different functions are disjoint, that is
Uk(yP) fl /ta ) = 0, y1 E Ci(Xt), -51- E Cj(Xt), j
j.
206
The following lemma is stated in a form that allows the necessary iteration; its proof shows how a large number of pairwise separated codes close to the identity can be constructed using markers.
Lemma 111.5.5 Suppose k > (2g+3)/8 and L = mk. Let GL be the subset of {0, 1, 2, 3} L consisting of those sequences in which no block of 2's and 3's of length g occur Then there is a pairwise k-separated family {C i : 1 < i < 2g} of functions from GL to A L , which is 8-close to the identity, with the additional property that no (g +1)-block of 2's and 3's occurs in any concatenation of members of U,C,(G
Proof Order the set {2, 31g in some way and let wf denote the i-th member. Define the i-th k-block encoder a (4) = yilc by setting
= Yg+2 = Y2g+3 = 0 , Y2 = Yg+3 = W1 ,
g+1
2g+2
yj = x
i > 2g + 3,
that is, 2. i (x lic) replaces the first 2g 3 terms of 4 by the concatenation 0wf0wf0, and leaves the remaining terms unchanged. The encoder Ci is defined by blocking xt into k-blocks and applying Cs.; separately to each k-block, that is, Ci (xt) = yt, where
jk+k jk+1 = Ci jk+k x )jk+1 o < <M.
Any block of length k in C(x) = yt must contain the (g -I- 2)-block Owf 0, and since different wf are used with different Ci , pairwise k-separation must hold. Since each changes at most 2g+3 < 3k coordinates in a k-block, the family is 3-close to the identity. Note also that the blocks of 2's and 3's introduced by the construction are separated from all else by O's, hence no (g -I- 1)-block of 2's and 3's can occur in any concatenation of members of Ui C,(GL). Thus the lemma is established. El The code family constructed by the preceding lemma is concatenated to produce a single code C of length 2g L. The block-to-stationary construction, Lemma 1.8.5, is then used to construct a stationary coding for which the waiting time is long for one value of k. An iteration of the method, as done in the string matching construction of Section I.8.c, is used to produce the final process. The following lemma combines the codes of preceding lemma with the block-to-stationary construction in a form suitable for later iteration.
Lemma 111.5.6 Let it be an ergodic measure with alphabet A = {0, 1, 2, 3) such that i({2, 3}) = 0, where g > 1. Given c > 0, 3 > 0, and N, there is a k > 2g +3 and astationary encoder
F: A z 1--+ A z such that for y = p. o F-1 the following hold.

(i) pvxv (wk', 5") < 2kN ) 5_
2 -g E.
(ii) d(x, F (x)) < 6, for all x. (iii) v({2, 3 } g+ 1 ) = 0. Proof Choose k > (2g +3)18, then choose L > 2kN+2 lc so that L is divisible by k. Let L. GL, and (C,: 1 < i < 2g1 be as in the preceding lemma for the given 3. Put
207
M = 2g L and define a code C: Am 1--* A m by setting C(xr) = xr, for 41 (G L) 2g , and iL iL g, Y (i--1)L+1 = Ci(XoL +1), 1 < <2
for xr E (GL) 29 Next apply the block-to-stationary construction of Lemma 1.8.5 to obtain a stationary code F = F c with the property that for almost every x E A z there is a partition, Z = U/1 , of the integers into subintervals of length no more than M, such that if J consists of those j for which I is shorter than M, then UJEJ /J covers no more than a limiting (6/4)-fraction of Z, and such that y = F(x) satisfies the following two conditions.
n+ +1M = C(Xn ni 1M ). (a) If = [n + 1, n +Al], then yn

(b) If i E Uj E jii , then y, = x i . The intervals I., that are shorter than M will be called gaps, while those of length M will be called coding blocks. c has the desired properties, first note that the only changes To show that F = F made in x to produce y = F(x) are made by the block codes Ci, each of which changes only the first 2g + 3 places in a k-block. Thus d(x, F(x)) < 8, for every x, since k > (2g +3)/8, and hence property (ii) holds. Furthermore, the changes produce blocks of 2's and 3's of length g which are enclosed between two O's; hence, since p, had no blocks of 2's and 3's of length g, the encoded process can have no such blocks of length g + 1, which establishes property (iii). To show that property (i) holds, note that if y and 53 are chosen independently, and if y = F(x) and 5"; = F(), then only two cases can occur. Case 1. Either x1 or i is in a gap of the code F. Case 2. Both xi and i1 belong to coding blocks of F. Case 1 occurs with probability at most 72 since the limiting density of time in the gaps is upper bounded by 6/4. In Case 2, since the M-block code C is a concatenation of the L-block codes, {C i }, there will be integers n, in, i, and j such that
n < 1 < n + L 1, m < 1 < m + L 1,

and such that
ynn+L-1
= Ci ( X;11 +L-1\
yrn
= C1 (4+L-1) .
Three cases are now possible.
Case 2a. i = j. Case 2b. Either k > n + L 1 or 2kN > Case 2c. i0 j,k <n+L 1, and 2k N < m
L 1. L 1.
Case 2a occurs with probability at most 2 -g, since y and j3 were independently chosen and there are 2g different kinds of L-blocks each of which is equally likely to be chosen. 2kN+2 /E. If Case 2c occurs then Case 2b occurs with probability at most 6/2 since L
208
CHAPTER III. ENTROPY FOR RESTRICI ED CLASSES.
Wk (y,,
> 2kN , by the pairwise k-separation property. Cases 1, 2a, and 2b therefore
Puxv (Wk(Y < 2kN) _< 6/2 2 - g + 6/2,
imply that
which completes the proof of Lemma 111.5.6.
The finish the proof it will be shown that there is a sequence of stationary encoders, Fn.: Az 1-4. Az, n = 1, 2, ... , and an increasing sequence {k(n)} such that for the -1 , the following composition G, = F o Fn _i o o F1 and measure 01) = o G n x. properties hold for each n > 1, where Go(x)
(a) pooxon , ( wk(i)(y,

(b) d (G n (x), G
< 2ik(i)) < 27i +1
2-i +
2n+1 2n 9 1 < < n.
(x))
almost surely.
(c) On ) ({2, 3} n ) = 0. Condition (b) guarantees that eventually almost surely each coordinate is changed only o F -1 . finitely many times, so there will be a limit encoder F and measure y = desired property, that is, Condition (a) guarantees that the limit measure y will have the
1 log Wk(n)(Y, 33 ) k(n)
co, in y x y probability.
Condition (c) is a technical condition which makes it possible to proceed from step n to 1. step n In summary, it is enough to show the existence of such a sequence of encoders {Fn } and integers {k(n)}. This will be done by induction. The first step is to apply Lemma 111.5.6 with g = 0 to select k(1) > 3 and F1 such that (a), (b), and (c) hold for n = 1, with GI = F1. Assume F1, F2 , . , F have been constructed so that (a), (b), and (c) hold for n. Let 6 < 2 (n+1) be a positive number to be specified later. Apply Lemma 111.5.6 with 1) larger than both 2n + 2 and k(n) and g = n and p, replaced by On ) to select k(n to select an encoder Fn+i : A z A z such that for v (n+ 1) = On ) 0 , the properties (i)-(iii) of Lemma 111.5.6 hold. These properties imply that conditions (b) and (c) hold - 1 1 , while condition (a) holds for the case i = n + 1, for n + 1, since y (n+ 1) = o G n since it was assumed that 6 < 2 -(n+ 1) . If 6 is small enough, however, then y (n+ 1) will be so close to On ) that condition (a), with n replaced by n 1, will hold for i < n 1. This completes the construction of the counterexample.
Remark 111.5.7
The number E can be made arbitrarily small in the preceding argument, and hence one can make P(xo F(x)o) as small as desired. In particular, the entropy of the encoded process y can be forced to be arbitrarily close to log 2, the entropy of the coin-tossing process ,u.
209
III.5.e
An approximate-match counterexample.
Let a be a positive number smaller than 1/2. The goal is to show that there is an ergodic process it for which (5 ) 1 lim Prob ( log k co k
W k (X ,
y, a)
N) = 0, VN.
Such a t can be produced merely by suitably choosing the parameters in the strongnonadmissibility example constructed in 111.3.b. A weaker subsequence result is discussed first as it illustrates the ideas in simpler form. First some definitions and results from 111.3.b will be recalled. Two sets C, D c A k are a-separated if C n [D],, = 0, where [D] denotes the a-blowup of D. The full kblock universe of S c A f is the set k-blocks that appear anywhere in any concatentation of members of S. Two subsets S, S' C A f are (a, K) - strongly - separated if their full k-block universes are a-separated for any k > K. A merging of a collection {Si : j E J} is a product of the form
=
ni =1
S Cm)
[1, J] is such that 10 -1 ( = M/J, for each where J divides M and 0: [1, M] j E [1, J]. Given an increasing unbounded sequence {Nm }, and a decreasing sequence {a m } c (a, 1/2), inductive application of the merging/separation lemma, Lemma 111.3.5, produces a sequence {S(m) C A ("z ) : in > 1} and an increasing sequence {k m }, such that the following properties hold for each tn. (a) S(m) is a disjoint union of a collection {Si (m): j < ./m } of pairwise (am , km ) strongly-separated sets of the same cardinality.
(b) Each Si (m) is a merging of {Si (m 1): j
(c) 2kni Nin <
t(m)/m.
The only new fact here is property (c). Once km is determined, however, S(m) can be replaced by (S(m))n for any positive integer n without disturbing properties (a) and (b), and hence property (c) can also be assumed to hold, see Remark 111.3.6. Property (b) guarantees that the measure p, defined by {S(m)} is ergodic, while property (c) implies that (6) lim Prob (Wk, (x , y, a) < 2km N ) = 0,
for any integer N. This is just the subsequence form of the desired result (5). Indeed, if sample paths sample paths x and y are picked independently at random, then with probability at least (1 1/40 2 (1 1/m) 2 , their starting positions x1 and yi will belong to (m)-blocks that belong to different j (m) and lie at least 2 km N.-indices below the end of each (m)-block. Thus, Prob (W k,, (x, y, a) > 2 which proves (6), since Nm is unbounded. ) > (1 1/40 2 (1 1/rn)2 ,
210
The stronger form, discussed in III.3.b.2, controls what happens in the interval km < k < kn, 1. Further independent cutting and stacking can always be applied at any stage without losing separation properties already gained, and hence, as in the preceding discussion, the column structures at any stage can be made so long that bad waiting-time behavior is guaranteed for the entire range km < k < km i . This leads to an example satisfying the stronger result, (5). The reader is referred to [76] for a complete discussion of this final argument.
Chapter IV
B-processes.
Section IV.1 Almost block-independence.
The focus in this chapter is on B-processes, that is, finite alphabet processes that are stationary codings of i.i.d. processes. Various other names have been used, each arising from a different characterization, including almost block-independent processes, finitely determined processes, and very weak Bernoulli processes. This terminology and many of the ideas to be discussed here are rooted in O rn stein's fundamental work on the much harder problem of characterizing the invertible stationary codings of i.i.d. processes, the so-called isomorphism problem in ergodic theory, [46]. Invertibility is a basic concept in the abstract study of transformations, but is of little interest in stationary process theory where the focus is on the joint distributions rather than on the particular space on which the random variables are defined. This is fortunate, for while the theory of stationary codings of i.i.d. processes is still a complex theory it becomes considerably simpler when the invertibility requirement is dropped, as will be done here. A natural and useful characterization of stationary codings of i.i.d. processes, the almost block-independence property, will be discussed in this first section. The almost block-independence property, like other characterizations of B-processes, is expressed in terms of the a-metric (or some equivalent metric.) Either measure or random variable notation will be used for the d-distance, for example, dn (X7, Yr) will often be used in place of dn (p., v), if t is the distribution of XT and y the distribution of Yr. A block-independent process is formed by extending a measure A n on An to a product measure on (An)", then transporting this to a Ta-invariant measure Ft on A'. In other words, ji is the measure on A' defined by the formula
nz Ti(X inn )=
i
j=1
(x(il)n+ni) , Xin n E A mn , in > 1, (j_on+
and the requirement that

x,, J+1 1 x-
for all i < j < mn and all x/. Note that 71 is Ta -invariant, though it is not, in general, stationary. As in earlier chapters, randomizing the start produces a stationary process,
211
212
CHAPTER IV. B-PROCESSES.
The theory to be developed in this called the concatenated-block process defined by chapter could be stated in terms of approximation by concatenated-block processes, but it is generally easier to use the simpler block-independence ideas. The independent n-blocking of an ergodic process ,u is the block-independent process if n is defined by the restriction ,u n of p to An. It will be denoted by Ft(n), or by understood. In random variable language, the independent n-blocking of {Xi } is the Tn-invariant process {Y./ } defined by the following two conditions.
(a) Y U 1)n+n and X7 have the same distribution, for each j > 1.
(k ) v (j-1)n-l-n
(j-1)n+1
is independent of
r,
i < (j 1)n}, for each j > 1.
An ergodic process p, is almost block-independent (ABI) if given E > 0, there is an N such that if n > N and is the independent n-blocking of ,u then d(A, ;4) < E. An i.i.d. process is clearly almost block-independent, since the ABI condition holds for every n > 1 and every E > O. The almost block-independence property is preserved under stationary coding and, in fact, characterizes the class of stationary codings of i.i.d. processes. This result and the fact that mixing Markov chains are almost blockindependent are the principal results of this section.
Theorem IV.1.1 (The almost block-independence theorem.) An ergodic process is almost block-independent if and only if it is a stationary coding of an i.i.d. process. Theorem IV.1.2 A mixing Markov chain is almost block-independent.
Note, in particular, that the two theorems together imply that a mixing Markov chain is a stationary coding of an i.i.d. process, which is not at all obvious. The fact that stationary codings of i.i.d. processes are almost block-independent is established by first proving it for finite codings, then by showing that the ABI property is preserved under the passage to d-limits. Both of these are quite easy to prove. It is not as easy to show that an almost block-independent process p, is a stationary coding of an i.i.d. process, for this requires the construction of a stationary coding from some i.i.d. process y onto the given almost block-independent process p. This will be carried out by showing how to code a suitable i.i.d. process onto a process d-close to then how to how to make a small density of changes in the code to produce a process even closer in d. The fact that only a small density of changes are needed insures that an iteration of the method produces a limit coding equal to 12. Of course, all this requires that h(v) > h(p.), since stationary coding cannot increase entropy. In fact it is possible to show that an almost block-independent process p is a stationary coding of any i.i.d. process y for which h(v) > h(A). The construction is much simpler, however, and is sufficient for the purposes of this book, if it is assumed that the i.i.d. process has infinite alphabet with continuous distribution for then any n-block code can be represented as a function of the first n-coordinates of the process and d-joinings can be used to modify such codes.
SECTION IV 1. ALMOST BLOCK-INDEPENDENCE.
213
IV.1.a B-processes are ABI processes.

First it will be shown that finite codings of i.i.d. processes are almost blockindependent. This is because blocks separated by twice the window half-width are independent. Let {Y,} be a finite coding the i.i.d. process {X i } with window half-width w, and let {Z,} be the independent n-blocking of { Yi } . If the first and last w terms of each n-block are omitted, then mnci,,,(Zr, Yr) is upper bounded by
m(n
_ 2040 _ 2
w)
w+1
vnw-1 Tnnw-1 '(n1-1)n+w+1 , w+1
ymnw--1 (m-1)nw1 )
4.414/
by property (c) of the d-property list, Lemma 1.9.11. The successive n 2w blocks
7nw-1
'w+1
7 mn w-1 " '(ni-1)n+w+1
n+ 1 -1 , by the definition of independent nare independent with the distribution of Yw blocking, while the successive n 2w blocks
vnw-1
4
w+1
y mn w-1 (m-1)nw1
+r -I by the definition of window halfare independent with the distribution of rr width and the assumption that {X i } is i.i.d., so that property (f) of the d-property list, Lemma 1.9.11, yields ynw1 v mn w-1 n n +T -1 7 mn w-1 am(n_210 (z w
w-1-1 (m-1)n+w-1-1 )
Thus
dnm (Z,
Since this is less than c, for any n> wl, it follows that finite codings of i.i.d. processes are almost block-independent. Next assume p, is the d-limit of almost block-independent processes. Given E choose an almost block-independent process y such that d(p, y) < c/3, and hence,
>
(1)
< E/ 3,
for all n. Since y is almost block-independent there is an N such that if n > N and 7/ is the independent n-blocking of y then d(y, < c/3. Fix such an n and let FL be the independent n-blocking of A. The fact that both and are i.i.d. when thought of as An-valued processes implies that
d( 17,
anal FL) = cin(y, 1,L) < 6 / ,

< c.
by (1). The triangle inequality then gives
This proves that the class of ABI processes is d-closed. In summary, finite codings of i.i.d. processes and d-limits of ABI processes are almost block-independent, hence stationary codings of i.i.d. processes are almost blockindependent, since a stationary coding is a d-limit of finite codings. Furthermore, the preceding argument applies to stationary codings of infinite alphabet i.i.d. processes onto finite alphabet processes, hence such coded processes are also almost block-independent. This completes the proof that B-processes are almost block-independent. Lii
214
IV.1.b ABI processes are B-processes.

Let {Xi } be an almost block-independent process with finite alphabet A and Kolmogorov measure A. The goal is to show that p. is a stationary coding of an infinite alphabet i.i.d. process with continuous distribution. An i.i.d. process {Z, } is continuously distributed if Prob(Zo = b) = 0, for all alphabet symbols b. It will be shown that there is a continuously distributed i.i.d. process Vil with Kolmogorov measure A for which there is a sequence of stationary codes {Ft } such that the following hold.
(a)a(p., o Fr-1 ) > 0, as t * oc. (b)Er Prob((Ft+ i (z))o (Fr(z))o)
< oc.
The second condition is important, for it means that, almost surely, each coordinate Fe(z)i changes only finitely often as t oc. Therefore, there will almost surely exist a limit code F(z)1 = lime Ft (z) i which is stationary and for which cl(A, A o F -1 ) = 0, that is, A = X o F -1 . The exact representation of Vi l is unimportant, since any random variable with continuous distribution is a function of any other random variable with continuous distribution, and hence any i.i.d. process with continuous distribution is a stationary coding, with window width 1, of any other i.i.d. process with continuous distribution. This fact, which allows one to choose whatever continuously distributed i.i.d. process is convenient at a given stage of the construction, will be used in the sequel. The key to the construction is the following lemma, which allows the creation of arbitrarily better codes by making small changes in a good code.
Lemma W.1.3 (Stationary code improvement.) If a(p,, A.o G -1 ) <E, where p. is almost block-independent and G is a finite stationary coding of a continuously distributed i.i.d. process {Z,} with Kolmogorov measure A, and if is a binary equiprobable i.i.d. process independent of {Z,}, then given .5 > 0 there is a finite stationary coding F of {(Z i , Vi )} such that the following two properties hold.
(i) d(p., (A x
o F -1 ) < 8.
(ii) (A x v) ({(z, V): G(z)o
F (z, v)o}) < 4E.
The lemma immediately yields the desired result for one can start with the vector valued, countable component, i.i.d. process
(2) {(V(0), V(1) i , V(2)1 ,
V(t) i , ...)}
with independent components, where V(0)0 is uniform on [0, 1) and, for t > 1, V(t)0 is binary and equiprobable, together with an arbitrary initial coding F0 of { V(0)i }. The lemma can then be applied sequentially with
Z (t) i = (V(0) i , V (1) i , , V(t 1) i ),
and Er = 2-t+1 and 3, = 2 -t, to obtain a sequence of stationary codes {Ft } that satisfy the desired properties (a) and (b), producing thereby a limit coding F from the vector-valued i.i.d. process (2) onto the almost block-independent process A. The idea of the proof of Lemma IV. 1.3 is to first truncate G to a block code. A block code version of the lemma is then used to change to an arbitrarily better block code.
SECTION IV.1. ALMOST BLOCK-INDEPENDENCE.
215
Finally, the better block code is converted to a stationary code by a modification of the block-to-stationary method of I.8.b, using the auxiliary process {Vi } to tell when to apply the block code. The i.i.d. nature of the auxiliary process (Vi) and its independence of the {Z1 } process guarantee that the final coded process satisfies an extension of almost block-independence in which occasional spacing between blocks is allowed, an extension which will be shown to hold for ABI processes. A finite stationary code is truncated to a block code by using it except when within a window half-width of the end of a block. Lemma IV.1.4 (Code truncation.) Let F be a finite stationary coding which carries the stationary process v onto the stationary process 71. Given S > 0 there is an N such that if n > N then there is an n-block code G, called the (S, n)-truncation of F, such that d,(F(u)7,G(u7)) < S, for v-almost every sample path u. Proof Let w be the window half-width of F and let N = 2w/8. For n > N define G(u7) = bF(ii)7 1c, where u is any sample path for which 7 = /47, and b and c are fixed words of length w. This is well-defined since F(ii)7+ 1 depends only on the values of . Clearly d,(F (u)7, G(u7)) < 8, for v-almost every sample path u, since the only disagreements can occur in the first or last w places. The block code version of the stationary code improvement lemma is stated as the following lemma. Lemma IV.1.5 (Block code improvement.) Let (Y, y) be a nonatomic probability space and suppose *: Y 1- A n is a measurable mapping such that V 0 1 ) < 8. Then there is a measurable mapping 4): Y 1-+ A n such that ji = y o 0-1 and such that Ev(dn((k (Y), (y))) < 8. Proof This is really just the mapping form of the idea that d-joinings are created by cutting up nonatomic spaces, see Exercise 12 in Section 1.9. Indeed, a joining has the form X(a'i' , b7) = v(0-1 (4) n (b7)) for 4): Y H A n such that A n = vo V I , so that Vd, (4 , b7)) = E v(dn (0 (Y), (Y))). The almost block-independence property can be strengthened to allow gaps between the blocks, while retaining the property that the blocks occur independently. The following terminology will be used. A binary, ergodic process {Ri } will be called a (S, n)blocking process, if the waiting time between l's is never more than n and is exactly equal to n with probability at least 1 3. A block R:+n-1 of length n will be called a coding block if Ri = 0, for i< j<i+n-1and Ri+n _1 = 1, that is, 1;? +n--.1 = On -1 1. An ergodic pair process {(Wi , Ri)} will be called a (S, n)-independent process if {Ri } is a (8, n)-blocking process and if 147' is independent of {(Wi , Ri ): i < 0 } , given that is a coding block. An ergodic process {Wi } is called a (8, n)-independent extension of a measure pt, on A', with associated (8, n)-blocking process {Ri } , if {(147 Ri )} is a (S, n)-independent process and
Prob(14f;1 = diliq is a coding block) = ,u,(a7), 4
A'.
The following lemma supplies the desired stronger form of the ABI property.
216
Lemma IV.1.6 (Independent-extension.) Let {X,} be an almost block-independent process with Kolmogorov measure t. Given E > 0 there is an N and a 6 > 0 such that if n > N then a(p.,i1.) < c, for any (8, n)independent extension 11, of A n . Proof In outline, here is the idea of the proof. It is enough to show how an independent NI -blocking that is close to p, can be well-approximated by a (6, n)-independent extension, provided only that 6 is small enough and n is large enough. The obstacle to a good d-fit that must be overcome is the possible nonsynchronization between n-blocks and multiples of NI -blocks. This is solved by noting that conditioned on starting at a coding n-block, the starting position, say i, of an n-block may not be an exact multiple of NI , but i j will be for some j < NI. Thus after omitting at most N1 places at the beginning and end of the n-block it becomes synchronized with the independent Ni -blocking, hence all that is required is that 2N1 be negligible relative to n, and that 8 be so small that very little is outside coding blocks. To make this outline precise, first choose N1 so large that a(p,, y) < 6/3, where y is the Kolmogorov measure of the independent N I -blocking IN of pt. Since y is not stationary it need not be true that clni (it um ) < E/3, for all ni, but at least there will be an M1 such that dni (/u ni , vm ) < 13, in > Mi. (3) Put N2 = M1+ 2N1 and fix n > N2. Let 6 be a positive number to be specified later and let be the Kolmogorov measure of a (8, n)-independent extension {W} of An , with associated (6, n)-blocking process {R, } . The goal is to show that if M1 is large enough and 3 small enough then ii(y, < 2E/3, and hence cl(p., Ft) < c. Towards this end, let {W} denote the W-process conditioned on a realization { ri } of the associated (6, n)-blocking process (R 1 ). Let {n k } be the successive indices of the locations of the l's in { ri}. For each k for which nk +i n k = n choose the least integer tk and greatest integer sk such that tk Ni > nk , skNi < Note that m = skNI tkNi is at least as large as MI, and that i7 V -7,AN Ni1+1 has the same distribution as X' N1 +i , so that (3) yields t NI
(4) ytS:N IV I I
Wits:N A: 1+
) <
/3.
The processes Y,} and {W,} are now joined by using the joining that yields (4) on the blocks ftk + 1, sk NI L and an arbitrary joining on the other parts. If M i is large enough so that tk NI nk nk+1 tIcNI is negligible compared to M1 and if 8 is small enough, then, for almost every realization {r k }, the ci-distance between the processes {K} and {W,} will be less than 2E13. An application of property (g) in the list of dproperties, Lemma 1.9.11, yields the desired result, ii(v, < 2E/3, completing the proof of Lemma IV.1.6.
{
Proof of Lemma IV.1.3. Let {Z,} be a continuously distributed i.i.d. process and {V,} a binary, equiprobable i.i.d. process, independent of {Z,}, with respective Kolmogorov measures and y, and let x y denote the Kolmogorov measure of the direct product {(Z V,)}. Let p. be
217
an almost block-independent process with alphabet A, and let G be a finite stationary < E. coding such that d(p,, X o Fix 3 > O. The goal is to construct a finite stationary coding F of {(Z,, V,)} such that the following hold. (i) d(p, (X. x y) o <8.
(ii) (X x y) (((z, v): G(z)0
F(z, y) 0 }) < 4e.
Towards this end, first choose n and S < e so that if is a (S, n)-independent extension of i then d(p, p e-,) < 8, and so that there is an (c, n)-truncation G of G. In particular, this means that < 2E, Xn and hence the block code improvement lemma provides an n-block code 0 such that = o 0 -1 and E(dn ( - (Z), 0(Z))) 26. (5) Next, let g be a finite stationary coding of {V,} onto a (S, n)-blocking process {Ri } and lift this to the stationary coding g* of {(Z,, V,)} onto {(Z Ri )} defined by g*(z, y) = (z, g(v)). Let F(z, r) be the (stationary) code defined by applying 0 whenever the waiting time between 1 's is exactly n, and coding to some fixed letter whenever this does not occur. In other words, fix a E A, let {yk } denote the (random) sequence of indices that mark the starting positions of the successive coding blocks in a (random) realization r of {R, } , define
+n-1 ), F(Z, r Ykk+" = dp(zYA Yk

)
for all k, and define F(z, r)i = a, i fzi Uk[Yk, Yk n 1]. Clearly F maps onto a n)-independent extension of ,u, hence property (i) holds. The see why property (ii) holds, first note that {Zi } and {Ri } are independent, and hence the process defined by Wk = Z Yk+n-1' i i .i.d. with Wo distributed as Z. Furthermore,
E(d,z (d(Z11 ), (Z7))) = E(dn (d (Z 11 ), 0(Z7)) I R is a coding block)

so the ergodic theorem applied to {Wk }, along with the fitting bound, (5), implies that the concatenations satisfy (6)
k>oo
lirn dnk(O(Wi) 0(Wk), a(w.)d(wk))< 2c,
almost surely. By definition, however, F(u, vgk +n-1 = 0 (Wk), if g(v) = r, and , a(4 +n-i )) < E. Since, by the ergodic theorem applied to {R,}, the dn(G(Z)kk+n-1 limiting density of indices i for which i V Uk[Yk, yk + n 1] is almost surely at most & it follows from (6) that (X x y) ({(z, v): G(z)0 F (z, Y)o}) <2e
E<
4e,
thereby establishing property (ii). This completes the proof of the lemma, and hence the proof that an ABI process is a stationary coding of an i.i.d. process.
218
CHAPTER IV B-PROCESSES.
Remark W.1.7 The existence of block codes with a given distribution as well as the independence of the blocking processes used to convert block codes to stationary codes onto independent extensions are easy to obtain for i.i.d. processes with continuous distribution. With considerably more effort, suitable approximate forms of these ideas can be obtained for any finite alphabet i.i.d. process whose entropy is at least that of it. These results, which are the basis for Ornstein's isomorphism theory, are discussed in detail in his book, [46], see also [63, 42].
IV.1.c
Mixing Markov and almost block-independence.
The key to the fact that mixing Markov implies ABI, as well as several later results about mixing Markov chains is a simple form of coupling, a special type of joining frequently used in probability theory, see [35]. Let {X, } be a (stationary) mixing Markov process and let {1;z } be the nonstationary Markov chain with the same transition matrix as { X, } , but which is conditioned to start with Yo = a. A joining {(Xn , Yn )} is obtained by running the two processes independently until they agree, then running them together. The Markov property guarantees that this is indeed a joining of the two processes, with the very strong additional property that a sample path is always paired with a sample path which agrees with it ever after as soon as they agree once. In particular, this guarantees that ncln (X,,Y n ) cannot exceed the expected time until the two processes agree. The precise formulation of the coupling idea needed here follows. Let p. and y be the Kolmogorov measures of {X,} and { Y n } , respectively. For each n > 1, the coupling function is defined by the formula wn(4 = f minfi E [ 1, n 1]: ai = bi}, n, if fi E [1 , n 1 ] : ai = bi } 0 0 otherwise.
The coupling measure is the measure A n on An x An defined by v (an ,u(b7), v(a)p(b'i') 0 if w( a, = w < n, and an w+1 = 1)/ 1_ 1 if wn (a , b7) = n otherwise.
, b?) = (7)
A direct calculation, making use of the Markov properties for both measures, shows that An is a probability measure with vn and p. as marginals.
Lemma IV.1.8 (The Markov coupling lemma.) The coupling function satisfies the bound
(8) dn(tt, v) < E)" n (wn) ,
In particular dn(it, v) > 0, n -- oc. Proof The inequality (8) follows from the fact that the A n defined by (7) is a joining of ,u n and v n , and is concentrated on sequences that agree ever after, once they agree. Furthermore, the expected value of the coupling function is bounded, independent of n and a, see Exercise 2, below, so that iin (p., v) indeed goes to O. This establishes the
lemma.
219
To prove that a mixing Markov chain is d-close to its independent n-blocking, for all large enough n, the idea to construct a joining by matching successive n-blocks, conditioned on the past. This will produce the desired dcloseness, because the distribution of Xk k ,T; conditioned on the past depends only on Xk n , by the Markov property, and the distribution of X k n +7 conditioned on Xk n is d-close to the unconditioned distribution of Xn i , by the coupling lemma, provided only that n is large enough. Fix a mixing Markov chain {X, } with Kolmogorov measure and fix E > 0. The notation cin (X k n +7, X k n +7/xk n ) will be used to denote the du -distance between the unconI, given Xk n = Xkn kn " ditioned distribution of XI and the conditional distribution of Xk The coupling lemma implies that
(9) d,(x k n n , , x k ,n n 1 /xkn) <
for all sufficiently large n. Fix n for which (9) holds and let {Zi } be the independent n-blocking of {Xi } , with Kolmogorov measure i. For each positive integer m, a joining Ain of A m , with ii mn will be constructed such that
Ex,, (dnin (finn , z7)) <

which, of course, proves that d(p., < c, and hence that {X i } is almost blockindependent. To construct Xmn, first choose, for each xbi , a joining A./x,,, of the unconditioned kn "Vil and the conditional distribution of Xk kn + 7, given Xk n = xk n , that distribution of Xk realizes the du -distance between them. For each positive integer m, let nin be the measure on A"' x A mn defined by
m-1 (10) A.pnn(XT n einn ) = ii(4)11,(Z?) k=1
Jr
A1
zkknn++7)
A direct calculation, using the Markov property of {X, } plus the fact that Zk k n +7 is independent of Z j , j < kn, shows that A., u is a joining of A, with iimu . Likewise, summing over the successive n-blocks and using the bound (9) yields the bound Ex,n (dmn(xr, Zr)) < E. Since the proof extends to Markov chains of arbitrary order, this completes the proof that mixing Markov chains are almost block-independent. 1:1 The almost block-independent processes are, in fact, exactly the almost mixing Markov processes, in the sense that
Theorem IV.1.9 An almost block-independent process is the d-limit of mixing Markov processes. Proof By the independent-extension lemma, Lemma IV. 1.6, it is enough to show that for any 8 > 0, a measure it on A" has a (8, n)-independent extension ri which is mixing Markov of some order. The concatenated-block process defined by pt is a function of an irreducible Markov chain, see Example 1.1.11. A suitable random insertion of spacers between the blocks produces a (8, n)-independent extension which is a function of a mixing Markov chain. (See Exercise 13, Section 1.2.) The only problem, therefore, is to produce a (8, n)-independent extension which is actually a mixing Markov chain, not just a function of a mixing Markov chain. This is easy to accomplish by marking long
220
concatenations of blocks with a symbol not in the alphabet, then randomly inserting an occasional repeat of this symbol. il To describe the preceding in a rigorous manner, let C c A' consist of those ar for which it(4) > 0 and let s be a symbol not in the alphabet A. For each positive integer m, let sCm denote the set of all concatenations of the form s w(1) w(m), where w(i) E C, and let ft be the measure on sCrn defined by the formula
ft(sw( 1 ) .. w(m)) =
i=1
Fix 0 < p < 1, to be specified later, and let { ri } be the stationary, irreducible, aperiodic Markov chain with state space se" x {0, 1, , nm 1} and transition matrix defined by the rules
(a) If i < nm
1, (sxr , i) can only go to (sxr", i
1).
(b) (sxr , nm +1) goes to (sy'llin, 0) with probability pii(syr), and to (syr", 1) with probability (1 p),a(syr"), for each syr E SC m . Next, define f(sxrn, i) to be x, if 1 < i < mn + 1, and s otherwise. The process {Zi } defined by Zi = f(Y 1 ) is easily seen to be (mn + 2)-order Markov, since once the marker s is located in the past all future probabilities are known. As a function of the mixing Markov chain {ri }, it is mixing. Furthermore, (Z1 ) is clearly a (6, n)independent extension of A, provided m is large enough and p is small enough (so that only a 6-fraction of indices are spacers s.) This completes the proof of Theorem IV. 1.9. Remark W.1.10 The use of a marker symbol not in the original alphabet in the above construction was only to simplify the proof that the final process {Z1 } is Markov. A proof that does not require alphabet expansion is outlined in Exercise 6c, below. Remark IV.1.11 The proof that the B-processes are exactly the almost block-independent processes is based on the original proof, [66]. The coupling proof that mixing Markov chains are almost block-independent is modeled after the proof of a conditional form given in [41]. The proof that ABI processes are d-limits of mixing Markov processes is new.
IV.1.d
Exercises.
1. Show that (7) defines a joining of A n and yn . 2. Show the expected value of the coupling function is bounded, independent of n. 3. Show that the defined by (10) is indeed a joining of ,u,, n with Ti Ex. n (dnzn Winn zr)) E.
such that
4. Show that if { 1/17; } and {W} are (6, n)-independent extensions of An and fi n , respectively, with the same associated (6, n)-blocking process {R, } , then d(v, = {1T 7i } , rean(An, An), where y and D are the Kolmogorov measures of {W} spectively.
SECTION IV.2. THE FINITELY DETERMINED PROPERTY.
221
5. Show that even if the marker s belongs to the original alphabet A, the process {Z./ } defined in the proof of Theorem IV.1.9 is a (8, n)-independent extension of IL. (Hint: use Exercise 4.) 6. Let p, be the Kolmogorov measure of an ABI process for which 0 < kc i (a) < 1, for some a E A. (a) Show that p. must be mixing. oc, where a" denotes the concatenation of a (b) Show that Wan) 3. 0 as n with itself n times. (See Exercise 4c, Section 1.5.)
(c) Prove Theorem IV. 1.9 without extending the alphabet. (Hint: if n is large enough, then by altering probabilities slightly, and using Exercise 6b and Exercise 4, it can be assumed that a(a) = 0 for some a. The sequence ba2nb can be used in place of s to mark the beginning of a long concatenation of n-blocks.)
7. Some special codings of i.i.d. processes are Markov. Let {X,} be binary i.i.d. and define Y to be 1 if X X_1, and 0, otherwise. Show that {17 } is Markov, and not i.i.d. unless X n = 0 with probability 1/2.
Section IV.2
The finitely determined property.
The most important property of a stationary coding of an i.i.d. process is the finitely determined property, for it allows d-approximation of the process merely by approximating its entropy and a suitable finite joint distribution. A stationary process p. is finitely determined (FD) if any sequence of stationary processes which converges to it weakly and in entropy also converges to it in d. In other words, a stationary p is finitely determined if, given E > 0, there is a 8 > 0 and a positive integer k, such that any stationary process y which satisfies the two conditions,
(i) 114 vkl < 6

(ii) IH(,u) H(v)I < must also satisfy (iii) c/(kt, y) <
E.
Some standard constructions produce sequences of processes converging weakly and in entropy to a given process. Properties of such approximating processes that are preserved under d-limits are automatically inherited by any finitely determined limit process. This principle is used in the following theorem to establish ergodicity and almost block-independence for finitely determined processes. Theorem IV.2.1 A finitely determined process is ergodic, mixing, and almost block-independent. In particular, afinitely determined process is a stationary coding of an i.i.d. process. Proof Let p (n ) be the concatenated-block process defined by the restriction pn of a stationary measure p. to An. If k fixed and n is large relative to k, then the distribution of k-blocks in n-blocks is almost the same as the p.-distribution, the only error in this
222
approximation being the end effect caused by the fact that the final k symbols of the oo. The entropy of i.t (n ) is equal n-block have to be ignored. Thus 147 ) gk , as n weakly and in entropy so that {g (n ) } converges to 1-1(u n )/n which converges to H(/)), to A. Since each 11, (n) is ergodic and since ergodicity is preserved by passage to d-limits, it follows that a finitely determined process is ergodic. Concatenated-block processes can be modified by randomly inserting spacers between the blocks. For each n, let ii (n ) be a (8n , n)-independent extension of A n (such processes were defined as part of the independent-extension lemma, Lemma IV.1.6.) If 8, is small enough then /2 00 and ii (n) have almost the same n-block distribution, hence almost the same k-block distribution for any k < n, and furthermore, ,a (n ) and p, (n ) have almost the same entropy. Thus if 3n goes to 0 suitably fast, {P ) } will converge weakly and in entropy to /I , hence in if is assumed to be finitely determined. Since ii(n ) can be chosen to be a function of a mixing Markov chain, see Exercise 5, Section IV.1, and hence almost block-independent, and since a-limits of ABI processes are ABI, a finitely determined process must be almost block-independent. Also, since functions of mixing processes are mixing and d-limits of mixing processes are mixing, a finitely determined process must be mixing. This completes the proof of Theorem IV.2.1.
The converse result, that a stationary coding of an i.i.d. process is finitely determined is much deeper and will be established later. Also useful in applications is a form of the finitely determined property which refers only to finite distributions. If y is a measure on An, and k < n, the v-distribution of k-blocks in n-blocks is the measure = 0(k, y) on Ak defined by
)=
1 k+it k i=ouE,tvE,n,
EE
V (14 a k i V),
that is, the average of the y-probability of a i` over the first n k in sequences of length n. A direct calculation shows that 0(aiic ) = E Pk(a14)Y(4) = Ep(Pk(4IX),
Xn i
1 starting positions
the expected value of the empirical k-block distribution with respect to v. Note also that if i3 is the concatenated-block process defined by y, then
(1)
41 _< 2(k 1)/n, 10(k, y) 1-
since the only difference between the two measures is that 14 includes in its average the ways k-blocks can overlap two n-blocks. Theorem IV.2.2 (The FD property: finite form.) A stationary process p. is finitely determined if and only if given c > 0, there is a > 0 and positive integers k and N such that if n > N then any measure v on An for which Ip k 0(k, v)I < 6 and IH(pn ) H(v)I < n6 must also satisfy cin (p,v) < E. Proof One direction is easy for if I) is the projection onto An of a stationary measure I), then 0(k, v) = vk , for any k < n, and (11 n)H (v n ) H (v). Thus, if p. satisfies the finite form of the FD property stated in the theorem, then p. must be finitely determined.
223
To prove the converse assume ti, is finitely determined and let c be a given positive number. The definition of finitely determined yields a 6 > 0 and a k, such that any stationary process y that satisfies Ivk 12k 1 < 6 and 1H (v) H COI < 6 must also satisfy d(v, p.) < E. < 6/2. By making N Choose N so large that if n > N then 1(1/01-/(an ) enough larger relative to k to take care of end effects, that is, using (1), it can also be supposed that if n > N and if 17 is the concatenated-block process defined by a measure y on An, then 1,0(k, y) 1 <8/2. Fix n > N and let y be a measure on An for which 10(k, v) ii,k1 < 3/2 and 1H (v) H(1.4)1 <n6/2 and let Tf be the concatenated-block process defined by v. The definition of N guarantees that 117k till 5_ 17k 0(k, v) I +10(k, y) oak' < 6. Furthermore, 1H (i.)) H COI < 6, since H(t)) = (1/ n)H (v), which is within 8/2 of (1/n)H(An), which is, in turn, within 6/2 of H(p.). Thus, the definitions of k and 6 imply that a (/urY-) < c. A finitely determined process is mixing, by Theorem IV.2.1, hence is totally ergodic, so Exercise 2 implies that 4., (A n , y) < ii(p.,17) < E. This completes the proof of Theorem IV.2.2, CI
IV.2.a ABI implies FD.

The proof that almost block-independence implies finitely determined will be given in three stages. First it will be shown that an i.i.d. process is finitely determined. This proof will then be extended to establish that mixing Markov processes are finitely determined. Finally, it will be shown that the class of finitely determined processes is closed in the d-metric. These results immediately show that ABI implies FD, for an ABI process is the d-limit of mixing Markov processes by Theorem IV.1.9 of the preceding section.
IV.2.a.1 I.i.d. processes are finitely determined.

The i.i.d. proof will be carried out by induction. It is easy to get started for closeness in first-order distribution means that first-order distributions can be dl-well-fitted. The dn -fi of the distributions of X'il and yr by fitting the condiidea is to extend a good tting tional distribution of X n+ 1, given Xi', as well as possible to the conditional distribution of yr, given Y. The conditional distribution of Xn+ i, given X, does not depend on ril, if independence is assumed, and hence the method works, provided closeness in distribution and entropy implies that the conditional distribution of 17, given Yin, is almost independent of Yi. To make this sketch into a real proof, an approximate independence concept will be used. Let p(x, y) = px.y(x, y) = Prob(X = x, Y = y) denote the joint distribution of the pair (X, Y) of finite-valued random variables, and let p(x) = px (x) = E, p(x, y) and p(y) = py (y) = Ex p(x, y) denote the marginal distributions. The random variables are said to be c-independent if
E Ip(x, y) p(x)p(y)I < e.

X,
If H(X) = H(X1Y), then X and Y are independent. An approximate form of this is stated as the following lemma.
224
Lemma IV.2.3 (The E-independence lemma.) There is a positive constant c such that if 11(X) H(X1Y) < c6 2 , then X and Y are
6-independent.
Proof The entropy difference may be expressed as the divergence of the joint distribution pxy from the product distribution px p y , that is, H(X) H(X1Y) =
p(x) log p(x)

log
E p(x, y) log p(x, y)

x.y P(Y)
E p(x, y)
x,y D (p x,y1PxP Y )
P(x, Y) p(x)p(y)
Pinsker's inequality, Exercise 6 of Section 1.6, yields,

2
D(pxylpxp
y) C
(E , y IP(X
x
P(X)P(Y)i)
where c is a positive constant, independent of X and Y. This establishes the lemma. 0 A stationary process {Yi } is said to be an 6-independent process if (Y1, Yi-1) and yi are c-independent for each j > 1. An i.i.d. process is, of course, e-independent for every c > O. The next lemma asserts that a process is e-independent if it is close enough in entropy and first-order distribution to an i.i.d. process.
Lemma IV.2.4 Let {X i } be i.i.d. and {Y i } be stationary with respective Kolmogorov measures /2 and v. Given e > 0, there is a 8 > 0 such that if 1,u1 vil < 8 and 11I (A) H(v)1 < then {Yi} is an 6-independent process. Proof If first-order distributions are close enough then first-order entropies will be close. Thus if Ai is close enough to v1, then 11 (X 1 ) will be close to 11 (Y1 ). The i.i.d. property, however, implies that H(A) = H(X i ), so that if 11(v) = limm H(Y i lY i ) is close enough to H(A), then H(Yi) 1-1 (Y11 172 1 ) will be small, for all m > 0, since H(YilYm) is decreasing in m. The lemma now follows from the preceding lemma. 0
A conditioning formula useful in the i.i.d. proof as well as in later results is stated here, without proof, as the following lemma. In its statement, p(x, y) denotes the joint distribution of a pair (X, Y) of finite-valued random variables, with first marginal p(x) = yy p(x, y) and conditional distribution p(y1x) = p(x, y)I p(x).
Lemma IV.2.5 If f (x, y) = g(x) h(y), then
E f (x, y)p(x, y) =
X
.)'
g(x)p(x)
E p(x)
h(y)p(ydx).
Use will also be made of the fact that first-order d-distance is just one-half of variational distance, that is, di (A, y) = vii/ 2 , which is just the equality part of property (a) in the d-properties lemma, Lemma 1.9.11.
SECTION IV.2. THE FINITELY DE IERMINED PROPERTY. Theorem IV.2.6 An i.i.d. process is finitely determined.
225
Proof The following notation and terminology will be used. The random variable X n+1 , conditioned on X = x is denoted by X n+ i /4, and di (Xn +i Yn-i-i/Y) denotes the a-distance between the corresponding conditional distributions. Fix an i.i.d. process {Xi} and c > O. Lemma IV.2.4 provides a positive number 3 so that if j vil < 8 and I H(/),) 1/(y)1 < 3, then the process {Yi } defined by y is an c-independent process. Without loss of generality it can be supposed that 3 < E. Fix It will be shown by induction that cin (X, < E. such a Getting started is easy, since d i (X I , Y1 ) = (1/2)I,u i vil < 3/2 < c/2. Assume it has been shown that an (r i', Yin) < c, and let An realize an (Xtil, Yin). The strategy is to extend by fitting the conditional distributions Xn+ 1 /4 and Yn+i /yri' as well as possible. This strategy is successful because the distribution of Xn+ 1 /x'il is equal to the distribution of Xn+ i, by independence, and the distribution of 1" 1 + 1 /Y1 is, by c-independence, close on the average to that of Yn+i , which is in turn close to the distribution of X n+1 . For each x, y, let ?.f Y,+1/4). The measure X, 1 defined 1 , - realize di (Xn+i by
{r}.
Xn+1 (4+1,
4+1) =
\ x..,,
1/I'
( An+1, Yri+1),
is certainly a joining of /2, 44 and yn+1 . Furthermore, since (n 1)dn+1(4 +1 , y +1 ) = ndn (xì' , yì') + (xn+i, the conditional formula, Lemma IV.2.5, yields (n (2) 1 )Ex n+ I(dn+1) = nEA,(dn)
XI
E
VI
Xri (X n i Yn i )d1(Xn+1, Yn+1).x7.)T (Xn+1 Yni-1)
n+1
The first term is upper bounded by nc, since it was assumed that A n realizes while the second sum is equal to
cin (X, YD,
(3)
di (X
E
1 /x, Y < 1 /y)
y)di(Xn+04, yn+./4),
since it was assumed thatX x i .n realizes d i (X n +i The triangle inequality yields
(Xn+1/X r i l Xn+1) dl(Xn+19 Yn+1) +
(Yn+1 Y1 /y)
The first term is 0, by independence, while, by stationarity, the second term is equal to di (X I , Y1 ), which is at most c/2. The third term is just (1/2) the variational distance between the distributions of the unconditioned random variable Yn+1 and the conditioned random variable Y +1 /y, whose expected value (with respect to yri') is at most /2, since Yn+1 and Y;' are c-independent. Thus the second sum in (2) is at most 6/2 6/2, which along with the bound nExn (dn ) < n6 and the fact that A n +1 is a joining of ,u, n+i andy n+ i, produces the inequality (n 1)dn+1(, y)
(n +
A., 1 (dn+i) < ndn(A,
y)
+E < (n + 1)c,
thereby establishing the induction step. This completes the proof of Theorem IV.2.6.
226 IV.2.a.2
Mixing Markov processes are finitely determined.
The Markov result is based on a generalization of the i.i.d. proof. In that proof, the fitting was done one step at a time. A good match up to stage n was continued to stage n + 1 by using the fact that X, 1 is independent of X7 for the i.i.d. process, and 17,1 +1 is almost independent of 1?, provided the Y-process is close enough to the X-process in distribution and entropy. To extend the i.i.d. proof to the mixing Markov case, two properties will be used, the Markov property and the Markov coupling lemma, Lemma IV.1.8. The Markov property n +Pin depends on the immediate past X,, guarantees that for any n > 1, a future block Xn but on no previous values. The Markov coupling lemma guarantees that for mixing Markov chains even this dependence on X, dies off in the d-sense, as m grows. The key is to show that approximate versions of both properties hold for any process close enough to the Markov process in entropy and in joint distribution for a long enough time, for then a good match after n steps can be carried forward by fitting future m-blocks, for suitable choice of m. A conditional form of c-independence will be needed. The random variables X and Y are conditionally E-independent, given Z, if
E E ip(x, yiz) p(xlz)p(ydz)ip(z) < z x,y
E,
where p(x, ylz) denotes the conditional joint distribution and p(x lz) and p(y lz) denote the respective marginals. The conditional form of the E-independence lemma, Lemma IV.2.3, extends to the following result, whose proof is left to the reader.
Lemma IV.2.7 (The conditional 6-independence lemma.) Given E > 0, there is a y = y(E) > 0 such that if H(XIZ) H(XIY, Z) < y, then
X and Y are conditionally E-independent, given Z.
The next lemma provides the desired approximate form of the Markov property.
Lemma IV.2.8
Let {X,} be an ergodic Markov process with Kolmogorov measure and let {K} be a stationary process with Kolmogorov measure v. Given E > 0 and a positive integer m, there is a 3 > 0 such that if lid,m+1 v m+ii < 8 and IH(a) H(v)i < 8, then, for every n > 1, 17In and Yi-,; are conditionally E-independent, given Yo.
Proof Choose y = y(E) from preceding lemma, then choose 3 > 0 so small that if 1 12 m+1 v m i I < 8 then
IH(Yriy0)-HaTixol < y12,

which is possible since conditional entropy is continuous in the variational distance. Fix m. By decreasing 3 if necessary it can be assumed that if I H(g) H(v)I <8 then
Inanu) na-/(01 < y/2.

Fix a stationary process y for which ym+ il < 3 and 11/(u) H (v)I <8. The choice of 3 and the fact that H(YZIY n ) decreases to m H (v) as n oc, then guarantee that HOTIN 1 -1(YPIY2 n ) < y, n > 1,
SECTION IV.2. THE FINITELY DETERMINED PROPERTY
227
which, in turn, implies that 171 and Yfni are conditionally E-independent, given Yo, for all n > 1, by the choice of 3 1 . This proves Lemma IV.2.8.
Theorem IV.2.9 A mixing finite-order Markov process is finitely determined. Proof
n + 7i , conditioned on +71 /4 will denote the random vector Xn In the proof, X4 +im ) will denote the d-distance between the distributions n + T/xii, Ynn+ = x'17 , and ii(Xn of X'nl Vin / and 11,7:14:r/y;'. Only the first-order proof will be given; the extension to arbitrary finite order is left to the reader. Fix a mixing Markov process {X n } with Kolmogorov measure p, and fix E > O. The Markov coupling lemma, Lemma IV.1.8, provides an m such that (4) dm (XT, XT/x0) < 6/4, xo E A.
Since this inequality depends only on ,u,n+i there is a 3 > 0 so that if y is a stationary vm+1 1 < 6 then measure for which
(5)
am(Yln, Yln /Y0) <
yo E A.
I/2, it can also be assumed
Furthermore, since dm (u, y) is upper bounded by Ittm that 6 is small enough to guarantee that
(6)
(7)
dm (xT,
yr ) < e/4.
By making 3 smaller, if necessary, it can also be assumed from Lemma IV.2.8 that
Yini and Y i n are conditionally (E/2)-independent, given Yo ,
for any stationary process {Yi } with Kolmogorov measure y for which vm+1I < and I H(u) H(p)i <8. Fix a stationary y for which (5), (6), and (7) all hold. To complete the proof that mixing Markov implies finitely determined it is enough to show that d(p,, y) < E. The d-fitting of and y is carried out m steps at a time, using (6) to get started. As in the proof for the i.i.d. case, it is enough to show that
(8)
dni(x n+1 - /x 1 ,
E
X n Vn
4(4, y)ani
(X ,:vin/4, YTP:r/y)
where Xn realizes an (A, y). The distribution of Xn n+ 47/4 is the same as the distribution of Xn n + r, conditioned on X, = x, and hence the triangle inequality yields,
+ cim(r,r+Fr, + drn ( 17:+7 , 17,4T /Yri)

dm (Y:
YrinZ)
Y:2V/yri?).
The first three terms contribute less than 3E/4, by (4), (6) and (5). The fourth term is upper bounded by half of the variational distance between the distribution of Y,'1 /y,, and the distribution of Ynn+ +;n /y. But the expected value of this variational distance is less than 6/2, since Ynn+ +Ini and YinI are conditionally (E/2)-independent, given Yn , by (7). Thus the expected value of the fourth term is at most (1/2)6/2 = 6/4, and hence the sum in (8) is less than E. This completes the proof of Theorem IV.2.9.
228
CHAPTER IV. B-PROCESSES. The finitely determined processes are d-closed.
IV.2.a.3
The final step in the proof that almost block-independence implies finitely determined is stated as the following theorem, a result also useful in other contexts.
Theorem IV.2.10 (The a-closure theorem.) The d-limit of finitely determined processes is finitely determined.
The theorem is an immediate consequence of the following lemma, which guarantees that any process a-close enough to a finitely determined process must have an approximate form of the finitely determined property.
Lemma IV.2.11 If d(p,,v) <E and v is finitely determined, then there is a (5 > 0 and a k such that
d(u, for any stationary process Trt for which 1,ak
< 3, < (5 and IH(A) < S.
The basic idea for the proof is as follows. Let A be a stationary joining of it and y such that EAd(x i , y l )) = d(,u., v). It will be shown that if 72. is close enough to ,u in distribution and entropy there will be a stationary A with first marginal ,a such that
(a) 3 is close to A in distribution and entropy.

(b) The second marginal of 3 is close to u in distribution and entropy. If the second marginal, '17 is close enough to y in distribution and entropy then the finitely determined property of y guarantees that a(v, l) will be small. The fact that 3: is a stationary joining of and guarantees that
E3.:(d(xi, yi)),
which is however, close to E),(d(x l , vi)) = d(p., v), by property (a). The triangle inequality d(v + d a], then implies that a(u, is also small.
A simple language for making the preceding sketch into a rigorous proof is the language of channels, borrowed from information theory. (None of the deeper ideas from channel theory will be used, only its suggestive language.) Fix n, and for each ar il, let 4(.14) be the conditional measure on An defined by the formula (af, il b7) A.(b714) = E A. (9) it (a ril ) The family {A.,, (. O ) } of conditional measures can thought of as the (noisy) channel (or black box), > which, given the input ar il, outputs b7 with probability X, (b7la7). Such a finite-sequence channel extends to an infinite-sequence channel
(10) X
SECTION IV2. THE FINITELY DE I ERMINED PROPERTY
229
which outputs an infinite sequence y, given an infinite input sequence x, by breaking x into blocks of length n and applying the n-block channel to each block separately, that n ), independent of {xi: i < jn} is, ypi+ in+n i has the value b7, with probability X n (b7lx i n i and {y i : i < j n}. Given an input measure a on A" the infinite-sequence channel (10) defined by the conditional probabilities (9) determines a joining A* = A*(X n , a) of a with an output measure /3* = /6 * (4, a) on A'. The joining A*, called the joint input output measure defined by the channel and the input measure a, is the measure on A x A" defined for m > 1 by the formula
-
m_i
A * (a
, bn i ln ) = a(a)4(b i. "" lai. n+n ) j=0
The projection of A* onto its first factor is clearly the input measure a. The output measure /3* is the projection of A* onto its second factor, that is, the measure on A' defined for m > 1 by the formula
fi * ( 1/1" )
E A * (aT" , Km )*
mn
If a is stationary then neither A*(X n , a) nor /3*(X n , a) need be stationary, though both are certainly n-stationary, and hence stationary measures A = A(An , a), and # = ti(Xn, a) are obtained by randomizing the start. A direct calculation shows that A is a joining of a and /3. The measures A and fi are, respectively, called the stationary joint input output measure and stationary output measure defined by An and the input measure a. The next two lemmas contain the facts that will be needed about the continuity of A(X n , a) and /3(X n , a), in n and in a, with respect to weak convergence and convergence in entropy.
-
Lemma IV.2.12 (Continuity in n.) If is a stationary joining of the two stationary processes pt and v, then A(X n , converges weakly and in entropy to X and ,f3(4, ,u) converges weakly and in entropy to v, as n oc. Proof Fix (4, b ) , let n > k, and put A* = A*(Xn, it) and A = A(Xn , /2). The i, b(i) ;+ +ki), _ i < n, measure A(4` , b lic) is an average of the n measures A * (a(i) + x where a(i), +, = as and b(i), . = b for 1 < s < k. But, for i <n k
A * (a(i) i + k i,
x(a k i b/ ), b(i)+
so that, A(a, bki ) converges to X(a /ic, blf) as n oc. Likewise, the averaged output p,)(14) = E A A(4, b i ) converges to a oc. 1 b k ) = v(b) as n To establish convergence in entropy, first note that if (X7, 17) is a random vector with distribution An then H(X7, = H(X7) H(}7 W ) , and hence
'
1 lim H(1 11 1X7) = H

n
On the other hand, if (XI", Y) is a random vector with distribution A*(X n , ii),nn , then
hr(X inn, yrn) =
H(X) + myrnIxrn)
230 H (X )
i=1 H(X71n ) MH(YnX ril ),
+ E H(rr_1)n+1 I riz_on+i)
Do
since the channel treats input n-block independently. Dividing by mn and letting in gives 1 H(A*(4, p)) = H(,u)+ H (Y; 1 IX7) ,
so that H(A*(X n , p)) H (X), as n oc. Randomizing the start then produces the desired result, H(A(4, p)) H(A), since the entropy of an n-stationary process is the same as the entropy of the average of its first n-shifts, see Exercise 5, Section 1.6. Furthermore, randomizing the start also yields H(/3(A, p,)) H (v), since Isin = vn , if = /3*(X n , ,u). This completes the proof of Lemma IV.2.12.
Lemma IV.2.13 (Continuity in input distribution.) Fix n > 1, and a stationary measure A on A" x A. If a sequence of stationary processes la(m)} converges weakly and in entropy to a stationary process a as m oc, then, (i) {A(Xn , eon)} converges weakly and in entropy to A(4, a), as in --> Do. (ii) 0(4, a(m) )1 converges weakly and in entropy to t3(X n , a), as m -- oc. Proof For simplicity assume n = 1, so that both A* and fi* are stationary for any stationary input. Fix a stationary input measure a and put A = A(Xi, a). By definition, An(a r ,
br ) = a(aT)
i-1
X(b i lai ),
which is weakly continuous in a. Furthermore, if (XT, Y) is a random vector with distribution A n, then H(XT, Yr) = H(XT)+mH(Y i 'X i ), since the channel treats input symbols independently. Thus, dividing by m and letting in go to oc yields H(A) = H(a)+ H (Y I X 1) , which depends continuously on 11(a) and the distribution of (X1, Y1). Thus (i) holds. Likewise, fi = /3(X i , a) is weakly continuous in a, and H(/3) = H(A) H(Xi Yi) is continuous in H(A) and the distribution of (X 1 , Y1 ), so (ii) holds. The extension to n > 1 is straightforward, and hence Lemma IV.2.13 is established.
Proof of Lemma IV.2.11. Assume cl(p, y) < E, assume 1) is finitely determined, and let A be a stationary measure realizing ci(p, y), so that, in particular, Ex(d(xi, yl)) < E. The finitely determined property of y provides y and E such that any stationary process V' for which I vt Ve I < y and I H(v) H(71)1 < y, must also satisfy cl(v,) < E. The continuity-in-n lemma, Lemma IV.2.12, provides an n such that the stationary input-output measure A = A (An , g) satisfies
(11)
EA(d(xl, yi)) < Ex(d(xi, yl)) + c,
and the stationary output measure fi = fi(X n , A) satisfies Ifie vti < y/2, and IH(fi)
< y/2.
231
The continuity-in-input lemma, Lemma IV.2.13, provides 3 and k such that if ii. is any stationary process for which Fix lid < 8 and IH(ii) H(A)1 < ( 5 , then the joint input-output distribution X = A(X, Ft) satisfies
(12)
Ex(d (x i , )' 1 )) < En(d(xi, yi))
+e
and the output distribution il = 0(X,
ii) satisfies
1 17 )t fit' < y/2, and IH(i') 1-1 (/3)1 < y/2.

Given such an input
'A
it follows that the output -17 satisfies rile vel Pe Oel + Ifie vel < Y,
and
II (v)I
III (1 7)
II (0)I + i H(8) I-I (v)I < 1 / ,
so the definitions of f and y imply that dal, y) <E. Furthermore,
d(ii, 71) _-_. Ex(d(xi,

since X = A(Xn , 71) is a joining of the input Ex(d(xi, yi))
< <
yi)),
il with
the output 7/ and

)
En(d(xi, yi) + 6
Ex(d (x i, i)) + 2E < 3E,

)'
by the inequalities (11) and (12), and the fact that X realizes d(p,, y) which was assumed to be less than E . This completes the proof of Lemma IV.2.11, thereby completing the proof that dlimits of finitely determined processes are finitely determined, and hence the proof that almost block-independent processes are finitely determined. El
Remark IV.2.14 The proof that i.i.d. processes are finitely determined is based on Ornstein's original proof, [46], while the proof that mixing Markov chains are finitely determined is based on the Friedman-Ornstein proof, [13]. The fact that d-limits of finitely determined processes are finitely determined is due to Ornstein, [46]. The observation that the ideas of that proof could be expressed in the simple channel language used here is from [83]. The fact that almost block-independence is equivalent to finitely determined first appeared in [67]; the proof given here that ABI processes are finitely determined is new.
IV.2.b
Exercises.
1 The n-th order Markovization of a stationary process p. is the n-th order Markov process with transition probabilities y(x-n+i WO = p.(4 +1 )/p,(4). Show that a finitely determined process is the d-limit of its ,z -th order Markovizations. (Hint: y is close in distribution and entropy to p,.)
232
2. Show that if ,u is totally ergodic and y is any probability measure on A', then 0), where 7 denotes the concatenated-block process defined by an (A n , y) < a(i2, v. (Hint: a(p.,17) = d(x , y), for some y for which the limiting nonoverlapping kblock distribution of T'y is y, for some i < n, and for some x all of whose shifts have limiting nonoverlapping n-block distribution equal to /4, see Exercise la in Section 1.4. Hence any of the limiting nonoverlapping n-block distributions determined by (T i x,T i )' ) must be a joining of ia,, and v.) 3. Show that the class of B-processes is a-separable. 4. Prove the conditional c-independence lemma, Lemma IV.2.7. 5. Show that a process built by repeated independent cutting and stacking is a Bprocess if the initial structure has two columns with heights differing by 1.
Section IV.3
Other B-process characterizations.
A more careful look at the proof that mixing Markov chains are finitely determined shows that all that is really needed for a process to be finitely determined is a weaker form of the Markov coupling property. This form, called the very weak Bernoulli property, is actually equivalent to the finitely determined property, hence serves as another characterization of B-processes. Equivalence will be established by showing that very weak Bernoulli implies finitely determined and then that almost blowing-up implies very weak Bernoulli. Since it has already been shown that finitely determined implies almost blowing-up, see Section 111.4, it follows that almost block-independence, almost blowing-up, very weak Bernoulli, and finitely determined are, indeed, equivalent to being a stationary coding of an i.i.d. process.
IV.3.a
The very weak Bernoulli and weak Bernoulli properties.
As in earlier discussions, either measure or random variable notation will be used for the d-distance, and Xì'/x k will denote the random vector X7, conditioned on the past values x k . A stationary process {X,} is very weak Bernoulli (VWB) if given c > 0 there is an n such that for any k > 0, (1) Ex o , (d,(X7 I X7)) < c, where Ex o , denotes expectation with respect to the random past X k Informally stated,
.
a process is VWB if the past has little effect in the a-sense on the future. The significance of very weak Bernoulli is that it is equivalent to finitely determined and that many physically interesting processes, such as geodesic flow on a manifold of constant negative curvature and various processes associated with billiards, can be shown to be very weak Bernoulli by exploiting their natural expanding and contracting foliation structures. The reader is referred to [54] for a discussion of such applications. Theorem IV.3.1 (The very weak Bernoulli characterization.) A process is very weak Bernoulli if and only if it is finitely determined.
SECTION IV.3. OTHER B-PROCESS CHARACTERIZATIONS.
233
The proof that very weak Bernoulli implies finitely determined will be given in the next subsection, while the proof of the converse will be given later after some discussion of the almost blowing-up property. A somewhat stronger property, called weak Bernoulli, is obtained by using variational distance with a gap in place of the ci-distance. A stationary process {X i } is weak Bernoulli (WB) or absolutely regular, if past and future become c-independent if separated by a gap g, that is, given c > 0 there is a gap g such that for any k > 0 and m > 0, the random vectors Xr and X k are c-independent. The class of weak Bernoulli processes includes the mixing Markov processes and the large class of mixing regenerative processes, see Exercise 2 and Exercise 3. Furthermore, as noted in earlier theorems, Theorems 111.2.3 and 111.5.1, weak Bernoulli processes have nice empirical distribution and waiting-time properties. Their importance here is that weak Bernoulli processes are very weak Bernoulli. To see why, first note that if Xr and X k are c-independent then
Ex o (dm (X g +T/X k , X g n i z )) <
E/2,
since 4-distance is upper bounded by one-half the variational distance. If this is true for all m and k, then one can take n = g + ni with m so large that g 1(g + ni) < 12, and use the fact that
,
iin (14 , Vr) < cin _g (Ugn+i , V :+i) g In,

for any n > g and any random vector (Uli, ID, to obtain Ex o (cin (XTIV X k , X'1 ) ) < E. -k Thus weak Bernoulli indeed implies very weak Bernoulli. In a sense made precise in Exercises 4 and 5, weak Bernoulli requires that with high probability, the conditional measures on different infinite pasts can be joined in the future so that, with high probability, names agree after some point, while very weak Bernoulli only requires that the density of disagreements be small. This property was the key to the example constructed in [65] of a very weak Bernoulli process that is not weak Bernoulli, a result established by another method in [78].
IV.3.b
Very weak Bernoulli implies finitely determined.
The proof models the earlier argument that mixing Markov implies finitely determined. Entropy is used to guarantee approximate independence from the distant past, conditioned on intermediate values, while very weak Bernoulli is used to guarantee that even such conditional dependence dies off in the ii-sense as block length grows. Since approximate versions of both properties hold for any process close enough to {X,} in entropy and in joint distribution for a long enough time, a good fitting can be carried forward by fitting future ni-blocks, for suitable choice of ni. As in the earlier proofs, the key is to show that
E A. ( 4, y rli )dni (x;,-,T
Y::ET /)1)
is small for some fixed m and every n, given only that the processes are close enough in entropy and k-th order distribution, for some fixed k > ni. The triangle inequality gives
<(x;', vin ,
x:-,77.4) + cini (yn n _, p-r , },nnz /y;)
234
By stationarity, the first term is equal a,n (XT, yr), which is small since in < k, while the expected value of the second term is small for large enough ni, uniformly in n, by the very weak Bernoulli property. Thus, once such an in is determined, all that remains to be ,m, Y/yr,') is small, for all n, provided only shown is that the expected value of dm (Ynn that is close enough to {X i } in k-order distribution for some large enough k > as well as close enough in entropy. The details of the preceding sketch are carried out as follows. Fix a very weak Bernoulli process {X n } with Kolmogorov measure p. and fix 6 > O. The very weak Bernoulli property provides an in so that
(2)
Ex o (d,(XT / X t , r in )) < c /8,
t > O.
The goal is to show that a process sufficiently close to {Xi} in distribution and entropy satisfies almost the same bound. Towards this end, let y be a positive number to be specified later. For in fixed, H(XTIX t ) converges to mH(A), as t > Do, and hence there is a K such that H(XTIX _ K ) < mH(u) + y.
Fix k = m K 1. The quantity Exo K (a,n (rin /X K , )( tin )) depends continuously on 'Lk, so if .8 is small enough and {Yi } is a stationary process with Kolmogorov measure V such that I tuk Vkl <S, then
(3)
EyoK (iim (Yr i /Y K ,Yr i )) < c/4.
Furthermore, H(XT IX K ) also depends continuously on 1Uk , so that if it also assumed that I H(v) H(g)I <S, and Sis small enough, then H(YrIY K ) < m H (v) + 2y holds. But, since H(YrIY r ) decreases to m H (v) this means that II (Yi n IY K) 11 ( 11171 1 17 K_i) < 2y, j O.
If y is small enough, the conditional 6-independence lemma, Lemma IV.2.7, implies that Yr and YIK K4 are conditionally (E/2)-independent, given K Since a-distance is upper bounded by one-half the variational distance, this means that
Eyo(dm(Yr/ 17 K, Yin /Y K-j)) < E/4 , j>1 .

This result, along with the triangle inequality and the earlier bound (3), yields the result that will be needed, namely,
(4)
Eyot (c1(Y im ,Yr /Y_,)) < 6/2, t > 1. ,
In summary, since it can also be supposed that 8 < E/2, and hence that am (XI'', Y;n) < 6/4, it follows from the triangle inequality that there is a S > 0 such that
E y;)ani(Xnnn1 +I
/ y n vn+ni 1,,n\ 'n+1 I "Ai
x 4 ,1" -1
for all n, for any stationary process {Y, } with Kolmogorov measure u for which Ik VkI < and I H Cu) H (v)I < (3, which completes the proof that very weak Bernoulli implies finitely determined. LI
235
IV.3.c
The almost blowing-up characterization.
The almost blowing-up property (ABUP), was introduced in Section 111.4. The e-blowup [C], of a set C c A" is its c-neighborhood relative to the dn -metric, that is,
[C], = fb; I : dn (d; , b) <E, for some all E C) .

A set B C An has the (6, c)-blowing-up property if ,u([C],) > 1 - E, for any subset C c B for which i(C) > 2 -1'3 A stationary process has the almost blowing-up property if for each n there is a B, C A" such that x'll E Bn , eventually almost surely, and for any E > 0, there is a 8 > 0 and an N such that Bn has the (3, c)-blowing-up property for n > N. It was shown in Section III.4.c that a finitely determined process has the almost blowing-up property. In this section it will be shown that almost blowing-up implies very weak Bernoulli. Since very weak Bernoulli implies finitely determined, this completes the proof that almost block-independence, finitely determined, very weak Bernoulli, and almost blowing-up are all equivalent ways of saying that a process is a stationary coding of an i.i.d. process. The proof is based on the following lemma, which, in a more general setting and stronger form, is due to Strassen, [81].
Lemma IV.3.2 Let A and v be probability measures on A. If v([C],) > 1- c, whenever it(C) > c, then dn (A , y) < 2e. Proof The ideas of the proof will be described first, after which the details will be given. The goal is to construct a joining A. of A and v for which dk (4, yl') < e, except on a set of (4, )1) of A.-measure less than E. A joining can be thought of as a partitioning of the v-mass of each y'i' into parts and an assignment of these parts to various xlii, subject to the joining requirement, namely, that the total mass received by each 4 is equal to A(4). A simple strategy for beginning the construction of a good joining is to cut off an a-fraction of the v-mass of yri' and assign it to some xr,' for which dn (x1' , )1) < E. A trivial argument shows that this is indeed possible for some positive a for those )1' that are within c of some fil of positive A-mass. The set of such yii' is just the c-blowup of the support of A, which, by hypothesis, has v-measure at least 1 - c. The key to continuing is the observation that if the set of .4 whose /i -mass is not completely filled in the first stage has /i -measure at least E then its blowup has large vmeasure and the simple strategy can be used again, namely, for most y'i' a small fraction of the remaining v-mass can be cut off and and assigned to the unfilled mass of a nearby x'11 . This simple strategy can repeated as long as the set of x'i' whose ,u-mass is not yet completely filled has it-measure at least E. If the fraction is chosen to be largest possible at each stage then only a finite number of stages are needed to reach the point when the set of 4 whose pt-mass is not yet completely filled has 1a-measure less than E.
Some notation and terminology will assist in making the preceding sketch into a proof. A nonnegative function ;a is called a mass function if it has nonempty support and E Ti(4) < 1. A function 0: [S], i- S, is called an c-matching over S if dk ( yr,' , 0 (yriz)) < E, for all y [Sk. If the domain S of such a 0 is contained in the support of a mass
236
function at, then the number a = 0) defined by i"u- s: ( x I'
(5)
a = min { 1, inf 71.(x)>0
is positive. It is called the maximal (y, Ft, 0)-stuffi ng fraction, for it is the largest number a < 1 for which av(V 1 (4)) < (x), x E S, that is, for which an a-fraction of the y-mass of each y'i? E [S], can be assigned to the at-mass of 4 = The desired joining A. is constructed by induction. To get started let p. (1) = A, let Si be the support of itt (1) , let 01 be any E-matching over S1 (such a (P i exists by the definition of c-blowup, as long as SI 0 0) and let al be the maximal (y, p. (1) , Op-stuffing fraction. Having defined Si , ,u (i) , (pi and ai , let ,u (i + 1) be the mass function defined by
tt o)(x i
)
The set Si+1 is then taken to be the support of p, (i + 1) , the function Oi+1 is taken to be any c-matching over Si+i , and aii is taken to be the maximal (y, p. (i + 1) , Oi+i )-stuffing fraction. If ai < 1 then, by the definition of maximal stuffing fraction, there is an 4 E Si for which td ) (4) = ai y(0,7 1 (4)) and therefore ,u,(Si+i) < ,u(Si ). Hence there is first i, say i*, for which ai = 1 or for which p.(Si +i) < c. The construction can be stopped after i*, cutting up and assigning the remaining unassigned v-mass in any way consistent with the joining requirement that the total mass received by each x is equal to If ai. = 1, then V([Si d E ) > I E, otherwise ,u(Si. +1 ) < E. In either case, dk (x, y) < E, except on a set of (4, y in) of I=1 X-measure less than e, and the proof of.Lemma IV.3.2 is finished. With Lemma IV.3.2 in hand, only one simple entropy fact about conditional measures will be needed to prove that almost blowing-up implies very weak Bernoulli. This is the fact that, with high probability, the conditional probability ,u(4lx ,0) has approximately the same exponential size as the unconditioned probability ,u,(4), provided only that n is large enough. Let E(Xn,,o ) denote the a-algebra generated by the collection of cylinder sets {[xn k > 0 } . Also, for Fn E E(X" co ), the notation xn E Fn will be used as shorthand for the statement that Clk[zn k ] c F.
Lemma IV.3.3 (The conditional entropy lemma.) Let be an ergodic process of entropy H and let a be a given positive number There is an N = N(a) > 0 such that if n > N then there is a set Fn E E(X" co ) such that ,u(F) > 1 a and so that if xn E Fn then
o so , ,; Proof By the ergodic thZoc rigiiii(-x-Rlit)(14ift4 AID almost surely, while, by the entropy theorem, (1/n) log ,u(4) H, almost surely. These two facts together imply the lemma. I=1 Only the following consequence of the upper bound will be needed.
237
Lemma IV.3.4 Let N = N (a) be as in the preceding lemma. If n > N, there is a set D, of measure at least 1 Nfii- such that if x E D, and C C An, then
E (X cx,)
_ oe ) 2an,u(C) ,NA ii(Clx Proof For n > N, the Markov inequality yields a set D, E E(X 0 ) of measure at least Dn , a set Fn c A" such that 1 Afrx and, for each x 0
00 ) > 1 (a) iu(F,, I x_
,) 5_ 2" ,u(x), (b) 11, (x tilx

If
E
.X;2 E
Fn .
D, and C c An then (a) and (b) give 11-(Clx)
,u(c" n Fn ix) < 2" ,u(C n Fn ) Ara <
2"A(C)-1- NATe,
which establishes the lemma. Now the main theorem of this section will be proved.
Theorem IV.3.5
Almost blowing-up implies very weak Bernoulli. Proof Let p, be a stationary process with the almost blowing-up property. The goal is to show that ,u is very weak Bernoulli, that is, to show that for each E > 0, there is an n and a set V E E(X ,,,,) such that ,u(V) > 1 c and such that
dn(A, A ( lx ,0 )) <e,
X E
V.
E
By Lemma IV.3.2, it is enough to show how to find n and V p,(V) > 1 c and such that the following holds for x 00 E V.
E(X,,,,) such that
(6)
If C c An and ,u(Clx ( 1 00 ) > c then ,u([C]f) a. 1 c.
Towards this end, let Bn c An ,n > 1, be such that x7 E B, eventually almost surely, and such that for any e > 0, there is a 3 > 0 and an N such that Bn has the (3, 0-blowing-up property for n > N. The quantity ,u(Bn ) is an average of the quantities ,u(B,Ix 00), so that if n is so large * E E(X ,0 ) such that A(B n ) > 1 c 2 /4, then, by the Markov inequality, there is a set Dn *) > 1 c/2 and such that A(B n ix%) > 1 e12, for x,0 E D. Combining that p,(Dn this with Lemma IV.3.4 it follows that if n is large enough and a < c 2 /4 then the set * E E(X), has measure at least 1 c, and for E V and C c A', V = Dn n Dn
,a(cix) <
,u(c n /3,1x%) 6/2 ,s/CTe < 2" ,u(C n
12,
which, if A(Clx, o ) > c can be rewritten as
,u(C n B,,)> 2-"(c N/Fx el2).
238
If a is chosen to be less than both 8 and c2 /8, then 2-"(e .14Fe 6/2) > 2 -8n, for all sufficiently large n, so that the blowing-up property of B, yields
ittaCli)> ou([C n B4O1,) > 1
E,
provided only that n is large enough. This establishes the desired result (6) and completes the proof that almost blowing-up implies very weak Bernoulli.
Remark IV.3.6 The proof of Theorem IV.3.5 appeared in [39]. The proof of Strassen's lemma given here uses a construction suggested by Ornstein and Weiss which appeared in [37]. A more sophisticated result, using a marriage lemma to pick the c-matchings 0i , yields the stronger result that the an -metric is equivalent to the metric cl,*(,u, 1)), defined as the minimum c > 0 for which ,u(C) < v([C], c, for all C c A. This and related results are discussed in Pollard's book, [59, Example 26, pp. 79-80].
IV.3.d
Exercises.
1. Show that a stationary process is very weak Bernoulli if and only if for each c > 0 there is an m such that for each k > 1,
Cin (X7 IX k ,
except for a set of x k of measure at most E.
< E,
2. Show using coupling that a mixing Markov chain is weak Bernoulli. 3. Show that a mixing regenerative process is weak Bernoulli. (Hint: coupling.) 4. Show by using the martingale theorem that a stationary process p, is very weak Bernoulli if given E > 0 there is a positive integer n and a measurable set G E E(X 0 ) of measure at least 1 c, such that if i 00 E G then there is a measurable mapping 0: [x] 1- [.,0 0,0 ] which maps the conditional measure ,u( ix%) onto the conditional measure ,u,(1.0 0o ), and a measurable set B c [x ,o ] such that ti(B Ix%) < E, with the property that for all x B, 0(x) 1 = xi for all except at most en indices i E [1, n]. . 5. Show by using the martingale theorem that a stationary process ,a is weak Bernoulli if given c > 0 there is a positive integer K such that for every n > K there is a measurable set G E E(X ,0 ) of measure at least 1 c, such that if x 0 E G i then there is a measurable mapping 0: [x,o ] which maps the conditional measure ii(-Ix) onto the conditional measure tie li ,c ), and a measurable set B c [x] such that ,u(Bix ,,,) < c, with the property that for all x B, 4)(x), = x, for i E [K, n].
,
Bibliography
[1] Arnold, V.I. and Avez, A., Ergodic problems of classical mechanics. W.A. Benjamin, Inc., New York, 1968. [2] R. Arratia and M. Waterman, "The Erd6s-Rnyi strong law for pattern matching with a given proportion of mismatches," Ann. Probab., 17(1989), 1152-1169. [3] A. Barron, "Logically smooth density estimation," Ph. D. Thesis, Dept. of Elec. Eng., Stanford Univ., 1985. [4] P. Billingsley, Ergodic theory and information. John Wiley and Sons, New York, 1965. [5] D. C. Kohn, Measure Theory. Birkhauser, New York, 1980. [6] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley and Sons, New York, 1991. [7] I. Csiszdr and J. Krner, Information Theory Coding theorems for discrete memoryless systems. Akadmiai Kiack"), Budapest, 1981. [8] P.Elias, "Universal codeword sets and representations of the integers," IEEE Trans. Info. Th., IT-21(1975), 194-203. [9] P. Erds and A. Rnyi, "On a new law of large numbers," J. d'analyse, 23(1970), 103-111. [10] J. Feldman, "r-Entropy, equipartition, and Ornstein's isomorphism theorem,", Israel J. of Math., 36(1980), 321-343. [11] W. Feller, An introduction to probability theory and its applications. Volume II (Second Edition)., Wiley, New York, 1971. [12] N. Friedman, Introduction to ergodic theory Van Nostrand Reinhold, New York, NY, 1970. [13] N. Friedman and D. Ornstein, "On isomorphism of weak Bernoulli transformations," Advances in Math., 5(1970), 365-394. [14] H. Furstenberg, Recurrence in ergodic theory and combinatorial number theory Princeton Univ. Press, Princeton, NJ, 1981. [15] R. Gallager, Information theory and reliable communication. Wiley, New York, NY, 1968. 239
240
BIBLIOGRAPHY
[16] P. Grassberger, "Estimating the information content of symbol sequences and efficient codes," IEEE Trans. Inform. Theory, IT-35(1989), 669-675. [17] R. Gray, Probability, random processes, and ergodic properties. Springer Verlag, New York, NY, 1988.
-
[18] R. Gray, Entropy and information theory. Springer-Verlag, New York, NY, 1990. [19] R. Gray and L. D. Davisson, "Source coding without the ergodic assumption," IEEE Trans. Info. Th., IT-20(1975), 502-516. [20] P. Halmos, Measure theory. D. van Nostrand Co., Princeton, NJ, 1950. [21] P. Halmos, Lectures on ergodic theory. Chelsea Publishing Co., New York, 1956. [22] W. Hoeffding, "Asymptotically optimal tests for multinomial distributions," Ann. of Math. Statist., 36(1965), 369-400. [23] M. Kac, "On the notion of recurrence in discrete stochastic processes," Ann. of Math. Statist., 53(1947), 1002-1010. [24] S. Kakutani, "Induced measure-preserving transformations," Proc. Japan Acad., 19(1943), 635-641. [25] T. Kamae, "A simple proof of the ergodic theorem using nonstandard analysis," Israel J. Math., 42(1982), 284-290. [26] S. Karlin and G. Ghandour, "Comparative statistics for DNA and protein sequences - single sequence analysis," Proc. Nat. Acad. Sci., U.S.A., 82(1985), 5800-5804. [27] I. Katznelson and B. Weiss, "A simple proof of some ergodic theorems," Israel J. Math., 42(1982), 291-296. [28] M. Keane and M. Smorodinsky, " Finitary isomorphism of irreducible Markov shifts," Israel J. Math., 34(1979) 281-286. [29] J. G. Kemeny and J. L. Snell, Finite Markov chains. Van Nostrand Reinhold, Princeton, New Jersey, 1960. [30] J. Kieffer, "Sample converses in source coding theory," IEEE Trans. Info. Th., IT-37(1991), 263-268. [31] I. Kontoyiannis and Y. M. Suhov, "Prefixes and the entropy rate for long-range sources," Probability, statistics, and optimization. (F. P. Kelly, ed.) Wiley, New York, 1993. [32] U. Krengel, Ergodic theorems. W. de Gruyter, Berlin, 1985. [33] R. Jones, "New proofs of the maximal ergodic theorem and the Hardy-Littlewood maximal inequality," Proc. AMS, 87(1983), 681-4. [34] D. Lind and B. Marcus, An Introduction to Symbolic Dynamics and Coding. Cambridge Univ. Press, Cambridge, 1995. [35] T. Lindvall, Lectures on the coupling method. John Wiley and Sons, New York, 1992.
BIBLIOGRAPHY
241
[36] K. Marton, "A simple proof of the blowing-up lemma," IEEE Trans. Info. Th., IT-42(1986), 445-447. [37] K. Marton and P. Shields, "The positive-divergence and blowing-up properties," Israel J. Math., 86(1994), 331-348. [38] K. Marton and P. Shields, "Entropy and the consistent estimation of joint distributions," Ann. Probab., 22(1994), 960-977. Correction: Ann. Probab., to appear. [39] K. Marton and P. Shields, "Almost sure waiting time results for weak and very weak Bernoulli processes," Ergodic Th. and Dynam. Sys., 15(1995), 951-960. [40] D. Neuhoff and P. Shields, "Block and sliding-block source coding," IEEE Trans. Info. Th., IT-23(1977), 211-215. [41] D. Neuhoff and P. Shields, "Indecomposable finite state channels and primitive approximation," IEEE Trans. Inform. Th., IT-28(1982), 11-18. [42] D. Neuhoff and P. Shields, "Channel entropy and primitive approximation," Ann.
Probab., 10(1982), 188-198.

[43] D. Neuhoff and P. Shields, "A very simplistic, universal, lossless code," IEEE Workshop on Information Theory, Rydzyna, Poland, June, 1995. [44] A. Nobel and A. Wyner, "A recurrence theorem for dependent processes with applications to data compression," IEEE Trans. Inform. Th., IT-38(1992), 1561-1563. [45] D. Ornstein, "An application of ergodic theory to probability theory", Ann. Probab.,
1(1973), 43-58.
[46] D. Ornstein, Ergodic theory, randomness, and dynamical systems. Yale Mathematical Monographs 5, Yale Univ. Press, New Haven, CT, 1974. [47] D. Ornstein, D. Rudolph, and B. Weiss, "Equivalence of measure preserving transformations," Memoirs of the AMS, 262(1982). [48] D. Ornstein and P. Shields, "An uncountable family of K-Automorphisms", Advances in Math., 10(1973), 63-88. [49] D. Ornstein and P. Shields, "Universal almost sure data compression," Ann. Probab.,
18(1990), 441-452.
[50] D. Ornstein and P. Shields, "The d-recognition of processes." Advances in Math.,
104(1994), 182-224.
[51] D. Ornstein and B. Weiss, "The Shannon-McMillan-Breiman theorem for amenable groups," Israel J. Math., 44(1983), 53-60. [52] D. Ornstein and B. Weiss, "How sampling reveals a process," Ann. Probab.,
18(1990), 905-930.
[53] D. Ornstein and B. Weiss, "Entropy and data compression," IEEE Trans. Inform. Th., IT-39(1993), 78-83.
242
BIBLIOGRAPHY
[54] D. Ornstein and B. Weiss, "On the Bernoulli nature of systems with some hyperbolic structure," Ergodic Th. and Dynam. Sys., to appear. [55] W. Parry, Topics in ergodic theory Cambridge Univ. Press, Cambridge, 1981. [56] K. Petersen, Ergodic theory. Cambridge Univ. Press, Cambridge, 1983. [57] Phelps, R., Lectures on Choquet's theorem. Van Nostrand, Princeton, N.J., 1966 [58] M. Pinsker, Information and information stability of random variables and processes. (In Russian) Vol. 7 of the series Problemy Peredai Informacii, AN SSSR. Moscow, 1960. English translation: Holden-Day, San Francisco, 1964. [59] D. Pollard, Convergence of stochastic processes. Springer-Verlag, New York, 1984. [60] V. Rohlin, "A 'general' measure-preserving transformation is not mixing," Dokl.
Akad. Nauk SSSR, 60(1948), 349-351.

[61] D. Rudolph, "If a two-point extension of a Bernoulli shift has an ergodic square, then it is Bernoulli," Israel J. of Math., 30(1978), 159-180. [62] I. Sanov, "On the probability of large deviations of random variables," (in Russian), Mat. Sbornik, 42(1957), 11-44. English translation: Select. Transi. Math. Statist. and Probability, 1(1961), 213-244. [63] P. Shields, The theory of Bernoulli shifts. Univ. of Chicago Press, Chicago, 1973. [64] P. Shields, "Cutting and independent stacking of intervals," Mathematical Systems Theory, 7 (1973), 1-4. [65] P. Shields, "Weak and very weak Bernoulli partitions," Monatshefte fiir Mathematik,
84 (1977), 133-142.
[66] P. Shields, "Stationary coding of processes," IEEE Trans. Inform. Th., IT-25(1979),
283-291.
[67] P. Shields, "Almost block independence," Z. fur Wahr., 49(1979), 119-123. [68] P. Shields, "The ergodic and entropy theorems revisited," IEEE Trans. Inform. Th., IT-33(1987), 263-6. [69] P. Shields, "Universal almost sure data compression using Markov types," Problems of Control and Information Theory, 19(1990), 269-277. [70] P. Shields, "The entropy theorem via coding bounds," IEEE Trans. Inform. Th., IT-37(1991), 1645-1647. [71] P. Shields, "String matching - the general ergodic case," Ann. Probab., 20(1992),
1199-1203.
[72] P. Shields, "Entropy and prefixes," Ann. Probab., 20(1992), 403-409. [73] P. Shields, "Cutting and stacking. A method for constructing stationary processes," IEEE Trans. Inform. Th., IT-37(1991), 1605-1617.
BIBLIOGRAPHY
243
[74] P. Shields, "Universal redundancy rates don't exist," IEEE Trans. Inform. Th., IT-
39(1993), 520-524.
[75] P. Shields, "Two divergence-rate counterexamples," J. of Theor. Prob., 6(1993),
521-545.
[76] P. Shields, "Waiting times: positive and negative results on the Wyner-Ziv problem," J. of Theor. Prob., 6(1993), 499-519. [77] P. Shields and J.-P. Thouvenot, "Entropy zero x Bernoulli processes are closed in the a-metric," Ann. Probab., 3(1975), 732-6. [78] M. Smorodinsky, "A partition on a Bernoulli shift which is not 'weak Bernoulli'," Math. Syst. Th., 5(1971), 210-203. [79] M. Smorodinsky, "Finitary isomorphism of m-dependent processes," Contemporary Math., vol. 135(1992) 373-376. [80] M. Steele, "Kingman's subadditive ergodic theorem," Ann. Inst. Henri Poincar,
25(1989), 93-98.
[81] V. Strassen, "The existence of probability measures with given marginals," Ann. of Math. Statist., 36(1965), 423-439. [82] W. Szpankowski, Asymptotic properties of data compression and suffix trees," IEEE Trans. Inform. Th., IT-39(1993), 1647-1659. [83] J. Moser, E. Phillips, and S.Varadhan, Ergodic theory. a seminar Courant Inst. of Math. Sciences, NYU, New York, 1975. [84] P. Walters, An introduction to ergodic theory. Springer-Verlag, New York, 1982. [85] F. M. J. Willems, "Universal data compression and repetition times," IEEE Trans. Inform. Th., IT-35(1989), 54-58. [86] A. Wyner and J. Ziv, "Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression," IEEE Trans. Inform. Th., IT-35(1989), 125-1258. [87] A. Wyner and J. Ziv, "Fixed data base version of the Lempel-ziv data compression algorithm," IEEE Trans. Inform. Th., IT-37(1991), 878-880. [88] A. Wyner and J. Ziv, "The sliding-window Lempel-Ziv algorithm is asymptotically optimal," Proc. of the IEEE, 82(1994), 872-877. [89] S. Xu, "An ergodic process of zero divergence-distance from the class of all stationary processes," J. of Theor. Prob., submitted. [90] J. Ziv, "Coding theorems for individual sequences," IEEE Trans. Inform. Th., IT-
24(1978), 405-412.
[91] J. Ziv and A. Lernpel, "A universal algorithm for sequential data compression," IEEE Trans. Info. Th., IT-23(1977), 337-343. [92] J. Ziv, "Coding of sources with unknown statistics-Part I: Probability of encoding error; Part II: Distortion relative to a fidelity criterion", IEEE Trans. Info. Th., IT-18(1972), 384-394.
Index
a-separated, 103, 187 absolutely regular, 179, 233 addition law, 58 admissible, 174 in probability, 174 in d, 184 almost block-independence (ABI), 212 almost blowing-up (ABUP), 195, 235 alphabet, 1 asymptotic equipartition (AEP), 55 B-process, 8, 211 Barron's code-length bound, 125 base (bottom) of a column, 107 of a column structure, 108 binary entropy function, 52 block code, 8 block coding of a process, 8 block-independent process, 211 block-structure measure, 104 block-to-stationary construction, 83 blowing-up property (BUP), 194 almost blowing-up, 195, 235 (8, E)-blowing-up, 195, 235 blowup, 68 blowup bound, 68 8-blowup, 185, 194 Borel-Cantelli principle, 11 building blocks, 69 built by cutting and stacking, 109 built-up, 69 built-up set bound, 70 (1 E)-built-up, 138 circular k-type, 123 code, 71 code sequence, 74, 121 rate of, 74 codebook, 72 codeword, 71 coding block, 84, 215 245 faithful (noiseless), 71 length function, 72, 121 n-code, 121 per-symbol, 7 truncation, 215 column, 24, 107 base (bottom), 24, 107 cutting a column, 109 disjoint columns, 107 height (length), 24, 107 labeling, 107 level, 24, 107 name, 107 subcolumn, 108 support, 107 top, 24, 107 upward map, 108 width (thickness), 24, 104, 107 column partitioning, 109 column structure, 107 complete sequences, 110 transformation defined by, 110 copy, 115 cutting into copies, 115 disjoint column structures, 108 estimation of distributions, 111 column structure: top, 108 uniform, 188 upward map, 108 width, 108 width distribution, 108 (a, k)-separated structures, 189 (a, K)-strongly-separated, 190 columnar representation, 105 complete sequences of structures, 110 concatenated-block process, 9, 10 concatenation representation, 26, 27 conditional E -independence, 226 entropy, 58
246 conditional invertibility, 164 consistency conditions, 1 continuous distribution, 214 coupling, 218 coupling measure, 218 covering, 46 almost-covering principle, 46 number, 67 cutting and stacking, 103, 109 standard representation, 112 cyclic merging rule, 190 cylinder set, 2 d-admissible, 184 d-distance , 89, 91 definition (joining), 91 definition for processes, 92 for ergodic processes, 89 for i.i.d. processes, 89 for stationary processes, 92 for sequences, 89 realizing tin 91 d-topology, 87 d-distance properties, 97 completeness, 102 relationship to entropy, 100 ergodicity, 99 mixing, 100 typical sequences, 96 d-far-apart Markov chains, 90 rotation processes, 90 d'-distance, 98 direct product, 31 distinct words, 147 distribution of a process, 1 of blocks in blocks, 186, 222 of a partition, 17 start, 6 shifted empirical block, 168 divergence, 57 inequality, 57 doubly-infinite sequence model, 4
,
INDEX
empirical Markov entropy, 65 empirical joinings, 94 empirical universe, 176, 185 encoding partition, 80 entropy, 51 conditional, 58 entropy rate, 51 empirical, 138 estimation, 143, 144 n-th order, 59 of a distribution, 51 a process, 59 a partition, 58 a random variable, 56 topological, 132 Ziv entropy, 133 entropy interpretations: covering exponent, 68 prefix codes, 73 entropy properties: d-limits, 100 subset bound, 60 upper semicontinuity, 88 entropy of i.i.d. processes, 62 Markov processes, 62 concatenated-block processes, 78 entropy theorem, 51, 129 entropy-typical sequences, 67 cardinality bound, 67 E-independence, 223 for processes, 224 ergodic, 15 components, 49 decomposition, 48 Markov, 20 process, 16 totally, 21 ergodic theorem (of Birkhoff), 33 of von Neumann, 42 maximal, 42 of Kingman, 43 ergodicity and d-limits, 99 finite form, 45 "good set" form, 45 eventually almost surely, 11 expected return time, 24 exponential rates for entropy, 166 for frequencies, 166
Elias code, 75, 76 empirical distribution, 63 of overlapping k-blocks, 43, 64 empirical measure (limiting), 48 empirical entropy, 138 first-order entropy, 63
INDEX extremal measure, 102

faithful code, 71 faithful-code sequence, 76, 121 filler or spacer blocks, 84 fillers, 151 finitary coding, 195 finitary process, 195 finite coder, 7 finite coding, 8 approximation theorem, 80 finite energy, 159 finite-state process, 6 concatenated-block process, 10 finitely determined, 186, 198, 221 finite form, 222 i.i.d. processes, 223 first-order blocks, 104 first-order rate bound, 166 frequency, 43 typical, 44 (k, E)-typical sequences, 45 function of a Markov chain, 6 generalized renewal process, 26 generalized return-time picture, 28 independent partitions, 17 cutting and stacking, 114 M-fold, 116 repeated, 117 extension, 215 n-blocking, 212 stacking of a structure: onto a column, 115 onto a structure, 115 induced process, 27 transformation, 25 infinitely often, almost surely, 11 instantaneous coder, 7 function, 7 invariant measure, 4 set, 15 irreducible Markov, 18 join of measures, 90 mapping interpretation, 91 definition of d, 91 join of partitions, 17 joint input-output measure, 229
247
Kac's return-time formula, 25 Kolmogorov measure, 3, 4 complete measure model, 3 two-sided model, 4 and complete sequences, 110 Kolmogorov partition, 13 Kolmogorov representation, 2, 3 Kraft inequality, 73 Kronecker's theorem, 22 Lempel-Ziv (LZ) algorithm, 131 convergence theorem, 132 upper bound, 133 simple LZ parsing, 131 LZW algorithm, 136 linear mass distribution, 104 log-sum inequality, 65 Markov inequality, 11 Markov chain, 5 source, 6 k-th order, 6 Markov order theorem, 62 matching, 235 measure preserving, 4, 14 merging and separation, 190 mixing, 18 and d-limits, 100 and Markov, 20
name of a column, 107 of a column level, 107 (T, 2)-name, 14 nonadmissibility, 175, 185 nonexistence of too-good codes, 122 nonoverlapping-block process, 10 overlapping to nonoverlapping, 168 overlapping-block process, 10 packing, 34 (1 0-packing, 34, 138 partial, 41 packing lemma, 34 stopping-time, 40 strong-packing, 139 separated, 41 two-packings, 41
248 pairwise k-separated, 205 partition, 14 per-letter Hamming distance, 68, 89 Pinsker's inequality, 66 Poincare recurrence theorem, 23 prefix, 72 code, 73 code sequence, 74, 121 trees, 158 process, 1, 13 and complete sequences, 110 B-process, 8, 211 entropy, 59 finite-energy, 159 (S, n)-blocking, 215 (S, n)-independent, 215 i.i.d., 5 Markov, 6 in-dependent, 12 N-th term process, 10 regenerative, 30, 120 stationary, 3 (T ,P) process, 14 product measure, 5 Vf-mixing, 169, 175
-
INDEX and upward maps, 109 start position, 27, 28 stationary coding, 7 and entropy, 81 and mixing, 32, 82 time-zero coder, 7 with induced spacers, 84 stationary process, 3, 13 distribution, 6 input-output measure, 229 Markov process, 6 N-stationary, 8 output measure, 229 stopping time, 40 interval, 40 string matching, 85 strong cover, 33 strongly-covered, 70 (1 0-strongly-covered, 46 (L, 0-strongly-covered, 34 strongly-packed, 150 (K, 0-strongly-packed, 138 subadditivity, 59 subinvariant function, 30 transformation, 4 shift, 3 transformation/partition model, 13 too-long word, 150 representation, 151 too-small set principle, 67 too-soon-recurrent representation, 156 totally ergodic, 21 tower construction, 29 type, 63 class, 63, 167 equivalence, 63 k-type, 43, 64 class, 64 class bound, 63, 65 typical sequence, 44, 45, 67 and d-distance, 96
randomizing the start, 8 rate function for entropy, 166 for frequencies, 165 recurrence-time function, 154 repeated words, 148 return-time distribution, 24 picture, 24 process, 26 rotation (translation), 21 process, 21 d-far-apart rotations, 101 Shannon-McMillan-Breiman theorem, 55 shift transformation, 3 shifted, block parsings, 167 equivalence, 167, 170 with gap, 170, 179 skew product, 31 sliding-block (-window) coding, 7 speed of convergence, 166 splitting index, 180 splitting-set lemma, 180 stacking columns, 109
universal code, 145 existence theorem, 122 sequence, 121 universe (k-block), 132 of a column structure, 188 full k-block universe, 190
INDEX very weak Bernoulli (VWB), 232 waiting-time function, 200 approximate-match, 200 weak Bernoulli (WB), 179, 233 weak topology, 87, 88 weak convergence, 87 well-distributed, 119 window function, 195 window half-width, 7 words, 104 distinct, 148 repeated, 148 Ziv entropy, 133
249

Paul C. Shields The Ergodic Theory of Discrete Sample Paths Graduate Studies in Mathematics 13 1996

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paul C. Shields The Ergodic Theory of Discrete Sample Paths Graduate Studies in Mathematics 13 1996

Uploaded by

Copyright:

Available Formats

The Ergodic Theory of Discrete Sample Paths

Graduate Studies in Mathematics

American Mathematical Society

Library of Congress Cataloging-in-Publication Data

Chapter I Basic concepts.

CHAPTER I. BASIC CONCEPTS.

A, 1 < i < co.

[4] = {x: x, = ai , m < i < n} .

SECTION Ll. STATIONARY PROCESSES.

Prob(Xi = ai , ni <j < n) = Prob(Xi+i = ai , ni <j < n),

x = (xi, x2, ...) == T x = (x2, x3, . .).

CHAPTER I. BASIC CONCEPTS.

ai, m <j < n,

p(B) = p(T -1 B), B E E,

SECTION 1.1. STATIONARY PROCESSES.

CHAPTER I. BASIC CONCEPTS.

SECTION 1.1. STATIONARY PROCESSES.

CHAPTER I. BASIC CONCEPTS.

B z is the full coder defined by f, then F' ([y]) =

SECTION I.1. STATIONARY PROCESSES.

(i) li;;} is stationary.

then averaging to obtain the measure Z defined by the formula

it*(0 -1 (7i B)), B

CHAPTER I. BASIC CONCEPTS.

(b) (ar , N) goes to (b ', 1) with probability p,(br), for each br

The process is started by selecting (ar ,i) with probability i,(4)/N.

SECTION 1.1. STATIONARY PROCESSES.

CHAPTER I. BASIC CONCEPTS.

SECTION L2. THE ERGODIC THEORY MODEL.

Section 1.2 The ergodic theory model.

CHAPTER I. BASIC CONCEPTS.

[4] = n7..7i+1 Pai ,

X(x) = Xp(T n-l x), n> 1.

(4) = u (nLiT -i+1 Pai )

SECTION 1.2. THE ERGODIC THEORY MODEL.

I.2.a Ergodic processes.

7-1 B = B = (B) = 0 or p(B) = 1.

(a) T is ergodic. (b) T -1 B C B, = (c) T -1 B D B,

(B) = 0 or kt(B) = 1. ,i(B) = 0 or p(B) = 1.

B LB) = 0 = (B) = 0 or kt(B) = 1.

= f , a.e., implies that f is constant, a.e.

CHAPTER I. BASIC CONCEPTS.

Examples of ergodic processes.

{ (2s, t/2) (2s 1, (t 1)/2)

if S <1/2 it s> 1/2.

Figure 1.2.4 The Baker's transformation.

SECTION I.2. THE ERGODIC THEORY MODEL.

[t(Pa n Qb) = 12 (Pa)11 (Qb),Va, b.

Figure 1.2.6 The generalized Baker's transformation.

lim ,u(T -nC n D) = ,u(C),u(D), C, D

p,(T -Nc n D) = ,a(T-N C),a(D) =

SECTION 1.2. THE ERGODIC THEORY MODEL.

p.,([4] n [btni 1 -_ - :Vinj) = uvw,

CHAPTER I. BASIC CONCEPTS.

Example 1.2.12 (Codings and ergodicity.)

SECTION 1.2. THE ERGODIC THEORY MODEL.

22 Proposition 1.2.14 T is ergodic if and only if a is irrational.

CHAPTER I. BASIC CONCEPTS.

which is a contradiction. This proves Kronecker's theorem.

SECTION 1.2. THE ERGODIC THEORY MODEL.

I.2.c The return-time picture.

Theorem 1.2.17 (The Poincare recurrence theorem.)