You are on page 1of 7

Lecture 1

Agenda for the lecture

Introduction to data compression

Review of basic probability

1.1 Source model and compression

A source is a generator of data. It produces an n-length sequence of symbols x1 , ..., xn from


an alphabet X . This symbol can be a word in Kannada and the sequence can be a sentence,
or the X can be just the binary alphabet {0, 1} and the sequence can be a digitized Hindi
song.

Question 1. What is the least number of bits of memory that we need to store the data
produced by the source?

This question is very vague as it is stated and hides many sub-questions. A general goal
of Information theory is to provide a precise mathematical formulation of such questions
and obtain a rigorous answer, thereby establishing fundamental limits of performance.
How can we possibly compress the data? For instance, how can we possibly compress
a book? Well, we know something. If it is an English book, then it is very likely that the

Himanshu
c Tyagi. Feel free to use with acknowledgement.

1
word is will be followed by either an article (a/an/the) or a verb. In fact, by going over
the book once we can construct a histogram of words following is, which is equivalent to
constructing a probability distribution over the words following is. Thus, we can form a
complicated probabilistic model to capture how words are arranged in the book. But how
do we use this model to compress the data? Well, this is simple. We proceed along the
book and assign symbols of different lengths to each word, based on the specific word and
its location. For instance, suppose our model tells us that the probability of getting an a
after is is 0.3 and that of getting a the after is is 0.2, then assigning 0 for storing an
aafter is and 11 for the is better than assigning 11 for a and 0 for the. Of course,
one needs to store the model itself and account for that in length as well. A complicated
model will require more space to store, but will help you compress better. On the other
hand, a simple model can be stored economically but may not help you to compress much.
These are some of the issues we shall see in data compression. For now, we assume
that a probabilistic model for the data is known. Formally,

Definition 1.1. (Source model) A source X with alphabet X is a random variable taking
values in the set X and with distribution PX .

For now we assume that X is finite and each symbol x is produced with a probability1
PX (x).
The following examples illustrate various possible answers for Question 1.

Example 1.2. (Uniform source) Consider a source X with alphabet X of cardinality


|X | = 210 . Suppose the source produces each symbol x X with the same probability
210 , i.e., the pmf of the source PX corresponds to a uniform distribution.
Q. What is the minimum number of bits that we need to store so that each symbol
produced by the source can be recovered exactly?
This question is easy. Since we must recover every symbol exactly, our compression
1
When there is no confusion, we will drop the subscript X in the probability mass function (pmf) PX .

2
scheme must be a one-to-one mapping. Thus, the least number of bits needed is2 log |X | =
10.
This compression scheme constitutes a so-called fixed-length source code since each
source symbol is stored as a sequence of same length 10. We can also try a variable-length
source code, where each symbol is stored using a sequence of different length. But even
with this flexibility the average length in this case cannot be lower than 10 (argue formally
that this is the case!).
But maybe we were being over ambitious. What if we allow some probability of error,
i.e., we donot wish to recover all the symbols but only a fraction of them.
Q. What is the minimum number of bits that we need to store so that we can recover
99% of the symbols produced by the source exactly, i.e., the probability of error for our
compression scheme is less than 0.01?
Okay. So now we can ignore 1% of symbols produced by the source. The best thing we
can do is store a fraction 99/100 of k source symbols exactly and reserve a fixed code, say
the all 0 sequence, for every other symbol. Thus, we need dlog(0.99k + 1)e = 10 bits, and
there is nothing to be gained by allowing some error.

Example 1.3. (Nonuniform source) We now consider a different 10-bit source (i.e., source
with |X | = 210 ) for which the pmf is given as follows:

0.495, x = (00...0),






PX (x) = 0.495, x = (11...1), (1)




0.1/1022, otherwise.

Thus, the source produces the all 0 and the all 1 sequence (of length 10) with probability
0.495 each and everything else with equal probability .1/1022.
Q. What is the minimum number of bits that we need to store so that each symbol
2
Throughout this course, our logarithms will be to the base 2.

3
produced by the source can be recovered exactly?
Well, for this question, nothing changed from the previous example. We still need a
one-to-one mapping, and the least number of bits needed for storing the source is still 10.
Q. What is the minimum number of bits that we need to store so that the probability
of error for our compression scheme is less than 0.01?
Any compression scheme that we make has to recover the all 0 and the all 1 sequence
exactly, because otherwise our probability of error will be at least 0.495. Thus, we need to
store at least 1-bit. On the other hand, if we store the all 0 sequence as simply 00, the all
one sequence as 11, and everything else as 01, we can recover the source X with probability
of error less than 0.01 using only 2-bits of storage!
Even if we dont allow an error, we can store on average by assigning 0 for x = (0...0),
10 for x = (1...1), and storing other sequences as is with an additional prefix 11. In this
case the average length is less than 0.495 1 + 0.495 2 + 0.1 12 = 2.685, still much
smaller than 10-bits!

We can draw the following heuristic conclusions from the examples above:

1. A nonuniform source can be compressed significantly if either error is


allowed and fixed-length codes are used or on average using a variable-
length source code.

2. A uniform source cannot be compressed much.

Thus, compressibility of a source is connected to the randomness in the underlying dis-


tribution. Information theory provides you with such measures of randomness which
capture the compressibility of a source (random variable). Furthermore, it provides prov-
able operational significance for these measures in form of formal coding theorems.

4
1.2 A (minimal) review of probability concepts needed for
this course

Consider a random variable X taking values in a set X (discrete or continuous) with prob-
ability distribution PX . For the discrete case, the distribution PX defines the probability
mass function (pmf) and specifies the probability PX (x) for each symbol x X . For the
continuous case, we restrict to simple random variables for which the probability distribu-
tion PX can be described by a density function fX (x) such that the probability of each
subset3 A X is given by
Z
P (X A) = fX (x) dx.
A
R
For notational simplicity, we shal use the notation A d P(x) to represent the probability
of a set A under the distribution P.
For us, the distribution of a random variable represents a model which we fit on the
observed data, i.e., upon observing the data we try to identify a distribution which could
have possibly generated this data. This important question, which is indeed the first step
of Information Theoretic modelling, will not be discussed in this course. It is usually
discussed in great details in a statistics or a machine learning course.
An important concept of information theory (and that of engineering in general) is the
concept of probability of error. We shall often tolerate a small probability of error in
our solutions. This probability of error represents the fraction of data points for which
our proposed data compression algorithm will not work, or the fraction of messages that
are transmitted incorrectly. Once we ignore this small fraction, the remainder of data will
satisfy the nice properties required for our proposed algorithm to work. Thus, the utility
of probabilistic modelling lies in understanding how a random variable behaves with large
probability (this captures how our observed data behaved in most of the cases).
3
This vanilla introduction will suffice for this course. A more interested reader should register for a
formal course in Probability theory.

5
In this course, we shall rely on several (simple) standard results that are used to capture
the typical (large probability) behaviour of a random variable. In this context, the
expected value of a random variable represents its nominal value. It is given by

P
xPX (x) , for a discrete rv
Z

xX
E [X] = x dP(x) =
X R
X xfX (x) dx, for a continuous rv.

A random variable roughly stays around its mean up to an error margin of (X), where
p
(X) = Var [X] is the standard deviation and Var [X] is the variance given by

Var [X] = E (X E [X])2 .


 

The following results quantize how a rv concentrates around its mean.

1.2.1 Markov and Chebyshev Inequalities

Lemma 1.4. (Markov Inequality) Consider a random variable X taking values in X


[0, ). Then, for > 0
P X > 1 E [X] .


Proof. Let a = 1 E [X] and let A be the event {X > a}. Then,

Z
E [X] = x dP(x)
ZX Z
= x dP(x) + x dP(x)
ZA Ac

x dP(x)
A
Z
a dP(x)
A

= aP (X A) .

6
In this course, the above result will be stated as follows: E [X] / is a large probability
upper bound for a nonnegative random variable X.

Lemma 1.5. (Chebyshev Inequality) Consider a random variable X taking values in X R


and with finite expectation E [X]. Then, for > 0

P |X E [X] |2 > 1 Var [X] ,




p p
namely E [X] + Var [X] / and E [X] Var [X] / are large probability upper and lower
bounds, respectively, for a random variable X.

Proof. Follows by applying Markov inequality to |X E [X] |.

You might also like