Random Signals and Noise

EE420/500 Random Signals and Noise
Summer 2012
Instructor
John Stensby, EB 217I, Office Hours: Mon, Wed, 2:30 - 4 PM, or by appointment.
stensby@eng.uah.edu

Course Material
1. G. Cooper, C. McGillem, Probabilistic Methods of Signal and System Analysis
2. The most recent version of the class notes can be printed from the EE420/500 home page

http://www.ece.uah.edu/courses/ee420-500/

Course Outline/Goal/Recommendation
The course consists of material taken from the first eight chapters of the class notes (Note: not all of the
first eight chapters will be covered). Most of this material is contained in the first eight chapters of the
required text, Cooper and McGillem. This material satisfies the course goal of providing the student with
a background in applied probability, statistics and random processes that is necessary to undertake
courses in communication theory, radar, signal processing and similar areas. The Schaums outline by
Hsu (reference #1 below) is highly recommended. This reference covers most of the course, and it
provides many worked example problems. This semester, your smartest move would be to get and study
Hsus outline.

Grading
Midterm 30%
Weekly Short Tests 30%
Homework 10%
Final Exam 30%

Notes
1. Two types of homework assignments will be made. Type-I homework will be collected and selectively
graded. Type-II homework will not be collected. Solutions to all homework assignments will be posted
on the bulletin board outside of Room 217 of the Engineering Building.

2. The short tests will come from the homework and/or example problems worked in class (I will supply
one of the homework problems and allow 10-15 minutes for its completion). Expect one every week. I
will drop the lowest short-test grade (to compensate for absences).

3. The midterm and final will be closed book since they will be homework-based, or they will come
from the problems that I worked on the board. That is, the majority of midterm/final problems will be
modified homework problems and problems worked in class.

References
1. H. Hsu, Probability, Random Variables, and Random Processes, McGraw Hill, 1997 (Schaums
Outline Series)

2. A. Papoulis, S. Pillai, Probability, Random Variables and Stochastic Processes, Fourth Edition,
McGraw-Hill, New York, 2002.

3. H Stark, J. Woods, Probability, Statistics and Random Processes for Engineers, Fourth Edition,
Prentice Hall, 2012.

4. P. Peebles, Probability, Random Variables and Random Signals Principles, McGraw Hill, 1980.

5. G.R. Grimmett, D.R. Stirzaker, Probability and Random Processes, Oxford Science Publications,
1992.
Chapter 1: Introduction to Probability
1) The Classical Approach to Probability
2) The Relative Frequency Approach to Probability
3) The Axiomatic Approach to Probability
4) Elementary Set Theory
5) Probability Space: Sample Space, -Algebra and Probability Measure
6) Conditional Probability
7) Theorem of Total Probability - Discrete Form
8) Bayes Theorem
9) Independence of Events
10) Cartesian Product of Sets
11) Independent Bernoulli Trials
12) Gaussian Function
13) DeMoivre-Laplace Theorem
14) Law of Large Numbers
15) Poisson Theorem and Random Points

Chapter 2: Random Variables
1) Random Variables
2) Distribution and Density Function
3) Continuous/Discrete/Mixed Random Variables
4) Normal/Gaussian Random Variable
5) Uniform Random Variable
6) Binomial Random Variable
7) Poisson Random Variable
8) Rayleigh Random Variable
9) Exponential Random Variable
10) Conditional Distribution/Density
11) Theorem of Total Probability - Continuous Form
12) Bayes Theorem - Continuous Form
13) Expectation
14) Variance and Standard Deviation
15) Moments
16) Conditional Expectation
17) Tchebycheff Inequality
18) Poisson Points Applied to System Reliability

Chapter 3: Multiple Random Variables
1) Joint Distribution/Density
2) Jointly Gaussian Random Variables
3) Independence of Random Variables
4) Expectation of a Product of Random Variables
5) Variance of a Sum of Independent Random Variables
6) Random Vectors and Covariance Matrices

Chapter 4: Function of Random Variables
1) Transformation of One Random Variable Into Another
2) Determination of Distribution/Density of Transformed Random Variables
3) Expected Value of Transform Random Variable
4) Characteristic Functions and Applications
5) Characteristic Function for Gaussian Random Vectors
6) Moment Generating Function
7) One Function of Two Random Variables
8) Leibnitzs Rule
9) Two Functions of Two Random Variables
10) Joint Density Functions
11) Linear Transformation of Gaussian Random Variables

Chapter 5: Moments and Conditional Statistics
1) Expected Value of a Function of Two Random Variables
2) Covariance
3) Correlation Coefficient
4) Uncorrelated and Orthogonal Random Variables
5) Joint Moments
6) Conditional Distribution/Density: One Random Variable Conditioned on Another
7) Conditional Expectation
8) Application of Conditional Expectation: Bayesian Estimation
9) Conditional Multi-dimensional Gaussian Density

Chapter 6: Random Processes
1) Definitions and Examples of Random Processes
2) Continuous and Discrete Random Processes
3) Distribution and Density Functions
4) Stationary Random Processes
5) First- and Second-Order Probabilistic Averages
6) Wide-Sense Stationary Processes
7) Ergodic Processes
8) Classical Random Walk
9) Wiener Process As a Limit of the Random Walk
10) Independent Increments
11) Diffusion Equation for Transition Density
12) Probability Current
13) Solution of Diffusion Equation by Transform Techniques

Chapter 7: Correlation Functions
1) Autocorrelation Function
2) Autocovariance Function
3) Correlation Function
4) Properties of Autocorrelation Function for Real-Valued WSS Random Processes
5) Random Binary Waveform
6) Poisson Random Points (Revisited)
7) Poisson Random Processes
8) Autocorrelation of Poisson Processes
9) Semi-Random Telegraph Signal
10) Random Telegraph Signal
11) Autocorrelation of Wiener Process
12) Correlation Time
13) Crosscorrelation Function
14) Input/Output Cross Correlation for Linear Systems
15) Autocorrelation of System Output in Terms of Autocorrelation of Input

Chapter 8: Power Density Spectrum
1) Definition of Power spectrum of a Stationary Process
2) Calculation of Power spectrum of a Process
3) Rational Power Spectrums
4) Wiener-Khinchine theorem
5) Application to Random Telegraph Signal
6) Power Spectrum of System Output in Terms of Power Spectrum of System Input
7) Noise Equivalent Bandwidth of a Lowpass System or Filter

HAND603.DOC (over please)
EE603 RANDOM SIGNALS IN COMMUNICATION
Spring 2010
Instructor
John Stensby (stensby@eng.uah.edu), EB 217I, Office Hours: Tue, Thurs. 3-5PM and Fri 2-4PM
or by appointment.
Course Material
1. H. Stark, J. Woods, Probability and Random Processes with Applications to Signal
Processing, Third Edition, Prentice Hall, 2002.
2. Course notes (Chapters 9 through 14) can be downloaded and printed from
http://www.ece.uah.edu/courses/ee420-500/
Prerequisite
EE420/500 or equivalent. Please review the first eight chapters of the online notes. Pay close
attention to Gaussian random variables, Gaussian random processes, Poisson random points and
Poisson processes.
Course Outline
For the most part, I will follow my class notes, starting at Chapter 9. Topics to be covered
include those listed below.
1. Some review of probability and random process (Ch 1 - 8 of EE500 notes)
2. Narrow band Gaussian Noise (Ch. 9 of class notes and Section 8.6 of Text)
3. Shot Noise (Ch. 9 of class notes).
4. Thermal Noise and System Noise Figure (Ch. 10 of class notes)
5. Sequences of Random Variables (Ch. 11 of class notes, Ch 6 of Text)
6. Mean-Square Calculus (Ch. 12 of class notes, Section 8.1 of Text)
Most semesters, this is as far as I get. However, if time allows, I will continue on with:
7. Orthogonal expansions of random processes and applications to detection/match filtering
theory (Ch. 13 class notes, Section 8.5 of Text)
8. Markov and Diffusion Processes (Ch. 14 class notes, Ch. 7 of Text)
Grading
Major Exams (2) 50%
Homework 20%
Final Exam 30%
Notes
1. The main goal of this course is to provide the student with fundamentals that are necessary for
advanced study in communication systems, signal processing, radar/sonar, control systems
and other areas where random data/fluctuations must be considered/analyzed.
2. Homework will be assigned almost every week. Homework solutions will be posted on the
bulletin board outside of Room 217 of the Engineering Building.
4. Please observe my posted office hours. If they are not convenient then please make an
appointment to see me.
5. All in-class exams will be open notes/book.
HAND603.DOC (over please)
References
1. H. Hsu, Probability, Random Variables, and Random Processes, (Schaums Outline Series),
McGraw Hill, 1997
Hsus outline is highly recommended. He covers many course topics, and he provides many worked
example problems.
2. G. Grimmett, D. Stirzaker, Probability and Random Processes, Second Edition, Clarendon
Press, 1992.
A great text on the subject matter. Must be seen/read to be fully appreciated!!!
3. A. Papoulis, S. Pillai, Probability, Random Variables and Stochastic Processes, Fourth
Edition, McGraw-Hill, 2002.
This is a good general reference book
4. A. Jazwinski, Stochastic Processes and Filtering Theory, Academic Press, 1970.
This text has a lot of good stuff on convergence of random sequences and mean square calculus
(plus, it is a good text for Kalman and nonlinear filtering theory).
5. T. Soong, Random Differential Equations in Science and Engineering, Academic Press, 1973.
Soong has great coverage of convergence concepts and mean square calculus.
6. W. Gardner, Introduction to Random Processes With Applications to Signals and Systems,
Second Edition, McGraw Hill, 1990.
7. J. Gubner, Probability and Random Processes for Electrical and Computer Engineers,
Cambridge University Press, 2006.
An excellent (and easy to read) text for engineers and scientists who need some advanced theory
8. K.L. Chung, A Course in Probability Theory, Third Edition, Academic Press, 2001.
A very readable introduction to probability theory from a measure-theoretic standpoint.
9. M. Love, Probability Theory I and II, Fourth Edition, Springer-Verlag, 1977.
These two books provide a terse but comprehensive coverage of probability and random processes
for the advanced reader (with a working knowledge of measure theory).
EE420/500 Class Notes 01/26/10 John Stensby
Updates at http://www.ece.uah.edu/courses/ee420-500/ 1-2
frequency concept is essential in applying the probability theory to the physical world.
Axiomatic Approach
The Axiomatic Approach is followed in most modern textbooks on probability. It is
based on a branch of mathematics known as measure theory. The axiomatic approach has the
notion of a probability space as its main component. Basically, a probability space consists of 1)
a sample space, denoted as S, 2) a collection of events, denoted as F, and 3) a probability
measure, denoted by P. Without discussing the measure-theoretic aspects, this is the approach
employed in this course. Before discussing (S,F,P), we must review some elementary set theory.
Elementary Set Theory
A set is a collection of objects. These objects are called elements of the set. Usually,
upper case, bold face, letters in italics font are used to denote sets (i.e., A, B, C, ... ). Lower case
letters in italics font are used to denote set elements (i.e., a, b, c ... ). The notation a A (a A)
denotes that a is (is not) an element of A. All sets are considered to be subsets of some
universal set, denoted here as S.
Set A is a subset of set B, denoted as A B, if all elements in A are elements of B. The
empty or null set is the set that contains no elements. Its denoted as {}.
Transitivity Property
If U B and B A then U A, a result known as the Transitivity Property.
Set Equality
B = A is equivalent to the requirements that B A and A B. Often, two sets are
shown to be equal by showing that this requirement holds.
Unions
The union of sets A and B is a set containing all of the elements of A plus all of the
elements of B (and no other elements). The union is denoted as A B.
Union is commutative: A B = B A.
Union is associative: (A B) C = A (B C)
Intersection
The intersection of sets A and B is a set consisting of all elements common to both A and
B. It is denoted as A B. Figure 1-1 illustrates the concept of intersection.
Intersection is commutative: A B = B A.
Intersection is associative: (A B) C = A (B C).
Intersection is distributive over unions: A (B C) = (A B) (A C).
Sets A and B are said to be mutually exclusive or disjoint if they have no common
elements so that A B = {}.
Set Complementation
The complement of A is denoted as A , and it is the set consisting of all elements of the
universal set that are not in A. Note that A A = S, and A A = {}
Set Difference
The difference A B denotes a set consisting of all elements in A that are not in B.
Often, A B is called the complement of B relative to A.
A
B
AB
Figure 1-1: The intersection of sets A and B.
A
B
Shaded Area is A - B
Figure 1-2: The difference between sets A and B.
De Morgans Laws
1
2
)
)

A B A B
A B A B
=
=
(1-2)
More generally, if in a set identity, we replace all sets by their complements, all unions
by intersections, and all intersections by unions, the identity is preserved. For example, apply
this rule to the set identity A B C A B A C = ( ( ( ) ) ) to obtain the result
A B C A B A C = ( ( ( ) ) ) .
Infinite Unions/Intersection of Sets
An infinite union of sets can be used to formulate questions concerning whether or not
some specified item belongs to one or more sets that are part of an infinite collection of sets.
Let A
i,
1 i < , be a collection of sets. The union of the A
i
is written/defined as
{ }
i n
i 1
: for some n, 1 n <
A A . (1-3)
Equivalently,
i 1
i
A
if, and only if, there is at least one integer n for which A
n
. Of
course, may be in more than one set belonging to the infinite collection.
An infinite intersection of sets can be used to formulate questions concerning whether or
not some specified item belongs to all sets belonging to an infinite collection of sets. Let A
i,
1
i < , be a collection of sets. The intersection of the A
i
is written/defined as
{ } { }
i n n
i 1
: for all n, 1 n < : , 1 n <
=
=
A A A . (1-4)
Equivalently,
i 1
i
=
A
if, and only if, A
n
for all n.
- Algebra of Sets
Consider an arbitrary set S of objects. In general, there are many ways to form subsets of
S. Set F is said to be a -algebra of subsets of S (often, we just say -algebra; the phrase of
subsets of S is understood) if
1) F is a set of subsets of S,
2) If A F then A F (i.e. F is closed under complementation)
3) {} F and S F, and
4) If A
i
F, 1 i < , then
A
i
i 1 =
F (i.e. F is closed under countable unions).

These four properties can be used to show that if A
i
F, 1 i < , then
A
i
i 1 =
F,
that is, F is closed under countable intersections. Some examples of -algebras follow.
Example 1-1: For S = { H, T } then F = { {}, {H, T}, {H}, {T} } is a -algebra.
Example 1-2: All possible subsets of S constitute a -algebra. This is the largest -algebra that
can be formed from subsets of S. We define F
L
{all possible subsets of S} to be the -algebra
comprised of all possible subsets of S. Often, F
L
is called the Power Set for S.
Example 1-3: { {}, S } is the smallest -algebra that can be formed from subsets of S.
Intersection and Unions of -algebras
The intersection of -algebras is a -algebra, a conclusion that follows directly from the
basic definition.
Example 1-4: Let F
1
and F
2
be -algebras of subsets of S. Then the intersection F
1
F
2
is a -
algebra. More generally, let I denote an index set and F
k
, k I, be a collection of -algebras.
Then the intersection
k
k I
F
is a algebra.
Example 1-5: Let A be a non-empty proper subset of S. Then (A) { {}, S, A, A
_
} is the
smallest -algebra that contains A. Here, A
_
denotes the complement of A.
In general, a union of -algebras is not a -algebra. A simple counter example that
establishes this can be constructed by using Example 1-5.
Example 1-5 can be generalized to produce the smallest -algebra that contains n sets
A
1
, , A
n
. (In Examples 1-5 and 1-6, can you spot a general construction method for generating
the smallest -algebra that contains a given finite collection C of sets?).
Example 1-6: Let A
1
and A
2
be non-empty proper subsets of S. We construct ({A
1
, A
2
}), the
smallest -algebra that contains the sets A
1
and A
2
. Consider the disjoint pieces A
1
A
2
,
, and
1 2 1 2 1 2
A A A A A A of S that are illustrated by Figure 1-3. Let G denote all possible
unions (i.e., all unions taken k at a time, 0 k 4) of these disjoint pieces. Note that G is a -
A
1
A
2
1 2
A A
A
1
A
2
1 2
A A
1 2
A A
S
Figure 1-3: Disjoint pieces , , ,
1 2 1 2 1 2 1 2
A A A A A A A A of S.
algebra that contains A
1
and A
2
. Furthermore, if F is any -algebra that contains A
1
and A
2
, then
G F. Therefore, G is in the intersection of all -algebras that contain A
1
and A
2
so that ({A
1
,
A
2
}) = G. There are four disjoint pieces, and each disjoint piece may, or may not, be in a
given set of G. Hence, ({A
1
, A
2
}) = G will contain 2
4
= 16 sets, assuming that all of the
disjoint pieces are non-empty.
This construction technique can be generalized easily to n events A
1
, , A
n
. The
minimal -algebra ({A
1
, , A
n
}) consists of the collection G of all possible unions of sets of
the form C
1
C
2
... C
n
, where each C
k
is either A
k
or
k
A (these correspond to the disjoint
pieces used above). Example 1-5 corresponds to the case n = 1, and Example 1-6 for the case n
= 2. Note that G is a -algebra that contains A
1
, , A
n
. Furthermore, G must be in every -
algebra that contains A
1
, , A
n
. Hence, ({A
1
, , A
n
}) = G. Now, there are 2
n
disjoint
pieces of the form C
1
C
2
... C
n
, and each disjoint piece may, or may not, be in any given
set in G. Hence, ({A
1
A
n
}) = G, contains
n
2
2 sets, assuming that each disjoint piece
C
1
C
2
... C
n
is non-empty (otherwise, there are fewer than
n
2
2 sets in ({A
1
, , A
n
})).
Example 1-7: Let A
1
, A
2
, , A
n
, be a disjoint partition of S. By this we mean that
1
, { if
=
= = }
k j k
k
A A A j k S .
By Example 1-6, note that it is possible to represent each element of ({A
1
, A
2
, , A
n
, }) as a
union of sets taken from the collection {A
1
, A
2
, , A
n
, }.
-Algebra Generated by Collection of Subsets
Let C be any non-empty collection of subsets of S. In general, C is not a algebra.
Every subset of C is in the algebra F
L
{all possible subsets of S}; we say C is in algebra
F
L
and write C F
L
. In general, there may be many algebras that contain C. The intersection
of all algebras that contain C is the smallest algebra that contains C, and this smallest
algebra is denoted as (C). We say that (C) is the algebra generated by C. Examples 1-5
and 1-6 illustrate the construction of ({A
1
, A
2
, , A
n
}).
Example 1-8: Let S = R, the real line. Let C be the set consisting of all open intervals of R (C
contains all intervals (a, b), a R and b R). C is not a -algebra of S = R. To see this,
consider the identity
n 1
1 1
[ 1, 1] ( 1 , 1+ )
n n
=
=
.
Hence, the collection of open intervals of R is not closed under countable intersections; the
collection of all open intervals of R is not a -algebra of S = R.
Example 1-9: As in the previous example, let S = R, the real line, and C be the set consisting of
all open intervals of R. While C is not a -algebra itself (see previous example), the smallest -
algebra that contains C (i.e., the -algebra generated by C) is called the Borel -algebra, and it is
denoted as B in what follows. -algebra B plays an important role in the theory of probability.
Obviously, B contain the open intervals, but it contains much more. For example, all half open
intervals (a, b] are in B since
n 1
(a, b] (a, b 1/ n)
=
= +
B . (1-5)
Using similar reasoning, it is easy to show that all closed intervals [a,b] are in B.
Example 1-10: Let F be a -algebra of subsets of S. Suppose that B F. It is easy to show that
[ ]
A B: A G F (1-6)
is a -algebra of subsets of B. To accomplish this, first show that G contains {} and B. Next,
show that if A G then B - A G (i.e., the complement of A relative to B must be in G).
Finally, show that G is closed under countable intersections; that is, if A
k
G, 1 k < , then
k
k 1
A
G . (1-7)
Often, G is called the trace of F on B. Note that G will not be a -algebra of subsets of S.
However, G F. G is an example of a sub -algebra of F.
Probability Space (S, F, P)
A probability space (S, F, P) consists of a sample space S, a set F of permissible events
and a probability measure P. These three quantities are described in what follows.
Sample Space S
The sample space S is the set of elementary outcomes of an experiment. For example,
the experiment of tossing a single coin has the sample space S = { heads, tails }, a simple,
countable set. The experiment of measuring a random voltage might use S = {v : - < v < },
an infinite, non-countable set.
Set F of permissible Events
In what follows, the collection (i.e., set) of permissible events is denoted as F. The
collection of events F must be a -algebra of subsets of S, but not every -algebra can be the
collection of events. For a -algebra to qualify as a set of permissible events, it must be possible
to assign probabilities (this is the job of P discussed next) to the sets/events in the -algebra
without violating the axioms of probability discussed below.
Probability Measure
A probability must be assigned to every event (element of F). To accomplish this, we
use a set function P that maps events in F into [0,1] (P : F [0,1]). Probability measure P
must satisfy
1) 0 P(A) 1 for every A F
2) If A
i
F, 1 i < , is any countable, mutually exclusive sequence (i.e., A
i
A
j
= {} for i
j) of events then
i i
i 1 i 1
) ( ) (we say that must be

= =

=

P P P A A countably additive (1-8)
3) P(S) = 1 (1-9)
Conditions 1) through 3) are called the Axioms of Probability. Now, we can say that the
permissible set of events F can be any -algebra for which there exists a P that satisfies the
Axioms of Probability. As it turns out, for some sample spaces, there are some -algebras that
cannot serve as a set of permissible events because there is no corresponding P function that
satisfies the Axioms of Probability.
In many problems, it might be desirable to let all subsets of S be events. That is, it might
be desirable to let "everything" be an event (i.e., let the set of events be the -algebra F
L
that is
discussed in Example 1-2 above). However, in general, it is not possible to do this because a P
function may not exist that satisfies the Axioms of Probability.
A special case deserves to be mentioned. If S is countable (i.e. there exists a 1-1
correspondence between the elements of S and the integers), then F can be taken as the -
algebra consisting of all possible subsets of S (i.e., let F = F
L
, the largest -algebra). That is, if
S is countable, it is possible to assign probabilities (without violating the Axioms) to the elements
of F
L
in this case. But, in the more general case where S is not countable, to avoid violating the
Axioms of Probability, there may be subsets of S that cannot be events; these subsets must be
excluded from F.
As it turns out, if S is the real line (a non-countable sample space), then the -algebra F
L
of all possible sets (of real numbers) contains too many sets. In this case, it is not possible to
obtain a P that satisfies the Axioms of Probability, and F
L
cannot serve as the set of events.
Instead, for S equal to the real line, the Borel -algebra B, discussed in Example 1-9, is usually
chosen (it is very common to do this in applications). It is possible to assign probabilities to
Borel sets without violating the Axioms of Probability. The Borel sets are assigned probabilities
as shown by the following example.
Example 1-11: Many important applications employ a probability space (S, F, P) where S is the
set R of real numbers, and F = B is the Borel -algebra (see Example 1-9). The probability
measure P is defined in terms of a density function f(x). Density f(x) can be any integrable
function that satisfies
1) f(x) 0 for all x S = R,
2) f (x)dx 1
.
Then, probability measure P is defined by
B
(B) f (x)dx, B =
P = F B . (1-10)
As defined above, the notion of F, the set of possible events, is abstract. However, in
most applications, one encounters only a few general types of F. In most applications, S is either
countable, the real line R = (-, ), or an interval (i.e., S = (a,b)). These cases are discussed
briefly in what follows.
Many applications involve countable sample spaces. For most of these cases, F is taken
as F
L
, the set of all possible subsets of S. To events in F
L
, probabilities are assigned in an
application-specific, intuitive manner.
On the other hand, many applications use the real line S = R = (-, ), a non-countable
set. For these cases, it is very common to use an F and P as discussed in Example 1-11. An
identical approach is used when S is an interval of the real line.
Implications of the Axiom of Probabilities
A number of conclusions can be reached from considering the above-listed Axioms of
Probability.
1) The probability of the impossible event [] is zero.
Proof: Note that A [] = [] and A [] = A. Hence, P(A) = P(A []) = P(A) + P([]).
Conclude from this that P([]) = 0.
2) For any event A we have P(A) = 1- P( ) A .
Proof: A A = S and A A = []. Hence,
1 = P(S) = P(A A ) = P(A) + P( ) A ,
a result that leads to the conclusion that
P(A) = 1- P( ) A . (1-11)
3) For any events A and B we have
P(A B) = P(A) + P(B) - P(A B). (1-12)
Proof: The two identities
( ) ( ) ( ) ( ) ( ) ( ) ( ) = = = A B A B A B A A B A B A A B A
B = B (A A ) = (A B) (B A )
lead to
P(A B ) = P(A) + P(B A )
P(B) = P(A B) + P(B A ).
Subtract these last two expressions to obtain the desired results
P(A B) = P(A) + P(B) - P(A B).
Note that this result is generalized easily to the case of three or more events.
Conditional Probability
The conditional probability of an event A, assuming that event M has occurred, is
P
P
P
( )
( )
( )
A M
A M
M
Y
=

, (1-13)
where it is assumed that P(M) 0. Note that P(AM) = P(AM)P(M), a useful identity.
Consider two special cases. The first case is M A so that A M = M. For M A, we
have
1
= =
=
P P
P
P P
( ) ( )
( )
( ) ( )
A M M
A M
M M
. (1-14)
Next, consider the special case A M so that P(MA) = 1. For this case, we have
1
( ) ( )
( ) ( )
( ) ( )
( )
( )
( ),
=
P P
P P
P P
P
P
P
A M M A
A M A
M M
A
M
A
(1-15)
an intuitive result.
Example 1-12: In the fair die experiment, the outcomes are f
1
, f
2
, ... , f
6
, the six faces of the die.
Let A = { f
2
}, the event "a two occurs", and M = { f
2
, f
4
, f
6
}, the event "an even outcome
occurs". Then we have P(A) = 1/6, P(M) = 1/2 and P(A M) = P(A) so that
2 2
1/ 6
"even"
1/ 2
f 1/ 3 (f .
} = = > ) =1/6 P P ({ )
Example 1-13: A box contains three white balls w
1
, w
2
, w
3
and two red balls r
1
and r
2
. We
remove at random and without replacement two balls in succession. What is the probability that
the first removed ball is white and the second is red?
[{first ball is white}] = 3/5
[{second is red} {first ball is white}] = 1/2
[{first ball is white} {second ball is red}] = [{second is red} {first ball is white}] [{first ball is white}]
(1/2)(
P
P
P P P
3/5)= 3/10
Theorem 1-1 (Total Probability Theorem Discrete Version): Let [A
1
, A
2
, ... , A
n
] be a
partition of S. That is,
n
i
i=1
A = S and A
i
A
j
= {} for i j. (1-16)
Let B be an arbitrary event. Then
P[B] = P[B A
1
] P[A
1
] + P[B A
2
] P[A
2
] + ... + P[B A
n
] P[A
n
]. (1-17)
Proof: First, note the set identity
) ( ) ( ) ( )
n n 1 2 1 2
= ( B = B S = B A A A B A B A B A
For i j, B A
i
and B A
j
are mutually exclusive. Hence, we have
P P P P
P P P P P P
[
.
B = B A B A B A
B A A B A A B A A
] [ ] [ ] [
n
]
[ ] [ ] [ ] [ ] [
n
] [
n
]
+ + +
+ + + =
1 2
1 1 2 2

Y Y Y
(1-18)
This result is known as the Total Probability Theorem, Discrete Version.
Example 1-14: Let [A
1
, A
2
, A
3
] be a partition of S. Consider the identity
[ ] = [ ] [ ] + [ ] [ ] + [ ] [ ]
1 1 2 2 3 3

P P P P P P P B B A A B A A B A A .
This equation is illustrated by Figure 1-4.
Bayes Theorem
Let [A
1
, A
2
, ... , A
n
] be a partition of S. Since
A
1
A
2
A
3
B
A
1
B
A
2
B
A
3
B
Figure 1-4: Example that illustrates the
Total Probability Theorem.
P P
P
P
( ) ( )
( )
( )
i i
i
A B B A
A
B
Y Y
=
P P P P P P P [B B A A B A A B A A ] [ ] [ ] [ ] [ ] [
n
] [
n
] = + + +
Y Y Y
1 1 2 2
,
we have
Theorem 1-2 (Bayes Theorem):
P
P P
P P P P P P
( )
( ) ( )
i
i i
[ ] [ ] [ ] [ ] [
n
] [
n
]
A B
B A A
B A A B A A B A A
Y
Y
Y Y Y
=
+ + +
1 1 2 2

. (1-19)
The P[A
i
] are called apriori probabilities, and the P[A
i
B] are called aposteriori probabilities.
Bayes theorem provides a method for incorporating experimental observations into the
characterization of an event. Both P[A
i
] and P[A
i
B] characterize events A
i
, 1 i n;
however, P[A
i
B] may be a better (more definitive) characterization, especially if B is an event
related to the A
i
. For example, consider events A
1
= [snow today], A
2
= [no snow today] and T =
[todays temperature is above 70F]. Given the occurrence of T, one would expect P[A
1
T ]
and P[A
2
T ] to more definitively characterization snow today than does P[A
1
] and P[A
2
].
Example 1-15: We have four boxes. Box #1 contains 2000 components, 5% defective. Box #2
contains 500 components, 40% defective. Boxes #3 and #4 contain 1000 components each, 10%
defective in both boxes. At random, we select one box and remove one component.
a) What is the probability that this component is defective? From the theorem of total
probability we have
[Component is defective] Box ] [Box ]
(.05)(.25)+(.4)(.25)+(.1)(.25)+(.1)(.25)
.1625
[defective
=
=
=
=
4
i 1
P P i P i
b) We examine a component and find that it is defective. What is the probability that it came
from Box #2? By Bayes Law, we have
Defective Defective
(.4)(.25)
.1625
Defective Box 2 Box#2
Box#2 Defective
Box 1 Box#1 Box Box#4
.615

+ +
#
=
# #4
=
=
P P
P
P[ ] P[ ] P[ ] P[ ]
( ) ( )
( )

Independence
Events A and B are said to be independent if
P[A B] = P[A] P[B]. (1-20)
If A and B are independent, then
P
P
P
P P
P
P [ ]
[
[
[ ] [
[
]
]
]
]
A B
A A
A
B
B
B
B
Y
= = =
[ ] . (1-21)
Three events A
1
, A
2
, and A
3
are independent if
1) P[A
i
A
j
] = P[A
i
] P[A
j
] for i j
2) P[A
1
A
2
A
3
]

= P[A
1
] P[A
2
] P[A
3
].
Be careful! Condition 1) may hold, and condition 2) may not. Likewise, Condition 2) may hold,
and condition 1) may not. Both are required for the three events to be independent.
Example 1-16: Suppose
P[A
1
] = P[A
2
] = P[A
3
] = 1/5
P[A
1
A
2
] = P[A
1
A
3
] = P[A
2
A
3
] = P[A
1
A
2
A
3
] = p
If p = 1/25, then P[A
i
A
j
] = P[A
i
] P[A
j
] for i j holds, so that
requirement 1) holds. However, P[A
1
A
2
A
3
]

P[A
1
] P[A
2
] P[A
3
],
so that requirement 2) fails. On the other hand, if p = 1/125, then P[A
1
A
2
A
3
] = P[A
1
] P[A
2
] P[A
3
], and requirement 2) holds. But, P[A
i
A
j
] P[A
i
] P[A
j
] for i
j, so that requirement 1) fails.
More generally, the independence of n events can be defined inductively. We say that n
events A
1
, A
2
, ..., A
n
are independent if
1) All combinations of k, k < n, events are independent, and
2) P[A
1
A
2
... A
n
]

= P[A
1
] P[A
2
] ... P[A
n
].
Starting from n = 2, we can use this requirement to generalize independence to an arbitrary, but
finite, number of events.
Cartesian Product of Sets
The cartesian product of sets A
1
and A
2
is denoted as A
1
A
2
, and it is a set whose
elements are all ordered pairs (a
1
, a
2
), where a
1
A
1
and a
2
A
2
. That is,
[ ]
1 2 1 2 1 1 2 2
(a , a ) : a , a A A A A . (1-22)
Example 1-17: Let A = { h, t } and B = { u, v } so that A B = { hu, hv, tu, tv }
Clearly, the notion of cartesian product can be extended to the product of n, n > 2, sets.
Generalized Rectangles
Suppose A S
1
and B S
2
. Then, the cartesian product A B can be represented as A
B = (A S
2
)(S
1
B), a result illustrated by Figure 1-5 (however, A B need not be one
A
1
A
2
A
3
1 2 3
A A A
contiguous piece as depicted by the figure). Because of this geometric interpretation, the sets A
B, A S
1
and B S
2
, are referred to as generalized rectangles.
Combined Experiments - Product Spaces
Consider the experiments 1) rolling a fair die, with probability space (S
1
, F
1
, P
1
), and 2)
tossing a fair coin, with probability space (S
2
, F
2
, P
2
). Suppose both experiments are performed.
What is the probability that we get "two" on the die and "heads" on the coin. To solve this
problem, we combine the two experiments into a single experiment described by (S
C
, F
C
, P
C
),
known as a product experiment or product space. The product sample space S
C
, product -
algebra F
C
, and product probability measure P
C
are discussed in what follows.
Product Sample Space S
C
To combine the two sample spaces, we take S
C
= S
1
S
2
. Sample space S
C
is defined as
[ ]
C 1 2 1 2 1 1 2 2
( , ) : , S S S S S . (1-23)
S
C
consists of all possible pairs (
1
,
2
) of elementary outcomes,
1
S
1
and
2
S
2
.
Product -Algebra F
C
Set F
C
of combined events must contain all possible products A B, where A F
1
and B
S
1
S
2
B
A
S
1
B
A

S
2
A B
Figure 1-5: Cartesian product as intersection of generalized
rectangle.
F
2
are related.
We have shown how to naturally combine two experiments into one experiment. Clearly,
this process can be extended to combine n separate experiments into a single experiment.
Counting Subsets of Size k
If a set has n distinct elements, then the total number of its subsets consisting of k
elements each is
n
k
n
k n k
F
H
G
I
K
J

!
!( )!
. (1-25)
Order within the subsets of k elements is not important. For example, the subset {a, b} is the
same as {b, a}.
Example 1-18: Form the
( )
3!
3
3
2
2! 1!
= =
subsets of size two from the set {a
1
, a
2
, a
3
}. These three subsets are (a
1
, a
2
), (a
1
, a
3
) and (a
2
, a
3
).
Bernoulli Trials - Repeated Trials
Consider the experiment (S, F, P) and the event A F. Suppose
( ) p
( ) 1 p q.
=
= =
P
P
A
A
We conduct n independent trials of this experiment to obtain the combined experiment with
sample space S
C
= S S ... S, a cartesian product of n copies of S. For the combined
experiment, the set of events F
C
and the probability measure P
C
are obtained as described
previously (P
C
is easy to obtain because the trials are independent). The probability that A
occurs k times (in any order) is
n
k n k
[ Occurs k Times in n Independent Trials] (1 )
k

=

P A p p . (1-26)
In (1-26), it is important to remember that the order in which A and A occur is not important.
Proof
The n independent repetitions are known as Bernoulli Trials. The event { A Occurs k
Times In a Specific Order } is the cartesian product A
1
A
2
... A
n
, where k of the A
i
are A
and n - k are A , and a specified ordering is given. The probability of this specific event is
P
1
[A
1
] P
1
[A
2
] P
1
[A
n
] = p
k
(1 - p)
n-k
.
Equivalently,
P[A Occurs k Times In a Specific Order] = p
k
(1 - p)
n-k
.
Now, the event { A Occurs k Times In Any Order } is the union of the
n
k
F
H
G
I
K
J mutually exclusive,
equally likely, events of the type { A Occurs k Times In a Specific Order }. Hence, we have
P[ Occurs Times in Independent Trials] ( ) k n =
F
H
G
I
K
J

n
k
k n k
p p 1 ,
as claimed.
Often, we are interested in the probability of A occurring at least k
1
times, but no more
than k
2
times, in n independent trials. The probability that event A occurs at least k
1
times, but
no more than k
2
times, is given as
[ ]
[ ]
k k
2 2
k n k
1 2
k k k k
1 1
n
occurs between k and k times occurs k times
k

= =

= =

p q P P A A . (1-27)
Example 1-19: A factory produces items, 1% of which are bad. Suppose that a random sample
of 100 of these items is drawn from a large consignment. Calculate the probability that the
sample contains no defective items. Let X denote the number of bad items in the sample of 100
items. Then, X is distributed binomially with parameters n = 100 and p = .01. Hence, we can
compute
[ ] [ ]
0 100
X = 0 No bad items in sample of 100 items
100
(.01) (1 .01)
0
.366
=

=

=
P P
Gaussian Function
The function
g(x) exp[ x ]
1
2
1
2
2
(1-28)
is known as the Gaussian function, see Figure 1-6. The Gaussian function can be used to define
G(x) g(y) dy
x

z
, (1-29)
a tabulated function (also, G is an intrinsic function in MatLab, Matcad and other mathematical
software). It is obvious that G(-) = 0; it is known that G() = 1. We will use tables of G to
evaluate integrals of the form
2 2 2
1 1 1
2 2
x x (x ) /
2
x x (x ) /
2 1
1 x 1 (x ) 1 y
g dx exp[ ]dx exp[ ]dy
2 2 2
2
x x
G G

= =

=

, (1-30)
where - < < and 0 < < are known numbers. Due to symmetry in the g function, it is
easy to see that
G(-x) = 1 - G(x). (1-31)
Many tables contain values of G(x) for x 0 only. Using these tables and (1-31), we can
determine G(x) for negative x.
The Gaussian function G is related to the error function. For x 0,
G(x) = +
1
2
erf(x) , (1-32)
where
-3 -2 -1 0 1 2 3
x
G(x)
1.0
0.5
2
exp[ ]d
2
1
x
G(x)
2
g(x) exp[ x ]
2
1

x
Figure 1-6: The Gaussian function
1
2
2
x u
1
erf (x) , x 0
0 2
e du
, (1-33)
is the well-known (and tabulated in mathematical handbooks) error function.
Example 1-20: Consider again Example 1-11 with a Gaussian density function. Here, the
sample space S consists of the whole real line R. For F, we use the Borel -algebra B discussed
in Examples 1-9 and 1-11. Finally, for B B, we use the probability measure
2
1
2
u
B
1
(B) e d u, B
2

P B . (1-34)
Formally denoted as (R, B, P), this probability space is used in many applications.
DeMoivre-Laplace Theorem
The Bernoulli trials formula is not practical for large n since n! cannot be calculated
easily. We seek ways to approximate the result. One such approximation is given by
Theorem 1-3 (DeMoivre-Laplace Theorem): Let q = 1 - p. If npq >> 1, then
n
k
k n k
n
k n
2n
F
H
G
I
K
J

=

F
H
G
I
K
J
p q
pq
p
pq pq
p
pq
1
2
2
1
exp[
( )
]
n
g
k n
n
(1-35)
for k in an npq neighborhood of np (i.e., k np< npq ).
This theorem is used as follows. Suppose our experiment has only two outcomes,
success (i.e., event A) and failure (i.e., event A
_
). On any trial, the probability of a success
is p, and the probability of failure is 1 - p. Suppose we conduct n independent trials of the
experiment. Let S
n
denote the number of successes that occur in n independent trials. Clearly,
0 S
n
n. The DeMoivre-Laplace theorem states that for npq >> 1 we have
P[
n
n
k
k n k
n
k n
n
S p q
pq
p
pq
pq
p
pq
= =
F
H
G
I
K
J

R
S
T
U
V
W
=

F
H
G
I
K
J
k
n
g
k n
n
] exp[
( )
]
1
2
1
2
2
1
(1-36)
for k - np< npq .
Example 1-21: A fair coin is tossed 1000 times. Find the probability that heads will show
exactly 500 times. For this problem, p = q = .5 and k - np = 0. Hence,
P[exactly 500 heads]
1
10 5
0252
=.
Now, we want to approximate the probability of obtaining between k
1
and k
2
occurrences
of event A. By the DeMoivre-Laplace theorem
k k
2 2
k - n
n
k k k k
1 1
n
1
k n - k
[k k k ] g
1 2
n
k
= =

=

P
p
pq
p q
pq
(1-37)
assuming that k
1
np< npq and k
2
np< npq . If npq is large, then g([k - np] / npq )
changes slowly for k
1
k k
2
, and
2
1
2 1
k
2
k
k n x n
n n
k
k k
1
k n k n
n n
1 1
[k k k ] g g dx
1 2
n n
G G ,

=

=

P
p p
pq pq
p p
pq pq
pq pq
(1-38)
as illustrated by Figure 1-7.
Example 1-22: A fair coin is tossed 10,000 times. Approximate a numerical value for P[4950
#Heads 5050]. Since k
1
= 4950 and k
2
= 5050, we have
2
k n k n
1
1 and 1
n n

= =
p p
pq pq
.
Application of (1-38) leads to the answer
P[4950 #Heads 5050] G(1) - G(-1) = G(1) - [1 - G(1)] = .6826
From what is given above, we might conclude that approximations (1-37) and (1-38)
require the restrictions k
1
np< npq and k
2
np< npq . However, these restrictions
may not be necessary in a given application. In (1-37), terms near the beginning (i.e., k = k
1
)
and end (i.e., k = k
2
) of the sum generally contain the most error (assuming k
1
< np < k
2
).
However, for large enough n, the total sum of these errors (i.e., the total error) is small compared
to the entire sum of all terms (i.e., the answer). That is, in (1-37) with large n, the highly
accurate terms (those close to k = np) have a sum that dominates the sum of all terms (i.e., the
entire sum (1-37)), so the error in the tails (the terms near k = k
1
and k = k
2
contain the most
error) becomes less significant as n becomes large. In fact, in the limit as n , Approximation
(1-38) becomes exact; no restrictions are required on k
1
np/ npq and k
2
np/ npq .
Theorem 1-4 (How DeMoivre-Laplace is stated most often): As above, let S
n
denote the
number of successes in n independent trials. Define a centered and normalized version of
k
1
k
2
1 x-n
g
n n

p
pq pq
x
np
2
1
k
2
k n
k k
1
k
x n
n k
1
[k k k ] g
1 2
n
1
g dx
n
n
P
p
pq
p
pq
pq
pq
Figure 1-7: Gaussian approximation to the Binomial density function.
S
n
as
~
S
S p
pq
n
n
n
n

.
Now, denote x
1
and x
2
as arbitrary real numbers. Then, we can say
2
1 n 2 2 1
n
1
x
1
2
lim [x x ] exp( x / 2) dx G(x ) G(x )
2 x
= =
P

S (1-39)
Proof: The proof is based on Stirlings approximation for n!, and it can be found in many
books. For example, see one of
[1] E. Parzen, Modern Probability Theory and Its Applications, John Wiley, 1960.
[2] Y.A Rozanov, Probability Theory: A Concise Course, Dover, 1969.
[3] A. Papoulis, S. Pillai, Probability, Random Variables and Stochastic Processes, Fourth
Edition, McGraw Hill, 2002 (proof not in editions I through III).
Example 1-23: An order of 10
4
parts is received. The probability that a part is defective equals
1/10. What is the probability that the total number of defective parts does not exceed 1100?
1100
1100
k 0
k 0
4
10
4
k 10 k
[#defective parts 1100]= P[k defective parts] (.1) (.9)
k
=
=

=

P
Since np is large
0 1000 1100 1000
[#defective parts 1100] G G
900 900
G(10/ 3) since G(10/ 3) >> G(-100/3) 0
.99936

=
P
In this example, we used the approximation
F k
n
i
G
k n
n
G
n
n
G
k n
n
i
k
i n i
( ) =
F
H
G
I
K
J

F
H
G
I
K
J

F
H
G
I
K
J

F
H
G
I
K
J
=

0
p q
p
pq
p
pq
p
pq
(1-40)
which can be used when np >> 1 so that G n n p / pq c h
0 . The sum of the terms from k =
900 to k = 1100 equals .99872. Note that the terms from k = 0 to 900 do not amount to much!
Example 1-24: Figures 1-8 and 1-9 illustrate the De-Moivre Laplace theorem. The first of
these figures depicts, as a solid line plot, G({x - n }/ n p pq ) for n = 25, p = q = 1/2. As a
sequence of dots, values are depicted of the Binomial function F(k), for the case n = 25, p = q =
Figure 1-8: Gaussian and Binomial distribution
functions for n = 25, p = q = 1/2.
Figure 1-9: Gaussian and Binomial distribution
functions for n = 50, p = q = 1/2.
7 8 9 10 11 12 13 14 15 16 17 18
x - Axis for Gaussian & k-Axis for Binomial
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
n = 25, p = q = 1/2
x-n
Line is G
n

p
pq
k
i n i
i=0
n
Dots are F(k)=
i

p q
18 20 22 24 26 28 30 32
x - Axis for Gaussian & k-Axis for Binomial
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
x - n
Line is
n

p
pq
G
k
i n i
i 0
n
Dots are F(k)
i

=

=

p q
n = 50, p = q = 1/2
1/2 (given by the sum in (1-40)) is displayed on Fig. 1-8. In a similar manner, Figure 1-9
illustrates the case for n = 50 and p = q = 1/2.
Law of Large Numbers (Weak Law)
Suppose we perform n independent trials of an experiment. The probability of event A
occurring on any trial is p. We should expect that the number k of occurrences of A is about np
so that k/n is near p. In fact, the Law of Large Numbers (weak version) says that k/n is close to p
in the sense that, for any > 0, the probability that k/n - p tends to 1 as n . This
result is given by the following theorem.
Theorem 1-5 (Law of Large Numbers - Weak Version): For all > 0, we have
n
k
limit 1
n

=

P p . (141)
That is, as n becomes larger, it becomes more probable to find k/n near p.
Proof: Note that
Y Y + k / n - n p p p p
k
n
k n ( ) ( ) ,
so that
( )
n( )
k n k
k n( )
n n n
n n n
n
k
n( ) k n( )
k n
G G 2G 1.
+
=

= + =

=

P P
p
p
pq pq pq
p p p q p
But
n
n
pq
and G
n
n
pq
F
H
G
I
K
J
1 as n . Therefore,
k n
2G 1 1
n pq

P p
as n , and the Law of Large Numbers is proved.
Example 1-25: Let p = .6, and find large n such that the probability that k/n is between .59 and
.61 is at least 98/100. That is, choose n so that P[.59 < k/n < .61] .98. This requires
( )
.61n .60n .59n .60n .01n .01n
n n n n
.01n
n
.59n k .61n G G G G
2G 1 .98

=

=

P
pq pq pq pq
pq
or
G . / . / . 01 198 2 9900 n pq
d i
=
From a table of the Gaussian distribution function, we see that G(2.33) = .9901, a value that is
close enough for our problem. Hence, we equate
n
.01 2.33 =
pq
and solve this for
2
4
(.6)(.4)
n (2.33) 13, 029
10
=
so we must choose n > 13,029.
Generalization of Bernoulli Trials
Suppose [A
1
, A
2
, , A
r
] is a partition of the sample space. That is,
r
i
i=1
i j
A
A A { }, i j.
=
=
S
Furthermore, for 1 i r, let P(A
i
) = p
i
and p
1
+ p
2
+ ... + p
r
= 1. Now, perform n independent
trials of the experiment and denote by p
n
(k
1
, k
2
, ... , k
r
) the probability of the event
{A
1
occurs k
1
times, A
2
occurs k
2
times, , A
r
occurs k
r
times},
where k
1
+ k
2
+ ... + k
r
= n. Order is not important here. This is a generalization of previous
work which had the two event A and A. Here, the claim is
p (k , k , ... , k ) =
n 1 2 r
n
k k k
r
1
k k
r
k
r
!
! ! !
1 2
2
1 2
p p p . (1-42)
Proof: First, consider a "counting problem". From n distinct objects, how many ways can you
form a first subset of size k
1
, a second subset of size k
2
, ... , an r
th
subset of size k
r
?
Number of ways of forming a first subset of size k
1
and a second of size n-k
1
is
n
k n k
!
!( )!
1 1
.
Number of ways of forming a subset of size k
1
, a second of size k
2
and a third of size n-k
1
-k
2
is
n
k n k
n k
k n k k
2
!
!( )!
( )!
!( )!
1 1
1
2 1

.
Number of ways of forming a subset of size k
1
, a second of size k
2
, a third of size k
3
and a fourth
of size n-k
1
-k
2
-k
3
is
1 1 2
1 1 2 1 2 3 1 2 3
n! (n k )! (n k k )!
k !(n k )! k !(n k k )! k !(n k k k )!

.
Number of ways of forming a first subset of size k
1
, a second subset of size k
2
, ... , an r
th
subset
of size k
r
(where k
1
+ k
2
+ ... + k
r
= n) is
n
k n k
n k
k n k k
n k k
k n k k k
n k k k
k n k k k
n
k k k
2
2
2
2 r 1
r 2 r
2 r
!
!( )!
( )!
!( )!
( )!
!( )!
( )!
!( )!
!
! ! !
1 1
1
2 1
1
3 1 3
1
1
1
Hence, the probability of A

1
occuring k
1
times, A
2
occuring k
2
times ... A
r
occuring k
r
times
(these occur in any order) is
p (k , k , ... , k ) =
n 1 2 r
n
k k k
r
1
k k
r
k
r
!
! ! !
1 2
2
1 2
p p p
as claimed.
Example 1-26: A fair die is rolled 10 times. Determine the probability that f
1
shows 3 times
and "even" shows 6 times.
A
1
= {f
1
shows}
A
2
= {f
2
or f
4
or f
6
shows}
A
3
= {f
3
or f
5
shows}
A
1
A
2
A
3
= S = {f
1
, f
2
, f
3
, f
4
, f
5
, f
6
}
A
i
A
j
= {} for i j
P(A
1
) = 1/6, P(A
2
) = 1/2, P(A
3
) = 1/3
n = 10, k
1
= # times A
1
occurs = 3, k
2
= # times A
2
occurs = 6 and k
3
= # times A
3
occurs = 1
( )
1 1
3 6
f shows 3 times,"even" shows 6 times,not (f or even) shows 1 time (3, 6,1)
10! 1 1 1
.0203
3! 6!1! 6 2 3
=

= =

10
P P
Poisson Theorem and Random Points
The probability that A occurs k times in n independent trials is
k n-k
n
[ occurs k time in n independent trials]
k

=

P A p q . (1-43)
If n is large and npq >> 1, we can use the DeMoivre Laplace theorem to approximate the
probability (1-43). However, the DeMoivre Laplace theorem is no good if n is large and p is
small so that np is on the order of 1. However, for this case, we can use the Poisson
Approximation.
Theorem 1-6 (Poisson Theorem): As n and p 0, such that np (a constant), we have
( )
n
p 0
np
k
k n k n
e
k
k!

p q . (1-44)
Proof
n
k
n
k n
1
n
n
k n k
n
1
n
k
1
n
n n 1 n 2 n k 1
n
1
n k
1
n
n n 1 n 2 n k 1
n
k n k
k n k k
k
n k
k n k
k
n k k
k
k terms in
F
H
G
I
K
J
=
F
H
G
I
K
J
F
H
G
I
K
J

F
H
G
I
K
J
=
F
H
G
I
K
J

F
H
G
I
K
J
=
F
H
G
I
K
J
+
=
F
H
G
I
K
J
L
N
M
M
O
Q
P
P

F
H
G
I
K
J
+
p q

!
!( )!
!
( )( ) ( )
!
( )( ) ( )

numerator
product

Now, as n we have
n
p 0
n
np
1 e
n

n
p 0
k
np
1 1
n

n k terms in numerator product
p 0
np
k
n(n 1)(n 2) (n k 1)
1
n
.
Putting it all together, we have
( )
n
p 0
k
np k n k n
e
k
k!

p q
as claimed.
Example 1-27: Suppose that 3000 parts are received. The probability p that a part is defective is
10
-3
. Consider part defects to be independent events. Find the probability that there will be
more than five defective parts. Let k denote the number of defective parts, and note that np = 3.
Then
( ) ( )
5
3 k 3 3000 k
k 0
3000
k 5 1 k 5 1 (10 ) (1 10 )
k

=

> = =

P P .
But ( )
5 k
3
k 0
3
k 5 e .916
k!
=
=
P so that ( ) k 5 .084 > P .

The function
k
(k) e
k!

= P is known as the Poisson Function with parameter .
Random Poisson Points
In a random manner, place n points in the interval (-T/2, T/2). Denote by P(k in t
a
) the
probability that k of these points will lie in an interval (t
1
, t
2
] (-T/2, T/2), where t
a
= t
2
- t
1
(see
Fig. 1-10). Find P(k in t
a
). First, note that the probability of placing a single point in (t
1
, t
2
] is
p =
t t
T
t
T
2 a
=
1
. (1-45)
Now, place n points in (-T/2, T/2), and do it independently. The probability of finding k points
in a sub-interval of length t
a
is
k n k
a
n
(k in t ) =
k

P p q ,
where p = t
a
/T.
t
a
(
t
1
]
t
2
Poisson Points are denoted by black dots
) (
-T/2 T/2
Figure 1-10: n random Poisson Points in (-T/2, T/2).
Now, assume that n and T such that n/T
d
, a constant. Then, np = n(t
a
/T)
t
a
d
and
d
d a
a a
k
t k n-k d a
a
n
T
np n(t /T) t n
( t )
(k in t ) = e
k k!

P p q . (1-46)
The constant
d
is the average point density (the average number of points in a unit length
interval).
In the limiting case, as n and T such that n/T
d
, a constant, the points are
known as Random Poisson Points. They are used to describe many arrival time problems,
including those that deal with electron emission in vacuum tubes and semiconductors (i.e., shot
noise), the frequency of telephone calls, and the arrival of vehicle traffic.
Alternative Development of P(k in t
a
) as Expressed by (1-46)
We arrive at (1-46) in another way that gives further insight into Poisson points. As
above, we consider the infinite time line - < t < , and we place an infinite number of points on
this line where
d
is the average point density (
d
points per unit length, on the average).
To first-order in t, the probability of finding exactly one point in (t, t + t] is
d
t. That
is, this probability can be formulated as
[ ]
d
exactly one point in (t, t + t] t Higher-Order Terms = + P , (1-47)
where Higher-Order Terms are terms involving (t)
2
, (t)
3
, . Also, we can express the
probability of finding no points in (t, t+t] as
[ ]
d
no points in (t, t + t] (1 t) Higher-Order Terms = + P . (1-48)
Consider the arbitrary interval (0, t], t > 0 (nothing is gained here by assuming the more
general case (t
0
, t] , t > t
0
). Denote as p
k
(t) the probability of finding exactly k points in (0, t];
we write
[ ]
k
p (t) exactly k points in (0, t] P . (1-49)
Now, k points in (0, t + t] can happen in two mutually exclusive ways. You could have
k points in (0, t] and no point in (t, t + t] or you could have k-1 points in (0, t] and exactly one
point in (t, t + t]. Formulating this notion in terms of p
k
, (1-47) and (1-48), we write
( ) ( )
k k d k 1 d
p (t t) p (t) 1 t p (t) t Higher-Order Terms
+ = + + , (1-50)
where Higher-Order Terms are those involving second-and-higher-order powers of t.
Equation (1-50) can be used to write the first-order-in-t relationship
[ ]
k k
d k 1 k
p (t t) p (t)
p (t) p (t)
t

+
=
, (1-51)
where terms of order t and higher are omitted from the right-hand side. In (1-51), take the limit
as t approaches zero to obtain
[ ]
k d k 1 k
d
p (t) p (t) p (t)
dt

= , (1-52)
an equation that can be solved for the desired p
k
(t).
Starting with p
0
, we can solve (1-52) recursively. With k = 0, Equation (1-52) becomes
0 d 0
0
t 0
d
p (t) p (t)
dt
lim p (t) 1
+
=
=
(1-53)
(the probability is one of finding zero points in a zero length interval), so that p
0
(t) = exp[-
d
t].
Setting k = 1 in (1-52) leads to
d
t
1 d 1
1
t 0
d
p (t) p (t) e
dt
lim p (t) 0
+
+ =
=
(1-54)
(the probability is zero of finding one point in a zero length interval), so that p
1
(t) = (
d
t)exp[-
d
t]. This process can be continued to obtain
d
k
t
d
k
( t)
p (t) e , k = 0, 1, 2, ...
k!

= , (1-55)
a formula that satisfies (1-52) as can be seen from direct substitution. Note the equivalence of
(1-55) and (1-46) (with an interval length of t instead of t
a
). Poisson points arise naturally in
applications where a large number of points are distributed at random and independently of one
another (think of the large number of applications where Poisson point models can be applied!).
Poisson Points In Non-Overlapping Intervals
Consider again a (-T/2, T/2) long interval that contains n points. Consider two non-
overlapping subintervals of length t
a
and t
b
. See Figure 1-11 where points are denoted as black
dots. We want to find the probability P(k
a
in t
a
, k
b
in t
b
) that k
a
points are in interval t
a
and k
b
points are in interval t
b
. Using the generalized Bernoulli trials formula developed previously, we
claim that
a b a b
k k n k k
a b a b
a a b b
a b a b
t t t t n!
(k in t , k in t ) = 1
k ! k !(n k k )! T T T T

P . (1-56)
Proof: This can be established by using the idea of a generalized Bernoulli Trial. The events
A
1
= {point in t
a
} with P(A
1
) = t
a
/T,
A
2
= {point in t
b
} with P(A
2
) = t
b
/T
A
3
= {point outside t
a
and t
b
} with P(A
3
) = 1 - t
a
/T - t
b
/T
form a disjoint partition of (-T/2, T/2). The event {k
a
in t
a
and k
b
in t
b
} is equivalent to the event
{A
1
occurs k
a
times, A
2
occurs k
b
times, A
3
occurs n - k
a
- k
b
times}. Hence, from the
Generalized Bernoulli theory
a b a b
k k n k k
a b a b
a a b b
a b a b
t t t t n!
(k in t , k in t ) = 1
k ! k !(n k k )! T T T T

P (1-57)
as claimed.
Note that the events {k
a
in t
a
} and {k
b
in t
b
} are not independent. This intuitive result
follows from the fact that
t
a
t
b
Figure 1-11: Poisson Points in non-
overlapping intervals.
a b a b
a a b b
k k n k k
a b a b
a a b b
a b a b
k n-k k n-k
a a b b
a a b b
a a b
t t t t n!
(k in t , k in t ) = 1
k ! k !(n k k )! T T T T
t t t t n! n!
1 1
k ! (n k )! T T k ! (n k )! T T
(k in t ) (k in

P
= P P
b
t ).
(1-58)
That is, the joint probability P(k
a
in t
a
, k
b
in t
b
) does not factor into P(k
a
in t
a
) P(k
b
in t
b
).
The fact is intuitive that the events {k
a
in t
a
} and {k
b
in t
b
} are dependent for the finite
case outlined above. Since the number n of points is finite, the more you put into the t
a
interval
the fewer you have to put into the t
b
interval.
Limiting Case
Now, suppose that n/T =
d
and n , T . Note that nt
a
/T =
d
t
a
, nt
b
/T =
d
t
b
so
that
a b a b
a b a b
a b
k k n k k
a b a b
a b a b
k k n
k k
a b a b d a d b a b
d d
k k
a b
t t t t n!
1
k ! k !(n k k )! T T T T
n(n 1) (n k k 1) t t ( t ) ( t ) t t
1 1
n k ! k ! n
n

+

+ + +
=

as n and T
d
a b
n
T
n / T
a b
k k
n(n 1) (n k k 1)
1
n

+
+


the random arrival of electrons at a vacuum tube anode or semiconductor junction (see Chapters
7 and 9 of these class notes for a discussion of shot noise).
Chapter 2 - Random Variables
In this and the chapters that follow, we denote the real line as R = (- < x < ), and the
extended real line is denoted as R
+
R{}. The extended real line is the real line with
thrown in.
Put simply (and incompletely), a random variable is a function that maps the sample
space S into the extended real line.
Example 2-1: In the die experiment, we assign to the six outcomes f
i
the numbers X(f
i
) = 10i.
Thus, we have X(f
1
) = 10, X(f
2
) = 20, X(f
3
) = 30, X(f
4
) = 40, X(f
5
) = 50, X(f
6
) = 60.
For an arbitrary value x
0
, we must be able to answer questions like what is the
probability that random variable X is less than x
0
? Hence, the set { S : X() x
0
} must be
an event (i.e., the set must belong to -algebra F) for every x
0
(sometimes, the algebraic, non-
random variable x
0
is said to be a realization of the random variable X). This leads to the more
formal definition.
Definition: Given a probability space (S, F, P), a random variable X() is a function

+
X: S R . (2-1)

That is, random variable X is a function that maps sample space S into the extended real line R
+
.
In addition, random variable X must satisfy the two criteria discussed below.
1) Recall the Borel -algebra B of subsets of R that was discussed in Chapter 1 (see Example
1-9). For each B B, we must have

[ ]
1
X (B) X( ) B , B

:
S F B . (2-2)

A function that satisfies this criteria is said to be measurable. A random variable X must be
a measurable function.
2) P[ S : X() = ] = 0. Random variable X is allowed to take on the values of ;
however, it must take on the values of with a probability of zero.
These two conditions hold for most elementary applications. Usually, they are treated as mere
technicalities that impose no real limitations on real applications of random variables (usually,
they are not given much thought).
However, good reasons exist for requiring that random variable X satisfy the conditions
1) and 2) listed above. In our experiment, recall that sample space S describes the set of
elementary outcomes. Now, it may happen that we cannot directly observe elementary outcomes
S. Instead, we may be forced to use a measuring instrument (i.e., random variable) that
would provide us with measurements X(), S. Now, for each R, we need to be able to
compute the probability P[- < X() ], because [- < X() ] is a meaningful,
observable event in the context of our experiment/measurements. For probability P[- < X()
] to exist, we must have [- < X() ] as an event; that is, we must have

[ ] : X( ) < S F (2-3)

for each R. It is possible to show that Conditions (2-2) and (2-3) are equivalent. So, while
(2-2) (or the equivalent (2-3)) may be a mere technicality, it is an important technicality.
Sigma-Algebra Generated by Random Variable X
Suppose that we are given a probability space (S, F, P) and a random variable X as
described above. Random variable X induces a -algebra (X) on S. (X) consists of all sets of
the form { S : X() B, B B}, where B denotes the -algebra of Borel subsets of R that
was discussed in Chapter 1 (see Example 1-9). Note that (X) F; we say that (X) is the sub
-algebra of F that is generated by random variable X.
Probability Space Induced by Random Variable X
Suppose that we are given a probability space (S, F, P) and a random variable X as
described above. Let B be the Borel -algebra introduced by Example 1-9. By (2-2), for each B
B, we have { S : X() B} F, so P[{ S : X() B}] is well defined.
This allows us to use random variable X to define (R
+
, B, P), a probability space
induced by X. Probability measure P is defined as follows: for each B B, we define P(B)
P[{ S : X() B}]. We say that P induces probability measure P.
Distribution and Density Functions
The distribution function of the random variable X() is the function

F(x) [X( ) x] [ : X( ) x] = = P P S , (2-4)

where < x < .
Example 2-2: Consider the coin tossing experiment with P[heads] = p and P[tails] = q 1 - p.
Define the random variable

X(head) = 1
X(tail) = 0.

If x 1, then both X(head) = 1 x and X(tail) = 0 x so that

F(x) 1 for x 1 = .

If 0 x < 1, then X(head) = 1 > x and X(tail) = 0 x so that

0 x < 1 F(x) [X x] q for = = P

Finally, if x < 0, then both X(head) = 1 > x and X(tail) = 0 > x so that

x < 0 F(x) [X x] 0 for = = P .

See Figure 2-1 for a graph of F(x).
Properties of Distribution Functions
First, some standard notation:

r x r x
F(x ) limit F(r) and F(x ) limit F(r)
+
+

.

Some properties of distribution functions are listed below.

Claim #1: F(+) = 1 and F(-) = 0. (2-5)

Proof:
x x
limit [X x] = [ ]=1 and F(- ) = limit [X x] = [{ }]=0 F(+ ) =

P P P P S .

Claim #2: The distribution function is a non-decreasing function of x. If x
1
< x
2
, then F(x
1
)
F(x
2
).

Proof: x
1
< x
2
implies that {X() x
1
} {X() x
2
}. But this means that P[{X() x
1
}]
P[{X() x
2
}] and F(x
1
) F(x
2
).

Claim #3: P[X > x] = 1 - F(x). (2-6)

Proof: {X() x
1
} and {X() > x
1
} are mutually exclusive. Also, {X() x
1
}{X() > x
1
} =
1
q
1
F(x)
x

Figure 2-1: Distribution function for X(head)
=1, X(tail) = 0 random variable.
S. Hence, P[{X x
1
}] + P[{X > x
1
}] = P[S] = 1.
Claim #4: Function F(x) may have jump discontinuities. It can be shown that a jump is the only
type of discontinuity that is possible for distribution F(x) (and, F(x) may have a countable
number of jumps, at most). F(x) must be right continuous; that is, we must have F(x
+
) = F(x).
At a jump, take F to be the larger value; see Figure 2-2.

Claim #5: P[ x
1
< X x
2
] = F(x
2
) - F(x
1
) (2-7)

Proof: {X() x
1
} and {x
1
< X() x
2
} are mutually exclusive. Also, {X() x
2
} = {X()
x
1
} {x
1
< X() x
2
}. Hence,
2 1 1 2
[X( ) x ] [X( ) x ] + [ x < X( ) x ] = P P P and P[x
1
<
X() x
2
] = F(x
2
) - F(x
1
).

Claim #6: P[X = x ] = F( x ) - F( x
). (2-8)

Proof: P[ x - < X x ] = F(x) - F(x - ). Now, take limit as 0
+
to obtain the desired result.

Claim #7: P[ x
1
X x
2
] = F(x
2
) - F(x
1
). (2-9)

Proof: {x
1
X x
2
} = {x
1
< X x
2
}{X = x
1
} so that

1 2 2 1 1 1 2 1
[ x X x ] = ( F(x ) F(x ) )+( F(x ) F(x ) )= F(x ) F(x )

P .
1
x
0
F(x
0
)

Figure 2-2: Distributions are right continuous.
Continuous Random Variables
Random variable X is of continuous type

if F
x
(x) is continuous. In this case, P[X = x] =
0; the probability is zero that X takes on a given value x.
Discrete Random Variables
Random variable X is of discrete type

if F
x
(x) is piece-wise constant. The distribution
should look like a staircase. Denote by x
i
the points where F
x
(x) is discontinuous. Then F
x
(x
i
)
F
x
(x
i
) = P[X = x
i
] = p
i
. See Figure 2-3.
Mixed Random Variables
Random variable X is said to be of mixed type if F
x
(x) is discontinuous but not a
staircase.
Density Function
The derivative

x
x
dF (x)
f (x)
dx
(2-10)

is called the density function of the random variable X. Suppose F
x
has a jump discontinuity at a
point x
0
. Then f(x) contains the term

1
F
x
(x)
x

Figure 2-3: Distribution function for a discrete random
variable.
x 0 x 0 0 x 0 x 0 0
F (x ) F (x ) (x x ) F (x ) F (x ) (x x )
+

=

. (2-11)

See Figure 2-4.
Suppose that X is of a discrete type taking values x
i
, i I. The density can be written as

x i i
i
f (x) [X x ] (x x )
= =
P
I
,

where I is an index set. Figures 2-5 and 2-6 illustrate the distribution and density, respectively,
of a discrete random variable.
Properties of f
x

The monotonicity of F
x
implies that f
x
(x) 0 for all x. Furthermore, we have

x
x
x x x
dF (x)
f (x) F (x) f ( )d
dx

= =
(2-12)

} k
F
x
(x)
x x
0
D
i
s
t
r
i
b
u
t
i
o
n

F
u
n
c
t
i
o
n
Jump Discontinuity at x
0

x
0
f
x
(x)
k(x - x
0
)
D
e
n
s
i
t
y

F
u
n
c
t
i
o
n
x
Delta Function at x
0
Figure 2-4: "Jumps" in distribution function causes delta function in density
function. The distribution jumps by the value k at x = x
0
.
P[x
1
< X x
2
] = F
x
(x
2
) - F
x
(x
1
) =
2
1
x
x
x
f ( )d
. (2-13)

If X is of continuous type, then F
x
(x) = F
x
(x
-
), and

P[x
1
X x
2
] = F
x
(x
2
) - F
x
(x
1
) =
2
1
x
x
x
f ( )d
.

For continuous random variables P[x < X x + x] f
x
(x)x for small x.
Normal/Gaussian Random Variables
Let , < < , and , > 0, be constants. Then

2
1
x
2
1 x 1 x
f (x) g exp[ ]
2

= =

(2-14)
1
F
x
(x)
x
x
1
x
2
x
3
x
4
U
V
W
k
2
q k
3
}
k
4
D
i
s
t
r
i
b
u
t
i
o
n
F
u
n
c
t
i
o
n
k
1
Figure 2-5: Distribution function for a
discrete random variable.
f
x
(x)
x
x
1
x
2
x
3
x
4
k
1
k
2
k
3
k
4
D
e
n
s
i
t
y

F
u
n
c
t
i
o
n
Figure 2-6: Density function for a discrete
random variable.
Figure 2-7: Density and distribution functions for a Gaussian random variable
with = 0 and = 1.
-3 -2 -1 0 1 2 3
x
F(x)
1.0
0.5
2
exp[ ]d
2
1
x
F(x)
1
2
2
1
f (x) exp[ x ]
2

x
is a Gaussian density function with parameters and . These parameters have special
meanings that will be discussed below. The notation N(;) is used to indicate a Gaussian
random variable with parameters and . Figure 2-7 illustrates Gaussian density and
distribution functions.
Random variable X is said to be Gaussian if its distribution function is given by

2
x
2
1 (x )
F(x) [X x] exp dx
2
2

= =

P (2-15)

for given , - < < , and , > 0. Numerical values for F(x) can be determined with the aid
of a table. To accomplish this, make the change of variable y = (x - )/ in (2-15) and obtain

2
(x ) / 1 y x
F(x) exp dy G
2 2

= =

. (2-16)

Function G(x) is tabulated in many reference books, and it is built in to many popular computer
math packages (i.e., Matlab, Mathcad, etc.).
Uniform
Random variable X is uniform between x
1
and x
2
if its density is constant on the interval
[x
1
, x
2
] and zero elsewhere. Figure 2-8 illustrates the distribution and density of a uniform
x
1
x
2
1
x x
2 1
x
f
x
(x)
x
1
x
2
1
F
x
(x)
x

Figure 2-8: Density and distribution functions for a uniform random
variable.
random variable.
Binomial
Random variable X has a binomial distribution of order n with parameter p if it takes the
integer values 0, 1, ... , n with probabilities

k n k
n
[X k] p q , 0 k n.
k

= =

P (2-17)

Both n and p are known parameters where p + q = 1, and

n
n!
k!(n k)!
k

. (2-18)

We say that binomial random variable X is B(n,p).
The Binomial density function is

n
k n k
x
k 0
n
f (x) p q (x k)
k

=

=

, (2-19)

and the Binomial distribution is

x
m
k n k
x x x
k 0
n
F (x) p q , m x m 1 n
k
1, x n.
=

= < +

=
(2-20)

Note that m
x
depends on x.

Poisson
Random variable X is Poisson with parameter a > 0 if it takes on integer values 0, 1, ...
with

k
a
a
[X k] e , k 0, 1, 2, 3, ...
k!
= = = P (2-21)

The density and distribution of a Poisson random variable are given by

k
a
x
k 0
a
f (x) e (x k)
k!
=
=
(2-22)

x
m
k
a
x x x x
k 0
a
F (x) e m x m 1, m 0, 1, ... .
k!
=
= < + =
(2-23)

Rayleigh
The random variable X is Rayleigh distributed with real-valued parameter , > 0, if it
is described by the density

( )
2
2
1
x
2
x
f (x) exp , x 0
0 , x 0.
x

=

= <
(2-24)

See Figure 2-9 for a depiction of a Rayleigh density function.
The distribution function for a Rayleigh random variable is

2 2
x
u / 2
x
2
0
u
F (x) e du, x 0
0, x 0

=
= <
(2-25)

To evaluate (2-25), use the change of variable y = u
2
/2
2
, dy = (u/
2
)du to obtain

2 2
2 2
x / 2
y x / 2
x
0
F (x) e dy 1 e , x 0
0, x 0

= =
= <
(2-26)

as the distribution function for a Rayleigh random variable.
Exponential
The random variable X is exponentially distributed with real-valued parameter , > 0, if
it is described by the density

[ ]
x
f (x) exp x , x 0
0 , x 0.
=
= <
(2-27)

See Figure 2-10 for a depiction of an exponential density function.
0.0 0.5 1.0 1.5 2.0 2.5 3.0 x/
f x x
x
x
( ) exp ( ) =
L
N
M
O
Q
P
1
2
1
2
2

e j
U x
1 1
2
exp
0 1 2 3
x
f x
x
( ) exp ( ) = x U x

Figure 2-9: Rayleigh density function. Figure 2-10: Exponential density function.
The distribution function for an exponential random variable is

x
y x
x
0
F (x) e dy 1 e , x 0
0, x 0.

= =
= <
(2-28)

Conditional Distribution
Let M denote an event for which P[M] 0. Assuming the occurrence of event M, the
conditional distribution F(xM) of random variable X is defined as

[X x, M]
F(x M) [X x M]
[M]
=
P
P
P
. (2-29)

Note that F(M) = 1 and F(-M) = 0. Furthermore, the conditional distribution has all of the
properties of an "ordinary" distribution function. For example,

1 2
1 2 2 1
[x X x , M]
[x X x M] F(x M) F(x M)
[M]
<
< = =
P
P
P
. (2-30)

Conditional Density
The conditional density f(xM) is defined as

x 0
d [x X x x M]
f (x M) F(x M) limit
dx x
< +
= =
P
. (2-31)

The conditional density has all of the properties of an "ordinary" density function.
Example 2-3: Determine the conditional distribution F(xM) of random variable X(f
i
) = 10i of
the fair-die experiment where M={f
2
, f
4
, f
6
} is the event "even" has occurred. First, note that X
must take on values in the set {10, 20, 30, 40, 50, 60}. Hence, if x 60, then [X x] is the
certain event and [X x, M] = M. Because of this,

[X x, M] [M]
F(x M) 1, x 60
[M] [M]
= = =
P P
P P
.

If 40 x < 60, then [X x, M] = [f
2
, f
4
], and

2 4
2 4 6
[X x, M] [f , f ] 2/ 6
F(x M) , 40 x 60
[M] [f , f , f ] 3/ 6
= = = <
P P
P P
.

If 20 x < 40, then [X x, M] = [f
2
], and

2
2 4 6
[X x, M] [f ] 1/ 6
F(x M) , 20 x 40
[M] [f , f , f ] 3/ 6
= = = <
P P
P P
.

Finally, if x < 20, then [X x, M] = [], and

2 4 6
[X x, M] [{ ] 0
F(x M) , x 20
[M] [f , f , f ] 3/ 6
}
= = = <
P P
P P
.

Conditional Distribution When Event M is Defined in Terms of X
If M is an event that can be expressed in terms of the random variable X, then F(xM)
can be determined from the "ordinary" distribution F(x). Below, we give several examples.
As a first example, consider M = [X a], and find both F(xM) = F(xX a) = P[X
xX a] and f(xM). Note that

[X x X a]
F(x X a) [X x X a]
[X a]
,
= =
P
P
P
.

Hence, if x a, we have [X x, X a] = [X a] and

[X a]
F(x X a) 1, x a
[X a]
= =
P
P
.

If x < a, then [X x, X a] = [X x] and

x
x
[X x] F (x)
F(x X a) , x a
[X a] F (a)
= = <
P
P
.

At x = a, F(xX a) would jump 1 - F
X
(a
-
)/F
X
(a) if F
X
(a
-
) F
X
(a). The conditional density is

x
x x x
x x
f (x)
, x a
d d F (x) F (a ) F (a)
f (x X a) F(x X a) 1 (x a)
dx dx F (a) F (a)
0, x a

<

= = = +

.

As a second example, consider M = [b < X a] so that

[X x b X a]
F(x b X a)
[b X a]
, <
< =
<
P
P
.

Since

[X x, b < X a] = [b < X a], a x
[b < X x], b < x < a
= { }, x b ,

=

we have
x x
x x
F(x b < X a) 1, a x
F (x) F (b)
, b x a
F (a) F (b)
0, x b .
=
= < <
=

F(xb < X a) is continuous at x = b. At x = a, F(xb < X a) jumps 1 - {F
X
(a
) -
F
X
(b)}/{F
X
(a) - F
X
(b)}, a value that is zero if F
X
(a
) = F
X
(a). The corresponding conditional
density is

x
x x x x
x x
d
f (x b X a) F(x b X a)
dx
f (x)
, b x a
F (a ) F (b) F (a) F (b)
1 (x a) .
F (a) F (b)
0 , otherwise
< = <

< <

= +

Example 2-4: Find f(xX - ), where X is N(,). First, note that

X X + ,

so that

2
1 x
1
[ X ] [ X ] exp
2
2
dx
+

= + =

P P .

By a change of variable, this last result becomes

1
2
1
[ X ] exp u
2
2
du G( ) G( ) 2G( ) 1.
= =
P
Hence, by the previous example, we have

( )
2
x
1
2
1
exp
2
f (x X ) , x
2G( ) 1
0 otherwise .

= +

=

Total Probability Continuous Form
Let X be any random variable and define B = [X x]. Let A
1
, A
2
, ... , A
n
be a partition
of sample space S. That is,

n
i i j
i=1
A and A A { for i j = = }
S . (2-32)

From the discrete form of the Theorem of Total Probability discussed in Chapter 1, we have
P[B] = P[B A
1
]P[A
1
] + P[B A
2
]P[A
2
] + ... + P[B A
n
]P[A
n
]. Now, with B = [X x], this
becomes

P[X x] = P[X x A
1
]P[A
1
] + P[X x A
2
]P[A
2
] + ... + P[X x A
n
]P[A
n
]. (2-33)

Hence,

x 1 1 2 2 n n
x 1 1 2 2 n n
F (x) = F(x A ) [A ] + F(x A ) [A ] + ... + F(x A ) [A ]
f (x) = f(x A ) [A ] + f(x A ) [A ] + ... + f(x A ) [A ] .

P P P
P P P
(2-34)

Several useful formulas can be derived from this result. For example, let A be any event,
and let X be any random variable. Then we can write a version of Bayes rule as
1 1 2 2 n n
[A X x] [X x A] [A] F(x A) [A]
[A X x]
[X x] [X x] F(x)
F(x A) [A]
.
F(x A ) [A ] + F(x A ) [A ] + ... + F(x A ) [A ]
,

= = =

=

P P P P
P
P P
P
P P P
(2-35)

As a second example, we derive a formula for P[Ax
1
< X x
2
]. Now, the conditional
distribution F(xA) has the same properties as an "ordinary" distribution. That is, we can write
P[x
1
< X x
2
A] = F(x
2
A) - F(x
1
A) so that

1 2
1 2
1 2
2 1
x 2 x 1
[x X x
[A x X x ] [A]
[x X x
F(x F(x
[A] .
F (x F (x
< ]
< =
< ]
) )
=
) )
P
P P
P
P
(2-36)

In general, we cannot write

[A X x]
[A X x]
[X x]
, =
= =
=
P
P
P
(2-37)

since this may result in an indeterminant 0/0 form. Instead, we must write

+ +
+
x 0 x 0 x x
x 0 x x
,
F(x x A) F(x A)
[A X x] limit [A x X x x] limit [A]
F (x x) F (x)
[F(x x A) F(x A)] / x
limit (A)
[F (x x) F (x)] / x

+
= = < + =
+
+
=
+
P P P
P
(2-38)

which yields

x
f (x A)
[A X x] [A]
f (x)
= = P P . (2-39)
Now, multiply both sides of this last result by f
x
and integrate to obtain

x
[A X x]f (x)dx [A] f (x A)dx

= =

P P . (2-40)

But, the area under f(xA) is unity. Hence, we obtain

x
[A] [A X x]f (x) dx
= =
P P , (2-41)

the continuous version of the Total Probability Theorem. In Chapter 1, we gave a finite
dimensional version of this theorem. Compare (2-41) with the result of Theorem 1-1;
conceptually, they are similar. P[AX = x] is the probability of A given that X = x. Equation
(2-41) tells us to average this conditional probability over all possible values of X to find P[A].
Example 2-5
Consider tossing a coin. The sample space is S = {h, t}, a heads and a tails.
However, assume that we do not know P[h] for the coin. So, we model P[h] as a random
variable p . Our goal is to estimate P[h] by repeatedly tossing the coin.
As a random variable, p must map some sample space, call it S
c
, into [0, 1], the range of
possible values for the (probability of heads) of any coin. To be more definitive, lets take S
c
as
the set of all coins in a given large pot, and note that p : S
c
[0, 1]. As assigned by p , each
coin in S
c
has a (probability of heads) [0, 1].
Assume that we can guess (or we know) a density function
p
f (p)
that describes random

variable p . Given numbers p
1
and p
2
, the probability of selecting a coin with p (i.e., the
probability of heads) between p
1
and p
2
is

[ ]
2
1
p
1 2 p
p
p p p f (p)dp < =

P . (2-42)

On the right hand side of (2-46), the integral is called the ensemble average, or expected value,
of the random variable p .
Bayes Theorem - Continuous Form
From (2-39) we get

x
[A X x]
f (x A) f (x)
[A]
=
=
P
P
. (2-47)

Now, use (2-41) and (2-47) to write

x
x
[A X x]
f (x A) f (x)
[A X ]f ( )d
=
=
=
P
P
, (2-48)

a result known as the continuous form of Bayes Theorem.
Often, f
X
(x) is called the a-priori density for random variable X. And, f(xA) is called
the a-posteriori density conditioned on the observed event A. In an application, we might cook
up a density f
X
(x) that (crudely) describes a random quantity (i.e., variable) X of interest. To
improve our characterization of X, we note the occurrence of a related event A and compute
f(xA) to better characterize X.
The value of x that maximizes f(xA) is called the maximum a-posteriori (MAP)
estimate of X. MAP estimation is used in statistical signal processing and many other problems
where one must estimate a quantity from observations of related random quantities.
Example 2-6: (MAP estimate of probability of heads in previously discussed coin experiment)
Find the MAP estimate of the probability of heads in the previously discussed coin selection and
tossing experiment. First, recall that the probability of heads is modeled as a random variable p
with density
p
f (p)
. We call
p
f (p)
the a-priori ("before the coin toss experiment") density of p

(which we may have to guess). Suppose we toss the coin n times and get k heads. We want to
use these experimental results with our a-priori density
p
f (p)
to compute the conditional density

Observed Event A
f (p k heads, in a specific order, in n tosses of a selected coin)
_
. (2-49)

This density is called a-posteriori density ("after the coin toss experiment") of random variable
p given the experimentally observed event

A [k heads, in a specific order, in n tosses of a selected coin] = . (2-50)

The a-posteriori density f(pA) may give us a good idea (better than the a-priori density
p
f (p)
) of the probability of heads for the randomly selected coin that was tossed. Conceptually,
think of f(pA) as a density that results from using experimental data/observation A to update
p
f (p)
. In fact, given that A occurred, the a-posteriori probability that p is between p

1
and p
2
is

2
1
p
p
f (p A) dp
. (2-51)

Finally, the value of p that maximizes f(pA) is the maximum a-posteriori estimate (MAP
estimate) of p for the selected coin.
To find this a-posteriori density, recall that p is defined on the sample space S
c
. The
experiment of tossing the randomly selected coin n times is defined on the sample space (i.e.,
product space) S
c
S
n
, where S = [h, t]. The elements of S
c
S
n
have the form

n
c
n outcomes n outcomes
, t h t h where and t h t h
_ _
S S .

Now, given that p = p, the conditional probability of event A is
Observed Event A
k n k
[k heads in specific order in n tosses of specific coin p = p] = p (1 p)

_
P . (2-52)

Substitute this into the continuous form of Bayes rule (2-48) to obtain

Observed Event A
k n k
p
1
k n k
p
0
f (p A) f (p k heads, in a specific order, in n tosses of a selected coin )
p (1 p) f (p)
,
(1 ) f ( )d
(2-53)

a result known as the a-posteriori density of p given A. In (2-53), the quantity is a dummy
variable of integration.
Suppose that the a-priori density
p
f (p)
is smooth and slowly changing around p = k/n

(indicating a lot of uncertainty in the value of p). Then, for large n, the numerator p
k
(1-p)
n-
k
p
f (p)
, and the a-posteriori density f(pA), has a sharp peak at p = k/n, indicating little
uncertainty in the value of p. When f(pA) is peaked at p = k/n, the MAP estimate of the
probability of heads (for the selected coin) is the value p = k/n.
Example 2-7: For > 0, we use the a-priori density

2
2
p
1 (p 1/ 2)
exp
2
2
f (p)
1
1 2G
2

, (2-54)

where G(x) is a zero mean, unit variance Gaussian distribution function (verify that there is unit
area under
p
f (p)
). Also, use numerical integration to evaluate the denominator of (2-53). For

= .3, n = 10 and k = 3, a-posteriori density f(pA) was computed and the results are plotted on
Figure 2-11. For = .3, n = 50 and k = 15, the calculation was performed a second time, and the
results appear on Figure 2-11. For both, the MAP estimate of p is near .3, since this is were the
plots of f(p A) peak.
Expectation
Let X denote a random variable with density f
x
(x). The expected value of X is defined as

x
E[X] xf (x)dx
=

. (2-55)

Also, E[X] is known as the mean, or average value, of X.
If f
x
is symmetrical about some value , then is the expected value of X. For example,
let X be N(,) so that

2
2
1 (x )
f (x) exp
2
1
2

=

. (2-56)

Now, (2-56) is symmetrical about , so E[X] = .
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 p
0
1
2
3
4
5
6
7
D
e
n
s
i
t
y

F
u
n
c
t
i
o
n
s
f p
p
~( )
f(pA) for n = 10, k = 3
f(pA) for n = 50, k = 15

Figure 2-11: Plot of a-priori and a-posteriori
densities.
Suppose that random variable X is discrete and takes on the value of x
i
with P[X = x
i
] =
p
i
. Then the expected value of X, as defined by (2-55), reduces to

i i
E[X] x p
i
=

. (2-57)

Example 2-8: Consider a B(n,p) random variable X (that is, X is Binomial with parameters n
and p). Using (2-57), we can write

n n
k n k
k 0 k 0
n
E[X] k [X k] k p q
k

= =

= = =

P . (2-58)

To evaluate this result, first consider the well-know, and extremely useful, Binomial expansion

n
k n
k 0
n
x (1 x)
k =

= +

. (2-59)

With respect to x, differentiate (2-59); then, multiply the derivative by x to obtain

n
k n 1
k 0
n
k x xn(1 x)
k

=

= +

. (2-60)

Into (2-60), substitute x = p/q, and then multiply both sides by q
n
. This results in

n
k n k n 1 n 1 n 1
k 0
n
p
k p q npq (1 ) np(p q)
q
k
np .

=

= + = +

=
(2-61)
From this and (2-58), we conclude that a B(n,p) random variable has

E[X] = np. (2-62)

We now provide a second, completely different, evaluation of E[X] for a B(n,p) random
variable. Define n new random variables

th
i
X 1 , if i trial is a "success", 1 i n
= 0 , otherwise .
=
(2-63)

Note that

i
E[X ] 1 p 0 (1 p) p, 1 i n = + = . (2-64)

B(n,p) random variable X is the number of successes out of n independent trials; it can be written
as

1 2 n
X X X X = + + + . (2-65)

With the use of (2-64), the expected value of X can be evaluated as

1 2 n 1 2 n
E[X] E[X X X ] E[X ] E[X ] E[X ]
np ,
= + + + = + + +
=

(2-66)

a result equivalent to (2-62).
Consider random variable X, with mean
x
= E[X], and constant k. Define the new
random variable Y X + k. The mean value of Y is
y x
E[Y] E[X ] E[X] = = + = + = + k k k . (2-67)

A random variable, when translated by a constant, has a mean that is translated by the same
constant. If instead Y kX , the mean of Y is E[Y] = E[kX] = k
x
.
In what follows, we extend (2-55) to arbitrary functions of random variable X. Let g(x)
be any function of x, and let X be any random variable with density f
x
(x). In Chapter 4, we will
discuss transformations of the form Y = g(X); this defines the new random variable Y in terms of
the old random variable X. We will argue that the expected value of Y can be computed as

x
-
E[Y] E[g(X)] g(x)f (x)dx
= =

. (2-68)

This brief heads-up note (on what is to come in CH 4) is used next to define certain statistical
averages of functions of X.
Variance and Standard Deviation
The variance of random variable X is

2 2 2
x
Var[X] E[(X ) ] (x ) f (x)dx
= = =
. (2-69)

Almost always, variance is denoted by the symbol
2
. The square root of the variance is called
the standard deviation of the random variable, and it is denoted as . Finally, variance is a
measure of uncertainty (or dispersion about the mean). The smaller (alternatively, larger)
2
is,
the more (alternatively, less) likely it is for the random variable to take on values near its mean.
Finally, note that (2-69) is an application of the basic result (2-68) with g = (x - )
2
.
Example 2-9: Let X be N(,) then

2
2 2
2
1 (x )
VAR[X] E[(X ) ] (x ) exp dx
2
2

= =

. (2-70)

Let y = (x)/, dx = dy so that

2
1
2
y
2 2 2
1
Var[X] y e dy
2

= =

, (2-71)

a result obtained by looking up the integral in a table of integrals.
Consider random variable X with mean
x
and variance
2
x
= E[{X
x
}
2
]. Let k denote
an arbitrary constant. Define the new random variable Y kX. The variance of Y is

2 2 2 2 2 2
y y x x
E[{Y } ] E[ {X } ] = = = k k . (2-72)

The variance of kX is simply k
2
time the variance of X. On the other hand, adding constant k to a
random variable does not change the variance; that is, VAR[X+k] = VAR[X].
Consider random variable X with mean
x
and variance
2
x
. Often, we can simplify a
problem by centering and normalizing X to define the new random variable

x
x
X
Y

. (2-73)

Note that E[Y] = 0 and VAR[Y] = 1.
Moments
The nth moment of random variable X is defined as

n n
n x
m E[X ] x f (x)dx
= =

. (2-74)

The nth moment about the mean is defined as

n n
n x
E[(X ) ] (x ) f (x)dx
= =
. (2-75)

Note that (2-74) and (2-75) are basic applications of (2-68).
Variance in Terms of Second Moment and Mean
Note that the variance can be expressed as

2 2 2 2 2 2
Var[X] E[(X ) ] E[X 2 X ] E[X ] E[2 X] E[ ] = = = + = + , (2-76)

where the linearity of the operator E[] has been used. Now, constants come out front of
expectations, and the expectation of a constant is the constant. Hence, Equation (2-76) leads to

2 2 2 2 2
2 2
E[X ] E[2 X] E[ ] m 2 E[X] m = + = + = , (2-77)

the second moment minus the square of the mean. In what follows, this formula will be used
extensively.
Example 2-10: Let X be N(0,) and find E[X
n
]. First, consider the case of n an odd integer

2
n n
2
1 x
E[X ] x exp dx 0
2
1
2

= =

, (2-78)

since an integral of an odd function over symmetrical limits is zero. Now, consider the case n an
even integer. Start with the known tabulated integral

2
x 1/ 2
e dx , > 0

= =
. (2-79)
Repeated differentiations with respect to yields

2
2
2
2 x 3/ 2
4 x 5/ 2
th 2k x (2k 1) / 2
1
first d/d : x e dx
2
3 1
second d/d : x e dx
2 2
(2k 1)
3 5 1
k d/d : x e dx .
2 2 2 2
=
=
. .

Let n = 2k (remember, this is the case n even) and = 1/2
2
to obtain

2 2
n
n x / 2 (n 1) 2(n+1)
2
n+1
1 3 5 (n 1)
x e dx 2
1 3 5 (n 1) 2 .

=
=
(2-80)

From this, we conclude that the n
th
-moment of a zero-mean, Gaussian random variable is

2 2
n n x / 2 n
n
m E[X ] x e dx 1 3 5 (n 1) , n = 2k (n even)
= 0 , n = 2k-1 (n odd) .
1
2

= =

(2-81)

Example 2-11 (Rayleigh Distribution): A random variable X is Rayleigh distributed with
parameter if its density function has the form

X
2
2 2
x x
f (x) exp , x 0
2
= 0 , x < 0 .

=

(2-82)

This random variable has many applications in communication theory; for example, it describes
the envelope of a narrowband Gaussian noise process (as described in Chapter 9 of these notes).
The n
th
moment of a Rayleigh random variable can be computed by writing

2 2
n n 1 x / 2
2
0
1
E[X ] x e dx
+
=

. (2-83)

We consider two cases, n even and n odd. For the case n odd, we have n+1 even, and (2-83)
becomes

2 2 2 2
2
n n 1 x / 2 n 1 x / 2
1 2
E[X ] x e dx x e dx
2 2
1 1
2

+ +

= =

- -
. (2-84)

On the right-hand-side of (2-84), the bracket contains the (n+1)
th
moment of a N(0,) random
variable (compare (2-81) and the right-hand-side of (2-84)). Hence, we can write

n n 1 n 2
E[X ] 1 3 5 n 1 3 5 n / 2, n = 2k+1 (n odd)
2
+

= =

. (2-85)

Now, consider the case n even, n+1 odd. For this case, (2-83) becomes

2 2 2 2
n n 1 x / 2 n x / 2
2
2
0 0
x
1
E[X ] x e dx x e dx

+
= =

. (2-86)

Substitute y = x
2
/2
2
, so that dy = (x/
2
)dx, in (2-86) and obtain

( ) ( )
n
n y n / 2 n n / 2 y n / 2 n
0 0
n
E X 2 y e dy 2 y e dy 2
2
!

= = =

, (2-87)

(note that n/2 is an integer here) where we have used the Gamma function

k y
0
(k 1) y e dy k!

+ = =
, (2-88)

k 0 an integer. Hence, for a Rayleigh random variable X, we have determined that

( )
n n
n / 2 n
E[X ] 1 3 5 n / 2 , n odd
n
= 2 , n even .
2
!
=
(2-89)

In particular, Equation (2-89) can be used to obtain the mean and variance

E[X]
2
=
(2-90)
( )
2
2 2 2 2
Var[X] E[X ] E[X] 2 (2 )
2 2

= = = ,

given that X is a Rayleigh random variable.
Example 2-12: Let X be Poisson with parameter , so that
k
[X k] e
k!

= = P , k 0, and

k
k 0
f (x) e (x k)
k!
. (2-91)

Show that E[X] = and VAR[X] = . Recall that

k
k 0
e
k!
=

. (2-92)

With respect to , differentiate (2-92) to obtain

k 1 k
k 0 k 1
1
e k k
k! k!

= =

= =

(2-93)

Multiply both sides of this result by e
-
to obtain

k
k 1
= k e E[X]
k!
. (2-94)

as claimed. With respect to , differentiate (2-93) (obtain the second derivative of (2-92)) and
obtain

k 2 k k
2
2 2
k 1 k 1 k 1
1 1
e k(k 1) k k
k! k! k!

= = =

= =

.

Multiply both sides of this result by
2
e
-
to obtain

k k
2 2 2
k 1 k 1 k 1 k 1
k e k e k [X k] k [X k]
k! k!

= = = =

= = = =

P P . (2-95)

Note that (2-95) is simply
2 2
E[X ] E[X] = . Finally, a Poisson random variable has a
variance given by

( )
2
2 2 2
Var[X] E[X ] E[X] E[X ] E[X] = = = = , (2-96)
as claimed.
Conditional Mean
Let M denote an event. The conditional density f(xM) can be used to define the
conditional mean

-
E[X M] xf (x dx
= )
. (2-97)

The conditional mean has many applications, including estimation theory, detection theory, etc.
Example 2-13: Let X be Gaussian with zero mean and variance
2
(i.e., X is N(0,)). Let M =
[X > 0]; find E[X M] = E[XX > 0]. First, we must find the conditional density f(x X > 0);
from previous work in this chapter, we can write

X X
x
x
[X x, X ] [X x, X ] [X x, X ]
F(x X )
[X ] 1- [X ] 1 F (0)
F (x) F (0)
, x 0
1 F (0)
= 0, x < 0,
> 0 > 0 > 0
> 0 = = =
> 0 0
P P P
P P

so that

x
f (x X 0) 2f (x), x 0
0, x 0.
> =
= <

From (2-97) we can write

2 2
0
2
E X X 0 x exp x / 2 dx
2

> =

.

Now, set y = x
2
/2
2
, dy = xdx/
2
to obtain

y
0
2 2
E X X 0 e dy
2

> = =

.
Tchebycheff Inequality
A measure of the concentration of a random variable near its mean is its variance.
Consider a random variable X with mean , variance
2
and density f
X
(x). The larger
2
, the
more "spread-out" the density function, and the more probable it is to find values of X "far" from
the mean. Let denote an arbitrary small positive number. The Tchebycheff inequality says that
the probability that X is outside ( - , + ) is negligible if / is sufficiently small.
Theorem (Tchebycheffs Inequality)
Consider random variable X with mean and variance
2
. For any > 0, we have

2
2
X-

P . (2-98)

Proof: Note that

2 2 2
x x
{x : x }
2
x
{x : x }
2
(x ) f (x)dx (x ) f (x)dx
[ f (x)dx ]
[ X- ] .

=

=

P

This leads to the Tchebycheff inequality

2
2
X-

P . (2-99)

The significance of Tchebycheff's inequality is that it holds for any random variable, and
it can be used without explicit knowledge of f(x). However, the bound is very "conservative" (or
"loose"), so it may not offer much information in some applications. For example, consider
Gaussian X. Note that

[ ]
X-
X- 3 1 X- 3 1 3 3
1 G(3) G( 3) 2 2G(3)
.0027

= =

= =
=
P P P
, (2-100)

where G(3) is obtained from a table containing values of the Gaussian integral. However, the
Tchebycheff inequality gives the rather "loose" upper bound of

X- 3 1/ 9

P . (2-101)

Certainly, inequality (2-101) is correct; however, it is a very crude upper bound as can be seen
from inspection of (2-100).
Generalizations of Tchebycheff's Inequality
For a given random variable X, suppose that f
X
(x) = 0 for x < 0. Then, for any > 0, we
have

P[ X ] / . (2-102)

To show (2-102), note that

x x x
0
E[X] x f (x) dx xf (x) dx f (x) dx [X ]

= = =

P , (2-103)

so that P[ X ] /, as claimed.
Corollary: Let X be an arbitrary random variable and and n an arbitrary real number and
positive integer, respectively. The random variable X -
n
takes on only nonnegative values.
Hence

n
n
n
n
E[ X- ]
X-

P , (2-104)

which implies

n
n
E[ X- ]
X-

P . (2-105)

The Tchebycheff inequality is a special case with = and n = 2.
Application: System Reliability
Often, systems fail in a random manner. For a particular system, we denote t
f
as the time
interval from the moment a system is put into operation until it fails; t
f
is the time to failure
random variable. The distribution F
a
(t) = P[t
f
t] is the probability the system fails at, or prior
to, time t. Implicit here is the assumption that the system is placed into service at t = 0. Also, we
require that F
a
(t) = 0 for t 0.
The quantity

a f
R(t) 1 F (t) [t t] = = > P (2-106)

is the system reliability. R(t) is the probability the system is functioning at time t > 0.
We are interested in simple methods to quantify system reliability. One such measure of
system reliability is the mean time before failure

[ ]
f a
0
MTBF E t t f (t)dt
=

, (2-107)

where f
a
= dF
a
/dt is the density function that describes random variable t
f
.
Given that a system is functioning at time t, t 0, we are interested in the probability
that a system fails at, or prior to, time t, where t > t 0. We express this conditional distribution
function as

f f f a a
f
f f a
[t t, t t ] [t < t t] F (t) F (t )
F(t t t ) , t t
[t t ] [t t ] 1 F (t )
>
> = = = >
> >
P P
P P
. (2-108)

The conditional density can be obtained by differentiating (2-108) to obtain

a
f f
a
d f (t)
f (t t t ) F(t t t ) , t t
dt 1 F (t )
> = > = >

. (2-109)

F(tt
f
> t) and f(tt
f
> t) describe t
f
conditioned on the event t
f
> t. The quantity f(tt
f
> t)dt
is, to first-order in dt, the probability that the system fails between t and t + dt given that it was
working at t.
Example 2-14
Suppose that the time to failure random variable t
f
is exponentially distributed. That is,
suppose that

t
a
t
a
1 e , t 0
F (t)
0, t 0
e , t 0
f (t)
0, t 0,
<
<
(2-110)

for some constant > 0. From (2-109), we see that

t
(t t )
f a
t
e
f (t t t ) e f (t t ), t > t
e
> = = = . (2-111)

That is, if the system is working at time t, then the probability that it fails between t and t
depends only on the positive difference t - t, not on absolute t. The system does not wear out
(become more likely to fail) as time progresses!
With (2-109), we define f(tt
f
> t) for t > t. However, the function

a
a
f (t)
(t)
1 F (t)

, (2-112)

known as the conditional failure rate (also known as the hazard rate), is very useful. To first-
order, (t)dt (when this quantity exists) is the probability that a functioning-at-t system will fail
between t and t + dt.
Example 2-15 (Continuation of Example 2-14)
Assume the system has f
a
and F
a
as defined in Example 2-14. Substitute (2-110) into
(2-112) to obtain

t
t
e
(t)
1 {1 e }
= =

. (2-113)

That is, the conditional failure rate is the constant . As stated in Example 2-14, the system does
not wear out as time progresses!
If conditional failure rate (t) is a constant , we say that the system is a good as new
system. That is, it does not wear out (become more likely to fail) overtime.
Examples 2-14 and 2-15 show that if a systems time-to-failure random variable t
f
is
exponentially distributed, then

f a
f (t t t ) f (t t ), t > t > = (2-114)

and

(t) = constant, (2-115)

so the system is a good as new system.
The converse is true as well. That is, if (t) is a constant for a system, then the
systems time-to-failure random variable t
f
is exponentially distributed. To argure this, use
(2-112) to write

a a
f (t) [1 F (t)] = . (2-116)

But (2-116) leads to

a
a
dF
F (t)
dt
= + . (2-117)

Since F
a
(t) = 0 for t 0, we must have

t
a
1 e , t 0
F (t)
0, t 0,
<

so t
f
is exponentially distributed. For (t) to be equal to a constant , failures must be truly
random in nature. A constant requires that there be no time epochs where failure is more likely
(no Year 2000 type problems!).
Random variable t
f
is said to exhibit the Markov property, or it is said to be memoryless,
if its conditional density obeys (2-114). We have established the following theorem.
Theorem
The following are equivalent
1) A system is a good-as-new system
2) = constant
3) f(tt
f
> t) = f
a
(t t), t > t
4) t
f
is exponentially distributed.
The previous theorem, and the Markov property, is stated in the context of system
reliability. However, both have far reaching consequences in other areas that have nothing to do
with system reliability. Basically, in problems dealing with random arrival times, the time
between two successive arrivals is exponentially distributed if
1) the arrivals are independent of each other, and
2) the arrival time t after any specified fixed time t is described by a density function that
depends only on the difference t - t (the arrival time random variable obeys the Markov
property).
In Chapter 9, we will study shot noise (caused by the random arrival of electrons at a
semiconductor junction, vacuum tube anode, etc), an application of the above theorem and the
Markov property.

Chapter 3: Multiple Random Variables
Let X and Y denote two random variables. The joint distribution of these random
variables is defined as

XY
F (x, y) [X x, Y y] = P . (3-1)

This is the probability that (X,Y) lies in the shaded region (below y and to the left of x) depicted
on Figure 3-1.
Elementary Properties of the Joint Distribution
As x and/or y approach minus infinity the distribution approaches zero; that is,

F
XY
(-,y) = 0 and F
XY
(x,-) = 0. (3-2)

To show this, note that {X = -, Y y} {X = -}, but P[X = -] = 0 F
XY
(-,y) = 0.
Similar reasoning can be given to show that F
XY
(x,-) = 0.
As x and y both approach infinity (simultaneously, and in any order) the distribution
approaches unity; that is,

F
XY
(,) = 1. (3-3)

to x = -
to y = -
x-axis
y
-
a
x
i
s
x
y

Figure 3-1: Region included in the definition of F(x,y).
This follows easily by noting that {X , Y } = S and P(S ) = 1.
In many applications, the identities

P[x
1
< X x
2
, Y y] = F
XY
(x
2
,y) - F
XY
(x
1
,y) (3-4)

P[X x, y
1
< Y y
2
] = F
XY
(x,y
2
) - F
XY
(x,y
1
) (3-5)

are useful. To show (3-4), note that P[x
1
< X x
2
, Y y] is the probability that the pair (X, Y) is
in the shaded region D
2
depicted by Figure 3-2. Now, it is easily seen that

P[X x
2
, Y y] = P[X x
1
, Y y] + P[x
1
< X x
2
, Y y],

which is equivalent to F
XY
(x
2
,y) = F
XY
(x
1
,y) + P[x
1
< X x
2
, Y y]. This leads to (3-4). A
similar development leads to (3-5).
Joint Density
The joint density of X and Y is defined as the function

D
2
x
2
x
1
y
y =
x-axis
y
-
a
x
i
s

Figure 3-2: Region x
1
< X x
2
, Y y on plane.
That is, there is one unit of area under the joint density function.
Marginal descriptions can be obtained from joint descriptions. We claim that

F
X
(x) = F
XY
(x,) and F
Y
(y) = F
XY
(,y). (3-10)

To see this, note that {X x} = {X x, Y < } and {Y y} = {X < , Y y}. Take the
probability of these events to obtain the desired results.
Other relationships are important as well. For example, marginal density f
X
(x) can be
obtained from the joint density f
XY
(x,y) by using

f x f x y
X XY
( ) ( , )dy =

z
. (3-11)

To see this, use Leibnitz's rule to take the partial derivative, with respect to x, of the distribution
x y
X Y X Y
F ( x , y ) = f ( u , v ) d v d u

to obtain

z
x
F x y
XY XY
y
( , ) = f (x, v) dv . (3-12)

Now, let y go to infinity, and use F
X
(x) = F
XY
(x,) to get the desired result

X XY XY
f (x) F (x, )= f (x,v) dv
x

. (3-13)

A similar development leads to the conclusion that

f y f x y
Y XY
( ) ( , )dx =

z
. (3-14)

Special Case: Jointly Gaussian Random Variables
Random variables X and Y are jointly Gaussian (a.k.a jointly normal) if their joint
density has the form

XY
2
2
2
1
x y y
x
2 2
2 x y
x y
x y
2(1 r )
(x )(y ) (y ) 1
(x )
f (x, y) exp 2r
2 1 r

= +

, (3-15)

where
x
= E[X],
y
= E[Y],
y
2
= Var[Y],
x
2
= Var[X], and r is a parameter known as the
correlation coefficient (r lies in the range -1 r 1). The marginal densities for X and Y are
given by

f x
2
exp x y
2
exp
x
x
x
2
x
2
y
2 2
( ) ( ) / ( ) ( ) / = =
1
2
1
2

and f
y
y y
y .

Independence
Random variables X and Y are said to be independent if all events of the form {X A}
and {Y B}, where A and B are sets of real numbers, are independent. Apply this to the events
{X x} and {Y y} to see that if X and Y are independent then

F
XY
(x,y) = P[X x, Y y] = P[X x] P[Y y] = F
X
(x) F
Y
(y)
(3-16)
XY XY X Y X Y
2
f (x,y) = F (x,y) = F (x) F (y) = f (x) f (y).
x y x y

The converse of this can be shown as well. Hence, X and Y are independent if, and only if, their
joint density factors into a produce of marginal densities; a similar statement can be made for
distribution functions. This result generalizes to more than two random variables; n random
variables are independent if, and only if, their joint density (alternatively, joint distribution)
factors into a product of marginal densities (alternatively, marginal distributions).
Example 3-1: Consider X and Y as jointly Gaussian. The only way you can get the equality

1
2 1 r
x
2r
x y y
2
exp x
2
exp
x y
2
x
x
x y
y
y
y 1 r x
x
x
2
x
2 2 2
2
2
2
2
2

+
L
N
M
O
Q
P
R
S
|
T
|
U
V
|
W
|
=
exp
( )
( )( ) ( )
( ) / ( ) /
( )
1
2
1
2
1
2

y
y y
y
(3-17)

is to have the correlation coefficient r = 0. Hence, Gaussian X and Y are independent if and only
if r = 0.
Many problems become simpler if their random variables are (or can be assumed to be)
independent. For example, when dealing with independent random variables, the expected value
of a product (of independent random variables) can be expressed as a product of expected values.
Also, the variance of a sum of independent random variables is the sum of the variances. These
two simplifications are discussed next.
Expectation of a Product of Independent Random Variables
Independence of random variables can simplify many calculations. As an example, let X
and Y be random variables. Clearly, the product Z = XY is a random variable (review the
definition of a random variable given in Chapter 2). As we will discuss in Chapter 5, the
expected value of Z = XY can be computed as

XY
E[XY] xy f (x, y) dxdy

=

, (3-18)

where f
XY
(x,y) is the joint density of X and Y. Suppose X and Y are independent random
variables. Then (3-16) and (3-18) yield
XY X Y
E[XY] xy f (x, y) dxdy x f (x)dx y f (y)dy E[X]E[Y] .

= = =

(3-19)

This result generalizes to more than two random variables; for n independent random variables,
the expected value of a product is the product of the expected values.
The converse of (3-19) is not true, in general. That is, if E[XY] = E[X]E[Y], it does not
necessarily follow that X and Y are independent.
Variance of a Sum of Independent Random Variables
For a second example where independence simplifies a calculation, let X and Y be
independent random variables, and compute the variance of their sum. The variance of X+Y is
given by

[ ]
2 2
2 2
Var[X Y] E {(X Y) (E[X] E[Y])} E {(X E[X]) (Y E[Y]) }
E {X E[X]} 2E {X E[X]}{Y E[Y]} E {Y E[Y]} .

+ = + + = +

= + +

(3-20)

Since X and Y are independent we have E[{X-E[X]}{Y-E[Y]}] = E[X-E[X]] E[Y-E[Y]] = 0,
and (3-20) becomes

2 2
Var[X Y] E {X E[X]} E {Y E[Y]} Var[X] Var[Y] .

+ = + = +

(3-21)

That is, for independent random variables, the variance of the sum is the sum of the variances
(this applies to two or more random variables). In general, if random variables X and Y are
dependent, then (3-21) is not true.
Example 3-2: The result just given can be used to simplify the calculation of variance in some
cases. Consider the binomial random variable X, the number of successes out of n
independent trials. As used in an example that was discussed in Chapter 2 (where we showed
that E[X] = np), we can express binomial X as

1 2 n
X X X X = + + + " , (3-22)

where X
i
, 1 i n, are random variables defined by

th
i
X 1 , if i trial is a "success", 1 i n
= 0 , otherwise.
=
(3-23)

Note that all n of the X
i
are independent, they have identical mean p, and they have identical
variance

( )
2
2 2
i i i
Var[X ] E[X ] E[X ] p p pq = = = , (3-24)

where p is the probability of success on any trial, and q = 1-p. Hence, we can express the
variance of Binomial X as

1 2 n 1 2 n
Var[X] Var[X X X ] Var[X ] Var[X ] Var[X ] npq = + + + = + + + = " " . (3-25)

Hence, for the Binomial random variable X, we know that E[X] = np and VAR[X] = npq.
Random Vectors: Vector-Valued Mean and Covariance Matrix
Let X
1
, X
2
, ... , X
n
denote a set of n random variables. In this subsection, we use vector
and matrix techniques to simplify working with multiple random variables.
Denote the vector-valued random variable

T
1 2 3 n
X [X X X X ] =
G
" . (3-26)
Clearly, using vector notation is helpful; writing X
G
is much easier than writing out the n random
variables X
1
, X
2
, ... , X
n
.
The mean of X
G
is a constant vector
G
= E[X
G
] with components equal to the means of the
X
i
. We write

[ ]
T
T
1 2 3 n 1 2 3 n
T
1 2 3 n
E X E [X X X X ] E[X ] E[X ] E[X ] E[X ]
[ ]

= =

=
G
G
" "
"
(3-27)

where
i
= E[X
i
], 1 i n.
The covariance of X
i
and X
j
is defined as

ij
= E[(X
i
-
i
)(X
j
-
j
)], 1 i, j n . (3-28)

Note that
ij
=
ji
. Use these n
2
covariance values to form the covariance matrix

=
L
N
M
M
M
M
O
Q
P
P
P
P

11 12 1
21 22 2
1 2
"
"
# # #
"
n
n
n n nn
. (3-29)

Note that this matrix is symmetric; that is, note that =
. Finally, we can write

T
E (X ) (X )

=

G G
G G
. (3-30)

Equation (3-30) provides a compact, simple definition for .
Symmetric Positive Semi-Definite and Positive Definite Matrices
A real-valued, symmetric matrix Q is positive semi-definite (sometimes called
nonnegative definite) if U
G
T
QU
G
0 for all real-valued vectors U
G
. A real-valued, symmetric
matrix Q is positive definite if U
G
T
QU
G
> 0 for all real-valued vectors U
G
0
G
. A real-valued,
symmetric, positive semi-definite matrix may (or may not) be singular. However, a positive
definite symmetric matrix is always nonsingular.
Theorem 3-1: The covariance matrix is positive semi-definite.
Proof: Let
G
" U = [ ] u
T
1
u u
2 n
be an arbitrary, real-valued vector. Now, define the scalar

n
j j j
j 1
Y u (X )
=
=
. (3-31)

Clearly, E[Y
2
] 0. However

n n n n
2
j j j k k k j k k j j k
j 1 k 1 j 1k 1
n n
j jk k
j 1k 1
T
E[Y ]=E u (X ) u (X ) u E[(X )(X )]u
u u
U U 0.
= = = =
= =

=

=
=

G G
(3-32)

Hence
G G
U U
T
0 for all U
G
and is positive semi-definite.
Matrix is positive definite in almost all practical applications. That is,
G G
U U
T
> 0 for
all U
G
0
G
. If is not positive definite, then at least one of the X
i
can be expressed as a linear
combination of the remaining n-1 random variables, and the problem can be simplified by
reducing the number of random variables. If is positive definite, then 0, is
nonsingular and
1
exits ( denotes the determinant of the covariance matrix).

Uncorrelated Random Variables
Suppose we are given n random variables X
i
, 1 i n. We say that the X
i
are
uncorrelated if, for 1 i, j n,

i j i j
E[X X ] = E[X ] E[X ] , i j . (3-33)

For uncorrelated X
i
, 1 i n, we have

ij i i j j i j i j j i j j
i j i j j i j j
= E[(X - )(X - )] = E[X X ]- E[X ]- E[X ] + , i j
E[X ]E[X ]- - + , i j

=
(3-34)

and this leads to

2
ij i
= , i j
= 0, i j.
=
(3-35)

Hence, for uncorrelated random variables, matrix is diagonal and of the form (the variances
are on the diagonal)

=
L
N
M
M
M
M
M
M
O
Q
P
P
P
P
P
P
1
2
2
2
1
2
2
0 0 0
0 0 0
0 0 0
0 0 0
"
# # % # #
"
"
n
n
. (3-36)

If X
i
and X
j
are independent they are also uncorrelated, a conclusion that follows from (3-19).
However, the converse is not true, in general. Uncorrelated random variables may be dependent.
Multivariable Gaussian Density
Let X
1
, X
2
, ... , X
n
be jointly Gaussian random variables. Let denote the covariance
matrix for the random vector X
G
= [X
1
X
2
... X
n
]
T
, and denote
G
= E[X
G
]. The joint density of X
1
,
X
2
, ... , X
n
can be expressed as a density for the vector X
G
. This density is denoted as f(X
G
), and it
is given by

f X X X
n
T
( )
( )
exp ( ) ( )
/ /
G G
G
G
G
=
1
2
2 1 2
1
2
1
. (3-37)

When n = 1 (or 2), this result yields the expressions given in class for the first (second) order
case.
With (3-37), we have perpetuated a common abuse of notation. We have used X
G
= [X
1

X
2
X
n
]
T
to denote a vector of random variables. However, in (3-37), X
G
is a vector of
algebraic variables. This unfortunate dual use of a symbol is common in the literature. This
ambiguity should present no real problem; from context, the exact interpretation of X
G
should be
clear.
Example 3-3: Let X
G
= [X
1
X
2
]
T
be a 21 Gaussian random vector. Then X
1
and X
2
are
Gaussian random variables with joint density of the form (3-15). Let
1
= E[X
1
] and
2
= E[X
2
]
denote the means of X
1
and X
2
, respectively; likewise, let
1
2
= VAR[X
1
] and
2
2
= VAR[X
2
].
Finally, let r denote the correlation coefficient in the joint density. Find an expression for
covariance matrix in terms of these quantities. This can be accomplished by comparing the
exponents of (3-37) and (3-15). For the exponent of (3-37), let Q =
-1
and write

[ ]
{ }
11 12 1 1
T
1 1 2 2
12 22 2 2
2 2
11 1 1 12 1 1 2 2 22 2 2
q q X
1 1
(X ) Q(X ) {X } {X }
2 2
q q X
1
q (X ) 2q (X )(X ) q (X ) ,
2

=

= + +
G G
G G
(3-38)
where we have used the fact that Q is symmetric (Q
T
= Q). Now, compare (3-38) with the
exponent of (3-15) (where X
1
and X
2
are used instead of x and y) and write

{ }
1 1 2 2
2
2 2
11 1 1 12 1 1 2 2 22 2 2
2 2
1
1 1 2 2
2 2
1 2
1 2
2(1 r )
1
q (X ) 2q (X )(X ) q (X )
2
(X ) (X )(X ) (X )
2r .
+ +

= +

(3-39)

Equate like terms on both sides of (3-39) and obtain

11
2 2
1
12
2
1 2
22
2 2
2
1
q
(1 r )
r
q
(1 r )
1
q .
(1 r )
=

=

=

(3-40)

Finally, take the inverse of matrix Q and obtain

1
2
2 2 2
1 1 2
1 1 2
1
2
1 2 2
2 2 2
1 2 2
1 r
r
(1 r ) (1 r )
Q
r 1
r
(1 r ) (1 r )

= = =

. (3-41)

The matrix on the right-hand-side of (3-41) shows the general form of the covariance matrix for
a two-dimensional Gaussian random vector.
From (3-41) and the discussion before (3-29), we note that E[(X
1
-
1
)(X
2
-
2
)] = r
1
2
.
Hence, the correlation coefficient r can be written as
1 1 2 2 12 1 2
1 2 1 2
E[(X )(X )]
r

= =

, (3-42)

the covariance normalized by the product
1
2
. When working problems, Equation (3-42) is an
important, very useful, formula for correlation coefficient r.
Chapter 4 - Function of Random Variables
Let X denote a random variable with known density f
X
(x) and distribution F
X
(x). Let y =
g(x) denote a real-valued function of the real variable x. Consider the transformation
Y = g(X). (4-1)
This is a transformation of the random variable X into the random variable Y. Random variable
X() is a mapping from the sample space into the real line. But so is g(X()). We are interested
in methods for finding the density f
Y
(y) and the distribution F
Y
(y).
When dealing with Y = g(X()), there are a few technicalities that should be considered.
1. The domain of g should include the range of X.
2. For every y, the set {Y = g(X) y} must be an event. That is, the set { S : Y() = g(X())
y } must be in F (i.e., it must be an event).
3.The events {Y = g(X) = } must be assigned a probability of zero.
In practice, these technicalities are assumed to hold, and they do not cause any problems.
Define the indexed set
I
y
= { x : g(x) y }, (4-2)
the composition of which changes with y. The distribution of Y can be expressed as
F
Y
(y) = P[ Y y ] = P[g(X) y] = P[ X I
y
]. (4-3)
This provides a practical method for computing the distribution function.
Example 4-1: Consider the function y = g(x) = ax + b, a > 0, and b are constants.
y
I = {x: g(x)=ax + b y} = {x : x (y - b)/a}
so that
y b y b
y y x
a a
F (y) [X I ] [X ] F ( )

= = = P P .
Example 4-2: Given random variable X and function y = g(x) = x
2
as shown by Figure 4-1.
Define Y = g(X) = X
2
, and find F
Y
(y). If y < 0, then there are no values of x such that x
2
< y.
Hence
F
Y
(y) = 0, y < 0.
If y > 0, then x
2
y for y x y . Hence, I
y
= {x : g(x) y } = { } y x y , and
( ) ( )
Y X X
F (y) [ y X y ] F F , y > 0
y ( y)
= =

P .
Special Considerations
Special consideration is due for functions g(x) that have flat spots and/or jump
discontinuities. These cases are considered next.
Watch for places where g(x) is constant (flat spots). Suppose g(x) is constant on the
interval (x
0
, x
1
]. That is, g(x) = y
1
, x
0
< x x
1
, where y
1
is a constant, and g(x) y
1
off x
0
< x
x
1
. Hence, all of the probability that X has in the interval x
0
< x x
1
is assigned to the single
value Y = y
1
so that
x-axis
y
y y
y = g(x)=x
2
Figure 4-1: Quadratic transformation used in Example 4-2.
P[Y = y
1
] = P[x
0
< X x
1
] = F
X
(x
1
) - F
X
(x
0
). (4-4)
That is, F
Y
(y) has a jump discontinuity at y = y
1
. The amount of jump is F
X
(x
1
) - F
X
(x
0
). As an
example, consider the case of a saturating amplifier/limiter transformation.
Example 4-3 (Saturating Amplifier/Limiter): In terms of F
X
(x), find the distribution F
Y
(y) for
Y = g(X) where
b, x>b
g(x) = x, -b<x b,
b, x -b.
Both F
x
and y = g(x) are illustrated by Figure 4-2.
Case: y b
For this case we have g(x) y for all x. Therefore F
Y
(y) =1 for y b.
Case: -b y < b
For -b y < b, we have g(x) y for x y. Hence, F
Y
(y) = P(Y = g(X) y) = F
X
(y), -b y < b
Case: y < -b
For y < -b, we have g(x) < y for NO x. Hence, F
Y
(y) = 0, y < -b
b
-b
b -b
y=g(x)
1
x-axis
F
X
(x)
-b b
Figure 4-2: Transformation y = g(x) and distribution F
x
(x) used in Ex. 4-3.
The result of these cases is shown by Figure 4-3.
Watch for points where g(x) has a jump discontinuity
As shown by the following two examples, special
care may be required when dealing with functions that
have jump discontinuities.
Example 4-4: In terms of F
X
Y
(y)
for Y = g(X), where
g x
x c
( )
,
=
+
R
S
T
x 0
x - c, x < 0
as depicted by Figure 4-4 to the right.
Case y c: If y c then g(x) y for x y-c. Hence,
F
Y
(y) = F
X
(y-c) for y c.
Case -c y < c: If -c y < c then g(x) y for x < 0. Hence,
F
Y
(y) = P[ X < 0 ] = F
X
(0
-
) for -c y < c.
Case y < -c: If y < -c then g(x) y for x y+c. Hence,
F
Y
(y) = F
X
(y+c) for y < -c.
Example 4-5: In terms of F
X
Y
(y) for Y = g(X) where
g x
x c
( )
,
=
+ >
R
S
T
x 0
x - c, x 0
as depicted by Figure 4-5.
Case y c: If y c then g(x) y for x y-c. Hence,
1
y-axis
F
y
(y)
-b b
F
X
(-b)
1-F
X
(b
-
)
Figure 4-3: Result for Ex 4-3.
y = g(x)
c
-c
x-axis
y
-
a
x
i
s
Figure 4-4: Transformation for
Example 4-4.
F
Y
(y) = F
X
(y-c) for y c.
Case -c y < c: If -c y < c then g(x) y for x 0. Hence,
F
Y
(y) = P[ X 0 ] = F
X
(0) for -c y < c.
Case y < -c: If y < -c then g(x) y for x y+c. Hence,
F
Y
(y) = F
X
(y+c) for y < -c.
Notice that there is only a subtle difference between the
previous two examples. In fact, if F
X
(x) is continuous at x = 0, then F
Y
(y) is the same for the
previous two examples.
Determination of f
y
in terms of f
x
Determine the density f
Y
(y) of Y = g(X) in terms of the density f
X
(x) of X. To
accomplish this, we solve the equation y = g(x) for x in terms of y . If g has an inverse, then we
can solve for a unique x in terms of y (x = g
-1
(y)). Otherwise, we will have to do it in segments.
That is, x
1
(y), x
2
(y), ... , x
n
(y) can be found (as solutions, or roots, of y = g(x) ) such that
y = g(x
1
(y)) = g(x
2
(y)) = g(x
3
(y)) = ... = g(x
n
(y)). (4-5)
Note that x
1
through x
n
are functions of y. The range of each x
i
(y) covers part of the domain of
g(x). The union of the ranges of x
i
(y), 1 i n, covers all, or part of, the domain of g(x). The
desired f
Y
(y) is
f y
f x
g x
f x
g x
f x
g x
Y
X X X n
n
( )
( )
( )
( )
( )
( )
( )
=
+

1
1
2
2
+ + " , (4-6)
y = g(x)
c
-c
x-axis
y

a
x
i
s
Figure 4-5: Transformation for
Example 4-5.
where g(x) denotes the derivative of g(x).
We establish this result for the function y = g(x) that is depicted by Figure 4-6, a simple
example where n = 2. The extension to the general case is obvious.
y y
Y Y
y
(y < Y y + y) = f ( )d f (y) y
+

P
for small y (increments x
1
, x
2
and y are defined to be positive). Similarly,
1 1 1 X 1 1
2 2 2 X 2 2
(x - x X x ) f (x ) x
(x < X x + x ) f (x ) x
<

P
P
(4-7)
1 1 1 2 2 2
Y X 1 1 X 2 2
X 1 X 2
Y
1 2
(y < Y y + y) (x - x X x ) + (x < X x + x )
f (y) y f (x ) x f (x ) x
f (x ) f (x )
f (y)
y y
x x
<
+
+

P P P
(4-8)
Now, let the increments approach zero. The positive quantities y/x
1
and y/x
2
approach
x-axis
y
y = g(x) = x
2
x
1
x
2
x
1
x
2
y
Figure 4-6: Transformation y = g(x).
y
x dx
y
x dx
1 2
0
0
0
0
0
0

dg(x
and
dg(x
x
x
y
x
x
y
1
2
1
2
1 2

) )
. (4-9)
This leads to the desired result
f

dg(x

dg(x
Y
1 2
( )
( )
)
( )
)
y
f x
dx
f x
dx
X X
= +
1 2
. (4-10)
Example 4-6: Consider Y = aX
2
where a > 0. If y < 0, then y = ax
2
has no real solutions and
f
Y
(y) = 0, y < 0. (4-11)
If y > 0, then y = ax
2
has solutions x y a
1
= / and x y a
2
= / . Also, note that g(x) = 2ax.
Hence,
f

dg(x

dg(x
2a

2a
, y > 0
= 0 y < 0
Y
1 2
( )
( )
)
( )
)
( / )
/
( / )
/
y
f x
dx
f x
dx
f y a
y a
f y a
y a
X X
X X
= +
= +

1 2
. (4-12)
To see a specific example, assume that X is Rayleigh distributed with parameter . The density
for X is given by (2-18); substitute this density into (4-12) to obtain
f y
a
U y
a
y
a
U y
Y
y
a
y
a
y
a
( )
exp
( )
exp ( )
=

=
L
N
M
O
Q
P
1 1
2
2 2
2
1
2 2
2 2

(4-13)
which is the density for an exponential random variable with parameter = 1/(2
2
a), as can be
seen from inspection of (2-19). Hence the square of a Rayleigh random variable produces an
exponential random variable.
Expected Value of Transformed Random Variable
Given random variable X, with density f
X
(x), and a function g(x), we form the random
variable Y = g(X). We know that
Y
E Y yf y dy
Y
= =

z
[ ] ( ) (4-14)
This requires knowledge of f
Y
(y). We can express
Y
directly in terms of g(x) and f
X
(x).
Theorem 4-1: Let X be a random variable and y = g(x) a function. The expected value of Y =
g(X) can be expressed as
Y
E Y E g X g x f x dx
X
= = =

z
[ ] [ ( )] ( ) ( ) (4-15)
x-axis
y
y = g(x) = x
2
x
1
x
2
x
1
x
2
y
Figure 4-7: Transformation used in discussion of Theorem 4-1.
To see this, consider the following example that is illustrated by Figure 4-7. Recall that
f f
Y
( ) ( ) ( ) y y f x x x x
X X
+
1 1 2 2
. Multiply this expression by y = g(x
1
) = g(x
2
) to obtain
yf g
Y
( ) ( )f ( ) ( )f ( ) y y g x x x x x x
X X
+
1 1 1 2 2 2
. (4-16)
Now, partition the y-axis as 0 = y
0
< y
1
< y
2
< ..... , where y = y
k+1
- y
k
, k = 0, 1, 2, ... . By the
mappings x y
1
= and x y
2
= , this leads to a partition x
1k
, k = 0, 1, 2, ... ,of the negative
axis and a partition x
2k
, k = 0, 1, 2, ... ,of the positive x-axes. Sum both sides over their
partitions and obtain
y f
k Y
( ) ( )f ( ) ( )f ( ) y y g x x x g x x x
k
k
k X k k
k
k X k k
k

=

+
0
1 1 1
0
2 2 2
0
. (4-17)
Let y 0, x
k1
0 and x
k2
0 to obtain
yf y dy g x x g x x
g x x
Y x x
x
0
0
0
z z z
z
= +
=
( ) ( )f ( )dx ( )f ( )dx
( )f ( )dx
, (4-18)
the desired result. Observe that this argument can be applied to practically any function y = g(x).
Example 4-7: Let X be N(0,) and let Y = X
n
. Find E[Y]. For n even (i.e., n = 2k) we
know that E[X
n
] = E[X
n
] = 135 (n - 1)
n
. For odd n (i.e., n = 2k + 1) write
E
X x
x dx [ ] ( )dx exp[ / ]
2 1 2 1 2 1 2 2
0
2
2
2
k k k
f x x
+ +
= =
z z

. (4-19)
Change variables: let y = x
2
/2
2
, dy = (x/
2
)dx and obtain
2k
2 k 1 2 2 2k 1
2 k 2
0
2 k 1
k y
0
1 x x dx
E[ ] (2 ) exp[ x / 2 ]
X
2
(2 )
2 (2 )
y e dy .
2
+ +
+

=

(4-20)
However, from known results on the Gamma function, we have
( ) ! k y e dy k
k y
+ = =
z
1
0
. (4-21)
Now, use (4-21) in (4-20) to obtain
( )
n 1
2
n
2 2 n
-
n
n 1
n
2
1
E[ X ] exp[ x / 2 ]dx
x
2
1 3 5 (n 1) , n 2k (n even)
2
2 , n 2k 1 (n odd) !
=

= =
= = +
" (4-22)
for a zero-mean Gaussian random variable X.
Approximate Mean of g(X)
Let X be a random variable and y = g(x) a function. The expected value of g(X) can be
expressed as
X
E[ g(X) ] g(x) f (x) dx
=

. (4-23)
To approximate this, expand g(x) in a Taylor's series around the mean to obtain
g x g g x g
x
n
n
n
( ) ( ) ( )( ) ( )
( )
!
( )
= + +

+

+ " " . (4-24)
Use this expansion in the expected-value calculation to obtain
X
n
(n)
X
(3) (n) 3 2 n
E[ g(X) ] g(x) f (x) dx
(x )
g( ) g ( )(x ) +g ( ) f (x) dx
n!
g( ) g ( ) g ( ) + g ( ) +.
2! 3! n!
= + + +

= + + + +
" "
" "
(4-25)
An approximation to E[g(X)] can be based on this formula; just compute a finite number of
terms in the expansion.
Characteristic Functions
The characteristic function of a random variable is

( ) ( )e [ ] = =
z
f x dx E e
X
j x j X
. (4-26)
Characteristic function is complex valued with

( ) ( )e ( ) = =
z z
f x dx f x dx
X
j x
X
1. (4-27)
Note that () is the Fourier transform of f
X
(x), so we can write

( )
( )]
=
Y
Y
F[f
X
by -
x
replace
, (4-28)
and
f x dx
X
j x
( ) ( )e =

z
1
2

. (4-29)
Definition (4-26) takes the form of a sum when X is a discrete random variable. Suppose
that X takes on the values x
i
with probabilities p
i
= P[X = x
i
] for index i in some index set I (i
I). Then the characteristic function of X is

( ) ( )e exp[ ] = =
z
f x dx p j x
X
j x
i
i I
i
. (4-30)
Due to the delta functions in density f
X
(x), the integral in (4-30) becomes a sum.
Example 4-8: Consider the Gaussian density function
f x e
X
x
( )
/
=

1
2
2 2
2

. (4-31)
The Fourier transform of f
X
is F [f
X
(x)] = exp[-
2
2
/2], as given in common tables. Hence,

( )
( )]
/
= =

Y
Y
F[f
X
-
x
e
2 2
2
. (4-32)
If f x e
X
x
( )
( ) /
=

1
2
2 2
2

, then

( ) )e ]
/ /
= =

e e
j j
F[(1/ 2
x
fold
e
2 2 2 2
2 2
. (4-33)
Example 4-9: Let random variable N be Poisson with parameter . That is,
P N = n =

e
n
n

!
, n = 0, 1, 2, ... . (4-34)
From (4-30), we can write
{ }
n n
n 0 n 0
n
j
j
n 0
j
( ) exp[ ] exp[ j n] exp[ ] exp[ j n]
n! n!
e
exp[ ] exp[ ]exp[ e ]
n!
exp[ e 1 ]

= =

= =

= =
=

(4-35)
as the characteristic function for a Poisson process.
Multiple Dimension Case
The joint characteristic function
XY
(
1
,
2
) of random variables X and Y is defined as
[ ]
XY
1 i 2 k
j( x y )
1 2 1 2 i k
i k
( , ) E exp{j( X Y)} e [X = x , Y y ]
+
= + = =
P (4-36)
for the discrete case and

XY XY
E j X Y e f x y dxdy
j x y
( , ) exp{ ( )} ( , )
( )
1 2 1 2
1 2
= + =
+
z z
(4-37)
for the continuous case. Equation (4-37) is recognized as the two dimensional Fourier transform
(with the sign of j reversed) of f
XY
(x,y). Generalizing these definitions, we can define the joint
characteristic function of n random variables X
1
, X
2
, ... , X
n
as
[ ]
X X
1 n
... 1 n 1 1 n n
( , ... , ) E exp{j X ... j X } = + + . (4-38)
Equation (4-38) can be simplified using vector notation. Define the two vectors
1 1
2 2
n n
X
X
, X .
X

= =

G
G
# #
(4-39)
Then, we can write the n-dimensional characteristic function in the compact form
T
X
j X
( ) E e

G
G
G
G
. (4-40)
Equations (4-38) and (4-40) convey the same information; however, (4-40) is much easier to
write and work with.
Characteristic Function for Multi-dimensional Gaussian Case
Let X
G
= [X
1
X
2
... X
n
]
T
be a Gaussian random vector with mean
G
= E[X
G
]. Let
G
= [
1
2
...
n
]
T
be a vector of n algebraic variables. Note that
T
k k
1
n
2
1 2 n
k 1
n
[ ]
=

= =

G G
"
#
(4-41)
is a scalar. The characteristic function of X
G
is given as
T T
T
1
2
( ) E exp[ j ] .
exp[ j X]
= =

G G G G G G
G
(4-42)
Application: Transformation of Random Variables
Sometimes, the characteristic function can be used to determine the density of random
variable Y = g(X) in terms of the density of X. To see this, consider
j Y j g(X) j g(x)
Y X
( ) E[e ] E[e ] e f (x) dx
= = =

. (4-43)
If a change of variable y = g(x) can be made (usually, this requires g to have an inverse), this last
integral will have the form

Y
j y
e h y dy ( ) ( ) =

z
. (4-44)
The desired result f
Y
(y) = h(y) follows (by uniqueness of the Fourier transform).
Example 4-10: Suppose X is N(0;) and Y = aX
2
. Then

Y
j Y j aX j ax
X
j ax x
E e E e e f x dx e e dx ( ) [ ] [ ] ( )
/
= = = =
z z
2 2 2 2 2
2
2
2
0
.
For 0 x < , note that the transformation y = ax
2
is one-to-one. Hence, make the change of
variable y = ax
2
, dy = (2ax)dx = 2 ay dx to obtain
2
2
y / 2a
j y y / 2a j y
Y
0 0
2 dy e
( ) e e e dy
2 2 ay 2 ay

= =

.
Hence, we have
f y
e
ay
U y
Y
y a
( ) ( )
/
=
2
2
2

.
Moment Generating Function
The moment generating function is
( ) ( )e [ ] s f x dx E e
X
sx sX
=
z
. (4-45)
The n
th
derivative of is
d
ds
s x f x dx E X e
n
n
n
X
sx n sX
( ) ( )e [ ] =
z
, (4-46)
so that
d
ds
s X m
n
n
n
n
( ) [ ] Y
Y
= s 0
= E . (4-47)
Example 4-11: Suppose X has an exponential density f x e U x
X
x
( ) ( ) =

. Then the moment
generating function is
( ) s e e dx
s
x sx
=
0
.
This can be differentiated to obtain
2
2
2
2
d
1
(s) E[X]
ds
d
2
(s) E[X ] .
ds
=
=
= =

= =

s 0
s 0
From this, we can compute the variance as

2 2 2
2
2
2
2 1 1
= = E X E X [ ] [ ] b g
e j
.
Theorem 4-2
Let X and Y be independent random variables. Let g(x) and h(y) be arbitrary functions. Define
the transformed random variables
Z g(X)
W h(Y).
=
=
(4-48)
Random variables Z and W are independent.
Proof: Define
z
w
A [x : g(x) z]
B [y : h(y) w].

Then the joint distribution of Z and W is
ZW z w
F (z, w) [Z z, W w] [g(X) z, h(Y) w] [X A , Y B ] . = = = P P P
However, due to independence of X and Y,
ZW
W
z w z w
z
F (z, w) [X A , Y B ] [X A ] [Y B ]
[g(X) z] [h(Y) w] [Z z] [W w]
F (z)F (w) ,
= =
= =
=
P P P
P P P P ,
so that Z and W are independent.
One Function of Two Random Variables
Given random variables X and Y and a function z = g(x,y), we form the new random
variable
Z = g(X,Y). (4-49)
We want to find the density and distribution of Z in terms of like quantities for X and Y. For
real z, denote D
z
as
D
z
= {(x,y) : g(x,y) z}. (4-50)
Now, note that D
z
satisfies
{Z z} = {g(X,Y) z} = {(X,Y) D
z
}, (4-51)
so that
XY
z
Z z
D
F (z) [Z z] [(X, Y) D ] f (x, y) dxdy = = =

P P . (4-52)
Thus, to find F
Z
it suffices to find region D
Z
for every z and then evaluate the above integral.
Example 4-12: Consider the function Z = X + Y. The distribution F
Z
can be represented as
F z f x y dxdy
Z XY
x y z
( ) ( , ) =
+
zz
.
In this integral, the region of integration is depicted by the shaded area shown on Figure 4-8.
Now, we can write
F z f x y dxdy
Z XY
z y
( ) ( , ) =

z z
.
By using Leibnitzs rule (see below) for differentiating an
integral, we get the density
Z Z XY
XY
z y d d
f (z) F (z) f (x, y) dxdy
dz dz
f (z y, y) dy .

= =
=

Leibnitzs Rule: Consider the function of t defined by

F t x t
a t
b t
( ) ( , )dx
( )
( )
z
.
Note that the t variable appears in the integrand and limits. Leibnitzs rule states that
d
dt
F t
d
dt
x t
x t
t
b t t
db t
dt
a t t
da t
dt
a t
b t
a t
b t
( ) ( , )dx
( , )
( ( ), )
( )
( ( ), )
( )
( )
( )
( )
( )
=

z z
dx + .
Special Case: X and Y Independent.
Assume that X and Y are independent. Then f
XY
(z-y,y) = f
X
(z-y)f
Y
(y), and the previous
result becomes
f z f z y y dy
Z X Y
( ) ( )f ( ) =
z
, (4-53)
the convolution of f
X
and f
Y
.
x + y z
x

+

y

=

z
x-axis
y-axis
Figure 4-8: Integrate over the
shaded region to obtain F
Z
.
Example 4-13: Consider independent random variables X and Y with densities shown by
Figure 4-9. Find density f
Z
that describes the random variable Z = X + Y.
CASE I: z < - 1/2 (see Fig 4-10)
There is no overlap, so f
Z
(z) = 0 for z < - 1/2.
CASE II: - 1/2 < z < 1/2 (see Fig 4-11)
f z dy
e z
Z
z y
z
z
( )
, / /
( )
/
( / )
=
= < <

+
z
e

1 2
1 2
1 1 2 1 2
CASE III: 1/2 < z (see Fig 4-12)
f z dy
e e e z
Z
z y
z
( )
[ ],
( )
/
/
/ /
=
= <

z
e

1
2
1 2
1 2
1 2 1 2
As shown by Figure 4-13, the final result is
f
X
(x) = e
-x
U(x)
1/2 -1/2
1
f
y
(y)
x-axis y-axis
Figure 4-9: Density functions used in Example 4-13.
1/2 -1/2
1
f
y
(y)
y-axis
y = z
f
X
(z-y)
Figure 4-10: Case I: z < -
1/2 -1/2
1
f
y
(y)
y-axis
y = z
f
X
(z-y)
Figure 4-11: Case II: - < z < .
1/2 -1/2
1
f
y
(y)
y-axis
y = z
f
X
(z-y)
Figure 4-12: CaseIII: < z.
z
(z 1/ 2)
1/ 2 1/ 2 z
f (z) 0, z 1/ 2
1 e , 1/ 2 z 1/ 2
[e e ]e , 1/ 2 z
+

= <
= <
=
,
Example 4-14: Let X and Y be random variables. Consider the transformation
Z X Y = / .
For this transformation, we have
D
z
= { (x,y) : x/y z },
the shaded region on the plot depicted by Figure 4-14. Now, compute the distribution
XY XY
yz 0
z
0 yz
F (z) f (x, y)dxdy + f (x, y)dxdy

=

.
The density f
Z
(z) is found by differentiating F
Z
to obtain
-1 0 1 2 3 4 5
X-Axis
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Y
-
A
x
i
s
Figure 4-13: Final result for Example 4-13.
Z
0
z xy xy xy
0
d
f (z) F (z) y f (yz, y)dy y f (yz, y)dy y f (yz, y)dy
dz

= = + =

.
Example 4-15: Consider the transformation Z X Y = +
2 2
. For this transformation, the region
D
Z
is given by
D x y x y z x y x y z
z
= + = + {( , ) : } {( , ) : }
2 2 2 2 2
, (4-54)
the interior of a circle of radius z > 0. Hence, we can write
z
z z
D
F (z) [Z z] [(x, y) D ] f (x, y) dxdy . = = =

P P (4-55)
Now, suppose X and Y are independent, jointly Gaussian random variables with
f x y
x y
XY
( , ) exp
( )
=
+
L
N
M
M
O
Q
P
P
1
2 2
2
2 2
2

. (4-56)
x

=

y
z
x-axis
y-axis
Drawn for case z > 0
D
z
is the shaded region
y
x
y
x
Figure 4-14: Integrate over shaded region to obtain F
Z
for Example 4-14.
Substitute (4-56) into (4-55) to obtain
F z dxdy
z
x y
D
z
( ) exp
( )
=
L
N
M
O
Q
P
+
zz
1
2
2
2
2 2
2

.
To integrated this, use Figure 4-15, and
transform from rectangular to polar
coordinates
2 2
1
r x y , r 0
Tan (y / x),
dA r dr d .
= +
= <
=
The change to polar coordinates yields
F z r dr d
z
r z
( ) exp =
L
N
M
O
Q
P
z z
1
2
2
2
0 0
2
2
2
.
The integrand does not depend on , so
the integral over is elementary. For the
integral over r, let u = r
2
/2
2
and du = (r/
2
)dr to obtain
2 2 2
2
2 2
z z / 2 r
u
z
2
0 0
2
z / 2
dr
F (z) exp r e du
1 e , z 0 ,

= =

= >

so that
r
x r cos
y r sin
=
=
<
x-axis
y-axis
Rectangular-to-Polar Transformation
r
d
r
d
d
A
r
d
Cut-away view detailing differential area

dA = rdrd.
Figure 4-15: Figures that supports Example 4-15.
f z
d
dz
F z
z
e
z z
z
( ) ( ) ,
/
= =

2
2
2 2
z 0 ,
a Rayleigh density with parameter . Hence, if X and Y are identically distributed, independent
Gaussian random variables then Z X Y = +
2 2
is Rayleigh distributed.
Two Functions of Two Random Variables
Given random variables X and Y and functions z = g(x,y), w = h(x,y), we form the new random
variables
Z = g(X,Y) (4-57)
W = h(X,Y).
Express the joint statistics of Z, W in terms of functions g, h and f
XY
. To accomplish this, define
D
zw
= {(x,y) : g(x,y) z, h(x,y) w }. (4-58)
Then, the joint distribution of Z and W can be expressed as
ZW XY
zw
zw
D
F (z, w) [(X, Y) D ] f (x, y)dxdy = =

P . (4-59)
Example 4-16: Consider independent Gaussian X and Y with the joint density function
f x y
x y
XY
( , ) exp
( )
=
+
L
N
M
M
O
Q
P
P
1
2 2
2
2 2
2

.
Define random variables Z and W in terms of X and Y by the transformations
2 2
Z X Y
W Y/ X.
= +
=
Find F
ZW
, F
Z
and F
W
. First, define region D
zw
as
D
ZW
= {(x,y) : x
2
+ y
2
z
2
, y/x w }.
D
zw
is the shaded region on Figure 4-16. The figure is
drawn for the case w > 0 (the case w < 0 gives results that are identical to those given below).
Now, integrate over D
zw
to obtain
1
2
2 2
ZW XY
2
zw
1
2 2
2
Tan w z
0
r / 2 1
2
D
Tan (w)
z
r / 2 2 1
0
F (z, w) f (x, y) dxdy 2 e r dr d
e r dr ,

= =

+
=
which leads to
1
2 2
ZW
Tan (w)
z / 2 2
F (z, w) {1 e }, z 0, w
0, z 0, w .

+
= < <
= < < <

(4-60)
Note that F
zw
factors into the product F
Z
F
W
, where
F z e U z
Z
z
( ) { } ( )
/
=

1
2 2
2
F w Tan w w
W
( ) ( ) , = +

< <
1
2
1
1
. (4-61)
x-axis
y-axis
tan
-1
(w)
y = xw
Drawn for case
w > 0
D
zw
D
zw
Figure 4-16: Integrate over
shaded region to obtain F
ZW
for
Example 4-16.
Note that Z and W are independent, Z is Rayleigh distributed and W is Cauchy distributed.
Joint Density Transformations: Determine f
ZW
Directly in Terms of f
XY
.
Let X and Y be random variables with joint density f
XY
(x,y). Let
z = g(x,y) (4-62)
w = h(x,y)
be (generally nonlinear) functions that relate algebraic variables x, y to the algebraic variables z,
w. Also, we assume that g and h have continuous first-partial derivatives at the point (x,y) used
below. Now, define the new random variables
Z = g(X,Y)
W = h(X,Y).
(4-63)
In this section, we provide a method for determining the joint density f
ZW
(z,w) directly in terms
of the known joint density f
XY
(x,y).
First, consider the relatively simple case where (4-62) can be inverted. That is, it is
possible to solve (4-62) for unique functions
x (z, w)
y (z, w)
=
=
(4-64)
that give x, y in terms of z, w. Note that z = g((z,w),(z,w)) and w = h((z,w),(z,w)) since
(4-64) is the inverse of (4-62) . Later, we will consider the general case where the
transformation cannot be inverted.
The quantity P[z < Z z + dz, w < W w + dw] is the probability that random variables
Z and W lie in the infinitesimal rectangle R
1
illustrated on Figure 4-17. The area of this
infinitesimal rectangle is AREA(R
1
) = dzdw. The vertices of the z-w plane rectangle R
1
are the
points
1
2
3
4
P = (z, w)
P = (z, w+dw)
P = (z+dz, w+dw)
P = (z+dz, w).
(4-65)
The z-w plane infinitesimal rectangle R
1
gets mapped into the x-y plane, where it shows up as
parallelogram R
2
. As shown on the x-y plane of Figure 4-17, to first-order in dw and dz,
parallelogram R
2
has the vertices
z, w plane x, y plane
R
1
dw
dz
z-axis
w-axis
x-axis
y-axis
1
2
3
4
P = (z, w)
P = (z, w+dw)
P = (z+dz, w+dw)
P = (z+dz, w)
1
2
w w
3
z w z w
4
z z
P = (x, y)
P = (x+ dw, y+ dw)
P = (x+ dz+ dw, y+ dz+ dw)
P = (x+ dz, y+ dz)

P
1
P
2
P
3
P
4
P
1
P
2
P
3
P
4
R
2
,
g,h
Figure 4-17: (z,w) and (x,y) planes used in transformation of two random variables.
Functions , transform from z,w plane to x,y plane. Functions g,h transform from x,y
plane to z,w plane.
1 3
2 4
P = (x, y) P = (x+ dz+ dw, y+ dz+ dw)
z w z w
P = (x+ dw, y+ dw) P = (x+ dz, y+ dz).
w w z z

(4-66)
The requirement that (4-64) have continuous first-partial derivatives was used to write (4-66).
Note that P
1
maps to P
1
, P
2
maps to P
2
, etc (it is easy to show that P
2
- P
1
= P
3
- P
4
and
P
4
- P
1
= P
3
- P
2
so that we have a parallelogram in the x-y plane). Denote the area of
x-y plane parallelogram R
2
as AREA(R
2
)
If random variables Z, W fall in the z-w plane infinitesimal square R
1
, then the random
variables X, Y must in the x-y plane parallelogram R
2
, and vice-versa. In fact, we can claim
[ ] [ ]
ZW XY
ZW XY
z < Z z +dz, w < W w +dw x < X x +dx, y < Y y +dy
f (z, w) dz dw = f (x, y) dx dy
f (z, w) ( ) f (x, y) ( )
=

AREA AREA
P P
1 2
R R
1 2
R R
, (4-67)
where the approximation becomes exact as dz and dw approach zero. Since AREA(R
1
) = dzdw,
Equation (4-67) yields the desired f
XY
once an expression for AREA(R
2
) is obtained.
Figure 4-18 depicts the x-y plane parallelogram R
2
for which area AREA(R
2
) must be
obtained. This parallelogram has sides
1 2
P P
JJJJJG
and
1 4
P P
JJJJJG
(shown as vectors with arrow heads on
Fig 4-18) that can be represented as
P
1
P
2
P
3
P
4
R
2
Figure 4-18: Parallelogram in x-y plane.
1 4
1 2

P P dz dz
z z

P P dw dw
w w

= +

= +

i j
i j
JJJJJG
JJJJJG
, (4-68)
where
i and
j are unit vectors in the x and y directions, respectively. Now, the vector cross
product of sides
1 4
P P
JJJJJG
and
1 2
P P
JJJJJG
is denoted as
1 4 1 2
P P P P
JJJJJG JJJJJG
. And, the area of parallelogram R
2
is the magnitude
1 4
P P
JJJJJG
1 2
P P
JJJJJG
sin() =
1 4 1 2
P P P P
JJJJJG JJJJJG
, where is the positive angle between
the vectors. Since

= i j k ,

= j i k ,

j j =

i i =

k k = 0, we write
1 4 1 2

z w
( ) P P P P det dz dz 0 det dzdw.
z z
z w
dw dw 0
w w

= = =

JJJJJG JJJJJG
i j k
AREA
2
R (4-69)
In the literature, the last determinant on the right-hand-side of (4-69) is called the Jacobian of
transformation (4-64); symbolically, it is denoted as J(x,y); instead, the notation (x,y)/(z,w)
may be used. We write
x x
z w z w (x, y
(x, y) det det
(z w
y y
z w z w

)

=
, )

J . (4-70)
Finally, substitute (4-69) into (4-67), cancel out the dzdw term that is common to both sides, and
obtain the desired result
ZW XY
x (z,w)
y (z,w)
(x, y
f (z, w) f (x, y)
(z w
=
=
)
=

, )

, (4-71)
a formula for the density f
ZW
in terms of the density f
XY
. It is possible to obtain (4-71) directly
from the change of variable formula in multi-dimensional integrals; this fact is discussed briefly
in Appendix 4A.
It is useful to think of (4-69) as
(x, y
( ) ( )
(z w
)
=
, )
AREA AREA
2 1
R R , (4-72)
a relationship between AREA(R
2
) and AREA(R
1
). So, the Jacobian can be thought of as the
area gain imposed by the transformation (the Jacobian shows how area is scaled by the
transformation).
By considering the mapping of a rectangle on the x, y plane to a parallelogram on the z,
w plane (i.e., in the argument just given, switch planes so that the rectangle is in the x-y plane
and the parallelogram is in the z-w plane) , it is not difficult to show
XY ZW
(z, w
f (x, y) f (z, w)
(x y
)
=
, )
, (4-73)
where (x,y) and (z,w) are related by (4-62) and (4-64). Now, substitute (4-73) into (4-71) to
obtain
ZW ZW
(z, w (x, y
f (z, w) f (z, w)
(x y (z w
) )
=
, ) , )
, (4-74)
where (x,y) and (z,w) are related by (4-62) and (4-64).
Equation (4-74) leads to the conclusion
1
(x, y (z, w
(z w (x y
) )
=

, ) , )

, (4-75)
where (x,y) and (z,w) are related by (4-62) and (4-64).
Sometimes, the Jacobian (z,w)/(x,y) is easier to compute than the Jacobian
(x,y)/(z,w); Equation (4-75) tells us that the former is the numerical inverse of the latter. In
terms of the Jacobian (z,w)/(x,y), Equation (4-71) becomes
XY
ZW
x (z,w)
y (z,w)
f (x, y)
f (z, w)
(z w
(x, y
=
=
=

, )

)
, (4-76)
which may be easier to evaluate than (4-71).
Often, the original transformation (4-62) does not have an inverse. That is, it may not be
possible to find unique functions and as described by (4-64). In this case, we must solve
(4-62) for its real-valued roots x
k
(z,w), y
k
(z,w), 1 k n, where n > 1. These n roots depend on
z and w; each of the (x
k
,y
k
) covers a different part of the x-y plane. Note that
z = g(x
k
,y
k
), w = h(x
k
,y
k
) (4-77)
for each root, 1 k n. For this case, a simple extension of (4-71) leads to
ZW XY
n
k 1
k k
(x, y
f (z, w) f (x, y)
(z w
(x, y) (x , y )
=
)

=

, )

=
, (4-78)
and the generalization of (4-76) is
XY
ZW
n
k 1
k k
f (x, y)

f (z, w) . (z w
(x, y
(x, y) (x , y )
=

= , )

)

=
(4-79)
That is, to obtain f
ZW
(z,w), we should evaluate the right-hand-side of (4-71) (or (4-76)) at each of
the n roots x
k
(z,w), y
k
(z,w), 1 k n, and sum up the results.
Example 4-17: Consider the linear transformation
z = ax + by z a b x
,
w = cx + dy w c d y

where ad - bc 0. This transformation has an inverse. It is possible to express
1
x a b z x = Az+ Bw
y c d w y = Cz+Dw,

=

where A, B, C and D are appropriate constants (can you find A, B, C and D??). Now, compute
a b
(z,w)
= det ad bc
(x,y)
c d

=

.
If X and Y are random variables described by f
XY
(x,y), the density function for random variables
Z = aX + bY, W = cX + dY is
f z w
f Az Bw
zw
XY
( , )
( ,
=
+ Cz + Dw)
ad - bc
.
Example 4-18: Consider X
G
, an n1, zero-mean Gaussian random vector with positive definite
covariance matrix
x
. Define Y
G
= AX
G
, where A is an nn nonsingular matrix. As discussed
previously, the density for X
G
is
T 1 1
x x
1/ 2 2 n / 2
x
1
f (X) exp X X
(2 )

=

G G G
.
Hence, f
Y
(Y
G
) can be expressed as
X
Y
-1
X=A Y
f (X)
f (Y)
(Y)
(X)
=
G G
G
G
G
G
,
where
(Y)
det[A]
(X)
G
G ,
the absolute value of the determinant of the matrix A. This leads to the result
1 T 1 1 1
Y x
1/ 2 2 n / 2
x
T 1 T 1 1 1
x
1/ 2 2 n / 2
x
1
f (Y) exp (A Y) A Y
(2 ) det A
1
exp Y (A ) A Y
(2 ) det A

=

=

G G G
G G
,
which can be expressed as
T 1 1
Y Y
1/ 2 2 n / 2
Y
1
f (Y) exp Y Y
(2 )

=

G G G
,
where
Y
= A
x
A
T
is the covariance of Gaussian
random vector Y
G
. This example leads to the
general, very important result that linear
transformations of Gaussian random variables
produces Gaussian random variables (remember
this!!).
Example 4-19 (Polar Coordinates): Consider the transformation
2 2
1
r x y , 0 r
Tan (y / x), .
= + <
= <
That is illustrated by Figure 4-19. With the limitation of to the (, ] range, the
transformation has the inverse
x = r cos()
y = r sin()
cos r sin
(x, y)
det r
(r, )
sin r cos

= =

so that
x-axis
y-axis
r
Figure 4-19: Polar coordinate transfor-

mations used in Example 4-19.
XY
XY
r
x r cos
y r sin
(x, y)
f (r, ) f (x, y)
(r, )
r f (r cos , r sin )
=
=
=

=
for r > 0 and - < .
Example 4-20: Consider the random variables Z = g(X,Y) and W = h(X,Y) where
2 2
z = g(x, y) x y
w = h(x, y) y/x
+
. (4-80)
Transformation (4-80) has roots (x
1
, y
1
) and (x
2
, y
2
) given by
2 1/ 2 2 1/ 2
1 1 1
2 1/ 2 2 1/ 2
2 2 2
x z(1 w ) , y wx wz(1 w )
x z(1 w ) , y wx wz(1 w )

= + = = +
= + = = +
(4-81)
for - < w < and z > 0; the transformation has no real roots for z < 0. A direct evaluation of
the Jacobian leads to
2 2 2 2
2
z z
x(x y ) y(x y )
x y
(z w
det det
(x, y
w w
y / x 1/ x
x y

+ +

, )

= =

)

,
which can be expressed as
2 2
2 2
(z w
(x y ) .
1 y / x
(x, y

, )
= +
+

)
(4-82)
When evaluate at both (x
1
, y
1
) and (x
2
, y
2
), the Jacobian yields
2
2 2 1 1
(z w (z w 1 w
(x, y (x, y z
x , y ) x , y )
, ) , ) +
= =
) )

( (
. (4-83)
Finally, application of (4-78) leads to the desired result
[ ]
ZW XY XY 1 1 2 2
2
z
f (z, w) , z 0, w f (x , y ) f (x , y )
1 w
= > < < +
+
, (4-84)
where (x
1
,y
1
) and (x
2
,y
2
) are given by (4-81). If, for example, X and Y are independent, zero-
mean Gaussian random variables with the joint density
XY
2 2 2
2
1
f (x, y) = exp
(x + y ) / 2
2

, (4-85)
then we obtain the transformed density
ZW Z W
2 2
2 2
z 1/
f (z, w) exp z / 2 U(z) f (z)f (w)
1 w

= =

+
(4-86)
where
Z
W
2 2
2
2
z
f (z) exp z / 2 U(z)
1/
f (w)
1 w

=

=
+
(4-87)
Thus, random variables Z and W are independent, Z is Rayleigh, and W is Cauchy.
Linear Transformations of Gaussian Random Variables
Let y
i
, 1 i n, be zero mean, unit variance, independent (which is equivalent to being
uncorrelated in the Gaussian case) Gaussian random variables. Define the Gaussian random
vector Y
G
= [y
1
y
2
y
n
]
T
. Note that E[Y
G
] = 0
G
and the covariance matrix is
y
= E[ Y
G
Y
G

T
] =
I, an n n identity matrix. Hence, we have
f Y Y Y
n
T
( )
( )
exp
/
G G G
=
1
2
2
1
2
. (4-88)
Now, let A be an n n nonsingular, real-valued matrix, and consider the linear transformation
G G
X AY = . (4-89)
The transformation is one-to-one. For every Y
G
there is but one X
G
, and for every X
G
there is but
one Y
G
= A
-1
X
G
. We can express the density of X
G
in terms of the density of Y
G
as
f X
f Y
abs J
x
y
Y A X
( )
( )
[ ]
G
G
G G
=
=
1
(4-90)
where
J
x y x y x y
x y x y x y
x y x y x y
A
A
n
n
n n n n
=

L
N
M
M
M
M
O
Q
P
P
P
P
= = det det[ ]
1 1 1 2 1
2 1 2 2 2
1 2
0
"
"
# # #
"
. (4-91)
Hence, we have
Y
1 T 1
T 1 T 1
1
x
1
n / 2 2
1
n / 2 2
1
f (X) f (A X)
A
1
exp (A X) A X
(2 )
A
1
exp , X (A ) A X
(2 )
A

=

=

G G
G G
G G
(4-92)
which can be written as
f X X X
x
n
x
T
x
( )
( )
exp
/ /
G G G
=

1
2
2 1 2
1
1
2

, (4-93)
where
x
T
A A

=
1 1 1
( ) , which leads to the requirement that
x
= AA
T
. (4-94)
Since A is nonsingular (a requirement on the selection of A), is positive definite. In this
development, we used
x
= AA
T
= AA
T
= A
2
so that A =
x
1/2
.
It is important to note that X
G
= A Y
G
is zero mean Gaussian with a covariance matrix
given by
x
= AA
T
.
Note that a linear transformation of Gaussian random variables produces
Gaussian random variables.
Consider the converse problem. Given zero mean Gaussian vector X
G
with positive
definite covariance matrix
x
. Find a non-singular transformation matrix A so that X
G
= AY
G
,
where Y
G
is zero mean Gaussian with covariance matrix
y
= I (identity matrix). The implication
is profound: Y
G
= A
-1
X
G
says that it is possible to transform a Gaussian vector with correlated
entries into a Gaussian vector made with uncorrelated (and independent) random variables. We
can remove correlation by properly transfoming the original vector. Clearly, we must find a
matrix A that satisfies
AA
T
=
x
. (4-95)
The solution to this problem comes from linear algebra. Given any positive definite
symmetric matrix
x
, there exists a nonsingular matrix P such that
P
T
x
P = I, (4-96)
which means that
x
= (P
T
)
-1
P
-1
= (P
-1
)
T
P
-1
(we say that
x
is congruent to I). Compare this to
the result given above to see that matrix A can be found by using
A = (P
-1
)
T
= (P
T
)
-1
. (4-97)
The procedure for finding P is simple:
1) Use the given
x
to write the augmented matrix [ I ]
x
#
2) Do elementary row and column operations until the augmented matrix becomes [ I P ]
T
# . The
elementary operations are
i) interchange two rows (columns)
ii) multiply a row (column) by a scalar
iii) add a multiple of one row (column) to another row (column).
3) Write the desired A as A = (P
T
)
-1
.
Example 4-21: Suppose we are given the covariance matrix
x
=

L
N
M
O
Q
P
1 2
2 5
.
First, write the augmented matrix
[
xY
] =
1 2 1
2 5
0
0 1
L
N
M
O
Q
P
1) Add to 2nd row 2first row to obtain
1 -2 1
0 1 1
0
2
L
N
M
O
Q
P
2) Add to 2nd column 2first column to obtain
1 0 1
0 1 1
0
2
L
N
M
O
Q
P
= [ I P
T
]
3) P
T
=
L
N
M
O
Q
P
1 0
2 1
4) A = (P )
T -1
=
L
N
M
O
Q
P
1 0
2 1
Check Results: is P
T
x
P = I ? (Yes!) Check Results: is AA
T
=
x
? (Yes!)
Example 4-22: Consider the covariance matrix
x
=

L
N
M
M
M
O
Q
P
P
P
2 0 3
0 1 0
3 0 10
Now, write the augmented matrix
[
x
Y
] =
2 0 3 1 0 0
0 1 0 0 1 0
3 0 10 0 0 1
L
N
M
M
M
O
Q
P
P
P
Add to 3rd row 3/2 times 1st row. Add to 3rd column 3/2 times 1st column
2 0 0 1 0 0
0 1 0 0 1 0
0 0 0 1
11
2
3
2
L
N
M
M
M
O
Q
P
P
P
Multiply 1st row by 1 2 / . Multiply 1st column by 1 2 /
1 0 0 0 0
0 1 0 0 1 0
0 0 0 1
1
2
11
2
3
2
L
N
M
M
M
O
Q
P
P
P
Multiply 3rd row by 2 11 / . Multiply 3rd column by 2 11 / .
1 0 0 0 0
0 1 0 0 1 0
0 0 1 0
Y
1
2
3
2
2
11
2
11
L
N
M
M
M
O
Q
P
P
P
= [ ] I
T
P
Finally, compute A P
T
= =
L
N
M
M
M
O
Q
P
P
P
( )
1
2
2
11
2
2 0 0
0 1 0
3 0
Check Results:
x
= AA
T
? (YES!)
4A-1
Appendix 4A: Change of Variable Formula in a Double Integral
Consider the transformation
z = g(x,y) (4A.1)
w = h(x,y)
between the (x,y) plane and the (z,w) plane. Assume that (4A.1) has the inverse
x (z, w)
y (z, w).
=
=
(4A.2)
Assume that (4A.1) and (4A.2) have continuous first-partial derivatives. Define the Jacobian
z z g g
x y x y
(z w (g h
det det
(x, y (x, y w w h h
x y x y

, ) , )

= =
) )

. (4A.3)
As discussed in the class notes, this Jacobian relates incremental regions in the (z,w) and (x,y)
planes.
Consider Regions S
1
and S
2
in the (z,w) and (x,y) planes, respectively, as abstractly
depicted by Figure 4A-1. Suppose we are interested in integrating some function f(z,w) over
region S
1
. By a change of variable, this integral can be performed over region S
2
. In fact, the
celebrated change of variable formula of multi-dimensional Calculus states that
(g h
f(z, w) dzdw f(g(x, y), h(x, y)) dxdy
(x, y
, )
=
)

1 2
S S
. (4A.4)
4A-2
As the examples that follow show, integration over S
2
can be easier than integration over S
1
.
Example 4A-1: In rectangular x-y coordinates, a circle of radius r is described by the equation
x
2
+ y
2
= r
2
. Find the area of this circle. In rectangular coordinates, the area is computed as
2 2
2 2
( x )
( x )
Area dydx

=

r
r
r
r
, (4A.5)
a result that can be integrated in closed form. A simplification can be achieved by changing to
polar coordinates and exploiting the obvious circular symmetry. We use the transformation
2 2
1
x y , 0
Tan (y / x),
= +
= <
, (4A.6)
which has the inverse
x cos
y sin .
=
=
(4A.7)
The Jacobian is
z, w plane x, y plane
z-axis
w-axis
x-axis
y-axis
,
g,h
Region S
1
Region S
2
Figure 4A-1: Mapping from Region S
1
to Region S
2
.
4A-3
cos sin
(x, y)
det
( , )
sin cos

= =

. (4A.8)
From (4A.4), we can evaluate
2 2
2 2
( x )
( x )
2
0 0
(x, y)
Area dydx d d 2 d
( , )

= = = =

r
r
r r r
r
r . (4A.9)
Example 4A-2: Several important problems in probability theory involve an integral of the form
T T
T T
g(x y) dxdy

, (4A.10)
where the integrand depends only on the difference x-y (not absolute x and/or y). Since the
integrand g is a function of one quantity (i.e., x-y), one should suspect that the double integral
could be reduced to a single integral over this one variable. These suspicions are correct;
Integral (4A.10) can be simplified by the transformation
u x y
v x y
=
= +
(4A.11)
with inverse
x (u v)
y (v u)
= +
=
. (4A.12)
As shown by Figure 4A-2, x-y plane points P
1
, P
2
, P
3
and P
4
map to u-v plane points
, , , and
1 2 3 4
P P P P , respectively.
4A-4
The Jacobian is

(x, y)
det
(u, v)

= =

. (4A.13)
In the (u,v)-plane, as u goes from 2T to 0, v traverses from 2T-u to 2T+u, as can be seen from
the dotted vertical line on Figure 4A-2. In a similar manner, as u goes from 0 to 2T, v traverses
from 2T+u to 2T-u. Hence, the integral (4A.10) can be expressed as
2
T T
T T
0 2T u 2T 2T u
2T 2T u 0 2T u
0 2T
2T 0
2T
2T
(x, y)
g(x y) dxdy g(u) dudv
(u, v)
g(u) dvdu g(u) dvdu
(2T u)g(u) du (2T u)g(u) du
(2T u )g(u) du

+
+
= +
= + +
=

R
(4A.14)
Often, the integral over u on the right-hand side of (4A.14) is easier to evaluate than the original
integral in the x-y coordinate system.
u, v plane
x, y plane
x-axis
y-axis
-T T
-T
T
u-axis
v-axis
-2T
2T
-2T
2T
u x y
v x y
=
= +
x (u v)
y (u v)
= +
=
P
1
P
2
P
3
P
4
1
P
2
P
3
P
4
P
v

=

-
2
T
-
u
v

=

2
T
+
u
u
Figure 4A-2: Change from (x,y) plane to (u,v) plane.
Chapter 5 Moments and Conditional Statistics
Let X denote a random variable, and z = h(x) a function of x. Consider the
transformation Z = h(X). We saw that we could express
E Z E h x x dx
x
[ ] [ ( )f ( ) = =

z
h(X)] , (5-1)
a method of calculating E[Z] that does not require knowledge of f
Z
(z). It is possible to extend
this method to transformations of two random variables.
Given random variables X, Y and function z = g(x,y), form the new random variable
Z = g(X,Y). (5-2)
f
Z
(z) denotes the density of Z. The expected value of Z is E Z z f z dz
z
[ ] ( ) =

z
; however, this
formula requires knowledge of f
Z
, a density which may not be available. Instead, we can use
E Z E g X Y g x y f x y dxdy
xy
[ ] [ ( , )] ( , ) ( , ) = =

z z
(5-3)
to calculate E[Z] without having to obtain f
Z
. This is a very useful result.
Covariance
The covariance C
XY
of random variables X and Y is defined as
C = E[(X - )(Y - )] = (x - )(y - ) f (x, y)dxdy
XY
- -
XY

x y x y
z z
, (5-4)
where
x
= E[X] and
y
= E[Y]. Note that C
XY
can be expressed as
C = E[(X - )(Y - )] = E[XY-
XY x

x y y x y x y
Y X E XY + = ] [ ] . (5-5)
Correlation Coefficient
The correlation coefficient for random variables X and Y is defined as
r
C
xy
XY
x y
=

. (5-6)
r
xy
is a measure of the statistical similarity between X and Y.
Theorem 5-1: The correlation coefficient must lie in the range 1 r
xy
+1.
Proof: Let denote any real number. Consider the parabolic equation
g( )

+
= + + E
X Y
C
x y
x xy y
[
( ) ( )
]
m r
2
2 2 2
2 0 (5-7)
Note that g() 0 for all ; g is a parabola that opens
upward.
As a first case, suppose that there exists a
value
0
for which g(
0
) = 0 (see Fig. 5-1). Then
0
is a repeated root of g() = 0. In the quadratic
formula used to determine the roots of (5-7), the
discriminant must be zero. That is, (2C
xy
)
2
-4
x
2
y
2
=
0, so that
r
xy
= =
C
xy x y
/
1 .
Now, consider the case g() > 0 for all ; g
has no real roots (see Fig. 5-2). This means that the
discriminant must be negative (so the roots are
complex valued). Hence, (2C
xy
)
2
-4
x
2
y
2
< 0 so that
0
-axis
g( ) =
2 2 2
2
x xy y
C + +
Figure 5-1: Case for which the
discriminant is zero.
-axis
g( ) =
2 2 2
2
x xy y
C + +
Figure 5-2: Case for which the
discriminant is negative.
r
xy
= <
C
xy
x y

1. (5-8)
Hence, in either case, 1 r
xy
+1 as claimed.
Suppose an experiment yields values for X and Y. Consider that we perform the
experiment many times, and plot the outcomes X and Y on a two dimensional plane. Some
hypothetical results follow.
x-axis
y-axis
Correlation Coefficient r
xy
near -1
x-axis
y-axis
xy
near 0
y-axis
xy
near +1
x-axis
Figure 5-3: Samples of X and Y with varying degrees of correlation.
Notes:
1. If r
xy
= 1, then there exists constants a and b such that Y = aX + b in the mean-square sense
(i.e., E[{Y - (aX + b)}
2
] = 0).
2. The addition of a constant to a random variable does not change the variance of the random
variable. That is,
2
= VAR[X] = VAR[X + ] for any .
3. Multiplication by a constant increases the variance of a random variable. If VAR[X] =
2
,
then VAR[X] =
2
2
.
4. Adding constants to random variables X and Y does not change the covarance or correlation
of these random variables. That is, X + and Y + have the same covariance and correlation
coefficient as X and Y.
Correlation Coefficient for Gaussian Random Variables
Let zero mean X and Y be joint Gaussian with joint density
f x y
XY
( , ) exp
( )
=
+
L
N
M
M
O
Q
P
P
R
S
|
T
|
U
V
|
W
|
1
2
1 r
x
2r
xy y
x y
2
2
2
x
2
x y
2
y
2
1 r

1
2
. (5-9)
We are interested in the correlation coefficient r
XY
; we claim that r
XY
= r, where r is just a
parameter in the joint density (from statements given above, r is the correlation coefficient for
the nonzero mean case as well). First, note that C
XY
= E[XY], since the means are zero. Now,
show r
XY
= r by establishing E[XY] = r
X
Y
, so that r
XY
= C
XY
/
X
Y
= E[XY]/
X
Y
= r. In the
square brackets of f
XY
is an expression that is quadratic in x/
X
. Complete the square for this
quadratic form to obtain
( ) ( )
2
x
2 2 2
2
2 2 x
2 2 2 2
x y y y
x
x y y y
2 2
xy y y y 1 y x
x
2r r 1 r x r y 1 r

=

+ = + +

. (5-10)
Use this new quadratic form to obtain
XY
x
2 2
y
y
y
x
y
2
y / 2
2 2
2
x
x

x {normal density with mean r y}
E[XY] xy f (x, y) dxdy
(x r y)
1 x
y e exp dxdy.
2
2 (1 r )
2 (1 r )
(5-11)
Note that the inner integral is an expected value calculation; the inner integral evaluates to
r y
x
y
. Hence,
2 2
y x
y
y
2 2
y x x
y y
y
y / 2
y / 2
2 2
y
x y
1
E[XY] y e r y dxdy
2
1
r y e dy r
2
r ,

=

= =

=
(5-12)
as desired. From this, we conclude that r
XY
= r.
Uncorrelatedness and Orthogonality
Two random variables are uncorrelated if their covariance is zero. That is, they are
uncorrelated if
C
XY
= r
XY
= 0 . (5-13)
Since C
XY
= E[XY] E[X]E[Y], Equation (5-13) is equivalent to the requirement that E[XY] =
E[X]E[Y]. Two random variables are called orthogonal if
E[XY] = 0. (5-14)
Theorem 5-2: If random variables X and Y are independent, then they are uncorrelated
(independence uncorrelated).
Proof: Let X and Y be independent. Then
XY X Y
E[XY] xy f (x, y) dxdy xy f (x)f (y) dxdy E[X] E[Y]

= = =

. (5-15)
Therefore, X and Y are uncorrelated. Note: The converse is not true in general. If X and Y are
uncorrelated, then they are not necessarily independent. This general rule has an exception for
Gaussian random variable, a special case.
Theorem 5-3: For Gaussian random variables, uncorrelatedness is equivalent to independence
( Gaussian random variables Independence Uncorrelatedness for ) .
Proof: We have only to show that uncorrelatedness independence. But this is easy. Let the
correlations coefficient r = 0 (so that the two random variables are uncorrelated) in the joint
Gaussian density . Note that the joint density factors into a product of marginal densities.
Joint Moments
Joint moments of X and Y can be computed. These are defined as
m E X Y x y f x y dxdy
kr
k r k r
XY
= =

z z
[ ] ( , ) . (5-16)
Joint central moments are defined as

kr x
k
y
r
x
k
y
r
E X Y x y f x y dxdy
XY
= =
z z
[( ) ( ) ] ( ) ( ) ( , ) . (5-17)
Conditional Distributions/Densities
Let M denote an event with P(M) 0, and let X and Y be random variables. Recall that
[Y y M]
F(y M) [Y y M]
[M]
,
= =
P
P
P
. (5-18)
Now, event M can be defined in terms of the random variable X.
Example (5-1): Define M = [X x] and write
XY
X
[X x Y y] F (x, y)
F(y X x)
[X x] F (x)
,
= =
P
P
(5-19)
f y X
F x y y
F x
XY
X
( )
( , ) /
( )
Y =

x . (5-20)
Example (5-2): Define M = [x
1
< X x
2
] and write
XY XY
X X
1 2 2 1
1 2
1 2 2 1
[x X x Y y] F (x , y) F (x , y)
F(y x X x )
[x X x ] F (x ) F (x )
< ,
< = =
<
P
P
. (5-21)
Example (5-3): Define M = [X = x], where f
X
(x) 0. The quantity [Y y M]/ [M] , P P can be
indeterminant (i.e., 0/0) in this case (certainly, this is true for continuous X) so that we must use
x 0
F(y X x) F(y x - x X x)
limit
+

= = < . (5-22)
From the previous example, this result can be written as
XY XY XY XY
X X X X
XY
X
x 0 x 0
F (x, y) F (x x, y) [F (x, y) F (x x, y)] / x
F(y X x)
limit limit
F (x) F (x x) [F (x) F (x x)]/ x
F (x, y) / x
.
F (x) / x
+ +

= = =

=

(5-23)
From this last result, we conclude that the conditional density can be expressed as
XY
X
2
f (y X x) F(y X x)
y
F (x, y) / x y
,
F (x) / x
= = =

=

(5-24)
which yields
f y X
f x y
f x
XY
X
( )
( , )
( )
Y = = x . (5-25)
Use the abbreviated notation f (yx) = f (yX = x), Equation (5-25) and symmetry to write
f
XY
(x,y) = f (yx) f
X
(x) = f (xy) f
Y
(y). (5-26)
Use this form of the joint density with the formula before last to write
f y
f x
X
( )
( )
Yx =
Y f(x y)f (y)
Y
, (5-27)
a result that is called Bayes Theorem for densities.
Conditional Expectations
Let M denote an event, g(x) a function of x, and X a random variable. Then, the conditional
expectation of g(X) given M is defined as
E g X g x f x [ ( ) ( ) ( Y] = Y)
z
dx . (5-28)
For example, let X and Y denote random variables, and write the conditional mean of X given Y
= y as
x y
E[X Y = y] E[X y] x f (x y dx
. (5-29)
Higher-order conditional moments can be defined in a similar manner. For example, the
conditional variance is written as
2 2
x y x y x y x y
E[(X Y = y] E[(X y] (x ) f (x y) dx

2 2
) )

. (5-30)
Remember that
x y
and
2
x y
are functions of algebraic variable y, in general.
Example (5-4): Let X and Y be zero-mean, jointly Gaussian random variables with
f x y
XY
( , ) exp
( )
=
+
L
N
M
M
O
Q
P
P
R
S
|
T
|
U
V
|
W
|
1
2
1 r
x
2r
xy y
x y
2
2
2
x
2
x y
2
y
2
1 r

1
2
. (5-31)
Find f(xy),
XY
and
x y Y
2
. We will accomplish this by factoring f
XY
into the product
f(xy)f
Y
(y). By completing the square on the quadratic, we can write
( )
( )
2
x
2 2
2
2
2 2 2
x y y
x
x y y
2
2 x
2
y
y
2
xy y y y x
x
2r r 1 r
2
y 1
x r y 1 r

=

+ = +

, (5-32)
so that
x
y
XY
Y
2
2
2 2 2
2
y y
x
x
f (y)
f (x y)
(x r y)
1 1 y
f (x, y) exp exp
2 2
2 (1 r )
2 (1 r )
. (5-33)
From this factorization, we observe that
2
x
y
2 2
2
x
x
(x r y)
1
f (x y) exp
2 (1 r )
2 (1 r )

. (5-34)
Note that this conditional density is Gaussian! This unexpected conclusion leads to
x
x y
y
2 2 2
x y x
r y
(1 r )
=
=
(5-35)
as the conditional mean and variance, respectively.
The variance
x
2
of a random variable X is a measure of uncertainty in the value of X. If
x
2
is small, it is highly likely to find X near its mean. The conditional variance
x y Y
2
is a
measure of uncertainty in the value of X given that Y = y. From (5-35), note that
x y Y
2
0 as
r 1. As perfect correlation is approached, it becomes more likely to find X near its
conditional mean
x y Y
.
Example (5-5): Generalize the previous example to the non-zero mean case. Consider X and Y
same as above except for E[X] =
X
and E[Y] =
Y
. Now, define zero mean Gaussian variables
X
d
and Y
d
so that X = X
d
+
X
, Y = Y
d
+
Y
and
X Y
d d
XY X Y
d d
x
y
x y
x y
d d
2
2
x y
y
2 2 2
2
y y
x
x
f (x , y )
f (x, y) f (x , y )
(x, y)
(x , y )
(x r (y ))
(y )
1 1
exp exp
2 2
2 (1 r )
2 (1 r )

= =

. (5-36)
By Bayes rule for density functions, it is easily seen that
x
y
2
x y
2 2
2
x
x
(x r (y ))
1
f (x y) exp
2 (1 r )
2 (1 r )

. (5-37)
Hence, the conditional mean and variance are
x
x y x y
y
2 2 2
x y x
r (y )
(1 r )
= +
=
(5-38)
respectively, for the case where X and Y are themselves nonzero mean.
Conditional Expected Value as a Transformation for a Random Variable
Let X and Y denote random variables. The conditional mean of random variable Y given
that X = x is an "ordinary" function (x) of x. That is,
(x) E[Y X x] E[Y x] y f (y x) dy
= = = =
. (5-39)
In general, function (x) can be plotted, integrated, differentiated, etc.; it is an "ordinary"
function of x. For example, as we have just seen, if X and Y are jointly Gaussian, we know that
y
y x
x
(x) E[Y X x] r (x )
= = = +
, (5-40)
a simple linear function of x.
Use (x) to transform random variable X. Now, (X) = E[YX] is a random variable.
Be very careful with the notation: random variable E[YX] is different from function
E[YX = x] E[Yx] (note that E[YX = x] and E[Yx] are used interchangeably). Find the
expected value E[(X)] = E[E[YX]] of random variable (X). In the usual way, we start this
task by writing
E E Y X E Y X f x dx y f y dy f x dx
X X
[ [ ] ] [ ] ( ) ( ) ( ) Y Y Y = =
L
N
M
O
Q
P
z z z
x x = . (5-41)
Now, since f
XY
(x,y) = f (yx) f
X
(x) we have
E E Y X y f y f x dxdy y f x y dxdy y f y dy
X XY Y
[ [ ] ] ( ) ( ) ( , ) ( ) Y Y =
z z z z z
= = x . (5-42)
From this, we conclude that
E Y E E Y X [ ] [ [ ] ] = Y . (5-43)
The inner conditional expectation is conditioned on X; the outer expectation is over X. To
emphasis this fact, the notation E
X
[E[YX]] E[E[YX]] is used sometimes in the literature.
Generalizations
This basic concept can be generalized. Again, X and Y denote random variables. And,
g(x,y) denotes a function of algebraic variables x and y. The conditional mean
(x) = E[g(X, Y) X = x] = E[g(x, Y) X = x] = g(x, y) f(y x ) dy
-
Y Y Y
z
(5-44)
is an "ordinary" function of real value x.
Now, (X) = E[g(X,Y)X] is a transformation of random variable X (again, be careful:
E[g(X,Y)X] is a random variable and E[g(X,Y)X = x] = E[g(x,Y)x] = (x) is a function of
x). We are interested in the expected value E[(X)] = E[E[g(X,Y)X]] so we write
X
X
- -
xy
- - - -
E[ (X)] = E[ E[g(X,Y) X] ] = f (x) g(x,y)f (y x)dy dx
= g(x,y)f (y x)f (x) dy dx g(x,y)f (x, y) dy dx E[g(X, Y)] ,

= =

(5-45)
where we have used f
XY
(x,y) = f(yx)f
X
(x), Bayes law of densities. Hence, we conclude that
E[g(X,Y)] = E[E[g(X,Y)X]] = E
X
[E[g(X,Y)X]]. (5-46)
In this last equality, the inner conditional expectation is used to transform X; the outer
expectation is over X.
Example (5-6): Let X and Y be jointly Gaussian with E[X] = E[Y] = 0, Var[X] =
X
2
, Var[Y] =
Y
2
and correlation coefficient r. Find the conditional second moment E[X
2
Y = y] = E[X
2
y].
First, note that
Var[XY Y Y y E X y E X y
2
] [ ] [ ] =
e j
2
. (5-47)
Using the conditional mean and variance given by (5-35), we write
E X y y E X y
2
[ ] ] [ ] ( ) Y Y Y = + = +
F
H
G
I
K
J
Var[X r r y
x
x
y
e j
2
2 2
2
1

. (5-48)
Example (5-7): Let X and Y be jointly Gaussian with E[X] = E[Y] = 0, Var[X] =
X
2
, Var[Y] =
Y
2
and correlation coefficient r. Find
Y
E[XY] E [ (Y)] = , (5-49)
where
x
y
r y
(y) E[XY Y = y] = y E[X Y = y] y

= =

. (5-50)
To accomplish this, substitute (5-50) into (5-49) to obtain
Y Y
2 2 x x
y x y
y y
E[XY] E [ (Y)] r E [Y ] r r

= = = =

. (5-51)
Application of Conditional Expectation: Bayesian Estimation
Let denote an unknown DC voltage (for example, the output a thermocouple, strain
gauage, etc.). We are trying to measure . Unfortunately, the measurement is obscured by
additive noise n(t). At time t = T, we take a single sample of and noise; this sample is called z
= + n(T). We model the noise sample n(T) as a random variable with known density f
n
(n) (we
have abused the symbol n by using it simultaneously to denote a random quantity and an
algebraic variable. Such abuses are common in the literature). We model unknown as a
random variable with density f
(). Density f
() is called the a-priori density of , and it is

known. In most cases, random variables and n(T) are independent, but this is not an absolute
requirement (the independence assumption simplifies the analysis). Figure 5-4 depicts a block
diagram that illustrates the generation of voltage-sample z.
From context in the discussion given below (and in the literature), the reader should be
able to discern the current usage of the symbol z. He/she should be able to tell whether z denotes
a random variable or a realization of a random variable (a particular sample outcome). Here, (as
is often the case in the literature) there is no need to use Z to denote the random variable and z to
denote a particular value (sample outcome or realization) of the random variable.
We desire to use the measurement z to estimate voltage . We need to develop an
estimator that will take our measurement sample value z and give us an estimate
(z) of the
actual value of . Of course, there is some difference between the estimate
and the true value

of ; that is, there is an error voltage
(z)
(z) - . Finally, making errors cost us. C(
(z))
denotes the cost incurred by using measurement z to estimate voltage ; C is a known cost
function.
The values of z and C(
(z)) change from one sample to the next; they can be interpreted
as random variables as described above. Hence, it makes no sense to develop estimator
that
minimizes C(
(z)). But, it does make sense to choose/design/develop
with the goal of

minimizing E[C(
(z))] = E[C(
(z) - )], the expected or average cost associated with the

estimation process. It is important to note that we are performing an ensemble average over all
possible z and (random variables that we average over when computing E[C(
(z) - )]).
The estimator, denoted here as
b
, that minimizes this average cost is called the

Bayesian estimator. That is, Bayesian estimator
b
satisfies
n(t)
+
+
at t = T
z = + n(T)
+
Figure 5-4: Noisy measurement of a DC voltage.
b

E[ ( (z) - )] E[ ( (z) - )]

.
b

C C (5-52)
(
b
is the "best" estimator. On the average, you "pay more" if you use any other estimator).
Important Special Case : Mean Square Cost Function C(
) =
Let's use the squared error cost function C(
) =
2
. Then, when estimator
is used,
the average cost per decision is
( ) ( ) Z
2 2
2
z

E[ ] (z) f ( , z) d dz (z) f ( z) d f (z)dz

= =

(5-53)
For the outer integral of the last double integral, the integrand is a non-negative function of z.
Hence, average cost
2
E[ ]
will be minimized if, for every value of z, we pick
(z) to minimize
the non-negative inner integral
( )
2
(z) f ( z) d
. (5-54)
With respect to
, differentiate this last integral, set your result to zero and get
( )
2 (z) f ( z) d 0
. (5-55)
Finally, solve this last result for the Bayesian estimator
b
(z) f ( z) d E[ z]
= =
. (5-56)
That is, for the mean square cost function, the Bayesian estimator is the mean of conditioned
on the data z. Sometimes, we call (5-56) the conditional mean estimator.
As outlined above, we make a measurement and get a specific numerical value for z (i.e.,
we may interpret numerical z as a specific realization of a random variable). This measured
value can be used in (5-56) to obtain a numerical estimate of . On the other hand, suppose that
we are interested in the average performance of our estimator (averaged over all possible
measurements and all possible values of ). Then, as discussed below, we treat z as a random
variable and average
2 2
b
(z) { (z) } =
over all possible measurements (values of z) and all

possible values of ; that is, we compute the variance of the estimation error. In doing this, we
treat z as a random variable. However, we use the same symbol z regardless of the interpretation
and use of (5-56). From context, we must determine if z is being used to denote a random
variable or a specific measurement (that is, a realization of a random variable).
Alternative Expression for
The conditional mean estimator can be expressed in a more convenient fashion. First,
use Bayes rule for densities (here, we interpret z as a random variable)
z
f (z )f ( )
f ( z)
f (z)

= (5-57)
in the estimator formula (5-56) to obtain
b
z z
f (z )f ( ) d f (z )f ( ) d
f (z )f ( )
(z) d ,
f (z) f (z)
f (z )f ( ) d

= = =

(5-58)
a formulation that is used in application.
Mean and Variance of the Estimation Error
For the conditional mean estimator, the estimation error is
b
E[ z] = =
. (5-59)
The mean value of
is (averaged over all and all possible measurements z)

b
E[ ] E[ ] E E[ z]
E[ ] E E[ z] E[ ] E[ ]
= 0

= =

= =

. (5-60)
Equivalently,
b
E[ ] E[ ] = ; because of this, we say that

b
is an unbiased estimator.
Since E[
] = 0, the variance of the estimation error is

2
2
VAR[ ] E[ ] E[ z] f ( , z)d dz

= =

, (5-61)
where f(,z) is the joint density that describes and z. We want VAR[
] < VAR[]; otherwise,

our estimator is of little value since we could use E[] to estimate . In general, VAR[
] is a
measure of estimator performance.
Example (5-8): Bayesian Estimator for Single-Sample Gaussian Case
Suppose that is N(
0
,
0
) and n(T) is N(0,). Also, assume that and n are
independent. Find the conditional mean (Bayesian) estimator
b
. First, when interpreted as a
random variable, z = + n(T) is Gaussian with mean
0
and variance
0
2
+
2
. Hence, from the
conditional mean formula (5-38) for the Gaussian case, we have
z
0
b 0 0
2 2
0
(z) E[ z] = r (z )

= +
+
, (5-62)
where r
Z
is the correlation coefficient between and z. Now, we must find r
Z
. Observe that
2
0 0 0 0 0 0
z
2 2 2 2 2 2
0 0 0 0 0 0
2
0 0
2 2 2 2
0 0 0
E[( )(z )] E[( )([ ] (T))] E[( ) ( ) (T)]
r
E[( ) ]
,
+ +
= = =
+ + +

= =
+ +
n n
(5-63)
since and (T) are independent. Hence, the Bayesian estimator is
2
0
b 0 0
2 2
0
(z) (z )
= +
+
. (5-64)
The error is
= -
b
, and E[
] = 0 as shown by (5-60). That is,

b
is an unbiased
estimator since its expected value is the mean of the quantity being estimated. The variance of
is
2
2
2 0
b 0 0
2 2
0
2
2 2
2 2 0 0
0 0 0 0
2 2 2 2
0 0
VAR[ ] E[( ) ] E ( ) (z )
E[( ) ] 2 E[( )(z )] E[(z ) ]

= =

+

= +

+ +

. (5-65)
Due to independence, we have
2
0 0 0 0 0 0 0
E[( )(z )] E[( )( (T))] E[( )( )] = + = = n (5-66)
2 2 2 2
0 0 0
E[(z ) ] E[( (T)) ] = + = + n (5-67)
Now, use (5-66) and (5-67) in (5-65) to obtain
2
2 2
2 2 2 2 0 0
0 0 0
2 2 2 2
0 0
2
2
0
2 2
0
VAR[ ] 2 [ ]

= + +

+ +

=
+

. (5-68)
As expected, the variance of error
approaches zero as the noise average power (i.e., the

variance)
2
0. On the other hand, as
2
, we have VAR[
]
0
2
(this is the noise
dominated case). As can be seen from (5-68), for all values of
2
, we have VAR[
] < VAR[]
=
0
2
, which means that
b
will always out perform the simple approach of selecting mean E[]
=
0
as the estimate of .
Example (5-9): Bayesian Estimator for Multiple Sample Gaussian Case
As given by (5-68), the variance (i.e., the uncertainty) of
b
may be too large for some

applications. We can use a sample mean (involving multiple samples) in the Bayesian estimator
to lower its variance.
Take multiple samples of z(t
k
) = + n(t
k
), 1 k N (t
k
, 1 k N, denote the times at
which samples are taken). Assume that the t
k
are far enough apart in time that n(t
k
) and n(t
j
) are
independent for t
k
t
j
(for example, this would be the case if the time intervals between samples
are large compared to the reciprocal of the bandwidth of noise n(t)). Define the sample mean of
the collected data as
N
k
k 1
1
z z(t )
N
=
= +
n (5-69)
where
N
k
k 1
1
(t )
N
=

n n (5-70)
is the sample mean of the noise. The quantity n is Gaussian with mean E[ n ] = 0; due to
independence, the variance is
N 2
k
2
k 1
1
VAR[ ] VAR[ (t )]
N
N
=
n n . (5-71)
Note that z + n has the same form regardless of the number of samples N. Hence,
based on the data z , the Bayesian estimator for has the same form regardless of the number of
samples. We can adopt (5-64) and write
2
0
b 0 0
2 2
0
(z) (z )
/ N
= +
+
. (5-72)
That is, in the Bayesian estimator formula, use sample mean z instead of the single sample z.
Adapt (5-68) to the multiple sample case and write the variance of error
= -
b
as
2
2
0
2 2
0
/ N
VAR[ ]
/ N

=
+

. (5-73)
By making the number N of averaged samples large enough, we can average out the noise and
make (5-73) arbitrarily small.
Conditional Multidimensional Gaussian Density
Let
G
X be an n 1 Gaussian vector with E[
G
X] = 0 and a positive definite n n
covariance matrix
X
. Likewise, define
G
Y as a zero-mean, m 1 Gaussian random vector with
m m positive definite covariance matrix
Y
. Also, define n m matrix
XY
= E[
G G
XY
T
]; note
that
XY
T
=
YX
= E[
G G
YX
T
], an m n matrix. Find the conditional density f(
G
X
G
Y).
First, define the (n+m) 1 super vector
G
G
G Z
X
Y
=
L
N
M
O
Q
P
, (5-74)
which is obtained by stacking
G
X on top of
G
Y. The (n+m) (n+m) covariance matrix for
G
Z
is
X XY
T
T T
Z
YX Y
X
E[ZZ ] E
X Y
Y

= = =

G
G G
G G
G
. (5-75)
The inverse of this matrix can be expressed as (observe that
Z
Z
-1
= I)
Z
A B
B C
T
=
L
N
M
O
Q
P
1
, (5-76)
where A is nn, B is nm and C is mm. These intermediate block matrices are given by
A I C
B A C
C I A
X XY Y YX X XY YX X
XY Y X XY
Y YX X XY Y YX XY Y
= = +
= =
= = +

( ) [ ]
( ) [ ]

1 1 1 1
1 1
1 1 1 1
(5-77)
Now, the joint density is
f X Y X Y
A B
B C
X
Y
XY
Z
T T
T
n m
( , )
( )
exp
G G G G
G
G =
L
N
M
O
Q
P
L
N
M
O
Q
P
L
N
M
M
O
Q
P
P
+
1
2
1
2

Y
(5-78)
The marginal density is
f Y Y Y
Y
Y
T
Y
m
( )
( )
exp
G G G
=

1
2
1
2
1

(5-79)
From Bayes Theorem for densities
f X Y
f X Y
f Y
X Y
A B
B C
X
Y
XY
Y
T T
T
Y n Z
Y
( )
( )
( )
( )
exp
G G
G G
G
G G
G
G Y
,
Y
= =
L
N
M
O
Q
P
L
N
M
O
Q
P
L
N
M
M
O
Q
P
P
1
2
1
2
1

(5-80)
However, straightforward but tedious matrix algebra yields
G G
G
G
G G
G G
G G
G G G G G G
G G G G G G
X Y
A B
B C
X
Y
X Y
AX BY
B X C Y
X AX BY Y B X C Y
X AX X BY Y C Y
T T
T
Y
T T
T
Y
T T T
Y
T T T
Y

Y Y
L
N
M
O
Q
P
L
N
M
O
Q
P
=
+
+
L
N
M
O
Q
P
= + + +
= + +
1
1
1
1
2
(
[ ] [ ( ]
[ ]
)
) (5-81)
(Note that the scalar identity
G G
X BY
T
=
G G
Y B X
T T
was used in obtaining this result). From the
previous page, use the results B A
XY Y
=

1
and C A
Y Y YX XY Y
=

1 1 1
to write
G G
G
G
G G G G G G
G G G G
X Y
A B
B C
X
Y
X AX X A Y Y A Y
X Y A X Y
T T
T
Y
T T
XY Y
T
Y YX XY Y
XY Y
T
XY Y

Y
L
N
M
O
Q
P
L
N
M
O
Q
P
= +
=

1
1 1 1
1 1
2
(5-82)
To simplify the notation, define
G G
M Y
XY Y
X XY Y YX

=

1
1
(an m 1 vector)
Q A (an n n matrix)
-1
(5-83)
so that the quadratic form becomes
G G
G
G
G G G G
X Y
A B
B C
X
Y
X M Q X M
T T
T
Y
T

Y
L
N
M
O
Q
P
L
N
M
O
Q
P
=
1
1
( ) ( ) (5-84)
Now, we must find the quotient
Z
Y
. Write
1
X XY
X XY Y YX XY
Z
1
YX Y Y YX
Y
n
m
I 0
I
0

= =

(5-85)
I
m
is the m m identity matrix and I
n
is the n n identity matrix. Hence,

Z X XY Y YX Y
=
1
(5-86)

Z
Y
X XY Y YX
Q = =
1
(5-87)
Use Equation (5-84) and (5-87) in f
X
(xy) to obtain
T 1
1
2
n
1
f (X Y) exp (X M) Q (X M)
(2 ) Q

=

G G G G G G
, (5-88)
where
1
XY Y
1
X XY Y YX
-1
M Y (an m 1 vector)
Q A (an n n matrix)

=
G G
(5-89)
Vector
G
M = E[
G G
X Y Y ] is the conditional expectation vector. Matrix Q E X M X M Y
T
= [( )( ) ]
G G G G G
Y
is the conditional covariance matrix.
Generalizations to Nonzero Mean Case
Suppose E[
G
X] =
G
M
X
and E[
G
Y] =
G
M
Y
, then
f X Y
Q
X M Q X M
n
T
( )
( )
exp ( ) ( )
G G G G G G
Y =
1
2
1
2
1
, (5-90)
where
1
X XY Y Y
T 1
X XY Y YX
( M E[X Y] M Y M ) (an n 1 vector)
Q E[(X M)(X M) Y] (an n n matrix).
= +
=
G G G G G G
G G G G G
(5-91)
Chapter 6 - Random Processes
Recall that a random variable X is a mapping between the sample space S and the
extended real line R
+
. That is, X : S R
+
.
A random process (a.k.a stochastic process) is a mapping from the sample space into an
ensemble of time functions (known as sample functions). To every S, there corresponds a
function of time (a sample function) X(t;). This is illustrated by Figure 6-1. Often, from the
notation, we drop the variable, and write just X(t). However, the sample space variable is
always there, even if it is not shown explicitly.
For a fixed t = t
0
, the quantity X(t
0
;) is a random variable mapping sample space S into
the real line. For fixed
0
S, the quantity X(t;
0
) is a well-defined, non-random, function of
time. Finally, for fixed t
0
and
0
, the quantity X(t
0
;
0
) is a real number.
Example 6-1: X maps Heads and Tails
Consider the coin tossing experiment where S = {H, T}. Define the random function
X(t;Heads) = sin(t)
X(t;Tails) = cos(t)
time
X(t;
1
)
X(t;
2
)
X(t;
3
)
X(t;
4
)
Figure 6-1: Sample functions of a random process.
Continuous and Discrete Random Processes
For a continuous random process, probabilistic variable takes on a continuum of
values. For every fixed value t = t
0
of time, X(t
0
;) is a continuous random variable.
Example 6-2: Let random variable A be uniform in [0, 1]. Define the continuous random
process X(t;) = A()s(t), where s(t) is a unit-amplitude, T-periodic square wave. Notice that
sample functions contain periodically-spaced (in time) jump discontinuities. However, the
process is continuous.
For a discrete random process, probabilistic variable takes on only discrete values. For
every fixed value t = t
0
of time, X(t
0
;) is a discrete random variable.
Example 6-3: Consider the coin tossing experiment with S = {H, T}. Then X(t;H) = sin(t),
X(t;T) = cos(t) defines a discrete random process. Notice that the sample functions are
continuous functions of time. However, the process is discrete.
Distribution and Density Functions
The first-order distribution function is defined as
F(x,t) = P[X(t) x]. (6-1)
The first-order density function is defined as
f x t
dF(x,
( ; )
t)
dx
. (6-2)
These definitions generalize to the n
th
-order case. For any given positive integer n, let x
1
,
x
2
, ... , x
n
denote n realization variables, and let t
1
, t
2
, ... , t
n
denote n time variables. Then,
define the n
th
-order distribution function as
F(x
1
, x
2
, ... , x
n
; t
1
, t
2
, ... , t
n
) = P[X(t
1
) x
1
, X(t
2
) x
2
, ... , X(t
n
) x
n
]. (6-3)
Similarly, define the n
th
-order density function as
f
n
(x , x , ... , x ; t , t , ... , t ) =
F(x , x , ... , x ; t , t , ... , t )
x x ... x
1 2 n 1 2 n
1 2 n 1 2 n
1 2 n

(6-4)
In general, a complete statistical description of a random process requires knowledge of all
order distribution functions.
Stationary Random Process
A process X(t) is said to be stationary if its statistical properties do not change with
time. More precisely, process X(t) is stationary if
F(x
1
, x
2
, ... , x
n
; t
1
, t
2
, ... , t
n
) = F(x
1
, x
2
, ... , x
n
; t
1
+c, t
2
+c, ... , t
n
+c) (6-5)
for all orders n and all time shifts c.
Stationarity influences the form of the first- and second-order distribution/density
functions. Let X(t) be stationary, so that
F(x; t) = F(x; t+c) (6-6)
for all c. This implies that the first-order distribution function is independent of time. A similar
statement can be made concerning the first-order density function. Now, consider the second-
order distribution of stationary X(t); for all t
1
, t
2
and c, this function has the property
1 2 1 2 1 2 1 2
1 2 1 1 2 1
F(x , x ; t , t ) = F(x , x ; t +c, t +c)
=F(x , x ; t +c, {t +c}+ ) , t t .
(6-7)
This must be true for all t
1
, t
2
and c. Hence, F(x
1
,x
2
;t
1
,t
2
) depends on the time difference t
2

t
1
; the second-order distribution does not depend on absolute t
1
and t
2
. In F(x
1
,x
2
;t
1
,t
2
), you will
only see t
1
and t
2
appear together as t
2
t
1
, which we define as . Often, for stationary processes,
we change the notation and define
" new" notation "old" notation
1 2 1 2 1 1
F(x , x ; ) F(x , x ; t , t + )

. (6-8)
Similar statements can be made concerning the second-order density function.
Be careful! These conditions on first-order F(x) and second-order F(x
1
, x
2
; ) are
necessary conditions; they are not sufficient to imply stationarity. For a given random process,
suppose that the first order distribution/density is independent of time and the second-order
distribution/density depends only on the time difference. Based on this knowledge alone, we
cannot conclude that X(t) is stationary.
First- and Second-Order Probabilistic Averages
First- and second-order statistical averages are useful. The expected value of general
random process X(t) is defined as
(t) = E[X(t)] = x f(x; t) dx
-
z
. (6-9)
In general, this is a time-varying quantity. The expected value is often called a first-order
statistic since it depends on a first-order density function. The autocorrelation function of X(t)
is defined as
R t t E X t X t x x f x x t t dx dx ( , ) [ ( ) ( )] ( , ; , )
1 2 1 2 1 2 1 2 1 2 1 2
= =

z z
. (6-10)
In general, R depends on two time variables, t
1
and t
2
. Also, R is an example of a second-order
statistic since it depends on a second-order density function.
Suppose X(t) is stationary. Then the mean
= E[X(t)] = x f(x) dx
-
z
(6-11)
is constant, and the autocorrelation function
R E X t X t x x f x x dx dx ( ) [ ( ) ( )] ( , ; ) = + =

z z
1 2 1 2 1 2
(6-12)
depends only on the time difference = t
2
t
1
(it does not depend on absolute time). However,
the converse is not true: the conditions a constant and R() independent of absolute time do not
imply that X(t) is stationary.
Wide Sense Stationarity (WSS)
Process X(t) is said to be wide-sense stationary (WSS) if
1) Mean = E[X(t)] is constant, and
2) Autocorrelation R() = E[X(t)X(t+)] depends only on the time difference.
Note that stationarity implies wide-sense stationarity. However, the converse is not true: WSS
does not imply stationarity.
Ergodic Processes
A process is said to be Ergodic if all orders of statistical and time averages are
interchangeable. The mean, autocorrelation and other statistics can be computed by using any
sample function of the process. That is

= = =
= + = = +

z z
z z z
E X t xf x
T
X t dt
R E X t X t x x f x x dx
T
X t X t dt
T
T
T
T
T
T
[ ( )] ( )dx ( )
( ) [ ( ) ( )] ( , ; )dx ( ) ( ) .
limit
limit
1
2
1
2
1 2 1 2 1 2
(6-13)
This idea extends to higher-order averages as well. Since we are averaging over absolute time,
the ensemble averages (all orders) cannot depend on absolute time. This requires that the
original process must be stationary. That is, ergodicity implies stationarity. However, the
converse is not true: there are stationary processes that are not ergodic. The hierarchy of random
processes is abstractly illustrated by Figure 6-2.
Example 6-4: Let X(t) = A, where A is uniformly distributed in the interval [0, 1]. Sample
functions of X are straight lines, as shown by Figure 6-3. Clearly, X(t) is not ergodic since the
time average of each sample function is different.
Example 6-5: Random Walk
The random walk is the quintessential example of a Markov process, a type of process
that has many applications in engineering and the physical sciences. Many versions of the
1
time
X(t;
1
)
X(t;
2
)
X(t;
3
)
Three sample functions of X(t)
Figure 6-3: Sample functions of a non-ergodic random process.
All
Processes Wide Sense Stationary
Stationary
Ergodic
Figure 6-2: Hierarchy of random processes.
random walk have been studied over the years (i.e., the gambler's ruin, drunken sailor, etc.).
At first, a discrete random walk is introduced. Then, it is shown that a limiting form of the
random walk is the well-known continuous Wiener process. Finally, simple equations are
developed that provide a complete statistical description of the discrete and limiting form of the
random walk.
Suppose a man takes a random walk by starting at a designated origin on a straight line
path. With probability p (alternatively, q 1 - p), he takes a step to the right (alternatively, left).
Suppose that each step is of length A meters, and each step is completed in
s
seconds. After N
steps (completed in N
s
seconds), the man is located X
d
(N) steps from the origin; note that N
X
d
(N) N since the man starts at the origin. If X
d
(N) is positive (alternatively, negative), the
man is located to the right (alternatively, left) of the origin. The quantity P[X
d
(N) = n], N n
N, denotes the probability that the man's location is n steps from the origin after he has taken N
steps. Figure 6-4 depicts two sample functions of X
d
(N), a discrete random process.
The calculation of P[X
d
(N) = n] is simplified greatly by the assumption, implied in the
previous paragraph, that the man takes independent steps. That is, the direction taken at the N
th
N
X
d
(N)
Figure 6-4: Two sample functions of a random walk process.
step is independent of X
d
(k), 0 k N 1, and the directions taken at all previous steps. Also
simplifying the development is the assumption that p does not depend on step index N. Under
these conditions, it is possible to write
P
P
[X (N ) X (N) ] p
[X (N ) X (N) ] p q .
d d
d d
+ = + =
+ = = =
1 1
1 1 1
(6-14)
Let R
n0
and L
n0
denote the number of steps to the right and left, respectively, that will
place the man n, N n N, steps from the origin after he has completed a total of N steps.
Integers R
n0
and L
n0
depend on integers N and n; the relationship is given by
R L n
R L N
n n
n n
0 0
0 0
=
+ =
(6-15)
since N n N. Integer values for R
n0
and L
n0
exist only if N n N and the pair of
integers (N + n), (N - n) are even. When an integer solution of (6-15) exists it is given by
R
N n
L
N n
n
n
0
0
2
2
=
+
=

, (6-16)
for -N n N, and (N+n), (N-n) even integers. After taking a total of N steps, it is not possible
to reach n steps (to the right of the origin) if integer values for R
n0
and L
n0
do not exist. Also, if
it is possible to reach n steps (to the right of the origin after taking a total of N steps), then it is
not possible to reach n 1 steps (to the right of the origin).
Of course, there are multiple sequences of N steps, R
n0
to the right and L
n0
to the left, that
the man can take to insure that he is n steps to the right of the origin. In fact, the number of such
sequences is given by
N
R
N!
R ! L !
n
n n
0
0 0
F
H
G
G
I
K
J
J
= . (6-17)
This quantity represents the number of subsets of size R
n0
that can be formed from N distinct
objects. These sequences are mutually exclusive events. Furthermore, they are equally
probable, and the probability of each of them is
P[R (L ) Steps To Right (Left) in a Specific Sequence] = p q
n0 n0
R L
n0 n0
. (6-18)
The desired probability P[X
d
(N) = n] can be computed easily with the use of (6-16),
(6-17) and (6-18). From the theory of independent Bernoulli trials, the result
n0 n0
R L
n0 n0
n 0 n 0
d
n0 n0
N!
p q , if integers R and L exit
R ! L !
[X (N) n]
0, if integers R and L do not exit
= =

P (6-19)
follows easily. If there are no integer solutions to (6-15) for given values of n and N (i.e.,
integers R
n0
and L
n0
do not exist), then it is not possible to arrive at n steps from the origin after
taking N steps and P[X
d
(N) = n] = 0. Note that (6-19) is just the probability that the man takes
R
n0
steps to the right given that he takes N independent steps.
The analysis leading to (6-19) can be generalized to include a non-zero starting location.
Instead of starting at the origin, assume that the man starts his random walk at m steps to the
right of the origin. Then, after the man has completed N independent steps, P[X
d
(N) = nX
d
(0)
= m] denotes the probability that he is n steps to the right of the origin given that he started m
steps to the right of the origin. A formula for P[X
d
(N) = nX
d
(0) = m] is developed in what
follows.
Let n m. The quantity denotes the man's net increase in the number of steps to
the right after he has completed N steps. Also, R
nm
(alternatively, L
nm
) denotes the number of
steps to the right (alternatively, left) that are required if the man starts and finishes m and n,
respectively, steps to the right of the origin. Note that R
nm
+ L
nm
= N and R
nm
- L
nm
= so that
nm
nm
N
R
2
N
L .
2
+
=

=
(6-20)
Solution (6-20) is valid only if N and integers (N + ), (N ) are even. Otherwise,
integers R
nm
and L
nm
do not exist, and it is not possible to start at m (steps to the right of the
origin), take N independent steps, and find yourself at n (steps to the right of the origin).
Finally, suppose that integers R
nm
and L
nm
exist for some n and m; that is, it is possible to go
from m to n (steps to the right of the origin) in a total of N steps. Then, it is not possible to go
from m to n 1 steps in a total of N steps.
The desired result follows by substituting R
nm
and L
nm
for R
n0
and L
n0
in (6-19); this
procedure leads to
nm nm
nm nm
nm d d
R L
nm nm
R L
nm
[ X (N) n X (0) m] [ R steps to the right out of Nsteps ]
N!
p q
R !(N R )!
N
p q
R
= = =
=

=

P P
(6-21)
if integers R
nm
and L
nm
exist, and
P[X (N) n X ( ) m]
d d
= = = Y 0 0 (6-22)
if R
nm
and L
nm
do not exist.
To simplify the developments in the remainder of this chapter, it is assumed that p = q =
1/2. Also, R
nm
is assumed to exist in what follows. Otherwise, results (that use R
nm
) given
below can be modified easily if n and m are such that R
nm
does not exist.
Process X
d
(N) has independent increments. That is, consider integers N
1
, N
2
, N
3
and N
4
,
where N
1
< N
2
N
3
< N
4
. Then X
d
(N
2
) - X
d
(N
1
) is statistically independent of X
d
(N
4
) - X
d
(N
3
).
The Wiener Process As a Limit of the Random Walk
Recall that each step corresponds to a distance of A meters, and each step is completed in
s
seconds. At time t = N
s
, let X(N
s
) denote the man's physical displacement (in meters) from
the origin. Then X(N
s
) is a random process given by X(N
s
) AX
d
(N), since X
d
(N) denotes the
number of steps the man is from the origin after he takes N steps. Note that X(N
s
) is a discrete-
time random process that takes on only discrete values.
For large N and small A and
s
, the probabilistic nature of X(N
s
) is of interest. First,
note that P[X(N
s
) = AnX(0) = Am] = P[X
d
(N) = nX
d
(0) = m]; this observation and the
Binomial distribution function leads to the result (use p = q = )
nm
s d d
nm
R
k N k
1 1
2 2
k=0
[ (N ) n (0) m] = [X (N) n X (0) = m]
= [from N steps, the number k taken to right is R ]
N
= ( ) ( ) .
k

=
A A P P
P
X X
(6-23)
For large N, the DeMoivre-Laplace theorem (see Chapter 1 of these class notes) leads to the
approximation
P[ (N ) ( ) ] = X X

s
=
F
H
G
I
K
J
F
H
G
I
K
J
=
z
A A n m
R N/
N/ N
,
exp
nm
/
Y 0
2
4
1
2
1
2
2
G G
u du
N
(6-24)
where G is the distribution function for a zero-mean, unit-variance Gaussian random variable.
The discrete random walk process outlined above has a continuous process as a formal
limit. To see this, let A 0,
s
0 and N in such a manner that
A
A
A
2
s
s
s
2
(t) ( )
=
=
=
=
D
t N
x n
x m
N ,
0
X X
(6-25)
where D is known as the diffusion constant. In terms of D, x, x
0
and t, the results of (6-25) can
be used to write
N
(x x ) /
t /
(x x )
Dt
. =

=

0 0
2
A
s
(6-26)
Process X(t) is a continuous random process.
The probabilistic nature of the limiting form of X(t) is seen from (6-24) and (6-26). In
the limit, the process X(t) is described by the first-order conditional distribution function
F x t u du
Dt
( ; ) exp
)/
Yx
0
1
2
1
2
2
2
=
z
-
(x-x
0
, (6-27)
and the first-order conditional density function
f(x, t x ) ( ; )
Dt
exp
(x x )
4Dt
Y Y
0 0
2
1
4
= =

L
N
M
M
O
Q
P
P
d
dx
F x t x
0
, (6-28)
a result depicted by Figure 6-5. Often, f(x,tx
0
) is known as the transition density function for
the process since f(x,tx
0
)x is the probability of making the transition from x
0
to the interval
(x, x+x) by time t. Equation (6-28) describes the conditional probability density function of a
continuous-time Wiener process. Clearly, process X(t) is Gaussian, it has a mean of x
0
, and it
has a variance that grows with time (hence, it is nonstationary). Finally, as t 0
+
, note that
f(x,tx
0
) (x x
0
), as expected.
x
0
f(x,tx
0
)
t = .1 second
t = 1 second
t = 3 second
f( , ) exp
( )
x t x
Dt
x x
Dt
0
0
Y =

L
N
M
M
O
Q
P
P
1
4 4
2
x
0
+1 x
0
+2 x
0
+3 x
0
-1 x
0
-2 x
0
-3
Figure 6-5: Density function for a diffusion process with D = 1.
Figure 6-6 attempts to depict a sample function of a Wiener process. While such
drawings are nice to look at (and they are fun to draw!), they cannot depict accurately all
attributes of a Wiener process sample function. As it turns out, Wiener processes have
continuous (in time) sample functions; there wont be a step or jump in the sample function of a
Wiener process. However, in the traditional Calculus sense, the sample functions are
differentiable nowhere. That is, in the classical Calculus sense, the derivative dX/dt does not
exist at any value of time (actually, its just a little more complicated than this). A generalized
derivative of the Wiener process does exit, however. In engineering and the physical sciences, it
is known as white Gaussian noise.
Process X(t) has independent increments. Let (t
1
, t
2
), (t
3
, t
4
) be non-overlapping
intervals (t
1
< t
2
t
3
< t
4
). Then increment X(t
2
) - X(t
1
) is independent of increment X(t
4
) -
X(t
3
). Finally, increment X(t
2
) - X(t
1
) has a mean of zero and a variance of 2Dt
2
- t
1
.
The Diffusion Equation For the Transition Density Function
In terms of physical displacement (from the origin) X, the conditional probability
P[X(N
s
) = AnX(0) = Am] describes the probabilistic nature of the discrete time random walk
problem outlined above. In what follows, this conditional probability is denoted by the short
hand notation P[An, N
s
Am]. For the case p = q = 1/2, it is easy to see that it satisfies the
difference equation
P P P [ N+1) = [ N [ N
1
2
1
2
A A A A A A n, ( m] (n 1), m] (n 1), m]
s s s
Y Y Y + + . (6-29)
t
x
0
X(t)
Figure 6-6: A hypothetical sample function of a Wiener process X(t).
That is, to get to An at time (N+1)
s
, you can be at A(n-1) at time N
s
and take a step to the right
(this occurs with probability equal to 1/2), or you can be at A(n+1) at time N
s
and take a step to
the left (this occurs with probability equal to 1/2). Equation (6-29) can be applied twice to
obtain
1 1
2 2
1 1 1
2 2 2
1 1 1
2 2 2
[ n, (N+2) m] = [ (n 1), (N+1) m] [ (n 1), (N+1) m]
[ (n 2), N m] [ n, N m]
[ n, N m] [ (n 2), N m]
+ +

= +

+ + +

A A A A A A
A A A A
A A A A
s s s
s s
s s
P P P
P P
P P
(6-30)
This last result can be simplified to obtain
1 1 1
4 2 4
[ n, (N+2) m]
= [ (n 2), N m] [ n, N m] [ (n 2), N m]

+ + +
A A
A A A A A A
s
s s s
P
P P P
. (6-31)
The continuous conditional density f(x,tx
0
) given by (6-28) satisfies a partial
differential equation. To obtain this equation, first note that the difference equation (6-31) can
be used to write
2
[ n, (N+2) m] [ n, N m]
2
[ (n 2), N m] 2 [ n, N m] [ (n 2), N m]
2
(2 )

+ +
=

s s
s
2
s s s
s
P P
P P P
A A A A
A A A A A A A
A
(6-32)
(substitute (6-31) for the first term on the left-hand side of (6-32) to verify the expression).
Now, in the sense described by (6-25), the formal limit of (6-32) can be found. To find this
limiting form, we must consider two results. First, note that (6-24) and (6-25) imply
s s
2
2
2
s
s
2
)
0
0
[ n, N m] [ (N ) n (0) m] [ (N ) (n 2) (0) m]
2
1 2
1
exp
N N
2 2 N N
2
1
exp
2 N
4 N
(x x
2 1
exp 2 f (x; t x )
2 2Dt
4 Dt
2

= = =

=

= =

A A A A A A
A
A
A
A
G G
s
P P P X X X X
. (6-33)
This last equation shows that, in the limit describe by (6-25), P[ N A A n, m]
s
Y approaches
2Af(x;tx0). That is, as A 0,
s
0, A
2
/2
s
D as described by (6-25), we know that
P[ N A A n, m]
s
Y approaches zero according to
0
[ n, N m] 2 f (x; t x ) 0
s
P A A A . (6-34)
Second, we must review expressions for derivatives; let g(x) be an ordinary function of x, and
recall that the first partial derivative of g can be expressed as
g(x x) g(x x)
g(x)
limit
x 2 x
x 0
+
=

. (6-35)
Use this formula to express the second derivative of g as
2
2 2
2
[g(x 2 x) g(x)] [g(x) g(x 2 x)]
g(x)
limit
x (2 x)
x 0
g(x 2 x) 2g(x) g(x 2 x)
limit
(2 x)
x 0
+
=

+ +
=

(6-36)
From (6-34), (6-35) and (6-36), the formal limit of (6-32) is
t
f(x, t x ) D f(x, t x ) Y Y
0 0
x
=
2
2
, (6-37)
where f(x,tx
0
) denotes the conditional probability density function given by (6-28). Note that
(6-37) is identical in form to the source-free, one-dimensional heat equation. Probability
diffuses just like heat and electronic charge (and many other physical phenomenon)!
Equation (6-37) is a one-dimensional diffusion equation. It describes how probability
diffuses (or flows) with time. It implies that probability is conserved in much the same way that
the well-know continuity equation implies the conservation of electric charge. To draw this
analogy, note that f describes the density of probability (or density of probability particles) on
the one-dimensional real line. That is, f can be assigned units of particles/meter. Since D has
units of meters
2
/second, a unit check on both sides of (6-37) produces
1
second
particles
meter
meter
second
1
meter
2
particles
meter
2
e je j e je j e j
= . (6-38)
Now, write (6-37) as
t
f = , (6-39)
where
D f
x
, (6-40)
and is the divergence operator. The quantity is a one-dimensional probability current, and
it has units of particles/second. Note the similarity between (6-39) and the well-known
continuity equation for electrical charge.
Probability current ( x, t x
0
) indicates the rate of particle flow past point x at time t.
Let (x
1
, x
2
) denote an interval; integrate (6-39) over this interval to obtain
2
1
x
1 2 0 0 2 0 1 0
x
[x (t) x x ] f(x, t x ) dx (x , t x ) (x , t x )]
t t

< = = +

P X . (6-41)
As illustrated by Figure 6-7, the left-hand side of this equation represents the time rate of
probability build-up on (x
1
, x
2
). That is, between the limits of x
1
and x
2
, the area under f is
changing at a rate equal to the left-hand side of (6-41). As depicted, the right-hand side of (6-41)
represents the probability currents entering the ends of the interval (x
1
, x
2
).
The Wiener process is a simple example of a diffusion process. Diffusion processes are
important in the study of communication and control systems. As it turns out, the state vector
that describes a system (such as a circuit, PLL, spring-mass, etc.) driven by white Gaussian noise
is a diffusion process. Also, this state vector is described statistically by a density function that

f (x, tx
0
)
x
1
x
2

t
1 2
t
t
t ,t ,t P[x ( ) x x ] f (x, x )dx (x x ) (x x )
x
x
1 2 0 0 0 0
1
2
< =
z
= X Y Y Y Y
(x x )
1
,tY
0
(x x )
2
,tY
0
Figure 6-7: Probability build-up on the interval x
1
, x
2
due to
probability current entering the ends of the interval.
satisfies a partial differential equation known as the Fokker-Planck equation (of which (6-37) is
a simple example). Finally, it should be pointed out that a diffusion process is a special case of a
Markov process, a more general process (see Papoulis for the definition of a Markov process).
Solution of Diffusion Equation by Transform Techniques
The one-dimensional diffusion equation (6-37) can be solved by using transform
techniques. First, initial and boundary conditions must be specified. The desired initial
condition is
f x t x x x ( , ) ( ) Y Y
Y
0
0
t 0 =
= , (6-42)
which means that random process x starts at x
0
. What are known as natural boundary conditions
are to be used; that is, we require
f x t x
x
( , ) Y Y
Y
0
0
=
= . (6-43)
Consider the transform of f(x,tx
0
) defined by
( , ) ( , )e s t f x t x dx
jxs
=

z
Y
0
. (6-44)
With respect to t, differentiate (6-44) to obtain
R
S
T
U
V
W
=

R
S
T
U
V
W
z z
( , )
( , )
( , )
s t
t t
f x t x e dx D
x
f x t x
e dx
jxs jxs
Y
Y
0
2
2
0
. (6-45)
Now, use
limit f
limit
x
x
=
( , )
( , )
x t x
f x t x
x
Y
Y
0
0
0
0
, (6-46)
and integrate by parts twice to obtain
=
t
Ds
2
. (6-47)
This equation can be solved easily to obtain
( , ) exp ( , ) s t Ds t s =
2
0 . (6-48)
However, from the initial condition (6-42), we have
( , ) exp( ) s jx s 0
0
= . (6-49)
Finally, combine (6-48) and (6-49) to obtain
( , ) exp s t jx s Ds t =
0
2
. (6-50)
But, this is the well-known characteristic function of the Gaussian density function
f(x, t x )
Dt
exp
(x x )
4Dt
Y
0
2
1
4
=

L
N
M
M
O
Q
P
P
0
. (6-51)
This same technique can be used to solve higher-order, more complicated diffusion equations.
Chapter 7 - Correlation Functions
Let X(t) denote a random process. The autocorrelation of X is defined as
R t t E X t X t x x f x x t t dx dx
x
( , ) [ ( ) ( )] ( , ; , )
1 2 1 2 1 2 1 2 1 2 1 2

z z
- -
. (7-1)
The autocovariance function is defined as
C t t E X t t X t t R t t t t
x X X X X X
( , ) [{ ( ) ( )}{ ( ) ( )}] ( , ) ( ) ( )
1 2 1 1 2 2 1 2 1 2
- - - q q q q , (7-2)
and the correlation function is defined as
X X X x 1 2 1 2 1 2
x 1 2
x 1 x 2 x 1 x 2
C (t , t ) R (t , t ) (t ) (t )
r (t , t )
(t ) (t ) (t ) (t )
-q q

o o o o
. (7-3)
If X(t) is at least wide sense stationary, then R
X
depends only on the time difference t = t
1
- t
2
, and we write
R E X t X t x x f x x dx dx
x
( ) [ ( ) ( )] ( , ; ) t t t -

z z
1 2 1 2 1 2
- -
. (7-4)
Finally, if X(t) is ergodic we can write
R X t X t
x
T
T
T
T
( ) ( ) ( )dt t t -
z
limit
1
2
-
. (7-5)
Function r
X
(t) can be thought of as a measure of statistical similarity of X(t) and X(t+t). If
r
x
(t
0
) = 0, the samples X(t) and X(t+t
O
) are said to be uncorrelated.
Properties of Autocorrelation Functions for Real-Valued, WSS Random Processes
1. R
X
(0) = E[X(t)X(t)] = Average Power.
2. R
X
(t) = R
X
(-t). The autocorrelation function of a real-valued, WSS process is even.
Proof:
R ( ) = E[X(t)X(t + )] = E[X(t - )X(t - )] (Due to WSS)
= R (- )
X
X
t t t t t
t
-
(7-6)
3. R
X
(t)s R
X
(0). The autocorrelation is maximum at the origin.
Proof:
E X t X t E X t X t X t X t
R R R
X X X
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( )
- - - -
-
t t t
t
b g
2 2 2
2 0
0 0 2 0
(7-7)
Hence, R
X
(t)s R
X
(0) as claimed.
4. Assume that WSS process X can be represented as X(t) = q + X
ac
(t), where q is a constant and
E[X
ac
(t)] = 0. Then,
( )( )
[ ]
X
X
ac ac
2
ac ac ac
2
ac
R ( ) E X (t) X (t )
E[ ] 2 E[X (t)] E X (t)X (t )
R ( ).
t q- q- - t

q - q - - t
q - t
(7-8)
5. If each sample function of X(t) has a periodic component of frequency c then R
X
(t) will have
a periodic component of frequency c.
Example 7-1: Consider X(t) = Acos(ct+0) + N(t), where A and c are constants, random
variable 0 is uniformly distributed over (0, 2r), and wide sense stationary N(t) is independent of
0 for every time t. Find R
X
(t), the autocorrelation of X(t).
R E
A
E t E A N t
E N t A E N t N t
A
R
X
N
( ) [
cos( ) ( )
( ) cos( [ ) ( ) ( )
cos( ) ( )
t c 0 c t 0 t
c ct 0 ct c 0 t
c t 0 t
ct t
- - -
- - -
-
Acos( t + ) + N(t) Acos( t + ]+ ) + N(t + )
cos(2 + 2 ) + cos( ) t +
t + ] +
l q l q
2
2
2
2
(7-9)
So, R
X
(t) contains a component at c, the same frequency as the periodic component in X.
6. Suppose that X(t) is ergodic, has zero mean, and it has no periodic components; then
limit
t
t
R
X
( ) 0. (7-10)
That is, X(t) and X(t+t) become uncorrelated for large t.
7. Autocorrelation functions cannot have an arbitrary shape. As will be discussed in Chapter 8,
for a WSS random process X(t) with autocorrelation R
X
(t), the Fourier transform of R
X
(t) is the
power density spectrum (or simply power spectrum) of the random process X. And, the power
spectrum must be non-negative. Hence, we have the additional requirement that
[ ]
j
x x x
R ( ) R ( )e d R ( ) cos( )d 0

- ct
- -
t t t t ct t

F (7-11)
for all c (the even nature of R
X
was used to obtain the right-hand side of (7-11)). Because of
this, in applications, you will not find autocorrelation functions with flat tops, vertical sides, or
any jump discontinuities in amplitude (these features cause oscillatory behavior, and negative
values, in the Fourier transform). Autocorrelation R(t) must vary smoothly with t.
Example 7-2: Random Binary Waveform
Process X(t) takes on only two values: A. Every t
a
seconds a sample function of X
either toggles value or it remains the same (positive constant t
a
is known). Both possibilities
are equally likely (i.e., P[toggle] = P[no toggle] = 1/2). The possible transitions occur at
times t
0
+ kt
a
, where k is an integer, - < k < . Time t
0
is a random variable that is uniformly
distributed over [0, t
a
]. Hence, given an arbitrary sample function from the ensemble, a toggle
can occur anytime. Starting from t = t
0
, sample functions are constant over intervals of length t
a
,
and the constant can change sign from one t
a
interval to the next. The value of X(t) over one "t
a
-
interval" is independent of its value over any other "t
a
-interval". Figure 7-1 depicts a typical
sample function of the random binary waveform. Figure 7-2 is a timing diagram that illustrates
the "t
a
intervals". The algorithm used to generate the process is not changing with time. As a
result, it is possible to argue that the process is stationary. Also, since +A and A are equally
likely values for X at any time t, it is obvious that X(t) has zero mean.
t
1
t
A
-A
Time Line
0 t
0
t
0
+2t
a
t
0
+3t
a
t
0
+4t
a
t
0
+5t
a
t
0
-t
a
t
0
+t
a
X(t
1
)=X
1
X(t
1
+t)=X
2
t
1
+t
t
X(t)
Fig. 7- 1: Sample function of a simple binary random process.
To determine the autocorrelation function R
X
(t), we must consider two basic cases.
1) Case t > t
a
. Then, the times t
1
and t
1
+ t cannot be in the same "t
a
interval". Hence, X(t
1
)
and X(t
1
+ t) must be independent so that
R E X t X t E X t E X t t
a
( ) [ ( ) ( )] [ ( )] [ ( )] , t t t t - - >
1 1 1 1
0 . (7-12)
2) Case t < t
a
. To calculate R(t) for this case, we must first determine an expression for the
probability P[t
1
and t
1
+ t in the same t
a
interval]. We do this in two parts: the first part is i) 0
< t < t
a
, and the second part is ii) -t
a
< t s 0.
i) 0 < t < t
a
. Times t
1
and t
1
+t may, or may not, be in the same "t
a
-interval". However, we write
1 1 a 0 1 1 0 a
1 a 0 1
1 1 a
a
a
a
a
[t and t in same "t interval"] [t t t <t t ]
[t t t t ]
1
[t (t t )]
t
t
0 < < t
t
,
- t s s - t -
- t - < s
- - t -
-t
t
P P
P
(7-13)
ii) -t
a
< t s 0. Times t
1
and t
1
+t may, or may not, be in the same "t
a
-interval". However, we
write
0 t
0
t
0
+2t
a
t
0
+3t
a
t
0
+4t
a
t
0
+5t t
0
-t
a
t
0
+t
a
" t
a
Intervals "
Fig. 7-2: Time line illustrating the independent "t
a
intervals".
1 1 a 0 1 1 0 a
1 a 0 1
1 1 a
a
a
a
a
[t and t in same "t interval"] [t t t <t t ]
[t t t t ]
1
[t (t t )]
t
t
-t < 0
t
,
- t s - t s -
- < s - t
- t - -
-t
t s
P P
P
(7-14)
Combine (7-13) and (7-14), we can write
a
1 1 a a
a
t
[t and t in same "t interval"] < t
t
,
t -
- t t P . (7-15)
Now, the product X(t
1
)X(t
1
+t) takes on only two values, plus or minus A
2
. If t
1
and t
1
+ t are in
the same "t
a
-interval" then X(t
1
)X(t
1
+t) = A
2
. If t
1
and t
1
+ t are in different "t
a
-intervals" then
X(t
1
) and X(t
1
+t) are independent, and X(t
1
)X(t
1
+t) = A
2
equally likely. For t < t
a
we can
write
[ ]
[ ]
1 1
2
1 1 a
2 2
1 1 a 1 1
2 2
1 1 a 1 1
R( ) E X(t )X(t )
A t and t in same "t interval"
A {t and t in different "t intervals"}, X(t )X(t ) A
A {t and t in different "t intervals"}, X(t )X(t ) A
t - t
- t

- - t - t

- - t - t -

P
P
P
. (7-16)
However, the last two terms on the right-hand side of (7-16) cancel out (read again the two
sentences after (7-15)). Hence, we can write
[ ]
2
1 1 a
R( ) A t and t in same "t interval" t - t P . (7-17)
Finally, substitute (7-15) into (7-17) to obtain
2
a
1 1
a
t
a
A , < t
t
a
R( ) E[X(t )X(t )]
0, > t
- t

t

t - t

. (7-18)
Equation (7-18) provides a formula for R(t) for the random binary signal described by Figure 7-
1. A plot of this formula for R(t) is given by Figure 7-3.
Poisson Random Points Review
The topic of random Poisson points is discussed in Chapters 1, 2 and Appendix 9B. Let
n(t
1
, t
2
) denote the number of Poisson points in the time interval (t
1
, t
2
). Then, these points are
distributed in a Poisson manner with
k
1 2
( )
[n(t , t ) k] e
k!
-/t
/t
P , (7-19)
where t = t
1
- t
2
, and / > 0 is a known parameter. That is, n(t
1
,t
2
) is Poisson distributed with
parameter /t. Note that n(t
1
, t
2
) is an integer valued random variable with
t
a
-t
a
A
2
R(t)
t
Fig. 7-3: Autocorrelation of Random Binary waveform.
E[n(t
1
, t
2
)] = /t
1
- t
2
VAR[n(t
1
, t
2
)] = /t
1
- t
2
(7-2O)
E[n
2
(t
1
, t
2
)] = VAR[n(t
1
, t
2
)] + (E[n(t
1
, t
2
)])
2
= /t
1
- t
2
- /
2
t
1
- t
2
2
.
Note that E[n(t
1
,t
2
)] and VAR[n(t
1
,t
2
)] are the same, an unusual result for random quantities. If
(t
1
, t
2
) and (t
3
, t
4
) are non-overlapping, then the random variables n(t
1
, t
2
) and n(t
3
, t
4
) are
independent. Finally, constant / is the average point density. That is, / represents the average
number of points in a unit length interval.
Poisson Random Process
Define the Poisson random process
X(t) = 0, t = 0
= n(0,t), t 0 >
. (7-21)
A typical sample function is illustrated by Figure 7-4.
Mean of Poisson Process
For any fixed t 0, X(t) is a Poisson random variable with parameter /t. Hence,
E[X(t)] = /t, t 0. (7-22)
The time varying nature of the mean implies that process X(t) is nonstationary.

Location of a Poison Point
X(t)

1
2
3
4
time
Fig. 7-4: Typical sample function of Poisson random process.
Autocorrelation of Poisson Process
The autocorrelation is defined as R(t
1
, t
2
) = E[X(t
1
)X(t
2
)] for t
1
0 and t
2
0. First, note
that
2 2
R(t, t) t t , t 0 / - / > , (7-23)
a result obtained from the known 2
nd
moment of a Poisson random variable. Next, we show that
2
1 2 2 1 2 2 1
2
1 1 2 1 2
R(t , t ) t t t for 0 < t t
t t t for 0 < t t
/ - / <
/ - / <
. (7-24)
Proof: case 0 < t
1
< t
2
We consider the case 0 < t
1
< t
2
. The random variables X(t
1
) and {X(t
2
) - X(t
1
)} are
independent since they are for non-overlapping time intervals. Also, X(t
1
) has mean /t
1
, and
{X(t
2
) - X(t
1
)} has mean /(t
2
- t
1
). As a result,
1 2 1 1 2 1 1 2 1
E[X(t ){X(t ) - X(t )}] =E[X(t )]E[X(t ) - X(t )]= t (t t ) / / - . (7-25)
Use this result to obtain
1 2 1 2 1 1 2 1
2 2 2
1 1 2 1 1 1 1 2 1
2
1 1 2 1 2
R(t ,t )=E[X(t )X(t )] =E[X(t ){X(t )+X(t ) - X(t )}]
=E[X (t )]+E[X(t ){X(t ) - X(t )}]= t t t (t t )
= t t t for 0 < t t
/ - / - / / -
/ - / <
. (7-26)
Case 0 < t
2
< t
1
is similar to the case shown above. Hence, for the Poisson process, the
autocorrelation function is
2
1 2 2 1 2 2 1
2
1 1 2 1 2
R(t , t ) t t t for 0 < t t
t t t for 0 < t t
/ - / <
/ - / <
. (7-27)
Semi-Random Telegraph Signal
The Semi-Random Telegraph Signal is defined as
X(0) 1
X(t) 1 if number of Poisson Points in (0,t) is
= -1 if number of Poisson Points in (0,t) is
even
odd
(7-28)
for - < t < . Figure 7-5 depicts a typical sample function of this process. In what follows, we
find the mean and autocorrelation of the semi-random telegraph signal.
First, note that
2 4
2 4
t t
[X(t) 1] [even number of pts in (0, t)]
= [0 pts in (0,t)] + [2 pts in (0,t)] + [4 pts in (0,t)] +
t t
e 1 + e cosh( t ) .
2! 4!
-/ -/

/ /

- - /

P P
P P P "
"
(7-29)
Note that (7-29) is valid for t < 0 since it uses t. In a similar manner, we can write

Location of a Poisson Point
X(t)
Fig. 7-5: A typical sample function of the semi-random telegraph signal.
3 5
3 5
t t
[X(t) 1] [odd number of pts in (0, t)]
= [1 pts in (0,t)] + [3 pts in (0,t)] + [5 pts in (0,t)] +
t t
e t + e sinh( t ).
3! 5!
-/ -/
-

/ /

/ - - /

P P
P P P "
"
(7-30)
As a result of (7-29) and (7-30), the mean is
( )
t
2 t
E[X(t)] 1 [X(t) 1] 1 [X(t) 1]
e cosh( t ) sinh( t )
e .
-/
- /
- - - -
/ - /
P P
(7-31)
The constraint X(0) = 1 causes a nonzero mean that dies out with time. Note that X(t) is not
WSS since its mean is time varying.
Now, find the autocorrelation R(t
1
, t
2
). First, suppose that t
1
- t
2
= t > 0, and - < t
2
< .
If there is an even number of points in (t
2
, t
1
), then X(t
1
) and X(t
2
) have the same sign and
1 2 1 2 2
2 2
[X(t ) 1, X(t ) 1] [ X(t ) 1 X(t ) 1] [ X(t ) 1]
{exp[ ]cosh( )}{exp[ t ]cosh( t )}

-/t /t -/ /
P P P
(7-32)
1 2 1 2 2
2 2
[X(t ) 1, X(t ) 1] [ X(t ) 1 X(t ) 1] [ X(t ) 1]
{exp[ ]cosh( )}{exp[ t ]sinh( t )}
- - - - -
-/t /t -/ /
P P P
(7-33)
for t
1
- t
2
= t > 0, and - < t
2
< . If there are an odd number of points in (t
2
, t
1
), then X(t
1
) and
X(t
2
) have different signs, and we have
1 2 1 2 2
2 2
[X(t ) 1, X(t ) 1] [ X(t ) 1 X(t ) 1] [ X(t ) 1]
{exp[ ]sinh( )}{exp[ t ]sinh( t )}
- - -
-/t /t -/ /
P P P
(7-34)
1 2 1 2 2
2 2
[X(t ) 1, X(t ) 1] [ X(t ) 1 X(t ) 1] [ X(t ) 1]
{exp[ ]sinh( )}{exp[ t ]cosh( t )}
- -
-/t /t -/ /
P P P
(7-35)
for t
1
- t
2
= t > 0, and - < t
2
< . The product X(t
1
)X(t
2
) is +1 with probability given by the
sum of (7-32) and (7-33); it is -1 with probability given by the sum of (7-34) and (7-35). Hence,
its expected value can be expressed as
2
2
1 2 1 2
t
2 2
t
2 2
R(t , t ) E[X(t )X(t )]
e cosh( ) e {cosh( t ) sinh( t )}
e sinh( ) e {cosh( t ) sinh( t )} .
-/
-/t
-/
-/t

/t / - /

- /t / - /

(7-36)
Using standard identities, this result can be simplified to produce
[ ]
2
2 2
1 2 1 2
t
2 2
t t
2
1 2
R(t , t ) E[X(t )X(t )]
e cosh( ) sinh( ) e cosh( t ) sinh( t )
e e e e
e , for t t 0.
-/
-/t
-/ /
-/t -/t
- /t
/t - /t / - /

t - >
(7-37)
Due to symmetry (the autocorrelation function must be even), we can conclude that
2
1 2 1 2
R(t , t ) R( ) e , = t t
t - /
t t - , (7-38)
is the autocorrelation function of the semi-random telegraph signal, a result illustrated by Fig. 7-
6. Again, note that the semi-random telegraph signal is not WSS since it has a time-varying
mean.
Random Telegraph Signal
Let X(t) denote the semi-random telegraph signal discussed above. Consider the process
Y(t) = oX(t), where o is a random variable that is independent of X(t) for all time. Furthermore,
assume that o takes on only two values: o = +1 and o = -1 equally likely. Then the mean of Y is
E[Y] = E[oX] = E[o]E[X] = 0 for all time. Also, R
Y
(t) = E[o
2
]R
X
(t) = R
X
(t), a result depicted
by Figure 7-6. Y is called the Random Telegraph Signal since it is entirely random for all time
t. Note that the Random Telegraph Signal is WSS.
t
R(t) = exp(-2/t)
Fig. 7-6: Autocorrelation function for both the semi-random and random telegraph signals.
Autocorrelation of Wiener Process
Consider the Wiener process that was introduced in Chapter 6. If we assume that X(0) =
0 (in many textbooks, this is part of the definition of a Wiener process), then the autocorrelation
of the Wiener process is R
X
(t
1
, t
2
) = 2D{min(t
1
,t
2
)}. To see this, first recall that a Wiener process
has independent increments. That is, if (t
1
, t
2
) and (t
3
, t
4
) are non-overlapping intervals (i.e., 0 s
t
1
< t
2
s t
3
< t
4
), then increment X(t
2
) - X(t
1
) is statistically independent of increment X(t
4
) - X(t
3
).
Now, consider the case t
1
> t
2
0 and write
R t t E X t X t E X t X t X t X t
E X t X t X t E X t X t
D t
x
( , ) ( ) ( ) { ( ) ( ) ( )} ( )
{ ( ) ( )} ( ) ( )} ( )
.
1 2 1 2 1 2 2 2
1 2 2 2 2
2
0 2
- -
- -
-
(7-39)
By symmetry, we can conclude that
R
X
(t
1
, t
2
) = 2D{min(t
1
,t
2
)}, t
1
& t
2
0, (7-40)
for the Wiener process X(t), t 0, with X(0) = 0.
Correlation Time
Let X(t) be a zero mean (i.e., E[X(t)] = 0) W.S.S. random process. The correlation time
of X(t) is defined as
t t t
x
x
x
R
R d =

z
1
0
0
( )
( ) . (7-41)
Intuitively, time t
x
gives some measure of the time interval over which significant correlation
exists between two samples of process X(t).
For example, consider the random telegraph signal described above. For this process the
correlation time is
t t
/
/t
x
e d =
-
z
1
1
1
2
2
0
. (7-42)
In Chapter 8, we will relate correlation time to the spectral bandwidth (to be defined in Chapter
8) of a W.S.S. process.
Crosscorrelation Functions
Let X(t) and Y(t) denote real-valued random processes. The crosscorrelation of X and Y
is defined as
R t t E X t Y t xyf x y t t dx dy
xY
( , ) [ ( ) ( )] ( , ; , )
1 2 1 2 1 2

z z
- -
. (7-43)
The crosscovariance function is defined as
C t t E X t t Y t t R t t t t
XY X Y XY X Y
( , ) [{ ( ) ( )}{ ( ) ( )}] ( , ) ( ) ( )
1 2 1 1 2 2 1 2 1 2
- - - q q q q (7-44)
Let X(t) and Y(t) be WSS random processes. Then X(t) and Y(t) are said to be jointly
stationary in the wide sense if R
XY
(t
1
,t
2
) = R
XY
(t), t = t
1
- t
2
. For jointly stationary in the wide
sense processes the crosscorrelation is
R E X t Y t xy f x y dxdy
XY XY
( ) [ ( ) ( )] ( , ; ) t t t -

z z
- -
(7-45)
Warning: Some authors define R
XY
(t) = E[X(t)Y(t+t)]; in the literature, there is controversy in
the definition of R
XY
over which function is shifted. For R
XY
, the order of the subscript is
significant! In general, R
XY
R
YX
.
For jointly stationary random processes X and Y, we show some elementary properties of
the cross correlation function.
1. R
YX
(t) = R
XY
(-t). To see this, note that
R E Y t X t E Y t X t E Y t X t R
YX XY
( ) [ ( ) ( )] [ ( ) ( )] [ ( ) ( )] ( ) . t t t t t t t - - - - - - (7-46)
2. R
YX
(t) does not necessarily have its maximum at t = 0; the maximum can occur anywhere.
However, we can say that
2 0 0 R R R
XY X Y
( ) ( ) ( ) t s - . (7-47)
To see this, note that
E X t Y t E X t E X t Y t E Y t
R R R
X XY Y
[{ ( ) ( )} ] [ ( )] [ ( ) ( )] [ ( )]
( ) ( ) ( )
- - -
-
t t t
t
2 2 2
0
0 2 0 0
2 +
. (7-48)
Hence, we have 2 0 0 R R R
XY X Y
( ) ( ) ( ) t s - as claimed.
Linear, Time-Invariant Systems: Expected Value of the Output
Consider a linear time invariant system with impulse response h(t). Given input X(t), the
output Y(t) can be computed as
Y t L t ( ) =

z
X( ) X( )h(t - ) d
-
t t t (7-49)
The notation L[ - ] denotes a linear operator (the convolution operator in this case). As given by
(7-49), output Y(t) depends only on input X(t), initial conditions play no role here (assume that
all initial conditions are zero).
Convolution and expectation are integral operators. In applications that employ these
operations, it is assumed that we can interchange the order of convolution and expectation.
Hence, we can write
[ ]
-
-
x
-
E[Y(t)] E L[X(t)] E X( )h(t - ) d
E[X( )]h(t - ) d
( )h(t - ) d .

= t t t

t t t
q t t t
(7-50)
More generally, in applications, it is assumed that we can interchange the operations of
expectation and integration so that
[ ]
1 n 1 n
1 n 1 n
1 n 1 n 1 n 1 n
E f (t , , t )dt dt E f (t , , t ) dt dt

o o o o

" " " " " " , (7-51)
for example (f is a random function involving variables t
1
, , t
n
).
As a special case, assume that input X(t) is wide-sense stationary with mean q
x
.
Equation (7-50) leads to
Y x x x
- -
E[Y(t)] h(t - ) d h( ) d H(0)

q q t t q t t q

, (7-52)
where H(0) is the DC response (i.e., DC gain) of the system.
Linear, Time-Invariant Systems: Input/Output Cross Correlation
Let R
X
(t
1
,t
2
) denote the autocorrelation of input random process X(t). We desire to find
R
XY
(t
1
,t
2
) = E[x(t
1
)y(t
2
)], the crosscorrelation between input X(t) and output Y(t) of a linear,
time-invariant system.
Theorem 7-1
The cross correlation between input X(t) and output Y(t) can be calculated as (both X and
Y are assumed to be real-valued)
[ ] [ ]
XY X X 1 2 1 2 2 1 2 1 2
-
R (t , t ) E X(t )Y(t ) L R (t , t ) R (t , t ) h( ) d
- o o o
(7-53)
Notation: L
2
[] means operate on the t
2
variable (the second variable) and treat t
1
(the first
variable) as a fixed parameter.
Proof: In (7-53), the convolution involves folding and shifting the t
2
slot so we write
Y t X t h d X t Y t X t X t h d ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
2 2 1 2 1 2
- -
z z
o o o o o o
- -
, (7-54)
a result that can be used to derive
[ ]
XY
X
X
1 2 1 2 1 2
-
1 2
-
2 1 2
R (t , t ) E[X(t )Y(t )] E[X(t )X(t )] h( ) d
R (t , t ) h( ) d
L R (t , t ) .
- o o o
- o o o
(7-55)
Special Case: X is WSS
Suppose that input process X(t) is WSS. Let t = t
1
- t
2
and write (7-55) as
R R h d R h d
R h
XY X X
X
( ) ( ) ( ) ( ) ( )
( ) ( )
t t o o o t o o o
t t
- - -
- -
z z
- -
(7-56)
Note that X(t) and Y(t) are jointly wide sense stationary.
Theorem 7-2
The autocorrelation R
Y
(t) can be obtained from the crosscorrelation R
XY
(t) by the formula
R t t L R t t R t t h d
Y XY XY
( , ) ( , ) ( , ) ( )
1 2 1 1 2 1 2
-
z
o o o
-
(7-57)
Notation: L
1
[] means operate on the t
1
variable (the first variable) and treat t
2
(the second
variable) as a fixed parameter.
Proof: In (7-57), the convolution involves folding and shifting the t
1
slot so we write
1 1 1 2 2 1
- -
Y(t ) X(t )h( ) d Y(t )Y(t ) Y(t )X(t )h( ) d

- o o o - o o o

(7-58)
Now, take the expected value of (7-58) to obtain
R t t E Y t Y t
E Y t X t h d R t t h d
L R t t
Y
XY
XY
( , ) [ ( ) ( )]
[ ( ) ( )] ( ) ( , ) ( )
( , ) ,
1 2 1 2
2 1 1 2
1 1 2
- -
z z
o o o o o o
- -
(7-59)
a result that completes the proof of our theorem.
The last two theorems can be combined into a single formula for finding R
Y
(t
1
,t
2
).
Consider the formula
Y XY
X
1 2 1 2
-
1 2
- -
R (t , t ) R (t , t ) h( ) d
R (t , t ) h( ) d h( ) d .

-o o o

-o - o o

(7-60)
This result leads to
R t t R t t h h d d
Y X
( , ) ( , ) ( ) ( )
1 2 1 2
- -
z z
o o o
- -
, (7-61)
an important "double convolution" formula for R
Y
in terms of R
X
.
Special Case: X(t) is W.S.S.
Suppose that input process X(t) is WSS. Let t = t
1
- t
2
and write (7-61) as
R t t R t t h h d d
Y X
( , ) ( [ ]) ( ) ( )
1 2 1 2
- - -
z z
o o o
- -
(7-62)
Define t = t
1
- t
2
; in (7-62) change the variables of integration to o and = o - and obtain
Y X
X
- -
- -
R ( ) R ( ) h( )h( ) d d
R ( ) h( )h( [ ]) d d

t t - o o- o

t - o - -o o

. (7-63)
This last formula can be expressed in a more convenient form. First, define
+( ) ( ) ( [ ]) ( ) ( ) t o t o o t t = - - - -
z
-
h h d h h . (7-64)
Then, (7-63) can be expressed as
h(t)-h(-t)
R
Y
(t) = [h(t)-h(-t)]-R
X
(t) R
X
(t)
Fig. 7-7: Output autocorrelation in terms of input autocorrelation.
R R d R
Y X X
( ) ( ) ( ) ( ) ( ) t t t t - -
z
+ +
-
, (7-65)
a result that is illustrated by Figure 7-7. Equation (7-65) is a convenient formula for computing
R
Y
(t) when input X(t) is W.S.S. Note from (7-65) that WSS input X(t) produce WSS output
Y(t). A similar statement can be made for stationary in the strict sense: strict-sense stationary
input X(t) produces strict-sense stationary output Y(t).
Example 7-3
A zero-mean, stationary process X(t) with autocorrelation R
x
(t) = qo(t) (white noise) is
applied to a linear system with impulse response h(t) = e
-ct
U(t), c > 0. Find R
xy
(t) and R
y
(t) for
this system. Since X is WSS, we know that
XY X
c
c
R ( ) E[X(t )Y(t)] R ( ) h( ) ( ) e U( )
e U( ).
t
t
t - t t - -t o t - -t
-t
q
q
(7-66)
Note that X and Y are jointly WSS. That R
XY
(t) = 0 for t > 0 should be intuitive since X(t) is a
white noise process. Now, the autocorrelation of Y can be computed as
Y X X
XY
c c
c
R ( ) E[Y(t )Y(t)] R ( ) [h( ) h( )] [R ( ) h( )] h( )
R ( ) h( ) { e U( )} {e U( )}
e , < < .
2c
t - t
- t
t - t t - t - -t t - -t - t
t - t -t - t
- t
q
q
(7-67)
Note that output Y(t) is not white noise; samples of Y(t) are correlated with each other.
Basically, system h(t) filtered white-noise input X(t) to produce an output Y(t) that is correlated.
As we will see in Chapter 8, input X(t) is modeled as having an infinite bandwidth; the system
bandlimited its input to form an output Y(t) that has a finite bandwidth.
Example 7-4 (from Papoulis, 3rd Ed., pp. 311-312)
A zero-mean, stationary process X(t) with autocorrelation R
x
(t) = qo(t) (white noise) is
applied at t = 0 to a linear system with impulse response h(t) = e
- ct
U(t), c > 0. See Figure 7-8.
Assume that the system is at rest initially (the initial conditions are zero so that Y(t) = 0, t < 0).
Find R
xy
and R
y
for this system.
In a subtle way, this problem differs from the previous example. Since the input is
applied at t = 0, the system sees a non-stationary input. Hence, we must analyze the general,
nonstationary case. As shown below, processes X(t) and Y(t) are not jointly wide sense
stationary, and Y(t) is not wide sense stationary. Also, it should be obvious that R
XY
(t
1
,t
2
) =
E[X(t
1
)Y(t
2
)] = 0 for t
1
< 0 or t
2
< 0 (note how this differs from Example 7-3).
R
xy
(t
1
,t
2
) equals the response of the system to R
x
(t
1
-t
2
) = qo(t
1
-t
2
), when t
1
is held fixed
and t
2
is the independent variable (the input o function occurs at t
2
= t
1
). For t
1
> 0 and t
2
> 0, we
can write
[ ]
X X xy 1 2 2 1 2 1 2
-
1 2 1 2
-
2 1 2 1 1 2
1 2
R (t , t ) L R (t , t ) R (t , t ) h( ) d
(t t ) h( ) d h( [t t ])
exp[ c(t t )] U(t t ), t 0, t 0,
0, t 0 or t 0,
- o o o
- - o o o - -
- - - > >
< <
q q
= q
o
(7-68)
a result that is illustrated by Figure 7-9. For t
2
< t
1
, output Y(t
2
) is uncorrelated with input X(t
1
),
as expected (this should be intuitive). Also, for (t
2
- t
1
) > 5/c we can assume that X(t
1
) and Y(t
2
)
h(t) = e
-ct
U(t)
close at
t = 0
X(t) Y(t)
Figure 7-8: System with random input applied at t = 0.
are uncorrelated. Finally, note that X(t) and Y(t) are not jointly wide sense stationary since R
XY
depends on absolute t
1
and t
2
(and not only the difference t = t
1
- t
2
).
Now, find the autocorrelation of the output; there are two cases. The first case is t
2
> t
1
>
0 for which we can write
Y
XY
1
1 2 1
2
1 2 1 2
1 2
-
t
c(t [t ]) c 1
2 1
0
2ct c(t t )
2 1
R (t , t ) E[Y(t )Y(t )]
R (t , t ) h( ) d
e e U(t [t ])d
(1 e )e , t > t > 0,
2c
- - -o - o
- - -
- o o o
- - o o
-
q
q
(7-69)
Note that the requirements 1) h(o) = 0, o < 0, and 2) t
1
- o > 0 were used to write (7-69). The
second case is t
1
> t
2
> 0 for which we can write
t
2
- Axis
R
xy
(t
1
,t
2
)
0 t
1
q
Figure 7-9: Plot of (7-68); the crosscorrelation
between input X and output Y.
Y XY
1
1
1
1
1 2
2 1 2
2
2
1 2 1 2 1 2
-
t
c(t [t ]) c
2 1
0
t
c(t [t ]) c
t t
2ct c(t t )
1 2
R (t , t ) E[Y(t )Y(t )] R (t , t ) h( ) d
e e U(t [t ])d
e e d
(1 e )e , t > t > 0.
2c
q
q
q
- - -o - o
- - -o - o
-
- - -
-o o o
- -o o
o
-
(7-70)
Note that output Y(t) is not stationary. The reason for this is simple (and intuitive). Input X(t) is
applied at t = 0, and the system is at rest before this time (Y(t) = 0, t < 0). For a few time
constants, this fact is remembered by the system (the system has memory). For t
1
and t
2
larger than 5 time constants (t
1
, t
2
> 5/(2c)), steady state can be assumed, and the output
autocorrelation can be approximated as
R t t
c
e
y
c t t
( , )
1 2
2
2 1
=
- -
q
. (7-71)
Output y(t) is approximately stationary for t
1
, t
2
> 5/(2c
1
).
Example 7-5: Let X(t) be a real-valued, WSS process with autocorrelation R(t). For any fixed
T > 0, define the random variable
T
T
T
S X(t)dt
-
=

. (7-72)
Express the second moment E[S
T
2
] as a single integral involving R(t). First, note that
T T T T
T
T T T T
2
1 1 2 2 1 2 1 2
S X(t )dt X(t )dt X(t )X(t )dt dt
- - - -
=

, (7-73)
a result that leads to
[ ]
T T T T
T
T T T T
2
1 2 1 2 1 2 1 2
E S E X(t )X(t ) dt dt R(t t ) dt dt
- - - -

-

. (7-74)
The integrand in (7-74) depends only on one quantity, namely the difference t
1
t
2
. Therefore,
Equation (7-74) should be expressible in terms of a single integral in the variable t = t
1
t
2
. To
see this, use t = t
1
t
2
and t = t
1
+ t
2
and map the (t
1
, t
2
) plane to the (t, t) plane (this
relationship has an inverse), as shown by Fig. 7-10. As discussed in Appendix 4A, the integral
(7-74) can be expressed as
2
T T
1 2
1 2 1 2
T T
(t , t )
R(t t ) dt dt R( ) d d
( , )
- -
o
- t t t
o t t

R
, (7-75)
where
2
R is the rotated square region in the (t, t) plane shown on the right-hand side of Fig.
7-10. For use in (7-75), the Jacobian of the transformation is
Fig. 7-10: Geometry used to change variables from (t
1
, t
2
) to (t, t) in the double
integral that appears in Example 7-5.
(t
1
, t
2
) plane
t
1
-axis
t
2
-axis
-T T
-T
T
t-axis
t-axis
-2T 2T
-2T
2T
(t, t) plane
(t, 2T+t)
(t, -{2T+t)
(t, 2T-t)
(t, -{2T-t)
t t
1 2
t t
1 2
t -
t -
t ( )
1
t ( )
2
t - t
t - t
1 2

(t , t )
( , )

o

o t t
-
(7-76)
In the (t,t)-plane, as t goes from 2T to 0, the quantity t traverses from 2T- t to 2T+ t, as can
be seen from examination of Fig. 7-10. Also, as t goes from 0 to 2T, the quantity t traverses
from 2T+ t to 2T- t. Hence, we have
T
T T 0 2T 2T 2T
2
1 2 1 2
T T 2T (2T ) 0 (2T )
0 2T
2T 0
2T
2T
E S R(t t ) dt dt R( ) d d R( ) d d
(2T )R( ) d (2T )R( ) d
(2T )R( ) d
-t -t
- - - - -t - -t
-
-

- t t t - t t t

- t t t - - t t t
- t t t

(7-77)
a single integral that can be evaluate given autocorrelation function R(t).
Chapter 8 - Power Density Spectrum
Let X(t) be a WSS random process. X(t) has an average power, given in watts, of
E[X(t)
2
], a constant.
This total average power is distributed over some range of frequencies. This distribution
over frequency is described by S
X
(), the power density spectrum. S
X
() is non-negative
(S
X
() 0) and, for real-valued X(t), even (S
X
() = S
X
(-)). Furthermore, the area under S
X
is
proportional to the average power in X(t); that is

Average Power in X(t d ) =

z
1
2
S
x
( )
-
. (8-1)

Finally, note that S
X
() has units of watts/Hz.
Let X(t) be a WSS random processes. We seek to define the power density spectrum of
X(t). First, note that

F X t ( ) =

z
X(t)e
-j t
-

dt (8-2)

does not exits, in general. Random process X(t) is not absolutely integrable, and F[X(t)] does
not converge for most practical applications. Hence, we cannot use F[X(t)]
2
as our definition
of power spectrum (however,
[ ]
X(t) F exists as a generalized random function that can be used
to develop a theory of power spectral densities).
We seek an alternate route to the power spectrum. Let T > 0 denote the length of a time
interval, and define the truncated process

X t X t t T
t T
T
( ) ( ),
,
=
= > 0
. (8-3)

Truncated process X
T

T
X (t) X(t)rect(t / 2T) = , (8-4)

where rect(t/2T) is the 2T-long window depicted by Figure 8-1.
Signal X
T
is absolutely integrable, that is, X
T
-
( ) t dt
z
< . Hence, for finite T, the
Fourier transform

F t dt
X
j t
T
( ) ( )e

=

z
X
T
-
(8-5)

exists. For every value of , F
X
T
() is a random variable. Now, Parseval's theorem states that

X X
T T
- - -
( ) ( ) ( ) t dt t dt F d
T
T
X
T
z z z
= =
2
2 2 1
2
. (8-6)

Now, divide both sides of this last equation by 2T to obtain

1
2
1
4
2
2
T
t dt
T
F d
T
T
X
T
X
T
- -
( ) ( )
z z
=

. (8-7)

The left-hand-side of this is the average power in the particular sample function X
T
being
integrated (It is a random variable). Average over all such sample functions to obtain

T -T
rect(t/2T)
1
rect t T) t
t
( / 2 1
0
= <
= >
, T
, T

Figure 8-1: Window used in approximating the
power spectum of X
T
(t).
E
T
t dt E
T
F d
T
T
X
T
1
2
1
4
2
2
X
T
- -
( ) ( )
z z
L
N
M
O
Q
P
=
L
N
M
O
Q
P
, (8-8)

which leads to

1
2
1
4
2
2
T
E t dt
T
E F d
T
T
X
T
[ ( ) ] [ ( ) ] X
T
- -
z z
=

. (8-9)

As T , the left-hand-side of (8-9) is the formula for the average power of X(t). Hence, we
can write

T X
T
X
T
T 2
2
-T -
T T
2
-
T
1 1
Avg Pwr = E[ X (t) ] dt E[ F ( ) ]d
limit limit
2T 4 T
E[ F ( ) ]
1
= d limit
2 2T
. (8-10)

The quantity

S
x
( )
[ ( ) ]

=
L
N
M
M
O
Q
P
P T
E F
T
X
T
limit
2
2
(8-11)

is the power density spectrum of WSS process X(t). Power density spectrum S
X
() is a real-
valued, nonnegative function. If X(t) is real-valued the power spectrum is an even function of
. It has units of watts/Hz, and it tells where in the frequency range the power lies. The
quantity

1
2 1
2
S
x
( ) d
z

is the power in the frequency band (
1
,
2
). Finally, to obtain the power spectrum of
deterministic signals, Equation (8-11), without the expectation (remove the E operator), can be
applied.
Example (8-1): Consider the deterministic signal X(t) = Aexp[j
0
t]. This signal is not real-
valued so we should not automatically expect an even power spectrum. Apply window
rect(t/2T) to x(t) and obtain

X (t) = Aexp[j t]rect
T
t
2T
0
(8-12)

The Fourier transform of X
T
is given by

F A j t rect t T ATSa
X
T
( ) [ exp( ) ( / )] [( )T] = = F
0 0
2 2 , (8-13)

where Sa(x) {sin(x)}/ x = . Hence, for large T we have (note that nothing is random so no
expectation is required here!)

X
T
2
2
2
0
F ( )
T Sa [( )T]
( ) 2 A
2T
=

x
S , (8-14)

a result depicted by Figure 8-2. The area under this graph is independent of T since

2
-
T Sa ( T)
d 1
(8-15)

independent of T. For (8-14), on either side of
0
, the width between the first zero crossings
(where all of the area is concentrated as T approaches infinity) is on the order of 2/T. The
height is on the order of 2A
2
T. As a result, (8-14) approaches a delta function and
X
T
2
2
2 2 0
0
T
T
F ( )
T Sa [( )T]
limit ( ) 2 A 2 A ( ),
limit
2T

= = =
x
S (8-16)

a result depicted by Figure 8-3. If
0
= 0, then X(t) = A, a constant DC signal. For this DC-
signal case, the power spectrum is S
x
() = 2A
2
(), as expected.
Rational Power Density Spectrums
In many applications S
X
() takes the form

2m 2m 2 2
2m 2 2
2n 2n 2 2
2n 2 2
0
0
a +a a
( ) , m n
b +b b
+ + +
= <
+ + +
x
"
"
S , (8-17)

2A
2
0

S
X
() =2A
2
(
0
)

Figure 8-3: Power spectrum of X(t) for Example 8-1.
Width 2/T
0
2A
2
T
S
x
()

Figure 8-2: Approximation to the power spectrum of X(t).
a rational function of . In (8-17), the coefficients a
0
, a
2
, , a
2m-2
, b
0
, b
2
, , b
2n-2
are real-
valued. Also, only even powers of appear in the numerator and denominator since S
X
() is an
even function of . Also, since

Avg Pwr in X =
1
2
S
x
-
z
< ( )d , (8-18)

we must have m < n (the degree of the numerator must be at least two less then the degree of the
denominator).
Rational spectrums are continuous in nature. They contain no delta function(s), an
observation that implies that X has no DC or sinusoidal component(s). However, in many
applications, one encounters processes that have a DC component as well as an AC component
with a rational spectrum. These processes can be modeled as

X(t) = A + X
AC
(t), (8-19)

where A is a DC constant, and zero-mean X
AC
has a rational spectrum, denoted here as S
AC
().
We compute the power spectrum of X(t) of the form (8-19). First, window X to obtain

T AC
X (t) rect(t / 2T) X (t)rect(t / 2T) A = + (8-20)

so that

[ ]
[ ]
[ ]
T
AC
1 2
1
2
X F ( ) F ( )
F ( ) rect(t / 2T)
F ( ) X (t)rect(t / 2T) .
A
= +

F
F
F
. (8-21)

Note that F
1
() is a deterministic function of , and F
2
() is a random function of . A simple
expansion yields

[ ]
T
2
2 2 2
1 2 1 1 2 2
X F F F 2Re F F F

= + = + +

F . (8-22)

Use the fact that F
1
is deterministic, and take the ensemble average of (8-22) to obtain

[ ]
T
2
2 2
1 1 2 2
E X F 2Re F E F E F [ ]

= + +

F . (8-23)

However, note that

[ ] [ ]
[ ]
AC AC
AC
j t
2
-
j t
-
E F E X (t)rect(t / 2T) E X (t)rect(t / 2T)e dt
E X (t) rect(t / 2T)e dt
0

= =

=
=
F
(8-24)

since E[X
AC
] = 0. Because of (8-24), the middle term on the right-hand side of (8-23) is zero,
and we have

[ ]
T
2
2
2
2
1
E X
E F
F
2T 2T 2T

= +
F
. (8-25)

Finally, as T , we have

[ ]
T
AC
2
2
T
E X
limit
( ) 2 ( ) ( )
2T
A

= = +
x
F
S S . (8-26)
That is, the power spectrum of X is the power spectrum for the DC component A (see the
sentence at the end of Example 8-1) added to the rational power spectrum S
AC
() for the AC
component.
Wiener-Khinchine Theorem
Assume that X(t) is a wide-sense-stationary process with autocorrelation R
X
(). The
power spectrum S
X
() is the Fourier transform of autocorrelation R
X
(). This is the famous
Wiener-Khinchine Theorem.
Proof:
Recall that the power spectrum of real-valued (the proof can be generalized to include
complex-valued random processes) random process X(t) is

2
T
T
E [X ]
( ) limit
2T

= S
F
, (8-27)

where

T
X(t), T
t
X (t)
0, T
t
<
>
. (8-28)

Take the inverse Fourier transform of S to obtain

[ ]
1 2
1 2
2
-1 j
T
T
T T
j t j t j
1 1 2 2
T T
T
T T
j (t t )
1 2 2 1
T T
T
1 E [X ]
limit e d ( )
2
2T
1 1
limit E X(t )e dt X(t )e dt e d
2 2T
1 1
limit E[X(t )X(t )] e d dt dt
2T 2

=

=

S
F
F
(8-29)
(the fact that X is real-valued was used to obtain (8-29)). However, from Fourier transform
theory we know that

1 2
j (t t )
1 2
1
e d (t t )
2

+
= +
. (8-30)

Now, use (8-30) in (8-29), and the fact that E[X(t
1
)X(t
2
)] = R(t
2
-t
1
), to obtain

[ ]
T T
-1
2 1 1 2 2 1
T T
T
T T
1 1
T T
T T
1
limit R(t t ) (t t )dt dt ( )
2T
1 1
limit R( )dt R( ) limit dt
2T 2T
R( ).

= +

= =

=

F S
(8-31)

This is the well-known, and very useful, Wiener-Khinchine Theorem: the Fourier transform of
the autocorrelation is the power spectrum density. Symbolically, we write

x
R ( ) ( ) S . (8-32)

A second proof of the Wiener-Khinchine Theorem follows. First, note

1 2 1 2
T T T T j (t t ) 2 j t j t
T 1 1 2 2 1 2 1 2
-T -T -T -T
[X ] X(t )e dt X(t )e dt X(t )X(t ) e dt dt

= =

F (8-33)

since X is assumed to be real valued. Take the expectation of this result to obtain

1 2
T T j (t t )
2
T 1 2 1 2
-T -T
E[ [X ] ] R(t , t ) e dt dt

=

F . (8-34)

Define = t
1
- t
2
and = t
1
+ t
2
. From Example 4A-2 of Appendix 4A, we have

( )
1 2
T T 2T j (t t )
2
j
T 1 2 1 2
-T -T -2T
E[ [X ] ] R(t t ) e dt dt 2T R( )e d

= =

F , (8-35)


2
2T
T j
-2T
E[ [X ] ]
1 R( )e d
2T 2T

=

F
. (8-36)

so that

[ ]
X
2
2T
T j j
-2T -
T T
E[ [X ] ]
limit limit ( ) 1 R( )e d R( )e d
2T 2T
R( ) ,

= = =

=

S
F
F
, (8-37)

(as T approaches infinity, triangle (1 - /2T) approaches unity over all for which the integral
of R() is significant).
Example (8-2): Power spectrum of the random telegraph signal
The random telegraph signal was discussed in Chapter 7; a typical sample function is
depicted by Figure 8-4. It is defined as

Location of a Poisson Point
X(t)

Figure 8-4: A typical sample function of the Random Telegraph Signal.
X
X t
( )
( )
0 =
=
if number of Poisson Points in (0, t) is

= - if number of Poisson Points in (0, t) is
even
odd ,

where is a random variable that takes on the two values = 1 equally likely. From Chapter
7, recall that the autocorrelation of X is R
X
() = e
2
, where is the average point density
(also, in X(t), is the average number of zero crossings per unit length). By the Wiener-
Khinchine theorem, the power spectrum is

S
x
e ( )

=
L
N
M
O
Q
P
=
+
F
2
2 2
4
4
. (8-38)

Large values for "average-toggle-density" make waveform X(t) toggle faster; they also make
the bandwidth larger, as shown by (8-38).
Example (8-3): Zero-Mean, White Noise
A zero-mean, white noise process X(t) is one for which

0
N
R( ) ( )
2
= , (8-39)

where N
0
/2 is a constant. The power spectrum is N
0
/2 Watts/Hz. This implies that X possesses
an infinite amount of power, a physical absurdity. In the mathematical literature, white noise
processes are called generalized random processes (the rational being somewhat similar to that
used when delta functions are called generalized functions). Intentionally, we have not stated
how X(t) is distributed (do not assume that X(t) is Gaussian unless this is explicitly stated). In
the name assigned to X(t), the adjective white is include to draw a parallel to white light, light
containing all frequencies.
White noise X(t) exists only as a mathematical abstraction. However, it is a very useful
abstraction. For example, suppose we have a finite bandwidth system driven by a wide-band
noise process with spectrum that is flat over the system bandwidth (noise bandwidth >> system
bandwidth). Under these conditions, the analysis could be simplified by assuming that input X(t)
is white noise.
Addition of Power Spectrums for Uncorrelated Processes
Suppose WSS, zero-mean processes X(t) and Y(t) are uncorrelated so that E[X(t+)Y(t)]
= E[X(t+)]E[Y(t)] = 0 for all t and . Then, we can write

[ ]
[ ] [ ]
X Y
X Y
R ( ) E [X(t ) Y(t )][X(t) Y(t)]
E X(t )X(t) E Y(t )Y(t)
R ( ) R ( ) .
+
= + + + +
= + + +
= +
(8-40)

Now, take the Fourier transform of (8-40) to see that

x y x y
( ) ( ) ( )
+
= + S S S , (8-41)

the result that power spectrums add for uncorrelated processes. This conclusion has many
applications (see (8-26) for the case of a DC component added to a zero-mean process).
Input-Output of Power Spectrums
Let X(t) be a W.S.S process. Y(t) = L[ ] denotes a linear, time-invariant system. From
Chapter 7, recall the formula

R h h R
Y x
( ) ( ) ( ) ( ) = b g . (8-42)

Take the Fourier transform of (8-42) to obtain
S S
Y X Y
h h ( ) [ ] [ ( ) ( )] ( ) = = F F R . (8-43)

However,

F[h(t)*h(-t)] = H(j)H*(j) = H(j)
2
. (8-44)

Combine this with (8-43) to obtain

S S
Y X
H j ( ) ( ) ( ) =
2
, (8-45)

an important result for computing the output spectral density.
Example (8-4): Let X(t) be modeled as zero-mean, white Gaussian noise. We assume R
X
() =
(N
0
/2)() so that S
X
() = N
0
/2. Let X(t) be applied to the first-order RC low-pass filter shown
by Figure 8-5. Find the output power density spectrum S
Y
() and the first-order density function
of Y(t). First, from Equation (8-45), we obtain

Y
0 0
2
N N 1 1 1
( )
1 j RC 1 j RC 2 2
1 (RC )

= =

+
+
S , (8-46)

a result that is depicted by Figure 8-6. Output Y(t) has a mean of zero (why?) and a variance
N
0
/2
S
X
()
C
R
+
-
X(t) Y(t)
+
-
H j
j RC
( )
=
+
1
1

Figure 8-5: Power spectrum of white noise and a simple RC low-pass filter.
equal to the AC power. Hence, the variance of Y(t) is

[ ]
Y
2 0 0 0
2 2
N N N 1 1 1
AC Power in Y(t) d d
2 2 4 RC 4RC
1 (RC ) 1

= = = =

+ +

. (8-47)

Finally, output Y is Gaussian since linear filtering a Gaussian input produces a Gaussian output.
As a result, we can write

Y
Y
Y
2
2
1 y
f (y) exp
2
2

=

, (8-48)

where
Y
2
= N
0
/(4RC).
Note that S
Y
(), given by (8-46), is an even-symmetry, rational function of with a
denominator degree is two more than the numerator degree (which is a requirement for a finite
power output process). Generally speaking, we should expect an even-symmetry rational output
spectrum from a lumped-parameter, time-invariant system (an RLC circuit/filter, for example)
that is driven by noise that has a flat spectrum over the system bandwidth.
Example (8-5): Let X(t) be a white Gaussian noise ideal current source with a double sided
spectral density of 1 watts/Hz. Find the average power absorbed by the resistor in the circuit
depicted by Figure 8-7. The spectral density of the power absorbed by the resistor is given by
-2 -1 0 1 2
RC
N
0
/2
S
Y
()

Figure 8-6: Power Spectrum of RC low-pass filter output.
R
2
2
1
( ) H( j ) 1
1
= =
+
S , (8-49)

an even-symmetry, rational function with denominator degree two more than numerator degree.
Hence, the total power absorbed by the 1 Ohm resistor is given by

R avg
2
- -
1 1 1
P ( ) d d 1/ 2 watt
2 2
1

= = =

+

S (8-50)

Noise Equivalent Bandwidth of a Low-pass System/Filter
We seek to quantify the idea of system/filter bandwidth. Let H(j) be the transfer
function of a low-pass system/filter. Let X(t) be white noise with a power spectrum of N
0
/2
watts/Hz. The average power output is

P H j d
avg
N
=

z
1
2
0
2
2

-
( ) . (8-51)

Now, consider an ideal low-pass filter that has a gain equal to H(0) and a one-sided bandwidth of
B
N
Hz (see Figure 8-8). Apply the white noise X(t) to this ideal filter. The output power is

P H d N H B
avg
N
B
B
N
N
N
= =
z
1
2
0 0
0
2
2
0
2
2
2
-
( ) ( ) . (8-52)
1 1F
X(t)
H j
V j
X j
j
j
j
( )
( )
( )
/
/
= =

=
+
1
1
1
V(j)
+
-

Figure 8-7: RC low-pass filter and transfer function.
Again, consider H(j). The noise equivalent bandwidth of H(j) is defined to be the one-sided
bandwidth (in Hz) of an ideal filter (with gain H(0)) that passes as much power as H(j) does
when both filters are supplied with the same input white noise. Hence, equate (8-52) and (8-51)
to obtain

N H B H j d
N
N
0
2
2
2
0
1
2
0
( ) ( ) =

-
. (8-53)

This yields

B
H
H j d
N
=

z
1
4 0
2
2

( )
( )
-
(8-54)

as the noise equivalent bandwidth of filter H(j).
Example (8-6): Find the noise equivalent bandwidth of the single pole RC low-pass filter
depicted by Figure 8-9. Direct application of Formula (8-54) yields

N
2 2 2 2
- -
1 1 1 1 1
B d d Hz
4 4 RC 4RC
1 R C 1

= = =

+ +

(8-55)
2B
N
-2B
N
H(0)
-axis

Figure 8-8: Amplitude response of an ideal low-pass filter
C
R
+
-
X(t) Y(t)
+
-
H j
j RC
( )
=
+
1
1

Figure 8-9: RC low-pass filter and transfer function.
Example (8-7): Find the noise equivalent bandwidth of an n
th
-order Butterworth low-pass filter.
By definition

H j
c
n
( )
( / )

2
2
1
1
=
+
,

for positive integer n. The quantity
c
is the 3dB cut off frequency (see Figure 8-10 for
magnitude response). The noise equivalent bandwidth is

B d d
N
c
n
c
n
=
+
=
+
z z
1
4
1
1
4
1
1
2 2

( / )
- -
.

This last integral appears in most integral tables. Using the tabulated result, the noise equivalent
bandwidth B
N
is

0.0 0.5 1.0 1.5 2.0 2.5 3.0
/
c
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
M
a
g
n
i
t
u
d
e
n = 2
n = 4
n = 6
H()

Figure 8-10: Magnitude response of an n
th
-order Butterworth filter with 3db cutoff
frequency
c
. The horizontal axis is /
c
. The filter approaches an ideal low-pass filter
as order n becomes large.
B
n n
N
c
=
F
H
G
I
K
J
1
4 2
sin( / )
, n = 1, 2, 3, ", (8-56)

Hz. As n , the Butterworth filter approaches the ideal LPF. The limit of (8-56) is

n n
c
n n n
c
B
n
N

=
+
F
H
G
G
I
K
J
J
= limit limit
1
4 2
2
1
3 2
3
1
5 2
5

! !
( ) ( ) "
(8-57)

as expected.
EE603 Class Notes 02/11/10 John Stensby
Chapter 9: Commonly Used Models: Narrow-Band Gaussian Noise and Shot Noise
Narrow-band, wide-sense-stationary (WSS) Gaussian noise (t) is used often as a noise
model in communication systems. For example, (t) might be the noise component in the output
of a radio receiver intermediate frequency (IF) filter/amplifier. In these applications, sample
functions of (t) are expressed as
(t) (t)cos t (t)sin t =
c c s c
, (9-1)
where
c
is termed the center frequency (for example,
c
could be the actual center frequency of
the above-mentioned IF filter). The quantities
c
(t) and
s
(t) are termed the quadrature
components (sometimes,
c
(t) is known as the in-phase component and
s
(t) is termed the
quadrature component), and they are assumed to be real-valued.
Narrow-band noise (t) can be represented in terms of its envelope R(t) and phase (t).
This representation is given as
c
(t) R(t) cos( t (t)) = + , (9-2)
where
2 2
c s
1
s c
R(t) (t) (t)
(t) tan ( (t) / (t)).
+

(9-3)
Normally, it is assumed that R(t) 0 and < (t) for all time.
Note the initial assumptions placed on (t). The assumptions of Gaussian and WSS
behavior are easily understood. The narrow-band attribute of (t) means that
c
(t),
s
(t), R(t)
and (t) are low-pass processes; these low-pass processes vary slowly compared to cos
c
t ; they
are on a vastly different time scale from cos
c
t. Many periods of cos
c
t occur before there is
notable change in
c
(t),
s
(t), R(t) or (t).
A second interpretation can be given for the term narrow-band. This is accomplished in
terms of the power spectrum of (t), denoted as S
(). By the Wiener-Khinchine theorem,

S
() is the Fourier transform of R
(), the autocorrelation function for WSS (t). Since (t) is

real valued, the spectral density S
() satisfies
( ) 0
( ) ( ).

=
S
S S
(9-4)
Figure 9-1 depicts an example spectrum of a narrow-band process. The narrow-band attribute
means that S
() is zero except for a narrow band of frequencies around

c
; process (t) has a
bandwidth (however it might be defined) that is small compared to the center frequency
c
.
Power spectrum S
() may, or may not, have

c
as axes of local symmetry. If
c
is
an axis of local symmetry, then
S S

( ) ( ) + = +
c c
(9-5)
for 0 < <
c
, and the process is said to be a symmetrical band-pass process (Fig. 9-1 depicts a
symmetrical band-pass process). It must be emphasized that the symmetry stated by the second
of (9-4) is always true (i.e., the power spectrum is even); however, the symmetry stated by (9-5)
Rad/Sec
()
-
c

c
w
a
t
t
s
/
H
z
Fig. 9-1: Example spectrum of narrow-band noise.
may, or may not, be true. As will be shown in what follows, the analysis of narrow-band noise is
simplified if (9-5) is true.
To avoid confusion when reviewing the engineering literature on narrow-band noise, the
reader should remember that different authors use slightly different definitions for the cross-
correlation of jointly-stationary, real-valued random processes x(t) and y(t). As used here, the
cross-correlation of x and y is defined as R
xy
( ) E[x(t+)y(t)]. However, when defining R
xy
,
some authors shift (by ) the time variable of the function y instead of the function x.
Fortunately, this possible discrepancy is accounted for easily when comparing the work of
different authors.
(t) has Zero Mean
The mean of (t) must be zero. This conclusion follows directly from
c c s c
E[ (t)] E[ (t)]cos t E[ (t)]sin t = . (9-6)
The WSS assumption means that E[(t)] must be time invariant (constant). Inspection of (9-6)
leads to the conclusion that E[
c
] = E[
s
] = 0 so that E[] = 0.
Quadrature Components In Terms of and
Let the Hilbert transform of WSS noise (t) be denoted in the usual way by the use of a
circumflex; that is,
(t) denotes the Hilbert transform of (t) (see Appendix 9A for a discussion
of the Hilbert transform). The Hilbert transform is a linear, time-invariant filtering operation
applied to (t); hence, from the results developed in Chapter 7,
(t) is WSS.
In what follows, some simple properties are needed of the cross correlation of (t) and
(t) . Recall that
(t) is the output of a linear, time-invariant system that is driven by (t). Also
recall that techniques are given in Chapter 7 for expressing the cross correlation between a
system input and output. Using this approach, it can be shown easily that
R ( ) E[ (t ) (t)] R ( )
R ( ) E[ (t ) (t)] R ( )
R (0) R (0) 0
R ( ) R ( ) .

+ =
+ =
= =
=
(9-7)
Equation (9-1) can be used to express
(t) . The Hilbert transform of the noise signal can

be expressed as
(t) (t) cos t (t) sin t (t) cos t (t) sin t

(t) sin t (t) cos t .

= =
= +
c c s c c c s c
c c s c
(9-8)
This result follows from the fact that
c
is much higher than any frequency component in
c
or
s
so that the Hilbert transform is only applied to the high-frequency sinusoidal functions (see
Appendix 9A).
The quadrature components can be expressed in terms of and
. This can be done by

solving (9-1) and (9-8) for

c c c
s c c
(t) (t)cos t

(t)sin t
(t)

(t) cos t (t)sin t .
= +
=
(9-9)
These equations express the quadrature components as a linear combination of Gaussian .
Hence, the components
c
and
s
are Gaussian. In what follows, Equation (9-9) will be used to
calculate the autocorrelation and crosscorrelation functions of the quadrature components. It
will be shown that the quadrature components are WSS and that
c
and
s
are jointly WSS.
Furthermore, WSS process (t) is a symmetrical band-pass process if, and only if,
c
and
s
are
uncorrelated for all time shifts.
Relationships Between Autocorrelation Functions R
, R
and R
It is easy to compute, in terms of R
, the autocorrelation of the quadrature components.

Use (9-9) and compute the autocorrelation
R ( ) E[ (t) (t )]
E[ (t) (t )]cos t cos (t ) E[
(t) (t )]sin t cos (t )

E[ (t)
(t )]cos t sin (t ) E[
(t)
(t )]sin t sin (t ) .

c
c c
c c c c
c c c c
= +
= + + + + +
+ + + + + +
(9-10)
This last result can be simplified by using (9-7) to obtain
R ( ) R ( )[cos t cos (t ) sin t sin (t )]
R ( )[cos t sin (t ) sin t cos (t )] ,

c
c c c c
c c c c
= + + +
+ + +
a result that can be expressed as
R ( ) R ( ) cos

R ( )sin .

c
c c
= + (9-11)
The same procedure can be used to compute an identical result for R
s
; this leads to the
conclusion that
c s
R ( ) R ( )

(9-12)
for all .
A somewhat non-intuitive result can be obtained from (9-11) and (9-12). Set = 0 in the
last two equations to conclude that
c s
R (0) R (0) R (0)

= = , (9-13)
an observation that leads to
2 2
c s
c s
E[ (t)] E[ (t)] E[ (t)]
Avg Pwr in (t) = Avg Pwr in (t) = Avg Pwr in (t).
2
= =

(9-14)
The frequency domain counterpart of (9-11) relates the spectrums S
, S
c
and S
s
.
Take the Fourier transform of (9-11) to obtain
( )
( )
c c
c s
c c c c
1
( ) ( ) ( ) ( )
2
1
sgn( ) ( ) sgn( ) ( ) .
2

= = + +
+ +
S S S S
S S
(9-15)
Since
c
and
s
are low-pass processes, Equation (9-15) can be simplified to produce
c c c c
c s
( ) ( ) ( ) ( ),
0, otherwise,

= = + +
=
S S S S
(9-16)
a relationship that is easier to grasp and remember than is (9-11).
Equation (9-16) provides an easy method for obtaining S
c
and/or
s
S given only S
.
First, make two copies of S
(). Shift the first copy to the left by

c
, and shift the second copy
to the right by
c
. Add together both shifted copies, and truncate the sum to the interval
c

c
to get S
c
. This shift and add procedure for creating S
c
is illustrated by Fig. 9-2.
Given only S
(), it is always possible to determine S
c
(which is equal to
s
S ) in this manner.
The converse is not true; given only S
c
, it is not always possible to create S
() (Why? Think
Fig. 9-2: Creation of
c
S from shifting and adding copies of

S .
about the fact that S

c
( ) must be even, but S
() may not satisfy (9-5)).

The Crosscorrelation R

c s
It is easy to compute the cross-correlation of the quadrature components. From (9-9) it
follows that
c
S
()
S
(+
c
) , <
c
c
S
(
c
) , <
c
c
S
c
() = S
(+
c
) + S
(
c
) , <
c
c

c
R ( ) E[ (t ) ( )]
E[ (t )
(t)]cos (t )cos t E[ (t ) (t)]cos (t )sin t

E[
(t )
(t)]sin (t )cos t E[
(t ) (t)]sin (t )sin t .

c s
c s
c c c c
c c c c
= +
= + + + +
+ + + + +
t

(9-17)
By using (9-7), Equation (9-17) can be simplified to obtain
R ( ) R ( )[ sin t cos (t ) cos t sin (t )]
R ( )[cos t cos (t ) sin t sin (t )] ,

c s
c c c c
c c c c
= + + +
+ + +
a result that can be written as
R ( ) R ( )sin

R ( )cos

c s
c c
= . (9-18)
The cross-correlation of the quadrature components is an odd function of . This follows
directly from inspection of (9-18) and the fact that an even function has an odd Hilbert
transform. Finally, the fact that this cross-correlation is odd implies that R ( )

c s
0 = 0; taken at
the same time, the samples of
c
and
s
are uncorrelated and independent. However, as
discussed below, the quadrature components
c
(t
1
) and
s
(t
2
) may be correlated for t
1
t
2
.
The autocorrelation R
of the narrow-band noise can be expressed in terms of the

autocorrelation and cross-correlation of the quadrature components
c
and
s
. This important
result follows from using (9-11) and (9-18) in
R ( )cos R ( )sin R ( )cos

R ( )sin cos
R ( )sin

R ( ) cos sin .

c
c
c s
c c c c
c c c
+ = +
+
(9-19)
However, R
results from simplification of the right hand side of (9-19), and the desired
relationship
R ( ) R ( ) cos R ( )sin

= +
c
c
c s
c
(9-20)
follows.
Comparison of (9-16) with the Fourier transform of (9-20) reveals an unsymmetrical
aspect in the relationship between S
, S
c
and S
s
. In all cases, both S
c
and S
s
can be
obtained by simple translations of S
as is shown by (9-16). However, in general, S
cannot be
expressed in terms of a similar, simple translation of S
c
(or S
s
), a conclusion reached by
inspection of the Fourier transform of (9-20). But, as shown next, there is an important special
case where R ( )

c s
is identically zero for all , and S
can be expresses as simple translations

of S
c
.
Symmetrical Bandpass Processes
Narrow-band process (t) is said to be a symmetrical band-pass process if
S S

( ) ( ) + = +
c c
(9-21)
for 0 < <
c
. Such a bandpass process has its center frequency
c
as an axis of local
symmetry. In nature, symmetry usually leads to simplifications, and this is true of Gaussian
narrow-band noise. In what follows, we show that the local symmetry stated by (9-21) is
equivalent to the condition R ( )

c s
= 0 for all (not just at = 0).
The desired result follows from inspecting the Fourier transform of (9-18); this transform
is the cross spectrum of the quadrature components, and it vanishes when the narrow-band
process has spectral symmetry as defined by (9-21). To compute this cross spectrum, first note
the Fourier transform pairs
R ( ) ( )
R ( ) Sgn( ) ( ) ,

S
S j
(9-22)
where
Sgn( )
for
for
+
<
R
S
|
T
|
1
1

> 0
0
(9-23)
is the commonly used sign function. Now, use Equation (9-22) and the Frequency Shifting
Theorem to obtain the Fourier transform pairs
R ( ) sin ( ) ( )
R ( ) cos Sgn( ) ( ) Sgn( ) ( ) .

c c c
c c c c c
j
+
+ + +
1
2
1
2
S S
S S
j
(9-24)
Finally, use this last equation and (9-18) to compute the cross spectrum
S
S S

c s c s
j
c c c c
( ) [ R ( )]
( )[ Sgn( )] ( )[ Sgn( )] .
=
= + + +
F
1
2
1 1
(9-25)
Figure 9-3 depicts example plots useful for visualizing important properties of (9-25).
From parts b) and c) of this plot, note that the products on the right-hand side of (9-25) are low
pass processes. Then it is easily seen that
c
c c c c
c s
c
0 ,
( ) j[ ( ) ( )],
0 , .

>
= + < <
<
S S S (9-26)
Finally, note that S

c s
( ) = 0 is equivalent to the narrow-band process satisfying the
symmetry condition (9-21). Since the cross spectrum is the Fourier transform of the cross-
correlation, this last statement implies that, for all t
1
and t
2
(not just t
1
= t
2
),
c
(t
1
) and
s
(t
2
) are
uncorrelated if and only if (9-21) holds. On Fig. 9-3, symmetry implies that the spectral
components labeled with U can be obtained from those labeled with L by a simple folding
operation.
System analysis is simplified greatly if the noise encountered has a symmetrical
S
()
-
c

c
-2
c
2
c
a)
L L U U
S
(-
c
)
-
c

c
-2
c
2
c
1-Sgn(-
c
)
b)
L L U U
S
(+
c
)
-
c

c
-2
c
2
c
1+Sgn(+
c
)
c)
L
L U U
Figure 9-3: Symmetrical bandpass processes have
c
(t
1
) and
s
(t
2
)
uncorrelated for all t
1
and t
2
.
spectrum. Under these conditions, the quadrature components are uncorrelated, and (9-20)
simplifies to
R ( ) R ( ) cos

=
c
c
. (9-27)
Also, the spectrum S
of the noise is obtained easily by scaling and translating S
c
F[R ]
c
as shown by
S S S

( ) [ ( ) ( )] = + +
1
2
c
c
c
c
. (9-28)
This result follows directly by taking the Fourier transform of (9-27). Hence, when the process
is symmetrical, it is possible to express S
in terms of a simple translations of S
c
(see the
comment after (9-20)). Finally, for a symmetrical bandpass process, Equation (9-16) simplifies
to
S S S

c s
2 ( ) ( ) ( ),
,
= = +
=
c c c
otherwise 0
. (9-29)
Example 9-1: Figure 9-4 depicts a simple RLC bandpass filter that is driven by white Gaussian
noise with a double sided spectral density of N
0
/2 watts/Hz. The spectral density of the output is
L C
R
+
-
+
S() = N
0
/2
watts/Hz
(WGN)
Figure 9-4: A simple band-pass filter driven by white Gaussian
noise (WGN).
given by
S

( ) ( )
( )
( )
= =
+ +
N
H j
N j
j
bp
c
0
2
0 0
0
2 2
2
2 2
2
, (9-30)
where
0
= R/2L,
c
= (
n
2
-
0
2
)
1/2
and
n
= 1/(LC)
1/2
. In this result, frequency can be
normalized, and (9-30) can be written as
S

( )
( )
( )
=

+ +
N j
j
0 0
0
2
2
2
2
1
, (9-31)
where
0
=
0
/
c
and = /
c
. Figure 9-5 illustrates a plot of the output spectrum for
o
= .5;
note that the output process is not symmetrical. Figure 9-6 depicts the spectrum for
o
= .1 (a
much sharper filter than the
o
= .5 case). As the circuit Q becomes large (i.e.,
o
becomes
small), the filter approximates a symmetrical filter, and the output process approximates a
symmetrical bandpass process.
Envelope and Phase of Narrow-Band Noise
Zero-mean quadrature components
c
(t) and
s
(t) are jointly Gaussian, and they have the
-2 -1 0 1 2
(radians/second)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
S
()
Figure 9-5: Output Spectrum for
o
= .5
-2 -1 0 1 2
(radians/second)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
S
()
Figure 9-6: Output Spectrum for
o
= .1
same variance
2
= R R R
c s
( ) ( ) ( ) 0 0 0 = = . Also, taken at the same time t, they are
independent. Hence, taken at the same time, processes
c
(t) and
s
(t) are described by the joint
density
f
c s
c s
( , ) exp
=
+
L
N
M
M
O
Q
P
P
1
2 2
2
2 2
2
. (9-32)
We are guilty of a common abuse of notation. Here, symbols
c
and
s
are used to denote
random processes, and sometimes they are used as algebraic variables, as in (9-32). However,
always, it should be clear from context the intended use of
c
and
s
.
The narrow-band noise signal can be represented as
c c s c
1 c 1
(t) (t) cos t (t) sin t
(t) cos( t (t))
=
= +
(9-33)
where
2 2
1 c s
1 s
1 1
c
(t) (t) (t)
(t)
(t) Tan , - < ,
(t)
= +

=

(9-34)
are the envelope and phase, respectively. Note that (9-34) describes a transformation of
c
(t)
and
s
(t). The inverse is given by
c
s
=
=
1
1
cos( )
sin( )
1
1
(9-35)
The joint density of
1
and
1
can be found by using standard techniques. Since (9-35) is
the inverse of (9-33) and (9-34), we can write
c 1 1
s 1 1
c s
1 1 c s
cos
1 1
sin
( , )
f ( , ) f ( , ) det
( , ) =
=

=

(9-36)
c s
1 1 1
1 1 1
1 1
( , )
cos sin
sin cos
( , )

=

(again, the notation is abusive). Finally, substitute (9-32) into (9-36) to obtain
2 2 2 1
1 1 1 1 1
2 2
2 1
1
2 2
1
f ( , ) exp (sin cos )
2 2
1
exp ,
2 2

= +

=

. (9-37)
for
1
0 and - <
1
. Finally, note that (9-37) can be represented as
f f ( , ) ( )f( )
1 1

1 1
= , (9-38)
where
f U ( ) exp ( )

1
1
2 2
1
2
1
1
2
=
L
N
M
O
Q
P

(9-39)
describes a Rayleigh distributed envelope, and
f( ) ,
1 1
=
1
2
- < (9-40)
describes a uniformly distributed phase. Finally, note that the envelope and phase are
independent. Figure 9-7 depicts a hypothetical sample function of narrow-band Gaussian noise.
Envelope and Phase of a Sinusoidal Signal Plus Noise - the Rice Density Function
Many communication problems involve deterministic signals embedded in random noise.
The simplest such combination of signal and noise is that of a constant frequency sinusoid added
to narrow-band Gaussian noise. In the 1940s, Steven Rice analyzed this combination and
published his results in the paper Statistical Properties of a Sine-wave Plus Random Noise, Bell
System Technical Journal, 27, pp. 109-157, January 1948. His work is outlined in this section.
Consider the sinusoid
0 c 0 0 0 c 0 0 c
s(t) A cos( t ) A cos cos t A sin sin t = + = , (9-41)
where A
0
,
c
, and
0
are known constants. To signal s(t) we add noise (t) given by (9-1), a
zero-mean WSS band-pass process with power
2
= E[
2
] = E[
c
2
] = E[
s
2
]. This sum of signal
and noise can be written as
0 0 c c 0 0 s c
2 c 2
s(t) + (t) [A cos (t)]cos t [A sin (t)]sin t
(t) cos[ t ] ,
= + +
= +
(9-42)
Fig. 9-7: A hypothetical sample function of narrow-band Gaussian noise. The envelope is
Rayleigh and the phase is uniform.
where
2 2
2 0 0 c 0 0 s
1 0 0 s
2 2
0 0 c
(t) [A cos (t)] [A sin (t)]
A sin (t)
(t) tan , ,
A cos (t)
= + + +
+
= <

+

(9-43)
are the envelope and phase, respectively, of the signal+noise process. Note that the quantity
2 2
0
(A / 2 ) / is the signal-to-noise ratio, a ratio of powers.
Equation (9-43) represents a transformation from the components
c
and
s
into the
envelope
2
and phase
2
. The inverse of this transformation is given by
c 2 2 0 0
s 2 2 0 0
(t) (t) cos (t) A cos
(t) (t)sin (t) A sin .
=
=
(9-44)
Note that constants A
0
cos
0
and A
0
sin
0
only influence the mean of
c
and
s
. In the remainder
of this section, we describe the statistical properties of envelope
2
and phase
2
.
At the same time t, processes
c
(t) and
s
(t) are statistically independent (however, for
0,
c
(t) and
s
(t+) may be dependent). Hence, for
c
(t) and
s
(t) we can write the joint
density
f
c s
c s
( , )
exp[ ( ) / ]

=
+
2 2 2
2
2
2
(9-45)
(we choose to abuse notation for our convenience:
c
and
s
are used to denote both random
processes and, as in (9-45), algebraic variables).
The joint density f(
2
,
2
) can be found by transforming (9-45). To accomplish this, the
Jacobian
c s
2 2 2
2 2 2
2 2
( , )
cos sin
sin cos
( , )

=

(9-46)
can be used to write the joint density
c 2 2 0 0
s 2 2 0 0
c s
2 2 c s
cos A cos
2 2
sin A sin
( , )
f ( , ) f ( , ) det
( , ) =
=

=

(9-47)
{ } 2
2 2 2 1
2 2 2 0 2 0 2
2
2
f ( , ) exp [ 2A cos( ) A ] U( )
2

= +
2 0
.
Now, the marginal density f(
2
) can be found by integrating out the
2
variable to obtain
{ } 2
2
2 2 2
0
2
2 2
2 0 2 1 1
2 0 2 2 0 2
2 2 2
0
2
f ( ) f ( , ) d
A
exp [ A ] U( ) exp{ cos( )}d .
=

= +

2
(9-48)
This result can be written by using the tabulated function
I d
0
1
2
0
2
( ) exp{ cos( )}
z
, (9-49)
the modified Bessel function of order zero. Now, use definition (9-49) in (9-48) to write
f I A U
A
( ) exp [ ] ( )

2
2
2
0
1
2
2
2
0
2
2
2 0
2 2
=
F
H
I
K
+
R
S
T
U
V
W

, (9-50)
a result known as the Rice probability density. As expected,
0
does not enter into f(
2
).
Equation (9-50) is an important result. It is the density function that statistically
describes the envelope
2
at time t; for various values of A
0
/, the function f(
2
) is plotted on
Figure 9-8 (the quantity
2 2
0
(A / 2) / is the signal-to-noise ratio). For A
0
/ = 0, the case of no
sinusoid, only noise, the density is Rayleigh. For large A
0
/ the density becomes Gaussian. To
observe this asymptotic behavior, note that for large the approximation
I
e
0
2
( ) ,
>>1, (9-51)
becomes valid. Hence, for large
2
A
0
/
2
Equation (9-50) can be approximated by
f
A
A U ( ) exp [ ] ( )

2
2
0
2
1
2
2 0
2
2
2
2

R
S
T
U
V
W

. (9-52)
0 1 2 3 4 5 6
2
/
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8

f

(
2
)
A
0
/ = 0
A
0
/ = 1
A
0
/ = 2
A
0
/ = 3
A
0
/ = 4
Figure 9-8: Rice density function for sinusoid plus noise. Plots
are given for several values of A
0
/. Note that f is approximately
Rayleigh for small, positive A
0
/; density f is approximately
Gaussian for large A
0
/.
For A
0
>> , this function has a very sharp peak at
2
= A
0
, and it falls off rapidly from its peak
value. Under these conditions, the approximation
f A ( ) exp [ ]
2
2
1
2
2 0
2
1
2
2

R
S
T
U
V
W

(9-53)
holds for values of
2
near A
0
(i.e.,
2
A
0
) where f(
2
) is significant. Hence, for large A
0
/,
envelope
2
is approximately Gaussian distributed.
The marginal density f(
2
) can be found by integrating
2
out of (9-47). Before
integrating, complete the square in
2
, and express (9-47) as
f A U
A
( , ) exp [ cos( )] exp
sin ( )
( )

2
2
2
1
2
2 0
2
2
2
2
2
2
0
2
2

2 2 0
2 0
=

R
S
T
U
V
W

{ }
. (9-54)
Now, integrate
2
out of (9-54) to obtain
{ }
2 2 2 2
0
2
A 2 2 1 2
0
2 0 2 0 2
2 0 2 2 2 0
2
2
f ( ) f ( , )d
exp exp [ A cos( )] d .
sin ( )
2
=

=

(9-55)
On the right-hand-side of (9-55), the integral can be expressed as the two integrals

2
2
0
1
2
2 0
2
2
2 0
2
0
1
2
2 0
2
2
0
2
1
2
2 0
2
2
0
2
2
4
2
2
2
2
z
z
z

=

+
exp [ cos( )]
{ cos( )}
exp [ cos( )]
cos( )
exp [ cos( )]
A d
A
A d
A
A d

2 0
2 0
2 0
2 0
2 0
{ }
{ }
{ }
(9-56)
After a change of variable = [
2
- A
0
cos(
2
-
0
)]
2
, the first integral on the right-hand-side of
(9-56) can be expressed as
{ }
2 2
A cos ( )
0
0
2 2 0 2 0 1
2 0 2 0 2
2 2
0
2
2 2
2
2
2 2
A cos ( )
2 0
0
2
2
2{ A cos( )}
exp [ A cos( )] d
4
1
exp[ ]d
4
1
exp .
2

(9-57)
After a change of variable = [
2
- A
0
cos(
2
-
0
)]/, the second integral on the right-hand-side
of (9-56) can be expressed as
{ }
{ }
{ }
0
0
2
1
2 0 2 0 2
2
0
2
2
1
2
(A / ) cos[ ]
2 0
(A / ) cos[ ]
2 2 0
1
2
A
0
2 0
1
exp [ A cos( )] d
2
1
exp d
2
1
1 exp d
2
F( cos[ ]),

(9-58)
where
F x d
x
( ) exp
z
1
2
2
2

is the distribution function for a zero-mean, unit variance Gaussian random variable (the identity
F(-x) = 1 - F(x) was used to obtain (9-58)).
Finally, we are in a position to write f(
2
), the density function for the instantaneous
phase. This density can be written by using (9-57) and (9-58) in (9-55) to write
2
A
0
2
2
2
2
A
A
0 2 0 0 2 0
2 0
2 0
2
2
1
f ( ) exp
2
A cos( )
exp F( cos[ ])
sin ( )
2

, (9-59)
the density function for the phase of a sinusoid embedded in narrow-band noise. For various
values of SNR and for
0
= 0, density f(
2
) is plotted on Fig. 9-9. For a SNR of zero (i.e., A
0
=
0), the phase is uniform. As SNR A
0
2
/
2
increases, the density becomes more sharply peaked (in
general, the density will peak at
0
, the phase of the sinusoid). As SNR A
0
2
/
2
approaches
infinity, the density of the phase approaches a delta function at
0
.
Phase Angle
2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
f

(
2
)
A
0
/ = 0
A
0
/ = 1
A
0
/ = 2
A
0
/ = 4
0 /2 -/2 -
Figure 9-9: Density function for phase of signal plus noise A
0
cos(
0
t+
0
) +
{
c
(t)cos(
0
t) -
s
(t)sin(
0
t)} for the case
0
= 0.
Shot Noise
Shot noise results from filtering a large number of independent and randomly-occuring-
in-time impulses. For example, in a temperature-limited vacuum diode, independent electrons
reach the anode at independent times to produce a shot noise process in the diode output circuit.
A similar phenomenon occurs in diffusion-limited pn junctions. To understand shot noise, you
must first understand Poisson point processes and Poisson impulses.
Recall the definition and properties of the Poissson point process that was discussed in
Chapters 2 and 7 (also, see Appendix 9-B). The Poisson points occur at times t
i
with an average
density of
d
points per unit length. In an interval of length , the number of points is distributed
with a Poisson density with parameter
d
.
Use this Poisson process to form a sequence of Poisson Impulses, a sequence of impulses
located at the Poisson points and expressed as
i
i
z(t) (t t ) =
, (9-60)
where the t
i
are the Poisson points. Note that z(t) is a generalized random process; like the delta
function, it can only be characterized by its behavior under an integral sign. When z(t) is
integrated, the result is the Poisson random process
t
0
(0, t), t 0
x(t) z( )d 0, t 0
- (0, t) t 0,
>
= = =
<
n
n
(9-61)
where n(t
1
,t
2
) is the number of Poisson points in the interval (t
1
,t
2
]. Likewise, by passing the
Poisson process x(t) through a generalized differentiator (as illustrated by Fig. 9-10), it is
possible to obtain z(t).
The mean of z(t) is simply the derivative of the mean value of x(t). Since E[x(t)]=
d
t, we
can write
z d
d
E[z(t)] E[x(t)]
dt
= = = . (9-62)
This formal result needs a physical interpretation. One possible interpretation is to view
z
as
( )
t / 2
1 1
z d d
t t
t / 2
t t
limit z( )d limit t random fluctuation with increasing t

= = + =
. (9-63)
For large t, the integral in (9-63) fluctuates around mean
d
t with a variance of
d
t (both the
mean and variance of the number of Poisson points in (-t/2, t/2] is
d
t). But, the integral is
multiplied by 1/t; the product has a mean of
d
and a variance like
d
/t. Hence, as t becomes
large, the random temporal fluctuations become insignificant compared to
d
, the infinite-time-
interval average
z
.
Important correlations involving z(t) can be calculated easily. Because R
x
(t
1
,t
2
) =
2
d
t
1
t
2
+
d
min(t
1
,t
2
) (see Chapter 7), we obtain
2
xz 1 2 x 1 2 d 1 d 1 2
2
2
z 1 2 xz 1 2 d d 1 2
1
R (t , t ) R (t , t ) t U(t t )
t
R (t , t ) R (t , t ) (t t ) .
t
= = +
= = +
(9-64)
... ... ... ...
d/dt
x(t) z(t)

Poisson Impulses Poisson Process
Figure 9-10: Differentiate the Poisson Process to get Poisson impulses.
The Fourier transform of R
z
() yields
2
z d d
( ) 2 ( ) = + S , (9-65)
the power spectrum of the Poisson impulse process.
Let h(t) be a real-valued function of time and define
i
i
s(t) h(t t ) =
, (9-66)
a sum known as shot noise. The basic idea here is illustrated by Fig. 9-11. A sequence of
functions described by (9-60) (i.e., process z(t)) is input to system h(t) to form output shot noise
process s(t). The idea is simple: process s(t) is the output of a system activated by a sequence of
impulses (that model electrons arriving at an anode, for example) that occur at the random
Poisson points t
i
.
Determined easily are the elementary properties of shot noise s(t). Using the method
discussed in Chapter 7, we obtain the mean
[ ] [ ] [ ]
s d d
0
E s(t) E z(t) h(t) h(t) E z(t) h(t)dt H(0)
= = = = =
. (9-67)
Shot noise s(t) has the power spectrum
2 2 2
2 2 2
s z d d s d
( ) H( ) ( ) 2 H (0) ( ) H( ) 2 ( ) H( ) = = + = + S S . (9-68)
Finally, the autocorrelation is
[ ]
2
-1 2 2 j 2 2 d
s s d d d
R ( ) = ( ) H (0) H( ) e d H (0) ( )
2

= + = +

S F , (9-69)
where
2
j
1
( ) H( ) e d h(t)h(t )dt
2

= = +

. (9-70)
From (9-67) and (9-69), shot noise has a mean and variance of
s d
2
2 2 2 2 d
s d d d d
= H(0)
= [ H (0) + (0)] - [ H(0)] (0) H( ) d ,
2

= =

(9-71)
respectively (Equation (9-71) is known as Campbells Theorem).
Example: Let h(t) = e
-t
U(t) so that H() = 1/( + j),
t
(t) e / 2
= and
2
d d d
s s
2
2 d d d
s s
2 2
E[s(t)] R ( ) e
2
( ) 2 ( )
2

= = = +

= = +

+
S
(9-72)
First-Order Density Function for Shot Noise
In general, the first-order density function f
s
(x;t) that describes shot noise s(t) cannot be
calculated easily. Before tackling the difficult general case, we first consider a simpler special
t
i-1
t
i+1
t
i+2
t
i
t
i-1
t
i+1
t
i+2
t
i
... ... ... ...
h(t)
z(t) s(t) = h(t)*z(t)
h(t)
Poisson Impulses Shot Noise
Figure 9-11: Converting Poisson impulses z(t) into shot noise s(t)
case where it is assumed that h(t) is of finite duration T. That is, we assume initially that
h(t) 0, t 0 and t T = < . (9-73)
Because of (9-73), shot noise s at time t depends only on the Poisson impulses in the
interval (t - T, t]. To see this, note that
i
i i
i t T t t
s(t) h(t ) ( t )d h(t t )
<
= =

, (9-74)
so that only the impulses in (t - T, t] influence the output at time t. Let random variable n
T
denote the number of Poisson impulses during (t - T, t]. From Chapter 1, we know that
d
k
T
d
( T)
[ k] e
k!

= =
T
P n . (9-75)
Now, the Law of Total Probability (see Ch. 1 and Ch. 2 of these notes) can be applied to write
the first-order density function of the shot noise process s(t) as
d
k
T
d
s s s
k 0 k 0
( T)
f (x) f (x k) [ k] f (x k)e
k!

= =
= = = = =
T T T
n P n n (9-76)
(note that f
s
(x) is independent of absolute time t). We must find f
s
(xn
T
= k), the density of shot
noise s(t) conditioned on there being exactly k Poisson impulses in the interval (t - T, t].
For each fixed value of k that is used on the right-hand-side of (9-76), conditional density
f
s
(xn
T
= k) describes the filter output due to an input of exactly k impulses on (t - T, t]. That is,
we have conditioned on there being exactly k impulses in (t - T, t]. As a result of the
conditioning, the k impulse locations can be modeled as k independent, identically distributed
(iid) random variables (all locations t
i
, 1 i k, are uniform on the interval).
For the case k = 1, at any fixed time t, f
s
(xn
T
= 1) is actually equal to the density g
1
(x)
of the random variable
1 1
x (t) h(t t ) , (9-77)
where random variable t
1
is uniformly distributed on (t - T, t ]. That is, g
1
(x) f
s
(xn
T
= 1)
describes the result that is obtained by transforming a uniform density (used to describe t
1
) by the
transformation h(t - t
1
).
Convince yourself that density g
1
(x) = f
s
(xn
T
= 1) does not depend on time. Note that
for any given time t, random variable t
1
is uniform on (t-T, t ], and x
1
(t) h(t-t
1
) is assigned
values in the set {h() : 0 < T}, the assignment not depending on t. Hence, density g
1
(x)
f
s
(xn
T
= 1) does not depend on t.
The density f
s
(xn
T
= 2) can be found in a similar manner. Let t
1
and t
2
denote
independent random variables, each of which is uniformly distributed on (t - T, t ], and define
2 1 2
x (t) h(t t ) h(t t ) + . (9-78)
At fixed time t, the random variable x
2
(t) is described by the density f
s
(xn
T
= 2) = g
1
g
1
(i.e.,
the convolution of g
1
with itself) since h(t - t
1
) and h(t - t
2
) are independent and identically
distributed with density g
1
.
The general case f
s
(xn
T
= k) is similar. At fixed time t, the density that describes
k 1 2 k
x (t) h(t t ) h(t t ) h(t t ) + + + (9-79)
is
k s 1 1 1
k convolutions
g (x) f (x = k) = g (x) g (x) g (x)
T
n , (9-80)
the density g
1
convolved with itself k times.
The desired density can be expressed in terms of results given above. Simply substitute
(9-80) into (9-76) and obtain
d
k
T
d
s k
k 0
( T)
f (x) e g (x)
k!
=

. (9-81)
When n
T
= 0, there are no Poisson points in (t - T, t], and we have
0 s
g (x) f (x = 0) = (x)
T
n (9-82)
since the output is zero. Convergence is fast, and (9-81) is useful for computing the density f
s
when
d
T is small (the case for low density shot noise), say on the order of 1, so that, on the
average, there are only a few Poisson impulses in the interval (t - T, t]. For the case of low
density shot noise, (9-81) cannot be approximated by a Gaussian density.
f
s
(x) For An Infinite Duration h(t)
The first-order density function f
s
(x) is much more difficult to calculate for the general
case where h(t) is of infinite duration (not subject to the restriction (9-73)). We show that shot
noise is approximately Gaussian distributed when
d
is large compared to the time interval over
which h(t) is significant (so that, on the average, many Poisson impulses are filtered to form s(t)).
To establish this fact, consider first a finite duration interval (-T/2, T/2), and let random
variable n
T
, described by (9-75), denote the number of Poisson impulses that are contained in the
interval. Also, define the time-limited shot noise
T k
k 1
s (t) h(t t ), T/ 2 t T/ 2
=
< <
T
n
, (9-83)
where the random variables t
i
denote the times at which the Poisson impulses occur in the
interval. Shot noise s(t) is the limit of s
T
(t) as T approaches infinity.
In our analysis of s(t), we first consider the characteristic function
T
j s j s
s
T
( ) E e limit E e

= =

. (9-84)
Now, write the characteristic function of s
T
as
[ ]
T T
j s j s
k 0
E e E e k k

=

= = =

T T
n P n , (9-85)
where P[n
T
= k] is given by (9-75). In the conditional expectation used in (9-85), output s
T
results from filtering exactly k impulses (this is different from the s
T
that appears on the left-
hand-side of the equation). Due to the conditioning, we can model the impulse locations as k
independent, identically distributed (iid they are uniform on (-T/2, T/2)) random variables. As
a result, the terms h(t - t
i
) in s
T
(t) are independent so that
( )
T T
k
j s j s
E e k E e 1

= = =

T T
n n , (9-86)
where
T
T/ 2
j s j h(t x)
T/ 2
1
E e 1 e dx, T/ 2 t T/ 2
T

= = < <

T
n , (9-87)
since each t
i
is uniformly distributed on (-T/2, T/2). Finally, by using (9-84) through (9-87), we
can write
[ ]
T T
d
d
j s j s
s
T T
k 0
k k
T/ 2
T j h(t x) d
T/ 2
T
k 0
k
T/ 2
j h(t x)
d
T
T/ 2
T
k 0
( ) limit E e limit E e k k
1 ( T)
limit e dx e
T k!
e dx
limit e .
k!
=

= = = =

T T
n P n
(9-88)
Recalling the Taylor series of the exponential function, we can write (9-88) as
( )
T/ 2
j h(t x) j h(t x)
s d d d
T/ 2
T
( ) limit exp{ T}exp e dx exp e 1 dx

= =

, (9-89)
a general formula for the characteristic function of the shot noise process.
In general, Equation (9-89) is impossible to evaluate in closed form. However, this
formula can be used to show that shot noise is approximately Gaussian distributed when
d
is
large compared to the time constants in h(t) (i.e., compared to the time duration where h(t) is
significant).
This task will be made simpler if we center and normalize s(t). Define
d
d
s(t)- H(0)
(t)

s , (9-90)
so that
[ ]
E 0
R ( ) ( ) h(t)h(t )dt
=
= = +
s
s
. (9-91)
(see (9-67) and (9-69)). The characteristic functions of s and s are related by
j d
d s d
d
s - H(0)
( ) E e E exp j exp j H(0) ( )

= = =

s
s
. (9-92)
Use (9-89) in (9-92) to write
d d
j j
d
( ) exp exp h(t x) 1 h(t x) dx
s
. (9-93)
Now, in the integrand of (9-93), expand the exponential in a power series, and cancel out the
zero and first-order terms to obtain
( )
k k
k
k k
( j ) ( j )
d d
k! k!
d d
k 2 k 2
k
( j )
2 2
d
k! k
k 3
d
h(t x) h(x)
( ) exp dx exp dx
h (x)dx
1
exp h (x)dx exp .
2

= =
=

= =

=

s
(9-94)
As
d
approaches infinity, (i.e., becomes large compared to the time constants in low-pass filter
h(t)), Equation (9-94) approaches the characteristic function of a Gaussian process.
To see this, let
max
be the dominant (i.e., largest) time constant in low-pass filter h(t).
There exists constant M such that (note that h is causal)
x /
h(x) Me U(x)

max
(9-95)
(
max
is the filter effective memory duration). Consequently, we can bound
k kx / k k k
0
h (x)dx h(x) dx M e dx M
k

=

max
max

. (9-96)
Consider the second exponent on the right-hand-side of (9-94). Using (9-96), it is easily
seen that
( )
( )
k k
k
k
( j ) ( j )
d
k! k k!
k 2
k 3 k 3
d
d
h (x)dx
M
k
= =

max
. (9-97)
Now, as d
(so that
max
/ d
0) in (9-97), each right-hand-side term approaches zero

so that
( )
k
d
0
k
( j )
d
k! k
k 3
d
h (x)dx
0

max
. (9-98)
Use this last result with Equation (9-94). As
max
/ d
0 with increasing
d
, we have
2
( j )
2 2 2 1
2 2
( ) exp h (x)dx exp
s s
, (9-99)
where
2
R (0) =
s s
(9-100)
is the variance of standardized shot noise s(t) (see (9-91)). Note that Equation (9-99) is the
characteristic function of a zero-mean, Gaussian random variable with variance (9-100). Hence,
shot noise is approximately Gaussian distributed when
d
is large compared to the dominant
time constant in low-pass filter h(t) (so that, on the average, a large number of Poisson impulses
are filtered to form s(t)).
Example: Temperature-Limited Vacuum Diode
In classical communications system theory, a temperature-limited vacuum diode is the
quintessential example of a shot noise generator. The phenomenon was first predicted and
analyzed theoretically by Schottky in his 1918 paper: Theory of Shot Effect, Ann. Phys., Vol 57,
Dec. 1918, pp. 541-568. In fact, over the years, noise generators (used for testing/aligning
communication receivers, low noise preamplifiers, etc.) based on vacuum diodes (i.e., Sylvania
5722 special purpose noise generator diode) have been offered on a commercial basis.
Vacuum tube noise generating diodes are operated in a temperature-limited, or saturated,
mode. Essentially, all of the available electrons are collected by the plate (few return to the
cathode) so that increasing plate voltage does not increase plate current (i.e., the tube is
saturated). The only way to increase (significantly) plate current is to increase filament/cathode
temperature. Under this condition, between electrons, space charge effects are minimal, and
individual electrons are, more or less, independent of each other.
The basic circuit is illustrated by Figure 9-12. In a random manner, electrons are emitted
by the cathode, and they flow a distance d to the plate to form current i(t). If emitted at t = 0, an
independent electron contributes a current h(t), and the aggregate plate current is given by
k
k
i(t) h(t t ) =
, (9-101)
where t
k
are the Poisson-distributed independent times at which electrons are emitted by the
cathode (see Equation (9-66)). In what follows, we approximate h(t).
As discussed above, space charge effects are negligible and the electrons are
independent. Since there is no space charge between the cathode and plate, the potential
distribution V in this region satisfies Laplaces equation
2
2
0
x
V
. (9-102)
The potential must satisfy the boundary conditions V(0) = 0 and V(d) = V
p
. Hence, simple
integration yields
p
= x , 0 x
V
V d
d
. (9-103)
As an electron flows from the cathode to the plate, its velocity and energy increase. At
point x between the cathode and plate, the energy increase is given by
p
n
E (x) (x) = x =
V
eV e
d
, (9-104)
where e is the basic electronic charge.
Power is the rate at which energy changes. Hence, the instantaneous power flowing from
the battery into the tube is
R
L
+ - +
-
i(t)
V
p
Plate
V
f
Filament
d
Figure 9-12: Temperature-limited vacuum
diode used as a shot noise generator.
p
n n
p
dE dE dx dx
h
dt dx dt dt
= = =
V
e V
d
, (9-105)
where h(t) is current due to the flow of a single electron (note that d
-1
dx/dt has units of sec
-1
so
that (e/d ) dx/dt has units of charge/sec, or current). Equation (9-105) can be solved for current to
obtain
x
d
h v
dt
= =
e x e
d d
, (9-106)
where v
x
is the instantaneous velocity of the electron.
Electron velocity can be found by applying Newtons laws. The force on an electron is
just e(V
p
/d), the product of electronic charge and electric field strength. Since force is equal to
the product of electron mass m and acceleration a
x
, we have
x
a =
p
V
e
m d
. (9-107)
As it is emitted by the cathode, an electron has an initial velocity that is Maxwellian distributed.
However, to simplify this example we will assume that the initial velocity is zero. With this
assumption, electron velocity can be obtained by integrating (9-107) to obtain
x
v = t
p
V
e
m d
. (9-108)
Over transition time t
T
the average velocity is
x x
0
1
v v dt =
2
= =
T
t
p
T
T T
V
e d
t
t m d t
. (9-109)
Finally, combine these last two equations to obtain
x
2
2
v = t, 0 t

T
T
d
t
t
. (9-110)
With the aid of this last relationship, we can determine current as a function of time.
Simply combine (9-106) and (9-110) to obtain
2
2
h(t) t, 0 t

=

T
T
e
t
t
, (9-111)
the current pulse generated by a single electron as it travels from the cathode to the plate. This
current pulse is depicted by Figure 9-13.
The bandwidth of shot noise s(t) is of interest. For example, we may use the noise
generator to make relative measurements on a communication receiver, and we may require the
noise spectrum to be flat (or white) over the receiver bandwidth (the noise spectrum
amplitude is not important since we are making relative measurements). To a certain flatness,
we can compute and examine the power spectrum of standardized s(t) described by (9-90). As
given by (9-91), the autocorrelation of s(t) is
t
h(t)
t
T
2e/t
T
Figure 9-13: Current due to a single electron emitted by
the cathode at t = 0.
( ) ( ) ( )
2 2 2
0
4
R ( ) 2 t(t + )dt = 1 1 , 0
3 2
R ( ), 0
0, otherwise

= +
=
=
T
t
s T
T T T T
s T
e e
t
t t t t
t . (9-112)
The power spectrum of s(t) is the Fourier transform of (9-112), a result given by
( )
2
s
4
0
4
( ) 2 R ( ) cos( )d ( ) 2(1 cos sin )
( )
= = +
T T T T
T
S t t t t
t
. (9-113)
Plots of the autocorrelation and relative power spectrum (plotted in dB relative to peak power at
= 0) are given by Figures 9-14 and 9-15, respectively.
To within 3dB, the power spectrum is flat from DC to a little over = /t
T
. For the
Sylvania 5722 noise generator diode, the cathode-to-plate spacing is .0375 inches and the transit
time is about 310
-10
seconds. For this diode, the 3dB cutoff would be about 1/2t
T
= 1600Mhz.
In practical application, where electrode/circuit stray capacitance/inductance limits frequency
range, the Sylvania 5722 has been used in commercial noise generators operating at over
400Mhz.
R
s
()
2
4
3
T
e
t
t
T
-t
T
Figure 9-14: Autocorrelation function of
normalized shot noise process.
-10
-8
-6
-4
-2
2
R
e
l
a
t
i
v
e

P
o
w
e
r

(
d
B
)
(Rad/Sec)
/t
T
10Log{S()/S(0)}
0
Figure 9-15: Relative power spectrum of nor-
malized shot noise process.
EE420/500 02/18/09 John Stensby
Updates at http://www.ece.uah.edu/courses/ee420-500/ 9A-1
Appendix 9A: Hilbert Transforms
Consider the filter H(), described by Figure 9A-1, that has a unity magnitude response
for all frequencies. Also, the phase response is -/2 for positive frequencies and /2 for negative
frequencies. The transfer function of this filter is
H j j ( ) sgn( ) = . (9A1)
An engineer might think of filter H() as a wide-band phase shift network.
The impulse response of the filter is
h t H j j
j
t t
( ) [ ] [sgn( )] = = =
F
H
I
K
= F F
-1 -1

1
. (9A2)
When driven by an arbitrary signal x(t), the filter produces the output
1
()
()
/2
/2
Figure 9A-1: Magnitude and phase of Hilbert transform operator.
( )
( )
( )
x t x h
x u
t u
du = =
z

. (9A3)
The function ( ) x t is the Hilbert Transform of x(t). Note that
F ( ) ( ) sgn( ) ( ) x = = H X j X , (9A4)
so that
( ) sgn( ) ( ) x
-1
t j X = F . (9A5)
In some cases, this formula allows use of a Fourier transform table to compute the Hilbert
transform.
EXAMPLES
1. Consider x(t) = cos(
0
t) with transform X() = [( -
0
) + ( +
0
)]. We have
h(t)
t
Figure 9A-2: Impulse response h(t) of
Hilbert transform operator
= + >
= + <
j X j
j
sgn( ) ( ) ( ) ( ) ,
( ) ( ) ,

0 0
0 0
0
0

0
0
so that -jsgn()X() = j[( +
0
) - ( -
0
)]sgn(
0
). Hence, we can write
( ) cos( ) sgn( ) ( ) sgn( )sin( ) x t t j X t = = =
0 0 0
F
-1
. (9A6)
2. In a similar manner, we can write
sin( ) sgn( )cos( )
0 0 0
t t =
(9A7)
3. Combine (9A6) and (9A7) to obtain
exp{ } cos sin sgn( )[sin cos ] sgn( ) exp{ } j t t j t t j t j j t
0 0 0 0 0 0 0 0
= + = =
Properties of Hilbert Transforms
1. The energy (or power) in x(t) and ( ) x t are equal. This claim follows from
F F [ ] sgn( ) ( ) ( ) [ ] x j X X x
2 2 2 2
= = = . (9A9)
Since the energy (or power) density spectrum at the input and output of the filter are the same,
the two energies (or powers) are equal.
2.
( ) ( ) x t x t = . This claim follows from

[ ] [ ]
[ ]
-1 -1
-1
x(t) jsgn( ) x(t)] jsgn( ) jsgn( )X( j )

X( j ) x(t).
= =

= =
[ F F F
F
(9A10)
(9A8)
3. x(t) and ( ) x t are orthogonal. For energy signals, we have
limit
T

z
= x t x t
T
T
( )( )dt 0 . (9A11)
For power signals, we have
x t x t
T
x t x t
T
T
( )( ) ( )( )dt =

z
limit
T
1
2
0 . (9A12)
This claim follows from (proof given for energy signals; case of power signals is similar)
j t
j t
2
1
x(t)x(t)dt x(t) jsgn( )X( j )e d dt
2
1
jsgn( )X( j ) x(t)e dt d
2
j
sgn( ) X( j ) d
2
0 ,

=

=

(9A13)
since integrand sgn()X(j)
2
is an odd function which is integrated over symmetric limits.
4. If c(t) and m(t) are signals with non-overlapping spectra, where m(t) is low pass and c(t) is
high pass, then
m t t t c t ( )c( ) ( )( ) m =
To develop this important result, denote M() = F [m(t)] and C() = F [c(t)] as the Fourier
transform of m and c, respectively. The fact that the signals have no over-lapping spectrum
(9A14)
implies that there exists a W for which
M
C
( ) ( ) ,
( ) ( ) ,

= = >
= = <
F
F
m t W
c t W
0
0

(9A15)
since m(t) is low pass and c(t) is high pass. Use (9A8) and note that
1 2 1 2 1 2
2
1 2 1 2 1 2 1 2
2
1
m(t)c(t) ( ) ( ) exp[ j( )t] d d
(2 )
1
( ) ( )[ jsgn( )]exp[ j( )t] d d
(2 )

= +
= + +

M C
M C
.
Once the quantity sgn(
1
+
2
) in the integrand of (9A16) is simplified, we will obtain the
desired result. To simplify (9A16), note that non-overlapping spectra and (9A15) imply
M
C
( ) ,
( ) ,

1 1
2 2
0
0
= >
= <

W
W
. (9A17)
Hence, the integrand of (9A16) is zero for all (
1
,
2
) in the cross-hatched region on Fig 9A-3.
More importantly, on the shaded region of the (
1
,
2
) plane, the integrand is non-zero, and we
can write
sgn( ) sgn( )
1 2 2
+ = (9A18)
for all (
1
,
2
) in the shaded (but not the cross-hatched!) region illustrated on Figure 9A-3.
Finally, use of simplification (9A18) in Equation (9A16) yields the desired result
(9A16)
1 1 1 2 2 2 2
1 1
m(t)c(t) ( ) exp[ j t]d ( )[ jsgn( )]exp[ j t] d
2 2
m(t)c(t)

=

=

M C
which is (9A14).
5) Since the impulse response h(t) does not vanish for t < 0, the Hilbert transform is a non-
causal linear operator.
6. If x(t) is an even (alternatively, odd) function then ( ) x t is an odd (alternatively, even)
function. This claim follows easily from Fourier transform theory. If x(t) is even, then X(j) =
F [x(t)] is real-valued. As a result, -jsgn()X(j) is purely imaginary. But this means that
( ) ( ) ( ) x t = F jsgn X j will be odd.
-W
W
-W
W
1
-axis
2
-axis
1
+
2
> 0
1
+
2
> 0
1
+
2
< 0
1
+
2
< 0
Figure 9A-3: Integrand of (9A16) is zero in the cross-hatched region. In the upper-half plane shaded
region, we have U(
1
+
2
) = 1. In the lower-half plane shaded region, we have U(
1
+
2
) = -1.
Updates at http://www.ece.uah.edu/courses/ee420-500/ 9B-1
Appendix 9-B: Random Poisson Points
As discussed in Chapter 1, let n(t
1
,t
2
) denote the number of Poisson random points in the
interval (t
1
, t
2
]. The quantity n(t
1
, t
2
) is a non-negative-integer-valued random variable with
d
d
k
1 2 k
( )
[ (t , t ) k] p ( ) e
k!

= = P n , k 0, (9B1)
where t
1
- t
2
is the interval length, and constant
d
> 0 is the average point density
(average number of points per unit length). Note that Equation (9B1) does not depend on the
absolute value of t
1
or t
2
. We say that n(t
1
, t
2
) is Poisson distributed with parameter
d
.
Random Poisson points have independent increments; the number of points in (t
1
, t
2
] is
independent of the number of points in (t
3
, t
4
] if these two intervals do not overlap (i.e., the
intersection of these intervals is the null set). This completes our first characterization of random
Poisson points. In what follows, two other (equivalent) characterizations are discussed.
From the results of Chapter 2 of the class notes, in an interval of length , the expected
number of points is
d
. Somewhat surprisingly, this is also equal to the variance of n(t
1
, t
2
); we
write
1 2 1 2 d
E[ (t , t )] VAR[ (t , t )] = = n n . (9B2)
A second characterization of random Poisson points has to do with the duration between
points. As shown below, this duration is a random variable that is exponentially distributed.
Random Poisson points can be characterized by their independent-increment nature as well as a
requirement that the duration between points is described by an exponential random variable
(this is our second characterization).
Finally, a third characterization can be given for random Poisson points. Random
Poisson points can be characterized as an independent-increment point process that obeys the
Markov property (or memory-less property) that was discussed towards the end of Chapter 2 of
these class notes (see Examples (2-14) and (2-15)). These alternative characterizations are
discussed in this appendix.
Given the three characterizations mentioned above, it should be apparent that random
Poisson points can be used to model many physical phenomena. They serve as a good model in
many situations where something occurs repeatedly, the repetitions are independent of each
other, and the average number of repetitions per unit time (or unit length) is constant over a large
number of successive unit-time (or unit-length) intervals. Applications that involve N points
placed randomly in an interval of length T can, when N and T are large, be modeled using
Poisson points. As discussed in Chapter 1 of the class notes, a Poisson point model becomes
exact as N and T approach infinity in such a manner that the ratio N/T approaches a constant
d
(the average number of points per unit length). Poisson point models have been applied to
problems dealing with electron emission in semiconductors and vacuum tubes (i.e., shot noise in
electronic devices), telephone calls arriving at a switchboard and the arrival of cars at an
intersection, to name a few applications.
Exponentially Distributed Duration
The duration between adjacent Poisson points is independent and exponentially
distributed. First, we show a simpler result: we analyze the statistics of the duration from an
arbitrary, but fixed, value of time to the nearest Poisson point. Let t
0
be any fixed
marker/reference point in time; we show that the duration from t
0
to the nearest Poisson point (on
either side of t
0
) is exponentially distributed.
Define random variable
1
to be the duration from a fixed, but arbitrary, marker/reference
point t
0
to the first Poisson point to the right of t
0
(see Fig. 9B-1). For algebraic variable , the
event [
1
] is equivalent to the event there are one or more Poisson points in the interval (t
0
,
t
0
+]. That is, [
1
] = [n(t
0
, t
0
+) 1] so that
d d
0
d
1
( )
[ ] [ ( , ) 1] 1 [ ( , ) 0] 1 e 1 e .
0!

= + = + = = = P P P n n
0 0 0 0
t t t t (9B3)
We have
d
1
-
F ( ) 1- e , 0,

= (9B4)
and,
d
1
-
d
f ( ) e , 0,

= (9B5)
for the distribution and density functions, respectively, that describe random variable
1
.
Equation (9B5) establishes the desired result that random variable
1
is exponentially distributed.
Now, let random variable
-1
denote the duration from t
0
back to the first Poisson point before t
0
(see Fig 9B-1). In a manner similar to that used above, one can show that
1
is exponentially
distributed.
The statistical properties of the duration from an arbitrary fixed point t
0
to any Poisson
point (to the right or left of t
0
) can be analyzed. Define random variable
n
as the duration from
t
0
to the n
th
Poisson point to the right of t
0
. The event [
n
] is equivalent to the event there
are n or more Poisson points in the interval (t
0
, t
0
+ ] so that

-1

1
T
n
n
t
0
Denotes Location of a Poisson Point
t
0
is an arbitrary fixed point
n-1
Fig. 9B-1: Poisson points (a Poisson point process).

d
d
n-1 k
-
n 0 0 0 0
k=0
( )
[ ]= [ ( ) n] =1- [ ( ) n] =1- e
k!

<

P P P n n t , t + t , t + . (9B6)
We have
d
d
n
n-1 k
-
k=0
( )
F ( ) 1- e
k!

=

, (9B7)
the distribution function for
n
. Finally, differentiate (9B7) to obtain the gamma density function
d d d d
d d
n
d
n 1 n 1 n 1 n 2 k k-1 k k
d d
k 0 k 1 k 0 k 0
n
- n-1 d
( ) ( ) ( ) ( )
f ( ) e e
k! (k -1)! k! k!
e ,
(n -1)!

= = = =

= =

=

(9B8)
0 (some authors refer to (9B8) as an Erlang density see p. 321 of Stark and Woods).
This last result allows us to analyze the statistical properties of the duration between
Poisson points. As shown by Figure 9B-1, define random variable T
n

n
-
n-1
, the duration
between the (n-1)
th
and the n
th
Poisson points to the right of arbitrary fixed point t
0
. Since
Poisson points in non-overlapping intervals are independent, the random variables T
n
and
n-1
are
independent, and the density function that describes
n
is the convolution of the densities that
describe T
n
and
n-1
(can you explain why this is true?). Equivalently, the moment generating
function
n
(s)
for
n
is equal to the product
n 1 n
T
(s) (s)
of moment generating functions

for T
n
and
n-1
, respectively (again, can you explain why this is true?). So, using (9B8) and the
definition of the moment generating function, we calculate
d
d d
n n
d
n n
- x -( s)x sx n-1 sx n-1
0 0
(s) f (x)e dx x e e dx x e dx
(n -1)! (n -1)!

=

= = . (9B9)
The last integral on the right of (9B9) is tabulated as
d
- ( s)x n-1
n n
0
d d
(n) (n 1)!
x e dx =
( s) ( s)
, (9B10)
where (n) = (n-1)! is the well known Gamma function with integer n argument. Finally,
substitute (9B10) into (9B9) to obtain
n
n n
d d
n n
d d
(n)
(s)
(n -1)!
( s) ( s)

=

= . (9B11)
Finally, since
n
(s)
=
n 1 n
T
(s) (s)
, we have
n
T
n
n 1
1
n n 1
d d d
n n 1
d
d d
(s)
(s)
(s) ( s)
( s) ( s)

= = =

. (9B12)
Note that
n
T
(s) is the moment generating function of an exponential random variable. Hence,
random variable T
n
, the duration between the (n-1)
th
and n
th
Poisson points to the right of t
0
(or
any two Poisson points), is described by an exponential random variable with parameter
d
.
Poisson Points Obey the Markov Property
Poisson points obey the Markov property (see the System Reliability section in Chapter
2). As above, denote t
0
as a fixed (but arbitrary) marker/reference point. Denote
1
as a random
variable that describes the time duration between t
0
and the next Poisson point to the right,
random variable
1
being described by density
1
f ( )
. Let conditional density f(

1
> )
describes random variable
1
conditioned on the event [
1
> ], an algebraic variable. Since
1
f ( )
is exponential, we have the Markov property

1 1
f( > ) = f ( ), ,
> (9B13)
as shown in Chapter 2 (see Examples (2-14) and (2-15)). A similar formula can be written for
the (conditional) density that describes random variable T
n
conditioned on the event T
n
> . The
fact that it has been very long (or short) seconds since t
0
and not observing a point does not
make the next Poisson point more (or less) likely. Stated differently, the probability of finding a
point in the interval (, ] is not influenced by the length of (t
0
, ], an interval containing no
point(s).
9C-1
Appendix 9C: Low-Pass Equivalent and Analytic Signal
We start with a wide-sense stationary (WSS), narrow-band Gaussian process
c c s c
(t) (t) cos t (t) sin t = . (9C-1)
Note that both
c
and
s
are zero-mean, WSS low-pass Gaussian processes, as shown in Chapter
9 of the class notes. In what follows, we define the low-pass equivalent and analytic signal
corresponding to (t). Finally, we use this information to select the optimum value of
c
for a
Gaussian narrow-band noise process.
Low-Pass Equivalent/Complex Envelope
The low-pass equivalent of (9C-1) is defined as
LP c s
(t) (t) j (t) + . (9C-2)
Often, this is referred to as the complex-envelope representation of . Note that
LP
is a WSS
low-pass Gaussian process. The original band-pass process (t) is related to
LP
by
c
LP
j t
(t) Re (t)e

=

. (9C-3)
In the analysis of band-pass signals and systems, very often
LP
is easier to work with than
since manipulation of messy trigonometric functions/identities is not required (especially true
when computing the band-pass output of a band-pass system).
Analytic Signal
The analytic signal for (t) is defined as
P
(t) (t) j (t) + , (9C-4)
9C-2
where (t) denotes the Hilbert transform of (t). Note that (9C-4) can be written as
P
1
2
1
(t) (t) 2 (t) j
2 t

+

, (9C-5)
where
1
2
1
(t) j U( )
2 t
+
(9C-6)
(U() is a unit step in the frequency domain). Therefore, the Fourier transform of (9C-5) can be
written as
p
( ) 2 ( )U( ) = , (9C-7)
where [ ] [ ]
P p
( ) (t) and ( ) (t) F F . To construct
p
, Equation (9C-7) tells us that we
should start with , truncate its negative frequency components, and double the amplitude of its
positive frequency components.
We desire to obtain a relationship between
p
and
LP
. Note that
c c
P LP LP
c c
LP LP
c
LP
j t j t
j t j t
j t
(t) (t) j (t) Re (t)e j Re (t){ je }
Re (t)e jIm (t)e
(t)e .

+ = +

= +

=
(9C-8)
By examining the Fourier transform of (9C-8), one can see that the low-pass equivalent is the
analytic signal translated to the left by
c
in frequency (i.e., the analytic signal translated down to
base band).
9C-3
Autocorrelation and Crosscorrelation of Complex-Valued Signals
Chapter 7 of the class notes gave a definition for the autocorrelation function of a real-
valued, WSS random process x(t). This definition must be modified slightly to cover the more
general case when x(t) is complex valued. For a complex-valued, WSS process x(t), we define
the autocorrelation as
x
R ( ) E X(t )X (t)

= +

, (9C-9)
where the star denotes complex conjugate. Note that R
x
is conjugate symmetric in that
x x
R ( ) R ( )
= . Of course, if x(t) is real-valued, then so is R

x
, and we have
x x
R ( ) R ( )
=
= R
x
(). Finally, power spectrum S() = F[R
x
] must be real-valued and nonnegative; it is even
if x(t) is real valued.
In a similar manner, let x(t) and y(t) be complex-valued, jointly wide sense stationary
random processes. The crosscorrelation function is defined here as
xy
R ( ) E X(t )Y (t)

= +

. (9C-10)
In general, this function does not exhibit conjugate symmetry. Cross spectrum S() = F[R
xy
]
can be complex valued with negative real/imaginary components.
Autocorrelation function of
LP
R
and
p
R
The autocorrelation function of complex-valued, low-pass equivalent

LP
is
[ ]
[ ] [ ]
{ }
LP
c s c s s c
LP LP c s c s
c c s s s c c s
R ( ) E (t ) (t) E { (t ) j (t )}{ (t) j (t)}
E (t ) (t) (t ) (t) jE (t ) (t) (t ) (t)
R ( ) R ( ) j R ( ) R ( ) .

= + = + + +

= + + + + + +
= +
(9C-11)
9C-4
However, from Chapter 9, we know that
c s
R ( ) R ( )

= and
s c c s
R ( ) R ( )

=
c s
R ( )

= . Hence, we can write (9C-7) as
( )
LP c c s
R ( ) 2 R ( ) jR ( )

= . (9C-12)
In a similar manner, we can write
[ ]
( )
p
p p

R ( ) E (t ) (t) E { (t ) j (t )}{ (t) j (t)}
R ( ) j R ( ) R ( ) R ( )
2 R ( ) jR ( ) .

= + = + + +

= + +

= +
(9C-13)
Finally, we can use (9C-8) and write a relationship between
p
R ( )
and
LP
R ( )
as
LP
c c
LP LP
p
c
LP LP
c
j (t ) j t
p p
j
j
R ( ) E (t )n (t) E (t )e n (t)e
E (t )n (t) e
R ( )e .
+

= + = +

= +

=
(9C-14)
Power Spectral Densities
Equations (9C-12) and (9C-13) have Fourier transforms given by
LP c c s
( ) 2 ( ) 2j ( )

= S S S (9C-15)
p
( ) 4 ( )U( )

= S S , (9C-16)
9C-5
respectively.
Note that
c c s s
( ) [R ( )]

= S F is a cross-spectral density; it is purely imaginary and
odd in (since
c s
R ( )

is an odd function of ). Therefore, j
c s
( )

S is real valued and odd
in (after all, we know that
LP
( )
S must be real valued!). Finally, note that (9C-16) implies

p p
4 ( ) ( ) ( )

= + S S S . (9C-17)
Equation (9C-14) has a Fourier transform given by
p LP
c
( ) ( )

= S S , (9C-18)
where
LP LP
[R ]

S F and
p p
[R ]
S F are real-valued, non-negative power spectrums of the
low-pass equivalent and analytic signal, respectively. Equation (9C-18) shows that the power
spectrum of the analytic signal can be obtained by translating up to
c
the power spectrum of the
low-pass equivalent.
Optimum Value of
c
for Use in Band-Pass Model
Given a band-pass process (t), representation (9C-1) is not unique. That is, there is a
range of
c
values that could be used, each value accompanied by a different set of low-pass
functions
c
(t) and
s
(t) (i.e.,
c
and
s
depends on the value of
c
that is used in the band-pass
model). However, for a given band-pass process (t), it is possible to define and compute an
optimum value of
c
. This is accomplished in what follows.
Clearly, the magnitude of the low-pass equivalent,
LP
, is the actual envelope of noise
(9C-1). Note that
LP
is dependent on the value of
c
that is used in (9C-1). In what follows, the
optimum
c
is defined as that value which produces the least temporal variation in the low-pass
equivalent. That is, the optimum value of
c
minimizes E[d
LP
/dt
2
].
.
Now, the power spectrum of d
LP
/dt is
LP p
2 2
c
( ) ( )

= + S S , a result that follows
from (9C-18). Hence, the optimum value of
c
minimizes
9C-6
LP
p p
2
2 2
c c
d 1 1
E ( )d ( ) ( )d
dt 2 2

= + =

S S . (9C-19)
With respect to
c
, differentiate (9C-19), and set the derivative equal to zero. This produces the
constraint
p
c
1
2( ) ( )d 0
2

S . (9C-20)
Finally, the optimum value of
c
is
p
p
c
( ) d
( ) d
S
S
. (9C-21)
Example 9C-1: Consider the noise with spectrum depicted by Fig. 9C-1a). From (9C-18), we
know that
p
S has its spectrum concentrated in a narrow band centered at +
c
, a positive
number. From (9C-17), we can immediately plot
p
S as Fig. 9C-1b). From (9C-21), we
calculate the optimum
S
()
0
1

2
1
2
0
1

2
4
p
( )
a)
b)
Fig. 9C-1: a) Power spectrum of narrow band noise. b) Power spectrum
of the corresponding analytic signal.
9C-7
2
2 1
2 1 1
2
2 1
1
2 2
1
2
c
4 d

2
4 d

+

= = =

, (9C-22)
as expected.
Chapter 10: Thermal Noise and System Noise Figure
In a warm resistor (i.e., one above absolute zero degrees Kelvin) free electrons move
about in thermally excited motion. This gives rise to a noise voltage that appears across the
resistors terminals. This noise was first analyzed in 1927 by J.B. Johnson of the Bell Telephone
Laboratories, and it goes by the names thermal noise, white Gaussian noise, Johnson noise and
kTB noise. This chapter describes a simple model for thermal noise and describes how it is
treated in system analysis.
Each thermally excited electron contributes to the noise in the resistor. There are a large
number of such electrons, each moving very rapidly and each contributing a small amount of
noise power. As a result, thermal noise is the result of a large number of noise voltages, and the
central limit theorem implies that the total noise is Gaussian. Assuming a constant resistance
and temperature, the noise process is stationary.
The noise generated by the randomly moving electrons can be analyzed and modeled by
using thermodynamics. It can be shown (see H. Nyquist, Thermal Agitation of Electric Charge
in Conductors, Physical Review, Vol. 32, pp 110-113, 1928a) that the noise generated by a
resistance of R ohms at a temperature of T degrees Kelvin (to convert degrees Celsius to degrees
Kelvin simply add 273; degrees Kelvin = degrees Celsius + 273) has a power spectrum that is
accurately represented by
S
n
0
N
2
e
( )
/
exp( / )
L
N
M
O
Q
P
0
0
1
, (10-1)
where bandwidth parameter
0
is given by

0
2 = ( ) kT / = . (10-2)
The quantities k, = and N
0
/2 are the Boltzmanns, Plancks and the spectral intensity parameter,
respectively, and their numerical values are
-23
-34
23 0
1.38 10 joules/degree Kelvin
6.63 10 joule seconds
N
2 2.76 10 watts-Ohm/Hz .
2

=
=
k
kTR TR
(10-3)
At room temperature, T = 290 Kelvin, and
12
0
21 0
12 10 radians / second
N
8 10 watt-ohm/Hz .
2

R
(10-4)
The quantity
0
/2 = kT/= can be approximated by the reciprocal of the mean relaxation time of
free electrons in the resistor.
As can be seen from (10-4), the bandwidth
0
of the process is very large. Over the
frequency range of interest to the communication engineer (from audio to the microwave bands),
Equation (10-1) can be approximated as
e
0
n
N
( ) = 2 watts - Ohm/Hz,
2
S kTR, (10-5)
a flat spectral density. Figure 10-1 depicts a commonly used model for a noisy resistor.
S
n
e
( ) = 2kTR
n
e
(t)
+
R
Figure 10-1: Model for a warm, noisy, resistor.
When connected in a circuit, a warm resistor delivers noise power to the circuit (see
Figure 10-2). Within a one-sided bandwidth of B Hz, we want to determine the maximum
amount of noise power that the resistor can deliver to the circuit. That is, assume that the
external circuit acts like an ideal bandpass filter with a magnitude response illustrated by Figure
10-3 (the circuit only absorbs power in the frequency range illustrated by the figure). We want
to find the amount of noise power that the external circuit absorbs. From circuit theory, we
know that maximum power transfer occurs when the source is impedance matched to the load;
maximum power is transferred when Z
in
= R, where Z
in
is the impedance seen looking into the
external circuit. When impedance matched, the voltage across the terminals of the external
circuit will be n
e
(t)/2. Since (10-5) is a double-sided spectrum (power over both negative and
positive frequencies must be added, or the power over the positive frequencies must be doubled),
the amount of noise power, in a bandwidth of B Hz, absorbed by the external circuit is
P
watts
=
F
H
I
K
=
z
2
1
2
4

(2
d
bandwidth
of 2
rad/ sec
kTR
R
/ )
B
kTB (10-6)
watts, where the integration is over the positive frequency side of Figure 10-3. So, from the
warm resistor, within a bandwidth of B Hz, we see that the maximum available noise power
+
R
External
Circuit
S
n
e
( ) = 2kTR
Z
in
is driving point impedance of circuit.
Max power transfer occurs when Z
in
= R.
n
e
(t)
Figure 10-2: Warm resistor supplying
noise power to an external circuit.
(sometimes called the available power) is kTB watts, a value that does not depend on the
resistance R but is proportional to bandwidth B. Because of (10-6), thermal noise is sometimes
called kTB noise.
Often, small amounts of power are specified in dB relative to one milliwatt; the units
used are dBm. So 1/10 milliwatt would be -10dBm, 1 milliwatt would be 0 dBm and 10
milliwatts would be +10dBm, etc. In units of dBm, the noise power delivered by a resistor of R
ohms at T degrees Kelvin in a bandwidth of B Hz is
P
dBm
= 10 Log =10 Log
P
.001 .001
watts
kTB
. (10-7)
Example 10-1: Convert the power level of 13 dBm to watts. From the definition given above,
we can write 13 = 10log(P
watts
/.001), a result that leads to P
watts
= 20 mW (milliwatts). This
result can be obtained mentally without the use of logarithms. Since 0dBm is 1 mW, we know
that 10dBm is 10 mW. Also, we know that doubling the power is equivalent to a 3dB increase in
power; so 13dBm is twice the power of 10dBm. This leads to the conclusion that 13dBm is 20
mW.
Example 10-2: Determine the spectral density of the noise v(t) that exists across the capacitor in
Figure 10-4. The circuit model is given to the right; basically, n
e
is driving an RC series circuit.
The voltage across the capacitor (i.e., the output voltage) has the power spectrum
2
v n
e 2 2 2 2 2
2 2(1/ )
( ) H( j ) ( )
1 (1/ )

= = =

+ +

S S
kTR kT RC
C
R C RC
, (10-8)
B Hz B Hz
-f
c
f
c
f (Hz)
1
Figure 10-3: Frequency response of the external circuit
illustrated in Figure 10-2.
and the autocorrelation of the output voltage is given as
R e
v
( )
/

=

kT
C
RC
. (10-9)
For the circuit depicted by Fig. 10-4, the impedance looking into the output terminals is
(1/ s )
Z(s)
1/ s 1 s
= =
+ +
C R R
R C RC
. (10-10)
Combine this with (10-8) to see that (Re[ ] denotes the real part)
S
v
() = 2kT Re[Z(j)]. (10-11)
Suppose an impulse current generator i(t) = (t) is connected to the output terminals of
the circuit depicted by Fig. 10-4. The driving impulse i(t) would generate a voltage of
[ ]
-1 t /
1
z(t) Z(s) e U(t)
= =
RC
C
L (10-12)
across the output terminals. Voltage z(t) is an impulse response if one considers current i(t) =
(t) as the input and voltage z(t) as the output. For 0, z() can be used to express the 0
S
n
e
( ) = 2kTR
n
e
(t)
+
R C
v(t)
+
-
Figure 10-4: Noisy resistor in parallel with
a capacitor.
half of R
v
as
v
R ( ) z( ), 0 = kT , (10-13)
a result that is verified by (10-9).
The fact that output autocorrelation R
v
and spectrum S
v
can be written in terms of
complex impedance Z(j) and impulse response z() is no fluke. Instead, these observations are
a consequence of the Nyquist Noise Theorem.
Nyquist Noise Theorem
Consider a passive, reciprocal circuit composed of linear R, L and C components.
Denote by v(t) the thermal noise voltage that appears across any two terminals a and b; let Z(s)
be the complex impedance looking into these terminals. See Figure 10-5. The power spectrum
of thermal noise v(t) is
[ ]
v
( ) 2 Z( j ) = S kT Re . (10-14)
For > 0, the autocorrelation of v(t) is
[ ]
-1
v
R ( ) Z(s) , 0, = > L kT (10-15)
C L
a
b
+
-
v(t)
R C L
a
b
Complex
Impedance
Z(s)
R
n
e
+
Fig. 10-5: a) Passive RLC circuit with one, or more, noisy resistors. Source n
e
models thermal
noise voltage in R. Voltage v(t) is the output thermal noise after filtering by the circuit. b) Z(s)
is the complex impedance looking into the circuit terminals.
where L
-1
[Z(s)] is the one-sided, causal Laplace transform inverse of the complex driving point
impedance Z(s).
Equation (10-15) follows from (10-14) by an inverse transform relationship. The inverse
Fourier transform of (10-14) can be expressed as
[ ]
[ ]
1 j
v v
j
1 Z( j ) Z ( j )
R ( ) ( ) 2 { } e d
2 2
1
Z( j ) Z( j ) e d
2

+
= =

-
-
F S
=
KT
KT
(10-16)
Substitute s = j into (10-16) and obtain
j j
s s
v
j j
1 1
R ( ) Z(s)e d Z( s)e d
2 j 2 j

= +

- -
KT . (10-17)
Note that Z(s) (alternatively, Z(-s)) has all of its poles in the left-half plane (alternatively, right-
half plane). Consider the case 0 for which (10-17) can be written as
s s
v
R R
1 1
R ( ) lim Z(s)e ds lim Z( s)e ds
2 j 2 j

= +

> >
C C
KT , (10-18)
where contour C is depicted by Figure 10-6. By Laplace transform theory, the first integral on
the right-hand side of (10-18) is simply the single-sided inverse Laplace transform of Z(s)
(remember that all poles of Z(s) are in C for sufficiently large radius R). Since Z(-s) has no poles
in C, the second right-hand-side integral equates to zero. As a result, R
v
(), for 0, can be
expressed as the one-sided inverse (10-15).
Note that R
v
() generally contains an impulse function when complex impedance Z(s) contains
an additive pure resistance component or when Z(s) is purely resistive. Also, use the fact that
R
v
() is an even function to obtain the autocorrelation for < 0.
Proof: See A. Papoulis, S. Pillai, Probability, Random Variables and Stochastic Processes,
Fourth Edition, McGraw Hill, 2002, p. 452.
RMS Value of Noise Voltage in a B Hz Bandwidth
Consider the warm resistor model depicted by Fig. 10-1. We want to band limit noise
n
e
(t) to a one-sided bandwidth of B Hz. We connect the warm resistor to the ideal band-pass
filter depicted by Fig. 10-3 and denote the filter output as v
n
(t). We want V
RMS

2
n
E[v ] , the
Root-Mean-Square (RMS) value of band-limited noise voltage v
n
(t).
Suppose we terminate the filter with a load resistor. Note that the maximum available
power is delivered to a filter load resistance of R ohms (equal to the source resistance). Now,
v
n
/2 is the output voltage across such an R-Ohm filter load. In addition, V
RMS
/2 is the RMS
value of this voltage. Finally, (V
RMS
/2)
2
/R is the power absorbed by the filter load resistor. The
desired result follows from the equation
poles of Z(s)
all in LHP
Contour C of Radius R
R
Fig. 10-6: Contour C used in the development of the Nyquist noise theorem.
( / ) / V
RMS
2
2
R kTB = . (10-19)
This can be solved for
V
RMS
= 4RkTB , (10-20)
the RMS value of the voltage v
n
(which is n
e
filtered to a one-sided bandwidth of B Hz).
Example 10-3: Consider Figure 10-2 where the resistor operates at a temperature of 17 C, and
the external circuit has a frequency response depicted by Figure 10-3 with a bandwidth of B =
10kHz.
a) In watts and dBm, calculate the thermal noise power that the warm resistor delivers to the
external circuit under impedance matched conditions. First, we must compute the absolute
temperature T = 17 + 273 = 290 (degrees Kelvin). Then, we can use (10-6) and (10-7) to
compute
23 4 17
watts
17
dBm
P 1.38 10 )(290)(1 10 ) 4 10 watts
4 10
P 10 Log 133.98 dBm
.001

= =
= =
= ( kTB
b) Calculate V
RMS
if the impedance-matched, band-limited external circuit has a driving point
impedance of 100 Ohms. We use the value kTB = 410
-17
in Equation (10-20) to obtain the
value
V V
RMS
= = =
4 4 100 4 10 1265
17
RkTB ( )( ) . (micro-volts)
for the RMS value of n
e
filtered to a one-sided bandwidth of B Hz.
Antenna Noise Temperature - Noise in Receiving Antennas
At resonance, the impedance measured at the terminals of a radio frequency
communication antenna is real valued (pure resistance). The resistance that appears at the
terminals is the sum of a radiation resistance and an ohmic resistance. For commonly used
antennas, the radiation resistance is approximately 50 for a quarter-wave vertical over a
ground plane, 72 for a half-wave dipole in free space or 300 for a folded half-wave dipole
in free space. From a transmitter, power supplied to an antenna is absorbed by these
resistances; actually, the power absorbed by the radiation resistance is radiated into space, and
power absorbed by the ohmic resistance is turned into heat (it is wasted). Usually, an
antennas ohmic losses are small compared to the power that is radiated (the radiation resistance
is much larger than the ohmic resistance). An exception to this occurs in antennas with physical
dimensions that are small compared to a wavelength (as is sometimes the case for antennas
designed for low frequencies - an automobile antenna for the AM band, for example); for these
cases, the ohmic losses may dominate.
Random noise appears across the terminals of a receiving antenna. This noise comes
from two sources: (1) thermal noise generated in the antennas ohmic resistance, and (2) noise
received (picked up) from other sources (both natural and man-made). Often, the antenna
noise is represented as though it were thermal noise generated in a fictitious resistance, equal to
the radiation resistance, at a temperature T
A
that would account for the actual delivered noise
power. That is, we model the antenna as a warm resistor, with value equal to the radiation
resistance, operating at some temperature of T
A
degrees Kelvin. T
A
is the value at which the
fictitious resistor would deliver an amount of noise power equal to what the antenna delivers. T
A
is called the noise temperature of the antenna.
Example 10-4: Suppose that a 200 antenna exhibits an RMS noise voltage of V
RMS
= .1V at
its terminals, when measured in a bandwidth of B = 10
4
Hz (i.e., an RMS volt meter, with a
bandwidth of 10
4
Hz, indicates .1 V when connected to the antenna). What is the antenna noise
temperature T
A
? We assume that the 200 impedance does not change significantly over the
10kHz bandwidth. From (10-20), we compute
V
RMS
2
4 = kT RB
A
,
T
kRB
A
= =
V
RMS
2 7 2
23 4
4
10
4 138 10 200 10
90 6
( )
( . )( )( )
.

Kelvin .
Hence, from a noise standpoint, the antenna looks like a 200 resistor at 90.6 K. Temperature
T
A
does not depend on bandwidth (why?).
Effective Noise Temperature of a Broadband Noise Source
A broadband noise source (any object that generates electrical noise) can be
characterized by specifying its effective noise temperature. Suppose a source, within a
bandwidth of B Hz, delivers P
a
watts to an impedance-matched load. As is suggested by (10-6),
we define the effective noise temperature of the source as
/ degrees Kelvin
s a
T P kB . (10-21)
This usage assumes that the noise spectrum is flat over the bandwidth of interest. For this case,
temperature T
s
will be independent of B; the noise power P
a
will depend on B, and bandwidth
should cancel out of (10-21).
A source may have an output noise spectrum with a shape that depends on frequency.
However, the source may still be almost flat over a small bandwidth B = /2 of interest.
For example, this could be the case for warm resistors that are connected in a circuit with
capacitors and/or inductors. For this case, the effective noise temperature T
s
is a function of
frequency. Next, we consider such a case.
Let S
v
() denote the output noise power spectrum of a source. In a bandwidth of
radians/second, we have
v
Power in Bandwidth ( ) watts
2
S , (10-22)
a frequency-dependent quantity (this assumes the source is flat over bandwidth ). Let R()
denote the resistive component of the source output impedance. Now, T
s
(), the effective noise
temperature of the source, must satisfy
v
2 ( ) R( ) = ( )
2 2

s
k T S , (10-23)
so that
v
( )
( ) =
2 R( )
s
T
k
S
. (10-24)
Equation (10-24) can be obtained from (10-21). Within a bandwidth B = /2 (over
which S
v
is flat), the power delivered to an impedance matched load of R ohms is
v
( ) / 4
( ) 2
R( ) 2

=

S
a
P . (10-25)
Substitute (10-25) into (10-21) to obtain
v
v
( ) / 4
2
( ) R( ) 2
( ) ( ) / =
2 R( )
2

=
s a
T P kB
k
k
S
S
, (10-26)
a result that is equivalent to (10-24).
Example 10-5: Consider the RLC network depicted by Fig.10-7. R
1
is at T
1
degrees Kelvin, and
R
2
is at T
2
degrees Kelvin. If the network is used as a broadband noise source with output across
the terminals, what is the effective noise temperature of the source?
Solution: The noise spectrum across resistor R
1
is
1
1 1 1
v
2 2 2
1 1
1 1
R / j C R
( ) 2 Re 2
R 1/ j C
1 R C

= =

+
+

S
1 1
k T k T .
Now, add this to the spectrum
2
v
( ) S 2kT
2
R
2
generated by R
2
. The sum spectrum (the noise
spectrum appearing across the terminals) is
1 2
v v v
2 2 2
1 1 2 1 1
2
2 2 2 2 2 2
1 1 1 1
( ) ( ) + ( )
R R R (1 R C )
2 R 2 .
1 R C 1 R C
=

+ +
= + =

+ +

S S S
1 1 2
2
T T T
k T k
(10-27)
Looking back into the terminal pair, the resistive component of the source output impedance is
2 2 2
1 1 2 1 1 1
2
2 2 2
1 1
1 1
R / j C R (1 R C ) R
R( ) R Re
R 1/ j C
1 R C
+ +
= + =

+
+

.
R
1
@ T
1
R
2
@ T
2
C
1
Figure 10-7: Noise source for Example 10-5.
Using (10-26), the source effective noise temperature is
2 2 2
1 2 1 1
2 2 2
1 1
v
2 2 2
2 1 1 1
2 2 2
1 1
R R (1 R C )
2
1 R C
( )
2 R( )
R (1 R C ) R
2
1 R C

+ +

+

= =

+ +

+

1 2
s
T T
k
T
k
k
S
,
or
2 2 2
1 2 1 1
2 2 2
1 2 1 1
R R (1 R C )
R R (1 R C )
+ +
=
+ +
1 2
s
T T
T . (10-28)
Note that T
s
is a function of frequency.
Effective Input Noise Temperature of an Amplifier/Network
Within a bandwidth of B Hz, a noise source will supply P
ns
= kT
s
B watts, where T
s
is the
effective noise temperature of the source. Suppose that this source is connected to a noiseless
amplifier/network (a fictitious amplifier that generates no internal noise) with power gain G
a
.
Then, the noise power output of the amplifier is
=
no a s
P G k T B (10-29)
watts. Now, suppose that the amplifier is not noiseless; within a one-sided bandwidth of B Hz,
suppose that it generates P
ne
watts of noise so that the total output noise power is
Gain = G
a
P
ne
watts of internal noise
generated by amplifier
+
Noise Pwr From Source
P
ns
= kT
s
B
Output Noise Pwr
P
no
= G
a
kT
s
B + P
ne
Load
Figure 10-8: Amplifier with P
ns
watts of input noise power. Amplifier generates internally P
ne
watts
of additional noise power.
=
no a s ne
P G k T B+P (10-30)
watts (as measured in a B Hz bandwidth), a result depicted by Figure 10-8. Write (10-30) as
) =
(
no a s e
ne
e
a
P G k T +T B
P
T
G kB
. (10-31)
The quantity T
e
is called the effective input noise temperature of the amplifier. By using T
e
, we
have referenced the internally generated noise to an input source. That is, the noise from an
input source at temperature T
e
would be amplified and show up as P
ne
watts of output noise, a
result depicted by Figure 10-9. In the development of (10-31), we have assumed that the
internally generated noise has a spectrum that is approximately flat over the bandwidth B of
interest; hence, P
ne
is proportional to bandwidth, and T
e
would be largely independent of
bandwidth

. However, P
ne
is allowed to vary slowly with frequency, so temperature T
e
may be a
function of frequency. As will be seen in what follows, the ability to reference internally-
generated noise to an amplifier input is very useful when it comes to analyzing the noise
properties of a chain of noisy amplifiers/networks. Also, effective input noise temperature is a
commonly used method for specifying the noise performance of low noise amplifiers (LNA) in
the TVRO market.
Gain = G
a
Noiseless Amplifier
+
Noise Pwr From Source
P
ns
= kT
s
B
Output Noise Pwr
P
no
= G
a
k(T
s
+T
e
)B
Load
Internally Generated
Amplifier Noise Reference to
Input is P
ne
/G
a
= kT
e
B
+
Figure 10-9: P
ne
watts of internally-generated amplifier noise can be referenced to the input
as a source at temperature T
e
.
Signal-to-Noise Ratio (SNR)
Signal-to-noise ratio (SNR) is a ratio of signal power to noise power at a port (pair of
terminals). More specifically, it is the ratio of signal power in a specified bandwidth to noise
power in the same bandwidth. Note that bandwidth must be specified, or implied, when
discussing/computing an SNR. Often, SNR is given in dB; let P
s
and P
n
denote signal and noise
powers, respectively. In term of dB, we write
SNR(dB) 10Log
P
P
s
n
=
F
H
G
I
K
J 10
. (10-32)
As a signal passes through a cascade of amplifier stages or devices, the SNR decreases after each
stage because noise is added in each stage. However, in many applications involving a cascade
of low-noise, high-gain stages, the overall output SNR is determined by the input SNR and the
noise properties (i.e., the noise figure) of the first stage alone. That is, stages after the first
influence very little the overall system SNR.
System Noise Factor/Figure
Noise is added to a signal as it passes through a system, such as an amplifier. Thermal
noise, which is in all practical electronic systems, is the most predominant noise in many (but not
all!) systems. Since the system amplifies/attenuates its input signal and noise equally, and more
noise is added by the system itself, the signal-to-noise ratio (SNR) at the output is lower than the
signal to noise ratio on the input. System noise factor and noise figure characterize the extent of
this degradation as the signal passes through the system.
Let (SNR)
IN
and (SNR)
OUT
denote the input and output signal-to-noise ratios,
respectively, of the system (these are specified in the same bandwidth). Let P
si
and P
ni
denote
the input signal and noise powers, respectively. Likewise, denote as P
so
and P
no
as the output
signal and noise powers, respectively. Then, system noise factor is defined as
( )
IN OUT
si ni no no
so no so si ni ni
P P P P
F = (SNR) / (SNR) = =
P P P P P P
=
a
G
, (10-33)
where G
a
= P
so
/P
si
is the power gain of the system. Note that F > 1 in real-world, noisy systems.
In general, the smaller F is the better.
As discussed previously, SNR depends on bandwidth. In (10-33), (SNR)
in
and (SNR)
out
are specified in the same bandwidth B, and they are both proportional to B. Hence, bandwidth
cancels out of (10-33), and device/system noise factor and noise figure are independent of
bandwidth.
Noise factor can be specified in terms of amplifier/device noise temperature. Note that
P
no
= G
a
P
ni
+ P
ne
, where P
ne
= G
a
kT
e
B is noise added by the amplifier/device. Combine this with
(10-33) to obtain
no ni
ni ni ni
P P +
F = 1
P P P
= = +
a a e e
a a
G G kT B kT B
G G
. (10-34)
Equation (10-34) shows that F is dependent on input noise power P
ni
.
From (10-34), note that F depends on P
ni
, the input noise power. In applications, the
standard procedure is to use P
ni
= kT
0
B, T
0
= 290K, the noise generated by a room-temperature
resistor. For this value of P
ni
, noise factor (10-34) becomes
F =1
(F 1)
+
=
e
0
e 0
T
T
T T
, (10-35)
a much easier to remember formula pair.
Often, system noise factor F is converted to Decibels (dB). In this case, we use the name
system noise figure and write
NF = 10Log F =10Log
(SNR)
(SNR)

10 10
IN
OUT
L
N
M
O
Q
P
, (10-36)
where the units are Decibels (NF > 0 in real-world, noisy systems). Again, the smaller the better
when it comes to NF. The advantage of using NF and a logarithmic scale is evident in
[ ] [ ]
OUT IN 10 10
10Log (SNR) =10Log (SNR) - NF , (10-37)
using the fact that, when expressing noise figure in dBs, one can subtract NF from the input SNR
to obtain the output SNR.
In product literature and on technical data sheets, it is common for manufactures to
specify noise factor F and/or noise figure NF. For example, in their Diode and Transistor
Designers Catalog 1982-1983, Hewlett Packard (now Agilent Technologies) advertises a
typical NF of 1.6 dB for their 2N6680 microwave GaAs FET operating at 4 GHz. As can be
seen from (10-34), device/system F (and NF) depend on source noise power P
in
. So, one may
ask, what source noise power (or source temperature) did Agilent use when they specified the
2N6680s noise figure?
If Agilent followed the accepted norm, they specified the above-mentioned device noise
figure relative to a standard noise source, a resistor at room temperature T
0
= 290 Kelvin.
Generally, one can assume that a device/system is connected to a standard noise source at T
0
=
290 Kelvin when noise factor/figure is measured and specified. When the noise factor of a
system/device is specified, it is common to use (10-34) with P
ni
= kT
0
B, where T
0
= 290 Kelvin
(this produces (10-35)).
Example 10-6: The noise figure of a perfect (hypothetical) system would be zero dB; such a
system would not add any noise to a signal passing through it. On the other hand, suppose an
amplifier has a NF of 2 dB, and it is amplifying input signal and noise with an input SNR of 10
dB. For this case, use (10-37) to compute the output SNR as 8 dB.
Example 10-7: Suppose an amplifier is operating with an input signal power of 210
-10
watts, an
input noise power of 210
-18
watts and a power gain of 110
6
. Furthermore, suppose the
amplifier itself generates an output noise power of 610
-12
watts. Calculate a) input SNR in dB,
b) output SNR in dB, and c) noise factor and noise figure.
a) The input SNR is
(SNR)
IN
=

2 10
2 10
1 10
10
18
8
.
Equivalently, the input SNR is 80 dB.
b) The output noise power is the sum of the internally-generated noise and the input noise, after
amplification. Hence, the total output noise power is
Noise Output Power = 10
6
( ) 2 10 6 10 8 10
18 12 12
+ =

watts .
The output signal power is
Signal Output Power =10
6
( ) 2 10 2 10
10 4
=

.
Hence, the output signal-to-noise ratio is
(SNR) .
OUT
=

2 10
8 10
25 10
4
12
7
.
Equivalently, the output SNR is about 74 dB.
c) The ratio of the result for parts a) and b) produce a system noise factor of F = 110
8
/2.510
7
=
4, or a system noise figure of NF Log dB =
( )
= 10 4 6 .
Example 10-8: To continue the 2N6680 example given earlier, assume that Agilent used a
standard noise source (i.e., one at T
0
= 290 Kelvin) when they measured the 1.6 dB noise figure
of their device. The noise temperature of an optimally designed (for minimum noise figure)
2N6680-based amplifier would be on the order of T
e
= 290(10
.16
- 1) = 129.2 Kelvin.
Noise Factor of a Purely Resistive Attenuator
Suppose we have a two port network comprised only of resistors. For example, both
fixed and variable resistive attenuators are available commercially from many vendors. In some
applications, a long run of coax cable can contribute significant losses; in many of these cases,
the coax can be modeled as a purely resistive attenuator (coax loss is approximately constant
over small fractional bandwidths). Let G
a
, 0 < G
a
< 1, denote the gain of the resistive attenuator.
Also, for purposes of determining attenuator noise factor, we make the assumption that the
resistive attenuator is at the same absolute temperature T
0
= 290 Kelvin as the resistive noise
source connected to its input. Unless specified otherwise, noise factor/figure is always specified
relative to a standard noise source, a resistor at T
0
= 290 Kelvin. Finally, we assume that the
attenuator is impedance matched on its input and output ports.
Under the conditions outlined in the previous paragraph, a resistive attenuator has a noise
factor F different from unity. The reason for this is that the attenuators resistive components
contribute noise, and (SNR)
OUT
(SNR)
IN
. As we now show, the noise factor of the attenuator
can be taken as F = G
a
-1
. To see this, we model the noise source-attenuator combination in two
equivalent ways. First, we model the combination as a single resistive noise source at T
0
degrees
Kelvin. For this interpretation, the attenuator output noise power is
na
P =
0
kT B (10-38)
watts. On the other hand, the attenuator can be viewed as a noisy device, characterized by a less-
than-unity gain G
a
and an equivalent input temperature of T
e
degrees Kelvin. Since the
attenuator is connected to a noise source at T
0
degrees Kelvin, the first of (10-31) leads to
na a
P G ( ) =
0 e
k T +T B . (10-39)
Equations (10-38) and (10-39) should produce the same results; when they are equated, we get
the formula
T T
e a
G =
( )
1
0
1 . (10-40)
Now, substitute (10-40) into (10-35) to obtain
F
G
G
a
a
= +

=
1
1
1
0 1
( )T
T
0
(10-41)
as the attenuator noise factor F. Finally, use (10-41) to determine
a a
NF 10Log(F) 10Log(1/ G ) 10Log(G ) dB = = = , (10-42)
a positive result (since G
a
< 1 for an attenuator). So, the noise figure (NF) of the attenuator/lossy
coax is just its loss in dB.
Overall Noise Figure of Cascaded Networks
Most systems involve several stages connected in a cascade fashion. For example, a
typical radio receiving system consists of a combination of antenna, feed line (i.e., coax), RF
amplifier, down converter, IF amplifier followed by other stages. In order to perform a noise
analysis of such systems, we need to know how to compute the noise factor of a cascade of
stages given the noise factor and gain of the individual stages.
Figure 10-10 depicts two cascaded-connected stages with associated gains, effective
noise temperatures and noise figures. A standard noise source (a resistor at temperature T
0
=
290K) is connected to the input side. The noise power output delivered to the load is
N
due to source due to first two-port due to second two-port
From Input
Internally Generated
(Re
no a a a a a
1 2 1 2 2
a a a a
1 2 1 2
a
1
P G G G G G
G G G G ( )
G
= + +
= +

T
0 e e
1 2
0
e
2
0 e
1
kT B kT B kT B
T
kT B + k T B
ferenced to Input of Cascade)
a a
1 2
a
1
G G
G

= + +

e
2
0 e
1
T
k T T B
(10-43)
Let T
e
1,2
denote the noise temperature of the cascaded network. Then, comparison of (10-43) to
the first equation of (10-31) leads to the conclusion
T =T
T
e
e
e
1,2
1
2
+
G
a
1
. (10-44)
T
e
1,2
accounts for the noise introduced by both two-ports acting together. It is the overall noise
temperature of the cascaded two-port.
This process can be extended to the case of n such cascaded two-port networks.
Following the approach just outlined, one can easily find that n cascaded networks has an overall
noise temperature of
1
a 1
G , , F
1
e
T
2
a 2
G , , F
2
e
T
Output
Load
Absorbes
P
no
Input Source
Supplies P
ni
= kT
0
B
in Bandwidth B
Figure 10-10: A cascade of two-port devices.
T =T
T T
T
e
e
e e
e
1,n
1
2 3 n
+ + +
G G G G G G
a a a a a a
n 1 1 2 1 2 1
+ "
"
( )
, (10-45)
where T
e
k
is the noise temperature of the k
th
network in the chain.
The overall noise factor of n cascaded networks can be expressed in terms of the
individual noise factors. First, use (10-35) to write
T T
e 0
k
= ( ) F
k
1 , (10-46)
where F
k
is the noise factor of the k
th
network. Now, substitute (10-46) into (10-45) to obtain
T =T
T T T
e
0
0 0 0
1,n
( )
( ) ( ) ( )
( )
F
F
G
F
G G
F
G G G
a a a
n
a a a
n
1
2 3
1
1 1 1
1 1 2 1 2 1
+

+

+

+ "
"
, (10-47)
for the overall noise temperature of the cascaded network. Finally, use this last expression for
the two-port equivalent noise temperature in (10-35) to obtain
F
1,n
= F
F
G
F
G G
F
G G G
a a a
n
a a a
n
1
2 3
1 1 1
1 1 2 1 2 1
+

+

+

+ "
"
( )
, (10-48)
a result known as Friss formula.
Equation (10-48) tells a very important story. Basically, inspection of (10-48) reveals
that the first two-port in a chain has the predominant effect on the overall noise factor, unless
G
a
1
is very small or F
2
is very large. In practical system design, one should pay particular
attention to the noise performance of the first stage in a chain since it will, most likely, establish
the overall noise properties of the chain. For example, in the design of VHF, UHF and
microwave receiving systems, one usually optimizes the noise performance of the receivers RF
amplifier (which is the first stage); latter stages have little influence on the receivers noise
performance. Often, an external, low noise, RF amplifier will be placed at the antenna to avoid
feed line loss (which can be several dB) from decreasing the effective value of G
a
1
.
Example 10-9: Consider a VHF receiver and antenna that are interconnected with coax cable
having a loss of 1.5dB (for example, 100ft of RG213 coax at 50Mhz). Suppose the receivers
RF amplifier (the receivers first stage) has a noise figure of NF
2
7dB and a gain of 20dB. The
active FET mixer (the receivers second stage) has a noise figure and a conversion gain of 8dB.
The mixer is followed by an IF amplifier that is based on a Motorolla MC1590G integrated
circuit; the IF amplifier has a gain of 60dB and a noise figure of NF
4
6dB.
a) Find the noise figure and noise temperature of the coax and receiver combination. The
stages are arranged in the order given below.
1) Coax: Pwr Gain = -1.5dB so that
1
a
G = 10
-1.5/10
= .7080
F
1
= 1/
1
a
G = 1.413
2) RF Amp: Pwr Gain = 20 dB so that
2
a
G = 10
20/10
= 100
NF
2
= 7dB so that F
2
= 10
7/10
= 5.012
3) Mixer: Pwr Gain = 8dB so that
3
a
G = 10
8/10
= 6.310
NF
3
= 8dB so that F
3
= 10
8/10
= 6.310
4) IF Amp: Pwr Gain = 60dB so that
4
a
G = 10
60/10
= 10
6
NF
4
= 6dB so that F
4
= 10
6/10
= 3.981
Substitute the above data into Friss formula (10-48) to obtain
5.012 1 6.310 1 3.981 1
F 1.413 7.16
.7080 .7080 100 .7080 100 6.310

= + + + =

, (10-49)
or NF = 8.55dB. Use Equation (10-35) to calculate the overall noise temperature as
T T
e
F Kelvin = = =
0
1 290 716 1 1786 ( ) ( . ) . (10-50)
At a value of 7.080, the first two terms of (10-49) dominate the final answer; that is, the coax
loss and RF amplifier come close to setting the overall system noise factor. Also, note that the
1.5 dB coax loss (and the resulting coax noise factor) has a significant adverse affect on overall
system noise factor. The adverse affect of coax loss can be minimized/eliminated by placing the
RF amplifier at the antenna feed point, ahead of the coax. This situation is analyzed in part b).
b) Consider changing the receiving system so that the RF amplifier is at the antenna end of the
coax, mounted in a small aluminum box to protect it from the weather. This RF amplifier
placement removes the feed line loss between the antenna and the RF amplifier. At the receiver
end, the coax is impedance matched directly to the mixer input. That is, interchange 1) and 2) in
the above lineup. Use the same values for component noise figures and gains, and repeat part a).
With this configuration, the overall system noise figure is
2
1.41 1 6.310 1 3.981 1
F 5.01 5.10
.7080 100 .7080 100 6.310
10

= + + + =

, (10-51)
or NF = 7.07 dB. Note that this overall NF is very close to the NF (= 7dB) of the RF amplifier
alone!! The first stage (i.e., the RF amplifier) set the overall system NF!! The new system noise
temperature is
290(4.10) 1189 Kelvin = =
e
T (10-52)
This simple swap virtually eliminated the adverse affects of coax loss on system noise figure.
Example 10-10: Consider the receiver front end depicted by Figure 10-11. The radiation
resistance of the antenna is 70 ohms. Ambient (atmospheric) noise is responsible for the antenna
noise; the antenna noise temperature is T
a
= 300 Kelvin. The noise figures of the RF and mixer
stages are given in dB. The local oscillator (LO) is assumed to generate no noise (this is not a
good assumption in all practical cases; LO noise is significant in most applications). Find the
noise temperature and noise figure of the receiver front end. For the RF stage, the noise factor F
1
= 2, and gain G
a
1
= 10. For the mixer, F
2
= 4.47, and conversion gain G
a
2
= 7.94. The noise
temperatures of the RF and mixer stages can be computed by using = (F 1)
e 0
T T obtained from
(10-35); the results are T
e
1
= 290Kelvin and T
e
2
= 1006Kelvin. Excluding the antenna, the
effective input temperature of the receiver is found to be
T
R
= T
e
1
+T
e
2
/ G
a
1
= 391Kelvin (10-53)
by using (10-44). Finally, the front-end noise factor is F = 2.35 (which equates to a noise figure
of NF = 3.7dB), a result computed by using (10-48).
RF Amplifier
Gain = 10dB
NF = 3dB
FET Down Converter
Gain = 9dB
NF = 6.5 dB
Antenna
T
A
= 300K
R
a
= 70 Ohms
~
Local Oscillator
(assumed to be noiseless)
To IF Amplifier
Figure 10-11: Receiver front-end.
The theory of sequences of finite-second-moment random variables is the topic of this
chapter. We study their application to system theory, where they serve as the systems input and
output. It is natural to ask if a given random sequence has a limit and in what sense the limit is
approached. Convergence of random variable sequences is discussed in this chapter. This
chapter deals with discrete phenomenon and mathematics.
Random sequences occur in applications where analog signals are sampled. They have
applications in the fields of signal and image processing, digital control and digital
communications. They have many applications outside of electrical engineering (for example, in
the world of games, stocks, money and finance).
Let ( , F, ) be a probability space (see Chapter 1 of these notes). A random variable
() maps into the real line R. (See Chapter 2 for the definition of a random variable.) A
random, or stochastic, sequence (n;) is a sequence of random variables that is indexed on n.
Often, we suppress the argument and write (n). For a fixed
0
, (n;
0
) is an ordinary
deterministic sequence of numbers known as a sample function. Hence, a random sequence can
be thought of as a mapping from into a set of deterministic sample function sequences (i.e., for
each , there is a different sample function (n;) ). Finally, for a fixed index n
0
, (n
0
;) is
a random variable.
(n;) ()f(n), where () is a random variable, and f(n) is a deterministic
sequence of real numbers, is a simple random sequence.
(n;) A()sin(n/10 + ()), where A() and () are random variables, is a
random sequence.
These two elementary examples have the feature that their future values are predictable from
their present and past values.
Consider the tossing of a fair coin. Here, we have the sample space = [H, T], the set of
events (i.e., -algebra) F = {[H], [T], [H,T], }, and the probability measure that is usually
associated with the tossing of a fair coin (i.e., [H] = [T] = 1/2, etc.). ( , F, ) is the
probability space for the coin tossing experiment. We define a random variable X: R as
(H) 1
(T) 0 .
=
=
X
X
(11-1)
We know that {X < 1/2} = [T], {X > 1/2} = [H], etc. In what follows, probability space ( , F, )
and random variable X will be used to build the Bernoulli trials random sequence.
This random sequence (n) is defined easily. On the n
th
toss, assign (n) = 1
(alternatively, (n) = 0) if a heads (alternatively, tails) is obtained. We call (n) the Bernoulli
trials random sequence. This simple sequence must be described by using the methodology
outlined above, a task that introduces some complexity. We need a probability space (

, F
) so that (n;) can be defined as a mapping from
into a set of binary functions. (

, F
), the development of which is outlined below, is a product space.

Our product space (

, F
) is developed by using ideas from Chapter 1 of these notes

(also see Chapter 6 of Stark and Woods, Probability and Random Processes, 3
rd
ed.). Instead of
considering individual heads and tails as elementary outcomes of separate experiments, our
product space has elementary outcomes that are infinite head/tail sequences. We define the
Bernoulli trials random sequence (n;) as a mapping from sample space
into a set of binary

functions.
Our product space will be built as an infinite Cartesian product of ( , F, ) with itself
(recall that ( , F, ) describes the coin tossing experiment). But first, by (
n
, F
n
,
n
), we denote
the n
th
repetition of ( , F, ); that is,
n
= {[H
n
], [T
n
]} and F
n
= {[H
n
], [T
n
], [H
n
, T
n
], }, where
H
n
and T
n
denotes heads on the n
th
toss and tails on the n
th
toss, respectively.
n
is the usual
probably measure that is used for the tossing of a fair coin (i.e.,
n
[H
n
] =
n
[T
n
] = 1/2, etc.).
Now, denoted as (

, F
), our infinite-dimensional product space is determined from the (

k
,
F
k
,
k
), k 1, as outlined in what follows.
The sample space
is the infinite Cartesian product

k 1 2 3 n
k 1

=
= =
. (11-2)
Elements of
consist of infinite sequences of heads and tales. Element
has the form

=
2 3

1
, , , b g , (11-3)
where
k

k
( is a sequence of outcomes, not the outcome of a specific trial).
F
denotes the set of events (i.e., the -algebra) for the product space. F
includes all
sets of the form
k 1 2 n
k 1
A A A A
, (11-4)
where A
k
F
k
, 1 k < (set (11-4) is called a generalized rectangle). Also, all countable
intersections and unions of such sets are included in F
. For example, consider the event (in F
)
the first two tosses produce different outcomes. This event is represented as
[ ] [ ]
1 2 3 4 1 2 3 4
{H } {T } {T } {H } , (11-5)
the union of two generalized rectangles. Also, the intersection of events [{H
1
}{
2
}{
3
} ]
[{
1
}{T
2
}{
3
} ] must mean the event [{H
1
}{ T
2
}{
3
} ]. As it turns out, F
is
the -algebra generated by the collection (i.e., set) of all generalized rectangles of the form
(11-4) (see Chapter 1 for details on how a -algebra can be generated by a collection of sets).
To finish our product space, we must define
, a probability measure on the product

space. To accomplish this, we use the fact that the successive trials are independent, and
probabilities can be multiplied (without this assumption, it would not be possible to define
without knowing the interdependence of each trial on the other trials). We start with events of
the form given by (11-4), and we define
n n n
n 1 n 1
A (A )

= =

=

. (11-6)
We realize that every event in F
can be represented as countable unions and/or intersections of

events of the form (11-4). And, we use the Countable Additivity property (possessed by all valid
probability measures) to extend definition (11-6) to all of F
. This finishes the definitions of
and our product space (

, F
). Note that we have developed the same product space that is

discussed in Chapter 6 of Stark and Woods, 3
rd
edition (see the section on the Bernoulli random
sequence, p. 310).
Finally, using our infinite-dimensional product space, we are in a position to define the
Bernoulli trials random sequence. Denote an elementary outcome in
as . That is, = (
1
,
2
,
... )
, where each
k

k
, k 1, is either a head or tail (so that is an infinite indexed
sequence of heads and tails). Finally, we define the Bernoulli trials random sequence
n n
n n
(n; ) 1, H
0, T
= =
= =
(11-7)
a mapping from
into the set of binary functions (remember that

k
is the k
th
component
of ).
When dealing with an infinite sequence of random variables, we need to be able to define
the notion of a limit of an event sequence. In general, the limit of an event sequence is
somewhat complicated and abstract. Before considering the general case, we first consider the
important simple special case of nested sequences.
A nested decreasing sequence of events is a simple concept. The event sequence A
k
, k
1, is nested and decreasing if for each integer n 1 we have
A
1
A
2
A
n
. (11-8)
A convenient feature of such a sequence is that
N
N k
k 1 =
=
A A . (11-9)
In some applications, an event can be expressed as the limit of a nested decreasing
sequence of events, a sometimes-valuable representation. For example, let (n), n 1, be a
sequence of random variables and consider
{ } { } { } { } { }
N
N
A (n) 5, n 1 (1) 5 (2) 5 (3) 5 (N) 5
limit
< = < < < <

=

A
, (11-10)
where
N
N
n=1
{ (n) 5} <
A . (11-11)
Note that A
1
A
2
A
N
so that A
N
is a nested decreasing sequence that has the limit A

{ (n) < 5, n 1}.
Like a bounded and monotone sequence of real numbers, all of which have a real-number
limit, a nested decreasing sequence of events has a well-defined limit event. As shown by (11-9),
the limit of A
N
can be expressed as a countable intersection of events. And, a countable
intersection of events is an event (recall that the set of events, a -algebra, is closed under
countable unions and intersections). Often, we write A
N
A
, where A
is the limit.
Similar results and statements can be made for a nested increasing sequence of events.
The sequence B
N
is a nested increasing event sequence if B
1
B
2
B
N
for all N.
Furthermore, we can write
N
N n
n 1 =
=
B B . (11-12)
A nested increasing event sequence always has a limit
n
n
limit
B B (11-13)
since B
can be written as a countable union of events. Often, we write B

N
B
.
Nested sequences of events are special cases of general event sequences. In Appendix
11b, we define the limit, when it exists, of an arbitrary sequence of events (unlike the case of
nested sequences, the limit of an arbitrary event sequence may not exist!).
Concerning infinite intersections and unions, some standard notation needs to be
reviewed. For B
n
, n 1, an sequence of events, we utilize the standard notation
N
n n
N
n=1 n=1
limit

B B (11-14)
N
n n
N
n=1 n=1
limit

B B . (11-15)
Of course, the limits (11-14) and (11-15) may, or may not, exist when the B
n
are non-nested.
A A
We need to be able to compute probabilities like [ (n) < 5, n 1]. This probability can,
we will argue, be computed as the limit of [A
N
], where A
N
is represented by (11-11). That is,
we need to show the second equality in (the first equality is a definition)
N N
N N
n=1 n=1 n=1
{ (n) < 5} limit { (n) < 5} limit { (n) < 5}

=

. (11-16)
To any specified accuracy, this limit can be approximated by using sufficiently large N.
The second equality in Equation (11-16) follows from the continuity of the probability
measure , a fact that we will argue in what follows. The events
N
N
n=1
{ (n) < 5}
A (11-17)
form an indexed set of nested, decreasing events. The limit of the nested sequence is
N
N
N N
n=1 n=1
limit limit { (n) < 5} { (n) < 5}

=

A A . (11-18)
As will be shown in a section that follows, for the nested sequence of decreasing events, we have
[ ] [
N N
N N
limit limit

= =

A A A . (11-19)
That is, we can interchange and the limit operations. A similar statement will be made for a
nested sequence of increasing events, B
N
B
.
Nested sequences are just special cases. In Appendix 11-B, we define what is meant by
the limit of an event sequence where the events are not generally nested. Also, we argue that
(11-19) is true for arbitrary convergent sequences of events.
On a general probability space ( , F, ), the probability measure has a continuity
property. This is satisfying from an intuitive sense; it allows us to use as a metric, or gauge,
to measure the size of an event. Also, the continuity of is used when we approximate the
probability of an event that is represented as the limit of an infinite sequence of nested events.
There is an analog here to the theory of continuous functions. Let f(x) be any function
with domain that includes x
0
. Then f(x) is continuous at x
0
if and only if
limit f ( ) f (limit ) f ( )

= =
n n 0
n n
x x x (11-20)
for all sequences {x
n
} that converge to x
0
. In words, Equation (11-20) states that one can
interchange limit and function computation. In the sense described by Theorem 11-1 (and the
more inclusive results given in Appendix 11B), this basic idea carries over to probability
measures.
Consider an increasing sequence of events as shown by Figure 11-1. That is, the
events are such that B
n
B
n+1
for all n 1. Define the infinite union of these events as
N
N n n
N N
n=1 n=1
limit limit

=

B B B B , (11-21)
a well-defined event (since a -algebra is closed under countable unions). Then, to any degree
of accuracy that is required, [B
] can be approximated by [B
n
] for sufficiently large n. That
is, we have
n n
n n
limit [ ] limit [ ]

= =

B B B . (11-22)
In words, (11-22) says that we can move the limit operation from outside to inside the
probability measure (interchange the limit and probability operations).
We define the sequence of events
1 1
2 2
n n n 1

1
=
=
=

A B
A B B
A B B
(11-23)
where the over-bar denotes set complement. The A
n
are disjoint and
N N
n n
n 1 n 1 = =
=

A B , 1 N (i.e., including N = ). (11-24)
B
1
B
2 B
4
B
3

An increasing sequence of
events.
In addition, since the B
k
are increasing in size and they are nested, we have
N
N n
n 1 =
=
B B . (11-25)
As a result of this, we can write
[ ]
N N N
N n n n
n 1 n 1 n 1
[ ]
= = =

= = =

B B A A (11-26)
for all finite N. Now take the limit of (11-26) to obtain
[ ]
N
N n n
N N
n 1 n 1
limit limit [ ] [ ]

= =
= =

B A A . (11-27)
Now, the step in the proof answers the question: does the sum on the right-hand
side of (11-27) converge? If yes, what does it converge to? Since the A
n
are disjoint, we can use
the Countable Additivity Property of (see Chapter 1) to write
[ ]
N n n
N
n 1 n 1
limit [ ] 1

=

= =

B A A . (11-28)
In (11-27), the middle N
th
partial sum is an increasing sequence of real numbers that is bounded
above by unity, as can be seen by (11-28). Hence, the limits in (11-27) and (11-28) converge.
To find out what they converge to, simply use
n n
n 1 n 1

= =

A B B , (11-29)
in (11-28) to obtain the desired result
[ ] [ ]
n n
n n
limit limit

= =

B B B . (11-30)
A version of Theorem 11-1 holds for a decreasing nested sequence of events.
That is, suppose B
n
B
n+1
for n 1. Then we can write
[ ]
n n
n n
limit limit ]

= =

B B B , (11-31)
where
N
N n n
n 1 n 1
,
= =

B B B B . (11-32)
Similar to the proof given for Theorem 11-1.
Appendix 11B extends Theorem 11-1 to more general, non-nested sequences of events.
In the appendix, we define the limit, if it exists, of an event sequence, not necessarily nested. If
event A
is the limit of an infinite event sequence A

n
, we show that [A
] is the limit of [A
n
] as
index n approaches infinity. So, the probability measure is continuous!! The analogy, drawn
in the paragraph preceding Theorem 11-1, to continuous function is valid!
Theorem 11-1 and its corollary are used to approximate the probability of an
event that is represented as the limit of an infinite sequence of events. For example, for each n
0, let B
n
= {X[k] < 2 for 0 k n}. This is a decreasing and nested sequence of events: B
n+1

B
n
, n 0. Suppose we wanted to calculate [B
], where B
= {X[k] < 2 for 0 k}. We know

that
n
n=0

B B . (11-33)
We use the corollary to approximate (as closely as desired) [B
] as the probability of a finite

intersection. That is, based on our accuracy requirements, we select N and approximate
[ ]
N
n n
n=0 n=0
[ ] [ ] [ ] X(0) 2, X(1) 2, , X(N) 2
= < < <

B B B . (11-34)
Back in Chapter 2 of these class notes, we were told that probability distribution
functions are right continuous. We were told that
n
F( ) limit F( +1/n)
= x x (11-35)
for any distribution function F(x) and all x. However, Equation (11-35) follows directly from
Theorem 11.1 since
[ ]
[ ]
n n
n
limit F( +1/n) = limit X x 1/ n
limit{X x 1/ n}
{X x}
F(x).

+

= +

=
=
x
(11-36)
In this chapter, we assume that all random variables are real-valued. This assumption
greatly simplifies the notation, definitions and theory. From a conceptual standpoint, little is lost
by assuming that everything is real valued (however, complex-valued random sequences are
important - and often used - in many applications where band-pass signals are represented by
their complex-valued, low-pass equivalents).
A random sequence is statistically specified by its distribution functions, all orders are
required in general. That is, for each positive integer n, and for all positive integer sequences k
1
k
2
k
n
, we need knowledge of the n
th
-order distribution function
1 2 n
1 2 n
k k k 1 2 n
1 k 2 k n k
F( , , , ; k , k ,..., k )
X[k ] , X[k ] , , X[k ] .

=

x x x
x x x
(11-37)
Note that a complete statistical specification requires an infinite set of distribution functions. In
(11-37), the algebraic variables
1 2 n
k k k
, , , x x x are called realization variables. The subscripts
on these variables serve only to distinguish one variable from another; F(, , ; k
1
, k
2
, k
3
) is
just as meaningful as
1 2 3
k k k 1 2 3
F( , , ; k , k , k ) x x x .
The probability density functions are obtained by differentiating distribution functions.
That is, the nth-order probability density function is defined as
1 2 n
1 2 n
1 2 n
k k k 1 2 n
n
k k k 1 2 n
k k k
f ( , , , ; k , k ,..., k )
F( , , , ; k , k ,..., k )
x x x
x x x
x x x
. (11-38)
The moments of a random sequence are important in applications. The mean (sometimes
called the first-order average) is defined as
[ ] [n] E X(n) f ( , n)d
= =

x x x (11-39)
for a sequence of continuous random variables.
Second-order statistical averages appear often in practice. For example, the
autocorrelation function is defined as
[ ]
X k n k n k n
R (k, n) E X(k)X(n) f ( ; k, n)d d

=

x x x , x x x . (11-40)
In a similar manner, the autocovariance function is defined as
[ ]
X
k n k n k n
C (k, n) E {X(k) (k)}{X(n) (n)}
{ (k)}{ (n)} f ( ; k, n)d d

=

x - x x , x x x
. (11-41)
Note that both R
X
and C
X
are symmetric
[ ] [ ]
X
X
R (k, n) E X(k)X(n) E X(n)X(k)
R (n, k)
= =
=
(11-42)
[ ]
[ ]
X
X
C (k, n) E {X(k) (k }{X(n) (n }
E {X(n) (n }{X(k) (k }
C (n, k).
=
=
=
(11-43)
Also, we can write
X X
C (k, n) R (k, n) (k) (n) = . (11-44)
The sequence X(k) is said to have uncorrelated elements (or to be uncorrelated) if
[ ] [ ]
X
R (k, n) E X(k)X(n) E X(n)] E[X(k) (n) (k), n k = = = .
For such a sequence, (11-44) leads to the conclusion that
2
X X
C (k, n) R (k, n) (k) (n) (k) k n
0 k n
= = =
=
(11-45)
where
2
(k) denotes the sequence variance.
Many applications involve the arrival of objects. For example, we may be
interested in the arrival of cars at an intersection, the arrival of electrons at the plate of a vacuum
tube, etc. A commonly-used simplifying assumption is that the objects arrive independently of
one another. Let (k) denote the interval of time (in seconds) between the arrival of the (k-1)
th
and k
th
objects (relative to a given initial time
0
, (1) is the arrival time for the first object). The
time line is depicted by Fig. 11-2 below. For n 1, we assume that (n) is a sequence of
identical, independent random variables each with the exponential density
f (t n) exp[ t]U(t)
= ; . (11-46)
The mean of (n) is
x
0
(n) E[ (n)] x e dx 1/
= = =

, (11-47)
and its variance is
2 2 2 2 x 2 2 2
0
2
(n) E[ (n) ] (1/ ) e dx (1/ ) 2/ 1/
1/
= = =
=
. (11-48)
Relative to a given initial time
0
, the running sum of these intervals is the arrival times
of the objects . That is, the arrival time of the n
th
object is
n
k 1
T(n) (k)
=

, (11-49)
itself a random variable (the T(n) are Poisson random points). Since the time intervals are
independent, the density function f
T
(t;n) for T(n) is an n-1 fold convolution of (11-46) with itself.
We claim that this result is
n 1
T
( t)
f (t; n) exp(- t)U(t)
(n 1)!
. (11-50)
This result can be established by induction (by using a different approach, this same result was
derived in Appendix 9B). Clearly, the result is correct for n = 1; assume it is true for n-1. Now,
we convolve again to obtain
(1)
0
(2) (3) (4)
T(1)
T(2)
T(3)

T(4)
Random arrival times. (k) is the time between arrivals, and T(k) is the
actual arrival time (relative to origin
0
).
[ ]
T T
n 2
t
0
n-2
t
n
0
n 1
n
f (t n) f (t n -1) exp( t)U(t)
( {t })
exp( ) exp( {t })d U(t)
(n 2)!
exp( t) d U(t)
(n 2)!
t
exp( t) U(t)
(n 1)!
=

=

; ;
(11-51)
as claimed. Equation (11-51) is the Erlang density, and T(n) is an Erlang distributed random
variable (this same result was obtained in Appendix 9B). The expected value of random variable
T(n) is
T
(n) n (n) n /
= = . (11-52)
Since the interval random variables are independent, the variance of T(n) is
T
(n) n (n) n /
= = . (11-53)
A random sequence X(n) is called a Gaussian random sequence if all its n
th
-order
probability density functions are Gaussian. Such sequences are very popular. Because of the
Central Limit Theorem, Gaussian sequences occur in many applications. Also, they are
completely described by only first- and second-order statistical averages (i.e., means and
covariances). Finally, use of Gaussian statistics simplifies many technical developments and
makes mathematically tractable many problems in the areas of filtering, estimation, detection
and control.
Let X(n) be a zero-mean Gaussian sequence; that is, E[X(n)] = 0 for all n. Also,
let X be delta correlated; that is, R
X
(k,n) = E[X(k)X(n)] =
2
(k-n), where
2
is the variance and
1, k 0
(k)
0, k 0
=
. (11-54)
Often, delta-correlated sequences are said to be white; in many applications, delta-correlated
Gaussian sequences are called white Gaussian noise. For k n, X(k) and X(n) are uncorrelated
and, since they are Gaussian, independent. As a result, an n
th
-order density function factors into
a product of n first-order density functions.
Most computer-based math packages (such as MatLab, Matcad, etc.) generate periodic
sequences that, for many purposes, can be used to approximate white Gaussian noise. For these
sequences, the correlation between elements can be very low, and the sequence period is very
long relative to the number of sequence values that are needed.
Random sequence X(n) is said to have independent increments if for all N > 1 and n
1
<
n
2
< ... < n
N
the process increments X(n
1
), X(n
2
) - X(n
1
), X(n
3
) - X(n
2
), ... , X(n
N
) - X(n
N-1
) are
jointly independent. Such processes have the nice feature that n
th
-order density and distribution
functions can be built up as products of the densities of the individual increments. For
example, the second order distribution, for the case n
2
> n
1
, can be written as
[ ]
[ ]
[ ] [ ]
1 2 1 2 1 1 2 2
1 1 2 1 2 1
1 1 2 1 2 1
F(x , x ; n , n ) X(n ) x , X(n ) x
X(n ) x , X(n ) X(n ) x x
X(n ) x X(n ) X(n ) x x .
=
=
=
(11-55)
We have seen independent increment processes in previous chapters. For example, the
Random Walk process, introduced in Chapter 6, has independent increments.
Often, random sequences are generated by a mechanism that is not changing with time.
In these cases, the sequence moments are constant. More precisely, a random sequence is said to
be stationary if, for all positive integers k, the k
th
-order density function of the sequence is
invariant to any shift of the index. That is, stationarity requires
1 2 k
1 2 k
n n n 1 2 k
n n n 1 0 2 0 k 0
f ( , , , ; n , n , ..., n )
f ( , , , ; n n , n n , ..., n n ) = + + +
x x x
x x x
(11-56)
for all orders k and index shift values n
0
.
Example 11-5 introduces a random sequence (n) of interval times. Since the interval
times are independent, k
th
-order densities can be built up as products of first-order densities, each
of a form given by (11-46). Clearly, the random sequence (n) of interval times is described by
an k
th
-order density that satisfies (11-56); the sequence (n) is stationary. On the other hand, the
total waiting time to the n
th
arrival, the sum T(n) given by (11-49), is stationary as is obvious
by inspection of (11-52) and (11-53).
A weaker form of stationarity is adequate in some applications. Sometimes, all that is
required is stationarity in all second-order statistics. We say that a random sequence is wide
sense stationary (WSS) if its mean function is constant and its covariance depends only on the
time difference. That is, the sequence is WSS if
( ) ( ) k 0 = (11-57)
X X X
R (k, m) R (k - m) R (n) = = , (11-58)
where n = k - m is the time difference between the two sequence values. Clearly, all stationary
, . Given input X(n) and impulse response
h(n,k), we can express the output as
[ ]
=
Y(n) = L X(n) = h(n, )X( )
. (11-62)
A linear system is said to be shift invariant (or time invariant) if a simple delay in the
input sequence produces a corresponding delay in the output sequence. More formally, we say
that linear system L[] is shift invariant if
[ ] [ ]
0 0
Y(n) = L X(n) Y(n - k ) = L X(n - n ) (11-63)
for all input/output pairs (X, Y) and all index shifts n
0
. Shift invariant systems depend only on
the difference of n and , not their absolute values. In this case, we can write h(n,) = h(n - ).
Also, for shift invariant systems, Equation (11-62) becomes
[ ] Y(n) = L X(n) = h(n - )X( ) = h*X
=
, (11-64)
the convolution of input X with impulse response h.
A system is said to be bounded input - bounded output (BIBO) stable if bounded input
sequences produce bounded output sequences. A linear, shift-invariant system is BIBO stable if,
and only if, its impulse response is absolutely summable; that is, BIBO stability is equivalent to
n=
h(n)
<
. (11-65)
A linear, shift-invariant system is said to be causal if it does not respond before it is
excited. More explicitly, for a causal system, if two inputs X
1
(n) and X
2
(n) are equal up to some
index n
0
, then the corresponding outputs Y
1
(n) = L[X
1
(n)] and Y
2
(n) = L[X
2
(n)] must be equal
up to index n
0
; what happens to the inputs after index n
0
in no way influences the outputs before
index n
0
. One can show that a linear, shift-invariant system is causal if, and only if, h(n) = 0 for
n < 0. For a linear, shift-invariant and causal system, the input-output relationship becomes
[ ]
n
Y(n) = L X(n) = h(n )X( )
=
. (11-66)
One should consider the differences between (11-62), the most general I/O formula, (11-64) for
the shift-invariant case and (11-66) which describes the most restrictive case.
Linear, shift-invariant systems can be analyzed in the frequency domain. For this
purpose, we describe the Fourier Transform of signal X(k) as
j j k
k=
X (e ) X(k) e
=

-
F
(11-67)
(we will use a subscript of F to denote a Fourier transform). If (11-67) converges, X
F
is a
continuous, 2-periodic function of frequency variable . The inverse Fourier transform is
j j n
1
X(n) = X (e )e d
2

F
. (11-68)
The Fourier transform of a linear, shift-invariant system's output can be found easily.
Simply use the convolution theorem with (11-64) to obtain
j j j
Y (e ) [h(n) X(n)] H (e )X (e )

= =
F F F
F . (11-69)
Given a system with a random input, we determine below the mean and autocorrelation
of the output. A more general, difficult problem is to find the n
th
-order density function that
describes the system output. A linear system with a Gaussian input will have a Gaussian output.
Unfortunately, a general statement of this scope cannot be made for nonlinear systems or
systems driven by non-Gaussian inputs.
Consider the linear system with input X(n) and output Y(n) = L[X(n)] (we do
not require shift-invariance or causality). Suppose that both
X
(n) = E[X(n)] and
Y
(n) =
E[Y(n)] exist. For this case, we can write
[ ] [ ] [ ] [ ]
Y X
(n) = E Y(n) E L[X(k)] L E[X(k)] L (n) = = = . (11-70)
That is, it is possible to interchange the operations of L[] and E[]. We write
[ ]
k
E Y(n) = E h(n, k)X(k)
. (11-71)
Then, we formally interchange the summation and expectation to obtain
[ ]
Y X
k k= k
(n) = E Y(n) = E h(n, k)X(k) h(n, k)E[X(k)] h(n, k) (k)

= =

= =

, (11-72)
and (11-70) is established.
Note that our derivation of (11-70) is not rigorous. A potential problem with (11-72) is
the formal interchange of expectation and summation. In cases where the mean of Y(n) does not
exist, this interchange is not valid (can you construct a simple example where the mean of Y(n)
does not exist, i.e., the interchange in (11-72) is valid?). We will consider this interchange
problem again once we have studied some convergence concepts.
Let's consider a special case of (11-72); suppose that input X(n) is wide-sense stationary
and the system is shift invariant. Then, h(n,k) = h(n - k) and
X
(n) =
X
is a constant so that
j
Y X X X
k k=
(n) h(n - k) h(k) H(e )

=0
=

= = =

, (11-73)
so
Y
(n) =
Y
is a constant as well. The bracketed quantity on the right-hand side of (11-73) is
the DC gain of the system (which we assume to be finite in the development of (11-73)).
Consider a low-pass filter with impulse response h(n) =
n
U(n), where 0 < < 1
to insure stability. The Fourier transform of h is
H e
e
j
j
( )

1
1
(11-74)
According to (11-73), the mean of the filter output is
X
/(1 - ).
Next, we determine the cross correlation between a system input X(n) and its output
Y(n), both input and output assumed to be real valued. This quantity is defined as
[ ]
XY
R (m, n) E X(m)Y(n) (11-75)
Then, we use this result to find the autocorrelation R
Y
of the system output.
Let X(n) and Y(n) denote the input and output, respectively, of a linear operator
L[]; that is Y(n) = L[X(n)]. The cross-correlation between the input X and output Y can be
calculated by the formula
[ ]
XY 2 X
R (m, n) = L R (m, n) , (11-76)
where L
2
operates on the second variable (i.e., the n variable), treating the first variable (i.e.,
the m variable) as a constant parameter. In a similar manner, the autocorrelation of the output
can be calculated by the formula
[ ]
Y 1 XY
R (m, n) = L R (m, n) , (11-77)
where L
1
operates on the first variable only.
(see Theorems 7-1 and 7-2 for continuous-time version of this result) First, we write
2
X(m)Y(n) = X(m)L[X(n)] = L [X(m)X(n)] . (11-78)
Now, take the expected value of this result to obtain
[ ] [ ] [ ]
2 2 2 X
E[X(m)Y(n)] = E L [X(m)X(n)] = L E[X(m)X(n)] = L R (m, n) , (11-79)
and this establishes (11-76). The formula for the autocorrelation of the output can be developed
by taking the expectation of the product Y(m)Y(n) to obtain
[ ] [ ] [ ]
[ ]
[ ]
Y 1
1
1 XY
R (m, n) = E Y(m)Y(n) = E L[X(m)]Y(n) = E L [X(m)Y(n)]
= L E[X(m)Y(n)]
= L R (m, n) ,
(11-80)
and this establishes (11-77) so that the theorem is established.
Let us consider Theorem 11-3 specialized to the case of a WSS input sequence X(k) and
a shift-invariant, linear system described by unit sample function h(k). For this case, formula
(11-76) yields
XY X
=-
X X
=- =-
R (m, n) = R (m, n - )h( )
= R ([m- n] + )h( ) = R ([m- n] - )h(- )

(11-81)
Observe that the right-hand side of (11-81) depends on m, n only through the difference k = m-n.
Hence, X and Y are jointly wide sense stationary, and we can write
XY X
R (k) = R (k) *h(-k) . (11-82)
For the WSS case, the output correlation formula (11-80) becomes
Y XY XY
R (m, n) = R ({m- }- n)h( ) = R ({m- n}- )h( )

= =

, (11-83)
a formula depending on k = m-n. Hence, we write
Y XY
R (k) = R (k) h(k) . (11-84)
Finally, combining (11-82) and (11-84) yields
Y X X
R (k) = R (k) h(-k) h(k) = R (k) {h(-k) h(k)} , (11-85)
and we see that a WSS input produces an WSS output.
Suppose output Y is related to input X by the simple relationship
Y(n) = L[X(n)] = X(n) - X(n -1) , (11-86)
the first-order, backwards difference. For example, sequence Y(n) might be subjected to a
threshold to implement a pulse detector function. The mean of the output is
X X
E[Y(n)] = E[X(n)] - E[X(n -1)] = (n) - (n -1) . (11-87)
The cross-correlation between input and output is
[ ]
XY 2 X X X
R (m, n) = L R (m, n) = R (m, n) - R (m, n -1) . (11-88)
Finally, the autocorrelation of the output is
[ ]
Y 1 XY XY XY
X X X X
R (m, n) = L R (m, n) = R (m, n) - R (m-1, n)
= R (m, n) - R (m-1, n) - R (m, n -1) + R (m-1, n -1) .
(11-89)
Suppose the input is WSS with autocorrelation
R
X
(m,n) = a
m-n
, 0 < a < 1. (11-90)
Then Equations (11-87) and (11-88) yield
Y
= 0
R
XY
(m,n) = a
m-n
- a
m-n+1
, (11-91)
and (11-89) yields
R
Y
(m,n) = 2a
m-n
- a
m-1-n
-a
m-n+1
. (11-92)
The output sequence Y(k) is WSS; if k = m n, then (11-90) and (11-92) become
R
X
(k) = a
k
(11-93)
R
Y
(k) = 2a
k
- a
k-1
-a
k+1
, (11-94)
respectively. A comparison of Fig. 11-3 and Fig. 11-4 (both correlations were computed and
plotted for a = .6) reveals that the pulse detector (11-86) decorrelates the input data X(k), at
least to some extent.
All real-valued random variables with finite second moments comprise a vector space
over the field of real numbers. We define vector space L
2
as
L
2
=
< X : E[ X
2
] o t
, (11-95)
all real-valued, finite-second-moment random variables. We take the real number field, denoted
-10 -5 0 5 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Eqn. (11-93) with a = .6.
R
x
(k)
-10 -5 0 5 10
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Eqn. (11-94) with a = .6.
R
y
(k)
here by R, as our scalar field. To show that L
2
is a valid vector space, we must show, among
other things, that L
2
is closed under vector addition (i.e., X L
2
and Y L
2
implies that X + Y
L
2
) and scalar multiplication (i.e., X L
2
and c R implies that cX L
2
).
The fact that L
2
is closed under scalar multiplication follows easily. Clearly if X L
2
and c R then E[cX
2
] = c
2
E[X
2
] < so cX L
2
.
The fact that L
2
is closed under vector addition follows from use of the Schwarz
inequality (sometimes called the Cauchy-Schwarz inequality).
Let X L
2
and Y L
2
. Then
2
2 2
E[XY] E[ X ] E[ Y ] . (11-96)
Let be any real-valued number and consider
2 2 2 2
E[ X Y ] E[ Y ] E[XY] E[XY] E[ X ] 0 + = + + + . (11-97)
Now, Equation (11-97) is a function of , and the inequality must hold for all . The Schwarz
inequality follows immediately by substituting
2
E[XY]
E[ Y ]
= (11-98)
into (11-97). In fact, (11-98) is the value of that minimizes E[X + Y
2
] (can you show
this?) . In (11-96), equality results when Y is a scalar multiple of X.
Now, we show that L
2
is closed under vector addition. Let X L
2
and Y L
2
and
consider the sum X + Y. The second moment of the sum satisfies
2 2 2
2 2 2 2
E[ X Y ] E[ X ] 2E[XY] E[ Y ]
E[ X ] 2 E[ X ] E[ Y ] E[ Y ]
+ = + +
+ +
. (11-99)
However, all quantities on the right-hand-side of (11-99) are finite since X L
2
and Y L
2
.
Hence, the sum X+Y L
2
, and L
2
is closed under vector addition. Closure under vector addition
and scalar multiplication is necessary for L
2
to be a valid vector space. The remaining
requirements (found in any elementary text on linear algebra) that L
2
must satisfy are shown
easily. Hence, we can consider the set of all real-valued random variables with finite second
moments to be a valid vector space.
In words, the condition [X = 0] = [{ : X() = 0}] = 1 is stated by using one of
1) X = 0 almost everywhere (a.e.)
2) X = 0 almost surely (a.s.)
3) X = 0 with probability one,
all equivalent phrases (used by different authors). It should be noted that X = 0 (a.s.) is
equivalent to X 0 (i.e., X = 0 for all or everywhere). If X = 0 (a.s.), the event B = {
: X() 0} has probability zero, however it can be nonempty.

E[X
2
] = 0 is equivalent to [X = 0] = 1, or X = 0 (a.s.). To prove this requires some
knowledge of measure theory. However, the result is intuitive. Consider
2 2
X
0 E[X ] x f (x)dx
= =

, (11-100)
where f
X
(x) is the density used to describe X. If Equation (11-100) is true, then all of the
probability (all of the area under density f
X
) is concentrated at x = 0, or [X = 0] = 1. Hence,
we have
E[X
2
] = 0 if, and only if, [X = 0] = 1 so that X = 0 (a.s.). (11-101)
It is worth repeating that [X = 0] = 1 is the equivalent to X 0 for all .
L
2
M is said to be a subspace of L
2
if it is a valid vector space (i.e., closed under scalar
multiplication and vector addition, in addition to the other requirements given in any linear
algebra text) and M L
2
. Subspaces play a crucial role in many applications that involve
optimization problems.
It is natural to define an inner product on L
2
as the expected value of a product. That is,
for any X L
2
and Y L
2
, we denote the inner product (dot product or scalar product) as
X,Y, and we define
X, Y E[XY] = . (11-102)
The Cauchy-Schwarz inequality (11-96) implies that
X, Y X, X Y, Y < . (11-103)
That is, the inner product exists as a real number. It can be shown that X,Y = E[XY] satisfies
the properties
1. X, X 0. X, X 0 if and only if X = 0 almost surely ( ., [X = 0] = 1),
2. X,Y Y,X and
3. cX,Y c X,Y , where c .
=
=
=
i.e
R
(11-104)
If E[X] = 0, then second moment
2
X, X E[X ] = is the variance of random variable X. Random
variables X and Y are said to be orthogonal if X, Y E[XY] 0 = = .
In 1) of (11-104), [X = 0] = 1 is equivalent to X 0 (i.e., X identically zero); so,
X, X 0 = is equivalent to X 0. However, the equivalence of X, X 0 = and X 0 is a
general requirement of an inner product (see any text on linear analysis). As a result, our claim
is technically incorrect that (11-102) is an inner product, at least in the strict sense defined in the
mathematical literature. However, in the applications literature, this subtle flaw is overlooked,
and (11-102) is declared a valid inner product. After discussing a norm for L
2
, we use the notion
of equivalent classes to fix (11-102) and make it a valid inner product.
On a vector space, a vector norm maps vectors into real numbers in a manner that adapts
the concept of length to vectors. Almost universally, the norm of vector X is denoted as X .
On vector space L
2
, we define the norm of X as
2
X X, X E[X ] = = (11-105)
(we say that the inner product induces the norm). From (11-104) it follows directly that (11-105)
satisfies
1. X 0. X 0 if, and only if, X = 0 almost surely ( ., [X = 0] = 1),
2. cX c X , for any c and
3. X+Y X Y (the triangle inequality).
=
=
+
i.e
R (11-106)
If E[X] = 0, then X is the standard deviation of X.
In 1) of (11-106), [X = 0] = 1 is equivalent to X 0 (i.e., X identically zero); so,
X 0 = is equivalent to X 0. However, the equivalence of X 0 = and X 0 is a general
requirement of a vector norm (see any text on linear analysis). As a result, our claim is
technically incorrect that (11-105) is a vector norm, at least in the strict sense defined in the
mathematical literature. However, in the applications literature, this subtle flaw is overlooked,
and (11-105) is declared to be a valid vector norm. As discussed below, the idea of equivalence
classes can be used to fix (11-105) and make it a valid vector norm.
Often, norm (11-105) is called the mean-square norm since it involves the mean of the
square of a random variable. In terms of (11-105), we can restate the Schwarz inequality as
X, Y X Y . (11-107)
Equation (11-107) is how the Schwarz inequality is usually stated in the analysis literature where
the notions of inner product and norm play central roles.
The triangle inequality (number 3 of (11-106)) has a form similar to the well-known
triangle inequality for real numbers (which states that r
1
+ r
2
r
1
+ r
2
for any real
numbers r
1
and r
2
). This inequality follows from the observation
[ ]
2 2 2
2
X+ Y X+ Y, X+ Y X, X 2 X, Y Y, Y X 2 X Y Y
X Y .
= + + + +
= +
(11-108)
The norm (11-105) allows us a way to define the equality of two vectors (random
variables). If X and Y are random variables for which
X Y = 0 (11-109)
we say that X = Y in the mean-square sense, or we say X = Y (m.s.). From (11-101), we see
that (11-109) is equivalent to [X = Y] = 1 and [X Y] = 0.
L
2
,
As given by (11-102) and (11-105), the definitions of , and share a common
flaw. The problem is subtle, and it has little or no bearing on applications of the theory. Often,
it is not discussed in elementary theory and applications-oriented books. In this subsection, the
flaw in definitions (11-102) and (11-105) is discussed, and the standard method is given for
fixing the flaw (if you dislike abstractions, then skip this subsection).
We say that random variables X and Y are equivalent if [X = Y] =1 (i.e., X Y = 0 .
We denote that X and Y are equivalent by writing X ~ Y. Given any X L
2
, there are many (an
uncountable number, in general) random variables (in L
2
) that are equivalent to X.
Specifically, for X 0, we have X = 0. However, equivalent to X 0 are nonzero Y
for which Y = 0 , and this is the problem that we must fix. Stated succinctly, Y = 0 does
not mean Y 0 (i.e., Y 0 everywhere, for all ), and this lies at the root of the problem.
This is why , as defined by (11-105), is not a true norm (sometimes, it is called a seminorm).
Fortunately, the problem can be fixed by passing to equivalent classes.
We partition L
2
into distinct equivalent classes (the general topics of equivalence
relations and equivalent classes are discussed in most books on real analysis). Into a given
equivalent class is lumped all random variables that are equivalent. X ~ Y if, and only if, X and
Y are in the same equivalent class. In a given equivalent class, all random variables have the
same seminorm, as defined by (11-105). Now, we the norm of an equivalent class as
seminorm (11-105) applied to any member of the class (read again the previous two sentences).
As so defined, distinct equivalent classes have numerically different norms.
We have fixed the problem if we think only of equivalent classes and their norms (and
stop thinking of individual random variables and their seminorms). We substitute the notion of
vector equivalence for vector equality (we use X ~ Y instead of X = Y). If (equivalent class) X
L
2
has X = 0 then X ~ 0 (that is, X and 0 are the same equivalent class). We have a true
norm on L
2
, a vector space of distinct equivalent classes.
Often, one has to deal with sequences of random variables that converge to a random
variable. We say that the random sequence X(n;) converges to random variable X
0
() if for
every fixed
0
the sequence of numbers X(n;
0
) converges to the number X
0
(
0
). This is
"ordinary", sometimes called pointwise, sequence convergence (a topic that is usually covered in
a Calculus course) that has nothing to do with the fact that we are dealing with random variables.
Also, it is very restrictive. In applications, we can get by with much "weaker" modes of
convergence; we discuss three alternative convergence modes. In what follows, we discuss
almost sure (a.s.) convergence, convergence in probability (i.p.) and mean square (m.s.)
convergence. Mean square convergence is convergence in the mean-square norm (11-105). We
discuss m.s. convergence first.
As n goes to infinity, a sequence of random variables X(n) L
2
converges in m.s. to a
random variable X
0
L
2
if
( )
2
0 0
n n
limit X - X(n) 0 same as limit E[{X - X(n)} ] = 0

=

. (11-110)
The norm used in (11-110) is the mean-square norm given by (11-105). Often, this type of
convergence is denoted as
m.s
0
X(n) X , (11-111)
or
0
n
l.i.m X(n) = X
, (11-112)
where l.i.m denotes limit in the mean.
Let Z be a random variable with E[Z
2
] < (i.e., Z L
2
). Let c
n
, n 0, be a
sequence of deterministic real numbers converging to real number c. Then, c
n
Z, n 0, is a
sequence of random variables. We show that
n
n
l.i.m c Z cZ
= . (11-113)
To see this, consider
2 2 2 2 2
n n n
E c Z cZ E c c Z c c E Z

= =

.
Now, c
n
c and E[Z
2
] < implies E[c
n
Z - cZ
2
] 0, and this proves (11-113).
Mean square convergence is linear. That is, if
0
n
0
n
X l.i.mX(n)
Y l.i.m Y(n) ,
=
=
(11-114)
then for any real-valued constants a and b we have
( )
0 0
n
aX bY l.i.m aX(n) bY(n)
+ = + . (11-115)
Note that
0 0 0 0
0 0
0 0
{aX(n) bY(n)} {aX bY } a{X(n) X } b{Y(n) Y }
a{X(n) X } b{Y(n) Y }
= a X(n) X b Y(n) Y .
+ + = +
+
+
(11-116)
However, Equation (11-114) ensures that the right-hand-side of (11-116) approaches zero as n
approaches infinity, and this proves (11-115).
Not every sequence of random variables has a mean square limit. We need "tools" and
techniques for determining if a sequence has a mean-square limit. Fortunately, our intuition is
helpful in this regard. As discussed next, a random sequence has a mean-square limit if, and
only if, sequence terms come arbitrarily "close" to one another if you go far enough out in the
sequence.
Let X(n), n 0, be a sequence of vectors in L
2
. The sequence is said to be a mean square
Cauchy sequence if
n,m
limit X(n) - X(m) = 0
. (11-117)
More tersely, we say that the sequence is m.s. Cauchy if (11-117) is true. For a m.s. Cauchy
sequence, the quantity X(n) X(m) approaches zero as n and m approach infinity, in any
manner whatever. Basically, the further you go out in a mean square Cauchy sequence the
"closer" (in the mean-square sense) the elements become.
It is easy to show that mean square convergence implies the mean-square Cauchy
property (i.e., (11-111) implies (11-117)). Actually, this is true for arbitrary normed vector
spaces (i.e., all convergent sequences are Cauchy sequences, regardless of the normed vector
space under consideration). However, for the general normed vector space, Cauchy sequences
are not necessarily convergence. But, for L
2
space equipped with the mean-square norm, the
mean-square Cauchy property implies mean square convergence. This is stated by the following
theorem.
(Special Case of Riesz-Fischer Theorem)
Vector space L
2
is complete in the sense that a mean square Cauchy sequence has a
unique limit in L
2
. That is, for sequence X(n) in L
2
, there exists a unique element X
0
L
2
such
that
0 0
n
n
limit X - X(n) 0 denoted symbolically as l.i.m X(n) = X

=

(11-118)
n,m
n,m
limit X(n) - X(m) = 0 denoted symbolically as l.i.m [X(n) - X(m)] 0

=

. (11-119)
Since the converse is true (see paragraph before the theorem statement), (11-118) and (11-119)
are equivalent for vector space L
2
. In (11-119), one must remember that the double limit is zero
regardless of how n and m approach infinity.
The value of Theorem 11.6 is this: we do not have to know/find the m.s. limit of a
sequence to know that the sequence is m.s. convergent. To show that L
2
sequence X(n)
converges to some m.s. limit X
0
, we need not know/find X
0
. Instead, to show convergence, it is
sufficient to show that X(n) has elements that come arbitrarily close to one another as you go
out in the sequence. In some cases, establishing (11-119) is much easier than finding X
0
described by (11-118).
With the introduction of Theorem 11.6, we have established L
2
as a complete vector
space with norm (11-105) that is induced by inner product (11-102). In the literature, such
vector spaces are referred to as Hilbert Spaces. They are the natural setting for many significant
problems in Fourier series, communication theory, optimal filtering, etc.
Mean-square convergence has a number of useful properties. We discuss the ability to
interchange l.i.m and expectation. Also, we show that a mean square limit is unique (with
equality in the mean square sense). To develop these results, we must mention some (almost)
obvious, facts. Note that
n
l.i.m X(n)
(11-120)
is a random variable, but
n
limit E[X(n)]
(11-121)
is an "ordinary" limit of an "ordinary" sequence. Also, for any random variable X in L
2
, we have
[ ] [ ]
2
2
E[X] E X E X 1 E X 1 X

= =

. (11-122)
The first inequality results from the fact that the absolute value of an integral is less than, or
equal to, the integral of the absolute value. The second inequality comes from the Cauchy-
Schwarz inequality (11-96) with Y = 1. Now, we show that we can interchange expectation and
l.i.m.
Let X(n) be a sequence in L
2
Suppose the sequence has a m.s. limit X
0
L
2
that is,
m.s
0 0
n
X(n) X limit X(n) - X 0
= . (11-123)
Then it follows that
0 n
n n
E[X ] E l.i.mX limit E[X(n)]

= =

. (11-124)
That is, expectation and l.i.m are interchangeable.
Since L
2
is complete, mean square limit X
0
is in L
2
(X
0
has a finite second moment), so
E[X
0
] exists (i.e., the mean is finite). Now, from (11-122), we have
0 0 0 0
E[X(n)] E[X ] E[X(n) X ] E X(n) X X(n) X =

. (11-125)
However, from (11-123) we know that the norm on the right-hand side of (11-125) goes to zero
as n approaches infinity. Hence, we have the desired result (11-124).
An important use of Theorem 11-7 deals with interchanging expectations and
summations. For k = 1, 2, , let X
k
L
2
be a sequence of random variables with finite second
moments. Define the n
th
partial sum
n
n k
k 1
Y X
=

. (11-126)
Suppose that
n
n k
n n
k 1
Y l.i.m Y l.i.m X

=
=

. (11-127)
We say that partial sum (11-126) converges in mean square to Y. By Theorem 11-7, we can
write
[ ]
n n
n k k k
n n n
k 1 k 1 k 1
E Y E l.i.m Y E l.i.m X limit E[X ] E[X ]

= = =

= =

. (11-128)
The mean square limit of a sequence is unique. That is, if
0 0
n n
0 0
m
X l.i.mX(n) limit X - X(n) 0
Y l.i.m X(m) limit Y - X(m) 0

=
=
m
, (11-129)
then X Y
0 0
0 = and [X
0
= Y
0
] = 1.
Observe that
0 0 0 0 0 0
X - Y = {X - X(n)}+{X(n) - Y } < X - X(n) + X(n) - Y (11-130)
from the triangle inequality. Now, on the right-hand side of (11-130), both norms go to zero as a
consequence of (11-129). Hence, we have X Y
0 0
0 = as claimed. The fact that [X
0
= Y
0
] =
1 follows immediately from (11-101) and the paragraph that follows this equation.
We are trying to sample a DC voltage (for example, the output of a strain
gauge, water tank level detector, etc.). However, our samples contain additive noise; the k
th
sample is Y(k) = m
dc
+ (k), where m
dc
is the DC voltage we are trying to measure, and (k) is a
real-valued sample of stationary, zero mean noise with variance
2
. We assume that (k) is
uncorrelated from sample to sample (any two different-indexed samples are uncorrelated). We
try the time-honored technique of averaging out the noise. That is, we form the average
n
k=1
1
X(n) = Y(k)
n

. (11-131)
Note that X(n) has m
dc
as its mean and
2
/n as its variance (indeed, with increasing n, we are
averaging out the noise). However, the question remains: As n , does the random
sequence X(n) L
2
converge in mean square to a random variable? Lets see if the sequence is
mean square Cauchy; consider
[ ]
2
2
dc dc
2 2
dc dc dc dc
2 2
dc dc
X(m) - X(n) = E [{X(m) - m }-{X(n) - m }]
E {X(m) - m } - 2{X(m) - m }{X(n) - m }+{X(n) - m }
2E {X(m) m }{X(n) - m } .
m n

=

= +
(11-132)
Consider the case n > m and use the fact that the noise is uncorrelated from sample to sample to
evaluate the middle term
[ ] ( )
[ ] [ ]
dc dc dc dc
2
dc dc
2
E {X(m) m }{X(n) m } E {X(m) m } {X(m) m }+{X(n) - X(m)}
E {X(m) m } E X(m) m E X(n) X(m)
0 0
m
=

= +

= +
(11-133)
Similarly, note that E[{X(m)-m
dc
}{X(n)-m
dc
}] =
2
/n for the case m > n. Therefore, we can
write (11-132) as
2
2
1 2 1
X(m) - X(n) = + .
m min{n, m} n

(11-134)
As m and n approach infinity (in any order), (11-134) approaches zero, so the sequence is mean
square Cauchy. By Theorem 11-6, the sequence is mean-square convergent. But what is its
limit? The obvious candidate is m
dc
. To see that this is the limit, consider
n
dc
n n n
k 1
1
limit X(n) m limit (k) limit 0 .
n n
=
= = =
(11-135)
So, we see that X(n) converges in mean square to m
dc
(and we can expect to get better results
the more samples are included in the average).
With Example 11-10, we have established a Mean Square Law of Large Numbers for
sequences of uncorrelated random variables. More general, let Y
k
, k = 1, 2, , be a sequence of
uncorrelated random variables with common mean E[Y
k
] = m and common variance VAR[Y
k
] =
2
. Then the sample mean
n
k=1
1
X(n) = Y(k)
n

(11-136)
converges in mean square to m.
In a subsequent section, we will show that (11-136) converges to m in probability, a yet-
to-be-defined mode of convergence that is weaker than mean-square convergence. That sample
mean (11-136) converges in probability to m is just the well-known and popular Law of Large
Numbers, (weak version) that is cited often in the popular press.
Let X(k), k 1, be a sequence of independent random variables each of which
is either 1 or 0. Furthermore, suppose that
[X(k) =1] =1/k
[X(k) = 0] =1-1/k
. (11-137)
As k , does X(k) converge in mean square? Lets check the obvious candidate X = 0;
consider
n n
limit X(n) - 0 limit 1/ n 0

= = . (11-138)
So, we see that X(n) converges in mean square to the random variable X = 0. However, in
Example 11-15, we will see that X(n) does not converge (to zero) in a pointwise manner.
: Let X(k), k 1, be a sequence of independent random variables similar to the
previous example. However, suppose that X(k) is either k or 0 with
2
2
[X(k) = k] =1/k
[X(k) = 0] =1-1/k
. (11-139)
So, as k becomes large, we see that X is getting larger with a smaller probability. Is X(k) mean
square convergent? To find out, consider
2
2 2
X(m) - X(n) = E X(m) - 2X(m)X(n) + X(n)
= 2[1-1/nm]

, (11-140)
a result that converges to 2 as m, n approach infinity. Hence, X(n) is mean-square Cauchy;
hence, it is mean square convergent.
Theorem 11-7 tells us that expectation and l.i.m. are interchangeable for m.s. convergent
sequences. A similar result holds for the inner product operation defined by (11-102).
Let X(n) and Y(m) be m.s. convergent
sequences with m.s. limits X
0
and Y
0
, respectively, so that
0 0
n n
0 0
m m
X l.i.mX(n) limit X - X(n) 0
Y l.i.m Y(m) limit Y - Y(m) 0

=
=
. (11-141)
Under these conditions, we claim that
0 0
n m n,m
X , Y l.i.mX(n), l.i.m Y(m) limit X(n), Y(m)

= = . (11-142)
First, consider the simple algebra
0 0 0 0 0 0
0 0 0
0 0 0
0 0 0
X(n), Y(m) - X , Y = X(n), Y(m) - X(n), Y + X(n), Y - X , Y
= X(n), Y(m) - Y + X(n) - X , Y
X(n), Y(m) - Y + X(n) - X , Y
X(n) Y(m) - Y + X(n) - X Y .
(11-143)
Now, since
m.s.
0
X(n) X as n , the sequence X(n) is bounded (can you show
this??), say X(n) < M. Use this fact, (11-141) and (11-143) to conclude
0 0 0 0 0
n,m n,m
limit X(n), Y(m) - X , Y limit Y(m) - Y + X(n) - X Y
0 ,

=

=
M
(11-144)
a result that proves (11-142) and the continuity of the inner product.
Theorem 11-9 establishes continuity of the inner product X,Y E[XY]. What we mean
by this is simple. Suppose we are given sequences X(n) and Y(m) with m.s. limits X
0
and Y
0
,
respectively, as described by (11-141). For large n and m, X(n) and Y(m) get close to X
0
and Y
0
, respectively, and X(n),Y(m) E[X(n)Y(m)] gets close to X
0
,Y
0
= E[X
0
,Y
0
]. This
intuitive idea is known as continuity of the inner product.
Some results that involve mean square convergence of random sequences can be
generalized to a "weaker" convergence mode. This new mode is called convergence in
probability. It is "weaker" (i.e., more general) than m.s. convergence; m.s. convergent sequences
also converge in probability, but the converse is not true.
As n , a random sequence X(n) converges in probability (i.p.) to a random variable
X
0
if, for every > 0, we have
{ }
0
n
limit X(n) - X 0
> = . (11-145)
Often, this type of convergence is denoted by either of
i.p.
0
X(n) X (11-146)
n
l.i.p X(n) X
= . (11-147)
For convergence in probability, many of the results parallel those given above for m.s.
convergence. First, as we go out in a series (i.e., as the index becomes large), it may be more
likely that the terms are closer together (this does not mean that the terms must be closer together
in the m.s. sense). We say that a random sequence X(n) is Cauchy in probability if, for every >
0, we have
n,m
limit X(m) - X(n) 0
> =

. (11-148)
Cauchy in probability is a weaker condition than Cauchy in the mean square sense. A
sequence that is mean square Cauchy is also Cauchy in probability, but the converse is not true.
Condition (11-117) implies condition (11-148); however, the converse is not true. Next, we
provide a theorem that does for convergence in probability what Theorem 11-6 did for
convergence in mean square.
As n , a sequence X(n) converges in probability to a random variable X
0
if, and only if, the sequence is Cauchy in probability.
First, we show that if X(n) converges in probability to X
0
then it is Cauchy in
probability. Suppose that the sequence converges in probability. Then note the event (i.e., set)
relationship
{ } { } { }
0 0
X(m) - X > /2 X(n) - X > /2 X(m) - X(n) > , (11-149)
as depicted by Figure 11-5. From (11-149), we see that
0 0
X(m) - X(n) > X(m) - X > /2 X(n) - X > /2 +

. (11-150)
Now, since X(n) converges to X
0
in probability, both terms on the right hand side of (11-150)
approach zero as n and m approach infinity. Hence, the sequence is Cauchy in probability as
claimed. The converse (if X(n) is Cauchy in probability then it converges in probability) is
harder to prove and is not given here (see M. Love, Probability Theory I, 4
th
Edition, pp. 117-
118).
If a sequence converges in probability, then the limit is unique. That is,
suppose X(n) converges in probability to both X
0
and Y
0
. Then it necessarily follows that [X
0
Y
0
] = 0.
: If X(n) - X(m) > then either X(n) - X
0
> /2 or X(m) - X
0
> /2.
X(n) X(m)
X
0
Longer Than
Longer Than /2
Using the same reasoning that led to (11-150), we can write
{ } { } { }
0 0 0 0
X Y X - X(n) / 2 Y - X(n) / 2 > > > (11-151)
0 0 0 0
X Y X - X(n) / 2 Y - X(n) / 2 > > + >

. (11-152)
However, both terms on the right-hand side of (11-152) approach zero as n approaches infinity.
Hence, for every > 0 we have
X
0
> = Y
0
0 , (11-153)
so that
+
0 0
0
limit X Y 0
> =

. (11-154)
Continuity of the probability measure (see Appendix 11B) and (11-154) lead to the conclusion
0 0
X Y 0 0 > =

, (11-155)
and this establishes the claim that [X
0
Y
0
] = 0.
As claimed previously, convergence in mean square implies convergence in probability.
This claim is substantiated by the following theorem (which is a nice application of the
Tchebycheff inequality).
Convergence in mean square implies convergence in probability.
Let X(n) be a sequence that converges in mean square to the random variable X
0
. For
each n, apply the generalized Tchebycheff inequality (see Chapter 2 of these notes) to X(n) - X
0
and obtain
2
2
0
0
0
2 2
E X(n) - X
X(n) - X
X(n) - X =

(11-156)
for every > 0. However, we know that
m.s.
0
X(n) X , so that
0
X(n) - X 0 as n .
Hence, with (11-156), we have
0
n
limit X(n) - X 0
> =

, (11-157)
so that
i.p.
0
X(n) X as claimed.
Lets reconsider Examples 11-10 and 11-11, both of which provided sequences that
converged in the mean square sense. Now, we know that these sequences converge in
probability, as implied by Theorem 11-12. Actually, that the sequence in Example 11-10
converges in probability is just a statement of the Law of Large Numbers (weak version).
Let X(n) be a sequence of independent,
identically distributed (i.i.d) random variables with mean
X
and variance
X
2
. Then, the sample
mean
n
k=1
1
X(k)
n

n
(11-158)
converges in probability to the real mean
X
as n approaches infinity.
The proof of this theorem follows from Example 11-10 and Theorem 11-12.
The law of large numbers is the basis for estimating
X
from measurements. In
applications, it is common to take the sample mean (11-158) as an estimate of the real mean
X
. The basis for doing this is the Law of Large Numbers.
In Example 11-12, we considered a sequence of independent random variables
X(k), k 1, with
[X(k) = k] = 1/k
2
[X(k) = 0] = 1 - 1/k
2
.
We found out that this sequence converge in the mean square sense (a sufficient
number of the sample function sequences contain a sufficient number of instances where X(k)
= k so that m.s. convergence is not achieved). Now, we show that it converge in
probability to X
0
= 0. For every > 0, we have
[ ]
2
0
k k k
limit X(k) - X limit X(k) > limit X(k) = k limit 1/ k 0

> = = = =

k
, (11-159)
and we see that the sequence converges in probability to zero (in (11-159), only
that X(k) = k are involved; for k 1, the actual of X(k) do not enter into the
computation).
The converse of Theorem 11-12 is not true (convergence in probability imply
convergence in mean square), and Example 11-13 is a counter example that establishes this fact.
Basically, convergence in mean square is dependent upon both the numerical values of the
sequence elements and the probabilities associated with the values. On the other hand,
convergence in probability is only concerned with the probabilities.
For convergence in probability, this example shows that one cannot interchange
the limit and expectation operations. For n 1, consider the sequence X(n), where X(n) is either
= -1 or n. Also, suppose that
produce the same value (This differs from mean square convergence; recall that Theorem 11-7
proved that expectation and l.i.m are interchangeable). So, while convergence in probability is
very general (and weak), there are limitations on what you can do with it.
The last form of convergence we will study is called almost surely (a.s.) convergence.
The random sequence X(n) converges almost surely to the random variable X
0
if the sequence of
functions X(n;) converges to X
0
() for all except possibly on a set of probability zero
(recall that denotes the sample space). Almost surely convergence requires that
0 0
n n
limit X(n) X limit X(n; ) X ( ) 1

= = = =

. (11-164)
In other words, X(n) converges almost surely to random variable X
0
if there exists an event A,
with (A) = 1 (and (A
_
) = 0), for which X(n;) X
0
() for all A. Often we write
a.s.
0
X(n) X . (11-165)
Obviously, this type of convergence is weaker than pointwise (p.w.) convergence (p.w.
convergence requires that X(n;) X
0
() for all ). However, as shown below, almost
surely (a.s.) convergence implies convergence in probability (i.p.). And, it doesnt imply, nor is
it implied by, convergence in mean square (m.s.). In the literature, a.s. convergence goes by the
names convergence almost everywhere and convergence with probability one (other names are
used as well).
Like convergence in mean square and probability, in the context of almost sure
convergence, it is possible to examine the separation, or distance, between sequence elements as
we go farther out in a sequence. We say that X(n) is an almost surely Cauchy sequence if
n,m n,m
limit X(n) X(m) 0 limit X(n ) X(m ) 0 1

= = = =

; ; . (11-166)
In other words, there exists an event A, (A) = 1, for which
n,m
limit X(n ) X(m ) 0
= ; ; (11-167)
for all A. To establish that X(n) is an almost surely Cauchy sequence, we do not require
knowledge of a sequence limit.
With regard to necessary and sufficient conditions for the Cauchy criteria, almost surely
convergence parallels m.s. and i.p. convergence. To show almost surely convergence of a
sequence, it is not necessary to come up with a limit (in the almost surely convergent sense) for
the sequence. Instead, as shown by the following theorem, we can use the Cauchy criteria.
A sequence X(n) is almost surely convergent if, and only if, it is an almost
surely Cauchy sequence.
This theorem follows from the fact that, in the real number system, sequences of real
numbers converge if, and only if, they are Cauchy sequences.
A practical and useful test for almost surely convergence is given by the following
theorem.
Let X(n) denote a sequence of random variables. Suppose that X(n) converges
to random variable X
0
almost surely; that is, we suppose that
a.s.
0
X(n) X . (11-168)
Then, for every > 0 we have
0 0
m m
n=m
limit X(n) - X for n m limit {X(n) - X } 1

= =

all , (11-169)
which we write as
[ ]
m
m
limit A 1
= , (11-170)
where A
m
is defined as
m 0
0
n m
A X(n; ) - X ( ) for n m
{X(n; ) X ( ) } ,
=

=

all
(11-171)
an event that depends on m and . The converse is true as well; hence, (11-169) and (11-168)
are equivalent (i.e., one implies the other).
Note: The sequence A
m
, m 0, is nested increasing with m; that is, A
m
A
m+1
for all m and all
> 0. Also, the complement of (11-171) is (DeMorgans Laws come in handy here)
m
0
0
n m
A X(n; ) - X ( ) for n m
{X(n; ) X ( ) } .
=
>

= >

some
(11-172)
So, Theorem 11-15 is sometimes stated as:
a.s.
0
X(n) X iff for all > 0 we have
m
0
m m
limit X(n) X for n m limit A 0

> = =

some . (11-173)
First, suppose that
a.s.
0
X(n) X . Then, there exists an event
1
for which
[ ]
[ ]
1
1 1
0 1
n
1
{ } 0
limit X(n; ) X ( ) for each .
=

= =

=
(11-174)
Now, show that
1 k
k 1
A
. Take any
0

1
. As shown by (11-174), X(n;
0
) converges
in an ordinary sense to X
0
(
0
); this means that, given any > 0, there exists an integer m(
0
,)
(integer m depends on
0
and ) with the property
0 0 0
X(n, ) X ( ) (11-175)
for n m(
0
,). Hence, we see that
0

k
A , for all k m(
0
,); that is, we can write
0 1 0 k 0 0
n=k
A {X(n ) X ( ) } , k m( , )
. (11-176)
Since the A
k
are nested increasing, we have
1 k
k=1
A
. (11-177)
Since (
1
) = 1, Equation (11-177) yields
k
k=1
A 1
. (11-178)
This leads to the conclusion
n n
k k k
n n
k=1 k=1 k=1
n
n
1 A limit A limit A
limit (A )

= = =

=

, (11-179)
and we have proven that (11-168), which states
a.s.
0
X(n) X , implies (11-169), which states
n
n
[A ] 1
. Now, we show the converse; we show that (11-169) implies (11-168). We

do this by showing that a false (11-168) implies a false (11-169) (this is the contrapositive of the
statement (11-169) implies (11-168)). Hence, assume that (11-168) is false and show that
m
m
limit [A ] 1
(i.e., (11-169) is false). If (11-168) is false there exists an event , () > 0,

such that X(n,) / X
0
() for (i.e., convergence does not occur for ). Now,
consider the random variable
0
n
Z( ) X(n, ) X ( ) , lim sup = . (11-180)
The event { Z() > 0} can be expressed as
n 1
{ Z( ) 0} { Z( ) 1/ n}
=
> = >
. (11-181)
For each
0
, we have Z(
0
) > 0, so
0
{ Z() > 0}; this fact implies that
{ Z() > 0}. (11-182)
Now, () > 0 implies ( { Z() > 0} ) > 0 and the existence of some integer n
0
for
which the event { Z() > 1/n
0
} has a strictly positive probability (to see this, equate the
probability of both sides of (11-181) and use the continuity of ). That is, we have
[ { Z() > 1/n
0
} ] > 0. (11-183)
But, this positive probability event is contained in the complement of A
m
, m 1, defined using
= 1/n
0
. This observation is written as
0 m 0 0
n=m
{ Z( ) > 1/ n } A {X(n, ) X ( ) 1/ n }

>

, (11-184)
for every integer m (apply DeMorgans Law to (11-171) to get this complement). Hence, for
every integer m, we have
m
(A ) ( { Z( ) ) 0 > } > , (11-185)
so that
m
(A ) is bounded away from zero, and (A
m
) is bounded away from unity, as m .
Hence, Equation (11-170) (equivalently, Equation (11-169)) cannot be true; we have shown that
a false (11-168) implies a false (11-169) (equivalently, we have shown that (11-169) implies
(11-168)).
Almost surely (a.s.) convergence implies convergence in probability (i.p.).
This is easy to show. Suppose that X(n) X
0
almost surely (a.s.) so that
m
(A ) 0
as m for any fixed (but arbitrary) > 0 used in the definition of A
m
. Note that
{X(m) - X
0
> }
m
A
0
n=m
{X(n, ) X ( ) }

>

. (11-186)
Hence,
m
(A ) 0 as m implies that ({X(m) - X
0
> }) 0 as m , and we have
X(m) X
0
in probability (i.p.).
Theorem 11-16 shows that a.s. convergence implies convergence in probability; however, the
converse is not true, as shown by the next example.
This example shows that convergence in mean square (m.s.) does not imply
convergence almost surely (a.s.). Recall that Example 11-11 discussed a binary random
sequence X(k), all independent random variables, with
[X(k) 1] 1/ k
[X(k) 0] 1 1/ k
= =
= =
. (11-187)
In Example 11-11, we saw that X(k) converges in mean square (m.s.) to X
0
0 (hence, it also
converges in probability (i.p.) to X
0
0). Now, we show that this sequence does not converge
almost surely (a.s.). In terms of A
m
given by (11-171), observe that
n 0
n n
m n
n
m n
1 1
n n 1
n
1 1
n m n m
n n
m=0 m=0
limit [A ] limit {X(m) - X }
limit {X(m) = 0}
limit( 1 )( 1 )
limit ( 1 ) limit exp
0.
=
+

+ +

=

=

=

= =

=

(11-188)
Since this limit is not unity, X(m) cannot converge almost surely to X
0
= 0 (study again Equation
(11-169)). What we have provided here is a counter example that shows that mean square (m.s.)
convergence does not imply almost surely (a.s.) convergence. Also, the example shows that
convergence in probability (i.p.) does not imply convergence almost surely (a.s.). Also, see
Stark and Woods (3
rd
Edition), Example 6.7-3, p. 381 for a similar example.
This example shows that convergence almost surely (a.s.) does not imply
convergence in mean square (m.s.). Recall that Example 11-12 presented a binary random
sequence X(k) of independent random variables with
2
2
[X(k) k] 1/k
[X(k) 0] 1-1/k
= =
= =
. (11-189)
As shown by Example 11-12, this sequence is mean square (m.s.) convergent. We show that
X(k) converges almost surely (a.s.) to X
0
= 0. In terms of
m
A defined by (11-186), observe that
m 0
n n n
m=n m n
2
n
m=n
limit [A ] limit {X(m) - X } limit {X(m) m}
limit 1/ m 0.

=

= > = =

=

(11-190)
Equivalently, in terms of A
n
given by (11-171), this last result implies that
n 0
n n
m=n
limit [A ] limit {X(m) - X } 1

= =

. (11-191)
From Theorem 11-15 (see Equation (11-169)), we can conclude that X(n) converges almost
surely (a.s.) to X
0
= 0. Together with Example 11-12, this example shows that convergence
almost surely (a.s.) does not imply convergence in mean square (m.s.). Also, this example shows
that convergence in probability (i.p.) does not imply convergence in mean square (m.s.).
The next example is somewhat counter intuitive. It demonstrates that convergence
pointwise imply convergence in mean square, in general. Even though X(n;) X
0
()
for all (i.e., the random variable converges pointwise), the integral in the computation of
E[X[n;] - X
0
2
] may diverge so that X(n) does not converge to X
0
in mean square.
Consider the probability space ( ,B, ), where = [0, 1], B the Borel sets (B is
the -algebra generated by the open intervals on . See Chapter 1 of class notes), and
[ ]
B
B d , B =
B (11-192)
(if B is an interval, then [B] is the interval length. can be thought of as a generalized
length of event B). For , define the random variable sequence
1 2
n n
1 2
n n
[ , ]
n,
X(n, ) n ( )
0, otherwise
= =

I (11-193)
(note that I
B
() is called the Indicator Function). On X(n) converges to zero in a pointwise
manner. We say that
p.w.
X(n) 0 . Sometimes, we say that X(n) converges everywhere or
surely. However, sequence X(n) does not converge to zero in the mean square sense since
2 2
2
2 1
X(n) 0 E X(n) n n
n n

= = =

. (11-194)
Figure 11-6 shows a Venn diagram that depicts the interrelationships between i.p., m.s.,
a.s., and p.w. convergence. The diagram follows directly from the definitions, theorems and
counter examples given in this chapter. Mean square convergence neither implies, nor is it
implied by, a.s. convergence; see Examples 11-15 and 11-16 for relevant counterexamples. The
fact that p.w. convergence does not imply m.s. convergence is established by Example 11-17.
Theorem 11-12 (alternatively, Theorem 11-16) establishes that m.s. (alternatively, a.s.)
convergence implies i.p. convergence.
i.p.
m.s.
a.s.
p.w.
p.w. - Pointwise (Everywhere)
a.s. - Almost Surely (Almost Everywhere)
m.s - Mean Square
i.p. - In Probability
Relationship between modes of convergence.
Appendix 11A: Limit, Limit Superior, and Limit Inferior of a Real Number Sequence
A sequence of real numbers is a mapping from the integers into the real numbers. R
denotes the set of real numbers; x R - < x < . The set of extended real numbers is
denoted as R
+
, and it is R{}. We denote a sequence by the notation X(n). Often, the
notation X
n
is used; in the literature, other notational conventions have been used to represent
sequences.
Examples:
1. X(n) = 1/n, n 1.
2. X(n) = sin(n/10)
3. X(n) = n
2
.
Basic Concepts
A sequence may have a limit. X
0
R is a limit of sequence X(n) if, for each > 0, there
exists an integer N
such that X(n) - X

0
< for all n > N
. Note that N
depends on . If one
exist, a limit is unique. A sequence may diverge to either + or -. Example 1 above has the
limit X
0
= 0. Examples 2 does not have a limit; it oscillates forever. Finally, Example 3
diverges to +.
A monotone increasing sequence is a sequence for which X(n+1) X(n). A bounded
sequence has the property that X(n) < M for all n, where M is some positive, finite number.
It is obvious that a bounded, monotone increasing sequence is convergent. A similar definition
and result can be given for monotone decreasing sequences.
Cauchy Sequences Convergent Sequences
A sequence of real numbers is said to be Cauchy if
limit
n m
n m
,
( ) ( )
= X X 0. (11A1)
In (11A1), integers n and m approach infinity; the manner in which they do this is not important.
For a Cauchy sequence, the terms get "closer together" the "farther out" you go in the sequence.
A major theorem in the theory of real analysis is that a real sequence is convergent if, and
only if, it is Cauchy. Because of this, we say that the real numbers are complete (or the real
numbers form a complete vector space). To test a sequence for convergence, it is not necessary
to "cook up" a limit. All we have to do is show that the sequence is Cauchy, and this establishes
that the sequence is convergent.
Limit Superior of a Sequence
Given any sequence X(n) of real numbers, we define a new sequence
h( ) Least Upper Bound of X( ),
sup X( )
n m
m n n m
n
. (11A2)
The term sup is an abbreviation for superior, or least upper bound. Given integer m, h(m) is the
least upper bound of the set {X(m), X(m+1), }. h(m) is a sequence of least upper bounds.
Note that h(m) is a monotone decreasing sequence (h(m) h(m+1)).
We discuss three possibilities for the behavior of h(m). First, sequence values h(m) may
be + for all m. This is true if, and only if, for all c R and integer n, there exists some k, k n,
such that X(k) > c (no matter how far out you go, you can go out further and find arbitrarily
large positive sequence elements). The second possibility is that, as m approaches infinity,
h(m) may converge to a real number A R. In this case, for any > 0, there are infinitely many
terms of the sequence that are greater than A - while only a finite number of terms are greater
than A + . Finally, the third case is that h(m) may approach - as index m approaches . This
is true if, and only if, the sequence X(n) itself approaches - as n approaches .
The limit of h(m) is called the limit superior (the names lim sup and upper limit have
been used in the literature) of the sequence X(n); the notation is
lim supX( ) limit h( ) limit sup X( )

= = =

A
m m
n n m
n m n , (11A3)
where the limit over m is interpreted as an extended real number. (11A3) is referred to as the lim
sup of the sequence. While the limit of X(n) may (or may not) exit, (11A3) always exists as an
extended real number (the A = lim sup may be a finite real number, or it may be either + or -;
see the paragraph after (11A2)). Finally, the limit superior of a sequence and the limit superior
of a sequence of sets (discussed in Appendix 11B) are similar notions.
Alternate terminology exists in the literature. As defined by (11A2), h(m) is a monotone
decreasing sequence. Hence,
0
limit h( ) Greatest Lower Bound {h( ), 0 } inf h( )

= = < A
m m
m m m m . (11A4)
Because of this, some authors write
0
lim supX( ) inf sup X( )

= = A
n m m
n
n n . (11A5)
Limit Inferior of a Sequence
Given any sequence X(n), we define a new sequence
g( ) Greatest Lower Bound of X( ),
inf X( )
n m
m n n m
n
. (11A6)
The term inf is an abbreviation for inferior, or greatest lower bound. Given integer m, g(m) is
the greatest lower bound of the set {X(m), X(m+1), }. g(m) is a sequence of greatest lower
bounds. Note that g(m) is a monotone increasing sequence (g(m) g(m+1)).
We discuss three possibilities for the behavior of g(m). First, sequence values g(m) may
be - for all m. This is true if, and only if, for all c R and integer n, there exists some k, k n,
such that X(k) < c (no matter how far out you go, you can go out further and find arbitrarily
large negative sequence elements). The second possibility is that, as m approaches infinity,
g(m) may converge to a real number A R. In this case there are infinitely many terms of the
sequence that are less than A + while only a finite number of terms are less than A - , for any
> 0. Finally, the third case is that g(m) may approach as m approaches . This is true if,
and only if, the sequence X(n) itself approaches as n approaches .
The limit of g(m) is called the limit inferior (lim inf and other names have been used in
the literature) of the sequence X(n); the notation is
liminf X( ) limit g( ) limit inf X( )

= =

A
n m m m
n
n m n , (11A7)
where the limit over m is interpreted as an extended real number. (11A7) is referred to as the lim
inf of the sequence. While the limit of X(n) may (or may not) exit, (11A7) always exists as an
extended real number (the A = lim inf may be a finite real number, or it may be either + or -;
see the paragraph after (11A6)). Finally, the limit inferior of a sequence and the limit inferior of
a sequence of sets (discussed in Appendix 11B) are similar notions.
Alternate terminology exists in the literature. As defined by (11A6), g(m) is a monotone
increasing sequence. Hence,
0
limit g( ) Least Upper Bound {g( ), 0 } sup g( )

= = < A
m m
m m m m (11A8)
Because of this, some authors write
0
liminf X( ) sup inf X( )

= = A
m n m n
n n (11A9)
The interpretations, in terms of , of lim inf and lim sup that are given above can be
combined. Let X(n), n 0, be an arbitrary sequence of real numbers. For each > 0, there
exists a finite integer n
1
such that
X( ) < < + A A n (11A10)
for all n > n
1
. In words, for arbitrary > 0, all but a finite number of the X(n) fall between the
numbers lim inf X(n) and lim sup X(n) + .
Relationships Involving lim inf and lim sup
Some simple relationships can be given between the limit superior and limit inferior of a
sequence. Given any sequence X(n), we have the relationships
liminf ( ) limsup ( )
limsup ( ) liminf ( )
n n
n n
n n
n n

=
X X
X X
(11A11)
(note: the sequence -X(n) is the negative of the sequence X(n)). Also, the sequence X(n)
converges to the extended real number A if, and only if,
A = = =

limit
n
n n
n n n X X X ( ) liminf ( ) limsup ( ) . (11A12)
The definition of the limit of a sequence of sets (considered in Appendix 11B) is the analog of
(11A12).
Example: Consider the sequence
X( ) ,
,
n n
n
-n
-n
= + =
= =
1 e
e
/10
/10
0, 2, 4, 6,
-1 1, 3, 5, 7,
"
"
(11A13)
Clearly, this sequence does not converge. However,
n
limsupX(n) 1
= = A (11A14)
since , for each > 0, there are only a finite number of terms larger than 1 + , but there are an
infinite number of terms larger than 1 - . In a similar manner, we see that
n
liminf X(n) 1
= = A (11A15)
Updates at http://www.ece.uah.edu/courses/ee420-500/ 11b-1
Appendix 11B: Limit Superior, Limit Inferior and Limit of a Sequence of Sets (Events)
If A
1
A
2
A
3
... is a nested increasing sequence of events with
A A
n
n
, (11B1)
Theorem 11-1 states and proves the result
[ ]
N N
n n
N N
n 1 n 1
limit A limit A A

= =

=

P P = P . (11B2)
That is, it is permissible to move the limit operation from outside to inside of the probability
measure P. A similar result was shown for a nested decreasing sequence of events. In this
appendix, these simple continuity results are extended to sequences of arbitrary events.
Let A
1
, A
2
, be a sequence of arbitrary events (not necessarily nested in any manner).
Define the events
n m n
m n
n m n
m n
B A , B is a sequence of events, and
C A , C is a sequence of events.
nested decreasing
nested increasing
(11B3)
Note that
n m n m n
m n m n
C A A A B

= =

(11B4)
for all n.
As defined, B
n
and C
n
are legitimate events since countable intersections and unions of
events are always events (recall the definition of a -algebra of events). Note that B
n
is a nested
decreasing sequence of events, and C
n
is a nested increasing sequence of events. Because of
this monotone nature, we can write
n n
n m m
m 1 m 1 m m
n n
n m m
m 1 m 1 m m
B B A
C C A
= = =
= = =
= =
= =

(11B5)
Now, define events B and C as the limits of B
n
and C
n
, respectively; that is, define
n m m
n
m 1 m 1 m m
B limit B B A

= = =
= =

(11B6)
n m m
n
m 1 m 1 m=m
C limit C C A .

= =
= =

(11B7)
Note that
n
n
B is in infinitely many of the A
C is in all but a finite number of the A

(11B8)
(note: the phrase infinitely many is not the same as the phrase all but a finite).
That B and C exist as events follows from the fact that countable unions and intersections
of events are events. That is, in the terminology of Chapter 1, -Algebra F is closed under
countable intersections and unions (once you are inside F its hard to get outside).
In the literature, B is called the limit supremum (also called lim sup, limit superior or
upper limit) of the sequence A
n
, and it is denoted symbolically as
B A
n
=
lim sup
n
. (11B9)
A little thought will lead to the conclusion that each element of B is in infinitely many of the A
n
.
That is, if event B occurs then infinitely many of the A
n
occur (we say that the A
n
occur
infinitely often, or A
n
i.o.).
In the literature, C is called the limit infimum (also called lim inf, limit inferior or lower
limit) of the sequence A
n
, and it is denoted symbolically as
C A
n
=

lim inf
n
. (11B10)
A little thought will lead to the conclusion that each element of C is in all but a finite number of
the A
n
. That is, if event C occurs then all but a finite of the A
n
occur.
Given a sequence of events A
n
, then B = lim sup A
n
and C = lim inf A
n
always exist;
however, they may not be equal. However, it is easily seen that C B always.
The infinite event sequence A
1
, A
2
, is said to have a limit event A, and we write
A A
n
=
limit
n
, (11B11)
if lim inf A
n
= lim sup A
n
. That is, the infinite event sequence A
1
, A
2
, has limit event A if
events B and C, given by (11B6) and (11B7), respectively, are equal (i.e., B C and C B). In
this case, we write
n n n
n
n n
A limit A lim supA lim inf A

= = = . (11B12)
Given convergence of A
n
to A as defined by (11B12), it is not difficult to show that
limit limit
n n
=
F
H
I
K
P P = P ( (A) A A
n n
) (11B13)
(this is a good homework problem!). So, in the sense described by (11B13), we can say that
probability measures are continuous (often, (11B13) is taken as the definition of sequential
continuity, and P is said to be sequentially continuous).
Continuity property (11B13) held by P is analogous to a well-known continuity property
enjoyed by functions. Function f(x) is continuous at x = x
0
if, and only if, f(x
n
) f(x
0
) for any
sequence (and all sequences) x
n
x
0.
Borel-Cantelli Lemma
Let A
1
, A
2
, be a sequence of events. Let
B A
n
=
lim sup
n
,
the event that infinitely many of the A
n
occur. Then it follows that
n
n 1
(A ) (B) 0
=
< =
P P . (11B14)
Furthermore, if all of the A
i
are independent (i.e., we have an independent sequence) then
n
n 1
(A ) (B) 1
=
= =
P P . (11B15)
Proof: We show (11B14) first. Since
m n
n
n 1 m n
B A lim supA

= =
=

, we see that
m
m n
B A

, (11B16)
for all n. This last equation implies that
m
m n
(B) (A )

P P (11B17)
for all n. Now, the hypothesis
n
n 1
(A )
=
<
P implies that the right-hand side of (11B17)

approaches zero as n approaches infinity. Hence, take the limit in (11B17) to see that P(B) = 0
as claimed by (11B14).
Now we show (11B15). From DeMorgans Laws, we have
m
n 1 m n
B A

= =
=

, (11B18)
where over-bar denotes the complementation. However, as indexed on r, r n,
r
m
m n
A
=
is a
nested, decreasing sequence of sets. By Theorem 11-1 we have
r
m m
r
m n m n
A limit A
= =

=

P P . (11B19)
Due to the independence of the sequence, we have
( ) ( )
r r r
m m m m
r r r
m n m n m n m n
A limit A limit A limit 1- A

= = = =

= = =

P P P P . (11B20)
For all x 0, note that 1 - x e
-x
. Apply this inequality to (11B20) and obtain
( ) ( )
m m m
m n m n m n
A exp - A exp A

= = =

= =

P P P (11B21)
Use the hypothesis
n
n 1
P(A )
=
=
stated in (11B15) to see that (11B21) implies that

m
m n
A 0
=

=

P (11B22)
for all n. Finally, from (11B18) we have
m
n
m n
(B) limit A 0
=

= =

P P , (11B23)
so that P(B) = 1 as claimed by (11B15).
In general, (11B15) is false if the sequence of A
i
is not independent. To show this,
consider any event E, 0 < P(E) < 1, and define A
n
= E for all n. Then, B = E and P(B) = P(E)
0.
Appendix 11C: Summary of Results for a Nested Increasing Sequence of Sets
Suppose B
n
, 1 n, is a nested increasing sequence of sets. Such a sequence is discussed in
Chapter 11 where it was shown
n n
n n
limit [B ] [ limit B ]

= P P .
The steps used to establish this result are briefly summarized below. Next to each step is a brief
justification for the step.
N
N
limit [B ]
P
=
N
n
N
n 1
limit B
P B
n
is a nested increasing sequence
=
N
n
N
n 1
limit A
P
N N
n n
n 1 n 1
A B
= =
=

for all N (including N = )
=
N
n
N
n 1
limit [A ]
P A
n
are disjoint
=
n
n 1
[A ]
P Definition
=
n
n 1
A
P
Countably additivity of probability measure P.
See Axioms of Probability in Chapter 1
=
n
n 1
B
P
n n
n 1 n 1
A B

= =
=

=
N
n
N
n 1
limit B
P Definition
=
N
N
limit B

P B
n
is a nested increasing sequence
Updates at http://www.ece.uah.edu/courses/ee-420-500/ 12-1
Chapter 12: Mean Square Calculus
Many applications involve passing a random process through a system, either dynamic
(i.e., one with memory that is described by a differential equation) or one without memory (for
example, Y = X
2
). In the case of dynamic systems, we must deal with derivatives and integrals
of stochastic processes. Hence, we need a stochastic calculus, a calculus specialized to deal with
random processes.
It should be clear that such a calculus must be an extension of the concepts covered in
Chapter 11. After all, any definition of a derivative must contain the notion of limit (the
definition can be stated as a limit of a function or a sequence). And, an integral is just a limit of
a sequence (for example, recall the definition of the Riemann integral as a limit of a sum).
One might ask if ordinary calculus can be applied to the sample functions of a random
process. The answer is yes. However, such an approach is overtly restrictive and complicated,
and it is not necessary in many applications. It is restrictive and complicated because it must
deal with every possible sample function of a random process. In most applications, the
complications of an ordinary calculus approach are not required since only statistical averages
(such as means and variances) are of interest, not individual sample functions.
Many applications are served well by a calculus based on mean-square convergence
concepts similar to those introduced in Chapter 11. Such a mean-square calculus discards the
difficulties of having to deal with all sample functions of a random process; instead it uses
only the important sample functions, those that influence a statistical average of interest (the
average power, for example). Also, the mean-square calculus can be developed adequately using
ordinary calculus concepts; measure-theoretic techniques are not required. The development
of mean-square calculus parallels the development of ordinary calculus (concepts in m.s
calculus have counterparts in ordinary calculus and vice versa). For these reasons, the mean
square calculus is included in most advanced books on applied random processes, and it is the
topic of this chapter.
The major pillars of the mean-square calculus are the notions of mean-square limit,
mean-square continuity, mean square differentiation and mean-square integration. From a
limited perspective, one could say that these notions are only simple applications of functional
analysis (more specifically, Hilbert space theory), a field of study involving vector spaces that
serves as the basis, and unifying force, of many electrical engineering disciplines. This
chapter introduces the above-mentioned pillars, and we give hints at the broader vector
space interpretation of these concepts and results.
Finite Average Power Random Processes
In this chapter, we consider only real-valued random processes with
E X t
2
[ ( )] < (12-1)
for all t. Such processes are said to have finite average power, or they are said to be second-
order random processes. We deal with real-valued processes only in order to simplify the
notation and equations. Excluding complex-valued processes, and eliminating complex notation,
does not restrict coverage/discussion/understanding of the basic concepts of mean-square
calculus. Note that every result in this chapter can be generalized easily to the complex-valued
random process case.
For every fixed t, finite-second-moment random variable X(t) is in the vector space L
2
discussed in Chapter 11. As a result, we can apply to random processes the inner product and
norm notation that was introduced in the previous chapter. Let X(t
1
) and Y(t
2
) be two finite-
power random processes. The inner product of X(t
1
) and Y(t
2
) is denoted as X Y ( ), ( ) t t
1 2
, and
it is defined by
X Y E X Y ( ), ( ) [ ( ) ( )] t t t t
1 2 1 2
. (12-2)
For each t, the norm of X(t), denoted in the usual manner by X( ) t , is defined by
X X X E X t
2
( ) ( ), ( ) [ ( )] t t t
2
= = . (12-3)
Finally, note that the correlation function can be expressed as
( , ) ( ) ( ) ( ), ( ) t t E X t X t X t X t
1 2 1 2 1 2
= . (12-4)
We assume that all finite-power processes have zero mean. This assumption imposes no
real limitation. Since (12-1) implies that E[X] < (use the Cauchy-Schwarz inequality to show
this), we can form the new random process
Y t X t E X t ( ) ( ) [ ( )] = (12-5)
that has zero mean. Hence, without loss of generality, we limit ourselves to zero-mean, finite-
average power random processes.
The theory of m.s. limits, m.s. continuity, m.s. differentiation and m.s. integration of a
stochastic process can be given using the inner product and norm notation introduced above.
Alternatively, one can use the equivalent expectation notation. Both notational methodologies
have advantages and disadvantages; in this chapter, we will use both.
Limits
The limit of a random process can be defined in several different ways. Briefly, we
mention some possibilities before focusing on the mean-square limit.
Surely (Everywhere): As t t, X(t,) approaches Y(t,) for every S. This is the
ordinary Calculus limit; it is very restrictive and rarely used in the theory of random processes.
Almost Surely (Almost Everywhere): There exists A S, P(A) = 1, such X(t,) Y(t,) as t
t for every A. This is only slightly less restrictive (weaker) then requiring that the limit
exist everywhere (the former case), and it is rarely used in applications.
In Probability: As t t, X(t,) approaches Y(t,) in probability (i.p.) if, for all > 0, we have
t t
limit X(t ) Y(t) 0
> =

P . (12-6)
Often, this is denoted as
t t
Y(t) l.i.p X(t )
= . (12-7)
This form of limit is weaker than the previously-defined surely and almost surely limits. Also,
it is weaker than the mean-square limit, defined below.
Limit in the Mean
For finite-power random processes, we adopt the limit-in-the mean notation that was
introduced for sequences in Chapter 11. Specifically, as t approaches t (i.e., t t), we state
that process X(t) has the mean-square limit Y(t) if
2
2 2
t t t t 0
limit X(t ) Y(t) limit E {X(t ) Y(t)} limit E {X(t ) Y(t)} 0

= = + =

. (12-8)
To express this, we use the l.i.m notation introduced in Chapter 11. The symbolic statement
Y t l i mX t l i mX t
t t
( ) . . ( ) . . ( ) = = +

0
(12-9)
should be interpreted as meaning (12-8). In (12-9), we have employed a variable t that
approaches t ; equivalently, we have used a variable that approaches 0, so that t + approaches
t. While mathematically equivalent, each of the two notation styles has its advantages and
disadvantages, and we will use both styles in what follows.
Completeness and the Cauchy Convergence Criteria (Revisited)
Please review Theorem 11-6, the completeness theorem for the vector space of finite-
second-moment random variables. This theorem states that the vector space of finite-second-
moment random variables is complete in the sense that a sequence of finite-second-moment
random variables converges to a unique limit if, and only if, sequence terms come arbitrarily
close together (in the m.s. sense) as you go out in the sequence (this is the Cauchy criteria).
Theorem 11-6, stated for sequences of finite-second-moment random variables, has a
counterpart for finite-power random processes. Let t
n
, n 0, be any sequence that approaches
zero as n approaches infinity (otherwise, the sequence is arbitrary). Directly from Theorem 11-
6, we can state that
n
n
l.i.m X(t t )
+ (12-10)
exists as a unique (in the mean-square sense) random process if, and only if, the double limit
n m
n
m
l.i.m[ X(t t ) X(t t ) ] 0
+ + = (12-11)
exists. Hence, we need not know the limit (12-10) to prove convergence of the sequence;
instead, we can show (12-11), the terms come arbitrarily close together as we go out in the
sequence. For random processes, the Cauchy Convergence Theorem is stated most often in the
following manner.
Theorem 12-1 (Cauchy Convergence Theorem): Let X(t) be a real-valued, finite-power
random process. The mean-square limit
t t 0
Y(t) l.i.m X(t l.i.m X(t )

) = + (12-12)
exists as a unique (in the mean-square sense) random process if, and only if,
1 1
2 2
1 2 1 2
t t 0
t t 0
limit X(t X(t limit X(t X(t
2 2

{ ) )} = { + ) + )} = 0

. (12-13)
In terms of the l.i.m notation, Equation (12-13) can be stated as
1 1
2 2
1 2 1 2
t t 0
t t 0
l.i.m [ X(t X(t ] l.i.m [ X(t X(t ] 0

) ) = + ) + ) = . (12-14)
The result Y(t) defined by (12-12) is not needed to know that the limit exists; m.s. limit (12-12)
exists if, and only if, (12-14) holds. When using this result, one should remember that (12-13)
and (12-14) must hold regardless of how t
1
and t
2
approach t (alternatively,
1
and
2
approach
zero); this requirement is implied by the Calculus definition of a limit. The Cauchy
Convergence Theorem plays a central role in the mean-square calculus of finite-power random
processes.
Continuity of the Expectation Inner Product (Revisited)
Theorem 11-9 was stated for sequences, but it has a counterpart when dealing with finite-
power random processes. Suppose X(t) and Y(t) are finite-power random processes with
1
2
0 1
t t
0 2
t t
l.i.m X(t ) X (t )
l.i.m Y(t ) Y (t ).
=
=
(12-15)
Then we have
[ ] [ ]
1 1 1 1 2 2
2 2
0 1 0 2 1 2 1 2
t t t t t t
t t
E X (t )Y (t ) E l.i.m X(t ) l.i.m Y(t ) limit E X(t )Y(t )

= =

. (12-16)
Written using inner product notation, Equation (12-16) is equivalent to
1 1 1 1 2 2
2 2
0 1 0 2 1 2 1 2
t t t t t t
t t
X (t )Y (t ) l.i.m X(t ) l.i.m X(t ) limit X(t )Y(t )

= = . (12-17)
As was pointed out in the coverage given Theorem 11-9, the inner product is continuous.
In the context of random processes, continuity of the expectation is expressed by Equations
(12-15) and (12-16) (the expectation of a product is referred to as an inner product so we can say
that the inner product is continuous). As t
1
and t
2
get near t
1
and t
2
, respectively (so that X(t
1
)
and Y(t
2
) get near X
0
(t
1
) and Y
0
(t
2
), respectively), we have E[X(t
1
)Y(t
2
)] coming near
E[X
0
(t
1
)Y
0
(t
2
)].
Existence of the Correlation Function
A random process that satisfies (12-1) has a correlation function (t
1
,t
2
) = E[X(t
1
)X(t
2
)]
that exists and is finite. We state this fact as the following theorem.
Theorem 12-2: For all t, zero-mean process X(t) has finite average power (i.e., X(t) L
2
) if,
and only if, its correlation function (t
1
,t
2
) = E[X(t
1
)X(t
2
)] exists as a finite quantity.
Proof: Suppose zero mean X(t) has finite average power (i.e., satisfies (12-1)). Use the
Cauchy-Schwarz inequality to see
( , ) [ ( ) ( )] [ ( )] [ ( )] t t E X t X t E X t E X t
1 2 1 2 1
= <
2 2
2
, (12-18)
so (t
1
,t
2
) exists as a finite quantity. Conversely, suppose that (t
1
,t
2
) exists as a finite quantity.
As a result of (12-18), we have
E X t E X t X t t t [ ( )] [ ( ) ( )] ( , )
2
= = < , (12-19)
so that X(t) has finite average power.
This theorem is important since it assures us that finite-power random processes are
synonymous with those that possess correlation functions.
Continuity of Random Processes
For random processes, the concept of continuity is based on the existence of a limit, just
like the concept of continuity for ordinary, non-random functions. However, as discussed
previously, for random processes, the required limit can be defined in several ways (i.e,
everywhere, almost everywhere, in probability, mean square sense, etc). In what follows, we
give simple definitions for several types of continuity before concentrating on the type of
continuity that is most useful in applications, mean-square continuity.
Sample-function continuity (a.k.a. continuity or continuity everywhere) at time t requires
that each and every sample function be continuous at time t. We say that X(t) is sample function
continuous at time t if
limit limit
t 0
+ =
t
X t X t X t ( , ) ( , ) ( , )
(12-20)
for all S. This is the strongest type of continuity possible. It is too restrictive for many
applications.
A weaker, less restrictive, form of continuity can be obtained by throwing out a set
of sample functions that are associated with an event whose probability is zero. We say that the
random process X(t,) is almost surely sample function continuous (a.k.a. continuous almost
everywhere) at time t if (12-20) holds everywhere except on an event whose probability is zero.
That is, X(t,) is almost surely sample function continuous if there exists an event A, P(A) = 1,
for which (12-20) holds for all A. This form of continuity requires the use of measure
theory, an area of mathematics that most engineers are not conversant with. In addition, it is too
restrictive for most applications, and it is not needed where only statistical averages are of
interest, not individual sample functions.
Continuity in probability, or p-continuity, is based on the limit-in-probability concept that
was introduced in Chapter 11, and it is even weaker that a.s. continuity. We say that X(t,) is p-
continuous at time t if
limit limit
t 0
> = + > =
t
X t X t X t X t P P ( ) ( ) ( ) ( )
0 (12-21)
for all > 0.
Mean Square Continuity
A stochastic process X(t) is mean-square (m.s.) continuous at time t if
X t l i mX t l i mX t ( ) . . ( . . ( = +
t t
) )

0
, (12-22)
which is equivalent to
l i m l i m
t t
. . ( ) ( ) . . ( ) ( )

+ = X t X t X t X t

0
0 (12-23)
or
2
2 2
t t t t 0
limit X(t ) X(t) limit E {X(t ) X(t)} limit E {X(t ) X(t)} 0

= + =

. (12-24)
Mean square continuity does not imply continuity at the sample function level. A simple test for
mean-square continuity involves the correlation function of the process.
Theorem 12-3: At time t, random process X(t) is mean-square continuous if, and only if,
correlation (t
1
,t
2
) is continuous at t
1
= t
2
= t.
A simple proof of this theorem can be based on Theorem 12-1, the Cauchy Convergence
Theorem. Basically, the requirement
X t l i mX t ( ) . . ( =
t t
) (12-25)
for m.s. continuity is equivalent to the Cauchy convergence requirement (12-13). Hence, the
proof of Theorem 12-3 boils down to establishing that (12-13) is equivalent to (t
1
,t
2
) being
continuous at t
1
= t
2
= t. While this is easy to do, we take a different approach while proving the
theorem.
Proof of Theorem 12-3: First, we show continuity of (t
1
,t
2
) at t
1
= t
2
= t is sufficient for m.s.
continuity of X(t) at time t (i.e., the if part). Consider the algebra
E E E E E { ( ) ( )} ( ) ( ) ( ) ( ) ( ) ( )
( , ) ( , ) ( , ) ( , ) .
X t X t X t X t X t X t t X t
t t t t t t t t
= +
= +
2 2 2
X

(12-26)
If (t
1
,t
2
1
= t
2
= t, the right-hand-side of (12-26) has zero as a limit (as t t)
so that
limit X X t limit t t t t

= + =
t t t t
E t t t t t { ( ) ( )} ( , ) ( , ) ( , ) ( , )
2
0 , (12-27)
and the process is m.s. continuous (this establishes the if part). Next, we establish necessity
(the only if part). Assume that X(t) is m.s. continuous at t so that (12-23) is true. Consider the
algebra
( , ) ( , ) ( ) ( ) ( ) ( )
{ ( ) ( )} ( ) ( )}
{ ( ) ( )} ( ) ( ){ ( ) ( )}
t t t t X t X t X t X t
X t X t
X t X t X t X t X t
1 2 1 2
1 2
1 2
=
=
+ +
E E
E X t X t
E X t E
{ (12-28)
which implies
( , ) ( , )
{ ( ) ( )} ( ) ( )} { ( ) ( )} ( )
( ){ ( ) ( )} .
t t t t
X t X t X t X t
X t X t X t
1 2
1 2 1
2
+
+
E X t X t E X t
E
{ (12-29)
Apply the Cauchy-Schwarz inequality to each term on the right-hand-side of (12-29) to obtain
( , ) ( , ) [{ ( ) ( )} ] [{ ( ) ( )} ]
[{ ( ) ( )} ] [ ( ) ]
[{ ( ) ( )} ] [ ( ) ] .

t t t t X t X t
X t X t
X t X t
1 2 1
2
2
2
1
2 2
2
2 2

+
+
E X t E X t
E X t E
E X t E
e j e j
e j e j
e j e j
(12-30)
Since X is m.s. continuous at t, the right-hand-side of (12-30) approaches zero as t
1
, t
2
approach
t. Hence, if X is mean-square continuous at t then
limit t t t t
t t
1
1
,
( , ) ( , )
2
2
=
t
, (12-31)
so that (t
1
,t
2
1
= t
2
= t.
a 2
a
a
a
t
( ) A , t
t
0 t ,

= <

= >
(12-33)
(a result depicted by Fig. 7-3), and it is W.S.S. As shown by Fig. 7-1, the sample functions have
jump discontinuities (see Fig. 7-1). However, () is continuous at = 0, so the process is mean-
square continuous. This example illustrates the fact that m.s. continuity is weaker than
sample-function continuity.
Example 12-3: The sample and held random process has m.s. discontinuities at every switching
epoch. Consider passing a zero-mean, wide-sense-stationary random process V(t) through a
sample-and-hold device that utilizes a T-second cycle time. Figure 12-1 illustrates an example
of this; the dotted wave form is the original process V(t), and the piece-wise constant wave form
is the output X(t). To generate output X(t), a sample is taken every T seconds, and it is held
for T seconds until the next sample is taken. Such a process can be expressed as
nT (n+1)T (n+2)T (n+3)T (n+4)T (n+5)T (n+6)T
Input V(t)
Output X(t)
Figure 12-1: Dotted-line process is zero mean and constant variance V(t). Solid line (piece-
wise constant) process is called the Sample and Held random process X(t).
X t V nT)q t nT)
n
( ) ( ( =
=
, (12-34)
where
q t
t T
otherwise
( )
,
,
<
R
S
T
1 0
0
(12-35)
Assume that T is large compared to the correlation time of process V(t). So, samples of
the process are uncorrelated if they are spaced T seconds (or larger) apart in time. If
2
=
E[V
2
(t)] is the constant variance of input waveform V(t), the correlation function for output X(t)
is
t
1
t
2
nT (n+1)T (n+2)T (n+3)T (n+4)T (n+5)T

nT
(n+1)T
(n+2)T
(n+3)T
(n+4)T
(n+5)T
Figure 12-2: Correlation of sample and held random process.
Correlation is
2
on half-open, TT squares placed along the
diagonal, and it is zero off of these squares.
( , ) ( ( t t q t nT)q t nT)
n
1 2
2
1 2
=
=
, (12-36)
a result depicted by Fig. 12-2. The correlation is equal to
2
on the half-open, TT squares that
lie along the diagonal, and it is zero off of these squares (in particular, the correlation is
2
along
the diagonal t
1
= t
2
). It is obvious from an inspection of Fig. 12-2 that (t
1
,t
2
) is continuous at
every diagonal point except t
1
= t
2
= nT, n an integer. Hence, by Theorem 12-3, X(t) is m.s.
continuous for t not a switching epoch.
Corollary 12-3B: If the correlation function (t
1
,t
2
) is continuous for all t
1
= t
2
= t (i.e., at all
points on the line t
1
= t
2
), then it is continuous at every point (t
1
, t
2
) R
2
.
Proof: Suppose (t
1
, t
2
) is continuous for all t
1
= t
2
= t. Then Theorem 12-3 tell us that X(t) is
m.s. continuous for all t. Hence, for any t
1
and any t
2
we have
l i mX t X t
l i mX t X t
. . ( ) ( )
. . ( ) ( )
+ =
+ =
0
1 1
0
2 2
(12-37)
Now, use Equation (12-16), the continuity of the inner product, to write
( , ) ( ) ( ) . . ( ) . . ( )
( ) ( )
( , ) ,
,
,
t t E X t X t E l i mX t l i m X t
E X t X t
t t
1 2 1 2
0
1 1
0
2 2
0
1 1 2 2
0
1 1 2 2
1 2
= = + +
L
N
M
O
Q
P
= + +
= + +

limit
limit
1 2
1 2
(12-38)
so is continuous at (t
1
, t
2
) R
2
.
The mean (t) = E[X(t)] of a process is a deterministic, non-random, function of time. It
can be time varying. If the process is m.s. continuous, then its mean is a continuous function of
time.
Theorem 12-4: Let X(t) be a mean-square continuous random process. Under this condition,
the mean (t) = E[X(t)] is a continuous function of time.
Proof: Let X(t) be mean-square continuous and examine the non-negative variance of the
process increment X(t)-X(t) given by
Var X t X t E X t X t E X t X t ( ) ( ) { ( ) ( )} ( ) ( ) =
2
2
0
c h
(12-39)
From inspection of this result, we can write
E X t X t E X t X t t t { ( ) ( )} ( ) ( ) ( ) ( ) =
2
2 2
c h b g . (12-40)
Let t approach t in this last equation; due to m.s. continuity, the left-hand-side of (12-40) must
approach zero. This implies that
limit

t t
t t ( ( ) = ) , (12-41)
which is equivalent to saying that the mean is continuous at time t.
Mean-square continuity is stronger that p-continuity. That is, mean-square continuous
random processes are also p-continuous (the converse is not true). We state this claim with the
following theorem.
Theorem 12-5: If a random process is m.s. continuous at t then it is p-continuous at t.
Proof: A simple application of the Chebyshev inequality yields
P X t X t
X t X t
( ) ( )
{ ( ) ( )}
>

a
a
E
2
2
(12-42)
for every a > 0. Now, let t approach t, and note that the right-hand-side of (12-42) approaches
zero. Hence, we can conclude that
limit
> =
t
t
t
X X t P ( ) ( ) a 0 , (12-43)
and X is p-continuous at t (see definition (12-21)).
Operator
To simplify our work, we introduce some shorthand notation. Let f(t) be any function,
and define the difference operator
f t f t f t ( ) ( ) ( ) + . (12-44)
On the
operator, the subscript is the size of the time increment.

We extend this notation to functions of two variables. Let f(t,s) be a function of t (the
first variable) and s (the second variable). We define
( )
( )
( , ) ( , ) ( , )
( , ) ( , ) ( , )
1
2
f t s f t s f t s
f t s f t s f t s
+
+
. (12-45)
On the difference operator, a superscript of (1) (alternatively, a superscript of (2)) denotes that
we difference the first variable (alternatively, the second variable).
Mean Square Differentiation
A stochastic process X(t) has a mean square (m.s.) derivative, denoted here as X
(t), if
there exists a finite-power random process
( ) . .
( ) ( )
. .
( )
X t l i m
X t X t
l i m
X t
=
+
L
N
M
O
Q
P
=
L
N
M
O
Q
P

0 0

(12-46)
(i.e., if the l.i.m exists). Equation (12-46) is equivalent to
limit limit
0 0

+

F
H
G
I
K
J
L
N
M
M
O
Q
P
P
=
F
H
G
I
K
J
L
N
M
M
O
Q
P
P
= E
X t X t
E
X t ( ) ( ) ( )

X(t) X(t)
2 2
0
. (12-47)
A necessary and sufficient condition is available for X(t) to be m.s. differentiable. Like
the case of m.s. continuity considered previously, the requirement for m.s. differentiability
involves a condition on (t
1
,t
2
). The condition for differentiability is based on Theorem 12-1,
the Cauchy Convergence Theorem. As 0, we require existence of the l.i.m of {
X(t)}/ for
the m.s. derivative to exist. For each arbitrary but fixed t, the quotient
X(t)/ is a random
process that is a function of . So, according to the Cauchy Convergence Theorem, the quantity
{
X(t)}/ has a m.s. limit (as goes to zero) if, and only if,
1
2
1
2
0
1 2
0
X(t) X(t)
l.i.m 0

=

. (12-48)
Note that (12-48) is equivalent to
1
2
1
2
1
1 2 2
1
2
2
0
1 2
0
2 2
0
1 1 2 2
0
X(t) X(t)
limit E
X(t) X(t) X(t) X(t)
limit E 2
0.

= +

=
(12-49)
In (12-49), there are two terms that can be evaluated as
2 2
0 0
2
0
X(t) X(t ) X(t)
limit E limit E
(t , t ) (t , t) (t, t ) (t, t)
limit ,

+

=

+ + + + +
=
(12-50)
and a cross term that evaluates to
1 2
1 1
2 2
1
2
1 2
0 0
1 2 1 2
0 0
1 2 1 2
0
1 2
0
X(t) X(t)
X(t ) X(t) X(t ) X(t)
limit E limit E
(t , t ) (t , t) (t, t ) (t, t)
limit

+ +
=

+ + + + +
=

(12-51)
Now, substitute (12-50) and (12-51) into (12-49), and observe that (12-48) is equivalent to
1
2
1
2
1
2
2
0
1 2
0
2
0
1 2 1 2
0
1 2
0
X(t) X(t)
limit E
(t , t ) (t , t) (t, t ) (t, t)
2limit
(t , t ) (t , t) (t, t ) (t, t)
2 limit
0

+ + + + +
=
+ + + + +

=
(12-52)
As is shown by the next theorem, (12-48), and its equivalent (12-52), can be stated as a
differentiability condition on (t
1
,t
2
).
Theorem 12-6: A finite-power stochastic process X(t) is mean-square differentiable at t if, and
only if, the double limit
( ) ( )
1 2
1 2
1 1
2 2
1 2 1 2
0 0
1 2 1 2
0 0
(t, t)
(t , t ) (t , t) (t, t ) (t, t)
limit limit .

+ + + + +
=

(12-53)
exists and is finite (i.e., exists as a real number). Note that (12-53) is the second limit that
appears on the right-hand side of (12-52). By some authors, (12-53) is called the 2
nd
generalized
derivative.
Proof: Assume that the process is m.s. differentiable. Then (12-48) and (12-52) hold;
independent of the manner in which
1
and
2
approach zero, the double limit in (12-52) is zero.
But this means that the limit (12-53) exists, so m.s. differentiability of X(t) implies the existence
of (12-53). Conversely, assume that limit (12-53) exist and has the value R
2
(this must be true
independent of how
1
and
2
approach zero). Then, the first limit on the right-hand-side of
(12-52) is also R
2
, and this implies that (12-52) (and (12-48)) evaluate to zero because R
2
- 2R
2
+
R
2
= 0.
Note on Theorem 12-6: Many books on random processes include a common error in their
statement of this theorem (and a few authors point this out). Instead of requiring the existence of
(12-53), many authors state that X(t) is m.s. differentiable at t if (or iff according to some
authors)
2
(t
1
,t
2
)/t
1
t
2
exists at t
1
= t
2
= t. Strictly speaking, this is incorrect. The existence of
(12-53) implies the existence of
2
(t
1
,t
2
)/t
1
t
2
at t
1
= t
2
= t, the converse is not true. Implicit in
(12-53) is the requirement of path independence; how
1
and
2
approach zero should not
influence the result produced by (12-53). However, the second-order mixed partial is not
defined in such general terms. Using the notation introduced by (12-45), the second-order
mixed partial is defined as
(2)
(1)
2
1
2
1
1 2
2
0
2
0
1 2 1 2 1
(t , t )
limit
limit
t t t t

= =

. (12-54)
Equation (12-54) requires that
2
0 first to obtain an intermediate, dependent-on-
1
,

result;
then, the second limit
1
0 is taken. The existence of (12-53) at a point implies the existence
of (12-54) at the point. The converse is not true; the existence of (12-54) does not imply the
existence of (12-53). The following example shows that existence of the second partial
derivative (of the form (12-54)) is not sufficient for the existence of limit (12-53) and the m.s.
differentiability of X(t).
Example 12-4 (from T. Soong, Random Differential Equations in Science and Engineering, p.
93): Consider the finite-power stochastic process X(t), -1 t 1, defined by
k k 1
k
X(0) 0
X(t) , 1/ 2 t 1/ 2 , k = 1, 2, 3, ...
X( t), 1 t 0
=
= <
=
(12-55)
where the
k
are independent, identically distributed random variables each with a mean of zero
and a variance of unity. For t 0, Figure 12-3 depicts a typical sample function of such a
process (fold the graph to get X for negative time).
Process X(t) has, for t 0, s 0, a correlation function (t,s) that is depicted by Figure
12-4 (this plot can be used to obtain the value of for (t,s) in the second, third and fourth
quadrants of the (t,s) plane). As depicted on Figure 12-4, in the first quadrant, the correlation
X(t)
t-axis
1
1
2
1
4
1
8
Figure 12-3: For 0 t 1, a typical sample

function of X(t).
t
-
a
x
i
s
s-axis
1
1
1
1
1
1
1
1
1
1
(t,s)
1/2
1/2
1/4
1/4 1/8
1/8
Figure 12-4: Correlation function (t,s) is unity on
shaded , half-closed rectangles and zero otherwise.
function is unity on the shaded, half-closed squares, and it is zero elsewhere in the first quadrant.
Specifically, note that (t,t) = 1, 0 < t 1. Take the limit along the line t = , s = to see that
( ) ( )
1 2
2 2
0 0
2
0
2
0
(t, s) ( , ) ( , 0) (0, ) (0, 0)
limit limit
t s 0
( , ) 0 0 0
limit
1
limit

+
=

= =
+
=
= =
. (12-56)
By Theorem 12-6, X(t) does not have a mean-square derivative at t = 0 since (12-53) does not
exist at t = s = 0, a conclusion drawn from inspection of (12-56). But for 1 t 1, it is easily
seen that
(t, s)
0, 1 t 1 ,
s
s 0
=
so the second-order partial derivative exists at t = s = 0, and its value is
2

t = s = 0 s = 0 t = 0
(t, s) (t, s)
t s t s
0.

=

=
Example (12-4) shows that, at a point, the second partial derivative (i.e., (12-54)) can
exist and be finite, but limit (12-53) may not exit. Hence, it serves as a counter example to those
authors that claim (incorrectly) that X(t) is m.s. differentiable at t if (or iff according to some
authors)
2
(t
1
,t
2
)/t
1
t
2
exists and is finite at t
1
= t
2
= t. However, as discussed next, this
second-order partial can be used to state a sufficient condition for the existence of the m.s.
derivative of X.
Theorem 12.7 (Sufficient condition for the existence of the m.s. derivative): If /t
1
, /t
2
,
and
2
/t
1
t
2
exist in a neighborhood of (t
1
,t
2
) = (t,t), and
2
/t
1
t
2
is continuous at (t
1
,t
2
) =
(t,t), then limit (12-53) exits, and process X is m.s. differentiable at t.
Proof: Review your multivariable calculus. For example, consult Theorem 17.9.1 of L.
Leithold, The Calculus with Analytic geometry, Second Edition. Also, consult page 79 of E.
Wong, B. Hajek, Stochastic Processes in Engineering Systems.
One should recall that the mere existence of
2
/t
1
t
2
at point (t
1
,t
2
) does not imply that
this second-order partial is continuous at point (t
1
,t
2
). Note that this behavior in the
multidimensional case is in stark contrast to the function-of-one-variable case.
Example 12-5: Consider the Wiener process X(t), t 0, that was introduced in Chapter 6. For
the case X(0) = 0, we saw in Chapter 7 that
( , ) min , t t D t t
1 2 1 2
2 = l q
for t
1
0, t
2
0. This correlation function does not have a second partial derivative at t
1
= t
2
= t
> 0. To see this, consider Figure 12-5, a plot of (t
1
,t
2
) as a function of t
1
for fixed t
2
= t
20
> 0.
Hence, the Wiener process is not m.s. differentiable at any t > 0. This is not unexpected; in the
limit, as the step size and time to take a step shrink to zero, the random walk becomes an
increasingly dense sequence of smaller and smaller jumps. Heuristically, the Wiener process can
be thought of as an infinitely dense sequence of infinitesimal jumps. Now, jumps are not
t
1
t
20
2Dt
20
(t
1
,t
20
)
Figure 12-5: (t
1
,t
2
) for fixed t
2
= t
20
.
differentiable, so it is not surprising that the Wiener process is not differentiable.
M.S. Differentiability for the Wide-Sense-Stationary Case
Recall that a W.S.S random process has a correlation function (t
1
,t
2
) that depends only
on the time difference = t
1
t
2
. Hence, it is easily seen that W.S.S process X(t) is m.s.
differentiable for all t if, and only if, it is m.s. differentiable for any t.
In the definition of the 2
nd
generalized derivative of autocorrelation (t
1
, t
2
) defined by
(12-53), the path dependence issue does not arise in the WSS case since only depends on the
single variable t
1
t
2
. Regardless of how t
1
and t
2
approach zero, the difference t
1
t
2
always approaches zero in only two ways, from the positive or negative real numbers. In the
following development, we use the fact that a function f(x) must exist in a neighborhood of a
point x
0
(including x
0
) for the derivative df/dx to exist at x
0
.
Corollary 12-6
Wide sense stationary, finite power X(t), with autocorrelation (), is mean square differentiable
at anytime t if and only if the 1
st
and 2
nd
derivatives of () exist and are finite at = 0 (Since
() is even, () is odd so that (0) = 0).
Proof: For the WSS case, the second generalized derivative can be written as
1 1
2 2
1
2
1 2 2 1
1 1 1 2 1 2
0 0
1 2 2
0 0
1 2 1 2
2 2
0
1
0
( ) ( ) ( ) (0)
( ) ( ) ( ) (0)
limit limit
( ) ( ) ( ) (0)
limit .

+

=

Since () is even, the right-hand side limits are equivalent, and the order in which the limit is
taken is immaterial. Since second derivative () exists at = 0, the first derivative () must
exist in a neighborhood of = 0. As a result, the above generalized derivative becomes
1 2
2
1 2 2 1
1 1 2
0 0
2 2
0
( ) ( ) ( ) (0)
( ) (0)
limit limit (0)

= =

Hence, for the WSS case: and exist and are finite at = 0
The second generalized derivative of exists at all t
X(t) is m.s. differentiable at all t.
Example 12-6: Consider the random telegraph signal discussed in Chapter 7. Recall that this
W.S.S process has the correlation function
( , ) ( ) t t e
1 2
2
= =

that is depicted by Figure 12-6. Clearly, is not differentiable at = 0, so the random telegraph
signal is not m.s. differentiable anywhere.
Some Properties of the Mean Square Derivative
Many of the properties of ordinary derivatives of deterministic functions have counter
parts when it comes to m.s. derivatives of finite-power random processes. We give just a few of
R() = exp(-2)
Figure 12-6: Correlation function of the
W.S.S. random telegraph signal.
these.
Theorem 12-8: For a finite-power random process X(t), mean square differentiability at time t
implies mean square continuity at time t.
Proof: Suppose that X(t) is m.s. differentiable at t. Consider
2
2 2
2
0 0
2
2
2
0 0
2
0
{X(t ) X(t)}
limit E {X(t ) X(t)} limit E
{X(t ) X(t)}
limit limit E
(t , t ) (t , t) (t, t ) (t, t)
0 limit
0,

+

+ =

+

=

+ + + + +
=
=
(12-57)
since the limit involving is finite (as given by Theorem 12-6). Hence, a m.s. differentiable
function is also m.s. continuous.
Theorem 12-9: If X
1
(t) and X
2
(t) are finite-power random processes that are m.s. differentiable,
then X
1
(t) + X
2
(t) is a finite power process that is m.s. differentiable for any real constants
and . Furthermore, m.s. differentiation is a linear operation so that
d
dt
X t X t
dX t
dt
dX t
dt

1 2
1 2
( ) ( )
( ) ( )
+ = + . (12-58)
Mean and Correlation Function of dX/dt
In Theorem 11-7, we established that the operations of l.i.m and expectation were
interchangeable for sequences of random variables. An identical result is available for finite-
power random processes. Recall that if we have
t t
X(t) l.i.mX(t )
= ,
then
[ ] [ ]
t t t t
E X(t) E l.i.mX(t ) limit E X(t )

= =

(12-59)
for finite-power random process X(t). This result can be used to obtain an expression for E[

X]
in terms of E[X].
Theorem 12-10: Let X(t) be a finite-power, m.s. differentiable random process. The mean of
the derivative is given by
E X t
d
dt
E X t
( ) [ ( )] . = (12-60)
In words, you can interchange the operations of expectation and differentiation for finite power,
m.s. differentiable random processes.
Proof: Observe the simple steps
0 0
X(t ) X(t) E[X(t )] E[X(t)]
E X(t) E l.i.m limit
d
E[X(t)]
dt

+ +
= =

=
. (12-61)
In this result, the interchange of l.i.m and expectation is justified by Theorem 11-7, as outlined
above.
Theorem 12-11: Let finite-power process X(t) be m.s. differentiable for all t. Then for all t and
s, the quantities E X t X s [

( ) ( )] , E X t X s [ ( )

( )] and E X t X s [

( )

( )] are finite and expressible in terms
of (t,s) = E[X(t)X(s)]. The relevant formulas are

( , ) [

( ) ( )]
( , )
( , ) [ ( )

( )]
( , )
( , ) [

( )

( )]
( , )
.
XX
XX
XX
t s E X t X s
t s
t
t s E X t X s
t s
s
t s E X t X s
t s
t s
=

=

2
(12-62)
Proof: Use the Cauchy-Schwarz inequality to see that the quantities exist and are finite. Now,
the first of (12-62) follows from
( , ) [

( ) ( )] . .
( ) ( )
( )
( ) ( ) ( ) ( ) ( , ) ( , )
( , )
.
XX
t s E X t X s E l i m
X t X t
X s
E
X t X s X t X s t s t s
t s
t
=
+
=
+
L
N
M
O
Q
P
=
+
=

0
0 0
e j
limit limit (12-63)
The remaining three correlation functions are obtained in a similar manner.
For the wide sense stationary case, Theorem 12-11 can be simplified. If X is W.S.S.,
then (12-62) simplifies to
XX
XX
2
XX
2
( )
( ) E[X(t)X(t )]
( )
( ) E[X(t)X(t )]
( )
( ) E[X(t)X(t )] .

+ =

+ =

+ =

. (12-64)
Y
X X
( ; )
( ) ( )
t

+ t t
(12-65)
does not have a m.s. limit as approaches zero. Also, almost surely, Wiener process sample
functions are not differentiable (in the ordinary Calculus sense). That is, there exists A, P(A) =
1, such that for each A, X(t,) is not differentiable at any time t. However, when interpreted
as a generalized random process (see discussion above), the Wiener process has a generalized
derivative (as defined in the literature) that is white Gaussian noise (a generalized random
process).
For fixed > 0, let us determine the correlation function
Y
(;) of Y(t,). Use the fact
that the Wiener process has independent increments to compute
2
Y
2D ,
( ; )
0,
>
, (12-66)
a result depicted by Figure 12-7 (can you show (12-66)?). Note that the area under
Y
is 2D,
independent of . As approaches zero, the base width shrinks to zero, the height goes to
infinity, and
+
Y
0
limit ( ; ) 2D ( )
= . (12-67)

2D/

Y
(;)
Figure 12-7: Correlation of Y(t,). It approaches 2D()

as approaches zero.
That is, as becomes small, Y(t;) approaches a delta-correlated, white Gaussian noise
process!
To summarize a complicated discussion, as 0, Equation (12-65) has no m.s. limit, so
the Wiener process is not differentiable in the m.s. sense (or, as it is possible to show, in the
sample function sense). However, when considered as a generalized random process, the
Wiener process has a generalized derivative that turns out to be Gaussian white noise, another
generalized random process. See Chapter 3 of Stochastic Differential Equations: Theory and
Applications, by Ludwig Arnold, for a very readable discussion of the Wiener process and its
generalized derivative, Gaussian white noise. The theory of generalized random processes tends
to parallel the theory of generalized functions and their derivatives.
Example 12-7: We know that the Wiener process is not m.s. differentiable; in the sense of
classical Calculus, the third equation in (12-62) cannot be applied to the autocorrelation function
(t
1
,t
2
) = 2Dmin{t
1
,t
2
} of the Wiener process. However, (t
1
,t
2
) is twice differentiable if we
formally interpret the derivative of a jump function with a delta function. First, note that
= <
= >
t
t t D
t
t t t
D t
2
1 2
2
1 2 2
2
2 0
2
( , ) min( , ) ,
,
t
t
1
1
, (12-68)
a step function in t
1
. Now, differentiate (12-68) with respect to t
1
and obtain

=
2
1 2
1 2 1 2
2
t t
t t D t t ( , ) ( ) . (12-69)
The Wiener process can be thought of as an infinitely dense sequence of infinitesimal
jumps (i.e., the limit of the random walk as both the step size and time to take a step approach
zero). As mentioned previously, Gaussian white noise is the generalized derivative of the
Wiener process. Hence, it seems plausible that wide-band Gaussian noise might be constructed
by using a very dense sequence of very narrow pulses, the areas of which are zero mean
Gaussian with a very small variance.
Example 12-8 (Construct a Wide-Band Gaussian Process): How can we construct a Gaussian
process with a large-as-desired bandwidth? In light of the discussion in this section, we should
try to construct a sequence of delta-like pulse functions that are assigned Gaussian amplitudes.
As our need for bandwidth grows (i.e., as the process bandwidth increases), the delta-like
pulse functions should (1) become increasingly dense in time and (2) have areas with
increasingly smaller variance. These observations result from the fact that the random walk
becomes an increasingly dense sequence of increasingly smaller jumps as it approaches the
Wiener process, the generalized derivative of which is Gaussian white noise. We start with a
discrete sequence of independent Gaussian random variables x
k
, k 0,
k
t
(
k
+
2
)
t
(
k
+
4
)
t
(
k
+
6
)
t
(
k
+
8
)
t
(
k
+
1
0
)
t
(
k
+
1
2
)
t
(
k
+
1
4
)
t
Wide-Band Gaussian Process x (t)
t
Figure 12-8: Wide-band Gaussian random process composed of delta-
like pulses with height x t
k
/ and weight (area) x t
k
. The area of
each pulse has a variance that is proportional to t.
E x
E x x D k j
k j
k
k j
=
= =
=
0
2
0
,
,
,
k 0,
. (12-70)
For t > 0 and k 0, we define the random process
t
k
x
x (t) , k t t < (k+1) t
t
, (12-71)
a sample function of which is illustrated by Figure 12-8. As t approaches zero, our process
becomes an increasingly dense sequence of delta-like rectangles (rectangle amplitudes grow
like 1/ t ) that have increasingly smaller weights (rectangle areas diminish like t ).
Clearly, from (12-70), we have E[
t
x (t)
] = 0. Also, the autocorrelation of our process

approaches a delta function of weight 2D since
[ ]
t t
t
1 2 1 2
x 1 2
2D
E x (t )x (t ) , k t t , t (k 1) t for some integer k
t
R (t , t )
0, otherwise,

= < +
(12-72)
and
t
x 1 2 1 2
t 0
limit R (t , t ) 2D (t t )

= . (12-73)
Hence, by taking t sufficiently small, we can, at least in theory, create a Gaussian process of
any desired bandwidth.
Mean Square Riemann Integral
Integrals of random processes crop up in many applications. For example, a slowly
varying signal may be corrupted by additive high-frequency noise. Sometimes, the rapid
fluctuation can be averaged or filtered out by an operation involving an integration. As a
second example, the integration of random processes is important in applications that utilize
integral operators such as convolution.
A partition of the finite interval [a, b] is a set of subdivision points t
k
, k = 0, 1 , ... , n,
such that
a t t b = < < < =
0 1
t
n
" . (12-74)
Also, we define the time increments
t t t
i i i
=
1
,
1 i n. We denote such a partition as P
n
, where n+1 is the number of points in the partition.
Let
n
denote the upper bound on the mesh size; that is, define
n
= max
k
t
k
, (12-75)
a value that decreases as the partition becomes finer and n becomes larger.
For 1 k n, let t
k
be an arbitrary point in the interval [t

k-1
,t
k
). For a finite-power
random process X(t), we define the Riemann sum
X t t
k k
k
n
( )
=

1
. (12-76)
Now, the mean square Riemann integral over the interval [a, b] is defined as
X t dt l i m X t t
k k
k
a
b
( ) . . ( )
z

n
n
0
1
. (12-77)
As
n
0 (the upper bound on the mesh size approaches zero) integer n approaches infinity,
and the Riemann sum converges (in mean square) to the mean-square Riemann integral, if all
goes well. As is the case for m.s. continuity and m.s. differentiability, a necessary and sufficient
condition, involving , is available for the existence of the m.s. Riemann integral.
Theorem 12-12: The mean-square Riemann integral (12-77) exists if, and only if, the ordinary
double integral
( , ) d d
a
b
a
b
z z
(12-78)
exists as a finite quantity.
Proof: Again, the Cauchy Convergence Criteria serves as the basis of this result. Let P
n
and P
m
denote two distinct partitions of the [a, b] interval. We define these partitions as
0 1 n
n i i i 1
n i
i
a t t t b
: t t t
max t
= < < < =
"
P
(12-79)
0 1 m
m i i i 1
m i
i

a t t t b

: t t t
max t
= < < < =
"
P
Partion P
n
has time increments denoted by t
i
= t
i
- t
i-1
, 1 i n, and it has an upper bound on
mesh size of
n
. Likewise, partition P
m
has time increments of t
k
= t
k
- t
k-1
, 1 k m, and it
has an upper bound on mesh size of
m
. According to the Cauchy Convergence Criteria, the
mean-square Riemann integral (12-77) exists if, and only if,
n
m
2
n m
k j j k
0
k 1 j 1
0

limit E X(t ) t X(t ) t 0
= =
, (12-80)
where t
k
and t
are arbitrary points with t

k-1
t
k
< t
k
for 1 k n, and t
j-1
t
< t
j
for 1 j m.
Now, expand out the square, and take expectations to see that
2
n m
k j j k
k 1 j 1
n n n m m m
i k i j k j j i j i k k
k 1 i 1 k 1 j 1 j 1 i 1

E X(t ) t X(t ) t

(t , t ) t t 2 (t , t ) t t (t , t ) t t
= =
= = = = = =

= +

. (12-81)
As
n
and
m
approach zero (the upper bounds on the mesh sizes approach zero), Equation
(12-80) is true if, and only if,
n
m
n m
b b
j k j k
a a
0
k 1 j 1
0

limit (t , t ) t t ( , ) d d
= =
. (12-82)
Note that cross-term (12-82) must converge independent of the paths that n and m take as n
and m . If this happens, the first and third sums on the right-hand side of (12-81) also
converge to the same double integral, and (12-80) converges to zero.
Example 12-7: Let X(t), t 0, be the Wiener process, and consider the m.s. integral
Y t X d
t
( ) ( ) =
z

0
. (12-83)
Recall that (t
1
,t
2
) = 2Dmin(t
1
,t
2
), for some constant D. This can be integrated to produce
2
2
t t t t
1 2 1 2 1 2 1 2
0 0 0 0
t t
1 1 2 1 2
0 0
3
( , )d d Dmin( , )d d
D d d d
Dt / 3.
=
= +

=

. (12-84)
The Wiener process X(t) is m.s. Riemann integrable (i.e., (12-83) exists for all finite t) since the
double integral (12-84) exists for all finite t.
Example 12-8: Let Z be a random variable with E[Z
2
] < . Let c
n
, n 0, be a sequence of real
numbers converging to real number c. Then c
n
Z, n 0, is a sequence of random variables.
Example 11-9 shows that
l i mc Z cZ
n
n
. .
= .
We can use this result to evaluate the m.s. Riemann integral
k
n
k
n
n 1
t
i k 1 k
0
n
k 0
0
n 1
i k 1 k
n
k 0
0
2
t
0
2
2Z d l.i.m 2Z ( )
2Z limit ( )
t
2Z d 2Z
2
0
Zt
=
=

= =

Properties of the Mean-Square Riemann Integral
As stated in the introduction of this chapter, many concepts from the mean-square
calculus have analogs in the ordinary calculus, and vice versa. We point out a few of these
parallels in this section.
Theorem 12-13: If finite-power random process X(t) is mean-square continuous on [a, b] then it
is mean-square Riemann integrable on [a, b].
Proof: Suppose that X(t) is m.s. continuous for all t in [a, b]. From Theorem 12-3, (t
1
,t
2
) is
continuous for all a t
1
= t
2
b. From Corollary 12-3B, (t
1
,t
2
) is continuous at all t
1
and t
2
,
where a t
1
, t
2
b, (not restricted to t
1
= t
2
). But this is sufficient for the existence of the
integral (12-78), so X(t) is m.s. Riemann integrable on [a, b].
Theorem 12-14: Suppose that X(t) is m.s. continuous on [a, b]. Then the function
Y t X d
a
t
( ) ( )
z
, a t b, (12-85)
is m.s. continuous and differentiable on [a, b]. Furthermore, we have
( ) ( ) Y t X t = . (12-86)
Mean and Correlation of Mean Square Riemann Integrals
Suppose f(t,u) is a deterministic function, and X(t) is a random process. If the integral
Y u f t u X t dt
a
b
( ) ( , ) ( ) =
z
(12-87)
exists, then
[ ]
k
n
k
n
b
a
n
k k t
0
k 1
n
k k t
0
k 1
E Y(u) E f (t, u)X(t) dt
E l.i.m f (t , u)X(t )
limit E f (t , u) X(t ) ,
=

=

=

=

(12-88)
where t t t
k k k

1
. For every finite n, the expectation and sum can be interchanged so that
(12-88) becomes
[ ] [ ] [ ]
k
n
n
b
k k t
a
0
k 1
E Y(u) limit f (t , u) E X(t ) f (t, u)E X(t) dt
=
= =
. (12-89)
In a similar manner, the autocorrelation of Y can be computed as
k
j
n m
k
j
n
m
b b
Y
a a
n m
k k t j j
t
0 0
k 1 j 1
n m
k k t j j
t
0
k 1 j 1
0

(u, v) E f (t, u)X(t)dt f (t, v)X(t)dt

E l.i.m f (t , u)X(t ) l.i.m f (t , u)X(t )

limit E f (t , u)X(t ) f (t , u)X(t )

= =
= =

=

=

=

(12-90)
But, for all finite n and m, the expectation and double sum can be interchanged to obtain
k
j
n
m
n m
Y k x k j j t
t
0
k 1 j 1
0
b b
x
a a

(u, v) limit f (t , u) (t , t )f (t , u)
f (t, u)f (s, v) (t, s) dtds
= =
=
=

(12-91)
where
X
is the correlation function for process X.
By now, the reader should have realized what mean square calculus has to offer. Mean
square calculus offers, in a word, simplicity. To an uncanny extent, the theory of mean square
calculus parallels that of ordinary calculus, and it is easy to apply. Based on only the
correlation function, simple criterion are available for determining if a process is m.s.
continuous, m.s. differentiable, and m.s. integrable.
EE603 Class Notes Version 1 John Stensby
603CH13.DOC 13-1
Chapter 13 Series Representation of Random Processes
Let X(t) be a deterministic, generally complex-valued, signal defined on [0, T] with
X t dt
T
( )
2
0
<
z
. (13-1)
Let
k
(t), k 0, be a complete orthonormal basis for the vector space of complex-valued, square
integrable functions on [0, T]. The functions
k
satisfy

k
j
T
t t dt k j
k j
( ) ( ) ,
,
z
= =
=
0
1
0
. (13-2)
Then, we can expand X(t) in the generalized Fourier series
X t t
X t t dt
m
m
m
m m
T
( ) ( )
( ) ( )
=
=
=
z
x
x
1
0
(13-3)
for t in the interval [0,T]. In (13-3), convergence is not pointwise. Instead, Equation (13-3)
converges in the mean square sense. That is, we have
limit
N
=
=
z
X t t dt
k
N
k
T
( ) ( ) x
k
1
2
0
0 . (13-4)
It is natural to ask if similar results can be obtained for finite power, m.s. Riemann
integrable random processes. The answer is yes. Obviously, for random process X(t), the
expansion coefficients x
k
will be random variables. In general, the coefficients x
k
will be pair-wise
603CH13.DOC 13-2
correlated. However, by selecting the basis functions as the eigenfunctions of a certain integral
operator, it is possible to insure that the coefficients are pair-wise uncorrelated, a highly desirable
condition that simplifies many applications. When the basis functions are chosen to make the
coefficients uncorrelated, the series representation of X(t) is known as a Karhunen-Love
expansion. These types of expansions have many applications in the areas of communication and
control.
Some Important Properties of the Autocorrelation Function
Random process X(t) has an autocorrelation function (t
1
,t
2
) which we assume is
continuous on [0, T][0, T]. Note that is Hermitian; that is, the function satisfies (t
1
,t
2
) =
*
(t
2
,t
1
). Also, it is nonnegative definite, a result that is shown easily. Let f(t) be any function
defined on the interval [0, T]. Then, we can define the random variable
x
f
T
X t f t dt =
z
( ) ( )
0
. (13-5)
The mean of x
f
is
E m t f t dt
f
T
[ ] ( ) ( ) x =
z
0
,
a result that is zero under the working assumption that m(t) = E[X(t)] = 0. The variance of x
f
is
Var[ E f t X t dt X f d f t E X t X f dtd
f t t f dtd
f
T T T T
T T
x ] ( ) ( ) ( ) ( ) ( ) [ ( ) ( )] ( )
( ) ( , ) ( )
=
L
N
M
O
Q
P
=
=
z z z z
z z

0 0 0 0
0 0

. (13-6)
Now, the variance of a random variable cannot be negative, so we conclude
603CH13.DOC 13-3
f t t f dtd
T T
( ) ( , ) ( )
z z

0 0
0 (13-7)
for arbitrary function f(t), 0 t T. Condition (13-7) implies that autocorrelation function
(t
1
,t
2
) is nonnegative definite. In most applications, the autocorrelation function is positive
definite in that
f t t f dtd
T T
( ) ( , ) ( )
z z
>
0 0
0 (13-8)
for arbitrary functions f(t) that are not identically zero.
We can define the linear operator A : L
2
[0,T] L
2
[0,T] by the formula
A x( ) ( , ) ( ) t t x d
T
z

0
(13-9)
(recall that L
2
[0,T] is the vector space of square integrable functions on [0,T]). Continuous,
Hermitian, nonnegative definite autocorrelation function forms the kernel of linear operator A.
In the world of mathematics, A[] is a commonly-used Hilbert-Schmidt operator, and it is an
example of a compact, self-adjoint linear operator (for definitions of these terms see the appendix
of R. Ash, Information Theory, the book An Introduction to Hilbert Space, by N. Young, or
almost any book on Hilbert spaces and/or functional analysis).
Eigenfunctions and Eigenvalues of Linear Operator A
The eigenfunctions
k
and eigenvalues
k
of linear operator A satisfy A[
k
(t)] =
k
k
(t)
which is the same as

k k k
T
t t d ( ) ( , ) ( ) =
z

0
. (13-10)
In what follows, we assume that kernel (t,) is
603CH13.DOC 13-4
a) Hermitian (i.e. (t,) =
(,t)),
b) at least nonnegative definite (i.e. (13-7) holds),
c) continuous on [0, T][0, T],
d) satisfies ( , ) t dtd
T T

0 0
z z
< (this is a consequence of the continuity condition c).
Much is known about the eigenfunctions and eigenvalues of linear operator A. We state a number
of properties of the eignevectors/eigenvalues. Proofs that are not given here can be found in the
references cited above.
1. For a Hermitian, nonnegative definite, continuous kernel (t,), there exist at least one square-
integrable eigenfunction and one nonzero eigenvalue.
2. It is obvious that eigenfunctions are defined up to a multiplicative constant. So, we normalize
them according to (13-2).
3. If
1
(t) and
2
(t) are eigenfunctions corresponding to the same eigenvalue , then
1
(t) +
2
(t) is an eigenfunction corresponding to .
4. Distinct eigenvalues correspond to eigenfunctions that are orthogonal.
5. The eigenvalues are countable (i.e., a 1-1 correspondence can be established between the
eigenvalues and the integers). Furthermore, the eigenvalues are bounded. In fact, each
eigenvalue
k
must satisfy the inequality
inf ( ) ( , ) ( ) sup ( ) ( , ) ( )
f
T T
k
f
T T
f t t f dtd f t t f dtd
=
z z z z
<
1
0 0
1
0 0
(13-11)
6. Every nonzero eigenvalue has a finite-dimensional eigenspace. That is, there are a finite
number of linearly independent eigenfunctions that correspond to a given eigenvalue (
k
, 1 k
n, are linearly independent if
1
1
+
2
2
+ +
n
n
= 0 implies that
1
=
2
= =
n
= 0).
7. The eigenfunctions form a complete orthonormal basis of the vector space L
2
[0,T], the set of
all square integrable functions on [0, T]. If is not positive definite, there is a zero eigenvalue,
603CH13.DOC 13-5
and you must include its orthonormalized eigenfunction(s) to get a complete orthonormal basis of
L
2
[0,T] (use the Gram-Schmidt procedure here).
8. The eigenvalues are nonnegative. For a positive definite kernel (t,), the eigenvalues are
positive. To establish this claim, use (13-8) and (13-2) and write

i i i i
T
i i
T T
i i
T T
t t dt t t d dt
t t d dt
= =
L
N
M
O
Q
P
=
z z z
z z
( )[ ( )] ( ) ( , ) ( )
( ) ( , ) ( )
0 0 0
0 0
0
. (13-12)
This result is strictly positive if kernel (t,) is positive definite.
9. The sum of the eigenvalues is the expected value of the process energy in the interval [0, T].
That is
E X t dt t t dt
T
k
k
T
( ) ( , )
2
0
1
0
z

z
L
N
M
O
Q
P
= =
=
. (13-13)
With items 10 through 15, we want to establish Mercers theorem. This theorem states
that you can represent the autocorrelation function (t,) by the expansion
( , ) ( ) ( ) t t
k k k
k
=

=
1
. (13-14)
We will not give a rigorous proof of this result, but we will come close.
10. Let
1
(t) and
1
be an eigenfunction and eigenvalue pair for kernel (t,), the nonnegative
definite autocorrelation of random process X. Then
603CH13.DOC 13-6

1 1 1 1
( , ) ( , ) ( ) ( ) t t t

(13-15)
is the nonnegative-definite autocorrelation of the random process
X t X t t X d
T
1 1
0
1
( ) ( ) ( ) ( ) ( )
z

. (13-16)
To show this, first compute the intermediate result
E X t X E X t t X s s ds X X s s ds
E X t X E X s X t s ds t E X s X s ds
t E X s X s
T T
T T
1 1 1 1 1 1 1
0
1 2 1 2 2
0
1 2
0
1 2 2 1 1
0
1 1 1
1 1 1 2 1
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) (

=
F
H
I
K

F
H
I
K
L
N
M
O
Q
P
=
+
z z
z z

s s ds ds
t t s s ds t s s ds
t s s s s ds ds
T T
T T
T T
1 1 2
0 0
1 2
1 2
0
1 2 2 1 1
0
1 1 1
1 1 1 2 1 1 1 2
0 0
1 2
) ( )
( , ) ( ) ( , ) ( ) ( ) ( , ) ( )
( ) ( ) ( , ) ( ) ( ) .

z z
z z
z z
=
+

(13-17)
Use (t,) =
(,t), and take the complex conjugate of the eigenfunction relationship to obtain

1 1 1
0

=
z
( ) ( , ) ( ) s s ds
T
(13-18)
With (13-18), the two cross terms on the right-hand-side of (13-17) become
= =

z z

1
0
1 1
0
1 1 1 1
( ) ( , ) ( ) ( ) ( , ) ( ) ( ) ( ) t s s ds t s s ds t
T T
(13-19)
603CH13.DOC 13-7
On the right-hand-side of (13-17), the double integral can be evaluated as
( , ) ( ) ( ) ( ) ( , ) ( )
( ) ( )
s s s s ds ds s s s s ds ds
s s ds
T T T T
T
1 2 1 1 1 2
0 0
1 2 1 1 1 2 1 2 2
0 0
1
1 1 1 1 1
0
1
1

z z z z
z
=
L
N
M
O
Q
P
=
=
. (13-20)
Finally, use (13-19) and (13-20) in (13-17) to obtain
E X t X t t
1 1 1 1 1
( ) ( ) ( , ) ( ) ( )

= , (13-21)
and this establishes the validity of (13-15).
11. As defined by (13-15),
1
(t,) may be zero for all t, . If not,
1
(t,) can be used as the
kernel of integral equation (13-10). This reformulated operator equation has a new
eigenfunction
2
(t) and new nonzero eigenvalue
2
(this follows from Property #1 above). They
can be used to define the new nonnegative definite autocorrelation function

2 1 2 2 2
( , ) ( , ) ( ) ( ) t t t

. (13-22)
Furthermore, the new eigenfunction
2
(t) is orthogonal to the old eigenfunction
1
(t). That
2
is nonnegative definite follows immediate from application of Property 10 with replaced by
1
. That
1
(t)
2
(t) follows from an argument that starts by noting

2 2 1 2
0
( ) ( , ) ( ) t t d
T
=
z
. (13-23)
Plug (13-15) into (13-23) and obtain
603CH13.DOC 13-8

2 2 2
0
1 1 1 2
0
( ) ( , ) ( ) ( ) ( ) ( ) t t d t d
T T
=
z z

. (13-24)
Multiply both sides of this equation by
1
*
(t) and integrate to obtain

2 1 2
0
1 2
0 0
1 1
2
0
1 2
0

z z z z z
= ( ) ( ) ( , ) ( ) ( ) ( ) ( ) ( )
*
t t dt t t d dt t dt d
T T T T T
(13-25)
=
L
N
M
O
Q
P

z z z

2 1
0 0
1 1 2
0
( ) ( , ) ( ) ( ) ( ) t t dt d d
T T T
.
Use (13-18) (which results from the Hermitian symmetry of ) to evaluate the term in the bracket
on the right-hand-side of Equation (13-25). This evaluation results in

2 1 2
0
2 1
0 0
1 1 2
0
2 1 1
0
1 1 2
0
1 1 1 2
0

z z z z
z z
z
=
L
N
M
O
Q
P

=
=
( ) ( ) ( ) ( , ) ( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( )
t t dt t t dt d d
d d
d
T T T T
T T
T
e j
. (13-26)
Since
1
*
-
1
= 0 (by Property 8, the eigenvalues are real valued), we have

2 1 2
0
0
z
= ( ) ( ) t t dt
T
. (13-27)
Since
2
0, we conclude that
1
(t)
2
(t), as claimed. In addition to being an eigenfunction-
eigenvalue pair for kernel
1
,
2
(t) and
2
are an eigenfunction and eigenvalue, respectively, for
kernel (as can be seen from (13-24) and the fact that
1
2
).
12. Clearly, as long as the resulting autocorrelation function is nonzero, the process outlined in
Property 11 can be repeated. After N such repetitions, we have orthonormal eigenfunctions
1
,
603CH13.DOC 13-9

N
and nonzero eigenvalues
1
, ,
N
. Furthermore, the N
th
-stage autocorrelation function is

N k k k
k
N
t t t ( , ) ( , ) ( ) ( )

=
1
. (13-28)
13.
N
(t,) may vanish, and the algorithm for computing eigenvalues may terminate, for some
finite N. In this case, there exist a finite number of nonzero eigenvalues, and autocorrelation
(t,) has a finite dimensional expansion of the form
( , ) ( ) ( ) t t
k k k
k
N
=

=
1
. (13-29)
for some N. In this case, the kernel (t,) is said to be degenerate; also, it is easy to show that
(t,) is not positive definite.
14. If the case outlined by 13) does not hold, there exists a countable infinite number of nonzero
eigenvalues. However,
N
(t,) converges as N . First, we show convergence for the special
case t = ; next, we use this special case to establish convergence for the general case, t . To
reduce notational complexity in what follows, define the partial sum
S t t t t
n m k k k
k n
m
n n m m
n n
m m
,
( , ) ( ) ( ) ( ) ( )
( )
( )

=
L
N
M
M
M
M
M
O
Q
P
P
P
P
P
L M . (13-30)
Consider the special case t = . Since
N
(t,t) 0, Equation (13-28) implies
0
1
< < S t t t t
N ,
( , ) ( , ) . (13-31)
603CH13.DOC 13-10
As a function of index N, the sequence S
1,N
(t,t) is increasing but always bounded above by (t,t),
as shown by (13-31). Hence, as N , both S
1,N
(t,t) and
N
(t,t) must converge to some limit.
For the general case t , convergence of S
1,N
(t,) can be shown by establishing the fact
that partial sum S
n,m
(t,) 0 as n, m (in any order). To establish this fact, consider partial
sum S
n,m
(t,) to be the inner product of two vectors as shown by (13-30); one vector contains the
elements
k k
t ( ) , n k m, and the second vector contains the elements
k k
( ) , n k
m. Now, apply the Cauchy-Schwartz inequality (see Theorem 11-4) to inner product S
n,m
(t,)
and obtain
S t t S t t S
n m k k k
k n
m
n m n m , , ,
( , ) ( ) ( ) ( , ) ( , ) =
. (13-32)
As N , the convergence of S
1,N
(t,t) implies that partial sum S
n,m
(t,t) 0 as n, m (in any
order). Hence, the right-hand-side of (13-32) approaches zero as n, m (in any order), and
this establishes the convergence of S
1,N
(t,) and (13-28) for the general case t .
15. As it turns out,
N
(t,) converges to zero as N , a claim that is supported by the
following argument. For each m N and fixed t, multiply
N
(t,) by
m
() and integrate to obtain

N m
T
m
T
k k k
k
N
m
T
m m m m
t d t d t d
t t
( , ) ( ) ( , ) ( ) ( ) ( ) ( )
( ) ( )

0 0
1
0
0
z z

z
=
L
N
M
O
Q
P
=
=
=
. (13-33)
For each fixed t and all m N,
N
(t,) has zero component in the
m
() direction. Equation
(13-33) leads to the conclusion
603CH13.DOC 13-11
limit
N
z
=
N m
T
t d ( , ) ( )
0
0 (13-34)
for each m 1. By the continuity of the inner product, we can interchange limit and integration in
(13-34) to see that
(t,) has no component in the

m
() direction, m 1. Since the
eigenfunctions
m
() span the vector space L
2
[0,T] of square-integrable functions, we see that
(t,) = 0. The argument we have presented supports the claim

( , ) ( ) ( ) t t
k k k
k
=

=
1
, (13-35)
a result known as Mercers theorem. In fact, the sum in (13-35) can be shown to converge
uniformly on the rectangle 0 t, T (see R. Ash, Information Theory, Interscience Publishers,
1965).
Karhunen-Love Expansion
In an expansion of the form (13-3), we show that the coefficients x
k
will be pair-wise
uncorrelated if, and only if, the basis functions
k
are eigenfunctions of (13-10) . Then, we show
that the series converges in a mean-square sense.
Theorem 13-1: Suppose that finite-power random process X(t) has an expansion of the form
X t t
X t d
m
m
m
m m
T
( ) ( )
( ) ( )
=
=
=
z
x
x
1
0

(13-36)
for some complete orthonormal set
k
(t), k 1, of basis functions. If the coefficients x
n
satisfy
603CH13.DOC 13-12
E n m
n m
n m n
x x
= =
=
,
, 0
(13-37)
(i.e., the coefficients are pair-wise uncorrelated and x
n
has a variance equal to eigenvalue
n
), then
the basis functions
n
(t) must be eigenfunctions of (13-9); that is, they must satisfy
( , ) ( ) ( ) , t d t
n
T
n n

0
z
= 0 t T. (13-38)
Proof: Multiply the expansion in (13-36) (the first equation in (13-36)) by x
n
, take the
expectation, and use (13-37) to obtain
E X t E t E t t
n m n
m
m n n n
( ) ( ) ( ) ( ) x x x x

=
= = =
1
2

n
. (13-39)
Now, multiply the complex conjugate of the second equation in (13-36) by X(t), and take the
expectation, to obtain
E X t E X t X d t d
n n
T
n
T
( ) ( ) ( ) ( ) ( , ) ( ) x

= =
z z

0 0
. (13-40)
Finally, equate (13-39) and (13-40) to obtain
( , ) ( ) ( ), t d
n
T
n n

0
z
= t 0 t T,
where
n
is given by (13-37). In addition to this result, the K-L coefficients will be orthogonal
if the orthonormal basis function satisfy (13-38).
Theorem 13-2: If the orthogonal basis functions
n
(t) are eigenfunctions of (13-38) the
603CH13.DOC 13-13
coefficients x
k
will be orthogonal.
Proof: Suppose the orthogonal basis functions
n
(t) satisfy integral equation (13-38). Compute
the expected value
E E X t t dt E X t t dt
m n
T
m m n
T
x x x x
n

=
R
S
T
U
V
W
L
N
M
O
Q
P
=
z z
( ) ( ) ( ) ( )
0 0
. (13-41)
Now, use (13-39) to replace the expectation in (13-41) and obtain
E t t dt
m m m n
T
m mn
x x
n

= =
z
( ) ( )
0
which shows that the coefficients are pair-wise uncorrelated. Theorems 13-1 and 13-2 establish
the claim that the x
k
will be uncorrelated if, and only if, the basis functions satisfy integral equation
(13-38). Next, we show mean square convergence of the K-L series.
Theorem 13-3: Let X(t) be a finite-power random process on [0, T]. The Karhunen-Love
expansion
X t t
X d
m
m
m
m m
T
( ) ( )
( ) ( )
=
=
=
z
x
x
1
0

, (13-42)
where the coefficients are pair-wise uncorrelated and the basis functions satisfy the integral
equation (13-38), converges in the mean square sense.
Proof: Evaluate the mean-square error between the series and the process to obtain
E X t t
m
m
m
( ) ( )
L
N
M
O
Q
P
=
x
1
2

603CH13.DOC 13-14
=
F
H
I
K
L
N
M
O
Q
P

F
H
I
K
L
N
M
O
Q
P
=

E X t X t t E t X t t
m
m
m n
n
n m
m
m
( ) ( ) ( ) ( ) ( ) ( ) x x x
1 1 1
. (13-43)
On the right-hand side of (13-43), the first term is
E X t X t t t t t t
m
m
m m
m
m m
( ) ( ) ( ) ( , ) ( ) ( )
F
H
I
K
L
N
M
O
Q
P
= =
=

x
1 1
0 (13-44)
(E[X(t) x
m
]= =
m
m
(t), first established by (13-39), was used here). The fact that the right hand
side of (13-44) is zero follows from Mercers Theorem (discussed in Property 15 above). On the
right-hand side of (13-43), the second term can be expressed as
E t X t t
E X t t E t t
t t t t
n
n
n m
m
m
n
n
n n m n m
n m
n n
n
n n n n
n
x x
x x x
=
F
H
I
K
L
N
M
O
Q
P
=
=
=
0 0
0 0 0
1 1
0

( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
. (13-45)
On the right-hand-side of (13-45), E[x
n
X*] was evaluated with the aid of (13-39); also the fact
that the coefficients are uncorrelated was used in (13-45). Equations (13-43) through (13-45)
imply
E X t t
m
m
m
( ) ( )
L
N
M
O
Q
P
=
=
x
1
2
0 , (13-46)
603CH13.DOC 13-15
so the K-L expansion converges in the mean square sense.
As it turns out, the K-L expansion need contain only eigenfunctions that correspond to
nonzero eigenvalues. Suppose (t) is an eigenfunction that corresponds to eigenvalue = 0.
Then the corresponding coefficient x has a second moment given by
E E X t t dt E X t X t dt d
t t dt d t t d dt
T T T
T T T T
xx

=
L
N
M
M
O
Q
P
P
=
L
N
M
O
Q
P
= =
L
N
M
O
Q
P
=
z z z
z z z z
( ) ( ) ( ) ( ) ( ) ( )
( , ) ( ) ( ) ( ) ( , ) ( )
.

0
2
0 0
0 0 0 0
0
(13-47)
That is, in the K-L expansion, the coefficient x of (t) has zero variance, and it need not be
included in the expansion.
Example 13-1 (K-L Expansion of the Wiener Process): From Chapter 6, recall that the
Wiener process X(t), t 0, has the autocorrelation function
( , ) min{ , } t t D t t
1 2 1 2
2 = , (13-48)
where D is the diffusion constant. Substitute (13-48) into (13-38) and obtain
2
0
D t d t
n n n
T
min{ , } ( ) ( ) =
z
(13-49)
2 2
0
D d Dt d t
n
t
n
t
T
n n
( ) ( ) ( )
z z
+ = , (13-50)
for 0 t T. With respect to t, we must differentiate (13-50) twice; the first derivative produces
603CH13.DOC 13-16
2D d t
n
t
T
n n
( ) ( )
z
= , (13-51)
where
n
denotes the time derivative of
n
. Differentiate (13-51) to obtain
+ =

n
n
n
t
D
t ( ) ( )
2
0 , (13-52)
a second-order differential equation in the eigenfunction
n
. A general solution of (13-52) is

n n n n n n
t t t D ( ) sin cos , / = + =
n
2 , (13-53)
where
n
,
n
and
n
are constants that must be chosen to so that
n
satisfies appropriate boundary
conditions. Evaluate (13-50) at t = 0 to see
n
( ) 0 0 = (13-54)
for all n. Because of (13-54), Equation (13-53) implies that all
n
0. In a similar manner,
Equation (13-51) implies that =
n
T) ( 0 , a result that leads to the conclusion

n
n
D n
T
n
T
= =

=
2 2 1
2
( ) ( )
, n =1, 2, 3, ... (13-55)
Equation (13-55) implies that the eigenvalues are given by
n
D T
n
=
2
2
2 2
( )
, n =1, 2, 3, ... (13-56)
And, the normalization condition (13-2) can be invoked to obtain
603CH13.DOC 13-17
( sin )

n n
T
n
t dt
T
0
2
2
2
1
z
= = , (13-57)
so that
n
T
=
2
. (13-58)
After using
n
0, (13-58) and (13-55) in Equation (13-53), the eigenfunctions can be expressed
as

n
T
t
T
n t ( ) sin , =
2
b g
e j
0 t T. (13-59)
Finally, the K-L expansion of the Wiener process is
X t
T
n t
T
n
( ) sin ( ) , =
=
2
1
x
n

e j
0 t T, (13-60)
where the uncorrelated coefficients are given by
x
n
=
z
2
0
T
X t n t dt
T
T
( ) sin ( )

e j
. (13-61)
Furthermore, the coefficient x
n
has variance
n
given by (13-56).
Example 13-2: Consider the random process
X t A t ( ) cos = +
0
, (13-62)
603CH13.DOC 13-18
where A and
0
are constants, and is a random variable that is uniformly distributed on (-, ].
As shown in Chapter 7, the autocorrelation of X(t) is
( ) cos =
A
2
0
2
, (13-63)
a function with period T
0
= 2/
0
. Substitute (13-63) into (13-10) to obtain
A
t d
n
T
n n
2
0
0
2
0
cos ( ) ( ) ( ), =
z
t 0 t T
0
. (13-64)
The eigenvalues and eigenfunctions are found easily. First, use Mercers theorem to write
( ) ( ) ( )
cos ( ) cos cos sin sin
t t
A
t
A
t
A
t
k k k
k
=
= = +

1
2
0
2
0 0
2
0 0
2 2 2
. (13-65)
Note that this kernel is degenerate. After normalization, the eigenfunctions, that correspond to
nonzero eigenvalues, can be written as

1 0
2 0
2
2
( ) / cos
( ) / sin
t T t
t T t
=
=
. (13-66)
Both of these eigenfunctions correspond to the eigenvalue = TA
2
/4; note that = TA
2
/4 has an
eigenspace of dimension two. Also, note that there are a countably infinite number of
eigenfunctions in the null space of the operator. That is, for k 1, the eigenfunctions
603CH13.DOC 13-19

1 0
2 0
2
2
k
k
t T k t
t T k t
( ) / cos
( ) / sin
=
=
(13-67)
correspond to the eigenvalue = 0. The K-L expansion of random process X(t) is
X t T t T t ( ) / cos / sin = + x x
1 0 0 2 0 0
2 2 , (13-68)
where
x
x
1 0
2 0
2
2
= +
=
A T
A T
/ cos
/ sin
. (13-69)
As expected, we have
E E
A T
E
A T A T
d
E x E
A T A T
E x E
A T A T
[ ] sin cos sin sin( )
[ ] cos
[ ] sin
x x
1 2
2
0
2
0
2
0
0
2
1
2
2
0 2
2
0
2
2
2
0 2
2
0
2 4
2
4
1
2
2 0
2 4
2 4
=
L
N
M
M
O
Q
P
P
=
L
N
M
M
O
Q
P
P
= =
=
L
N
M
M
O
Q
P
P
=
=
L
N
M
M
O
Q
P
P
=
z

. (13-70)
K-L Expansion for Processes with Rational Spectrums
Suppose X(t) is a wide sense stationary process with a rational power spectrum. That is,
the power spectrum of X can be represented as
603CH13.DOC 13-20
S( )
( )
( )
=
N
D
2
2
, (13-71)
were N and D are polynomials. Such a process occurs if white noise is passed through a linear,
time-invariant filter. Hence, many applications are served well by modeling their processes as
having a rational power spectrum.
As it turns out, a process with a rational power spectrum can be expanded in a K-L
expansion where the eigenfunctions are non-harmonically related sine and cosine functions. For
such a case, the eigenvalues and eigenfunctions can be found. The example that follows illustrates
a general method for solving for the eigenfunctions.
Example 13-4: Let X(t) be a process with power spectrum
S( ) ,

=
+

2 P
2 2
- < < , P > 0, > 0. (13-72)
Process X(t) has the autocorrelation function
( ) ( ) exp( ) = =
F S
1
P . (13-73)
For the related eigenfunction/eigenvalue problem, the integral equation is
Pe u du t
t u
T
T

z
=
( ) ( ), - T t T . (13-74)
An analysis leading to the eigenvalues and eigenfunctions is less complicate if a symmetric interval
[-T,T] is used (of course, our expansion will be valid on [0, T]). We can write (13-74) as

( ) ( ) ( ) ,
( ) ( )
t Pe u du Pe u du
t u
T
t
u t
t
T
= + =

z z
- T t T. (13-75)
603CH13.DOC 13-21
With respect to t, differentiate (13-75) to obtain

d
dt
t P e e u du P e e u du
t u t u
t
T
T
t
( ) ( ) ( ) = +

z z
. (13-76)
Once again, differentiate (13-76) to obtain

d
dt
t P e e u du P e e t
P e e u du P e e t
t u t t
T
t
t u t t
t
T
2
2
2
2
( ) ( ) ( )
( ) ( )
=
+

z
z
, (13-77)
which can be written as

d
dt
t P e u du P t
t u
T
T 2
2
2
2 ( ) ( ) ( ) =

z
. (13-78)
Now, multiply (13-74) by
2
and use the product to eliminate the integral in (13-78); this
procedure results in

d
dt
t P t
2
2
2
2 ( ) ( / ) ( ) = . (13-79)
There are no zero eigenvalues since is positive definite. Inspection of (13-79) reveals that the
three cases
i) 0 < < 2P/,
ii) = 2P/,
iii) < 2P/
must be considered.
Case i) 0 < < 2P/
603CH13.DOC 13-22
We start by defining
b
P
2
2
2

<

( / )
, 0 < b
2
, (13-80)
which can be solved for

=
+
2P
jb jb ( )( )
. (13-81)
In terms of b, the general, complex-valued solution of (13-79) is
( ) t c e c e
jbt jbt
= +

1 2
, (13-82)
where c
1
and c
2
are complex constants. Plug (13-82) into integral equation (13-75) to obtain

c e c e
Pe e c e c e du Pe e c e c e du
Pe c
e
jb
c
e
jb
Pe c
e
jb
c
e
jb
P
c e
jb
c e
jb
c e
jb
jbt jbt
t u jbu jbu
T
t
t u jbu jbu
t
T
t
jb u jb u
u T
u t
t
jb u jb u
u t
u T
jbt jbt jbt
1 2
1 2 1 2
1 2 1 2
1 2 1
+
= + + +
=
+
+
L
N
M
M
O
Q
P
P
+
+
+

L
N
M
M
O
Q
P
P
=
+
+
+
=
=
+
=
=
z z
( ) ( ) ( ) ( )
c e
jb
Pe c
e
jb
c
e
jb
Pe c
e
jb
c
e
jb
jbt
t
jb T jb T
t
jb T jb T
2
1 2 1 2
+ + +
+
L
N
M
M
O
Q
P
P
+
+
L
N
M
M
O
Q
P
P
+
+
+

L
N
M
M
O
Q
P
P
( ) ( ) ( ) ( )
.
(13-83)
Now, substitute (13-81) for on the left-hand-side of (13-83); then, cancel out like terms to
603CH13.DOC 13-23
obtain the requirement
0
1 2 1 2
=
+
+
L
N
M
M
O
Q
P
P
+
+

L
N
M
M
O
Q
P
P
+ +
e c
e
jb
c
e
jb
e c
e
jb
c
e
jb
t
jb T jb T
t
jb T jb T

( ) ( ) ( ) ( )
. (13-84)
We must find the values of b (i.e., the frequencies of the eigenfunctions) for which equality is
achieved in (13-84) . Note that both bracket terms must vanish identically to achieve equality for
all time t. However, for c
1
c
2
, neither bracket will vanish for any real b. Hence, we require c
1
= c
2
in order to obtain equality in (13-84). First, consider c
1
= -c
2
; to zero the first bracket term
we must have
e
jb
e
jb
e jb e jb
jb jb
e e jb e e
jb jb
j bT) jb bT)
jb jb
jbT jbT jbT jbT
jbT jbT jbT jbT

+

=
+
+
=
+
+
=

+
=

( ) ( )
( )( ) ( )( )
sin( cos(
( )( )
2 2
0
. (13-85)
To obtain zero in this last expression, we must have
sin( cos( bT) b bT) + = 0 . (13-86)
Finally, this leads to the requirement
tan( / bT) b = . (13-87)
With c
1
= -c
2
, the second bracket in (13-84) is zero if (13-87) holds. Hence, the values of b that
solve (13-87) are roots of (13-84), and they are frequencies of the eigenfunctions.
603CH13.DOC 13-24
Next, we must analyze the case c
1
= c
2
(which is similar to the case c
1
= -c
2
just finished).
For c
1
= c
2
, we get
tan( / bT) b = . (13-88)
Hence, the permissible frequencies of the eigenfunction are given by the union
b bT) b b bT) b = = 0 0 : tan( / : tan( /
U
. (13-89)
These frequencies can be found numerically. Figure 13-1 depicts graphical solutions of
(13-89) for the first nine frequencies. A value of T = 2 was used to construct the figure. Note
T = 2
-6
-5
-4
-3
-2
-1
0
1
2
4 3 2
bT
2
3
2
5
2
7
2
b
1
b
8
b
6
b
4
b
2
b
9
b
7
b
5
b
3
Y
Y

=

-
b
T
/
T
Y = T/bT
Figure 13-1: Graphical display of the b
k
, the frequencies of the
eigenfunctions.
603CH13.DOC 13-25
that b
k
, k odd, form a decreasing sequence of positive numbers, while b
k
, k even, form a
decreasing sequence of negative numbers.
Once the frequencies b
k
are found, they can be used to determine the eigenvalues
k
k
P
b
=
2
2 2
( )
, k =1, 2, 3, L (13-90)
The frequencies b
k
, k odd, were obtained by setting c
1
= c
2
. For this case, (13-82) yields
k k k
t b t ( ) cos , = l k odd , (13-91)
where constant l
k
is chosen to normalize the eigenfunction. That is, l
k
must satisfy
l
k
2 2
1 cos ( ) b t dt
k
T
T
z
= , (13-92)
which leads to
l
k
=
+

1
1 2 T Sa b T)
k
[ (
, - T t T, k odd , (13-93)
where Sa(x) sin(x)/x. Hence, for k odd, we have the eigenfunctions
k
k
k
t
T Sa b T)
b t ( )
[ (
cos , =
+

1
1 2
- T t T, k odd . (13-94)
The frequencies b
k
, k even, were obtained by setting c
1
= -c
2
. An analysis similar to the
case just presented yields the eigenfunctions
603CH13.DOC 13-26
k
k
k
t
T Sa b T)
b t ( )
[ (
sin , =

1
1 2
- T t T, k even . (13-95)
Observations:
1. Eigenfunctions are cosines and sines at frequencies that are not harmonically related.
2. For each n, the value of b
n
T is independent of T. Hence, as T increases, the value of b
n
decreases, so the frequencies are inversely related to T.
3. As bT increases, the upper intersections (the odd integers k) occur at approximately (k-1)/2,
and the lower intersections occur at approximately (k-1)/2, k even. Hence, the higher index
eigenfunctions are approximately a set of harmonically related sines and cosines. For large k
we have
k
k
k
t
T Sa b T)
k
T
t
T Sa b T)
k
T
t
( )
[ (
cos
( )
,
[ (
sin
( )
,

1
1 2
1
2
1
1 2
1
2
- T t T, k odd
- T t T, k even
(13-96)
This concludes the case 0 < < 2P/.
Case ii) = 2P/
For this case, Equation (13-79) becomes
2
0
2
2
P
d
dt
t
( ) = . (13-97)
Two independent solutions to this equation are (t) = t and (t) = 1. By direct substitution, it is
seen that neither of these satisfy integral equation (13-74). Hence, this case yields no
eigenfunctions and eigenvalues.
Case iii) > 2P/
603CH13.DOC 13-27
For this case, Equation (13-79) becomes
d
dt
t
P
t
2
2
2
2
( )
( / )
( ) =

(13-98)
This equation two independent solutions given by
1
2
( )
( )
t e
t e
t
t
=
=

, (13-99)
where

>
2
2
0
( / ) P
. (13-100)
By direct substitution, it is seen that neither of these satisfy integral equation (13-74). Hence, this
case yields no eigenfunctions and eigenvalues.
Example 13-5: In radar detection theory, we must detect the presence of a signal given a T-
second record of receiver output data. There are two possibilities (termed hypotheses). First, the
record may consist only of receiver noise; no target is present for this case. The second possibility
is that the data record contains a target reflection embedded in the receiver noise; in this case, a
target is present. You must filter the record of data and make a decision regarding the
presence/absence of a target.
Let (t), 0 t T, denote the record of receiver output data. After receiving the
complete time record, we must decide between the hypotheses
603CH13.DOC 13-28
H
H s
0
1
:
:
(t) = (t), only noise - target not present
(t) = (t) + (t), signal + noise - target is present

. (13-101)
Here, (t) is zero-mean Gaussian noise that is described by positive definite correlation function
(t,). Note that we allow non-white and non-stationary noise in this example. s(t) is the
reflected signal, which we assume to be known (usually, s(t) is a scaled and time-shifted version
of the transmitted signal). At time T, we must decide between H
0
and H
1
.
We expand the received signal (t) in a K-L expansion of the form

( ) ( )
( ) ( )
t t
t t dt
k
k
k
k k
T
=
z
1
0
, (13-102)
where
k
(t) are the eigenfunction of (13-10), an integral equation that utilizes kernel (t,)
describing the receiver noise. The
k
are uncorrelated Gaussian random variables with variance
equal to the positive eigenvalues of the integral equation; that is, VAR[
k
] =
k
.
The received signal (t) may be only noise, or it may be signal + noise. Hence, the
conditional mean of
k
is
E E t t dt
E E t t t dt s
k
T
k
T
k

k
k
Y
Y
0
1
H
H s
=
L
N
M
O
Q
P
=
= +
L
N
M
O
Q
P
=
z
z
( ) ( )
( ) ( ) ( )
0
0
0
b g
, (13-103)
where
s t t dt
k k
T
=
z
s( ) ( )
0
603CH13.DOC 13-29
are the coefficients in the expansion
s( ) ( ) t
k
=
=
s t
k k
0
(13-104)
of signal s(t). Under both hypotheses,
k
has a variance given by
Var Var
k k k
Y Y H H
0
= =
1
(13-105)
To start with, our statistical test will use only the first n K-L coefficients
k
, 1 k n.
We form the vector
r
L V
n
T
=
1 2
(13-106)
and the two densities
P V P V
P V P V s
k
k
n
k k
k
n
k
k
n
k k k
k
n
0
1
2
1
1 1
1
2
1
2
2
( ) ( ) ( ) exp /
( ) ( ) ( ) exp ( ) /
r r
r r
=
L
N
M
M
O
Q
P

F
H
G
I
K
J
=
L
N
M
M
O
Q
P

F
H
G
I
K
J
= =
= =

Y
Y
H
H
0

. (13-107)
P
0
(alternatively, P
1
) is the density for the n coefficients when H
0
(alternatively, H
1
) is true.
We will use a classical likelihood ratio test (see C.W. Helstrom, Statistical Theory of
Signal Detection, 2
nd
edition) to make a decision between H
0
and H
1
. First, given V
r
, we compute
the likelihood ratio
603CH13.DOC 13-30
( )
( )
( )
exp ( ) /
r
r
r V
P V
P V
s s
k k k k
k
n
=
L
N
M
M
O
Q
P
P
=
1
0
2
1
2 2 . (13-108)
in terms of the known s
k
and
k
. Then, we compare the computed to a user-defined threshold
0
to make our decision (there are several well-know methods for setting the threshold
0
). We
decide hypothesis H
1
if exceeds the threshold, and H
0
if is less than the threshold. Stated
tersely, our test can be expressed as
( )
r
V
H
H
1
0
0
>
<
. (13-109)
The inequality (13-109) will be unchanged, and the decision process will not be affected, if
we take any monotone function of (13-109). Due to the exponential functions in (13-108), we
take the logarithm of this equation and obtain
G
s s
s G
n
k
k
k
k
n
k
k
k
k
n
n
O
Q
P
>
<
+
O
Q
P

= =

1
1
0
1
0
0
H
H
ln . (13-110)
To simplify (13-110), we define q
k
s
k
/
k
. The q
k
are coefficients in the generalized
Fourier series expansion of a function q(t); that is, the coefficients q
k
determine the function
q t q t
k k
k
( ) ( )
=

1
. (13-111)
As will be discussed shortly, function q(t) is the solution of an integral equation based on kernel
603CH13.DOC 13-31
. In terms of the coefficients q
k
, (13-110) can be written as
G q q s G
n k k
k
n
k k
k
n
n
>
<
+
= =

1
1
0
1
0
0
H
H
ln (13-112)
The two sums in (13-112) converge as n . By a general form of Parsevals theorem,
we have
limit
limit
n
n
z
=
=
q q t t dt
q s q t t dt
k k
k
n
T
k k
k
n
T

1
0
1
0
( ) ( )
( ) ( ) s
. (13-113)
Use (13-113), and take the limit of (13-112) to obtain the decision criteria
G q t t dt q t s t dt
T T
>
<
+
z z
( ) ( ) ln ( ) ( )
0
1
0
0
0
H
H
. (13-114)
As shown on the left-hand-side of Equation (13-114), statistic G can be computed once data
record (t), 0 t T is known. Then, to make a decision between hypothesis H
0
and H
1
, G is
compared to the threshold obtained by computing the right-hand-side of (13-114).
Statistic G can be obtained by a filtering operation, as illustrated by Figure 13-3. Simply
pass received signal (t) through a filter with impulse response
h t q T t ( ) ( ), 0 t T, (13-115)
603CH13.DOC 13-32
and sample the filter output at t = T (the end of the integration period) to obtain the statistic G.
This is the well known matched filter for signal s(t) embedded in Gaussian, nonstationary,
correlated noise described by correlation function (t,).
As described above, function q(t) has expansion (13-111) with coefficients q
k
s
k
/
k
.
However, we show that q(t) is the solution of a well-known integral equation. First, write
(13-111) with as the time variable. Then, multiply the result by (t,), and integrate from = 0
to = T to obtain
( , ) ( ) ( , ) ( ) ( ) t q d q t d
s T
k k
T
k
k
k
k k
k

0 0
1 1
z z

= =
L
N
M
O
Q
P
=
, (13-116)
where s
k
/
k
has been substituted for q
k
. On the right-hand-side, cancel out the eigenvalue
k
, and
use (13-104) to obtain the integral equation
( , ) ( ) ( ), t q d t
T

0
z
= s 0 t T, (13-117)
for the match filter impulse response q(t). Equation (13-117) is the well-know Fredholm integral
equation of the first kind.
h(t) = q(T-t), 0 t T
(t)
Sample
@ t = T
G q t t dt q t s t dt
T T
>
<
+
z z
( ) ( ) ln ( ) ( )
0
1
0
0
0
H
H
G q t t dt
T
=
z
( ) ( )
0
a)
b)
Figure 13-3: a) Statistic G generated by a filtering operation that is matched
to the signal and noise environment. b) Statistical test description.
603CH13.DOC 13-33
Special Case: Matched Filter for Signal in White Gaussian Noise
We consider the special case where the noise is white with correlation
( ) ( ) =
2
. (13-118)
The Fredholm integral equation is solved easily for this case; simply substitute (13-118) into
(13-117) and obtain
q t t ( ) ( ) / , = s
2 2
0 t T. (13-119)
So, according to (13-115), the matched filter for the white Gaussian noise case is
T
1
t t T
1
s(t) h(t)
a) b)
t
F
i
l
t
e
r

O
u
t
p
u
t
T 2T
T/3
Sampling Time to Produce Statistic G
c)
Figure 13-4: a) Signal s(t), b) matched filter impulse response h(t) and c) filter output for the
case
2
= 1. Note that the filter output is sampled at t = T to produce the decision statistic G.
603CH13.DOC 13-34
h t T t ( ) ( ) / = s
2
, (13-120)
a folded, shifted and scaled version of the original signal. Figure 13-4 illustrates a) signal s(t), b)
matched filter h(t) and c) filter output, including the sample point t = T, for the case
2
= 1.
EE603 Class Notes Version 1.0 John Stensby
603CH14.DOC 14-1
Chapter 14: Markov Process Dynamic System State Model
There are many applications of dynamic systems driven by stationary Gaussian noise. In
these applications, the system is modeled by a differential equation with a Gaussian noise forcing
function. If the system is linear, the system state vector is Gaussian, and its mean and covariance
matrix can be found easily (numerical methods may be necessary in the case of linear time varying
systems). If the system is nonlinear, the system state vector is not generally Gaussian, and
modeling/analyzing the system becomes more difficult. This chapter introduces theory and
techniques that are useful in modeling/analyzing linear/nonlinear systems driven by Gaussian
noise.
Often, the bandwidth of a noise forcing function is large compared to the bandwidth of the
system driven by the noise. To the system, the noise forcing function looks white; its spectral
density looks flat over the system bandwidth, even thought it does not have an infinite
bandwidth/power. Under these circumstances, it is common to model the noise forcing function
as white Gaussian noise. In general, this modeling assumption (known as the diffusion
approximation in the literature) simplifies system analysis, and it allows the problem at hand to be
treated by a vast body of existing knowledge.
All lumped-parameter dynamic systems (i.e., systems that can be modeled by a differential
equation), be they linear or nonlinear, time-varying or time-invariant, have an important feature in
common. Assuming white Gaussian noise excitation, all lumped-parameter dynamic systems have
a state vector that can be modeled as a Markov process. Roughly speaking, what this means is
simple: Given the system state at time t
0
and the input noise for t t
0
, one can determine the
system state for t t
0
. Future values of the system state can be determined using the present
value of the system state and the input noise; past values of the state are not necessary.
For a lumped-parameter dynamic system driven by white Gaussian noise, we are interested
in determining the probability density function that describes the system state vector. In general,
this density function evolves with time (the system state is a nonstationary process), starting from
some known density at t = 0. As discussed in this chapter, the desired density function satisfies a
603CH14.DOC 14-2
partial differential equation known as the Fokker-Planck equation.
This chapter is devoted to laying the foundation for the analysis of this Markov state
model. First, from Chapter 6 of these class notes, the classical random walk is reviewed; it is a
simple example of a discrete Markov process. As step size and the time between successive steps
approach zero, the random walk approaches the Wiener process, a simple continuous-time
Markov process. The Wiener process is described by a probability density function that satisfies
the diffusion equation, a simple example of a Fokker-Planck equation. After discussing this
simple example, a more general first-order system model is introduced, and the Fokker-Planck
equation is developed that describes the model.
The Random Walk - A Simple Markov Process
Suppose a man takes a random walk on a straight-line path; he starts his walk m steps to
the right of the origin. With probability p (alternatively, q 1 - p), he takes a step to the right
(alternatively, left). Suppose that each step is of length l meters, and each step is completed in
s
seconds. After N steps (completed in N
s
seconds), the man is located X
d
(N) steps from the
origin; note that N + m X
d
(N) N + m since the man starts at m steps to the right of the
origin. If X
d
(N) is positive (negative), the man is located to the right (left) of the origin.
The quantity P[X
d
(N) = n(X
d
(0) = m] denotes the probability that the man's location is n
steps to the right of the origin, after N steps, given that he starts at m steps to the right of the
origin. The calculation of this probability is simplified greatly by the assumption, implied in the
previous paragraph, that the man takes independent steps. That is, the direction taken at the N
th
step is independent of X
d
(k), 0 k N 1, and the directions taken at all previous steps. Also
simplifying the development is the assumption that probability p does not depend on step index N.
A formula for P[X
d
(N) = n(X
d
(0) = m] is developed in Chapter 6 of these class notes; this
development is summarized here. Let n m, so that denotes the man's net increase in the
number of steps to the right after he has completed N steps. Also, R
nm
(alternatively, L
nm
) denotes
the number of steps to the right (alternatively, left) that are required if the man starts and finishes
m and n, respectively, steps from the origin. Then, it is easily seen that
603CH14.DOC 14-3
R
N
L
N
nm
nm
2
2
(14-1)
if (( N and N + , N are even; otherwise, integers R
nm
and L
nm
do not exist. In terms of
these integers, the desired result is
P P [ X (N) n X ( ) m] [ R steps to the right out of N steps]
N!
R ! L !
d d
nm nm
nm

Y 0
p q
R L
nm nm
(14-2)
if integers R
nm
and L
nm
exist, and
P[X (N) n X ( ) m]
d d
Y 0 0 (14-3)
if R
nm
and L
nm
do not exist.
For Npq >> 1, an asymptotic approximation is available for (14-2). In the development
that follows, it is assumed that p = q = 1/2. According to the DeMoivre-Laplace, for N/4 >> 1
and (R
nm
- N/2( < N/ 4 , the approximation
P[X (N) n X ( ) m]
N!
R ! L !
( ) ( )
(N/ 4)
exp
(R N/ 2)
2(N/ 4)
d d
nm nm
R L nm
nm nm

L
N
M
M
O
Q
P
P
Y 0
1
2
1
2
1
2
2
(14-4)
can be made.
Limit of the Random Walk - the Wiener Process
Recall that each step corresponds to a distance of l meters, and each step is completed in
s
seconds. At time t = N
s
, let X(N
s
) denote the man's physical displacement from the origin.
603CH14.DOC 14-4
Then X(N
s
) is a random process given by X(N
s
) lX
d
(N), since X
d
(N) denotes the number of
steps the man is from the origin after he takes N steps. Note that X(N
s
) is a discrete-time
random process that takes on only discrete values.
For large N and small l and
s
, the probabilistic nature of X(N
s
) is of interest. First, note
that P[X(N
s
) = ln(X(0) = lm] = P[X
d
(N) = n(X
d
(0) = m]; this observation and the Binomial
distribution function leads to the result
P P [ (N ) ( ) ] = [Number of Steps to Right R ]
=
nm
R
nm
X X
s
k 0
k k
k

F
H
G
I
K
J
l l n m
N
( ) ( ) .
N
Y 0
1
2
1
2
(14-5)
For large N, the DeMoivre-Laplace leads to the approximation
P[ (N ) ( ) ] = X X

s

F
H
G
I
K
J
F
H
G
I
K
J

z
l l n m
R N/
N/ N
exp
nm
/
Y 0
2
4
1
2
1
2
2
G G u du
N
(14-6)
where G is the distribution function for a zero-mean, unit-variance Gaussian random variable.
The discrete random walk process outlined above has the continuous Wiener process as a
formal limit. To see this, let l 0,
s
0 and N in such a manner that
l
l
l
2
s
s
s
2
(t) ( )
D
t N
x n
x m
N ,
0
X X
(14-7)
603CH14.DOC 14-5
where D is known as the diffusion constant. In terms of D, x, x
0
and t, the results of (14-7) can
be used to write
N
(x x ) /
t /
(x x )
Dt
.

0 0
2
l
s
(14-8)
The probabilistic nature of the limiting form of X(t) is seen from (14-6) and (14-8). In the limit,
the process X(t) is described by the first-order conditional distribution function
F x t x u du
x x Dt
( ; ) exp
( )/
Y
0
1
2
2
2 1
2
0
z
(14-9)
and the first-order conditional density function
f(x, t x )
Dt
exp
(x x )
4 Dt
Y
0
2
1
4

L
N
M
M
O
Q
P
P
0
. (14-10)
When X(0) = x
0
= 0, this result describes the conditional probability density function of a
continuous-time Wiener process. Clearly, process X(t) is Gaussian, and it is nonstationary since it
has a variance that grows with time. X(t) is continuous, with probability one, but its sample
functions are nowhere differentiable. It is a simple example of a diffusion process.
The Diffusion Equation For the Transition Density Function
As discussed in Chapter 6 of these notes, the conditional density (14-10) satisfies the one-
dimensional diffusion equation
t
f(x, t x ) D f(x, t x ) Y Y
0 0
x
2
2
(14-11)
603CH14.DOC 14-6
with initial condition
f x t x x x ( , ) ( ) Y Y
Y
0
0
t 0
(14-12)
and boundary condition
f x t x
x
( , ) Y Y
Y
0
0
t
. (14-13)
Initial condition (14-12) means that process X starts at x
0
. Boundary condition (14-13) implies
that probability cannot accumulate at infinity; often, (14-13) is referred to as natural boundary
conditions.
Diffusion equation (14-11) describes how probability diffuses (or flows) with time. To
draw this analogy, note that f describes the density of probability (or density of probability
particles) on the one-dimensional real line. That is, f can be assigned units of particles/meter.
Since D has units of meters
2
/second, a unit check on both sides of (14-11) produces
1
second
particles
meter
meter
second
1
meter
2
particles
meter
2
e j e je j
F
H
I
K

F
H
I
K
. (14-14)
Diffusion phenomenon is a transport mechanism that describes flow in many important
applications (heat, electric current, molecular, etc.).
Equation (14-11) implies that probability is conserved in much the same way that the well-
know continuity equation implies the conservation of electric charge. Write (14-11) as
t
f , (14-15)
where
603CH14.DOC 14-7
D f
x
, (14-16)
and is the divergence operator. The quantity is a one-dimensional probability current, and it
has units of particles/second. Note the similarity between (14-15) and the well-known continuity
equation for electrical charge.
Probability current ( x, t (x
0
) indicates the rate of particle flow past point x at time t. Let
(x
1
, x
2
) denote an interval; integrate (14-15) over this interval to obtain
t
[x ( ) x x ]
t
f(x, t x ) dx [ (x , x ) (x , x )]
x
x
P
1 0
t t t <
z
X
2 0 2 0 1 0
1
2
Y Y Y Y . (14-17)
As illustrated by Fig. 14-1, the left-hand side of this equation represents the time rate of
probability build-up on (x
1
, x
2
). That is, between the limits of x
1
and x
2
, the area under f is
changing at a rate equal to the left-hand side of (14-17). As depicted by Fig. 14-1, the right-hand
side of (14-17) represents the probability currents entering the ends of the interval (x
1
, x
2
).
An Absorbing Boundary On the Random Walk
The quantity X
d
(N) is unconstrained in the discrete random walk discussed so far. Now,
consider placing an absorbing boundary at n
1
. No further displacements are possible after the
man reaches the boundary at n
1
; the man stops his random walk the instant he arrives at the

f (x, t(x
0
)
x
1
x
2

t
1 2
t
t
t , t , t P[ x ( ) x x ] f (x, x )dx (x x ) (x x )
x
x
1 2 0 0 0 0
1
2
<
z
X Y Y Y Y
(x x )
1
, tY
0
(x x )
2
, tY
0
Figure 14-1: Probability build-up on (x
1
, x
2
) expressed in terms of net current
entering the interval.
603CH14.DOC 14-8
boundary (he is absorbed). Let X
A
(N) take the place of X
d
(N) to distinguish the fact that an
absorbing boundary exists at x
1
. That is, after taking N steps, the man is X
A
(N) steps to the right
of the origin, given an absorbing boundary at n
1
. Clearly, we have
X N X N if X n n
n if X n n
A d d
d
( ) ( ) ( )
( )
<

1
1 1
for all n N
for some n N
. (14-18)
An absorbing boundary has applications in many problems of practical importance.
As before, assume that the man starts his random walk at m steps from the origin where m
< n
1
. This initial condition implies that X
A
(0) = m since random process X
A
denotes the man's
displacement (in steps) from the origin. He takes random steps; either he completes N of them, or
he is absorbed at the boundary before completing N steps. For the random walk with an
absorbing boundary, the quantity P[n, N(m; n
1
] denotes the probability that X
A
(N) = n given that
X
A
(0) = m and an absorbing boundary exists at n
1
. In what follows, an expression is developed
for this probability.
It is helpful to trace the man's movement by using a plane as shown by Figures 14-2a and
14-2b. On these diagrams, the horizontal axis denotes displacement, in steps, from the origin; the
vertical axis denotes the total number of steps taken by the man. Every time the man takes a step,
he moves upward on the diagram; also, he moves laterally to the right or left. The absorbing
boundary is depicted on these figures by a solid vertical line at n
1
. In the remainder of this
section, these diagrams are used to illustrate the reflection principle for dealing with random
processes that hit absorbing boundaries.
Figure 14-2a depicts two N-step sequences (the solid line paths) that start at m and arrive
at n. One of these is "forbidden" since it intersects the boundary. A "forbidden" N-step sequence
is one that intersects the boundary one or more times. For the present argument, assume that a
"forbidden" sequence is not stopped (or altered in any way) by the boundary. For all steps above
the last point of contact with the boundary, the "forbidden" sequence on Fig. 14-2a has been
603CH14.DOC 14-9
reflected across the boundary to produce a dashed-line path that leads to the point 2n
1
- n, the
reflected (across the boundary) image of the point n. In this same manner, every "forbidden" path
that starts at m and reaches n can be partially reflected to produce a unique path that leads to the
image point 2n
1
- n.
The solid line path on Fig. 14-2b is an N-step sequence that arrives at the point 2n
1
- n.
As was the case on Fig. 14-2a, point 2n
1
- n is the mirror image across the boundary of point n.
For all steps above the last point of contact with the boundary, the solid-line sequence on Fig. 14-
2b has been reflected across the boundary to produce a dashed-line path that leads to the point n
(we have mapped the solid-line path that reaches the image point into a forbidden sequence that
reaches point n). In this same manner, every path that reaches image point 2n
1
- n can be partially
reflected to produce a unique forbidden path that leads to n.
From the observations outlined in the last two paragraphs, it can be concluded that a one-
to-one correspondence exists between N-step "forbidden" sequences that reach point n and N-
step sequences that reach the image point 2n
1
- n. That is, for every "forbidden" sequence that
reaches n, there is a sequence that reaches 2n
1
- n. And, for every sequence that reaches 2n
1
- n
there is a forbidden sequence that reaches n. Out of all N-step sequences that start at m, the
proportion that are forbidden and reach n is exactly equal to the proportion that reach the image
n n n n n 2
N N
1 1 1
m n m
Displacement in Steps
T
o
t
a
l

N
u
m
b
e
r

o
f

S
t
e
p
s
Displacement in Steps
0 0
n 2
1
n
a) b)
F
o
r
b
i
d
d
e
n
F
o
r
b
i
d
d
e
n
T
o
t
a
l

N
u
m
b
e
r

o
f

S
t
e
p
s
Figure 14-2: Random walk with an absorbing boundary at n
1
.
603CH14.DOC 14-10
point. This observation is crucial in the development of P[n, N(m; n
1
].
Without the boundary in place, the computation of P[n, N(m] involves computing the
relative frequency of the man arriving at X
d
= n after N steps. That is, to compute the probability
P[n, N(m], the number of N step sequences that leave m and lead to n must be normalized by
the total number of distinct N step sequences that leave m. P[n, N(m] can be represented as
P[n, N m] =
#N step sequences that leave m and reach n
Total number of N step sequences that leave m
Y . (14-19)
With the boundary at n
1
in place, the computation of P[n, N(m; n
1
] involves computing
the relative frequency of the man arriving at X
A
= n after N steps. To compute P[n, N(m; n
1
]
when the boundary is in place, a formula similar to (14-19) can be used, but the number of
"forbidden" sequences (i.e., those that would otherwise be absorbed at the boundary) that reach n
must be subtracted from the total number (i.e., the number without a boundary) of N-step
sequences that lead to n. That is, when the boundary is in place, (14-19) must be modified to
produce
P[n, N m; n ] =
#N step sequences that
leave m and reach n with
no boundary in place
-
#N step " forbidden" sequences that
leave m and reach n

1
Y
R
S
|
T
|
U
V
|
W
|
R
S
T
U
V
W
(14-20)
But the number of N step forbidden sequences that reach n is exactly equal to the number of
sequences that reach the image point 2n
1
- n. Hence, (14-20) can be modified to produce
P[n, N m; n ] =
-
leave m and reach 2n
boundary in place

1
1
Y
R
S
|
T
|
U
V
|
W
|

R
S
|
T
|
U
V
|
W
|
n
without a
(14-21)
603CH14.DOC 14-11
This equation is rewritten as
P P P [n, N m n ] [n, N m] [2n n, N m]
1
Y ; Y Y
1
, (14-22)
where P[n, N(m] is given by (14-2). For the absorbing boundary case, the probability of reaching
n can be expressed in terms of probabilities that are calculated for the boundary-free case.
An Absorbing Boundary On the Wiener Process
As before, suppose that each step corresponds to a distance of l meters, and it takes
s
seconds to take a step. Furthermore, X
A
(N
s
) = lX
A
(N) denotes the man's physical distance (in
meters) from the origin. Also, for the case where an absorbing boundary exists at ln
1
> lm,
P[ln, N
s
(lm; ln
1
] denotes the probability that the man is ln meters from the origin at t = N
s
,
given that he starts at lm when t = 0. Using the argument which led to (14-22), it is possible to
write
P P P [ n, N m n ] [ n, N m] [ n n , N m] l l l l l l l l
s s s
Y ; Y Y
1 1
2 . (14-23)
The argument that lead to (14-10) can be applied to (14-23), and a density f
A
(x,t(x
0
, x
1
)
that describes the limit process X
A
(t) can be obtained. As l 0,
s
0 and N in the
manner described by (14-7), the limiting argument that produced (14-10) can be applied to
(14-23); the result of taking the limit is
f (x, t x ; x )
Dt
exp
(x x )
4Dt
exp
( x x x )
4 Dt
,
0 1
1
A
Y
L
N
M
O
Q
P

L
N
M
O
Q
P
L
N
M
M
O
Q
P
P
1
4
2
2 2
0 0
x < x
1
, (14-24)
where ln
1
x
1
is the location of the absorbing boundary. For x < x
1
, density f
A
(x,t(x
0
;x
1
) is
described by (14-24). As discussed next, this density contains a delta function at x = x
1
, to
account for the portion of sample functions that have been absorbed by time t.
603CH14.DOC 14-12
For x < x
1
, density f
A
(x, t(x
0
; x
1
) is equal to the right-hand-side of (14-24). At x = x
1
,
density f
A
(x, t(x
0
; x
1
) must contain a delta function of time-varying weight
w t x f x t x dx
a
x
( , ) ( , ; ) Y Y x x
0 0 1 1
1
1

z
, (14-25)
the probability that the process has been absorbed by the boundary (at x
1
) sometime during the
interval [0, t]. Of course, f
A
(x, t(x
0
; x
1
) = 0 for x > x
1
. Figure 14-3 depicts density f
A
(x, t(x
0
; x
1
)
for 2Dt = 1, x
0
= 0 and x
1
= 1.
From (14-24) , note that
limit
x

x
0 1
f (x, t x ; x )
1
0
A
Y (14-26)
for all t 0. That is, the density function vanishes as the boundary is approached from the left. If
X
A
(t) represents the position of a random particle, then (14-26) implies that, only rarely, can the
particle be found in the vicinity of the boundary.
A Reflecting Boundary On the Random Walk
On the discrete random walk, the above-discussed boundary at n
1
can be made reflecting.
-3 -2 -1 0 1 x
0.0
0.1
0.2
0.3
0.4
f
A
(
x
,
t
(
x
0
,

x
1
)

w t x f x t x x
A
x
( ) ( , , ) Y , Y
0
x dx
0 1 1
1
1

z
1
2
2
2
2
2
2
exp exp
( )
F
H
I
K

F
H
G
I
K
J
L
N
M
O
Q
P
x
x
x
0
= 0
x
1
= 1
2Dt = 1
Figure 14-3: f
A
(x,t(x
0
, x
1
) for x
0
= 0, x
1
= 1 and 2Dt = 1.
603CH14.DOC 14-13
To distinguish the fact that the boundary at n
1
is reflecting, let X
R
(N) denote the mans
displacement (number of steps to the right of the origin) after he has taken N steps. By definition,
the process with a reflecting boundary is represented as
X N X N if X N n
n X N if X N n
R d d
d d
( ) ( ) ( )
( ) ( )
<

1
1 1
2
. (14-27)
When X
d
exceeds the boundary it is reflected through the boundary to form X
R
.
A version of (14-20) can be obtained for the case of a reflecting boundary at x
1
. Again,
the N step sequences that leave m and arrive at n must be tallied. Every sequence that leaves m
and arrives at n without the boundary maps to a sequence that leaves m and arrives at n with the
boundary in place. Also, every sequence that leaves m and arrives at 2n
1
- n without the boundary
maps to a sequence that leaves m and arrives at n with the boundary in place. Hence, for the case
of a reflecting boundary at n
1
, we can write
P
P P
[n, N m; n ] =
+
#N step " forbidden" sequences that
leave m and reach n

1]
Y
R
S
|
T
|
U
V
|
W
|
R
S
T
U
V
W
+ [n, N m] [2n n, N m]
1
Y Y
. (14-28)
Note the similarity between (14-28) and (14-20).
A Reflecting Boundary On the Wiener Process
As before, suppose that each step corresponds to a distance of l meters, and it takes
s
seconds to take a step. Furthermore, X
R
(N
s
) = lX
R
(N) denotes the man's physical distance (in
meters) from the origin. Also, for the case of a reflecting boundary at ln
1
> lm, P[ln, N
s
(lm;
ln
1
] denotes the probability that the man is ln meters from the origin at t = N
s
, given that he
603CH14.DOC 14-14
starts at lm when t = 0. Using the argument which led to (14-22), it is possible to write
P P P [ n, N m n ] [ n, N m] [ n n , N m] l l l l l l l l
s s s
Y ; Y Y
1 1
2 + . (14-29)
The argument that lead to (14-10) can be applied to (14-23), and a density f
R
(x,t(x
0
, x
1
)
that describes the limit process X
R
(t) can be obtained. As l 0,
s
0 and N in the
manner described by (14-7), the limiting argument that produced (14-10) can be applied to
(14-29); the result of taking the limit is
f (x, t x ; x )
Dt
exp
(x x )
4 Dt
exp
( x x x )
4 Dt
,
0 1
1
R
Y
L
N
M
O
Q
P
+
L
N
M
O
Q
P
L
N
M
M
O
Q
P
P

1
4
2
0
2 2
0 0
, x x
x > x

1
1
, (14-30)
where ln
1
x
1
is the location of the reflecting boundary. Figure 14-4 depicts a plot of f
R
(x,t(x
0
,
x
1
) for the case x
0
= 0, x
1
= 1 and 2Dt = 1.
-3 -2 -1 0 1
x
0.0
0.1
0.2
0.3
0.4
0.5
f
R
(
x
,
t
(
x
0
,

x
1
)
x
0
= 0
x
1
= 1
2Dt = 1
Figure 14-4: f
R
(x,t(x
0
, x
1
) for x
0
= 0, x
1
= 1 and 2Dt = 1.
603CH14.DOC 14-15
The First-Order Markov Process
Usually, simplifying assumptions are made when performing an engineering analysis of a
nonlinear system driven by noise. Often, assumptions are made that place limits on the amount of
information that is required to analyze the system. For example, it is commonly assumed that a
finite dimensional model can be used to describe the system. The model is described by a finite
number of state variables which are modeled as random processes. An analysis of the system
might involve the determination of the probability density function that describes these state
variables. A second common assumption has to do with how this probability density evolves with
time, and what kind of initial data it depends on. This assumption states that the future evolution
of the density function can be expressed in terms of the present values of the state variables;
knowledge of past values of the state variables is not necessary. As discussed in this chapter, this
second assumption means that the state vector can be modeled as a continuous Markov process.
Furthermore, the process has a density function that satisfies a parabolic partial differential
equation known as the Fokker-Planck equation (also known as Kolmogorovs Forward
Equation).
The one-dimensional Markov process and Fokker-Planck equation are discussed in this
section. Unlike the situation in multi-dimensional problems, a number of exact closed-form
results can be obtained for the one-dimensional case, and this justifies treating the case separately.
Also, the one-dimensional case is simpler to deal with from a notational standpoint
A random process has the Markov property if its distribution function at any future
instant, conditioned on present and past values of the process, does not depend on the past values.
Consider increasing, but arbitrary, values of time t
1
< t
2
< ... < t
n
, where n is an arbitrary positive
integer. A random process X(t) has the Markov property if its conditional probability distribution
function satisfies
603CH14.DOC 14-16
n n n 1 n 1 1 1
n n n-1 n 1 n-2 n 2 1 1
n n n-1 n 1
n n n 1 n-1
F( x , t x ,t ; ... ; x ,t )
= [X(t ) x X(t ) x ,X(t ) x , ,X(t ) x ]
= [X(t ) x X(t ) x ]
=F(x , t x , t )
(
(
(
(
- -
- -
-
-
P
P
L
(14-31)
for all values x
1
, x
2
, ... , x
n
and all sequences t
1
< t
2
< ... < t
n
.
The Wiener and random walk processes discussed in Section 6.1 are examples of Markov
processes. In the development that produced (14-10), the initial condition x
0
was specified at t =
0. Now, suppose the initial condition is changed so that x
0
is specified at t = t
0
. In this case,
transitions in the displacement random process X can be described by the conditional probability
density function
f(x, t x , t )
D(t t )
exp
(x x )
4 D(t t )
Y
0 0
0
2
0
1
4
L
N
M
O
Q
P
0
(14-32)
for t > t
0
. Note that the displacement random process X is Markov since, for t greater than t
0
,
density (14-32) can be expressed in terms of the displacement value x
0
at time t
0
; prior to time t
0
,
the history of the displacement is not relevant.
If random process X(t) is a Markov process, the joint density of X(t
1
), X(t
2
), ... , X(t
n
),
where t
1
< t
2
< ... < t
n
, has a simple representation. First, recall the general formula
f(x , t ; x , t ; ; x , t )
f(x , t x , t ; ; x , t ) f(x , t x , t ; ; x , t )
f(x , t x , t ) f(x , t )
n n 1 n 1
n n 1 n 1 n n n
n 1
n 1 n-1 1
2 1 1

L
L L L
L
1
1 1 2 2 1
2 1 1
Y Y
Y
(14-33)
Now, utilize the Markov property to write (14-33) as
603CH14.DOC 14-17
f(x , t ; x , t ; ; x , t )
f(x , t x , t ) f(x , t x , t ) f(x , t x , t ) f(x , t ) .
n n 1 n 1
n n 1 n 1 n n n
n 1
n n-1 2 1 1

L
L
1
1 2 2 2 1 1
Y Y Y
(14-34)
Equation (14-34) states that a Markov process X(t), t t
0
, is completely specified by an
initial marginal density f(x
0
, t
0
) (think of this marginal density as specifying an initial condition
on the process) and a first-order conditional density
1 1 1 1 0
f(x, t x , t ), t t , t t ( . (14-35)
For this reason, conditional densities of the form (14-35) are known as transition densities.
Based on physical reasoning, it is easy to see that (14-35) should satisfy
1 1 1 1
f(x, t x , t ) (x x ) ( . (14-36)
Some important special cases arise regarding the time dependence of the marginal and
transitional densities of a Markov process. A Markov process is said to be homogeneous if f(x
2
,
t
2
( x
1
, t
1
) is invariant to a shift in the time origin. In this case, the transition density depends only
on the time difference t
2
- t
1
. Now, recall that stationarity implies that both f(x,t) and f(x
2
, t
2
(x
1
,
t
1
) are invariant to a shift in the time origin. Hence, stationary Markov processes are
homogeneous. However, the converse of this last statement is not generally true.
An Important Application of Markov Processes
Consider a physical problem that can be modeled as a first-order system driven by white
Gaussian noise. Let X(t), t t
0
, denote the state of this system; the statistical properties of state
X are of interest here. Suppose that the initial condition X(t
0
) is a random variable that is
independent of the white noise driving the system. Then state X(t) belongs to a special class of
Markov processes known as diffusion processes. As is characteristic of diffusion processes,
603CH14.DOC 14-18
almost all sample functions of X are continuous, but they are differentiable nowhere. Finally,
these statements are generalized easily to n
th
-order systems driven by white Gaussian noise.
As an example of a first-order system driven by white Gaussian noise, consider the RL
circuit illustrated by Fig. 14-5. In this circuit, current i(t), t t
0
, is the state, and white Gaussian
noise v
in
(t) is assumed to have a mean of zero. Inductance L(i) is a coil wound on a ferrite core
with a nonlinear B-H relationship; the inductance is a known positive function of the current i
with derivative dL/di . The initial condition i(t
0
) is assumed to be a zero-mean, Gaussian random
variable, and it is independent of the input noise v
in
(t) for all t.
The formal differential equation that describes state i(t) is
in
d dL d
Li = i +L i = -Ri + v
dt di dt
1
1
]
(14-37)
Equation (14-37) is equivalent to
in
d Ri v
i = - +
dt (dL/di)i + L (dL/di)i +L
. (14-38)
However, the problem with the previous two equations is that sample functions of current i are
not differentiable, so Equations (14-37) and (14-38) only serve as symbolic representations of the
circuit dynamic model. Now, recall from Section 6.1.5 that white noise v
in
a generalized derivative of a Wiener process. If W
t
denotes this Wiener process, and i
t
i(t)
L
R
+
-
+
-
v
i
v
in out
Figure 14-5: A simple RL circuit.
603CH14.DOC 14-19
denotes the circuit current (in these representations, the independent time variable is depicted as a
subscript), then it is possible to say that
t t
t
t t
Ri d
di = - dt +
(dL/di)i +L (dL/di)i + L
W
(14-39)
is formally equivalent to (14-38).
Equations (14-38) and (14-39) are stochastic differential equations, and they should be
considered as nothing but formal symbolic representations for the integral equation
k
k k
t t
t t
t t
Ri d
i - i = - d +
(dL/di)i + L (dL/di)i + L

W
, (14-40)
where t > t
k
t
0
. On the right-hand side of (14-40), the first integral can be interpreted in the
classical Riemann sense. However, sample functions of W
t
are not of bounded variation, so the
second integral cannot be a Riemann-Stieltjes integral. Instead, it can be interpreted as a
stochastic It integral, and a major field of mathematical analysis exists to support this effort.
On the right-hand side of (14-40), the stochastic differential dW
t
does not depend on i
t
, t <
t
k
. Based on this observation, it is possible to conjecture that any probabilistic description of
future values of i
t
, t > t
k
, when conditioned on the present value i
k
t
and past values
i i
k k
t t
, , ... ,
1 2
does not depend on the past current values. That is, the structure of (14-40)
suggests that i
t
is Markov. The proof of this conjecture is a major result in the theory of
stochastic differential equations (see Chapter 9 of Arnold, Stochastic Differential Equations:
Theory and Applications).
Of great practical interest are methods for characterizing the statistical properties of
diffusion processes that represent the state of a dynamic system driven by white Gaussian noise.
At least two schools of thought exist for characterizing these processes. The first espouses direct
numerical simulation of the system dynamic model. The second school is adhered to here, and it
603CH14.DOC 14-20
utilizes an indirect analysis based on the Fokker-Planck equation.
The Chapman-Kolmogorov Equation
Suppose that X(t) is a random process described by the conditional density function f(x
3
,
t
3
(x
1
, t
1
). Clearly, this density function must satisfy
f(x , t x , t ) f(x , t ; x , t x , t ) dx
3 3 1 1 3 3 2 2 1 1 2
Y Y
z
-
, (14-41)
where t
1
< t
2
< t
3
. Now, a standard result from probability theory can be used here; substitute
f(x , t ; x , t x , t ) f(x , t x , t ; x , t ) f(x , t x , t )
3 3 2 2 1 1 3 3 2 2 1 2 2 1 1
Y Y Y
1
(14-42)
into (14-41) and obtain
f(x , t x , t ) f(x , t x , t ; x , t ) f(x , t x , t ) dx
3 3 1 1 3 3 2 2 1 1 2 2 1 1 2
Y Y Y
z
-
. (14-43)
Equation (14-43) can be simplified if X is a Markov process. By using the Markov property, this
last equation can be simplified to obtain
f(x , t x , t ) f(x , t x , t ) f(x , t x , t ) dx
3 3 1 1 3 3 2 2 2 2 1 1 2
Y Y Y
z
-
. (14-44)
This is the well-known Chapman-Kolmogorov equation for Markov processes (it is also known
as the Smoluchowski equation). It provides a useful formula for the transition probability from x
1
at time t
1
to x
3
at time t
3
in terms of an intermediate step x
2
at time t
2
, where t
2
lies between t
1
and
t
3
. In Section 6.3, a version of (14-44) is used in the development of the N-dimensional Fokker-
Planck equation.
603CH14.DOC 14-21
The One-Dimensional Kramers-Moyal Expansion
As discussed in Section 6.1.1, a limiting form of the random walk is a Markov process
described by the transition density (14-10). This density function satisfies the diffusion equation
(14-11). These results are generalized in this section where it is shown that a first-order Markov
process is described by a transition density that satisfies an equation known as the Kramers-Moyal
expansion. When the Markov process is the state of a dynamic system driven by white Gaussian
noise, this equation simplifies to what is know as the Fokker-Planck equation. Equation (14-11)
is a simple example of a Fokker-Planck equation.
Consider the random increment
X
t
X(t t) X(t )
1
1 1
+ , (14-45)
where t is a small, positive time increment. Given that X(t
1
) = x
1
, the conditional characteristic
function of X
t
1
is given by
1 1
1
1 1 1 1 1 1 t t
j (x x )
x , t , t) E[exp( j X ) X x ] exp[ ]f x, t t x , t ) .
+ (; ( ( (
-
(14-46)
If the Markov process is homogeneous, then depends on the time difference t but not the
absolute value of t
1
. The inverse of (14-46) is
f x, t t x , t ) exp[
(x x )
] x , t , t) d ,
1 1 1
( Y ( ; +

z

1
1
1
1
2

j
-
(14-47)
which is an expression for the transition density in terms of the characteristic function of the
random increment. Now, use (14-47) in
f x, t t ) f x, t t ; x , t ) dx f x, t t x , t ) f x , t ) dx
1 1 1 1 1 1
( ( ( Y ( + + +
z z

1 1 1 1 1
- -
(14-48)
603CH14.DOC 14-22
to obtain
f(x, t t) exp[
(x x )
] ( ; x , t , t ) d f(x , t ) dx
1
1
1 1 1 1 1
1
2
+

z z

j
. (14-49)
The characteristic function can be expressed in terms of the moments of the process X.
To obtain such a relationship for use in (14-49), expand the exponential in (14-46) and obtain

( ; x , t , t ) E[exp( X ) X(t ) x ]
E
( )
( X ) X(t ) x
( )
( x , , )
t
t
( )

1 1 1 1
1 1
1 1
1
1

L
N
M
O
Q
P
j
j
q!
j
q!
m t t
q
q
q 0
q
q
q 0
Y
Y
Y
(14-50)
where
m t t
q q q ( )
t
(x , , ) E[ { X } X(t ) x ] E[ {X(t t) X(t )} X(t ) x ]
1 1 1 1 1 1 1 1
1
+
Y Y
(14-51)
is the q
th
conditional moment of the random increment X
t
1
.
This expansion of the characteristic function can be used in (14-49). Substitute (14-50)
into (14-49) and obtain
f(x, t t) exp[
(x x )
]
( )
( x , , ) d f(x , t ) dx
( ) exp[
(x x )
]d (x , , ) f(x , t ) dx
( )
( )
1
1
1 1 1 1 1
1
1 1 1 1 1
1
2
1 1
2
+

L
N
M
O
Q
P
z z
z z

j
j
q!
m t t
q!
j
j
m t t
q
q
q 0
q 0
q q
(14-52)
This result can be simplified by using the identity
603CH14.DOC 14-23
1
2
1
2
1 1

( ) exp[
(x x )
]d
x
exp[
(x x )
]d
x
(x x ) .
1
j
j j
q
q
q

z z
e j
e j
(14-53)
The use of identity (14-53) in (14-52) results in
( )
( )
( )
q
(q)
1 1 1 1 1 1 1
q=0
q
(q)
1 1 1 1 1 1
q=0
q
(q)
1 1 1 1 1
q=1
1
f(x, t t) (x x ) m (x , t , t) f(x , t ) dx
q! x
1
(x x ) m (x , t , t) f(x , t ) dx
q! x
1
f(x, t ) m (x , t , t) f(x , t )
q! x
_
+

(14-54)
since m
(0)
= 1. Now, no special significance is attached to time variable t
1
in (14-54); hence,
substitute t for t
1
and write
f(x, t t) f(x, t)
x
(x, , ) f(x, t)
( )
+
1
1
q!
m t t
q
q
q
e j
. (14-55)
Finally, divide both sides of this last result by t, and let t 0 to obtain the formal limit
t
f(x, t)
x
(x, ) f(x, t)
1
1
q!
t
q
q
(q)
e j
K , (14-56)
where
603CH14.DOC 14-24
K
(q)
q
t (x, ) limit
E[ { X(t t) X(t) } X(t) x ]
t t
+
0
Y
, (14-57)
q 1, are called the intensity coefficients. Integer q denotes the order of the coefficient.
Equation (14-56) is called the Kramers-Moyal expansion. In general, the coefficients given by
(14-57) depend on time. However, the intensity coefficients are time-invariant in cases where the
underlying process is homogeneous. In what follows, we assume homogeneous processes and
time-independent intensity coefficients.
The One-Dimensional Fokker-Planck Equation
The intensity coefficients K
( q)
vanish for q 3 in applications where X is the state of a
first-order system driven by white Gaussian noise (see Risken, The Fokker-Planck Equation,
Second Edition). This means that incremental changes X
t
[X(t + t) - X(t)] in the process
occur slowly enough so that their third and higher-order moments vanish more rapidly than t.
Under these conditions, Equation (14-56) reduces to the one-dimensional Fokker-Planck
equation
t
f(x, t)
x
[ (x) f(x, t)]
x
[ (x) f(x, t)] + K K
( ) ( ) 1
2
2
2
1
2
. (14-58)
When K
( q)
= 0 for q 3, random process X is known as a diffusion process, and its sample
functions are continuous (see Risken, The Fokker-Planck Equation, Second Edition). Apart from
initial and boundary conditions, Equation (14-58) shows that the probability density function for
a diffusion process is determined completely by only first and second-order moments of the
process increments.
As a simple example, consider the RL circuit depicted by Fig.14-5, where the inductance
is a constant L independent of current i
t
. This circuit is driven by v
in
, a zero-mean, white Gaussian
noise process with a double-sided spectral density of N
0
/2 watts/Hz. This white noise process is
603CH14.DOC 14-25
the formal derivative of a Wiener process W
t
; the variance of an increment of this Wiener process
is (N
0
/2)t. The RL circuit is described by (14-39) and (14-40) which can be used to write
i
R
L
i
L
t t t
+
+
z
e j

t d
t
t t

1
W (14-59)
The commonly used notation i
t
i(t) and i
t
i(t + t) - i(t) is used in (14-59), and the
differential dW
t
is formally equivalent to v
in
dt. This current increment can be used in (14-57) to
obtain
K
( )
t
limit
E[ ]
t
1
0

Y i i i
R
L
i
t t
(14-60)
In a similar manner, the second intensity coefficient is
K
( )
t t
limit
E[ ( ) ]
t
limit
t
t t
E[ (t ) (t ) ]dt dt
t
t t
t
N
.
2
0
2
0
1
1 2 1 2
0
1
2
+ +

z z

Y i i i
v v
L
t t L
in in
2
2
(14-61)
Hence, the Fokker-Planck equation that describes the RL circuit is
f
t
( f )
N
f +
R
L i
i
L i
2
0
2
2
4
. (14-62)
Finally, as can be shown by direct substitution, a steady-state solution of this equation is
f( )
(N / )
exp (N / ) i
RL
i RL
1
2 4
1
2
4
0
2
0
. (14-63)
603CH14.DOC 14-26
The intensity coefficients used in (14-58) can be given physical interpretations when
process X(t) denotes the time-dependent displacement of a particle. Consider a particle on one of
an ensemble of paths (sample functions) that pass through point x at time t. As the particle passes
through point x at time t, its velocity is dependent on which path it is on. Inspection of (14-57)
shows that K
(1)
(x) is the average of these velocities. In a similar manner, coefficient K
(2)
can be
given a physical interpretation. In t seconds after time t, the particle has undergone a
displacement of X
t
X(t+t) - X(t) from point x. Of course, X
t
depends on which path the
particle is on, so there is uncertainty in the magnitude of X
t
. That is, after leaving point x, there
is uncertainty in how far the process has gone during the time increment t. For small t, a
measure of this uncertainty is given by t K
(2)
(x), a first-order-in-t approximation for the
variance of the displacement increment X
t
.
In many applications, K
( 2)
is constant (independent of x). If K
( 2)
(x) is not constant, then
(14-58) can be transformed into a new Fokker-Planck equation where the new coefficient
~
K
(2)
is
a positive constant (for details of the transformation see pg. 97 of Risken, The Fokker-Planck
Equation, Second Edition). For this reason, coefficient K
( 2)
in (14-58) is assumed to be a positive
constant in what follows.
Note that (14-58) can be written as
t
f(x, t)
x
(x) f(x, t)
x
[ f(x, t)]
x
(x)
x
f(x, t)
(x, t) ,

K K
K K
( ) ( )
( ) ( )
1 2
1 2
1
2
1
2
(14-64)
where
(x, t) (x)
x
f(x, t) K K
( ) ( ) 1 2
1
2

, (14-65)
603CH14.DOC 14-27
and denotes the divergence of . Notice the similarity of (14-64) with the well-known
continuity equation

t
J (14-66)
from electromagnetic theory. In this analogy, f (particles/meter) and (particles/second) in
(14-64) are analogous to one-dimensional charge density (electrons/meter) and one-dimensional
current J (electrons/second), respectively, in (14-66). is the probability current; in an abstract
sense, it can be thought of as describing the "amount" of probability crossing a point x in the
positive direction per unit time. In the literature, it is common to see cited as a flow rate of
"probability particles". That is, (x,t) can be thought of as the rate of particle flow at point x and
time t (see both Section 4.3 of Stratonovich, Topics in the Theory of Random Noise, Vol. 1 and
Section 5.2 of Gardiner, Handbook of Stochastic Methods).
The K
(1)
f term in is analogous to a drift current component. To see the current aspect,
recall that K
(1)
has units of velocity if X is analogous to particle displacement, and f has units of
particles per meter. Hence, the product K
(1)
f is analogous to a one-dimensional current since
(meters / second)(particles / meter) = particles / second . (14-67)
It is a drift current (i.e., a current that results from an external force or potential acting on
particles) since, in applications, K
(1)
is due to the presence of an external force. This external
force acts on the particles, and
s
1
K
( )
f (14-68)
can be thought of as a drift current that results from movement of the forced particles.
603CH14.DOC 14-28
In fact, drift coefficient K
(1)
is used in
U
K
K
p
1
(x)
( ) d x
( )
( )

z
2
2

(14-69)
to define the one-dimensional potential function for (14-64). An important conceptual role can be
developed for U
p
; it is more likely for probability (probability particles) to flow in the direction of
a lower potential.
Figure 14-6 depicts a potential function U
p
(x) which is encountered in the first-order PLL
(see Stensby, Phase-Locked Loops: Theory and Applications) and other applications. The
significant feature of this potential function is the sequence of periodically spaced potential wells.
A particle can move from one well to the next, and it is more likely to move to a well of lower
potential than to a well of higher potential. In the phase-locked loop, noise-induced cycle slips
are associated with the movement of particles between the wells.
The component
d

1
2
2
x
[ f(x, t) ]
( )
K (14-70)
0 5 10 15 20 25
-7
-6
-5
-4
-3
-2
-1
0
1
U
p
(

)

=

-
.
2

-

c
o
s
(

)
minima.
Figure 14-6: Potential function with periodically spaced
local minima.
603CH14.DOC 14-29
in (14-65) is analogous to a diffusion current (diffusion currents result in nature from a non-zero
gradient of charged particles which undergo random motion - see pg. 339 of Smith and Dorf,
Devices and Systems, 5
th
edition). As in (14-67),
d
is analogous to a current since it has units of
particles/second. That
d
is analogous to a diffusion current is easy to see when K
(2)
is a
constant. In this case,
d
is proportional to the gradient of the analogous charge density f, and
K
(2)
is the constant diffusion coefficient. The negative sign associated with (14-70) is due to the
fact that particles tend to diffuse in the direction of lower probability concentrations.
Initial and boundary conditions on f(x, t) must be supplied when a solution is sought for
the Fokker-Planck equation. An initial condition is a constraint that f(x, t) must satisfy at some
instant of time; initial condition f(x, t
1
), where t
1
is the initial time, is specified as a function of the
variable x. A boundary condition is a constraint that f(x, t) must satisfy at some displacement x
1
(x
1
may be infinite); boundary condition f(x
1
, t) is specified as a function of the variable t.
Usually, initial and boundary conditions are determined by the physical properties of the
application under consideration.
Transition Density Function
Often, the transition density f(x,t(x
1
,t
1
) is of interest. This transition density satisfies the
Fokker-Planck equation. To see this, substitute f(x,t) = f(x,t(x
1
,t
1
)f(x
1
,t
1
) into (14-58), cancel
out the common f(x
1
,t
1
) term, and write
t
f(x, t x , t )
x
[ (x) f(x, t x , t ) ]
x
[ (x) f(x, t x , t ) ]
( ) ( )
Y Y Y
1 1
1
1 1
2
2
2
1 1
1
2
+ K K (14-71)
So, the transition density f(x, t(x
1
, t
1
) for the Markov process X(t) can be found by solving
(14-71) subject to
f(x, t x , t ) (x x )
1 1 1 1
Y . (14-72)
603CH14.DOC 14-30
In computing the transition density, the boundary conditions that must be imposed on (14-71) are
problem dependent.
Kolmogorovs Equations
Fokker-Planck equation (14-71) is also known as Kolmogorovs Forward Equation. The
adjective Forward refers to the fact that the equation is in terms of the temporal and spatial
variables t and x, the forward variables (in f(x, t(x
1
, t
1
), x and t are referred to as forward
variables while x
1
and t
1
are called the backward variables).
Yes (you may be wondering), there is a Kolmogorov Backward Equation. In the
transition density f(x, t(x
1
, t
1
), one can think of holding x, t fixed and using x
1
, t
1
as the
independent variables. In the backward variables x
1
and t
1
, f(x, t(x
1
, t
1
) must satisfy
2
(1) (2)
1 1 1 1 1 1
2
1 1
1
1
f(x, t x , t ) (x) f(x, t x , t ) (x) f(x, t x , t )
t x 2
x

( ( (

K K , (14-73)
a result known as Kolmogorovs Backward Equation. It is possible to show that the backward
equation is the formal adjoint of the forward equation. Finally, the backward equation is used
when studying exit, and first-passage-time problems.
Natural, Periodic and Absorbing Boundary Conditions
Boundary conditions must be specified when looking for a solution of (14-58). In general,
the specification of boundary conditions can be a complicated task that requires the consideration
of subtle issues. Fortunately, only three types of simple boundary conditions are required for
most applications in the area or communication and control system analysis.
The first type of boundary conditions to be considered arise naturally in many applications
where sample functions of X(t) are unconstrained in value. In these applications, the Fokker-
Planck equation applies over the whole real line, and the boundary conditions specify what
happens as x approaches t. With respect to x, integrate Equation (14-64) over the real line to
obtain
603CH14.DOC 14-31

t
f(x, t) dx limit (x, t) limit (x, t)
x x
-
. (14-74)
Now, combine this result with the normalization condition
f(x, t) dx
-
z
1, (14-75)
which must hold for all t, to obtain
limit (x, t) limit (x, t)
x x
. (14-76)
While this must be true in general, the stronger requirement
limit (x, t) limit (x, t)
x x
0 (14-77)
holds in all applications of the theory to physical problems (where probability build-up at infinity
cannot occur). Furthermore, the requirement
limit f(x, t) limit f(x, t)
x x
0 (14-78)
is necessary for the normalization condition (14-75) to hold. As x t, the density function
must satisfy the requirement that f(x,t) 0 on the order of 1/x
1+
, for some > 0; this
requirement is necessary for convergence of the integral in (14-75). Equations (14-77) and
(14-78) constitute what is called a set of natural boundary conditions.
Boundary conditions of a periodic nature are used in the analysis of phase-locked loops
and other applications. They require a constraint of the form
603CH14.DOC 14-32
f(x, t) f( x L , t )
(x, t) ( x L , t ) ,
+
+
0
0

(14-79)
where L
0
is the period. These are called periodic boundary conditions, and they are used for
certain types of analyses when the intensity coefficients K
(1)
and K
(2)
are periodic functions of x.
Periodic intensity coefficients occur in a class of problems generally referred to as Brownian
motion in a periodic potential (see Chapter 11 of Risken, The Fokker-Planck Equation, Second
Edition).
The last type of boundary conditions discussed here are of the absorbing type. Suppose
that X(t) denotes the displacement of a particle that starts at X(0) = x
0
. As is illustrated by Fig.
14-7, the particle undergoes a random displacement X(t) until, at t = t
a
, it makes first contact with
the boundary at x
b
> x
0
. The particle is absorbed (it vanishes) at this point of first contact.
Clearly, the time interval [0, t
a
] from start to absorption depends on the path (sample function)
taken by the particle, and the length of this time interval is a random variable.
For the case illustrated by Fig. 14-7, an absorbing boundary at x
b
requires that
limit f(x, t)
x x
b

0 (14-80)
Time
D
i
s
p
l
a
c
e
m
e
n
t
x
b
x
o
t
a
0
Absorbing Boundary
X(t)
Figure 14-7: At time t = ta, process X(t) hits an
absorbing boundary placed at x = x
b
.
603CH14.DOC 14-33
for all t. Intuitively, this condition says that the particle can be found only rarely in a small
neighborhood (x
b
- x, x
b
), x > 0, of the boundary. Previously, the boundary condition (14-80)
was shown to hold for a Wiener process subjected to an absorbing boundary (see(14-26)). In the
remainder of this section, this boundary condition is supported by an intuitive argument based on
arriving at a contradiction; that is, (14-80) is assumed to be false, and it is shown that this leads to
a contradiction (see also pg. 219 of Cox and Miller, The Theory of Stochastic Processes). The
argument given requires that X(t) be homogeneous; however, the boundary condition holds in the
more general nonhomogeneous case.
Suppose that (14-80) is not true; suppose some time interval (t
1
, t
2
) and some
displacement interval (x
b
- x, x
b
) exist such that
f(x, t) , t t t , x x x x > < < < < 0
1 2 b b
, (14-81)
for some small > 0. That is, suppose the density f(x,t) is bounded away from zero for x in some
small neighborhood of the boundary and for t in some time interval. Then, on any infinitesimal
time interval (t, t + t) (t
1
, t
2
), the probability g(t)t that absorption occurs during (t, t + t) is
greater than or equal to the joint probability that the particle is near x
b
at time t and the process
increment X
t
X(t + t) - X(t) carries the particle into the boundary. That is, to first-order in
t, the probability g(t)t must satisfy
g(t) t [ X(t t) X(t) X x, x x X(t) x ]
[ X(t t) X(t) X x x x X(t) x ] [ x x X(t) x ] ,

+ > < <
+ > < < < <
P
P
t b b
t b b b b
P

Y
(14-82)
where x is a small and arbitrary positive increment. Note that g(t) is the probability density
function of the absorption time.
As t 0, the right-hand side of (14-82) approaches zero on the order of t or slower
if x is allowed to approach zero on the order of t . To see this, first note that (14-81)
603CH14.DOC 14-34
implies
P[ x x X(t) x ] x
b b
< < (14-83)
so that
g(t) t x [ X(t t) X(t) X x x x X(t) x ] + > < < P
t b b
Y . (14-84)
However, a first-order-in-t approximation for the conditional variance of X
t
is
Var[ X X(t) x ] t
t
( )
Y K
2
, (14-85)
where K
( 2)
is a positive constant (see the paragraph preceding (14-64)). With a non-zero
probability, the magnitude of a random variable exceeds its standard deviation. Hence, (14-85)
implies the existence of a p
0
> 0 such that
P[ X x x X(t) x ]
t
( )
> < < > K
2
t p
b b
Y
0
0 (14-86)
for sufficiently small t. Set x = K
( )
t
2
in (14-84), and use (14-86) to obtain
g(t) t
( )
p K
2
t
0
, (14-87)
which leads to
g(t)
t
( )
p
K
2
0
. (14-88)
603CH14.DOC 14-35
Now, allow t 0 in (14-88) to obtain the contradiction that the density function g(t) is infinite
on t
1
< t < t
2
. This contradiction implies that assumption (14-81) cannot be true; hence, boundary
condition (14-80) must hold for all t 0.
Steady-State Solution to the Fokker-Planck Equation
Many applications are of a time-invariant nature, and they are characterized by Fokker-
Planck equations that have time-invariant intensity coefficients. Usually, such an application is
associated with a Markov process X(t) that becomes stationary as t . As t becomes large, the
underlying density f(x, t(x
0
, t
0
) that describes the process approaches a steady-state density
function f(x) that does not depend on t, t
0
or x
0
. Often, the goal of system analysis in these cases
is to find the steady-state density f(x). Alternatively, in the steady state, the first and second
moments of X may be all that are necessary in some applications.
The steady-state density f(x) satisfies
0
1
2
2

L
N
M
O
Q
P
d
dx
( ) f(x)
d
d
[ f(x)]
( ) ( )
K K
1
x
x
, (14-89)
the steady-state version of (14-64). In this equation, the diffusion coefficient K
( 2)
is assumed to
be constant (see the paragraph before (14-64)). Integrate both sides of (14-89) to obtain
ss
1
x
x
K K
( ) ( )
( ) f(x)
d
d
[ f(x)]
1
2
2
, (14-90)
where constant
ss
represents a steady-state value for the probability current.
The general solution of (14-90) can be found by using standard techniques. First, simplify
the equation by substituting y(x) K
( 2)
f(x) to obtain
dy
dx
(x)
y
( )
( )
2
L
N
M
O
Q
P

K
K
1
2
2
ss
(14-91)
603CH14.DOC 14-36
The integrating factor for (14-91) is
(x) exp
( )
d
x
( )
( )

L
N
M
O
Q
P
L
N
M
O
Q
P
z
2
1
2
K
K

. (14-92)
This result and (14-91) can be used to write

dy
dx
(x) d
dx
[ y ]
( )
( )
2
L
N
M
O
Q
P
L
N
M
M
O
Q
P
P

K
K
1
2
2 y
ss
(14-93)
so that
d
dx
[ (x) f(x) ] (x)
( )
K
2
2
ss
. (14-94)
Finally, the general solution to (14-90) can be written as
f(x) (x) ( ) d C
x
( )
+
z
K
2
1
2
e j

ss
r r . (14-95)
Note that (14-95) depends on the constants
ss
and C. The steady-state value of
probability current
ss
and the constant C are chosen so that f(x) satisfies specified boundary
conditions (which are application-dependent) and the normalization condition
f(x) dx
z
1
-
. (14-96)
These results are used in what follows to compute the steady-state probability density function
that describes the state variable in a first-order system driven by white Gaussian noise.
603CH14.DOC 14-37
The One Dimensional First-Passage Time Problem
Suppose Markov process X(t) denotes the instantaneous position of a particle that starts
at x
0
when t = 0. Assume that absorbing boundaries exist at b
1
and b
2
with b
1
< x
0
< b
2
. Let t
a
denote the amount of time that is required for the particle to reach an absorbing boundary for the
first time. Time t
a
is called the first-passage time, and it varies from one path (sample function of
X(t)) to the next. Hence, it is a random variable which depends on the initial value x
0
. Figure 14-
8 depicts the absorbing boundaries and two typical paths of the particle.
As it evolves with time, process X(t) is described by the density f(x,t(x
0
,t
0
), t
0
= 0. This
evolution is described here in a qualitative manner; Fig. 14-9 is used in this effort, and it depicts
the typical behavior of f(x,t(x
0
,0). At t = 0, the process starts at x
0
; this implies that all of the
probability is massed at x
0
as is shown by Fig. 14-9a. As t increases, the location of the particle
becomes more uncertain, and f(x,t(x
0
,0) tends to "spread out" in the x variable as shown by Fig.
14-9b (this figure depicts the density for some t = t
1
> 0).
For t > 0, f(x,t(x
0
,0) is represented as the sum of a continuous function q(x,t(x
0
,0) and a
pair of delta functions placed at b
1
and b
2
. The delta functions represent the fact that probability
will accumulate at the boundaries; this accumulation is due to the fact that sample functions will,
sooner or later, terminate on a boundary and become absorbed. As shown by Fig. 14-9c which
Time
b
2
x
o
Absorbing Boundary
X(t)
b
1
Absorbing Boundary
Figure 14-8: Two sample functions of X(t) and
absorbing boundaries at b
1
and b
2
.
603CH14.DOC 14-38
depicts the case t
2
> t
1
, q continues to "spread out" as time increases, and it is more likely that the
particle impacts a boundary and is absorbed. The area under q(x,t(x
0
,0) decreases with time, and
the delta function weights increase with time; however, the sum of the area under q and the delta
function weights is unity. As t , the probability that the particle is absorbed approaches unity;
this requirement implies that
t
limit
q( , t , ) .
x x Y
0
0 = 0 (14-97)
This time-limiting case is depicted by Fig. 14-9d; on this figure, the quantity q is zero, and the
x
o
b
1
b
2
1
x
f (x,0(x
o
,0)
a)
x
0
b
1
b
2
p
-
(t
1
)
p
+
(t
1
)
f(x,t
1
(x
0
,0)
q(x,t
1
|x
0
,0)
x
b)
x
0
b
1
b
2
p
-
(t
2
)
p
+
(t
2
)
f(x,t
2
(x
0
,0)
x
c)
q(x,t
2
|x
0
,0)
x
o
b
1
b
2
p
-
()
1-p
-
()
f(x,(x
o
,0)
x
d)
Figure 14-9: Density f(x, t(x
0
,0) at a) t = 0, b) t = t
1
, c) t = t
2
> t
1
, and d) t = .
603CH14.DOC 14-39
delta function weights add to unity.
Function q(x, t (x
0
, 0) is a solution of the one-dimensional Fokker-Planck equation given
by (14-58). The initial condition
q 0
0
(x, x , ) (x x ) 0
0
Y (14-98)
and the absorbing boundary conditions
q 0
q 0
1 0
0
(b , t x , )
(b , t x , ) ,
Y
Y
0
0
2
(14-99)
t 0, should be used in this effort.
The Distribution and Density of the First-Passage Time Random Variable
Function q(x, t (x
0
, 0) can be used to compute the distribution and density functions of the
first-passage time random variable t
a
. First, the quantity
(t x , ) (x, t x , ) dx
b
b
Y Y
0 0
0 0
1
2
z
q , (14-100)
t 0, represents the probability that the particle has not been absorbed by time t > 0 given that the
position of the particle is x
0
, b
1
< x
0
< b
2
, at t = 0. In a similar manner,
F ( t x , ) [ t t x , ]
( t x , )
(x, t x , ) dx
b
b
t
a
a
Y Y
Y
Y
0 0
0
1
2
0 0
1 0
1

z
P
q
0
0
(14-101)
603CH14.DOC 14-40
represents the probability that the first-passage time random variable t
a
is not greater than t; that
is, F (t x , )
t
a
Y
0
0 is the distribution function for the first-passage time random variable t
a
. Finally,
the desired density function can be obtained by differentiating (14-101) with respect to t; this
procedure yields the formula
f ( t x , ) F ( t x , )
( t x , )
t t
a a
Y Y
Y
0 0
0
0 0
0

t
t
(14-102)
for the density function of t
a
.
The Expected Value of the First-Passage Time Random Variable
The expected value of the first-passage time random variable is important in many
applications. A simple expression for E[t
a
] is determined in this section for the case where
diffusion coefficient K
( 2)
is constant.
The average value of the first-passage time can be expressed in terms of q(x, t (x
0
, 0). To
accomplish this, note that (14-102) can be used to write
E[t ] t f (t x , ) dt t
t
(t x , ) dt .
a t
a

z z
0 0
0 0
0 0 Y Y

e j
(14-103)
However, the integral in (14-103) can be evaluated by parts to yield
E[t ] t (t x , ) (t x , ) dt
a
+
z
Y
Y
Y
Y
0
0 0
0 0
0
. (14-104)
Now, the integral in (14-104) is assumed to converge. Hence, it is necessary that (t (x
0
, 0)
approach zero faster than 1/ t as t ; this implies that the first term on the right of (14-104) is
zero and that
603CH14.DOC 14-41
E[t ] (t x , ) dt
a

z
Y
0
0
0
. (14-105)
Finally, substitute (14-100) into this and obtain
E[t ] Q(x x , ) dx
a

z
Y
0
1
2
0
b
b
, (14-106)
where
Q(x x , ) (x, t x , ) dt Y Y
0
0
0
0

z
q
0
. (14-107)
Note that (14-99) implies that Q satisfies the condition
Q(b x , )
Q(b x , ) .
1 0
2 0
0 0
0 0
Y
Y
(14-108)
Equations (14-106) and (14-107) show that the expected value of the first-passage time can be
expressed in terms of q(x, t (x
0
, 0).
A simple first-order linear differential equation can be obtained that describes Q. To
obtain this equation, first note that q(x, t (x
0
, 0) satisfies the one-dimensional Fokker-Planck
equation
t
(x, t x , )
x
[ (x) (x, t x , )]
x
(x, t x , )
( ) ( )
q q q Y Y Y
0
1
0
2
2
2
0
0 0 0 K K +
1
2
, (14-109)
where it has been assumed that K
( 2)
> 0 is constant. Now, integrate both sides of this last
equation with respect to time and obtain
603CH14.DOC 14-42
q(x, x , ) (x, x , )
d
dx
[ (x) Q(x x , )]
d
dx
Q(x x , )
( ) ( )
Y Y Y Y
0 0
1
0
2
2
2
0
0 0 0 0 0 q +
1
2
K K , (14-110)
where Q is given by (14-107). Equations (14-97) and (14-98) can be used with (14-110) to
obtain
(x x )
d
dx
[ (x) Q(x x , )]
d
dx
Q(x x , )
( ) ( )
0
1
0
2
2
2
0
0 0 K K Y Y +
1
2
. (14-111)
Now, integrate both sides of this result to obtain
d
dx
Q(x x , )
(x)
Q(x x , ) C (x x )
( )
( ) ( )
Y Y Y
0 0 0 0
0 0
2
1
2 2
2 -
K
K K
F
H
I
K
U
e j
, (14-112)
where CY
0
is a constant of integration, and U(x) denotes the unit step function. Equation
(14-112) is a first-order, linear differential equation that describes Q. Due to (14-108), it must be
solved subject to the boundary conditions
Q(b x , ) ( b , t x , ) dt
Q(b x , ) ( b , t x , ) dt .
1 0 1 0
2 0 2 0
0 0 0
0 0 0
0
0
Y Y
Y Y

z
z
q
q
(14-113)
The integrating factor for (14-112) is related to the potential function U
p
(x) (see (14-69)) by
(x) exp ( ) d
x
exp[ (x) ]
( )
( )

z
2
2
1
K
K U
e j p
(14-114)
since
603CH14.DOC 14-43
d
dx
[ (x) Q(x x , ) ] (x) C (x x )
( )
Y Y
0 0 0
0
2
2

K
e je j
U (14-115)
Now, integrate both sides of (14-115) to obtain
Q(x x , ) (x) ( ) C ( x ) d
b
x
C
( )
Y Y Y
0
1
0 0
1
1
0
2
2
+
z

K
U
e j
, (14-116)
where CY
1
is a constant of integration. Application of boundary conditions (14-113) leads to the
determination that CY
1
= 0 and
C
( ) ( x ) d
b
b
( ) d
b
b
0
0
1
2
1
2
Y

U
z
z
. (14-117)
A formula for the average first-passage time can be obtained by substituting (14-116),
with CY
1
= 0, into (14-106). This effort yields
E[ t ] (x)
b
b
( ) C ( x ) d
b
x
dx
a
( )

z z
2
2
1
1
2
0 0
1 K
Y U
e j
, (14-118)
where constant CY
0
is given by (14-117).
Boundary Absorption Rates
Probability current flows into both boundaries b
1
and b
2
. On Fig. 14-9, this is illustrated
by the weights p
-
(t) and p
+
(t) increasing with time. However, in general, the flow is unequal, and
one boundary may receive more probability current than the other. Hence, over a time interval [0,
T] , the probability that flows into the boundaries may be unequal, a phenomenon that is analyzed
in this section.
Figure 14-1 illustrates the fact that the amount of probability in an interval changes at a
603CH14.DOC 14-44
rate that is equal to the current flowing into the interval. This implies that
Y (x , t x , ) dt
1 0
0
0
T
z
represents the total amount of probability that flows (in the direction of increasing x) past point x
1
during [0, T] . This result can be used to quantify the amount of probability that enters the
boundaries depicted on Fig. 14-9.
As is illustrated on Fig. 14-9, probability flows into the boundaries at b
1
and b
2
as time
increases. During the time interval [0, T] , the total amount of probability that flows into the
boundaries b
1
and b
2
is
p ( (b , t x , ) dt
p ( (b , t x , ) dt
+

+
z
z
T)
T
T)
T
Y
Y
1 0
2 0
0
0
0
0
(14-119)
respectively. The minus sign on the first of these integrals results from the fact that probability
must flow in a direction of decreasing x (to the left) if it enters boundary b
1
. As T approaches
infinity, the total probability that enters the boundaries b
1
and b
2
is p
-
() and p
+
() = 1 p
-
(),
respectively.
In terms of q(x, t (x
0
, 0), the probability current on the interval b
1
x b
2
can be
expressed as (see (14-65))
Y Y Y (x, t x , ) (x) (x, t x , )
x
[ (x, t x , )]
( ) ( )
0
1
0
2
0
0 0 0 K K q q -
1
2

. (14-120)
Integrate this equation over 0 t < , and use (14-107) to write
603CH14.DOC 14-45
Y Y Y (x, t x , ) dt (x) Q(x x , )
d
dx
[ Q(x x , )]
( ) ( )
0
1
0
2
0
0 0 0
0
z
K K -
1
2
. (14-121)
This result describes the total probability that flows to the right (in the direction of increasing x)
past point x during the time interval [0, ). Now, in this last result, use boundary conditions
(14-108) and p
+
given by (14-119) to write
p
d
dx
[ Q(x x , )]
( )
x b
+

( ) = -
1
2
K
2
0
0
2
Y (14-122)
for the total probability absorbed at boundary b
2
. In a similar manner, the total probability
absorbed at boundary b
1
is given as
p
d
dx
[ Q(x x , )]
( )
x b
( ) =
1
2
K
2
0
0
1
Y . (14-123)
The sign difference in the last two equations results from the fact that current entering b
1
must
flow in a direction that is opposite to the current flowing into boundary b
2
.
An expression for the ratio of absorption probabilities p
+
() and p
-
() is needed in order
to compute probabilities. By using (14-122) and (14-123), this ratio can be expressed as
p
p
d
dx
[ Q(x x , )]
d
dx
[ Q(x x , )]
( )
x
( )
x
b
b
+
( )
( )

Y Y
Y
Y Y
Y
K
K
2
0
2
0
0
0
2
1
. (14-124)
The derivatives in this result can be supplied by (14-112) when, as assumed here, K
( 2)
is
independent of x. Use boundary conditions (14-113) in (14-112) to obtain
603CH14.DOC 14-46
d
dx
Q(x x , ) C (x x )
x b
1
, b
( )
x b
1
, b
Y
Y
Y

Y
Y
Y
0 0 0
0
2
2
2
2

F
H
I
K

K
U , (14-125)
where CY
0
is given by (14-117). Now, combine the last two equations to obtain
p
p
C (b x )
C (b x )
+
( )
( )

Y
Y
0 2 0
0 1 0
U
U
(14-126)
which becomes
p
p
C
C
+

( )
( )

Y
Y
0
0
1
(14-127)
for the usual b
1
< x
0
< b
2
case. This last result provides a ratio of the absorption probabilities in
terms of the constant CY
0
. Finally, combine the equality p
+
() + p
-
() = 1 with (14-127) to
obtain
p C
p C
( ) =
( ) =1-
0
0
Y
Y
(14-128)
for the total probabilities that enter the boundaries at b
1
and b
2
, respectively.

Random Signals and Noise

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Random Signals and Noise

Uploaded by

Copyright:

Available Formats

EE420/500 Random Signals and Noise

F (i.e. F is closed under countable unions).

Hence, the probability of A

P so that ( ) k 5 .084 > P .

EE420/500 Class Notes 01/26/10 John Stensby

that describes random

the a-priori ("before the coin toss experiment") density of p

to compute the conditional density

. In fact, given that A occurred, the a-posteriori probability that p is between p

is smooth and slowly changing around p = k/n

). Also, use numerical integration to evaluate the denominator of (2-53). For

. Finally, we can write

Leibnitzs Rule: Consider the function of t defined by

Cut-away view detailing differential area

= < < <

Figure 4-19: Polar coordinate transfor-

() is called the a-priori density of , and it is

and the true value

(z) - . Finally, making errors cost us. C(

(z)). But, it does make sense to choose/design/develop

with the goal of

(z) - )], the expected or average cost associated with the

, that minimizes this average cost is called the

Let's use the squared error cost function C(

. Then, when estimator

will be minimized if, for every value of z, we pick

over all possible measurements (values of z) and all

is (averaged over all and all possible measurements z)

E[ ] E[ ] = ; because of this, we say that

] = 0, the variance of the estimation error is

] < VAR[]; otherwise,

] = 0 as shown by (5-60). That is,

approaches zero as the noise average power (i.e., the

may be too large for some

if number of Poisson Points in (0, t) is

(). By the Wiener-Khinchine theorem,

() is the Fourier transform of R

(), the autocorrelation function for WSS (t). Since (t) is

() is zero except for a narrow band of frequencies around

() may, or may not, have

(t) . Recall that

(t) . The Hilbert transform of the noise signal can

(t) (t) cos t (t) sin t (t) cos t (t) sin t

. This can be done by

It is easy to compute, in terms of R

, the autocorrelation of the quadrature components.

(t) (t )]sin t cos (t )

R ( )[cos t sin (t ) sin t cos (t )] ,

(). Shift the first copy to the left by

(), it is always possible to determine S

S from shifting and adding copies of

() may not satisfy (9-5)).

(t)]cos (t )cos t E[ (t ) (t)]cos (t )sin t

R ( )[cos t cos (t ) sin t sin (t )] ,

of the narrow-band noise can be expressed in terms of the

as is shown by (9-16). However, in general, S

can be expresses as simple translations

R ( ) cos Sgn( ) ( ) Sgn( ) ( ) .

of the noise is obtained easily by scaling and translating S

in terms of a simple translations of S

0) in (9-97), each right-hand-side term approaches zero

( ) ( ) x t x t = . This claim follows from

x(t) jsgn( ) x(t)] jsgn( ) jsgn( )X( j )