Professional Documents
Culture Documents
Prapun Suksompong
School of Electrical and Computer Engineering
Cornell University, Ithaca, NY 14853
ps92@cornell.edu
August 24, 2009
Contents
1 Mathematical Background 4
1.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Enumeration / Combinatorics / Counting . . . . . . . . . . . . . . . . . . 9
1.3 Dirac Delta Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Classical Probability 20
2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Probability Foundations 25
3.1 Algebra and σ-algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Kolmogorov’s Axioms for Probability . . . . . . . . . . . . . . . . . . . . . 30
3.3 Properties of Probability Measure . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Countable Ω . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Random Element 37
4.1 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Discrete random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Continuous random variable . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Mixed/hybrid Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.7 Misc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 PMF Examples 48
5.1 Random/Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Bernoulli and Binary distributions . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Binomial: B(n, p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Geometric: G(β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5 Poisson Distribution: P(λ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6 Compound Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.7 Hypergeometric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.8 Negative Binomial Distribution (Pascal / Pólya distribution) . . . . . . . . 59
5.9 Beta-binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.10 Zipf or zeta random variable . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 PDF Examples 60
6.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 Pareto: Par(α)–heavy-tailed model/density . . . . . . . . . . . . . . . . . . 67
6.5 Laplacian: L(α) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.6 Rayleigh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.7 Cauchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.8 More PDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7 Expectation 71
8 Inequalities 78
9 Random Vectors 82
9.1 Random Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
10 Transform Methods 88
10.1 Probability Generating Function . . . . . . . . . . . . . . . . . . . . . . . . 88
10.2 Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.3 One-Sided Laplace Transform . . . . . . . . . . . . . . . . . . . . . . . . . 90
10.4 Characteristic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
12 Convergences 110
12.1 Summation of random variables . . . . . . . . . . . . . . . . . . . . . . . . 116
12.2 Summation of independent random variables . . . . . . . . . . . . . . . . . 116
12.3 Summation of i.i.d. random variable . . . . . . . . . . . . . . . . . . . . . 117
12.4 Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . 118
2
14 Real-valued Jointly Gaussian 122
3
1 Mathematical Background
1.1 Set Theory
1.1. Basic Set Identities:
• Idempotence: (Ac )c = A
• Commutativity (symmetry):
• Associativity:
◦ A ∩ (B ∩ C) = (A ∩ B) ∩ C
◦ A ∪ (B ∪ C) = (A ∪ B) ∪ C
• Distributivity
◦ A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)
◦ A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)
• de Morgan laws
◦ (A ∪ B)c = Ac ∩ B c
◦ (A ∩ B)c = Ac ∪ B c
• The cardinality (or size) of a collection pr set A, denoted |A|, is the number of
elements of the collection. This number may be finite or infinite.
4
diagram with five congruent ellipses. (Two ellipses are congruent if they are the exact
same size and shape, and differ only by their placement in the plane.)
8. Many of the logical identities given in §1.1.2 correspond to set identities, given in
the following table.
name rule
• N = {1, 2, , 3, . . . }
R = (−∞, ∞).
• For a set of sets, to avoid the repeated use of the word “set”, we will call it a
collection/class/family of sets.
Definition 1.3. Monotone sequence of sets
5
• The sequence of events (A1 , A2 , A3 , . . . ) is monotone-increasing sequence of events if
and only if
A1 ⊂ A2 ⊂ A3 ⊂ . . ..
In which case,
S
n
◦ Ai = An
i=1
S
∞
◦ lim An = Ai .
n→∞ i=1
• Alternative notation: 1A .
• A = {ω : IA (ω) = 1}
• A = B if and only if IA = IB
IA∪B (ω) = max (IA (ω) , IB (ω)) = IA (ω) + IB (ω) − IA (ω) · IB (ω)
6
SN SN S S∞
1.5. Suppose i=1 Ai = Bi for all finite N ∈ N. Then, ∞
i=1 i=1 Ai = i=1 Bi .
S
Proof. To show “⊂”, suppose x ∈ ∞ i=1 Ai . Then, ∃N0 such that x ∈ AN0 . Now, AN0 ⊂
SN0 SN0 S∞ S∞
i=1 Ai = i=1 Bi ⊂ i=1 Bi . Therefore, x ∈ i=1 Bi . To show “⊃”, use symmetry.
Proof. Bn = ∪i≥n+1 Ai = (∪i≥n Ai ) ∪ An+1 = Bn+1 ∪ An+1 So, (1) is true. For (2), consider
two cases. (2.1) For element x ∈ / ∪i Ai , we know that x ∈
/ Bn and hence x ∈ / ∩∞
n=1 Bn . (2.2)
For x ∈ ∪i Ai , we know that ∃i0 such that x ∈ Ai0 . Note that x can’t be in other Ai because
the Ai ’s are disjoint. So, x ∈ / ∩∞
/ Bi+0 and therefore x ∈ n=1 Bn .
1.7. Any countable union can be written as a union of pairwise disjoint sets: Given any
sequence of sets Fn , define a new sequence by A1 = F1 , and
\ [
An = Fn ∩ Fn−1 c
∩ · · · ∩ F1c = Fn ∩ Fic = Fn \ Fi .
i∈[n−1] i∈[n−1]
S∞ S∞
for n ≥ 2. Then, i=1 Fi = i=1 Ai where the union on the RHS is a disjoint union.
Proof. Note that the An are pairwise disjoint. To see this, consider An1 , An2 where n1 6= n2 .
WLOG, assume n1 < n2 . This implies that Fnc1 get intersected in the definition of An2 . So,
n\1 −1 n\2 −1
An1 ∩ An2 = Fn1 ∩ Fnc1 ∩ Fic ∩ Fn2 ∩ Fic = ∅
| {z } i=1 i=1
∅ i6=n1
S S
Also, for finite N ≥ 1, we have n∈[N ] Fn = n∈[N ] An (*). To see this, note that (*) is
S S
true for N = 1. Suppose (*) is true for N = m and let B = n∈[m] An = n∈[m] Fn . Now,
for N = m + 1, by definition, we have
[ [
An = An ∪ Am+1 = (B ∪ Fm+1 ) ∩ Ω
n∈[m+1] n∈[m]
[
= B ∪ Fm+1 = Fi .
i∈[m+1]
So, (*) is true for N = m + 1. By induction, (*) is true for all finite N . The extension to
∞ is done via (1.5).
For finite union, we can modify the above statement by setting Fn = ∅ for n ≥ N .
Then, An = ∅ for n ≥ N .
By construction, {An : n ∈ N} ⊂ σ ({Fn : n ∈ N}). However, in general, it is not true
that {Fn : n ∈ N} ⊂ σ ({An : n ∈ N}). For example, for finite union with N = 2, we can’t
get F2 back from set operations on A1 , A2 because we lost information about F1 ∩ F2 . To
create a disjoint union which preserve the information about the overlapping parts of the
7
Fn ’s, we can define the A’s by ∩n∈N Bn where Bn is Fn or Fnc . This is done in (1.8). However,
this leads to uncountably many Aα , which is why we used the index α above instead of n.
The uncountability problem does not occur if we start with a finite union. This is shown
in the next result.
1.8. Decomposition:
• Fix sets A1 , A2 , . . . , An , not necessarily disjoint. Let Π be a collection of all sets of
the form
B = B1 ∩ B2 ∩ · · · ∩ Bn
where each Bi is either Aj or its complement. There are 2n of these, say
n
B (1) , B (2) , . . . , B (2 ) .
Then,
(a) Π is a partition of Ω and
(b) Π \ ∩j∈[n] Acj is a partition of ∪j∈[n] Aj .
S
Moreover, any Aj can be expressed as Aj = B (i) for some Si ⊂ [2n ]. More
i∈Sj
(i)
specifically, Aj is the union of all B which is constructed by Bj = Aj
• Fix sets
T A1 , A2 , . . ., not necessarily disjoint. Let Π be a collection of all sets of the form
B = n∈N Bn where each Bn is either An or its complement. There are uncountably
of these hence we will index the B’s by α; that is we write B (α) . Let I be the set of
all α after we eliminate all repeated B (α) ; note that I can still be uncountable. Then,
(a) Π = B (α) : α ∈ I is a partition of Ω and
(b) Π \ {∩n∈N Acn } is a partition of ∪n∈N An .
S
Moreover, any Aj can be expressed as Aj = B (α) for some Sj ⊂ I. Because I is
α∈Sj
uncountable, in general, Sj can be uncountable. More specifically, Aj is the (possibly
uncountable) union of all B (α) which is constructed by Bj = Aj . The uncountability
of Sj can be problematic because it implies that we need uncountable union to get
Aj back.
1.9. Let {Aα : α ∈ I} be a collection of disjoint sets where I is a nonempty index set. For
any set S ⊂ I, define a mapping g on 2I by g(S) = ∪α∈S Aα . Then, g is a 1:1 function if
and only if none of the Aα ’s is empty.
1.10. Let
A = (x, y) ∈ R2 : (x + a1 , x + b1 ) ∩ (y + a2 , y + b2 ) 6= ∅ ,
where ai < bi . Then,
A = (x, y) ∈ R2 : x + (a1 − b2 ) < y < x + (b1 − a2 )
= (x, y) ∈ R2 : a1 − b2 < y − x < b1 − a2 .
8
𝑦
𝑦=𝑥
𝑐 + 𝑏1 − 𝑎2 A
𝑐 + 𝑎1 − 𝑏2
𝑐
x
𝑐
Figure 2: The region A = (x, y) ∈ R2 : (x + a1 , x + b1 ) ∩ (y + a2 , y + b2 ) 6= ∅ .
1.12. Given a set of n distinct items, select a distinct ordered sequence (word) of length
r drawn from this set.
9
• Sampling without replacement:
Y
r−1
n!
(n)r = (n − i) =
i=0
(n − r)!
= n · (n − 1) · · · (n − (r − 1)); r≤n
| {z }
r terms
◦ Ordered sampling of r ≤ n out of n items without replacement.
◦ For integers r, n such that r > n, we have (n)r = 0.
◦ The definition in product form
Y
r−1
(n)r = (n − i) = n · (n − 1) · · · (n − (r − 1))
| {z }
i=0
r terms
can be extended to any real number n and a non-negative integer r. We define
(n)0 = 1. (This makes sense because we usually take the empty product to be
1.)
◦ (n)1 = n
◦ (n)r = (n − (r − 1))(n)r−1 . For example, (7)5 = (7 − 4)(7)4 .
1, if r = 1
◦ (1)r =
0, if r > 1
• Ratio:
Q
r−1
(n − i) r−1
Y
(n)r i=0 i
= = 1−
nr Q
r−1 n
(n) i=0
i=0
r−1
Y 1 P
r−1
−n i r(r−1)
− ni
≈ e =e i=0 = e− 2n
i=0
r2
≈ e− 2n
• Stirling’s Formula:
√ √ 1 n
n! ≈ 2πnnn e−n = 2πe e(n+ 2 ) ln( e ) .
10
1.14. Binomial coefficient:
n (n)r n!
= =
r r! (n − r)!r!
This gives the number of unordered sets of size r drawn from an alphabet of size n without
replacement; this is unordered sampling of r ≤ n out of n items without replacement. It
is also the number of subsets of size r that can be formed from a set of n elements. Some
properties are listed below:
(b) Use combin(n,r) in Mathcad. However, to do symbolic manipulation, use the facto-
rial definition directly.
(c) Reflection property: nr = n−r n
.
(d) nn = n0 = 1.
(e) n1 = n−1 n
= n.
(f) nr = 0 if n < r or r is a negative integer.
(g) max nr = b n+1 n
c .
r 2
(h) Pascal’s “triangle” rule: nk = n−1 k
+ n−1
k−1
. This property divides the process of
choosing k items into two steps where the first step is to decide whether to choose
the first item or not.
P
n P
n
(i) 0≤k≤n k
= 0≤k≤n k
= 2n−1
k even k odd
There are many ways to show this identity.
11
(i) Consider the number of subsets of S = {a1 , a2 , . . . , an } of n distinct elements.
First choose subsets A of the first n − 1 elements of S. There are 2n−1 distinct
S. Then, for each A, to get set with even number of element, add element an to
A if and only if |A| is odd.
(ii) Look at binomial expansion of (x + y)n with x = 1 and y = −1.
(iii) For odd n, use the fact that nr = n−r n
.
Pmin(n1 ,n2 ) n1 n2
(j) k=0 k k
= n1n+n
1
2
= n1 +n2
n 2
.
• This property divides the process of choosing k items from n1 +n2 into two steps
where the first step is to choose from the first n1 items.
• Can replace the min(n1 , n2 ) in the first sum by n1 or n2 if we define nk = 0 for
k > n.
Pn n2 2n
• r=1 r = n
.
(k) Parallel summation:
Xn
m k k+1 n n+1
= + + ··· + = .
m=k
k k k k k + 1
P
n
n
(a) Let x = y = 1, then r
= 2n .
r=0
(b) Sum involving only the even terms (or only the odd terms):
Xn
n r n−r 1
xy = ((x + y)n + (y − x)n ) , and
r=0
r 2
r even
X n
n r n−r 1
xy = ((x + y)n − (y − x)n ) .
r=0
r 2
r odd
12
In particular, if x + y = 1, then
Xn
n r n−r 1
xy = (1 + (1 − 2x)n ) , and (2a)
r=0
r 2
r even
X n
n r n−r 1
xy = (1 − (1 − 2x)n ) . (2b)
r=0
r 2
r odd
n
1.16. Multinomial Counting : The multinomial coefficient n1 n2 ··· nr
is defined as
r
P
i−1
Y n− nk
k=0
i=1 ni
n n − n1 n − n1 − n2 nr
= · · ···
n1 n2 n3 nr
n!
= Qr
n!
i=1
P
r
It is the number of ways that we can arrange n = ni tokens when having r types of
i=1
symbols and ni indistinguishable copies/tokens of a type i symbol.
13
Table 5 Binomial coefficient identities.
n
Factorial expansion k = k!(n−k)! ,
n!
k = 0, 1, 2, . . . , n
n n
Symmetry k = n−k , k
= 0, 1, 2, . . . , n
n n n
Monotonicity 0 < 1 < ···
<
n/2 , n≥0
n n−1 n−1
Pascal’s identity k = k−1 + k , k = 0, 1, 2, . . . , n
n
Binomial theorem (x + y)n = k=0 nk xk y n−k , n ≥ 0
n n
k=0 k = 2 , n ≥ 0
n
Counting all subsets
n
k=0 (−1) k = 0, n ≥ 0
k n
Even and odd subsets
n n2 2n
Sum of squares k=0 k = n , n≥0
n n2 2n 2n
Square of row sums k=0 k = k=0 k , n ≥ 0
n n n−1
Absorption/extraction k = k k−1 , k =0
n m n n−k
Trinomial revision m k = k m−k , 0 ≤ k ≤ m ≤ n
m n+k n+m+1
Parallel summation k=0 k = m , m, n ≥ 0
n−m m+k n+1
Diagonal summation k=0 m = m+1 , n ≥ m ≥ 0
r m n m+n
Vandermonde convolution k=0 k r−k = r , m, n, r ≥ 0
n/2 n−k
Diagonal sums in Pascal’s k=0 k = Fn+1 (Fibonacci numbers), n ≥ 0
triangle (§2.3.2)
n
k=0 k k = n2
n
Other Common Identities n−1
, n≥0
n
k=0 k k = n(n + 1)2
n
2 n−2
, n≥0
n
k=0 (−1) k k = 0, n ≥ 0
k n
n (nk) 2n+1 −1
k=0 k+1 = n+1 , n≥0
k ( )
n n
k−1 ( k )
n n
n−1 n n 2n
k=0 k k+1 = n−1 , n > 0
m m n m+n
k=0 k p+k = m+p , m, n, p ≥ 0, n ≥ p + m
c 2000 by CRC Press LLC
1.17. Multinomial Theorem
P
n− ij P
n
X X1
n n−i X
j<r−1
n! n− ij Y
r−1
xikk
j<r
(x1 + . . . + xr ) = ··· xr
P Q
i1 =0 i2 =0 ir−1 =0 n− ik ! ik ! k=1
k<n k<n
• r-ary entropy function: Consider any vector p = (p1 , p2 , . . . , pr ) such that pi ≥ 0 and
Pr
pi = 1. We define
i=1
X
r
H (p) = − pi logb pi .
i=1
ni
As a special case, let pi = n
, then
n n!
= Q
r ≈ 2nH2 (p)
n1 n2 · · · nr
ni !
i=1
(d) Extra Range Requirement: Suppose we further require that ai ≤ xi < bi where
0 ≤ ai < bi , then we work instead with yi = xi − ai . The number of solutions is
P
n
X Pn P
k− ai + n − 1 k− ai − (bi − ai ) + n − 1
i=1 + (−1)|I| i=1 i∈I .
n−1 I⊂[n] n−1
I6=∅
15
tree diagram: a tree that displays the different alternatives in some counting process.
unordered selection (of k items from a set S): a subset of k items from S.
unordered selection (of k items from a set S with replacement): a selection of k
objects
1.19. The barsinand
which each
stars object in the selection set S can be chosen arbitrarily often
argument:
and such that the order in which the objects are selected does not matter.
• Young
Consider tableau: an array obtained
the distribution of r = 10byindistinguishable
replacing each cell of ainto
balls Ferrers
n =diagram by a
n distinguishable
positive
cells. Then, integer.
we only concern with the number of balls in each cell. Using n − 1 = 4
bars, we can divide r = 10 stars into n = 5 groups. For example, ****|***||**|*
n+r−1
would mean (4,3,0,2,1). In general, there are r
ways of arranging the bars and
stars.
2.1 SUMMARY OF COUNTING PROBLEMS
n+r−1
• Table
There1 arelists many
r
distinct vector
important counting xn1 of nonnegative
x =problems, integersofsuch
gives the number thatbeing
objects x1 + x2 +
· · · + x = r. We use n − 1 bars to separate r 1’s.
counted,ntogether with a reference to the section of this Handbook where details can be
found. Table 2 lists several important counting rules and methods, and gives the types
• ofSuppose
countingr problems
letters are
thatdrawn
can bewith replacement
solved using these from a set
rules and {a1 , a2 , . . . , an }. Given a
methods.
drawn sequence, let xi be the number of ai in the drawn sequence. Then, there are
n+r−1
Tabler 1 Counting = xn1 .
possible x problems.
The notation used in this table is given at the end of the table.
Arranging objects in a circle (where rotations, but not reflections, are equivalent):
n distinct objects (n − 1)! §2.2.1
P (n,k)
k out of n distinct objects k §2.2.1
c 2000 byFigure
CRC Press5:LLC
Counting problems and corresponding sections in [18].
16
objects number of objects reference
Subsets:
n
of size k from a set of size n k §2.3.2
of all sizes from a set of size n 2n
§2.3.4
of {1, . . . , n}, without consecutive Fn+2 §3.1.2
elements
Solutions to x1 + · · · + xn = k:
k+n−1 k+n−1
nonnegative integers k = n−1 §2.3.3
k−1
positive integers n−1 §2.3.3
k−(a1 +···+an )+n−1
integers where 0 ≤ ai ≤ xi for all i n−1 §2.3.3
integers where 0 ≤ xi ≤ ai for one inclusion/exclusion principle §2.4.2
or more i
integers where x1 ≥ · · · ≥ xn ≥ 1 pn (k) − pn−1 (k) §2.5.1
Figure
integers 6: Counting
where x1 ≥ · · · ≥problems
xn ≥ 0 (con’t) and corresponding
pn (k) sections in§2.5.1
[18].
Solutions to x1 + x2 + · · · + xn = p(n) §2.5.1
n in nonnegative integers where
x1 ≥ x2 ≥ · · · ≥ xn ≥ 0
Solutions to x1 + 2x2 + 3x3 + · · · + p(n) §2.5.1
nxn = n in nonnegative integers
17
c 2000 by CRC Press LLC
objects number of objects reference
objects number of objects reference
Functions from a k-element set to an1 n-element set:
corners of a hexagon, allowing ro- 6 4
12 [k
3
+ 3k + 4k + 2k 2 + 2k]
all functions
tations and reflections nk §2.2.1
one-to-one
corners offunctions
a hexagon,(n
allowing
≥ k) n =
6 [k +
1 6k
2k 2 +=2k]
+ n!
k 3 (n−k)! P (n, k) §2.2.1
only rotations
onto functions (n ≤ k) inclusion/exclusion §2.4.2
corners of a tetrahedron k 12 1k[k
4 +11k 2 ]2
partial functions +
0 1 16 n+ k
42
n + · · · + kk nk §2.3.2
edges of a tetrahedron 12 [k + 3k =+(n 8k+] 1)k
2
18
c 2000 by CRC Press LLC
1.3 Dirac Delta Function
The (Dirac) delta function or (unit) impulse function is denoted by δ(t). It is usually
depicted as a vertical arrow at the origin. Note that δ(t) is not a true function; it is
undefined at t = 0. We define δ(t) as a generalized function which satisfies the sampling
property (or sifting property) Z
φ(t)δ(t)dt = φ(0)
for any function φ(t) which is continuous at t = 0. From this definition, It follows that
Z
(δ ∗ φ)(t) = (φ ∗ δ)(t) = φ(τ )δ(t − τ )dτ = φ(t)
• δ(t) = 0 when t 6= 0.
δ(t − T ) = 0 for t 6= T .
R
• A δ(t)dt = 1A (0).
R
(a) δ(t)dt = 1.
R
(b) {0} δ(t)dt = 1.
Rx
(c) −∞ δ(t)dt = 1[0,∞) (x). Hence, we may think of δ(t) as the “derivative” of the
unit step function U (t) = 1[0,∞) (x).
R
• φ(t)δ(t)dt = φ(0) for φ continuous at 0.
R
• φ(t)δ(t − T )dt = φ(T ) for φ continuous at T . In fact, for any ε > 0,
Z T +ε
φ(t)δ(t − T )dt = φ(T ).
T −ε
1
• δ(at) = |a|
δ(t)
• δ(t − t1 ) ∗ δ(t − t2 ) = δ (t − (t1 + t2 )).
• g(t) ∗ δ(t − t0 ) = g(t − t0 ).
• Fourier properties:
1 1
P
∞
◦ Fourier series: δ(x − a) = 2π
+ π
cos(n(x − a)) on [−π, π].
k=1
R j2πf t
◦ Fourier transform: δ(t) = 1e df
19
• For a function g whose real-values roots are ti ,
X
n
δ (t − ti )
δ (g (t)) = (3)
k=1
|g 0 (ti )|
Note that the (Dirac) delta function is to be distinguished from the discrete time Kro-
necker delta function.
R massRat 0; that is for any set A, we have δ(A) = 1[0 ∈ A].
As a finite measure, δ is a unit
In which case, we have again gdδ = f (x)δ(dx) = g(0) for any measurable g.
For a function g : D → Rn where D ⊂ Rn ,
X δ(x − z)
δ(g(x)) = (5)
|det dg(z)|
z:g(z)=0
[3, p 387].
2 Classical Probability
Classical probability, which is based upon the ratio of the number of outcomes favorable to
the occurrence of the event of interest to the total number of possible outcomes, provided
most of the probability models used prior to the 20th century. Classical probability remains
of importance today and provides the most accessible introduction to the more general
theory of probability.
Given a finite sample space Ω, the classical probability of an event A is
kAk the number of cases favorable to the outcome of the event
P (A) = = .
kΩk the total number of possible cases
• In this section, we are more apt to refer to equipossible cases as ones selected at
random. Probabilities can be evaluated for events whose elements are chosen at
random by enumerating the number of elements in the event.
• Equipossibility is meaningful only for finite sample space, and, in this case, the eval-
uation of probability is accomplished through the definition of classical probability.
2.1. Basic properties of classical probability:
20
• P (A) ≥ 0
• P (Ω) = 1
• P (∅) = 0
• P (Ac ) = 1 − P (A)
• A ⊥ B is equivalent to P (A ∩ B) = 0.
• A ⊥ B ⇒ P (A ∪ B) = P (A) + P (B)
P
• Suppose Ω = {ω1 , . . . , ωn } and P (ωi ) = n1 . Then P (A) = p (ω).
ω∈A
◦ The probability of an event is equal to the sum of the probabilities of its com-
ponent outcomes because outcomes are mutually exclusive
|A ∩ B| P (A ∩ B)
P (A|B) = = . (6)
|B| P (B)
• It is the updated probability of the event A given that we now know that B occurred.
• P (A|B) = P (A ∩ B|B) ≥ 0
P (Ω|B) = P (B|B) = 1.
• If A ⊥ C, P (A ∪ C |B ) = P (A |B ) + P (C |B )
• P (A ∩ B) = P (B)P (A|B)
• P (A ∩ B) ≤ P (A|B)
• P (A ∩ B) = P (A) × P (B|A)
• P (A ∩ B ∩ C) = P (A ∩ B) × P (C|A ∩ B)
21
2.3. Total Probability and Bayes Theorem If {Bi , . . . , Bn } is a partition of Ω, then
for any set A,
P
• Total Probability Theorem: P (A) = ni=1 P (A|Bi )P (Bi ).
• Bayes Theorem: Suppose P (A) > 0, we have P (Bk |A) = PnP (A|Bk )P (Bk ) .
i=1 P (A|Bi )P (Bi )
|=
P (A ∩ B) = P (A)P (B) (7)
|A ∩ B||Ω| = |A||B|.
• Sometimes the definition for independence above does not agree with the everyday-
language use of the word “independence”. Hence, many authors use the term “statis-
tically independence” for the definition above to distinguish it from other definitions.
2.5. Having three pairwise independent events does not imply that the three events are
jointly independent. In other words,
A B, B C, A C;A B C.
|=
|=
|=
|=
|=
Example: Experiment of flipping a fair coin twice. Ω = {HH, HT, T H, T T }. Define event
A to be the event that the first flip gives a H; that is A = {HH, HT }. Event B is the
event that the second flip gives a H; that is B = {HH, T H}. C = {HH, T T }. Note also
that even though the events A and B are not disjoint, they are independent.
2.6. Consider Ω of size 2n. We are given a set A ⊂ Ω of size n. Then, P (A) = 12 . We want
to find all sets B ⊂ Ω such that A B. (Note that without the required independence,
|=
P (A ∩ B) = P (A)P (B).
Let r = |A ∩ B|. Then, r can be any integer from 0 to n. Also, let k = |B \ A|. Then, the
condition for independence becomes
r 1r+k
=
n 2 n
which is equivalent to r = k. So, the construction of the set B is given by choosing
r
n
elements from set A, then choose
r = k elements from set Ω \ A. There are r
choices
for the first part and nk = nr choice for the second part. Therefore, the total number of
possible B such that A B is
|=
Xn 2
n 2n
= .
r=1
r n
22
2.1 Examples
2.7. Background
(a) Historically, dice is the plural of die, but in modern standard English dice is used
as both the singular and the plural. [Excerpted from Compact Oxford English Dic-
tionary.]
2.8. Chevalier de Mere’s Scandal of Arithmetic:
Which is more likely, obtaining at least one six in 4 tosses of a fair die (event
A), or obtaining at least one double six in 24 tosses of a pair of dice (event B).
We have 4
5
P (A) = 1 − = .518
6
and 24
35
P (B) = 1 − = .491.
36
Therefore, the first case is more probable.
Remark: Probability theory was originally inspired by gambling problems. In 1654,
Chevalier de Mere invented a gambling system which bet even money on the second case
above. However, when the he began losing money, he asked his mathematician friend
Blaise Pascal to analyze his gambling system. Pascal discovered that the Chevalier’s system
would lose about 51 percent of the time. Pascal became so interested in probability and
together with another famous mathematician, Pierre de Fermat, they laid the foundation
of probability theory.
2.9. A random sample of size r with replacement is taken from a population of n elements.
The probability of the event that in the sample no element appears twice (that is, no
repetition in our sample) is
(n)r
.
nr
The probability that at least one element appears twice is
r−1
Y
i r(r−1)
pu (n, r) = 1 − 1− ≈ 1 − e− 2n .
i=1
n
23
− ⎜ 1− ⎟
⎝ x⎠
−x
x− 1
( x− 1) ⋅ x
0
−1
• Probability of coincidence birthday: Probability that there is at least two people who
e
have the same birthday in your class of n students
2 4 6 8 10
1, 1 x 10 if r ≥ 365,
1
a) P ( A1 ∩ A= 2 ) − P ( A1
) P365
( A2 ) ≤364 365 − (r − 1)
1 − · 4 · · · · · , if 0 ≤ r ≤ 365
365 365 365
| {z }
Random Variable r terms
3) Let◦i.i.d. X 1 ,… , X rParadox
Birthday be uniformly : Indistributed
a group ofon finite set {aselected
23arandomly 1 ,… , an } people,
. Then, the
the probability
( r −1)
r −1 birthdays are− requally
that at least two will share a birthday (assuming ⎛ i⎞ likely to
probability that
occur on anyX ,… , X
1 given are all distinct is p ( n, r ) =
r day of the year) is about 0.5. See
u 1 − ∏
i =1 ⎝
1 −
⎜ also ⎟ ≈ 1
n ⎠ (3).
− e 2n
.
pu ( n, r )
0.9
0.6
0.8
0.5 p = 0.9
0.7 r ( r −1)
1− e
−
2n p = 0.7
0.6
r
0.4
p = 0.5
0.5
p = 0.3 r = 23
r = 23 n 0.3
0.4
p = 0.1 n = 365
0.3
n = 365
0.2
0.2
0.1
0.1
0 0
0 5 10 15 20 25 30 35 40 45 50 55 0 50 100 150 200 250 300 350
r n
2.10. Monte Hall’s Game: Started with showing a contestant 3 closed doors behind of
which was a prize. The contestant selected a door but before the door was opened, Monte
Hall, who knew which door hid the prize, opened a remaining door. The contestant was
then allowed to either stay with his original guess or change to the other closed door.
Question: better to stay or to switch? Answer: Switch. Because after given that the
contestant switched, then the probability that he won the prize is 23 .
2.11. False Positives on Diagnostic Tests: Let D be the event that the testee has the
disease. Let + be the event that the test returns a positive result. Denote the probability
of having the disease by pD .
Now, assume that the test always returns positive result. This is equivalent to P (+|D) =
1 and P (+c |D) = 0. Also, suppose that even when the testee does not have a disease, the
test will still return a positive result with probability p+ ; that is P (+|Dc ).
If the test returns positive result, then the probability that the testee has the disease is
P (D ∩ +) pD 1
P (D |+) = = = p+
P (+) pD + p+ (1 − pD ) 1+ pD
(1 − pD )
PD
≈ ; for rare disease (pD 1)
P+
24
D Dc
+ P (+ ∩ D) P (+ ∩ Dc ) P (+)
= P (+|D)P (D) = P (+|Dc )P (Dc ) = P (+ ∩ D) + P (+ ∩ Dc )
= 1(pD ) = pD = p+ (1 − pD ) = pD + p+ (1 − pD )
+c P (+c ∩ D) P (+c ∩ Dc ) P (+c )
= P (+c |D)P (D) = P (+c |Dc )P (Dc ) = P (+c ∩D)+P (+c ∩Dc )
= 0(pD ) = 0 = (1 − p+ )(1 − pD ) = (1 − p+ )(1 − pD )
P (D) P (Dc ) P (Ω) = 1
= P (+ ∩ D) + P (+c ∩ D) = P (+∩Dc )+P (+c ∩Dc )
= pD = 1 − pD
3 Probability Foundations
To study formal definition of probability, we start with the probability space (Ω, A, P ). Let
Ω be an arbitrary space or set of points ω. Viewed probabilistically, a subset of Ω is an
event and an element ω of Ω is a sample point. Each event is a collection of outcomes
which are elements of the sample space Ω.
The theory of probability focuses on collections of events, called event algebras and
typically denoted A (or F) that contain all the events of interest (regarding the random
experiment E) to us, and are such that we have knowledge of their likelihood of occurrence.
The probability P itself is defined as a number in the range [0, 1] associated with each event
in A.
Definition 3.1. [7, Def 1.6.1 p38] An event algebra A is a collection of subsets of the
sample space Ω such that it is
In other words, “A class is called an algebra on Ω if it contains Ω itself and is closed under
the formation of complements and finite unions.”
• Ω = (0, 1]. B0 = the collection of all finite unions of intervals of the form (a, b] ⊂ (0, 1]
1
There is no problem when Ω is countable.
25
S
∞
1
◦ Not a σ-field. Consider the set ,1
2i+1 2i
i=1
(a) Nonempty: ∅ ∈ A, X ∈ A
(b) A ⊂ 2Ω
• A ∈ A ⇒ Ac ∈ A
• A, B ∈ A ⇒ A ∪ B ∈ A, A ∩ B ∈ A, A\B ∈ A, A∆B = (A\B ∪ B\A) ∈ A
S
n T
n
• A1 , A2 , . . . , An ∈ F ⇒ Ai ∈ F and Ai ∈ F
i=1 i=1
(f) The largest A is the set of all subsets of Ω known as the power set and denoted by
2Ω .
(g) Cardinality of Algebras: An algebra of subsets of a finite set of n elements will always
have a cardinality of the form 2k , k ≤ n. It is the intersection of all algebras which
contain C.
3.4. There is a smallest (in the sense of inclusion) algebra containing any given family C
of subsets of Ω. Let C ⊂ 2X , the algebra generated by C is
\
G,
G is an algebra
C⊂G
i.e., the intersection of all algebra containing C. It is the smallest algebra containing C.
Definition 3.5. A σ-algebra A is an event algebra that is also closed under countable
unions,
(∀i ∈ N)Ai ∈ A =⇒ ∪i∈N ∈ A.
Remarks:
3.6. Because every σ-algebra is also an algebra, it has all the properties listed in (3.3).
Extra properties of σ-algebra A are as followed:
26
S
∞ T
∞
(a) A1 , A2 , . . . ∈ A ⇒ Aj ∈ A and Aj ∈ A
j=1 j=1
(c) The collection of σ-algebra in Ω is closed under arbitrary intersection, i.e., let
ATα be σ-algebra ∀α ∈ A where A is some index set, potentially uncountable. Then,
Aα is still a σ-algebra.
α∈A
i.e., the intersection of all σ-field containing C. It is the smallest σ-field containing C
(b) σ (C) is the smallest σ-field containing C in the sense that if H is a σ-field and
C ⊂ H, then σ (C) ⊂ H
(c) C ⊂ σ (C)
(h) σ ({A, B}) has at most 16 elements. They are ∅, A, B, A∩B, A∪B, A\B, B \A, A∆B,
and their corresponding complements. See also (3.11). Some of these 16 sets can be
the same and hence the use of “at-most” in the statement.
27
(j) A ⊂ B ⇒ σ (A) ⊂ σ (B)
3.9. For the decomposition described in (1.8), let the starting collection be C1 and the
decomposed collection be C2 .
A = {∪α∈S Aα : S ⊂ I} (8)
S
[7, Ex 1.5 p. 39] where we define Ai = ∅. It turns out that A = σ(Π). Of course, a
i∈∅
σ-algebra is also an algebra. Hence, (8) is also a way to construct an algebra. Note that,
from (1.9), the necessary and sufficient condition for distinct S to produce distinct element
in (8) is that none of the Aα ’s are empty.
In particular, for countable Ω = {xi : i ∈ N}, where xi ’s are distinct. If we want a
σ-algebra which contains {xi } ∀i, then, the smallest one is 2Ω which happens to be the
biggest one. So, 2Ω is the only σ-algebra which is “reasonable” to use.
By (3.9), we know that σ(π) = σ(C). Note that there are seemingly 2n sets in Π however,
some of them can be ∅. We can eliminate the empty set(s) from Π and it is still a partition.
So the cardinality of Π in (9) (after empty-set elimination) is k where k is at most 2n . The
partition Π is then a collection of k sets which can be renamed as A1 , . . . , Ak , all of which
are non-empty. Applying the construction in (8) (with I = [n]), we then have σ(C) whose
n
cardinality is 2k which is ≤ 22 . See also properties (3.8.7) and (3.8.8).
2
In this case, Π is countable if and only if I is countable.
28
Definition 3.12. In general, the Borel σ-algebra or Borel algebra B is the σ-algebra
generated by the open subsets of Ω
(a) On Ω = R, the σ-algebra generated by any of the followings are Borel σ-algebra:
(c) Borel σ-algebra on the extended real line is the extended σ-algebra
29
(iv) the
class of “southwest regions” Sx of points “southwest” to x ∈ Rk , i.e. Sx =
y ∈ Rk : yi ≤ xi , i = 1, . . . , k
3.13. The Borel σ-algebra B of subsets of the reals is the usual algebra when we deal
with real- or vector-valued quantities.
Our needs will not require us to delve into these issues beyond being assured that
events we discuss are constructed out of intervals and repeated set operations on these
intervals and these constructions will not lead us out of B. In particular, countable unions
of intervals, their complements, and much, much more are in B.
K1 Nonnegativity: ∀A ∈ A, P (A) ≥ 0.
K4 Monotone continuity: If (∀i > 1) Ai+1 ⊂ Ai and ∩i∈N Ai = ∅ (a nested series of sets
shrinking to the empty set), then
lim P (Ai ) = 0.
i→∞
• Note that there is never a problem with the convergence of the infinite sum; all partial
sums of these non-negative summands are bounded above by 1.
• If P satisfies K0–K3, then it satisfies K4 if and only if it satisfies K40 [7, Theorem
3.5.1 p. 111].
30
Proof. To show “⇒”, consider disjoint A1 , A2 , . . .. Define Bn = ∪i>n Ai . Then, by
(1.6), Bn+1 ⊂ Bn and ∩∞
n=1 Bn = ∅. So, by K4, lim P (Bn ) = 0. Furthermore,
n→∞
∞
!
[ [
n
Ai = Bn ∪ Ai
i=1 i=1
where all the sets on the RHS are disjoint. Hence, by finite additivity,
∞
!
[ X n
P Ai = P (Bn ) + P (Ai ).
i=1 i=1
Equivalently, in stead of K0-K4, we can define probability measure using P0-P2 below.
(P0) P : A → [0, 1]
that satisfies:
(P1,K2) P (Ω) = 1
• P (∅) = 0
(a) P (∅) = 0
31
(e) Monotonicity: A, B ∈ A, A ⊂ B ⇒ P (A) ≤ P (B) and P (B − A) = P (B) − P (A)
For example,
P (A1 ∪ A2 ∪ A3 ) = P (A1 ) + P (A2 ) + P (A3 )
− P (A1 ∩ A2 ) − P (A1 ∩ A3 ) − P (A2 ∩ A3 )
+ P (A1 ∩ A2 ∩ A3 ) .
S
n P
n
(j) Finite additivity: If A = Aj with Aj ∈ A disjoint, then P (A) = P (Aj )
j=1 j=1
32
T∞from above. If B1 ⊃ B2 ⊃ B3 ⊃ · · · is a decreasing sequence
3.17. Conditional continuity
of measurable sets, then P ( i=1 Bi ) = lim P (Bj ). In a more compact notation, if Bi & B,
j→∞
then P (B) = lim P (Bj ).
j→∞
T
∞
Proof. Let B = Bi . LetAk = Bk \Bk+1 , i.e, the new part. We consider two partitions of
i=1
S
∞ S
n−1 P
∞
B1 : (1) B1 = B ∪ Aj and (2) B1 = Bn ∪ Aj . (1) implies P (B1 ) − P (B) = P (Aj ).
j=1 j=1 j=1
P
n−1
(2) implies P (B1 ) − P (Bn ) = P (Aj ). We then have
j=1
∞
(1) X (2)
lim (P (B1 ) − P (Bn )) = P (Aj ) = P (B1 ) − P (B) .
n→∞
j=1
3.18. Let A be a σ-algebra. Suppose that P : A → [0, 1] satisfies (P1) and is finitely
additive. Then, the following are equivalent:
∞
S P
∞
• (P2): If An ∈ A, disjoint, then P An = P (An )
n=1 n=1
and ! !
∞
\ \
N
P An = lim P An .
N →∞
n=1 n=1
33
and !
∞
\
P An = lim P (AN ) , if An+1 ⊂ An .
N →∞
n=1
3.19. Given a common event algebra A , probability measures P1 , . . ., Pm , and the numbers
P
m P
m
λ1 , . . ., λm , λi ≥ 0, λi = 1, a convex combination P = λi Pi of probability measures is
1 1
a probability measure
3.20. A can not contain an uncountable, disjoint collection of sets of positive probability.
Definition 3.21. Discrete probability measure P is a discrete probability measure
if ∃ finitelyPor countably
P many points ωk and nonnegative masses mk such that ∀A ∈ A
P (A) = mk = mk IA (ωk )
k:ωk ∈A k
If there is just one of these points, say ω0 , with mass m0 = 1, then P is a unit mass
at ω0 . In this case, ∀A ∈ A, P (A) = IA (ω0 ).
Notation: P = δω0
• Here, Ω can be uncountable.
3.4 Countable Ω
A sample space Ω is countable if it is either finite or countably infinite. It is countably
infinite if it has as many elements as there are integers. In either case, the element of Ω
can be enumerated as, say, ω1 , ω2 , . . . . If the event algebra A contains each singleton set
{ωk } (from which it follows that A is the power set of Ω), then we specify probabilities
satisfying the Kolmogorov axioms through a restriction to the set S = {{ωk }} of singleton
events.
Definition 3.22. When Ω is countable, a probability mass function (pmf) is any
function p : Ω → [0, 1] such that X
p(ω) = 1.
ω∈Ω
34
3.5 Independence
Definition 3.24. Independence between events and collections of events.
◦ Note that the case when j = 1 automatically holds. The case when j = 0
can be regard as the ∅ event case, which is also trivially true.
Pn
n
◦ There are j
= 2n − 1 − n constraints.
j=2
◦ Example: A1 , A2 , A3 are independent if and only if
Remark : The first equality alone is not enough for independence. See a
counter example below. In fact, it is possible for the first equation to hold
while the last three fail as shown in (3.26.b). It is also possible to construct
events such that the last three equations hold (pairwise independence), but
the first one does not as demonstrated in (3.26.a).
≡ P (B1 ∩ B2 ∩ · · · ∩ Bn ) = P (B1 ) P (B2 ) · · · P (Bn ) where Bi = Ai or Bi = Ω
35
≡ Each of the finite subcollection is independent.
Definition 3.25. A collection of events {Aα } is called pairwise independent if for every
distinct events Aα1 , Aα2 , we have P (Aα1 ∩ Aα2 ) = P (Aα1 ) P (Aα2 )
Example 3.26.
(a) Let Ω = {1, 2, 3, 4}, A = 2Ω , P (i) = 41 , A1 = {1, 2}, A2 = {1, 3}, A3 = {2, 3}. Then
P (Ai ∩ Aj ) = P (Ai ) P (Aj ) for all i 6= j butP (A1 ∩ A2 ∩ A3 ) 6= P (A1 ) P (A2 ) P (A3 )
(c) The paradox of ”almost sure” events: Consider two random events with probabilities
of 99% and 99.99%, respectively. One could say that the two probabilities are nearly
the same, both events are almost sure to occur. Nevertheless the difference may
become significant in certain cases. Consider, for instance, independent events which
may occur on any day of the year with probability p = 99%; then the probability
P that it will occur every day of the year is less than 3%, while if p = 99.99% then
P = 97%.
36
4 Random Element
4.1. A function X : Ω → E is said to be a random element of E if and only if
(Ω,A) (E,BE )
X is measurable which is equivalent to each of the following statements.
≡ X −1 (BE ) ⊂ A
≡ σ (X) ⊂ A
µX (A) = P X (A) = P X −1 (A) = P ◦ X −1 (A)
= P ({ω : X (ω) ∈ A}) = P ({X ∈ A})
• Random variables are important because they provide a compact way of referring to
events via their numerical attributes.
• The abbreviation r.v. will be used for “real-valued random variables” [11, p. 1].
4.7. At a certain point in most probability courses, the sample space is rarely mentioned
anymore and we work directly with random variables. The sample space often “disappears”
but it is really there in the background.
• [X ∈ B] = {ω ∈ Ω : X(ω) ∈ B} and
37
• In particular, P [X < x] is a shorthand for P ({ω ∈ Ω : X(ω) < x}).
4.10. Every random variable can be written as a sum of a discrete random variable and
a continuous random variable.
4.11. A random variable can have at most countably many point x such thatP [X = x] > 0.
4.12. Point masses probability measures / Direc measures, usually written εα , δα , is used
to denote point mass of size one at the point α. In this case,
• P X {α} = 1
• P X ({α}c ) = 0
4.13. There exists distributions that are neither discrete nor continuous.
Let µ (A) = 12 µ1 (A) + 12 µ2 (A) for µ1 discrete and µ2 coming from a density.
4.14. When X and Y take finitely many values, say x1 , . . . , xm and y1 , . . . , yn , respectively,
we can arrange the probabilities pX,Y (xi , yj ) in the m × n matrix
pX,Y (x1 , y1 ) pX,Y (x1 , y2 ) . . . pX,Y (x1 , yn )
pX,Y (x2 , y1 ) pX,Y (x2 , y2 ) . . . pX,Y (x2 , yn )
.. .. . . .. .
. . . .
pX,Y (xm , y1 ) pX,Y (xm , y2 ) . . . pX,Y (xm , yn )
• The sum of the entries in the ith row is PX (xi ), and the sum of the entries in the jth
column is PY (yj ).
• The sum of all the entries in the matrix is one.
• The distribution P X can be obtained from the distribution function by setting P X (−∞, x] =
FX (x); that is FX uniquely determines P X
• 0 ≤ FX ≤ 1
38
The (cumulative) distribution function (cdf) of the random variable X is the
function FX x P X , x P X x .
◦ The X x g(x)
Pfunction P X x=PF[X
x < xis left-continuous
Fx] inFx.at x.
= the jump or saltus in
• ∀x<y
P ((x, y]) = F (y) − F (x)
P ([x, y]) = F (y) − F x−
P ([x, y)) = F y − − F x−
P ((x, y)) = F y − − F (x)
P ({x}) = F (x) − F x−
• If F satisfies (C1), (C2), and (C3), then ∃ a unique probability measure P on (R, B)
that has P (a, b] = F (b) − F (a) ∀a, b ∈ R
39
Definition 4.16. It is traditional to write X ∼ F to indicate that “X has distribution F ”
[23, p. 25].
g 1 y inf x : g x y
g x
y6
x4
y5
y4 x3
y3
y2
y1
x2
x2 x3 x4 x y1 y3 y4 y6 y
Trick: Just flip the graph along the line x = y, then make the graph left-continuous.
Figure
If g is a cdf, 11:
then only consider y 0,1 .
Left-continuous inverse on (0,1)
Example:
• Trick : Just flip the graph along the line
Distribution F x = y, then
F 1 make the graph left-continuous.
1 1
1 b ln 1−1
Logistic
Distribution
x
1 F
e b uF
Exponential 1 − e−λx − 1
1 λ
ln (u)
a
Pareto 1 x x−a
−e b
uaa+ b ln ln u
Extreme value 1−e l m
i 1 ln u
1 − (1 x − p)
b
Geometric
Weibull a ln u ln(1−p)
1 e a
b
Logistic 1 − 1x−µ µ − b ln u1 − 1
1+e
34) If F is non-decreasing, right continuous, withb lim F x 0 and
1 lim F x 1 ,
Pareto 1 − x−a x u− a x
then F is the CDF of some probability measure
x b on , . 1
Weibull 1 − e( a ) a (ln u) b
In particular, let U ~ 0,1 and X F 1 U , then FX F . Here, F 1 is the
Table 1: Left-continuous inverse
left-continuous inverse of F.
1
To generate X ~ , set X ln U .
35) Point masses probability measures / Direc measures, usually written , , is
used to denote point mass of size one at the point . In this case,
40
Definition 4.19. Let X be a random variable with distribution function F . Suppose that
p ∈ (0, 1). A value of x such that F (x− ) = P [X < x] ≤ p and F (x) = P [X ≤ x] ≥ p is
called a quantile of order p for the distribution. Roughly speaking, a quantile of order p
is a value where the cumulative distribution crosses p. Note that it is not unique. Suppose
F (x) = p on an interval [a, b], then all x ∈ [a, b] are quantile of order p.
A quantile of order 12 is called a median of the distribution. When there is only one
median, it is frequently used as a measure of the center of the distribution. A quantile
of order 41 is called a first quartile and the quantile of order 34 is called a third quartile. A
median is a second quartile.
Assuming uniqueness, let q1 , q2 , and q3 denote the first, second, and third quartiles
of X. The interquartile range is defined to be q3 − q1 , and is sometimes used as a mea-
sure of the spread of the distribution with respect to the median. The five parameters
max X, q1 , q2 , q3 , min X are often referred to as the five-number summary. Graphically,
the five numbers are often displayed as a boxplot.
4.20. If F is non-decreasing, right continuous, with lim F (x) = 0 and lim F (x) = 1,
x→−∞ x→∞
then F is the CDF of some probability measure on (R, B).
In particular, let U ∼ U (0, 1) and X = F −1 (U ), then FX = F . Here, F −1 is the
left-continuous inverse of F . Note that we just explicitly define a random variable X(ω)
with distribution function F on Ω = (0, 1).
41
⇒ X is completely determined by the values µX ({x1 }) , µX ({x2 }) , . . .
• pi = pX (xi ) = P [X = xi ]
pX (xk ) = P [X = xk ].
Definition 4.27. Sometimes, it is convenient to work with the “pdf” of a discrete r.v.
Given that X is a discrete random variable which is defined as in definition 4.23. Then,
the “pdf” of X is X
fX (x) = pX (xk )δ(x − xk ), x ∈ R. (13)
xk
Although the delta function is not a well-defined function3 , this technique does allow easy
manipulation of mixed distribution. The definition of quantities involving discrete random
variables and the corresponding properties can then be derived from the pdf and hence
there is no need to talk about pmf at all!
3
Rigorously, it is a unit measure at 0.
42
4.4 Continuous random variable
Definition 4.28. A random variable X is said to be a continuous random variable if
and only if any one of the following equivalent conditions holds.
≡ ∀x, P [X = x] = 0
≡ FX is continuous
≡ X is absolutely continuous
Rb
≡ ∀a, b FX (b) − FX (a) = f (x)dx
a
43
• X is a continuous random variable
• P [X ∈ [a, b]] = P [X ∈ [a, b)] = P [X ∈ (a, b]] = P [X ∈ (a, b)] because the corre-
sponding integrals over an interval are not affected by whether or not the endpoints
are included or excluded. In other words, P [X = a] = P [X = b] = 0.
• P [fx (X) = 0] = 0
Definition 4.33 (Absolutely Continuous CDF). An absolutely continuous cdf Fac can be
written in the form Z x
Fac (x) = f (z)dz,
−∞
4.34. Any nonnegative function that integrates to one is a probability density function
(pdf) [9, p. 139].
• R(0) = P [T > 0] = P [T ≥ 0] = 1.
R∞
(b) The mean time of failure (MTTF) = E [T ] = 0
R(t)dt.
44
(c) The (age-specific) failure rate or hazard function os a device or system with lifetime
T is
P [T ≤ t + δ|T > t] R0 (t) f (t) d
r (t) = lim =− = = ln R(t).
δ→0 δ R(t) R (t) dt
(i) r (t) δ ≈ P [T ∈ (t, t + δ] |T > t]
Rt
(ii) R(t) = e− 0 r(τ )dτ
.
Rt
(iii) f (t) = r(t)e− 0 r(τ )dτ
Definition 4.37. A random variable whose cdf is continuous but whose derivative is the
zero function is said to be singular.
4.39. By allowing density functions toRcontain impulses, the cdfs of mixed random variables
can be expressed in the form F (x) = (−∞,x] f (t)dt.
where
• the xi are the distinct points at which FX has jump discontinuities, and
0
FX (x) , FX is differentiable at x
• f˜X (x) =
0, otherwise.
45
In which case, Z X
E [g(X)] = g(x)f˜X (x)dx + g(xk )P [X = xk ].
k
4.41. Suppose the cdf F can be expressed in the form F (x) = G(x)U (x − x0 ) for some
function G. Then, the density is f (x) = G0 (x)U (x − x0 ) + G(x0 )δ(x − x0 ). Note that
G(x0 ) = F (x0 ) = P [X = x0 ] is the jump of the cdf at x0 . When the random variable is
continuous, G(x0 ) = 0 and thus f (x) = G0 (x)U (x − x0 ).
4.6 Independence
Definition 4.42. A family of random variables {Xi : i ∈ I} is independent if ∀ finite
J ⊂ I, the family of random variables {Xi : i ∈ J} is independent. In words, “an infinite
collection of random elements is by definition independent if each finite subcollection is.”
Hence, we only need to know how to test independence for finite collection.
(a) (Ei , Ei )’s are not required to be the same.
(b) The collection of random variables {1Ai : i ∈ I} is independent iff the collection of
events (sets) {Ai : i ∈ I} is independent.
Definition 4.43. Independence among finite collection of random variables: For finite I,
the following statements are equivalent
≡ (Xi )i∈I are independent (or mutually independent [2, p 182]).
Q
≡ P [Xi ∈ Hi ∀i ∈ I] = P [Xi ∈ Hi ]where Hi ∈ Ei
i∈I
Q
≡ P (Xi : i ∈ I) ∈ × Hi = P [Xi ∈ Hi ] where Hi ∈ Ei
i∈I i∈I
Q
≡ P (Xi :i∈I) × Hi = P Xi (Hi ) where Hi ∈ Ei
i∈I i∈I
Q
≡ P [Xi ≤ xi ∀i ∈ I] = P [Xi ≤ xi ]
i∈I
Q
≡ [Factorization Criterion] F(Xi :i∈I) ((xi : i ∈ I)) = FXi (xi )
i∈I
Q
≡ Absolutely continuous Xi with density fXi : f(Xi :i∈I) ((xi : i ∈ I)) = fXi (xi )
i∈I
46
Definition 4.44. If the Xα , α ∈ I are independent and each has the same marginal
distribution with distribution Q, we say that the Xα ’s are iid (independent and identically
iid
distributed) and we write Xα ∼ Q
• The abbreviation can be IID [23, p 39].
Definition 4.45. A pairwise independent collection of random variables is a set of
random variables any two of which are independent.
(a) Any collection of (mutually) independent random variables is pairwise independent
(b) Some pairwise independent collections are not independent. See example (4.46).
Example 4.46. Let suppose X, Y , and Z have the following joint probability distribution:
pX,Y,Z (x, y, z) = 14 for (x, y, z) ∈ {(0, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0)}. This, for example,
can be constructed by starting with independent X and Y that are Bernoulli- 12 . Then set
Z = X ⊕ Y = X + Y mod 2.
(a) X, Y, Z are pairwise independent.
|=
|=
Definition 4.47. The convolution of probability measures µ1 and µ2 on (R, B) is the
measure µ1 ∗ µ2 defined by
Z
(µ1 ∗ µ2 ) (H) = µ2 (H − x)µ1 (dx) H ∈ BR
(i) If Y (or FY ) is absolutely continuous with density fY , then for any X (or
R X ), X + Y (or FX ∗ FY ) is absolutely continuous with density (FX ∗ fY ) (z) =
F
fY (z − x)dFX (x)
47
If, in addition, FX has density fX . Then,
Z Z
fY (z − x)dFX (x) = fY (z − x)fX (x) dx
This is denoted by fX ∗ fY
In other words, if densities fX , fY exist, then FX ∗ FY has density fX ∗ fY , where
Z
(fX ∗ fY ) (z) = fY (z − x)fX (x) dx
4.48. If random variables X and Y are independent and have distribution µX and µY ,
then X + Y has distribution µX ∗ µY
(a) Let X and Y be nonnegative independent random variables on (Ω, A, P ), then E [XY ] =
EXEY
P
n
(c) If X1 , . . . , Xn are independent and Xk has a finite second moment, then all Xk
k=1
have finite second
n moments
as well.
P P n
Moreover, Var Xk = Var Xk .
k=1 k=1
2
P
k P
k
(d) If pairwise independent Xi ∈ L , then Var Xi = Var [Xi ]
i=1 i=1
4.7 Misc
4.50. The mode of a discrete probability distribution is the value at which its probability
mass function takes its maximum value. The mode of a continuous probability distribution
is the value at which its probability density function attains its maximum value.
• the mode is not necessarily unique, since the probability mass function or probability
density function may achieve its maximum value at several points.
5 PMF Examples
The following pmf will be defined on its support S. For Ω larger than S, we will simply
put the pmf to be 0.
48
X∼ Support set X pX (k) ϕX (u)
1
Uniform Un {1, 2, . . . , n} n
1 1−eiun
U{0,1,...,n−1} {0, 1, . . . , n − 1} n(1−eiu )
n
1 − p, k = 0
Bernoulli B(1, p) {0, 1}
p,
k=1
n k n
Binomial B(n, p) {0, 1, . . . , n} k
p (1 − p)n−k (1 − p + peju )
1−β
Geometric G(p) N ∪ {0} (1 − p)pk 1−βeiu
Geometric G 0 (p) N (1 − p)k−1 p
eλ(e )
k iu −1
Poisson P(λ) N ∪ {0} e−λ λk!
5.1 Random/Uniform
5.1. Rn , Un
When an experiment results in a finite number of “equally likely” or “totally random”
outcomes, we model it with a uniform random variable. We say that X is uniformly
distributed on [n] if
1
P [X = k] = , k ∈ [n].
n
We write X ∼ Un .
1
• pi = n
for i ∈ S = {1, 2, . . . , n}.
• Examples
Example 5.3. For X uniform on [-M:1:M], we have EX = 0 and Var X = M (M3 +1) .
For X uniform on [N:1:M], we have EX = M −N
2
1
and Var X = 12 (M − N )(M − N − 2).
Example 5.4. Set S = 0, 1, 2, . . . , M , then the sum of two independent U(S) has pmf
(M + 1) − |k − M |
p(k) =
(M + 1)2
1
for k = 0, 1, . . . , 2M . Note its triangular shape with maximum value at p(M ) = M +1
. To
visualize the pmf in MATLAB, try
49
experiment where
there are only n possible outcomes and they are all equally probable
there is a balance of information about outcomes
k = 0:2*M;
Bernoulli: (1,p)
P = (1/((M+1)^2))*ones(1,M+1);
P == conv(P,P);
{0,1}; p = q stem(k,P)
= 1-p, p = p
0 1
EX 1 p 0 1 p p
5.2 Bernoulli and Binary distributions
EX 1 p 0 1 p p
5.5. Bernoulli : B(1, p) or Bernoulli(p)
E X 2 p12 1 p 02 p
• S = {0, 1}
Var X E X 2 EX p p 2 p 1 p
2
• p0 = q = 1 − p, p1 = p
Var2 ] =
X p. p 1 p 1 p 0 p p 1 p 1 p p p 1 p .
2 2
Alternatively,
• EX = E [X
Var [X]
Note that = p − p2is=maximized
the variance p (1 − p). Note
at p =that
0.5. the variance is maximized at p = 0.5.
0.25
0.2
p ( 1 p )
0.1
0
0
0 0.2 0.4 0.6 0.8 1
0 p 1
5.6. Binary : Suppose X takes only two values a and b with b > a. P [X = b] = p.
(b) Var X = (b − a)2 V arI = (b − a)2 p (1 − p) . Note that it is still maximized at p = 1/2.
(d) Suppose Xk are independent random variables taking on two values b P and a with
n
P [Xk = p] = p = 1 − P [Xk = a] where b > a. Then, the sum S = k=1 Xk is
“binomial” on {k(b − a) + an : k = 0, 1, . . . , n} where the probability at point k(b −
a) + an is nk pk (1 − p)n−k .
50
5.3 Binomial: B(n, p)
5.7. Binomial distribution with size n and parameter p. p ∈ [0, 1].
(a) pi = ni pi (1 − p)n−i for i ∈ S = {0, 1, 2, . . . , n}
• Use binopdf(i,n,p) in MATLAB.
(b) X is the number of success in n independent Bernoulli trials and hence the sum of n
independent, identically distributed Bernoulli r.v.
n
(c) ϕX (u) = (1 − p + peju )
(d) EX = np
(e) EX 2 = (np)2 + np (1 − p)
(f) Var [X] = np (1 − p)
P
n
(g) Tail probability: n
r
pr (1 − p)n−r = Ip (k, n − k + 1)
r=k
51
nλ0 = λ . Hence X is approximately normal N ( λ , λ ) for λ large.
Some says that the normal approximation is good when λ > 5 .
p := 0.05 n := 100 λ := 5 p := 0.05 n := 800 λ := 40
x x
−λ λ −λ λ
e ⋅ e ⋅
Γ ( x+ 1) 0.15 Γ ( x+ 1) 0.06
−1 2 −1 2
( x− λ ) ( x− λ )
1 2 ⋅λ 1 2 ⋅λ
⋅e ⋅e
2⋅ π λ 2⋅ π λ
0.1 0.04
− x λ−1 − x λ−1
e ⋅x e ⋅x
Γ ( λ) Γ ( λ)
Γ ( n+ 1) x n− x 0.05 Γ ( n+ 1) x n− x 0.02
⋅ p ⋅ ( 1− p ) ⋅ p ⋅ ( 1− p )
Γ ( n− x+ 1) ⋅ Γ ( x+ 1) Γ ( n− x+ 1) ⋅ Γ ( x+ 1)
0 0
0 5 10 0 20 40 60
x x
⎣= gPk([kΛfailures
• P [Xif=Ek]⎡ λ + 1) − followed
Λg ( Λ )by
⎤⎦ =a 0success] Λ has the Poisson
k
Thus, conversely, for all bounded
= (P [failures])g,P then
[success]
P [X ≥ k] = β = the probability of having at least k initial failure = the probability
distribution P (ofλ )having
. to perform at least k+1 trials.
P [X > k] = β k+1 = the probability of having at least k+1 initial failure.
Poisson distribution can be obtained as a limit from negative binomial distributions.
• Memoryless property:
(Thus, the negative binomial distribution with parameters r and p can be approximated by the
◦ P [X ≥ k + c |X ≥ k ] = P [X ≥ c], k, c > 0.
rq
Poisson distribution
◦ Pwith k + c |X ≥ k ]λ==P [X >(maen-matching),
[X > parameter c], k, c > 0. provided that p is
p
◦ If a success has not occurred in the first k trials (already fails for k times),
“sufficiently” close then
to 1the
andprobability
r is “sufficiently” large. at least j more trials is the same the
of having to perform
probability of initially having to perform at least jtrials.
Let X be Poisson with mean λ . Suppose, that the mean λ is chosen in accord with a
◦ Each time a failure occurs, the system “forgets” and begins anew as if it were
probability distribution FΛ ( λthe
performing ) . first
Hence,trial.
52
◦ Geometric r.v. is the only discrete r.v. that satisfies the memoryless property.
• Ex.
◦ lifetimes of components, measured in discrete time units, when the fail catas-
trophically (without degradation due to aging)
◦ waiting times
∗ for next customer in a queue
∗ between radioactive disintegrations
∗ between photon emission
◦ number of repeated, unlinked random experiments that must be performed prior
to the first occurrence of a given event A
∗ number of coin tosses prior to the first appearance of a ‘head’
number of trials required to observe the first success
for k ∈ N ∪ {0}.
• P [X > k] = β k
Qn
• Suppose independent Xi ∼ G1 (βi ). min (X1 , X2 , . . . , Xn ) ∼ G1 ( i=1 βi ).
5.13. EX = Var X = λ.
5.14. Successive probabilities are connected via the relation kpX (k) = λpX (k − 1).
53
Most probable value (imax ) Associated max probability
0<λ<1 0 e−λ
λλ −λ
λ∈N λ − 1, λ λ!
e
λbλc −λ
λ ≥ 1, λ ∈
/N bλc bλc!
e
where the Xi ’s are i.i.d. E(λ). The equalities given by (*) are easily obtained via
counting the number of events from rate-λ Poisson process on interval [0, 1].
Var X
5.17. Fano factor (index of dispersion): EX
=1
An important property of the Poisson and Compound Poisson laws is that their classes
are close under convolution (independent summation). In particular, we have divisibility
properties (5.22) and (5.32) which are straightforward to prove from their characteristic
functions.
5.18
(Recursion
equations). Suppose X ∼ P(λ). Let mk (λ) = E X k and µk (λ) =
E (X − EX)k .
mk+1 (λ) = λ (mk (λ) + m0k (λ)) (15)
µk+1 (λ) = λ (kµk−1 (λ) + µ0k (λ)) (16)
[15, p 112]. Starting with m1 = λ = µ2 and µ1 = 0, the above equations lead to recursive
determination of the moments mk and µk .
1 1 P
−λ 1 d n
5.19. E X+1 = λ 1 − e . Because for d ∈ N, Y = X+1 n=0 an X can be expressed
P
d−1 n c
as n=0 bn X + X+1 , the value of EY is easy to find if we know EX n .
5.20. Mixed Poisson distribution: Let X be Poisson with mean λ. Suppose, that the
mean λ is chosen in accord with a probability distribution whose characteristic function is
ϕΛ . Then,
h i h i
ϕX (u) = E E eiuX |Λ = E eΛ(e −1) = E ei(−i(e −1))Λ = ϕΛ −i eiu − 1 .
iu iu
54
• EX = EΛ.
• Var X = Var Λ + EΛ.
• E [X 2 ] = E [Λ2 ] + EΛ.
• Var[X|Λ] = E [X|Λ] = Λ.
• When Λ is a nonnegative
integer-valued random variable, we have GX (z) = GΛ (ez−1 )
and P [X = 0] = GΛ z1 .
• E [XΛ] = E [Λ2 ]
• Cov [X, Λ] = Var Λ
5.23. Raikov’s theorem: independent random variables can have their sum Poisson-
distributed only if every component of the sum is Poisson-distributed.
P
∞
converges to µ, then S = Xj converges with probability 1, and S has distribution P(µ).
j=1
If on the other hand (17) diverges, then S diverges with probability 1.
55
• If X and Y are independent Poisson random variables with respective parameters
λ
λ
and µ, then (1) Z = X +Y is P(λ+µ) and (2) conditioned on Z = z, X is B z, λ+µ .
λ λµ λµ
So, E [X|Z] = λ+µ
Z, Var[X|Z] = Z (λ+µ) 2 , and E [Var[X|Z]] = λ+µ
.
5.26. One of the reasons why Poisson distribution is important is because many natural
phenomenons can be modeled by Poisson processes. For example, if we consider the number
of occurrences Λ during a time interval of length τ in a rate-λ homogeneous Poisson process,
then Λ ∼ P(λτ ).
Example 5.27.
• The first use of the Poisson model is said to have been by a Prussian physician,
von Bortkiewicz, who found that the annual number of late-19th-century Prussian
soldiers kicked to death by horses followed a Poisson distribution [7, p 150].
• #clicks in a Geiger counter in τ seconds when the average number of click in 1 second
is λ.
5.29. Poisson distribution can be obtained as a limit from negative binomial distributions.
Thus, the negative binomial distribution with parameters r and p can be approximated
by the Poisson distribution with parameter λ = rqp (mean-matching), provided that p is
“sufficiently” close to 1 and r is “sufficiently” large.
56
5.30. Convergence of sum of bernoulli random variables to the Poisson Law
Suppose that for each n ∈ N
Xn,1 , Xn,2 , . . . , Xn,rn
are independent; the probability space for the sequence may change with n. Such a collec-
tion is called a triangular array [1] or double sequence [8] which captures the nature
of the collection when it is arranged as
X1,1 , X1,2 , . . . , X1,r1 ,
X2,1 , X2,2 , . . . , X2,r2 ,
.. .. ..
. . ··· .
Xn,1 , Xn,2 , . . . , Xn,rn ,
.. .. ..
. . ··· .
where the random variables in each row are independent. Let Sn = Xn,1 + Xn,2 + · · · + Xn,rn
be the sum of the random variables in the nth row.
Consider a triangular array of bernoulli random variables Xn,k with P [Xn,k = 1] = pn,k .
Prn
If max pn,k → 0 and pn,k → λ as n → ∞, then the sums Sn converges in distribution
1≤k≤rn k=1
to the Poisson law. In other words, Poisson distribution rare events limit of the binomial
(large n, small p).
As a simple special case, consider a triangular array of bernoulli random variables Xn,k
with P [Xn,k = 1] = pn . If npn → λ as n → ∞, then the sums Sn converges in distribution
to the Poisson law. i
To show this special case directly, we bound the first i terms of n! to get (n−i)
i!
≤ ni ≤
ni
i!
. Using the upper bound,
n i 1 npn n
pn (1 − pn )n−i ≤ (npn )i (1 − pn )−i (1 − ) .
i i! | {z } | {z } | {zn }
→λ i →1
→e−λ
n−i i
The lower bound gives the same limit because (n − i)i = n
ni where the first term
→ 1.
5.31. The mean and variance of CP (λ, L (V )) are λEV and λEV 2 respectively.
57
An important property of the Poisson and Compound Poisson laws is that their classes
are close under convolution (independent summation). In particular, we have divisibility
properties (5.22) and (5.32) which are straightforward to prove from their characteristic
functions.
5.33. Divisibility property
ofPthe compound Poisson
P law:
Suppose we
P have inde-
pendent Λi ∼ CP λi , µ(i) , then ni=1 Λi ∼ CP λ, λ1 ni=1 λi µ(i) where λ = ni=1 λi .
Proof.
!
Y
n X
n
ϕP
n (t) = eλi (ϕqi (t)−1) = exp λi (ϕqi (t) − 1)
Zi
i=1
i=1
!i=1 !!
X
n
1X
n
= exp λi ϕqi (t) − λ = exp λ λi ϕqi (t) − 1
i=1
λ i=1
We usually focus on the case when µ is a discrete probability measure on N = {1, 2, . . .}.
In which case, we usually refer to µ by the pmf P
q on N; q is called the base pmf. Equivalently,
CP (λ, q) is also the distribution
P of the sum i∈N iΛi where (Λi : i ∈ N) are independent
with Λi ∼ P (λqi ). Note that i∈N λqi = λ. The Poisson distribution is a special case of
the compound Poisson distribution where we set q to be the point mass at 1.
5.34. The compound negative binomial [Bower, Gerber, Hickman, Jones, and Nesbitt,
1982, Ch 11] can be approximated by the compound Poisson distribution.
5.7 Hypergeometric
An urn contains N white balls and M black balls. One draws n balls without replacement,
so n ≤ N + M . One gets X white balls and n − X black balls.
( N M
( x )(n−x)
, 0 ≤ x ≤ N and 0 ≤ n − x ≤ M
P [X = x] = (N +M
n )
0, otherwise
5.35. The hypergeometric distributions “converge” to the binomial distribution: Assume
N
that n is fixed, while N and M increase to +∞ with lim N +M = p, then
N →∞
M →∞
n x
p (x) → p (1 − p)n−x (binomial).
x
Note that binomial is just drawing balls with replacement:
n
x n−x x n−x
x
N M n N M
p (x) = = .
(N + M )n x N +M N +M
Intuitively, when N and M large, there is not much difference in drawing n balls with or
without replacement.
58
5.36. Extension: If we have m colors and Ni balls of color i. The urn contains N =
N1 + · · · + Nm balls. One draws n balls without replacement. Call Xi the number of balls
of color i drawn among n balls. (Of course, X1 + · · · + Xm = n.)
N N
( x11 )( x22 )···(Nxmm )
, x1 + · · · + xm = n and xi ≥ 0
P [X1 = x1 , . . . , Xm = xm ] = (Nn )
0, otherwise
rq rq
• EX = p
and Var [X] = p2
, where q = 1 − p.
• Note that if we define nx ≡ n(n−1)·····(n−(x−1))
x(x−1)·····1
. Then,
−r x r (r + 1) · · · · · (r + (x − 1)) x r+x−1
≡ (−1) = (−1) .
x x (x − 1) · · · · · 1 x
P P
• If independent Xi ∼ NegBin (ri , p), then Xi ∼ NegBin ri , p . This is easy to
i i
see from the characteristic function.
• When r = 1, we have geometric distribution. Hence, when r is a positive integer,
negative binomial is an independent sum of i.i.d. geometric.
• p (x) = Γ(r+x) r
Γ(r)x!
p (1 − p)x
5.38. A negative binomial distribution can arise
as a mixture of Poisson distributions with
p
mean distributed as a gamma distribution Γ q = r, λ = 1−p .
Let X be Poisson with mean λ. Suppose, that the mean λ is chosen in accord with a
probability distribution FΛ (λ). Then, ϕX (u) = ϕΛ (−i (eiu − 1)) [see compound Poisson
distribution]. Here, Λ ∼ Γ (q, λ0 ); hence, ϕΛ (u) = 1 u q .l
1−i λ
λ
q
0
−q 0
eiu −1
So, ϕX (u) = 1 − λ0 = 1− 1 eiu , which is negative binomial with p = λ0λ+1
λ0 +1 0
.
λ0 +1
59
5.9 Beta-binomial distribution
A variable with a beta binomial distribution is distributed as a binomial distribution with
parameter p, where p is distribution with a beta distribution with parameters αand β.
• P (k |p ) = nk pk (1 − p)n−k .
nq1
• EX = q1 +q2
6 PDF Examples
6.1 Uniform Distribution
6.1. Characterization for uniform[a, b]:
1 0 x < a, x > b
(a) f (x) = b−a U (x − a) U (b − x) = 1
b−a
a≤x≤b
0 x < a, x > b
(b) F (x) = x−a
b−a
a≤x≤b
b+a sin(u b−a
2 )
(c) ϕX (u) = eiu 2
u b−a
2
esb −esa
(d) MX (s) = s(b−a)
.
6.2. For most purpose, it does not matter whether the value of the density f at the
1
endpoints are 0 or b−a .
Example 6.3.
• Phase of oscillators ⇒ [-π, π] or [0,2π]
60
X∼ fX (x) ϕX (u)
1 b+a sin(u b−a
2 )
Uniform U(a, b) 1 (x)
b−a [a,b]
eiu 2
u b−a
2
Exponential E(λ) λe−λx 1[0,∞] (x) λ
λ−iu
x−s
1 − µ−s0
Shifted Exponential (µ, s0 ) µ−s0
e 01
[s0 ,∞) (x)
α −αx
Truncated Exp. e−αa −e−αb
e 1[a,b] (x)
α −α|x| α2
Laplacian L(α) 2
e α2 +u2
1 x−m 2
Normal N (m, σ 2 ) √1 e− 2 ( σ ) eium− 12 σ 2 u2
σ 2π
1 T −1 T m− 1 uT Λu
n√ e− 2 (x−m) Λ (x−m)
1
Normal N (m, Λ) eju 2
(2π) 2 det(Λ)
λq xq−1 e−λx 1
Gamma Γ (q, λ) 1(0,∞) (x) q
Γ(q) (1−i uλ )
Pareto Par(α) αx−(α+1) 1[1,∞] (x)
α c α+1
Par(α, c) = c Par(α) c x
1(c,∞) (x)
Beta β (q1 , q2 ) Γ(q1 +q2 ) q1 −1
Γ(q1 )Γ(q2 )
x (1 − x)q2 −1 1(0,1) (x)
Γ(q1 +q2 ) xq1 −1
Beta prime 1
Γ(q1 )Γ(q2 ) (x+1)(q1 +q2 ) (0,∞)
(x)
2
Rayleigh 2αxe−αx 1[0,∞] (x)
1 1
Standard Cauchy π 1+x2
1 α
Cau(α) π α2 +x2
Γ(d) 1
Cau(α, d) √
παΓ(d− 12 ) 1+ x 2 d
( )
α
ln x−µ 2
− 12
Log Normal eN ( µ,σ 2 ) 1
√
σx 2π
e ( σ )1
(0,∞) (x)
61
• Use with caution to represent ignorance about a parameter taking value in [a, b].
a+b (b−a)2
6.4. EX = 2
, Var X = 12
, E [X 2 ] = 13 (b2 + ab + a2 ).
6.5. The product X of two independent U[0, 1] has
and
FX (x) = x − x ln x
R1 x
R1
on [0, 1]. This comes from P [X > x] = 1 − FU t
dt = 1 − xt dt.
0 x
• The standard normal cdf is sometimes denoted by Φ(x). It inherits all properties
of cdf. Moreover, note that Φ(−x) = 1 − Φ(x).
jvX ∞
1 2 2 1
− jω m − ω σ 2 2
(d) ϕX23)
(v) = E transform:
Fourier e =F ( f X )2=v ∫σ f.]X ( x ) e− jω x dt = e
ejmv− 2
.
−∞
sm+∞12 s2 σ 2
(e) MX (s) = e π
∫e
−α x 2
24) Note that dx = .
−∞ α R∞ 1 2 σ2
−jωm− 2 ω
(f) Fourier transform: Fm{f
⎛ x− ⎞ X} = fX (x)⎛ ex−jωx
− m ⎞dt =⎛ e x − m ⎞ .
25) P [ X > x ] = Q ⎜ ⎟ ; P [ X <−∞
x] = 1 − Q ⎜ ⎟ = Q⎜− ⎟.
⎝ σ ⎠ ⎝ σ ⎠ ⎝ σ ⎠
(g) P [X •> x]
P ⎡⎣=X P ≥⎤ x]
− μ[X< σ = = Q Px−m
0.6827, ⎡ X −=μ 1> σ−⎤Φ
=
x−m
0.3173 = Φ − x−m
⎦ ⎣
σ ⎦ σ σ
P [X < x] = P [X ≤ x] = 1 − Q x−m σ
= Q − x−m
σ
= Φ x−m
σ
.
P ⎡⎣ X − μ > 2σ ⎤⎦ = 0.0455, P ⎡⎣ X − μ < 2σ ⎤⎦ = 0.9545
fX ( x)
fX ( x)
95%
68%
μ −σ μ μ +σ μ − 2σ μ μ + 2σ
∞ 2
1 − x2
Figure
26) Q-function ( z ) =Probability
: Q 14: ∫z 2π e dx corresponds to P [ X >ofz ]Xwhere
density function ~ Nσ(20,1
∼ NX(m, ) .) ;
that is Q ( z ) is the probability of the “tail” of N ( 0,1) .
6.7. Properties
N ( 0,1)
62
Q( z)
(a) P [|X − µ| < σ] = 0.6827; P [|X − µ| > σ] = 0.3173
P [|X − µ| > 2σ] = 0.0455; P [|X − µ| < 2σ] = 0.9545
(b) Moments and central moments:
h i h i 0, k odd
k k−2
(i) E (X − µ) = (k − 1) E (X − µ) = k
1 · 3 · 5 · · · · · (k − 1) σ , k even
( q
h i
2 · 4 · 6 · · · · · (k − 1) σ k π2 , k odd
(ii) E |X − µ|k =
1 · 3 · 5 · · · · · (k − 1) σ k , k even
(iii) Var [X 2 ] = 4µ2 σ 2 + 2σ 4 .
n 0 1 2 3 4
EX n 1 µ µ + σ2
2
µ (µ + 3σ 2 )
2
µ + 6µ σ + 3σ 4
4 2 2
E [(X − µ)n ] 1 0 σ2 0 3σ 4
(d) Lévy–Cramér theorem: If the sum of two independent non-constant random variables
is normally distributed, then each of the summands is normally distributed.
R∞ 2 pπ
• Note that e−αx dx = α
.
−∞
where |B| is the length (Lebesgue measure) of the set B. This is because the probability
is concentrated around 0. More generally, for X ∼ N (m, σ 2 )
|B|
P [X ∈ B] ≤ 1 − 2Q .
2σ
6.9 (Stein’s Lemma). Let X ∼ N (µ, σ 2 ), and let g be a differentiable function satisfying
E |g 0 (X)| < ∞. Then
E [g(X)(X − µ)] = σ 2 E [g 0 (X)] .
[2, Lemma 3.6.5 p 124]. Note that this is simply integration by parts with u = g(x) and
dv = (x − µ)fX (x)dx.
63
⎜ ⎟
⎝ 2⎠
erf ( z )
Q ( 2z )
• E (X − µ)k = E (X − µ)k−1 (X − µ) = σ 2 (k − 1)E (X − µ)k−2 .
R∞ x 2 0 z
6.10. Q-function: Q (z) = √1 e− 2 dx corresponds to P [X > z] where X ∼ N (0, 1);
2π
z
that is Q (z) is the probability of the “tail” of N (0, 1). The Q function is then a comple-
mentary cdf (ccdf).
N ( 0,1)
1
0.9
0.8
0.7
Q(z) 0.6
0.5
0.4
0.3
0.2
0.1
z
0
-3 -2 -1 0 1 2 3
z
0
where we evaluate the double integral using polar coordinates [9, Q7.22 p 322].
π
R4 x2
2
(e) Q (x) = 1
π
e− 2 sin2 θ dθ
0
x2
(f) d
dx
Q (x) = − √12π e− 2
(f (x))2
(g) d
dx
Q (f (x)) = − √12π e− 2
d
dx
f (x)
R R R (f (x)) 2 Rx
(h) Q (f (x)) g (x)dx = Q (f (x)) g (x)dx + √1 e− 2 d
f (x) g (t) dt dx
2π dx
a
64
( f ( x ) )2
⎛d ⎞⎛ ⎞
x
∫ Q ( f ( x ) ) g ( x )dx = Q ( f ( x ) ) ∫ g ( x )dx + ∫
1 −
f) e 2
⎜ f ( x ) ⎟ ∫ g ( t ) dt ⎟ dx .
⎜
2π ⎝ dx ⎠⎝ a ⎠
g) Approximation:
(i) P [X > x] = Q x−m
P [X < x] = 1 − ⎡Q x−m ⎤ 1. − z 2
σ
x−m
1
= Q − 1
i) Q ( z ) ≈ ⎢ σ σ⎥ e 2 ; a = , b = 2π
π
(j) Approximation: ⎣ (1 − a ) z + a z + b ⎦
⎢ 2
⎥ 2π
h −
x2 i 2 2
(i) Q (z)⎛ ≈ 1(1−a)z+a
⎞ e 12√ √1 e1 − z2− x
; a = π1 , b = 2π
ii) ⎜1 − 2 ⎟ z≤ Q ( x )2π≤ e 2 .
2 +b
⎝ x− x⎠2x 2π 22
1 e√ 2 1 − x2
27)(ii) 1 − xand
Moment 2 ≤ moment
xcentral
2π
Q (x) ≤ 2 e
2
(iii) Q (z) ≈nz√12π 1 −0 0.7 2
1 e− z2 ;2z > 2 3 4
z
1 μ μ 2 + σ 2 μ ( μ +Rz3σ ) μ 4 + 6μ 2σ 2 +√3σ 4
2 2
EX n 2
6.11. Error function (MATLAB): erf (z) = √2π e−x dx = 1 − 2Q 2z
E ⎡( X − μ ) ⎤ 1 0
n
⎣ ⎦ σ2 00 3σ 4
(a) It is an odd function of z.
⎧0, k odd
• E ⎡( X − μ ) ⎤ = ( k − 1) E ⎡( X − μ ) ⎤ = ⎨
k −2
k
⎩1 ⋅ 3 ⋅ 5 ⋅ ⋅ ( k − 12)σ , k even
(b) For z ≥ ⎣0, it corresponds
⎦ to P⎣ [|X| < z]⎦where X ∼ N 0, . 1 k
√
( )
z
−1 2
Q−1Error function (Matlab): erf ( z ) = ∫
(g)29) (q) = 2 erfc (2q) e− x dx = 1 − 2Q 2 z corresponds to
2
π 0 √ R∞ 2
(h) The complementary error function: erfc (z) = 1−erf (z) = 2Q 2z = √2π z e−x dx
⎛ 1⎞
P ⎡⎣ X < z ⎤⎦ where X ~ N ⎜ 0, ⎟ .
⎝ 2⎠
⎛ 1⎞
N ⎜ 0, ⎟
⎝ 2⎠
erf ( z )
Q ( 2z )
0 z
b) erf ( − z ) = −erf ( z )
65
6.3 Exponential Distribution
6.12. Denoted by E (λ). It is in fact Γ (1, λ). λ > 0 is a parameter of the distribution,
often called the rate parameter.
6.13. Characterized by
• fX (x) = λe−λx U (x) ;
• FX (x) = 1 − e−λx U (x) ;
• Survival-, survivor-, or reliability-function: P [X > x] = e−λx 1[0,∞) (x) + 1(−∞,0) (x);
λ
• ϕX (u) = λ−iu
.
λ
• MX (s) = λ−s
for Re {s} < λ.
for all x > 0 and all s > 0 [14, p. 157–159]. In words, the future is independent of the
past. The fact that it hasn’t happened yet, tells us nothing about how much longer it will
take before it does happen.
66
• In particular, suppose we define the set B + x to be {x + b : b ∈ B}. For any x > 0
and set B ⊂ [0, ∞), we have
P [X ∈ B + x|X > x] = P [X ∈ B]
because R R
P [X ∈ B + x] B+x
λe−λt dt τ =t−x B
λe−λ(τ +x) dτ
= = .
P [X > x] e−λx e−λx
P
6.27. Consider independent Xi ∼ E (λ). Let Sn = ni=1 Xi .
i.i.d.
6.29. If Si , Ti ∼ E (α), then
!
X
m X
n X
m−1
n+m−1
n+m−1
1
P Si > Tj =
i 2
i=1 j=1 i=0
X n + i − 1 1 n+i
m−1
= .
i 2
i=0
Note that we can set up two Poisson processes. Consider the superposed process. We want
the nth arrival from the T processes to come before the mth one of the S process.
67
• distribution of wealth
λ2
(e) MX (s) = λ2 −s2
, −λ < Re {s} < λ.
2
6.33. EX = 0, Var X = α2
, E |X| = α1 .
Example 6.34.
• amplitudes of speech signals
• If X and Y are independent E(λ), then X − Y is L(λ). (Easy proof via ch.f.)
6.6 Rayleigh
6.35. Characterizations:
−αx2
(a) F (x) = 1 − e U (x)
2
(b) f (x) = 2αxe−αx u (x)
2
e−αt , t ≥ 0
(c) P [X > t] = 1 − F (t) =
1, t<0
√ 1
(d) Use −2σ 2 ln U to generate Rayleign σ2
from U ∼ U(0, 1).
68
6.36. Read “ray0 -lee”
Example 6.37.
• Noise X at the output of AM envelope detector when no signal is present
6.38. Relationship with other distributions
(a) Let X be a Rayleigh(α) r.v., then Y = X 2 is E(α). Hence,
√
·
−−
E(α) )*
−− Rayleigh(α). (18)
2
(·)
i.i.d. √
(b) Suppose X, Y ∼ N (0, σ 2 ). R = X 2 + Y 2 has a Rayleigh distribution with density
1 − 12 r 2
fR (r) = 2r e 2σ . (19)
2σ 2
i.i.d.
• Note that X 2 , Y 2 ∼ Γ 12 , 2σ1 2 . Hence, X 2 + Y 2 ∼ Γ 1, α = 2σ1 2 , exponential.
√
By (18), X 2 + Y 2 is a Rayleigh r.v. α = 2σ1 2 .
• Alternatively, the transformation from Cartesian coordinates xy to polar coor-
dinates θr gives
1 1 r cos θ 2 1 1 r sin θ 2
fR,Θ (r, θ) = rfX,Y (r cos θ, r sin θ) = r √ e− 2 ( σ ) √ e− 2 ( σ )
σ 2π σ 2π
1 1 1 2
= 2r 2 e− 2σ2 r .
2π 2σ
Hence, the radius R and the angle Θ are independent, with the radius R having
a Rayleigh distribution while the angle Θ is uniformly distributed in the interval
(0, 2π).
6.7 Cauchy
6.39. Characterizations: Fix α > 0.
α 1
(a) fX (x) = π α2 +x2
,
α>0
(b) FX (x) = π1 tan− 1 λx + 12 .
69
• Mean and variance do not exist.
• Note that even though the pdf of Cauchy distribution is an even function, the mean
is not 0.
P
6.41. Suppose Xi are independent Cau(αi ). Then, i ai Xi ∼ Cau(|ai | αi ).
6.42. Suppose X ∼ Cau(α). X1 ∼ Cau α1 .
(c) For symmetric distributions q1 and q2 are the same. As their values increase the
distributions become more peaked.
q1 q1 q2
(f) E [X] = q1 +q2
and Var X = (q1 +q2 )2 (q1 +q2 +1)
.
6.44. Rice/Rician
where I0 is the modified Bessel function4 of the first kind with order zero.
70
p
2 2 2
p we have independent Ni ∼ N (mi , σ ), then R = N1 + N2 is Rician with
(c) Suppose
v = m21 + m22 and the same σ. To see this, use (30) and the fact that we can express
m1 = v cos φ and m2 = v sin φ for some φ.
6.45. Weibull : For λ > 0 and p > 0, the Weibull(p, λ) distribution [9] is characterized
by
1
(a) X = Yλ p where Y ∼ E(1).
p
(b) fX (x) = λpxp−1 e−λx , x > 0
p
(c) FX (x) = 1 − e−λx , x > 0
Γ(1+ n
p)
(d) E [X n ] = n .
λp
7 Expectation
Consider probability space (Ω, A, P )
7.1. Let X + = max (X, 0), and X − = − min (X, 0) = max (−X, 0). Then, X = X + − X − ,
and X + , X − are nonnegative r.v.’s. Also, |X| = X + + X −
≡ E |X| is finite.
≡ EX is finite ≡ EX is defined.
≡ X∈L
≡ |X| ∈ L
In which case,
EX = E X + − E X −
Z Z
= X (ω) P (dω) = XdP
Z Z
= xdP (x) = xP X (dx)
X
and Z
XdP = E [1A X] .
A
71
Definition 7.3. A r.v. X admits (has) an expectation if E [X + ] and E [X − ] are not
both equal to +∞. Then, the expectation of X is still given by EX = E [X + ] − E [X − ]
with the conventions +∞ + a = +∞ and −∞ + a = −∞ when a ∈ R
(b) E [X p ] = 0
(c) X = 0 a.s.
a.s.
7.6. X = Y ⇒ EX = EY
7.8. Expectation rule: Let X be a r.v. on (Ω, A, P ), with values in (E, E), and distri-
bution P X . Let h : (E, E) → (R, B) be measurable. If
• X ≥ 0 or
• h (X) ∈ L 1 (Ω, A, P ) which is equivalent to h ∈ L 1 E, E, P X
then
R R
• E [h (X)] = h (X (ω)) P (dω) = h (x) P X (dx)
R R
• h (X (ω)) P (dω) = h (x) P X (dx)
[X∈G] G
and Z Z
X
h (x) P (dx) = h (x) fX (x) dx
G G
72
Expectation of a discrete random variable: Suppose x is a discrete random variable.
X
EX = xP [X = x]
x
and X
E [g(X)] = g(x)P [X = x].
x
Similarly, XX
E [g(X, Y )] = = g(x, y)P [X = x, Y = y].
x y
These are called the law/rule of the lazy statistician (LOTUS) [23, Thm 3.6 p 48],[9,
p. 149].
R R R R
7.10. P [X ≥ t] dt = P [X > t] dt and P [X ≤ t] dt = P [X < t] dt
E E E E
For p > 0,
Z∞
p
E [X ] = pxp−1 P [X > x]dx.
0
= E [X(X − EX)]
73
• Notation: DX , or σ 2 (X), or σX
2
, or VX [23, p. 51]
k k
P
k P P P
k
• If Xi ∈ L2 , then Xi ∈ L2 , Var Xi exists, and E Xi = EXi
i=1 i=1 i=1 i=1
2
• Suppose E [X ] < ∞. If Var X = 0, then X = EX a.s.
p
(d) Standard Deviation: σX = Var[X].
σX
(e) Coefficient of Variation: CVX = EX
.
X
• It is the standard deviation of the “normalized” random variable EX
.
• 1 for exponential.
Var X
(f) Fano Factor (index of dispersion): EX
.
• 1 for Poisson.
(g) Central Moments: the nth central moment is µn = E [(X − EX)n ].
(i) µ1 = E [X − EX] = 0.
2
(ii) µ2 = σX = Var X
P n
n
(iii) µn = k
mn−k (−m1 )k
k=1
Pn
n
(iv) mn = k
µn−k mk1
k=1
µ3
(h) Skewness coefficient: γX = 3
σX
(i) Describe the deviation of the distribution from a symmetric shape (around the
mean)
(ii) 0 for any symmetric distribution
µ4
(i) Kurtosis: κX = 4 .
σX
(i) γ1 = EX = m1 ,
γ2 = E [X − EX]2 = m2 − m21 = µ2
γ3 = E [X − EX]3 = m3 − 3m1 m2 + 2m31 = µ3
γ4 = m4 − 3m22 − 4m1 m3 + 12m21 m2 − 6m41 = µ4 − 3µ22
(ii) m1 = γ1
m2 = γ2 + γ12
m3 = γ2 + 3γ1 γ2 + γ13
m4 = γ4 + 3γ22 + 4γ1 γ3 + 6γ12 γ2 + γ14
74
5
14:27
TABLE 2.1 Product-Moments of Random Variables
2 12
B(n, p) np np(1-p)
β β
G(β) 1−β (1−β)2
P(λ) λ λ
a+b (b−a)2
U(a, b) 2 12
1 1
E(λ) λ λ2
α undefined, 0 < α < 1
, α>1
Par(α) α−1 ∞, 1<α<2
∞, 0<α≤1 α
, α>2
(α−2)(α−1)2
2
L(α) 0 α2
N (m, σ 2 ) m σ 2
7.13.
• For c ∈ R, E [c] = c
• E [·] is a linear operator: E [aX + bY ] = aEX + bEY .
• In general, Var[·] is not a linear operator.
7.14. All pairs of mean and variance are possible. A random variable X with EX = m
and Var X = σ 2 can be constructed by setting P [X = m − a] = P [X = m + a] = 12 .
Definition 7.15.
75
• Covariance between X and Y :
Cov [X, Y ] = E [(X − EX)(Y − EY )] = E [XY ] − EXEY
= E [X(Y − EY )] = E [Y (X − EX)] .
≡ E [XY ] = EXEY
7.16. Properties
(a) Var X = Cov [X, X], ρX,X = 1
When σY , σX > 0, equality occurs if and only if the following conditions holds
(f) Linearity:
(i) Let Yi = ai Xi + bi .
i. Cov [Y1 , Y2 ] = Cov [a1 X1 + b1 , a2 X2 + b2 ] = a1 a2 Cov [X1 , X2 ].
ii. The ρ is preserved under linear transformation:
76
(ii) Cov [a1 X + b1 , a2 X + b2 ] = a1 a2 Var X.
(iii) ρa1 X+b1 ,a2 X+b2 = 1. In particular, if Y = aX + b, then ρX,Y = 1.
In particular
Var (X + Y ) = Var X + Var Y + 2Cov [X, Y ]
and
Var (X − Y ) = Var X + Var Y − 2Cov [X, Y ] .
(k) Covariance Inequality: Let X be any random variable and g and h any function such
that E [g(X)], E [h(X)], and E [g(X)h(X)] exist.
77
7.17. Suppose X and Y are i.i.d. random variables. Then,
p √
E [(X − Y )2 ] = 2σX .
P
k−1
and G(N ) = arN − L(N ). To have G(N ) > 0, need ri < rk ∀k ∈ N which turns out to
i=0
require r ≥ 2. In fact, for r ≥ 2, we have G(N ) ≥ a ∀N ∈ N ∪ {0}. Hence, E [G(N )] ≥ a.
It is exactly a when r = P2.
Now, E [L (N )] = a ∞ rn −1 n
n=1 r−1 p (1 − p) = ∞ if and only if r(1 − p) ≥ 1. When
1 − p ≤ 12 , because we already have r ≥ 2, it is true that r(1 − p) ≥ 1.
8 Inequalities
8.1. Let (Ai : i ∈ I) be a finite family of events. Then
2
P !
P (Ai ) [ X
P Pi ≤P Ai ≤ P (Ai )
P (Ai ∩ Aj ) i i
i j
(a) Useless when a ≤ E |X|. Hence, good for bounding the “tails” of a distribution.
78
a) X n
at a.s.
least X
two X n abirthday
will share 1
, X (assuming X n are
1 , andbirthdays X likely
equally . to occur
on any given day of the year) is about 0.5. See also (3).
n
P
b) XEvents X X n X .
⎛ n ⎞ andn summation. − n
n
⎛ 1integration
64) Interchanging ⎞
2) − ⎜ 1 − ⎟ ≤ P ⎜ ∩ Ai ⎟ − ∏ P ( Ai ) ≤ ( n − 1) n n −1 [Szekely’86, p 14].
⎝ n ⎠ ⎝ i =1 ⎠ i =1 n ↓
X f
↓ 1
a) If (1) 1
− ≈−0.37 n
e
converges a.s., (2) k g a.e., and (3) g L , then
n 1 k 1
1
f n , f n L and
n
f1n d f n d .
n n
x
⎛ 1⎞
b) If X
n 1
n ,⎝ then
⎟
x⎠
− ⎜ 1−
X
0.5
n 1
n converges absolutely a.s., and integrable.
−x
( x−1)⋅ x x− 1
Moreover, X n 0 X n .
n 1 − 1 n 1
e
2 4 6 8 10
Inequality 1 x 10
65) Let Ai :a)i PI (Abe T Q
1
) − P ( Afamily
) P ( Bound
A2 ) of
n n
1 ∩a
A2finite
Figure 118: ≤ events.
for P Then
Ai − P (Ai ).
4 i=1 i=1
2
Random Variable
P Ai
set {a ,…, a } . Then, the
Let i.i.d. X 1 ,…, X r bei uniformlydistributedon a finite
P Ai r −P1 1 Ai n.
3)
P Ai A
probability that X ,…, X are all distinct is p ( n, r ) = 1 −
j i i ⎛
1−
i⎞ −
r ( r −1)
≈ 1 − e 2n . u ∏ ⎜⎝ ⎟
i1 j r
i =1 n⎠
a) Special case: the birthday paradox
1 in (1).
66) Markov’s Inequality: X a X , a > 0.
1
pu(n,r) for n = 365
a
pu ( n, r )
0.9
0.6
0.8
p = 0.9
0.7
0.6 1− e
−
r ( r −1)
2n
0.5
x p = 0.7
r
0.4
p = 0.5
0.5
p = 0.3 r = 23
r = 23 n 0.3
0.4
p = 0.1 n = 365
0.3
n = 365
0.2
0.2 a a1 x a
0.1
0.1
0 0
0 5 10 15 20 25 30 35 40 45 50 55 0 50 100 150 200 250 300 350
r n
distribution.
b) Remark: P X a P X a .
1
c) X a X , a > 0. 79
a
d) Suppose g is a nonnegative function. Then, 0 and p 0 , we have
(d) Suppose g is a nonnegative function. Then, ∀α > 0 and p > 0, we have
(i) P [g (X) ≥ α] ≤ 1
αp
(E [(g (X))p ])
(ii) P [g (X − EX) ≥ α] ≤ 1
αp
(E [(g (X − EX))p ])
1
(e) Chebyshev’s Inequality : P [|X| > a] ≤ P [|X| ≥ a] ≤ a2
EX 2 , a > 0.
(i) P [|X| ≥ α] ≤ 1
αp
(E [|X|p ])
2
σX 1
(ii) P [|X − EX| ≥ α] ≤ α2
; that is P [|X − EX| ≥ nσX ] ≤ n2
• Useful only when α > σX
4 2 a+b 2
(iii) For a < b, P [a ≤ X ≤ b] ≥ 1 − (b−a)2
σX + EX − 2
Definition 8.6. If p and q are positive real numbers such that p + q = pq, or equivalently,
1
p
+ 1q = 1, then we call p and qa pair of conjugate exponents.
• 1 < p, q < ∞
• As p → 1, q → ∞. Consequently, 1 and ∞ are also regarded as a pair of conjugate
exponents.
1 1
8.7. Hölder’s Inequality : X ∈ Lp , Y ∈ Lq , p > 1, p
+ q
= 1. Then,
(a) XY ∈ L1
80
1 1
(b) E [|XY |] ≤ (E [|X|p ]) p (E [|Y |q ]) q with equality if and only if
E [|Y |q ] |X (ω)|p = E [|X|p ] |Y (ω)|q a.s.
81
9 Random Vectors
In this article, a vector is a column matrix with dimension n × 1 for some n ∈ N. We use
1 to denote a vector with all element being 1. Note that 1(1T ) is a square matrix with all
element being 1. Finally, for any matrix A and constant a, we define the matrix A + a to
be the matrix A with each of the components are added by a. If A is a square matrix, then
A + a = A + a1(1T ).
Definition 9.1. Suppose I is an index set. When Xi ’s are random variables, we de-
fine a random vector XI by XI = (Xi : i ∈ I). For example, if I = [n], we have XI =
(X1 , X2 , . . . , Xn ). Note also that X[n] is usually denoted by X1n . Sometimes, we simply
write X to denote X1n .
Definition 9.2. Half-open cell or bounded rectangle in Rk is set of the form Ia,b =
k
{x : ai < xi ≤ bi , ∀i ∈ [k]} = × (ai , bi ]. For a real function F on Rk , the difference of F
i=1
around the vertices of Ia,b is
X X
∆Ia,b F = sgnIa,b (v) F (v) = (−1)|{i:vi =ai }| F (v) (23)
v v
where the sum extending over the 2k vertices v of Ia,b . (The ith coordinate of the vertex v
could be either ai or bi .) In particular, for k = 2, we have
• ∆Ia,b FX ≥ 0
82
C3 xi → −∞ for some i (the other coordinates held fixed), then FX (x) → 0
If ∀i xi → ∞, then FX (x) → 1.
• For any function F on Rk with satisfies (C1), (C2), and (C3), there is a unique
probability measure µ on BRk such that ∀a, ∀b ∈ Rk with a ≤ b, we have µ (Ia,b ) =
∆Ia,b F (and ∀x ∈ Rk µ (Sx ) = F (x)).
• TFAE:
(a) FX is continuous at x
(b) FX is continuous from below
(c) FX (x) = P X (Sx◦ )
(d) P X (Sx ) = P X (Sx◦ )
(e) P X (∂Sx ) = 0 where ∂Sx = Sx − Sx◦ = {y : yi ≤ xi ∀i, ∃j yj = xj }
9.4 (Joint pdf). A function f is a multivariate or joint pdf (commonly called a density)
if and only if it satisfies the following two conditions:
(a) f ≥ 0;
R
(b) f (x)d(x) = 1.
lim f (xI ) = 0.
xi →±∞
R
• P [X ∈ A] = A
fX (x)dx.
• Remarks: Roughly, we may say the following:
<Xi ≤xi +∆xi ]
P [∀i, xiQ P [x<X≤x+δx]
(a) fX (x) = lim = lim Q
∀i, ∆xi →0 i ∆xi ∆x→0 i ∆xi
83
(b) For I = [n],
R x1 R xn
◦ FX (x) = −∞ . . . −∞ fX (x)dxn . . . dx1
∂n
◦ fX (x) = ∂x1 ···∂xn FX (x) .
∂
P R (k)
(k) u, j = k
◦ ∂u FX (u, . . . , u) = fX v dx[n]\{k} where vj = .
k∈N (−∞,u]n 6 k
xj , j =
∂
Ru Ru
For example, ∂u FX,Y (u, u) = fX,Y (x, u)dx + fX,Y (u, y) dy.
−∞ −∞
Q
n
(c) fX1n (xn1 ) = E δ (Xi − xi ) .
i=1
9.5. Consider two random vectors X : Ω → Rd1 and Y : Ω → Rd2 . Define Z = (X, Y ) :
Ω → Rd1 +d2 . Suppose that Z has density fX,Y (x, y).
R R
(a) Marginal Density : fY (y) = fX,Y (x, y)dx and fX (x) = fX,Y (x, y)dy.
Rd1 Rd2
• In other words, to obtain the marginal densities, integrate out the unwanted
variables.
R
• fXI\{i} (xI\{i} ) = fXI (xI )dxi .
fX,Y (x,y)
(b) fY |X (y|x) = fY (y)
.
R y1 R yd
(c) FY |X (y|x) = −∞
··· −∞
2
fY |X (t|x)dtd2 · · · dt1 .
R
9.6. P [(X + a1 , X + b1 ) ∩ (Y + a2 , Y + b2 ) 6= ∅] = A
fX,Y (x, y)dxdy where A is defined
in (1.10).
9.7. Expectation and covariance:
(a) The expectation of a random vector X is defined to be the vector of expectations of
its entries. EX is usually denoted by µX or mX .
(b) For non-random matrix A, B, C and a random vector X, E [AXB + C] = AEXB +C.
84
(ii) ΛX is symmetric.
i. Properties of symmetric matrix
A. All eigenvalues are real.
B. Eigenvectors corresponding to different eigenvalues are not just linearly
independent, but mutually orthogonal.
C. Diagonalizable.
ii. Spectral theorem: The following equivalent statements hold for symmet-
ric matrix.
A. There exists a complete set of eigenvectors; that is there exists an or-
thonormal basis u(1) , . . . , u(n) of R with CX u(k) = λk u(k) .
B. CX is diagonalizable by an orthogonal matrix U (U U T = U T U = I).
C. CX can be represented as CX = U ΛU T where U is an orthogonal matrix
whose columns are eigenvectors of CX and λ = diag(λ1 , . . . , λn ) is a
diagonal matrix with the eigenvalues of CX .
(iii) Always nonnegative definite (positive
h semidefinite).
i That is ∀a ∈ Rn where n is
2
the dimension of X, aT CX a = E aT (X − µX ) ≥ 0.
• det (CX ) ≥ 0.
1 √ √ √ √ √ √
(iv) We can define CX2 = CX to be CX = U ΛU T where Λ = diag( λ1 , . . . , λn ).
√ √
i. det CX = det CX .
√
ii. CX is nonnegative definite.
√ 2 √ √
iii. CX = CX CX = CX .
(v) Suppose, furthermore, that CX is positive definite.
−1
i. CX = U Λ−1 U T where Λ−1 = diag( λ11 , . . . , λ1n ).
q √
− 12 −1 −1 T 1 1
ii. CX = CX = ( CX ) = U DU where D = √
λ1
, . . . , λn
√
√ −1
√
iii. CX CX CX = I.
1
−1 −1
iv. CX , CX2 , CX 2 are all positive definite (and hence are all symmetric).
1 2
− −1
v. CX 2 = CX .
−1
vi. Let Y = CX 2 (X − EX). Then, EY = 0 and CY = I.
(vi) For i.i.d. Xi with each with variance σ 2 , ΛX = σ 2 I.
85
2
• For Yi = X + Zi where X and Z are independent, ΛY = σX + ΛZ .
(g) ΛX+Y + ΛX−Y = 2ΛX + 2ΛY
(h) det (ΛX+Y ) ≤ 2n det (ΛX + ΛY ) where n is the dimension of X and Y .
2
(i) Y = (X, X, . . .
, X) where X is a random variable with variance σX , then ΛY =
1 ··· 1
2 .. .
σX . . . . .. . Note that Y = 1X where 1 has the same dimension as Y .
1 ··· 1
(j) Let X be a zero-mean random vector whose covariance matrix is singular. Then, one
of the Xi is a deterministic linear combination of the remaining components. In other
words, there is a nonzero vector a such that aT X = 0. In general, if ΛX is singular,
then there is a nonzero vector a such that aT X = aT EX.
(k) If X and Y are both random vectors (not necessarily of the same dimension), then
their cross-covariance matrix is
ΛXY = CXY = Cov [X, Y ] = E (X − EX)(Y − EY )T .
Note that the ij-entry of CXY is Cov [Xi , Yj ].
• CY X = (CXY )T .
(l) RXY = E XY T .
(m) If we stack X and Y in to a composite vector Z = XY
, then
CX CXY
CZ = .
CY X CY
(n) X and Y are said to be uncorrelated if CXY = 0, the zero matrix. In which case,
CX 0
C(X ) = ,
Y 0 CY
a block diagonal matrix.
9.8. The joint characteristic function of an n-dimensional random vector X is defined
by h T i P
ϕX (v) = E ejv X = E ej i vi Xi .
When X has a joint density fX , ϕX is just the n-dimensional Fourier transform:
Z
T
ϕX (v) = ejv x fX (x)dx,
and the joint density can be recovered using the multivariate inverse Fourier transform:
Z
1 −jv T x
fX (x) = e ϕX (v)dv.
(2π)n
86
TX
(a) ϕX (u) = Eeiu .
R T
(b) fX (x) = 1
(2π)n
e−jv x ϕX (v)dv.
T
(c) For Y = AX + b, ϕY (u) = eib u ϕX AT u .
∂
(i) ϕ
∂vi X
(0) = jEXi .
∂2
(ii) ϕ
∂vi ∂vj X
(0) = j 2 E [Xi Xj ] .
• Y = PTX
• P T = P −1 is called a decorrelating transformation.
• Diagonalize C = P DP T where D = diag(λ1 ). Then, Cov [Y ] = D.
• In MATLAB, use [P,D] = eig(C). To extract the diagonal elements of D as a vector,
use the command d = diag(D).
• If C is singular (equivalently, if some of the λi are zero), we only need to keep around
the Yi for which λi > 0 and can throw away the other components of Y without any
loss of information. This is because λi = EYi2 and EYi2 = 0 if and only if Yi ≡ 0 a.s.
[9, p 338–339].
87
9.1 Random Sequence
9.10. [11, p 9–10] Given a countable family X1 , X2 , . . . of random r.v.’s, their statistical
properties are regarded as defined by prescribing, for each integer n ≥ 1 and every finite
set I ⊂ N, the joint distribution function FXI of the random vector XI = (Xi : i ∈ I).
Of course, some consistency requirements must be imposed upon the infinite family FXI ,
namely, that for j ∈ I
(a) FXI\{j} xI\{j} = lim FXI (xI ) and that
xj →∞
(b) the distribution function obtained from FXI (xI ) by interchanging two of the indices
i1 , i2 ∈ I and the corresponding variable xi1 and xi2 should be invariant. This simply
means that the manner of labeling the random variables X1 , X2 , . . . is not relevant.
The joint distributions {FXI } are called the finite-dimensional distributions associated
with XN = (Xn )∞n=1 .
10 Transform Methods
10.1 Probability Generating Function
Definition 10.1. [9][11, p. 11] Let X be a discrete random variable taking only nonnegative
integer values. The probability generating function (pgf) of X is
∞
X X
GX (z) = E z = z k P [X = k].
k=0
10.3. Properties
(a) GX is infinitely differentiable at least for |z| < 1.
88
(c) Moment generating property:
"k−1 #
d (k) Y
GX (z) =E (X − i) .
dz (k) z=1 i=0
89
10.3 One-Sided Laplace Transform
10.6. The one-sided Laplace transform
R of−sx
a nonnegative random variable X is defined
for s ≥ 0 by L (s) = M (s) = E e−sX = e P X (dx)
[0,∞)
[17, p 183].
90
10.4 Characteristic Function
10.7. The characteristic function (abbreviated
R c.f. or ch.f.) of a probability measure
µ on the line is defined for real t by ϕ (t) = eitx µ (dx) R
A random variable X has characteristic function ϕX (t) = E eitX = eitx P X (dx)
R R
(a) Always exists because |ϕ(t)| ≤ |eitx |µ (dx) = 1µ (dx) = 1 < ∞
R
(b) If X has a density, then ϕX (t) = eitx fX (x) dx
(c) ϕ (0) = 1
(d) ∀t ∈ R |ϕ (t)| ≤ 1
(f) Suppose that all moments of X exists and ∀t ∈ R, EetX < ∞, then ϕX (t) =
P∞
(it)k
k!
EX k
k=0
h i
(g) If E |X|k < ∞, then ϕ(k) (t) = ik E X k eitX and ϕ(k) (0) = ik E X k .
(m) Inversion
(i) The inversion formula: If the probability measure µ has characteristic func-
1
RT e−ita −e−itb
tion ϕ and if µ {a} = µ {b} = 0, then µ (a, b] = lim 2π it
ϕ (t)dt
T →∞ −T
1
RT e−ita −e−itb
i. In fact, if a < b, then lim ϕ (t)dt = µ (a, b) + 21 µ {a, b}
T →∞ 2π −T it
91
R
i. bounded continuous density f (x) = 1
2π
e−itx ϕ (t)dt
1
R e−ita −e−itb
ii. µ (a, b] = 2π it
ϕ (t)dt
D
(n) Continuity Theorem: Xn −
→ X if and only if ∀t ϕXn (t) → ϕX (t) (pointwise).
P
n
(it)k
and ϕX (t) = k!
EX k + tn β (t) where lim β (t) = 0 or equivalently,
k=0 |t|→0
X
n
(it)k
ϕX (t) = EX k + o (tn ) (25)
k=0
k!
92
(b) If fX is even, then ϕX is also even.
• If fX is even, ϕX = ϕ−X .
10.10. Linear combination
P of independent random
Q variables: Suppose X1 , . . . , Xn are
independent. Let Y = ni=1 ai Xi . Q Then, ϕY (t) = ni=1 ϕX (ai t). Furthermore, if |ai | = 1
and all fXi are even, then ϕY (t) = ni=1 ϕX (t).
(a) ϕX+Y (t) = ϕX (t)ϕY (t).
(a) P
Discrete r.v.:
P Suppose pi is pmf with corresponding ch.f. ϕi . Then, the ch.f. of
i ai pi is i ai ϕ i .
(b) Absolutely
P continuous
P r.v.: Suppose fi is pdf with corresponding ch.f. ϕi . Then, the
ch.f. of i ai fi is i ai ϕi .
(b) Formula (27) below provides a convenient way of arriving at fY from fX without
going through FY .
93
11.3 (Linear transformation). Y = aX + b where a 6= 0.
(
FX y−b
a
, a>0
FY (y) =
y−b −
1 − FX a
, a<0
(b) n even:
−
1 1
FX y n − FX −y n , y≥0
FY (y) =
0, y < 0.
and ( 1 1
1
n
y −1
1 n
fX y n + fX −y n , y≥0
fY (y) =
0, y < 0.
Again, the density fY in the above formula holds when X is absolutely continuous. Note
that when n < 1, fY is not defined at 0. If we allow delta functions,
then the density
1 1
1 n
formula above are also valid for mixed r.v. because n y −1
δ ±y n − xk = δ (y − (±xk )n ).
11.5. In general, for Y = g(X), we solve the equation y = g(x). Denoting its real roots
by xk . Then,
X fX (xk )
fY (y) = 0 (x )|
. (27)
k
|g k
If g(x) = c = constant for every x in the interval (a, b), then FY (y) is discontinuous for
y = c. Hence, fY (y) contains an impulse (FX (b) − FX (a)) δ(y − c) [15, p. 93–94].
94
• To see this, consider when there is unique x such that g(x) = y. Then, For small ∆x
and ∆y, P [y, y < Y ≤ y+∆y] = P [x < X ≤ x+∆x] where (y+∆y] = g ((x, x + ∆x])
is the image of the interval (x, x + ∆x]. (Equivalently, (x, x + ∆x] is the inverse image
of y + ∆y].) This gives fY (y)∆y = fX (x)∆x.
• The joint density fX,Y is
Let the xk be the solutions for x of g(x) = y. Then, by integrating (28) w.r.t. x, we
have (27) via the use of (4).
• When g bijective,
d −1
fY (y) = g (y) fX g −1 (y) .
dy
a
• For Y = X
, fY (y) = ay fX ay .
√
• Suppose X is nonnegative. For Y = X,
fY (y) = 2yfX (y 2 ).
11.6. Given Y = g(X) where X ∼ U(a, b). Then, to get fY (y0 ), plot g on (a, b). Let
A = g −1 ({y0 }) be the set of all points x such that g(x) = y0 . Suppose A can be written
as a countable disjoint union A = B ∪ ∪i Ii where B is countable and the Ii ’s are intervals.
We have !
1 1 X
fY (y) = + `(Ii ) δ(y − y0 )
b − a |g 0 (x)| i
95
50 Chapter Two
p = n1 Hypergeometric
distributions
Discrete
n3 x = 0,1⋅⋅⋅ , min(n1, n2)
X1 + ⋅⋅⋅ + XK n3 ← ∞
n1, n2, n3
Poisson v = np
x = 0,1⋅⋅⋅ n ∞ ← Binomial
v x = 0,1 ⋅⋅⋅ n X1 + ⋅⋅⋅ + XK
n, p
Bernoulli
n=1 x = 0,1
m = np
s2 = v p
s 2 = np(1 – p)
m=v
n ← ∞
Normal a = b→ ∞
Y = eX Beta
Lognormal -∞ < x < ∞ 0<x<1
y>0 log Y m, s a, b
m = ab
X- m s 2 = ab 2
X1 + ⋅⋅⋅ + XK a→ ∞ X1
X1 + ⋅⋅⋅ + XK s
1/X
m + sX X1 + X2
distributions
Continuous
normal x>0
a, a -∞ < x < ∞ a, b
n = 2a
X1/X2 2 a = 1/2
X1 + ⋅⋅⋅ + XK2
a=0 a=n
a + aX X1 + ⋅⋅⋅ + XK
a=1
a =1 Erlang
Standard Chi-square x>0
cauchy n→∞ x>0 b, n
-∞ < x < ∞ n
n2 → ∞
1
n=2 n=1
X n1 X X1/n1 b =2 X1 + ⋅⋅⋅ + XK
n=1 X2 /n2
F Exponential Standard
x>0 x>0 -blog X
uniform
n1, n2
b 0 < x <1
min(X1,◊◊◊,XK)
2
X
1
a + (b − a)X a=0
X X a a=1 X1 - X2
b=1
t X2
-∞ < x < ∞ Rayleigh Weibull Uniform
Triangular
n x>0 x>0 a<x<b
–1 < x < 1 a, b
b a, b
96
Uniform: U ( a, b )
⎛ ⎞
X i ~ Γ ( qi , λ ) ⇒ ∑X
~ Γ ⎜ ∑ qi , λ ⎟ .
⎝ i
i
⎠ fX ( x) =
1
1 ( x) ⎛λ⎞
X ~ E (λ ) ⇒ α X ~ E ⎜ ⎟ .
i
b − a [ a , b] ⎝α ⎠
⎛ λ⎞
X ~ Γ ( q, λ ) ⇒ Y = α X ~ Γ ⎜ q, ⎟ . ⎛ ⎞
⎝ α⎠ 1
X ~ U ( 0,1) ⇒ − ln ( X ) ~ E ( λ ) Xi ~ E(λi) ⇒ min ( X i ) ~ E ⎜ ∑ λi ⎟ .
λ ⎝ i ⎠
Gamma : Γ ( q, λ ) Exponential: E(λ)
λ q x q −1e− λ x f X ( x ) = λ e− λ x1[0, ∞ ) ( x )
fX ( x) = 1( 0, ∞ ) ( x ) q =1
Γ(q)
n 1
q= ,λ = ⎛1 1 ⎞ ⎛ 1 ⎞
X ~ N ( 0,σ 2 ) ⇒ X 2 ~ Γ ⎜ , 2 ⎟
i .i .d .
2 2σ 2 X 2 + Y 2 with X , Y ~ N ⎜ 0, ⎟
⎝ 2 2σ ⎠ ⎝ 2λ ⎠
Normal/Gaussian: N ( m,σ 2 )
⎛n 1 ⎞
Chi-squared χ 2 : Γ ⎜ , 2 ⎟
⎝ 2 2σ ⎠
1 ⎛ x−μ ⎞
2
1 − ⎜
σ ⎟⎠
fX ( x) = e 2⎝
with X i ~ N ( 0,σ )
n i .i .d .
Usually, σ = 1 . ∑X
i =1
i
2 2
σ 2π
X1
with X i ~ Γ ( qi , λ ) X2
X1 + X 2 λ =α i .i .d .
⎛ 1 ⎞
X 2 + Y 2 with X , Y ~ N ⎜ 0, ⎟
⎝ 2α ⎠
Beta:
Rayleigh: f ( x ) = 2α xe −α x 1[0, ∞ ) ( x )
2
Γ ( q1 + q2 )
f β q ,q ( x ) = x q1 −1 (1 − x ) 1( 0,1) ( x )
q2 −1
1 2
Γ ( q1 ) Γ ( q2 )
Din
November 8, 2004
g ( y ) f ( g ( y ))
d
Note: Y = g ( X ) ; monotone g ⇒ f ( y ) = Y
−1
X
−1
97
11.2 MISO case
11.8. If X and Y are jointly continuous random variables with joint density fX,Y . The
following two methods give the density of Z = g(X, Y ).
• Condition on one of the variable, say Y = y. Then, begin conditioned, Z is simply a
function of one variable g(X,R y); hence, we can use the one-variable technique to find
fZ|Y (z|y). Finally, fZ (z) = fZ|Y (z|y)fY (y)dy.
• Directly find the joint density of the random vector YZ = g(X,Y )
. Observe that the
∂g ∂g Y
∂g
Jacobian is ∂x ∂y . Hence, the magnitude of the determinant is ∂x .
0 1.
Of course, theR standard way of finding the pdf of Z is by finding the derivative of the
cdf FZ (z) = (x,y):x2 +y2 ≤z fX,Y (x, y)d(x, y). This is still good for solving specific examples.
It is also a good starting point for those who haven’t learned conditional probability nor
Jacobian.
Let the x(k) be the solutions of g(x, y) = z for fixed z and y. The first method gives
X fX|Y (x|y)
∂
fZ|Y (z|y) = g(x, y) .
k ∂x (k) x=x
Hence,
X fX,Y (x, y)
∂
fZ,Y (z, y) = g(x, y) ,
k ∂x x=x(k)
which comes out of the second method directly. Both methods then gives
Z X
fX,Y (x, y)
∂
fZ (z) = g(x, y) dy.
k ∂x (k) x=x
The integration for a given z is only on the value of y such that there is at least a solution
for x in z = g(x, y). If there is no such solution, fZ (z) = 0. The same technique works for
a function of more than one random variables Z = g(X1 , . . . , Xn ). For any j ∈ [n], let the
(k)
xj be the solutions for xj in z = g(x1 , . . . , xn ). Then,
Z X
fX1 ,...,Xn (x1 , . . . , xn )
fZ (z) = dx[n]\{j} .
∂
k ∂x g(x1 , . . . , xn )
j (k)
xj =xj
For the second method, we consider the random vector (hr (X1 , . . . , Xn ), r ∈ [n]) where
hr (X1 , . . . , Xn ) = Xr for r 6= j and hj = g. The Jacobian is of the form
1 0 0 0
0 1 0 0
∂g ∂g
∂x ∂x ∂x ∂x .
∂g ∂g
1 2 j n
0 0 0 1
98
By swapping the row with all the partial derivatives to the first row, the magnitude of
the determinant is unchanged and we also end up with upper triangular matrix whose
determinant is simply the product of the diagonal elements.
(a) For Z = aX + bY ,
Z Z
1 z − by 1 z − ax
fZ (z) = fX,Y , y dy = fX,Y x, dx.
|a| a |b| b
ax+by a b
• Note that Jacobian y
y
, xy = .
0 1
(ii) Note that when X and Y are independent and a = b = 1, we have the convolu-
tion formula
Z Z
fZ (z) = fX (z − y) fY (y)dy = fX (x)fY (z − x) dx.
(b) For Z = XY ,
Z Z
z 1 z 1
fZ (z) = fX,Y x, dx = fX,Y ,y dy
x |x| y |y|
y x
[9, Ex 7.2, 7.11, 7.15]. Note that Jacobian xyy
, xy = .
0 1
(c) For Z = X 2 + Y 2 ,
√ p p
Z z fX|Y
z− + fX|Y − z − y y
y2 y 2
fZ (z) = p fY (y)dy
√ 2 z − y2
− z
99
Y
(e) For Z = X
, Z Z
y
y
fZ (z) = fX,Y , y dy = |x| fX,Y (x, xz)dx.
z z
X
Similarly, when Z = Y
, Z
fZ (z) = |y| fX,Y (yz, y)dy.
min(X,Y )
(f) For Z = max(X,Y )
where X and Y are strictly positive,
Z ∞ Z ∞
FZ (z) = FY |X (zx|x)fX (x)dx + FX|Y (zy|y)fY (y)dy,
0 0
Z ∞ Z ∞
fZ (z) = xfY |X (zx|x)fX (x)dx + yfX|Y (zy|y)fY (y)dy, 0 < z < 1.
0 0
[9, Ex 7.17].
PN
11.9 (Random sum). Let S = i=1 Vi where Vi ’s are i.i.d. ∼V independent of N .
(b) ES = EN EV .
Remark : If N ∼ P (λ), then ϕS (u) = exp (λ (ϕV (u) − 1)), the compound Poisson
distribution CP (λ, L (V )). Hence, the mean and variance of CP (λ, L (V )) are λEV and
λEV 2 respectively.
∂(g1 ,...,gn )
Alternative notations for the Jacobian matrix are J, ∂(x 1 ,...,xn )
[7, p 242], Jg (x) where the it
is assumed that the Jacobian matrix is evaluated at z = x = (x1 , . . . , xn ).
100
• Let A be an n-dimensional
Q “box” defined by the corners x and x+∆x. The “volume”
of the image g(A) is ( i ∆xi ) |det dg(x)|. Hence, the magnitude of the Jacobian
determinant gives the ratios (scaling factor) of n-dimensional volumes (contents). In
other words,
∂(y1 , . . . , yn )
dy1 · · · dyn = dx1 · · · dxn .
∂(x1 , . . . , xn )
• Note that for any matrix A, det(A) = det(AT ). Hence, the formula above could
tolerate the incorrect “Jacobian”.
[2, p. 185].
101
11.12. Suppose Y = g(X) where both X and Y have the same dimension, then the joint
density of X and Y is
fX,Y (x, y) = fX (x)δ(y − g(x)).
• In most cases, we can show that X and Y are not independent, pick a point
x(0)
such that fX x(0) > 0. Pick a point y(1) such that
y (1) 6= g x(0) and fY y (1) > 0.
(0) (1) (0) (1)
Then, fX,Y x , y = 0 but fX x fY y > 0. Note that this technique does
not always work. For example, if g is a constant function which maps all values of
x to a constant c. Then, we will not be able to find y (1) . Of course, this is to be
expected because we know that a constant is always independent of other random
variables.
Example 11.13.
(a) For Y = AX + b, where A is a square, invertible matrix,
1
fY (y) = fX A−1 (y − b) . (31)
|det A|
(b) Transformation between Cartesian coordinates (x, y) and polar coordinates (r, θ)
p
• x = r cos θ, y = r sin θ, r = x2 + y 2 , θ = tan−1 xy .
∂x ∂x
cos θ −r sin θ
• ∂y
∂r ∂θ
∂y =
= r. (Recall that dxdy = rdrdθ).
∂r ∂θ
sin θ r cos θ
We have
and p
1 −1 y
fX,Y (x, y) = p fR,Θ 2 2
x + y , tan .
x2 + y 2 x
If, furthermore, Θ is uniform on (0, 2π) and independent of R. Then,
1 1 p
fX,Y (x, y) = p fR 2
x +y . 2
2π x2 + y 2
(c) A related transformation is given by Z = X 2 + Y 2 and Θ = tan−1 Y
. In this case,
√ √ X
X = Z cos Θ, Y = Z sin Θ, and
1 √ √
fZ,Θ (z, θ) = fX,Y z cos θ, z sin θ (33)
2
which gives (29).
√ Y
11.14. Suppose X, Y are i.i.d. N (0, σ 2 ). Then, R = X 2 + Y 2 and Θ = arctan X are
1 r 2
independent with R being Rayleigh fR (r) = σr2 e− 2 ( σ ) U (r) and Θ being uniform on
[0, 2π].
102
11.15 (Generation of a random sample of a normally distributed random variable). Let
U1 , U2 be i.i.d. U(0, 1). Then, the random variables
p
X1 = −2 ln U1 cos(2πU2 )
p
X2 = −2 ln U1 sin(2πU2 )
11.16. In (11.11), suppose dim(Y ) = dim(g) ≤ dim(X). To find the joint pdf of Y , we
introduce “arbitrary” Z = h(X) so that dim YZ = dim(X).
• Y1 = X1:n = Xmin is the first order statistic denoting the smallest of the Xi ’s,
• Y2 = X2:n is the second order statistic denoting the second smallest of the Xi ’s . . .,
and
• Yn = Xn:n = Xmax is the nth order statistic denoting the largest of the Xi ’s.
In words, the order statistics of a random sample are the sample values placed in ascending
order [2, p 226]. Many results in this section can be found in [4].
Let Ay = [Xmax ≤ y], By = [Xmin > y]. Then, Ay = [∀i Xi ≤ y] and By = [∀i Xi > y].
103
11.18 (Densities). Suppose the Xi are absolutely continuous with joint density fX . Let
Sy be the set of all n! vector which comes from permuting the coordinates of y.
X
fY (y) = fX (x), y1 ≤ y2 ≤ · · · ≤ yn . (34)
x∈Sy
Q
To see this, note that fY (y) j ∆yj is the probability that Yj is in the small interval of
length ∆yj around yj . This probability can be calculated from finding the probability that
all Xk fall into the above small regions.
From the joint density, we can find the joint pdf/cdf of YI for any I ⊂ [n]. However, in
many cases, we can directly reapply the above technique to find the joint pdf of YI . This
is especially useful when the Xi are independent or i.i.d.
(a) The marginal density fYk can be found by approximating fYk (y) ∆y with
X
n X
P [Xj ∈ [y, y + ∆y) and ∀i ∈ I, Xi ≤ y and ∀r ∈ (I ∪ {k})c , Xr > y],
j=1 I∈([n]\{j})
k−1
where for any set A and integer ` ∈ |A|, we define A` to be the set of all k-element
subsets of A. Note also that we assume (I ∪ {k})c = [n] \ (I ∪ {k}).
To see this, we first choose the Xj that will be Yk with value around y. Then, we
must have k − 1 of the Xi below y and have the rest of the Xi > y.
(b) For integers r < s, the joint density fYr ,Ys (yr , ys )∆yr ∆ys can be approximated by the
probability that two of the Xi are inside small regions around yr and ys . To make
them Yr and Ys , for the other Xi , r − 1 of them before yr , s − r − 1 of them between
yr and ys , and n − s of them beyond ys .
• fXmax ,Xmin (u, v)∆u∆v can be approximated by by
X
P [Xj ∈ [u, u+∆u), Xj ∈ [v, v+∆v), and ∀i ∈ [n]\{j, k} , v < Xi ≤ u], v ≤ u,
(j,k)∈S
where S is the set of all n(n − 1) pairs (j, k) from [n] × [n] with j 6= k. This is
simply choosing the j, k so that Xj will be the maximum with value around u,
and Xk will be the minimum with value around v. Of course, the rest of the Xi
have to be between the min and max.
◦ When n = 2, we can use (34) to get
fXmax ,Xmin (u, v) = fX1 ,X2 (u, v) + fX1 ,X2 (v, u), v ≤ u.
Note that the joint density at point yI is0 if the the elements in yI are not arranged in the
“right” order.
11.19 (Distribution functions). We note again the the cdf may be obtained by integration
of the densities in (11.18) as well as by direct arguments valid also in the discrete case.
104
(a) The marginal cdf is
X
n X
FYk (y) = P [∀i ∈ I, Xi ≤ y and ∀r ∈ [n] \I, Xr > y].
j=k I∈([n])
j
where the union is a disjoint union. Hence, we sum the probability that exactly j
of the Xi are ≤ y for j ≥ k. Alternatively, note that the event [Yk ≤ y] can also be
expressed as a disjoint union
[
[Xi ≤ k and exactly k − 1 of the X1 , . . . , Xj−1 are ≤ y] .
j≥k
This gives
X
n X
FYk (y) = P [Xj ≤ y, ∀i ∈ I, Xi ≤ y, and ∀r ∈ [j − 1] \ I, Xr > y] .
j=k I∈( [j−1]
)
k−1
where the upper limit of the second union is changed from n to m because we must
have N (yr ) ≤ N (ys ). Now, to have N (yr ) = j and N (ys ) = m for m > j is to put j
of the Xi in (−∞, yr ], m − j of the Xi in (yr , ys ], and n − m of the Xi in (ys , ∞).
105
when u < v. When v ≥ u, the second term can be found by (23) which gives
X
FXmax ,Xmin (u, v) = FX1 ,...,Xn (u, . . . , u) − (−1)|i:wi =v| FX1 ,...,Xn (w)
w∈S
X
|i:wi =v|+1
= (−1) FX1 ,...,Xn (w).
w∈S\{(u,...,u)}
where S = {u, v}n is the set of all 2n vertices w of the “box” × (ai , bi ]. The joint
i∈[n]
density is 0 for u < v.
( )
Q 0,Q u≤v
(c) FXmax ,Xmin (u, v) = FXk (u) − (FXk (u) − FXk (v)), v < u
k∈[n] k∈[n]
Q
(e) FXmin (v) = 1 − (1 − FXi (v)).
i
Q
(f) FXmax (u) = FXi (u).
i
i.i.d.
11.21. Suppose Xi ∼ X with common density f and distribution function F .
106
If we define y0 = −∞, yk+1=∞ , n0 = 0, nk+1 = n + 1, then for k ∈ [n] and 1 ≤ n1 <
· · · < nk ≤ n, the joint density fYn1 ,Yn2 ,...,Ynk (yn1 , yn2 , . . . , ynk ) is given by
! k
Y k Y (F (ynj+1 ) − F (ynj ))nj+1 −nj −1
n! f (ynj ) .
j=1 j=1
(n j+1 − n j − 1)!
In particular, for r < s, the joint density fYr ,Ys (yr , ys ) is given by
n!
f (yr )f (ys )F r−1 (yr )(F (ys ) − F (yr ))s−r−1 (1 − F (ys ))n−s
(r − 1)!(s − r − 1)!(n − s)!
[2, Theorem 5.4.6 p 230].
(b) The joint cdf FYr ,Ys (yr , ys ) is given by
FYs (ys ), ys ≤ yr ,
P
n P m
n!
j!(m−j)!(n−m)!
(F (yr ))j (F (ys ) − F (yr ))m−j (1 − F (ys ))n−m , yr < ys .
m=s j=r
n 0, u≤v
(c) FXmax ,Xmin (u, v) = (F (u)) − n .
(F (u) − F (v)) , v < u
0, u≤v
(d) fXmax ,Xmin (u, v) = n−2
n (n − 1) fX (u) fX (v) (F (u) − F (v)) , v<u
(e) Marginal cdf:
n
X n
FYk (y) = (F (y))j (1 − F (y))n−j
j
j=k
Xn X k + m − 1
n−k
j−1
= (F (y))k (1 − F (y))j−k = (F (y))k (1 − F (y))m
k−1 k−1
j=k m=0
Z F (y)
n!
= tk−1 (1 − t)n−k dt.
(k − 1)! (n − k)! 0
Note that N (y) ∼ B (n, F (y)). The last equality comes from integrating the marginal
density fYk in (36) with change of variable t = F (y).
(i) FXmax (y) = (F (y))n and fXmax (y) = n (F (y))n−1 fX (y).
(ii) FXmin (y) = 1 − (1 − F (y))n and fXmin (y) = n (1 − F (y))n−1 fX (y).
(f) Marginal density:
n!
fYk (y) = (F (y))k−1 (1 − F (y))n−k fX (y) (36)
(k − 1)! (n − k)!
[2, Theorem 5.4.4 p 229]
Consider small neighborhood ∆y around y. To have Yk ∈ ∆y , we must have exactly
one of the Xi ’s in ∆y , exactly k− 1 of them less than y, and exactly n − k of them
greater than y. There are n n−1k−1
n!
= (k−1)!(n−k)! 1
= B(k,n−k+1) possible setups.
107
(g) The range R is defined as R = Xmax − Xmin .
R
(i) For x > 0, fR (x) = n(n − 1) (F (u) − F (u − x))n−2 f (u − x)f (u)du.
R
(ii) For x ≥ 0, FR (x) = n (F (u) − F (u − x))n−1 f (u)du.
Both pdf and cdf above are derived by first finding the distribution of the range
conditioned on the value of the Xmax = u.
and n
X n
P [Yj = xi ] = Pik (1 − Pi )n−k − Pi−1
k
(1 − Pi−1 )n−k .
k=j
k
108
The marginal densities is given by
∂ ∂
fU (u) = FX,Y (x, y) + FX,Y (x, y)
∂x x=u, ∂y x=u,
y=u y=u
Zu Zu
= fX,Y (x, u)dx + fX,Y (u, y) dy,
−∞ −∞
109
12 Convergences
Definition 12.1. A sequence of random variables (Xn ) converges pointwise to X if
∀ω ∈ Ω lim Xn (ω) → X (ω)
n→∞
Definition 12.2 (Strong Convergence). The following statements are all equivalent con-
ditions/notations for a sequence of random variables (Xn ) to converge almost surely to a
random variable X
a.s.
(a) Xn −−→ X
(i) Xn → X a.s.
(ii) Xn → X with probability 1.
(iii) Xn → X w.p. 1
(iv) lim Xn = Xa.s.
n→∞
a.s.
(b) (Xn − X) −−→ 0
(c) P [Xn → X] = 1
hn oi
(i) P ω : lim Xn (ω) = X (ω) = 1
n→∞
hn oc i
(ii) P ω : lim Xn (ω) = X (ω) =0
n→∞
hn oi
(iii) P ω : lim Xn (ω) 6= X(ω) = 0
n→∞
(iv) P [Xn 9 X] = 0
hn oi
(v) P ω : lim |Xn (ω) − X (ω)| = 0 = 1
n→∞
hn oi
(d) ∀ε > 0 P ω : lim |Xn (ω) − X (ω)| < ε = 1
n→∞
a.s. P
(e) Let A1 , A2 , . . . be independent. Then, 1An −−→ 0 if and only if P (An ) < ∞
n
110
Definition 12.4 (Convergence in probability). The following statements are all equivalent
conditions/notations for a sequence of random variables (Xn ) to converges in probability
to a random variable X
P
(a) Xn −
→X
(i) Xn →P X
(ii) p lim Xn = X
n→∞
P
(b) (Xn − X) −
→0
(iv) The strict inequality between |Xn − X| and ε can be replaced by the correspond-
ing “non-strict” version.
(c) Suppose (Xn ) i.i.d. with distribution U [0, θ]. Let Zn = max {Xi : 1 ≤ i ≤ n}. Then,
P
Zn −
→θ
P P
(d) g continuous, Xn −
→ X ⇒ g (Xn ) −
→ g (X)
P
(i) Suppose that g : Rd → R is continuous. Then, ∀i Xi,n −
→ Xi implies
n→∞
P
g (X1,n , . . . , Xd,n ) −
→ g (X1 , . . . , Xd )
n→∞
P P
(e) Let g be a continuous function at c. Then, Xn −
→ c ⇒ g (Xn ) −
→ g (c)
P
(f) Fatou’s lemma: 0 ≤ Xn −
→ X ⇒ lim inf EXn ≥ EX
n→∞
P
(g) Suppose Xn −
→ X and |Xn | ≤ Y with EY < ∞, then EXn → EX
111
P
(h) Let A1 , A2 , . . . be independent. Then, 1An −
→ 0 iff P (An ) → 0
P
(i) Xn −
→ 0 iff ∃δ > 0 such that ∀t ∈ [−δ, δ] we have ϕXn (t) → 1
Definition 12.6 (Weak convergence for probability measures). Let Pn and P be probability
d
R R (d ≥R1). The sequence Pn converges weakly to P if the sequence of real
measure on
numbers gdPn → gdP for any g which is real-valued, continuous, and bounded on Rd
12.7. Let (Xn ), X be Rd -valued random variables with distribution functions (Fn ), F ,
distributions (µn ) , µ, and ch.f. (ϕn ) , ϕ respectively.
The following are equivalent conditions for a sequence of random variables (Xn ) to
converge in distribution to a random variable X
(i) Xn ⇒ X
L
i. Xn −
→X
D
ii. Xn −
→X
(ii) Fn ⇒ F
i. FXn ⇒ FX
ii. Fn converges weakly to F
iii. lim P Xn (A) = P X (A) for every A of the form A = (−∞, x] for which
n→∞
P X {x} = 0
(iii) µn ⇒ µ
i. P Xn converges weakly to P X
(e) Continuous Mapping theorem: lim Eg (Xn ) = Eg (X)for all g which is real-
n→∞
valued, continuous, and bounded on Rd
(i) lim Eg (Xn ) = Eg (X) for all bounded real-valued function g such that P [X ∈ Dg ] =
n→∞
0 where Dg is the set of points of discontinuity of g
(ii) lim Eg (Xn ) = Eg (X) for all bounded Lipschitz continuous functions g
n→∞
(iii) lim Eg (Xn ) = Eg (X) for all bounded uniformly continuous functions g
n→∞
112
(iv) lim Eg (Xn ) = Eg (X) for all complex-valued functions g whose real and imag-
n→∞
inary parts are bounded and continuous
(i) For nonnegative random variables: ∀s ≥ 0 LXn (s) → LX (s) where LX (s) =
Ee−sX
Note that there is no requirement that (Xn ) and X be defined on the same probability
space (Ω, A, P )
12.8. Continuity Theorem: Suppose lim ϕn (t) exists ∀t; call this limit ϕ∞ Further-
n→∞
more, suppose ϕ∞ is continuous at 0. Then there exists ∃ a probability distribution µ∞
such that µn ⇒ µ∞ and ϕ∞ is the characteristic function of µ∞
(b) Suppose Xn ⇒ X
(d) Suppose (µn ) is a sequence of probability measures on R that are all point masses
with µn ({αn }) = 1. Then, µn converges weakly to a limit µ iff αn → α; and in this
case µ is a point mass at α
(i) Suppose P Xn and P X have densities δn and δ w.r.t. the same measure µ. Then,
δn → δ µ-a.e. implies
i. ∀B ∈ BR P Xn (E) → P X (E)
• FXn → FX
• FXn ⇒ FX
R R
ii. Suppose g is bounded. Then, g (x) P Xn (dx) → g (x) P X (dx). In Equiv-
alently, E [g (Xn )] → E [g (X)] where the E is defined with respect to appro-
priate P .
(ii) Remarks:
113
i. For absolutely continuous random variables, µ is the Lebesgue measure, δ
is the probability density function.
ii. For discrete random variables, µ is the counting measure, δ is the probability
mass function.
(g) Xn ⇒ 0 if and only if ∃δ > 0 such that ∀t ∈ [−δ, δ] we have ϕXn (t) → 1
Then as n → ∞,
Y X
(1 − Xn,k ) ⇒ e−X if and only if Xn,k ⇒ X
k k
Example 12.12.
0, ω ∈ 0, 1 − n12
(a) Ω = [0, 1] , P is uniform on [0,1]. Xn (ω) =
en , ω ∈ 1 − n12 , 1
114
(X )
nk
X n ⎯⎯ →X X n ⎯⎯
a.s.
→X
p
L
Xn ≤ Y ∈ L p
f ( ⋅ ) continuous
E Xn →E X
p p
X n ≤ Y ∈ Lp (X )
nk
X n ⎯⎯
P
→X
f ( ⋅ ) continuous ∃c P ( X = c ) = 1 f X n ⎯⎯
→ fX
X n ⎯⎯
D
→X
f ( ⋅ ) continuous
{
∀x ∈ x : F ( x − ) = F ( x ) } ∀x ∈ D dense in R
dense in R
lim Fn ( x ) = F ( x )
n →∞
→ X , ∀n X n ≤ Y , Y ∈ Lp . Then, X ∈ Lp and X n ⎯⎯ →X .
p
X n ⎯⎯
L p a.s. L
c) X
(i) Suppose
n 9 0
−→→0X ⇒ X n ⎯⎯
a.s.
⎯⎯
d) XXnn −
(ii)
P D
→X
(iii) Xn −(2)
→X X n ⎯⎯→ 0
P a .s.
(3)
n ⎯⎯ 1→−0 n1
P
1
n
, Xw.p.
(c) Xn = .
[0,1]n1with uniform probability distribution. Define
n, Ω =w.p.
ii) Let
X n (0ω ) = ω + 1⎡ j −1
(i) Xn −
→
P
⎢
j⎤
, ⎥
(ω ) , where ⎡1
k=⎢
⎢2
( ⎤
1 + 8n − 1 ⎥ and
⎥
)
Lp ⎣ k k⎦
(ii) Xn 9 0 1
j = n − k ( k − 1) . The sequence of intervals ⎡ j −1 j ⎤
⎢
⎣ k k⎦
, ⎥ under the indicator
2
function is shown below:
115
[0,1]
⎡ 1⎤ ⎡1 ⎤
⎢ 0, 2 ⎥ → ⎢ 2 ,1⎥
⎣ ⎦ ⎣ ⎦
⎡ 1 ⎤ ⎡1 2 ⎤ ⎡ 2 ⎤
⎢ 0, 3 ⎥ → ⎢ 3 , 3 ⎥ → ⎢ 3 ,1⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
⎡ 1⎤ ⎡1 1⎤ ⎡1 3⎤ ⎡3 ⎤
⎢ 0, 4 ⎥ → ⎢ 4 , 2 ⎥ → ⎢ 2 , 4 ⎥ → ⎢ 4 ,1⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
Let X (ω ) = ω .
Figureof23:
(1) The sequence Diagram
real numbers ( X nfor
(ω ) )example (12.12)
does not converge for any ω .
(2) X n ⎯⎯ →X
p
L
12.1 Summation of
(3) X ⎯⎯
→X
random variables
P
n
P
n
Set Sn = Xi . ⎧1 1
i=1 ⎪⎪ n , w.p. 1 − n
iii) X n = ⎨ .
⎪n, w.p. 1 1
12.13 (Markov’s theorem;⎪⎩ Chebyshev’s
n inequality). For finite variance (Xi ), if n2
Var Sn →
1 1 P
0, then S
n n
− ES − → 0P → 0
n (1)n X n ⎯⎯
(2) X n ⎯⎯ →0
p
L
1 1
P
n
• If (X i ) are pairwise independent, then
15) n2
Var Sn = n2
Var Xi See also (12.17).
i=1
Summation of random variables
12.2 Summation (of n independent random variables
n
Set S = ∑ X and S ) = ∑ X ( ) where X ( ) = X 1 ⎡⎣ X ≤ c ⎦⎤ is X truncated at c.
n i n
c
i
c
n
c
n n n
i =1 i =1 n P
12.14. For sequence ( X nX
independent
16) Two ) nand (Yn )probability
, the that
are tail equivalent [ Xin converges
if ∑ P X ≠ Yn ] < ∞ . is either 0 or 1.
i=1
n
⎛ ⎞
12.15 Suppose ( X n )SLLN).
P Var(Kolmogorov’s and (Yn ) areConsider ⎜ ∑ P [ X n (X
a sequence
tail equivalent ≠ Ynn] )< of
∞ ⎟ independent
, then random variables.
If Xn
< ∞, then 1
S − 1
ES −−
a.s.
→ 0 ⎝ n ⎠
n2 n n n
a) P ( lim inf [ X n = Yn ]) = 1
n
n
n →∞
Pn
b) ∀ω ∈ liminf
• In particular, for [ X n = Yn ] ∃N ∈ such that
n →∞ independent Xn , if n
1 ∀n ≥ N X ( ω ) = Y (ω ) .
EXn → n
a.s.
a orn EXn → a, then n1 Sn −−→ a
i=1
c) With probability 1, the sequence ( X n ) converges iff the sequence (Yn )
12.16. Suppose that X1 , X2 , . . . are independent and that the series X = X1 + X2 + · · ·
converges.
converges a.s. Then, ϕX = ϕX1 ϕX2 · · ·
1
P
n
12.17. For pairwise independent (Xi ) with finite variances, if n2
Var Xi → 0, then
i=1
1 P
S
n n
− n1 ESn −
→0
(a) Chebyshev’s Theorem: For pairwise independent (Xi ) with uniformly bounded
P
variances, then n1 Sn − n1 ESn −
→0
116
12.3 Summation of i.i.d. random variable
Let (Xi ) be i.i.d. random variables.
n
1 P
12.18 (Chebyshev’s inequality). P n Xn − EX1 ≥ ε ≤ 1 Var[X1 ]
ε2 n
i=1
(a) L2 weak law: (finite variance) Let (Xi ) be uncorrelated (or pairwise independent)
random variables
P
n
(i) V arSn = Var Xi
i=1
1
P
n
P
(ii) If n2
Var Xi → 0, then n1 Sn − n1 ESn −
→0
i=1
1
P
n
L2
(iii) If EXi = µ and V arXi ≤ C < ∞. Then, n
Xi −→ µ
i=1
(b) Let (Xi ) be i.i.d. random variables with EXi = µ and V arXi = σ 2 < ∞. Then,
n
1 P
(i) P n
Xn − EX1 ≥ ε ≤ ε12 Var[X1]
n
i=1
1
P
n
L2
(ii) n
Xi −→ µ
i=1
1
P
n
L2 1
P
n
P 1
P
n
D
(c) n
Xi −→ µ implies n
Xi −
→ µ which in turn imply n
Xi −
→µ
i=1 i=1 i=1
(n) P
(d) If Xn are i.i.d. random variables such that lim tP [|X1 | > t] → 0, then n1 Sn −EX1 −
→
t→∞
0
(e) Khintchine’s theorem: If Xn are i.i.d. random variables and E |X1 | < ∞, then
(n) P
EX1 → EX1 and n1 Sn −
→ EX1 .
12.20 (SLLN).
(a) Kolmogorov’s
P Var Xn SLLN : Consider a sequence (Xn ) of independent random variables.
a.s.
If n2
< ∞, then n1 Sn − n1 ESn −−→ 0
n
1
P
n
a.s.
• In particular, for independent Xn , if n
EXn → a or EXn → a, then n1 Sn −−→ a
i=1
117
1
P
n
a.s.
(b) Khintchin’s SLLN : If Xi ’s are i.i.d. with finite mean µ, then n
Xi −−→ µ
i=1
(c) Consider the sequence (Xn ) of i.i.d. random variables. Suppose EX1− < ∞ and
a.s.
EX1+ = ∞, then n1 Sn −−→ ∞
a.s.
• Suppose that Xn ≥ 0 are i.i.d. random variables and EXn = ∞. Then, n1 Sn −−→
∞
12.21 (Relationship between LLN and the convergence of relative Pfrequency to the proba-
bility). Consider i.i.d. Zi ∼ Z. Let Xi = 1A (Zi ). Then, n1 Sn = n1 ni=1 1A (Zi ) = rn (A), the
relative
1 Pfrequency of an event A. Via LLN and appropriate conditions, rn (A) converges
n
to E n i=1 1A (Zi ) = P [Z ∈ A].
−mc
Sn√
P
n
(b) = √1 (Xk − m) ⇒ N (0, σ).
n n
k=1
Xk −m iid Pn
To see this, let Z k = ∼ Z and Y n = k=1 Zk . Then, EZ = 0, Var Z = 1, and ϕYn (t) =
n σ
ϕZ ( √n ) . By approximating e ≈ 1 + x + 12 x2 . We have ϕX (t) ≈ 1 + jtEX − 12 t2 E [X 2 ]
t x
118
The approximation is best for s, k near nm [9, p. 211].
√ 1
• The approximation n! ≈ 2πnn+ 2 e−n can be derived from approximating the density
n−1 e−s
of Sn when Xi ∼ E(1). We know that fSn (s) = s(n−1)! . Approximate the density at
√ n− 12 −n
s = n, gives (n − 1)! ≈ 2πn e . Multiply through by n. [9, Ex. 5.18, p. 212]
• See also normal approximation for the binomial in (14).
• pY |X (y|x) = pZ (y − x)
P
• pY (y) = x pZ (y − x)pX (x)
• pX|Y (x|y) = P pZ (y−x)p0X (x) 0 .
x0 pZ (y−x )pX (x )
∂ ∂
R
g(z,x)
∂
13.5. F
∂z Y |X
(g (z, x) |x ) = ∂z
fY |X (y |x)dy = fY |X (g (z, x) |x ) ∂z g (z, x)
−∞
119
P [X≤x,y−∆<Y ≤y]
13.6. FX|Y (x |y ) = P [X ≤ x |Y = y ] = lim P [y−∆<Y ≤y]
= E 1(−∞,x] (X) |Y = y .
∆→0
fX,A (x)
13.7. Define fX|A (x) to be P (A)
. Then,
Rx
fX,A (x0 ) dx0
x−∆ fX,A (x) P (A) fX|A (x)
P [A |X = x ] = lim Rx = = .
∆→0 fX (x) fX (x)
fX (x0 ) dx0
x−∆
Q
13.8. For independent Xi , P [∀i Xi ∈ Bi |∀i Xi ∈ Ci ] = P [Xi ∈ Bi |Xi ∈ Ci ].
i
(a) Substitution law for conditional expectation: E [g(X, Y )|X = x] = E [g(x, Y )|X = x]
(g) E [X + Y |Z = z ] = E [X |Z = z ] + E [Y |Z = z ]
(j) ming(x) E [(Y − g(X))2 ] = E [(Y − E [Y |X])2 ] where g ranges over all functions. Hence,
E [Y |X] is sometimes called the regression of Y on X, the “best” predictor of Y con-
ditional on X.
120
Definition 13.11. The conditional variance is defined as
Z
Var[Y |X = x] = (y − m(x))2 f (y|x)dy
(c) ∀i ∈ [n] Xi and the vector (Xj )[n]\{i} are independent conditioning on Y .
Example 13.17. Suppose X and Y are independent. Conditioned on another random
variable Z, it is not true in general that X and Y are still independent. See example
(4.46). Recall that Z = X ⊕ Y which can be rewritten as Y = X ⊕ Z. Hence, when Z = 0,
we must have Y = X.
121
13.18. Suppose we know that fX|Y,Z (x|y, z) = g(x, y); that is fX|Y,Z does not depend on
z. Then, conditioned on Y , X and Z are independent. In which case,
13.19. Suppose we know that fZ|V,U1 ,U2 (z |v, u1 , u2 ) = fZ|V (z |v ) for all z, v, u1 , u2 , then
conditioned on V , we can conclude that Z and (U1 , U2 ) are independent. This further
implies Z and Ui are independent. Moreover,
• In order for this definition to make sense when v = 0 or when X has a singular
covariance matrix, we agree that any constant random variable is considered to be
Gaussian.
• Of course, the mean and variance are v T EX and v T ΛX v, respectively.
If X is a Gaussian random vector with mean vector m and covariance matrix Λ, we write
X ∼ N (m, Λ).
• To remember the form of the above formulas, both exponents have to be scalar.
So, we better have (x − m)T Λ−1 (x − m) instead of having the transpose on the
last term. To make this P
more
P clear, set Λ = I, then we must have a dot product.
T
Note also that v Av = vk Ak` v` .
k `
• The above formula can be derived by starting form a random vector Z whose
1
components are i.i.d. N (0, 1). Let X = CX2 Z + m. Use (31) and (37).
i.i.d.
• For Xi ∼ N (mi , σi2 ),
P x−mi 2
1 − 12
fX (x) = n Q e .
i σi
(2π) 2 i σi
i.i.d.
In particular, if Xi ∼ N (0, 1),
1 − 12 xT x
fX (x) = n e (37)
(2π) 2
122
n p p
• (2π) 2 det (Λ) = det (2πΛ)
• The Gaussian density is constant on the “ellipsoids” centered at m,
−1
x ∈ Rn : (x − m)T CX (x − m) = constant .
P PP
j vi EXi − 12 vk v` Cov[Xk ,X` ]
jv T m− 12 v T Λv
(c) ϕX (v) = e =e i k ` .
• This can be derived from definition (14.1) by noting that
Y
h T i z}|{
T
ϕX (v) = E ejv X = E ej1 v X
123
• In particular, E (X − EX)4 = 3σ 4 .
(k) To generate N (m, Λ). First, by spectral theorem, Λ = V DV T where V is orthogo-
nal matrix whose columns are eigenvectors of Λ and D is diagonal matrix with the
eigenvalues of Λ. The random variable we want is V X + m where X ∼ N (0, D).
14.3. For bivariate normal, fX,Y (x, y) is
2 2
x−EX x−EX
− 2ρ σX y−EY y−EY
+ σY
1 σX σY
p exp − ,
2πσX σY 1 − ρ2
2 (1 − ρ2 )
Cov(X,Y )
where ρ = σX σY
∈ [−1, 1]. Here, x, y ∈ R.
X y−mY
• fX,Y (x, y) = σX1σY ψρ x−mσX
, σY
2 2
x y
• fX,Y is constant on ellipses of the form σX
+ σY
= r2 .
2
2
σX Cov [X, Y ] σX ρσX σY
• Λ= = .
Cov [X, Y ] σY2 ρσX σY σY2
• The following are equivalent:
(a) ρ = 0
(b) Cov [X, Y ] = 0
(c) X and Y are independent.
• |ρ| = 1 if and only if (X − EX) = k (Y − EY ). In which case
k
◦ ρ= |k|
= sign (k)
◦ |k| = σσXY
p
• Suppose fX,Y (x, y) only depends on x2 + y 2 and X and Y are independent, then
X and Y are normal with zero mean and equal variance.
σX 2 2
• X|Y ∼ N ρ σY (Y − mY ) + mX , σX (1 − ρ )
Y |X ∼ N ρ σσXY (X − mX ) + mY , σY2 (1 − ρ2 )
124
[9, eq. (7.22) p 309, eq. (7.23) p 311, eq. (7.26) p 313]. This is the joint density of
U, V where U, V ∼ N (0, 1) with Cov [U, V ] = E [U V ] = ρ.
◦ The general bivariate Gaussian pair is obtained from the transformation
X σX U + mX σX 0 U mX
= = + .
Y σY V + mY 0 σY V mY
◦ fU |V (u|v) is N (ρv, 1 − ρ2 ). In other words, U |V ∼ N (ρV, 1 − ρ2 ).
14.4 (Conditional Gaussian).
X
ΛX ΛXY
µX
(a) Suppose (X, Y ) are jointly Gaussian; that is Y
∼N µY
, . Then,
ΛY X ΛY
fX|Y (x |y ) is N E [X |y ] , ΛX|y where E [X |y ] = µX +ΛXY Λ−1
Y (y − µY ) and ΛX|y =
−1
ΛX − ΛXY ΛY ΛY X .
• Note the direction of the formula for ΛX|y .
ΛX → ΛXY
↓
ΛY X ← ΛY
125
• If D is a subspace and Z ∈ D, then (38) is automatically satisfied.
• (39) says that the vector Θ − Z is orthogonal to all vectors in D.
Example 15.2. Suppose Θ and N are independent Poisson random variables with respec-
tive parameters λ and µ. Let Z = Θ + N .
• Y is P(λ + µ)
λ
• Conditioned on Y = y, Θ is B y, λ+µ
.
λ
• Θ̂MMSE (Y ) = E [Θ|Y ] = λ+µ
Y .
λµ λµ
• Var[Θ|Y ] = Y (λ+µ)2
, and MSE = E [Var[Θ|Y ]] = λ+µ
< Var Y = λ + µ.
126
(c) The vector linear (affine) MMSE estimator is given by
Θ̂LMMSE (Y ) = EΘ + ΣΘY Σ−1
Y (Y − EY )
and h i
MMSE = Cov Θ − Θ̂LMMSE (Y ) = ΣΘ − ΣΘY Σ−1
Y ΣY Θ .
In which case,
2 2 2
E Θ̃ − B Ỹ = E Θ̃ − AỸ + E (A − B)Ỹ .
127
A Math Review
A.1 Inequalities
A.1. By definition,
P∞ P PN
• n=1 an = n∈N an = lim n=1 an and
N →∞
Q∞ Q QN
• n=1 an = n∈N an = lim n=1 an .
N →∞
128
-0.6838
1
𝑥
0.5
0
ln 1 + 𝑥
-0.5
𝑥 − 𝑥2
-1 𝑥
𝑥+1
-1.5
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
𝑥
X
rn X
rn
0≤ x2n,k ≤ an |xn,k | → 0 × x0 = 0.
k=1 k=1
On the other hand, suppose we have (ii). Given any ε > 0, by (ii), ∃n0 such that ∀n ≥ n0 ,
P
rn P
rn
x2n,k ≤ ε2 . Hence, for any k, x2n,k ≤ x2n,k ≤ ε2 and hence |xn,k | ≤ ε which implies
k=1 k=1
an ≤ ε.
PrnNote that when the xk,n are non-negative, condition (i) already implies that the sum
k=1 |xn,k | converges as n → ∞. Alternative versions of A.4 are as followed.
Prn
(a) Suppose (ii) x2n,k → 0 as n → ∞. Then, as n → ∞ we have
k=1
Y
rn X
rn
x
(1 + xn,k ) → e if and only if xn,k → x. (43)
k=1 k=1
Proof. We already know from A.4 that the RHS of (43) implies the LHS. Also, con-
dition (ii) allows the use of A.3 which implies
P
rn P
rn
Y
rn
xn,k x2n,k Y
rn
(1 + xn,k ) ≤ e k=1 ≤e k=1 (1 + xn,k ). (42b)
k=1 k=1
129
(b) Suppose the xn,k are nonnegative and (iii) an → 0 as n → ∞. Then, as n → ∞ we
have
Yrn Xrn
−x
(1 − xn,k ) → e if and only if xn,k → x. (44)
k=1 k=1
Proof. We already know from A.4 that the RHS of (43) implies the LHS. Also, con-
dition (iii) allows the use of A.3 which implies
P
rn P
rn
Y
rn
− xn,k x2n,k Y
rn
(1 − xn,k ) ≤ e k=1 ≤e k=1 (1 − xn,k ). (42c)
k=1 k=1
A.5. Let αi and βi be complex numbers with |αi | ≤ 1 and |βi | ≤ 1. Then,
Ym Ym X m
αi − βi ≤ |αi − βi |.
i=1 i=1 i=1
130
A.2 Summations
A.7. Basic formulas:
Pn
(a) k = n(n+1)
2
k=0
P
n
n(n+1)(2n+1)
(b) k2 = 6
= 61 (2n3 + 3n2 + n)
k=0
2
P
n P
n
(c) 3
k = k = 41 n2 (n + 1)2 = 14 (n4 + 2n3 + n2 )
k=0 k=0
P
n
A.8. Let g (n) = h (k) where h is a polynomial of degree d. Then, g is a polynomial
k=0
P
d+1
of degree d + 1; that is g (n) = am x m .
m=1
• To find the coefficients am , evaluate g (n) for n = 1, 2, . . . , d + 1. Note that the case
when n = 0 gives a0 = 0 and hence the sum starts with m = 1.
• k 3 = k (k + 1) (k + 2) − 3k (k + 1) + k
P
∞
ρk
(b) ρi = 1−ρ
i=k
P
b
ρa −ρb+1
(c) ρi = 1−ρ
i=a
P
∞
ρ
(d) iρi = (1−ρ)2
i=0
131
P
b
ρb+1 (bρ−b−1)−ρa (aρ−a−ρ)
(e) iρi = (1−ρ)2
i=a
P
∞
kρk ρk+1
(f) iρi = 1−ρ
+ (1−ρ)2
i=k
P
∞
ρ+ρ2
(g) i2 ρi = (1−ρ)3
i=0
P
∞ P
∞ P
∞ P
i P
(b) f (i, j) = f (i, j) = 1 [i ≥ j]f (i, j)
j=1 i=j i=1 j=1 (i,j)
2 3 P
∞ k−1
• λeλ + eλ = 1 + 2λ + 3 λ2! + 4 λ3! + . . . = λ
k (k−1)!
k=1
X∞ k
3λ d d d λ
k =λ λ λ e .
k=0
k! dλ dλ dλ
Therefore, the coefficients of the terms in (48) directly becomes the coefficients in
(47)
132
A.13. Zeta function ξ (s) is defined for any complex number s with Re {s} > 1 by the
P
∞
1
Dirichlet series: ξ (s) = ns
.
n=1
A.14. Abel’s theorem: Let a = (ai : i ∈ N) be any sequence of real or complex numbers
and let ∞
X
Ga (z) = ai z i ,
i=0
P∞
be the power series with coefficients a. Suppose that the series i=0 ai converges. Then,
∞
X
lim Ga (z) = ai . (49)
z→1−
i=0
In the special case where all the coefficientsPai are nonnegative real numbers, then the
above formula (49) holds also when the series ∞ i=0 ai does not converge. I.e. in that case
both sides of the formula equal +∞.
A.3 Calculus
A.3.1 Derivatives
A.15. Basic Formulas
d u
(a) dx
a = au ln a du
dx
d loga e du
(b) dx
loga u = u dx
, a 6= 0, 1
Xn
(n) n (n−k)
f (x) = g (x) h(k) (x).
k=0
k
In fact,
dn Y X Y dni
r r
n!
f i (t) = f (t).
ni i
dtn i=1 n +···+n =n
n 1 !n2 ! · · · n r ! i=1
dt
1 r
133
Definition A.16 (Jacobian). In vector calculus, the Jacobian is shorthand for either the
Jacobian matrix or its determinant, the Jacobian determinant. Let g be a function
from a subset D of Rn to Rm . If g is differentiable at z ∈ D, then all partial derivatives
exists at z and the Jacobian matrix of g at a point z ∈ D is
∂g1 ∂g1
∂x1
(z) · · · ∂xn
(z)
.. . . ∂g ∂g
dg (z) = . .. .. = (z) , . . . , (z) .
∂gm ∂gm
∂x1 ∂xn
∂x1
(z) · · · ∂xn (z)
∂(g1 ,...,gn )
Alternative notations for the Jacobian matrix are J, ∂(x 1 ,...,xn )
[7, p 242], and Jg (x) where
the it is assumed that the Jacobian matrix is evaluated at z = x = (x1 , . . . , xn ).
A.17. When m = 1, we have a scalar function, and we can talk about the gradient vector.
The gradient (or gradient vector field) of a scalar function f (x) with respect to a vector
variable x = (x1 , . . . , xn ) is
∂f
∂x1
(z)
.. T
∇x f (z) = . = (df (z)) .
∂f
∂xn
(z)
134
(a) The RHS of the linear approximation (50) characterizes the tangent hyperplane
L(x) = dg(z)(x − z) + g(z)
at the point z. The level surface that pass through the point z is given by
{x : g(x) = g(z)} .
Using the linear approximation (50), we see that around the point z, the point on
the level surface must satisfy
dg(z)(x − z) ≈ 0.
Hence, the tangent plane at z for the level surface is given by
{x : dg(z)(x − z) = 0} .
Note also that the gradient vector ∇g(z) = (dg(z))T is perpendicular to the tangent
plane through z: for any two point x(1) , x(2) on the tangent plane though z,
dg(z)(x(2) − x(1) ) = 0.
Hence, the gradient vector at z is perpendicular to the level surface at z.
(b) When n = 2, given a point P = (x0 , y0 ) we have a tangent plane
∂g ∂g
L(x, y) = (P ) (x − x0 ) + (P ) (y − y0 ) + g (P )
∂x ∂x
through P . The level curve that pass through the point P is given by
{(x, y) : g(x, y) = g(P )} .
Approximating g by L, we find that the tangent line at P for the level curve is
given by
∂g ∂g
(x, y) : (P ) (x − x0 ) + (P ) (y − y0 ) = 0 .
∂x ∂x
This line is perpendicular to the gradient vector.
(c) When n = 3, given a point P = (x0 , y0 , z0 ), we think about the level surface
S = {(x, y, z) : f (x, y, z) = c} where c = f (P ). The tangent plane at the point P
on the level surface S is given by
∂g ∂g ∂g
(x, y, z) : (P ) (x − x0 ) + (P ) (y − y0 ) + (P ) (z − z0 ) = 0 .
∂x ∂y ∂z
This plane is perpendicular to the gradient vector.
A.18. For gradient, if the argument is row vector, then,
∂f1 ∂f2 ∂fm
∂θ1 ∂θ1 ∂θ1
∇θ f T (θ) = ... ..
.
... ..
. .
∂f1 ∂f2 ∂fm
∂θn ∂θn ∂θn
135
Definition A.19. Given a scalar-valued function f , the Hessian matrix is the square
matrix of second partial derivatives
2 2f
∂ f
∂θ12
· · · ∂θ∂1 ∂θ
. . . ..
n
∇2θ f (θ) = ∇θ (∇θ f (θ))T = .
. . . .
∂2f ∂2
∂θn ∂θ1
· · · ∂θ 2
n
(matrix multiplication).
• In particular,
d ∂ d ∂ d
g (x (t) , y (t) , z (t)) = g (x, y, z) x (t) + g (x, y, z) y (t)
dt ∂x dt ∂y dt
∂ d
+ g (x, y, z) z (t) .
∂z dt (x,y,z)=(x(t),y(t),z(t))
136
(a) V = f (U ) is an open neighborhood of f (c).
(b) f |U : U → V is bijective.
(c) g = (f |U )−1 : V → U is C 1 .
A.3.2 Integration
A.26. Basic Formulas
R u
(a) au du = lna a , a > 0, a 6= 1.
1 1
R1 α , α > −1 R∞ α , α < −1
(b) t dt = α+1 and t dt = α+1 So, the integration of the
0 ∞, α ≤ −1 1 ∞, α ≥ −1
R1 R∞
function 1t is the test case. In fact, 1t dt = 1t dt = ∞.
0 1
R m
xm+1
ln x − 1
, m 6= −1
(c) x ln xdx = m+1 m+1
1
2
ln2 x, m = −1
A.27. Integration by Parts:
Z Z
udv = uv − vdu
137
4
4
2
t
1.5 3
t
t
0.5
t
2
1
− 0.5
t
−1
t 1
0
0
0 1 2 3 4 5
0 t 5
e ax n ( −1) n! nα
k
n 1
0
n
x 1
Gn x
n
a a
1
1
, 1
To see this, note that
f x g x dx f x G x f x G x dx , and
1 1
f x G x dx f x G x f x G x dx .
n 1
n
n
n
n 1 n 1
x2 e3 x 1 2 2 ex
x e dx x 2 x e3 x sin x
2 3x
+
+ 3 9 27
1 3x cos x - ex
2x e
sin x e dx sin x
x
- 3 ex
+
1 3x
2 e sin x cos x e x sin x e x dx
+ 9
1
0 - 1 3x
e sin x cos x e x
27 2
Figure
x n e ax 27:n Examples of Integration by Parts using figure 26.
x e dx a
n 1 ax
n ax
x e (Integration by parts).
a
R xn eax n
R
(c) xn eax dx = − xn−1 eax
1 a a
R 1
, 1 R
(d) f(x) g (x)dx = f (x) g (x) − f 0 (x) g (x)dx.
t dt
0 1
0
, 1
To see this, start with the product rule: (f (x)g(x))0 = f (x)g 0 (x) + f 0 (x)g(x). Then,
both
1 sides.
integrate , 1
t
dt 1
Rb , 1 Rb
0
1
(e) f (x) g (x)dx = f (x) g (x)|ba − f 0 (x) g (x)dx 1
a 1 a 1 1
So, the integration of the function is the test case. In fact, dt dt .
t 0
t 1
t
A.28. If n is a positive integer,
Z
eax X (−1)k n! n−k
n
n ax
x e dx = x .
a k=0 ak (n − k)!
eax 1
(a) n = 1 : a
x− a
eax
(b) n = 2 : a
x2 − a2 x + 2
a2
Rt eat
P
n
(−1)k n! n−k (−1)n n! eat
P
n
(−1)k n! n−k n!
(c) xn eax dx = a ak (n−k)!
t − an+1
= a ak (n−k)!
t + (−a)n+1
0 k=0 k=0
R∞ e−at
P
n
e−at
P
n
(d) xn e−ax dx = a ak (n−k)!
n!
tn−k = a
n!
an−k j!
tj
t k=0 j=0
R∞ n!
(e) xn eax dx = (−a)n+1
, a < 0. (See also Gamma function)
0
R∞
• n! = e−t tn dt.
0
• In MATLAB, consider using gamma(n+1) in stead of factorial(n). Note also
that gamma() allows vector input.
R1
(f) xβ e−x dx is finite if and only if β > −1.
0
R1 R1 R1
Note that 1
e
xβ dx ≤ xβ e−x dx ≤ xβ dx.
0 0 0
139
R∞
(g) ∀β ∈ R, xβ e−x dx < ∞.
1
R∞ R∞ R∞
For β ≤ 0, xβ e−x dx ≤ e−x dx < e−x dx = 1.
1 1 0
R∞ R∞ R∞
For β > 0, xβ e−x dx ≤ xdβe e−x dx ≤ xdβe e−x dx = dβe!
1 1 0
R∞
(h) xβ e−x dx is finite if and only if β > −1.
0
Zb(x)
∂g
f 0 (x) = b0 (x) g (x, b (x)) − a0 (x) g (x, a (x)) + (x, y)dy. (52)
∂x
a(x)
In particular, we have
Zx
d
f (t)dt = f (x) , (53)
dx
a
Zv(x) Zv(x)
d dv d
f (t)dt = f (t)dt = f (v (x)) v 0 (x) , (54)
dx dx dv
a a
v(x)
Z
v(x) Z Z
u(x)
d d
f (t)dt = f (t)dt − f (t)dt = f (v (x)) v 0 (x) − f (u (x)) u0 (x) . (55)
dx dx
u(x) a a
Note that (52) can be derived from (A.22) by considering f (x) = h(a(x), b(x), x) where
Rb
h (a, b, c) = g (c, y)dy. [9, p 318–319].
a
(c) 0! = 1.
140
1
√
(d) Γ 2
= π.
(e) Γ (x + 1) = xΓ (x) (Integration by parts).
• This relationship is used to define the gamma function for negative numbers.
Γ(q) R∞
(f) αq
= xq−1 e−αx dx, α > 0.
0
20
15
10
-5
-10
-15
-20
-6 -4 -2 0 2 4 6
x
For x = 1, the incomplete beta function coincides with the (complete) beta function.
The regularized incomplete beta function (or regularized beta function for short) is de-
fined in terms of the incomplete beta function and the (complete) beta function:
B(x; a, b)
Ix (a, b) = .
B(a, b)
• For integers m, k,
X
m+k−1
(m + k − 1)!
Ix (m, k) = xj (1 − x)m+k−1−j .
j=m
j!(m + k − 1 − j)!
• I0 (a, b) = 0, I1 (a, b) = 1
• Ix (a, b) = 1 − I1−x (b, a)
141
References
[1] Patrick Billingsley. Probability and Measure. John Wiley & Sons, New York, 1995.
5.30, 2b
[2] George Casella and Roger L. Berger. Statistical Inference. Duxbury Press, 2001. 4.43,
6.9, 11, 8.13, 11.11, 11.4, 1, 6, 11.21
[3] Donald G. Childers. Probability And Random Processes Using MATLAB. McGraw-
Hill, 1997. 1.3, 1.3
[5] W. Feller. An Introduction to Probability Theory and Its Applications, volume 2. John
Wiley & Sons, 1971. 4.37
[6] William Feller. An Introduction to Probability Theory and Its Applications, Volume 1.
Wiley, 3 edition, 1968.
[7] Terrence L. Fine. Probability and Probabilistic Reasoning for Electrical Engineering.
Prentice Hall, 2005. 3.1, 3.10, 3.14, 4.18, 4.24, 5.27, 9.1, 9.3, 11.10, 15.2, A.16
[8] B. V. Gnedenko. Theory of probability. Chelsea Pub. Co., New York, 4 edition, 1967.
Translated from the Russian by B.D. Seckler. 5.30
[9] John A. Gubner. Probability and Random Processes for Electrical and Computer En-
gineers. Cambridge University Press, 2006. 1.2, 4.31, 4.34, 4.36, 4, 6.45, 7.9, 9.9, 10.1,
2, 3, 6, 11.24, 12.22, 12.23, 14.3, 3, A.6, A.13, A.29
[10] M.O. Jones and R.F. Serfozo. Poisson limits of sums of point processes and a particle-
survivor model. Ann. Appl. Probab, 17(1):265–283, 2007. 12.10
[11] Samuel Karlin and Howard E. Taylor. A First Course in Stochastic Processes. Aca-
demic Press, 1975. 4.6, 9.10, 10.1
[14] Nabendu Pal, Chun Jin, and Wooi K. Lim. Handbook of Exponential and Related
Distributions for Engineers and Scientists. Chapman & Hall/CRC, 2005. 6.25
142
[18] Kenneth H. Rosen, editor. Handbook of Discrete and Combinatorial Mathematics.
CRC, 1999. 1, 4, 5, 6, 7, 8
[20] Gbor J. Szkely. Paradoxes in Probability Theory and Mathematical Statistics. 1986.
8.2
[21] George B. Thomas, Maurice D. Weir, Joel Hass, and Frank R. Giordano. Thomas’
Calculus. Addison Wesley, 2004. A.16
[22] Tung. Fundamental of Probability and Statistics for Reliability Analysis. 2005. 17, 20
143
Index
Bayes Theorem, 22 Standard Deviation, 74
Binomial theorem, 12
Birthday Paradox, 24 Total Probability Theorem, 22
Event Algebra, 25
gradient, 134
gradient vector field, 134
Hölder’s Inequality, 80
Hessian matrix, 136
Markov’s Inequality, 78
Minkowski’s Inequality, 81
Monte Hall’s Game, 24
multinomial coefficient, 13
Multinomial Counting, 13
Multinomial Theorem, 15
144