Lecture Notes Statistics 420-Probability Fall 2002

LECTURE NOTES
STATISTICS 420PROBABILITY
Fall 2002
Robert J. Boik
Department of Mathematical Sciences
Montana State University Bozeman
Revised August 30, 2004
2
Contents
0 Course Information & Syllabus 5
1 Probability 7
1.1 Sample Spaces and Events . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Algebra of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Experiments with Symmetries . . . . . . . . . . . . . . . . . . . 9
1.4 Composition of Experiments: Counting Rules . . . . . . . . . 11
1.5 Sampling at Random . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Binomial & Multinomial Coecients . . . . . . . . . . . . . . . 15
1.7 Discrete Probability Distributions . . . . . . . . . . . . . . . . . 17
1.8 Subjective Probability . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Discrete Random Variables 23
2.1 Probability Functions . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Bayes Theorem (Law of Inverse Probability) . . . . . . . . . . 30
2.5 Statistical Independence of Random Variables . . . . . . . . . 30
2.6 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7 Application: Probability of Winning in Craps . . . . . . . . . . 34
3 Expectations of Discrete Random Variables 37
3.1 The Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Expectation of a Function . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . 45
3.5 Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . 47
3.6 Probability Generating Functions . . . . . . . . . . . . . . . . . 50
4 Bernoulli and Related Random Variables 57
4.1 Sampling Bernoulli Populations . . . . . . . . . . . . . . . . . . 57
4.2 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . 60
3
4 CONTENTS
4.4 Inverse Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Approximating Binomial Probabilities . . . . . . . . . . . . . . 64
4.6 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.7 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8 Multinomial Distributions . . . . . . . . . . . . . . . . . . . . . . 70
4.9 Using Probability Generating Functions . . . . . . . . . . . . . 72
5 Continuous Random Variables 75
5.1 Cumulative Distribution Function (CDF) . . . . . . . . . . . . 75
5.2 Density and the Probability Element . . . . . . . . . . . . . . . 79
5.3 The Median and Other Percentiles . . . . . . . . . . . . . . . . 84
5.4 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 Expected Value of a Function . . . . . . . . . . . . . . . . . . . . 85
5.6 Average Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.7 Bivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . 90
5.8 Several Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.9 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . 93
5.10 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.11 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . 97
5.12 Moment Generating Functions . . . . . . . . . . . . . . . . . . . 100
6 Families of Continuous Distributions 103
6.1 Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Exponential Distributions . . . . . . . . . . . . . . . . . . . . . . 109
6.3 Gamma Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.4 Chi Squared Distributions . . . . . . . . . . . . . . . . . . . . . . 114
6.5 Distributions for Reliability . . . . . . . . . . . . . . . . . . . . . 115
6.6 t, F, and Beta Distributions . . . . . . . . . . . . . . . . . . . . . 118
7 Appendix 1: Practice Exams 121
7.1 Exam 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2 Exam 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3 Exam 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
8 Appendix 2: Selected Equations 127
Chapter 0
Course Information & Syllabus
Course Information
Prerequisite: Math 224
Required Texts
Berry, D. A. & Lindgren, B. W. (1996). Statistics: Theory and Methods,
2
nd
edition. Belmont CA: Wadsworth Publishing Co.
Lecture Notes
Instructor
Robert J. Boik, 2260 Wilson, 994-5339, rjboik@math.montana.edu.
Oce Hours: Monday 1:102:00; Wednesday 9:009:50 & 1:102:00;
Friday 9:009:50.
Holidays: Friday October 4 (Exam Exchange); Friday November 8 (Exam
Exchange); Monday Nov 11 (Veterans Day); Friday Nov 29 (Thanksgiving)
Drop dates: Mon September 23 is the last day to drop without a grade; Tues
November 19 is the last day to drop.
Grading: 600 Points Total; Grade cutos (percentages) 90, 80, 70, 60; All
exams are closed book. Equation sheets will be provided.
Homework: 200 points
Exam-1, Thursday Oct 3, 6:00-8:00 PM: 100 points
Exam-2, Thursday November 7, 6:00-8:00 PM: 100 points
Comprehensive Final, Friday Dec 20, 8:009:50 AM: 200 points
Homepage: http://www.math.montana.edu/rjboik/classes/420/stat.420.html
5
6 CHAPTER 0. COURSE INFORMATION & SYLLABUS
Syllabus
1. Probability: Chapter 1
2. Discrete Random Variables: Chapter 2
3. Expectations: Chapter 3
4. Bernoulli and Related Random Variables: Chapter 4
5. Continuous Random Variables: Chapter 5
6. Families of Continuous Distributions: Chapter 6
Chapter 1
Probability
Probability Model: three components are (a) the sample space, (b) events, and (c)
probabilities of events.
1.1 Sample Spaces and Events
1. Experiment (experiment of chance, not necessarily a designed experiment): A
process whose outcomes are uncertain.
2. Sample space: is a set that contains all possible outcomes from the
experiment. In some cases, may contain outcomes that are not possible.
(a) The number of outcomes in the sample space can be nite or innite.
(b) Innite sample spaces can be countable or uncountable. A sample space
is countable if the outcomes can be associated with the integers 1, 2, . . ..
3. Event: A subset of the sample space.
(a) A simple event contains only one outcome. A simple event is denoted by
.
(b) A compound event contains two or more outcomes. Compound events
are denoted by capital letters.
1.2 Algebra of Events
1. Relationships and Denitions
(a) Inclusion: E F means E = F, where is an outcome in .
(b) Equality: E = F means that E F and F E.
(c) Complement: E
c
is the set { , E}.
7
8 CHAPTER 1. PROBABILITY
(d) Note: We will not use the notation E F. Accordingly, if E F, then
either E = F or E = F could be true.
(e) Venn diagrams can be used to display sets.
(f) Empty Set: is the empty set (i.e., the set that has no members). Note
that
c
= and that
c
= .
2. Operations on Sets
(a) Union: A B is the set of all outcomes that are in A and/or in B.
(b) Intersection: AB is the set of all outcomes that are in A and in B. I
prefer the notation A B to denote set intersections.
(c) Disjoint: A and B are disjoint (mutually exclusive) if A B is empty;
i.e., A B = .
(d) Commutative Operations: A B = B A and A B = B A.
(e) Associative Operations: (A B) C = A (B C) and
(A B) C = A (B C).
(f) Distributive Operations: A (B C) = (A B) (A C) and
A (B C) = (A B) (A C).
(g) DeMorgans laws:
i.
_
k
_
i=1
A
i
_
c
=
k
i=1
A
c
i
Proof: show that
_
k
_
i=1
A
i
_
c
i=1
A
c
i
and that
k
i=1
A
c
i

_
k
_
i=1
A
i
_
c
.
Part 1: Let be an outcome in . Then,

_
k
_
i=1
A
i
_
c
=
k
_
i=1
A
i
= A
i
for each i = 1, . . . , k = A
c
i
for each i = 1, . . . , k
=
k
i=1
A
c
i
=
_
k
_
i=1
A
i
_
c
i=1
A
c
i
.
Part 2: Let be an outcome in . Then,

k
i=1
A
c
i
= A
c
i
for each i = 1, . . . , k
= A
i
for each i = 1, . . . , k =
k
_
i=1
A
i
1.3. EXPERIMENTS WITH SYMMETRIES 9
=
_
k
_
i=1
A
i
_
c
=
k
i=1
A
c
i

_
k
_
i=1
A
i
_
c
.
ii.
_
k
i=1
A
i
_
c
=
k
_
i=1
A
c
i
(h) Partitions of Events: The collection E
1
, E
2
, . . . , E
k
is said to be a
partition of F if the events E
1
, E
2
, . . . , E
k
satisfy
E
i
E
j
= i = j( i.e., mutually disjoint events), and
k
_
i=1
E
i
= F.
The number of events in the partition, k, need not be nite.
1.3 Experiments with Symmetries
1. Equally Likely Outcomes
(a) If the outcomes in a nite sample space, = {
1
, . . . ,
N
} are equally
likely, then each has probability 1/N. That is, P(
i
) = 1/N for
i = 1, . . . , N.
(b) Denition: If an object is drawn at random from a nite population of
N objects, then the objects are equally likely to be selected.
(c) If an event E consists of a subset of k outcomes in a sample space of N
equally likely outcomes, then P(E) = k/N.
(d) Denition: Odds
i. If event E has probability P(E) and event F has probability P(F),
then
Odds of E to F
def
=
P(E)
P(F)
,
provided that P(F) > 0. Note that odds is not a probability. Rather,
odds is a ratio of probabilities.
ii. The odds of a single event, E, is dened as the odds of E to E
c
.
That is,
Odds of E
def
=
P(E)
P(E
c
)
=
P(E)
1 P(E)
,
provided that P(E) < 1.
2. Axioms of Probability
(a) 0 Pr(A) 1 for any event A.
(b) Pr() = 1.
(c) Additivity: If A
1
and A
2
are disjoint events, then
P(A
1
A
1
) = P(A
1
) +P(A
2
).
3. Relative frequency interpretation of probability: The relative frequency of an
event is the proportion of times that the event has occurred in n independent
and identical replications of the experiment. The probability of an event is the
limiting relative frequency of that event as n . Specically, let E be an
event and let f
n
(E) be the number of times that E occurred in n independent
and identical replications of the experiment. Then
Pr(E) = lim
n
f
n
(E)
n
.
Example: The Game of Craps. Roll a pair of dice (i.e., two die). If the sum of
the dice is 7 or 11, then the player wins and the game is over. If the sum of
the dice is 2, 3, or 12, then the player loses and the game is over. If the sum of
the dice is anything else, then the sum is called the point and the game
continues. In this case, a player repeatedly rolls the pair of dice until either
the sum is either 7 or equal to the point. If a 7 occurs rst, then the player
loses. If the point occurs rst, then the player wins.
On the following page is a plot of the relative frequency of rolling a 7 or 11 on
the rst roll. To generate the plot, a pair of dice was rolled 1,000,000 times.
Out of all these rolls, a 7 or 11 was rolled 222,531 times. If the dice are fair
(i.e., one pip, two pips, . . ., six pips are equally likely), then it can be shown
that
Pr(7 or 11 on rst roll) =
8
36
= 0.22222 . . . .
1.4. COMPOSITION OF EXPERIMENTS: COUNTING RULES 11
10
0
10
1
10
2
10
3
10
4
10
5
10
6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Relative Frequency of 7 or 11 in Craps
Number of Experiments
R
e
l
a
t
i
v
e

F
r
e
q
u
e
n
c
y
1.4 Composition of Experiments: Counting Rules
1. Computing probabilities when the sample space is countable: Let
= {
1
,
2
, . . .}, where
i
is a simple event (i.e., an outcome). Suppose that
A is a compound event.
(a) General formula: From additivity (axiom iii), it follows that
Pr(A) =
i
A
Pr(
i
).
(b) Special case: Symmetry. Suppose that contains N outcomes and the
outcomes are equally likely. This situation occurs if the experiment
consists of selecting one outcome from at random. If event A consists
of k out of the N outcomes in , then Pr(A) = k/N.
2. Counting Rules: These rules are useful for computing probabilities when
outcomes are equally likely.
(a) Multiplication Rule. A composite experiment, , is a sequence of
sub-experiments,
1
, . . . ,
k
performed successively or simultaneously.
Denote the sample space for sub-experiment
i
by
i
and suppose that
the number of outcomes in
i
is n
i
. Then, the number of outcomes in is
N =
k
i=1
n
i
.
If N is not too large, then the N outcomes can be displayed with a tree
diagram.
Result: If the n
1
outcomes in
1
are equally likely, the n
2
outcomes in
2
are equally likely, . . ., and the n
k
outcomes in
k
are equally likely,
then the N possible outcomes are equally likely. We will prove this
important result later.
(b) Permutation Rule. A permutation is an ordered sequence of n items
taken from N > n distinct items. The number of permutations is
(N)
n
= N(N 1)(N 2) (N n + 1) =
N!
(N n)!
.
Result: If we select n items from a population of N distinct items one at
a time at random and without replacement, then the (N)
n
possible
sample sequences are equally likely. We will prove this important result
later.
(c) Distinct Sequences of Non-Distinct Items. Suppose that in a set of N
items, m
1
are of type 1, m
2
are of type 2, . . . , m
k
are of type k, where
k
j=1
m
j
= N. Then the number of distinct sequences of the N items is
_
N
m
1
, m
2
, . . . , m
k
_
=
N!
k
j=1
m
j
!
.
Special case: k = 2. For this special case, N = m
1
+m
2
and the
notation is usually simplied from
_
N
m
1
, m
2
_
to
_
N
m
1
_
or
_
N
m
2
_
.
Regardless of the notation, the equation remains the same:
_
N
m
1
_
=
_
N
m
2
_
=
N!
m
1
!(N m
1
)!
=
N!
m
2
!(N m
2
)!
=
N!
m
1
!m
2
!
.
Combination Rule. An unordered subset of n items taken without
replacement from N n distinct items is called a combination. The
number combinations of size n from N distinct items is
_
N
n
_
=
N!
n!(N n)!
Result: If we select n items from a population of N distinct items at
random, one at a time and without replacement, then the
_
N
n
_
possible combinations are equally likely. We will prove this
important result later.
1.4. COMPOSITION OF EXPERIMENTS: COUNTING RULES 13
3. Examples
Straight in 5 card poker. Five cards are dealt at random from a standard
deck of 52 cards. A straight is 5 cards that are in consecutive order. The
suits of the cards are not relevant. An ace can be high or low. Thus,
there are 10 types of straight (5 high to ace high). Each straight can
occur 4
5
ways. Accordingly,
Pr(straight) =
10 4
5
_
52
5
_ =
10,240
2,598,960
= 0.00394
1
254
.
Full house in poker. Five cards are dealt at random from a standard deck
of 52 cards. A full house is three of a kind plus two of a kind. There are
_
13
1
_
= 13 ways to choose a denomination (i.e., ace, 2, . . ., king). Given a
denomination, there are
_
4
3
_
= 4 ways to choose three cards from four.
There are
_
12
1
_
= 12 ways to choose a second denomination and
_
4
2
_
= 6
ways to choose two cards from four. Accordingly,
Pr(full house) =
_
13
1
__
4
3
__
12
1
__
4
2
_
_
52
5
_ =
3,744
2,598,960
0.00144.
Birthday Problem. Assume that birthdays are uniformly distributed over
365 days (this is not quite true). That is, assume that if a person is
chosen at random, then the probability that the persons birthday is any
specic date (say October 28) is 1/365. Given a sample of k individuals,
nd the probability that two or more share the same birthday. Solution:
If k 365, then
Pr(one or more birthday matches) = 1 Pr(no matches)
= 1
(365)
k
365
k
= 1
365!
(365 k)!365
k
1 e
k
_
365
365 k
_
365k.5
.
Otherwise, if k > 365, then the probability of a match is 1. The
approximation is based on Stirlings formula,
n!
2 n
n+.5
e
n
.
Exact probabilities, based on the above assumptions, are displayed on the
following page.
k Prob k Prob k Prob
1 0.0000 21 0.4437 41 0.9032
2 0.0027 22 0.4757 42 0.9140
3 0.0082 23 0.5073 43 0.9239
4 0.0164 24 0.5383 44 0.9329
5 0.0271 25 0.5687 45 0.9410
6 0.0405 26 0.5982 46 0.9483
7 0.0562 27 0.6269 47 0.9548
8 0.0743 28 0.6545 48 0.9606
9 0.0946 29 0.6810 49 0.9658
10 0.1169 30 0.7063 50 0.9704
11 0.1411 31 0.7305 51 0.9744
12 0.1670 32 0.7533 52 0.9780
13 0.1944 33 0.7750 53 0.9811
14 0.2231 34 0.7953 54 0.9839
15 0.2529 35 0.8144 55 0.9863
16 0.2836 36 0.8322 56 0.9883
17 0.3150 37 0.8487 57 0.9901
18 0.3469 38 0.8641 58 0.9917
19 0.3791 39 0.8782 59 0.9930
20 0.4114 40 0.8912 60 0.9941
1.5 Sampling at Random
1. Denition: Random sampling with replacement: Objects are drawn one
at a time at random from a population of N distinct objects. After each draw,
the object is returned to the population. Sampling stops after n draws.
(a) The number of distinct sequences is N
n
.
(b) Each of the N
n
sequences is equally likely.
(c) The probability that a specic object, say object i, is contained in the
sample at least once is
P(object i is in the sample at least once) = 1
_
N 1
N
_
n
1 exp
_
n
N
_
.
2. Denition: Random sampling without replacement: Objects are drawn
one at a time at random from a population of N distinct objects. A sampled
object is not returned to the population after it is drawn. Sampling stops after
n draws.
1.6. BINOMIAL & MULTINOMIAL COEFFICIENTS 15
(a) The number of distinct sequences is (N)
n
= N!/(N n)!.
(b) Each of the (N)
n
sequences is equally likely.
(c) The number of distinct combinations is
_
N
n
_
.
(d) Each of the
_
N
n
_
combinations is equally likely.
(e) The probability that a specic object, say object i, is contained in the
sample is
P(object i is in the sample) = 1
_
N 1
n
_
_
N
n
_ =
n
N
.
1.6 Binomial & Multinomial Coecients
1. Binomial Coecients
(a) Binomial Theorem: Let x and y be any two nite numbers. Then,
(x +y)
n
=
n
k=0
_
n
k
_
x
k
y
nk
.
The quantities
_
n
k
_
for k = 0, 1, . . . , n are called the binomial coecients.
To prove the theorem, we will rst establish the following lemma.
Lemma: Let a and b be nonnegative integers that satisfy a > b. Then
_
a 1
b 1
_
+
_
a 1
b
_
=
_
a
b
_
.
Proof: The left-hand-side of the claim can be written as
_
a 1
b 1
_
+
_
a 1
b
_
=
(a 1)!
(b 1)!(a b)!
+
(a 1)!
b!(a b 1)!
=
(a 1)!
(b 1)!(a b)!

ab
ab
+
(a 1)!
b!(a b 1)!

a(a b)
a(a b)
=
a! b
b!(a b)! a
+
a! (a b)
b!(a b)! a
=
a!
b!(a b)!
_
b
a
+
a b
a
_
=
_
a
b
_
.
Proof of Binomial Theorem: We will use proof by induction. The claim
can be stated as
(x +y)
m
=
m
k=0
_
m
k
_
x
k
y
mk
for m = 0, 1, 2, . . . .
First, show that the claim is true for m = 0:
(x +y)
0
= 1 and
0
k=0
_
0
k
_
x
k
y
0k
=
_
0
0
_
x
0
y
0
= 1.
Now, suppose that the claim is true for m = 0, 1, 2, . . . , n 1 and show
that the claim also is true for m = n.
(x +y)
n
= (x +y)(x + y)
n1
= (x +y)
n1
k=0
_
n 1
k
_
x
k
y
n1k
=
n1
k=0
_
n 1
k
_
x
k+1
y
n1k
+
n1
k=0
_
n 1
k
_
x
k
y
nk
=
n
j=1
_
n 1
j 1
_
x
j
y
nj
+
n1
j=0
_
n 1
j
_
x
j
y
nj
=
_
n 1
0
_
x
0
y
n
+
n1
j=1
__
n 1
j 1
_
+
_
n 1
j
__
x
j
y
nj
+
_
n 1
n 1
_
x
n
y
0
=
n
j=0
_
n
j
_
x
j
y
nj
by using the Lemma.
(b) Pascals triangle. Binomial coecients can be generated using Pascals
triangle. Each element in the following table is obtained by adding the
two entries in the preceding row that are in the above-left and
above-right positions.
n
0 1
1 1 1
2 1 2 1
3 1 3 3 1
4 1 4 6 4 1
5 1 5 10 10 5 1
6 1 6 15 20 15 6 1
2. Multinomial Coecients
1.7. DISCRETE PROBABILITY DISTRIBUTIONS 17
(a) Multinomial Theorem: Let J
n
0
= {0, 1, 2, . . . , n} and let x
1
, x
2
, . . . , x
k
be
nite numbers. Then,
(x
1
+x
2
+ +x
k
)
n
=
R
_
n
m
1
, . . . , m
k
_
x
m
1
1
x
m
2
2
x
m
k
k
, where
R =
_
(m
1
, m
2
, . . . , m
k
); m
j
J
n
0
for each j and
k
j=1
m
j
= n
_
.
The quantity
_
n
m
1
, . . . , m
k
_
, where
k
j=1
m
j
= n and m
j
J
n
0
is called a multinomial coecient. The coecients are evaluated as
follows:
_
n
m
1
, m
2
, . . . , m
k
_
=
n!
k
j=1
m
j
!
.
(b) Example: Suppose that k = 3 and n = 2. Then
(x
1
+x
2
+x
3
)
2
=
_
2
2, 0, 0
_
x
2
1
x
0
2
x
0
3
+
_
2
0, 2, 0
_
x
0
1
x
2
2
x
0
3
+
_
2
0, 0, 2
_
x
0
1
x
0
2
x
2
3
+
_
2
1, 1, 0
_
x
1
1
x
1
2
x
0
3
+
_
2
1, 0, 1
_
x
1
1
x
0
2
x
1
3
+
_
2
0, 1, 1
_
x
0
1
x
1
2
x
1
3
= x
2
1
+x
2
2
+x
2
3
+ 2x
1
x
2
+ 2x
1
x
3
+ 2x
2
x
3
.
1.7 Discrete Probability Distributions
1. Components of a Discrete Probability Model
(a) A countable sample space, = {
1
,
2
, . . .}
(b) A nonnegative number P() assigned to each outcome such that
P(
i
) = 1.
2. If E is an event in a discrete probability model, then
P(E) =
E
P().
3. Axioms of Probability
(a) 0 Pr(A) 1 for any event A.
(b) Pr() = 1.
(c) Additivity
Finite: If A
i
for i = 1, . . . , k are mutually disjoint, then
Pr
_
k
_
i=1
A
i
_
=
k
i=1
Pr(A
i
).
Innite: If A
i
for i = 1, . . . , are mutually disjoint, then
Pr
_
_
i=1
A
i
_
=
i=1
Pr(A
i
).
4. Implications of the Axioms
Pr(E
c
) = 1 Pr(E). Proof:
Pr() = 1 = Pr(E E
c
) = Pr(E) + Pr(E
c
) =Pr(E
c
) = 1 Pr(E).
Corollary: Pr() = 0. Corollary: If A and B are disjoint, then
Pr(A B) = 0
If E F,then P(E) P(F). Proof: E F =F =
(E F) (E
c
F) = E (E
c
F) =Pr(F) = Pr(E) + Pr(E
c
F)
because E is disjoint from E
c
F.
Pr(A B) = Pr(A) + Pr(B) Pr(A B) for any events A and B. Proof:
note that A = (A B) (A B
c
) and B = (A B) (A
c
B).
Accordingly Pr(A) = Pr(A B) + Pr(A B
c
) and
Pr(A B
c
) = Pr(A) Pr(A B) because A B is disjoint from A B
c
.
Similarly, Pr(B) = Pr(A B) + Pr(A
c
B) and
Pr(A
c
B) = Pr(B) Pr(A B). Lastly, note that
A B = (A B) (A B
c
) (A
c
B). These three events are mutually
disjoint, so
Pr(A B) = Pr(A B) + Pr(A B
c
) + Pr(A
c
B)
= Pr(A B) + Pr(A) Pr(A B) + Pr(B) Pr(A B)
= Pr(A) + Pr(B) Pr(A B).
5. Law of Total Probability: Let F
1
, . . . , F
k
be a partition of . Then
Pr(A) =
k
i=1
Pr(A F
i
).
1.8. SUBJECTIVE PROBABILITY 19
1.8 Subjective Probability
1. The relative frequency approach to probability requires that experiments be
repeatable. Subjective probabilities do not have this requirement.
2. Subjective probabilities obey the same axioms as other probabilities
(otherwise, they wouldnt be probabilities).
3. Eliciting a subjective probability: To nd Pr(E), consider the following
situation. Let E be an event. Suppose that you and a friend place a bet. Your
friend contributes $100 and you contribute $B. If E occurs, then you win
$(100 +B). If E
c
occurs, then your friend wins $(100 +B). What is the
maximum amount that you are willing to bet in this game? That is, what is
the maximum $B that you will contribute. After you have chosen $B, then
your subjective probability is computed as Pr(E) = B/(100 +B).
4. Probability of a hypothesis. Let H be a hypothesis in an empirical
investigation. For example, H = A specic teaching method that uses an
online component is superior on average to another teaching method that does
not use the online component. Bayesians can speak about Pr(H) because
Pr(H) reects prior beliefs (i.e., prior to collecting data) about the hypothesis.
Frequentists cannot speak about Pr(H) except to say that either Pr(H) = 1 or
Pr(H) = 0 because the event H is either true or false. It is not true sometimes
and false sometimes.
5. Odds at the racetrack. The heading Odds Against in the table on page 35 of
the text is incorrect. It should simply be Odds. The odds against the horse
winning are 22 to 10, 22 to 10, 27 to 5, etc.
6. Examples of Racetrack odds. To construct the following tables, it was
assumed that the track handle is 17%. That is, the track takes 17% of the
money bet, regardless of which horse wins.
For ve horses, the general table is set up as follows.
Betters Perspective Track Perspective
Amount Odds Odds Payo
Horse Bet Prob Odds Against Prob Odds Against $2
A $K
A
K
A
T
K
A
T K
A
T K
A
K
A
K
A
0.83T
K
A
0.83T K
A
0.83T K
A
K
A
2 + 2
0.83T K
A
K
A
B $K
B
K
B
T
K
B
T K
B
T K
B
K
B
K
B
0.83T
K
B
0.83T K
B
0.83T K
B
K
B
2 + 2
0.83T K
B
K
B
C $K
C
K
C
T
K
C
T K
C
T K
C
K
C
K
C
0.83T
K
C
0.83T K
C
0.83T K
C
K
C
2 + 2
0.83T K
C
K
C
D $K
D
K
D
T
K
D
T K
D
T K
D
K
D
K
D
0.83T
K
D
0.83T K
D
0.83T K
D
K
D
2 + 2
0.83T K
D
K
D
E $K
E
K
E
T
K
E
T K
E
T K
E
K
E
K
E
0.83T
K
E
0.83T K
E
0.83T K
E
K
E
2 + 2
0.83T K
E
K
E
Tot $T 1
1
0.83
Note that the track violates the probability axioms.
Example 1
A $500
1
2
1
1
1
1
0.6024 1.5152 0.66 3.32
B $250
1
4
1
3
3
1
0.3012 0.4310 2.32 6.64
C $100
1
10
1
9
9
1
0.1205 0.1370 7.30 16.60
D $100
1
10
1
9
9
1
0.1205 0.1370 7.30 16.60
E $50
1
20
1
19
19
1
0.0602 0.0641 15.60 33.20
Tot $1000 1
1
0.83
Example 2
A $20
5
16
5
11
11
5
0.3765 0.6039 1.6560 5.312
B $20
5
16
5
11
11
5
0.3765 0.6039 1.6560 5.312
C $10
5
32
5
27
27
5
0.1883 0.2319 4.312 10.624
D $10
5
32
5
27
27
5
0.1883 0.2319 4.312 10.624
E $4
1
16
1
15
15
1
0.0753 0.0814 12.28 26.56
Tot $64 1
1
0.83
Example 3
1.8. SUBJECTIVE PROBABILITY 21
A $1920
96
100
24
1
1
24
1.1566 7.3846 0.1354 1.7292
B $60
3
100
3
97
97
3
0.0361 0.0375 26.6667 55.3333
C $10
1
200
1
199
199
1
0.0060 0.0061 165.00 332.00
D $8
1
250
1
249
249
1
0.0048 0.0048 206.50 415.00
E $2
1
1000
1
999
999
1
0.0012 0.0012 829.00 1660.00
Tot $2000 1
1
0.83
Chapter 2
Discrete Random Variables
1. Denition: A Random Variable is a characteristic of the outcome of an
experiment.
2. Notation: Use capital letters to denote random variables (rvs). Example:
X() is a rv. Use small letters to denote a realization of the random variable.
3. Example: Consider the experiment of choosing a student at random from a
classroom. Then = {Jack, Dolores, . . .}. Let X() be a characteristic of
student . Then X() is a rv.
4. Types of random variables
(a) Categorical versus Numerical
X() = gender of selected student is a categorical random variable
and X(
2
) = x
2
= female is a realization of the random variable.
Y () = age of selected student is a numerical random variable and
Y (
1
) = y
1
= 19.62 is a realization of the random variable.
(b) Continuous versus Discrete
If the possible values of a rv are countable, then the rv is discrete.
If the possible values of a rv are contained in open subsets (or half
open subsets) of the real line, then the rv is continuous.
2.1 Probability Functions
1. Denition: The probability function (p.f.) of a discrete rv assigns a
probability to the event X() = x. The p.f. is denoted by f
X
(x) and is
dened by
f
X
(x)
def
= Pr[X() = x] =
X()=x
Pr().
23
24 CHAPTER 2. DISCRETE RANDOM VARIABLES
This function also is called a probability mass function (pmf). The
terminology pmf appears to be more often used than p.f., so I will use pmf
rather than p.f.
2. The pmf can be an equation, a table, or a graph that shows how probability is
assigned to possible values of the random variable.
3. The distribution of probabilities across all possible values is called the
probability distribution. A probability distribution may be displayed as
(a) a table, (b) a graph, or (c) an equation.
4. Examples
(a) Example: Roll a fair four sided die twice. The face values on the die are
1, 2, 3, and 4. The sample space is = {(1, 2), (1, 2), . . . , (4, 4)}. Note
that #() = 16 and each outcome = (
1
,
2
) is equally likely. Let
X() = max(
1
,
2
). Find the pmf of X. Solution:
x f
X
(x)
1 1/16
2 3/16
3 5/16
4 7/16
Total 1
(b) Example: Choose a baby name at random from
= {John, Josh, Thomas, William}. Let X() =rst letter of name.
Then P(X = J) = 0.5.
5. Support of the distribution: The set of possible values of X that have non-zero
probability is called the support of the distribution. We will denote this set by
S
. That is,
S
= {x; f(x) > 0}.
The support of a random variable is analogous to the sample space of an
experiment. Note, f
X
(x) is abbreviated as f(x). This convention will be
followed if it is clear that the pmf f(x) refers to the random variable X.
6. Properties of a pmf
f(x) 0 for all x. This property also can be written as f(x) 0 x.
xS
Pr(X = x) = 1.
2.2. JOINT DISTRIBUTIONS 25
7. Indicator Function:
I
A
(a) =
_
1 if a A,
0 otherwise.
8. Application of indicator function. Consider, again, the random variable
X() = max(
1
,
2
), where (
1
,
2
) is an outcome when rolling a fair
four-sided die twice. The pmf of X is
f
X
(x) =
_
2x1
16
if x = 1, 2, 3, 4
0 otherwise
=
2x 1
16
I
{1,2,3,4}
(x).
2.2 Joint Distributions
1. Joint Probability Functions: Let X and Y be discrete random variables
dened on
SX
and
SY
, respectively. Then the joint pmf (or p.f.) of (X, Y ) is
dened as
f
X,Y
(x, y)
def
= Pr[X() = x, Y () = y] = Pr(X = x, Y = y).
Note, joint distributions can be extended from the bivariate case (above) to
the general multivariate case. A joint pmf satises
f(x, y) 0 for all pairs (x, y) and
(x,y)S
f(x, y) = 1, where
S
=
SX

SY
= {(u, v); u
SX
, v
SY
}.
Note: The set
S
=
SX

SY
could include (x, y) pairs that have probability
zero. If so, then the true support is a subset of
S
.
Example: Two way table for powerball. See problem 1-R11 on page 39. Let
X() = number of matches out of 5 on rst drawing and Y () = number of
matches out of 1 on second drawing. Then
f
X,Y
(x, y) = P(X = x, Y = y) =
_
5
x
__
40
5 x
__
1
y
__
44
1 y
_
_
45
5
__
45
1
_ I
{0,1,...,5}
(x)I
{0,1}
(y),
where x = 0, . . . , 5 and y = 0, 1. For other values of x and y, the probability is
zero. These probabilities, multiplied by 54,979,155, are given below:
y
x 0 1 Sum
0 28,952,352 658,008 29,610,360
1 20,105,800 456,950 20,562,750
2 4,347,200 98,800 4,446,000
3 343,200 7,800 351,000
4 8,800 200 9,000
5 44 1 45
Sum 53,757,396 1,221,759 54,979,155
In decimal form, the probabilities are
y
x 0 1 Sum
0 0.526605984 0.011968318 0.538574301
1 0.365698600 0.008311332 0.374009932
2 0.079069968 0.001797045 0.080867012
3 0.006242366 0.000141872 0.006384238
4 0.000160061 0.000003638 0.000163698
5 0.000000800 0.000000018 0.000000818
Sum 0.977777778 0.022222222 1.000000000
2. Marginal pmf: Sum the joint pmf over all other variables to obtain the
marginal pmf of one random variable.
(a) f
X
(x) =
ySY
f(x, y).
(b) f
Y
(y) =
xSX
f(x, y).
(c) f
Z
(z) =
xSX
ySY
f
X,Y,Z
(x, y, z).
3. Example 1: The marginal pmfs for the powerball problem are
f
X
(x) =
1
y=0
f
X,Y
(x, y) = P(X = x) =
_
5
x
__
40
5 x
_
_
45
5
_ I
{0,1,...,5}
(x) and
f
Y
(y) =
5
x=0
f
X,Y
(x, y) = P(Y = y) =
_
1
y
__
44
1 y
_
_
45
1
_ I
{0,1}
(y).
2.3. CONDITIONAL PROBABILITY 27
The numerical values of these pmfs are displayed in the margins of the tables
on page 26.
4. Example 2: Suppose that
f
X,Y
(i, j) =
_
_
_
2(i + 2j)
3n(n + 1)
2
i = 0, 1, . . . , n and j = 0, 1, . . . , n
0 otherwise.
.
Use the result
n
i=0
i =
n
i=1
i =
n(n + 1)
2
to obtain
f
X
(i) =
_
_
_
2(n +i)
3n(n + 1)
i = 0, 1, . . . , n
0 otherwise.
and
f
Y
(j) =
_
_
_
n + 4j
3n(n + 1)
j = 0, 1, . . . , n
0 otherwise.
.
2.3 Conditional Probability
1. Denitions:
P(|B)
def
=
_
_
_
P()
P(B)
if B
0 otherwise.
P(A|B)
def
=
P(A
B)
P(B)
provided that P(B) > 0.
These quantities are read as probability of given B and probability of A
given B. Think of B as the new sample space and then re-scale P() and
P(A
B) relative to P(B).
2. Examples:
(a) I will do #2.15 on page 58 in class.
(b) I will do #2.17 on page 58 in class.
(c) Lets Make a Deal. Note: this is not the same game that is described on
page 79 of the text. An SUV is randomly placed behind one of three
identical doors. Goats are placed behind the other two doors. You choose
a door (say door 1) and will win the item behind the door after it is
opened. Before your door is opened, however, Monte Hall reveals a goat
behind one of the two remaining doors (either door 2 or door 3). If he
reveals a goat behind door 2, then he gives you the option of switching
from door 1 to door 3. If he reveals a goat behind door 3, then he gives
you the option of switching from door 1 to door 2. To maximize the
probability of winning the SUV, should you stick with your original
choice or should you switch? Assume that Monte knows where the SUV
is; he always reveals a goat; and he never reveals the content behind the
door that you choose.
Solution: Let C = i (C for choose) be the event that your initial choice is
door i. Let S = i (S for SUV) be the event that the SUV is behind door
i. Let R = i (R for reveal) be the event that Monte reveals a goat behind
door i. Conditional on C = 1, the table of joint probabilities for (R, S) is
as follows
S
1 2 3 Sum
1 0 0 0 0
R 2 p
1
0
1
3
1
3
+p
1
3
1
3
p
1
1
3
0
2
3
p
1
Sum
1
3
1
3
1
3
1
In the above table, the value of p
1
must satisfy p
1
(0,
1
3
). If Monte
chooses a door at random when S = 1, then p
1
=
1
6
. Accordingly,
P(S = 1|C = 1) =
1
3
,
P(S = 1|C = 1) = 1 P(S = 1|C = 1) =
2
3
.
If your strategy is to stay with door 1, then you win the SUV with
probability
1
3
. If your strategy is to switch, then you will win the SUV if
S = 1 because you always switch to the correct door. This event has
probability
2
3
. Therefore, the best strategy is to switch. For more
information, go to <http://math.rice.edu/ddonovan/montyurl.html>.
3. Multiplication Rule
(a) Two events: P(E F) = P(F|E)P(E) = P(E|F)P(F).
(b) More than two events: P(
k
i=1
E
i
) = P(E
1
)
k
j=2
P(E
j
|
j1
i=1
E
j
). For example,
with 4 events,
P(E
1
, E
2
, E
3
, E
4
) = P(E
1
) P(E
2
|E
1
) P(E
3
|E
1
, E
2
) P(E
4
|E
1
, E
2
, E
3
).
2.3. CONDITIONAL PROBABILITY 29
4. Applications of Multiplication rule
(a) If samples are selected at random one at a time without replacement,
then all sequences are equally likely.
Proof: The number of distinct sequences of n objects selected from N
objects is (N)
n
. Label the N objects as o
1
, o
2
, . . . , o
N
. Label the rst
selection as S
1
, the second selection as S
2
, etc. Then
(S
1
= o
i
1
, S
2
= o
i
2
, S
3
= o
i
3
, . . . , S
n
= o
in
)
is a sequence provided that the subscripts i
1
, i
2
, . . . , i
n
are all distinct.
For example, if N = 100 and n = 3, then (S
1
= o
23
, S
2
= o
14
, S
3
= o
89
) is
a sequence. Using the multiplication rule, the probability of a sequence
can be written as follows:
P(S
1
= o
i
1
, S
2
= o
i
2
, S
3
= o
i
3
, . . . , S
n
= o
in
)
= P(S
1
= o
i
1
) P(S
2
= o
i
2
|S
1
= o
i
1
) P(S
3
= o
i
3
|S
1
= o
i
1
, S
2
= o
i
2
)
P(S
n
= o
in
|S
1
= o
i
1
, . . . , S
n1
+o
i
n1
)
=
1
N

1
N 1

1
N 3

1
N n + 1
=
(N n)!
N!
=
1
(N)
n
.
Accordingly, all sequences are equally likely.
(b) If samples are selected at random without replacement, then all
combinations are equally likely.
Proof: The unordered set
(o
i
1
, o
i
2
, . . . , o
in
)
is a combination provided that the subscripts i
1
, i
2
, . . . , i
n
are all distinct.
The number of distinct combinations is
_
N
n
_
and the objects in each
combination can be ordered in n! ways. Therefore, each combination
corresponds to n! sequences and
P(o
i
1
, o
i
2
, . . . , o
in
) = n!P(S
1
= o
i
1
, S
2
= o
i
2
, . . . , S
n
= o
in
) =
n!
(N)
n
=
1
_
N
n
_.
Accordingly, each combination is equally likely.
5. Conditional pmf: Let X and Y be discrete random variables. Then,
f
X|Y
(x|y)
def
= P(X = x|Y = y) =
f
X,Y
(x, y)
f
Y
(y)
=
P(X = x, Y = y)
P(Y = y)
.
2.4 Bayes Theorem (Law of Inverse Probability)
Bayes Theorem answers the questionHow do you express P(E|F) in terms of
P(F|E)?
1. Bayes Theorem states that
P(E|F) =
P(F|E)P(E)
P(F|E)P(E) +P(F|E
c
)P(E
c
)
.
Proof:
P(E|F) =
P(E F)
P(F)
by the denition of conditional probability
=
P(F|E)P(E)
P(F)
by the multiplication rule
=
P(F|E)P(E)
P(F E) +P(F E
c
)
by the law of total probability
=
P(F|E)P(E)
P(F|E)P(E) +P(F|E
c
)P(E
c
)
by the multiplication rule.
2. More generally, Bayes Theorem states that if E
1
, E
2
, . . . , E
n
is a partition of
, then
Pr(E
k
|F) =
Pr(F|E
k
) Pr(E
k
)
n
i=1
Pr(F|E
i
) Pr(E
i
)
.
Furthermore, the conditional odds of E
i
to E
j
is
Odds of E
i
to E
j
conditional on F =
P(E
i
|F)
P(E
j
|F)
=
P(F|E
i
)
P(F|E
j
)

P(E
i
)
P(E
j
)
.
2.5 Statistical Independence of Random
Variables
1. Denition: Two random variables, X and Y , are independent if and only if
(i) f
X,Y
(x, y) = f
X
(x)f
Y
(y) for all (x, y)
SX,Y
. To denote independence,
we write X Y .
2. Denition: k random variables, X
1
, X
2
, . . . , X
k
, are mutually independent i
f(x
1
, . . . , x
k
) =
k
i=1
f
i
(x
i
) for all (x
1
, . . . , x
k
)
SX
1
,...,X
k
.
2.5. STATISTICAL INDEPENDENCE OF RANDOM VARIABLES 31
3. Example: Consider a random variable X with pmf f
X
(x). Let X
1
, X
2
, . . . , X
n
be a sequence of random variables obtained by sampling at random from f
X
.
Then, X
1
, . . . , X
n
are independent random variables and their joint
distribution is f
X
1
,...,Xn
(x
1
, . . . , x
n
) =
n
i=1
f
X
(x
i
).
4. Independent Events. Let E
1
, E
2
, . . . , E
k
be events. A set of indicator random
variables, X
1
, . . . , X
k
can be dened as
X
i
=
_
1 if E
i
occurs, and
0 otherwise.
Then the events E
1
, E
2
, . . . , E
k
are mutually independent if and only if the
indicator variables X
1
, . . . , X
k
are mutually independent.
5. Result: Let A and B be events. Then A B if and only if
P(A B) = P(A)P(B).
Proof: Dene the random variables X and Y as
X =
_
1 if A occurs,
0 otherwise
and Y =
_
1 if B occurs,
0 otherwise.
First, assume that A B. Then
A B X Y
=P(A B) = f
X,Y
(1, 1) = f
X
(1)f
Y
(1) = P(A)P(B).
Second. assume that P(A B) = P(A)P(B). Then,
P(A B) = P(A)P(B) =f
X,Y
(1, 1) = f
X
(1)f
Y
(1).
Use this result to ll in the two-by-two table of joint and marginal
probabilities:
X = 0 X = 1
Y = 0 f
Y
(0)
Y = 1 f
X
(1)f
Y
(1) f
Y
(1)
f
X
(0) f
X
(1) 1
=
X = 0 X = 1
Y = 0 f
X
(1) f
X
(1)f
Y
(1) f
Y
(0)
Y = 1 f
X
(1)f
Y
(1) f
Y
(1)
f
X
(0) f
X
(1) 1
by subtraction
=
X = 0 X = 1
Y = 0 f
X
(1)f
Y
(0) f
Y
(0)
Y = 1 f
X
(1)f
Y
(1) f
Y
(1)
f
X
(0) f
X
(1) 1
because 1 f
Y
(1) = f
Y
(0)
=
X = 0 X = 1
Y = 0 f
Y
(0) f
X
(1)f
Y
(0) f
X
(1)f
Y
(0) f
Y
(0)
Y = 1 f
X
(1)f
Y
(1) f
Y
(1)
f
X
(0) f
X
(1) 1
by subtraction
=
X = 0 X = 1
Y = 0 f
Y
(0)f
X
(0) f
X
(1)f
Y
(0) f
Y
(0)
Y = 1 f
X
(1)f
Y
(1) f
Y
(1)
f
X
(0) f
X
(1) 1
because 1 f
X
(1) = f
X
(0)
=
X = 0 X = 1
Y = 0 f
Y
(0)f
X
(0) f
X
(1)f
Y
(0) f
Y
(0)
Y = 1 f
X
(0) f
Y
(0)f
X
(0) f
X
(1)f
Y
(1) f
Y
(1)
f
X
(0) f
X
(1) 1
by subtraction
=
X = 0 X = 1
Y = 0 f
Y
(0)f
X
(0) f
X
(1)f
Y
(0) f
Y
(0)
Y = 1 f
X
(0)f
Y
(1) f
X
(1)f
Y
(1) f
Y
(1)
f
X
(0) f
X
(1) 1
because 1 f
Y
(0) = f
Y
(1)
=X Y because f
X,Y
(x, y) = f
X
(x)f
Y
(y) for x = 0, 1, y = 0, 1
=A B.
6. Example: Roll two distinct fair 6-sided dice. Let E
1
be the event that the rst
die is odd, et E
2
be the event that the second die is even, and let E
3
be the
event that there exactly one odd and one even die occur. Are these events
mutually independent? Are there any pairs of events that are independent?
2.6 Exchangeability
1. Denition: Two random variables, X and Y , are said to be exchangeable i
f
X,Y
(x, y) = f
X,Y
(y, x) for all (x, y)
SX,Y
. Note, if X and Y are
exchangeable, then
SX,Y
=
SY,X
.
2. Denition: n random variables, X
1
, . . . , X
n
are said to be exchangeable i
f
X
1
,...,Xn
(x
1
, . . . , x
n
) = f
X
1
,...,Xn
(x
1
, . . . , x
n
) for all (x
1
, . . . , x
n
)
SX
1
,...,Xn
and
for all (x
1
, . . . , x
n
), where (x
1
, . . . , x
n
) is a permutation of (x
1
, . . . , x
n
). Note,
the equality must be satised for all n! permutations.
3. Result: If X
1
, . . . , X
n
are exchangeable, then the marginal distributions of
each X
i
are identical. Also, the joint distributions of any subset of k Xs is the
same as the distribution of any other set of k Xs, where k can be 1, 2, . . . , n.
Proof that bivariate marginals are identical when 3 random variables are
exchangeable: Recall, that the joint pmf of X
1
and X
2
is obtained from the
2.6. EXCHANGEABILITY 33
joint pmf of X
1
, X
2
, and X
3
as follows:
f
X
1
,X
2
(x
1
, x
2
) =
x
3
SX
3
f
X
1
,X
2
,X
3
(x
1
, x
2
, x
3
).
If X
1
, X
2
, and X
3
are exchangeable, then
f
X
1
,X
2
,X
3
(x
1
, x
2
, x
3
) = f
X
1
,X
2
,X
3
(x
1
, x
3
, x
2
) and
f
X
1
,X
2
(x
1
, x
2
) =
x
3
SX
3
f
X
1
,X
2
,X
3
(x
1
, x
3
, x
2
) = f
X
1
,X
3
(x
1
, x
2
).
Also,
f
X
1
,X
2
,X
3
(x
1
, x
2
, x
3
) = f
X
1
,X
2
,X
3
(x
3
, x
1
, x
2
) and
f
X
1
,X
2
(x
1
, x
2
) =
x
3
SX
3
f
X
1
,X
2
,X
3
(x
3
, x
1
, x
2
) = f
X
2
,X
3
(x
1
, x
2
).
Accordingly, exchangeability implies that
f
X
1
,X
2
(x
1
, x
2
) = f
X
1
,X
3
(x
1
, x
2
) = f
X
2
,X
3
(x
1
, x
2
).
In the same manner, it can be shown that exchangeability implies that
f
X
1
(x
1
) = f
X
2
(x
1
) = f
X
3
(x
1
).
4. Example 1: If X
1
, . . . , X
n
are independently and identically distributed (iid),
then X
1
, . . . , X
n
are exchangeable.
5. Example 2: Consider the procedure of sampling at random without
replacement from a nite population of size N. Let X
1
, X
2
, . . . , X
n
be the
rst, second, etc selection and let x
1
, . . . , x
n
be the population units. The
random variables are not independent but the random variables are
exchangeable. For example
f
X
1
,X
2
(x
i
, x
j
) = f
X
1
(x
i
)f
X
2
|X
1
(x
j
|X
1
= x
i
)
=
_
1
N
__
1
N 1
_
=
1
N(N 1)
and
f
X
1
,X
2
(x
j
, x
i
) = f
X
1
(x
j
)f
X
2
|X
1
(x
i
|X
1
= x
j
)
=
_
1
N
__
1
N 1
_
=
1
N(N 1)
.
2.7 Application: Probability of Winning in Craps
1. Recall, the rules of the game are as follows. Roll a pair of dice (i.e., two die).
If the sum of the dice is 7 or 11, then the player wins and the game is over. If
the sum of the dice is 2, 3, or 12, then the player loses and the game is over. If
the sum of the dice is anything else, then the sum is called the point and
the game continues. In this case, a player repeatedly rolls the pair of dice until
either the sum is either 7 or equal to the point. If a 7 occurs rst, then the
player loses. If the point occurs rst, then the player wins.
2. The sample space when rolling two dice is
= {(1, 1), (1, 2), (2, 1), . . . , )6, 6)}.
If the dice are fair, then the 36 outcomes are equally likely. Let Y () be the
sum of the two dice on the rst roll. It is easy to show that the pmf for Y is
y f
Y
(y) y f
Y
(y)
2 1/36 8 5/36
3 2/36 9 4/36
4 3/36 10 3/36/
5 4/36 11 2/36
6 5/36 12 1/36
7 6/36
Alternatively,
f
Y
(y) =
6 |x 7|
36
I
{2,3,...,12}
(x).
3. Let X() be the sum of the dice on the last roll of the game. Then the joint
support for (Y, X) is
SY,X
= {(2, 2), (3, 3), (4, 4), (4, 7), (5, 5), (5, 7), (6, 6), (6, 7),
(7, 7), (8, 8), (8, 7), (9, 9), (9, 7), (10, 10), (10, 7), (11, 11), (12, 12)}.
The winning (Y, X) values are
(4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), and (11, 11).
The losing (Y, X) values are
(2, 2), (3, 3), (4, 7), (5, 7), (6, 7), (8, 7), (9, 7), (10, 7), and , (12, 12).
2.7. APPLICATION: PROBABILITY OF WINNING IN CRAPS 35
4. Suppose that the rst roll yields a 4. Then the game continues until a 7 or
another 4 is rolled. Denote a non-4, non-7 by N. Then the game is won if a
sequence such as {4, 4}, {4, N, 4}, {4, N, N, 4}, {4, N, N, N, 4} etc is observed.
Note that P(N) = 1 P(4) P(7) = 27/36. In any case, the rst roll is a 4
and the last role is a 4. That is, Y = 4 and X = 4. The probability that
Y = 4 and X = 4 can be computed as follows:
f
Y,X
(4, 4) = P{4, 4} +P{4, N, 4} +P{4, N, N, 4} +P{4, N, N, N, 4} +
=
_
3
36
_
2
+
_
3
36
_
2
_
27
36
_
+
_
3
36
_
2
_
27
36
_
2
+
_
3
36
_
2
_
27
36
_
3

=
_
3
36
_
2
i=0
_
27
36
_
i
=
_
3
36
_
2
1
1
_
27
36
_ by the geometric series result
=
1
36
.
Furthermore, the conditional probability of winning, given that the point is 4
is
f
X|Y
(4|4) =
f
Y,X
(4, 4)
f
Y
(4)
=
1/36
3/36
=
1
3
.
Accordingly,
f
X|Y
(x|4) =
_
_
1
3
if x = 4
2
3
if x = 7
0 otherwise.
.
5. The probabilities f
Y,X
(y, y) are summarized in the following table. The
probabilities that correspond to a win are summed. The result is
P(Win) =
244
495
= 0.5
7
990
0.4929.
Point: y f
Y,X
(y, y) Winning Outcomes
2
1
36
3
2
36
4
1
36
1
36
5
2
45
2
45
6
25
396
25
396
7
6
36
1
6
8
25
396
25
396
9
2
45
2
45
10
1
36
1
36
11
2
36
2
36
12
1
36
Total
244
495
Chapter 3
Expectations of Discrete Random
Variables
3.1 The Mean
1. Denition: The expected value of X is dened as E(X)
def
=
xS
xf
X
(x) if the
expectation exists.
2. Alternative denition: E(X)
def
=
X()P() if the expectation exists.

3. Properties of expectations: Let a, b, and c be constants. Then
E(c) = c.
E(aX +c) = aE(X) +c if the expectation exists.
E(X +Y ) = E(X) + E(Y ) if the expectations exist.
E(aX +bY +c) = aE(X) +bE(Y ) +c if the expectations exist.
Proof: Assume that the expectations exist. Then,
E(aX +bY + c) =
[aX() +bY () +c] P()

= a
X()P() +b
Y ()P() +c
P()
= aE(X) +bE(Y ) +c.
4. Examples
If X
i
for i = 1, . . . , n are random variables, then E
_
n
i=1
X
i
_
=
n
i=1
E(X
i
).
37
38 CHAPTER 3. EXPECTATIONS OF DISCRETE RANDOM VARIABLES
Example: Play 100 games of craps at $1 per game. Let X
i
be the amount
won on game i. That is,
X
i
=
_
1 if game i is won, and
1 if game i is lost.
Find the total expected return.
5. More Examples: f
X,Y
(i, j) = 2(i + 2j)/[3n(n + 1)
2
] for i = 0, . . . , n and
j = 0, . . . , n. It can be shown that f
X
(i) = 2(i +n)/[3n(n +1)] for i = 0, . . . , n
and that f
Y
(j) = (n + 4j)/[3n(n + 1)] for j = 0, . . . , n. Accordingly,
E(X) =
n
i=0
2(i
2
+ni)
3n(n + 1)
=
5n + 1
9
and E(Y ) =
n
i=0
nj + 4j
2
3n(n + 1)
=
11n + 4
18
.
To verify the above results, use
n
i=0
i
2
= n(n + 1)(2n + 1)/6.
6. Center of gravity interpretation: Denote E(X) by . Then, E(X ) = 0.
The sum of the positive deviations and the sum of the negative deviations are
equal in absolute value. To balance the distribution, place the fulcrum at .
7. Symmetric distributions.
The distribution of X is said to be symmetric around a if
f
X
(a y) = f
X
(a +y) for all y.
Suppose that f
X
(a y) = f
X
(a +y) for all y. Then, E(X) = a.
Proof: First, note that symmetry implies that
a y
SX
a +y
SX
.
The expected value of X is
E(X) =
xSX
xf
X
(x) =
xS
X
x<a
xf
X
(x) +af
X
(a) +
xS
X
x>a
xf
X
(x).
Write x as x = (a y) in the rst sum. Then a y < a =y > 0. Write
x as a +y in the second sum. Then, a +y > a =y > 0. Accordingly,
the expectation is
E(X) =
ayS
X
y>0
(a y)f
X
(a y) +af
X
(a) +
a+yS
X
y>0
(a +y)f
X
(a +y)
=
ayS
X
y>0
(a y +a +y)f
X
(a y) +af
X
(a)
3.1. THE MEAN 39
because a y
SX
a +y
SX
and f
X
(a y) = f
X
(a +y)
= 2a
ayS
X
y>0
f
X
(a y) +af
X
(a)
= a
ayS
X
y>0
f
X
(a y) +af
X
(a) +a
a+yS
X
y>0
f
X
(a +y)
because a y
SX
a +y
SX
and f
X
(a y) = f
X
(a +y)
= a
xS
X
x<a
f
X
(x) +af
X
(a) +a
xS
X
x>a
f
X
(x)
= a
_
_
xS
X
x<a
f
X
(x) +f
X
(a) +
xS
X
x>a
f
X
(x)
_
_
= a
xSX
f
X
(x) = a 1 = a.
8. E(X) need not exist. Example: double or nothing gamble. Play a game in
which the probability of winning is . Bet $1 on the game. If you win, then
collect $2. If you lose, double the bet and play again. Continue to play the
game until you win. Let X = total amount bet before you nally win. Find
E(X).
Solution: The probability of winning on the i
th
game is the probability of
losing on each of the rst i 1 games and winning on the i
th
game. The games
are independent, so this probability is (1 )
i1
. The support set for X is
SX
= {1, 3, 7, 15, 31, . . .} = {2
1
1, 2
2
1, 2
3
1, 2
4
1, . . .}.
For example, if you win on game 3, then you bet $1 on game 1, $2 on game 2,
and $4 on game 3. The total amount bet is 1 + 2 + 4 = $7. The table below
summarizes the pmf of X.
Game Amount Bet x: Total Bet f
X
(x)
1 1 1
2 2 3 (1 )
3 4 7 (1 )
2
4 8 15 (1 )
3
5 16 31 (1 )
4
.
.
.
.
.
.
.
.
.
.
.
.
i 2
i1
2
i
1 (1 )
i1
.
.
.
.
.
.
.
.
.
.
.
.
The expected value of X is
E(X) =
i=1
x
i
f
X
(x
i
) =
i=1
_
2
i
1
_
(1 )
i1
i=1
2
i
(1 )
i1
i=1
(1 )
i1
= 2
i=0
2
i
(1 )
i
i=0
(1 )
i
=
_
2
_
1
12(1)
_
1 if 2(1 ) < 1,
1 if 2(1 ) 1
=
_
1
21
if 2(1 ) < 1,
if 2(1 ) 1.
9. Conditional Expectation: E(X|Y = y) =
xSX
xf
X|Y
(x|y). Note that
E(X|Y = y) is a function of y.
3.2 Expectation of a Function
1. If X is a discrete random variable, then g(X) is just another discrete random
variable. That is g(X) = g[X()] = Y () for some function Y .
2. E[g(X)] = E(Y ) =
Y ()P() =
ySY
y
,Y ()=y
P() =
ySY
yf
Y
(y)
3. Result:
E[g(X)] =
xSX
g(x)f
X
(x).
Proof:
E[g(X)] =
g [X()] P() by denition

=
xSX
g(x)
,X()=x
P() by reordering the terms in the summation
=
xSX
g(x)f
X
(x).
3.2. EXPECTATION OF A FUNCTION 41
4. Caution: In general E[g(X)] = g[E(X)]. For example, if g(X) = 1/X, then
E[g(X)] =
xSX
1
x
f
X
(x) = g[E(X)] =
1
E(X)
=
1
xSX
xf
X
(x)
.
If g(X) is a linear function, however, then E[g(X)] = g[E(X)]. That is, if
g(X) = a +bX, where a and b are constants, then
E[g(X)] = E(a +bX) = a +bE(X) = g[E(X)].
5. Utilities: The subjective value of the random variable X is called the utility of
X and is denoted by u(X). The expected utility is E(U) =
xSX
u(x)f
X
(x).
6. Expectations of Conditional pmf: E[f
X|Y
(x|Y )] = f
X
(x).
Proof:
E[f
X|Y
(x|Y )] =
ySY
f
X|Y
(x|y)f
Y
(y)
=
ySY
f
X,Y
(x, y) by the denition of a conditional pmf
= f
X
(x).
This expectation is sometimes written as E
Y
[f
X|Y
(x|Y )] = f
X
(x) to remind us
that the expectation is taken with respect to the distribution of Y .
7. Iterated Expectation: E
Y
[E(X|Y )] = E(X).
Proof:
E
Y
[E(X|Y )] =
ySY
[E(X|y)] f
Y
(y) by results on expectation of a function
=
ySY
_
xSX
xf
X|Y
(x|y)
_
f
Y
(y) by denition of conditional expectation
=
ySY
xSX
xf
X|Y
(x|y)f
Y
(y)
=
ySY
xSX
xf
X,Y
(x, y) by denition of conditional pmf
=
xSX
x
ySY
f
X,Y
(x, y)
=
xSX
xf
X
(x) = E(X).
8. Example of iterated expectation. Suppose that a coin has probability of
landing heads. Dene the random variable Y to be 1 if a head is tossed and 0
if a tail is tossed. Note that E(Y ) = . Toss the coin and then roll a fair
six-sided die 2Y + 1 times. Let X be the total number of pips on the 2Y + 1
rolls. Find E(X).
Solution: The expected number of pips on a single roll is 3.5. Therefore,
E(X|Y ) = (2Y + 1)(3.5) and
E(X) = E
Y
[E(X|Y )]
= E
Y
[(2Y + 1)(3.5)] = [2E(Y ) + 1](3.5) = (2 + 1)(3.5).
If the coin is fair, then E(Y ) =
1
2
and E(X) = 7.
9. Expected value of a function of several random variables:
E[g(X
1
, X
2
, . . . , X
k
) =
(x
1
,x
2
,...,x
k
)S
g(x
1
, x
2
, . . . , x
k
)f
X
(x
1
, x
2
, . . . , x
k
).
Example: suppose the joint support of (X, Y ) is
S
= {(0, 0), (0, 1), (1, 0), (1, 1)}. Find the expectation of 1/f
X,Y
(X, Y ).
Solution:
E
_
1
f
X,Y
(X, Y )
_
=
1
x=0
1
y=0
_
1
f
X,Y
(X, Y )
_
f
X,Y
(x, y)
=
1
x=0
1
y=0
1 = 4.
10. Expectation under independence: If X and Y are independent, then
E[g(X)h(Y )] = E[g(X)]E[h(Y )], provided that the expectations exist.
Proof: Suppose that X Y . Then f
X,Y
(x, y) = f
X
(x)f
Y
(y) and
E[g(X)h(Y )] =
xSX
ySY
g(x)h(y)f
X,Y
(x, y)
=
xSX
ySY
g(x)h(y)f
X
(x)f
Y
(y)
=
xSX
g(x)f
X
(x)
ySY
h(y)f
Y
(y) = E[g(X)]E[h(Y )].
Note: a much stronger result can be established. If X Y , then g(X) h(Y )
for any functions g and h.
3.3. VARIABILITY 43
3.3 Variability
1. Mean Absolute Deviation: MAD
def
= E(|X
X
|) =
xS
|x
X
|f
X
(x).
2. Variance: The variance of X is dened as
Var(X)
def
= E(X
X
)
2
=
xS
(x
X
)
2
f
X
(x).
It is conventional to denote the variance of the random variable X by
2
X
.
3. Result: Var(X) = E(X
2
) [E(X)]
2
.
Proof:
Var(X) = E
_
[X E(X)]
2
_
= E
_
X
2
2XE(X) + [E(X)]
2
_
= E(X
2
) 2E(X)E(X) + [E(X)]
2
= E(X
2
) [E(X)]
2
.
4. Standard Deviation:
def
= +
2
.
5. Variance of a uniform distribution: Suppose that the support of X is
S
= {1, 2, . . . , N} and that each value in the support set has equal probability.
This situation can be denoted as X Discrete Uniform(1, 2, . . . , N). For this
distribution,
X
=
N + 1
2
and
2
X
=
(N + 1)(N 1)
12
.
Proof:
E(X) =
xSX
xf
X
(x) =
N
x=1
x
1
N
=
1
N
N
i=1
i =
_
1
N
__
N(N + 1)
2
_
=
N + 1
2
.
Also,
E(X
2
) =
xSX
x
2
f
X
(x) =
N
x=1
x
2
1
N
=
1
N
N
i=1
i
2
=
_
1
N
__
N(N + 1)(2N + 1)
6
_
=
(N + 1)(2N + 1)
6
.
Accordingly,
Var(X) =
(N + 1)(2N + 1)
6

_
N + 1
2
_
2
=
(N + 1)(N 1)
12
.
6. Example: Toss a fair die once. Let X be the number of pips on the top face.
Then, X Discrete Uniform(1, 2, . . . , ). Therefore,
E(X) =
7
2
and Var(X) =
(7)(5)
12
= 2.9167.
7. Parallel axis theorem: Let c be a constant and let X be a random variable
with mean
X
and variance
2
x
< . Then, E(X c)
2
=
2
X
+ (c )
2
. Note
that E(X c)
2
is minimized with respect to c when c =
X
.
Proof: Use the add zero trick. Write c as c =
X
+ (c
X
). Therefore,
E(X c)
2
= E[(X
X
) (c
X
)]
2
= E
_
(X
X
)
2
2(X
X
)(c
X
) + (c
X
)
2
= E(X
X
)
2
2(c
X
)E(X
X
) + E(c
X
)
2
=
2
X
+ 0 + (c
X
)
2
=
2
X
+ (c
X
)
2
.
Why is this called the parallel axis theorem? My guess is that the parallel
axes refer to two vertical lines drawn on the graph of the pmf of X. One is
drawn at x = c and one is drawn at x =
X
.
8. Alternative proof that
X
is the minimizer of g(c) = E(X c)
2
: Use calculus.
Take the derivative of g(c) with respect to c and set it to zero to nd critical
points:
d
d c
g(c) =
d
d c
E(X
2
2cX +c
2
)
=
d
d c
_
E(X
2
) 2c
X
+c
2
= 2
X
+ 2c, and
d g(c)
d c
= 0 =c =
X
.
Use the second derivative test to show that a minimizer has been found:
d
2
(d c)
2
g(c)
c=
x
= 2 > 0 =
X
is a minimizer.
3.4. COVARIANCE AND CORRELATION 45
3.4 Covariance and Correlation
1. Covariance:
Cov(X, Y )
def
= E[(X
X
)(Y
Y
)].
It is conventional to denote the covariance between X and Y by
X,Y
.
2. Result: E[(X
X
)(Y
Y
)] = E(XY ) E(X)E(Y ).
Proof:
E[(X
X
)(Y
Y
)] = E[XY XE(Y ) E(X)Y + E(X)E(Y )]
= E(XY ) E(X)E(Y ) E(X)E(Y ) + E(X)E(Y ) = E(XY ) E(X)E(Y ).
3. Example: Suppose that f
X,Y
(i, j) = 2(i + 2j)/[3n(n + 1)
2
] for i = 0, . . . , n and
j = 0, . . . , n. Then E(XY ) = n(2n + 1)/6 and Cov(X, Y ) = (n + 2)
2
/162.
4. Result: Cov(a +bX, c +dY ) = bd Cov(X, Y ). Proof in class.
5. Special case of above result: Var(aX +b) = a
2
Var(X).
6. Correlation: Cor(X, Y )
def
=
X,Y
/(
X
Y
). It is conventional to denote the
correlation between X and Y by
X,Y
.
7. Cauchy-Schwartz Inequality: Let X and Y be two random variables. Then
E(X
2
)E(Y
2
) [E(XY )]
2
, provided that the expectations exist. Proof:
E(Y X)
2
0 for all . Now minimize with respect to .
E(Y X)
2

= 0 = =
E(XY )
E(X
2
)
=E
_
Y
E(XY )
E(X
2
)
X
_
2
0.
Simplify to obtain the Cauchy-Schwartz inequality.
8. Application of Cauchy-Schwartz: The Cauchy-Schwartz inequality says that if
X
and Y

are random variables and the required expectations exist, then
[E(X
)]
2
E(X
2
)E(Y
2
). Let X
= (X
X
)/
X
and let
Y
= (Y
Y
)/
Y
. Then,
[E(X
)]
2
=
_
E
_
(X
X
)(Y
Y
)
Y
__
2
E(X
2
)E(Y
2
) = E
_
(X
X
)
2
2
X
_
E
_
(Y
Y
)
2
2
Y
_
=
2
X,Y

_
2
X
2
X
__
2
Y
2
Y
_
= 1
=
X,Y
[1, 1].
9. Example: Consider a very small insurance company. Let X be the number of
policies sold and let Y be the number of claims made. Suppose that the joint
pmf for X and Y is the following:
x
y 0 1 2
0
1
4
1
4
1
16
9
16
1 0
1
4
1
8
3
8
2 0 0
1
16
1
16
1
4
1
2
1
4
1
(a) If the premium on each policy is $1,000 and each claim amount is $2,000,
then the net revenue is 1000X 2000Y . Find the expected net revenue.
Solution:
E(1000X 2000Y ) = 1000E(X) 2000E(Y ) = 1000(1) 2000(.5) = 0.
(b) Find the correlation between X and Y .
Solution: E(X) = 1, E(Y ) = 0.5, E(X
2
) = 1.5, E(Y
2
) =
5
8
, and
E(XY ) =
3
4
. Therefore,
Var(X) = 1.5 1
2
= 0.5,
Var(Y ) =
5
8
0.5
2
= 0.375,
Cov(X, Y ) = 0.75 (1)(0.5) = 0.25, and
X,Y
=
0.25
_
(0.5)(0.375)
=
1
3
0.5774.
10. Result: Cor(a +bX, c +dY ) = sign(bd)
X,Y
.
Proof:
Cor(a +bX, c +dY ) =
Cov(a +bX, c +dY )
_
Var(a +bX) Var(c +dY )
=
bd
X,Y
_
b
2
2
X
d
2
2
Y
=
_
bd
|bd|
_

X,Y
Y
= sign(bd)
X,Y
.
3.5. SUMS OF RANDOM VARIABLES 47
11. Example: Suppose that f
X,Y
(i, j) = 2(i + 2j)/[3n(n + 1)
2
] for i = 0, . . . , n and
j = 0, . . . , n. Then E(XY ) = n(2n + 1)/6, E(X) = (5n + 1)/9;
E(X
2
) = n(7n + 5)/18, E(Y ) = (11n + 4)/18, and E(Y
2
) = n(8n + 7)/18. To
obtain E(X
2
) and E(Y
2
), use
n
i=0
i
3
= n
2
(n + 1)
2
/4. It follows that
2
X
= (n +2)(13n 1)/162,
2
Y
= (n +2)(23n 8)/324,
X,Y
= (n +2)
2
/162,
and
X,Y
=
2(n + 2)/
_
(13n 1)(23n 8). Note that
lim
n
X,Y
=
2/
299 = 0.08179.
12. Result: if X Y , then
X,Y
= 0.
Proof: From result 10 on page 42, we know that
X Y =E[g(X)h(Y )] = E[g(x)]E[h(Y )]. Accordingly,
Cov(X, Y ) = E[(X
X
)(Y
Y
)] = E(X
X
)E(Y
Y
) = 0.
13. Result:
X,Y
= 0 =X Y . A counter example is sucient to establish this
result. Consider the joint probability function below:
x
y 1 2 3 4
1 0 0.125 0.125 0 0.250
2 0 0.125 0.125 0 0.250
4 0.125 0 0 0.125 0.250
5 0.125 0 0 0.125 0.250
0.250 0.250 0.250 0.250 1.000
Computation shows that E(X) = 2.5, E(Y ) = 3, and E(XY ) = 7.5. Therefore,
Cov(X, Y ) = 0 and
X,Y
= 0, but X and Y are not independent. Correlation
is a measure of the linear dependence between X and Y . In this case, X and
Y are not linearly related, but they are quadratically related.
3.5 Sums of Random Variables
1. Suppose that X
1
, X
2
, . . . , X
k
are random variables, where E(X
i
) =
i
,
Var(X
i
) =
2
i
, and Cov(X
i
, X
j
) =
ij
. Furthermore, suppose that c
1
, c
2
, . . . , c
k
are known constants. Let T =
k
i=1
c
i
X
i
. Then,
E(T) =
k
i=1
c
i
i
and Var(T) =
k
i=1
c
2
i
2
i
+ 2
n
i<j
c
i
c
j
ij
.
Proof: The result concerning E(T) is trivial. To prove the variance result,
begin with the denition of variance:
Var(T) = E[T E(T)]
2
= E
_
k
i=1
c
i
X
i
i=1
c
i
i
_
2
= E
_
k
i=1
c
i
(X
i
i
)
_
2
= E
__
k
i=1
c
i
(X
i
i
)
__
k
j=1
c
i
(X
j
j
)
__
= E
_
k
i=1
k
j=1
c
i
c
j
(X
i
i
)(X
j
j
)
_
=
k
i=1
c
2
i
E(X
i
i
)
2
+
i=j
c
i
c
j
E[(X
i
i
)(X
j
j
)]
=
k
i=1
c
2
i
2
i
+
i=j
c
i
c
j
ij
=
k
i=1
c
2
i
2
i
+ 2
i<j
c
i
c
j
ij
because
ij
=
ji
.
(a) Special case: c
i
= 1 for all i. Then,
Var
_
k
i=1
X
i
_
=
k
i=1
2
i
+ 2
i<j
ij
.
(b) Special case: k = 2, c
1
= 1, c
2
= 1. Then
Var(X
1
+X
2
) =
2
1
+
2
2
+ 2
12
.
(c) Special case: k = 2, c
1
= 1, c
2
= 1. Then
Var(X
1
+X
2
) =
2
1
+
2
2
2
12
.
(d) Special case: c
i
= 1 for all i and variables are pairwise uncorrelated.
Then,
Var
_
k
i=1
X
i
_
=
k
i=1
2
i
.
3.5. SUMS OF RANDOM VARIABLES 49
2. Application: Take a simple random sample of size n with replacement from a
population having mean
X
and variance
2
X
. Let T =
n
i=1
X
i
and let
X = T/n. Then
E(T) = n
X
, Var(T) = n
2
X
, E(X) =
X
, and Var(X) =
2
X
/n.
Proof: The random variables are iid. Therefore, E(X
i
) =
X
for all i,
Var(X
i
) =
2
X
for all i, and Cov(X
i
, X
j
) = 0 for all i = j. Therefore,
E(T) = E
_
n
i=1
X
i
_
=
n
i=1
E(X
i
) = n
X
and
Var(T) = Var
_
n
i=1
X
i
_
=
n
i=1
2
X
= n
2
X
.
Furthermore, X =
n
i=1
1
n
X
i
, so. c
i
= 1/n for all i in the formula for X.
Accordingly,
E(X) =
1
n
E(T) =
X
and
Var(X) = Var
_
1
n
T
_
=
1
n
2
Var(T) =

2
X
n
.
3. Application: Take a simple random sample of size n from the pmf f
X
(x)
having expectation
X
and variance
2
X
. Let T =
n
i=1
X
i
and let X = T/n.
Then
E(T) = n
X
, Var(T) = n
2
X
, E(X) =
X
, and Var(X) =
2
X
/n.
Proof: This situation is essentially the same as the situation in item 2 above.
The random variables are iid because taking an observation from f
X
(x) doe
not change the pmf. Use the proof above.
4. Application: Take a simple random sample of size n without replacement from
a nite population of size N having mean
X
and variance
2
X
. Let
T =
n
i=1
X
i
and let X = T/n. Then,
E(T) = n
X
, Var(T) = n
2
X
_
1
(n 1)
(N 1)
_
,
E(X) =
X
, and Var(X) =

2
X
n
_
1
(n 1)
(N 1)
_
.
Proof: Item 5 on page 33 shows that X
1
, . . . , X
n
are exchangeable random
variables, even though they are not independent. Therefore,
E(X
i
) = E(X
1
) =
X
for i = 1, . . . , n;
Var(X
i
) = Var(X
1
) =
2
X
for i = 1, . . . , n; and
Cov(X
i
, X
j
) = Cov(X
1
, X
2
) =
12
for all i = j.
To nd the value for
12
, use
Var
_
N
i=1
X
i
_
= Var(N
X
) = 0
together with exchangeability. That is,
Var
_
N
i=1
X
i
_
=
N
i=1
Var(X
i
) + 2
i<j
Cov(X
i
, X
j
)
= N
2
X
+N(N 1)
12
= 0
=
12
=

2
X
N 1
.
Accordingly,
E(T) =
n
i=1
E(X
i
) = n
X
,
E(X) = E
_
1
n
T
_
=
1
n
E(T) =
X
,
Var(T) = Var
_
n
i=1
X
i
_
=
n
i=1
Var(X
i
) +
i<j
Cov(X
i
, X
j
)
= n
2
X
+n(n 1)
_

2
X
N 1
_
= n
2
X
_
1
(n 1)
(N 1)
_
, and
Var(X) = Var
_
1
n
T
_
=
_
1
n
_
2
Var(T) =

2
X
n
_
1
(n 1)
(N 1)
_
.
3.6 Probability Generating Functions
1. Denition: Suppose that X is a discrete random variable with support a
subset of the natural numbers. That is,
SX
{0, 1, 2, . . . , }. Let t be a real
number. Then the probability generating function (pgf) of X is
X
(t)
def
= E(t
X
),
where t is chosen to be small enough in absolute value so that the expectation
exists. If |t| < 1, then the expectation always exists.
3.6. PROBABILITY GENERATING FUNCTIONS 51
2. Result (without proof): Probability Generating Functions are Unique. This
result reveals that there is a one-to-one relationship between the pmf and the
pgf. That is, each pmf is associated with exactly one pgf and each pgf is
associated with exactly one pmf. The importance of this result is that we can
use the pgf to nd the pmf.
More specically, the uniqueness results says that if Y is a random variable
with support
SY
{0, 1, 2, . . . , } and
E(t
Y
) = p
0
t
0
+p
1
t
1
+ +p
k
t
k
+ ,
then the pmf of Y is f
Y
(i) = p
i
for i = 0, 1, . . . , . Of course, it also is true
that if Y is a random variable with support
SY
{0, 1, 2, . . . , } and the pmf
of Y is f
Y
(i) = p
i
for i = 0, 1, . . . , ; then
E(t
Y
) = p
0
t
0
+ p
1
t
1
+ +p
k
t
k
+ .
3. Example: Find the pmf of Y if E(t
Y
) = .2t
4
+.3t
8
+.5t
19
.
Solution:
y f
Y
(y)
4 0.2
8 0.3
19 0.5
1.0
4. Result: Suppose that X
1
, . . . , X
n
are independent. Denote the pgf of X
i
by
X
i
(t) for i = 1, . . . , n. If
U =
n
i=1
X
i
, then
U
(t) =
n
i=1
X
i
(t).
Proof: Use results on expectations of functions of independent random
variables. That is,
U
(t) = E(t
U
) = E
_
t
P
n
i=1
X
i
_
= E
_
n
i=1
t
X
i
_
=
n
i=1
E
_
t
X
i
_
=
n
i=1
X
i
(t),
using the result in item 10 on page 42.
5. Result: Suppose that X
1
, . . . , X
n
are iid. Denote the pgf of X
i
by
X
(t) for
i = 1, . . . , n. If
U =
n
i=1
X
i
, then
U
(t) = [
X
(t)]
n
.
Proof: This result follows directly from the result in item 4, above.
6. Application 1: Bernoulli Binomial. Suppose that X
i
for i = 1, . . . , n are iid
Bernoulli random variables each with probability of success . Find the
distribution of Y =
n
i=1
X
i
. A Bernoulli random variable is a random
variable that has support
S
= {0, 1}. The pmf of a Bernoulli random variable,
B, is
f
B
(b) =
_
_
if b = 1,
1 if b = 0,
0 otherwise.
The pmf of a Bernoulli random variable also can be written as
f
B
(b) =
b
(1 )
1b
I
{0,1}
(b).
Solution: If X
i
are iid Bernoulli random variables each with probability of
success , then
f
X
i
(x) =
x
(1 )
1x
I
{0,1}
(x) and
X
i
(t) = (1 )t
0
+t
1
= (1 ) +t for i = 1, . . . , n.
Using the result from item 5 above, the pgf of Y =
n
i=1
X
i
is
Y
(t) = [t + (1 )]
n
.
Using the binomial theorem from item 1a on page 15, the pgf of Y can be
written as
Y
(t) =
n
i=0
_
n
i
_
(t)
i
(1 )
ni
.
Accordingly,
f
Y
(y) =
_
n
y
_
y
(1 )
ny
for y = 0, 1, . . . , n.
This pmf is called the binomial pmf.
7. Result: A Useful Expansion: Suppose that a is a constant whose value is in
[1, 1], n is an integer constant that satises n 1, and t is a variable that
satises t (1, 1). Dene h(t) as
h(t)
def
= (1 at)
n
.
Then,
h(t) =
r=0
_
n +r 1
r
_
a
r
t
r
.
Proof: Expand h(t) in a Taylor series around t = 0. The rst few derivatives of
h(t), evaluated at t = 0 are
d
dt
h(t) = n(1 at)
(n+1)
(1)a,
d
dt
t=0
= na,
d
2
(dt)
2
h(t) = n(n + 1)(1 at)
(n+2)
(1)a
2
,
d
2
(dt)
2
t=0
= n(n + 1)a
2
,
d
3
(dt)
3
h(t) = n(n + 1)(n + 2)(1 at)
(n+3)
(1)a
3
,
d
3
(dt)
3
t=0
= n(n + 1)(n + 2)a
3
,
.
.
.
d
r
(dt)
r
h(t) = n(n + 1)(n + 2) (n +r 1)(1 at)
(n+r)
(1)a
r
,
d
r
(dt)
r
t=0
= n(n + 1)(n + 2) (n +r 1)a
r
.
Accordingly, the Taylor series is
h(t) =
r=0
d
r
(dt)
r
h(t)
t=0
1
r!
(t 0)
r
=
r=0
_
n +r 1
r
_
a
r
t
r
.
The ratio test veries that the series converges because
lim
r0
_
n +r + 1 1
r + 1
_
(at)
r+1
_
n +r 1
r
_
(at)
r
= lim
r0
n +r
r + 1
at = at and |at| < 1.
8. Application 2: Geometric Negative Binomial. Suppose that U
1
, U
2
, . . . is a
sequence of iid Bernoulli random variables, each with probability of success
(0, 1). Let X be the number of Bernoulli trials to the rst success. For
example, if U
1
= 1, then X = 1. If U
1
= 0, U
2
= 1, then X = 2. If U
1
= 0,
U
2
= 0, . . ., U
x1
= 0, U
x
= 1, then X = x. The random variable X is called a
geometric random variable and its pmf is
f
X
(x) = (1 )
x1
I
{1,2,...,}
(x).
The pmf follows from independence among the Bernoulli random variables.
Suppose that X
1
, X
2
, . . . , X
n
are iid geometric random variables, each with
parameter . Let Y =
n
i=1
X
i
. The random variable Y is called a negative
binomial random variable with parameters n and . The random variable Y is
the number of Bernoulli trials to the n
th
success. The pmf of Y is
f
Y
(y) =
_
y 1
y n
_
n
(1 )
yn
I
{n,n+1,...,}
(y).
Proof: The pgf of X
i
for t (1, 1) is
X
i
(t) = E
_
t
X
i
_
=
x=1
(1 )
x1
t
x
= t
x=1
[(1 )t]
x1
= t
x=0
[(1 )t]
x
=
t
1 (1 )t
because |(1 )t| < 1.
Accordingly, the pgf of Y is
Y
(t) =
_
t
1 (1 )t
_
n
= (t)
n
[1 (1 )t]
n
.
Using the expansion from item 7 above, where a = (1 ), the pgf of Y
Y
(t) = (t)
n
r=0
_
n +r 1
r
_
(1 )
r
t
r
.
Accordingly,
Pr(Y = n +r) =
_
n +r 1
r
_
n
(1 )
r
.
Let y = n +r and, therefore, r = y n to obtain
f
Y
(y) =
_
y 1
y n
_
(1 )
yn
n
I
{n,n+1,...,}
(y).
9. Application 3: Game of Razzle Dazzle.
(a) Toss 8 fair six-sided dice. Let Y be the sum of the pips.
(b) Let X
i
be the number of pips shown on die i for i = 1, . . . , 8. The total
score is Y =
8
i=1
X
i
. The goal is to nd the pmf of Y
(c) The random variable X
i
has a discrete uniform distribution on
SX
= {1, 2, . . . , 6}. The pgf of X
i
is
X
i
(t) = E
_
t
X
i
_
=
6
i=1
t
i
_
1
6
_
=
_
1
6
_
t t
7
1 t
=
_
t
6
_
1 t
6
1 t
by using the result
N
i=1
a
i
=
a a
N+1
1 a
.
(d) The pgf of Y is
Y
(t) = E
_
t
Y
_
= E
_
t
X
1
+X
2
+X
8
_
= E
_
t
X
1
t
X
2
t
X
8
_
=
8
i=1
E
_
t
X
i
_
=
_
t
6
_
8
_
1 t
6
1 t
_
8
,
using the mutual independence of X
1
, . . . , X
n
.
(e) Use the binomial theorem to write (1 t
6
)
8
as
(1 t
6
)
8
= (t
6
+ 1)
8
=
8
i=0
_
8
i
_
(t
6
)
i
1
8i
=
8
i=0
_
8
i
_
(t
6
)
i
.
(f) If |t| < 1, then the Taylor series expansion of (1 t)
8
around t = 0 is
1
(1 t)
8
=
r=0
_
7 +r
r
_
t
r
using the result in item 7, where a = 1.
(g) Accordingly, the pgf of Y is
Y
=
_
1
6
_
8 8
i=0
j=0
_
8
i
_
(1)
i
_
7 +j
j
_
t
8+6i+j
.
(h) To nd Pr(Y = y), one can sum all coecients in the pgf for which t is
raised to the y
th
power. That is, sum all coecients for which
8 + 6i +j = y.
(i) Example: Pr(Y = 13) is equal to the coecient that corresponds to
(i, j) = (0, 5) because 6 0 + 5 + 8 = 13. That is,
Pr(Y = 5) =
1
6
8
_
8
0
_
(1)
0
_
7 + 5
5
_
=
22
6
8
0.00047154.
(j) Example: Pr(Y = 20) is found by summing coecients corresponding to
(i, j) = (0, 12), (1, 6), and (2, 0). That is,
Pr(Y = 20) =
1
6
8
_
_
8
0
_
(1)
0
_
7 + 12
12
_
+
_
8
1
_
(1)
1
_
7 + 6
6
_
+
_
8
2
_
(1)
2
_
7 + 0
0
_
_
=
1
6
8
(50388 13728 + 28) =
36,688
6
8
0.0218431.
Chapter 4
Bernoulli and Related Random
Variables
4.1 Sampling Bernoulli Populations
1. A Bernoulli random variable is a random variable with support
S
= {0, 1}.
That is, if X is a Bernoulli rv, then X = 1 (success) or X = 0 (failure).
2. Probability Function (PMF): Denote the probability of success by p. Then,
f
X
(x) =
_
_
p if x = 1, p [0, 1];
1 p if x = 0, p [0, 1]; and
0 otherwise.
This is a family of pmfs indexed by the parameter p.
3. Indicator function: An indicator function is a function that has range {0, 1}.
We will denote indicator functions by the letter I. Specically, I
A
(a) is dened
as
I
A
(a) =
_
1 if a A,
0 otherwise.
Examples:
(a) I
(0,100)
(2.6) = 0.
(b) I
(0,100)
(2.6) = 1.
(c) I
{0,100}
(2.6) = 0.
4. Alternative expressions for pmf:
f
X
(x) = p
x
(1 p)
1x
, for x = 0, 1 and p [0, 1]; and
f
X
(x) = p
x
(1 p)
1x
I
{0,1}
(x)I
[0,1]
(p).
57
58 CHAPTER 4. BERNOULLI AND RELATED RANDOM VARIABLES
5. Moments: E(X
k
) = p for k = 1, 2, . . ..
6. Result:
X
= p and
2
X
= p(1 p).
7. Abbreviation: The notation iid means independently and identically
distributed.
8. Sequences of Bernoulli Random Variables
(a) Sampling With Replacement: The Bernoulli Process
Characteristics (these follow from X
i
iid
Bern(p) for i = 1, 2, . . .)
Non-overlapping sequences of trials are independent
The distribution of a set of consecutive trials is identical to the
distribution of any other set of trials of the same length (this is
called the stationary property)
The distribution of future trials is independent of the results of
past trials.
Joint pmf of X
1
, . . . , X
n
:
f
X
1
,X
2
,...,Xn
(x
1
, x
2
, . . . , x
n
|p) = p
y
(1 p)
ny
I
{0,1,...,n}
(y)I
[0,1]
(p),
where y =
n
i=1
x
i
.
(b) Sampling Without Replacement
Population contains N items, M of which are 1s (successes) and
N M of which are 0s (failures).
Consecutively sample n N items at random without replacement.
Let X
i
be the value of the sampled item on trial i. Note that X
i
is a
Bernoulli rv.
Joint pmf:
f
X
1
,X
2
,...,Xn
(x
1
, x
2
, . . . , x
n
|M, N, n) =
(M)
y
(N M)
ny
(N)
n
I
SY
(y)
=
_
M
y
__
NM
ny
_
_
n
y
__
N
n
_ I
SY
(y), where y =
n
i=1
x
i
and
SY
= {max(0, n +M N), . . . , min(n, M)}.
Note, the support of Y follows from y 0; y n; y M; and
n y N M.
4.2. BINOMIAL DISTRIBUTION 59
4.2 Binomial Distribution
1. Suppose that X
1
, X
2
, . . . , X
n
are iid Bern(p). Then Y =
n
i=1
X
i
Bin(n, p).
2. Probability mass function: f
Y
(y) =
_
n
y
_
p
y
(1 p)
ny
, where y = 0, 1, . . . , n and
p [0, 1]. To justify this result, note that Pr(Y = y) is the probability of a
specic sequence of X
i
s such that
x
i
= y multiplied by the number of
possible sequences that satisfy
x
i
= y. The pmf of Y also can be obtained
from the pmf of X
i
by using probability generating functions. See page 52 of
these notes for details.
3. Moments
(a) E(Y ) = E(
n
i=1
X
i
) = nE(X) = np.
(b) Var(Y ) = Var(
n
i=1
X
i
) = nVar(X) = np(1 p).
4. Table 1a (pp. 648) gives f
Y
(y) = Pr(Y = y)
5. Table 1b (pp. 650) gives
n
y=k
f
Y
(y) = Pr(Y k). Caution, most tables of
the cumulative binomial distribution give Pr(Y k).
6. Reproductive property: If Y
1
, Y
2
, . . . , Y
k
are independently distributed as
Y
i
Bin(n
i
, p), then
k
i=1
Y
i
Bin(n, p), where n =
k
i=1
n
i
. To justify this
result, write each Y
i
as the sum of iid Bernoulli random variables. An
alternative justication is to use probability generating functions. If
Y Bin(n, p), then the pgf of Y is
Y
(t) = E(t
Y
) =
n
y=0
_
n
y
_
p
y
(1 p)
ny
t
y
=
n
y=0
_
n
y
_
(pt)
y
(1 p)
ny
= [pt + (1 p)]
n
by the Binomial Theorem (see page 15). Also, see page 52 for an alternative
derivation of the pgf of Y . Suppose that Y
i
ind
Bin(n
i
, p) for i = 1, . . . , k. Let
W =
k
i=1
Y
i
. The pgf of W is
W
(t) = E(t
W
) =
k
i=1
Y
i
(t) = [pt + (1 p)]
P
k
i=1
n
i
.
The pgf for W has the form of the pgf of a binomial random variable with
parameters
k
i=1
n
i
and p. Therefore,
{Y
i
}
k
i=1
ind
Bin(n
i
, p) =
k
i=1
Y
i
Bin
_
k
i=1
n
i
, p
_
.
4.3 Hypergeometric Distribution
1. Population contains N items, M of which are 1s (successes) and N M of
which are 0s (failures).
2. Sample n N items at random without replacement.
3. Let X
i
be the value of the sampled item on trial i and let Y =
n
i=1
X
i
=
total number of successes. Then, Y HyperG(N, M, n).
4. Probability mass function: f
y
(y|N, M, n) =
_
M
y
__
NM
ny
_
_
N
n
_ , where y is an integer
that satises max(0, n +M N) y min(n, M). Note: this probability is
_
n
y
_
times the probability of a specic sequence of Xs that result in Y = y
successes. the number of possible sequences is
_
n
y
_
.
5. Result: E(Y ) =
n
i=1
E(X
i
) = np, where p = M/N.
6. Result: Suppose that n = N. Then Var(
N
i=1
X
i
) = Var(M) = 0. Use
exchangeability to obtain Cov(X
i
, X
j
) = p(1 p)/(N 1). See item 4 on
page 49.
7. Result:
Var(Y ) = np(1 p)
_
1
n 1
N 1
_
.
Proof: By exchangeability,
Var(Y ) = Var(
n
i=1
X
i
) = nVar(X
1
) +n(n 1) Cov(X
1
, X
2
) =
np(1 p) n(n 1)p(1 p)/(N 1) = np(1 p)
_
1
n1
N1
_
.
8. Application: Mark-recapture studies to estimate population size (see problem
4-18).
9. Application: Fishers exact test for H
0
: p
1
= p
2
in 2 2 tables. Let
Y
1
Bin(n
1
, p
1
) and Y
2
Bin(n
2
, p
2
) be independent random variables.
Assume that p
1
= p
2
and nd the conditional distribution of Y
1
given that
Y
1
+Y
2
= m
1
. Solution: write p
1
and p
2
as p (the common value). Then
Pr(Y
1
= y
1
|Y
1
+Y
2
= m
1
) =
Pr(Y
1
= y
1
and Y
1
+Y
2
= m
1
)
pr(Y
1
+Y
2
= m
1
)
=
Pr(Y
1
= y
1
, Y
2
= m
1
y
1
)
Pr(Y
1
+Y
2
= m
1
)
=
_
n
1
y
1
_
p
y
1
(1 p)
n
1
y
1
_
n
2
m
1
y
1
_
p
m
1
y
1
(1 p)
n
2
m
1
+y
1
_
n
1
+n
2
m
1
_
p
m
1
(1 p)
n
1
+n
2
m
1
4.3. HYPERGEOMETRIC DISTRIBUTION 61
=
_
n
1
y
1
__
n
2
m
1
y
1
_
_
n
1
+n
2
m
1
_ .
Accordingly, conditional on Y
1
+Y
2
= m
1
, the rv Y
1
is distributed as
HyperG(n
1
+n
2
, n
1
, m
1
).
10. Binomial approximation to hypergeometric. If N is large and p = M/N is not
near 0 or 1, then
Y HyperG(N, M, n) =Y Bin(n, p).
Proof: Suppose that both p = M/N and n remain constant while N goes to
innity. Then
lim
N
f
y
(y|N, M, n) = lim
N
_
M
y
__
NM
ny
_
_
N
n
_ = lim
N
_
n
y
_
(M)
y
(N M)
ny
(N)
n
= lim
N
_
n
y
__
M
N
__
M 1
N 1
_

_
M y + 1
N y + 1
_
_
N M
N y
__
N M 1
N y 1
_

_
N M n +y + 1
N n + 1
_
= lim
N
_
n
y
__
M
N
_
_
M
N

1
N
1
1
N
_

_
M
N

y1
N
1
y1
N
_
_
1
M
N
1
y
N
__
1
M
N

1
N
1
y1
N
_

_
1
M
N

ny1
N
1
n1
N
_
=
_
n
y
_
p
y
(1 p)
ny
.
11. Illustration: Suppose n = 20, y = 6, and p = 0.1 or p = 0.5. The binomial
probabilities are
Pr(Y = 6|n = 20, p = 0.1) = 0.0089 and
Pr(Y = 6|n = 20, p = 0.5) = 0.0370.
The hypergeometric probabilities for various population sizes are displayed
below. It can be seen that as N , the hypergeometric probabilities
converge to the binomial probabilities.
10
1
10
2
10
3
10
4
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
N
P
(
Y
=
6
)
Hypergeometric Pr(Y=6), n=20
p = 0.1
p = 0.5
4.4 Inverse Sampling
1. Geometric Distribution
(a) Let X
1
, X
2
, . . . be an innite sequence of iid Bern(p) random variables.
(b) Let Z be the trial number in which the rst success occurs. Then,
Z Geo(p).
(c) f
Z
(z) = (1 p)
z1
p, where z = 1, 2, . . . and p (0, 1].
(d) Result: If Z Geo(p), then E(Z) = 1/p. Proof: let q = 1 p. Then
E(Z) =
i=1
ipq
i1
= p
i=1
iq
i1
= p
d
dq
i=1
q
i
= p
d
dq
_
1
1 q
1
_
= p
1
(1 q)
2
=
1
p
.
(e) Result: If Z Geo(p), then Var(Z) = q/p
2
. Hint on proof: let q = 1 p.
Then,
E[Z(Z 1)] =
i=1
i(i 1)pq
i1
= pq
i=1
i(i 1)q
i2
= pq
d
2
dq
2
i=1
q
i
.
2. Negative Binomial Distribution
4.4. INVERSE SAMPLING 63
(a) Let X
1
, X
2
, . . . be an innite sequence of iid Bern(p) random variables.
(b) Let W be the trial number in which the r
th
success occurs.
(c) W NegBin(r, p).
(d) f
W
(w) =
_
w1
r1
_
p
r
(1 p)
wr
, where w = r, r + 1, r + 2, . . . and p (0, 1].
Justication: if the r
th
success occurs on trial w, then there must be
r 1 successes on the rst w 1 trials and one success on trial w. These
events are independent and have probabilities
_
w1
r1
_
p
r1
(1 p)
wr
and p,
respectively.
(e) Result: If W NegBin(r, p), then W
r
i=1
Z
i
, where Z
i
iid
Geo(p), for
i = 1, 2, . . .. Accordingly, E(W) = r/p and Var(W) = rq/p
2
.
(f) Alternative denition of Negative Binomial
Let X
1
, X
2
, . . . be an innite sequence of iid Bern(p) random
variables.
Let Y be the number of failures before the r
th
success occurs. Then,
Y NegBin
(r, p).
f
Y
(y) =
_
r+y1
y
_
p
r
(1 p)
y
, where y = 0, 1, . . . and p (0, 1].
th
success occurs on trial y +r, then there must
be y failures on the rst r +y 1 trials and one success on trial
r +y. These events are independent and have probabilities
_
r+y1
y
_
p
r1
(1 p)
y
and p, respectively.
Result: If Y NegBin
(r, p), then Y W r. Accordingly,

E(Y ) = E(W) r = r/p r = r(1 p)/p and
Var(Y ) = Var(W r) = Var(W) = rq/p
2
.
3. Negative Hypergeometric Distribution
(a) Population contains N items, M of which are 1s (successes) and N M
of which are 0s (failures).
(b) Sample items at random without replacement. Let X
i
be the value of the
sampled item on trial i and let W be the trial number in which the r
th
success occurs. Note: r must satisfy r M.
(c) W NegHyperG(N, M, r).
(d) f
W
(w) =
_
M
r1
__
NM
wr
_
_
N
w1
_
_
M r + 1
N w + 1
_
, where w = r, r + 1, . . . , M.
th
success occurs on trial w, then there must be
r 1 successes on the rst w 1 trials and one success on trial w. These
events have probabilities
Pr(r 1 successes on rst w 1 trials) =
_
M
r1
__
NM
wr
_
/
_
N
w1
_
and
Pr(success on trial r|r 1 successes on rst w 1 trials) =
(M r + 1)/(N w + 1).
(e) Alternative expression:
f
W
(w) =
_
w1
r1
__
Nw
Mr
_
_
N
M
_ .
Justication: There are
_
N
M
_
ways of arranging the M successes. If
W = w, then the rst w 1 trials must contain r 1 successes, the r
th
trial contains 1 success, and the last N w trials contain Mr successes.
4.5 Approximating Binomial Probabilities
1. Normal Approximation to the Binomial
(a) If Y Bin(n, p) and np(1 p) > 5, then Y N[np, np(1 p)]. This is an
application of the central limit theorem. We will discuss this important
theorem later. Below are displays of four binomial distributions. It can
be seen that if n is small and p is near zero (or one), then the normal
approximation is not very accurate. The textbook recommends that the
normal approximation be used only if np(1 p) 5. before
0 1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
0.15
0.2
0.25
x
P
(
x
)
o
r
f(
x
)
Normal Approx to binomial: n=10, p=0.5
0 1 2 3 4 5 6 7 8 9 10
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
x
P
(
x
)
o
r
f(
x
)
0 5 10 15 20 25 30 35 40 45 50
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
x
P
(
x
)
o
r
f(
x
)
0 10 20 30 40 50 60 70 80 90 100
0
0.02
0.04
0.06
0.08
0.1
0.12
x
P
(
x
)
o
r
f(
x
)
(b) Using a correction for continuity, will improve the accuracy of the normal
approximation. The continuity correction consists of adding
4.5. APPROXIMATING BINOMIAL PROBABILITIES 65
approximating P(Y y) as
P(Y y)
_
y + 0.5 np
_
np(1 p)
_
,
where is the standard normal cumulative distribution function (Table
IIa, pp. 652), rather than as
P(Y y)
_
y np
_
np(1 p)
_
.
Below is a comparison of the approximation with continuity correction
and without continuity correction for the case Y Bin(10, 0.5). Note
that np(1 p) 5 is not satised.
P(Y y)
Normal Approximation
y Exact With Correction Without Correction
0 0.0010 0.0022 0.0008
1 0.0107 0.0134 0.0057
2 0.0547 0.0569 0.0289
3 0.1719 0.1714 0.1030
4 0.3770 0.3759 0.2635
5 0.6230 0.6241 0.5000
6 0.8281 0.8286 0.7365
7 0.9453 0.9431 0.8970
8 0.9893 0.9866 0.9711
9 0.9990 0.9978 0.9943
10 1.0000 0.9997 0.9992
2. Poisson Approximation to the Binomial
(a) If Y Bin(n, p), where n is large and p is small, then
P(Y = y)
e
np
(np)
y
y!
for y = 0, 1, 2, . . . .
(b) Proof: Let = np and write p as p = /n. Examine the binomial pmf as
n goes to innity, but remains constant:
P(Y = y) =
_
n
y
_
p
y
(1 p)
ny
=
n!
y!(n y)!
_
n
_
y
_
1

n
_
ny
=

y
y!
(n)
y
n
y
_
1

n
_
n
_
1

n
_
y
.
Examine the components individually:
lim
n
(n)
y
n
y
= lim
n
_
n
n
_
_
n 1
n
__
n 2
n
_

_
n y + 1
n
_
= 1;
lim
n
_
1

n
_
y
= (1 0)
y
= 1;
lim
n
_
1

n
_
n
= e
.
The latter result can be found in any calculus text book. Accordingly, as
n goes to , P(Y = y) goes to
lim
n
_
n
y
_
p
y
(1 p)
ny
=

y
y!
1 e
1 =
e
y
y!
..
4.6 Poisson Distribution
1. If Y Poi(), then
P(Y = y) =
e
y
y!
for y = 0, 1, 2, . . . .
2. The pmf sums to one because
i=0
i
/i! = e
.
3. The pgf of Y is
Y
(t) = E(t
Y
) =
i=0
e
(t)
i
i!
= e
e
t
= e
(t1)
.
4. Table IV (pp. 658) gives cumulative Poisson probabilities; Pr(Y c).
5. Poisson Process
(a) Stationary Assumption: The probability of y events in a region or time
period does not depend on the location of the region or time period.
(b) Independence Assumption: Events in non-overlapping regions (time
periods) are independent.
(c) In a small region (time period), the probability of one event is
approximately proportional to the size of the region (time period).
4.6. POISSON DISTRIBUTION 67
(d) In a small region (time period), the probability of two or more events in a
small region (time period) is negligible compared to the probability of
one event in the region.
6. Technical Details about Assumptions 3 and 4 above.
(a) Consider an interval (t
0
, t
0
+h]. The interval length is h.
(b) Let X = number of events that occur in the interval.
(c) Assumption #3 says that Pr(X = 1) = h +o(h), where o(h) is a small
remainder term. That is, the probability is approximately proportional to
the size of the area.
(d) o(h) is read as little o of h and is dened as a term that satises
lim
h0
o(h)
h
= 0.
(e) Assumption #3 says that
lim
h0
Pr(X = 1)
h
= lim
h0
h +o(h)
h
= .
(f) Assumption #4 says that Pr(X 2) = o(h).
(g) Assumptions #3 and #4 together imply that Pr(X = 0) = 1 h +o(h).
7. Let Y be the total number of events in a region of size t. If the events follow a
Poisson process, then Y Poi(t).
(a) Divide the total region into n equal parts, each of size h = t/n.
(b) Let X
i
= number of events in the i
th
part.
(c) By the Poisson process assumptions, the X
i
s are independent and
identically distributed. Furthermore, X
i
iid Bern(p) for i = 1, . . . , n,
where p = t/n. The Bernoulli approximation becomes more accurate as
n increases.
(d) Let Y =
n
i=1
X
i
. By the Poisson process assumptions, Y Bin(n, p),
where p = t/n. The binomial approximation becomes more accurate as
n goes to innity.
(e) Let n go to innity. Note that p goes to zero as n goes to innity and
np = t. Accordingly, by the Poisson approximation to the binomial, the
limiting distribution of Y is Poi(t).
8. Moments: if Y Poi(), then
(a) E(Y ) = and
(b) Var(Y ) = .
(c) Justication: Show that E[(Y )
r
] =
r
:
E[(Y )
r
] =
i=0
(i)
r
e
i
/i! = e
i=r
(i)
r
i
/i! = e
j=0
j
/j! =
r
.
9. Reproductive property of Poisson random variables. If Y
1
, Y
2
, . . . , Y
k
are
independently distributed as Y
i
Poi(
i
), then
k
i=1
Y
i
Poi(), where
=
k
i=1
i
. Justication: Recall that if Y
i
Poi(
i
), then the pgf of Y
i
is
Y
i
(t) = exp{(t 1)
i
}. Let W =
k
i=1
Y
i
. Then, by independence, the pgf of
W is
W
(t) =
k
i=1
e
(t1)
i
= e
(t1)
P
k
i=1
i
= e
(t1)
.
Therefore, W Poi(), where =
k
i=1
i
.
10. Proof that Y Poi(t) if events follow a Poisson process. (Optional Material)
(a) Let Y be the total number of events that occur in a region of size t.
(b) Let P
n
(t) = Pr(Y = n) in a region of size t. Note that
P
0
(0) = lim
h0
1 h +o(h) = 1. Accordingly, P
n
(0) = 0 if n 1.
(c) P
0
(t +h) = P
0
(t)P
0
(h) by the stationary and independence assumptions.
To verify this result, divide the region of size t +h into a region of size t
and a region of size h. Then Y = 0 if and only if there are zero events in
each sub-region. These events are independent by the Poisson process
assumptions.
(d) The derivative of P
0
(t) with respect to t can be obtained as follows. By
denition, the derivative is
dP
0
(t)
dt
= lim
h0
P
0
(t +h) P
0
(t)
h
.
Write P
0
(t +h) P
0
(t) as
P
0
(t +h) P
0
(t) = P
0
(t)P
0
(h) P
0
(t) = P
0
(t) [1 P
0
(h)]
= P
0
(t) {1 [1 h +o(h)]} = P
0
(t) [h +o(h)] .
Accordingly,
dP
0
(t)
dt
= lim
h0
P
0
(t +h) P
0
(t)
h
= lim
h0
hP
0
(t) +o(h)
h
= P
0
(t).
Note that,
d ln[P
0
(t)]
dt
=
1
P
0
(t)
dP
0
(t)
dt
= .
4.6. POISSON DISTRIBUTION 69
(e) Integrate to obtain
ln[P
0
(t)] =
_
d ln[P
0
(t)]
dt
dt =
_
dt = t +c.
It follows that P
0
(t) = Ke
t
, where K = e
c
. Furthermore, K = 1
because P
0
(0) = 1 = Ke
0
. Thus, P
0
(t) = e
t
.
(f) P
n
(t +h) =
n
i=0
P
i
(h)P
ni
(t) by the independence and stationary
assumptions. Furthermore, P
n
(t +h) = P
0
(h)P
n
(t) +P
1
(h)P
n1
(t) +o(h)
because
n
i=2
P
i
(h) = o(h).
(g) The derivative of P
n
(t) with respect to t can be obtained as follows. By
denition, the derivative is
dP
n
(t)
dt
= lim
h0
P
n
(t +h) P
n
(t)
h
.
Write P
n
(t +h) P
n
(t) as
P
n
(t +h) P
n
(t) = P
n
(t)P
0
(h) +P
n1
(t)P
1
(h) +o(h) P
0
(t)
= P
n
(t) [1 h +o(h)] +P
n1
(t) [h +o(h)] P
n
(t)
= hP
n
(t) +P
n1
(t)h +o(h).
Accordingly,
dP
n
(t)
dt
= lim
h0
P
0
(t +h) P
0
(t)
h
= lim
h0
hP
n
(t) +P
n1
(t)h +o(h)
h
= P
n
(t) +P
n1
(t).
(h) Note that e
t
_
dP
n
(t)
dt
+P
n
(t)
_
= e
t
P
n1
(t). It follows that
d
dt
e
t
P
n
(t) = e
t
P
n1
(t).
(i) For n = 1,
d
dt
e
t
P
1
(t) = e
t
P
0
(t) = .
(j) Integrate to obtain e
t
P
1
(t) =
_
d
dt
e
t
P
1
(t)dt = t +c. Use P
1
(0) = 0 to
obtain P
1
(t) = e
t
t.
(k) Use induction to prove that P
n
(t) = e
t
(t)
n
/n!. Suppose that
P
i
(t) = e
t
(t)
i
/i! for i = 0, 1, . . . , n 1. Then
d
dt
e
t
P
n
(t) = e
t
P
n1
(t) =
e
t
e
t
(t)
n1
(n 1)!
=

n
t
n1
(n 1)!
.
(l) Integrate to obtain e
t
P
n
(t) = (t)
n
/n! +c. Use P
n
(0) = 0 to obtain
P
n
(t) = e
t
(t)
n
/n!.
4.7 Law of Large Numbers
1. Let X
1
, X
2
, . . . , X
n
be a sequence of iid random variables having mean
E(X) = and Var(X) =
2
. Let X
n
= n
1
n
i=1
X
i
be the sample mean of
the sequence. Then
X
n
prob
as n .
Read this result as X bar converges in probability to as n goes to .
Convergence to in probability means that
lim
n
Pr(|X
n
| > ) = 0 for any > 0.
2. Justication: lim
n
2
x
= lim
n
2
/n = 0. A formal proof using Chebyshevs
inequality will be given in Stat 424.
3. Special case: If X
1
, X
2
, . . . , X
n
is a sequence of iid Bern(p) random variables
and p
n
= n
1
n
i=1
X
i
, then p
n
prob
p.
4. Beware of the gamblers fallacy. Consider a sequence of iid Bern(0.5) random
variables. Some gamblers believe that the law of large numbers means that a
string of 0s is likely to be followed by 1s because the long run proportion of 1s
is 0.5. This interpretation is not correct. The random variables are
independent, so the probability of a 1 given a string of 0s is still 0.5.
4.8 Multinomial Distributions
1. Let X be a categorical random variable with support set
S
= {A
1
, A
2
, . . . , A
k
}
and probability function Pr(X = A
j
) = p
j
for j = 1, . . . , k. That is,
f
X
(a) =
_
p
j
if a = A
j
for j = 1, . . . , k;
0 otherwise.
2. Sequence of iid random variables: Let X
1
, X
2
, . . . , X
n
be a sequence of iid
random variables, each with p.f. f
X
(a). Let y
j
be the number of occurrences
of A
j
in the sequence for j = 1, . . . , k. By exchangeability, the probability of
the sequence is
Pr(X
1
= a
1
, X
2
= a
2
, . . . , X
n
= a
n
) =
k
j=1
p
y
j
j
.
4.8. MULTINOMIAL DISTRIBUTIONS 71
3. Let Y
j
for j = 1, . . . , k be a set of k random variables dened by
Y
j
=
n
i=1
A
j
(X
i
) = number of occurances of A
j
in the sequense,
where
A
j
(X
i
) =
_
1 if X
i
= A
j
;
0 otherwise.
Then Y
1
, Y
2
, . . . , Y
k
have a multinomial distribution.
4. If Y
1
, Y
2
, . . . , Y
k
Multinom(n, p
1
, . . . , p
k
), then
f
Y
1
,Y
2
,...,Y
k
(y
1
, y
2
, . . . , y
k
) =
_
n
y
1
, . . . , y
k
_
k
j=1
p
y
j
j
,
where y
1
, . . . , y
k
are non-negative integers that satisfy
k
j=1
y
j
= n.
Justication: each sequence with y
1
occurrences of A
1
, y
2
occurrences of A
2
,
etc has the same probability. There are
_
n
y
1
,...,y
k
_
such sequences.
5. Special case: k = 2. Suppose that Y
1
, Y
2
Multinom(n, p
1
, p
2
). Then
Y
1
Bin(n, p
1
).
6. Marginal Distributions
(a) Result: Any set of two or more Y s also has a multinomial distribution.
For example, Consider the set Y
1
, . . . , Y
k
. Let Z = Y
4
+Y
5
+ +Y
k
.
Then the triplet Y
1
, Y
2
, Y
3
has distribution
Y
1
, Y
2
, Y
3
, Z Multinom(n, p
1
, p
2
, p
3
,
k
i=4
p
i
).
(b) Justication: collapse categories 4 to k.
(c) Example: Y
j
Bin(n, p
j
). Accordingly, Y
1
, . . . , Y
k
is a set of
non-independent binomial random variables.
(d) E(Y
j
) = np
j
and Var(Y
j
) = np
j
(1 p
j
).
(e) Y
i
+Y
j
Bin(n, p
i
+p
j
) if i = j.
(f) Cov(Y
i
, Y
j
) = np
i
p
j
if i = j. Justication:
Var(Y
i
+Y
j
) = n(p
i
+p
j
)(1 p
i
p
j
) = Var(Y
i
) + Var(Y
j
) + 2 Cov(Y
i
.Y
j
).
Solve for Cov(Y
i
, Y
j
).
7. Conditional Multinomial Distributions
(a) Result: Suppose that
(X
1
, . . . , X
k
1
, Y
1
, . . . , Y
k
2
, Z
1
, . . . , Z
k
3
)
Multinom(n, p
11
, . . . , p
1k
1
, p
21
, . . . , p
2k
2
, p
31
, . . . , p
3,k
3
).
Then the conditional distribution of X
1
, . . . , X
k
1
given
Z
1
= z
1
, Z
2
= z
2
, . . . , Z
k
3
= z
k
3
is
(X
1
, . . . , X
k
1
, Y |Z
1
= z
1
, Z
2
= z
2
, . . . , Z
k
3
= z
k
3
)
Multinom
_
n
k
3
i=1
z
i
,
p
11
p
1
+p
2
, . . . ,
p
1k
1
p
1
+p
2
,
p
2
p
1
+p
2
_
,
where Y = n
z
i
X
i
, p
1
=
p
1i
, and p
2
=
p
2i
.
(b) Proof: In class
(c) Example: Suppose X
1
, . . . X
4
Multinom
_
30,
1
2
,
1
5
,
1
4
,
1
20
_
. Then, the
distribution of X
1
, X
2
given X
4
= 10 is
(X
1
, X
2
, Y |X
4
= 10) Multinom
_
20,
0.5
0.95
,
0.2
0.95
,
0.25
0.95
_
where Y = X
3
= 20 X
1
X
2
.
(d) Example: Suppose X
1
, . . . X
4
Multinom
_
30,
1
2
,
1
5
,
1
4
,
1
20
_
. Then, the
distribution of X
1
given X
3
= 2, X
4
= 12 is
(X
1
, Y |X
3
= 2, X
4
= 12) Multinom
_
16,
0.5
0.7
,
0.2
0.7
_
=X
1
Bin
_
16,
5
7
_
.
4.9 Using Probability Generating Functions
1. Use p.g.f. to derive pmf of sum of iid Bernoulli rvs. Hint: Y =
n
i=1
X
i
, where
X
i
iid
Bern(p).
2. Use p.g.f. to derive pmf of sum of independent Poisson random variables (see
notes on Poisson).
3. Factorial Moment Generating Function
(a) The f.m.g.f. is the same as the p.g.f., except that the expectation E(t
X
)
is required to exist for T in an open neighborhood containing 1.
(b) Dierentiate the f.m.g.f. with respect to t and evaluate at t = 1 to obtain
d
dt
X
(t)
t=1
=
d
dt
xS
t
x
f
X
(x)
t=1
=
xS
d
dt
t
x
f
X
(x)
t=1
=
xS
xt
x1
f
X
(x)
t=1
=
xS
xf
X
(x) = E(X).
4.9. USING PROBABILITY GENERATING FUNCTIONS 73
(c) Take the second derivative of the f.m.g.f. with respect to t and evaluate
at t = 1 to obtain
d
2
dt
2
X
(t)
t=1
=
d
2
dt
2
xS
t
x
f
X
(x)
t=1
=
xS
d
2
dt
2
t
x
f
X
(x)
t=1
=
xS
x(x 1)t
x2
f
X
(x)
t=1
=
xS
x(x 1)f
X
(x)
= E[X(X 1)].
(d) In general,
E[(X)
r
] =
d
r
dt
r
X
(t)
t=1
for r = 1, 2, . . .
if the expectation exists.
(e) Var(X) = E[(X)
2
] + E(X) [E(X)]
2
.
(f) Table of factorial moment generating functions (probability generating
functions): See page 154 of the text. You should be able to derive each of
these pgfs.
Chapter 5
Continuous Random Variables
5.1 Cumulative Distribution Function (CDF)
1. Denition: Let X be a random variable. Then the cdf of X is denoted by
F
X
(x) and dened by
F
X
(x) = P(X x).
If X is the only random variable under consideration, then F
X
(x) can be
written as F(x).
2. Example: Discrete Distribution. Suppose that X Bin(3, 0.5). Then F(x) is
a step function and can be written as
F(x) =
_
_
0 x (, 0);
1
8
x [0, 1);
1
2
x [1, 2);
7
8
x [2, 3);
1 x [3, ).
3. Example: Continuous Distribution. Consider modeling the probability of
vehicle accidents on I-94 in the Gallatin Valley by a Poisson process with rate
per year. Let T be the time until the rst accident. Then
P(T t) = P(at least one accident in time t)
= 1 P(no accidents in time t) = 1
e
0
0!
= 1 e
t
.
Therefore,
F(t) =
_
0 t < 0;
1 e
t
t 0.
75
76 CHAPTER 5. CONTINUOUS RANDOM VARIABLES
4. Example: Uniform Distribution. Suppose that X is a rv with support
S
= [a, b], where b > a. Further, suppose that the probability that X falls in
an interval in
S
is proportional to the length of the interval. That is,
P(x
1
X x
2
) = (x
2
x
1
) for a x
1
x
2
b. To solve for , let x
1
= a
and x
2
= b. then
P(a X b) = 1 = (b a) = =
1
b a
.
Accordingly, the cdf is
F(x) = P(X x) = P(a X x) =
_
_
0 x < a;
x a
b a
x [a, b];
1 x > b.
In this case, X is said to have a uniform distribution: X Unif(a, b).
5. Properties of a cdf
(a) F() = 0 and F() = 1. Your text tries (without success) to motivate
this result by using equation 1 on page 157. Ignore the discussion on the
bottom of page 160 and the top of page 161.
(b) F is non-decreasing; i.e., F(a) F(b) whenever a b.
(c) F(x) is right continuous. That is, lim
0
+ F(x +) = F(x).
6. Let X be a rv with cdf F(x).
(a) If b a, then P(a < X b) = F(b) F(a).
(b) For any x, P(X = x) = lim
0
+
P(x < X x) = F(x) F(x), where
F(x) is F evaluated as x and is an innitesimally small positive
number. If the cdf of X is continuous from the left, then F(x) = F(x)
and P(X = x) = 0. If the cdf of X has a jump at x, then F(x) F(x)
is the size of the jump.
(c) Example: Problem 5-8.
7. Denition of Continuous Distribution: The distribution of the rv X is said to
be continuous if the cdf is continuous at each x and the cdf is dierentiable
(except, possibly, at a countable number of points).
8. Monotonic transformations of a continuous rv: Let X be a continuous rv with
cdf F
X
(x).
5.1. CUMULATIVE DISTRIBUTION FUNCTION (CDF) 77
(a) Suppose that g(X) is a continuous one-to-one increasing function. Then
for y in the counter-domain (range) of g, the inverse function x = g
1
(y)
exists. Let Y = g(X). Find the cdf of Y . Solution:
P(Y y) = P[g(X) y] = P(X g
1
(y)] = F
X
[g
1
(y)].
(b) Suppose that g(X) is a continuous one-to-one decreasing function. Then
for y in the counter-domain (range) of g, the inverse function x = g
1
(y)
exists. Let Y = g(X). Find the cdf of Y . Solution:
P(Y y) = P[g(X) y] = P(X > g
1
(y)] = 1 F
X
[g
1
(y)].
(c) Example: Suppose that X Unif(0, 1), and Y = g(X) = hX +k where
h < 0. Then, X = g
1
(Y ) = (Y k)/h;
F
X
(x) =
_
_
0 x < 0;
x x [0, 1];
1 x > 1,
and
F
Y
(y) = 1 F
x
[(y k)/h] =
_
_
0 y < h +k;
y (h +k)
k
y [h +k, k];
1 y > k,
That is, Y Unif(h +k, k).
(d) Inverse CDF Transformation.
i. Suppose that X is a continuous rv having a strictly increasing cdf
F
X
(x). Recall that a strictly monotone function has an inverse.
Denote the inverse of the cdf by F
1
X
. That is, if F
X
(x) = y, then
F
1
X
(y) = x. Let Y = F
X
(X). Then the distribution of Y is
Unif(0, 1).
Proof: If W Unif(0, 1), then the cdf of W is F
W
(w) = w. The cdf
of Y is
F
Y
(y) = P(Y y) = P(F
X
(X) y) = P[X F
1
X
(y)]
= F
X
_
F
1
X
(y)
= y.
If Y has support [0, 1] and F
Y
(y) = y, then it must be true that
Y Unif(0, 1).
ii. Let U be a rv with distribution U Unif(0, 1). Suppose that F
X
(x)
is a strictly increasing cdf for a continuous random variable X. Then
the cdf of the rv F
1
X
(U) is F
X
(x).
Proof:
P
_
F
1
X
(U) x
= P [U F
X
(x)] = F
U
[F
X
(x)] = F
X
(x)
because F
U
(u) = u.
(e) Application of inverse cdf transformation: Given U
1
, U
2
, . . . , U
n
, a
random sample from Unif(0, 1), generate a random sample from F
X
(x).
Solution: Let X
i
= F
1
X
(U
i
) for i = 1, . . . , n.
i. Example 1: Suppose that F
X
(x) = 1 e
x
for x > 0, where > 0.
Then X
i
= ln(1 U
i
)/ for i = 1, . . . , n is a random sample from
F
X
.
ii. Example 2: Suppose that
F
X
(x) =
_
1
_
a
x
_
b
_
I
(a,)
(x),
where a > 0 and b > 0 are constants. Then X
i
= a(1 U
i
)
b
for
i = 1, . . . , n is a random sample from F
X
.
9. Non-monotonic transformations of a continuous rv. Let X be a continuous rv
with cdf F
X
(x). Suppose that Y = g(X) is a continuous but non-monotonic
function. As in the case of monotonic functions,
F
Y
(y) = P(Y y) = P[g(X) y], but in this case each inverse solution
x = g
1
(y) must be used to nd an expression for F
Y
(y) in terms of
F
X
[g
1
(y)]. For example, suppose that X Unif(1, 2) and g(X) = Y = X
2
.
Note that x =
y for y [0, 1] and x = +
y for y (1, 4]. The cdf of Y is

F
Y
(y) = P(X
2
y) =
_
P(
y X
y) y [0, 1];
P(X
y) y (1, 4]
=
_
F
X
(
y) F
X
(
y) y [0, 1];
F
X
(
y) y (1, 4]
=
_
_
0 y < 0;
2
y/3 y [0, 1];

(
y + 1)/3 y (1, 4];

1 y > 4;
Plot the function g(x) over x
SX
as an aid to nding the inverse solutions
x = g
1
(y).
5.2. DENSITY AND THE PROBABILITY ELEMENT 79
5.2 Density and the Probability Element
1. Mathematical Result: Assume that F(x) is a continuous cdf. Let g(x +m) be
a dierentiable function and let y = x +m. Then
d
dm
g(x +m)
m=0
=
d
d y
g(y)
d
d m
y
m=0
=
d
d y
g(y)
m=0
=
d
dx
g(x)
by the chain rule.
2. Probability Element: Suppose that X is a continuous rv. Let x be a small
positive number. Dene h(a, b) as
h(a, b)
def
= P(a X a +b) = F
X
(a +b) F
X
(a).
Expand h(x, x) = P(x X x + x) in a Taylor series around x = 0:
h(x, x) = F(x + x) F(x)
= h(x, 0) +
d
dx
h(x, x)
x=0
x +o(x)
= 0 +
d
dx
F(x + x)
x=0
x +o(x)
=
_
d
dx
F(x)
_
x +o(x), where
lim
x0
o(x)
x
= 0.
The function
d F(x) =
_
d
dx
F(x)
_
x
is called the dierential. In the eld of statistics, the dierential of a cdf is
called the probability element. The probability element is an approximation to
h(x, x). Note that the probability element is a linear function of the
derivative
d
dx
F(x).
3. Example; Suppose that
F(x) =
_
0 x < 0;
1 e
3x
otherwise.
Note that F(x) is a cdf. Find the probability element at x = 2 and
approximate the probability P(2 X 2.01). Solution:
d
dx
F(x) = 3e
3x
so
the probability element is 3e
6
x and
P(2 X 2.01) 3e
6
0.01 = 0.00007436. The exact probability is
F(2.01) F(2) = 0.00007326.
4. The average density in the interval (x, x + x) is dened as
Average density
def
=
P(x < X < x + x)
x
.
5. Density: The probability density function (pdf) at x is the limit of the average
density as x 0:
pdf = f(x)
def
= lim
x0
P(x X x + x)
x
= lim
x0
F
X
(x + x) F
X
(x)
x
=
_
d
dx
F(x)
_
x +o(x)
x
=
d
dx
F(x).
Note that the probability element can be written as d F(x) = f(x)x.
6. Example: Suppose that is a positive real number. If
F(x) =
_
1 e
x
x 0;
0 otherwise.
then f(x) =
d
dx
F(x) =
_
e
x
x 0;
0 otherwise.
7. Example: If X Unif(a, b), then
F(x) =
_
_
0 x < a;
xa
ba
x [a, b];
1 x > b
and f(x) =
d
dx
F(x) =
_
_
0 x < a;
1
ba
x [a, b];
0 x > b.
8. Properties of a pdf
i f(x) 0 for all x.
ii
_

f(x) = 1
9. Relationship between pdf and cdf: If X is a continuous rv with pdf f(x) and
cdf F(x), then
f(x) =
d
dx
F(x)
F(x) =
_
x
f(u)du and
P(a < X < b) = P(a X b) = P(a < X b)
= P(a X < b) = F(b) F(a) =
_
b
a
f(x)dx.
10. PDF example - Cauchy distribution. Let f(x) = c/(1 +x
2
) for < x <
and where c is a constant. Note that f(x) is nonnegative and
_

1
1 +x
2
dx = arctan(x)
=

2

2
= .
Accordingly, if we let c = 1/, then
f(x) =
1
(1 +x
2
)
is a pdf. It is called the Cauchy pdf. The corresponding CDF is
F(x) =
_
x
1
(1 +u
2
)
du =
arctan(u)
=
1
_
arctan(x) +

2
_
=
arctan(x)
+
1
2
.
11. PDF example - Gamma distribution: A more general waiting time
distribution: Let T be the time of arrival of the r
th
event in a Poisson process
with rate parameter . Find the pdf of T. Solution: T (t, t +t) if and only
if (a) r 1 events occur before time t and (b) one event occurs in the interval
(t, t +t). The probability that two or more events occur in (t, t +t) is o(t)
and can be ignored. By the Poisson assumptions, outcomes (a) and (b) are
independent and the probability of outcome (b) is t +o(t). Accordingly,
P(t < T < t + t) f(t)t =
e
t
(t)
r1
(r 1)!
t
=
_
e
t
r
t
r1
(r 1)!
_
t
and the pdf is
f(t) =
_
_
0 t < 0;
e
t
r
t
r1
(r 1)!
t 0
=
e
t
r
t
r1
(r)
I
[0,)
(t).
12. Transformations with Single-Valued Inverses: If X is a continuous random
variable with pdf f
X
(x) and Y = g(X) is a single-valued dierentiable
function of X, then the pdf of Y is
f
Y
(y) = f
X
_
g
1
(y)
d
d y
g
1
(y)

for y
Sg(x)
(i.e., support of Y = g(X)). The term
J(y) =
d
d y
g
1
(y)
is called the Jacobian of the transformation.
(a) Justication 1: Suppose that Y = g(X) is strictly increasing. Then
F
Y
(y) = F
X
[g
1
(y)] and
f
Y
(y) =
d
dy
F
Y
(y) = f
X
_
g
1
(y)
d
dy
g
1
(y)
= f
X
_
g
1
(y)
d
dy
g
1
(y)
because the Jacobian is positive. Suppose that Y = g(X) is strictly

decreasing. Then F
Y
(y) = 1 F
X
[g
1
(y)] and
f
Y
(y) =
d
dy
[1 F
Y
(y)] = f
X
_
g
1
(y)
d
dy
g
1
(y)
= f
X
_
g
1
(y)
d
dy
g
1
(y)
because the Jacobian is negative.

(b) Justication 2: Suppose that g(x) is strictly increasing. Recall that
P(x X x + x) = f
X
(x)x +o(x).
Note that
x X x + x g(x) g(X) g(x + x).
Accordingly,
P(x X x + x) = P(y Y y + y) = f
Y
(y)y +o(y)
= f
X
(x)x +o(x)
where y +y = g(x +x). Expanding g(x + x) around x = 0 reveals
that
y + y = g(x + x) = g(x) +
d g(x)
d x
x +o(x).
Also,
y = g(x) =g
1
(y) = x
=
d g
1
(y)
d y
=
d x
d y
=
d y
d x
=
d g(x)
d x
=
_
d g
1
(y)
d y
_
1
=y + y = g(x) +
_
d g
1
(y)
d y
_
1
x
=y =
_
d g
1
(y)
d y
_
1
x
=x =
d g
1
(y)
d y
y.
Lastly, equating f
X
(x)x to f
Y
(y)y reveals that
f
Y
(y)y = f
X
(x)x = f
X
_
g
1
(y)
x
= f
X
_
g
1
(y)
d g
1
(y)
d y
y
=f
Y
(y) = f
X
_
g
1
(y)
d g
1
(y)
d y
.
The Jacobian
dg
1
(y)
dy
is positive for an increasing function, so the absolute
value operation is not necessary. A similar argument can be made for the
case when g(x) is strictly decreasing.
13. Transformations with Multiple-Valued Inverses: If g(x) has more than one
inverse function, then a separate probability element must be calculated for
each of the inverses. For example, suppose that X Unif(1, 2) and
Y = g(X) = X
2
. There are two inverse functions for y [0, 1], namely
x =
y and x = +
y. There is a single inverse function for y (1, 4]. The

pdf of Y is found as
f(y) =
_
_
0 y < 0;
f(
y)
y
dy
+f(
y)
y
dy
y [0, 1];
f(
y)
y
dy
y (1, 4];
0 y > 4.
=
_
_
0 y < 0;
1
3
y
y [0, 1];
1
6
y
y (1, 4];
0 y > 4.
5.3 The Median and Other Percentiles
1. Denition: The number x
p
is said to be the 100p
th
percentile of the
distribution of X if x
p
satises
F
X
(x
p
) = P(X x
p
) = p.
2. If the cdf F
X
(x) is strictly increasing, then x
p
= F
1
X
(p) and x
p
is unique.
3. If F
X
(x) is not strictly increasing, then x
p
may not be unique.
4. Median: The median is the 50
th
percentile (i.e., p = 0.5).
5. Quartiles: The rst and third quartiles are x
0.25
and x
0.75
respectively.
6. Example: If X Unif(a, b), then (x
p
a)/(b a) = p; x
p
= a +p(b a); and
x
0.5
= (a +b)/2.
7. Example: If F
X
(x) = 1 e
x
(i.e., waiting time distribution), then
1 e
xp
= p; x
p
= ln(1 p)/; and x
0.5
= ln(2)/.
8. ExampleCauchy: Suppose that X is a random variable with pdf
f(x) =
1
_
1 +
(x )
2
2
_,
where < x < ; > 0; and is a nite number. Then
F(x) =
_
x
f(u)du =
_ x
1
(1 +z
2
)
dz
_
make the change of variable from x to z =
x
_
=
1
arctan
_
x
_
+
1
2
.
Accordingly,
F(x
p
) = p =
x
p
= + tan [(p 0.5)] ;
x
0.25
= + tan [(0.25 0.5)] = ;
x
0.5
= + tan(0) = ; and
x
0.75
= + tan [(0.75 0.5)] = +.
5.4. EXPECTED VALUE 85
9. Denition Symmetric Distribution: A distribution is said to be symmetric
around c if F
X
(c ) = 1 F
X
(c +) for all .
10. Denition Symmetric Distribution: A distribution is said to be symmetric
around c if f
X
(c ) = f
X
(c +) for all .
11. Median of a symmetric distribution. Suppose that the distribution of X is
symmetric around c. Then, set to c x
0.5
to obtain
F
X
(x
0.5
) =
1
2
= 1 F
X
(2c x
0.5
) =F
X
(2c x
0.5
) =
1
2
=c = x
0.5
.
That is, if the distribution of X is symmetric around c, then the median of the
distribution is c.
5.4 Expected Value
1. Denition: Let X be a rv with pdf f(x). Then the expected value (or mean)
of X, if it exists, is
E(X) =
X
=
_

xf(x)dx.
2. The expectation is said to exist if the integral of the positive part of the
function is nite and the integral of the negative part of the function is nite.
5.5 Expected Value of a Function
1. Let X be a rv with pdf f(x). Then the expected value of g(X), if it exists, is
E[g(X)] =
_

g(x)f(x)dx.
2. Linear Functions. The integral operator is linear. If g
1
(X) and g
2
(X) are
functions whose expectation exists and a, b, c are constants, then
E[ag
1
(X) +bg
2
(X) +c] = aE[g
1
(X)] +bE[g
2
(X)] +c.
3. Symmetric Distributions: If the distribution of X is symmetric around c and
the expectation exists, then E(X) = c. Proof. Assume that the mean exists.
First, show that E(X c) = 0:
E(X c) =
_

(x c)f(x)dx
=
_
c
(x c)f(x)dx +
_

c
(x c)f(x)dx
( let x = c u in integral 1 and let x = c + u in integral 2)
=
_

0
uf(c u)du +
_

0
uf(c +u)du
=
_

0
u [f(c +u) f(c u)] du = 0
by symmetry of the pdf around c. Now use E(X c) = 0 E(X) = c.
4. Example: Suppose that X Unif(a, b). That is,
f(x) =
_
_
_
1
b a
x [a, b];
0 otherwise.
A sketch of the pdf shows that the distribution is symmetric around (a +b)/2.
More formally,
f
_
a +b
2

_
= f
_
a +b
2
+
_
=
_
_
_
1
b a

_
ba
2
,
ba
2
;
0 otherwise.
Accordingly, E(X) = (a +b)/2. Alternatively, the expectation can be found by
integrating xf(x):
E(X) =
_

xf(x) dx =
_
b
a
x
b a
dx
=
x
2
2(b a)
b
a
=
b
2
a
2
2(b a)
=
(b a)(b +a)
2(b a)
=
a +b
2
.
5. Example: Suppose that X has a Cauchy distribution. The pdf is
f(x) =
1
_
1 +
(x )
2
2
_,
where and are constants that satisfy || < and (0, ). By
inspection, it is apparent that the pdf is symmetric around . Nonetheless,
the expectation is not , because the expectation does not exist. That is,
_

xf(x)dx =
_

_
1 +
(x )
2
2
_dx
5.6. AVERAGE DEVIATIONS 87
= +
_

z
(1 +z
2
)
dz where z =
x
= +
_
0
z
(1 +z
2
)
dz +
_

0
z
(1 +z
2
)
dz
= +
ln(1 +z
2
)
2
+
ln(1 +z
2
)
2
0
and neither the positive nor the negative part is nite.
6. Example: Waiting time distribution. Suppose that X is a rv with pdf e
x
for x > 0 and where > 0. Then, using integration by parts,
E(X) =
_

0
xe
x
dx = xe
x
0
+
_

0
e
x
dx
= 0
1
e
x
0
=
1
.
5.6 Average Deviations
1. Variance
(a) Denition:
Var(X)
def
= E(X
X
)
2
=
_

(x
X
)
2
f(x)dx
if the expectation exists. It is conventional to denote the variance of X
by
2
X
.
(b) Computational formula: Be able to verify that
Var(X) = E(X
2
) [E(X)]
2
.
(c) Example: Suppose that X Unif(a, b). Then
E(X
r
) =
_
b
a
x
r
b a
dx =
x
r+1
(r + 1)(b a)
b
a
=
b
r+1
a
r+1
(r + 1)(b a)
.
Accordingly,
X
= (a +b)/2,
E(X
2
) =
b
3
a
3
3(b a)
=
(b a)(b
2
+ab +a
2
)
3(b a)
=
b
2
+ab +a
2
3
and
Var(X) =
b
2
+ab +a
2
3

(b +a)
2
4
=
b
2
2ab +a
2
12
=
(b a)
2
12
.
(d) Example: Suppose that f(x) = e
x
for x > 0 and where > 0. Then
E(X) = 1/,
E(X
2
) =
_

0
x
2
e
x
dx
= x
2
e
x
0
+
_

0
2xe
x
dx = 0 +
2
2
and
Var(X) =
2
2

1
2
=
1
2
.
2. MAD
(a) Denition:
Mad(X)
def
= E(|X
X
|) =
_

|x
X
|f(x)dx.
(b) Alternative expression: First, note that
E(|X c|) =
_
c
(c x)f(x)dx +
_

c
(x c)f(x)dx
= c [2F
X
(c) 1]
_
c
xf(x)dx +
_

c
xf(x)dx.
Accordingly,
Mad(X) =
X
[2F
X
(
X
) 1]
_

X
xf(x)dx +
_

X
xf(x)dx.
(c) Leibnitzs Rule: Suppose that a(), b(), and g(x, ) are dierentiable
functions of . Then
d
d
_
b()
a()
g(x, )dx = g [b(), ]
d
d
b() g [a(), ]
d
d
a()
+
_
b()
a()
d
d
g(x, )dx.
(d) Result: If the expectation E(|X c|) exists, then the minimizer of
E(|X c|) with respect to c is c = F
1
X
(0.5) = median of X.
Proof:: Set the derivative of E(|X c|) to zero and solve for c:
d
dc
E(|X c|)
=
d
dc
_
c [2F
X
(c) 1]
_
c
xf
X
(x)dx +
_

c
xf
X
(x)dx
_
5.6. AVERAGE DEVIATIONS 89
= 2F
X
(c) 1 + 2cf
X
(c) cf
X
(c) cf
X
(c)
= 2F
X
(c) 1.
Equating the derivative to zero and solving for c reveals that c is a
solution to F
X
(c) = 0.5. That is, c is the median of X. Use the second
derivative test to verify that the solution is a minimizer:
d
2
dc
2
E(|X c|) =
d
dc
[2F
X
(c) 1] = 2f
X
(c) > 0
=c is a minimizer.
(e) Example: Suppose that X Unif(a, b). Then F
X
(
a+b
2
) = 0.5 and
Mad(X) =
_ a+b
2
a
x
b a
dx +
_
b
a+b
2
x
b a
dx =
b a
4
.
(f) Example: Suppose that f
X
(x) = e
x
for x > 0 and where > 0. Then
E(X) = 1/, Median(X) = ln(2)/, F
X
(x) = 1 e
x
, and
Mad(X) =
1
_
2 2e
1
1
_

1
0
xe
x
dx +
_

1
xe
x
dx =
2
e
,
where
_
xe
x
dx = xe
x
1
e
x
has been used. The mean
absolute deviation from the median is
E
X
ln(2)
=
_
ln(2)
1
0
xe
x
dx +
_

ln(2)
1
xe
x
dx
=
ln(2)
.
3. Standard Scores
(a) Let Z =
X
X
X
.
(b) Moments: E(Z) = 0 and Var(Z) = 1.
(c) Interpretation: Z scores are scaled in standard deviation units.
(d) Inverse Transformation: X =
X
+
X
Z.
5.7 Bivariate Distributions
1. Denition: A function f
X,Y
(x, y) is a bivariate pdf if
(i) f
X,Y
(x, y) 0 for all x, y and
(ii)
_

f
X,Y
(x, y)dxdy = 1.
2. Bivariate CDF: If f
X,Y
(x, y) is a bivariate pdf, then
F
X,Y
(x, y) = P(X x, Y y) =
_
x
_
y
f
X,Y
(u, v)dvdu.
3. Properties of a bivariate cdf:
(i) F
X,Y
(x, ) = F
X
(x)
(ii) F
X,Y
(, y) = F
Y
(y)
(iii) F
X,Y
(, ) = 1
(iv) F
X,Y
(, y) = F
X,Y
(x, ) = F
X,Y
(, ) = 0
(v) f
X,Y
(x, y) =

2
xy
F
X,Y
(x, y).
4. Joint pdfs and joint cdfs for three or more random variables are obtained as
straightforward generalizations of the above denitions and conditions.
5. Probability Element: f
X,Y
(x, y)xy is the joint probability element. That is,
P(x X x + x, y Y y + y) = f
X,Y
(x, y)xy +o(xy).
6. Example: Bivariate Uniform. If (X, Y ) Unif(a, b, c, d), then
f
X,Y
(x, y) =
_
_
_
1
(b a)(d c)
x (a, b), y (c, d);
0 otherwise.
For this density, the probability P(x
1
X x
2
, y
1
Y y
2
) is the volume of
the rectangle. For example, if (X, Y ) Unif(0, 4, 0, 6), then
P(2.5 X 3.5, 1 Y 4) = (3.5 2.5)(4 1)/(4 6) = 3/24. Another
example is P(X
2
+Y
2
> 16) = 1 P(X
2
+Y
2
16) = 1 4/24 = 1 /6
because the area of a circle is r
2
and therefore, the area of a circle with
radius 4 is 16 and the area of the quarter circle in the support set is 4.
5.7. BIVARIATE DISTRIBUTIONS 91
7. Example: f
X,Y
(x, y) =
6
5
(x +y
2
) for x (0, 1) and y (0, 1). Find
P(X +Y < 1). Solution: First sketch the region of integration, then use
calculus:
P(X +Y < 1) = P(X < 1 Y ) =
_
1
0
_
1y
0
6
5
(x +y
2
)dxdy
=
6
5
_
1
0
_
x
2
2
+xy
2
_
1y
0
dy
=
6
5
_
1
0
(1 y)
2
2
+ (1 y)y
2
dy
=
6
5
_
y
2

y
2
2
+
y
3
6
+
y
3
3

y
4
4
_
1
0
=
3
10
.
8. Example: Bivariate standard normal
f
X,Y
(x, y) =
e
1
2
(x
2
+y
2
)
2
=
e
1
2
x
2
2
e
1
2
y
2
2
= f
X
(x)f
Y
(y).
Using numerical integration, P(X +Y < 1) = 0.7602. The matlab code is
g = inline(normpdf(y).*normcdf(1-y),y);
Prob=quadl(g,-5,5)
where has been approximated by 5.
9. Marginal Densities:
(a) Integrate out unwanted variables to obtain marginal densities. For
example,
f
X
(x) =
_

f
X,Y
(x, y)dy; f
Y
(y) =
_

f
X,Y
(x, y)dx;
and f
X,Y
(x, y) =
_

f
W,X,Y,Z
(w, x, y, z)dwdz.
(b) Example: If f
X,Y
(x, y) =
6
5
(x +y
2
) for x (0, 1) and y (0, 1), then
f
X
(x) =
6
5
_
1
0
(x +y
2
)dy =
6x + 2
5
for x (0, 1) and
f
Y
(y) =
6
5
_
1
0
(x +y
2
)dx =
6y
2
+ 3
5
for y (0, 1).
10. Expected Values
(a) The expected value of a function g(X, Y ) is
E[g(X, Y )] =
_

g(x, y)f
X,Y
(x, y)dxdy.
(b) Example: If f
X,Y
(x, y) =
6
5
(x +y
2
) for x (0, 1) and y (0, 1), then
E(X) =
_
1
0
_
1
0
x
6
5
(x +y
2
)dxdy =
_
1
0
3y
2
+ 2
5
dy =
3
5
.
5.8 Several Variables
1. The joint pdf of n continuous random variables, X
1
, . . . , X
n
is a function that
satises
(i) f(x
1
, . . . , x
n
) 0, and
(ii)
_

f(x
1
, . . . , x
n
) dx
1
dx
n
= 1.
2. Expectations are linear regardless of the number of variables:
E
_
k
i=1
a
i
g
i
(X
1
, X
2
, . . . , X
n
)
_
=
k
i=1
a
i
E[g
i
(X
1
, X
2
, . . . , X
n
)]
if the expectations exist.
3. Exchangeable Random variables
(a) Let x
1
, . . . , x
n
be a permutation of x
1
, . . . , x
n
. Then, the joint density of
X
1
, . . . , X
n
is said to be exchangeable if
f
X
1
,...,Xn
(x
1
, . . . , x
n
) = f
X
1
,...,Xn
(x
1
, . . . , x
n
)
for all x
1
, . . . , x
n
and for all permutations x
1
, . . . , x
n
.
(b) Result: If the joint density is exchangeable, then all marginal densities
are identical. For example,
f
X
1
,X
2
(x
1
, x
2
) =
_

f
X
1
,X
2
,X
3
(x
1
, x
2
, x
3
) dx
3
=
_

f
X
1
,X
2
,X
3
(x
3
, x
2
, x
1
) dx
3
by exchangeability
=
_

f
X
1
,X
2
,X
3
(x
1
, x
2
, x
3
) dx
1
by relabeling variables
= f
X
2
,X
3
(x
2
, x
3
).
5.9. COVARIANCE AND CORRELATION 93
(c) Result: If the joint density is exchangeable, then all bivariate marginal
densities are identical, and so forth.
(d) Result: If the joint density is exchangeable, then the moments of X
i
(if
they exist) are identical for all i.
(e) Example Suppose that f
X,Y
(x, y) = 2 for x 0, y 0, and x +y 1.
Then
f
X
(x) =
_
1x
0
2dy = 2(1 x) for x (0, 1)
f
Y
(y) =
_
1y
0
2dx = 2(1 y) for y (0, 1) and
E(X) = E(Y ) =
1
3
.
5.9 Covariance and Correlation
1. Review covariance and correlation results for discrete random variables
(Section 3.4) because they also hold for continuous random variables. Below
are lists of the most important denitions and results.
(a) Denitions
Cov(X, Y )
def
= E[(X
X
)(Y
Y
)].
Cov(X, Y ) is denoted by
X,Y
.
Var(X) = Cov(X, X).
Corr(X, Y )
def
= Cov(X, Y )/
_
Var(X) Var(Y ).
Corr(X, Y ) is denoted by
X,Y
.
(b) Covariance and Correlation Results (be able to prove any of these).
Cov(X, Y ) = E(XY ) E(X)E(Y ).
Cauchy-Schwartz Inequality: [E(XY )]
2
E(X
2
)E(Y
2
).

X,Y
[1, 1] To proof, use the Cauchy-Schwartz inequality.
Cov(a +bX, c +dY ) = bd Cov(X, Y ).
Cov
_
i
a
i
X
i
,
i
b
i
Y
i
_
=
j
a
i
b
j
Cov(X
i
, Y
j
). For example,
Cov(aW +bX, cY +dZ) =
ac Cov(W, Y ) +ad Cov(W, Z) +bc Cov(X, Y ) +bd Cov(X, Z).
Corr(a +bX, c +dY ) = sign(bd) Corr(X, Y ).
Var
_
i
X
i
_
=
j
Cov(X
i
, X
j
) =
i
Var(X
i
) +
i=j
Cov(X
i
, X
j
).
Parallel axis theorem: E(X c)
2
= Var(X) + (
X
c)
2
. Hint on
proof: rst add zero X c = (X
X
) + (
X
c), then take
expectation.
2. Example (simple linear regression with correlated observations): Suppose that
Y
i
= +x
i
+
i
for i = 1, . . . , n and where
1
, . . . ,
n
have an exchangeable
distribution with E(
1
) = 0, Var(
1
) =
2
and Cov(
1
,
2
) =
2
. The ordinary
least squares estimator of is
=
n
i=1
(x
i
x)Y
i
n
i=1
(x
i
x)
2
.
Find the expectation and variance of

. Solution: Write

as
=
n
i=1
w
i
Y
i
, where w
i
=
(x
i
x)
n
j=1
(x
j
x)
2
.
Then, the expectation of

is
E(
) =
n
i=1
w
i
E(Y
i
) =
n
i=1
w
i
( +x
i
)
=
n
i=1
(x
i
x)( +x
i
)
n
j=1
(x
j
x)
2
=
n
i=1
(x
i
x)x
i
n
j=1
(x
j
x)
2
because
n
i=1
(x
i
x) = 0
=
because
(x
i
x)x
i
=
(x
i
x)
2
. The variance of

is
Var(
) =
n
i=1
w
2
i
Var(Y
i
) +
i=j
w
i
w
j
Cov(Y
i
, Y
j
)
5.10. INDEPENDENCE 95
=
2
n
i=1
w
2
i
+
2
i=j
w
i
w
j
=

2
n
i=1
(x
i
x)
2
+
2
_
n
i=1
n
j=1
w
i
w
j
i=1
w
2
i
_
=

2
n
i=1
(x
i
x)
2
+
2
__
n
i=1
w
i
__
n
j=1
w
j
_
i=1
w
2
i
_
=

2
(1 )
n
i=1
(x
i
x)
2
.
5.10 Independence
1. Denition: Continuous random variables X and Y are said to be independent
if their joint pdf factors into a product of the marginal pdfs. That is,
f
X,Y
(x, y) = f
X
(x)f
Y
(y) X Y.
2. Example: if f
X,Y
(x, y) = 2 for x (0, 0.5) and y (0, 1) then X Y . Note,
the joint pdf can be written as
f
X,Y
= 2I
(0,0.5)
(x)I
(0,1)
(y) = 2I
(0,0.5)
(x) I
(0,1)
(y)
= f
X
(x) f
Y
(y)
where
I
A
(x) =
_
1 x A;
0 otherwise.
3. Example: if f
X,Y
(x, y) = 8xy for 0 x y 1, then X and Y are not
independent. Note
f
X,Y
(x, y) = 8xyI
(0,1)
(y)I
(0,y)
(x),
but
f
X
(x) =
_
1
x
f
X,Y
(x, y) dy = 4x(1 x
2
)I
(0,1)
(x) and
f
Y
(y) =
_
y
0
f
X,Y
(x, y) dx = 4y
3
I
(0,1)
(y).
4. Note: Cov(X, Y ) = 0 =X Y . For example, if
f
X,Y
(x, y) =
1
3
I
(1,2)
(x)I
(x,x)
(y),
then
E(X) =
_
2
1
_
x
x
x
3
dydx =
_
2
1
=
2x
2
3
dx =
14
9
,
E(Y ) =
_
2
1
_
x
x
y
3
dydx =
_
2
1
x
2
x
2
6
dx = 0, and
E(XY ) =
_
2
1
_
x
x
xy
3
dydx =
_
2
1
x(x
2
x
2
)
6
dx = 0.
Accordingly, X and Y have correlation 0, but they are not independent.
5. Result: Let A and B be subsets of the real line. Then random variables X and
Y are independent if and only if
P(X A, Y B) = P(X A)P(Y B)
for all choices of sets A and B. Proof in class.
6. Result: If X and Y are independent, then so are g(X) and h(Y ) for any g and
h. Proof: Let A be any set of intervals in the range of g(x) and let B be any
set of intervals in the range of h(y). Denote by g
1
(A) the set of all intervals
in the support of X that satisfy x g
1
(A) g(x) A. Similarly, denote
by h
1
(B) the set of all intervals in the support of Y that satisfy
y h
1
(B) h(y) B. If X Y , then,
P[g(X) A, h(Y ) B] = P
_
X g
1
(A), Y h
1
(B)
= P
_
X g
1
(A)
P
_
Y h
1
(B)
= P[g(X) A] P[h(Y ) B].

The above equality implies that g(X) h(Y ) because the factorization is
satised for all A and B in the range spaces of g(X) and h(Y ). Note that we
already proved this result for discrete random variables.
7. The previous two results readily extend to any number of random variables
(not just two).
8. Suppose that X
i
for i = 1, . . . , n are independent. Then
(a) g
1
(X
1
), . . . , g
n
(X
n
) are independent,
(b) The Xs in any subset are independent,
(c) Var
_
a
i
X
i
_
=
a
2
i
Var(X
i
), and
(d) if the Xs are iid with variance
2
, then Var
_
a
i
X
i
_
=
2
a
2
i
.
5.11. CONDITIONAL DISTRIBUTIONS 97
5.11 Conditional Distributions
1. Denition: If f
X,Y
(x, y) is a joint pdf, then the pdf of Y , conditional on X = x
is
f
Y |X
(y|x)
def
=
f
X,Y
(x, y)
f
X
(x)
provided that f
X
(x) > 0.
2. Example: Suppose that X and Y have joint distribution
f
X,Y
(x, y) = 8xy for 0 < x < y < 1.
Then,
f
X
(x) =
_
1
x
f
X,Y
(x, y)dy = 4x(1 x
2
), 0 < x < 1;
E(X
r
) =
_
1
0
4x(1 x
2
)x
r
dx =
8
(r + 2)(r + 4)
;
f
Y
(y) = 4y
3
, 0 < y < 1;
E(Y
r
) =
_
1
0
4y
3
y
r
dy =
4
r + 4
f
X|Y
(x|y) =
8xy
4y
3
=
2x
y
2
, 0 < x < y; and
f
Y |X
(y|x) =
8xy
4x(1 x
2
)
=
2y
1 x
2
, x < y < 1.
Furthermore,
E(X
r
|Y = y) =
_
y
0
x
r
2x
y
2
dx =
2y
r
r + 2
and
E(Y
r
|X = x) =
_
1
x
y
r
2y
1 x
2
dy =
2(1 x
r+2
)
(r + 2)(1 x
2
)
.
3. Regression Function: Let (X, Y ) be a pair of random variables with joint pdf
f
X,Y
(x, y). Consider the problem of predicting Y after observing X = x.
Denote the predictor as y(x). The best predictor is dened as the function
Y (X) that minimizes

SSE = E
_
Y

Y (X)
_
2
=
_

[y y(x)]
2
f
X,Y
(x, y)dydx.
(a) Result: The best predictor is y(x) = E(Y |X = x).
(b) Proof: Write f
X,Y
(x, y) as f
Y |X
(y|x)f
X
(x). Accordingly,
SSE =
_

__

[y y(x)]
2
f
Y,|X
(y, x)dy
_
f
X
(x)dx.
To minimize SSE, minimize the quantity in { } for each value of x. Note
that y(x) is a constant with respect to the conditional distribution of Y
given X = x. By the parallel axis theorem, the quantity in { } is
minimized by y(x) = E(Y |X = x).
(c) Example: Suppose that X and Y have joint distribution
f
X,Y
(x, y) = 8xy for 0 < x < y < 1.
Then,
f
Y |X
(y|x) =
8xy
4x(1 x
2
)
=
2y
1 x
2
, x < y < 1 and
y(x) = E(Y |X = x)
_
1
x
y
2y
1 x
2
dy =
2(1 x
3
)
3(1 x
2
)
.
(d) Example; Suppose that (Y, X) has a bivariate normal distribution with
moments E(Y ) =
Y
, E(X) =
X
, Var(X) =
2
X
, Var(Y ) =
2
Y
, and
Cov(X, Y ) =
X,Y
Y
. Then the conditional distribution of Y given X
is
(Y |X = x) N( +x,
2
), where
=
Cov(X, Y )
Var(X)
=

X,Y
X
; =
Y

X
and
2
=
2
Y
_
1
2
X,Y
_
.
4. Averaging Conditional pdfs and Moments (be able to prove any of these
results)
(a) E
X
_
f
Y |X
(y|X)
= f
Y
(y). Hint: f
X,Y
(x, y) = f
Y |X
(y|x)f
X
(x).
(b) E
X
{E[h(Y )|X]} = E[h(Y )]. This is the rule of iterated expectation. A
special case is E
X
[E(Y |X)] = E(Y ).
(c) Var(Y ) = E
X
[Var(Y |X)] + Var [E(Y |X)]. That is, the variance of Y is
equal to the expectation of the conditional variance plus the variance of
the conditional expectation. Proof:
Var(Y ) = E(Y
2
) [E(Y )]
2
= E
X
_
E(Y
2
|X)
{E
X
[E(Y |X)]}
2
5.11. CONDITIONAL DISTRIBUTIONS 99
by the rule of iterated expectation
= E
X
_
Var(Y |X) + [E(Y |X)]
2
_
{E
X
[E(Y |X)]}
2
because Var(Y |X) = E(Y
2
|X) [E(Y |X)]
2
= E
X
[Var(Y |X)] + E
X
[E(Y |X)]
2
{E
X
[E(Y |X)]}
2
= E
X
[Var(Y |X)] + Var [E(Y |X)]
because Var[E(Y |X)] = E
X
[E(Y |X)]
2
{E
X
[E(Y |X)]}
2
.
5. Example: Suppose that X and Y have joint distribution
f
X,Y
(x, y) =
3y
2
x
3
for 0 < y < x < 1.
Then,
f
Y
(y) =
_
1
y
3y
2
x
3
dx =
3
2
(1 y
2
), for 0 < y < 1;
E(Y
r
) =
_
1
0
3
2
(1 y
2
)y
r
dy =
3
(r + 1)(r + 3)
;
=E(Y ) =
3
8
and Var(Y ) =
19
320
;
f
X
(x) =
_
x
0
3y
2
x
2
dy = 1, for 0 < x < 1;
f
Y |X
(y|x) =
3y
2
x
3
, for 0 < y < x < 1;
E(Y
r
|X = x) =
_
x
0
3y
2
x
3
y
r
dy =
3x
r
3 +r
=E(Y |X = x) =
3x
4
and
Var(Y |X = x) =
3x
2
80
;
Var [E(Y |X)] = Var
_
3X
4
_
=
9
16

1
12
=
3
64
;
E[Var(Y |X)] = E
_
3X
2
80
_
=
1
80
;
19
320
=
3
64
+
1
80
.
5.12 Moment Generating Functions
1. Denition: If X is a continuous random variable, then the moment generating
function (mgf) of X is
X
(t)
def
= E
_
e
tX
_
=
_

e
tx
f
X
(x)dx,
provided that the expectation exists for t in a neighborhood of 0. If X is
discrete, then replace integration by summation. If all of the moments of X do
not exist, then the mgf will not exist. Note that the mgf is related to the pgf
by
X
(t) =
X
(e
t
).
Also note that if
X
(t) is a mgf, then
X
(0) = 1.
2. Example: Exponential Distribution. If f
X
(x) = e
x
I
(0,)
(x), then
X
(t) =
_

0
e
tx
e
x
dx =

t
_

0
( t)e
(t)x
dx =

t
.
3. Example: Geometric Distribution. If X Geom(p), then
X
(t) =
x=1
e
tx
(1 p)
x1
p = pe
t
x=1
(1 p)
x1
e
t(x1)
= pe
t
x=0
_
(1 p)e
t
x
=
pe
t
1 (1 p)e
t
.
4. mgf of a linear function:
a+bX
(t) = E
_
e
t(a+bX)
= e
at
X
(tb).
For example, if Z = (X
X
)/
X
, then
Z
(t) = e
t
X
/
X
X
(t/
X
).
5. Independent Random Variables: If X
i
for i = 1, . . . , n are independent, and
S =
X
i
, then
S
(t) = E
_
e
t
P
X
i
_
= E
_
n
i=1
e
tX
i
_
=
n
i=1
X
i
(t).
If the Xs are iid random variables and S =
n
i=1
X
i
, then
S
(t) = [
X
(t)]
n
.
5.12. MOMENT GENERATING FUNCTIONS 101
6. Result: Moment generating functions, if they exist, uniquely determine the
distribution. For example, if the mgf of Y is
Y
(t) =
e
t
2 e
t
=
1
2
e
t
1
1
2
e
t
,
then Y Geom(0.5).
7. Computing Moments. Consider the derivative of
X
(t) with respect to t
evaluated at t = 0:
d
dt
X
(t)
t=0
=
_

d
dt
e
tx
t=0
f
X
(x)dx
=
_

xf
X
(x)dx = E(X).
Similarly, higher order moments can be found by taking higher order
derivatives:
E(X
r
) =
d
r
(dt)
r
X
(t)
t=0
.
Alternatively, expand e
tx
around t = 0 to obtain
e
tx
=
r=0
(tx)
r
r!
.
Therefore
X
(t) = E
_
e
tX
_
= E
_

r=0
(tX)
r
r!
_
=
r=0
E(X
r
)
t
r
r!
.
Accordingly, E(X
r
) is the coecient of t
r
/r! in the expansion of the mgf.
8. Example: Suppose that X Geom(p). Then the moments of X are
E(X
r
) =
d
r
(dt)
r
X
(t)
t=0
=
d
dt
_
pe
t
1 (1 p)e
t
_
t=0
.
Specically,
d
dt
X
(t) =
d
dt
_
pe
t
1 (1 p)e
t
_
=
X
(t) +
1 p
p

X
(t)
2
and
d
2
(dt)
2
X
(t) =
d
dt
_
X
(t) +
1 p
p

X
(t)
2
_
=
X
(t) +
1 p
p

X
(t)
2
+
1 p
p
2
X
(t)
_
X
(t) +
1 p
p

X
(t)
2
_
.
Therefore,
E(X) = 1 +
1 p
p
=
1
p
;
E(X
2
) = 1 +
1 p
p
+
1 p
p
2
_
1 +
1 p
p
_
=
2 p
p
2
and
Var(X) =
2 p
p
2

1
p
2
=
1 p
p
2
.
9. Example: Suppose that Y Unif(a, b). Use the MGF to nd the central
moments E[(Y
Y
)
r
] = E[(Y
a+b
2
)
r
]. Solution:
Y
(t) =
1
b a
_
b
a
e
ty
dy =
e
tb
e
ta
t(b a)
Y
Y
(t) = e
t(a+b)/2
Y
(t)
=
e
t(a+b)/2
_
e
tb
e
ta
t(b a)
=
_
2
t(b a)
_
e
t
2
(ba)
e
t
2
(ba)
2
=
_
2
t(b a)
_
sinh
_
t(b a)
2
_
=
2
t(b a)
i=0
_
t(b a)
2
_
2i+1
1
(2i + 1)!
=
i=0
_
t(b a)
2
_
2i
1
(2i + 1)!
=
i=0
_
t
2i
(2i)!
_
(b a)
2i
2
2i
(2i + 1)
.
Therefore, the odd moments are zero, and
E(Y
Y
)
2i
=
(b a)
2i
2
2i
(2i + 1)
.
For example, E(Y
Y
)
2
= (b a)
2
/12 and E(Y
Y
)
4
= (b a)
4
/80.
Chapter 6
Families of Continuous
Distributions
6.1 Normal Distributions
1. PDF and CDF of Standard Normal Distribution:
f
Z
(z) =
e
z
2
/2
2
F
Z
(z) = P(Z z) = (z) =
_
z
f
Z
(u) du
To verify that f
Z
(z) integrates to one, examine K
2
, where K =
_
e
u
2
/2
du:
K
2
=
__

e
u
2
/2
du
_
2
=
__

e
u
2
1
/2
du
1
___

e
u
2
2
/2
du
2
_
=
_

1
2
(u
2
1
+u
2
2
)
du
1
du
2
.
Now transform to polar coordinates:
u
1
= r sin ; u
2
= r cos and
K
2
=
_
2
0
_

0
e
1
2
(r
2
)
r dr d
=
_
2
0
_
e
1
2
(r
2
)
0
_
d
=
_
2
0
1 d = 2.
103
104 CHAPTER 6. FAMILIES OF CONTINUOUS DISTRIBUTIONS
Therefore K =
2 and f
Z
(z) integrates to one.
2. Other Normal Distributions: Transform from Z to X = +Z, where and
are constants that satisfy || < and 0 < < . The inverse
transformation is z = (x )/ and the Jacobian of the transformation is
|J| =
dz
dx
=
1
.
Accordingly, the pdf of X is
f
X
(x) = f
Z
_
x
_
1
=
e
1
2
2
(x)
2
2
2
.
We will use the notation X N(,
2
) to mean that X has a normal
distribution with parameters and .
3. Moment Generating Function: Suppose that X N(,
2
). Then
X
(t) = e
t+t
2
2
/2
.
Proof:
X
(t) = E(e
tX
) =
_

e
tx
e
1
2
2
(x)
2
2
2
dx
=
_

1
2
2
[2t
2
x+(x)
2
]
2
2
dx.
The trick is to complete the square in the exponent:
2t
2
x + (x )
2
= 2t
2
x +x
2
2x +
2
= x
2
2x( +t
2
) +
2
=
_
x ( +t
2
)
2
( +t
2
)
2
+
2
=
_
x ( +t
2
)
2
2t
2
t
2
4
.
Therefore,
X
(t) = e
1
2
2
(2t
2
t
2
4
)
_

1
2
2
[x(+t
2
)]
2
2
2
dx
= e
t+t
2
2
/2
_

1
2
2
(x
)
2
2
2
dx where
= +t
2
= e
t+t
2
2
/2
because the second term is the integral of the pdf of a random variable with
distribution N(
,
2
) and this integral is one.
4. Moments of Normal Distributions
6.1. NORMAL DISTRIBUTIONS 105
(a) Moments of the standard normal distribution: Let Z be a normal random
variable with = 0 and = 1. That is, Z N(0, 1). The moment
generating function of Z is
Z
(t) = e
t
2
/2
. The Taylor series expansion of
Z
(t) around t = 0 is
Z
(t) = e
t
2
/2
=
i=0
1
i!
_
t
2
2
_
i
=
i=0
_
(2i)!
2
i
i!
__
t
2i
(2i)!
_
.
Note that all odd powers in the expansion are zero. Accordingly,
E(Z
r
) =
_
_
0 if r is odd
r!
2
r/2
_
r
2
_
!
if r is even.
It can be shown by induction that if r is even, then
r!
2
r/2
_
r
2
_
!
= (r 1)(r 3)(r 5) 1.
In particular, E(Z) = 0 and Var(Z) = E(Z
2
) = 1.
(b) Moments of Other Normal Distributions: Suppose that X N(,
2
).
Then X can be written as X = +Z, where Z N(0, 1). To obtain
the moments of X, one may use the moments of Z or one may
dierentiate the moment generating function of X. For example, using
the moments of Z, the rst two moments of X are
E(X) = E( +Z) = +E(Z) = and
E(X
2
) = E
_
( +Z)
2
= E(
2
+ 2Z +
2
Z
2
) =
2
+
2
.
Note that Var(X) = E(X
2
) [E(X)]
2
=
2
. The alternative approach is
to use the moment generating function:
E(X) =
d
dt
X
(t)
t=0
=
d
dt
e
t+t
2
2
/2
t=0
= ( +t
2
)e
t+t
2
2
/2
t=0
= and
E(X
2
) =
d
2
dt
2
X
(t)
t=0
=
d
dt
( +t
2
)e
t+t
2
2
/2
t=0
=
2
e
t+t
2
2
/2
+ ( +t
2
)
2
e
t+t
2
2
/2
t=0
=
2
+
2
.
5. Box-Muller method for generating standard normal variables. Let Z
1
and Z
2
be iid random variables with distributions Z
i
N(0, 1). The joint pdf of Z
1
and Z
2
is
f
Z
1
,Z
2
(z
1
, z
2
) =
e
1
2
(z
2
1
+z
2
2
)
2
.
Transform to polar coordinates: Z
1
= Rsin(T) and Z
2
= Rcos(T). The joint
distribution of R and T is
f
R,T
(r, t) =
re
1
2
r
2
2
I
(0,)
(r)I
(0,2)
(T) = f
R
(r) f
T
(t) where
f
R
(r) = re
1
2
r
2
I
(0,)
(r) and f
T
(t) =
1
2
I
(0,2)
(t).
Factorization of the joint pdf reveals that R T. Their respective cdfs are
F
R
(r) = 1 e
1
2
r
2
and F
T
(t) =
t
2
.
Let U
1
= F
R
(R) and U
2
= F
T
(T). recall that U
i
Unif(0, 1). Solving the cdf
equations for R and T yields
R =
_
2 ln(1 U
1
) and T = 2U
2
.
Lastly, express Z
1
and Z
2
as functions of R and T:
Z
1
= Rsin(T) =
_
2 ln(1 U
1
) sin(2U
2
) and
Z
2
= Rcos(T) =
_
2 ln(1 U
1
) cos(2U
2
).
Note that U
1
and 1 U
1
have the same distributions. Therefore Z
1
and Z
2
can
be generated from U
1
and U
2
by
Z
1
=
_
2 ln(U
1
) sin(2U
2
) and Z
2
=
_
2 ln(U
1
) cos(2U
2
).
6. Linear Functions of Normal Random Variables: Suppose that X and Y are
independently distributed random variables with distributions X N(
X
,
2
X
)
and Y N(
Y
,
2
Y
).
(a) The distribution of aX +b is N(a
X
+b, a
2
2
X
). Proof: The moment
generating function of aX +b is
aX+b
(t) = E(e
t(aX+b)
) = e
tb
E(e
taX
) = e
tb
e
ta+t
2
a
2
2
/2
= e
t(a+b)+t
2
(a)
2
/2
and this is the moment generating function of a random variable with
distribution N(a +b, a
2
2
).
6.1. NORMAL DISTRIBUTIONS 107
(b) The distribution of aX +bY is N(a
X
+b
Y
, a
2
2
X
+b
2
2
Y
). Proof: The
moment generating function of aX +bY is
aX+bY
(t) = E(e
t(aX+bY )
) = E(e
taX
)E(e
tbY
) by independence
= e
ta
X
+t
2
a
2
2
X
/2
e
tb
Y
+t
2
b
2
2
Y
/2
= e
t(a
X
+b
Y
)+t
2
(a
2
2
X
+b
2
2
Y
)/2
.
and this is the moment generating function of a random variable with
distribution N(a
X
+b
Y
, a
2
2
X
+b
2
2
Y
).
(c) The above result is readily generalized. Suppose that X
i
for i = 1, . . . , n
are independently distributed as X
i
N(
i
,
2
i
). If T =
n
i=1
a
i
X
i
, then
T N(
T
,
2
T
), where
T
=
n
i=1
a
i
i
and
2
T
=
n
i=1
a
2
i
2
i
.
7. Probabilities and Percentiles
(a) If X N(
X
,
2
X
), then the probability of an interval is
P(a X b) = P
_
a
X
X
Z
b
X
X
_
=
_
b
X
X
_
_
a
X
X
_
.
(b) If X N(
X
,
2
X
), then the 100p
th
percentile of X is
x
p
=
X
+
X
z
p
,
where z
p
is the 100p
th
percentile of the standard normal distribution.
Proof:
P(X
X
+
X
z
p
) = P
_
X
X
X
z
p
_
= P(Z z
p
) = p
because Z N(0, 1).
8. Log Normal Distribution
(a) Denition: If ln(X) N(,
2
), then X is said to have a log normal
distribution. That is
ln(X) N(,
2
) X LogN(,
2
).
Note: and
2
are the mean and variance of ln(X), not of X.
(b) PDF: Let Y = ln(X), and assume that Y N(,
2
). Note that x = g(y)
and y = g
1
(x), where g(y) = e
y
and g
1
(x) = ln(x). The Jacobian of the
transformation is
|J| =
dy
dx
=
1
x
.
Accordingly, the pdf of X is
f
X
(x) = f
Y
_
g
1
(x)
1
x
=
e
1
2
2
[ln(x)]
2
x
2
2
I
(0,)
(x).
(c) CDF: If Y LogN(,
2
), then
P(Y y) = P[ln(Y ) ln(y)] =
_
ln(y)
_
.
(d) Moments of a log normal random variable. Suppose that
X LogN(,
2
). Then E(X
r
) = e
r+r
2
2
/2
. Proof: Let Y = ln(X). Then
Y N(,
2
) and
E(X
r
) = E
_
e
rY
_
= e
r+r
2
2
/2
,
where the nal result is obtained by using the MGF of a normal random
variable. To obtain the mean and variance, set r to 1 and 2:
E(X) = e
+
2
/2
and Var(X) = e
2+2
2
e
2+
2
= e
2+
2
_
e
2
1
_
.
(e) Displays of various log normal distributions. The gure below displays
four log normal distributions. The parameters of the distribution are
summarized in the following table:
=
2
= = = =
Plot E[ln(X)] Var[ln(X)] E(X)
_
Var(X) /
1 3.2976 4.6151 100 1000 0.1
2 3.8005 1.6094 100 200 0.5
3 4.2586 0.6931 100 100 1
4 4.5856 0.0392 100 20 5
Note that each distribution has mean equal to 100. The distributions
dier in terms of , which is the coecient of variation.
6.2. EXPONENTIAL DISTRIBUTIONS 109
0 50 100 150 200
0
0.02
0.04
0.06
0.08
0.1
0.12
f
X
(
x
)
= 0.1
0 50 100 150 200
0
0.005
0.01
0.015
0.02
= 0.5
0 50 100 150 200
0
0.002
0.004
0.006
0.008
0.01
x
f
X
(
x
)
= 1
0 50 100 150 200
0
0.005
0.01
0.015
0.02
0.025
x
= 5
If the coecient of variation is small, then the log normal distribution
resembles an exponential distribution, As the coecient of variation
increases, the log normal distribution converges to a normal distribution.
6.2 Exponential Distributions
1. PDF and CDF
f
X
(x) = e
x
I
[0,)
(x) where is a positive parameter, and
F
X
(x) = 1 e
x
provided that x 0. We will use the notation X Expon() to mean that X
has an exponential distribution with parameter . Note that the 100p
th
percentile is x
p
= ln(1 p)/. The median, for example, is x
0.5
= ln(2)/.
2. Moment Generating Function. If X Expon(), then
X
(t) = /( t) for
t < . Proof:
X
(t) = E(e
tX
) =
_

0
e
tx
e
x
dx
=
_

0
e
(t)x
dx =

t
_

0
( t)e
(t)x
dx =

t
because the last integral is the integral of the pdf of a random variable with
distribution Expon( t), provided that t > 0.
3. Moments: If X Expon(), then E(X
r
) = r!/
r
. Proof:
X
(t) =

t
=
1
1 t/
=
r=0
_
t
_
r
=
r=0
_
t
r
r!
__
r!
r
_
provided that < t < . Note that E(X) = 1/, E(X
2
) = 2/
2
and
Var(X) = 1/
2
.
4. Displays of exponential distributions. Below are plots of four exponential
distributions. Note that the shapes of the distributions are identical.
0 20 40 60 80
0
0.02
0.04
0.06
0.08
0.1
f
X
(
x
)
= 0.1
0 2 4 6 8
0
0.2
0.4
0.6
0.8
1
= 1
0 1 2 3 4
0
0.5
1
1.5
2
x
f
X
(
x
)
= 2
0 0.5 1 1.5
0
1
2
3
4
5
x
= 5
5. Memoryless Property: Suppose that X Expon(). The random variable can
be thought of as the waiting time for an event to occur. Given that an event
has not occurred in the interval [0, w), nd the probability that the additional
waiting time is at least t. That is, nd P(X > t +w|X > w). Note: P(X > t)
is sometimes called the reliability function. It is denoted as R(t) and is related
to F
X
(t) by
R(t) = P(X > t) = 1 P(X t) = 1 F
X
(t).
The reliability function represents the probability that the lifetime of a
product (i.e., waiting for failure) is at least t units. For the exponential
6.2. EXPONENTIAL DISTRIBUTIONS 111
distribution, the reliability function is R(t) = e
t
. We are interested in the
conditional reliability function R(t +w|X > w). Solution:
R(t +w|X > w) = P(X > t +w|X > w) =
P(X > t +w)
P(X > w)
=
e
(t+w)
e
w
= e
t
.
Also,
R(t +w|X > w) = 1 F
X
(t +w|X > w) =F
X
(t +w|X > w) = 1 e
t
.
That is, no matter how long one has been waiting, the conditional distribution
of the remaining life time is still Expon(). It is as though the distribution
does not remember that we have already been waiting w time units.
6. Poison Inter-arrival Times: Suppose that events occur according to a Poisson
process with rate parameter . Assume that the process begins at time 0. Let
T
1
be the arrival time of the rst event and let T
r
be the time interval from the
(r 1)
st
arrival to the r
th
arrival. That is, T
1
, . . . , T
n
are inter-arrival times.
We will nd the joint distribution of T
1
, T
2
, . . . , T
n
. Consider the joint pdf:
f
T
1
,T
2
,...,Tn
(t
1
, t
2
, . . . , t
n
) =
= f
T
1
(t
1
) f
T
2
|T
1
(t
2
|t
1
) f
T
3
|T
1
,T
2
(t
3
|t
1
, t
2
)
f
Tn|T
1
,...,T
n1
(t
n
|t
1
, . . . , t
n1
)
by the multiplication rule. To obtain the rst term, rst nd the cdf of T
1
:
F
T
1
(t
1
) = P(T
1
t
1
) = P [one or more events in (0, t
1
)]
= 1 P [no events in (0, t
1
)] = 1
e
t
1
(t
1
)
0
0!
= 1 e
t
1
.
Dierentiating the CDF yields
f
T
1
(t
1
) =
d
dt
1
(1 e
t
1
) = e
t
1
I
(0,)
(t
1
).
The second term is the conditional pdf of T
2
given T
1
= t
1
. Recall that in a
Poisson process, events in non-overlapping intervals are independent.
Therefore,
f
T
2
|T
1
(t
2
|t
1
) = f
T
2
(t
2
) = e
t
2
.
Each of the remaining conditional pdfs also is just an exponential pdf.
Therefore,
f
T
1
,T
2
,...,Tn
(t
1
, t
2
, . . . , t
n
) =
n
i=1
e
t
i
I
[0,)
(t
i
).
This joint pdf is the product of n marginal exponential pdfs. Therefore, the
inter-arrival times are iid exponential random variables. That is,
T
i
iid Expon().
6.3 Gamma Distributions
1. Erlang Distributions:
(a) Consider a Poisson process with rate parameter . Assume that the
process begins at time 0. Let Y be the time of the r
th
arrival. Using the
dierential method, the pdf of Y can be obtained as follows:
P(t < Y < t +dt) P(r 1 arrivals before time t)
P[one arrival in (t, t +dt)]
=
e
t
(t)
r1
(r 1)!
dt.
Accordingly,
f
Y
(y) =
e
y
r
y
r1
(r 1)!
I
[0,)
(y).
The above pdf is called the Erlang pdf.
(b) Note that Y is the sum of r iid Expon() random variables (see notes on
inter-arrival times). Accordingly, E(Y ) = r/ and Var(Y ) = r/
2
.
(c) CDF of an Erlang random variable: F
Y
(y) = 1 P(Y > y) and P(Y > y)
is the probability that fewer that r events occur in [0, y). Accordingly,
F
Y
(y) = 1 P(Y > y) = 1
r1
i=0
e
y
(y)
i
i!
.
2. Gamma Function
(a) Denition: () =
_

0
u
1
e
u
du, where > 0.
(b) Alternative expression: Let z =
2u so that u = z
2
/2; du = z dz; and
() =
_

0
z
21
e
z
2
/2
2
1
dz.
(c) Properties of ()
i. (1) = 1. Proof:
(1) =
_

0
e
w
dw = e
w
0
= 0 + 1 = 1.
ii. ( + 1) = (). Proof:
( + 1) =
_

0
w
e
w
dw.
6.3. GAMMA DISTRIBUTIONS 113
Let u = w
, let dv = e
w
dw and use integration by parts to obtain
du = w
1
, v = e
w
and
( + 1) = w
e
w
0
+
_

0
w
1
e
w
dw
= 0 +().
iii. If n is a positive integer, then (n) = (n 1)!. Proof:
(n) = (n 1)(n 1) = (n 1)(n 2)(n 2) etc.
iv. (
1
2
) =
. Proof:
_
1
2
_
=
_

0
e
z
2
/2
2
1
2
dz =
_

e
z
2
/2
2
dz =
because the integral of the standard normal distribution is one.

3. Gamma Distribution
(a) PDF and CDF: If Y Gam(, ), then
f
Y
(y) =
y
1
e
y
()
I
(0,)
(y) and F
Y
(y) =
_
y
0
u
1
e
u
()
du.
(b) Note: is called the shape parameter and is called the scale parameter.
(c) Moment Generating Function: If Y Gam(, ), then
Y
(t) =
_

0
e
ty
y
1
e
y
()
dy
=
_

0
y
1
e
(t)y
()
dy
=

( t)
_

0
y
1
( t)
e
(t)y
()
dy
=

( t)
because the last integral is the integral of a random variable with

distribution Gam(, t) provided that t > 0.
(d) Moments: If Y Gam(, ), then
E(Y ) =
d
dt
Y
(t)
t=0
=

( t)
+1
t=0
=

;
E(Y
2
) =
d
2
(dt)
2
Y
(t)
t=0
=

( + 1)
( t)
+2
t=0
=
( + 1)
2
; and
Var(Y ) = E(Y
2
) [E(Y )]
2
=
( + 1)
2

2
2
=

2
.
(e) General expression for moments (including fractional moments). If
Y Gam(, ), then
E(Y
r
) =
( +r)
r
()
provided that +r > 0.
Proof:
E(Y
r
) =
_

0
y
r
y
1
e
y
()
dy =
_

0
y
+r1
e
y
()
dy
=
( +r)
r
()
_

0
y
+r1
+r
e
y
( +r)
dy =
( +r)
r
()
because the last integral is the integral of a random variable with
distribution Gam( + r, ), provided that +r > 0.
(f) Distribution of the sum of iid exponential random variables. Suppose
that Y
1
, Y
2
, . . . , Y
k
are iid Expon() random variables. Then
T =
k
i=1
Y
i
Gam(k, ). Proof:
Y
i
(t) = /( t) =
T
(t) =
k
/( t)
k
.
(g) Note that the Erlang distribution is a gamma distribution with shape
parameter equal to an integer.
6.4 Chi Squared Distributions
1. Denition: Let Z
i
for i = 1, . . . k be iid N(0, 1) random variables. Then
Y =
k
i=1
Z
2
i
is said to have a
2
distribution with k degrees of freedom. That
is, Y
2
k
.
2. MGF:
Y
(t) = (1 2t)
k
2
for t < 0.5. Proof: First nd the MGF of Z
2
i
:
Z
2
i
(t) = E(e
tZ
2
) =
_

e
tz
2 e
1
2
z
2
2
dz
=
_

1
2(12t)
1
z
2
2
dz = (1 2t)
1
2
_

1
2(12t)
1
z
2
(1 2t)
1
2
2
dz
= (1 2t)
1
2
6.5. DISTRIBUTIONS FOR RELIABILITY 115
because the last integral is the integral of a N[0, (1 2t)
1
] random variable.
It follows that the MGF of Y is (1 2t)
k
2
. Note that this is the MGF of a
Gamma random variable with parameters = 0.5 and = k/2. Accordingly,
f
Y
(y) =
y
k
2
1
e
1
2
y
_
k
2
_
2
k
2
I
(0,)
(y).
3. Properties of
2
Random variables
(a) If Y
2
k
, then E(Y
r
) =
2
r
(k/2 +r)
(k/2)
provided that k/2 +r > 0. Proof:
Use the moment result for Gamma random variables.
(b) Using ( + 1) = (), it is easy to show that E(Y ) = k,
E(Y
2
) = k(k + 2), and Var(X) = 2k.
(c) Y N(k, 2k) for large k. This is an application of the central limit
theorem. A better approximation (again for large k) is
2Y
2k 1 N(0, 1).
(d) If Y
1
, Y
2
, . . . , Y
n
are independently distributed as Y
i

2
k
i
, then
n
i=1
Y
i

2
k
, where k =
n
i=1
k
i
. Proof: use the MGF.
(e) If X
2
k
, X +Y
2
n
, and X Y , then Y
2
nk
. Proof: See page 248
in the text. Note that by independence
X+Y
(t) =
X
(t)
Y
(t).
6.5 Distributions for Reliability
1. Denition: Suppose that L is a nonnegative continuous rv. In particular,
suppose that L is the lifetime (time to failure) of a component. The
reliability function is the probability that the lifetime exceeds x. That is,
Reliability Function of L = R
L
(x)
def
= P(L > x) = 1 F
L
(x).
2. Result: If L is a nonnegative continuous rv whose expectation exists, then
E(L) =
_

0
R
L
(x) dx =
_

0
[1 F
L
(x)] dx.
Proof: Use integration by parts with u = R
L
(x) =du = f
(
x) and
dv = dx =v = x. Making these substitutions,
_

0
R
L
(u) du =
_

0
u dv = uv
_

0
v du
= x[1 F
L
(x)]
0
+
_

0
xf
L
(x) dx
=
_

0
xf
L
(x) dx = E(L)
provided that lim
x
x[1 F
L
(x)] = 0.
3. Denition: the hazard function is the instantaneous rate of failure at time x,
given that the component lifetime is at least x. That is,
Hazard Function of L = h
L
(x)
def
= lim
dx0
P(x < L < x +dx|L > x)
dx
= lim
dx0
_
F
L
(x +dx) F
L
(x)
dx
_
1
R
L
(x)
=
f
L
(x)
R
L
(x)
.
4. Result:
h
L
(x) =
d
dx
ln[R
L
(x)] =
1
R
L
(x)

d
dx
R
L
(x)
=
1
R
L
(x)
f
L
(x) =
f
L
(x)
R
L
(x)
.
5. Result: If R
L
(0) = 1, then
R
L
(x) = exp
_
_
x
0
h
L
(u) du
_
.
Proof:
h
L
(x) =
d
dx
{ln [R
L
(x)] ln [R
L
(0)]}
=h
L
(x) =
d
dx
_
ln [R
L
(u)]
x
0
_
=
_
x
0
h
L
(u) du = ln [R
L
(x)]
=exp
_
_
x
0
h
L
(u) du
_
= R
L
(x).
6. Result: the hazard function is constant if and only if time to failure has an
exponential distribution. Proof: First, suppose that time to failure has an
exponential distribution. Then,
f
L
(x) = e
x
I
(0,)
(x) =R
L
(x) = e
x
=h
L
(x) =
e
x
e
x
= .
Second, suppose that the hazard function is a constant, . Then,
h
L
(x) = =R
L
(x) = exp
_
_
x
0
du
_
= e
x
=f
L
(x) =
d
dx
_
1 e
x
= e
x
.
6.5. DISTRIBUTIONS FOR RELIABILITY 117
7. Weibull Distribution: Increasing hazard function. The hazard function for the
Weibull distribution is
h
L
(x) =
x
1
,
where and are positive constants. The corresponding reliability function is
R
L
(x) = exp
_
_
x
0
h
L
(u) du
_
= exp
_
_
x
_
,
and the pdf is
f
L
(x) =
d
dx
F
L
(x) =
x
1
exp
_
_
x
_
I
(0,)
(x).
8. Gompertz Distribution: exponential hazzard function. The hazzard function
for the Gompertz distribution is
h
L
(x) = e
x
,
where and are positive constants. The corresponding reliability function is
R
L
(x) = exp
_
_
e
x
1
_
,
and the pdf is
f
L
(x) =
d
dx
F
L
(x) = e
x
exp
_
_
e
x
1
_
I
(0,)
(x).
9. Series Combinations: If a system fails whenever any single component fails,
then the components are said to be in series. The time to failure of the system
is the minimum time to failure of the components. If the failure times of the
components are statistically independent, then the reliability function of the
system is
R(x) = P(system life > x) = P(all components survive to x)
=
R
i
(x),
where R
i
(x) is the reliability function of the i
th
component.
10. Parallel Combinations: If a system fails only if all components fail, then the
components are said to be in parallel. The time to failure of the system is the
maximum time to failure of the components. If the failure times of the
components are statistically independent, then the reliability function of the
system is
R(x) = P(all components fail by time x) = 1 P(no component fails by time x)
= 1
F
i
(x) = 1
[1 R
i
(x)] ,
where F
i
(x) is the cdf of the i
th
component.
6.6 t, F, and Beta Distributions
1. t distributions: Let Z and X be independently distributed as Z N(0, 1) and
X
2
k
. Then
T =
Z
_
X/k
has a central t distribution with k degrees of freedom. The pdf is
f
T
(t) =
_
k + 1
2
_
_
k
2
_
k
_
1 +
t
2
k
_
(k+1)/2
.
If k = 1, then the pdf of T is
f
T
(t) =
1
(1 +t
2
)
.
which is the pdf of a standard Cauchy random variable.
Moments of a t random variable. Suppose that T t
k
. Then
E(T
r
) = E
_
k
r/2
Z
r
X
r/2
_
= k
r/2
E(Z
r
)E(X
r/2
), where
Z N(0, 1), X
2
k
, and Z X.
Recall that odd moments of Z are zero, the 2i
th
moment of Z is
(2i)!/[i!2
i
], and E(X
a
) = 2
a
(k/2 +a)/(k/2) provided that a < k/2.
Accordingly, if r is a non-negative integer, then
E(T
r
) =
_
_
does not exist if r > k;
0 if r is odd and r < k;
k
r/2
r!
_
k
2
r
_
(r/2)! 2
r
_
k
2
_
if r is even and r < k.
Using the above expression, it is easy to show that E(T) = 0 if k > 1 and
that Var(T) = k/(k 2) if k > 2.
2. F Distributions: Let U
1
and U
2
be independent
2
random variables with
degrees of freedom k
1
and k
2
, respectively. Then
Y =
U
1
/k
1
U
2
/k
2
6.6. T, F, AND BETA DISTRIBUTIONS 119
has a central F distribution with k
1
and k
2
degrees of freedom. That is,
Y F
k
1
,k
2
. The pdf is
f
Y
(y) =
_
k
1
k
2
_
k
1
/2

_
k
1
+k
2
2
_
y
(k
1
2)/2
_
k
1
2
_
_
k
2
2
__
1 +
yk
1
k
2
_
(k
1
+k
2
)/2
I
(0,)
(y).
If T t
k
, then T
2
F
1,k
.
Moments of an F random variable. Suppose that Y F
k
1
,k
2
. Then
E(Y
r
) = E
_
(k
2
U
1
)
r
(k
1
U
2
)
r
_
=
_
k
2
k
1
_
r
E(U
r
1
)E(U
r
2
), where
U
1

2
k
1
, U
2

2
k
2
, and U
1
U
2
.
Using the general expression for the moments of a
2
random variable, it
can be shown that for any real valued r,
E(Y
r
) =
_
_
does not exist if r > k
2
/2;
k
r
2
_
k
1
2
+r
_
_
k
2
2
r
_
k
r
1
_
k
1
2
_
_
k
2
2
_
if r < k
2
/2.
Using the above expression, it is easy to show that
E(Y ) =
k
2
k
2
2
if k > 2 and that Var(Y ) =
2k
2
2
(k
1
+k
2
2)
k
1
(k
2
2)
2
(k
2
4)
if k
2
> 4.
3. Beta Distributions: Let U
1
and U
2
be independent
2
random variables with
degrees of freedom k
1
and k
2
, respectively. Then
Y =
U
1
U
1
+U
2
Beta
_
k
1
2
,
k
2
2
_
.
That is, Y has a beta distribution with parameters k
1
/2 and k
2
/2. More
generally, if U
1
Gam(
1
, ), U
2
Gam(
2
, ), and U
1
U
2
, then
Y =
U
1
U
1
+U
2
Beta(
1
,
2
).
If Y Beta(
1
,
2
), then the pdf of Y is
f
Y
(y) =
y
1
1
(1 y)
2
1
(
1
,
2
)
I
(0,1)
(y),
where (
1
,
2
) is the beta function and is dened as
(
1
,
2
) =
(
1
)(
2
)
(
1
+
2
)
.
If B Beta(
1
,
2
), then
2
B
1
(1 B)
F
2
1
,2
2
.
If B Beta(
1
,
2
), where
1
=
2
= 1, then B Unif(0, 1).
If X Beta(
1
,
2
), then
E(X
r
) =
(
1
+r)(
1
+
2
)
(
1
+
2
+r)(
1
)
provided that
1
+r > 0.
Proof:
E(X
r
) =
_
1
0
x
r
x
1
1
(1 x)
1
(
1
,
2
)
dx =
_
1
0
x
1
+r1
(1 x)
1
(
1
,
2
)
dx
=
(
1
+r,
2
)
(
1
,
2
)
_
1
0
x
1
+r1
(1 x)
1
(
1
+r,
2
)
dx
=
(
1
+r,
2
)
(
1
,
2
)
=
(
1
+r)(
1
+
2
)
(
1
+
2
+r)(
1
)
,
provided that
1
+ r > 0, because the last integral is the integral of the
pdf of a random variable with distribution Beta(
1
+r,
2
).
If F F
k
1
,k
2
, then
k
1
F
k
1
F +k
2
Beta
_
k
1
2
,
k
2
2
_
.
Chapter 7
Appendix 1: Practice Exams
7.1 Exam 1
1. Let A and B be any two events. Use the axioms of probability to prove that
P(A
c
) = 1 P(A).
2. Let A
1
, A
2
, . . . , A
k
be a collection of events. Prove that
_
k
_
i=1
A
i
_
c
=
k
i=1
A
c
i
.
3. Barbara and I have 7 coee mugs at home; 2 of design A, 3 of design B, and 2
of design C. Barbara likes to arrange the coee mugs on the shelf such that all
mugs of the same design are adjacent to one another. I prefer to place them at
random on the shelf. If I unload all seven mugs from the dishwasher and place
then on the shelf at random, what is the probability that my arrangement will
satisfy Barbaras sense of kitchen esthetics?
4. The order on a ballot of six candidates for two positions is determined at
random. The six candidates include the two current oce holders. What is
the probability that the rst two candidates listed on the ballot are the
current oce holders.
5. A committee with three members is to be selected at random without
replacement from a collection of 30 people, of whom 10 are males and 20 are
females. Find the probability that the committee contains two females and one
male. Note: just give the equation, you do not need to evaluate the equation.
6. Suppose that a lie detector test has the following properties. If the suspect is
telling the truth, then the lie detector will correctly say so with probability
0.85. If the suspect is lying, then the lie detector will say so with probability
0.99. Suppose that 95% of the time, suspects are telling the truth. Find the
probability that a suspect is lying if the lie detector test says that he or she is
lying.
121
122 CHAPTER 7. APPENDIX 1: PRACTICE EXAMS
7. Suppose that A and B are independent events. Verify that A and B
c
are
independent events. Hint: use law of total probability.
8. A stand of trees contains 10 aspen, 5 lodgepole pine, and 7 Douglas r trees.
Seven trees are sampled at random without replacement. Find the probability
that the seventh tree selected is a lodgepole pine, given that the third tree
selected is an aspen.
9. Let X and Y be random variables with joint probability function
f
X,Y
(i, j) =
_
_
_
i +j
n
2
(n + 1)
for i = 1, 2, . . . , n and j = 1, 2, . . . , n
0 otherwise.
(a) True or False: The joint distribution of X and Y is exchangeable. (2 pts)
(b) Find the conditional distribution of X given Y . That is, nd f
X|Y
(i|j).
(9 pts)
7.2 Exam 2
1. Let Y be a random variable with p.f. f
Y
(y) =
e
y
_
1 e
_
y!
, for y = 1, 2, . . . ,
and where > 0. Find the probability generating function of Y .
2. Suppose that the rv X has probability generating function
X
(t) = t/(2 t).
Find E(X).
3. The From historical records, the mean minimum temperature in Bozeman on
November 17 is 19F
. The standard deviation is 9.5. Find the mean and

standard deviation of the November 17 minimum temperature expressed in C
.
4. Consider a litter of k wolf pups. Let X
1
, X
2
, . . . , X
k
be the volume (in
milliliters) of milk that the pups drink on a given day. Suppose that the
random variables have an exchangeable distribution with E(X
1
) = ,
Var(X
1
) =
2
, and Cov(X
1
, X
2
) =
2
. Find expressions for the following
quantities. Your answers should be functions of and .
(a) Corr(X
i
, X
j
), where i = j.
(b) Var(X
i
X
j
), where i = j.
(c) Var(

X), where

X = k
1
k
i=1
X
i
7.2. EXAM 2 123
5. A certain biased coin has probability 0.7 of landing heads when tossed. The
coin is tossed is 20 times and the number of heads is recorded. Denote this
random variable by Y . A fair coin is then tossed until the Y
th
success occurs.
Denote the number of tosses of the fair coin by X. Find the expectation of X.
6. A multiple choice test consists of 12 items, each with four possible choices.
Tim has not studied for the exam, so he guesses on each question. Let X be
the number that he gets correct.
(a) Give E(X) and Var(X).
(b) To pass, Tim must score 8 or better. What is the exact probability that
Tim fails?
(c) We discussed two ways to approximate the probability in part (b). Use
the more appropriate of the two and compute an approximation to the
probability that Tim fails.
7. The probability of winning the Powerball lottery is 1/54,979,154. For
simplicity. round this to one over 55 million. Suppose that 11 million tickets
are sold. Let X be the number of winning tickets.
(a) Give the mean of X.
(b) Find (approximately) the probability that there are either one or two
winners.
8. An automatic cake icing machine is malfunctioning; it no longer forms the
letters i and r, but instead writes a for each. So happy Birthday Irma becomes
Happy Baathday Aama. Cakes are randomly selected with probability 1/4 and
inspected. The machine is stopped when the second error is found. Suppose
that the letters i and r appear on half of the cakes. Let X be the number of
cakes that have been iced when the machine is nally stopped. Find the
probability function for X. Identify the family of distributions. Hint: rst
dene what a success is.
9. Suppose that random samples of n
1
male and n
2
female adults have been
obtained from populations that satisfy n
1
/N
1
0 and n
2
/N
2
0. Let X be
the number of male students in the sample who wear glasses (or contacts) and
let Y be the number of female students in the sample who wear glasses (or
contacts). Suppose that the population proportions who wear glasses (or
contacts) are identical in the two populations. Derive the conditional
distribution of X given that X +Y = c. Identify the family of distributions.
7.3 Exam 3
1. Suppose that P(A) = 0.4, P(B) = 0.5, and P(A B) = 0.3. Find the
following:
(a) P(A B).
(b) P(A B
c
).
2. In a string of 12 Christmas tree lights, 3 are defective. The bulbs are selected
one at a time at random without replacement and tested. Find the probability
that the third defective bulb is found on the 6
th
test.
3. Consider a nite population of N distinct values x
1
, x
2
, . . . x
n
. Denote the
population mean by E(X) = and the population variance by Var(X) =
2
.
A sample of size n N is selected one at a time at random without
replacement. Denote the joint probability of the sample by
f
X
1
,X
2
,...,Xn
(x
1
, x
2
, . . . , x
n
) and denote the sample mean by

X = n
1
n
i=1
X
i
.
(a) Evaluate f
X
1
,X
2
,...,Xn
(x
1
, x
2
, . . . , x
n
) and verify that the distribution is
exchangeable.
(b) Derive Var(

X). Hint: what happens if n = N?
4. Suppose that X
i
are iid Bern(p) random variables. Find the probability
generating function of S =
n
i=1
X
i
.
5. Suppose that X is a discrete random variable with support set
SX
= {1, 2, 6, 9, 11} and probability function f
X
(x). Compute E
_
1
f
X
(X)
_
.
6. Suppose that Interstate Highway 94 has two types of defects, namely A and
B. Defect A follows a Poisson process with a mean of 2 defects per mile.
Defect B follows a Poisson process with a mean of 4 defects per mile and is
distributed independently of defect A. Let X be the number of defects of type
A in a 10 mile segment of highway and let Y be the number of defects of type
B in the same 10 mile segment. Derive the conditional distribution of X given
that the 10 mile segment contains a total of 50 defects.
7. A discrete random variable has probability generating function
X
(t) =
1
2
+
1
3
t +
2
13
t
2
+
1
78
t
5
.
Find the probability function of X.
7.3. EXAM 3 125
8. Suppose that X is a continuous random variable with cdf
F
X
(x) =
_
1 e
x
0 < x < ;
0 x < 0,
where is a positive constant.
(a) Find the pdf.
(b) Find an expression for x
p
, the 100p
th
percentile of the distribution.
9. Let X be the discrete uniform random variable with support
S
= {1, 2, . . . , N}
and cdf
F
X
(x) =
_
_
1 x > N;
[x]
N
x [1, N];
0 x < 1,
,
where [x] is the largest integer equal to or less than x (i.e., the greatest integer
function). Use the cdf to compute
(a) P(X N 1) when N > 1.
(b) P(X < N 1) when N > 2.
(c) P(3 X 5) when N > 5.
10. Suppose that X is a continuous random variable with pdf
f
X
(x) = 6x(1 x)I
(0,1)
(x). Let Y = X/(1 X).
(a) Find the pdf of Y .
(b) Find E(Y ).
11. Suppose that I have a random number generator that will generate (pseudo)
random numbers X Unif(0, 1). I would like to generate random numbers
from a distribution having pdf
f
Y
(y) =

y
+1
I
(,)
(y),
where and are positive constants. Given a sequence of iid X random
variables, how do I obtain a sequence of iid Y random variables?
12. Suppose X and Y are continuous random variables with joint cdf
F
X,Y
(x, y) = x
2
y
2
for x (0, 1), y (0, 1)
and where F
X,Y
(x, y) is suitable dened elsewhere.
(a) Find the joint pdf f
X,Y
(x, y).
(b) Does (X, Y ) have an exchangeable distribution?
(c) Are X and Y independent?
13. Suppose X and Y are continuous random variables with joint density
f
X,Y
(x, y) =
12
7
(x
2
+ 2y) for 0 < y < x < 1.
(a) Find the marginal distribution of X.
(b) Find the conditional distribution f
Y |X
(y|x).
(c) Given X = .5, use the regression function to predict the value of Y .
14. Let X and Y be exchangeable random variables (continuous or discrete) with
E(X) = , Var(X) =
2
, and Cov(X, Y ) =
2
.
(a) Verify that Corr(X, Y ) = .
(b) Find Corr(X, X Y ). Your answer should be a function of only.
Chapter 8
Appendix 2: Selected Equations
n
i=0
r
i
=
1 r
n+1
1 r
;
i=0
r
i
=
1
1 r
if |r| < 1;
i=0
r
i
i!
= e
r
;
n
i=
i =
n(n + 1)
2
;
n
i=1
i
2
=
n(n + 1)(2n + 1)
6
;
F
= 32 +
9
5
C
;
g(t) = g(t
0
) +
k1
i=1
_
i
g(t)
( t)
i
t=t
0
_
(t t
0
)
i
i!
+R
k
, where
R
k
=
_
k
g(t)
( t)
i
t=t
_
(t t
0
)
k
k!
and t

_
(t, t
0
) if t
0
> t
(t
0
, t) if t
0
< t;
_
b
a
u dv = uv
b
a
_
b
a
v du
127

Lecture Notes Statistics 420-Probability Fall 2002

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Notes Statistics 420-Probability Fall 2002

Uploaded by

Copyright:

Available Formats

LECTURE NOTES

X()P() if the expectation exists.

[aX() +bY () +c] P()

g [X()] P() by denition

(r, p), then Y W r. Accordingly,

y for y [0, 1] and x = +

y for y (1, 4]. The cdf of Y is

y/3 y [0, 1];

y + 1)/3 y (1, 4];

82 CHAPTER 5. CONTINUOUS RANDOM VARIABLES

because the Jacobian is positive. Suppose that Y = g(X) is strictly

because the Jacobian is negative.

y. There is a single inverse function for y (1, 4]. The

= P[g(X) A] P[h(Y ) B].

Y (X) that minimizes

because the integral of the standard normal distribution is one.

because the last integral is the integral of a random variable with

. The standard deviation is 9.5. Find the mean and

You might also like