You are on page 1of 101

Monday, 15 August 2011

Lecture 7 - Content
2 Sets
2 Probability and counting
2 Conditional probability
Many students are not familiar with set theory, i.e. spend some more time explaining
various operations carefully. If time left, spend some time talking about cardinality
of N, Q and R.
Statistics (Advanced): Lecture 7 1
Sets
Before we look at probability it is necessary to understand sets because probabilities
are typically described in terms of sets where an event occurs.
Denition 1. The set of all possible outcomes of an experiment is called a sample
space, denoted by . Any subset A of the sample space , denoted by A is
called an event.
Denition 2. The counting operator N(A) is a set function that counts how many
elements belong to the set (event) A.
Example (Sample spaces).
Coin: = H, T N() = 2.
Dice: = 1, 2, 3, 4, 5, 6 N(1, 2, 5) = 3;
Weight: = R
+
N(R
+
) = .
(We want to count objects for any event)
Statistics (Advanced): Lecture 7 2
Set Notation
Before we introduce probability we need to introduce some notation. Let A, B .
symbol set theory probability
largest set certain event
empty set impossible event
AB union of A and B event A or event B
AB intersection of A and B event A and event B
A
C
= A complement of A not event A
Statistics (Advanced): Lecture 7 3
Intersection Operator
The set A B denotes the set such that if C A B then C A and C B
( is called the intersection operator).
Statistics (Advanced): Lecture 7 4
Intersection Operator
Examples:
2 1, 2 red, white = .
2 1, 2, green red, white, green = green.
2 1, 2 1, 2 = 1, 2.
Some basic properties of intersections:
2 A B = B A.
2 A (B C) = (A B) C.
2 A B A.
2 A A = A.
2 A = .
2 A B if and only if A B = A.
Statistics (Advanced): Lecture 7 5
Union Operator
The set AB denotes the set such that if C AB then C A and/or C B
( is called the union operator).
Statistics (Advanced): Lecture 7 6
Union Operator
Examples:
2 1, 2 red, white = 1, 2, red, white.
2 1, 2, green red, white, green = 1, 2, red, white, green.
2 1, 2 1, 2 = 1, 2.
Some basic properties of unions:
2 A B = B A.
2 A (B C) = (A B) C.
2 A (A B).
2 A A = A.
2 A = A.
2 A B if and only if A B = B.
Statistics (Advanced): Lecture 7 7
Set Minus
The set A B denotes the set such that if C A B then C A and C / B.
Statistics (Advanced): Lecture 7 8
Set Complement
The set A
c
= A denotes the set such that if C A
c
then C / A.
Statistics (Advanced): Lecture 7 9
Set Minus
Examples:
2 1, 2 red, white = 1, 2.
2 1, 2, green red, white, green = 1, 2.
2 1, 2 1, 2 = .
2 1, 2, 3, 4 1, 3 = 2, 4.
Some basic properties of complements:
2 A B ,= B A.
2 A A
c
= .
2 A A
c
= .
2 (A
c
)
c
= A.
2 A A = .
Statistics (Advanced): Lecture 7 10
Theorem 1. The complement of the union of A and B equals the intersection of
the complements
(A B)
C
= (A
C
) (B
C
).
Proof. Use Venn diagrams for LHS and RHS and colour areas.
Theorem 2. de Morgans Laws.
_
n

i=1
A
i
_
c
=
n
_
i=1
A
c
i
and
_
n
_
i=1
A
i
_
c
=
n

i=1
A
c
i
Statistics (Advanced): Lecture 7 11
Counting Ordered Sampling without replacement
Example (Ordered samples without replacement). The number of ordered samples
of size r we can draw without replacement from n objects is,
n (n 1) . . . (n r + 1) =
n!
(n r)!
Recall: 0! = 1.
Statistics (Advanced): Lecture 7 12
Counting Unordered Sampling without replacement
Example (Unordered samples without replacement).
n
C
r
=
_
n
r
_
=
n!
r!(n r)!
= n Choose r.
Recall,
n
C
r
=
n
C
nr
since
_
n
n r
_
=
n!
(n r)!((n (n r))!
= n Choose r.
and so
_
n
0
_
=
_
n
n
_
= 1
Statistics (Advanced): Lecture 7 13
Sampling without replacement (cont)
Example. Consider A, B, C and select r = 2 items:
A, B A, C B, C order important
B, A C, A C, B
_
3
2
_
= 3 samples

order not important 3 2 = 6 samples


Statistics (Advanced): Lecture 7 14
Sampling in R
# Creating ordered lists
n = 158;
x = 1:n;
set.seed(6) # set random seed to 6 to reproduce results
sample(x) # random permutation of nos 1,2,...,158: n! possibilities
sample(x,10) # choose 10 numbers without replacement
sample(x,10,TRUE) # choose 10 numbers with replacement = bootstrap sampling
Statistics (Advanced): Lecture 7 15
What is Probability?
1. Subjective probability expresses the strength of ones belief (the basis of Bayesian
Statistics a bit on that later).
2. Classical probability concept, mathematical answer for equally likely outcomes.
Theorem 3. If there are n equally likely possibilities of which one must occur
and s are regarded as favourable ( = successes), then the probability P of a
success is given by s/n.
Statistics (Advanced): Lecture 7 16
What is Probability?
3. The frequency interpretation of probability:
Theorem 4. The probability of an event (or outcome) is the proportion of times
the event occur in a long run of repeated experiments.
or in words:
If an experiment is repeated n times under identical conditions, and if
the event A occurs m times, then as n becomes large (i.e. in the long-run)
the probability of A occurring is the ratio m/n.
Statistics (Advanced): Lecture 7 17
What is Probability?
2 The constancy of the gender ratio at birth. In Australia, the proportion of male
births is fairly stable at 0.51. This long run relative frequency is used to estimate
the probability that a randomly chosen birth is male.
2 Cancer council records show the age standardised mortality rate from breast
cancer in NSW was close to 20 per 100,000 over the years 1972-2000. For a
randomly chosen woman, we use 0.0002 as the probability of breast cancer.
Example (Coin tossing).
Buon (1707-1788): n = 4, 040 P(H) = 50.7%.
Pearson (1857-1936): n = 24, 000 P(H) = 50.05%.
Coin Tossing in R
table(sample(c("H","T"),4040,T))/4040
table(sample(c("H","T"),24000,T))/24000
Statistics (Advanced): Lecture 7 18
Coin Tossing 2010s
In the 2010s Stanford Professor Persi Diaconis developed the Coin Tosser 3000.
However, the machine is designed to ip a coin with the same result every time!
Statistics (Advanced): Lecture 7 19
What is Probability?
4. Mathematical formulation of probability
Denition 3 (due to Andrey Kolmogorov, 1933). Given a sample space
A , we dene P(A), the probability of A, to be a value of a non-negative
additive set function that satises the following three axioms:
A1: For any event A, 0 P(A) 1,
A2: P() = 1,
A3: If A and B are mutually exclusive events (A B = ), then
P(A B) = P(A) + P(B).
A3: If A
1
, A
2
, A
3
, . . . is a nite or innite sequence of mutually exclusive events
in , then
P(A
1
A
2
A
3
. . .) = P(A
1
) + P(A
2
) + P(A
2
) + . . . .
Statistics (Advanced): Lecture 7 20
Example (Lotto). A lotto type barrel contains 10 balls numbered 1, 2, . . . , 10. Three
balls are drawn.
i. How many distinct samples can be drawn?
n =
10
C
3
=
_
10
3
_
=
10 9 8
1 2 3
= 120.
ii. Event A = 1, 2, . . . , 7 (all numbers less than seven).
_
7
3
_
=
7 6 5
1 2 3
= 35 successes P(A) =
35
120
=
7
24
.
iii. B = all drawn numbers are even: P(B) =
1
120

_
5
3
_
=
10
120
=
1
12
.
A B = (2, 4, 6) P(A B) = 1/120.
iv. P(A B)? To answer this we need our next theorem.
Statistics (Advanced): Lecture 7 21
Addition Theorem
Theorem 5 (Addition Theorem). If A and B are any events in , then
P(A B) = P(A) + P(B) P(A B)
Proof. Use Venn diagrams, i.e. draw pictures OO and colour regions.
Alternatively use axioms only. First note that
P(A) = P((A (A B)) (A B))
A3
= P(A (A B)) + P(A B). (1)
Similarly, P(B) = P(B (A B)) + P(A B). Next,
P(A B) = P((A (A B)) (B (A B)) (A B))
A3
= P(A (A B)) + P(B (A B)) + P(A B)
= [P(A (A B)) + P(A B)]
+[P(B (A B)) + P(A B)] P(A B)
= P(A) + P(B) P(A B)
which follows from the result (1).
Statistics (Advanced): Lecture 7 22
Example (Lotto). A lotto type barrel contains 10 balls numbered 1, 2, . . . , 10. Three
balls are drawn.
i. How many distinct samples can be drawn? 120.
ii. Event A = 1, 2, . . . , 7 (all numbers less than seven). P(A) =
7
24
.
iii. B = all drawn numbers are even: P(B) =
1
12
.
Also P(A B) = 1/120.
iv. P(A B)?
P(A B) = P(A) + P(B) P(A B) =
7
24
+
1
12

1
120
=
44
120
=
11
30
.
Statistics (Advanced): Lecture 7 23
Poincares Theorem
Theorem 6 (Poincares formula, not part of M1905). Let A
1
, A
2
, . . . , A
n
be any
events in . Then,
P
_
n
_
i=1
A
i
_
=
n

i=1
P(A
i
)

i<j
P(A
i
A
j
) +

i<j<k
P(A
i
A
j
A
k
) + . . .
+(1)
n1
P
_
n

i=1
A
i
_
.
Statistics (Advanced): Lecture 7 24
(Unconditional) probability
2 Recall 3 Axioms of probability.
2 P(A
C
) = 1 P(A) since A A
C
= hence, 1 = P() = P(A A
C
) =
P(A) + P(A
C
).
2 P() = 0 because =
C
, hence P() = 1 P().
2 etc.
Statistics (Advanced): Lecture 7 25
Conditional Probability Motivating Example
Consider the following (ctional) table of Sports Mortality Rates compiled over the
last decade:
SPORT Description DEATHS
Chess Board Game considered
the national sport of Russia 0
Statistics (Advanced): Lecture 7 26
Conditional Probability Motivating Example
Consider the following (ctional) table of Sports Mortality Rates compiled over the
last decade:
SPORT Description DEATHS
Chess Board Game considered
the national sport of Russia 0
Boxing Barbaric Sport where two
people hit each other 5
Statistics (Advanced): Lecture 7 27
Conditional Probability Motivating Example
Consider the following (ctional) table of Sports Mortality Rates compiled over the
last decade:
SPORT Description DEATHS
Chess Board Game considered
the national sport of Russia 0
Boxing Barbaric Sport where two
people hit each other 5
Chess Boxing 5 minutes of Chess followed
by 2 minutes of Boxing 0
Statistics (Advanced): Lecture 7 28
Conditional Probability Motivating Example
Consider the following (ctional) table of Sports Mortality Rates compiled over the
last decade:
SPORT Description DEATHS
Chess Board Game considered
the national sport of Russia 0
Boxing Barbaric Sport where two
people hit each other 5
Chess Boxing 5 minutes of Chess followed
by 2 minutes of Boxing 0
Sky Diving Jumping out of a plane
with a parachute 10
Statistics (Advanced): Lecture 7 29
Conditional Probability Motivating Example
Consider the following (ctional) table of Sports Mortality Rates compiled over the
last decade:
SPORT Description DEATHS
Chess Board Game considered
the national sport of Russia 0
Boxing Barbaric Sport where two
people hit each other 5
Chess Boxing 5 minutes of Chess followed
by 2 minutes of Boxing 0
Sky Diving Jumping out of a plane
with a parachute 10
Lawn Bowls Rolling a Ball across
grass to hit other balls 1000
Statistics (Advanced): Lecture 7 30
Hence, Lawn Bowls is the most dangerous sport by far!
Statistics (Advanced): Lecture 7 31
Conditional Probability Motivating Example
However, the number of deaths given that the sportsperson is young is zero so
that
P(Dying from Lawn Bowls[sportsperson is young) 0
even though
P(Dying from Lawn Bowls) is large.
Statistics (Advanced): Lecture 7 32
Conditional Probability Another Motivating Example
What is the probability of the important event
A = (starting salary after uni 60k)?
What is the sample space ?
Possibilities are:

1
= all students,

2
= all male students,

3
= all students with a maths degree.
Statistics (Advanced): Lecture 7 33
Conclusion
2 Probability depends on the underlying sample space !
2 Hence, if it is unclear to what sample space A refers to then make it clear by
writing
P(A[) instead of P(A)
which we read as the conditional probability of A relative to or given ,
respectively.
Denition 4. If A and B are any events in and P(B) ,= 0 then, the conditional
probability of A given B is
P(A[B) =
P(A B)
P(B)
.
Statistics (Advanced): Lecture 7 34
Additional material for Lecture 7
A combinatorial proof of the binomial theorem
The binomial theorem says
(x + y)
n
=
n

k=0
_
n
k
_
x
k
y
nk
.
Consider the more complicated product
(x
1
+ y
1
)(x
2
+ y
2
) (x
n
+ y
n
)
Its expansion consists of the sum of 2
n
terms, each term being the product of n
factors. Each term consists either x
k
or y
k
, for each k = 1, . . . , n. For example,
(x
1
+ y
1
)(x
2
+ y
2
) = x
1
x
2
+ x
1
y
2
+ y
1
x
2
+ y
1
y
2
Now, there is 1 =
_
n
0
_
term with y terms only, n =
_
n
1
_
with one x term and (n1)
y terms etc. In general, there are
_
n
k
_
terms with exactly k xs and (nk) ys. The
theorem follows by letting x
k
= x and y
k
= y.
Statistics (Advanced): Lecture 7 35
More on set theory
The operation of forming unions, intersections and complements of events obey rules
similar to the rules of algebra. Following some examples for events A, B and C:
Commutative law: A B = B A and A B = B A
Associative law: (AB)C = A(BC) and (AB)C = A(BC)
Distributive law: (A B) C = (A C) (B C) and (A B) C =
(A C) (B C).
Statistics (Advanced): Lecture 7 36
extra page
Statistics (Advanced): Lecture 7 37
Tuesday, 16 August 2011
Lecture 8 - Content
2 Conditional probability
2 Bayes rule
2 Integer valued random variables
Conditional probability equation
P(A[) =
P(A )
P()
= P(A) for P(B) > 0 : P(A[B) =
P(A B)
P(B)
Statistics (Advanced): Lecture 8 38
Conditional probability (cont)
Example (Defect machine parts). Suppose that 500 machine parts are inspected
before they are shipped.
2 I = (a machine part is improperly assembled)
2 D = (a machine part contains one or more defective components)
N(S) = 500, N(I) = 30, N(D) = 15, N(I D) = 10
Statistics (Advanced): Lecture 8 39
Example (cont)
Assumption: equal probabilities in the selection of one of the machine parts.
Using the classical concept of probability we get:
P(D) = P(D[) =
N(D)
N()
=
15
500
=
3
100
,
P(D[I) =
N(D I)
N(I)
=
10
30
=
1
3
>
3
100
,
note that if N() > 0, then
=
N(D I)
1
N()
N(I)
1
N()
=
P(D I)
P(I)
.
Statistics (Advanced): Lecture 8 40
General multiplication rule of probability
Theorem 7 (General multiplication rule of probability). If A and B are any events
in , then
P(A B) = P(B) P(A[B), if P(B) ,= 0, changing A and B yields
= P(A) P(B[A), if P(A) ,= 0.
Proof. This holds because,
P(A[B) :=
P(A B)
P(B)
etc.
What happens if P(A[B) = P(A)?
additional information of B is of no use special multiplication rule!
P(A B) = P(A) P(B).
Statistics (Advanced): Lecture 8 41
Denition of independence of events
Denition 5. If A and B are any two events in a sample space , we say that A
is independent of B if and only if P(A[B) = P(A).
From the general multiplication rule it follows that if P(A[B) = P(A) then P(B[A) =
P(B) and we say simply that A and B are independent.
Statistics (Advanced): Lecture 8 42
Alternative View of Independence
Alternatively, if A and B are independent then P(A B) = P(A) P(B) and
hence,
P(B[A) =
P(A B)
P(A)
(using Bayes rule)
=
P(A) P(B)
P(A)
(using independence)
= P(B).
which can also be interpreted as saying that knowing A does not eecting the
probability of B.
Statistics (Advanced): Lecture 8 43
Independence
In other words the events A and B are independent if the chance that one happens
remains the same regardless of how the other turns out.
Example. Suppose that we toss a fair coin twice. Let
A = heads of the rst toss
and
B = heads of the second toss.
Now suppose A occurred. Then
P(B knowning A has happened) =
1
2
.
Statistics (Advanced): Lecture 8 44
Independence Example 2
Example. Consider the following 6 boxes
1 2 3 1 2 3
Suppose we select a box at random, as it is drawn you see that it is green. Then
P(A = getting a 2) =
2
6
=
1
3
P(B = getting a 2 if I know it is green) =
1
3
Knowing the selected box is green has not changed our knowledge about which
numbers might be drawn.
Hence, the events A and B are independent.
Statistics (Advanced): Lecture 8 45
Independence Example 3
Example. Consider the following 6 boxes
1 1 2 1 2 2
Suppose we select a box at random, as it is drawn you see that it is green. Then
P(A = getting a 2) =
3
6
=
1
2
P(B = getting a 2 if I know it is green) =
1
3
Knowing the selected box is green HAS CHANGED our knowledge about which
numbers might be drawn.
Hence, the events A and B are NOT independent.
Statistics (Advanced): Lecture 8 46
Independence Example 4
Example. Two cards are drawn at random from an ordinary deck of 52 playing
cards. What is the probability of getting two aces if
(a) the rst card is replaced before the second is drawn?
(Solution: 4/52 4/52 = 1/169 since here P(A
1
A
2
) = P(A
1
) P(A
2
) )
(b) The rst card is not replaced before the second card is drawn?
(Solution: 4/52 3/51 = 1/221 but unlike above P(A
2
[A
1
) ,= P(A
2
))
Independence is violated when the sampling is without replacement.
Statistics (Advanced): Lecture 8 47
Independence Example 5
Medical records indicate that the proportion of children who have had measles by the
age of 8 is 0.4. The corresponding proportion for chicken pox is 0.5. The proportion
who have had both diseases by the age of 8 is 0.3. An infant is randomly selected.
Let A represent the event that he contracts measles, and B that he contracts chicken
pox, by the age of 8 years.
2 Estimate P(A), P(B) and P(A B).
P(A) = 0.4, P(B) = 0.5 and P(A B) = 0.3.
2 Are A and B independent?
P(A) P(B) = 0.2 ,= P(AB) = 0.3, so NO, A and B are not independent.
Statistics (Advanced): Lecture 8 48
Bayes rule
Example (The burgers are better...). Assume you get your burgers
2 60% from supplier B
1
[HJ]
2 30% from supplier B
2
[McD]
2 10% from supplier B
3
[RR]
P(B
1
) = 0.6, P(B
2
) = 0.3, and P(B
3
) = 0.1.
Interested in the event A =(good burger).
[Draw a picture that shows that B
1
, B
2
, and B
3
are mutually exclusive events with
B
1
B
2
B
3
= .]
Statistics (Advanced): Lecture 8 49
Example (cont)
It follows that,
A = A (B
1
B
2
B
3
) = (A B
1
) (A B
2
) (A B
3
).
Note that (A B
1
), (A B
2
) and (A B
3
) are mutually exclusive.
By Axiom 3 we get
P(A) = P((A B
1
) (A B
2
) (A B
3
))
= P(A B
1
) + P(A B
2
) + P(A B
3
).
Remember the general multiplication rule:
We already know that
P(A B) = P(B) P(A[B), if P(B) ,= 0,
= P(A) P(B[A), if P(A) ,= 0.
Statistics (Advanced): Lecture 8 50
Example (cont)
So we can write
P(A) = P(B
1
) P(A[B
1
) + P(B
2
) P(A[B
2
) + P(B
3
) P(A[B
3
)
= 0.6 P(A[B
1
)
. .
0.95, very good
+0.3 P(A[B
2
)
. .
0.80, sucient
+0.1 P(A[B
3
)
. .
0.65, insucient
= 0.875.
[The above probabilities 0.95, 0.8, 0.65 are from personal experience, i.e. subjective probability.]
What did the example teach us?
Strategy: decompose complicated events into mutually exclusive simple(r) events!
Statistics (Advanced): Lecture 8 51
Total probability rule
Theorem 8 (Total probability rule). If B
1
, B
2
, . . . , B
n
are mutually exclusive events
such that B
1
B
2
. . . B
n
= then for any event A ,
P(A) =
n

i=1
P(B
i
) P(A[B
i
).
Example (Burger, cont). We know already that supplier B
3
is bad. So what is
P(B
3
[A) (if a burger is good is it from B
3
)? By denition of the conditional prob-
ability, since P(A) > 0,
P(B
3
[A) =
P(A B
3
)
P(A)
=
P(B
3
A)
P(A)
=
P(B
3
) P(A[B
3
)

3
i=1
P(B
i
) P(A[B
i
)
=
0.1 0.65
0.875
= 0.074.
After we know that a burger is good the probability that it comes from B
3
decreases
from 0.1 to 0.074.
Statistics (Advanced): Lecture 8 52
Bayes rule or Theorem
What we just derived is the famous formula, called Bayes rule or theorem.
Theorem 9 (Bayes Rule). If B
1
, B
2
, . . . , B
n
are mutually exclusive events such
that B
1
B
2
. . . B
n
= then for any event A ,
P(B
j
[A) =
P(A[B
j
) P(B
j
)

n
i=1
P(A[B
i
) P(B
i
)
.
The probabilities P(B
i
) are called the priori probabilities and the probabilities P(B
i
[A)
the posteriori probabilities, i = 1, . . . , n.
Statistics (Advanced): Lecture 8 53
Reverend Thomas Bayes (1701 - 1761)
2 Born in Hertfordshire (London, England),
2 was a Presbyterian minister,
2 studied: theology and mathematics,
2 best known for Essay Towards Solving a
Problem in the Doctrine of Chances ,
2 where Bayes Theorem was rst proposed.
2 Words: Bayes rule, Bayes Theorem,
Bayesian Statistics.
Statistics (Advanced): Lecture 8 54
Example of Bayes Rule Screening test for Tuberculosis
TB (D
+
) No TB (D

)
X-ray Positive (S
+
) 22 51 73
X-ray Negative (S

) 8 1739 1747
30 1790 1820
What is the probability that a randomly selected individual has tuberculosis given
that his or her X-ray is positive given that P(D
+
) = 0.000093?
2 P(D
+
) = 0.000093 which implies that P(D

) = 0.999907.
2 P(S
+
[D
+
) = 22/30 = 0.7333
2 P(S
+
[D

) = 51/1790 = 0.0285
P(D
+
[S
+
) =
P(S
+
[D
+
)P(D
+
)
P(S
+
[D
+
)P(D
+
) + P(S
+
[D

)P(D

)
=
0.7333 0.000093
0.7333 0.000093 + 0.0285 0.999907
= 0.00239
Statistics (Advanced): Lecture 8 55
Integer valued random variables
Many observed numbers are the random result of many possible numbers.
Denition 6. A random variable X is a real-valued function of the elements of a
sample space .
Note that such functions are denoted with capital letters and their images (out-
comes) with lower case letters, e.g. x.
Examples.
2 How many times (X) will you be caught speeding?
2 What will your nal mark (Y ) for MATH1905 be?
2 How old (Z, in years) do you think your stats lecturer is?
Statistics (Advanced): Lecture 8 56
Random Variable Example 3 Coins
Consider tossing three coins. The number of heads showing when the coins land is
a random variable: it assigns the number 0 to the outcome T, T, T, the number
1 to the outcome T, T, H, the number 2 to the outcome T, H, H, and the
number 3 to the outcome H, H, H .
Statistics (Advanced): Lecture 8 57
Random Variable Example 3 Coins
Events Random Variable Probability
TTT
TTH
THT
THH
HTT
HTH
HHT
HHH
X = Number of Heads
P(X = 0) =
1
8
P(X = 1) =
3
8
P(X = 2) =
3
8
P(X = 3) =
1
8
Statistics (Advanced): Lecture 8 58
Random Variable Notation 3 Coins
We use upper case letters to denote unobserved random variables, say X, and
lower case letters to their observed values, in this case x.
For example, in the above example before the three coins land we denote the number
of heads X, after the coins have landed we denote the number of coins x so that
we can write P(X = x).
Statistics (Advanced): Lecture 8 59
The mother of all examples: Bernoulli trials!
Denition 7. Bernoulli trials satisfy the following assumptions:
(i) there are only two possible outcomes for each trial,
(ii) the probability of success is the same for each trial,
(iii) the outcomes from dierent trials are independent,
(iv) there are a xed number n of Bernoulli trials conducted.
Example (n = 1, coin). : Head or Tail. We can describe the trial (before ipping
the coin) in full detail. Consider a function
X : H, T 0, 1 s.t. X(H) = x
H
= 1 and X(T) = x
T
= 0.
What is the probability that X = x
H
= 1?
P(X = 1) = P(X = x
H
) = P(H) = p = 1/2 P(X = 0) = 1/2.
Statistics (Advanced): Lecture 8 60
Jacob Bernoulli (16541705)
2 Born in Basel (Switzerland),
2 1 of 8 mathematicians in his family,
2 studied: theology maths & astro,
2 best known for Ars Conjectandi (The Art of
Conjecture),
2 application of probability theory to games
of chance, introduction of the law of large
numbers.
2 Words: Bernoulli trial, Bernoulli numbers.
Statistics (Advanced): Lecture 8 61
extra page
Statistics (Advanced): Lecture 8 62
Monday, 22 August 2011
Lecture 9 - Content
2 Distribution of a random variable
2 Binomial distribution
2 Mean of a distribution
Statistics (Advanced): Lecture 9 63
Revised Axioms of Probability
In Lecture 7 we used the following denition of probability
Denition 8 (due to Andrey Kolmogorov, 1933). Given a sample space A ,
we dene P(A), the probability of A, to be a value of a non-negative additive set
function that satises the following three axioms:
A1: For any event A, 0 P(A) 1,
A2: P() = 1,
A3: If A
1
, A
2
, A
3
, . . . is a nite or innite sequence of mutually exclusive events in
, then
P(A
1
A
2
A
3
. . .) = P(A
1
) + P(A
2
) + P(A
2
) + . . . .
However, nothing is lost if we replace A1 : 0 P(A) 1 with A1 : 0 P(A).
Statistics (Advanced): Lecture 9 64
Proof by Contradiction
Assume the following 3 axioms:
A1: For any event A , 0 P(A),
A2: P() = 1,
A3: If A
1
, A
2
, A
3
, . . . is a nite or innite sequence of mutually exclusive events in
, then
P(A
1
A
2
A
3
. . .) = P(A
1
) + P(A
2
) + P(A
2
) + . . . .
Now let us assume that A4: For any event A , P(A) > 1, then
2 1 = P() = P(A A
c
) = P(A) + P(A
c
).
2 Rearranging we have P(A
c
) = 1 P(A).
2 By A4 we have P(A
c
) < 0. This contradicts A1, hence A4 cannot be assumed.
Statistics (Advanced): Lecture 9 65
Revised Axioms of Probability
Denition 9 (due to Andrey Kolmogorov, 1933). Given a sample space A ,
we dene P(A), the probability of A, to be a value of a non-negative additive set
function that satises the following three axioms:
A1: For any event A, 0 P(A),
A2: P() = 1,
A3: If A
1
, A
2
, A
3
, . . . is a nite or innite sequence of mutually exclusive events in
, then
P(A
1
A
2
A
3
. . .) = P(A
1
) + P(A
2
) + P(A
2
) + . . . .
This is the minimal set of axioms needed to dene probability.
Random Variables Reminder
n = 1 (Coin): = H, T
X
0, 1 R. Thus X(H) = x
H
= 1 and X(T) =
x
T
= 0 P(X = 1) = P(H) = p.
Statistics (Advanced): Lecture 9 66
Distribution of a random variable
Denition 10. The probability distribution of a integer-valued random variable X
is a list of the possible values of X together with their probabilities
p
i
= P(X = i) 0 and

i
p
i
= 1.
There is nothing special with the subscript i; we could and will equally well use j,
k, x etc.
Denition 11. The probability that the value of a random variable X is less than
or equal to x, that is
F(x) = P(X x),
is called the cumulative distribution function or just the distribution function.
Also, note that for integer valued random variables that
P(X = x) = F(x) F(x 1).
Statistics (Advanced): Lecture 9 67
Example (n = 3, IT problems). A network is fragile. By experience: P(F) = 0.1 =
1p that in any given week 1 major problem; P(S) = 0.9 = p that there is none,
respectively. Out of 3 weeks, how many weeks, X, had 1 problem and with what
probability?
(a) All possible outcomes:
FFF SFF FSF FFS
SSF FSS SFS SSS
(b) What is the probability of each outcome? Use special multiplication rule of
probability because sessions are independent!?
(c) What is the probability distribution of the number of successes, X, among the
3 sessions.
Statistics (Advanced): Lecture 9 68
Example (cont)
P(X = 0) = P(FFF) = P(F) P(F) P(F) = (1 p)
3
P(X = 1) = P(SFF FSF FFS)
. .
mutually exclusive events
= P(SFF) + P(FSF) + P(FFS)
= 3 (1 p)
2
p =
_
3
1
_
(1 p)
2
p, select one S out of 3 trials.
similarly we get for X = 2 and X = 3
P(X = 2) =
_
3
2
_
(1 p)p
2
, select two S out of 3 trials,
P(X = 3) =
_
3
3
_
p
3
, select three S out of 3 trials.
Statistics (Advanced): Lecture 9 69
Binomial distribution
We can generalise this result for any n 1 and success probability p [0, 1].
Denition 12. The probability distribution of the number of successes X = i in
n N independent Bernoulli trials is called the binomial distribution,
p
i
= P(X = i) =
_
n
i
_
p
i
(1 p)
ni
.
The success probability of a single Bernoulli trial is p and i = 0, 1, . . . , n.
To say that the random variable X has the binomial distribution with parameters n
and p we write X B(n, p).
This denes a family of probability distributions, with each member characterized
by a given value of the parameter p and the number of trials n.
Statistics (Advanced): Lecture 9 70
Binomial distribution
Since p
i
, 0 i n is a probability distribution we have the identity (which we will
use later on)
n

i=0
_
n
i
_
p
i
(1 p)
ni
= 1
for any 0 p 1.
A special case of the Binomial distribution is the Bernoulli distribution where n = 1
and
P(X
i
= i) = p
i
(1 p)
1i
.
There is another special relationship between the Bernoulli distribution and the
Binomial distribution.
If X
i
Bernoulli(p) for 1 i n and Y =

n
i=1
X
i
then
Y B(n, p).
Statistics (Advanced): Lecture 9 71
Example (Dice). Roll a fair dice 9 times. Let X be the probability of sixes obtained.
Then X B(9, 1/6); that is
p
i
= P(X = i)
=
_
n
i
_
p
i
(1 p)
ni
=
_
9
i
__
1
6
_
i
_
5
6
_
9i
=
_
9
i
_
5
9i
6
9
, i = 0, 1, . . . , 9.
Statistics (Advanced): Lecture 9 72
With your table calculator or with R:
> n = 9;
> p = 1/6;
> round(dbinom(0:n,n,p),4) # dbinom for B(n,p) probs
[1] 0.1938 0.3489 0.2791 0.1302 0.0391
[5] 0.0078 0.0010 0.0001 0.0000 0.0000
> pbinom(1,n,p) # for B(n,p) cumulative probabilities
[1] 0.5426588
Hence, P(X = 4) = 0.0391 and P(X < 2) = F(1) = 0.5426588.
Statistics (Advanced): Lecture 9 73
Shape of the binomial distribution
2 We get a binomial distribution if
1. we are counting something over a xed number of trials or repetitions,
2. the trials are independent and
3. the probability of the outcome of interest is constant across trials.
2 The binomial distribution is centred at n p,
2 the closer p to 1/2 the more symmetric the distribution/histogram,
2 the larger n the closer the shape to a bell (normal).
Statistics (Advanced): Lecture 9 74
par(mfrow=c(2,2)); n =10 # and for n=50, 100, etc
barplot(dbinom(0:n,n,1/2))
title(main="Probabilities for X~B(n=10,p=0.5)")
barplot(dbinom(0:n,n,0.1))
title(main="Probabilities for X~B(10,0.1)")
barplot(dbinom(0:n,n,0.8))
title(main="Probabilities for X~B(10,0.8)")
barplot(dbinom(0:n,n,0.4))
title(main="Probabilities for X~B(10,0.4)")
Statistics (Advanced): Lecture 9 75
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
Probabilities for X~B(n=10,p=0.5)
0
.
0
0
.
1
0
.
2
0
.
3
Probabilities for X~B(10,0.1)
0
.
0
0
0
.
1
0
0
.
2
0
0
.
3
0
Probabilities for X~B(10,0.8)
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
Probabilities for X~B(10,0.4)
Statistics (Advanced): Lecture 9 76
Example. In a small pond there are 50 sh, 20 of which have been tagged. Seven
sh are caught and X represents the number of tagged sh in the catch. Assume
each sh in the pond has the same chance of being caught. Is X binomial
(a) if each sh is returned before the next catch?
Yes, provided the sh do not learn from their experience, i.e. the probability of
catching each sh stays the same for each of the 7 trials.
P(X = 1) =
_
7
1
__
20
50
_
1
_
30
50
_
6
0.131 = dbinom(1,7,0.4)
Statistics (Advanced): Lecture 9 77
(b) if the sh are not returned once they are caught?
This situation cannot be modelled by a binomial as the proportion of tagged sh
changes at each trial.
If there were 5,000 sh, 2,000 of which had been tagged then the change in the
proportion was negligible and we could model with a binomial.
P(X = 1) =
_
20
1
_

_
30
6
_
_
50
7
_
= choose(20,1)*choose(30,6)/choose(50,7) (in R)
= 0.119 (to 3 d.p.)
P(X = 1) =
_
2000
1
_

_
3000
6
_
_
5000
7
_
= choose(2000,1)*choose(3000,6)/choose(5000,7) (in R)
= 0.131 (to 3 d.p.)
Statistics (Advanced): Lecture 9 78
Mean of a distribution
Denition 13. For a random variable X taking values 0, 1, 2, . . . with
P(X = i) = p
i
i = 0, 1, 2, . . .
the mean or expected value of X is dened to be
= E(X) =

i
i p
i
.
Interpretation of E(X)
2 Long run average of observations of X because p
i
f
i
/n.
2 Centre of balance of the probability density (histogram). (draw picture)
2 Measure of location of the distribution.
Denition 14. For any function g(X) we dene the expected value E(g(X)) by
E(g(X)) =

i
g(i) p
i
.
Statistics (Advanced): Lecture 9 79
Expectation of a Dice Roll
Let X = Face showing from a dice roll where p
i
= P(X = i) = 1/6 for i =
1, 2, . . . , 6. Then
= E(X)
=
6

i=1
i p
i
=

i
i 1/6
= 3.5.
Note: the expected value in this case is not one of the observed values.
Statistics (Advanced): Lecture 9 80
Mean of a distribution (cont)
Theorem 10. For constants a and b
E(aX + b) = a E(X) + b.
Proof.
E(aX + b) =

all i
g(i) p
i
; where g(i) = a i + b,
=

all i
_
(a i)p
i
+ b p
i
_
= a

all i
i p
i
+ b

all i
p
i
= a E(X) + b.
Statistics (Advanced): Lecture 9 81
Expectation of X B(n, p)
Theorem 11. The expectation of X B(n, p) is E(X) = np.
Proof.
E(X) =
n

i=0
i p
i
=
n

i=0
i
n!
i!(n i)!
p
i
(1 p
i
)
(ni)
; change to i = 1, . . . , n,
=
n

i=1
i
n!
i!(n i)!
p
i
(1 p
i
)
(ni)
; simplify,
=
n

i=1
i
n (n 1)!
i(i 1)!(n i)!
p
i
(1 p
i
)
(ni)
,
= n p
n

i=1
(n 1)!
(i 1)!(n i)!
p
i1
(1 p
i
)
(ni)
; sub. j = i 1, m = n 1.
Hence, E(X) = np
m

j=0
_
m
j
_
p
j
(1 p)
mj
. .
sums to 1 because probabilities from Y B(m, p)
.
Statistics (Advanced): Lecture 9 82
Example (Multiple choice section in M1905 exam is worth 35%).
20 questions and each question has 5 possible answers. A student decides to answer
the questions by selecting an answer at random.
(a) What is the expected number of correct responses? Let X denote the number
of correct answers. X B(20, 0.2). The expected number of correct answers is
np = 4
(b) Probability that the student has more than 10 correct answers?
P(X > 10) = 1 P(X 10)
= 1 0.9994, with 1-pbinom(10,20,0.2)
= 0.0006
(c) If the student scores 4 for a correct answer but -1 for a wrong response, what is
his expected score?
E[4 X + (1) (20 X)] = E(5X 20) = 0.
Statistics (Advanced): Lecture 9 83
Tuesday, 23 August 2011
Lecture 10 - Content
2 Variance of a distribution
2 More integer-valued distributions
2 Probability generating functions
Statistics (Advanced): Lecture 10 84
Expectation of a distribution Reminders
The expectation of a distribution (or expectation of a random variable) is the mean
of the probability distribution (a measure of distribution location).
Note that
2 E(X) =

i
i p
i
=

i
i P(X = i) and
2 E(g(X)) =

i
g(i) p
i
=

i
g(i) P(X = i).
Statistics (Advanced): Lecture 10 85
Variance of a distribution
Example. Suppose X (e.g. number of shoes in suitcase) takes the values 2, 4 and
6 with probabilities
i 2 4 6
p
i
0.1 0.3 0.6
Hence,
= E(X)
=

i
i p
i
=

i
i p
i
= 2 0.1 + 4 0.3 + 6 0.6
= 5.
Statistics (Advanced): Lecture 10 86
What is E(X
2
)?
Suppose X (e.g. number of shoes in suitcase) takes the values 2, 4 and 6 with
probabilities
i 2 4 6
p
i
0.1 0.3 0.6
What is E(X
2
)?
Solution 1: E(X
2
)
Def
=

g(i)p
i
=

i
2
p
i
= 26.8 ,= 5
2
.
Solution 2: i i
2
= j and X X
2
= Y , use E(Y ) =

j
jp
j
j 4 16 36
p
j
0.1 0.3 0.6
The distribution of Y can be hard to get (e.g. for continuous rvs).
Statistics (Advanced): Lecture 10 87
Denition 15. The variance of the random variable X is dened by
Var(X) =
2
= E(X )
2
= E(X
2
)
2
,
where = E(X) and
2
is also a measure of spread.
This is like the large sample limit of a sample variance.
The standard deviation of X is =

2
.
Statistics (Advanced): Lecture 10 88
Variance of a Linear Transformation
Theorem 12. For any constants a and b
Var(aX + b) = a
2
Var(X).
Proof.
Var(aX + b) = E[(aX + b)
2
] (E[aX + b])
2
= E[a
2
X
2
+ abX + b
2
] (a E[X] + b)
2
= a
2
E[X
2
] + 2ab E[X] + b
2
(a E[X]
2
+ 2ab E[X] + b
2
)
= a
2
(E[X
2
] E[X]
2
)
= a
2
Var(X).
Statistics (Advanced): Lecture 10 89
Example. If X B(n, p) then well show later that Var(X) = n p (1 p).
2 Hence, if p = 0 or 1 then the variance is 0.
2 the variance is largest when p = 0.5 and in this case it is
2
= n/4.
Statistics (Advanced): Lecture 10 90
More integer-valued distributions
Geometric distribution
The binomial random variable is just one possible integer-valued random variable.
Suppose we have an innite sequence of independent trials, each of which gives a
success with probability p and failure with probability q = 1 p.
Denition 16. The geometric distribution with parameter p (= success prob.) has
probabilities for the number of failures X before the rst success,
p
i
= P(X = i) = q
i
p, i = 0, 1, 2, . . . .
Note the probabilities add to 1:
P(X = 0) + p
1
+ . . . = p + qp + q
2
p + . . . = p(1 + q + q
2
+ . . .) = p
_
1
1 q
_
= 1
[By induction we can prove that 1 + q + . . . + q
n
=
1q
n+1
1q
.]
Statistics (Advanced): Lecture 10 91
Example. A fair die is thrown repeatedly until it shows a six.
(a) What is the probability that more than 7 throws are required?
P(X > 7) = 1 P(X 7) = 1
7

i=0
_
5
6
_
i
1
6
= 0.232 (3dp)
with 1-pgeom(7,1/6) or with 1-sum(dgeom(0:7,1/6)).
(b) Is it more likely that an odd number of throws is required or an even number?
Because 0 P(X = i) 1 and F() = 1 we nd,
P(even) P(odd) =

j=1
P(X = 2(j 1))

k=1
P(X = 2k 1)
=

j=1
P(X = 2(j 1)) P(X = 2j 1) =

j=1
q
2(j1)
p q
2j1
p
= p

j=1
(q
2(j1)
q
2j1
)
. .
0
odd number of throws are more likely.
Statistics (Advanced): Lecture 10 92
The Poisson approximation to the Binomial
The Poisson distribution often serves as a rst theoretical model for counts which
do not have a natural upper bound.
Possible examples
2 modeling of number of accidents, crashes, breakdowns
2 modeling radioactivity measured by the Geiger counter
2 modeling of so-called rare events (meteorite impacts, heart attacks)
Whether or not the Poisson distribution suciently describes count data is not
answered at this early stage but postponed to later lectures in statistics.
Statistics (Advanced): Lecture 10 93
The Poisson distribution can be seen as the limiting distribution of B(n, p):
Let n , while p 0 and np (0, ).
For X B(n, p) we know that
P(X = k) =
_
n
k
_
p
k
. .
=()
(1 p)
nk
. .
=()
.
Then, () =
_
n
k
_
p
k
=
_
n
k
_

k
n
k
=
n(n 1) (n k + 1)
n n n

k
k!


k
k!
and () = (1 p)
nk
=
_
1

n
_
n
_
1

n
_
k
. .
1
e

.
Hence,
P(X = k) e

k
k!
, for k = 0, 1, 2, . . . .
Statistics (Advanced): Lecture 10 94
Approximation is good if n p
2
is small!
X B(158,
1
365
) and n p
2
= 0.001186:
> # What is the probability that of 158 people, exactly k have a birthday today?
> n = 158; p=1/365;
> round(dbinom(0:7,n,p),5);
[1] 0.64826 0.28139 0.06068 0.00867 0.00092 0.00008 0.00001 0.00000
> round(dpois(0:7,p*n),5);
[1] 0.64864 0.28078 0.06077 0.00877 0.00095 0.00008 0.00001 0.00000
But for n = 10
> n = 10; p=1/3;
> round(dbinom(0:4,n,p),5);
[1] 0.01734 0.08671 0.19509 0.26012 0.22761
> round(dpois(0:4,p*n),5);
[1] 0.03567 0.11891 0.19819 0.22021 0.18351
Statistics (Advanced): Lecture 10 95
Probability distribution for X B(n, p) and X T()
0
.
0
0
.
2
0
.
4
0
.
6
Probabilities for X~B(158,1/365)
0
.
0
0
.
2
0
.
4
0
.
6
Probabilities for X~Poi(158/365)
0
.
0
0
0
.
1
0
0
.
2
0
Probabilities for X~B(10,1/3)
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
Probabilities for X~Poi(10/3)
Statistics (Advanced): Lecture 10 96
Probability generating functions
Let X N and p
i
= P(X = i), i = 0, 1, 2, . . .
Denition 17. The probability generating function is dened as
(s) = p
0
+ p
1
s + p
2
s
2
+ p
3
s
3
+ . . ..
Example. If X only takes a nite number of values (e.g. X B(n, p)) then (s)
is a polynomial.
Alternatively (e.g. X T()) (s) is a power series.
Statistics (Advanced): Lecture 10 97
Properties of (s)
Let s [0, 1] then
2 0 (s) 1,
2 (1) = p
0
+ p
1
+ . . . = 1,
2
/
(s) = p
1
+ 2p
2
s + 3p
3
s
2
+ . . . 0, s 0.
2
/
(1) = p
1
+ 2p
2
+ 3p
3
+ . . . = E(X) (if E(X) is nite),
2
//
(s) = 2p
2
+ 6p
3
+ 4 3p
4
+ . . . at s = 1, so
//
(1) = E(X(X 1)) and
Var(X) = E(X
2
) (EX)
2
=
//
(1) +
/
(1) (
/
(1))
2
.
Statistics (Advanced): Lecture 10 98
Example (Poisson distribution). For X T(),
(s) =

i=0
e

i
i!
s
i
= e

i=0
e
s
e
s
(s)
i
i!
= e

e
s
= e
(s1)
.
Hence,

/
(s) = e
(s1)
so E(X) = (=
/
(1))

//
(s) =
2
e
(s1)
so E[X(X 1)] =
2
and
Var(X) = E(X
2
) (EX)
2
=
2
+
2
= .
Statistics (Advanced): Lecture 10 99
Example (Binomial distribution). Let X B(n, p).
First, note that
(x + y)
n
=
n

i=0
_
n
i
_
x
i
y
ni
. (2)
Then
(s) =
n

i=0
s
i
_
n
i
_
p
i
(1 p)
ni
=
n

i=0
_
n
i
_
(sp)
i
(1 p)
ni
= (1 p + ps)
n
which follows from (2). Then

/
(s) = np(1 p + ps)
n1
so that
/
(1) = E(X) = np,

//
(s) = np
2
(n 1)(1 p + ps)
n2
so that
//
(1) = np
2
(n 1)
and nally,
Var(X) =
//
(1) (
/
(1))
2
+
/
(1) = np
2
(n 1) n
2
p
2
+ np = np(1 p).
Statistics (Advanced): Lecture 10 100
Statistics (Advanced): Lecture 10 101

You might also like