You are on page 1of 40

Lecture Notes: Discrete Probability

(ST111 Probability A)

Warwick University, 2010


Julia Brettschneider

Version β.1 (16.4.2010)


What others have said about probability

Many returning Warwick mathematics graduates:


Why didn’t someone tell us that probability and statistics are the key
mathematical subjects in applied quantitative work?

A randomly selected first year student:


Given I passed the exam, what is the probability I studied for it?

Another first year student:


Is there any way I can buy that hat off you?

Another student:
You keep telling me to read books.

A rumour:
The exam will only be taken by a sample of 40 students selected at random
from this class.

A coin:
This module tossed me around in my sleep!

Pierre Simon Laplace:


It is remarkable that a science which began with the consideration of games of
chance should have become the most important object of human knowledge.

Niels Bohr:
Einstein, stop telling God what to do with his dice.

Stephen Jay Gould:


Misunderstanding of probability may be the greatest of all impediments to
scientific literacy.

ebay:
Shadowfist CCG Probability Manipulator SE Rare
0 Bids £0.99 + Postage £0.75 12h 21m

Oscar Wilde:
Always be a little improbable.
1 Preface

Lecture notes

It is very difficult to simultaneously listen to a lecturer, read what is being written on


the blackboard and copy the essential of both sources. First year modules are a good
opportunity to experiment with note taking strategies, because you can get hold of most
of the material (and more) from textbooks and/or lecture notes.
Ω: What are you talking about? Thatt fonny ackzenntt iss a gott reesonn to giff us de
stuff in writingg! Ah, sorry, I forgot to introduce myself. I’m a rather polite fellow under
regular circumstances, actually. My name is Omega. I have this little sister omi. She
is small, and very funny. Well, omi is actually her nickname. Her real name is (small)
omega, like this ω. We have Greek names, because a lot of mathematical objects have
Greek names.
May I continue my lecture please? I do remember getting very frustrated in my first year
as an undergraduate student about having to take notes. There were no lecture notes,
and there were these indistinguishable indices i, j and ι in form of 1cm high wet chalk
sculptures on a dusty blackboard. I now have to acknowledge that my own blackboards
have, in the past, missed the qualification for the RBBCCM 1 , I will do my best to write
sufficiently big and readable and to speak loud enough. If I forget try waving or
shouting.
This is the first time I’m giving probability lectures to Warwick students and I’m trying to
optimally adapt existing material and methods to both the Mathematics and the MORSE
students. These typed lecture notes are work in progress. I will be grateful to you for
pointing out any kinds of errors as well as for making any other relevant suggestions (can
this explained better in a different way? anything missing? etc.)
These lectures are based on a number of other sources. Firstly, there are previous versions
of this module taught by my colleagues. I thank Saul Jacka for sharing his lecture notes
with me – some parts are taken straight from his notes – and I thank Roger Tribe for
providing a sketch about content and motivation for his lectures. Secondly, some of the
fascinating probability textbooks; see below. Thirdly, my notes from teaching at University
of California at Berkeley, Technical University Berlin and Humboldt University Berlin.
Finally, I have always been inspired by Hans Föllmer’s lectures who taught me probability
in the first place. I thank my daughter Chaya as well as Charlie and Lola 2 for fascinating
discussions on the topic.
My grandmother passed away unexpectedly in the first week of this term. That reminded
me that the only practical experience I have with chance games is playing dice and cards
with her in my childhood. It also reminds me of a children’s set theory box my grand-
parents gave to me. In the wake of the 1960s New Math 3 movement, ∩ and ∪ between
sets of colourful triangles, squares and circles had become a regular part of the primary
school mathematics curriculum and eventually found their way into toy stores. The same
for Greek letters. My grandmother would have enjoyed Ω and ω. I chose ω’s nickname
omi in memory of my grandmother.

1
Royal Blackboard Beauty Contest for the Counties of the Midlands
2
http://www.bbc.co.uk/cbeebies/charlieandlola/stories/
3
get the idea at http://www.youtube.com/watch?v=I8aW4YuFSiY&feature=related
Websites

Confusing enough, you may come across a few different websites when searching for this
module:
http://www2.warwick.ac.uk/fac/sci/statistics/courses/modules/year1/st111/
http://www2.warwick.ac.uk/fac/sci/statistics/courses/modules/year1/st111/resources
The first one is very general and is intended to help students with their module choice or
with anticipating the role of this module within the course program. The second website,
the resources website, is the one that is relevant for you now. There, I post lecture notes,
exercise sheets, information about the module organisation, old exams, computer code
and interesting links about probability. In previous years, some of the lecturers have used
the MathStuff website by the Warwick Mathematics Department to post exercises, lecture
notes, old exams and other information. This could still be useful for you and you can
find it there under ST111 > Module Pages > Archived Material; or as this address:
http://mathstuff.warwick.ac.uk/ST111/archive
Finally, there is my website:
http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic/brettschneider/

Textbooks

Please check out the list on the resources website. Which one is best? Some people say
the one by Ross tends to more or less suit the majority of students. The other obvious
one is the textbook by Pitman.
By the way, have you experienced the law of the second source? The second source where
you read about a piece of mathematics new to you tends to be the one that explains it
better – regardless of the order of the sources. Interestingly, there is no such law for the
third source. In fact, there seems to be a k ≥ 3 such that while reading the kth source
you end up being more confused than while reading the first one, again regardless of the
order of the sources. Without getting to that point you will never know how well you had
already understood the material before getting hold of source k. For this and many other
reasons,

X
“read another book”
i=1

Pictures sources

The top cover picture showing the foyer of the Warwick Mathematics and Statistics De-
partment, known as The Street, is from
http://www.maths.warwick.ac.uk/general/news/images/wall-400.jpg
Other picture sources are mentioned in the figure captions. If no source is mentioned, such
as for the bottom picture on the cover, I made the picture. For simulations I am typically
using the statistical programming language R which is publicly available at
http://www.r-project.org
The one most frequently asked question

How do I study for the exams? There is a lot of advice on the resource website.
In a nutshell:

X
“do another exercise”
i=1

Synopsis
Week I
Lecture 1: Randomness, probability as mathematical discipline
Lecture 2: A few questions (birthday problem, door problem, random sequences)

Week II
Lecture 3: Terminology for random events, classical examples for random experiments
Lecture 4: Classical probability
Lecture 5: Combinatorics

Week III
Lecture 6: Axioms of probability
Lecture 7: Conditional probability
Lecture 8: Total probability theorem, Bayes theorem

Week IV
Lecture 9: General multiplication rule, independence, binomial distribution
Lecture 10: Geometric distribution, negative binomial distribution
Lecture 11: Poisson approximation to the binomial, Poisson distribution

Week V
Lecture 12: In-class test
Lecture 13: Random variables
Lecture 14: Expectation, variance

Have a great journey into probability spaces!


2 Motivation

2.1 Observing random phenomena

We start by looking at a few pictures from the real world that exhibit random phenomena.
The slide show used during this lecture can be downloaded from the ST111 resource
website. It includes ice crystals, animal patterns, genetics, traffic, networks, statistical
mechanics, stock markets, gambling, knitting and more.
The common definition of mathematics as the science of patterns goes back to G.F. Hardy,
if not longer. Number theorist look for patterns in the integers, topologists in shapes,
analysts in motion and change and so on.
Look at the bottom picture on the cover sheet of these lecture notes. Are the black and
white dot sequences regular or not? In which sense (not)?
Row 1: Yes, it is constant
Row 2: Yes, it is periodic.
Row 3: No, but almost. It is the sequence from row (2) with a few errors.
Row 4: No, there seems to be no rule that generates this sequence.
Row 5: No, there seems to be no rule that generates this sequence.

Question 2.1. Statistical regularity in dot sequences. The sequences in Row 4 and Row 5
clearly are not regular in the classical sense: There is no deterministic rule that generates
them. But are they statistically regular in some sense? And if so, do they have the same
kind of statistical regularity in the sense that they show similar statistical characteristics?
One of them has actually been generated by flipping a fair coin 100 times and printing a
white dot for head and a black dot for tail. Can you say which one? And why do you think
so?
See end of this section for an answer.

ω: But excuse me, I do not like black and white. My very most favourite colour is totally
tomato red and my other favourite colour is bitter chocolate brown with lime green sparkles
in it and YOU should really try this link: http://biscuitsandjam.com/stripe_maker.php
And this one about statistical mechanics makes some pretty shape, too:
http://physics.ucsc.edu/~peter/ising/ising.html

Ω: Speaking of black and white, have you heard that half of the students taking this
module loves and the other half hates us? Don’t cry, omi, you know as well as me that we
can be quite annoying and silly...
I have decided that the most sensible step for us is to turn grey and to shrink bit, we do not overstay our
welcome.

2.2 Studying random phenomena

It seems desirable to get more insight into the phenomenon of randomness, but what kind
of discoveries can we expect from studying something as inconceivable as randomness?

The actual science of logic is conversant at present only with things either
certain, impossible, or entirely doubtful, none of which (fortunately) we have to
reason on. Therefore the true logic for this world is the calculus of probabilities,
which takes account of the magnitude of the probability which is, or ought to
be, in a reasonable man’s mind.
— James Clerk Maxwell
How dare we speak of the laws of chance? Is not chance the antithesis of all
law?
— Joseph Bertrand, Calcul des probabilités

Why would anyone want to study randomness? There are a number of reasons including
• describing and understanding patterns in random phenomena,
• unrevealing scientific principles,
• predicting and forecasting events,
• learning and developing degrees of belief about events,
• quantifying and comparing risks,
• taking decisions under uncertain circumstances.
Probability theory provides a formal theory to describe and analyse random phenomena.

2.3 Probability is pure mathematics (with some probability p > 0)

Probability can be practiced as pure math. The birth of of probability theory as a math-
ematical discipline is often associated with the axiomatisation the field underwent during
the twenties and early thirties of the 20th century.
It borders other fields in mathematics. Most obvious are the overlaps with measure theory,
analysis, functional analysis, PDEs, geometry, combinatorics, computing, ergodic theory,
number theory and mathematical physics. The last International Congress of Mathemati-
cians (2006 in Madrid) recognised and honoured such connections. Andrei Okounkov from
Russia received a Fields Medal for his contributions bridging probability, representation
theory and algebraic geometry. Wendelin Werner from France received a Fields Medal
for his contributions to the development of stochastic Loewner evolution, the geometry of
two-dimensional Brownian motion, and conformal field theory.
The probability group at Warwick is among the leading ones in the UK. Their major home
is P@W; see http://www2.warwick.ac.uk/fac/sci/statistics/paw/

2.4 Probability is applied mathematics, with some probability q > 0

Probability can be practiced as applied mathematics. (To keep things open we will not
impose p + q = 1.)
The instrument that mediates between theory and practice, between thought
and observation, is mathematics; it builds the bridge and makes it stronger
and stronger. Thus it happens that our entire present day culture, to the
degree that it reflects intellectual achievement and the harnessing of nature, is
founded on mathematics.
— David Hilbert, radio speech (1930)
Pobability theory has been serving the human mind’s inquiries and activities in many
areas. Besides statistics (see below) these including the following:
Physical world: e.g. astronomy (in particular, theories about measurement error), me-
chanics, statistical physics, quantum mechanics.
Living world: e.g. integrative biology, evolution, genetics, genomics, medicine, neuro-
science.
Social world: e.g. economics (in particular, financial markets), sociology, demography.
Engineering: e.g. computer science (in particular, machine learning), electrical engineer-
ing, risk assessment, reliability, operations research, information theory, communication
theory, control theory, traffic engineering.
Humanities: e.g. epistemology, determinism, philosophy of language, philosophy of
religion, philosophy of logic, political philosophy, belief systems.
Games of chance: e.g. roulette, dice, card games.
Arts: e.g. stochastic music, visual art (inspired by chaos theory and fractals, patterns
simulated by stochastic algorithms).
Probability theory is based on a set of axioms that can be summarised in just a few
lines. What makes it applicable to so many different kinds of applications is a huge
repertoire of probabilistic models. For example: Classical discrete probability models
in games of chance, genetic inheritance, quality control, measurement error, electrical
circuits and lattice modes for particle interaction. Discrete time stochastic processes (often
involving a certain depth of dependency on the past) in queuing, coding and stochastic
composition algorithm. Continuous stochastic processes for particle motion and stock
prices. Probabilistic networks in sociology, mobile phone and genomics.
Before trying our models in the real world, let us state some thoughts on the application
of probability theory from Saul Jacka’s lecture notes: ”It is a very powerful language, for
if we accept a few simple propositions, such as involve mapping a practical situation on
to some axioms, the whole battery of a theory’s deductions follow as true, conditional
upon the validity of that mapping. Of course, if that first mapping is nonsense then, no
matter how good our mathematics is, the result is almost certainly rubbish and can have
disastrous practical consequences.”

2.5 Probability as the mathematical foundation of statistics.

A quick note: You will need every bit of Probability A and Probability B (and more) to
understand Mathematical Statistics A and B in your second year. The latter modules are
the basis for a lot of exciting modules, theoretical and applied, ahead of you in the final
years. Please make sure you are on top of the probability modules.
Probability and statistics are linked by studying similar objects from different perspectives:
Probability starts out with a model for a random experiment that describes potential
outcomes and assumes certain basic parameters. Probabilities for events of interest are
deduced from these model assumptions. For statistics, the starting point are the observed
outcomes (aka oberservations or data) of an experiment that has already been performed.
The task is to infer characteristics of an optimal model for the unknown (or only partially
known) mechanism of the experiment. That means, we are trying the find the model that
is, among a certain class of models, most likely to have created the observations.
“Probability and statistics used to be married, then they separated, then they got divorced
and now they hardly see each other,” says David Williams. At Warwick they have moved
on to be good friends offering a wide range of research activities. They cover any possible
blend of the two disciplines from pure probability to applied statistics and extending to
interdisciplinary collaborations with other Warwick research groups such as Complexity,
Systems Biology and Finance.

2.6 How are we going to study probability in this module?

In his classical book An Introduction to Probability Theory and its Applications, William
Feller stresses that in probability, as in other mathematical disciplines, we must distinguish
three aspects of the theory:
(a) the formal logical content,
(b) the intuitive background,
(c) the applications.
He continues: ”The character, and the charm, of the whole structure cannot be appreciated
without considering all three aspects in their proper relation.”
Considering both the history of probability and the fact that this is a first year module we
start with models for equally like probabilities. This approach is minimalistic in the formal
sense but already provides enough mathematical foundation to properly dive into some
interesting examples to motivate probability as a field. While some students are extremely
good at solving probability problems intuitively, others would prefer to be see a more for-
mal approach first, which is why we will then introduce the modern axiomatic approach
to probability including the notions of σ-algebras, probability measures, and random vari-
ables. Besides these objects, the mathematical theory of probability which includes crucial
concepts such as conditioning and independence, expectations and variances. In the easier
examples, the introduction of such formalism may feel like breaking a butterfly on a wheel,
On some occasions they serve as an introduction of the formalism needed for more complex
questions some of which even have counterintuitive answers. Along with the theory we
will get to know classical probability distributions such as Bernoulli, uniform, binomial,
geometric and Poisson. All of this will be motivated by real world questions and illustrated
by further applications and by examples constructed from probabilistic experiments (such
as coin tossing or gambling examples).

2.7 Some examples to surprise and enthuse you

ω: I am sure that this is a typo. It should say confuse you.

Ω: Close your eyes and ears if you’re worried.

ω: But confusion completely does not scare me. When I am big I will be working as a researching researcher.
And also it is your birthday and Mum said I should help you make sure you have an extremely lovely,
happy birthday, so that’s why I will stick around.

Ω: OK. Thank you omi. Let us sort out a problem together then. . .

2.7.1 Birthday problem

Question 2.2. Suppose there are r students in this class. What is the probability that at
least two students in the class have the same birthday?

ω: But I need to know everybody’s birthday first to find out whether any two are on the same day!

Ω: Listen, in the question they don’t care about exactly which students these are. They just want to know
what are the chances of two identical birthdays whatever class of r people.
ω: But when it’s your birthday that means you are really special. How can you feel really special if all you
know is your odds of having a birthday today are say one in two, or one in a million or whatever?

Ω: Don’t be silly. I’m just curious to find out how likely is it, that a random class of r students has at least
two students in it with identical birthdays?

ω: You are the one being silly. A random class is how our school teacher calls us when we are naughty.

Ω: Not that kind of random. I mean a class picked at random from all the classes of r students in the
universe.

ω: Universe, outer space, probability space – you are just making up funny words. How is the universe
going to help you out? You do not even know who is out there. Maybe on Mars everybody is born in May,
and on Saturn everybody is born on a Saturday, and in the Milky Way everybody is born on a Lucky Day!

Ω: OK. That is a very good point. We need a realistic probability space to answer this question on Earth.
I will make the following assumptions:
• Every year has 365 days.
• Every student is equally like to be born an any of these days.
• Birthdays are independent of each other.

ω: But excuse me, there are leap years and everybody knows there are more birthdays during some parts
of the year and twins have the same birthday.

Ω: You’re absolutely right, but I just want to start somewhere and my assumptions still capture the essence
of the problem. Once we have solved it this way we can see how important the assumptions are in the
calculation and then tackle it with a different set of assumptions.

ω: There are 23 kids in my class and they are all very special. I’m not worried that any two of us have
the same birthday. The risk Alphie is born on my birthday is only 1 in 365, for Betty it is also 1 in 365
and so on makes 24/365. Now, Alphie makes the same calculation for her, and so make Betty and so on
makes 25 times 24/365.

Ω: That is about 1.64. In other words 164%. Did you know that probabilities can not be bigger than
100%?

ω: OK, I see, it’s just that I counted some things twice, like Alphie on the same day as me is the same as
me on the same day as Alphie. I will try to answer that question again. . . .

Ω: I know you like questions. But this is actually my question. I was just going to look at a simpler version
of this problem. There are three students in a class and we want to know what are the chances that at least
two of them have their birthday in the same month. I will call the months a, b, c, . . . , l. Now I will figure
out all possibilities for the birthday months of the three students and count how many of them have at
least two identical months in it: aaa, aab, aac, . . . , aal obviously all belong in this group, makes 12. Now,
aba, abb, abc, . . . , abl. Of those, only two have at least two identical months. Same for all the once starting
with ac or ad and so on up to al. So far, we have 12 plus 2 times 11. Now, there are baa, bab, . . . , bal and
bba, bbb, . . . , bbl. . .
Problem solving technique: Consider all options.

ω: And bla, bla, bla and. . . Omega, I am hungry, and there’s got to be a better way for this!

Answer to Question 2.2: We make three assumptions.


(i) The year has n = 365 days.
This is correct for 3 in 4 years, otherwise it is almost correct.
(ii) Every student in this class is equally likely to be born on any day of the year.
In fact, there is a surge of birthdays in autumn in the UK, but it’s not that big and we
will neglect it for now. To be sure the models suits, we would also need to check out
the data from overseas to account for the students not born here. Finally, we need
to think about how the class was selected from the general population and whether
this might in any way be affected by birthdays. It is reasonable to assume this is not
the case for this class, but it could be different. For example, if the university had
identification numbers involving birthdays and then split the students taking the
probability module into classes based on identification numbers, rather than based
on degree as currently done.)
(iii) The birthdays are independent of each other.
Again, a reasonable assumption for this class, but it could be different in a situation
where the class was composed by a different process.

We order the students in some way. With the above assumptions our model is that
the birthdays are equally likely to form any ordered r-tuple with numbers chosen from
{1, 2, . . . , n}.
We want to compute the probability pn,r that there are at least two students with the
same birthday among the r students in the class. Obviously, if r > n this has to happen,
so pn,r = 1. Otherwise, we can compute it by dividing the number of r-tuples containing
repeats by the total number nr of r-tuples from a set of n elements.
Omega attempted to list all possible r-tuples, identify which of them have repeats in them
and count those. Whereas this is a correct way of solving the problem, it is more efficient
to just list and count those r-tuples that do not have any repeats in them. This will allow
us to compute the probability qn,r of the opposite, that is, not any two students have the
same birthday?
Problem solving technique: Check if the opposite event is easier to handle?
In how many ways can birthdays be assigned without ever repeating one? Let the first
student have whatever birthday. Avoiding that day for the second student leaves n − 1
options. The third student has to avoid both previously taken birthdays which leaves n−2
choices. The forth has n − 3 choices and so on up to the n − r + 1 choices for the rth
student. So we have

qn,r = n · (n − 1) · (n − 2) · . . . · (n − r + 1)/nr ,

which yields
pn,r = 1 − n · (n − 1) · (n − 2) · . . . · (n − r + 1)/nr .
Let us look at some numerical results:

p365,10 ≈ 11.7% p365,40 ≈ 89.1%


p365,20 ≈ 41.1% p365,50 ≈ 97.1%
p365,30 ≈ 70.6% p365,60 ≈ 99.4%

In particular, we can see that the smallest number of students r such that the probability
of having at least two of them with the same birthday is 23:

p365,22 ≈ 47.6% p365,23 ≈ 50.7%

2.7.2 Door problem

An old kind of problem that became famous worldwide via the Monty Hall game show in
1990. You are being shown three closed doors. Behind one of the doors is a car; behind
each of the other doors is a goat. You choose one of the doors. The show’s host, who
knows which door conceals the car, opens one of the two remaining doors which he knows
will definitively reveal a goat. Then he asks you whether or not you want to switch your
choice to the remaining closed door. What are you going to to? Here are some attempts
to answer this question.
(i) Either the prize is behind the door you bet on originally or behind the other still
closed door makes it a fifty-fifty chance for each, so it doesn’t matter whether you
change or not.

(ii) All doors were equally like originally. That is not going to change because of whatever
that show’s host did. So it doesn’t matter whether you change or not.

(iii) Imagine the same problem but with 100 doors instead of 3. Behind one of the doors
is a car, behind all the other 99 doors is a goat. The show’s host knows where the
prize is and opens all doors but the one you picked and one other one. Now it seems
intuitive, that you would want to switch.
Problem solving technique: Exaggerating the original question helps to reveal
the essential characteristics of a problem. Often, the qualitative aspects of the answer
to the exaggerated question can be carried over to answer the original question.

(iv) We compare the two different strategies stick (with your original choice) and switch.
There are two possibilities:
Case 1: Original choice is door with car. Now you will get a goat.
Case 2: Original choice is a doors with a goat. Now you will get a car.
The probability for Case 1 is 1/3, the probability for Case 2 is 2/3. If your strategy
is stick then your chance of winning the car are 1/3. If your strategy is switch then
your chance of winning the car are 2/3.
Problem solving technique: Distinguishing cases allows to find solutions of the
problem under more constraint conditions. In probability, it is usually done to get
rid of (some of ) the randomness.

(v) A qualitative approach. The shop’s host knows where the car is and opens, on
purpose, a door that reveals a goat. If you make use of that information it should
increase your chances of getting the car. At least it should not decrease it.

(vi) A variation of the problem. The show’s host does not know where the car is, he just
happens to reveal a goat when he opens the door. Does this change your answer to
the problem?

2.7.3 Two short questions

Question 2.3. Which one of the following birth orders in a family with six children is
more likely, or are they the same? (G means girls, B means boy.)

GBGBBG BGBBBB

Question 2.4. A hexagonal die with 2 red and 4 green faces is thrown 20 times. You have
to bet on the occurrence of one of the following patterns. Which one would you choose?

RGRRR GRGRRR GRRRRR

Answers to these two questions are given below. Before you read them, try to come up
with your own answers.
About Question 2.3: Both are equally likely. Experiments with people who have no
training in probability have shown that they tend to believe that the first order is more
likely than the second order. The second sequence is perceived to be too regular. The
answers suggest that the respondents compare the frequency of families with 3 girls and 3
girls with the frequency of families with 5 boys and 1 girl. Yet both exact orders of birth,
GBGBBG and BGBBBB, are equally like, because they both represent one of 64 equally
like possibilities. (See Kahneman and Tversky, 1972b, p.432 see Chapter 5, Neglecting
exact birth order.)
About Question 2.4: Your best bet is (i). Most people bet on (ii), because G is more
likely than R and (ii) has two Gs in it. Yet (i) is more likely, simply because RGRRR is
nested in GRGRRR. This is an example of committing the conjunction fallacy without
realising that there is a conjunction. (See Kahneman and Tversky, 1983, p.303, Failing to
detect a hidden binary sequence.)
The conjunction fallacy has been studied by many researchers using the example about
the (hypothetical) person Linda. The description reads:

Linda is 31 years old, single, outspoken, and very bright. She majored in phi-
losophy. As a student, she was deeply concerned with issues of discrimination
and social justice, and also participated in anti-nuclear demonstrations.

Then people are asked to rank statements about Linda by their probability, in particularly
the following two:

Linda is a bank teller.


Linda is a bank teller and is active in the feminist movement.

People assigned higher probabilities to the second statement. This is not in tune with
the mathematical rules about probability and has stimulated a discussion about how real
people deal with probabilities. (See Kahneman and Tversky, 1982b, p.92.)
ω: People’s minds don’t follow the usual probability rules, small particles do not follow them either; are
they of any use?

ω: Oh yes, they work in many situations. Besides, I believe that human minds do follow them for the most
part, it’s just that we enjoy the examples where we’re getting mixed up. And these fractions of elementary
particles, well, that’s why quantum probability was invented.

About Question 2.1: Most likely, the second one (Row 5) is the coin toss. Does the
first one look more random to you? Well, it does switch more often between black and
white, and we seem to take that as a sign for proper randomness. But could this be too
much of a good thing? Consider that staying with the same colour is as likely as switching
to another colour we would expect that a sequence of 100 coin tosses has about 50 colour
changes. However, there are 66 colour changes in the first one (Row 4) and 46 in the
second (Row 5). While we can not say with certainty, we have argued that the difference
in the number of observed colour changes makes Row 5 much more likely to be generated
by the coin tossing by than Row 4.
This is an example for the thinking used in hypothesis testing, a branch of statistics. This
particular argument is used in the runs test used to check the independence assumption.

3 Models for equally probable events

3.1 Terminology for random experiments

The set of all outcomes is called outcome space or sample space and is usually denoted
by Ω. Before the experiment ω is not know, after the experiment is is known. However,
we will often calculate just probabilities for ω to belong to a certain subset of Ω. These
subsets are called events.

3.1.1 Some classical random experiments

Example 3.1. Toss a coin and see what it is facing. Ω = {h, t} with h for the coin facing
head and t for the coin facing tail.
Example 3.2. Roll a die and observe the number of dots it faces. Ω = {1, 2, 3, 4, 5, 6}
and events can be any subsets of Ω. For example, ”rolling an odd number” is {1, 3, 5} and
”rolling a number bigger than 4” is {4, 5} and ”rolling a six” is {6}.
Example 3.3. The sides of an icosahedron die have 20 equilateral triangles, all of the
same size, with the numbers 1 to 20 written on them. Roll an icosahedron die and observe
the number it faces. Ω = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20} and
events can be any subsets of Ω. For more information on different kinds of dice and a nice
picture of all the platonic solid shaped dices you may look under ”dice” in Wikipedia.
Example 3.4. There are several ways of modelling the experiment of tossing two coins.
For example:
(i) Observe which sides are up. This corresponds to Ω = {hh, ht, th, tt}.
(ii) Count the total number of heads. This corresponds to Ωheads = {0, 1, 2}.
Note that (i) contains more information than (ii).
Example 3.5. The deck most often seen in English-speaking cultures, and common in
other countries where the deck has been introduced, is the Anglo-American poker deck.
This deck contains 52 unique cards in the four French suits, Spade (♠), Heart (♥), Dia-
mond (♦) and Club (♣) and thirteen ranks running from two (deuce) to ten, Jack, Queen,
King, and Ace. Draw a card from the deck and observe the suit and the rank.
For simplicity, introduce the coding Ace=1, Jack=11, Queen=12, King=13 and choose
Ω = {Sk | S = ♠, ♥, ♦, ♣; k = 1, 2, . . . , 10}. Events can be any subsets of Ω, for example,
”an Ace”={♠1, ♥1, ♦1, ♣1}, or ”a Spade”={♠k | k = 1, 2, . . . , 13}.
Alternatively, we can define outcome spaces that reflect only a partial observation of the
characteristics of the cards. For example Ωsuit = {♠, ♥, ♦, ♣} or Ωnumber = {1, 2, . . . , 13}.
Example 3.6. A box contains a finite number n of tickets enumerated 1, 2, . . . , n. Draw
one at random. The standard choice for an outcome space is W = {1, 2, . . . , n}. Events can
be subsets of Ω, for example, ”a 3”= {3}, ”a number between 5 and 10”= {5, 6, 7, 8, 9, 10},
or ”an even number”= {1k | k ∈ N, k ≤ n/2}.

Note that the previous example is a general form of a random experiment with finitely
many outcomes. Examples 3.1 to 3.5 can be represented in this form. The next two
examples are both infinite models. However, the first one is countable whereas the second
one is continuous.
Example 3.7. Coin tossing until heads come up.
A coin is tossed until heads come up. The outcomes are all sequences consisting of some
number of tails t and then end with a head h, so the outcome space can be represented as
Ω = {h, th, tth, ttth, tttth, ttttth, . . .}. Note that while there are infinitely many different
outcomes, each individual outcome is finite. Some examples for events are: ”it takes 3
trials to get a head”, ”it takes at least 3 trials to get a head”, ”it takes at most 3 trials to
get a head”.
Example 3.8. Infinitely many coin tosses.
A coin is tossed infinitely many times. In each toss, it is recorded which face shows up.
The outcomes are all sequences consisting of h and t, so Ω = {h, t}N . Some examples for
events are: ”first toss is a tail”, ”at least 20 heads”, ”the pattern hhhthhh occurs at least
once”, ”infinitely many heads”.
Since there is a one-to-one map of the set of binary sequences onto the interval [0, 1], this
corresponds to picking a number at random between 0 and 1, as random number generators
do. It is now obvious that this is an example for a continuous outcome space.

In Examples 3.1 to 3.6 we can define probabilities representing a random experiment in a


finite outcome space Ω = {ω1 , . . . , ωn } with equally like outcomes as follows:
P (ω1 ) = 1/n, P (ω2 ) = 1/n, . . . , P (ωn ) = 1/n. (1)

ω: What about the infinite examples?

Ω: Mmh, I would say that in Example 3.8 I would use the approach for finitely many outcomes to define
probabilities for events that just specify the first n numbers of tosses. Then let n got to infinity. Honestly,
though, I have no idea what it means for probabilities to converge. These concepts will come up in the
modules Probability B and Random events...

ω: Example 3.7 I’m sure about one thing; it is absolutely completely not possible to give every outcome
the same chance. Why not? You just have to imagine they did. Then there was this number p > 0 and
the probability that one of the first n outcomes occurred was n · p and the probability that one of the first
n + 1 of them occurred was (n + 1) · p. And so on. That goes to infinity. But the probability that one of
all of them occurred should be at most 1. So it just can’t be! And this is why I’m never ever allowed to
invited infinitely many friends to my birthday party. You can not cut a cake into infinitely many equally
big pieces.

To prepare for the next sections, review set theory notation.


In particularly, recall de Morgan’s rules.

3.2 Classical probability

The axioms for classical probability we will define below are only a slight extension of
a model with equally likely, or equiprobable, outcomes as in (1). Classical probability
models are based on symmetric characteristics of the random mechanism generating the
outcomes. Often, these are physical characteristics such as a die with faces of the same
shape or a box with balls of the same size. Calculations can usually be performed directly
or indirectly using the equiprobable model.
Definition 3.9. Algebra of sets.
Let Ω be a set of points. A system A of subsets of Ω is called algebra if
(a) Ω ∈ A,
(b) A, B ∈ A =⇒ A ∪ B ∈ A, A ∩ B ∈ A,
(c) A ∈ A =⇒ Ac ∈ A.
Example 3.10. Algebra generated by a partition.
The algebra A generated by a partition B1 , . . . , Bn of Ω consists of the empty set and all
possible unions of any number of element of the partition. A consists of 2n subsets of Ω.
Definition 3.11. Classical measurable space.
Let Ω be a non-empty set of points and F the algebra generated by a partition B1 , . . . , Bn
of Ω. Then (Ω, F) is called a classical measurable space. Ω is called sample space and
B1 , . . . , Bn are called basic events.
Definition 3.12. Axioms for classical probability.
(Ω, F, P ) is called classical probability space if the following two axioms are fulfilled:
C1: (Ω, F) is a classical measurable space.
C2: P : F −→ [0, 1] is a set function such that, for any event A ∈ F which compromises
exactly k distinct basic events Bi , it is P (A) = k/n.

Obvious examples for classical probability spaces are examples 3.1 to 3.6 with P defined
as in (1).
Ω: Stop this equally likely approach to probability! David Williams (you can find him in Wikipedia) calls
it an invitation to disaster.

ω: But I like it. I think it’s more fair to give everybody the same chance.

Ω: Ah, are you bringing socialism back on the stage? Anyway, from the mathematical point of view, this
equally likely approach to probability requires only a minimal amount of techniques and terminology.
All you need is counts, count, counts, is all you need.
All you need is counts (all together now),
All you need is counts (everybody).

3.3 Some results from combinatorics

3.3.1 Counting

“I think you’re begging the question,” said Haydock, “and I can see looming
ahead one of those terrible exercises where six men have white hats and six
men have black hats and you have to work it out by mathematics how likely it
is that the hats will get mixed up and in what proportion. If you start thinking
about things like that, you would go round the bend. Let me assure you of
that!” — Agatha Christie, The Mirror Crack’d

Question 3.13. Campus residences.


A campus residence unit consists of three two-storey buildings with four rooms in each
storey. What is the total number of rooms in this unit?
Answer: 3 buildings times 2 storeys times 4 rooms makes 24 rooms all together.

A way to visualise this is drawing a tree shaped graph. First, it shows three branches, one
for each building. Then, each of these branches grows two branches, one for each storey.
Finally, each of these ones, further branches out into four limbs, one for each room in that
storey.
Let us now formally state the counting methods we have just applied intuitively.

Theorem 3.14. Fundamental Rule.


(i) Given a set of m distinct elements a1 , . . . , am and a set of n distinct elements b1 , . . . , bn ,
there exist m · n distinct ordered pairs (ai , bj ) comprising one element from each set.
(ii) Given a set of n1 distinct elements a1 , . . . , an1 ; a set of n2 distinct elements b1 , . . . , bn2 ;
up to set of nv distinct elements x1 , . . . , xv ; there exist vi=1 ni distinct ordered v-tupels
Q

(ai1 , . . . , xiv ).

ω: That’s a funny theorem. It only applies to tupels of lengths 2 or 24.

Ω: I see you are taking the notation very literally. Look, just forget about using the latin alphabet
completely and exclusively. You can squeeze any number of characters and more between a and x. For
example, a, b, c, x makes quadrupels, and a, α, β, γ, δ, b, c, . . . , x makes 29-tupels. a and x are just used in
this way to avoid more indices; otherwise we would have something like a1,i1 , . . . , anv ,iv ), you see? It’s all
fine, there’s no account for taste.

Proof.
(i) is obvious: Arrange the paris in a rectangular array with pair (ai , bj ) at the intersection
of the ith row and the jth column.
(ii) By induction. True for v = 2 from (i). Now suppose it is true for v − 1.
Expressing the v-tupel as a pair ((ai1 , . . . , wiv−1 , xiv ) and applying (i) with m = n1 · n2 ·
. . . · niv−1 and n = nv shows the claim for v.

Question 3.15. Group picture.


You want to take a picture of seven people arranged on chairs in a row. How many choices
do you have?
Answer: Put the seven chairs in a row. The first person has the choice between seven
chairs. For the second person, there are six chairs left. (Note: For the number of choices it
does no matter which chair the first person chose.) The the third person, there a five chairs
left. And so on. The sixth person only has a choice between two chairs, and the seventh
person has no choice at all. So, the total number of arrangements is 7·6·5·. . .·2·1 = 5040.

ω: But I want to be next to you!

Ω: Well, eh, that will reduce the number of options. I guess that’s a great exercise for first year students
learning about combinatorics. . .

Definition 3.16. A permutation of a finite set is any ordering of its elements in a list.

Ω: Actually, this it not what I learned in my algebra class. They said, a permutation of a finite set is a
bijective map of that set onto itself.

ω: It doesn’t matter to me. Bijective maps uniquely describe how to reorder the set, and list orderings are
the results of these reordering processes. Practically speaking, it’s all the same.

Theorem 3.17. Total number of permutations.


A set with n elements has n! different permutations.

Proof. Let a1 , . . . , an be the n distinct elements of the set. We compute the number of
permutations following the idea of the calculation in Question 3.15. There are n choices
for an to go, n − 1 choices for an−1 , n − 2 choices for an−2 , and so on, up to 2 choices
for a2 and just one option for a1 . Using the fundamental rule, that makes n! different
orderings.

3.3.2 Sampling

Ω: I am collecting a collection.
Use Saul Jacka’s notes p.9-11, 13.

Here is some motivation for multinomial theorem.

Question 3.18. Head like a sieve.


You secured your bicycle with a combination lock involving seven rotating discs arranged
in a row each showing the digits 0 to 9. After a day of revising material for an economics
class, your memory of the lock combination is rather vague. All you remember is that the
(unique) combination for opening the lock has three times the digit 3, twice the digit 2 and
again twice the digit 7. Making use of whatever little you remember, you decide to have
a go. You estimate that it takes you about three seconds to try a combination. How long
will it take you to open the lock in the worst case scenario that you won’t find it until your
last trial?
Answer: There are 7! ways to order the seven digits you remember. However, that’s not
yet the answer to the question. You need to divide by the number of orderings of the three
3s and the two 2s and the two 7s. So the total number of combinations you’ve got to try
is 7!/(3! · 2! · 2!) = 7 · 6 · 5 = 210. Trying them all would take 10 21 minutes.

ω: That’s silly. First she cares about order, then she doesn’t.

Ω: Look, over here are seven M&Ms. Three of them are green, two are orange and two are violet, then
you would care about the order of the colours as colours go, but you would not care about the order of
the individual pieces within one colour. For example, you can reorder the green ones among themselves
without even noticing. . .

ω: Did you notice I just ate all the seven while you weren’t looking? I was much faster than three seconds
and I can not even taste the order.

4 Axioms of probability
Use Saul Jacka’s notes p.6-7.

Remark 4.1. When can we use the power set?


If Ω is countable the default is to use the power set P(Ω) for F. However, that you can
define measures on all subsets of the space that does not mean you always should. In some
situations it is actually desirable to work with a submodel (Ω, F ∗ ) of the measurable space
(Ω, F), that is, F ∗ ⊂ F; as will be discussed below. The example about chess and bridge
in Saul Jacka’s notes is such a situation. In order to define a probability measure based
on the partial information available we need to use a submodel.
For continuous Ω the situation is very different, because it is typically not possible to
define set functions simultaneously on all subsets of Ω that fulfils all the characteristics of
a measure (in the sense of the axioms). (If you would like to find out more about this,
wait for advanced modules in probability theory and measure theory or read about the
Banach-Tarski paradox now).
Example 4.2. Bernoulli experiment.
This is the simplest possible random experiment that is not trivial. There are two out-
comes, often called 0 and 1 with the latter referred to as “success”. There is a fixed
p ∈ [0, 1], the probability for success. Use Ω = {0, 1} and the σ-algebra generated by the
partition ({0}, {1}), that is, F = { ∅, {0}, {1}, {0, 1} }. Then P ({1}) = p uniquely defines
a probability measure by P ({0}) = 1 − P ({1}) = 1 − p.

ω: With p = 1 I’ve won! I always win... always, always, always!

Ω: Obviously, this case is very boring, and so is the case p = 0. They are included in the model, but they
are the extreme cases. Such experiments are not actually random, they are deterministic.

Example 4.3. Finitely many outcomes.


Given weights pi ≥ 0 for i = 1, . . . , n with p1 + . . . + pn = 1,
P ({wi }) = pi (i = 1, . . . , n) (2)
defines a probability measure on Ω = {w1 , . . . , ωn } with σ-algebra P(Ω).
If pi = 1/n for all i = 1, . . . , n then this is called the uniform distribution.
Here are some explicit situations:
(i) An n-faced die with pi specifying the probability for the ith face to show up. If
pi = 1/n for all i = 1, . . . , n then the die is fair. Otherwise it is loaded. (This refers
to methods of inserting small quantities of metal into some of the sides of the die to
increase the likelihood it lands the other side up.)
(ii) Drawing a ball at random from a box with a finite number of coloured balls. With
m different colours and Ni (i = 1, . . . , m) balls of colour i, N = N1 + . . . + Nm , the
probabilities are pi = Ni /N.

Example 4.4. Countable number of outcomes.


Given weights pi ≥ 0 for i = 1, 2, . . . with ∞
P
i=1 pi = 1,

P ({wi }) = pi (i = 1, 2, . . .) (3)

defines a probability measure on Ω = {w1 , w2 , . . .} with σ-algebra P(Ω).

ω: I need to invite infinitely many friends to my birthday party. Because, whenever I think I will invite
these n friends I realise I forgot one. But how can I divide my birthday cake into infinitely many equally
large pieces?

Ω: That would not be LARGE pieces, omega, but small pieces! You are even in much bigger trouble here.
Just read on. . .

Remark 4.5. No uniform distribution on a countable set.


There is no uniform distribution on a countably infinite set. Why? If there was a uniform
distribution than there would be a c ∈ [0, 1] such that P (ωi ) = c for all i = 1, 2, . . . By
the properties of a measure, P (Ω) = ∞
P
i=1 c (that means adding up c infinitely often). If
c > 0 the series diverges and if c = 0 the series equals 0. Both cases contradict the axiom
that P (Ω) = 1.

Ω: In part B of this module you can make a model for equally sharing a cake with a continuum of friends.
Is that what you would like to do for you next birthday, omega?

ω: But I do not have a continuum of friends. Even if I added all my imaginary ones.

Ω: Try nextgenerationfacebook.

5 Conditioning

Find a way to get 6 male and 4 female students to come in the front and form a circle.
Offer 3 of the women and 2 of the men to put a hat on their head. Get another student
and call him Mr. Randomness. Bring him in the centre of the circle, blindfold him and
let him spin around until the audience calls him to stop. He faces one of the other 10
students, say X. What is the probability X is wearing a hat? Five of the ten have a hat,
so it’s a half.
Now Mr. Random is asked to exchange ”hi” with the student he is facing. What is the
probability that X is wearing a hat? Is there anything different? Well, Mr. Random did
get some information. From the voice saying ”hi” he can tell whether X is male or female.
As Mr. Random had seen the students with their hats before getting blindfolded, he knows
that 3 of the 4 women wear hat, whereas only 2 of the 6 mean wear hats. In this run of
the experiment it turns out that X is a woman. What he is going to calculate now is the
probability that the person has a hat on given he already knows the person is a woman.
Among the women, what is the chance she has got a hat on? He takes the number of
women with hats and divides by the number of women.

5.1 Conditional probability.

Definition 5.1. Conditional probability.


Let (Ω, F, P ) be a probability space and A, B ∈ F events. Assume P (B) > 0. The condi-
tional probability of A with respect to B is given by
P (A ∩ B)
P (A | B) := .
P (B)
Remark 5.2. Conditional probability is a probability measure.
We call P (A | B) conditional probability, but is it actually a probability as defined in
Section 4? The answer is yes, because for any fixed B, the set function A 7→ P (A | B)
defines a probability measure.
 The proof amounts to showing the two properties in (P 2).
(i) is obvious: P Ω | B = P (Ω ∩ B)/P (B) = 1.
(ii) follows from basic set operations: For A1 , A2 , . . . be mutually exclusive,
 S  
[  P A ∩ B S 
P i∈N (Ai ∩ B)
i∈N i
P Ai B = =

P (B) P (B)
i∈N
P
P (Ai ∩ B) X
= i∈N = P (Ai | B).
P (B)
i∈N

ω: Why does the left-hand side in Defintion 5.1 look symmetric, but then the right-hand side is not?!

Ω: P (A | B) is just a short form for the right-hand side and it is not symmetric. But I agree, that the
symbolic expression (A | B) is misleading in terms of its symmetric appearance. But what actually is the
relationship between P (A | B) and P (B | A)?

Some implications of the definition are both easy to proof and very frequently used:
Multiplication rule

P (A ∩ B) = P (A | B) · P (B) ∀A, B ∈ F with P (B) > 0 (4)


= P (B | A) · P (A) ∀A, B ∈ F with P (A) > 0 (5)

“Flip around” formula


P (A)
P (A | B) = · P (B | A) ∀A, B ∈ F with P (A) > 0, P (B) > 0 (6)
P (B)
Averaging conditional probabilities

P (A) = P (A | B) · P (B) + P (A | B c ) · P (B c ) ∀A, B ∈ F with 0 < P (B) < 1 (7)

Bayes’ rule
P (A | B) · P (B)
P (B | A) =
P (A | B) · P (B) + P (A | B c ) · (1 − P (B))
(8)
∀A, B ∈ F with P (A) > 0, 0 < P (B) < 1
The first two statements are immediate consequences of Definition 5.1. The last two
statement are special cases of more general Theorems 5.6 and 5.7. Before we state and
prove these theorem, we will show a few examples for the use of the rules (4) to (8).

Question 5.3. Pick a box and pick a ticket.


There are two boxes. One has three tickets with the numbers 1, 3 and 5, the other one has
tickets with the numbers 2 and 4. First toss a fair coin. If heads come up choose the box
with the odd numbers, if tails come up choose the box with the even numbers. Then you
pick a ticket at random from the chosen box. What is the probability you get a 3?
We will present two solutions, a naive approach and one based on the multiplication rule.
Answer 1: There are five different outcomes defined by the number of the ticket drawn,
so Ω = {1, 2, 3, 4, 5}. Since the two boxes are equally like, P ({1, 3, 5}) = 1/2 = P ({2, 4}).
Since the tickets within each box are equally like, using the addition axiom, P ({1, 3, 5}) =
P ({1}) + P ({3}) + P ({5}) = 3 · P ({3}). Putting this all together yields P ({3}) = 1/6.
Answer 2: Denote the outcomes by two-character combinations: bo and be for boxes with
odd and even numbered tickets in it, followed by the number on the ticket. So the outcome
space is Ω = {bo 1, bo 3, bo 5, be 2, be 4}. We want to find the probability for bo 3. Define events:
Bo :=“box with odd numbered tickets was chosen”= {bo 1, bo 3, bo 5},
Be :=“box with even numbered tickets was chosen”= {be 2, be 4},
Ti :=“ticket with number i was drawn” (i = 1, 2, 3, 4, 5).
By assumption, P (Bo ) = P (Be ), so both must be 1/2. The only way to draw a ticket with
the number 3 is by choosing the box with odd numbers first. If drawing from that box,
chances for each of the numbers are equal; in other words, P (Ti | Bo ) = 1/3 for i = 1, 3, 5.
Using (4), P ({bo 3}) = P (T3 | Bo ) · P (Bo ) = 1/3 · 1/2 = 1/6.

ω: The first answer looks a million times smarter.

Ω: Well, you’re right... but you’re not. To come up with the first answer you need an idea about how
to exploit the specific conditions of this example. The second answer is a standard application of the
multiplication rule. It should be in your repertoire. In particular, when things get a bit more complex.
What if the coin is biased? What if there are more than one ticket numbered 3, and in both boxes? What
if the tickets in a box are not equally likely to be drawn? You can still use the first approach, but it won’t
look so elegant anymore.

Question 5.4. Golden tickets.


There are two shops with Wonka chocolate bars and it is impossible to recognise whether
or not they contain a golden ticket to visit Willy Wonka’s factory. In one of them, the
chance to find a golden ticket is a half, in the other one it is a third. Omega bought one
chocolate bar and found a golden ticket inside. Guess which shop Omega got it from and
explain why you make that guess.
Answer: Denote the outcomes by two-character combinations: se for the shop with equal
probabilities, su for the other shop, g for a golden ticket and b for a blank. So the outcome
space is Ω = {se g, se b, su g, su b}. Define events:
Se :=“Shop with equal chances was chosen”= {se g, se b},
G :=“Golden ticket was found”
By assumption, P (Se ) = P (Su ) = 1/2, and P (G | Se ) = 1/2, and P (G | Su ) = 1/3. To
answer omega’s question we need to compare the probability of Se given G with the
probability of Su given G. Averaging conditional probabilities according to (7) yields

P (G) = P (G | Se ) · P (Se ) + P (G | Su ) · P (Su ) = 1/2 · 1/2 + 1/3 · 1/2 = 5/12.


Using (6) and putting this all together yields

P (Se ) 1/2
P (Se | G) = · P (G | Se ) = · 1/2 = 3/5.
P (G) 5/12

Furthermore, P (Su | G) = 1−P (Se | G) = 2/5. Therefore it is 50% more likely that Omega’s
golden ticket came from the shop with equal probabilities than from the other shop.

Remark 5.5. About the proof.


The answer above uses (7) and (6) explicitly to show what is going on. This actually
amounts to proving (8) for that example. Alternatively, one could just plug numbers into
(8) and obtain the same result.

ω: I know how I can answer Question 5.4 much faster! Say there’s a place with only one Wonka bar with
a golden ticket in it, and there is a place with a million Wonka bars one of which has a golden ticket in it
whereas the other 999,999 have none. The chance of getting a golden ticket when you search in place one
is, well, there isn’t even any chance, you surely get one. But when you are searching for a golden ticket
in the second place, it’s like 1 in a million, so your chances are incy wincy small. If you actually got a
ticket from one of those places, well, chances are it’s the first place where you got it from. The boxes in
the example are not quite as extreme, but the answer still goes in the same direction.
Problem solving technique: Exaggerating the original question (see also dialogue following Question
2.7.2).

Ω: Yes. And did you notice you argued the other way round from usual? Here is another example. Who
knitted the Klein bottle hat, the lecturer or her husband? Only a small fraction of the men know how to
knit, whereas, at least in their generation, a decent fraction of women do. So it’s more likely she knitted
that hat than he did it. This kind of thinking is a probabilistic version of reverse engineering also known
as statistical inference. Regular people actually use it in everyday life all the time.

ω: Like when you blamed me for destroying your favourite sunglasses. Just because I broke your fabulous
prize-winning rocket, I dropped your report card in the loo, I reset your password and I played frisbee with
your data CD, you think that I also stepped on your glasses. But it wasn’t me!

Theorem 5.6. Total probability


Let (Ω, F, P ) be a probability space and B1 , . . . , Bn ∈ F a partition of Ω. Assume that
P (Bi ) > 0 for i = 1, . . . , n. Then
n
X
P (A) = P (A | Bi ) P (Bi ) (9)
i=1

Pn
Proof. Since A∩B1 , . . . , A∩Bn are mutually exclusive, P (A) = i=1 P (A∩Bi ). Applying
Definition 5.1 transforms the addends to P (A | Bi )P (Bi ).

5.2 Bayes theorem and applications

Theorem 5.7. Bayes


Let (Ω, F, P ) be a probability space and B1 , . . . , Bn ∈ F a partition of Ω. Assume that
P (Bk ) > 0 for k = 1, . . . , n. Then

P (A | Bk )P (Bk )
P (Bk | A) = (10)
P (A | B1 )P (B1 ) + . . . + P (A | Bn )P (Bn )

Proof. Fix k ∈ {1, . . . , n}. Formula (6) yields P (Bk | A) = P (A | Bk )P (Bk )/P (A), which
results in (9) after replacing P (A) by the right-hand side in Theorem 5.6.
Terminology
P (Bi ) are called prior probabilities – that’s because it’s before knowing about A
P (A | Bi ) are called likelihoods – probabilities of A given the different options Bi
P (Bi | A) are called posterior probabilities – that’s because it’s after knowing about A
ω: How comes a theorem that is so easy to proof became so famous?

Ω: Sometimes, it’s not the technique that makes a result famous, but the idea to even think of of looking
at the mathematical objects in a new way. The key idea of first the ”flip around” formula and then also
Bayes’ theorem is to use the information about the probability for A given a few different Bi s to sort of
reverse engineer which of the Bi ’s happened in the first place.

ω: You mean which one is most likely to have occurred given A did. Actually, the theorem calculates the
probability for each of the Bi s given we know A happened. My teacher is using this all the time. Whenever
it’s noisy in the classroom while she’s writing at the board, she blames me! But just because I’m talking
a lot doesn’t meant that’s always me!

Example 5.8. Medical diagnostics.


A lab test for the disease VerySick is conducted on blood samples and has the poten-
tial outcomes ”positive” and ”negative”. We consider a certain population that has an
incidence rate of 1% for the disease. We have experience with applying the test in this
population: In the past, 95% of the people with the disease produce a positive test result,
and 2% of the people without the disease produce a positive test result.
What is the probability that an individual chosen at random from that population will
have the disease given his or her test was positive?
Let D be the event the individual has the disease and let + and − be events that the
individual has a positive or negative test, respectively.
Using Bayes theorem,

P (+ | D)P (D) 0.95 · 0.01


P (D | +) = c
= ≈ 32%
P (+ | D)P (D) + P (+ | D )(1 − P (D)) 0.95 · 0.01 + 0.02 · 0.99

ω: So what?

Ω: You are being told you’ve got this disease that’s going to kill you within 5 years. You spent a few weeks
being horrified and you end up dying of a heart attack. So, indeed you died. However, the likelihood you
do not have the disease given your test was positive is actually about two in three!

ω: Strange, how can this come about?

Ω: Not actually having the disease despite being tested positive is an event called false positive. Related
to that is the false positive rate and the specificity and the sensitivity of the test and...

ω: I don’t care about all this medical terminology. I just want to know this: If they gave me a positive
test result, what is the probability that I actually have the disease.

Ω: That’s what the physicians call predictive value of a positive test and it’s just what we computed above.
It does depend on the quality of the test but also on the disease prevalence (i.e., the incidence of the
disease) in the population you are drawn from. In the example above, the incidence was only 1%. This
made the nominator small, but inflated the denominator because of the term 1 − P (D) = 99%.

ω: What’s the point of such a test then?

Ω: actually, I was being a bit cynical about the heart attack... There are ways to apply the test in a more
meaningful way.

Example 5.9. Medical diagnostics (ct.)


Say a patient who already feels sick walks into a surgery. The physician conducts a
careful examination excluding a number of diseases or conditions. Based on experience,
the physician estimates there is a probability of about 30% that the patient has the
disease VerySick. The physician thinks of the patient as member of a certain population
of similarly sick people. In that population, P (D) = 30% and P (Dc ) = 70%. Let us make
the same assumptions for the test accuracies. Applying Bayes theorem in the same way
as before yields a false positive rate P (D | +) ≈ 95.3%.

The medical diagnostics example is a very common situation. The false positive issue is a
concern when the prevalence is low even with good quality tests. This is often the case for a
screening test for a disease in an unspecific population, unless the disease is very common.
Depending on the nature of the diagnosis, such news may create stress and anxiety in
the individual who received them, even if it is understood that the diagnosis has a lot of
uncertainty attached to it. Often, the individual will have to wait long periods of time
until further tests can be done and deliver refined results. Such aim of avoiding stress for
the patient has to be outweighed with the usefulness of potentially true news, for example
if early treatment is crucial for survival. Examples for population-wide screening with low
incidence rates of the condition tested for may be found, for example, in antenatal care or
detection of drug users.
ω: I want to tell you something. I really very much hate word problems.

Ω: Why?

ω: Either I do math or I do writing.

Ω: You mean one part of your brain is in charge of math and the other part of your brain is in charge of
the rest of the world, and the two don’t mix? You are basically saying that applied mathematics can not
be done. Either you are coming up with the most beautiful piece of math to describe a certain real world
application but you neglect the point in the application, or you pour all you thoughts into an excellent
description of a piece of the real world but then you have no brains left to work out mathematically rigorous
answers to questions? I suggest you activate a third part of your brain to get the first two parts to act in
concert.

5.3 General multiplication rule and independence

If the occurrence of the event B has no impact on the occurrence of the event A then
P (A | B) = P (A). Using the multiplication rule (4), this can be expressed as P (A ∩ B) =
P (A) · P (B). Or it could be expressed as P (B | A) = P (B). Note that in formulations
using conditional probabilities we need to assume that the probability of the event that
comprises the condition is not zero.
Definition 5.10. Independence of two events.
Let (Ω, F, P ) be a probability space and A ∈ F and B ∈ F events. A and B are called
independent, if

P (A ∩ B) = P (A) · P (B) (11)

and dependent otherwise.

ω: It just means they have nothing to do with each other. For example, if I toss a coin and I want it to
come up heads and you want it to come up tails than we what you and me want is independent...

Ω: Eh, I’m not so sure about that. Let’s see... P ({h}) · P ({t}) = 1/2 · 1/2 = 1/4 and P ({h} ∩ {t}) =
P (∅) = 0, so they are disjoint, but not independent.

ω: Ah sure, they have a lot to do with each other. Because if it’s heads it can’t be tails!

Example 5.11. Independent and dependent events.


Roll a die. In Example 3.2 define the events Odd = {1, 3, 5} and Gk = {1, . . . , k}. Then
Odd and Gk are independent if k is even, otherwise they are dependent. Just check (11),
for example: P (Odd ∩ G2 ) = P ({1}) = 1/6 and P (Odd) · P (G2 ) = 1/2 · 1/3 = 1/6, so they
are independent, but P (Odd ∩G1 ) = P ({1}) = 1/6 and P (Odd)·P (G1 ) = 1/6·1/6 = 1/36.
Toss two coins. See Example 3.4. The assumption of equiprobable outcomes implies
independency of the events Hk =“kth toss is heads” (k = 1, 2), because
P (H1 ∩ H2 ) = P ({hh}) = 1/4 = 1/2 · 1/2 = P ({hh, ht}) · P ({hh, th}) = P (H1 ) · P (H2 ).
This technical property reflects our idea that, if the coin tossing is conducted properly,
the coin has no memory.
Remark 5.12. Real world applications.
Independency is an assumption very often made for one or more of the following reasons:
(i) It is justified by theoretical considerations.
(ii) It is empirically justified (i.e., checked on data).
(iii) It simplifies calculations.
Reason (iii) is fine as long as the real world application only serves as an inspiration for
a mathematical mind’s desire to create beautiful problem and subsequently solve them.
Before applying results of such creative processes to make statements about the real world
additional study is required. In particularly, the magnitude and the direction of the error
introduced by the simplification need to be addressed. A recent example of substantial
errors caused by inappropriate independence assumptions occured in the modelling of
the default risks of mortgages (even when they were secured by houses located in socio-
economically highly homogeneous neighbourhoods). Such oversimplified risk models lead
to the subprime mortgage crisis in the US real-estate market that is believed to have
triggered the global financial crises.
Reason (i) is often made successfully, especially in scientific or technological applications.
However, it does bear the risk that partial knowledge or an inappropriate weighing of
existing knowledge leads to incorrect models.
Reason (ii) is convincing and sometimes the only choice. However, the collection of data
and the testing of independence in data bear their own challenges. Among other issues,
model development based on empirical knowledge only can be misled by chance variation
or bias in data. A simple but striking example is the empirical“proof” that the area of
a rectangle is a linear function of the sum of the lengths of its edges (see the statistics
textbook by Freedman, Pisani and Purves).
Ideally, independence assumptions are based on both theoretical and empirical justifica-
tions. In some applications, independence assumptions are wrong, but can still play a
role. For example in obtaining upper or lower bounds of an unknown true probability.

The multiplication rule (4) can be generalized to more than two events. This will be used
to define a notion of independence for more than two events.
Theorem 5.13. General multiplication rule.
Let (Ω, F, P ) be a probability space and A1 , . . . , An ∈ F with P (A1 ∩. . .∩An−1 ) > 0. Then
P (A1 ∩ . . . ∩ An ) = P (A1 ) · P (A2 | A1 ) · . . . · P (An | A1 ∩ . . . ∩ An−1 ). (12)

Proof. Exercise.
(Hint: The result can be shown by induction based on multiple application of (4).)
Definition 5.14. (Mutual) independence.
Let (Ω, F, P ) be a probability space and A1 , . . . , An ∈ F events with P (A1 ∩. . .∩An−1 ) > 0.
The events are called (mutually) independent, if
P (A1 ∩ . . . ∩ An ) = P (A1 ) · P (A2 ) · . . . · P (An ) (13)
Definition 5.15. Pairwise independence.
Let (Ω, F, P ) be a probability space and A1 , . . . , An ∈ F events. The events are called
pairwise independent, if

P (Aj ∩ Ak ) = P (Aj ) · P (Ak ) for all j, k ∈ {1, . . . , n} with j 6= k. (14)

Remark 5.16. Pairwise and mutual independence are different.


Mutual independence applies pairwise independence. The reverse is not true. For example,
take two independent coin tosses and consider the events Hk from Example 5.11 and the
event “both tosses are heads”; see Exercise sheet 3, Part II, Problem 1.

6 Repeated experiments

This section is about doing the same experiment several times. We consider a simple, but
not trivial, situation: a series of independent Bernoulli trials (see Example 4.2).

Model 6.1. Bernoulli trials process.


A Bernoulli trials process is a finite or infinite sequence of random experiments with the
following characteristics:
(i) Each experiment has two possible outcomes, referred to as “success” and “failure”.
(ii) The probability p ∈ [0, 1] for “success” is the same for each experiment; in particu-
larly, it is not affected by the outcome of any of the other experiments.

The probability 1 − p for failure is often denoted by q.


The standard outcome space for a Bernoulli trials process of length n is Ω = {0, 1}{0,1,...,n}
and F = P(Ω). (In other words, Ω is the set of all n-tuples of zeros and ones.) The
standard outcome space for an infinite Bernoulli trials process is Let Ω = {0, 1}N . (In
other words, Ω is the set of all sequences of zeros and ones.)

This is an abstract version of a model for independent coin tosses.


In the following sections we will study probability distributions relating to a number of
different kinds of events. In the finite case, the standard probability space (Ω, P(Ω)) will
be used. In the case of infinitely many trials submodels are used (see Remark 4.1), for
example in Sections 6.2 and 6.3.
hier
In this module, we only consider independent repetitions of experiments. In advanced
probability modules about stochastic processes you will encounter more general models.
In particularly, there will be models of processes that allow future repetitions to depend on
the present (Markov processes) or on the present and a finite window of the past (higher
order Markov processes).

6.1 Number of successes

Consider a Bernoulli trials process of length n with success probability p ∈ [0, 1] and failure
probability q = 1 − p as described in Model 6.1. We want to compute probabilities for the
outcomes. Illustrate this a special case.
Problem solving technique: Consider a special case or an example.
Example 6.2. Three independents Bernoulli trials.
Let n = 3. Using independence we observe

ω 111 110 101 100 011 010 001 000


P (ω) p3 p2 q p2 q pq 2 p2 q pq 2 pq 2 q3

We have P ({111}) = p3 and P ({000}) = q 3 .


The other probabilities fall into two groups:
P ({110}) = P ({101}) = P ({011}) = p2 q and P ({001}) = P ({010}) = P ({100}) = p q 2 .

Go back to the general case of a Bernoulli trial process of length n. For ω ∈ Ω let ω(i) be the
ith place in the tuple. At a closer look, using independence, shows that the probability of
an outcome only depends on the number of successes and failures. Given the total number
of trials is fixed, it is enough to record only the number of successes. The events

Sk = “k successes in n trials” (k = 0, 1, . . . , n) (15)

can be represented using the 0-1-coding we chose to represent failure and success (check
yourself):

n X n o
Sk = ω ω(i) = k (k = 0, 1, . . . , n). (16)

i=1

This implies

P ({ω}) = pk q n−k for all ω ∈ Sk .

To calculate the probability of the event Sk it remains to count the number of outcomes
in Sk . In how many ways can we choose (exactly) k successes in n trials? According to
Section 3.3, using the formula for “without replacement, not ordered”, this is nk , so


 
n k n−k
P (Sk ) = p q (k = 0, 1, . . . , n). (17)
k

This defines a probability measure (see Example 4.3) on the set of possible outcomes for
the number of successes.

Definition 6.3. Binomial distribution.


The measure
 
n k n−k
P (k) = p q (k = 0, 1, . . . , n). (18)
k

on {0, 1, . . . , n} is called binomial distribution.

Remark 6.4. Special case 1/2.


For p = 1/2 formula (17) simplifies to the following form:
 
1 n
P (k) = n (k = 0, 1, . . . , n). (19)
2 k
Example 6.5. Library books.
You go the the library with a list of 7 books you need for your class. You would like to
know the probability you can get a least 5 of them. Answers to questions like this one
can be computed, but some assumptions have to be made. Here, we assume that the
probability that a book is checked out is 10%, for any one of the books and independently
of any of the other books. (This is a simplification and whether or not such assumptions
are at least approximately true would need to be verified for a particular library.)
Now the model for a Bernoulli trials process of length 7 with success probability 1/10 (see
Model 6.1) applies and we can compute the probability of the events Sk for k successes
using (17)). Here “success” is being used for the book being checked out, because the
computation becomes shorter this way.)

P (“at least 5 are not checked out”) = P (“at most 2 are checked out”)
2 2  
 [  X X 7 1 k  9 7−k
=P Sk = P (Sk ) =
k 10 10
k∈{0,1,2} k=0 k=0

95
= · (92 + 7 · 9 + 21) ≈ 0.9743
107
So the probability that you can get at least 5 of the books from your list is about 97.43%.

Remark 6.6. Normal approximation.


What if in the last example we had asked about the availability of 20 out of 70 books?
Or 200 out of 700? They same methods would apply, but the computations become cum-
bersome. Even with computers nowadays facilitating even such cumbersome calculations,
practice, binomial probabilities often approximated using the normal distribution. See
slide show for this topic (can be downloaded from the module resources website).
The approximation of discrete distributions is a fundamental theme in probability theory.
The most well known result of this kind is the Central Limit Theorem. Such results are
crucial for applications in statistics, where empirical distributions (based on finite data
sets) are compared with continuous distributions that arise in models or as limits.

6.2 Waiting for success

In the last section we calculated the probability for a certain number of successes in
a fixed number of trials. In this section, we work with the model for infinitely many
Bernoulli trials (see Model 6.1). We keep trying until the first success. That means
Ω = {1, 01, 001, 0001, 00001, . . .}. Ω is countable. This can be shown, for example, by
using the one-to-one map between Ω and N0 that is defined by counting the number of
zeros in any outcome ω ∈ Ω.
Define events based on the waiting time:

Wk = “it takes k trials until first success” for k = 1, 2, . . . (20)

To calculate the probability for Wk just look at the meaning: The first k − 1 trials are
failures, the kth trials is a success. Since the trials are independent, the probability
factorizes:

P (Wk ) = q k−1 p (k = 1, 2, . . .). (21)


According to Example 4.4, formula (21) does define a probability measure on (Ω, P(Ω))
The weights pk = P (Wk ) are non-negative, and their sum equals 1, because (Wk )k∈N
form a partition of Ω The probability measure (Ω, P(Ω)) can also be represented as a
probability measure on (N, P(N)). Probability measures on spaces like N or R are often
called distributions.
Definition 6.7. Geometric distribution.
The measure P ({k}) = q k−1 p (k = 1, 2, . . .) on {1, 2, . . .} is called geometric distribution.

To get a feeling for this distribution it is helpful to look at the consecutive odds ratios
(and they are useful to quickly sketch a histogram of the distribution).
P ({k + 1})
=q (k = 1, 2, . . .). (22)
P ({k})

Define the events Vk =“it takes at most k trials until first success” k = 1, 2, . . . Let k ∈ N.
Then Vk = W1 ∪ W2 ∪ . . . ∪ Wk with W1 , W2 , . . . , Wn mutually exclusive. Using results
about geometric sums (see Analysis textbooks) we get
k k
X X 1 − qk
P (Vk ) = P (Wi ) = q k−1 · p = · p = 1 − qk . (23)
1−q
i=1 i=1

Finally, let us put this into practice in an everyday life example.


Example 6.8. Can you spare some change?
Penny needs some cash. She positions herself on the pavement and keeps asking everybody
who passes by for some change until she succeeds. Assume that people react to her request
independently of each other and that each of her requests is successful with probability
1/10. Calculate the probability for the following events:
A =“She has to ask at least 5 people until the first success”
B =“She has to ask exactly 5 people until the first success”
C =“She has to ask at most 4 people until the first success”
Because of the assumptions, the situation can be described by a model for infinitely many
Bernoulli trials with success probability 1/10 (see Model 6.1).
Another way of describing A is to say that the first 4 trials are failures (and the fifth and
later trials may or may not be successes). This implies P (A) = (9/10)4 .
Since B = W5 , Definition 6.7 yields P (B) = (9/10)4 · 1/10.
C is the complement of A, so P (C) = 1 − (9/10)4 . Or use C = V4 and (23) to get the
same result.

ω: ‘Excuse me UK, can you spare some cash? Excuse me EU, can you spare some cash?’ Is the Penny
example about illiquidity?

Ω: I wasn’t going to talk about finance here.

ω: Why not? The totally most important application of probability theory...

Ω: It certainly is an application of no less importance than Archeology, Biology, Coding, Demography, Elec-
trical engineering, Genetics, Hurricanes, Insurances, Juggling (most times), Knitting (sometimes), Logic,
Magnetism, Networks, Operations research, Psychology, Queuing, Risk, Statistics, Trees, Uncertainty,
Volcano science, Walks, some random applications X and Y, and the Zeta function.

Question 6.9. Waiting for some more change.


Penny in Example 6.8 needs some more cash. She keeps asking people in the street until
she succeeds in receiving cash from at least 5 of them.
6.3 Waiting for multiple successes

As in the last section, consider the model for infinitely many Bernoulli trials (see Model
6.1). Now keep trying until you have obtained r successes. The outcome space Ω − r now
consists of all 0-1-patterns that have exactly r 1’s in it, one of them at the end.
For example, for r = 2 : Ωr = {11, 011, 101, 0011, 0101, 1001, 00011, 00101, 01001, 10001, . . .}.

Lemma 6.10. Ωr is countable (for any r ∈ N).

Proof. There is no canonical candidate for a one-to-one map between Ωr and the integers
as in the last section, but it is still easy to do. One could arrange the outcomes in a
triangle and then define a map from the integers to Ω by enumerating the outcomes row
by row. For example, for r = 2 use
11
011 101
0011 0101 1001
00011 00101 01001 10001
...
and define f (1) = 11, f (2) = 011, f (3) = 101, f (4) = 0011, . . .
It works similarly for larger r, but the explicit construction of such a map becomes more
cumbersome as r gets bigger. Here is another way to show that Ωr is countable: Every
ω ∈ Ωr can be described by the its length and the location of the first r − 1 successes.
This suggests a surjection from the countable set Nr−1 × N onto Ωr , which implies that
Ωr is countable.
We are interested in the probabilities for waiting a certain number of trials until the rth
success. Proceed similarly to the last section.
Define events based on the waiting time until the rth success:

Wk = “it takes k trials until first success” for k = r, r + 1, . . . (24)

To calculate the probability for Wk just look at the meaning: The kth trials is a success,
and there are r −1 successes and k −1−(r −1) failures among the first k −1 trials. For any
outcome of this kind the probability is pr−1 pk−1−(r−1) p. So, similarly to the derivation of
(17) we obtain
 
k − 1 r k−r
P (Wk ) = p q (k = r, r + 1, . . .). (25)
r−1

This probability measure (Ω, P(Ω)) can also be represented as a probability measure on
(N, P(N)).

Definition 6.11. Negative  binomial distribution.


k−1 r k−r
The measure P (Wk ) = r−1 p q (k = r, r + 1, . . .). on {r, r + 1, r + 2, . . .} is called
negative binomial distribution.

To get a feeling for the probabilities of these events it is helpful to look at the consecutive
odds ratios (and hey are handy to quickly sketch a histogram). A simple calculation yields

P (Wk+1 ) k
= (k = r, r + 1, . . .). (26)
P (Wk ) k−1
Here are some applications of the negative binomial distribution. In other words, the
Bernoulli trials process may be an appropriate model and we are interested in looking at
the waiting time for the rth success:
• A robe consists of several cables. I has to be replaced after 3 of them broke.
• How likely do milk bottles get stolen more often than twice in a year?
• Parents of pupils who arrive late more than 3 times a term need to see the
head teacher.

6.4 Occurances of rare independent events.

6.4.1 Poisson approximation to the binomial (for rare events)

Use Saul Jacka’s lecture notes p.24/25.

6.4.2 Poisson distribution


k µk
It turns out that the the numbers e−µ µk! add up to 1, because ∞ µ
P
k=0 k! = e . Furthermore,
they are non-negative, so they actually define a probability mass function.

Definition 6.12. Poisson distribution.


k
Let µ > 0. The measure Pµ ({k}) = e−k · µk! (k = 0, 1, 2, . . .) on {0, 1, 2, . . .} is called
Poisson distribution with intensity parameter µ.

ω: We use this for counting fish. And the Sauterelle distribution for counting grasshoppers and the de
Chevalier’s distribution for counting horsemen.

Ω: Oh you’re being silly. Read this old story about Chevalier de Méré... e.g. at
http://www.ualberta.ca/MATH/gauss/fcm/BscIdeas/probability/DeMere.htm.

Here are some of the traditional applications of the Poisson distribution.

Example 6.13. Missprints.


From experience it is known that misprints in books by the publisher PROP (ProofRead-
OncePress) follow a Poisson distribution with a rate of 0.5 misprints per page. What is
the probability that in a book from PROP there is at least one misprint on page 20? Let
N be the number of misprints on page 20. Then

0.50
P (N ≥ 1) = 1 − P (N = 0) = 1 − e−0.5 · = 1 − e−0.5 ≈ 0.393.
0!

In other words, there is a likelihood of almost 40% for a misprint on page 20 (or on any
other page for that matter).

Example 6.14. Radioactive decay.


Let N be the number of α-particles given off by a 1-gram block of radioactive material
during a 1-second interval. From previous experiments it is known that, on average, 3.2
α-particles are emitted per second. 1 gram material consists of n atoms for some n large.
Each atom has a probability of 3.2/n of disintegrating and sending of an α-particle during
the next second. We assume independence which justifies the use of a Binomial model.
Using the Poisson approximation to the Binomial helps for explicit computations. For
example, we will approximate the probability that no more than 2 α-particles will be
emitted during the next second.

P (N ≤ 2) =P (N = 0) + P (N = 1) + P (N = 2)
3.22 −3.2
= e−3.2 + 3.2 · e−3.2 + 3.2 · e−3.2 + ·e
2!
3.22
=(1 + 3.2 + ) · e−3.2 ≈ 0.37989.
2!
Example 6.15. Earthquakes.
Earthquakes are very unpredictable. A common simple model is to assume they follow a
Poisson distribution. For a region in the West of the US it is known that the rate is about
2 earthquakes per week. We will know calculate a few different kinds of probabilities using
this model.
(i) What is the probability that there are at least 3 earthquakes next week? Let N be
the number of earthquakes next week.
2 1 0
P (N ≥ 3) = 1 − 2k=0 P (N = k) = 1 − ( 22! + 21! + 20! ) · e−2 = 1 − 2 e−2 .
P

(ii) What is the probability that there are exactly k earthquakes during the next 3 weeks?
This corresponds to changing the unit to 3 weeks with a new intensity parameter 6.
k
The probability is therefore e−6 · 6k! .
More generally, what is the probability that there are exactly k earthquakes during
the next m weeks?
k
Reasoning as before we obtain e−2m · (2m)k! .
(iii) Let us know change the perspective we look at the situation and define T to be the
time (in weeks) until the next earthquake. Let Nm be the number of earthquakes
during the next m weeks. The crucial point here is to understand that the sets
{T > k} and {Nm = 0} are identical. Therefore P (T > m) = P (Nm = 0) = e−2m .

Some other situations where a Poisson distribution is a good candidate for a suitable
model:
• number of people who survive age 100
• number of floods in the UK per year
• number of occurrences of a certain pattern in a stretch of DNA

6.4.3 Random scatter

Some comments about the T-shirt with the random scatter — not relevant for the exam.

7 Random variables and distributions

In many examples the aim is to compute the probability of an expression of the form

N = “number of <something that occurs randomly>”.

For example:
N = “number of heads in n coin tosses”
N = “number of coin tosses until third head”
N = “number of α-particles emitted by 1 gram of radioactive material in 1 second”
N = “number of people taking the 18:22 bus to Leamington Spa”
N = “number of goals scored by Coventry City FC in the 2009-2010 season”
N = “number of broken eggs in an egg carton selected at random from your grocery store”
More generally, a lot of examples call for the probability of expression of the form

X = “<some mathematical function> of <an outcome of a random experiment>”.

For example:
X = “sum of two dice”
X = “test score of a randomly selected student from this class”
X = “phone number selected at random from the Coventry phone book”
X = “weight (rounded off to the nearest stone) of a randomly selected British citizen”
X = “minimum due date of ten books you borrowed from the library”

In other words, we are interested in events that are characterised by the level sets of some
function of the outcomes of the random experiment. Formally, this is captured by the
following concept.

Definition 7.1. Discrete random variable.


Let (Ω, F) be a measurable space and let S be a countable set. A function X from Ω to S
is called discrete random variable (RV) if

{X = x} ∈ F for all x ∈ S. (27)

ω: I will not ever never use a random variable. They are totally unnecessary, because I completely already
know everything from the sample space. If I want to know the number of heads in three coin tosses then
I choose Ω = {0, 1, 2, 3} and my probabilities are defined right there.

Ω: Okay, whether or not you use random variables is your choice, but consider that I’m witnessing the
exact same coin tossing experiment, and it would be nice if we could simply share the outcome space. I’m
interested in whether or not the last toss is a head, and your Ω = {0, 1, 2, 3} does NOT tell me that.

Remark 7.2. About notation and about the rationale for this definition.
(i) The first term in (27) is not a typographic error, but a very common shortform
used in probability theory and measure theory. Here is what it means: {X = s} =
{ω ∈ Ω | X(ω) = s} = X −1 (s). Expressions like {X > s} and {X ≥ s} are defined
accordingly.

(ii) Typical choices for S are finite sets, N or Z. S does not need to be numerical; for
example, a random variable describing the colour of a stripe picked at random from
the mural in the Street is in a finite set of colours. Another example for a colour-
valued random variable is shown in Example 7.8 below.

(iii) Condition (27) is called measurability of the function X, a basic concept in measure
theory that will be used massively in modules on advanced probability theory and
measure theory. The immediate use here is that because of the measurability of X
any probability measure P on the space (Ω, F) canonically defines probabilities for
the level sets of X via the equation pX := P ◦ X −1 as detailed in Definition 7.6.

(iv) Condition (27) may seem artificial in the discrete setting, because one could always
force it to be true simply by choosing F = Ω. By contrast, this is not always possible
in the continuous case and not always appropriate in the discrete case (see Remark
4.1).
Example 7.3. Indicator functions.
Let (Ω, F) be a measurable space and A ∈ F. The function 1A on Ω defined by
(
1 for ω ∈ A ,
1A (ω) = (28)
0 for ω ∈ Ac .

is a discrete random variable with range S = {0, 1}.


Note the trivial cases 1Ω ≡ 1 and 1∅ ≡ 0, which do not actually depend on ω. In other words
these random variables degenerated into something deterministic (that is, non-random).
Lemma 7.4. Properties of indicator functions.

(i) 1A + 1Ac = 1 for any A ∈ F.

(ii) 1A∪B = 1A + 1B − 1A∩B for any A, B ∈ F.

Can you proof this? Try before you read on.


Or try this hint: Distinguish the cases (ω ∈ A, ω ∈ Ac etc.).
Proof of the lemma.
(i) Distinguish two cases.
1. If ω ∈ A then 1A (ω) + 1Ac (ω) = 1 + 0 = 1.
2. If ω ∈ Ac then 1A (ω) + 1Ac (ω) = 0 + 1 = 1.
(ii) If ω ∈ (A ∪ B)c then all indicator functions are 0 and the equality is obvious.
Otherwise, 1A∪B (ω) = 1 and we have to show that 1A (ω) + 1B (ω) − 1A∩B (ω) = 1.
We distinguish three cases.
1. ω ∈ A ∩ B c : 1A (ω) + 1B (ω) − 1A∩B (ω) = 1 + 0 + 0 = 1.
2. ω ∈ B ∩ Ac : 1A (ω) + 1B (ω) − 1A∩B (ω) = 0 + 1 + 0 = 1.
3. ω ∈ A ∩ B : 1A (ω) + 1B (ω) − 1A∩B (ω) = 0 + 0 + 1 = 1.
ω: These indicator functions are a very usefulish invention.

Ω: After your dislike of random variables earlier, that comment does surprise me.

Example 7.5. Some random variables from three coin tosses.


All of the following functions are discrete random variables on the measurable space
given by Ω = {hhh, hht, hth, htt, thh, tht, tth, ttt} and F = P(Ω) (= power set of Ω).
X(ω) = “number of heads in ω”
Y is the pay-off of the following game: If the coin lands heads you get £1, if the coin lands
tails you loose £1. In other words,
Y (ω) = “number of heads in ω” − “number of tails in ω”
Z(ω) = 1 if the third toss lands heads, and Z(ω) = 0 otherwise.
T is the time until the first head comes up. If no heads come up in the three tosses, set
T (ω) = 4.
The table below shows the values the random variables take for the different outcomes.

ω hhh hht hth htt thh tht tth ttt


X(ω) 3 2 2 1 2 1 1 0
Y (ω) 3 1 1 -1 1 -1 -1 -3
Z(ω) 1 0 1 0 1 0 1 0
T (ω) 1 1 1 1 2 2 3 4
Assume that the coin is fair and the tosses are independent, so P (ω) = 1/8 for all ω ∈ Ω.
By Remark 7.2(iii), this defines probabilities for the level sets of the random variables.
For example:
pX (3) = P (X −1 (3)) = P ({hhh}) = 1/8, pX (2) = P (X −1 (2)) = 3/8 etc.
pT (1) = P (T −1 (1)) = P ({hhh, hht, hth, htt}) = 4/8, pT (2) = P (T −1 (2)) = 2/8 etc.
The table below shows the probabilities of all non-empty level sets for the four random
variables.

x 0 1 2 3 y -3 -1 1 3
pX (x) 1/8 3/8 3/8 1/8 pY (y) 1/8 3/8 3/8 1/8
x 0 1 t 1 2 3 4
pX (x) 1/2 1/2 pT (t) 1/2 1/4 1/8 1/8

We now define formally what we just computed in the last example.

Definition 7.6. Probability mass function (or probability distribution)


Let (Ω, F, P ) be a probability space and X : Ω −→ S a discrete random variable. Then

pX (x) = P (X = x) (x ∈ S) (29)

is called probability mass function (PMF) or probability distribution of X. We also say


that X is distributed according to pX or, short, X ∼ pX .

Example 7.7. Indicator functions.


Let (Ω, F, P ) be a probability space, A ∈ F and X := 1A the indicator function of A as
defined in Example 7.3. Then, range(X)={0, 1} and pX (1) = P ({1A = 1}) = P (A) and
pX (0) = P ({1A = 0}) = P (Ac ) = 1 − P (A).

The above example is the easiest non-trivial example. We have already seen a number of
more complicated probability mass functions for random variables such as Bernoulli distri-
bution (X =“number of successes”), geometrical distribution (X =“number of trials until
success”) and negative binomial distribution (X =“number of trials until rth success”).
The following example shows a probability mass function obtained by (real world) data.

Example 7.8. Colour of flowers (a PMF defined by data).


Poinsettias can be red, pink, or white. In one study of the hereditary mechanism control-
ling the colour, 182 progeny of a certain parental cross were categorised by colour resulting
in 108 red ones, 34 pink ones and 40 white ones.
Let X be the colour of a poinsettia picked at random. Then the PMF of X is given by:
x red pink white
pX (x) 0.59 0.19 0.22

Lemma 7.9. Transformation of a random variable.


Let (Ω, F, P ) be a probability space and X a discrete random variable with range S. Let ψ
be a numerical function on S. Then Y := ψ ◦ X defines a discrete random variable.

Proof. The range of Y is ψ(S). It is discrete because the cardinality is at most the
cardinality of S. It remains to show condition (27) for Y. For any y ∈ ψ(S), {Y = y} =
{ω ∈ Ω | ψ(X(ω)) = y} = {ω ∈ Ω | X(ω) = ψ −1 (y)} = x∈ψ−1 (y) {X = x}. By (27) for
S

X, each of the individual sets is in F, and since F is a σ-algebra, so is their (countable)


union.
Using the addition axiom we find that the distribution of Y is given by
X
pY (y) = P (ψ(X) = y) = PX (x) (30)
x:ψ(x)=y

Example 7.5. Some random variables from three coin tosses (ct.)
P
Y = ψ ◦ X for ψ(x) = 2x − n and pY (y) = x:ψ(x)=y pX (x).
In particular, pY (3) = pX (3), pY (1) = pX (2), pY (−1) = pX (1), pY (−3) = pX (1).
Choose Ỹ = ψ̃ ◦ X for ψ̃(x) = | 2x − n |. In other words, Ỹ is the gain in the game,
regardless who receives it.
P
Now pỸ (y) = x:ψ̃(x)=y pX (x), but pỸ (3) = pX (3) + pX (0) and pỸ (1) = pX (2) + pX (1).
Lemma 7.10. Combination of random variables.
Let (Ω, F) be a measurable space. Let X and Y be discrete random variables with ranges
that are S and R, which are discrete subsets of R. Then the following expressions are also
discrete random variables: max(X, Y ), min(X, Y ), X + Y, X − Y, X · Y and, if Y (ω) 6= 0
for all ω ∈ Ω, X/Y.

Proof. Let Z := (X, Y ). This is a random variable on (Ω̃, F̃), where Ω̃ = Ω × Ω is the
product space and F̃ is the product σ-algebra. The operations max, min, +, −, · and /
between the two random variables can be understood as a function ψ of the pair Z. For
example=, X + Y = ψ ◦ Z, where ψ1 is the function on S × R defined by ψ1 (x, y) := x + y,
and max(X, Y ) = ψ2 ◦ Z, for ψ2 (x, y) := max(x, y). Since S and R are discrete subsets of
R, so is the range of ψ ◦ Z. By Lemma 7.9, ψ ◦ Z is a discrete random variable.
Remark 7.11. Continuous analogues.
Both Lemma 7.9 and Lemma 7.10 have analogues for continuous random variables, but
additional assumptions are needed and more care needs to be taken in the proofs.
Example 7.12. Simple random variables built from indicator functions.
Let αi ∈ R (i = 1, . . . , n) and let Ai ∈ F (i = 1 . . . , n). By Example 7.3, the lat-
ter define discrete random variables. Applying Lemma 7.9 to ψi (x) = αi x yields that
αi 1Ai (i = 1, . . . , n) are also discrete random variables and multiple application of Lemma
7.10 insures that ni=1 αi 1Ai is also a discrete random variable. Random variables with
P

such a representation are called simple random variables.


A common technique used in proofs is to first show a results for such simple random vari-
ables, then approximate more general random variables with simple random variables and
finally work out how to carry over the result to the limit.

Note that in the proof above we made use of the pair of two random variables defined
on the product space. Thinking about them in this way will often come in handy when
computing their probability mass function. The following definition will help formalising
this.
Definition 7.13. Joint PMF/joint distribution.
Let (Ω, F, P ) be a probability space and let X and Y be discrete random variables . Then

p(x, y) = P ({X = x}, {Y = y}) ((x, y) ∈ range(X) × range(Y )) (31)

is called joint distribution function of X and Y.


P
Any set of non-negative numbers p(x, y) (x ∈ S, y ∈ R) with x∈S,y∈R p(x, y) = 1 defines a
joint distribution for X and Y. Given such a joint distribution, we can obtain distributions
for the individual random variables by the column sums and the row sums of the joint
probabilities:
X X
pX (x) := p(x, y) and pY (y) := p(x, y). (32)
Y ∈R X∈S

They are called marginal distributions of X and Y.

Remark 7.14. What does the comma mean?


It’s just a short form, {X = x, Y = y} = {X = x} ∩ {Y = y}.

Example 7.15. Golden tickets (continued).


In the situation of Question 5.4 define two random variables, one denotes the chosen shop
and the other one says whether or not a golden ticket was found in the chocolate bar. For
example, set X = 1Se and Y = 1G . Obviously, range(X) = range(Y ) = {0, 1}.
The marginal distribution of X is given by the assumption and can also be recovered as
sums of the columns: pX (0) = pX (1) = 1/2.
The joint distribution of X and Y can be computed using the multiplication rule.
p(0, 0) = P ({X = 0}, {Y = 0}) = P (Su ∩ Gc ) = P (Su ) · P (Gc | Su ) = 1/2 · 2/3 = 1/3.
p(0, 1) = P ({X = 0}, {Y = 1}) = P (Su ∩ G) = P (Su ) · P (G | Su ) = 1/2 · 1/3 = 1/6.
p(1, 0) = P ({X = 1}, {Y = 0}) = P (Se ∩ Gc ) = P (Se ) · P (Gc | Se ) = 1/2 · 1/2 = 1/4.
p(1, 1) = P ({X = 1}, {Y = 1}) = P (Se ∩ G) = P (Se ) · P (G | Se ) = 1/2 · 1/2 = 1/4.

range(X) distribution of Y
0 1
0 4/12 3/12 7/12 (33)
range(Y)
1 2/12 3/12 5/12
distribution of X 6/12 6/12

The distribution of Y is obtained by the sums of the rows: pY (0) = 7/12 and py (1) = 5/12.

7.1 Expectation

Ω: Look, I’ve got a box with two 50p-coins, one £1-coins and two £2-coins. You can pick one at random.
What do you expect to get?

ω: One of the five coins from that box!

Ω: You’re always taking things so literally. Think of average outcome...

ω: OK. I can do that. £(2 · 0.5 + 1 · 1 + 2 · 2)/5 = £6/5 = £1.20.

Ω: This can be rewritten as p(0.5) · 0.5 + p(1) · 1 + p(2) · 2, where p(ω) = P ({ω}) for ω ∈ Ω = {0.5, 1, 2}
where p(0.5) = P ({£0.5}) = 2/5, p(1) = P ({£1}) = 2/5 and p(2) = P ({£2}) = 2/5.

Definition 7.16. Expectation of a discrete random variable.


Let (Ω, F) be a measurable space and X a random variable with countable range S and
PMF pX . Then
X
E[X] = pX (x) · x (34)
x∈S

is called expected value of X.


In other words, E[X] is the average of all possible values of X weighted by their proba-
bilities.

Example 7.17. Tickets in a box.


There is a box with n tickets with numbers z1 , . . . , zn ∈ N. One ticket is drawn at random.
To compute the expected value for the number shown on the ticket use the model Ω =
{z1 , . . . , zn } with equally likely probabilities and let X be the number on the ticket. X is
a random variable and E[X] = ( ni=1 zi )/n. Some explicit examples:
P

(i) The numbers are 1, 1, 2, 4. Then E[X] = (1 + 1 + 2 + 4)/4 = 2.


(ii) There are numbers are 10 tickets in the box. One of them has the number 1 all others
have a 0. E[X] = (9 · 0 + 1 · 1)/10 = 0.1.
(iii) There are 100 tickets in the box. One of them has the number 1000 all others have
the number 1. E[X] = (99 · 1 + 1 · 1000)/100 = 1099/100 = 10.99

Remark 7.18. Is the expectation a good summary of data?


Imagine you live in a country where 1% of the population earns 1000 Dollars a day and 99%
of the population earn 1 Dollar a day. Picking a person at random from the population
and asking about his or her salary corresponds to (ii). Now imagine you are one of the
99% and you read in the newspaper that in your country the daily salary is about 11
Dollars. That is 11 times more than yours even though you feel like earning just the same
as pretty much everybody else around you.
What happened? The one outlier value pulled the expected value up. It gives now an
unrealistic picture of the salary distribution in that country.
What can be done? Remove outliers, use alternative measures for location (e.g. median)
that are more robust to outliers, always report a measure of spread also etc.
In other examples, the expectation may give you a good first idea about the magnitude
of the numbers in a data set. It depends!

Example 7.19. Dice.


Roll a die once. Let X be the number shown. The model is the same as for tickets in a
box. E[X] = (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5 This was for a hexagonal die.
For a tetrahedral die: E[X] = (1 + 2 + 3 + 4)/4 = 2.5.
For an octahedral die: E[X] = (1 + 2 + 3 + 4 + 5 + 6 + 7 + 8)/8 = 4.5.
For a dodecahedral die: E[X] = (1+2+3+4+5+6+7+8+9+10+11+12)/12 = 78/12 = 6.5.
For an icosahedral die: DIY
See Section 5.2.1 in Wiki under “dice” for more info and a beautiful picture of a platonic
solid set of dice.

Example 7.20. Expected value of an indicator function.


Let (Ω, F, P ) be a probability space, A ∈ F and 1A the indicator function of A defined in
Example 7.3. Then, using (34) and Example 29,

E[1A ] = p1A (0) · 0 + p1A (1) · 1 = P (A).

Lemma 7.21. Linearity of the expected value.


Let X and Y be random variables on a probability space (Ω, F, P). Then,

E[X + Y ] = E[X] + E[Y ] and E[αX] = αE[X] for all α ∈ R (35)


Remark 7.22. Properties of expected value.
(i) Note that E[X · Y ] = E[X] · E[Y ] is not generally true. In fact, this is an additional
property of X and Y called uncorrelated.
(ii) Using induction, (35) can be generalised as follows.
For any random variables Xi (i=1,. . . , n) on (Ω, F, P) and all αi ∈ R (i = 1, . . . , n),
n
hX i n
X
E αi Xi = αi E[Xi ] (36)
i=1 i=1

Example 7.23. Many rolls of a die.


Roll a tetrahedral die n times. Let Ω = {1, 2, 3, 4} and Xi (i = 1, . . . , n) the numbers
shown in the ith roll. The Xi are random and so is their sum Sn = ni=1 Xi and their
P

average An := 1/n · Sn . Let us compute their expected values using the linearity of E.
X
E[Sn ] = E[Xi ] = n · 2.5.

1 1
E[An ] = E[ Sn ] = E[Sn ] = 2.5.
n n
Question 7.24. Expected number of working components.
In a system of n components each works with a certain probability pi (i = 1, . . . , n).
Compute the expected number of working components.
Answer: Use Ω0 = {w, f} as an outcome space describing the state of an individual
component using w for “works” and f for “fails”. Let Ω = Ωn0 be the outcome space
describing the state of the whole system. Let Ai (i = 1, . . . , n) be the event that the i th
component works. (Formally, this looks for example as follows: A1 = {w, ω2 , . . . , ωn | ωi ∈
Ω0 for i = 2, . . . , n}.) The number X of working components can be represented as the
sum of indicator functions: X(ω) = 1A1 (ω) + . . . + 1An (ω). Using (36) yields
n
X  Xn n
X
E[X] = E 1Ai = E[1Ai ] = pi . (37)
i=1 i=1 i=1

The method used in the answer above is very general and has got a name:

Lemma 7.25. Method of indicators.


Let (Ω, F, P) be a probability space and Ai ∈ F (i = 1, . . . , n) and let X be the number of
events that occur. Then E[X] = ni=1 P (Ai ).
P

Pn
Proof. Using X(ω) = i=1 1Ai (ω) (ω ∈ Ω) the statement follows from (37).
The proof is just a formalisation of the simple idea used in the explicit example above. It
deserve to be a lemma, because the technique it is often quite useful. However, it is a bit
of art to recognise when it can be applied.
Ω: They forgot the independence assumption.

ω: That is not at all true, because it was not at all needed!

Example 7.26. Expected number of successes in Binomial trials.


Let X be the number of successes in n independent Bernoulli trials with success parameter
p. X has a Binomial distribution and it is possible to compute the expected value of X
directly using the weights of the Binomial distribution. Much easier, however, is to just
use the method of indicators. It yields E[X] = np.
7.2 Variance

Definition 7.27. Variance of a discrete random variable.


Let (Ω, F) be a measurable space and X a random variable with countable range S, PMF
pX and expected value µ. Then

V ar(X) = E[(X − µ)2 ] (38)

is called variance of X.

In other words, we measure how much the outcomes deviate from the expected value by
the square of their difference. Then we take the expected values of that expression. Using
the square makes the expression always non-negative. Alternatively, absolute values could
be used (and they are used in the so-called L1 -norm approach to statistics).

Remark 7.28. Alternative expression for the variance.

V ar(X) = E[X 2 ] − E[X]2


This follows from the following computation using the linearity of the expected value and
the fact that µ is just a constant (not actually random). E[(X −µ)2 ] = E[X −2·X ·µ+µ2 ] =
E[X 2 ] − 2 · µ · E[X] + µ2 = E[X 2 ] − 2 · µ · µ + µ2 = E[X 2 ] − 2 · µ2 + µ2 = E[X 2 ] − µ2 .

Example 7.29. Tickets in a box.


Use the same notation and numbers as in Example 7.17.
(i) The numbers are 1, 1, 2, 4. E[X 2 ] = (12 + 12 + 22 + 42 )/4 = 22/4 = 5.5, E[X]2 = 4, so
V ar(X) = 5.5 − 4 = 1.5.
(ii) There are 10 tickets in the box. One of them has the number 1 all others have a 0.
E[X 2 ] = (9 · 02 + 1 · 12 )/10 = 0.1 and E[X]2 = 0.12 = 0.01, so V ar(X) = 0.09.
(iii) There are 100 tickets in the box. One of them has the number 1000 all others have
the number 1. E[X 2 ] = (99 · 12 + 1 · 10002 )/100 = 1, 000, 099/100 = 10, 000.99 and
E[X]2 = 10.992 so V ar(X) = 10, 000.99 − 10.992 .

The variance is (iii) is high. This is related to the issue discussed in Remark 7.18.
The expectation is a measure of location and the variance is a measure of spread. They are
examples of one-parameter summaries of distributions or data sets. Their mathematical
expressions can be computed by the recipes provided. However, whether they do provide
good summaries or whether they give an oversimplified or even misleading picture of the
situation has to be checked in the particular cases.

You might also like