You are on page 1of 130

MAST30001 Stochastic Modelling

Lecturer: Nathan Ross

Administration

LMS - announcements, grades, course documents

Lectures/Practicals

Student-staff liaison committee (SSLC) representative

Slide 1

Modelling
We develop an imitation of the system. It could be, for example,
I

a small replica of a marina development,

a set of equations describing the relations between stock


prices,

a computer simulation that reproduces a complex system


(think: the paths of planets in the solar system).

We use a model
I

to understand the evolution of a system,

to understand how outputs relate to inputs, and

to decide how to influence a system.

Slide 2

Why do we model?

We want to understand how a complex system works. Real-world


experimentation can be
I

too slow,

too expensive,

possibly too dangerous,

may not deliver insight.

The alternative is to build a physical, mathematical or


computational model that captures the essence of the system that
we are interested in (think: NASA).

Slide 3

Why a stochastic model?

We want to model such things as


I

traffic in the Internet

stock prices and their derivatives

waiting times in healthcare queues

reliability of multicomponent systems

interacting populations

epidemics

where the effects of randomness cannot be ignored.

Slide 4

Good mathematical models

capture the non-trivial behaviour of a system,

are as simple as possible,

replicate empirical observations,

are tractable - they can be analysed to derive the quantities of


interest, and

can be used to help make decisions.

Slide 5

Stochastic modelling
Stochastic modelling is about the study of random experiments.
For example,
I

toss a coin once, toss a coin twice, toss a coin infinitely-many


times

the lifetime of a randomly selected battery (quality control)

the operation of a queue over the time interval [0, ) (service)

the changes in the US dollar - Australian dollar exchange rate


from 2006 onwards (finance)

the positions of all iphones that make connections to a


particular telecommunications company over the course of one
hour (wireless tower placement)

the network friend structure of Facebook (ad revenue)

Slide 6

Stochastic modelling

We study a random experiment in the context of a Probability


Space (, F, P). Here,
I

the sample space is the set of all possible outcomes of our


random experiment,

the class of events F is a set of subsets of . We view these


as events we can see or measure, and

P is a probability measure defined on the elements of F.

Slide 7

The sample space


We need to think about the sets of possible outcomes for the
random experiments. For those discussed above, these could be
I

{H, T }, {(H, H), (H, T ), (T , H), (T , T )}, the set of all


infinite sequences of Hs and Ts.

[0, ).

the set of piecewise-constant functions from [0, ) to Z+ .

the set of continuous functions from [0, ) to IR+ .


S
n=0 {(x1 , y1 ) . . . (xn , yn )}, giving locations of the phones
when they connected.

Set of simple networks with number of vertices equal to the


number of users: edges connect friends.

Slide 8

Review of basic notions of set theory

A B.
I

A B = { : A or B} = B A.
I
I

Intersection of sets (events):


both occur.
Tn
A1 A2 An = i=1 Ai .

Ac = { : w 6 A}
I

Union of sets (events):Sat least one occurs.


n
A1 A2 An = i=1 Ai .

A B = { : A and B} = B A = AB.
I

A is a subset of B or if A occurs, then B occurs.

Complement of a set/event: event doesnt occur.

: the empty set or impossible event.

Slide 9

The class of events F

For discrete sample spaces, F is typically the set of all subsets.


Example: Toss a coin once, F = {, {H}, {T }, {H, T }}

For continuous state spaces, the situation is more complicated:

Slide 10

The class of events F

S equals circle of radius 1.

We say two points on S are in the same family if you can get
from one to the other by taking steps of arclength 1 around
the circle.

Each family chooses a single member to be head.

If X is a point chosen uniformly at random from the circle,


what is the chance X is the head of its family?

Slide 11

The class of events F

A = {X is head of its family}.

Ai = {X is i steps clockwise from its family}.

Bi = {X is i steps counterclockwise from its family}.

By uniformity, P(A) = P(Ai ) = P(Bi ), BUT

law of total probability:


1 = P(A) +

(P(Ai ) + P(Bi ))!

i=1

The issue is that the event A is not one we can see or measure so
should not be included in F.

Slide 12

The class of events F

These kinds of issues are technical to resolve and are dealt with in
later probability or analysis subjects which use measure theory.

Slide 13

The probability measure P

The probability measure P on (, F) is a set function from F


satisfying
P1. P(A) 0 for all A F [probabilities measure long run %s or
certainty]
P2. P() = 1 [There is a 100% chance something happens]
P3. Countable
S additivity:PifA1 , A2 are disjoint events in F,
then P (
i=1 Ai ) =
i=1 P(Ai ) [Think about it in terms of
frequencies]

Slide 14

How do we specify P?
The modelling process consists of
I

defining the values of P(A) for some basic events in A F,

deriving P(B) for the other unknown more complicated


events in B F from the axioms above.

Example: Toss a fair coin 1000 times. Any length 1000 sequence
of Hs and Ts has chance 21000 .
I

What is the chance there are more than 600 Hs in the


sequence?

What is the chance the first time the proportion of heads


exceeds the proportion of tails occurs after toss 20?

Slide 15

Properties of P

P() = 0.

P(Ac ) = 1 P(A).

P(A B) = P(A) + P(B) P(A B).

Slide 16

Conditional probability
Let A, B F be events with P(B) > 0. Supposing we know that
B occurred, how likely is A given that information? That is, what
is the conditional probability P(A|B)?
For a frequency interpretation, consider the situation where we
have n trials and B has occurred nB times. What is the relative
frequency of A in these nB trials? The answer is
nAB /n
P(A B)
nAB
=

.
nB
nB /n
P(B)
Hence, we define
P(A|B) =

P(A B)
.
P(B)

We need a more sophisticated definition if we want to define the


conditional probability P(A|B) when P(B) = 0.
Slide 17

Example:

Tickets are drawn consecutively and without replacement from a


box of tickets numbered 1 10. What is the chance the second
ticket is even numbered given the first is
I

even?

labelled 3?

Slide 18

Bayes formula

Let B1 , B2 , , Bn be mutually exclusive events with A


then
n
X
P(A) =
P(A|Bj )P(Bj ).

Sn

j=1 Bj ,

j=1

With the same assumptions as for the Law of Total Probability,


P(Bj |A) =

P(A|Bj )P(Bj )
P(Bj A)
= Pn
.
P(A)
k=1 P(A|Bk )P(Bk )

Slide 19

Example:

A disease affects 1/1000 newborns and shortly after birth a baby is


screened for this disease using a cheap test that has a 2% false
positive rate (the test has no false negatives). If the baby tests
positive, what is the chance it has the disease?

Slide 20

Independent events

Events A and B are said to be independent if


P(A B) = P(A)P(B).
If P(B) 6= 0 or P(A) 6= 0 then this is the same as P(A|B) = P(A)
and P(B|A) = P(B).
Events A1 , , An are independent if, for any subset {i1 , ..., ik } of
{1, ..., n},
P(Ai1 Aik ) = P(Ai1 ) P(Aik ).

Slide 21

Random variables
A random variable (rv) on a probability space (, F, P) is a
function X : IR.
Usually, we want to talk about the probabilities that the values of
random variables lie in sets of the form (a, b) = {x : a < x < b}.
The smallest -algebra of subsets of IR that contains these sets is
called the set B(IR) of Borel sets, after Emile Borel (1871-1956).
The probability that X (a, b) is the probability of the subset
{ : X () (a, b)}. In order for this to make sense, we need this
set to be in F and we require this condition for all a < b and we
say the function X is measurable with respect to F.
So X is measurable with respect to F if { : X () B} F for
all Borel sets B IR .

Slide 22

Distribution Functions

The function FX (t) = P(X t) = P({ : X () (, t]} that


maps R to [0, 1] is called the distribution function of the random
variable X .
Any distribution function F
F1. is non-decreasing,
F2. is such that F (x) 0 as x and F (x) 1 as x ,
and
F3. is right-continuous, that is limh0+ F (t + h) = F (t) for all t.

Slide 23

Distribution Functions

We say that
I

the random variable X is discrete if it can take only


countably-many
values, with P(X = xi ) = pi > 0 and
P
p
=
1.
Its
distribution
function FX (t) is commonly a step
i i
function.

the random variable X is continuous if FX (t) is absolutely


continuous, that is if there exists
R t a function fX (t) that maps
R to R+ such that FX (t) = fX (u)du.

A mixed random variable has some points that have positive


probability and also some continuous parts.

Slide 24

Examples of distributions

Examples of discrete random variables: binomial, Poisson,


geometric, negative binomial, discrete uniform
http://en.wikipedia.org/wiki/Category:
Discrete_distributions

Examples of continuous random variables: normal,


exponential, gamma, beta, uniform on an interval (a, b)
http://en.wikipedia.org/wiki/Category:
Continuous_distributions

Slide 25

Random Vectors
A random vector X = (X1 , ..., Xd ) is a measurable mapping of
(, F) to IRd , that is, for each Borel set B IRd ,
{ : X () B} F.
The distribution function of a random vector is
FX (t) = P(X1 t1 , , Xd td ), t = (t1 , , td ) Rd .
It follows that
P(s1 < X1 t1 , s2 < X2 t2 )
= F (t1 , t2 ) F (s1 , t2 ) F (t1 , s2 ) + F (s1 , s2 ).

Slide 26

Independent Random Variables

The random variables X1 , , Xd are called independent if


FX (t) = FX1 (t1 ) FXd (td ) for all t = (t1 , , td ).
Equivalently,
I

the events {X1 B1 }, , {Xd Bd } are independent for all


Borel sets B1 , , Bd R,

or, in the absolutely-continuous case,


fX (t) = fX1 (t1 ) fXd (td ) for all t = (t1 , , td ).

Slide 27

Revision Exercise

For bivariate random variables (X , Y ) with density functions


I
I

f (x, y ) = 2x + 2y 4xy for 0 < x < 1, 0 < y < 1, and


f (x, y ) = 4 8x 8y + 16xy for 0 < x < 1, 0 < y < 1,
0 < x + y < 1,
I
I
I

check f is a true density.


find the marginal probability density functions fX (x) and fY (y ),
find the probability density function of Y conditional on the
value of X .

Slide 28

Expectation of X
For a discrete, continuous or mixed random variable X that takes
on values in the set SX , the expectation of X is
Z
E (X ) =
xdFX (x)
SX

The integral on the right hand side is a Lebesgue-Stieltjes integral.


It can be evaluated as
 P
x P(X = xi ), if X is discrete
= R i=1 i
xf
if X is absolutely continuous.
X (x)dx,
In second year, we required that the integral be absolutely
convergent. In this course we will allow the expectation to be
infinite, provided that we never get in a situation where we have
.
Slide 29

Expectation of g (X )

For a measurable function g that maps SX to some other set SY ,


Y = g (X ) is a random variable taking values in SY and
Z
E (Y ) = E (g (X )) =
g (x)dFX (x).
SX

We can also evaluate E (Y ) by calculating its distribution function


FY (y ) and then using the expression
Z
E (Y ) =
ydFY (y ).
SY

Slide 30

Properties of Expectation

E (aX + bY ) = aE (X ) + bE (Y ).

If X Y , then E (X ) E (Y ).

If X c, then E (X ) = c.

Slide 31

Moments

The kth moment of X is E (X k ).

The kth central moment of X is E [(X E (X ))k ].

The variance V (X ) of X is the second central moment


E (X 2 ) (E (X ))2 .

V (cX ) = c 2 V (X ).

If X and Y have finite means and are independent, then


E (XY ) = E (X )E (Y ).

If X and Y are independent (or uncorrelated), then


V (X Y ) = V (X ) + V (Y ).

Slide 32

Conditional Probability

The conditional probability of event A given X is a random


variable (since it is a function of X ). We write it as P(A|X ).
I

for a real number x, if P(X = x) > 0, then


P(A|x) = P(A {X = x})/P({X = x}).

if P(X = x) = 0, then
P(A|x) = lim+ P(A{X (x, x+)})/P({X (x, x+)}).
0

Slide 33

Conditional Distribution

The conditional distribution function FY |X (y |X ) of Y


evaluated at the real number y is given by P({Y y }|X ),
where P({Y y }|x) is defined on the previous slide.

If (X , Y ) is absolutely continuous, then the conditional density


of Y given that X = x is fY |X (y |x) = f(X ,Y ) (x, y )/fX (x).

Slide 34

Conditional Expectation

The conditional expectation E (Y |X ) = (X ) where


(x) = E (Y |X = x)
 P
y P(Y = yj |X = x) if Y is discrete
R j j
=
SY yfY |X (y |x)dy if Y is absolutely continuous.

Slide 35

Properties of Conditional Expectation

Linearity: E (aY1 + bY2 |X ) = aE (Y1 |X ) + bE (Y2 |X ),

Monotonicity: Y1 Y2 , then E (Y1 |X ) E (Y2 |X ),

E (c|X ) = c,

E (E (Y |X )) = E (Y ),

For any measurable function g , E (g (X )Y |X ) = g (X )E (Y |X )

E (Y |X ) is the best predictor of Y from X in the mean square


sense. This means that, for all random variables Z = g (X ),
the expected quadratic error E ((g (X ) Y )2 ) is minimised
when g (X ) = E (Y |X ) (see Borovkov, page 57).

Slide 36

Exercise

Let = {a, b, c, d}, P({a}) =


P({d}) = 14 .
Define random variables,

Y () =

X () =

1
2,

P({b}) = P({c}) =

1,
0,

= a or b,
= c or d,

2,
5,

= a or c,
= b or d.

1
8

and

Compute E (X ), E (X |Y ) and E (E (X |Y )).

Slide 37

Example

The number of storms, N, in the upcoming rainy season is


distributed according to a Poisson distribution with a parameter
value that is itself random. Specifically, is uniformly
distributed over (0, 5). The distribution of N is called a mixed
Poisson distribution.
1. Find the probability there are at least two storms this season.
2. Calculate E (N|) and E (N 2 |).
3. Derive the mean and variance of N.

Slide 38

Exercise

The joint density of X and Y is given by


fX ,Y (x, y ) =

e x/y e y
, x > 0, y > 0.
y

Calculate E [X |Y ] and then calculate E [X ].

Slide 39

Limit Theorems (Borovkov 2.9)


The Law of Large Numbers (LLN) states that if X1 , X2 , are
independent and identically-distributed with mean , then
n

1X
Xj
Xn
n
j=1

as n .
In the strong form, this is true almost surely, which means that it
is true on a set A of sequences x1 , x2 , . . . that has probability one.
In the weak form, this is true in probability which means that, for
all  > 0,
P(|Xn | > ) 0
as n .

Slide 40

Limit Theorems (Borovkov 2.9)

The Central Limit Theorem (CLT) states that if X1 , X2 , are


independent and identically-distributed with mean and variance
2 , then for any x,


Z x
Xn
1
2
< x (x)
e t /2 dt
P
/ n
2

as n .
That is, a suitably-scaled variation from the mean approaches a
standard normal distribution as n .

Slide 41

Limit Theorems (Borovkov 2.9)


The Poisson Limit Theorem states that that if X1 , X2 , are
independent Bernoulli random variables with
P(Xi = 1) = 1 P(Xi = 0) = pi , then X1 + X2 + + Xn is
well-approximated by a Poisson random variable with parameter
= p1 + + pn .
Specifically, with W = X1 + X2 + + Xn , then, for any Borel set
B R,
P(W B) P(Y B)
where Y Po().
There is, in fact, a bound on the accuracy of this approximation
Pn
2
i=1 pi
,
|P(W B) P(Y B)|
max(1, )

Slide 42

Example

Suppose there are three ethnic groups, A (20%), B (30%) and C


(50%), living in a city with a large population. Suppose 0.5%, 1%
and 1.5% of people in A, B and C respectively are over 200cm tall.
If we know that of 300 selected, 50, 50 and 200 people are from A,
B and C, what is the probability that at least four will be over 200
cm?

Slide 43

Stochastic Processes (Borovkov 2.10)

A collection of random variables {Xt , t T } (or {X (t), t T })


on a common prob space (, F, P) is called a stochastic process.
The index variable t is often called time.
I

If T = {1, 2, } or { , 2, 1, 0, 1, 2, }, the process is


a discrete time process.

If T = IR or [0, ), the process is a continuous time process

If T = IRd , then the process is a spatial process, for example


temperature at t T IR2 , which could be, say, the
University campus.

Slide 44

Examples of Stochastic Processes


If Xt N(0, 1) for all t, then Xt is called a Gaussian process.
Different processes can be modelled by making different
assumptions about the dependence between the Xt for different t.
Standard Brownian Motion is a Gaussian process where
I For any 0 s1 < t1 s2 < sk < tk , X (t1 ) X (s1 ),
. . . , X (tk ) X (sk ) are independent.
I We also have V (X (t1 ) X (s1 )) = t1 s1 for all s1 < t1 .

http://www.ms.uky.edu/~mai/java/stat/brmo.html

Brownian motion

Lect 04

620-301

Slide 45

Examples of Stochastic Processes


Xt is the number of sales of an item up to time t.

{Xt , t 0} is called a counting process.

Slide 46

Examples of Stochastic Processes

Xt is the number of people in a queue at time t.

Lect 04

620-301

Slide 47

Interpretations
We can think of as consisting of the set of sample paths
= {Xt : t T }, that is a set of sequences if T is discrete or a
set of functions if T is continuous. Each has a value at
each time point t T . With this interpretation,
I

For a fixed , we can think of t as a variable, X (t) as a


deterministic function (realization, trajectory, sample path) of
the process.

If we allow to vary, we get a collection of trajectories.

For fixed t, with varying, we see that Xt () is a random


variable.

If both and t are fixed, then Xt () is a real number.

Slide 48

Examples of Stochastic Processes

If Xt is a counting process:
I

For fixed , Xt () is a non-decreasing step function of t.

For fixed t, Xt () is a non-negative integer-valued random


variable.

For s < t, Xt Xs is the number of events that have occurred


in the interval (s, t].

If Xt is the number of people in a queue at time t, then


{Xt : t 0} is a stochastic process where, for each t, Xt () is a
non-negative integer-valued random variable but it is NOT a
counting process because, for fixed , Xt () can decrease.

Slide 49

Finite-Dimensional Distributions

Knowing just the one-dimensional (individual) distributions of Xt


for all t is not enough to describe a stochastic process.
To specify the complete distribution of a stochastic process
{Xt , t T }, we need to know the finite-dimensional distributions
that is the family of joint distribution functions
Ft1 ,t2 , ,tk (x1 , , xk ) of Xt1 , , Xtk for all k 1 and
t1 , , tk T .

Slide 50

Discrete-Time Markov Chains

We are frequently interested in applications where we have a


sequence X1 , X2 , of outputs (which we model as random
variables) in discrete time. For example,
I
I

DNA: A (adenine), C (cytosine), G (guanine), T (thymine).


Texts: Xj takes values in some alphabet, for example
{A, B, , Z , a, }.
I
I
I

Developing and testing compression software.


Cryptology: codes, encoding and decoding.
Attributing manuscripts.

Slide 51

Independence?
Is it reasonable to assume that neighbouring letters are
independent?
I

Text T = {a1 an } of length n.

Let n` = #{i n : ai = `}, n`j = #{i n 1 : ai ai+1 = `j}.

Assuming that T is random, we expect n` /n P(letter = `)


and n`j /n P(two letters = `j).

If letters were independent, we have


P(two letters = `j) = P(letter = `)P(letter = j) so we would
expect that n`j /n n` /n nj /n.

However, let ` = j = a, P(letter = a) 0.08, but aa is very


rare.

We conclude that assuming independence does not lead to a good


model for text.
Slide 52

The Markov Property

The Markov property embodies a natural first generalisation to the


independence assumption. It assumes a kind of one-step
dependence or memory. Specifically, for all Borel sets B,
P(Xn+1 B|Xn = xn , Xn1 = xn1 , ) = P(Xn+1 B|Xn = xn )

Slide 53

Discrete-Time Markov Chains


A random sequence {Xn , n 0} with a countable state space
(without loss {1, 2, }) forms a DTMC if
P(Xn+1 = k|Xn = j, Xn1 = xn1 , , X0 = x0 ) = P(Xn+1 = k|Xn = j).

This enables us to write


P(Xn+1 = k|Xn = j) = pjk (n).
Furthermore, we commonly assume that the transition probabilities
pjk (n) do not depend on n, in which case the DTMC is called
homogeneous (more precisely time homogeneous) and we write
pjk (n) = pjk .

Slide 54

Discrete-Time Markov Chains

For a homogeneous DTMC, we define the transition matrix to be a


matrix with rows and columns corresponding to the states of the
process and whose jkth entry is pjk . So

p11 p12
p21 p22
P=

pm1 pm2

p1m
p2m
.

pmm

Slide 55

Discrete-Time Markov Chains


We can associate a directed graph with a DTMC by letting the
nodes correspond to states and putting in an arc jk if pjk > 0.

Slide 56

Discrete-Time Markov Chains

For a transition matrix of a DTMC:


I

Each entry is 0.

Each row sums to 1.

Any square matrix having these two properties is called a


stochastic matrix.

Slide 57

Discrete-Time Markov Chains


Examples:
I

If the {Xn } are independent and identically-distributed


random variables with P(Xi = k) = pk , what is the transition
matrix of the DTMC?

A communication system transmits the digits 0 and 1. At


each time point, there is a probability p that the digit will not
change and prob 1 p it will change.

Slide 58

Discrete-Time Markov Chains


I

Suppose that whether or not it rains tomorrow depends on


previous weather conditions only through whether or not it is
raining today and not on past weather conditions. Suppose
also that if it rains today, then it will rain tomorrow with
probability p and if it does not rain today, then it will rain
tomorrow with probability q. If we say that the process is in
state 0 when it rains and state 1 when it does not rain, then
the above is a two-state Markov chain.

A simple random walk. Let a sequence of random variables


{Xn } Z be defined by Xn+1 = Xn + Yn+1 , where {Yn } are
independent and identically-distributed random variables with
P(Yn = 1) = p, P(Yn = 1) = 1 p.

Slide 59

Discrete-Time Markov Chains

The n-step transition probabilities P(Xm+n = j|Xm = i) of a


homogeneous DTMC do not depend on m. For n = 1, 2, , we
denote them by
(n)

pij = P(Xm+n = j|Xm = i).


It is also convenient to use the notation

1 if j = i
(0)
pij :=
0 if j 6= i.

Slide 60

Discrete-Time Markov Chains


The Chapman-Kolmogorov equations show how we can calculate
(n)
the pij from the pij . For n = 1, 2, and any r = 1, 2, , n,
(n)

pij =

(r ) (nr )

pik pkj

Slide 61

Discrete-Time Markov Chains


If we define the n-step transition matrix as

(n)

p11

(n)

p12

..

(n)
(n)
(n)
P (n) =
p21 p22 p23
..
..
..
.
.
.

..
.
..
.
,
..
.

then the Chapman-Kolmogorov equations can be written in the


matrix form
P (n) = P (r ) P (nr )
with P (1) = P. By mathematical induction, it follows that
P (n) = P n ,
the nth power of P.
Slide 62

Discrete-Time Markov Chains


How do we determine the distribution of a DTMC?
We have
I

the initial distribution 0 = (10 , . . . , n0 ), where


j0 = P(X0 = j), for all j, and

the transition matrix P.

In principle, we can use these and the Markov property to derive


the finite dimensional distributions, although the calculations are
frequently intractable.
For k 1 and t1 < < tk Z+ ,
P(Xt1 = x1 , Xt2 = x2 , , Xtk = xk )
= [ 0 P t1 ]x1 [P t2 t1 ]x1 x2 , . . . , [P tk tk1 ]xk xk1 .

Slide 63

Discrete-Time Markov Chains


Example
Suppose P(X0 = 1) = 1/3, P(X0 = 2) = 0, P(X0 = 3) = 1/2,
P(X0 = 4) = 1/6 and

1/4 0 1/4 1/2


1/4 1/4 1/4 1/4
.
P=
0
0 2/3 1/3
1/2 0 1/2 0
I

Find the distribution of X1 ,

Calculate P(Xn+2 = 2|Xn = 4), and

Calculate P(X3 = 2, X2 = 3, X1 = 1).

Slide 64

Discrete-Time Markov Chains

Fundamental questions that we quite often want to ask are


I

What proportion of time does the chain spend in each state in


the long run?

Or does this even make sense?

The answer depends on the classification of states.

Slide 65

Discrete-Time Markov Chains


Here are some definitions.
I

State k is accessible from state j, denoted by j k, if there


(n)
exists an n 1 such that pjk > 0. That is, there exists a
path j = i0 , i1 , i2 , , in = k such that pi0 i1 pi1 i2 pin1 in > 0.

If j k and k j, then states j and k communicate,


denoted by j k.

State j is called non-essential if there exists a state k such


that j k but k 6 j.

State j is called essential if j k implies that k j.

A state j is an absorbing state if pjj = 1. An absorbing state


is essential but essential states do not have to be absorbing.

Slide 66

Discrete-Time Markov Chains

Example
Draw a transition diagram and
with transition matrix

0
0.5
P=
0
0

then classify the states of a DTMC

0.5 0.5 0
0
0 0.5

0 0.5 0.5
0 0.5 0.5

Slide 67

Discrete-Time Markov Chains

A state j which is such that j 6 j is called ephemeral. Ephemeral


states usually dont add anything to a DTMC model and we are
going to assume that there are no such states.
With this assumption, the communication relation has the
properties
I

j j (reflexivity),

j k if and only if k j (symmetry), and

if j k and k i, then j i (transitivity).

A relation that satisfies these properties is known as an equivalence


relation.

Slide 68

Discrete-Time Markov Chains

Consider a set S whose elements can be related to each other via


any equivalence relation . Then S can be partitioned into a
collection of disjoint subsets S1 , S2 , S3 , . . . SM (where M might be
infinite) such that j, k Sm implies that j k.
So the state space of a DTMC is partitioned into communicating
classes by the communication relation .

Slide 69

Discrete-Time Markov Chains


An essential state cannot be in the same communicating class as a
non-essential state. This means that we can divide the sets in the
n of
partition S1 , S2 , S3 , . . . SM into a collection of S1n , S2n , S3n , . . . SM
n
non-essential communicating classes and a collection of
e of essential communicating classes.
S1e , S2e , S3e , . . . SM
e
If the DTMC starts in a state from a non-essential communicating
n then once it leaves, it can never return. On the other
class Sm
hand, if the DTMC starts in a state from a essential
e then it can never leave it.
communicating class Sm

Definition:

If a DTMC has only one communicating class, that is all states


communicate with each other, then it is called an irreducible
DTMC.

Slide 70

Discrete-Time Markov Chains

Example
Classify the states of the DTMC with

0.5 0.5
0
0
0.5 0.5
0
0
P=
0.25 0.15 0.45 0.15
0
0
0
1

Slide 71

Discrete-Time Markov Chains

Exercise
Classify the states of the DTMC with

0 0 + 0
0 + 0 +

+ 0 0 0

P=
0 0 0 +
0 + 0 0

0 + 0 0
0 0 + 0

0 0 +
0 0 +
0 0 0
0 0 0
0 0 0
+ + 0
0 0 +

Slide 72

Discrete-Time Markov Chains


Now lets revisit the random walk example where
Xn+1 = Xn + Yn+1 , with {Yn } independent and
identically-distributed with X0 = 0, P(Yn = 1) = p and
P(Yn = 1) = 1 p = q. This DTMC is irreducible and so all
states are essential. However,
I

if p > q, then E (Xn X0 ) = n(p q) > 0, so Xn will drift to


infinity, at least in expectation.

For each fixed state j, with probability one, the DTMC will
visit j only finitely many times.

A states long run essentialness is not captured by being essential


in this case: we need a further classification of the states.

Slide 73

Recurrence and Transience of States


Let X0 = j and Ti (j) be the time between the ith and (i 1)st
return to state j. Then T1 (j), T2 (j), . . . are independent and
identically distributed random variables.

Slide 74

Recurrence and Transience of States

Our further classification relies on calculating the probability that


the DTMC returns to a state once it has left.
Let
fj = P(Xn = j for some n > 0|X0 = j) = P(T1 (j) < |X0 = j).
The state j is said to be recurrent if fj = 1 and transient if fj < 1.

Slide 75

Characterizing Recurrence
If the DTMC starts in a recurrent state j then, with probability
one, it will eventually re-enter j. At this point, the process will
start anew (by the Markov property) and it will re-enter again with
probability one. So the DTMC will (with probability one) visit j
infinitely-many times.
If the DTMC starts in a transient state j then there is a probability
1 fj > 0 that it will never return. So, letting Nj be the number of
visits to state j after starting there, we see that Nj has a geometric
distribution.
Specifically, for n 0,
P(Nj = n|X0 = j) = P(T1 (j) < , , Tn (j) < , Tn+1 (j) = ).
This is equal to (1 fj )fjn , which implies that E (Nj |X0 = j) =

fj
1fj .

Slide 76

Characterizing Recurrence

If
qj =

E [I (Xn = j)|X0 = j] =

n=1

then qj = E [
qj
1+qj .

n=1 I (Xn

(n)

pjj ,

n=1

= j)|X0 = j] = E (Nj |X0 = j) =

fj
1fj ,

so

fj =
It follows that state j is recurrent if and only if qj = .
[Read: j is recurrent if and only if the expected number of returns
to state j is infinite.]

Slide 77

Communication classes are either recurrent or transient


Now assume that state j is recurrent and j k. There must exist
(s)
(t)
s and t such that pjk > 0 and pkj > 0. Then
(s+n+t)

pjj

= P(Xs+t+n = j|X0 = j)
P(Xs+t+n = j, Xs+n = k, Xs = k|X0 = j)
(s) (n) (t)

= pjk pkk pkj .


(s+n+t)

Similarly pkk

(t) (n) (s)

pkj pjj pjk and so, for n > s + t,


(nst)

pjj

(n)

(n+s+t)

pkk pjj

P
(s) (t)
(n)
where = pjk pkj . So the series
n=1 pkk must diverge because
P (n)
n=1 pjj diverges, and we conclude that state k is also recurrent.
Slide 78

If the Markov chain is irreducible then all states are either


recurrent or transient and so its appropriate to refer to the chain
as either recurrent or transient.

Slide 79

The Random Walk

Let Xn+1 = Xn + Yn+1 where {Yn : n 1} are independent and


identically-distributed random variables with P(Yn = 1) = p and
P(Yn = 1) = 1 p = q.
We can compute the m-step transition probabilities from state j to
itself by observing that these probabilities are zero if m is odd and
equal to
 
2n n n
p q
n
if m = 2n.

Slide 80

The Random Walk

Stirlings formula n!

2nnn e n gives us the fact that


(2n)

pjj
and the series

(4pq)n

,
n

(2n)
n=1 pjj

diverges if p = q = 1/2, so the DTMC is recurrent,

converges if p 6= q (compare to geometric series), so the


DTMC is transient.

Slide 81

Periodicity

The Polya random walk illustrates another phenomenon that can


occur in DTMCs - periodicity.

Definition:
{n :

(n)
pjj

State j is periodic with period d > 1 if


> 0} is non-empty and has greatest common divisor d.

If state j has period 1, then we say that it is aperiodic.

Slide 82

Discrete-Time Markov Chains


Examples
I

The Polya random walk has period d = 2 for all states j.

Whatis the period ofthe DTMC with


0 0.5 0.5

1 0
0 ?
P=
1 0
0
Find
the period for the DTMC
with

0 0 0.5 0.5
1 0 0
0
.
P=
0 1 0
0
0 1 0
0

Slide 83

States in a communicating class have same period


Assume that state j has period dj and j k. Then, as before,
(s)
(t)
there must exist s and t such that pjk > 0 and pkj > 0. We know
straightaway that dj divides s + t since it is possible to go from j
to itself in s + t steps.
Now take a path from k to itself in r steps. If we concatenate our
path from j to k in s steps, this r step path, and our path from
from k to j in t steps, we have an s + r + t step path from j to
itself. So dj divides s + r + t which means that dj divides r . So
the dj divides the period dk of k.
Now we can switch j and k in the argument to conclude that dk
divides dj which means that dj = dk , and all states in the same
communicating class have a common period.

Slide 84

Discrete-Time Markov Chains


The arguments on the preceding slides bring us to the following
theorem, which discusses some solidarity properties of states in the
same communicating class.

Theorem:

In any communicating class Sr of a DTMC with


state space S, the states are
I

either all recurrent or all transient, and

either all aperiodic or all periodic with a common period


d > 1.

If states from Sr are periodic with period d > 1, then


(1)
(2)
(d)
Sr = Sr Sr Sr where the DTMC passes from
(i)
(i+1)
the subclass Sr to Sr
with probability one at a transition.

Slide 85

Discrete-Time Markov Chains


Examples:
d = 4:

Slide 86

Discrete-Time Markov Chains

How do we analyse a DTMC?


I

Draw a transition diagram.

Consider the accessibility of states, then divide the state space


into essential and non-essential states.

Define the communicating classes, and divide them into


recurrent and transient communicating classes.

Decide whether the classes are periodic.

Slide 87

Discrete-Time Markov Chains

Exercises

0 0.5 0.5
0 .
Analyse the DTMC with P = 1 0
1 0
0

0 0 0.5 0.5
1 0 0
0
.
Consider a DTMC with P =
0 1 0
0
0 1 0
0

Slide 88

Discrete-Time Markov Chains

Example
Analyse the Markov chain with states numbered 1 to 5 and with
one-step transition probability matrix

1
0
0 0 0
1/2 0 1/2 0 0

P=
0 1/2 1/2 0 0
0
0
0 0 1
0
0
0 1 0

Slide 89

Finite State DTMCs have at least one recurrent state

Recall that a state j is transient if and only if

X
n=1

(n)
pjj

E [I (Xn = j)|X0 = j] < .

n=1

This means that the DTMC visits j only finitely-many times (with
probability one), given that it starts there.
Let S be the set of states, and fj,k be the probability that the
DTMC ever visits state k, given that it starts in state j.

Slide 90

Finite State DTMCs have at least one recurrent state


If all states k S are transient, then it must be the case that

X
n=1

(n)
pjj

fj,k

(n)

pkk < .

n=1

k6=j

However, the left hand side can be decomposed to the expression

XX

E [I (Xn = k)|X0 = j] =

kS n=1

X
X

E [I (Xn = k)|X0 = j]

n=1 kS

n=1

which is a contradiction, and so at least one state must be


recurrent.
It follows that if a finite-state DTMC is irreducible, then all states
are recurrent.
Slide 91

Recurrence in Infinite State DTMC

When a communicating class has infinitely many states, the above


line of argument does not work:
X

fj,k may be infinite.

k6=j

And it shouldnt: random walk with p > 1/2 has all states
transient.

Slide 92

Recurrence in Infinite State DTMC

In order to be able to tell whether a class is recurrent, we need to


be able to calculate the probability of return for at least one state.
Lets label this state 0 and denote by fj,0 the probability that the
DTMC ever reaches state 0, given that it starts in state j. Then
we see that the sequence {fj,0 } satisfies the equation
X
fj,0 = pj0 +
pjk fk,0 .
()
k6=0

Slide 93

Solving the equation ()


We illustrate how to solve this equation by

Example: Consider a random walk on the nonnegative integers:


pj,j+1 = p = 1 pj,j1 , for j > 0,
and
p0,1 = p = 1 p0,0 .
(And pij = 0 otherwise.) Equation () says that for j > 1,
fj,0 = pfj+1,0 + (1 p)fj1,0
and, for j = 0, 1,
fj,0 = pfj+1,0 + (1 p).

Slide 94

Solving the equation ()


The first equation is a second-order linear difference equation with
constant coefficients.
These can be solved in a similar way to second-order linear
differential equations with constant coefficients, which you learned
about in Calculus II or accelerated Mathematics II.
Recall that, to solve
a

dy
d 2y
+b
+ cy = 0,
dt 2
dt

we try a solution of the form y = y (t) = e t to obtain the


Characteristic Equation
a2 + b + c = 0.

Slide 95

Solving the equation ()

If the characteristic equation has distinct roots, 1 and 2 , the


general solution has the form
y = Ae 1 t + Be 2 t .
If the roots are coincident, the general solution has the form
y = Ae 1 t + Bte 1 t .
In both cases, the values of the constants A and B are determined
by the initial conditions.

Slide 96

Solving the equation ()

The method for solving second-order linear difference equation


with constant coefficients is similar. To solve
auj+1 + buj + cuj1 = 0,
we try a solution of the form u = mj to obtain the Characteristic
Equation
am2 + bm + c = 0.

Slide 97

Solving the equation ()

If this equation has distinct roots, m1 and m2 , the general solution


has the form
y = Am1j + Bm2j .
If the roots are coincident, the general solution has the form
y = Am1j + Bjm1j .
The values of the constants A and B need to be determined by
boundary equations, or other information that we have.

Slide 98

Solving the equation ()


Back to the Example: The characteristic equation of
fj,0 = pfj+1,0 + (1 p)fj1,0
is
pu 2 u + (1 p) = 0
which has roots u = 1 and u = (1 p)/p.
If (1 p)/p 6= 1, the general solution for j 1 is of the form

fj,0 = A + B

1p
p

j
.

Slide 99

Solving the equation ()


If (1 p)/p > 1, then the general solution is

fj,0 = A + B

1p
p

j
.

Similarly, if (1 p)/p = 1, the general solution is of the form


fj,0 = A + Bj.
In either case, these can only be probabilities if B = 0 and then
notice
A = f1,0 = pf2,0 + (1 p) = pA + (1 p),
so A = 1. This makes sense because p 1/2 and so we have a
neutral or downward drift.

Slide 100

Solving the equation ()

However if (1 p)/p < 1, we need to work harder to obtain the


solution to our problem. Let fj,0 (m) be the probability that the
DTMC moves from state j to state 0 in less than or equal to m
steps. Because

m=1 {j 7 0 in m steps} = {j 7 0 ever},


we find fj,0 (m) % fj,0 .

Slide 101

Solving the equation ()


Now let {gj,0 } be any nonnegative solution to
X
gj,0 = pj0 +
pjk gk,0 .
k6=0

We show by induction that fj,0 (m) gj,0 for all m. Clearly this is
true for m = 1. Assume that it is true for m = `. Then
X
fj,0 (` + 1) = pj0 +
pjk fk,0 (`)
k6=0

pj0 +

pjk gk,0

k6=0

= gj,0 .
It follows that fj,0 = limm fj,0 (m) gj,0 and so {fj,0 } is the
minimal nonnegative solution to ().
Slide 102

Solving the equation ()

For the random walk with (1 p)/p < 1, the general solution for
j 1 was of the form

fj,0 = A + B

1p
p

j
.

The minimal nonnegative solution for j > 0 is



fj,0 =

1p
p

j
.

and f0,0 = 2(1 p).

Slide 103

The Gamblers Ruin Problem


I

Denote the initial capital of a gambler by N.

The gambler will stop playing if he/she wins $M or loses


his/her initial stake of $N.

There is a probability p that the gambler wins $1 and a


probability 1 p that he/she loses $1 on each game.

We assume that the outcomes of successive plays are


independent.

This is a simple DTMC with a finite state space {N, . . . , M}


and transition probabilities pj,j+1 = p and pj,j1 = 1 p for
j {N + 1, . . . , M 1}, and pN,N = pM,M = 1.

The gambler would like to know the probability that he/she will
win $M before becoming bankrupt.

Slide 104

The Gamblers Ruin Problem

We use () to calculate the probability that the gamblers ruin


DTMC hits N. For N + 1 j M 1, we have
fj,N = pfj+1,N + (1 p)fj1,N
with fN,N = 1 and fM,N = 0.
When p 6= 1/2, the general solution to the first equation is again

fj,N = A + B

1p
p

j
.

(For the same range of j.)

Slide 105

The Gamblers Ruin Problem


The upper boundary condition gives us

A = B

1p
p

M
,

and the lower boundary condition gives us



B=

1p
p

N

1p
p

M !1
,

So the general solution is



fj,N = 

1p
p

1p
p

j

N

1p
p

M

1p
p

M .

Slide 106

The Gamblers Ruin Problem


When p = 1/2, the general solution to the first equation is
fj,N = A + Bj.
The upper boundary condition gives us
A = BM,
and the lower boundary condition gives us
B=

1
,
M +N

So the general solution is


fj,N =

M j
.
M +N
Slide 107

The Gamblers Ruin Problem

The expected gain is E (G ) = M (N + M)f0,N

Here are some numbers:


I

If N = 90, M = 10 and p = 0.5 then f0,N = 0.1.

If N = 90, M = 10 and p = 0.45 then f0,N = 0.866.

If N = 90, M = 10 and p = 0.4 then f0,N = 0.983.

If N = 99, M = 1 and p = 0.4 then f0,N = 0.333.

Slide 108

Long run behaviour of DTMCs

We want to know the proportion of time a DTMC spends in each


state over the long run (if this concept makes sense) which should
(n)
be the same as the limiting probabilities limn pkj .
I

These will be zero for transient states and non-essential states.

For an irreducible and recurrent DTMC, we will see that these


limiting probabilities exist and are even independent of k.

Slide 109

Long run behaviour of DTMCs

Recall that we used Ti (j) to denote the time between the ith and
(i 1)st return to state j. We then defined state j (and hence its
communicating class) to be
I

transient if Ti (j) = with positive probability, and

recurrent if Ti (j) < with probability one.

There is a further classification of recurrent states. Specifically, j is


I

null-recurrent if E [Ti (j)] = , and

positive-recurrent if E [Ti (j)] < .

This classification is important for the calculation of the limiting


probabilities.

Slide 110

Long run behaviour of DTMCs

Examples
I

The symmetric random walk with p = q = 1/2. For all j,


Ti (j) < with probability one, but E [Ti (j)] = . That is,
all states are null-recurrent.

A finite irreducible DTMC: E [Ti (j)] < for all j.

Slide 111

Long run behaviour of DTMCs


In the long run, how often does a DTMC visit a state j?
Let j E [T1 (j)|X0 = j] < By the Law of Large Numbers,
T1 (j) + T2 (j) + + Tk (j) j k. So there are approximately k
visits in kj time-steps, and the relative frequency of visits to j is
1/j . This leads us to

Theorem:

If j is an aperiodic state in a positive recurrent


communicating class: j = E [T1 (j)|X0 = j] < , then
(n)

lim pjj =

1
.
j

In the null recurrent or transient case where j = :


(n)
lim p
n jj

= 0.

Slide 112

Long run behaviour of DTMCs

Why do we need aperiodicity?



P=

0 1
1 0


.

Slide 113

Long run behaviour of DTMCs

Further:

Theorem:

if i j are aperiodic and the communication class


is positive recurrent: j = E [T1 (j)|X0 = j] < , then
(n)

lim pij =

1
.
j

Note that the right hand side doesnt have an i in it!

Slide 114

Ergodicity and Stationarity


Definition: We call the DTMC ergodic if for all j the limit
(n)
j = limn pij exists and does not depend on i.
For an ergodic DTMC, with limiting distribution = (1 , 2 , . . .),
P
I
j j = 1.
I

For any initial probability distribution 0 ,


0 P n as n .

P = and hence P n = .

Any distribution satisfying the last item is called a stationary


distribution for the DTMC.

Slide 115

Ergodicity and Stationarity

Theorem:

A DTMC {Xn } is ergodic if and only if it is


irreducible, aperiodic and positive recurrent.

In this case there is a unique solution to the system of linear


equations
P =
P
with j j = 1 (that is, a unique stationary distribution) and
moreover j = 1/j .
We often test whether an irreducible, aperiodic DTMC is ergodic
by attempting to solve the equations in the third dot point.

Slide 116

Examples
An m m stochastic matrix P is called doubly-stochastic if all the
column sums are equal to one.
If an aperiodic DTMC has a doubly-stochastic transition matrix,
then we can easily verify that
(1/m, 1/m, 1/m, . . .)P = (1/m, 1/m, 1/m, . . .).
It follows that
= (1/m, 1/m, 1/m, . . .),
and the stationary distribution is uniform on S.
Find a stationary distribution for


0 1
P=
.
1 0

Slide 117

Examples
Find a stationary distribution for

1/2 1/2 0
P = 1/2 1/2 0 .
0
0 1
Find the stationary distribution for

0.8 0.2
0
0
0
0
0.5
0.5
P=
0.75 0.25 0
0
0
0
0.4 0.6

Slide 118

Random walk with one barrier

Give a criterion for ergodicity of the DTMC with state space


{0, 1, 2, } and transition matrix

q
P=
0

..
.

q
..
.

0
..
.

p
..
.

0
..
.

..

..
.

.
..
.

..
.

When the DTMC is ergodic, derive its stationary distribution.

Slide 119

Random walk with one barrier


We saw that this DTMC is irreducible, aperiodic and recurrent
when p q. Solve the linear equations
(0 , 1 , ) = (0 , 1 , )P
k
to get k = (p/q)
P 0 .
We also need k0 k = 1. The sum on the left hand side is finite
only if p < q, in which case 0 = 1 (p/q) and so
k = [1 (p/q)] (p/q)k .
P
So there is a solution to = P with k0 k = 1, and hence
the DTMC is ergodic, only if p < q, in which case

k =

1
.
(p/q)k (1 (p/q))

Slide 120

The distribution

For an irreducible, aperiodic and positive-recurrent DTMC, the


distribution defined by has a number of interpretations.
It can be seen as
I

limiting,

stationary,

ergodic.

Slide 121

The distribution

The limiting interpretation


By definition
(n)

j = lim Pij
n

and so j is the limiting probability that the DTMC is in state j.


This means that so long as the DTMC has been going for quite a
long time, the probability that it is in state j will be approximately
j .

Slide 122

The distribution

The stationary interpretation


We showed that
P =
and so has a stationary interpretation. If the DTMC starts with
probability distribution it will persist with this distribution
forever.
Furthermore, the DTMC is strictly stationary in the sense that its
finite-dimensional distributions are invariant. For each k,
(Xm , Xm+1 , . . . , Xm+k ) has the same distribution as
(X0 , X1 , . . . , Xk ), independently of m.

Slide 123

The distribution

The ergodic interpretation


This means that for sample paths of DTMC that constitute a set
of probability one, the proportion of time that the process spends
in state j is j .
This is can be formally stated as a Law of P
Large Numbers. For any
initial distribution, the proportion of time ni=1 I (Xi = j)/n that
the DTMC spends in state j converges to j with probability one
as n .

Slide 124

Reducible DTMC

From non-essential states, the DTMC will eventually leave


forever.

As soon as it enters a class Sr of essential states, it will stay


in Sr forever.

If the stochastic sub-matrix Pr , derived from P by restricting


it to the states of Sr , is aperiodic and positive-recurrent, then
this subchain is ergodic and
(
(r )
j
j Sr
P(Xn = j|X0 Sr )
=0
if j 6 Sr .

Slide 125

Periodic DTMC

If Pr is a periodic subchain with a period d > 1, which is


recurrent with a finite expected recurrence time, then for
0 k d 1, {Xnd+k |X0 Sr } is an ergodic DTMC with
(k)
state space Sr .
For any ` and k = 0, 1, , d 1,
(
P(Xnd+k = j|X0

so

(r )

(`)

jSr

Sr(`) )

(r )

j
=0

as n

(`+k(mod d))

for j Sr

(`+k(mod d))

for j 6 Sr

= 1 for any `

Slide 126

Discrete-Time Markov Chains


Example
Classify the DTMC with

0
1

P=
0
0

0
1
4

0 0.1 0.9 0
0 0
0 0
1 0
0 0
1 0
0 0
0 0
0 13
0 0
0 15
1
0
0 18
4

0
0
0
0
2
3
4
5
1
4

0
0
0
0
0
0

1
8

and discuss its properties.

Slide 127

Good Trick
Sometimes we want to model a physical system where the future
depend on part of the past. Consider following example. A
sequence of random variables {Xn } describes the weather at a
particular location, with Xn = 1 if it is sunny and Xn = 2 if it is
rainy on day n.
Suppose that the weather on day n + 1 depends on the weather
conditions on days n 1 and n as is shown below:
P(Xn+1 = 2|Xn = Xn1 = 2) = 0.6
P(Xn+1 = 1|Xn = Xn1 = 1) = 0.8
P(Xn+1 = 2|Xn = 2, Xn1 = 1) = 0.5
P(Xn+1 = 1|Xn = 1, Xn1 = 2) = 0.75

Slide 128

Good Trick

If we put Yn = (Xn1 , Xn ), then Yn is a DTMC. The possible


states are 10 = (1, 1), 20 = (1, 2), 30 = (2, 1) and 40 = (2, 2).
We see that {Yn : n 1} is a DTMC with transition matrix

0.8 0.2
0
0
0
0
0.5 0.5
.
P=
0.75 0.25 0
0
0
0
0.4 0.6

Slide 129

You might also like