You are on page 1of 12

WMS

ST220
Introduction to
Mathematical Statistics

Revision Guide

Written by Joseph Jordan

WMS
ii ST220 Introduction to Mathematical Statistics

Contents

I Probability 1
1 Review of Random Variables 1

2 Standard families of Probability distributions 4

3 The Normal Distribution 6

4 Sequences of Random Variables 7

II Statistics 9

Authors
Written by Joseph Jordan.
Any corrections or improvements should be entered into our feedback form at http://tinyurl.com/WMSGuides
(alternatively email revision.guides@warwickmaths.org).

History
Current Edition: 2012/2013.
ST220 Introduction to Mathematical Statistics 1

Part I
Probability
1 Review of Random Variables
1.1 Definitions
Definition 1.1. The sample space is the set of all possible outcomes of an experiment, or observation.
It is most commonly denoted Ω.

Definition 1.2 (Random Variable). A random variable is a function X : Ω → R.

For the remainder of this section X will be a random variable.

Definition 1.3. The cumulative distribution function (cdf), usually denoted FX of X is the function
such that FX (x) = P(X ≤ x) ∀x ∈ R

Definition 1.4. X is called continuous if FX is a continuous function of x. If instead FX is a step-


function, then X is called discrete.

Definition 1.5. Two random variables X and Y are said to be identically distributed if FX (x) =
FY (x) ∀x ∈ R

Warning Do not mistake this with X and Y being the same function.

Definition 1.6. The probability mass function (pmf) of a discrete random variable X is fX (x) = P(X =
x) ∀x ∈ R
P
Immediate Result A consequence of this is that FX (x) = t≤x fX (t)

Note We cannot define a function with a similar property for a continuous random variable because
P(X = x) = 0 ∀x R∈ R. Instead we define the probability density function (pdf) as the function fX (x)
x
such that FX (x) = −∞ fX (t)dt ∀x ∈ R.

d
Immediate Result fX (x) = dx FX (x), by the Fundamental Theorem of Calculus.

Definition 1.7. A function f (x) is a pmf (or pdf) iff:

(i) f (x) ≥ 0 ∀x ∈ R

(ii) limx→∞ FX (x) = 1

In later sections, it will be typographically easier to use the symbol ∼ to denote that a random variable
X has a certain distribution. Hence, if X and Y are identically distributed, we may now write X ∼ Y .

1.2 Independence
Definition 1.8. A set C of measurable
Tn sets isQ said to be independent iff
n
for every subset {An } of C, P ( i=1 Ai ) = i=1 P (Ai ).

Definition 1.9. A set {X1 , ..., Xn } of random variables is said to be independent iff
n
Y
fX1 ,..,Xn (x1 , .., xn ) = fXi (xi )
i=1
.
2 ST220 Introduction to Mathematical Statistics

1.3 Transformations of Random Variables


Functions of Random Variables Suppose we define a new random variable Y to be the function
of our original random variable X, i.e. Y = g(X). Then FY (y) = P(Y ≤ y) = P(g(X) ≤ y). If X is
discrete then we can calculate the resultant probability mass function directly. If X is continuous then
d
fY (y) = dy FY (y).

Theorem 1.10. If g is both invertible and of class C 1 then fY (y) = fX (g −1 (y)) dy
d −1
g (y) .

Theorem 1.11 (Probability Integral Transform). Suppose X is continuous, and let Y = FX (X), where
FX (x) is an increasing function of x. Then Y ∼Uniform(0, 1)

Proof.

−1 −1
FY (y) = P(Y ≤ y) = P(FX (X) ≤ y) = P(X ≤ FX (y)) = FX (FX (y)) = y 0<y<1

which you should (or at least will in Section 2) recognise as the distribution of the Uniform(0, 1) distri-
bution.

Theorem 1.12 (Transformations in the Multi-Variate Case). Let Y and X be vectors of random
variables such that Y = τ (X), where τ is a continuously differentiable bijective map. Let φ = τ −1
such that φ(y) = (φ1 (y), . . . , φ(y)) = (x1 , . . . , xn ). If X is absolutely continuous with density fX (x; θ),
then Y has density
 ∂φ1 (y) ∂φ1 (y)

...
 
∂y1 ∂yn ∇φ1 (y)

fY (y; θ) = det  .. .. 
 fX (φ(y); θ) = det  ..
 fX (φ(y); θ)

 . ... .   .
∂φn (y) ∂φ(y) ∇φn (y)
∂y1 ... ∂yn

Immediate Result

1.4 Expectation
Definition 1.13 (Expectation). The expected value of any function g(X) is
X
E[g(X)] = g(x)fX (x)
x

when X is discrete and


Z ∞
E[g(X)] = g(x)fX (x)dx
−∞

when X is continuous. We can interpret this value as the average of g(X) in repeated sampling.

Definition 1.14 (Variance). The variance, usually denoted σ 2 , is given by

E[(X − E[X])2 ]
p
The standard deviation of X is variance(X).

Definition 1.15 (Moments). The nth moment of X is µn = E[X n ].

Definition 1.16 (Central Moments). The nth central moment of X is µcn = E[(x − µ1 )n ].
ST220 Introduction to Mathematical Statistics 3

Immediate Results The expectation is the first moment and the variance is the second central mo-
ment.

Lemma 1.17. (i) E[a + bg(X)] = a + bE[g(X)] ∀a, b ∈ R, and any function g;

(ii) If g(X) ≥ 0, then E[g(X)] ≥ 0;

(iii) Var(a + bX) = b2 Var(X)∀a, b ∈ R;

(iv) Var(X) = E[X 2 ] − (E[X])2 .

Definition 1.18. The moment generating function of X is defined as

MX (t) = E[etX ]

(provided the expectation exists for all t in some neighbourhood of zero)

Immediate Result An extremely useful property of the moment generating function is revealed in its
name; it generates moments. This is because

dn

n

E[X ] = n MX (t)
dt t=0

Proof. For the continuous case, we prove by induction on n (assuming we can differentiate under the
integral sign).

d ∞ tx
Z Z ∞
d
n=1: MX (t) = e fX (x)dx = xetx fX (x)dx = E[XetX ] = E[X] when t = 0
dt dt −∞ −∞

Assume true for n. Then


dn+1 d dn d ∞ n tx
Z
M X (t) = M X (t) = x e fX (x)dx
dtn+1 dt dtn dt −∞
Z ∞
= xn+1 etx fX (x)dx = E[X n+1 etx ] = E[X n+1 ] when t = 0.
−∞

Properties of the Moment Generating Function

(i) MX (t) = MY (t) ∀t in some neighbourhood of zero ⇒ FX (x) = FY (x) ∀x;

(ii) for a, b ∈ R, let Y = a + bX. Then MY (t) = eat MX (bt);

(iii) for independent X and Y , let Z = X + Y . Then MZ (t) = MX (t)MY (t)

Proof. (ii) Ma+bX (t) = E(e(a+bX) t) = eat E(ebtX ) = eat MX (bt)


Z ∞
(iii) E[et(X+Y ) ] = E[etx ety ] = etx ety fX (x)fY (y)dxdy
Z ∞ Z ∞ −∞

= etx fX (x)dx ety fY (y)dy = MX (t)MY (t).


−∞ −∞

Remark There is a one-to-one correspondence between MGF’s and the pdf’s of random variables.
(Uniqueness theorem).

Definition 1.19. In higher dimensions the moment generating function of X is defined as


TX
MX (t) = E[et ]
(provided the expectation exists on a neighbourhood of zero)
4 ST220 Introduction to Mathematical Statistics

2 Standard families of Probability distributions


In the previous section we recalled the pmf/pdf, expectation, variance and mgf. You should be able to
derive these from first principles for the following distributions:

2.1 Discrete Distributions


2.1.1 Discrete Uniform

This simple distribution has probability mass function

1
fX (x) = x ∈ {1, . . . , k}
k
k+1 k2 −1
It has mean 2 and variance 12 . Its moment generating function is also

et − e(k+1)t
MX (t) =
k(1 − et )

2.1.2 Binomial(n, θ)

The binomial distribution is one of the more familiar distributions in this course. It has probability mass
function  
n x
fX (x) = θ (1 − θ)n−x x ∈ {0, . . . , n}
x

From this we can determine its mean nθ and variance nθ(1 − θ). Although, these can be derived directly
from its moment generating function
(θet + (1 − θ))n

2.1.3 Negative Binomial, NB(r, p)

the Negative Binomial Distribution, X ∼ NB(r, p), which is the number of successes before the rth
failure of a sequence of Bernoulli(p) random variables.
 
k+r−1
fX (k) = (1 − p)r pk for k = 0, 1, 2, . . .
k


2.1.4 Multinomial n, p = (p1 , .., pk )

The multinomial distribution is used when we are interested in the outcome of n trials of a random
variable X. Where each trial has exactly k outcomes, that occur with probability p1 , ..., pk , respectively.

fX (x1 , . . . , xk ) = P(X1 = x1 and . . . and Xk = xk )

= P(outcome 1 occurs x1 times and . . . and outcome k occurs xk times)

n!
 Pk
 px1 · · · pxkk , when xi = n
x1 ! · · · xk ! 1 i=1


=


0 otherwise,

This is a generalisation of the Binomial distribution, to see this take n = 2, p1 = p and p2 = (1 − p).
ST220 Introduction to Mathematical Statistics 5

2.1.5 Poisson(θ)
Another frequently used distribution is the Poisson distribution. It has probability mass function

e−θ θx
fX (x) = x ∈ N ∪ {0}
x!
The Poisson distribution has mean and variance θ. Its moment generating function is
t
−1)
MX (t) = eθ(e

Immediate Result If X and Y are Poisson random variables with means λ and µ, then X + Y ∼
Poisson(λ + µ)
t
−1)] [µ(et −1)] t
−1)]
Proof. MX+Y (t) = MX (t)MY (t) = e[λ(e e = e[(λ+µ)(e i.e. the MGF of the Poisson(λ + µ)
random variable.

Binomial-Poisson Relationship A Binomial(n, θ) random variable is well approximated by a


Poisson(nθ) random variable.

2.2 Continuous Distributions


2.2.1 Uniform(a, b)
The continuous uniform distribution is also very simple. We have
 1
b−a if a < x < b
fX (x) =
0 otherwise

a+b (a−b)2
From this, we can derive the mean 2 , variance 12 and moment generating function;

etb − eta
MX (t) =
t(b − a)

2.2.2 Exponential(β)
The probability density function for this distribution is
 1 −x
βe if x > 0
β
fX (x) =
0 otherwise

The mean and variance are β and β 2 respectively. It is the only probability distribution which satisfies
the memoryless property, which is:
1 1
MX (t) = for t <
1 − tβ β

2.2.3 Gamma(α, β)
This distribution has probability density function
x
1 α−1 − β

fX (x) = Gamma(α)β α x e if x > 0
0 otherwise

where Gamma(α) := (α − 1)! for α ∈ N.


The mean and variance are αβ and αβ 2 respectively. It has moment generating function
1 1
MX (t) = for t <
(1 − βt)α β
6 ST220 Introduction to Mathematical Statistics

Immediate Results

(i) Exponential(β) = Gamma(1, β);

(ii) If X ∼ Gamma(α, β), then cX ∼ Gamma(α, cβ)



(iii) For α ∈ N, Gamma(α, β) = i=1 Exponential(β).

2.2.4 Normal(µ, σ 2 )
(N (µ, σ 2 )) We now meet one of the most important distributions in statistics. The Normal (sometimes
called Gaussian) distribution has probability density function

1 (x−µ)2
fX (x) = √ e− 2σ 2 x∈R
2πσ 2

From this we can derive the mean µ and variance σ 2 . The moment generating function of the Normal
distribution is
σ 2 t2
MX (t) = eµt+ 2

Immediate Results
Z−µ
(i) if Z ∼ N (µ, σ 2 ), then σ ∼ N (0, 1). This is called the standardised Normal distribution;

(ii) if Z ∼ N (µ, σ 2 ), then cZ + d ∼ N (cµ + d, cσ)∀c, d ∈ R

The cdf for the N (0, 1) distribution for appropriate values is typically given in exam questions, it is
up to you to express the general Normal(µ, σ 2 ) distribution in these terms when calculating interval
probability.

2.3 Log-Normal, lnN(µ, σ 2 )


If Y ∼ N(µ, σ 2 ), then X := ln(Y) ∼ lnN(µ, σ 2 ).
By our knowledge of transformations of random variables,

1 (ln x−µ)2
fX (x; µ, σ) = √ e− 2σ 2
xσ 2π

One use of Log-Normal random variables is in simple interest rate models.

3 The Normal Distribution


Definition 3.1 (Bi-Variate Normal Distribution). The non-singular bivariate Normal distribution N (µ, Σ)
has joint probability density function given by

1 1
(x−µX ,y−µy )Σ−1 (x−µ
y−µY )
X
fXY (x, y; µ, Σ) = √ e2
2π det Σ
where
2
 
σX ρXY σX σY
Σ=
ρXY σX σY σY2
called the covariance matrix of X and Y is non-singular and positive-definite and
 
µX
µ=
µY
ST220 Introduction to Mathematical Statistics 7

Definition 3.2 (Multi-Variate Normal Distribution). We say X ∼ N (µ, Σ)


 
1 1 T −1
fx (x) = exp − (x − µ) Σ (x − µ) ,
(2π)k/2 |Σ|1/2 2
where Σ is symmetric and positive definite and |Σ| is the determinant of the covariance matrix Σ.
Note that this definition is consistent with both the Uni-Variate and Bi-Variate definitions. Recall that
the ith component of X, Xi ∼ N ((µ)i , (Σ)(i,i) ).
Remark: We know that (Xi ) are pairwise independent iff Cov(Xi , Xj ) = Σi,j = 0. It is easy to verify
the forward implication for any random variable, WARNING: the backward implication is false outside
the multinormal case.

4 Sequences of Random Variables


4.1 Convergence
There are two (examinable) ways in which a sequence of random variables can converge.
Definition 4.1 (Convergence in Probability). A sequence of random variables X1 , X2 , . . . converges in
probability to the random variable X if
∀ > 0 lim P(|Xn − X| < ) = 1
n→∞

Immediate Result If X1 , X2 , . . . converges in probability to X, then for any continuous function h,


h(Xn ) converges in probability to h(X).
Definition 4.2 (Convergence in Distribution). A sequence of random variables X1 , X2 , . . . converges in
distribution to the random variable X if
lim FXn (x) = FX (x)
n→∞

(at all points x where FX (x) is continuous.)


Although we have not proved it, it’s equivalent to checking that for every bounded continuous function
E[f (Xn )] → E[f (X)]

Remarks
(i) Although convergence in probability implies convergence in distribution, the converse is not true.
(ii) If the sequence {Xn }n∈N converges in probability to the true value of an unknown distribution
parameter θ, then {Xn }n∈N is said to be consistent for θ. Alternatively, {Xn }n∈N is called a
consistent estimator of X.
(iii) Convergence in distribution involves the convergence of the cumulative distribution function.

4.2 Limit Theorems


Theorem 4.3 (Weak Law of Large Numbers). Let {Xn }∞ n=1 be independent and identically distributed
random variables with E[Xi ] = µ and Var(Xi ) = σ 2 < ∞. Then X n , the sample mean, converges in
probability to µ.
Proof.
E[(X n − µ)2 ] Var(X n ) σ2
∀ > 0 P(|X n − µ| ≥ ) = P(|X n − µ|2 ≥ 2 ) ≤ 2
= 2
= 2
| {z  }  n
Markov Inequality

which converges to 0 as n → ∞. Hence P(|X n − µ| < ) → 1


8 ST220 Introduction to Mathematical Statistics

Remark We also proved the following corollary to the Markov Inequality, Chebychev’s Inequality:

Var(X)
P(|X − µ| ≥ ) ≤
2
Theorem 4.4 (Central Limit Theorem). Let X1 , X2 , . . . be a sequence of independent and identically
distributed random variables, with common mean µ and variance σ 2 . Let Yn (x) denote the cumulative
−µ
distribution function of Zn = X n√
σ/ n
. Then, for every x, limn→∞ Gn (x) = Φ(x). Equivalently, the
standardised sample mean converges in distribution to the standard normal distribution, N (0, 1).

The combination of the Central Limit Theorem and the following theorem provide the basis for many
useful approximations.
Theorem 4.5 (Slutsky’s Lemma). If Xn → X in distribution, and f : Rk → Rl is continuous then:

f (Xn ) → f (X) in distribution.

Proof. For this, we use the alternative definition of convergence in distribution. Suppose g : Rl → R
is bounded continuous. Then g ◦ f is bounded continuous : so E[g ◦ f (Xn )] → E[g ◦ f (X)]. Since g is
arbitrary, f (Xn ) → f (X) in distribution.
Corollary 4.6 (not in this course, but on past papers).
If Xn → X in distribution and Yn → c ∈ R in distribution (or probability, which implies distribution),
then
Yn Xn → cX in distribution and Yn + Xn → c + X in distribution
Proof: Choose suitable f .
Theorem 4.7 (Pearson’s Theorem). 1. Suppose Xn ∼ M ultinomial(n, p) (in Rk ). Then

Xn − np
√ → X ∼ N (0, V )
n

where (denoting qi = 1 − pi ) and the convergence is in distribution.


 
p1 q1 −p1 p2 −p1 p3 ...
−p2 p1 p2 q2 −p2 p3 . . .
V =
−p3 p1

−p3 p2 p3 q3 . . .
... ... ... ...
or V = (vij ), where
vij = pi δij − pi pj
or also alternatively, V = diag(p) − ppT

2. Write Xi to mean the ith component of X (the limit). Then we have


k
X X2 i
∼ χ2k−1
i=1
pi

so that
X(n)i −npi 2
P
3. npi →dist χ ∼ χ2 r.v.
ST220 Introduction to Mathematical Statistics 9

Part II
Statistics
Definition 4.8. Suppose data (x1 , . . . , xn ) is modelled as observed values of r.v.s whose joint probability
mass function or joint probability density function is fθ (·) depending on a parameter θ. We define the
observed likelihood function to be θ 7→ L(θ) given by fθ (x1 , . . . , xn ), where we substitute the observed
data values into the pmf/pdf. (Note that the xi s are not dummy variables!) In the light of the data, we
have a preference for values of θ that give greater values to L(θ).
The maximum likelihood estimate, θ0 of θ is the value of θ that maximises L(θ) (assuming it
exists and is unique). In some sense, this is the most plausible value of θ, having observed the results of
the experiment.
Remark: Its often easier to work with log-likelihood (base e), given by l(θ) = log L(θ). Because log
is an increasing function, we have θ0 = argmaxθ l(θ) as well.

4.3 Likelihood Ratios


Definition 4.9. Suppose a statistical model is parametrised by θ (perhaps a vector) belonging to some
space Θ of possible values. Given θ0 , θ1 ∈ Θ we define the corresponding likelihood ratio to be
L(θ1 )
W (θ1 , θ0 ) =
L(θ0 )

where L(θ) is the observed likelihood function.


More generally, if Θ0 , Θ1 ⊂ Θ are subsets of Θ, we define the corresponding (generalised) likelihood
ratio to be
supθ∈Θ1 L(θ)
W (Θ1 , Θ0 ) =
supθ∈Θ0 L(θ)
Definition 4.10. Suppose data (x1 , . . . , xn ) is modelled as observed values of random variables (X1 , . . . , Xn )
whose distribution depends on a parameter θ. If a statistic t(x1 , . . . , xn ) is used as an estimate of θ, then
the corresponding random variable T = t(X1 , . . . , Xn ) is called an estimator of θ.
Its bias is defined to be Eθ [T − θ].
2
Its mean squared error (M.S.E.) is defined p to be Eθ [(T − θ) ].
And its standard error is defined to be Eθ [(T − θ)2 ].

4.4 Confidence Interval


Notation 4.11. For α ∈ (0, 1), let zα denote the 100(1 − α) percentile of the standard Gaussian
distribution on R, i.e. with Z ∼ N (0, 1),

P(Z ≤ zα ) = 1 − α

We have for example to two decimal places (the most frequently used):

z0.05 = 1.64, z0.025 = 1.96, z0.01 = 2.32, z0.005 = 2.58

Then, returning to the estimator X̄, we have √X̄−µ


2
∼ N (0, 1) for all values of µ ∈ R, and thus
σ /n
X̄ − µ
Pµ (−z.025 ≤ p ≤ z.025 ) = 0.95 ∀µ ∈ R Or, by rearranging the inequality defining the event,
σ 2 /n
r r
 σ2 σ2 
Pµ X̄ − z.025 ≤ µ ≤ X̄ + z.025 = 0.95 (1)
n n
q q
2 2
Then define l(x) = x̄ − z.025 σn and r(x) = x̄ + z.025 σn . These are two functions of the data, i.e.
statistics.
10 ST220 Introduction to Mathematical Statistics

Then because (1) holds, in repetitions of the experiment the interval [l(x), r(x)] will contain µ with
a frequency of 95%. Moreover, this statement is true no matter what value µ is. Because of this, we call
the interval [l(x), r(x)] a 95% confidence interval for µ. More formally:
Definition 4.12. Suppose data x1 , . . . , xn are modelled as observed values of random variables X1 , . . . , Xn
whose distribution depends on a parameter(s) θ ∈ Θ. Let φ be some real valued function of θ (often
θ itself). A pair of statistics l(x), r(x) specify an interval [l(x), r(x)] which is called a 100(1 − α)%
confidence interval for φ if
 
Pθ l(X) ≤ φ(θ) ≤ r(X) = 1 − α ∀θ ∈ Θ

Remark: Suppose θ̂ is an estimate of θ and thus φ̂ = φ(θ̂) is an estimate of the unknown value of
φ. The basic question is, how accurate should we consider this estimate to be? Confidence intervals are
one approach to this. The idea is to give a range of plausible values for φ. Usually φ̂ will belong to the
C.I. and then the C.I. is an indication of how large the error in the estimation procedure might be (e.g.
it could be much bigger).
Warning! One must be careful not to over-interpret C.I.s. The defining property says that in
hypothetical repetitions of the experiment, the interval constructed by the procedure will cover the true
value 100(1 − α)% of the time. It is not true however that the actual interval constructed from the data
contains the true value with probability 1 − α.

4.5 Calibrating Likelihood Ratios: p-Values


Suppose x1 . . . xn are modelled as observed values of X1 . . . Xn whose joint distribution depends on a
parameter θ ∈ Θ. Let Θ0 , Θ1 ⊆ Θ. To compare the null hypothesis that θ ∈ Θ0 with the alternative
hypothesis θ ∈ Θ1 , we calculate a likelihood ratio;
supθ∈Θ1 L(θ)
W (x) =
supθ∈Θ0 L(θ)
Large values of W mean some θ ∈ Θ1 is a much better fit to the data than any θ ∈ Θ0 . But how do you
calculate W ?
Definition 4.13. Suppose W (x) is a statistic used for comparing hypotheses θ ∈ Θ0 (the null hypothesis)
with θ ∈ Θ1 (the alternative hypothesis) such that large values of W are evidence against the null
hypothesis. Then, we define the p-value corresponding to W (x) to be:
 
p(x) := sup Pθ W (X) ≥ W (x)
θ∈Θ0

where W (X) is the random variable corresponding to the statistic W (x).


p(x) is therefore the (maximal) frequency with which in repetitions of the experiment with the null
hypothesis true, we would observe evidence against the null hypothesis at least as strong as we have
actually observed.
Small values of p(x) correspond to strong evidence against the null hypothesis.
Connection with decisions:
Suppose we pick a particular value α ∈ (0, 1). Typically, α = 0.01, 0.05 say. (This is referred to as a
significance level.) We adopt the following decision procedure:
• If the p-value p(x) ≤ α, we decide against the null hypothesis in favour of the alternative.
• If p(x) > α, we accept the null hypothesis as consistent with the data. (We haven’t got strong
enough evidence against it.)
We now make a couple of simplifying assumptions:
1. Θ0 = {θ0 } : the null hypothesis is ‘simple’
2. W (X) has a continuous distribution under Pθ0 .
Then Pθ0 (p(X) ≤ α) = α (exercise!). Consequently, in repetitions of the experiment with the null
hypothesis true, we would wrongly reject the null hypothesis with a frequency α.

You might also like