Professional Documents
Culture Documents
ST220
Introduction to
Mathematical Statistics
Revision Guide
WMS
ii ST220 Introduction to Mathematical Statistics
Contents
I Probability 1
1 Review of Random Variables 1
II Statistics 9
Authors
Written by Joseph Jordan.
Any corrections or improvements should be entered into our feedback form at http://tinyurl.com/WMSGuides
(alternatively email revision.guides@warwickmaths.org).
History
Current Edition: 2012/2013.
ST220 Introduction to Mathematical Statistics 1
Part I
Probability
1 Review of Random Variables
1.1 Definitions
Definition 1.1. The sample space is the set of all possible outcomes of an experiment, or observation.
It is most commonly denoted Ω.
Definition 1.3. The cumulative distribution function (cdf), usually denoted FX of X is the function
such that FX (x) = P(X ≤ x) ∀x ∈ R
Definition 1.5. Two random variables X and Y are said to be identically distributed if FX (x) =
FY (x) ∀x ∈ R
Warning Do not mistake this with X and Y being the same function.
Definition 1.6. The probability mass function (pmf) of a discrete random variable X is fX (x) = P(X =
x) ∀x ∈ R
P
Immediate Result A consequence of this is that FX (x) = t≤x fX (t)
Note We cannot define a function with a similar property for a continuous random variable because
P(X = x) = 0 ∀x R∈ R. Instead we define the probability density function (pdf) as the function fX (x)
x
such that FX (x) = −∞ fX (t)dt ∀x ∈ R.
d
Immediate Result fX (x) = dx FX (x), by the Fundamental Theorem of Calculus.
(i) f (x) ≥ 0 ∀x ∈ R
In later sections, it will be typographically easier to use the symbol ∼ to denote that a random variable
X has a certain distribution. Hence, if X and Y are identically distributed, we may now write X ∼ Y .
1.2 Independence
Definition 1.8. A set C of measurable
Tn sets isQ said to be independent iff
n
for every subset {An } of C, P ( i=1 Ai ) = i=1 P (Ai ).
Definition 1.9. A set {X1 , ..., Xn } of random variables is said to be independent iff
n
Y
fX1 ,..,Xn (x1 , .., xn ) = fXi (xi )
i=1
.
2 ST220 Introduction to Mathematical Statistics
Theorem 1.11 (Probability Integral Transform). Suppose X is continuous, and let Y = FX (X), where
FX (x) is an increasing function of x. Then Y ∼Uniform(0, 1)
Proof.
−1 −1
FY (y) = P(Y ≤ y) = P(FX (X) ≤ y) = P(X ≤ FX (y)) = FX (FX (y)) = y 0<y<1
which you should (or at least will in Section 2) recognise as the distribution of the Uniform(0, 1) distri-
bution.
Theorem 1.12 (Transformations in the Multi-Variate Case). Let Y and X be vectors of random
variables such that Y = τ (X), where τ is a continuously differentiable bijective map. Let φ = τ −1
such that φ(y) = (φ1 (y), . . . , φ(y)) = (x1 , . . . , xn ). If X is absolutely continuous with density fX (x; θ),
then Y has density
∂φ1 (y) ∂φ1 (y)
...
∂y1 ∂yn ∇φ1 (y)
fY (y; θ) = det .. ..
fX (φ(y); θ) = det ..
fX (φ(y); θ)
. ... . .
∂φn (y) ∂φ(y) ∇φn (y)
∂y1 ... ∂yn
Immediate Result
1.4 Expectation
Definition 1.13 (Expectation). The expected value of any function g(X) is
X
E[g(X)] = g(x)fX (x)
x
when X is continuous. We can interpret this value as the average of g(X) in repeated sampling.
E[(X − E[X])2 ]
p
The standard deviation of X is variance(X).
Definition 1.16 (Central Moments). The nth central moment of X is µcn = E[(x − µ1 )n ].
ST220 Introduction to Mathematical Statistics 3
Immediate Results The expectation is the first moment and the variance is the second central mo-
ment.
Lemma 1.17. (i) E[a + bg(X)] = a + bE[g(X)] ∀a, b ∈ R, and any function g;
MX (t) = E[etX ]
Immediate Result An extremely useful property of the moment generating function is revealed in its
name; it generates moments. This is because
dn
n
E[X ] = n MX (t)
dt t=0
Proof. For the continuous case, we prove by induction on n (assuming we can differentiate under the
integral sign).
d ∞ tx
Z Z ∞
d
n=1: MX (t) = e fX (x)dx = xetx fX (x)dx = E[XetX ] = E[X] when t = 0
dt dt −∞ −∞
Remark There is a one-to-one correspondence between MGF’s and the pdf’s of random variables.
(Uniqueness theorem).
1
fX (x) = x ∈ {1, . . . , k}
k
k+1 k2 −1
It has mean 2 and variance 12 . Its moment generating function is also
et − e(k+1)t
MX (t) =
k(1 − et )
2.1.2 Binomial(n, θ)
The binomial distribution is one of the more familiar distributions in this course. It has probability mass
function
n x
fX (x) = θ (1 − θ)n−x x ∈ {0, . . . , n}
x
From this we can determine its mean nθ and variance nθ(1 − θ). Although, these can be derived directly
from its moment generating function
(θet + (1 − θ))n
the Negative Binomial Distribution, X ∼ NB(r, p), which is the number of successes before the rth
failure of a sequence of Bernoulli(p) random variables.
k+r−1
fX (k) = (1 − p)r pk for k = 0, 1, 2, . . .
k
2.1.4 Multinomial n, p = (p1 , .., pk )
The multinomial distribution is used when we are interested in the outcome of n trials of a random
variable X. Where each trial has exactly k outcomes, that occur with probability p1 , ..., pk , respectively.
n!
Pk
px1 · · · pxkk , when xi = n
x1 ! · · · xk ! 1 i=1
=
0 otherwise,
This is a generalisation of the Binomial distribution, to see this take n = 2, p1 = p and p2 = (1 − p).
ST220 Introduction to Mathematical Statistics 5
2.1.5 Poisson(θ)
Another frequently used distribution is the Poisson distribution. It has probability mass function
e−θ θx
fX (x) = x ∈ N ∪ {0}
x!
The Poisson distribution has mean and variance θ. Its moment generating function is
t
−1)
MX (t) = eθ(e
Immediate Result If X and Y are Poisson random variables with means λ and µ, then X + Y ∼
Poisson(λ + µ)
t
−1)] [µ(et −1)] t
−1)]
Proof. MX+Y (t) = MX (t)MY (t) = e[λ(e e = e[(λ+µ)(e i.e. the MGF of the Poisson(λ + µ)
random variable.
a+b (a−b)2
From this, we can derive the mean 2 , variance 12 and moment generating function;
etb − eta
MX (t) =
t(b − a)
2.2.2 Exponential(β)
The probability density function for this distribution is
1 −x
βe if x > 0
β
fX (x) =
0 otherwise
The mean and variance are β and β 2 respectively. It is the only probability distribution which satisfies
the memoryless property, which is:
1 1
MX (t) = for t <
1 − tβ β
2.2.3 Gamma(α, β)
This distribution has probability density function
x
1 α−1 − β
fX (x) = Gamma(α)β α x e if x > 0
0 otherwise
Immediate Results
2.2.4 Normal(µ, σ 2 )
(N (µ, σ 2 )) We now meet one of the most important distributions in statistics. The Normal (sometimes
called Gaussian) distribution has probability density function
1 (x−µ)2
fX (x) = √ e− 2σ 2 x∈R
2πσ 2
From this we can derive the mean µ and variance σ 2 . The moment generating function of the Normal
distribution is
σ 2 t2
MX (t) = eµt+ 2
Immediate Results
Z−µ
(i) if Z ∼ N (µ, σ 2 ), then σ ∼ N (0, 1). This is called the standardised Normal distribution;
The cdf for the N (0, 1) distribution for appropriate values is typically given in exam questions, it is
up to you to express the general Normal(µ, σ 2 ) distribution in these terms when calculating interval
probability.
1 (ln x−µ)2
fX (x; µ, σ) = √ e− 2σ 2
xσ 2π
1 1
(x−µX ,y−µy )Σ−1 (x−µ
y−µY )
X
fXY (x, y; µ, Σ) = √ e2
2π det Σ
where
2
σX ρXY σX σY
Σ=
ρXY σX σY σY2
called the covariance matrix of X and Y is non-singular and positive-definite and
µX
µ=
µY
ST220 Introduction to Mathematical Statistics 7
Remarks
(i) Although convergence in probability implies convergence in distribution, the converse is not true.
(ii) If the sequence {Xn }n∈N converges in probability to the true value of an unknown distribution
parameter θ, then {Xn }n∈N is said to be consistent for θ. Alternatively, {Xn }n∈N is called a
consistent estimator of X.
(iii) Convergence in distribution involves the convergence of the cumulative distribution function.
Remark We also proved the following corollary to the Markov Inequality, Chebychev’s Inequality:
Var(X)
P(|X − µ| ≥ ) ≤
2
Theorem 4.4 (Central Limit Theorem). Let X1 , X2 , . . . be a sequence of independent and identically
distributed random variables, with common mean µ and variance σ 2 . Let Yn (x) denote the cumulative
−µ
distribution function of Zn = X n√
σ/ n
. Then, for every x, limn→∞ Gn (x) = Φ(x). Equivalently, the
standardised sample mean converges in distribution to the standard normal distribution, N (0, 1).
The combination of the Central Limit Theorem and the following theorem provide the basis for many
useful approximations.
Theorem 4.5 (Slutsky’s Lemma). If Xn → X in distribution, and f : Rk → Rl is continuous then:
Proof. For this, we use the alternative definition of convergence in distribution. Suppose g : Rl → R
is bounded continuous. Then g ◦ f is bounded continuous : so E[g ◦ f (Xn )] → E[g ◦ f (X)]. Since g is
arbitrary, f (Xn ) → f (X) in distribution.
Corollary 4.6 (not in this course, but on past papers).
If Xn → X in distribution and Yn → c ∈ R in distribution (or probability, which implies distribution),
then
Yn Xn → cX in distribution and Yn + Xn → c + X in distribution
Proof: Choose suitable f .
Theorem 4.7 (Pearson’s Theorem). 1. Suppose Xn ∼ M ultinomial(n, p) (in Rk ). Then
Xn − np
√ → X ∼ N (0, V )
n
so that
X(n)i −npi 2
P
3. npi →dist χ ∼ χ2 r.v.
ST220 Introduction to Mathematical Statistics 9
Part II
Statistics
Definition 4.8. Suppose data (x1 , . . . , xn ) is modelled as observed values of r.v.s whose joint probability
mass function or joint probability density function is fθ (·) depending on a parameter θ. We define the
observed likelihood function to be θ 7→ L(θ) given by fθ (x1 , . . . , xn ), where we substitute the observed
data values into the pmf/pdf. (Note that the xi s are not dummy variables!) In the light of the data, we
have a preference for values of θ that give greater values to L(θ).
The maximum likelihood estimate, θ0 of θ is the value of θ that maximises L(θ) (assuming it
exists and is unique). In some sense, this is the most plausible value of θ, having observed the results of
the experiment.
Remark: Its often easier to work with log-likelihood (base e), given by l(θ) = log L(θ). Because log
is an increasing function, we have θ0 = argmaxθ l(θ) as well.
P(Z ≤ zα ) = 1 − α
We have for example to two decimal places (the most frequently used):
Then because (1) holds, in repetitions of the experiment the interval [l(x), r(x)] will contain µ with
a frequency of 95%. Moreover, this statement is true no matter what value µ is. Because of this, we call
the interval [l(x), r(x)] a 95% confidence interval for µ. More formally:
Definition 4.12. Suppose data x1 , . . . , xn are modelled as observed values of random variables X1 , . . . , Xn
whose distribution depends on a parameter(s) θ ∈ Θ. Let φ be some real valued function of θ (often
θ itself). A pair of statistics l(x), r(x) specify an interval [l(x), r(x)] which is called a 100(1 − α)%
confidence interval for φ if
Pθ l(X) ≤ φ(θ) ≤ r(X) = 1 − α ∀θ ∈ Θ
Remark: Suppose θ̂ is an estimate of θ and thus φ̂ = φ(θ̂) is an estimate of the unknown value of
φ. The basic question is, how accurate should we consider this estimate to be? Confidence intervals are
one approach to this. The idea is to give a range of plausible values for φ. Usually φ̂ will belong to the
C.I. and then the C.I. is an indication of how large the error in the estimation procedure might be (e.g.
it could be much bigger).
Warning! One must be careful not to over-interpret C.I.s. The defining property says that in
hypothetical repetitions of the experiment, the interval constructed by the procedure will cover the true
value 100(1 − α)% of the time. It is not true however that the actual interval constructed from the data
contains the true value with probability 1 − α.