You are on page 1of 44

Maximum Likelihood Estimation of ARMA Models

For i.i.d data with marginal pdf f(y


t
; ), the joint pdf
for a sample y = (y
1
, ..., y
T
) is
f(y; ) = f(y
1
, ..., y
T
; ) =
T

t=1
f(y
t
; )
The likelihood function is this joint density treated as
a function of the parameters given the data y:
L(|y) = L(|y
1
, ..., y
T
) =
T

t=1
f(y
t
; )
The log-likelihood is
L(|y) =
T

t=1
lnf(y
t
; )
1
Conditional MLE of ARMA Models
Problem: For a sample from a covariance stationary
time series {y
t
}, the construction of the log-likelihood
given above doesnt work because the random variables
in the sample y = (y
1
, ..., y
T
) are not i.i.d. One So-
lution: Conditional factorization of log-likelihood Intu-
ition: Consider the joint density of two adjacent ob-
servations f(y
2
, y
1
; ). The joint density can always be
factored as the product of the conditional density of y
2
given y
1
and the marginal density of y
1
:
f(y
2
, y
1
; ) = f(y
2
|y
1
; )f(y
1
; )
For three observations, the factorization becomes
f(y
3
, y
2
, y
1
; ) = f(y
3
|y
2
, y
1
; )f(y
2
|y
1
; )f(y
1
; )
2
In general, the conditional marginal factorization has
the form
f(y
p
, ..., y
1
; ) =
_
T

t=p+1
f(y
t
|I
t1
; )
_
.f(y
p
, ..., y
1
; )
I
t
= {y
t
, ..., y
1
} = info available at time t
y
p
, ..., y
1
= initial values
The exact log-likelihood function may then be expressed
as

mle
= arg max
T

t=p+1
lnf(y
t
|I
t1
; ) +lnf(y
p
, ..., y
1
; )
The conditional log-likelihood is

cmle
= arg max
T

t=p+1
lnf(y
t
|I
t1
; )
3
Two types of maximum likelihood estimates (mles) may
be computed. The rst type is based on maximizing the
conditional log-likelihood function. These estimates are
called conditional mles and are dened by

cmle
= arg max
T

t=p+1
lnf(y
t
|It 1; )
The second type is based on maximizing the exact log-
likelihood function. These estimates are called exact
mles, and are dened by

mle
= arg max
T

t=p+1
lnf(y
t
|It 1; ) +lnf(y
p
, ..., y
1
; )
4
Result:
For stationary models,

cmle
and

mle
are consistent and
have the same limiting normal distribution. In nite
samples, however,

cmle
and

mle
are generally not equal
and my dier by a substantial amount if the data are
close to being non-stationary or non-invertible.
5
AR(p ), OLS equivalent to Conditional MLE
Model:
y
t
= +
1
y
t1
+.......... +
p
y
tp
+
t
. e
t
WN(0,
2
).
= x

t
+
t
, =
1
,
2
, ....,
p
, x
t
= y
t1
, y
t2
, ...., y
tp
OLS:

= (
T
t=1
x
t
x

t
)
1

T
t=1
x
t
y
t
,

2
=
1
T (p +1)

T
t=1
(y
t
x

t
)
2
.
6
Properties of the estimator
is downward bias in a nite sample, i.e. E[

] < .
Estimator might be biased but consistent, it con-
verges in probability.
7
Example: MLE for stationary AR(1)
Y
t
= c +
1
Y
t1
+
t
, t = 1, ...., T

t
WN(0,
2
) || < 1
= (c, ,
2
)

Conditional on I
t1
y
t
|I
t1
N(c +y
t1
,
2
), t = 2...., T
which only depends on y
t1
. The conditional density
f(y
t
|I
t1;
) is then
f(y
t
|y
t1;
) = (2
2
)

1
2
exp
_
1
2
2
(y
t
c y
t1
)
2
_
,
t = 2...., T
8
To determine the marginal density for the initial value
y
1
, recall that for a stationary AR(1) process
E[y
1
] = =
c
1
var[y
t
] =

2
1
2
It follows that
y
1
N
_
c
1
,

2
1
2
_
f(y
1
; ) =
_
2
2
1
2
_

1
2
exp
_

1
2
2
2
(y
1

c
1
)
2
_
9
The conditional log-likelihood function is
T

t=2
lnf(y
t
|y
t1
; ) =
(T 1)
2
ln(2)
(T 1)
2
ln(
2
)

1
2
2
T

t=2
(y
t
c y
t1
)
2
Notice that the conditional log-likelihood function has
the form of the log-likelihood function for a linear re-
gression model with normal errors
y
t
= c y
t1
+
t
,
t
N(0,
2
), t = 2, .....T
It follows that
c
cmle
= c
ols

c
cmle
=

c
ols

2
cmle
=
T

t=2
(y
t
c
cmle


c
cmle
y
t1
)
2
10
The marginal log-likelihood for the initial value y
1
is
ln f(y
1
; ) =
1
2
ln(2)
1
2
ln
_

2
1
2
__
y
1

c
1
_
2
The exact log-likelihood function is then
lnL(; y) =
T
2
ln(2)
T
2
ln(

2
1
2
)
1
2
2
2
_
y
1

c
1
_
2

(T 1)
2
ln(
2
)
1
2
2
T

t=2
(y
t
c y
t1
)
2
11
Prediction Error Decomposition
To illustrate this algorithm, consider the simple AR(1)
model. Recall,
y
t
|I
t1
N(c +y
t1
,
2
), t = 1, 2...., T
from which it follows that
E[y
t
|I
t1
] = c +y
t1
var[y
t
|I
t1
] =
2
The 1-step ahead prediction errors may then be de-
ned as
v
t
= y
t
E[y
t
|I
t1
] = y
t
c +y
t1
, t = 2...., T
12
The variance of the prediction error at time t is
f
t
= var(v
t
) = var(
t
) =
2
, t = 2, ...T
For the initial value, the rst prediction error and
its variance are
v
1
= y
1
E[y
1
] = y
1

c
1
f
1
= var(
1
) =

2
1
2
Using the prediction errors and the prediction error vari-
ances, the exact log-likelihood function may be re-expressed
as
ln L(|y) =
T
2
ln(2)
1
2
T

t=1
lnf
t

1
2
T

t=1
v
2
t
f
t
which is the prediction error decomposition. A further
simplication may be achieved by writing
var(v
t
) =
2
f

t
=
2
1
1
2
for t = 1
=
2
.1for t > 1
That is f

t
=
1
1
2
for t = 1 and f

t
= 1 for t > 1. Then
the log-likelihood becomes
ln L(|y) =
T
2
ln(2)
1
2
T

t=1
ln
2

1
2
T

t=1
lnf

t

1
2
2
T

t=1
v
2
t
f

t
13
MLE Estimation of MA(1)
Recall
Y
t
= +
t1
+
t
, || < 1
e
t
WN(0,
2
).
|| < 1 is assumed for invertible representation only,
nothing about stationarity.
14
Estimation MA(1)
Y
t
|
t1
N( +
t1
,
2
),
f(Y
t
|
t1
,

) =
1

2
2
e

1
2
2
(y
t

t1
)
2

=(, ,
2
).
Problem: without knowing
t2
we dont observe
t1
.
Need to know
t2
to know
t1
= y
t

t2
.
But
t1
unobservable. Assume
0
= 0.
Make it non-random, just x it with number 0. This
works with any number.
15
Estimation MA(1)
Y
1
|
0
N(,
2
),
y
1
= +
1

1
= y
1
,
y
2
= +
2
+
1

2
= y
2
(y
1
),

t
= y
t
(y
t1
) +....... +(1)
t1

t1
(y
1
)
Conditional likelihood:
L(

|y
1
....y
T
;
0
= 0) =
T

t=1
1

2
2
e

1
2
2
(
2
t1
)
If || < 1 (much less),
0
doesnt matter, CMLE is con-
sistent.
Exact MLE requires Kalman Filter.
16
Why do we need Exact MLEs
Estimation of MA(1) models, assumed that e
0
= 0
while calculating the sequence of e

t
s. A more tradi-
tional approach is to estimate the unconditional or the
exact log-likelihood function by assuming that e
0
is ran-
dom and hence allowing it to follow some distribution
and use it to calculate the ets from the data. Such
an allowance will aect not only the prediction error
and the variance of y
1
; but also successive prediction
errors and their variances. The practice of obtaining an
estimate of the parameters by not conditioning on the
pre-sample values but obtaining them using such exact
MSEs is called the exact or unconditional ML estima-
tion method. Note that the problem of obtaining the
sequence of e

t
s arises only in pure MA or mixed models.
17
Exact MLEs
What are the advantages of using such a procedure? To
understand that, an examination of the rst prediction
error and its variance is instructive. For example, in an
MA(1) process, the rst prediction error is the same,
it is y
1
in both conditional and exact ML estimation
procedures. But the assumption that e
0
= 0 means
that the var(z1) = var(e1) =
2
whereas, if we allow
e
0
to be random, var(z1) =
2
(1+
2
): And this will be
reected in f
1
in the exact log-likelihood function. And
in typically small sample sizes such an assumption may
matter a lot for the estimates. Besides, if happens
to be close to unity, the dierences will be even more
signicant.
18
Exact MLEs
But estimating the exact log-likelihood function is dif-
cult and was a costly exercise in terms of computing,
till some time back; but in these days of advanced and
cheap computing facilities, cost should not be a deter-
rent in using exact ML estimation method. Our job
becomes easier by noting that for any ARMA model it
can be shown that the two end-equations, namely, that
of the prediction error and the prediction error variance,
are recursive in nature and literature shows that these
two can be calculated by using two popular methods,
viz. (1) the triangular factorization method, TF and
(2) the kalman lter recursions.
19
Estimation
For the Gaussian AR(1) process,
Y
t
= c +
1
Y
t1
+
t
, || < 1
e
t
WN(0,
2
).
The joint distribution of Y
T
= (Y
1
, Y
2
.....Y
T
)

is
Y
T
N
_
0,

_
the observations y (y
1
, y
2
, ..., y
T
) are the single
realization of Y
T
.
20
MLE AR(1)
_

_
Y
1
.
.
.
Y
T
_

_
= N
_
,

_
=
_

.
.
.

=
_
_
_
_
_

0
...
T1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

T1
...
0
_
_
_
_
_
21
MLE AR(1)
The p.d.f. of the sample y = (y
1
, y
2
, ..., y
T
)

is given by
the multivariate normal density
f
Y
_
y; ;
_
= (2)

T
2
||
1
2
exp
_

1
2
(y )

1
(y )
_
Denoting

=
2
y
with
ij
=
|ij|

=
_
_
_
_
_

0
...
T1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

T1
...
0
_
_
_
_
_
=
0
_
_
_
_
_
_
1 ...

T1

0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

T1

0
... 1
_
_
_
_
_
_
22

=
2
y
=
2
y
_
_
_
_
_
1 ... (T 1)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(T 1) ... 1
_
_
_
_
_
(j) =
j
Collecting the parameters of the model in = (c, ,
2
)

,
the joint p.d.f. becomes:
f
Y
_
y;
_
= (2
2
y
)

T
2
||
1
2
exp
_

1
2
2
y
(y )

1
(y )
_
Collecting the parameters of the model in = (c, ,
2
)

,
the sample log-likelihood function is given by
L() =
T
2
log(2)
T
2
log(
2
y
)
1
2
log(||)
1
2
2
y
(y )

1
(y )
MLE
The exact log-likelihood function is a non-linear
function of the parameters . There is no closed
form solution for the exact mles.
The exact mles must be determined by numeri-
cally maximizing the exact log-likelihood function.
Usually, a Newton-Raphson type algorithm is used
for the maximization which leads to the iterative
scheme

mle,n
=

mle,n


H(

mle,n
)
1
s(

mle,n
)
where

H(

) is an estimate of the Hessian matrix


(2nd derivative of the log-likelihood function), and
23
s(

mle,n
) is an estimate of the score vector (1st
derivative of the log-likelihood function).
The estimates of the Hessian and Score may be
computed numerically (using numerical derivative
routines) or they may be computed analytically (if
analytic derivatives are known).
Factorization
Note that for large T, might be large and dicult
to invert.
Since is positive denite symmetric matrix then
there exists a unique, triangular factorization of ;
= AfA

,
where
f
TXT
=
_
_
_
_
_
f
1
0 ... 0
0 f
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 ... f
T
_
_
_
_
_
f
t
0 for all t diagonal matrix
A
TXT
=
_
_
_
_
_
1 0 ... 0
a
21
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
T1
a
T1
... 1
_
_
_
_
_
24
Likelihood
The likelihood function can be rewritten as:
L(

| y
T
) = (2)

T
2
det(AfA)

1
2
e
1
2
(y
T
)

(AfA)
1
(y )
This is done by converting the correlated variables y
1
, y
2
........y
T
into a collection, say
1
,
2
........
T
of uncorrelated vari-
ables. In the following, let P
j
denote the projection
onto the random variables in X
j
Dene
= A
1
( y
T
) (prediction error).
where
A = ( y
T
).
25
Since A is lower-triangular matrix with 1
s
along the
principal diagonal,

1
= y
1

2
= y
2
P
1
y
2
= y
2
a

11

3
= y
3
a

21

1
a

22

1
.
.
.

T
= y
T

T1

i=1
a

Ti

T1
Also, since A is lower triangular with 1
s
along the prin-
cipal diagonal, det(A) = 1
det(AfA) = det(A) . det(f ) . det(A) = det(f ).
Then, L(

| y
T
)=(2)

T
2
det(f
1
)

1
2
e

1
2

(f
1
)

T
t=1
_
1

2f
t
e

1
2

2
f
t
_
,
where
t
is t
th
element of
Tx1
= prediction error y
t

y
t|t..1
,
y
t|t..1
=
t1

i=1
a

t,i
y
i
, 1 = 2, 3, .....T, where a

t,i
is(t; i)
th
element of A
1
.
26
Kalman Filters
The Kalman lter comprises a set of mathematical
equations which result from an optimal recursive so-
lution given by the least squares method.
The purpose of this solution consists in computing a
linear, unbiased and optimal estimator of a systems
state at time t, based on information available at t..1
and update, with the additional information at t, these
estimates.
The lters performance assumes that a system can be
described through a stochastic linear model with an as-
sociated error following a normal distribution with mean
zero and known variance.
27
Kalman Filters
The Kalman lter is the main algorithm to estimate dy-
namic systems in state-space form. This representation
of the system is described by a set of state variables.
The state contains all information relative to that sys-
tem at a given point in time. This information should
allow to infer about the past behaviour of the system,
aiming at predict its future behaviour.
28
DEVELOPING THE KALMAN FILTER
ALGORITHM
The basic building blocks of a Kalman Filter are two
equations: the measurement equation and the transi-
tion equation. The measurement equation relates the
unobserved data (x
t
where t indicates a point in time)
with observable data (y
t
; where t indicates a point in
time):
y
t
= m x
t
+v
t
(1)
where, E(v
t
) = 0 and var(v
t
) = r
t
The transition equa-
tion is based on a model that allows the unobserved
29
data to change through time.
x
t+1
= a x
t
+w
t
(2)
where, E(w
t
) = 0 and var(w
t
) = q
t
The process starts
with an initial estimate for x
t
, call it x
0
, which has
a mean of
0
and a standard deviation of s
0
. Using
the expectation of equation (2), a prediction for x
1
emerges, call it x

1
x

1
= E(a x
0
+w
0
) = a
0
(3)
The predicted value from equation (3) is then inserted
into equation (1) and again taken as an expectation to
produce a prediction for y
1
, call it y

1
y

1
= E(m x

1
+v
0
) = m E(x

1
) = m a
0
(4)
Thus far, predictions of the future are based on expec-
tations and not on the variance or standard deviation
associated with the predicted variables. The variance
will eventually be incorporated to produce better es-
timation. However, the next step is to compare the
predicted y
1
(i.e. y

1
) with the actual y
1
when it occurs.
In equation (5), the expectation of the predicted value
and actual value for y
1
are compared to produce the
predicted (or expected) error, y
e
1
:
y
e
1
= E(y
1
y

1
) = y m E(x

1
) = y
1
m a
0
(5)
Given the error in predicting y
1
which is based on the
expectation of x
1
from equation (3), a new estimation
for x
1
is considered, x
1E
. Notice this is dierent from
x

1
because x
1E
incorporates the prediction error of y
1
.
Equation (6) identies x
1E
as an expectation of an
adjusted x

1
.
x
1E
= E[x

1
+k
1
y
e
1
] = E[x

1
] +k
1
(y
1
E(m x

1
))
= a
0
+k
1
(y
1
m a
0
) (6)
k
1
(or more generically, k
t
in equation (6) is re-
ferred to as the Kalman gain and incorporates the vari-
ance of x

1
(denoted as p

1
or generically as p

t
and the
variance of y

1 (see the denominator in equation (8)


below).
var(x

1
) = var(a
0
x
0
) +var(w
0
) =
2
0
a
2
+q
0
= p

1
(7)
k
1
=
m p

1
p

1
m
2
+r
0
=
m (
2
0
a
2
+q
0
)
(
2
0
a
2
+q
0
) m
2
+r
0
(8)
The cycle starts over with x
1E
taking the place of x
0
in equation (3) and used to forecast y
2
. The mean of
x
1E
is the value of the expectation calculated in equa-
tion(6). Notice, the mean of x
1E
incorporates the mean
of x
0
, the variance of x
0
(via the Kalman gain), the vari-
ance of the error in the measurement equation (via the
Kalmna gain), the variance of the error in the transition
equation (via the Kalman gain), and the observe y
1
.
The variance of x
1E
:
var(x

1
) = p

1
[1 k
1
m] = p

1

_
1
1
1 +
r
0
(
2
0
a
2
+q
0
)m
2
_
(9)
Notice, the variance of x
1E
is reduced relative to the
variance of x

1
. Further,because the distributional as-
pects of each estimated value of x
t
is known (assuming
a model), the model parameters within the measure-
ment and transition equations can be optimized using
maximum likelihood estimation. An iterative sequence
of lter followed by parameter estimation optimization
eventually optimizes the entire system.
Predict future unobserved variable (x) based on the
current estimate of the unobserved variable:
x

t
= E(a x
(t)E
+w
t
) (10)
Use predicted unobserved variable to predict future
observable variable (y):
y

t+1
= E(m x

t+1
+v
t
) (11)
When the future observable variable actually occurs,
calculate the error in the prediction:
y
e
t+1
= E(y
t+1
y

t+1
) (12)
Generate a better estimate of the unobserved vari-
able at time (t + 1) and start the process over for
time (t + 2):
x
(t+1)E
= E[x

(t+1)
+k
(t+1)
y
e
(t+1)
] (13)
Note: k
t+1
is the Kalman gain and is based on the
variance of the predicted variables in the rst and
second step of the process:
k
t+1
=
m var(x

(t=1))
var(y

(t+1))
(14)
APPLYING MAXIMUM LIKELIHOOD ESTIMATION
TO THE KALMAN FILTER
The Kalman Filter provides output throughout the time
series in the form of estimated values (e.g. x
1E
from
equation (6)) which are the means/expectations of the
unobserved variables x
1
....x
S
for every time period t with
associated variances provided by equation (9) (Note: x
0
is also from a distribution with a mean of
0
and a vari-
ance of s
2
0
and is part of the time series). Keeping with
the univariate structure from the previous section and
assuming T time periods of data, a maximum likelihood
estimation (MLE) is imposed by further assuming x
0
,
w
1
....w
T
, and v
1
...v
T
are jointly normal and uncorrelated.
Consequently, a joint likelihood function exists:
_
1

2
2
e
(x
0

0
)
2
2
2
0
_

__
1

2q
t
_
T
e

T
t=1
(x
t
E(x
t
))
2
2q
1
_

__
1

2r
t
_
T
e

T
t=1
(y
t
E(y
t
))
2
2r
t
_
(15)
recall: x
0
is distributed N(
0
, s
2
0
), w
t
is distributed N(0, q
t
),
and v
t
is distributed N(0, r
t
)
The problem with the likelihood function in equation
(10) is that x
t
is not observable. However, as noted
above, the Kalman Filter provides means and variances
for x
t
throughout the time series. Consequently, re-
move the rst two terms and dene the mean of y
t
as
y

t
= m x

t
= a m x
(t1)E
, dene the variance of
y
t
as p

t
m
2
= r
(t1)
,and keep the assumption of y
t
being normally distributed. Notice, the variance deni-
tion(via p

t
; see the denominator in equation (8)) and
the mean denition incorporate the distributional prop-
erties of x
t
while dealing with the observable y
t
. Fur-
ther, the initial conditions are introduced when t equals
one as x
0E
=
0
and through p

1
(see equation(7)). The
new likelihood function becomes
T

t=1
__
1

2[p

t
m
2
+r
(t1)
]
_
T
e

T
t=1
(y
t
a

x
(
t1)E)
2
p

t
m
2
+r
(t1)
_
(16)
To simplify the math, take the natural logarithm of the
likelihood function creating the log- likelihood function:

T ln(2)
2

1
2
T

t=1
ln[p

t
m
2
+r
t1
]
1
2
T

t=1
(y
t
a

x
t1)E
)
2
p

t
m
2
+r
t1
(17)
Further assume that r
t
and q
t
(contained in p

t
) are con-
stant throughout time. Consequently, equation (12)
has a value which is called a score. The next step is
to maximize the log- likelihood function using the par-
tial derivatives for r (= r
t
for all t), q (= q
t
for all t),
and a. After solving for these three parameters based
on setting the partial derivatives to zero, the Kalman
Filter is re-estimated and then the maximum likelihood
procedure is applied again until the score improves by
less than a particular value (say 0.0001) indicating con-
vergence. The iterative use of the MLE procedure is
called the Expectation Maximization (EM) algorithm.

You might also like