You are on page 1of 17

1 An overview of almost sure convergence

An attractive features of almost sure convergence (which we dene below) is that it often
reduces the problem to the convergence of deterministic sequences.
First we go through some denitions (these are not very formal). Let
t
be the set of
all possible outcomes (or realisations) at the point t, and dene the random variable Y
t
as
the function Y
t
:
t
R. Dene the set of possible outcomes over all time as =

t=1

t
,
and the random variables X
t
: R, where for every , with = (
0
,
2
, . . .), we
have X
t
() = Y
t
(
t
). Hence we have a sequence of random variables {X
t
}
t
(which we call
a random process). When we observe {x
t
}
t
, this means there exists an , such that
X
t
() = x
t
. To complete things we have a sigma-algebra F whose elements are subsets of
and a probability measure P : F [0, 1]. But we do not have to worry too much about
this.
Denition 1 We say that the sequence {X
t
} converges almost sure to , if there exists a
set M , such that P(M) = 1 and for every N we have
X
t
() .
In other words for every > 0, there exists an N() such that
|X
t
() | < , (1)
for all t > N(). We denote X
t
almost surely, as X
t
a.s.
.
We see in (1) how the denition is reduced to a nonrandom denition. There is an equivalent
denition, which is explicitly stated in terms of probabilities, which we do not give here.
The object of this handout is to give conditions under which an estimator a
n
converges
almost surely to the parameter we are interested in estimating and to derive its limiting
distribution.
2 Strong consistency of an estimator
2.1 The autoregressive process
A simple example of a sequence of random variables {X
t
} which are dependent is the auto-
gressive process. The AR(1) process satises the representation
X
t
= aX
t1
+
t
, (2)
where {
t
}
t
are iid random variables with E(
t
) = 0 and E(
2
t
) = 1. It has the unique causal
solution
X
t
=

j=0
a
j

tj
.
1
From the above we see that E(X
2
t
) = (1 a
2
)
1
. Let
2
:= E(X
2
0
) = (1 a
2
)
1
.
Question What can we say about the limit of

2
n
=
1
n
n

j=1
X
2
t
.
If E(
4
t
) < , then we show that E(
n

2
)
2
0 (Exercise).
What can we say about almost sure convergence?
2.2 The strong law of large numbers
Recall the strong law of large numbers; Suppose {X
t
}
t
is an iid sequence, and E(|X
0
|) <
then by the SLLN we have that
1
n
n

j=1
X
t
a.s.
E(X
0
),
for the proof see Grimmet and Stirzaker (1994).
What happens when the random variables {X
t
} are dependent?
In this case we use the notion of ergodicity to obtain a similar result.
2.3 Some ergodic tools
Suppose {Y
t
} is an ergodic sequence of random variables (which implies that {Y
t
}
t
are iden-
tically distributed random variables). For the denition of ergodicity and a full treatment
see, for example, Billingsley (1965).
A simple example of an ergodic sequence is {Z
t
}, where {Z
t
}
t
are iid random variables.
Theorem 1 One variation of the ergodic theorem: If {Y
t
} is an ergodic sequence, where
E(|g(Y
t
)|) < . Then we have
1
n
n

i=1
g(Y
t
)
a.s.
E(g(Y
0
)).
We give some sucient conditions for a process to be ergodic. The theorem below is a
simplied version of Stout (1974), Theorem 3.5.8.
Theorem 2 Suppose {Z
t
} is an ergodic sequence (for example iid random variables) and
g : R

R is a continuous function. Then the sequence {Y


t
}
t
, where
Y
t
= g(Z
t
, Z
t1
, . . . , ),
is an ergodic process.
2
(i) Example I Let us consider the sequence {X
t
}, which satises the AR(1) representa-
tion. We will show by using Theorem 2, that {X
t
} is an ergodic process. We know
that X
t
has the unique (casual) solution
X
t
=

j=0
a
j

tj
.
The solution motivates us to dene the function
g(x
0
, x
1
, . . .) =

j=0
a
j
x
j
.
We have that
|g(x
0
, x
1
, . . .) g(y
0
, y
1
, . . .)| = |

j=0
a
j
x
j

j=0
a
j
y
j
|

j=0
a
j
|x
j
y
j
|.
Therefore if max
j
|x
j
y
j
| |1 a|, then |g(x
1
, x
2
, . . .) g(y
1
, y
2
, . . .)| . Hence
the function g is continuous (under the sup-norm), which implies, by using Theorem
2, that {X
t
} is an ergodic process.
Application Using the ergodic theorem we have that
1
n
n

t=1
X
2
t
a.s.

2
.
(ii) Example II A stochastic process often used in nance is the ARCH process. {X
t
} is
said to be an ARCH(1) process if it satises the representation
X
t
=
t
Z
t

2
t
= a
0
+ a
1
X
2
t1
.
It has almost surely the solution
X
2
t
= a
0

j=0
a
j
1
j

i=0
Z
2
ti
.
Using the arguments above we can show that {X
2
t
}
t
is an ergodic process.
3 Likelihood estimation
Our object here is to evaluate the maximum likelihood estimator of the AR(1) parameter
and to study its asymptotic properties. We recall that the maximum likelihood estimator is
the parameter which maximises the joint density of the observations. Since the log-likelihood
3
often has a simpler form, we often maximise the log density rather than the density (since
both the maximum likelihood estimator and maximum log likelihood estimator yield the
same estimator).
Suppose we observe {X
t
; t = 1, . . . , n} where X
t
are observations from an AR(1) process.
Let F

be the distribution function and the density function of respectively. We rst


note that the AR(1) process is Markovian, that is
P(X
t
x|X
t1
, X
t2
, . . .) = P(X
t
x|X
t1
) (3)
f
a
(X
t
|X
t1
, . . .) = f
a
(X
t1
|X
t1
).
The Markov property is where the probability of X
t
given the past is the same as the
probability of X
t
given X
t1
.
To prove (3) we see that
P(X
t
x
t
|X
t1
= x
t1
, X
t2
= x
t2
) = P(aX
t1
+
t
x
t
|X
t1
= x
t1
, X
t2
= x
t2
)
= P(
t
x
t
ax
t1
|X
t2
= x
t2
)
= P

(
t
x
t
ax
t1
) = P(X
t
x
t
|X
t1
= x
t1
),
hence the process satises the Markov property.
By using the above we have
P(X
t
x|X
t1
) = P

( x aX
t1
)
P(X
t
x|X
t1
) = F

(x aX
t1
)
f
a
(X
t
|X
t1
) = f

(X
t
aX
t1
). (4)
Evaluating the joint density and using (4) we see that it satises
f
a
(X
1
, X
2
, . . . , X
n
) = f
a
(X
1
)
n

t=2
f
a
(X
t
|X
t1
, X
t1
, . . .) (by Bayes theorem)
= f
a
(X
1
)
n

t=2
f(X
t
|X
t1
) (by the Markov property)
= f
a
(X
1
)
n

t=2
f

(X
t
aX
t1
) (by (4)).
Therefore the log likelihood is
log f
a
(X
1
, X
2
, . . . , X
n
) = log f(X
1
)
. .
often ignored
+
n

k=2
log f

(X
t
aX
t1
)
. .
conditional likelihood
.
Usually we ignore the initial distribution f(X
1
) and maximise the conditional likelihood to
obtain the estimator.
4
Hence we use a
n
as an estimator of a, where
a
n
= arg max
a
n

k=2
log f(X
t
aX
t1
),
and is the parameter space we do the maximisation in. We say that a
n
is the conditional
likelihood estimator.
3.1 Likelihood function when the innovations are Gaussian
We now consider the special case that the innovations {
t
}
t
are Gaussian. In this case we
have that
log f(X
t
aX
t1
) =
1
2
log 2 (X
t
aX
t1
)
2
.
Therefore if we let
L
n
(a) =
1
n 1
n

t=2
(X
t
aX
t1
)
2
,
since
1
2
log 2 is a constant we have
L
n
(a)
n

k=2
log f(X
t
aX
t1
).
Therefore the maximum likelihood estimator is
a
n
= arg max
a
L
n
(a).
In the derivations here we shall assume = [1, 1].
Denition 2 An estimator
n
is said to be a consistent estimator of , if there exists a set
M , where P(M) = 1 and for all M we have

n
() .
Question: Is the likelihood estimator a
n
strongly consistent?
In the case of least squares for AR processes, a
n
has the explicit form
a
n
=
1
n1

n
t=2
X
t
X
t1
1
n1

n1
t=1
X
2
t
.
Now by just applying the ergodic theorem to the numerator and denominator we get
almost sure converge of a
n
(exercise).
5
However we will tackle the problem in a rather artical way and assume that it does
not have an explicit form and instead assume that a
n
is obtained by minimising L
n
(a)
using a numerical routine. In general this is the most common way of minimising a
likelihood function (usually explicit solutions do not exist).
In order to derive the sampling properties of a
n
we need to directly study the likelihood
function L
n
(a). We will do this now in the least squares case.
Most of the analysis of L
n
(a) involves repeated application of the ergodic theorem.
The rst clue to solving the problem is to let

t
(a) = (X
t
aX
t1
)
2
.
By using Theorem 2 we have that {
t
(a)}
t
is an ergodic sequence. Therefore by using
the ergodic theorem we have
L
n
(a) =
1
n 1
n

t=2

t
(a)
a.s.
E(
0
(a)).
In other words for every a [1, 1] we have that L
n
(a)
a.s.
E(
0
(a)). This is what we
call almost sure pointwise convergence.
Remark 1 It is interesting to note that the least squares/likelihood estimator a
n
can also be
used even if the innovations do not come from a Gaussian process.
3.2 Strong consistency of a general estimator
We now consider the general case where B
n
(a) is a criterion which we maximise (or min-
imse). We note that B
n
(a) includes the likelihood function, least squares criterion etc.
We suppose it has the form
B
n
(a) =
1
n 1
n

j=2
g
t
(a), (5)
where for each a [c, d], {g
t
(a)}
t
is a ergodic sequence. Let

B(a) = E(g
t
(a)), (6)
we assume that

B(a) is continuous and has a unique maximum in [c, d].
We dene the estimator
n
where

n
= arg max
a[c,d]
B
n
(a).
6
Object To show under what conditions
n
a.s.
, where = arg max B(a).
Now suppose that for each a [c, d] B
n
(a)
a.s.


B(a), then this it called almost sure pointwise
convergence. That is for each a [c, d] we can show that there exists a set M
a
such that
M
a
where P(M
a
) = 1, and for each M
a
and every > 0
|B
n
(, a)

B(a)| ,
for all n > N
a
(). But the N
a
() depends on the a, hence the rate of convergence is not
uniform (the rate of depends on a).
Remark 2 We will assume that the set M
a
is common for all a, that is M
a
= M
a
for all
a, a

[c, d]. We will prove this result in Lemma 1 under the assumption of equicontinuity
(dened later), however I think the same result can be shown under the weaker assumption
that

B(a) is continuous.
Returning to the estimator, by using the pointwise convergence we observe that
B
n
(a) B
n
( a
n
)
a.s.


B( a
n
)

B(a), (7)
where a
n
is kept xed in the limit. We now consider the dierence |B
n
( a
n
)

B(a)|, if we can
show that |B
n
( a
n
)

B(a)|
a.s.
0, then a
P
a almost surely.
Studying (B
n
( a
n
)

B(a)) and using (7) we have
B
n
(a)

B(a) B
n
( a
n
)

B(a) B
n
( a
n
)

B( a
n
).
Therefore we have
|B
n
( a
n
)

B(a)| max
_
|B
n
(a)

B(a)|, |B
n
( a
n
)

B( a
n
)|
_
.
To show that the RHS of the above converges to zero we require not only pointwise conver-
gence but uniform convergence of B
t
(a), which we dene below.
Denition 3 B
n
(a) is said to almost surely converge uniformly to

B(a), if
sup
a[c,d]
|B
n
(a)

B(a)|
a.s.
0.
In other words there exists a set M where P(M) = 1 and for every M,
sup
a[c,d]
|B
n
(, a)

B(a)| 0.
Therefore returning to our problem, if B
n
(a)
a.s.


B(a) uniformly, then we have the bound
|B
n
(
n
)

B()| sup
a[c,d]
|B
n
(a)

B(a)|
a.s.
0.
This implies that B
n
(
n
)
a.s.


B() and since

B() has a unique minimum we have that

n
a.s.
. Therefore if we can show almost sure uniform convergence of B
n
(a), then we have
strong consistency of the estimator
n
.
Comment: Pointwise convergence is relatively easy to show, but how to show uniform convergence?
7
Figure 1: The middle curve is

B(a, ). If the sequence {B
n
(a, )} converges uniformly to

B(a, ), then B
n
(a, ) will lie inside these boundaries, for all n > N().
3.3 Uniform convergence and stochastic equicontinuity
We now dene the concept of stochastic equicontinuity. We will prove that stochastic
equicontinuity together with almost sure pointwise convergence (and a compact parame-
ter space) imply uniform convergence.
Denition 4 The sequence of stochastic functions {B
n
(a)}
n
is said to be stochastically
equicontinuous if there exists a set M where P(M) = 1 and for every and > 0,
there exists a and such that for every M
sup
|a
1
a
2
|
|B
n
(, a
1
) B
n
(, a
2
)| ,
for all n > N().
Remark 3 A sucient condition for stochastic equicontinuity is that there exists an N(),
such that for all n > N() B
n
(, a) belongs to the Lipschitz class C(L) (where L is the same
for all ). In general we verify this condition to show stochastic equicontinuity.
In the following lemma we show that if {B
n
(a)}
n
is stochastically equicontinuous and
also pointwise convergent, then there exists a set M , where P(M) = 1 and for every
M, there is pointwise convergence of B
n
(, a)

B(a). This lemma can be omitted on
rst reading (it is mainly technical).
Lemma 1 Suppose the sequence {B
n
(a)}
n
is stochastically equicontinuous and also pointwise
convergent (that is B
n
(a) converges almost surely to

B(a)), then there exists a set M
where P(M) = 1 and for every M and a [c, d] we have
|B
n
(, a)

B(a)| 0.
8
PROOF. Enumerate all the rationals in the interval [c, d] and call this sequence {a
i
}
i
. Then
for every a
i
there exists a set M
a
i
where P(M
a
i
) = 1, such that for every M
a
i
we have
|B
n
(, a
i
)

B(a
i
)| 0. Dene M = M
a
i
, since the number of sets is countable P(M) = 1
and for every M and a
i
we have B
n
(, a
i
)

B(a
i
) 0. Suppose we have equicontinity
for every realisation in

M, and

M is such that P(

M) = 1. Let

M =

M {M
a
i
}. Let


M, then for every /3 > 0, there exists a > 0 such that
sup
|a
1
a
2
|
|B
n
(, a
1
) B
n
(, a
2
)| /3,
for all n > N(). Now for any given a, choose a rational a
j
such that |a a
j
| . By
pointwise continuity we have
|B
n
(, a
i
)

B(a
i
)| /3,
where n > N

(). Then we have


|B
n
(, a)

B(a)| |B
n
(, a) B
n
(, a
i
)| +|B
n
(, a
i
)

B(a
i
)| +|

B(a)

B(a
i
)| ,
for n > max(N(), N

()). To summarise for every



M and a [c, d], we have |B
n
(, a)

B(a)| 0. Hence we have pointwise covergence for every realisation in



M.
We now show that equicontinuity implies uniform convergence.
Theorem 3 Suppose the set [a, b] is a compact interval and for every a [c, d] B
n
(a) con-
verges almost surely to

B(a). Furthermore assume that {B
n
(a)} is almost surely equicontin-
uous. Then we have
sup
a[c,d]
|B
n
(a)

B(a)|
a.s.
0.
PROOF. Let

M = M

M where M is the set where we have uniform convergence and

M
the set where all

M, {B
n
(, a)}
n
converge pointwise, it is clear that P(

M) = 1. Choose
/3 > 0 and let be such that for every

M we have
sup
|a
1
a
2
|
|B
n
(, a
1
) B
n
(, a
2
)| /3,
for all n > N(). Since [c, d] is compact it can be divided into a nite number of intervals.
Therefore let c =
0

1

p
= d and sup
i
|
i+1

i
| . Since p is nite, there
exists a

N() such that
max
1ip
|B
n
(, a
i
)

B(a
i
)| /3,
for all n >

N() (where N
i
() is such that |B
n
(, a
i
)

B(a
i
)| /3, for all n N
i
() and

N() = max
i
(N
i
())). For any a [c, d], choose the i, such that a [a
i
, a
i+1
], then by
stochastic equicontinuity we have
|B
n
(, a) B
n
(, a
i
)| /3,
9
for all n > N(). Therefore we have
|B
n
(, a)

B(a)| |B
n
(, a) B
n
(, a
i
)| +|B
n
(, a
i
)

B(a
i
)| +|

B(a)

B(a
i
)| ,
for all n max(N(),

N()). Since

N() does not depend on a, then sup
a
|B
n
(, a)

B(a)| 0. Furthermore it is true for all



M and P(

M) = 1 hence we have almost sure
uniform convergence.
The following theorem summarises all the sucient conditions for almost sure consistency.
Theorem 4 Suppose B
n
(n) and

B(a) are dened as in (5) and (6) respectively. Let

n
= arg max
a[c,d]
B
n
(a) (8)
and = arg max
a[c,d]

B(a). (9)
Assume that [c, d] is a compact subset and that

B(a) has a unique maximum in [c, d]. Fur-
thermore assume that for every a [c, d] B
n
(a)
a.s.


B(a) and the sequence {B
n
(a)}
n
is
stochastically equicontinuous. Then we have

n
a.s.
.
3.4 Strong consistency of the least squares estimator
We now verify the conditions in Theorem 4 to show that the least squares estimator is
strongly consistent.
We recall that
a
n
= arg max
a
L
n
(a), (10)
where
L
n
(a) =
1
n 1
n

t=2
(X
t
aX
t1
)
2
.
We have already shown that for every a [1, 1] we have L
n
(a)
a.s.


L(a), where

L(a) :=
E(g
0
(a)). Recalling Theorem 3, since the parameter space [1, 1] is compact, to show strong
consistency we need only to show that L
n
(a) is stochastically equicontinuous.
By expanding L
n
(a) and using the mean value theorem we have
L
n
(a
1
) L
n
(a
2
) = L
n
( a)(a
1
a
2
), (11)
where a [min[a
1
, a
2
], max[a
1
, a
2
]] and
L
n
() =
2
n 1
n

t=2
X
t1
(X
t
X
t1
).
10
Because [1, 1] we have
|L
n
()| D
n
,
where
D
n
=
2
n 1
n

t=2
(|X
t1
X
t
| + X
2
t1
).
Since {X
t
}
t
is an ergodic process, then {|X
t1
X
t
| + X
2
t1
} is an ergodic process. Therefore,
if var(
0
) < , by using the ergodic theorem we have
D
n
a.s.
2E(|X
t1
X
t
| + X
2
t1
).
Let D := 2E(|X
t1
X
t
| + X
2
t1
). Therefore there exists a set M , where P(M) = 1 and
for every M and > 0 we have
|D
n
() D| ,
for all n > N(). Substituting the above into (11) we have
|L
n
(, a
1
) L
n
(, a
2
)| D
n
()|a
1
a
2
|
(D + )|a
1
a
2
|,
for all n N(). Therefore for every > 0, there exists a := /(D + ) such that
|L
n
(, a
1
) L
n
(, a
2
)| (D + )|a
1
a
2
|,
for all n N(). Since this is true for all M we see that {L
n
(a)} is stochastically
equicontinuous.
Theorem 5 Let a
n
be dened as in (10). Then we have
a
n
a.s.
a.
PROOF. Since {L
n
(a)} is almost sure equicontinuous, the parameter space [1, 1] is compact
and we have pointwise convergence of L
n
()
a.s.


L(), by using Theorem 4 we have that
a
n
a.s.
, where = max

L(). Finally we need to show that a. Since

L() = E(
0
(a)) = E(X
1
X
0
)
2
,
we see by dierentiating

L() that it is maximised when = E(X
0
X
1
)/E(X
2
0
). Inspecting
the AR(1) process we see that
X
t
= aX
t1
+
t
X
t
X
t1
= aX
2
t1
+
t
X
t1
E(X
t
X
t1
) = aE(X
2
t1
) +E(
t
X
t1
)
. .
=0
.
Therefore a = E(X
0
X
1
)/E(X
2
0
), hence a, thus we have shown strong consistency of the
least squares estimator.
11
(i) This is the general method for showing almost sure convergence of a whole class of
estimators. Using a similar techique one can show strong consistency of maximum like-
lihood estimator of the ARCH(1) process, see Berkes, Horvath, and Kokoskza (2003)
for details.
(ii) Described here was strong consistency where we showed
n
a.s.
. Using similar tools
one can show weak consistency where
n
P
(convergence in probability), which
requires a much weaker set of conditions. For example, rather than show almost sure
pointwise convergence we would show pointwise convergence in probability. Rather
than stochastic equicontinuity we would show equicontinuity in probability, that is for
every > 0 there exists a such that
lim
n
P
_
sup
|a
1
a
2
|
|B
n
(a
1
) B
n
(a
2
)| >
_
0.
Some of these conditions are easier to verify than almost sure conditions.
4 Central limit theorems
Recall once again the AR(1) process {X
t
}, where X
t
satises
X
t
= aX
t1
+
t
.
It is of interest to check if there actually is dependency in {X
t
}, if there is no dependency
then a = 0 (in which case {X
t
} would be white noise). Of course in most situations we only
have an estimator of a (for example,
n
dened in (10)).
(i) Given the estimator a
n
, how to see if a = 0?
(ii) Usually we would do a hypothesis test, with H
0
: a = 0 against the alternative of say
a = 0. However in order to do this we require the distribution of a
n
.
(iii) Usually evaluating the exact sample distribution is extemely hard or close to impossible.
Instead we would evaluating the limiting distribution which in general is a lot easier.
(iv) In this section we shall show asymptotic normality of

n( a
n
a). The reason for
normalising by

n, is that ( a
n
a)
a.s.
0 as n , hence in terms of distributions it
converges towards the point mass at zero. Therefore we need to increase the magnitude
of the dierence a
n
a. We can show that ( a
n
a) = O(n
1/2
), therefore

n( a
n
a) =
O(1). Multiplying ( a
n
a) by anything larger would mean that its limit goes to innity,
multipying ( a
n
a) by something smaller in magnitude would mean its limit goes to
zero, so n
1/2
, is the happy median.
12
We often use L
n
(a) to denote the derivative of L
n
(a) with respect to a (
Ln(a)
a
). Since
a
n
= arg max L
n
(a), we observe that L
n
( a
n
) = 0. Now expanding L
n
( a
n
) about a (the
true parameter) we have
L
n
( a
n
) L
n
(a) =
2
L
n
( a
n
a),
L
n
(a) =
2
L
n
( a
n
a), (12)
where L
n
() =

2
Ln

2
,
L
n
(a) =
2
n 1
n

t=2
X
t1
(X
t
aX
t1
) =
2
n 1
n

t=2
X
t1

t
and
2
L
n
=
2
n 1
n

t=2
X
2
t1
.
Therefore by using (12) we have
( a
n
a) =
_

2
L
n
_
1
L
n
(a). (13)
Since {X
2
t
} are ergodic random variables, by using the ergodic theorem we have
2
L
n
a.s.

2E(X
2
0
). This with (13) implies

n( a
n
a) =
_

2
L
n
_
1
. .
a.s.
(2E(X
2
0
))
1

nL
n
(a). (14)
To show asymptotic normality of

n( a
n
a), will show asymptotic normality of

nL
n
(a).
We observe that
L
n
(a) =
2
n 1
n

t=2
X
t1

t
,
is the sum of martingale dierences, since E(X
t1

t
|X
t1
) = X
t1
E(
t
|X
t1
) = X
t1
E(
t
) =
0. In order to show asymptotic of L
n
(a) we will use the martingale central limit theorem.
4.0.1 Martingales and the conditional likelihood
First a quick overview. {Z
t
; t = 1, . . . , } are called martingale dierences if
E(Z
t
|Z
t1
, Z
t2
, . . .) = 0.
An example is the sequence {X
t1

t
}
t
considered above. Because
t
and X
t1
, X
t2
, . . . are
independent then E(X
t1

t
|X
t1
, X
t2
, . . .) = 0.
The stochastic sum {S
n
}
n
, where
S
n
=
n

k=1
Z
t
13
is called a martingale if {Z
t
} are martingale dierences.
First we show that the gradient of the conditional log likelihood, dened as
C
n
() =
n

t=2
log f

(X
t
|X
t1
, . . . , X
1
)

,
at the true parameter
0
is the sum of martingale dierences. By denition if C
n
(
0
) is the
sum of martingale dierences then
E
_
log f

(X
t
|X
t1
, . . . , X
1
)


=
0

X
t1
, X
t2
, . . . , X
1
_
= 0.
Rewriting the above in terms of integrals and exchanging derivative with integral we have
E
_
log f

(X
t
|X
t1
, . . . , X
1
)


=
0

X
t1
, X
t2
, . . . , X
1
_
=
_
log f

(x
t
|X
t1
, . . . , X
1
)


=
0
f

0
(x
t
|X
t1
, . . . , X
1
)dx
t
=
_
1
f

0
(x
t
|X
t1
, . . . , X
1
)
f

(x
t
|X
t1
, . . . , X
1
)


=
0
f

0
(x
t
|X
t1
, . . . , X
1
)dx
t
=

__
f

(x
t
|X
t1
, . . . , X
1
)dx
t
_

=
0
= 0.
Therefore {
log f

(Xt|X
t1
,...,X
1
)


=
0
}
t
are a sequence of martingale dierences and C
t
(
0
) is
the sum of martingale dierences (hence it is a martingale).
4.1 The martingale central limit theorem
Let us dene S
n
as
S
n
=
1

n
n

t=1
Y
t
, (15)
where F
t
= (Y
t
, Y
t1
, . . .), E(Y
t
|F
t1
) = 0 and E(Y
2
t
) < . In the following theorem
adapted from Hall and Heyde (1980), Theorem 3.2 and Corollary 3.1, we show that S
n
is
asymptotically normal.
Theorem 6 Let {S
n
}
n
be dened as in (15). Further suppose
1
n
n

t=1
Y
2
t
P

2
, (16)
where
2
is a nite constant, for all > 0,
1
n
n

t=1
E(Y
2
t
I(|Y
t
| >

n)|F
t1
)
P
0, (17)
14
(this is known as the conditional Lindeberg condition) and
1
n
n

t=1
E(Y
2
t
|F
t1
)
P

2
. (18)
Then we have
S
n
D
N(0,
2
). (19)
4.2 Asymptotic normality of the least squares estimator
We now use Theorem 6 to show that

nL
n
(a) is asymptotically normal, which means
we have to verify conditions (16)-(18). We note in our example that Y
t
:= X
t1

t
, and
that the series {X
t1

t
}
t
is an ergodic process. Furthermore, since for any function g,
E(g(X
t1

t
)|F
t1
) = E(g(X
t1

t
)|X
t1
), where F
t
= (X
t
, X
t1
, . . .) we need only to condi-
tion on X
t1
rather than the entire sigma-algebra F
t1
.
C1 : By using the ergodicity of {X
t1

t
}
t
we have
1
n
n

t=1
Y
2
t
=
1
n
n

t=1
X
2
t1

2
t
P
E(X
2
t1
) E(
2
t
)
. .
=1
=
2
.
C2 : We now verify the conditional Lindeberg condition.
1
n
n

t=1
E(Y
2
t
I(|Y
t
| >

n)|F
t1
) =
1
n
n

t=1
E(X
2
t1

2
t
I(|X
t1

t
| >

n)|X
t1
)
We now use the Cauchy-Schwartz inequality for conditional expectations to split X
2
t1

2
t
and I(|X
t1

t
| > ). We recall that the Cauchy-Schwartz inequality for conditional
expectations is E(X
t
Y
t
|G) [E(X
2
t
|G)E(Y
2
t
|G)]
1/2
almost surely. Therefore
1
n
n

t=1
E(Y
2
t
I(|Y
t
| >

n)|F
t1
)

1
n
n

t=1
_
E(X
4
t1

4
t
|X
t1
)E(I(|X
t1

t
| >

n)
2
|X
t1
)
_
1/2

1
n
n

t=1
X
2
t1
E(
4
t
)
1/2
_
E(I(|X
t1

t
| >

n)
2
|X
t1
)
_
1/2
. (20)
We note that rather than use the Cauchy-Schwartz inequality we can use a generalisa-
tion of it called the Holder inequality. The Holder inequality states that if p
1
+q
1
= 1,
then E(XY ) {E(X
p
)}
1/p
{E(Y
q
)}
1/q
(the conditional version also exists). The ad-
vantage of using this inequality is that one can reduce the moment assumptions on
X
t
.
15
Returning to (20), and studying E(I(|X
t1

t
| > )
2
|X
t1
) we use that E(I(A)) = P(A)
and the Chebyshev inequality to show
E(I(|X
t1

t
| >

n)
2
|X
t1
) = E(I(|X
t1

t
| >

n)|X
t1
)
= E(I(|
t
| >

n/X
t1
)|X
t1
)
= P

(|
t
| >

n
X
t1
))
X
2
t1
var(
t
)

2
n
. (21)
Substituting (21) into (20) we have
1
n
n

t=1
E(Y
2
t
I(|Y
t
| >

n)|F
t1
)

1
n
n

t=1
X
2
t1
E(
4
t
)
1/2
_
X
2
t1
var(
t
)

2
n
_
1/2

E(
4
t
)
1/2
n
3/2
n

t=1
|X
t1
|
3
E(
4
t
)
1/2

E(
4
t
)
1/2
n
1/2
1
n
n

t=1
|X
t1
|
3
.
If E(
4
t
) < , then E(X
4
t
) < , therefore by using the ergodic theorem we have
1
n

n
t=1
|X
t1
|
3
a.s.
E(|X
0
|
3
). Since almost sure convergence implies convergence in
probability we have
1
n
n

t=1
E(Y
2
t
I(|Y
t
| >

n)|F
t1
)
E(
4
t
)
1/2
n
1/2
. .
0
1
n
n

t=1
|X
t1
|
3
. .
P
E(|X
0
|
3
)
P
0.
Hence condition (17) is satised.
C3 : We need to verify that
1
n
n

t=1
E(Y
2
t
|F
t1
)
P

2
.
Since {X
t
}
t
is an ergodic sequence we have
1
n
n

t=1
E(Y
2
t
|F
t1
) =
1
n
n

t=1
E(X
2
t1

2
|X
t1
)
=
1
n
n

t=1
X
2
t1
E(
2
|X
t1
) = E(
2
)
1
n
n

t=1
X
2
t1
. .
a.s.
E(X
2
0
)
P
E(
2
)E(X
2
0
) =
2
,
hence we have veried condition (18).
16
Altogether conditions C1-C3 imply that

nL
n
(a) =
1

n
n

t=1
X
t1

t
D
N(0,
2
). (22)
Recalling (14) and that

nL
n
(a)
D
N(0,
2
) we have

n( a
n
a) =
_

2
L
n
_
1
. .
a.s.
(2E(X
2
0
))
1

nL
n
(a)
. .
D
N(0,
2
)
. (23)
Using that E(X
2
0
) =
2
, this implies that

n( a
n
a)
D
N(0,
1
4
(
2
)
1
). (24)
Thus we have derived the limiting distribution of a
n
.
References
Berkes, I., Horvath, L., & Kokoskza, P. (2003). GARCH processes: Structure and estimation.
Bernoulli, 9, 201-2017.
Billingsley, P. (1965). Ergodic theory and information.
Grimmet, G., & Stirzaker. (1994). Probablity and random processes.
Hall, P., & Heyde, C. (1980). Martingale Limit Theory and its Application. New York:
Academic Press.
Stout, W. (1974). Almost Sure Convergence. New York: Academic Press.
17

You might also like