Time Series Analysis and System Identification

Time Series Analysis
and
System Identication
Course number 157109
Huibert Kwakernaak
Gjerrit Meinsma
Department of Applied Mathematics
University of Twente
Enschede, The Netherlands
ii
Preface
These lecture notes were translated fromthe 1997 Dutchedition of the lecture notes for the course Tijdreeksenanalyse
en Identicatietheorie. The Dutch lecture notes are a completely revised and expanded version of an earlier set of
notes (Bagchi and Strijbos, 1988).
For the preparation of part of the Dutch notes the book by Kendall and Orr (1990) was used. This book served as the
text for the course for several years. Also, while preparing the Dutch notes ample use was made of the seminal book by
Ljung (1987).
The Dutch lecture notes rst came out in in 1994. Revisions appeared in 1995 and 1997.
In the preparation of the English edition the opportunity was taken to introduce a number of minor revisions and
improvements in the presentation. For the benet of speakers of Dutch an EnglishDutch glossary has been included.
Prerequisites
The prerequisite knowledge for the course covers important parts of
1. probability theory,
2. mathematical statistics,
3. mathematical system theory, and
4. the theory of stochastic processes,
all at the level of the undergraduate courses about these subjects taught at the Department of Applied Mathematics of
the University of Twente.
Matlab
Time series analysis and identication theory are of great applied mathematical interest but also serve a very impor-
tant practical purpose. There is a generous choice of application software for the numerous algorithms and procedures
that have been developed for time series analysis and system identication. In the international systems and control
community MATLAB is the standard environment for numerical work. For this reason MATLAB has been selected as the
platform for the numerical illustrations and the computer laboratory exercises for this course. The numerical illustra-
tions were developed under MATLAB version 4 but they work well under versions 5, 6 and 7.
Within MATLAB the System Identication Toolbox of Ljung (1991) supplies a number of important and highly useful
numerical routines. At suitable places in the notes various routines from the System Identication Toolbox are intro-
duced. A concise description of these routines may be found in one of the appendices.
Exercises and examination problems
The notes include a number of exercises. In addition, most of the problems that were assigned at the written examina-
tions during the period 19931995 were added to the notes as exercises. Finally, the 1996 and 1997 examinations were
translated and included as appendices together with full solutions.
Huibert Kwakernaak
February 1, 1998
Later revisions
In August 1999 errors were corrected by Rens Strijbos. For the print of 2001 more errors were corrected and several
changes were made by Gjerrit Meinsma. The main changes concern the operator q which now denotes the forward
shift operator; the treatment of the FFT and recursive least squares; the use of matrix notation and the addition of
several proofs and exercises. Following the course in 2003 some more errors were corrected. These pertain mainly to
estimation theory of spectral densities. For the 2006 print a beginning was made to rearrange and update the second
chapter. In 2008 some small typos were removed.
iii
Contents
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Stochastic processes 5
2.1 Basic notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Moving average processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Convolutions and the shift operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Auto-regressive processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 ARMA processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Spectral analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Trends and seasonal processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Prediction of time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Estimators 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Normally distributed processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Foundations of time series analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Linear estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Non-parametric time series analysis 41
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Tests for stochasticity and trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Classical time series analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Estimation of the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Estimation of the covariance function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Estimation of the spectral density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.7 Continuous time processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Estimation of ARMA models 59
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Parameter estimation of AR processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Parameter estimation of MA processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Parameter estimation of ARMA processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 Non-linear optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6 Order determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6 System identication 75
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Non-parametric system identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 ARX models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4 ARMAX models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.5 Identication of state models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.6 Further problems in identication theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
A Proofs 91
B The Systems Identication Toolbox 97
C Glossary EnglishDutch 99
D Bibliography 101
iv
Index 103
v
vi
1 Introduction
1.1 Introduction
1.1.1 Time series analysis
Time series analysis deals with the analysis of measured
or observed data that evolve with time. Historically
the subject has developed along different lines. Besides
mathematical statistics also the engineering and physical
sciences have contributed. In the last decades a reward-
ing interrelationship has developed between the disci-
plines of times series analysis and system theory. In these
lecture notes the system approach to time series analysis
is emphasized.
Time series analysis is used as a tool in many scien-
tic elds. A great deal of software is available. In these
notes serious attention is devoted to the practical aspects
of time series analysis.
Important subjects from time series analysis are:
1. Formulating mathematics models for time series.
2. Describing and analyzing time series.
3. Forecasting time series.
4. Estimating the properties of observed time series.
5. Estimating the parameters of models for observed
time series.
The time series that are studied may be of very different
natures. They have in common that their time behavior is
irregular and only partly predictable. In 1.2 (p. 1) several
examples of times series are described.
1.1.2 System identication
system
u y
v
Figure 1.1: System with input signal u and noise
v.
Figure 1.1 shows a commonparadigm
1
for systemiden-
tication. A system is subject to an input signal u, which
may be accurately observed and recorded. In some sit-
uations the input u may be freely chosen within certain
1
Paradigm: model situation.
rules. The output signal y may also be observed and
recorded. The output y is not only a direct result of the
input u but also has a component v that consists of dis-
turbances and measurement errors. For simplicity we
refer to v as noise. System identication aims at infer-
ring the dynamical properties of the system and the sta-
tistical properties of the noise v fromthe observed signals
u and y .
System identication applies to many situations. As
soon as control, prediction or forecasting problems need
to be solved for systems whose dynamics are completely
or partially unknown system identication enters the
stage. In 1.2 (p. 1) an example of a system identication
problem is described. The system identication problem
has much in common with time series analysis but nat-
urally the presence of the input signal u introduces new
aspects.
1.1.3 Organization of the notes
The notes are organized like this:
Chapter 2 (p. 5) reviews some notions from the the-
ory of stochastic processes.
Chapter 3 (p. 31) presents a survey of statistical no-
tions concerning estimators.
Chapter 4 (p. 41) deals with non-parametric time se-
ries analysis. Non-parametric time series analysis
mainly concerns the estimation of covariance func-
tions and spectral density functions.
Chapter 5 (p. 59) is devoted to parametric time series
analysis, in particular the estimation of the parame-
ters of ARMA models,
Chapter 6 (p. 75) offers an introduction to non-
parametric and parametric system identication.
1.2 Examples
1.2.1 Time series
We present several concrete examples of time series.
Example 1.2.1 (Monthly owof the river Tiber in Rome dur-
ing 19371962). The owand other statistics of the Italian
river Tiber with its different branches are recorded at ve
different locations. The measurements are monitored for
unusual water heights, water speeds and the like. The
data of this example originate from the Rome observa-
tion station. They represent monthly averages of the daily
measurements. As a result there are twelve data points
per year. The data concern the period from 1937 until
1962 with the exception of the years 1943 until 1946. All
together 265 data points are available. Figure 1.2 shows
the plot of the data.

0 50 100 150 200 250
0
400
800
1200
month
a
v
e
r
a
g
e
o
w
[
m
3
/
s
]
Figure 1.2: Monthly average of the ow of the
river Tiber in Rome during 19371962
Example 1.2.2 (Immigration into the USA). To a greater or
lesser extent each country is faced with immigration. Im-
migration may lead to great problems. All countries care-
fully record how many immigrants enter.
In this example we consider the immigration into the
USA during the period from 1820 until 1962. Figure 1.3
shows the plot of the available data.

0 20 40 60 80 100 120
1
2
3
year
a
n
n
u
a
l
i
m
m
i
g
r
a
t
i
o
n
(
u
n
i
t
s
u
n
k
o
w
n
)
Figure 1.3: Annual immigration into the USA dur-
ing 18201962
Example 1.2.3 (Bituminous coal production of the USA).
This example deals with the production of bituminous
coal. Bitumen is a collective noun for brown or black
mineral substances of a resinous nature and highly in-
ammable, which are known under different names.
Naphtha is the most uid, petroleumand mineral tar less
so, and asphalt is solid.
Monthly records of the production of bituminous coal
in the USA are available from January 1952 until Decem-
ber, 1959. This amounts to 96 data points. These data
might be used to forecast the production in the coming
months to meet contractual obligations. The data are
plotted in Figure 1.4.

Example 1.2.4 (Freight shipping of the Dutch Railways).
The Netherlands, like many other countries, has an ex-
tensive railway network. Many places are connected by
rail. Many goods are shipped by rail. For the Dutch Rail-
ways it is important to knowwhether the demand for rail-
way shipping keeps growing. For this and other reasons
the amount of freight shipped is recorded quarterly. Data
0 10 20 30 40 50 60 70 80 90
2.5
3
3.5
4
4.5
5
month
p
r
o
d
u
c
t
i
o
n
[
t
o
n
s
]
10
4
Figure 1.4: Monthly production of bituminous
coal in the USA during 19521959
are available for the period 19651978, and are plotted in
Fig. 1.5.

0 10 20 30 40 50
160
180
200
220
240
quarter
F
r
e
i
g
h
t
[
t
o
n
s
]
Figure 1.5: Quarterly freight shipping of the
Dutch Railways during 19651978
Example 1.2.5 (Mortgages in the USA during 19731978).
Buying real estate normally requires more cash than is
available. A bank needs to be found to negotiate a mort-
gage. Every bank naturally keeps accurate records of the
amount of mortgages that are outstanding and of the new
mortgages. The data we work with originate from the
reports of large commercial US banks to the Federal Re-
serve System. The unit is billions of dollars! We have 70
monthly records from January, 1973, until October, 1978.
Figure 1.6 shows the plot.

0 10 20 30 40 50 60 70
40
50
60
70
80
90
100
month
a
m
o
u
n
t
[
b
i
l
l
i
o
n
$
]
Figure 1.6: Monthly new mortgages in the USA
during 19731978
Example 1.2.6 (Home construction in the USA during
19471967). Home construction activities are character-
2
istic for the rest of the economy. The number of build-
ing permits that are issued generally decreases before an
economic recession sets in. For this reason the level of ac-
tivity in home construction is not only important for the
building industry but also as an indicator for general eco-
nomic development.
This example concerns the prediction of the index of
new homes for which a building permit is issued. The
data are quarterly during the period from1947 until 1967.
This involves 84 data points, which are plotted in Fig-
ure 1.7.

0 10 20 30 40 50 60 70 80
0
50
100
150
200
quarter
i
n
d
e
x
Figure 1.7: Index of quarterly building permits in
the USA during 19471967
1.2.2 System identication
Finally we discuss an example of a system identication
problem.
Example 1.2.7 (Identication of a laboratory process).
Figure 1.8 shows an example of an input signal u and the
corresponding output signal y . The recordings originate
from an experimental laboratory process and have been
taken from the MATLAB Identication Toolbox (Ljung,
1991). The process consists of a hair dryer, which blows
hot air through a tube. The input is the electrical voltage
across the heating coil. The output is the output voltage
of a thermocouple that has been placed in the outgoing
air ow and is a measure for the air temperature.
The input signal is a binary stochastic sequence. It is
switched betweentwo xedvalues. The sampling interval
is 0.8[s]. The signal switches from one value to the other
with a probability of 0.2. Such test signals are often used
for system identication. They are easy to generate and
have a rich frequency content.
Inspection of the output signal shows that it is not free
of noise. The noise is caused by turbulence of the air ow.
0 50 100 150 200 250

3
4
5
6
7
sampling instant t
i
n
p
u
t
u
0 50 100 150 200 250
3
6
sampling instant t
o
u
t
p
u
t
y
Figure 1.8: Measured input and output signals in
a laboratory process
3
4
2 Stochastic processes
The purpose of time series analysis is to obtain insight
into the structure of the phenomenon that generates the
time series. Generally the phenomenon is modeled as a
stochastic process. The time series then is a realization of
the stochastic process.
Depending on whether the observations of the phe-
nomenon are recorded at discrete instants only (usually
equidistantly spaced) or continuously the time series is
referred to as a discrete or a continuous time series. In
these notes the discussion is mainly limited to discrete
time series.
InSection 2.1 (p. 5) various notions fromthe theory of
stochastic processes are reviewed. The sections that fol-
low deal with several stochastic processes with a specic
structure.
Section 2.2 (p. 9) discusses moving average processes,
2.3 (p. 10) treats convolutions, 2.4 (p. 11) is about
auto-regressive processes, and 2.5 (p. 16) is on mixed
auto-regressive/moving average processes. After this in
2.6 (p. 17) the spectral analysis of stochastic processes
is summarized. In 2.7 (p. 22) it is explained how trends
and seasonal processes may be modeled. The chapter
concludes in 2.8 (p. 24) with a treatment of the predic-
tion of time series.
2.1 Basic notions
In this section we summarize several notions from the
theory of stochastic processes.
2.1.1 Time series as realizations of stochastic pro-
cesses
The examples of time series that we presented in the pre-
vious chapter are discrete time series, i.e. they are dened
at discrete time instances only. We will denote such time
series by x
t
, with t the time index. Two dominant features
present in all these times series are:
They are irregular: the time series x
t
are noisy inthat
they appear not to be samples of some underlying
smooth function.
Despite the irregularities, the time series possess
a denite temporal relatedness or memory: subse-
quent values x
t
and x
t +1
in a time series are not un-
related.
Irregularity is unavoidable and temporal relatedness is a
must without which for instance prediction would not be
feasible. Therefore any sensible class of mathematical
models for time series should be able to capture these two
features. The predominant class of mathematical mod-
els with these features are the stochastic processes. To in-
troduce stochastic processes we consider 9 ctitious daily
temperature proles x
t
, t = 1, 2, . . . shown in Fig 2.1(a-i).
The two features of irregularity and temporal relatedness
are clear in these 9 series. If we want to consider the 9
t t t t t
t t t t t
x
t
x
t
x
t
x
t
x
t
x
t
x
t
x
t
x
t
x
t
(a) (b) (c) (d) (e)
(f )
(g) (h) (i) (j)
Figure 2.1: Nine time series (a-i) and combined (j)
temperature proles as originating from a single process
then it makes sense to combine the plots into a single plot
as shown in Fig. 2.1(j). Here at any time t we have a cloud
of 9 points x
t
. We now take the point of view that each of
the 9 time series x
t
, t = 1, 2, . . . is a realization of a fam-
ily of stochastic variables X
t
, t = 1, 2, . . .. A realization is
a particular outcome, a particular observation of a family
of stochastic variables. The stochastic variables X
t
have a
certain probability distribution associated to them,
F
Xt
(x, t ) :=Pr(X
t
x).
Its derivate
f
Xt
(x, t ) :=
x
F
Xt
(x, t )
is known as the (probability) density function or ampli-
tude distribution and intuitively we expectat any given
t many samples x
t
there were the mass of the density
function is high. Figure 2.2(a) tries to convey this point.
It depicts the density functions f
Xt
(x, t ) together with the
9 realizations. Most of the samples x
t
indeed are located
near the peak of the density functions. The density func-
tion f
Xt
(x, t ) may depend on time as is clearly the case
here.
2.1.2 The mean value function
Knowledge of the probability distribution F
Xt
(x, t ) is suf-
cient to determine the meanvalue functionm(t ) or sim-
ply the mean. It is dened as
m(t ) =X
t
=
_

x dF
X
(x, t ). (2.1)
denotes expectation. The mean value function m(t ) in-
dicates the average level of the process at time t . For rea-
sons of exposition the mean is depicted by a solid graph
in Fig. 2.2(b) even though it is only dened at the integers.
5
x
t
x
t
x
f
(
x
,
1
)
f
(
x
,
3
)
t t
(a) (b)
Figure 2.2: The combined times series with superimposed the density functions (a) and mean (b)
Figure 2.3: Condence bounds m(t ) (t ) and
m(t ) 2(t )
2.1.3 Variance and standard deviation
The variance is dened as
var(X
t
) :=
_
(X
t
m(t ))
2
_
=
_

(x m(t ))
2
dF
X
(x, t ).
It characterizes the spread of the collection of realizations
around its mean. The smaller the variance the more con-
densed we expect the cloud of points to be. Clearly, the
variance in Fig. 2.2 increases with time. The variance
however has a dimension different from that of the sam-
ples: if x
t
is in meters, for instance, then var(X
t
) is in me-
ters squared. Taking a square root resolves this issue. The
standard deviation is the square root of the variance
(t ) =
_
var(X
t
).
It has the same dimension as x
t
and therefore canbe plot-
ted together with x
t
: Fig. 2.3 shows the realizations x
t
together with two condence bounds around the mean:
m(t ) (t ) and m(t ) 2(t ).
2.1.4 The covariance and correlation function
Figure 2.4 once again combines the 9 time series in a sin-
gle plot but now the samples in each of the 9 time series
x
2
x
3
Figure 2.4: Temporal relatedness
are connected by solid lines. The purpose is to demon-
strate that there is a temporal relatedness or memory in
these series. For instance the encircled value of x
2
in the
plot appears to inuence the subsequent values x
3
, x
4
, . . .
etcetera and only slowly does its effect on the time series
diminish and does the series return in the direction of its
expected value m(t ). This is very common in practice.
For example if the mean temperature today is supposed
to be 20 degrees but today it happens to be 10 degrees
then tomorrow the temperature most likely is going to be
around 10 degrees as well, however in a week or so the
temperature may easily rise and exceed the claimed av-
erage of 20. That is, time series typically display memory
but the memory is limited.
Temporal relatedness is captured by the covariance
function R(t , s ) dened as
R(t , s ) :=cov(X
t
, X
s
) :=[(X
t
m(t ))(X
s
m(s ))], (2.2)
with t and s two time indices. Knowledge of the proba-
bility distribution F
X
(x, t ) is insufcient to determine the
covariance function. Indeed this distribution determines
for each t the stochastic properties of X
t
but it says noth-
ing about howone X
t
is related to the next X
t +1
. Sufcient
is knowledge of the joint probability distribution
F
Xt ,Xs
(x
t
, x
s
; t , s ) :=Pr(X
t
x
t
, X
s
x
s
)
6
and we have
R(t , s ) =
_
2
(x
1
m(t ))(x
2
m(s )) dF
Xt ,Xs
(x
1
, x
2
; t , s ).
The covariance function may be normalized to the corre-
lation function dened as
(t , s ) =
R(t , s )
_
R(t , t )R(s , s )
. (2.3)
It has the convenient property that
1 (t , s ) 1.
Both covariance and correlation functions indicate to
what extent the values of the process at the time instants
t and s are statistically related. If R(t , s ) = (t , s ) = 0 we
say that X
t
andX
s
are uncorrelated. If (t , s ) =1 thenX
t
and X
s
are maximally correlated and it may be shownthat
that is the case iff X
t
is a linear function of X
s
(or the other
way around), that is, iff X
t
=a +bX
s
or X
s
=a +bX
t
1
. We
will interpret the covariance and correlation function in
detail for stationary processes introduced shortly.
Lemma 2.1.1 (Properties of R and ).
1. Symmetry: R(t , s ) = R(s , t ) and (t , s ) = (s , t ) for
all t , s .
2. Positivity: R(t , t ) 0 and (t , t ) =1 for all t .
3. Cauchy-Schwartzs inequality: |R(t , s )|
_
R(t , t )R(s , s ) and |(t , s )| 1 for all t , s .
4. Nonnegative-deniteness of covariance matrix: The
covariance matrix, also called variance matrix, of a
nite number n of stochastic variables X
t1
, X
t2
, . . . , X
tn
is the n n matrix Rdened as
R=
_
_
_
_
_
_
X
t1
m(t
1
)
.
.
.
X
tn
m(t
n
)
_
_
_
_
_
_
X
t1
m(t
1
) X
tn
m(t
n
)
_
_
.
(2.4)
The i j -th entry of R is R(t
i
, t
j
). The covariance ma-
trix is a symmetric nonnegative denite n n ma-
trix, that is, for every vector v
n
there holds that
v
T
Rv 0. Likewise also the correlation matrix P
which is the matrix with entries (t
i
, t
j
) is symmetric
and nonnegative denite.
Proof. Problem 2.3 (p. 26).
2.1.5 Denition of a stochastic process
More generally, a stochastic process is a family of stochas-
tic variables X
t
, t . The set is the time axis. In the
discrete-time case usually is the set of the natural num-
bers or that of the integers or subsets thereof. In the
1
This follows from the proof of Cauchy-Schwarz inequality, see
page 91.
continuous-time case often is the real line or the set
of nonnegative real numbers
+
.
The values of a stochastic process at n different time
instants t
1
, t
2
, . . . , t
n
, all in , have a joint probability dis-
tribution
F
Xt
1
,Xt
2
,...,Xtn
(x
1
, x
2
, . . . , x
n
; t
1
, t
2
, . . . , t
n
)
:=Pr(X
t1
x
1
, X
t2
x
2
, . . . , X
tn
x
n
),
(2.5)
with x
1
, x
2
, . . . , x
n
real numbers. The set of all these prob-
ability distributions, for all n andfor xed nfor all
n-tuples t
1
, t
2
, . . . , t
n
, is called the probability law of the
process. If this probability lawis known then the stochas-
tic structure of the process is almost completely deter-
mined
2
. Usually it is too much, and also not necessary, to
require that the complete probability law be known, and
the partial characterizations of the process such as mean
and covariance are sufcient.
2.1.6 Stationarity and wide-sense stationarity
A process is called (strictly) stationary if the probability
distributions (2.5) are invariant under time shifts, that is,
if for every n
Pr(X
t1+
x
1
, X
t2+
x
2
, . . . , X
tn +
x
n
)
=Pr(X
t1
x
1
, X
t2
x
2
, . . . , X
tn
x
n
)
for all . The statistical properties of a stationary pro-
cess do not change with time.
A direct consequence of stationarity is that for station-
ary stochastic processes the distribution F
Xt
(x) does not
depend on time. As a result also the mean value function
m(t ) =
_

x dF
Xt
(x) (2.6)
does not depend on time, and, hence, is a constant m. In
addition we have for the covariance function
R(t +, s +)
=
_

(x
1
m)(x
2
m) dF
Xt +,Xs +
(x
1
, x
2
; t +, s +)
=
_

(x
1
m)(x
2
m) dF
Xt ,Xs
(x
1
, x
2
; t , s )
=R(t , s ).
(2.7)
This holds in particular for = s in which case we get
R(t s , 0) = R(t , s ). The covariance function R(t , s ) of a
stationary process therefore only depends on the differ-
ence of its arguments t and s . For a stationary process we
dene
r () =R(t +, t ) =cov(X
t +
, X
t
), (2.8)
2
In the discrete-time case the probability law determines the structure
completely. In the continuous-time case more is needed.
7
and call r , like R, the covariance function of the process.
The correlation function is redened as
() =
r ()
r (0)
. (2.9)
Stochastic processes that are not necessarily stationary,
but have the properties that the mean value function is
constant and the covariance function only depends on
the difference of its arguments are called wide-sense sta-
tionary. The amplitude distribution of a wide-sense sta-
tionary process is not necessarily constant!
Lemma 2.1.2 (Covariance function of a wide-sense station-
ary process). Suppose that r is the covariance function
and the correlation function of a wide-sense stationary
process. The following holds.
1. Positivity: r (0) 0, (0) =1,
2. Symmetry: r () =r () and () =() for all
,
3. Cauchy-Schwarzs inequality: |r ()| r (0) and
|()| 1 for all .
4. Nonnegative-deniteness: The covariance matrix
and correlation matrix
_
_
_
_
_
_
r (0) r (
1
) r (
n1
)
r (
1
) r (0) r (
n2
)
.
.
.
.
.
.
.
.
.
.
.
.
r (
n1
) r (
n2
) r (0)
_
_
_
_
_
_
, (2.10)
_
_
_
_
_
_
(0) (
1
) (
n1
)
(
1
) (0) (
n2
)
.
.
.
.
.
.
.
.
.
.
.
.
(
n1
) (
n2
) (0)
_
_
_
_
_
_
(2.11)
are nonnegative denite nn matrices for any
i
, i =
1, 2, . . . , n 1 and all n .
2.1.7 Temporal relatedness
As claimed, the covariance function r () of a wide-sense
stationary process describes the temporal relatedness of
the process. An example of a covariance function is
r () =
2
X
a
||
, . (2.12)
The number
X
=
_
r (0) =
_
var(X
t
) is the standard de-
viation of the process. The number a, with |a| < 1, de-
termines how fast the function decreases to zero, and,
hence, how fast the temporal relatedness is lost. Fig-
ure 2.5(a) shows the plot of r for
X
= 1 and a = 0.9. Fig-
ure 2.6(a) shows an example of a realization of the corre-
sponding (stationary) process with mean value m =0. In
Example 2.4.3 (p. 12) we see howsuch realizations may be
generated.
-50 -40 -30 -20 -10 0 10 20 30 40 50
0
0.2
0.4
0.6
0.8
1
r
-50 -40 -30 -20 -10 0 10 20 30 40 50
-0.5
0
0.5
1
r
Figure 2.5: (a) Exponential covariance function.
(b) Damped harmonic covariance
function
0 25 50 75 100 125 150 175 200
-3
-2
-1
0
1
2
3
x
t
0 25 50 75 100 125 150 175 200
-3
-2
-1
0
1
2
3
x
t
Figure 2.6: (a) Realization of a process with expo-
nential covariance function. (b) Real-
ization of a process with damped har-
monic covariance function
8
A different example of a covariance function is
r () =
2
X
a
||
[A cos(
2
T
) +B sin(
2||
T
)], ,
(2.13)
with
X
, a, T, A, and B constants. Again
X
0 is the
standard deviation of the process. The number a, with
|a| <1, again determines how fast the function decreases
to zero. The number T > 0 represents a periodic charac-
teristic of the time series. Figure 2.5(b) displays the be-
havior of the plot of r for =1, a =0.9, T =12, A =1 and
B =0.18182. Figure 2.6(b) shows a possible realization of
the process with mean value m = 0. The irregularly peri-
odic behavior is unmistakable.
2.1.8 White noise
The temporal relatedness is minimal in a wide-sense sta-
tionary discrete-time process for which
r () =
_

2
for =0,
0 for =0,
. (2.14)
This process consists of a sequence of uncorrelated
stochastic variables X
t
, t . Such a process is some-
times said to be purely random. In a physical or engi-
neering context it is often called white noise. This name
is explained in 2.6.5 (p. 20). Usually the mean value of
white noise is assumed to be 0 and the standard deviation
takentobe 1. Sometimes it is assumed that the stochas-
tic variables are not only uncorrelated but even mutually
independent. In this case the process is completely char-
acterized by its amplitude distribution F
Xt
(x) = Pr(X
t

x). Often it is assumed that this amplitude distribution is
normal.
The white noise process plays an important role in the
theory of stochastic processes and time series analysis. In
a sense it is the most elementary stochastic process. We
shall soon see that other stochastic processes often may
be thought to originate from white noise, that is, they are
generated by a driving white noise process.
Figure 2.7 shows an example of a realization of white
noise with mean zero and standard deviation = 1. In
a suitable computer environment realizations of white
noise may easily be generated with a random number
generator.
Figure 2.7: Realization of white noise with mean 0
and standard deviation 1
x= rand(1,99); % 99 samples white, uniformly
% distributed over [0,1]
x=randn(1,99); % 99 samples white, normally
% distributed, mean 0, var 1
plot(x); % plot it
x
t
t
t
t
t 10

s
s 10
x
t
x
s
Figure 2.8: Realization of a moving average X
t
=
t
+
t 1
+ +
t 10
(top) of white noise
t
(bottom)
2.2 Moving average processes
A moving average process of order k is a process X
t
that
may be described by the equation
X
t
=b
0
t
+b
1
t 1
+ +b
k
t k
, t k. (2.15)
The coefcients b
0
, b
1
, . . . , b
k
are real and
t
, t , is
white noise with mean and standard deviation . The
value X
t
of the process at time t is the weighted sum of
the k +1 immediately preceding values of the white noise
process
t
. This explains the name. The notation for
this process is MA(k), where the acronym MA stands for
moving average. Figure 2.8 illustrates moving averag-
ing.
Without loss of generality it may be assumed that b
0
is
scaled so that b
0
=1.
2.2.1 Mean value function and covariance function
By taking the expectation of both sides of (2.15) it follows
for the mean value function m(t ) =X
t
that
m(t ) = (b
0
+b
1
+ +b
k
), t =k, k +1, . . . (2.16)
Obviously m(t ) =m is constant.
Dene the centered processes

X
t
= X
t
m and
t
=
t
. By subtracting (2.16) from (2.15) it follows that
X
t
=b
0

t
+b
1

t 1
+ +b
k

t k
, t =k, k +1, . . . .
(2.17)
Squaring both sides of this equality yields
X
2
t
=
k
i =0
k
j =0
b
i
b
j

t i

t j
. (2.18)
Taking the expectation of both sides and using the fact
that
t
is white noise we nd that
var(

X
t
) =var(X
t
) =R(t , t ) =
2
k
i =0
b
2
i
. (2.19)
9
Clearly also the variance var(X
t
) does not depend ont . By
taking the expectation of both sides of
X
t

X
s
=
k
i =0
k
j =0
b
i
b
j

t i

s j
(2.20)
it follows for t s that
R(t , s ) =

X
t

X
s
=
_

2
k
i =t s
b
i
b
i t +s
if 0 t s k,
0 if t s >k.
Inspection shows that R(t , s ) only depends on the differ-
ence t s of its arguments. The process X
t
hence is wide-
sense stationary with covariance function
r () =
_

2
k
i =||
b
i
b
i ||
for || k,
0 for || >k,
. (2.21)
We see that the covariance is exactly zero for time shifts
|| greater than k. This is because the process has nite
memory. This should be clear from Fig. 2.8: in the gure
the two time windows for t and s do not overlap, that is,
no sample of the white noise is in both of the windows,
hence X
t
and X
s
are uncorrelated in Fig. 2.8.
2.2.2 The running average process
An example of an MA(k) process is the running average
process of order k +1, dened by
X
t
=
1
k +1
k
j =0
t j
, t k. (2.22)
The weights of this MA(k) process are b
i
= 1/(k +1), i =
0, 1, . . . , k. The variance of the process is
2
X
=r (0)
=var(X
t
)
=
2
k
i =0
1
(k +1)
2
=
2
k +1
.
The covariance function of the running average process
is
r () =
_

2
k
i =||
1
(k+1)
2
for || k,
0 for || >k,
=
_

2
X
_
1
||
k+1
_
for || k,
0 for || >k,
.
Figure 2.9 shows the triangular shape of r . Figure 2.10
shows an example of a realization of the process for k =9
and =1. For the computation of the rst 9 values of the
process the white noise
t
for t < 0 has somewhat arbi-
trarily been taken equal to 0.
-20 -10 0 10 20
0
1
r
Figure 2.9: Covariance function of the running
average process of order 9
25 50 75 100 125 150 175 200
-2
-1
0
1
2
t
x
t
Figure 2.10: Realization of the running average
process of order 9
ep=randn(1,200); % some white noise
B=[1 2 3.3]; % coefficients MA(3) scheme
x=filter(B,1,ep); % x
t
=
t
2
t 1
+3.3
t 2
plot(x(3:end)); % only plot x
3
, x
4
, . . . , x
200
2.3 Convolutions and the shift operator
That MA-processes are wide sense stationary is not un-
expected once we realize that MA-processes may be for-
mulated without explicit reference to time: whatever t is,
the value of x
t
is a sum of preceding white noise samples.
It is useful at this point to introduce an abbreviated no-
tation for MA- and other processes. Dene the forward
shift operator, q, by
(qX)
t
=X
t +1
, t . (2.23)
Then (q
1
X)
t
=X
t 1
and we may write (2.15) as
X
t
= (b
0
+b
1
q
1
+ +b
k
q
k
)
t
or simply
X
t
=N(q)
t
,
for N(q) :=b
0
+b
1
q
1
+ +b
k
q
k
. The operator N(q) that
denes the MA-process is a polynomial inq
1
, and it does
not depend on time. Moving averages are convolutions.
Indeed, the indices of every term in b
0
t
+b
1
t 1
+ +
b
k
t k
add up to t so an MA-process is an example of a
process of the form
X
t
=
n=
h
n
t n
. (2.24)
10
The right-hand side is knownas the convolution of h
t
and
t
. Similar to what we showed for MA processes, we have
for convolution systems that
X
t
=
n=
h
n
(2.25)
and
r () =
2
n=
h
n
h
n
. (2.26)
Stochastic processes withinnite memory are not MA(k)-
processes as the latter has a nite memory of length k.
Convolutions (2.24) however are MA-processes of innite
order and they can in principle exhibit innite memory.
The class of convolutions is in fact large enough to model
effectively every wide sense stationary process. This is an
important result:
Lemma 2.3.1 (Innovations representation). Every zero
mean wide sense stationary process X
t
with absolutely
summable covariance function,
|r ()| < is of the

form (2.24) with
t
some zero mean white noise process
and h
t
some square summable sequence,
t
|h
t
|
2
<.
Proof. Follows later from spectral properties, see Ap-
pendix A.
A system theoretic interpretation is that essentially ev-
ery wide sense stationary process can be seen as the out-
put of a system driven by white noise.
2.4 Auto-regressive processes
The previous lemma appears to suggest that when con-
sidering wide sense stationary processes we should try
to model it as an MA-process, possibly of innite order
(i.e. convolutions). Not necessarily so as the following ex-
ample shows.
Example 2.4.1 (Random shock and its simulation). Con-
sider the innite order MA-process
X
t
=
t
+a
t 1
+a
2
t 2
+a
3
t 3
+
For practical implementation we would have to cut this
innite order MA-process to a nite one. If for instance
a = 0.9 then an MA(50) scheme could be a good approx-
imation | note that 0.9
50
= 0.0052. However, we can also
express X
t
recursively as
X
t
=aX
t 1
+
t
.
This is a description of the process that requires only a
single coefcient and the values of X
t
are easily gener-
ated recursively. This is an example of a rst order auto-
regressive process.

In this section we analyse auto-regressive schemes. An
auto-regressive process of order n is a process X
t
, t
+
,
that satises a difference equation of the form
X
t
=a
1
X
t 1
+a
2
X
t 2
+ +a
n
X
t n
+
t
, t n. (2.27)
The constant coefcients a
1
, a
2
, . . . , a
n
are real numbers
and
t
, t , is stationary white noise with mean and
variance
2
. We assume that > 0 and a
n
= 0. The
process is called auto-regressive because the value of the
process at time t besides on a purely randomcomponent
depends on the n immediate past values of the process
itself. We denote such a process as an AR(n) process. Be-
cause of its memory, an AR process generally is less irreg-
ular than a white noise process.
2.4.1 The Markov scheme
We rst analyze the simplest case
X
t
=aX
t 1
+
t
, t =1, 2, . . . , (2.28)
with a real. In the statistical literature this AR(1) scheme
is often called a Markov scheme. For a = 1 the process
known as random walk results.
By repeated substitution it follows that
X
t
=
t
+a
t 1
+a
2
t 2
+ +a
t 1
1
+a
t
X
0
, (2.29)
for t =1, 2, . . .. By taking the expectation of both sides we
nd for m(t ) =X
t
m(t ) =+a+a
2
+ +a
t 1
+a
t
m(0)
=
_

1a
t
1a
+a
t
m(0) if a =1,
t +m(0) if a =1.
(2.30)
The mean value function is time-dependent, except if a =
0 or (a =1, =0) or, if a =1, if
m(0) =
1a
. (2.31)
Then we have m(t ) =m(0), t 0.
We could also have obtained these results by taking the
expectation of both sides of (2.28). Then we obtain
m(t ) =am(t 1) +, t =1, 2, . . . . (2.32)
Repeated substitution results in (2.30).
We consider the covariance function of the process
produced by the Markov scheme. Dene

X
t
= X
t
m(t )
and
t
=
t
. The process

X
t
has mean 0, and
t
is
white noise with mean 0 and variance
2
.

X
t
and
t
are
the centered processes corresponding to X
t
and
t
.
We have

R(t , s ) = cov(

X
t
,

X
s
) = cov(X
t
, X
s
) = R(t , s ),
so that we may compute the covariance function of X
t
as
that of

X
t
. Furthermore it follows by subtracting (2.32)
from (2.28) that
X
t
=a

X
t 1
+
t
, t =1, 2, . . . . (2.33)
11
From
X
t
=
t
+a
t 1
+a
2

t 2
+ +a
t 1

1
+a
t
X
0
, (2.34)
for t =1, 2, . . ., we obtain
R(t , t ) =var(X
t
)
= (1+a
2
+ +a
2t 2
)
2
+a
2t
var(X
0
)
=
_
1a
2t
1a
2

2
+a
2t
var(X
0
) for a
2
=1,
t
2
+var(X
0
) for a
2
=1.
(2.35)
Furthermore we have for 0
R(t +, t )
=cov(

X
t +
,

X
t
) (2.36)
=a
_
1a
2t
1a
2

2
+a
2t
var(X
0
) for a
2
=1,
(t
2
+var(X
0
)) for a
2
=1.
Note that we assume X
0
to be independent of
t
, t >0.
Suppose that a
2
= 1. Inspection of (2.35) reveals that
R(t , t ) depends on the time t , unless
var(X
0
) =
2
1a
2
. (2.37)
Because var(X
0
) cannot be negative, (2.37) can only hold
if |a| <1. If (2.37) holds then
R(t , t ) =
2
1a
2
, t 0. (2.38)
It follows from (2.36) that if (2.37) holds then
R(t +, t ) =
2
1a
2
a
, 0. (2.39)
Apparently if |a| < 1 and the initial conditions (2.31) and
(2.37) apply then the AR(1) process is wide-sense station-
ary. With the symmetry property of 2.1.2 (p. 8) it follows
from (2.39) that
r () =cov(X
t +
, X
t
) =
2
1a
2
a
||
, . (2.40)
Further contemplation of (2.30) and (2.36) reveals that if
(2.37) does not hold but |a| <1 then
m(t )
t
1a
,
R(t +, t )
t
2
1a
2
a
||
.
We say that then the process is asymptotically wide-sense
stationary. More generally:
Denition 2.4.2 (Asymptotically wide-sense stationary
AR(n) process). An AR(n) process is said to be asymptot-
ically wide sense stationary if the limits
m := lim
t
m(t ), r () := lim
t
R(t +, t )
exist and are unique (i.e., do not depend on initial
X
0
, X
1
, . . . , X
n1
).

Example 2.4.3 (Simulation of the AR(1) process). We now
recognize how a realization of a process with covariance
function
r () =
2
X
a
||
, (2.41)
as shown in Fig. 2.5 on page 8, may be generated. We con-
sider the Markov scheme
X
t
=aX
t 1
+
t
, t 1. (2.42)
To obtain a required value of
X
for given a we choose
in accordance with (2.37) the standard deviation of the
white noise
t
as
2
= (1 a
2
)
2
X
. If we let = 0 then X
t
is centered. Next we need to implement the initial condi-
tions
m(0) =X
0
=
1a
=0,
var(X
0
) =
2
1a
2
=
2
X
.
These conditions imply that X
0
should randomly be
drawn according to a probability distribution with mean
zero and variance
2
X
. If X
0
is obtained this way then the
rest of the realization is generated with the help of the
Markov scheme (2.42). The successive values of
t
are de-
termined with a random number generator.

2.4.2 Asymptotically wide-sense stationary AR pro-
cesses
We have seen that under the condition |a| <1 the Markov
scheme produces an asymptotically wide-sense station-
ary process. A similar result holds for the general AR(n)
process
X
t
=a
1
X
t 1
+a
2
X
t 2
+ +a
n
X
t n
+
t
, t n. (2.43)
For given X
0
, X
1
, . . . , X
n1
the solution for t n may
be determined by successive substitution. We may also
viewthe process X
t
as the solution of the linear difference
equation
X
t
a
1
X
t 1
a
2
X
t 2
a
n
X
t n
=
t
, t n. (2.44)
Using the forward shift operator q this becomes
(1a
1
q
1
a
2
q
2
a
n
q
n
)
.
D(q)
X
t
=
t
, t n, (2.45)
or
D(q)X
t
=
t
, t n. (2.46)
The solution X
t
of this linear difference equation is the
sum of a particular solution (for instance corresponding
to the initial conditions X
0
= X
1
= = X
n1
= 0), and a
suitable solution of the homogeneous equation
D(q)X
t
=0. (2.47)
We briey discuss the solution of the homogeneous equa-
tion. Let
1
be a root of the equation D() =0. Then
X
t
=A
t
1
, t , A (2.48)
12
is a solution of (2.47). Indeed we have that
D(q)A
t
1
= (A
t
1
a
1
A
t 1
1
a
n
A
t n
1
)
=A
t
1
D(
1
)
=0.
The function D() is a polynomial in
1
of degree n
and, hence, there are n zeros
i
. We conclude that if
i
,
i =1, 2, . . . , n, are the n zeros of D, then every linear com-
bination of the functions
t
i
, t , is a solution of the ho-
mogeneous equation (2.47). More generally, every solu-
tion of the homogeneous equation (2.47) is a linear com-
bination of the functions
t
k
t
i
, t , (2.49)
with k =0, 1, . . . , m
i
1, and i =1, 2, . . . , k. The integer m
i
is the multiplicity of the root
i
, and
i
, i = 1, 2, . . . , k are
the mutually different zeros of D.
If |
i
| <1 then the corresponding term (2.49) in the so-
lution of the homogeneous equation (2.47) approaches
zero as t . Conversely, if for one or several of the ze-
ros we have |
i
| 1 then the corresponding terms in the
solution of the homogeneous equation do not vanish for
t . The result is that in the latter case the solution of
the ARequation (2.46) contains non-stochastic terms that
do not vanish. The resulting process cannot be asymp-
totically wide-sense stationary. It may be proved that this
condition is also sufcient.
Lemma 2.4.4 (Asymptotically wide-sense stationary AR(n)
process). An AR(n) process D(q)X
t
=
t
is asymptotically
wide-sense stationary iff all solutions of D() = 0
have magnitude strictly smaller than 1.

From a system theoretical point of view this neces-
sary and sufcient condition is equivalent to the require-
ment that the system described by the difference equa-
tion D(q)X
t
=
t
(with input signal
t
and output signal
X
t
) be asymptotically stable. For brevity we simply say
that the AR(n) scheme is stable.
We have seen that if the initial conditions of the AR(1)
process are suitably chosen then the process is immedi-
ately wide-sense stationary. This also holds for the AR(n)
scheme. In 2.4.3 (p. 13) we see what these initial condi-
tions are.
For the AR(1) process X
t
= aX
t 1
+
t
we have D(q) =
1aq
1
. The polynomial D has a as its only zero. Neces-
sary and sufcient for asymptotic wide-sense stationarity
hence is that |a| < 1. This agrees with what we found in
2.4.1 (p. 11).
2.4.3 The Yule-Walker equations
We consider the problem how to compute the (asymp-
totic) covariance function of the AR(n) process
X
t
=a
1
X
t 1
+a
2
X
t 2
+ +a
n
X
t n
+
t
, t n. (2.50)
For simplicity we assume that the process has been cen-
tered already, so that =
t
= 0 and m(t ) = m(0) =
X
0
=0.
To derive the covariance function r () we assume for
now that >0. Then it follows by multiplying both sides
of (2.50) by X
t
,
X
t
X
t
=a
1
X
t 1
X
t
+a
2
X
t 2
X
t
+
+a
n
X
t n
X
t
+
t
X
t
,
and taking expectations that
r () =a
1
r (1) +a
2
r (2) + +a
n
r (n), >0.
(2.51)
Here we used the assumption that the X
t
are computed
forward in time (i.e. that X
t
is seen as a function of past
X
t 1
, X
t 2
, . . . and past white noise) so that
t
and X
t
are
uncorrelated for >0. Division of (2.51) by r (0) yields
() =a
1
(1) +a
2
(2) + +a
n
(n), >0.
(2.52)
with the correlation function of the process. Interest-
ingly, this equation is nothing but D(q)() = 0, ( 1),
that is, () is a solution of the homogeneous equation.
As a result the () and r () converge to zero for any sta-
ble AR process.
For = 1, 2, . . . , n the equation (2.52) formstaking
into account that (0) =1 and using the symmetry prop-
erty () = ()a set of n linear equations for the
rst n correlation coefcients (1), (2), . . . , (n). These
equations are known as the Yule-Walker equations and
they are easy to solve. Once the rst n correlation coef-
cients are known then the remaining coefcients ()
may be obtained for > n by recursive application of
(2.52).
To determine the covariance function fromthe correla-
tion function we need to compute the variance var(X
t
) =
R(t , t ) = r (0) of the process. By squaring both sides of
(2.50) and taking expectations we obtain the present vari-
ance R(t , t ) as a sum of past covariances,
R(t , t ) =
n
i =1
n
j =1
a
i
a
j
R(t i , t j ) +
2
. (2.53)
If we suppose again that the process is wide-sense sta-
tionary then it follows that
r (0) =
n
i =1
n
j =1
a
i
a
j
r (j i ) +
2
, (2.54)
With the substitution r (j i ) =r (0)(j i ) it follows that
r (0) =r (0)
n
i =1
n
j =1
a
i
a
j
(j i ) +
2
, (2.55)
so that
r (0) =
2
1
n
i =1
n
j =1
a
i
a
j
(j i )
. (2.56)
13
We now recognize how the initial conditions X
0
, X
1
, . . . ,
X
n1
of the AR(n) scheme need to be chosen so that
the process is immediately wide-sense stationary. The
0
, X
1
, . . . , X
n1
need to be ran-
domly drawn according to an n-dimensional joint prob-
ability distribution with all n means 0 and covariances
cov(X
i
, X
j
) =r (i j ), with i and j in {0, 1, . . . , n 1}.
For the Markov scheme X
t
=aX
t 1
+
t
the Yule-Walker
equations (2.52) reduce to
(k) =a(k 1), k 1. (2.57)
Since (0) =1 it immediately follows that (k) =a
|k|
, k
. From (2.56) it is seen that r (0) =
2
/(1 a
2
). These
results agree with those of 2.4.1 (p. 11).
2.4.4 Partial correlations
The result that is described in this subsection plays a role
in estimating AR schemes as explained in 5.2 (p. 59) and
5.6 (p. 69).
In the AR(n) scheme the value X
t
of the process at time
t is a linear regression on the n preceding values X
t 1
,
X
t 2
, . . . , X
t n
, with regression coefcients a
1
, a
2
, . . . , a
n
.
Conversely we could ask the question whether a given
wide-sense stationary process X
t
(not necessarily gener-
ated by an AR scheme) may be explained by an AR(n)
scheme. To this end we could determine the coefcients
a
1
, a
2
, . . . , a
n
that minimize the mean square error
_
X
t

n
i =1
a
i
X
t i
_
2
. (2.58)
Dene a
0
=1. Then we have
_
X
t

n
i =1
a
i
X
t i
_
2
=
n
i =0
n
j =0
a
i
a
j
r (i j ), (2.59)
with r the covariance function of the process. We min-
imize this quadratic function with respect to the coef-
cients a
1
, a
2
, . . . , a
n
. Partial differentiation with respect
to a
k
, k =1, 2, . . . , n, yields the necessary conditions
n
i =0
a
i
r (i k) =0, k =1, 2, . . . , n. (2.60)
Because a
0
=1 this is equivalent to
r (k) =
n
i =1
a
i
r (i k), k =1, 2, . . . , n. (2.61)
Division by r (0) yields
(k) =
n
i =1
a
i
(i k), k =1, 2, . . . , n, (2.62)
with the correlation function. These are precisely the
Yule-Walker equations, except that now the correlation
coefcients are given, and we need to solve for the regres-
sion coefcients a
1
, a
2
, . . . , a
n
. The Yule-Walker equa-
tions have an important symmetry property that is clear
from the matrix representation of (2.62),
_
_
_
_
_
_
(0) (1) (n 1)
(1) (0) (n 2)
.
.
.
.
.
.
.
.
.
.
.
.
(n 1) (n 2) (0)
_
_
_
_
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
n
_
_
_
_
_
_
=
_
_
_
_
_
_
(1)
(2)
.
.
.
(n)
_
_
_
_
_
_
.
(2.63)
The matrix on the left is symmetric and we recognize it as
the correlation matrix of Lemma 2.1.2. As a result the ma-
trix is nonnegative denite and under weak assumptions
it is invertible which guarantees that the a
j
exist and are
uniquely determined by the (k), (see Problem2.8, p. 26).
If a priori we do not know what the correct order n of
the regression scheme is then we could successively solve
the set of equations (2.62) for n =1, 2, 3, . . .. If the process
really satises an AR scheme then starting with a certain
value of n, say N +1, the regression coefcients a
k
, k
N +1 will be identical to zero.
To make the dependence of n explicit we rewrite (2.62)
as
(k) =
n
i =1
a
ni
(i k), k =1, 2, . . . , n, (2.64)
Here the a
ni
, i = 1, 2, . . . , n denote the regression coef-
cients that correspond to a regression scheme of order
n.
A well-known result from regression analysis is that
rather than solving (2.64) for a number of successive val-
ues of n the coefcients a
ni
may recursively be computed
from the equations
a
n+1,n+1
=
(n +1)
n
j =1
a
nj
(n +1j )
1
n
j =1
a
nj
(j )
, (2.65)
a
n+1,j
=a
nj
a
n+1,n+1
a
n,n+1j
, j =1, 2, . . . , n.
(2.66)
These equations are known as the Levinson-Durbin algo-
rithm (Durbin, 1960) and they can derived from the spe-
cial structure of Eqn. (2.63), see Proof A.1.3 (p. 91). For n
given, rst (2.65) is used to compute the value of a
n+1,n+1
from the values of the coefcients that were found ear-
lier. Next, (2.66) yields a
n+1,1
, a
n+1,2
, . . . , a
n+1,n
. The re-
cursion equations hold from n = 0, with the convention
that summations whose upper limit is less than its lower
limit cancel (are zero).
The coefcients a
11
, a
22
, a
33
, . . . , are called the partial
correlation coefcients. The value N of n above which
the partial correlation coefcients are zero (or, in prac-
tice, very small) determines the order of the regression
scheme.
2.4.5 Simulation with Matlab
14
In Exercise 2.7 (p. 26) the AR(2) process
X
t
=a
1
X
t 1
+a
2
X
t 2
+
t
(2.67)
will be discussed. The parameters are chosen as a
1
=
2a cos(2/T) and a
2
=a
2
, with a =0.9 and T =12. The
process has a damped harmonic covariance function. In
Fig. 2.6(b) on p. 8 a realization of the process is shown.
This realization has been generated with the help of the
following sequence of MATLAB-commands:
a = 0.9; T = 12; % Define the parameters
a1 = 2acos(2pi/T);
a2 = aa;
% Choose the standard deviation sigma of
% the white noise so that the stationary
% var of the process is 1 (see problem 2.7):
sigma=sqrt((1a1a2)(1+a1a2)(1+a2)/(1a2))
D = [1 a1 a2]; % Define the system
N = 1;
randn(seed,0); % Generate white noise
e = sigmarandn(200,1);
th=idpoly(D,[],N); % Generate
x = idsim(e,th); % and
plot(x); % plot realization of X
t
The MATLAB-functions idsim and poly2th belong to
the Systems Identication Toolbox. The function idsim
is used to simulate a class of linear systems. The struc-
ture of the system is dened by the theta parameter
th. The function poly2thgenerates the theta parameter
from the various polynomials that determine the system
structure. For AR schemes only the polynomial D(q) =
1 a
1
q
1
a
n
q
n
is relevant. The coefcients of D
are specied as a rowvector according to ascending pow-
ers of q
1
. The rst coefcient of D needs to be 1.
The various MATLAB commands are described in Ap-
pendix B.
2.4.6 Converting AR to MA and back
By successive substitution the AR(n) scheme
X
t
=a
1
X
t 1
+a
2
X
t 2
+ +a
n
X
t n
+
t
(2.68)
may be rewritten as
X
t
=
t
+h
1
t 1
+h
2
t 2
+ . (2.69)
If the scheme is asymptotically wide-sense stationary
that is, if the system D(q)X
t
=
t
is asymptotically
stablethen the coefcients h
1
, h
2
, h
3
, . . . converge ex-
ponentially to zero. We show this shortly. Hence, the AR
scheme may be viewed as anMA() scheme, withinnite
memory. Every stable AR scheme may be approximated
by an MA scheme with sufciently long memory.
Take by way of example the stable AR(1) scheme
X
t
=aX
t 1
+
t
, t 1. (2.70)
with |a| <1. By repeated substitution it follows that
X
t
=a
_
aX
t 2
+
t 1
_
+
t
=
t
+a
t 1
+a
2
t 2
+a
3
t 3
+ .
This is the random shock model as considered in Exam-
ple 2.4.1.
Conversely, under certain conditions anMA(k) scheme
may be converted into an AR() scheme. Assume that
b
0
=1. We hence consider the MA(k) scheme
X
t
=
t
+b
1
t 1
+ +b
k
t k
. (2.71)
This may be rewritten as
t
=X
t
b
1
t 1
b
2
t 2
b
k
t k
. (2.72)
By repeated substitution this leads to an AR() scheme
of the form
t
=X
t
+g
1
X
t 1
+g
2
X
t 2
+ . (2.73)
We will showthat the conditionfor convergence of the co-
efcients g
1
, g
2
, . . . to zero is that all zeros of the polyno-
mial N() =1+b
1
1
+b
2
2
+ +b
k
k
have modulus
strictly smaller than 1. If this condition is satised then
the MA scheme X
t
=N(q)
t
is said to be invertible.
Example 2.4.5 (Inversion of an MA(1) process). Consider
the MA(1) process X
t
=N(q)
t
, with
N(q) =1+bq
1
. (2.74)
To nd the equivalent AR() description we try to elimi-
nate
t k
by repeated substitution,
X
t
b
t 1
=
t
,
X
t
b
_
X
t 1
b
t 2
_
.
t 1
=
t
,
X
t
bX
t 1
+b
2
_
X
t 2
b
t 3
_
.
t 2
=
t
.
This results in the AR() process D(q)X
t
=
t
where
D(q) =1bq
1
+b
2
q
2
b
3
q
3
+ .
The example may suggest that D() may be obtained

via long division in the negative powers of of 1/N().
Indeed that is the case for arbitrary invertible N(q), see
Problem 2.30 (p. 28) and Problem 2.12 (p. 26). If we are
after the coefcients of D(q) then we may do that by re-
peated substitution or by long division, whichever hap-
pens to be more convenient. The connection allows to
prove what we alluded to before.
15
Lemma 2.4.6 (Invertible MA processes). The coefcients of
D(q) =1/N(q) converge to zero if and only if all zeros
i
of
N() have modulus smaller than 1.
Proof. Let
i
denote the zeros of N(). Then N() may be
factored as N() =
m
i =1
1
i
1
. For simplicity assume
that all zeros
i
have multiplicity 1 so that 1/N() has a
partial fraction expansion of the form
1
N()
=
m
i =1
c
i
1
1
i
1
, c
i
.
For large enough all fractions |
i
1
| are less than 1, in
which case we may replace
1
1i
1
by its geometric series.
This gives
1
N()
=
m
i =1
c
i
_
1+
i
1
+ (
i
1
)
2
+
_
=
j =0
_
m
i =1
c
i
j
i
_
j
.
The coefcients
m
i =1
c
i
j
i
of
j
converge to zero as j
if and only if |
i
| <1 for all i .
The case that some zeros
i
have higher multiplicity
may be proved similarly, but the proof is more techni-
cal.
The above lemma is a statement about convergence of
coefcients. It also implies a type a convergence of the
process itself, see Problem 2.15 (p. 27).
In conclusion we point out a further symmetry that
AR and MA schemes possess. Stable AR processes have
the property that the partial correlation coefcients even-
tually become exactly zero, while the regular correla-
tion function decreases exponentially. Invertible MA pro-
cesses have the property that the regular correlationfunc-
tion eventually becomes exactly zero, while the partial
correlation coefcients decrease exponentially.
Example 2.4.7 (Matlab simulation). The following se-
quence of MATLAB commands was used to produce the
time series of Fig. 2.10.
% Simulation of an MA(10) process
% Choose sigma such that the stationary
% variance of the process is 1
sigma = 1/sqrt(11);
% Define the system structure
D = 1;
N = ones(1,11);
th = poly2th(D,[],N);
% Generate the white noise
randn(seed,0);
e = sigmarandn(256,1);
% Generate and plot realization of X
t
x = idsim(e,th);
plot(x);
For an explanation see 2.4.5 (p. 15).

2.5 ARMA processes
The AR and MA processes that are discussed in the pre-
ceding section may be combined to so-called ARMA pro-
cesses in the following manner. The process X
t
, t , is
a mixed auto-regressive/moving average process of order
(n, k) if it satises the difference equation
X
t
=a
1
X
t 1
+a
2
X
t 2
+ +a
n
X
t n
(2.75)
+b
0
t
+b
1
t 1
+ +b
k
t k
,
for t n. The coefcients a
1
, a
2
, . . . , a
n
, andb
0
, b
1
, . . . , b
k
are real numbers, and
t
, t , is white noise with mean
and standard deviation . We denote this process as an
ARMA(n, k) process. Without loss of generality it may be
assumed that b
0
=1.
We may compactly represent (2.75) as
D(q)X
t
=N(q)
t
, (2.76)
with D and N the polynomials in q
1
,
D(q) =1a
1
q
1
a
2
q
2
a
n
q
n
,
N(q) =b
0
+b
1
q
1
+b
2
q
2
+ +b
k
q
k
.
ARMA processes include AR- and MA-processes as spe-
cial cases. They form a exible class of models that de-
scribe many practical phenomena with adequate accu-
racy.
It is not difcult to see that if the AR process D(q)X
t
=
t
is asymptotically wide-sense stationary then so is the
ARMAprocess D(q)X
t
=N(q)
t
. Necessary and sufcient
for wide-sense stationarity is that all the zeros of the poly-
nomial D have modulus strictly smaller than 1. System
theoretically this means that the ARMA scheme is stable.
If the ARMA scheme is stable then the ARMA model
is equivalent to an MA() scheme that is obtained by
successive substitution. Conversely, if all the zeros of
N() have modulus strictly smaller than 1 then the ARMA
model by successive substitution is equivalent to an
AR() scheme. We then say like for the MA scheme
that the ARMA scheme is invertible.
For ARMA processes generally neither the correlations
nor the partial correlations become exactly zero with in-
creasing time shift.
Direct computation of the covariance function from
the coefcients of the ARMA scheme is not simple. When
we discuss in 2.6.5 (p. 20) and 2.6.6 (p. 20) the spectral
analysis of wide-sense stationary processes we see how
the covariance function may be found.
16
2.6 Spectral analysis
In many applications, for example radar, sonar and seis-
mology applications, the important properties of the time
series are their frequency properties. As an example, sup-
pose we want to know the speed of a fast moving object.
To do that we may transmit a sinusoidal signal cos(
0
t )
towards this object and estimate the speed of the object
on the basis of the reected signal (the time series). This
reected signal is also sinusoidal but due to Doppler ef-
fect has the different frequency
0
(1 2v/c ), where v is
the speed of the object and c the known speed of wave
propagation. Estimation of the speed v now boils down
to estimation of the frequency
0
(1 2v/c ) of the re-
ected signal. Of course measurements of this signal are
noisy and it may hence be advantageous to model it as a
realization of a stochastic process.
In this section we summarize a number of important
results concerning the spectral analysis of wide-sense sta-
tionary processes. We emphasize the connection with
system theory.
In 2.5 (p. 16) it is shown that by successive substi-
tution a stable ARMA scheme may be represented by an
MA() scheme of the form
X
t
=h
0
t
+h
1
t 1
+h
2
t 2
+
=
m=0
h
m
t m
, t .
We recognize this as a convolution sum. We also recog-
nize that system theoretically the process X
t
is the out-
put signal of a linear time-invariant systemwith the white
noise
t
as input signal. Figure 2.11(a) shows the cor-
responding block diagram. In Fig. 2.11(b) we view the
ARMA scheme as a special case of a convolution system
with input signal u
t
and output signal y
t
. We may write
y
t
=
m=
h
m
u
t m
, t , (2.77)
where we set h
m
= 0 for m < 0. The function h
m
, m ,
is nothing else but the impulse response of the system. If
the input signal is the impulse
u
t
=
_
1 for t =0,
0 for t =0,
t , (2.78)
then the corresponding output signal is
y
t
=h
t
, t . (2.79)
2.6.1 Frequency response
We consider the convolution system (2.77). Suppose that
the input signal is the complex-harmonic signal
u
t
=e
it
, t , (2.80)
ARMA
t X
t
h
u
t
y
t
(a) (b)
Figure 2.11: (a) Block diagram of the ARMA sys-
tem. (b) Convolution system with
impulse response h.
with i =
1 and the real number the angular fre-

quency. Then the corresponding output signal is
y
t
=
m=
h
m
e
i(t m)
=
_

m=
h
m
e
im
_
.
h()
e
it
=

h() e
it
, t .
Hence the output signal also is complex-harmonic
3
with
angular frequency and with complex amplitude

h().
The function
h() =
m=
h
m
e
im
(2.81)
is called the frequency response function of the system.
The expression (2.81) denes the function

h as the dis-
crete Fourier transform abbreviated to DFT of the
impulse response h. Because there also exists another
discrete Fourier transform (see 4.6.5, p. 51) the name
DCFT discrete-to-continuous Fourier transform is
more informative (Kwakernaak and Sivan, 1991). Suf-
cient for existence of the DCFT is that
m=
|h
m
| exists.
Lemma 2.6.1 (Properties of the DCFT.). Let

h be the DCFT
of h
m
, m . Then,
1. Periodicity:

h(+k2) =

h() for every integer k.
2. Conjugate symmetry: If h is a real sequence then

h is
conjugately symmetric, that is,

h() =

h() for all
. The overbar denotes the complex conjugate.
Proof. Problem 2.17, (p. 27).
Because of the periodicity property it is sufcient to
consider the DCFT on an arbitrary interval on the fre-
quency axis of length 2. Because of the symmetry prop-
erty we choose for this interval the interval [, ).
By multiplying both sides of (2.81) by e
it
and next in-
tegrating bothsides over [, ) with respect to it easily
follows that
h
t
=
1
2
_

h() e
it
d, t . (2.82)
3
All harmonic e
it
are therefore eigenfunctions of the convolution sys-
tem, and

h() are the eigenvalues.
17
This formula serves to reconstruct the original time func-
tion h from the DCFT

h. The formula (2.82) is called the
inverse DCFT.
The frequency response function

h characterizes the
response of the convolution system (2.77) to many more
input signals than complex-harmonic signals alone. Sup-
pose that the input signal u
t
possesses the DCFT u().
Then with the help of the inverse DCFT the input signal
u may be represented as
u
t
=
1
2
_

u() e
it
d, t . (2.83)
This expression shows that u
t
may be considered as
a linear combination of uncountably many complex-
harmonic functions e
it
with frequencies [, ).
The innitesimal coefcient of the harmonic component
e
it
in the expansion (2.83) is u()d.
Substitution of the expansion (2.83) into the convolu-
tion sum (2.77) yields
y
t
=
m=
h
m
u
t m
=
m=
h
m
_
1
2
_

u() e
i(t m)
d
_
=
1
2
_

m=
h
m
e
im
_
.
h()
u() e
it
d,
so that
y
t
=
1
2
_

h() u()
.
y ()
e
it
d, t . (2.84)
The expression (2.84) is precisely the inverse DCFT of the
output signal y . Clearly we have for the DCFT y of y
y () =

h() u(), [, ). (2.85)
The commutative diagram of Fig. 2.12 illustrates the rela-
tions between the input signal u, the output signal y and
their DCFTs u and y .
2.6.2 Frequency response function of the ARMA
scheme
Consider as a concrete system the ARMA scheme. If we
denote the input signal as u and the corresponding out-
put signal as y then we have
D(q)y
t
=N(q)u
t
, t . (2.86)
We assume that the scheme is stable, that is, that all zeros
i
of D() have modulus strictly smaller than 1.
If the input signal is complex-harmonic with u
t
= e
it
then it follows from 2.6.1 (p. 17) that the output signal is
Multiplication with

h
Convolution with h
DCFT DCFT
DCFT DCFT
inverse inverse
u
u
y
y
Figure 2.12: Commutative diagram
also complex-harmonic of the form y
t
= a e
it
. In addi-
tion, the complex amplitude a is given by a =

h(), with
h the frequency response function of the system. Given

u
t
=e
it
we therefore look for a solution of the difference
equation (2.86) of the form y
t
=a e
it
.
It may easily be veried that shifting the complex-
harmonic signal e
it
back once comes down to multipli-
cation of the signal by e
i
. Back shifting several times re-
sults in multiplication by the corresponding power e
i
.
Application of the compound shift operation N(q) to
the complex-harmonic signal e
it
therefore yields the
complex-harmonic signal N(e
i
) e
it
. Hence, substitu-
tion of u
t
=e
it
and y
t
=a e
it
into (2.86) results in
D(e
i
)a e
it
=N(e
i
) e
it
, t . (2.87)
Cancellation of the factor e
it
and solution for a =

h()
shows that the frequency response function of the ARMA
scheme is given by
h() =
N(e
i
)
D(e
i
)
, [, ). (2.88)
2.6.3 Spectral density function
Realizations of wide-sense stationary processes, such as
the output process X
t
of a stable ARMA scheme, do not
have a DCFT because the innite sum diverges. Direct
frequency analysis of stochastic processes therefore is not
immediately feasible. It does make sense to study the
Fourier transform of the covariance function of the pro-
cess, however.
Let X
t
, t , be a wide-sense stationary process with
covariance function r . Then the DCFT
() =
=
r () e
i
, [, ), (2.89)
of r (if it exists) is called the spectral density function or
the power spectral density function of the process. The
name will soon become clear.
Lemma 2.6.2 (Properties of the spectral density function).
The spectral density function has the following proper-
ties:
18
1. Realness: () is real for all .
2. Nonnegativity: () 0 for all .
3. Symmetry: () =() for all .
4. power spectral density: var(X
t
) = r (0) =
1
2
_
() d.
For non-stochastic time series X
t
the power is com-
monly dened as the average of X
2
t
. In a stochastic set-
ting the power of a zero mean process becomes var(X
t
) =
X
2
t
= r (0). Condition 4 of Lemma 2.6.2 shows that the
power can be seen as an integral over frequency of the
function (), hence the name power spectral density.
2.6.4 Filters
The importance of the spectral density function becomes
clear whenwe consider ltered stochastic processes. Sup-
pose that the convolution system
Y
t
=
m=
h
m
U
t m
, t . (2.90)
has a wide-sense stationary process U
t
with covariance
function r
u
as input signal. We show that the output pro-
cess Y
t
is also wide-sense stationary. To this end we com-
pute the covariance function of the output process.
Without loss of generality we assume that the processes
U
t
and Y
t
are bothcentered. Then for s and t inwe have
Y
t
Y
s
=
_
m
h
m
U
t m
n
h
n
U
s n
_
=
_
n
h
m
h
n
U
t m
U
s n
_
=
n
h
m
h
n
U
t m
U
s n
=
n
h
m
h
n
r
u
(t s +n m).
All summations are fromto . Inspection shows that
the right-hand side only depends on the difference t s
of the arguments t and s . Apparently the process Y
t
is
wide-sense stationary with covariance function r
y
given
by
r
y
() =
n
h
m
h
n
r
u
(+n m), . (2.91)
Next we determine the spectral density function of Y
t
. Let
u
be the spectral density function of U
t
, so that
r
u
() =
1
2
_

u
() e
i
d, . (2.92)
Substitution of (2.92) into the right-hand side of (2.91)
yields
r
y
()
=
n
h
m
h
n
_
1
2
_

u
() e
i(+nm)
d
_
(2.93)
=
1
2
_

m
h
m
e
im
_
.
h()
_
n
h
n
e
in
_
.
h()
u
() e
i
d
=
1
2
_

h()|
2
u
()
.
y
()
e
i
d, . (2.94)
Here we use the conjugate symmetry of the frequency
response function

h. Closer inspection shows that the
right-hand side of (2.94) is the inverse DCFT of r
y
. The
DCFT of r
y
is the spectral density function
y
of the out-
put process y . Hence we have for this spectral density
function
y
() =|
h()|
2
u
(), [, ). (2.95)
This relation very clearly exhibits the effect of the system
on the input process. Suppose that the system is a band
lter for which
h() =
_
1 for
0
b
2
||
0
+
b
2
,
0 for all other frequencies.
=
0
+
0
Hence the system only lets through harmonic signals
with frequencies in a narrow band with width b centered
around the frequencies
0
, with 0 b
0
. Then we
have for the variancei.e. the powerof the output sig-
nal
var(Y
t
) =r
y
(0)
=
1
2
_

y
() d
b
2
u
(
0
) +
b
2
u
(
0
)
=
b
u
(
0
).
We interpret
1
2
u
(
0
) as the power (density) of U
t
at fre-
quency
0
. Because the power var(Y
t
) is nonnegative for
all b 0 this also proves that
u
(
0
) can only be nonneg-
ative (Property 2 of Lemma 2.6.2).
In general we say that the system (2.90) is a lter for the
process U
t
.
2.6.5 Spectral density of white noise and ARMA-
processes
19
Because the covariance function of white noise
t
with
standard deviation is given by
r
() =cov(
t +
,
t
) =
_

2
for =0,
0 for =0,
,
(2.96)
the spectral density function of white noise equals
() =
2
, [, ). (2.97)
Therefore, all frequencies are equally represented in
white noise. The fact that white light also has this prop-
erty explains the name white noise.
From 2.6.2 (p. 18) we know that the stable ARMA
scheme D(q)Y
t
= N(q)U
t
has the frequency response
function
h() =
N(e
i
)
D(e
i
)
, [, ). (2.98)
We conclude from this that the wide-sense stationary
process X
t
dened by the stable ARMA scheme D(q)X
t
=
N(q)
t
has the spectral density function
X
() =|
h()|
2
()
=
_
_
_
_
N(e
i
)
D(e
i
)
_
_
_
_
2
2
.
The covariance function of the ARMA process may now
be determined by inverse Fourier transformation of the
spectral density function:
r
X
() =
1
2
_

X
() e
i
d
=
1
2
_

_
_
_
_
N(e
i
)
D(e
i
)
_
_
_
_
2
2
e
i
d.
Inthe next subsection we showthat there is a simpler way
of obtaining the covariance function.
By way of illustration we consider the spectral density
function of the AR(2) process
X
t
=a
1
X
t 1
+a
2
X
t 2
+
t
. (2.99)
Because N(q) =1 and D(q) =1a
1
q
1
a
2
q
2
the spec-
tral density function is
X
() =
2
_
_
1a
1
e
i
a
2
e
2i
_
_
2
, . (2.100)
This process is also discussed in 2.7 (p. 26). For a
1
=
2a cos(2/T) and a
2
= a
2
the covariance function is of
the form
r
X
() =a
||
_
A cos(
2
T
) +B sin(
2||
T
)
_
, (2.101)
with A and B constants to be determined. An example
of a realization of the process with a = 0.9 and T = 12
is given in Fig. 2.6(b). In Fig. 2.13 the plot of the spectral
density function of the process is shown in the frequency
range from 0 to . We see that the spectral density has
a peak near the angular frequency 2/T = 0.5236. The
peak agrees with the weakly periodical character of the
process.
0 0.5 1 1.5 2 2.5 3
0
2
4
6
8
10
Figure 2.13: Spectral density function of an AR(2)

process.
Example 2.6.3 (Matlab computation). The Systems Iden-
tication Toolbox has a facility to compute the spectral
density function of a given ARMA process. To obtain the
spectral density function of Fig. 2.13 rst those MATLAB
commands from the script of 2.4.5 (p. 15) may be ex-
ecuted that dene the system structure th of the AR(2)
scheme. Next the commands
phi = th2ff(th);
[omega,phi] = getff(phi);
serve to compute the frequency axis omega and the cor-
responding values of the spectral density phi. With their
help the plot of Fig. 2.13 may be prepared.

2.6.6 Two-sided z -transformation and generating func-
tions
By replacing in the denition
x() =
t =
x
t
e
it
(2.102)
of the DCFTthe quantity e
i
by the complex variable z we
obtain the two-sided z -transform
X(z ) =
t =
x
t
z
t
(2.103)
of x
t
. In mathematical statistics and stochastics the z -
transform is known as the generating function.
We dene the z -transform for all complex values of z
for which the innite sum converges. For z = e
i
,
[, ), that is, on the unit circle in the complex plane,
the z -transform reduces to the DCFT:
X(e
i
) = x(). (2.104)
If the z -transform X of x exists on the unit circle then the
inversion formula for the z -transform is
x
t
=
1
2
_

X(e
i
) e
it
d, t . (2.105)
Often the computation of the complex integral may be
avoided by algebraic manipulations such as known from
applications of the Laplace transformation.
20
Example 2.6.4 (Covariance function of the
ARMA(1,1)-process). By way of example we consider the
computation of the covariance function of the ARMA(1,1)
process dened by the scheme
X
t
=aX
t 1
+
t
+b
t 1
. (2.106)
Without loss of generality we choose the coefcient of
t
equal to 1. The scheme is stable if |a| <1. Because N(q) =
1+bq
1
and D(q) =1aq
1
the spectral density function
of the process is
X
() =
_
_
_
_
1+b e
i
1a e
i
_
_
_
_
2
2
=
1+b e
i
1a e
i
1+b e
i
1a e
i

2
.
With the substitution e
i
= z we see that inverse Fourier
transformation of
X
comes down to determining the in-
verse z -transform of
1+bz
1
1az
1
1+bz
1az
2
. (2.107)
This is a rational of z with poles at z = a and z
1
= a.
Partial fraction expansion yields
1+bz
1
1az
1
1+bz
1az
2
=
B
2
z
1
a
+C
2
+
B
2
z a
, (2.108)
with
B =
(a +b)(1+ab)
1a
2
, C =
1+2ab +b
2
1a
2
.
This is not a usual partial fraction expansion but one that
retains the z z
1
symmetry. From |z | = | e
i
| = 1 and
|a| <1 it follows that |az | <1 so that we have the following
innite expansion
1
z
1
a
=
z
1az
=z (1+az +a
2
z
2
+ )
=z +az
2
+a
2
z
3
+ .
For reasons of symmetry we have that
1
z a
=
z
1
1az
1
=z
1
+az
2
+a
2
z
3
+ .
Combining these two expansions in (2.108) yields
2
1+bz
1
1az
1
1+bz
1az
=
2
B( +az
2
+z ) +
2
C +
2
B(z
1
+az
2
+ ).
By denition this is the z -transform of the covariance
function,
+r (2)z
2
+r (1)z +r (0) +r (1)z
1
+r (2)z
2
+ .
(2.109)
Matching the coefcients shows that
r () =
_
_
_
2
Ba
1
for <0,
2
C for =0,
2
Ba
1
for >0,
=
_
_
_
1+2ab+b
2
1a
2

2
for =0,
(a+b)(1+ab)
a(1a
2
)

2
a
||
for =0,
.
We conclude with a comment about the invertibility of

stable ARMA schemes. The covariance function of the
stable ARMA scheme D(q)X
t
= N(q)
t
is the inverse z -
transform of
N(z )
D(z )
N(z
1
)
D(z
1
)
2
. (2.110)
Condition for the invertibility of the scheme is that all ze-
ros of N(z ) have modulus strictly smaller than 1. Con-
sider the numerator N(z )N(z
1
) of (2.110). The zeros of
this numerator consist of the zeros of N(z ) and the recip-
rocals of these zeros. Suppose that not all zeros of N have
modulus greater than or equal to 1. Then it is always pos-
sible to nd a polynomial

N (of the same degree as N)
whose zeros all have modulus smaller than or equal to 1
such that
N(z
1
)

N(z ) =N(z
1
)N(z ). (2.111)
The ARMA process

X
t
dened by D(q)

X
t
=

N(q)
t
has
the same spectral density function as the original process
X
t
. Hence, it also has the same covariance function and
therefore cannot be distinguished from the original pro-
cess. Without loss of generality we may therefore assume
that all zeros of N have modulus less than or equal to 1.
If there are no zeros with modulus equal to 1 then the
scheme is invertible.
2.6.7 Spectral analysis of continuous-time processes
The exposition of the present section has been limited to
discrete-time processes with time axis . For many ap-
plications this is the relevant model. There also are ap-
plication areas, in particular in physics and electrical en-
gineering, where the underlying phenomena have an es-
sential continuous-time character. To analyze these phe-
nomena it is necessary to use continuous-time stochastic
processes as models.
We summarize some of the notions that we developed
for discrete-time processes for the continuous-time pro-
cess X
t
, t . Suppose that the process is wide-sense
stationary with covariance function r () = cov(X
t +
, X
t
),
for . The spectral density function of the process is
dened as the Fourier transform
4
() =
_

r () e
i
d, . (2.112)
4
Kwakernaak and Sivan (1991) refer to this Fourier transform as the
CCFT continuous-to-continuous Fourier transform.
21
If the spectral density function exists then it is real and
nonnegative for all , and symmetric in . If the spectral
density is given then the covariance function r may be
retrieved by the inverse Fourier transformation
r () =
1
2
_

() e
i
d, . (2.113)
This shows that
var(X
t
) =r (0) =
1
2
_

() d. (2.114)
Consider a continuous-time convolution system with in-
put signal u and output signal y described by
y
t
=
_

h()u
t
d, t . (2.115)
The function h is called the impulse response of the sys-
tem. The impulse response is the response of the system
if the input signal is the delta function u
t
= (t ). The
frequency response function of the system is the Fourier
transform
h() =
_

h() e
i
d, , (2.116)
of the impulse response. If the frequency response

h ex-
ists then it is conjugate symmetric (provided h is real).
The impulse response may be recovered from the fre-
quency response function by inverse Fourier transforma-
tion.
If the input signal u of the convolution system (2.115)
is a wide-sense stationary stochastic process with covari-
ance function r
u
then the output process is also wide-
sense stationary with covariance function
r
y
() =
_

h(t )h(s )r
u
(+s t ) dt ds , .
(2.117)
The spectral density function
y
of the output process
follows fromthe spectral density function
u
of the input
process by the relation
y
() =|
h()|
2
u
(), . (2.118)
2.7 Trends and seasonal processes
In classical time series analysis a time series is decom-
posed into three components:
1. A trend, that is, a more or less gradual development.
The monthly index of the Americanhome mortgages
of Fig. 1.6 (p. 2) appears to consist mainly of this
component.
2. A seasonal component with a more or less pro-
nounced periodic character. The water ow of the
river Tiber of Fig. 1.2 (p. 2) has an obvious seasonal
component with a period of 12 months.
3. An incidental component, consisting of irregular
uctuations. The annual immigration into the USA
of Fig. 1.3 (p. 2) appears tobe primarily of this nature.
We showhowARMAschemes may be used tomodel these
three types of phenomena.
2.7.1 Trends and ARIMA models
In the classical time series analysis discussed in 4.3
(p. 41) trends are often represented by polynomials:
Z
t
=a
0
+a
1
t +a
2
t
2
+ +a
p
t
p
, t . (2.119)
The degree of the polynomial may be decreased by apply-
ing the difference operator. Dene the (backward) differ-
ence operator by
Z
t
=Z
t
Z
t 1
. (2.120)
In terms of the backward shift operator q
1
we have
=1q
1
. (2.121)
If Z
t
is a polynomial in t of degree p as in (2.119) then Z
t
is a polynomial of degree p 1. By applying the difference
operator p + 1 times the degree of the polynomial Z
t
is
decreased to 0. Because the difference operator reduces
constants to 0 we have
p+1
Z
t
=0, t . (2.122)
Conversely we may consider this relation as a difference
equation for Z
t
. Each solution of this difference equation
is a polynomial of degree p. We may rewrite the equation
(2.122) as
(1q
1
)
p+1
Z
t
=0. (2.123)
Obviously, irregularities in the trend may be modeled by
adding a stochastic component to this equation. Thus we
obtain the AR(p +1) model
(1q
1
)
p+1
Z
t
=
t
, t , (2.124)
with
t
white noise with mean 0 and standard deviation
. For p = 0 the model agrees with the random walk
model
Z
t
=Z
t 1
+
t
. (2.125)
The polynomial trend model may easily be embedded in
an ARMA model. We then consider schemes of the form
D(q)(1q
1
)
d
X
t
=N(q)
t
, t , (2.126)
or
D(q)
d
X
t
=N(q)
t
, t . (2.127)
D is a polynomial of degree n and N has degree k. This
model is known as an ARIMA(n, d, k) scheme. The I in
this acronym is the initial letter of integrated. Integra-
tion is to be understood as the inverse operation of taking
differences.
22
0 25 50 75 100 125 150 175 200
0
0.5
1
1.5
I(1)
0 25 50 75 100 125 150 175 200
0
0.5
1
1.5
I(2)
0 25 50 75 100 125 150 175 200
0
0.5
1
I(3)
Figure 2.14: Realizations of I(1), I(2) and I(3) pro-
cesses
Figure 2.14 shows realizations of I(1), I(2), and I(3) pro-
cesses. The three realizations were obtained froma single
realization of white noise with variance 1 by applying the
I(1) scheme three times in succession. The results were
scaled by successively dividing the realizations that were
found by 10, 1000, and 100000. The plots show that as d
increases the behavior of the realization of the I(d) pro-
cess becomes less irregular, relatively.
2.7.2 Seasonal processes
A time series Z
t
is periodic with period P if
Z
t
=Z
t P
, t , (2.128)
or
(1q
P
)Z
t
=0, t . (2.129)
The strict periodicity is softened by modifying the model
to
(1aq
P
)Z
t
=
t
, t , (2.130)
with
t
white noise and a a constant such that |a| < 1.
This AR(P) process has the stationary covariance function
r () =
_

2
Z
a
||
P
for =0, P, 2P, . . . ,
0 for other values of ,
.
(2.131)
Depending on the value of a the values of the time se-
ries Z
t
at time instants that are a multiple of the period
0 50 100 150 200 250
-10
-5
0
5
10
t
z
t
Figure 2.15: Realization of the process Z
t
=
0.9Z
t 12
+
t
, t
P away from each other more or less strongly correlated.
For time instants that are not separated by a multiple of
P there is no correlation at all. Figure 2.15 shows a re-
alization for P = 12 and a = 0.9. It is clearly seen that
the behavior within one period may be very irregular.
The model (2.130) may be rened by considering ARMA
schemes of the form
D(q
P
)Z
t
=N(q
P
)
t
, t . (2.132)
The models retain the characteristic that there is no cor-
relation for time instants that are not separated by a mul-
tiple of P.
We discuss other possibilities to capture weakly pe-
riodic phenomena by an ARMA-model of the form
D(q)Z
t
= N(q)
t
. To this end we consider the homoge-
neous equation D(q)Z
t
= 0. Denote the zeros of D as
i
,
i = 1, 2, . . .. In 2.4.2 (p. 12) it is explained that the so-
lution of the homogeneous equation is a linear combina-
tion of terms of the form t
k
(
i
)
t
, t .
If D has a zero e
i0
on the unit circle (with
0
real) then
the homogeneous equation has a corresponding purely
periodic solution e
i0t
for t . The corresponding real
solution is of the form cos(
0
t +), with a constant.
The polynomial D(q) = 1 q
P
, for instance, which
characterizes the purely periodic model (2.129), has P ze-
ros e
ik2/P
, k =0, 1, . . . , P 1, on the unit circle. The poly-
nomial D(q) =1aq
P
of the model (2.130) has the zeros
a
1/P
e
ik2/P
, k = 0, 1, . . . , P 1. The closer a is to 1 the
closer the zeros are to the unit circle, and the more pro-
nounced the periodic character is.
Generally it is true that if D has one or several com-
plex conjugate zero pairs that are close to the unit circle
then the realizations of the corresponding ARMA process
D(q)X
t
= N(q)
t
show a more or less pronounced peri-
odic component. How pronounced depends on the dis-
tance of the zero pair to the unit circle.
An example is the AR(2) process X
t
=a
1
X
t 1
+a
2
X
t 2
+
t
of 2.7 (p. 26). For a
1
=2a cos(2/T) and a
2
=a
2
, with
a and T real, the polynomial
D(q) =12a cos(2/T)q
1
+a
2
q
2
(2.133)
has the complex conjugate zero pair
a[cos(2/T) i sin(2/T)]. (2.134)
23
Figure 2.6(b) shows a realization of this process for a =0.9
and T = 12. The zeros are close enough to the unit cir-
cle for the quasi-periodic character to be clearly recog-
nizable.
2.8 Prediction of time series
An important and interesting application of models for
time series is prediction or forecasting. The problem is
to predict the future behavior of the phenomenon that is
observed as accurately as possible given the observations
up to the present time. We discuss this problem for pro-
cesses that are described by ARMA schemes.
2.8.1 Prediction of ARMA processes
We study the prediction of processes described by the
ARMA scheme
D(q)X
t
=N(q)
t
, t , (2.135)
with
t
white noise with mean 0. We assume that the
scheme is both stable and invertible. For the time be-
ing we suppose that the prediction is based on observa-
tions of a realization of the process from the innite past
until time t
0
. Later we consider the situation where only
nitely many past observations are available.
The solution of the prediction problem consists of two
basic steps.
1. Given the past observations reconstruct the past re-
alization of the white noise that generated the obser-
vations.
2. Given the past realization of the driving white noise
predict the future behavior of the process.
Eventually the twosolution steps are combined into a sin-
gle prediction operation.
The reconstruction of the driving white noise is based
on inversion of the ARMA scheme. By the invertibility as-
sumption there exists an equivalent AR() scheme of the
form
G(q)X
t
=
t
, t . (2.136)
G follows by expansion of
G(q) =
D(q)
N(q)
= g
0
+g
1
q
1
+g
2
q
2
+ . (2.137)
With the help of (2.136) the past realization of the white
noise
t
, t t
0
, may be reconstructed recursively from
the observed realization X
t
, t t
0
, according to
t
= g
0
X
t
+g
1
X
t 1
+g
2
X
t 2
+ , t =. . . , t
0
1, t
0
.
(2.138)
Similarly the X
t
, t t
0
may be reconstructed from
the
t
, t t
0
with help of the MA() scheme X
t
=
N(q)/D(q)
t
. In other words: knowing the past of X
t
is equivalent to knowing the past of
t
. We next con-
sider the prediction problem. For that we partially ex-
pand N(q)/D(q) as
N(q)
D(q)
=h
0
+h
1
q
1
+h
2
q
2
+ h
m1
q
m+1
+q
m
R
m
(q)
D(q)
with R(q) causal. Such representations exist for any m
and they may be easily obtained using long division,
see Example 2.8.1. With this partial expansion we may
write
X
t0+m
=h
0
t0+m
+h
1
t0+m1
+ +h
m1
t0+1
.
predictionerror et
0
+m|t
0
(2.139)
+
R
m
(q)
D(q)
t0
.
best prediction

Xt
0
+m|t
0
.
The second term, indicated as best prediction

X
t0+m|t0
is determined by
t0
,
t01
, . . . , and therefore is known at
time t
0
. The rst term, indicated as prediction error
e
t0+m|t0
is a stochastic variable with mean zero that is un-
correlated with X
t
, t t
0
. It is in fact independent of X
t
,
t t
0
if we assume that the white noise is mutually in-
dependent. In that case it follows that

X
t0+m|t0
is the con-
ditional expectation of X
t0+m
given X
t
, t t
0
. Among all
functions

X
t0+m|t0
of X
t
, t t
0
, the predictor
X
t0+m|t0
=
R
m
(q)
D(q)
t0
(2.140)
minimizes then the mean square prediction error
_
(

X
t0+m|t0
X
t0+m
)
2
| X
t
, t t
0
_
. (2.141)
We thus identify the optimal predictor as
X
t +m|t
=
R
m
(q)
D(q)
t
(2.142)
Here we replaced t
0
with t . The

X
t +m|t
is the best pre-
dictor of X
t +m
given past observations up to and includ-
ing time t and it is called the m-step predictor. Since
t
=
D(q)
N(q)
X
t
we have that
X
t +m|t
=
R
m
(q)
D(q)
D(q)
N(q)
X
t
=
R
m
(q)
N(q)
X
t
.
To determine

X
t +m|t
we therefore do not need to generate
the
t
rst. Clearly we may implement the predictor as
the ARMA scheme
N(q)

X
t +m|t
=R
m
(q)X
t
, t . (2.143)
The result shows that for AR schemes D(q)X
t
=
t
the m-
step predictor is an MA scheme

X
t +m|t
=R
m
(q)X
t
.
Up to this point we have assumed that all past values
of X
t
from time on are available for prediction. Be-
cause of the assumed invertibility of the original ARMA
scheme (2.135) the predictor (2.143) is stable. The effect
24
of an incorrect initialization of (2.143) at a time instant
in the nite rather than the innite past hence decreases
exponentially. Therefore a predictor of the form (2.143)
that for instance has been initialized at time 0 with initial
conditions 0 asymptotically yields correct results.
From the formula
e
t +m|t
=h
0
t +m
+h
1
t +m1
+ +h
m1
t +1
(2.144)
for the prediction error it follows that the mean square
prediction error equals
e
2
t0+m|t0
= (h
2
0
+h
2
1
+ +h
2
m1
)
2
. (2.145)
Example 2.8.1. By way of example we discuss the AR(1)
process described by
X
t
=aX
t 1
+
t
. (2.146)
Inspection shows that the best one-step predictor is
X
t +1|t
=aX
t
. (2.147)
We derive the m-step predictor formally. We have D(q) =
1 aq
1
and N(q) = 1, then long division gives us (see
Problem 2.30) that
N(q)
D(q)
=
1
1aq
1
=1+aq
1
+a
2
q
2
+ +a
m1
q
1m
+
q
m
a
m
1aq
1
.
(2.148)
For the sake of stability we need to assume that |a| < 1.
Apparently we have h
i
= a
i
, i = 0, 1, . . . and R
m
(q) = a
m
.
Substitution into (2.143) yields the m-step predictor
X
t +m|t
=a
m
X
t
. (2.149)
Prediction over m steps for this process hence amounts
to multiplication of the last known observation X
t
by a
m
.
That the last known observation is all that is needed from
the past follows because the AR(1) process is a Markov
process.
The mean square prediction error according to (2.145)
is
e
2
t +m|t
= (1+a
2
+a
4
+ +a
2(m1)
)
2
=
1a
2m
1a
2

2
= (1a
2m
)
2
X
.
Here
2
X
is the stationary variance of the process itself.
For m the mean square prediction error approaches
2
X
, and the m-step prediction itself approaches 0.
The smaller |a| is the faster the mean square prediction
error increases with m. The reason is that if |a| is small
the temporal dependence of the process is also small so
that any prediction is poor.

It is easy to see that the predictor (2.149) in the above
example is also optimal if a = 1 (despite the fact that the
AR(1) scheme is not stable). The AR(1) process then re-
duces to the random walk process
X
t
=X
t 1
+
t
. (2.150)
The optimal m-step predictor for the random walk is
X
t +m|t
=X
t
. (2.151)
The best prediction of the random walk hence is to use
the last known observation as the prediction. It is well
known that the weather forecast tomorrow the weather
will be like today scores almost as good as meteorologi-
cal weather forecasts ...
Example 2.8.2 (Matlab computation). The Systems Identi-
cation Toolbox has a routine predict for the calcula-
tion of m-step predictions. We apply it to the AR(2) time
series that is produced in 2.4.5 (p. 15). After dening
the theta structure th and generating the time series x
according to the script of 2.4.5 (p. 15) we calculate the
one-step predictions xhat with
xhat = predict(x,th,1);
In Fig. 2.16 the rst 51 values of the observed and the pre-
dicted process are plotted. The sample average of the
square prediction error is 0.2998. The predictor works
well whenthe time series varies more or less smoothly but
has difculties coping with fast changes.

5 10 15 20 25 30 35 40 45 50
-2
-1
0
1
2
t
Figure 2.16: Dots: One-step predictions. Line:
AR(2) time series
2.9 Problems
2.1 Basic stochastic processes. Let
t
be white noise with
variance 1 and nonzero mean
t
= 1/4. Compute
variance and mean of X
t
=
t
+
t 1
+
t 2
+
t 3
and
sketch a reasonable realization x
t
for t =0, 1, . . . , 20
and indicate mean and standard deviation of the
process.
2.2 Basic stochastic processes. Let Y
t
be a stationary pro-
cess with zero mean and variation
2
and suppose
the Y
t
, t are mutually uncorrelated. Showthat all
25
four time series X
t
below have the same mean and
variance but that their covariance function R(t +, t )
generally differ. Sketch two realizations of each of
the following four time series below.
a) X
t
=Y
1
,
b) X
t
= (1)
t
Y
1
,
c) X
t
=Y
t
,
d) X
t
= (Y
t
+Y
t 1
)/
2.
2.3 Properties of R, and R. Prove Lemma 2.1.1. (Use
the Cauchy-Schwarz inequality on page 91 and for
Part 4 consider var(v
1
X
t1
+v
2
X
t2
) for constant v
1
, v
2
.)
2.4 Covariance function of a wide-sense stationary pro-
cess. Prove Lemma 2.1.2.
2.5 Convolutions. Prove equations (2.25) and (2.26).
2.6 Non-centered processes. Assume that the white noise
process
t
has mean = 0. Suppose that the AR(n)
process X
t
dened by (2.27) is asymptotically wide-
sense stationary.
a) Prove that the asymptotically wide-sense sta-
tionary process has the constant mean
m =
n
i =1
a
i
. (2.152)
b) Prove that if the process is asymptotically wide-
sense stationary then the denominator 1
n
i =1
a
i
is non-zero. Hint: If 1
n
i =1
a
i
= 0
then D() has a zero 1.
c) Howneed the initial conditions X
0
, X
1
, . . . , X
n1
be chosen so that the process is immediately
stationary?
d) Let m(t ) = X
t
be the mean value function
of the process X
t
. Which difference equation
and initial conditions does the centered pro-
cess

X
t
=X
t
m(t ) satisfy?
2.7 A Yule scheme. The Yule scheme is the AR(2) scheme
X
t
=a
1
X
t 1
+a
2
X
t 2
+
t
, t =2, 3, . . . . (2.153)
a) Prove that the rst two stationary correlation
coefcients are given by
(1) =
a
1
1a
2
, (2) =a
2
+
a
2
1
1a
2
. (2.154)
b) Also show that the stationary variance equals
r (0) =
(1a
2
)
2
(1a
1
a
2
)(1+a
1
a
2
)(1+a
2
)
.
(2.155)
c) Suppose that the polynomial D(q) =1a
1
q
1
a
2
q
2
has the complex conjugate zero pair
a e
i2/T
, (2.156)
with a and T real. Showthat it follows that a
1
=
2a cos(2/T) and a
2
=a
2
. Also show that for
k = 2, 3, . . . the stationary correlation function
has the form
(k) =a
k
[A cos(
2k
T
) +B sin(
2k
T
)], (2.157)
with A and B real constants to be determined.
d) Let a = 0.9 and T = 12. Compute (1) and
(2) numerically from the Yule-Walker equa-
tions. Use these numbers to determine A and
B numerically. Plot .
2.8 Solvability of the Yule-Walker equations
. Assume
that X
t
is wide sense stationary. Use Lemma 2.1.1
and 2.1.2 to show that (2.63) has a unique solution if
and only if X
t
, . . . , X
t +n
are linearly independent (i.e.,
n
i =1
c
i
X
t +i
=0 = c
i
=0).
2.9 Partial correlation coefcients. Consider the AR(2)
scheme
X
t
=c
1
X
t 1
+c
2
X
t 2
+
t
, t =2, 3, . . . , (2.158)
that is also studied in Problem 2.7
a) Compute the partial correlation coefcients
a
11
and a
22
.
b) Derive the limit lim
c20
a
11
. Is that limit a sur-
prise?
c) What can you say about the coefcients a
ni
for
n 2?
2.10 ARMA model. The covariance function of a wide-
sense stationary process X
t
, t , with mean 0 is
given by
r () =
_
_
_
1 =0
1
2
|| =1
0 || 2
, . (2.159)
By which ARMA scheme may this process be de-
scribed?
2.11 Asymptotic behavior of the partial correlations of the
MA process and the correlations of the AR process.
Make it plausible that the partial correlation coef-
cients of an invertible MA process and the correla-
tion coefcients of a stable AR process decrease ex-
ponentially.
2.12 Long division. Consider the AR(2) process (1
a
1
q
1
a
2
q
2
)X
t
=
t
and suppose it is stable.
26
a) Use repeated substitution todetermine the rst
three coefcients b
0
, b
1
, b
2
of the MA() de-
scription X
t
= (b
0
+b
1
q
1
+b
2
q
2
+ )
t
of the
process.
b) Use long division to determine the rst three
coefcients b
0
, b
1
, b
2
of the series expansion
(b
0
+b
1
1
+b
2
2
+ ) of 1/(1a
1
1
a
2
2
)
in the negative powers of .
2.13 MA() process. Consider the MA() scheme given
by
X
t
=
t
+
1
2
t 3
+
1
4
t 6
+
1
8
t 9
+ . (2.160)
The process
t
, t , is white noise with variance 1.
a) There exists an AR scheme of nite order that
generates this process. Determine this scheme.
b) Compute the covariance function r (k) of the
process. Plot it.
2.14 Stationary mean. Suppose that the ARMA scheme
D(q)X
t
= N(q)
t
is asymptotically wide-sense sta-
tionary. Show that the mean value function is
asymptotically given by
X
t
=m(t )
t
N(1)
D(1)
. (2.161)
2.15 Convergence of inverted processes
. Consider an MA-
process X
t
=N(q)
t
and let h
k
be the coefcients of
1/N(q),
1
N(q)
=h
0
+h
1
q
1
+h
2
q
2
+ .
Now dene for each n the approximating processes
Y
n
t
via
(h
0
+h
1
q
1
+h
2
q
2
+ +h
n
q
n
)Y
n
t
=
t
. (2.162)
a) Suppose X
t
= N(q)
t
is invertible. Show that
the AR-process (2.162) is stable for n large
enough. (Hint: consider inf
,||1
|1/N()|.)
b) Suppose X
t
= N(q)
t
is invertible. Show that
lim
n
Y
n
t
=X
t
and that for each k we have
lim
n
r
Y
n (k) =r
X
(k).
2.16 Centered ARMA process. Let m(t ) =X
t
be the mean
value function of the ARMAprocess (2.75). What dif-
ference equation and initial conditions are satised
by the centered process

X
t
=X
t
m(t )?
2.17 Properties of the DCFT. Prove Lemma 2.6.1.
2.18 Properties of the spectral density function. Prove
Lemma 2.6.2.
2.19 Show that the spectral density of X
t
= q
K
t
is inde-
pendent of K and explain in words why this is not a
surprise.
2.20 Reciprocal zeros. Prove that it is always possible to
nda polynomial

N (of the same degree as N) whose
zeros have modulus smaller than or equal to 1 such
that

N(z )

N(z
1
) = N(z )N(z
1
). What is the relation
between the zeros of N and those of

N?
2.21 Let N(q) = (1 +
1
2
q
1
)(1 +3q
1
) and suppose
t
is a
white noise process with variance
2
.
a) Show that X
t
=N(q)
t
is not invertible.
b) Determine
X
().
c) Which invertible MA process Z
t
=

N(q)
t
has
the same covariance function as X
t
?
2.22 The difference operator decreases the degree of a poly-
nomial. Prove the claim in Subsection 2.7.1 that if
Z
t
is a polynomial in t of degree p as in (2.119) that
then Z
t
is a polynomial of degree p 1.
2.23 Variance of the I(d) process. Consider the process Z
t
,
t 0, that is generated by the I(d) scheme
(1q
1
)
d
Z
t
=
t
, t , (2.163)
with the initial conditions Z
0
=Z
1
= =Z
1d
= 0.
Here d is a natural number and
t
is white noise with
variance 1. Compute var(Z
t
) as function of t for d =
1, 2, and 3.
2.24 Covariance function of the process (2.130). Prove
that the stationary covariance function of the pro-
cess (2.130) is given by (2.131). What is
2
Z
?
2.25 Location of the zero pair and periodicity. Suppose
that D has a complex conjugate zero pair a e
i0
,
with
0
> 0. For stability we need |a| < 1. The zero
pair is close to the unit circle if |a| is not much less
than 1. Make it plausible that the zero pair is close
enough to the unit circle for the periodic character
to be pronounced if
log|a|
0
. (2.164)
2.26 Stochastic harmonic process. Consider the stochastic
harmonic process
X
t
=A cos(
2t
T
+B), t , (2.165)
with T a given integer. A and B are independent
stochastic variables with A = 0, var A =
2
, and B
uniformly distributed on [0, 2].
a) This process is stationary. What is the covari-
ance function r of the process?
b) Find an AR representation of this process. How
should the initial conditions be chosen?
2.27 Best linear predictor
. On page 24 the best predic-

tor (2.139) is argued to minimize the mean square
error (2.141) provided that the
t
are mutually inde-
pendent.
27
Now suppose that the
t
are uncorrelated (but
maybe not independent). Show that the best
predictor of (2.139) is the one that minimizes
(

X
t0+m|t0
X
t0+m
)
2
with respect to the linear predic-
tors

X
t0+m|t0
=
i =0
c
i
t0i
, c
j
.
2.28 Predictor for the MA(1) process. Find the optimal m-
step predictor for the MA(1) process. Which condi-
tion needs to be satised? Discuss and explain the
solution.
2.29 Mean square prediction error. What is the theoretical
mean square prediction error of the one-step predic-
tor for the AR(2) process of Example 2.8.2 (p. 25)?
2.30 Division with remainder. Long division is a proce-
dure to expand N(q)/D(q) in negative powers of q.
The skeleton is shown in Eqn. (2.166) (page 29). In
the course of doing long division you automatically
keep track of the remainder denoted q
m
R(q) and
we have by construction that
N(q)
D(q)
=h
0
+ +h
m1
q
1m
+
q
m
R(q)
D(q)
. (2.167)
Compute the expansion (2.167) for
a)
N(q)
D(q)
=
1
1aq
1
for arbitrary m;
b)
N(q)
D(q)
=
1
1aq
m
;
c)
N(q)
D(q)
=
1
1q
1
+0.5q
2
for m =1, 2, 3.
2.31 Predictor. Consider the scheme
(1+q
1
+
1
4
q
2
)X
t
=
t
with
t
a zero mean white noise process with vari-
ance
2
.
a) Is the scheme stable?
b) Is the scheme invertible?
c) Determine the spectral density function of X
t
.
d) Determine the 2-step predictor scheme.
2.32 Predictor for a seasonal model. Consider the sea-
sonal model
X
t
aq
P
X
t
=
t
, t =1, 2, . . . , (2.168)
with the natural number P > 1 the period, a a real
constant, and
t
white noise with variance
2
. As-
sume that the model is stable.
a) Determine the one-step optimal predictor for
this model. Interpret the result.
b) Determine the (P + 1)-step optimal predictor.
Interpret the result.
2.33 Prediction of a polynomial times series. A time series
is given by the expression
X
t
=c
0
+c
1
t +c
2
t
2
, t , (2.169)
where c
0
, c
1
and c
2
are unknown coefcients.
a) Determine an AR scheme that describes the
time series.
b) Determine a recursive scheme for the one-step
optimal predictor of the time series.
c) How may a k-step predictor be obtained?
Matlab problems
2.34 Predictors. The following MATLAB determines the m-
step predictor.
function Rm=predic(D,N,m);
%
d=length(D);
Rm=[N zeros(1,max(0,dlength(N))) ...
zeros(1,m)];
for k=1:m
Rm(1:d)=Rm(1:d)Rm(1)/D(1)D;
Rm(1)=[];
end
To nd the m step predictor for X
t
=
1
2
X
t 3
+
t
we
type at the MATLAB prompt
D=[1 0 0 1/2]; % D(q) =11/2q
3
N=1; % N(q) =1
m=2; % for example
Rm=predic(D,N,m)
Determine for X
t
=
1
2
X
t 3
+
t
the m-step predictors
for m =1, 2, 3, 4, 5, 6, 7 and interpret the results.
28
D(q)
.
1a
1
q
1
a
n
q
n
/
N(q)
.
b
0
+b
1
q
1
+ +b
k
q
k
\
=b0
.
h
0
+ +h
m1
q
1m
h
0
+
h
1
q
1
+
h
1
q
1
+
.
.
.
h
m1
q
1m
+
h
m1
q
1m
+
r
m
q
m
+
.
q
m
R(q)
(2.166)
29
30
3 Estimators
3.1 Introduction
In this short chapter we present a survey of several im-
portant statistical notions that are needed for time series
analysis. Section 3.2 is devoted to normally distributed
processes and the multi-dimensional normal distribu-
tion. Section 3.3 presents estimators and their proper-
ties, the maximum likelihood principle and estimation
method, and the Cramr-Rao inequality. We mostly deal
with stochastic processes of which at most the rst and
second order moments are known or are to be estimated,
or processes that are normally distributed. For such pro-
cesses linear estimators are a natural choice. These are
considered in Section 3.4.
3.2 Normally distributed processes
3.2.1 Normally distributed processes
Up to this point the discussion has remained limited to
the rst and second order properties of stochastic pro-
cesses, that is, their mean and covariance function. Other
properties, in particular the probability distribution of
the process, have not been considered. There has been
no mention, for instance, of the probability distribution
of the uncorrelated stochastic variables
t
, t , that de-
ne white noise.
For some applications it is not necessary to intro-
duce assumptions on the probability distributions be-
yond the second-order properties. Sometimes it cannot
be avoided, however, that more is assumed to be known.
A common hypothesis, which may be justied for many
applications, is that the process is normally distributed.
A process Z
t
, t , is normally distributed if all joint
probability distributions of Z
t1
, Z
t2
, . . . , Z
tn
, are multi-
dimensional normal distributions. Multi-dimensional
normal distributions are also referred to as multi-
dimensional Gaussian distributions. Several results and
formulas from the theory of multi-dimensional normal
distributions are summarized in 3.2.2 (p. 31) and 3.2.3
(p. 33).
Multi-dimensional normal probability distributions of,
say, the n stochastic variables Z
1
, Z
2
, . . . , Z
n
, are com-
pletely determined if the n expectations
Z
i
, i =1, 2, . . . , n,
and the n
2
covariances
cov(Z
i
,Z
j
), i =1, 2, . . . , n, j =1, 2, . . . , n,
are known.
3.2.2 Multi-dimensional normal distributions
For completeness and for later use we briey summarize
the theory and formulas of multi-dimensional normally
distributed stochastic variables.
A (scalar) stochastic variable Z is said to be normally
distributed if
1. Z = with probability 1, or
2. Z has the probability density function
f
Z
(z ) =
1
2
e
1
2
2
(z )
2
=
z
Here and are real constants with >0. In both cases

is the expectation of Z. In the rst case Z has variance
0, in the second it has variance
2
>0. In the rst case Z
is said to be singularly normally distributed. If = 0 and
= 1 then Z is said to have a standard normal distribu-
tion.
From elementary probability theory it is known that
if Z
1
and Z
2
are two independent normally distributed
stochastic variables then every linear combination a
1
Z
1
+
a
2
Z
2
of Z
1
and Z
2
, with a
1
and a
2
real constants, is also
normally distributed. Likewise, every linear combination
a
1
Z
1
+a
2
Z
2
+ +a
n
Z
n
of the n independent normally
distributed stochastic variables Z
1
,Z
2
, . . . ,Z
n
is normally
distributed.
Let Z
1
,Z
2
, . . . ,Z
n
, be n mutually independent stochas-
tic variables with standard normal distributions and con-
sider k linear combinations X
1
, X
2
, . . . , X
k
, of the form
X
i
=m
i
+
n
j =1
a
i j
Z
j
, i =1, 2, . . . , k. (3.1)
Here the m
i
and a
i j
are real constants. Each of the
i
is (scalar) normally distributed.
If the k stochastic variables X
1
, X
2
, . . . , X
k
may be repre-
sented in the form (3.1), with Z
1
,Z
2
, . . . ,Z
n
, independent
normally distributed stochastic variables with standard
distributions, then X
1
, X
2
, . . . , X
k
are said to be jointly nor-
mally distributed.
We investigate what the joint probability density func-
tion of X
1
, X
2
, . . . , X
k
is, if it exists. Dene the randomvec-
tors
X =
_
_
_
_
_
_
X
1
X
2
.
.
.
X
k
_
_
_
_
_
_
, Z =
_
_
_
_
_
_
Z
1
Z
2
.
.
.
Z
n
_
_
_
_
_
_
,
and let A be the k n matrix with entries a
i j
and m the
k-dimensional column vector with entries m
i
. Then we
may write (3.1) as
X =AZ +m. (3.2)
31
According to elementary probability theory the joint
probability density function of the independent stochas-
tic variables Z
1
,Z
2
, . . . ,Z
n
, each with a standard normal
distribution, is
f
Z
(z ) =
n
j =1
f
Zj
(z
j
)
=
n
j =1
1
2
e
1
2
z
2
j
=
1
_
2
_
n
e
1
2
n
j =1
z
2
j
=
1
_
2
_
n
e
1
2
z
T
z
. (3.3)
Here z is the column vector with components
z
1
, z
2
, . . . , z
n
. The superscript T indicates the trans-
pose. To determine the joint probability density function
f
X
of X
1
, X
2
, . . . , X
n
from f
Z
we use the following theorem.
Theorem 3.2.1 (Probability density under transformation).
Let Z be a vector-valued stochastic variable of dimension
N with joint probability density function f
Z
(z ). Further-
more, let g :
N

N
be a differentiable bijective map
with inverse g
1
. Dene with the help of this map the
stochastic variable
X = g (Z).
Then X has the joint probability density function
f
X
(x) = f
Z
(g
1
(x)) | det J (x)|. (3.4)
J is the Jacobian matrix of h = g
1
. The entry J
i j
of the
N N matrix J is given by
J
i j
(x) =
h
i
(x)
x
j
(x). (3.5)
Proof. The probability density f
X
is dened by
f
X
(x
1
, x
2
, . . . , x
N
) =
N
F
X
(x
1
, x
2
, . . . , x
N
)
x
1
x
2
, . . . , x
N
.
F
X
is the joint probability distribution F
X
(x) = Pr(X x),
with x = (x
1
, x
2
, . . . , x
N
). Here an inequality between vec-
tors is taken entry by entry. Because X = g (Z) we have
F
X
(x) =Pr(X x)
=Pr(g (Z) x)
=
_
g (z )x
f
Z
(z ) dz .
By changing the variable of integration to = g (z ) it fol-
lows from calculus that
F
X
(x) =
_
x
f
Z
(g
1
()) | det( J ())| d.
We nally obtain (3.4) by partial differentiation of F
X
(x)
with respect to the components of x.
To apply Theorem 3.2.1 to (3.2) we need to assume that
k = n, so that A is square, and that A is non-singular.
Then the map g and the inverse map g
1
are dened by
g (z ) = Az +m and g
1
(x) = A
1
(x m). The Jacobian
matrix of g
1
is J (x) =A
1
. By application of (3.4) to (3.3)
it now follows that the probability density function of the
normally distributed vector-valued stochastic variable X
is given by
f
X
(x) =
1
(
2)
n
| det A|
e
1
2
(xm)
T
(AA
T
)
1
(xm)
=
1
(
2)
n
(det AA
T
)
1/2
e
1
2
(xm)
T
(AA
T
)
1
(xm)
. (3.6)
It is easy to identify the parameters m and AA
T
that occur
in this probability density function. By taking the expec-
tation of both sides of (3.2) it follows immediately that
1
m
X
:=X =m.
The matrix
X
:=
_
(X X)(X X)
T
_
is called the variance matrix or covariance matrix of the
stochastic vector X. It follows from (3.2) that
X
=
_
AZZ
T
A
T
_
=A (ZZ
T
) A
T
=AA
T
,
because ZZ
T
= I . With this we may rewrite (3.6) as
f
X
(x) =
1
(
2)
n
(det
X
)
1/2
e
1
2
(xmX )
T
1
X
(xmX )
. (3.7)
This is the general form of the joint normal probability
density function. The following facts may be proved:
1. If conversely the stochastic vector X has the proba-
bility density (3.7) with
X
a symmetric positive def-
inite matrix then X is normally distributed with ex-
pectation m
X
and variance matrix
X
.
2. Let X = AZ +m, with the entries of Z independent
with standard normal distributions. Then X is nor-
mally distributed withexpectation m
X
=m and vari-
ance matrix
X
=AA
T
.
a) If A has full row rank then
X
is non-singular
and X has the probability density function
(3.7).
b) If A does not have full row rank then the vari-
ance matrix
X
= AA
T
is singular. Then X
has no probability density function and is said
to be singularly multi-dimensionally normally
distributed.
1
The expectation of a matrix is the matrix of expectations.
32
Example 3.2.2 (Two dimensions). Suppose that
_
X
1
X
2
_
=
_
1
0 2
__
Z
1
Z
2
_
with Z
1
,Z
2
independent zero mean unit variance nor-
mally distributed stochastic variables. If = 0 then X
2
=
2X
1
hence for small values of we expect that X
2
is close
to 2X
1
. Figure 3.1 depicts the joint probability density
2
f
X
(x
1
, x
2
) for =1/2. Clearly the mass of f
X
(x
1
, x
2
) is cen-
tered around the line x
2
=2x
1
.

x
1
x
2
Figure 3.1: Joint probability density f
X
(x
1
, x
2
), see
Example 3.2.2
3.2.3 Characteristic function
If Z is ann-dimensional vector-valued stochastic variable
then
Z
(u) :=e
iu
T
Z
,
with u an n-dimensional vector, is called the characteris-
tic function of Z. If Z possesses a probability density f
Z
then we have
Z
(u) =
_
n
f
Z
(z ) e
iu
T
z
dz . (3.8)
This expression shows that
Z
is nothing but a multi-
dimensional Fourier transform of f
Z
. For the multi-
dimensional normal probability density (3.7) we have
X
(u) =
1
(
2)
n
(det
X
)
1/2
_
n
e
iu
T
x
1
2
(xmX )
T
1
X
(xmX )
dx.
(3.9)
Rewriting the exponent yields
iu
T
x
1
2
(x m
X
)
T
1
X
(x m
X
)
=
1
2
(x m
X
i
X
u)
T
1
X
(x m
X
i
X
u)
1
2
u
T
X
u +iu
T
m
X
.
2
For reason of exposition f
X
is scaled with a factor 5 in Fig. 3.1.
Substitution into (3.9) yields using the fact that the in-
tegral over
n
of an n-dimensional probability density
function equals 1 that
X
(u) =e
1
2
u
T
X uiu
T
mX
. (3.10)
This formula also holds if Z is singularly normally dis-
tributed.
A useful application of the characteristic function is
the computation of moments of stochastic variables (see
Problem 3.2, p. 38)
3.3 Foundations of time series analysis
3.3.1 Introduction
In Chapter 2 we reviewed stochastic processes as models
for time series. The remainder of these lecture notes deal
with the question how on the basis of observations of the
time series inferences may be made about the properties
of the underlying process.
One such problem is to estimate the covariance func-
tion based on observations of part of the time series.
Other problems arise if for instance the structure of the
covariance function is knownbut the values of certainpa-
rameters that occur in the model need to be estimated.
By way of example, consider a time series that is a real-
ization of a wide-sense stationary process X
t
, t , with
mean X
t
= m and covariance function cov(X
t
, X
s
) =
r (t s ). If part of a realization x
t
, t = 0, 1, . . . , N 1,
of this process
3
is given then an obvious estimate of the
mean m is the sample mean
m
N
=
1
N
N1
t =0
x
t
. (3.11)
The circumex denotes that we deal with an estimate.
The index N indicates that the result depends on the
number of observations N.
Naturally we immediately wonder how accurate m
N
is
as an estimate of m. With this question we arrive in the
realm of mathematical statistics. There is something spe-
cial, however. Elementary statistics usually deals with
samples of stochastic variables that are mutually inde-
pendent and identically distributed. The numbers x
t
that
occur in the sumin (3.11), however, are samples of depen-
dent stochastic variables. This means that well-known
results from elementary statistics about the variance of
sample means do not apply.
3.3.2 Estimates and estimators
We work with a stochastic process X
t
, t . In our appli-
cations we usually have =. The subset of is the set
3
Realizations of a process X
t
, t , are denoted as x
t
, t .
33
of the time instants at which a realization of the process
has been observed. Often ={0, 1, . . . , N 1}.
Let be a parameter of the stochastic model of the pro-
cess. In the case of a wide-sense stationary process,
could be the mean m of the process. We consider the
question how to obtain an estimate of the parameter
from the observed realization x
t
, t . An estimator s
is an operation on x
t
, t , that produces an estimate
s (x
t
, t ) for . An estimator hence is a map. The image
is the estimate.
Before considering how an estimator actually may be
found we concentrate on the question howthe properties
of a given estimator may be characterized. Because for
different realizations of the process different outcomes of
the estimate result we may evaluate the properties of the
estimator by studying the stochastic variable
S =s (X
t
, t ).
For the estimator of the mean m of a wide-sense station-
ary process that we proposed in (3.11) we thus study the
stochastic variable
S
N
=
1
N
N1
t =0
X
t
. (3.12)
An estimator S = s (X
t
, t ) is called an unbiased esti-
mator for the parameter if
S =.
The estimator (3.12) is an unbiased estimator for the
mean of the wide-sense stationary process X
t
, t . A
biased estimator has a systematic error, called bias. If the
bias is known (which often is not the case) then it may be
corrected.
It sometimes happens that an estimator is biased for a
nite number of observations but becomes unbiased as
the number of observations approaches innity. Denote
S
N
=s (X
t
, t {0, 1, . . . , N}). (3.13)
Then S
N
is called an asymptotically unbiased estimator
of if
lim
N
S
N
=.
An unbiased or asymptotically unbiased estimator is not
necessarily a good estimator. An important quantity
that determines the quality of the estimator is the mean
square estimation error
(S )
2
.
The larger the mean square error is the less accurate the
estimator is. If the estimator is unbiased then the mean
square estimation error is precisely the variance of the es-
timator.
Often we may expect that we can estimate more accu-
rately by accumulating more and more observations. An
estimator S
N
as in (3.13) is called consistent if for every
>0
lim
N
Pr(|S
N
| >) =0.
In this case the estimator converges in probability to the
true value .
Theorem3.3.1 (Sufcient condition for consistency). If S
N
is an unbiased or asymptotically unbiased estimator for
and
lim
N
var(S
N
) =0
thenS
N
is consistent.
Proof. We use the Chebyshev inequality, which is that for
every stochastic variableZ and numbers >0 there holds
(Z
2
) =
_

z
2
d F
Z
_
z
2
2
z
2
d F
Z
2
_
z
2
2
d F
Z
=
2
Pr(Z
2
2
)
=
2
Pr(|Z| ).
For Z =S
N
this states that
0 Pr(|S
N
| )
(S
N
)
2
2
=
var(S
N
) + (S
N
)
2
2
.
Because asymptotic unbiasedness implies that
lim
N
S
N
= and by the assumption that
lim
N
var(S
N
) =0 we immediately have
lim
N
Pr(|S
N
| ) =0.
This completes the proof.
3.3.3 Maximumlikelihood estimators
In the next sections and chapters we introduce estima-
tors for the mean value, covariance function and spectral
density function of wide-sense stationary processes and
discuss the statistical properties of these estimators. The
formulas for these estimators are based on sample aver-
ages. There also are other methods to nd estimators. A
well-known and powerful method is the maximum likeli-
hood principle. We discuss this idea.
Suppose that X
1
, X
2
, . . . , X
N
are stochastic variables
whose joint probability density function depends on an
unknown parameter . This may be a vector-valued pa-
rameter. We denote the joint probability density function
as
f
X1,X2,...,XN
(x
1
, x
2
, . . . , x
N
; ). (3.14)
34
Suppose that x
1
, x
2
, . . . , x
N
are observed sample values of
the stochastic variables. By substitution of these values
into (3.14) the expression (3.14) becomes a function of .
If
f
X1,X2,...,XN
(x
1
, x
2
, . . . , x
N
;
1
) > f
X1,X2,...,XN
(x
1
, x
2
, . . . , x
N
;
2
)
then we say that
1
is a more likely value of than
2
. We
therefore select the value of that maximizes the likeli-
hood function
f
X1,X2,...,XN
(x
1
, x
2
, . . . , x
N
; )
as the estimate

of the parameter. Often it turns out to be
more convenient to work with the log likelihood function
L(x
1
, x
2
, . . . , x
N
; ) =log f
X1,X2,...,XN
(x
1
, x
2
, . . . , x
N
; ).
The maximum likelihood method provides a straightfor-
ward recipe to nd estimators. We apply the method in
Chapter 5 (p. 59) to the identication of ARMA models.
Disadvantages of the method are that analytical formulas
need to be available for the probability density function
and that the maximization of the likelihood function may
be quite cumbersome.
There also are methodological objections to the maxi-
mum likelihood principle. These objections appear to be
unfounded because in many situations maximum likeli-
hood estimators have quite favorable properties. They of-
ten are efcient. This property is dened in the next sec-
tion.
3.3.4 The Cramr-Rao inequality
The well known Cramr-Rao inequality provides a lower
bound for the variance of an unbiased estimator. At this
point it is useful to introduce two short hands for the rst
two partial derivatives: from now on we use L
and L
to mean
L
(x, ) =
L(x, )

, L
(x, ) =
2
L(x, )

2
.
Theorem 3.3.2 (Cramr-Rao inequality). Suppose that the
probability distribution of the stochastic variables X
1
, X
2
,
. . . , X
N
depends on a scalar unknown parameter . Let S =
s (X), with X = (X
1
, X
2
, . . . , X
N
), be an unbiased estimator
for . Denote the log likelihood function of the stochastic
variables as L(x, ), with x = (x
1
, x
2
, . . . , x
N
).
1. Then
var(S)
1
M()
, (3.15)
where
M() =[L
(X, )]
2
=L
(X, ). (3.16)
2. Var(S) =1/M() if and only if
L
(x, ) =M()[s (x) ]. (3.17)

Proof. See Appendix A (p. 91).
Example 3.3.3 (Cramr-Rao inequality). By way of exam-
ple we consider the case that X
1
, X
2
, . . . , X
N
are inde-
pendent, normally distributed stochastic variables with
unknown expectation m and standard deviation . The
joint probability density function and log likelihood func-
tion are
f (x, m) =
1
_
2
_
N
e
1
2
2
N
i =1
(xi m)
2
,
L(x, m) =N log(
2)
1
2
2
N
i =1
(x
i
m)
2
.
Differentiating twice with respect to m we nd
L
m
(x, m) =
1
2
N
i =1
(x
i
m), L
mm
(x, m) =
N
2
.
According to Cramr-Rao we thus have for every unbi-
ased estimator S of m that
var(S)
2
N
. (3.18)
A well known estimator for m is the sample average
m
N
=
1
N
N
i =1
X
i
.
This estimator is unbiased with variance
var( m
N
) =
_
1
N
N
i =1
(X
i
m)
_
2
=
2
N
.
The variance of this estimator equals the lower bound. It
is not difcult to check that (3.17) applies.

Unbiased estimators whose variance equals the lower
bound of the Cramr-Rao inequality are said to be ef-
cient. If equality is assumed in the limit
4
N then the
estimator is called asymptotically efcient.
Maximum likelihood estimators have the following
properties.
1. If an efcient estimator S of exists thenS is a max-
imum likelihood estimator. The reason is that if S is
efcient then by Theorem 3.3.2 (p. 35)
L
(x, ) =M()[s (x) ].

Now if

ML
is the maximum likelihood estimator of
, then L
(X,

ML
) = 0, so that M(

ML
)[s (x)

ML
] =
0. Normally M is invertible so that necessarily s (x) =
ML
.
4
In the sense that the ratio of var(S) and 1/M() approaches 1 as N
.
35
2. In many situations maximum likelihood estimators
are asymptotically efcient.
In contrast to the latter statement is the fact that for small
sample sizes the properties of maximum likelihood es-
timators are reputed to be less favorable than those of
other estimators.
The Cramr-Rao inequality may be extended to the
case that the parameter is vector-valued.
Theorem3.3.4 (Cramr-Rao inequality for the vector case).
Suppose that the probability distribution of the stochastic
variables X
1
, X
2
, . . . , X
N
depends on an unknown vector-
valued parameter
=
_
_
_
_
_
n
_
_
_
_
_
.
Let
S =
_
_
_
_
_
S
1
S
2

S
n
_
_
_
_
_
=s (X) =
_
_
_
_
_
s
1
(X)
s
2
(X)

s
n
(X)
_
_
_
_
_
,
with X = (X
1
, X
2
, . . . , X
N
), be an unbiased estimator of .
Denote the log likelihood function of the stochastic vari-
ables as L(x, ), with x = (x
1
, x
2
, . . . , x
N
). Denote the gra-
dient and the Hessian of L with respect to successively
as
L
(x, ) =
_
_
_
_
_
_
_
_
L(x,)
1
L(x,)
2

L(x,)
n
_
_
_
_
_
_
_
_
,
L
T (x, ) =
_
_
_
_
_
_
_
_
2
L(x,)

2
1
2
L(x,)
1 2

2
L(x,)
1 n
2
L(x,)
2 1
2
L(x,)

2
2

2
L(x,)
2 n

2
L(x,)
n 1
2
L(x,)
n 2

2
L(x,)

2
n
_
_
_
_
_
_
_
_
.
Furthermore, denote the variance matrix of S as
var(S) =[(S S)(S S)
T
]
=
_
_
_
_
_
var(S
1
) cov(S
1
,S
2
) cov(S
1
,S
n
)
cov(S
2
,S
1
) var(S
2
) cov(S
2
,S
n
)

cov(S
n
,S
1
) cov(S
n
,S
2
) var(S
n
)
_
_
_
_
_
.
Then
var(S) M()
1
, (3.19)
where
M() =[L
(X, )L
T
(X, )] =L
T (X, ). (3.20)
A proof of Theorem 3.3.4 is listed in Appendix A. The

expectation of a matrix with stochastic entries is taken
entry by entry. The inequality A B, with A and B sym-
metric matrices such as in (3.19) by denition is that the
matrix A B is nonnegative denite
5
. From (3.19) it fol-
lows that in particular
var(S
i
) R
i i
, i =1, 2, . . . , n.
The numbers R
i i
are the diagonal entries of the matrix
R = M()
1
. In mathematical statistics the matrix M()
is known as Fishers information matrix.
3.4 Linear estimators
t
k
x
k
|k |
Figure 3.2: Linear approximation
3.4.1 Linear estimators
One of the classical estimation problems is to approxi-
mate a set of observed (t
k
, x
k
)
2
by a function lin-
ear in t . See Fig. 3.2. In the absence of assumptions on
the stochastic properties of X
t
the approximating straight
line x =a +bt is often taken to be the one that minimizes
the sum of squares
k
(x
k
(a +bt
k
))
2
(3.21)
with respect to a and b. In this section we review this and
other linear estimation problems.
It will be convenient to stack the observations
X
1
, X
2
, . . . , X
N
in a column vector denoted X,
X =
_
_
_
_
_
_
X
1
X
2
.
.
.
X
N
_
_
_
_
_
_
.
This allows us to write the least squares problem (3.21) in
a more compact vector notation as the problem of mini-
mizing
2
=
T
where is the vector of equation errors
5
A square real matrix M is positive denite if x
T
Mx > 0 for every real
vector x = 0 (of correct dimension). It is non-negative denite if
x
T
Mx 0 for every vector x (of correct dimension).
36
dened via
_
_
_
_
_
_
X
1
X
2
.
.
.
X
N
_
_
_
_
_
_
.
X
=
_
_
_
_
_
_
1 t
1
1 t
2
.
.
.
.
.
.
1 t
N
_
_
_
_
_
_
.
W
_
a
b
_
.
+
_
_
_
_
_
_
2
.
.
.
N
_
_
_
_
_
_
.
.
More generally, we consider in this section the problem
of obtaining estimators of , when all we are given are
observations X and we know that
X =W +
withW a givenmatrix. Depending onwhat we assume for
different estimators result. A possibility is to use least
squares approximation as indicated above.
Lemma 3.4.1 (Least squares). Suppose W is a full column
rank matrix. Then there is a unique that minimizes
2
and it is given by
= (W
T
W)
1
W
T
X.
Proof. The quantity to be minimized is
2
=
T
=
(X W)
T
(X W) and this is quadratic in . Quadratic
functions that have a minimum are minimal if and only if
their gradient is zero. The gradient of (X W)
T
(X W)
with respect to is

[(X W)
T
(X W)] =2W
T
(X W)
=2[W
T
X (W
T
W)].
This is zero if and only if = (W
T
W)
1
W
T
X.
Suppose nowthat we have some knowledge of stochas-
tic properties of . To begin with, assume that the entries
t
of come from a zero mean white noise process. Then
var() =(
T
) =
2
I
N
.
Because of the relation X =W + it follows that also X is
random, and therefore any sensible estimator

=S(X) as
well. In the least squares problem the estimator

turns
out to be linear in the observations X. To keep the prob-
lems tractable we will now limit the class of estimators to
linear estimators, that is, we want the estimator

to be
of the form
= KX, for some matrix K.

Since we know that var() =
2
I
N
, the least squares so-
lution that aims to minimize
2
is no longer well moti-
vated. What is well motivated is to nd an estimate

of
that minimizes

2
=
j
(
j

j
)
2
.
Lemma 3.4.2 (Linear unbiased minimum variance estima-
tor). Consider X = W + and assume that = 0 and
var() =
2
I
N
. The following holds.
1. Alinear estimator

= KX of is unbiasedif and only
if KW = I . In particular this shows that linear unbi-
ased estimators exist if and only if W has full column
rank;
2. The linear unbiased estimator

= KX that mini-
mizes

2
is
= (W
T
W)
1
W
T
X (3.22)
in which case
var(

) =
2
(W
T
W)
1
and

2
=
2
tr(W
T
W)
1
;
3. For any linear unbiased estimator p of there holds
that
var(p) var(

)
where

is the estimator dened in (3.22).
A proof is listed on Page 93. Note that this results in

the same estimator as the least squares estimator. Condi-
tion 3 states that among the unbiased linear estimators,
the minimum variance estimator (3.22) is the one that
achieves the smallest variance matrix. Here the matrix
inequality is in sense that var(p) var(

) is a nonnega-
tive denite matrix. We may interpret this as a Cramr-
Rao type lower bound but then with respect to the lin-
ear estimators. It is important to realize that Condition
3 is only about linear estimators p. It may be possible
that nonlinear estimators exist that outperform the lin-
ear ones. However if all we know are the rst and second
order moments of the underlying stochastic process then
linear estimators are a natural choice. Also, as we shall
now see, the linear estimators are maximum likelihood
estimators in the case that
t
are zero mean normally dis-
tributed white noise. In that case these estimators are
optimal with respect to any unbiased estimator linear or
nonlinear.
Suppose that the
t
are zero mean jointly normally dis-
tributed white noise with variance
2
. The vector is
then a vector-valued Gaussian process, with joint prob-
ability density function
f
() =
1
(
2)
N
e
1
2
2
2
.
From X = W + we get that X is normally distributed
with mean W and variance
2
I . So
f
X
(X) =
1
(
2)
N
e
1
2
2
XW
2
.
37
Its log likelihood function is
L =N log(
2)
1
2
2
X W
2
.
The maximum likelihood estimators of and follow
similarly as in Example 3.3.3 by differentiation of L with
respect to and . Here, though, we need the vector val-
ued derivative of , i.e. gradient,
L
=
1
2
W
T
(X W), L
=N
1
+
1
3
X W
2
.
The gradient L
is zero if and only if

= (W
T
W)
1
W
T
X,
and then the derivative L
is zero if and only if

2
=
X W

2
/N. Again we found the same estimator for
, and as a result the variance matrix var(

) again equals
2
(W
T
W)
1
. The Hessian L
T is readily seen to be
L
T =
1
2
W
T
W.
Fishers information matrix hence is M() = L
T =
1
2
(W
T
W), but this is precisely the inverse of the variance,
so the Cramr-Rao lower bound is attained (3.19). There-
fore:
Lemma 3.4.3 (Maximum likelihood). Suppose
t
is a zero
mean normally distributed white noise process with vari-
ance
2
. Consider X = W + as dened above and as-
sume that W has full column rank. Then the maximum
likelihood estimators of and
2
are
ML
= (W
T
W)
1
W
T
X,

2
ML
=
1
N
X W

2
and the estimator

ML
is efcient with var(

ML
) =
2
(W
T
W)
1
.

3.5 Problems
3.1 Stationary normally distributed processes. Suppose
that the stochastic process X
t
, t , is normally dis-
tributed. Prove that the process is wide-sense sta-
tionary if and only if it is strictly stationary.
3.2 Fourth-order moment of normally distributed
stochastic variables. Let X
1
, X
2
, X
3
and X
4
be
four jointly distributed stochastic variables with
characteristic function (u
1
, u
2
, u
3
, u
4
).
a) Show that
X
1
X
2
X
3
X
4
=
4
(u
1
, u
2
, u
3
, u
4
)
u
1
u
2
u
3
u
4
_
_
_
_
u1=0,u2=0,u3=0,u4=0
.
b) Prove with the help of (3.10) that if X
1
, X
2
,
X
3
and X
4
are zero mean jointly normally dis-
tributed random variables then
X
1
X
2
X
3
X
4
=X
1
X
2
X
3
X
4
+X
1
X
3
X
2
X
4
+X
1
X
4
X
2
X
3
. (3.23)
3.3 Normally distributed AR process. Let {
t
}
t 0
be
a jointly normally distributed white noise pro-
cess. Consider AR(n) process {X
t
}
t 0
dened by
D(q)X
t
=
t
, (t ). Under which conditions on
X
n
, X
n+1
, . . . , X
1
is {X
t
}
t 0
jointly normally dis-
tributed?
3.4 Unbiasedestimators and the meansquare error. Let S
be an unbiased estimator of a parameter , and con-
sider other estimators
S
=S, .
Such an estimator S
is unbiased if and only if = 1

(or =0). Now it is tempting to think that the that
minimizes the mean square error
min
(S
)
2
(3.24)
is equal to 1 as well. That is generally not the case:
Show that (3.24) is minimized for
=
2
var(S) +
2
1. (3.25)
3.5 Estimation of random walk parameter.
6
The process
dened by
X
t
=
t
, t =1, 2, 3, . . . , (3.26)
with X
0
= 0, the difference operator X
t
= X
t

X
t 1
, and
t
, t = 1, 2, . . . normally distributed zero
mean white noise with variance
2
, is known as the
random walk process.
a) Compute the mean value function m(t ) =
X
t
and the covariance function r (t , s ) =
cov(X
t
, X
s
) of the process. Is the process wide-
sense stationary?
b) Determine the joint probability density func-
tion of X
1
, X
2
, . . . , X
N
, with N >1 an integer.
c) Prove that the maximum likelihood estimator

2
of the variance
2
of the white noise
t
, t >
0, based on the observations X
1
, X
2
, . . . , X
N
, is
given by

2
N
=
1
N
N
k=1
(X
k
X
k1
)
2
. (3.27)
6
Examination May 25, 1993.
38
d) Prove that this estimator is unbiased.
e) It may be shown that
4
t
=3
4
. Prove that the
estimator (3.27) is consistent, by showing that
var(
2
N
) 0 as N .
3.6 Suppose X
0
, . . . , X
N1
are mutually independent
stochastic variables with the same probability den-
sity function
f (x) =
_
1
e
x/
if x >0
0 if x 0
for some parameter > 0. It may be shown that
X
t
= and X
2
t
=2
2
.
a) Determine the maximum likelihood estimator
of given X
0
, . . . , X
N1
.
b) Is this estimator

biased?
c) Express the Cramr-Rao lower bound for the
variance of this estimator

in terms of and
N.
d) Is this estimator

efcient?
e) Is this estimator

consistent?
3.7 Interpretationof the vector-valuedCramr-Rao lower
bound
. Show that a vector of estimators

n
of

n
is efcient if and only for every
j
the
estimator
n
j =1
j
is an efcient estimator of
n
j =1
j
.
3.8 Colored noise. Suppose
N
is a vector of stochas-
tic variables all with zero mean, and that
T
=PP
T
for some nonsingular matrix P.
a) Determine
T
for :=P
1
.
b) Reformulate Lemma 3.4.2 but now for the case
that
T
= PP
T
(instead of
T
=
2
I
N
as
assumed in Lemma 3.4.2).
3.9 Linear unbiased minimum variance estimator. Con-
sider the stochastic process X
t
= +
t
with
t
zero
mean white noise with variance
2
.
Given observations x
1
, . . . , x
N
determine the linear
unbiased minimum variance estimator of , using
Lemma 3.4.2.
Matlab problems
10. Explain what the MATLAB code listed below does and
adapt it to nd a polynomial p(t ) of sufciently high
degree suchthat max
t [0,/2]
| sin(t )p(t )| is less than
10
7
. Supply plots of the error sin(t ) p(t ) and com-
pare the results with the Taylor series expansion of
sin(t ) around t =0: sin(t ) =t
1
3!
t
3
+
1
5!
t
5
1
7!
t
7
.
n=5;
N=10;
t=linspace(0,2,N);
X=exp(t);
W=ones(N,n+1);
for k=1:n
W(:,k+1)=W(:,k) . t;
end
theta=W\X; % this means theta=(WW)\WX
plot(t,X,t,Wtheta);
39
40
4 Non-parametric time series analysis
4.1 Introduction
In this chapter we discuss non-parametric time series
analysis. This includes classical time series analysis.
Classical time series analysis consists of a number of
relatively simple techniques to identify and isolate trends
and seasonal effects. These techniques are mainly based
on moving averaging methods. The methodology is com-
plemented with a collection of non-parametric statistical
tests for stochasticity and trend.
Classical methods are still used extensively in practical
times series analysis. There exists a well-known software
package, known as Census X-11, which implements these
methods. The American Bureau of the Census developed
the package. Many national and international institutes
that gather and process statistical data use this package.
Under non-parametric time series analysis we also
include methods to estimate covariance functions and
spectral density functions. Parametric time series anal-
ysis, which is the subject of Chapter 5 (p. 59), deals with
the estimation of parameters of more or less structured
models for time series, in particular, ARMA models.
In 4.2 (p. 41) some well known non-parametric tests
for stochasticity and trend are reviewed. A brief discus-
sion of a number of ideas from classical time series anal-
ysis follows in 4.3 (p. 41).
The rst subject fromnon-parametric time series anal-
ysis that we discuss is the estimation of the mean of a
wide-sense stationary process in 4.5 (p. 43). We in-
troduce an obvious estimator and investigate its consis-
tency. Next in 4.6 (p. 45) we discuss the estimation of the
covariance function of a wide-sense stationary process.
Section 4.7 (p. 46) considers the estimation of the spec-
tral density function. Again the unbiasedness and consis-
tency of the estimates are investigated. Windowing in
the time and frequency domain is extensively discussed,
and a brief presentation of the fast Fourier transform is
included.
In 4.7 (p. 53) some aspects of non-parametric esti-
mation problems for continuous-time processes are re-
viewed.
4.2 Tests for stochasticity and trend
This and the next section follow Chapters 2, 3 and 4
of Kendall and Orr (1990).
The rst question that may be asked when an observed
time series x
t
, t = 1, 2, . . . , n, is studied is whether it is
a purely random sequence, that is, whether it is a re-
alization of white noise. The following well know non-
parametric test may provide an indication for the answer
to this question. A time instant k is a turning point of the
observed sequence if
x
k1
<x
k
>x
k+1
(4.1)
(the sequence exhibits a peak) or
x
k1
>x
k
<x
k+1
(4.2)
(the sequence has a local minimum). If the sequence has
length n then it has at most n 2 turning points. Let y be
the total number of turning points. The null hypothesis
is that the sequence is a realization of a sequence of mu-
tually independent identically distributed stochastic vari-
ables (that is, of white noise). Under this hypothesis the
expectation and variance of y equal
y =
2(n 2)
3
, var(y ) =
16n 29
90
. (4.3)
A proof is listed in Appendix A. With increasing n the dis-
tribution of y quickly approaches a normal distribution.
With good approximation the statistic
z =
y y
_
var(y )
(4.4)
has a standard normal distribution for n sufciently
large. If the observed outcome of z lies outside a suit-
ably chosen condence interval then the null hypothesis
is rejected.
The next test could be whether the sequence consti-
tutes a trend. A suitable statistic for this is the number of
points of increase. A time instant k is a point of increase
if x
k+1
>x
k
. If the sequence is a realization of a sequence
of mutually independent identically distributed stochas-
tic variables then the number of points of increase z has
expectation z =
1
2
(n 1) and variance var(z ) =
1
12
(n +1).
The outcome of the statistic is an indication for the pres-
ence of a trend and its direction.
Null hypotheses that are not rejected are not necessar-
ily true. The use of the tests that have been described
therefore is limited. In the literature many other non-
parametric tests are known that possibly may be useful.
4.3 Classical time series analysis
4.3.1 Introduction
In classical time series analysis a time series x
t
has three
components:
1. A trend m
t
.
2. A seasonal component s
t
.
3. An incidental or random component w
t
.
41
These components may be additive but also multiplica-
tive. In the rst case we have
x
t
=m
t
+s
t
+w
t
. (4.5)
In the second case
x
t
=m
t
s
t
w
t
. (4.6)
The multiplicative model may be converted to an addi-
tive model by taking the logarithmof the time series. This
can only be done if the time series is essentially positive.
If a time series exhibits uctuations that increase with
time then this is an indication that a multiplicative model
might be appropriate. The cause of the increasing uctu-
ations then is an increasing trend factor m
t
in the mul-
tiplicative model (4.6). The immigration into the USA as
plotted in Fig. 1.3 (p. 2) shows this pattern.
4.3.2 Trends
In what follows we use the additive model (4.5) and for
the time being assume that no seasonal component s
t
is
present. A well-known method to estimate the trend is to
take a moving average of the form
m
t
=
k2
j =k1
h
j
x
t j
. (4.7)
The circumex on the m indicates an estimated value.
The real numbers h
j
are the weights. If k
1
= k
2
= k then
the moving average is said to be centered. If in addition
h
j
=h
j
for all j then the moving average is symmetric.
The method that is described in what follows may be
used to determine moving averaging schemes in a sys-
tematic manner. Suppose without loss of generality
that we wish to estimate the value of the trend at time 0.
Then around this time we could represent the trend as a
polynomial, say of degree p, so that
m
t
=a
0
+a
1
t + +a
p
t
p
. (4.8)
Next we could determine the coefcients a
0
, a
1
, , a
p
by minimizing the sum of squares
n
t =n
(x
t
a
0
a
1
t a
p
t
p
)
2
. (4.9)
This makes sense if 2n > p. By way of example choose
p =3 and n =3. Then
3
t =3
(x
t
a
0
a
1
t a
2
t
2
a
3
t
3
)
2
(4.10)
needs to be minimized with respect to the coefcients
a
0
, a
1
, a
2
and a
3
. Partial differentiation with respect to
each of the coefcients and setting the derivatives equal
to zero yields the four normal equations
3
t =3
x
t
t
j
=a
0
3
t =3
t
j
+a
1
3
t =3
t
j +1
+a
2
3
t =3
t
j +2
+a
3
3
t =3
t
j +3
for j = 0, 1, 2, 3. The summations over t from 3 to 3 of
the odd powers j of t cancel, so that the set of equations
reduces to
3
t =3
x
t
= 7a
0
+28a
2
,
3
t =3
t x
t
= 28a
1
+196a
3
,
3
t =3
t
2
x
t
= 28a
0
+196a
2
,
3
t =3
t
3
x
t
= 196a
1
+1588a
3
.
(4.11)
Since according to (4.8) m
0
=a
0
we are only interested in
the coefcient a
0
. By solving the rst and third equations
of (4.11) for a
0
it follows that
m
0
=
1
21
_
7
3
t =3
x
t

3
t =3
t
2
x
t
_
=
1
21
(2x
3
+3x
2
+6x
1
+7x
0
+6x
1
+3x
2
2x
3
).
The estimate m
0
of the trend at time 0 hence follows from
a symmetric weighted average of the values of the time
series x
t
about the time instant 0 with weights
1
21
[2, 3, 6, 7, 6, 3, 2]. (4.12)
These weights dene a moving averaging scheme.
In Kendall and Orr (1990) tables may be found that list
the weights of moving averaging schemes for different
values of the degree of the polynomial p and of the num-
ber of terms n in the sum of squares. The schemes are
all symmetric. The sum of the weights is always 1, which
may easily be checked for the weights of (4.12). The con-
sequence of this is that if x
t
is constant then this constant
is estimated as the trend a sensible result.
For time instants near the beginning and the end of the
time series a scheme of the form (4.7) cannot be used be-
cause some of the values of the time series x
t
that are
needed are missing. This may be remedied by adjusting
the number of terms in the sum of squares (4.9) corre-
spondingly. The resulting moving averaging scheme is no
longer symmetric.
In the days when no electronic computers were avail-
able and the calculations needed to be done with sim-
ple tools recall the photographs of halls lled with me-
chanical calculators and their operators from the second
world war much effort was spent in searching for ef-
cient schemes. A clever device is the successive applica-
tion of several moving averaging schemes of short length
with simple (integer) weights. Spencers formula (from
1904) consists of the successive application of four mov-
ing averaging schemes with weights
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1],
1
350
[1, 0, 1, 2, 1, 0, 1].
(4.13)
42
This composite scheme is equivalent to a 21-point
scheme that differs little from a least squares scheme.
Schemes of this type are still used in actuarial practice.
Which scheme is eventually used depends on the re-
sults. It often is found by trial and error.
4.3.3 Seasonal components
If a seasonal component is present then the period P is
usually known. Periods with the values 12 (months) or 4
(quarters) are common.
The seasonal component s
t
with period P of the time
series
x
t
=m
t
+s
t
+w
t
(4.14)
may easily be removed on the basis of the assumption
that
P
j =1
s
t +j
=0 for all t . (4.15)
Application to x
t
of a centered P-point moving averaging
scheme with weights
1
P
[1, 1, , 1] (4.16)
eliminates the seasonal component. The seasonal com-
ponent is recovered by subtracting the processed time se-
ries from the original time series.
If P is even such as for the examples that were cited
then no centered scheme with the weights (4.16) is fea-
sible. The scheme then is modied to a (P +1)-point cen-
tered scheme with weights
1
P
[
1
2
, 1, 1, , 1,
1
2
]. (4.17)
The approach that has been described assumes that the
pattern of the seasonal component does not change with
time. The Census X-11 package has modications and
options that permit gradual changes of the seasonal pat-
tern.
Example 4.3.1 (Measurements of water level). Suppose x
t
shown in Fig. 4.1(a) represents the water level of the sea
measured every 12 minutes. Because of waves, the mea-
surements x
t
are very irregular. To get rid of irregularities
we average x
t
over 11 samples,
y
t
=
5
j =5
1
11
x
t +j
.
This choice of 11 is somewhat arbitrary, and as 11 sam-
ples cover about 2 hours, the periodic behavior in x
t
is
slightly attened. The result y
t
is shown in Fig. 4.1(b).
This clearly shows a periodic behavior, and now we also
see a denite trend, possibly due to sedimentation. To
separate this trend we average y
t
over four periods (240
samples)
m
t
=
120
j =120
h
j
y
t +j
,
in which
[h
120
, . . . , h
120
] =
1
240
[
1
2
, 1, 1, . . . , 1, 1,
1
2
].
See part (c) of Fig. 4.1. Combined, the x
t
has now been
separated into a trend, a seasonal component and a ran-
dom component,
x
t
= m
t
.
trend
+y
t
m
t
.
seasonal
+x
t
y
t
.
random
.
4.4 Estimation of the mean

4.4.1 Estimation of the mean
We consider the estimation of the mean m = X
t
of a
wide-sense stationary process X
t
, t . As previously
established the sample mean
m
N
=
1
N
N1
t =0
X
t
(4.18)
is an obvious estimator. It is unbiased.
If X
t
, t , is normally distributed white noise then X
t
,
t =0, 1, , N1, are mutually independent normally dis-
tributed with expectation m and standard deviation . In
Example 3.3.3 (p. 35) we saw that m
N
then is an efcient
estimator of m, with variance
2
/N.
Now consider the general case that X
t
, t , is wide-
sense stationary with covariance function cov(X
t
, X
s
) =
r (t s ). The variance of the estimator (4.18) then is given
by
var( m
N
) =
1
N
N1
k=N+1
_
1
|k|
N
_
r (k). (4.19)
The proof of (4.19) may be found in Appendix A (p. 93).
This implies with Theorem 3.3.1 (p. 34) that the estimator
is consistent if
lim
N
N1
k=N+1
_
1
|k|
N
_
r (k) <. (4.20)
In that case the variance of the estimation error depends
like 1/N on the number of observations N. This means
that the standard deviation of the estimation error de-
creases to zero as 1/
N as N . The error de-

creases relatively slowly as the number of observations
N increases. If the number of observations is taken 100
times larger then the estimation error becomes only 10
times smaller. Withthe Cramr-Rao lower bound (3.18) in
mind, we expect that signicantly better estimators can
not be found.
Example 4.4.1 (Estimation of the mean of an
AR(1)-process). Consider by way of example the AR(1)-
process X
t
= aX
t 1
+
t
. This process has a stationary
covariance function of the form
r (k) =
2
X
a
|k|
, k , (4.21)
43
-1
0
1
2
3
4
0 0 0 500 500 500 1000 1000 1000
-1
0
1
2
3
4
-1
0
1
2
3
4
x
t
y
t
m
t
Figure 4.1: Separating noise and seasonal components fromx
t
provided |a| < 1. According to (4.19) the variance of the
estimator (4.18) equals
var( m
N
) =
2
X
N
_
1+2
N1
k=1
(1
k
N
)a
k
_
. (4.22)
Using
N1
k=1
a
k
=
a a
N
1a
and
N1
k=1
ka
k
=
a Na
N
+ (N 1)a
N+1
(1a)
2
we nd that for N large enough
var( m
N
)
2
X
N
1+a
1a
. (4.23)
The error in the approximation is of order 1/
N.
For a =0 the variance (4.23) reduces to
2
X
/N. This is of
course precisely the variance that results for white noise.
If a increases from0 to 1 then the error variance also in-
creases. This is caused by the increasing positive correla-
tion between successive values of X
t
. The correlation de-
creases the effective number of observations, see Fig. 4.2.
If on the other hand a decreases from 0 to 1 then the
error variance decreases. This somewhat surprising phe-
nomenon is explained by the alternating character of the
time series if a is negative. The alternating character de-
creases the contribution of
t
to m
N
, with t xed.

4.4.2 Ergodicity
If the condition (4.20) is satised then it is possible to de-
termine the mean of the process X
t
with arbitrary pre-
cision from a single realization X
t
, t of the process.
A process for which all statistical properties, such as its
mean, but also its covariance function, may be deter-
mined from a single realization is said to be ergodic.
Acharacteristic example of a non-ergodic process is the
process dened by
X
t
=Y, t . (4.24)
0 5 10 15 20 25 30 35 40 45 50
3
2
1
0
1
2
3
m
t
m
t

_
2
X
N
1+a
1a
m
t
+
_
2
X
N
1+a
1a
x
t
0 5 10 15 20 25 30 35 40 45 50
15
10
5
0
5
10
m
t
m
t

_
2
X
N
1+a
1a
m
t
+
_
2
X
N
1+a
1a
x
t
Figure 4.2: Realization x
t
and estimate m
t
with
condence bounds for a = 0.2 (left)
and a =0.9 (right)
Y is a stochastic variable. All realizations of the process
are constant. A single realization of the process X
t
yields
only a single sample of the stochastic variable Y . Hence,
determining the stochastic properties of the process X
t
,
t , from a single realization is out of the question.
Loosely speaking, for ergodicity the process should
have a limited temporal relatedness. The assumption that
lim
N
1
N
N
k=0
|r (k)| =0 (4.25)
is sometimes referred to as an ergodicity assump-
tion (Kendall and Orr, 1990). When the assumption
44
holds, then m(t ) may be determined with arbitrary preci-
sion froma single realization. A slightly stronger assump-
tion that we sometimes require is that the r (k) are abso-
lutely summable,
k=0
|r (k)| <. (4.26)
All wide sense stationary ARMAprocesses have this prop-
erty.
4.5 Estimation of the covariance function
4.5.1 An estimator of the covariance function
The covariance function of a wide-sense stationary pro-
cess is more interesting than its mean. It provides im-
portant information about the temporal relatedness of
the process. We discuss how the covariance function
cov(X
t
, X
s
) = r (t s ) of a wide-sense stationary process
X
t
t , may be estimated.
Suppose that a part x
t
, t {0, 1, . . . , N 1} of a realiza-
tion is available. The covariance function cov(X
t
, X
s
) is
dened as r (t s ) =(X
t
m)(X
s
m), with m =X
t
. For
m we may determine an estimate m
N
as in 4.4. Then for
xed k we may estimate the covariance function r (k) by
taking the sample mean of the product
(x
t +k
m
N
)(x
t
m
N
) (4.27)
over as many values as are available. For k 0 we may
let t run from 0 to N k 1 without going outside the set
{0, 1, . . . , N1}. This implies that Nk values of (4.27) are
available. Thus we obtain the sample average estimator
r
N
(k) =
1
N k
Nk1
t =0
(X
t +k
m
N
)(X
t
m
N
), (4.28)
k =0, 1, . . . , N 1.
For k < 0 we obtain by the same argument r
N
(k) =
r
N
(k).
4.5.2 Biasedness
We investigate the biasedness of the estimator (4.28). For
k 0 we write
Nk1
t =0
(X
t +k
m
N
)(X
t
m
N
)
=
Nk1
t =0
(X
t +k
m m
N
+m)(X
t
m m
N
+m)
=
Nk1
t =0
(X
t +k
m)(X
t
m)
Nk1
t =0
(X
t +k
m)( m
N
m)
Nk1
t =0
(X
t
m)( m
N
m) +
Nk1
t =0
( m
N
m)
2
.
For N k we have with good approximation
Nk1
t =0
X
t +k

Nk1
t =0
X
t
(N k) m
N
. (4.29)
It follows that
Nk1
t =0
(X
t +k
m
N
)(X
t
m
N
)
Nk1
t =0
(X
t +k
m)(X
t
m) (N k)( m
N
m)
2
.
Dividing by N k and taking expectations we obtain
1
r
N
(k) r (k) var( m
N
). (4.30)
The estimator (4.28) of r (k) hence is biased, and the bias
is always negative. As we saw in 4.4 (p. 43) under suitable
conditions lim
N
var( m
N
) = 0. In this case the estima-
tor (4.28) is asymptotically unbiased.
In practice the estimator (4.28) is usually replaced with
an estimator of the same form but with N in the denom-
inator rather than N k. The reason is that although
the revised estimator has a larger bias it generally has a
smaller mean square estimation error. Another impor-
tant reason is discussed in Problem4.3, (p. 56). Fromnow
on we therefore work with the estimator
r
N
(k) =
1
N
N|k|1
t =0
(X
t +|k|
m
N
)(X
t
m
N
), (4.31)
k =(N 1), . . . , 0, . . . , N 1.
This estimator is knownas the standardbiasedcovariance
estimator. Like (4.28) this estimator is asymptotically un-
biased.
4.5.3 Accuracy
We consider the accuracy of the estimator (4.31). For sim-
plicity we assume that the process X
t
has a known mean
m, and that m =0. Hence we use the estimator
r
N
(k) =
1
N
N|k|1
t =0
X
t +|k|
X
t
. (4.32)
This greatly simplies the algebra, and the conclusions
are qualitatively the same as for when m is unknown.
Asymptotically the formulas that we nd also hold for the
estimator (4.31). The estimator (4.32) has expectation
r
N
(k) = (1
|k|
N
)r (k), (4.33)
and hence is asymptotically unbiased. The mean square
error is
( r
N
(k) r (k))
2
=var( r
N
(k)) + [r (k) r
N
(k)]
2
=var( r
N
(k)) +
k
2
N
2
r
2
(k).
1
It may be shown that r
N
(k) =r (k) var( m
N
) +
1
N(Nk)
O(k) if (4.26)
holds.
45
When computing the variance of r
N
(k) expectations of
fourth-order moments of the stochastic process X
t
occur.
To evaluate these it is necessary to consider more than
only the rst- and second-order properties of the process.
We assume that the process is normally distributed. Then
it follows that for k 0
var( r
N
(k)) =
1
N
N1k
i =N+1+k
(1
k +|i |
N
)[r
2
(i )+r (k+i )r (ki )].
(4.34)
The proof may be found in Appendix A (p. 93). For large
N (that is, N 1) we have for the mean square error
( r
N
(k) r (k))
2
var( r
N
(k)
1
N
i =
[r
2
(i ) +r (i +k)r (i k)].
This implies for k =0 and k 1
( r
N
(k) r (k))
2
_
2
N
i =
r
2
(i ) for k =0,
1
N
i =
r
2
(i ) for k 1.
(4.35)
These quantities may be estimated as soon as an estimate
r
N
of r is available.
In Appendix A (p. 93) it is furthermore proved that for
k 0 and k
0
cov( r
N
(k), r
N
(k
)) (4.36)
1
N
i =
[r (i )r (i +k k
) +r (i +k)r (i k
)].
For increasing k k
the covariance cov( r

N
(k), r
N
(k
)) ap-
proaches zero. Inspection of (4.36) shows that the interval
over which the covariance decreases to zerois of the same
order of magnitude as the interval over which the covari-
ance function r decreases to zero. An important implica-
tion of this is that the correlation in the estimation errors
of r
N
(k) and r
N
(k
) for close values of k and k
is large.
Consequently r
N
may have a deceivingly smooth behav-
ior while the estimation error is considerable.
For reasons of exposition we had to take several restric-
tive assumptions. Practice seems to conrmhowever that
they qualitatively hold for arbitrary wide sense stationary
process.
4.5.4 Example
We illustrate the results by an example. Figure 2.6(a) (p. 8)
shows a realization of length N =200 of a process withthe
exponential covariance function
r (k) =
2
X
a
|k|
, (4.37)
with a = 0.9 and
X
= 1. Figure 4.3 shows the behavior
of the estimate r
200
of the covariance function r based on
the 200 available observations. We see that the estimated
behavior deviates considerably from the real behavior.
0 10 20 30 40 50 60
-0.5
0
0.5
1
time shift k
r
r
Figure 4.3: Solid: Estimate of an exponential co-
variance function. Dashed: Actual co-
variance function
We check what the accuracy of the estimate is. Because
i =
r
2
(i ) =
4
X
i =
a
2|i |
=
4
X
1+a
2
1a
2
(4.38)
it follows from (4.35) that for large N
var( r
N
(0))
4
X
1+a
2
1a
2
2
N
. (4.39)
For large k the variance of r
N
(k) is half this amount. Nu-
merically we nd that the standard deviation of r
200
(0) is
about 0.309. For large k the standard deviation of r
200
(k)
is about 0.218. We see that the errors that are theoretically
anticipated are large. The actual estimates lie within the
theoretical error margins.
Example 4.5.1 (Estimation of the covariance function in
Matlab). The time series of Fig. 2.6(a) (p. 8) and the esti-
mated covariance function of Fig. 4.3 were produced with
the following MATLAB script. It assumes that the MATLAB
toolbox IDENT is installed.
% Estimation of the covariance function of
% an AR(1) process. Define the parameters
% and the system structure
a = 0.9;
sigma = sqrt(1a^2); % then var(X
t
) =1 for
D = [1 a]; % X
t
=aX
t 1
+
t
th = poly2th(D,[]); %
randn(seed,1);
e = sigmarandn(256,1); % white noise
x = idsim(e,th); % realization of X
t
rhat = covf(x,64); % Estimate covariance
plot(rhat); % over 64 time shifts
4.6 Estimation of the spectral density

4.6.1 Introduction
In this section we look for estimators of the spectral den-
sity of a wide-sense stationary process X
t
, t . We as-
sume that we know that the process has mean X
t
= 0,
46
and that the covariance function is cov(X
t
, X
s
) = r (t s ).
We further assume that r is absolutely summable, that is,
k=
|r (k)| <. Then the spectral density
() =
k=
r (k) e
ik
, [, ), (4.40)
exists and may be shown to be continuous in the fre-
quency .
An obvious estimator of the spectral density is the
DCFT of an estimator of the covariance function r
N
of the
process. For r
N
we choose the estimator (4.32):
r
N
(k) =
1
N
N|k|1
t =0
X
t +|k|
X
t
, k =0, 1, . . . , (N 1).
(4.41)
From this we obtain
N
() =
N1
k=N+1
r
N
(k) e
ik
. (4.42)
as an estimator for the spectral density .
This estimator is closely connected to a function that
is known as the periodogram of the segment X
t
, t
{0, 1, . . . , N 1} of the time series X
t
. The periodogram
p
N
is dened as the function
2
p
N
() =
1
N
_
_
_
_
_
N1
t =0
X
t
e
it
_
_
_
_
_
2
=
_
_
X
N
()
_
_
2
N
, (4.43)
where

X
N
is dened as the DCFT
X
N
() =
N1
t =0
X
t
e
it
. (4.44)
In Appendix A (p. 94) it is shown that the estimator
(4.42) of and the periodogram p
N
of (4.43) are actually
one and the same object:
Lemma 4.6.1 (Periodogram).
N
() =p
N
(). (4.45)
4.6.2 Biasedness and consistency

The estimator (4.41) of the covariance function is asymp-
totically unbiased. It may be proved that the estimator
(4.42) of the spectral density is also asymptotically unbi-
ased.
The next question is whether or not the estimator is
also consistent. For the case that the process X
t
, t is
normally distributed this may be veried and the answer
is negative:
2
Often it is the plot of p
N
() as a function of frequency that is called
periodogram.
Lemma 4.6.2 (Inconsistency of the periodogram). Assume
X
t
is a wide sense stationary normally distributed process
with zero mean. Then for any
1
,
2
(, ) there holds
that
1. lim
N
var(

N
(
1
)) =
2
(
1
) if
1
=0,
2. lim
N
cov(

N
(
1
),

N
(
2
)) =0 if
1
=
2
.
The proof of the lemma is involved. It is summarized in

Appendix A. For normally distributed processes consis-
tency is equivalent to convergence to zero of its variance.
Hence the above lemma implies that the estimator

N
is
not consistent. Also it reveals that for large N the errors in
the estimates of the spectral density for close but different
frequencies are uncorrelated. A consequence of these two
results is that the plot of

N
() often has an erroneous
and irregularly uctuating appearance. This is imminent
from Figure 4.4. The gure shows two estimates

N
, one
for N =128 and one for N =1024, together with the exact
. The plots conrm that the standard deviation of

N
at
each frequency is close to .
One explanation of Condition 1 of the above lemma is
that the estimator (4.42) follows by summing over esti-
mates of the covariance function that each have variance
of order 1/N. Because the summation is over 2N1 terms
the outcome may have very large variance. Based on this
argument we expect that if in (4.42) we sum over M N
terms then the variance will be smaller. The variance of
a sum of O(M) terms each with variance O(1/N) can at
most be O(M
2
/N). Therefore if M and N jointly approach
innity but such that M
2
/N 0 then we expect consis-
tent estimation. This is a rather conservative condition.
One of the results of the next section is that in fact it suf-
ces to take M/N 0.
4.6.3 Windowing
The previous paragraph suggests to consider estimators
of the form
w
N
() =
k=
w(k) r
N
(k) e
ik
. (4.46)
Here w is a symmetric function such that w(k) = 0 for
|k| M. The function w is called a window. The simplest
choice is the rectangular window
w(k) =
_
1 for |k| M,
0 for |k| >M,
(4.47)
with M < N. As we shall see there are advantages to
choose other shapes for the window.
A useful result is that

w
N
and the estimator

N
of (4.42)
are related by convolution,
w
N
() =
1
2
_

W()

N
() d, [, ).
(4.48)
47
0
5
10
15
20
25
30
0 0.5 1 1.5 2 2.5 3
0
5
10
15
20
25
30
0 0.5 1 1.5 2 2.5 3
() ()
128
()

1024
()

Figure 4.4: Two estimates

N
for the process of Subsection 2.4.5
W is the DCFT of w and is called the spectral window cor-
responding to the time window w.
Proof 4.6.3 (Relation between

w
N
and

N
). By inverse
Fourier transformation it follows that
w(k) =
1
2
_

W() e
ik
d. (4.49)
Substitution of this into (4.46) yields
w
N
() =
k=
_
1
2
_

W() e
ik
d
_
r
N
(k) e
ik
=
1
2
_

W()
_

k=
r
N
(k) e
i()k
_
d
=
1
2
_

W()

N
() d.
The effect of the window w clearly is that for xed
w
N
() is a weighted average of

N
about the frequency
. The distribution of the weights is determined by the
shape of the spectral window W. It is expected that the
uctuations of

N
may be decreased by a suitable choice
of the window. A less desirable side effect is that the spec-
tral density is distorted by the windowing operation. The
amount of distortion depends on the shape of the density
and that of the spectral window.
Because of the symmetry of w the spectral window W
is a real function. We require that
1
2
_

W() d =1. (4.50)

This is equivalent to that w(0) = 1. The result of the nor-
malization (4.50) is that if

N
is a constant then

w
N
is
precisely this constant. Similarly if

N
varies slowly com-
pared to W then

w
N
.
In the literature many window pairs w and W are
known. We discuss some common ones. The rectangular
windoww
R
is given by (4.47). The corresponding spectral
window is
W
R
() =
sin((M +
1
2
))
sin(
1
2
)
. (4.51)
Figure 4.5 shows the time and spectral windows. The part
of the spectral window between the frequencies a and
a, with a =2/(2M+1), is called the main lobe. On either
side the side lobes are found.
A disadvantage of the rectangular window is that W
R
is negative for certain frequency intervals. Moreover the
side lobes are relatively large. The consequence of the
negative values is that the estimated spectral density

w
N
may also be negative for certain frequencies. This is a
non-interpretable result because the spectral density is
essentially non-negative.
Bartletts window w
B
has the triangular form
w
B
(k) =
_
1
|k|
M
for |k| M,
0 for |k| >M.
(4.52)
The corresponding spectral window is
W
B
() =
1
M
_
sin(
M
2
)
sin(
1
2
)
_
2
. (4.53)
The windows are plotted in Fig. 4.6. The spectral window
W
B
is positive everywhere. Note that the half width a of
the main lobe equals twice that of the rectangular win-
dow. Hence, there is loss of spectral resolution compared
to the rectangular window. This is the price for the reduc-
tion of the estimation error. The side lobes of Bartletts
window still are relatively large.
The Hann window, w
H
, sometimes wrongly referred to
as the Hanning window, is very popular. It is given by
w
H
(k) =
_
1
2
(1+cos(
k
M
)) for |k| M,
0 for |k| >M.
(4.54)
The corresponding spectral window is
W
H
() =
1
2
W
R
() +
1
4
W
R
(
M
) +
1
4
W
R
(+
M
), (4.55)
48
ag r
1
0
0
0
0
w
R
W
R
k
M M
2M +1
a a
Figure 4.5: The rectangular window w
R
and the corresponding spectral window W
R
, with a =2/(2M +1)
1
0
0
0
0
w
B W
B
k
M
M M a a
Figure 4.6: Bartletts window w
B
B
, with a =2/M
1
0
0
0
0
w
H W
H
k
M
M M

a a
Figure 4.7: Hanns window w
H
H
, with a =2/M
49
with W
R
the spectral window (4.51) that belongs to the
rectangular window. The window w
H
and the spectral
window W
H
are plotted in Fig. 4.7. The width of the
main lobe is the same as that of Bartletts window. The
side lobes of Hanns window are smaller than those of
Bartletts window, and are only marginally negative.
A variant of Hanns window is that of Hamming. Ham-
mings window is given by
w
h
(k) =
_
0.54+0.46cos(
k
M
) for |k| M,
0 for |k| >M,
(4.56)
and spectral window
W
h
() =0.54W
R
() +0.23W
R
(
M
) +0.23W
R
(+
M
).
This window has slightly better properties than Hanns
window. The width of the main lobe is the same as that
of Hanns window.
We conclude by listing formulas for the asymptotic
variance and covariance of the estimation error for win-
dowed estimates of normally distributed white noise:
Lemma 4.6.4 (Covariance of windowed periodogram). As-
sume X
t
is a zero mean normally distributed white noise
process. Then for large N
cov(

w
N
(
1
),

w
N
(
2
))
(
1
)(
2
)
2N
_

W(
1
)[W(
2
) +W(
2
+)]d.
An immediate special case is the variance,

var(

w
N
())
2
()
N
_

W
2
() d (4.57)
=
2
()
N/2
k=
w
2
(k). (4.58)
In the rst equality we used the fact that W is symmetric
and periodic in with period 2, in the second we used
a result known as Parsevals equality
3
:
1
2
_
W
2
()d =
k=
w
2
(k).
To illustrate these results consider Daniells window,
which in the frequency domain is given by
W
D
() =
_
M for ||

M
,
0 for || >

M
.
=
M
/M /M
3
It follows directly by equating

w
N
(0) in (4.46) with (4.48) for r
N
(k) :=
w(k).
The larger M is chosen, the narrower the spectral window
is. The corresponding time window is
w
D
(k) =
_
1 for k =0,
sin(k/M)
k/M
for k =0.
=
k
k =M
Because the time windoww
D
is not exactly zero outside a
nite interval it can only be implemented approximately.
Evaluation of (4.57) yields
var(

w
N
())
2M
N
2
(). (4.59)
For the popular window of Hann the equality of (4.58)
evaluates to
var(

w
N
())
3M
4N
2
().
We see that in both cases the relative error is
_
var(

w
N
())
()
=O(
_
M/N). (4.60)
The rule of thumb is that the relative error of the spec-
tral estimation is of order
_
M/N even in the non-white
case. If M and N simultaneously approach but such
that M/N 0 then consistent estimation results. A typi-
cal value for M is M =4
N.
It is generally true that if the width of the spectral win-
dow decreases that then the resolution increases, but at
the same time also the variance of the estimation error
increases. For every application the width of the spectral
windowneeds to be suitably chosen to nd a useful com-
promise between the two opposite effects. The spectral
window becomes narrower if the time window becomes
wider.
Inspection of Lemma 4.6.4 shows that the correlation
between the errors in the estimates at two different fre-
quencies decreases to zero if the distance between the
frequencies is greater than the width of the spectral win-
dow.
4.6.4 More about the DCFT
The Fourier transform, in this case the DCFT, plays a cen-
tral role in the estimation of spectral densities. Before dis-
cussing the practical computation of estimates we have
another look at the DCFT.
A special circumstance in the estimation of spectral
densities is that the time functions whose DCFT is com-
puted always have nite length. This is indeed always
the case for practical numerical computations of Fourier
transforms.
Consider the DCFT
x() =
L1
k=0
x
k
e
ik
, , (4.61)
50
of a time function x
k
, k =0, 1, . . . , L1. For simplicity and
without loss of generality we position the support
4
of the
function in the interval [0, L). The DCFT is periodic in
with period 2.
Because the function x
k
that is transformed is dened
by L numbers we expect that also the DCFT x() may
be determined by L numbers. Indeed, it turns out to be
sufcient to calculate the DCFT at L points, with equal
distances 2/L, on an interval of length 2, for instance
[0, 2). As shown in Appendix A (p. 95) we may recover
the x
k
from the L values x(
2
L
n) via
x
k
=
1
L
L1
n=0
x(
2
L
n) e
i
2
L
nk
, k =0, 1, . . . , L 1. (4.62)
Outside the interval [0, 2) the values of the DCFT x()
on the grid {
2
L
n, n } follow by periodic continua-
tion with period 2. Then between the grid points the
DCFT may be retrieved by the interpolation formula, see
e.g. Kwakernaak and Sivan (1991),
x() =
n=
x(
2
L
n) sinc
_
2
L
n
2
L
_
, . (4.63)
Here sinc is the function
sinc(t ) =
_
1 for t =0,
sin(t )
t
for t =0.
(4.64)
The interpolation formula (4.63) is the converse of the
well known sampling theorem from signal processing.
The sampling theorem states that a band-limited time
signal may be retrieved between sampling intervals by
interpolation with sinc functions for a suitable chosen
sampling interval. Formula (4.63) states that the Fourier
transform of a time-limited signal may be retrieved be-
tween suitably spaced grid points by interpolation with
sinc functions.
Thus, for the computation of the DCFT x() it is suf-
cient to compute
x(
2
L
n) =
L1
k=0
x
k
e
i
2
L
nk
, n =0, 1, . . . , L 1. (4.65)
The transformation of the time function x
k
, k =
0, 1, . . . , L 1, to the frequency function x(
2
L
n), n =
0, 1, . . . , L 1, according to (4.65) is known as the dis-
crete Fourier transformation. To distinguish it from the
DCFT Kwakernaak and Sivan (1991) call the transforma-
tion (4.65) the DDFT (discrete-to-discrete Fourier trans-
form). Note the symmetry between the DDFT and its in-
verse (4.62). The DDFT and its inverse may efciently
be computed with an algorithm that is known as the fast
Fourier transform.
4
The support is the smallest interval [a,b] so that x
k
= 0 for all k < a
and k >b.
4.6.5 The fast Fourier transform
The fast Fourier transform (FFT) is an efcient algorithm
to compute the DDFT. The DDFT of the time signal x
k
,
k = 0, 1, . . . , L 1, is dened by Eqn. (4.65), which we
rewrite in the form
x
n
=
L1
k=0
x
k
e
i
2
L
nk
, n =0, 1, . . . , L 1. (4.66)
The inverse DDFT (4.62) now becomes
x
k
=
1
L
L1
n=0
x
n
e
i
2
L
nk
, k =0, 1, . . . , L 1. (4.67)
Because of the symmetry of (4.66) and (4.67) the FFT al-
gorithm can both be used for the DDFT and the inverse
DDFT.
For any xed n the computation of (4.66) requires O(L)
operations. As there are L values of n to be considered, a
full DDFT (4.66) seems to require O(L
2
) operations. The
FFT however shows that this may be achieved with much
less effort.
There are several variants and renements of the FFT.
We consider the FFT algorithm with base 2 of Cooley-
Tukey. This is the most efcient and best known algo-
rithm. Like all FFTs it uses factorization of L as a product
of integers. For simplicity we assume that L is an integral
power of 2, say L =2
M
.
The crux of the FFT is that for all even n = 2m we may
rewrite (4.66) as
x
2m
=
L/21
k=0
(x
k
+x
k+L/2
) e
i
2
L/2
mk
, m =0, 1, . . . ,
L
2
1
and for all odd indices n =2m +1,
x
2m+1
=
L/21
k=0
e
i
2
L
k
(x
k
x
k+L/2
) e
i
2
L/2
mk
, m =0, . . . ,
L
2
1.
Thus the L-point DDFT (4.66) may be reduced to compu-
tation of two L/2-point DDFTs, one of
y
even,k
:=x
k
+x
k+L/2
, k =0, 1, . . . , L/21
and one of
y
odd,k
:=e
i
2
L
k
(x
k
x
k+L/2
), k =0, 1, . . . ,
L
2
1.
Forming the y
even,k
from x
k
takes L/2 (complex) addi-
tions. Forming y
odd,k
takes besides additions also mul-
tiplications. Nowadays multiplication requires about as
much computing time as addition, in any event, forming
y
even,k
and y
odd,k
together takes d L units of operation for
some constant d > 0. Now let C(L) denote the number
of operations needed to compute the L-point DDFT. The
above shows that
C(L) =d L +2C(L/2).
51
As C(1) =0 we get that
C(L) =d Llog
2
(L).
The DDFT so computed is known as the fast Fourier
transform (FFT). Compare the number of operations
d Llog
2
(L) to the number of operations O(L
2
) that we
would need for direct computation of the DDFT. For large
L the computing time that the FFT requires is a small frac-
tion of that for the direct DDFT. In practice L is very large
and the savings that the FFT achieves is spectacular.
For application of the FFT it is necessary that the length
L of the sequence that needs to be transformed be an in-
tegral power of 2. If this is not the case then the sequence
may be supplemented, for instance with a number of ze-
ros, so that the length satises the requirement. This of
course reduces the efciency. In addition this method
does not precisely yield the desired result. The DDFT of
a sequence of length L is dened on a natural grid of L
points. Increasing L leads to a different grid. There are
variants of the FFT that permit L to have other values
than integral powers of 2. These are less efcient than
the Cooley-Tukey algorithm. Burrus and Parks (1985) de-
scribe such algorithms.
The FFT algorithm was published in 1965 by the Amer-
ican mathematician J. W. Cooley and the American statis-
tician J. W. Tukey. It was preceded by the work of the Ger-
man applied mathematician C. Runge inthe period 1903
1905, which remained unnoticed for a long time. Only the
advent of the electronic computer made the method in-
teresting. The algorithm plays an important role in sig-
nal processing and time series analysis; for advanced ap-
plications computer chips are used whose sole task is to
compute FFTs.
4.6.6 Practical computation
Up to this point we have assumed that to nd the spectral
density of a time series, the time series rst is centered
and then its covariance function is estimated according
to
r
N
(k) =
1
N
N|k|1
t =0
X
t +k
X
t
, k =N +1, . . . , N 1.
(4.68)
Then the estimate

N
of the spectral density is found by
Fourier transformation of r
N
.
Computation of (4.68) for all k takes O(N
2
) operations.
For small values of N direct computation of (4.68) is fea-
sible. For large values of N it is worthwhile to consider
computing r
N
by inverse Fourier transformation because
of the efciency of the FFT algorithm. The idea is as fol-
lows.
Suppose we compute the estimated spectral density
function on L equidistantly spaced frequency grid points
on [0, 2),
N
(
2
L
n) =
1
N
_
_
_
_
_
N1
t =0
X
t
e
i
2
L
nt
_
_
_
_
_
2
, n =0, 1, . . . , L 1.
(4.69)
Here we used (4.45) and (4.42). The computation of the
sum in (4.69) is essentially a DDFT, so is quickly found. It
is not precisely inthe DDFTform(4.66), but we may make
it so if L N, by dening X
t
=0 for t =N, N +1, . . . , L 1.
Then the sum is a true DDFT,
N
(
2
L
n) =
1
N
_
_
_
_
_
L1
t =0
X
t
e
i
2
L
nt
_
_
_
_
_
2
, n =0, 1, . . . , L 1.
(4.70)
By denition the

N
(
2
L
n) satisfy
N
(
2
L
n) =
N1
k=N+1
r
N
(k) e
i
2
L
nk
, n =0, 1, . . . , L 1.
(4.71)
This we recognize as a DDFT, but it is slightly different
in that here the sum is not from 0 to L 1 as the deni-
tion (4.66) of DDFT assumes. The sum (4.71) depends on
2N1 values of r
N
(k), so as before we expect that we may
recover these 2N 1 values if we have at least 2N 1 val-
ues

N
(
2
L
n) available. We have L such values available,
so want to choose L such that
L 2N 1.
Then a variation of the inverse DDFT (4.67) applies that
states that (4.71) is invertible with
r
N
(k) =
1
L
L1
n=0
N
(
2
L
n) e
i
2
L
nk
, k =N +1, . . . , N 1.
(4.72)
Also this inverse DDFT (4.72) may be efciently com-
puted using the FFT. The FFT algorithm generates the
values of (4.72) for indices k = 0, 1, . . . , L 1. Only for
k [0, N 1] do they equal r
N
(k).
In summary, to estimate the covariance function we
apply the following procedure:
1. Choose an L 2N 1. Supplement the observed
time series X
t
, t = 0, 1, . . . , N 1, with L N zeros:
X
t
=0 for t =N, N +1, . . . , L 1.
2. Use the FFT to compute the spectral density func-
tion be means of the periodogram dened on L fre-
quency grid points
N
(
2
L
n) =
1
N
_
_
_
_
_
L1
t =0
X
t
e
i
2
L
nt
_
_
_
_
_
2
, n =0, 1, . . . , L 1.
3. Use the inverse FFT to compute
r
tmp
N
(k) =
1
L
L1
n=0
N
(
2
L
n) e
i
2
L
nk
, k =0, 1, . . . , L1.
Extract from this result the covariance function,
r
N
(k) = r
tmp
N
(|k|), k =N +1, . . . , N 1.
52
The procedure is illustrated in 4.6.7 (p. 53) with an ex-
ample. In all, the above procedure nds the covariance
function in O(Llog
2
(L)) operations which is a consider-
able improvement compared to the O(L
2
) operations that
direct computation of the covariance function requires.
Once the estimate of the covariance function is known
the next step is to apply a suitable time window. Then the
estimate of the windowed spectral density follows by ap-
plication of the FFT to the windowed r
N
(k). As the time
window has width 2M +1 we nd with the FFT an esti-
mated spectral density function on a frequency grid with
at least 2M+1 gridpoints. Correspondingly the frequency
grid spacing is 2/(2M +1) or less. This is smaller than
the widtha =2/M of the mainlobe of the common win-
dows. Increasing the number of gridpoints much beyond
its natural 2M +1 hence is not very useful.
An alternative procedure to estimate the spectral den-
sity is to compute the periodogram rst, but to omit
the remaining steps. The estimate of the spectral den-
sity then follows by direct application of a suitable spec-
tral window to the raw periodogram, see Problem 4.7
(p. 56).
Marple (1987) extensively discusses many methods to
estimate spectral densities.
4.6.7 Example
By way of illustration we discuss the estimation of the
spectral density of the realization of the AR(2) process
that is plotted in Fig. 2.6(b) (p. 8). The realization con-
sists of 200 samples. Fig. 4.8(a) shows the periodogram,
obtained by application of the FFT to the realization after
continuation with 201 zeros, which is sufcient because
then L = 401 > 2N 1. In Fig. 4.9(a) the relevant por-
tion of the periodogram
200
() is plotted on the grid-
ded frequency axis [0, ). The part for negative frequen-
cies follows by periodic continuation. The estimated co-
variance function r
tmp
200
was obtained by inverse Fourier
transformation of the periodogram, again with the FFT.
Figure 4.8(b) shows the result. Figure 4.9(b) shows the es-
timated covariance function r
200
(k) for k = 0, 1, . . . , 200.
Again the part for negative shifts follows by periodic con-
tinuation.
Inspection of Fig. 4.9(a) conrms the expected irregular
behavior of the periodogram. InFig. 4.9(b) we see that the
estimated covariance function initially has the expected
damped harmonic appearance but that for time shifts be-
tween about 60 and 110 large errors occur.
By windowing the estimated covariance function and
Fourier transformation the spectral density may be esti-
mated. Figure 4.10 shows the results of windowing with
Hamming windows of widths 100, 50, and 25. For width
100 the behavior of the estimated spectral density still is
rather irregular. For width 25 there is a clear loss of reso-
lution. The best result is found for the width 50.
Example 4.6.5 (Matlab computation). The following script
was used to obtain the results of gures 4.84.10. First the
script of Example 2.4.3 is executed. Then:
N =length(x);
L =42^nextpow2(N); % big enough
xh=fft(x,LN); % add LN zeros to x, then fft
ph=abs(xh).^2/N
rt=ifft(ph);
rt=real(rt); % needed due to rounding errors
rN=rt(1:N);
Direct computation of the covariance function by the
command
rN1 = covf(x,200);
from the Systems Identication Toolbox yields exactly the
same result.
Windowed estimates of the spectral density as in
Fig. 4.10 may be obtained with commands of the form
phid = spa(x,50);
[omega,phi50] = getff(phid);
plot(omega,phi50);
4.7 Continuous time processes

4.7.1 Introduction
In this section we discuss some aspects of the non-
parametric statistical analysis of continuous time series.
Especially in physics and engineering the phenomena
that are studied often depend continuously on time.
From a practical point of view signal processing is only
feasible by sampling and digital processing. The question
is to what extent the properties of the underlying contin-
uous time process may be retrieved this way.
We assume that the underlying phenomenon may be
described as a wide-sense stationary continuous time
stochastic process
X
t
, t . (4.73)
By sampling with sampling interval T a discrete time pro-
cess
X
n
=X
nT
, n , (4.74)
is created. It is desired to estimate the statistical prop-
erties of the continuous time process X
t
, in particular its
mean, covariance function and spectral density, from N
observations
X
n
, n =0, 1, . . . , N 1, (4.75)
of the sampled process.
If the underlying continuous time process has mean
X
t
= m and covariance function cov(X
t
, X
s
) = r (t s )
then the sampled process has mean
X
n
=X
nT
=m (4.76)
53
0

2
0
5
10
15
20
25
(
2
L
n) (a)
0 50 100 150 200 250 300 350 400
-1
-0.5
0
0.5
1
r
tmp
200
(k)
(b)
Figure 4.8: (a) Periodogramfor n =0, 1, 2, . . . , 400. (b) Estimated covariance function for k =1, 2, . . . , 400
0 0.5 1 1.5 2 2.5 3
0
5
10
15
20
=
2
L
n
200
(
2
L
n)
(a)
0 25 50 75 100 125 150 175 200
-1
-0.5
0
0.5
k
r
200
(k)
(b)
Figure 4.9: (a) Periodogramon [0, ]. (b) Estimated covariance function on [0, 1, . . . , 199]
0 0.5 1 1.5
0
5
10
15
200
()
(a)
0 0.5 1 1.5
0
5
10
15
(b)
0 0.5 1 1.5
0
5
10
15
(c)
Figure 4.10: Dashed: Exact spectral density. Solid: Estimated spectral density with a Hamming window. (a)
Width time window 100. (b) Width 50. (c) Width 25
54
and covariance function
cov(X
k
, X
n
) =cov(X
kT
, X
nT
) =r ((k n)T). (4.77)
Estimation of the mean m of the time series X
n
hence
yields an estimate of the mean m of the continuous time
process X
t
. Likewise, estimation of the covariance func-
tion of the time series X
n
yields an estimate of
r (kT), k =0, 1, . . . , (N 1). (4.78)
Hence, we may only estimate the covariance function r
on the discrete time axis.
4.7.2 Frequency content of the sampled signal
By processing the sampled signal we estimate the sam-
pled covariance function of the continuous time process.
The question is what information the Fourier transform
of the sampled covariance function contains about the
spectral density of the underlying continuous time pro-
cess.
We pursue this question for a general continuous time
function
x(t ), t , (4.79)
with CCFT
x() =
_

x(t ) e
it
dt , . (4.80)
Sampling yields the sequence
x
n
=x(nT), n . (4.81)
We dene the DCFT of this sequence as
x
() =T
n=
x
n
e
inT
, . (4.82)
This generalization of the DCFT simplies to the earlier
denition if T =1. The DCFT as given by (4.82) is periodic
in with period
2
T
. The inverse of this Fourier transfor-
mation is
x
n
=
1
2
_
T
T
x
() e
inT
d, n . (4.83)
The relation between the DCFT x
of the sampled signal

x
and the CCFT x of the underlying continuous time sig-

nal x is
5
x
() =
k=
x(k
2
T
), . (4.84)
Figure 4.11 illustrates the effect of sampling. The fre-
5
Required are some technical assumptions discussed in Appendix A,
(p. 95).
2/T /T 0 /T 2/T
x
Figure 4.11: Sampling a continuous time signal
causes aliasing
quency content x of the continuous time process x out-
side the interval [
T
,

T
] is folded back into this inter-
val. Because of this the frequency content x
of the sam-
pled signal differs fromthat of the continuous time signal.
This phenomenon is called aliasing, because frequencies
outside the interval are mistaken for frequencies inside
the interval.
If the CCFT x of the signal x is zero outside the in-
terval [
T
,

T
] then no aliasing takes place. The signal x
then is said to be band limited. The angular frequency
T
is known as the Nyquist frequency. If the bandwidth of
the signal x is less than the Nyquist frequency then the
continuous time signal x may be fully recovered from the
sampled signal by the interpolation formula
x(t ) =
n=
x(nT) sinc
(t nT)
T
, t . (4.85)
This result is known as Shannons sampling theorem.
4.7.3 Consequences for statistical signal processing
From 4.7.1 (p. 53) we know that by statistical analysis of
the sampled process an estimate may be obtained of the
sampled covariance function of the continuous time pro-
cess. Application of the DCFT to the sampled covariance
function according to (4.82) only yields a correct result for
the spectral density of the continuous time process if
has a bandwidth that is less than the Nyquist frequency
T
. If the process X
t
is not limited to this bandwidth then
the spectral density is distorted. The more the band-
width exceeds the Nyquist frequency the more severe is
the distortion. In practice aliasing is usually avoided by
passing the continuous time process through a lter that
cuts off or strongly attenuates the signal at all frequencies
above the Nyquist frequency before sampling. This pre-
sampling lter or anti-aliasing lter of course affects the
high-frequency content of the continuous time signal.
4.8 Problems
55
4.1 Estimation of the mean of the MA(1) process. Sup-
pose that X
t
is a realization of the MA(1) process
X
t
=
t
+ b
t 1
. Write the covariance function of
the process as r (k) =
2
X
(k), with the correlation
function.
a) Determine
2
X
and (k)
b) Verify that the variance of the estimator (4.18)
equals
var( m
N
) =
2
_
1+2b +b
2
N
2b
N
2
_
How does the variance behave asymptotically?
Explain the dependence of the variance on b.
c) The Cramr-Rao inequality suggests that vari-
ances generally are of order 1/N at best, but
for b =1 the above equation reads var( m
N
) =
2b
N
2
2
. Is something wrong here?
4.2 Harmonic process. Consider the process dened by
X
t
=A cos
_
2t
T
+B
_
, t . (4.86)
T is a natural number. A and B are independent
stochastic variables. A has expectation zero and B
is uniformly distributed on [0, 2].
a) Prove that the process is stationary.
b) Is the process ergodic?
4.3 Not all estimated r (k) are covariance functions
. A
property of covariance functions r (k) is that the
covariance matrices are nonnegative denite, see
Eqn. (2.10). Estimated covariance functions r (k)
however need not have this property, and this is a
source of problems and may for example lead to es-
timates

() of the spectral density that fail to satisfy
() 0.
a) Showthat the revised estimates (4.31) dosatisfy
(2.10).
b) Showby counter example that the unbiased es-
timates (4.28) need not satisfy (2.10).
4.4 Estimationof the covariance functionandcorrelation
function of white noise. Suppose that X
t
is normally
distributed white noise withzeromeanand standard
deviation . We consider the estimator (4.32) for the
covariance function r (k) of the white noise.
a) What are the variances of the estimator for k =
0 and for k =0?
Hint: If X is normally distributed with zero
mean and standard deviation , then (X
4
) =
3
4
.
b) Dene

N
(k) =
r
N
(k)
r
N
(0)
(4.87)
as an estimator for the correlation function
(k) = r (k)/r (0) of the process. Argue that for
large N and k =0 the variance of
N
(k) approx-
imately equals 1/N.
4.5 Test for white noise.
6
The correlation function of a
centered time series consisting of N data points is
estimated as

N
(k) =
r
N
(k)
r
N
(0)
. (4.88)
Here r
N
is an estimator for the covariance function.
It turns out that:

N
(1) equals 0.5.
For |k| 2 the values of |
N
(k)| are less than
0.25.
How large should N be so that it may be concluded
with good condence that the time series is not a re-
alization of white noise?
-50 -40 -30 -20 -10 0 10 20 30 40 50
0
0.5
1
r ()
Figure 4.12: Experimental correlation function
4.6 Interpretation of experimental results.
7
A test vehi-
cle with a wheel base of 10 m is pulled along a road
with a poor surface with constant speed. On the ve-
hicle a sensor is mounted that records the vertical
acceleration of the body of the vehicle. Figure 4.12
shows a plot of the correlation function of the mea-
sured signal. The sampling interval is 0.025 s. What
is the speed of the vehicle?
4.7 Hanns window in the frequency domain. Let
N
(2n/M), n , be the estimate of the spectral
density that is obtained by application of a rectan-
gular window of width M. Show that application of
Hanns window yields the estimate
Hann
N
(
2n
M
)
=
1
4
N
_
2(n 1)
M
_
+
1
2
N
_
2n
M
_
+
1
4
N
_
2(n +1)
M
_
.
What is the corresponding formula for Hammings
window?
4.8 Running average process.
8
Consider the running av-
erage process X dened by
X
t
=
1
k +1
k
j =0
t j
, t , (4.89)
6
7
8
56
with k 0 a given integer. The process
t
, t , is
white noise with mean
t
=0 and variance var
t
=
2
.
a) Compute the covariance function r of the pro-
cess. Sketch the plot of r .
b) Compute the spectral density function of the
process.
c) Is the MA process invertible?
d) Suppose that we wish to estimate the covari-
ance function r and the spectral density func-
tion of the process based on a single real-
ization of the process on {0, 1, . . . , N 1}, with
N k.
i. Discuss how large N should be chosen to
achieve a reasonable estimation accuracy.
Specically, suppose that k = 4 and that
r (0) should be estimated with an accuracy
of about 1%.
ii. To estimate the spectral density function
a time or frequency window with a cor-
responding window width should be se-
lected. What may be said about the best
choice of the width of the window?
4.9 Spectral analysis of a seasonal model.
9
Consider the
seasonal model
y
t
cq
P
y
t
=
t
, t =1, 2, . . . , (4.90)
with the natural number P 1 the period, c a real
constant, and
t
white noise with variance
2
. As-
sume that the model is stable.
a) Derive the spectral density function ().
Sketch the plot of .
b) For c = 0.8 and certain P the spectral density
function () on [0, ) equals
0 0.5 1 1.5 2 2.5 3
0
5
10
15
20
25
What is the value of P?
c) Assume that the spectral density function is es-
timated by Fourier transformation of an esti-
mate of the covariance function of the process.
How large should the width M of the time win-
dow be in relation to the period P to prevent
that the estimated spectral density function is
strongly distorted in the sense that the peaks as
visible in the above plot remain visible in the
windowed estimate.
9
d) Given the M suggested in 4.9c, how large
should the number N of observed data points
of the time series be so that a reasonable sta-
tistical accuracy is obtained? Here the answer
may be more qualitative than for 4.9c.
Matlab problems
10. Consider the AR process
(1aq
2
)X
t
=
t
, a
where
t
is white noise with mean and standard
deviation . Let
a =
1
2
.
a) Generate samples x
t
, t =0, 1, . . . , N 1 for N =
400 and plot the result.
b) Plot

N
() and r
N
(k) using the FFT for suitably
chosen L.
c) Compare

N
() and r
N
(k) with the exact ()
and r (k).
d) Compute

w
N
for the Hamming window w(k)
with support [M, M]. Do this for various
M and compare the results (given the N of
Part 10a) to the exact . Which M do you pre-
fer? Discuss the results.
e) Change a to a = 3/4. Redo 10d. How does
this affect the choice of M? Is it advisable to
increase N?
57
58
5 Estimation of ARMA models
5.1 Introduction
In Chapter 4 (p. 41) a number of non-parametric estima-
tors are discussed. These estimators may be applied to ar-
bitrary wide-sense stationary time series without further
assumptions on their structure. In some applications it is
advantageous or even necessary to assume that the time
series has a specic structure, which may be character-
ized by a nite number of parameters.
Because of its exibility the ARMAscheme is often used
as a model for time series. Estimating the properties of
the time series then amounts to estimating the coef-
cients of the scheme. The use of ARMA schemes and
the theory of estimating such schemes has strongly been
stimulated by the work of Box and Jenkins (1970).
In 5.2 (p. 59) we discuss the estimation of AR models.
Section 5.3 (p. 63) treats the estimation of MA schemes.
The combined problem of estimating ARMA models is
the subject of 5.4 (p. 64). In 5.5 (p. 68) a brief expo-
sition about numerical optimization methods for the so-
lution of maximum likelihood problems and non-linear
least squares problems follows. In 5.6 (p. 69) the prob-
lem of the choice of the model order is examined.
The various methods are numerically illustrated by
MATLAB examples.
5.2 Parameter estimation of AR processes
5.2.1 Introduction
We assume that the observed time series is a realization
of an AR(n) process described by
X
t
=+a
1
X
t 1
+a
2
X
t 2
+ +a
n
X
t n
+
t
, (5.1)
t = n, n +1, . . . , where
t
is white noise with mean 0 and
variance
2
.
We consider the problem how to estimate the parame-
ters a
1
, a
2
, . . . , a
n
, and
2
, given the N observations X
0
,
X
1
, . . . , X
N1
. We successively discuss methods based on
least squares in 5.2.2 (p. 59) and on likelihood maximiza-
tion in 5.2.4 (p. 60).
5.2.2 Least squares estimator
In view of the available observations we attempt to nd
a best least squares t of (5.1) for t = n, n +1, . . . , N
1. Because the model represents X
t
as a linear regression
on the n preceding values X
t 1
, X
t 2
, . . . , X
t n
and the
constant we look for those values of the parameters ,
a
1
, a
2
, . . . , a
n
for which the sum of squares
N1
t =n
(X
t
a
1
X
t 1
a
2
X
t 2
a
n
X
t n
)
2
(5.2)
is minimal. That is, we seek a model that explains the data
with minimal contribution of noise
t
. By successive par-
tial differentiation with respect to and a
1
, a
2
, . . . , a
n
and setting the derivatives equal to zero we may nd the
estimates and a
j
. This can indeed be done but we take
a different route, a route that turns out to be useful for
on-line estimation when successively more data X
N
and
X
N+1
etcetera become available.
We write the AR scheme out in full:
_
_
_
_
_
_
X
n
X
n+1
.
.
.
X
N1
_
_
_
_
_
_
.
X
=
_
_
_
_
_
1 X
n1
X
n2
X
0
1 X
n
X
n1
X
1

1 X
N2
X
N3
X
N1n
_
_
_
_
_
.
W
_
_
_
_
_
_
a
1
.
.
.
a
n
_
_
_
_
_
_
.
+
_
_
_
_
_
_
n+1
.
.
.
N1
_
_
_
_
_
_
.
.
(5.3)
In this matrix notation the sum of squares (5.2) equals
T
. Minimizing this with respect to the vector of coef-
cients is a standard projection result treated earlier.
The solution is
= (W
T
W)
1
W
T
X. (5.4)
5.2.3 Recursive least squares
In adaptive control and real-time system identication
problems and many other problems it is often the case
that as time progresses more and more data X
t
become
available. Each time new data is in, we might want to
re-compute the least squares estimation via the rule

=
(W
T
W)
1
W
T
X. This, however, may become very time
consuming since as the amount of data grows the matrix
dimensions of W and X grow as well. Like in Kalman l-
tering it turns out to be possible to obtain the new esti-
mate

simply from the previous estimate with a modi-
cation that is proportional to an appropriate prediction
error.
Suppose we have available N observations
X
0
, X
1
, . . . , X
N1
. We found for the estimate of the
expression
N1
= (W
T
N1
W
N1
)
1
W
T
N1
X. (5.5)
Here the subscript N 1 has been added to and W to
express that X
N1
is the last observation used for the esti-
mate. Suppose that a next observation X
N
becomes avail-
able. We may now form the next W-matrix,
W
N
=
_
W
N1
Z
N1
_
,
59
where
Z
N1
=
_
1 X
N1
X
N2
X
Nn
_
.
It is composed of the previous W
N1
with a single rowvec-
tor Z
N1
stacked at the bottom. Consequently
W
T
N
W
N
=
_
W
T
N1
Z
T
N1
_
_
W
N1
Z
N1
_
=W
T
N1
W
N1
+Z
T
N1
Z
N1
.
The matrix W
T
W has been updated with a rank-one ma-
trix Z
T
N1
Z
N1
. It makes sense to gure out how much

changes now that the next data X

N
is available. Therefore
consider

N1
,
N1
= (W
T
N
W
N
)
1
W
T
N
_
X
XN
_
(W
T
N1
W
N1
)
1
W
T
N1
X
= (W
T
N
W
N
)
1
_
W
T
N
_
X
XN
_
(W
T
N1
W
N1
+Z
T
N1
Z
N1
)(W
T
N1
W
N1
)
1
W
T
N1
X
_
= (W
T
N
W
N
)
1
_
_
W
T
N1
Z
T
N1
__
X
XN
_
W
T
N1
X Z
T
N1
Z
N1
(W
T
N1
W
N1
)
1
W
T
N1
X
.
N1
_
= (W
T
N
W
N
)
1
Z
T
N1
.
KN
(X
N
Z
N1

N1
)
.
eN
. (5.6)
This is in an interesting form. The term labeled e
N
equals
X
N
Z
N1

N1
=X
N
_
+ a
1
X
N1
+ + a
n
X
Nn
_
(5.7)
which is the prediction error X
N

X
N|N1
of X
N
given the
current AR model. The result (5.6) thus states that if the
newmeasurement X
N
agrees with its prediction that then
the estimate of remains the same

N
=

N1
. If the pre-
diction error is off then the estimate is updated propor-
tional to the prediction error, with scaling factor K
N
. The
K
N
is known as the vector gain and is closely related to the
Kalman gain.
For efcient implementation of the least squares algo-
rithm we rewrite Eqn. (5.6). Dene
P
N
= (W
T
N
W
N
)
1
. (5.8)
As it stands, this inverse would have to be calculated in
(5.6) every time that new data is available. It may be ver-
ied, however, that updating W
T
N1
W
N1
with a rank-one
matrix Z
T
N1
Z
N1
is equivalent to (another) rank-one up-
date of its inverse P
N1
, without having to invert a matrix,
P
N
=P
N1
P
N1
Z
T
N1
1
1+Z
N1
P
N1
Z
T
N1
Z
N1
P
N1
. (5.9)
(See Problem 5.3, p. 72.) With the update of P
N
in this
formthe vector gain K
N
dened in (5.6) may be expressed
in terms of the past data,
K
N
=P
N
Z
T
N1
=
_
P
N1
P
N1
Z
T
N1
1
1+Z
N1
P
N1
Z
T
N1
Z
N1
P
N1
_
Z
T
N1
=P
N1
Z
T
N1
1
1+Z
N1
P
N1
Z
T
N1
.
Note that (5.9) canbe writtenas P
N
=P
N1
K
N
Z
N1
P
N1
.
In summary, for efcient, recursive estimation of we
apply the following procedure:
1. Initialize N, P
N
and

N
;
2. Increase N;
3. Form the (n + 1)-vector of the last n observations
Z
N1
=
_
1 X
N1
X
Nn
_
;
4. Form the vector gain K
N
=P
N1
Z
T
N1
1
1+ZN1PN1Z
T
N1
;
5. update P
N
= P
N1
K
N
Z
N1
P
N1
and

N
=

N1
+
K
N
(X
N
Z
N1

N1
);
6. Return to Step 2.
Example 5.2.1 (Recursive least squares). Figure 5.1(a)
shows a realization of the AR process X
t
= +
t
with
=1, and
t
normally distributed zero-mean white noise
with variance 1. Part (b) of the gure shows as a func-
tion of N the estimate

N
of on the basis of the rst
N observations X
N
. It seems to converge to = 1. The
vector gain K
N
shown if Part (c) clearly converges to zero.
This is because after many observations, the new infor-
mation that X
N
provides pales in comparison with what
the large number of observations X
0
, . . . , X
N1
has already
provided. This also indicates a possible problem: if the
mean varies (slowly) with time then the long memory
of the estimate makes adaptation of

M
to very slow. In
such situations new observations X
N
should be weighted
more than observations in the long past X
M
, M N.
There are ways of coping with this problem, but we shall
not go into that here.

5.2.4 Maximumlikelihood estimators
We now approach the problemof estimating the parame-
ters of the AR(n) scheme according to the maximum like-
lihood method. To this end we need to determine the
joint probability density of the observations X
0
, X
1
, . . . ,
X
N1
. This is possible if
1. the joint probability density function of the initial
conditions X
0
, X
1
, . . . , X
n1
is known, and
2. the probability density of the white noise
t
is
known.
It is plausible and often justiable to assume that the
white noise
t
is normally distributed. To avoid any as-
sumptions on the distribution of the initial conditions we
choose a likelihood function that is not the joint proba-
bility density of the observations X
0
, X
1
, . . . , X
N1
, but the
conditional probability density
f
Xn,Xn+1,...,XN1|X0,X1,...,Xn1
(x
n
, x
n+1
, . . . , x
N1
|x
0
, x
1
, . . . , x
n1
)
(5.10)
of X
n
, X
n+1
, . . . , X
N1
, given X
0
, X
1
, . . . , X
n1
.
60
0 50 100
-2
0
2
4
0 50 100
-2
0
2
4
0 50 100
0
0.2
0.4
N
N N N
K
N
x
N
Figure 5.1: RLS estimate

N
and vector gain K
N
To obtaina manageable expression for (5.10) we use the
well known formula
f
X,Y|Z
= f
X|Y,Z
f
Y|Z
(5.11)
fromprobability theory, with X, Y andZ (possibly vector-
valued) stochastic variables. With this we obtain for
(5.10), temporarily omitting the arguments of the prob-
ability densities:
f
Xn,Xn+1,...,XN1|X0,X1,...,Xn1
= f
Xn+1,Xn+2,...,XN1|X0,X1,...,Xn
f
Xn|X0,X1,...,Xn1
= f
Xn+2,Xn+3,...,XN1|X0,X1,...,Xn+1
f
Xn+1|X0,X1,...,Xn
f
Xn|X0,X1,...,Xn1
= . . . . . .
=
N1
t =n
f
Xt |X0,X1,...,Xt 1
.
(5.12)
We next consider the conditional probability density
f
Xt |X0,X1,...,Xt 1
of X
t
given X
0
, X
1
, . . . , X
t 1
. From
X
t
=+a
1
X
t 1
+a
2
X
t 2
+ +a
n
X
t n
+
t
(5.13)
we see that for t n and given X
0
, X
1
, . . . , X
t 1
the
stochastic variable X
t
is normally distributed with mean
(X
t
|X
0
, X
1
, . . . , X
t 1
) =(X
t
|X
t n
, X
t n+1
, . . . , X
t 1
)
=+a
1
X
t 1
+a
2
X
t 2
+ +a
n
X
t n
and variance
2
. Hence, we have
f
Xt |X0,X1,...,Xt 1
(x
t
|x
0
, x
1
, . . . , x
t 1
)
=
1
2
e
1
2
2
_
xt
n
j =1
a j xt j
_
2
.
With this it follows from (5.12) that
f
Xn ,Xn+1,...,XN1|X0,X1,...,Xn1
(x
n
, x
n+1
, . . . , x
N1
|x
0
, x
1
, . . . , x
n1
)
=
1
_
2
_
Nn
e
1
2
2
N1
t =n
_
xt
n
j =1
a j xt j
_
2
.
By taking the logarithmwe obtain the log likelihood func-
tion
L =(Nn) log(
2)
1
2
2
N1
t =n
_
_
_x
t

n
j =1
a
j
x
t j
_
_
_
2
.
(5.14)
With the matrix notation of Eqn. (5.3) this simplies to
L =(N n) log(
2)
1
2
2
X W
2
. (5.15)
Inspection shows that maximization of L with respect to
the parameters amounts to minimization of the sum of
X W
2
. (5.16)
This results in the same estimator

as the least squares
estimator.
There remains the estimation of in the maximum
likelihood framework. For this we need to maximize L
with respect to . Partial differentiation of L with respect
to and setting the derivative equal to zero yields the
necessary condition
N n

+
1

3
_
_
X W

_
_
2
=0. (5.17)
Solution for
2
yields

2
=
1
N n
_
_
X W

_
_
2
. (5.18)
5.2.5 Accuracy
It may be proved that the least squares estimators for a
1
,
a
2
, . . . , a
n
, , and , which are also maximum likelihood
estimators, are asymptotically unbiased, consistent and
asymptotically efcient (assuming that the process
t
is
normally distributed). By the asymptotic efciency for
large N the variance of the estimators may be approxi-
mated with the help of the Cramr-Rao lower bound. For
this lower boundwe need the vector-valued version of the
Cramr-Rao inequality of Theorem 3.3.4 (p. 36).
61
By partial differentiation of the log likelihood function
of (5.15) it follows that
L

=
1
2
W
T
(X W)
L

=
N n
+
1
3
X W
2
.
Further partial differentiation yields the entries of the
Hessian
2
L

T
=
1
2
W
T
W,
2
L

=
2
3
W
T
(X W),
2
L

T
=
_

2
L

_
T
,
2
L

2
=
N n
2

3
4
X W
2
.
After replacing the samples x
t
by the stochastic variables
X
t
and taking expectations we nd
2
L

T
=
Nn
2
_
_
_
_
_
_
_
1
r (0) r (1) r (n 1)
r (1) r (0) r (n 2)

r (n 1) r (n 2) r (0)
_
_
_
_
_
_
_
,
2
L

=
2
3
W
T
=
_
_
_
_
_
1 1 1
X
n1
X
n
X
N2

X
0
X
1
X
N1n
_
_
_
_
_
_
_
_
_
_
_
n+1
.
.
.
N1
_
_
_
_
_
_
=0,
(5.19)
2
L

2
=
2(N n)
2
.
In (5.19) we used that the
t
have zero mean and that all
X
mk
, k > 0 are uncorrelated with
m
. For large N the
variance matrix S of the estimators [ , , a
1
, a
2
, . . . , a
n
]
approximately equals the Cramr-Rao lower bound, M
1
with
M =
_
_
2
L

2
2
L

T
2
L

2
L

T
_
_
=
N n
2
_
_
_
_
_
_
_
_
_
_
_
_
2 0 0 0 0
0 1
0 r (0) r (1) r (n 1)
0 r (1) r (0) r (n 2)

0 r (n 1) r (n 2) r (0)
_
_
_
_
_
_
_
_
_
_
_
_
.
In practice M is approximated by replacing the covari-
ance function r , the constant and the standard devia-
tion by their estimates.
5.2.6 Example
The time series plotted in Fig. 2.6(b) (p. 8) is a realization
of an AR(2) process. The parameters a
1
, a
2
and may be
estimated with the help of the MATLAB routine ar of the
Systems Identication Toolbox. The routine assumes that
= 0. Besides estimates of the parameters the routine
also provides the standard deviations of the estimates.
The following results are obtained:
a
1
=1.5588, estimate a
1
=1.5375, st.dev. 0.0399,
a
2
=0.81, estimate a
2
=0.8293, st.dev. 0.0399,
2
=0.0888, estimate
2
=0.0885, no st.dev. given.
(5.20)
The estimates of the parameters are rather accurate, and
fall well within the error bounds.
Example 5.2.2 (Matlab session). After executing the script
of 2.4.5 (p. 15) to generate the time series the following
MATLAB session provides the results as shown:
>> thd = ar(x,2);
>> present(thd)
This matrix was created by the command AR on
11/18 1993 at 20:40. Loss fcn: 0.0885
Akaikes FPE: 0.0903 Sampling interval 1
The polynomial coefficients and their
standard deviations are
A =
1.0000000 1.53752969131 0.82926219021
0 0.03986290251 0.03989843610
The example invokes optimized routines from the SYS-

TEM IDENTIFICATIONTOOLBOX which hides the math that is
involved. A plain MATLAB script that does essentially the
same is:
a1=1.5;
a2=.75;
D=[1 a1 a2]; % second order AR scheme
nul=roots(D); % compute zeros of D
abs(nul) % scheme is stable if <1
sig=1; % set st.dev of
t
ep=sigrandn(1,200); % normally distributed
t
x=filter(1,D,ep); % generate x
t
(row)
% Now use the 200 samples of x
t
to
% estimate an AR(2) scheme:
X=x(3:200); % Set up X and W in X =W +
W=[x(2:199) x(1:198)];
theta=W\X; % the estimated a1 and a2
epes=XWtheta; % residuals (estimate of )
sig2=mean(epes.^2); % estimate of var of
M=WW/sig2; % estimate fishers inf.matrix M
Mi=inv(M); % close to var()
disp(true a1, estimate a1 and its st.dev:);
disp([a1 theta(1) sqrt(Mi(1,1))]);
disp(true a2, estimate a2 and its st.dev:);
disp([a2 theta(2) sqrt(Mi(2,2))]);
62
5.3 Parameter estimation of MA processes
5.3.1 Introduction
We consider the process X
t
that is generated by the MA(k)
scheme
X
t
=+
t
+b
1
t 1
+ +b
k
t k
, t 0, (5.21)
with
t
white noise withmean 0 andvariance
2
. Without
loss of generality the coefcient of
t
on the right-hand
side has been chosen equal to 1. We look for estimators
of the parameters b
1
, b
2
, . . . , b
k
, , and
2
based on ob-
servations of X
0
, X
1
, . . . , X
N1
.
From 2.2.1 (p. 9) we know that the covariance func-
tion of the process is given by
r () =
_

2
k
i =||
b
i
b
i ||
for || k,
0 for || >k,
(5.22)
with b
0
=1. One possibility to estimate the parameters is
to replace r () for =0, 1, . . . , k, in (5.22) with estimates,
for instance those from 4.5 (p. 45). The k+1 (non-linear)
equations that result then could be solved for the k +1
parameters b
1
, b
2
, . . . , b
k
and
2
. It turns out that this
procedure is very inefcient.
We therefore contemplate other estimation methods,
in particular (non-linear) least squares and maximum
likelihood estimation. The least squares solution at this
stage is difcult to motivate, but later when we consider
prediction error methods the motivation is clear.
5.3.2 Non-linear least squares estimators
The stochastic variables X
0
, X
1
, . . . , X
N1
are generated
by the independent, normally distributed stochastic vari-
ables
k
,
k+1
, . . . ,
N1
. It follows from (5.21) that
X =
_
_
_
_
_
_
1
1
.
.
.
1
_
_
_
_
_
_
.
e
+
_
_
_
_
_
_
b
k
b
k1
b
1
1 0 0
0 b
k
b
2
b
1
1 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 b
k
b
1
1
_
_
_
_
_
_
.
M
,
(5.23)
with
X =
_
_
_
_
_
_
X
0
X
1
.
.
.
X
N1
_
_
_
_
_
_
, =
_
_
_
_
_
_
k+1
.
.
.
N1
_
_
_
_
_
_
. (5.24)
M is an N (N +k) matrix that depends on the unknown
coefcients b
0
, b
1
, . . . , b
k
.
For given coefcients and given observations X the
equation (5.23) cannot be solved uniquely for . To re-
solve this difculty we replace
k
,
k+1
, . . . ,
1
by their
means 0, and consider
X =e +
_
_
_
_
_
1 0 0
b
1
1 0 0

0 0 b
k
b
1
1
_
_
_
_
_
.
M
+
_
_
_
_
_
N1
_
_
_
_
_
.
+
.
Because detM
+
= 1 the square matrix M
+
is invertible
and we have
+
=M
1
+
(X e ). (5.25)
With the help of
+
we form the sum of squares
T
+
+
= (X e )
T
(M
+
M
T
+
)
1
(X e ). (5.26)
This sum of squares is a non-linear function of the pa-
rameters and b
1
, b
2
, . . . , b
k
. Minimization generally can
only be done numerically. In 5.5 (p. 68) we briey review
the algorithms that are available for this purpose.
Setting several variables equal to zeroto force the equa-
tion (5.23)
X e =M (5.27)
to have a unique solution is unsatisfactory. The equation
(5.27) has more unknowns in the vector than there are
equations. One of the many solutions of (5.27) is
=M
T
(MM
T
)
1
(X e ). (5.28)
This is the least squares solution, that is, that solution
for which
T
is minimal. A proof is listed in Appendix A
(p. 96). For this solution the sum of squares is

T
= (X e )
T
(MM
T
)
1
(X e ). (5.29)
This solution shows a resemblance to (5.26). M replaces
M
+
. In place of (5.26) we now minimize (5.29) with re-
spect to the parameters and b
1
, b
2
, . . . , b
k
to obtain es-
timates. Also here we need to resort to numerical mini-
mization.
Once the parameters and b
1
, b
2
, . . . , b
k
have been es-
timated by minimization of (5.26) or (5.29) the variance
2
may be estimated as the sample average of the squares
of the residuals. Minimization of (5.29) further needs a
bound on the b
1
, b
2
, . . . , b
k
,
5.3.3 Maximumlikelihood estimator
Starting point for maximum likelihood estimation is the
joint probability density function f
X0,X1,...,XN1
. Again we
need to assume that the white noise
t
is normally dis-
tributed so that also the process X
t
is normally dis-
tributed. From
X =e +M (5.30)
it follows for the mean and the variance matrix of X
X =e ,
X
=
2
MM
T
. (5.31)
63
With the help of (3.7) of 3.2.2 (p. 31) it follows that the
multi-dimensional normal probability density function
of X is given by
f
X
(x) =
1
(
2)
N
_
det MM
T
e
1
2
2
(xe )
T
(MM
T
)
1
(xe )
,
(5.32)
with x =col(x
0
, x
1
, . . . , x
N1
). The log likelihood function
thus is
L =N log(
2)
1
2
log(det MM
T
) (5.33)
1
2
2
(X e )
T
(MM
T
)
1
(X e ).
Here we replaced x with X. The log likelihood func-
tion needs to be maximized with respect to the param-
eters. Again this requires numerical optimization. In-
spection of (5.33) shows that if the second term that
with log(det MM
T
) were missing then maximization of
L with respect to andb
1
, b
1
, . . . , b
k
would yield the same
result as minimization of the sum of squares (5.29).
The least squares and maximum likelihood estimation
methods of this section are special cases of correspond-
ing methods that are used for the estimation of ARMA
schemes. The properties and practical computation of
these estimators are reviewed in 5.4 (p. 64).
5.4 Parameter estimation of ARMA processes
5.4.1 Introduction
As seen in 5.2 (p. 59) it is rather easy to estimate the pa-
rameters of AR processes. Parameter estimation for MA
processes is more involved. The same holds for the ARMA
processes that are discussed in the present section. We
consider ARMA(n, k) schemes of the form
X
t
=+a
1
X
t 1
+a
2
X
t 2
+ +a
n
X
t n
(5.34)
+b
0
t
+b
1
t 1
+ +b
k
t k
,
for t =n, n +1, . . .. Here
t
is white noise with mean zero
and variance
2
. Without loss of generality we may take
b
0
= 1 if needed. Often we work with centered processes
so that =0.
The parameters , a
1
, a
2
, . . . , a
n
, b
0
, b
1
, . . . , b
k
, and
2
are to be estimated on the basis of observations of X
0
, X
1
,
. . . , X
N1
. The orders n and k are assumed to be known.
ARMA processes are not necessarily identiable. This
means that given the covariance function of an ARMA
process, or, equivalently, its spectral density function,
there may be several ARMA models with these character-
istics. To see this we write the ARMA scheme (5.34) in the
compact formD(q)X
t
=N(q)
t
. Then we knowfrom 2.6
(p. 17) that the spectral density of the corresponding sta-
tionary process under the condition that the scheme is
stable is given by
() =
_
_
_
_
N(e
i
)
D(e
i
)
_
_
_
_
2
2
. (5.35)
The polynomials N and D may be modied in the follow-
ing ways without affecting :
1. N and D may be simultaneously multiplied by the
same polynomial factor, say P, without effect on .
If all roots of P lie inside the unit circle then stability
is not affected. Conversely, any common polynomial
factors in N and D may be canceled without chang-
ing .
To avoid indeterminacies in the model, which com-
plicate the identication, it is necessary to choose
the orders n and k as small as possible. If for in-
stance n and k both are 1 larger than the actual or-
ders then inevitably an undetermined common fac-
tor is introduced.
2. For stability all roots of D need to have magnitude
smaller than 1. For the behavior of , however, it
makes no difference if a stable factor 1 aq
1
of
D, with |a| < 1, is replaced with an unstable factor
q
1
a. Such a replacement in N similarly has no
inuence on .
Uniqueness may be obtained by requiring that all
roots of D and N have magnitude smaller than 1.
Again we derive least squares and maximum likelihood
estimators. Furthermore we introduce in 5.4.4 (p. 64)
the prediction error method as a new but related way to
nd estimators.
5.4.2 Least squares estimators
The values of X
n
, X
n+1
, . . . , X
N1
are generated by the ini-
tial conditions X
0
, X
1
, . . . , X
n1
and by
nk
,
nk+1
, . . . ,
N1
. Dene the vectors
X =
_
_
_
_
_
X
n
X
n+1

X
N1
_
_
_
_
_
, X
0
=
_
_
_
_
_
X
0
X
1

X
n1
_
_
_
_
_
, =
_
_
_
_
_
nk
nk+1

N1
_
_
_
_
_
.
64
Then we have from (5.34)
_
_
_
_
_
_
_
1 0 0
a
1
1 0 0
a
2
a
1
1 0

0 0 a
n
a
1
1
_
_
_
_
_
_
_
.
R
X
=
_
_
_
_
_
_
_
b
k
b
k1
b
0
0 0
0 b
k
b
1
b
0
0 0
0 0 b
k
b
1
b
0
0 0

0 0 b
k
b
1
b
0
_
_
_
_
_
_
_
.
M
+
_
_
_
_
_
a
n
a
n1
a
1
0 a
n
a
2

0 0
_
_
_
_
_
.
P
X
0
+
_
_
_
_
_
1
1

1
_
_
_
_
_
.
e
.
The expression
RX =M +PX
0
+e (5.36)
is more compact. R is square and non-singular with di-
mensions (N n) (N n). M has dimensions (N n)
(N n +k), and P has dimensions (N n) n.
M generally is not invertible, so that (5.36) is not
uniquely solvable for . The least squares solution for
is
=M
T
(MM
T
)
1
(RX PX
0
e ). (5.37)
This yields the sum of squares

T
= (RX PX
0
e )
T
(MM
T
)
1
(RX PX
0
e ).
Minimization with respect to the parameters , a
1
, a
2
,
. . . , a
n
, b
0
, b
1
, . . . , that occur in the coefcient matrices R,
M and P yields the desired estimators. After computing
the residuals with the help of (5.37) the variance
2
may
be estimated as the sample average of the squares of the
residuals. As with the MA scheme, we need to bound the
coefcients b
0
, . . . , b
k
, say
k
j =0
b
2
j
1. Without a bound
on the b
j
the matrix M may grow without bound render-
ing
T
as close to zero as we want.
5.4.3 Maximum likelihood estimator
After the preparations of 5.4.2 the determination of the
maximum likelihood estimator is not difcult. Again we
work with normal probability distributions. For the same
reason as for the AR scheme we take as likelihood func-
tion the conditional probability density of X given X
0
. In-
spection of
X =R
1
(M +PX
0
+e ) (5.38)
shows that the conditional expectation of X given X
0
=x
0
equals R
1
(Px
0
+ e ). The conditional variance matrix
is
2
R
1
MM
T
(R
1
)
T
. We use formula (3.7) from 3.2.2
(p. 31) for the multi-dimensional normal probability den-
sity. After some algebra using the fact that det R =1 it fol-
lows that
f
X|X
0 (x|x
0
)
=
1
(
2)
Nn
det MM
T
e
1
2
2
(RxPx
0
e )
T
(MM
T
)
1
(RxPx
0
e )
.
The log likelihood function thus is
L =(N n) log(
2)
1
2
logdet MM
T
1
2
2
(RX PX
0
e )
T
(MM
T
)
1
(RX PX
0
e ).
(5.39)
Here we replace x with X and x
0
with X
0
. Again we recog-
nize the sum of squares (RX PX
0
e )
T
(MM
T
)
1
(RX
PX
0
e ). With the denitions
Q :=
_
P R
_
, X :=
_
X
0
X
_
(5.40)
we may represent the sum of squares in the more com-
pact form
(QX e )
T
(MM
T
)
1
(QX e ). (5.41)
Least squares estimation involves minimization of (5.41),
maximum likelihood estimation maximization of (5.39).
In this case we may x which implies a bound on the
coefcients b
j
.
5.4.4 Prediction error method
Maximum likelihood estimation requires maximization
of the log likelihood function L as given by (5.39). As we
know from 3.3 (p. 33) maximum likelihood estimators
often have the favorable properties of consistency and
asymptotic efciency. The rst two terms of the the log
likelihood function (5.39) turn out to play a secondary
role for these two properties. The least squares estima-
tor also has these properties. The least squares estimator,
in turn, is a special case of a more general procedure, that
has become known as the predictionerror method (Ljung,
1987).
To explain this method we consider the ARMA scheme
D(q)X
t
=N(q)
t
. (5.42)
For simplicity we take =0. Without loss of generality we
furthermore assume that the coefcient b
0
of the polyno-
mial N equals 1. Then the coefcient h
0
in the expansion
H(q) =
N(q)
D(q)
=h
0
+h
1
q
1
+h
2
q
2
+ (5.43)
also equals 1. From 2.8 (p. 24) it follows that the optimal
one-step predictor for the scheme is given by
X
t +1|t
=
q[N(q) D(q)]
N(q)
X
t
. (5.44)
65
The one-step prediction error is
e
t +1
:=X
t +1

X
t +1|t
=
t +1
.
Hence we may use the one-step predictor to reconstruct
the residuals
t
according to
t
=e
t
=X
t

X
t |t 1
, t =0, 1, . . . , N 1. (5.45)
In this case it follows with (5.44) that
e
t
=X
t

X
t |t 1
=X
t

N(q) D(q)
N(q)
X
t
=
D(q)
N(q)
X
t
. (5.46)
Hence, the residuals may simply be found by inversion of
the the ARMA scheme. In other applications of the pre-
diction error method the situation is less elementary.
The simplest form of the application of the prediction
error method for estimating model parameters is to min-
imize the sum of the squares of the prediction errors
1
N
N1
t =0
e
2
t
(5.47)
with respect to parameters a
1
, a
2
, . . . , a
n
and b
1
, b
2
, . . . ,
b
k
. The prediction errors e
t
are generated fromthe obser-
vations X
t
, t =0, 1, . . . , N1, with the help of the one-step
predictor. If the one-step predictor is based on parame-
ter values that do not agree with the actual values then
the residuals e
t
that are determined differ from the real-
ization of the white noise that generate the process X
t
.
The idea of the prediction error method is to choose the
parameters in the one-step predictor such that the sumof
squares (5.47) obtained fromthe computed residuals is as
small as possible. The idea hence is to consider a model
for a time series good if it predicts the time series well.
This makes sense. Think of the models used for weather
forecasts. If the forecasts are often correct, then we may
consider the model satisfactory.
Once the parameters a
i
and b
i
have been estimated
then the variance
2
is estimated as the sample average
of the squares of the prediction errors.
For the initialization of the predictor at time 0 values of
X
t
and

X
t |t 1
for t < 0 are required. The simplest choice
is to take them equal to 0.
The practical computation of the minimum prediction
error estimates is done numerically by minimization of
the sum of squares (5.47). In 5.5.2 (p. 69) algorithms for
this purpose are reviewed.
The following points are important.
1. The assumed model structure needs to be correct.
We return to this problem in 5.6 (p. 69).
2. The initial values of the parameters need to be suit-
ably chosen to avoid lengthy iteration or conver-
gence to erroneous values. These starting values
sometimes may be found from a priori information.
Exploratory pre-processing to obtain tentative esti-
mates may be very effective.
To validate the nal estimation result it is recommended
to compute the residuals once a model has been ob-
tained. After plotting the residuals they may be checked
for bias, trend and outliers. The residuals should be a re-
alization of white noise. To test this the covariance func-
tion of the residuals may be estimated. If the estimated
covariances are within the appropriate condence inter-
vals then it may safely be assumed that the estimated
model is correct. The condence intervals followfromthe
variances of the estimated covariances (see Problem 4.4
(p. 56).)
Example 5.4.1 (Prediction error method for MA(1)
schemes). Suppose we choose to model an observed
time series X
0
, X
1
, . . . , X
N1
by an zero mean invertible
MA(1) process X
t
= (1 +bq
1
)
t
. The value of b we want
to determine using the prediction error method as the
one that minimizes the mean squared prediction errors
(5.47) of e
t
, where
e
t
=
t
=
1
1+bq
1
X
t
. (5.48)
Nowsuppose the observations were actually generated by
an(unknown) zeromean MA(1) scheme X
t
= (1+cq
1
)
t
.
The prediction error e
t
may be related to the white noise

t
as
(1+bq
1
)e
t
=X
t
= (1+cq
1
)
t
. (5.49)
Therefore
e
t
=
1+cq
1
1+bq
1

t
=
t
+
(c b)q
1
1+bq
1

t
=
t
+ (c b)
t 1
where
t 1
=
1
1+bq
1

t 1
.
Note that
t
and
t 1
are uncorrelated, so the expectation
of e
2
t
equals
e
2
t
=
2
t
+ (c b)
2
2
t 1
. (5.50)
For the sum of the squares of the prediction errors we
thus have
1
N
N1
t =0
e
2
t
=
1
N
N1
t =0
_

2
t
+ (c b)
2
2
t 1
_
(5.51)
The minimum of this expression is obtained for b =c .

66
5.4.5 Accuracy of the minimum prediction error estima-
tor
Ljung (1987) proves that under reasonable assumptions
the estimators that are obtained with the prediction error
method are consistent. Following his example ( 9.2
9.3) we present a heuristic derivation of the asymptotic
properties of the estimator. Important for this deriva-
tion is that if the parameters of the predictor are equal to
the actual parameters then the prediction errors consti-
tute a sequence of independent stochastic variables. The
derivation that follows is also valid for other applications
of the prediction error method.
We collect the unknown parameters in a column vector
=
_
a
1
a
2
a
n
b
1
b
2
b
k
_
T
. (5.52)
To make the dependence on the parameters explicit we
write the sum of the squares of the prediction errors as
1
N
N1
t =0
e
2
t
(). (5.53)
The minimum prediction error estimator

N
minimizes
this sum. Therefore, the gradient of (5.53) with respect to
at the point

N
equals 0. Evaluation of this gradient and
substitution of =

N
yields
N1
t =0
e
t
(

N
)e
t
(

N
) =0
(n+k)1
(5.54)
Here e
t
denotes the gradient of e
t
with respect to .
Suppose that
0
is the correct value of the parameter
vector. Write

N
=
0
+

0
and suppose that

0
is
small. Then it follows by Taylor expansion of (5.54) about
the point
0
that
0 =
N1
t =0
e
t
(

N
)e
t
(

N
)
=
N1
t =0
e
t
(
0
+

N

0
)e
t
(
0
+

N

0
)
N1
t =0
e
t
(
0
)e
t
(
0
)
+
N1
t =0
_
e
t
(
0
)e
t
(
0
)
T
+e
t
(
0
)e
T
t
(
0
)
_
(

N

0
).
Here e
T
t
is the Hessian of e
t
with respect to . We
rewrite this expression as
N1
t =0
e
t
(
0
)e
t
(
0
) (5.55)
_
N1
t =0
_
e
t
(
0
)e
t
(
0
)
T
+e
t
(
0
)e
T
t
(
0
)
_
_
(

N

0
).
We introduce the further approximation
N1
t =0
_
e
t
(
0
)e
t
(
0
)
T
+e
t
(
0
)e
T
t
(
0
)
_
(5.56)
N1
t =0
e
t
(
0
)e
t
(
0
)
T
NM.
On the left both terms have been approximated by their
expectations. The expectation of the second term on the
left is zero (see Problem 5.7, p. 73). On the right we have
M := lim
N
1
N
N1
t =0
e
t
(
0
)e
t
(
0
)
T
. (5.57)
With this we obtain from (5.55)
N

0
M
1
_
1
N
N1
t =0
e
t
(
o
)e
t
(
o
)
_
. (5.58)
Consider the sum
1
N
N1
t =0
e
t
(
0
)e
t
(
0
). (5.59)
The terms e
t
(
0
)e
t
(
0
) asymptotically have expectation
zero and asymptotically are uncorrelated (see Prob-
lem 5.7, p. 73). Therefore, the variance of the sum (5.59)
approximately equals
var
_
1
N
N1
t =0
e
t
(
0
)e
t
(
0
)
_
1
N
2
N1
t =0
e
2
t
(
0
)e
t
(
0
)e
t
(
0
)
T
2
N
2
N1
t =0
e
t
(
0
)e
t
(
o
)
T
2
N
M.
With the help of this it follows from (5.58) that asymptot-
ically
var(

N

0
) M
1
var
_
1
N
N1
t =0
e
t
(
0
)e
t
(
0
)
_
M
1
2
N
M
1
.
From (5.58) it is seen that the estimation error

N

0
is the sum of N terms. By the central limit theorem
we therefore expect that

N(

N

0
) asymptotically is
normally distributed with mean 0 and variance matrix
2
M
1
.
The derivation that is presented is heuristic but the re-
sults may be proved rigorously.
To compute the asymptotic variance of the estimation
error the matrix M of (5.57) is needed. In practice this
matrix is estimated as the sample average
M =
1
N
N1
t =0
e
t
(

N
)e
t
(

N
)
T
. (5.60)
67
The gradients e
t
of the residuals that occur are needed
for numerical minimization of the sum of squares (5.53)
and, hence, are available. This is explained in 5.5 (p. 68).
The matrix

M itself is often also needed in this minimiza-
tion, depending on the algorithm that is used.
The gradient e
t
of the residuals with respect to the pa-
rameter is needed for the numerical minimization of
the the sum of squared prediction errors, and for assess-
ing the accuracy of the nal estimates, as explained in
5.4.5. See Problem 5.7.
5.4.6 Example
We apply the minimum prediction error method to the
AR(2) time series of Fig. 2.6(b) (p. 8). The parameters a
1
and a
2
may be estimated with the MATLAB routine armax,
which implements the minimum prediction algorithm.
The routine assumes that = 0. The following result is
is obtained.
a
1
=1.5588, estimate a
1
=1.5424, st.dev. 0.0399,
a
2
=0.81, estimate a
2
=0.8362, st.dev. 0.0399,
2
=0.0888, estimate
2
=0.0885, no st.dev. given.
The results are not the same as those of the least squares
method of 5.2.6 (p. 62), but are not signicantly different.
Example 5.4.2 (Matlab session). After executing the script
of 2.4.5 (p. 15) to generate the time series the following
MATLAB session yields the results that are shown.
>> thd = armax(x,[2 0]);
...........
>> present(thd)
This matrix was created by the command ARMAX
on 11/22 1993 at 19:45 Loss fcn: 0.0885
A =
1.0000 1.5424 0.8362
0 0.0399 0.0399
At the position of the dots armax displays several inter-
mediate results of the iterative optimization process.

5.5 Non-linear optimization
5.5.1 Non-linear optimization
This section follows pp. 282284 of Ljung (1987).
The implementation of the various estimation meth-
ods that are presented in 5.3 (p. 63) and 5.4 (p. 64) re-
lies on the numerical minimization of a non-linear scalar
function f (x) with respect to the vector-valued variable
x
n
.
All numerical optimization procedures are iterative.
Often, given an intermediate point x
i
the next point x
i +1
is determined according to
x
i +1
=x
i
+p
i
. (5.61)
Here p
i

n
is the search direction and the positive con-
stant the step size. The search direction is determined
on the basis of information about f that is accumulated
in the course of the iterative process. Depending on the
information that is made available the methods for deter-
mining the search direction may be classied into three
groups.
1. Methods that are only based on values of the func-
tion f .
2. Methods that use both function values and values of
the gradient f
x
of f .
3. Methods that use function values, the values of the
gradient and of the Hessian f
xx
T of f , that is, the ma-
trix of second order partial derivatives of f (x) with
respect to the elements of x.
A well known method from group 2 is the steepest descent
method, where the search direction is chosen as
p
i
=f
x
(x
i
). (5.62)
This method follows from the Taylor expansion
f (x
i +1
) = f (x
i
+p
i
)
= f (x
i
) +f
T
x
(x
i
)p
i
+O(
2
),
which holds if f is sufciently smooth. If the last term on
the right-hand side is neglected then the greatest change
is obtained by choosing the search direction according to
(5.62). By varying a searchis performed inthe searchdi-
rection until a sufciently large decrease of f is obtained.
This is called a line search. The steepest descent method
typically initially provides good progress. Near the min-
imum progress is very slow because the gradient f
x
be-
comes smaller and smaller.
A typical representative of group 3 is the Newton-
Raphson algorithm. Here the search direction is the
Newton direction
p
i
=f
1
xx
T
(x
i
) f
x
(x
i
). (5.63)
This direction results by including a quadratic termin the
Taylor expansion:
f (x
i
+p
i
) = f (x
i
)+f
T
x
p
i
+
1
2
2
(p
i
)
T
f
xx
T (x
i
)p
i
+O(
3
).
For = 1 the right-hand side of this expression ne-
glecting the last term is minimal if the search direc-
tion p
i
is chosen according to (5.63). Also with the New-
ton algorithm normally line searching is used with = 1
as starting value. The Newton algorithm may be unpre-
dictable inthe beginning of the iterative process but often
68
converges very fast once the neighborhood of the mini-
mum has been reached.
A disadvantage of the Newton method is that besides
the gradient in each iteration point also the Hessian
needs to be computed. This implies that formulas for the
gradient and the Hessian need to be available and must
be coded. This is often problematic, certainly for the
Hessian. There exist several algorithms, known as quasi-
Newton algorithms, where the Hessian is approximated
or is estimated during the iteration on the basis of the val-
ues of the gradients that are successively computed, and,
hence, no explicit formulas are needed.
Group 1 contains methods where the gradient is esti-
mated by taking nite differences. Other methods in this
group use specic search patterns.
For all methods the following holds:
1. The function f needs to be sufciently smooth for
the search to converge.
2. The algorithm may converge to a local minimum
rather than to the desired global minimum. The only
remedy is to choose the starting point suitably.
There is an ample choice of standard software for these
algorithms. Within MATLAB the Optimization Toolbox
provides for this.
5.5.2 Algorithms for non-linear least squares
In the case of non-linear least squares the function f that
needs to be minimized has the form
f (x) =
1
2
N
j =1
e
2
j
(x), (5.64)
where the e
j
(x) are more or less complicated functions of
x. The structure of f allows to structure the minimiza-
tion algorithm correspondingly. The gradient of f is the
column vector
f
x
(x) =
N
j =1
e
j
(x)g
j
(x), (5.65)
where g
j
is the gradient of e
j
(x) with respect to x. By dif-
ferentiating once again the Hessian of f follows as
f
xx
T (x) =
N
j =1
g
j
(x)g
T
j
(x) +
N
j =1
e
j
(x)h
j
(x), (5.66)
with h
j
the Hessian matrix of e
j
(x) with respect to x.
We see that to apply the Newton algorithm to the least
squares problem in principle the Hessian of the residu-
als e
j
is needed. For the algorithm to work well it is im-
portant that the Hessian f
xx
T be correct in the neighbor-
hood of the minimum. For a well posed least squares
problem the residuals e
j
near the minimum approxi-
mately are realizations of uncorrelated stochastic vari-
ables with zero mean. Then the mean of the second term
on the right-hand side of (5.66) is (approximately) zero.
Therefore near the minimum we may assume f
xx
T(x) =
N
j =1
g
j
(x)g
T
j
(x). This idea leads to a quasi-Newton algo-
rithm where the search direction is given by
p
i
=H
1
(x
i
) f
x
(x
i
), (5.67)
where the gradient f
x
is given by (5.65) and the matrix
H(x) =
N
j =1
g
j
(x)g
T
j
(x) (5.68)
is an approximation of the Hessian. If the step size
is always chosen as 1 then this is called the Gauss-
Newton algorithm. If is adapted to smaller values then
the method is sometimes known as the damped Gauss-
Newton algorithm.
The matrix H as given by (5.68) is always non-negative
denite. In some applications, such as when the model is
over-parameterized or the data contain insufcient infor-
mation, it may happen that H is singular or almost singu-
lar. This causes numerical problems. A well known way
to remedy this is to modify H to
H(x) =
N
j =1
g
j
(x)g
T
j
(x) +I , (5.69)
with a small positive number that is to be chosen suit-
ably. This is known as the Levenberg-Marquardt algo-
rithm. As argued, the H(x) is a good approximation of
the Hessian near the minimum. Away fromthe minimum
the approximationerror may be considerable, but the fact
that H(x) is nonsingular and nonnegative denite guar-
antees that the p
i
of (5.67) is a direction of descent, that is,
f (x +p
i
) < f (x) for small enough >0.
For the application of these least squares algorithms to
the estimation of ARMA schemes formulas for the gradi-
ent of the residuals e
j
with respect to the parameters a
1
,
a
2
, . . . , a
n
and b
0
, b
1
, . . . , b
k
, need to be available. Such
formulas are derived by Ljung (1987), see Problem 5.7
(p. 73).
5.6 Order determination
5.6.1 Introduction
In this section we review the problem of determining the
order of the most suitable model. We successively discuss
several possibilities.
5.6.2 Order determination with the covariance function
As we saw when discussing AR, MA and ARMA processes
in 2.4 (p. 11), 2.2 (p. 9), and 2.5 (p. 16), these pro-
cesses are characterized by a specic behavior of the co-
variance function. MA(k) processes have a covariance
69
function that is exactly zero for time shifts greater than k.
The covariance function of AR and ARMA processes de-
creases exponentially. To obtain a rst impression of the
nature and possible order of the process it is in any case
very useful to compute and plot the covariance function.
Because of the estimation inaccuracies it often is not
possible to distinguish whether the covariance function
decreases exponentially to zero or becomes exactly zero
after a nite number of time shifts. If the covariance
drops to zero fast then it is obvious to consider an MA
scheme. The order then may be estimated as that time
shift from which the covariance remains within a con-
dence interval around zero whose size is to be deter-
mined.
If the covariance function decreases slowly then it is
better to switch to an AR or ARMAscheme. Such schemes
probably describe the process with fewer parameters
than an MA scheme.
5.6.3 Order determination with partial correlations
In 2.4.4 (p. 14) it is shown that the partial correlations
of an AR(n) scheme become exactly zero for time shifts
greater than n. If the MA scheme has been eliminated as
a suitable model then it is useful to estimate and plot the
partial correlations.
We briey review the notion of partial correlations. The
Yule-Walker equations of an AR(n) process are
(k) =
n
i =1
a
ni
(i ), k =1, 2, . . . . (5.70)
The a
ni
are the coefcients of the scheme and is the
correlation function. For given correlation function the
coefcients of the scheme may be computed fromthe set
of n linear equations that results by considering the Yule-
Walker equations for k = 1, 2, . . . , n. Even if the process
is not necessarily an AR process then for given n the co-
efcients a
n1
, a
n2
, . . . , a
nn
, still may be computed. The
last coefcient a
nn
is the nth partial correlation coef-
cient of the process. The partial correlation coefcients
may be computed recursively with the Levinson-Durbin
algorithm (2.652.66) (p. 14). For a given realization of
length N of a process with unknown correlation function
the correlations may be estimated as

N
(k) =
r
N
(k)
r
N
(0)
, (5.71)
with r
N
an estimate of the covariance function such as
proposed in 4.5 (p. 45). By substituting these estimates
for in (5.70) estimates of the partial correlation coef-
cients are obtained. Like for the covariance function it
often is difcult to distinguish whether the partial corre-
lations become exactly zero or decrease exponentially to
zero. It may be proved that for time shifts greater than n
the estimated partial correlations of an AR(n) process are
approximately independent with zeromean and variance
equal to 1/N. For large N they are moreover with good
accuracy normally distributed. With these facts a con-
dence interval may be determined, for instance the in-
terval (2
_
1/N, 2
_
1/N). With this condence interval
the order of a potential AR scheme may be established. If
this order is large then it is better to resort to an ARMA
scheme.
5.6.4 Analysis of the residuals
It is to be expected that as the model is made more com-
plex (that is, as the order n of the AR part and the order k
of the MApart are taken larger) the estimated variance
2
of the residuals
t
becomes smaller. Once the correct val-
ues of the orders have been reached the mean square er-
ror will no longer decrease fast as complexity is increased.
This provides an indication for the correct values of n and
k.
All estimation methods that are discussed yield esti-
mates for the residuals
t
. If the process has been cor-
rectly estimated with the right model then the residu-
als are a realization of white noise. Estimating the cor-
relation function of the residuals and checking whether
for nonzero time shifts the correlations are sufciently
small constitutes a very useful test for the reliability of the
model.
5.6.5 Information criteria
In the previous subsection it is noted that if the complex-
ity of the model increases the estimated variance of the
residuals decreases. If the correct complexity is reached
a knee occurs in the plot. Sometimes it is not easy
to establish where the knee is located. If complexity is
further increased the variance of the residuals keeps de-
creasing, with an accompanying over-parameterization
of the model.
The converse holds for the value L
max
that the log likeli-
hood function assumes for the estimated model. L
max
in-
creases monotonically with the complexity of the model.
The quantity L
max
decreases monotonically.
Based on information theoretic arguments Akaike (see
for instance Ljung (1987)) proposed Akaikes information
criterion
IC
Akaike
=L
max
+
2m
N
(5.72)
with m the total number of parameters in the model. For
the ARMA scheme we have m = n +k +1. The second
term in the criterion is a measure for the complexity of
the model. With increasing m the rst term of the crite-
rion decreases and the second increases. The best model
follows by minimization of the information criterion with
respect to m.
On the basis of other information theoretic considera-
tions Rissanen (also see Ljung (1987)) derived the mini-
mum length information criterion, also known as Rissa-
70
nens information criterion,
IC
Rissanen
=L
max
+
m logN
N
. (5.73)
Rissanens criterion yields consistent estimators for both
the order and the model parameters. Because the second
termof IC
Rissanen
has more weight than that of IC
Akaike
Ris-
sanens criterion generally yields models of lower order
than that of Akaike.
Akaike and Rissanens criteria may also be applied
when the prediction error method is used. Let

V be the
minimal mean square prediction error
V =
1
N
N1
t =0
e
2
t
( p
N
). (5.74)
Then Akaikes information criterion takes the form
(1+
2m
N
)

V, (5.75)
and Rissanens criterion becomes
(1+
m logN
N
)

V. (5.76)
A last criterion is the nal prediction error (FPE) criterion
of Akaike. It is given by
1+
m
N
1
m
N
V, (5.77)
and equals the mean variance of the one-step prediction
error if the model is applied to a different realization of
the time series than the one that is used to estimate the
model (Ljung, 1987, 16.4). For m N the criterion re-
duces to Akaikes information criterion.
5.6.6 Example
By way of example we consider the time series of g-
ures 1.4 (p. 2), which represents the monthly production
of bituminous coal in the USA during 19521959. There
are 96 data points.
We begin by centering the time series (with the MAT-
LAB function detrend) and dividing by 10000 to scale
the data. The resulting time series is plotted in Fig. 5.2.
As a rst exploratory step we compute and plot the
correlation function. For this purpose the MATLAB rou-
tine cor is available, which was especially developed for
the Time Series Analysis and Identication Theory course.
Figure 5.2(a) shows the result. Inspection reveals that the
correlation function decays more or less monotonically
to zero. An AR scheme of order 1 or 2 may explain this
correlation function well.
Next we calculate the partial autocorrelation function.
For this the routine pacf has been developed. Fig-
ure 5.2(b) shows the result, again with condence limits.
The partial autocorrelations decay rather abruptly. Also
10 20 30 40 50 60 70 80 90
-1.5
-1
-0.5
0
0.5
1
1.5
x
t
t [month]
Figure 5.2: Centered and scaled bituminous coal
production
0 5 10 15 20 25 30 35 40 45 50
-1
-0.5
0
0.5
1
0 5 10 15 20 25 30 35 40 45 50
-1
-0.5
0
0.5
1
Correlation function
Partial correlation function
(a)
(b)
Figure 5.3: (a) Estimated correlation function.
Dashed: Condence limits. (b) Es-
timated partial correlation function.
Dashed: condence limits.
71
ARMA scheme
2
FPE a
1
a
2
a
3

b
1
(1, 0) 0.1007 0.1028 0.7288
(0.0729)
(2, 0) 0.0907 0.0945 0.5008 0.2689
(0.0980) (0.0954)
(3, 0) 0.0898 0.0956 0.4808 0.2575 0.0664
(0.1037) (0.1073) (0.1013)
(2, 1) 0.0905 0.0963 0.5395 0.2435 0.0474
(0.3392) (0.2503) (0.3573)
Table 5.1: Estimation results ARMA schemes. In parentheses: standard deviations of the estimates
this indicates that an AR scheme of loworder explains the
process.
We now use the routine armax from the Systems Iden-
tication Toolbox to estimate several ARMA schemes by
the prediction error method. Table 5.1 summarizes the
results. Comparison of the outcomes for the AR(1) and
AR(2) schemes shows that the AR(2) scheme is better than
the AR(1) scheme, and that the extra estimated coefcient
a
2
differs signicantly from 0. If we go from the AR(2)
to the AR(3) schema then we note that the extra coef-
cient a
3
no longer differs signicantly from 0 (compared
with the standard deviation). Of the three AR schemes
the AR(2) scheme has the smallest nal prediction error
(FPE).
The last row of the table shows that renement of the
AR(2) scheme to an ARMA(2, 1) scheme is not useful.
The extra estimated coefcient

b
1
is not signicantly dif-
ferent from 0. Also all standard deviations are consid-
erably larger than for the AR(2) scheme. This is an in-
dication for over-parameterization. In the case of over-
parameterization the matrix M in 5.4.5 (p. 67) becomes
singular or near-singular, resulting in large but meaning-
less numbers in the estimated variance matrix. Finally, in
Table 5.1 the FPE of the ARMA(2,1) scheme is larger than
for the AR(2) scheme. The FPE is minimal for the AR(2)
scheme. To validate the estimated AR(2) model we com-
pute the residuals (with the routine resid). Figure 5.4(a)
displays the result. Near the time 10 two outliers occur.
These may also be found in the original times series. The
routine resid also produces the correlation function of
the residuals plotted in Fig. 5.4(b). This correlation func-
tion gives no cause at all to doubt the whiteness of the
residuals.
5.7 Problems
5.1 Recursive least squares for mean estimation. Con-
sider the scheme X
t
=+
t
, with
t
zeromean white
noise. We want to estimate = . Determine for
this scheme the recursive least squares expressions
for

N
, P
N
and the vector gain K
N
. What can you say
10 20 30 40 50 60 70 80 90
-1.5
-1
-0.5
0
0.5
1
1.5
0 5 10 15 20 25
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
(b)
(a)
Correlation function of residuals. Output 1
Residuals
lag
Figure 5.4: (a) Residuals for the estimated AR(2)
model. (b) Correlation function of
the residuals for the estimated AR(2)
model. Dashed: condence limits.
about lim
N
P
N
.
5.2 Recursive least squares for zero mean processes. Con-
sider the stable AR(n) scheme D(q)X
t
=
t
without
the term . It is hence assumed that we know that
X
t
=
t
= 0. This simplies the recursive least
squares solution a little. Show that
W
T
W = (N n)
_
_
_
_
_
_
r (0) r (
1
) r (
n1
)
r (
1
) r (0) r (
n2
)
.
.
.
.
.
.
.
.
.
.
.
.
r (
n1
) r (
n2
) r (0)
_
_
_
_
_
_
where W is the W-matrix of (5.3) without the rst
column (which corresponds to .)
5.3 Matrix inverse. Verify (5.9).
5.4 Estimation of seasonal model.
1
Consider the sea-
1
72
sonal model
Y
t
cq
P
Y
t
=
t
, t =1, 2, . . . , (5.78)
with the natural number P the period, c a real con-
stant, and
t
zero mean white noise with variance
2
.
a) Prove that the model is stable if and only if |c | <
1. Assume that this condition is satised in the
remainder of the problem.
b) Prove that the mean value function of the sta-
tionary process dened by (5.78) is identical
to 0. Also prove that the covariance function
r () =Y
t
Y
t +
is given by
r () =
_
_
_
2
1c
2
c
||/P
for || =0, P, 2P, . . . ,
0 for other values of .
(5.79)
c) Determine the joint probability density func-
tionof Y
P
, Y
P+1
, . . . , Y
N1
, givenY
0
, Y
1
, . . . , Y
P1
,
with N >P.
d) Determine maximum likelihood estimators c
and
2
for the model parameters c and
2
based on the probability density function ob-
tained in Part (c).
e) Instead of the conditional probability density
function
f
YP,YP+1,...,YN1|Y0,Y1,...,YP1
(5.80)
that is chosen in (c) consider the unconditional
probability density function
f
Y0, Y1,..., YN1
(5.81)
based on the assumption that the joint proba-
bility density of Y
0
, Y
1
, . . . , Y
P1
equals the sta-
tionary density. What form does the likelihood
function nowtake? What is the effect on the re-
sulting maximum likelihood estimates for large
values of N?
5.5 Variance of estimates. We consider the problem of
estimating the coefcients =
_
a
1
a
2
a
n
_
T
in the AR(n) model (5.1) under the assumption that
=0, and that we know that =0. As usual we base
our estimation on N observations X
0
, X
1
, . . . , X
N1
.
a) Under some assumptions Subsection 5.2.5 ex-
plains how to obtain estimates of var(

) where
is the Maximum Likelihood estimator of Sub-

section 5.2.4. Show that for our case this spe-
cializes to
var(

)

2
N n
_
_
_
_
_
_
_
r
N
(0) r
N
(1) r
N
(n 1)
r
N
(1) r
N
(0) r
N
(n 2)

r
N
(n 1) r
N
(n 2) r
N
(0)
_
_
_
_
_
_
_
1
b) Now consider the AR(1) and AR(2)-models
X
t
=a
11
X
t 1
+
t
, X
t
=a
12
X
t 1
+a
22
X
t 2
+
t
.
Show that with the above approximation of
variances that
var( a
11
) var( a
12
).
c) Explain in words why the above is intuitive.
That is, explain that it is intuitive that the vari-
ances increase if there are more degrees of free-
domin the model. [It may be useful to consider
the special case that the realization on which
the estimation is based is constant: X
0
= X
1
=
=X
N1
.]
5.6 Scaling maximumlikelihood estimator. Consider the
likelihood function L dened in (5.39). It is a func-
tion of (, b
0
, . . . , b
k
, a
1
, . . . , a
n
, ). Show that for any
=0,
L(, b
0
, . . . , b
k
, a
1
, . . . , a
n
, )
=L(,
1
b
0
, . . . ,
1
b
k
, a
1
, . . . , a
n
, ).
Explain this result in terms of the
t
.
5.7 Gradients of the residuals. In this problem we
demonstrate how the gradient e
t
practically may be
computed. We also consider some theoretical prop-
erties of this gradient. From (5.46) we know that the
residuals e
t
may be determined from the observed
time series X
t
according to the inverse scheme
N(q)e
t
=D(q)X
t
, t =0, 1, . . . , N 1. (5.82)
The recursive computation of the residuals is usually
initialized by choosing the missing values of X
t
and
e
t
for t < 0 equal to 0. By assumption the inverse
scheme is stable. The parameters that are estimated
are the coefcients of the polynomials
N(q) =1+b
1
q
1
+b
2
q
2
+ +b
k
q
k
,
D(q) =1a
1
q
1
a
2
q
2
a
n
q
n
.
We dene e
ai
t
as the gradient of e
t
with respect to a
i
and e
bi
t
as the gradient of e
t
with respect to b
i
.
a) Verify that the gradients e
ai
t
may be computed
by application of the scheme
N(q)e
ai
t
=q
i
X
t
, t =0, 1, . . . , N 1,
to the observed time series. How is the compu-
tation initialized?
b) Also verify that the gradients e
bi
t
may be com-
puted by application of the scheme
N(q)e
bi
t
=q
i
e
t
, t =0, 1, . . . , N 1,
to the residuals. Again, how is the computation
initialized?
73
c) Suppose that the polynomials D and N that are
used to generate the residuals e
t
and the gra-
dients are precisely equal to the polynomials
of the ARMA scheme that generated X
t
. Prove
that for t the gradients e
ai
t
and e
bj
t
are sta-
tistically independent of e
t
.
d) Prove that for t the stochastic variable
e
t
(
0
)e
T
t
(
0
) has expectation zero.
e) Prove that asymptotically for t and s
e
t
(
0
)e
s
(
0
)e
t
(
0
)e
t
(
0
)
T
=
_
0 for s =t ,
2
e
t
(
0
)e
t
(
0
)
T
for s =t .
5.8 Minimum prediction error for an MA(1) process.
2
In
this problem we consider the minimum prediction
error estimation of the parameter b of the MA(1)
process
X
t
=
t
+b
t 1
, t , (5.83)
with |b| <1. We assume the process to be invertible.
The process
t
, t , is white noise with mean 0 and
variance
2
.
a) What is the one-step prediction error e
t
=X
t

X
t |t 1
of this process? What is the most conve-
nient choice of

X
0|1
?
b) Given a realization X
t
, t = 0, 1, . . . , N 1 of the
process, howmay the one-step prediction error
e
t
, t =0, 1, . . . , N 1, be determined?
c) For the implementation of a suitable min-
imization algorithm (for instance a quasi-
Newton algorithm) for the minimum predic-
tion error estimation of b it is necessary to
compute besides the mean square prediction
error
V =
1
N
N1
t =0
e
2
t
(5.84)
also the gradient
V
b
(5.85)
of V with respect to the parameter b. Derive
an algorithm for computing this gradient. (Use
Problem 5.7.)
5.9 MA(10) process. Application of armax to the MA(10)
process of 2.2.2 (p. 10) leads to incorrect results.
Why?
2
Examination May, 1994.
74
6 System identication
system
z u y
v
Figure 6.1: System with input signal u and noise
v
6.1 Introduction
This chapter is devoted to system identication. Fig-
ure 6.1 repeats the paradigm for system identication
that is described in 1.1.2 (p. 1). We study the prob-
lem how to reconstruct the dynamical properties of the
system and perhaps also the statistical properties of the
noise v from the recorded signals u and y .
The paradigm may be detailed in different ways. We
give two examples.
1. The system is assumed to be linear and time-
invariant, with unknown impulse response about
which nothing is known. The signal v is assumed to
be a realization of a wide-sense stationary stochastic
process, with unknown covariance function.
The problemis to estimate the unknown impulse re-
sponse and covariance function from the observa-
tions. This is an example of non-parametric system
identication.
2. It is assumed that based on known rst principles a
structured dynamical model may be constructed for
the system and the signal v, but that this model con-
tains a number of parameters with unknown values.
The problem is to estimate the values of these pa-
rameters from the observations. This is an example
of parametric system identication.
In 6.2 we discuss Case 1. In the remaining sections of
this chapter Case 2 is studied.
6.2 Non-parametric system identication
6.2.1 Introduction
In this section we study Case 1 of 6.1. We assume
that the system of Fig. 6.1 is linear, time-invariant and
causal. This means that the system is described by an
equation z
t
=
m=0
h
m
u
t m
and that the system is fully
determined by the function h
m
known as the impulse re-
sponse. Now
y
t
=
m=0
h
m
u
t m
+v
t
, t , (6.1)
and we consider the question how the impulse response
h
m
and the properties of the signal v
t
may be estimated
from the N observation pairs
(u
t
, y
t
), t =0, 1, . . . , N 1, (6.2)
of the input and output signal.
6.2.2 Impulse and step response analysis
If the noise v
t
is absent or very small then the function h
may be measured directly by choosing the input signal as
the impulse
u
t
=
_
u
0
for t =0,
0 for t >0,
(6.3)
withu
0
a constant that is chosen as large as possible with-
out violating the model assumptions. Then we have
y
t
=u
0
h
t
+v
t
u
0
h
t
, t 0.
We estimate the impulse response according to

h
t
=
y
t
/u
0
, t = 0, 1, . . .. Obviously this experiment is only pos-
sible if the input signal u may be chosen freely. Also, the
system needs to be at rest at the initial time 0.
Another possibility is to choose a step function
u
t
=u
0
for t 0 (6.4)
for the input signal. If the system is initially at rest then
the response is
y
t
=u
0
t
m=0
h
m
+v
t
u
0
t
m=0
h
m
, t 0.
The estimates for the impulse response follow recursively
as
h
0
=
y
0
u
0
,
h
t
=
1
u
0
(y
t
y
t 1
), t =1, 2, . . . .
Both methods to estimate the impulse response usually
only are suitable to establish certain rough properties of
the system such as the static gain, the dominant time
constants, and the intrinsic delay, if any is present.
75
6.2.3 Frequency response analysis
Suppose that the input is chosen as the complex-
harmonic signal
u
t
=u
0
e
i0t
, t 0, (6.5)
with the real constants u
0
and
0
the amplitude and the
angular frequency, respectively. We assume that the sys-
temis BIBO stable, which is that the output z
t
is bounded
for any bounded input u
t
. A system (6.1) is BIBO stable
if and only if
m=0
|h
m
| < . By substitution into (6.1) it
follows that
y
t
=
t
m=0
h
m
u
0
e
i0(t m)
+v
t
=
_
t
m=0
h
m
e
i0m
_
u
0
e
i0t
+v
t
, t 0.
For t we nd
y
t

h(
0
)u
0
e
i0t
+v
t
. (6.6)
Here
h(
0
) =
m=
h
m
e
i0m
,
0
<, (6.7)
is the DCFT of the impulse response h and, hence, the
frequency response function of the system. By writing
h(
0
) =|
h(
0
)| e
i(0)
, (6.8)
with (
0
) the argument of the complex number

h(
0
), it
follows that
y
t
|
h(
0
)|u
0
e
i(0t +(0))
+v
t
=|
h(
0
)|u
0
[cos(
0
t +(
0
)) +i sin(
0
t +(
0
))] +v
t
.
(6.9)
This complex output signal is the (asymptotic) response
to the complex input signal
u
t
=u
0
e
i0t
=u
0
[cos(
0
t ) +i sin(
0
t )]. (6.10)
By the linearity of the system it follows that the response
to the real part
u
t
=u
0
cos(
0
t ), (6.11)
of the input signal (6.10) asymptotically equals the real
part
y
t
|
h(
0
)|u
0
cos(
0
t +(
0
)) +v
t
(6.12)
of the asymptotic response (6.9).
If the noise v is negligibly small then by applying the
real-harmonic input signal (6.11) (with u
0
as large as
possible), waiting until the stationary response has been
reached, and measuring the amplitude and phase of the
output signal the magnitude |
h(
0
)| and phase (
0
) of
the frequency response at the frequency
0
may be de-
termined with the help of (6.12). If the noise is not neg-
ligibly small then often by averaging the output over a
sufciently large number of periods an accurate estimate
may be found. By repeating this measurement for a large
number of frequencies an estimate of the behavior of the
frequency response

h is obtained. By inverse Fourier
transformation anestimate of the impulse response h fol-
lows.
For slow systems this method is excessively time con-
suming. For electronic systems that operate at audio,
video or other communication frequencies the method is
a proven technique for which specialized measurement
equipment exists.
6.2.4 Spectral analysis
We consider the equation
y
t
=
m=0
h
m
u
t m
+v
t
, t , (6.13)
which describes the relation between the input and out-
put signals of the conguration of Fig. 6.1. Suppose that
the input signal u is a realization of a wide-sense sta-
tionary stochastic process U
t
with mean 0. Also assume
that the noise v
t
is a realization of a wide-sense station-
ary stochastic process V
t
with mean 0. Then the output
signal y
t
is a realization of the stochastic process
Y
t
=
m=0
h
m
U
t m
+V
t
, t , (6.14)
which also has mean 0. Given the processes U
t
and Y
t
we
dene the cross covariance function of the two processes
as
R
y u
(t
1
, t
2
) =Y
t1
U
t2
, t
1
, t
2
. (6.15)
It follows with the help of (6.14) that
R
y u
(t
1
, t
2
) =
_

m=0
h
m
U
t1m
+V
t1
_
U
t2
=
m=0
h
m
U
t1m
U
t2
+V
t1
U
t2
.
If we assume that the input signal U
t
and the noise V
t
are
uncorrelated processes then it follows that
R
y u
(t
1
, t
2
) =
m=0
h
m
R
u
(t
1
m, t
2
)
=
m=0
h
m
r
u
(t
1
t
2
m),
where R
u
(t
1
, t
2
) =r
u
(t
1
t
2
) is the covariance function of
the wide-sense stationary input process U
t
. Inspection of
the right-hand side of this expression shows that it only
76
depends on the difference t
1
t
2
of the two time instants
t
1
and t
2
. Apparently the cross covariance function is a
function
Y
t1
U
t2
=R
y u
(t
1
, t
2
) =r
y u
(t
1
t
2
) (6.16)
of the difference t
1
t
2
of the arguments. We have
r
y u
(k) =
m=0
h
m
r
u
(k m), k . (6.17)
The cross covariance function r
y u
clearly is the convolu-
tion of the impulse response h and the covariance func-
tion
1
r
u
of the input signal. By application of the DCFT it
follows that
y u
() =

h()
u
(), <, (6.18)
Here

h is the frequency response function of the system
and
u
the spectral density function of the input process
U
t
. The DCFT
y u
of the cross covariance function r
y u
is
called the cross spectral density of the stochastic processes
Y
t
andU
t
.
Note that the relation (6.18) is independent of the sta-
tistical properties of the noise V
t
(as long as it has mean
0 and is uncorrelated with the input process U
t
). By divi-
sion we nd
h() =
y u
()
u
()
, <, (6.19)
With the help of this relation the frequency response
function

h may be estimated in the following manner
from an observation series (U
t
, Y
t
), t =0, 1, . . . , N 1.
1. Estimate the auto-covariance function r
u
and the
cross covariance function r
y u
according to
r
u
(k) =
1
N
N1
t =|k|
U
t
U
t |k|
, (6.20)
r
y u
(k) =
_
1
N
N1
t =k
Y
t
U
t k
for k 0,
1
N
N+k1
t =0
Y
t
U
t k
for k <0,
(6.21)
both for k =0, 1, 2, . . . , (M1), with M N. Ap-
ply suitable time windows to r
u
and r
y u
.
2. Estimate the spectral density function
u
and the
cross spectral density
y u
by Fourier transformation
of the windowed estimates of the covariance func-
tions.
3. Estimate the frequency response function from the
estimated spectral densities with the help of (6.19) .
Like for the estimation of the spectral density function as
discussed in 4.6 (p. 46) the windows need to be chosen
so that an acceptable compromise is achieved between
1
In this context the covariance function is sometimes called auto-
covariance function.
the statistical accuracy and resolution. Ljung (1987, 6.4)
discusses this in more detail.
Note that this method provides no or no useful esti-
mate of the frequency response function

h for those fre-
quencies for whichthe spectral density function
u
of the
input process is zero or very small. If
u
() > 0 for all
then the input signal is called persistently exciting. This
property is a necessary condition for being able to iden-
tify the system.
The spectral estimation method as describedis suitable
as an exploratory tool to get a rst impression of the dy-
namical properties of the system. The parametric estima-
tionmethods that are the subject of the next sections usu-
ally are more attractive if the dynamic model is needed
for applications such as forecasting and control system
design.
Spectral estimation may also be used to determine the
spectral density of the noise V
t
. Write
Y
t
=
m=0
h
m
U
t m
.
Z
t
+V
t
=Z
t
+V
t
. (6.22)
If the noise V
t
is independent of the input processU
t
then
also the noise V
t
and the process Z
t
are independent. We
then have that
r
y
(k) =r
z
(k) +r
v
(k), (6.23)
with r
y
, r
z
and r
v
the covariance functions of Y
t
, Z
t
and
V
t
, respectively. By Fourier transformation it follows that
y
() =
z
() +
v
(), (6.24)
with
y
the spectral density function of Y
t
,
z
that of Z
t
and
v
that of V
t
. With the help of
z
() =|
h()|
2
u
() (6.25)
(see 2.6, p. 17) it follows that the spectral density func-
tion of the output process Y
t
is given by
y
() =|
h()|
2
u
() +
v
(). (6.26)
With the help of (6.19) it follows that
v
() =
y
() |
h()|
2
u
() (6.27)
=
y
()
|
y u
()|
2
u
()
.
For the estimation of the frequency response

h it hence
is necessary to estimate the spectral density
u
and the
cross spectral density
y u
. If in addition the spectral den-
sity function
y
of the output process is estimated (by
rst estimating the covariance function of the output pro-
cess) then (6.27) yields an estimate for
v
.
The function
K
y u
() =
_
|
y u
()|
2
y
()
u
()
, <, (6.28)
77
is known as the coherence spectrum of the processes Y
t
and U
t
. This function may be viewed as a frequency
dependent correlation coefcient between the two pro-
cesses. If the model (6.14) applies then we have
v
() = [1K
2
y u
()]
y
(). (6.29)
If K
y u
equals 1 for certain frequencies then the spectral
density of the noise V
t
equals 0 for those frequencies.
The computations of this section may advantageously
be done in the frequency domain with the use of the FFT.
As we know from 4.7 (p. 42) the spectral density func-
tions
u
and
y
of the processes U
t
and Y
t
may be esti-
mated by spectral windowing of their periodograms
1
N
|
U
N
()|
2
and
1
N
|
Y
N
()|
2
, (6.30)
where the DCFTs
U
N
() =
N1
t =0
U
t
e
it
and

Y
N
() =
N1
t =0
Y
t
e
it
(6.31)
may be computed by the FFT. Similarly the cross spectral
density
y u
of the processes Y
t
and U
t
may be estimated
by spectral windowing of the cross periodogram
1
N
Y
N
()

U
N
(). (6.32)
In this formula

denotes the complex conjugate. More
details may be found in Chapter 6 of Ljung (1987).
6.2.5 Matlab example
By way of example we consider the laboratory process
of 1.2.7 (p. 3). This example is treated in the demo
iddemo1 of the Systems Identication Toolbox. The total
length of the observed series is 1000 points. In Fig. 6.2 the
rst 300 points are plotted. We use these 300 measure-
ment pairs for identication.
With the following MATLAB commands an estimate of
the frequency response function of the process may be
obtained:
z = [y(1:300) u(1:300)];
z = detrend(z,constant);
hh = spa(z);
bode(hh);
The rst command combines the rst 300 points of the
measurement series to an input/output pair. In the sec-
ond command the means are subtracted. The third com-
mand yields an estimate of the frequency response func-
tion in a special format that is explained in the manual of
the Systems Identication Toolbox.
To obtain an estimate of the frequency response func-
tion the routine spa rst estimates the auto-covariance
function of the input signal and the cross covariance
function of the input and output signals. Next these
functions are windowed with Hammings window. After
Fourier transformation of the windowedcovariance func-
tions the estimation of the frequency response function
follows. See the manual or the MATLAB help function for a
further description and a specication of the optional ar-
guments of the function spa. The command bode pro-
duces the plot of Fig. 6.3.
An estimate of the impulse response of the system may
be obtained with the command
ir = cra(z);
The function cra estimates the impulse response ir of
the system by successively executing the following steps:
1. Estimate anAR(n) scheme for the input process. The
default value of n is 10.
2. Filter the input and output process u and y with the
inverse of the ARscheme. This amounts to the appli-
cation of an MA scheme, and serves to prewhiten
the input signal.
3. Compute the cross covariance function of the l-
tered input and output signals. Because the input
signal now is white the cross covariance function
precisely equals the impulse response (within a pro-
portionality constant that equals the standard devi-
ation of the input signal).
This computation entirely takes place in the time do-
main.
Figure 6.4 shows the graphical output of cra, consist-
ing of a plot of the estimated impulse response along with
a 99% condence region. Inspection shows that the (es-
timated) impulse response h
m
only differs (signicantly)
from 0 for m 3. Obviously the system has a time delay
0 50 100 150 200 250
3
4
5
6
7
u
t
0 50 100 150 200 250
3
4
5
6
7
sampling instant t
y
t
Figure 6.2: Measured input and output signals of
a laboratory process
78
10
2
10
1
10
0
10
1
10
2
10
1
10
0
a
m
p
l
i
t
u
d
e
10
2
10
1
10
0
10
1
60
40
20
[rad/sec]
p
h
a
s
e
Figure 6.3: Estimated frequency response func-
tion of the laboratory process
0 2 4 6 8 10 12 14 16 18 20
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
lags
Impulse response estimate
Figure 6.4: Output of cra
10
2
10
1
10
0
10
1
10
3
10
2
10
1
10
0
a
m
p
l
i
t
u
d
e
10
2
10
1
10
0
10
1
60
40
20
[rad/sec]
p
h
a
s
e
Figure 6.5: Bode plots of two estimates (dashed
and solid) of one frequency response
function
(or deadtime) of 3 sampling intervals. This time delay is
caused by the transportation time of the air ow through
the tube.
We may compare the results of the two estimation
methods by computing the frequency response from the
estimated impulse response ir. This is done like this:
thir = poly2th(1,ir);
hhir = th2ff(thir);
The rst command denes the scheme y
t
=P(q)u
t
where
the coefcients of the polynomial P are formed from
the estimated impulse response. The second command
serves to compute the frequency response function of the
corresponding system. The command
bode([hh hhir]);
displays the magnitude and phase plots of the frequency
response functions that are found with the two methods
in one frame. Figure 6.5 shows the result. The estimated
frequency responses differ little.
6.3 ARX models
6.3.1 Introduction
The rst parametric systemidenticationmethod that we
consider applies to the case that the system of Fig. 6.1
(p. 75) may be represented as a linear time-invariant sys-
tem described by the difference equation
y
t
a
1
y
t 1
a
2
y
t 2
a
n
y
t n
.
D(q)y
t
=c
0
u
t
+c
1
u
t 1
+ +c
m
u
t m
.
P(q)u
t
,
79
t . Many practical systems may adequately be charac-
terized by difference equations of this form. To account
for measurement errors and disturbances we modify the
equation to
D(q)y
t
P(q)u
t
=w
t
, t , (6.33)
or
D(q)y
t
=P(q)u
t
+w
t
, t , (6.34)
with w
t
a noise term. Because of (6.33) this model is
sometimes known as the error-in-the-equation model. If
the noise w
t
is a realization of white noise then (6.34) is
called an ARX scheme. As far as the effect of the noise w
t
on the output process concerns (6.34) is an AR scheme.
The character X refers to the presence of the exogenous
(external) signal u
t
.
6.3.2 Least squares estimation
A least squares estimate of the coefcients a
1
, a
2
, . . . , a
n
,
c
0
, c
1
, . . . , c
m
may be obtained by minimization of the
sum of squares
1
2
t
[D(q)y
t
P(q)u
t
]
2
. (6.35)
To set up the least squares problem we again represent
the equations in matrix form. For ease of exposition we
assume that n m. Then the relevant matrix equation is
_
_
_
_
_
_
y
n
y
n+1
.
.
.
y
N1
_
_
_
_
_
_
.
Y
=
_
_
_
_
_
_
y
n1
y
0
u
n
u
nm
y
n
y
1
u
n+1
u
nm+1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
y
N2
y
Nn
u
N1
u
Nm1
_
_
_
_
_
_
.
F
_
_
_
_
_
_
_
_
_
_
_
_
a
1
.
.
.
a
n
c
0
.
.
.
c
m
_
_
_
_
_
_
_
_
_
_
_
_
.
+
_
_
_
_
_
_
w
n
w
n+1
.
.
.
w
N1
_
_
_
_
_
_
.
W
.
(6.36)
The solution of the least squares problem

N
=
(F
T
F)
1
F
T
Y exists iff F has full column rank. For that u
t
obviously has to be at least nonzero. Similarly as in Sub-
section 5.2.5, Ljung (1987) showed that if the noise w
t
is a
realization of white noise and the input signal u
t
is suf-
ciently rich (see p. 83) then the least squares estimator
exists and is consistent and asymptotically efcient.
This does not apply if the noise is not white. We con-
sider this situation. Suppose that the input signal u
t
and
the noise w
t
are realizations of wide-sense zeromeansta-
tionary processes U
t
and W
t
. Then also the output sig-
nal y
t
is a realization of such a process Y
t
. Now multiply
(6.36) from the left with
1
N
F
T
:
1
N
F
T
Y =
1
N
F
T
F +
1
N
F
T
W (6.37)
and realize that the least squares estimate

N
satises the
similar equation
1
N
F
T
Y =
1
N
F
T
F

N
. (6.38)
Under weak conditions the sample covariance matrices
1
N
F
T
Y and
1
N
F
T
F converge to well dened limits
2
. There-
fore if

N
is to be an asymptotically unbiased estimator of
, then necessarily we need to have that
lim
N
1
N
F
T
W =0.
Now we have that
lim
N
1
N
F
T
W (6.39)
=
1
N
_
_
_
_
_
_
_
_
_
_
_
_
Y
n1
Y
n
Y
N2
.
.
.
.
.
.
.
.
.
.
.
.
Y
0
Y
1
Y
Nn
U
n
U
n+1
U
N1
.
.
.
.
.
.
.
.
.
.
.
.
U
nm
U
nm+1
.
.
. U
Nm1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
W
n
W
n+1
.
.
.
W
N1
_
_
_
_
_
_
=
_
_
_
_
_
_
_
_
_
_
_
r
y w
(1)
.
.
.
r
y w
(n)
r
uw
(0)
.
.
.
r
uw
(m)
_
_
_
_
_
_
_
_
_
_
_
.
If W
t
is white noise then W
t
and Y
t i
are uncorrelated for
i > 0, so that r
y w
(i ) = 0 for i > 0. These terms appear in
1
N
F
T
W and hence are all zero; the remaining terms in
1
N
F
T
W are r
uw
(i ), i 0 and they are zero by assump-
tion. Hence
1
N
F
T
W =0 which is what we need.
If W
t
is not white noise then W
t
and Y
t i
are correlated
for at least one i > 0 so that r
y w
(i ) = 0 for at least one
i > 0. As a result lim
N
1
N
F
T
W is not identically zero,
and as a result the estimate

N
is biased, no matter how
large the number of observations N is.
6.3.3 Instrumental variable method
In practice the biasedness of the least squares estimator
for non-white w
t
is a sufciently serious handicap to jus-
tify looking for other estimators

. Ljung (1987) advo-
cates the following method. He proposed to avoid the
mismatch between (6.37) and (6.38) by premultiplying
the equation Y = F +W not with
1
N
F
T
but with another
matrix
1
N
X
T
:
1
N
X
T
Y =
1
N
X
T
F +
1
N
X
T
W, (6.40)
where
X =
_
_
_
x
n
x
n1
x
m
x
m1

x
N1
x
N2
x
Nnm
x
Nnm1
_
_
_
2
They converge in fact to matrices of cross-covariances and covari-
ances, comparable to that of (6.39).
80
which is a matrix whose entries x
t
are a realization of a
wide sense stationary process that is uncorrelated with
the noise W
t
. Such time series x
t
are called instrumental
variables. Then by assumption we have that X
T
W = 0,
and now as estimator is proposed the solution of (6.40) in
which X
T
W is replaced with zero,
= (X
T
F)
1
X
T
Y. (6.41)
A possible choice for the instrumental variable is x
t
=u
t
.
If the noise w
t
is uncorrelated with the input signal then
this yields a correct instrument.
It may be proved (see Ljung (1987)) that the instrumen-
tal variable method (abbreviated to IV method) yields the
most accurate estimate of (in some sense) if the equa-
tion error W is not correlated with y
t
and u
t
as is done
in (6.39), but with z
t
and u
t
, where z
t
is the output sig-
nal of the system without noise. This means that z
t
is the
solution of the difference equation
D(q)z
t
=P(q)u
t
. (6.42)
Because the coefcients of the polynomials D and P are
not known this result is not immediately useful since the
instrument u
t
cannot be computed. However, the in-
strument may be approximated by rst making a pre-
liminary estimate of the unknown coefcients with the
least squares method. These (inaccurate) estimates are
then used in (6.42) to compute the instrument z
t
approx-
imately. Next the coefcients are estimated according to
the instrumental variable method from the (6.41) with
X =
_
_
_
_
_
_
z
n1
z
0
u
n
u
nm
z
n
z
1
u
n+1
u
nm+1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
z
N2
z
Nn
u
N1
u
Nmn
_
_
_
_
_
_
.
(6.43)
Again we use the laboratory process as an illustration. In
6.2.5 (p. 78) the impulse response was estimated. For
estimating the ARX model the Systems IdenticationTool-
box provides the function arx. This function is an imple-
mentation of the least squares method.
This is the record of the MATLAB session (in continua-
tion of that of 6.2.5) that produces an estimate:
>> th = arx(z,[2 2 3]);
>> present(th)
This matrix was created by the command ARX
on 1/18 1998 at 12:52 Loss fcn: 0.0016853
B =
0 0 0 0.0666 0.0445
0 0 0 0.0021 0.0033
A =
1.0000 1.2737 0.3935
0 0.0208 0.0190
>> e=pe(th,z); % (resid(th,z) is buggy)
In the second argument [2 2 3] of arx the rst param-
eter is the degree n = 2 of the denominator polynomial
D. The second parameter m =2 and the third parameter
d = 3 determine the structure of the numerator polyno-
mial P in the form
P(q) =q
d
(c
0
+c
1
q
1
+ +c
m
q
m
). (6.44)
The integer d is the dead time of the system. In 6.2.5
we saw that the dead time is 3 for the laboratory process.
The degrees n and m were chosen after some experimen-
tation.
The command present displays the estimated coef-
cients of the polynomials D and P (A and B in the no-
tation of the Toolbox.) The coefcients all differ signi-
cantly from 0, so that there is no reason to decrease the
degrees of D and P.
To validate the result the command resid is used to
compute the residuals and their sample correlation func-
tion. Figure 6.6 shows the graphical output of resid.
The upper plot is the sample correlation function of the
residuals, together with its condence level. The plot jus-
ties the conclusion that the residuals are a realization of
white noise. The lower plot shows the sample cross cor-
relation function of the residuals and the input signal u,
also with its condence level. The result shows that the
residuals are uncorrelated with the input signal. Hence,
the model hypotheses of the ARX model are satised.
There is no need to resort to an IV estimation method.
It is interesting to compare the estimation result with
that of 6.2.5. From the impulse response that is esti-
mated in 6.2.5 the step response may easily be com-
puted. With the help of the ARX model that is estimated
in the present subsection the step response may also be
determined. These commands do the job:
stepr = cumsum(ir);
step = ones(20,1);
mstepr = idsim(step,th);
The variable stepr is the step response that follows from
the estimated impulse response and mstepr the step re-
sponse of the estimated ARX model. In Fig. 6.7 both step
responses are plotted. They show good agreement.
6.4 ARMAX models
6.4.1 Introduction
ARX schemes of the form
D(q)y
t
=P(q)u
t
+w
t
, t , (6.45)
81
0 5 10 15 20 25
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-25 -20 -15 -10 -5 0 5 10 15 20 25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Correlation function of residuals, Output 1
Cross correlation function between input 1 and residuals from out.
lag
lag
Figure 6.6: Sample auto correlation function of
the residuals and sample cross corre-
lation function with the input signal
0 2 4 6 8 10 12 14 16 18
0
0.2
0.4
0.6
0.8
1
time t
e
s
t
i
m
a
t
e
d
s
t
e
p
r
e
s
p
o
n
s
e
Figure 6.7: Solid: step response of the estimated
ARX model. Dashed: step response
computed from the estimated im-
pulse response
have limited possibilities for modeling the noise. If the
instrumental variable method is used then the noise w
t
may be non-white, but the method does not immediately
provide information about the statistical properties of the
noise w
t
.
We therefore now consider the ARMAX model, which is
given by
D(q)Y
t
=P(q)u
t
+N(q)
t
, t . (6.46)
D and P are the same polynomials as in 6.3, while N is
the polynomial
N(q) =1+b
1
q
1
+b
2
q
2
+ +b
k
q
k
. (6.47)
The systemnoise
t
is white noise with mean0 and vari-
ance
2
. The input signal u
t
may be a realization of a
stochastic process but this is not necessary. The ARMAX
model is a logical extension of on the one hand the ARX
model and on the other the ARMA model.
We study the problem how to estimate the coefcients
of the polynomials D, N and P from a series of observa-
tions of (Y
t
, u
t
) for t =0, 1, . . . , N 1.
6.4.2 Prediction error method
The prediction error method of 5.4.4 (p. 65) for estimat-
ing ARMAschemes may easily be extended to the estima-
tion of ARMAX schemes. The output signal of the ARMAX
scheme (6.46) is given by
Y
t
=
P(q)
D(q)
u
t
+
N(q)
D(q)
t
.
X
t
, t . (6.48)
According to 5.4.4 the one-step predictor for the process
X
t
is equal to
X
t |t 1
=
N(q) D(q)
N(q)
X
t
. (6.49)
It follows that the one step predictor for the process Y
t
is
given by
Y
t |t 1
=
P(q)
D(q)
u
t
+

X
t |t 1
(6.50)
=
P(q)
D(q)
u
t
+
N(q) D(q)
N(q)
X
t
. (6.51)
Note that for the prediction of the output signal the future
behavior of the input signal u
t
is assumed to be known.
By substituting
X
t
=Y
t

P(q)
D(q)
u
t
(6.52)
into the right-hand side of (6.50) it follows that
Y
t |t 1
=
N(q) D(q)
N(q)
Y
t
+
P(q)
N(q)
u
t
. (6.53)
82
The prediction error hence is
e
t
=Y
t

Y
t |t 1
(6.54)
=
D(q)
N(q)
Y
t

P(q)
N(q)
u
t
.
By substitution of Y
t
from (6.48) it follows that e
t
=
t
.
Inspection of (6.54) shows that the prediction error may
be generated by the difference scheme
N(q)e
t
=D(q)Y
t
P(q)u
t
. (6.55)
Application of the minimum prediction error method im-
plies minimization of the sum of the squares of the pre-
diction errors of the prediction errors
N1
t =0
e
2
t
(6.56)
with respect to the unknown parameters. The numerical
implementation follows that for the estimation of ARMA
schemes.
The estimator that is obtained this way is consistent
under plausible conditions. The most relevant condi-
tion is that the experiment is sufciently informative.
This in particular means that the input signal be suf-
ciently rich. An input signal that is identical to 0, for
instance, provides no information about the coefcients
of the polynomial P and, hence, is not sufciently rich. A
sufcient condition for the experiment to be sufciently
informative is that the input signal u
t
be a realization of a
wide-sense stationary process U
t
whose spectral density
function is strictly positive for all frequencies. Such an in-
put signal is said to be persistently exciting (see also 6.2,
p. 75).
As an example to illustrate the methods of this section we
choose the strm system. This is the system described
by the ARMAX scheme D(q)Y
t
=P(q)u
t
+N(q)
t
, with
D(q) =11.5q
1
+0.7q
2
,
P(q) =q
1
(1+0.5q
1
),
N(q) =1q
1
0.2q
2
.
Realizations of the input and output signals may be gen-
erated with the following series of commands:
D = [1 1.5 0.7];
P = [0 1 0.5];
N = [1 1 0.2];
th0 = poly2th(D,P,N);
randn(seed,0)
u = sign(randn(300,1));
e = randn(300,1);
y = idsim([u e],th0);
z = [y u];
0 50 100 150 200 250 300
-1
-0.5
0
0.5
1 u
t
0 50 100 150 200 250 300
-10
0
10
time t
y
t
Figure 6.8: Input and output signals for the
strm system
0 2 4 6 8 10 12 14 16 18 20
-1
-0.5
0
0.5
1
1.5
2
2.5
lags
impulse response estimate
Figure 6.9: Output of cra
83
Figure 6.8 shows the plots of the input and output signals.
To obtain a rst impression of the systemwe estimate the
impulse response:
ir = cra(z);
Figure 6.9 shows the result. The estimated impulse re-
sponse suggests a second-order slightly oscillatory sys-
tem with a dead time of 1 sampling instant. We initially
do not include a dead time in the model, however.
To obtain a better impression of the structure of the
system we try to identify the system as an ARX scheme.
To avoid problems caused by non-white noise such
as indeed present in the strm system we use an IV
method for identication. With the following series of
commands we test each time an ARX model D(q)Y
t
=
P(q)u
t
+w
t
, where D has degree n and P has degree n1,
for n =1, 2, 3, 4 and 5.
th1 = iv4(z,[1 1 0]); present(th1); pause
Table 6.1 summarizes the results. Inspection shows that
if n increases from 1 to 3 both the sample variance of the
residuals and the FPE (see 5.6, p. 69) decrease.
If n changes from 3 to 4 then both quantities increase
again. It is not clear why the sample variance of the resid-
uals increases.
If subsequently n is increased from 4 to 5 the variance
and the FPEdecrease again to a value that is less than that
for n =3.
We consider the roots of the polynomials P and D as
estimated for n =5:
Roots of P: 88.6472, 0.3589i 0.8137, 0.7826;
Roots of D: 0.4253i 0.6716, 0.7380i 0.3670, 0.2522.
We note this:
1. One of the roots of P is very large. This is caused by
the small rst coefcient c
0
of P. Because this coef-
cient does not differ signicantly from 0 it may be
set equal to 0.
2. The root pair 0.3589 i 0.8137 of P does not lie far
in the complex plane from the root pair 0.4253
i 0.6716 of D. Canceling the polynomial factors that
correspond to these root pairs in the numerator and
denominator has little effect on the response of the
system.
If we cancel the two root pairs then D has degree 3. The
remaining roots of D are 0.7380 i 0.3670 and 0.2522,
which exhibit a reasonable resemblance to the roots
0.7429 i 0.3659 and 0.1094 of the polynomial D that is
estimated for n =3.
Structure
2
FPE
2 2 1 1 0.994 1.027
2 2 2 1 0.968 1.008
2 2 3 1 0.971 1.0017
Table 6.2: Results of the estimation with armax
After cancellation of the root pair from P the roots
88.6472 and 0.7826 are left. The roots of P for n = 3 are
23.6658 and 0.5196.
These considerations show that the model for n =5 es-
sentially is the same as that for n =3.
On the basis of these ndings we conjecture that D has
degree 3. Inspection shows, however, that for n = 3 the
estimated last coefcient a
3
of D does not deviate sig-
nicantly from 0. We therefore set this coefcient equal
to 0, which reduces the degree of D to 2. Furthermore
we conjecture on the basis of 1 that P is of the form
P(q) =q
1
(c
0
+c
1
q
1
).
Application of the IV method to this structure yields:
>> tht=iv4(z,[2 2 1]);
>> present(tht)
This matrix was created by the command IV4
on 12/17 1993 at 20:5 Loss fcn: 1.027
B =
0 1.0687 0.4660
0 0.0584 0.0784
A =
1.0000 1.4825 0.6854
0 0.0186 0.0163
The sample variance of the residuals and the FPE are less
than for n = 3. All coefcients are signicantly different
from 0.
Testing with resid shows that the residuals for the es-
timated ARX scheme are not white. We therefore try to es-
timate an ARMAX scheme. The function armax from the
Toolbox is based on the prediction error method. We esti-
mate a model of the form D(q)X
t
=P(q)u
t
+N(q)
t
with
the structure of D and P as just found. For the degree of
N we try a number of possibilities:
th2211 = armax(z,[2 2 1 1]);
th2221 = armax(z,[2 2 2 1]);
th2231 = armax(z,[2 2 3 1]);
The parameters of the second argument such as [2 2 1
1] of armax successively are the degrees of D, P and N,
and the dead time. The sample variances and FPE values
that are found are given in Table 6.2. The smallest value
of the FPE is reached for the (2, 2, 2, 1) structure:
84
Scheme
2
FPE a
1
a
2
a
3
a
4
a
5
c
0
c
1
c
2
c
3
c
4
1 9.515 9.643 0.2114 0.4114
(14.49) (0.1753)
2 1.207 1.24 1.5450 0.7331 0.0739 1.3577
(0.0143) (0.0141) (0.0633) (0.0657)
3 1.069 1.113 1.3765 0.5234 0.0750 0.0482 1.1154 0.5926
(0.2160) (0.3428) (0.1666) (0.0591) (0.0669) (0.2657)
4 1.15 1.213 0.7785 0.5329 0.7660 0.1393 0.0112 1.0389 1.2351 0.1405
(0.1149) (0.1560) (0.1002) (0.0442) (0.0624) (0.0747) (0.1467) (0.1949)
5 0.9795 1.047 0.3730 0.1019 0.3409 0.3398 0.1083 0.0123 1.0712 1.6182 1.4661 0.6744
(0.1985) (0.2134) (0.2156) (0.1813) (0.1233) (0.0578) (0.0620) (0.2372) (0.3573) (0.2428)
Table 6.1: Results IV estimation ARX scheme. (In parentheses: standard deviations of the estimates)
>> present(th2221)
This matrix was created by the command ARMAX
on 12/17 1993 at 20:19 Loss fcn: 0.968
B =
0 1.0774 0.4471
0 0.0571 0.0775
A =
1.0000 1.4807 0.6812
0 0.0184 0.0152
C =
1.0000 1.0225 0.1712
0 0.0610 0.0602
A, B and C successively are the polynomials D, P and N.
Comparison with the actual values of the coefcients of
the strm system reveals that the estimates are reason-
ably accurate.
Aresidual test with resid shows that the residuals may
be considered to be white.
The observed cancellation of corresponding factors in
D and P does not occur in the model that is estimated
with the IV method for n = 4. Subsequent application of
armax with an assumed degree 4 of D, however, quickly
leads to the conclusion that the degree of D is 2.
6.5 Identication of state models
6.5.1 Introduction
In this section we consider the very general problem of
estimating parameters inlinear state models. Many of the
models we studied so far are special cases of this problem.
The extension to state models also opens up the possi-
bility to study multivariable estimation problems, that is,
estimation problems for systems with several inputs and
outputs.
We consider state models of the form
X
t +1
=AX
t
+Bu
t
+V
t
, (6.57)
Y
t
=CX
t
+Du
t
+W
t
, t . (6.58)
A
nn
, B
nk
, C
mn
and D
mk
are ma-
trices with suitable dimensions. The n-dimensional pro-
cess X
t
is the state vector. The k-dimensional signal u
t
is the input signal and the m-dimensional process Y
t
the
output signal. The input signal u
t
may be a realization of
a stochastic process but this is not necessary.
V
t
and W
t
are vector-valued white noise processes of
dimensions n and m, respectively, with zero means. This
means that
V
t
=
_
_
_
_
_
V
1,t
V
2,t

V
n,t
_
_
_
_
_
, W
t
=
_
_
_
_
_
W
1,t
W
2,t

W
m,t
_
_
_
_
_
, t , (6.59)
where the components V
i ,t
and W
j ,t
all are scalar zero
mean white noise processes. The component processes
may but need not be independent. It is assumed that
_
V
t1
W
t1
_
_
V
T
t2
W
T
t2
_
=
_
R for t
1
=t
2
,
0 for t
1
=t
2
,
(6.60)
with R an (n +m) (n +m) symmetric matrix that may
be written in the form
R =
_
R
1
R
12
R
T
12
R
2
_
. (6.61)
R is called the variance matrix of the vector-valued white
noise process V
t
and W
t
.
The equations (6.576.58) typically originate from
known rst principles for the system. It is required to
estimate one or several unknown parameters in the sys-
tem equations based on a number of observation pairs
(u
t
, Y
t
), t =0, 1, . . . , N 1, of the input and output signals.
This problem has the previous estimation problems for
AR, ARX and ARMAX models as a special cases.
85
Example 6.5.1 (ARX model as state model). The ARX
model
Y
t
a
1
Y
t 1
a
2
Y
t 2
a
n
Y
t n
=c
0
u
t
+c
1
u
t 1
+ +c
m
u
t m
+
t
with
t
the noise term, can be cast as a state model using
what is called the observer canonical form,
X
t +1
=
_
_
_
_
_
_
_
_
_
_
0 0 a
n
1 0
.
.
. a
n1
0 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 0
.
.
.
0 0 1 a
1
_
_
_
_
_
_
_
_
_
_
.
A
X
t
+
_
_
_
_
_
_
_
_
_
c
n
+a
n
c
0
c
n1
+a
n1
c
0
.
.
.
.
.
.
c
1
+a
1
c
0
_
_
_
_
_
_
_
_
_
.
B
u
t
,
Y
t
=
_
0 0 1
_
.
C
X
t
+ c
0
.
D
u
t
+
t
.
Wt
.
(6.62)
Estimation of the coefcients a
i
and c
j
can be seen as
estimation of the elements of A, B and D.

6.5.2 Identication with the prediction error method
We apply the prediction error method for the estimation
of the parameters that occur in the model (6.576.58).
To make the dependency on explicit we write the sys-
temmatrices as A
, B
, C
and D
and similarly for other

data the depends on . To solve the prediction error
problem we use several well-known results from Kalman
ltering theory (see for instance Kwakernaak and Sivan
(1972) or Bagchi (1993)). Given the system
X
t +1
=A
X
t
+B
u
t
+V
t
, (6.63)
Y
t
=C
X
t
+D
u
t
+W
t
, t , (6.64)
dene

X
t |t 1
as the best estimate of X
t
based on the ob-
servations Y
0
, Y
1
, . . . , Y
t 1
. For Y we use a corresponding
notation. Note that we consider u
t
as xed and known
for all t . Then we have
t +1|t
=A
t |t 1
+B
u
t
+K
(t )[Y
t
C
t |t 1
D
u
t
],
(6.65)
for t =0, 1, . . .. The best one-step predictor

X
t +1|t
of X
t +1
,
given Y
0
, Y
1
, . . . , Y
t
, hence may be computed recursively.
The initial condition for the recursion is
X
0|1
=X
0
. (6.66)
The sequence of gain matrices K
(0), K
(1), . . . is also
found recursively from the matrix equations
K
(t ) =
_
A
(t )C
,T
+R
12
__
R
2
+C
(t )C
,T
_
1
,
(6.67)
Q
(t +1) =
_
A
(t )C
_
Q
(t )A
,T
+R
1
K
(t )R
T
12
,
(6.68)
for t =0, 1, . . .. The symmetric matrix Q(t ) equals
Q
(t ) =var(X
t

X
t |t 1
)
=(X
t

X
t |t 1
)(X
t

X
t |t 1
)
T
,
and, hence, is the variance matrix of the one-step predic-
tion of the state. The initial condition for (6.676.68) is
Q(0) =varX
0
(6.69)
=(X
0
X
0
)(X
0
X
0
)
T
.
The best one-step prediction

Y
t +1|t
of Y
t
is
t +1|t
=C
t +1|t
+D
u
t +1
, (6.70)
so that the one-step prediction error of Y
t
equals
e
t
=Y
t

Y
t |t 1
=Y
t
C
t |t 1
D
u
t
.
We summarize howthe one-step prediction errors are de-
termined:
1. Solve the recursive matrix equations (6.67
6.68) with the initial condition (6.69) for
K
(0), K
(1), . . . , K
(N 1).
2. Solve the Kalman lter equation (6.65) with the ini-
tial condition (6.66) recursively to determine the
one-step predictions of the state

X
t |t 1
for t =
0, 1, . . . , N 1.
3. Compute the one-step prediction errors
e
t
=Y
t
C
t |t 1
D
u
t
, t =0, 1, . . . , N 1.
From Kalman ltering theory it is known that the one-
step prediction errors e
t
form a vector-valued white
noise process.
The minimum prediction error method for the identi-
cation of the system (6.576.58) involves the minimiza-
tion of the sum of the squares of the prediction errors
N1
t =0
e
,T
t
e
t
(6.71)
with respect to the unknown parameter vector that oc-
curs in the model. This is done numerically using the op-
timization algorithms reviewed in 5.5 (p. 68). The appli-
cation of these methods requires that formulas are devel-
oped to compute the gradient of the prediction error e
t
with respect to .
Under reasonable conditions (such as persistent exci-
tation) consistent estimators are obtained, like for other
applications of the prediction error method. The accu-
racy of the estimates may be analyzedin a way that is sim-
ilar to that in 5.4.5 (p. 67).
86
6.6 Further problems in identication theory
6.6.1 Introduction
In this section we list a number of further current prob-
lems and themes in identication theory.
6.6.2 Structure determination
In 5.6 (p. 69) we discussed the problem of determining
the correct order of ARMAschemes for time series. Insys-
temidentication the same problemarises. The methods
that are mentioned in 5.6 may also be applied to sys-
tem identication. We already did this in the example of
6.4.3 (p. 83).
A problem that is closely related to structure determi-
nation is the question whether the system that is to be
identied does or does not belong to the model set that is
considered. Suppose for instance that we attempt to esti-
mate a system as an ARMAX scheme with certain degrees
of the various polynomials. It is very well possible that the
actual systemdoes not belong to this class of models. The
question is howwe may ascertain this. For the parametric
identication methods we considered the ultimate test
for accepting the model that is found is whether the resid-
uals are white.
A further problem in this context is the identiability
of the system. A system is not identiable if within the
assumed model set there are several systems that explain
the observed system behavior. This occurs for instance
if we attempt to explain an ARMA(3,2)-process D(q)X
t
=
N(q)
t
with an ARMA(4,3)-scheme. The polynomials D
and N then are allowed to have an arbitrary common fac-
tor of degree 1.
Lack of identiability often manifests itself by large val-
ues of the estimated estimation error of the parameters.
The reason is that in case of lack of identiability the
matrix of second derivatives of the maximum likelihood
function or the mean square prediction error becomes
singular. Sometimes this phenomenon only becomes no-
ticeable for long measurement series.
6.6.3 Recursive system identication
The systemidentication methods that were discussed so
far are based on batch processing of a complete observa-
tion series. In some applications the dynamic properties
of the systemor process change with time. Then it may be
necessary to keep observing the system, and to use each
new observation to update the system model. For this
purpose recursive identication methods have been de-
veloped. Often these are recursive versions of the batch-
oriented algorithms that we discussed.
The best known recursive identication algorithm ap-
plies to the ARX-model
y
t
=a
1
y
t 1
+a
2
y
t 2
+ +a
n
y
t n
+c
0
u
t
+c
1
u
t 1
+ +c
m
u
t m
+
t
.
We rewrite this equation in the form
y
t
=(t )
t
+
t
, (6.72)
with
(t ) =
_
y
t 1
y
t 2
y
t n
u
t
u
t 1
u
t m
_
,
t
=
_
a
1
a
2
a
n
c
0
c
1
c
m
_
T
.
If the parameters do not change with time then we have
t +1
=
t
. (6.73)
Equations (6.73) and (6.72) together dene a system in
state form as in (6.636.64):
X
t +1
=AX
t
+Bu
t
+V
t
,
Y
t
=CX
t
+Du
t
+W
t
, t .
We have X
t
=
t
, A = I , B =0, V
t
=0, C =(t ), D =0 and
W
t
=
t
. Note that the matrix C now is time-varying, and
also note that this state representation of the ARX model
has nothing to do with the observer canonical formof the
ARX model as explained in Example 6.5.1.
The Kalman lter equations (6.656.68) nowapply, with
a small modication for the time dependence of C. For
the best estimate of the parameter vector we obtain the
recursive equation
t +1|t
=

t |t 1
+K(t )[y
t
(t )

t |t 1
], t 0. (6.74)
The sequence of gain matrices K(0), K(1), . . . is recursively
determined from the matrix equations
K(t ) =Q(t )
T
(t )
_
2
+(t )Q(t )
T
(t )
_
1
, (6.75)
Q(t +1) =
_
I K(t )(t )
_
Q(t ) +R
1
, t 0. (6.76)
Here we have taken take R
12
= 0 and R
2
=
2
, but R
1
has
been left as it is.
The equations (6.746.76) form a recursive algorithm
for the identication of the ARX scheme. For R
1
= 0 the
algorithm actually is a recursive implementation of the
least squares algorithm of 6.3.2 (p. 80), see also Sec-
tion 5.2.2. If the input signal is persistently exciting then
for R
1
=0 the estimates of the parameters converge to the
correct values.
If the parameters are not constant but vary (slowly)
with time then this may be accounted for by choosing R
1
different from 0 (but positive). This corresponds to the
model
t +1
=
t
+V
t
, (6.77)
with V
t
vector-valued white noise. Each parameter is
modeled as a random walk. In the estimation algorithm
(6.746.76) the gain matrix K(t ) now does not approach 0
as is the case for R
1
=0 but the algorithm continues
to update the estimates.
87
controller
system
u y
v
Figure 6.10: System in feedback loop

6.6.4 Design of identication experiments
Insystemidentication the accuracy of the estimates that
are found depends on the properties of the input signal.
The input signal needs to be sufciently rich to acquire
the desired information about the system. An important
limitation often is that the input signal amplitude cannot
be larger than the system can absorb. The problem how
to select the input signal if this freedom is available
is knownas the problemof the designof identication ex-
periments.
6.6.5 Identication in the closed loop
It sometimes happens that the system that needs to be
identied is part of a feedback loop as in Fig. 6.10. By the
feedback connection the input signal u is no longer un-
correlated with the noise v. This causes special difcul-
ties.
6.6.6 Robustication
Commonly, observed time series and signal records con-
tain aws, such as missing data points or observations
that are completely wrong (outliers). To spot and pos-
sibly correct such exceptions it is imperative to plot the
data and inspect them for missing data points, outliers
and other anomalies before further processing.
If minimum prediction error methods are used then
the estimates may be robustied by weighting outliers
less heavily. Similar techniques may be conceived for
other identication techniques. Robustication is a sub-
ject that is extensively studied in statistics.
6.7 Problems
6.1 Cross periodogram. Prove that the cross peri-
odogram (6.32) is the DCFT of the estimate (6.21)
of the cross covariance function of Y
t
and U
t
with
M =N. Hint: Compare Proof A.1.10 (p. 94).
6.2 Consider the diagram of Fig. 6.1 (p. 75) and assume
that the system is a convolution z
t
=
m
h
m
u
t m
.
Why is it not a good idea to try to identify the sys-
tem with the wide sense stationary input u
t
gener-
ated via the MA(2) scheme
u
t
= (1+q
2
)
t
.
6.3 Mean-square error linear ltering
. Consider
Y
t
=
m=
h
m
U
t m
+V
t
and suppose that Y
t
,U
y
and V
t
are zero mean wide
sense stationary processes. In Subsection 6.2.4 it is
shown that H(e
i
) =
y u
()/
u
() provided that
U
t
and V
t
are uncorrelated processes (i.e. U
t
V
n
=
0 t , n).
In the absence of knowledge of properties of V
t
it
makes sense to try to explain Y
t
as much as pos-
sible by the input U
t
. Therefore consider
min
hm,m
_
Y
t

m=
h
m
U
t m
_
2
(6.78)
This is an example of linear ltering. Show that the
DCFT

h() of the h
m
that minimize this expression
is determined by the linear equation
h() =
y u
()
u
()
. (6.79)
Note: Recall Eqn. (6.19). The above equation when
seen as an equation in

h() is the frequency domain
version of the Wiener-Hopf equation. The so ob-
tained h
m
need not correspond to a causal system,
that is, the inverse DCFT h
m
of
y u ()
u ()
may be nonzero
for certain m < 0. Minimizing (6.78) with respect to
causal systems h
m
is more involved; it is the famous
Wiener ltering problem, related to Kalman ltering.
A fairly straightforward way to circumvent the prob-
lem of causality is to consider FIR systems, considered
in the following problem.
6.4 Causal FIR systems. It is well known that discrete
time systems of the form Z
t
=
k=
h
m
U
t m
are
causal if and only if h
m
=0 for all m <0, that is, if
Z
t
=
m=0
h
m
U
t m
.
We say the system is FIR (nite impulse response) if
only a nite number of values h
m
, m is nonzero.
Causal FIR systems thus can be expressed as a nite
sum
Z
t
=
M1
m=0
h
m
U
t m
, M <.
FIR systems are popular for their simplicity and in-
herent stability.
a) Show that FIR systems are BIBO-stable
b) Given zero mean wide sense stationary Y
t
and
U
t
, show that h
m
, m = 0, 1, . . . , M minimizes
88
(Y
t

M
m=0
h
m
U
t m
)
2
if and only if
_
_
_
_
_
r
u
(0) r
u
(1) r
u
(M 1)
r
u
(1) r
u
(0) r
u
(M 2)

r
u
(M 1) r
u
(M 2) r
u
(0)
_
_
_
_
_
.
Ru
_
_
_
_
_
_
h
0
h
1
.
.
.
h
m1
_
_
_
_
_
_
(6.80)
=
_
_
_
_
_
_
r
y u
(0)
r
y u
(1)
.
.
.
r
y u
(M 1)
_
_
_
_
_
_
c) Suppose the h
m
minimize the above expected
squared error. Let E
t
= Y
t

M1
m=0
h
m
U
t m
.
Show that
2
E
=
2
Y
h
T
R
u
h (6.81)
where h is the vector h =
_
h
0
h
m1
_
T
and R
u
is the covariance matrix dened in
Eqn. (6.80).
Note the resemblance of (6.80) with the Yule-Walker
equations (2.63). The expression (6.81) shows that
E

Y
, as is to be expected.
6.5 Estimation of the ARX model.
3
Argue that the max-
imum likelihood method and the prediction error
method lead to (approximately) the same estima-
tors for the parameters of the ARX model as the least
squares method.
6.6 Schemes of Box-Jenkins and Ljung. The ARMAX
scheme
Y
t
=
P(q)
D(q)
u
t
+
N(q)
D(q)
t
(6.82)
is a very general model for linear time-invariant sys-
tems. Several variants exist.
a) The Box-Jenkins scheme is given by
Y
t
=
P(q)
D(q)
u
t
+
N(q)
R(q)
t
. (6.83)
i. Show that by reducing the fractions on the
right-hand side to the common denomi-
nator D(q)R(q) the scheme may be con-
verted to an ARMAX scheme. Some struc-
tural information is lost this way, however.
ii. Determine the one-step predictor and the
resulting prediction error for this scheme.
b) Ljung (1987) works with the general model
D(q)Y
t
=
P(q)
Q(q)
u
t
+
N(q)
R(q)
t
. (6.84)
i. Show that also this scheme may be con-
verted to an ARMAX scheme.
ii. Determine the one-step predictor and the
resulting prediction error for this scheme.
3
89
90
A Proofs
This appendix contains a number of proofs and addi-
tional technical results.
Chapter 2
Lemma A.1.1 (Cauchy-Schwarz). For stochastic variables
X, Y withnite secondorder moments (X
2
) <, (Y
2
) <
, there holds that
1. | (XY)|
2
(X
2
) (Y
2
)
2. | (XY)|
2
=(X
2
) (Y
2
) if and only if Y =0 or X =kY
for some k .
Proof. Let
2
X
:=(X
2
) and
2
Y
:=(Y
2
). The result is triv-
ial if X = 0 or Y = 0. Consider X = 0 and Y = 0. Then for
any ,
0
_ X
Y
_
2
(A.1)
=
_ X
2
2
X
2
XY
Y
+
2
Y
2
2
Y
_
=12
(XY)
Y
+
2
.
As the above is nonnegative it follows that
2
(XY )
Y
1+
2
. (A.2)
For =1 this reads
| (XY)|
X Y
1. This proves Condition 1.
If equality | (XY )|
2
= (X
2
) (Y
2
) holds then for =
sgn((XY)) the expression (A.1) is zero, so necessarily
X
X
Y
Y
=0. Take k =
X
/
Y
.
Proof A.1.2 (Lemma 2.3.1). If r () is absolutely summable
then its spectral density () exists and r (0) =
1
2
_
() d< . Therefore () :=
_
() is square
integrable and hence its inverse Fourier transformcall it
h
t
is square summable. The spectral density of Y :=h
equals
2
||
2
=
2
hence, modulo scaling, Y
t
has co-
variance function r ().
Proof A.1.3 (Levinson-Durbin algorithm (p. 14)). For any
n let
n
denote the column vector of coefcients
n
=
_
a
n1
a
n2
a
nn
_
T
determined by (2.64). Then
n+1
by denition satises
_
_
_
_
_
_
_
_
_
(0) (1) (n 1) (n)
(1) (0) (n 2) (n 1)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(n 1) (n 2) (0) (1)
(n) (n 1) (1) (0)
_
_
_
_
_
_
_
_
_
.
Pn+1
_
_
_
_
_
_
_
_
a
1
a
2
.
.
.
a
n
a
n+1
_
_
_
_
_
_
_
_
.
n+1
=
_
_
_
_
_
_
_
_
(1)
(2)
.
.
.
(n)
(n +1)
_
_
_
_
_
_
_
_
.
vn+1
.
(A.3)
For notational convenience we introduced here besides
n+1
also the short-hands P
n+1
and v
n+1
(for any n). We
further need the matrix J
n
dened as the n n anti-
diagonal matrix
J
n
=
_
_
_
_
_
0 0 1
0 1 0

1 0 0 0
_
_
_
_
_
. (A.4)
Premultiplication by J reverses the order of the rows, so
J M is M but with its rows reversed. Postmultiplication
as in M J reverses the columns of M. Now, an interesting
property of P
n
is that it is and symmetric and constant
along diagonals. This fact implies that
J
n
P
n
=P
n
J
n
, (A.5)
that is reversing the order of rows is the same reversing
the columns of P
n
. You may want to Verify this.
Assume we have determined
n
, that is, we solved
n
from
P
n
n
=v
n
. (A.6)
Equation (A.3) may be expressed in block-partitioned
matrices as
_
P
n
J v
n
v
T
n
J (0)
_
n+1
=
_
v
n
(n +1)
_
. (A.7)
Since v
n
=P
n
n
and ( J P
n
)
n
= (P
n
J )
n
we get that
_
P
n
P
n
J
n
v
T
n
J (0)
_
n+1
=
_
P
n
n
(n +1)
_
. (A.8)
Split
n+1
as
n+1
=
_
s
an+1,n+1
_
with s
n
, then
_
P
n
P
n
J
n
v
T
n
J (0)
__
s
a
n+1,n+1
_
=
_
P
n
n
(n +1)
_
. (A.9)
From the top row-block we can solve for s ,
s =
n
a
n+1,n+1
J
n
, (A.10)
then a
n+1,n+1
follows from the bottom row (inserting
(0) =1),
v
T
n
J (
n
a
n+1,n+1
J
n
) +a
n+1,n+1
=(n +1). (A.11)
This gives the partial correlation coefcient
a
n+1,n+1
=
(n +1) v
T
n
J
n
1v
T
n
n
(A.12)
and then the other coefcients s of
n+1
=
_
s
an+1,n+1
_
follow
from (A.10).
91
Chapter 3
Proof A.1.4 (Cramr-Rao inequality, (p. 35)). Denote the
joint probability density function of the stochastic vari-
ables X
1
, X
2
, , X
N
as f (x, ). By the denition of the
probability density function and the assumed unbiased-
ness of the estimator we have
1 =
_
N
f (x, ) dx, =
_
N
s (x) f (x, ) dx.
The integrals are multiple integrals with respect to x
1
, x
2
,
, x
N
. Partial differentiation of the two expressions with
respect to yields
0 =
_
N
f
(x, ) dx, 1 =
_
N
s (x) f
(x, ) dx, (A.13)

where the subscript denotes partial differentiation with
respect to . By the substitution
f
(x, ) =
_

log f (x, )
_
f (x, ) = L
(x, ) f (x, )
(A.14)
it follows that
0 =
_
N
L
(x, ) f (x, ) dx, 1 =

_
N
s (x)L
(x, ) f (x, ) dx.

(A.15)
These equalities may be rewritten as
0 =L
(X, ), 1 =SL
(X, ), (A.16)
with S = s (X). By subtracting times the rst equality
from the second we nd
[(S )L
(X, )] =1. (A.17)

Then by the Cauchy-Schwarz inequality,
1 (S )
2
[L
(X, )]
2
. (A.18)
This proves (3.15) with M() =[L
(X, )]
2
. It remains to
prove the other equality of (3.16). Partial differentiation
of the rst equality of (A.15) yields
0 =
_
N
L
(x, ) f (x, ) dx +
_
N
L
(x, ) f
(x, ) dx.
(A.19)
With (A.14) we nd from this that
0 =
_
N
L
(x, ) f (x, ) dx +
_
N
[L
(x, )]
2
f (x, ) dx,
or
M() =[L
(X, )]
2
=L
(X, ).
This completes the proof of (3.153.16). The Cauchy-
Schwarz inequality (A.18) is an equality if and only if
L
(x, ) =k() (s (x) ) for all x

N
, (A.20)
for some k (which may depend on ).
Proof A.1.5 (Cramr-Rao inequality for the vector case,
(p. 36)). Denote the joint probability density function of
the stochastic variables X
1
, X
2
, , X
N
as f (x, ). By the
denition of the probability density function and the as-
sumed unbiasedness of the estimator we have
_
N
f (x, ) dx =1,
_
N
s (x) f (x, ) dx =.
The integrals are multiple integrals with respect to x
1
, x
2
,
, x
N
. Partial differentiation of the two expressions with
respect to (vector) yields
_
N
f
T (x, ) dx =0,
_
N
s (x) f
T (x, ) dx = I (A.21)
where the subscript
T
denotes partial differentiation
with respect to row vector
T
. By the substitution
f
T (x, ) = f (x, )
_

T
log f (x, )
_
= f (x, )L
T
(x, )
(A.22)
it follows that
L
T
=
_
N
f (x, )L
T(x, ) dx =0, (A.23)

(SL
T
) =
_
N
s (x) f (x, )L
T(x, ) dx = I .
By subtracting times the rst inequality from the sec-
ond we nd
[(S )L
T
] = I . (A.24)
Note that
_
S
L
_
_
(S )
T
L
T
_
=
_
var(S) I
I M
_
. (A.25)
Here M =L
L
T
is Fischers information matrix. By con-

struction matrix (A.25) is nonnegative denite,
_
var(S) I
I M
_
0. (A.26)
Then so is
_
I M
1
_
_
var(S) I
I M
__
I
M
1
_
=var(S) M
1
0.
This is what we set out to prove. It also shows that
var(S) M
1
=
_
I M
1
_
_
_
S
L
_
_
(S )
T
L
T
_
__
I
M
1
_
=
_
S M
1
L
__
S M
1
L
_
T
.
Therefore var(S) = M
1
if and only if S M
1
L
= 0,
that is, if and only if L
=M (S ).
92
It remains to prove the other equality of (3.20). Partial
differentiation of the rst equality of (A.23) with respect
to column vector yields
0 =
_
N
L
T (x, ) f (x, ) dx +
_
N
f
(x, )L
T (x, ) dx.
(A.27)
With (A.22) we nd from this that
0 =
_
N
L
T (x, ) f (x, ) dx
+
_
N
L
(x, ) f (x, )L
T(x, ) dx,
which means that
[L
(X, )L
T (X, )] =L
T (X, ). (A.28)
Proof A.1.6 (Lemma 3.4.2). The expectation of KX is
(KX) =(K(W +))
= KW +K()
= KW.
so the estimator KX is unbiased (for arbitrary ) if and
only if KW = I .
It is interesting that sumof squares such as
2
=
j
(
j

j
)
2
can also be expressed as the sum of di-
agonal elements of the corresponding variance matrix
(

)(

)
T
. Verify this. The sum of diagonal ele-
ments is called the trace and is denoted with tr.
For any unbiased estimator

= KX of the sum of
variances can now be expressed as

2
= K(W +)
2
(A.29)
=K
2
=tr (K
T
K
T
) =
2
tr(K K
T
).
Write K as
K = (W
T
W)
1
W
T
+L
with L not yet determined. Now for unbiasedness we
need that KW = I ,
I =KW =
_
(W
T
W)
1
W
T
+L
_
W =I +LW.
Hence LW =0. As a result we have that
K K
T
=
_
(W
T
W)
1
W
T
+L
__
(W
T
W)
1
W
T
+L
_
T
(A.30)
= (W
T
W)
1
+LL
T
.
It is nowdirect that tr(K K
T
) is minimal iff L =0, i.e., (A.29)
is minimized for K
opt
= (W
T
W)
1
W
T
. With this notation
we get that
K K
T
= K
opt
K
T
opt
+LL
T
K
opt
K
T
opt
.
This nally shows that cov(KX) cov(K
opt
X) for any un-
biased estimator KX of .
Chapter 4
Proof A.1.7 (Eqn. (4.3)). In a sequence of N elements
there are N 2 triples (x
k1
, x
k
, x
k+1
). Per triple (a, b, c )
there are six equally likely possibilities (with =largest,
=smallest, =medium)
(a, b, c ) (a, b, c) (a, b, c ) (a, b, c) (a, b, c) (a, b, c).
Of these 6 there are 4 with a turning point. So per triple
= 4/6 = 2/3. There are N 2 triples, so that =
2
3
(N 2).
Proof A.1.8 (Variance of the estimator (4.18) for the mean).
For the variance of m
N
we have
var( m
N
) =( m
N
m)
2
=
_
1
N
N1
t =0
(X
t
m)
_
2
=
_
1
N
N1
t =0
(X
t
m)
__
1
N
N1
s =0
(X
s
m)
_
=
1
N
2
N1
t =0
N1
s =0
(X
t
m)(X
s
m)
=
1
N
2
N1
t =0
N1
s =0
r (t s ).
Let k = t s . For a xed 0 k N 1 there are N k
pairs of (t , s ) for which t s =k,
(t , s ) {(k, 0), (k +1, 1), . . . , (N 1, N k 1)} (A.31)
For reasons of symmetry there are N |k| such pairs if
(N 1) k 0. Therefore
var( m
N
) =
1
N
2
N1
k=N+1
(N |k|)r (k)
=
1
N
N1
k=N+1
_
1
|k|
N
_
r (k).
Proof A.1.9 (Covariance of the estimator (4.32) for the co-
variance function). For k 0 and k
0 we have
cov( r
N
(k), r
N
(k
))
=( r
N
(k) r
N
(k))
_
r
N
(k
) r
N
(k
)
_
= r
N
(k) r
N
(k
) r
N
(k) r
N
(k
)
=
1
N
2
N1k
t =0
N1k
s =0
X
t +k
X
t
X
s +k
X
s
(1
k
N
)(1
k
N
)r (k)r (k
).
(A.32)
93
With 3.2 (p. 38) it follows that
X
t +k
X
t
X
s +k
X
s
=X
t +k
X
t
X
s +k
X
s
+X
t +k
X
s +k
X
t
X
s
+X
t +k
X
s
X
t
X
s +k
=r (k)r (k
) +r (t s +k k
)r (t s )
+r (t s +k)r (t s k
).
Substitution of this into (A.32) and evaluation yields
cov( r
N
(k), r
N
(k
))
=
1
N
2
N1k
t =0
N1k
s =0
[r (t s +k k
)r (t s )
+r (t s +k)r (t s k
)]
Substituting s =t i into the second summation and in-
terchanging the order of summation yields
cov( r
N
(k), r
N
(k
)) (A.33)
=
1
N
N1k
i =N+1+k
w
N,k,k
(i )[r (i +k k
)r (i ) +r (i +k)r (i k
)].
Here the function w
N,k,k
is dened by
w
N,k,k
(i ) =
_
_
_
Nk
i
N
for i 0,
Nki
N
for i k
k,
Nk
N
for 0 <i <k
k.
(A.34)
For simplicity it is assumed here that k
k (otherwise
interchange the role of k
and k). If k =k
(A.33) reduces
to (4.34). For N max(k, k
) we have w
N,k,k
(i ) 1 and
(A.33) simplies to (4.36).
Proof A.1.10 (Relation between the estimator (4.42) and
periodogram(4.43)). We write the function p
N
of (4.43) as
p
N
() =
1
N
_
_
_
_
_
N1
t =0
X
t
e
it
_
_
_
_
_
2
=
1
N
N1
t =0
X
t
e
it
N1
s =0
X
s
e
is
=
1
N
N1
t =0
N1
s =0
X
t
X
s
e
i(t s )
.
The above double sumequals the sum of all entries in the
N N matrix
1
N
_
_
_
_
_
X
0
X
0
e
0
X
0
X
1
e
i
X
0
X
2
e
i2

X
1
X
0
e
i
X
1
X
1
e
0
X
1
X
2
e
i

X
2
X
0
e
i2
X
2
X
1
e
i1
X
2
X
2
e
0

_
_
_
_
_
. (A.35)
On each diagonal the exponential term is constant. If we
sum the entries of the matrix per diagonal we get
N1
k=N+1
_
1
N
N|k|1
t =0
X
t +|k|
X
t
_
.
r
N
(k)
e
ik
=

N
(). (A.36)
Proof A.1.11 (Lemma 4.6.2). For the moment assume that
X
t
is also white, X
t
=
t
i.e. that r (k) = 0 for all k = 0.
Using the fact that

() equals the periodogram we get

that
(
1
)

(
2
) (A.37)
=
1
N
2
N1
t =0
N1
s =0
N1
u=0
N1
v=0
(
t
v
) e
i1(t s )
e
i2(uv)
(We skipped the subscript
N
and added subscript

to ex-
press that this is for white noise only.) In order compute
its expectation we rst use (3.23) to nd the expectation
of
t
v
,
v
=
t
v
+
t
v
+
t
u
=
4
(
t s
uv
+
t u
s v
+
t v
s v
)
where
z
denotes the Kronecker delta. Insert this
in (A.37) to get
(
1
)

(
2
)
=
1
N
2
_
N1
t =s =0
N1
u=v=0
4
+
N1
t =u=0
N1
s =v=0
4
e
i(1+2)(t s )
+
N1
t =v=0
N1
s =u=0
4
e
i(12)(t s )
_
=
4
+
4
N
2
_
_
_
_
_
N1
t =0
e
i(1+2)t
_
_
_
_
_
2
+
4
N
2
_
_
_
_
_
N1
t =0
e
i(12)t
_
_
_
_
_
2
=
4
+
4
N
2
_
sin((
1
+
2
)N/2)
sin((
1
+
2
)/2)
_
2
(A.38)
+
4
N
2
_
sin((
1
2
)N/2)
sin((
1
2
)/2)
_
2
=
4
_
1+
1
N
W
B
(
1
+
2
) +
1
N
W
B
(
1
2
)
_
. (A.39)
Indeed, the quotient of the two sinusoids here is a
Bartletts spectral window W
B
for M = N as given
by (4.53). Its plot is shown in Fig. 4.6. From this plot it
should be intuitively clear (it can also easily be proved)
that, still with M =N,
lim
N
1
N
W
B
() =
, (2, 2).
This together with the fact that

is asymptotically unbi-
ased gives us
lim
N
cov(

(
1
),

(
2
))
= lim
N
(

(
1
),

(
2
))
(
1
)
(
2
)
.
4
=
4
(
1+2
+
12
)
1
,
2
(, ).
In the limit the covariance hence is zero almost every-
where, but it equals
4
if
1
=
2
=0. For white noise we
have
4
=
2
(), so both two conditions of Lemma 4.6.2

are veried for the white noise case X
t
.
94
The proof of the non-white case proceeds as follows.
Since we assume X
t
to be stationary normally distributed
it must be that X
t
=
k=
h
k
t k
for some square
summable sequence h
j
. (This is a generalization of (3.1)).
We knowthat then
X
() =|H()|
2
(). Nowfor the es-

timates it may be shown that
X
() =|H()|
2

() +O(1/
N).
This is enough to prove Lemma 4.6.2. Consider
lim
N
cov(

X
(
1
),

X
(
2
))
=

X
(
1
),

X
(
2
)
X
(
1
)
X
(
2
)
=|H(
1
)|
2

(
1
)|H(
2
)|
2

(
2
)
X
(
1
)
X
(
2
)
=|H(
1
)|
2
|H(
2
)|
2
4
.
X (1)X (2)
(1+
1+2
+
12
)
X
(
1
)
X
(
2
)
=
X
(
1
)
X
(
2
)(
1+2
+
12
).
Inserting
1
=
2
and
1
=
2
yields the two condi-
tions of Lemma 4.6.2.
Proof A.1.12 (Lemma 4.6.4).
w
N
(
1
)

w
N
(
2
)
=
_
1
2
_

W(
1
)

N
()d
1
2
_

W(
2
)

N
()d
_
=
1
(2)
2
_

W(
1
)W(
2
)[

N
()

N
()] dd.
According to (A.39) we have that

N
()

N
() =
4
(1 +
1
N
W
B
(+) +
1
N
W
B
()), (for N =M). Now, as N
the function W
B
() approaches 2
k
(2k), with
() denoting the Dirac delta function. This is because
of (4.50) and the fact that W
B
has period 2 and for any
> 0 the integral
_
||
W
B
()d0 as N . Then
(think about it),
lim
N
w
N
(
1
)

w
N
(
2
)
=
1
(2)
2
_

W(
1
)W(
2
)
[
4
(1+
2
N
( +) +
2
N
()] dd.
Without the two delta terms the right-hand side actually
is (
1
)(
2
) =
4
! So
lim
N
cov(

X
(
1
),

X
(
2
))
= lim
N
w
N
(
1
)

w
N
(
2
)
4
=
4
2N
_

W(
1
)W(
2
)[( +) +( )] dd
=
2
2N
_

W(
1
)[W(
2
) +W(
2
+)] d.
In the second equality we used the sifting property of
delta functions which says that
_
f (t )(t a) dt = f (a)
for any function f (t ) that is continuous at t =a.
Proof A.1.13 (Inverse DDFT (4.67)). For given x
k
dene
x
n
:=
L1
k=0
x
k
e
i
2
L
nk
. Then
1
L
L1
n=0
x
n
e
i
2
L
nk
=
1
L
L1
n=0
_
L1
m=0
x
m
e
i
2
L
nm
_
e
i
2
L
nk
=
1
L
L1
n=0
L1
m=0
x
m
e
i
2
L
nm
e
i
2
L
nk
=
1
L
L1
m=0
x
m
L1
n=0
e
i
2
L
n(mk)
=
1
L
L1
m=0
x
m
L1
n=0
_
e
i
2
L
(mk)
_
n
=
1
L
L1
m=0
x
m
_
_
_
1
_
e
i2/L(mk)
_
L
1e
i2/L (mk)
if e
i
2
L
(mk)
=1
L if e
i
2
L
(mk)
=1
=
1
L
L1
m=0
x
m
_
0 if m =k
L if m =k
=x
k
.
Frequency content of sampled signals. We review a
proper mathematical foundation for Equality (4.84),
which we copy here
x
() =
k=
x(k
2
T
), . (A.40)
The equality is about the connection of the CCFT x of
a continuous time signal x(t ) and the adjusted DCFT x
(4.83) of the sampled signal x
n
:=x(nT), n , with sam-
pling time T.
There are counter examples of (A.40) in that functions
x(t ) exist for which both x and x
are well dened, while

the right-hand side of (A.40) is not dened. An example is
the continuous function x(t ) dened per interval as
x(t ) =e
|t |
sin(((2
(2
2
k
+1
)
) +1)t ),
for
k|t | (k +1), k .
The value of must be positive but is otherwise arbitrary.
Since | sin| is bounded by one, we see that x(t ) decays to
zero exponentially fast. This guarantees that x and x
ex-
ist. The x(t ) is unusual in that it oscillates extremely fast
as |t | increases. This is the source of the violation (A.40).
We need to exclude arbitrary fast oscillations.
Let V
a,b
(x) denote the total variation of a function x(t )
on the interval [a, b] ,
V
a,b
= sup
a=t0<t1<<tn =b
n
i =1
|x(t
i
) x(t
i 1
)|. (A.41)
95
We say that a function x(t ) is of uniform bounded vari-
ation if sup
t
V
t ,t +1
(x) < . Uniform bounded varia-
tion implies, roughly speaking, that the oscillations of the
function do no grow without bound as time increases. It
may now be shown that if x(t ) is exponentially bounded
by some C e
|t |
, > 0, C > 0 and if x(t ) is of uniform
bounded variation then (A.40) is correct provided we take
the sampled signal x
n
to be dened as
x
n
:=
x(nT
) +x(nT
+
)
2
. (A.42)
This is well dened because for functions of bounded
variation the limits x(t
) and x(t
+
) exist for every t . If
x(t ) is continuous then this is the ordinary sampled sig-
nal x
n
=x(nT) and we recover the original (A.40).
Chapter 5
Proof A.1.14 (Eqn. (5.29) is the minimal
T
). Let y =
where is dened as
=M
T
(MM
T
)
1
(X e ). (A.43)
The satises X e =M if and only if My =0 because
X e =M
=M( +y )
=MM
T
(MM
T
)
1
(X e ) +My
=X e +My.
Knowing that My =0 gives
T
y =0. Therefore
T
= [ +y ]
T
[ +y ]
=
T
+
T
y +y
T
+y
T
y
=
T
+y
T
y.
It is now immediate that
T
is minimal if and only if y =
0. In that case we have
T
=
T

= (X e )(MM
T
)
1
M M
T
(MM
T
)
1
(X e )
= (X e )(MM
T
)
1
(X e ).
96
B The Systems Identication Toolbox
In this appendix we list those routines of the Systems
Identication Toolbox of MATLAB that appear in the lec-
ture notes and are useful for the computer laboratory ex-
ercises. For more extensive information we refer to the
help function of MATLAB and the Toolbox manual. In the
list only the simplest way of calling each routine is de-
scribed.
All estimation routines assume that the times series or
output/input pairs are centered. This can be done with
the function dtrend.
th = ar(z,n)
Estimate the parameters th of an AR scheme from
the observed time series z. The integer n is the
degree of the polynomial that characterizes the
scheme. The time series z is arranged as a column
vector. See page 62.
th = armax(z,nn)
Estimate the parameters th of an ARMA or ARMAX
scheme from the time series z or the output/input
pair z = [y u]. The rowmatrix nn contains the de-
grees of the polynomials that occur in the scheme.
The parameters are specied in the special theta
format of the Toolbox. Enter the command help
theta for a description of this format. See also the
function present. See pages 68 and 84.
th = arx(z,nn)
Estimate the parameters th (in theta format) of an
AR or ARX scheme from the time series z or the out-
put/input pair z = [y u]. The row matrix nn con-
tains the degrees of the polynomial that characterize
the scheme. See page 81.
bode(g)
Show a Bode plot of the frequency function g. Use
help idmode/help for more information. See
pages 78 and 79.
r = covf(z,M)
Determine an estimate r of the covariance function
of the time series z over M points. See page 46.
ir = cra(z)
Produce an estimate ir of the impulse response of
a system from the output/input pair z = [y u] us-
ing correlation techniques. See pages 78 and 84.
z = detrend(z,constant)
Center the time series z or the output/input pair z
= [y u]. dtrend(z,linear) removes a linear
trend. See page 78.
ffplot(g)
Plot the frequency function g with linear scales.
[omega,ampl,phase] = getff(g)
Determine the frequency axis omega, the amplitude
ampl andthe phase phase of the frequency function
g. See pages 20 and 53.
y = idsim(u,th)
Compute the output signal y of a system with pa-
rameters th (in theta format) for the input signal u.
See for instance page 15 and 81.
th = iv4(z,nn)
Estimate the parameters th (in theta format) of an
ARX model with output/input pair z = [y u] ac-
cording to a four stage IV method. The rowvector nn
contains the degrees of the polynomials that deter-
mine the model. See page 84
th = poly2th(A,B,C,D,F,lam)
Convert the polynomial A, B, C, D, F of a Ljung
scheme to the parameter matrix th in theta format.
The parameter lam is the variance of the white noise.
See for instance page 15.
p = predict(z,th,k)
Determine a k step prediction p of the output signal
of the system with parameters th (in theta format)
with output/input pair z = [y u]. This function
may also be used for the prediction of time series.
See page 25.
present(th)
Show the information in the theta matrix th. See
page 84.
e = pe(th,x)
Compute the residuals of a system model with pa-
rameter matrix th (in theta format) that follow from
the time series z or the output/input pair z = [y
u]. The command resid(th,x) is supposed to
plot and also test the residuals, but it contains a mis-
take. See page 81.
g = spa(z)
Find an estimate g (in frequency format) of the
frequency response function of a system with out-
put/input pair z = [y u] using spectral methods.
If z is a time series then g is an estimate of the
spectral density function. The routine also supplies
the standard deviation of the estimation error. See
page 53.
97
g = th2ff(th)
Compute the frequency response function g (in fre-
quency function format) of the system with theta
structure th. See page 20.
[A,B,C,D,F] = th2poly(th)
Convert the theta matrix th to the polynomials A,
B, C, D, F of the corresponding Ljung scheme.
Especially for this course the following routines were
developed.
r = cor(z,n)
This routine estimates the correlation function of z.
The integer n is the number of points over which the
function is computed. The output argument r con-
tains the correlation function in an n-dimensional
row vector. In addition the function and its con-
dence intervals are plotted.
p = pacf(z,n)
This routine estimates the partial auto-correlation
function of z. The integer n is the number of points
over which the function is computed. The output ar-
gument p contains the partial correlation function in
an n-dimensional row vector. In addition the func-
tion and its condence intervals are plotted.
98
C Glossary EnglishDutch
aliasing vouweffect
amplitude distribution amplitudeverdeling
angular frequency hoekfrequentie
ARMA process ARMA-proces
ARMA scheme ARMA-schema
auto-covariance function autocovariantiefunctie
auto-regressive process autoregressief proces
backward shift operator achterwaartse verschuivingsoperator
band lter bandlter
band limited in bandbreedte begrensd
bias onzuiverheid
centered process gecentreerd proces
characteristic function karakteristieke functie
coherence spectrum coherentiespectrum
complex conjugate toegevoegd complexe waarde
consistent consistent
consistency consistentie
continuous time continue-tijd
convolution convolutie
correlation function correlatiefunctie
covariance covariantie
covariance function covariantiefunctie
cross covariance function kruiscovariantiefunctie
cross spectral density kruis-spectraledichtheidsfunctie
denumerable aftelbaar
difference operator differentieoperator
disturbance signal stoorsignaal
entry element
efcient efcint
ergodic ergodisch
estimation error schattingsfout
estimation schatting
estimator schatter
exogenous exogeen
expectation verwachting, verwachtingswaarde
fast Fourier transform snelle Fouriertransformatie
lter lter
nal prediction error uiteindelijke voorspelfout
Fishers information matrix informatiematrix van Fisher
forecasting voorspellen
Fourier transform Fouriergetransformeerde
Fourier transformation Fouriertransformatie
frequency frequentie
frequency response frequentieresponsie
frequency response function frequentieresponsiefunctie
gain versterking
gain matrix versterkingsmatrix
generating function genererende functie
gradient gradint
Hessian matrix van Hess
identiability identiceerbaarheid
impulse response impulsresponsie
information criterion informatiecriterium
initial condition beginvoorwaarde
input ingang
input signal ingangssignaal
instrumental variable method instrumentele-variabelemethode
invertible inverteerbaar
joint gezamenlijk
Kalman lter Kalmanlter
least squares method kleinste-kwadratenmethode
likelihood function aannemelijkheidsfunctie
line search lijnzoeken
long division staartdelen
main lobe hoofdlob
maximum likelihood estimator maximum-aannemelijkheidsschatter
mean value function gemiddelde-waardefunctie
mean gemiddelde
model set modelverzameling
moving average process gewogen-gemiddeldeproces
moving average gewogen gemiddelde
multi-dimensional meerdimensionaal
noise ruis
non-parametric niet-parametrisch
nonnegative denite positief-semideniet
normally distributed normaal verdeeld
Nyquist frequency Nyquistfrequentie
order determination ordebepaling
outlier uitschieter
output uitgang
output signal uitgangssignaal
parametric parametrisch
partial correlation coefcients partile correlatiecofcinten
partial fraction expansion breuksplitsen
periodogram periodogram
persistently exciting persistent exciterend
plot graek
prediction error voorspelfout
prediction voorspelling
predictor voorspeller
presampling lter conditioneringslter
probability density kansdichtheid
probability law kanswet
random process toevalsproces, stochastisch proces
random variable toevalsvariabele, stochastische variabele
random walk stochastische wandeling
realization realisatie
rectangular rechthoekig
residual residu
resolution oplossend vermogen
running average process lopend-gemiddeldeproces
sample (n.) steekproef
sample (v.) bemonsteren
sample average steekproefgemiddelde
sample mean steekproefgemiddelde
sampling theorem bemonsteringsstelling
script script
seasonal component seizoenscomponent
side lobe zijlob
spectral density spectrale dichtheid
spectral density function spectrale dichtheidsfunctie
stable stabiel
standard deviation standaarddeviatie, spreiding
state toestand
stationary stationair
steepest descent method steilste-hellingmethode
step response stapresponsie
stochastic stochastisch
systemidentication systeemidenticatie
test toets, test
time dependent tijdsafhankelijk
time invariant tijdsonafhankelijk
time series tijdreeks
time shift tijdverschuiving
transform getransformeerde
transformation transformatie
trend trend
unbiased zuiver
update (v.) bijwerken
variance matrix variantiematrix
white noise witte ruis
wide-sense stationary zwak-stationair
window venster
windowing vensteren
z -transform z -getransformeerde
99
100
D Bibliography
A. Bagchi. Optimal Control of Stochastic Systems. Prentice
Hall International, Hempstead, U.K., 1993.
A. Bagchi and R. C. W. Strijbos. Tijdreeksenanalyse en
identicatietheorie. Lecture notes (in Dutch), Faculty
of Mathematical Sciences, University of Twente, 1988.
G. E. P. Box and G. M. Jenkins. Time Series Analysis, Fore-
casting and Control. Holden-Day, San Francisco, 1970.
C. S. Burrus and T. W. Parks. DFT/FFT and Convolution
Algorithms. John Wiley, New York, 1985.
J. Durbin. Estimation of parameters in time-series regres-
sion models. J. R. Statist. Soc. B., 22:139153, 1960.
M. Kendall andJ. K. Orr. Time Series. EdwardArnold, U.K.,
third edition, 1990.
H. Kwakernaak and R. Sivan. Linear Optimal Control Sys-
tems. Wiley-Interscience, New York, 1972.
H. Kwakernaak and R. Sivan. ModernSignals and Systems.
Prentice Hall, Englewood Cliffs, N. J., 1991.
L. Ljung. System Identication: Theory for the User. Pren-
tice Hall, Englewood Cliffs, N. J., 1987.
L. Ljung. Users Guide of the SystemIdentication Toolbox
for Use with MATLAB. The MathWorks, Natick, Mass.,
U.S.A., 1991.
S. L. Marple. Digital Spectral Analysis. Prentice-Hall, En-
glewood Cliffs, N. J., 1987.
101
102
Index
anti-aliasing lter, 55
ARIMA scheme, 22
ARMAX scheme, 81
ARMA scheme, 16
estimation of-, 64
frequency response, 18
ARX scheme, 80
AR scheme, 11
asymptotically stationary, 12
estimation of-, 59
stable-, 13
asymptotically
efcient estimator, 35
stable, 13
stable ARMA process, 16
stationary AR processes, 12
unbiased estimator, 34
wide-sense stationary process, 12, 13
auto-covariance function, 77
auto-regressive process, 11
band lter, 19
band limited signal, 55
Bartletts window, 48
bias, 34
BIBO stability, 76
Box-Jenkins, 59, 89
scheme, 89
CCFT, 21
Census X-11, 41, 43
centered process, 11
centering, 11
characteristic function, 33
Chebyshev inequality, 34
classical time series analysis, 41
coherence spectrum, 78
consistent estimator, 34
continuous time processes, 53
convolution, 11
sum, 17
system, 17
Cooley-Tukey, 51
correlation
function, 7
matrix, 7, 8
covariance
function, 6, 8
matrix, 7, 8, 32
Cramr-Rao inequality, 35, 61, 92
vector case, 36, 92
cross
covariance function, 76
periodogram, 78
spectral density, 77
Daniells window, 50
DCFT, 17, 50
inverse, 18
DDFT, 51
density function, 5
design of experiments, 88
DFT, 17
difference operator, 22
distribution
Gaussian-, 31
joint probability-, 7
normal-, 31
probability-, 5
standard normal-, 31
efcient estimator, 35
ergodic process, 44
error-in-the-equation model, 80
estimate, 34
estimation error, 34
estimator, 34, 35
asymptotically unbiased, 34
consistent, 34
unbiased, 34
exogenous signal, 80
fast Fourier transform, 51
FFT, 51
lter, 19
nal prediction error criterion, 71
Fishers information matrix, 36
forecasting, 24
forward shift operator, 10
Fourier transform, 21
discrete, 17, 51
fourth-order moment, 38
FPE criterion, 71
frequency
analysis, 76
response function, 17, 22
Gauss-Newton algorithm, 69
Gaussian distribution, 31
generating function, 20
103
Hammings window, 50
Hanns window, 48
Hanning, 48
i, 17
identiability, 87
identication in closed loop, 88
impulse response, 17, 22
analysis, 75
incidental component, 22, 41
information criterion
of Akaike, 70
of Rissanen, 71
instrumental variable, 81
instrumental variable method, 80
inverse DCFT, 18
invertible scheme, 15
IV method, 81
Kalman lter, 86
least squares
algorithm, 69
estimator
AR scheme, 59
ARMA scheme, 64
ARX scheme, 80
MA scheme, 63
non-linear, 63, 69
recursive, 59
Levenberg-Marquardt algorithm, 69
Levinson-Durbin algorithm, 14, 70
likelihood function, 35
line search, 68
Ljungs scheme, 89
main lobe, 48
Markov scheme, 11
matrix
correlation-, 7, 8
covariance-, 7, 8, 32
nonnegative denite-, 7, 36
positive denite-, 36
variance-, 7, 32
maximum likelihood estimator, 34, 60, 65
MA scheme, 9
estimation of-, 63
mean, 5
estimation of-, 43
value function, 5
model set, 87
moving average, 42
process, 9
Newton algorithm, 68
non-linear optimization, 68
nonnegative denite matrix, 7, 36
normally distributed processes, 31
normal distribution, 31
standard-, 31
Nyquist frequency, 55
observer canonical form, 86
order determination, 69, 70
Parseval, 50
partial correlation coefcients, 14, 70
periodogram, 47
persistently exciting, 77, 83, 86
positive denite matrix, 36
power, 19
power spectral density, 18
prediction, 24
of ARMA process, 24
prediction error, 24
prediction error method, 65, 82, 86
accuracy of-, 67
presampling lter, 55
probability density under transformation, 32
probability distribution, 5
probability law, 7
process
asymp. wide sense stationary, 12, 13
centered, 11
moving average, 9
running average, 10
quasi-Newton algorithm, 69
recursive identication, 87
residuals
analysis of-, 70
resolution, 50, 77
robustication, 88
running average process, 10
sampling, 53
sampling theorem, 51, 55
seasonal
component, 22, 41
process, 23
Shannon, 55
shift operator, 10
side lobes, 48
sinc, 51
spectral
analysis, 17, 21, 76
density, 18, 21
estimation of-, 46
resolution, 48
window, 48
Spencers formula, 42
stability
AR scheme, 13
asymptotic, 13
standard biased covariance estimator, 45
104
standard deviation, 6, 8
standard normal distribution, 31
state model, 85
identication of-, 85
stationary process, 7
wide sense, 8
steepest descent method, 68
step response analysis, 75
stochastic process, 5, 7
strictly stationary process, 7
structure determination, 87
system identication, 75
non-parametric, 75
parametric, 75
test
for stochasticity, 41
for trend, 41
for whiteness, 56, 70
time axis, 7
tr, 93
trace, 93
trend, 22, 41, 42
unbiased estimator, 34
uncorrelated, 7
variance
matrix, 7, 32
vector gain, 60
white noise, 9, 20
wide-sense stationary process, 8
window, 47
Bartletts, 48
Daniells, 50
Hammings, 50
Hann, 48
Hanning, 48
rectangular, 48
windowing, 47
Yule-Walker equations, 13, 70
Yule scheme, 26
z -transform, 20
105

Time Series Analysis and System Identification

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Time Series Analysis and System Identification

Uploaded by

Copyright:

Available Formats

Time Series Analysis

0 50 100 150 200 250

|r ()| < is of the

The example may suggest that D() may be obtained

1 and the real number the angular fre-

h the frequency response function of the system. Given

Figure 2.13: Spectral density function of an AR(2)

We conclude with a comment about the invertibility of

. On page 24 the best predic-

Here and are real constants with >0. In both cases

(x, ) =M()[s (x) ]. (3.17)

(x, ) =M()[s (x) ].

A proof of Theorem 3.3.4 is listed in Appendix A. The

= KX, for some matrix K.

A proof is listed on Page 93. Note that this results in

is zero if and only if

is zero if and only if

is unbiased if and only if = 1

. Show that a vector of estimators

4.4 Estimation of the mean

N as N . The error de-

the covariance cov( r

) for close values of k and k

4.6 Estimation of the spectral density

4.6.2 Biasedness and consistency

The proof of the lemma is involved. It is summarized in

W() d =1. (4.50)

An immediate special case is the variance,

4.7 Continuous time processes

of the sampled signal

and the CCFT x of the underlying continuous time sig-

changes now that the next data X

The example invokes optimized routines from the SYS-

is the Maximum Likelihood estimator of Sub-

and similarly for other

Figure 6.10: System in feedback loop

(x, ) dx, (A.13)

(x, ) f (x, ) dx, 1 =

(x, ) f (x, ) dx.

(X, )] =1. (A.17)

(x, ) =k() (s (x) ) for all x

T(x, ) dx =0, (A.23)

is Fischers information matrix. By con-

() equals the periodogram we get

(), so both two conditions of Lemma 4.6.2

(). Nowfor the es-

(4.83) of the sampled signal x

are well dened, while

You might also like