Mathematical Primer

Assumed Knowledge in FINS2624
This document outlines and explains some key concepts youre assumed to be
comfortable with when enrolling in FINS2624. It provides short proofs for many
of the results and I think its useful to have a look at them. However, youre
not required to be able to replicate any of them, and if you prefer you may
completely skip all proofs and just focus on the results. Simply being able to
practically apply the methods described in this document will be enough to
get you through the course. In the interest of brevity we will sometimes take
small shortcuts in the proofs, e.g. make some unstated assumptions or only
prove some special cases. You may be familiar with some of the results from
high school while some of the results would have been covered in your first-year
courses. Please skim through this document and read whatever sections you
feel you need to brush up on more carefully. Depending on your background,
you may want to skip some (or all) sections entirely. The aim of this document
is to allow you to successfully absorb the material that is taught in FINS2624,
but by necessity the material covered here will be limited and the explanations
rather brief. This document is obviously not meant as a substitute for going
to high school and/or taking your required first-year courses. If you dont have
the time or patience to read this document, at least familiarize yourself with
the summary, which contains most results but without proofs, explanations or
motivations.
Summary
A weighted average, x, of observations {xn } with associated weights {wn } is
given by
N
P
x=
wn xn
n=1
If is a constant and f (x), u(x) and v(x) are functions of x, the first derivatives
of f are given in the table below
f (x)
u(x) + v(x)
u(x)
xn
u(x)v(x)

u v(x)
u0 (x)
f 0 (x)
0
u0 (x) + v 0 (x)
u0 (x)
nxn1
0
u(x)v 0 (x) +
u (x)v(x)
u0 v(x) + v 0 (x)
u00 (x)
f (x + h) can be approximated by
f (x + h) = f (x) +
df (x)
h + R(x, h)h
dx
where R(x, h) is an approximation error that approaches zero as h approaches

zero.
To find the values of x that gives the extreme values of some function f (x)
we solve f 0 (x) = 0 for x and check the sign of f 00 (x). If the sign is negative
(positive) we have found a local maximum (minimum).
If s is one of S possible states of the world, x(s) is the value of some stochastic
variable X in that state and p(s) is the probability of the state occurring, then
E(X) =
S
P
p(s)x(s)
s=1
If and are constants and {Xn } and {Ym } are series of stochastic variables
(with the indices suppressed as convenient) then
E(X) = E(X)

N
P

Xn
n=1
N
P
E(Xn )
n=1
Cov(X, Y ) = Cov(Y, X)
Cov(X, Y ) = Cov(X, Y )

N
P
Cov
n=1
Xn ,
M
P

Ym
m=1
N P
M
P
Cov(Xn , Ym )
n=1 m=1
V ar(X) = Cov(X, X)
XY =
Cov(X, Y )
X Y
Contents
1.
Weighted averages . . . . . . . . . . . . . . . . . . . . . . . . . .
2.
Differentiation
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.
The constant rule . . . . . . . . . . . . . . . . . . . . . . .
2.2.
The sum rule . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.
The constant factor rule . . . . . . . . . . . . . . . . . . .
10
2.4.
The power rule . . . . . . . . . . . . . . . . . . . . . . . .
10
2.5.
Taylor approximations . . . . . . . . . . . . . . . . . . . .
12
2.6.
The product rule . . . . . . . . . . . . . . . . . . . . . . .
13
2.7.
The chain rule . . . . . . . . . . . . . . . . . . . . . . . .
13
2.8.
Partial derivatives . . . . . . . . . . . . . . . . . . . . . .
15
2.9.
Higher derivatives . . . . . . . . . . . . . . . . . . . . . .
15
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.1.
Local and global maxima and minima . . . . . . . . . . .
17
3.2.
A strategy to solve optimization problems . . . . . . . . .
18
Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
4.1.
Stochastic variables . . . . . . . . . . . . . . . . . . . . .
19
4.2.
Mathematical expectations . . . . . . . . . . . . . . . . .
19
4.3.
Covariances, variances and correlations
. . . . . . . . . .
21
4.4.
Regression analysis . . . . . . . . . . . . . . . . . . . . . .
24
3.
4.
1.
Weighted averages
Weighted averages have several applications in the course, such as defining portfolio returns, bond durations and mathematical expectations. They are similar
to your standard average (the arithmetic mean), but may assign different weights
(importance) to different observations. For instance, we might observe that the
average monthly wage in Luxembourg is $6103 whereas it is $1177 in Hungary.
It would, however, be wrong to conclude that the average monthly wage in the
is (6103 + 1177)/2 = 3640,
area covering both Luxembourg and Hungary, I,
because Hungary has 9.9 million inhabitants whereas Luxembourg only ha 0.5
million. Instead wed want to assign weights to the two observations in proportion to their populations, i.e.
wLux =
0.5
0.05
0.5 + 9.9
wHun =
9.9
0.95
0.5 + 9.9
Note that the two weights sums to one, which will always be the case. Well get
a (population) weighted average of the per capita income in the two countries
by multiplying each countrys income with its respective weight:
I = wLux ILux + wHun IHun 0.05 6103 + 0.95 1177 1423
In general, the weighted average of some observations x1 , x2 , ..., xn with associated weights w1 , w2 , ..., wN is
x=
N
P
wn xn
(1)
n=1
Note that the standard arithmetic mean is a special case of this with all weights
equal to 1/N .
2.
Differentiation
Any (interesting) function f (x) will change as we change the value of x. The
derivative of f with respect to x is a measure of how large that change will be.
Specifically, it measures the change in f (x) as a fraction of the change in x for
infinitesimal (very small) changes in x. The concept is most easily understood
by first considering some specific function f (x) and some discrete change in x.
Lets pick f (x) = x2 . Imagine that we start off with a value of x = 1 and
change this by adding a value h = 3 for a new value of x + h = 2. The corresponding function values would be f (1) = (1)2 = 1 and f (2) = 22 = 4. Lets
draw a straight line through those two points as in the figure 1 below.
Figure 1: f (x) = x2 with line through (-1,1) and (2,4)
It is straight forward to calculate the slope of the resulting line. Well simply
divide the change in the function value by the change in x:
f (x + h) f (x)
41
f (x)
=
=
=1
x
h
3
This slope is the average change in the function value as we move from x = 1
to x = 2. If we consider some smaller change, e.g. h = 1, well get a new slope
as pictured in figure 2 below.
Figure 2: f (x) = x2 with line through (-1,1) and (0,0)
The slope of this line is again given by the equation

f (x)
f (x + h) f (x)
1
=
=
= 1
x
h
1
If we keep choosing smaller and smaller values of h the green line will move
again. In the limit, as h approaches zero, the green line will only touch the blue
curve in one point. We call such a line a tangent, and its slope is the derivative
of f (x) when x = 1.1 We can interpret this as the average change in the
function value over an infinitesimal distance close to x = 1. The situation is
illustrated in figure 3 below.
Figure 3: Tangent to f (x) = x2 in (-1,1)
The slope of the tangent is calculated in the same way as in the discrete cases
1 The process of finding derivatives is called differentiation, which explains the title of this
section.
above. To emphasize that we are concerned with the slope in a single point, we
denote the derivative of f (x) with respect to x by df (x)/dx and define it as
df (x)
f (x + h) f (x)
= lim
h0
dx
h
(2)
Sometimes it will be convenient to denote the derivative of f (x) by f 0 (x) (pronounced f prime of x). I will use both notations interchangeably in the remainder of this document. The derivative does not exist for every function,
which makes for a lot of ifs and buts in the rules well go through below. In
the interest of brevity, well assume throughout that were dealing with nice
functions where the derivatives do exist.
2.1.
The constant rule
It would be time consuming to rely on definition (2) every time we wanted

to find the derivative of a function. Fortunately, there are some very useful
rules that will help us out. The first of these rules is known as the constant
rule, which states that the derivative of a constant, i.e. anything that does not
change when we change the variable with respect to which we take the derivative, is zero. Formally, for some constant
d
=0
dx
(3)
If we think of constants as functions that always have the same value, e.g.
f (x) = , this follows immediately from definition (2)
df (x)
f (x + h) f (x)
= lim
h0
dx
h
df (x)
= lim
=0
h0
dx
h
2.2.
The sum rule
The sum rule states that the derivative of a sum is the sum of the derivatives
of each term. Formally, if f1 , f2 , ..., fN are functions of x then
N
N
X
dfn (x)
d X
fn (x) =
dx n=1
dx
n=1
(4)
Well prove this for N = 2, but the proof can be generalized to cover a general
N . It is also true for subtractions. Lets denote the two functions well consider
u(x) and v(x). Lets also define f (x) = u(x) + v(x). Consider f (x + h) f (x)
and divide by h.
f (x + h) f (x) = u(x + h) + v(x + h) u(x) v(x)

f (x + h) f (x)
u(x + h) u(x) v(x + h) v(x)
=
+
h
h
h
Now let h approach zero
lim
h0
u(x + h) u(x)
v(x + h) v(x)
f (x + h) f (x)
= lim
+ lim
h0
h0
h
h
h
By (2) this equals

df (x)
du(x) dv(x)
=
+
dx
dx
dx
Since f (x) = u(x) + v(x), we have equation (4) for N = 2
du(x) dv(x)
d
u(x) + v(x) =
+
dx
dx
dx
2.3.
The constant factor rule
According to the constant factor rule, the derivative of the product of a constant and some function is the product of the constant and the derivative of the
function. That is, we may leave constants alone when we take derivatives.
The result follows readily from the definition (2). Let g(x) = f (x), where is
some constant. Then
dg(x)
g(x + h) g(x)
= lim
h0
dx
h
dg(x)
f (x + h) f (x)
= lim
h0
dx
h
dg(x)
f (x + h) f (x)
= lim
h0
dx
h
dg(x)
f (x)
=
dx
dx
Which gives the rule
2.4.
df (x)
d
f (x) =
dx
dx
(5)
d n
x = nxn1
dx
(6)
The power rule
Next up is the power rule
That is, the derivative of a function of the type f (x) = xn with respect to x is
nxn1 . Below well prove this when n is a positive integer. It actually holds for
any value of n 6= 0, but the proof gets more cumbersome.
In order to simplify the proof, well rely on the following factorization
10
an xn = (a x)
n
P
ank xk1
k=1
(7)
This may not appear obvious at first, but if we write out the sum as
an1 x0 + an2 x1 + an3 x2 + ... + a2 xn3 + a1 xn2 + a0 xn1
and then multiply it by a x, we see that we get
a(an1 x0 + an2 x1 + an3 x2 + ... + a2 xn3 + a1 xn2 + a0 xn1 )

x(an1 x0 + an2 x1 + an3 x2 + ... + a2 xn3 + a1 xn2 + a0 xn1 )
Moving a and x inside the parentheses gives

(an + an1 x + an2 x3 + ... + a3 xn3 + a2 xn2 + axn1 )
(an1 x + an2 x2 + an3 x3 + ... + a2 xn2 + axn1 + xn )
which in turn equals
an xn + (an1 x + an2 x3 + ... + a3 xn3 + a2 xn2 + axn1 )

(an1 x + an2 x2 + an3 x3 + ... + a2 xn2 + axn1 )
and we see that the expressions in the parentheses cancel out. Armed with this
insight, lets define a = x + h, recall that the function were interested in is
f (x) = xn and apply definition (2)
f (x + h) f (x)
df (x)
= lim
h0
dx
h
df (x)
f (a) f (x)
= lim
ax
dx
ax
df (x)
an xn
= lim
ax a x
dx
11
Applying (7), we can rewrite this as

n
P
df (x)
1
= lim
(a x)
ank xk1
ax (a x)
dx
k=1
The (a x) cancels out, and we are left with the summation. Since we are
taking the limit as a approaches x, we can substitute x for a in this expression
and get result (6)
n
n
P
P
df (x)
xnk+k1 = nxn1
xnk xk1 =
=
dx
k=1
k=1
2.5.
Taylor approximations
Before we move on to more rules for the derivative, well introduce a useful
application of it known as first-order Taylor approximations, according to which
we can approximate the change in the value of a function by using its derivative.2
Specifically, we have
f(x + h) = f(x) +
df (x)
h + R(x, h)h
dx
(8)
where R(x, h) is some average approximation error that will approach zero as h
approaches zero. We wont prove this theorem, but understand it using figure
(3). Recall that when taking a derivative, we are calculating the slope of a
function in one point, represented by the slope of the tangent line in the figure.
The Taylor approximation works by assuming that the slope of the function
is constant and approximating f (x + h) by the value of the tangent at x + h.
Since the derivative of f (x) is not generally constant, this gives rise to the
(total) approximation error R(x, h)h. However, as we let h approach zero, we
are approaching the point where the slope of the function is indeed exactly
df (x)/dx, and R(x, h) will also approach zero.
2 We could potentially make better approximations using higher-order derivatives as defined
below, but we wont bother.
12
2.6.
The product rule
The product rule allows us to take derivatives of the product of two functions
in a convenient way. Specifically, for a function f (x) = u(x)v(x), we have that
df (x)
dv(x) du(x)
= u(x)
+
v(x)
dx
dx
dx
(9)
To prove this, note that
f (x + h) f (x) = u(x + h)v(x + h) u(x)v(x)
To save some space, lets for now denote the derivative of f (x) with respect to
x by f 0 (x). Now substitute Taylor approximations for u(x + h) and v(x + h).

f (x+h)f (x) = u(x)+hu0 (x)+Ru (x, h)h v(x)+hv 0 (x)+Rv (x, h)h u(x)v(x)
Expand the parenthesis and divide both sides by h

f (x + h) f (x)
u(x)v(x)
=
+u(x)v 0 (x)+u(x)Rv (x, h)+u0 (x)v(x)+hu0 (x)v 0 (x)+
h
h
u(x)v(x)
u0 (x)Rv (x, h)h + Ru (x, h)v(x) + Ru (x, h)hv 0 (x) + Ru (x, h)Rv (x, h)h
h
Now take the limit as h approaches zero
f (x + h) f (x)
= u(x)v 0 (x) + u0 (x)v(x)
h0
h
lim
df (x)
= u(x)v 0 (x) + u0 (x)v(x)
dx
2.7.
The chain rule
Composite functions are functions of functions. For instance, consider the function u(x) = (1 + x)2 . By defining f (x) = x2 and g(x) = 1 + x, we can view
13
u(x) = f (g(x)) as a composite function. The chain rule lets us easily find derivatives in such cases. Formally, if u(x) = f (g(x)), we have that
du(x)
df (g(x)) dg(x)
=
dx
dg(x) dx
(10)
As when proving the product rule, well make use of the shorthand notation
f 0 (x) = df (x)/dx and some first-order Taylor approximations. (8) gives
f (g[x] + k) = f (g[x]) + f 0 (g[x])k + Rf (x, k)k
when applied to f (g[x] + k) and

f (g[x + h]) = f g 0 [x]h + Rg [x, h]h + g[x]
when applied to g(x + h). By choosing k = g 0 (x)h + Rg (x, h)h we set the RHS
of the second equation equal to the LHS of the first equation and get
f (g[x + h]) = f (g[x]) + f 0 (g[x])k + Rf (x, k)k
Lets move f (g[x]) and expand k to get
f (g[x + h]) f (g[x]) = f 0 (g[x])g 0 (x)h + f 0 (g[x])Rg (x, h)h + Rf (x, k)g 0 (x)h +
Rf (x, k)Rg (x, h)h
Now divide by h and take the limit as h approaches zero
lim
h0
f (g[x + h]) f (g[x])

= lim (f 0 (g[x])g 0 (x) + f 0 (g[x])Rg (x, h) + Rf (x, k)g 0 (x) + Rf (x, k)Rg (x, h))
h0
h
df (g[x])
du(x)
=
= f 0 (g[x])g 0 (x)
dx
dx
14
2.8.
Partial derivatives
Well sometimes be concerned with functions of several variables, e.g. f (x, y).
In such cases well often be interested in the derivative of f (x, y) with respect
to only one variable, e.g. x. We call such a derivative a partial derivative and
denote it by
f (x, y)
x
The trick in these cases is to treat the other variables, in this case y, as constants
for the purpose of the calculation. All the rules we have discussed above still
apply.
2.9.
Higher derivatives
The derivatives we have discussed so far are known as first (or first-order)
derivatives. For instance, if we consider the function f (x) = x3 we have seen
that by the product rule f 0 (x) = 3x2 . f 0 (x) is itself a function of x and as with
any such function we may reasonably ask how it changes when x changes, i.e.
take its derivative with respect to x. We call the resulting derivative of the first
derivative the second (or second-order) derivative of f (x) with respect to x and
denote it by f 00 (x) (pronounced f bis of x) or by
d2 f (x)
dx2
In our example, f 00 (x) = 6x, which we get by applying the product rule to
f 0 (x) = 3x2 . The second derivative is a measure of how the first derivative
changes as we change the variable with respect to which we take the derivatives.
3.
Optimization
Many, if not most, interesting economic problems consist in finding out what
the best decision is in some situation. We typically analyze the problem using
some simplified economic model, in which the decision can be represented as
finding the value of some variable that maximizes or minimizes some function.
Classic examples are questions such as What production volume maximizes
15
profits? and What portfolio weights maximizes utility?. The trick to finding
the value of a variable that maximizes the value of a function is to realize that if
we are at a point where increasing the value of the variable increases the value
of a function we cannot be at the maximum (because we could reach a higher
function value by increasing the value of the variable). Mathematically, that
means the maximum of a function cannot be in a point where its derivative is
positive. Similarly, the maximum cannot be in a point where the derivative is
negative, because we could then reach a higher function value by decreasing the
value of the variable. Therefore, the maximum of a function can only be in a
point where its derivative equals zero. However, that is not a sufficient condition, because we could make the exact same argument about the minimum of a
function, i.e. it cannot be where the derivative is negative because we could decrease the function value by increasing the value of the variable and so on. Still,
the first step to finding either the minimum or the maximum of a function is to
find the points where its first derivative, i.e. the slope of its tangent, equals zero.
Lets consider the example where f (x) = x3 x2 . We know that this function
can only reach a maximum where f 0 (x) = 0, so we use the product rule to find
that
df (x)
= 3x2 2x
dx
We set this equation equal to zero and solve for x to find the candidates for
values of x that maximize the function.
3x2 2x = 0
x(3x 2) = 0
By inspection, we see that the equation above holds for x = 0 or x = 23 . That is,
the function must take its maximum value at either of these two points, which
is illustrated in figure 5 below3
3 The astute reader may object that this is only locally true. Yes, well done. Well get to
that in a bit.
16
Figure 4: Tangents to f (x) = x3 x2 in for x = 0 and x =
It is easy to verify graphically that x = 0 is a maximum and x =
2
3
2
3
is a mini-
mum, but how do we go about it analytically? The trick is to notice that when
a function reaches a maximum, the derivative passes zero by going from positive to negative, i.e. the function is increasing to the left of a maximum and
decreasing to the right. The opposite is true for a minimum. Mathematically,
this means that the second derivative is negative at a maximum and positive
at a minimum. We can easily verify this in our example by finding the second
derivative of our function
d2 f (x)
= 6x 2
dx2
and plugging in our two x-values
f 00 (0) = 6 0 2 = 2 < 0
f 00 ( 23 ) = 6
3.1.
2
3
2=2>0
Local and global maxima and minima
Is the analysis above sufficient to determine that our function is maximized at

x = 0 and minimized at x = 23 ? Unfortunately not, as is obvious if we expand
the axes of figure 5 as weve done in figure ?? below.
17
Figure 5: Tangents to f (x) = x3 x2 in for x = 0 and x =
2
3
Even though the function takes its highest value at x = 0 if we are comparing
it to other points close to x = 0, we see that for points were x > 1 the function
value is even higher. We refer to a point such as the one at x = 0 as a local
maximum. If there had been no value for x where f (x) > f (0) we would have
said that the function also had a global maximum for x = 0. In this case, the
function has no global maximum, since the function will always increase in value
the more we increase x past one. Even when that is not the case, we could have
several local maxima, in which case wed have to compare the function value in
each to determine which local maximum is the global maximum. We define a
local and a global minimum in way analogous to the local and global maximum.
Fortunately, in the optimization problems well encounter in the course, all local
maxima and minima will always turn out to be global.
3.2.
A strategy to solve optimization problems
The checklist below sums up what weve learned about optimization so far.
When trying to find the value of some variable x that maximizes or minimizes
some function f (x) well go through the following steps.
1. Find the first derivative of f (x) and set it equal to zero. The solutions to the
resulting equation will be interesting since theyre potentially local maxima
or minima.
2. Find the second derivative of f (x) and evaluate it for the values of x that
we found in step 1. If the second derivative in a given point is negative
(positive), the function has a local maximum (minimum) in that point.
18
3. Figure out which, if any, of the local maxima or minima found in step 2 are
global.4 We wont bother with this step in the course.
4.
Statistics
We will briefly review the results in this section in lecture 4, so if you dont know
them by then youll have a chance to catch up. However, we will only have time
to list the results we need, not to discuss their background in any detail or to
prove them, which is what is done in this section. Most proofs follow more or less
immediately from the definitions of expectations or covariances. Throughout
we will us Greek letters to denote constants and capital Latin letters to denote
stochastic variables.
4.1.
Stochastic variables
In this section well be concerned with random events. Specifically, well be

interested in variables that will take some specific value in the future that is
unknown today. Such variables are called stochastic (or random) variables.
Typically, well have the some asset return in mind. The realization of stochastic
variables follow some probability distribution, i.e. each possible future value that
it can take is associated with some probability. Well be interested in figuring
out some properties of that distribution, specifically its mean and variance,
which is described below.
4.2.
Mathematical expectations
Lets start with the expected values or expectations. The expected value of a
discrete stochastic variable is defined as the probability weighted sum of the
possible outcomes. This is most easily understood through an example. Let X
be the number of eyes that show up when we cast a die. The possible outcomes
(sometimes referred to as realizations or states) are {1, 2, 3, 4, 5, 6}. Each
outcome occurs with some probability, in this case 16 . To get the expected value
of X, denoted E(X), we multiply each outcome, i.e. the number of eyes shown
by the die, with the probability of that outcome occurring. Denoting outcome
4 This entails comparing the function values in each point to each other, evaluating the
function values in any boundary or non-differentiable points and studying the limits of f (x)
as x approaches plus or minus infinity.
19
s by x(s) and the probability of that outcome by p(s) wed have
E(X) =
S
P
p(s)x(s)
s=1
(11)
In this course, the well mainly be interested in asset returns, so there will be
many more possible outcomes than when casting a die. We could imagine that
the return of a stock over a period one year could be -0.0001%, 0%, 0.0001%,
0.0002% etc. In fact, the number of possible outcomes is so great that it will be
easiest to think of it as a continuum, i.e. that the returns could take any value.
In this case the sum above turns into the integral below, which can be thought
of as a sum with the number of states (each corrsponding to one term in the
summation) approaching infinity. The same logic still applies, though.
E(X) =
p(s)x(s)ds
(12)
A straight forward interpretation of the expected value of a stochstic variable

is as the average of a large number of outcomes. For instance, if we cast a die
a very large number of times the average of the number of eyes it shows would
be close to 3.5, which is the expected value of X as defined above (and that
average would approach E(X) as the number of outcomes considered approaches
infinity). In terms of the probability distribution of some stochastic varible, X,
E(X) is the mean (something like the middle) of that distribution. For the
classic symmeric, bell-shaped distributions (such as the normal distribution) it
is the value at which the bell (strictly the p.d.f.) peaks.
Well rely on two properties of expectations in this course, namely
E(X) = E(X)
N

N
P
P
E
Xn =
E(Xn )
n=1
(13)
(14)
n=1
For simplicity, well prove these for discrete stochastic variables. The proofs for
continuous stochastic variables would be similar. To prove equation (13), that
20
for some constant, , E(X) = E(X), we simply apply definition (11) to X.
E(X) =
S
P
p(s)xs
s=1
E(X) =
S
P
p(s)xs
s=1
E(X) = E(X)
Proving equation 14, i.e. that the sum of N stochastic variables (each denoted
by Xn ) equals the sum of their respective expectations, is equally straight by
applying definition (11) to the sum

E
N
P

Xn
n=1

E
N
P
N
P
n=1
4.3.
p(s)
s=1

Xn
n=1
S
P
N
P
Xn (s)
n=1
N P
S
P
p(s)Xn (s)
n=1 s=1

Xn
N
P
E(Xn )
n=1
Covariances, variances and correlations
The covariance between two stochastic variables X and Y is defined by

Cov(X,Y) = E (X - E[X])(Y - E[Y])
(15)
That is, the covariance is the expectation of the product of the deviations of the
two variables from their respective means. The covariance captures two characteristics of the involved variables. Firstly, it tells us whether the two variables
tend to realize on the same or opposite sides of their respective means. If, as
will mostly be the case in this course, the variables we have in mind are daily
returns, we can think of X E(X) as the surprise return of asset X on a certain day and Y E(Y ) would be the corresponding surprise return of asset
Y . If the surprise return of asset X tends to be positive (negative) on the same
days that its positive (negative) in asset Y , the product (X E[X])(Y E[Y ])
would tend to be positive and Cov(X, Y ) would have a positive sign. We might
21
say that assets X and Y move (or vary) together. If on the other hand the surprise return of asset X tends to be positive (negative) on the same days that its
negative (positive) in asset Y , the product (X E[X])(Y E[Y ]) would tend
to be negative and Cov(X, Y ) would have a negative sign. We might say that
assets X and Y move (or vary) against each other. The magnitude of Cov(X, Y )
depends on how large the surprises are in the two assets. Specifically, if X and
Y tend to realize far away from their respective means we would tend to multiply two large numbers with each other when calculating the expectation of the
product and the resulting sum, Cov(X, Y ), would be large. If the surprises tend
to be small, the resulting sum, Cov(X, Y ), would also be small.
It is convenient to separate the two effects that are captured by the covariance
into two separate measures. The first of these would be the variance. This is the
special case in which were considering the covariance of a variable with itself, i.e.

Var(X) = Cov(X,X) = E (X - E[X])(X - E[X]) = E (X - E[X])2
(16)
Since (X E[X])2 is a square, it will always have a positive sign. That is, the
surprise in X will trivially be of the same sign as itself every time. The variance can therefore not be said to measure something like a co-movement in any
interesting sense. Only the second effect measured by the covariance remains,
i.e. how large the average surprise in X tends to be.5 We often denote V ar(X)
2
by X
. It is useful to remember that the variance as just a special case of the
covariance, because we wont need any specific results for variances. As long
as we know how to deal with covariances, well be in good shape to deal with
variances, too. It will often be useful to speak of the square root of the variance,
i.e. X , which we call the standard deviation of X. One advantage of X over
2
X
is that X will be expressed in the same unit as X. So if X is an asset
return, we may interpret X in terms of percentage units. For instance, if X

is normally distributed, we know that 68% of the (absolute) surprises will be
smaller than X percentage units. The variance (and covariance) on the other
hand is expressed in squared percentage units, which does not have an obvious
interpretation.
5 The expectation of the squared deviation, to be more precise, but these are obviously
related in a straight forward manner.
22
What about the other effect captured by the covariance, i.e. the tendency of
variables to move together? This is captured by the correlation coefficient, which
is typically denoted by XY and defined by
XY =
Cov(X, Y )
X Y
(17)
That is, it is a standardized version of Cov(X, Y ). Recall that the s in some

sense capture the expected size of the surprises in the two variables. By dividing the covariance by the s we take away that effect and are only left
with the tendency of the variables to move together. The correlation coefficient
will always take a value in the interval [1, 1], with the endpoints representing
perfect co-movements, i.e. if XY = 1 then the surprises in X and Y will be of
the same type, i.e. have the same sign, 100% of the times. If XY = 1 then
the surprises in X and Y will be of opposite type 100% of the times. If XY = 0
then theres no particular relationship between the variables. You may interpret
intermittent values of with these three reference point in mind.
Well make use of the following properties of covariances in this course
Cov(X,Y) = Cov(Y,X)
Cov(X, Y ) = Cov(X, Y )
N

M
N P
M
P
P
P
Cov
Xn ,
Ym =
Cov(Xn , Ym )
n=1
m=1
n=1 m=1
Equation (18) follows immediately from definition (15)
Cov(X, Y ) = E ([X E(X)] [Y E(Y )])

Cov(X, Y ) = E ([Y E(Y )] [X E(X)])
Cov(X, Y ) = Cov(Y, X)
Similarly, to prove equation (19), lets apply the same definition again
Cov(X, Y ) = E ([X E(X)] [Y E(Y )])
23
(18)
(19)
(20)
Cov(X, Y ) = E ([(X E[X])] [(Y E[Y ])])

Cov(X, Y ) = E ( [X E(X)] [Y E(Y )])
Cov(X, Y ) = E ([X E(X)] [Y E(Y )]) = Cov(X, Y )
Finally, to prove equation (20) apply the definition of the covariance again

Cov
N
P
Xn ,
N
P
Xn ,
n=1

Cov
N
P

Ym

=E
M
P

Ym

=E
m=1
Xn ,
n=1
M
P
N
P
Xn E

N
P
Xn
n=1

Ym

=E
m=1
N
P

Xn
N
P
M
P

Ym E

E(Xn )
n=1
M
P
m=1

M
P
Ym
m=1
m=1
n=1
n=1
m=1
n=1
Cov
M
P

Ym
M
P

E(Ym )
m=1
N
P
M

P
[Xn E(Xn )]
[Ym E(Ym )]
n=1
m=1
Lets simplify the notation at this point by defining Z = Z E(Z), and rewrite
the equation above as

Cov

Cov
N
P
Xn ,
n=1
m=1
N
P
M
P
Xn ,
n=1

Cov
N
P
N
P
n=1
4.4.

Ym

=E
Xn ,
M
P

Ym

=E
M
P
m=1
m=1
N P
M
P
Xn Ym
n=1 m=1

Ym
m=1
Xn ,
N
M
P
P
Xn
Ym
n=1
m=1
n=1
Cov
M
P
N P
M
P
E Xn Ym
n=1 m=1

Ym
N P
M
P
Cov(Xn , Ym )
n=1 m=1
Regression analysis
Linear regression analysis is probably the tool most used by econometricians. It

is a way to statistically analyze the relationship between two (or more) variables.
It is not yet covered in this document but it will be used in the course, so please
review your old statistics notes to refresh your knowledge.
24

Mathematical Primer

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mathematical Primer

Uploaded by

Copyright:

Available Formats

Assumed Knowledge in FINS2624

where R(x, h) is an approximation error that approaches zero as h approaches

The constant rule . . . . . . . . . . . . . . . . . . . . . . .

The sum rule . . . . . . . . . . . . . . . . . . . . . . . . .

The constant factor rule . . . . . . . . . . . . . . . . . . .

The power rule . . . . . . . . . . . . . . . . . . . . . . . .

The product rule . . . . . . . . . . . . . . . . . . . . . . .

The chain rule . . . . . . . . . . . . . . . . . . . . . . . .

Local and global maxima and minima . . . . . . . . . . .

A strategy to solve optimization problems . . . . . . . . .

Covariances, variances and correlations

I = wLux ILux + wHun IHun 0.05 6103 + 0.95 1177 1423

Figure 1: f (x) = x2 with line through (-1,1) and (2,4)

Figure 2: f (x) = x2 with line through (-1,1) and (0,0)

The slope of this line is again given by the equation

Figure 3: Tangent to f (x) = x2 in (-1,1)

The constant rule

It would be time consuming to rely on definition (2) every time we wanted

The sum rule

f (x + h) f (x) = u(x + h) + v(x + h) u(x) v(x)

By (2) this equals

The constant factor rule

The power rule

Next up is the power rule

In order to simplify the proof, well rely on the following factorization

an1 x0 + an2 x1 + an3 x2 + ... + a2 xn3 + a1 xn2 + a0 xn1

and then multiply it by a x, we see that we get

a(an1 x0 + an2 x1 + an3 x2 + ... + a2 xn3 + a1 xn2 + a0 xn1 )

Moving a and x inside the parentheses gives

which in turn equals

an xn + (an1 x + an2 x3 + ... + a3 xn3 + a2 xn2 + axn1 )

Applying (7), we can rewrite this as

The product rule

To prove this, note that

f (x + h) f (x) = u(x + h)v(x + h) u(x)v(x)

Expand the parenthesis and divide both sides by h

The chain rule

f (g[x] + k) = f (g[x]) + f 0 (g[x])k + Rf (x, k)k

when applied to f (g[x] + k) and

f (g[x + h]) = f (g[x]) + f 0 (g[x])k + Rf (x, k)k

Lets move f (g[x]) and expand k to get

Now divide by h and take the limit as h approaches zero

f (g[x + h]) f (g[x])

Figure 4: Tangents to f (x) = x3 x2 in for x = 0 and x =

It is easy to verify graphically that x = 0 is a maximum and x =

Local and global maxima and minima

Is the analysis above sufficient to determine that our function is maximized at

Figure 5: Tangents to f (x) = x3 x2 in for x = 0 and x =

A strategy to solve optimization problems

In this section well be concerned with random events. Specifically, well be

s by x(s) and the probability of that outcome by p(s) wed have

A straight forward interpretation of the expected value of a stochstic variable

for some constant, , E(X) = E(X), we simply apply definition (11) to X.

Covariances, variances and correlations

The covariance between two stochastic variables X and Y is defined by

return, we may interpret X in terms of percentage units. For instance, if X

That is, it is a standardized version of Cov(X, Y ). Recall that the s in some

Well make use of the following properties of covariances in this course

Equation (18) follows immediately from definition (15)

Cov(X, Y ) = E ([X E(X)] [Y E(Y )])

Cov(X, Y ) = E ([X E(X)] [Y E(Y )])

Cov(X, Y ) = E ([(X E[X])] [(Y E[Y ])])

Linear regression analysis is probably the tool most used by econometricians. It

You might also like