You are on page 1of 46

Basics of Multivariate Normal (MVN) Tutorial

Unit 2 Definition and properties of MVN-I


Structure
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9

Introduction
Scope of multivariate analysis
Variance-covariance matrices
Multivariate distributions
Nonsingular Multivariate Normal Distribution
Characterization of multivariate normal via linear functions
Summary
References
Solutions to Exercises

2.1 INTRODUCTION
We begin the unit with an overview of the scope of multivariate analysis. We identify the
extensions of problems of univariate analysis to higher dimensions and also outline the
problems special to multivariate analysis which do not have univariate equivalence. We
do this in section 2.2. In section 2.3 we study the properties of variance-covariance
matrices in detail. We also identify the class of variance-covariance matrices with the
class of nnd matrices. In section 2.4 we give several examples of discrete and continuous
multivariate distributions. We also define the concepts of marginal and conditional
distributions and also the important concept of independence. We illustrate these
concepts with the help of examples. We also give a method of obtaining the density of a
transformed random vector. In section 2.5 we introduce multivariate normal distribution
using its density function. We obtain the marginal and conditional distributions. In
section 2.6 we study the multivariate normal distribution defined via linear zero functions
and obtain several properties in an elegant manner. We also show that the two definitions
of multivariate normal coincide if the variance-covariance matrix is positive definite.
Objectives :
After completing this unit, you should be able to

Understand the similarities in and differences between univariate and multivariate


analysis.
Play around with the mean vectors, variance-covariance matrices and covariance
matrices of transformed variables.

Understand and apply the concepts of marginal and conditional distributions and
independence.
Understand and apply the properties of multivariate normal distribution.
Appreciate the beauty of the density-free approach to multivariate normal
distribution.

2.2 SCOPE OF MULTIVARIATE ANALYSIS


Let us start with an example. A software company wants to recruit 3 fresh engineering
graduates. There are 20 applicants and their scores (out of 100) in the aptitude test (x1),
the test on software technology (x2) and the interview (x3) are recorded below in a matrix
form
52

64

68

50
49

70

50
70

61
62
X
58
50

47

55
46

60
52

58
78

42

61
75

35

72

60
65
58
64
54
71
65
65
70
48

58

59
65
81

47
54

63
78

64
31

59
61
48
62
53
61
82
47

56

60
42

68
64

50
65

49

x1

The vector x x2 is called a random vector and the matrix X is called a data matrix
x
3
related to x. The (i, j)th element xij of X denotes the score of the ith candidate in the
aptitude test or the test on software technology or the interview according as j = 1, 2, 3
respectively. Thus x51 = 49 is the score of the fifth candidate in the aptitude test. Each
row of X corresponds to the scores of a candidate and each column of X corresponds to
the scores in a particular test / interview. Univariate analysis deals with the data on a
single variable, say, those in the interview. Multivariate analysis deals with data on more

than one variable (possibly correlated) collected on the same subjects. One of the major
aims of statistics relates to dealing with variability in the data. By dealing with
variability we mean (i) determining the extent of variability, (ii) identifying the sources of
variability and (iii) either control the variability by taking suitable measures or taking
advantage of the variability to select certain subject in an optimal manner or classifying
the subjects or variables into different groups depending upon the variability. When we
deal with a single variable, the variability is often quantified by the variance. When we
deal with more than one variable, then the variability is often quantified by the matrix of
variances and covariances. For example, in the case of the data mentioned above, the
variability is quantified by the matrix
33 (( ij ))

where ij = Cov(xi, xj), i,j = 1, 2, 3.


Such a matrix is called the variance covariance matrix of x and is denoted by D(x).
For the above data in the data matrix X, the variance covariance matrix of the random
vector x is
90.0947

68.0579
66.0474

68.0579
80.6816
65.0816

66.0474

65.0816

169.7342

let us turn our attention to the recruitment problem. If the recruitment is based on just the
interview scores, then the candidates 6, 10 and 2 get selected. Again, if the recruitment is
based on the aptitude test, the candidates 19, 6 and 8 get selected where as under the
criterion of software technology scores, the candidates 19, 2 and 8 get selected. Notice
that it is not realistic or optimal to base the judgment on the scores of just one of the three
as we are ignoring useful information on the others. One possible way of using the
scores on all the three is taking the average of the scores on the three (two tests and the
interview) for each candidate and select the three candidates with the top three average
scores. What should be the object in choosing the criterion? We should choose a
criterion which can distinguish among the candidates in the best possible manner. When
we took the average, we took the linear combination ltx where
1 1 1
, , . Why not look for a linear combination ptx which distinguishes maximum
3 3 3

lt =

over all linear combinations among the candidates in the best manner or in other words
which has the largest variability (variance) and use that as an index for selection
criterion? This is precisely what is done in obtaining the first principal component. The
first principal component ptx is a normalized linear combination of x which has the
largest variance among all normalized linear combinations of x.
Getting information on any variable is expensive in terms of time and or money. We may
like to ask whether it is worthwhile conducting the test in software technology given that

the aptitude test and interview are being conducted. Put in other words, does the test in
software technology provide significant additional information in the presence of the
aptitude test and the interview? This is called the assessment of additional information.
Based on the data X on x can we group the candidates into some well-defined classes?
This may be a useful information if the company has jobs of different types (a)
requiring high skills and (b) requiring medium skills but intensive hard work. There may
be a third group which is not of any use to the company. This is called the problem of
discrimination.
How well can we predict the interview score based on the two test scores? This problem
is called the problem of multiple regression and correlation. Suppose the interview
mentioned above is a technical interview. Assume that there is another HR interview and
the score on HR interview be denoted by x4. We may be interested in the association
between the tests and the interviews, i.e., between (x1, x2) and (x3, x4). Such a problem is
called the problem of canonical correlations. The problems mentioned above are some
problems specific to multivariate analysis which do not occur in univariate analysis. In
univariate analysis, we talk about inferences (estimation / testing) on the mean /
proportion / variance of a variable. These problems can be extended to the inferences on
mean vector / variance covariance matrix of a random vector. Univariate analysis of
variance has an analogue in Multivariate analysis of variance.
In the blocks 5, 6, 7 and 8 we study these topics in detail.
2.3 VARIANCE COVARIANCE MATRICES
As we saw in the previous section, the variance covariance matrix of a random vector is a
quantification of the joint variability of the components of the random vector. The
variance covariance matrices play a very important role in multivariate analysis. In this
section we formally define a random vector, its mean vector and variance-covariance
matrix. We shall obtain formulae for the mean vector and variance-covariance matrix of
linear compounds of a given random vector. We shall give a method of transforming
correlated random variables to uncorrelated random variables. We shall show that every
variance covariance matrix is nnd and that every nnd matrix is the variance-covariance
matrix of a random vector.
x1

Definition: A random vector x


x
p

of order px1 is a finite sequence of random

E x1

variables x1, x2,, xp. E x is called the mean vector of x where E(xi) denotes
E x
p

the mean of xi. Let ij denote the covariance between xi and xj. Then the matrix =
((ij)) of order pxp is called the variance-covariance matrix of x, denoted by D(x). Let

y1

y
y
q

be another random vector. Let ij denote the covariance between xi and yj , i =

1,, p, j = 1,, q. Then pxq = ((ij)) is called the covariance matrix between x and y
and is denoted by Cov(x,y).
Clearly D(x) = Cov(x,x)
Example 1. Let x1, x2, and x3 be random variables with means 2.3, -4.1 and 1.5
respectively and the variances 4, 9 and 16 respectively. Let ij denote the correlation
coefficient between xi and xj ; j = i,,3 and i = 1,2,3. Let 12 = 0.5, 13 = 0.3 and 23 =
-0.4.
Write down the mean vector and the variance covariance matrix of
t
x x1 x2 x3 .

E x
1
2.3

M = 4.1
Solution. The mean vector E(x) of x is E x

1.5
E x p

Let = ((ij)) denote the variance-covariance matrix of x. Let V(.) and Cov(. , .) denote
the variance and covariance respectively. Then,

11 = V(x1) = 4, 22 = V(x2) = 9 and 33 = V(x3) = 16


12 Cov x1 , x2 12 V x1 .V x2
= 0.5 x 2 x 3 = 3.0
13 Cov x1 , x3 13 V x1 .V x3

= 0.3 x 2 x 4 = 2.4
23 Cov x2 , x3 23 V x2 .V x3

= -0.4 x 3 x 4 = -4.8
4 .0

Thus
3 .0
2 .4

3 .0

2 .4

9 .0

4 .8

4 .8

16

Notice that is symmetric in the above example. In fact, this is true for every variancecovariance matrix because ij = Cov(xi, xj) = Cov(xj, xi) = ji for all i and j. since the
leading principal minors of are 4.0, 27.0 and 288.0 respectively, it follows from

theorem 5 of unit 1 that is positive definite. We shall later show that every variancecovariance matrix is non-negative definite.
Let x be a random vector. Consider ltx where lt is a fixed vector (i.e., the components of l
are not random variables.) We shall now find the mean and variance of ltx.
Theorem 1. Let x px1 be a random vector with E(x) = and variance-covariance matrix of
x equal to . Let l be a fixed vector and let ltx = l1x1 + l2x2 ++ lpxp be a linear
combination of the components of x. Then E(ltx) = lt E(x) = lt and V(ltx) = ltl. Cov(ltx,
mtx) = ltm where m is a fixed vector.
Proof: E(ltx) = E(l1x1 + l2x2 ++ lpxp)
= l1E(x1) + l2E(x2)++ lpE(xp)
= lt E(x) = lt .
t
V(l x) = V(l1x1 + l2x2 ++ lpxp)
= Cov(l1x1 + l2x2 ++ lpxp, l1x1 + l2x2 ++ lpxp)
p

l l
i j

i 1 j 1

Cov(ltx, mtx) =

ij

= ltl

l m
i

ij

i 1 j 1

= ltm.

Example 2. Find the mean and variance of l x in example 1 when l


t

covariance between ltx and mtx where m 1 .


1

Solution. Mean of ltx = lt E(x) = l1E(x1) + l2E(x2) + l3E(x3)


1
1
( E(x1) + E(x2) + E(x3)) =
(2.3-4.1+1.5)
3
3
1
=
x -0.3 = -0.1
3

1
V(l x) = l l = 1
3
t

1
= 9 .4
9

4.0

1 3.0
2.4

7 .2

3.0
9.0
4.8

1

13.6 1
1

1
x 30.2 = 3.36
9

2.4 1

4.8 1
16 1

1
1
1 . Find the
3
1

Cov(l x, m x) = l m
t

1
= 3 1

1
= 9.4
9

E1.

Let

x x1

x2

4.0

3.0

9.0

2.4

7.2

x4

x3

3.0

4.8

13.6 1
1

2.4 2

4.8 1
16 1

1
2
x -2.0 =
.
3
3

be an random vector with man vector =

1 1 1

1
0.2
and variance covariance matrix =
0.2

0.2

0.2
1

0.2
0.2

0.2
0.2

1
0.2

0. 2

0. 2
0. 2

Find the mean and variance of ltx and Cov(ltx, mtx) where lt = 1 1 1 1 and
mt = 1 1 1 1 .
We know that V(x) = E(x - E(x))2 and Cov(x, y) = E((x - E(x)) (y - E(y)). Is there a
multivariate analogue to the above? Notice that
D(x) = = ((ij))
Where ij =Cov(xi, xj) = E((xi - E(xi)) (xj - E(xj)).
x E x
1
1

E
M

x1 E x1 L x p E x p
Thus D(x) =

x p E x p

or

D(x) = E

x E x x E x
t

(2.3.1)

It can be shown similarly that


Cov(x, y) = E

x E x y E y
t

(2.3.2)

We shall now extend the results of theorem 1 to several linear functions.


Theorem 2. Let x and y be random vectors of orders px1 and qx1 respectively. Let E(x)
= , E(y) = , D(x) = , D(y) = and Cov(x, y) = .
Let Br x p and Cs x q be fixed matrices (nonrandom). Then the following hold.

(a) E(Bx) = B
(b) D(Bx) = BBt
(c) Cov(Bx, Cy) = BCt
b1t

t
th
Proof: (a) B = M
where bi is the i row of B, i = 1,, r.
brt

b1t x

Now E(Bx) = E M
brt x

b1t E x

M
brt E x

b1t x

(b) D(Bx) = Cov(Bx, Bx) = Cov M


bt x
r

b1t

M E x = B
t
br

b 1t x

M
b rt x

t
Thus the (i, j)th element of D(Bx) is Cov( bit x, b j x) = bit b j for i, j = 1,, p from
theorem 1.

b1t

Thus D(Bx) = b1 : : br = BBt


bt
r

b1t x

(c) Cov(Bx, Cy) = Cov


bt x
r

c1t y


, where c tj is the jth row of C, j = 1,, s.
ct y
r

t
The (i, j)th element Cov(Bx, Cy) = bi cj.

b1t

Hence Cov(Bx, Cy) = c1 : : cr = BCt


bt
r

Example 3. Let be the variance covariance matrix of a random vector x of order px1.
Let rij denote the correlation coefficient between xi and xj, i, j = 1,, p.
Write R = ((rij)). R is called the correlation matrix of x.

11
0

Write T =

0
22

0
0


pp

Thus T is a diagonal matrix with ith diagonal element equal to the standard deviation of xi,
i = 1,, p. Assume that 11,, pp are strictly positive. Show that R = T-1T-1.

Solution. Notice that rij

Cov xi , x j

V xi V x j

ij

=
ii
jj

The (i, j) element of T T is 0 ii


th

-1

-1

11

1
2

= ii 2 i1

ii 2 ij

ii 2 ip

th

j element

= ii 2 ij jj2 =

ij
ii jj

1 p

i1 ip

p1 pp

M
12
jj
0

M
0

M
12
jj
0

M
0

jth element

= rij

Thus R = T-1T-1.
This establishes a relationship between the variance-covariance matrix and the correlation
matrix of a random vector.
E2.

(a) Give an alternative matrix proof of theorem 2(b) and (c).


(b) Obtain the correlation matrix of a random vector x with the following
variance-covariance matrix.
4.0

= 3.0
2.4

3.0
9.0
4.8

2.4

4.8
16

(Compare your results with the matrices in example 1.)

We are ready now to show the equality of the class of all nnd matrices with the class of
all variance-covariance matrices as promised in the beginning of this section.
Theorem 3.

(a) Every variance-covariance matrix is nonnegative definite (nnd).


(b) Every nnd matrix is the variance covariance matrix of a random vector.

Proof: (a) Let be the variance-covariance matrix of a random vector x. Then for each
fixed l, ltl = V(ltx) 0 (Since variance of a random variable is nonnegative.) Hence
is nnd.
(b) Let p x p be an nnd matrix. Then by theorem 4(b) of unit 1, there exists a matrix C of
order p x r for some positive integer r such that = CCt. Let x1, x2,, xr be independent
random variables each with variance 1. write xt = (x1,, xr). Then D(x) = I r x r (I r x r is
the identity matrix of order r x r). Write y = Cx. Then by theorem 2, D(y) = CICt = CCt
= .
Corollary: The variance-covariance matrix of a random vector x is positive semidefinite if and only if there exists a fixed non-null vector l such that ltx is a constant with
probability 1.
Proof:

is positive semi-definite
There exists fixed l 0, such that ltl = 0
There exists fixed l 0 such that V(ltx) = 0
There exists a fixed vector l 0 such that ltx is a constant with
probability 1.

In general, independent / uncorrelated random variables are easier to handle statistically


than the correlated random variables. We shall now give methods of transforming a
random vector (the components of which are correlated) with positive definite variance
covariance matrix to a random vector the components of which are uncorrelated, by a
suitable linear transformation.
Let x be a random vector with D(x) = is a non-diagonal pd matrix.
Method 1. Let = PPt be a spectral decomposition of where P is orthogonal and
diagonal. Since is pd, so is . Write = Ptx. D() = PtP = PtPPtP = since P is
orthogonal. Since is a diagonal matrix, the components of are uncorrelated. Also
V( i) = i, the ith diagonal element of .
If we write y = -Ptx, then D(y) = -PtPPtP - = I.
Thus the components of y are uncorrelated each with variance 1.

Before proceeding further, let us recall that if is pd, then there exists a nonsingular
t
t
matrix B such that = BBt. Write y = B-1x. Then D(y) = B-1 B -1 = B-1BBt B -1 = I.
Hence the components of y are uncorrelated, each with variance 1. Notice that in method
1, P is a choice for the matrix B.
In unit 1, we gave a method of computing a lower triangular square root of a pd matrix.
Before giving the method 2, we shall give another algorithm of obtaining a lower
triangular square root of a pd matrix. This algorithm also helps us in getting the inverse
of the triangular square root as a bonus whereby we can write down y immediately.
Let be a pd matrix of order p x p.
Algorithm:

Step 1.
Step 2.
Step 3.
Step 4.
Step 5.

Form the matrix T = ( : I)


Set i = 1 (i is the sweep out number)
Replace the ith row of T by ith row of T tii
Is i = p? If yes go to step 9. If no go to step 5.
Set j = i + 1

Step 6. Replace jth row of T by jth row of T -

t ji
tii

ith row of T

Step 7. Is j = p? If yes go to step 9. If no go to step 8.


Step 8. Replace j by j + 1 and go to step 6.
Step 9. Is i = p? If yes, go to step 11. If no go to step 10.
Step 10. Replace i by i + 1 and go to step 3.
Step 11. The first p columns of T form Bt and the lost p columns of T form
B-1 where = BBt with B lower triangular.
The proof is beyond the scope of this notes. For proof one may refer to Rao and
Bhimasankaram (2000) page 358.
Method 2. Obtain a lower triangular square root of and B-1 by the above algorithm.
Then write down y = B-1x. Since B is lower triangular, so is B-1.
b11
21
b
So y =

b p1

Observe that

0
b 22

0
0

x1



xp

pp
b

where B-1 = ((bij))

y1 = b11x1
y2 = b21x1+ b22x2

yi = bi1x1++ biixi
yp = bp1x1++ bppxp

Thus the first component of y is a scalar multiple of the first component of x. The second
component of y is a linear combination of the first two components of x, and so on.
We now illustrate the above methods with example.
2

.
Example 4. Let x be a random vector with D(x) =
1 2
(a) Make an orthogonal transformation = Px (where P is an orthogonal matrix) such
that the components of are uncorrelated. Obtain the variances of 1 and 2.

(b) Make a nonsingular linear transformation y = Bx so that the components of y are


uncorrelated each with variance 1 by both the methods described above.
Solution : (a) Using the solution of E4 of Unit1, we have the spectral decomposition of
2
D(x) =
1

Where p1 =

3
= 3p1 p +3p2 p = (p1 : p2)
2
0
t
1

t
2

t
0 p1
3
t = P
1
0
p2

1
x1 x2 and 2 = 1 x1 x2
2
2
3
0

Now D() = PtP


V(1) = 3

3
PtP =
1
0

0
1

V(2) = 1

and

(b) Method 1. Write y1=

or y = 3
0

Pt

p1t
1 1 1

. The transformation
t =
2 1 1
p2

is = Ptx.

So

1
1 1
and p2 =

and P =(p1 : p2).


2 1
1

1
2

Hence the required orthogonal matrix is Pt =

Thus 1 =

1
1 and y2 = 2
3

= 3
0

t
P x

Then y1 and y2 are uncorrelated and V(y1) and V(y2) = 1.

Method 2. From the matrix


1

and proceed as below.

2 1 1 0 .. (1)
1 2 0 1 .. (2)
-----------------2

1
1
0 .. (3) = (1) 2
2
2
1
3
1

1 .. (4) = (2) (3)


2
2
2

------------------2

thus B =

2
1
2

1
2

1
0 .. (5) = (3)
2
2
3 1
.. (6) = (4)
6
3
2

3 and B-1 =

1
2
1

The required transformation is y = B-1x.


Hence y1=

1
x1
2

y1=

E3.

1
x1
6

2
x2 .
3

x1

x2
Consider a random vector x = x3 with E(x) =
x4
x
5
5

1
and D(x) = 1
1
1

1
4
0

1
0
6

1
2
1

0
2

2
0

1
2

8
3

3
9

2
0

3
2

3
2

x1

Write y = x2
x
3
2
Let B =
4

x4
. Thus x =
and =
x5

4
and C = 1
0

Write u = By and v = C
Compute the following:
(a) E(u), D(u)
(b) E(v), D(v) and
(c) Cov(u, v).

E4.

Let x be a random vector with E(x) = 2


1

and D(x) = 1
1

1
2
1

1 .
2

Notice that each row sum of D(x) is 0. Obtain a linear combination l x which is
constant and with probability 1. What is the value of this constant?

E5.

4 2 6

Let x be a random vector with D(x) = 2 17 27 . Use the method 2 and


6 27 70

obtain a lower triangular matrix B and its inverse such that the components of y =
B-1x are uncorrelated each with variance 1.

Example 5. let be the variance covariance matrix of a random vector x. Let ii = 0.


Show that the covariance of xi with all other components is zero.
Solution 1. V(xi) = ii = 0. So xi is a constant with probability 1. Hence Cov(xi, xj) = 0
for all j.
Solution 2. being a variance-covariance matrix is nnd by thorem 3(a). Now by
example 11 of Unit 1, ii = 0 ij = 0 for all j. Hence Cov(xi, xj) = ij = 0 for all j.
2.4 MULTIVARIATE DISTRIBUTIONS
In this section we give an introduction to multivariate distribution. By a multivariate
distribution, we mean the joint distribution of some random variables. We give examples
of discrete and continuous multivariate distributions. From the joint distribution of x1,
, xp we obtain the individual distribution (which we call the marginal distribution) of

each x1, x2, , xp. We define the concept of conditional distribution. We briefly study the
concept of independence of random variables and its relation to uncorrelatedness. First,
let us consider a few examples.
Example 6. A college has 2 specialists in long distance running, 4 specialists in Tennis
and 6 top level cricketers among its students. The college plans to send 3 sportsmen from
the above for participating in the University sports and games. The three sportsmen are
selected randomly from among the above 12. Let x1 and x2 denote respectively the
number of long distance specialists and the number of tennis specialists chosen. The joint
probability mass function of x1 and x2 is defined as P{x1 = i, x2 = j} for i =0,1,2 and j =
0,1,2,3. Obtain the joint probability mass function of x1, x2.
Solution:

6 12
20
p(0, 0) = P{x1 = 0, x2 = 0} = =
220
3 3
4 6

2

p(0, 1) = P{x1 = 0, x2 = 1} =
1
Similarly,

4 6

1

p(0, 2) =
2
4

p(0, 3) =
3
2

p(1, 0) =
1
2

p(1, 1) =
1

12
60


220
3

12
36


3
220

12
4


3
220

6 12
30


2 3
220
4 6 12
48


1 1 3
220

2 4

12

2 6

1

12
6


3
220

12
4


220
3

12

p(1, 2) =
220
1 2 3
p(1, 3) = 0 since the number of persons chosen is 3.
p(2, 0) =
2

2 4

1

p(2, 1) =
2
p(2, 2) = 0
p(2, 3) = 0

The values taken by x1 and x2 and the corresponding probabilities constitute the joint
distribution of x1 and x2 and can be expressed in a tabular form as follows:
Table 1
Joint distribution of x1 and x2
Value

x2

Row

taken by
x1

1
2

20
220
30
220
6
220
56
220

Joint probability

sum

Joint probability

Column sum

60
220
48
220
4
220
112
220

36
220
12
220

4
220

48
220

4
220

120
220
90
220
10
220

For the joint distribution table, it is easy to write down the distributions of x1 and x2 which
we call the marginal distributions of x1 and x2 respectively.
P{x1 = 0} = P{x1 = 0, x2 = 0}+ P{x1 = 0, x2 = 1}+ P{x1 = 0, x2 = 2}+ P{x1 = 0, x2 = 3}
=

20
60
60
4
+
+
+
.
220 220 220 220

Notice that P{x1 = 0} is the row-sum corresponding to x1 = 0 in the above table.


Accordingly this is recorded as row-sum corresponding to x1 = 0. Similarly the second
and third row-sums are the probabilities of x1 = 1 and x1 = 2 respectively. Thus the
marginal distribution of x1 is
Table 2
Marginal distribution of x1
Value
Probability

120
220

90
220

10
220

Similarly, the marginal distribution of x2 is obtained using the column sums in table 1.
Thus the marginal distribution of x2 is
Table 3
Marginal distribution of x2
Value
Probability

56
220

112
220

48
220

4
220

Suppose we are given additional information that no long distance running specialist is
selected, or in other words, we know that x1 = 0. Then what are the probabilities for x2 =
0, 1, 2, 3 given this additional information? Notice that we are looking for the

conditional probabilities: {x2 = j | x1 = 0} for j = 0, 1, 2, 3. We can compute them using


the row corresponding to x1 = 0.
P{x2 = 0 | x1 = 0} =

P( x1 0, x2 0)
20 120
20
1

=
=
P( x1 0)
220 220 120
6

60 120
60
1

=
220 220 120
2
36 120
36
3

P{x2 = 2 | x1 = 0} =
=
220 220 120
10
4
120
4
1

P{x2 = 3 | x1 = 0} =
=
220 220 120
30

P{x2 = 1 | x1 = 0} =

The distribution of x2 given x1 = 0 is called the conditional distribution of x2 given x1 = 0


and can be expressed neatly in the following table.
Table 4
Conditional distribution of x2 | x1 = 0
Value
0
1
2
3
Probability

1
6

1
2

3
10

1
30

E6.

In example 6, let x3 = number of cricketers chosen. Write down the joint


distribution of x2 and x3. Obtain the marginal distributions of x2 and x3. Obtain
the conditional distribution of x3 given x2 = 1.

E7.

In example 6, let pijk denote P{x1 = i, x2 = j, x3 = k}. Obtain pijk for i = 0,1,2; j =
0,1,2,3 and k = 0,1,2,,6. The values of x1, x2 and x3 and the corresponding pijk
constitute the joint distribution of x1, x2 and x3.

E8.

x1

In example 6, obtain the variance-covariance matrix of x = x2


x
3

x1

The random variables x1, x2 and x3 in example 6 and E5 are discrete. x = x2


x
3

is called

a discrete random vector and the distribution of x (the joint distribution of x1, x2 and x3) in
such a case is called discrete multivariate distribution.
On the other hand we say that x1,, xp are jointly continuous (x =

x p is
t

continuous) if there exists a function f(u1,,up) defined for all u1,, up having the
property that for every set A of p-tuples,

P((x1,, xp) A) = f(u1,,up)du1,,dup where the integration is over all (u1,,up)


A.
The function f(u1,,up) is called the probability density function of x (or joint probability
density function of x1,, xp). If A1,, Ap are sets of real numbers such that A = {(u1,
,up), ui Ai, i = 1,,p} we can write
P{(u1,,up) A} = P{ xi Ai, i = 1,,p } =
The distribution function of x = x1 K

f (u u
1

A1

)du1 du p

Ap

x p is defined as F(a1,a2,,ap)={x1 a1,, xp


t

ap}
=

a1

ap

f ( u1 K u p )du1 K du p
p

It follows, upon differentiation, that f(a1,,ap) = a a F( a1 , , a p ) provided the


1
p
partial derivatives are defined.
Another interpretation of the density function of x can be given using the following:
a1 a1 a1 a p

P{ai < xi < ai + ai, i = 1,,p} =

f (u , , u
1

a1

)du1 , , du p

a1

f(a1,,ap)a1,,ap.
when a1, i = 1,,p are small and f is continuous. Thus f(a1,,ap) is a measure of the
chance that the random vector x is in a small neighborhood of (a1,,ap).
Let fx(u1, u2,,up) be the density function of x.
Then the marginal density of xi, denoted by f xi ui is defined as
f xi ui =

f ( u1 ,u2 ,K u p )du1 K dui 1dui 1 K du p

Let x1 = (x1,,xr)t and x1 = (xr+1,,xp)t. Then the marginal density of x1 is similarly


defined as f x1 ( u1 )

f ( u1 ,K u p )dur 1 K du p where u1 = (u1,,ur)t.

The conditional density of x1 given x2 = (ur+1,,up)t is defined as

f x1| x2 u2 ( u1 | u2 )

f x ( u1 ,K u p )
f x2 ( u2 )

Why is the conditional density defined thus? To see this let us multiply the left hand side
of the above equality by du1,,dur and the right hand side also by du1,,dur = durdur+1
dup / dur+1dup.
Then f x | x
1

(u1 | u2 )du1 dur

u 2

f x (u1 , u p )du1 du p
f u2 (ur 1 , , u p )dur 1 du p

P u1 x1 u1 du1 , , u p x p u p du p

P ur 1 xr 1 ur 1 dur 1 , , u p x p u p du p

= Pu1 x1 u1 du1 , , ur xr ur dur | ur 1 xr 1 ur 1 dur 1, , u p x p u p du p


Thus for small values of du1,,dup, f x1 | x 2 u2 (u1 | u2 ) represents the conditional
probability that xi lies between ui and u1+dui, i = 1,,r given that xj lies between uj and
uj+duj, j = r+1,, p.
Let us now consider a few examples.
Example 7. Consider x = (x1, x2)t with density
0 u1 1, 0 u2 1
c 2 u1 u2
fx(u1, u2) =
0
otherwise

(a)
(b)
(c)
(d)

where c is a constant.

Obtain the value of c


Find the marginal density of x2
Find the conditional density of x1 given x2 = u2, where 0 < u2 <1.
Find the probability that x1 > given x2 = u2.

Solution: (a) Since fx(u1, u2) is a density function,


Now 1 =

f x (u1 , u2 ) du1 , du2

1 1

f x ( u1 ,u2 )du1du2 =

c.(2 u

u2 ) du1du2

0 0

= c. ( 2 u1 ) du1 u2 du2 since du1 du2 1


0

u2

u2

= c. 2u1 1 2
2 0 2

= 2

1 1

= c.1 = c or c = 1
2 2

=1

(b) The marginal density of x1 in the range (0, 1) is


f x1 u1 =

(u1 , u2 ) du2

(2 u

u2 )du2

= 2 u1 u2 du2
0

1
3
=
- u1.
2
2

= 2 u1
3
-u1 ,
Thus f x1 ( u1 ) 2
0

for 0 u1 1
else where.

3
- u2 ,
(c) By symmetry, the marginal density of x2 is f x2 ( u2 ) 2

for 0 u2 1

So,

else where.

2 u1 u2
f x ( u1 ,u2 )
= 3
where 0
f x2 ( u 2 )
u2
2
< u1 <1, 0 < u2 <1. since fx(u1, u2) = 0 whenever u1 (0, 1) or u2 (0, 1)
2 u1 u2
whenever 0 u1 1, 0 u2 1
3

2
we have f x1 | x2 u 2 ( u1 | u2 ) 2

0
otherwise

the conditional density of x1 | x2 = u2 is f x1|x2 u2 ( u1 | u2 )

(d) P

1
x1
x2 u 2
4

x1 | x 2 u 2

(u1 | u2 )du2

1
4

2 u1 u2
du1
3
1
u2
4
2
1
1
1 2 u1 u2 du1
= 3
u2
4
2
=


u12
2
u

u
u

1
1
2
= 3
2

u2
2
1

1
4

1
2 1 u
2 u2 2

= 3
2
4 32 4
u2
2
1 31 3u2
31 24u2


3
=
u2 32 4 16 3 2u2
2
1

Example 8. Consider a random vector x with density


2e-u1 e-2u2
0 u1 , 0 u2
f x ( u1 ,u2 )
0
else where.

(a) Obtain the marginal density of x2


(b) Obtain the conditional density of x1 given x2 = u2.
Solution: (a) Clearly f x2 ( u2 ) 0 whenever - < x2 0. let 0 < x2 . Then

f x2 ( u2 ) 2e u1 e 2u2 du1

2e 2 u 2

u1

du1

= 2e 2 u e u 0
= 2e 2 u 2
whenever 0 < u2 .
which is an exponential distribution with parameter 2.
2

(b) f x1|x2 u2 ( u1 | u2 )

f x ( u1 ,u2 )
2e u1 e u 2
=
e u1 . Whenever 0 < u1 , 0 < u2 .
u 2
f x2 ( u 2 )
2e

Again f x1|x2 u2 ( u1 | u2 ) 0 u1 (0,) or u2 (0,).


e u1
0

Thus f x1|x2 u2 ( u1 | u2 )

if u1 0, ,u2 0 ,
otherwise

It can be easily checked that this is the same as the marginal distribution of x1. Thus in
this example the joint density of x1 and x2 is the product of the marginal densities of x1
and x2.
x1
where x1 is of order r x 1 and x2 is of order (p-r) x 1 have density fx(u)
Let x =
x2
u1
is partitioned according as the partition of x. We say that x1 and x2 are
where u =
u2

independent if P{x1 A, x2 B} = P{x1 A}.P{x2 B} for all subsets A and B of Rr


and R(p-r) respectively.

It can be shown (which is beyond the scope of the present notes) that x1 and x2 are
independent if and only if the joint density of x1 and x2 (i.e., the density of x) is equal to
the product of the marginal density of x1 and x2. or in other words
fx(u) = f x1 ( u1 ). f x 2 ( u2 ) .
In fact if we can factorize fx(u) = g1(u1). g2(u2) where gi(ui) involves only ui, i = 1,2, then
x1 and x2 are independent. Further c1.g1(u1) and c1.g2(u2) are the marginal densities of x1
and x2 respectively where c1 and c1 are constants.
Thus x1 and x2 of example 8 are independent.
However, x1 and x2 of example 7 are not independent.
We give below a relationship between uncorrelatedness and independence.
Theorem 4. Let x1 and x2 be independent. Then Cov(x1, x2) = 0.
Proof: Let f x1 ( u1 ) and f x 2 ( u2 ) be the densities of x1 and x2. Then the joint density of
x1
u
) is fx(u) = f x1 ( u1 ). f x 2 ( u2 ) where u = 1 .
x1 and x2 (i.e., the density of x =
x2
u2
Let u1 = (u1,,ur)t and u2 = (ur+1,,up)t.
Cov(u1, u2)

= E(u1, u2t) - E(u1).E(u2t)


t
= u1u2 f x ( u1 ) f x ( u2 )du1 du p
1

- K u1 f x1 ( u1 )du1 K dur

K u

t
2

p =0
f x2 ( u2 )dur 1 K du

since the first integral in the previous expression splits into the product of the two later
integrals.
However the converse is not true as shown through the following exercise.
E9.

Let x have the following distribution


Value
-3
Probability

-1

(a) Show that the distribution x2 is


Value
Probability

(b) Write down the joint distribution of x and x2


(c) Show that x and x2 are uncorrelated.
(d) Show that x and x2 are not independent.
E10.

x1
with the density
Consider a random vector x =
x2
cu1 2 u1 u2
0 u1 1, 0 u2 1
fx(u1, u2) =
0
otherwise

where c is a constant.

(a)
(b)
(c)
(d)

Obtain the value of c.


Find the marginal densities of x1 and x2.
Find the conditional density of x2 given x1 = u1.
Are x1 and x2 independent?

Let x = (x1,,xp)t be a random vector with density fx(u) where u = (u1,,up)t. Earlier we
saw that even if x1,,xp are correlated with a positive definite variance-covariance
matrix, we can find a linear transformation y = Bx with B nonsingular such that the
components of y are uncorrelated. It is several times of interest to find the distribution of
y = g(x) [i.e., yi = gi (x1,,xp), a function of x1,,xp for i = 1,,p] given the distribution
of x. How can we find this? We need the following assumptions:
(i)
(ii)

The equations vi = gi (u1,,up), i = 1,,p have a unique solution for u1,,up


in terms of v1,,vp, with the solution given by ui = hi (v1,,vp), i = 1,,p.
The functions gi, i = 1,, p have continuous partial derivatives at all points
(u1,,up) and are such that the determine as
g1
u1
J ( u1 , ,u p )
g p
u1

g1
u p
0 at all points ( u1 , ,u p ).
g p
u p

Under these two conditions it can be shown (which is beyond the scope of this notes) that
1
the density of y is given by f y v f x u . J u1 , ,u p where J u1 , , u p is the
absolute value of J u1 , ,u p , where xi = hi(v), i = 1,,p.

Let us consider the bivariate normal distribution. This is a special case of the multivariate
normal distribution which we shall study in detail in the next few subsections.
x1
have density
Example 9. (Bivariate normal distribution). Let x =
x2

f x u1 ,u2

u
1 x1
x
1

1 x21x2

u 2 x2

x2

2 x1x2

2 x1 x2 1

2
x1 x 2

u1 x u 2 y

x1 x2

for - < u1 < , - < u2 < .

This distribution is denoted by N 2 x1 , x2 , x1 , x2 , x1 x2 .


(a)
(b)
(c)
(d)
(e)
(f)

Rewrite the above density in terms of variance-covariance matrix of x.


Obtain a lower triangular square root B of .
Write down the density of y = B-1x.
Hence write down the marginal density of x1.
Show that y1 and y2 are independent.
Obtain the conditional distribution of x1 given x2 = u2.
11
12

12

22

Solution: (a) Let

be the variance covariance matrix of x.

11 x2 , 22 x2 and 12 x x x x .
1

Thus

1 2

Also

2
12

11 22

1
1 x2 x
1 2

2 2
x1 x2

u1 x1

x
1

2 2
x1x2 x1 x2

x1 x2
x1 x2

2
x1x2

u u

1
x21

1
u1 x1 : u2 x2
=
2

1 x1x2

2 x x
1 2

x2

u2 x 2

2 2
x1 x2
1

x2

x1 x 2

x1x2

x1 x2
1

x22

x1

u1 x1

u2 x2

x22
x1 x2 x1 x2

= u1 x1 : u2 x2

x21
1 x21x2 x21 x22 x1x2 x1 x2

= u1 x1
= u

u 2 x2
t

u1

u1

22

12

1
.
where u and x

u
x
2
2

u1 x1

2 x2

12

11

u1 x1

2
x
2

Then

Thus the density of x can be rewritten in terms of as

Hence we can write this distribution as

N 2 x ,

1
2

1
2

1
u
2

t 1u

(b) Let = BBt where B is lower triangular.


11 12 b11 0 b11 b21

Writing
12 22 b12 b22 0 b22
b11 11

We have

b21

12
11

122
11

b22 22

and

(c) Consider y = B-1x write v = B-1u and y = B-1 x.


Then the density of y is given by (density of x written in terms of y). J u1 ,u2

1
Now B

So
and

Then

1
b11
b b
21 11
b22

1
b22

1
1
x1 ; v1
u1
b11
b11
1
x2 b21b11 x1 ; v2 1 u2 b21b11u1
y2
b22
b22
y1

J u1 ,u2

v1
u1

v1
u2

v2
u1

v2
u2

1
b11

1
b11b22

b22

11

122
22

11

11

22

122

= 2

Also -1 = (BBt) -1 = B t B 1 B 1 B 1
-1

Hence

(u - x)t -1(u - x)
t
= (u - x)t B 1 B 1 (u - x)
= (v - y)t (v - y)

Thus the density of y is

1 1
e . e
1
2

2
2
1 t 1 1 t
vy v y
vy v y
2
2 2

. The range of values for y1 and y2 are

clearly the same as the range of y values of x1 and x2, namely, 0 < y1 < , 0 < y2 < .
Hence y ~ N 2 y , I

1 12 v1 y1 2 v 2 y2 2
(d) The joint density of y1 and y2 is
e
2
1 2 v1 y1 2 2 v 2 y2 2
e
.e
dv1dv2
Hence the marginal density of y1 is
2

1 12 v1 y1
1 v12 2 y2
e
dv2
2 e
2

2
1 12 v1 y1
e
since the integrand above as the density of a normal
2
2

=
=

distribution.
Thus the marginal density of y1 = b11x1 is

1 12 v1 y1
e
2

which is the density of

N y1 ,1 .
1

So, x1 has a normal distribution with mean = b11 y b11 b x x and


1

11

variance = b112 V ( y1 ) b112 11 .

Thus the marginal distribution of x1 is N x1 , 11 .


1 12 v1 y1 2 12 v 2 y2 2
e
.e
2
1 12 v1 y1 2
1 12 v 2 y2 2
h1( v1 ).h2 ( v2 ) where h1( v1 )
and h2 ( v2 )
. Notice
e
e
2
2
that h1 depends only on v1 and h1 depends only on v2. Hence y1 and y2 are independent.
(e) The joint density of y1 and y2 is (i.e. the density of y) is

(f) We can write


1
11 12

= 12
12 22

11


1 12

1
1
Hence

1

11

12

0
11
1
11

0 22 21 11 12 0 1

1 0

12
.

1

1

11

22 12 111 12

x1
can be rewritten as
Hence the density of x =
x2
1

0
12
1 0

1
11
u1 x1 :u2 x2 1
1

u1 x1
2

.e

1
1

12 1
1
u2 x2

2
11
1 0

0
1

22
12 11 12

u1 x1
1
0 u

1
x1

12 1 u2 x u2 x2 12 u1 x1
2
11
11

Writing

We have the density of x as

1
2

1
2

.e

1 u1 x1
2
11

u2 x2

1

2

12
u
11 1 x1

1
22 12 11
12

2
1
Also notice that 11 22 11 11 22 12 11 12

Hence the conditional density of x2| x1 = u1 is

2 11 22 12 111 12

.e

1 u1 x1
2 11

.e

2 11

2 22 12 111 12

.e

1

2

.e

u 2 x2

1 u1 x1
2 11

u 2 x2

1

2

12
u1 x1
11

1
22 12 11
12

12
u1 x1
11

1
22 12 11
12

12
122

which is the density function of x2


x1
22
11
11

x
2
2
or N x2 x1 x2 2 u x1 , x 2 1 x1x2
x1

Thus the conditional distribution of x2| x1 = u1 is

x
N x 2 x1 x2 2 u x1 , x22 1 2x1x2

x1

Let x have a bivariate normal distribution N 2 x1 , x 2 , x1 , x 2 , x1 x 2 . Show that


x1 and x2 are independent if and only if x1 x2 =0. (Recall that, in general,
uncorrelatedness does not imply in dependence. However if x1 and x2 have a bivariate
normal distribution then x1 and x2 are independent if and only if they are uncorrelated).
E11

2.5 Nonsingular Multivariate Normal Distribution

The multivariate normal distribution is a generalization of the univariate normal


distribution to several dimensions. This distribution plays a fundamental role in
multivariate analysis. While it is true that the real data virtually never follow multivariate
normal, the multivariate normal distribution is often a good approximation to the
population distribution. Also the sampling distributions of many multivariate statistics
are approximately normal, regardless of the form of the parent distribution (discrete or
continuous) in view of the multivariate central limit theorem which can be stated as
follows:
Theorem 5. (Multivariate Central Limit Theorem) Let x1, x2, , xn be independent
observations from a population with mean vector and finite variance-covariance matrix,
1 n
. Let x xi . Then n x has an approximate multivariate normal
n i 1
distribution with mean vector 0 and variance-covariance matrix . (n should be large
relative to the number of components in x .)
Moreover, the multivariate normal theory is easily tractable mathematically and nice and
elegant results can be obtained. Thus the study of multivariate normal distribution serves
the dual purposes of usefulness in practice and mathematical elegance.
Let us start with the simplest extension of a univariate standard normal distribution.
Let x1,, xp be independent standard normal variables. Then their joint density which we
have identified as the density of x = (x1,, xp)t is given by
u2

u2

u2

1 21
f x u1 , ,u p
e
2

1 22
1 2p
e
e
2
2

ut u
2

where u = (u1,, up)t.

Let us make a nonsingular linear transformation y = Bx + where B is a fixed


nonsingular matrix and is a fixed vector. Accordingly let us set v = Bu + . Then the
density of y is obtained as follows.
First we obtain the Jacobian of the transformation:
y1
x1
J = Absolute value of
v p
u1

v1
b11
u p

= Absolute value of

v p
u p

b p1

b12

b1 p

bp 2

b pp

= Absolute value of |B|


Also u = B-1(v - ).
So the density of y is

.e v

B 1 B 1 v

1
absolute value of B

Let us write = BBt. Then || = |B||Bt| = |B|2. So the absolute value of |B| is ||, the
positive square root of ||.
Thus the density of y can be rewritten as f y v

1
2

1
v t 1 v
2

Notice that the density of y depends on the parameters and . This density is called as
the density of p-variate normal distribution with parameters and and we denote the
distribution as
y ~ Np( , )
In the same notation, x that we considered above has the distribution x ~ Np(0, I).
Let us now identify the parameters and .
We know that E(x) = 0 and the variance-covariance matrix D(x) = I since x1,, xp are
independent standard normal variates.
Since y = Bx + , we have
E(y) = E(Bx + ) = BE(x) + =
D(y) = D(Bx + ) = D(Bx) = BIBt = BBt =
Thus, and are the mean vector and the variance-covariance matrix of y ~ Np( , ).
Recall that we started with B nonsingular and hence is nonsingular (in fact, positive
definite). The fact that B is nonsingular was crucial in obtaining the density of y as above
(Notice that the jacobian not being 0 was an assumption while obtaining the density of
the transformed random vector in the form mentioned above.) Thus, the distribution
Np( , ), i.e., the p-variate normal distribution with mean vector and variancecovariance matrix is called a nonsingular p-variate normal distribution if is positive
definite. Later on, we shall also study the case where need not necessarily be positive
definite. Let us now summarize this discussion.
Definition: A random vector y of order p x 1 is said to have a nonsingular p-variate
normal distribution with parameters and if it has the density

f y v

1
v t 1 v
2

where is a fixed vector in Rp and is a p x p positive definite matrix. Also then, we


denote y ~ Np( , ).
Remark: If y ~ Np( , ) where is a positive definite, then E(y) = and D(y) = .
We shall now turn our attention to the marginal distributions. Let y ~ Np( , ).

Partition

y1
y
y2

v1
v ,
v2

1 11 12
and t
2 12 22

(2.5.1)

where y1, v1 and 1 are of order r x 1 (1 r p) and 11 and 22 are of order r x r and (p-r)
x (p-r) respectively. Such partitions of v, and are called conformable partitions to
that of y.
We first show that if 12 = 0, then y1 and y2 are independent. Notice that if 12 = 0, then
the covariance between yi and yj, i = 1,,r and i = r+1,,p is 0 and hence each yi,,yr is
uncorrelated with each of yr+1,,yp.
Theorem 6. Let y ~ Np( , ) where is positive definite. Let y, v, and be partitioned
as in (2.5.1). Then y1 and y2 are independent if and only if 12 = 0.
Proof: We need to prove only the if part as the only if part has already proved in the
previous section (see theorem 4)
The density of y is f y v

1
v t 1 v
2

11 O
11 22 .
O 22

Since 12 = O,

111

Also
-1

22

t
v v1

Hence

t
1

1vt : 2t

t
2

111 O v
1 1

1
2 2

O 22 v

= v1 1 111 v1 1 v 2 2 221 v2 2 .
Thus f y v can be rewritten as

1
fy v p 1 1 e
2 2 1 2 2 2

1 t 1 t 1
v11 1 v11 v2 2 2 v2 2
2

1 t 1 1 t 1
v 11 1v11 v 22 2 2v 2
2 2
r 1 p r 1
22 2 2
1 2

1 1
e e
2 2

= f y1 v1 . f y 2 v2
Hence y1 and y2 are independent.

E12.

Show that the marginal distributions of y1 and y2 where 12 = 0 are Nr( 1, 11) and
Np-r( 2, 22) respectively.

We now obtain the marginal distribution of y1 in general case.


Theorem 7. Let y ~ Np( , ) where is positive definite. Let y, v, and be partitioned
as in (2.5.1). Then the marginal distribution of y1 is Nr( 1, 11).
Proof: The technique of the proof lies in transforming from y to by a nonsingular linear
transformation such that y1 = 1 and 1 and 2 are independent. Then the marginal
distribution of y1 = 1 can be easily obtained using theorem 6. Towards this end, write
11
t
12

12
I
t 1
22 12 11
I

Also

111

Hence
0


22 - 21 11 112 0

21 111

22

12

1
= 11 . 22 21 11 12

since

A O
A . C where A and C are square and |I| =1
B C

Now f y v

= e

1
v1
2

vt :

2 2

11

t 1v

1
11
O

11 12

11
O

O
1

1
v
2

1
11

1
1t :t2
2

21 11
22
12

I
O v

1 1
1
2 2
t12 11
I v

22 21 11112

1
2

21 11
22
12

22 21 11112

12


I
0

t
1
12 11 I
12

0
I 111
22 21 11112 0
I

0 11
I 0

t12 111

I 111

12

11

0 11

0
I


I
O v

1
1 1

where

t
1

12 11 I v
v

2

2 2

The Jacobian of the transformation is J

12 11

11

1
v1
2

t 1
v
1 11

12 11

O
y
I

Hence the density of =

I
t

1
v
2
2

1 1

t
2 12 v11
1 1
O
1.
I

is

v1
2 2111

From the above it is clear that y1 1 and y


2

p r

1
1
22 21 11 12 v

22 21 11112

21y1 1
11

v1
2 2111
1

are independent.

Hence the marginal distribution of y1 from the above factorization is Nr( 1, 11). Also the

1
1
1
marginal distribution of y2 2111 y1 is N p r 2 2111 1 , 22 21 11 12 . (Notice
t

that 21= 12 )
Corollary: If y ~ Np( , ) where is positive definite, then yi ~ N( i, ii) for i = 1, , p.
We shall now obtain the conditional distribution of y2 given y1 = v1 where y
(~ Np( , )) is partitioned as in (2.5.1). Once again we assume that is p.d.
Theorem 8. Let y ~ Np( , ) where is positive definite. Let y, v, and be
partitioned as in (2.5.1).
The conditional distribution of y2 given y1 is

N p r 2 21 1 1 , 22 21 11112 .

11

1
Proof: In the proof of theorem 7 we showed that y2 2111 y1 and y1 are independently
1
distributed. Hence the conditional distribution of y2 2111 y1 given y1 is the same as

the

unconditional

distribution

of

y2 21 1 y1 ,

11

which

is

N p r 2 21 1 1 , 22 21 11112 . Thus the conditional distribution of y2 given y1 =


11

v1 is a p-r variate normal with mean equal to


2 21 1 1 21 111v1
11

since when y1 = v1, 2111 y1 21 11 v1 is a fixed vector.

1
Also the conditional variance-covariance matrix is the same as that of y2 2111 y1
1
which is 22 21 11 12 .

Thus

the

conditional

distribution

N p r 2 21 1 v1 1 , 22 21 111

12

11

of

y2

given

y1

is

Let r = p 1. Then y2 is univariate random variable yp. Also 21 is a row vector of order
1 x (p-1) and 11 is of order (p-1) x (p-1). The conditional expectation of yp on y1 ,, yp-1.
From the above theorem,
1
E(yp | y1 ,, yp-1) = p 2111 v
1

= 0 + 1v1 ++ p-1vp-1
1
where 0 = p 2111 1 and = (1, , p-1)t = 11 12
1

So it is clear that if y1 ,, yp have a joint p-variate normal distribution, then the regression
of yp on y1 ,, yp-1 is linear in y1 ,, yp-1. 1, , p-1 are called the regression coefficients.
1
4 1 2

Example 10. Let y = (y1, y2, y3)t have N3( , ) where 1 and 1 4 2 .
0
2 2 4

(a) Write down the distributions of y1, y2, and y3.


y1

(b) Write down the marginal distribution of


y3
(c) Write down the conditional distribution of y1 given y2 = - 0.5 and y3 = 0.2
(d) Make a nonsingular linear transformation = Ty + c so that 1, 2 and 3 are
independent standard normal variables.
Solution. (a) In view of the corollary to theorem 7
y1 ~ N(1, 4),

y2 ~ N(-1, 4) and y3 ~ N(0, 4)

[By ~ N(, 2), we mean has a normal distribution with mean and variance 2.]
(b) It is easy to see that the marginal distribution of (y1, y3)t is a bivariate normal. From
the given information on and , we have
E(y1) = 1, E(y3) = 0, V(y1) = V(y3) = 4, Cov(y1, y3) = 2
y1
~ N2
Hence
y3

1

0

(c) The conditional distribution of y1 given y2 = -0.5 and y3 = 0.2 is normal by theorem 8.
11

Partition
12

t12

where 12 is of order 1 x 2 and 22 is of order 2 x 2.

22

0.5 2

0.2 3

1
Then the conditional mean is 1 12 22

= 1 + (1 2)
=1+

1
0
12

1 4

12 2

0.5 1

0.2 0

0.5
= 1 + 0.1 = 1.1
0.2

The conditional variance is


1
11 12t 22
12 = 4 -

1
0
12

1
= 4 1 = 3.
2

So the conditional distribution of y1 given y2 = - 0.5 and y3 = 0.2 is N(1.1, 3).


(d) y - ~ N3(0, ).
We shall obtain a lower triangular matrix B such that BBt = . Then = B-1(y - ) has a
trivariate normal distribution. Further,
E( ) = E(B-1(y - )) = B-1(E(y - )) = B-1( - ) = 0. D( ) = B-1D(y) B 1 = B-1 B 1
t
= B-1 BB t B 1 = I.
t

We shall use the algorithm in the previous section to get B and B-1 as required. Thus we
form
4
1
2
1
0
0
(1)
1
4
2
0
1
0
(2)
2
2
4
0
0
1
(3)
------------------------------------------------------------------------------------------------------2

0
0
(4) = (1) 4
15
3
1

0
1
0
(5) = (2) (4)x
4
2
4
3

0
3
0
1
(6) = (3) (4)
2
2
------------------------------------------------------------------------------------------------------2

0
0
(7) = (4)
2
1
3
15

0
0
(8) = (5) 154
60
15
2
5

12

0
0
1
(9) = (6) - 5 x(8)
5
5
5
------------------------------------------------------------------------------------------------------2

0
0
(10) = (4)
2
1
3
15

0
0
(11) = (8)
60
15
2
5

12
5

0
2

Hence B

1
2

15
2
3
5

1
0 and B

12

1
15

1
2
1
60
1
15

0
2

15
1

(12) = (9)

15

12

12
5

0
0
5

15

12

Hence if

= Ty + c where T = B and c = T 1 , then


0

-1

is a vector of independent

N(0, 1) variables.

E13.

Let N4( , ) where


and

9
0

0
4

2
0

2
0

0
1

6
0

1
0

y1
.
(a) Obtain the marginal distribution of
y3
y1
y
and 2 are independent.
(b) Show that
y4
y3
y1
y
given 3 =
(c) Obtain the conditional distribution of
y2
y4
(d) Write down the correlation coefficient between y1 and y3.

1. 2

2. 6

2.6 CHARACTERIZATION OF MULTIVARIATE NORMAL VIA LINEAR


FUNCTIONS
Let y have a nonsingular p-variate normal distribution. In the previous section, we saw
that each component of y has a univariate normal distribution. Notice that yi eit y
where ei is the ith column of the identity matrix. Thus yi is a linear combination of the
components of y. In this section we show that every fixed linear combination of the
components of y (where y is as specified above) has a univariate normal distribution. We
then go on to show that the distribution of a p-variate random vector is completely
determined by the class of distributions of all fixed linear combinations of its
components. Hence we show that a p-variate random vector y has a p-variate normal

distribution if and only if every fixed linear combination of the components of y has a
univariate normal distribution. Using this characterization, we derive several properties
of multivariate normal distribution some of which we studied in the previous section.
Here we do not need the density function and the results are obtained easily and without
tears as S K Mitra, one of the experts in the field says. This approach is due to
D.Basu(1956).
Theorem 9. Let y ~ Np( , ) where is positive definite. Then the following hold.
(a) Let z = By where B is a fixed nonsingular matrix. Then z ~ Np(B , BBt)
(b) Let x = Cy where C is a fixed r x p matrix of rank r (1 r p). Then x ~ Nr(C ,
CCt)
(c) Let w = ly where l is a fixed nonnull vector. Then w ~ N(l , l l).
Further the distributions of z and x are nonsingular multivariate normal.
Proof: (a) The density of y is f y v

1
p

1
v
2

t 1v

Let = By where B is a fixed nonsingular matrix. Then the Jacobian of the


transformation is the absolute value of |B| as we saw in the beginning of section 2.5.
Let u = Bv and = B . We have y = B-1 , v = B-1u and = B-1 . We can now write
down the density of as
f u

1 1
1
B u B
2

B u
t

B1

1
abs. value of B

Notice that |BBt| = |B|2.||. Hence |BBt| = || .absolute value of |B|.


Further (B-1u - B-1 )t -1(B-1u - B-1 )
t
= (u - )t B 1 -1B-1(u - )
= (u - )t (BBt)-1(u - )
Hence f u

B Bt

1
u
2

t BBt u
1

which is the density of Np( , BBt) or Np(B , BBt)

(b) Since C is an r x p matrix of rank r, the rows of C are linearly independent. Hence
C

there exists a matrix T of order (p-r) x p of rank (p-r) such that B is nonsingular.
T
p
[We can extend the rows of C to a basis of R . The rows of T are additional vectors in
the extended basis of Rp.]
Cy
t
follows Np(B , BB ).
T
y

From (a) we have = By =

Observe that the components of Cy are the first r components of . By theorem 7, the
marginal distribution of the first r components of , i.e., distribution of Cy is r-variate
normal. Since E(Cy) = C and D(Cy) = CCt, it follows that
Cy = Nr(C , CCt).
(c) This is a special case of (b) where r = 1.
Thus we have proved that if y ~ Np( , ) where is positive definite, then every nonnull
fixed linear combination ly of y has a univariate normal distribution. If l=0, then ly = 0
with probability1. It has mean 0 and variance 0. It can be thought of as N(0, 0).

4
6

Example 11: Let y ~ N3( , ) where 2 and 1


6
2

1
1

(a) Obtain the distribution of x = Cy where C

1
5

(b) Obtain a linear combination = ly of y such that has a standard normal


distribution.
Solution: (a) C is a 2 x 3 matrix of rank 2. (Both the rows of C are non-null and neither
is a scalar multiple of the other.) Hence by theorem 9(b), we have x = Cy ~ N2(C ,
CCt).

Now C

1
= 1

1
CC = 1

1
5

1
5

1
0
2
1
0

6
1
1
1
2

1
8
4

2 1

4 1
9 1

25

63

=
63 269

1
5

(c) Let m = 1 . By (c) of theorem 9 and (a) above, we have my ~ N(l , l l).
1

where m = 0 and mm = 25.


So my ~ N(0, 25).

Writing l =

1
. m=
5

E14.

1
5
1
5
1
5

, we have = ly ~ N(0, 1).

1 1 1 1

1 2 2 1

Let y ~ N4( , ) where 2 1 3 4 and


1 2 9 1

1 1 1 16
1
(a) Let y ~ Np( , ). Let y y1 L y p . Obtain the distribution of y .
p
(b) Obtain x = Cy and w = Ty where x and w are of order 2 x 1 such that x and w
have independent nonsingular bivariate normal distributions. Compute the
parameters of these distributions.

We now embark on the issue of characterizing the multivariate normal distribution via
linear functions.
Definition: (Characteristic function). For a univariate random variable x, x(t) = E(eitx) is
called the characteristic function of x. For a multivariate random variable y of order p x
1, y(t) = E( eit y ) is called the characteristic function of y.
t

The characteristic function of a univariate random variable is a function of a variable t,


where as that of a p-variate random variable is a function of p variables t1,, tp or the
vector t = (t1,, tp)t.
We state the following results on characteristic functions the proof of which are beyond
the scope of this notes.
Theorem 10. Every random vector has a characteristic function.
Theorem 11. (a) The characteristic function (t) of a random vector y uniquely
determines its distribution.
(b) If (t) = 1(t1) p(tp), then the components of y are independent.

y1

Let y = M be a partition of y. Partition t accordingly. If
y
k
(t) = 1(t1) k(tk)
then y1,, yk are mutually independent.
(For proof and other properties of characteristic functions, refer to chapter 15 of Feller
(Volume 2, 1966).)
We now prove a theorem due to Cramr and Word that connects the distribution of a pvariate random vector with the distributions of its linear combinations.
Theorem 12: Let y be a p-variate random vector. Then the distribution of y is completely
determined by the class of univariate distributions of all linear functions ly, l Rp. (l
fixed).
Proof: Let the characteristic function of ly be (t, l) = E(eitly).
Now (1, l) = E(eily) = (l) is the characteristic function of y as a function of l = (l1,,
lp)t. By theorem 11(a) above, the distribution of y is completely specified by the
characteristic function of y.
Motivated by theorem 12 together with the fact that every linear function of a random
vector with (nonsingular) multivariate normal has a univariate normal distribution, we
define multivariate normal distribution as follows.
Definition. A p-dimensional random vector y is said to have a p-variate normal
distribution if every linear function ly (with l fixed) of y has a univariate normal
distribution.
From now on, in this section, we use the above definition and obtain several important
properties of multivariate normal distribution. We shall also show that this coincides
with the earlier definition through density whenever the density exists. This approach is
called a density-free approach.
Theorem 13. Let y be a p-dimensional random vector having a p-variate normal
distribution. Then E(y) and D(y) exist.
Proof: Since yi is a linear function of y, by definition, E(yi) = i and V(yi) = ii exist and
are finite for i = 1,, p. Since yi + yj is a linear function of y, V(yi + yj) = V(yi) + 2
Cov(yi, yj) + V(yj) exists and is finite. Hence Cov(yi, yj) = ii exists and is finite. Hence
E(y) = (1,, p)t and D(y) = = ((ij)) exist and are finite.

Also then, ly ~ N (l , l l) for all fixed l.


We now obtain the characteristic function of y.
We prove,
Theorem 14. Let y have a p-variate normal distribution with E(y) = and D(y) = .
Then the characteristic function of y is
1 t
t
(t) = E( eit y ) = eit 2 t t
t

Proof: Recall that the characteristic function of univariate random variable having a
normal distribution with mean and variance 2 is (s) = E(eis) =

is

s 2 2
2

For each fixed t, tty ~ N(tt , ttt).


1

Hence the characteristic function of tty denoted by (s, t) = E(eisty) = eist 2 s t t , for each
t.
Now (t) = E( eit

) = (1, t) = eit 2t t .

Notice that the characteristic function of y depends on and , its mean vector and
variance covariance matrix respectively. Also by theorem 11, the distribution of y is
completely specified by its characteristic function. Hence the distribution of y is
completely specified by its mean vector and variance-covariance matrix. Let E(y) =
and D(y) = . Henceforth, we shall say y ~ Np( , ) to denote that y has a p-variate
normal distribution with parameters and . Clearly = E(y) and = D(y). We do not
insist that is positive definite. However, in view of theorem 3 of section 2.3 is nnd.
Theorem 15. Let y ~ Np( , ). Then every linear function lty ~ Np(lt , ltl).
Proof: By definition, lty has a univariate normal distribution. Further E(lty) = lt and
V(lty) = ltl.
Let y ~ Np( , ). Let B be a fixed r x p matrix. Show that By ~ Nr(B , BBt).

E15.

Theorem 16. Let y ~ Np( , ). If is a diagonal matrix, then the components of y are
independent.
1 t
t
Proof: The characteristic function of y is (t) = E( eit y ) = eit 2t t =
t

it11 2 t12 11

K e

it p p 2 t 2p pp

t
2
2
(since t t t1 11 t p pp )

So, by theorem 11(b) y1,, yp are independent.

If is a diagonal matrix, then y1,, yp are uncorrelated. Thus we have shown in theorem
15 that uncorrelatedness implies independence if y1,, yp have a joint normal
distribution.
y1
Theorem 17. Let y ~ Np( , ). Write y = when y1 has r components. Partition
y2
12
1

and 11
. Then y1 and y2 are independently
and conformably as
2
21 22
distributed if and only if 12 = 0.

Proof is similar to that of theorem 16 and is left as exercise.

Example 12. Let y ~ Np( , ).

y1

Partition y as y = . Show that if y1,, yk are


y
k

independent pair-wise, then they are mutually independent. (In general, pair-wise
independence does not implies mutually independence, but it holds if y1,, yk have a
joint normal distribution.)

Solution:
11



k1

Partion
12

k 2

Thus

k

and conformably to that of y.

and

1k

.
kk

yi

has a multivariate normal distribution with mean vector i
Let i j. By E15,


yj
j
ii ij
(why?). Since yi and yj are independent, ij =
and variance-covariance matrix

ji jj

0.
11 O K

Thus M 22 O
O
O K

kk

Now the characteristic

function

1 t
t
it
t 1
1 1 t2 1 11

ti

K e

t
k k

1
t2 tk kkt k

of

y is

(t) = E( eit

) =

itt 2t t t

Thus y1,, yk are mutually independent.


So far we have not shown the existence of a random vector defined as in the definition of
multivariate normal via linear functions. We shall now do this. First we shall prove a
Lemma.
Lemma 1. Let y have a multivariate distribution with mean vector and variance
covariance matrix . Then y - belongs to the column space of .
Proof: We shall show that if l is orthogonal to the columns of then lt(y - ) = 0 with
probability 1.
lt = 0 ltl = 0 V(lt(y - )) = 0 lt(y - ) is a constant with probability 1
lt(y - ) = E(lt(y - )) = 0
Theorem 18. y ~ Np( , ) if and only if there exists a random vector x of independent
standard normal variables such that y = + Bx with probability 1 for some matrix B of
full column rank such that BBt = .
Proof: If part: Let l be a fixed vector. Then lty = + ltBx. ltBx is a linear combination
of independent standard normal variables and hence has a univariate normal distribution.
Hence lty = lt + ltBx has a univariate normal distribution. The choice of l being
arbitrary, it follows that y has a multivariate normal distribution. Now E(y) = E( + Bx)
= + BE(x) = since E(x) = 0
D(y) = D( + Bx) = D(Bx) = BIBt = .
Hence y ~ Np( , ).
Only if part: Let the rank of be equal to r. Let a spectral decomposition of be
0 P1t
t
t P1P1
0 0 P2

P1 : P2

where P1 is a matrix of order p x r such that P1t P1 = Ir and is a pd diagonal matrix of


order r x r. Thus = BBt where B = P1 where is the pd square root of . Notice
that the rank of B is the same as rank of P1 which is equal to r since P1t P1 = Ir. Also
( )-1 P1t P1 = I.
Now write x = ( )-1 P1t (y - ). Clearly x is a random vector of order r x 1. By E15, x
has an r-variate normal distribution. Now, E(x) = ( )-1 P1t E(y - ) = 0.

Also

D(x) = ( )-1 P1t P1( )-1


= ( )-1 P1t P1 P1t ( )-1
= ( )-1 ( )-1 = I.

Thus we have manufactured a random vector x ~ Nr(0, I) (i.e., x is a vector of r


independent standard normal variables) such that
x = ( )-1 P1t (y - )
or

P1 x = P1 P1t (y - )

By Lemma 1, (y - ) belongs to the column space of (with probability 1) which is the


same as the column space of P1.
Thus (y - ) = P1v for some v with probability 1.
Thus P1 x = P1 P1t P1v = P1v = y - with probability 1
or y = + Bx where B = P1
Also BBt = P1 P1 = P1P1 = .
Thus we have established the existence of a random vector y having p-variate normal
definite via linear functions.
Example 13. Let y ~ Np( , ) as defined via the linear functions. Let be pd. Then
show that the density of y is
1
t
v 1v
1
2
fy v
e
p
1
2 2 2
Solution: Since is pd, by theorem 17, there exists a random vector x of order p x 1
which is a vector of independence N(0, 1) variables such that y = + Bx with probability
1. where B is a pxp matrix such that = BBt.
Since is nonsingular, it follows that B is nonsingular.
So as seen in the beginning of section 2.5 the density of y is
1
t
v 1v
1
2
fy v
e
p
1
2 2 2

Thus the two definitions of multivariate normal coincide if is pd.


2.7 SUMMARY
In this unit we have covered the following points.

Nature of multivariate problems


Computation of the mean vector and variance-covariance matrix of a linear
transformation of a random vector
An algorithm to compute a lower triangular square root of a positive definite
matrix and its inverse simultaneously
Discrete and continuous multivariate distributions
Uncorrelatedness and independence
Marginal and conditional distributions
Density of a nonsingular multivariate normal distribution
Marginal and conditional distributions in multivariate normal
Definition of multivariate normal via linear functions
Characteristic function of multivariate normal.

2.8 REFERENCES
1. Basu, D.(1956) A note on the multivariate extension of some theorems related to
the univariate normal distribution. Sankhya, Vol.17, pages 221-224.
2. Feller W (1966) An Introduction to Probability Theory and its applications. Vol.2,
Wiley, New York.
3. Rao A R and Bhimasankaram P (2000) Linear Algebra, Hindustan Book Agency,
Delhi.
4. Ross, S.(1976) A first course in probability. Second Edition. Macmillan
Publishing Company, New York.

You might also like