You are on page 1of 10

A Stochastic Simulation for Approximating the log-Determinant of a Symmetric Positive Denite Matrix

Michael McCourt December 15, 2008


Abstract The ecacy of approximation by radial basis functions is often determined by a scale parameter . Unfortunately, some choices of which will produce a good interpolant yield an ill-conditioned matrix. One method of choosing an appropriate scale parameter is to assume the data is a realization of a Gaussian process and then nd the maximum likelihood estimator. This will require the evaluation of the determinant of the covariance matrix; doing so will likely be an ill-conditioned problem because of the spectrum of the covariance matrix. Here we will discuss a Monte Carlo technique developed in [7] and analyze its usefulness on matrices arising in radial basis approximation. Later we will consider an adaptation of this method which will be applicable to more general matrices.

Introduction to Determinants

The determinant of a matrix is dened in modern algebra to be the unique function F : Mn (K) K, where Mn (K) is the square matrices of size n over the eld K, such that F is alternating multilinear with respect to columns and F (In ) = 1. There are more abstract formulations which are coordinate-free and there are simple identities for small or special matrices; however, for our concerns we will use the standard denition
n

det(A) =
i=1

(1.1)

where i is the ith eigenvalue of A. The standard technique for computing the determinant is by a decomposition of the matrix. For instance if you already have the Cholesky factorization of A = LLT then the determinant can be computed as det(A) = det(L) det(LT ) = det(L)2 . Because L is a triangular matrix its eigenvalues are all on its diagonal, so once you have the decomposition of A, computing its determinant is trivial. If however there is no stable way to factor the matrix, nding the determinant may be more troublesome. Few modern numerical algorithms require the computation of the determinant because doing so is often more computationally expensive than other methods. For instance, Ax = b can be solved with Cramers Rule, but doing so requires the computation of n + 1 determinants, each of which has the complexity of a single decomposition. Obviously this is hopelessly inecient because the system could be solved with only one decomposition. Other applications of the determinant can be found in [3]. The most common appearance of determinants today are in statistics, one of which is the density function for the multivariate Normal distribution L(x; , ) = exp(1/2(x )T 1 (x )) (2)N det() .

Many statistical settings accompany the evaluation of det(A) by a linear solve A1 b - thus if a factorization is used to nd A1 b, det(A) is found for free. Conversely, if the linear solve is conducted in

an iterative fashion there may be no determinant-revealing decomposition; in this case computing the determinant must be done separately. Ideally, nding the determinant in this case should have the same complexity as the linear solve. In the next section we will identify a problem which requires the determinant of a matrix without a factorization.

Radial Basis Interpolation


Nf

Suppose we want to t the model Y (x) =


j=1

j fj (x) + Z(x)

to data (xi , yi ), 1 i n. The fj terms are deterministic functions - eg. they may be polynomials if our model expects polynomial reproduction. Z(x) is a Gaussian Process with zero mean and covariance Cov(Z(u), Z(v)) =
2

(u, v)

(2.1)

where (u, v) = ( u v ) is the radial basis function (or kernel), with the shape parameter. 2 is the process variance. For simplicity, we will ignore the regression terms by setting j 0 this means we presume the data (xi , yi ) is a realization of Z. For this case, given a shape parameter, the best linear unbiased predictor for a new x0 is y0 () = (x0 )T 1 y (2.2)

where is the covariance matrix, y is a vector of the design values, (x0 ) is the vector of correlations between x0 and the design points. The choice of is very important in making accurate predictions. As an example, consider the input data of which is generated by evaluating (1 + x2 )1 at 6 evenly space points between 0 and 1. When we

x 0 0.2 0.4 0.6 0.8 1.0

y 1.0000 0.9806 0.9285 0.8575 0.7809 0.7071

compare the unique polynomial interpolant of these points to the Gaussian RBF interpolant for various values of the result is Figure 1. It is apparent in that graph that for some values of , the RBF interpolant outperforms the simple polynomial interpolation. This is a motivation for us to nd the best possible. Of course, nding the best is not a straightforward problem; even dening what best means is a subjective decision. For this paper we are going to take advantage of our assumption that the data is generated by a Gaussian Process. From this we know the likelihood of the parameters given the data is L(,
2

) = p(,

|y)

exp 212 y T 1 y ( 2 )n/2 det()

See [8]. Because of the special structure of this function, 2 can be analytically integrated out to nd M () = L(, 2 )d 2 = (y T 1 y)n/2 det()1/2 ,
0

loglog Plot of the RBF Interpolation Error 0 0.5 1 1.5 log (MS Error) 2 2.5 3 3.5 4 4.5 2 Polynomial Error RBF Error

10

1.5

0.5

0 log ()
10

0.5

1.5

Figure 1: For some values of the RBF interpolant is better than a polynomial.
which is the function that we want to maximize. Note that M can be interpreted as p(|y), the prole likelihood over all 2 . The process of maximizing that particular function is threatened by over/underow, so instead we will try to minimize its negative logarithm which is 1 M () = log(y T 1 y) + log det(). n (2.3)

Each time we evaluate M it will require a linear solve and a determinant evaluation. In the next section we will quickly discuss an iterative technique used to solve ill-conditioned linear systems arising in RBF interpolation. For more information about radial basis functions and Gaussian Processes please reference [4], [5] or [6].

Rileys Algorithm

More than 50 years ago, a regularization technique for solving ill-conditioned symmetric positive denite linear systems was developed in [2]. It resurfaced in [1] and Fasshauer recently gave a talk at a colloquium in Syracuse on its eciency when compared to the SVD. While it does not have an ocial name, we will call this Rileys Algorithm. Assuming the goal is to solve the system Ax = b where A is SPD and ill-conditioned, Rileys algorithm solves a regularized system C = A + I >0 (3.1)

which can be factored safely with the Cholesky decomposition. If all we concerned ourselves with was solving Cy = b this would be ridge regression or Tikhonov regularization. But we can take another step by noting the identity A1 = 1

(C 1 )k ,
k=1

which gives us a simple iteration method for approximating the solution to Ax = b xi+1 = y + C 1 xi (3.2)

where y = C 1 b. This technique allows us to nd x by only performing a stable Cholesky decomposition on C.

Choosing the regularization parameter to maximize stability but minimize the summation length is not a straightforward procedure, but we wont concern ourselves with it here. What we do need to be concerned with is that evaluating (2.3) requires both 1 y and det(). Unfortunately, using Rileys algorithm for the linear solve does not provide us with a determinant revealing factorization. In the next section we will analyze a Monte Carlo technique for approximating the log-determinant of .

Approximating the Determinant

This algorithm is rst described in [7] for a matrix A = In D. D is a matrix whose eigenvalues are within (1, 1) and (1, 1) is a parameter in the model they were studying. For this case, the algorithm as follows will approximate the log-determinant of A. Basic Algorithm Choose integers m, p > 0. For k from 1 to p Draw x N (0, In ) k m V (k) = n k=1 xT Dk x/xT x k Compute the mean of V , E[V ] = log(det(A)) At the end of this algorithm, we have a p-vector V and we can generate a 95% condence interval for the true determinant [V F, V + F ] where V is the mean and F = nm1 + 1.96 (m + 1)(1 ) s2 (V ) . p (4.1)

s2 (V ) is the sample variance. This algorithm is a direct extension from the power series representation

log(1 x) =
j=1

xj j

x [1, 1)

where we use the Rayleigh quotient to extract the eigenvalues. A proof of convergence is provided in [7] but for a basic understanding of why this works, we can think about the Schur decomposition of D 1 T v1 1 . .. D = V V T = v1 vn . . . . 1 n
T vn

When we compute xT Dx we are indirectly computing y = V T x which has the same distribution because V is an orthogonal matrix. Because y N (0, In ), then
n

y T y =
i=1

2 yi (1 i ).

(4.2)

so each inner product yields a weighted sum of the eigenvalues. Taking the quotient xT Dx/xT x turns the coecients in that summation into random variables with properties that allow us to derive (4.1). V T x can be thought of as the vector whose ith value is the amount of x which points in the direction of i . If we choose enough x vectors which point randomly in all the directions then we should hit all the eigenvalues equally.

Extending the Algorithm

So now that we have a technique which will work on special matrices (for (D) < 1 we need the eigenvalues of A between 0 and 2), what can we do to extend its usefulness to more general matrices? How can we use this algorithm on A whose eigenvalues are not in the acceptable range? Can we test if the condence interval is small enough before completing the algorithm?

Are there changes for this to be implemented eciently in Matlab? While we are going through these adaptations, keep in mind the following thought: The closer the eigenvalues are to 1, the slower the convergence. This is a direct analog to the fact that the log series converges very slowly when the argument is far from the center, in that case 0.

5.1

Scaling the Matrix

Suppose we know the largest eigenvalue of A is 1 : that means that the largest eigenvalue of A/1 is 1. If we call D = In A/1 , log det(A) = log det(1 (A/1 )) = n log 1 + log det(A/1 ) = n log 1 + log det(In D) So now we are able to use the algorithm on D and add n log 1 to the result. This is the most obvious way to adapt A to an acceptable matrix. It has the disadvantage of scaling all the eigenvalues of the matrix by 1 . If all the eigenvalues are near 1 then this is desirable. Unfortunately if there is only one eigenvalue of A outside the region, then the other eigenvalues are scaled unnecessarily. This is unacceptable if the other eigenvalues of the matrix are close to zero, because scaling those brings the associated eigenvalues of D closer to the edge of convergence. If those eigenvalues were already slowing up the series convergence, they are even more troublesome now. Consider the following RBF example: we want to use Gaussians to interpolate with = .1 and centers x = 0 : .2 : 1. The resulting matrix looks like 1.0000 0.9996 0.9984 0.9964 0.9936 0.9900 0.9996 1.0000 0.9996 0.9984 0.9964 0.9936 0.9984 0.9996 1.0000 0.9996 0.9984 0.9964 A= 0.9964 0.9984 0.9996 1.0000 0.9996 0.9984 0.9936 0.9964 0.9984 0.9996 1.0000 0.9996 0.9900 0.9936 0.9964 0.9984 0.9996 1.0000 and its eigenvalues are 6.0 100 1.4 102 1.2 105 5.5 109 1.4 1012 2.6 1016 which is just dismal. The condence interval for this matrix is still [V F, V + F ], but F is now F = n(1 )m1 + 1.96 (m + 1) s2 (V ) . p (5.1)

where = n /1 . Even for this 6 6 problem, it takes a huge m to counteract the incredibly small . After m = 1000000 there still is not convergence - there would be no convergence even if the scaling were unnecessary, but the division by 1 6 still hurts. While scaling allows the problem to be handled by the algorithm, it is only appropriate for a matrix whose eigenvalues are on the same order. Of course if a matrix has eigenvalues which are near each other but are not near zero, there is no ill-conditioning so the determinant can be found by a factorization. This seems to make scaling the matrix an ineective technique; we will revisit it later in an acceleration framework.

5.2

Orthogonal Draws

The problem with the tactic of scaling the spectrum to (1, 1) is that the entire system is punished for the deliquency of a small subset. If possible, it would seem more logical to handle the oending

eigenvalues separately so as to not slow down the convergence of the series by scaling. One way to do this is by a technique I call orthogonal draws. Suppose we know that there are only n n eigenvalues of D = I A which are less than -1. If we had the orthogonal invariant subspace W of those n eigenvalues (in ), we could ensure that the draws x are from W . This can be seen quickly by referring to (4.2) and noting that if x span(W ) 2 the yi associated with the problematic eigenvalues will be 0. To visualize this, lets return to the Schur decomposition of D WT . W D = V V T = W T W Now we dene a symmetric projection matrix P = I WWT and consider P DP = (I W W T )V V T (I W W T ) W = 0 0 W
T

From this it is obvious that the only nonzero eigenvalues of matrix P DP are in which are within the acceptable range to be used in the algorithm. This means that we do not have to worry about drawing x from W because we can form P DP and use it in the algorithm with standard x N (0, In ). The output will be log det( ), to which we will have to add log det( ), determined when W was found. Finding W can be done with a simple subspace iteration or with the Lanczos procedure. Of course, we will not know beforehand how many eigenvalues of D are less than -1. As a result, we will need to make a guess of n and if the smallest eigenvalue associated with W is less than -1, a larger subspace will need to be found until W satises the requirements. Once an appropriate W is found, Computing P DP can be done as a preprocessing step of O(n n2 ) and then each multiplication requires only O(n2 ), or each time P DP x needs to be computed it can be done in 3 matrix-vector products with total complexity O(n2 ) + O(n n). Another aspect to consider is how we can combine scaling with orthogonal draws to speed up convergence. Lets look at the RBF generated matrix with = .01 for (xi , xj ) = (1 ( xi xj ))+ (5.2) in 1D with 30 evenly spaces points between 0 and 1. See [4] for a proof that this matrix is SPD. The spectrum that Matlab nds for this matrix is displayed as a histogram in Figure 2. Because only one eigenvalue of is greater than 1 (and thus only 1 eigenvalue of D less than -1) we only need to nd the subspace of that one eigenvalue. If, however, we found the invariant subspace of the 5 largest eigenvalues, the only eigenvalues remaining are within (.01, .00005). Using the algorithm on this matrix will converge in a reasonable time, but using the algorithm on a matrix whose eigenvalues are within (1, .005) will converge even more quickly. Lucky for us, we have a technique for scaling the eigenvalues of a matrix; although previously we used it to reduce the size of the spectrum. Consider the following idea to nd det(): 1. Use subspace iteration to nd the largest eigenvectors/eigenvalues W / . Because is SPD, W T W = . 2. Dene (but dont form) P = In W W T , and call = min( ). 3. Apply the algorithm to the matrix P (In /)P to get 1 . 4. Compute our approximation log det() 1 + n log + sum log diag( ). This combination of orthogonal draws and scaling will allow us to approximate the determinant of with a smaller log series summation m. This faster convergence is at the cost of performing the subspace iterations and the additional P multiplications. Because there is likely no way to know how many eigenvalues of beyond the range of convergence, there needs to be exibility built into the steps taken; however, with certain RBF we can expect that the number of oending eigenvalues is small. See [5] and [4] for some discussion of the spectra for RBF interpolation matrices. Also, [9] discusses how the choice of design points xi aects the determinant of the resulting matrix.

Distribution of the Eigenvalues of the Matrix 40 35 30 25 frequency 20 15 10 5 0 5

2 1 log (eigenvalues)
10

Figure 2: Only 5 eigenvalues of are greater than 1

5.3

Flexible Convergence

Deciding when our algorithm has converged requires deciding if the 95% condence interval is small enough. There are two parameters which determine how much work goes into the approximation: m is the quality of the approximation log(In D) =
1 p controls the variance of the estimator E[ p p k=1 m k=1

xT Dk x/xT x

Vk ] log det(In D)

While it would seem that we need to increase p and m together to ensure that one source of variance does not overwhelm the other, in testing it seems that p never needs to be larger than n. In fact 50 is the value used in all our experiments and in every case the error contributed by a small m greatly outweighs that from a small p.
Convergence as Series Improvement 600 Lower 95% CI Approximation Upper 95% CI True Solution

400

200 mean(V)

200

400

600 0 10

10

10 Summation order (x5)

10

10

Figure 3: The 95% Condence Interval converges in width before drift.


Because of this empirical data we are going to redesign the algorithm so that p is chosen at the beginning but m is increased incrementally until the size and drift of the condence interval are acceptably

small. See Figure 3 as an example where the p related condence interval (computed with (4.1) and = .5) is small enough long before the log series converges. One serious diculty with adjusting m is that computing the in (5.1) requires knowledge of the smallest eigenvalue of the system. We could use inverse iteration to estimate it, but that may or may not be practical depending on the conditioning of the problem. This is why we reference the drift in the p condence interval rather the true condence interval described by (5.1) as our decision to conduct a longer m summation. The other problem with having a exible series convergence is that the p x vectors used need to be stored because, if after m summations there is inadequate convergence, those vectors need to be used to perform the next set of matrix-vector products. This imposes an extra computational overhead but because p is xed, all the memory can be allocated at the beginning.

5.4

A Matlab Implementation

Before we look at all the changes discussed, lets look at the Matlab code used in [7]. function mclogdet = DetAppx (D, p ,m) n = length (D) ; v = zeros ( p , 1 ) ; fo r i = 1 : p x = randn ( n , 1 ) ; c = x; fo r k = 1 :m c = D c ; v ( i ) = ( x c ) / k+v ( i ) ; end v ( i ) = nv ( i ) / ( x x ) ; end mclogdet = mean( v ) ; This code can easily be accelerated (more than 100 times faster depending on n, m and p) by taking advantage of Matlabs vectorized capabilities. Also note that this function is sparse D friendly because the only operation D is involved in is a matrix-vector product. function mclogdet = DetAppxVectorized (D, p ,m) n = length (D) ; v = zeros ( 1 , p ) ; x = randn ( n , p ) ; xTx = sum( x . x ) ; c = x; fo r j = 1 :m c = D c ; v = vn/ j sum( x . c ) . / xTx ; end mclogdet = mean( v ) ; And for a general SPD matrix generated by RBF interpolation with (5.2) (or another RBF) we can use all the adaptations we have talked about to come up with the nal function. Please see Appendix A for a Matlab implementation.

Final Thoughts

Although the original paper [7] focused on a nding the determinant of special matrices derived from a unique (and confusing) statistical problem, the technique can be adapted to nd the eigenvalues of more general matrices. Here we have focused mainly on the symmetric positive denite matrices which arise in RBF interpolation, but it is not dicult to see how extensions of the adaptations suggested here would provide applicability to all matrices. The question then becomes: for what problems is it acceptable to approximate the determinant, and when is it cheaper to nd the determinant directly?

These questions will have to be answered by each individual interested in using this technique. It seems possible though that for iterative problems when the determinant is required, approximation and speed may be preferable to nding the exact solution at each step. This was the case for the maximum likelihood estimation problem described here because inverting the matrix in question required a method which did not reveal the determinant. As is often the case, the applicability of this technique varies between topics.

References
[1] A. Neumaier, Solving ill-conditioned and singular linear systems: a tutorial on regularization, SIAM Review, 40/3:636-666, 1998. [2] JD Riley, Solving systems of linear equations with a positive denite, symmetric, but possibly illconditioned matrix, Mathematical Tables and Other Aids to Computation, 9/51:96-101, 1955. [3] B Kolman, DR Hill, Elementary Linear Algebra, Pearson Prentice Hall. [4] G Fasshauer, Meshfree Approximation Methods in Matlab, World Scientic Press. [5] H Wendland, Scattered Data Approximation, Cambridge University Press. [6] ML Stein, Interpolation of Spatial Data: Some Theory for Kriging, Springer. [7] RP Barry, RK Pace, Monte Carlo estimates of the log-determinant of sparse matrices Linear Algebra and Applications, 289:41-54, 1999. [8] B Nagy, JL Loeppky, WJ Welch, Fast Bayesian Inference for Gaussian Process Models, University of British Columbia, Stat. Dept. Tech. Report #230, 2007. [9] LP Bos, U Maier, On the Asymptotics of Fekete-Type Points for Univariate Radial Basis Interpolation, Journal of Approximation Theory, 119:252-270, 2002.

Appendix A
This Matlab function combines all the adaptations of the original function so that we can approximate the determinant of A to a certain tolerance. Note that we use eigs to nd the disruptive subspace because A has the potential to be sparse for compactly supported RBF. If A is a dense matrix, it will probably be faster to use subspace iteration, but because eigs allows us to use a function to represent A x it will work for both situations. Note also that we do the orthogonal projection separately rather than forming the P DP matrix. function mclogdet = DetAppxVecRBF (A, p , t o l ) n = length (A ) ; v = zeros ( 1 , p ) ; x = randn ( n , p ) ; xTx = sum( x . x ) ; c = x; [ V, E ] = FindOuterSpace (A ) ; lam = E( end ) ; D = ( speye ( n)A) / lam ; oldv = ones (1 , p ) ; fo r j = 1 :m c = ProjectPAP (D, V, c ) ; v = vn/ j sum( x . c ) . / xTx ; i f abs (mean( v)mean( o l d v ) ) /mean( o l d v )< t o l break end oldv = v ; end mclogdet = mean( v)+n log ( lam)+sum( log (E ) ) ; end function [ V, o u t e r e i g s ] = FindOuterSpace (A) n = length (A ) ; k = f l o o r ( sqrt ( n ) ) ; lam = 2 ; V = []; outereigs = [ ] ; o p t s . issym = 1 ; o p t s . disp = 0 ; while lam>1 [ U, T] = e i g s (@( x ) ProjectPAP (A, V, x ) , n , k , l a , o p t s ) ; lam = T( k , k ) ; V = [ V,U ] ; o u t e r e i g s = [ o u t e r e i g s ; diag (T ) ] ; end end function y = ProjectPAP (A, V, x ) % This m u l t i p l i e s ( IVV ) A( IVV ) x i f V==[] y = Ax ; else y = A ( xV (V x ) ) ; y = yV (V y ) ; end

10

You might also like