Matlab Routines for Subset Selection and Ridge Regression in Linear Neural Networks

for Subset Selection and Ridge Regression in Linear Neural Networks
Mark J. L. Orr
Matlab Routines
1; 2
Center for Cognitive Science Edinburgh University Scotland, UK August 16, 1996
1 2
mark@cns.ed.ac.uk http://www.cns.ed.ac.uk/people/mark.html
Contents
Tutorial Introduction
colSum diagProduct dupRow dupCol forwardSelect getNextArg globalRidge localRidge overWrite predictError rowSum rbfDesign traceProduct
2 14 15 16 17 18 24 25 30 36 37 41 42 45
These manual pages are for a set of Matlab routines which implement forward selection and ridge regression (global and local) for linear neural networks (such as radial basis function networks) applied to supervised learning problems (regression, classi cation and time series prediction). The main routines are forwardSelect, globalRidge, localRidge, predictError and rbfDesign. The other documented routines (e.g. rowSum, traceProduct, etc.) are useful little utilities used in the main routines which are not part of standard Matlab. The rest of this tutorial introduction describes linear networks and illustrates the basic function of the routines by applying a radial basis function network to a toy problem. See the reference pages further on for all the di erent options o ered by the routines. For a detailed but leisurely introduction to radial basis function networks see the world wide web page
http://www.cns.ed.ac.uk/people/mark/intro/intro.html
Supervised Learning
Supervised learning, or what statisticians know as nonparametric regression, is the problem of estimating a function, y(x), given only a training set of pairs of inputoutput points, f(xi ; yi )gp i=1 , sampled, usually with noise, from the function. In practice both the independent or input variable x, and the dependent or output variable y are multi-dimensional which the routines documented below will handle okay. However, for simplicity we will assume that the both variables are 1-dimensional in the discussion that follows. The toy problem we shall use as an illustrative example is 50 points sampled randomly, with noise, from a sine wave (see Figure 1). The Matlab code is as follows.
p = 50; x = 1 - rand(1, p); sigma = 0.2; y = sin(10 * x)' + sigma * randn(p,1); hold off plot(x, y, 'c+')
The true function, y(x) = 10 sin(x), is shown with a dashed curve in Figure 1. This is the target which we shall try to estimate using only the information provided by the set of sampled points (plus some assumptions about smoothness). The target is plotted by using a dense set of test points, this time sampled without noise.
pt = 500; xt = linspace(0, 1, pt); yt = sin(10 * xt)'; hold on plot(xt, yt, 'b--')
Figure 1
0.5
Linear Networks
Linear single-layer neural networks model 1-dimensional functions with an equation of the form m X f (x) = wj hj (x) : Here w 2 Rm is the adaptable (trainable) weight vector and fhj ( )gm j =1 are xed basis functions (or the transfer functions of the hidden units { see Figure 2).
j =1
f (x) w1 h1 (x)
...
I@ 6@ @
wj
@@ I @
hj (x)
@@ wm @@ @
...
@@
hm (x)
@@ @
Figure 2
Radial Basis Functions

Radial basis functions are one possible choice for the hidden unit activation functions in a linear network. The most distinguishing feature of these functions is that they are local, or at least their response decreases monotonically away from a central point. Strictly speaking, functions whose response increases monotonically away from a central point are also radial basis functions, but because they are not local they are less interesting. A typical radial basis function, centred at cj and of width or radius rj , is the Guassian 2! ( x c ) j : hj (x) = exp r2
j
An example, for cj = 0:5 and rj = 0:2, is shown in Figure 3 for which the Matlab code (making use of xt, the input points of the test set above) is
cj = 0.5; rj = 0.1; ht = exp(-(xt - cj).^2 / rj^2); plot(xt, ht, 'g-')
Figure 3
0.5
Other common choices for radial basis functions are the Cauchy function the inverse multiquadric,
r2 hj (x) = (x c j)2 + r2 ; j j
q
hj (x) =
which are both local, and the non-local multiquadric, q 2 (x cj )2 + rj hj (x) = : r

j
rj ; 2 (x cj )2 + rj
Least Squares Training

For a training set with p patterns, f(xi ; yi)gp i=1 , the optimal weight vector can be found by minimising the sum of squared errors
p X i=1
S =
and is given by
(f (xi) yi )2
w ^ = H>H 1 H>y ; the so called normal equation 7]. H is known as the design matrix (Hij = hj (xi )) and y = y1 y2 : : : yp]> is the p-dimensional vector of training set output values.
The routine rbfDesign can be used to generate the design matrix for a radial basis function network given a set of training set input points fxi gp i=1 , a set of centre m and a choice of RBF type. For our example , a set of radii f r g positions fcj gm j j =1 j =1 we shall try putting a Cauchy RBF on top of each input point, so m = p and ci = xi , and make each the same width, rj = r = 0:1. In that case the Matlab code to generate the design matrix is simply
c m r H = = = = x; p; 0.1; rbfDesign(x, c, r, 'c');
This matrix can then be used to nd the optimal weight vector using the normal equation.
w = inv(H' * H) * H' * y; Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. RCOND = 2.016658e-17
Note that Matlab has complained that the matrix H>H is ill conditioned, but let's carry on just now and come back to this point later. To obtain the trained network's output over the test set simply needs the appropriate design matrix to be multiplied with the optimal weight vector.
Ht = rbfDesign(xt, c, r, 'c'); ft = Ht * w; plot(xt, ft, 'g-')
The result is plotted in Figure 4. The t (the green solid curve) is very rough and way o target. The reason for this is connected with the ill-conditioning of H>H and caused by the fact that there are exactly as many unknowns (the weights) as constraints (the training data). Clearly, we have to be much more careful in choosing the hidden units or somehow manage to smooth the output. This is why forward selection and ridge regression are important.
6
Figure 4
0.5
Forward Selection
One way to give linear networks the exibility of non-linear networks (which can adapt their basis functions) is to go through a process of selecting a subset of basis functions from a larger set of candidates. That way the architecture of the network can be adaptable and yet the selectable basis functions can remain xed and the linear characteristics of the network retained. Thus non-linear gradient descent in the continuous space of the basis function parameters is avoided and replaced by a discrete search in the space of possible basis function subsets. In linear regression theory 7] subset selection is well known and one popular variant is forward selection in which the model starts empty (m = 0) and basis functions are selected one at a time and added to the network. The basis function to add is the one which reduces the sum of squared errors the most and the point at which to stop adding new functions is signalled when a suitably chosen estimate of the predicted average error on new data (there are various heuristics to do this) reaches a minimum value and then starts to increase. The routine forwardSelect implements forward selection and allows a choice of error prediction heuristics, including Bayesian information criterion (BIC) 2].
subset = forwardSelect(H, y, '-t BIC -v'); forwardSelect pass add BIC 1 39 3.972e-01 2 10 2.442e-01 3 2 7.793e-02 4 31 5.004e-02 5 47 5.096e-02 6 9 5.222e-02 minimum passed
The routine takes the design matrix H and the output vector y as inputs and returns a list of numbers indexing the columns of H (the basis functions) selected as part of the subset. It also has an option (-v) which causes it to print out information as it goes along which we've made use of. The rst column gives the iteration number. In each iteration one new basis function is added. The second column is the index of the basis function selected in each iteration. The third column gives the value of BIC. Note how it reaches a minimum on the fourth iteration, so the selected centres will be 39, 10, 2 and 31. To get the output from this greatly reduced network (from 50 down to 4 centres) the appropriate columns of the training set design matrix have to be extracted, the normal equation used to obtain the optimal weight vector and the test set predictions computed on the basis of a cut-down test design matrix.
Hs = H(:, subset); w = inv(Hs' * Hs) * Hs' * y; ft = Ht(:, subset) * w; plot(xt, ft, 'g-')
The t is shown in Figure 5 (solid green curve) and signi cantly improves on the t in Figure 4 where all the the basis functions were used.
Figure 5
0.5
Since this is a simulated example { we know the answer at the beginning { we can measure how well this, and other ts, perform by calculating the mean square error between the network output and the test set outputs.
(ft - yt)' * (ft - yt) / pt ans = 0.0151
This score can be compared to ridge regression methods (see below).
The forwardSelect routine has several other options. For example the calculations can be speeded up by invoking orthogonal least squares 1]. Also, ridge regression can be combined with forward selection with a xed or variable regularisation parameter and can lead to improved ts 6].
Global Ridge Regression

To counter the e ects of over- tting, illustrated in Figure 5, a roughness penalty term can be added to the sum of squared errors to produce the cost function
p X i=1
C =
(f (xi ) yi )2 +
minimisation of which can be used to nd a weight vector which is more robust to noise in the training set. This is called ridge regression 4] or weight-decay 3] and involves a single regularisation parameter ( ) to control the trade-o between tting the data and penalising large weights. It can be generalised to multiple parameters, one for each basis function (see below), which leads to local adaptation of the smoothing e ect when the basis functions are local (as in RBF networks). Thus there are two forms of ridge regression: global (one ) and local (m 's). The optimal weight vector for global ridge regression is where Im is the m-dimensional identity matrix. We could just guess a value to use for the global regularisation parameter, say = 1 10 4 . Then for our example problem
lambda = 1e-4; w = inv(H' * H + lambda * eye(m)) * H' * y; ft = Ht * w;
m X 2; wj j =1
w ^ = H>H + Im
1 > H y;
gives the predicted test set outputs, plotted in Figure 6.

Figure 6
0.5
The mean square error is
(ft - yt)' * (ft - yt) / pt ans = 0.0200
which is over 30% worse than the score for forward selection (above). However, much better results can be achieved by actively searching for the optimal parameter using globalRidge which employs a re-estimation formula 6] to optimise a choice of model selection criterion 2]: unbiased estimate of variance (UEV), nal prediction error (FPE), generalised cross-validation (GCV) or Bayesian information criterion (BIC). The input arguments are the design matrix, the training set outputs, an initial guess for and the choice of model selection criterion. If we once again use BIC for the example problem we obtain, starting from a guess of = 0:1,
lambda = globalRidge(H, y, 0.1, 'BIC -v') globalRidge pass lambda BIC change 0 1.000e-01 6.644e-02 1 3.085e-01 6.187e-02 15 2 1.032e+00 5.874e-02 20 3 1.058e+00 5.874e-02 42952 relative threshold in BIC crossed lambda = 1.0582
The verbose option (-v) has been exercised to show details of the iterations performed by the algorithm. The rst column gives the iteration or pass number (with the 0-th iteration corresponding to the starting values). The second column gives the current estimate for . The third column is the model selection score, in this case BIC. The fourth column is the relative change in the model selection score from one iteration to the next (expressed in units of 1 part in n). The algorithm terminates when n > 1000 (there are options for adjusting this threshold or using absolute instead of relative change). Using the optimal value, = 1:06, leads to the test set t
w = inv(H' * H + lambda * eye(m)) * H' * y; ft = Ht * w; plot(xt, ft, 'g-')
shown in Figure 7 which has a mean square error

(ft - yt)' * (ft - yt) / pt ans = 0.0093
nearly 40% lower than the forward selection score.
10
Figure 7
0.5
The routine predictError can be used for explicit calculations of error prediction estimates. We can use it to demonstrate that globalRidge really did nd the minimum BIC as follows (see Figure 8).
lambdas = logspace(-4, 2, 100); bics = zeros(1, 100); for i = 1:100 bics(i) = predictError(H, y, lambdas(i), 'BIC'); end bic = predictError(H, y, lambda, 'BIC'); hold off loglog(lambdas, bics, 'm-') hold on loglog(lambda, bic, 'r*')
10
Figure 8
10
10
10
10
10
10
Tutorial Introduction Local Ridge Regression
11
A generalisation of standard ridge regression is to attach a separate regularisation parameter to each basis function by using the cost
C =
p X i=1
(f (xi ) yi)2 +
m X j =1
j wj ;
Leading to an optimal weight of
w ^ = H>H +
1 > H y;
where jk = j jk . If the basis functions are local (as in RBF networks) adjusting each j has a local e ect, providing a mechanism for the amount of smoothing to adapt to local conditions 5]. It is for this reason that the terms local and global are used to distinguish the two forms of ridge regression. The routine localRidge is provided to search for the optimal set of regularisation parameter values by minimising either unbiased estimate of variance (UEV) or generalised cross-validation (GCV) 5]. The other model selection criteria (FPE, BIC and LOO) have not been implemented (yet) as there are some technical di culties still to be resolved. The input arguments are, as usual, the design matrix, the training set output values and a guess for the initial values of the regularisation parameters. The guess can either be a list of m values (giving the starting values for each j ) or a single number (to which each j is initially set). The list can contain entries of 1 (in Matlab Inf) which is equivalent to initially removing the corresponding basis functions from the network (of course, the algorithm may decide to reinstate them). The answer from localRidge is in the form of a new list, possibly with some entries of 1, which gives a local minimum of the model selection criterion near to the initial guess in m-dimensional space. Di erent initial guesses may result in di erent local minima. When tried out on our example problem with all the parameters initially set to the global minimum ( = 1:06, see above), only 11 of the 50 basis functions were retained in the nal network.
lambdas = localRidge(H, y, lambda, '-v'); localRidge pass in out GCV change 0 50 0 4.721e-02 1 18 32 4.075e-02 7 2 15 35 4.023e-02 78 3 14 36 4.012e-02 361 4 11 39 4.005e-02 596 5 11 39 4.003e-02 2574 relative threshold crossed finite = find(lambdas ~= Inf) finite = 1 2 9 10 16 18 20 25
36
41
48
12
Again we have used the -v option to get the routine to print out what it is doing. The rst column is the iteration number. In one iteration each parameter is optimised once. Multiple iterations are required to converge on a minimum because each individual optimisation a ects the optimal values for all the other parameters. The second and third columns give the current number of nite (in the network) and in nite (out of the network) parameter values. The fourth column is the running value of the model selection score (in this case GCV) and the last column is the relative change in its value from one iteration to the next (in units of 1 part in n). The default termination criterion is n > 1000. Options exist to change this threshold and use absolute rather than relative change. To work out the test set predictions the basis functions corresponding to in nite values of j have to be removed from the both the training set and test set design matrices.
lambdas = lambdas(finite); Hl = H(:, finite); w = inv(Hl' * Hl + diag(lambdas)) * Hl' * y; ft = Ht(:, finite) * w; plot(xt, ft, 'g-')
The t is shown in Figure 9 (the solid green curve) and its mean square error compared to the true target (blue dashed curve) is
(ft - yt)' * (ft - yt) / pt ans = 0.0142
This is a whole 50% worse than global ridge regression and only a modest 10% improvement on forward selection. There is no guarantee that local will perform any better than global ridge regression on any particular data set. However, on average, and especially for target functions with inhomogeneous structure scales, localRidge is expected to win out over globalRidge.
Figure 9
0.5
13
References
1] S. Chen, C.F.N. Cowan, and P.M. Grant. Orthogonal least squares learning for radial basis function networks. IEEE Transactions on Neural Networks, 2(2):302{309, 1991. 2] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993. 3] J. Hertz, A. Krough, and R.G. Palmer. Introduction to the Theory of Neural Computation. Addison Wesley, Redwood City, CA, 1991. 4] A.E. Hoerl and R.W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(3):55{67, 1970. 5] M.J.L. Orr. Local smoothing of radial basis function networks. In International Symposium on Arti cial Neural Networks, Hsinchu, Taiwan, 1995. 6] M.J.L. Orr. Regularisation in the selection of radial basis function centres. Neural Computation, 7(3):606{623, 1995. 7] J.O. Rawlings. Applied Regression Analysis. Wadsworth & Brooks/Cole, Paci c Grove, CA, 1988.
14
colSum
colSum
Purpose
Sums the columns of a matrix.
Synopsis
v = colSum(M)
Description
Returns a row vector v whose entries are the sums of each column of M. Gives the same result as the Matlab routine sum except when M is a row vector, in which case sum sums the columns but colSum returns the vector unaltered.
Example
No di erence between colSum and sum when the input is a matrix.
X = 1 2 3; 2 3 5]; colSum(X) ans = 3 5 8 sum(X) ans = 3 5 8
The di erence comes when the input is a row vector.

X = 1 2 3]; colSum(X) ans = 1 2 sum(X) ans = 6
See Also
rowSum.
diagProduct
15
diagProduct
Purpose
Calculates the diagonal of the product of two matrices.
Synopsis
d = diagProduct(X, Y)
Description
Calculating
d = diag(X * Y);
can be wasteful, especially if the matrices are large. diagProduct achieves the same calculation quicker. The speed-up factor is approximately equal to the number of columns of X (or the number of rows of Y).
Example
Do it the hard way.
X = rand(100,100); flops(0) d = diag(X * X); flops ans = 2000000
Save time.
flops(0) d = diagProduct(X, X); flops ans = 19900
See Also
traceProduct.
16
dupRow
dupRow
Purpose
Creates a matrix with identical rows.
Synopsis
M = dupRow(v, m)
Description
Returns a matrix with m rows, each one a copy of the row vector v.
Example
dupRow( 1 2], 3) ans = 1 2 1 2 1 2
See Also
dupCol.
dupCol
17
dupCol
Purpose
Creates a matrix with identical columns.
Synopsis
M = dupCol(v, n)
Description
Returns a matrix with n columns, each one a copy of the column vector v.
Example
dupCol( 1; 2], 3) ans = 1 1 1 2 2 2
See Also
dupRow.
18
forwardSelect
forwardSelect
Purpose
Performs forward selection, a standard type of subset selection 3], where basis functions (or hidden units) are added to the model (or network) one at a time starting with none and ending when a suitable criterion has been satis ed. There is an option to perform ridge regression with either a xed or variable regularisation constant and an option to speed-up the calculations using orthogonal least squares 2].
Synopsis
subset, H, l, U, A, w, P] = forwardSelect(F, Y, options)
Description
This algorithm builds a regression model from a set of candidate basis functions by starting with none and adding new basis functions one at a time until no more can be usefully added. At each step during the forward selection process a new column from F, the full design matrix, must be chosen to add to the current design matrix, Hm. If fk is chosen then the new design will be Hm+1 = Hm fk ]. The new column chosen is the one with the minimum cost
> U> Um+1 w C (fk ) = (Hm+1 w ^ m+1 y)>(Hm+1 w ^ m+1 y) + w ^m ^ m+1 : +1 m+1
(1)
This cost is composed of the familiar sum of squared errors term on the left and an optional regularisation term on the right (see below). y is the vector of training set outputs and w ^ m+1 is the optimal weight (the weight which, for the given model, minimises the cost). For purposes of illustration we assume the the training set outputs are one-dimensional { otherwise y and w ^ m+1 would be matrices and each of the two terms on the right of the above equation would have to be replaced by their traces. The selectable basis functions are supplied by the input argument F which is the full design matrix corresponding to the case when all the basis functions are included in the model. The chosen model is returned as a list indexing the columns of F corresponding to the chosen subset. The only other mandatory input argument is Y, the matrix of training set outputs (one output on each row). A minimal call to the routine is therefore
subset = forwardSelect(H, Y);
but this implies several default options and does not take advantage of some other potentially useful quantities returned by forwardSelect. The options available, the other output variables and the basic operation of the algorithm are described below.
forwardSelect
19
value]
Termination:
-t VAR|UEV|FPE|GCV|LOO
Various criteria can be employed to decide when the model has grown su ciently and all but one work by detecting when a minimum has occured in an estimate of the predicted error on future test sets. The exception is where a threshold is put on the training set variance explained by the model. The total variance is y>y and > H> Hm w ^ m so if the threshold is (0 < < 1) then the explained variance is w ^m m termination occurs when > > rst becomes true. To specify this method, along with a threshold of, say 0.95, include the substring -t VAR 0.95 in the input argument options. If the optional value is omitted the default is = 0:9. The other methods consist of various heuristics to estimate the likely error on future tests sets and terminate the network growth when a minimum occurs. The available heuristics are: unbiased estimate of variance (UEV), nal prediction error (FPE), generalised cross-validation (GCV), Bayesian information criterion (BIC) and leave-one-out cross-validation (LOO) 1]. The predicted error is calculated using one of these methods after each new basis function is added. To avoid the problem of temporary local minima (the predicted error decreases to a shallow minimum, then increases and then decreases again below the level of the minimum) the minimum must persist for a number of iterations after it is rst reached in order to cause the algorithm to terminate. This number controlled is by the optional value part of the options string. For example, to invoke leave-one-out cross-validation and require terminating minima to persist for at least ve iterations insert the substring -t LOO 5 in options. By default, if the termination criterion is not given in the options string, the method used is generalised cross-validation and the minima must go on for at least 2 iterations.
^m > w ^ mHmHm w y> y
Maximum number of basis functions:
-m value
The size of the network can be restricted directly by specifying the maximum number of basis functions. For example, to limit the size to 20 include the substring -m 20 in options. By default, there is no limit (other than the number of columns in the full design matrix).
Fixed regularisation:
-g
value]
By default there is no regularisation. To invoke ridge regression (regularisation by penalising the sum of squared weights) with a xed value for (the regularisation parameter) of, say, 0.001, use the option string -g 0.001. The default value, if none is speci ed, is = 0:01, so -g 0.01 is equivalent to just -g on its own.
Re-estimation:
-r
UEV|FPE|GCV|BIC]
value]
Alternatively, the value of can be adjusted after each basis function is added using a re-estimation formula (given in 2]) which is designed to minimise GCV. By default
20
forwardSelect
re-estimation is not done; to enable it put -r in options. Other error prediction heuristics (UEV, FPE and BIC) can also be used and lead to similar formulae. To invoke re-estimation using, e.g. FPE and starting from a value of = 1, the options string should contain -r FPE 1. By default GCV is used and the starting value is = 0:1. This means -r GCV 0.1, -r GCV, -r 0.1 and -r are all equivalent option strings.
Orthogonal least squares:
OLS
Orthogonal least squares 2] is an e cient method of performing forward selection based on orthogonalisation of the columns of the design matrix. To invoke this option simply include the string OLS in options. When not using ridge regression ( = 0) it is always worth using OLS because it makes no di erence to the results but speeds up the computations. However, when > 0, there is unfortunately a di erence between the standard method and OLS. This has to do with the the nature of the regularisation term in the cost (1). The ~ is relationship between the actual design H and orthogonalised design H
~ m Um Hm = H
where Um is an m m upper triangular matrix. When OLS is invoked the covariance matrix Am1 is not the standard
Am1 = H> m Hm + Im
but instead is
> Am1 = H> m Hm + Um Um ~> ~ m + Im = Um1 H mH 1
1 (U> m) :
~ , only the latter covariThis is because, due to the relationship between H and H ance can be factored such that the inverted matrix is diagonal (which is the point of performing orthogonalisation since such inversions are extremely quick). The consequence is that when OLS is not used Um = Im in (1) but when OLS is employed Um is no longer the identity matrix. Verbose output: -v, -V
To monitor the performance of the algorithm there are two options which cause the routine to print out messages as it runs. The -v option causes the iteration or pass number, the number of the column of the full design matrix added in that pass and the relative cost (equation (1) divided by y> y) to be printed each time a new basis function is added. If any of the error prediction heuristics are being used for termination then their running value is also printed. If re-estimation is turned on the running value of is also printed. An indication of what caused termination is printed at the end. Extra verbosity is obtained by use of the option -V which causes cumulative computational costs (in number of oating point operations) to be printed in addition to the normal verbose output.
forwardSelect
21
Returned values
The routine returns a list of integers, subset, indexing the chosen columns of the full design matrix. Several other potentially useful quantities are also returned: H, l, U, A, w and P. H is the chosen design matrix, Hm , assembled from the selected columns of the full design matrix (H = F(:,subset)). The value of the regularisation constant, , is returned in l. This will be zero if neither the -g option nor the -r option have been used and will be the same as the xed value under the -g option. In the case of re-estimation the value returned is the last re-estimated value before termination. U is the upper triangular matrix, Um , used to orthogonalise the columns of Hm in OLS. If OLS is not used, U just contains the identity matrix. The covariance is returned in A and is the matrix
> Am1 = H> m Hm + Um Um 1
The optimal weight in w is given by
wm = Am1H> my ; and P contains the projection matrix Pm where Pm = Ip Hm Am1 H> m:
Example
As our example we'll try to learn the function
f (x) = 1 + (1 x + 2 x2 ) e
x2 ;
from 100 randomly sampled points in the range 4 < x < 4 corrupted by Gaussian noise of standard deviation 0.2. First generate a training set and plot it (the green circles in the gure).
p = 100; x = 4 - 8 * rand(1,p); y = 1 + ((1-x+2*x.^2) .* exp(-x.^2))' + 0.2 * randn(p,1); plot(x, y, 'go')
To plot the ts later and the true function now (the dotted curve in the gure) we shall need a test set.
pt = 500; xt = linspace(-4,4,pt); yt = 1 + ((1 - xt + 2 * xt.^2) .* exp(-xt.^2))'; plot(xt, yt, 'w:')
To learn the function we shall use a radial basis function network with centres coincident with the training examples. As it is not clear what size to make the basis functions we shall place three di erent sized functions { small, medium and large { at each position and let forwardSelect choose the best ones to keep.
22
3
forwardSelect
0 4
c = x x x]; r = 0.5 * ones(1,p) 1.0 * ones(1,p) 1.5 * ones(1,p)]; F = rbfDesign(x, c, r);
Now let us try a vanilla avour version of forwardSelect, plain forward selection without any regularisation, re-estimation or orthogonalisation.
subset = forwardSelect(F, y);
Assemble the design matrix from the columns of F, compute the covariance and hence the optimal weight vector.
H = F(:,subset); A = inv(H' * H); w = A * H' * y;
Now form the design matrix for the test set (using only the selected centres and their radii) and compute and plot the learned function over it (the blue dashed curve in the gure).
Ht = rbfDesign(xt, c(subset), r(subset)); ft = Ht * w; plot(xt, ft, 'b--')
Compute the mean square error over the test set.

(ft - yt)' * (ft - yt) / pt ans = 0.0084
forwardSelect
23
For comparison let us now try a bells and whistles version. This time we shall let forwardSelect do the work of assembling the design matrix and computing the weight vector. We shall also use OLS to speed things up, re-estimate the regularisation parameter (using UEV starting from = 1 10 6 ) and use leave-one-out cross-validation (instead of the default GCV) for termination with the proviso that minimum should persist for at least 4 iterations.
subset, H, l, U, A, w] = ... forwardSelect(F, y, 'OLS -r UEV 1e-6 -t LOO 4');
Compute and plot the test set t (the red solid curve in the gure).
Ht = rbfDesign(xt, c(subset), r(subset)); ft = Ht * w; plot(xt, ft, 'r-')
For comparison with the vanilla avour attempt, compute the mean square error.
(ft - yt)' * (ft - yt) / pt ans = 0.0075
See Also
globalRidge, localRidge.
References
1] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993. 2] M.J.L. Orr. Regularisation in the selection of radial basis function centres. Neural Computation, 7(3):606{623, 1995. 3] J.O. Rawlings. Applied Regression Analysis. Wadsworth & Brooks/Cole, Paci c Grove, CA, 1988.
24
getNextArg
getNextArg
Purpose
Repeated application of this routine splits a string into a sequence of space or comma separated sub-strings. Is used for cycling through a string of options.
Synopsis
arg, i] = getNextArg(s, i)
Description
The input variable i tells the routine where to start looking in the string s. The next word is returned in arg and the index i is updated to point to the position after the word. One or more spaces and/or commas create the word boundaries. If i points to a position o the end of the string arg is set to the empty string and i is not updated.
Example
s = 'Hello world'; i = 1; arg, i] = getNextArg(s, i) arg = Hello i = 6 arg, i] = getNextArg(s, i) arg = world i = 12 arg, i] = getNextArg(s, i) arg = '' i = 12
globalRidge
25
globalRidge
Purpose
Estimates the optimal ridge regression parameter for linear feed-forward networks in supervised learning problems. The optimisation is performed using one (or more) initial guess(es) and a choice of method for predicting the error on future test sets.
Synopsis
l, e, L, E] = globalRidge(H, Y, l, options, U)
Description
In ridge regularisation for linear networks the cost function is made up of the usual sum-squared-error plus a roughness penalty dependent on the squared length of the weight vector and a positive regularisation parameter, . Assuming 1-dimensional output (for simplicity) this cost is
C = (H w y)>(H w y) + w>U>U w :
mean squared error on a future test set, can be minimised to nd the optimal value of , including unbiased estimate of variance (UEV), nal prediction error (FPE), generalised cross-validation (GCV) and Bayesian information criterion (BIC) 1]. Such a scheme is implemented by globalRidge using an iterative re-estimation technique 2]. The design matrix, with p rows (one for each training point) and m columns (one for each basis function) is supplied in the variable H. The output training points are supplied in the matrix Y with p rows and n columns (n being the dimension of the output space). The guess or guesses for the optimal value of are input via the vector l. The main reason for having multiple guesses is the possibility of local minima in the criteria and therefore of di erent results from di erent starting points.
H is the design, w the weight vector, y the vector of training set output values and U the penalty metric. Various model selection criteria, which attempt to predict the
Prediction method:
UEV|FPE|GCV|BIC
The choice of which method to use for predicting future error is controlled by a three letter code in the options string. The choice is: unbiased estimate of variance (UEV), nal prediction error (FPE), generalised cross-validation (GCV) or Bayesian information criterion (BIC). The default, if none is speci ed, is GCV.
Termination:
-t value
The iterations are halted by a termination criterion controlled by the value speci ed with the -t option. If this value is greater than unity then the iterations halt when the change in predicted error from one iteration to the next is less than one part
26
globalRidge
in value. Alternatively, if the value is less than unity then the iterations terminate when the absolute change in predicted error is less than value. The default, which applies when none is speci ed in the options string, is 1000 (i.e. termination occurs when the relative change is less than 1 part in 1000).
Maximum number of iterations:
-h value -h
A hard limit on the maximum number of iterations can be set via the whose default value is 100.
option
Non-standard penalty matrix:
-U
Usually the size of the penalty term is determined by the Euclidean length of w and the matrix U, which appears in the cost function (see above), is just the identity. If a non-standard metric is desired then the string -U should be inserted into options. In that case globalRidge will expect a fth argument, U, specifying an m-by-m metric.
Verbose output: -v, -V

To monitor the performance of the algorithm there are two options which cause the routine to print out messages as it runs. The -v option causes the iteration (or pass) number, the value of , the value of the of the criterion (UEV, FPE, GCV or BIC) and its change (relative or absolute), to be printed at the end of each iteration (and once at a nominal pass number of zero to indicate the staring values). -V has the same e ect except the cumulative computational cost (number of oating point operations) is also printed.
Returned variables
The optimal value of and the corresponding predicted error are returned in l and e respectively. If multiple guesses are made { the input l is a vector { then the output l and e are also vectors of the same length giving the nal values for each guess. The other outputs can be used to examine the evolution of and predicted error during the iterative optimisation algorithm. L lists successive values of from the initial guess to the nal estimate and E contains the corresponding predicted errors. If multiple guesses are made L and E are matrices, each column corresponding to one optimisation (the initial guess in the top row, the nal estimate in the last nonzero row and a succession of intermediate values in between). The number of rows in L and E is equal to the length of the sequence which requires the greatest number of iterations to convergence. The other sequences, being shorter, are padded with zeros.
globalRidge
27
Example
100 points were sampled randomly from the function k x k) y = sin(10 10 k x k in the square box 1 < x1 < 1 and 1 < x2 < 1 and noise of standard deviation 0.1 added to simulate a noisy training set.
X = ones(2,100) - 2 * rand(2,100); r = 10 * sqrt(sum(X.^2)); y = (sin(r) ./ (r + eps))' + 0.1 * randn(100,1);
This training set was used to construct the design matrix of a radial basis function network with one Gaussian centre of width 0.3 placed at each training set input point.
C = X; r = 0.3; H = rbfDesign(X, C, r, 'g');
(a)
(b)
(c)
(d)
28
globalRidge
A test set was constructed on a 20 20 grid and drawn in plot (a) of the gure.
x = linspace(-1, 1, 20); X1, X2] = meshgrid(x); Xt = X1(:) X2(:)]'; Ht = rbfDesign(Xt, C, r, 'g'); rt = 10 * sqrt(sum(Xt.^2)); yt = (sin(rt) ./ (rt + eps))'; Yt = zeros(20,20); Yt(:) = yt; subplot(221) surfl(x, x, Yt);
From an initial guess of = 1 globalRidge was used to nd the minimum predicted error by Bayesian information criterion and a threshold of 10000.
globalRidge(H, y, 1, 'BIC -t 10000') ans = 0.2114
This value was used to train the network and predict its output over the test set.
A = inv(H' * H + l * eye(p)); w = A * (H' * y); ft = Ht * w; Ft = zeros(20,20); Ft(:) = ft; subplot(222) surfl(x, x, Ft);
The resulting t using this optimal value for is shown in plot (b) of the gure, and is reasonably accurate. In contrast, the t in plot (c) is the result of deliberately using too low a value of (10 4 ) and clearly shows signs of over- tting. Likewise, too high a value of (100) has been used to produce the t in plot (d) which has been practically attened. The second gure shows three di erent error predictions as a function of (see predictError). The red solid curve shows the true error (known from the true target). The dotted magenta curve is BIC, with the local minimum found by globalRidge marked with a star. The two other curves are GCV (green dash) and LOO (leave-one-out cross-validation; blue dash-dot). Note that the true minimum is at a somewhat smaller value (about = 0:04) than any of the nearby predicted minima. This illustrates a general point which is that there is no guarantee that the true minimum will be found precisely using these methods, which are essentially heuristic in nature. There is also no consensus amongst statisticians over which method (FPE, GCV, BIC, LOO or others) is likely to give the best results (UEV is, however, known to be unreliable).
globalRidge
29
0.1
0.08 LOO 0.06 BIC
0.04
0.02
GCV true error
0 4 10
10
10
10
See Also
forwardSelect, localRidge, predictError.
References
1] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993. 2] M.J.L. Orr. Local smoothing of radial basis function networks. In International Symposium on Arti cial Neural Networks, Hsinchu, Taiwan, 1995.
30
localRidge
localRidge
Purpose
Estimates the optimal ridge regression parameters for linear feed-forward networks in supervised learning problems. With radial basis functions, because of their local nature, ridge regression with multiple adjustable regularisation parameters performs a kind of locally adaptive smoothing. The optimisation is performed using either unbiased estimate of variance or generalised cross-validation and needs an initial guess for the values of the parameters. Parameters in both the initial guesses and the nal optimal estimates whose value is in nity (Inf) correspond to pruned basis functions.
Synopsis
L, V, H, A, W, P] = localRidge(F, Y, L, options)
Description
In standard ridge regression for smoothing linear networks the cost function is made up of the usual sum-squared-error plus a roughness penalty dependent on the squared length of the weight vector and a positive regularisation parameter, :
C (w ; ) =
p X i=1
(f (xi ; w) yi )2 +
m X 2; wj j =1
where f (x; w) is the output of the network for input x and weight vector w and f(xi ; yi)gp i=1 is the training set. A generalisation of this to the case of multiple regularisation parameters, = 1 2 : : : m ]> (one for each basis function) is
C (w; ) =
p X i=1
(f (xi ; w) yi )2 +
m X j =1
j wj :
Since, at least in the case of local basis functions (e.g. radial), this has the e ect of producing locally adaptive smoothing the technique has been called local regularisation or local ridge regression 2], hence the name of this routine. Various model selection criteria which attempt to predict the mean squared error on a future test set 1], can (at least in principle) be minimised to nd the optimal value of . An analysis of the case of generalised cross-validation (GCV) is presented in 2]. Unbiased estimate of variance (UEV) is also reasonably tractable to analyse but but nal prediction error (FPE), Bayesian information criterion (BIC) and leave-one-out cross-validation (LOO) are more di cult and have not yet been implemented for this routine. There are two mandatory inputs, F and Y. The full design matrix, with p rows (one for each training point) and M columns (one for each basis function) is supplied in the variable F while Y, which has p rows and n columns, contains the output training points (n is the dimension of the output space). The optional arguments are as follows.
localRidge
31
Initial guess for

The initial guesses for the M components of are handed in via L. The latter can either be an M -dimensional vector, a scalar (if each initial j is the same) or omitted altogether, in which case it has a default value of 0.1 (i.e. all the j start at a value of 0.1).
Random ordering:
-r
The algorithm sweeps through the regularisation parameters optimising one at a time. The order in which they are tackled is either the same as the ordering of the columns of F or random, the former by default. If random ordering is required then -r should appear somewhere in the options string.
Termination:
-t UEV|GCV
value]
The iterations (one iteration consists of an entire optimisation sweep through all the regularisation parameters) are halted when the model selection criterion, either GCV or UEV, converges (becomes small enough). To change from the default GCV to UEV include the string -t UEV in options. Convergence happens in one of two ways. Either the criterion changes value by less than some small absolute amount or it changes its relative size (compared to the the previous iteration) by less than one part in a given number. The optional value accompanying the -t option controls the termination method. If the value is less than one it is interpreted as a small absolute threshold. On the other hand, if it is greater than 1 it is interpreted as meaning one-part-in-value for the change of relative size. Its default value is 1000, i.e. the iterations halt when GCV (or UEV) changes by less than one part in 1000. Unlike the routine forwardSelect, where optimisation of the single regularisation parameter and termination of the subset selection process can be handled by di erent criteria, the termination and optimisation in localRidge use the same one (either both UEV or both GCV). Since each optimisation step (in which one parameter is optimised) is guaranteed to either decease the model selection criterion or leave it unchanged, minima are never reached. This is why convergence and not the detection of minima (as in forwardSelect) are used to terminate the iterations.
Maximum number iterations:
-h value
To avoid the possibility of endless iterations if the termination criteria are weak a hard limit to the total number of iterations can be set with the -h option with its associated positive value. By default there is no limit.
Verbose output: -v, -V

To monitor the performance of the algorithm there are two options which cause the routine to print out messages as it runs. The -v option causes the iteration number,
32
localRidge
the number of active columns ( nite j values), the number of pruned basis functions (in nite j values), the running value of the criterion (UEV or GCV) and the change (relative or absolute) in the criterion, to be printed at the end of each iteration (and once at a nominal iteration number of zero to indicate the staring values).
Returned values
The routine returns a list of optimised regularisation parameters L plus some further quantities which, although they could be derived using L together with the original full design F and training outputs Y, have been worked out by the routine anyway and save having to recompute them. These are V the nal value of the criterion (UEV or GCV), H the design matrix, A the covariance, W the weight matrix and P the projection matrix. Here is how these quantities are related to F, L and Y.
subset = find(L ~= Inf); H = F(:,subset); A = inv(H' * H + diag(L(subset))); W = A * H' * Y; P = eye(p) - H * A * H'; V = trace(Y' * P^2 * Y) / trace(P); V = p * trace(Y' * P^2 * Y) / (trace(P))^2;
% UEV % GCV
Example
The example target function is
f (x) = 1 + (1 x + 2 x2 ) e
x2 ;
from which 100 randomly sampled points in the range 4 < x < 4 corrupted by Gaussian noise of standard deviation 0.2 are sampled. This is the same problem used for forwardSelect. First generate a training set and plot it (the green circles in the gure).
p = 100; x = 4 - 8 * rand(1,p); y = 1 + ((1 - x + 2 * x.^2) .* exp(-x.^2))'; y = y + 0.2 * randn(p,1); plot(x, y, 'go')
To plot the ts later and the true function now (the dotted curve in the gure) generate a test set.
pt = 500; xt = linspace(-4,4,pt); yt = 1 + ((1 - xt + 2 * xt.^2) .* exp(-xt.^2))'; plot(xt, yt, 'w:')
Fit the function with a radial basis function network whose centres are coincident with the training examples. As it is not clear what size to make the basis functions,
localRidge
33
place three di erent sized functions { small, medium and large { at each input position and let localRidge throw away any it doesn't want by optimising their regularisation parameters to 1.
c = x x x]; r = 0.5 * ones(1,p) 1.0 * ones(1,p) 1.5 * ones(1,p)]; F = rbfDesign(x, c, r); Ft = rbfDesign(xt, c, r);
0 4
Now invoke localRidge, with termination when GCV changes by less than 1 part in one 10000
l, V, H, A, w, P] = localRidge(F, y, 0.01, '-t 10000');
and plot the t to the test set (the dashed red line in the gure).
subset = find(l ~= Inf); Ht = Ft(:,subset); ft = Ht * w; plot(xt, ft, 'r--')
For comparison with other ts, work out the mean square error between this t and the true function.
(ft - yt)' * (ft - yt) / pt ans = 0.0086
The optimisation process that localRidge performs is highly non-linear. Slightly di erent initial conditions or a change in the order in which individual regularisation parameters are optimised could result in a di erent local minimum (in GCV or UEV)
34
localRidge
being reached. It may be more e cacious to manipulate the initial conditions so that the problem is half solved before localRidge begins. One way is to perform forward selection rst and then re ne the network thus produced using localRidge. The illustrative example bene ts from this approach, as will now be shown. First call forwardSelect with global re-estimation turned on. This will give a reasonable estimate of the initial value to use for the regularisation parameters.
subset, H, l, U, A, w] = forwardSelect(F, y, '-r');
Calculate and plot the forwardSelect t (see the blue solid curve in the gure).
Ht = Ft(:, subset); ft = Ht * w; plot(xt, ft, 'b-')
How good is the t?

(ft - yt)' * (ft - yt) / pt ans = 0.0075
This has actually done better than localRidge from a cold start. The subset and regularisation parameter chosen by forwardSelect can now be used to provide a head start for localRidge as follows. First construct the initial set of parameters implied by the results of forwardSelect.
L = Inf * ones(1,M); L(subset) = l * ones(1,length(subset));
Use these in a call to localRidge.

L, V, H, A, w] = localRidge(F, y, L, '-t 10000');
Plot the t (the solid red line in the gure).

subset = find(l ~= Inf); Ht = Ft(:,subset); ft = Ht * w; plot(xt, ft, 'r-')
and work out the mean square error as before.

(ft - yt)' * (ft - yt) / pt ans = 0.0067
This score, the result of using forwardSelect and localRidge in combination, is better than either method used on its own.
localRidge
35
References
1] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993. 2] M.J.L. Orr. Local smoothing of radial basis function networks. In International Symposium on Arti cial Neural Networks, Hsinchu, Taiwan, 1995.
36
overWrite
overWrite
Purpose
To repeatedly overwrite previously written strings when monitoring the processing of a large and/or slow loop.
Synopsis
b = overWrite(b, n)
Description
When large and/or slow loops have to be processed it is often convenient to have a command inside the loop which prints out the iteration number, so the user (who knows how many total iterations are needed) can monitor progress. Doing this with fprintf can be awkward if a large number of print statements are generated. Instead, overWrite can be used to over-print previous text. It does this by linking successive calls through the variable b which tells overWrite how many spaces to back-up before printing the next string. The variable n can be either an integer or a string or omitted altogether. In the latter case overWrite will rub out the last string printed and then backup, in e ect resetting to the state before the rst call to overWrite.
Example
It is di cult to illustrate the action of overWrite, however, the following fragment of M- le code prints the numbers 1 to 999 on the same spot then ashes up the string 'finished' for a second before resetting the position of the cursor to where it was at the beginning.
b = 0; for i = 1:999 b = overWrite(b, i); end b = overWrite(b, 'finished'); pause(1) overWrite(b);
predictError
37
predictError
Purpose
Estimates the predicted mean squared error for a linear network with local or global ridge regression. Needs the design matrix, the training set outputs, the regularisation parameter(s) and a choice of prediction method. Also returns some potentially useful quantities used in calculating the error.
Synopsis
e, W, A, P, g] = predictError(H, Y, l, options, U)
Description
In ridge regression for a linear network with m basis functions and a training set of size p the cost function is made up of two parts. The rst part is the usual sum-squared-error, (H w y)> (H w y) ; where H is the design matrix (p rows and m columns), y is the p-dimensional vector of training set outputs and w is the m-dimensional weight vector. For simplicity, we assume the outputs are 1-dimensional though they do not have to be. The second part is a roughness penalty dependent on the squared length of w and the regularisation parameter(s). In the case of global smoothing 3] with a single parameter the penalty term is just w>w while for local smoothing 3] with parameters 1 2 : : : m ]> it is w> w, where =
2 1 6 0 6 6 6 . 4 . .
0 ::: 0 2 ::: 0 .. . . . .. . . 0 ::: m
3 7 7 7 7 5
Various model selection criteria can be used to predict the mean squared error on a future test set, including unbiased estimate of variance (UEV), nal prediction error (FPE), generalised cross-validation (GCV), Bayesian information criterion (BIC) and Leave-One-Out cross-validation 1]. There is also the raw training set mean square error (MSE). The routine predictError can be used to calculate one or more of these quantities. The design matrix is supplied in the variable H (p rows and m columns) and the output training points in the variable Y (p rows and n columns, where n is the dimension of the output space). Global or local smoothing is speci ed by setting the variable l to the desired value of . For global smoothing l is a scalar ( ) while for local smoothing it is an m-dimensional vector ( 1 2 : : : m ]>). For no smoothing at all set l to zero.
38 Prediction method:
MSE|UEV|FPE|GCV|BIC|LOO
predictError
The choice of which method to use for predicting mean squared error is controlled by a three character code in the option string. The choices are: MSE, UEV, FPE, GCV, BIC and LOO (see above). The default, when option is omitted, is GCV. If more than one of these codes is speci ed the returned error will be a list with one estimate for each method. For example if options is the string 'FPE GCV BIC' then e will be a row vector with 3 entries: the rst nal prediction error, the second generalised cross-validation and the third Bayesian information criterion.
Other returned variables

As well as returning the predicted test error (or training set error in the case of MSE) the routine returns various other quantities which are used in the calculation. The variables A, w, P and g return, respectively, the quantities
H>H + Im 1 ; w ^ = A 1 H>y ; P = Ip H A 1H> ; = p trace (P) :

1 =
where is the e ective number of parameters 2]. In some circumstances it is appropriate to have a metric which is di erent to the identity matrix with which to compute the roughness penalty. Such a non-standard metric is speci ed through the input argument U. If this argument is omitted or if it is any scalar (e.g. 1) then the standard Im metric is used as above. However, if U is a square matrix U of dimension m, then the metric used is U> U. In this case the roughness penalty is w>U> U w for global smoothing and w>U> U w for local smoothing. Similarly, the covariance returned in the output variable A is either A 1 = (H>H + U>U) 1 if the smoothing is global or A 1 = (H>H + U> U) 1 if the smoothing is local. This facility, allowing non-standard metrics, is provided in predictError (and also in globalRidge) speci cally to handle the type of metrics produced by the orthogonal least squares procedure in forward subset selection (see forwardSelect).
Example
The gure shows a set of 50 training points (small cyan circles) sampled from a target function (dotted curve) with noise. This training set is tted by a radial basis function network. The basis functions are Gaussians of width 0.1, there is one basis function per training set input point, and global ridge regression is used to avoid over t. Suppose we want to decide between = 1 10 7 and = 1 for the regularisation parameter by choosing the value which gives the smallest Bayesian Information Criterion.
predictError
39
2
2 0
0.2
0.4
0.6
0.8
Assuming the training data is contained in x (a row vector 50 input training points) and y (a column vector 50 output training points), and that c is equal to x (one centre at each input) then
H = rbfDesign(x, c, 0.1, 'g');
creates the appropriate design matrix. Then

predictError(H, y, 1e-7, 'BIC') ans = 0.7159 predictError(H, y, 1, 'BIC') ans = 0.4904
shows that the predicted error for = 1 is much smaller and therefore it should be preferred. The ts for the two cases are also shown in the gure. The = 1 10 7 t (red dashed curve) shows signs of over t while the = 1 t (green solid curve) is smoother and is a better t to the target. The ts were produced by using the weight vectors returned by the calls to predictError. First, a dense set of test points and their design matrix is constructed:
xt = linspace(0, 1, 500); Ht = rbfDesign(xt, c, 0.1, 'g');
Then, for example for the obtained from
= 1 case, the weight vector as well as the error are
e, w] = predictError(H, y, 1, 'b');
40
and the tted function is computed and plotted.
ft = Ht * w; plot(xt, ft, 'g')
predictError
See Also
forwardSelect, globalRidge
and localRidge.
References
1] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993. 2] J.E. Moody. The e ective number of parameters: An analysis of generalisation and regularisation in nonlinear learning systems. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Neural Information Processing Systems 4, pages 847{854. Morgan Kaufmann, San Mateo CA, 1992. 3] M.J.L. Orr. Local smoothing of radial basis function networks. In International Symposium on Arti cial Neural Networks, Hsinchu, Taiwan, 1995.
rowSum
41
rowSum
Purpose
Sums the rows of a matrix.
Synopsis
v = rowSum(M)
Description
Returns a column vector v whose entries are the sums of each row of M. Gives the same result as the Matlab routine sum except when M is a column vector when sum inconveniently sums the rows.
Example
No di erence between rowSum and sum when the input is a matrix.
X = 1 2 3; 2 3 5]; rowSum(X) ans = 6 10 sum(X')' ans = 6 10
The di erence comes when the input is a column vector.

X = 1; 2]; rowSum(X) ans = 1 2 sum(X')' ans = 3
See Also
colSum.
42
rbfDesign
rbfDesign
Purpose
Computes the design matrix for a radial basis function network with xed inputs, centres, radii, function type and optional bias unit. Used in both training and prediction.
Synopsis
H = rbfDesign(X, C, R, options)
Description
If there are p inputs and m centres then the design matrix, H, has p rows and m columns. The p input training vectors (which are n-dimensional) are the columns of X (n rows, p columns) and the m centre positions (which are also n-dimensional) are the columns of C (n rows, m columns). The options string controls the type of radial function used. Options is set to 'c' for the Cauchy function, 1 ; (z ) = 1 + z2 or 'm' for the multiquadric, or 'i' for the inverse multiquadric, (z ) = p 1 2 : 1+z The default type, obtained by setting options to 'g' or simply by omitting this input argument, is the Gaussian function, (z ) = exp( z 2 ) : Additionally, if the character 'b' appears in the options string then a bias unit will be added to the network, i.e. H will be appended with an extra column of 1's. The type (scalar, vector or matrix) and size of the variable R determines what sort of radial scaling is used. Given that X is the matrix x1 x2 xp] (the input training points) and C is the matrix c1 c2 cm ] (the hidden unit positions), the choice of value for R has the following consequences for the elements of the design matrix.
R
(z ) =
1 + z2 ;
is a scalar r 2 R
Hij =
and each centre has the same spherical size.
k xi cj k ; r
rbfDesign
43
rm ]> 2 Rm Hij =
j
!
is a vector r1 r2
k xi cj k ; r
and each centre has a di erent spherical size.

R
is a vector r 2 Rn
Hij =
is a matrix R 2 Rn n
k (diag (r)) 1 (xi cj ) k ;
and each centre has the same ellipsoidal size, with axes coincident with the basis vectors.
R
Hij =
R
k R 1(xi cj ) k
m
and each centre has the same ellipsoidal size, with arbitrary axes. is a matrix r1 r2
rm ] 2 R n
Hij =
k (diag (rj )) 1 (xi cj ) k ;
and each centre has a di erent ellipsoidal size, with axes coincident the basis vectors.
Example
Solve the XOR problem, as in 1]. First, set up the training data.
X = y = 0; 0] 0; 1] 0 1 0 1]'; 1; 1] 1; 0]];
Next, set up a radial basis function network with four multiquadric centres of unit radius at the same positions as the input training points.
C = X; r = 1; H = rbfDesign(X, C, r, 'm');
Train the network by solving the normal equation.

w = inv(H' * H) * H' * y;
Check the output over the training set.

(H * w)' ans = -0.0000
1.0000
1.0000
-0.0000
44
rbfDesign
Now try reproducing gure 2 of 1]. First generate a list of 2-dimensional points in the form of a grid { these will be the test set inputs.
d = 100; x = linspace(-1, 2, d); X1, X2] = meshgrid(x); Xt = X1(:) X2(:)]';
Obtain the design matrix of the test inputs relative to the same centres used in the network above. Use this to calculate the trained network's predicted outputs for the test set.
Ht = rbfDesign(Xt, C, r, 'm'); yt = Ht * w;
Convert this vector to a matrix corresponding to the original 2-dimensional grid and use it to produce a contour plot of the network's output. Plot the unit square for reference.
Yt = zeros(d,d); Yt(:) = yt; hold off contour(x, x, Yt, hold on plot( 0 0 1 1 0],
-.18 -.16 0 .28 .71 1 1.57]) 0 1 1 0 0])
1 1
References
1] D.S. Broomhead and D. Lowe. Multivariate functional interpolation and adaptive networks. Complex Systems, 2:321{355, 1988.
traceProduct
45
traceProduct
Purpose
Calculates the trace of the product of two matrices.
Synopsis
t = traceProduct(X, Y)
Description
Calculating
t = trace(X * Y);
can be wasteful, especially if the matrices are large. traceProduct achieves the same calculation quicker. The speed-up factor is approximately equal to the number of columns of X (or the number of rows of Y).
Example
Do it the hard way.
X = rand(100,100); flops(0) t = trace(X * X); flops ans = 2000099
Save time.
flops(0) t = traceProduct(X, X); flops ans = 19999
See Also
diagProduct.

Matlab Routines for Subset Selection and Ridge Regression in Linear Neural Networks

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Matlab Routines for Subset Selection and Ridge Regression in Linear Neural Networks

Uploaded by

Copyright:

Available Formats

for Subset Selection and Ridge Regression in Linear Neural Networks

Radial Basis Functions

which are both local, and the non-local multiquadric, q 2 (x cj )2 + rj hj (x) = : r

Least Squares Training

This score can be compared to ridge regression methods (see below).

Global Ridge Regression

gives the predicted test set outputs, plotted in Figure 6.

shown in Figure 7 which has a mean square error

nearly 40% lower than the forward selection score.

Tutorial Introduction Local Ridge Regression

Leading to an optimal weight of

The di erence comes when the input is a row vector.

^m > w ^ mHmHm w y> y

Maximum number of basis functions:

Orthogonal least squares:

The optimal weight in w is given by

wm = Am1H> my ; and P contains the projection matrix Pm where Pm = Ip Hm Am1 H> m:

c = x x x]; r = 0.5 * ones(1,p) 1.0 * ones(1,p) 1.5 * ones(1,p)]; F = rbfDesign(x, c, r);

Compute the mean square error over the test set.

Maximum number of iterations:

Non-standard penalty matrix:

Verbose output: -v, -V

0.08 LOO 0.06 BIC

GCV true error

Initial guess for

Maximum number iterations:

Verbose output: -v, -V

How good is the t?

Use these in a call to localRidge.

Plot the t (the solid red line in the gure).

and work out the mean square error as before.

0 ::: 0 2 ::: 0 .. . . . .. . . 0 ::: m

Other returned variables

H>H + Im 1 ; w ^ = A 1 H>y ; P = Ip H A 1H> ; = p trace (P) :

creates the appropriate design matrix. Then

Then, for example for the obtained from

= 1 case, the weight vector as well as the error are

The di erence comes when the input is a column vector.

and each centre has the same spherical size.

and each centre has a di erent spherical size.

k (diag (r)) 1 (xi cj ) k ;

k (diag (rj )) 1 (xi cj ) k ;

Train the network by solving the normal equation.

Check the output over the training set.

-.18 -.16 0 .28 .71 1 1.57]) 0 1 1 0 0])

You might also like