You are on page 1of 14

Biometrika (2011), 98, 1, pp.

133146 C 2011 Biometrika Trust Printed in Great Britain

doi: 10.1093/biomet/asq063 Advance Access publication 31 January 2011

Partial envelopes for efficient estimation in multivariate linear regression


BY ZHIHUA SU AND R. DENNIS COOK School of Statistics, University of Minnesota, 224 Church St., S.E., Minneapolis, Minnesota 55414, U.S.A. suzhihua@stat.umn.edu dennis@stat.umn.edu SUMMARY We introduce the partial envelope model, which leads to a parsimonious method for multivariate linear regression when some of the predictors are of special interest. It has the potential to achieve massive efficiency gains compared with the standard model in the estimation of the coefficients for the selected predictors. The partial envelope model is a variation on the envelope model proposed by Cook et al. (2010) but, as it focuses on part of the predictors, it has looser restrictions and can further improve the efficiency. We develop maximum likelihood estimation for the partial envelope model and discuss applications of the bootstrap. An example is provided to illustrate some of its operating characteristics.
Some key words: Dimension reduction; Envelope model; Grassmann manifold; Reducing subspace.

1. INTRODUCTION Introduced recently by Cook et al. (2010), enveloping is a new approach to multivariate analysis that has the potential to produce substantial gains in efficiency. Their development was in terms of the standard multivariate linear model Y = + X + , (1)

where Rr , the random response Y Rr , the fixed predictor vector X R p is centred to have sample mean zero, and the error vector N (0, ). They demonstrated that an envelope estimator of the unknown coefficient matrix Rr p has the potential to achieve massive gains in efficiency relative to the standard estimator of , and that these gains will be passed on to other tasks such as prediction. In this article, we propose an extension of envelopes called partial envelopes. Partial envelopes can focus on selected columns of and can achieve gains in efficiency beyond those possible by using an envelope. In addition, the envelope estimator reduces to the standard estimator when r p and is of rank r , so there is no possibility for efficiency gains in this setting. Partial envelopes remove this restriction, providing gains even when r p . In the next section, we review envelopes and envelope estimation. Because this is a new area, our goal was to provide intuition and insight rather than technical details. Additional results for envelopes will be discussed during our extension to partial envelopes given in 3. 2. ENVELOPES It can happen in the context of model (1) that some linear combinations of Y are immaterial to the regression because their distribution does not depend on X , while other linear combinations

134

ZHIHUA SU AND R. DENNIS COOK

of Y do depend on X and are thus material to the regression. In effect, envelopes separate the material and immaterial parts of Y , and thereby allow for gains in efficiency. Suppose that we can find a subspace S Rr so that Q S Y | X Q S Y, QS Y PS Y | X , (2)

where means identically distributed, P() projects onto the subspace indicated by its argument and Q = Ir P . For any S with those properties, PS Y carries all of the material information and perhaps some immaterial information, while Q S Y carries just immaterial information. Let B = span(). Then, (2) holds if and only if B S and = S + S , where S = var( PS Y ) and S = var( Q S Y ) (Cook et al., 2010). However, S is not necessarily unique, because there may be infinitely many subspaces that satisfy these relations in a particular problem. A reducing has the property that S S and S S (Conway, 1990). Cook et al. subspace S of (2010) showed that S is a reducing subspace of if and only if = S + S . This enabled them to address the uniqueness issue and ensure that PS Y contains only material information by defining the minimal subspace to be the intersection of all reducing subspaces of that contain B , which is called the -envelope of B and denoted as E (B ). Let u = dim{E (B )}. Then
B E (B ),

E ,

(3)

where E (B ) is shortened to E for subscripts. These relationships establish a unique link between the coefficient matrix and the covariance matrix of (1), and it is this link that has the potential to produce gains in the efficiency of estimates of . In particular, Cook et al. (2010) demonstrated that these gains will be massive when E contains at least one eigenvalue that is substantially larger than the largest eigenvalue of E , so that in effect Y contains redundant immaterial information. Model (1) will be called the standard model when (3) is not imposed and called the envelope model when (3) is imposed. Figure 1(a) gives a schematic illustration of envelope estimation in a regression with r = 2 responses and a single binary predictor X , indicating one of the two bivariate normal populations represented by ellipses that cover, say, 99% of their distributions. Cook et al. (2010) showed that E (B ) is characterized by the identity E (B ) = k r and P j is the projection j =1 P j B , where k onto the j th eigenspace of . When has distinct eigenvalues E (B ) could be spanned by any of the 2r possible subsets of its eigenvectors. Figure 1(a), the -envelope of = ( j ) = E(Y | X = 1) E(Y | X = 0), is parallel to the second eigenvector of , and its orthogonal complement, which ties in with the immaterial part of Y , is parallel to the first eigenvector. This setup meets the population requirements (3). Standard inference on the second coordinate 2 of is based on the marginal data obtained by projecting each data point onto the Y2 axis, represented by the line segment A from a representative data point marked with an x, giving rise to the usual two-sample inference methods. In contrast, for envelope inference on 2 , each data point is first projected onto E (B ) and then projected onto the horizontal axis, represented in Fig. 1(a) by the two line segments marked with B. Imagining this process for many data points from each population, it can be seen that the two empirical distributions of the projected data from the standard method will have much larger variation than the empirical distributions for the envelope method. The envelope E (B ) is estimated in practice and will hence have a degree of wobble that spreads the distribution of the data projected along routes represented by path B. The asymptotic approximations of the variance of the envelope estimator of take this wobble into account. For a numerical illustration of this phenomenon, Fig. 1(b) shows a plot of r = 2 responses, the logarithms of near-infrared reflectance at two wavelengths in the range of 16802310 nm, measured on samples from two populations of ground wheat with low and high protein content,

Partial envelopes for efficient estimation


(a)
3

135

(b)
( )
50

25 15

( )
0 B Y1 0 A 15 B 50 0 25

3 3 15 0 Y2 15 3

75 65 44 23 2 19 40

Fig. 1. Variance reduction by envelope estimation. (a) Schematic representation of envelope estimation. (b) Wheat protein data with points marked as high protein, an actual dot, and low protein, an actual open circle.

24 and 26 samples, respectively. The plotted points resemble the schematic representation shown in Fig. 1(a), and consequently an envelope analysis can be expected to yield more precise results than the standard analysis. The standard estimate of 2 is 21 with a standard error of 94. In contrast, the envelope estimate of 2 is 47 with a standard error of 046. The standard and envelope analyses for 1 are related similarly. Although E (B ) is necessarily spanned by some subsets of the eigenvectors of , the maximum likelihood estimator of E (B ) shown in Fig. 1(b) is not spanned by a subset of the eigenvectors of the usual pooled estimator of . This happens because the likelihood will balance the mean and variance conditions in (3), leading away from the sample eigenvectors. If r p and has full row rank r , then B = Rr . Consequently, E (B ) = Rr , the envelope estimator of reduces to the standard estimator of , and enveloping offers no gains. When r > p , it is still possible to have E (B ) = Rr , so again enveloping offers no gains over the standard analysis. In these and other situations, the partial envelopes developed in the next section can provide gains over both the standard and envelope estimators of . 3. PARTIAL
ENVELOPES

31. Definition A subset of the predictors is often of special interest in multivariate regression, particularly when some predictors correspond to treatments while the remaining predictors are included to account for heterogeneity among the experimental units. Partial envelopes are designed to focus consideration on the coefficients corresponding to the predictors of interest. Partition X into two sets of predictors X 1 R p1 and X 2 R p1 , p1 + p2 = p ( p1 < r ), and conformably partition the columns of into 1 and 2 . Then, model (1) can be rewritten as Y = + 1 X 1 + 2 X 2 + , where 1 corresponds to the coefficients of interest. We can now consider the -envelope for B1 = span(1 ), leaving 2 as an unconstrained parameter. This leads to the parametric structure B1 E (B1 ) and = PE1 PE1 + Q E1 Q E1 , where PE1 denotes the projection onto E (B1 ), which is called the partial envelope for B1 . This is the same as the envelope structure, except the enveloping is relative to B1 instead of the larger space B . For emphasis, we will henceforth refer to E (B ) as the full envelope. Because B1 B , the partial

136

ZHIHUA SU AND R. DENNIS COOK

envelope is contained in the full envelope, E (B1 ) E (B ), which allows the partial envelope to offer gains that may not be possible with the full envelope. To provide some insights into this setting, let R1|2 denote the population residuals from the multivariate linear regression of X 1 on X 2 . Then, recalling that we have required X to be centred, X + , where is a linear comthe linear model can be parameterized as Y = + 1 R1|2 + 2 2 2 bination of 1 and 2 . Next, let RY |2 = Y 2 X 2 , the population residuals from the regression of Y on X 2 alone. We can now write a linear model involving 1 alone: RY |2 = 1 R1|2 + . The partial envelope E (B1 ) is the same as the full envelope for B1 in the regression of RY |2 on R1|2 . In other words, we can interpret partial envelopes in terms of the motivating conditions (2) applied to the regression of RY |2 on R1|2 . In particular, the schematic representation of Fig. 1 also serves for partial envelopes by reinterpreting Y1 and Y2 as the two coordinates of RY |2 and reinterpreting the binary predictor as R1|2 . The classical added variable plot (Cook & Weisberg, 1|2 of R1|2 . Y |2 of RY |2 versus the sample version R 1982) is simply a plot of the sample version R 32. Maximum likelihood estimators Because we have centred the predictors, the maximum likelihood estimator of is simply . The estimators of the remaining parameters require the estimator of E (B1 ). Let SY |2 =Y Y |2 , let S R |2 denote the sample covariance matrix of denote the sample covariance matrix of R 1|2 and let | A|0 denote the product of the Y |2 on R the residual vectors from the regression of R nonzero eigenvalues of the square matrix A 0. Then, as shown in Appendix A, the maximum likelihood estimator of E (B1 ) for a fixed dimension u 1 is (B1 ) = arg E
S G (u 1 ,r )

min {log | PS S R |2 PS |0 + log | Q S SY |2 Q S |0 },

(4)

where G (u 1 , r ) denotes the Grassmann manifold of dimension u 1 in Rr , i.e., the set of all u 1 dimensional subspaces in Rr . The maximum likelihood estimator of the full envelope E (B ) is obtained using the objective function on the right-hand side of (4), except that S R |2 is replaced with the sample covariance matrix S R of the residual vectors from the fit of the standard model, SY |2 is replaced with the sample covariance matrix SY of the observed response vectors, u 1 is replaced with u = dim{E (B )} and S is reinterpreted as an argument represent (B1 ) is the same ing the full envelope (Cook et al., 2010). We see from this result that E Y |2 = as the estimator of the full envelope applied in the context of the working model R 1|2 + . 1 R (B1 ) of the estimator 1 of 1 is the projection onto E The maximum likelihood estimator 2 of 2 is the coefficient of 1 from the standard model. The maximum likelihood estimator 1 X 1 on X 2 . If X 1 and X 2 matrix from the ordinary least squares fit of the residuals Y Y 2 reduces to the maximum likelihood estimator of 2 from the standard are orthogonal, then E1 + Q E1 , where E1 SY |2 Q E1 S R |2 P model. The maximum likelihood estimator of is = P (B1 ), E1 = P E1 denotes the projection operator for E E1 is the estimated covariance E1 S R |2 P P E1 is the estimated covariance matrix for E1 SY |2 Q matrix for the material part of Y and E = Q 1 the immaterial part of Y . 33. Asymptotic distributions j j ) ( j = 1, 2). The results In this section, we give the asymptotic distributions of n 1/2 ( are conveniently expressed in terms of a coordinate version of the partial envelope model. Let

Partial envelopes for efficient estimation

137

Rr u 1 be a semi-orthogonal matrix, T = Iu 1 , whose columns form a basis for E (B1 ), let ( , 0 ) Rr r be an orthogonal matrix and let Ru 1 p1 be the coordinates of 1 in terms of the basis matrix . Then Y = + X 1 + 2 X 2 + , =
E1

E1

0 0,
T

(5)

where Ru 1 u 1 and 0 R(r u 1 )(r u 1 ) are positive definite matrices that serve as coordinates of E1 and E relative to the basis matrices for E (B1 ) and its orthogonal complement. 1 2 , let denote the limit as the sample 1 and In preparation for the limiting distributions of size n of the sample covariance matrix of X , and partition = ( jk ) according to the 1 partitioning of X ( j , k = 1, 2). Let 1|2 = 11 12 22 21 , with 2|1 defined similarly by interchanging the subscripts. The matrix 1|2 is constructed in the same way as the covariance matrix for the conditional distribution of X 1 | X 2 when X is normally distributed, although here X is fixed. For A R p1 p1 , define M ( A) = AT
1 0

1 0

2 Iu 1 (r u 1 ) .

1 ( ), 1 () and 1 (2 ) denote the maximum likelihood estimators of 1 when , and 2 Let are known. If a random vector wn has the property that n 1/2 (wn ) converges in distribution to an N (0, A), then we will describe the asymptotic variance of wn as avar(n 1/2 wn ) = A. The 2 are stated in the following proposition; justification is given 1 and limiting distributions of in Appendix B. j ) vec( j )} ( j = 1, 2), converge in distriPROPOSITION 1. Under model (5), n 1/2 { vec( bution to normal random vectors with mean zero and covariance matrices 1 )} = avar{n 1/2 vec( 2 )} = { avar{n 1/2 vec( ={
1 1|2

E1

+ (T ( (
21 21
E1

0) M

1|2 )(

0 ),
T

1 ( )}] + avar[n 1/2 vec{ Q E1 1 ()}]; = avar[n 1/2 vec{


2|1 2|1 1 1 1 1|2
T

1 1 T 1 1 0 )} , 0 ) M ( 11 )( 12 0 1 T 1 T 1 0 )avar{1 (2 )}( 12 0 0 0 )} , 0

1 ( )}] = where avar[n 1/2 vec{

1 ()}] is defined implicitly. and avar[n 1/2 vec{ Q E1

Several comments on this proposition are in order. We first consider regressions in which = 0. This will arise when X 1 and X 2 are asymptotically uncorrelated or have been chosen by design to be orthogonal. Because X is nonrandom, this condition can always be forced 1|2 . As discussed in 3.1, this replacement without an inferential cost by replacing X 1 with R alters the definition of 2 but does not alter the parameter of interest 1 . When 12 = 0, 2 )} = 22 1 , which is the same as the asymptotic covariance matrix for avar{n 1/2 vec( 1 )} reduces to the asymptotic the estimator of 2 from the standard model. And avar{n 1/2 vec( Y |2 = 1 X 1 + . No longer covariance matrix for the full envelope estimator of 1 in the model R requiring that 12 = 0, we can carry out asymptotic inference for 1 based on a partial enve 1|2 + . If we choose 1 = , Y | 2 = 1 R lope by using the full envelope for 1 in the model R 1 / 2 1 )} reduces to the asymptotic covariance so X 1 = X , 2 is nil and 1|2 = , then avar{n vec( matrix for the full envelope estimator of derived by Cook et al. (2010). 1 of 1 and the full envelope We next turn to a comparison of the partial envelope estimator 1e , where the subscript e added to a statistic means computation based on the full estimator
12

138

ZHIHUA SU AND R. DENNIS COOK

1 )} and avar{n 1/2 vec( 1e )} at envelope. Proposition 2 provides a comparison of avar{n 1/2 vec( two extremes; justification is provided in Appendix B. 1 )} PROPOSITION 2. If E (B ) = Rr , then avar{n 1/2 vec( 1 / 2 1 / 2 1 )} avar{n vec( 1e )}. E (B1 ), then avar{n vec( 1e )}. If E (B ) = avar{n 1/2 vec(

At one extreme, this proposition tells us that when the full envelope is Rr , which is the situation that motivated this work, the covariance matrix for the partial envelope estimator will never be greater than the full envelope estimator, which is the same as the standard estimator. At the other extreme, if the full and partial envelopes are the same, then the partial envelope estimator of 1 will never do better than the full envelope estimator of the same quantity. The following lemma will be helpful in developing some intuition for this conclusion. LEMMA 1. Let B2 = span(2 ). Then E (B ) = E (B1 ) + E (B2 ). This lemma says that the full envelope is the sum of the partial -envelopes for B1 and B2 . The dimension of E (B1 ) E (B2 ) can vary from zero to the minimum of the two. But E (B ) = E (B1 ) if and only if E (B2 ) E (B1 ). This means that E (B2 ) can contain information on E (B1 ). This information is not used by the partial envelope, but is used by the full envelope and consequently the full envelope estimator may have the smaller asymptotic variance. Beyond the conclusions in Proposition 2, we were unable to develop practically ponderable conditions for characterizing when the asymptotic variance of the partial envelope estimator of 1 is less than that of the full envelope estimator of 1 . A straightforward course in practice is to simply compute and compare the asymptotic variances. 34. Dimension selection The dimension u 1 of the partial envelope is essentially a model selection parameter, and standard methods can be used to aid in its choice. We briefly describe two methods in this section. The first is based on sequential hypothesis testing. The hypothesis u 1 = d (d < p1 ) can be tested by using the likelihood ratio statistic (d ) = (r ) L (d )}, where 2{ L E1 |0 (n /2) log | Q E1 |0 E1 SY |2 Q (d ) = (nr /2){1 + log(2 )} (n /2) log | P E1 S R |2 P L denotes the maximum value of the likelihood for the partial envelope model with u 1 = d , see Appendix A. When d = r , the partial envelope model reduces to the standard model and thus (r ) = (nr /2){1 + log(2 )} (n /2) log | S R | L is the value of the maximized loglikelihood for the standard model. Following standard likelihood theory, under the null hypothesis, (d ) is distributed asymptotically as a chi-squared random variable with p1 (r d ) degrees of freedom. The test statistic (d ) can be used in a sequential scheme to choose u 1 : starting with d = 0 and using a common test level, choose u 1 to be the first hypothesized value that is not rejected. The dimension of the partial envelope could also be determined by using an information criterion, (d ) + h (n )g (d )}, u 1 = arg min{2 L
d

Partial envelopes for efficient estimation

139

where h (n ) equals log n for BIC and equals 2 for AIC and g (d ) is the number of real parameters in the partial envelope model, g (d ) = r + d (r d ) + d p1 + r p2 + d (d + 1)/2 + (r d )(r d + 1)/2. Subtracting g (d ) from the number of parameters r + pr + r (r + 1)/2 for the standard model gives the degrees of freedom for (d ) mentioned previously. 35. Computing Computational algorithms for Grassmann optimization typically require that the objective function be written in terms of semi-orthogonal basis matrices rather than projection operators (Edelman et al., 1999; Liu et al., 2004). Since eigenvalues are invariant under cyclic permutations, (4) can be expressed equivalently as (B1 ) = span{arg min(log |G T S R |2 G | + log | H T SY |2 H |)} E
1 = span{arg min(log |G T S R |2 G | + log |G T SY |2 G |)},

where the minimization is over semi-orthogonal matrices G Rr u 1 and (G , H ) is an orthogonal matrix. We adapted Lipperts SG MIN version 2.4.1 computer code available from wwwmath.mit.edu/lippert/sgmin.html to perform this numerical optimization. That program offers several optimization methods, including NewtonRaphson iteration on Grassmann manifolds with analytic first and numerical second derivatives of the objective function, and we have found it to be stable and reliable. However, like most numerical methods for optimizing nonlinear objective functions, it requires good starting values to facilitate convergence and avoid any lurking local optima. A standard way to deal with multiple local optima is to use NewtonRaphson iter ation, beginning with a n consistent starting value. An estimator that is one NewtonRaphson iteration step away from a n consistent estimator may be sufficient, because it is asymptotically equivalent to the maximum likelihood estimator, for example, Small et al. (2000), although we prefer to always iterate until convergence. In the remainder of this section, we describe methods for determining starting values. Equally important, our descriptions should also serve to highlight additional structure of the envelope model. Because these are new methods, we describe them in the context of the full envelope, understanding that they apply equally to partial envelopes by replacing S R and SY with S R |2 and SY |2 . We assume that u = dim{E (B )} is known, and we use the notation of equation (5) to describe the coordinate model for the full envelope when 2 is nil: Y = + X + , =
E

0 0,
T

(6)

where Ru u and 0 R(r u )(r u ) are positive definite matrices that serve as coordinates of E and E relative to the basis matrices for E (B ) and its orthogonal complement. The starting values that we use are functions of S R , SY and the ordinary least squares estimator B of . The asymptotic behaviour of these statistics is summarized in the following lemma. Because they are standard moment estimators, its proof seems straightforward and is omitted. LEMMA 2. The sample matrices B , SY and S R are n consistent estimators of their population counterparts , Y = ( + T ) T + E and = E + E . We see from this lemma that the eigenvectors of both Y and will be in either E (B ) or its orthogonal complement, but from one of these matrices alone there is no way to tell in which

140

ZHIHUA SU AND R. DENNIS COOK


Y

subspace an eigenvector lies. However, we can tell by using of the partially maximized loglikelihood. LEMMA 3. Under the envelope model (6),

and

together, which is the role

E (B ) = span arg min (log |G T G | + log |G T


V( )

1 Y G |)

, .

where minV (

means that the minimum is taken over all subsets of u eigenvectors of

This lemma says that we can always find E (B ) in the population by minimizing the loglikelihood function over all subsets of u eigenvectors of . Its proof is in Appendix B. Based on this, we determine our first set of starting values G 0 as the best set of u eigenvectors of S R ,
1 G 0 = arg min (log |G T S R G | + log |G T SY G |). V (SR )

When the number of subsets is too large for this to be practical, we use a sequential scheme that involves starting with a randomly selected subset of u eigenvectors of S R and then updating each eigenvector in the subset from among those remaining, continuing for two or three iterations through the entire subset of u eigenvectors. The starting value G 0 works well when the signal is sufficiently large, but may perform poorly when the signal is small or r is large. In those cases, we have found it useful to update G 0 by using the ordinary squares estimator B of and Krylov subspaces. First, project B onto span(G 0 ) to update B , B0 = PG 0 B , and then form the updated estimator of ,
T T T S R ,0 = (U F B0 ) (U F B0 )/ n ,

)T and F is the n p matrix with rows X T . where U is the n r matrix with rows (Y Y 2 Next, let K 0 = ( B0 , S R ,0 B0 , S R ,0 B0 , . . .), where the column dimension of K 0 must be at least u . The new starting value G 1 then consists of the first u left singular vectors of K 0 . The population rationale for the final step comes from Cook et al. (2007) who showed that there always exists an integer k , which is bounded by the number of eigenspaces of , such that E (B ) = span(, , . . . , k 1 ). To illustrate the behaviour of the starting value G 1 , we simulated data from model (6) with r = 100 responses, p = 3 predictors and an envelope of dimension u = 4. Let W1 Rr u , W2 Ru p and W3 Ru u be matrices of independent uniform (0, 1) random variables. Then, the T W1 )1/2 , = W2 , = various matrices in the model were generated as follows: = W1 (W1 T 001 W3 W3 , 0 = 064 Ir u and = 0. The predictors were also generated as vectors of independent uniform random variables, centred to have mean zero in the sample. To gain a feeling for the sizes of the various matrices involved in this computation, we take the expectations Ew with respect to the matrices W j used in their construction: tr{Ew ( E )} 005, tr{Ew ( E )} 61 and tr{Ew ( T T )} 033. From this we see that the variation E in the material part of Y will tend to be small relative to the variation E in the immaterial part of Y , while the signal is larger than E but is still small relative to E . Recalling the discussion of 2, this is the kind of setting in which envelopes can provide substantial gains in efficiency. To illustrate the advantages of the proposed starting values, we generated n = 1000 observations from the above simulation model and started iteration at G 1 and a randomly selected starting value G = Z ( Z T Z )1/2 , where Z Rr u is a matrix of independent standard normal random variables. The two starting values converged to the same solution, the maximum angle between

Partial envelopes for efficient estimation

141

(B ) and E (B ) being only 6 degrees. However, starting at G required about twice the numE ber of iterations to reach convergence as when starting at G 1 , and the loglikelihood increased by about 8000 units when starting from G , but increased only 3 units when starting from G 1 . We repeated this numerical experiment with n = 300 observations, which is a fairly small sample size in view of the number of responses. In that case, starting from G 1 converged to a solu (B ) that was only about 12 degrees away from the true subspace E (B ). The algorithm tion E also converged when starting from G , but it reached a local solution that was approximately 87 degrees away from the true subspace. Generally, our experience indicates that random starts are not very helpful as they tend to reach local maxima. 4. ASYMPTOTIC

APPROXIMATIONS AND THE BOOTSTRAP

In this section, we report a few of our results from a simulation study to investigate the accu1 )} presented in Proposition 1. We simulated data racy of the asymptotic variance avar{n 1/2 vec( from model (5) with r = 10, p = 10, p1 = 1, = 0 1 = 1 and the elements of R101 and 2 R109 selected once at the outset as independent standard normal variables. For each sample, the elements in X R10 were generated as 0 or 10 each with probability 1/2. The covariance matrix had one small eigenvalue 00006 with corresponding eigenvector , eight intermediate eigenvalues between 040 and 51 and one large eigenvalue of about 986. The actual variance of (k ) (k = 1, . . . , 200), from 200 repli1 was estimated as the sample variance of the estimates 1 cations of the simulation scenario for each sample size. We also estimated the sample variance 1 based on 200 residual bootstrap samples from one of the 200 replications. The results are of shown in Fig. 2 for the four values of u that were used to construct the estimators. For clarity, we let u 0 denote the true value of u . In the simulation model, u 0 = 1. 1 and the The vertical axes of Figs. 2(ad) are the standard deviation for one element of horizontal axes are the sample size. The results shown in Figs. 2(ad) illustrate the general conclusions that we reached from our simulation study. Because corresponds to a relatively small eigenvalue, we expected that the asymptotic variability of the envelope estimator with u = u 0 = 1 would be much smaller than that of the standard estimator. That expectation is confirmed by the 1 ) can give a very good approximation of the actual results shown in Fig. 2(a) and that avar(n 1/2 variance when u = u 0 . Figures 2(bd) show that the envelope estimator can still give substantial gains over the standard estimator when u > u 0 . This typically happens when the estimated envelope avoids the larger eigenvalues of , as was the case in the simulation. Nevertheless, when u > u 0 , the actual variability of the envelope estimator can be substantially larger than 1 ). The residual bootstrap is a reliable method for estimating the actual variance of avar(n 1/2 1 , regardless of the relation between u and u 0 . We also studied how the error distribution might affect the performance of the envelope estimator. The simulation scenario was identical to that described for Fig. 2, except that was generated and identically distributed standard normal, as 1/2 , where the elements of were independent 2 4)/8 random variables. The results shown in Fig. 3 t6 /(3/2)1/2 , 121/2 {U (0, 1) 05} or (4 indicate that the performance of the envelope estimator is quite robust. 5. EXAMPLE The dataset is from Johnson & Wichern (2007) and is on properties of pulp fibres and the paper made from them. The reason for choosing this example is its richness reflecting multiple results within the same context. The data have 62 measurements on four paper properties: breaking

142
(a)
005 004 Standard deviation 003 002 001 0

ZHIHUA SU AND R. DENNIS COOK


u=1 005 004 Standard deviation 003 002 001 0

(b)

u=3

200

400

600 Sample size

800

1000

1200

200

400

600 Sample size

800

1000

1200

(c)
005 004 Standard deviation 003 002 001 0

u=5 005 004 Standard deviation 003 002 001 0

(d)

u=7

200

400

600 Sample size

800

1000

1200

200

400

600 Sample size

800

1000

1200

1 . The horizonFig. 2. Simulation results on the asymptotic standard deviation of an element of tal dashed line at about 004 marks the standard deviation of the standard model estimator and the dashed line just above the horizontal axis marks the asymptotic standard deviation of the envelope model estimator. The solid line corresponds to the estimated actual standard deviation of the envelope estimator and the line marked with circles corresponds to the bootstrap standard deviation.

length, elastic modulus, stress at failure and burst strength. The predictors are three properties of fibre: arithmetic fibre length, long fibre fraction and fine fibre fraction. First, we fitted an envelope model to all the predictors. Likelihood ratio testing suggested taking u = 2. The ratio of the asymptotic standard deviation from the standard model to that from the envelope model was computed for each element in , the range of ratios is 098 to 110, with an average of 103. This suggests that we do not gain much efficiency by fitting the envelope. The reason is apparent from the estimated structure of : the eigenvalues of E are 49532 and 00143, whereas the eigenvalues for E are 01007 and 00060. So, the part of Y that is material to X is no less variable than the immaterial part, and not much efficiency is gained from enveloping . Next, we fitted the partial envelope models to each column of . We started with the fine fibre fraction. Likelihood ratio testing selected u 1 = 1. The asymptotic standard deviation ratios between the standard model and the partial envelope model for the elements in the third column of are 6359, 679, 1040 and 749. Substantial reduction is thus achieved when attention is focused on fine fibre fraction because the part of Y that is material to this predictor is much less variable than the immaterial part. A close look at reveals that E1 has eigenvalue 00149 while has eigenvalues 110981, 01008 and 00070. E1 As we indicated in 4, the actual variance can be estimated by the bootstrap variance. A sim1 . Under ulation with 200 bootstrap replicates was run to investigate the actual variance of 1 the partial envelope model, although the bootstrap standard deviations for the elements in

Partial envelopes for efficient estimation


3 25 Standard deviation 2 15 1 05 Standard deviation

143
t-errors

x103

Normal errors 3 25 2 15 1 05

x103

200

400

600 Sample size

800

1000

1200

200

400

600 Sample size

800

1000

1200

3 25 Standard deviation 2 15 1 05

x103

Uniform errors 3 25 Standard deviation 2 15 1 05

x103

Chi-squared errors

200

400

600 Sample size

800

1000

1200

200

400

600 Sample size

800

1000

1200

Fig. 3. Simulation results for four error distributions. The contents of the plots are as described for Fig. 2, except that the standard deviation of the standard model estimator is not shown.

are 970, 229, 257 and 149 times as large as their asymptotic counterparts, they are still 656, 297, 405 and 503 times the size of the asymptotic standard deviations for the standard model. Next, the partial envelope was fitted to arithmetic length and we inferred that u 1 = 0. This means that with the other two predictors present, the paper properties are invariant to the change in arithmetic length. The test of the hypothesis u 1 = 0 under the partial envelope model is equivalent to the F -test of the hypothesis 1 = 0 under the standard model. Finally, we applied the partial envelope model to the long fibre fraction. The estimated envelope had dimension two and it was only a small angle apart from the envelope model we fitted in the first place. The standard deviation ratios between the envelope model and partial envelope model for the second column of are 09985, 09994, 09980 and 10046. This illustrates our statement in Proposition 2 that when E (B ) = E (B1 ), the partial envelope model cannot outperform the envelope model. Lemma 1 shows that E (B ) can be decomposed into E (B1 ) + E (B2 ) + E (B3 ), where Bi represents the space spanned by the i th column of . Recall the dimensions of E (B ), E (B1 ), E (B2 ) and E (B3 ) were inferred to be 2, 0, 2 and 1, respectively. Then E (B3 ) is forced to lie within E (B2 ), and the angle between the sample version of the two is around 8 degrees. This example illustrates situations in which we will or will not expect to get a significant reduction from fitting the partial envelope model. Basically, when E has at least one large 1 eigenvalue, massive reduction in variance is a typical result from applying the partial envelope model. But if a large eigenvalue is associated with E1 , we may achieve no noticeable reduction. In the application context, we found that partial envelopes significantly reduced the standard errors of the coefficients of fine fibre fraction.

144

ZHIHUA SU AND R. DENNIS COOK

While envelopes convert equivariantly under symmetric linear transformations of the response that commute with , they are not equivariant for all linear transformations (Cook et al., 2010). Similarly, a partial envelope may not convert equivariantly under scale changes of the response and, for this reason, it may be advantageous to choose commensurate scales. Nevertheless, as illustrated in this example, useful results are often obtained using the original measurement scales. ACKNOWLEDGEMENT We are grateful to the editor and two referees for their insightful suggestion and comments that helped us improve the paper. This work was supported in part by a grant from the U.S. National Science Foundation. APPENDIX A
Maximum likelihood estimators Y |2 = 1 R 1|2 + . Following As described in 32, E (B1 ) is the same as the full envelope in the model R (B1 ) can be 1|2 , E Y |2 and R the derivation in 42 of Cook et al. (2010) with their Y and X replaced by R obtained by minimizing the following function over S G (u 1 , r ), log | PS S R |2 PS |0 + log | Q S SY |2 Q S |0 , (A1)

where S R |2 = U T Q F2 Q F Q F2 U / n , SY |2 = U T Q F2 U / n , F = Q F2 F1 / n , F1 is the n p1 matrix with rows T T )T . , F2 is the n p2 matrix with rows X 2 and U is the n r matrix with rows (Y Y X1 E1 E1 from optimizing (A1), 1 , where 1 = P 1 is the ordinary maximum likelihood After getting P (B1 ). Then, 1 , = = T estimator of the coefficients for X 1 . Let be a semi-orthogonal basis for E T T T S R |2 , 0 = 0 1 , the maximum likelihood SY |2 0 , 1 = T and 2 = 0 0 0 . Having derived T T T 1 2 = (U F1 ) F2 ( F2 F2 )1 . Substitute all the above estimators into the loglikelihood estimator of 2 is function, with a fixed dimension of the envelope d , the maximized loglikelihood is equal to E1 SY |2 Q E1 |0 (n /2) log | Q E1 |0 . (d ) = (nr /2){1 + log(2 )} (n /2) log | P E1 S R |2 P L

APPENDIX B
Proofs Proof of Proposition 1. Because of the overparameterization in (5), we use Proposition 4.1 in Shapiro (1986) to derive the asymptotic distributions. For simplicity, we denote vec(2 ), vec(), vec( ), vech( ) and vech( 0 ) as 0 , 1 , 2 , 3 and 4 , respectively, and then we combine them into the vecT T T T T T , 1 , 2 , 3 , 4 ) . Here, vec and vech are the vector and vector-half operators defined by tor = (0 Henderson & Searle (1979). Let vec(2 ) h 0 () vec(2 ) h 1 () . vec( ) h () = vec(1 ) = T T vech( ) h 2 () + 0 0 0) vech( h ) converges in distribution to N (0, S0 ), where S0 = H ( H T J H ) H T and H = Then, n 1/2 (h T ( h i / j ) is the gradient matrix (i = 0, 1, 2; j = 0, 1, 2, 3, 4), Ir p2 0 0 0 0 . T 0 0 0 I p1 H = 0 0 0 0 ) Cr ( ) E u 1 Cr ( 0 0 ) E (r u 1 ) 0 0 2Cr (

Partial envelopes for efficient estimation


The Fisher information for {vec(2 ) , vec(1 ) , vech( ) } in the standard model is 1 1 0 22 21 1 , 0 J = 12 1 11 1 T 1 1 0 0 E ( ) Er 2 r
T T T T

145

where Cr Rr (r +1)/2r and Er Rr r (r +1)/2 provide the contraction and expansion matrices for the vec and vech operators: for any symmetric r r matrix A, vech( A) = Cr vec( A) and vec( A) = Er vech( A). 2 are the first two diagonal blocks of S0 . After some matrix 1 and The asymptotic variances for multiplication, we have
2 2

1 )} = avar{n 1/2 vec( 2 )} = { avar{n 1/2 vec(

1 1|2

E1

+ (T (
21

0) M
T

1|2 )(

0 ),
T

2|1

1 1 0 )M (

11 )(

12

1 T 1 0 0 )} .

Proof of Proposition 2. When E (B ) = Rr , the full envelope model is the same as the standard 1 1e )} = multivariate linear model. Then, avar{n 1/2 vec( 22 , which is the upper left r p2 r p2 1 1 )} is the upper left r p2 r p2 block of J . From the proof of Proposition 1, avar{n 1/2 vec( T T 1/2 1 T T 1/2 block of H ( H J H ) H . As J { J H ( H J H ) H } J = Q J 1/2 H 0, J 1 H ( H T J H ) H T . So, 1e )} avar{n 1/2 vec( 1 )}. avar{n 1/2 vec( If E (B ) = E (B1 ), the full and partial envelope models have the same envelope, and then the parameters , 0 , and 0 are the same in both models. We write as (1 , 2 ). As Vec(1 ) = [( I p1 , 0) Ir ] vec(), 1e )} = ( I p1 , 0) Ir avar[n 1/2 vec() ]( I p1 , 0)T Ir avar{n 1/2 vec( = where M(
) 1 11

T + (1

1 0 ) M( ) (1

0 ),
T

= T

1 0

1 0

1 1 11

2 Iu Ir u , and
T T + (1

1 )} = avar{n 1/2 vec(

0) M

1
11

(1

0 ),
T

1 1 T 1 0 2 Iu 1 Ir u 1 . As stated in 3.3, we assume where M 11 = 1 11 1 0 + 0 + T T T = 0 without loss of generality. Then T = 1 11 1 + 2 22 2 1 11 1 , so we have 12 1/2 1/2 avar{n vec(1 )} avar{n vec(1e )} because the other terms are the same.

Proof of Lemma 3. Let be a semi-orthogonal basis for E (B ), and let G be an r u semi-orthogonal matrix with G 0 a basis of the orthogonal complement of its span. Then, we have log |G T G | + log |G T As
Y 1 T Y G | = log |G

G | + log |G T 0

Y G0|

+ log |

1 Y |.

+ T log |G
T

,
Y G0| 1/2
T 1 (G T 0 G0) T

G | + log |G T 0

T = log |G T G | + log |G T 0 G 0 | + log | Ir u + G 0

1/2

G0|

= log | | + log | Ir u + G T 0

1/2

1 (G T 0 G0)

1/2

G 0 |.

The objective function takes its minimum at span(G ) = span( ), because it makes the second term zero, otherwise that term will be positive. As is a subset of u eigenvectors of , we can search all the subsets of u eigenvectors of to get the minima.

REFERENCES
CONWAY, J. (1990). A Course in Functional Analysis. New York: Springer. COOK, R. & WEISBERG, S. (1982). Residuals and Influence in Regression. New York: Chapman and Hall.

146

ZHIHUA SU AND R. DENNIS COOK

COOK, R., LI, B. & CHIAROMONTE, F. (2007). Dimension reduction in regression without matrix inversion. Biometrika 94, 569. COOK, R., LI, B. & CHIAROMONTE, F. (2010). Envelope models for parsimonious and efficient multivariate linear regression. Statist. Sinica 20, 9271010. EDELMAN, A., ARIAS, T. & SMITH, S. (1999). The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20, 30353. HENDERSON, H. & SEARLE, S. (1979). Vec and vech operators for matrices, with some uses in Jacobians and multivariate statistics. Can. J. Statist. 7, 6581. JOHNSON, R. & WICHERN, D. (2007). Applied Multivariate Statistical Analysis. Upper Saddle River, NJ: Prentice Hall. LIU, X., SRIVASTAVA, A. & GALLIVAN, K. (2004). Optimal linear representations of images for object recognition. IEEE Trans. Pat. Anal. 26, 6626. SHAPIRO, A. (1986). Asymptotic theory of overparameterized structural models. J. Am. Statist. Assoc. 81, 1429. SMALL, C., WANG, J. & YANG, Z. (2000). Eliminating multiple root problems in estimation. Statist. Sci. 15, 31332.

[Received February 2010. Revised July 2010]

You might also like