Technical Report 73 IMSV, University of Bern Adaptive Confidence Sets For The Optimal Approximating Model

Technical Report 73
IMSV, University of Bern

Adaptive Confidence Sets for the Optimal
Approximating Model
arXiv:0802.3276v3 [math.ST] 7 Oct 2009
Angelika Rohde
Universitat Hamburg
Department Mathematik
Bundesstrae 55
D-20146 Hamburg
Germany
e-mail: angelika.rohde@math.uni-hamburg.de
and
Lutz Dumbgen
Universitat Bern
Institut fur Mathematische Statistik
und Versicherungslehre
Sidlerstrasse 5
CH-3012 Bern
Switzerland
e-mail: duembgen@stat.unibe.ch
Universitat Hamburg and Universitat Bern
Abstract: In the setting of high-dimensional linear models with Gaussian noise,

we investigate the possibility of confidence statements connected to model selection.
Although there exist numerous procedures for adaptive (point) estimation, the con-
struction of adaptive confidence regions is severely limited (cf. Li, 1989). The present
paper sheds new light on this gap. We develop exact and adaptive confidence sets
for the best approximating model in terms of risk. One of our constructions is based
on a multiscale procedure and a particular coupling argument. Utilizing exponential
inequalities for noncentral 2 -distributions, we show that the risk and quadratic loss
of all models within our confidence region are uniformly bounded by the minimal risk
times a factor close to one.
AMS 2000 subject classifications: 62G15, 62G20.
Keywords and phrases: Adaptivity, confidence sets, coupling, exponential inequal-
ity, model selection, multiscale inference, risk optimality..

This work was supported by the Swiss National Science Foundation.
1
A. Rohde and L. Dumbgen/Confidence Sets for the Best Approximating Model 2
1. Introduction
When dealing with a high dimensional observation vector, the natural question arises
whether the data generating process can be approximated by a model of substantially lower
dimension. Rather than on the true model, the focus is here on smaller ones which still
contain the essential information and allow for interpretation. Typically, the models under
consideration are characterized by the non-zero components of some parameter vector.
Estimating the true model requires the rather idealistic situation that each component is
either equals zero or has sufficiently modulus: A tiny perturbation of the parameter vector
may result in the biggest model, so the question about the true model does not seem to be
adequate in general. Alternatively, the model which is optimal in terms of risk appears as
a target of many model selection strategies. Within a specified class of competing models,
this paper is concerned with confidence regions for those approximating models which are
optimal in terms of risk.
Suppose that we observe a random vector Xn = (Xin )ni=1 with distribution Nn (n , 2 In )
together with an estimator n for the standard deviation > 0. Often the signal n
represents coefficients of an unknown smooth function with respect to a given orthonormal
basis of functions.
There is a vast amount of literature on point estimation of n . For a given estimator
n = n (Xn , n ) for n , let
L(n , n ) := kn n k2 and R(n , n ) := EL(n , n )
be its quadratic loss and the corresponding risk, respectively. Here kk denotes the standard
Euclidean norm of vectors. Various adaptivity results are known for this setting, often in
(c)
terms of oracle inequalities. A typical result reads as follows: Let (n )cCn be a family
(c) (c)
of candidate estimators n = n (Xn ) for n , where > 0 is temporarily assumed to be
known. Then there exist estimators n and constants An , Bn = O(log(n) ) with 0
such that for arbitrary n in a certain set n Rn ,
R(n , n ) An inf R(n(c) , n ) + Bn 2 .

cCn
Results of this type are provided, for instance, by Polyak and Tsybakov (1991) and Donoho
and Johnstone (1994, 1995, 1998), in the framework of Gaussian model selection by Birge
and Massart (2001). The latter article copes in particular with the fact that a model is
not necessarily true. Further results of this type, partly in different settings, have been
provided by Stone (1984), Lepski et al. (1997), Efromovich (1998), Cai (1999, 2002), to
mention just a few.
By way of contrast, when aiming at adaptive confidence sets one faces severe limita-
tions. Here is a result of Li (1989), slightly rephrased: Suppose that n contains a closed
Euclidean ball B(no , cn1/4 ) around some vector no Rn with radius cn1/4 > 0. Still as-
suming to be known, let Dn = Dn (Xn ) n be a (1 )-confidence set for n n .
Such a confidence set may be used as a test of the (Bayesian) null hypothesis that n is
uniformly distributed on the sphere B(no , cn1/4 ) versus the alternative that n = no : We
reject this null hypothesis at level if k no k < cn1/4 for all Dn . Since this test
cannot have larger power than the corresponding Neyman-Pearson test,

Pno sup k no kn < cn 1/4
P Sn2 2n; (c2 n1/2 / 2 ) (Sn2 2n )
Dn

= 1 () + 21/2 c2 / 2 + o(1),
where 2n;(2 ) stands for the -quantile of the noncentral chi-squared distribution with
n degrees of freedom and noncentrality parameter 2 . Throughout this paper, asymptotic
statements refer to n . The previous inequality entails that no reasonable confidence
set has a diameter of order op (n1/4 ) uniformly over the parameter space n , as long as the
latter is sufficiently large. Despite these limitations, there is some literature on confidence
sets in the present or similar settings; see for instance Beran (1996, 2000), Beran and
Dumbgen (1998) and Genovese and Wassermann (2005).
Improving the rate of Op (n1/4 ) is only possible via additional constraints on n , i.e. con-
sidering substantially smaller sets n . For instance, Baraud (2004) developed nonasymp-
totic confidence regions which perform well on finitely many linear subspaces. Robins and
van der Vaart (2006) construct confidence balls via sample splitting which adapt to some
extent to the unknown smoothness of n . In their context, n corresponds to a Sobolev
smoothness class with given parameter (, L). However, adaptation in this context is possi-
ble only within a range [, 2]. Independently, Cai and Low (2006) treat the same problem
in the special case of the Gaussian white noise model, obtaining the same kind of adap-
tivity in the broader scale of Besov bodies. Other possible constraints on n are so-called
shape constraints; see for instance Cai and Low (2007), Dumbgen (2003) or Hengartner
and Stark (1995).
The question is whether one can bridge this gap between confidence sets and point esti-
mators. More precisely, we would like to understand the possibility of adaptation for point
estimators in terms of some confidence region for the set of all optimal candidate estima-
(c)
tors n . That means, we want to construct a confidence region Kn, = Kn, (Xn , n ) Cn
for the set
Kn (n ) := argmin R(n(c) )
cCn
n
o
= c Cn : R(n(c) , n ) R(n(c ) , n ) for all c Cn
such that for arbitrary n Rn ,

Pn Kn (n ) Kn, 1 (1)
and
max R(n(c), n )

cKn,
= Op (An ) min R(n(c) , n ) + Op (Bn ) 2 . (2)
max L(n(c) , n )
cCn
cKn,
Solving this problem means that statistical inference about differences in the performance
of estimators is possible, although inference about their risk and loss is severely limited.
In some settings, selecting estimators out of a class of competing estimators entails esti-
mating implicitly an unknown regularity or smoothness class for the underlying signal n .
Computing a confidence region for good estimators is particularly suitable in situations
in which several good candidate estimators fit the data equally well although they look
different. This aspect of exploring various candidate estimators is not covered by the usual
theory of point estimation.
Note that our confidence region Kn, is required to contain the whole set Kn (n ), not
just one element of it, with probability at least 1 . The same requirement is used by
Futschik (1999) for inference about the argmax of a regression function.
The remainder of this paper is organized as follows. For the readers convenience our
approach is first described in a simple toy model in Section 2. In Section 3 we develop and
analyze an explicit confidence region Kn, related to Cn := {0, 1, . . . , n} with candidate
estimators
n
n(k) := 1{i k}Xin i=1
.
These correspond to a standard nested sequence of approximating models. Section 4 dis-

cusses richer families of candidate estimators.
All proofs and auxiliary results are deferred to Sections 5 and 6.
2. A toy problem
Suppose we observe a stochastic process Y = (Y (t))t[0,1] , where
Y (t) = F (t) + W (t), t [0, 1],
with an unknown fixed continuous function F on [0, 1] and a Brownian motion W =

(W (t))t[0,1] . We are interested in the set
S(F ) := argmin F (t).

t[0,1]
Precisely, we want to construct a (1 )-confidence region S = S (Y ) [0, 1] for S(F )

in the sense that

P S(F ) S 1 , (3)
regardless of F . To construct such a confidence set we regard Y (s) Y (t) for arbitrary
different s, t [0, 1] as a test statistic for the null hypothesis that s S(F ), i.e. large
values of Y (s) Y (t) give evidence for s 6 S(F ).
A first naive proposal is the set
n o
Snaive := s [0, 1] : Y (s) min Y + naive

[0,1]
with naive
denoting the (1 )-quantile of max[0,1] W min[0,1] W .
Here is a refined version based on results of Dumbgen and Spokoiny (2001): Let be
the (1 )-quantile of
|W (s) W (t)| q

sup p 2 log(e/|s t|) . (4)
s,t[0,1] |s t|
Then constraint (3) is satisfied by the confidence region S which consists of all s [0, 1]
such that
q q
Y (s) Y (t) + |s t| 2 log(e/|s t|) + for all t [0, 1].
To illustrate the power of this method, consider for instance a sequence of functions
F = Fn = cn Fo with positive constants cn and a fixed continuous function Fo with
unique minimizer so . Suppose that
Fo (t) Fo (so )
lim = 1
tso |t so |
for some > 1/2. Then the naive confidence region satisfies only
max |t so | = Op cn1/ ,

(5)
tS
naive
whereas

max |t so | = Op log(cn )1/(21) c2/(21)
n . (6)
tS
3. Confidence regions for nested approximating models
As in the introduction let Xn = n + n denote the n-dimensional observation vector with

(k) n
n Rn and n Nn (0, 2 In ). For any candidate estimator n = 1{i k}Xin i=1
the
loss is given by
n k
L(n(k) , n ) 2
(Xin in )2
X X
Ln (k) := = in +
i=k+1 i=1
with corresponding risk

n
Rn (k) := R(n(k) , n ) = 2
+ k 2 .
X
in
i=k+1
Model selection usually aims at estimating a candidate estimator which is optimal in

terms of risk. Since the risk depends on the unknown signal and therefore is not available,
the selection procedure minimizes an unbiased risk estimator instead. In the sequel, the
(k)
bias-corrected risk estimator for the candidate n is defined as
n
2
n2 ) + kn2 ,
X
Rn (k) := (Xin
i=k+1
where n2 is a variance estimator satisfying the subsequent condition.
(A) n2 and Xn are stochastically independent with
mn2
2m ,
2
where 1 m = mn with m = meaning that is known, i.e. n2 2 . For

asymptotic statements, it is generally assumed that
2n
n2 := = O(1)
mn
unless stated otherwise.

Example. Suppose that we observe Y = M + with given design matrix M

R(n+m)n of rank n, unknown parameter vector Rn and unobserved error vector
Nn+m (0, 2 In+m ). Then the previous assumptions are satisfied by Xn := (M M )1/2
with := (M M )1 M Y and n2 := kY M k2 /m, where n := (M M )1/2 .
Important for our analysis is the behavior of the centered and rescaled difference process

Dn = Dn (j, k) 0j<kn
with
Rn (j) Rn (k) Rn (j) + Rn (k)

Dn (j, k) := 1/2
n2 4kn /k2 + 2n
Pk 2
i=j+1 (Xin
2 ) 2(k j)( 2 2 )
2 in
= 1/2 .
n2 4kn /k2 + 2n
One may also write Dn (j, k) = (n /)2 Dn (j, k) + Vn (j, k) with

k
1 X
2

Dn (j, k) := p 2( in /)( in /) + ( in /) 1 , (7)
4kn /k2 + 2n i=j+1
1/2
Vn (j, k) := 4kn /k2 + 2n 2(k j)(1 2 / 2 ) (8)
This representation shows that the distribution of Dn depends on the degrees of free-
dom, m, and the unknown signal-to-noise vector n /. The process Dn consists of par-
tial sums of the independent, but in general non-identically distributed random variables
2(in /)(in /) + (in /)2 1. The standard deviation of Dn (j, k) is given by
k
1 X
2 2 1/2
n (j, k) := p 4 in / + 2 .
4kn /k2 + 2n i=j+1
Note that n (0, n) = 1 by construction. To imitate the more powerful confidence region of
Section 2 based on the multiscale approach, one needs a refined analysis of the increment
process Dn . Since this process does not have subgaussian tails, the standardization is more
involved than the correction in (4).
1/2
Theorem 1. Define n (j, k) := 2 log e/n (j, k)2 for 0 j < k n. Then
|Dn (j, k)|

sup 32 log n + Op (1),
0j<kn n (j, k)
and for any fixed c > 2,

!+
|Dn (j, k)| c n (j, k)2
dn := sup n (j, k) 1/2
0j<kn n (j, k) 4kn /k2 + 2n n (j, k)
is bounded in probability. In case of kn k2 = O(n), L(dn ) is weakly approximated by the

law of
+
|n (j, k)|

n := sup n (j, k) ,
0j<kn n (j, k)
where
2n (k j)
n (j, k) = W (n (0, k)2 ) W (n (0, j)2 ) 1/2 Z
n 4kn /k2 + 2n
with a standard Brownian motion W and a random variable Z N (0, 1) independent of
W.
The limiting distribution indicates that the additive correction term in the definition of
dn cannot be chosen essentially smaller. It will play a crucial role for the efficiency of the
confidence region.
To construct a confidence set for Kn (n ) by means of dn , we are facing the problem that
the auxiliary function n (, ) depends on the unknown signal-to-noise vector n /. In fact,
knowing n would imply knowledge of Kn (n ) already. A natural approach is to replace
the quantities which are dependent on the unknown parameter by suitable estimates. A
common estimator of the variance n (j, k)2 , j < k, is given by
n
X 1 X
k
2 2
n (j, k) := 4(Xin /n2 1) + 2 2
4(Xin /n2 1) + 2 .
i=1 i=j+1
However, using such an estimator does not seem to work since

(j, k)
n
sup 1 6p 0

n (j, k)

0j<kn
as n goes to infinity. This can be verified by noting that the (rescaled) numerator of
n (j, k)2

0j<kn
is, up to centering, essentially of the same structure as the rescaled
difference process Dn itself.
The least favourable case of constant risk
The problem of estimating the set arg mink Rn (k) can be cast into our toy model where
Y (t), F (t) and W (t) correspond to Rn (k), Rn (k) and the difference Rn (k) Rn (k), re-
spectively. One may expect that the more distinctive the global minima are, the easier it is
to identify their location. Hence the case of constant risks appears to be least favourable,
corresponding to a signal
n
n := i=1
,
(k)
In this situation, each candidate estimator n has the same risk of n 2 .
A related consideration leading to an explicit procedure is as follows: For fixed indices
0 j < k n,
k
2
(k j) 2 ,
X
Rn (j) Rn (k) = in
i=j+1
and if Assumption (A) is satisfied, the statistic

Pk 2
i=j+1 Xin Rn (k) Rn (j)
Tjkn := = 2
(k j)n2 (k j)n2
has a noncentral (in the numerator) F -distribution

Pk 2
i=j+1 in Rn (j) Rn (k)

Fkj,m = Fkj,m kj+
2 2
with k j and m degrees of freedom. Thus large or small values of Tjkn give evidence for
Rn (j) being larger or smaller, respectively, than Rn (k). Precisely,
(
st. Ln (Tjkn ) whenever j Kn (n ),
Ln (Tjkn )
st. Ln (Tjkn ) whenever k Kn (n ).
Note that this stochastic ordering remains valid if n2 is just independent from Xn , i.e. also
under the more general requirement of the remark at the end of this section. Via suitable
coupling of Poisson mixtures of central 2 -distributed random variables, this observation

is extended to a coupling for the whole process Tjkn 0j<kn
:
Proposition 2 (Coupling). For any n Rn there exists a probability space with random

variables Tjkn 0j<kn
and Tjkn 0j<kn
such that

L Tjkn 0j<kn
= L n Tjkn 0j<kn
,

L Tjkn 0j<kn
= Ln Tjkn 0j<kn
,
and for arbitrary indices 0 j < k n,

(

Tjkn whenever j Kn (n ),
Tjkn
Tjkn whenever k Kn (n ).
As a consequence of Proposition 2, we can define a confidence set for Kn (n ), based

on this least favourable case. Let n, denote the (1 )-quantile of L (dn ), where for n
simplicity c := 3 in the definition of dn . Note also that n (j, k)2 = (k j)/n in case of
n = n . Motivated by the procedure in Section 2 and Theorem 1, we define

n o
Kn, := j : Rn (j) Rn (k) + n2 |k j|cjkn for all k 6= j (9)

= j : Tijn 2 cijn for all i < j,

Tjkn 2 + cjkn for all k > j
with s
6
k j 3
k j 2
cjkn = cjkn, := + n, + .
|k j| n |k j| n
Theorem 3. Let (n )nN be arbitrary. With Kn, as defined above,

Pn Kn (n ) 6 Kn, .
In case of n 0 (i.e. n/m 0), the critical values n, converge to the critical value
introduced in Section 2. In general, n, = O(1), and the confidence regions Kn, satisfy
the oracle inequalities
r
max Rn (k) min Rn (j) + 4 3 + op (1) 2 log(n) min Rn (j) (10)
kKn, jCn jCn
+ Op 2 log n

and
r
max Ln (k) min Ln (j) + Op 2 log(n) min Ln (j) (11)
kKn, jCn jCn
+ Op 2 log n .

Remark (Dependence on ) The proof reveals a refined version of the bounds in

Theorem 3 in case of signals n such that

log(n)3 = O min Rn (j) .
jCn

Let 0 < (n) 0 such that 6n,(n) = O minjCn Rn (j) . Then
max Rn (k) min Rn (j)

kKn, jCn
p r
+ 4 3 log n + 2 6 n, + Op (1) 2 min Rn (j)
jCn
uniformly in (n).
Remark (Variance estimation) Instead of Condition (A), one may require more gen-
erally that n2 and Xn are independent with
n2
n 2 1 D N (0, 2 )

for a given 0. This covers, for instance, estimators used in connection with wavelets.
There is estimated by the median of some high frequency wavelet coefficients divided by
the normal quantile 1 (3/4). Theorem 1 continues to hold, and the coupling extends to
this situation, too, with S 2 in the proof being distributed as nn2 . Under this assumption
on the external variance estimator, the confidence region Kn, , defined with m := 2n/ 2 ,
is at least asymptotically valid and satisfies the above oracle inequalities as well.
4. Confidence sets in case of larger families of candidates
The previous result relies strongly on the assumption of nested models. It is possible to
obtain confidence sets for the optimal approximating models in a more general setting,
albeit the resulting oracle property is not as strong as in the nested case. In particular, we
can no longer rely on a coupling result but need a different construction. For the readers
convenience, we focus on the case of known , i.e. m = ; see also the remark at the end
of this section.
Let Cn be a family of index sets C {1, 2, . . . , n} with candidate estimators
n
(C) := 1{i C}Xin i=1
and corresponding risks
Rn (C) := R( (C) , n ) = 2
+ |C| 2 ,
X
in
i6C
where |S| denotes the cardinality of a set S. For two index sets C and D,
2 Rn (D) Rn (C) = n2 (C \ D) n2 (D \ C) + |D| |C|

with the auxiliary quantities
n2 (J) := 2
/ 2 ,
X
in J {1, 2, . . . , n}.
iJ
Hence we aim at simultaneous (1 )-confidence intervals for these noncentrality param-

eters n (J), where J Mn := {D \ C : C, D Cn }. To this end we utilize the fact
that
1 X 2
Tn (J) := X
2 iJ in
has a 2|J| (n2 (J))-distribution. We denote the distribution function of 2k (2 ) by Fk ( | 2 ).
Now let Mn := |Mn | 1 |Cn |(|Cn | 1), the number of nonvoid index sets J Mn . Then
with probability at least 1 ,
/(2Mn ) F|J| Tn (J) n2 (J) 1 /(2Mn ) for =

6 J Mn . (12)
Since F|J| (Tn (J) | 2 ) is strictly decreasing in 2 with limit 0 as 2 , (12) entails the
2 2 (J) for all parameters n2 (J) as

simultaneous (1 )-confidence intervals n,,l (J), n,,u
2
follows: We set n,,l 2
() := n,,u () := 0, while for nonvoid J,
n o
2
(J) := min 2 0 : F|J| Tn (J) 2 1 /(2Mn ) ,

n,,l (13)
n o
2
(J) := max 2 0 : F|J| Tn (J) 2 /(2Mn ) .

n,,u (14)
By means of these bounds, we may claim with confidence 1 that for arbitrary C, D Cn
the normalized difference (n/ 2 ) Rn (D) Rn (C) is at most n,,u
2 2

(C \ D) n,,l (D \
C) + |D| |C|. Thus a (1 )-confidence set for Kn (n ) = argminCCn Rn (C) is given by
n o
2 2
Kn, := C Cn : n,,u (C \ D) n,,l (D \ C) + |D| |C| 0 for all D Cn .
These confidence sets Kn, satisfy the following oracle inequalities:
Theorem 4. Let (n )nN be arbitrary, and suppose that log |Cn | = o(n). Then
r
max Rn (C) min Rn (D) + Op 2 log(|Cn |) min Rn (D)
CKn, DCn DCn
+ Op 2 log |Cn |

and
r
max Ln (C) min Ln (D) + Op 2 log(|C n |) min Ln (D)
CKn, DCn DCn
+ Op 2 log |Cn | .

Remark. The upper bounds in Theorem 4 are of the form

q
n 1 + Op 2 log(|Cn |)/n + 2 log(|Cn |)/n
with n denoting minimal risk or minimal loss. Thus Theorem 4 entails that the maximal
risk (loss) over Kn, exceeds the minimal risk (loss) only by a factor close to one, provided
that the minimal risk (loss) is substantially larger than 2 log |Cn |.
Remark (Suboptimality in case of nested models) In case of nested models,

the general construction is suboptimal in the factor of the leading (in most cases) term
q
minj Rn (j). Following the proof carefully and using that 2 log |Cn | = 2 2 log n + O(1)
in this special setting, one may verify that
r
max Rn (k) min Rn (j) + 4 8 + op (1) 2 log(n) min Rn (j)
kKn, jCn jCn
+ Op 2 log n .

The intrinsic reason is that the general procedure does not assume any structure of the
family of candidate estimators. Hence advanced multiscale theory is not applicable.
Remark. In case of unknown , let := 1 (1 )1/2 . Then with probability at least

1 ,
/2 Fm m(n /)2 0 1 /2.

The latter inequalities entail that (/n )2 lies between n,,l := m/m;1 /2 and n,,u :=
m/2m; /2 . Then we obtain simultaneous (1 )-confidence bounds n,,l
2 2
(J) and n,,u (J)
as in (13) and (14) by replacing with and Tn (J) with
n,,l X 2 n,,u X 2
X and X ,
n2 iJ in n2 iJ in
respectively. The conclusions of Theorem 4 continue to hold, as long as n/mn = O(1).
5. Proofs
5.1. Proof of (5) and (6)
Note first that min[0,1] Y lies between Fn (so ) + min[0,1] W and Fn (so ) + W (so ). Hence for
any (0, 1),
Snaive s [0, 1] : Fn (s) + W (s) Fn (so ) + W (so ) + naive

s [0, 1] : Fn (s) Fn (so ) naive + naive

s [0, 1] : Fo (s) Fo (so ) c1 naive

+ naive

= n
and
Snaive s [0, 1] : Fn (s) + W (s) Fn (so ) + min W + naive

[0,1]
naive naive

s [0, 1] : Fn (s) Fn (so )
s [0, 1] : Fo (s) Fo (so ) c1 naive

naive

= n
with probability 1 . Since naive

< naive
if < < 1, these considerations, combined
with the expansion of Fo near so , show that the maximum of |s so | over all s Snaive is
1/
precisely of order Op (cn ).
On the other hand, the confidence region S is contained in the set of all s [0, 1] such
that
q q o
Fn (s) + W (s) Fn (so ) + W (so ) + |s so | 2 log(e/|s so |) + ,
and this entails that

q q
Fo (s) Fo (so ) c1
n |s so | 2 log(e/|s so |) + + Op (1)
with Op (1) not depending on s. Now the expansion of Fo near so entails claim (6).
5.2. Exponential inequalities
An essential ingredient for our main results is an exponential inequality for quadratic
functions of a Gaussian random vector. It extends inequalities of Dahlhaus and Polonik
(2006) for quadratic forms and is of independent interest.
Proposition 5. Let Z1 , . . . , Zn be independent, standard Gaussian random variables. Fur-

P
n
thermore, let 1 , . . . , n and 1 , . . . , n be real constants, and define 2 := Var i=1 i (Zi +
Pn
i )2 = 2
i=1 i (2 + 4i2 ). Then for arbitrary 0 and max := max(1 , . . . , n , 0),
n
X 2
i (Zi + i )2 (1 + i2 )

P exp
i=1
2 + 4max /

e1/4 exp / 8 .
Note that replacing i in Proposition 5 with i yields twosided exponential inequali-

ties. By means of Proposition 5 and elementary calculations one obtains exponential and
related inequalities for noncentral 2 distributions:
Corollary 6. For an integer n > 0 and a constant 0 let Fn ( | 2 ) be the distribution

function of 2n (2 ). Then for arbitrary r 0,
r2
Fn (n + 2 + r | 2 ) 1 exp , (15)
4n + 82 + 4r
r2
Fn (n + 2 r | 2 ) exp . (16)
4n + 82
In particular, for any u (0, 1/2),

q
Fn1 (1 u | 2 ) n + 2 + (4n + 82 ) log(u1 ) + 4 log(u1 ), (17)
q
Fn1 (u | 2 ) n + 2 (4n + 82 ) log(u1 ). (18)
Moreover, for any number 0, the inequalities u Fn (n + 2 | 2 ) 1 u entail that

q
+ (4n + 8 2 ) log(u1 ) + 8 log(u1 ),
2 2 q (19)
(4n + 8 2 ) log(u1 ).
Conclusion (19) follows from (15) and (16), applied to r = 2 2 and r = 2 2 ,

respectively.
Proof of Proposition 5. Standard calculations show that for 0 t < (2max )1 ,

n n n
1 X
X
2
2ti o
E exp t i (Zi + i ) = exp i2 log(1 2ti ) .
i=1
2 i=1
1 2ti
Then for any such t,

n
X
i (Zi + i )2 (1 + i2 )

P
i=1
n n
X
i (1 + i2 ) E exp t i (Zi + i )2
X
exp t t
i=1 i=1
n n
1 4t2 2i o
i2
X
= exp t + log(1 2ti ) 2ti . (20)
2 i=1
1 2ti
Elementary considerations reveal that

(
x2 /2 if x 0,
log(1 x) x
x2 /(2(1 x)) if x 0.
Thus (20) is not greater than

n n
1X 4t2 2i 2t2 2i o
exp t + i2 +
2 i=1 1 2ti 1 2t max(i , 0)
2 t2 /2
exp t + .
1 2tmax
Setting
h
t := 0, (2max )1 ,
+ 2max
the preceding bound becomes
n
X 2
i (Zi + i )2 (1 + i2 )

P exp .
i=1
2 + 4max /

Finally, since max 2, the second asserted inequality follows from
2 2 1
= .
2 + 4max / 2 + 8 8 8 + 4 8 4
5.3. Proofs of the main results
Throughout this section we assume without loss of generality that = 1. Further let

Sn := {0, 1, . . . , n} and Tn := (j, k) : 0 j < k n .
Proof of Theorem 1.
Step I. We first analyze Dn in place of Dn . To collect the necessary ingredients, let the
metric n on Tn pointwise be defined by
q
n (j, k), (j , k ) n (j, j )2 + n (k, k )2 .

:=
We need bounds for the capacity numbers D(u, T , n ) (cf. Section 6) for certain u > 0
and T T . The proof of Theorem 2.1 of Dumbgen and Spokoiny (2001) entails that

12u4 2

D u, t Tn : n (t) , n for all u, (0, 1]. (21)
Note that for fixed (j, k) Tn , Dn (j, k) may be written as

n
i (in + in )2 (1 + in
2
X
)
i=1
with
1/2
i = in (j, k) := 4kn k2 + 2n I(j,k](i),
1/2
so |i | 4kn k2 + 2n . Hence it follows from Proposition 5 that
!
2
P |Dn (t)| n (t) 2 exp 1/2
2 + 4 4kn k2 + 2n /n (t)
for arbitrary t Tn and 0. One may rewrite this exponential inequality as

P |Dn (t)| n (t)Gn , n (t) 2 exp() (22)
for arbitrary t Tn and 0, where

p 4
Gn , := 2 + 1/2 .
4kn k2 + 2n
The second exponential inequality in Proposition 5 entails that

2e1/4 exp / 8

P Dn (t) n (t) (23)
and

2e1/4 exp()

P Dn (s) Dn (t) 8n (s, t) (24)
for arbitrary s, t Tn and 0.

Utilizing (21) and (24), it follows from Theorem 7 and the subsequent Remark 3 in
Dumbgen and Walther (2007) that
|Dn (s) Dn (t)|

lim sup P sup >Q = 0 (25)
0 n s,tTn :n (s,t) n (s, t) log e/n (s, t)
for a suitable constant Q > 0. Since Dn (j, k) = Dn (0, k) Dn (0, j) and n (j, k) =

n (0, j), (0, k) , this entails the stochastic equicontinuity of Dn with respect to n .
For 0 < 1 define
!+
|Dn (t)| c n (t)2
Tn (, ) := sup n (t) 1/2
tTn :<n (t) n (t) n (t) 4kn k2 + 2n
1/2
with a constant c > 0 to be specified later. Recall that n (t) := 2 log e/n (t)2 .
Starting from (21), (22) and (25), Theorem 8 of Dumbgen and Walther (2007) and its
subsequent remark imply that
Tn (0, ) p 0 as n and 0, (26)
provided that c > 2. On the other hand, (21), (23) and (25) entail that
Tn (, 1) = Op (1) for any fixed > 0. (27)
Now we are ready to prove the first assertion about Dn . Recall that Dn = n2 (Dn + Vn )
and
Vn (j, k) 2n (k j)
= 1/2 Zn
n (j, k) n (j, k) 4kn k2 + 2n n
2(k j)/ 4kn k2 +
p
with Zn being asymptotically standard normal. Since n (j, k)
1/2
2n ,
p
2(k j)
|Vn (j, k)| n (j, k)
n |Zn | n |Zn |, (28)
n (j, k)
n n

so the maximum of |Vn |/n over Tn is bounded by 2n |Zn | = Op (1). Furthermore, since
|Tn | n2 /2, one can easily deduce from (23) that the maximum of |Dn |/n over Tn exceeds

32 log n + with probability at most e1/4 exp / 8 . Since n = 1 + Op (n1/2 ), these
considerations show that

(Dn + Vn )(t)|
max 32 log n + Op (1)
tTn n (t)
and
Dn (t) (Dn + Vn )(t)
max = Op (n1/2 log n).
tTn n (t)
This proves our first assertion about Dn /n .
Step II. Because n2 p 1, it is sufficient for the proof of the weak approximation

dw Dn (t) tTn
, n (t) tTn
0 as n (29)
to show the result for n2 Dn = Dn + Vn with the processes Dn and Vn introduced in

(7) and (8). Here, dw refers to the dual bounded Lipschitz metric which metrizes the
topology of weak convergence. Further details are provided in the appendix. Note that
Dn (j, k) = Dn (k) Dn (j) with Dn () := Dn (0, ) and Vn (j, k) = Vn (k) Vn (j) with
Vn () := Vn (0, ). Thus we view these processes Dn and Vn temporarily as processes on Sn .
They are stochastically independent by Assumption (A). Hence, acccording to Lemma 9,
it suffices to show that Dn and Vn are approximated in distribution by
k

W (n (k)) kSn
and p Z , (30)
n 4kn k2 + 2n kSn
respectively. The assertion about Vn is an immediate consequence of the fact that Zn :=

m/2(1 n2 ) = n1 n(1 n2 ) converges in distribution to Z while 0 k/ n 4kn k2 +
p
1/2
2n 1/ 2.
It remains to verify the assertion about Dn . It follows from the results in step I that
the sequence of processes Dn on Sn is stochastically equicontinuous with respect to the
metric n on Sn Sn . More precisely,
|Dn (k) Dn (j)|
max = Op (1),
(j,k)Tn n (j, k) log(e/n (j, k)2 )
and it is well-known that W (n (0, k)2 )

kSn
has the same property, even with the factor
log(e/n (j, k)2 )1/2 in place of log(e/n (j, k)2 ). Moreover, both processes have independent
increments. Thus, in view of Theorem 8 in Section 6, it suffices to show that

max dw Dn (j, k), W (n (0, k)) W (n (j)) 0. (31)
(j,k)Tn
To this end we write Dn (j, k) = Dn,1 (j, k) + Dn,2 (j, k) + Dn,3 (j, k) with
k
1/2 X
Dn,1 (j, k) := 4kn k2 + 2n 1{|in | n }(2in in + 2in 1),
i=j+1
k
1/2 X
Dn,2 (j, k) := 4kn k2 + 2n 1{|in | > n }2in in ,
i=j+1
k
1/2 X
Dn,3 (j, k) := 4kn k2 + 2n 1{|in | > n }(2in 1)
i=j+1
1/2
and arbitrary numbers n > 0 such that n but n / 4kn k2 + 2n 0. These
three random variables Dn,s (j, k) are uncorrelated and have mean zero. The number an :=
{i : |in | > n } satisfies the inequality kn k2 an 2 , whence

n
2an 1
E Dn,3 (j, k)2

2
0.
2n + 4kn k 2n2
Moreover, Dn,1 (j, k) and Dn,2 (j, k) are stochastically independent, where Dn,1 (j, k) is
asymptotically Gaussian by virtue of Lindebergs CLT, while Dn,2 (j, k) is exactly Gaus-
sian. These findings entail (29).
Step III. For 0 < 1 define

!+
c n (t)2

(Dn + Vn )(t)

Sn (, ) := sup n (t) 1/2 ,
tTn : n (t) 4kn k2 + 2n n (t)
<n (t)
+
W (n (0, k)2 ) W (n (0, j)2 )

n (, ) := sup n (j, k) .
(j,k)Tn : n (j, k)
<n (j,k)

Since Sn (0, 1) Tn (0, 1) +
2n |Zn |, it follows from (26) and (27) that Sn (0, 1) = Op (1).
1/2
As to the approximation in distribution, since n (0, n) 4kn k2 + 2n 2n ,
n (t)2

max 0 while max |n (t)| = O(1)
1/2
t: (t) 4kn k2 + 2n n (t) t:n (t)

n
for any fixed (0, 1). Consequently it follows from step II that

dw Sn (, 1), n (, 1) 0 (32)
for any fixed (0, 1). Thus it suffices to show that
Sn (0, ), n (0, ) p 0 as n and 0,

provided that kn k2 = O(n). For n (0, ) this claim follows, for instance, with the same
arguments as (26). Moreover, Sn (0, ) is not greater than
1/2
|Vn (t)| 4kn k2 + 2n
Tn (0, ) + sup Tn (0, ) + ,
tTn :n (t) n (t) n
according to (28). Thus our claim follows from (26) and kn k2 = O(n).
Proof of Proposition 2. The main ingredient is a well-known representation of non-

central 2 distributions as Poisson mixtures of central 2 distributions. Precisely,

2 /2 (2 /2)j
2k (2 ) = e 2k+2j ,
X
j=0
j!
as can be proved via Laplace transforms. Now we define time points
k
2
and tkn := tj(n)n + k j(n)
X
tkn := in
i=1
with j(n) any fixed index in Kn (n ). This construction entails that tkn tkn with equality
if, and only if, k Kn (n ).
Figure 1 illustrates this construction. It shows the time points tkn (crosses) and tkn
(dots and line) versus k for a hypothetical signal n R40 . Note that in this example,
Kn (n ) is given by {10, 11, 20, 21}.
Let , G1 , G2 , . . . , Gn , Z1 , Z2 , Z3 , . . . and S 2 be stochastically independent random
variables, where = ((t))t0 is a standard Poisson process, Gi and Zj are standard
Gaussian random variables, and S 2 2m . Then one can easily verify that
k 2(tkn /2)
m
X
G2i Zs2
X
Tjkn := + ,
(k j)S 2 i=j+1 s=2(tjn /2)+1
k 2(tkn /2)
m
X

G2i Zs2
X
Tjkn := +
(k j)S 2 i=j+1 s=2(tjn /2)+1
)
define random variables (Tjkn )0j<kn and (Tjkn 0j<kn with the desired properties.
In the proofs of Theorems 3 and 4 we utilize repeatedly two elementary inequalities:
Lemma 7. Let a, b, c be nonnegative constants.

p
(i) Suppose that 0 x y x + b(x + y) + c. Then

y x + 2bx + b + bc + c x + 2bx + (3/2)(b + c).

(ii) For x 0 define h(x) := x + a + bx + c. Then

h(h(x)) x + 2 a + bx + b/2 + bc + 2c.
Fig 1. Construction of the coupling
p
Proof of Lemma 7. The inequality y x + b(x + y) + c entails that either y < x + c
or
(y x c)2 b(x + y) = 2bx + b(y x).
Since y < x + c is stronger than the assertions of part (i), we only consider the displayed
quadratic inequality. The latter is equivalent to
2
y x (b/2 + c) 2bx + (b/2 + c)2 c2 = 2bx + b2 /4 + bc.
pP P
Hence the standard inequality i zi i zi for nonnegative numbers zi leads to
q
y x 2bx + b2 /4 + bc + b/2 + c = 2bx + b + bc + c.
2
Finally, 0 b c entails that bc (b + c)/2.
As to part (ii), the definition of h(x) entails that
q
h(h(x)) = x + a + bx +
a + bx + b a + bx + bc + 2c
q
x + a + bx + a + bx + b a + bx + bc + 2c
q
= x + a + bx + a + bx 1 + b/ a + bx + bc + 2c

x + 2 a + bx + b/2 + bc + 2c,

because 1 + d 1 + d/2 for arbitrary d 0.
Proof of Theorem 3. The definition of Kn, and Proposition 2 together entail that Kn,
contains Kn (n ) with probability at least 1 . The assertions about n, are immediate
consequences of Theorem 1 applied to n = n .
1/2
Now we verify the oracle inequalities (10) and (11). Let n := 4kn k2 + 2n n .
With n we denote the function n on Tn corresponding to n . Throughout this proof
we use the shorthand notation Mn (, k) := Mn () Mn (k) for Mn = Rn , Rn , Ln , Ln and
() () ()
arbitrary , k Cn . Furthermore, n (, k) := n (k, ) if > k, and n (k, k) := 0.
In the subsequent arguments, kn := min(Kn (n )), while j stands for a generic index in
Kn, . The definition of the set Kn, entails that
j k
n
h i
Rn (j, kn ) n2 n (j, kn ) + n, + O(log n) . (33)
n
Here and subsequently, O(rn ) and Op (rn ) denote a generic number and random variable,
respectively, depending on n but neither on any other indices in Cn nor on (0, 1).
Precisely, in view of our remark on dependence of , we consider all (n) with (n) > 0
such that n,(n) = O(n1/6 ). Note that n2 = 1 + Op (n1/2 ). Moreover, n (j, kn )2 (j
2
kn )/n equals 12nx log(e/x) 12n with x := |j kn |/n [0, 1]. Thus we may rewrite
(33) as
j k
n
Rn (j, kn ) n (j, kn ) + n, + Op (log n). (34)
n
Combining this with the equation Rn (j, kn ) = Rn (j, kn ) Dn (j, kn ) yields
j k
n

Rn (j, kn ) n (j, kn ) + n, + Op (log n) + |Dn (j, kn )|. (35)
n
Since n (j, kn )2 6n and maxtTn |Dn (t)|/n (t) = Op (log n), (35) yields

Rn (j, kn ) 12n + 6n n, + Op (log n)n (j, kn ).
But elementary calculations yield
n (j, kn )2 = n (j, kn )2 + sign(kn j)Rn (j, kn ) 6n + Rn (j, kn ). (36)
Hence we may conclude that

q
Rn (j, kn ) Op (log n) Rn (j, kn ) + Op n(log n + n, ) ,
and Lemma 7 (i), applied to x = 0 and y = Rn (j, kn ), yields

max Rn (j, kn ) Op n(log n + n, ) . (37)
jKn,
This preliminary result allows us to restrict our attention to indices j in a certain subset
Pn 2
of Cn : Since 0 Rn (n, kn ) = n kn i=kn +1 in ,
n
2
X
in n kn .
i=kn +1
P kn 2
On the other hand, in case of j < kn , Rn (j, kn ) = i=j+1 in (kn j), so
n
X
2
in n + Op n(log n + n, ) .
i=j+1
Pn 2
Thus if jn denotes the smallest index j Cn such that i=j+1 in 2n, then kn jn , and
Kn, {jn , . . . , n} with asymptotic probability one, uniformly in (n). This allows
us to restrict our attention to indices j in {jn , . . . , n} Kn, . For any jn , Dn (, kn )
involves only the restricted signal vector (in )ni=jn +1 , and the proof of Theorem 1 entails
that +
|Dn (, kn )| p 2c log n

max 2 log n = Op (1).
jn n n (, kn ) n (, kn )
Thus we may deduce from (35) the simpler statement that with asymptotic probability
one,
n (j, kn ) + n (j, kn )
p
Rn (j, kn ) 2 log n + n, + Op (1) (38)
+ Op (log n).
Now we need reasonable bounds for n (j, kn )2 in terms of Rn (j) and the minimal risk
n = Rn (kn ), where we start from the equation in (36): If j < kn , then n (j, kn )2 =
n (j, kn )2 + 4Rn (j, kn ) and n (j, kn )2 = 6(kn j) 6n . If j > kn , then n (j, kn )2 =
n (j, kn )2 + 4Rn (j, kn ) and
j
2 2
X
n (j, kn ) = (4in + 2) 4n + 2Rn (j) = 6n + 2Rn (j, kn ).
i=kn +1
Thus
q
n (j, kn ) + n (j, kn ) 2 6 n + 2 + 6 Rn (j, kn ),
and inequality (38) leads to

p
Rn (j, kn ) 4 3 log n + 2 6 n, + Op (1) n
p q
+ Op log n + n, Rn (j, kn ) + Op (log n)
for all j Kn, . Again we may employ Lemma 7 with x = 0 and y = Rn (j, kn ) to conclude
that
p
max Rn (j, kn ) 4 3 log n + 2 6 n, + Op (1) n
jKn,
3/2

+ Op (log(n)3/4 + n,(n) )1/4 2
n + log n + n,(n)
uniformly in 0.
If log(n)3 + 6n,(n) = O(n ), then the previous bound for Rn (j, kn ) = Rn (j) n reads
p
max Rn (j) n + 4 3 log n + 2 6 n, + Op (1) n
jKn,
uniformly in (n). On the other hand, if we consider just a fixed > 0, then
n, = O(1), and the previous considerations yield
q
max Rn (j) n + 4 3 + op (1) log(n) n
jKn,
+ Op log(n)3/4 1/4

n + log n
q
n + 4 3 + op (1) log(n) n + Op (log n).
To verify the latter step, note that for any fixed > 0,
(
1 log n if n 4 log n,
log(n)3/4 1/4
n
if n 4 log n.
p
log(n) n
It remains to prove claim (11) about the losses. From now on, j denotes a generic index
in Cn . Note first that
kn
(1 2in ) = Rn (kn , j) Ln (kn , j)
X
Ln (j, kn ) Rn (j, kn ) = if j < k.
i=j+1
Thus Theorem 1, applied to n = 0, shows that
Ln (j, kn ) Rn (j, kn ) + (j, kn ) 2 log n + Op (1) + Op (log n),

p
n
where
q q
n+ (j, kn ) :=
p
2|kn j| 2n + 2|Rn (j, k)|.
It follows from Ln (0) = Rn (0) = kn k2 that Ln (j) n equals
Ln (j, kn ) + (Ln Rn )(kn , 0)

q p q
= Rn (j, kn ) + Op log(n)n + Op log n Rn (j, kn ) + Op (log n)
q
Op log(n)n + log n ,
because Rn (j, kn ) 0 and Rn (j, kn ) + Op (rn ) Rn (j, kn ) Op (rn2 ). Consequently, n :=

p
minjCn Ln (j) satisfies the inequality

q
n n + Op log(n)n + log n = (1 + op (1))n + Op (log n),
and this is easily shown to entail that

p p
n n + Op log n n + Op (log n) = (1 + op (1))n + Op (log n).
Now we restrict our attention to indices j Kn, again. Here it follows from our result
about the maximal risk over Kn, that Ln (j) n equals
q p q
Rn (j, kn ) + Op log(n)n + Op log n Rn (j, kn ) + Op (log n)
q q
2Rn (j, kn ) + Op log(n)n + log n Op log(n)n + log n .
Hence maxjKn, Ln (j) is not greater than

q p p
n + Op log(n)n + log n n + Op log n n + Op (log n).
Proof of Theorem 4. The application of inequality (19) in Corollary 6 to the tripel

(|J|, Tn (J) |J|, /(2Mn )) in place of (n, 2 , ) yields bounds for n,,l
2 2
(J) and n,,u (J)
in terms of n2 (J) := (Tn (J) |J|)+ . Then we apply (17-18) to Tn (J), replacing (n, 2 , u)
with (|J|, n2 (J), /(2Mn )) for any fixed (0, 1). By means of Lemma 7 (ii) we obtain
finally
)
2
n,,u (J) n2 (J) q
2 2 (1 + op (1)) (16|J| + 32 n2 (J)) log Mn (39)
n (J) n,,l (J)
+ (K + op (1)) log Mn
for all J Mn . Here and throughout this proof, K denotes a generic constant not depend-
ing on n. Its value may be different in different expressions. It follows from the definition
of the confidence region Kn, that for arbitrary C Kn, and D Cn ,
Rn (C) Rn (D) = n2 (D \ C) n2 (C \ D) + |C| |D|
= (n2 n,,l
2 2
)(D \ C) + (n,,u n2 )(C \ D)
2 2
n,,u (C \ D) n,,l (D \ C) + |D| |C|
(n2 n,,l
2 2
)(D \ C) + (n,,u n2 )(C \ D).
Moreover, according to (39) the latter bound is not larger than

nq
16|D \ C| + 32n2 (D \ C) log Mn

(1 + op (1))
q o
16|C \ D| + 32n2 (C \ D) log Mn + (K + op (1)) log Mn

+
q
(1 + op (1)) 2 16|D| + 32n2 (C c ) + 16|C| + 32n2 (D c ) log Mn

+ (K + op (1)) log Mn
q
8 Rn (C) + Rn (D) log Mn (1 + op (1)) + (K + op (1)) log Mn .
Thus we obtain the quadratic inequality

q
Rn (C) Rn (D) 8 Rn (C) + Rn (D) log Mn (1 + op (1))
+ (K + op (1)) log Mn ,
and with Lemma 7 this leads to

q
Rn (C) Rn (D) + 8 2 Rn (D) log Mn (1 + op (1)) + (K + op (1)) log Mn .
This yields the assertion about the risks.

As for the losses, note that Ln () and Rn () are closely related in that
2in |J|
X
(Ln Rn )(D) =
iD
for arbitrary D Cn . Hence we may utilize (17-18), replacing the tripel (n, 2 , u) with
(|D|, 0, /(2n )), to complement (39) with the following observation:
q q
A |D| log Mn Ln (D) Rn (D) A |D| log Mn + A log Mn (40)
simultaneously for all D Cn with probability tending to one as n and A .

p
Note also that (40) implies that Rn (D) A Rn (D) log Mn + Ln (D). Hence
Rn (D) (3/2) Ln (D) + A2 log Mn

for all D Cn ,
by Lemma 7 (i). Assuming that both (39) and (40) hold for some large but fixed A, we
may conclude that for arbitrary C Kn, and D Cn ,
Ln (C) Ln (D)
= (Ln Rn )(C) (Ln Rn )(D) + Rn (C) Rn (D)

q q
A 2(|C| + |D|) log Mn + A 2 Rn (C) + Rn (D) log Mn + 4A log Mn
q
2A 2 Rn (C) + Rn (D) log Mn + 4A log Mn
q
A Ln (C) + Ln (D) log Mn + A log Mn

for constants A and A depending on A. Again this inequality entails that

q
Ln (C) Ln (D) + A 2Ln (D) log Mn + A log Mn
for another constant A = A (A).
6. Auxiliary results
This section collects some results from the vicinity of empirical process theory which are
used in the present paper.
For any pseudo-metric space (X , d) and u > 0, we define the capacity number

D(u, X , d) := max |Xo | : Xo X , d(x, y) > u for different x, y Xo .
It is well-known that convergence in distribution of random variables with values in

a separable metric space may be metrized by the dual bounded Lipschitz distance. Now
we adapt the latter distance for stochastic processes. Let (T ) be the space of bounded
functions x : T R, equipped with supremum norm k k . For two stochastic processes
X and Y on T with bounded sample paths we define
sup E f (X) E f (Y ),

dw (X, Y ) :=
f H(T )
where P and E denote outer probabilities and expectations, respectively, while H(T ) is
the family of all funtionals f : (T ) R such that
|f (x)| 1 and |f (x) f (y)| kx yk for all x, y (T ).
If d is a pseudo-metric on T , then the modulus of continuity w(x, | d) of a function

x l (T ) is defined as
w(x, | d) := sup |x(s) x(t)|.

s,tT :d(s,t)
Furthermore, Cu (T , d) denotes the set of uniformly continuous functions on (T , d), that is

n o
Cu (T , d) = x l (T ) : lim w(x, | d) = 0 .
0

Theorem 8. For n = 1, 2, 3, . . . consider stochastic processes Xn = Xn (t) tTn
and

Yn = Yn (t) tTn
on a metric space (Tn , n ) with bounded sample paths. Then
dw (Xn , Yn ) 0
provided that the following three conditions are satisfied:

(i) For arbitrary subsets Tn,o of Tn with |Tn,o | = O(1),

dw Xn Tn,o , Yn Tn,o 0;
(ii) for each number > 0,
lim lim sup P w(Zn , | n ) >

= 0 for Zn = Xn , Yn ;
0 n
(iii) for any > 0, D(, Tn , n ) = O(1).
Proof. For any fixed number > 0 let Tn,o be a maximal subset of Tn such that n (s, t) >
for differnt s, t Tn,o . Then |Tn,o | = O(1) by Assumption (iii). Moreover, for any t Tn
there exists a to Tn,o such that n (t, to ) . Hence there exists a partition of Tn into

sets Bn (to ), to Tn,o , satisfying to Bn (to ) t Tn : n (t, to ) . For any function x
in (Tn ) or (Tn,o ) let n x (Tn ) be given by
X
n x(t) := 1{t Bn (to )}x(to ).
to Tn,o

Then n x is linear in xTn,o with kn xk = xTn,o . Moreover, any x (Tn ) satisfies
the inequality kx n xk w(x, | n ). Hence for Zn = Xn , Yn ,
sup E h(Zn ) h(n Zn )

dw (Zn , n Zn )
hH(Tn )

E min kZn n Zn k , 1
E min w(Zn , | n ), 1 ,

and this is arbitrarily small for sufficiently small > 0 and sufficiently large n, according
to Assumption (ii).
Furthermore, elementary considerations reveal that

dw (n Xn , n Yn ) = dw Xn Tn,o , Yn Tn,o ,
and the latter distance converges to zero, because of |Tn,o | = O(1) and Assumption (i).
Since
dw (Xn , Yn ) dw (Xn , n Xn ) + dw (Yn , n Yn ) + dw (n Xn , n Yn ),
these considerations entail the assertion that dw (Xn , Yn ) 0.

Finally, the next lemma provides a useful inequality for dw (, ) in connection with sums
of independent processes.
Lemma 9. Let X = X1 + X2 and Y = Y1 + Y2 with independent random variables X1 ,

X2 and independent random variables Y1 , Y2 , all taking values in ( (T ), k k ). Then
dw (X, Y ) dw (X1 , Y1 ) + dw (X2 , Y2 ).
For this lemma it is important that we consider random variables rather than just
stochastic processes with bounded sample paths. Note that a stochastic process on T
is automatically a random variable with values in ( (T ), k k ) if (a) the index set
T is finite, or (b) the process has uniformly continuous sample paths with respect to a
pseudo-metric d on T such that N (u, T , d) < for all u > 0.
Proof of Lemma 9. Without loss of generality let the four random variables X1 , X2 ,
Y1 and Y2 be defined on a common probability space and stochastically independent. Let
f be an arbitrary functional in H(T ). Then it follows from Fubinis theorem that

Ef (X1 + X2 ) Ef (Y1 + Y2 )

Ef (X1 + X2 ) Ef (Y1 + X2 ) + Ef (Y1 + X2 ) Ef (Y1 + Y2 )

E E(f (X1 + X2 ) | X2 ) E(f (Y1 + X2 ) | X2 )

+ E E(f (Y1 + X2 ) | Y1 ) E(f (Y1 + Y2 ) | Y1 )

dw (X1 , Y1 ) + dw (X2 , Y2 ).
The latter inequality follows from the fact that the functionals x 7 f (x + X2 ) and x 7
f (Y1 + x) belong to H(T ), too. Thus dw (X, Y ) dw (X1 , Y1 ) + dw (X2 , Y2 ).
Acknowledgement. Constructive comments of a referee are gratefully acknowledged.

References
[1] Baraud, Y. (2004). Confidence balls in Gaussian regression. Ann. Statist. 32,
528-551.
[2] Beran, R. (1996). Confidence sets centered at Cp estimators. Ann. Inst. Statist.
Math. 48, 1-15.
[3] Beran, R. (2000). REACT scatterplot smoothers: superefficiency through basis
economy. J. Amer. Statist. Assoc. 95, 155-169.
[4] Beran, R. and Dumbgen, L. (1998). Modulation of estimators and confidence sets.
Ann. Statist. 26, 1826-1856.
[5] Birge, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc.
3, 203-268.
[6] Cai, T.T. (1999). Adaptive wavelet estimation: a block thresholding and oracle
inequality approach. Ann. Statist. 26, 1783-1799.
[7] Cai, T.T. (2002). On block thresholding in wavelet regression: adaptivity, block size,
and threshold level. Statistica Sinica 12, 1241-1273.
[8] Cai, T.T. and Low, M.G. (2006). Adaptive confidence balls. Ann. Statist. 34,
202-228.
[9] Cai, T.T. and Low, M.G. (2007). Adaptive estimation and confidence intervals for
convex functions and monotone functions. Manuscript in preparation.
[10] Dahlhaus, R. and Polonik, W. (2006). Nonparametric quasi-maximum likelihood
estimation for Gaussian locally stationary processes. Ann. Statist. 34, 2790-2824.
[11] Donoho, D.L. and Johnstone, I.M. (1994). Ideal spatial adaptation by wavelet
shrinkage. Biometrika 81, 425-455.
[12] Donoho, D.L. and Johnstone, I.M. (1995). Adapting to unknown smoothness via
wavelet shrinkage. JASA 90, 1200-1224.
[13] Donoho, D.L. and Johnstone, I.M. (1998). Minimax estimation via wavelet
shrinkage. Ann. Statist. 26, 879-921.
[14] Dumbgen, L. (2002). Application of local rank tests to nonparametric regression. J.
Nonpar. Statist. 14, 511-537.
[15] Dumbgen, L. (2003). Optimal confidence bands for shape-restricted curves. Bernoulli
9, 423-449.
[16] Dumbgen, L. and Spokoiny, V.G. (2001). Multiscale testing of qualitative hy-
potheses. Ann. Statist. 29, 124-152.
[17] Dumbgen, L. and Walther, G. (2007). Multiscale inference about a density. Tech-
nical report 56, IMSV, University of Bern.
[18] Efromovich, S. (1998). Simultaneous sharp estimation of functions and their deriva-
tives. Ann. Statist. 26, 273-278.
[19] Futschik, A. (1999). Confidence regions for the set of global maximizers of non-
parametrically estimated curves. J. Statist. Plann. Inf. 82, 237-250.
[20] Genovese, C.R. and Wassermann, L. (2005). Confidence sets for nonparametric
wavelet regression. Ann. Statist. 33, 698-729.
[21] Hengartner, N.W. and Stark, P.B. (1995). Finite-sample confidence envelopes
for shape-restricted densities. Ann. Statist. 23, 525-550.
[22] Hoffmann, M. and Lepski, O. (2002). Random rates in anisotropic regression
(with discussion). Ann. Statist. 30, 325-396.
[23] Lepski, O.V., Mammen, E. and Spokoiny, V.G. (1997). Optimal spatial adap-
tation to inhomogeneous smoothness: an approach based on kernel estimates with
variable bandwidth selectors. Ann. Statist. 25, 929-947.
[24] Li, K.-C. (1989). Honest confidence regions for nonparametric regression. Ann.
Statist. 17, 1001-1008.
[25] Polyak, B.T. and Tsybakov, A.B. (1991). Asymptotic optimality of the Cp -test
for the orthogonal series estimation of regression. Theory Probab. Appl. 35, 293-306.
[26] Robins, J. and van der Vaart, A. (2006). Adaptive nonparametric confidence
sets. Ann. Statist. 34, 229-253.
[27] Stone, C.J. (1984). An asymptotically optimal window selection rule for kernel
density estimates. Ann. Statist. 12, 1285-1297.

Technical Report 73 IMSV, University of Bern Adaptive Confidence Sets For The Optimal Approximating Model

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Technical Report 73 IMSV, University of Bern Adaptive Confidence Sets For The Optimal Approximating Model

Uploaded by

Copyright:

Available Formats

Technical Report 73

IMSV, University of Bern

Abstract: In the setting of high-dimensional linear models with Gaussian noise,

L(n , n ) := kn n k2 and R(n , n ) := EL(n , n )

R(n , n ) An inf R(n(c) , n ) + Bn 2 .

such that for arbitrary n Rn ,

These correspond to a standard nested sequence of approximating models. Section 4 dis-

Suppose we observe a stochastic process Y = (Y (t))t[0,1] , where

Y (t) = F (t) + W (t), t [0, 1],

with an unknown fixed continuous function F on [0, 1] and a Brownian motion W =

S(F ) := argmin F (t).

Precisely, we want to construct a (1 )-confidence region S = S (Y ) [0, 1] for S(F )

3. Confidence regions for nested approximating models

As in the introduction let Xn = n + n denote the n-dimensional observation vector with

with corresponding risk

Model selection usually aims at estimating a candidate estimator which is optimal in

where n2 is a variance estimator satisfying the subsequent condition.

(A) n2 and Xn are stochastically independent with

where 1 m = mn with m = meaning that is known, i.e. n2 2 . For

unless stated otherwise.

Example. Suppose that we observe Y = M + with given design matrix M

Rn (j) Rn (k) Rn (j) + Rn (k)

One may also write Dn (j, k) = (n /)2 Dn (j, k) + Vn (j, k) with

|Dn (j, k)|

and for any fixed c > 2,

is bounded in probability. In case of kn k2 = O(n), L(dn ) is weakly approximated by the

However, using such an estimator does not seem to work since

The least favourable case of constant risk

and if Assumption (A) is satisfied, the statistic

has a noncentral (in the numerator) F -distribution

and for arbitrary indices 0 j < k n,

As a consequence of Proposition 2, we can define a confidence set for Kn (n ), based

n = n . Motivated by the procedure in Section 2 and Theorem 1, we define

Theorem 3. Let (n )nN be arbitrary. With Kn, as defined above,

Remark (Dependence on ) The proof reveals a refined version of the bounds in

max Rn (k) min Rn (j)

4. Confidence sets in case of larger families of candidates

and corresponding risks

2 Rn (D) Rn (C) = n2 (C \ D) n2 (D \ C) + |D| |C|

with the auxiliary quantities

Hence we aim at simultaneous (1 )-confidence intervals for these noncentrality param-

/(2Mn ) F|J| Tn (J) n2 (J) 1 /(2Mn ) for =

These confidence sets Kn, satisfy the following oracle inequalities:

Remark. The upper bounds in Theorem 4 are of the form

Remark (Suboptimality in case of nested models) In case of nested models,

Remark. In case of unknown , let := 1 (1 )1/2 . Then with probability at least

respectively. The conclusions of Theorem 4 continue to hold, as long as n/mn = O(1).

5.1. Proof of (5) and (6)

Snaive s [0, 1] : Fn (s) + W (s) Fn (so ) + W (so ) + naive

s [0, 1] : Fn (s) Fn (so ) naive + naive

s [0, 1] : Fo (s) Fo (so ) c1 naive

Snaive s [0, 1] : Fn (s) + W (s) Fn (so ) + min W + naive

s [0, 1] : Fo (s) Fo (so ) c1 naive

with probability 1 . Since naive

and this entails that

5.2. Exponential inequalities

Proposition 5. Let Z1 , . . . , Zn be independent, standard Gaussian random variables. Fur-

Note that replacing i in Proposition 5 with i yields twosided exponential inequali-

Corollary 6. For an integer n > 0 and a constant 0 let Fn ( | 2 ) be the distribution

In particular, for any u (0, 1/2),

Moreover, for any number 0, the inequalities u Fn (n + 2 | 2 ) 1 u entail that

Conclusion (19) follows from (15) and (16), applied to r = 2 2 and r = 2 2 ,

Proof of Proposition 5. Standard calculations show that for 0 t < (2max )1 ,