ITERATIVELY REGULARIZED GAUSS-NEWTON METHOD FOR NONLINEAR INVERSE PROBLEMS WITH RANDOM NOISE

We study the convergence of regularized Newton methods applied to nonlinear operator equations in Hilbert spaces if the data are perturbed by random noise. It is shown that the expected square error is bounded by a constant times the minimax rates of the corresponding linearized problem if the stopping index is chosen using a-priori knowledge of the smoothness of the solution. For unknown smoothness the stopping index can be chosen adaptively based on Lepski 's balancing principle. For this stopping rule we establish an oracle inequality, which implies order optimal rates for deterministic errors, and optimal rates up to a logarithmic factor for random noise. The performance and the statistical properties of the proposed method are illustrated by Monte-Carlo simulations. AMS subject classi cations. 65J22, 62G99, 65J20, 65J15


Introduction.
In this paper we study the solution of nonlinear ill-posed operator equations (1.1) F (a) = u, assuming that the exact data u are perturbed by random noise.Here F : D(F ) ⊂ X → Y is a nonlinear operator between separable Hilbert spaces X and Y which is Fréchet differentiable on its domain D(F ).
Whereas the solution of nonlinear operator equations by iterative regularization methods with deterministic errors has been studied intensively over the last decade (see Bakushinskiȋ and Kokurin [2] and the references therein), we are not aware of any convergence and convergence rate results of iterative regularization methods for nonlinear inverse problems with random noise, although in many practical examples iterative methods are frequently applied where the noise is random rather than deterministic.Vice versa, nonlinear regularization methods are rarely used in a statistical context, and our aim is to explore the potential use of these methods in classical fields of applications such as econometrics [22] and financial statistics [10].In particular, we will derive rates of convergence under rather general assumptions.
We will consider the case that the error in the data consists of both deterministic and stochastic parts, but we are mainly interested in the situation where the stochastic noise is dominant.More precisely, we assume that the measured data u obs are described by a Hilbert space process (1.2a) where δ ≥ 0 describes the deterministic error level, η ∈ Y, η ≤ 1 denotes the normalized deterministic error, σ 2 ≥ 0 determines the variance of the stochastic noise, and ξ is a Hilbert space process in Y.We assume that ξ ϕ := ξ, ϕ is a random variable with E [ξ ϕ ] = 0 and Varξ ϕ < ∞ for any test vector ϕ ∈ Y, and that the covariance operator Cov ξ : Y → Y, characterized by Cov ξ ϕ, ψ Y = E [ξ ϕ ξ ψ ], satisfies the normalization condition Note that in the case of white noise (Cov ξ = I Y ) the noise ξ is not in the Hilbert space with probability 1; nevertheless this prominent situation is covered in our setting in the weak formulation as above.For a further discussion of the noise model (1.2) we refer to [5], where it is shown that it incorporates several discrete noise models where the data consists of a vector of n measurements of a function u at different points.In this case σ is proportional to n −1/2 ; i.e., the limit n → ∞ corresponds to the limit σ → 0. A particular discrete noise model will be discussed in section 5.
In this paper we provide a convergence analysis for the class of generalized Gauss-Newton methods given by (1.3a) corresponding to m-times iterated Tikhonov regularization with m ∈ N. Standard Tikhonov regularization is included as the special case m = 1.For deterministic errors (i.e., σ = 0 in (1.2a)) the convergence of the iteration (1.3a) has been studied in [1,2,3].Typically the explicit formula (1.3a) is not used in implementations, but âk+1 is computed by solving m linear least squares problems (see Figure 2.1).The advantage of iterated Tikhonov regularization over ordinary Tikhonov regularization is a higher qualification number (see [12]).For simplicity, we assume that the regularization parameters are chosen of the form (1.3b) α k = α 0 q k with q ∈ (0, 1) and α 0 > 0.
Let us comment on the differences when treating measurement errors as random instead of deterministic: From a practical point of view the most important difference is the choice of the stopping index.Whereas the discrepancy principle, as the most common deterministic stopping rule, works reasonably well for discrete random noise models with small data vectors, the performance of the discrepancy principle becomes arbitrarily bad as the size of the data vector increases.This is further discussed and numerically demonstrated in section 5.The same holds true for the deterministic version of Lepskiȋ's balancing principle as studied for the iteration (1.3a) in [3].From a theoretical point of view the rates of convergence are different for deterministic and random noise, and in the latter case they depend not only on the source condition, but also on the operator F and the covariance operator Cov ξ of the noise.
Actually our analysis also provides an improvement of known results for purely deterministic errors, i.e., σ = 0.This is achieved by showing an oracle inequality, which is a well-established technique in statistics (cf.[8,9]), but is rarely used in numerical analysis thus far.To our knowledge the only deterministic oracle inequality has been shown by Mathé and Pereverzev (see [19,21]) for Lepskiȋ's balancing principle for linear problems.Theorem 4.1 below is a generalization of this result to nonlinear problems.As shown in Remark 4.3 this provides error estimates which are better by an arbitrarily large factor than any known error deterministic estimates for any nonlinear inversion method in the limit δ → 0.
An important alternative to the iteratively regularized Gauss-Newton method is nonlinear Tikhonov regularization, for which convergence and convergence rate results for random noise have been obtained by O'Sullivan [23], Bissantz, Hohage, and Munk [4], Loubes and Ludeńa [17], and Hohage and Pricop [14].In this paper we show order optimal rates of convergence under less restrictive assumptions on the operator than in [14,17,23] and for a range of smoothness classes instead of a single one as in [4].
The paper is organized as follows: In the following section we show that the total error can be decomposed into an approximation error, a propagated data error, and a nonlinearity error, and that the last error component is dominated by the sum of the first two error components (Lemma 2.2).This will be fundamental for the rest of this paper.In section 3 we prove order optimal rates of convergence if the smoothness of the solution is known and the stopping index is chosen appropriately.Adaptation to unknown smoothness by Lepskiȋ's balancing principle is discussed in section 4, and an oracle inequality for nonlinear inverse problems is shown.The paper is completed by numerical simulations for a parameter identification problem in a differential equation which illustrate how well the theoretical rates of convergence are met and compare the performances of the balancing principle and the discrepancy principle.

Error decomposition.
In this section we will analyze the error (2.1) of the iteration (1.3).We set â0 := a 0 , i.e., E 0 = a 0 − a † .Since lower bounds for the expected square error are typically not available for nonlinear inverse problems, we compare our upper bounds on the error with lower bounds for the linearized inverse problem It is a fundamental observation due to Bakushinskiȋ [1], which transfers directly from deterministic errors to random noise, that the total error (2.3a) and the Taylor remainder F (a † ) − F (â k ) + T k E k vanishes.Hence E nl k+1 = 0; i.e., the nonlinearity error vanishes for the linearized equation (2.2).This can also be seen as follows: If F is linear, then the iteration formula (1.3a) reduces to the nonrecursive formula âk = a 0 + g α k (T * T )T * (u δ lin − T a 0 ), which is the underlying linear regularization method with initial guess a 0 and regularization parameter α k applied to (2.2).The approximation error E app k agrees exactly in the linear and the nonlinear case, and the data noise error E noi k differs only by the operator T and T k .The goal of the following analysis is to show that the nonlinearity error E nl k can be bounded in terms of sharp estimates of Approximation error.We will assume that there exists w ∈ Y such that for some ρ > 0. This is equivalent to the existence of w ∈ X with w ≤ ρ such that a 0 − a † = (T * T ) 1/2 w (see [12,Prop. 2.18]).Later we will require ρ to be sufficiently small, which expresses the usual closeness condition on the initial guess required for the convergence of Newton's method.Note, however, that we require not only smallness of a 0 − a † in the norm • X but smallness in the stronger norm (T * ) † • X .It is well known (see [12] or (3.3) below) that under assumption (2.4) the approximation error of iterated Tikhonov regularization is bounded by Moreover, the approximation error satisfies with γ app := q −m .If α 0 is chosen sufficiently large, we also have All the inequalities in (2.6) can be reduced to inequalities for real-valued functions with the help of spectral theory [12].The second inequality in (2.6a) follows from and the first inequality holds since r α , where t := T * T .
Remark 2.1.Note that the second inequality in (2.6a) rules out regularization methods with infinite qualification such as Landweber iteration as an alternative to iterated Tikhonov regularization.Although the regularized Newton method (1.3a) also converges for such linear regularization methods [2,15] and convergence is even faster for smooth solutions, the estimate (2.15) in Lemma 2.2 will be violated in general as it contains the norm of the approximation error E app itself instead of an estimate.This estimate is crucial to achieve the improvement discussed in Remark 4.3.The results of this paper also hold true for other spectral regularization methods satisfying (2.6) and (2.12) below, but since iterated Tikhonov regularization is by far the most common choice, we have decided to restrict ourselves to this case for simplicity.
Propagated data noise error.The deterministic part of E noi k can be estimated by the well-known operator norm bound (2.7) with a constant C g depending only on m.This bound cannot be used to estimate the stochastic part of E noi k or, more precisely, the variance term with a ∈ D(F ) and α > 0, since ξ = ∞ almost surely for typical noise processes ξ such as white noise.We assume that there exists a known function ϕ noi : (0, Such a condition is satisfied if F [a] is Hilbert-Schmidt for all a ∈ D(F ) and the singular values of these operators have the same rate of decay for all a ∈ D(F ).
Estimates of the form (2.8a) have been derived for spectral regularization methods under general assumptions in [5].We further assume that for some constants γ noi , γ noi < ∞.Moreover, we assume an exponential inequality of the form Such an exponential inequality is derived for Gaussian noise processes ξ in the appendix.If ξ is white noise and the singular values of F [a] decay at the rate σ j (F [a]) ∼ j −β with β > 1 2 uniformly for all a ∈ D(F ), then we can choose ϕ noi of the form (2.9) with c := 1 2 + 1 4β and some constant C noi ∈ (0, ∞) (see [5]).For colored noise c may also have values smaller than 1  2 .By virtue of the exponential inequality (2.8c) the probability is very small that the stochastic propagated data noise error V ( a k , α k ) at any Newton step k is much larger than the expected value E [V ( a k , α k )].We will distinguish between a "good case" where the propagated data noise is "small" at all Newton steps, and a "bad case" where the noise is "large" in at least one Newton step.The "good case" is analyzed in Lemma 2.2 below.The proof of Theorem 3.2 will require a rather elaborate distinction between what is small and what is large in order to derive the optimal rate of convergence.This distinction will be described by functions τ (k, σ) satisfying For given τ = τ (k, σ) and a given maximal iteration number k, the "good event" A τ,k is that all iterates âk belong to D(F ) for k = 1, . . ., k (and hence are well defined) and that the propagated data noise error is bounded by (2.10b)

FRANK BAUER, THORSTEN HOHAGE, AND AXEL MUNK
Nonlinearity error.In the following we will assume that the Fréchet derivative F of F satisfies a Lipschitz condition with constant L > 0; i.e., (2.11) If the straight line connecting a † and âk is contained in D(F ), the Taylor remainder in To bound the second term in the definition of E nl k+1 , we need the assumption (2.4) and the estimate which can be shown with the help of the Riesz-Dunford formula; see Bakushinskiȋ and Kokurin [2, section 4.1].Using (2.12), the source condition (2.4), and (2.11), the second term can be bounded by In summary we obtain the following recursive estimate of the nonlinearity error: Lemma 2.2.Assume that the ball B 2R (a 0 ) with center a 0 and radius 2R > 0 is contained in D(F ) and that (1.2), (1.3), (2.8), and (2.11) hold true.Moreover, let α 0 be sufficiently large that (2.6b) is satisfied.
Then there exists ρ > 0 specified in the proof such that the following holds true for all a † ∈ B R (a 0 ) satisfying the source condition (2.4) and all δ, σ ≥ 0: If (2.10) defining the "good event" is satisfied with k = K max defined by (2.15) Here Proof.We prove this by induction in k, starting with the induction step.If the assertion holds true for k − 1 (k = 2, . . ., K max ), then we obtain from (2.3a), (2.6a), and (2.10) that (2.16) Hence, it follows from (2.13) and the inequality (x + y) (2.17) If ρ ≤ γ nl /(2 C nl (1+γ nl )γ app ), the first term on the right-hand side of (2.17) is bounded by (γ nl /2) ( E app k + Φ noi (k)).Now we estimate the second term in (2.17).Using (2.5) we obtain for ρ sufficiently small, so . Moreover, it follows from the inequality γ nl ≤ 1 and the definition of K max that Therefore, the second line in (2.17) is also bounded by (γ nl /2) ( E app k + Φ noi (k)), which yields (2.15) for k.This can be used to show that âk Moreover, it follows from the monotonicity of Φ noi (cf.(2.10)), (2.14), and Together with (2.15) this shows that It remains to establish the assertion for k = 1.Due to assumption (2.6b), E 0 is bounded by the right-hand side of (2.16) with k = 1.Now the assertion for k = 1 follows as above.
Since we do not assume the stochastic noise ξ to be bounded, there is a positive probability that âk / ∈ D(F ) at each Newton step k.Therefore, we stop the iteration if âk − a 0 ≥ 2R for some k, and we choose the initial guess a 0 as estimator of a † in this case.The algorithm is summarized in Figure 2.1.

Convergence results for known smoothness.
In the following we will assume more smoothness for a 0 − a † than in (2.4).This is expressed in terms of source conditions of the form where Λ is continuous and monotonely increasing with Λ(0) = 0 (see [20] for the linear case).If (3.1) sup The most important case is Λ(t) = t μ , and the condition is called a Hölder source condition.The largest number μ > 0 for which (3.1) holds true with this choice of Λ is called the qualification μ 0 of the linear regularization method.The qualification of Tikhonov regularization is μ 0 = 1, and the qualification of m-times iterated Tikhonov regularization is μ 0 = m (see [12]).We obtain Let us first consider deterministic errors, i.e., σ = 0.The following result shows that the same rate of convergence can be achieved for the nonlinear inverse problem (1.2) as for the linearized problem (2.2).

8c) by
(2.10b)).Due to (2.8b) we can choose κ > 1 sufficiently small such that τ j satisfies condition (2.10a) for all j = 1, . . ., J. Definition (3.9) is motivated by the fact that an unusually large propagated data noise term E noi k at the first Newton steps, where the total error is dominated by E app k , has less effect than an unusually large propagated data noise error at the last Newton steps.From Lemmas 2.2 and 3.4 it follows that the iterates a 1 , . . ., a K remain in B R (a † ) if the bounds on E noi k , k = 1, . . ., K, in the definition (3.9) of A j with j ≤ J(σ) are satisfied and σ is sufficiently small.Together with (2.8c) this yields for the probability of the complementary event By definition, we have K * = K for the events A j .Applying Lemma 2.2 and the inequality (x + y + z) 2 ≤ 3x 2 + 3y 2 + 3z 2 and using γ nl ≤ 1 we obtain Moreover, by the definition of the algorithm we have an error bound âK * − a † ≤ âK * − a 0 + a 0 − a † ≤ 3R for the event CA J(σ) .Note that J(σ) is chosen such that e −c2J(σ) ≤ σ 2 .Hence, for some constant C > 0. In the second line we have used that P(A 1 ) + J(σ) j=2 P(A j \ A j−1 ) + P(CA J(σ) ) = 1 and P(A j \ A j−1 ) ≤ P(CA j−1 ).Now (3.8) follows as in the proof of Theorem 3.1.
The proof of Theorem 3.2 is completed by the following two lemmas.
, and γ noi := max{γ noi , q −1/2 }, the optimal stopping index K defined in (3.7) satisfies the following bounds: To show (3.10), assume on the contrary that Then we obtain using (2.6a) and (3.12) in contradiction to the definition of K. Therefore, (3.13) is false, and (3.10) holds true.
As lim x→∞ xq cx = 0 for all c > 0, there exists σ 0 > 0 such that for all σ ∈ (0, σ 0 ].This together with (3.16) gives the assertion.The right-hand side of the estimate (3.8) is known to be an order optimal error bound for the linearized problem (2.2) under mild assumptions (cf.[5]).Again, more explicit bounds can easily be derived under more specific assumptions.For example, if δ = 0 and ϕ noi is given by (2.9), we obtain

Adaptation by Lepskiȋ's balancing principle.
The stopping index in Theorem 3.2 cannot be computed since it depends on E app k and hence the smoothness of the difference a 0 − a † of the initial guess and the unknown solution.In this section we address the problem of how to choose the stopping index of the Newton iteration adaptively using a Lepskiȋ-type balancing principle.
We will present the balancing principle in a general framework adapted from [19].From our choice of notation it will be obvious how the regularized Newton methods (1.3) fit into this framework and how the stopping index can be selected for these methods.As the same method will hopefully also apply to regularization methods other than (1.3), we have decided to use this more general setting.
General abstract setting.Let â0 , â1 , . . ., âKmax be estimators of a † in a metric space (X, d), and let Φ noi , Φ app , Φ nl : N 0 → [0, ∞) be functions such that (4.1) We assume that Φ noi is known and nondecreasing, Φ app is unknown and nonincreasing, and Φ nl is unknown and satisfies for some γ nl > 0. This is illustrated in Figure 4.1.Note that under the assumptions of Lemma 2.2 these inequalities are satisfied for the iterates of the generalized Gauss-Newton method (1.3) with Φ app (k) := E app k and Φ nl (k) := E nl k .As in [3] we consider the following Lepskiȋ-type parameter selection rule: (4.3) Deterministic errors.We first recall some results from [19,21] for linear problems, i.e., Φ nl ≡ 0. Assume that Φ noi satisfies for some constant γ noi < ∞, and let Then, as shown by Mathé and Pereverzev [21], the following deterministic oracle inequality holds true: This shows that k bal yields an optimal error bound up to a factor 6γ noi .
If (4.1) and (4.2) hold true, then Assumption 2.1 in [3] is satisfied with E(k)δ = 2(1 + γ nl )Φ noi (k) (in the notation of [3]) and (4.7) Therefore, Theorem 2.3 in [3] implies the error estimate We obtain the following order optimality result inspired by Mathé [19].Theorem 4.1 (deterministic oracle inequality).Assume (4.1)-(4.4).Then we have K ≤ k with k defined in (4.5).Therefore, we get Obviously, we could add the term Φ nl (k) to Φ app (k) + Φ noi (k) on the right-hand side of (4.9).Hence, Theorem 4.1 implies that the Lepskiȋ rule (4.3) leads to an optimal error bound up to a factor 6(1 + γ nl )γ noi among all k = 1, . . ., K max .However, Theorem 4.1 even implies the stronger result that we obtain the same rates of convergence as in the linear case, which are often known to be minimax.
To show (4.13), let > 0. Using Lemma 3.3, (3.11) with σ = 0, and Therefore, it follows from (4.12) that there exists δ 0 such that for all δ < δ 0 .Using (3.10) in Lemma 3.3 we obtain with C > 0 independent of and δ.This shows (4.13) since we can make (C ) 1/(2μ+1) arbitrarily small.Equation (4.13) implies that although estimates of the form (4.11), known in deterministic regularization theory for the discrepancy principle and improved parameter choice rules [12,16], are order optimal as uniform estimates over a smoothness class, they are suboptimal by an arbitrarily large factor for each individual element of the smoothness class in the limit δ → 0. To our knowledge it is an open question whether or not deterministic parameter choice rules other than Lepskiȋ's balancing principle are optimal in the more restrictive sense of oracle inequalities.
General noise model.We return to the general noise model (1.2).Theorem 4.4.Let the assumptions of Lemma 2.2 hold true with K max determined by (2.14) Proof.First consider the event A τ,Kmax of "bounded noise" defined by (2.10b) with τ (k, σ) = (ln σ −2 )/c 2 .Then the assumptions of Theorem 4.1 are satisfied with Φ app (k) := E app k and Φ nl (k) := E nl k due to (2.3a), (2.8b), and Lemma 2.2, and we get As μ > 1 2 , we can use the same arguments as in Lemma 3.4 to show that K := argmin k∈N ( E app k + Φ noi (k)) ≤ K max for δ, σ small enough, and hence the minimum may be taken over all k ∈ N 0 .Moreover, we have for some constant C tail > 0. For the last inequality we have used that K max = O ln σ −1 due to the definition of K max and the fact that ϕ noi is decreasing.Using again that ϕ noi is decreasing and K → ∞ as σ → 0, we get (4.15) for σ small enough with a generic constant C. Hence, In particular, for δ = 0 and ϕ noi given by (2.9), we obtain Comparing (4.16) to (3.17) or (4.14) to (3.8) shows that we have to pay a logarithmic factor for adaptivity.As shown in [26] it is impossible to achieve optimal rates adaptively in the general situation considered in this paper.However, for special classes of linear inverse problems which are not too ill-posed order optimal adaptive parameter choice rules have been devised (see, e.g., [9]).It remains an interesting open problem to construct order optimal adaptive stopping rules for mildly ill-posed nonlinear statistical inverse problems.

Numerical experiments.
To test the predicted rates of convergence with random noise and the performance of the stopping rule (4.3), we consider a simple parameter identification problem for an ordinary differential equation where the forward operator F is easy to evaluate and reliable conclusions can be obtained by Monte Carlo simulations within reasonable time.The efficiency of the iteratively regularized Gauss-Newton method to solve large-scale inverse problems has been sufficiently demonstrated in a number of previous publications (see, e.g., [13]).

FRANK BAUER, THORSTEN HOHAGE, AND AXEL MUNK
We introduce the parameter-to-solution operator If G denotes the inverse of the differential operator − ∂ 2 ∂x 2 + 1 with periodic boundary conditions, the differential equation (5.1) can equivalently be reformulated as an integral equation u + GM a−1 u = Gf with the multiplication operator M a−1 u := (a − 1)u.To prove the Fréchet differentiability of F , it is convenient to consider M a−1 as an operator from C per ([0, 1]) to L 2 ([0, 1]) and G as an operator from L 2 ([0, 1]) to C per ([0, 1]).Then M a is bounded and G is compact, and it is easy to show that F is analytic on the domain D(F ) := {a ∈ L 2 ([0, 1]) : I + GM a−1 boundedly invertible in C per ([0, 1])}.In particular, F satisfies the Lipschitz condition (2.11).By a Neumann series argument, D(F ) is open, and it contains all positive functions.Differentiation of (5.1) with respect to a shows that for a perturbation h of a the Fréchet derivative u h = F [a]h satisfies the differential equation where u is the solution to (5.1).This implies that F [a]h = −(I + GM a−1 ) −1 GM u h.We briefly sketch how Hölder source conditions (3.2) can be interpreted as smoothness conditions in Sobolev spaces under certain conditions.Assume that a ∈ H s ∩ D(F ) with s > 1/2, and f ∈ C ∞ .Here and in the following we shortly write H s for H s per ([0, 1]).Since G : H t → H t+2 is an isomorphism for all t ≥ 0 and uv ∈ H t for u, v ∈ H t with uv H t ≤ C u H t v H t for t > 1/2, it can be shown that I + GM a : H t → H t is an isomorphism for t ∈ [0, s] and that u ∈ H s .Moreover, the operators F [a], F [a] * : H t → H min(s,t+2) are bounded, and if u has no zeros in [0, 1] and t ≤ s − 2, the inverses are bounded as well.Under this additional assumption, it can be shown using Heinz's inequality (see [12]) that (F [a] * F [a]) μ : L 2 → H 4μ is an isomorphism if 2 2μ ≤ s.Thus, a, a 0 ∈ H s implies a Hölder source condition with μ = s/4.Moreover, the singular values of F [a] behave like σ j (F [a]) ∼ j −2 , and this asymptotic behavior is uniform for all a with a L ∞ ≤ C.
Noise model.Let us assume that our data are n noisy measurements of u † = F (a † ) at equidistant points x The measurement errors are modeled by independent and identically distributed (i.i.d.) random variables j satisfying For n even we introduce the space Π n := span e 2πijx : j = −n/2, . . ., n/2 − 1 and the linear mapping S n : R n → Π n , which maps a vector u = (u 1 , . . ., u n ) of nodal values to the unique trigonometric interpolation polynomial S n u ∈ Π n satisfying (S n u)(x (n) j ) = u j , j = 1, . . ., n.We will show that u obs := S n Y with Y := (Y 1 , . . ., Y n ) satisfies assumption (1.2a).Hence, we interpret u obs as our Moreover, the covariance operator of the stochastic noise is given by where := ( 1 , . . ., n ) and P n ∈ L(L 2 ([0, 1])) is the orthogonal projection onto Π n .Note that the stochastic noise level dominates the deterministic noise level δ = O(n −2 ) for large n.Numerical results.As exact solutions a † we used three functions of different smoothness shown in the right panel of Figure 5.1.These functions were defined in terms of Fourier coefficients such that they belong to t>s H t per ([0, 1]) with s ∈ {0.5, 1.5, 3.5}.The initial guess was always chosen as the constant function 1.We never had to stop the iteration early because an iterate was not in the domain of definition of F .
We first tested the performance of the balancing principle for σ = 0.1 and σ = 0.01 and three different values on n (cf.(5.3)) for the curve with s = 1.5.For each value of σ and n, 100 independent data vectors Y were drawn from a Gaussian distribution according to the additive noise model (5.2).We chose Φ noi (k) := , where (l) are independent copies of the noise vector.Moreover, we set γ nl = 0.1 in (4.3).Note that this choice of Φ noi is not fully covered by our theory since we use only an estimator of the expected value, as opposed to (2.8a) we do not have a uniform bound over a ∈ D(F ), and we dropped the logarithmic factor in Theorem 4.4.Furthermore, recall that the Lepskiȋ rule FRANK BAUER, THORSTEN HOHAGE, AND AXEL MUNK requires the computation of iterates up to a fixed K max specified in (2.14).Usually the Lipschitz constant L involved in the definition (2.14) of K max is not known exactly.Fortunately, we need only an upper bound L on the Lipschitz constant, and the experimental results are insensitive to the choice of L. As expected from the assumptions of our convergence results, the reconstructions are also insensitive to the choice of α 0 > 0 and q ∈ (0, 1) as long as they are sufficiently large.Further increasing α 0 and q results in more Newton steps and hence more computational effort to reach the optimal value of α k = α 0 q k , but no noticable difference in the reconstructions.The results of our first series of simulations are summarized in Table 5.1.As "inefficiency index" of a stopping rule K * we used the number The results displayed in Table 5.1 demonstrate that both the expected optimal error (the denominator in the previous expression) and the expected error for the balancing principle (4.3) (the numerator) depend only on σ = σ / √ n but not on n.This is in contrast to the discrepancy principle, which is defined in a discrete setting by Here τ = 2.1, Y = (Y j ) j=1,...,n is the vector defined in (5.2), (F (a)) j := (F (a))(x (n) j ), j = 1, . . ., n, and the factor n −1/2 before the Euclidean norm in R n is chosen such that n −1/2 F (a) R n ≈ F (a) L 2 ([0,1]) .The discrepancy principle for the noise model (5.2) works fairly well for small n but badly for large n, as previously observed, e.g., in [18], for linear problems.The reason is that the standard deviation of the measurement error n −1/2 E n j=1 2 j 1/2 = σ = σ √ n tends to infinity as n → ∞ with constant σ, and hence the discrepancy principle stops too early.In fact this happens almost always at the first step for n sufficiently large, whereas the optimal stopping index, which asymptotically depends only on σ, but not on n, may be arbitrarily large for small σ.
Finally we tested the rates of convergence with the balancing principle for the three curves a † shown in the right panel of Figure 5.1.We always chose n = 128 and varied σ .For each value of σ and each of the three curves a † we performed 50 runs of the iteratively regularized Gauss-Newton method.The triangles in the left panel show the rates σ 1/6 , σ 3/8 , and σ 7/12 , which are obtained by neglecting the logarithmic factor in (4.16) and setting c = 5  8 and μ = 1 8 , 3 8 , 7 8 following the discussion above.Note that the experimental rates agree quite well with these predicted rates even for the first two functions, which are not smooth enough to be covered by our theory.
In summary we have demonstrated that the performance of the balancing principle is independent of the sample size n, whereas the discrepancy principle works well for small n, but becomes more and more inefficient as n → ∞.Moreover, for the balancing principle the empirical rates of convergence match the theoretical rates very well.
Appendix.Exponential inequality for Gaussian noise.In this appendix we will prove the exponential inequality (2.8c) for the case that the noise process ξ in (1.2a) is Gaussian.
Let us consider the random variable V = Λξ 2 for an arbitrary linear operator Λ : Y → X such that E Λξ 2 < ∞.Since Cov Λξ = ΛM Λ * with M := Cov ξ , the operator ΛM Λ * is trace class; i.e., the eigenvalues is a Gaussian white noise process; i.e., the random variables ξ i := M −1/2 ξ, ϕ i are i.i.d.standard normal for any orthonormal system {ϕ i } in Y.In particular, if {ϕ i } is a system of left singular vectors of ΛM Then it holds that for any η > 0 (see [11,25]).

Fig. 5 . 1 .
Fig. 5.1.Left panel: rates of convergence of (E a † − âk bal 2 ) 1/2 as σ → 0 for different smoothnesses of exact solution.The triangles indicate the rates predicted by theory, and the bars indicate the empirical standard deviations of the reconstruction error.Right panel: exact solutions.observed data.Since √ nS n is unitary, u obs and Y have unitarily equivalent covariance operators.We have δη = E [u] obs − F (a † ) and σξ = u obs − E [u] obs in (1.2a), and E [u] obs

Table 5 . 1
Comparison of balancing principle and discrepancy principle as stopping rules.For each value of n and σ the algorithm was run 100 times.The values after ± denote standard deviations.