Morozov's principle for the augmented Lagrangian method applied to linear inverse problems

The Augmented Lagrangian Method as an approach for regularizing inverse problems received much attention recently, e.g. under the name Bregman iteration in imaging. This work shows convergence (rates) for this method when Morozov's discrepancy principle is chosen as a stopping rule. Moreover, error estimates for the involved sequence of subgradients are pointed out. The paper studies implications of these results for particular examples motivated by applications in imaging. These include the total variation regularization as well as $\ell^q$ penalties with $q\in[1,2]$. It is shown that Morozov's principle implies convergence (rates) for the iterates with respect to the metric of strict convergence and the $\ell^q$-norm, respectively.

Of particular interest are ill-posed equations, that is, when the solution of Ku = g does not depend continuously on the data g (as it is e.g. the case if K has non-closed range). This becomes distinctly delicate if the data g is not available precisely but only noise-affected observations g δ for which we assume that we have the additional information g δ − g ≤ δ.
It is a natural question to ask: "When does a solution algorithm for the optimization problem (1.1) applied to perturbed data g δ instead of g, constitute a regularization method for the ill-posed equation Ku = g?" In [12] an affirmative answer was given for the Augmented Lagrangian Method (ALM), which in the context of regularization is also known as the Bregman iteration (see [20]). The ALM was introduced simultaneously by Hestenes [17] and Powell [21] as an iterative solution method for (1.1) and reads as follows: Algorithm 1 (the ALM). Let p δ 0 ∈ H 2 and choose a sequence {τ n } n∈N of positive parameters. For n = 1, 2, . . . compute u δ n ∈ argmin u∈H1 τ n 2 Ku − g δ 2 + J(u) − p δ n−1 , Ku − g δ and (1.2a) is the Lagrangian for (1.1) and the additional term τn 2 Ku − g δ 2 is an augmentation of L that fosters the fulfillment of the constraint. Hence, in the limit, the augmentation term is supposed to vanish and the variables p δ n shall tend to a Lagrange multiplier for the problem (1.1).
It is well known that the Karush-Kuhn-Tucker conditions are necessary and sufficient regularity conditions for the solutions of (1.1), which guarantee existence of a saddle point of L. Thus, if there exists u † ∈ H 1 and p † ∈ H 2 such that Ku † = g and K * p † ∈ ∂J(u † ) then, L(u † , p) ≤ L(u † , p † ) ≤ L(u, p † ). It was pointed out in [4] that this coincides with the standard source conditions in regularization theory.
As in [12], we will consider the ALM as a regularization method, that is, for stably computing approximations of solutions of (1.1) from perturbed data g δ . With R n : H 2 → H 1 and R * n : H 2 → H 2 we denote the operators defined by R n (g δ ) := u δ n and R * n (g δ ) = p δ n , respectively.
The paper [12] came up with a characterization of parameter choice rules Γ : (0, ∞) × H 2 → N such that for each solution u † of (1.1) in an appropriate sense. Under a standard source condition, it showed also convergence rates for a class of stopping rules Γ(δ, y δ ) for which Γ(δ, y δ ) → ∞, as δ → 0. We pursue further that study and mainly show that Morozov's discrepancy principle does belong to the above mentioned class. Moreover, we investigate the degenerate case of the discrepancy principle, that is when {Γ(δ, g δ )} has finite accumulation points. Note that the complex challenge of choosing a right regularization parameter when dealing with stabilization methods for improperly posed problems is frequently approached via Morozov's rule due to its natural heuristic motivation. Namely, this rule selects a parameter by comparing the residual Ku δ n − g δ with the presumably known noise level δ -see, e.g. [11,Ch. 4].
In [12], the implications of general convergence analysis for the ALM were emphasized for the case of quadratic functionals J (cf. Example 1). In particular, the authors pointed out that in this case the ALM is equivalent to the Tikhonov-Morozov method (cf. [15]). Here, we will study in more detail two choices for J that are especially appealing for inverse problems occurring in imaging: i) Total-variation regularization (cf. [4,5,24]). Let H 1 = L 2 (Ω) for a bounded domain Ω ⊂ R 2 and consider the function Here, |Du| (Ω) denotes the total-variation of the (measure-valued) distributional derivative of u. ii) Sparse regularization (cf. [9,13,18]). Let H 1 = ℓ 2 and This work is organized as follows. Section 2 presents the main notions and notation, while Section 3 recalls several results in [12] and proposes some extensions of them. For instance, upper bounds for the Bregman distance between the subgradients of the objective functional J in (1.1) corresponding to the iterates and the solution, respectively, are obtained. Section 4 shows that the ALM together with Morozov's discrepancy principle lead to stable approximations for the operator equation both in the nondegenerate and degenerate cases. The results are applied for the total variation setting in Section 5, by underlying strict convergence (rates) for the primal variables. Section 6 summarizes the knowledge on the ALM for the sparsity regularization setting, i.e. convergence rates for the primal variables with respect to the ℓ q -norm and for the subgradients of these variables with respect to Bregman distances (1 ≤ q ≤ 2) and dual norms (1 < q < 2).
2. Basic Definitions and some Notation.
2.1. Basic Assumptions. Throughout this paper we will assume that H 1 and H 2 are separable Hilbert spaces with inner products ·, · and norms · (not further specified since the meaning is always clear from the context). We will frequently make use of Young's inequality, which states that for all u, v ∈ H 1 and γ > 0 one has that We assume further that K : H 1 → H 2 is a linear and bounded operator and that J : H 1 → R = R ∪ {∞} is convex, lower semi-continuous (l.s.c.) and proper, that is, the domain is non-empty. In order to guarantee that J-minimizing solutions of Ku = g exist and that Algorithm 1 is well defined, we need to impose additional restrictions (cf. [12,Lem. 3.1]): Assumption 1. The sub-level sets of the functional are sequentially pre-compact with respect to the weak topology on H 1 . That is, for every c ∈ R, every sequence {u n } n∈N contained in the sub-level set has a weakly convergent subsequence in H 1 .
Moreover, we will assume that {τ n } n∈N in Algorithm 1 is a fixed sequence of positive regularization parameters which can be considered as step-sizes for the iterations. We will make use of the quantity The case of constant parameter τ n = τ is known as stationary augmented Lagrangian method and leads to t n = nτ . We will only require that lim n→∞ t n = +∞ and sup i.e., the τ n 's do not decay too quickly and stay bounded. Finally, we will assume that g ∈ H 2 is an attainable element, that is, there exists a u ∈ D(J) such that Ku = g. By g δ ∈ H 2 we always denote a perturbed version of g satisfying g δ − g ≤ δ. For k ∈ N, we will abbreviate g k := g δ k with δ k → 0 as k → ∞.

Convex Analysis.
In the course of this paper we will frequently use some tools from convex analysis. A standard reference in this respect is [10].
The subdifferential (or generalized derivative) ∂J(u) of J at u is the set of all elements ξ ∈ H 1 satisfying The domain D(∂J) of the subgradient consists of all u ∈ H 1 for which ∂J(u) = ∅. Finally, we define the graph of ∂J as According to [10, Chap. I Cor. 5.1], the set Gr(∂J) is sequentially closed with respect to the weak-strong topology on H 1 × H 1 . That is, if the sequence {(u n , v n )} n∈N of elements in Gr(∂J) satisfies that u n converges weakly to u and v n converges strongly to v, then (u, v) ∈ Gr(∂J).
The functional J * : H 1 → R denotes the Legendre-Fenchel transform (or the dual functional) of J, which is defined by v ∈ ∂J(u) ⇔ u ∈ ∂J * (v).
Furthermore, it follows from the definition of the subgradient that For u ∈ D(∂J) and v ∈ D(J), the Bregman distance of J between u and v with respect to ξ ∈ ∂J(u) is defined by We will skip the superscript ξ, if the choice of the subgradient is obvious. If additionally v ∈ D(∂J) and η ∈ ∂J(v), we further define the symmetric Bregman distance, by Note that the convexity of J implies that D J and D sym J are always non-negative. is convex, lower semi-continuous and proper. Moreover, for u ∈ D(∂J) = D(L * L) the subgradient ∂J(u) coincides with the set {L * Lu} (cf. [12,Lem. 2.4]). This finally implies that 2.3. Source Condition. It is well known, that regularization methods for the reconstruction of a solution u † of (1.1) in general converge arbitrarily slow, unless further regularity is imposed on u † [11]. In the general setup presented in this paper, this is usually done in terms of the standard source condition [4], that is, there exists an element p † ∈ H 2 (the source element) such that 3. Summary and extensions of previous results. In this section we summarize the results on regularization by means of the ALM as presented in [12]. We further derive an extended error estimate that allows for convergence rates of the sequence K * p δ n in the Bregman-distance associated with the Fenchel conjugate J * . The dual characterization of the ALM by the proximal point method plays a central role in the convergence analysis in [12]. This observation dates back to the work of Rockafellar in [23]. In the current context, defining G : This result leads to the general convergence result [12, Thm. 5.3]: Theorem 3.2. Assume that the stopping rule Γ : Then, the sequence R Γ(δ k ,g k ) (g k ) k∈N is bounded and each weak cluster point is a J-minimizing solution of Ku = g. Additionally, with (3.5) and the residuum satisfies the rate . (3.6) As indicated in Section 2.3, the speed of convergence in (3.5) can be arbitrarily slow, unless one imposes regularity restrictions on the true solutions of Ku = g. We recall below Theorem 6.3 from [12] in this respect. Theorem 3.3. Assume that the stopping rule Γ : (0, ∞) × H 2 → N satisfies lim k→∞ t Γ(δ k ,g k ) = +∞. Then the following two conditions are equivalent: (i) There exists a J-minimizing solution u † of Ku = g that satisfies the source condition (2.2) with source element p † ∈ H 2 and there exists C ∈ R such that Additionally, if (i) or (ii) holds, then and each cluster point of R * Γ(δ k ,g k ) (g k ) is a minimizer of G(·, g). Theorem 3.4 and Corollary 3.6 below provide quantitative estimates for the primal and dual iterates of the ALM in case that the source condition (2.2) holds. These results extend [12, Thm. 6.2].
Theorem 3.4. Assume that u † is a J-minimizing solution of Ku = g which satisfies the source condition (2.2) with source element p † ∈ H 2 . Then, for any γ > 0 (3.9) Proof. Since u † satisfies the source condition, we have that K * p † ∈ ∂J(u † ) which is equivalent to u † ∈ ∂J * (K * p † ). This leads to . Therefore, the last inequality together with Proposition 3.1 and Young's inequality gives for an arbitrary γ > 0 Using (1.2b) together with the inequality Ku δ n − g 2 ≤ 2 Ku δ n − g δ 2 + 2δ 2 and the previous estimate show the assertion.
ii) It holds Proof. From Young's inequality it follows that Hence the first inequality follows from Theorem 3.4 with γ = 1/(1 − 2α), due to the fact that α < 1/2. In order to prove ii) we observe from (3.10) that for all γ > 1 Hence, Lemma 3.5 with a = p δ 0 − p † 2 and b = δ 2 t 2 n leads to Finally, the assertion follows from Remark 3.7. i) Obviously, the best possible rates with respect to the estimates in Theorem 3. 3 and Corollary 3.6 i) are obtained when t Γ(δ,g δ ) ∼ δ −1 . However, if one only has for some C > 0, then Corollary 3.6 ii) shows that the symmetric Bregman distance behaves at least as well as the residual: ii) Since K * p δ n ∈ ∂J(u δ n ) and K * p † ∈ ∂J(u † ) is equivalent to u δ n ∈ ∂J * (K * p δ n ) and u † ∈ ∂J * (p † ) respectively, it follows that Hence, all estimates for the primal variables u δ n n∈N automatically hold also for K * p δ n n∈N . 4. Morozov's discrepancy principle. In this section we analyze the discrepancy principle as an a posteriori stopping rule. In order to apply the convergence (rate) results in Theorems 3.2, 3.3, and 3.4, a given stopping rule Γ : (0, ∞) × H 2 → N has to satisfy (3.4) and (3.7), respectively. We verify these estimates for the particular situation where the stopping index is chosen according to Morozov's discrepancy principle: Choose ρ > 1 and define That is, we take the first iterate u δ n for which the residual Ku δ n − g δ falls below a number which is a constant ρ times the noise level δ.
Our analysis is structured as follows: In Section 4.1, we derive convergence rates (based on Corollary 3.6 ii)) for the symmetric Bregman-distance between the primal iterates u δ n n∈N and J-minimizing solutions of Ku = g, under the hypothesis that the source condition holds. Here, we make no other assumption on Γ(δ, g δ ) except (4.1). In Section 4.2 we then point out that the convergence results in Theorems 3.2 and 3.3 apply for the parameter choice rule (4.1) if additionally one requires lim δ→0 Γ(δ, g δ ) = ∞. We refer to this situation as the non-degenerate case. Finally in Section 4.3 we treat the degenerate case, i.e., where {Γ(δ, g δ )} δ has finite accumulation points.
4.1. Convergence rates.. We will state a qualitative estimate for the Bregman distance between the primal variables in the ALM and solutions of (1.1) if the source condition is satisfied and if the Morozov stopping rule is applied. In particular, this analysis sheds some light on the role of ρ in (4.1).
Lemma 4.2. Let u † be a J-minimizing solution of Ku = g that satisfies the source condition with source element p † and assume that Γ is chosen according to the stopping rule (4.1). Then, In particular, (3.7) is satisfied.
Proof. Let g δ ∈ H 2 and set δ := g − g δ as well as n * = Γ(δ, g δ ) − 1. Then, it follows from (4.1) that This together with (3.3) yields 2b)). From the definition of G it follows that G(p, g δ )− G(p δ n * , g δ ) = G(p, g)− G(p δ n * , g)+ p − p δ n * , g − g δ . After applying Young's inequality to the inner product we get, for every p ∈ H 2 and η > 0, Setting η = t n * hence gives Since u † satisfies the source condition with source element p † , it follows from [12, With this preparation we are ready to state the announced estimate for the primal variables.
Proof. From (4.2) it follows that This together with (4.2) and the fact that Since by construction in (4.1) the assertion follows from Corollary 3.6 ii). Remark 4.4. The function which appears in the right hand side of (4.4) is minimal for ρ * ≃ 1.6404 with f (ρ * ) ≃ 4.6753. Hence, after setting ρ = ρ * in the stopping rule (4.1), Theorem 4.3 implies the following rough estimate 4.2. The nondegenerate case. In this section we will show that the assumptions of Theorems 3.2 and 3.3 are satisfied for the stopping rule (4.1), if additionally one requires lim k→∞ Γ(δ k , g k ) = ∞. (4.5) From Lemma 4.2 it already follows that (3.7) holds which implies applicability of Theorem 3.3. Moreover, we find Lemma 4.5. Assume that Γ is chosen according to the stopping rule (4.1) and that (4.5) holds. Then, Γ(δ k , g k ) satisfies (3.4), i.e. δ 2 k t Γ(δ k ,g k ) → 0 and t Γ(δ k ,g k ) → +∞, as k → ∞.
Combining the above results with Theorem 3.2 yields results on convergence for Morozov's discrepancy principle as a stopping rule: Corollary 4.6. Assume that Γ is chosen as in (4.1) and that (4.5) holds. Then, the sequence R Γ(δ k ,g k ) (g k ) k∈N is bounded and each weak cluster point u † is a J-minimizing solution of Ku = g. Additionally, (3.5) and (3.6) hold.
If additionally the source condition is satisfied, Lemma 4.5 and Theorem 3.3 imply Corollary 4.7. Let the assumptions of Corollary 4.6 be satisfied and assume that there exists a solution u † of (1.1) which verifies the source condition with source element p † . Then, (3.8) holds and each weak cluster point of R is a minimizer of G(·, g). Remark 4.8. From Schauder's Theorem and from ran(K) = ker(K * ) ⊥ it follows that for each compact K with dense range, the adjoint operator K * is compact and injective and hence strongly, wherep is a minimizer of G(·, g). If the condition on the range of K is not satisfied, then strong convergence hold on subsequences.
4.3. The degenerate case. We will finally discuss the case when the stopping index chosen by Morozov's discrepancy principle degenerates, that is, when there exists an N ∈ N such that lim sup δ→0 + Γ(δ, g δ ) = N. (4.6) In this case, the assumption (4.5) is not satisfied and the results of Section 4.2 do not apply in general.
The following result shows, however, that a degenerate stopping rule as in (4.6) already implies that the true solutions of (1.1) satisfy the source condition (2.2) and hence the results in Section 4.1 hold. Moreover, the convergence (on subsequences) of the dual sequence also follows. Theorem 4.9. Let Γ : (0, ∞) × H 2 → N be as in (4.1) and assume that (4.6) holds. Then, the following assertions are true: i) The set p δ N δ>0 is bounded and each of its weak cluster points is a minimizer of G(·, g).
ii) The set u δ N δ>0 is bounded and each of its weak cluster points is a J-minimizing solution of Ku = g. iii) All J-minimizing solutions of Ku = g satisfy the source condition with a source element p † . iv) Proof. The definition of Γ(δ, g δ ) in (4.1) and the monotonicity of the residual Ku δ n − g δ (cf. [12,Cor. 3

This proves i).
From the definition of u δ N in (1.2a) and the fact that In other words, J(u δ N ) − J(u † ) = O(δ) as δ → 0 + . This together with (4.8) shows that sup δ>0 J(u δ N ) + Ku δ N < ∞ and consequently, according to Assumption 1, that u δ N δ>0 is weakly compact and hence bounded. Thus, ii) follows from (4.8) and the lower semi-continuity of J.
Let p † be a minimizer of G(·, g), which exists according to i). This and the definition of G(p, g) in (3.1) implies Moreover, we deduce from the optimality condition of (1.2a) that K * p δ N ∈ ∂J(u δ N ), which in turn implies that Ku δ N ∈ ∂(J * • K * )(p δ N ). Using the definition of the subgradient and some rearrangements give Since p δ N δ>0 is bounded according to ii), the previous two estimates result in Using once more the relation K * p δ N ∈ ∂J(u δ N ) shows that J * (K * p δ N ) + J(u δ N ) = K * p δ N , u δ N and consequently Now, let u † be a J-minimizing solution of Ku = g which exists according to ii). Taking the limit δ → 0 + in the previous equality, using (4.8), (4.9), as well as the boundedness of p δ N δ>0 and the fact that J(u δ N ) → J(u † ) result in . This proves iii).

Statement iv) follows from i), iii) and Corollary 3.6 ii) together with the first inequality in (4.7).
Remark 4.10. As {Γ(δ, g δ )} δ>0 has finite accumulation points, without restricting generality, we can consider that this is a constant subsequence. This yields that for all δ sufficiently small, one has to stop the algorithm at the same iteration.
A degenerate case is discussed for the Landweber method for nonlinear equations in the book [11, p. 284]. It is shown there that lim δ→0 u δ N = u N where u N is the N -th iterate in the exact data case and is a solution of the operator equation as well. This means that in the exact data case the Landweber algorithm reaches the solution after N steps, with N being the stopping index in the noisy data case.
For the ALM analyzed here, we could not show that lim δ→0 u δ N = u N where u N is the N -th iterate in the exact data case because the implicit feature of the method makes the analysis more difficult. However, we could establish that the accumulation points of {u δ N } δ>0 are J -minimizing solutions with additional smoothness, i.e., satisfying the source condition.
The results for the two cases are briefly summarized in the following corollary. Corollary 4.11. Let Γ : (0, ∞) × H 2 → N be chosen according to Morozov's rule (4.1). Then, as δ → 0, the stopping index Γ either increases and leads to weak convergence of the ALM algorithm on subsequences to solutions of the operator equation or is constant, in which case the corresponding ALM iterates converge weakly on subsequences to a solution of the equation satisfying the source condition.

5.
Iterative total variation regularization. The ALM method in the case of J being the total variation seminorm (1.3) is also known as Bregman iteration [20]. It was shown in [20] that Morozov's discrepancy principle yields weak * convergence in BV(Ω) of the iterative method. The expected but missing convergence there was the one with respect to the total variation seminorm, in the sense lim k→∞ J(u k ) = J(u). (5.1) As a consequence of the analysis based on the augmented Lagrangian method tools, it became clear that this convergence does hold. Moreover, linear convergence rates with respect to the Bregman distance associated with the total variation seminorm were established in [5] first for the noise free case. According to [4] and due to the symmetric Bregman distance estimates pointed out in this work, such convergence rates provide information on the fine structure of the iterates, that is, the variation of the iterates is concentrated around the discontinuities set of the true solution. In the noisy data case, an a posteriori stopping rule was proposed in [20]: Although convergence was shown there for the net {u δ n * (δ,g δ ) } as δ → 0, no convergence rate was obtained for it. This section aims to point out such a convergence rate. Note that the a posteriori rule (4.1) employed here relates to the above mentioned one by Γ(δ, g δ ) = n * (δ, g δ ) + 1.
Still, the question on how to quantify the weak * convergence is not answered. A possible answer could be given by taking into account that weak * convergence in BV(Ω) together with convergence in the sense (5.1) is equivalent to so-called strict convergence. Thus, one can obtain convergence rates with respect to a related metric, as shown below. Recall [2, page 125] that {u k } k∈N ⊂ BV(Ω) converges strictly to u if it converges with respect to the metric In this section we consider the linear and bounded operator K : where Ω ⊂ R 2 is open and bounded. Proposition 5.1. Let {g k } k∈N ⊂ L 2 (Ω) be such that g − g k ≤ δ k → 0 as k → ∞. Let Γ be chosen according to the Morozov's rule (4.1) and assume that lim k→∞ Γ(δ k , g k ) = ∞. Then, the sequence R Γ(δ k ,g k ) (g k ) k∈N satisfies (3.5) and (3.6). Moreover, it has a subsequence which converges strictly to a J-minimizing solution of Ku = g.
Proof. The first assertions result from Corollary 4.6. Let further denote u k = R Γ(δ k ,g k ) (g k ). According to Corollary 4.6, the sequence {u k } k∈N is bounded in L 2 (Ω) and sup k∈N J(u k ) < ∞. Hence we find that Theorem 2.5 in [1] implies that {u k } k∈N is strongly L 1 -compact and thus there is a subsequence, indexed by k ′ , which converges to some u * strongly in L 1 (Ω). Since each L 2 -weak cluster point of {u k } k∈N is a J-minimizing solution of Ku = g according to Corollary 4.6, the same holds for u * . Finally, it follows from (3.5) thatd(u k ′ , u * ) → 0.
Clearly, error estimates in terms of the L 1 -norm are desirable, but not easy to derive. In order to show convergence rates for strict convergence of the iterates, we need to employ another metric, which appears naturally in the analysis, namely 3) The following lemma points out the relation between the two metrics described above.
Lemma 5.2. Assume that K : L 1 (Ω) → L 2 (Ω) is continuous and can be extended by continuity to L 2 (Ω). Then, convergence of a sequence with respect to the metric d defined by (5.2) implies convergence of the sequence with respect to the metric d defined by (5.3). If additionally the linear bounded operator K : L 2 (Ω) → L 2 (Ω) is injective, then the two metrics are equivalent.
Proof. The first part follows immediately from Ku L 2 ≤ K u L 1 for any u ∈ L 1 (Ω).
Assume now that d(u k , u) → 0 as k → ∞ and that K is injective. Then, K in particular does not annihilate constant functions and it follows from [1, Lemma 4.1] that u → Ku L 2 + J(u) is BV-coercive. Hence boundedness of { Ku k L 2 } k∈N and {J(u k )} k∈N , which follows from d(u k , u) → 0, yields boundedness of { u k BV } k∈N . Thus, there exists a subsequence {u k ′ } k ′ ∈N which converges to some v ∈ BV(Ω) strongly in L 1 (Ω) and weakly in L 2 (Ω) to v due to compact and bounded embedding respectively (cf. [1, Theorem 2.5]). These yield strong convergence of the subsequence in L 1 (Ω) to v, as well as weak convergence in L 2 (Ω) of {Ku k ′ } k ′ to Kv. Since the weak limit is unique, it follows that Ku = Kv and consequently, since K is injective, that u = v.
Moreover, the entire sequence {u k } k∈N converges strongly in L 1 (Ω) to u, which completes the proof.
Note that the continuity of the operator K from L 1 (Ω) into L 2 (Ω) is not necessary for proving the second part of the lemma. Now we are able to show the convergence rate in terms of the metric d: Proposition 5.3. Let {g k } k∈N ⊂ L 2 (Ω) be such that g − g k ≤ δ k → 0 as k → ∞. Let Γ be chosen according to rule (4.1) and assume that lim k→∞ Γ(δ k , g k ) = ∞. If u † is a J-minimizing solution of Ku = g that satisfies the source condition (2.2) with source element p † ∈ H 2 , then the following convergence rate holds: Proof. From the definition of rule (4.1) it follows that KR Γ(δ k ,g k ) g k − Ku † = O(δ k ).
In order to establish an error estimate for |J(R Γ(δ k ,g k ) (g k )) − J(u † )|, we use Theorem 4.3. Indeed, since the symmetric Bregman distance is larger than the Bregman distance, one has Using the Cauchy-Schwarz inequality and again Corollary 3.6 we see that Similarly one can show which ends the proof.
6. Sparse regularization. In the case of sparse regularization, the convex functional (1.4) is considered with 1 ≤ q ≤ 2 (see [9]). The aim of the functional J is to promote sparse solutions, i.e. solutions which have only a few (especially a finite number of) nonzero entries. Tikhonov regularization based on this regularization functional has been studied in great detail in [13,18,19]. The case q = 1 for the stationary augmented Lagrangian method has been treated in [5] also under the name Bregman iteration. There, the authors obtained convergence of the method for noisefree data for the Bregman distance and considered an a priori stopping rule for noisy data. In this section we also treat the case q = 1 and derive both an enhanced convergence rate for noisefree data in norm and also optimal convergence rates for noisy data with the a posteriori rule given by Morozov's discrepancy principle.
6.1. Convergence rates for δ → 0. We start with a result on convergence in the noisy data case which holds for all q ∈ [1,2]. Fulfillment of a source condition is not needed here.
Theorem 6.1. Let K : ℓ 2 → H 2 be linear and bounded, 1 ≤ q ≤ 2 and let J be defined by (1.4). Moreover, let the parameter choice Γ obey (3.4). Then the sequence {R Γ(δ k ,g k ) (g k )} has a subsequence which converges strongly to a J-minimizing solution of Ku = g.
Note that the entire sequence of iterates converges strongly to the unique J-minimizing solution of Ku = g in the case q ∈ (1, 2]. By Theorem 4.5, we also conclude that ℓ q -regularization combined with Morozov's discrepancy principle gives rise to a (subsequentially) convergent regularization method and, if additionally the source condition is fulfilled, leads to convergence rates in the sense of Bregman distances.
Actually, in the latter case, we can strengthen the above result. More precisely, we can derive convergence rates with respect to the ℓ q norm for q ∈ [1,2]. The two cases q ∈ (1, 2] and q = 1 have to be treated separately.
In the case q ∈ (1, 2], we take advantage of the differentiability and the high degree of convexity of the functional J to estimate even the distance between the subgradients appearing in the iterative process.
The following result, which will be useful in the sequel, was pointed out in [22,Proposition 3.2]. We give here the proof for the sake of completeness.
Lemma 6.2. If q ∈ (1, 2] and J according to (1.4), then one has for all v ∈ ℓ q and u ∈ ℓ 2(q−1) , u = 0, Proof. The inequality is obvious if q = 2. Let q ∈ (1, 2). Note that D(∂J) = ℓ 2(q−1) . In order to simplify the notation in the proof, we omit the subscript for the ℓ q norm. Now [6,Lemma 1.4.8] implies that for all v ∈ ℓ q , u ∈ ℓ 2(q−1) where t := v − u . Let ϕ(t) := (t + u ) q for t small enough. The Taylor expansion of ϕ around 0 yields existence of an a t ∈ (0, t) such that This inequality and (6.2) imply Note that a t + u ≥ u and q − 3 < 0. Hence, (a t + u ) q−3 ≤ u q−3 and Let b ∈ (0, 1) and take t < 3(1−b) u 2−q . Then inequality (6.3) yields with c q = bq(q−1) 2 u q−2 . Proposition 6.3. Let K : ℓ 2 → H 2 be linear and bounded, J be defined by (1.4) with 1 < q ≤ 2. Let Γ be the parameter choice according to Morozov's discrepancy principle (4.1). If the J-minimizing solution u † of Ku = g satisfies the source condition (2.2) with a source element p † , then the following convergence rates hold for k sufficiently large: Proof. We apply inequality (6.1) for v = R Γ(δ k ,g k ) (g k ) and u = u † and obtain for k sufficiently large. This and Theorem 4.3 imply the first assertion.
In order to show the estimate for the subgradients, note that (see, e,g., [6,Lemma 1.4.10]) for any ξ 1 , ξ 2 ∈ ℓ r for some positive constant c r depending on r ≥ 2. Consequently, it follows from Remark 3.7 that and thus completes the proof. Now we turn to the case of sparse regularization for q = 1. Here, one can derive improved convergence rates in case the solution u † does not only fulfill the source condition but also is indeed sparse. To be more precise, we define for a given set I ⊂ N the projection P I by and require the following Assumption 2. i) The solutions u † of (1.1) satisfy the source condition (2.2) with source element p † . ii) For K * p † = ξ and I = {k | |ξ k | = 1}, one has that the quantity θ = sup{|ξ k | | k / ∈ I} is strictly smaller than one.
iii) The operator KP I : ℓ 1 → H is injective in the sense that P I u = P I v implies We start with the following lemma which can be traced back to [13] (see also [14]). Lemma 6.4. Assume that Assumption 2 is satisfied. Then, there exist constants β 1 , β 2 > 0 such that Proof. Due to Assumption 2 iii) the operator KP I is injective and hence, there exists c such that KP I u ≥ c P I u ℓ 1 for all u ∈ ℓ 1 . Now we estimate Since u † k = 0 for k / ∈ I, Assumption 2 i) and ii) implies Combining both estimates gives which yields the assertion with Remark 6.5. We remark on Assumption 2: Statement ii) is related to the notion of "strict sparsity pattern" in [3]. To get a practically relevant condition, one may replace this with the assumption that the range of K * is contained in some ℓ p with p < ∞ (since in this case the sequence ξ has to tend to zero). This also implies that I is finite. Alternatively one may also work with K : ℓ 2 → H (which implies Assumption iii) is a restricted injectivity condition. Since one needs to know the set I to verify this in advance, one often uses the "finite basis injectivity property" (FBI property) from [3,18] which states that KP I is injective for all finite sets I. This condition can be checked in advance and hence, it seems more practical. Now we treat the case of noisy data and show that the application of Morozov's discrepancy principle leads to optimal convergence rates. Theorem 6.6. Let u † be a J-minimizing solution of Ku = g and assume that Γ is the parameter choice according to Morozov's discrepancy principle (4.1). Then, one has Proof. We estimate the symmetric distance from below using Lemma 6.4. To this end, set u k = R Γ(δ k ,g k ) (g k ) and observe that Rearranging and using the Cauchy-Schwartz inequality leads to From the definition of Morozov's discrepancy principle (4.1) and Theorem 4.3 we finally conclude the proof.
6.2. Convergence rate for n → ∞ in the noisefree case. Another consequence of our analysis of the ALM is that we can prove convergence rates of the ALM iteration with noisefree data which are superior to previous results. Proposition 6.7. Let J be according to (1.4) with q = 1, u † be a J-minimizing solution of Ku = g and p 0 = 0. Then there exists a constant C > 0 such that the iterates u n of the ALM fulfill Proof. Since K * p n ∈ ∂J(u n ), one has J(u n ) − J(u † ) ≤ − K * p n , u † − u n = − p n , g − Ku n ≤ p n g − Ku n Now we use Lemma 6.4 to obtain β 1 J(u n − u † ) ≤ J(u n ) − J(u † ) + β 2 Ku n − g ≤ ( p n + β 2 ) Ku n − g .
Theorem 3.4 (with δ = 0) gives J(u n − u † ) ≤ (γ + β 2 ) p † β 1 t n which proves the assertion. This proposition shows that the ALM can calculate approximate solutions to the so-called Basis Pursuit problem [8] of finding minimal ℓ 1 -norm solutions of underdetermined linear systems and also gives an estimate on the speed of convergence of the objective value.
6.3. Implications for Compressed Sensing. Finally we remark on the relation of our results to the theory of compressed sensing: Linear convergence rates for the variational regularization with ℓ 1 -norm has been shown in [13,14] under a source condition and some assumptions on the operator K. A similar result has been proven (see [7]) in the finite dimensional setting of compressed sensing, by using the restricted isometry property condition. In the latter setting, [14] established the following connection between the above mentioned conditions -see [14, part of Proposition 5.3 and Theorem 4.7] : Proposition 6.8. Assume that Ku † = g. Assume that K satisfies the srestricted isometry property and let u † be an s-sparse solution of the equation. Then u † satisfies the source condition and KP I is injective, with I given by Lemma 6.4.
Based on this result and on the ones in this section, one can immediately state the following: Proposition 6.9. Assume that K satisfies the s-restricted isometry property and let u † be an s-sparse solution of the equation. Then linear convergence rates hold for Bregman iterations in the noisy-free case and in the noisy data case when the discrepancy principle is employed.

Conclusion.
In this work we showed that Morozov's discrepancy principle (4.1) applied to the Augmented Lagrangian Method (ALM) 1 leads to a regularization method for linear inverse problems Ku = g. This gives a theoretical justification for the observation that the discrepancy principle provides useful results in practical situations.
We used a dual characterization of the ALM in order to derive explicit error bounds for the Bregman distance between the iterates and a true J-minimizing solution u † of Ku = g, if u † satisfies the source condition for a source element p † . In this case, also error bounds for the Bregman distance (with respect to J * ) between the dual iterates in the ALM and p † were obtained. We also showed that a sufficient condition for the source condition to hold is the existence of finite accumulation points in the sequence of stopping indices chosen by the discrepancy principle.
We applied our general results to particular situations which have a special appeal for problems arising in imaging.
Firstly, we considered the case of total variation regularization where we were able to show that the ALM converges strictly in BV(Ω) and to establish convergence rates with respect to an equivalent metric.
Secondly, we studied sparse regularization on ℓ 2 , more precisely when J coincides with the ℓ q -norm (q ∈ [1,2]). Aside to √ δ-rates in the ℓ q -norm for q > 1, we were able to prove linear convergence rates for the particular interesting case of ℓ 1 (under suitable regularity conditions on u † ). The sequence of dual iterates in the ALM in the latter case carries important information on the support of the solution. The conjugate function J * of the ℓ 1 -norm, however, degenerates to an indicator function. As a consequence, the general estimates for the dual variables do not reveal much insight in their convergence behavior. It is still an open issue whether one can obtain more relevant estimates for the dual variables.