Variational Multiscale Nonparametric Regression: Algorithms and Implementation

Many modern statistically efficient methods come with tremendous computational challenges, often leading to large-scale optimisation problems. In this work, we examine such computational issues for recently developed estimation methods in nonparametric regression with a specific view on image denoising. We consider in particular certain variational multiscale estimators which are statistically optimal in minimax sense, yet computationally intensive. Such an estimator is computed as the minimiser of a smoothness functional (e.g., TV norm) over the class of all estimators such that none of its coefficients with respect to a given multiscale dictionary is statistically significant. The so obtained multiscale Nemirowski-Dantzig estimator (MIND) can incorporate any convex smoothness functional and combine it with a proper dictionary including wavelets, curvelets and shearlets. The computation of MIND in general requires to solve a high-dimensional constrained convex optimisation problem with a specific structure of the constraints induced by the statistical multiscale testing criterion. To solve this explicitly, we discuss three different algorithmic approaches: the Chambolle-Pock, ADMM and semismooth Newton algorithms. Algorithmic details and an explicit implementation is presented and the solutions are then compared numerically in a simulation study and on various test images. We thereby recommend the Chambolle-Pock algorithm in most cases for its fast convergence. We stress that our analysis can also be transferred to signal recovery and other denoising problems to recover more general objects whenever it is possible to borrow statistical strength from data patches of similar object structure.


Introduction
Regression analysis has a centuries long history in science and is one of the most powerful and widely used tools in modern statistics. As it aims to discover a functional dependency f between certain variables of interest, it provides important insight into the relationship of such variables. Typically, the data are noisy and specific regression models provide a mathematical framework for the recovery of this unknown relationship f given such noisy observations. Due to the flexibility of such models, they can be accommodated to many different scenarios, e.g., for linearly dependent data as in linear regression and for data following a general nonlinear structure as in nonlinear regression. The literature on this and the range of applications is vast, we exemplarily refer to the work of Draper and Smith [27] for an overview. Often, these models depend on a few parameters only specifying f already, which then can be estimated from the data in a relatively simple way. In contrast, nonparametric regression avoids such a restrictive modelling and becomes indispensable when prior knowledge is not sufficient for parametric modelling (see, e.g., [3,31]). One of the most fundamental and most studied nonparametric models is the Gaussian nonparametric regression (i.e., denoising) model with independent errors, sometimes denoted as white noise regression model (after a Fourier transformation). In this model, we aim to estimate the unknown regression function f : [0, 1] d → R given noisy observations Y i , which are related to f as where i are independent normal random variables with zero mean and unit variance and Γ n is a discrete grid in [0, 1] d that consists of n points. Note that σ determines the noise level, which we assume for simplicity to be constant (see, however, Section 5). We stress that much of what is addressed in this paper can be generalised to other error models and other domains than the d-dimensional unit cube [0, 1] d (see also Section 5). However, to keep the presentation simple, we restrict ourselves to the Gaussian error and equidistant grids in the d-dimensional unit cube. A mathematical theory of such nonparametric regression problems has a long history in statistics (see [70] for an early reference), as they are among the simplest models where the unknown object is still in a complex function space and not just encoded in a low-dimensional parameter, yet they are general enough to cover many applications (see, e.g., [31]). Consequently, the statistical estimation theory of nonparametric regression is a well investigated area for which plenty of methods have been proposed, such as kernel smoothing [59,73]; global regularisation techniques such as penalised maximum likelihood [29]; ridge regression, which amounts to Tikhonov regularisation [62,57]; or total variation (TV) regularisation [67]. These methods have not been designed a-priori in a spatially adaptive way which could be overcome by a second generation of (sparse) localising regularisation methods originating in the development of wavelets [15]. To fine tune the estimator for first generation methods, usually a simple regularisation parameter has to be chosen (statistically); for wavelet-based estimators, this amounts to properly select (and truncate) the wavelet coefficients methods, e.g., by soft thresholding (see, e.g., [22]). We refer to Tsybakov [71] for an introduction to the modern statistical theory of nonparametric regression, mainly from a minimax perspective. One of the latest developments to the problem of recovering the function f in (1) already dates back to Nemirovski [60] and is implicitly exemplified by the Dantzig selector [11], which can be seen as a hybrid between a sparse approximation with respect to a dictionary and variational (L 1 ) regularisation. Such hybrid methods were coined by Grasmair et al. [37] as MIND estimators (MultIscale Nemirovski-Dantzig) and are the focus of this paper. We discuss these mainly in the context of statistical image analysis (d = 2), but stress that our findings also apply to signal recover d = 1 and to other situations where multiscale approaches are advantageous, e.g., for temporal-spatial imaging, where d = 3, 4.

Variational Denoising
One of the most prominent regularisation methods for image analysis is total variation (TV) denoising, which was popularised by Rudin, Osher and Fatemi [67] (for further developments towards a mathematical theory for more general variational methods, see, e.g., [68] and references therein). In the spirit of Tikhonov regularisation, the rationale behind is to enforce certain properties for the function f fitted to the model (1), which are encoded by a convex penalty term R such as the TV seminorm. Consequently, the weighted sum of a least-squares data fidelity term (corresponding to the maximum likelihood estimation in model (1)), R is minimised over all functions g (e.g., of bounded variation) and the minimiser is taken as the reconstruction or estimator for f , i.e., with a weighting factor (sometimes called the regularisation parameter) α > 0. As soon as R is convex, algorithmically, this leads to a convex optimisation problem, which can be solved efficiently in practice. This also applies to other convex data fidelity, e.g., when the likelihood comes from an exponential family model. To solve (2) numerically, most common are first-order methods (see, e.g., [6]), which boil down to computing the (sub-)gradients of the least-squares term and R.
One of the statistical disadvantages of methods of the form (2) is the usage of the global least-squares term, which to some extend enforces the same smoothness of g everywhere on the domain Γ n . To overcome this issue, a spatially varying choice of the parameter α has been discussed in case of R being the TV seminorm (see, e.g., [44,43]). Other options are localised least-squares fidelities [42,21,20] or anisotropic total variation penalties, where the weighting matrix A is also improved iteratively based on the available data (see, e.g., [51]). We discuss another powerful strategy to cope with local inhomogeneity of the signal which has its origin in statistics and has been recently shown to be statistically optimal (in a certain sense to be made precise below) for various choices of penalties R.

Statistical Multiscale Methods
In the nonparametric statistics literature, multiscale methods have their origin in the discovery of wavelets [15], which Donoho and Johnstone [26] used to construct wavelet thresholding estimators and showed their ability to adapt to certain smoothness classes. Such estimators are computationally simple, as they only involve thresholding with respect to an orthonormal basis (cf. [22]). Following wavelets, myriad multiscale dictionaries tailored to different needs have been developed, such as curvelets [9], shearlets [49], etc. In a nutshell, the superiority of multiscale dictionaries over other bases for denoising and inverse problems lies in their excellent approximation and localising properties, which yields sparse approximations of functions with respect to loss functions that results from norms which average the error, e.g., L p , p ≥ 1, Sobolev or Besov norms. However, it is well known that despite this sparseness these estimators tend to present Gibbs-like oscillatory artefacts in statistical settings (which is not well reflected by such norms). This affects the quality of the overall reconstruction critically [10,16].

Variational Multiscale Methods
Variational multiscale methods offer a solution to the problem of such Gibbs-artefacts, as they combine multiscale methods with classical (non-sparse) regularisation techniques, with the idea of making use of the best of both worlds: the stringent data-fitting properties of (overcomplete) multiscale dictionaries, and the desired (problem dependent) smoothness imposed by a regularisation method, e.g., by TV regularisation. Definition 1. Let F be a class of functions, R : F → R ∪ {∞} a convex functional and {φ λ |λ ∈ Λ n } a finite collection of functions. Given observations Y i as in (1), the MultIscale Nemirovski-Dantzig Estimator (MIND) is defined by the constrained optimisation problem for a threshold q n > 0, where we use the inner product h, g : Several particular cases of the MIND have been proposed: the first time by Nemirovski [60] (who gave credit to S. V. Shil'man) in 1985 for a system of indicator functions and a Sobolev seminorm, R(g) = |g| H s , Donoho [22] derived soft-thresholding from this principle where the dictionary {φ λ | λ ∈ Λ n } is a wavelet basis (for the latter, see also [55]). For the case of a curvelet frame, we refer to [10] and for a system of indicator functions and the Poisson likelihood to [33,34], where R(g) = |g| BV is the TV seminorm. Notice that, in practice, the choices of R and the dictionary reflect previous knowledge or expectations on the unknown function f : the amount of regularisation that R imposes should be dependent on how smooth we expect f to be; and the dictionary should measure patterns that we expect f to obey, e.g., elongated features in the case of curvelets and piecewise constant areas in the case of indicator functions. Furthermore, we have the following rule of thumb: the more redundant is the dictionary, the better is the statistical properties of the estimator but the more expensive is the computation of a solution to (3).
It is documented that the different proposed instances of the MIND show increased regularity and reduced artefacts as compared with standard multiscale methods (i.e., thresholding type estimators), but still avoid oversmoothing if the threshold q n is appropriately chosen (see [33,34,16]). For instance, we illustrate in Figure 1 that the MIND with R the TV seminorm outperforms the soft-thresholding estimator under various choices of multiscale dictionaries, including wavelets, curvelets and shearlets. In addition to their good empirical performance, estimators of the form (3) have recently been analysed theoretically and proved to be (nearly minimax) optimal for nonparametric regression [37,16] and certain inverse problems [11,17]. We give a brief review of these theoretical results in Section 2.

Computational Challenges and Scope of the Paper
However, the practical and theoretical superiority of regularised multiscale methods over purely dictionary thresholding methods comes at a cost: instead of simple thresholding, a complex optimisation problem has to be solved, with an increase in computation time and the need of improved optimisation methods. In practice, the problem (3) is non-smooth (e.g., if R is chosen as the TV seminorm) and is furthermore high-dimensional due to a huge number of constraints. For instance, in the case of images of 256 × 256 pixels, the number of constraints is 1,751,915 for the dictionary of cubes of edge length ≤ 30 and 3,211,264 for that of shearlets. Besides our theoretical review in Section 2, the goal of this paper is to discuss different particularly successful algorithmic approaches for the solution of (3) in Section 3. Among the first attempts is the usage of the alternating direction method of multipliers (ADMM) in [32,33,34], including a problem-specific convergence analysis. The computational disadvantage of this approach is the necessity to compute projections to the constraint set, which is most generally done by Dykstra's algorithm. Even though this can be efficiently implemented using GPUs (cf. [50,48]), the projection step provides a computational bottleneck of the overall algorithm. Several other convex optimisation algorithms such as Douglas-Rachford-Splitting (see [56]) or the Proximal Alternating Predictor Corrector algorithm (cf. [54]) also require the computation of such projections, and hence their computational performance is similar to the ADMM-based version. In view of this, we furthermore discuss two new approaches based on the Chambolle-Pock algorithm [13] on the one hand and a semismooth Newton method (cf. [Hintermüller,14]) on the other hand. The former allows avoiding computing projections to the constraint sets, but requires only the computation of the corresponding resolvent operator, which reduces to a high-dimensional soft-thresholding problem. The latter allows solving a regularised version of the optimality conditions by Newton's method and then apply a pathcontinuation scheme to decrease the amount of regularisation. Both algorithms hence avoid Dykstra's projection step, but still have favourable convergence properties.
In Section 4, we finally discuss the practical advantages and disadvantages of the previously described algorithms along different numerical examples. For all examples checked in this paper, we found that the Chambolle-Pock algorithm is superior compared to the others. Finally, we provide a Matlab implementation of the Chambolle-Pock algorithm for this problem and code to run all examples in https://github.com/housenli/MIND.

Theoretical Properties of Variational Multiscale Estimation Methods
In this section, we briefly review some theoretical reconstruction properties of multiscale variational estimators within model (1). Recall that, in the nonparametric regression model (1), we have access to noisy samples of a function f at locations x i ∈ Γ n . We have some flexibility in choosing the underlying grid Γ n : it could be for instance an equidistant d-dimensional grid (e.g., as pixels in an image), but other choices are possible (e.g., a polar grid).

Theoretical Guarantees
Estimators of the form (3) have been analysed from a theoretical viewpoint in a variety of settings. When the dictionary basis functions φ λ are orthogonal, this becomes particularly simple, as then the evaluation functionals (if the truth equals g) φ λ , become independent in model (3). This is valid for wavelet systems and their statistical analysis is vast (see, e.g., [52,25,24,74,39,7,75,1,8] for various forms of adaptation and thresholding techniques). If the dictionary is redundant the analysis becomes more difficult (see, however, [38] for the asymptotic validity of hard and soft thresholding in this case) and only in recent years it could be been shown that suitably constructed MINDs with redundant dictionaries perform optimal in a statistical (minimax) sense over certain function spaces. In the following, we summarise some of these results in an informal way. To this end, the notion of minimax optimality is key, which compares a given estimator with the best possible estimator in terms of their worst case error (see Equation (4) and Definition 3.1 in [71] for a formal definition).
-Sobolev spaces: The authors of [60,37] analysed the MIND with R(g) = |g| H s and {φ λ } being a set of indicator functions of rectangles at different locations and scales. They showed that, for the choice q n = C σ log n/n for an explicit constant C > 0, the MIND is minimax optimal up to logarithmic factors for estimating functions in the Sobolev space H s . This means that the MIND's expected reconstruction error is of the same order as the error of the best possible estimator, i.e., Besides minimax optimality, Grasmair et al. [37] also showed that the MIND with H s regularisation is also optimal for estimating functions in other Sobolev spaces H t for certain smoothness indices t, a phenomenon known as adaptation.
-Bounded variation: In [16], the MIND with bounded variation regularisation R(g) = |g| BV was considered. It was shown that it is optimal in a minimax sense up to logarithms for estimating functions of bounded variation if d = 2. For d ≥ 3, the discretisation matters further and this could only be shown in a Gaussian white noise model. Such results hold for a variety of dictionaries {φ λ }, such as wavelet bases, mixed wavelet and curvelet dictionaries, as well as suitable systems of indicator functions of rectangles.
In addition to theoretical guarantees, these results also provide a way of choosing the threshold parameter q n in (3). Indeed, both for Sobolev and for bounded variation regularisation, it is shown (see again [37,16]) that the choice q n = C σ (log n)/n for an explicit constant C > 0 yields asymptotically optimal performance. The constant C depends on the dimension d and smoothness s of the functions, on whether we consider Sobolev or BV regularisation, and on the dictionary {φ λ } we employ.

Practical choice of the threshold
Besides the theoretically (asymptotically) optimal choice of the parameter q n , a Monte Carlo method for a finite sample choice was proposed by Grasmair et al. [37]. It is based on the observation that the multiscale constraint in (3) can be interpreted as a test statistic for testing whether the data Y is compatible with the function g, in the sense of (1). In fact, the "multiscales" come from not only performing one test, but many tests that focus on different features of g of various sizes, locations and orientations. From this viewpoint, q n is interpreted as a critical value for a statistical test, and statistical testing theory suggests that q n should be chosen as a high quantile of the random variable This interpretation yields a practical way of choosing q n : we simply pre-estimate σ, simulate independent realisations of the noise , compute their values in (5) and finally set q n to be a quantile of that sample. This choice of q n yields good practical performance (see Section 4) and is compatible with the theoretically optimal choice for n large enough. Methods for preestimating σ from the data can be found, e.g., in [58] (see Section 4.3 for simulations of the MIND estimator with estimated noise level).

Computational Methods
In this section, we discuss different algorithmic approaches to compute the MINDf in (3). First, we stress that the optimisation problem (3) is a non-smooth, convex, high-dimensional optimisation problem, which allows in principle to exploit any optimisation method designed for such situations. In the following, we restrict to three different approaches that we found particularly suited for our scenario. To set the notation in this section, we rewrite (3) as where F and G are lower semi-continuous, proper convex functionals given by and K is the linear (bounded) operator that maps from R n to R #Λn and is defined by Here, and in what follows, 1 ≤0 denotes the indicator function of the negative half-space, which is Due to the convexity of F and G, the problem (6) can equivalently be solved by finding a root of the subdifferential ∂J(v) (see, e.g., [66]), which is a generalisation of the classical derivative. Such roots correspond to stationary points of the evolution equation Two fundamental algorithms for the solution of this equation arise from applying the explicit or implicit Euler method, which leads to the steepest decent procedure and the so-called proximal point method respectively, where I denotes the identity operator and λ k > 0 is a step size parameter. For (7a), λ k should be chosen such that (I − λ k ∂J) is a contraction to ensure convergence. For continuously differentiable J, it therefore suffices to choose λ k < L −1 with the Lipschitz constant L of ∇J; in practice, one often uses λ k = 2L −1 with an estimatorL of the Lipschitz constant (generated, e.g., by a power method). In the case of (7b), convergence can be obtained for any choice of λ k > 0, but the computation of the so-called resolvent operator (I + λ k ∂J) −1 is clearly as difficult as the original problem. Motivated by these two methods, different algorithms have been suggested. Therefore, the subdifferential ∂J(v) of J is split into the subdifferentials of F and G, which allow for a much simpler computation of the corresponding resolvent operators. This makes use of the formula which holds true whenever there exists some vector v ∈ R n such that both F and G are finite and continuous at Kv and v respectively (see e.g., Prop. 5.6 in [30]). In our situation, this is the case whenever R is continuous at a point in the interior of the feasible set. With the help of (8), we can also derive the necessary and (due to convexity) sufficient first-order optimality conditions Kv ∈ ∂F (w) with the conjugate functional In the following, we present three different algorithms to solve (6) either via a specific splitting of ∂J or via the first-order optimality conditions: the Chambolle-Pock primal dual algorithm, an ADMM method and a semismooth Newton method.

The Chambolle-Pock Method
The primal dual algorithm by Chambolle and Pock [13] is based on a reformulation of the optimality conditions (9) as fixed point equations. If we multiply the second condition by a parameter τ > 0 and add w on both sides yields w + τ Kv ∈ w + τ ∂F (w).

Similarly, the first condition yields
Hence, the solutions v and w of the optimality conditions can be found by repetitively applying the resolvent operators of G and the dual of F , which are given by This combined with an extrapolation step yields the Chambolle-Pock algorithm. It can also be interpreted as a splitting of the subdifferential ∂J into those of F and G, and then applying proximal point steps (i.e., backwards steps) to ∂G and the dual of ∂F .
The first proximal operator in (10) depends on the regularising functional R(·), which is typically convex, so the computation can be done efficiently. For instance, it can be solved by quadratic programming [61] if R(·) is a Sobolev norm and with Chambolle's algorithm [12] if R(·) is the TV penalty.
The second proximal operator in (10) involves the high-dimensional constraint. However, the solution to (10b) is simply the soft-thresholding operator applied to δ −1 w − KY with threshold q n δ. Altogether, the Chambolle-Pock algorithm applied to (3) is given in Algorithm 1.

Algorithm 1: Chambolle-Pock algorithm
Require: δ, τ > 0, θ ∈ [0, 1], k = 0, (v 0 , w 0 ) ∈ X × Y , stopping criterion 1: while stopping criterion not satisfied do 2: To run the Chambolle-Pock algorithm, we need to choose the step sizes δ and τ , and convergence of the algorithm is ensured whenever τ δ ≤ K −2 op . Due to the different difficulty of the two subproblems, it is reasonable to choose δ = τ and, in particular, to choose δ > τ . More precisely, the subproblem with respect to F is more involved because of the large number of constraints, so this requires a much smaller step size, which by the Moreau's identity is equivalent to choosing a much larger δ. We observe in practice that the choices δ = K −1 op √ n and τ = K −1 op / √ n yield good results (see Section 4).
We remark that, in the Chambolle-Pock method applied to this problem, the high-dimensionality appears only in the very simple problem of soft-thresholding. This is a very convenient way of dealing with it and, as shown below, makes the Chambolle-Pock-algorithm superior over the ADMM method.

ADMM Method
The alternating direction methods of multipliers (ADMM) can be seen as a variant of the augmented Lagrangian method (see [65,40] for classical references). To derive it, we introduce a slack variable h ∈ R #Λn and rewrite (3) it into the equivalent problem argmin v∈R n ,w∈R #Λn By the convex duality theory, it is equivalent to find the saddle point of the augmented Lagrangian L λ (v, w; h), that is, where h ∈ R #Λn is the Lagrangian multiplier and λ > 0. As its name suggests, the ADMM algorithm solves this saddle point problem alternately over v, w and h in a Gauß-Seidel fashion (i.e., successive displacement). The details are given in Algorithm 2 below. The usage of the ADMM for the problem (3) was first proposed by Frick et al. [33]. One central difference compared to the Chambolle-Pock algorithm is that the ADMM splitting avoids the usage of F . Instead, it deals with the high-dimensional constraint in an explicit way, which ultimately results in a slower performance.
From the steps performed by the ADMM in Algorithm 2, the first one (Line 2) involves the proximal operator of R and can typically be dealt with with a standard algorithm (see the discussion in the Chambolle-Pock algorithm, Section 3.1). The second step (Line 3) is more challenging, as it involves solving the optimisation problem In other words, we have to find the orthogonal projection of the point This set is the intersection of 2#Λ n half-spaces, and it is known to be non-empty (as it always contains { Y, φ λ | λ ∈ Λ n }). The projection problem (11) can be solved by Dykstra's algorithm [28,4], which converges linearly [19] (see [2] for an efficient stopping rule).
Finally, it follows from Corollary 3.1 in [18] that the ADMM has a linear convergence guarantee for the problem (6).

Semismooth Newton Method
Besides the optimality conditions (9), the problem (3) can also be solved by the so-called Karush-Kuhn-Tucker conditions, which are necessary and sufficient as well. Using the notation of this section, the original problem (3) is equivalent to This is a convex optimisation problem with linear inequality constraints, and if we introduce B : R n → R 2#Λn as v → Bv with B = (K , −K ) , then a vector v is a solution to (12) if and only if there exists a vector of Lagrange multipliers λ ∈ R 2#Λn ≥0 such that The latter two conditions can be reformulated as The immediately visible advantage of this formulation over the original problem is that the inequality constraints have been transformed into equations. Now, suppose for a moment that the functional R is twice differentiable. Then, ∂R(v) is single valued and (13a) is a differentiable equation for v and λ. It seems a natural approach to solve the corresponding system of Equations (13a) and (14) via an analog of Newton's method. On the other hand, even if R was twice differentiable, the max function in (14) is not differentiable in the classical sense. The application of Newton's method is however still possible, as the max function turns out to be semismooth (see Definition 2.5 in [Hintermüller]). For a general operator B, it is however not clear if the overall system (13a) and (14) can also be described as finding the root of a semismooth function. Therefore, one introduces a regularisation parameter β ∈ (0, 1) and replaces (14) by the regularised equation This system is now explicit in λ and yields in combination with (13a) the overall system with the new regularisation parameter δ := (1 − β)/(βc) ∈ (0, ∞). This system can-for twice differentiable R and under appropriate assumptions on B satisfied in our example-now be shown to be semismooth (cf. [41]). Furthermore, for δ 0, the solution of the regularised system (16) converges towards a solution of the original system (13). This follows from the fact that (15) is in fact the Moreau-Yosida regularisation of the second optimality condition (9b). In practice, this limiting process is realised by a path-continuation scheme, sustaining the superlinear convergence behaviour.
For differentiable R, such as the Sobolev seminorm R(v) = |v| H s , the implementation of the semismooth Newton method with path-continuation is now straightforward: the overall system to be solved can now be written as Denote by D k [T δ ] the generalised derivative of the functional at the position u k . Then, we initialise the iteration at δ 0 > 0 and u 0,δ and solve the linear equations iteratively until a stopping criterion is satisfied. Then, the parameter δ is decreased by a fixed factor (say 1/2) and the iteration is started again with u k,δ as the initial guess. This continuation process is stopped until a global error criterion is reached, which is formulated in terms of the number and magnitude of the violated constraints (see Algorithm 3). Note that, for δ > 0, the underlying minimisation problem is strictly convex (this is an immediate property of the Moreau-Yosida regularisation, cf. [14]) and the subdifferential is Lipschitz continuous with Lipschitz constant ∼1/δ, and hence the Newton iteration for (17) will converge for all initial guesses in a ball with radius ∼δ around 0. Hence, whenever the inner iteration in Algorithm 3 does not converge, it should be re-started with a larger value of δ. However, a larger value of δ clearly prolongs the run-time of Algorithm 3, and hence the initial value δ should not be too large.
In the case of a non-differentiable R such as the TV-seminorm, we introduce another regularisation for R, resulting in its Huber regularisation This functional is differentiable for any β > 0, and to compute the limit β 0 we again apply a path-continuation strategy (cf. [14]). In this case, the superlinear convergence behaviour is sustained. However, we remark that the path-following routine for two parameters (δ and β) turns out to be unstable in practice. Instead, it might be desirable to look for the Moreau-Yosida regularisation of both the constraint and R, that is, to regularise the optimality conditions (13) in one step with just one parameter. We leave this idea to be pursued in future work.
Finally, we remark that this superlinear convergence of the semismooth Newton method is a huge theoretical advantage over the Chambolle-Pock and the ADMM algorithms. In practice, however, the situation is more complex, as the convergence speed of Newton's method depends strongly on the initialisation. In Table 1, we provide a summary of the comparison between the three methods.

Numerical study
In this section, we compare the practical performance of the Chambolle-Pock, ADMM and semismooth Newton algorithms in computing MIND and demonstrate the empirical performance of MIND with respect to various choices of dictionaries. We refer to, e.g., [10,32,33,34,37,48,16] for further numerical examinations of MIND. We select in particular Sobolev H 1 and TV seminorms as the regularisation functional, with the former differentiable but the latter non-differentiable and the 50%-quantile of σ MS in (5) as q n for MIND. Concerning measure of image quality, we consider the peak signal-to-noise ratio (PSNR [46]), the structural similarity index measure (SSIM [72]) and the visual information fidelity (VIF [69]) criteria. The implementation in MATLAB for the Chambolle-Pock algorithm, together with code that reproduces all the following numerical examples, is available at https://github.com/housenli/MIND.

Comparison of Three Algorithms
Here, we compare the convergence speed of the Chambolle-Pock, ADMM and semismooth Newton algorithms in solving (3) or equivalently (6). We stress that the relative performance of the three algorithms remains similar over different settings. Thus, as a particular example, we choose the "cameraman" image (256 × 256 pixels) as the function f in model (1) and the indicator functions of dyadic (partition) system of cubes (cf. Definition 2.2 in [37]), which consists of cubes as the dictionary {φ λ |λ ∈ Λ n }. The noise level σ is assumed to be known and is chosen such that the signal-to-noise ratio (SNR) max i=1,...,n |f (x i )|/σ = 30, see Figure 2. Besides the aforementioned image quality measures, we employ objective values R(g k ), (relative) constraint gaps max λ∈Λn | φ λ , g k − Y | − q n /q n , and the distance g k − g ∞ to the limit solution g ∞ to examine the evolution of iterations g k . The limit solution g ∞ is obtained for each algorithm after a large number of iterations, that is 3 × 10 4 outer iterations for the Chambolle-Pock, 3 × 10 3 outer iterations for the ADMM, and the largest possible number of iterations until which the path-continuation scheme remains stable for the semismooth Newton method (in the considered setup, 29 outer iterations in case of H 1 regularisation). As mentioned in Section 3.3, the semismooth Newton method is unstable for the case of TV regularisation, which is thus not reported here. All limit solutions g ∞ are shown in Figures 3 and 4 and are visually quite the same in each figure (while MIND with TV regularisation leads to a slightly better result than that with H 1 regularisation). This indicates that all three algorithms are able to find the global solution of (3) to a desirable accuracy.    To reduce the burden of practitioners on parameter tuning, we set the parameters in the Chambolle-Pock (Algorithm 1) by default as θ = 1, δ = K −1 op √ n and τ = K −1 op / √ n in all settings. We note that the performance of the Chambolle-Pock, which is reported below, can be further improved by fine tuning θ and δ and possibly also by preconditioning [64], for every choice of regularisation functional and dictionary. In contrast, the parameters of the ADMM and the semismooth Newton algorithms are fine tuned towards the best performance for each case. More precisely, for ADMM, we choose λ = 50 in the case of H 1 and λ = 0.5 in the case of TV, and, for the semismooth Newton, we choose δ = 1/3 and ∆δ = 1/3 in the case of H 1 . The performance of three algorithms over time is shown in Figures 5 and 6 for H 1 and TV regularisation, respectively. In both cases, the Chambolle-Pock clearly outperformed the ADMM with respect to all considered criteria. The semismooth Newton algorithm, as is ensured by the theory, exhibited a faster convergence rate than the other two, but this happened only at a very late stage. Moreover, the path-continuation scheme is sensitive to the choice of parameters and makes it difficult to decide the right stopping stage in general. In summary, from a practical perspective, the overall performance of the Chambolle-Pock algorithm was found to be superior compared to the other algorithms. We speculate, however, that a hybrid combination of the Chambolle-Pock and the semismooth Newton algorithm (switching to the semismooth Newton algorithm after a burn in period using the Chambolle Pock algorithm) might lead to further improvement.

Comparison of Different Dictionaries
We next investigate the choice of dictionary {φ λ |λ ∈ Λ n } and its impact on the practical performance. Five different dictionaries are considered. The first consists of the indicator functions of dyadic cubes defined in (18), and the second is composed of the indicator functions The third is the system of tensor wavelets, in particular, the Daubechies' symlets with six vanishing moments. The fourth is the frame of curvelets and the last is that of shearlets, both of which are constructed using default parameters in packages CurveLab (http://www. curvelet.org) and ShearLab3D (http://shearlab.math.lmu.de), respectively. As shown in Figures 3 and 4, different algorithms lead to almost the same final result, we only report the Chambolle-Pock algorithm, due to its empirically fast convergence. As a reflection of versatility of images, we consider three test images of different types. They are a magnetic resonance tomography image of mouse "brain' (cf. Figure 8a) from Radiopaedia.org (rID: 67777), a "cell" image (cf. Figure 9a) taken from [36] and a mouse "BIRN" image (cf. Figure 10a) from cellimagelibrary.org (doi:10.7295/W9CCDB17) (see also Figure 1 in the Introduction as well as Figure 12 in Section 4.3 for the case of natural images). All images are rescaled to 256 × 256 pixels via bicubic interpolation. The SNR is set to 20 over all cases (see Figure 7 for noisy image)s. The results on all test images are shown in Figures 8-10. In general, the dictionary of shearlets performed the best, which is followed by that of curvelets, while the rest (i.e., dyadic cubes, small cubes and wavelets) had similar performance. One exception is the "cell" image, where the dictionary of indicator functions of cubes, in particular of small cubes, was better at detecting tiny white balls (see Figure 9). This is because of the similarity of such features and the cubes at small scales in the dictionary. The average performance over 10 random repetitions measured by aforementioned image quality measures as well as mean integrated squared error is reported in Table 2, which is consistent with the virtual inspections in Figures 8-10. We note that 10 repetitions are sufficient here, since the variation in each repetition is comparably small (cf. standard deviations reported in parenthesis in Table 2). As a remark, we emphasise that the difference in performance due to the choice of dictionaries is generally negligible, but the dictionary of shearlets slightly outperforms the other choices.

Unknown Noise Level
We now investigate the issue of unknown noise level σ. This is illustrated by implementing two different estimators for the noise level into the MIND estimator with total variation regularisation and shearlets dictionary. We stress, however, that the results are to be expected to be similar for other choices of the dictionary and regularisation functional. The MIND optimisation is solved by the Chambolle-Pock algorithm. However, in contrast to previous simulations the 50%-quantile of σ MS in (5) with known σ is now replaced by an estimated noise level σ which then gives a new threshold q n for MIND. We display all results for the well known test image "butterfly" (256 × 256 pixels, see, e.g., [76]). The SNR is set to 20, or equivalently       Noise level Count by [58] by [53] Figure 11: Performance of two noise level estimators by Munk et al. [58] and Liu et al. [53] over 1000 repetitions, with the true σ indicated by a vertical red line, in case of "butterfly" image with SNR = 20. σ = 12.6. The choice of test image and SNR is purely arbitrary, and the results would remain quite the same for other choices. We investigate the influence on MIND for two different noise level estimators. One is a second-order difference-based estimator by Munk et al. [58], which tends to overestimate the noise level, and the other one is a patch based estimator by Liu et al. [53], which often underestimates the noise level (see Figure 11). The comparison results are summarised in Figure 12. Visually, there is little difference between the use of MIND with known noise level and that of the estimated ones. A slight improvement in terms of PSNR is even shown for the use of the estimator that tends to underestimate the noise level. However, a caution should be taken that an underestimated noise level leads to a sacrifice in statistical confidence. For instance, it is no longer possible to guarantee that the true image f lies in the constraint of (3) with given probability 50%. In contrast, when using the true noise level or an overestimated noise level, such a statistical confidence statement is valid. In short, there is nearly no loss of performance of MIND due to the unknown noise level in practice. A heuristic explanation of this is that the mean squared error of the estimated noise level σ scaled as O(n −1 ), whereas for estimating f this rate is slower. Hence, its effect on the overall performance is of minor order.

Conclusions
In this paper, we discuss three different algorithms for the computation of variational multiscale estimators (MIND). Among them, we find the Chambolle-Pock algorithm to perform best in practice and to be equally suited for smooth and non-smooth regularisation functionals R(·), the latter case including the prominent TV regularisation functional. Further, there is no need (c) MIND with true σ, PSNR=28.9 (d) MIND withσ by [58], PSNR = 28 (e) MIND withσ by [53], PSNR = 29.1 Figure 12: Results on "butterfly" with unknown noise level: (c-e) MIND with total variation and shearlets dictionary is shown, with threshold q n the 50%-quantile of σ MS , whereσ is true σ, estimatedσ by Munk et al. [58] and Liu et al. [53], respectively.
to tune parameters that are related to the Chambolle-Pock algorithm, as the default choice already delivers desirable performance. Concerning the MIND estimator itself, we find that, for two-dimensional imaging, in nearly all cases it performs best when employing shearlets as the dictionary. However, if more specific a-priori knowledge about the shape of features is available, improvements can be achieved with a dictionary consisting of basis functions that resemble such features. The MIND estimator can be easily modified accordingly. For example, as we demonstrate, in the case of images with bubble-like structures, the choice of cubes as dictionary leads to very promising results.
In contrast to many other statistical recovery methods relying on subtle regularisation parameter choices which are often difficult to choose and interpret, the only relevant parameter for MIND (besides parameters for the numerical optimisation step) is a significance level α between 0 and 1. This has a clear statistical meaning: it guarantees that the estimator is no rougher than the truth in terms of the regularisation functional R(·) with probability at least 1 − α.
Finally, the unknown noise level can be easily estimated by a difference-based or patch-based estimator, both being easy to compute in O(n) steps.

Bump signals and inverse problems
Even though our findings give a quite satisfactory answer to the question of computability of variational multiscale estimators, some additional questions arise. The first one concerns model (1). We focus on the nonparametric regression model, and in particular on image denoising. Concerning this, we stress that variational multiscale estimators can be applied to other settings, as well. The most obvious one is signal recovery when d = 1, and specific structure of signals can be incorporated as well, such as locally constant signals [35], where the number of constant segments is incorporated into the regularisation functional. Although the resulting functional is no longer convex, the MIND estimator can still be computed efficiently exploiting a dynamic program.
A further extensions concerns general linear inverse problems in any dimension d (see, e.g., [17]). In such noisy inverse problems, the only difference concerns the dictionary {φ λ }, which has to be chosen in a way akin to the wavelet-vaguelette transform [23]. Otherwise, the structure of the estimator (3) remains unchanged, and in particular the optimisation problem to be solved remains the same. We hence expect the findings of this paper to apply to variational multiscale estimators for inverse problems as well.

Different Noise Models
The second possible extension of the present paper concerns the noise model. We consider Gaussian noise with homogeneous (i.e., not depending on the spatial location x) yet unknown variance. We remark that the extension to heterogeneous variance is of interest as in many applications the variance varies with the signal. To this end, the residuals Y − f in (3) have to be standardised by a local estimator of the variance (see, e.g., [5]).
We finally mention that variational multiscale estimators have been proposed for non-Gaussian noise models as well (see, e.g., [34] for Poisson noise). For non-Gaussian data, the constraint | φ λ , g − Y | ≤ q n in (3) is replaced by a constraint on the likelihood-ratio statistic, which offers a route to generalise this further, e.g., to exponential family models (see [47]). In some cases (e.g., Poisson), after a variance stabilising transformation, that constraint can be turned into a linear constraint as in the Gaussian case, so our findings are expected to apply there as well. For general noise, however, deeper study will be required, as nonlinearity in the constraint may lead to nonconvexity of the optimisation problem (3).