Next Article in Journal
A Semi-Supervised Reduced-Space Method for Hyperspectral Imaging Segmentation
Next Article in Special Issue
Nearly Exact Discrepancy Principle for Low-Count Poisson Image Restoration
Previous Article in Journal
Using Inertial Sensors to Determine Head Motion—A Review
Previous Article in Special Issue
Learned Primal Dual Reconstruction for PET
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Off-The-Grid Variational Sparse Spike Recovery: Methods and Algorithms

Université Côte d’Azur, CNRS, Inria, I3S, Morpheme Project, 06900 Sophia Antipolis, France
Université Côte d’Azur, CNRS, LJAD, 06000 Nice, France
Authors to whom correspondence should be addressed.
J. Imaging 2021, 7(12), 266;
Received: 21 October 2021 / Revised: 24 November 2021 / Accepted: 28 November 2021 / Published: 6 December 2021
(This article belongs to the Special Issue Inverse Problems and Imaging)


Gridless sparse spike reconstruction is a rather new research field with significant results for the super-resolution problem, where we want to retrieve fine-scale details from a noisy and filtered acquisition. To tackle this problem, we are interested in optimisation under some prior, typically the sparsity i.e., the source is composed of spikes. Following the seminal work on the generalised LASSO for measures called the Beurling-Lasso (BLASSO), we will give a review on the chief theoretical and numerical breakthrough of the off-the-grid inverse problem, as we illustrate its usefulness to the super-resolution problem in Single Molecule Localisation Microscopy (SMLM) through new reconstruction metrics and tests on synthetic and real SMLM data we performed for this review.

1. Introduction

In this paper, we propose to conduct a comprehensive review on the so-called off-the-grid variational methods to solve the sparse spike recovery problem. We will exhibit the main theoretical and numerical results in the literature, underlining the interest of these methods for various domains dealing with inverse problems. As part of this review and our former work on gridless methods, we developed an implementation of the more consistent numerical methods with a focus on efficiency and computation time. With this implementation, we were able to apply off-the-grid method to fluorescence microscopy super-resolution problem. The codes and the computed result are an addition to the off-the-grid literature, and constitute further evidence supporting the relevance of this domain in the inverse problem field.
Loosely speaking, inverse problems consist in the reconstruction of the causes from the consequences. The problem is generally ill-posed, meaning that existence, uniqueness, and stability of a solution(s) is (are) not guaranteed. A case arising in numerous fields such as image or signal processing, telecommunications, machine learning, super-resolution, etc. is the sparse spike problem. It consists of the reconstruction of spikes located on a domain X from an acquisition y, with the prior of sparsity on the cause; or in layman terms, the source is composed of a few spikes. This includes sources such as stars in astronomy, fractures in seismology, etc. A spike is typically modelled by a Dirac measure a δ x with amplitude a C and position x X . All the difficulty lies in the estimation of the number N of spikes, of their amplitudes ( a i ) i = 1 N and their positions ( x i ) i = 1 N . Hence, the goal is to reconstruct the measure m = i = 1 N a i δ x i only from a few number of observations y in a Hilbert space H (typically L 2 X ) linked to m through an operator Φ accounting for deterioration of the input (blur, downsizing by the sampling) such as y = def . Φ m + w where w H is an additive noise. The reconstruction of the spikes may be off-the-grid i.e., the positions ( x i ) i = 1 N are not constrained on a grid hence ( x i ) i = 1 N are not limited to a finite set of values: this allows interesting new mathematical insights and guarantees for the reconstruction, at the cost of some challenges for the numerical implementation. The general sparse spike problem is encountered in many situations, such as:
  • compressed sensing domain [1], where one wants to recover a s-sparse vector v C N from M measurements A v 0 where A C M × N ;
  • machine learning, sketching mixtures, etc. For example, we desire to fit a probability distribution with respect to given data. The point is to estimate parameters ( a i ) R N and ( x i ) X N of a mixture i = 1 N a i φ ( x i ) of N elementary distributions described by φ . For instance, one wants to retrieve the means μ i R and standard deviations σ i R + of a Gaussian mixture, see [2] for more insights on this question:
  • deep learning such as training neural networks with a single hidden layer [3];
  • signal processing, for instance low rank tensor decomposition for Direction of Arrival estimation through sensor array (multiple sampling points);
  • super-resolution, a rather central problem in image processing. Roughly speaking, it consists of the reconstruction of details from an altered input of signal/image. It includes a classic physical operator of acquisition such as Fourier measurements, Laplace transform or Gaussian convolution.
The latter item will be our case of interest in the sparse spike problem for this paper. All the difficulty stems from the degradation in the acquisition process, which entails in general two things: a deterioration by the system of acquisition, typically modelled by the Point Spread Function in imagery which acts as a low-pass filter sensor acquisition which results in sampling and pollution by noise of different types, characterised by densities such as Gaussian, Poisson, etc. To sum up, we want to reconstruct the correct number of spikes with correct amplitudes and positions in the continuous setting from a noisy and filtered discrete acquisition. It can be tackled from the theoretical point of view by either the variational approach or Prony’s method:
  • Prony’s method and its variants such as MUSIC (MUltiple SIgnal Classification), ESPRIT (Estimation of Signal Parameters by Rotational Invariance Techniques) or Matrix Pencil, which recover the signal source from Fourier measurements in a noiseless 1D setting. It consists of the decomposition of the signal onto a basis of exponentials with different amplitudes, damping factors, frequencies, and phase angles to match the observed data. The results are compelling in the 1D noiseless case, and can be extended to a multivariate and noisy context, but still these methods lack of versatility since they cannot be sometimes extended to the context of interest. Thus, we will not consider this approach in this paper;
  • variational approach which does not impose any particular structure on the acquisition operator, which can be adapted to any type of noise and does not need any prior on the number of point sources [4]. The key idea is to solve the inverse problem by finding among all possible signal sources the one minimising an objective function called the energy, formulated as a trade-off between a fidelity data term and a regularisation term, typically enforcing the sparsity prior here.
Then, there are two types of variational approaches: the discrete and the off-the-grid. In the discrete setting, one seeks to recover the spikes on a prescribed fine grid, typically with more points than the acquisition image. Indeed, we call coarse grid for the low-resolved acquisition, and fine grid for the finer (by a so-called super-resolution factor q N ) grid of the reconstruction. Thus, it consists of a finite dimensional problem, where the positions of the spikes must lie on a grid G of L points meshing the domain X . This problem is a problem of sparse vectors reconstruction, and it can be tackled by enforcing sparsity through minimisation of the 1 norm of the unknown vector. This is known as the LASSO [5] or the Basis-pursuit problem, defined as the variational problem with tuning parameter λ > 0 controlling the trade-off between fidelity to the data and enforcement of the prior:
min a R L y Φ L a H 2 data term + λ a 1 sparsity prior
where Φ L : R L H is the acquisition operator with a vector of size L as an input and H is a Hilbert space. A grid is useful to epitomise the concept of sparsity in the case of spikes: indeed, sparsity is just the fact that only a few points of the L grid have a non-zero value. Moreover, since a computer can only store array and vector quantities, it seems rather fair to work with finite dimensional problem, even for the theoretical analysis. However, how does one choose the discretisation? A grid with a step-size too small yields numerical instabilities [6] while choosing a step-size that is too large leads to round-off errors. Moreover, one would like to localise the spikes as precisely as possible without having to rely on a grid: a discretisation of positions would necessarily convey approximation on positions. The appropriate mathematical objects to get rid of these discretisation drawbacks is to represent a collection of spikes with Dirac measures, an element of the space of Radon measures M X . The operator of acquisition is now Φ : M X H , the sparsity is enforced by a norm on M X called the TV-norm. This variational problem is called the BLASSO (for Beurling LASSO):
min m M X y Φ m H 2 data term + λ m TV sparsity prior .
In this latter setting, the spikes can move continuously on the domain X : a comparison between the discrete and the off-the-grid reconstruction is given in Figure 1. The off-the-grid setting can be seen as the limit of the discrete case with a finer and finer grid [7].
This shift from the discrete domain to the continuous setting called off-the-grid or gridless leads to some crucial mathematical insights, in particular a sharp signal-dependent criterion for stable spikes recovery [8], the minimum separation distance (see the next section). Obviously, some difficulties arise also due to the infinite dimension and the lack of algebraic properties of the set of optimisation. The comparison between discrete and gridless settings may be summed up by:
  • the discrete problem is tackled by LASSO, through the minimisation of a convex function defined on a fine grid i.e., a convenient finite dimension Hilbert R L space. Due to the 1 norm, there are some cases where the sparsity is not properly enforced: one can then replace the 1 norm by the non-continuous pseudo-norm 0 , but this yields an NP-hard combinatory non-convex problem. There exists some continuous relaxation of 0 such as CEL0 [9], but, due to the non-convex aspect, the problem is still hard from the theoretical and numerical point of view. Despite the lack of guarantees, there are numerous algorithms to compute the solution of LASSO or its 0 relaxed variant;
  • the off-the-grid problem is treated by BLASSO, a convex functional defined on M X . The convex property is handy from the theoretical point of views as it leads to some crucial insights on the existence/uniqueness/support estimation w.r.t. noise, at the cost of the set of optimisation, namely M X a Banach (no Hilbertian structure so no straightforward proximal algorithm) infinite dimensional and non-reflexive space for the strong topology (convergence results are then essentially on the weak-* topology). Despite this lack of algebraic properties, one has currently a wide range of algorithms to tackle this problem, such as root-finding or greedy algorithms.
Gridless reconstruction can then be evaluated through suitable metrics, namely the Flat Metric based on optimal transport of measures. This metric assesses the quality of the reconstruction and can be applied straightforwardly to off-the-grid and even discrete reconstruction outputs.
In the following, we give a review on the key results in the variational off-the-grid domain. The paper is organised in three sections, namely:
  • the variational analysis of the space M X , the properties and the guarantees of reconstruction concerning the sparse spike problem are now quite well-documented [8,10,11,12] and will be recalled in the theoretical Section 2;
  • multiple strategies were considered to numerically tackle BLASSO, the more compelling will be presented and put into context in the numerical Section 3;
  • interesting practical applications and new metrics have been considered for the gridless method, such as the SMLM super-resolution; these results are shown and discussed in Section 4.
At the end of each paragraph, a grey box (beginning either with ‘summary’ or ‘shorthand’) like this one will recall the main results highlighted in the section. Please refer to it for a quick summary.

2. A Theoretical Background for Gridless Spike Recovery

In the following, X denotes the ambient space where the positions of the spikes live. We suppose X is a subset of R d such that its interior X ˚ is a submanifold of dimension d N [13]. This setting encompasses X = R d , the torus X = T d = def . R d / Z d , any compact with non-empty interior, etc. The reader is invited to take a look at the Table A1 to remind the notations.

2.1. What Is a Measure?

As we have stated in the section above, the Dirac measure is the proper object to describe a spike not constrained on a finite set of positions. This object is not a function, since one cannot exhibit any integrable equivalence class satisfying the properties of the Dirac (see below). Thus, one should considerate the notion of Radon measure, a formal extension of functions. From a distributional standpoint, it is a subset of the distribution space D ( X ) , namely the space of linear forms over the space of test functions D ( X ) i.e., smooth functions (continuous derivatives of all orders) compactly supported. This functional approach consists in the definition of a measure as a linear form on some function space, namely:
Definition 1
(Evanescent continuous function on X ).We call C 0 X , Y the set of continuous functions with zero at infinity (or evanescent), namely all the continuous map ψ : X Y such that:
ε > 0 , K X compact , sup x X \ K ψ ( x ) Y ε .
When Y = R , we will simply write C 0 X . Since we dispose of a suitable test functions space, we need to precise the notion of duality at stake in this review.
Definition 2
(Topological dual space).If E is a topological vector space, we denote E its topological dual i.e., the space of all continuous linear forms ψ : E R . The pairing between an element ϕ E and a map ψ E is denoted by the bilinear mapping ϕ , ψ E × E = def . ψ ( ϕ ) called the duality bracket.
This notion allows us to define the Radon measure through duality in the following definition.
Definition 3
(Set of Radon Measures).We denote M X the set of real signed Radon measures on X of finite masses. It is the topological dual of C 0 X with supremum norm · , X by the Riesz–Markov representation theorem. It can also be defined as the topological dual of the space of continuous function C X if X is compact [14]. Thus, a Radon measure m is a continuous linear form evaluated on functions f C 0 X , with for m M X the duality bracket denoted by f , m C 0 X × M X = X f d m .
The term ‘signed’ refers to the generalisation of the concept of (positive) measure, by allowing the quantity f , m C 0 X × M X to be negative. We can define in the same way the space of real non-negative Radon measures M + X dual of C 0 X , R + and the space of complex Radon measures M C X dual of C 0 X , C . Classic examples of Radon measures are:
  • the Lebesgue measure of dimension d N ;
  • the Dirac measure δ z centred in z X , also called the δ -peak. For all f C 0 X , one has f , δ z C 0 X × M X = f ( z ) ;
  • discrete measures m a , x = def . i = 1 N a i δ x i where N N , a C N , x X N .
Since C 0 X is a Banach space, M X is complete [11] by endowing it with its dual norm called the total variation (TV) norm, defined for m M X by:
| m | ( X ) = def . sup X f d m , f C 0 X , f , X 1 .
The TV norm of a measure is also called its mass. One can note that, in the case of a discrete measure defined as before m a , x = def . i = 1 N a i δ x i , one has | m a , x | ( X ) = a 1 .
The interested reader might take a look at the Appendix B.1 for more details on some functional analysis notions and results.
Summary: we model a spike by a Dirac measure, an element of the Radon measure spaces M X . This space is defined by duality, it is endowed by the TV-norm and is complete. It is, however, infinite dimensional and non-reflexive (see Appendix B.1), this poses additional difficulties to be taken into account in the optimisation.

2.2. Observations

Let us introduce the space where the acquired data live. We will denote by H this Hilbert space; for the instance of images, H = L 2 X . Let m M X be the source measure, we call acquisition y H the result of the forward/acquisition map Φ : M X H evaluated on m, with measurement kernel φ : X H :
y = def . Φ m = X φ ( x ) d m ( x ) .
The latter integral ought to not be confused with the duality bracket f , m C 0 X × M X = X f ( x ) d m ( x ) mentioned in Definition 3 above. Indeed, while f ( x ) R for x X , we have φ ( x ) H : the integral in (1) is then a Böchner integral [15] i.e., the proper notion to deal with vector valued map. It is valid as long as φ is continuous and bounded [3,13].
Remark 1.
Measures are objects that generalise functions at the cost of losing some of their properties. Thus, one cannot define a product of measures (what would be the square of the Dirac?) and one ought to be aware of some caveats concerning the functions of measure: these functionals need to be at most (sub)linear in order to be well-defined [16].
In the following, we will impose φ C 2 ( X , H ) . Let us also define the adjoint operator of Φ : M X H in the weak-* topology, namely the map Φ : H C 0 X . It is defined for all x X and p H by Φ ( p ) ( x ) = p , φ ( x ) H . The choice of φ and H depends on the physical process of acquisition; indeed, generic measurement kernels are:
  • convolution kernel with typically H = L 2 X and x X , φ ( x ) = def . ( s φ ˜ ( s x ) ) H , for the PSF φ ˜ C 0 2 ( R d ) . One has, for instance, the Gaussian kernel, centred in c X with spread σ > 0 , defined by s φ ˜ ( s c ) = def . 1 / 2 π σ 2 d / 2 e s c 2 2 / 2 σ 2 ;
  • Fourier kernel with cut-off frequency f c N and H = C 2 f c + 1 , for x X = T in 1D:
    φ ( x ) = e 2 i π k x | k | f c ;
  • Laplace kernel [4] for non-negative weighting function ξ C X specific to the physical acquisition process and H = L 2 R + : x X ,   φ ( x ) = def . ( s ξ ( x ) e s x ) H .
These three kernels correspond to various physical context of imagery, hence they are encountered in multiple acquisition process, such as Nuclear Magnetic Resonance spectroscopy (Fourier), SMLM super-resolution (convolution), MA-TIRF (Laplace), etc.
We will now on use the following notation for the discrete forward map: let x = ( x 1 , , x N ) and a R N : Φ x ( a ) = def . i = 1 N a i φ ( x i ) .
Shorthand: an acquisition living in the Hilbert space H of a measure m is the quantity Φ m . Φ is the forward operator, completely defined by a kernel φ specific to the physical context of imagery.

2.3. An Off-The-Grid Functional: The BLASSO

Let m a 0 , x 0 = def . i = 1 N a 0 , i δ x 0 , i be the source measure with amplitudes a 0 R N and positions x 0 X N , the sparse spike problem is to recover this measure from the acquisition y = def . Φ m a 0 , x 0 + w , where w H is an additive noise, typically white Gaussian noise. To tackle this problem, we use the following convex functional [11,17] also called the BLASSO, which stands for Beurling-LASSO:
argmin m M X T λ ( m ) = def . 1 2 y Φ ( m ) H 2 + λ | m | ( X )
with regularisation parameter λ > 0 which accounts for the trade-off between fidelity and sparsity of the reconstruction. The name BLASSO was coined in the work of [17,18] according to the link between the Generalised Minimal Extrapolation (GME) problem where one seeks to reconstruct a Radon measure from several observations on its Fourier coefficients, and the work [19] of the Norwegian mathematician Beurling, which coincides with GME in the case of a Fourier forward operator.
The BLASSO in a noiseless setting writes down:
argmin Φ m = y 0 | m | ( X ) with y 0 = Φ m a 0 , x 0 .
BLASSO is genuinely linked with its discrete counterpart the (LASSO) [8]: one can formally see BLASSO as the functional limit of LASSO on a finer and finer grid. If the LASSO problem exhibits existence and uniqueness of the solution, what can one say for its off-the-grid counterpart? First of all, let us observe that:
  • m | m | ( X ) is lower semi-continuous w.r.t. the weak-* convergence (see Appendix B.1 for more insights);
  • Φ is continuous from the weak-* topology of M X to the weak topology of H .
Thus, one can establish the existence of solutions to (𝒫λ(y)) thanks to convex analysis results, as proved in [11].
Summary: the sparse spike problem is tractable thanks to the convex functional on M X called the BLASSO and denoted by (𝒫λ(y)). With m M X as an input, it consists of a data term comparing observed data versus Φ m , and a regularisation accounting for sparsity prior through the TV-norm of m. Existence of solutions of the BLASSO is known and proved.
The difficulties now lie in the following questions:
what are the conditions to recover a sparse measure, within a certain noise regime? Is the minimum unique?
under which conditions can we retrieve exactly the number of spikes, the amplitude, and the positions; when do we have support stability?
how can we tackle numerically the infinite dimensional and non-reflexive nature of the space M X ?
In order to address these points, we need to introduce some notions of convex analysis in the following subsection.

2.4. Dual Problems and Certificates

The BLASSO in Equation (𝒫λ(y)) above is a minimisation problem with a convex functional. Then, we can apply Ekeland–Temam ([20] Remark 4.2) results and define a dual problem which writes down for p H (see Appendix B.2 for the proof):
argmax ϕ p , X 1 y , p H λ 2 p H 2
which can be recast as the projection onto a closed convex [11,18]:
argmax ϕ p , X 1 y λ p H 2
Fenchel’s duality between (𝒫λ(y)) and (𝒟λ(y)) is proved in [11]. Therefore, any solution m λ of (𝒫λ(y)) is linked [8] to the unique solution p λ of (𝒟λ(y)) by the extremality conditions:
Φ p λ | m λ | ( X ) , p λ = 1 λ ( Φ m λ y )
where | · | ( X ) is the sub-differential of the TV norm. Indeed, since the total variation is not differentiable (as the 1 norm) but lower semi-continuous w.r.t. the weak-*topology, we use its sub-differential which for m M X identifies to:
| m | ( X ) = η C 0 X ; η , X 1 and X η d m = | m | ( X ) .
Elements of this subgradient are called certificate. Thanks to strong duality, one can define peculiar certificates called the dual certificates [10].
Definition 4.
We call η λ = def . Φ p λ , where p λ satisfies (2), a dual certificate of m λ .
It is a certificate since Φ p λ | m λ | ( X ) , and it is called dual because it verifies the second extremality (2) condition: it is thus defined by the dual solution p λ . Loosely speaking, a dual certificate η λ is associated with a measure m λ , and it certifies that the measure m λ is a minimum of the BLASSO. For instance, if there exist solutions of (𝒫λ(y)) of the form m λ = def . i = 1 N a i δ x i , the support satisfies [8] for all 0 i N : | η λ | ( x i ) = 1 .
In the same fashion, one has the link between a solution m 0 of the noiseless BLASSO (𝒫0(y0)) and its certificates η 0 , which are not unique in general. Then, in the rest of the document, we will refer to η 0 as the minimal norm certificate i.e., the dual certificate η 0 with minimal supremum norm η 0 , X . It is shown in [8] that this minimal norm certificate η 0 has important properties, since it somehow drives the stability of the recovered spike locations when the additive noise is small, in particular how close they are to the positions of the true measure m a 0 , x 0 : see Definition 6 in the section below.
Summary: we defined the primal problem in the former section, thanks to convexity, we can define the dual problem of the BLASSO. A solution m λ of the BLASSO and a solution p λ of the dual problem are linked through extremality condition. The dual solution p λ defines the dual certificate, an element of the subgradient specified by η λ = Φ p λ : the dual certificate η λ certifies that m λ is a solution of the BLASSO. We can then establish more precise conditions on the uniqueness/support recovery.

2.5. Support Recovery Guarantees

We will address in this section the first two questions we have laid down, namely existence, uniqueness, and support recovery conditions. A classical tool to establish some recovery properties lies in the notion of the minimum separation distance.
Definition 5
(Minimum separation distance).The minimum separation distance is a characterisation of the support of the discrete measure m a 0 , x 0 by:
Δ ( m a 0 , x 0 ) = def . min i j | x 0 , i x 0 , j | .
The reconstruction condition is driven by this minimum separation distance, itself determined by the type of measure (complex, real, real non-negative) and the type of forward operator.
  • if the operator is an acquisition of the Fourier spectrum within [ f c , f c ] with frequency cut-off f c for X = T d the d-torus in the noiseless setting, it is necessary that Δ ( m a 0 , x 0 ) 2 f c if the source measure is complex [10]. Upon a few conditions [21], one can weaken it to Δ ( m a 0 , x 0 ) 1.26 f c , and Δ ( m a 0 , x 0 ) 1.87 f c if the source measure is real [10];
  • regardless of the operator Φ [17,22], there is no condition on the separation for a real positive source measure in the noiseless setting; however, stability constant explodes when Δ ( m a 0 , x 0 ) 0 .
These results are important but do not provide a sharp characterisation of the recovery in the presence of noise; however, we expect to find noise in the images we deal with and therefore to be limited by this noise regime. To account for this effect, we need to add some conditions on the ground-truth measure; following the work of [8], we introduce:
Definition 6
(Non-degenerate source condition).The source m a 0 , x 0 verifies the NDSC (Non-Degenerate Source Condition) if:
  • there exists η Im Φ such that η | m a 0 , x 0 | ( X ) ;
  • s X \ i = 1 N { x 0 , i } , | η 0 ( s ) | < 1 ;
  • i 1 , N , the Hessian matrix 2 η 0 ( x 0 , i ) R d × d is invertible.
The first condition amounts to assuming that m a 0 , x 0 is a solution to (𝒫0(y0)) and there exists a solution to its dual problem. If the two latter conditions are matched, we say that η 0 is not degenerate. This allows us to write the main result of [8], namely:
Theorem 1
(Noise robustness [8]).Let Γ x 0 the N × N matrix defined by Γ x 0 = def . ( φ ( · x 0 , i ) , φ ( · x 0 , i ) ) i = 1 N . Assume that Γ x 0 has full column rank and that m a 0 , x 0 verifies the NDSC. Then, there exists α > 0 , λ 0 > 0 such that for all 0 λ λ 0 and w such that w α λ ; there exists N pairings ( a λ , i , x λ , i ) such that m λ = def . i = 1 N a λ , i δ x λ , i is the unique solution of Ref. (𝒫λ(y)) composed of exactly N spikes. In particular, for λ = 1 / α w H , we have the control over the discrepancies:
i 1 , N : x λ , i x 0 , i = 𝒪 ( w H ) and | a λ , i a 0 , i | = 𝒪 ( w H ) .
Under the Non-Degenerate Source Condition, for λ and w H 2 / λ small enough, one can reconstruct a measure with the same number of spikes as the ground-truth measure m a 0 , x 0 . Furthermore, the reconstructed measure (weak-*)converges to the ground-truth measure when the noise level drops to 0. The authors of [8] also introduce the notion of vanishing derivatives precertificate. The η 0 certificate is indeed hard to compute from the dual problem of (𝒫0(y0)). Because of the constraint η 0 , X 1 , the precertificate allows for leveraging this computation by solving instead a linear system. The interested reader is advised to take a glance at this article among other ones [8,22] for these new concepts.
Shorthand: the minimum separation distance criterion is used to assess recovery possibilities in the noiseless setting. In a low regime of noise, a theorem states that the source measure m a , x composed of N spikes can be recovered through BLASSO, with a control over the discrepancies (amplitudes/positions) between the reconstructed and the source measures.
We were therefore able to establish some guarantees on the reconstruction of the source measures in the presence of noise. In the next section, we propose to address the third question and to discuss strategies to compute the numerical solution of the inverse problem; a difficult task requiring accounting for the difficulties of the optimisation space.

3. Numerical Strategies to Tackle the BLASSO

The BLASSO problem 𝒫λ(y) is an optimisation over the set of Radon measures, an infinite dimensional and non-reflexive space. We recall that it writes down:
argmin m M X T λ ( m ) = def . 1 2 y Φ ( m ) H 2 + λ | m | ( X ) .
A naive approach would be to enforce the measure m to be supported on a fine grid ( p i ) i L which is equivalent to solving the LASSO problem:
min a R L y Φ L a H 2 + a 1
with the discrete operator Φ L a = def . i = 1 L a i φ ( p i ) and φ the kernel of the forward operator. This approach conveys numerous cons: for instance, the solution of the LASSO, in small noise regime and when the step size tends to 0, contains pairs of spikes around the true one [6,7]. Furthermore, refining the step size leads to a worse conditioning of the forward operator, accounting for numerical difficulties. The following classes of algorithms better account for the infinite dimensional nature of M X . We present in detail the three methods with the most established results in the literature [13,18,23]. Before describing these methods, let us remark that there also exist some promising avenues, such as the projected gradient descent [24,25]. It relies on an over parametrised initialisation i.e., a discrete measure with numerous δ -peaks compared to the ground-truth, then one applies a gradient descent on the amplitudes and positions of the over parametrised measure combined at each step with a projection on a set of positions constraints to enforce the separation of the spikes. This projection can be replaced by a ‘heuristic’ which boils down to the merging of δ -peaks that are not enough separated [25].

3.1. Semi-Definite Recasting and Hierarchy

Semi-definite programming was one of the first schemes solving the BLASSO in the specific case of a Fourier acquisition on the 1D torus T 1 [10,12,17,18]. Before explaining in layman terms the SDP scheme, let us first introduce and detail the relevant quantities for this section. Let d = 1 be the dimension of the interior of X , let us study the case where the forward operator denoted by F n (and not Φ for this section) is a Fourier coefficient measurement up to some cut-off frequency f c N , with n = 2 f c + 1 the number of measurements. We have F n : M C X C n and, for a discrete measure m a , x = def . j = 1 N a j δ x j , it writes down F n ( m a , x ) = j a j e 2 i π k x j | k | f c and its adjoint operator F n : C n C 0 X , C is for s X :
c C n , F n ( c ) ( s ) = c , e 2 i π k s | k | f c C n = | k | f c c f c + k e 2 i π k s .
This method is based on semi-definite programming (SDP) for efficiently computing the minima of BLASSO. It stems from the Hilbert approach [26] when one globally decomposes the objective function into simple pieces, atoms. The solution of the dual problem of (𝒫λ(y)), denoted here ( 𝒟 λ ( y ) ) , is a polynomial p linked to a certificate by F n p : the idea then is the reconstruction of the dual certificate as a linear sum of trigonometric polynomials [18], which is enough to find the measure associated with this reconstructed certificate. This associated measure is a solution to the BLASSO. The dual problem, on the other hand, is tractable thanks to a semi-definite programming approach.
Since F n p is a trigonometric polynomial for any p C n by the definition above, one can recast the constraint F n p , X 1 (imposed by definition of a certificate, see Equation (3)) and rewrite it as the intersection of the cone of positive semi-definite matrices { A : A 0 } with an affine hyperplane [10,27]. Hence, the Fenchel dual problem of (𝒫λ(y)) for the Fourier forward operator F n :
( 𝒟 λ ( y ) ) max p n Re { y , p } λ 2 p 2 constrained by F n p , X 1
with Hermitian product · , · , has the equivalent formulation [27]
max p C n , Q C n × n Re { y , p } λ 2 p H 2 constrained by ( 𝒟 λ ( y ) ) Q p p 1 0 and k = 1 n j Q k , k + j = δ 0 , j for j 1 , n 1
with Q being a Hermitian matrix and p a vector of coefficients (accounting for the dual variable p), and δ 0 , j the Kronecker delta equal to 1 if j = 0 and 0, otherwise. The choice of regulariser λ is crucial: if chosen to be too high, it will yield a solution with fewer spikes, if chosen too low, it will recover a solution with spurious spikes. This finite dimensional formulation can now be tackled with classic semi-definite programming solvers, as did the authors of [10] who proposed an algorithm of Interior Point Method, given in Algorithm 1. The first step reaches a solution p, allowing the definition of the certificate p 2 n 2 e 2 i π t = def . 1 | F n p | 2 ( t ) , where F n is defined in Equation (4).
Algorithm 1. Interior Point Method applied to the BLASSO.
max p C n , Q C n × n Re y , p λ 2 p H 2
 subject to Q p p 1 0 and i = 1 n j Q i , i + j = δ 0 , j for j = 1 , , n 1 .
Reconstruct the support X ^ of m by locating the roots of p 2 n 2 on the unit circle (e.g., by computing the eigenvalues of its companion matrix).
Solve t X ^ a t e 2 i π k t = y k to recover the amplitudes a.
One can note the link between the dual and the primal problem, i.e., that p, the solution of ( 𝒟 λ ( y ) ) , entails the location of the spikes: as F n p yields its extremal points on the support of m since it is the certificate of a discrete measure, note that p 2 n 2 ( e 2 i π t ) = 1 | F n p | 2 ( t ) has all its roots on the unit circle, and these roots are the support of the target measure [10]. Thus, the strategy is to solve the dual problem and then to use a root-finding algorithm on the certificate F n p associated with the dual solution, hence reconstructing the support of the measure then the measure (after a last amplitude recover step). We present an example of the reconstruction of three Dirac measures on the 1 D torus T 1 through the observed noisy data y and the roots of the polynomial p 2 n 2 ( e 2 i π t ) in Figure 2.
This strategy is only suitable for d = 1 . For the multi-variate case, one needs to make use of a so-called Lasserre Hierarchy [28]. Consider the semi-definite relaxation of order m with m > n = 2 f c + 1 :
max p C n , Q C n × n Re y , p constrained by 0 Q p ˜ p ˜ 1   where   p ˜ k = c k if k [ f c , f c ] d 0 otherwise Tr ( Θ k Q ) = δ 0 , k with k m , m
with Θ k = θ k d θ k 1 , where θ k j and the entries of m × m elementary Toeplitz matrix are 1 on its k j -th diagonal and 0 elsewhere, and ⊗ the Kronecker product. In a nutshell, Lasserre’s hierarchies give a sequence of nested outer SDP approximations of the cone of moments of non-negative measure. This method has been successfully applied to super-resolution in [12]. Some reconstructions in the 1D setting with a Fourier kernel are given in Figure 3, and the interested reader may find a more in-depth tutorial in the Numerical Tours on ‘Sparse spikes measures’ joined with the code used to compute the following figure (, consulted on 30 November 2021).
These methods are proved to be asymptotically exact [12]. Nonetheless, it is not known if the algorithm has finite convergence in general: one does not know when to stop the hierarchy to obtain a solution of the BLASSO [3]. This stems from the fact that non-negative trigonometric polynomials in dimension d > 1 are not necessarily sums of square. Moreover, these SDP based approaches are rather limited to a certain class of measurement map Φ , typically the Fourier forward operator or at least filters with compact Fourier supports. With the two following classes of algorithms, one can better exploit the continuous setting and get rid of the discretisation drawback.
Summary (1st algorithm): the scheme boils down to the resolution of the dual problem, the reconstruction of the measure’s support thanks to the certificate associated with the dual solution, and finally the solving of a linear problem to yield the corresponding estimated amplitudes. This strategy can be extended to a multivariate context, but, still, it is quite restrictive on the forward operator and it does not have finite convergence in general.

3.2. Greedy Algorithm: The Conditional Gradient

The conditional gradient method, also called the Frank–Wolfe (FW) algorithm [29,30], aims at solving min m C f ( m ) for C a weakly compact convex set of a topological vector space and f a convex and differentiable function (the differential is then denoted by df). It relies on the iterative minimisation of a linearised version of f. Hence, the interest of this algorithm lies in the fact that it uses only the directional derivatives of f and that it does not require any Hilbertian structure, contrary to a classic proximal algorithm formulated in terms of Euclidean distance. We recall the definition of the conditional gradient in the pseudocode Algorithm 2 for the general problem of minimising f:
Algorithm 2. Frank–Wolfe.
Jimaging 07 00266 i001
One can make the following remarks:
  • the compactness assumption on C ensures that the argmin in step 2 is non-empty;
  • in line 7, we can replace m k + 1 by any element m ^ C such that f ( m ^ ) f ( m k + 1 ) without changing the convergence properties of the algorithm;
There are, however, two problems that prevent us from applying straightforwardly this algorithm to BLASSO: T λ is not differentiable, and the optimisation set M X is not bounded. It is thus necessary to perform an epigraphical lift [13,31] to reach a differentiable functional that shares the same minimum measures as T λ :
( 𝒫 λ ( y ) ) min ( r , m ) C T ˜ λ ( m , r ) = def . 1 2 y Φ ( m ) H 2 + λ r
with the bounded set C = ( r , m ) R + × M X ; | m | ( X ) r M and M = def . y 2 2 λ . Even though C is not weakly compact, it is compact for the weak-* topology and the hypotheses for Algorithm 2 are still matched. The Frank–Wolfe algorithm is then well-defined for the energy T ˜ λ , differentiable in the Fréchet sense on the Banach R × M X . Its differential writes down:
d T ˜ λ : ( r , m ) X Φ ( Φ m y ) d m + λ r .
Finally, one has that m is a minimum of T λ iff ( | m | ( X ) , m ) minimises ( 𝒫 λ ( y ) ) , and T λ ( m ) = T ˜ λ | m | ( X ) , m . In the rest of the document, we will omit the r-part, and we will refer to the quantity ( | m | ( X ) , m ) by only m .
We note before that the update m k + 1 in line 7 can be replaced by any value m ^ improving the objective function; this remark is rather interesting as it can drastically improve the convergence property of the algorithm [11,32]. Hence, an interesting improvement to the Frank–Wolfe algorithm relies in the change of the final update step by a non-convex optimisation on both the amplitudes and the positions of the reconstructed δ -peaks in a simultaneous fashion. This modification is presented in Algorithm 3.
Algorithm 3.Sliding Frank-Wolfe.
Jimaging 07 00266 i002
This tweak yields a theoretical convergence to the unique solution of BLASSO in a finite number of iterations, empirically a N-step convergence. This version is called the Sliding Frank–Wolfe algorithm [13], as the spike positions are sliding on the continuous domain X . The authors also proved in the same paper that the generated measure sequence m [ k ] converges towards the minimum for the weak-* topology.
A reconstruction by Sliding Frank-Wolfe for the same Fourier operator, ground-truth spikes, and acquisition as the latter section is plotted in Figure 4. Contrary to SDP in Figure 3, no spurious spike is reconstructed. As in the SDP method, the choice of regulariser λ is crucial: if chosen too high, it will yield a solution with fewer spikes than needed; if set too low, it will recover a solution with spurious spikes. We set λ = 1 for the 1D Fourier example as in the former SDP section.
The line 3 in Algorithm 3 is typically solved by a grid search, the convex step in line 5 can use a FISTA solver [33], and the non-convex step in line 6 can be tackled by a modified Broyden–Fletcher–Goldfarb–Shann method (L-BFGS-B) implementation [34]. Reconstructions in the 2D setting with a convolution kernel, similar to the SMLM conditions, are presented in Figure 5. Since luminosity is always a non-negative quantity, one can restrict [4] the SFW to build a positive measure of the cone M + X , by changing:
  • the stopping condition to η [ k ] x [ k ] 1 ;
  • the LASSO step is solved for a R + N [ k ] + 1 ;
  • the non-convex step is solved on R + N [ k ] + 1 × X N [ k ] + 1 .
Hence, this modified algorithm offers a good trade-off between precision and theoretical guarantees. However, it suffers from the high computation load for one iteration, making it slow to compute. The next section algorithm is a promising alternative with easier/cheaper iteration while still taking advantage of the continuous setting.
Shorthand (2nd algorithm): conditional gradient method is a greedy algorithm consisting in the iterative minimisation of a linearised version of the objective convex function. This algorithm can be applied to any forward operator without restriction on the space X . Up to a modification (SFW), the Frank–Wolfe algorithm reaches a finite convergence, empirically a N-step convergence for a source measure with N spikes. The iterations, however, are computationally costly, yielding long computation time.

3.3. Optimal Transport Based Algorithm: The Particle Gradient Descent

All the following results are proven for a domain X with no boundaries, e.g., the d-dimensional torus T d . The case described in the former sections— X is any compact of R d —is included in this new setting, since any compact X can be periodised to yield a domain with no boundaries. The forward operator kernel φ : X H should also be differentiable in the Fréchet sense. The least squares term in BLASSO is denoted by the more general data term R : H R + , the functional T λ of the BLASSO will now be restricted to M + X and denoted J; its Fréchet differential at point ν M + X is denoted J ν :
J ( ν ) = y Φ ν H 2 + λ | ν | ( X ) ,
J ν ( x ) = φ ( x ) , R H + λ for all x X
A comprehensive guide on its computation is given in Appendix B.4. In the following, we describe the setting for non-negative measures of M + X , but it can be extended in a straightforward fashion [23] to signed measures of M X by performing the method on the positive then negative part of the signed measure (see Jordan decomposition in Appendix B.1). Figure 6 sums up the chief quantities and relations introduced in this section, the reader is advised to refer to it whenever he or she needs a global view on the optimal transport problem.
Sparse optimisation on measures through optimal transport [3,23] relies on the approximation of the ground-truth positive measure m a 0 , x 0 by a ‘system of N N particles’, i.e., an element of the space Ω N = def . ( R + × X ) N . The point is then to estimate the ground-truth measure by a gradient-based optimisation on the objective function:
F N ( ( r 1 , x 1 ) , , ( r N , x N ) ) = def . y 1 N i = 1 N r i 2 φ ( x i ) H 2 + λ N r i 2
where ( r i , x i ) belongs to the lifted space Ω = def . R + × X endowed with a metric. Hence, the hope is that the gradient descent on F N converges to the amplitudes and the positions of the ground-truth measure, despite the non-convexity of functional (7). The author of [23] proposes the definition of a suitable metric for the gradient of F N , which enables separation of the variables in the gradient descent update. Let α , β be two parameters such that α > 0 and β > 0 and for any ( r , θ ) Ω , we define the Riemannian inner product of Ω called the cone metric endowing Ω as defined by ( δ r 1 , δ r 2 ) R + 2 , ( δ θ 1 , δ θ 2 ) X 2 :
( δ r 1 , δ θ 1 ) , ( δ r 2 , δ θ 2 ) ( r , θ ) = def . δ r 1 δ r 2 α + r 2 δ θ 1 , δ θ 2 θ β .
We denote by · , · θ the metric on the manifold X at the point θ . For the gradient of the functional F N for all i 1 , N w.r.t.; the cone metric writes down [2,23]:
r i F N = 2 α r i J ν ( x i ) = 2 α r i λ ( η λ 1 ) x i F N = β λ J ν ( x i ) = β λ η λ for ν = def . i = 1 N r i 2 δ x i , η λ = def . J ν / λ .
See Appendix B.4 for more details on this computation. We now present the theoretical results on the particle gradient descent, which corresponds to the blue dashed lines in Figure 6. The reader is invited to refer to this figure any time he needs to get a hold on the broader picture.

3.3.1. Theoretical Results

The main idea of these papers [3,23] boils down to the following observation: the minimisation of function (7) is a peculiar case of a more general problem, formulated in terms of measure of the lifted space Ω . The space is more precisely P 2 ( Ω ) subset of M Ω , namely the space of probabilities with finite second moments endowed with the 2-Wasserstein metric i.e., the optimal transport distance: see Appendix B.5 for more details. Hence, the lift of the unknown m M + X to μ P 2 ( Ω ) enables the removal of the asymmetry for discrete measures between position x X and amplitude a R + by lifting a δ x to δ ( a , x ) . The lifted functional now writes down for parameter λ > 0 :
μ P 2 ( Ω ) , F ( μ ) = def . y Φ ˜ μ H 2 + λ V ˜ ( μ )
where Φ ˜ μ = def . Ω ϕ ( a , x ) μ ( a , x ) for ϕ ( a , x ) = def . a φ ( x ) and V ˜ is the TV-norm on the spatial component of the measure μ . The functional is non-convex, its Fréchet differential is denoted F , and for u Ω :
F ( μ ) ( u ) = def . R ˜ ( μ ) , ϕ ( u ) H + λ
with R ˜ = def . y Ω ϕ ( a , x ) μ ( a , x ) H 2 . Then, a discrete measure μ N = def . 1 N i N δ a i , x i of P 2 ( Ω ) can also be seen as an element of Ω N from the standpoint of its components ( a i , x i ) . It allows the authors of [3,23] to perform a precise characterisation of the source recovery conditions, through the measures and the tools of optimal transport such as gradient flow (see below).
Then, one may run a gradient descent on the amplitudes and positions ( a i , x i ) ( R + × X ) N of the measure μ N , in order to exploit the differentiability of the kernel φ . Note that the measure μ N is over-parametrized, i.e., its number of δ -peaks is larger compared to the number of spikes of the ground-truth measure: thus, the particles, namely the δ -peaks of the space Ω , are covering the domain X for their spatial an example, where μ N is plotted in red dots.
Before giving the main results, we need to clarify the generalised notion of gradient descent to measure function called the gradient flow [35,36] from optimal transport theory, the main ingredient in the particle gradient descent. Letting F : R d R be the objective function with certain regularity, a gradient flow describes the evolution of a curve x ( t ) such that its starting point at t = 0 is x 0 R d , evolving by choosing at any time t in the direction that decreases the function F the most [36]:
x ( t ) = F ( x ( t ) ) pour t > 0 x ( 0 ) = x 0 .
The interest of gradient flow is its extension to spaces X with no differentiable structure. In the differentiable case, one can consider the discretisation of the gradient flow i.e., the sequence defined for a step-size τ > 0 , k N :
x k + 1 τ argmin x X F ( x ) + | x x k τ | 2 2 τ .
It is the implicit Euler scheme for the equation ( x τ ) = F ( x τ ) , or the weaker ( x τ ) F ( x τ ) if F is convex and non-smooth. The gradient flow is then the limit (under certain hypotheses) of the sequence ( x k τ ) k 0 for τ 0 for a starting point x 0 X . Gradient flow can be extended to metric space: indeed, for a metric space ( X , d ) and a map F : X R lower semi-continuous one can define the discretisation of gradient flow by the sequence
x k + 1 argmin x X F ( x ) + d ( x , x k ) 2 2 τ .
In the case of the metric space of probability measures i.e., the measures with unitary mass, the limit τ 0 of the scheme exists and converges to the unique gradient flow starting at x 0 element of the metric space. A typical case is the space of probabilities with finite second moments P 2 ( Ω ) , endowed with 2-Wasserstein metric, i.e., the optimal transport distance (see Appendix B.5): a gradient flow in this space P 2 ( Ω ) is a curve t μ t called a Wasserstein gradient flow starting at μ 0 P 2 ( Ω ) , for all t R + , one has μ t P 2 ( Ω ) , obeying the partial differential equation in the sense of distributions:
t μ t = div ( μ t F ( μ t ) ) .
Recall that div ( m ) = i = 1 d m x d for all m M X , derivatives ought to be understood in the distributional sense. This equation ensures the conservation of the mass, namely, at each time t > 0 , one has | μ t | ( Ω ) = | μ 0 | ( Ω ) . Hence, despite the lack of differentiability structure of P 2 ( Ω ) which forbids straightforward application of a classical gradient-based algorithm, one can perform an optimisation on the space through gradient flow to reach a minimum of F by discretizing (11).
The interesting case of a gradient flow in P 2 ( Ω ) is the flow starting at μ N , 0 = def . 1 / N i = 1 N δ ( a i 0 , x i 0 ) , uniquely defined by Equation (11), which writes down for all t R + : μ N , t = def . 1 / N i = 1 N δ ( a i ( t ) , x i ( t ) ) , where a i : R + R + and x i : R + X are continuous maps. This path ( μ N , t ) t 0 is a Wasserstein gradient flow, and uses N Dirac measures over Ω to optimise the objective function F in (9). When the number of particles N goes to infinity and if μ N , 0 converges to some μ 0 P 2 ( X ) , the gradient flow ( μ N , t ) t 0 converges to the unique Wasserstein gradient flow of F starting from μ 0 , described by the time-dependent density ( μ t ) t 0 valued in P 2 ( X ) obeying the latter partial differential Equation (11).
For these non-convex gradient flows, the authors of [3] give a consistent result for gradient based optimisation methods: under a certain hypothesis, the gradient flow ( μ N , t ) t 0 converges to global minima in the over-parametrization limit i.e., for N + . It relies on two important assumptions that prevent the optimisation from being blocked in non-optimal points:
  • homogeneity (A function f between vector spaces is positively p-homogeneous if, for λ > 0 and argument x, one has f ( λ x ) = λ d f ( x ) .) of ϕ in order to select the correct magnitude for each feature, or at least partially 1-homogeneity (i.e., boundedness of φ in [3]);
  • diversity in the initialisation of parameters, in order to explore all combinations of features. Too few or too close particles will not reach all source peaks and will only yield local minima.
We can then introduce the fundamental result for the many particle limits [3], the mean-field limits of gradient flows ( μ N , t ) t 0 , despite the lack of convexity of these gradient flows:
Theorem 2
(Global convergence—informal).If the initialisation μ N , 0 is such that μ 0 = def . lim N + μ N , 0 support separates (The support of a measure m is the complement of the largest open set on which m vanishes. In an ambient space X , we say that a set C separates the sets A and B if any continuous path in X with endpoints in A and B intersects C.) { } × X from { + } × X then the gradient flow μ t weakly-* (see Appendix B.1) converges in P 2 ( Ω ) to a global minimum of F, and we also have:
lim N , t F ( μ N , t ) = min m M + X J ( m ) .
Limits can be interchanged; the interested reader might take a look at [3] for precise statements and exact hypothesis (boundary conditions, ‘Sard-type’ regularity e.g., φ is d-times continuously differentiable, etc).
Since we have a convergence result, we can then investigate the numerical implementation. This optimisation problem is tractable thanks to the Conic Particle Gradient Descent algorithm [23] denoted CPGD: the proposed framework involves a slightly different gradient flow ( ν t ) t 0 defined through a projection of ( μ t ) t 0 onto M + X . This new gradient flow ( ν t ) t 0 is defined for a specific metric in M + X , which is now a trade-off between Wasserstein and Fisher–Rao (also called Hellinger metric.) metric [23], it is then called a Wasserstein–Fisher–Rao gradient flow. Then, the Wasserstein–Fisher–Rao gradient flow starting at ν N , 0 = def . i = 1 N a i 0 δ x i 0 in M + X writes down t ν N , t = def . 1 N i = 1 N r i ( t ) 2 δ x i ( t ) in M + X , rather than the Wasserstein flow t μ N , t = def . 1 N i = 1 N δ r i ( t ) , x i ( t ) starting at μ N , 0 in P 2 ( Ω ) . The partial differential equation of a Wasserstein–Fisher–Rao flow writes down:
t ν t = 4 α ν t T λ ( ν t ) + β div ( ν t J ν ( ν t ) )
for the two parameters α , β > 0 arising from the cone metric, α tunes the Fisher–Rao metric weight, while β tunes the Wasserstein metric one. All statements on convergence could be made alternatively on μ t or ν t , and we have indeed the same theorem:
Theorem 3
(Global convergence—informal).If ν 0 has full support (its support is the whole set X ) and ( ν t ) t 0 converges for t + , then the limit is a global minimum of J. If ν N , 0 N + ν 0 in the weak-* sense, then:
lim N , t J ν N , t = min m M + X J ( m ) .
Summary (3rd algorithm theoretical aspects): we introduced the proposed solution of [3,23], namely approximating the source measure by a discrete non-convex objective function of amplitudes and positions. The analytical study of the discrete function is an uphill problem and could be tackled thanks to the recast of the problem in the space of measures. Then, we exhibited the theoretical framework on gradient flows, understood in the sense of generalisation of gradient descent in the space of measures. Eventually, we presented the convergence results of the gradient flow denoted ( ν t ) t towards the minimum of the BLASSO, thus enabling results for the convergence. Gradient descent on the discrete objective approximates well the gradient flow dynamic and can then benefit from the convergence results exhibited before.
We now discuss the numerical results of the particle gradient descent. The reader is advised to take a look at Figure 6, more precisely at red and green ellipses, to get a grasp on the numerical part.

3.3.2. Numerical Results

We recall that a gradient flow ( ν N , t ) t 0 starting at = def . 1 / N i = 1 N r i ( 0 ) 2 δ x i ( 0 ) can be seen as a (time continuous) generalisation of gradient descent in the space of measures, allowing precise theoretical statements on the recovery conditions. To approach this gradient flow, we use the Conic Particle Gradient Descent algorithm [23] denoted CPGD: the point is to discretise the evolution of the gradient flow t ν N , t through a numerical scheme on (12). This consists of a gradient descent on the amplitudes r and positions x through the gradient of the functional F N in Equation (8), a strategy which approximates well the dynamic of the gradient flow [23].
This choice of gradient with the cone metric enables multiplicative updates in r and additive in x, the two updates being independent of each other. Then, the algorithm consists of a gradient descent with the definition of r i ( t ) and x i ( t ) according to [2,23]:
r i ( t ) = 2 α r i λ ( η λ ( x i ( t ) ) 1 ) x i ( t ) = β λ η λ ( x i ( t ) )
thanks to a gradient in Equation (8), for the mirror retraction (The notion of retraction compatible with cone structure is central: in the Riemann context, a retraction is a continuous mapping that maps a tangent vector to a point on the manifold. Formally, one could see it as a way to enforce the gradient evaluation to be mapped on the manifold. See [23] for other choices of compatible retractions and more insights on these notions.) and η λ = J ν / λ . The structure of the CPGD is presented in Algorithm 4. Note that the multiplicative updates in r yields an exponential of the certificate, and that the updates of the quantities r , x are separated.
This algorithm has rather easy and cheap iterations: to reach an accuracy of ε —i.e., a distance such as the -Wasserstein distance between the source measure m a 0 , x 0 and the reconstructed measure m is below ε —the CPGD yields a typical complexity cost of log ( ε 1 ) rather than ε 1 / 2 for convex program ([23] Theorem 4.2). A reconstruction from the latter 1D Fourier measurements is plotted in Figure 7, the reconstruction is obtained through two gradient flows, the former on the positive measures to recover the positive δ -peaks of the ground-truth and the latter on the negative measures to recover the negative one: the merging of the two results gives the reconstructed δ -peaks. The noiseless reconstruction (See our GitHub repository for our implementation:, accessed on 30 November 2021) for 2D Gaussian convolution with the same setting as the Frank–Wolfe section is plotted in Figure 8. One can see that the spikes are well-recovered as some non-zero red and blue particles cluster around the three δ -peaks.
Algorithm 4. Conic particle gradient descent algorithm.
Jimaging 07 00266 i003
Summary (3rd algorithm numerical aspects): the gradient flow ( ν t ) t is computable by the Conic Particle Gradient Descent algorithm, consisting in an estimation through a gradient (w.r.t. cone metric) descent on both amplitudes and positions of an over-parametrised measure, namely a measure with a fixed number of δ -peaks exceeding the source’s one. The iterations are cheaper than the SFW presented before, but the CPGD lacks guarantees in a low-noise regime.
To sum up all the pros and cons of these algorithms, we give Table 1 for a quick digest. Since the CPGD lacks guarantees on the global optimality of its output, the following section will use the conditional gradient and more precisely the Sliding Frank-Wolfe in order to tackle the SMLM super-resolution problem.

4. Applications and Results in SMLM Imaging

As an illustration of off-the-grid applications’ in this review, we propose to solve the super-resolution problem, aiming to retrieve biological structures at very small scales.

4.1. Metrics of Quality of Reconstruction

If one has access to the ground-truth i.e., the real position of the point sources, one is able to assess the quality of the reconstruction by:
  • detection metrics, such as the Jaccard index;
  • quality of reconstruction metrics, such as the L 2 norm in the discrete case.
Detection metrics can be applied to the off-the-grid output in a straightforward manner. We will rather focus in this part on the ‘quality of reconstruction’ metric. Any of the former algorithms returns a list of Dirac measures, which can be compared with the ground-truth measure m a 0 , x 0 . This comparison cannot be done with discrete tools, such as the L 2 norm of the reconstructed acquisitions: we cannot compare an element of M X with L 2 X . Examining the L 2 norm of the x i vector of reconstructed positions against the x 0 , i vector is not sufficient either: we need the same number of elements for x and x 0 , we have to sort the vector of positions, and we have no guarantee that the matching of one position of x with another of x 0 is the right one.
Hence, a distance on the measure space is the good tool of comparison. We will use in the following the Wasserstein 1-distance W 1 [37]: see Appendix B.5 for some recall on the useful definition and more insights on the optimal transport setting used in this section. The Wasserstein distance with measures of equal mass is defined (Actually, it is well-defined on the subset X = def . m M X , | m | ( X ) y H 2 / 2 λ because only the bounded subset of M X are metrizable for the weak-* topology, so we have to restrain the set of measures to X in order to reach a Polish space i.e., the convenient framework for this OT-based metric, see Appendix B.5. Since all solutions of the BLASSO belong to X [8], we will keep this slight abuse of notation in the rest of the paper.) as:
Definition 7
(Balanced optimal transport).For 0 p < + and m 1 , m 2 M X such that | m 1 | ( X ) = | m 2 | ( X ) , the p-Wasserstein distance is written:
W p ( m 1 , m 2 ) = def . min γ Γ ( m 1 , m 2 ) X × X | u v | p d γ ( u , v ) 1 / p .
Γ ( m 1 , m 2 ) is the set of transport plans between m 1 and m 2 , one can take a look at Appendix B.5 for more insights on this notion. However, this notion is not sufficient for our application since the metric can only take measures of equal masses (i.e., equal TV-norm) as an input. In the case of a discrete measure, we recall that mass is simply the sum of the modulus of individual amplitudes: hence, in general, we cannot compare a source measure and a reconstructed measure with differing amplitudes. The classic solution is then to distribute the unit mass, divided by the number of spikes, uniformly over each δ -peak of the discrete measure. Still, it would be way more convenient to incorporate the case of differing masses in the metric. The proper metric to compare two measures of different masses is called the Kantorovtich–Rubinstein norm also referred as the Flat Metric [37,38,39].
Definition 8
(Unbalanced optimal transport).Let us denote m M X of finite first moment and τ > 0 , and the following quantity is called Kantorovtich–Rubinstein norm:
F τ ( m ) = def . sup f C b X X f d m , f , X τ , f Lipschitz , f Lip 1
where f Lip is the Lipschitz constant of f. We then define the Flat Metric d τ for m 1 , m 2 M X of finite first moments:
d τ ( m 1 , m 2 ) = def . F τ ( m 1 m 2 ) .
The parameter τ is homogeneous to a distance, and it is understood in the optimal transport sense as the cost of creating/destroying a Dirac measure. The Flat Metric coincides with the 1-Wasserstein distance, for m 1 , m 2 of equal masses, when τ + [38]; it also coincides with the total variation norm of m 1 m 2 when τ 0 . Then, it may be seen as an interpolation between the total variation norm and the 1-Wasserstein norm. Moreover, when the number of δ -peaks is correctly estimated, the Flat Metric stands for the mean error in terms of localisation and is similar to the RMSE [39]. Eventually, the Flat metric can be extended to discrete reconstruction i.e., images on a fine grid; this metric is then a method applicable to discrete reconstruction, namely images with a finer grid.
To sum up, there are two possibilities if one wants to compare the reconstructed measure and the ground-truth one:
  • let the source measure be composed of N spikes, we set the amplitude of each δ -peak at 1 / N . We apply the same procedure to the reconstructed (with differing or not number of spikes), hence dividing uniformly the unit mass over all the δ -peaks of the considered measure. Therefore, the reconstructed luminosity is not considered as relevant and discarded: we can compute directly the 1-Wasserstein distance, since it is equal to the Flat Metric in this case;
  • we want to account for the luminosity, and we use the Flat Metric to compare the reconstructed measure against the ground-truth one.
Summary: classic quality of reconstruction metrics such as the L 2 X norm cannot be straightforwardly applied to off-the-grid reconstruction. Instead, one could use optimal transport score such as the Flat Metric: it accounts for both amplitude and position reconstructions, while it can be easily extended to discrete reconstruction (images on a fine grid).

4.2. Results for an SMLM Stack

In super-resolution for biomedical imaging, one wants to retrieve some fine scale details to better study biological structures of interest. Indeed, the studied bodies are generally smaller than the Rayleigh limit at 200 nm, a length at which the phenomenon of light diffraction comes into play. This diffraction causes a blurring of the image, which can be described as a convolution of the image by the PSF mentioned above. Hence, we want to perform a deconvolution i.e., remove the blur of diffraction to get a super-resolved image. It is worth noticing that other imaging systems exist, for which the inverse problems to solve are a bit different from deconvolution: e.g., Nuclear Magnetic Resonance spectroscopy with Fourier measurements [40], MA-TIRF with Laplace [13].
In order to enhance spatial resolution over standard diffraction-limited microscopy techniques and allow imaging of biological structures below the Rayleigh criterion, one can use SMLM, which stands for ‘Single Molecule Localisation Microscopy’. It is a compelling technique in fluorescence microscopy to tackle the super-resolution problem [41]. It requires photoactivable fluorophores with, roughly speaking two states, for example ‘On’ and ‘Off’. These molecules are therefore only visible on the acquisitions in the ‘On’ case, and the idea is then to light up some molecules in the sample to make the acquisition and to be able to locate them precisely; the fluorescent molecules are bound to the biological structure and, since only a few molecules are emitting in one frame, the resulting image is rather sparse, which allows accurate localisation. This process is repeated until all the molecules have been lit and imaged. All the positions of the imaged molecules frame-by-frame can then be put together to form a super-resolved image that go below the diffraction barrier, ridden of the degradation by the process of acquisition (blur, noise, etc.). The quality of the image reconstruction is naturally limited by the number of acquisitions necessary to reconstruct the image, which implies a cost in time (precious insofar as the organism studied moves) and in physical memory and by the density of fluorophores lit at each stage. Indeed, there is a risk of overlap hindering the localisation of the molecules since the separation criterion is not matched.
Off-the-grid methods can be applied to any SMLM stack with only the knowledge of the forward operator, the acquisition system’s PSF in this case. In this review, a gridless method based on Sliding Frank–Wolfe is tested on an 2D SMLM acquisition stack from the 2013 EPFL Challenge (, accessed on 30 November 2021). For this purpose, we consider the first image of the stack, locate the source points, and store the coordinates of these points. Then, we move on to the second image, we locate the source points, and so on. Note that off-the-grid method with this variational approach is not the only method taking advantage of a continuous domain like the PSF-fitting such as DAOSTORM [42], etc.
Deconvolution is a first challenge to solve this inverse problem, but we must also take into account the noise. One has to deal with three main types of noise on these acquisitions:
  • photon noise (also known as shot noise or quantum noise) is due to the quantum nature of light. It arises from the fact that fluorophores emit photons randomly, so that, between t and t + 1 (exposure time), a variable number of photons have been emitted, and therefore a variable number of photons have been collected by the sensor. Thus, the amplitude of the electrical signal generated in the sensor (at each pixel) fluctuates according to a Poisson statistic;
  • the dark current is a phenomenon due to the natural agitation of electrons. This natural agitation is sufficient to occasionally eject an electron from the valence band to the conduction band without any photoelectric effect. Additional charges are therefore created which interfere with the signal. The number of electrons generated by thermal agitation follows a Poisson distribution;
  • amplification and readout noise. This noise is produced by the electronic circuit that amplifies and converts the electron packets into voltage. It is generally modelled by a Gaussian noise.
Thus, we have several noises that pollute each of the observed images. To deal with this ill-posed inverse problem, we use the results on BLASSO, with the least-squares term as the data-fitting term and the TV norm as the regulariser of the inverse problem. In the Bayesian approach, the least-square term is modelling the maximum of likelihood when the acquisition is polluted by Gaussian noise, hence our model is making the approximation of Gaussian noise. Measurements are discrete so at each image one has to deal with images with N 1 × N 2 pixels, each of them with size ( b 1 , b 2 ) . Let ( c i , 1 , c i , 2 ) be the centre of the ith pixel, and we denote the ith camera pixels by
Ω i = def . ( c i , 1 , c i , 2 ) + b 1 2 N 1 , b 1 2 N 1 × b 2 2 N 2 , b 2 2 N 2 .
We can then clarify the forward operator Φ : m R N 1 N 2 which encapsulates the integration over camera pixels [13]; indeed, with the evaluation of the discrete Gaussian kernel φ with standard deviations σ , for i { 1 , , N 1 N 2 } :
[ φ ( x ) ] i = def . 1 2 π σ 2 Ω i e ( x 1 s 1 ) 2 2 σ 2 + ( x 2 s 2 ) 2 2 σ 2 d s 1 s 2 .
In the SMLM data set, one has the PSF standard deviation σ = 149.39 nm and N 1 = N 2 = 100 nm . The reconstruction is performed by our implementation of the Sliding Frank–Wolfe in python (See our GitHub repository for our PyTorch implementation:, accessed on 30 November 2021) insofar as it is the more robust method available: indeed, it works with Gaussian kernel, and it has proven results in a noise regime, etc. The results are presented in Figure 9. The stack of 2500 images of 64 × 64 is qualified as high density with high SNR: the number of activated fluorophores is quite important, and the noise is not negligible (see the EPFL Challenge page for more insights.).
A flat metric between the reconstructed measure m a , x and the ground-truth measure m a 0 , x 0 is then computed, and it reaches d τ ( m a , x , m a 0 , x 0 ) = 1.7 × 10 2 . The reconstruction is convincing and well captures the fine details of the biological structures, and one can clearly see the interweaving tubulins in the right part of the image.
Note that an interesting feature of the gridless reconstruction is that, once the Radon measure is computed, it is straightforward to plot it through any operator on a fine grid of one choice. Indeed, as one cannot represent a discrete measure m, we rather plot Φ m , where Φ is the PSF with a slightly smaller variance, in order to clearly see the deconvolution. In all of our reconstructions, we convolve the reconstruction through the PSF with variance σ / 6 and plot it on a grid 32 times finer. As a matter of comparison, discrete methods are performed for a fixed fine grid, and, if one wants a finer reconstruction, one has to recompute everything.
We finally test the off-the-grid reconstruction on a real data set of tubulins with high density molecules, provided by the 2013 IEEE ISBI SMLM challenge. In this stack of 500 frames of 128 × 128 pixels, the FWHM (full width at half maximum) of the acquisition system is estimated at 351.8 nm . We recall that the FWHM is the width of the Gaussian curve measured between those points on the y-axis, which are half the maximum amplitude, also note that it is linked to the variance σ by FWHM = 2 2 ln 2 × σ . We compare the reconstruction of the off-the-grid method with the output of the Deep-STORM [43] algorithm, touted as the algorithm with the most visually compelling results. The reconstructions of the gridless method and the Deep-STORM algorithm are presented in Figure 10, where one can appreciate the reconstruction by off-the-grid on fine details. The reconstruction seems a small bit blurry compared to Deep-STORM, due to the plotting through a small spread Gaussian kernel. However, it is noteworthy that both comparisons perform well to retrieve the filaments, in particular in the enhancing yellow circles: the off-the-grid reconstruction seems to better preserve the structure compared to the Deep-STORM’s rough output. The quality of the reconstruction is notably interesting for off-the-grid reconstruction since it does not require any test sets to yield this reconstruction, contrary to Deep-STORM. The only data needed are the knowledge of (an estimation of) the forward operator, and the off-the-grid reconstruction can be then performed from any input without having to train the model on different types and levels of noise.
Shorthand: We tested an off-the-grid method on both SMLM synthetic and experimental data set. The gridless problem is tractable thanks to the Sliding Frank–Wolfe algorithm, and yields compelling results. The results are all the more interesting since there is only one parameter, handy to tune and robust w.r.t. noise. Thus, it can be easily adapted to any other dataset with a known acquisition operator.

5. Conclusions

We described in this review the off-the-grid variational settings for the sparse spike problem, through the definition of the space of signed measures M X and the functional BLASSO defined over this set. Thanks to the trade-off between the convexity of the functional and the infinite dimensional, non-reflexive space of optimisation M X ; the BLASSO can be defined to solve the sparse spike recovery problem. We review in this paper the theoretical guarantees to reach the correct minimum as the literature provides multiple results, in particular a sharp criterion for stable spike recovery under a low noise regime. Numerical methods to tackle the BLASSO problem were also discussed, with insights on the SDP approach, which is asymptotically exact but only suited for Fourier measurements, the Frank–Wolfe approach with known rate of convergence but a high computation load and the Conic Particle Gradient Descent with cheap iterations but lacks of guarantees. We were finally able to present the result of the off-the-grid approach with a Sliding Frank–Wolfe algorithm in the case of SMLM synthetic data and real data from the EPFL Challenge, and to illustrate the usefulness of these methods to recover fine-scale details.

Author Contributions

B.L., L.B.-F. and G.A. contributed in equal parts to the conceptualisation and methodology. B.L. was the main contributor for numerical implementation, data curation, validation, and visualisation. All authors contributed to investigation and resources. B.L. wrote the first draft, L.B.-F. and G.A. reviewed and edited it. L.B.-F. and G.A. supervised the conduct of this paper. All authors have read and agreed to the published version of the manuscript.


This work has been supported by the French government, through the UCA DS4H Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR-17-EURE-0004 and through the 3IA Côte d’Azur Investments in the Future project managed by the National Research Agency with the reference number ANR-19-P3IA-0002.

Data Availability Statement

All the codes are available at the git repository or its mirror (accessed on 30 November 2021).

Conflicts of Interest

The authors declare no conflict of interest.


The following abbreviations are used in this manuscript:
PSFPoint Spread Function
FWHMFull Width at Half Maximum
SNRSignal-to-Noise Ratio
LASSOLeast Absolute Shrinkage and Selection Operator
SFWSliding Frank–Wolfe
OTOptimal Transport
CPGDConic Particle Gradient Descent
SMLMSingle Molecule Localisation Microscopy

Appendix A. Notations Table

Table A1. Main notations used in the review.
Table A1. Main notations used in the review.
ddimension of the ambient space
X ambient space of the spike positions e.g., the torus, [ 0 , 1 ] d , R d , etc.
C 0 X space of evanescent continuous functions
M X space of signed Radon measures
M + X space of non-negative Radon measures
M C X space of complex Radon measures
H Hilbert space, typically L 2 X
mRadon measure
| m | ( X ) TV norm of m
δ Dirac measure
a , x respectively amplitudes and positions of the spike a δ x
Nnumber of Dirac measures in a discrete measure
λ regularisation parameter in (𝒫λ(y))
Φ forward acquisition operator with kernel φ and adjoint Φ
F n forward Fourier acquisition operator with n measurements
p λ solution of the dual problem (𝒟λ(y))
η λ dual certificate of (𝒫λ(y))
T λ BLASSO (𝒫λ(y)) functional
Ω Lifted space Ω = def . R + × X
JBLASSO functional on M + X
Rdata-term R : M X H
F N ‘discrete’ functional on Ω N
Ffunctional on M Ω
( μ t ) t , ( ν t ) t Gradient flows respectively in P 2 ( Ω ) and M + X
α , β Cone metric/ Fisher–Rao–Wasserstein tuning parameters
W p Wasserstein distance of order p
P 2 ( Ω ) space of probability measures with 2nd moment endowed with W 2

Appendix B. Useful Definitions and Notions

Appendix B.1. Details on Functional Analysis

Definition A1.
Two Radon measures μ and ν of M X are called singular if there exists two disjoint subsets A, B of the σ-algebra of X whose union is X , such that μ is zero on all measurable subsets of B while ν is zero on all measurable subsets of A.
Proposition A1
(Jordan decomposition).The Jordan decomposition states for every measure μ M X the existence of two non-negative Radon measures μ + , μ M + X which are singular and such that μ = μ + μ .
Definition A2
(weak-* topology).Loosely speaking, weak-* convergence is convergence locally on average. We say that a sequence of Radon measures ( m n ) n 0 weakly-* converges towards m M X if and only if for all f C 0 X :
X f d m n n + X f d m .
We note m n m , and it is also called the vague convergence.
Definition A3.
A vector space E is said to be reflexive if the bi-dual E is identified with E.
Remark A1.
Since C 0 X is not a reflexive space for its norm supremum, the dual of M X for the topology induced by its TV norm is a complicated space, strictly larger [44] than C 0 X . However, if M X is endowed with the weak-* topology, then M X is a locally convex space whose dual is C 0 X [8].
We also precise the notion of metrisability for the sake of the optimal transport part:
Definition A4.
A topological space ( E , T ) is said to be metrisable if there exists a distance d : E × E [ 0 , + [ such that the topology induced by d is T .
( M X , ) is not a first-countable space, then it is not a metrisable space. To get a hold on that:
Lemma A1.
If E is a Banach, the weak-* topology is not metrisable on E , except if E is of finite dimension.
Nonetheless, all bounded subsets of M X are metrisable for the weak-* topology. This property is of upmost importance for the definitions of classic OT metrics such as the Wasserstein distance and for the proof of Γ -convergence of the LASSO to the BLASSO [7,37]. To sum-up all the properties of the different topologies, we give the following Table A2:
Table A2. Algebraic properties of M X for its two main topologies.
Table A2. Algebraic properties of M X for its two main topologies.
PropertiesTV TopologyWeak-* Topology
CompletenessYesOn its bounded subset
MetrisableYesOn its bounded subset
Polish space (see Appendix B.5)NoOn its bounded subset

Appendix B.2. Proof of the Fenchel Dual

Proposition A2.
Let be the problem:
argmax ϕ p , X 1 y , p H λ 2 p H 2 .
It is the dual problem of the BLASSO (𝒫λ(y)).
We will apply results from [20] (Remark 4.2), with a little caveat: the Banach space V should be reflexive, which is clearly not the case here with V = M X . However, the reflexive hypothesis is only needed for the sake of existence proof. Since we already proved the solution’s existence, this reflexivity hypothesis is not needed in our case. Back to the Remark 4.2, it states, for Λ : V Y linear, F : V R and V : Y R convex that the primal problem:
inf u V F ( u ) + G ( Λ u )
has a dual problem which writes down:
sup p Y F ( Λ p ) G ( p ) .
If u and p are respectively solutions of the primal and dual, the extremality conditions are:
Λ p F ( u ) p G ( Λ u ) .
Let use specify in our case V = def . M X , Y = def . H , F ( m ) = def . | m | ( X ) and G ( p ) = def . 1 2 y p H 2 . One can clearly see that the adjoint of G is G ( p ) = y , p H + 1 2 p H 2 for p H ; in order to determine F let ψ V = def . C 0 X :
F ( ψ ) = sup m M X ψ , m C 0 X × M X | m | ( X ) ψ , m C 0 X × M X | m | ( X ) , m M X .
Let x X , and m = λ δ x with λ > 0 . Then, one have:
sup m M X ψ , m C 0 X × M X | m | ( X ) λ ( ψ ( x ) 1 ) .
At the limit λ , one yields F ( ψ ) + if ψ ( x ) > 1 . A similar result for ψ ( x ) < 1 is obtained with the measure m = λ δ x . One finally reaches F ( ψ ) = + if ψ , X > 1 . Let us assume that ψ , X 1 , first note that F ( ψ ) = sup m M X ψ , m C 0 X × M X | m | ( X ) 0 (case m = 0 ). Moreover,
ψ , m C 0 X × M X | m | ( X ) ψ , X | m | ( X ) | m | ( X ) | m | ( X ) ( ψ , X 1 ) 0 sin ce ψ , X 1 .
By introducing the sup on both sides of the last inequality, one finally get F ( ψ ) = 0 if ψ , X 1 , thus reaching the condition on the supremum norm.
Then, from (A1), we yield the dual problem:
sup Φ p , X 1 y , p H 1 2 p H 2
and for m and p respectively solutions of the primal and dual, its extremality conditions are:
Φ p | m | ( X ) p = Φ m y .
This concludes the proof. □

Appendix B.3. Fréchet Differential of J ν ′

Let σ , ν M + X and ε > 0 . Consider the following:
J ( ν + ε σ ) = R X φ ( θ ) ( ν + ε σ ) + λ | ν + ε σ | ( X ) = R X φ ( θ ) d ν ( θ ) + ε X φ ( θ ) d σ ( θ ) + λ | ν | ( X ) + ε | σ | ( X ) .
The TV linearity is obtained thanks to the positivity of ν , σ . Hence, the differential J ν at point ν is given by:
J ( ν + ε σ ) d ε | ε = 0 = X φ ( θ ) d σ ( θ ) , R X φ ( s ) d σ ( s ) H + λ | σ | ( X ) = X φ ( θ ) , R φ ( s ) d σ ( s ) H d σ ( θ ) + λ | σ | ( X ) = φ , R H , σ C 0 X × M X + λ , σ C 0 X × M X = J ν , σ C 0 X × M X

Appendix B.4. Gradient of FN

Let R be the data fitting term e.g., the least squares, and h an injective function such as the map h : r r 2 . For r R + N and θ X N , we consider:
F N ( r , θ ) = R 1 N h ( r i ) φ ( θ i ) + 1 N h ( r i ) .
Its differential is given by:
d F N ( x ) ( δ x ) = F N ( x ) , δ x = 1 N α ( r ) 1 F N r i δ r i + β ( r ) 1 F N r i , δ θ i θ .
Moreover, we have:
F N r i = 1 N h ( r i ) φ ( θ i ) R r i 1 N h ( r i ) φ ( θ i ) + λ N h ( r i ) = 1 N h ( r i ) φ ( θ i ) R r i 1 N h ( r i ) φ ( θ i ) + λ = h ( r i ) J ν ( θ i ) .
F N r i = 1 N h ( r i ) φ ( θ i ) R r i 1 N h ( r i ) φ ( θ i ) + λ N h ( r i ) = h ( r i ) J ν ( θ i ) .
which yields the gradient in [23]. One can simplify the final expression by introducing the certificate η λ = def . J ν / λ .

Appendix B.5. Details on Section 4

Definition A5.
A space ( X , d p ) is a Polish space if it is separable, metrizable, and has a topology—induced by a distance—which makes the space complete.
X is a separable Hilbert space then ( X , d p ) is a Polish metric space for d p a distance on R d restricted to X . We can also introduce:
Definition A6
(Transport Plane).The non-negative measure γ M + ( X × X ) which verifies, for all A , B B ( X ) , where B ( X ) is the Borel σ-algebra:
γ ( A × X ) = m 1 ( A ) , γ ( B × X ) = m 2 ( B )
is called the transport plane between two positive measures m 1 and m 2 of same mass.
We call Γ ( m 1 , m 2 ) the set of transport planes between m 1 and m 2 [37]. Metrics of optimal transport such as the Wasserstein distance use at their core these notions, and are defined only on Polish spaces: this is why we work with the measures in X from [7], restriction of M X with the weak-* topology.
Definition A7
(Wasserstein distance).Let the Polish metric space ( X , d p ) , and p [ 1 , + ) . For any probability measures μ and ν of X , the Wasserstein distance of order p between μ and ν is defined by:
W p ( μ , ν ) = inf γ Γ ( μ , ν ) X d ( x , y ) p d γ ( x , y ) 1 / p .
We also recall the definition of moments:
Definition A8.
If r N , we call moment of order r of a measure m M X the quantity:
X x r d m ( x ) .
We say that m is of r-finite moment if the preceding quantity is finite.


  1. Candes, E.; Romberg, J.; Tao, T. Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information. IEEE Trans. Inf. Theory 2006, 52, 489–509. [Google Scholar] [CrossRef][Green Version]
  2. De Castro, Y.; Gadat, S.; Marteau, C.; Maugis, C. SuperMix: Sparse Regularization for Mixtures. Ann. Stat. 2021, 49, 1779–1809. [Google Scholar] [CrossRef]
  3. Chizat, L.; Bach, F. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport. arXiv 2018, arXiv:1805.09545. [Google Scholar]
  4. Denoyelle, Q. Theoretical and Numerical Analysis of Super-Resolution without Grid. Ph.D. Thesis, Université Paris Sciences et Lettres, Paris, France, 2018. [Google Scholar]
  5. Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  6. Duval, V.; Peyré, G. Sparse spikes super-resolution on thin grids II: The continuous basis pursuit. Inverse Probl. 2017, 33, 095008. [Google Scholar] [CrossRef][Green Version]
  7. Duval, V.; Peyré, G. Sparse regularization on thin grids I: The Lasso. Inverse Probl. 2017, 33, 055008. [Google Scholar] [CrossRef]
  8. Duval, V.; Peyré, G. Exact Support Recovery for Sparse Spikes Deconvolution. Found. Comput. Math. 2014, 15, 1315–1355. [Google Scholar] [CrossRef]
  9. Soubies, E.; Blanc-Féraud, L.; Aubert, G. A Continuous Exact l0 penalty (CEL0) for least squares regularized problem. SIAM J. Imaging Sci. 2015, 8, 1607–1639. [Google Scholar] [CrossRef]
  10. Candès, E.J.; Fernandez-Granda, C. Towards a Mathematical Theory of Super-resolution. Commun. Pure Appl. Math. 2013, 67, 906–956. [Google Scholar] [CrossRef][Green Version]
  11. Bredies, K.; Pikkarainen, H.K. Inverse problems in spaces of measures. ESAIM Control. Optim. Calc. Var. 2012, 19, 190–218. [Google Scholar] [CrossRef][Green Version]
  12. Castro, Y.D.; Gamboa, F.; Henrion, D.; Lasserre, J.B. Exact solutions to Super Resolution on semi-algebraic domains in higher dimensions. IEEE Trans. Inf. Theory 2017, 63, 621–630. [Google Scholar] [CrossRef][Green Version]
  13. Denoyelle, Q.; Duval, V.; Peyré, G.; Soubies, E. The sliding Frank–Wolfe algorithm and its application to super-resolution microscopy. Inverse Probl. 2019, 36, 014001. [Google Scholar] [CrossRef][Green Version]
  14. Federer, H. Geometric Measure Theory; Springer: Berlin/Heidelberg, Germany, 1996. [Google Scholar]
  15. Cohn, D.L. Measure Theory; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
  16. Temam, R. Fonction Convexe d’une Mesure et Applications. In Séminaire Équations aux Dérivées Partielles (Polytechnique) dit Aussi “Séminaire Goulaouic-Schwartz”; 1982–1983; Available online: (accessed on 30 November 2021).
  17. De Castro, Y.; Gamboa, F. Exact reconstruction using Beurling minimal extrapolation. J. Math. Anal. Appl. 2012, 395, 336–354. [Google Scholar] [CrossRef][Green Version]
  18. Azais, J.M.; Castro, Y.D.; Gamboa, F. Spike detection from inaccurate samplings. Appl. Comput. Harmon. Anal. 2015, 38, 177–195. [Google Scholar] [CrossRef]
  19. Beurling, A. Sur les intégrales de Fourier absolument convergentes et leur application à une transformation fonctionnelle. In Proceedings of the Ninth Scandinavian Mathematical Congress, Helsinki, Finland; 1938; pp. 345–366. [Google Scholar]
  20. Ekeland, I.; Témam, R. Convex Analysis and Variational Problems; Society for Industrial and Applied Mathematics: Phliadelphia, PA, USA, 1999. [Google Scholar] [CrossRef][Green Version]
  21. Fernandez-Granda, C. Super-Resolution of Point Sources via Convex Programming. Inf. Inference J. IMA 2016, 5, 251–303. [Google Scholar] [CrossRef][Green Version]
  22. Denoyelle, Q.; Duval, V.; Peyré, G. Support Recovery for Sparse Super-Resolution of Positive Measures. J. Fourier Anal. Appl. 2017, 23, 1153–1194. [Google Scholar] [CrossRef][Green Version]
  23. Chizat, L. Sparse optimization on measures with over-parameterized gradient descent. Math. Program. 2021. [Google Scholar] [CrossRef]
  24. Traonmilin, Y.; Aujol, J.F. The basins of attraction of the global minimizers of the non-convex sparse spike estimation problem. Inverse Probl. 2020, 36, 045003. [Google Scholar] [CrossRef][Green Version]
  25. Traonmilin, Y.; Aujol, J.F.; Leclaire, A. Projected gradient descent for non-convex sparse spike estimation. IEEE Signal Process. Lett. 2020, 27, 1110–1114. [Google Scholar] [CrossRef]
  26. Hilbert, D. Ueber die Darstellung definiter Formen als Summe von Formenquadraten. Math. Ann. 1888, 32, 342–350. [Google Scholar] [CrossRef]
  27. Dumitrescu, B.A. Positive Trigonometric Polynomials and Signal Processing Applications; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
  28. Lasserre, J.B. Global Optimization with Polynomials and the Problem of Moments. SIAM J. Optim. 2001, 11, 796–817. [Google Scholar] [CrossRef]
  29. Levitin, E.; Polyak, B. Constrained minimization methods. USSR Comput. Math. Math. Phys. 1966, 6, 1–50. [Google Scholar] [CrossRef]
  30. Frank, M.; Wolfe, P. An algorithm for quadratic programming. Nav. Res. Logist. Q. 1956, 3, 95–110. [Google Scholar] [CrossRef]
  31. Harchaoui, Z.; Juditsky, A.; Nemirovski, A. Conditional gradient algorithms for norm-regularized smooth convex optimization. Math. Program. 2014, 152, 75–112. [Google Scholar] [CrossRef][Green Version]
  32. Boyd, N.; Schiebinger, G.; Recht, B. The alternating descent conditional gradient method for sparse inverse problems. In Proceedings of the 2015 IEEE 6th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), Cancun, Mexico, 13–16 December 2015. [Google Scholar] [CrossRef][Green Version]
  33. Beck, A.; Teboulle, M. A fast Iterative Shrinkage-Thresholding Algorithm with application to wavelet-based image deblurring. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009. [Google Scholar] [CrossRef]
  34. Byrd, R.H.; Lu, P.; Nocedal, J.; Zhu, C. A Limited Memory Algorithm for Bound Constrained Optimization. SIAM J. Sci. Comput. 1995, 16, 1190–1208. [Google Scholar] [CrossRef]
  35. Ambrosio, L.; Gigli, N.; Savare, G. Gradient Flows in Metric Spaces and in the Space of Probability Measures; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar] [CrossRef]
  36. Santambrogio, F. Euclidean, Metric, and Wasserstein Gradient Flows: An overview. arXiv 2016, arXiv:1609.03890. [Google Scholar] [CrossRef][Green Version]
  37. Peyré, G.; Cuturi, M. Computational Optimal Transport; Now Publishers Inc.: Hanover, MA, USA, 2019. [Google Scholar] [CrossRef]
  38. Lellmann, J.; Lorenz, D.A.; Schönlieb, C.; Valkonen, T. Imaging with Kantorovich-Rubinstein discrepancy. SIAM J. Imaging Sci. 2014, 7, 2833–2859. [Google Scholar] [CrossRef][Green Version]
  39. Denoyelle, Q.; an Pham, T.; del Aguila Pla, P.; Sage, D.; Unser, M. Optimal-transport-based metric for SMLM. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021. [Google Scholar] [CrossRef]
  40. Dossal, C.; Duval, V.; Poon, C. Sampling the Fourier transform along radial lines. SIAM J. Numer. Anal. 2017, 55, 2540–2564. [Google Scholar] [CrossRef][Green Version]
  41. Sage, D.; Kirshner, H.; Pengo, T.; Stuurman, N.; Min, J.; Manley, S.; Unser, M. Quantitative evaluation of software packages for single-molecule localization microscopy. Nat. Methods 2015, 12, 717–724. [Google Scholar] [CrossRef]
  42. Holden, S.J.; Uphoff, S.; Kapanidis, A.N. DAOSTORM: An algorithm for high- density super-resolution microscopy. Nat. Methods 2011, 8, 279–280. [Google Scholar] [CrossRef]
  43. Nehme, E.; Weiss, L.E.; Michaeli, T.; Shechtman, Y. Deep-STORM: Super-resolution single-molecule microscopy by deep learning. Optica 2018, 5, 458–464. [Google Scholar] [CrossRef]
  44. Javanshiri, H.; Nasr-Isfahani, R. The strict topology for the space of Radon measures on a locally compact Hausdorff space. Topol. Appl. 2013, 160, 887–895. [Google Scholar] [CrossRef]
Figure 1. (a) Discrete reconstruction, which can be seen as spikes with support constrained on a grid; (b) off-the-grid reconstruction, the spikes can move continuously on the line. The red line is the acquisition y, orange spikes are the source (the cause we want to retrieve), blue spikes are discrete reconstruction constrained on a grid and green can move freely since it is off-the-grid. Note that, when a source spike is between two grid points, two spikes will be recovered in the discrete reconstruction.
Figure 1. (a) Discrete reconstruction, which can be seen as spikes with support constrained on a grid; (b) off-the-grid reconstruction, the spikes can move continuously on the line. The red line is the acquisition y, orange spikes are the source (the cause we want to retrieve), blue spikes are discrete reconstruction constrained on a grid and green can move freely since it is off-the-grid. Note that, when a source spike is between two grid points, two spikes will be recovered in the discrete reconstruction.
Jimaging 07 00266 g001
Figure 2. (a) Certificates associated with acquisition y and noiseless y 0 , result of three δ -peaks (in black, plotted with 10 times their ground-truth amplitudes) through a Fourier measurement of cut-off frequency f c = 6 ; (b) localisation of the roots of the certificate associated with the dual maximum. All the roots (the three ground-truths and the spurious spike on the right) on the unit circle are interpreted as the support of the δ -peaks.
Figure 2. (a) Certificates associated with acquisition y and noiseless y 0 , result of three δ -peaks (in black, plotted with 10 times their ground-truth amplitudes) through a Fourier measurement of cut-off frequency f c = 6 ; (b) localisation of the roots of the certificate associated with the dual maximum. All the roots (the three ground-truths and the spurious spike on the right) on the unit circle are interpreted as the support of the δ -peaks.
Jimaging 07 00266 g002
Figure 3. Reconstruction with the Interior Point Method for λ = 1 . The algorithm detected a spurious spike near 0.05; otherwise, amplitudes and positions of the peaks are correctly estimated.
Figure 3. Reconstruction with the Interior Point Method for λ = 1 . The algorithm detected a spurious spike near 0.05; otherwise, amplitudes and positions of the peaks are correctly estimated.
Jimaging 07 00266 g003
Figure 4. Reconstruction by Sliding Frank-Wolfe for a 1D Fourier operator, with the same settings (y, noise realisations, λ = 1 ) as the former section. All ground-truth spikes are reconstructed, no spurious spike is detected.
Figure 4. Reconstruction by Sliding Frank-Wolfe for a 1D Fourier operator, with the same settings (y, noise realisations, λ = 1 ) as the former section. All ground-truth spikes are reconstructed, no spurious spike is detected.
Jimaging 07 00266 g004
Figure 5. (a) First iterate k = 0 ; (b) mid-computation k = 1 ; (c) end of the computation k = 2 , results for SFW reconstruction on the domain X = [ 0 , 1 ] 2 for the Gaussian kernel with spread-factor σ = 0.1 and additive Gaussian noise of variance 0.1 . All δ -peaks are successfully recovered only thanks to the acquisition, λ = 3 × 10 2 .
Figure 5. (a) First iterate k = 0 ; (b) mid-computation k = 1 ; (c) end of the computation k = 2 , results for SFW reconstruction on the domain X = [ 0 , 1 ] 2 for the Gaussian kernel with spread-factor σ = 0.1 and additive Gaussian noise of variance 0.1 . All δ -peaks are successfully recovered only thanks to the acquisition, λ = 3 × 10 2 .
Jimaging 07 00266 g005
Figure 6. Digest of the important quantities mentioned in [3,23]: red refers to M + X quantities, green to Ω N = def . ( R + × X ) and blue to the Wasserstein space P 2 ( Ω ) and theoretical results. Dashed lines correspond to the theoretical section, and continuous lines indicate the numerical part.
Figure 6. Digest of the important quantities mentioned in [3,23]: red refers to M + X quantities, green to Ω N = def . ( R + × X ) and blue to the Wasserstein space P 2 ( Ω ) and theoretical results. Dashed lines correspond to the theoretical section, and continuous lines indicate the numerical part.
Jimaging 07 00266 g006
Figure 7. Reconstruction by Conic Particle Gradient Descent for a 1D Fourier operator in a noiseless setting, with the same ground-truth spikes as the former section. Implementation is an adaptation of [23], α = β = 1 × 10 3 and λ = 1 for 1000 iterations.
Figure 7. Reconstruction by Conic Particle Gradient Descent for a 1D Fourier operator in a noiseless setting, with the same ground-truth spikes as the former section. Implementation is an adaptation of [23], α = β = 1 × 10 3 and λ = 1 for 1000 iterations.
Jimaging 07 00266 g007
Figure 8. (a) Initialisation k = 0 ; (b) mid-computation k = 150 ; (c) end of the computation k = 1000 . Conic Particle Gradient Descent applied for 2D Gaussian deconvolution, the red dots are the particle measure ν ( k ) (size of dot proportional with amplitude), the three white dots are the source measure, the image in the background is the noiseless acquisition y 0 and the black lines are the paths of the particles ν ( k ) —all the paths constitute the gradient flow ( ν t ) t 0 . Implementation is an adaptation of [23], α = β = 1 × 10 2 and λ = 1 .
Figure 8. (a) Initialisation k = 0 ; (b) mid-computation k = 150 ; (c) end of the computation k = 1000 . Conic Particle Gradient Descent applied for 2D Gaussian deconvolution, the red dots are the particle measure ν ( k ) (size of dot proportional with amplitude), the three white dots are the source measure, the image in the background is the noiseless acquisition y 0 and the black lines are the paths of the particles ν ( k ) —all the paths constitute the gradient flow ( ν t ) t 0 . Implementation is an adaptation of [23], α = β = 1 × 10 2 and λ = 1 .
Jimaging 07 00266 g008
Figure 9. (a) Ground-truth tubulins, two excerpts of the stack in the square below: convolution + all noise described before; (b) reconstructed measure by Sliding Frank-Wolfe visualised through Gaussian kernel with a smaller σ (see text).
Figure 9. (a) Ground-truth tubulins, two excerpts of the stack in the square below: convolution + all noise described before; (b) reconstructed measure by Sliding Frank-Wolfe visualised through Gaussian kernel with a smaller σ (see text).
Jimaging 07 00266 g009
Figure 10. (top left) Excerpt of the stack; (top right) mean of the stack; (bottom left) reconstruction by off-the-grid method; (bottom right) Deep-STORM.
Figure 10. (top left) Excerpt of the stack; (top right) mean of the stack; (bottom left) reconstruction by off-the-grid method; (bottom right) Deep-STORM.
Jimaging 07 00266 g010
Table 1. Pros and cons for the different off-the-grid algorithm strategies, Semi-definite programming (SDP) vs. Sliding Frank-Wolfe (SFW) algorithm vs. Conic Particle Gradient Descent (CPGD).
Table 1. Pros and cons for the different off-the-grid algorithm strategies, Semi-definite programming (SDP) vs. Sliding Frank-Wolfe (SFW) algorithm vs. Conic Particle Gradient Descent (CPGD).
AlgorithmOperatorSpace X Convergence RateComputation TimeTuning Parameters
SDP [10]FourierTorus T d AsymptoticMild λ
SFW [11,13]AllAny compactSublinearLong λ
CPGD [23]AllTorus T d log ( ε ) Quick λ , α , β
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Laville, B.; Blanc-Féraud, L.; Aubert, G. Off-The-Grid Variational Sparse Spike Recovery: Methods and Algorithms. J. Imaging 2021, 7, 266.

AMA Style

Laville B, Blanc-Féraud L, Aubert G. Off-The-Grid Variational Sparse Spike Recovery: Methods and Algorithms. Journal of Imaging. 2021; 7(12):266.

Chicago/Turabian Style

Laville, Bastien, Laure Blanc-Féraud, and Gilles Aubert. 2021. "Off-The-Grid Variational Sparse Spike Recovery: Methods and Algorithms" Journal of Imaging 7, no. 12: 266.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop