Off-The-Grid Variational Sparse Spike Recovery: Methods and Algorithms

Gridless sparse spike reconstruction is a rather new research field with significant results for the super-resolution problem, where we want to retrieve fine-scale details from a noisy and filtered acquisition. To tackle this problem, we are interested in optimisation under some prior, typically the sparsity i.e., the source is composed of spikes. Following the seminal work on the generalised LASSO for measures called the Beurling-Lasso (BLASSO), we will give a review on the chief theoretical and numerical breakthrough of the off-the-grid inverse problem, as we illustrate its usefulness to the super-resolution problem in Single Molecule Localisation Microscopy (SMLM) through new reconstruction metrics and tests on synthetic and real SMLM data we performed for this review.


Introduction
In this paper, we propose to conduct a comprehensive review on the so-called off-the-grid variational methods to solve the sparse spike recovery problem. We will exhibit the main theoretical and numerical results in the literature, underlining the interest of these methods for various domains dealing with inverse problems. As part of this review and our former work on gridless methods, we developed an implementation of the more consistent numerical methods with a focus on efficiency and computation time. With this implementation, we were able to apply off-the-grid method to fluorescence microscopy super-resolution problem. The codes and the computed result are an addition to the offthe-grid literature, and constitute further evidence supporting the relevance of this domain in the inverse problem field.
Loosely speaking, inverse problems consist in the reconstruction of the causes from the consequences. The problem is generally ill-posed, meaning that existence, uniqueness, and stability of a solution(s) is (are) not guaranteed. A case arising in numerous fields such as image or signal processing, telecommunications, machine learning, super-resolution, etc. is the sparse spike problem. It consists of the reconstruction of spikes located on a domain X from an acquisition y, with the prior of sparsity on the cause; or in layman terms, the source is composed of a few spikes. This includes sources such as stars in astronomy, fractures in seismology, etc. A spike is typically modelled by a Dirac measure aδ x with amplitude a ∈ C and position x ∈ X . All the difficulty lies in the estimation of the number N of spikes, of their amplitudes (a i ) N i=1 and their positions (x i ) N i=1 . Hence, the goal is to reconstruct the measure m = ∑ N i=1 a i δ x i only from a few number of observations y in a Hilbert space H (typically L 2 (X )) linked to m through an operator Φ accounting for deterioration of the input (blur, downsizing by the sampling) such as y def.
= Φm + w where w ∈ H is an additive noise. The reconstruction of the spikes may be off-the-grid i.e., the positions (x i ) N i=1 are not constrained on a grid hence (x i ) N i=1 are not limited to a finite set of values: this allows interesting new mathematical insights and guarantees for the reconstruction, at the cost of some challenges for the numerical implementation. The general sparse spike problem is encountered in many situations, such as: • compressed sensing domain [1], where one wants to recover a s-sparse vector v ∈ C N from M measurements Av 0 where A ∈ C M×N ; • machine learning, sketching mixtures, etc. For example, we desire to fit a probability distribution with respect to given data. The point is to estimate parameters (a i ) ∈ R N and (x i ) ∈ X N of a mixture ∑ N i=1 a i ϕ(x i ) of N elementary distributions described by ϕ. For instance, one wants to retrieve the means µ i ∈ R and standard deviations σ i ∈ R + of a Gaussian mixture, see [2] for more insights on this question: • deep learning such as training neural networks with a single hidden layer [3]; • signal processing, for instance low rank tensor decomposition for Direction of Arrival estimation through sensor array (multiple sampling points); • super-resolution, a rather central problem in image processing. Roughly speaking, it consists of the reconstruction of details from an altered input of signal/image. It includes a classic physical operator of acquisition such as Fourier measurements, Laplace transform or Gaussian convolution.
The latter item will be our case of interest in the sparse spike problem for this paper. All the difficulty stems from the degradation in the acquisition process, which entails in general two things: a deterioration by the system of acquisition, typically modelled by the Point Spread Function in imagery which acts as a low-pass filter sensor acquisition which results in sampling and pollution by noise of different types, characterised by densities such as Gaussian, Poisson, etc. To sum up, we want to reconstruct the correct number of spikes with correct amplitudes and positions in the continuous setting from a noisy and filtered discrete acquisition. It can be tackled from the theoretical point of view by either the variational approach or Prony's method: • Prony's method and its variants such as MUSIC (MUltiple SIgnal Classification), ES-PRIT (Estimation of Signal Parameters by Rotational Invariance Techniques) or Matrix Pencil, which recover the signal source from Fourier measurements in a noiseless 1D setting. It consists of the decomposition of the signal onto a basis of exponentials with different amplitudes, damping factors, frequencies, and phase angles to match the observed data. The results are compelling in the 1D noiseless case, and can be extended to a multivariate and noisy context, but still these methods lack of versatility since they cannot be sometimes extended to the context of interest. Thus, we will not consider this approach in this paper; • variational approach which does not impose any particular structure on the acquisition operator, which can be adapted to any type of noise and does not need any prior on the number of point sources [4]. The key idea is to solve the inverse problem by finding among all possible signal sources the one minimising an objective function called the energy, formulated as a trade-off between a fidelity data term and a regularisation term, typically enforcing the sparsity prior here.
Then, there are two types of variational approaches: the discrete and the off-the-grid. In the discrete setting, one seeks to recover the spikes on a prescribed fine grid, typically with more points than the acquisition image. Indeed, we call coarse grid for the lowresolved acquisition, and fine grid for the finer (by a so-called super-resolution factor q ∈ N * ) grid of the reconstruction. Thus, it consists of a finite dimensional problem, where the positions of the spikes must lie on a grid G of L points meshing the domain X . This problem is a problem of sparse vectors reconstruction, and it can be tackled by enforcing sparsity through minimisation of the 1 norm of the unknown vector. This is known as the LASSO [5] or the Basis-pursuit problem, defined as the variational problem with tuning parameter λ > 0 controlling the trade-off between fidelity to the data and enforcement of the prior: min a∈R L y − Φ L a 2 H data term + λ a 1 sparsity prior (LASSO) where Φ L : R L → H is the acquisition operator with a vector of size L as an input and H is a Hilbert space. A grid is useful to epitomise the concept of sparsity in the case of spikes: indeed, sparsity is just the fact that only a few points of the L grid have a non-zero value. Moreover, since a computer can only store array and vector quantities, it seems rather fair to work with finite dimensional problem, even for the theoretical analysis. However, how does one choose the discretisation? A grid with a step-size too small yields numerical instabilities [6] while choosing a step-size that is too large leads to round-off errors. Moreover, one would like to localise the spikes as precisely as possible without having to rely on a grid: a discretisation of positions would necessarily convey approximation on positions. The appropriate mathematical objects to get rid of these discretisation drawbacks is to represent a collection of spikes with Dirac measures, an element of the space of Radon measures M(X ). The operator of acquisition is now Φ : M(X ) → H, the sparsity is enforced by a norm on M(X ) called the TV-norm. This variational problem is called the BLASSO (for Beurling LASSO): (BLASSO) In this latter setting, the spikes can move continuously on the domain X : a comparison between the discrete and the off-the-grid reconstruction is given in Figure 1. The off-thegrid setting can be seen as the limit of the discrete case with a finer and finer grid [7].
(a) (b) Figure 1. (a) Discrete reconstruction, which can be seen as spikes with support constrained on a grid; (b) off-the-grid reconstruction, the spikes can move continuously on the line. The red line is the acquisition y, orange spikes are the source (the cause we want to retrieve), blue spikes are discrete reconstruction constrained on a grid and green can move freely since it is off-the-grid. Note that, when a source spike is between two grid points, two spikes will be recovered in the discrete reconstruction.
This shift from the discrete domain to the continuous setting called off-the-grid or gridless leads to some crucial mathematical insights, in particular a sharp signal-dependent criterion for stable spikes recovery [8], the minimum separation distance (see the next section). Obviously, some difficulties arise also due to the infinite dimension and the lack of algebraic properties of the set of optimisation. The comparison between discrete and gridless settings may be summed up by: • the discrete problem is tackled by LASSO, through the minimisation of a convex function defined on a fine grid i.e., a convenient finite dimension Hilbert R L space. Due to the 1 norm, there are some cases where the sparsity is not properly enforced: one can then replace the 1 norm by the non-continuous pseudo-norm 0 , but this yields an NP-hard combinatory non-convex problem. There exists some continuous relaxation of 0 such as CEL0 [9], but, due to the non-convex aspect, the problem is still hard from the theoretical and numerical point of view. Despite the lack of guarantees, there are numerous algorithms to compute the solution of LASSO or its 0 relaxed variant; • the off-the-grid problem is treated by BLASSO, a convex functional defined on M(X ). The convex property is handy from the theoretical point of views as it leads to some crucial insights on the existence/uniqueness/support estimation w.r.t. noise, at the cost of the set of optimisation, namely M(X ) a Banach (no Hilbertian structure so no straightforward proximal algorithm) infinite dimensional and non-reflexive space for the strong topology (convergence results are then essentially on the weak- * topology). Despite this lack of algebraic properties, one has currently a wide range of algorithms to tackle this problem, such as root-finding or greedy algorithms.
Gridless reconstruction can then be evaluated through suitable metrics, namely the Flat Metric based on optimal transport of measures. This metric assesses the quality of the reconstruction and can be applied straightforwardly to off-the-grid and even discrete reconstruction outputs.
In the following, we give a review on the key results in the variational off-the-grid domain. The paper is organised in three sections, namely: • the variational analysis of the space M(X ), the properties and the guarantees of reconstruction concerning the sparse spike problem are now quite well-documented [8,[10][11][12] and will be recalled in the theoretical Section 2; • multiple strategies were considered to numerically tackle BLASSO, the more compelling will be presented and put into context in the numerical Section 3; • interesting practical applications and new metrics have been considered for the gridless method, such as the SMLM super-resolution; these results are shown and discussed in Section 4.
At the end of each paragraph, a grey box (beginning either with 'summary' or 'shorthand') like this one will recall the main results highlighted in the section. Please refer to it for a quick summary.

A Theoretical Background for Gridless Spike Recovery
In the following, X denotes the ambient space where the positions of the spikes live. We suppose X is a subset of R d such that its interiorX is a submanifold of dimension d ∈ N * [13]. This setting encompasses X = R d , the torus X = T d def.
= R d /Z d , any compact with non-empty interior, etc. The reader is invited to take a look at the Table A1 to remind the notations.

What Is a Measure?
As we have stated in the section above, the Dirac measure is the proper object to describe a spike not constrained on a finite set of positions. This object is not a function, since one cannot exhibit any integrable equivalence class satisfying the properties of the Dirac (see below). Thus, one should considerate the notion of Radon measure, a formal extension of functions. From a distributional standpoint, it is a subset of the distribution space D (X ), namely the space of linear forms over the space of test functions D (X ) i.e., smooth functions (continuous derivatives of all orders) compactly supported. This functional approach consists in the definition of a measure as a linear form on some function space, namely: Definition 1 (Evanescent continuous function on X ). We call C 0 (X , Y ) the set of continuous functions with zero at infinity (or evanescent), namely all the continuous map ψ : X → Y such that: When Y = R, we will simply write C 0 (X ). Since we dispose of a suitable test functions space, we need to precise the notion of duality at stake in this review. Definition 2 (Topological dual space). If E is a topological vector space, we denote E * its topological dual i.e., the space of all continuous linear forms ψ : E → R. The pairing between an element φ ∈ E and a map ψ ∈ E * is denoted by the bilinear mapping φ, ψ E×E * def.
This notion allows us to define the Radon measure through duality in the following definition.

Definition 3 (Set of Radon Measures).
We denote M(X ) the set of real signed Radon measures on X of finite masses. It is the topological dual of C 0 (X ) with supremum norm · ∞,X by the Riesz-Markov representation theorem. It can also be defined as the topological dual of the space of continuous function C (X ) if X is compact [14]. Thus, a Radon measure m is a continuous linear form evaluated on functions f ∈ C 0 (X ), with for m ∈ M(X ) the duality bracket denoted by f , m C 0 (X )×M(X ) = X f dm.
The term 'signed' refers to the generalisation of the concept of (positive) measure, by allowing the quantity f , m C 0 (X )×M(X ) to be negative. We can define in the same way the space of real non-negative Radon measures M + (X ) dual of C 0 (X , R + ) and the space of complex Radon measures M C (X ) dual of C 0 (X , C). Classic examples of Radon measures are: • the Lebesgue measure of dimension d ∈ N; • the Dirac measure δ z centred in z ∈ X , also called the δ-peak. For all f ∈ C 0 (X ), one has f , δ z C 0 (X )×M(X ) = f (z); • discrete measures m a,x def. [11] by endowing it with its dual norm called the total variation (TV) norm, defined for m ∈ M(X ) by: The TV norm of a measure is also called its mass. One can note that, in the case of a discrete measure defined as before m a,x def. = ∑ N i=1 a i δ x i , one has |m a,x |(X ) = a 1 . The interested reader might take a look at the Appendix B.1 for more details on some functional analysis notions and results. Summary: we model a spike by a Dirac measure, an element of the Radon measure spaces M(X ). This space is defined by duality, it is endowed by the TV-norm and is complete. It is, however, infinite dimensional and non-reflexive (see Appendix B.1), this poses additional difficulties to be taken into account in the optimisation.

Observations
Let us introduce the space where the acquired data live. We will denote by H this Hilbert space; for the instance of images, H = L 2 (X ). Let m ∈ M(X ) be the source measure, we call acquisition y ∈ H the result of the forward/acquisition map Φ : M(X ) → H evaluated on m, with measurement kernel ϕ : X → H: The latter integral ought to not be confused with the duality bracket f , m C 0 (X )×M(X ) = X f (x) dm(x) mentioned in Definition 3 above. Indeed, while f (x) ∈ R for x ∈ X , we have ϕ(x) ∈ H: the integral in (1) is then a Böchner integral [15] i.e., the proper notion to deal with vector valued map. It is valid as long as ϕ is continuous and bounded [3,13].

Remark 1.
Measures are objects that generalise functions at the cost of losing some of their properties. Thus, one cannot define a product of measures (what would be the square of the Dirac?) and one ought to be aware of some caveats concerning the functions of measure: these functionals need to be at most (sub)linear in order to be well-defined [16].
In the following, we will impose ϕ ∈ C 2 (X , H). Let us also define the adjoint operator of Φ : M(X ) → H in the weak- * topology, namely the map Φ * : H → C 0 (X ). It is defined for all x ∈ X and p ∈ H by Φ * (p)(x) = p, ϕ(x) H . The choice of ϕ and H depends on the physical process of acquisition; indeed, generic measurement kernels are: • convolution kernel with typically H = L 2 (X ) and ∀x ∈ X , ϕ(x) def.
Fourier kernel with cut-off frequency f c ∈ N and H = C 2 f c +1 , for x ∈ X = T in 1D: • Laplace kernel [4] for non-negative weighting function ξ ∈ C (X ) specific to the physical acquisition process and H = L 2 (R + ): These three kernels correspond to various physical context of imagery, hence they are encountered in multiple acquisition process, such as Nuclear Magnetic Resonance spectroscopy (Fourier), SMLM super-resolution (convolution), MA-TIRF (Laplace), etc.
We will now on use the following notation for the discrete forward map: let x = (x 1 , . . . , x N ) and a ∈ R N : Φ x (a) def.
Shorthand: an acquisition living in the Hilbert space H of a measure m is the quantity Φm. Φ is the forward operator, completely defined by a kernel ϕ specific to the physical context of imagery.

An Off-The-Grid Functional: The BLASSO
Let m a 0 ,x 0 def. = ∑ N i=1 a 0,i δ x 0,i be the source measure with amplitudes a 0 ∈ R N and positions x 0 ∈ X N , the sparse spike problem is to recover this measure from the acquisition y def.
= Φm a 0 ,x 0 + w, where w ∈ H is an additive noise, typically white Gaussian noise. To tackle this problem, we use the following convex functional [11,17] also called the BLASSO, which stands for Beurling-LASSO: with regularisation parameter λ > 0 which accounts for the trade-off between fidelity and sparsity of the reconstruction. The name BLASSO was coined in the work of [17,18] according to the link between the Generalised Minimal Extrapolation (GME) problem where one seeks to reconstruct a Radon measure from several observations on its Fourier coefficients, and the work [19] of the Norwegian mathematician Beurling, which coincides with GME in the case of a Fourier forward operator.
The BLASSO in a noiseless setting writes down: argmin Φm=y 0 |m|(X ) with y 0 = Φm a 0 ,x 0 . (P 0 (y 0 )) BLASSO is genuinely linked with its discrete counterpart the (LASSO) [8]: one can formally see BLASSO as the functional limit of LASSO on a finer and finer grid. If the LASSO problem exhibits existence and uniqueness of the solution, what can one say for its off-the-grid counterpart? First of all, let us observe that: • m → |m|(X ) is lower semi-continuous w.r.t. the weak- * convergence (see Appendix B.1 for more insights); • Φ is continuous from the weak- * topology of M(X ) to the weak topology of H.
Thus, one can establish the existence of solutions to (P λ (y)) thanks to convex analysis results, as proved in [11].
Summary: the sparse spike problem is tractable thanks to the convex functional on M(X ) called the BLASSO and denoted by (P λ (y)). With m ∈ M(X ) as an input, it consists of a data term comparing observed data versus Φm, and a regularisation accounting for sparsity prior through the TV-norm of m. Existence of solutions of the BLASSO is known and proved.
The difficulties now lie in the following questions: 1.
what are the conditions to recover a sparse measure, within a certain noise regime? Is the minimum unique? 2.
under which conditions can we retrieve exactly the number of spikes, the amplitude, and the positions; when do we have support stability? 3.
how can we tackle numerically the infinite dimensional and non-reflexive nature of the space M(X )?
In order to address these points, we need to introduce some notions of convex analysis in the following subsection.

Dual Problems and Certificates
The BLASSO in Equation (P λ (y)) above is a minimisation problem with a convex functional. Then, we can apply  Remark 4.2) results and define a dual problem which writes down for p ∈ H (see Appendix B.2 for the proof): which can be recast as the projection onto a closed convex [11,18]: Fenchel's duality between (P λ (y)) and (D λ (y)) is proved in [11]. Therefore, any solution m λ of (P λ (y)) is linked [8] to the unique solution p λ of (D λ (y)) by the extremality conditions: where ∂|·|(X ) is the sub-differential of the TV norm. Indeed, since the total variation is not differentiable (as the 1 norm) but lower semi-continuous w.r.t. the weak- * topology, we use its sub-differential which for m ∈ M(X ) identifies to: Elements of this subgradient are called certificate. Thanks to strong duality, one can define peculiar certificates called the dual certificates [10].
It is a certificate since Φ * p λ ∈ ∂|m λ |(X ), and it is called dual because it verifies the second extremality (2) condition: it is thus defined by the dual solution p λ . Loosely speaking, a dual certificate η λ is associated with a measure m λ , and it certifies that the measure m λ is a minimum of the BLASSO. For instance, if there exist solutions of (P λ (y)) of the form m λ def.
In the same fashion, one has the link between a solution m 0 of the noiseless BLASSO (P 0 (y 0 )) and its certificates η 0 , which are not unique in general. Then, in the rest of the document, we will refer to η 0 as the minimal norm certificate i.e., the dual certificate η 0 with minimal supremum norm η 0 ∞,X . It is shown in [8] that this minimal norm certificate η 0 has important properties, since it somehow drives the stability of the recovered spike locations when the additive noise is small, in particular how close they are to the positions of the true measure m a 0 ,x 0 : see Definition 6 in the section below.
Summary: we defined the primal problem in the former section, thanks to convexity, we can define the dual problem of the BLASSO. A solution m λ of the BLASSO and a solution p λ of the dual problem are linked through extremality condition. The dual solution p λ defines the dual certificate, an element of the subgradient specified by η λ = Φ * p λ : the dual certificate η λ certifies that m λ is a solution of the BLASSO. We can then establish more precise conditions on the uniqueness/support recovery.

Support Recovery Guarantees
We will address in this section the first two questions we have laid down, namely existence, uniqueness, and support recovery conditions. A classical tool to establish some recovery properties lies in the notion of the minimum separation distance.
Definition 5 (Minimum separation distance). The minimum separation distance is a characterisation of the support of the discrete measure m a 0 ,x 0 by: The reconstruction condition is driven by this minimum separation distance, itself determined by the type of measure (complex, real, real non-negative) and the type of forward operator.
• if the operator is an acquisition of the Fourier spectrum within [− f c , f c ] with frequency cut-off f c for X = T d the d-torus in the noiseless setting, it is necessary that ∆(m a 0 ,x 0 ) 2 f c if the source measure is complex [10]. Upon a few conditions [21], one can weaken it to ∆(m a 0 ,x 0 ) 1.26 f c , and ∆(m a 0 ,x 0 ) 1.87 f c if the source measure is real [10]; • regardless of the operator Φ [17,22], there is no condition on the separation for a real positive source measure in the noiseless setting; however, stability constant explodes when ∆(m a 0 ,x 0 ) → 0.
These results are important but do not provide a sharp characterisation of the recovery in the presence of noise; however, we expect to find noise in the images we deal with and therefore to be limited by this noise regime. To account for this effect, we need to add some conditions on the ground-truth measure; following the work of [8], we introduce: Definition 6 (Non-degenerate source condition). The source m a 0 ,x 0 verifies the NDSC (Non-Degenerate Source Condition) if: The first condition amounts to assuming that m a 0 ,x 0 is a solution to (P 0 (y 0 )) and there exists a solution to its dual problem. If the two latter conditions are matched, we say that η 0 is not degenerate. This allows us to write the main result of [8], namely: Assume that Γ x 0 has full column rank and that m a 0 ,x 0 verifies the NDSC. Then, there exists α > 0, λ 0 > 0 such that for all 0 ≤ λ ≤ λ 0 and w such that w ≤ αλ; there exists N (P λ (y)) composed of exactly N spikes. In particular, for λ = 1/α w H , we have the control over the discrepancies: Under the Non-Degenerate Source Condition, for λ and w 2 H /λ small enough, one can reconstruct a measure with the same number of spikes as the ground-truth measure m a 0 ,x 0 . Furthermore, the reconstructed measure (weak- * )converges to the ground-truth measure when the noise level drops to 0. The authors of [8] also introduce the notion of vanishing derivatives precertificate. The η 0 certificate is indeed hard to compute from the dual problem of (P 0 (y 0 )). Because of the constraint η 0 ∞,X ≤ 1, the precertificate allows for leveraging this computation by solving instead a linear system. The interested reader is advised to take a glance at this article among other ones [8,22] for these new concepts.
Shorthand: the minimum separation distance criterion is used to assess recovery possibilities in the noiseless setting. In a low regime of noise, a theorem states that the source measure m a,x composed of N spikes can be recovered through BLASSO, with a control over the discrepancies (amplitudes/positions) between the reconstructed and the source measures.
We were therefore able to establish some guarantees on the reconstruction of the source measures in the presence of noise. In the next section, we propose to address the third question and to discuss strategies to compute the numerical solution of the inverse problem; a difficult task requiring accounting for the difficulties of the optimisation space.

Numerical Strategies to Tackle the BLASSO
The BLASSO problem P λ (y) is an optimisation over the set of Radon measures, an infinite dimensional and non-reflexive space. We recall that it writes down: A naive approach would be to enforce the measure m to be supported on a fine grid (p i ) L i which is equivalent to solving the LASSO problem: and ϕ the kernel of the forward operator. This approach conveys numerous cons: for instance, the solution of the LASSO, in small noise regime and when the step size tends to 0, contains pairs of spikes around the true one [6,7]. Furthermore, refining the step size leads to a worse conditioning of the forward operator, accounting for numerical difficulties. The following classes of algorithms better account for the infinite dimensional nature of M(X ). We present in detail the three methods with the most established results in the literature [13,18,23]. Before describing these methods, let us remark that there also exist some promising avenues, such as the projected gradient descent [24,25]. It relies on an over parametrised initialisation i.e., a discrete measure with numerous δ-peaks compared to the ground-truth, then one applies a gradient descent on the amplitudes and positions of the over parametrised measure combined at each step with a projection on a set of positions constraints to enforce the separation of the spikes. This projection can be replaced by a 'heuristic' which boils down to the merging of δ-peaks that are not enough separated [25].

Semi-Definite Recasting and Hierarchy
Semi-definite programming was one of the first schemes solving the BLASSO in the specific case of a Fourier acquisition on the 1D torus T 1 [10,12,17,18]. Before explaining in layman terms the SDP scheme, let us first introduce and detail the relevant quantities for this section. Let d = 1 be the dimension of the interior of X , let us study the case where the forward operator denoted by F n (and not Φ for this section) is a Fourier coefficient measurement up to some cut-off frequency f c ∈ N, with n = 2 f c + 1 the number of measurements. We have F n : M C (X ) → C n and, for a discrete measure m a,x def.
This method is based on semi-definite programming (SDP) for efficiently computing the minima of BLASSO. It stems from the Hilbert approach [26] when one globally decomposes the objective function into simple pieces, atoms. The solution of the dual problem of (P λ (y)), denoted here (D F λ (y)), is a polynomial p linked to a certificate by F * n p: the idea then is the reconstruction of the dual certificate as a linear sum of trigonometric polynomials [18], which is enough to find the measure associated with this reconstructed certificate. This associated measure is a solution to the BLASSO. The dual problem, on the other hand, is tractable thanks to a semi-definite programming approach.
Since F * n p is a trigonometric polynomial for any p ∈ C n by the definition above, one can recast the constraint F * n p ∞,X ≤ 1 (imposed by definition of a certificate, see Equation (3)) and rewrite it as the intersection of the cone of positive semi-definite matrices {A : A 0} with an affine hyperplane [10,27]. Hence, the Fenchel dual problem of (P λ (y)) for the Fourier forward operator F n : with Hermitian product {·, ·}, has the equivalent formulation [27] max p∈C n ,Q∈C n×n with Q being a Hermitian matrix and p a vector of coefficients (accounting for the dual variable p), and δ 0,j the Kronecker delta equal to 1 if j = 0 and 0, otherwise. The choice of regulariser λ is crucial: if chosen to be too high, it will yield a solution with fewer spikes, if chosen too low, it will recover a solution with spurious spikes. This finite dimensional formulation can now be tackled with classic semi-definite programming solvers, as did the authors of [10] who proposed an algorithm of Interior Point Method, given in Algorithm 1. The first step reaches a solution p, allowing the definition of the certificate p 2n−2 e 2iπt def.
2 Reconstruct the supportX of m by locating the roots of p 2n−2 on the unit circle (e.g., by computing the eigenvalues of its companion matrix). 3 Solve ∑ t∈X a t e −2iπkt = y k to recover the amplitudes a.
One can note the link between the dual and the primal problem, i.e., that p, the solution of ( D F λ (y)), entails the location of the spikes: as F * n p yields its extremal points on the support of m since it is the certificate of a discrete measure, note that p 2n−2 (e 2iπt ) = 1 − |F * n p| 2 (t) has all its roots on the unit circle, and these roots are the support of the target measure [10]. Thus, the strategy is to solve the dual problem and then to use a root-finding algorithm on the certificate F * n p associated with the dual solution, hence reconstructing the support of the measure then the measure (after a last amplitude recover step). We present an example of the reconstruction of three Dirac measures on the 1D torus T 1 through the observed noisy data y and the roots of the polynomial p 2n−2 (e 2iπt ) in Figure 2. This strategy is only suitable for d = 1. For the multi-variate case, one needs to make use of a so-called Lasserre Hierarchy [28]. Consider the semi-definite relaxation of order m with m > n = 2 f c + 1: where θ k j and the entries of m × m elementary Toeplitz matrix are 1 on its k j -th diagonal and 0 elsewhere, and ⊗ the Kronecker product. In a nutshell, Lasserre's hierarchies give a sequence of nested outer SDP approximations of the cone of moments of non-negative measure. This method has been successfully applied to superresolution in [12]. Some reconstructions in the 1D setting with a Fourier kernel are given in Figure 3, and the interested reader may find a more in-depth tutorial in the Numerical  These methods are proved to be asymptotically exact [12]. Nonetheless, it is not known if the algorithm has finite convergence in general: one does not know when to stop the hierarchy to obtain a solution of the BLASSO [3]. This stems from the fact that non-negative trigonometric polynomials in dimension d > 1 are not necessarily sums of square. Moreover, these SDP based approaches are rather limited to a certain class of measurement map Φ, typically the Fourier forward operator or at least filters with compact Fourier supports. With the two following classes of algorithms, one can better exploit the continuous setting and get rid of the discretisation drawback.
Summary (1st algorithm): the scheme boils down to the resolution of the dual problem, the reconstruction of the measure's support thanks to the certificate associated with the dual solution, and finally the solving of a linear problem to yield the corresponding estimated amplitudes. This strategy can be extended to a multivariate context, but, still, it is quite restrictive on the forward operator and it does not have finite convergence in general.

Greedy Algorithm: The Conditional Gradient
The conditional gradient method, also called the Frank-Wolfe (FW) algorithm [29,30], aims at solving min m∈C f (m) for C a weakly compact convex set of a topological vector space and f a convex and differentiable function (the differential is then denoted by d f ). It relies on the iterative minimisation of a linearised version of f . Hence, the interest of this algorithm lies in the fact that it uses only the directional derivatives of f and that it does not require any Hilbertian structure, contrary to a classic proximal algorithm formulated in terms of Euclidean distance. We recall the definition of the conditional gradient in the pseudocode Algorithm 2 for the general problem of minimising f : Step research: Update: m k+1 ← m k + γ k (s k − m k ).

end 9 end
One can make the following remarks: • the compactness assumption on C ensures that the argmin in step 2 is non-empty; • in line 7, we can replace m k+1 by any elementm ∈ C such that f (m) ≤ f (m k+1 ) without changing the convergence properties of the algorithm; There are, however, two problems that prevent us from applying straightforwardly this algorithm to BLASSO: T λ is not differentiable, and the optimisation set M(X ) is not bounded. It is thus necessary to perform an epigraphical lift [13,31] to reach a differentiable functional that shares the same minimum measures as T λ : with the bounded set C = {(r, m) ∈ R + × M(X ); |m|(X ) ≤ r ≤ M} and M def.
= y 2 2λ . Even though C is not weakly compact, it is compact for the weak- * topology and the hypotheses for Algorithm 2 are still matched. The Frank-Wolfe algorithm is then well-defined for the energyT λ , differentiable in the Fréchet sense on the Banach R × M(X ). Its differential writes down: Finally, one has that m * is a minimum of T λ iff (|m * |(X ), m * ) minimises ( P λ (y)), and T λ (m * ) =T λ (|m * |(X ), m * ). In the rest of the document, we will omit the r-part, and we will refer to the quantity (|m * |(X ), m * ) by only m * .
We note before that the update m k+1 in line 7 can be replaced by any valuem improving the objective function; this remark is rather interesting as it can drastically improve the convergence property of the algorithm [11,32]. Hence, an interesting improvement to the Frank-Wolfe algorithm relies in the change of the final update step by a non-convex optimisation on both the amplitudes and the positions of the reconstructed δ-peaks in a simultaneous fashion. This modification is presented in Algorithm 3.
such that: This tweak yields a theoretical convergence to the unique solution of BLASSO in a finite number of iterations, empirically a N-step convergence. This version is called the Sliding Frank-Wolfe algorithm [13], as the spike positions are sliding on the continuous domain X . The authors also proved in the same paper that the generated measure sequence m [k] converges towards the minimum for the weak- * topology.
A reconstruction by Sliding Frank-Wolfe for the same Fourier operator, ground-truth spikes, and acquisition as the latter section is plotted in Figure 4. Contrary to SDP in Figure 3, no spurious spike is reconstructed. As in the SDP method, the choice of regulariser λ is crucial: if chosen too high, it will yield a solution with fewer spikes than needed; if set too low, it will recover a solution with spurious spikes. We set λ = 1 for the 1D Fourier example as in the former SDP section.
The line 3 in Algorithm 3 is typically solved by a grid search, the convex step in line 5 can use a FISTA solver [33], and the non-convex step in line 6 can be tackled by a modified Broyden-Fletcher-Goldfarb-Shann method (L-BFGS-B) implementation [34]. Reconstructions in the 2D setting with a convolution kernel, similar to the SMLM conditions, are presented in Figure 5. Since luminosity is always a non-negative quantity, one can restrict [4] the SFW to build a positive measure of the cone M + (X ), by changing: • the stopping condition to η [k] x   Hence, this modified algorithm offers a good trade-off between precision and theoretical guarantees. However, it suffers from the high computation load for one iteration, making it slow to compute. The next section algorithm is a promising alternative with easier/cheaper iteration while still taking advantage of the continuous setting.

Shorthand (2nd algorithm):
conditional gradient method is a greedy algorithm consisting in the iterative minimisation of a linearised version of the objective convex function. This algorithm can be applied to any forward operator without restriction on the space X . Up to a modification (SFW), the Frank-Wolfe algorithm reaches a finite convergence, empirically a N-step convergence for a source measure with N spikes. The iterations, however, are computationally costly, yielding long computation time.

Optimal Transport Based Algorithm: The Particle Gradient Descent
All the following results are proven for a domain X with no boundaries, e.g., the d-dimensional torus T d . The case described in the former sections-X is any compact of R d -is included in this new setting, since any compact X can be periodised to yield a domain with no boundaries. The forward operator kernel ϕ : X → H should also be differentiable in the Fréchet sense. The least squares term in BLASSO is denoted by the more general data term R : H → R + , the functional T λ of the BLASSO will now be restricted to M + (X ) and denoted J; its Fréchet differential at point ν ∈ M + (X ) is denoted J ν : A comprehensive guide on its computation is given in Appendix B.4. In the following, we describe the setting for non-negative measures of M + (X ), but it can be extended in a straightforward fashion [23] to signed measures of M(X ) by performing the method on the positive then negative part of the signed measure (see Jordan decomposition in Appendix B.1). Figure 6 sums up the chief quantities and relations introduced in this section, the reader is advised to refer to it whenever he or she needs a global view on the optimal transport problem. Figure 6. Digest of the important quantities mentioned in [3,23]: red refers to M + (X ) quantities, green to Ω N def.

The functional we want to optimise
= (R + × X ) and blue to the Wasserstein space P 2 (Ω) and theoretical results. Dashed lines correspond to the theoretical section, and continuous lines indicate the numerical part.
Sparse optimisation on measures through optimal transport [3,23] relies on the approximation of the ground-truth positive measure m a 0 ,x 0 by a 'system of N ∈ N * particles', i.e., an element of the space Ω N def. = (R + × X ) N . The point is then to estimate the ground-truth measure by a gradient-based optimisation on the objective function: where (r i , x i ) belongs to the lifted space Ω def.
= R + × X endowed with a metric. Hence, the hope is that the gradient descent on F N converges to the amplitudes and the positions of the ground-truth measure, despite the non-convexity of functional (7). The author of [23] proposes the definition of a suitable metric for the gradient of F N , which enables separation of the variables in the gradient descent update. Let α, β be two parameters such that α > 0 and β > 0 and for any (r, θ) ∈ Ω, we define the Riemannian inner product of Ω called the cone metric endowing Ω as defined by ∀(δr 1 , δr 2 ) ∈ R 2 + , ∀(δθ 1 , δθ 2 ) ∈ X 2 : (δr 1 , δθ 1 ), (δr 2 , δθ 2 ) (r,θ) def.
We denote by ·, · θ the metric on the manifold X at the point θ. For the gradient of the functional F N for all i ∈ 1, N w.r.t. the cone metric writes down [2,23]: See Appendix B.4 for more details on this computation. We now present the theoretical results on the particle gradient descent, which corresponds to the blue dashed lines in Figure 6. The reader is invited to refer to this figure any time he needs to get a hold on the broader picture.

Theoretical Results
The main idea of these papers [3,23] boils down to the following observation: the minimisation of function (7) is a peculiar case of a more general problem, formulated in terms of measure of the lifted space Ω. The space is more precisely P 2 (Ω) subset of M(Ω), namely the space of probabilities with finite second moments endowed with the 2-Wasserstein metric i.e., the optimal transport distance: see Appendix B.5 for more details. Hence, the lift of the unknown m ∈ M + (X ) to µ ∈ P 2 (Ω) enables the removal of the asymmetry for discrete measures between position x ∈ X and amplitude a ∈ R + by lifting aδ x to δ (a,x) . The lifted functional now writes down for parameter λ > 0: whereΦµ def.
= aϕ(x) andṼ is the TV-norm on the spatial component of the measure µ. The functional is non-convex, its Fréchet differential is denoted F , and for u ∈ Ω: (Ω) can also be seen as an element of Ω N from the standpoint of its components (a i , x i ). It allows the authors of [3,23] to perform a precise characterisation of the source recovery conditions, through the measures and the tools of optimal transport such as gradient flow (see below).
Then, one may run a gradient descent on the amplitudes and positions (a i , x i ) ∈ (R + × X ) N of the measure µ N , in order to exploit the differentiability of the kernel ϕ. Note that the measure µ N is over-parametrized, i.e., its number of δ-peaks is larger compared to the number of spikes of the ground-truth measure: thus, the particles, namely the δ-peaks of the space Ω, are covering the domain X for their spatial part.as an example, where µ N is plotted in red dots.
Before giving the main results, we need to clarify the generalised notion of gradient descent to measure function called the gradient flow [35,36] from optimal transport theory, the main ingredient in the particle gradient descent. Letting F : R d → R be the objective function with certain regularity, a gradient flow describes the evolution of a curve x(t) such that its starting point at t = 0 is x 0 ∈ R d , evolving by choosing at any time t in the direction that decreases the function F the most [36]: The interest of gradient flow is its extension to spaces X with no differentiable structure. In the differentiable case, one can consider the discretisation of the gradient flow i.e., the sequence defined for a step-size τ > 0, k ∈ N * : It is the implicit Euler scheme for the equation (x τ ) = −∇F(x τ ), or the weaker (x τ ) ∈ ∂F(x τ ) if F is convex and non-smooth. The gradient flow is then the limit (under certain hypotheses) of the sequence (x τ k ) k≥0 for τ → 0 for a starting point x 0 ∈ X. Gradient flow can be extended to metric space: indeed, for a metric space (X, d) and a map F : X → R lower semi-continuous one can define the discretisation of gradient flow by the sequence In the case of the metric space of probability measures i.e., the measures with unitary mass, the limit τ → 0 of the scheme exists and converges to the unique gradient flow starting at x 0 element of the metric space. A typical case is the space of probabilities with finite second moments P 2 (Ω), endowed with 2-Wasserstein metric, i.e., the optimal transport distance (see Appendix B.5): a gradient flow in this space P 2 (Ω) is a curve t → µ t called a Wasserstein gradient flow starting at µ 0 ∈ P 2 (Ω), for all t ∈ R + , one has µ t ∈ P 2 (Ω), obeying the partial differential equation in the sense of distributions: . (11) Recall that div(m) for all m ∈ M(X ), derivatives ought to be understood in the distributional sense. This equation ensures the conservation of the mass, namely, at each time t > 0, one has |µ t |(Ω) = |µ 0 |(Ω). Hence, despite the lack of differentiability structure of P 2 (Ω) which forbids straightforward application of a classical gradient-based algorithm, one can perform an optimisation on the space through gradient flow to reach a minimum of F by discretizing (11).
The interesting case of a gradient flow in P 2 (Ω) is the flow starting at i ,x 0 i ) , uniquely defined by Equation (11), which writes down for all t ∈ R + : µ N,t def.
, where a i : R + → R + and x i : R + → X are continuous maps. This path (µ N,t ) t≥0 is a Wasserstein gradient flow, and uses N Dirac measures over Ω to optimise the objective function F in (9). When the number of particles N goes to infinity and if µ N,0 converges to some µ 0 ∈ P 2 (X ), the gradient flow (µ N,t ) t≥0 converges to the unique Wasserstein gradient flow of F starting from µ 0 , described by the time-dependent density (µ t ) t≥0 valued in P 2 (X ) obeying the latter partial differential Equation (11).
For these non-convex gradient flows, the authors of [3] give a consistent result for gradient based optimisation methods: under a certain hypothesis, the gradient flow (µ N,t ) t≥0 converges to global minima in the over-parametrization limit i.e., for N → +∞. It relies on two important assumptions that prevent the optimisation from being blocked in non-optimal points: • homogeneity (A function f between vector spaces is positively p-homogeneous if, for λ > 0 and argument x, one has f (λx) = λ d f (x).) of φ in order to select the correct magnitude for each feature, or at least partially 1-homogeneity (i.e., boundedness of ϕ in [3]); • diversity in the initialisation of parameters, in order to explore all combinations of features. Too few or too close particles will not reach all source peaks and will only yield local minima.
We can then introduce the fundamental result for the many particle limits [3], the meanfield limits of gradient flows (µ N,t ) t≥0 , despite the lack of convexity of these gradient flows: Theorem 2 (Global convergence-informal). If the initialisation µ N,0 is such that µ 0 def. = lim N→+∞ µ N,0 support separates (The support of a measure m is the complement of the largest open set on which m vanishes. In an ambient space X , we say that a set C separates the sets A and B if any continuous path in X with endpoints in A and B intersects C.) {−∞} × X from {+∞} × X then the gradient flow µ t weakly- * (see Appendix B.1) converges in P 2 (Ω) to a global minimum of F, and we also have: Limits can be interchanged; the interested reader might take a look at [3] for precise statements and exact hypothesis (boundary conditions, 'Sard-type' regularity e.g., ϕ is d-times continuously differentiable, etc).
Since we have a convergence result, we can then investigate the numerical implementation. This optimisation problem is tractable thanks to the Conic Particle Gradient Descent algorithm [23] denoted CPGD: the proposed framework involves a slightly different gradient flow (ν t ) t≥0 defined through a projection of (µ t ) t≥0 onto M + (X ). This new gradient flow (ν t ) t≥0 is defined for a specific metric in M + (X ), which is now a trade-off between Wasserstein and Fisher-Rao (also called Hellinger metric.) metric [23], it is then called a Wasserstein-Fisher-Rao gradient flow. Then, the Wasserstein-Fisher-Rao gradient flow starting at ν N,0 def.
in M + (X ), rather than the Wasserstein flow t → µ N,t def.
starting at µ N,0 in P 2 (Ω). The partial differential equation of a Wasserstein-Fisher-Rao flow writes down: for the two parameters α, β > 0 arising from the cone metric, α tunes the Fisher-Rao metric weight, while β tunes the Wasserstein metric one. All statements on convergence could be made alternatively on µ t or ν t , and we have indeed the same theorem:

Theorem 3 (Global convergence-informal).
If ν 0 has full support (its support is the whole set X ) and (ν t ) t≥0 converges for t → +∞, then the limit is a global minimum of J. If ν N,0 −−−−→ N→+∞ ν 0 in the weak- * sense, then: Summary (3rd algorithm theoretical aspects): we introduced the proposed solution of [3,23], namely approximating the source measure by a discrete non-convex objective function of amplitudes and positions. The analytical study of the discrete function is an uphill problem and could be tackled thanks to the recast of the problem in the space of measures. Then, we exhibited the theoretical framework on gradient flows, understood in the sense of generalisation of gradient descent in the space of measures. Eventually, we presented the convergence results of the gradient flow denoted (ν t ) t towards the minimum of the BLASSO, thus enabling results for the convergence. Gradient descent on the discrete objective approximates well the gradient flow dynamic and can then benefit from the convergence results exhibited before.
We now discuss the numerical results of the particle gradient descent. The reader is advised to take a look at Figure 6, more precisely at red and green ellipses, to get a grasp on the numerical part.

Numerical Results
We recall that a gradient flow (ν N,t ) t≥0 starting at def.
can be seen as a (time continuous) generalisation of gradient descent in the space of measures, allowing precise theoretical statements on the recovery conditions. To approach this gradient flow, we use the Conic Particle Gradient Descent algorithm [23] denoted CPGD: the point is to discretise the evolution of the gradient flow t → ν N,t through a numerical scheme on (12). This consists of a gradient descent on the amplitudes r and positions x through the gradient of the functional F N in Equation (8), a strategy which approximates well the dynamic of the gradient flow [23]. This choice of gradient with the cone metric enables multiplicative updates in r and additive in x, the two updates being independent of each other. Then, the algorithm consists of a gradient descent with the definition of r i (t) and x i (t) according to [2,23]: thanks to a gradient in Equation (8), for the mirror retraction (The notion of retraction compatible with cone structure is central: in the Riemann context, a retraction is a continuous mapping that maps a tangent vector to a point on the manifold. Formally, one could see it as a way to enforce the gradient evaluation to be mapped on the manifold. See [23] for other choices of compatible retractions and more insights on these notions.) and η λ = −J ν /λ. The structure of the CPGD is presented in Algorithm 4. Note that the multiplicative updates in r yields an exponential of the certificate, and that the updates of the quantities r, x are separated. This algorithm has rather easy and cheap iterations: to reach an accuracy of ε-i.e., a distance such as the ∞-Wasserstein distance between the source measure m a 0 ,x 0 and the reconstructed measure m * is below ε-the CPGD yields a typical complexity cost of log ε −1 rather than ε −1/2 for convex program ([23] Theorem 4.2). A reconstruction from the latter 1D Fourier measurements is plotted in Figure 7, the reconstruction is obtained through two gradient flows, the former on the positive measures to recover the positive δ-peaks of the ground-truth and the latter on the negative measures to recover the negative one: the merging of the two results gives the reconstructed δ-peaks. The noiseless reconstruction (See our GitHub repository for our implementation: https://github.com/ XeBasTeX, accessed on 30 November 2021) for 2D Gaussian convolution with the same setting as the Frank-Wolfe section is plotted in Figure 8. One can see that the spikes are well-recovered as some non-zero red and blue particles cluster around the three δ-peaks.   Input: Gradient step sizes α, β > 0 and N ≥ 1 the number of particles. 1 Draw uniformly the initialisation discrete measure with amplitude-positions Mirror descent step, for all i = 1, . . . , N update: Conic Particle Gradient Descent applied for 2D Gaussian deconvolution, the red dots are the particle measure ν (k) (size of dot proportional with amplitude), the three white dots are the source measure, the image in the background is the noiseless acquisition y 0 and the black lines are the paths of the particles ν (k) -all the paths constitute the gradient flow (ν t ) t≥0 . Implementation is an adaptation of [23], α = β = 1 × 10 −2 and λ = 1.
Summary (3rd algorithm numerical aspects): the gradient flow (ν t ) t is computable by the Conic Particle Gradient Descent algorithm, consisting in an estimation through a gradient (w.r.t. cone metric) descent on both amplitudes and positions of an overparametrised measure, namely a measure with a fixed number of δ-peaks exceeding the source's one. The iterations are cheaper than the SFW presented before, but the CPGD lacks guarantees in a low-noise regime.
To sum up all the pros and cons of these algorithms, we give Table 1 for a quick digest. Since the CPGD lacks guarantees on the global optimality of its output, the following section will use the conditional gradient and more precisely the Sliding Frank-Wolfe in order to tackle the SMLM super-resolution problem.

Applications and Results in SMLM Imaging
As an illustration of off-the-grid applications' in this review, we propose to solve the super-resolution problem, aiming to retrieve biological structures at very small scales.

Metrics of Quality of Reconstruction
If one has access to the ground-truth i.e., the real position of the point sources, one is able to assess the quality of the reconstruction by: • detection metrics, such as the Jaccard index; • quality of reconstruction metrics, such as the L 2 norm in the discrete case.
Detection metrics can be applied to the off-the-grid output in a straightforward manner. We will rather focus in this part on the 'quality of reconstruction' metric. Any of the former algorithms returns a list of Dirac measures, which can be compared with the ground-truth measure m a 0 ,x 0 . This comparison cannot be done with discrete tools, such as the L 2 norm of the reconstructed acquisitions: we cannot compare an element of M(X ) with L 2 (X ). Examining the L 2 norm of the x i vector of reconstructed positions against the x 0,i vector is not sufficient either: we need the same number of elements for x and x 0 , we have to sort the vector of positions, and we have no guarantee that the matching of one position of x with another of x 0 is the right one.
Hence, a distance on the measure space is the good tool of comparison. We will use in the following the Wasserstein 1-distance W 1 [37]: see Appendix B.5 for some recall on the useful definition and more insights on the optimal transport setting used in this section. The Wasserstein distance with measures of equal mass is defined (Actually, it is well-defined on the subset X def. = m ∈ M(X ), |m|(X ) ≤ y 2 H /2λ because only the bounded subset of M(X ) are metrizable for the weak- * topology, so we have to restrain the set of measures to X in order to reach a Polish space i.e., the convenient framework for this OT-based metric, see Appendix B.5. Since all solutions of the BLASSO belong to X [8] , we will keep this slight abuse of notation in the rest of the paper.) as: Definition 7 (Balanced optimal transport). For 0 ≤ p < +∞ and m 1 , m 2 ∈ M(X ) such that |m 1 |(X ) = |m 2 |(X ), the p-Wasserstein distance is written: Γ(m 1 , m 2 ) is the set of transport plans between m 1 and m 2 , one can take a look at Appendix B.5 for more insights on this notion. However, this notion is not sufficient for our application since the metric can only take measures of equal masses (i.e., equal TV-norm) as an input. In the case of a discrete measure, we recall that mass is simply the sum of the modulus of individual amplitudes: hence, in general, we cannot compare a source measure and a reconstructed measure with differing amplitudes. The classic solution is then to distribute the unit mass, divided by the number of spikes, uniformly over each δ-peak of the discrete measure. Still, it would be way more convenient to incorporate the case of differing masses in the metric. The proper metric to compare two measures of different masses is called the Kantorovtich-Rubinstein norm also referred as the Flat Metric [37][38][39].
Definition 8 (Unbalanced optimal transport). Let us denote m ∈ M(X ) of finite first moment and τ > 0, and the following quantity is called Kantorovtich-Rubinstein norm: where f Lip is the Lipschitz constant of f . We then define the Flat Metric d τ for m 1 , m 2 ∈ M(X ) of finite first moments: The parameter τ is homogeneous to a distance, and it is understood in the optimal transport sense as the cost of creating/destroying a Dirac measure. The Flat Metric coincides with the 1-Wasserstein distance, for m 1 , m 2 of equal masses, when τ → +∞ [38]; it also coincides with the total variation norm of m 1 − m 2 when τ → 0. Then, it may be seen as an interpolation between the total variation norm and the 1-Wasserstein norm. Moreover, when the number of δ-peaks is correctly estimated, the Flat Metric stands for the mean error in terms of localisation and is similar to the RMSE [39]. Eventually, the Flat metric can be extended to discrete reconstruction i.e., images on a fine grid; this metric is then a method applicable to discrete reconstruction, namely images with a finer grid.
To sum up, there are two possibilities if one wants to compare the reconstructed measure and the ground-truth one: • let the source measure be composed of N spikes, we set the amplitude of each δ-peak at 1/N. We apply the same procedure to the reconstructed (with differing or not number of spikes), hence dividing uniformly the unit mass over all the δ-peaks of the considered measure. Therefore, the reconstructed luminosity is not considered as relevant and discarded: we can compute directly the 1-Wasserstein distance, since it is equal to the Flat Metric in this case; • we want to account for the luminosity, and we use the Flat Metric to compare the reconstructed measure against the ground-truth one.
Summary: classic quality of reconstruction metrics such as the L 2 (X ) norm cannot be straightforwardly applied to off-the-grid reconstruction. Instead, one could use optimal transport score such as the Flat Metric: it accounts for both amplitude and position reconstructions, while it can be easily extended to discrete reconstruction (images on a fine grid).

Results for an SMLM Stack
In super-resolution for biomedical imaging, one wants to retrieve some fine scale details to better study biological structures of interest. Indeed, the studied bodies are generally smaller than the Rayleigh limit at 200 nm, a length at which the phenomenon of light diffraction comes into play. This diffraction causes a blurring of the image, which can be described as a convolution of the image by the PSF mentioned above. Hence, we want to perform a deconvolution i.e., remove the blur of diffraction to get a super-resolved image. It is worth noticing that other imaging systems exist, for which the inverse problems to solve are a bit different from deconvolution: e.g., Nuclear Magnetic Resonance spectroscopy with Fourier measurements [40], MA-TIRF with Laplace [13].
In order to enhance spatial resolution over standard diffraction-limited microscopy techniques and allow imaging of biological structures below the Rayleigh criterion, one can use SMLM, which stands for 'Single Molecule Localisation Microscopy'. It is a compelling technique in fluorescence microscopy to tackle the super-resolution problem [41]. It requires photoactivable fluorophores with, roughly speaking two states, for example 'On' and 'Off '. These molecules are therefore only visible on the acquisitions in the 'On' case, and the idea is then to light up some molecules in the sample to make the acquisition and to be able to locate them precisely; the fluorescent molecules are bound to the biological structure and, since only a few molecules are emitting in one frame, the resulting image is rather sparse, which allows accurate localisation. This process is repeated until all the molecules have been lit and imaged. All the positions of the imaged molecules frame-by-frame can then be put together to form a super-resolved image that go below the diffraction barrier, ridden of the degradation by the process of acquisition (blur, noise, etc.). The quality of the image reconstruction is naturally limited by the number of acquisitions necessary to reconstruct the image, which implies a cost in time (precious insofar as the organism studied moves) and in physical memory and by the density of fluorophores lit at each stage. Indeed, there is a risk of overlap hindering the localisation of the molecules since the separation criterion is not matched.
Off-the-grid methods can be applied to any SMLM stack with only the knowledge of the forward operator, the acquisition system's PSF in this case. In this review, a gridless method based on Sliding Frank-Wolfe is tested on an 2D SMLM acquisition stack from the 2013 EPFL Challenge (https://srm.epfl.ch/DatasetPage?name=MT0.N1.HD, accessed on 30 November 2021). For this purpose, we consider the first image of the stack, locate the source points, and store the coordinates of these points. Then, we move on to the second image, we locate the source points, and so on. Note that off-the-grid method with this variational approach is not the only method taking advantage of a continuous domain like the PSF-fitting such as DAOSTORM [42], etc.
Deconvolution is a first challenge to solve this inverse problem, but we must also take into account the noise. One has to deal with three main types of noise on these acquisitions: • photon noise (also known as shot noise or quantum noise) is due to the quantum nature of light. It arises from the fact that fluorophores emit photons randomly, so that, between t and t + 1 (exposure time), a variable number of photons have been emitted, and therefore a variable number of photons have been collected by the sensor. Thus, the amplitude of the electrical signal generated in the sensor (at each pixel) fluctuates according to a Poisson statistic; • the dark current is a phenomenon due to the natural agitation of electrons. This natural agitation is sufficient to occasionally eject an electron from the valence band to the conduction band without any photoelectric effect. Additional charges are therefore created which interfere with the signal. The number of electrons generated by thermal agitation follows a Poisson distribution; • amplification and readout noise. This noise is produced by the electronic circuit that amplifies and converts the electron packets into voltage. It is generally modelled by a Gaussian noise.
Thus, we have several noises that pollute each of the observed images. To deal with this ill-posed inverse problem, we use the results on BLASSO, with the least-squares term as the data-fitting term and the TV norm as the regulariser of the inverse problem. In the Bayesian approach, the least-square term is modelling the maximum of likelihood when the acquisition is polluted by Gaussian noise, hence our model is making the approximation of Gaussian noise. Measurements are discrete so at each image one has to deal with images with N 1 × N 2 pixels, each of them with size (b 1 , b 2 ). Let (c i,1 , c i,2 ) be the centre of the ith pixel, and we denote the ith camera pixels by We can then clarify the forward operator Φ : m → R N 1 N 2 which encapsulates the integration over camera pixels [13]; indeed, with the evaluation of the discrete Gaussian kernel ϕ with standard deviations σ, for i ∈ {1, . . . , N 1 N 2 }: In the SMLM data set, one has the PSF standard deviation σ = 149.39 nm and N 1 = N 2 = 100 nm. The reconstruction is performed by our implementation of the Sliding Frank-Wolfe in python (See our GitHub repository for our PyTorch implementation: https://github.com/XeBasTeX, accessed on 30 November 2021) insofar as it is the more robust method available: indeed, it works with Gaussian kernel, and it has proven results in a noise regime, etc. The results are presented in Figure 9. The stack of 2500 images of 64 × 64 is qualified as high density with high SNR: the number of activated fluorophores is quite important, and the noise is not negligible (see the EPFL Challenge page for more insights.).
(a) (b) Figure 9. (a) Ground-truth tubulins, two excerpts of the stack in the square below: convolution + all noise described before; (b) reconstructed measure by Sliding Frank-Wolfe visualised through Gaussian kernel with a smaller σ (see text).
A flat metric between the reconstructed measure m a,x and the ground-truth measure m a 0 ,x 0 is then computed, and it reaches d τ (m a,x , m a 0 ,x 0 ) = 1.7 × 10 −2 . The reconstruction is convincing and well captures the fine details of the biological structures, and one can clearly see the interweaving tubulins in the right part of the image.
Note that an interesting feature of the gridless reconstruction is that, once the Radon measure is computed, it is straightforward to plot it through any operator on a fine grid of one choice. Indeed, as one cannot represent a discrete measure m, we rather plot Φm, where Φ is the PSF with a slightly smaller variance, in order to clearly see the deconvolution. In all of our reconstructions, we convolve the reconstruction through the PSF with variance σ/6 and plot it on a grid 32 times finer. As a matter of comparison, discrete methods are performed for a fixed fine grid, and, if one wants a finer reconstruction, one has to recompute everything.
We finally test the off-the-grid reconstruction on a real data set of tubulins with high density molecules, provided by the 2013 IEEE ISBI SMLM challenge. In this stack of 500 frames of 128 × 128 pixels, the FWHM (full width at half maximum) of the acquisition system is estimated at 351.8 nm. We recall that the FWHM is the width of the Gaussian curve measured between those points on the y-axis, which are half the maximum amplitude, also note that it is linked to the variance σ by FWHM = 2 √ 2 ln 2 × σ. We compare the reconstruction of the off-the-grid method with the output of the Deep-STORM [43] algorithm, touted as the algorithm with the most visually compelling results. The reconstructions of the gridless method and the Deep-STORM algorithm are presented in Figure 10, where one can appreciate the reconstruction by off-the-grid on fine details. The reconstruction seems a small bit blurry compared to Deep-STORM, due to the plotting through a small spread Gaussian kernel. However, it is noteworthy that both comparisons perform well to retrieve the filaments, in particular in the enhancing yellow circles: the off-the-grid reconstruction seems to better preserve the structure compared to the Deep-STORM's rough output. The quality of the reconstruction is notably interesting for off-the-grid reconstruction since it does not require any test sets to yield this reconstruction, contrary to Deep-STORM. The only data needed are the knowledge of (an estimation of) the forward operator, and the off-the-grid reconstruction can be then performed from any input without having to train the model on different types and levels of noise.

Shorthand:
We tested an off-the-grid method on both SMLM synthetic and experimental data set. The gridless problem is tractable thanks to the Sliding Frank-Wolfe algorithm, and yields compelling results. The results are all the more interesting since there is only one parameter, handy to tune and robust w.r.t. noise. Thus, it can be easily adapted to any other dataset with a known acquisition operator.

Conclusions
We described in this review the off-the-grid variational settings for the sparse spike problem, through the definition of the space of signed measures M(X ) and the functional BLASSO defined over this set. Thanks to the trade-off between the convexity of the functional and the infinite dimensional, non-reflexive space of optimisation M(X ); the BLASSO can be defined to solve the sparse spike recovery problem. We review in this paper the theoretical guarantees to reach the correct minimum as the literature provides multiple results, in particular a sharp criterion for stable spike recovery under a low noise regime. Numerical methods to tackle the BLASSO problem were also discussed, with insights on the SDP approach, which is asymptotically exact but only suited for Fourier measurements, the Frank-Wolfe approach with known rate of convergence but a high computation load and the Conic Particle Gradient Descent with cheap iterations but lacks of guarantees. We were finally able to present the result of the off-the-grid approach with a Sliding Frank-Wolfe algorithm in the case of SMLM synthetic data and real data from the EPFL Challenge, and to illustrate the usefulness of these methods to recover fine-scale details.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:  Table   Table A1. Main notations used in the review. Dirac measure a, x respectively amplitudes and positions of the spike aδ x N number of Dirac measures in a discrete measure λ regularisation parameter in (P λ (y)) Φ forward acquisition operator with kernel ϕ and adjoint Φ * F n forward Fourier acquisition operator with n measurements p λ solution of the dual problem (D λ (y)) η λ dual certificate of (P λ (y)) T λ BLASSO (P λ (y)) functional Ω Lifted space Ω def.
= R + × X J BLASSO functional on M + (X ) R data-term R : M(X ) → H F N 'discrete' functional on Ω N F functional on M(Ω) (µ t ) t , (ν t ) t Gradient flows respectively in P 2 (Ω) and M + (X ) α, β Cone metric/ Fisher-Rao-Wasserstein tuning parameters W p Wasserstein distance of order p P 2 (Ω) space of probability measures with 2nd moment endowed with W 2 Then, from (A1), we yield the dual problem: and for m * and p * respectively solutions of the primal and dual, its extremality conditions are: This concludes the proof.
Appendix B.3. Fréchet Differential of J ν Let σ, ν ∈ M + (X ) and ε > 0. Consider the following: The TV linearity is obtained thanks to the positivity of ν, σ. Hence, the differential J ν at point ν is given by: Appendix B.4. Gradient of F N Let R be the data fitting term e.g., the least squares, and h an injective function such as the map h : r → r 2 . For r ∈ R N + and θ ∈ X N , we consider: Its differential is given by: Moreover, we have: Similarly, which yields the gradient in [23]. One can simplify the final expression by introducing the certificate η λ def.

Appendix B.5. Details on Section 4
Definition A5. A space (X , d p ) is a Polish space if it is separable, metrizable, and has a topologyinduced by a distance-which makes the space complete.
X is a separable Hilbert space then (X , d p ) is a Polish metric space for d p a distance on R d restricted to X . We can also introduce: Definition A6 (Transport Plane). The non-negative measure γ ∈ M + (X × X ) which verifies, for all A, B ∈ B(X ), where B(X ) is the Borel σ-algebra: is called the transport plane between two positive measures m 1 and m 2 of same mass.
We call Γ(m 1 , m 2 ) the set of transport planes between m 1 and m 2 [37]. Metrics of optimal transport such as the Wasserstein distance use at their core these notions, and are defined only on Polish spaces: this is why we work with the measures in X from [7], restriction of M(X ) with the weak- * topology.
Definition A7 (Wasserstein distance). Let the Polish metric space (X , d p ), and p ∈ [1, +∞). For any probability measures µ and ν of X , the Wasserstein distance of order p between µ and ν is defined by: We also recall the definition of moments: Definition A8. If r ∈ N, we call moment of order r of a measure m ∈ M(X ) the quantity :