All the following results are proven for a domain
$\mathcal{X}$ with no boundaries, e.g., the
ddimensional torus
${\mathbb{T}}^{d}$. The case described in the former sections—
$\mathcal{X}$ is any compact of
${\mathbb{R}}^{d}$—is included in this new setting, since any compact
$\mathcal{X}$ can be periodised to yield a domain with no boundaries. The forward operator kernel
$\phi :\mathcal{X}\to \mathcal{H}$ should also be differentiable in the Fréchet sense. The least squares term in BLASSO is denoted by the more general data term
$R:\mathcal{H}\to {\mathbb{R}}^{+}$, the functional
${T}_{\lambda}$ of the BLASSO will now be restricted to
${\mathcal{M}}^{+}\left(\mathcal{X}\right)$ and denoted
J; its Fréchet differential at point
$\nu \in {\mathcal{M}}^{+}\left(\mathcal{X}\right)$ is denoted
${J}_{\nu}^{\prime}$:
Sparse optimisation on measures through optimal transport [
3,
23] relies on the approximation of the groundtruth positive measure
${m}_{{a}_{0},{x}_{0}}$ by a ‘system of
$N\in {\mathbb{N}}^{\ast}$ particles’, i.e., an element of the space
${\mathsf{\Omega}}^{N}\stackrel{\mathrm{def}.}{=}{({\mathbb{R}}^{+}\times \mathcal{X})}^{N}$. The point is then to estimate the groundtruth measure by a gradientbased optimisation on the objective function:
where
$({r}_{i},{x}_{i})$ belongs to the lifted space
$\mathsf{\Omega}\stackrel{\mathrm{def}.}{=}{\mathbb{R}}^{+}\times \mathcal{X}$ endowed with a metric. Hence, the hope is that the gradient descent on
${F}_{N}$ converges to the amplitudes and the positions of the groundtruth measure, despite the nonconvexity of functional (7). The author of [
23] proposes the definition of a suitable metric for the gradient of
${F}_{N}$, which enables separation of the variables in the gradient descent update. Let
$\alpha ,\beta $ be two parameters such that
$\alpha >0$ and
$\beta >0$ and for any
$(r,\theta )\in \mathsf{\Omega}$, we define the Riemannian inner product of
$\mathsf{\Omega}$ called the
cone metric endowing
$\mathsf{\Omega}$ as defined by
$\forall (\delta {r}_{1},\delta {r}_{2})\in {\mathbb{R}}_{+}^{2}$,
$\forall (\delta {\theta}_{1},\delta {\theta}_{2})\in {\mathcal{X}}^{2}$:
3.3.1. Theoretical Results
The main idea of these papers [
3,
23] boils down to the following observation: the minimisation of function (7) is a peculiar case of a more general problem, formulated in terms of measure of the lifted space
$\mathsf{\Omega}$. The space is more precisely
${\mathcal{P}}_{2}\left(\mathsf{\Omega}\right)$ subset of
$\mathcal{M}\left(\mathsf{\Omega}\right)$, namely the space of probabilities with finite second moments endowed with the 2Wasserstein metric i.e., the optimal transport distance: see
Appendix B.5 for more details. Hence, the lift of the unknown
$m\in {\mathcal{M}}^{+}\left(\mathcal{X}\right)$ to
$\mu \in {\mathcal{P}}_{2}\left(\mathsf{\Omega}\right)$ enables the removal of the asymmetry for discrete measures between position
$x\in \mathcal{X}$ and amplitude
$a\in {\mathbb{R}}^{+}$ by lifting
$a{\delta}_{x}$ to
${\delta}_{(a,x)}$. The lifted functional now writes down for parameter
$\lambda >0$:
where
$\tilde{\Phi}\mu \stackrel{\mathrm{def}.}{=}{\int}_{\mathsf{\Omega}}\varphi (a,x)\mu (a,x)$ for
$\varphi (a,x)\stackrel{\mathrm{def}.}{=}a\phi \left(x\right)$ and
$\tilde{V}$ is the TVnorm on the spatial component of the measure
$\mu $. The functional is nonconvex, its Fréchet differential is denoted
${F}^{\prime}$, and for
$u\in \mathsf{\Omega}$:
with
${\tilde{R}}^{\prime}\stackrel{\mathrm{def}.}{=}{\Vert y{\int}_{\mathsf{\Omega}}\nabla \varphi (a,x)\mu (a,x)\Vert}_{\mathcal{H}}^{2}$. Then, a discrete measure
${\mu}_{N}\stackrel{\mathrm{def}.}{=}\frac{1}{N}{\sum}_{i}^{N}{\delta}_{{a}_{i},{x}_{i}}$ of
${\mathcal{P}}_{2}\left(\mathsf{\Omega}\right)$ can also be seen as an element of
${\mathsf{\Omega}}^{N}$ from the standpoint of its components
$({a}_{i},{x}_{i})$. It allows the authors of [
3,
23] to perform a precise characterisation of the source recovery conditions, through the measures and the tools of optimal transport such as gradient flow (see below).
Then, one may run a gradient descent on the amplitudes and positions $({a}_{i},{x}_{i})\in {({\mathbb{R}}^{+}\times \mathcal{X})}^{N}$ of the measure ${\mu}_{N}$, in order to exploit the differentiability of the kernel $\phi $. Note that the measure ${\mu}_{N}$ is overparametrized, i.e., its number of $\delta $peaks is larger compared to the number of spikes of the groundtruth measure: thus, the particles, namely the $\delta $peaks of the space $\mathsf{\Omega}$, are covering the domain $\mathcal{X}$ for their spatial part.as an example, where ${\mu}_{N}$ is plotted in red dots.
Before giving the main results, we need to clarify the generalised notion of gradient descent to measure function called the
gradient flow [
35,
36] from optimal transport theory, the main ingredient in the particle gradient descent. Letting
$F:{\mathbb{R}}^{d}\to \mathbb{R}$ be the objective function with certain regularity, a gradient flow describes the evolution of a curve
$x\left(t\right)$ such that its starting point at
$t=0$ is
${x}_{0}\in {\mathbb{R}}^{d}$, evolving by choosing at any time
t in the direction that decreases the function
F the most [
36]:
The interest of gradient flow is its extension to spaces
X with no differentiable structure. In the differentiable case, one can consider the discretisation of the gradient flow i.e., the sequence defined for a stepsize
$\tau >0$,
$k\in {\mathbb{N}}^{\ast}$:
It is the implicit Euler scheme for the equation
${\left({x}^{\tau}\right)}^{\prime}=\nabla F\left({x}^{\tau}\right)$, or the weaker
${\left({x}^{\tau}\right)}^{\prime}\in \partial F\left({x}^{\tau}\right)$ if
F is convex and nonsmooth. The gradient flow is then the limit (under certain hypotheses) of the sequence
${\left({x}_{k}^{\tau}\right)}_{k\ge 0}$ for
$\tau \to 0$ for a starting point
${x}_{0}\in X$. Gradient flow can be extended to metric space: indeed, for a metric space
$(X,d)$ and a map
$F:X\to \mathbb{R}$ lower semicontinuous one can define the discretisation of gradient flow by the sequence
In the case of the metric space of probability measures i.e., the measures with unitary mass, the limit
$\tau \to 0$ of the scheme exists and converges to the unique gradient flow starting at
${x}_{0}$ element of the metric space. A typical case is the space of probabilities with finite second moments
${\mathcal{P}}_{2}\left(\mathsf{\Omega}\right)$, endowed with 2Wasserstein metric, i.e., the optimal transport distance (see
Appendix B.5): a gradient flow in this space
${\mathcal{P}}_{2}\left(\mathsf{\Omega}\right)$ is a curve
$t\mapsto {\mu}_{t}$ called a
Wasserstein gradient flow starting at
${\mu}_{0}\in {\mathcal{P}}_{2}\left(\mathsf{\Omega}\right)$, for all
$t\in {\mathbb{R}}^{+}$, one has
${\mu}_{t}\in {\mathcal{P}}_{2}\left(\mathsf{\Omega}\right)$, obeying the partial differential equation in the sense of distributions:
Recall that $div\left(m\right)={\sum}_{i=1}^{d}\frac{\partial m}{\partial {x}_{d}}$ for all $m\in \mathcal{M}\left(\mathcal{X}\right)$, derivatives ought to be understood in the distributional sense. This equation ensures the conservation of the mass, namely, at each time $t>0$, one has ${\mu}_{t}\left(\mathsf{\Omega}\right)={\mu}_{0}\left(\mathsf{\Omega}\right)$. Hence, despite the lack of differentiability structure of ${\mathcal{P}}_{2}\left(\mathsf{\Omega}\right)$ which forbids straightforward application of a classical gradientbased algorithm, one can perform an optimisation on the space through gradient flow to reach a minimum of F by discretizing (11).
The interesting case of a gradient flow in ${\mathcal{P}}_{2}\left(\mathsf{\Omega}\right)$ is the flow starting at ${\mu}_{N,0}\stackrel{\mathrm{def}.}{=}1/N{\sum}_{i=1}^{N}{\delta}_{({a}_{i}^{0},{x}_{i}^{0})}$, uniquely defined by Equation (11), which writes down for all $t\in {\mathbb{R}}^{+}$: ${\mu}_{N,t}\stackrel{\mathrm{def}.}{=}1/N{\sum}_{i=1}^{N}{\delta}_{({a}_{i}\left(t\right),{x}_{i}\left(t\right))}$, where ${a}_{i}:{\mathbb{R}}^{+}\to {\mathbb{R}}^{+}$ and ${x}_{i}:{\mathbb{R}}^{+}\to \mathcal{X}$ are continuous maps. This path ${\left({\mu}_{N,t}\right)}_{t\ge 0}$ is a Wasserstein gradient flow, and uses N Dirac measures over $\mathsf{\Omega}$ to optimise the objective function F in (9). When the number of particles N goes to infinity and if ${\mu}_{N,0}$ converges to some ${\mu}_{0}\in {\mathcal{P}}_{2}\left(\mathcal{X}\right)$, the gradient flow ${\left({\mu}_{N,t}\right)}_{t\ge 0}$ converges to the unique Wasserstein gradient flow of F starting from ${\mu}_{0}$, described by the timedependent density ${\left({\mu}_{t}\right)}_{t\ge 0}$ valued in ${\mathcal{P}}_{2}\left(\mathcal{X}\right)$ obeying the latter partial differential Equation (11).
For these nonconvex gradient flows, the authors of [
3] give a consistent result for gradient based optimisation methods: under a certain hypothesis, the gradient flow
${\left({\mu}_{N,t}\right)}_{t\ge 0}$ converges to global
minima in the overparametrization limit i.e., for
$N\to +\infty $. It relies on two important assumptions that prevent the optimisation from being blocked in nonoptimal points:
We can then introduce the fundamental result for the many particle limits [
3], the meanfield limits of gradient flows
${\left({\mu}_{N,t}\right)}_{t\ge 0}$, despite the lack of convexity of these gradient flows:
Theorem 2 (Global convergence—informal).If the initialisation ${\mu}_{N,0}$ is such that ${\mu}_{0}\stackrel{\mathrm{def}.}{=}{lim}_{N\to +\infty}{\mu}_{N,0}$ support separates (The support of a measure m is the complement of the largest open set on which m vanishes. In an ambient space $\mathcal{X}$, we say that a set C separates the sets A and B if any continuous path in $\mathcal{X}$ with endpoints in A and B intersects C.) $\{\infty \}\times \mathcal{X}$ from $\{+\infty \}\times \mathcal{X}$ then the gradient flow ${\mu}_{t}$ weakly* (see Appendix B.1) converges in ${\mathcal{P}}_{2}\left(\mathsf{\Omega}\right)$ to a global minimum
of F, and we also have: Limits can be interchanged; the interested reader might take a look at [
3] for precise statements and exact hypothesis (boundary conditions, ‘Sardtype’ regularity e.g.,
$\phi $ is
dtimes continuously differentiable, etc).
Since we have a convergence result, we can then investigate the numerical implementation. This optimisation problem is tractable thanks to the Conic Particle Gradient Descent algorithm [
23] denoted CPGD: the proposed framework involves a slightly different gradient flow
${\left({\nu}_{t}\right)}_{t\ge 0}$ defined through a projection of
${\left({\mu}_{t}\right)}_{t\ge 0}$ onto
${\mathcal{M}}^{+}\left(\mathcal{X}\right)$. This new gradient flow
${\left({\nu}_{t}\right)}_{t\ge 0}$ is defined for a specific metric in
${\mathcal{M}}^{+}\left(\mathcal{X}\right)$, which is now a tradeoff between Wasserstein and Fisher–Rao (also called Hellinger metric.) metric [
23], it is then called a
Wasserstein–Fisher–Rao gradient flow. Then, the Wasserstein–Fisher–Rao gradient flow starting at
${\nu}_{N,0}\stackrel{\mathrm{def}.}{=}{\sum}_{i=1}^{N}{a}_{i}^{0}{\delta}_{{x}_{i}^{0}}$ in
${\mathcal{M}}^{+}\left(\mathcal{X}\right)$ writes down
$t\mapsto {\nu}_{N,t}\stackrel{\mathrm{def}.}{=}\frac{1}{N}{\sum}_{i=1}^{N}{{r}_{i}\left(t\right)}^{2}{\delta}_{{x}_{i}\left(t\right)}$ in
${\mathcal{M}}^{+}\left(\mathcal{X}\right)$, rather than the Wasserstein flow
$t\mapsto {\mu}_{N,t}\stackrel{\mathrm{def}.}{=}\frac{1}{N}{\sum}_{i=1}^{N}{\delta}_{{r}_{i}\left(t\right),{x}_{i}\left(t\right)}$ starting at
${\mu}_{N,0}$ in
${\mathcal{P}}_{2}\left(\mathsf{\Omega}\right)$. The partial differential equation of a Wasserstein–Fisher–Rao flow writes down:
for the two parameters
$\alpha ,\beta >0$ arising from the cone metric,
$\alpha $ tunes the Fisher–Rao metric weight, while
$\beta $ tunes the Wasserstein metric one. All statements on convergence could be made alternatively on
${\mu}_{t}$ or
${\nu}_{t}$, and we have indeed the same theorem:
Theorem 3 (Global convergence—informal).If ${\nu}_{0}$ has full support (its support is the whole set $\mathcal{X}$) and ${\left({\nu}_{t}\right)}_{t\ge 0}$ converges for $t\to +\infty $, then the limit is a global minimum
of J. If ${\nu}_{N,0}\underset{N\to +\infty}{\to}{\nu}_{0}$ in the weak* sense, then: Summary (3rd algorithm theoretical aspects): we introduced the proposed solution of [3,23], namely approximating the source measure by a discrete nonconvex objective function of amplitudes and positions. The analytical study of the discrete function is an uphill problem and could be tackled thanks to the recast of the problem in the space of measures. Then, we exhibited the theoretical framework on gradient flows, understood in the sense of generalisation of gradient descent in the space of measures. Eventually, we presented the convergence results of the gradient flow denoted ${\left({\nu}_{t}\right)}_{t}$ towards the minimum of the BLASSO, thus enabling results for the convergence. Gradient descent on the discrete objective approximates well the gradient flow dynamic and can then benefit from the convergence results exhibited before. We now discuss the numerical results of the particle gradient descent. The reader is advised to take a look at
Figure 6, more precisely at red and green ellipses, to get a grasp on the numerical part.
3.3.2. Numerical Results
We recall that a gradient flow
${\left({\nu}_{N,t}\right)}_{t\ge 0}$ starting at
$\stackrel{\mathrm{def}.}{=}1/N{\sum}_{i=1}^{N}{\left({r}_{i}^{\left(0\right)}\right)}^{2}{\delta}_{{x}_{i}^{\left(0\right)}}$ can be seen as a (time continuous) generalisation of gradient descent in the space of measures, allowing precise theoretical statements on the recovery conditions. To approach this gradient flow, we use the Conic Particle Gradient Descent algorithm [
23] denoted CPGD: the point is to discretise the evolution of the gradient flow
$t\mapsto {\nu}_{N,t}$ through a numerical scheme on (12). This consists of a gradient descent on the amplitudes
r and positions
x through the gradient of the functional
${F}_{N}$ in Equation (8), a strategy which approximates well the dynamic of the gradient flow [
23].
This choice of gradient with the cone metric enables multiplicative updates in
r and additive in
x, the two updates being independent of each other. Then, the algorithm consists of a gradient descent with the definition of
${r}_{i}^{\prime}\left(t\right)$ and
${x}_{i}^{\prime}\left(t\right)$ according to [
2,
23]:
thanks to a gradient in Equation (8), for the mirror retraction (The notion of
retraction compatible with cone structure is central: in the Riemann context, a retraction is a continuous mapping that maps a tangent vector to a point on the manifold. Formally, one could see it as a way to enforce the gradient evaluation to be mapped on the manifold. See [
23] for other choices of compatible retractions and more insights on these notions.) and
${\eta}_{\lambda}={J}_{\nu}^{\prime}/\lambda $. The structure of the CPGD is presented in Algorithm 4. Note that the multiplicative updates in
r yields an exponential of the certificate, and that the updates of the quantities
$r,x$ are separated.
This algorithm has rather easy and cheap iterations: to reach an accuracy of
$\epsilon $—i.e., a distance such as the
∞Wasserstein distance between the source measure
${m}_{{a}_{0},{x}_{0}}$ and the reconstructed measure
${m}^{\ast}$ is below
$\epsilon $—the CPGD yields a typical complexity cost of
$log\left({\epsilon}^{1}\right)$ rather than
${\epsilon}^{1/2}$ for convex program ([
23] Theorem 4.2). A reconstruction from the latter 1D Fourier measurements is plotted in
Figure 7, the reconstruction is obtained through two gradient flows, the former on the positive measures to recover the positive
$\delta $peaks of the groundtruth and the latter on the negative measures to recover the negative one: the merging of the two results gives the reconstructed
$\delta $peaks. The noiseless reconstruction (See our GitHub repository for our implementation:
https://github.com/XeBasTeX, accessed on 30 November 2021) for 2D Gaussian convolution with the same setting as the Frank–Wolfe section is plotted in
Figure 8. One can see that the spikes are wellrecovered as some nonzero red and blue particles cluster around the three
$\delta $peaks.
Algorithm 4. Conic particle gradient descent algorithm. 

Summary (3rd algorithm numerical aspects): the gradient flow ${\left({\nu}_{t}\right)}_{t}$ is computable by the Conic Particle Gradient Descent algorithm, consisting in an estimation through a gradient (w.r.t. cone metric) descent on both amplitudes and positions of an overparametrised measure, namely a measure with a fixed number of $\delta $peaks exceeding the source’s one. The iterations are cheaper than the SFW presented before, but the CPGD lacks guarantees in a lownoise regime.
To sum up all the pros and cons of these algorithms, we give
Table 1 for a quick digest. Since the CPGD lacks guarantees on the global optimality of its output, the following section will use the conditional gradient and more precisely the
Sliding FrankWolfe in order to tackle the SMLM superresolution problem.