Stochastic Proximal Gradient Algorithms for Multi-Source Quantitative Photoacoustic Tomography

The development of accurate and efficient image reconstruction algorithms is a central aspect of quantitative photoacoustic tomography (QPAT). In this paper, we address this issues for multi-source QPAT using the radiative transfer equation (RTE) as accurate model for light transport. The tissue parameters are jointly reconstructed from the acoustical data measured for each of the applied sources. We develop stochastic proximal gradient methods for multi-source QPAT, which are more efficient than standard proximal gradient methods in which a single iterative update has complexity proportional to the number applies sources. Additionally, we introduce a completely new formulation of QPAT as multilinear (MULL) inverse problem which avoids explicitly solving the RTE. The MULL formulation of QPAT is again addressed with stochastic proximal gradient methods. Numerical results for both approaches are presented. Besides the introduction of stochastic proximal gradient algorithms to QPAT, we consider the new MULL formulation of QPAT as main contribution of this paper.


Introduction
Photoacoustic tomography (PAT) is an emerging imaging modality, which combines the benefits of pure ultrasound imaging (high resolution) with those of pure optical tomography (high contrast); see [1,2]. The basic principle of PAT is as follows (see Figure 1): A semitransparent sample such as a part of a human patient is illuminated with short pulses of optical radiation. A fraction of the optical energy is absorbed inside the sample, which causes thermal heating, expansion, and a subsequent acoustic pressure wave depending on the interior absorbing structure of the sample. The acoustic pressure is measured outside of the sample and used to reconstruct an image of the interior.
One important reconstruction problem in PAT is recovering the initial pressure distribution (see, for example, [3][4][5][6][7][8][9][10]). The initial pressure distribution only provides qualitative information about the tissue-relevant parameters, as it is the product of the optical absorption coefficient and the spatially varying optical intensity, which again indirectly depends on the tissue parameters. Quantitative photoacoustic tomography (QPAT) addresses this issue and aims at quantitatively estimating the tissue parameters by supplementing the inversion of the acoustic wave equation with an inverse problem for light propagation (see, for example, [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26]). Middle: due to the thermoelastic effect, the absorbed light distribution induces an acoustic pressure wave depending on internal tissue properties; Right: the acoustic pressure wave is measured outside the object and used to reconstruct an image of the interior.

Multi-Source QPAT
In this paper, we consider image reconstruction in QPAT using multiple sources. We allow limited view measurements, where, for each illumination, partial data are collected only from a certain angular domain. For modeling the light transport, we use the radiative transfer equation (RTE), which is commonly considered as a very accurate model for light transport in tissue (see, for example, [27][28][29][30]). In particular, opposed to the diffusion approximation, the RTE allows for modeling directed optical radiation, which is required for a reasonable QPAT forward model. Additionally, it allows for including internal voids as regions of low scattering. As proposed in [18], we work with a single-stage reconstruction procedure for QPAT, where the optical parameters are reconstructed directly from the measured acoustical data. The image reconstruction problem of multi-source QPAT using N different sources can be formulated as a system of nonlinear equations (see, for example, [18,31]) F i (µ) = v i for i = 1, . . . , N .
Here, F i is the operator that maps the unknown parameter pair µ = (µ a , µ s ) consisting of the absorption coefficient µ a : Ω → R and the scattering coefficient µ s : Ω → R to the measured acoustic data v i corresponding to the i-th source distribution (see Section 2 for precise definitions). There are two main classes of methods for solving the nonlinear inverse problem (1), namely, Tikhonov type regularization on the one and iterative regularization methods on the other hand [32][33][34]. Both approaches are based on rewriting (1) as a single equation F(µ) = v with forward operator F = (F i ) N i=1 and data v = (v i ) N i=1 . In Tikhonov regularization, one defines approximate solutions as minimizers of the penalized least squares functional 1 2 F(µ) − v 2 + λR(µ). Here, R( · ) is an appropriate regularization functional included to stabilize the inversion process and λ a regularization parameter that has to be carefully chosen depending on the data and the noise. In iterative regularization methods, stabilization is achieved via early stopping of iterative schemes. In such a situation, one usually applies iterative optimization techniques designed for minimizing the un-regularized least squares functional 1 2 F(µ) − v 2 , and the iteration index plays the role of the regularization parameter.
Tikhonov type as well as iterative regularization methods can both be formulated as finding a solution of the optimization problem with µ ∈ L 2 (Ω) × L 2 (Ω) .
In iterative regularization methods, one takes G = χ D , the characteristic function of the domain of definition D of the forward operator (taking the value 0 in and the value ∞ outside of D). In Tikhonov regularization, we take G = χ D + λR. Well established algorithms for solving Equation (2) are proximal gradient algorithms [35,36], which can be written in the form Here, prox s k G is the proximity operator and s k the positive step size; F i (µ k ) denotes the derivative of the i-th forward operator evaluated at µ k with F i (µ k ) * being its Hilbert space adjoint.

Stochastic Proximal Gradient Algorithms
Each iteration in the proximal gradient algorithm (3) can be numerically quite expensive, since it requires solving the forward and adjoint problems for all N equations in (1). In many cases, stochastic (proximal) gradient methods turn out to be more efficient since these methods only consider one of the equations in (1) per iteration. The stochastic proximal gradient method (see, for example, [37][38][39][40][41][42] and the references therein) for solving (2) is defined by . . , N} corresponds to one of the equations in (1) that is selected randomly for the update in the k-th iteration. In opposition to the standard proximal gradient method, this requires solving only one forward and one adjoint problem per iteration. Therefore, one iterative step is much cheaper for the stochastic gradient method than for the full gradient method. In the case of no regularization, λ = 0, the stochastic proximal gradient method reduces to the Kaczmarz method for inverse problems studied in [43][44][45]. The computationally most expensive task in the above methods is the numerical solution of the RTE. In this paper, we therefore additionally study a reformulation of the inverse problem of QPAT avoiding the computation of a solution of the RTE. For this purpose, the inverse problem is reformulated as multilinear inverse problem (35), where the RTE is added as a constraint instead of explicitly including its solution. The new formulation will be again addressed by Tikhonov regularization in combination with proximal stochastic gradient methods as discussed in Section 4.
Note that, in QPAT, it has often been assumed that the initial pressure distribution (corresponding to each illumination) is already recovered from acoustic measurements (see, for example, [13,16,23,25,26,[46][47][48][49]). Research was focused on inverting the light propagation in tissues either modeled by the RTE or the diffusion approximation. In the case that acoustic measurements are only known on parts of the boundary, reconstruction of the initial pressure distribution is not possible in a stable manner. In order to obtain stable reconstruction results in [18], we propose a single-stage approach for QPAT, where the optical parameters are directly recovered from the acoustic boundary data. Throughout this paper, we will make use of this approach, which delivers stable results especially in the limited view situation. In opposition to [18], in this paper, we introduce (proximal) stochastic gradient methods, which effectively exploit the multi-illumination structure and turn out to be faster than the standard proximal gradient methods.

Outline
The remainder of this paper is organized as follows. In Section 2, we provide the mathematical model for QPAT (the forward problem) using the RTE. We allow multiple sources and partial acoustic measurements. We also recall known results for QPAT including differentiability of the forward problem. In Section 3, we address the inverse problem of QPAT using Tikhonov regularization and study the proximal stochastic gradient method for its solution. The new reformulation of the inverse problem of QPAT as a multilinear inverse problem is presented in Section 4. For the solution of the proposed formulation, we again develop proximal gradient methods. Numerical results are presented in Section 5. The paper is concluded with a summary and outlook presented in Section 6.

The Forward Problem in QPAT
The image reconstruction problem of QPAT can be written as the system (1) of nonlinear equations, where the forward operators F i map tissue relevant parameters to acoustic data sets recorded in specific regions outside the tissue. Precise formulations will be given in this section.

Mathematical Notation
We fix some mathematical notation that is used throughout this paper. We denote by Ω ⊆ R d a convex domain with piecewise smooth boundary modeling our domain of interest, with d ∈ {2, 3} denoting the spatial dimension. In order to be able to impose appropriate boundary conditions for the RTE, it is convenient to split the set Γ := ∂Ω × S d−1 into inflow and outflow boundaries, with ν(x) denoting the outward pointing unit normal at x ∈ ∂Ω and x · y the standard inner product in R d . We write B R = {x ∈ R n | x < R} for the ball of radius R centered at the origin and suppose B R ⊇ Ω.
By L 2 (Ω) and L 2 (Ω × S d−1 ), we denote the Hilbert spaces of square integrable functions on Ω and Ω × S d−1 , respectively. By L 2 (Γ − , |ν · θ|), we denote the space of all q o : and define The inner products in Q, X, W , Y will be denoted by · , · Q , · , · X , · , · W , · , · Y , respectively. The subspace of all Φ ∈ W with Φ| Γ − = 0 will be denoted by W 0 . Elements in X will be written in the form µ = (µ a , µ s ) and are the parameters we aim to determine. They are actually required to be contained in the convex subset where µ a , µ s > 0. Elements in Q will be written in the form q = (q o , q i ) and model the optical sources. Elements in W describe the optical radiation, and elements in Y the measured acoustic data.

The Radiative Transfer Equation
To specify the forward operators, we require mathematical models for the light propagation, the conversion of optical into acoustic energy, and the propagation of the acoustic waves. These models will be presented in the rest of this section.
We model the optical radiation by a function Φ : is the density of photons at position x ∈ Ω and propagating into direction θ ∈ S d−1 . The interaction of the photons with the background are described by absorption coefficient µ a : Ω → R, the scattering coefficient µ s : Ω → R, and the scattering operator K : Φ → KΦ, taking the form (see [27,30]) with scattering kernel k : S d−1 × S d−1 → R. The absorption coefficient describes the ability of the background to absorb photons and the scattering coefficient describes the amount of photon scattering. The scattering kernel k (θ, θ ) describes the redistribution of velocity directions due to interaction of the photons with the background. From physical considerations, it is natural to assume k to be measurable, symmetric, nonnegative, and to satisfy S d−1 k ( · , θ ) dθ = 1. In this article, we are concerned with the situation when the kernel is known a priori. The photon density Φ(x, θ) is supposed to satisfy the stationary radiative transfer equation (RTE), with boundary conditions Here, q i : Ω × S d−1 → R denotes an internal photon source and q o : Γ − → R a prescribed boundary source pattern. Note that PAT uses very short light pulses (below microseconds) and that light propagation happens on time scales much shorter than the scale of acoustic wave propagation. This justifies the use of the stationary case for the RTE; see [12] for a more complete discussion.
Theorem 1 (Well-posedness of the RTE). For every µ ∈ D(T) and q ∈ Q, the stationary RTE (6) admits a unique solution Φ ∈ W. Moreover, there exists a constant C only depending on the parameters µ a , µ s > 0 (defining the domain D(T)), such that Proof. See [50].
Definition 1 (Solution operator for the RTE). The solution operator for the RTE is defined by where Φ denotes the unique solution of (6).
Theorem 1 guarantees that the operator T, mapping (q, µ) ∈ Q × D(T) to the solution of the RTE, is well defined. Note that in the actual application q = (q i , q o ) ∈ Q are prescribed sources, and µ = (µ a , µ s ) ∈ D(T) are the unknown parameters to be recovered.

Heating Operator
Due to the spatially varying absorption of photons, the tissue is locally heated and emits an acoustic pressure wave. The acoustic source is proportional to the amount of absorbed photons, the light intensity and the so-called Grüneisen parameter γ describing the efficiency of conversion of optical to acoustical energy. We assume γ to be constant and after appropriate re-scaling we take γ = 1; for more details about the Grüneisen parameter, we refer to [17]. Therefore, the conversion of the optical energy into acoustic pressure wave is described by the heating operator defined as follows.
Definition 2 (Heating operator). The heating operator is defined by If one introduces the averaging operator A : W → L 2 (Ω) defined by AΦ = S d−1 Φ( · , θ)dθ one may write the heating operator in the form Because T(q, µ) models the photon density, A • T(q, µ) actually models the total light intensity. The heating operator is therefore given by the product of the absorption coefficient and the light intensity. The averaging operator A is well defined and bounded and therefore the heating operator is well defined as a mapping between Q × D(T) and L 2 (Ω).

The Wave Equation
The local heating causes an acoustic pressure wave, where the initial pressure distribution p 0 is proportional to a fraction of the absorbed energy. Assuming constant speed of sound and after rescaling, the induced acoustic pressure p : R d × (0, ∞) → R satisfies the free-space wave equation: Here, the function p 0 vanishes outside B R , the ball of radius R, and acoustic data are collected on a subset of ∂B R × (0, ∞) that we denote by Λ × (0, ∞). Recall that coupling of the RTE and the wave equation happens in such a way that the result of the heating operator H(q, µ) acts as initial sound source p 0 depending on tissue parameters; see Definition 4. Standard existence and uniqueness theory for hyperbolic equations guarantees that, for any p 0 ∈ H 1 , (10) has a unique solution p ∈ H 1 , which continuously depends on p 0 . Taking the trace results in loss of regularity by degree 1/2. Therefore, p 0 → p| ∂B R ×(0,∞) is continuous between H 1 and H 1/2 . The following Lemma implies the much stronger result that p 0 → p| ∂B R ×(0,∞) is actually an L 2 -isometry.
have support in B R and let p denote the solution of (10). Then, Proof. See [51] for d odd and [52] for d even.
Definition 3. We define the solution operator with full boundary data for the wave Equation (10) by where p denotes the solution of (10).
According to Lemma 1, the operator U can be uniquely extended to a bounded linear operator defined on L 2 (B), denoted again by U : L 2 (B R ) → Y. The partial acoustic measurements made on Λ ⊆ ∂B R are then modeled by χ Λ×(0,∞) Up 0 .

Analysis of the Forward Problem in Multi-Source QPAT
We assume that we perform N individual experiments, where each experiment consists of separate optical sources and separate acoustic measurements. For the i-th experiment, we denote the source term by q i ∈ Q and assume the acoustic measurements are made on

Definition 4.
For any i ∈ {1, . . . , N}, we denote Here, T i denotes the i-th solution operator for the RTE, H i the i-th heating operator, U i the i-th partial solution operator for the wave equation, and F i the i-th forward operator.
Recall that T stands for the solution operator for the RTE (6) given in Definition 1, AΦ = S d−1 Φ( · , θ)dθ is the averaging operator, and U the solution operator for the wave Equation (10); see Definition 3. The operator T i models the photon transport and its solution (via the heating operator) acts as input for the solution of the wave equation and thereby couples the optical with the acoustical part.
Next, we recall continuity and differentiability of the forward operators. For that purpose, we call h ∈ X a feasible direction at µ ∈ D(T), if there exists some > 0 with µ + h ∈ D(T). Theorem 2 (Continuity and Differentiability).

1.
The operators T i , F i and H i are sequentially continuous and Lipschitz-continuous.

2.
For every µ ∈ D(T), the one-sided directional derivatives T i (µ)(h), F i (µ)(h) of T i , F i at µ in any feasible direction h exist, and are given by Proof. See [18].
Equations (13) and (14) define a bounded linear operator F i (µ) : X → Y, which we call the derivative of F i at µ ∈ D(T). Numerical minimization schemes actually require the adjoint of F i (µ), which we compute next.
Given data v 1 , . . . , v N ∈ Y, most numerical schemes for QPAT use gradients of the partial data-fidelity terms F I : By the chain rule, the gradient of F I is given by ∇F Convergence of schemes such as the (stochastic) proximal gradient method considered in the following section require the Lipschitz continuity of ∇F I , which will be shown in the following theorem.
Proof. Without loss of generality, we assume N = 1, For the second term in (18), we obtain where L T * and L T denote the Lipschitz constants of T * and T, and c 1 is a constant. The difference A(T * (µ)T(µ)) − A(T * (μ)T(μ)) in the first term in (18) is estimated in a similar manner. Furthermore, we have Noting that A, U and U * are linear and bounded, Theorem 2 and the computations above yield the Lipschitz continuity of ∇F.

Formulation of the Inverse Problem
The inverse problem of multi-source QPAT consists in finding µ ∈ X from measured data Here, µ = (µ a , µ s ) are the unknowns to be estimated, z i are the unknown error vectors, and v 1 , . . . , v N are the given noisy data. Using the notation v : we can write (19) in the alternative form Here, z ∈ Y N denotes the error vector.
There are, at least, two different strategies to address such an inverse problem: Tikhonov type regularization on the one and iterative methods on the other hand. In this section, we give an overview of such methods. In particular, we describe proximal stochastic gradient methods (for minimizing the Tikhonov functional), which seem particularly well suited for multi-source QPAT but have not been investigated yet for that purpose.

Tikhonov Regularization in QPAT
In this section, we consider a quadratic Tikhonov regularization term for solving (19). Let be a linear, densely defined, and possibly unbounded operator between X and another Hilbert space (Z, · , · Z ) and set D := D ∩ D(L). In this context, any element µ + ∈ D with Lµ + = min{ Lµ | F(µ) = v} is called an L( · ) -minimizing solution of Fµ = v. Tikhonov regularization with regularization term 1 2 Lµ 2 Z consists in computing a minimizer of the generalized Tikhonov functional T v,λ : X → R ∪ {∞}, defined by Here, λ > 0 denotes the regularization parameter that acts as a trade-off between the data fitting term and stability.

1.
For any v ∈ Y and any λ > 0, the Tikhonov functional T λ,v has at least one minimizer.

2.
Let Every sequence (µ m ) m∈N with µ m ∈ arg min T v m ,λ m has a weakly converging subsequence.
The limit of every weakly convergent subsequence of (µ m ) m∈N is an L( · ) -minimizing solution of Fµ = v.

The Proximal Stochastic Gradient Algorithm for QPAT
Depending on the particular choice of L, the Tikhonov functional (21) may be ill-conditioned. To address this issue in [18], we proposed the proximal gradient algorithm for minimizing (21), which is a very flexible algorithm for minimizing functionals of the form F + G, where F is smooth and G is convex (see, for example, [35,36]). Here, we extend the approach to the proximal stochastic gradient algorithm. Additionally, we propose computing the proximal step using Dykstra's projection algorithm.
Proximal gradient algorithm: The proximal gradient algorithm is a splitting method that iteratively computes explicit gradient steps for F and implicit proximal steps for G. In our context, we take F as the data fidelity term and where χ D is the characteristic function taking the value zero inside D and ∞ outside. The proximal gradient algorithm for minimizing the QPAT-Tikhonov functional (21) reads Here, prox s k G λ : X → D denotes the proximal mapping corresponding to the functional s k G λ , Furthermore, ∇F i (µ k ) is the gradient of the i-th data fidelity term computed in Theorem 3.

Dykstra's projection algorithm:
The constraint quadratic optimization problem (24) can efficiently be solved by a proximal variant of Dykstra's projection algorithm [35,36,53]. For that purpose, we write s k G λ = χ D + g with g(x) := s k λ 2 Lx 2 . Setting x 0 = µ, p 0 = 0 and q 0 = 0, Dykstra's projection algorithm for (24) reads, for m ∈ N, x m+1 = P D (y m + q m ), Both proximal mapping in (25) and the projection in (26) can be computed explicitly. In fact, one readily verifies that Here, I X is the identity operator on X and P D the projection onto D.
Proximal stochastic gradient algorithm: The methods described so far require in any iterative step the computation of the full gradient The evaluation of each ∇F i (µ) requires the solution of the RTE and an adjoint problem and therefore is quite time-consuming. For multi-source QPAT, where N > 1, a significant acceleration may be obtained by a Kaczmarz strategy, where in each iterative step only one of the summands ∇F i (µ) is used. The resulting proximal stochastic gradient method for minimizing the Tikhonov functional (21) in QPAT reads where i(k) ∈ {1, . . . , N} is selected randomly for the update in the k-th iteration. Furthermore, prox s k G λ is the proximal mapping of s k G λ that can be computed by Dykstra algorithm (25)- (28) and ∇F i (µ k ) is the gradient of the i-th data fidelity term that can be computed by Theorem 3.
One can also incorporate a block-iterative (or mini-batch) strategy in the stochastic gradient method, meaning that a small subset of {1, . . . , N} of equations is used per iteration instead of a single one. Such a variant could be especially useful in the case of a large number of different illumination patterns. For more details about stochastic gradient methods, see [37][38][39][40][41][42] and the references therein. Note that, in general, convergence of stochastic gradient methods requires asymptotically vanishing step size [42].

Iterative Regularization Methods
An alternative class of algorithms to address nonlinear inverse problems are iterative techniques. The most basic iterative method for solving the nonlinear inverse problem v = F(µ) is the Landweber iteration. In the case that the domain of definition D is a proper subset, we have to combine the Landweber iteration with a projection step onto D as presented in this subsection. The projected Landweber iteration applied to multi-source QPAT reads where ∇F i is the gradient of F i (see Equation (17)), and P D (µ) = min {µ, max {0, µ}} denotes the projection onto D. In Tikhonov regularization, the regularity of solutions is enforced by an explicitly included penalty. In opposition to that, in iterative regularization methods, a stabilization effect is enforced by early stopping of the iteration. A common stopping rule is the discrepancy principle, where iteration is stopped at the smallest index k ∈ N satisfying v − F(µ k ) ≤ τδ, where δ is an estimate for the noise and τ ≥ 1. Formally, the projected Landweber iteration (32) arises as a special case of the proximal gradient iteration (23) for minimizing the Tikhonov functional, where the regularization parameter is taken as λ = 0 and where the proximal mapping (24) reduces to the orthogonal projection onto D.
In a similar manner, one can also use a stochastic version of the projected Landweber iteration. Using the loping strategy of [43][44][45] in order to stabilize the iterative process, the resulting projected loping Landweber-Kaczmarz iteration reads Here, i(k) ∈ {1, . . . , N} for any k ∈ N may be randomly selected, τ > 1 is an appropriately chosen positive constant, ∇F i ( · ) is the gradient of the i-th data fidelity term computed in Theorem 3 and P D (µ) = min {µ, max {0, µ}} denotes the projection onto D. The iteration (33), (34) It is worth mentioning that, for noise free data, we have ω k = 1 for all k and, therefore, in this special situation, the iteration becomes µ k+1 = P D (µ k − s k ∇F i(k) (µ k )), which formally arises from the proximal stochastic gradient method (31) with λ = 0. A convergence analysis of the loping Landweber-Kaczmarz method can be found in [43,44].

QPAT as Multilinear Inverse Problem
Since the RTE is time-consuming to solve, we are looking for a suitable reformulation of the inverse problem in multi-source QPAT avoiding computation of a solution of the RTE in each iterative step. In this paper, we propose to write (19) as a multilinear inverse problem, where we add the RTE as a constraint instead of explicitly including its solution. The new formulation will again be addressed by Tikhonov regularization and proximal stochastic gradient methods.

Reformulation as Multilinear Inverse Problem
Recall the forward problem of QPAT governed by the RTE (6). With the abbreviation M(µ) := θ · ∇ x + µ a + µ s (I − K), the RTE can be written in compact form M(µ)Φ = q, where µ = (µ a , µ s ) is the unknown parameter pair. In the case of exact data, the multi-source problem in QPAT (19) then can be reformulated as the problem of finding the tuple z : Here, the index i indicates the i-th illumination, and q i ∈ Q, Φ i ∈ W, H i ∈ L 2 (Ω), v i ∈ Y are the corresponding source, photon density, heating and acoustical data, respectively, and AΦ i = S d−1 Φ i ( · , θ)dθ is the averaging operator. We call (35) and resulting formulations below the multilinear (MULL) formulation of QPAT.

Application of Tikhonov Regularization
In the case that the data v i are only known approximately, we use Tikhonov regularization for the stable solution of (35). For that purpose, we approximate (35) by the constrained optimization problem Here, the operator Lµ = (L a µ a , L s µ s ) is possibly unbounded, λ 2 L(µ) 2 is the regularization term and λ > 0 the regularization parameter. Note that (36) is equivalent to (21) and therefore the well-posedness and convergence results of Theorem 5 apply to (36) as well.
The constrained optimization problem (36) proposed in this paper can be addressed by various solution methods, for example using penalty methods or augmented Lagrangian techniques [54]. In this paper, we use a penalty approach for solving (36) where the constraints are included as penalty term. To simplify notation, we introduce the unconstraint functionals for certain parameters a 1 , a 2 , a 3 > 0 and z i : The sum of the unconstraint functionals (37) over all illuminations will actually be minimized in our numerical implementations. For that purpose, we define Then, we have J (i) (z) = ∑ 4 =1 a J (i) (z) + χ D (µ). For the approximate solution of (36), we minimize the unconstrained functional J(z) = ∑ 4 i=1 J (i) (z), which can be written in the forms (Here and below, we also write a 4 = λ, if it simplifies notation.) The formulations (38) as well as (39) can be solved by various optimization techniques. In particular, as the functionals are given as the sum of simpler terms, the stochastic (proximal) gradient method is particularly appealing.

Solution of the MULL Formulation of QPAT Using Stochastic Gradient Methods
For solving QPAT in the novel MULL formulation (35), we use stochastic gradient methods similar to previous sections. For that purpose, we require the gradients (determining the steepest descent directions) of the individual functionals J (i) (z i ) with respect to z i = (µ, Φ i , H i ), which are given as (All other partial gradients are vanishing.) In the following, let N be the number of illuminations, write z = (µ a , µ s , (Φ i , H i ) N i=1 ) and let (s k ) k∈N be a sequence of step sizes. In this paper, we propose the following instances of the stochastic proximal gradient method for QPAT based on the multilinear formulation (35).

MULL-projected stochastic gradient algorithm:
Here, we consider the form (38). For any iteration index k ∈ N choose i(k) ∈ {1, . . . , N} and (k) ∈ {1, . . . , 4} and define the sequence of iterates (z k ) k∈N by Here, the mapping P D × I is the proximal mapping corresponding to z → χ D (µ), which equals the projection P D in the µ component and equals the identity I in the other components.
For better scaling, in our actual numerical implementation, we replace the scalar step sizes s k by the adaptive step size rule Note that computing such step sizes does barely increase the computational time of the stochastic gradient method, since all involved calculations are anyhow necessary for computing the gradient for the iterative update. In opposition to that, calculating a similar adaptive step size for the algorithms proposed in Section 3 would require evaluation of the forward operators F i and therefore would significantly increase the computation time. This might be seen as an additional advantage of the novel MULL formulation (35) and its regularized version (36).

Numerical Simulations
For the Tikhonov approach to multi-source QPAT, the radiative transfer equation is numerically solved by a streamline diffusion finite element method. Solving the RTE is required to evaluate the forward operator F and the gradient ∇F of the data fidelity term in every iterative step. For the alternative multilinear approach, these calculations are not necessary. However, the application of the transport operator to Φ has to be calculated for every update of J 1 . The simulations are performed on the square domain Ω = [−1 cm, 1 cm] 2 , where the absorption and the scattering coefficient are supported.

Numerical Solution of the RTE
Employing a finite element scheme, we derive the weak formulation of Equation (6) by integrating against a test function w : Ω × S 1 → R and replacing the exact solution Φ by a linear combination in the finite element space i (x, θ) as in [18]. Here, the basis function ψ is the product of a basis function in space and a basis function in velocity. The spatial domain is triangulated uniformly with mesh size h and P 1 -Lagrangian element function for the spatial and velocity domain. By choosing the test function w(x, θ) = ∑ N h j=1 w j (ψ j (x, θ) + D(x, θ)θ · ∇ x ψ j (x, θ)) with streamline diffusion coefficient D(x, θ), we obtain Equation (52) yields a system of linear equations M (h) c (h) = b (h) , where evaluating the left-hand side of (52) provides the entries of M (h) , the right-hand side gives the components of vector b (h) . Note that the sparsity of matrix M (h) is low and solving the linear system for the Tikhonov approach is very time-consuming. On the other hand, the solution via the MULL formulation requires only a matrix vector multiplication, since in this case Φ (h) is an independent variable. Thus, only the application to Φ has to be calculated and the transport equation does not need to be solved.

Test Scenario for Multiple Illumination
The sample is illuminated in orthogonal direction at the boundaries of Ω = [−1 cm, 1 cm] 2 . In our simulations, we use N = 4 homogenous illuminations and no internal sources. The illuminations are applied separately from each side (left, right, top and bottom) with acoustic data measured on a half circle on the same side as the illumination (see Figure 2). For the scattering kernel, we use the two-dimensional Henyey-Greenstein kernel, k(θ, θ ) := 1 2π where the anisotropy factor is chosen as g = 0.5 in all our experiments.  For the simulated data, we choose a spatial mesh size 2/100, in order to discretize the velocity direction the unit circle is divided in 64 subintervals. In order to avoid inverse crime, for the reconstruction, we use a different spatial mesh size h = 2/80 and use N θ = 48 velocity directions. Calculating the simulated data corresponds to evaluating the forward operators F i with perpendicular boundary illumination constant along one side of the boundary square, q o (x, θ) = δ(θ − θ i )χ i (x)1 mJ cm −1 , where δ is an approximation of the Dirac delta function and χ i the indicator function of side i of Ω. In this way, we simulate data Thereby, the heating operator is computed numerically by solving the RTE as described in Section 5.1. The wave operator U is evaluated by straightforward discretization of the well-known explicit formulas for (10) that can be found, for example, in [55,56]. In the following, we present results for exact data (where z noise i = 0) as well as for noisy data. For the noisy data case, we add 0.5% random noise to the simulated data, i.e., we take the maximum value of the simulated pressure and add white noise z noise

Numerical Results
For regularizing the absorption and scattering coefficient, we make use of Laplace regularization and choose L a = ∆ and L s = 100∆, respectively. We assume that the coefficient µ is known at the boundary of Ω and is therefore used as the starting value of our iterative schemes. Furthermore, we use the boundary value of µ for regularization; that is, we implement it in the Dykstra projection procedure (26) by iteratively projecting on the known boundary value. In the following, we discuss the methods that we have outlined in the previous section. (19): We assume that the scattering coefficient is known and we restrict ourself to reconstructing the absorption coefficient. Then, the proximal gradient and proximal stochastic gradient algorithm, respectively, read

Standard formulation of QPAT
In contrast, to the full proximal gradient algorithm, the proximal stochastic gradient algorithm avoids evaluating the full gradient ∇F, but selects randomly an illumination number i ∈ {1, . . . , 4} for each iterative step. Because of formula (16), each iteration of the above procedures requires the calculation of the solution of the radiative transfer equation Φ as well of its adjoint Φ * .
The top row in Figure 4 shows reconstruction result for the absorption coefficient using the original formulation with the proximal gradient method with λ = 2 × 10 −8 and 10 iterative steps. The left picture shows the relative error µ a − µ k a / µ a . Note that, in this case, solutions of the RTE and its adjoint have to be computed for four illuminations per iterative step. The reconstruction results in the bottom row in Figure 4 are obtained by the proximal stochastic gradient method with λ = 2 × 10 −7 . The regularization parameters λ have been selected empirically as a trade-off between stability and accuracy. The total number of iterations is taken as 30. In each iteration, a illumination pattern is chosen randomly and the computation of RTE and its adjoint is executed only for this single illumination. Therefore, the computational effort for the proximal stochastic gradient method is approximately 3/4 of the proximal gradient algorithm using full gradients. For the algorithms based on the standard formulation (19), calculating adaptive step sizes similar to (51) is time-consuming as this requires another evaluation of the forward operator F i and therefore another solution of the RTE. Therefore, we simply use a constant step size rule; in our numerical experiments, it turned out that s k = 0.5 is a suitable choice.  Novel MULL formulation of QPAT (35): The multilinear approach overcomes the problem of solving the RTE by minimizing (38) or (39). In both cases, one selects an arbitrary functional and performs a steepest descent step, resulting in an iterative scheme for the variables µ a , µ s , Φ and H. Recall that none of the partial gradients (40)-(48) requires solving the RTE (which is the most time-consuming part for the standard formulation of QPAT). In each iterative step, we take a random illumination number i ∈ {1, . . . , 4} and a random functional number ∈ {1, 2, 3}. The gradient step then consists of the update rule Dykstra's algorithm for smoothing the µ component is applied after each iterative step when ∈ {1, 2}. Iteration (55) contains a gradient step for the RTE. Since one gradient step is not enough to obtain an appropriate approximation to the solution of the transport equation, we apply iteration (55) 40 times whenever = 1 is chosen. In this situation, we apply the Dykstra iteration in the µ component after these 40 iteration steps, whereas the positivity projection is done in every step. Flowcharts of the stochastic gradient algorithms (standard and MULL formulations) are shown in Figure 5. For the projected stochastic gradient method, regularization of µ is done by incorporating the regularization functional J 4 in the random choice of functionals; see (38). The positivity restriction is realized with the cut projection P D (µ) = max{0, µ} applied after every iterative step.
by (55) If ∈ {1, 2}: appl. Dykstra k := k + 1 no yes Figure 5. Flowcharts of stochastic gradient algorithms for QPAT proposed in this paper. Left: algorithm based on the standard formulation (19). Right: algorithm based on the novel MULL formulation (35). The update (54) using the standard formulation requires solving the forward RTE and the adjoint RTE, which is not required by (55) with the MULL formulation. Simulations are performed with N = 4. Figure 6 shows reconstructions with the stochastic gradient methods for the novel MULL formulation of QPAT (35). The results in the top row are for the MULL-proximal stochastic gradient algorithm with λ = 2 × 10 −8 and in the the bottom row results for the MULL-projected stochastic gradient method with λ = 2 × 10 −8 are shown. In both cases, we used 1000 iterations.

Remark 1.
Note that in the stochastic gradient methods for the novel MULL formulation of QPAT calculating the matrix vector product M (h) · Φ is the most costly part. In contrast, the standard formulation (19) requires the solution of the system M (h) c(h) = b (h) . Since the matrix M (h) is sparse only in its spatial domain, this is very time-consuming. On the other hand, the matrix M (h) (which is a discretization of θ · ∇ x + µ a + µ s (I − K)) has a simple dependence on the variables µ a , µ s . We therefore can compute the velocity entries of M (h) at the beginning of the iterative process to save computation time.
The reconstruction times for the final reconstructions using all methods described above are shown in Table 1. For the standard formulation of QPAT, the reconstruction times seem to be in accordance with reported results using gradient or Newton-type methods for QPAT (see, for example, [48].) The methods based on the new MULL formulation (35) (after 1000 iterations) are faster than the methods based on the standard formulation (19) of QPAT (after 10, respectively, 40 iterations). From the relative reconstruction errors shown in Figures 4 and 6, one notices that, opposed to the methods based on (19), the methods using the MULL formulation could even be stopped much earlier while still obtaining a comparable reconstruction quality. We roughly estimate a speedup of a factor 10 using the novel MULL formulation instead of the standard formulation of QPAT. Table 1. Reconstruction times for all methods. Recall that one iteration of the proximal stochastic gradient method is approximately four times cheaper than one iteration of the full proximal gradient method (both based on (19)). Further recall that one step in the methods based on the MULL formulation (35) is much less time consuming than for the methods based on (19); see Remark 1.

Algorithm Model Update No. Iterations Reconstruction Time
Proximal gradient (19)  In Figure 7, we show results for noisy data using the proximal gradient method based on the standard formulation (19) (top) and the proximal stochastic gradient method using the MULL formulation for QPAT (bottom). The regularization parameter is chosen as in the exact data case. Finally, in Figure 8, we show reconstruction results using only two consecutive illuminations applied from the top and from the left with noisy data. We use 10 iterations for the proximal gradient algorithm based on (19) (shown in left image in Figure 8) and 500 iterations for the stochastic gradient algorithms based on the MULL formulation (35) (shown in the right image in Figure 8). iterations. The phantom is as described in Figure 3 and, for the reconstruction methods, we use two consecutive illuminations (from the top and from the left). The reconstruction time has been about 14 h for the method based on the standard formulation (19) and 3 h for the proposed MULL-proximal stochastic gradient method.

Conclusions
In this paper, we developed efficient proximal stochastic gradient methods for image reconstruction in multi-source QPAT. We used the RTE as an accurate model for light transport and employed the single stage approach for QPAT introduced in [18]. One class of the proximal stochastic gradient methods has been developed based on the standard formulation for QPAT given in (19). Additionally, we developed another class using proximal stochastic gradient methods for the new MULL formulations (35) and (36) for QPAT. Besides proposing proximal stochastic gradient methods for QPAT, we also consider the formulations (35) and (36) as the main contributions of the present article. These new formulations avoid the time-consuming evaluation of the RTE at each iteration and allow for treating the QPAT problem as a constrained optimization problem, which enables the use of a variety of numerical algorithms. Here, we used a penalty approach in combination with stochastic gradient methods for the solution. Future work will be done in the direction of developing new algorithms based on (35) and (36). Additionally, we will investigate the use of different regularization terms in (36). Finally, the theoretical convergence analysis of proximal gradient algorithms and other iterative algorithms for solving (35) will be the subject of future research.