Accelerating the Sinkhorn algorithm for sparse multi-marginal optimal transport by fast Fourier transforms

We consider the numerical solution of the discrete multi-marginal optimal transport (MOT) by means of the Sinkhorn algorithm. In general, the Sinkhorn algorithm suffers from the curse of dimensionality with respect to the number of marginals. If the MOT cost function decouples according to a tree or circle, its complexity is linear in the number of marginal measures. In this case, we speed up the convolution with the radial kernel required in the Sinkhorn algorithm by non-uniform fast Fourier methods. Each step of the proposed accelerated Sinkhorn algorithm with a tree-structured cost function has a complexity of $\mathcal O(K N)$ instead of the classical $\mathcal O(K N^2)$ for straightforward matrix-vector operations, where $K$ is the number of marginals and each marginal measure is supported on at most $N$ points. In case of a circle-structured cost function, the complexity improves from $\mathcal O(K N^3)$ to $\mathcal O(K N^2)$. This is confirmed by numerical experiments.


Introduction
The optimal transport (OT) problem is an optimization problem that deals with the search for an optimal map (plan) that moves masses between two or more measures at low cost [44,55]. OT appears in a wide range of applications such as image and signal processing [7,14,52,54,56], economics [18,27], finance [21,22], and physics [26,29]. The OT problem was first introduced in 1781 by Monge. His objective was to find a map between two probability measures µ 1 , µ 2 on R d that transports µ 1 to µ 2 with minimal cost, where the cost function describes the cost of transporting mass between two points in R d . However, such maps do not always exist, so that Kantorovich [31] relaxed the problem in 1942 by looking for a transport plan with two prescribed marginals µ 1 and µ 2 that minimizes a certain cost functional.
Since the numerical computation of a transport plan is difficult in general, a regularization term such as the entropy [10,29], Kullback-Leibler divergence [41], general fdivergence [53] or L 2 -regularization [13,37] can be added to make the problem strictly convex. Different approaches such as the Sinkhorn algorithm [29,44], stochastic gradient descent [28], the Gauss-Seidel method [37], or the proximal splitting [5] have been used to iteratively determine a minimizing sequence of the MOT problem.
However, the problem suffers from the curse of dimensionality as the complexity grows exponentially with the number K of marginal measures. One way to circumvent this lies in incorporating additional sparsity assumptions into the model. Polynomialtime algorithms to solve certain sparse MOT problems have been studied in [4,9,29]. We will assume that the cost function decouples according to a graph, where the nodes correspond to the marginals and the cost function is the sum of functions that depend only on two variables which correspond to two marginals connected by an edge of the graph. For example, the circle-structured Euclidean cost function reads as c(x 1 , . . . , x K ) = x 1 − x 2 2 2 + · · · + x K−1 − x K 2 2 + x K − x 1 2 2 , x 1 , . . . , x K ∈ R d . This is significantly improved by Fourier-based fast summation methods [46,47]. The key idea is the approximation of the kernel function by a Fourier series, which enables the application of the non-uniform fast Fourier transform (NFFT). Although such fast summation methods are frequently used in different applications such as electrostatic particle interaction [40], tomography [30], image segmentation [3] and, very recently, also OT with two marginals [35], they were not utilized for MOT so far. Furthermore, a method for accelerating the Sinkhorn algorithm for Wasserstein barycenters via computing the convolution with a different kernel, namely the heat kernel, was discussed in [49].
Our main contribution is the combination of the fast summation method with the sparsity of the tree-or circle-structured cost function in the MOT problem for accelerating the Sinkhorn algorithm. Each iteration step has a complexity of O(KN ) for a treeand O(KN 2 ) for a circle-structured cost function, compared to O(KN 2 ) and O(KN 3 ), respectively, with the straightforward matrix-vector operations, where N is an upper bound of the number of atoms for each of the K marginal measures. Our numerical tests with both tree-and circle-structured cost functions confirm a considerable acceleration while the accuracy stays almost the same. A different acceleration of the Sinkhorn algorithm via low-rank approximation for tree-structured cost yields the same asymptotic complexity [50]. We note that MOT problems with tree-structured cost functions are used for the computation of Wasserstein barycenters [14], and with circle-structured cost for computing Euler flows [11].
Outline of the paper. Section 2 introduces the notation. In section 3, we focus on the discrete MOT problem with squared Euclidean norm cost functions and the numerical solution of the corresponding entropy-regularized problem by the Sinkhorn algorithm. We investigate sparsely structured cost functions that decouple according to a tree or circle in section 4. Then the complexity of the Sinkhorn algorithm depends only linearly on K. Section 5 describes a fast summation method for further accelerating the Sinkhorn algorithm. Finally, in section 6, we verify the practical performance by applying it to generalized Euler flows and for finding generalized Wasserstein barycenters. We compare the computational time of the proposed Sinkhorn algorithm based on the NFFT with the algorithm based on direct matrix multiplication.

Notation
Let K ∈ N and n = (n 1 , . . . , n k ) ∈ N K . We set [K] {1, · · · , K} and consider K finite sets consisting of points x k i k ∈ R d , which are called atoms. We set Ω Ω 1 × · · · × Ω K . Additionally, we define the index set and the set of K-dimensional matrices (tensors) Let P(Ω k ) denote the set of probability measures on Ω k . In this paper, we consider K discrete probability measures, also called marginal measures, µ k ∈ P(Ω k ) given by where the probabilities satisfy and, for all A ⊂ Ω k , the Dirac measure is given by For G, H ∈ R n , we denote their component-wise (Hadamard) product by and similarly their component-wise division by , as well as the Frobenius inner product The tensor product (Kronecker product) of u, v ∈ R m is denoted by u ⊗ v ∈ R m×m . Analogues can be defined for tensors of different size.

Multi-marginal optimal transport
In the following, we consider the discrete multi-marginal optimal transport (MOT) between K marginal measures µ k ∈ P(Ω k ), k ∈ [K]. We define the set of admissible transport plans by where the k-th marginal projection is defined as For a cost function c : Ω → R ≥0 and the samples x i = x 1 i 1 , . . . , x K i K , i ∈ I , we define the respective cost matrix The discrete MOT problem reads whose solution Π ∈ R n ≥0 is called optimal plan.

Entropy regularization
As the MOT problem (4) is numerically unfeasible, we consider for η > 0 the entropyregularized multi-marginal optimal transport (MOT η ) problem which is a convex optimization problem. It is possible to numerically deduce the optimal transport planΠ of (5) from the solution of the corresponding Lagrangian dual problem. The following theorem is a special case of [7] for a constant entropy function.
Theorem 3.1. The Lagrangian dual formulation of the discrete MOT η problem (5) states where the kernel matrix K ∈ R n is defined by The functional S : R n ≥0 → R is called the Sinkhorn function. The solutions of the dual and the primal problem are generally not equal. The solution of (6) is generally a lower bound to the solution of the primal problem (5). Equality holds if the cost function c is lower semi-continuous, i.e., lim inf c(x) ≥ c(x 0 ) as x → x 0 for every x 0 ∈ R n , or Borel measurable and bounded [8]. This is obviously the case for the squared Euclidean norm cost function that we study here.
A sequence converging to the optimal dual vectorsφ k , k ∈ [K] in (6) can be iteratively determined by the Sinkhorn algorithm [29,38] presented in Algorithm 1, where we note that line 4 is obtained by deriving the Sinkhorn function S with respect to φ k , k ∈ [K]. The complexity of the algorithm mainly comes from the computation of the marginal P k (K Φ), where the projection P k is defined in (2). In general, the number of operations depends exponentially on K.
Algorithm 1 Sinkhorn iterations for the MOT η problem Increment r ← r + 1 8: end for

Sparse cost functions
In this section, we take a look at sparsely structured cost functions, for which the Sinkhorn algorithm becomes much faster and we overcome the curse of dimensionality. Let G = (V , E) be an undirected graph with vertices V and edges E. We say that the MOT η problem has the graph G structure if V = [K] and the cost matrix (3) decouples according to This implies that the kernel K ∈ R n ≥0 satisfies where the kernel matrix for {k 1 , k 2 } ∈ E is given by We use the indices k ∈ V = [K] to identify the marginal measures µ k in the rest of the paper. The discrete, dual formulation (6) of the MOT η problem has the same form independently of the structure of the graph G, only the marginals P k (K Φ) differ. If the graph G is complete, i.e., each two of the vertices in V are connected with an edge, then the computational complexity of the Sinkhorn Algorithm 1 depends exponentially on the number K of marginal measures (vertices). For larger values of K, it is practically impossible to numerically compute an optimal plan to the MOT η problem. We consider two sparsity assumptions of MOT η problem, each of them yielding that the Sinkhorn algorithm has a linear complexity in the number K of nodes. It was shown in [4] that MOT η problems with graphically structured cost functions of constant tree-width can be implemented in polynomial time. This is the case for the tree and circle, whose tree-widths are 1 and 2, respectively. In the following, we give an explicit scheme to efficiently compute the Sinkhorn iterations for tree and circle structure.

Tree structure
We consider the MOT η problem with the structure of a tree (V , E), which is a connected and circle-free graph with |E| = |V | − 1. We define the (non-empty) neighbour set N k of k ∈ V as the set of all nodes ∈ V such that {k, } ∈ E. Furthermore we denote by L {k ∈ V : |N k | = 1} the set of all leaves of the tree.
We call 1 ∈ V the root of the tree. For every k ∈ V , there is a unique path between k and the root. For k ∈ V \ {1}, we define the parent p(k) as the node in N k such that p(k) lies in the path between k and the root 1. The root has no parent. Without loss of generality, we assume that p(k) < k holds for all k ∈ V \ {1}. We define the set of children C k { ∈ N k : > k}. Then, we can derive a recursive formula for the k-th marginal of the tensor K Φ ∈ R n . Theorem 4.1. Let (V , E) be a tree with leaves L and c be the MOT η cost function associated. Let furthermore k ∈ V be an arbitrary node. Then the k-th marginal of the transport plan K Φ is given by where the vectors α (k, ) ∈ R n k are recursively defined as A proof of Theorem 4.1 can be found in [29,Theorem 3.2]. The main idea is to split the rooted tree at the node k into N k subtrees. Therefore, the kernel matrix holds where D is the set of descendants of , i.e., the nodes t ∈ V such that lies in the path between k and t. Inserting K i into P k (K Φ) and recalling the definition of P k in (2) yields the result.
In order to efficiently perform the Sinkhorn algorithm, we compute iteratively for = K, . . . , 2 the vectors β α (p( ), ) ∈ R n p( ) . From Theorem 4.1, we obtain Since we assumed that p(k) < k, the computation of β requires only β t for t > . Similarly, the vectors γ α ( ,p( )) ∈ R n for > 1 and γ 1 1 n 1 can be iteratively computed by where the vectors β t ∈ R n p( ) are assumed to be known from above. Then the marginals are where for any subset U ⊆ C k , we define The resulting procedure is summarized in Algorithm 2.

Circle structure
We consider the MOT η problem where the graph (V , E) is a circle. We assume for each Thus we can define the distance between two nodes k 1 , k 2 ∈ V as Algorithm 2 Sinkhorn algorithm for tree structure Increment r ← r + 1 16: while |S (r) − S (r−1) | ≥ δ 17: return optimal planΠ = K Φ , whereΦ = k∈[K] φ k (r) Theorem 4.2. Let (V , E) be a circle and c be the MOT cost function associated. Let furthermore k ∈ V be an arbitrary node. The k-th marginal of the transport plan K Φ is given by where the matrices α ( ,t) ∈ R n ×n t , , t ∈ V recursively satisfy and we set k Proof. The kernel K can be decomposed as (7). Let k ∈ [K] and i k ∈ [n k ]. It holds then We defineα (K,1) and recursively for all = K − 1, . . . , 1 the matrixα ( ,1) ∈ R ((k+ −1)mod K)×n k bỹ Inserting this into the marginal yields Henceforth, Finally definingα ( ,1) = α ((k+ )mod K,k) yields the assumption.
To efficiently compute the marginal optimal transport as for the tree structure, we choose k = 1 as starting point and decompose the marginal into two matrices, which can be computed recursively as follows. For matrices G, H ∈ R n×m , we define the inner product with respect to the second dimension by where α (k,1) is given in (9) and Proof. Let k ∈ [K] and i k ∈ [n k ]. For k = 1, the assertion follows from Theorem 4.2. For k > 2, we have Furthermore the term I can be rewritten as This implies that Hence, the hypothesis is true for all k > 2. The case k = 2 can be proved with the same procedure.
In order to efficiently compute a maximizing sequence of the dual MOT η problem (6), we set for k ∈ [K] the dual matrices β k α (k,1) ∈ R n k ×n 1 , γ k λ (k,1) ∈ R n 1 ×n k , given in Theorems 4.2 and 4.3, respectively. The method is shown in Algorithm 3.
Algorithm 3 Sinkhorn algorithm for circle structure 1: Input: Initialization (φ k ) (0) , k ∈ [K], parameters η, δ > 0 2: Initialize r ← 0 3: for k = K, · · · , 2 do 4: for k = 2, · · · , K do 9: The tree-structured and circle-structured MOT η problems have both a sparse cost function which considerably improves the computational complexity of the Sinkhorn algorithm. In each iteration step, Algorithm 2 requires only 2(K − 1) matrix-vector products, which have a complexity of O(N 2 ) where N n ∞ , and Algorithm 3 requires 2(K − 1) matrix-matrix products, which have a complexity of O(N 3 ). This can be considerably improved by employing fast Fourier techniques, as we will see in the next section.

Non-uniform discrete Fourier transforms
The main computational cost of the Sinkhorn algorithm comes from the matrix-vector product with the kernel matrix (8). Let k, ∈ [K] and α ∈ R n . We briefly describe a fast summation method for the computation of β = K (k, ) α, i.e., We refer to [47] for a detailed derivation and error estimates. The main idea is to approximate the kernel function by a Fourier series. In order to ensure fast convergence of the Fourier series, we extend κ to a periodic function of certain smoothness p ∈ N. Let ε B > 0 and τ > ε B + max . For x ∈ R, we define the regularized kernel where κ B is a polynomial of degree p that fulfills the Herminte interpolation conditions κ  Then we define a 2τ-periodic function on R d bỹ By construction,κ is p times continuously differentiable and we haveκ( Let M ∈ N. We approximateκ by the 2τ-periodic Fourier expansion of degree 2M, with the discrete Fourier coefficientsκ(m) ∈ C, which can be efficiently approximated by the fast Fourier transform (FFT) This yields an approximation of (11) by The non-uniform discrete Fourier transform and the adjoint NDFT of α ∈ R n on the set Ω is given by cf. [45,Sect. 7]. Therefore, the approximation (16) can be written as The procedure is summarized in Algorithm 4. There are fast algorithms, known as non-uniform fast Fourier transform (NFFT), allowing the computation of an NDFT (17) and its adjoint (18) in O(M d log M + N ) steps up to arbitrary numeric precision, see, e.g., [12,23] and [45,Sect. 7], where N = n ∞ . Note that the direct implementation of (16) requires O(M d N ) operations. We call the Sinkhorn algorithm where the matrix-vector multiplication is performed via (19) the NFFT-Sinkhorn algorithm. Provided we fix the Fourier expansion degree M, which is possible because of the smoothness of κ, we end up at a numerical complexity of O(KN ) for each iteration step of the NFFT-Sinkhorn algorithm for trees. In case of a circle (Algorithm 3), we can apply the fast summation column by column for the matrix-matrix product with K (k,k+1) , yielding a complexity of O(KN 2 ) for each iteration step.

Numerical examples
We illustrate the results from section 5 concerning the Sinkhorn algorithm and its accelerated version, the NFFT-Sinkhorn. First, we investigate the effect of parameter choices in some artificial examples. Then, we look at the one-dimensional Euler flow problem of incompressible fluids and the fixed support barycenter problem of images. 1 All computations were performed on an 8-core Intel Core i7-10700 CPU and 32GB memory. For computing the NFFT of section 5, we rely on the implementation [32].

Uniformly distributed points
We consider the MOT η problem for uniform measures µ k of uniformly distributed points on Ω = [−1/2, 1/2]. We chose the entropy regularization parameter η = 0.1, so that a boundary regularization (13) for the fast summation method is necessary. For the tree-structured cost function, we set the boundary regularization ε B = 1/16, the Fourier expansion degree M = 156, and the smoothness parameters p = 3, see section 5. In Figure 2 left, we see the linear dependence of the computational time on the number K of marginals. For a growing number N of points, the NFFT-Sinkhorn algorithm, which requires O(KN ) steps, clearly outperforms the standard method, which requires O(KN 2 ) steps, see figure 2 right.
In Figure 3, we show the computation times for the circle-structured cost function, with the parameters η = 0.1, ε B = 3/32, M = 2000, and p = 3. As the Sinkhorn iteration requires matrix-matrix products, it is more costly than for the tree-structured cost function and so we used a lower number of points N . The advantage of the NFFT-Sinkhorn is smaller than for the tree, but still considerable. We point out here that the fast summation method is applied column by column to the matrix-matrix product, and the Fourier expansion degree M is larger. Finally, we investigate how the approximation error between the Sinkhorn and NFFT-Sinkhorn algorithm, |S (r) −S (r) |, depends on the entropy regularization parameter η and the Fourier expansion degree M at a fixed iteration r = 10, where S (r) denotes the evaluation with the Sinkhorn algorithm andS (r) its evaluation with the NFFT-Sinkhorn algorithm. Since the time differences for different M in the 1-dimensional case are very small, we consider 2-dimensional uniform marginal measures for the MOT η problems with tree-or circle-structured cost functions. In Figures 4 and 5, we see that for smaller η, we need a larger expansion degree M to achieve a good accuracy. The error stagnates at a certain level and does not decrease anymore for increasing M. This could be improved by increasing the approximation parameter p and the cutoff parameter of the NFFT, cf. [45,Section 7]. Parameter choice methods for the NFFT-based summation were discussed in [39]. For an appropriately chosen M, the NFFT-Sinkhorn is usually much faster than the Sinkhorn algorithm. However, for very small η, the kernel function (12) is concentrated on a small interval and therefore a simple truncation of the sum (11) might be beneficial to the NFFT approximation.

Fixed-support Wasserstein barycenter for general trees
Let T = (V , E) be a tree with set of leaves L, see section 4.1. For k ∈ L, let measures µ k and weightsw k ∈ [0, 1] be given that satisfy k∈Lwk = 1. For any edge e ∈ E, we set The generalized barycenters are the minimizers µ k , k ∈ V \ L, of where W 2 2 (µ e 1 , µ e 2 ) is the squared Wasserstein distance [6,20] between the measures µ e 1 and µ e 2 . The well-known Wasserstein barycenter problem [48,57] is a special case of (20), where the tree is star-shaped and the barycenter corresponds to the unique internal node. We consider the fixed-support barycenter problem [9,51], where also the nodes x k , k ∈ V \ L are given, so that we need to optimize (20) only for µ k i k , k ∈ V \ L. This yields an MOT problem with the tree-structured cost where the marginal constraints of (1) are only set for the known measures µ k , k ∈ L, see [1]. This barycenter problem can be solved approximately using a modification of the Sinkhorn algorithm 2 in which we replace line 12 by otherwise.
We test our algorithm with a tree consisting of K = 7 nodes, see Figure 6. The four given marginals µ k , k ∈ L, are dithered images in R 2 with uniform weights µ k i k = 1/N . As support points of the barycenters µ k , k ∈ V \L, we take the union over all support points x k i k of all four input measures µ k , k ∈ L. Furthermore, we use the barycenter weights w k = 1/4. The given images and the computed barycenters are show in Figure 7, where we executed r = 150 iterations of the Sinkhorn algorithm and its accelerated version. We chose the regularization parameter η = 5 · 10 −3 and the fast summation parameters M = 156, p = 3.

Generalized Euler flows
We consider the motion of N particles of an incompressible fluid, e.g. water, in a bounded domain Ω in discrete time steps t k (k − 1)/(K − 1), k ∈ [K]. We assume that we know the function σ : Ω → Ω, which connects N initial positions x i 1 ∈ Ω of particles with their final positions x i K ∈ Ω. At each time step t k , we know an image of the particle distribution, which is described by the discrete marginal measure µ k , k ∈ [K], with uniform weights µ k i k = 1/N . We want to find out how the single particles move, i.e., their trajectories. Due to the least-action principle, this problem can be formulated as MOT problem (4) with the circle-structured cost see [15,16,17]. Then the pair marginal Π i 1 ,...,i K of the optimal plan Π provides the (discrete) probability that a particle which was initially at position x i 1 ∈ Ω is in position x i k ∈ Ω at time t k , k = 2, . . . , K − 1. The onedimensional problem has been studied by several authors [4,9,11], where the particles are assumed to be on a grid. Here, we consider the case where the positions are uniformly distributed. We draw N = 400 uniformly distributed points on Ω = [0, 1]. We use K = 5 marginal constraints and the entropy regularization parameter η = 0.05.

Conclusions
We have proposed the NFFT-Sinkhorn algorithm to solve the MOT η problem efficiently. Assuming that the cost function of the multi-marginal optimal transport decouples according to a tree or a circle, we obtain a linear complexity in K. The complexity of the algorithm with respect to the numbers n k , k ∈ [K], of atoms of the discrete marginal measures is further improved by the non-uniform fast Fourier transform. This results in a considerable acceleration in our numerical experiments compared to the usual Sinkhorn algorithm. The tree-structured MOT η problem gives a much better numerical complexity than the circle-structured MOT η problem due to the fact that in the latter case, matrix-matrix products are required in Algorithm 3 instead of just matrix-vector products of Algorithm 2.