Overrelaxed Sinkhorn-Knopp Algorithm for Regularized Optimal Transport

This article describes a method for quickly computing the solution to the regularized optimal transport problem. It generalizes and improves upon the widely-used iterative Bregman projections algorithm (or Sinkhorn-Knopp algorithm). The idea is to overrelax the Bregman projection operators, allowing for faster convergence. In practice this corresponds to elevating the diagonal scaling factors to a given power, at each step of the algorithm. We propose a simple method for establishing global convergence by ensuring the decrease of a Lyapunov function at each step. An adaptive choice of overrelaxation parameter based on the Lyapunov function is constructed. We also suggest a heuristic to choose a suitable asymptotic overrelaxation parameter, based on a local convergence analysis. Our numerical experiments show a gain in convergence speed by an order of magnitude in certain regimes.


Introduction
Optimal Transport is an efficient and flexible tool to compare two probability distributions which has been popularized in the computer vision community in the context of discrete histograms [Rubner et al., 2000].The introduction of entropic regularization in [Cuturi, 2013] has made possible the use of the fast Sinkhorn-Knopp algorithm [Sinkhorn, 1964] scaling with high dimensional data.Regularized optimal transport have thus been intensively used in Machine Learning with applications such as Geodesic PCA [Seguy and Cuturi, 2015], domain adaptation [Courty et al., 2015], data fitting [Frogner et al., 2015], training of Boltzmann Machine [Montavon et al., 2016] or dictionary learning [Rolet et al., 2016, Schmitz et al., 2017].
The computation of optimal transport between two data relies on the estimation of an optimal transport matrix, the entries of which represent the quantity of mass transported between data locations.Regularization of optimal transport with strictly convex regularization [Cuturi, 2013, Dessein et al., 2016] nevertheless involves a spreading of the mass.Hence, for particular purposes such as color interpolation [Rabin and Papadakis, 2014] or gradient flow [Chizat et al., 2016], it is necessary to consider very small regularization of the problem.In this setting, the regularized transport problem can be ill-conditioned and the Sinkhorn-Knopp algorithm converges slowly.This is the issue we want to tackle here.Before going into further details, we now briefly introduce the main notations and concepts used all along this article.

Discrete optimal transport
We consider two discrete probability measures µ k ∈ R n k + * .Let us define the two following linear operators as well as the affine constraint sets Given a cost matrix c, where c ij represents the cost of moving mass µ 1 i to µ 2 j , the optimal transport problem corresponds to the estimation of an optimal transport matrix γ solution of: This is a linear programming problem whose resolution becomes intractable for large problems.

Regularized optimal transport
In [Cuturi, 2013], it has been proposed to regularize this problem by adding a strictly convex entropy regularization: with ε > 0, 1 is the matrix of size n 1 × n 2 full of ones and the Kullback-Leibler divergence is with the convention 0 log 0 = 0.It was shown in [Benamou et al., 2015] that the regularized optimal transport matrix γ * , which is the unique minimizer of problem (1), is the Bregman projection of γ 0 = e −c/ε (here and in the sequel, exponentiation is meant entry-wise) onto C 1 ∩ C 2 : where P C is the Bregman projection onto C defined as

Sinkhorn-Knopp algorithm
Iterative Bregman projections onto C 1 and C 2 converge to a point in the intersection man, 1967].Hence, the so-called Sinkhorn-Knopp algorithm (SK) [Sinkhorn, 1964] that performs alternate Bregman projections, can be considered to compute the regularized transport matrix: and we have lim l→+∞ γ = P C1 ∩ C2 (γ 0 ) = γ * .In the discrete setting, these projections correspond to diagonal scalings of the input: where is the pointwise division.To compute numerically the solution one simply has to store (a , b ) ∈ R n1 × R n2 and to iterate We then have γ = diag(a )γ 0 diag(b ).
Another way to interpret the SK algorithm is as an alternate maximization algorithm on the dual of the regularized optimal transport problem.The dual problem of (1) is e (αi+βj −ci,j )/ε . (5) The function E is concave, continuously differentiable and admits a maximizer, so alternate maximization converges and we recover SK algorithm by posing for a i = e αi/ε , b j = e βj /ε and γ 0 i,j = e −ci,j /ε .Efficient parallel computations can be considered [Cuturi, 2013] and one can almost reach realtime computation for large scale problem for certain class of cost matrices c allowing the use of seprable convolutions [Solomon et al., 2015].For small values of the parameter ε, numerical issues can arise and a stabilization of the algorithm is necessary [Chizat et al., 2016].The convergence of the process can nevertheless be very slow when ε is small.

Overview and contributions
In this paper, we consider an overrelaxation scheme designed to accelerate the Sinkhorn-Knopp algorithm.We first present and show the convergence of our algorithm in Section 2. In Section 3, we analyze the local convergence rate of the algorithm to justify the acceleration.We finally demonstrate numerically in Section 4 the good behavior of our method, where larger accelerations are observed for decreasing values of ε.

Related works
The introduction of relaxation variables through heavy ball approaches [Polyak, 1964] has recently gained in popularity to speed up the convergence of algorithms optimizing convex [Ghadimi et al., 2014] or non convex [Zavriev andKostyuk, 1993, Ochs, 2016] problems.Our specific approach is very much related to the SOR algorithm [Young, 2014], which is a classical way to solve linear systems.Similar schemes have been empirically considered to accelerate the SK algorithm in [Peyré et al., 2016, Schmitz et al., 2017].The convergence of these algorithms has nevertheless not been studied yet in the context of regularized optimal transport.

Overrelaxed Sinkhorn-Knopp algorithm
As illustrated in Figure 1 (a-b), SK algorithm, that performs alternate Bregman projections onto the affine sets C 1 and C 2 , can be very slow when ε → 0. The idea developed in this paper is to perform overrelaxed projections in order to accelerate the process, as displayed in Figure 1 (c).
Figure 1: The trajectory of γ given by the SK algorithm is illustrated for decreasing values of ε in (a) and (b).Overrelaxed projections (c) typically accelerate the convergence rate.

Overrelaxed projections
We define the ω-overrelaxed projection operator P ω C k as where the logarithm is taken coordinate-wise.Note that P 0 C k is the identity, P 1 C k = P C k is the standard Bregman projection and P 2 C k is an involution (in particular because C k is an affine subspace).A naive algorithm would then consist in iteratively applying P ω C2 • P ω C1 for some choice of ω.While it often behaves well in practice, this algorithm may sometimes not converge even for reasonable values of ω.Our goal in this section is to modify this algorithm to make it robust and to guarantee convergence.
Duality gives another point of view on the iterative overrelaxed Bregman projections: they indeed correspond to a successive overrelaxation (SOR) algorithm on the dual objective E. This is a procedure which, starting from (α 0 , β 0 ) = (0, 0), defines for ∈ N * , This can be seen by using the relationships given after equation (5).

Lyapunov function
Convergence of the successive overrelaxed projections is not guaranteed in general.In order to derive a robust algorithm with provable convergence, we introduce the Lyapunov function where γ * denotes the solution of the regularized OT problem.We will use this function to enforce the strict descent criterion F (γ +1 ) < F (γ ) as long as the process has not converged.The choice of (9) as a Lyapunov function is of course related to the fact that Bregman projections are used throughout the algorithm.Further, we will show (Lemma 1) that its decrease is very easy to compute and this descent criterion still allows enough freedom in the choice of the overrelaxation parameter.
Crucial properties of this Lyapunov function are gathered in the next lemma.
Lemma 1.For any M ∈ R * + , the sublevel set {γ | F (γ) ≤ M } is compact.Moreover, for any γ in R mn + * , the decrease of the Lyapunov function after an overrelaxed projection can be computed as where is a real function, applied coordinate-wise.
Proof.The fact that the Kullback-Leibler divergence is jointly lower semicontinuous implies in particular that K is closed.Moreover, K ⊂ R n1×n2 + is bounded because F is the sum of nonnegative, coercive functions of each component of its argument γ.
It follows from Lemma 1 that the decrease of F for an overrelaxed projection is very cheap to estimate, since its computational cost is linear with respect to the dimension of data µ k .In Figure 2, we display the function ϕ ω (x).Notice that for the Sinkhorn-Knopp algorithm, which corresponds to ω = 1, the function ϕ ω is always nonnegative.For other values 1 ≤ ω < 2, it is nonnegative for x close to 1.

Proposed algorithm
We first give a general convergence result that later serves as a basis to design an explicit algorithm.
Theorem 1.Let θ 1 and θ 2 be two continuous functions of γ such that where the inequality is strict whenever γ / ∈ C k .Consider the sequence defined by γ 0 = e −c/ε and Then the sequence (γ ) converges to γ * .
Proof.We refer to [Cuturi, 2013] for a proof of this lemma.
Proof of the theorem.First of all, notice that the operators P θ C k apply a scaling to lines or columns of matrices.All (γ ) are thus diagonally similar to γ 0 : ∀ ≥ 0, γ ∈ S By construction of the functions θ k , the sequence of values of the Lyapunov function (F (γ )) is non-increasing.Hence (γ ) is precompact.If ξ is a cluster point of (γ ), let us define Then by continuity of the applications, F (ξ) = F ( ξ) = F (ξ ).From the hypothesis made on θ 1 and θ 2 , it can be deduced that ξ is in C 1 and that ξ is in C 2 .Therefore ξ = ξ = ξ is in the intersection C 1 ∩ C 2 .By Lemma 2, ξ = γ * , and the whole sequence (γ ) converges to the solution.
We can construct explicitly functions θ k as needed in Theorem 1 using the following lemma.
Proof.Thanks to Lemma 1, one knows that The function that maps t ∈ [1, ∞) to ϕ t (x) is non-increasing since ∂ t ϕ t (x) = (x 1−t − 1) log x.For x = 1, it is even strictly decreasing.Thus inequality ( 13) is valid, with equality iff A k γ = µ k .
We now argue that a good choice for the functions θ k may be constructed as follows.Pick a target parameter θ 0 ∈ [1; 2) and a small security distance δ > 0. Define the functions Θ * and Θ as where min u denotes the smallest coordinate of the vector u.
Proposition 1.The function is continuous and verifies the descent condition (12).
Proof.Looking at Figure 2 can help understand this proof.Since the partial derivative of ∂ ω ϕ ω (x) is nonzero for any x < 1, the implicit function theorem proves the continuity of Θ * .The function Θ * (A k γ) µ k ) is such that every term in relation ( 10) is non-negative.Therefore, by Lemma 3, using this parameter minus δ ensures the strong decrease (12) of the Lyapunov function.Constraining the parameter to [1, θ 0 ] preserves this property.
This construction, which is often an excellent choice in practice, has several advantages: • it allows to choose arbitrarily the parameter θ 0 that will be used eventually when the algorithm is close to convergence (we motivate what are good choices for θ 0 in Section 3); • it is also an easy approach to having an adaptive method, as the approximation of Θ * has a negligible cost (it only requires to solve a one dimensional problem that depends on the smallest value of (A k γ) µ k , which can be done in a few iterations of Newton's method).
The resulting algorithm, which is proved to be convergent by Theorem 1, is written in pseudocode in Algorithm 1.It uses the function Θ defined implicitly in (15), which in practice is approximated with a few iterations of Newton's method on the function ω → ϕ ω (min u) which is decreasing is as can be seen on Figure 2.With the choice θ 0 = 1, one recovers exactly the original SK algorithm.is

Acceleration of local convergence rate
In order to justify the acceleration of convergence that is observed in practice, we now study the local convergence rate of the overrelaxed algorithm, which follows from the classical convergence analysis of the linear SOR method.Our result involves the second largest eigenvalue of the matrix where γ * is the solution to the regularized OT problem (the largest eigenvalue is 1, associated to the eigenvector 1).We denote the second largest eigenvalue by 1 − η, it satisfies η > 0 [Knight, 2008].
Proof.In this proof, we focus on the dual problem and we recall the relationship γ = e α /ε γ 0 e β /ε between the iterates of the overrelaxed projection algorithm γ and the iterates (α , β ) of the SOR algorithm on the dual problem (7), initialized with (α 0 , β 0 ) = (0, 0).The dual problem ( 5) is invariant by translations of the form (α, β) → (α − k, β + k), k ∈ R, but is otherwise strictly convex on any subspace which does not contain the line R(1, −1).In order to deal with this invariance (which cancels in the corresponding primal iterates), consider the subspace S of pairs of dual variables (α, β) that satisfy α 1 = 0, let π S be the (non orthogonal) projection on S of kernel (1, −1) and let (α * , β * ) ∈ S be a dual maximizer.Since one SOR iteration is a smooth map, the local convergence properties of the SOR algorithm are characterized by the local convergence of its linearization, which here corresponds to the SOR method applied to the maximization of the quadratic Taylor expansion of the dual objective E at (α * , β * ).This defines an affine map M θ : (α , β ) → (α +1 , β +1 ) whose spectral properties are well known [Ciarlet, 1982, Young, 2014] (see also [Chizat, 2017, chapter 4] for the specific case of convex minimization).For the case θ = 1, this corresponds to the matrix M 1 defined in (17).Specifically, in the non strictly concave case [Hadjidimos, 1985], we have that the operator π S • M 1 converges at the linear rate 1 − η towards the projector on (α * , β * ) and that the convergence of π S • M θ is guaranteed for θ ∈]0, 2[, with the rate where θ * := 2/(1 + √ η) is the optimal parameter, for which f (θ To switch from these dual convergence results to primal convergence results, remark that γ → γ * implies KL(γ , γ 0 ) → KL(γ * , γ 0 ) which in turn implies E(α , β ) → max E so invoking the partial strict concavity of E, π S (α , β ) → (α * , β * ).The converse implication is direct so it holds We conclude by noting the fact that π S (α , β ) converges at a linear rate implies the same rate on γ , thanks to the relationship between the iterates.
Corollary 1.The previous local convergence analysis applies to Algorithm 1 and the local convergence rate is governed by the choice of the target extrapolation parameter θ 0 .

Experimental results
We compare Algorithm 1 to SK algorithm on two very different optimal transport settings.In setting (a) we consider the domain [0, 1] discretized into 100 samples and the squared Euclidean transport cost on this domain.The marginals are densities made of the sum of a base plateau of height 0.1 and another plateau of height and boundaries chosen uniformly in [0, 1], subsequently normalized.
In setting (b) the cost is a 100 × 100 random matrix with entries uniform in [0, 1] and the marginals are uniform.Given an estimation of 1 − η, the local convergence rate of SK algorithm, we define θ 0 as the optimal parameter as given in Proposition 2. For estimating η, we follow two strategies.For strategy "estimated" (in blue on Figure 4), η is measured by looking at the local convergence rate of SK run on another random problem of the same setting and for the same value of ε.For strategy "measured" (in orange on Figure 4) the parameter is set using the local convergence rate of SK run on the same problem.Of course, the latter is an unrealistic strategy but it is interesting to see in our experiments that the "estimated" strategy performs almost as well as the "measured" one, as shown on 4.
Figure 4 displays the ratio of the number of iterations required to reach a precision of 10 −6 on the dual variable α for SK algorithm and Algorithm 1.It is is worth noting that the complexity per iteration of these algorithms is the same modulo negligible terms, so this ratio is also the runtime ratio (our algorithm can also be parallelized on GPUs just as SK algorithm).In both experimental settings, for low values of the regularization parameter ε, the acceleration ratio is above 20 with Algorithm 1.

Conclusion and perspectives
The SK algorithm is widely used to solve entropy regularized OT.The use of overrelaxed projections turns out to be a natural and simple idea to accelerate convergence while keeping many nice properties of this algorithm (first order, parallelizable, simple).We have proposed an algorithm that adaptively chooses the overrelaxation parameter so as to guarantee global convergence.The acceleration of the convergence speed is numerically impressive, in particular in low regularization regimes.It is theoretically supported in the local regime by the standard analysis of SOR iterations.
This idea of overrelaxation can be generalized to solve more general problems such as multimarginal OT, barycenters, gradient flows, unbalanced OT [Chizat, 2017, chap. 4] but there is no systematic way to derive globally convergent algorithms.Our work is a step in the direction of building and understanding the properties of robust first order algorithms for solving OT.More understanding is needed regarding SOR itself (global convergence speed, choice of θ 0 ), but also its relation to other acceleration methods [Scieur et al., 2016, Altschuler et al., 2017].

Figure 2 :
Figure 2: Value of ϕ ω (x).The function is positive above the red line, negative below.For any relaxation parameter ω smaller than 2, there exists a neighborhood of 1 on which ϕ ω (•) is positive.

Figure 3 :
Figure 3: Local linear rate of convergence of the overrelaxed algorithm as a function of 1 − η, the local convergence rate of SK algorithm and θ the overrelaxation parameter.(plain curve) the original rate is recovered for θ = 1.(dashed curve) optimal overrelaxation parameter θ * .