Next Article in Journal
Optimization Design and Performance Evaluation of R1234yf Ejectors for Ejector-Based Refrigeration Systems
Previous Article in Journal
Infrared and Visible Image Fusion with Significant Target Enhancement
 
 
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Sparse Regularized Optimal Transport with Deformed q-Entropy

by 1,* and 2
1
Graduate School of Informatics and The Hakubi Center for Advanced Research, Kyoto University, Kyoto 604-8103, Japan
2
Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 153-8505, Japan
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(11), 1634; https://doi.org/10.3390/e24111634
Received: 18 September 2022 / Revised: 4 November 2022 / Accepted: 7 November 2022 / Published: 10 November 2022

Abstract

:
Optimal transport is a mathematical tool that has been a widely used to measure the distance between two probability distributions. To mitigate the cubic computational complexity of the vanilla formulation of the optimal transport problem, regularized optimal transport has received attention in recent years, which is a convex program to minimize the linear transport cost with an added convex regularizer. Sinkhorn optimal transport is the most prominent one regularized with negative Shannon entropy, leading to densely supported solutions, which are often undesirable in light of the interpretability of transport plans. In this paper, we report that a deformed entropy designed by q-algebra, a popular generalization of the standard algebra studied in Tsallis statistical mechanics, makes optimal transport solutions supported sparsely. This entropy with a deformation parameter q interpolates the negative Shannon entropy ( q = 1 ) and the squared 2-norm ( q = 0 ), and the solution becomes more sparse as q tends to zero. Our theoretical analysis reveals that a larger q leads to a faster convergence when optimized with the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. In summary, the deformation induces a trade-off between the sparsity and convergence speed.

1. Introduction

Optimal transport (OT) is a classic problem in operations research, and it is used to compute a transport plan between suppliers and demanders with a minimum transportation cost. The minimum transportation cost can be interpreted as the closeness between the distributions when considering suppliers and demanders as two probability distributions. The OT problem has been extensively studied (also as the Wasserstein distance) [1] and used in robust machine learning [2], domain adaptation [3], generative modeling [4], and natural language processing [5], attributed to its many useful properties, such as the distance between two probability distributions. Recently, the OT problem has been employed for various modern applications, such as interpretable word alignment [6] and the locality-aware evaluation of object detection [7], because it can capture the geometry of data and provide a measurement method for closeness and alignment among different objects. From a computational perspective, a naïve approach is to use a network simplex algorithm or interior point method to solve the OT problem as a usual linear program; this approach requires supercubic time complexity [8] and is not scalable. A number of approaches have been suggested to accelerate the computation of the OT problem: entropic regularization [9,10], accelerated gradient descent [11], and approximation with tree [12] and graph metrics [13]. We focused our attention on entropic-regularized OT because it allows a unique solution attributed to strong convexity and transforms the original constrained optimization into an unconstrained problem with a clear primal–dual relationship. The celebrated Sinkhorn algorithm solves entropic-regularized OT with square-time complexity [9]. Furthermore, the Sinkhorn algorithm is amenable to differentiable programming, and it is easily incorporated into end-to-end learning pipelines [14,15].
Despite the popularity of the Sinkhorn algorithm, one of the main drawback is that Shannon entropy blurs the OT solution, i.e., solutions of entropic-regularized OT are always densely supported. The Shannon entropy induces a probability distribution that has strictly positive values everywhere on its support owing to the nature of the Shannon entropy [16] whereas the vanilla (unregularized) OT produces extremely sparse transport plans located on the boundaries of a polytope [17,18]. If we are interested in alignment and matching between different objects (such as in the several applications of natural language processing [6,19]), dense transport plans are not so interpretable that matching information between objects may be obfuscated by unimportant small densities contained in the transport plans. One attempt toward realizing sparse OT is to use the squared two-norm as an alternative regularizer. Blondel et al. [20] showed that the dual of this optimization problem can be solved via the L-BFGS method [21]; the primal solution corresponds to a transport plan recovered from the dual solution in a closed form, which is sparse. Although they successfully obtained a sparse OT formulation with a numerically stable algorithm, the degree of the sparsity cannot be easily modulated when we prefer to control the sparsity given a final application. Furthermore, the theoretical convergence rates of solving regularized OT are yet to be known.
In this study, we aimed to examine the relationship between the sparsity of transport plans and the convergence guarantee of regularized OT. Specifically, we propose yet another entropic regularizer called deformed q-entropy with a deformation parameter q that allows us to control the solution sparsity. We start with a dual solution of the entropic-regularized OT given by the Gibbs kernel to introduce a new regularizer; the Gibbs kernel associated with Shannon entropy induces nonsparsity, and, therefore, we replace the Gibbs kernel with another sparse kernel based on q-exponential distribution [22], following the idea of Tsallis statistics [23]. The deformed q entropy is derived from the dual solution characterized by the sparse kernel. Interestingly, the deformed q entropy recovers the Shannon entropy at the limit of q 1 and matches the (negative) squared two-norm at q = 0 ; this means that the deformed q entropy interpolates between the two regularizers. We confirm that the solution becomes increasingly sparse as q approaches zero. We call the regularized OT with the deformed q entropy deformed q-optimal transport (q-DOT). The q-DOT reveals an interesting connection between the OT solution and the q-exponential distribution, which is an independent interest. From the optimization perspective, we can solve the unconstrained dual of q-DOT with many standard solvers, as reported in Blondel et al. [20]. We can see that the convergence becomes faster with the BFGS method [24] as the deformation parameter q approaches one, as a result of our analysis of the convergence rate of the dual optimization. Therefore, the weaker deformation (larger q) leads to faster convergence while sacrificing sparsity. Finally, we demonstrate the trade-off between sparsity and convergence in the numerical experiments.
Our contributions can be summarized as: (i) showing a clear connection between the regularized OT problem and the q-exponential distribution; (ii) demonstrating the trade-off of the q-DOT between sparsity and convergence; (iii) providing a formal convergence guarantee of the q-DOT when solved with the BFGS method. The rest of this paper is organized as follows: Section 2 introduces the necessary background to the OT problem and entropic regularization. In Section 3, the Lagrange dual of the entropic-regularized OT problem is first shown; then, the dual optimal formula and the q-exponential distribution is connected to sparsify the transport matrix. Section 4 specifically focuses on the optimization perspective of the regularized OT problem, and a convergence guarantee with the BFGS method is provided, which shows the theoretical trade-off between sparsity and convergence. Finally, the empirical behavior and the trade-off of the regularized OT are numerically confirmed in Section 5.

2. Background

2.1. Preliminaries

For x R , let [ x ] + = x if x > 0 and 0 otherwise, and let [ x ] + p represent ( [ x ] + ) p hereafter. For a convex function f , X R , where X represents a Euclidean vector space equipped with an inner product · , · , the Fenchel–Legendre conjugate  f : X R is defined as f ( y ) sup x X x , y f ( x ) . The relative interior of a set S is denoted by ri S , and the effective domain of a function f is denoted by dom ( f ) . A differentiable function f is said to be M-strongly convex over S ri dom ( f ) if, for all x , y S , we have f ( x ) f ( y ) f ( x ) , x y M 2 x y 2 2 . If f is twice differentiable, the strong convexity is equivalent to 2 f ( x ) M I for all x S . Similarly, a differentiable function f is said to be M-smooth over S ri dom ( f ) if for all x , y S , we have f ( x ) f ( y ) 2 M x y 2 , which is equivalent to 2 f ( x ) M I for all x S if f is twice differentiable.

2.2. Optimal Transport

The OT is a mathematical problem to find a transport plan between two probability distributions with the minimum transport cost. The discussions in this paper are restricted to discrete distributions. Let ( X , d ) , δ x , and n 1 : = p [ 0 , 1 ] n | p , 1 n = 1 represent a metric space, Dirac measure at point x , and ( n 1 ) -dimensional probability simplex, respectively. Let μ = i = 1 n a i δ x i and ν = j = 1 m b i δ y i be histograms supported on the finite sets of points ( x i ) i = 1 n X and ( y j ) j = 1 m X , respectively, where a n 1 and b m 1 are probability vectors. The OT between two discrete probability measures μ and ν is the optimization problem
T ( μ , ν ) : = inf Π U ( μ , ν ) i = 1 n j = 1 m d ( x i , y j ) Π i j ,
where U represents the transport polytope, defined as
U ( μ , ν ) : = Π R 0 n × m | Π 1 m = a , Π 1 n = b .
The transport polytope U defines the constraints on the row/column marginals of a transport matrix Π . These constraints are often referred to as coupling constraints. For notational simplicity, matrix D i j : = d ( x i , y j ) and expectation D , Π : = i = 1 n j = 1 m D i j Π i j are used hereafter. T ( μ , ν ) is known as a 1-Wasserstein distance, which defines a metric space over histograms [1].
Equation (1) is a linear program and can be solved by well-studied algorithms such as the interior point and network simplex methods. However, its computational complexity is O ( n 3 log n ) (assuming n = m ), so is not scalable to large datasets [8].

2.3. Entropic Regularization and Sinkhorn Algorithm

The entropic-regularized formulation is commonly used to reduce the computational burden. Here, we introduce regularized OT with negative Shannon entropy [9] as
T λ H ( μ , ν ) : = inf Π U ( μ , ν ) D , Π + λ i = 1 n j = 1 m ( Π i j log Π i j Π i j ) negative Shannon entropy ,
where λ > 0 represents the regularization strength. Let us review the derivation of the updates of the Sinkhorn algorithm. The Lagrangian of the optimization problem in Equation (3) is
L ( Π , α , β ) : = i = 1 n j = 1 m ( D i j Π i j + λ ( Π i j log Π i j Π i j ) ) + i = 1 n α i ( [ Π 1 m ] i a i ) + j = 1 m β j ( [ Π 1 n ] j b j ) ,
where α R n and β R m represent the Lagrangian multipliers. Equation (4) ignores the constraints Π i j 0 (for all i [ n ] and j [ m ] ); however, they will be automatically satisfied. By taking the derivative in Π i j ,
Π i j L = D i j + λ log Π i j + α i + β j ,
and, hence, the stationary condition Π i j L = 0 induces the solution
Π i j = exp α i + β j + D i j λ .
The decomposition Π i j = exp D i j λ / exp α i + β j λ suggests that the stationary point is the (normalized) Gibbs kernel  exp D i j λ . One can easily infer that the Sinkhorn solution is dense because the Gibbs kernel is supported on the entire R 0 , i.e., exp z λ > 0 for all z R 0 . We can write Equation (6) into a matrix form by applying the variable transforms u i : = exp α i λ , v j : = exp β j λ , and K i j : = exp D i j λ as
Π = diag ( u ) : = U   K diag ( v ) : = V .
The following Sinkhorn updates are used to make Equation (7) meet the marginal constraints:
u a / ( K v ) v b / ( K u ) ,
where z / η represents the element-wise division of the two vectors z and η . The computational complexity is O ( K n m ) because the Sinkhorn updates involve only matrix-vector multiplications and element-wise divisions; K represents the number of the Sinkhorn updates. Finer analysis of the number of updates required to meet the error tolerance is provided in the literature [25].

3. Deformed q-Entropy and q-Regularized Optimal Transport

3.1. Regularized Optimal Transport and Its Dual

Let us consider the following primal problem with a general regularization function  Ω .
Definition 1
(Primal of regularized OT).
T Ω ( μ , ν ) = inf Π U ( μ , ν ) D , Π + i , j Ω ( Π i j ) ,
where Ω : R R represents a proper closed convex function.
Next, we derive its dual by Lagrange duality. The Lagrangian of Equation (9) is defined as
L ( Π , α , β ) : = D , Π + i , j Ω ( Π i j ) + α , Π 1 m a + β , Π 1 n b ,
with dual variables α R n and β R m . Then, the primal can be rewritten in terms of the Lagrangian
T Ω ( μ , ν ) = inf Π R 0 n × m sup α R n , β R m L ( Π , α , β ) .
In this Lagrangian formulation, we let the constraints Π R 0 n × m remain for a technical reason. The constrained optimization problem in (11) can be reformulated into the following unconstrained one with an indicator function I R 0 n × m .
T Ω ( μ , ν ) = inf Π R m × m sup α R n , β R m L ( Π , α , β ) + I R 0 n × m ( Π ) ,
which corresponds to an optimization problem with the convex objective function D , Π + i , j Ω ( Π i j ) + I R 0 n × m ( Π ) with only the linear constraints Π 1 m = a and Π 1 n = b . By invoking the Sinkhorn–Knopp theorem [26], the existence of a strictly feasible solution, namely, a solution satisfying Π 1 m = a and Π 1 n = b , can be confirmed. Hence, we see that the Slater condition is satisfied, and the strong duality holds as follows:
T Ω ( μ , ν ) = sup α R n , β R m inf Π R 0 n × m L ( Π , α , β ) = sup α R n , β R m a , α b , β + inf Π R 0 n × m i , j ( D i j + α i + β j ) Π i j + Ω ( Π i j ) = sup α R n , β R m a , α b , β sup Π R 0 n × m i , j ( D i j + α i + β j ) Π i j Ω ( Π i j ) = sup α R n , β R m a , α b , β i , j Ω ( D i j α i β j ) ,
where Ω represents the Fenchel–Legendre conjugate of Ω : R R
Ω ( η ) : = sup π 0 η π Ω ( π ) .
Although each element of the transport plans ranges over [ 0 , 1 ] , it is sufficient to define the Fenchel–Legendre conjugate as the supremum over R 0 because of how Ω emerges in the strong duality (13). According to Danskin’s theorem [27], the supremum of the Fenchel–Legendre conjugate can be attained at
Π i j = Ω ( D i j α i β j ) .
Therefore, the dual of regularized OT is formulated as follows:
Definition  2
(Dual of regularized OT).
T Ω ( μ , ν ) = sup α R n , β R m a , α b , β i , j Ω ( D i j α i β j ) ,
where Ω represents the Fenchel–Legendre conjugate Ω ( η ) : = sup π 0 η π Ω ( π ) . The optimal solution of the primal is given by the dual map Ω such that Π i j = Ω ( D i j α i β j ) , where ( α , β ) represents the dual optimal solution.
Next, we see several examples that are summarized in Table 1.
Example 1
(Negative Shannon entropy). Let Ω ( π ) = λ H ( π ) = λ ( π log π π ) ; then Ω ( η ) = λ e η / λ and Ω ( η ) = e η / λ . The optimal solution represented with the optimal dual variables ( α , β ) is Π i j = exp D i j + α i + β j λ . This recovers the stationary point of the Sinkhorn OT in Equation (6). The solution is dense because the regularizer Ω induces the Gibbs kernel Ω ( η ) = e η / λ > 0 for all η R .
Example 2
(Squared 2-norm). Let Ω ( π ) = λ 2 π 2 ; then Ω ( η ) = 1 2 λ [ η ] + 2 and Ω ( η ) = 1 λ [ η ] + . The optimal solution represented with the optimal dual variables ( α , β ) is Π i j = 1 λ D i j α i β j + . As mentioned by Blondel et al. [20], the squared 2-norm can sparsify the solution because Ω ( η ) = 1 λ [ η ] + may take the value 0.

3.2. q Algebra and Deformed Entropy

As shown in the last few examples, the dual map Ω plays an important role in the OT solution sparsity. In addition, the induced Ω is the Gibbs kernel when the negative Shannon entropy is used as Ω . Therefore, one may think of designing a regularizer from Ω by utilizing a kernel function that induces sparsity. One candidate is a q-exponential distribution. We begin with some basics required to formulate q-exponential distributions.
First, we introduce q-algebra, which has been well studied in the field of Tsallis statistical mechanics [23,29,30]. q algebra has been used in the machine-learning literature for regression [31], Bayesian inference [32], and robust learning [33]. For a deformation parameter q [ 0 , 1 ] , the q-logarithm and q-exponential functions are defined as
log q ( x ) : = x 1 q 1 1 q if q [ 0 , 1 ) log ( x ) if q = 1 , exp q ( x ) : = [ 1 + ( 1 q ) x ] + 1 / ( 1 q ) if q [ 0 , 1 ) exp ( x ) if q = 1 .
The q logarithm is defined for only x > 0 , as in the natural logarithm; they are inverse functions to each other (in an appropriate domain) and they recover the natural definition of the logarithm and exponential as q 1 . Their derivatives are ( log q ( x ) ) = 1 x q and ( exp q ( x ) ) = exp q ( x ) q , respectively. The additive factorization property exp ( x + y ) = exp ( x ) exp ( y ) satisfied by the natural exponential no longer holds for the q exponential, such that exp q ( x + y ) exp q ( x ) exp q ( y ) = exp q ( x + y + ( 1 q ) x y ) . Instead, we can construct another algebraic structure by introducing the other operation called the q product  q :
x q y = [ x 1 q + y 1 q 1 ] + 1 / ( 1 q ) .
With this product, the pseudoadditive factorization exp q ( x + y ) = exp q ( x ) q exp q ( y ) holds. Thus, the q algebra captures rich nonlinear structures, and it is often used to extend the Shannon entropy to the Tsallis entropy [23]
T q ( π ) = i = 1 n π i q log q ( π i ) .
One can see that the Tsallis entropy has an equivalent power formulation T q ( π ) = i = 1 n π i π i q 1 q , which means that it is often suitable for modeling heavy-tailed phenomena such as the power law. Although the introduced q logarithm and exponential can look arbitrary, they can be axiomatically derived by assuming the essential properties of the algebra (see Naudts [29]). For more physical insights, we recommend readers to refer to the literature [30].
Next, we introduce the q-exponential distribution. We introduce a simpler form for our purpose, whereas more general formulations of the q-exponential distribution have been introduced in the literature [22]. Given the form of the Gibbs kernel k ( ξ ) : = exp ( ξ / λ ) , we define the q-Gibbs kernel as follows:
Definition  3
(q-Gibbs kernel). For ξ 0 , we define the q-Gibbs kernel as k q ( ξ ) exp q ( ξ / λ ) for a deformation parameter q [ 0 , 1 ] and a temperature parameter λ R > 0 .
If we take ξ as the (centered) squared distance, then k q ( ξ ) represents the q-Gaussian distribution [22]. We illustrate the q-Gibbs kernel with different deformation parameters in Figure 1.
By definition, the support of the q-Gibbs kernel is supp ( k q ) = 0 , λ 1 q for q [ 0 , 1 ) and supp ( k q ) = R 0 for q = 1 . This indicates that the q-Gibbs kernel ignores the effect of a too-large ξ (or too large a distance between two points); its threshold is smoothly controlled by the temperature parameter λ and deformation parameter q.
Finally, we derive an entropic regularizer that induces sparsity by using the q-Gibbs kernel. Given the stationary condition in Equation (15), we impose the following functional form on the dual map:
π = Ω ( η ) = exp q η λ ,
where ( π , η ) = ( Π i j , D i j α i β j ) . Equation (20) results in the factorization
Π i j = exp q α i λ q exp q D i j λ q exp q β j λ ,
and a sufficiently large input distance D i j drives Π i j to zero; though exp q ( D i j / λ ) = 0 does not immediately imply Π i j = 0 because the q-product q lacks an absorbing element. By solving Equation (20),
Ω ( π ) = λ log q ( π ) , Ω ( π ) = λ 2 q π log q ( π ) π .
For the completeness, its derivation is shown in Appendix A. Hence, we define the deformed q entropy as follows:
Definition  4
(Deformed q-entropy). For π n 1 , the deformed q entropy is defined as
H q ( π ) = 1 2 q i = 1 n ( π i log q ( π i ) π i ) .
The deformed q-entropic regularizer for an element π i is Ω ( π i ) = λ 2 q ( π i log q ( π i ) π i ) .
The deformed q entropy recovers the Shannon entropy at the limit q 1 : H 1 ( π ) = i ( π i log ( π i ) π i ) . In addition, the limit q 0 recovers the negative of the squared 2-norm: H 0 ( π ) = 1 2 i ( π i 2 2 π i ) = 1 2 π 2 2 + 1 . Therefore, the deformed q entropy is an interpolation between the Shannon entropy and squared 2-norm. Hereafter, we consider the regularized OT with the deformed q entropy
T λ H q ( μ , ν ) = inf Π U ( μ , ν ) D , Π λ H q ( Π ) ,
by solving its dual counterpart. The deformed q entropy is different from the Tsallis entropy T q (see Equation (19)) in that the Tsallis entropy and deformed q entropy are defined by the q expectation π q , · [34] and the usual expectation π , · , respectively, while both are defined by the q logarithm.
Remark 1.
The primary reason we picked the deformed q entropy H q to design the regularizer is owing to its natural connection to the q-Gibbs kernel through the dual map, ( λ H q ) ( η ) = exp q ( η / λ ) . When the Tsallis entropy T q is used, the dual map is
( λ T q ) ( η ) = q 1 / ( 1 q ) exp q ( η / λ ) ,
which is not naturally connected to the q-Gibbs kernel.  Muzellec et al. [35] proposed regularized OT with the Tsallis entropy, but they did not discuss its sparsity. As we show in Appendix D.1, the Tsallis entropy does not empirically induce sparsity.
In Figure 2, the deformed q entropy with a different deformation parameter is plotted for the one-dimensional simplex 1 . One can easily confirm that H q ( π ) is concave for π R 0 n , as illustrated in the figure.

4. Optimization and Convergence Analysis

4.1. Optimization Algorithm

We occasionally write Ω = λ H q to simplify the notation in this section. By simple algebra, we confirm
Ω ( η ) = λ 2 q exp q η λ 2 q ,
which is convex because of the concavity of H q . To solve Equation (24), we solve the dual
T λ H q ( μ , ν ) = sup α R n , β R m a , α b , β λ 2 q i , j exp q D i j + α i + β j λ 2 q : = F ( z ) ,
where z : = ( α , β ) denotes dual variables. As Equation (27) is an unconstrained optimization problem, many famous optimization solvers can be used to solve it; here, we use the BFGS method [24]. For the sake of convergence analysis (Section 4.2), we optimize the convex 2 -regularized dual objective
minimize F ˜ ( z ) : = a , α + b , β + i , j Ω ( D i j α i β j ) + κ 2 z 2 2 ,
where κ > 0 represents the 2 -regularization parameter. In practice, 2 regularization hardly affects the performance when κ is sufficiently small. We can characterize the convergence rate by introducing (small) 2 regularization, which makes the objective strongly convex, whereas the convergence guarantee without its rate is still possible without 2 regularization [36].
We briefly summarize the algorithm in Algorithm 1, where d ( k ) , ρ ( k ) , and g ( k ) : = F ˜ ( z ( k ) ) represent the kth update direction, kth step size, and gradient at the current variable z ( k ) , respectively.
s ( k ) : = z ( k + 1 ) z ( k ) and ζ ( k ) : = g ( k + 1 ) g ( k )
are the differences of the dual variables and gradients between the next and current steps, respectively. Furthermore, let ( γ , γ ) be the tolerance parameter for the Wolfe conditions, i.e., update directions and step sizes satisfy the conditions
F ˜ ( z ( k ) + ρ ( k ) d ( k ) ) F ˜ ( z ( k ) ) + γ ρ ( k ) g ( k ) d ( k ) , ( Armijo condition )
g ( k + 1 ) d ( k ) γ g ( k ) d ( k ) . ( curvature condition )
Entropy 24 01634 i001
After obtaining the dual solution ( α ^ , β ^ ) , the primal solution can be recovered from Equation (15).

4.2. Convergence Analysis

We provide a convergence guarantee for Algorithm 1. A technical assumption is stated beforehand.
Assumption 1.
Let z be the global optimum of F ˜ . For τ ( 0 , 1 ) , we define the set Z τ ri dom ( F ˜ ) as
Z τ : = z | Ω ( D i j α i β j ) τ for all i , j .
Assume that z ( K ) obtained by Algorithm 1 and z are contained in Z τ .
The dual map Ω translates dual variables into primal variables, as in Equation (15). It is easy to confirm that Z τ is a closed convex set attributed to the convexity of Ω . Assumption 1 essentially assumes that all elements of the primal matrix (of z ( K ) and z ) are strictly less than 1; this always holds for z (unless n = m = 1 ) because of the strong duality. Moreover, this assumption is natural for z ( K ) values sufficiently close to the optimum z . The bound parameter τ is a key element for characterizing the convergence speed.
Theorem 1.
Let N : = max { n , m } . Under Assumption 1, Algorithm 1 with the parameter choice κ = 2 N τ q λ 1 returns a point z ( k ) satisfying
g ( K ) 2 < 16 ( F ˜ ( z ( 0 ) ) F ˜ ) N τ q λ r K
where F ˜ : = inf z F ˜ ( z ) represents the optimal value of the 2 -regularized dual objective and 0 < r < 1 is an absolute constant independent from ( λ , τ , q , N ) .
The proof is shown in Section 4.3. We conclude that a larger deformation parameter q yields better convergence because the coefficient in Equation (33) is O ( τ q / 2 ) with the base τ < 1 . Therefore, the deformation parameter introduces a new trade-off: q 0 yields a more sparse solution but slows down the convergence, whereas q 1 ameliorates the convergence while sacrificing sparsity. One may obtain the solution faster than the squared 2-norm regularizer used in Blondel et al. [20], which corresponds to the case q = 0 , by modulating the deformation parameter q.
In regularized OT, it is a common approach to use weaker regularization (i.e., a smaller λ ) to obtain a solution sparser and closer to the unregularized solution; however, a smaller λ results in numerical instability and slow computation [37]. This can be observed from Equation (33) because a smaller λ drives its upper bound considerably large.
Subsequently, we compared the computational complexity of q-DOT with the BFGS method and Sinkhorn algorithm. Altschuler et al. [25] showed that the Sinkhorn algorithm satisfies coupling constraints within the 1 error ε in O ( N 2 ( log N ) ε 3 ) time, which is the sublinear convergence rate. In contrast, our convergence rate in Equation (33) is translated into the iteration complexity K = O ( log ( N ε 1 ) ) , where g ( K ) 2 ε . The gradient of F ˜ is
F ˜ ( z ) = a i j = 1 m Ω ( D i j α i β j ) + κ α i b i i = 1 n Ω ( D i j α i β j ) + κ β j ,
and Ω ( · ) represents the mapping from the dual variables ( α i , β j ) to the primal transport matrix Π i j in Equation (15). Therefore, the gradient norm of F and coupling constraint error are comparable when the 2 -regularization parameter κ is sufficiently small. The overall computational complexity is O ( N 2 log ( N ε 1 ) ) because the one step of Algorithm 1 runs in O ( N 2 ) time; this is the linear convergence rate. To confirm the one step of Algorithm 1 runs in O ( N 2 ) time, we note that the update direction can be computed with O ( N 2 ) time by using the Sherman–Morrison formula to invert B ( k ) . In addition, the Hessian estimate can be updated with O ( N 2 ) time because B ( k ) is the rank-1 update and the computation of its inverse only requires the matrix-vector products of size N. Hence, Algorithm 1 exhibits better convergence in terms of the stopping criterion ε . The comparison is summarized in Table 2.

4.3. Proofs

To prove Theorem 1, we leveraged several lemmas shown below. Lemma 2 is based on Powell [24] and Byrd et al. [36]. The missing proofs are provided in Appendix C.
Lemma 1.
For the initial point z ( 0 ) and sequence z ( 1 ) , z ( 2 ) , , z ( K ) obtained by Algorithm 1, we define the following set and its bound:
Z : = conv z ( 0 ) , z ( 1 ) , z ( 2 ) , , z ( K ) , R : = sup z Z max i , j Ω ( D i j α i β j ) ,
where conv ( S ) represents the convex hull of the set S. Then, F ˜ : R n + m R is M 1 strongly convex and M 2 -smooth over Z , where M 1 = κ and M 2 κ + 2 N R q λ 1 . Moreover, F ˜ is M 2 -smooth over Z τ (defined in Equation (32)), where M 2 κ + 2 N τ q λ 1 .
Lemma 2.
Let z ( 1 ) , z ( 2 ) , , z ( k ) be a sequence generated by Algorithm 1 given an initial point z ( 0 ) . In addition, let c 1 , c 2 , c 3 , c 4 , and c 5 be the constants
c 1 : = 1 γ M 2 , c 2 : = n + m K + M 2 , c 3 : = K n + m ( n + m ) / K c 2 n + m + K K , c 4 : = c 3 1 γ , c 5 : = 2 ( 1 γ ) M 1 .
Then,
F ˜ ( z ( K ) ) F ˜ 1 γ c 1 M 1 2 c 4 2 c 5 2 K / 2 ( F ˜ ( z ( 0 ) ) F ˜ ) .
Lemma 3.
Let c 1 , c 2 , c 3 , c 4 , and c 5 be the same constants defined in Lemma 2. Then,
γ c 1 M 1 c 4 2 c 5 2 > ( 1 γ ) 3 γ e 2 ( n + m ) / e 4 ( 1 γ ) 2 M 1 M 2 3 .
Proof of Theorem 1.
Because F ˜ is differentiable and strongly convex, there exists an optimum z such that g : = F ˜ ( z ) = 0 ; this implies g ( K ) 2 = g ( K ) g 2 .
By using Assumption 1 and Lemma 1, we obtain g ( K ) g 2 = F ˜ ( z ( K ) ) F ˜ ( z ) 2 M 2 z ( K ) z 2 . In addition, z ( K ) z 2 2 2 M 1 ( F ˜ ( z ( K ) ) F ˜ ) as F ˜ is M 1 strongly convex over Z and the stationary condition F ˜ ( z ) = 0 holds. We obtain the convergence bound by using Lemmas 2 and 3 as
g ( K ) 2 = g ( K ) g 2 M 2 z ( K ) z 2 M 2 2 ( F ˜ ( z ( K ) ) F ˜ ) M 1 M 2 2 ( F ˜ ( z ( 0 ) ) F ˜ ) M 1 1 γ c 1 M 1 2 c 4 2 c 5 2 K / 2 < M 2 2 ( F ˜ ( z ( 0 ) ) F ˜ ) M 1 1 ( 1 γ ) 3 γ e 2 ( n + m ) / e 8 ( 1 γ ) 2 M 1 M 2 3 K / 2 κ + 2 N τ q λ 2 ( F ˜ ( z ( 0 ) ) F ˜ ) κ 1 C ( 1 + 2 N R q λ 1 κ 1 ) 3 K / 2 ,
where we define C : = ( 1 γ ) 3 γ e 2 ( n + m ) / e 8 ( 1 γ ) 2 and Lemma 1 is used at the last inequality to replace M 1 , M 2 and M 2 . We can immediately confirm C 1 16 from 0 < γ < γ < 1 , γ < 1 2 , and e 2 ( n + m ) / e < 1 . Finally, by choosing κ = 2 N τ q λ 1 ,
g ( K ) 2 16 ( F ˜ ( z ( 0 ) ) F ˜ ) N τ q λ 1 C ( 1 + ( R / τ ) q ) 3 K / 2 16 ( F ˜ ( z ( 0 ) ) F ˜ ) N τ q λ r K ,
where we use ( R / τ ) q 1 (owing to R τ by definition) and let r : = ( 1 C / 8 ) 1 / 4 and 127 / 128 4 r < 1 . □
Remark 2.
More precisely, Altschuler et al. [25] showed that the Sinkhorn algorithm converges in O ( N 2 L 3 ( log N ) ε 3 ) time, where L : = D . For q-DOT, its computational complexity is not directly comparable to that of the Sinkhorn in L; instead, the following analysis provides us a qualitative comparison. First, the convergence rate of q-DOT in Equation (33) is translated into the iteration complexity K = O ( log ( N ε 1 ) / log ( 1 / r ) ) . The rate r is introduced in the proof of Theorem 1 (see Equation (40)): r 1 C ( 1 + ( R / τ ) q ) 3 1 / 4 . Then, by the Taylor expansion, we have a rough estimate K O ( N 2 R 3 q log ( N ε 1 ) ) , where R is a bound on the possible primal variables defined in Equation (35). We cannot directly compare R q and L; nevertheless, R q and L can be considered in the same magnitude given a reasonably sized domain Z , noting that Ω ( π ) O ( π 1 q ) . Hence, it is reasonable to suppose that both the Sinkhorn algorithm and q-DOT roughly converge in cubic time with respect to L.

5. Numerical Experiments

5.1. Sparsity

All the simulations described in this section were executed on a 2.7 GHz quad-core Intel® Core i7 processor. We used the following synthetic dataset: ( x i ) i = 1 n N ( 1 2 , I 2 ) , ( y j ) j = 1 m N ( 1 2 , I 2 ) , and n = m = 30 , where N ( μ , Σ ) represents the Gaussian distribution with mean μ and covariance Σ . For each of the unregularized OTs, q-DOT, and Sinkhorn algorithm, we computed the transport matrices. For q-DOT and the Sinkhorn algorithm, different regularization parameters λ were compared: λ 1 × 10 2 , 1 × 10 1 , 1 ; and ε = 1 × 10 6 was used as the stopping criterion: q-DOT stopped after the gradient norm was less than ε , and the Sinkhorn algorithm stopped after the 1 error of the coupling constraints was less than ε . We compared different deformation parameters q 0 , 0.25 , 0.5 , 0.75 and fixed the dual 2 -regularization parameter κ = 1 × 10 6 for q-DOT. The q-DOT with q = 0 corresponded to a regularized OT with the squared 2-norm proposed by Blondel et al. [20]. For the unregularized OT, we used the implementation of the Python optimal transport package [38]. For q-DOT, we used the L-BFGS-B method (instead of the vanilla BFGS) provided by the SciPy package [39]. To determine zero entries in the transport matrix, we did not impose any positive threshold to disregard small values (as in Swanson et al. [6]) but regarded entries smaller than machine epsilon as zero.
The simulation results are shown in Table 3 and Figure 3. First, we qualitatively evaluated each method by using Figure 3 such that q-DOT obtained a very similar transport matrix to the unregularized OT solution. The solution was slightly blurred with increases in q and λ . In contrast, the Sinkhorn algorithm output considerably uncertain transport matrices. Furthermore, the Sinkhorn algorithm was numerically unstable with a very small regularization such as λ = 0.01 .
From Table 3, we further quantitatively observed the behavior. The transport matrices obtained by q-DOT were very sparse in most cases, and the sparsity was close to that of the unregularized OT. Furthermore, we observed the tendency such that smaller q and λ yielded a sparser solution. Significantly, the Sinkhorn algorithm obtained completely dense matrices (sparsity = 0). Although the transport matrices of q-DOT with ( q , λ ) = ( 0.5 , 1 ) , ( 0.75 , 1 ) appear somewhat similar to the Sinkhorn solutions in Figure 3, the former is much sparser. This suggests that a deformation parameter q slightly smaller than 1 is sufficient for q-DOT to output a sparse transport matrix.
For the obtained cost values D , Π ^ , we did not see a clear advantage of using a specific q and λ from the results of q-DOT. Nevertheless, it is evident that q-DOT more accurately estimated the Wasserstein cost than the Sinkhorn algorithm regardless of the q and λ used in this simulation.

5.2. Runtime Comparison

We compared the runtimes of q-DOT and the Sinkhorn algorithm using the same dataset as in Section 5.1, but with different dataset sizes: we chose n = m { 100 , 300 , 500 , 1000 } . The parameter choices were the same as in Section 5.1, except that the regularization parameter was fixed to λ = 0.1 . The result is shown in Figure 4; the larger deformation parameter q makes q-DOT converge faster when n = m = 100 . When n = m 300 , the difference between q = 0 , q = 0.25 , and q = 0.5 was not as evident. This may be partly because we fixed the parameter choice κ = 1 × 10 6 for the all experiments, unlike the oracle parameter choice κ = 2 N τ q λ 1 (in Theorem 1) depending on q. Nonetheless, q = 0.75 is clearly superior to the smaller q. From these observations, the trade-off between the sparsity and computation speed resulting from the deformation parameter q is theoretically established in Theorem 1 and it was empirically observed.

5.3. Approximation of 1-Wasserstein Distance

Finally, we compared the approximation errors of the 1-Wasserstein distance | D , Π ^ D , Π | of q-DOT and the Sinkhorn algorithm with different q and λ , where Π ^ represents the computed transport matrix and Π arg min Π U ( μ , ν ) D , Π represents the LP solution. We used the same dataset and stopping criterion ε as described in Section 5.1 For the range of q, we used q 0.00 , 0.25 , 0.50 , 0.75 . For the range of λ , we used λ 0.05 , 0.1 , 0.5 .
The result is shown in Figure 5. The difference was not significant when q was small, such as q 0.00 , 0.25 . Once q became larger, such as q 0.50 , 0.75 , the approximation error evidently worsened. The Sinkhorn algorithm always exhibited worse approximation errors than q-DOT with q in the range used in this simulation regardless of λ . Formal guarantees for the 1-Wasserstein approximation error (such as Altschuler et al. [25] and Weed [40]) will be considered in future work.

Author Contributions

Conceptualization, H.B.; methodology, H.B.; validation, H.B. and S.S.; formal analysis, H.B. and S.S.; writing—original draft preparation, H.B.; writing—review and editing, H.B. and S.S.; funding acquisition, H.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Hakubi Project, Kyoto University, and JST ERATO Grant JPMJER1903. The APC was covered by the Hakubi Project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
BFGSBroyden–Fletcher–Goldfarb–Shannon
q-DOTDeformed q-optimal transport
L-BFGSLimited-memory BFGS
OTOptimal transport

Appendix A. Derivation of Deformed q Entropy

Given a functional relationship π = Ω ( η ) = exp q ( η / λ ) in Equation (20), we derive the deformed q entropy.
First, the derivative of the regularizer Ω is simply the inverse of the dual map Ω by Danskin’s theorem [27]; hence, Ω ( π ) = λ log q ( π ) . The (negative of) deformed q entropy is recovered by integrating Ω :
Ω ( π ) = λ 0 p log q ( p ) d p = λ 0 p p 1 q 1 1 q d p = λ 1 q π 2 q 2 q π = λ 2 q π π 1 q 1 1 q π = λ 2 q π log q ( π ) π .

Appendix B. Additional Lemmas

Note again that we let M 1 , M 2 > 0 be the strong convexity and smoothness constants of F ˜ over Z , N : = max n , m , and z arg min z Z F ˜ ( z ) .
Lemma A1.
For all k,
M 1 s ( k ) 2 2 ζ ( k ) s ( k ) M 2 s ( k ) 2 2 .
In addition,
ζ ( k ) 2 2 ζ ( k ) s ( k ) M 2 .
Proof. 
Let G ¯ ( k ) : = 0 1 2 F ˜ ( z ( k ) + t s ( k ) ) d t . Then, by the chain rule and the fundamental theorem of calculus,
G ¯ ( k ) s ( k ) = 0 1 F ˜ ( z ( k ) + t s ( k ) ) t d t = F ˜ ( z ( k ) + s ( k ) ) F ˜ ( z ( k ) ) = g ( k + 1 ) g ( k ) = ζ ( k ) .
Because F ˜ is M 1 strongly convex and M 2 -smooth (over Z ), we have
M 1 w 2 2 w [ 2 F ˜ ( z ) ] w M 2 w 2 2
for all z Z and w . By choosing z = z ( k ) + t s ( k ) and w = s ( k ) , we have
M 1 s ( k ) 2 2 0 1 s ( k ) [ 2 F ˜ ( z ( k ) + t s ( k ) ) ] s ( k ) d t = s ( k ) G ¯ ( k ) s ( k ) = ζ ( k ) s ( k ) M 2 s ( k ) 2 2 .
Note that z ( k ) + t s ( k ) Z follows by the definition of Z in Equation (35). Thus, the first statement is proven.
The second statement is proven as follows:
ζ ( k ) 2 2 ζ ( k ) s ( k ) = s ( k ) G ¯ ( k ) 2 s ( k ) s ( k ) G ¯ ( k ) s ( k ) = ( s ( k ) G ¯ ( k ) 1 / 2 ) G ¯ ( k ) ( G ¯ ( k ) 1 / 2 s ( k ) ) G ¯ ( k ) 1 / 2 s ( k ) 2 2 = 0 1 ( s ( k ) ) [ 2 F ˜ ( z ( k ) + t s ( k ) ) ] ( s ( k ) ) s ( k ) 2 2 d t M 2 ,
where s ( k ) : = G ¯ ( k ) 1 / 2 s ( k ) . □
Lemma A2.
For all k,
M 1 2 z ( k ) z 2 g ( k ) 2 .
Proof. 
Because F ˜ is M 1 strongly convex over Z ,
M 1 2 z ( k ) z 2 2 F ˜ ( z ( k ) ) F ˜ ( z ) F ˜ ( z ( k ) ) , z ( k ) z g ( k ) 2 z ( k ) z 2 ,
where it follows from the optimality of z and the Cauchy–Schwarz inequality. □
Lemma A3.
The following equations hold:
det ( B ( K ) ) c 2 K n + m n + m ,
k = 0 K 1 B ( k ) s ( k ) 2 2 s ( k ) B ( k ) s ( k ) c 2 K ,
det ( B ( K ) ) det ( B ( 0 ) ) = k = 0 K 1 ζ ( k ) s ( k ) s ( k ) B ( k ) s ( k ) ,
where c 2 : = n + m K + M 2 is defined in Lemma 2.
Proof. 
To prove Equation (A10), we use the linearity of the trace and tr ( b a ) = a b to evaluate tr ( B ( k + 1 ) ) as follows:
tr ( B ( k + 1 ) ) = tr B ( k ) B ( k ) s ( k ) s ( k ) B ( k ) s ( k ) B ( k ) s ( k ) + ζ ( k ) ζ ( k ) ζ ( k ) s ( k ) = tr ( B ( k ) ) tr B ( k ) s ( k ) s ( k ) B ( k ) s ( k ) B ( k ) s ( k ) 0 + tr ζ ( k ) ζ ( k ) ζ ( k ) s ( k ) tr ( B ( k ) ) + ζ ( k ) 2 2 ζ ( k ) s ( k ) tr ( B ( 0 ) ) + j = 0 k ζ ( j ) 2 2 ζ ( j ) s ( j ) tr ( B ( 0 ) ) + ( k + 1 ) M 2 ,
where Lemma A1 is used at the last inequality. Note that the trace is the sum of the eigenvalues, whereas the determinant is the product of the eigenvalues. Then, we can use the AM–GM inequality to translate the determinant into the trace as follows:
det ( B ( k + 1 ) ) 1 n + m tr ( B ( k + 1 ) ) n + m tr ( B ( 0 ) ) + M 2 ( k + 1 ) n + m n + m .
Hence, by substituting k = K 1 and tr ( B ( 0 ) ) = n + m , Equation (A10) is proven.
To prove Equation (A11), we evaluate tr ( B ( k + 1 ) ) in a way similar to that for Equation (A13). From Lemma A1,
0 tr ( B ( k + 1 ) ) = tr ( B ( k ) ) B ( k ) s ( k ) 2 2 s ( k ) B ( k ) s ( k ) + ζ ( k ) 2 2 ζ ( k ) s ( k ) = tr ( B ( 0 ) ) j = 0 k B ( j ) s ( j ) 2 2 s ( j ) B ( j ) s ( j ) + j = 0 k ζ ( j ) 2 2 ζ ( j ) s ( j ) tr ( B ( 0 ) ) j = 0 k B ( j ) s ( j ) 2 2 s ( j ) B ( j ) s ( j ) + ( k + 1 ) M 2 .
By the AM–GM inequality,
j = 0 k B ( j ) s ( j ) 2 2 s ( j ) B ( j ) s ( j ) 1 k + 1 j = 0 k B ( j ) s ( j ) 2 2 s ( j ) B ( j ) s ( j ) k + 1 .
Hence, by substituting k = K 1 and tr ( B ( 0 ) ) = n + m , Equation (A11) is proven.
To prove Equation (A12), we use the matrix determinant lemma to expand det ( B ( k + 1 ) ) as follows:
det ( B ( k + 1 ) ) = det B ( k ) B ( k ) s ( k ) s ( k ) B ( k ) s ( k ) B ( k ) s ( k ) + ζ ( k ) ζ ( k ) ζ ( k ) s ( k ) = 1 1 s ( k ) B ( k ) s ( k ) · s ( k ) B ( k ) B ( k ) + ζ ( k ) ζ ( k ) ζ ( k ) s ( k ) 1 B ( k ) s ( k ) · det B ( k ) + ζ ( k ) ζ ( k ) ζ ( k ) s ( k ) .
Further, by the Sherman–Morrison formula, we have
B ( k ) + ζ ( k ) ζ ( k ) ζ ( k ) s ( k ) 1 = B ( k ) 1 B ( k ) 1 ζ ( k ) ζ ( k ) B ( k ) 1 ζ ( k ) s ( k ) + ζ ( k ) B ( k ) 1 ζ ( k ) .
By plugging Equation (A18) into Equation (A17), we have
det ( B ( k + 1 ) ) = ( s ( k ) ζ ( k ) ) 2 ( s ( k ) B ( k ) s ( k ) ) ( ζ ( k ) s ( k ) + ζ ( k ) B ( k ) 1 ζ ( k ) ) det B ( k ) + ζ ( k ) ζ ( k ) ζ ( k ) s ( k ) = ( s ( k ) ζ ( k ) ) 2 ( s ( k ) B ( k ) s ( k ) ) ( ζ ( k ) s ( k ) + ζ ( k ) B ( k ) 1 ζ ( k ) ) · 1 + ζ ( k ) B ( k ) 1 ζ ( k ) ζ ( k ) s ( k ) det ( B ( k ) ) = det ( B ( k ) ) ζ ( k ) s ( k ) s ( k ) B ( k ) s ( k ) ,
where the matrix determinant lemma is invoked again at the second identity. Recursively applying Equation (A19) with det ( B ( 0 ) ) = 1 , we obtain Equation (A12). □
Lemma A4.
For k, s ( k ) 2 c 5 g ( k ) 2 cos θ k , where θ k is the angle between s ( k ) and g ( k ) , and c 5 : = 2 ( 1 γ ) M 1 is defined in Lemma 2.
Proof. 
By the Armijo condition (30), we have
F ˜ ( z ( k + 1 ) ) F ˜ ( z ( k ) ) γ ρ ( k ) g ( k ) d ( k ) = γ g ( k ) s ( k ) .
Additionally, as F ˜ is M 1 -strongly convex over Z , it holds that F ˜ ( z ( k + 1 ) ) F ˜ ( z ( k ) ) s ( k ) g ( k ) + 1 2 M 1 s ( k ) 2 2 . Hence,
s ( k ) g ( k ) + 1 2 M 1 s ( k ) 2 2 γ g ( k ) s ( k ) ( 1 γ ) ( s ( k ) g ( k ) ) 1 2 M 1 s ( k ) 2 2 s ( k ) 2 2 ( 1 γ ) M 1 = c 5 s ( k ) g ( k ) s ( k ) 2 g ( k ) 2 = cos θ k g ( k ) 2 ,
which is the desired inequality. □
Lemma A5.
For k, let θ k be the angle between s ( k ) and g ( k ) . Then,
k = 0 K 1 1 γ c 1 M 1 cos 2 θ k 2 1 γ c 1 M 1 2 c 4 2 c 5 2 K / 2 ,
where c 1 , c 4 , and c 5 are defined in Lemma 2.
Proof. 
By multiplying each side of Equations A10A12, we have
k = 0 K 1 B ( k ) s ( k ) 2 2 s ( k ) B ( k ) s ( k ) · ζ ( k ) s ( k ) s ( k ) B ( k ) s ( k ) c 3 K ,
where c 3 : = K n + m ( n + m ) / K c 2 n + m + K K is defined in Lemma 2. By using B ( k ) s ( k ) = ρ ( k ) g ( k ) and ζ ( k ) s ( k ) ( 1 γ ) g ( k ) s ( k ) (shown in Equation (A33)),
k = 0 K 1 B ( k ) s ( k ) 2 2 s ( k ) B ( k ) s ( k ) · ζ ( k ) s ( k ) s ( k ) B ( k ) s ( k ) = k = 0 K 1 g ( k ) 2 2 · ζ ( k ) s ( k ) ( s ( k ) g ( k ) ) 2 ( 1 γ ) K · k = 0 K 1 g ( k ) 2 2 s ( k ) g ( k ) .
Hence,
k = 0 K 1 g ( k ) 2 s ( k ) 2 cos θ k c 3 1 γ K = c 4 K .
By Lemma A4, we can confirm
k = 0 K 1 cos 2 θ k k = 0 K 1 1 c 4 g ( k ) 2 cos θ k s ( k ) 2 1 c 4 c 5 K .
Let K ^ be the number of k = 0 , 1 , , K 1 such that cos θ k 1 c 4 c 5 , then
1 c 4 c 5 K k = 0 K 1 cos 2 θ k 1 c 4 c 5 2 K ^ ,
implying that K ^ is at most K 2 (note that 1 c 4 c 5 < 1 from Equation (A26)). Therefore,
k = 0 K 1 1 γ c 1 M 1 cos 2 θ k 2 1 γ c 1 M 1 2 c 4 2 c 5 2 K / 2 .

Appendix C. Deferred Proofs

Appendix C.1. Proof of Lemma 1

Proof. 
It is easy to confirm M 1 = κ because F ˜ is the sum of F (convex) and κ 2 z 2 2 .
Because F ˜ is twice differentiable and Z is a closed convex set, we evaluate the smoothness parameter M 2 (over Z ) by the eigenvalues of 2 F ˜ ( z ) . We begin by evaluating the eigenvalues of 2 F ( z ) , then evaluate the eigenvalues of 2 F ˜ ( z ) by 2 F ˜ ( z ) = 2 F ( z ) + κ I . Let P ( z ) R n × m be a matrix such that P i j ( z ) : = Ω ( D i j α i β j ) . Here, P i j ( z ) is the primal variable corresponding to the dual variables ( α i , β j ) (see Equation (15)). The gradient of F is
F ( z ) = a i j = 1 m Ω ( D i j α i β j ) b j i = 1 n Ω ( D i j α i β j ) = a i j = 1 m P i j ( z ) b j i = 1 n P i j ( z ) ,
and the Hessian of F is
2 F ( z ) = 1 λ · diag j P i j ( z ) q P ( z ) q ( P ( z ) q ) diag i P i j ( z ) q H ,
where P ( z ) q is the element-wise power of P ( z ) . Then, by invoking the Gershgorin circle theorem (Theorem 7.2.1 of [41]), the eigenvalues of H can be upper bounded by the following value:
max j P i j ( z ) q center of i -th disc + [ P ( z ) q 1 m ] i radius of i -th disc , i P i j ( z ) q + [ ( P ( z ) q ) 1 n ] j ,   max 2 j = 1 m P i j ( z ) q , 2 i = 1 n P i j ( z ) q   2 N R q ,
where we use 0 P i j ( z ) R for all i, j, and z Z at the last inequality. Hence, M 2 κ + 2 N R q λ .
M 2 κ + 2 N τ q λ is confirmed by noting that 0 P i j ( z ) τ for all i, j, and z Z τ and that Z τ is a closed convex set. □

Appendix C.2. Proof of Lemma 2

Proof. 
First, we evaluate the ratio between F ˜ ( z ( k + 1 ) ) F ˜ and F ˜ ( z ( k ) ) F ˜ for k = 0 , 1 , 2 , , K 1 . Let θ k be the angle between the vectors s ( k ) and g ( k ) . By the Armijo condition (Equation (30)), the difference F ˜ ( z ( k + 1 ) ) F ˜ ( z ( k ) ) can be evaluated as follows:
F ˜ ( z ( k + 1 ) ) F ˜ ( z ( k ) ) γ ρ ( k ) g ( k ) d ( k ) = γ g ( k ) ( z ( k + 1 ) z ( k ) ) = γ g ( k ) s ( k ) = γ ( s ( k ) 2 g ( k ) 2 cos θ K ) .
In addition, by the curvature condition (Equation (31)),
ζ ( k ) s ( k ) = g ( k + 1 ) s ( k ) = ρ ( k ) g ( k + 1 ) d ( k ) ρ ( k ) · γ g ( k ) d ( k ) g ( k ) s ( k ) γ g ( k ) s ( k ) g ( k ) s ( k ) = ( 1 γ ) g ( k ) s ( k ) ,
which implies s ( k ) 2 2 1 M 2 ζ ( k ) s ( k ) 1 γ M 2 g ( k ) s ( k ) = 1 γ M 2 s ( k ) 2 g ( k ) 2 cos θ k together with Lemma A1. Hence, we have
s ( k ) 2 c 1 g ( k ) 2 cos θ k ,
where c 1 : = 1 γ M 2 . Then,
F ˜ ( z ( k + 1 ) ) F ˜ ( F ˜ ( z ( k ) ) F ˜ ) + γ ( s ( k ) 2 g ( k ) 2 cos θ k ) ( F ˜ ( z ( k ) ) F ˜ ) γ c 1 g ( k ) 2 2 cos 2 θ k ( F ˜ ( z ( k ) ) F ˜ ) γ c 1 ( M 1 / 2 ) g ( k ) 2 z ( k ) z 2 cos 2 θ k ( F ˜ ( z ( k ) ) F ˜ ) γ c 1 ( M 1 / 2 ) cos 2 θ k ( F ˜ ( z ( k ) ) F ˜ ) = ( 1 γ c 1 M 1 cos 2 θ k / 2 ) ( F ˜ ( z ( k ) ) F ˜ ) ,
where Equation (A32) is used at the first inequality; Equation (A34) is used at the second inequality; Lemma A2 is used at the third inequality; a consequence of the convexity F ˜ ( z ( k ) ) F ˜ ( z ) g ( k ) , z ( k ) z g ( k ) 2 z ( k ) z 2 is used at the fourth inequality.
Next, recursively invoking the inequality Equation (A35), we obtain
F ˜ ( z ( K ) ) F ˜ k = 0 K 1 1 γ c 1 M 1 cos 2 θ k 2 ( F ˜ ( z ( 0 ) ) F ˜ ) 1 γ c 1 M 1 2 c 4 2 c 5 2 K / 2 ( F ˜ ( z ( 0 ) ) F ˜ ) ,
which is the desired bound. The last inequality is due to Lemma 3. □

Appendix C.3. Proof of Lemma 3

Proof. 
By substituting the definitions of the constants c 1 , c 2 , c 3 , c 4 , and c 5 ,
γ c 1 M 1 c 4 2 c 5 2 = γ · 1 γ M 2 · M 1 c 3 1 γ 2 · 2 ( 1 γ ) M 1 2   = ( 1 γ ) 3 M 1 3 γ 4 ( 1 γ ) 2 c 3 2 M 2   = M 1 3 ( 1 γ ) 3 γ 4 M 2 ( 1 γ ) 2 1 1 ( n + m ) n + m 1 / K c 2 n + m + K K K n + m K 2   = M 1 3 ( 1 γ ) 3 γ 4 M 2 ( 1 γ ) 2 · ( n + m ) 2 ( n + m ) K · n + m K + M 2 · K n + m n + m + K 2 ( n + m + K ) K   > M 1 3 ( 1 γ ) 3 γ 4 M 2 ( 1 γ ) 2 · 1 · M 2 2 K 2 ( n + m ) K   ( 1 γ ) 3 γ e 2 ( n + m ) / e 4 ( 1 γ ) 2 M 1 M 2 3 ,
where, at the first inequality, we invoke ( n + m ) 2 ( n + m ) K > 1 and
n + m K + M 2 · K n + m n + m + K 2 ( n + m + K ) K M 2 K n + m n + m + K 2 ( n + m + K ) K = M 2 2 ( n + m + K ) K K 2 ( n + m ) K M 2 2 K 2 ( n + m ) K ,
and we use K 2 ( n + m ) K e 2 ( n + m ) e for all K at the second inequality. Hence, the desired inequality is proven. □

Appendix D. Additional Experiments

Appendix D.1. Comparison with Tsallis Entropy

In this study, we used the deformed q entropy instead of the Tsallis entropy [23] as the sparse regularization. Here, we briefly empirically analyze what happens if we use the Tsallis entropy instead. We compare the dual optimization objective in Definition 2 with the deformed q entropy and Tsallis entropy. We use the following convex regularizer formed by the Tsallis entropy:
Ω ( π ) = λ i = 1 n π i q log q ( π i ) .
The simulations in this section were executed on a 2.7 GHz quad-core Intel® Core i7 processor. We used the following synthetic dataset: ( x i ) i = 1 n N ( 1 2 , I 2 ) , ( y j ) j = 1 m N ( 1 2 , I 2 ) , and n = m = 100 . For q-DOT and Tsallis-regularized OT, different regularization parameters λ 0.5 , 1 were compared, and ε = 1 × 10 6 was used as the stopping criterion on the gradient norm. The range of regularization parameters differed from that in Section 5.1 because Tsallis-regularized OT does not converge with too-small regularization parameters such as λ = 0.01 . We compared different deformation parameters q 0 , 0.25 , 0.5 , 0.75 . For the unregularized OT, we used the implementation of the Python optimal transport package [38]. For q-DOT and Tsallis-regularized OT, we use dthe L-BFGS-B method provided by the SciPy package [39]. To determine zero entries in the transport matrix, we regarded entries smaller than machine epsilon as zero.
Table A1. Comparison of the sparsity and absolute error on the synthetic dataset. Sparsity indicates the ratio of zero entries in each transport matrix. We counted the number of entries smaller than machine epsilon to measure the sparsity instead of imposing a small positive threshold for determining zero entries. Abs. error indicates the absolute error of the computed cost with respect to 1-Wasserstein distance. Tsallis-regularized OT with q = 0.00 does not work due to numerical instability.
Table A1. Comparison of the sparsity and absolute error on the synthetic dataset. Sparsity indicates the ratio of zero entries in each transport matrix. We counted the number of entries smaller than machine epsilon to measure the sparsity instead of imposing a small positive threshold for determining zero entries. Abs. error indicates the absolute error of the computed cost with respect to 1-Wasserstein distance. Tsallis-regularized OT with q = 0.00 does not work due to numerical instability.
Sparsity (q-DOT)Abs. Error (q-DOT)Sparsity (Tsallis)Abs. Error (Tsallis)
q = 0.00 , λ = 0.50 0.9840.001
q = 0.00 , λ = 1.00 0.9810.011
q = 0.25 , λ = 0.50 0.9770.0080.0003.362
q = 0.25 , λ = 1.00 0.9730.0100.0003.388
q = 0.50 , λ = 0.50 0.9590.0150.0003.153
q = 0.50 , λ = 1.00 0.9440.0220.0003.283
q = 0.75 , λ = 0.50 0.8610.0520.0001.962
q = 0.75 , λ = 1.00 0.7760.0990.0002.582
As can be seen from the results in Table A1, the Tsallis entropic regularizer neither induces sparsity nor achieves a better approximation of the 1-Wasserstein distance than the deformed q entropy. Note that the Tsallis entropy induces the dual map Ω ( η ) = q 1 / ( 1 q ) / exp q ( η / λ ) shown in Equation (25), which has dense support for q > 0 and becomes the source of dense transport matrices. This verifies that the design of the regularizer is important for regularized optimal transport.

Appendix D.2. Hyperparameter Sensitivity

In this section, we summarize more comprehensive experimental results of q-DOT and the Sinkhorn algorithm to show the performance dependence on hyperparameters q and λ . Subsequently, we describe experiments to show the sparsity of transport matrices, absolute error of computed costs with respect to 1-Wasserstein distance, and runtime with differently-sized datasets.
The simulations in this section were executed on a 2.7 GHz Intel® Xeon® Gold 6258R processor (different from the processor that we used in Section 5). We used the following synthetic dataset: ( x i ) i = 1 n N ( 1 2 , I 2 ) , ( y j ) j = 1 m N ( 1 2 , I 2 ) , with different N ( = n = m ) 100 , 300 , 500 , 1000 , 2000 , 3000 . For q-DOT and Tsallis-regularized OT, different regularization parameters λ 0.01 , 0.1 , 1 were compared, and ε = 1 × 10 6 was used as the stopping criterion. We compared different deformation parameters q 0 , 0.25 , 0.5 , 0.75 . For the unregularized OT, we used the implementation of the Python optimal transport package [38]. For q-DOT, we used the L-BFGS-B method provided by the SciPy package [39]. To determine zero entries in the transport matrix, we regarded entries smaller than machine epsilon as zero.
The results are shown in Table A2. In these tables, the results with q = 1.00 correspond to the Sinkhorn algorithm. The results for ( q , λ ) = ( 1.00 , 0.01 ) are missing because they did not work well due to numerical instability. In general, we observed similar behavior as we described in Section 5: sparsity intensified as q and λ decreased, thereby increasing runtime. As N increased, nonmonotonic trends in runtime were observed with respect to q: for a fixed λ , larger q accelerated the computation, while q = 0.25 seemed to be the slowest. This apparent discrepancy from Theorem 1 may be partly because Theorem 1 relies on an oracle parameter choice κ = 2 N τ q λ 1 as we discussed in Section 5.2, which is hardly known in practice. Nevertheless, it is remarkable that even q = 0.75 gives very sparse solutions with a reasonable amount of runtime. Regarding the absolute error, smaller q tends to perform better with relatively small datasets, such as N 1000 , while q = 1.00 performs better for larger datasets, such as N = 2000 and 3000. As we mentioned in Section 5.3, theoretical analysis of the approximation error is still unclear, and will be left for future work.
Table A2. Hyperparameter sensitivity of q-DOT and Sinkhorn algorithm. In these tables, q = 1.00 corresponds to the Sinkhorn algorithm. ( q , λ ) = ( 1.00 , 0.01 ) did not work well because of numerical instability. The results shown in the tables are the means of 10 random trials. Bold typeface indicates the best result for each of sparsity, absolute error, and runtime.
Table A2. Hyperparameter sensitivity of q-DOT and Sinkhorn algorithm. In these tables, q = 1.00 corresponds to the Sinkhorn algorithm. ( q , λ ) = ( 1.00 , 0.01 ) did not work well because of numerical instability. The results shown in the tables are the means of 10 random trials. Bold typeface indicates the best result for each of sparsity, absolute error, and runtime.
( N = 100 )SparsityAbs. ErrorRuntime [ms]( N = 100 )SparsityAbs. ErrorRuntime [ms]
q = 0.00 , λ = 0.01 0.9902.28× 10 2 4366.142 q = 0.00 , λ = 0.01 0.9971.30× 10 0 33,592.026
q = 0.00 , λ = 0.10 0.9883.63 × 10 3 1236.346 q = 0.00 , λ = 0.10 0.9962.15× 10 2 14,641.740
q = 0.00 , λ = 1.00 0.9826.20× 10 3 842.253 q = 0.00 , λ = 1.00 0.9942.03× 10 2 7749.233
q = 0.25 , λ = 0.01 0.9898.18× 10 3 3182.535 q = 0.25 , λ = 0.01 0.9967.07× 10 2 36,167.445
q = 0.25 , λ = 0.10 0.9865.54× 10 3 1131.784 q = 0.25 , λ = 0.10 0.9941.83× 10 2 15,176.970
q = 0.25 , λ = 1.00 0.9731.16× 10 2 668.734 q = 0.25 , λ = 1.00 0.9902.69× 10 2 5848.561
q = 0.50 , λ = 0.01 0.9879.91× 10 3 2388.176 q = 0.50 , λ = 0.01 0.9941.99× 10 2 25,940.619
q = 0.50 , λ = 0.10 0.9777.66× 10 3 1040.818 q = 0.50 , λ = 0.10 0.9912.41× 10 2 8304.774
q = 0.50 , λ = 1.00 0.9462.40× 10 2 339.978 q = 0.50 , λ = 1.00 0.9763.52× 10 2 2713.598
q = 0.75 , λ = 0.01 0.9791.16× 10 2 2396.353 q = 0.75 , λ = 0.01 0.9912.97× 10 2 18,820.365
q = 0.75 , λ = 0.10 0.9501.31× 10 2 731.564 q = 0.75 , λ = 0.10 0.9733.34× 10 2 4823.098
q = 0.75 , λ = 1.00 0.7861.02× 10 1 200.654 q = 0.75 , λ = 1.00 0.8649.57× 10 2 1654.697
q = 1.00 , λ = 0.01 q = 1.00 , λ = 0.01
q = 1.00 , λ = 0.10 0.0005.83× 10 2 1132.516 q = 1.00 , λ = 0.10 0.0007.39× 10 2 2014.341
q = 1.00 , λ = 1.00 0.0007.51× 10 1 31.284 q = 1.00 , λ = 1.00 0.0008.15× 10 1 207.094
( N = 100 )SparsityAbs. errorRuntime [ms]( N = 100 )SparsityAbs. errorRuntime [s]
q = 0.00 , λ = 0.01 0.9992.48× 10 0 86,046.395 q = 0.00 , λ = 0.01 1.0006.39× 10 0 336.207
q = 0.00 , λ = 0.10 0.9973.91× 10 2 49,523.995 q = 0.00 , λ = 0.10 0.9998.76× 10 2 286.879
q = 0.00 , λ = 1.00 0.9964.10× 10 2 27,357.659 q = 0.00 , λ = 1.00 0.9988.22× 10 2 133.223
q = 0.25 , λ = 0.01 0.9982.36× 10 1 104,346.641 q = 0.25 , λ = 0.01 0.9994.27× 10 0 413.775
q = 0.25 , λ = 0.10 0.9965.12× 10 2 41,810.473 q = 0.25 , λ = 0.10 0.9981.01× 10 1 221.787
q = 0.25 , λ = 1.00 0.9944.22× 10 2 18,415.400 q = 0.25 , λ = 1.00 0.9979.01× 10 2 87.945
q = 0.50 , λ = 0.01 0.9964.52× 10 2 78,618.996 q = 0.50 , λ = 0.01 0.9988.61× 10 2 374.123
q = 0.50 , λ = 0.10 0.9944.50× 10 2 25,512.371 q = 0.50 , λ = 0.10 0.9979.37× 10 2 120.605
q = 0.50 , λ = 1.00 0.9844.92× 10 2 8266.048 q = 0.50 , λ = 1.00 0.9909.49× 10 2 41.435
q = 0.75 , λ = 0.01 0.9944.55× 10 2 57,839.639 q = 0.75 , λ = 0.01 0.9961.05× 10 1 275.101
q = 0.75 , λ = 0.10 0.9795.07× 10 2 14,257.452 q = 0.75 , λ = 0.10 0.9851.02× 10 1 67.301
q = 0.75 , λ = 1.00 0.8901.00× 10 1 4362.478 q = 0.75 , λ = 1.00 0.9171.34× 10 1 21.536
q = 1.00 , λ = 0.01 q = 1.00 , λ = 0.01
q = 1.00 , λ = 0.10 0.0007.92× 10 2 5731.333 q = 1.00 , λ = 0.10 0.0008.62× 10 2 57.739
q = 1.00 , λ = 1.00 0.0008.35× 10 1 562.722 q = 1.00 , λ = 1.00 0.0008.51× 10 1 2.215
( N = 100 )SparsityAbs. errorRuntime [s]( N = 100 )SparsityAbs. errorRuntime [s]
q = 0.00 , λ = 0.01 1.0003.59× 10 0 1386.554 q = 0.00 , λ = 0.01 1.0004.09× 10 0 3257.314
q = 0.00 , λ = 0.10 0.9992.25× 10 1 1245.867 q = 0.00 , λ = 0.10 1.0008.56× 10 1 3108.889
q = 0.00 , λ = 1.00 0.9991.85× 10 1 823.011 q = 0.00 , λ = 1.00 0.9992.68× 10 1 2355.733
q = 0.25 , λ = 0.01 1.0005.88× 10 0 1555.064 q = 0.25 , λ = 0.01 1.0003.78× 10 0 3821.319
q = 0.25 , λ = 0.10 0.9991.86× 10 1 1201.656 q = 0.25 , λ = 0.10 0.9992.94× 10 1 3532.833
q = 0.25 , λ = 1.00 0.9981.86× 10 1 492.324 q = 0.25 , λ = 1.00 0.9992.76× 10 1 1530.838
q = 0.50 , λ = 0.01 0.9996.66× 10 1 1494.270 q = 0.50 , λ = 0.01 1.0001.85× 10 0 3669.894
q = 0.50 , λ = 0.10 0.9981.97× 10 1 589.379 q = 0.50 , λ = 0.10 0.9992.93× 10 1 1637.985
q = 0.50 , λ = 1.00 0.9941.85× 10 1 210.008 q = 0.50 , λ = 1.00 0.9952.71× 10 1 644.164
q = 0.75 , λ = 0.01 0.9982.00× 10 1 1300.517 q = 0.75 , λ = 0.01 0.9982.98× 10 1 3560.379
q = 0.75 , λ = 0.10 0.9892.00× 10 1 321.221 q = 0.75 , λ = 0.10 0.9912.91× 10 1 853.451
q = 0.75 , λ = 1.00 0.9372.08× 10 1 106.334 q = 0.75 , λ = 1.00 0.9462.83× 10 1 270.046
q = 1.00 , λ = 0.01 q = 1.00 , λ = 0.01
q = 1.00 , λ = 0.10 0.0009.06× 10 2 147.372 q = 1.00 , λ = 0.10 0.0008.94× 10 2 272.210
q = 1.00 , λ = 1.00 0.0008.62× 10 1 8.575 q = 1.00 , λ = 1.00 0.0008.62× 10 1 20.120

References

  1. Villani, C. Optimal Transport: Old and New; Springer: Berlin/Heidelberg, Germany, 2009; Volume 338. [Google Scholar]
  2. Shafieezadeh-Abadeh, S.; Mohajerin Esfahani, P.M.; Kuhn, D. Distributionally robust logistic regression. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
  3. Courty, N.; Flamary, R.; Habrard, A.; Rakotomamonjy, A. Joint distribution optimal transportation for domain adaptation. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  4. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR. pp. 214–223. [Google Scholar]
  5. Kusner, M.; Sun, Y.; Kolkin, N.; Weinberger, K. From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; PMLR. pp. 957–966. [Google Scholar]
  6. Swanson, K.; Yu, L.; Lei, T. Rationalizing text matching: Learning sparse alignments via optimal transport. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5609–5626. [Google Scholar]
  7. Otani, M.; Togashi, R.; Nakashima, Y.; Rahtu, E.; Heikkilä, J.; Satoh, S. Optimal correction cost for object detection evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 21107–21115. [Google Scholar]
  8. Pele, O.; Werman, M. Fast and robust Earth Mover’s Distances. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; IEEE: New York, NY, USA, 2009; pp. 460–467. [Google Scholar]
  9. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 2013, 26, 2292–2300. [Google Scholar]
  10. Dessein, A.; Papadakis, N.; Rouas, J.L. Regularized optimal transport and the rot mover’s distance. J. Mach. Learn. Res. 2018, 19, 590–642. [Google Scholar]
  11. Dvurechensky, P.; Gasnikov, A.; Kroshnin, A. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In Proceedings of the 36th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR. pp. 1367–1376. [Google Scholar]
  12. Le, T.; Yamada, M.; Fukumizu, K.; Cuturi, M. Tree-sliced variants of Wasserstein distances. Adv. Neural Inf. Process. Syst. 2019, 32, 12304–12315. [Google Scholar]
  13. Le, T.; Nguyen, T.; Phung, D.; Nguyen, V.A. Sobolev transport: A scalable metric for probability measures with graph metrics. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, Online, 28–30 March 2022; PMLR. pp. 9844–9868. [Google Scholar]
  14. Frogner, C.; Zhang, C.; Mobahi, H.; Araya, M.; Poggio, T.A. Learning with a Wasserstein loss. Adv. Neural Inf. Process. Syst. 2015, 28, 2053–2061. [Google Scholar]
  15. Cuturi, M.; Teboul, O.; Vert, J.P. Differentiable ranking and sorting using optimal transport. Adv. Neural Inf. Process. Syst. 2019, 32, 6861–6871. [Google Scholar]
  16. Blondel, M.; Martins, A.F.; Niculae, V. Learning with Fenchel-Young losses. J. Mach. Learn. Res. 2020, 21, 1–69. [Google Scholar]
  17. Birkhoff, G. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucum’an Rev. Ser. A 1946, 5, 147–154. [Google Scholar]
  18. Brualdi, R.A. Combinatorial Matrix Classes; Cambridge University Press: Cambridge, UK, 2006; Volume 13. [Google Scholar]
  19. Alvarez-Melis, D.; Jaakkola, T. Gromov–Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1881–1890. [Google Scholar]
  20. Blondel, M.; Seguy, V.; Rolet, A. Smooth and sparse optimal transport. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics, Canary Islands, Spain, 9–11 April 2018; PMLR. pp. 880–889. [Google Scholar]
  21. Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
  22. Amari, S.i.; Ohara, A. Geometry of q-exponential family of probability distributions. Entropy 2011, 13, 1170–1185. [Google Scholar] [CrossRef]
  23. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  24. Powell, M.J.D. Some global convergence properties of a variable metric algorithm for minimization without exact line searches. In Proceedings of the Nonlinear Programming, SIAM-AMS Proceedings, New York, NY, USA, 1 January 1976; Volume 9. [Google Scholar]
  25. Altschuler, J.; Niles-Weed, J.; Rigollet, P. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. Adv. Neural Inf. Process. Syst. 2017, 30, 1961–1971. [Google Scholar]
  26. Sinkhorn, R.; Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 1967, 21, 343–348. [Google Scholar] [CrossRef]
  27. Danskin, J.M. The theory of max-min, with applications. SIAM J. Appl. Math. 1966, 14, 641–664. [Google Scholar] [CrossRef]
  28. Bao, H.; Sugiyama, M. Fenchel-Young losses with skewed entropies for class-posterior probability estimation. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 13–15 April 2021; pp. 1648–1656. [Google Scholar]
  29. Naudts, J. Deformed exponentials and logarithms in generalized thermostatistics. Phys. A Stat. Mech. Its Appl. 2002, 316, 323–334. [Google Scholar] [CrossRef]
  30. Suyari, H. The unique non self-referential q-canonical distribution and the physical temperature derived from the maximum entropy principle in Tsallis statistics. Prog. Theor. Phys. Suppl. 2006, 162, 79–86. [Google Scholar] [CrossRef]
  31. Ding, N.; Vishwanathan, S. t-Logistic regression. Adv. Neural Inf. Process. Syst. 2010, 23, 514–522. [Google Scholar]
  32. Futami, F.; Sato, I.; Sugiyama, M. Expectation propagation for t-exponential family using q-algebra. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  33. Amid, E.; Warmuth, M.K.; Anil, R.; Koren, T. Robust bi-tempered logistic loss based on bregman divergences. Adv. Neural Inf. Process. Syst. 2019, 32, 15013–15022. [Google Scholar]
  34. Martins, A.F.; Figueiredo, M.A.; Aguiar, P.M.; Smith, N.A.; Xing, E.P. Nonextensive entropic kernels. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 640–647. [Google Scholar]
  35. Muzellec, B.; Nock, R.; Patrini, G.; Nielsen, F. Tsallis regularized optimal transport and ecological inference. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
  36. Byrd, R.H.; Nocedal, J.; Yuan, Y.X. Global convergence of a cass of quasi-Newton methods on convex problems. SIAM J. Numer. Anal. 1987, 24, 1171–1190. [Google Scholar] [CrossRef]
  37. Schmitzer, B. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM J. Sci. Comput. 2019, 41, A1443–A1481. [Google Scholar] [CrossRef]
  38. Flamary, R.; Courty, N.; Gramfort, A.; Alaya, M.Z.; Boisbunon, A.; Chambon, S.; Chapel, L.; Corenflos, A.; Fatras, K.; Fournier, N.; et al. POT: Python optimal transport. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
  39. Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [PubMed]
  40. Weed, J. An explicit analysis of the entropic penalty in linear programming. In Proceedings of the the 31st Conference on Learning Theory, Stockholm, Sweden, 5–9 July 2018; PMLR. pp. 1841–1855. [Google Scholar]
  41. Golub, G.H.; van Loan, C.F. Matrix Computations; The Johns Hopkins University Press: Baltimore, MA, USA, 2013. [Google Scholar]
Figure 1. Plots of the q-Gibbs kernels with different q ( λ = 1 ).
Figure 1. Plots of the q-Gibbs kernels with different q ( λ = 1 ).
Entropy 24 01634 g001
Figure 2. Plots of deformed q entropy with different q values. A constant term is ignored in the plots so that the end points are calibrated to zero.
Figure 2. Plots of deformed q entropy with different q values. A constant term is ignored in the plots so that the end points are calibrated to zero.
Entropy 24 01634 g002
Figure 3. Comparison of transport matrices. Wasserstein represents the result of the unregularized OT. Sinkhorn ( λ = 0.01 ) does not work well because of numerical instability.
Figure 3. Comparison of transport matrices. Wasserstein represents the result of the unregularized OT. Sinkhorn ( λ = 0.01 ) does not work well because of numerical instability.
Entropy 24 01634 g003
Figure 4. Runtime comparison of q-DOT and Sinkhorn algorithm ( q = 1 ). The error bars indicate the standard errors of 20 trials.
Figure 4. Runtime comparison of q-DOT and Sinkhorn algorithm ( q = 1 ). The error bars indicate the standard errors of 20 trials.
Entropy 24 01634 g004aEntropy 24 01634 g004b
Figure 5. Wasserstein approximation error of q-DOT and the Sinkhorn algorithm ( q = 1 ). The line shades indicate the standard errors of 20 trials.
Figure 5. Wasserstein approximation error of q-DOT and the Sinkhorn algorithm ( q = 1 ). The line shades indicate the standard errors of 20 trials.
Entropy 24 01634 g005
Table 1. Summary of Ω ( π ) , Ω ( η ) , and Ω ( η ) for several regularizers. The relationship between Ω , its conjugate, and the derivatives are summarized in Bao and Sugiyama [28].
Table 1. Summary of Ω ( π ) , Ω ( η ) , and Ω ( η ) for several regularizers. The relationship between Ω , its conjugate, and the derivatives are summarized in Bao and Sugiyama [28].
Ω ( π ) Ω ( η ) Ω ( η )
Negative entropy λ ( π log π π ) λ e η / λ e η / λ
Squared 2-norm λ 2 π 2 1 2 λ [ η ] + 2 1 λ [ η ] +
Deformed q entropy λ 2 q ( π log q ( π ) π ) λ 2 q exp q ( η / λ ) 2 q exp q ( η / λ )
Table 2. Comparison of the computational complexity of the Sinkhorn algorithm and deformed q-optimal transport. N = max n , m .
Table 2. Comparison of the computational complexity of the Sinkhorn algorithm and deformed q-optimal transport. N = max n , m .
Sinkhornq-DOT
O ( N 2 ( log N ) ε 3 ) O ( N 2 log ( N ε 1 ) )
Table 3. Comparison of the sparsity and cost with the synthetic dataset. Sparsity indicates the ratio of zero entries in each transport matrix. We counted the number of entries smaller than machine epsilon to measure the sparsity instead of imposing a small positive threshold for determining zero entries. Sinkhorn ( λ = 0.01 ) does not work well because of numerical instability.
Table 3. Comparison of the sparsity and cost with the synthetic dataset. Sparsity indicates the ratio of zero entries in each transport matrix. We counted the number of entries smaller than machine epsilon to measure the sparsity instead of imposing a small positive threshold for determining zero entries. Sinkhorn ( λ = 0.01 ) does not work well because of numerical instability.
SparsityCost D , Π ^
Wasserstein (unregularized)0.9677.126
q-DOT ( q = 0.00 , λ = 0.01 )0.9627.129
q-DOT ( q = 0.00 , λ = 0.10 )0.9617.126
q-DOT ( q = 0.00 , λ = 1.00 )0.9507.144
q-DOT ( q = 0.25 , λ = 0.01 )0.9637.129
q-DOT ( q = 0.25 , λ = 0.10 )0.9597.126
q-DOT ( q = 0.25 , λ = 1.00 )0.9127.133
q-DOT ( q = 0.50 , λ = 0.01 )0.9637.136
q-DOT ( q = 0.50 , λ = 0.10 )0.9467.127
q-DOT ( q = 0.50 , λ = 1.00 )0.8797.155
q-DOT ( q = 0.75 , λ = 0.01 )0.9487.127
q-DOT ( q = 0.75 , λ = 0.10 )0.8977.136
q-DOT ( q = 0.75 , λ = 1.00 )0.6477.245
Sinkhorn ( λ = 0.01 )
Sinkhorn ( λ = 0.10 )0.0007.164
Sinkhorn ( λ = 1.00 )0.0007.788
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bao, H.; Sakaue, S. Sparse Regularized Optimal Transport with Deformed q-Entropy. Entropy 2022, 24, 1634. https://doi.org/10.3390/e24111634

AMA Style

Bao H, Sakaue S. Sparse Regularized Optimal Transport with Deformed q-Entropy. Entropy. 2022; 24(11):1634. https://doi.org/10.3390/e24111634

Chicago/Turabian Style

Bao, Han, and Shinsaku Sakaue. 2022. "Sparse Regularized Optimal Transport with Deformed q-Entropy" Entropy 24, no. 11: 1634. https://doi.org/10.3390/e24111634

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop