Sparse Regularized Optimal Transport with Deformed q-Entropy

Optimal transport is a mathematical tool that has been a widely used to measure the distance between two probability distributions. To mitigate the cubic computational complexity of the vanilla formulation of the optimal transport problem, regularized optimal transport has received attention in recent years, which is a convex program to minimize the linear transport cost with an added convex regularizer. Sinkhorn optimal transport is the most prominent one regularized with negative Shannon entropy, leading to densely supported solutions, which are often undesirable in light of the interpretability of transport plans. In this paper, we report that a deformed entropy designed by q-algebra, a popular generalization of the standard algebra studied in Tsallis statistical mechanics, makes optimal transport solutions supported sparsely. This entropy with a deformation parameter q interpolates the negative Shannon entropy (q=1) and the squared 2-norm (q=0), and the solution becomes more sparse as q tends to zero. Our theoretical analysis reveals that a larger q leads to a faster convergence when optimized with the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. In summary, the deformation induces a trade-off between the sparsity and convergence speed.


Introduction
Optimal transport (OT) is a classic problem in operations research, and it is used to compute a transport plan between suppliers and demanders with a minimum transportation cost. The minimum transportation cost can be interpreted as the closeness between the distributions when considering suppliers and demanders as two probability distributions. The OT problem has been extensively studied (also as the Wasserstein distance) [1] and used in robust machine learning [2], domain adaptation [3], generative modeling [4], and natural language processing [5], attributed to its many useful properties, such as the distance between two probability distributions. Recently, the OT problem has been employed for various modern applications, such as interpretable word alignment [6] and the localityaware evaluation of object detection [7], because it can capture the geometry of data and provide a measurement method for closeness and alignment among different objects. From a computational perspective, a naïve approach is to use a network simplex algorithm or interior point method to solve the OT problem as a usual linear program; this approach requires supercubic time complexity [8] and is not scalable. A number of approaches have been suggested to accelerate the computation of the OT problem: entropic regularization [9,10], accelerated gradient descent [11], and approximation with tree [12] and graph metrics [13]. We focused our attention on entropic-regularized OT because it allows a unique solution attributed to strong convexity and transforms the original constrained optimization into an unconstrained problem with a clear primal-dual relationship. The

Preliminaries
For x ∈ R, let [x] + = x if x > 0 and 0 otherwise, and let [x] p + represent ([x] + ) p hereafter. For a convex function f , X → R, where X represents a Euclidean vector space equipped with an inner product ·, · , the Fenchel-Legendre conjugate f : X → R is defined as f (y) := sup x∈X x, y − f (x). The relative interior of a set S is denoted by ri S, and the effective domain of a function f is denoted by dom( f ). A differentiable function f is said to be M-strongly convex over S ⊆ ri dom( f ) if, for all x, y ∈ S, we have If f is twice differentiable, the strong convexity is equivalent to ∇ 2 f (x) MI for all x ∈ S. Similarly, a differentiable function f is said to be M-smooth over S ⊆ ri dom( f ) if for all x, y ∈ S, we have ∇ f (x) − ∇ f (y) 2 ≤ M x − y 2 , which is equivalent to ∇ 2 f (x) MI for all x ∈ S if f is twice differentiable.

Optimal Transport
The OT is a mathematical problem to find a transport plan between two probability distributions with the minimum transport cost. The discussions in this paper are restricted to discrete distributions. Let (X , d), δ x , and n−1 := {p ∈ [0, 1] n | p, 1 n = 1} represent a metric space, Dirac measure at point x, and (n − 1)-dimensional probability simplex, respectively. Let µ = ∑ n i=1 a i δ x i and ν = ∑ m j=1 b i δ y i be histograms supported on the finite sets of points (x i ) n i=1 ⊆ X and (y j ) m j=1 ⊆ X , respectively, where a ∈ n−1 and b ∈ m−1 are probability vectors. The OT between two discrete probability measures µ and ν is the optimization problem where U represents the transport polytope, defined as The transport polytope U defines the constraints on the row/column marginals of a transport matrix Π. These constraints are often referred to as coupling constraints. For notational simplicity, matrix D ij := d(x i , y j ) and expectation D, Π := ∑ n i=1 ∑ m j=1 D ij Π ij are used hereafter. T (µ, ν) is known as a 1-Wasserstein distance, which defines a metric space over histograms [1].
Equation (1) is a linear program and can be solved by well-studied algorithms such as the interior point and network simplex methods. However, its computational complexity is O(n 3 log n) (assuming n = m), so is not scalable to large datasets [8].

Entropic Regularization and Sinkhorn Algorithm
The entropic-regularized formulation is commonly used to reduce the computational burden. Here, we introduce regularized OT with negative Shannon entropy [9] as where λ > 0 represents the regularization strength. Let us review the derivation of the updates of the Sinkhorn algorithm. The Lagrangian of the optimization problem in Equation (3) is where α ∈ R n and β ∈ R m represent the Lagrangian multipliers. Equation (4) ignores the constraints Π ij ≥ 0 (for all i ∈ [n] and j ∈ [m]); however, they will be automatically satisfied. By taking the derivative in Π ij , and, hence, the stationary condition ∇ Π ij L = 0 induces the solution The suggests that the stationary point is the One can easily infer that the Sinkhorn solution is dense because the Gibbs kernel is supported on the entire R ≥0 , i.e., exp − z λ > 0 for all z ∈ R ≥0 . We can write Equation (6) into a matrix form by applying the variable transforms The following Sinkhorn updates are used to make Equation (7) meet the marginal constraints: where z/η represents the element-wise division of the two vectors z and η. The computational complexity is O(Knm) because the Sinkhorn updates involve only matrix-vector multiplications and element-wise divisions; K represents the number of the Sinkhorn updates. Finer analysis of the number of updates required to meet the error tolerance is provided in the literature [25].

Regularized Optimal Transport and Its Dual
Let us consider the following primal problem with a general regularization function Ω.

Definition 1 (Primal of regularized OT).
T where Ω : R → R represents a proper closed convex function.
Next, we derive its dual by Lagrange duality. The Lagrangian of Equation (9) is defined as with dual variables α ∈ R n and β ∈ R m . Then, the primal can be rewritten in terms of the Lagrangian In this Lagrangian formulation, we let the constraints Π ∈ R n×m ≥0 remain for a technical reason. The constrained optimization problem in (11) can be reformulated into the following unconstrained one with an indicator function which corresponds to an optimization problem with the convex objective function D, (Π) with only the linear constraints Π1 m = a and Π 1 n = b. By invoking the Sinkhorn-Knopp theorem [26], the existence of a strictly feasible solution, namely, a solution satisfying Π1 m = a and Π 1 n = b, can be confirmed. Hence, we see that the Slater condition is satisfied, and the strong duality holds as follows: where Ω represents the Fenchel-Legendre conjugate of Ω : Although each element of the transport plans ranges over [0, 1], it is sufficient to define the Fenchel-Legendre conjugate as the supremum over R ≥0 because of how Ω emerges in the strong duality (13). According to Danskin's theorem [27], the supremum of the Fenchel-Legendre conjugate can be attained at Therefore, the dual of regularized OT is formulated as follows: Definition 2 (Dual of regularized OT).
Next, we see several examples that are summarized in Table 1.

q Algebra and Deformed Entropy
As shown in the last few examples, the dual map ∇Ω plays an important role in the OT solution sparsity. In addition, the induced ∇Ω is the Gibbs kernel when the negative Shannon entropy is used as Ω. Therefore, one may think of designing a regularizer from ∇Ω by utilizing a kernel function that induces sparsity. One candidate is a q-exponential distribution. We begin with some basics required to formulate q-exponential distributions.
First, we introduce q-algebra, which has been well studied in the field of Tsallis statistical mechanics [23,29,30]. q algebra has been used in the machine-learning literature for regression [31], Bayesian inference [32], and robust learning [33]. For a deformation parameter q ∈ [0, 1], the q-logarithm and q-exponential functions are defined as log q (x) := The q logarithm is defined for only x > 0, as in the natural logarithm; they are inverse functions to each other (in an appropriate domain) and they recover the natural definition of the logarithm and exponential as q 1. Their derivatives are (log q (x)) = 1 x q and (exp q (x)) = exp q (x) q , respectively. The additive factorization property exp(x + y) = exp(x) exp(y) satisfied by the natural exponential no longer holds for the q exponential, such that exp q (x + y) = exp q (x) exp q (y) = exp q (x + y + (1 − q)xy). Instead, we can construct another algebraic structure by introducing the other operation called the q product ⊗ q : With this product, the pseudoadditive factorization exp q (x + y) = exp q (x) ⊗ q exp q (y) holds. Thus, the q algebra captures rich nonlinear structures, and it is often used to extend the Shannon entropy to the Tsallis entropy [23] One can see that the Tsallis entropy has an equivalent power formulation T q (π) = ∑ n i=1 π i −π q i 1−q , which means that it is often suitable for modeling heavy-tailed phenomena such as the power law. Although the introduced q logarithm and exponential can look arbitrary, they can be axiomatically derived by assuming the essential properties of the algebra (see Naudts [29]). For more physical insights, we recommend readers to refer to the literature [30].
Next, we introduce the q-exponential distribution. We introduce a simpler form for our purpose, whereas more general formulations of the q-exponential distribution have been introduced in the literature [22]. Given the form of the Gibbs kernel k(ξ) := exp(−ξ/λ), we define the q-Gibbs kernel as follows: Definition 3 (q-Gibbs kernel). For ξ ≥ 0, we define the q-Gibbs kernel as k q (ξ) := exp q (−ξ/λ) for a deformation parameter q ∈ [0, 1] and a temperature parameter λ ∈ R >0 .
If we take ξ as the (centered) squared distance, then k q (ξ) represents the q-Gaussian distribution [22]. We illustrate the q-Gibbs kernel with different deformation parameters in  By definition, the support of the q-Gibbs kernel is supp(k q ) = 0, λ 1−q for q ∈ [0, 1) and supp(k q ) = R ≥0 for q = 1. This indicates that the q-Gibbs kernel ignores the effect of a too-large ξ (or too large a distance between two points); its threshold is smoothly controlled by the temperature parameter λ and deformation parameter q.
Finally, we derive an entropic regularizer that induces sparsity by using the q-Gibbs kernel. Given the stationary condition in Equation (15), we impose the following functional form on the dual map: and a sufficiently large input distance D ij drives Π ij to zero; though exp q (−D ij /λ) = 0 does not immediately imply Π ij = 0 because the q-product ⊗ q lacks an absorbing element. By solving Equation (20), For the completeness, its derivation is shown in Appendix A. Hence, we define the deformed q entropy as follows: Definition 4 (Deformed q-entropy). For π ∈ n−1 , the deformed q entropy is defined as The deformed q-entropic regularizer for an element π i is Ω( The deformed q entropy recovers the Shannon entropy at the limit q 1: ). In addition, the limit q 0 recovers the negative of the squared 2-norm: Therefore, the deformed q entropy is an interpolation between the Shannon entropy and squared 2-norm. Hereafter, we consider the regularized OT with the deformed q entropy by solving its dual counterpart. The deformed q entropy is different from the Tsallis entropy T q (see Equation (19)) in that the Tsallis entropy and deformed q entropy are defined by the q expectation π q , · [34] and the usual expectation π, · , respectively, while both are defined by the q logarithm.

Remark 1.
The primary reason we picked the deformed q entropy H q to design the regularizer is owing to its natural connection to the q-Gibbs kernel through the dual map, ∇(−λH q ) (η) = exp q (η/λ). When the Tsallis entropy T q is used, the dual map is which is not naturally connected to the q-Gibbs kernel. Muzellec et al. [35] proposed regularized OT with the Tsallis entropy, but they did not discuss its sparsity. As we show in Appendix D.1, the Tsallis entropy does not empirically induce sparsity.
In Figure 2, the deformed q entropy with a different deformation parameter is plotted for the one-dimensional simplex 1 . One can easily confirm that H q (π) is concave for π ∈ R n ≥0 , as illustrated in the figure.

Optimization Algorithm
We occasionally write Ω = −λH q to simplify the notation in this section. By simple algebra, we confirm which is convex because of the concavity of H q . To solve Equation (24), we solve the dual where z := (α, β) denotes dual variables. As Equation (27) is an unconstrained optimization problem, many famous optimization solvers can be used to solve it; here, we use the BFGS method [24]. For the sake of convergence analysis (Section 4.2), we optimize the convex where κ > 0 represents the 2 -regularization parameter. In practice, 2 regularization hardly affects the performance when κ is sufficiently small. We can characterize the convergence rate by introducing (small) 2 regularization, which makes the objective strongly convex, whereas the convergence guarantee without its rate is still possible without 2 regularization [36]. We briefly summarize the algorithm in Algorithm 1, where d (k) , ρ (k) , and g (k) := ∇ F (z (k) ) represent the kth update direction, kth step size, and gradient at the current variable z (k) , respectively. and are the differences of the dual variables and gradients between the next and current steps, respectively. Furthermore, let (γ, γ ) be the tolerance parameter for the Wolfe conditions, i.e., update directions and step sizes satisfy the conditions Algorithm 1: BFGS algorithm for dual regularized OT Input : z (0) initial point, 0 < γ < 1 2 tolerance parameter for the Armijo condition, γ < γ < 1 tolerance parameter for the curvature condition, and B (0) = I initial Hessian estimate After obtaining the dual solution ( α, β), the primal solution can be recovered from Equation (15).

Convergence Analysis
We provide a convergence guarantee for Algorithm 1. A technical assumption is stated beforehand.
Assume that z (K) obtained by Algorithm 1 and z are contained in Z τ .
The dual map ∇Ω translates dual variables into primal variables, as in Equation (15). It is easy to confirm that Z τ is a closed convex set attributed to the convexity of ∇Ω . Assumption 1 essentially assumes that all elements of the primal matrix (of z (K) and z ) are strictly less than 1; this always holds for z (unless n = m = 1) because of the strong duality. Moreover, this assumption is natural for z (K) values sufficiently close to the optimum z . The bound parameter τ is a key element for characterizing the convergence speed.
where F := inf z F (z) represents the optimal value of the 2 -regularized dual objective and 0 < r < 1 is an absolute constant independent from (λ, τ, q, N).
The proof is shown in Section 4.3. We conclude that a larger deformation parameter q yields better convergence because the coefficient in Equation (33) is O(τ q/2 ) with the base √ τ < 1. Therefore, the deformation parameter introduces a new trade-off: q 0 yields a more sparse solution but slows down the convergence, whereas q 1 ameliorates the convergence while sacrificing sparsity. One may obtain the solution faster than the squared 2-norm regularizer used in Blondel et al. [20], which corresponds to the case q = 0, by modulating the deformation parameter q.
In regularized OT, it is a common approach to use weaker regularization (i.e., a smaller λ) to obtain a solution sparser and closer to the unregularized solution; however, a smaller λ results in numerical instability and slow computation [37]. This can be observed from Equation (33) because a smaller λ drives its upper bound considerably large.
Subsequently, we compared the computational complexity of q-DOT with the BFGS method and Sinkhorn algorithm. Altschuler et al. [25] showed that the Sinkhorn algorithm satisfies coupling constraints within the 1 error ε in O(N 2 (log N)ε −3 ) time, which is the sublinear convergence rate. In contrast, our convergence rate in Equation (33) and ∇Ω (·) represents the mapping from the dual variables (α i , β j ) to the primal transport matrix Π ij in Equation (15). Therefore, the gradient norm of F and coupling constraint error are comparable when the 2 -regularization parameter κ is sufficiently small. The overall computational complexity is O(N 2 log(Nε −1 )) because the one step of Algorithm 1 runs in O(N 2 ) time; this is the linear convergence rate. To confirm the one step of Algorithm 1 runs in O(N 2 ) time, we note that the update direction can be computed with O(N 2 ) time by using the Sherman-Morrison formula to invert B (k) . In addition, the Hessian estimate can be updated with O(N 2 ) time because B (k) is the rank-1 update and the computation of its inverse only requires the matrix-vector products of size N. Hence, Algorithm 1 exhibits better convergence in terms of the stopping criterion ε. The comparison is summarized in Table 2. Table 2. Comparison of the computational complexity of the Sinkhorn algorithm and deformed q-optimal transport. N = max{n, m}.

Proofs
To prove Theorem 1, we leveraged several lemmas shown below. Lemma 2 is based on Powell [24] and Byrd et al. [36]. The missing proofs are provided in Appendix C.

Remark 2.
More precisely, Altschuler et al. [25] showed that the Sinkhorn algorithm converges in O(N 2 L 3 (log N)ε −3 ) time, where L := D ∞ . For q-DOT, its computational complexity is not directly comparable to that of the Sinkhorn in L; instead, the following analysis provides us a qualitative comparison. First, the convergence rate of q-DOT in Equation (33) is translated into the iteration complexity K = O(log(Nε −1 )/ log(1/r)). The rate r is introduced in the proof of Theorem 1 (see Equation (40)): r ≥ 1 − . Then, by the Taylor expansion, we have a rough estimate K ≈ O(N 2 R −3q log(Nε −1 )), where R is a bound on the possible primal variables defined in Equation (35). We cannot directly compare R −q and L; nevertheless, R −q and L can be considered in the same magnitude given a reasonably sized domain Z, noting that ∇Ω(π) ≈ O(π 1−q ). Hence, it is reasonable to suppose that both the Sinkhorn algorithm and q-DOT roughly converge in cubic time with respect to L.

Sparsity
All the simulations described in this section were executed on a 2.7 GHz quad-core Intel ® Core ™ i7 processor. We used the following synthetic dataset: , and n = m = 30, where N (µ, Σ) represents the Gaussian distribution with mean µ and covariance Σ. For each of the unregularized OTs, q-DOT, and Sinkhorn algorithm, we computed the transport matrices. For q-DOT and the Sinkhorn algorithm, different regularization parameters λ were compared: λ ∈ 1 × 10 −2 , 1 × 10 −1 , 1 ; and ε = 1 × 10 −6 was used as the stopping criterion: q-DOT stopped after the gradient norm was less than ε, and the Sinkhorn algorithm stopped after the 1 error of the coupling constraints was less than ε. We compared different deformation parameters q ∈ {0, 0.25, 0.5, 0.75} and fixed the dual 2 -regularization parameter κ = 1 × 10 −6 for q-DOT. The q-DOT with q = 0 corresponded to a regularized OT with the squared 2-norm proposed by Blondel et al. [20]. For the unregularized OT, we used the implementation of the Python optimal transport package [38]. For q-DOT, we used the L-BFGS-B method (instead of the vanilla BFGS) provided by the SciPy package [39]. To determine zero entries in the transport matrix, we did not impose any positive threshold to disregard small values (as in Swanson et al. [6]) but regarded entries smaller than machine epsilon as zero.
The simulation results are shown in Table 3 and Figure 3. First, we qualitatively evaluated each method by using Figure 3 such that q-DOT obtained a very similar transport matrix to the unregularized OT solution. The solution was slightly blurred with increases in q and λ. In contrast, the Sinkhorn algorithm output considerably uncertain transport matrices. Furthermore, the Sinkhorn algorithm was numerically unstable with a very small regularization such as λ = 0.01.
From Table 3, we further quantitatively observed the behavior. The transport matrices obtained by q-DOT were very sparse in most cases, and the sparsity was close to that of the unregularized OT. Furthermore, we observed the tendency such that smaller q and λ yielded a sparser solution. Significantly, the Sinkhorn algorithm obtained completely dense matrices (sparsity = 0). Although the transport matrices of q-DOT with (q, λ) = (0.5, 1), (0.75, 1) appear somewhat similar to the Sinkhorn solutions in Figure 3, the former is much sparser. This suggests that a deformation parameter q slightly smaller than 1 is sufficient for q-DOT to output a sparse transport matrix.  For the obtained cost values D, Π , we did not see a clear advantage of using a specific q and λ from the results of q-DOT. Nevertheless, it is evident that q-DOT more accurately estimated the Wasserstein cost than the Sinkhorn algorithm regardless of the q and λ used in this simulation. Table 3. Comparison of the sparsity and cost with the synthetic dataset. Sparsity indicates the ratio of zero entries in each transport matrix. We counted the number of entries smaller than machine epsilon to measure the sparsity instead of imposing a small positive threshold for determining zero entries. Sinkhorn (λ = 0.01) does not work well because of numerical instability.

Sparsity
Cost

Runtime Comparison
We compared the runtimes of q-DOT and the Sinkhorn algorithm using the same dataset as in Section 5.1, but with different dataset sizes: we chose n = m ∈ {100, 300, 500, 1000}. The parameter choices were the same as in Section 5.1, except that the regularization parameter was fixed to λ = 0.1. The result is shown in Figure 4; the larger deformation parameter q makes q-DOT converge faster when n = m = 100. When n = m ≥ 300, the difference between q = 0, q = 0.25, and q = 0.5 was not as evident. This may be partly because we fixed the parameter choice κ = 1 × 10 −6 for the all experiments, unlike the oracle parameter choice κ = 2Nτ q λ −1 (in Theorem 1) depending on q. Nonetheless, q = 0.75 is clearly superior to the smaller q. From these observations, the trade-off between the sparsity and computation speed resulting from the deformation parameter q is theoretically established in Theorem 1 and it was empirically observed.

Approximation of 1-Wasserstein Distance
Finally, we compared the approximation errors of the 1-Wasserstein distance | D, Π − D, Π | of q-DOT and the Sinkhorn algorithm with different q and λ, where Π represents the computed transport matrix and Π ∈ arg min Π∈U (µ,ν) D, Π represents the LP solution. We used the same dataset and stopping criterion ε as described in Section 5.1 For the range of q, we used q ∈ {0.00, 0.25, 0.50, 0.75}. For the range of λ, we used λ ∈ {0.05, 0.1, 0.5}.
The result is shown in Figure 5. The difference was not significant when q was small, such as q ∈ {0.00, 0.25}. Once q became larger, such as q ∈ {0.50, 0.75}, the approximation error evidently worsened. The Sinkhorn algorithm always exhibited worse approximation errors than q-DOT with q in the range used in this simulation regardless of λ. Formal guarantees for the 1-Wasserstein approximation error (such as Altschuler et al. [25] and Weed [40]) will be considered in future work.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: BFGS Broyden-Fletcher-Goldfarb-Shannon q-DOT Deformed q-optimal transport L-BFGS Limited-memory BFGS OT Optimal transport

Appendix B. Additional Lemmas
Note again that we let M 1 , M 2 > 0 be the strong convexity and smoothness constants of F over Z, N := max{n, m}, and z ∈ arg min z∈Z F (z).
In addition, Proof. LetḠ (k) := 1 0 ∇ 2 F (z (k) + ts (k) )dt. Then, by the chain rule and the fundamental theorem of calculus, Because F is M 1 strongly convex and M 2 -smooth (over Z), we have for all z ∈ Z and w. By choosing z = z (k) + ts (k) and w = s (k) , we have Note that z (k) + ts (k) ∈ Z follows by the definition of Z in Equation (35). Thus, the first statement is proven.
Lemma A2. For all k, Proof. Because F is M 1 strongly convex over Z, where it follows from the optimality of z and the Cauchy-Schwarz inequality.
Lemma A3. The following equations hold: where c 2 := n+m K + M 2 is defined in Lemma 2.
Proof. To prove Equation (A10), we use the linearity of the trace and tr(ba ) = a b to evaluate tr(B (k+1) ) as follows: where Lemma A1 is used at the last inequality. Note that the trace is the sum of the eigenvalues, whereas the determinant is the product of the eigenvalues. Then, we can use the AM-GM inequality to translate the determinant into the trace as follows: Hence, by substituting k = K − 1 and tr(B (0) ) = n + m, Equation (A10) is proven.
To prove Equation (A12), we use the matrix determinant lemma to expand det(B (k+1) ) as follows: Further, by the Sherman-Morrison formula, we have By plugging Equation (A18) into Equation (A17), we have where the matrix determinant lemma is invoked again at the second identity. Recursively applying Equation (A19) with det(B (0) ) = 1, we obtain Equation (A12).
Lemma A5. For k, let θ k be the angle between s (k) and −g (k) . Then, where c 1 , c 4 , and c 5 are defined in Lemma 2.

Appendix C. Deferred Proofs
Appendix C.1. Proof of Lemma 1 Proof. It is easy to confirm M 1 = κ because F is the sum of F (convex) and κ 2 z 2 2 . Because F is twice differentiable and Z is a closed convex set, we evaluate the smoothness parameter M 2 (over Z) by the eigenvalues of ∇ 2 F (z). We begin by evaluating the eigenvalues of ∇ 2 F (z), then evaluate the eigenvalues of ∇ 2 F (z) by ∇ 2 F (z) = ∇ 2 F (z) + κI. Let P(z) ∈ R n×m be a matrix such that P ij (z) := ∇Ω (−D ij − α i − β j ). Here, P ij (z) is the primal variable corresponding to the dual variables (α i , β j ) (see Equation (15)). The gradient of F is and the Hessian of F is where P(z) q is the element-wise power of P(z). Then, by invoking the Gershgorin circle theorem (Theorem 7.2.1 of [41]), the eigenvalues of H can be upper bounded by the following value: where we use 0 ≤ P ij (z) ≤ R for all i, j, and z ∈ Z at the last inequality. Hence, M 2 ≤ κ + 2NR q λ . M 2 ≤ κ + 2Nτ q λ is confirmed by noting that 0 ≤ P ij (z) ≤ τ for all i, j, and z ∈ Z τ and that Z τ is a closed convex set.

Appendix C.2. Proof of Lemma 2
Proof. First, we evaluate the ratio between F (z (k+1) ) − F and F (z (k) ) − F for k = 0, 1, 2, . . . , K − 1. Let θ k be the angle between the vectors s (k) and −g (k) . By the Armijo condition (Equation (30)), the difference F (z (k+1) ) − F (z (k) ) can be evaluated as follows: In addition, by the curvature condition (Equation (31)), 2 cos θ k together with Lemma A1. Hence, we have where c 1 : where Equation (A32) is used at the first inequality; Equation (A34) is used at the second inequality; Lemma A2 is used at the third inequality; a consequence of the convexity F (z (k) ) − F (z ) ≤ g (k) , z (k) − z ≤ g (k) 2 z (k) − z 2 is used at the fourth inequality. Next, recursively invoking the inequality Equation (A35), we obtain which is the desired bound. The last inequality is due to Lemma 3.

Appendix C.3. Proof of Lemma 3
Proof. By substituting the definitions of the constants c 1 , c 2 , c 3 , c 4 , and c 5 , Table A1. Comparison of the sparsity and absolute error on the synthetic dataset. Sparsity indicates the ratio of zero entries in each transport matrix. We counted the number of entries smaller than machine epsilon to measure the sparsity instead of imposing a small positive threshold for determining zero entries. Abs. error indicates the absolute error of the computed cost with respect to 1-Wasserstein distance. Tsallis-regularized OT with q = 0.00 does not work due to numerical instability. As can be seen from the results in Table A1, the Tsallis entropic regularizer neither induces sparsity nor achieves a better approximation of the 1-Wasserstein distance than the deformed q entropy. Note that the Tsallis entropy induces the dual map ∇Ω (η) = q 1/(1−q) / exp q (−η/λ) shown in Equation (25), which has dense support for q > 0 and becomes the source of dense transport matrices. This verifies that the design of the regularizer is important for regularized optimal transport.

Appendix D.2. Hyperparameter Sensitivity
In this section, we summarize more comprehensive experimental results of q-DOT and the Sinkhorn algorithm to show the performance dependence on hyperparameters q and λ. Subsequently, we describe experiments to show the sparsity of transport matrices, absolute error of computed costs with respect to 1-Wasserstein distance, and runtime with differently-sized datasets.
The simulations in this section were executed on a 2.7 GHz Intel ® Xeon ® Gold 6258R processor (different from the processor that we used in Section 5). We used the following synthetic dataset: (x i ) n i=1 ∼ N (1 2 , I 2 ), (y j ) m j=1 ∼ N (−1 2 , I 2 ), with different N(= n = m) ∈ {100, 300, 500, 1000, 2000, 3000}. For q-DOT and Tsallis-regularized OT, different regularization parameters λ ∈ {0.01, 0.1, 1} were compared, and ε = 1 × 10 −6 was used as the stopping criterion. We compared different deformation parameters q ∈ {0, 0.25, 0.5, 0.75}. For the unregularized OT, we used the implementation of the Python optimal transport package [38]. For q-DOT, we used the L-BFGS-B method provided by the SciPy package [39]. To determine zero entries in the transport matrix, we regarded entries smaller than machine epsilon as zero.
The results are shown in Table A2. In these tables, the results with q = 1.00 correspond to the Sinkhorn algorithm. The results for (q, λ) = (1.00, 0.01) are missing because they did not work well due to numerical instability. In general, we observed similar behavior as we described in Section 5: sparsity intensified as q and λ decreased, thereby increasing runtime. As N increased, nonmonotonic trends in runtime were observed with respect to q: for a fixed λ, larger q accelerated the computation, while q = 0.25 seemed to be the slowest. This apparent discrepancy from Theorem 1 may be partly because Theorem 1 relies on an oracle parameter choice κ = 2Nτ q λ −1 as we discussed in Section 5.2, which is hardly known in practice. Nevertheless, it is remarkable that even q = 0.75 gives very sparse solutions with a reasonable amount of runtime. Regarding the absolute error, smaller q tends to perform better with relatively small datasets, such as N ≤ 1000, while q = 1.00 performs better for larger datasets, such as N = 2000 and 3000. As we mentioned in Section 5.3, theoretical analysis of the approximation error is still unclear, and will be left for future work.