Sparse Regularized Optimal Transport with Deformed q-Entropy

Bao, Han; Sakaue, Shinsaku

doi:10.3390/e24111634

Open AccessArticle

Sparse Regularized Optimal Transport with Deformed q-Entropy

by

Han Bao

^1,*

and

Shinsaku Sakaue

²

¹

Graduate School of Informatics and The Hakubi Center for Advanced Research, Kyoto University, Kyoto 604-8103, Japan

²

Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 153-8505, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2022, 24(11), 1634; https://doi.org/10.3390/e24111634

Submission received: 18 September 2022 / Revised: 4 November 2022 / Accepted: 7 November 2022 / Published: 10 November 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

Optimal transport is a mathematical tool that has been a widely used to measure the distance between two probability distributions. To mitigate the cubic computational complexity of the vanilla formulation of the optimal transport problem, regularized optimal transport has received attention in recent years, which is a convex program to minimize the linear transport cost with an added convex regularizer. Sinkhorn optimal transport is the most prominent one regularized with negative Shannon entropy, leading to densely supported solutions, which are often undesirable in light of the interpretability of transport plans. In this paper, we report that a deformed entropy designed by q-algebra, a popular generalization of the standard algebra studied in Tsallis statistical mechanics, makes optimal transport solutions supported sparsely. This entropy with a deformation parameter q interpolates the negative Shannon entropy (

q = 1

) and the squared 2-norm (

q = 0

), and the solution becomes more sparse as q tends to zero. Our theoretical analysis reveals that a larger q leads to a faster convergence when optimized with the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. In summary, the deformation induces a trade-off between the sparsity and convergence speed.

Keywords:

optimal transport; Sinkhorn algorithm; convex analysis; entropy; quasi-Newton method

1. Introduction

Optimal transport (OT) is a classic problem in operations research, and it is used to compute a transport plan between suppliers and demanders with a minimum transportation cost. The minimum transportation cost can be interpreted as the closeness between the distributions when considering suppliers and demanders as two probability distributions. The OT problem has been extensively studied (also as the Wasserstein distance) [1] and used in robust machine learning [2], domain adaptation [3], generative modeling [4], and natural language processing [5], attributed to its many useful properties, such as the distance between two probability distributions. Recently, the OT problem has been employed for various modern applications, such as interpretable word alignment [6] and the locality-aware evaluation of object detection [7], because it can capture the geometry of data and provide a measurement method for closeness and alignment among different objects. From a computational perspective, a naïve approach is to use a network simplex algorithm or interior point method to solve the OT problem as a usual linear program; this approach requires supercubic time complexity [8] and is not scalable. A number of approaches have been suggested to accelerate the computation of the OT problem: entropic regularization [9,10], accelerated gradient descent [11], and approximation with tree [12] and graph metrics [13]. We focused our attention on entropic-regularized OT because it allows a unique solution attributed to strong convexity and transforms the original constrained optimization into an unconstrained problem with a clear primal–dual relationship. The celebrated Sinkhorn algorithm solves entropic-regularized OT with square-time complexity [9]. Furthermore, the Sinkhorn algorithm is amenable to differentiable programming, and it is easily incorporated into end-to-end learning pipelines [14,15].

Despite the popularity of the Sinkhorn algorithm, one of the main drawback is that Shannon entropy blurs the OT solution, i.e., solutions of entropic-regularized OT are always densely supported. The Shannon entropy induces a probability distribution that has strictly positive values everywhere on its support owing to the nature of the Shannon entropy [16] whereas the vanilla (unregularized) OT produces extremely sparse transport plans located on the boundaries of a polytope [17,18]. If we are interested in alignment and matching between different objects (such as in the several applications of natural language processing [6,19]), dense transport plans are not so interpretable that matching information between objects may be obfuscated by unimportant small densities contained in the transport plans. One attempt toward realizing sparse OT is to use the squared two-norm as an alternative regularizer. Blondel et al. [20] showed that the dual of this optimization problem can be solved via the L-BFGS method [21]; the primal solution corresponds to a transport plan recovered from the dual solution in a closed form, which is sparse. Although they successfully obtained a sparse OT formulation with a numerically stable algorithm, the degree of the sparsity cannot be easily modulated when we prefer to control the sparsity given a final application. Furthermore, the theoretical convergence rates of solving regularized OT are yet to be known.

In this study, we aimed to examine the relationship between the sparsity of transport plans and the convergence guarantee of regularized OT. Specifically, we propose yet another entropic regularizer called deformed q-entropy with a deformation parameter q that allows us to control the solution sparsity. We start with a dual solution of the entropic-regularized OT given by the Gibbs kernel to introduce a new regularizer; the Gibbs kernel associated with Shannon entropy induces nonsparsity, and, therefore, we replace the Gibbs kernel with another sparse kernel based on q-exponential distribution [22], following the idea of Tsallis statistics [23]. The deformed q entropy is derived from the dual solution characterized by the sparse kernel. Interestingly, the deformed q entropy recovers the Shannon entropy at the limit of

q ↗ 1

and matches the (negative) squared two-norm at

q = 0

; this means that the deformed q entropy interpolates between the two regularizers. We confirm that the solution becomes increasingly sparse as q approaches zero. We call the regularized OT with the deformed q entropy deformed q-optimal transport (q-DOT). The q-DOT reveals an interesting connection between the OT solution and the q-exponential distribution, which is an independent interest. From the optimization perspective, we can solve the unconstrained dual of q-DOT with many standard solvers, as reported in Blondel et al. [20]. We can see that the convergence becomes faster with the BFGS method [24] as the deformation parameter q approaches one, as a result of our analysis of the convergence rate of the dual optimization. Therefore, the weaker deformation (larger q) leads to faster convergence while sacrificing sparsity. Finally, we demonstrate the trade-off between sparsity and convergence in the numerical experiments.

Our contributions can be summarized as: (i) showing a clear connection between the regularized OT problem and the q-exponential distribution; (ii) demonstrating the trade-off of the q-DOT between sparsity and convergence; (iii) providing a formal convergence guarantee of the q-DOT when solved with the BFGS method. The rest of this paper is organized as follows: Section 2 introduces the necessary background to the OT problem and entropic regularization. In Section 3, the Lagrange dual of the entropic-regularized OT problem is first shown; then, the dual optimal formula and the q-exponential distribution is connected to sparsify the transport matrix. Section 4 specifically focuses on the optimization perspective of the regularized OT problem, and a convergence guarantee with the BFGS method is provided, which shows the theoretical trade-off between sparsity and convergence. Finally, the empirical behavior and the trade-off of the regularized OT are numerically confirmed in Section 5.

2. Background

2.1. Preliminaries

For

x \in R

, let

{[x]}_{+} = x

if

x > 0

and 0 otherwise, and let

{[x]}_{+}^{p}

represent

{({[x]}_{+})}^{p}

hereafter. For a convex function

f, X \to R

, where

X

represents a Euclidean vector space equipped with an inner product

〈\cdot, \cdot〉

, the Fenchel–Legendre conjugate

f^{⋆} : X \to R

is defined as

f^{⋆} (y) {sup}_{x \in X} 〈x, y〉 - f (x)

. The relative interior of a set S is denoted by

ri S

, and the effective domain of a function f is denoted by

dom (f)

. A differentiable function f is said to be M-strongly convex over

S \subseteq ri dom (f)

if, for all

x, y \in S

, we have

f (x) - f (y) \leq 〈\nabla f (x), x - y〉 - \frac{M}{2} {∥ x - y ∥}_{2}^{2}

. If f is twice differentiable, the strong convexity is equivalent to

\nabla^{2} f (x) ⪰ M I

for all

x \in S

. Similarly, a differentiable function f is said to be M-smooth over

S \subseteq ri dom (f)

if for all

x, y \in S

, we have

{∥ \nabla f (x) - \nabla f (y) ∥}_{2} \leq M {∥ x - y ∥}_{2}

, which is equivalent to

\nabla^{2} f (x) ⪯ M I

for all

x \in S

if f is twice differentiable.

2.2. Optimal Transport

The OT is a mathematical problem to find a transport plan between two probability distributions with the minimum transport cost. The discussions in this paper are restricted to discrete distributions. Let

(X, d)

,

δ_{x}

, and

▵^{n - 1} : = \{p \in {[0, 1]}^{n} | 〈p, 1_{n}〉 = 1\}

represent a metric space, Dirac measure at point

x

, and

(n - 1)

-dimensional probability simplex, respectively. Let

μ = \sum_{i = 1}^{n} a_{i} δ_{x_{i}}

and

ν = \sum_{j = 1}^{m} b_{i} δ_{y_{i}}

be histograms supported on the finite sets of points

{(x_{i})}_{i = 1}^{n} \subseteq X

and

{(y_{j})}_{j = 1}^{m} \subseteq X

, respectively, where

a \in ▵^{n - 1}

and

b \in ▵^{m - 1}

are probability vectors. The OT between two discrete probability measures

μ

and

ν

is the optimization problem

T (μ, ν) : = inf_{Π \in U (μ, ν)} \sum_{i = 1}^{n} \sum_{j = 1}^{m} d (x_{i}, y_{j}) Π_{i j},

(1)

where

U

represents the transport polytope, defined as

U (μ, ν) : = \{Π \in R_{\geq 0}^{n \times m} | Π 1_{m} = a, Π^{⊤} 1_{n} = b\} .

(2)

The transport polytope

U

defines the constraints on the row/column marginals of a transport matrix

Π

. These constraints are often referred to as coupling constraints. For notational simplicity, matrix

D_{i j} : = d (x_{i}, y_{j})

and expectation

〈D, Π〉 : = \sum_{i = 1}^{n} \sum_{j = 1}^{m} D_{i j} Π_{i j}

are used hereafter.

T (μ, ν)

is known as a 1-Wasserstein distance, which defines a metric space over histograms [1].

Equation (1) is a linear program and can be solved by well-studied algorithms such as the interior point and network simplex methods. However, its computational complexity is

O (n^{3} log n)

(assuming

n = m

), so is not scalable to large datasets [8].

2.3. Entropic Regularization and Sinkhorn Algorithm

The entropic-regularized formulation is commonly used to reduce the computational burden. Here, we introduce regularized OT with negative Shannon entropy [9] as

T_{- λ H} (μ, ν) : = inf_{Π \in U (μ, ν)} 〈D, Π〉 + λ \underset{negative Shannon entropy}{\underset{︸}{\sum_{i = 1}^{n} \sum_{j = 1}^{m} (Π_{i j} log Π_{i j} - Π_{i j})}},

(3)

where

λ > 0

represents the regularization strength. Let us review the derivation of the updates of the Sinkhorn algorithm. The Lagrangian of the optimization problem in Equation (3) is

\begin{matrix} L (Π, α, β) : = & \sum_{i = 1}^{n} \sum_{j = 1}^{m} (D_{i j} Π_{i j} + λ (Π_{i j} log Π_{i j} - Π_{i j})) \\ + \sum_{i = 1}^{n} α_{i} ({[Π 1_{m}]}_{i} - a_{i}) + \sum_{j = 1}^{m} β_{j} ({[Π^{⊤} 1_{n}]}_{j} - b_{j}), \end{matrix}

(4)

where

α \in R^{n}

and

β \in R^{m}

represent the Lagrangian multipliers. Equation (4) ignores the constraints

Π_{i j} \geq 0

(for all

i \in [n]

and

j \in [m]

); however, they will be automatically satisfied. By taking the derivative in

Π_{i j}

,

\nabla_{Π_{i j}} L = D_{i j} + λ log Π_{i j} + α_{i} + β_{j},

(5)

and, hence, the stationary condition

\nabla_{Π_{i j}} L = 0

induces the solution

Π_{i j} = exp (- \frac{α_{i} + β_{j} + D_{i j}}{λ}) .

(6)

The decomposition

Π_{i j} = exp (- \frac{D_{i j}}{λ}) / exp (\frac{α_{i} + β_{j}}{λ})

suggests that the stationary point is the (normalized) Gibbs kernel

exp (- \frac{D_{i j}}{λ})

. One can easily infer that the Sinkhorn solution is dense because the Gibbs kernel is supported on the entire

R_{\geq 0}

, i.e.,

exp (- \frac{z}{λ}) > 0

for all

z \in R_{\geq 0}

. We can write Equation (6) into a matrix form by applying the variable transforms

u_{i} : = exp (- \frac{α_{i}}{λ})

,

v_{j} : = exp (- \frac{β_{j}}{λ})

, and

K_{i j} : = exp (- \frac{D_{i j}}{λ})

as

Π = \underset{: = U}{\underset{︸}{diag (u)}} K \underset{: = V}{\underset{︸}{diag (v)}} .

(7)

The following Sinkhorn updates are used to make Equation (7) meet the marginal constraints:

\{\begin{matrix} u^{'} \leftarrow a / (K v) \\ v^{'} \leftarrow b / (K^{⊤} u) \end{matrix},

(8)

where

z / η

represents the element-wise division of the two vectors

z

and

η

. The computational complexity is

O (K n m)

because the Sinkhorn updates involve only matrix-vector multiplications and element-wise divisions; K represents the number of the Sinkhorn updates. Finer analysis of the number of updates required to meet the error tolerance is provided in the literature [25].

3. Deformed q-Entropy and q-Regularized Optimal Transport

3.1. Regularized Optimal Transport and Its Dual

Let us consider the following primal problem with a general regularization function

Ω

.

Definition 1

(Primal of regularized OT).

T_{Ω} (μ, ν) = inf_{Π \in U (μ, ν)} 〈D, Π〉 + \sum_{i, j} Ω (Π_{i j}),

(9)

where

Ω : R \to R

represents a proper closed convex function.

Next, we derive its dual by Lagrange duality. The Lagrangian of Equation (9) is defined as

L (Π, α, β) : = 〈D, Π〉 + \sum_{i, j} Ω (Π_{i j}) + 〈α, Π 1_{m} - a〉 + 〈β, Π^{⊤} 1_{n} - b〉,

(10)

with dual variables

α \in R^{n}

and

β \in R^{m}

. Then, the primal can be rewritten in terms of the Lagrangian

T_{Ω} (μ, ν) = inf_{Π \in R_{\geq 0}^{n \times m}} sup_{α \in R^{n}, β \in R^{m}} L (Π, α, β) .

(11)

In this Lagrangian formulation, we let the constraints

Π \in R_{\geq 0}^{n \times m}

remain for a technical reason. The constrained optimization problem in (11) can be reformulated into the following unconstrained one with an indicator function

I_{R_{\geq 0}^{n \times m}}

.

T_{Ω} (μ, ν) = inf_{Π \in R^{m \times m}} sup_{α \in R^{n}, β \in R^{m}} L (Π, α, β) + I_{R_{\geq 0}^{n \times m}} (Π),

(12)

which corresponds to an optimization problem with the convex objective function

〈D, Π〉 + \sum_{i, j} Ω (Π_{i j}) + I_{R_{\geq 0}^{n \times m}} (Π)

with only the linear constraints

Π 1_{m} = a

and

Π^{⊤} 1_{n} = b

. By invoking the Sinkhorn–Knopp theorem [26], the existence of a strictly feasible solution, namely, a solution satisfying

Π 1_{m} = a

and

Π^{⊤} 1_{n} = b

, can be confirmed. Hence, we see that the Slater condition is satisfied, and the strong duality holds as follows:

\begin{matrix} T_{Ω} & (μ, ν) = sup_{α \in R^{n}, β \in R^{m}} inf_{Π \in R_{\geq 0}^{n \times m}} L (Π, α, β) \\ = sup_{α \in R^{n}, β \in R^{m}} - 〈a, α〉 - 〈b, β〉 + inf_{Π \in R_{\geq 0}^{n \times m}} \sum_{i, j} (D_{i j} + α_{i} + β_{j}) Π_{i j} + Ω (Π_{i j}) \\ = sup_{α \in R^{n}, β \in R^{m}} - 〈a, α〉 - 〈b, β〉 - (sup_{Π \in R_{\geq 0}^{n \times m}} \sum_{i, j} - (D_{i j} + α_{i} + β_{j}) Π_{i j} - Ω (Π_{i j})) \\ = sup_{α \in R^{n}, β \in R^{m}} - 〈a, α〉 - 〈b, β〉 - \sum_{i, j} Ω^{⋆} (- D_{i j} - α_{i} - β_{j}), \end{matrix}

(13)

where

Ω^{⋆}

represents the Fenchel–Legendre conjugate of

Ω : R \to R

Ω^{⋆} (η) : = sup_{π \geq 0} η π - Ω (π) .

(14)

Although each element of the transport plans ranges over

[0, 1]

, it is sufficient to define the Fenchel–Legendre conjugate as the supremum over

R_{\geq 0}

because of how

Ω^{⋆}

emerges in the strong duality (13). According to Danskin’s theorem [27], the supremum of the Fenchel–Legendre conjugate can be attained at

Π_{i j}^{⋆} = \nabla Ω^{⋆} (- D_{i j} - α_{i} - β_{j}) .

(15)

Therefore, the dual of regularized OT is formulated as follows:

Definition 2

(Dual of regularized OT).

T_{Ω} (μ, ν) = sup_{α \in R^{n}, β \in R^{m}} - 〈a, α〉 - 〈b, β〉 - \sum_{i, j} Ω^{⋆} (- D_{i j} - α_{i} - β_{j}),

(16)

where

Ω^{⋆}

represents the Fenchel–Legendre conjugate

Ω^{⋆} (η) : = {sup}_{π \geq 0} η π - Ω (π)

. The optimal solution of the primal is given by the dual map

\nabla Ω^{⋆}

such that

Π_{i j}^{⋆} = \nabla Ω^{⋆} (- D_{i j} - α_{i}^{⋆} - β_{j}^{⋆})

, where

(α^{⋆}, β^{⋆})

represents the dual optimal solution.

Next, we see several examples that are summarized in Table 1.

Example 1

(Negative Shannon entropy). Let

Ω (π) = - λ H (π) = λ (π log π - π)

; then

Ω^{⋆} (η) = λ e^{η / λ}

and

\nabla Ω^{⋆} (η) = e^{η / λ}

. The optimal solution represented with the optimal dual variables

(α^{⋆}, β^{⋆})

is

Π_{i j}^{⋆} = exp (- \frac{D_{i j} + α_{i}^{⋆} + β_{j}^{⋆}}{λ})

. This recovers the stationary point of the Sinkhorn OT in Equation (6). The solution is dense because the regularizer Ω induces the Gibbs kernel

\nabla Ω^{⋆} (η) = e^{η / λ} > 0

for all

η \in R

.

Example 2

(Squared 2-norm). Let

Ω (π) = \frac{λ}{2} π^{2}

; then

Ω^{⋆} (η) = \frac{1}{2 λ} {[η]}_{+}^{2}

and

\nabla Ω^{⋆} (η) = \frac{1}{λ} {[η]}_{+}

. The optimal solution represented with the optimal dual variables

(α^{⋆}, β^{⋆})

is

Π_{i j}^{⋆} = \frac{1}{λ} {[- D_{i j} - α_{i}^{⋆} - β_{j}^{⋆}]}_{+}

. As mentioned by Blondel et al. [20], the squared 2-norm can sparsify the solution because

\nabla Ω^{⋆} (η) = \frac{1}{λ} {[η]}_{+}

may take the value 0.

3.2. q Algebra and Deformed Entropy

As shown in the last few examples, the dual map

\nabla Ω^{⋆}

plays an important role in the OT solution sparsity. In addition, the induced

\nabla Ω^{⋆}

is the Gibbs kernel when the negative Shannon entropy is used as

Ω

. Therefore, one may think of designing a regularizer from

\nabla Ω^{⋆}

by utilizing a kernel function that induces sparsity. One candidate is a q-exponential distribution. We begin with some basics required to formulate q-exponential distributions.

First, we introduce q-algebra, which has been well studied in the field of Tsallis statistical mechanics [23,29,30]. q algebra has been used in the machine-learning literature for regression [31], Bayesian inference [32], and robust learning [33]. For a deformation parameter

q \in [0, 1]

, the q-logarithm and q-exponential functions are defined as

\begin{matrix} {log}_{q} (x) : = \{\begin{matrix} \frac{x^{1 - q} - 1}{1 - q} & if q \in [0, 1) \\ log (x) & if q = 1 \end{matrix}, & {exp}_{q} (x) : = \{\begin{matrix} {[1 + (1 - q) x]}_{+}^{1 / (1 - q)} & if q \in [0, 1) \\ exp (x) & if q = 1 \end{matrix} . \end{matrix}

(17)

The q logarithm is defined for only

x > 0

, as in the natural logarithm; they are inverse functions to each other (in an appropriate domain) and they recover the natural definition of the logarithm and exponential as

q ↗ 1

. Their derivatives are

{({log}_{q} (x))}^{'} = \frac{1}{x^{q}}

and

{({exp}_{q} (x))}^{'} = {exp}_{q} {(x)}^{q}

, respectively. The additive factorization property

exp (x + y) = exp (x) exp (y)

satisfied by the natural exponential no longer holds for the q exponential, such that

{exp}_{q} (x + y) \neq {exp}_{q} (x) {exp}_{q} (y) = {exp}_{q} (x + y + (1 - q) x y)

. Instead, we can construct another algebraic structure by introducing the other operation called the q product

\otimes_{q}

:

x \otimes_{q} y = {[x^{1 - q} + y^{1 - q} - 1]}_{+}^{1 / (1 - q)} .

(18)

With this product, the pseudoadditive factorization

{exp}_{q} (x + y) = {exp}_{q} (x) \otimes_{q} {exp}_{q} (y)

holds. Thus, the q algebra captures rich nonlinear structures, and it is often used to extend the Shannon entropy to the Tsallis entropy [23]

T_{q} (π) = - \sum_{i = 1}^{n} π_{i}^{q} {log}_{q} (π_{i}) .

(19)

One can see that the Tsallis entropy has an equivalent power formulation

T_{q} (π) = \sum_{i = 1}^{n} \frac{π_{i} - π_{i}^{q}}{1 - q}

, which means that it is often suitable for modeling heavy-tailed phenomena such as the power law. Although the introduced q logarithm and exponential can look arbitrary, they can be axiomatically derived by assuming the essential properties of the algebra (see Naudts [29]). For more physical insights, we recommend readers to refer to the literature [30].

Next, we introduce the q-exponential distribution. We introduce a simpler form for our purpose, whereas more general formulations of the q-exponential distribution have been introduced in the literature [22]. Given the form of the Gibbs kernel

k (ξ) : = exp (- ξ / λ)

, we define the q-Gibbs kernel as follows:

Definition 3

(q-Gibbs kernel). For

ξ \geq 0

, we define the q-Gibbs kernel as

k_{q} (ξ) {exp}_{q} (- ξ / λ)

for a deformation parameter

q \in [0, 1]

and a temperature parameter

λ \in R_{> 0}

.

If we take

ξ

as the (centered) squared distance, then

k_{q} (ξ)

represents the q-Gaussian distribution [22]. We illustrate the q-Gibbs kernel with different deformation parameters in Figure 1.

By definition, the support of the q-Gibbs kernel is

supp (k_{q}) = [0, \frac{λ}{1 - q}]

for

q \in [0, 1)

and

supp (k_{q}) = R_{\geq 0}

for

q = 1

. This indicates that the q-Gibbs kernel ignores the effect of a too-large

ξ

(or too large a distance between two points); its threshold is smoothly controlled by the temperature parameter

λ

and deformation parameter q.

Finally, we derive an entropic regularizer that induces sparsity by using the q-Gibbs kernel. Given the stationary condition in Equation (15), we impose the following functional form on the dual map:

π = \nabla Ω^{⋆} (η) = {exp}_{q} (\frac{η}{λ}),

(20)

where

(π, η) = (Π_{i j}^{⋆}, - D_{i j} - α_{i} - β_{j})

. Equation (20) results in the factorization

Π_{i j}^{⋆} = {exp}_{q} (- \frac{α_{i}}{λ}) \otimes_{q} {exp}_{q} (- \frac{- D_{i j}}{λ}) \otimes_{q} {exp}_{q} (- \frac{- β_{j}}{λ}),

(21)

and a sufficiently large input distance

D_{i j}

drives

Π_{i j}

to zero; though

{exp}_{q} (- D_{i j} / λ) = 0

does not immediately imply

Π_{i j}^{⋆} = 0

because the q-product

\otimes_{q}

lacks an absorbing element. By solving Equation (20),

\nabla Ω (π) = λ {log}_{q} (π), Ω (π) = \frac{λ}{2 - q} (π {log}_{q} (π) - π) .

(22)

For the completeness, its derivation is shown in Appendix A. Hence, we define the deformed q entropy as follows:

Definition 4

(Deformed q-entropy). For

π \in ▵^{n - 1}

, the deformed q entropy is defined as

H_{q} (π) = - \frac{1}{2 - q} \sum_{i = 1}^{n} (π_{i} {log}_{q} (π_{i}) - π_{i}) .

(23)

The deformed q-entropic regularizer for an element

π_{i}

is

Ω (π_{i}) = \frac{λ}{2 - q} (π_{i} {log}_{q} (π_{i}) - π_{i})

.

The deformed q entropy recovers the Shannon entropy at the limit

q ↗ 1

:

H_{1} (π) = - \sum_{i} (π_{i} log (π_{i}) - π_{i})

. In addition, the limit

q ↘ 0

recovers the negative of the squared 2-norm:

H_{0} (π) = - \frac{1}{2} \sum_{i} (π_{i}^{2} - 2 π_{i}) = - \frac{1}{2} {∥ π ∥}_{2}^{2} + 1

. Therefore, the deformed q entropy is an interpolation between the Shannon entropy and squared 2-norm. Hereafter, we consider the regularized OT with the deformed q entropy

T_{- λ H_{q}} (μ, ν) = inf_{Π \in U (μ, ν)} 〈D, Π〉 - λ H_{q} (Π),

(24)

by solving its dual counterpart. The deformed q entropy is different from the Tsallis entropy

T_{q}

(see Equation (19)) in that the Tsallis entropy and deformed q entropy are defined by the q expectation

〈π^{q}, \cdot〉

[34] and the usual expectation

〈π, \cdot〉

, respectively, while both are defined by the q logarithm.

Remark 1.

The primary reason we picked the deformed q entropy

H_{q}

to design the regularizer is owing to its natural connection to the q-Gibbs kernel through the dual map,

\nabla {(- λ H_{q})}^{⋆} (η) = {exp}_{q} (η / λ)

. When the Tsallis entropy

T_{q}

is used, the dual map is

\nabla {(- λ T_{q})}^{⋆} (η) = \frac{q^{1 / (1 - q)}}{{exp}_{q} (- η / λ)},

(25)

which is not naturally connected to the q-Gibbs kernel. Muzellec et al. [35] proposed regularized OT with the Tsallis entropy, but they did not discuss its sparsity. As we show in Appendix D.1, the Tsallis entropy does not empirically induce sparsity.

In Figure 2, the deformed q entropy with a different deformation parameter is plotted for the one-dimensional simplex

▵^{1}

. One can easily confirm that

H_{q} (π)

is concave for

π \in R_{\geq 0}^{n}

, as illustrated in the figure.

4. Optimization and Convergence Analysis

4.1. Optimization Algorithm

We occasionally write

Ω = - λ H_{q}

to simplify the notation in this section. By simple algebra, we confirm

Ω^{⋆} (η) = \frac{λ}{2 - q} {exp}_{q} {(\frac{η}{λ})}^{2 - q},

(26)

which is convex because of the concavity of

H_{q}

. To solve Equation (24), we solve the dual

T_{- λ H_{q}} (μ, ν) = sup_{α \in R^{n}, β \in R^{m}} \underset{: = - F (z)}{\underset{︸}{- 〈a, α〉 - 〈b, β〉 - \frac{λ}{2 - q} \sum_{i, j} {exp}_{q} {(- \frac{D_{i j} + α_{i} + β_{j}}{λ})}^{2 - q}}},

(27)

where

z : = (α, β)

denotes dual variables. As Equation (27) is an unconstrained optimization problem, many famous optimization solvers can be used to solve it; here, we use the BFGS method [24]. For the sake of convergence analysis (Section 4.2), we optimize the convex

ℓ_{2}

-regularized dual objective

\begin{matrix} minimize & \tilde{F} (z) : = 〈a, α〉 + 〈b, β〉 + \sum_{i, j} Ω^{⋆} (- D_{i j} - α_{i} - β_{j}) + \frac{κ}{2} {∥ z ∥}_{2}^{2}, \end{matrix}

(28)

where

κ > 0

represents the

ℓ_{2}

-regularization parameter. In practice,

ℓ_{2}

regularization hardly affects the performance when

κ

is sufficiently small. We can characterize the convergence rate by introducing (small)

ℓ_{2}

regularization, which makes the objective strongly convex, whereas the convergence guarantee without its rate is still possible without

ℓ_{2}

regularization [36].

We briefly summarize the algorithm in Algorithm 1, where

d^{(k)}

,

ρ^{(k)}

, and

g^{(k)} : = \nabla \tilde{F} (z^{(k)})

represent the kth update direction, kth step size, and gradient at the current variable

z^{(k)}

, respectively.

s^{(k)} : = z^{(k + 1)} - z^{(k)} and ζ^{(k)} : = g^{(k + 1)} - g^{(k)}

(29)

are the differences of the dual variables and gradients between the next and current steps, respectively. Furthermore, let

(γ, γ^{'})

be the tolerance parameter for the Wolfe conditions, i.e., update directions and step sizes satisfy the conditions

\begin{matrix} \tilde{F} (z^{(k)} + ρ^{(k)} d^{(k)}) & \leq \tilde{F} (z^{(k)}) + γ^{'} ρ^{(k)} g^{(k) ⊤} d^{(k)}, & (Armijo condition) \end{matrix}

(30)

\begin{matrix} g^{(k + 1) ⊤} d^{(k)} & \geq γ g^{(k) ⊤} d^{(k)} . & (curvature condition) \end{matrix}

(31)

After obtaining the dual solution

(\hat{α}, \hat{β})

, the primal solution can be recovered from Equation (15).

4.2. Convergence Analysis

We provide a convergence guarantee for Algorithm 1. A technical assumption is stated beforehand.

Assumption 1.

Let

z_{⋆}

be the global optimum of

\tilde{F}

. For

τ \in (0, 1)

, we define the set

Z_{τ} \subseteq ri dom (\tilde{F})

as

Z_{τ} : = \{z | \nabla Ω^{⋆} (- D_{i j} - α_{i} - β_{j}) \leq τ for all i, j\} .

(32)

Assume that

z^{(K)}

obtained by Algorithm 1 and

z_{⋆}

are contained in

Z_{τ}

.

The dual map

\nabla Ω^{⋆}

translates dual variables into primal variables, as in Equation (15). It is easy to confirm that

Z_{τ}

is a closed convex set attributed to the convexity of

\nabla Ω^{⋆}

. Assumption 1 essentially assumes that all elements of the primal matrix (of

z^{(K)}

and

z_{⋆}

) are strictly less than 1; this always holds for

z_{⋆}

(unless

n = m = 1

) because of the strong duality. Moreover, this assumption is natural for

z^{(K)}

values sufficiently close to the optimum

z_{⋆}

. The bound parameter

τ

is a key element for characterizing the convergence speed.

Theorem 1.

Let

N : = max {n, m}

. Under Assumption 1, Algorithm 1 with the parameter choice

κ = 2 N τ^{q} λ^{- 1}

returns a point

z^{(k)}

satisfying

∥ g^{(K)} ∥_{2} < \sqrt{\frac{16 (\tilde{F} (z^{(0)}) - {\tilde{F}}_{⋆}) N τ^{q}}{λ}} r^{K}

(33)

where

{\tilde{F}}_{⋆} : = {inf}_{z} \tilde{F} (z)

represents the optimal value of the

ℓ_{2}

-regularized dual objective and

0 < r < 1

is an absolute constant independent from

(λ, τ, q, N)

.

The proof is shown in Section 4.3. We conclude that a larger deformation parameter q yields better convergence because the coefficient in Equation (33) is

O (τ^{q / 2})

with the base

\sqrt{τ} < 1

. Therefore, the deformation parameter introduces a new trade-off:

q ↘ 0

yields a more sparse solution but slows down the convergence, whereas

q ↗ 1

ameliorates the convergence while sacrificing sparsity. One may obtain the solution faster than the squared 2-norm regularizer used in Blondel et al. [20], which corresponds to the case

q = 0

, by modulating the deformation parameter q.

In regularized OT, it is a common approach to use weaker regularization (i.e., a smaller

λ

) to obtain a solution sparser and closer to the unregularized solution; however, a smaller

λ

results in numerical instability and slow computation [37]. This can be observed from Equation (33) because a smaller

λ

drives its upper bound considerably large.

Subsequently, we compared the computational complexity of q-DOT with the BFGS method and Sinkhorn algorithm. Altschuler et al. [25] showed that the Sinkhorn algorithm satisfies coupling constraints within the

ℓ_{1}

error

ε

in

O (N^{2} (log N) ε^{- 3})

time, which is the sublinear convergence rate. In contrast, our convergence rate in Equation (33) is translated into the iteration complexity

K = O (log (N ε^{- 1}))

, where

∥ g^{(K)} ∥_{2} \leq ε

. The gradient of

\tilde{F}

is

\nabla \tilde{F} (z) = [\begin{matrix} ⋮ \\ a_{i} - \sum_{j = 1}^{m} \nabla Ω^{⋆} (- D_{i j} - α_{i} - β_{j}) + κ α_{i} \\ ⋮ \\ b_{i} - \sum_{i = 1}^{n} \nabla Ω^{⋆} (- D_{i j} - α_{i} - β_{j}) + κ β_{j} \\ ⋮ \end{matrix}],

(34)

and

\nabla Ω^{⋆} (\cdot)

represents the mapping from the dual variables

(α_{i}, β_{j})

to the primal transport matrix

Π_{i j}

in Equation (15). Therefore, the gradient norm of

F

and coupling constraint error are comparable when the

ℓ_{2}

-regularization parameter

κ

is sufficiently small. The overall computational complexity is

O (N^{2} log (N ε^{- 1}))

because the one step of Algorithm 1 runs in

O (N^{2})

time; this is the linear convergence rate. To confirm the one step of Algorithm 1 runs in

O (N^{2})

time, we note that the update direction can be computed with

O (N^{2})

time by using the Sherman–Morrison formula to invert

B^{(k)}

. In addition, the Hessian estimate can be updated with

O (N^{2})

time because

B^{(k)}

is the rank-1 update and the computation of its inverse only requires the matrix-vector products of size N. Hence, Algorithm 1 exhibits better convergence in terms of the stopping criterion

ε

. The comparison is summarized in Table 2.

4.3. Proofs

To prove Theorem 1, we leveraged several lemmas shown below. Lemma 2 is based on Powell [24] and Byrd et al. [36]. The missing proofs are provided in Appendix C.

Lemma 1.

For the initial point

z^{(0)}

and sequence

z^{(1)}, z^{(2)}, \dots, z^{(K)}

obtained by Algorithm 1, we define the following set and its bound:

Z : = conv (\{z^{(0)}, z^{(1)}, z^{(2)}, \dots, z^{(K)}\}), R : = sup_{z \in Z} max_{i, j} \nabla Ω^{⋆} (- D_{i j} - α_{i} - β_{j}),

(35)

where

conv (S)

represents the convex hull of the set S. Then,

\tilde{F} : R^{n + m} \to R

is

M_{1}

strongly convex and

M_{2}

-smooth over

Z

, where

M_{1} = κ

and

M_{2} \leq κ + 2 N R^{q} λ^{- 1}

. Moreover,

\tilde{F}

is

M_{2}^{'}

-smooth over

Z_{τ}

(defined in Equation (32)), where

M_{2}^{'} \leq κ + 2 N τ^{q} λ^{- 1}

.

Lemma 2.

Let

z^{(1)}, z^{(2)}, \dots, z^{(k)}

be a sequence generated by Algorithm 1 given an initial point

z^{(0)}

. In addition, let

c_{1}

,

c_{2}

,

c_{3}

,

c_{4}

, and

c_{5}

be the constants

\begin{matrix} c_{1} & : = \frac{1 - γ}{M_{2}}, & c_{2} : = \frac{n + m}{K} + M_{2}, \\ c_{3} & : = {(\frac{K}{n + m})}^{(n + m) / K} c_{2}^{\frac{n + m + K}{K}}, & c_{4} : = \frac{c_{3}}{1 - γ}, \\ c_{5} & : = \frac{2 (1 - γ^{'})}{M_{1}} . \end{matrix}

(36)

Then,

\tilde{F} (z^{(K)}) - {\tilde{F}}_{⋆} \leq {(1 - \frac{γ^{'} c_{1} M_{1}}{2 c_{4}^{2} c_{5}^{2}})}^{K / 2} (\tilde{F} (z^{(0)}) - {\tilde{F}}_{⋆}) .

(37)

Lemma 3.

Let

c_{1}

,

c_{2}

,

c_{3}

,

c_{4}

, and

c_{5}

be the same constants defined in Lemma 2. Then,

\frac{γ^{'} c_{1} M_{1}}{c_{4}^{2} c_{5}^{2}} > \frac{{(1 - γ)}^{3} γ^{'} e^{- 2 (n + m) / e}}{4 {(1 - γ^{'})}^{2}} {(\frac{M_{1}}{M_{2}})}^{3} .

(38)

Proof of Theorem 1.

Because

\tilde{F}

is differentiable and strongly convex, there exists an optimum

z_{⋆}

such that

g_{⋆} : = \nabla \tilde{F} (z_{⋆}) = 0

; this implies

∥ g^{(K)} ∥_{2} = {∥ g^{(K)} - g_{⋆} ∥}_{2}

.

By using Assumption 1 and Lemma 1, we obtain

∥ g^{(K)} - g_{⋆} ∥_{2} = ∥ \nabla \tilde{F} (z^{(K)}) - \nabla \tilde{F} (z_{⋆}) ∥_{2} \leq M_{2}^{'} {∥ z^{(K)} - z_{⋆} ∥}_{2}

. In addition,

∥ z^{(K)} - z_{⋆} ∥_{2}^{2} \leq \frac{2}{M_{1}} (\tilde{F} (z^{(K)}) - {\tilde{F}}_{⋆})

as

\tilde{F}

is

M_{1}

strongly convex over

Z

and the stationary condition

\nabla \tilde{F} (z_{⋆}) = 0

holds. We obtain the convergence bound by using Lemmas 2 and 3 as

\begin{matrix} ∥ g^{(K)} ∥_{2} & = ∥ g^{(K)} - g_{⋆} ∥_{2} \\ \leq M_{2}^{'} {∥ z^{(K)} - z_{⋆} ∥}_{2} \\ \leq M_{2}^{'} \sqrt{\frac{2 (\tilde{F} (z^{(K)}) - {\tilde{F}}_{⋆})}{M_{1}}} \\ \leq M_{2}^{'} \sqrt{\frac{2 (\tilde{F} (z^{(0)}) - {\tilde{F}}_{⋆})}{M_{1}} {(1 - \frac{γ^{'} c_{1} M_{1}}{2 c_{4}^{2} c_{5}^{2}})}^{K / 2}} \\ < M_{2}^{'} \sqrt{\frac{2 (\tilde{F} (z^{(0)}) - {\tilde{F}}_{⋆})}{M_{1}} {(1 - \frac{{(1 - γ)}^{3} γ^{'} e^{- 2 (n + m) / e}}{8 {(1 - γ^{'})}^{2}} {(\frac{M_{1}}{M_{2}})}^{3})}^{K / 2}} \\ \leq (κ + \frac{2 N τ^{q}}{λ}) \sqrt{\frac{2 (\tilde{F} (z^{(0)}) - {\tilde{F}}_{⋆})}{κ} {(1 - \frac{C}{{(1 + 2 N R^{q} λ^{- 1} κ^{- 1})}^{3}})}^{K / 2}}, \end{matrix}

(39)

where we define

C : = \frac{{(1 - γ)}^{3} γ^{'} e^{- 2 (n + m) / e}}{8 {(1 - γ^{'})}^{2}}

and Lemma 1 is used at the last inequality to replace

M_{1}

,

M_{2}

and

M_{2}^{'}

. We can immediately confirm

C \leq \frac{1}{16}

from

0 < γ^{'} < γ < 1

,

γ^{'} < \frac{1}{2}

, and

e^{- 2 (n + m) / e} < 1

. Finally, by choosing

κ = 2 N τ^{q} λ^{- 1}

,

\begin{matrix} ∥ g^{(K)} ∥_{2} & \leq \sqrt{\frac{16 (\tilde{F} (z^{(0)}) - {\tilde{F}}_{⋆}) N τ^{q}}{λ} {(1 - \frac{C}{{(1 + {(R / τ)}^{q})}^{3}})}^{K / 2}} \\ \leq \sqrt{\frac{16 (\tilde{F} (z^{(0)}) - {\tilde{F}}_{⋆}) N τ^{q}}{λ}} r^{K}, \end{matrix}

(40)

where we use

{(R / τ)}^{q} \geq 1

(owing to

R \geq τ

by definition) and let

r : = {(1 - C / 8)}^{1 / 4}

and

\sqrt[4]{127 / 128} \leq r < 1

. □

Remark 2.

More precisely, Altschuler et al. [25] showed that the Sinkhorn algorithm converges in

O (N^{2} L^{3} (log N) ε^{- 3})

time, where

{L : = ∥ D ∥}_{\infty}

. For q-DOT, its computational complexity is not directly comparable to that of the Sinkhorn in L; instead, the following analysis provides us a qualitative comparison. First, the convergence rate of q-DOT in Equation (33) is translated into the iteration complexity

K = O (log (N ε^{- 1}) / log (1 / r))

. The rate r is introduced in the proof of Theorem 1 (see Equation (40)):

r \geq {(1 - \frac{C}{{(1 + {(R / τ)}^{q})}^{3}})}^{1 / 4}

. Then, by the Taylor expansion, we have a rough estimate

K \approx O (N^{2} R^{- 3 q} log (N ε^{- 1}))

, where R is a bound on the possible primal variables defined in Equation (35). We cannot directly compare

R^{- q}

and L; nevertheless,

R^{- q}

and L can be considered in the same magnitude given a reasonably sized domain

Z

, noting that

\nabla Ω (π) \approx O (π^{1 - q})

. Hence, it is reasonable to suppose that both the Sinkhorn algorithm and q-DOT roughly converge in cubic time with respect to L.

5. Numerical Experiments

5.1. Sparsity

All the simulations described in this section were executed on a 2.7 GHz quad-core Intel^® Core^™ i7 processor. We used the following synthetic dataset:

{(x_{i})}_{i = 1}^{n} \sim N (1_{2}, I_{2})

,

{(y_{j})}_{j = 1}^{m} \sim N (- 1_{2}, I_{2})

, and

n = m = 30

, where

N (μ, Σ)

represents the Gaussian distribution with mean

μ

and covariance

Σ

. For each of the unregularized OTs, q-DOT, and Sinkhorn algorithm, we computed the transport matrices. For q-DOT and the Sinkhorn algorithm, different regularization parameters

λ

were compared:

λ \in \{1 \times 10^{- 2}, 1 \times 10^{- 1}, 1\}

; and

ε = 1 \times 10^{- 6}

was used as the stopping criterion: q-DOT stopped after the gradient norm was less than

ε

, and the Sinkhorn algorithm stopped after the

ℓ_{1}

error of the coupling constraints was less than

ε

. We compared different deformation parameters

q \in \{0, 0.25, 0.5, 0.75\}

and fixed the dual

ℓ_{2}

-regularization parameter

κ = 1 \times 10^{- 6}

for q-DOT. The q-DOT with

q = 0

corresponded to a regularized OT with the squared 2-norm proposed by Blondel et al. [20]. For the unregularized OT, we used the implementation of the Python optimal transport package [38]. For q-DOT, we used the L-BFGS-B method (instead of the vanilla BFGS) provided by the SciPy package [39]. To determine zero entries in the transport matrix, we did not impose any positive threshold to disregard small values (as in Swanson et al. [6]) but regarded entries smaller than machine epsilon as zero.

The simulation results are shown in Table 3 and Figure 3. First, we qualitatively evaluated each method by using Figure 3 such that q-DOT obtained a very similar transport matrix to the unregularized OT solution. The solution was slightly blurred with increases in q and

λ

. In contrast, the Sinkhorn algorithm output considerably uncertain transport matrices. Furthermore, the Sinkhorn algorithm was numerically unstable with a very small regularization such as

λ = 0.01

.

From Table 3, we further quantitatively observed the behavior. The transport matrices obtained by q-DOT were very sparse in most cases, and the sparsity was close to that of the unregularized OT. Furthermore, we observed the tendency such that smaller q and

λ

yielded a sparser solution. Significantly, the Sinkhorn algorithm obtained completely dense matrices (sparsity = 0). Although the transport matrices of q-DOT with

(q, λ) = (0.5, 1), (0.75, 1)

appear somewhat similar to the Sinkhorn solutions in Figure 3, the former is much sparser. This suggests that a deformation parameter q slightly smaller than 1 is sufficient for q-DOT to output a sparse transport matrix.

For the obtained cost values

〈D, \hat{Π}〉

, we did not see a clear advantage of using a specific q and

λ

from the results of q-DOT. Nevertheless, it is evident that q-DOT more accurately estimated the Wasserstein cost than the Sinkhorn algorithm regardless of the q and

λ

used in this simulation.

5.2. Runtime Comparison

We compared the runtimes of q-DOT and the Sinkhorn algorithm using the same dataset as in Section 5.1, but with different dataset sizes: we chose

n = m \in {100, 300, 500,

1000}

. The parameter choices were the same as in Section 5.1, except that the regularization parameter was fixed to

λ = 0.1

. The result is shown in Figure 4; the larger deformation parameter q makes q-DOT converge faster when

n = m = 100

. When

n = m \geq 300

, the difference between

q = 0

,

q = 0.25

, and

q = 0.5

was not as evident. This may be partly because we fixed the parameter choice

κ = 1 \times 10^{- 6}

for the all experiments, unlike the oracle parameter choice

κ = 2 N τ^{q} λ^{- 1}

(in Theorem 1) depending on q. Nonetheless,

q = 0.75

is clearly superior to the smaller q. From these observations, the trade-off between the sparsity and computation speed resulting from the deformation parameter q is theoretically established in Theorem 1 and it was empirically observed.

5.3. Approximation of 1-Wasserstein Distance

Finally, we compared the approximation errors of the 1-Wasserstein distance

| 〈D, \hat{Π}〉 - 〈D, Π_{♯}〉 |

of q-DOT and the Sinkhorn algorithm with different q and

λ

, where

\hat{Π}

represents the computed transport matrix and

Π_{♯} \in {arg min}_{Π \in U (μ, ν)} 〈D, Π〉

represents the LP solution. We used the same dataset and stopping criterion

ε

as described in Section 5.1 For the range of q, we used

q \in \{0.00, 0.25, 0.50, 0.75\}

. For the range of

λ

, we used

λ \in \{0.05, 0.1, 0.5\}

.

The result is shown in Figure 5. The difference was not significant when q was small, such as

q \in \{0.00, 0.25\}

. Once q became larger, such as

q \in \{0.50, 0.75\}

, the approximation error evidently worsened. The Sinkhorn algorithm always exhibited worse approximation errors than q-DOT with q in the range used in this simulation regardless of

λ

. Formal guarantees for the 1-Wasserstein approximation error (such as Altschuler et al. [25] and Weed [40]) will be considered in future work.

Author Contributions

Conceptualization, H.B.; methodology, H.B.; validation, H.B. and S.S.; formal analysis, H.B. and S.S.; writing—original draft preparation, H.B.; writing—review and editing, H.B. and S.S.; funding acquisition, H.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Hakubi Project, Kyoto University, and JST ERATO Grant JPMJER1903. The APC was covered by the Hakubi Project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

BFGS	Broyden–Fletcher–Goldfarb–Shannon
q-DOT	Deformed q-optimal transport
L-BFGS	Limited-memory BFGS
OT	Optimal transport

Appendix A. Derivation of Deformed q Entropy

Given a functional relationship

π = \nabla Ω^{⋆} (η) = {exp}_{q} (η / λ)

in Equation (20), we derive the deformed q entropy.

First, the derivative of the regularizer

\nabla Ω

is simply the inverse of the dual map

\nabla Ω^{⋆}

by Danskin’s theorem [27]; hence,

\nabla Ω (π) = λ {log}_{q} (π)

. The (negative of) deformed q entropy is recovered by integrating

\nabla Ω

:

\begin{matrix} Ω (π) & = λ \int_{0}^{p} {log}_{q} (p) d p = λ \int_{0}^{p} \frac{p^{1 - q} - 1}{1 - q} d p = \frac{λ}{1 - q} \{\frac{π^{2 - q}}{2 - q} - π\} \\ = \frac{λ}{2 - q} \{π \frac{π^{1 - q} - 1}{1 - q} - π\} = \frac{λ}{2 - q} (π {log}_{q} (π) - π) . \end{matrix}

(A1)

Appendix B. Additional Lemmas

Note again that we let

M_{1}, M_{2} > 0

be the strong convexity and smoothness constants of

\tilde{F}

over

Z

,

N : = max \{n, m\}

, and

z_{⋆} \in {arg min}_{z \in Z} \tilde{F} (z)

.

Lemma A1.

For all k,

M_{1} ∥ s^{(k)} ∥_{2}^{2} \leq ζ^{(k) ⊤} s^{(k)} \leq M_{2} {∥ s^{(k)} ∥}_{2}^{2} .

(A2)

In addition,

\frac{∥ ζ^{(k)} ∥_{2}^{2}}{ζ^{(k) ⊤} s^{(k)}} \leq M_{2} .

(A3)

Proof.

Let

{\bar{G}}^{(k)} : = \int_{0}^{1} \nabla^{2} \tilde{F} (z^{(k)} + t s^{(k)}) d t

. Then, by the chain rule and the fundamental theorem of calculus,

\begin{matrix} {\bar{G}}^{(k)} s^{(k)} & = \int_{0}^{1} \frac{\partial \nabla \tilde{F} (z^{(k)} + t s^{(k)})}{\partial t} d t \\ = \nabla \tilde{F} (z^{(k)} + s^{(k)}) - \nabla \tilde{F} (z^{(k)}) = g^{(k + 1)} - g^{(k)} = ζ^{(k)} . \end{matrix}

(A4)

Because

\tilde{F}

is

M_{1}

strongly convex and

M_{2}

-smooth (over

Z

), we have

M_{1} {∥ w ∥}_{2}^{2} \leq w^{⊤} [\nabla^{2} \tilde{F} (z)] w \leq M_{2} {∥ w ∥}_{2}^{2}

(A5)

for all

z \in Z

and

w

. By choosing

z = z^{(k)} + t s^{(k)}

and

w = s^{(k)}

, we have

\begin{matrix} M_{1} {∥ s^{(k)} ∥}_{2}^{2} & \leq \int_{0}^{1} s^{(k) ⊤} [\nabla^{2} \tilde{F} (z^{(k)} + t s^{(k)})] s^{(k)} d t \\ = s^{(k) ⊤} {\bar{G}}^{(k)} s^{(k)} = ζ^{(k) ⊤} s^{(k)} \leq M_{2} {∥ s^{(k)} ∥}_{2}^{2} . \end{matrix}

(A6)

Note that

z^{(k)} + t s^{(k)} \in Z

follows by the definition of

Z

in Equation (35). Thus, the first statement is proven.

The second statement is proven as follows:

\begin{matrix} \frac{∥ ζ^{(k)} ∥_{2}^{2}}{ζ^{(k) ⊤} s^{(k)}} & = \frac{s^{(k) ⊤} {\bar{G}}^{(k) 2} s^{(k)}}{s^{(k) ⊤} {\bar{G}}^{(k)} s^{(k)}} = \frac{(s^{(k) ⊤} {\bar{G}}^{(k) 1 / 2}) {\bar{G}}^{(k)} ({\bar{G}}^{(k) 1 / 2} s^{(k)})}{∥ {\bar{G}}^{(k) 1 / 2} s^{(k)} ∥_{2}^{2}} \\ = \int_{0}^{1} \frac{{(s^{{(k)}^{'}})}^{⊤} [\nabla^{2} \tilde{F} (z^{(k)} + t s^{(k)})] (s^{{(k)}^{'}})}{∥ s^{{(k)}^{'}} ∥_{2}^{2}} d t \\ \leq M_{2}, \end{matrix}

(A7)

where

s^{{(k)}^{'}} : = {\bar{G}}^{(k) 1 / 2} s^{(k)}

. □

Lemma A2.

For all k,

\frac{M_{1}}{2} ∥ z^{(k)} - z_{⋆} ∥_{2} \leq {∥ g^{(k)} ∥}_{2} .

(A8)

Proof.

Because

\tilde{F}

is

M_{1}

strongly convex over

Z

,

\begin{matrix} \frac{M_{1}}{2} {∥ z^{(k)} - z_{⋆} ∥}_{2}^{2} & \leq \tilde{F} (z^{(k)}) - \tilde{F} (z_{⋆}) - 〈\nabla \tilde{F} (z^{(k)}), z^{(k)} - z_{⋆}〉 \\ \leq ∥ g^{(k)} ∥_{2} {∥ z^{(k)} - z_{⋆} ∥}_{2}, \end{matrix}

(A9)

where it follows from the optimality of

z_{⋆}

and the Cauchy–Schwarz inequality. □

Lemma A3.

The following equations hold:

\begin{matrix} det (B^{(K)}) \leq {(\frac{c_{2} K}{n + m})}^{n + m}, \end{matrix}

(A10)

\begin{matrix} \prod_{k = 0}^{K - 1} \frac{∥ B^{(k)} s^{(k)} ∥_{2}^{2}}{s^{(k) ⊤} B^{(k)} s^{(k)}} \leq c_{2}^{K}, \end{matrix}

(A11)

\begin{matrix} \frac{det (B^{(K)})}{det (B^{(0)})} = \prod_{k = 0}^{K - 1} \frac{ζ^{(k) ⊤} s^{(k)}}{s^{(k) ⊤} B^{(k)} s^{(k)}}, \end{matrix}

(A12)

where

c_{2} : = \frac{n + m}{K} + M_{2}

is defined in Lemma 2.

Proof.

To prove Equation (A10), we use the linearity of the trace and

tr (b a^{⊤}) = a^{⊤} b

to evaluate

tr (B^{(k + 1)})

as follows:

\begin{matrix} tr (B^{(k + 1)}) & = tr (B^{(k)} - \frac{B^{(k)} s^{(k)} s^{(k) ⊤} B^{(k)}}{s^{(k) ⊤} B^{(k)} s^{(k)}} + \frac{ζ^{(k)} ζ^{(k) ⊤}}{ζ^{(k) ⊤} s^{(k)}}) \\ = tr (B^{(k)}) - \underset{\geq 0}{\underset{︸}{tr (\frac{B^{(k)} s^{(k)} s^{(k) ⊤} B^{(k)}}{s^{(k) ⊤} B^{(k)} s^{(k)}})}} + tr (\frac{ζ^{(k)} ζ^{(k) ⊤}}{ζ^{(k) ⊤} s^{(k)}}) \\ \leq tr (B^{(k)}) + \frac{∥ ζ^{(k)} ∥_{2}^{2}}{ζ^{(k) ⊤} s^{(k)}} \\ \leq tr (B^{(0)}) + \sum_{j = 0}^{k} \frac{∥ ζ^{(j)} ∥_{2}^{2}}{ζ^{(j) ⊤} s^{(j)}} \\ \leq tr (B^{(0)}) + (k + 1) M_{2}, \end{matrix}

(A13)

where Lemma A1 is used at the last inequality. Note that the trace is the sum of the eigenvalues, whereas the determinant is the product of the eigenvalues. Then, we can use the AM–GM inequality to translate the determinant into the trace as follows:

det (B^{(k + 1)}) \leq {(\frac{1}{n + m} tr (B^{(k + 1)}))}^{n + m} \leq {(\frac{tr (B^{(0)}) + M_{2} (k + 1)}{n + m})}^{n + m} .

(A14)

Hence, by substituting

k = K - 1

and

tr (B^{(0)}) = n + m

, Equation (A10) is proven.

To prove Equation (A11), we evaluate

tr (B^{(k + 1)})

in a way similar to that for Equation (A13). From Lemma A1,

\begin{matrix} 0 & \leq tr (B^{(k + 1)}) = tr (B^{(k)}) - \frac{∥ B^{(k)} s^{(k)} ∥_{2}^{2}}{s^{(k) ⊤} B^{(k)} s^{(k)}} + \frac{∥ ζ^{(k)} ∥_{2}^{2}}{ζ^{(k) ⊤} s^{(k)}} \\ = tr (B^{(0)}) - \sum_{j = 0}^{k} \frac{∥ B^{(j)} s^{(j)} ∥_{2}^{2}}{s^{(j) ⊤} B^{(j)} s^{(j)}} + \sum_{j = 0}^{k} \frac{∥ ζ^{(j)} ∥_{2}^{2}}{ζ^{(j) ⊤} s^{(j)}} \\ \leq tr (B^{(0)}) - \sum_{j = 0}^{k} \frac{∥ B^{(j)} s^{(j)} ∥_{2}^{2}}{s^{(j) ⊤} B^{(j)} s^{(j)}} + (k + 1) M_{2} . \end{matrix}

(A15)

By the AM–GM inequality,

\prod_{j = 0}^{k} \frac{∥ B^{(j)} s^{(j)} ∥_{2}^{2}}{s^{(j) ⊤} B^{(j)} s^{(j)}} \leq {(\frac{1}{k + 1} \sum_{j = 0}^{k} \frac{∥ B^{(j)} s^{(j)} ∥_{2}^{2}}{s^{(j) ⊤} B^{(j)} s^{(j)}})}^{k + 1} .

(A16)

Hence, by substituting

k = K - 1

and

tr (B^{(0)}) = n + m

, Equation (A11) is proven.

To prove Equation (A12), we use the matrix determinant lemma to expand

det (B^{(k + 1)})

as follows:

\begin{matrix} det (B^{(k + 1)}) & = det (B^{(k)} - \frac{B^{(k)} s^{(k)} s^{(k) ⊤} B^{(k)}}{s^{(k) ⊤} B^{(k)} s^{(k)}} + \frac{ζ^{(k)} ζ^{(k) ⊤}}{ζ^{(k) ⊤} s^{(k)}}) \\ = \{1 - \frac{1}{s^{(k) ⊤} B^{(k)} s^{(k)}} \cdot s^{(k) ⊤} B^{(k)} {(B^{(k)} + \frac{ζ^{(k)} ζ^{(k) ⊤}}{ζ^{(k) ⊤} s^{(k)}})}^{- 1} B^{(k)} s^{(k)}\} \\ \cdot det (B^{(k)} + \frac{ζ^{(k)} ζ^{(k) ⊤}}{ζ^{(k) ⊤} s^{(k)}}) . \end{matrix}

(A17)

Further, by the Sherman–Morrison formula, we have

{(B^{(k)} + \frac{ζ^{(k)} ζ^{(k) ⊤}}{ζ^{(k) ⊤} s^{(k)}})}^{- 1} = B^{(k) - 1} - \frac{B^{(k) - 1} ζ^{(k)} ζ^{(k) ⊤} B^{(k) - 1}}{ζ^{(k) ⊤} s^{(k)} + ζ^{(k) ⊤} B^{(k) - 1} ζ^{(k)}} .

(A18)

By plugging Equation (A18) into Equation (A17), we have

\begin{matrix} det (B^{(k + 1)}) & = \frac{{(s^{(k) ⊤} ζ^{(k)})}^{2}}{(s^{(k) ⊤} B^{(k)} s^{(k)}) (ζ^{(k) ⊤} s^{(k)} + ζ^{(k) ⊤} B^{(k) - 1} ζ^{(k)})} det (B^{(k)} + \frac{ζ^{(k)} ζ^{(k) ⊤}}{ζ^{(k) ⊤} s^{(k)}}) \\ = \frac{{(s^{(k) ⊤} ζ^{(k)})}^{2}}{(s^{(k) ⊤} B^{(k)} s^{(k)}) (ζ^{(k) ⊤} s^{(k)} + ζ^{(k) ⊤} B^{(k) - 1} ζ^{(k)})} \\ \cdot (1 + \frac{ζ^{(k) ⊤} B^{(k) - 1} ζ^{(k)}}{ζ^{(k) ⊤} s^{(k)}}) det (B^{(k)}) \\ = det (B^{(k)}) \frac{ζ^{(k) ⊤} s^{(k)}}{s^{(k) ⊤} B^{(k)} s^{(k)}}, \end{matrix}

(A19)

where the matrix determinant lemma is invoked again at the second identity. Recursively applying Equation (A19) with

det (B^{(0)}) = 1

, we obtain Equation (A12). □

Lemma A4.

For k,

∥ s^{(k)} ∥_{2} \leq c_{5} {∥ g^{(k)} ∥}_{2} cos θ_{k}

, where

θ_{k}

is the angle between

s^{(k)}

and

- g^{(k)}

, and

c_{5} : = \frac{2 (1 - γ^{'})}{M_{1}}

is defined in Lemma 2.

Proof.

By the Armijo condition (30), we have

\tilde{F} (z^{(k + 1)}) - \tilde{F} (z^{(k)}) \leq γ^{'} ρ^{(k)} g^{(k) ⊤} d^{(k)} = γ^{'} g^{(k) ⊤} s^{(k)} .

(A20)

Additionally, as

\tilde{F}

is

M_{1}

-strongly convex over

Z

, it holds that

\tilde{F} (z^{(k + 1)}) - \tilde{F} (z^{(k)}) \geq s^{(k) ⊤} g^{(k)} + \frac{1}{2} M_{1} {∥ s^{(k)} ∥}_{2}^{2}

. Hence,

\begin{matrix} s^{(k) ⊤} g^{(k)} + \frac{1}{2} M_{1} {∥ s^{(k)} ∥}_{2}^{2} \leq γ^{'} g^{(k) ⊤} s^{(k)} \\ \Rightarrow & (1 - γ^{'}) (- s^{(k) ⊤} g^{(k)}) \geq \frac{1}{2} M_{1} {∥ s^{(k)} ∥}_{2}^{2} \\ \Rightarrow & ∥ s^{(k)} ∥_{2} \leq \underset{= c_{5}}{\underset{︸}{\frac{2 (1 - γ^{'})}{M_{1}}}} \underset{= cos θ_{k}}{\underset{︸}{\frac{- s^{(k) ⊤} g^{(k)}}{∥ s^{(k)} ∥_{2} {∥ g^{(k)} ∥}_{2}}}} {∥ g^{(k)} ∥}_{2}, \end{matrix}

(A21)

which is the desired inequality. □

Lemma A5.

For k, let

θ_{k}

be the angle between

s^{(k)}

and

- g^{(k)}

. Then,

\prod_{k = 0}^{K - 1} (1 - \frac{γ^{'} c_{1} M_{1} {cos}^{2} θ_{k}}{2}) \leq {(1 - \frac{γ^{'} c_{1} M_{1}}{2 c_{4}^{2} c_{5}^{2}})}^{K / 2},

(A22)

where

c_{1}

,

c_{4}

, and

c_{5}

are defined in Lemma 2.

Proof.

By multiplying each side of Equations A10–A12, we have

\prod_{k = 0}^{K - 1} \frac{∥ B^{(k)} s^{(k)} ∥_{2}^{2}}{s^{(k) ⊤} B^{(k)} s^{(k)}} \cdot \frac{ζ^{(k) ⊤} s^{(k)}}{s^{(k) ⊤} B^{(k)} s^{(k)}} \leq c_{3}^{K},

(A23)

where

c_{3} : = {(\frac{K}{n + m})}^{(n + m) / K} c_{2}^{\frac{n + m + K}{K}}

is defined in Lemma 2. By using

B^{(k)} s^{(k)} = - ρ^{(k)} g^{(k)}

and

ζ^{(k) ⊤} s^{(k)} \geq - (1 - γ) g^{(k) ⊤} s^{(k)}

(shown in Equation (A33)),

\begin{matrix} \prod_{k = 0}^{K - 1} \frac{∥ B^{(k)} s^{(k)} ∥_{2}^{2}}{s^{(k) ⊤} B^{(k)} s^{(k)}} \cdot \frac{ζ^{(k) ⊤} s^{(k)}}{s^{(k) ⊤} B^{(k)} s^{(k)}} & = \prod_{k = 0}^{K - 1} \frac{∥ g^{(k)} ∥_{2}^{2} \cdot ζ^{(k) ⊤} s^{(k)}}{{(- s^{(k) ⊤} g^{(k)})}^{2}} \\ \geq {(1 - γ)}^{K} \cdot \prod_{k = 0}^{K - 1} \frac{∥ g^{(k)} ∥_{2}^{2}}{- s^{(k) ⊤} g^{(k)}} . \end{matrix}

(A24)

Hence,

\prod_{k = 0}^{K - 1} \frac{∥ g^{(k)} ∥_{2}}{∥ s^{(k)} ∥_{2} cos θ_{k}} \leq {(\frac{c_{3}}{1 - γ})}^{K} = c_{4}^{K} .

(A25)

By Lemma A4, we can confirm

\prod_{k = 0}^{K - 1} {cos}^{2} θ_{k} \geq \prod_{k = 0}^{K - 1} \frac{1}{c_{4}} \frac{∥ g^{(k)} ∥_{2} cos θ_{k}}{∥ s^{(k)} ∥_{2}} \geq {(\frac{1}{c_{4} c_{5}})}^{K} .

(A26)

Let

\hat{K}

be the number of

k = 0, 1, \dots, K - 1

such that

cos θ_{k} \leq \frac{1}{c_{4} c_{5}}

, then

{(\frac{1}{c_{4} c_{5}})}^{K} \leq \prod_{k = 0}^{K - 1} {cos}^{2} θ_{k} \leq {(\frac{1}{c_{4} c_{5}})}^{2 \hat{K}},

(A27)

implying that

\hat{K}

is at most

\frac{K}{2}

(note that

\frac{1}{c_{4} c_{5}} < 1

from Equation (A26)). Therefore,

\prod_{k = 0}^{K - 1} (1 - \frac{γ^{'} c_{1} M_{1} {cos}^{2} θ_{k}}{2}) \leq {(1 - \frac{γ^{'} c_{1} M_{1}}{2 c_{4}^{2} c_{5}^{2}})}^{K / 2} .

(A28)

□

Appendix C. Deferred Proofs

Appendix C.1. Proof of Lemma 1

Proof.

It is easy to confirm

M_{1} = κ

because

\tilde{F}

is the sum of

F

(convex) and

\frac{κ}{2} {∥ z ∥}_{2}^{2}

.

Because

\tilde{F}

is twice differentiable and

Z

is a closed convex set, we evaluate the smoothness parameter

M_{2}

(over

Z

) by the eigenvalues of

\nabla^{2} \tilde{F} (z)

. We begin by evaluating the eigenvalues of

\nabla^{2} F (z)

, then evaluate the eigenvalues of

\nabla^{2} \tilde{F} (z)

by

\nabla^{2} \tilde{F} (z) = \nabla^{2} F (z) + κ I

. Let

P (z) \in R^{n \times m}

be a matrix such that

P_{i j} (z) : = \nabla Ω^{⋆} (- D_{i j} - α_{i} - β_{j})

. Here,

P_{i j} (z)

is the primal variable corresponding to the dual variables

(α_{i}, β_{j})

(see Equation (15)). The gradient of

F

is

\nabla F (z) = [\begin{matrix} ⋮ \\ a_{i} - \sum_{j = 1}^{m} \nabla Ω^{⋆} (- D_{i j} - α_{i} - β_{j}) \\ ⋮ \\ b_{j} - \sum_{i = 1}^{n} \nabla Ω^{⋆} (- D_{i j} - α_{i} - β_{j}) \\ ⋮ \end{matrix}] = [\begin{matrix} ⋮ \\ a_{i} - \sum_{j = 1}^{m} P_{i j} (z) \\ ⋮ \\ b_{j} - \sum_{i = 1}^{n} P_{i j} (z) \\ ⋮ \end{matrix}],

(A29)

and the Hessian of

F

is

\nabla^{2} F (z) = \frac{1}{λ} \cdot \underset{H}{\underset{︸}{[\begin{matrix} diag (\sum_{j} P_{i j} {(z)}^{q}) & P {(z)}^{q} \\ {(P {(z)}^{q})}^{⊤} & diag (\sum_{i} P_{i j} {(z)}^{q}) \end{matrix}]}},

(A30)

where

P {(z)}^{q}

is the element-wise power of

P (z)

. Then, by invoking the Gershgorin circle theorem (Theorem 7.2.1 of [41]), the eigenvalues of

H

can be upper bounded by the following value:

\begin{matrix} max \{\underset{center of i -th disc}{\underset{︸}{\sum_{j} P_{i j} {(z)}^{q}}} + \underset{radius of i -th disc}{\underset{︸}{{[P {(z)}^{q} 1_{m}]}_{i}}}, \sum_{i} P_{i j} {(z)}^{q} + {[{(P {(z)}^{q})}^{⊤} 1_{n}]}_{j},\} \\ \leq max \{2 \sum_{j = 1}^{m} P_{i j} {(z)}^{q}, 2 \sum_{i = 1}^{n} P_{i j} {(z)}^{q}\} \\ \leq 2 N R^{q}, \end{matrix}

(A31)

where we use

0 \leq P_{i j} (z) \leq R

for all i, j, and

z \in Z

at the last inequality. Hence,

M_{2} \leq κ + \frac{2 N R^{q}}{λ}

.

M_{2}^{'} \leq κ + \frac{2 N τ^{q}}{λ}

is confirmed by noting that

0 \leq P_{i j} (z) \leq τ

for all i, j, and

z \in Z_{τ}

and that

Z_{τ}

is a closed convex set. □

Appendix C.2. Proof of Lemma 2

Proof.

First, we evaluate the ratio between

\tilde{F} (z^{(k + 1)}) - {\tilde{F}}_{⋆}

and

\tilde{F} (z^{(k)}) - {\tilde{F}}_{⋆}

for

k = 0, 1, 2, \dots, K - 1

. Let

θ_{k}

be the angle between the vectors

s^{(k)}

and

- g^{(k)}

. By the Armijo condition (Equation (30)), the difference

\tilde{F} (z^{(k + 1)}) - \tilde{F} (z^{(k)})

can be evaluated as follows:

\begin{matrix} \tilde{F} (z^{(k + 1)}) - \tilde{F} (z^{(k)}) & \leq γ^{'} ρ^{(k)} g^{(k) ⊤} d^{(k)} \\ = γ^{'} g^{(k) ⊤} (z^{(k + 1)} - z^{(k)}) \\ = γ^{'} g^{(k) ⊤} s^{(k)} \\ = γ^{'} (- ∥ s^{(k)} ∥_{2} ∥ g^{(k)} ∥_{2} cos θ_{K}) . \end{matrix}

(A32)

In addition, by the curvature condition (Equation (31)),

\begin{matrix} ζ^{(k) ⊤} s^{(k)} & = \underset{= ρ^{(k)} g^{(k + 1) ⊤} d^{(k)} \geq ρ^{(k)} \cdot γ g^{(k) ⊤} d^{(k)}}{\underset{︸}{g^{(k + 1) ⊤} s^{(k)}}} - g^{(k) ⊤} s^{(k)} \\ \geq γ g^{(k) ⊤} s^{(k)} - g^{(k) ⊤} s^{(k)} \\ = - (1 - γ) g^{(k) ⊤} s^{(k)}, \end{matrix}

(A33)

which implies

∥ s^{(k)} ∥_{2}^{2} \geq \frac{1}{M_{2}} ζ^{(k) ⊤} s^{(k)} \geq - \frac{1 - γ}{M_{2}} g^{(k) ⊤} s^{(k)} = \frac{1 - γ}{M_{2}} ∥ s^{(k)} ∥_{2} {∥ g^{(k)} ∥}_{2} cos θ_{k}

together with Lemma A1. Hence, we have

∥ s^{(k)} ∥_{2} \geq c_{1} {∥ g^{(k)} ∥}_{2} cos θ_{k},

(A34)

where

c_{1} : = \frac{1 - γ}{M_{2}}

. Then,

\begin{matrix} \tilde{F} (z^{(k + 1)}) - {\tilde{F}}_{⋆} & \leq (\tilde{F} (z^{(k)}) - {\tilde{F}}_{⋆}) + γ^{'} (- ∥ s^{(k)} ∥_{2} ∥ g^{(k)} ∥_{2} cos θ_{k}) \\ \leq (\tilde{F} (z^{(k)}) - {\tilde{F}}_{⋆}) - γ^{'} c_{1} {∥ g^{(k)} ∥}_{2}^{2} {cos}^{2} θ_{k} \\ \leq (\tilde{F} (z^{(k)}) - {\tilde{F}}_{⋆}) - γ^{'} c_{1} (M_{1} / 2) ∥ g^{(k)} ∥_{2} {∥ z^{(k)} - z_{⋆} ∥}_{2} {cos}^{2} θ_{k} \\ \leq (\tilde{F} (z^{(k)}) - {\tilde{F}}_{⋆}) - γ^{'} c_{1} (M_{1} / 2) {cos}^{2} θ_{k} (\tilde{F} (z^{(k)}) - {\tilde{F}}_{⋆}) \\ = (1 - γ^{'} c_{1} M_{1} {cos}^{2} θ_{k} / 2) (\tilde{F} (z^{(k)}) - {\tilde{F}}_{⋆}), \end{matrix}

(A35)

where Equation (A32) is used at the first inequality; Equation (A34) is used at the second inequality; Lemma A2 is used at the third inequality; a consequence of the convexity

\tilde{F} (z^{(k)}) - \tilde{F} (z_{⋆}) \leq 〈g^{(k)}, z^{(k)} - z_{⋆}〉 \leq ∥ g^{(k)} ∥_{2} {∥ z^{(k)} - z_{⋆} ∥}_{2}

is used at the fourth inequality.

Next, recursively invoking the inequality Equation (A35), we obtain

\begin{matrix} \tilde{F} (z^{(K)}) - {\tilde{F}}_{⋆} & \leq \{\prod_{k = 0}^{K - 1} (1 - \frac{γ^{'} c_{1} M_{1} {cos}^{2} θ_{k}}{2})\} (\tilde{F} (z^{(0)}) - {\tilde{F}}_{⋆}) \\ \leq {(1 - \frac{γ^{'} c_{1} M_{1}}{2 c_{4}^{2} c_{5}^{2}})}^{K / 2} (\tilde{F} (z^{(0)}) - {\tilde{F}}_{⋆}), \end{matrix}

(A36)

which is the desired bound. The last inequality is due to Lemma 3. □

Appendix C.3. Proof of Lemma 3

Proof.

By substituting the definitions of the constants

c_{1}

,

c_{2}

,

c_{3}

,

c_{4}

, and

c_{5}

,

\begin{matrix} \frac{γ^{'} c_{1} M_{1}}{c_{4}^{2} c_{5}^{2}} = \frac{γ^{'} \cdot \frac{1 - γ}{M_{2}} \cdot M_{1}}{{(\frac{c_{3}}{1 - γ})}^{2} \cdot {(\frac{2 (1 - γ^{'})}{M_{1}})}^{2}} \\ = \frac{{(1 - γ)}^{3} M_{1}^{3} γ^{'}}{4 {(1 - γ^{'})}^{2} c_{3}^{2} M_{2}} \\ = \frac{M_{1}^{3} {(1 - γ)}^{3} γ^{'}}{4 M_{2} {(1 - γ^{'})}^{2}} \frac{1}{{\{{(\frac{1}{{(n + m)}^{n + m}})}^{1 / K} c_{2}^{\frac{n + m + K}{K}} K^{\frac{n + m}{K}}\}}^{2}} \\ = \frac{M_{1}^{3} {(1 - γ)}^{3} γ^{'}}{4 M_{2} {(1 - γ^{'})}^{2}} \cdot {(n + m)}^{\frac{2 (n + m)}{K}} \cdot {\{(\frac{n + m}{K} + M_{2}) \cdot K^{\frac{n + m}{n + m + K}}\}}^{- \frac{2 (n + m + K)}{K}} \\ > \frac{M_{1}^{3} {(1 - γ)}^{3} γ^{'}}{4 M_{2} {(1 - γ^{'})}^{2}} \cdot 1 \cdot M_{2}^{- 2} K^{- \frac{2 (n + m)}{K}} \\ \geq \frac{{(1 - γ)}^{3} γ^{'} e^{- 2 (n + m) / e}}{4 {(1 - γ^{'})}^{2}} {(\frac{M_{1}}{M_{2}})}^{3}, \end{matrix}

(A37)

where, at the first inequality, we invoke

{(n + m)}^{\frac{2 (n + m)}{K}} > 1

and

\begin{matrix} {\{(\frac{n + m}{K} + M_{2}) \cdot K^{\frac{n + m}{n + m + K}}\}}^{- \frac{2 (n + m + K)}{K}} & \geq {(M_{2} K^{\frac{n + m}{n + m + K}})}^{- \frac{2 (n + m + K)}{K}} \\ = M_{2}^{- \frac{2 (n + m + K)}{K}} K^{- \frac{2 (n + m)}{K}} \\ \geq M_{2}^{- 2} K^{- \frac{2 (n + m)}{K}}, \end{matrix}

(A38)

and we use

K^{- \frac{2 (n + m)}{K}} \geq e^{- \frac{2 (n + m)}{e}}

for all K at the second inequality. Hence, the desired inequality is proven. □

Appendix D. Additional Experiments

Appendix D.1. Comparison with Tsallis Entropy

In this study, we used the deformed q entropy instead of the Tsallis entropy [23] as the sparse regularization. Here, we briefly empirically analyze what happens if we use the Tsallis entropy instead. We compare the dual optimization objective in Definition 2 with the deformed q entropy and Tsallis entropy. We use the following convex regularizer formed by the Tsallis entropy:

Ω (π) = λ \sum_{i = 1}^{n} π_{i}^{q} {log}_{q} (π_{i}) .

(A39)

The simulations in this section were executed on a 2.7 GHz quad-core Intel^® Core^™ i7 processor. We used the following synthetic dataset:

{(x_{i})}_{i = 1}^{n} \sim N (1_{2}, I_{2})

,

{(y_{j})}_{j = 1}^{m} \sim N (- 1_{2}, I_{2})

, and

n = m = 100

. For q-DOT and Tsallis-regularized OT, different regularization parameters

λ \in \{0.5, 1\}

were compared, and

ε = 1 \times 10^{- 6}

was used as the stopping criterion on the gradient norm. The range of regularization parameters differed from that in Section 5.1 because Tsallis-regularized OT does not converge with too-small regularization parameters such as

λ = 0.01

. We compared different deformation parameters

q \in \{0, 0.25, 0.5, 0.75\}

. For the unregularized OT, we used the implementation of the Python optimal transport package [38]. For q-DOT and Tsallis-regularized OT, we use dthe L-BFGS-B method provided by the SciPy package [39]. To determine zero entries in the transport matrix, we regarded entries smaller than machine epsilon as zero.

Table A1. Comparison of the sparsity and absolute error on the synthetic dataset. Sparsity indicates the ratio of zero entries in each transport matrix. We counted the number of entries smaller than machine epsilon to measure the sparsity instead of imposing a small positive threshold for determining zero entries. Abs. error indicates the absolute error of the computed cost with respect to 1-Wasserstein distance. Tsallis-regularized OT with

q = 0.00

does not work due to numerical instability.

Table A1. Comparison of the sparsity and absolute error on the synthetic dataset. Sparsity indicates the ratio of zero entries in each transport matrix. We counted the number of entries smaller than machine epsilon to measure the sparsity instead of imposing a small positive threshold for determining zero entries. Abs. error indicates the absolute error of the computed cost with respect to 1-Wasserstein distance. Tsallis-regularized OT with

q = 0.00

does not work due to numerical instability.

	Sparsity (q-DOT)	Abs. Error (q-DOT)	Sparsity (Tsallis)	Abs. Error (Tsallis)
$q = 0.00, λ = 0.50$	0.984	0.001	—	—
$q = 0.00, λ = 1.00$	0.981	0.011	—	—
$q = 0.25, λ = 0.50$	0.977	0.008	0.000	3.362
$q = 0.25, λ = 1.00$	0.973	0.010	0.000	3.388
$q = 0.50, λ = 0.50$	0.959	0.015	0.000	3.153
$q = 0.50, λ = 1.00$	0.944	0.022	0.000	3.283
$q = 0.75, λ = 0.50$	0.861	0.052	0.000	1.962
$q = 0.75, λ = 1.00$	0.776	0.099	0.000	2.582

As can be seen from the results in Table A1, the Tsallis entropic regularizer neither induces sparsity nor achieves a better approximation of the 1-Wasserstein distance than the deformed q entropy. Note that the Tsallis entropy induces the dual map

\nabla Ω^{⋆} (η) = q^{1 / (1 - q)} / {exp}_{q} (- η / λ)

shown in Equation (25), which has dense support for

q > 0

and becomes the source of dense transport matrices. This verifies that the design of the regularizer is important for regularized optimal transport.

Appendix D.2. Hyperparameter Sensitivity

In this section, we summarize more comprehensive experimental results of q-DOT and the Sinkhorn algorithm to show the performance dependence on hyperparameters q and

λ

. Subsequently, we describe experiments to show the sparsity of transport matrices, absolute error of computed costs with respect to 1-Wasserstein distance, and runtime with differently-sized datasets.

The simulations in this section were executed on a 2.7 GHz Intel^® Xeon^® Gold 6258R processor (different from the processor that we used in Section 5). We used the following synthetic dataset:

{(x_{i})}_{i = 1}^{n} \sim N (1_{2}, I_{2})

,

{(y_{j})}_{j = 1}^{m} \sim N (- 1_{2}, I_{2})

, with different

N (= n = m) \in \{100, 300, 500, 1000, 2000, 3000\}

. For q-DOT and Tsallis-regularized OT, different regularization parameters

λ \in \{0.01, 0.1, 1\}

were compared, and

ε = 1 \times 10^{- 6}

was used as the stopping criterion. We compared different deformation parameters

q \in \{0, 0.25, 0.5, 0.75\}

. For the unregularized OT, we used the implementation of the Python optimal transport package [38]. For q-DOT, we used the L-BFGS-B method provided by the SciPy package [39]. To determine zero entries in the transport matrix, we regarded entries smaller than machine epsilon as zero.

The results are shown in Table A2. In these tables, the results with

q = 1.00

correspond to the Sinkhorn algorithm. The results for

(q, λ) = (1.00, 0.01)

are missing because they did not work well due to numerical instability. In general, we observed similar behavior as we described in Section 5: sparsity intensified as q and

λ

decreased, thereby increasing runtime. As N increased, nonmonotonic trends in runtime were observed with respect to q: for a fixed

λ

, larger q accelerated the computation, while

q = 0.25

seemed to be the slowest. This apparent discrepancy from Theorem 1 may be partly because Theorem 1 relies on an oracle parameter choice

κ = 2 N τ^{q} λ^{- 1}

as we discussed in Section 5.2, which is hardly known in practice. Nevertheless, it is remarkable that even

q = 0.75

gives very sparse solutions with a reasonable amount of runtime. Regarding the absolute error, smaller q tends to perform better with relatively small datasets, such as

N \leq 1000

, while

q = 1.00

performs better for larger datasets, such as

N = 2000

and 3000. As we mentioned in Section 5.3, theoretical analysis of the approximation error is still unclear, and will be left for future work.

Table A2. Hyperparameter sensitivity of q-DOT and Sinkhorn algorithm. In these tables,

q = 1.00

corresponds to the Sinkhorn algorithm.

(q, λ) = (1.00, 0.01)

did not work well because of numerical instability. The results shown in the tables are the means of 10 random trials. Bold typeface indicates the best result for each of sparsity, absolute error, and runtime.

Table A2. Hyperparameter sensitivity of q-DOT and Sinkhorn algorithm. In these tables,

q = 1.00

corresponds to the Sinkhorn algorithm.

(q, λ) = (1.00, 0.01)

did not work well because of numerical instability. The results shown in the tables are the means of 10 random trials. Bold typeface indicates the best result for each of sparsity, absolute error, and runtime.

( $N = 100$ )	Sparsity	Abs. Error	Runtime [ms]	( $N = 100$ )	Sparsity	Abs. Error	Runtime [ms]
$q = 0.00, λ = 0.01$	0.990	2.28× 10 $^{- 2}$	4366.142	$q = 0.00, λ = 0.01$	0.997	1.30× 10 $^{0}$	33,592.026
$q = 0.00, λ = 0.10$	0.988	3.63 × 10 $^{- 3}$	1236.346	$q = 0.00, λ = 0.10$	0.996	2.15× 10 $^{- 2}$	14,641.740
$q = 0.00, λ = 1.00$	0.982	6.20× 10 $^{- 3}$	842.253	$q = 0.00, λ = 1.00$	0.994	2.03× 10 $^{- 2}$	7749.233
$q = 0.25, λ = 0.01$	0.989	8.18× 10 $^{- 3}$	3182.535	$q = 0.25, λ = 0.01$	0.996	7.07× 10 $^{- 2}$	36,167.445
$q = 0.25, λ = 0.10$	0.986	5.54× 10 $^{- 3}$	1131.784	$q = 0.25, λ = 0.10$	0.994	1.83× 10 $^{- 2}$	15,176.970
$q = 0.25, λ = 1.00$	0.973	1.16× 10 $^{- 2}$	668.734	$q = 0.25, λ = 1.00$	0.990	2.69× 10 $^{- 2}$	5848.561
$q = 0.50, λ = 0.01$	0.987	9.91× 10 $^{- 3}$	2388.176	$q = 0.50, λ = 0.01$	0.994	1.99× 10 $^{- 2}$	25,940.619
$q = 0.50, λ = 0.10$	0.977	7.66× 10 $^{- 3}$	1040.818	$q = 0.50, λ = 0.10$	0.991	2.41× 10 $^{- 2}$	8304.774
$q = 0.50, λ = 1.00$	0.946	2.40× 10 $^{- 2}$	339.978	$q = 0.50, λ = 1.00$	0.976	3.52× 10 $^{- 2}$	2713.598
$q = 0.75, λ = 0.01$	0.979	1.16× 10 $^{- 2}$	2396.353	$q = 0.75, λ = 0.01$	0.991	2.97× 10 $^{- 2}$	18,820.365
$q = 0.75, λ = 0.10$	0.950	1.31× 10 $^{- 2}$	731.564	$q = 0.75, λ = 0.10$	0.973	3.34× 10 $^{- 2}$	4823.098
$q = 0.75, λ = 1.00$	0.786	1.02× 10 $^{- 1}$	200.654	$q = 0.75, λ = 1.00$	0.864	9.57× 10 $^{- 2}$	1654.697
$q = 1.00, λ = 0.01$	—	—	—	$q = 1.00, λ = 0.01$	—	—	—
$q = 1.00, λ = 0.10$	0.000	5.83× 10 $^{- 2}$	1132.516	$q = 1.00, λ = 0.10$	0.000	7.39× 10 $^{- 2}$	2014.341
$q = 1.00, λ = 1.00$	0.000	7.51× 10 $^{- 1}$	31.284	$q = 1.00, λ = 1.00$	0.000	8.15× 10 $^{- 1}$	207.094
( $N = 100$ )	Sparsity	Abs. error	Runtime [ms]	( $N = 100$ )	Sparsity	Abs. error	Runtime [s]
$q = 0.00, λ = 0.01$	0.999	2.48× 10 $^{0}$	86,046.395	$q = 0.00, λ = 0.01$	1.000	6.39× 10 $^{0}$	336.207
$q = 0.00, λ = 0.10$	0.997	3.91× 10 $^{- 2}$	49,523.995	$q = 0.00, λ = 0.10$	0.999	8.76× 10 $^{- 2}$	286.879
$q = 0.00, λ = 1.00$	0.996	4.10× 10 $^{- 2}$	27,357.659	$q = 0.00, λ = 1.00$	0.998	8.22× 10 $^{- 2}$	133.223
$q = 0.25, λ = 0.01$	0.998	2.36× 10 $^{- 1}$	104,346.641	$q = 0.25, λ = 0.01$	0.999	4.27× 10 $^{0}$	413.775
$q = 0.25, λ = 0.10$	0.996	5.12× 10 $^{- 2}$	41,810.473	$q = 0.25, λ = 0.10$	0.998	1.01× 10 $^{- 1}$	221.787
$q = 0.25, λ = 1.00$	0.994	4.22× 10 $^{- 2}$	18,415.400	$q = 0.25, λ = 1.00$	0.997	9.01× 10 $^{- 2}$	87.945
$q = 0.50, λ = 0.01$	0.996	4.52× 10 $^{- 2}$	78,618.996	$q = 0.50, λ = 0.01$	0.998	8.61× 10 $^{- 2}$	374.123
$q = 0.50, λ = 0.10$	0.994	4.50× 10 $^{- 2}$	25,512.371	$q = 0.50, λ = 0.10$	0.997	9.37× 10 $^{- 2}$	120.605
$q = 0.50, λ = 1.00$	0.984	4.92× 10 $^{- 2}$	8266.048	$q = 0.50, λ = 1.00$	0.990	9.49× 10 $^{- 2}$	41.435
$q = 0.75, λ = 0.01$	0.994	4.55× 10 $^{- 2}$	57,839.639	$q = 0.75, λ = 0.01$	0.996	1.05× 10 $^{- 1}$	275.101
$q = 0.75, λ = 0.10$	0.979	5.07× 10 $^{- 2}$	14,257.452	$q = 0.75, λ = 0.10$	0.985	1.02× 10 $^{- 1}$	67.301
$q = 0.75, λ = 1.00$	0.890	1.00× 10 $^{- 1}$	4362.478	$q = 0.75, λ = 1.00$	0.917	1.34× 10 $^{- 1}$	21.536
$q = 1.00, λ = 0.01$	—	—	—	$q = 1.00, λ = 0.01$	—	—	—
$q = 1.00, λ = 0.10$	0.000	7.92× 10 $^{- 2}$	5731.333	$q = 1.00, λ = 0.10$	0.000	8.62× 10 $^{- 2}$	57.739
$q = 1.00, λ = 1.00$	0.000	8.35× 10 $^{- 1}$	562.722	$q = 1.00, λ = 1.00$	0.000	8.51× 10 $^{- 1}$	2.215
( $N = 100$ )	Sparsity	Abs. error	Runtime [s]	( $N = 100$ )	Sparsity	Abs. error	Runtime [s]
$q = 0.00, λ = 0.01$	1.000	3.59× 10 $^{0}$	1386.554	$q = 0.00, λ = 0.01$	1.000	4.09× 10 $^{0}$	3257.314
$q = 0.00, λ = 0.10$	0.999	2.25× 10 $^{- 1}$	1245.867	$q = 0.00, λ = 0.10$	1.000	8.56× 10 $^{- 1}$	3108.889
$q = 0.00, λ = 1.00$	0.999	1.85× 10 $^{- 1}$	823.011	$q = 0.00, λ = 1.00$	0.999	2.68× 10 $^{- 1}$	2355.733
$q = 0.25, λ = 0.01$	1.000	5.88× 10 $^{0}$	1555.064	$q = 0.25, λ = 0.01$	1.000	3.78× 10 $^{0}$	3821.319
$q = 0.25, λ = 0.10$	0.999	1.86× 10 $^{- 1}$	1201.656	$q = 0.25, λ = 0.10$	0.999	2.94× 10 $^{- 1}$	3532.833
$q = 0.25, λ = 1.00$	0.998	1.86× 10 $^{- 1}$	492.324	$q = 0.25, λ = 1.00$	0.999	2.76× 10 $^{- 1}$	1530.838
$q = 0.50, λ = 0.01$	0.999	6.66× 10 $^{- 1}$	1494.270	$q = 0.50, λ = 0.01$	1.000	1.85× 10 $^{0}$	3669.894
$q = 0.50, λ = 0.10$	0.998	1.97× 10 $^{- 1}$	589.379	$q = 0.50, λ = 0.10$	0.999	2.93× 10 $^{- 1}$	1637.985
$q = 0.50, λ = 1.00$	0.994	1.85× 10 $^{- 1}$	210.008	$q = 0.50, λ = 1.00$	0.995	2.71× 10 $^{- 1}$	644.164
$q = 0.75, λ = 0.01$	0.998	2.00× 10 $^{- 1}$	1300.517	$q = 0.75, λ = 0.01$	0.998	2.98× 10 $^{- 1}$	3560.379
$q = 0.75, λ = 0.10$	0.989	2.00× 10 $^{- 1}$	321.221	$q = 0.75, λ = 0.10$	0.991	2.91× 10 $^{- 1}$	853.451
$q = 0.75, λ = 1.00$	0.937	2.08× 10 $^{- 1}$	106.334	$q = 0.75, λ = 1.00$	0.946	2.83× 10 $^{- 1}$	270.046
$q = 1.00, λ = 0.01$	—	—	—	$q = 1.00, λ = 0.01$	—	—	—
$q = 1.00, λ = 0.10$	0.000	9.06× 10 $^{- 2}$	147.372	$q = 1.00, λ = 0.10$	0.000	8.94× 10 $^{- 2}$	272.210
$q = 1.00, λ = 1.00$	0.000	8.62× 10 $^{- 1}$	8.575	$q = 1.00, λ = 1.00$	0.000	8.62× 10 $^{- 1}$	20.120

References

Villani, C. Optimal Transport: Old and New; Springer: Berlin/Heidelberg, Germany, 2009; Volume 338. [Google Scholar]
Shafieezadeh-Abadeh, S.; Mohajerin Esfahani, P.M.; Kuhn, D. Distributionally robust logistic regression. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
Courty, N.; Flamary, R.; Habrard, A.; Rakotomamonjy, A. Joint distribution optimal transportation for domain adaptation. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR. pp. 214–223. [Google Scholar]
Kusner, M.; Sun, Y.; Kolkin, N.; Weinberger, K. From word embeddings to document distances. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; PMLR. pp. 957–966. [Google Scholar]
Swanson, K.; Yu, L.; Lei, T. Rationalizing text matching: Learning sparse alignments via optimal transport. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5609–5626. [Google Scholar]
Otani, M.; Togashi, R.; Nakashima, Y.; Rahtu, E.; Heikkilä, J.; Satoh, S. Optimal correction cost for object detection evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 21107–21115. [Google Scholar]
Pele, O.; Werman, M. Fast and robust Earth Mover’s Distances. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; IEEE: New York, NY, USA, 2009; pp. 460–467. [Google Scholar]
Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 2013, 26, 2292–2300. [Google Scholar]
Dessein, A.; Papadakis, N.; Rouas, J.L. Regularized optimal transport and the rot mover’s distance. J. Mach. Learn. Res. 2018, 19, 590–642. [Google Scholar]
Dvurechensky, P.; Gasnikov, A.; Kroshnin, A. Computational optimal transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In Proceedings of the 36th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR. pp. 1367–1376. [Google Scholar]
Le, T.; Yamada, M.; Fukumizu, K.; Cuturi, M. Tree-sliced variants of Wasserstein distances. Adv. Neural Inf. Process. Syst. 2019, 32, 12304–12315. [Google Scholar]
Le, T.; Nguyen, T.; Phung, D.; Nguyen, V.A. Sobolev transport: A scalable metric for probability measures with graph metrics. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, Online, 28–30 March 2022; PMLR. pp. 9844–9868. [Google Scholar]
Frogner, C.; Zhang, C.; Mobahi, H.; Araya, M.; Poggio, T.A. Learning with a Wasserstein loss. Adv. Neural Inf. Process. Syst. 2015, 28, 2053–2061. [Google Scholar]
Cuturi, M.; Teboul, O.; Vert, J.P. Differentiable ranking and sorting using optimal transport. Adv. Neural Inf. Process. Syst. 2019, 32, 6861–6871. [Google Scholar]
Blondel, M.; Martins, A.F.; Niculae, V. Learning with Fenchel-Young losses. J. Mach. Learn. Res. 2020, 21, 1–69. [Google Scholar]
Birkhoff, G. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucum’an Rev. Ser. A 1946, 5, 147–154. [Google Scholar]
Brualdi, R.A. Combinatorial Matrix Classes; Cambridge University Press: Cambridge, UK, 2006; Volume 13. [Google Scholar]
Alvarez-Melis, D.; Jaakkola, T. Gromov–Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1881–1890. [Google Scholar]
Blondel, M.; Seguy, V.; Rolet, A. Smooth and sparse optimal transport. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics, Canary Islands, Spain, 9–11 April 2018; PMLR. pp. 880–889. [Google Scholar]
Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
Amari, S.i.; Ohara, A. Geometry of q-exponential family of probability distributions. Entropy 2011, 13, 1170–1185. [Google Scholar] [CrossRef]
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Powell, M.J.D. Some global convergence properties of a variable metric algorithm for minimization without exact line searches. In Proceedings of the Nonlinear Programming, SIAM-AMS Proceedings, New York, NY, USA, 1 January 1976; Volume 9. [Google Scholar]
Altschuler, J.; Niles-Weed, J.; Rigollet, P. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. Adv. Neural Inf. Process. Syst. 2017, 30, 1961–1971. [Google Scholar]
Sinkhorn, R.; Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 1967, 21, 343–348. [Google Scholar] [CrossRef]
Danskin, J.M. The theory of max-min, with applications. SIAM J. Appl. Math. 1966, 14, 641–664. [Google Scholar] [CrossRef]
Bao, H.; Sugiyama, M. Fenchel-Young losses with skewed entropies for class-posterior probability estimation. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 13–15 April 2021; pp. 1648–1656. [Google Scholar]
Naudts, J. Deformed exponentials and logarithms in generalized thermostatistics. Phys. A Stat. Mech. Its Appl. 2002, 316, 323–334. [Google Scholar] [CrossRef]
Suyari, H. The unique non self-referential q-canonical distribution and the physical temperature derived from the maximum entropy principle in Tsallis statistics. Prog. Theor. Phys. Suppl. 2006, 162, 79–86. [Google Scholar] [CrossRef]
Ding, N.; Vishwanathan, S. t-Logistic regression. Adv. Neural Inf. Process. Syst. 2010, 23, 514–522. [Google Scholar]
Futami, F.; Sato, I.; Sugiyama, M. Expectation propagation for t-exponential family using q-algebra. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Amid, E.; Warmuth, M.K.; Anil, R.; Koren, T. Robust bi-tempered logistic loss based on bregman divergences. Adv. Neural Inf. Process. Syst. 2019, 32, 15013–15022. [Google Scholar]
Martins, A.F.; Figueiredo, M.A.; Aguiar, P.M.; Smith, N.A.; Xing, E.P. Nonextensive entropic kernels. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 640–647. [Google Scholar]
Muzellec, B.; Nock, R.; Patrini, G.; Nielsen, F. Tsallis regularized optimal transport and ecological inference. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Byrd, R.H.; Nocedal, J.; Yuan, Y.X. Global convergence of a cass of quasi-Newton methods on convex problems. SIAM J. Numer. Anal. 1987, 24, 1171–1190. [Google Scholar] [CrossRef]
Schmitzer, B. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM J. Sci. Comput. 2019, 41, A1443–A1481. [Google Scholar] [CrossRef]
Flamary, R.; Courty, N.; Gramfort, A.; Alaya, M.Z.; Boisbunon, A.; Chambon, S.; Chapel, L.; Corenflos, A.; Fatras, K.; Fournier, N.; et al. POT: Python optimal transport. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [PubMed]
Weed, J. An explicit analysis of the entropic penalty in linear programming. In Proceedings of the the 31st Conference on Learning Theory, Stockholm, Sweden, 5–9 July 2018; PMLR. pp. 1841–1855. [Google Scholar]
Golub, G.H.; van Loan, C.F. Matrix Computations; The Johns Hopkins University Press: Baltimore, MA, USA, 2013. [Google Scholar]

Figure 1. Plots of the q-Gibbs kernels with different q (

λ = 1

).

Figure 1. Plots of the q-Gibbs kernels with different q (

λ = 1

).

Figure 2. Plots of deformed q entropy with different q values. A constant term is ignored in the plots so that the end points are calibrated to zero.

Figure 3. Comparison of transport matrices. Wasserstein represents the result of the unregularized OT. Sinkhorn (

λ = 0.01

) does not work well because of numerical instability.

Figure 3. Comparison of transport matrices. Wasserstein represents the result of the unregularized OT. Sinkhorn (

λ = 0.01

) does not work well because of numerical instability.

Figure 4. Runtime comparison of q-DOT and Sinkhorn algorithm (

q = 1

). The error bars indicate the standard errors of 20 trials.

Figure 4. Runtime comparison of q-DOT and Sinkhorn algorithm (

q = 1

). The error bars indicate the standard errors of 20 trials.

Figure 5. Wasserstein approximation error of q-DOT and the Sinkhorn algorithm (

q = 1

). The line shades indicate the standard errors of 20 trials.

Figure 5. Wasserstein approximation error of q-DOT and the Sinkhorn algorithm (

q = 1

). The line shades indicate the standard errors of 20 trials.

Table 1. Summary of

Ω (π)

,

Ω^{⋆} (η)

, and

\nabla Ω^{⋆} (η)

for several regularizers. The relationship between

Ω

, its conjugate, and the derivatives are summarized in Bao and Sugiyama [28].

Table 1. Summary of

Ω (π)

,

Ω^{⋆} (η)

, and

\nabla Ω^{⋆} (η)

for several regularizers. The relationship between

Ω

, its conjugate, and the derivatives are summarized in Bao and Sugiyama [28].

	$Ω (π)$	$Ω^{⋆} (η)$	$\nabla Ω^{⋆} (η)$
Negative entropy	$λ (π log π - π)$	$λ e^{η / λ}$	$e^{η / λ}$
Squared 2-norm	$\frac{λ}{2} π^{2}$	$\frac{1}{2 λ} {[η]}_{+}^{2}$	$\frac{1}{λ} {[η]}_{+}$
Deformed q entropy	$\frac{λ}{2 - q} (π {log}_{q} (π) - π)$	$\frac{λ}{2 - q} {exp}_{q} {(η / λ)}^{2 - q}$	${exp}_{q} (η / λ)$

Table 2. Comparison of the computational complexity of the Sinkhorn algorithm and deformed q-optimal transport.

N = max \{n, m\}

.

Table 2. Comparison of the computational complexity of the Sinkhorn algorithm and deformed q-optimal transport.

N = max \{n, m\}

.

Sinkhorn	q-DOT
$O (N^{2} (log N) ε^{- 3})$	$O (N^{2} log (N ε^{- 1}))$

Table 3. Comparison of the sparsity and cost with the synthetic dataset. Sparsity indicates the ratio of zero entries in each transport matrix. We counted the number of entries smaller than machine epsilon to measure the sparsity instead of imposing a small positive threshold for determining zero entries. Sinkhorn (

λ = 0.01

) does not work well because of numerical instability.

Table 3. Comparison of the sparsity and cost with the synthetic dataset. Sparsity indicates the ratio of zero entries in each transport matrix. We counted the number of entries smaller than machine epsilon to measure the sparsity instead of imposing a small positive threshold for determining zero entries. Sinkhorn (

λ = 0.01

) does not work well because of numerical instability.

	Sparsity	Cost $〈D, \hat{Π}〉$
Wasserstein (unregularized)	0.967	7.126
q-DOT ( $q = 0.00, λ = 0.01$ )	0.962	7.129
q-DOT ( $q = 0.00, λ = 0.10$ )	0.961	7.126
q-DOT ( $q = 0.00, λ = 1.00$ )	0.950	7.144
q-DOT ( $q = 0.25, λ = 0.01$ )	0.963	7.129
q-DOT ( $q = 0.25, λ = 0.10$ )	0.959	7.126
q-DOT ( $q = 0.25, λ = 1.00$ )	0.912	7.133
q-DOT ( $q = 0.50, λ = 0.01$ )	0.963	7.136
q-DOT ( $q = 0.50, λ = 0.10$ )	0.946	7.127
q-DOT ( $q = 0.50, λ = 1.00$ )	0.879	7.155
q-DOT ( $q = 0.75, λ = 0.01$ )	0.948	7.127
q-DOT ( $q = 0.75, λ = 0.10$ )	0.897	7.136
q-DOT ( $q = 0.75, λ = 1.00$ )	0.647	7.245
Sinkhorn ( $λ = 0.01$ )	—	—
Sinkhorn ( $λ = 0.10$ )	0.000	7.164
Sinkhorn ( $λ = 1.00$ )	0.000	7.788

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bao, H.; Sakaue, S. Sparse Regularized Optimal Transport with Deformed q-Entropy. Entropy 2022, 24, 1634. https://doi.org/10.3390/e24111634

AMA Style

Bao H, Sakaue S. Sparse Regularized Optimal Transport with Deformed q-Entropy. Entropy. 2022; 24(11):1634. https://doi.org/10.3390/e24111634

Chicago/Turabian Style

Bao, Han, and Shinsaku Sakaue. 2022. "Sparse Regularized Optimal Transport with Deformed q-Entropy" Entropy 24, no. 11: 1634. https://doi.org/10.3390/e24111634

APA Style

Bao, H., & Sakaue, S. (2022). Sparse Regularized Optimal Transport with Deformed q-Entropy. Entropy, 24(11), 1634. https://doi.org/10.3390/e24111634

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sparse Regularized Optimal Transport with Deformed q-Entropy

Abstract

1. Introduction

2. Background

2.1. Preliminaries

2.2. Optimal Transport

2.3. Entropic Regularization and Sinkhorn Algorithm

3. Deformed q-Entropy and q-Regularized Optimal Transport

3.1. Regularized Optimal Transport and Its Dual

3.2. q Algebra and Deformed Entropy

4. Optimization and Convergence Analysis

4.1. Optimization Algorithm

4.2. Convergence Analysis

4.3. Proofs

5. Numerical Experiments

5.1. Sparsity

5.2. Runtime Comparison

5.3. Approximation of 1-Wasserstein Distance

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Derivation of Deformed q Entropy

Appendix B. Additional Lemmas

Appendix C. Deferred Proofs

Appendix C.1. Proof of Lemma 1

Appendix C.2. Proof of Lemma 2

Appendix C.3. Proof of Lemma 3

Appendix D. Additional Experiments

Appendix D.1. Comparison with Tsallis Entropy

Appendix D.2. Hyperparameter Sensitivity

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI