Next Article in Journal
On the Characteristics of Fatigue Fracture with Rapid Frequency Change
Next Article in Special Issue
Automating Model Comparison in Factor Graphs
Previous Article in Journal
Voter-like Dynamics with Conflicting Preferences on Modular Networks
Previous Article in Special Issue
Seeing the Error in My “Bayes”: A Quantified Degree of Belief Change Correlates with Children’s Pupillary Surprise Responses Following Explicit Predictions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Discretization of Optimal Transport

1
Department of Math & CS, Rutgers University, Newark, NJ 07102, USA
2
School of Mathematics, Institute for Advanced Study, Princeton, NJ 08540, USA
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(6), 839; https://doi.org/10.3390/e25060839
Submission received: 15 November 2022 / Revised: 21 April 2023 / Accepted: 25 April 2023 / Published: 24 May 2023
(This article belongs to the Special Issue Probabilistic Models in Machine and Human Learning)

Abstract

:
Obtaining solutions to optimal transportation (OT) problems is typically intractable when marginal spaces are continuous. Recent research has focused on approximating continuous solutions with discretization methods based on i.i.d. sampling, and this has shown convergence as the sample size increases. However, obtaining OT solutions with large sample sizes requires intensive computation effort, which can be prohibitive in practice. In this paper, we propose an algorithm for calculating discretizations with a given number of weighted points for marginal distributions by minimizing the (entropy-regularized) Wasserstein distance and providing bounds on the performance. The results suggest that our plans are comparable to those obtained with much larger numbers of i.i.d. samples and are more efficient than existing alternatives. Moreover, we propose a local, parallelizable version of such discretizations for applications, which we demonstrate by approximating adorable images.

1. Introduction

Optimal transport is the problem of finding a coupling of probability distributions that minimizes cost [1], and it is a technique applied across various fields and literatures [2,3]. Although many methods exist for obtaining optimal transference plans for distributions on discrete spaces, computing the plans is not generally possible for continuous spaces [4]. Given the prevalence of continuous spaces in machine learning, this is a significant limitation for theoretical and practical applications.
One strategy for approximating continuous OT plans is based on discrete approximation via sample points. Recent research has provided guarantees on the fidelity of discrete, sample-location-based approximations for continuous OT as the sample size N  [5]. Specifically, by sampling large numbers of points S i from each marginal, one may compute a discrete optimal transference plan on S 1 × S 2 , with the cost matrix being derived from the pointwise evaluation of the cost function on S 1 × S 2 .
Even in the discrete case, obtaining minimal cost plans is computationally challenging. For example, Sinkhorn scaling, which computes an entropy-regularized approximation for OT plans, has a complexity that scales with | S 1 × S 2 |  [6]. Although many comparable methods exist [7], all of them have a complexity that scales with the product of sample sizes, and they require the construction of a cost matrix that also scales with | S 1 × S 2 | .
We have developed methods for optimizing both sampling locations and weights for small N approximations of OT plans (see Figure 1). In Section 2, we formulate the problem of fixed size approximation and reduce it to discretization problems on marginals with theoretical guarantees. In Section 3, the gradient of entropy-regularized Wasserstein distance between a continuous distribution and its discretization is derived. In Section 4, we present a stochastic gradient descent algorithm that is based on the optimization of the locations and weights of the points with empirical demonstrations. Section 5 introduces a parellelizable algorithm via decompositions of the marginal spaces, which reduce the computational complexity by exploiting intrinsic geometry. In Section 6, we analyze time and space complexity. In Section 7, we illustrate the advantage of including weights for sample points by providing a comparison with an existing location that is only based on discretization.

2. Efficient Discretizations

Optimal transport (OT): Let ( X , d X ) , ( Y , d Y ) be compact Polish spaces (complete separable metric spaces), μ P ( X ) , ν P ( Y ) be probability distributions on their Borel-algebras, and c : X × Y R be a cost function. Denote the set of all joint probability measures (couplings) on X × Y with marginals μ and ν by Π ( μ , ν ) . For the cost function c, the optimal transference plan between μ and ν is defined as in [1]: γ ( μ , ν ) argmin π Π ( μ , ν ) c , π , where c , π X × Y c ( x , y ) d π ( x , y ) .
When X = Y , the cost c ( x , y ) = d X k ( x , y ) , W k ( μ , ν ) = c , γ ( μ , ν ) 1 / k defines the k-Wasserstein distance between μ and ν for k 1 . Here, d X k ( x , y ) is the k-th power of the metric d X on X.
Entropy regularized optimal transport (EOT)  [5,8] was introduced to estimate OT couplings with reduced computational complexity: γ λ ( μ , ν ) : = argmin π Π ( μ , ν ) c , π + λ KL ( π | | μ ν ) , where λ > 0 is a regularization parameter, and the regularization term KL ( π | | μ ν ) := log ( d π d μ d ν ) d π is the Kullback–Leibler divergence. The EOT objective is smooth and convex, and its unique solution with a given discrete ( μ , ν , c ) can be obtained using a Sinkhorn iteration (SK)  [9].
However, for large-scale discrete spaces, the computational cost of SK can still be unfeasible [6]. Even worse, to even apply the Sinkhorn iteration, one must know the entire cost matrix over the large-scale spaces, which itself can be a non-trivial computational burden to obtain; in some cases, for example, where the cost is derived from a probability model [10], it may require intractable computations [11,12].
The Framework: We propose the optimization of the location and weights of a fixed size discretization to estimate the continuous OT. The discretization on X × Y is completely determined by those on X and Y to respect the marginal structure in the OT. Let m , n Z * , μ m P ( X ) , ν n P ( Y ) be a discrete approximation of μ and ν , respectively, with μ m = i = 1 m w i δ x i , ν n = j = 1 n u j δ y j , x i X , y j Y , and w i , u j R + . Then, the EOT plan γ λ ( μ , ν ) Π ( μ , ν ) for the OT problem ( μ , ν , c ) can be approximated by the EOT plan γ λ ( μ m , ν n ) Π ( μ m , ν n ) for the OT problem ( μ m , ν n , c ) . There are three distributions that have their discrete counterparts; thus, with a fixed size m , n Z * , a naive idea about the objective to be optimized can be
Ω k , ρ ( μ m , ν n ) = W k k ( μ , μ m ) + W k k ( ν , ν n ) + ρ W k k ( γ λ ( μ , ν ) , γ λ ( μ m , ν n ) ) ,
where W k k ( ϕ , ψ ) represents the k-th power of k-Wasserstein distance between measures ϕ and ψ . The hyperparameter ρ > 0 balances between the estimation accuracy over marginals and that of the transference plan, while the weights on marginals are equal.
To properly compute W k k ( γ λ ( μ , ν ) , γ λ ( μ m , ν n ) ) , a metric d X × Y on X × Y is needed. We expect d X × Y on X-slices or Y-slices to be compatible with d X or d Y , respectively; furthermore, we may assume that there exists a constant A > 0 such that:
max { d X k ( x 1 , x 2 ) , d Y k ( y 1 , y 2 ) } d X × Y k ( ( x 1 , y 1 ) , ( x 2 , y 2 ) ) A ( d X k ( x 1 , x 2 ) + d Y k ( y 1 , y 2 ) ) .
For instance, (2) holds when d X × Y is the p-product metric for 1 p .
The objective Ω k , ρ ( μ m , ν n ) is estimated by its entropy regularized approximation Ω k , ζ , ρ ( μ m , ν n ) for efficient computation, where ζ is the regularization parameter, as follows:
Ω k , ζ , ρ ( μ m , ν n ) = W k , ζ k ( μ , μ m ) + W k , ζ k ( ν , ν n ) + ρ W k , ζ k ( γ λ ( μ , ν ) , γ λ ( μ m , ν n ) ) .
Here, W k k ( μ , μ m ) = d X k , γ ( μ , μ m ) 1 / k is estimated by W k , ζ k ( μ , μ m ) = d X k , γ ζ ( μ , μ m ) 1 / k . γ ζ ( μ , μ m ) is computed by optimizing W ^ k , ζ k ( μ , μ m ) = d X k , γ ζ ( μ , μ m ) + λ KL ( γ ζ ( μ , μ m ) | | μ μ m ) .
One major difficulty in optimizing Ω k , ζ , ρ ( μ m , ν n ) is to evaluate W k , ζ k ( γ λ ( μ , ν ) , γ λ ( μ m , ν n ) ) . In fact, obtaining γ λ ( μ , ν ) is intractable, which is the original motivation for the discretization. To overcome this drawback, by utilizing the dual formulation of EOT, the following are shown (see proof in Appendix A):
Proposition 1.
When X and Y are two compact spaces, and the cost function c is C , there exists a constant C 1 R + such that
max { W k k ( μ , μ m ) , W k k ( ν , ν n ) } W k , ζ k ( γ λ ( μ , ν ) , γ λ ( μ m , ν n ) ) C 1 [ W k , ζ k ( μ , μ m ) + W k , ζ k ( ν , ν n ) ] .
Notice that Proposition 1 indicates that W k , ζ k ( γ λ ( μ , ν ) , γ λ ( μ m , ν n ) ) is bounded above by multiples of W k , ζ k ( μ , μ m ) + W k , ζ k ( ν , ν n ) , i.e., when the continuous marginals μ and ν are properly approximated, so is the optimal transference plan between them. Therefore, to optimize Ω k , ζ , ρ ( μ m , ν n ) , we focus on developing algorithms to obtain μ m * , ν n * that minimize W k , ζ k ( μ , μ m ) and W k , ζ k ( ν , ν n ) .
Remark 1.
The regularizing parameters (λ and ζ above) introduce smoothness, together with an error term, into the OT problem. To make an accurate approximation, we need λ and ζ to be as small as possible. However, when parameters become too small, the matrices to be normalized in the Sinkhorn algorithm lead to an overflow or underflow problem of numerical data types (32-bit or 64-bit floating point numbers). Thus, the value for regularizing the constant threshold is proportional to the k-th power of the diameter of the supported region. In this work, we try our best to control the value (mainly on ζ), which ranges from 10−4 to 0.01 when the diameter is 1 in different examples.

3. Gradient of the Objective Function

Let ν = i = 1 m w i δ y i be a discrete probability measure in the position of “ μ m ” in the last section. For a fixed (continuous) μ , the objective now is to obtain a discrete target ν * = argmin W k , ζ k ( μ , ν ) .
In order to apply a stochastic gradient descent (SGD) to both the positions { y i } i = 1 m and their weights { w i } i = 1 m to achieve ν * , we now derive the gradient of W k , ζ k ( μ , ν ) about ν by following the discrete discussions of [13,14]. The SGD on X is either derived through an exponential map, or by treating X as (part of) an Euclidean space.
Let g ( x , y ) : = d X k ( x , y ) , and denote the joint distribution minimizing W ^ k , ζ k as π with the differential form at ( x , y i ) being d π i ( x ) , which is used to define W k , ζ k in Section 2.
By introducing the Lagrange multipliers α L ( X ) , β R m i, we have W ^ k , ζ k ( μ , ν ) = max α , β L ( μ , ν ; α , β ) , where L ( μ , ν ; α , β ) = X α ( x ) d μ ( x ) + i = 1 n β w i ζ X i = 1 n w i E i ( x ) d μ ( x ) with E i ( x ) = e ( α ( x ) + β i g ( x , y i ) ) / ζ (see [5]). Let α * , β * be the argmax; then, we have
W k , ζ k ( μ . ν ) = X i = 1 n g ( x , y i ) E i * ( x ) w i d μ ( x )
with E i * ( x ) = e ( α * ( x ) + β i * g ( x , y i ) ) / ζ . Since α ( x ) : = α ( x ) + t and β i : = β i t produce the same E i ( x ) for any t R , the representative with β n = 0 that is equivalent to β (as well as β * ) is denoted by β ¯ (similarly β ¯ * ) below in order to obtain uniqueness and make the differentiation possible.
From a direct differentiation of W k , ζ k , we have
W k , ζ k w i = X g ( x , y i ) E i * ( x ) d μ ( x ) + 1 ζ X j = 1 n g ( x , y j ) α * ( x ) w i + β j * w i w j E j * ( x ) d μ ( x ) .
y i W k , ζ k = X y i g ( x , y i ) 1 g ( x , y i ) ζ E i * ( x ) w i d μ ( x ) + 1 ζ X j = 1 n g ( x , y j ) y i α * ( x ) + y i β j * w j E j * ( x ) d μ ( x ) .
With the transference plan d π i ( x ) = w i E i * ( x ) d μ ( x ) and the derivatives of α * , β * , g ( x , y i ) calculated, the gradient of W k , ζ k can be assembled.
Assume that g is a Lipschitz constant that is differentiable almost everywhere (for k 1 and a d X Euclidean distance in R d , differentiability fails to hold only when k = 1 and y i = x ) and that y g ( x , y ) is calculated. The derivatives of α * and β ¯ * can then be calculated thanks to the Implicit Function Theorem for Banach spaces (see [15]).
The maximality of L at α * and β ¯ * induces N : = α , β ¯ L | ( α * , β ¯ * ) = 0 ( L ( X ) R m 1 ) , and the Fréchet derivative vanishes. By differentiating (in the sense of Fréchet) again on ( α , β ¯ ) and y i , w i , respectively, we get
( α , β ¯ ) N = 1 ζ d μ ( x ) δ ( x , x ) d π j ( x ) d π i ( x ) w i δ i j
as a bilinear functional on L ( X ) × R m 1 (note that, in Equation (6), the index i of d π i cannot be m). The bilinear functional ( α , β ¯ ) N is invertible, and we denote its inverse by M as a bilinear form on L ( X ) R m 1 . The last ingredient for the Implicit Function Theorem is w i , y i N :
w i N = 1 w i X ( · ) d π i ( x ) , 0
y i N = 1 ζ X ( · ) y i g ( x , y i ) d π i ( x ) , δ i j ζ X y i g ( x , y i ) d π i ( x ) .
Then, w i , y i ( α * , β ¯ * ) = M ( w i , y i N ) . Therefore, we have gradient w i , y i W k , ζ k calculated.
Moreover, we can differentiate Equations (4)–(8) to get a Hessian matrix of W k , ζ k on w i ’s and y i ’s to provide a better differentiability of g ( x , y ) (which may enable Newton’s method, or a mixture of Newton’s method and minibatch SGD to accelerate the convergence). More details about the claims, calculations, and proofs are provided in the Appendix B.

4. The Discretization Algorithm

Here, we provide a description of an algorithm for the efficient discretizations of optimal transport (EDOT) from a distribution μ to μ m with integer m, which is a given cardinality of support. In general, μ does not need not be explicitly accessible, and, even if it is accessible, computing the exact transference plan is not feasible. Therefore, in this construction, we assume that μ is given in terms of a random sampler, and we apply a minibatch stochastic gradient descent (SGD) through a set of samples that are independently drawn from μ of size N on each step to approximate μ .
To calculate the gradient μ m W k , ζ k ( μ , μ m ) = x i W k , ζ k ( μ , μ m ) , w i W k , ζ k ( μ , μ m ) i = 1 m , we need: (1).   π X , ζ | , the EOT transference plan between μ and μ m , (2).  the cost g = d X | k on X, and (3).  its gradient on the second variable x d X | k ( x , x ) . From N samples { y i } i = 1 N , we can construct μ N | = 1 N i = 1 N δ y i and calculate the gradients with μ replaced by μ N | as an estimation, whose effectiveness (convergence as N ) is proved in [5].
We call this discretization algorithm the Simple EDOT algorithm. The pseudocode is stated in the Appendix C.
Proposition 2.
(Convergence of the Simple EDOT). The Simple EDOT generates a sequence ( μ m ( i ) ) in the compact set X m × Δ . If the set of limit points of ( μ m ( i ) ) does not intersect with X m × Δ , then ( μ m ( i ) ) converges to a stationary point in X m × I n t ( Δ ) where I n t ( · ) represents the interior.
In simulations, we fixed k = 2 to reduce the computational complexity and fixed the regularizer ζ = 0.01 for X of diameter 1 and scales proportional with diam ( X ) k (see next section). Such a choice for ζ is not only small enough to reduce the error between the EOT estimation W k , ζ and the true W k , but also ensures that e g ( x , y ) / ζ and its byproduct in the SK are distinguishable from 0 in a double format.
Examples of discretization: We demonstrated our algorithm on the following:
E.g., (1). μ is the uniform distribution on X = [ 0 , 1 ] .
E.g., (2) μ is the mixture of two truncated normal distributions on X = [ 0 , 1 ] , and the PDF is f ( x ) = 0.3 ϕ ( x ; 0.2 , 0.1 ) + 0.7 ϕ ( x ; 0.7 , 0.2 ) , where ϕ ( x ; ξ , σ ) is the density of the truncated normal distribution on [ 0 , 1 ] with the expectation ξ and standard deviation σ .
E.g., (3) μ is the mixture of two truncated normal distributions on X = [ 0 , 1 ] 2 , where the two distributions are ϕ ( x ; 0.2 , 0.1 ) ϕ ( y ; 0.3 , 0.2 ) of weight 0.3 and ϕ ( x ; 0.7 , 0.2 ) ϕ ( y ; 0.6 , 0.15 ) of weight 0.7 .
Let N = 100 for all plots in this section. Figure 2a–c plots the discretizations ( μ m ) for E.g., (1)–(3) with m = 5 , 5 , and 7, respectively.
Figure 2f illustrates the convergence rate of W k , ζ k ( μ , μ m ) versus the SGD steps for Example (2) with μ m obtained by a 5-point EDOT. Figure 2d,e plot the entropy-regularized Wasserstein W k , ζ k ( μ , μ m ) versus m, thereby comparing EDOT and naive sampling for Examples (1) and (2). Here, the μ m s are: (a) from the EDOT with 3 m 7 in Example 1 and 3 m 8 in Example 2, which are shown by ×s in the figures. (b) from naive sampling, which is simulated using a Monte Carlo of volume 20,000 on each size from 3 to 200. Figure 2d,e demonstrate the effectiveness of the EDOT: as indicated by the orange horizontal dashed line, even 5-point EDOT discretization in these two examples outperformed 95% of the naive samplings of size 40, as well as 75% of the naive samplings of size over 100 (the orange dash and dot lines).
An example of a transference plan: In Figure 3a, we illustrate the efficiency of the EDOT on an OT problem: X = Y = [ 0 , 1 ] , where the marginal μ and ν are truncated normal (mixtures), and μ has two components (shown in red curve on the left), while ν has only one component (shown in red curve on the top). The cost function is the squared Euclidean distance, and λ = ζ = 0.01 .
The left of Figure 3a shows a 5 × 5 EDOT approximation with W k , ζ k ( μ , μ 5 ) = 4.792 × 10 3 , W k , ζ k ( ν , ν 5 ) = 5.034 × 10 3 , and W k , ζ k ( γ , γ 5 , 5 ) = 8.446 × 10 3 . The high density area of the EOT plan is correctly covered by EDOT estimating points with high weights. The right shows a 25 × 25 naive approximation with W k , ζ k ( μ , μ 7 ) = 5.089 × 10 3 , W k , ζ k ( ν , ν 7 ) = 2.222 × 10 2 , and W k , ζ k ( γ , γ 7 , 7 ) = 2.563 × 10 2 . The points of the naive estimating with the highest weights missed the region where the true EOT plan was of the most density.

5. Methods of Improvement

I. Adaptive EDOT: The computational cost of a simple EDOT increases with the dimensionality and diameter of the underlying space. Discretization with a large m is needed to capture higher dimensional distributions, which result an increase in parameters for calculating the gradient of W k , ζ k : m d for the y i positions and m 1 for the w i weights. Such an increment will not only increase the complexity in each step, but also require more steps for the SGD to converge. Furthermore, the calculation will have a higher complexity ( O ( m N ) for each normalization in Sinkhorn).
We proposed to reduce the computational complexity using a “divide and conquer” approach. The Wasserstein distance took the k-th power of the distance function d X | k as a cost function. The locality of distance d X | made the solution to the OT/EOT problem local, meaning that the probability mass was more likely to be transported to a close destination than to a remote one. Thus, we can “divide and conquer”—thereby cutting the space X into small cells and solve the discretization problem independently.
To develop a “divide and conquer” algorithm, we need: (1) an adaptive dividing procedure that is able to partition X = X 1 X I , which balances the accuracy and computational intensity among the cells; (2) to determine the discretization size m i and choose a proper regularizer ζ i for each cell X i . The pseudocodes for all variations are shown in the Appendix C Algorithms A2 and A3.
Choosing size m: An appropriate choice of m i will balance contributions to the Wasserstein among the subproblems as follows: Let X i be a manifold of dimension d, let diam ( X i ) be its diameter, and let p i = μ ( X i ) be the probability of X i . The entropy-regularized Wasserstein distance can be estimated as W k , ζ k = O ( p i m i k / d diam ( X i ) k )  [16,17]. The contribution to W k , ζ k ( μ , μ m ) per point in support of μ m is O ( p i m i ( k + d ) / d diam ( X i ) k ) . Therefore, to balance each point’s contribution to the Wasserstein among the divided subproblems, we set m i ( p i diam ( X i ) k ) d / ( k + d ) j = 1 I ( p j diam ( X j ) k ) d / ( k + d ) .
Occupied volume (Variation 1):A cell could be too vast (e.g., large in size with few points in a corner), thus resulting in obtaining a larger m i than needed. To fix it, we may replace the diam ( X i ) above with Vol ( X i ) 1 / d , where Vol ( X i ) is the occupied volume calculated by counting the number of nonempty cells in a certain resolution (levels in previous binary division). The algorithm (Variation 1) becomes a binary tree to resolve and obtain the occupied volume for each cell, then there is tree traversal to assign m i .
Adjusting the regularizer ζ : In the W k , ζ k , the SK on e g ( x , y ) / ζ is calculated. Therefore, ζ should scale with d X k to ensure that the transference plan is not affected by the scaling of d X . Precisely, we may choose ζ i = diam ( X i ) k ζ 0 for some constant ζ 0 .
The division: Theoretically, any refinement procedure that proceeds iteratively and eventually makes the diameter of each cell approach 0 can be applied for division. In our simulation, we used an adaptive kd-tree-style cell refinement in a Euclidean space  R d . Let X be embedded into R d within an axis-aligned rectangular region. We chose an axis x l in R d and evenly split the region along a hyperplane orthogonal to x l (e.g., cut square [ 0 , 1 ] 2 along the line x = 0.5 ); thus, we constructed X 1 and X 2 . With the sample set S given, we split it into two sample sizes S 1 and S 2 according to which subregion each sample was located in. Then, the corresponding m i and ζ i could be calculated as discussed above. Thus, two cells and their corresponding subproblems were constructed. If some of the m i was still too large, the cell was cut along another axis to construct two other cells. The full list of cells and subproblems could be constructed recursively. In addition, another cutting method (variation 2) that chooses the most sparse point as a cutting point through a sliding window is sometimes useful in practice.
After having the set of subproblems, we could apply the EDOT for the solutions in each cell, then combine the solutions μ m i ( i ) = j = 1 m i w j ( i ) δ y j ( i ) | into the final result μ m : = i = 1 I j = 1 m i p i w j ( i ) δ y j ( i ) | .
Figure 3b shows the optimal discretization for the example in Figure 2c with m = 30 , which was obtained by applying the EDOT with adaptive cell refinement, or ζ = 0.01 × diam 2 .
II. On embedded CW complexes: Although the samples on space X are usually represented as a vector in R d , inducing an embedding X R d , the space X usually has its own structure as a CW complex (or simply a manifold) with a more intrinsic metric. Thus, if the CW complex structure is known, even piecewise, we may apply the refinement on X with respect to its own metric, whereas direct discretization as a subset in R d may result in a low expressing efficiency.
We now illustrate the adaptive EDOT by an example on a mixture normal distribution of a sphere mapped through stereographic projection. More examples of a truncated normal mixture over a Swiss roll and the discretization of a 2D optimal transference plan are detailed in the Appendix D.5.
On the sphere: The underlying space X sphere is the unit sphere in R 3 . μ sphere is the pushforward of a normal mixture distribution on R 2 by stereographic projection. The sample set S sphere μ sphere over X sphere is shown on Figure 4 on the left. Consider a (3D) Euclidean metric on the X sphere induced by the embedding. Figure 4a (right) plots the EDOT solution with refinement for μ m with m = 40 . The resulting cell structure is shown as colored boxes.
To consider the intrinsic metric, a CW complex was constructed about a point on the equator as a 0-cell structure; the rest of the equator was constructed as a 1-cell, and the upper hemisphere and lower hemisphere were constructed as two dimension 2- (open) cells. We took the upper and lower hemispheres and mapped them onto a unit disk through stereographic projection with respect to the south and north pole, respectively. Then, we took the metric from spherical geometry and rewrote the distance function and its gradient using the natural coordinate on the unit disk. Figure 4b shows the refinement of the EDOT on the samples (in red) and the corresponding discretizations in colored points. More figures can be found in the Appendices.

6. Analysis of the Algorithms

In this section, we derive the complexity of the simple EDOT and the adaptive EDOT. In particular, we show the following:
Proposition 3.
Let μ be a (continuous) probability measure on a space X. A simple EDOT of size m has time complexity O ( ( N + m ) 2 m d L + N m L log ( 1 / ϵ ) ) and space complexity O ( ( N + m ) 2 ) , where N is the minibatch size (to construct μ N in each step to approximate μ), d is the dimension of X, L is the maximal number of iterations for SGD, and ϵ is the error bound in the Sinkhorn calculation for the entropy-regularized optimal transference plan between μ N and μ m .
Proposition 3 quantitatively shows that, when the adaptive EDOT is applied, the total complexities (in time and space) are reduced, because the magnitudes of both N and m are much smaller in each cell.
The procedure of dividing sample set S into subsets through the adaptive EDOT is similar to Quicksort; thus, the space and time complexities are similar. The similarity comes from the binary divide-and-conquer structure, as well as that each split action is based on comparing each sample with a target.
Proposition 4.
For the preprocessing (job list creation) for the adaptive EDOT, the time complexity is O ( N 0 log N 0 ) in the best and average case and O ( N 0 2 ) in the worst case, where N 0 is the total number of sample points, and the space complexity is O ( N 0 d + m ) , or simply O ( N 0 d ) as m N 0 .
Remark 2.
Complexity is the same as Quicksort. The set of N 0 sample points in the algorithm are treated as the “true” distribution in the adaptive EDOT, since, in the later EDOT steps for each cell, no further samples are taken, as it is hard for a sampler to produce a sample in a given cell. Postprocessing of the adaptive EDOT has O ( m ) complexity in both time and space.
Remark 3.
For the two algorithm variations in Section 5, the occupied volume estimation works in the same way as the original preprocessing step, which has the same time complexity as before (by itself, since dividing must happen after knowing the occupied volume of all cells), but, with the tree built, the original preporcessing becomes a tree traversal and has (additional) time complexity O ( N 0 ) and (additional) space complexity O ( N 0 ) for the space storing occupied volume.
For details on choosing cut points with window sliding, the discussion can be seen in the Appendix C.5.
Comparison with naive sampling: After having a size m discretization on X and a size n discretization on Y, the EOT solution (Sinkhorn algorithm) has time complexity O ( m n log ( 1 / ϵ ) ) . In the EDOT, two discretization problems must be solved before applying the Sinkhorn, while the naive sampling requires nothing but sampling.
According to Proposition 3, solving a single continuous EOT problem using a size m simple EDOT method may result in higher time complexity than naive sampling with an even larger sample size N (than m). However, unlike the EDOT, which only requires access to a distance function d X and d Y on X and Y, respectively, a known cost function c : X × Y R is necessary for naive sampling. In real applications, the cost function may be from real world experiments (or from extra computations) done for each pair ( x , y ) in the discretization; thus, the size of discretized distribution is critical for cost control. d X and d Y usually come along with the spaces X and Y, respectively, and are easy to compute. An additional application of the EDOT is necessary when the marginal distributions μ X and ν Y are fixed for different cost functions; then, discretizations can be reused. Thus, the cost of discretization is calculated one time, and the improvement it brings accumulates in each repeat.

7. Related Work and Discussion

Our original problem was the optimal transport problem between general distributions as samplers (instead of integration oracles). We translated that into a discretization problem and an OT problem between discretizations.
I. Comparison with other discretization methods: There are several other methods that generate discrete distributions from arbitrary distributions in the literature, which are obtained via semi-continuous optimal transport where the calculation of a weighted Voronoi diagram is needed. Calculating the weighted Voronoi diagrams usually requires 1. that the cost function be a squared Euclidean distance and 2. the application of Delaunay triangulation, which is expensive in more than two dimensions. Furthermore, semi-continuous discretization may only optimize one aspect between the position and weights of the atoms, and this process is mainly based on [18] (the optimized position) and [19] (the optimized weights).
We mainly compared the prior work of [18], which focuses on the barycenter of a set of distributions under the Wasserstein metric. This work resulted in a discrete distribution called the Lagrangian discretization, which is of the form 1 m i = 1 m δ x i  [2]. Other works, such as [20,21], find barycenters but do not create a discretization. Refs. [19,22] studied the discrete estimation of a 2-Wasserstein distance locating discrete points through a clustering algorithm k-means++ and a weighted Voronoi diagram refinement, respectively. Then, they assigned weights and made them non-Lagrangian discretizations. Ref. [19] (comparison in Figure 5)  roughly followed a “divide-and-conquer” approach in selecting positions, but the discrete positions were not tuned according to Wasserstein distance directly. Ref. [22] converged as the number of discrete points increased. However, it lacked a criterion (such as the Wasserstein in the EDOT) to show that the choice is not just one among all possible converging algorithms, but, rather, it is a special one.
By projecting the gradient in the SGD to the tangent space of the submanifold X m × { e m / m } = { 1 m δ x i } , or by equivalently fixing the learning rate on the weights to zero, the EDOT can estimate a Lagrangian discretization (denoted by EDOT-Equal). A comparison among the methods is held on the map of the Canary islands, which is shown in Figure 6. This example shows that our method can get a similar result using Lagrangian discretization as the methods in the literature, while, in general, this type of EDOT can work better.
Moreover, the EDOT can be used to solve barycenter problems.
Note that, to apply adaptive EDOT for barycenter problems, compatible divisions of the target distributions are needed (i.e., a cell A from one target distribution transports onto a discrete subset D thoroughly, and D transports onto a cell B from another target distribution, etc.).
We also tested these algorithms on discretizing gray/colored scale pictures. The comparison of discretization with points varying from 10 to 4000 for a kitty image between EDOT, EDOT-equal, [18] and estimations of their Wasserstein distances to the original image are shown in Figure 7 and Figure 8.
Furthermore, the EDOT may be applied on RGB channels of an image independently, which then combine plots of discretizations in the corresponding color. The results are shown in Figure 1 at the beginning of this paper.
Lagrangian discretization may have a disadvantage in representing repetitive patterns with incompatible discretization points.
In Figure 9, we can see that discretizing 16 objects with 24 points caused weight incompatibility locally for the Lagrangian discretization, thus making points locate between objects and increasing the Wasserstein distance. With the EDOT, the weights of points that lie outside of the blue object were much smaller. The patterned structure was better represented by the EDOT. In practice, patterns often occur as part of the data (e.g., pictures of nature), and it is easy to get an incompatible number in Lagrangian discretization, since the equal weight-requirement is rigid; consequently, patterns cannot be properly captured.
II. General k and deneral distance d X : Our algorithms (Simple EDOT, adaptive EDOT, and EDOT-Equal) work for a general choice of parameter k > 1 and C 2 distance d X on X. For example, in Figure 4 part (b), the distance used on each disk was spherical (arc length along the big circle passing through two points), which could not be isometrically reparametrized into a plane with Euclidean metrics because of the difference in curvatures.
III. Other possible impacts: As the OT problem widely exists in many other areas, our algorithm can be applied accordingly, e.g., the location and size of supermarkets or electrical substations in an area, or even air conditioners in the rooms of supercomputers. Our divide-and-conquer methods are suitable for solving these real-world applications.
IV. OT for discrete distributions: Many algorithms have been developed to solve OT problems between two discrete distributions [3]. Linear programming algorithms were first developed, but their applications have been restricted by high computational complexity. Other methods such as [23], with a cost of form c ( x , y ) = h ( x y ) for some h, which applies the “back-and-forth” method by hopping between two forms of a Kantorovich dual problem (on the two marginals, respectively) to get a gradient of the total cost over the dual functions, usually solve problems with certain conditions. In our work, we chose to apply an EOT developed by [8] for an estimated OT solution of the discrete problem.

8. Conclusions

We developed methods for efficiently approximating OT couplings with fixed size m × n approximations. We provided bounds on the relationship between a discrete approximation and the original continuous problem. We implemented two algorithms and demonstrated their efficacy as compared to naive sampling and analyzed computational complexity. Our approach provides a new approach to efficiently compute OT plans.

Author Contributions

Conceptualization, J.W. and P.W.; methodology, J.W.; software, J.W.; validation, P.W., P.S. and J.W.; formal analysis, J.W. and P.W.; investigation, P.W.; writing—original draft preparation, J.W., P.W. and P.S.; writing—review and editing, J.W., P.W. and P.S.; supervision, P.S.; project administration, P.S.; funding acquisition, P.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by DARPA grant number HR00112020039, W912CG22C0001, W911NF2020001 and NSF MRI 1828528.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
OTOptimal Transport
EOTEntropy-Regularized Optimal Transport
EDOTEfficient Discretization of Optimal Transport
SGDStochastic Gradient Descent

Appendix A. Proof of Proposition 1

Proof. 
We will adopt the notations Z = X × Y and z i = ( x i , y i ) Z . Furthermore, recall the condition:
max { d X ( x , x ) , d Y ( y , y ) } d Z ( z , z ) d X ( x , x ) + d Y ( y , y )
For inequality (i) without loss, assume that
max { W k k ( μ , μ m ) , W k k ( ν , ν n ) } = W k k ( μ , μ m )
Denote the optimal π Z Π ( γ λ ( μ , ν ) , γ λ ( μ m , ν n ) ) that achieves W k k ( γ λ ( μ , ν ) , γ λ ( μ m , ν n ) ) by π Z * and similarly for π X * . Then, we have:
W k k ( γ λ ( μ , ν ) , γ λ ( μ m , ν n ) ) = Z 2 d Z k ( z 1 , z 2 ) d π Z * A Z 2 d X k ( x 1 , x 2 ) d π Z * = A X 2 d X k ( x 1 , x 2 ) Y 2 d π Z * = ( a ) A X 2 d X k ( x 1 , x 2 ) d π X ( b ) A X 2 d X k ( x 1 , x 2 ) d π X * = W k k ( μ , μ m )
Here, π X Π ( μ , μ m ) and eq (a) hold since π Z * Π ( γ λ ( μ , ν ) , γ λ ( μ m , ν n ) ) , and ineq (b) holds since π X * is the optimal choice.
For inequality (ii), we use the following to simplify the notations: d γ d γ m n   : = d γ λ ( μ , ν ) d γ λ ( μ m , ν n ) and d Z k   : = d Z k ( z 1 , z 2 ) , d X k   : = d X k ( x 1 , x 2 ) , d Y k   : = d Y k ( y 1 , y 2 )
W k , ζ k ( γ λ ( μ , ν ) , γ λ ( μ m , ν n ) ) = Z 2 d Z k ( z 1 , z 2 ) d π z , ζ * = ( a ) Z 2 d Z k · exp ( α ( z 1 ) + β ( z 2 ) d Z k ( z 1 , z 2 ) ζ ) d γ d γ m n ( b ) A Z 2 ( d X k + d Y k ) · exp ( α ( z 1 ) + β ( z 2 ) d Z k ζ ) d γ d γ m n ( c ) C 1 Z 2 ( d X k + d Y k ) · exp ( d Z k ζ ) d γ d γ m n ( d ) C 1 [ Z 2 d X k · exp ( d X k ζ ) + d Y k · exp ( d Y k ζ ) d γ d γ m n ] = ( e ) C 1 [ X 2 d X k · exp ( d X k ζ ) d μ d μ m + Y 2 d Y k · exp ( d Y k ζ ) d ν d ν n ] ( f ) C 1 X 2 d X k · exp ( s ( x 1 ) + t ( x 2 ) d X k ζ ) d μ d μ m + C 1 Y 2 d Y k · exp ( s ( y 1 ) + t ( y 2 ) d Y k ζ ) d ν d ν n = C 1 [ W k , ζ k ( μ , μ m ) + W k , ζ k ( ν , ν n ) ]
Justifications for the derivations:
(a) Based on the dual formulation, it is shown in [5] Proposition 1 that, for ζ > 0 , there exist α ( z 1 ) , β ( z 2 ) C ( Z ) such that:
d π z , ζ * = exp ( α ( z 1 ) + β ( z 2 ) d Z k ( z 1 , z 2 ) ζ ) d γ d γ m n ;
(b) Inequality (ii) of Equation (A1);
(c) According to Ref. [24] Theorem 2, when X and Y are compact and c is smooth, α and β are uniformly bounded; moreover, both d X k and d Y k are uniformly bounded by the diameter of X and Y, respectively; hence, the constant B exists;
(d) Inequality (ii) of Equation (A1);
(e) γ λ ( μ , ν ) Π ( μ , ν ) and γ λ ( μ m , ν n ) Π ( μ m , ν n ) ;
(f) Similarly as in (a), for ζ > 0 , there exist s ( x 1 ) , t ( x 2 ) C ( X ) , and s ( y 1 ) , t ( y 2 ) C ( Y ) such that exp ( d X k ζ ) d μ d μ m = d π X * and exp ( d Y k ζ ) d ν d ν n = d π Y * . Moreover, X 2 d X k · exp ( s ( x 1 ) + t ( x 2 ) ζ ) d μ d μ m 0 , and Y 2 d Y k · exp ( s ( y 1 ) + t ( y 2 ) ζ ) d ν d ν n 0 .    □

Appendix B. Gradient of W k , ζ k

Appendix B.1. The Gradient

Following the definitions and notations in Section 2 and Section 3 of the paper, we calculate the gradient of W k , ζ k ( μ , μ m ) about parameters of μ m : = i = 1 m w i δ y i in detail.
W k , ζ k ( μ , μ m ) = X 2 g ( x , y ) d γ ζ ( x , y ) , where
γ ζ = argmin γ Π ( μ , μ m ) X 2 g ( x , y ) d γ ( x , y ) + ζ KL ( γ | | μ μ m ) .
Let α L ( X ) and β R m . Denote β = i = 1 m β i δ y i , and let
F ( γ ; μ , μ m , α , β ) : = X 2 g ( x , y ) d γ ( x , y ) + ζ KL ( γ | | μ μ m ) + X α ( x ) X d γ ( x , y ) d μ m ( y ) + X β ( y ) X d γ ( x , y ) d μ ( x ) = X 2 g ( x , y ) d γ ( x , y ) + ζ KL ( γ | | μ μ m ) + X α ( x ) i = 1 m d γ ( x , y i ) d μ m ( y i ) + i = 1 m β i X d γ ( x , y i ) d μ ( x )
Since γ on the second component X is discrete and supported on { y i } i = 1 m , we may denote d γ ( x , y i ) by d π i ( x ) ; thus,
F ( γ ; μ , μ m , α , β ) = X i = 1 m g ( x , y i ) d π i ( x ) + ζ i = 1 m X ln d π i ( x ) d μ ( x ) ln ( μ m ( y i ) ) d π i ( x ) + X α ( x ) i = 1 m d π i ( x ) d μ m ( y i ) + i = 1 m β i X d π i ( x ) d μ ( x )
Then, the Fenchel duality of Problem (A2) is
L ( μ , μ m ; α , β ) = X α ( x ) d μ ( x ) + i = 1 m β i w i ζ X i = 1 m e ( α ( x ) + β i g ( x , y i ) ) / ζ w i d μ ( x ) .
Let α * and β * be the argmax of the Fenchel dual (A5). The primal is solved by d γ * ( x , y i ) = e ( α * ( x ) + β i * g ( x , y i ) ) / ζ w i d μ ( x ) . To make the solution unique, we restrict the freedom of the solution (where we see that L ( μ , μ m ; α , β ) = L ( μ , μ m ; α + t , β t ) for any t R ). We use the condition β m = 0 to narrow the choices down to only one, and denote the dual variable β having the property β ¯ and β ¯ * .
We first calculate w i , y i L ( μ , μ m ; α * , β ¯ * ) with α * and β ¯ * as functions of μ m . (from the paper).
L w i = X g ( x , y i ) E i * ( x ) d μ ( x ) + 1 ζ X j = 1 n g ( x , y j ) α * ( x ) w i + β j * w i w j E j * ( x ) d μ ( x ) .
y i L = X y i g ( x , y i ) 1 g ( x , y i ) ζ E i * ( x ) w i d μ ( x ) + 1 ζ X j = 1 n g ( x , y j ) y i α * ( x ) + y i β j * w j E j * ( x ) d μ ( x ) .
Next, we calcuate the derivatives of α * and β ¯ * by finding their defining equation and then using the Implicit Function Theorem.
The optimal solution to the dual variables α * and β * is obtained by solving the stationary state equation α , β L = 0 . The derivatives are taken in the sense of the Fréchet derivative. The Fenchel dual function on α and β ¯ , has its domain and codomain L ( μ , μ m ; · , · ) : L ( X ) × R m 1 R . The derivatives are
α L ( μ , μ m ; α , β ¯ ) = X 1 i = 1 m w i E i ( x ) ( · ) d μ ( x ) ,
β ¯ i L ( μ , μ m ; α , β ¯ ) = w i 1 X E i ( x ) d μ ( x )
where E i ( x ) = e ( α ( x ) + β i g ( x , y i ) ) / ζ is defined as in the paper, α L ( μ , μ m ; α , β ¯ ) ( L ( X ) ) (as a linear functional), and β ¯ i L ( μ , μ m ; α , β ¯ ) R . Next, we need to show that L is differentiable in the sense of the Fréchet derivative, i.e.,
lim | | h | | 0 1 | | h | | L ( μ , μ m ; α + h , β ¯ ) L ( μ , μ m ; α , β ¯ ) α L ( μ , μ m ; α , β ¯ ) ( h ) = 0 .
By the definition of L (we write L ( α ) for L ( μ , μ m ; α , β ¯ ) ),
L ( α + h ) L ( α ) α L ( α ) ( h ) = X h ( x ) d μ ( x ) ζ X i = 1 m e h ( x ) / ζ 1 w i E i ( x ) d μ ( x ) X 1 i = 1 m w i E i ( x ) h ( x ) d μ ( x ) = ζ X i = 1 m 1 + h ( x ) ζ e h ( x ) / ζ E i ( x ) w i d μ ( x ) = ζ X k = 2 1 k ! h ( x ) k ζ k i = 1 m E i ( x ) w i d μ ( x ) ,
The last equality is from a Taylor expansion of the exponential function. Consider that | | h | | = ess sup | x X h ( x ) | the essential supremum of | h ( x ) | for x X given measure μ .
Denote N : = α , β ¯ L ,
1 | | h | | L ( α + h ) L ( α ) α L ( α ) ( h ) ζ | | h | | X k = 2 1 k ! | h ( x ) | k ζ k i = 1 m E i ( x ) w i d μ ( x ) ζ | | h | | X k = 2 1 k ! | | h | | k ζ k i = 1 m E i ( x ) w i d μ ( x ) = ζ k = 2 1 k ! | | h | | k 1 ζ k X i = 1 m E i ( x ) w i d μ ( x ) = ζ k = 2 1 k ! | | h | | k 1 ζ k
Therefore,
lim | | h | | 0 1 | | h | | L ( α + h ) L ( α ) α L ( α ) ( h ) = 0 ,
which shows that the expression of α L ( α ) in Equation (A8) gives the correct Fréchet derivative. Note here that α L ( X ) is critical in Equation (A12).
Let N : = α , β ¯ L values in ( L ( X ) ) × R m 1 . Then, N = 0 defines α * and β ¯ * , which makes it possible to differentiate them about μ m using the Implicit Function Theorem for Banach spaces. From now on, N take values at α = α * , β ¯ = β ¯ * , i.e., the marginal conditions on d π i ( x ) = w i E i ( x ) d μ ( x ) hold.
Thus, we need α , β ¯ N and w i , y i N calculated, and prove that α , β ¯ N is invertible (and give the inverse).
It is necessary to make sure which form α , β ¯ N is in according to the Fréchet derivative. Start from the map N ( μ , μ m ; · , · ) : ( L ( X ) ) × R m 1 ( L ( X ) ) × ( R m 1 ) , where R m 1 is isomorphic to its dual Banach space ( R m 1 ) . Then, α , β ¯ N Hom R b ( L ( X ) × R m 1 , ( L ( X ) ) × ( R m 1 ) ) , where Hom b represents the set of bounded linear operators. Moreover, recall that ( · ) A is the left adjoint functor of Hom R b ( A , · ) ; then, for R -vector spaces, Hom R b ( L ( X ) × R m 1 , ( L ( X ) ) × ( R m 1 ) ) Hom R b ( L ( X ) × R m 1 ) 2 , R . Thus, we can write α , β ¯ N in terms of a bilinear form on vector space L ( X ) × R m 1 .
From the expression of N , we may differentiate (similarly as the calculations (A11) to (A13)):
α N = 1 ζ X ( · ) ( ) i = 1 m w i E i ( x ) d μ ( x ) , 1 ζ X ( · ) w i E i ( x ) d μ ( x )
β ¯ N = 1 ζ X ( · ) w i E i ( x ) d μ ( x ) , δ i j ζ X w i E i ( x ) d μ ( x )
Consider the boundary conditions i = 1 m w i E i ( x ) d μ ( x ) = i = 1 m d π i ( x ) = μ ( x ) and X w i E i ( x ) d μ ( x ) = X π i ( x ) = w i . The α , β ¯ N as the Hessian form of L can be written as
α , β ¯ N = 1 ζ , · π j , · , π i w i δ i j
with ϕ 1 , ϕ 2 = X ϕ 1 ( x ) ϕ 2 ( x ) μ ( x ) , or further as
α , β ¯ N = 1 ζ d μ ( x ) d μ ( x ) δ ( x , x ) d π j ( x ) d π i ( x ) w i δ i j
over the basis { δ ( x ) , e i } x X , i < m .
By the inverse of α , β ¯ N , we mean the element in Hom R b ( L ( X ) ) × ( R m 1 ) ,   L ( X ) × R m 1 which composes with α , β ¯ N (on the left and on the right) as identities. By the natural identity between double dual V V and the tensor hom adjunction,
Hom R b ( L ( X ) ) × ( R m 1 ) , L ( X ) × R m 1 Hom R b ( L ( X ) ) × ( R m 1 ) , ( L ( X ) × R m 1 ) Hom R b ( ( L ( X ) ) × ( R m 1 ) ) 2 , R ,
we can write the inverse of α , β ¯ N as a bilinear form again.
Denote α , β ¯ N in the block form A B B T D . According to the block-inverse formula
A B B T D 1 = A 1 + A 1 B F 1 B T A 1 A 1 B F 1 F 1 B T A 1 F 1 ,
where F = D B T A 1 B , whose invertibility determines the invertibility of A B B T D .
Consider that A 1 Hom R b ( ( L ( X ) ) 2 , R ) ; explicitly, A 1 ( x , y ) = δ ( x , y ) . Therefore, from Equation (A17),
F i j = δ i j w i X 2 δ ( x , x ) π i ( x ) π j ( x ) = δ i j w i X w i w j E i ( x ) E j ( x ) μ ( x ) .
The matrix F is symmetric, of rank m 1 , and strictly diagonally dominant; therefore, it is invertible. To see the strictly diagonal dominance, consider j = 1 m X w i w j E i ( x ) E j ( x ) μ ( x ) = X w i E i ( x ) j = 1 m w j E j ( x ) μ ( x ) = X w i E i ( x ) μ ( x ) = w i by applying the marginal conditions. The matrix F is of size ( m 1 ) × ( m 1 ) (there is no i = m or j = m for F i j ). Then, the matrix F is strictly diagonally dominant.
With all ingradients known in formula (A19), we can calculate the inverse of α , β ¯ N .
Following the implicit function theorem, we need w i , y i N ; each partial derivative is an element in L ( X ) × R m 1 .
N w i = X E i ( x ) ( · ) d μ ( x ) , δ i j 1 X E i ( x ) d μ ( x ) = X ( · ) E i ( x ) d μ ( x ) , 0 .
Note that if we apply the constraint i = 1 m w i = 1 to the w i s, we may set w m = 1 i = 1 m 1 w i and recalculate the above derivatives as w i N = w i N w m N when i m and w m N = i = 1 m 1 w i N .
y i N = 1 ζ X ( · ) y i g ( x , y i ) w i E i ( x ) d μ ( x ) , δ i j ζ X y i g ( x , y i ) w i E i ( x ) d μ ( x )
Finally, by the Implicit Function Theorem,
w i , y j ( α * , β ¯ * ) = α , β ¯ N | α * , β ¯ * 1 w i , y j N | α * , β ¯ *
which fits in (A6) and (A7).

Appendix B.2. Second Derivatives

In this part, we calculate the second derivatives of W k , ζ k ( μ , μ m ) with respect to the ingredients of μ m , i.e., w i s and y i s, for the potential of applying Newton’s method to the EDOT (which we have not implemented yet).
Using the previous results, we can further calculate the second derivatives of W k , ζ k about w i s and y i s. Differentiating (A6) and (A7) results in
2 L w i w j = 1 ζ X g ( x , y j ) k = i , j α ( x ) w k + β ¯ k w k E k ( x ) d μ ( x ) + 1 ζ X k = 1 n 2 α ( x ) w i w j + 2 β ¯ k w i w j w k E k ( x ) d μ ( x ) + 1 ζ 2 X k = 1 m l = i , j α ( x ) w l + β ¯ k w l w k E k ( x ) d μ ( x )
y j L w i = δ i j X 1 g ( x , y i ) ζ y i E i ( x ) d μ ( x ) + 1 ζ X y j g ( x , y j ) α ( x ) w i + β ¯ j w i w j E j ( x ) d μ ( x ) + 1 ζ X g ( x , y j ) y j α ( x ) w i + β ¯ j w i w j E j ( x ) d μ ( x ) + 1 ζ 2 X k = 1 m g ( x , y k ) α ( x ) w i + β ¯ k w i y j α ( x ) + y j β ¯ k w k E k ( x ) d μ ( x )
y j y i L = δ i j X y i 2 g ( x , y i ) ( 1 g ( x , y i ) ζ ) w i E i ( x ) d μ ( x ) 1 ζ X y i g ( x , y i ) 2 ( 2 g ( x , y i ) ζ ) w i E i ( x ) d μ ( x ) + 1 ζ X 1 g ( x , y i ) ζ y i g ( x , y i ) · y j α ( x ) + y j β i ¯ w i E i ( x ) d μ ( x ) + 1 ζ X 1 g ( x , y j ) ζ y j g ( x , y j ) · y i α ( x ) + y i β j ¯ w j E j ( x ) d μ ( x ) + 1 ζ X k = 1 m g ( x , y k ) y i y j α ( x ) + y i y j β ¯ k · w k E k ( x ) d μ ( x ) + 1 ζ 2 X k = 1 m g ( x , y k ) l = i , j l α + l β ¯ k · w k E k ( x ) d μ ( x )
Once we have the second derivatives of g ( x , y ) on y i s, we need the second derivatives of α * and β ¯ * to build the above second derivatives. From the formula w i , y j ( α * , β ¯ * ) = α , β ¯ N | α * , β ¯ * 1 w i , y j N | α * , β ¯ * , we can differentiate
w k , y l w i , y j ( α * , β ¯ * ) = w k , y l α , β ¯ N | α * , β ¯ * 1 w i , y j N | α * , β ¯ * α , β ¯ N | α * , β ¯ * 1 w k , y l w i , y j N | α * , β ¯ * .
Here, from the formula that A 1 = A 1 A A 1 (this is the product rule for A A 1 = I ), we have
w k , y l α , β ¯ N | α * , β ¯ * 1 = α , β ¯ N | α * , β ¯ * 1 w k , y l α , β ¯ N | α * , β ¯ * α , β ¯ N | α * , β ¯ * 1
and
w k α , β ¯ N | α * , β ¯ * = 1 ζ 0 δ j k E k ( x ) d μ ( x ) δ i k E k ( x ) d μ ( x ) δ i j δ j k
y k α , β ¯ N | α * , β ¯ * = 1 ζ 2 0 δ j k y k g ( x , y k ) d π k ( x ) δ i k y k g ( x , y k ) d π k ( x ) 0 .
The last piece we need is w k , y l w i , y j N | α * , β ¯ * :
2 N w j w i = ( 0 , 0 ) ,
y j N w i = 1 ζ X ( · ) y j g ( x , y j ) E i ( x ) d μ ( x ) , 0 ,
y j y i N = δ i j ζ X ( · ) y i 2 g ( x , y i ) w i E i ( x ) d μ ( x ) + X ( · ) y i g ( x , y i ) 2 w i E i ( x ) d μ ( x ) , δ i k X y i 2 g ( x , y i ) w i E i ( x ) d μ ( x ) + X y i g ( x , y i ) 2 w i E i ( x ) d μ ( x ) ,
where in the last one, k, represents the k-th component in N ’s second part (about β ¯ ).

Appendix C. Algorithms

Appendix C.1. Algorithm: Simple EDOT

The following states the Algorithm A1 of Simple EDOT.
Algorithm A1 Simple EDOT using minibatch SGD
1:
input:  μ , k, m, ζ , N batch size, ϵ for stopping criterion, α = 0.2 for momentum, β = 0.2 for learning rate.
2:
output:  x i X , w i > 0 such that μ m = i = 1 m w i δ x i .
3:
initialize: randomly choose i = 1 m w i ( 0 ) δ x i ( 0 ) | ; set t = 0 ; set learning rate η t = 0.5 ( 1 + β t ) s (for t > 0 and s ( 0.5 , 1 ] ).
4:
repeat
5:
    Set t t + 1 ;
6:
    Sample N points { y j } j = 1 N X from μ independently;
7:
    Construct μ N ( t ) = 1 N j = 1 N δ y j ;
8:
    Calculate gradients x W ^ k , ζ k ( μ N ( t ) , μ m ( t ) ) and w W ^ k , ζ k ( μ N ( t ) , μ m ( t ) ) ;
9:
    Update D x t α D x t 1 + x W ^ k , ζ k ( μ N ( t ) , μ m ( t ) ) , D w t α D w t 1 + w W ^ k , ζ k ( μ N ( t ) , μ m ( t ) ) ;
10:
    Update x ( t ) x ( t 1 ) η t D x t , w ( t ) w ( t 1 ) η t D w t ;
11:
until | x W ^ k , ζ k | + | w W ^ k , ζ k | < ϵ ;
12:
Set output x i x i ( t ) , w i w i ( t ) .

Appendix C.2. Proof of Proposition 2

Remark A1.
The convergence to a stationary point in expectation means that the liminf of the expected norm of the gradient over all the sequences considered approaches to 0.
Proof. 
Discrete distributions of size m can be parameterized by X m × Δ in terms of i = 1 m p i δ y i with ( p 1 , p 2 , , p m ) Δ and y i X .
To make the SGD work, we assume that X is a path-connected subset of R d of dimension d.
For ϵ > 0 , let Δ ϵ = Δ ϵ m 1 = { ( p 1 , p 2 , , p m ) : i = 1 m p i = 1 , p i ϵ } . First, we prove the claim by assuming that ( * ) the set of limit points of ( μ m ( i ) ) is contained in X m × Δ ϵ .
According to Theorem 4.10 and Corollary 4.12 of [25], to show that Algorithm A1 converges to a stationary point in expectation, i.e., lim inf i 0 E [ | | W k , ζ k ( μ , μ m ( i ) ) | | 2 ] = 0 , one needs to check: (1). W k , ζ k ( μ , μ m ( i ) ) is second-differentiable; (2). W k , ζ k ( μ , μ m ( i ) ) is Lipschitz continuous; and (3). E [ | | W ^ k , ζ k ( μ N ( i ) , μ m ( i ) ) W k , ζ k ( μ , μ m ( i ) ) | | 2 ] is bounded.
(1). The second differentiablity of W k , ζ k ( μ , μ m ( i ) ) is shown in Appendix B.2 with the second derivative calculated.
(2). As a consequence of (a), we have that W k , ζ k ( μ , μ m ( i ) ) is continuous. Moreover, by checking each factor of 2 W k , ζ k ( μ , μ m ( i ) ) shown in Appendix B.2, we can see that 2 W k , ζ k ( μ , μ m ( i ) ) is bounded. (Actually, we need det ( α , β ¯ N ) finite to make the SIM bounded, which is true in Δ ϵ .) Therefore, W k , ζ k ( μ , μ m ( i ) ) is Lipschitz continuous.
(3). Noise has bounded variance: Equivalently, we just need to check that
Var ( | | W k , ζ k ( μ N , μ m ) | | 2 ) is finite, where μ N is the empirical distribution with N samples taken (which is stochastic), and μ m is the fixed discretization in X m × Δ ϵ (this need not to be the “optimal” one). W k , ζ k ( μ N ( i ) , μ m ( i ) ) is continuous with respect to both μ N ( i ) and μ m ( i ) ; hence, it is continuous over compact space X N × X m × Δ ϵ ; hence, it is bounded by a constant C.
Thus, the proposition holds with assumption ( ) .
Further suppose that assumption ( ) does not hold. Then, for any sequence ϵ 1 , ϵ 2 0 , there always exist infinite limit points of ( μ m ( i ) ) that lie outside Δ ϵ k for any k > 0 . Therefore, we can construct a subsequence of μ m ( i ) converging to a point p X m × Δ . Thus, p is also a limit point. This contradicts the assumption that the set of limit points of ( μ m ( i ) ) does not intersect with X m × Δ . The proof is then complete.    □

Appendix C.3. Proof of Proposition 3

Proof. 
First, for each iteration in the minibatch SGD, let N be the sample (minibatch) size of μ N for approximating μ . Let m be the size of target discretization μ m (the output). Furthermore, let d be the dimension of X and ϵ be the error bound in the Sinkhorn calculation for the entropy-regularized optimal transference plan between μ N and μ m . The Sinkhorn algorithm for the positive matrix e g ( x , y ) / ζ (of size N × m ) converges linearly, which takes O ( log ( 1 / ϵ ) ) steps to fall into a region of radius ϵ , thus contributing O ( N m log ( 1 / ϵ ) ) in time complexity. The inverse matrix M of ( α , β ¯ ) N = 1 ζ A B B T D (Equation (6)) is taken block-wise
M = ζ A 1 + A 1 B E 1 B T A 1 A 1 B E 1 E 1 B T A 1 E 1 ,
where E = D B T A 1 B . Block E is constructed in O ( N m 2 ) and inverted in O ( m 3 ) ; block A 1 B E 1 takes O ( N m 2 ) , as A is diagonal; and the block A 1 + A 1 B E 1 B T A 1 takes O ( N 2 m ) to construct. When m N , the time complexity in constructing M is O ( N 2 m ) . From M to the gradient of dual variables, the tensor contractions have complexity O ( ( N + m ) 2 m d ) . Finally, to get the gradient, the complexity is dominated by the second term of y i W k , ζ k (see Equation (5)), which is a contraction between a matrix N m (i.e., g d π ( x ) ) with tensors of sizes N m d and m 2 d (two gradients on the dual variables α and β ) along N and m, respectively. Thus, the final step contributes O ( ( N + m ) m d ) .
The time complexity of increment steps in the SGD turns out to be O ( m d ) . Therefore, for L steps of the minibatch SGD, the time complexity is O ( ( N + m ) 2 m d L + N m L log ( 1 / ϵ ) ) .
For space complexity of the simple EDOT, the Sinkhorn algorithm (which can be done in position O ( N m ) ) is the only iterative computation in a single SGD step, and between two SGD steps, only the resulting distribution is passed to the next step. Therefore, the space complexity is O ( ( N + m ) 2 ) coming from the M ; others are, at most, of size O ( m ( N + m ) ) .    □

Appendix C.4. Adaptive Refinement via DFS: Midpoints

The pseudocode for the division algorithm of the adaptive EDOT using KD-tree refinement cutting at the midpoints is shown in Algorithm A2. The R o u n d 0.5 means the rounding method with 0.5 rounded up, and R o u n d 0.5 is that with 0.5 rounded down; thus, the discretization point is correctly partitioned.
Algorithm A2 Adaptive Refinement via Depth First Search
1:
input: m, μ , N 0 sample size, m * max number of points in a cell, a = ( a 0 , a 1 , , a d 1 ) and b = ( b 0 , b 1 , , b d 1 ) as lower/upper bounds of the region;
2:
output:  o u t : stack of subproblems ( S i , m i , p i ) with p i = 1 , m i = m , | S i | = N 0 ;
3:
initialization:  p 0 1 ; T, o u t be empty stacks;
4:
Sample N 0 points S 0 μ ;
5:
T.push(( S 0 , m, p 0 , a , b ));
6:
while T is not empty do
7:
    (S, m, p, a , b ) ←T.pop();
8:
     l argmax { b i a i } , m i d ( a l + b l ) / 2 ;
9:
     a ( 1 ) a , a ( 2 ) ( a 0 , , m i d , , a d 1 ) ;
10:
     b ( 1 ) ( b 0 , , m i d , , b d 1 ) , b ( 2 ) b ;
11:
     S 1 { x S : x l < = m i d } , S 2 { x S : x l > m i d } , N 1 | S 1 | , N 2 | S 2 | ;
12:
     m 1 R o u n d 0.5 N 1 d / ( k + d ) N 1 d / ( k + d ) + N 2 d / ( k + d ) ;
13:
     m 2 R o u n d 0.5 N 2 d / ( k + d ) N 1 d / ( k + d ) + N 2 d / ( k + d ) ;
14:
    if  m 1 = 0  then
15:
         p 1 0 , p 2 ( N 1 + N 2 ) / N 0 ;
16:
    else if  m 2 = 0  then
17:
         p 1 ( N 1 + N 2 ) / N 0 , p 2 0 ;
18:
    else
19:
         p 1 N 1 / N 0 , p 2 N 2 / N 0 ;
20:
    end if
21:
    for  i 1 , 2  do
22:
        if  m i > m *  then
23:
           T.push( ( S i , m i , p i , a ( i ) , b ( i ) ) );
24:
        else if  m i > 0  then
25:
            o u t .push( ( S i , m i , p i ) );
26:
        end if
27:
    end for
28:
end while
Algorithm A3 Adaptive EDOT Variation 1
1:
input: m, μ , N 0 sample size, m * max number of points in a cell, a = ( a 0 , a 1 , , a d 1 ) and b = ( b 0 , b 1 , , b d 1 ) as lower/upper bounds of the region, let R be resolution for occupied volume;
2:
output:  o u t : stack of subproblems ( S i , m i , p i ) with p i = 1 , m i = m ;
3:
initialization:  p 0 1 ;
4:
procedure BuildTree( n o d e )
5:
     a n o d e . l o w e r , b n o d e . u p p e r
6:
     l argmax { b i a i } , m i d ( a l + b l ) / 2 ;
7:
     a ( 0 ) a , a ( 1 ) ( a 0 , , a l 1 , m i d , a l + 1 , a d 1 ) ;
8:
     b ( 0 ) ( b 0 , , b l 1 , m i d , , b l + 1 , b d 1 ) , b ( 1 ) b ;
9:
     S 0 { x S : x l < = m i d } , S 1 { x S : x l > m i d } ;
10:
    for  i 0 , 1 ; do
11:
        if  j = 0 d 1 ( b j a j ) > R and | S i | > 0 ; then
12:
            n o d e . c h i l d [ i ] = { o c c V o l : 0 , l o w e r : a ( i ) , u p p e r : b ( i ) , s a m p l e : S i , c h i l d : A r r a y ( 2 ) , d i s c S i z e : 0 } ;
13:
           BuildTree( n o d e . c h i l d [ i ] )
14:
        else if  | S i | = 0 ; then
15:
            n o d e . c h i l d [ i ] = { o c c V o l : 0 , l o w e r : a ( i ) , u p p e r : b ( i ) , s a m p l e : S i , c h i l d : A r r a y ( 2 ) , d i s c S i z e : 0 }
16:
        else
17:
            n o d e . c h i l d [ i ] = { o c c V o l : j = 0 d 1 ( b j ( i ) a j ( i ) ) , l o w e r : a ( i ) , u p p e r : b ( i ) , s a m p l e : S i , c h i l d : A r r a y ( 2 ) , d i s c S i z e : 0 }
18:
        end if
19:
    end for
20:
     n o d e . o c c V o l n o d e . c h i l d [ 0 ] . o c c V o l + n o d e . c h i l d [ 1 ] . o c c V o l ;
21:
end procedure
22:
Sample N 0 points S 0 μ ;
23:
R o o t = { o c c V o l : 0 , l o w e r : a , u p p e r : b , s a m p l e : S 0 , c h i l d : A r r a y ( 2 ) , d i s c S i z e : 0 } ;
24:
BuildTree( R o o t );
25:
T , o u t empty stack;
26:
T . push ( R o o t )
27:
while T is not empty do
28:
     n o d e T .pop();
29:
     P 0 | n o d e . c h i l d [ 0 ] . s a m p l e | , P 1 | n o d e . c h i l d [ 1 ] . s a m p l e | ;
30:
     V 0 n o d e . c h i l d [ 0 ] . o c c V o l , V 1 n o d e . c h i l d [ 1 ] . o c c V o l ;
31:
     M 0 P 0 · V 0 k / d , M 1 P 1 · V 1 k / d ;
32:
     m 0 R o u n d 0.5 M 0 d / ( k + d ) M 0 d / ( k + d ) + M 1 d / ( k + d ) ;
33:
     m 1 R o u n d 0.5 M 1 d / ( k + d ) M 0 d / ( k + d ) + M 1 d / ( k + d ) ;
34:
    if  m 0 = 0  then
35:
         p 0 0 , p 1 ( M 0 + M 1 ) / N 0 ;
36:
    else if  m 1 = 0  then
37:
         p 0 ( M 0 + M 1 ) / N 0 , p 1 0 ;
38:
    else
39:
         p 0 M 0 / N 0 , p 1 M 1 / N 0 ;
40:
    end if
41:
    for  i 0 , 1 ; do
42:
        if  m i > m *  then
43:
           T.push( n o d e . c h i l d [ i ] );
44:
        else if  m i > 0  then
45:
            o u t .push( ( S i , m i , p i ) );
46:
        end if
47:
    end for
48:
end while

Appendix C.5. Adaptive Refinement: Variation 1

The original division algorithm of the adaptive EDOT (Algorithm A2) cuts a cell into two based on the balance of the averaged contribution of W k per discretization point between the divided cells. However, when the mass of μ is not distributed evenly in a cell to be cut, especially if it concentrates on a few small regions, the estimation of W k by the diameter of a cell becomes far greater than the actual one, thereby resulting in assigning much more discretizaiton points to a cell (Figure A1). Thus, we develop the division algorithm of Variation 1 to elevate the performance in this situation by estimating the Wasserstein distance using an occupied volume of a set of sample points (usually the same sample points we used in Algorithm A2): Given a resolution R > 0 (the upper bound of a cell’s volume can be taken as Vol ( X ) / N 0 , with N 0 as the size of sample points), we keep cutting the region X until each cell is either of a volume smaller than R or contains no sample points; then, we call the total volume of those nonempty cells by the occupied volume V O c c ( X ) . Similar definition applies to each cell X i .
Figure A1. Divide-and-conquer strategies: original and Variation 1. Original tends to assign more atoms to vast region with small weights, while Variation 1 does better. Example: distribution on [ 0 , 10 ] , pdf is plotted in blue curve, discretization size of each cell is in black.
Figure A1. Divide-and-conquer strategies: original and Variation 1. Original tends to assign more atoms to vast region with small weights, while Variation 1 does better. Example: distribution on [ 0 , 10 ] , pdf is plotted in blue curve, discretization size of each cell is in black.
Entropy 25 00839 g0a1
After having the occupied volume of each cell, we may proceed to assign a number of discretization points to each cell. The only improvement of division algorithm Variation 1 in this part is on the Wasserstein estimation step (line 12 and 13 in Algorithm A2), where the estimated Wasserstein W k of cell i is changed from μ ( X i ) · m i 1 / d · ( diam X i ) k to μ ( X i ) · m i 1 / d · ( V Occ ( X i ) ) k / d .
It is considered that the algorithm assigning the discretization size depends on the estimation of the Wasserstein distance; however, this estimation in Variation 1 requires the occupied volume, which is calculated from leaf to root, meaning that the binary tree for occupied volume has to be built before starting Algorithm A2 with a modified estimation of W k . Fortunately, as the cutting points in this step have to coincide with the occupied-volume-calculation step, and the sample points belonging to each cell are both needed, we may save the sample points partition in the binary tree building for occupied volume and reuse them in discretization size assigning. Therefore, the discretization size assigning step works as a tree traversal (on a subtree with the same root, which is defined by the stopping conditions in depth along each path) of the binary tree built for occupied volume calculation.
Therefore, the time complexity for the occupied volume calculation is again O ( N 0 log N 0 ) , as the algorithm works in the same way as Quicksort again, and the time complexity for the rest (assigning discretization sizes) is O ( m ) as traversal on a tree of, at most, m leaves (m discretization points in total).
For space complexity, it is still O ( N 0 + m ) , since after the calculation of occupied volume, the rest is only adding constant size decorations onto the subtree with, at most, m leaves mentioned above.

Appendix C.6. Adaptive Refinement: Variation 2

The “cutting in the middle” method is easy to implement and guarantee the volume decreasing while going deeper in the tree (so the depth of getting under the resolution is guaranteed). However, it is also too rigid to fit the natural gaps of the original distribution, which may critically affect the optimal location of discretization points.
Our Variation 2 is on the dimension of redefining the cutting points from midpoints along the corresponding axis to the most sparse points. The sparsity is calculated by the moving window method along an axis/component of the d-coordinates; by applying the moving window method, we may have to sort the data points every time (since at each node, the sorting axis/component may be different). Since we still want to control the depth of the tree, a correction must be added to avoid the cutting point from locating too close to the boundaries (usually, the function C ( x a ) k ( x b ) k with a and b the boundaries and C > 0 as a constant). One influence is that now each cell’s volume (not the occupied volume) has to be calculated using the rectangular boundaries instead of being indicated only from its depth as before.
Thus, the influence on the time complexity is the following: 1. Changing the tree-building step to N 0 2 ( log N 0 ) 2 in the average case, N 0 4 in worst case (if Quicksort is applied) on each node’s moving-window method), and 2. Introducing a O ( N 0 ) for calculating the volume on each node in the binary tree. Furthermore, it introduces, at most, O ( N 0 ) additional space complexity, since each cell’s volume has to be stored instead of being calculated directly from the depth.
Variation 2 can be applied together with Variation 1, since they are aiming at different parts of the algorithm. An example is shown in Figure A6.

Appendix D. Empirical Parts

Appendix D.1. Estimate W k , ζ k : Richardson Extrapolation and Others

In the analysis, we may need W k , ζ k ( μ , μ m ) to compare how discretization methods behave. However, when the μ is not discrete, we are generally not able to obtain the analytical solution to the Wasserstein distance.
In certain cases, including all examples this paper contains, the Wasserstein can be estimated by finite samples (with a large size). According to [26], for μ P ( X ) in our setup (a probability measure on a compact Polish space with Borel algebra) and with g = d X k C ( X 2 ) being a continuous function, the Online Sinkhorn method can be used to estimate W k , ζ k . The Online Sinkhorn needs a large number of samples for μ (in batch) to be accurate.
In our paper, as X are compact subsets in R n , and μ has a continuous probability density function, we may use the Richardson Extrapolation method to estimate the Wasserstein distance between μ and μ m , which may require fewer samples and fewer computations (the Sinkhorn twice with different sizes).
Our examples are on intervals or rectangles, in which two grids Λ 1 of N points and Λ 2 of r N points (N and r N are both integers) can be constructed naturally for each. With μ determined by a smooth probability density function ρ , let μ ( N ) be the normalization of i = 1 N ρ ( Λ i ) δ Λ i (this may not be a probability distribution, so we use its normalization). From a continuity of ρ and the boundedness of the dual variables α and β , we can conclude that
lim N W k , ζ k ( μ ( N ) , μ m ) = W k , ζ k ( μ , μ m ) .
Let W k , ζ k ( μ ( N ) , μ m ) be a function of 1 / N ; to apply Richardson extrapolation, we need the exponent of the lowest term of 1 / N in the expansion W k , ζ k ( μ ( N ) , μ m ) = W * + O ( 1 / N h ) + O ( ( 1 / N ) h + 1 ) , where W * = W k , ζ k ( μ , μ m ) .
Consider that
W k , ζ k ( μ , μ m ) W k , ζ k ( μ ( N ) , μ m ) W k , ζ k ( μ , μ ( N ) ) .
Since W k , ζ ( μ ( N ) , μ ) N 1 / d , we may conclude that h = k / d , where d is the dimension of X. Figure A2 shows an empirical example in a d = 1 and k = 2 situation.
Figure A2. Richardson Extrapolation: the power of the expansion about N 1 . We take the EDOT μ 5 of example 2 (1-dim truncated normal mixture) as the target μ m and use evenly positioned μ N for different Ns to estimate. The y-axis is ln ( W 2 , 0.01 2 ( μ ( r N ) , μ 5 ) W 2 , 0.01 2 ( μ ( N ) , μ 5 ) ) , where r = 2 and r = 3 are calculated. With ln N as x-axis, linearity can be observed. The slopes are both about 1.9998 , which represent the exponent of the leading non-constant term of W 2 , 0.01 2 ( μ ( N ) , μ 5 ) on N, while the theoretical result is r = k / d = 2 . The differences are from higher order terms on N.
Figure A2. Richardson Extrapolation: the power of the expansion about N 1 . We take the EDOT μ 5 of example 2 (1-dim truncated normal mixture) as the target μ m and use evenly positioned μ N for different Ns to estimate. The y-axis is ln ( W 2 , 0.01 2 ( μ ( r N ) , μ 5 ) W 2 , 0.01 2 ( μ ( N ) , μ 5 ) ) , where r = 2 and r = 3 are calculated. With ln N as x-axis, linearity can be observed. The slopes are both about 1.9998 , which represent the exponent of the leading non-constant term of W 2 , 0.01 2 ( μ ( N ) , μ 5 ) on N, while the theoretical result is r = k / d = 2 . The differences are from higher order terms on N.
Entropy 25 00839 g0a2

Appendix D.2. Example: The Sphere

The CW complex structure of the unit sphere S 2 is constructed as follows: let ( 1 , 0 , 0 ) , the point on the equator, be the only dimension-0 structure, and let the equator be the dimension-1 structure (line segment [ 0 , 1 ] attached to the dimension-0 structure by identifying both end points to the only point ( 1 , 0 , 0 ) ). The dimension-2 structure is the union of two unit discs, which is identified to the south/north hemisphere of S 2 by stereographic projection:
π ± N : ( X , Y ) 1 1 + X 2 + Y 2 2 X , 2 Y , ± ( X 2 + Y 2 1 )
with respect to the north/south pole.

Spherical Geometry

The spherical geometry is the Riemannian manifold structure induced by embedding onto the unit sphere in R 3 .
The geodesic between two points is the shorter arc along the great circle determined by the two points. In their R 3 coordinates, d S 2 ( x , y ) = arccos ( x , y ) . Composed with stereographic projections, the distance in terms of CW complex coordinates can be calculated (and be differentiated).
The gradient about y (or its CW coordinate) can be calculated via the above formulas. In practice, the only problem is when x = ± y function arccos at ± 1 is singular. From the symmetry of sphere on the rotation along axis x , the derivatives of distance along all directions are the same. Therefore, we may choose the radial direction on the CW coordinate (unit disc). Furthermore, the differentiations are primary to calculate.

Appendix D.3. A Note on the Implementation of SGD with Momentum

There is a slight difference between our implementation of the SGD and the algorithm provided in the paper. In the implementation, we give two different learning rates to the positions ( y i s) and the weights ( w i s), as moving along positions is usually observed much slower than moving along weights. Empirically, we make the learning rates on the positions be exactly three times the learning rates on the weights at each SGD iteration. With this change, the convergence is faster, but we do not have a theory or empirical evidence to show that a fixed ratio of three is the best choice.
Implementing and testing the Newton’s method (hybrid with SGD) and other improved SGD methods could be good problems to work on.
Figure A3. The sphere example with 3D discretization (same as the paper) on two view directions. Colors of dots represent the weights of each atom in the distribution.
Figure A3. The sphere example with 3D discretization (same as the paper) on two view directions. Colors of dots represent the weights of each atom in the distribution.
Entropy 25 00839 g0a3

Appendix D.4. An Example on Transference Plan with Adaptive EDOT

We now illustrate the performance of the adaptive EDOT on a 2D optimal transport task. Let X = Y = [ 0 , 1 ] 2 , c = d X be the Euclidean distance, g = d X 2 , and the marginal μ , ν be truncated normal (mixtures), where μ has only two components and ν has five components. Figure A4 plots the McCann Interpolation of the OT plan between μ and ν (shown in red dots) and its discrete approximations (weights are color coded) with m = n = 30 . With m = n = 10 , the adaptive EDOT results were as follows: W k , ζ k ( μ , μ 10 ) = 1.33 × 10 2 , W k , ζ k ( ν , ν 10 ) = 1.30 × 10 2 , and W k , ζ k ( γ , γ 10 , 10 ) = 2.71 × 10 2 . With m = n = 30 , the adaptive EDOT results were as follows: W k , ζ k ( μ , μ 30 ) = 9.62 × 10 3 , W k , ζ k ( ν , ν 30 ) = 9.18 × 10 3 , and W k , ζ k ( γ , γ 30 , 30 ) = 1.516 × 10 2 . The naive sampling results were as follows: W k , ζ k ( μ , μ 30 ) = 1.75 × 10 2 , W k , ζ k ( ν , ν 30 ) = 1.58 × 10 2 , and W k , ζ k ( γ , γ 30 , 30 ) = 3.95 × 10 2 . The adaptive EDOT approximated the quality of 900 naive samples with only 100 points on a four-dimensional transference plan.
Figure A4. The McCann interpolation figures in finer time resolution for visualizing the transference plan from (1)–(11). It is a refined figure of the original paper. We see can see that the larger bubbles (representing a large probability mass) moved in a short distance, and smaller pieces moved longer.
Figure A4. The McCann interpolation figures in finer time resolution for visualizing the transference plan from (1)–(11). It is a refined figure of the original paper. We see can see that the larger bubbles (representing a large probability mass) moved in a short distance, and smaller pieces moved longer.
Entropy 25 00839 g0a4

Appendix D.5. Example: Swiss Roll

In this case, the underlying space X swiss is a the Swiss Roll, which is a 2D rectangular strip embedded in R 3 : ( θ , z ) ( θ , θ , z ) in cylindrical coordinates. μ swiss is a truncated normal mixture on a ( θ , z ) -plane. Samples S swiss μ swiss over X swiss are shown in Figure A5 (left) embedded in 3D and in Figure A6a as isometric into R 2 .
By following the Euclidean metric in R 3 , Figure A5 (right) plots the EDOT solution μ m through adaptive cell refinement (Algorithm A2) with m = 50 . The resulting cell structure is shown as colored boxes. The corresponding partition of S swiss is shown on Figure A6a, with samples contained in a cell marked by the same color. According to Figure A5 (right), the points in μ m were mainly located on the strip, with only one point off in the most sparse cell (yellow cell located in the bottom in the figure).
Figure A5. Discretization of a distribution supported on a Swiss Roll. Left: A total of 15,000 samples from the truncated normal mixture distribution μ swiss over X swiss . Right: A 50-point 3D discretization using Variation 2 of Algorithm A2; the refinement cells are shown in colored boxes.
Figure A5. Discretization of a distribution supported on a Swiss Roll. Left: A total of 15,000 samples from the truncated normal mixture distribution μ swiss over X swiss . Right: A 50-point 3D discretization using Variation 2 of Algorithm A2; the refinement cells are shown in colored boxes.
Entropy 25 00839 g0a5
On the other hand, consider the metric on X swiss induced by the isometry from the Swiss Roll as a manifold to a strip on R 2 . A more intrinsic discretization of μ swiss can be obtained by applying the EDOT through a refinement on the coordinate space—the (2D) strip. The partition of S swiss is shown on Figure A6b, and the resulting discretization μ 50 is shown in Figure A6c. Notice that all 50 points were located on the (locally) high density region of the Swiss Roll. We observe from Figure A6a,b that the 3D partition pulled disconnected and intrinsically remote regions together, while the 2D partition maintained the intrinsic structure.
Figure A6. Swiss Roll under isometry. (a) Refinement cells under 3D Euclidean metric (one color per samples from a refinement cell). (b) Refinement cells under 2D metric induced by the isometry. (c) EDOT of 50 points with respect to the 2D refinement. ζ = 0.01 × diam 2 for all.
Figure A6. Swiss Roll under isometry. (a) Refinement cells under 3D Euclidean metric (one color per samples from a refinement cell). (b) Refinement cells under 2D metric induced by the isometry. (c) EDOT of 50 points with respect to the 2D refinement. ζ = 0.01 × diam 2 for all.
Entropy 25 00839 g0a6

Appendix D.6. Example: Figure Densities

We used OpenCV to process the figures. The cat figure is a gray-scale figure of size 180 × 180 . Variation 1 was used in cutting the figure into pieces, since the figure contains some sparse regions on which the original division algorithm does not work well.
For the colored figures (Starry Night and Girl with a Pearl Earring) in Figure 1, we process the three channels independently, then plotted the colored dots, and finally combined them as corresponding channels in a colored file. In the reconstruction of Starry Night, we made the size of the colored dots of same size with a modified color value according to the weights. Furthermore, for Girl with a Pearl Earring, we used pure color ((255,0,0) as red, etc.) and changed the size of the dots (with an area proportional to the weights).

Appendix D.7. Example: Simple Barycenter Problems

The EDOT in simple form (no divide and conquer) can solve barycenter problems. The idea is simple: the gradient of a sum of functions is the sum of gradients of each function. Thus, to find the discrete barycenter of size m for several distributions μ i , we take the objective to be the sum of Wasserstein distances (raised to power k for rationality), whose gradient-to-target distribution is the sum of gradients between the discretization and each target distribution. This method only works for the simple EDOT, since there is no locality in barycenter problems. After a division, the weights of each target distribution in each cell of the partition may be different, so there is inter-cell transport in the optimal transport plan, which the current algorithm cannot deal with.
We can see in Figure A7 that the simple EDOT-Equal (no divide and conquer) achieved similar results as the non-regularized discretization in [18], whereas the EDOT produced a better approximation of the barycenter by taking advantage of changing weights freely.
Figure A7. A 7-point barycenter of two Gaussian distributions: (a): EDOT, area of dots represent the weights, W s u m 2 = 0.7322 ; (b): EDOT-Equal, W s u m 2 = 0.7380 ; (c): [18], W s u m 2 = 0.7389 ; W are regularized with ζ = 0.04 .
Figure A7. A 7-point barycenter of two Gaussian distributions: (a): EDOT, area of dots represent the weights, W s u m 2 = 0.7322 ; (b): EDOT-Equal, W s u m 2 = 0.7380 ; (c): [18], W s u m 2 = 0.7389 ; W are regularized with ζ = 0.04 .
Entropy 25 00839 g0a7

References

  1. Kantorovich, L.V. On the translocation of masses. J. Math. Sci. 2006, 133, 1381–1382. [Google Scholar] [CrossRef]
  2. Peyré, G.; Cuturi, M. Computational optimal transport. Found. Trends Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
  3. Villani, C. Optimal Transport: Old and New; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008; Volume 338. [Google Scholar]
  4. Janati, H.; Muzellec, B.; Peyré, G.; Cuturi, M. Entropic Optimal Transport between Unbalanced Gaussian Measures has a Closed Form. Adv. Neural Inf. Process. Syst. 2020, 33, 10468–10479. [Google Scholar]
  5. Aude, G.; Cuturi, M.; Peyré, G.; Bach, F. Stochastic optimization for large-scale optimal transport. arXiv 2016, arXiv:1605.08527. [Google Scholar]
  6. Allen-Zhu, Z.; Li, Y.; Oliveira, R.; Wigderson, A. Much faster algorithms for matrix scaling. In Proceedings of the 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), Berkeley, CA, USA, 15–17 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 890–901. [Google Scholar]
  7. Lin, T.; Ho, N.; Jordan, M.I. On the efficiency of the Sinkhorn and Greenkhorn algorithms and their acceleration for optimal transport. arXiv 2019, arXiv:1906.01437. [Google Scholar]
  8. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Proceedings of the Advances in Neural Information Processing Systems, Harrahs and Harveys, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 2292–2300. [Google Scholar]
  9. Sinkhorn, R.; Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Pac. J. Math. 1967, 21, 343–348. [Google Scholar] [CrossRef]
  10. Wang, J.; Wang, P.; Shafto, P. Sequential Cooperative Bayesian Inference. In Proceedings of the International Conference on Machine Learning, PMLR, Online/Vienna, Austria, 12–18 July 2020; pp. 10039–10049. [Google Scholar]
  11. Tran, M.N.; Nott, D.J.; Kohn, R. Variational Bayes with intractable likelihood. J. Comput. Graph. Stat. 2017, 26, 873–882. [Google Scholar] [CrossRef]
  12. Overstall, A.; McGree, J. Bayesian design of experiments for intractable likelihood models using coupled auxiliary models and multivariate emulation. Bayesian Anal. 2020, 15, 103–131. [Google Scholar] [CrossRef]
  13. Wang, P.; Wang, J.; Paranamana, P.; Shafto, P. A mathematical theory of cooperative communication. Adv. Neural Inf. Process. Syst. 2020, 33, 17582–17593. [Google Scholar]
  14. Luise, G.; Rudi, A.; Pontil, M.; Ciliberto, C. Differential properties of sinkhorn approximation for learning with wasserstein distance. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–8 December 2018; pp. 5859–5870. [Google Scholar]
  15. Accinelli, E. A Generalization of the Implicit Function Theorems. 2009. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1512763 (accessed on 3 February 2021).
  16. Weed, J.; Bach, F. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli 2019, 25, 2620–2648. [Google Scholar] [CrossRef]
  17. Dudley, R.M. The speed of mean Glivenko-Cantelli convergence. Ann. Math. Stat. 1969, 40, 40–50. [Google Scholar] [CrossRef]
  18. Claici, S.; Chien, E.; Solomon, J. Stochastic Wasserstein Barycenters. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 999–1008. [Google Scholar]
  19. Mérigot, Q. A multiscale approach to optimal transport. In Proceedings of the Computer Graphics Forum; Blackwell Publishing Ltd.: Oxford, UK, 2011; Volume 30, pp. 1583–1592. [Google Scholar]
  20. Solomon, J.; de Goes, F.; Peyré, G.; Cuturi, M.; Butscher, A.; Nguyen, A.; Du, T.; Guibas, L. Convolutional Wasserstein Distances: Efficient Optimal Transportation on Geometric Domains. ACM Trans. Graph. 2015, 34, 1–11. [Google Scholar] [CrossRef]
  21. Staib, M.; Claici, S.; Solomon, J.M.; Jegelka, S. Parallel Streaming Wasserstein Barycenters. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Long Beach, CA, USA, 2017; Volume 30. [Google Scholar]
  22. Beugnot, G.; Genevay, A.; Greenewald, K.; Solomon, J. Improving Approximate Optimal Transport Distances using Quantization. arXiv 2021, arXiv:2102.12731. [Google Scholar]
  23. Jacobs, M.; Léger, F. A fast approach to optimal transport: The back-and-forth method. Numer. Math. 2020, 146, 513–544. [Google Scholar] [CrossRef]
  24. Genevay, A.; Chizat, L.; Bach, F.; Cuturi, M.; Peyré, G. Sample complexity of sinkhorn divergences. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Okinawa, Japan, 16–18 April 2019; pp. 1574–1583. [Google Scholar]
  25. Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization methods for large-scale machine learning. Siam Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
  26. Mensch, A.; Peyré, G. Online sinkhorn: Optimal transport distances from sample streams. Adv. Neural Inf. Process. Syst. 2020, 33, 1657–1667. [Google Scholar]
Figure 1. Discretization of “Girl with a Pearl Earring” and “Starry Night” using EDOT with 2000 discretization points for each RGB channel. k = 2 , ζ = 0.01 × diam 2 .
Figure 1. Discretization of “Girl with a Pearl Earring” and “Starry Night” using EDOT with 2000 discretization points for each RGB channel. k = 2 , ζ = 0.01 × diam 2 .
Entropy 25 00839 g001
Figure 2. (ac) Plots of EDOT discretizations of the Examples (1)–(3). In (c), the x-axis and y-axis are the 2D coordinates, and the probability density of μ and weights of μ m are encoded by color. (d,e) show comparison between EDOT and i.i.d. sampling for Examples (1) and (2). EDOT are calculated with m = 3 to 7 (3 to 8). The 4 boundary curves of the shaded region are 5 % -, 25 % -, 75 % -, and 95 % -percentile curves; the orange line represents the level of m = 5 ; (f) plots the entropy regularized Wasserstein distance W k , ζ k ( μ , μ m ) versus the SGD steps for Example (2) with μ m optimized by 5-point EDOT. ζ = 0.01 in all cases.
Figure 2. (ac) Plots of EDOT discretizations of the Examples (1)–(3). In (c), the x-axis and y-axis are the 2D coordinates, and the probability density of μ and weights of μ m are encoded by color. (d,e) show comparison between EDOT and i.i.d. sampling for Examples (1) and (2). EDOT are calculated with m = 3 to 7 (3 to 8). The 4 boundary curves of the shaded region are 5 % -, 25 % -, 75 % -, and 95 % -percentile curves; the orange line represents the level of m = 5 ; (f) plots the entropy regularized Wasserstein distance W k , ζ k ( μ , μ m ) versus the SGD steps for Example (2) with μ m optimized by 5-point EDOT. ζ = 0.01 in all cases.
Entropy 25 00839 g002
Figure 3. (a): Approximation of a transference plan. Left: 5 × 5 EDOT approximation. Right: 25 × 25 naive approximation. In both figures, magnitudes of each point is color coded, the background grayscale density represents the true EOT plan. (b): An example of adaptive refinement on a unit square. Left: division of 10,000 sample S approximating a mixture of two truncated Gaussian distributions and the refinement for 30 discretization points. Number of discretization points assigned to each region is marked by black numbers. E.g., upper left regaion needs 6 points. Right: the discretization optimized locally and combined as a probability measure with k = 2 .
Figure 3. (a): Approximation of a transference plan. Left: 5 × 5 EDOT approximation. Right: 25 × 25 naive approximation. In both figures, magnitudes of each point is color coded, the background grayscale density represents the true EOT plan. (b): An example of adaptive refinement on a unit square. Left: division of 10,000 sample S approximating a mixture of two truncated Gaussian distributions and the refinement for 30 discretization points. Number of discretization points assigned to each region is marked by black numbers. E.g., upper left regaion needs 6 points. Right: the discretization optimized locally and combined as a probability measure with k = 2 .
Entropy 25 00839 g003
Figure 4. (a) Left: 30,000 samples from μ s p h e r e and the 3D cells under divide-and-conquer algorithm Right: 40-point EDOTs in 3D. (b) The 40-point CW-EDOTs in 2D. Red dots: samples, other dots: discrete atoms with weights represented in colors. Left: upper hemisphere. Right: lower hemisphere, stereographic projections about poles. ζ = 0.01 × diam 2 .
Figure 4. (a) Left: 30,000 samples from μ s p h e r e and the 3D cells under divide-and-conquer algorithm Right: 40-point EDOTs in 3D. (b) The 40-point CW-EDOTs in 2D. Red dots: samples, other dots: discrete atoms with weights represented in colors. Left: upper hemisphere. Right: lower hemisphere, stereographic projections about poles. ζ = 0.01 × diam 2 .
Entropy 25 00839 g004
Figure 5. EDOT of an example from [19]. Potrait of Riemann, discretization of size 625. Left: green dots show position and weights of EDOT discretization (same as right); cells in background are discretization of the same size in the original [19]. Right: A size 10,000 discretization from [19]; we directly applied EDOT to this picture, treating it as the continuous distribution. ζ = 0.01 × diam 2 .
Figure 5. EDOT of an example from [19]. Potrait of Riemann, discretization of size 625. Left: green dots show position and weights of EDOT discretization (same as right); cells in background are discretization of the same size in the original [19]. Right: A size 10,000 discretization from [19]; we directly applied EDOT to this picture, treating it as the continuous distribution. ζ = 0.01 × diam 2 .
Entropy 25 00839 g005
Figure 6. A comparison of EDOT (left), EDOT-Equal (mid), and [18] (right) on the Canary islands, treated as a binary distribution with a constant density on islands and 0 in the sea. Discretizations for each method is shown by black bullets. Wasserstein distances: EDOT: W 0.005 2 = 0.02876 , EDOT-Equal: W 0.005 2 = 0.05210 , Claici: W 0.005 2 = 0.05288 . Map size is 3.13 × 1.43 .
Figure 6. A comparison of EDOT (left), EDOT-Equal (mid), and [18] (right) on the Canary islands, treated as a binary distribution with a constant density on islands and 0 in the sea. Discretizations for each method is shown by black bullets. Wasserstein distances: EDOT: W 0.005 2 = 0.02876 , EDOT-Equal: W 0.005 2 = 0.05210 , Claici: W 0.005 2 = 0.05288 . Map size is 3.13 × 1.43 .
Entropy 25 00839 g006
Figure 7. Discretization of a kitty. Discretization by each method is shown in red bullets on top of the Kitty image. (a) EDOT, 10 points, W 0.001 2 = 0.009176 , radius represents weight; (b) EDOT-Equal, 10 points, W 0.001 2 = 0.008960 ; (c) [18], 10 points, W 0.001 2 = 0.009832 ; (d) [18], 200 points. Figure size 1 × 1 , ζ = 0.01 × diam 2 .
Figure 7. Discretization of a kitty. Discretization by each method is shown in red bullets on top of the Kitty image. (a) EDOT, 10 points, W 0.001 2 = 0.009176 , radius represents weight; (b) EDOT-Equal, 10 points, W 0.001 2 = 0.008960 ; (c) [18], 10 points, W 0.001 2 = 0.009832 ; (d) [18], 200 points. Figure size 1 × 1 , ζ = 0.01 × diam 2 .
Entropy 25 00839 g007
Figure 8. 2000-Point Discretizations, (a). EDOT (weight plotted in color), (b). EDOT-Equal, (c). Relations between log ( W 2 ) and log m (all with divide and conquer); it can be seen that the advantage of W EDOT over W Equal grows with the size of discretization.
Figure 8. 2000-Point Discretizations, (a). EDOT (weight plotted in color), (b). EDOT-Equal, (c). Relations between log ( W 2 ) and log m (all with divide and conquer); it can be seen that the advantage of W EDOT over W Equal grows with the size of discretization.
Entropy 25 00839 g008
Figure 9. Discretization of 16 blue disks in a unit square with 24 points (black). (a) EDOT, W ζ 2 = 0.001398 ; (b) EDOT-Equal, W ζ 2 = 0.002008 ; (c) [18], W ζ 2 = 0.002242 . ζ = 10 4 . Figure size is 1 × 1 .
Figure 9. Discretization of 16 blue disks in a unit square with 24 points (black). (a) EDOT, W ζ 2 = 0.001398 ; (b) EDOT-Equal, W ζ 2 = 0.002008 ; (c) [18], W ζ 2 = 0.002242 . ζ = 10 4 . Figure size is 1 × 1 .
Entropy 25 00839 g009
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.; Wang, P.; Shafto, P. Efficient Discretization of Optimal Transport. Entropy 2023, 25, 839. https://doi.org/10.3390/e25060839

AMA Style

Wang J, Wang P, Shafto P. Efficient Discretization of Optimal Transport. Entropy. 2023; 25(6):839. https://doi.org/10.3390/e25060839

Chicago/Turabian Style

Wang, Junqi, Pei Wang, and Patrick Shafto. 2023. "Efficient Discretization of Optimal Transport" Entropy 25, no. 6: 839. https://doi.org/10.3390/e25060839

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop