1. Introduction
Optimal transport is the problem of finding a coupling of probability distributions that minimizes cost [
1], and it is a technique applied across various fields and literatures [
2,
3]. Although many methods exist for obtaining optimal transference plans for distributions on discrete spaces, computing the plans is not generally possible for continuous spaces [
4]. Given the prevalence of continuous spaces in machine learning, this is a significant limitation for theoretical and practical applications.
One strategy for approximating continuous OT plans is based on discrete approximation via sample points. Recent research has provided guarantees on the fidelity of discrete, sample-location-based approximations for continuous OT as the sample size
$N\to \infty $ [
5]. Specifically, by sampling large numbers of points
${S}_{i}$ from each marginal, one may compute a discrete optimal transference plan on
${S}_{1}\times {S}_{2}$, with the cost matrix being derived from the pointwise evaluation of the cost function on
${S}_{1}\times {S}_{2}$.
Even in the discrete case, obtaining minimal cost plans is computationally challenging. For example, Sinkhorn scaling, which computes an entropy-regularized approximation for OT plans, has a complexity that scales with
$|{S}_{1}\times {S}_{2}|$ [
6]. Although many comparable methods exist [
7], all of them have a complexity that scales with the product of sample sizes, and they require the construction of a cost matrix that also scales with
$|{S}_{1}\times {S}_{2}|$.
We have developed methods for optimizing both sampling locations and weights for small N approximations of OT plans (see
Figure 1). In
Section 2, we formulate the problem of fixed size approximation and reduce it to discretization problems on marginals with theoretical guarantees. In
Section 3, the gradient of entropy-regularized Wasserstein distance between a continuous distribution and its discretization is derived. In
Section 4, we present a stochastic gradient descent algorithm that is based on the optimization of the locations and weights of the points with empirical demonstrations.
Section 5 introduces a parellelizable algorithm via decompositions of the marginal spaces, which reduce the computational complexity by exploiting intrinsic geometry. In
Section 6, we analyze time and space complexity. In
Section 7, we illustrate the advantage of including weights for sample points by providing a comparison with an existing location that is only based on discretization.
2. Efficient Discretizations
Optimal transport (OT): Let
$(X,{d}_{X})$,
$(Y,{d}_{Y})$ be compact Polish spaces (complete separable metric spaces),
$\mu \in \mathcal{P}\left(X\right)$,
$\nu \in \mathcal{P}\left(Y\right)$ be probability distributions on their Borel-algebras, and
$c:X\times Y\to \mathbb{R}$ be a cost function. Denote the set of all joint probability measures (couplings) on
$X\times Y$ with marginals
$\mu $ and
$\nu $ by
$\Pi (\mu ,\nu )$. For the cost function
c, the optimal transference plan between
$\mu $ and
$\nu $ is defined as in [
1]:
$\gamma (\mu ,\nu )\mathrm{argmin}{\phantom{\rule{4.pt}{0ex}}}_{\pi \in \Pi (\mu ,\nu )}\langle c,\pi \rangle $, where
$\langle c,\pi \rangle {\int}_{X\times Y}c(x,y)\mathrm{d}\pi (x,y)$.
When $X=Y$, the cost $c(x,y)={d}_{X}^{k}(x,y)$, ${W}_{k}(\mu ,\nu )={\langle c,\gamma (\mu ,\nu )\rangle}^{1/k}$ defines the k-Wasserstein distance between $\mu $ and $\nu $ for $k\ge 1$. Here, ${d}_{X}^{k}(x,y)$ is the k-th power of the metric ${d}_{X}$ on X.
Entropy regularized optimal transport (EOT) [
5,
8] was introduced to estimate OT couplings with reduced computational complexity:
${\gamma}_{\lambda}(\mu ,\nu ):=\mathrm{argmin}{\phantom{\rule{4.pt}{0ex}}}_{\pi \in \Pi (\mu ,\nu )}\langle c,\pi \rangle +\lambda \mathrm{KL}\left(\pi \right||\mu \otimes \nu )$, where
$\lambda >0$ is a regularization parameter, and the regularization term
$\mathrm{KL}\left(\pi \right||\mu \otimes \nu )$:=
$\int log\left(\frac{\mathrm{d}\pi}{\mathrm{d}\mu \otimes \mathrm{d}\nu}\right)\mathrm{d}\pi $ is the Kullback–Leibler divergence. The EOT objective is smooth and convex, and its unique solution with a given discrete
$(\mu ,\nu ,c)$ can be obtained using a Sinkhorn iteration (SK) [
9].
However, for large-scale discrete spaces, the computational cost of SK can still be unfeasible [
6]. Even worse, to even apply the Sinkhorn iteration, one must know the entire cost matrix over the large-scale spaces, which itself can be a non-trivial computational burden to obtain; in some cases, for example, where the cost is derived from a probability model [
10], it may require intractable computations [
11,
12].
The Framework: We propose the optimization of the location and weights of a fixed size discretization to estimate the continuous OT. The discretization on
$X\times Y$ is completely determined by those on
X and
Y to respect the marginal structure in the OT. Let
$m,n\in {\mathbb{Z}}^{*}$,
${\mu}_{m}\in \mathcal{P}\left(X\right)$,
${\nu}_{n}\in \mathcal{P}\left(Y\right)$ be a discrete approximation of
$\mu $ and
$\nu $, respectively, with
${\mu}_{m}={\sum}_{i=1}^{m}{w}_{i}{\delta}_{{x}_{i}}$,
${\nu}_{n}={\sum}_{j=1}^{n}{u}_{j}{\delta}_{{y}_{j}}$,
${x}_{i}\in X$,
${y}_{j}\in Y$, and
${w}_{i},{u}_{j}\in {\mathbb{R}}^{+}$. Then, the EOT plan
${\gamma}_{\lambda}(\mu ,\nu )\in \Pi (\mu ,\nu )$ for the OT problem
$(\mu ,\nu ,c)$ can be approximated by the EOT plan
${\gamma}_{\lambda}({\mu}_{m},{\nu}_{n})\in \Pi ({\mu}_{m},{\nu}_{n})$ for the OT problem
$({\mu}_{m},{\nu}_{n},c)$. There are three distributions that have their discrete counterparts; thus, with a fixed size
$m,n\in {\mathbb{Z}}^{*}$, a naive idea about the objective to be optimized can be
where
${W}_{k}^{k}(\varphi ,\psi )$ represents the
k-th power of
k-Wasserstein distance between measures
$\varphi $ and
$\psi $. The hyperparameter
$\rho >0$ balances between the estimation accuracy over marginals and that of the transference plan, while the weights on marginals are equal.
To properly compute
${W}_{k}^{k}({\gamma}_{\lambda}(\mu ,\nu ),{\gamma}_{\lambda}({\mu}_{m},{\nu}_{n}))$, a metric
${d}_{X\times Y}$ on
$X\times Y$ is needed. We expect
${d}_{X\times Y}$ on
X-slices or
Y-slices to be compatible with
${d}_{X}$ or
${d}_{Y}$, respectively; furthermore, we may assume that there exists a constant
$A>0$ such that:
For instance, (
2) holds when
${d}_{X\times Y}$ is the
p-product metric for
$1\le p\le \infty $.
The objective
${\Omega}_{k,\rho}({\mu}_{m},{\nu}_{n})$ is estimated by its entropy regularized approximation
${\Omega}_{k,\zeta ,\rho}({\mu}_{m},{\nu}_{n})$ for efficient computation, where
$\zeta $ is the regularization parameter, as follows:
Here,
${W}_{k}^{k}(\mu ,{\mu}_{m})={\langle {d}_{X}^{k},\gamma (\mu ,{\mu}_{m})\rangle}^{1/k}$ is estimated by
${W}_{k,\zeta}^{k}(\mu ,{\mu}_{m})={\langle {d}_{X}^{k},{\gamma}_{\zeta}(\mu ,{\mu}_{m})\rangle}^{1/k}$.
${\gamma}_{\zeta}(\mu ,{\mu}_{m})$ is computed by optimizing
${\widehat{W}}_{k,\zeta}^{k}(\mu ,{\mu}_{m})\phantom{\rule{-0.166667em}{0ex}}=\phantom{\rule{-0.166667em}{0ex}}\langle {d}_{X}^{k},{\gamma}_{\zeta}(\mu ,{\mu}_{m})\rangle +\lambda \mathrm{KL}({\gamma}_{\zeta}(\mu ,{\mu}_{m})\left|\right|\mu \otimes {\mu}_{m}).$One major difficulty in optimizing
${\Omega}_{k,\zeta ,\rho}({\mu}_{m},{\nu}_{n})$ is to evaluate
${W}_{k,\zeta}^{k}({\gamma}_{\lambda}(\mu ,\nu ),{\gamma}_{\lambda}({\mu}_{m},{\nu}_{n}))$. In fact, obtaining
${\gamma}_{\lambda}(\mu ,\nu )$ is intractable, which is the original motivation for the discretization. To overcome this drawback, by utilizing the dual formulation of EOT, the following are shown (see proof in
Appendix A):
Proposition 1. When X and Y are two compact spaces, and the cost function c is ${\mathcal{C}}^{\infty}$, there exists a constant ${C}_{1}\in {\mathbb{R}}^{+}$ such that Notice that Proposition 1 indicates that ${W}_{k,\zeta}^{k}\phantom{\rule{-0.166667em}{0ex}}({\gamma}_{\lambda}(\mu ,\nu ),\phantom{\rule{-0.166667em}{0ex}}{\gamma}_{\lambda}({\mu}_{m},{\nu}_{n}))$ is bounded above by multiples of ${W}_{k,\zeta}^{k}(\mu ,{\mu}_{m})+{W}_{k,\zeta}^{k}(\nu ,{\nu}_{n})$, i.e., when the continuous marginals $\mu $ and $\nu $ are properly approximated, so is the optimal transference plan between them. Therefore, to optimize ${\Omega}_{k,\zeta ,\rho}({\mu}_{m},{\nu}_{n})$, we focus on developing algorithms to obtain ${\mu}_{m}^{*},{\nu}_{n}^{*}$ that minimize ${W}_{k,\zeta}^{k}(\mu ,{\mu}_{m})$ and ${W}_{k,\zeta}^{k}(\nu ,{\nu}_{n})$.
Remark 1. The regularizing parameters (λ and ζ above) introduce smoothness, together with an error term, into the OT problem. To make an accurate approximation, we need λ and ζ to be as small as possible. However, when parameters become too small, the matrices to be normalized in the Sinkhorn algorithm lead to an overflow or underflow problem of numerical data types (32-bit or 64-bit floating point numbers). Thus, the value for regularizing the constant threshold is proportional to the k-th power of the diameter of the supported region. In this work, we try our best to control the value (mainly on ζ), which ranges from 10^{−4} to 0.01 when the diameter is 1 in different examples.
3. Gradient of the Objective Function
Let $\nu ={\sum}_{i=1}^{m}{w}_{i}{\delta}_{{y}_{i}}$ be a discrete probability measure in the position of “${\mu}_{m}$” in the last section. For a fixed (continuous) $\mu $, the objective now is to obtain a discrete target ${\nu}^{*}=\mathrm{argmin}\phantom{\rule{4.pt}{0ex}}{W}_{k,\zeta}^{k}(\mu ,\nu )$.
In order to apply a stochastic gradient descent (SGD) to both the positions
${\left\{{y}_{i}\right\}}_{i=1}^{m}$ and their weights
${\left\{{w}_{i}\right\}}_{i=1}^{m}$ to achieve
${\nu}^{*}$, we now derive the gradient of
${W}_{k,\zeta}^{k}(\mu ,\nu )$ about
$\nu $ by following the discrete discussions of [
13,
14]. The SGD on
X is either derived through an exponential map, or by treating
X as (part of) an Euclidean space.
Let
$g(x,y):={d}_{X}^{k}(x,y)$, and denote the joint distribution minimizing
${\widehat{W}}_{k,\zeta}^{k}$ as
$\pi $ with the differential form at
$(x,{y}_{i})$ being
$\mathrm{d}{\pi}_{i}\left(x\right)$, which is used to define
${W}_{k,\zeta}^{k}$ in
Section 2.
By introducing the Lagrange multipliers
$\alpha \in {L}^{\infty}\left(X\right),\beta \in {\mathbb{R}}^{m}$i, we have
${\widehat{W}}_{k,\zeta}^{k}(\mu ,\nu )={max}_{\alpha ,\beta}$$\mathcal{L}(\mu ,\nu ;\alpha ,\beta )$, where
$\mathcal{L}(\mu ,\nu ;\alpha ,\beta )={\int}_{X}\alpha \left(x\right)\mathrm{d}\mu \left(x\right)+{\sum}_{i=1}^{n}\beta {w}_{i}-\zeta {\int}_{X}{\sum}_{i=1}^{n}{w}_{i}{E}_{i}\left(x\right)\mathrm{d}\mu \left(x\right)$ with
${E}_{i}\left(x\right)={e}^{(\alpha \left(x\right)+{\beta}_{i}g(x,{y}_{i}))/\zeta}$ (see [
5]). Let
${\alpha}^{*},{\beta}^{*}$ be the argmax; then, we have
with
${E}_{i}^{*}\left(x\right)={e}^{({\alpha}^{*}\left(x\right)+{\beta}_{i}^{*}-g(x,{y}_{i}))/\zeta}$. Since
${\alpha}^{\prime}\left(x\right):=\alpha \left(x\right)+t$ and
${\beta}_{i}^{\prime}:={\beta}_{i}-t$ produce the same
${E}_{i}\left(x\right)$ for any
$t\in \mathbb{R}$, the representative with
${\beta}_{n}=0$ that is equivalent to
$\beta $ (as well as
${\beta}^{*}$) is denoted by
$\overline{\beta}$ (similarly
${\overline{\beta}}^{*}$) below in order to obtain uniqueness and make the differentiation possible.
From a direct differentiation of
${W}_{k,\zeta}^{k}$, we have
With the transference plan
$\mathrm{d}{\pi}_{i}\left(x\right)={w}_{i}{E}_{i}^{*}\left(x\right)\mathrm{d}\mu \left(x\right)$ and the derivatives of
${\alpha}^{*}$,
${\beta}^{*}$,
$g(x,{y}_{i})$ calculated, the gradient of
${W}_{k,\zeta}^{k}$ can be assembled.
Assume that
g is a Lipschitz constant that is differentiable almost everywhere (for
$k\ge 1$ and a
${d}_{X}$ Euclidean distance in
${\mathbb{R}}^{d}$, differentiability fails to hold only when
$k=1$ and
${y}_{i}=x$) and that
${\nabla}_{y}g(x,y)$ is calculated. The derivatives of
${\alpha}^{*}$ and
${\overline{\beta}}^{*}$ can then be calculated thanks to the Implicit Function Theorem for Banach spaces (see [
15]).
The maximality of
$\mathcal{L}$ at
${\alpha}^{*}$ and
${\overline{\beta}}^{*}$ induces
$\mathcal{N}:={\nabla}_{\alpha ,\overline{\beta}}{\mathcal{L}|}_{({\alpha}^{*},{\overline{\beta}}^{*})}=0\in {({L}^{\infty}\left(X\right)\otimes {\mathbb{R}}^{m-1})}^{\vee}$, and the Fréchet derivative vanishes. By differentiating (in the sense of Fréchet) again on
$(\alpha ,\overline{\beta})$ and
${y}_{i},{w}_{i}$, respectively, we get
as a bilinear functional on
${L}^{\infty}\left(X\right)\times {\mathbb{R}}^{m-1}$ (note that, in Equation (
6), the index
i of
$\mathrm{d}{\pi}_{i}$ cannot be
m). The bilinear functional
${\nabla}_{(\alpha ,\overline{\beta})}\mathcal{N}$ is invertible, and we denote its inverse by
$\mathbf{M}$ as a bilinear form on
${\left(\right)}^{{L}^{\infty}}$. The last ingredient for the Implicit Function Theorem is
${\nabla}_{{w}_{i},{y}_{i}}\mathcal{N}$:
Then,
${\nabla}_{{w}_{i},{y}_{i}}({\alpha}^{*},{\overline{\beta}}^{*})=\mathbf{M}\left({\nabla}_{{w}_{i},{y}_{i}}\mathcal{N}\right)$. Therefore, we have gradient
${\nabla}_{{w}_{i},{y}_{i}}{W}_{k,\zeta}^{k}$ calculated.
Moreover, we can differentiate Equations (
4)–(
8) to get a Hessian matrix of
${W}_{k,\zeta}^{k}$ on
${w}_{i}$’s and
${y}_{i}$’s to provide a better differentiability of
$g(x,y)$ (which may enable Newton’s method, or a mixture of Newton’s method and minibatch SGD to accelerate the convergence). More details about the claims, calculations, and proofs are provided in the
Appendix B.
4. The Discretization Algorithm
Here, we provide a description of an algorithm for the efficient discretizations of optimal transport (EDOT) from a distribution $\mu $ to ${\mu}_{m}$ with integer m, which is a given cardinality of support. In general, $\mu $ does not need not be explicitly accessible, and, even if it is accessible, computing the exact transference plan is not feasible. Therefore, in this construction, we assume that $\mu $ is given in terms of a random sampler, and we apply a minibatch stochastic gradient descent (SGD) through a set of samples that are independently drawn from $\mu $ of size N on each step to approximate $\mu $.
To calculate the gradient
${\nabla}_{{\mu}_{m}}{W}_{k,\zeta}^{k}(\mu ,{\mu}_{m})={\left(\right)}_{{\nabla}_{{x}_{i}}}^{{W}_{k,\zeta}^{k}}i=1m$, we need: (1).
${\pi}_{X,\zeta}^{\phantom{|}}$, the EOT transference plan between
$\mu $ and
${\mu}_{m}$, (2). the cost
$g={d}_{\phantom{\rule{-0.166667em}{0ex}}X}^{\phantom{|}k}$ on
X, and (3). its gradient on the second variable
${\nabla}_{{x}^{\prime}}{d}_{\phantom{\rule{-0.166667em}{0ex}}X}^{\phantom{|}k}(x,{x}^{\prime})$. From
N samples
${\left\{{y}_{i}\right\}}_{i=1}^{N}$, we can construct
${\mu}_{N}^{\phantom{|}}=\frac{1}{N}{\sum}_{i=1}^{N}{\delta}_{{y}_{i}}$ and calculate the gradients with
$\mu $ replaced by
${\mu}_{N}^{\phantom{|}}$ as an estimation, whose effectiveness (convergence as
$N\to \infty $) is proved in [
5].
We call this discretization algorithm the
Simple EDOT algorithm. The pseudocode is stated in the
Appendix C.
Proposition 2. (Convergence of the Simple EDOT). The Simple EDOT generates a sequence $\left({\mu}_{m}^{\left(i\right)}\right)$ in the compact set ${X}^{m}\times \Delta $. If the set of limit points of $\left({\mu}_{m}^{\left(i\right)}\right)$ does not intersect with ${X}^{m}\times \partial \Delta $, then $\left({\mu}_{m}^{\left(i\right)}\right)$ converges to a stationary point in ${X}^{m}\times Int(\Delta )$ where $Int(\xb7)$ represents the interior.
In simulations, we fixed $k=2$ to reduce the computational complexity and fixed the regularizer $\zeta =0.01$ for X of diameter 1 and scales proportional with $\mathrm{diam}{\left(X\right)}^{k}$ (see next section). Such a choice for $\zeta $ is not only small enough to reduce the error between the EOT estimation ${W}_{k,\zeta}$ and the true ${W}_{k}$, but also ensures that ${e}^{-g(x,y)/\zeta}$ and its byproduct in the SK are distinguishable from 0 in a double format.
Examples of discretization: We demonstrated our algorithm on the following:
E.g., (1). $\mu $ is the uniform distribution on $X=[0,1]$.
E.g., (2) $\mu $ is the mixture of two truncated normal distributions on $X=[0,1]$, and the PDF is $f\left(x\right)=0.3\varphi (x;0.2,0.1)+0.7\varphi (x;0.7,0.2)$, where $\varphi (x;\xi ,\sigma )$ is the density of the truncated normal distribution on $[0,1]$ with the expectation $\xi $ and standard deviation $\sigma $.
E.g., (3) $\mu $ is the mixture of two truncated normal distributions on $X={[0,1]}^{2}$, where the two distributions are $\varphi (x;0.2,0.1)\varphi (y;0.3,0.2)$ of weight $0.3$ and $\varphi (x;0.7,0.2)\varphi (y;0.6,0.15)$ of weight $0.7$.
Let
$N=100$ for all plots in this section.
Figure 2a–c plots the discretizations (
${\mu}_{m}$) for E.g., (1)–(3) with
$m=5,5,$ and 7, respectively.
Figure 2f illustrates the convergence rate of
${W}_{k,\zeta}^{k}(\mu ,{\mu}_{m})$ versus the SGD steps for Example (2) with
${\mu}_{m}$ obtained by a 5-point EDOT.
Figure 2d,e plot the entropy-regularized Wasserstein
${W}_{k,\zeta}^{k}(\mu ,{\mu}_{m})$ versus
m, thereby comparing EDOT and naive sampling for Examples (1) and (2). Here, the
${\mu}_{m}$s are: (a) from the EDOT with
$3\le m\le 7$ in Example 1 and
$3\le m\le 8$ in Example 2, which are shown by ×s in the figures. (b) from naive sampling, which is simulated using a Monte Carlo of volume 20,000 on each size from 3 to 200.
Figure 2d,e demonstrate the effectiveness of the EDOT: as indicated by the orange horizontal dashed line, even 5-point EDOT discretization in these two examples outperformed 95% of the naive samplings of size 40, as well as 75% of the naive samplings of size over 100 (the orange dash and dot lines).
An example of a transference plan: In
Figure 3a, we illustrate the efficiency of the EDOT on an OT problem:
$X=Y=[0,1]$, where the marginal
$\mu $ and
$\nu $ are truncated normal (mixtures), and
$\mu $ has two components (shown in red curve on the left), while
$\nu $ has only one component (shown in red curve on the top). The cost function is the squared Euclidean distance, and
$\lambda =\zeta =0.01$.
The left of
Figure 3a shows a
$5\times 5$ EDOT approximation with
${W}_{k,\zeta}^{k}(\mu ,{\mu}_{5})=4.792\times {10}^{-3}$,
${W}_{k,\zeta}^{k}(\nu ,{\nu}_{5})$$=5.034\times {10}^{-3}$, and
${W}_{k,\zeta}^{k}(\gamma ,{\gamma}_{5,5})=8.446\times {10}^{-3}$. The high density area of the EOT plan is correctly covered by EDOT estimating points with high weights. The right shows a
$25\times 25$ naive approximation with
${W}_{k,\zeta}^{k}(\mu ,{\mu}_{7})=5.089\times {10}^{-3}$,
${W}_{k,\zeta}^{k}(\nu ,{\nu}_{7})=2.222\times {10}^{-2}$, and
${W}_{k,\zeta}^{k}(\gamma ,{\gamma}_{7,7})=2.563\times {10}^{-2}$. The points of the naive estimating with the highest weights missed the region where the true EOT plan was of the most density.
5. Methods of Improvement
I. Adaptive EDOT: The computational cost of a simple EDOT increases with the dimensionality and diameter of the underlying space. Discretization with a large m is needed to capture higher dimensional distributions, which result an increase in parameters for calculating the gradient of ${W}_{k,\zeta}^{k}$: $md$ for the ${y}_{i}$ positions and $m-1$ for the ${w}_{i}$ weights. Such an increment will not only increase the complexity in each step, but also require more steps for the SGD to converge. Furthermore, the calculation will have a higher complexity ($\mathcal{O}\left(mN\right)$ for each normalization in Sinkhorn).
We proposed to reduce the computational complexity using a “divide and conquer” approach. The Wasserstein distance took the k-th power of the distance function ${d}_{X}^{\phantom{|}k}$ as a cost function. The locality of distance ${d}_{X}^{\phantom{|}}$ made the solution to the OT/EOT problem local, meaning that the probability mass was more likely to be transported to a close destination than to a remote one. Thus, we can “divide and conquer”—thereby cutting the space X into small cells and solve the discretization problem independently.
To develop a “divide and conquer” algorithm, we need: (1) an adaptive dividing procedure that is able to partition
$X={X}_{1}\bigsqcup \cdots \bigsqcup {X}_{\mathcal{I}}$, which balances the accuracy and computational intensity among the cells; (2) to determine the discretization size
${m}_{i}$ and choose a proper regularizer
${\zeta}_{i}$ for each cell
${X}_{i}$. The pseudocodes for all variations are shown in the
Appendix C Algorithms A2 and A3.
Choosing size
m: An appropriate choice of
${m}_{i}$ will balance contributions to the Wasserstein among the subproblems as follows: Let
${X}_{i}$ be a manifold of dimension
d, let
$\mathrm{diam}\left({X}_{i}\right)$ be its diameter, and let
${p}_{i}=\mu \left({X}_{i}\right)$ be the probability of
${X}_{i}$. The entropy-regularized Wasserstein distance can be estimated as
${W}_{k,\zeta}^{k}=\mathcal{O}\left({p}_{i}{m}_{i}^{-k/d}\mathrm{diam}{\left({X}_{i}\right)}^{k}\right)$ [
16,
17]. The contribution to
${W}_{k,\zeta}^{k}(\mu ,{\mu}_{m})$ per point in support of
${\mu}_{m}$ is
$\mathcal{O}\left({p}_{i}{m}_{i}^{-(k+d)/d}\mathrm{diam}{\left({X}_{i}\right)}^{k}\right)$. Therefore, to balance each point’s contribution to the Wasserstein among the divided subproblems, we set
${m}_{i}\approx \frac{{\left({p}_{i}\mathrm{diam}{\left({X}_{i}\right)}^{k}\right)}^{d/(k+d)}}{{\sum}_{j=1}^{\mathcal{I}}{\left({p}_{j}\mathrm{diam}{\left({X}_{j}\right)}^{k}\right)}^{d/(k+d)}}$.
Occupied volume (Variation 1):A cell could be too vast (e.g., large in size with few points in a corner), thus resulting in obtaining a larger ${m}_{i}$ than needed. To fix it, we may replace the $\mathrm{diam}\left({X}_{i}\right)$ above with $\mathrm{Vol}{\left({X}_{i}\right)}^{1/d}$, where $\mathrm{Vol}\left({X}_{i}\right)$ is the occupied volume calculated by counting the number of nonempty cells in a certain resolution (levels in previous binary division). The algorithm (Variation 1) becomes a binary tree to resolve and obtain the occupied volume for each cell, then there is tree traversal to assign ${m}_{i}$.
Adjusting the regularizer $\zeta $: In the ${W}_{k,\zeta}^{k}$, the SK on ${e}^{-g(x,y)/\zeta}$ is calculated. Therefore, $\zeta $ should scale with ${d}_{X}^{k}$ to ensure that the transference plan is not affected by the scaling of ${d}_{X}$. Precisely, we may choose ${\zeta}_{i}=\mathrm{diam}{\left({X}_{i}\right)}^{k}{\zeta}_{0}$ for some constant ${\zeta}_{0}$.
The division: Theoretically, any refinement procedure that proceeds iteratively and eventually makes the diameter of each cell approach 0 can be applied for division. In our simulation, we used an adaptive kd-tree-style cell refinement in a Euclidean space ${\mathbb{R}}^{d}$. Let X be embedded into ${\mathbb{R}}^{d}$ within an axis-aligned rectangular region. We chose an axis ${\mathbf{x}}_{l}$ in ${\mathbb{R}}^{d}$ and evenly split the region along a hyperplane orthogonal to ${\mathbf{x}}_{l}$ (e.g., cut square ${[0,1]}^{2}$ along the line $x=0.5$); thus, we constructed ${X}_{1}$ and ${X}_{2}$. With the sample set S given, we split it into two sample sizes ${S}_{1}$ and ${S}_{2}$ according to which subregion each sample was located in. Then, the corresponding ${m}_{i}$ and ${\zeta}_{i}$ could be calculated as discussed above. Thus, two cells and their corresponding subproblems were constructed. If some of the ${m}_{i}$ was still too large, the cell was cut along another axis to construct two other cells. The full list of cells and subproblems could be constructed recursively. In addition, another cutting method (variation 2) that chooses the most sparse point as a cutting point through a sliding window is sometimes useful in practice.
After having the set of subproblems, we could apply the EDOT for the solutions in each cell, then combine the solutions ${\mu}_{{m}_{i}}^{\left(i\right)}={\sum}_{j=1}^{{m}_{i}}{w}_{j}^{\left(i\right)}{\delta}_{{y}_{j}^{\left(i\right)}}^{\phantom{|}}$ into the final result ${\mu}_{m}:={\sum}_{i=1}^{\mathcal{I}}{\sum}_{j=1}^{{m}_{i}}{p}_{i}{w}_{j}^{\left(i\right)}{\delta}_{{y}_{j}^{\left(i\right)}}^{\phantom{|}}$.
Figure 3b shows the optimal discretization for the example in
Figure 2c with
$m=30$, which was obtained by applying the EDOT with adaptive cell refinement, or
$\zeta =0.01\times {\mathrm{diam}}^{2}$.
II. On embedded CW complexes: Although the samples on space X are usually represented as a vector in ${\mathbb{R}}^{d}$, inducing an embedding $X\hookrightarrow {\mathbb{R}}^{d}$, the space X usually has its own structure as a CW complex (or simply a manifold) with a more intrinsic metric. Thus, if the CW complex structure is known, even piecewise, we may apply the refinement on X with respect to its own metric, whereas direct discretization as a subset in ${\mathbb{R}}^{d}$ may result in a low expressing efficiency.
We now illustrate the adaptive EDOT by an example on a mixture normal distribution of a sphere mapped through stereographic projection. More examples of a truncated normal mixture over a Swiss roll and the discretization of a 2D optimal transference plan are detailed in the
Appendix D.5.
On the sphere: The underlying space
${X}_{\mathrm{sphere}}$ is the unit sphere in
${\mathbb{R}}^{3}$.
${\mu}_{\mathrm{sphere}}$ is the pushforward of a normal mixture distribution on
${\mathbb{R}}^{2}$ by stereographic projection. The sample set
${S}_{\mathrm{sphere}}\sim {\mu}_{\mathrm{sphere}}$ over
${X}_{\mathrm{sphere}}$ is shown on
Figure 4 on the left. Consider a (3D) Euclidean metric on the
${X}_{\mathrm{sphere}}$ induced by the embedding.
Figure 4a (right) plots the EDOT solution with refinement for
${\mu}_{m}$ with
$m=40$. The resulting cell structure is shown as colored boxes.
To consider the intrinsic metric, a CW complex was constructed about a point on the equator as a 0-cell structure; the rest of the equator was constructed as a 1-cell, and the upper hemisphere and lower hemisphere were constructed as two dimension 2- (open) cells. We took the upper and lower hemispheres and mapped them onto a unit disk through stereographic projection with respect to the south and north pole, respectively. Then, we took the metric from spherical geometry and rewrote the distance function and its gradient using the natural coordinate on the unit disk.
Figure 4b shows the refinement of the EDOT on the samples (in red) and the corresponding discretizations in colored points. More figures can be found in the Appendices.
6. Analysis of the Algorithms
In this section, we derive the complexity of the simple EDOT and the adaptive EDOT. In particular, we show the following:
Proposition 3. Let μ be a (continuous) probability measure on a space X. A simple EDOT of size m has time complexity $\mathcal{O}({(N+m)}^{2}mdL+NmLlog(1/\u03f5))$ and space complexity $\mathcal{O}\left({(N+m)}^{2}\right)$, where N is the minibatch size (to construct ${\mu}_{N}$ in each step to approximate μ), d is the dimension of X, L is the maximal number of iterations for SGD, and ϵ is the error bound in the Sinkhorn calculation for the entropy-regularized optimal transference plan between ${\mu}_{N}$ and ${\mu}_{m}$.
Proposition 3 quantitatively shows that, when the adaptive EDOT is applied, the total complexities (in time and space) are reduced, because the magnitudes of both N and m are much smaller in each cell.
The procedure of dividing sample set S into subsets through the adaptive EDOT is similar to Quicksort; thus, the space and time complexities are similar. The similarity comes from the binary divide-and-conquer structure, as well as that each split action is based on comparing each sample with a target.
Proposition 4. For the preprocessing (job list creation) for the adaptive EDOT, the time complexity is $\mathcal{O}({N}_{0}log{N}_{0})$ in the best and average case and $\mathcal{O}\left({N}_{0}^{2}\right)$ in the worst case, where ${N}_{0}$ is the total number of sample points, and the space complexity is $\mathcal{O}({N}_{0}d+m)$, or simply $\mathcal{O}\left({N}_{0}d\right)$ as $m\ll {N}_{0}$.
Remark 2. Complexity is the same as Quicksort. The set of ${N}_{0}$ sample points in the algorithm are treated as the “true” distribution in the adaptive EDOT, since, in the later EDOT steps for each cell, no further samples are taken, as it is hard for a sampler to produce a sample in a given cell. Postprocessing of the adaptive EDOT has $\mathcal{O}\left(m\right)$ complexity in both time and space.
Remark 3. For the two algorithm variations in Section 5, the occupied volume estimation works in the same way as the original preprocessing step, which has the same time complexity as before (by itself, since dividing must happen after knowing the occupied volume of all cells), but, with the tree built, the original preporcessing becomes a tree traversal and has (additional) time complexity $\mathcal{O}\left({N}_{0}\right)$ and (additional) space complexity $\mathcal{O}\left({N}_{0}\right)$ for the space storing occupied volume. For details on choosing cut points with window sliding, the discussion can be seen in the Appendix C.5. Comparison with naive sampling: After having a size m discretization on X and a size n discretization on Y, the EOT solution (Sinkhorn algorithm) has time complexity $\mathcal{O}(mnlog(1/\u03f5\left)\right)$. In the EDOT, two discretization problems must be solved before applying the Sinkhorn, while the naive sampling requires nothing but sampling.
According to Proposition 3, solving a single continuous EOT problem using a size m simple EDOT method may result in higher time complexity than naive sampling with an even larger sample size N (than m). However, unlike the EDOT, which only requires access to a distance function ${d}_{X}$ and ${d}_{Y}$ on X and Y, respectively, a known cost function $c:X\times Y\to \mathbb{R}$ is necessary for naive sampling. In real applications, the cost function may be from real world experiments (or from extra computations) done for each pair $(x,y)$ in the discretization; thus, the size of discretized distribution is critical for cost control. ${d}_{X}$ and ${d}_{Y}$ usually come along with the spaces X and Y, respectively, and are easy to compute. An additional application of the EDOT is necessary when the marginal distributions ${\mu}_{X}$ and ${\nu}_{Y}$ are fixed for different cost functions; then, discretizations can be reused. Thus, the cost of discretization is calculated one time, and the improvement it brings accumulates in each repeat.
7. Related Work and Discussion
Our original problem was the optimal transport problem between general distributions as samplers (instead of integration oracles). We translated that into a discretization problem and an OT problem between discretizations.
I. Comparison with other discretization methods: There are several other methods that generate discrete distributions from arbitrary distributions in the literature, which are obtained via semi-continuous optimal transport where the calculation of a weighted Voronoi diagram is needed. Calculating the weighted Voronoi diagrams usually requires 1. that the cost function be a squared Euclidean distance and 2. the application of Delaunay triangulation, which is expensive in more than two dimensions. Furthermore, semi-continuous discretization may only optimize one aspect between the position and weights of the atoms, and this process is mainly based on [
18] (the optimized position) and [
19] (the optimized weights).
We mainly compared the prior work of [
18], which focuses on the barycenter of a set of distributions under the Wasserstein metric. This work resulted in a discrete distribution called the Lagrangian discretization, which is of the form
$\frac{1}{m}{\sum}_{i=1}^{m}{\delta}_{{x}_{i}}$ [
2]. Other works, such as [
20,
21], find barycenters but do not create a discretization. Refs. [
19,
22] studied the discrete estimation of a 2-Wasserstein distance locating discrete points through a clustering algorithm
k-means++ and a weighted Voronoi diagram refinement, respectively. Then, they assigned weights and made them non-Lagrangian discretizations. Ref. [
19] (comparison in
Figure 5) roughly followed a “divide-and-conquer” approach in selecting positions, but the discrete positions were not tuned according to Wasserstein distance directly. Ref. [
22] converged as the number of discrete points increased. However, it lacked a criterion (such as the Wasserstein in the EDOT) to show that the choice is not just one among all possible converging algorithms, but, rather, it is a special one.
By projecting the gradient in the SGD to the tangent space of the submanifold
${X}^{m}\times \{{\mathbf{e}}_{m}/m\}=\{\frac{1}{m}\sum {\delta}_{{x}_{i}}\}$, or by equivalently fixing the learning rate on the weights to zero, the EDOT can estimate a Lagrangian discretization (denoted by EDOT-Equal). A comparison among the methods is held on the map of the Canary islands, which is shown in
Figure 6. This example shows that our method can get a similar result using Lagrangian discretization as the methods in the literature, while, in general, this type of EDOT can work better.
Moreover, the EDOT can be used to solve barycenter problems.
Note that, to apply adaptive EDOT for barycenter problems, compatible divisions of the target distributions are needed (i.e., a cell A from one target distribution transports onto a discrete subset D thoroughly, and D transports onto a cell B from another target distribution, etc.).
We also tested these algorithms on discretizing gray/colored scale pictures. The comparison of discretization with points varying from 10 to 4000 for a kitty image between EDOT, EDOT-equal, [
18] and estimations of their Wasserstein distances to the original image are shown in
Figure 7 and
Figure 8.
Furthermore, the EDOT may be applied on RGB channels of an image independently, which then combine plots of discretizations in the corresponding color. The results are shown in
Figure 1 at the beginning of this paper.
Lagrangian discretization may have a disadvantage in representing repetitive patterns with incompatible discretization points.
In
Figure 9, we can see that discretizing 16 objects with 24 points caused weight incompatibility locally for the Lagrangian discretization, thus making points locate between objects and increasing the Wasserstein distance. With the EDOT, the weights of points that lie outside of the blue object were much smaller. The patterned structure was better represented by the EDOT. In practice, patterns often occur as part of the data (e.g., pictures of nature), and it is easy to get an incompatible number in Lagrangian discretization, since the equal weight-requirement is rigid; consequently, patterns cannot be properly captured.
II. General
k and deneral distance
${d}_{X}$: Our algorithms (Simple EDOT, adaptive EDOT, and EDOT-Equal) work for a general choice of parameter
$k>1$ and
${C}^{2}$ distance
${d}_{X}$ on
X. For example, in
Figure 4 part (b), the distance used on each disk was spherical (arc length along the big circle passing through two points), which could not be isometrically reparametrized into a plane with Euclidean metrics because of the difference in curvatures.
III. Other possible impacts: As the OT problem widely exists in many other areas, our algorithm can be applied accordingly, e.g., the location and size of supermarkets or electrical substations in an area, or even air conditioners in the rooms of supercomputers. Our divide-and-conquer methods are suitable for solving these real-world applications.
IV. OT for discrete distributions: Many algorithms have been developed to solve OT problems between two discrete distributions [
3]. Linear programming algorithms were first developed, but their applications have been restricted by high computational complexity. Other methods such as [
23], with a cost of form
$c(x,y)=h(x-y)$ for some
h, which applies the “back-and-forth” method by hopping between two forms of a Kantorovich dual problem (on the two marginals, respectively) to get a gradient of the total cost over the dual functions, usually solve problems with certain conditions. In our work, we chose to apply an EOT developed by [
8] for an estimated OT solution of the discrete problem.
8. Conclusions
We developed methods for efficiently approximating OT couplings with fixed size $m\times n$ approximations. We provided bounds on the relationship between a discrete approximation and the original continuous problem. We implemented two algorithms and demonstrated their efficacy as compared to naive sampling and analyzed computational complexity. Our approach provides a new approach to efficiently compute OT plans.