Next Article in Journal
A Pathfinding Problem for Fork-Join Directed Acyclic Graphs with Unknown Edge Length
Previous Article in Journal
A Branch-and-Bound Algorithm for Polymatrix Games ϵ-Proper Nash Equilibria Computation
Previous Article in Special Issue
Overrelaxed Sinkhorn–Knopp Algorithm for Regularized Optimal Transport
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Subspace Detours Meet Gromov–Wasserstein

1
Laboratoire de Mathématiques de Bretagne Atlantique, Université Bretagne Sud, CNRS UMR 6205, 56000 Vannes, France
2
ENS Lyon, CNRS UMR 5668, LIP, 69342 Lyon, France
3
Department of Computer Science, Université Bretagne Sud, CNRS UMR 6074, IRISA, 56000 Vannes, France
4
IMT Atlantique, CNRS UMR 6285, Lab-STICC, 29238 Brest, France
*
Author to whom correspondence should be addressed.
Algorithms 2021, 14(12), 366; https://doi.org/10.3390/a14120366
Submission received: 9 November 2021 / Revised: 13 December 2021 / Accepted: 14 December 2021 / Published: 17 December 2021
(This article belongs to the Special Issue Optimal Transport: Algorithms and Applications)

Abstract

:
In the context of optimal transport (OT) methods, the subspace detour approach was recently proposed by Muzellec and Cuturi. It consists of first finding an optimal plan between the measures projected on a wisely chosen subspace and then completing it in a nearly optimal transport plan on the whole space. The contribution of this paper is to extend this category of methods to the Gromov–Wasserstein problem, which is a particular type of OT distance involving the specific geometry of each distribution. After deriving the associated formalism and properties, we give an experimental illustration on a shape matching problem. We also discuss a specific cost for which we can show connections with the Knothe–Rosenblatt rearrangement.

1. Introduction

Classical optimal transport (OT) has received lots of attention recently, in particular in Machine Learning for tasks such as generative networks [1] or domain adaptation [2] to name a few. It generally relies on the Wasserstein distance, which builds an optimal coupling between distributions given a notion of distance between their samples. Yet, this metric cannot be used directly whenever the distributions lie in different metric spaces and lacks from potentially important properties, such as translation or rotation invariance of the supports of the distributions, which can be useful when comparing shapes or meshes [3,4]. In order to alleviate those problems, custom solutions have been proposed, such as [5], in which invariances are enforced by optimizing over some class of transformations, or [6], in which distributions lying in different spaces are compared by optimizing over the Stiefel manifold to project or embed one of the measures.
Apart from these works, another meaningful OT distance to tackle these problems is the Gromov–Wasserstein (GW) distance, originally proposed in [3,7,8]. It is a distance between metric spaces and has several appealing properties such as geodesics or invariances [8]. Yet, the price to be paid lies in its computational complexity, which requires solving a nonconvex quadratic optimization problem with linear constraints. A recent line of work tends to compute approximations or relaxations of the original problem in order to spread its use in more data-intensive machine learning applications. For example, Peyré et al. [9] rely on entropic regularization and Sinkhorn iterations [10], while recent methods impose coupling with low-rank constraints [11] or rely on a sliced approach [12] or on mini-batch estimators [13] to approximate the Gromov–Wasserstein distance. In Chowdhury et al. [4], the authors propose to partition the space and to solve the optimal transport problem between a subset of points before finding a coupling between all the points.
In this work, we study the subspace detour approach for Gromov–Wasserstein. This class of method was first proposed for the Wasserstein setting in Muzellec and Cuturi [14] and consists of (1) projecting the measures onto a wisely chosen subspace and finding an optimal coupling between them (2) and then constructing a nearly optimal plan of the measures on the whole space using disintegration (see Section 2.2). Our main contribution is to generalize the subspace detours approach on different subspaces and to apply it for the GW distance. We derive some useful properties as well as closed-form solutions of this transport plan between Gaussians distributions. From a practical side, we provide a novel closed-form expression of the one-dimensional GW problem that allows us to efficiently compute the subspace detours transport plan when the subspaces are one-dimensional. Illustrations of the method are given on a shape matching problem where we show good results with a cheaper computational cost compared to other GW-based methods. Interestingly enough, we also propose a separable quadratic cost for the GW problem that can be related with a triangular coupling [15], hence bridging the gap with Knothe–Rosenblatt (KR) rearrangements [16,17].

2. Background

In this section, we introduce all the necessary material to describe the subspace detours approach for classical optimal transport and relate it to the Knothe–Rosenblatt rearrangement. We show how to find couplings via the gluing lemma and measure disintegration. Then, we introduce the Gromov–Wasserstein problem for which we will derive the subspace detour in the next sections.

2.1. Classical Optimal Transport

Let μ , ν P ( R d ) be two probability measures. The set of couplings between μ and ν is defined as:
Π ( μ , ν ) = { γ P ( R d × R d ) | π # 1 γ = μ , π # 2 γ = ν }
where π 1 and π 2 are the projections on the first and second coordinate (i.e.,  π 1 ( x , y ) = x ), respectively, and # is the push forward operator, defined such that:
A B ( R d ) , T # μ ( A ) = μ ( T 1 ( A ) ) .

2.1.1. Kantorovitch Problem

There exists several types of coupling between probability measures for which a non-exhaustive list can be found in [18] (Chapter 1). Among them, the so called optimal coupling is the minimizer of the following Kantorovitch problem:
inf γ Π ( μ , ν ) c ( x , y ) d γ ( x , y )
with c being some cost function. The Kantorovitch problem (1) is known to admit a solution when c is non-negative and lower semi-continuous [19] (Theorem 1.7). When c ( x , y ) = x y 2 2 , it defines the so-called Wasserstein distance:
W 2 2 ( μ , ν ) = inf γ Π ( μ , ν ) x y 2 2 d γ ( x , y ) .
When the optimal coupling is of the form γ = ( I d , T ) # μ with T, some deterministic map such that T # μ = ν , T is called the Monge map.
In one dimension, with  μ atomless, the solution to (2) is a deterministic coupling of the form [19] (Theorem 2.5):
T = F ν 1 F μ
where F μ is the cumulative distribution function of μ , and F ν 1 is the quantile function of ν . This map is also known as the increasing rearrangement map.

2.1.2. Knothe–Rosenblatt Rearrangement

Another interesting coupling is the Knothe–Rosenblatt (KR) rearrangement, which takes advantage of the increasing rearrangement in one dimension by iterating over the dimension and using the disintegration of the measures. Concatenating all the increasing rearrangements between the conditional probabilities, the KR rearrangement produces a nondecreasing triangular map (i.e.,  T : R d R d , for all x R d , T ( x ) = ( T 1 ( x 1 ) , , T j ( x 1 , , x j ) , , T d ( x ) ) , and for all j, T j is nondecreasing with respect to x j ), and a deterministic coupling (i.e.,  T # μ = ν ) [18,19,20].
Carlier et al. [21] made a connection between this coupling and optimal transport by showing that it can be obtained as the limit of optimal transport plans for a degenerated cost:
c t ( x , y ) = i = 1 d λ i ( t ) ( x i y i ) 2 ,
where for all i { 1 , , d } , t > 0 , λ i ( t ) > 0 , and for all i 2 , λ i ( t ) λ i 1 ( t ) t 0 0 . This cost can be recast as in [22] as c t ( x , y ) = ( x y ) T A t ( x y ) , where A t = diag ( λ 1 ( t ) , , λ d ( t ) ) . This formalizes into the following Theorem:
Theorem 1
([19,21]). Let μ and ν be two absolutely continuous measures on R d , with compact supports. Let γ t be an optimal transport plan for the cost c t , let T K be the Knothe–Rosenblatt map between μ and ν, and  γ K = ( I d × T K ) # μ the associated transport plan. Then, we have γ t t 0 D γ K . Moreover, if  γ t are induced by transport maps T t , then T t converges in L 2 ( μ ) when t tends to zero to the Knothe–Rosenblatt rearrangement.

2.2. Subspace Detours and Disintegration

Muzellec and Cuturi [14] proposed another OT problem by optimizing over the couplings which share a measure on a subspace. More precisely, they defined subspace-optimal plans for which the shared measure is the OT plan between projected measures.
Definition 1
(Subspace-Optimal Plans [14] Definition 1). Let μ , ν P 2 ( R d ) and let E R d be a k-dimensional subspace. Let γ E * be an OT plan for the Wasserstein distance between μ E = π # E μ and ν E = π # E ν (with π E as the orthogonal projection on E). Then, the set of E-optimal plans between μ and ν is defined as Π E ( μ , ν ) = { γ Π ( μ , ν ) | ( π E , π E ) # γ = γ E * } .
In other words, the subspace OT plans are the transport plans of μ , ν that agree on the subspace E with the optimal transport plan γ E * on this subspace. To construct such coupling γ Π ( μ , ν ) , one can rely on the Gluing lemma [18] or use the disintegration of the measure as described in the following section.

2.2.1. Disintegration

Let ( Y , Y ) and ( Z , Z ) be measurable spaces, and  ( X , X ) = ( Y × Z , Y Z ) the product measurable space. Then, for  μ P ( X ) , we denote μ Y = π # Y μ and μ Z = π # Z μ as the marginals, where π Y (respectively π Z ) is the projection on Y (respectively Z). Then, a family ( K ( y , · ) ) y Y is a disintegration of μ if for all y Y , K ( y , · ) is a measure on Z, for all A Z , K ( · , A ) is measurable and:
ϕ C ( X ) , Y × Z ϕ ( y , z ) d μ ( y , z ) = Y Z ϕ ( y , z ) K ( y , d z ) d μ Y ( y ) ,
where C ( X ) is the set of continuous functions on X. We can note μ = μ Y K . K is a probability kernel if for all y Y , K ( y , Z ) = 1 . The disintegration of a measure actually corresponds to conditional laws in the context of probabilities. This concept will allow us to obtain measures on the whole space from marginals on subspaces.
In the case where X = R d , which is our setting of interest, we have existence and uniqueness of the disintegration (see Box 2.2 of [19] or Chapter 5 of [23] for the more general case).

2.2.2. Coupling on the Whole Space

Let us note μ E | E and ν E | E as the disintegrated measures on the orthogonal spaces (i.e., such that μ = μ E μ E | E and ν = ν E ν E | E  (if we have densities, p ( x E , x E ) = p E ( x E ) p E | E ( x E | x E ) .)). Then, to obtain a transport plan between the two originals measures on the whole space, we can look for another coupling between disintegrated measures μ E | E and ν E | E . In particular, two such couplings are proposed in [14], the Monge-Independent (MI) plan:
π MI = γ E * ( μ E | E ν E | E )
where we take the independent coupling between μ E | E ( x E , · ) and ν E | E ( y E , · ) for γ E * almost every ( x E , y E ) , and the Monge-Knothe (MK) plan:
π MK = γ E * γ E | E *
where γ E | E * ( x E , y E ) , · is an optimal plan between μ E | E ( x E , · ) and ν E | E ( y E , · ) for γ E * almost every ( x E , y E ) . Muzellec and Cuturi [14] observed that MI is more adapted to noisy environments since it only computes the OT plan of the subspace. MK is more suited for applications where we want to prioritize some subspace but where all the directions still contain relevant information [14].
This subspace detour approach can be of much interest following the popular assumption that two distributions on R d differ only in a low-dimensional subspace as in the Spiked transport model [24]. However, it is still required to find the adequate subspace. Muzellec and Cuturi [14] propose to either rely on a priori knowledge to select the subspace (by using, e.g., a reference dataset and a principal component analysis) or to optimize over the Stiefel manifold.

2.3. Gromov–Wasserstein

Formally, the Gromov–Wasserstein distance allows us to compare metric measure spaces (mm-space), triplets ( X , d X , μ X ) and ( Y , d Y , μ Y ) , where ( X , d X ) and ( Y , d Y ) are complete separable metric spaces and μ X and μ Y are Borel probability measures on X and Y [8], respectively, by computing:
G W ( X , Y ) = inf γ Π ( μ X , μ Y ) L ( d X ( x , x ) , d Y ( y , y ) ) d γ ( x , y ) d γ ( x , y )
where L is some loss on R . It has actually been extended to other spaces by replacing the distances by cost functions c X and c Y , as, e.g., in [25]. Furthermore, it has many appealing properties such as having invariances (which depend on the costs).
Vayer [26] notably studied this problem in the setting where X and Y are Euclidean spaces, with  L ( x , y ) = ( x y ) 2 and c ( x , x ) = x , x or c ( x , x ) = x x 2 2 . In particular, let μ P ( R p ) and ν P ( R q ) , and the inner-GW problem is defined as:
InnerGW ( μ , ν ) = inf γ Π ( μ , ν ) ( x , x p y , y q ) 2 d γ ( x , y ) d γ ( x , y ) .
For this problem, a closed form in one dimension can be found when one of the distributions admits a density w.r.t. the Lebesgue measure:
Theorem 2
([26] Theorem 4.2.4). Let μ , ν P ( R ) , with μ being absolutely continuous with respect to the Lebesgue measure. Let F μ ( x ) : = F μ ( x ) = μ ( ] , x ] ) be the cumulative distribution function and F μ ( x ) = μ ( ] x , + [ ) the anti-cumulative distribution function. Let T a s c ( x ) = F ν 1 ( F μ ( x ) ) and T d e s c ( x ) = F ν 1 ( F μ ( x ) ) . Then, an optimal solution of (4) is achieved either by γ = ( I d × T a s c ) # μ or by γ = ( I d × T d e s c ) # μ .

3. Subspace Detours for GW

In this section, we propose to extend subspace detours from Muzellec and Cuturi [14] with Gromov–Wasserstein costs. We show that we can even take subspaces of different dimensions and still obtain a coupling on the whole space using the Independent or the Monge–Knothe coupling. Then, we derive some properties analogously to Muzellec and Cuturi [14], as well as some closed-form solutions between Gaussians. We also provide a new closed-form expression of the inner-GW problem between one-dimensional discrete distributions and provide an illustration on a shape-matching problem.

3.1. Motivations

First, we adapt the definition of subspace optimal plans for different subspaces. Indeed, since the GW distance is adapted to distributions that have their own geometry, we argue that if we project on the same subspace, then it is likely that the resulting coupling would not be coherent with that of GW. To illustrate this point, we use as a source distribution μ one moon of the two moons dataset and obtain a target ν by rotating μ by an angle of π 2 (see Figure 1). As the GW with c ( x , x ) = x x 2 2 is invariant with respect to isometries, the optimal coupling is diagonal, as recovered on the left side of the figure. However, when choosing one subspace to project both the source and target distributions, we completely lose the optimal coupling between them. Nonetheless, by choosing one subspace for each measure more wisely (using here the first component of the principal component analysis (PCA) decomposition), we recover the diagonal coupling. This simple illustration underlines that the choice of both subspaces is important. A way of choosing the subspaces could be to project on the subspace containing the more information for each dataset using, e.g., PCA independently on each distribution. Muzellec and Cuturi [14] proposed to optimize the optimal transport cost with respect to an orthonormal matrix with a projected gradient descent, which could be extended to an optimization over two orthonormal matrices in our context.
By allowing for different subspaces, we obtain the following definition of subspace optimal plans:
Definition 2.
Let μ P 2 ( R p ) , ν P 2 ( R q ) , E be a k-dimensional subspace of R p and F a k -dimensional subspace of R q . Let γ E × F * be an optimal transport plan for G W between μ E = π # E μ and ν F = π # F ν (with π E (resp. π F ) the orthogonal projection on E (resp. F)). Then, the set of ( E , F ) -optimal plans between μ and ν is defined as Π E , F ( μ , ν ) = { γ Π ( μ , ν ) | ( π E , π F ) # γ = γ E × F * } .
Analogously to Muzellec and Cuturi [14] (Section 2.2), we can obtain from γ E × F * a coupling on the whole space by either defining the Monge–Independent plan π MI = γ E × F * ( μ E | E ν F | F ) or the Monge–Knothe plan π MK = γ E × F * γ E × F | E × F * where OT plans are taken with some OT cost, e.g.,  G W .

3.2. Properties

Let E R p and F R q and denote:
G W E , F ( μ , ν ) = inf γ Π E , F ( μ , ν ) L ( x , x , y , y ) d γ ( x , y ) d γ ( x , y )
the Gromov–Wasserstein problem restricted to subspace optimal plans (2). In the following, we show that Monge–Knothe couplings are optimal plans of this problem, which is a direct transposition of Proposition 1 in [14].
Proposition 1.
Let μ P ( R p ) and ν P ( R q ) , E R p , F R q , π MK = γ E × F * γ E × F | E × F * , where γ E × F * is an optimal coupling between μ E and ν F , and for γ E × F * , almost every ( x E , y F ) , γ E × F | E × F * ( x E , y F ) , · is an optimal coupling between μ E | E ( x E , · ) and ν F | F ( y F , · ) . Then we have:
π MK argmin γ Π E , F ( μ , ν ) L ( x , x , y , y ) d γ ( x , y ) d γ ( x , y ) .
Proof. 
Let γ Π E , F ( μ , ν ) , then:
L ( x , x , y , y ) d γ ( x , y ) d γ ( x , y ) = ( L ( x , x , y , y ) γ E × F | E × F ( x E , y F ) , ( d x E , d y F ) γ E × F | E × F ( x E , y F ) , ( d x E , d y F ) ) d γ E × F * ( x E , y F ) d γ E × F * ( x E , y F ) .
However, for  γ E × F * a.e. ( x E , y F ) , ( x E , y F ) ,
L ( x , x , y , y ) γ E × F | E × F ( x E , y F ) , ( d x E , d y F ) γ E × F | E × F ( x E , y F ) , ( d x E , d y F ) L ( x , x , y , y ) γ E × F | E × F * ( x E , y F ) , ( d x E , d y F ) γ E × F | E × F * ( x E , y F ) , ( d x E , d y F )
by definition of the Monge–Knothe coupling. By integrating with respect to γ E × F * , we obtain:
L ( x , x , y , y ) d γ ( x , y ) d γ ( x , y ) L ( x , x , y , y ) d π MK ( x , y ) d π MK ( x , y ) .
Therefore, π MK is optimal for subspace optimal plans.    □
The key properties of G W that we would like to keep are its invariances. We show in two particular cases that we conserve them on the orthogonal spaces (since the measure on E × F is fixed).
Proposition 2.
Let μ P ( R p ) , ν P ( R q ) , E R p , F R q .
For L ( x , x , y , y ) = x x 2 2 y y 2 2 2 or L ( x , x , y , y ) = x , x p y , y q 2 , G W E , F (5) is invariant with respect to isometries of the form f = ( I d E , f E ) (resp. g = ( I d F , g F ) ) with f E an isometry on E (resp. g F an isometry on F ) with respect to the corresponding cost ( c ( x , x ) = x x 2 2 or c ( x , x ) = x , x p ).
Proof. 
We propose a sketch of the proof. The full proof can be found in Appendix A.1. Let L ( x , x , y , y ) = x x 2 2 y y 2 2 2 , let f E be an isometry w.r.t c ( x E , x E ) = x E x E 2 2 , and let f : R p R p be defined as such for all x R p , f ( x ) = ( x E , f E ( x E ) ) .
By using Lemma 6 of [27], we show that Π E , F ( f # μ , ν ) = { ( f , I d ) # γ | γ Π E , F ( μ , ν ) } . Hence, for all γ Π E , F ( f # μ , ν ) , there exists γ ˜ Π E , F ( μ , ν ) such that γ = ( f , I d ) # γ ˜ . By disintegrating γ ˜ with respect to γ E × F * and using the properties of the pushforward, we can show that:
x x 2 2 y y 2 2 2 d ( f , I d ) # γ ˜ ( x , y ) d ( f , I d ) # γ ˜ ( x , y ) = x x 2 2 y y 2 2 2 d γ ˜ ( x , y ) d γ ˜ ( x , y ) .
Finally, by taking the infimum with respect to γ ˜ Π E , F ( μ , ν ) , we find:
G W E , F ( f # μ , ν ) = G W E , F ( μ , ν ) .
   □

3.3. Closed-Form between Gaussians

We can also derive explicit formulas between Gaussians in particular cases. Let q p , μ = N ( m μ , Σ ) P ( R p ) , ν = N ( m ν , Λ ) P ( R q ) two Gaussian measures with Σ = P μ D μ P μ T and Λ = P ν D ν P ν T . As previously, let E R p and F R q be k and k dimensional subspaces, respectively. Following Muzellec and Cuturi [14], we represent Σ in an orthonormal basis of E E and  Λ in an orthonormal basis of F F , i.e.,  Σ = Σ E Σ E E Σ E E Σ E . Now, let us denote the following:
Σ / Σ E = Σ E Σ E E T Σ E 1 Σ E E
as the Schur complement of Σ with respect to Σ E . We know that the conditionals of Gaussians are Gaussians and that their covariances are the Schur complements (see, e.g., [28,29]).
For L ( x , x , y , y ) = x x 2 2 y y 2 2 2 , we have for now no certainty that the optimal transport plan is Gaussian. Let N p + q denote the set of Gaussians in R p + q . By restricting the minimization problem to Gaussian couplings, i.e., by solving:
GGW ( μ , ν ) = inf γ Π ( μ , ν ) N p + q x x 2 2 y y 2 2 2 d γ ( x , y ) d γ ( x , y ) ,
Salmona et al. [30] showed that there is a solution γ * = ( I d , T ) # μ Π ( μ , ν ) with μ = N ( m μ , Σ ) , ν = N ( m ν , Λ ) and
x R d , T ( x ) = m ν + P ν A P μ T ( x m μ )
where A = I ˜ q D ν 1 2 ( D μ ( q ) ) 1 2 0 q , p q R q × p , and I ˜ q is of the form diag ( ± 1 ) i q .
By combining the results of Muzellec and Cuturi [14] and Salmona et al. [30], we obtain the following closed-form for Monge–Knothe couplings:
Proposition 3.
Suppose p q and k = k . For the Gaussian-restricted GW problem (6), a Monge–Knothe transport map between μ = N ( m μ , Σ ) P ( R p ) and ν = N ( m ν , Λ ) P ( R q ) is, for all x R p , T MK ( x ) = m ν + B ( x m μ ) where:
B = T E , F 0 C T E , F | E , F
with T E , F being an optimal transport map between N ( 0 E , Σ E ) and N ( 0 F , Λ F ) (of the form (7)), T E , F | E , F an optimal transport map between N ( 0 E , Σ / Σ E ) and N ( 0 F , Λ / Λ F ) , and C satisfies:
C = ( Λ F F ( T E , F T ) 1 T E , F | E , F Σ E E ) Σ E 1 .
Proof. 
See Appendix A.2.1.    □
Suppose that k k , m μ = 0 , and m ν = 0 and let T E , F be an optimal transport map between μ E and ν F (of the form (7)). We can derive a formula for the Monge–Independent coupling for the inner-GW problem and the Gaussian restricted GW problem.
Proposition 4.
π MI = N ( 0 p + q , Γ ) where Γ = Σ C C T Λ with
C = ( V E Σ E + V E Σ E E ) T E , F T ( V F T + Λ F 1 Λ F F T V F T )
where T E , F is an optimal transport map, either for the inner-GW problem or the Gaussian restricted problem.
Proof. 
See Appendix A.2.2.    □

3.4. Computation of Inner-GW between One-Dimensional Empirical Measures

In practice, computing the Gromov–Wasserstein distance from samples of the distributions is costly. From a computational point of view, the subspace detour approach provides an interesting method with better computational complexity when choosing 1D subspaces. Moreover, we have the intuition than the GW problem between measures lying on smaller dimensional subspaces has a better sample complexity than between the original measures, as it is the case for the Wasserstein distance [31,32].
Below, we show that when both E and F are one-dimensional subspaces, then the resulting GW problem between the projected measures can be solved in linear time. This will rely on a new closed-form expression of the GW problem in 1D. Vayer et al. [12] provided a closed-form for GW with c ( x , x ) = x x 2 2 in one dimension between discrete measures containing the same number of points and with uniform weights. However, in our framework, the 1D projection of E , F may not have uniform weights, and we also would like to be able to compare distributions with different numbers of points. We provide in the next proposition a closed-form expression for the inner-GW problem between any unidimensional discrete probability distributions:
Proposition 5.
Consider Σ n = { a R + n , i = 1 n a i = 1 } the n probability simplex. For a vector a R n , we denote a as the vector with values sorted decreasingly, i.e.,  a 1 a n . Let μ = i = 1 n a i δ x i , ν = j = 1 m b j δ y j P ( R ) × P ( R ) with a , b Σ n × Σ m . Suppose that x 1 x n and y 1 y m . Consider the problem:
min γ Π ( a , b ) i j k l ( x i x k y j y l ) 2 γ i j γ k l
Then, there exists γ { N W ( a , b ) , N W ( a , b ) } such that γ is an optimal solution of (8) where N W is the North-West corner rule defined in Algorithm 1. As a corollary, an optimal solution of (8) can be found in O ( n + m ) .
Algorithm 1 North-West corner rule N W ( a , b )
  • a Σ n , b Σ m
  • while i < = n , j < = m do
  •     γ i j = min { a i , b j }
  •     a i = a i γ i j
  •     b j = b j γ i j
  •    If a i = 0 , i = i + 1 , if b j = 0 , j = j + 1
  • end while
  • return γ Π ( a , b )
Proof. 
Let γ Π ( a , b ) . Then:
i j k l ( x i x k y j y l ) 2 γ i j γ k l = i j k l ( x i x k ) 2 γ i j γ k l + i j k l ( y j y l ) 2 γ i j γ k l 2 i j k l x i x k y j y l γ i j γ k l
However, i j k l ( x i x k ) 2 γ i j γ k l = i k ( x i x k ) 2 a i a k , and i j k l ( y j y l ) 2 γ i j γ k l = j l ( y j y l ) 2 b j b l , so this does not depend on γ . Moreover 2 i j k l x i x k y j y l γ i j γ k l = 2 ( i j x i y j γ i j ) 2 . Hence, the problem (8) is equivalent to max γ Π ( a , b ) ( i j x i y j γ i j ) 2 (in terms of the OT plan), which is also equivalent to solving max γ Π ( a , b ) | i j x i y j γ i j | or equivalently:
max γ Π ( a , b ) ± 1 i j x i y j γ i j
We have two cases to consider: If ± 1 = 1 , we have to solve min γ Π ( a , b ) i j ( x i ) y j γ i j . Since the points are sorted, the matrix c i j = x i y j satisfies the Monge property [33]:
( i , j ) { 1 , , n 1 } × { 1 , , m 1 } , c i , j + c i + 1 , j + 1 c i + 1 , j + c i , j + 1
To see this, check that:
( x i ) y j + ( x i + 1 ) y j + 1 ( x i + 1 ) y j ( x i ) y j + 1 = ( x i ) ( y j y j + 1 ) + ( x i + 1 ) ( y j + 1 y j ) = ( y j y j + 1 ) ( x i + 1 x i ) 0
In this case, the North-West corner rule N W ( a , b ) defined in Algorithm 1 is known to produce an optimal solution to the linear problem (9) [33]. If ± = 1 , then changing x i to x i concludes. □
We emphasize that this result is novel and generalizes [12] in the sense that the distributions do not need to have uniform weights and the same number of points. I addition, Theorem 2 is not directly applicable to this setting since it requires having absolutely regular distributions, which is not the case here. Both results are, however, related, as the solution obtained by using the NW corner rule on the sorted samples is the same as that obtained by considering the coupling obtained from the quantile functions. The previous result could also be used to define tractable alternatives to GW in the same manner as the Sliced Gromov–Wasserstein [12].

3.5. Illustrations

We use the Python Optimal Transport (POT) library [34] to compute the different optimal transport problems involved in this illustration. We are interested here in solving a 3D mesh registration problem, which is a natural application of Gromov–Wasserstein [3] since it enjoys invariances with respect to isometries such as permutations and can also naturally exploit the topology of the meshes. For this purpose, we selected two base meshes from the Faust dataset [35], which provides ground truth correspondences between shapes. The information available from those meshes are geometrical (6890 vertices positions) and topological (mesh connectivity). These two meshes are represented, along with the visual results of the registration, in Figure 2. In order to visually depict the quality of the assignment induced by the transport map, we propagate through it a color code of the source vertices toward their associated counterpart vertices in the target mesh. Both the original color-coded source and the associated target ground truth are available on the first line of the illustration. To compute our method, we simply use as a natural subspace for both meshes the algebraic connectivity of the mesh’s topological information, also known as the Fiedler vector [36] (eigenvector associated to the second smallest eigenvalue of the un-normalized Laplacian matrix). Fiedler vectors are computed in practice using NetworkX [37] but could also be obtained by using power methods [38]. Reduced to a 1D optimal transport problem (8), we used the Proposition 5 to compute the optimal coupling in O ( n + m ) . Consequently, the computation time is very low ( 5 secs. on a standard laptop), and the associated matching is very good, with more than 98 % of correct assignments. We qualitatively compare this result to Gromov–Wasserstein mappings induced by different cost functions, in the second line of Figure 2: adjacency [39], weighted adjacency (weights are given by distances between vertices), heat kernel (derived from the un-normalized Laplacian) [40], and, finally, geodesic distances over the meshes. On average, computing the Gromov–Wasserstein mapping using POT took around 10 min of time. Both methods based on adjacency fail to recover a meaningful mapping. Heat kernel allows us to map continuous areas of the source mesh but fails in recovering a global structure. Finally, the geodesic distance gives a much more coherent mapping but has inverted left and right of the human figure. Notably, a significant extra computation time was induced by the computation of the geodesic distances ( 1 h/mesh using the NetworkX [37] shortest path procedure). As a conclusion, and despite the simplification of the original problem, our method performs best with a speed-up of two-orders of magnitude.

4. Triangular Coupling as Limit of Optimal Transport Plans for Quadratic Cost

Another interesting property derived in Muzellec and Cuturi [14] of the Monge–Knothe coupling is that it can be obtained as the limit of classic optimal transport plans, similar to Theorem 1, using a separable cost of the form:
c t ( x , y ) = ( x y ) T P t ( x y )
with P t = V E V E T + t V E V E T and ( V E , V E ) as an orthonormal basis of R p .
However, this property is not valid for the classical Gromov–Wasserstein cost (e.g., L ( x , x , y , y ) = ( d X ( x , x ) 2 d Y ( y , y ) 2 ) 2 or L ( x , x , y , y ) = ( x , x p y , y q ) 2 ) as the cost is not separable. Motivated by this question, we ask ourselves in the following if we can derive a quadratic optimal transport cost for which we would have this property.
Formally, we derive a new quadratic optimal transport problem using the Hadamard product. We show that this problem is well-defined and that it has interesting properties such as invariance with respect to axis. We also show that it can be related to a triangular coupling in a similar fashion than the classical optimal transport problem with the Knothe–Rosenblatt rearrangement.

4.1. Construction of the Hadamard–Wasserstein Problem

In this part, we define the “Hadamard–Wasserstein” problem between μ P ( R d ) and ν P ( R d ) as:
HW 2 ( μ , ν ) = inf γ Π ( μ , ν ) x x y y 2 2 d γ ( x , y ) d γ ( x , y ) ,
where ⊙ is the Hadamard product (element-wise product). This problem is different than the Gromov–Wasserstein problem in the sense that we do not compare intradistance anymore bur rather the Hadamard products between vectors of the two spaces (in the same fashion as the classical Wasserstein distance). Hence, we need the two measures to belong in the same Euclidean space. Let us note L as the cost defined as:
x , x , y , y R d , L ( x , x , y , y ) = k = 1 d ( x k x k y k y k ) 2 = x x y y 2 2 .
We observe that it coincides with the inner-GW (4) loss in one dimension. Therefore, by Theorem 2, we know that we have a closed-form solution in 1D.

4.2. Properties

First, we derive some useful properties of (12) which are usual for the regular Gromov–Wasserstein problem. Formally, we show that the problem is well defined and that it is a pseudometric with invariances with respect to axes.
Proposition 6.
Let μ , ν P ( R d ) .
  • The problem (12) always admits a minimizer.
  • HW is a pseudometric (i.e., it is symmetric, non-negative, HW ( μ , μ ) = 0 , and it satisfies the triangle inequality).
  • HW is invariant to reflection with respect to axes.
Proof. 
Let μ , ν P ( R d ) ,
  • ( x , x ) x x is a continuous map, therefore, L is less semi-continuous. Hence, by applying Lemma 2.2.1 of [26], we observe that γ L ( x , x , y , y ) d γ ( x , y ) d γ ( x , y ) is less semi-continuous for the weak convergence of measures.
    Now, as Π ( μ , ν ) is a compact set (see the proof of Theorem 1.7 in [19] for the Polish space case and of Theorem 1.4 for the compact metric space) and γ L d γ d γ is less semi-continuous for the weak convergence, we can apply the Weierstrass theorem (Memo 2.2.1 in [26]), which states that (12) always admits a minimizer.
  • See Theorem 16 in [25].
  • For invariances, we first look at the properties that must be satisfied by T in order to have: x , x , f ( x , x ) = f ( T ( x ) , T ( x ) ) where f : ( x , x ) x x .
    We find that x R d , 1 i d , | [ T ( x ) ] i | = | x i | because, denoting ( e i ) i = 1 d as the canonical basis, we have:
    x e i = T ( x ) T ( e i ) ,
    which implies that:
    x i = [ T ( x ) ] i [ T ( e i ) ] i .
    However, f ( e i , e i ) = f ( T ( e i ) , T ( e i ) ) implies [ T ( e i ) ] i 2 = 1 , and therefore:
    | [ T ( x ) ] i | = | x i | .
    If we take for T the reflection with respect to axis, then it satisfies f ( x , x ) = f ( T ( x ) , T ( x ) ) well. Moreover, it is a good equivalence relation, and therefore, we have a distance on the quotient space.
HW loses some properties compared to G W . Indeed, it is only invariant with respect to axes, and it can only compare measures lying in the same Euclidean space in order for the distance to be well defined. Nonetheless, we show in the following that we can derive some links with triangular couplings in the same way as the Wasserstein distance with KR.
Indeed, the cost L (13) is separable and reduces to the inner-GW loss in 1D, for which we have a closed-form solution. We can therefore define a degenerated version of it:
x , x , y , y R d , L t ( x , x , y , y ) = k = 1 d i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 = ( x x y y ) T A t ( x x y y )
with A t = diag ( 1 , λ t ( 1 ) , λ t ( 1 ) λ t ( 2 ) , , i = 1 d 1 λ t ( i ) ) , such as for all t > 0 , and for all i { 1 , , d 1 } , λ t ( i ) > 0 , and λ t ( i ) t 0 0 . We denote HW t the problem (12) with the degenerated cost (14). Therefore, we will be able to decompose the objective as:
L t ( x , x , y , y ) d γ ( x , y ) d γ ( x , y ) = ( x 1 x 1 y 1 y 1 ) 2 d γ ( x , y ) d γ ( x , y ) + k = 2 d i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ ( x , y ) d γ ( x , y )
and to use the same induction reasoning as [21].
Then, we can define a triangular coupling different from the Knothe–Rosenblatt rearrangement in the sense that each map will not be nondecreasing. Indeed, following Theorem 2, the solution of each 1D problem:
argmin γ Π ( μ , ν ) ( x x y y ) 2 d γ ( x , y ) d γ ( x , y )
is either ( I d × T asc ) # μ or ( I d × T desc ) # μ . Hence, at each step k 1 , if we disintegrate the joint law of the k first variables as μ 1 : k = μ 1 : k 1 μ k | 1 : k 1 , the optimal transport map T ( · | x 1 , , x k 1 ) will be the solution of:
argmin T { T asc , T desc } x k x k T ( x k ) T ( x k ) 2 μ k 1 : k 1 ( d x k x 1 : k 1 ) μ k 1 : k 1 ( d x k x 1 : k 1 ) .
We now state the main theorem, where we show that the limit of the OT plans obtained with the degenerated cost will be the triangular coupling we just defined.
Theorem 3.
Let μ and ν be two absolutely continuous measures on R d such that x 2 4 μ ( d x ) < + , y 2 4 ν ( d y ) < + and with compact support. Let γ t be an optimal transport plan for HW t , let T K be the alternate Knothe–Rosenblatt map between μ and ν as defined in the last paragraph, and let γ K = ( I d × T K ) # μ be the associated transport plan. Then, we have γ t t 0 D γ K . Moreover, if γ t are induced by transport maps T t , then T t t 0 L 2 ( μ ) T K .
Proof. 
See Appendix B.2. □
However, we cannot extend this Theorem to the subspace detour approach. Indeed, by choosing A t = V E V E T + t V E V E T with ( V E , V E ) an orthonormal basis of R d , then we project x x y y on E (respectively on E ), which is generally different from x E x E y E y E (respectively x E x E y E y E ).

4.3. Solving Hadamard–Wasserstein in the Discrete Setting

In this part, we derive formulas to solve numerically HW (12). Let x 1 , , x n R d , y 1 , , y m R d , α Σ n , β Σ m , p = i = 1 n α i δ x i and q = j = 1 m β j δ y j two discrete measures in R d . The Hadamard Wasserstein problem (12) becomes in the discrete setting:
HW 2 ( p , q ) = inf γ Π ( p , q ) i , j k , x i x k y j y 2 2 γ i , j γ k , = inf γ Π ( p , q ) E ( γ )
with E ( γ ) = i , j k , x i x k y j y 2 2 γ i , j γ k , . As denoted in [9], if we note:
L i , j , k , = x i x k y j y 2 2 ,
then we have:
E ( γ ) = L γ , γ ,
where ⊗ is defined as:
L γ = k , L i , j , k , γ k , i , j R n × m .
We show in the next proposition a decomposition of L γ , which allows us to compute this tensor product more efficiently.
Proposition 7.
Let γ Π ( p , q ) = { M ( R + ) n × m , M 𝟙 m = p , M T 𝟙 n = q } , where 𝟙 n = ( 1 , , 1 ) T R n . Let us note X = ( x i x k ) i , k R n × n × d , Y = ( y j y ) j , R m × m × d , X ( 2 ) = ( X i , k 2 2 ) i , k R n × n , Y ( 2 ) = ( Y j , l 2 2 ) j , l R m × m , and t { 1 , , d } , X t = ( X i , k , t ) i , k R n × n and Y t = ( Y j , , t ) j , R m × m . Then:
L γ = X ( 2 ) p 𝟙 m T + 𝟙 n q T ( Y ( 2 ) ) T 2 t = 1 d X t γ Y t T .
Proof. 
First, we can start by writing:
L i , j , k , = x i x k y j y 2 2 = X i , k Y j , 2 2 = X i , k 2 2 + Y j , 2 2 2 X i , k , Y j , = [ X ( 2 ) ] i , k + [ Y ( 2 ) ] j , 2 X i , k , Y j , .
We cannot directly apply proposition 1 from [9] (as the third term is a scalar product), but by performing the same type of computation, we obtain:
L γ = A + B + C
with
A i , j = k , [ X ( 2 ) ] i , k γ k , = k [ X ( 2 ) ] i , k γ k , = k [ X ( 2 ) ] i , k [ γ 𝟙 m ] k , 1 = [ X ( 2 ) γ 𝟙 m ] i , 1 = [ X ( 2 ) p ] i , 1
B i , j = k , [ Y ( 2 ) ] j , γ k , = [ Y ( 2 ) ] j , k γ k , = [ Y ( 2 ) ] j , [ γ T 𝟙 n ] , 1 = [ Y ( 2 ) γ T 𝟙 n ] j , 1 = [ Y ( 2 ) q ] j , 1
and
C i , j = 2 k , X i , k , Y j , γ k , = 2 k , t = 1 d X i , k , t Y j , , t γ k , = 2 t = 1 d k [ X t ] i , k [ Y t ] j , γ , k T = 2 t = 1 d k [ X t ] i , k [ Y t γ T ] j , k = 2 t = 1 d [ X t ( Y t γ T ) T ] i , j .
Finally, we have:
L γ = X ( 2 ) p 𝟙 m T + 𝟙 n q T ( Y ( 2 ) ) T 2 t = 1 d X t γ Y t T .
From this decomposition, we can compute the tensor product L γ with a complexity of O ( d ( n 2 m + m 2 n ) ) using only multiplications of matrices (instead of O ( d n 2 m 2 ) for a naive computation).
Remark 1.
For the degenerated cost function (14), we just need to replace X and Y by X t ˜ = A t 1 2 X and Y t ˜ = A t 1 2 Y in the previous proposition.
To solve this problem numerically, we can use the conditional gradient algorithm (Algorithm 2 in [41]). This algorithm only requires to compute the gradient:
E ( γ ) = 2 ( A + B + C ) = 2 ( L γ )
at each step and a classical OT problem. This algorithm is more efficient than solving the quadratic problem directly. Moreover, while it is a non-convex problem, it actually converges to a local stationary point [42].
On Figure 3, we generated 30 points of 2 Gaussian distributions, and computed the optimal coupling of HW t for several t. These points have the same uniform weight. We plot the couplings between the points on the second row, and between the projected points on their first coordinate on the first row. Note that for discrete points, the Knothe–Rosenblatt coupling amounts to sorting the points with respect to the first coordinate if there is no ambiguity (i.e., x 1 ( 1 ) < < x n ( 1 ) ) as it comes back to perform the optimal transport in one dimension [43] (Remark 2.28). For our cost, the optimal coupling in 1D can either be the increasing or the decreasing rearrangement. We observe on the first row of Figure 3 that the optimal coupling when t is close to 0 corresponds to the decreasing rearrangement, which corresponds well to the alternate Knothe–Rosenblatt map we defined in Section 4.2. It underlines the results provided in Theorem 3.

5. Discussion

We proposed in this work to extend the subspace detour approach to different subspaces, and to other optimal transport costs such as Gromov–Wasserstein. Being able to project on different subspaces can be useful when the data are not aligned and do not share the same axes of interest, as well as when we are working between different metric spaces as it is the case, for example, with graphs. However, a question that arises is how to choose these subspaces. Since the method is mostly interesting when we choose one-dimensional subspaces, we proposed to use a PCA and to project on the first directions for data embedded in Euclidean spaces. For more complicated data such as graphs, we projected onto the Fiedler vector and obtained good results in an efficient way on a 3D mesh registration problem. More generally, Muzellec and Cuturi [14] proposed to perform a gradient descent on the loss with respect to orthonormal matrices. This approach is non-convex and is only guaranteed to converge to a local minimum. Designing such an algorithm, which would minimize alternatively between two transformations in the Stiefel manifold, is left for future works.
The subspace detour approach for transport problem is meaningful whenever one can identify subspaces that gather most of the information from the original distributions, while making the estimate more robust and with a better sample complexity as far as dimensions are lower. On the computational complexity side, and when we have only access to discrete data, the subspace detour approach brings better computational complexity solely when the subspaces are chosen as one dimensional. Indeed, otherwise, we have the same complexity for solving the subspace detour and solving the OT problem directly (since the complexity only depends on the number of samples). In this case, the 1D projection often gives distinct values for all the samples (for continuous valued data) and hence the Monge–Knothe coupling is exactly the coupling in 1D. As such, information is lost on the orthogonal spaces. It can be artificially recovered by quantizing the 1D values (as experimented in practice in [14]), but the added value is not clear and deserves broader studies. If given absolutely continuous distributions wrt. the Lebesgue measure, however, this limit does not exist but comes with the extra cost of being able to compute efficiently the projected measure onto the subspace, which might require discretization of the space and is therefore not practical in high dimensions.
We also proposed a new quadratic cost HW that we call Hadamard–Wasserstein, which allows us to define a degenerated cost for which the optimal transport plan converges to a triangular coupling. However, this cost loses many properties compared to W 2 or G W , for which we are inclined to use these problems. Indeed, while HW is a quadratic cost, it uses a Euclidean norm between the Hadamard product of vectors and requires the two spaces to be the same (in order to have the distance well defined). A work around in the case X = R p and Y = R q with p q would be to “lift” the vectors in R p into vectors in R q with padding as it is proposed in [12] or to project the vectors in R q on R p as in [6]. Yet, for some applications where only the distance/similarity matrices are available, a different strategy still needs to be found. Another concern is the limited invariance properties (only with respect to axial symmetry symmetry in our case). Nevertheless, we expect that such a cost can be of interest in cases where invariance to symmetry is a desired property, such as in [44].

Author Contributions

Methodology, C.B., N.C., T.V., F.S., and L.D.; software, C.B. and N.C.; writing—original draft, C.B., N.C., T.V., F.S., and L.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research wad funded by project DynaLearn from Labex CominLabs and Région Bretagne ARED DLearnMe. N.C. acknowledges fundings from the ANR OTTOPIA AI chair (ANR-20-CHIA-0030). T.V. was supported in part by the AllegroAssai ANR project (ANR-19-CHIA-0009) and by the ACADEMICS grant of the IDEXLYON, project of the Université de Lyon, PIA operated by ANR-16-IDEX-0005.

Data Availability Statement

The FAUST dataset might be found at http://faust.is.tue.mpg.de (accessed in 1 September 2021).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
OTOptimal Transport
GWGromov–Wasserstein
KRKnothe–Rosenblatt
MIMonge–Independent
MKMonge–Knothe
PCAPrincipal Component Analysis
POTPython Optimal Transport

Appendix A. Subspace Detours

Appendix A.1. Proofs

Proof of Proposition 2. 
We first deal with L ( x , x , y , y ) = x x 2 2 y y 2 2 2 . Let f E be an isometry w.r.t c ( x E , x E ) = x E x E 2 2 , and let f : R p R p be defined such as for all x R p , f ( x ) = ( x E , f E ( x E ) ) .
From Lemma 6 of Paty and Cuturi [27], we know that Π ( f # μ , ν ) = { ( f , I d ) # γ | γ Π ( μ , ν ) } . We can rewrite:
Π E , F ( f # μ , ν ) = { γ Π ( f # μ , ν ) | ( π E , π F ) # γ = γ E × F * } = { ( f , I d ) # γ | γ Π ( μ , ν ) , ( π E , π F ) # ( f , I d ) # γ = γ E × F * } = { ( f , I d ) # γ | γ Π ( μ , ν ) , ( π E , π F ) # γ = γ E × F * } = { ( f , I d ) # γ | γ Π E , F ( μ , ν ) }
using f = ( I d E , f E ) , π E f = I d E and ( π E , π F ) # ( f , I d ) # γ = ( π E , π F ) # γ .
Now, for all γ Π E , F ( f # μ , ν ) , there exists γ ˜ Π E , F ( μ , ν ) such that γ = ( f , I d ) # γ ˜ , and we can disintegrate γ ˜ with respect to γ E × F * :
γ ˜ = γ E × F * K
with K a probability kernel on ( E × F , B ( E ) B ( F ) ) .
For γ E × F * almost every ( x E , y F ) , ( x E , y F ) , we have:
x E x E 2 2 + x E x E 2 2 y F y F 2 2 y F y F 2 2 2 ( f E , I d ) # K ( ( x E , y F ) , ( d x E , d y F ) ) ( f E , I d ) # K ( ( x E , y F ) , ( d x E , d y F ) ) = x E x E 2 2 + f E ( x E ) f E ( x E ) 2 2 y F y F 2 2 y F y F 2 2 2 K ( ( x E , y F ) , ( d x E , d y F ) ) K ( ( x E , y F ) , ( d x E , d y F ) ) = x E x E 2 2 + x E x E 2 2 y F y F 2 2 y F y F 2 2 2 K ( ( x E , y F ) , ( d x E , d y F ) ) K ( ( x E , y F ) , ( d x E , d y F ) )
using in the last line that f E ( x E ) f E ( x E ) 2 = x E x E 2 since f E is an isometry.
By integrating with respect to γ E × F * , we obtain:
( x x 2 2 y y 2 2 2 ( f E , I d ) # K ( ( x E , y F ) , ( d x E , d y F ) ) ( f E , I d ) # K ( ( x E , y F ) , ( d x E , d y F ) ) ) d γ E × F * ( x E , y F ) d γ E × F * ( x E , y F ) = x x 2 2 y y 2 2 2 d γ ˜ ( x , y ) d γ ˜ ( x , y ) .
Now, we show that γ = ( f , I d ) # γ ˜ = γ E × F * ( f E , I d ) # K . Let ϕ be some bounded measurable function on R p × R q :
ϕ ( x , y ) d γ ( x , y ) = ϕ ( x , y ) d ( ( f , I d ) # γ ˜ ( x , y ) ) = ϕ ( f ( x ) , y ) d γ ˜ ( x , y ) = ϕ ( f ( x ) , y ) K ( x E , y F ) , ( d x E , d y F ) d γ E × F * ( x E , y F ) = ϕ ( ( x E , f E ( x E ) ) , y ) K ( x E , y F ) , ( d x E , d y F ) d γ E × F * ( x E , y F ) = ϕ ( x , y ) ( f E , I d ) # K ( x E , y F ) , ( d x E , d y F ) d γ E × F * ( x E , y F ) .
Hence, we can rewrite (A1) as:
x x 2 2 y y 2 2 2 d ( f , I d ) # γ ˜ ( x , y ) d ( f , I d ) # γ ˜ ( x , y ) = x x 2 2 y y 2 2 2 d γ ˜ ( x , y ) d γ ˜ ( x , y ) .
Now, by taking the infimum with respect to γ ˜ Π E , F ( μ , ν ) , we find:
G W E , F ( f # μ , ν ) = G W E , F ( μ , ν ) .
For the inner product case, we can do the same proof for linear isometries on E . □

Appendix A.2. Closed-Form between Gaussians

Let q p , μ = N ( m μ , Σ ) P ( R p ) , and ν = N ( m ν , Λ ) P ( R q ) be two Gaussian measures with Σ = P μ D μ P μ T and Λ = P ν D ν P ν T .
Let E R p be a subspace of dimension k and F R q a subspace of dimension k .
We represent Σ in an orthonormal basis of E E and Λ in an orthonormal basis of F F , i.e., Σ = Σ E Σ E E Σ E E Σ E . We denote Σ / Σ E = Σ E Σ E E T Σ E 1 Σ E E as the Schur complement of Σ with respect to Σ E . We know that the conditionals of Gaussians are Gaussians and of covariance, the Schur complement (see e.g., Rasmussen [28] or Von Mises [29]).

Appendix A.2.1. Quadratic GW Problem

For G W with c ( x , x ) = x x 2 2 , we have for now no guarantee that there exists an optimal coupling which is a transport map. Salmona et al. [30] proposed to restrict the problem to the set of Gaussian couplings π ( μ , ν ) N p + q where N p + q denotes the set of Gaussians in R p + q . In that case, the problem becomes:
G G W ( μ , ν ) = inf γ Π ( μ , ν ) N p + q x x 2 2 y y 2 2 2 d γ ( x , y ) d γ ( x , y ) .
In that case, they showed that an optimal solution is of the form T ( x ) = m ν + P ν A P μ T ( x m μ ) with A = I ˜ q D ν 1 2 ( D μ ( q ) ) 1 2 0 q , p q and I ˜ q of the form diag ( ± 1 ) i q .
Since the problem is translation invariant, we can always solve the problem between the centered measures.
In the following, we suppose that k = k . Let us denote T E , F as the optimal transport map for (A2) between N ( 0 , Σ E ) and N ( 0 , Λ F ) . According to Theorem 4.1 in Salmona et al. [30], such a solution exists and is of the form (7). We also denote T E , F as the optimal transport map between N ( 0 , Σ / Σ E ) and N ( 0 , Λ / Λ F ) (which is well defined since we assumed p q and hence p k q k since k = k ).
We know that the Monge–Knothe transport map will be a linear map T MK ( x ) = B x with B a block triangular matrix of the form:
B = T E , F 0 k , p k C T E , F R q × p ,
with C R ( q k ) × k and such that B Σ B T = Λ (to have well a transport map between μ and ν ).
Actually,
B Σ B T = T E , F Σ E T E , F T T E , F Σ E C T + T E , F Σ E E T E , F T ( C Σ E + T E , F Σ E E ) T E , F T ( C Σ E + T E , F Σ E E ) C T + ( C Σ E E + T E , F Σ E ) T E , F T .
First, we have well T E , F Σ E T E , F T = Λ F , as T E , F is a transport map between μ E and ν F . Then:
B Σ B T = Λ T E , F Σ E T E , F T = Λ F T E , F Σ E C T + T E , F Σ E E T E , F T = Λ F F ( C Σ E + T E , F Σ E E ) T E , F T = Λ F F ( C Σ E + T E , F Σ E E ) C T + ( C Σ E E + T E , F Σ E ) T E , F T = Λ F .
We have:
( C Σ E + T E , F Σ E E ) T E , F T = Λ F F C Σ E T E , F T = Λ F F T E , F Σ E E T E , F T .
As k = k , Σ E T E , F T R k × k and is invertible (as Σ E and Λ F are positive definite and T E , F = P μ E A E , F P ν F with A E , F = I ˜ k D ν F 1 1 D μ E 1 2 with positive values on the diagonals. Hence, we have:
C = ( Λ F F ( T E , F T ) 1 T E , F Σ E E ) Σ E 1 .
Now, we still have to check the last two equations. First:
T E , F Σ E C T + T E , F Σ E E T E , F T = T E , F Σ E Σ E 1 T E , F 1 Λ F F T T E , F Σ E Σ E 1 Σ E E T T E , F T + T E , F Σ E E T E , F T = Λ F F .
For the last equation:
( C Σ E + T E , F Σ E E ) C T + ( C Σ E E + T E , F Σ E ) T E , F T = ( Λ F F ( T E , F T ) 1 T E , F Σ E E + T E , F Σ E E ) Σ E 1 ( T E , F 1 Λ F F T Σ E E T T E , F T ) + Λ F F ( T E , F T ) 1 Σ E 1 Σ E E T E , F T T E , F Σ E E Σ E 1 Σ E E T E , F T + T E , F Σ E T E , F T = Λ F F ( T E , F T ) 1 Σ E 1 T E , F 1 Λ F F T Λ F F ( T E , F T ) 1 Σ E 1 Σ E E T T E , F T T E , F Σ E E Σ E 1 T E , F 1 Λ F F T + T E , F Σ E E Σ E 1 Σ E E T T E , F T + T E , F Σ E E Σ E 1 T E , F 1 Λ F F T T E , F Σ E E Σ E 1 Σ E E T T E F T + Λ F F ( T E , F T ) 1 Σ E 1 Σ E E T E , F T T E , F Σ E E Σ E 1 Σ E E T T E , F T + T E , F Σ E T E , F T = Λ F F ( T E , F T ) 1 Σ E 1 T E , F 1 Λ F F T T E , F Σ E E Σ E 1 Σ E E T T E , F T + T E , F Σ E T E , F T
Now, using that ( T E , F T ) 1 Σ E 1 T E , F 1 = ( T E , F Σ E T E , F T ) 1 = Λ F 1 and Σ E Σ E E Σ E 1 Σ E E T = Σ / Σ E , we have:
( C Σ E + T E , F Σ E E ) C T + ( C Σ E E + T E , F Σ E ) T E , F T = Λ F F Λ F 1 Λ F F T + T E , F ( Σ E Σ E E Σ E 1 Σ E E T ) T E , F T = Λ F F Λ F 1 Λ F F T + Λ / Λ F = Λ F
Then, π MK is of the form ( I d , T MK ) # μ with:
T MK ( x ) = m ν + B ( x m μ ) .

Appendix A.2.2. Closed-Form between Gaussians for Monge–Independent

Suppose is k k in order to be able to define the OT map between μ E and ν F .
For the Monge–Independent plan, π MI = γ E × F * ( μ E | E ν F | F ) , let ( X , Y ) π MI . We know that π MI is a degenerate Gaussian with a covariance of the form:
Cov ( X , Y ) = Cov ( X ) C C T Cov ( Y )
where Cov ( X ) = Σ and Cov ( Y ) = Λ . Moreover, we know that C is of the form:
Cov ( X E , Y F ) Cov ( X E , Y F ) Cov ( X E , Y F ) Cov ( X E , Y F ) .
Let us assume that m μ = m ν = 0 , then:
Cov ( X E , Y F ) = Cov ( X E , T E , F X E ) = E [ X E X E T ] T E , F T = Σ E T E , F T ,
Cov ( X E , Y F ) = E [ X E Y F T ] = E [ E [ X E Y F T | X E , Y F ] ] = E [ X E E [ Y F T | Y F ] ]
since Y F = T E , F X E , X E is σ ( Y F ) -measurable. Now, using the equation (A.6) from Rasmussen [28], we have:
E [ Y F | Y F ] = Λ F F Λ F 1 Y F = Λ F F Λ F 1 T E , F X E
and
E [ X E | X E ] = Σ E E Σ E 1 X E .
Hence:
Cov ( X E , Y F ) = E [ X E E [ Y F T | Y F ] ] = E [ X E X E T ] T E , F T Λ F 1 Λ F F T = Σ E T E , F T Λ F 1 Λ F F T .
We also have:
Cov ( X E , Y F ) = E [ X E X E T T E , F T ] = Σ E E T E , F T ,
and
Cov ( X E , Y F ) = E [ X E Y F T ] = E [ E [ X E Y F T | X E , Y F ] ] = E [ E [ X E | X E ] E [ Y F T | Y F ] ] by independence = E [ Σ E E Σ E 1 X E X E T T E , F T Λ F 1 Λ F F T ] = Σ E E T E , F T Λ F 1 Λ F F T .
Finally, we find:
C = Σ E T E , F T Σ E T E , F T Λ F 1 Λ F F T Σ E E T E , F T Σ E E T E , F T Λ F 1 Λ F F T .
By taking orthogonal bases ( V E , V E ) and ( V F , V F ) , we can put it in a more compact way, such as in Proposition 4 in Muzellec and Cuturi [14]:
C = ( V E Σ E + V E Σ E E ) T E , F T ( V F T + Λ F 1 Λ F F T V F T ) .
To check it, just expand the terms and see that C E , F = V E C V F T .

Appendix B. Knothe–Rosenblatt

Appendix B.1. Properties of (12)

Proposition A1.
In a slightly more general setting, let X 0 = X 1 = R d , functions f 0 , f 1 from R d × R d to R d , and measures μ 0 P ( X 0 ) , μ 1 P ( X 1 ) . Then, the family X t = ( X 0 × X 1 , f t , γ * ) defines a geodesic between X 0 and X 1 , where γ * is the optimal coupling of HW between μ 0 and μ 1 , and
f t ( ( x 0 , x 0 ) , ( x 1 , x 1 ) ) = ( 1 t ) f 0 ( x 0 , x 0 ) + t f 1 ( x 1 , x 1 ) .
Proof. 
See Theorem 3.1 in [8]. □

Appendix B.2. Proof of Theorem 3

We first recall a useful theorem.
Theorem A1
(Theorem 2.8 in Billingsley [45]). Let Ω = X × Y be a separable space, and let P , P n P ( Ω ) with marginals P X (respectively P n , X ) and P Y (respectively P n , Y ). Then, P n , X P n , Y D P if and only if P n , X D P X , P n , Y D P Y and P = P X P Y .
Proof of Theorem 3. 
The following proof is mainly inspired by the proof of Theorem 1 in [21] (Theorem 2.1), [22] (Theorem 3.1.6) and [19] (Theorem 2.23).
Let μ , ν P ( R d ) , absolutely continuous, with finite fourth moments and compact supports. We recall the problem HW t :
HW t 2 ( μ , ν ) = inf γ Π ( μ , ν ) k = 1 d i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ t ( x , y ) d γ t ( x , y ) ,
with t > 0 , i { 1 , , d 1 } , λ t ( i ) > 0 and λ t ( i ) t 0 0 .
First, let us denote γ t the optimal coupling for HW t for all t > 0 . We want to show that γ t t 0 D γ K with γ K = ( I d × T K ) # μ and T K our alternate Knothe-Rosenblatt rearrangement. Let γ Π ( μ , ν ) such that γ t t 0 D γ (true up to subsequence as { μ } and { ν } are tight in P ( X ) and P ( Y ) if X and Y are polish space, therefore, by [18] (Lemma 4.4), Π ( μ , ν ) is a tight set, and we can apply the Prokhorov theorem [19] (Box 1.4) on ( γ t ) t and extract a subsequence)).
Part 1:
First, let us notice that:
HW t 2 ( μ , ν ) = k = 1 d i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ t ( x , y ) d γ t ( x , y ) = ( x 1 x 1 y 1 y 1 ) 2 d γ t ( x , y ) d γ t ( x , y ) + k = 2 d i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ t ( x , y ) d γ t ( x , y ) .
Moreover, as γ t is the optimal coupling between μ and ν , and γ K Π ( μ , ν ) ,
HW t 2 ( μ , ν ) k = 1 d i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ K ( x , y ) d γ K ( x , y ) = ( x 1 x 1 y 1 y 1 ) 2 d γ K ( x , y ) d γ K ( x , y ) + k = 2 d i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ K ( x , y ) d γ K ( x , y ) .
In our case, we have γ t t 0 D γ , thus, by Theorem A1, we have γ t γ t t 0 D γ γ . Using the fact that i , λ t ( i ) t 0 0 (and Lemma 1.8 of Santambrogio [19], since we are on compact support, we can bound the cost (which is continuous) by its max), we obtain the following inequality
( x 1 x 1 y 1 y 1 ) 2 d γ ( x , y ) d γ ( x , y ) ( x 1 x 1 y 1 y 1 ) 2 d γ K ( x , y ) d γ K ( x , y ) .
By denoting γ 1 and γ K 1 the marginals on the first variables, we can use the projection π 1 ( x , y ) = ( x 1 , y 1 ) , such as γ 1 = π # 1 γ and γ K 1 = π # 1 γ K . Hence, we get
( x 1 x 1 y 1 y 1 ) 2 d γ 1 ( x 1 , y 1 ) d γ 1 ( x 1 , y 1 ) ( x 1 x 1 y 1 y 1 ) 2 d γ K 1 ( x 1 , y 1 ) d γ K 1 ( x 1 , y 1 ) .
However, γ K 1 was constructed in order to be the unique optimal map for this cost (either T a s c or T d e s c according to theorem [26] (Theorem 4.2.4)). Thus, we can deduce that γ 1 = ( I d × T K 1 ) # μ 1 = γ K 1 .
Part 2:
We know that for any t > 0 , γ t and γ K share the same marginals. Thus, as previously, π # 1 γ t should have a cost worse than π # 1 γ K , which translates to
( x 1 x 1 y 1 y 1 ) 2 d γ K 1 ( x 1 , y 1 ) d γ K 1 ( x 1 , y 1 ) = ( x 1 x 1 y 1 y 1 ) 2 d γ 1 ( x 1 , y 1 ) d γ 1 ( x 1 , y 1 ) ( x 1 x 1 y 1 y 1 ) 2 d γ t 1 ( x 1 , y 1 ) d γ t 1 ( x 1 , y 1 ) .
Therefore, we have the following inequality,
( x 1 x 1 y 1 y 1 ) 2 d γ 1 ( x , y ) d γ 1 ( x , y ) + k = 2 d i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ t ( x , y ) d γ t ( x , y ) HW t 2 ( μ , ν ) ( x 1 x 1 y 1 y 1 ) 2 d γ 1 ( x , y ) d γ 1 ( x , y ) + k = 2 d i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ K ( x , y ) d γ K ( x , y ) .
We can substract the first term and factorize by λ t ( 1 ) > 0 ,
k = 2 d i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ t ( x , y ) d γ t ( x , y ) = λ t ( 1 ) ( ( x 2 x 2 y 2 y 2 ) 2 d γ t ( x , y ) d γ t ( x , y ) + k = 3 d i = 2 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ t ( x , y ) d γ t ( x , y ) ) λ t ( 1 ) ( ( x 2 x 2 y 2 y 2 ) 2 d γ K ( x , y ) d γ K ( x , y ) + k = 3 d i = 2 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ K ( x , y ) d γ K ( x , y ) ) .
By dividing by λ t ( 1 ) and by taking the limit t 0 as in the first part, we get
( x 2 x 2 y 2 y 2 ) 2 d γ ( x , y ) d γ ( x , y ) ( x 2 x 2 y 2 y 2 ) 2 d γ K ( x , y ) d γ K ( x , y ) .
Now, the 2 terms depend only on ( x 2 , y 2 ) and ( x 2 , y 2 ) . We will project on the two first coordinates, i.e., let π 1 , 2 ( x , y ) = ( ( x 1 , x 2 ) , ( y 1 , y 2 ) ) and γ 1 , 2 = π # 1 , 2 γ , γ K 1 , 2 = π # 1 , 2 γ K . Using the disintegration of measures, we know that there exist kernels γ 2 | 1 and γ K 2 | 1 such that γ 1 , 2 = γ 1 γ 2 | 1 and γ K 1 , 2 = γ K 1 γ K 2 | 1 , where
A B ( X × Y ) , μ K ( A ) = 𝟙 A ( x , y ) K ( x , d y ) μ ( d x ) .
We can rewrite the previous Equation (A3) as
( x 2 x 2 y 2 y 2 ) 2 d γ ( x , y ) d γ ( x , y ) = ( x 2 x 2 y 2 y 2 ) 2 γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) d γ 1 ( x 1 , y 1 ) d γ 1 ( x 1 , y 1 ) ( x 2 x 2 y 2 y 2 ) 2 γ K 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) γ K 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) d γ K 1 ( x 1 , y 1 ) d γ K 1 ( x 1 , y 1 ) .
Now, we will assume at first that the marginals of γ 2 | 1 ( ( x 1 , y 1 ) , · ) are well μ 2 | 1 ( x 1 , · ) and ν 2 | 1 ( y 1 , · ) . Then, by definition of γ K 2 | 1 , as it is optimal for the G W cost with inner products, we have for all ( x 1 , y 1 ) , ( x 1 , y 1 ) ,
( x 2 x 2 y 2 y 2 ) 2 γ K 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) γ K 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) ( x 2 x 2 y 2 y 2 ) 2 γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) .
Moreover, we know from the first part that γ 1 = γ K 1 , then by integrating with respect to ( x 1 , y 1 ) and ( x 1 , y 1 ) , we have
( x 2 x 2 y 2 y 2 ) 2 γ K 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) γ K 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) d γ 1 ( x 1 , y 1 ) d γ 1 ( x 1 , y 1 ) ( x 2 x 2 y 2 y 2 ) 2 γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) d γ 1 ( x 1 , y 1 ) d γ 1 ( x 1 , y 1 ) .
By (A4) and (A6), we deduce that we have an equality and we get
( ( x 2 x 2 y 2 y 2 ) 2 γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) ( x 2 x 2 y 2 y 2 ) 2 γ K 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) γ K 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) ) d γ 1 ( x 1 , y 1 ) d γ 1 ( x 1 , y 1 ) = 0 .
However, we know by (A5) that the middle part of (A7) is nonnegative, thus we have for γ 1 -a.e. ( x 1 , y 1 ) , ( x 1 , y 1 ) ,
( x 2 x 2 y 2 y 2 ) 2 γ K 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) γ K 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) = ( x 2 x 2 y 2 y 2 ) 2 γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) .
From that, we can conclude as in the first part that γ 2 | 1 = γ K 2 | 1 (by unicity of the optimal map). And thus γ 1 , 2 = γ K 1 , 2 .
Now, we still have to show that the marginals of γ 2 | 1 ( ( x 1 , y 1 ) , · ) and γ K 2 , 1 ( ( x 1 , y 1 ) , · ) are well the same, i.e., μ 2 | 1 ( x 1 , · ) and ν 2 | 1 ( y 1 , · ) . Let ϕ and ψ be continuous functions, then we have to show that for γ 1 -a.e. ( x 1 , y 1 ) , we have
ϕ ( x 2 ) γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) = ϕ ( x 2 ) μ 2 | 1 ( x 1 , d x 2 ) ψ ( y 2 ) γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) = ψ ( y 2 ) ν 2 | 1 ( y 1 , d y 2 ) .
As we want to prove it for γ 1 -a.e. ( x 1 , y 1 ) , it is sufficient to prove that for all continuous function ξ ,
ξ ( x 1 , y 1 ) ϕ ( x 2 ) γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) d γ 1 ( x 1 , y 1 ) = ξ ( x 1 , y 1 ) ϕ ( x 2 ) μ 2 | 1 ( x 1 , d x 2 ) d γ 1 ( x 1 , y 1 ) ξ ( x 1 , y 1 ) ψ ( y 2 ) γ 2 | 1 ( ( x 1 , y 1 ) , ( d x 2 , d y 2 ) ) d γ 1 ( x 1 , y 1 ) = ξ ( x 1 , y 1 ) ψ ( y 2 ) ν 2 | 1 ( y 1 , d y 2 ) d γ 1 ( x 1 , y 1 ) .
First, we can use the projections π x ( x , y ) = x and π y ( x , y ) = y . Moreover, we know that γ 1 = ( I d × T K 1 ) # μ 1 . The alternate Knothe–Rosenblatt rearrangement is, as the usual one, bijective (because μ and ν are absolutely continuous), and thus, as we suppose that ν satisfies the same hypothesis than μ , we also have γ 1 = ( ( T K 1 ) 1 , I d ) # ν 1 . Let us note T ˜ K 1 = ( T K 1 ) 1 . Then, the equalities that we want to show are:
ξ ( x 1 , T K 1 ( x 1 ) ) ϕ ( x 2 ) γ x 2 | 1 ( ( x 1 , T K 1 ( x 1 ) ) , d x 2 ) d μ 1 ( x 1 ) = ξ ( x 1 , T K 1 ( x 1 ) ) ϕ ( x 2 ) μ 2 | 1 ( x 1 , d x 2 ) d μ 1 ( x 1 ) ξ ( T ˜ K 1 ( y 1 ) , y 1 ) ψ ( y 2 ) γ y 2 | 1 ( ( T ˜ K 1 ( y 1 ) , y 1 ) , d y 2 ) d ν 1 ( y 1 ) = ξ ( T ˜ K 1 ( y 1 ) , y 1 ) ψ ( y 2 ) ν 2 | 1 ( y 1 , d y 2 ) d ν 1 ( y 1 ) .
In addition, we have indeed
ξ ( x 1 , T K 1 ( x 1 ) ) ϕ ( x 2 ) γ x 2 | 1 ( ( x 1 , T K 1 ( x 1 ) ) , d x 2 ) d μ 1 ( x 1 ) = ξ ( x 1 , T K 1 ( x 1 ) ) ϕ ( x 2 ) d γ 1 , 2 ( ( x 1 , x 2 ) , ( y 1 , y 2 ) ) = ξ ( x 1 , T K 1 ( x 1 ) ) ϕ ( x 2 ) d γ x 1 , 2 ( x 1 , x 2 ) = ξ ( x 1 , T K 1 ( x 1 ) ) ϕ ( x 2 ) μ 2 | 1 ( x 1 , d x 2 ) d μ 1 ( x 1 ) .
We can do the same for the ν part by symmetry.
Part 3:
Now, we can proceed the same way by induction. Let { 2 , , d } and suppose that the result is true in dimension 1 (i.e., γ 1 : 1 = π # 1 : 1 γ = γ K 1 : 1 ).
For this part of the proof, we rely on [19] (Theorem 2.23). We can build a measure γ K t P ( R d × R d ) such that:
π # x γ K t = μ π # y γ K t = ν π # 1 : 1 γ K t = η t ,
where η t , is the optimal transport plan between μ = π # 1 : 1 μ and ν = π # 1 : 1 ν for the objective:
k = 1 1 i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ ( x , y ) d γ ( x , y ) .
By induction hypothesis, we have η t , t 0 D π # 1 : 1 γ K . To build such a measure, we can first disintegrate μ and ν :
μ = μ 1 : 1 μ : d | 1 : 1 ν = ν 1 : 1 ν : d | 1 : 1 ,
then we pick the Knothe transport γ K : d | 1 : 1 between μ : d | 1 : 1 and ν : d | 1 : 1 . Thus, by taking γ K T = η t , γ K : d | 1 : 1 , γ K T satisfies the conditions well (A8).
Hence, we have:
k = 1 1 i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ K t ( x , y ) d γ K t ( x , y ) = k = 1 1 i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d η t , ( x 1 : 1 , y 1 : 1 ) d η t , ( x 1 : 1 , y 1 : 1 ) k = 1 1 i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ t ( x , y ) d γ t ( x , y ) ,
and therefore:
k = 1 1 i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ K t ( x , y ) d γ K t ( x , y ) + k = d i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ t ( x , y ) d γ t ( x , y ) HW t 2 ( μ , ν ) k = 1 1 i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ K t ( x , y ) d γ K t ( x , y ) + k = d i = 1 k 1 λ t ( i ) ( x k x k y k y k ) 2 d γ K t ( x , y ) d γ K t ( x , y ) .
As before, by subtracting the first term, dividing by i = 1 1 λ t ( i ) and taking the limit, we obtain:
( x x y y ) 2 d γ t ( x , y ) d γ t ( x , y ) ( x x y y ) 2 d γ K t ( x , y ) d γ K t ( x , y ) .
For the right hand side, using that γ K t = η t , γ K : d | 1 : 1 , we have:
( x x y y ) 2 d γ K t ( x , y ) d γ K t ( x , y ) = ( x x y y ) 2 γ K : d | 1 : 1 ( ( x 1 : 1 , y 1 : 1 ) , ( d x : d , d y : d ) ) γ K : d | 1 : 1 ( ( x 1 : 1 , y 1 : 1 ) , ( d x : d , d y : d ) ) d η t , ( x 1 : 1 , y 1 : 1 ) d η t , ( x 1 : 1 , y 1 : 1 ) = ( x x y y ) 2 γ K | 1 : 1 ( ( x 1 : 1 , y 1 : 1 ) , ( d x , d y ) ) γ K | 1 : 1 ( ( x 1 : 1 , y 1 : 1 ) , ( d x , d y ) ) d η t , ( x 1 : 1 , y 1 : 1 ) d η t , ( x 1 : 1 , y 1 : 1 ) .
Let us note for η t , almost every ( x 1 : 1 , y 1 : 1 ) , ( x 1 : 1 , y 1 : 1 )
G W ( μ | 1 : 1 , ν | 1 : 1 ) = ( x x y y ) 2 γ K | 1 : 1 ( ( x 1 : 1 , y 1 : 1 ) , ( d x , d y ) ) γ K | 1 : 1 ( ( x 1 : 1 , y 1 : 1 ) , ( d x , d y ) ) ,
then
( x x y y ) 2 d γ K t ( x , y ) d γ K t ( x , y ) = G W ( μ | 1 : 1 , ν | 1 : 1 ) d η t , ( x 1 : 1 , y 1 : 1 ) d η t , ( x 1 : 1 , y 1 : 1 ) .
By Theorem A1, we have η t , η t , t 0 D π # 1 : 1 γ K π # 1 : 1 γ K . So, if
η G W ( μ | 1 : 1 , ν | 1 : 1 ) d η d η
is continuous over the transport plans between μ 1 : 1 and ν 1 : 1 , we have
( x x y y ) 2 d γ K t ( x , y ) d γ K t ( x , y ) t 0 G W ( μ | 1 : 1 , ν | 1 : 1 ) π # 1 : 1 γ K ( d x 1 : 1 , d y 1 : 1 ) π # 1 : 1 γ K ( d x 1 : 1 , d y 1 : 1 )
and
G W ( μ | 1 : 1 , ν | 1 : 1 ) π # 1 : 1 γ K ( d x 1 : 1 , d y 1 : 1 ) π # 1 : 1 γ K ( d x 1 : 1 , d y 1 : 1 ) = ( x x y y ) 2 d γ K ( x , y ) d γ K ( x , y )
by replacing the true expression of G W and using the disintegration γ K = ( π K 1 : 1 ) # γ K γ K | 1 : 1 .
For the continuity, we can apply [19] (Lemma 1.8) (as in the [19] (Corollary 2.24)) with X = Y = R 1 × R 1 , X ˜ = Y ˜ = P ( Ω ) with Ω R d + 1 × R d + 1 and c ( a , b ) = G W ( a , b ) , which can be bounded on compact supports by max | c | . Moreover, we use Theorem A1 and the fact that η t η t t 0 D γ K 1 : 1 γ K 1 : 1 .
By taking the limit t 0 , we now obtain:
( x x y y ) 2 d γ ( x , y ) d γ ( x , y ) ( x x y y ) 2 d γ K ( x , y ) d γ K ( x , y ) .
We can now disintegrate with respect to γ 1 : 1 as before. We just need to prove that the marginals coincide, which is performed by taking for test functions:
ξ ( x 1 , , x 1 , y 1 , , y 1 ) ϕ ( x ) ξ ( x 1 , , x 1 , y 1 , , y 1 ) ψ ( y )
and using the fact that the measures are concentrated on y k = T K ( x k ) .
Part 4:
Therefore, we have well γ t t 0 D γ K . Finally, for the L 2 convergence, we have:
T t ( x ) T K ( x ) 2 2 μ ( d x ) = y T K ( x ) 2 2 d γ t ( x , y ) y T K ( x ) 2 2 d γ K ( x , y ) = 0
as γ t = ( I d × T t ) # μ and γ K = ( I d × T K ) # μ . Hence, T t t 0 L 2 T K . □

References

  1. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
  2. Courty, N.; Flamary, R.; Tuia, D.; Rakotomamonjy, A. Optimal transport for domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1853–1865. [Google Scholar] [CrossRef] [PubMed]
  3. Mémoli, F. Gromov–Wasserstein distances and the metric approach to object matching. Found. Comput. Math. 2011, 11, 417–487. [Google Scholar] [CrossRef]
  4. Chowdhury, S.; Miller, D.; Needham, T. Quantized Gromov–Wasserstein. In Proceedings of the Machine Learning and Knowledge Discovery in Databases, Research Track—European Conference, ECML PKDD 2021, Bilbao, Spain, 13–17 September 2021; Proceedings, Part III. Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A., Eds.; Springer: Berlin/Heidelberg, Germany, 2021; Volume 12977, pp. 811–827. [Google Scholar] [CrossRef]
  5. Alvarez-Melis, D.; Jegelka, S.; Jaakkola, T.S. Towards optimal transport with global invariances. In Proceedings of the The 22nd International Conference on Artificial Intelligence and Statistics, PMLR, Naha, Japan, 16–18 April 2019; pp. 1870–1879. [Google Scholar]
  6. Cai, Y.; Lim, L.H. Distances between probability distributions of different dimensions. arXiv 2020, arXiv:2011.00629. [Google Scholar]
  7. Mémoli, F. On the use of Gromov-Hausdorff Distances for Shape Comparison. In Proceedings of the 4th Symposium on Point Based Graphics, PBG@Eurographics 2007, Prague, Czech Republic, 2–3 September 2007; Botsch, M., Pajarola, R., Chen, B., Zwicker, M., Eds.; Eurographics Association: Switzerland, Geneve, 2007; pp. 81–90. [Google Scholar] [CrossRef]
  8. Sturm, K.T. The space of spaces: Curvature bounds and gradient flows on the space of metric measure spaces. arXiv 2012, arXiv:1208.0434. [Google Scholar]
  9. Peyré, G.; Cuturi, M.; Solomon, J. Gromov–Wasserstein averaging of kernel and distance matrices. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 2664–2672. [Google Scholar]
  10. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 2013, 26, 2292–2300. [Google Scholar]
  11. Scetbon, M.; Peyré, G.; Cuturi, M. Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and Costs. arXiv 2021, arXiv:2106.01128. [Google Scholar]
  12. Vayer, T.; Flamary, R.; Courty, N.; Tavenard, R.; Chapel, L. Sliced Gromov–Wasserstein. Adv. Neural Inf. Process. Syst. 2019, 32, 14753–14763. [Google Scholar]
  13. Fatras, K.; Zine, Y.; Majewski, S.; Flamary, R.; Gribonval, R.; Courty, N. Minibatch optimal transport distances; analysis and applications. arXiv 2021, arXiv:2101.01792. [Google Scholar]
  14. Muzellec, B.; Cuturi, M. Subspace Detours: Building Transport Plans that are Optimal on Subspace Projections. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R., Eds.; 2019; pp. 6914–6925. [Google Scholar]
  15. Bogachev, V.I.; Kolesnikov, A.V.; Medvedev, K.V. Triangular transformations of measures. Sb. Math. 2005, 196, 309. [Google Scholar] [CrossRef] [Green Version]
  16. Knothe, H. Contributions to the theory of convex bodies. Mich. Math. J. 1957, 4, 39–52. [Google Scholar] [CrossRef]
  17. Rosenblatt, M. Remarks on a multivariate transformation. Ann. Math. Stat. 1952, 23, 470–472. [Google Scholar] [CrossRef]
  18. Villani, C. Optimal Transport: Old and New; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008; Volume 338. [Google Scholar]
  19. Santambrogio, F. Optimal transport for applied mathematicians. Birkäuser NY 2015, 55, 94. [Google Scholar]
  20. Jaini, P.; Selby, K.A.; Yu, Y. Sum-of-Squares Polynomial Flow. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 3009–3018. [Google Scholar]
  21. Carlier, G.; Galichon, A.; Santambrogio, F. From Knothe’s transport to Brenier’s map and a continuation method for optimal transport. SIAM J. Math. Anal. 2010, 41, 2554–2576. [Google Scholar] [CrossRef] [Green Version]
  22. Bonnotte, N. Unidimensional and Evolution Methods for Optimal Transportation. Ph.D. Thesis, Université Paris-Sud, Paris, France, 2013. [Google Scholar]
  23. Ambrosio, L.; Gigli, N.; Savaré, G. Gradient Flows: In Metric Spaces and in the Space of Probability Measures; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  24. Niles-Weed, J.; Rigollet, P. Estimation of wasserstein distances in the spiked transport model. arXiv 2019, arXiv:1909.07513. [Google Scholar]
  25. Chowdhury, S.; Mémoli, F. The gromov–wasserstein distance between networks and stable network invariants. Inf. Inference A J. IMA 2019, 8, 757–787. [Google Scholar] [CrossRef] [Green Version]
  26. Vayer, T. A Contribution to Optimal Transport on Incomparable Spaces. Ph.D. Thesis, Université de Bretagne Sud, Vannes, France, 2020. [Google Scholar]
  27. Paty, F.P.; Cuturi, M. Subspace robust wasserstein distances. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5072–5081. [Google Scholar]
  28. Rasmussen, C.E. Gaussian processes in machine learning. In Summer School on Machine Learning; Springer: Berlin/Heidelberg, Germany, 2003; pp. 63–71. [Google Scholar]
  29. Von Mises, R. Mathematical Theory of Probability and Statistics; Academic Press: Cambridge, MA, USA, 1964. [Google Scholar]
  30. Salmona, A.; Delon, J.; Desolneux, A. Gromov–Wasserstein Distances between Gaussian Distributions. arXiv 2021, arXiv:2104.07970. [Google Scholar]
  31. Weed, J.; Bach, F. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli 2019, 25, 2620–2648. [Google Scholar] [CrossRef] [Green Version]
  32. Lin, T.; Zheng, Z.; Chen, E.; Cuturi, M.; Jordan, M. On projection robust optimal transport: Sample complexity and model misspecification. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Virtual, 13–15 April 2021; pp. 262–270. [Google Scholar]
  33. Burkard, R.E.; Klinz, B.; Rudolf, R. Perspectives of Monge Properties in Optimization. Discret. Appl. Math. 1996, 70, 95–161. [Google Scholar] [CrossRef] [Green Version]
  34. Flamary, R.; Courty, N.; Gramfort, A.; Alaya, M.Z.; Boisbunon, A.; Chambon, S.; Chapel, L.; Corenflos, A.; Fatras, K.; Fournier, N.; et al. POT: Python Optimal Transport. J. Mach. Learn. Res. 2021, 22, 1–8. [Google Scholar]
  35. Bogo, F.; Romero, J.; Loper, M.; Black, M.J. FAUST: Dataset and evaluation for 3D mesh registration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3794–3801. [Google Scholar]
  36. Fiedler, M. Algebraic connectivity of graphs. Czechoslov. Math. J. 1973, 23, 298–305. [Google Scholar] [CrossRef]
  37. Hagberg, A.; Swart, P.; Chult, D.S. Exploring Network Structure, Dynamics, and Function Using NetworkX; Technical Report; Los Alamos National Lab. (LANL): Los Alamos, NM, USA, 2008.
  38. Wu, J.P.; Song, J.Q.; Zhang, W.M. An efficient and accurate method to compute the Fiedler vector based on Householder deflation and inverse power iteration. J. Comput. Appl. Math. 2014, 269, 101–108. [Google Scholar] [CrossRef]
  39. Xu, H.; Luo, D.; Carin, L. Scalable Gromov–Wasserstein learning for graph partitioning and matching. Adv. Neural Inf. Process. Syst. 2019, 32, 3052–3062. [Google Scholar]
  40. Chowdhury, S.; Needham, T. Generalized spectral clustering via Gromov–Wasserstein learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Virtual, 13–15 April 2021; pp. 712–720. [Google Scholar]
  41. Vayer, T.; Courty, N.; Tavenard, R.; Flamary, R. Optimal transport for structured data with application on graphs. In Proceedings of the 36th International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6275–6284. [Google Scholar]
  42. Lacoste-Julien, S. Convergence rate of frank-wolfe for non-convex objectives. arXiv 2016, arXiv:1607.00345. [Google Scholar]
  43. Peyré, G.; Cuturi, M. Computational optimal transport: With applications to data science. Found. Trends® Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
  44. Nagar, R.; Raman, S. Detecting approximate reflection symmetry in a point set using optimization on manifold. IEEE Trans. Signal Process. 2019, 67, 1582–1595. [Google Scholar] [CrossRef] [Green Version]
  45. Billingsley, P. Convergence of Probability Measures; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Figure 1. From left to right: Data (moons); OT plan obtained with GW for c ( x , x ) = x x 2 2 ; Data projected on the first axis; OT plan obtained between the projected measures; Data projected on their first PCA component; OT plan obtained between the the projected measures.
Figure 1. From left to right: Data (moons); OT plan obtained with GW for c ( x , x ) = x x 2 2 ; Data projected on the first axis; OT plan obtained between the projected measures; Data projected on their first PCA component; OT plan obtained between the the projected measures.
Algorithms 14 00366 g001
Figure 2. Three-dimensional mesh registration. (First row) source and target meshes, color code of the source, ground truth color code on the target, result of subspace detour using Fiedler vectors as subspace. (Second row) After recalling the expected ground truth for ease of comparison, we present results of different Gromov–Wasserstein mappings obtained with metrics based on adjacency, heat kernel, and geodesic distances.
Figure 2. Three-dimensional mesh registration. (First row) source and target meshes, color code of the source, ground truth color code on the target, result of subspace detour using Fiedler vectors as subspace. (Second row) After recalling the expected ground truth for ease of comparison, we present results of different Gromov–Wasserstein mappings obtained with metrics based on adjacency, heat kernel, and geodesic distances.
Algorithms 14 00366 g002
Figure 3. Degenerated coupling. On the first row, the points are projected on their first coordinate and we plot the optimal coupling. On the second row, we plot the optimal coupling between the original points.
Figure 3. Degenerated coupling. On the first row, the points are projected on their first coordinate and we plot the optimal coupling. On the second row, we plot the optimal coupling between the original points.
Algorithms 14 00366 g003
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bonet, C.; Vayer, T.; Courty, N.; Septier, F.; Drumetz, L. Subspace Detours Meet Gromov–Wasserstein. Algorithms 2021, 14, 366. https://doi.org/10.3390/a14120366

AMA Style

Bonet C, Vayer T, Courty N, Septier F, Drumetz L. Subspace Detours Meet Gromov–Wasserstein. Algorithms. 2021; 14(12):366. https://doi.org/10.3390/a14120366

Chicago/Turabian Style

Bonet, Clément, Titouan Vayer, Nicolas Courty, François Septier, and Lucas Drumetz. 2021. "Subspace Detours Meet Gromov–Wasserstein" Algorithms 14, no. 12: 366. https://doi.org/10.3390/a14120366

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop