Fused Gromov-Wasserstein Distance for Structured Objects

Titouan Vayer; Laetitia Chapel; Remi Flamary; Romain Tavenard; Nicolas Courty

doi:10.3390/a13090212

,

and

¹

CNRS, IRISA, Université Bretagne-Sud, F-56000 Vannes, France

²

CNRS, OCA Lagrange, Université Côte d’Azur, F-06000 Nice, France

³

CNRS, LETG, Université Rennes, F-35000 Rennes, France

^*

Author to whom correspondence should be addressed.

Algorithms2020, 13(9), 212;https://doi.org/10.3390/a13090212

This article belongs to the Special Issue Efficient Graph Algorithms in Machine Learning

Version Notes

Order Reprints

Review Reports

Abstract

Optimal transport theory has recently found many applications in machine learning thanks to its capacity to meaningfully compare various machine learning objects that are viewed as distributions. The Kantorovitch formulation, leading to the Wasserstein distance, focuses on the features of the elements of the objects, but treats them independently, whereas the Gromov–Wasserstein distance focuses on the relations between the elements, depicting the structure of the object, yet discarding its features. In this paper, we study the Fused Gromov-Wasserstein distance that extends the Wasserstein and Gromov–Wasserstein distances in order to encode simultaneously both the feature and structure information. We provide the mathematical framework for this distance in the continuous setting, prove its metric and interpolation properties, and provide a concentration result for the convergence of finite samples. We also illustrate and interpret its use in various applications, where structured objects are involved.

Keywords:

optimal transport; GRAPHS and Structured objects; Wasserstein and Gromov-Wasserstein distances

1. Introduction

We focus on the comparison of structured objects, i.e., objects defined by both a feature and a structure information. Abstractly, the feature information cover all the attributes of an object; for example, it can model the value of a signal when objects are time series, or the node labels over a graph. In shape analysis, the spatial positions of the nodes can be regarded as features, or, when objects are images, local color histograms can describe the image’s feature information. As for the structure information, it encodes the specific relationships that exist among the components of the object. In a graph context, nodes and edges are representative of this notion, so that each label of the graph may be linked to some others through the edges between the nodes. In a time series context, the values of the signal are related to each other through a temporal structure. This representation can be related with the concept of relational reasoning (see [1]), where some entities (or elements with attributes, such as the intensity of a signal) coexist with some relations or properties between them (or some structure, as described above). Including structural knowledge about objects in a machine learning context has often been valuable in order to build more generalizable models. As shown in many contexts, such as graphical models [2,3], relational reinforcement learning [4], or Bayesian nonparametrics [5], considering objects as a complex composition of entities together with their interactions is crucial in order to learn from small amounts of data.

Unlike recent deep learning end-to-end approaches [6,7] that attempt to avoid integration of prior knowledge or assumptions about the structure wherever possible, ad hoc methods, depending on the kind of structured objects involved, aim to build meaningful tools that include structure information in the machine learning process. In graph classification, the structure can be taken into account through dedicated graph kernels, in which the structure drives the combination of the feature information [8,9,10]. In a time series context, Dynamic Time Warping and related approaches are based on the similarity between the features while allowing limited temporal (i.e., structural) distortion in the time instants that are matched [11,12]. Closely related, an entire field has focused on predicting the structure as an output and it has been deployed on tasks, such as segmenting an image into meaningful components or predicting a natural language sentence [10,13,14].

All of these approaches rely on meaningful representations of the structured objects that are involved. In this context, distributions or probability measures can provide an interesting representation for machine learning data. This allows their comparison within the Optimal Transport (OT) framework that provides a meaningful way of comparing distributions by capturing the underlying geometric properties of the space through a cost function. When the distributions dwell in a common metric space

(Ω, d)

, the Wasserstein distance defines a metric between these distributions under mild assumptions [15]. In contrast, the Gromov–Wasserstein distance [16,17] aims at comparing distributions that support live in different metric spaces through the intrinsic pair-to-pair distances in each space. Unifying both distances, the Fused Gromov–Wasserstein distance was proposed in a previous work in [18] and used in the discrete setting to encode, in a single OT formulation, both feature and structure information of structured objects. This approach considers structured objects as joint distributions over a common feature space associated with a structure space specific to each object. An OT formulation is derived by considering a tradeoff between the feature and the structure costs, respectively, defined with respect to the Wasserstein and the Gromov–Wasserstein standpoints.

This paper presents the theoretical foundations of this distance and states the mathematical properties of the

F G W

metric in the general setting. We first introduce a representation of structured objects using distributions. We show that classical Wasserstein and Gromov–Wasserstein distance can be used in order to compare either the feature information or the structure information of the structured object but that they both fail at comparing the entire object. We then present the Fused Gromov–Wasserstein distance in its general formulation and derive some of its mathematical properties. Particularly, we show that it is a metric in a given case, we give a concentration result, and we study its interpolation and geodesic properties. We then provide a conditional-gradient algorithm to solve the quadratic problem resulting from

F G W

in the discrete case and we conclude by illustrating and interpreting the distance in several applicative contexts.

Notations. Let

P (Ω)

be the set of all probability measures on a space

Ω

and

B (A)

the set of all Borel sets of a

σ

-algebra A. We note # the push-forward operator, such, that for a measurable function T,

B \in B (A)

,

T # μ (B) = μ (T^{- 1} (B))

.

We note supp

(μ)

the support of

μ \in P (Ω)

is the minimal closed subset

A \subset Ω

such that

μ (Ω \ A) = 0

. Informally, this is the set where the measure “is not zero”.

For two probability measures

μ \in P (A)

and

ν \in P (B)

we note

Π (μ, ν)

the set of all couplings or matching measures of

μ

and

ν

, i.e., the set

{π \in P (Ω \times Ω) | \forall (A_{0}, B_{0}) \in B (A) \times B (B), π (A_{0} \times B) = μ (A_{0}), π (Ω \times Ω_{0}) = ν (B_{0})}

.

For two metric spaces

(X, d_{X})

and

(Y, d_{Y})

, we define the distance

d_{X} \oplus d_{Y}

on

X \times Y

such that, for

(x, y), (x^{'}, y^{'}) \in X \times Y, d_{X} \oplus d_{Y} ((x, y), (x^{'}, y^{'})) = d_{X} (x, x^{'}) + d_{Y} (y, y^{'})

.

We note the simplex of N bins as

Σ_{N} = {a \in {(R_{+})}^{N}, \sum_{i} a_{i} = 1}

. For two histograms

a \in Σ_{n}

and

b \in Σ_{m}

we note with some abuses

Π (a, b)

the set of all couplings of a and b, i.e., the set

Π (a, b) = {π \in R_{+}^{n \times m} | \sum_{i} π_{i, j} = b_{j}; \sum_{j} π_{i, j} = a_{i}}

. Finally, for

x \in Ω

,

δ_{x}

denotes the dirac measure in x.

Assumption. In this paper, we assume that all metric spaces are non-trivial Polish metric spaces (i.e., separable and completely metrizable topological spaces) and that all measures are Borel.

2. Structured Objects as Distributions and Fused Gromov–Wasserstein Distance

The notion of structured objects used in this paper is inspired from the discrete point of view where one aims at comparing labeled graphs. More formally, we consider undirected labeled graphs as tuples of the form

G = (V, E, ℓ_{f}, ℓ_{s})

, where

(V, E)

are the set of vertices and edges of the graph.

ℓ_{f} : V \to Ω

is a labelling function that associates each vertex

v_{i} \in V

with a feature

a_{i} \overset{def}{=} ℓ_{f} (v_{i})

in some feature metric space

(Ω, d)

. Similarly,

ℓ_{s} : V \to X

maps a vertex

v_{i}

from the graph to its structure representation

x_{i} \overset{def}{=} ℓ_{s} (v_{i})

in some structure space

(X, d_{X})

specific to each graph.

d_{X} : X \times X \to R_{+}

is a symmetric application which aims at measuring the similarity between the nodes in the graph. In the graph context,

d_{X}

can either encode the neighborhood information of the nodes, the edge information of the graph or more generally it can model a distance between the nodes such as the shortest path distance or the harmonic distance [19]. When

d_{X}

is a metric, such as the shortest-path distance, we naturally endow the structure with the metric space

(X, d_{X})

.

In this paper, we propose enriching the previous definition of a structured object with a probability measure which serves the purpose of signaling the relative importance of the object’s elements. For example, we can add a probability (also denoted as weight)

{(h_{i})}_{i} \in Σ_{n}

to each node in the graph. This way, we define a fully supported probability measure

μ = \sum_{i} h_{i} δ_{(x_{i}, a_{i})}

, which includes all the structured object information (see Figure 1 for a graphical depiction).

Figure 1. Discrete structured object (left) can be described by a labeled graph with

{(a_{i})}_{i}

the feature information of the object and

{(x_{i})}_{i}

the structure information. If we enrich this object with a histogram

{(h_{i})}_{i}

aiming at measuring the relative importance of the nodes, we can represent the structured object as a fully supported probability measure

μ

over the couple space of feature and structure with marginals

μ_{X}

and

μ_{A}

on the structure and the features respectively (right).

This graph representation for objects with a finite number of points/vertices can be generalized to the continuous case and leads to a more general definition of structured objects:

Definition 1.

(Structured objects). A structured object over a metric space

(Ω, d)

is a triplet

(X \times Ω, d_{X}, μ)

, where

(X, d_{X})

is a metric space and μ is a probability measure over

X \times Ω

.

(Ω, d)

is denoted as the feature space, such that

d : Ω \times Ω \to R_{+}

is the distance in the feature space and

(X, d_{X})

the structure space, such that

d_{X} : X \times X \to R_{+}

is the distance in the structure space. We will note

μ_{X}

and

μ_{A}

the structure and feature marginals of μ.

Definition 2.

(Space of structured objects). We note

X

the set of all metric spaces. The space of all structured objects over

(Ω, d)

will be written as

S (Ω)

and is defined by all the triplets

(X \times Ω, d_{X}, μ)

where

(X, d_{X}) \in X

and

μ \in P (X \times Ω)

. To avoid finiteness issues in the rest of the paper we define for

p \in N^{*}

the space

S_{p} (Ω) \subset S (Ω)

such that if

(X \times Ω, d_{X}, μ) \in S_{p} (Ω)

we have:

\int_{Ω} d {(a, a_{0})}^{p} d μ_{A} (a) < + \infty

(1)

(the finiteness of this integral does not depend on the choice of

a_{0}

)

\int_{X \times X} d_{X} {(x, x^{'})}^{p} d μ_{X} (x) d μ_{X} (x^{'}) < + \infty .

(2)

For the sake of simplicity, and when it is clear from the context, we will sometimes denote only by

μ

the whole structured object. The marginals

μ_{X}, μ_{A}

encode very partial information since they focus only on independent feature distributions or only on the structure. This definition encompasses the discrete setting discussed in above. More precisely, let us consider a labeled graph of n nodes with features

A = {(a_{i})}_{i = 1}^{n}

with

a_{i} \in Ω

and

X = {(x_{i})}_{i = 1}^{n}

the structure representation of the nodes. Let

{(h_{i})}_{i = 1}^{n}

be an histogram, then the probability measure

μ = \sum_{i = 1}^{n} h_{i} δ_{(x_{i}, a_{i})}

defines structured object in the sense of Definition 1, since it lies in

P (X \times Ω)

. In this case, an example of

μ

,

μ_{X}

, and

μ_{A}

is provided in Figure 1.

Note that the set of structured objects is quite general and also allows considering discrete probability measures of the form

μ = \sum_{i, j = 1}^{p, q} h_{i, j} δ_{(x_{i}, a_{j})}

with

p, q

possibly different than n. We propose focusing on a particular type of structured objects, namely the generalized labeled graphs, as described in the following definition:

Definition 3.

(Generalized labeled graph). We call generalized labeled graph a structured object

(X \times Ω, d_{X}, μ) \in S_{p} (Ω)

such that μ can be expressed as

μ = (I \times ℓ_{f}) # μ_{X}

where

ℓ_{f} : X \to Ω

is surjective and pushes

μ_{X}

forward to

μ_{A}

, i.e.,

ℓ_{f} # μ_{X} = μ_{A}

.

This definition implies that there exists a function

ℓ_{f}

, which associates a feature

a = l_{f} (x)

to a structure point

x \in X

and, as such, one structure point can not have two different features. The labeled graph described by

μ = \sum_{i = 1}^{n} h_{i} δ_{(x_{i}, a_{i})}

is a particular instance of a generalized labeled graph in which

l_{f}

is defined by

l_{f} (x_{i}) = a_{i}

.

2.1. Comparing Structured Objects

We now aim to define a notion of equivalence between two structured objects

(X \times Ω, d_{X}, μ)

and

(Y \times Ω, d_{Y}, ν)

. We note in the following

ν_{Y}, ν_{B}

the marginals of

ν

. Intuitively, two structured objects are the same if they share the same feature information, if their structure information are lookalike, and if the probability measures are corresponding in some sense. In this section, we present mathematical tools for individual comparison of the elements of structured objects. First, our formalism implies comparing metric spaces, which can be done via the notion of isometry.

Definition 4.

(Isometry). Let

(X, d_{X})

and

(Y, d_{Y})

be two metric spaces. An isometry is a surjective map

f : X \to Y

that preserves the distances:

\forall x, x^{'} \in X, d_{Y} (f (x), f (x^{'})) = d_{X} (x, x^{'}) .

(3)

An isometry is bijective, since for

f (x) = f (x^{'})

we have

d_{Y} (f (x), f (x^{'})) = 0 = d_{X} (x, x^{'})

and hence

x = x^{'}

(in the same way

f^{- 1}

is also a isometry). When it exists, X and Y share the same “size” and any statement about X, which can be expressed through its distance is transported to Y by the isometry f.

Example 1.

Let us consider the two following graphs whose discrete metric spaces are obtained as shortest path between the vertices (see corresponding graphs in Figure 2).

(\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \end{matrix}), \underset{d_{X} (x_{i}, x_{j})}{\underset{︸}{(\begin{matrix} 0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 2 \\ 1 & 1 & 0 & 2 \\ 1 & 2 & 2 & 0 \end{matrix})}} and (\begin{matrix} y_{1} \\ y_{2} \\ y_{3} \\ y_{4} \end{matrix}), \underset{d_{Y} (y_{i}, y_{j})}{\underset{︸}{(\begin{matrix} 0 & 1 & 1 & 1 \\ 1 & 0 & 2 & 2 \\ 1 & 2 & 0 & 1 \\ 1 & 2 & 1 & 0 \end{matrix})}} .

Figure 2. Two isometric metric spaces. Distances between the nodes are given by the shortest path, and the weight of each edge is equal to 1.

These spaces are isometric since the map f, such that

f (x_{1}) = y_{1}

,

f (x_{2}) = y_{3}

,

f (x_{3}) = y_{4}

,

f (x_{4}) = y_{2}

verifies Equation (3).

The previous definition can be used in order to compare the structure information of two structured objects. Regarding the feature information, because they all lie in the same ambient space

Ω

, a natural way for comparing them is by the standard set equality

A = B

. Finally, in order to compare measures on different spaces, the notion of preserving map can be used.

Definition 5.

(Measure preserving map). Let

Ω_{1}, μ_{1} \in P (Ω_{1})

and

Ω_{2}, μ_{2} \in P (Ω_{2})

be two measurable spaces. A function (usually called a map)

f : Ω_{1} \to Ω_{2}

is said to be measure preserving if it transports the measure

μ_{1}

on

μ_{2}

such that

f # μ_{1} = μ_{2} .

If there exists such a measure preserving map, the properties about measures of

Ω_{1}

are transported via f to

Ω_{2}

.

Combining these two ideas together leads to the notion of measurable metric spaces (often called mm-spaces [17]), i.e., a metric space

(X, d_{X})

enriched with a probability measure and described by a triplet

(X, d_{X}, μ_{X} \in P (X))

. An interesting notion for comparing mm-spaces is the notion of isomorphism.

Definition 6.

(Isomorphism). Two mm-spaces

(X, d_{X}, μ_{X}), (Y, d_{Y}, μ_{Y})

are isomorphic if there exists a surjective measure preserving isometry

f : s u p p (μ_{X}) \to s u p p (ν_{Y})

between the support of the measures

μ_{X}, ν_{Y}

.

Example 2.

Let us consider two mm-spaces

(X = {x_{1}, x_{2}}, d_{X} = {1}, μ_{X} = {\frac{1}{2}, \frac{1}{2}})

and

(Y = {y_{1}, y_{2}}, d_{Y} = {1}, μ_{Y} = {\frac{1}{4}, \frac{3}{4}})

, as depicted in Figure 3. These spaces are isometric, but not isomorphic, as there exists no measure preserving map between them.

Figure 3. Two isometric but not isomorphic spaces.

All of this considered, we can now define a notion of equivalence between structured objects.

Definition 7.

(Strong isomorphism of structured objects).

Two structured objects are said to be strongly isomorphic if there exists an isomorphism I between the structures such that

f = (I, i d)

is bijective between

s u p p (μ)

and

s u p p (ν)

and measure preserving. More precisely, f satisfies the following properties:

P.1: $f # μ = ν$ .
P.2: The function f statisfies:

$\forall (x, a) \in s u p p (μ), f (x, a) = (I (x), a) .$

(4)
P.3: The function $I : s u p p (μ_{X}) \to s u p p (ν_{Y})$ is surjective, satisfies $I # μ_{X} = ν_{Y}$ and:

$\forall x, x^{'} \in {s u p p (μ_{X})}^{2}, d_{X} (x, x^{'}) = d_{Y} (I (x), I (x^{'})) .$

(5)

It is easy to check that the strong isomorphism defines an equivalence relation over

S_{p} (Ω)

.

Remark 1.

The function f described in this definition can be seen as a feature, structure, and measure preserving function. Indeed, fromP.1f is measure preserving. Moreover,

(X, d_{X}, μ_{X})

and

(Y, d_{Y}, ν_{Y})

are isomorphic through I. Finally usingP.1andP.2we have that

μ_{A} = ν_{B}

, so that the feature information is also preserved.

Example 3.

To illustrate this definition, we consider a simple example of two discrete structured objects:

\underset{x_{i}, a_{i}}{\underset{︸}{(\begin{matrix} (x_{1}, a_{1}) \\ (x_{2}, a_{2}) \\ (x_{3}, a_{3}) \\ (x_{4}, a_{4}) \end{matrix})}}, \underset{d_{X} (x_{i}, x_{j})}{\underset{︸}{(\begin{matrix} 0 & 1 & 1 & 1 \\ 1 & 0 & 1 & 2 \\ 1 & 1 & 0 & 2 \\ 1 & 2 & 2 & 0 \end{matrix})}}, \underset{h_{i}}{\underset{︸}{(\begin{matrix} 1 / 4 \\ 1 / 4 \\ 1 / 4 \\ 1 / 4 \end{matrix})}} and \underset{y_{i}, b_{i}}{\underset{︸}{(\begin{matrix} (y_{1}, b_{1}) \\ (y_{2}, b_{2}) \\ (y_{3}, b_{3}) \\ (y_{4}, b_{4}) \end{matrix})}}, \underset{d_{Y} (y_{i}, y_{j})}{\underset{︸}{(\begin{matrix} 0 & 1 & 1 & 1 \\ 1 & 0 & 2 & 2 \\ 1 & 2 & 0 & 1 \\ 1 & 2 & 1 & 0 \end{matrix})}}, \underset{h_{i}^{'}}{\underset{︸}{(\begin{matrix} 1 / 4 \\ 1 / 4 \\ 1 / 4 \\ 1 / 4 \end{matrix})}}

with for i,

a_{i} = b_{i}

and for

i \neq j

,

a_{i} \neq a_{j}

(see Figure 4). The two structured objects have isometric structures and same features individually, but they are not strongly isomorphic. One possible map

f = (f_{1}, f_{2})

, such that

f_{1}

leads to an isometry is

f (x_{1}, a_{1}) = (y_{1}, b_{1})

,

f (x_{2}, a_{2}) = (y_{3}, b_{3})

,

f (x_{3}, a_{3}) = (y_{4}, b_{4})

,

f (x_{4}, a_{4}) = (y_{2}, b_{2})

. Yet, this map does not satisfy

f_{2} (x, .) = I_{d}

for any x, since

f (x_{2}, a_{2}) = (y_{3}, b_{3})

and

a_{2} \neq b_{3}

. The other possible functions, such that

f_{1}

leads to an isometry are simply permutations of this example, yet it is easy to check that none of them verifiesP.2(for example, with

f (x_{2}, a_{2}) = (y_{4}, b_{4})

).

Figure 4. Two structured objects with isometric structures and identical features that are not strongly isomorphic. The color of the nodes represent the node feature and each edge represents a distance of 1 between the connected nodes.

2.2. Background on OT Distances

The Optimal Transport (OT) framework defines distances between probability measures that describe either the feature or the structure information of structured objects.

Wasserstein distance. The classical OT theory aims at comparing probability measures

μ_{A} \in P (Ω), ν_{B} \in P (Ω)

. In this context the quantity:

d_{W, p} (μ_{A}, ν_{B}) = {(\inf_{π \in Π (μ_{A}, ν_{B})} \int_{Ω \times Ω} d {(a, b)}^{p} d π (a, b))}^{\frac{1}{p}}

(6)

is usually called the p-Wasserstein distance (also known, for

p = 1

, as Earth Mover’s distance [20] in the computer vision community) between distributions

μ_{A}

and

ν_{B}

. It defines a distance on probability measures, especially

d_{W, p} (μ_{A}, ν_{B}) = 0

iff

μ_{A} = ν_{B}

. This distance also has a nice geometrical interpretation as it represents an optimal cost (w.r.t. d) to move the measure

μ_{A}

onto

ν_{B}

with

π (a, b)

the amount of probability mass shifted from a to b (see Figure 5). To this extent, the Wasserstein distance quantifies how “far”

μ_{A}

is from

ν_{B}

by measuring how “difficult” it is to move all the mass from

μ_{A}

onto

ν_{B}

. Optimal transport can deal with smooth and discrete measures and it has proved to be very useful for comparing distributions in a shared space, but with different (and even non-overlapping) supports.

Figure 5. Example of a coupling between two discrete measures on the same ground space equipped with a distance d that will define the Wasserstein distance. (Left): the discrete measures on

Ω

. (Right): one possible coupling between these measures that conserves the mass. Image adapted from [21] [Figure 2.6].

Gromov–Wasserstein distance. In order to compare measures whose support are not necessarily in the same ambient space [16,17] define a new OT distance. By relaxing the classical Hausdorff distance [15,17], authors build a distance over the space of all mm-spaces. For two compact mm-spaces

(X, d_{X}, μ_{X} \in P (X))

and

(Y, d_{Y}, ν_{Y} \in P (Y))

, the Gromov–Wasserstein distance is defined as:

d_{G W, p} (μ_{X}, ν_{Y}) = {(\inf_{π \in Π (μ_{X}, ν_{Y})} \int_{X \times Y \times X \times Y} L {(x, y, x^{'}, y^{'})}^{p} d π (x, y) d π (x^{'}, y^{'}))}^{\frac{1}{p}}

(7)

where:

L (x, y, x^{'}, y^{'}) = | d_{X} (x, x^{'}) - d_{Y} (y, y^{'}) |

The Gromov–Wasserstein distance depends on the choice of the metrics

d_{X}

and

d_{Y}

and with some abuse of notation we denote the entire mm-space by its probability measure. When it is not clear from the context, we will specify using

d_{G W, p} (d_{X}, d_{Y}, μ_{X}, ν_{Y})

. The resulting coupling tends to associate pairs of points with similar distances within each pair (see Figure 6). The Gromov–Wasserstein distance allows for the comparison of measures over different ground spaces and defines a metric over the space of all mm-spaces quotiented by the isomoprhisms (see Definitions 4 and 5). More precisely, it vanishes if the two mm-spaces are isomorphic. This distance has been used in the context of relational data e.g., in shape comparison [17,22], deep metric alignment [23], generative modelling [24] or to align single-cell multi-omics datasets [25].

Figure 6. Gromov–Wasserstein coupling of two mm-spaces

X = (X, d_{X}, μ_{X})

and

Y = (Y, d_{Y}, ν_{Y})

. Left: the mm-spaces have nothing in common. Similarity between pairwise distances is measured by

| d_{X} (x, x^{'}) - d_{Y} (y, y^{'}) |

. Right: an admissible coupling of

μ_{X}

and

ν_{Y}

. Image adapted from [21] [Figure 2.6].

2.3. Fused Gromov–Wasserstein Distance

Building on both Gromov–Wasserstein and Wasserstein distances, we define the Fused Gromov–Wasserstein (

F G W

) distance on the space of structured objects:

Definition 8.

(Fused Gromov-Wasserstein distance). Let

α \in [0, 1]

and

p, q \geq 1

. We consider

(X \times Ω, d_{X}, μ) \in S_{p q} (Ω)

and

(Y \times Ω, d_{Y}, ν) \in S_{p q} (Ω)

. The Fused-Gromov–Wasserstein distance is defined as:

d_{F G W, α, p, q} (μ, ν) = {(\underset{π \in Π (μ, ν)}{i n f} E_{p, q, α} (π))}^{\frac{1}{p}}

(8)

where

E_{p, q, α} (π) = \int_{{(X \times Ω \times Y \times Ω)}^{2}} {((1 - α) d {(a, b)}^{q} + α L {(x, y, x^{'}, y^{'})}^{q})}^{p} d π ((x, a), (y, b)) d π ((x^{'}, a^{'}), (y^{'}, b^{'}))

Figure 7 illustrates this definition. When it is clear from the context we will simply note

d_{F G W}

instead for

d_{F G W, α, p, q}

for brevity.

α

acts as a trade-off parameter between the cost of the structures represented by

L (x, y, x^{'}, y^{'})

and the feature cost

d (a, b)

. In this way, the convex combination of both terms leads to the use of both information in one formalism resulting on a single map

π

that “moves” the mass from one joint probability measure to the other.

Figure 7. Illustration of Definition 8. The figure shows two structured objects

(X \times Ω, d_{X}, μ)

and

(Y \times Ω, d_{Y}, μ)

. The feature space

Ω

is the common space for all features. The two metric spaces

(X, d_{X})

and

(Y, d_{Y})

represent the structures of the two structured objects, the similarity between all pair-to-pair distances of the structure points is measured by

L (x, y, x^{'}, y^{'})

.

μ

and

ν

are the joint measures on the structure space and the feature space.

Many desirable properties arise from this definition. Among them, one can define a topology over the space of structured objects using the

F G W

distance to compare structured objects, in the same philosophy as for Wasserstein and Gromov–Wasserstein distances. The definition also implies that

F G W

acts as a generalization of both Wasserstein and Gromov-Wasserstein distances, with

F G W

achieving an interpolation between these two distances. More remarkably,

F G W

distance also realizes geodesic properties over the space of structured objects, allowing the definition of gradient flows. All of these properties are detailed in the next section. Before reviewing them, we first compare

F G W

with

G W

and W (by assuming for now that

F G W

exists, which will be shown later in Theorem 1).

Proposition 1.

(Comparaison between

F G W

,

G W

and W). We have the following results for two structured objects μ and ν:

The following inequalities hold:

$d_{F G W, α, p, q} (μ, ν) \geq (1 - α) d_{W, p q} {(μ_{A}, ν_{B})}^{q}$

(9)

$d_{F G W, α, p, q} (μ, ν) \geq α d_{G W, p q} {(μ_{X}, ν_{Y})}^{q}$

(10)
Let us suppose that the structure spaces $(X, d_{X})$ , $(Y, d_{Y})$ are part of a single ground space $(Z, d_{Z})$ (i.e., $X, Y \subset Z$ and $d_{X} = d_{Y} = d_{Z}$ ). We consider the Wasserstein distance between μ and ν for the distance on $Z \times Ω$ : $\tilde{d} ((x, a), (y, b)) = (1 - α) d (a, b) + α d_{Z} (x, y)$ . Then:

$d_{F G W, α, p, 1} (μ, ν) \leq 2 d_{W, p} (μ, ν) .$

(11)

Proof of this proposition can be found in Section 7.1. In particular, following this proposition, when the

F G W

distance vanishes then both

G W

and W distances vanish so that the structure and the feature of the structure object are individually “the same” (with respect to their corresponding equivalence relation). However, the converse is not necessarily true, as shown in the following example.

Example 4.

(Toy trees). We construct two trees as illustrated in Figure 8 where the 1D node features are shown with colors. The shortest path between the nodes is used to capture the structures of the two structured objects and the Euclidean distance is used for the features. We consider uniform weights on all nodes. Figure 8 illustrates the differences between

F G W

,

G W

, and W distances. The left part is the Wasserstein coupling between the features: red nodes are transported on red ones and the blue nodes on the blue ones but tree structures are completely discarded. In this case, the Wasserstein distance vanishes. In the right part, we compute the Gromov–Wasserstein distance between the structures: all couples of points are transported to another couple of points, which enforces the matching of tree structures without taking into account the features. Because structures are isometric, the Gromov–Wasserstein distance is null. Finally, we compute the

F G W

using intermediate α (center), the bottom and first level structure is preserved as well as the feature matching (red on red and blue on blue) and

F G W

discriminates the two structured objects.

Figure 8. Difference on transportation maps between

F G W

,

G W

, and W distances on synthetic trees. (Left) the W distance between the features is null since feature information are the same, (Middle)

F G W

with

α \in] 0, 1 [

is different from zero and discriminate the two structured objects, and (Right)

G W

between the two isometric structures is null.

3. Mathematical Properties of $FGW$

In this section, we establish some mathematical properties of the

F G W

distance. The first result relates to the existence of the

F G W

distance and the topology of the space of structured objects. We then prove that the

F G W

distance is indeed a distance regarding the equivalence relation between structured objects, as defined in Defintion 7, allowing us to derive a topology on

S (Ω)

.

3.1. Topology of the Structured Object Space

The

F G W

distance has the following properties:

Theorem 1.

(Metric properties). Let

p, q \geq 1

,

α \in] 0, 1 [

and

μ, ν \in S_{p q} (Ω)

. The functional

π \to E_{p, q, α} (π)

always achieves an infimum

π^{*}

in

Π (μ, ν)

s.t.

d_{F G W, α, p, q} (μ, ν) = E_{p, q, α} (π^{*}) < + \infty

. Moreover:

•: $d_{F G W, α, p, q}$ is symmetric and, for $q = 1$ , satisfies the triangle inequality. For $q \geq 2$ , the triangular inequality is relaxed by a factor $2^{q - 1}$ .
•: For $α \in] 0, 1 [$ , $d_{F G W, α, p, q} (μ, ν) = 0$ if an only if there exists a bijective function $f = (f_{1}, f_{2}) : s u p p (μ) \to s u p p (ν)$ such that:

$f # μ = ν$

(12)

$\forall (x, a) \in s u p p (μ), f_{2} (x, a) = a$

(13)

$\forall (x, a), (x^{'}, a^{'}) \in s u p p {(μ)}^{2}, d_{X} (x, x^{'}) = d_{Y} (f_{1} (x, a), f_{1} (x^{'}, a^{'}))$

(14)
•: If $(μ, ν)$ are generalized labeled graphs then $d_{F G W, α, p, q} (μ, ν) = 0$ if and only if $(X \times Ω, d_{X}, μ)$ and $(Y \times Ω, d_{Y}, ν)$ are strongly isomorphic.

Proof of this theorem can be found in Section 7.2. The identity of indiscernibles is the most delicate part to prove and it is based on using the Gromov–Wasserstein distance between the spaces

X \times Ω

and

Y \times Ω

. The previous theorem states that

F G W

is a distance over the space of generalized labeled graphs endowed with the strong isomorphism as equivalence relation defined in Definition 7. More generally, for any structured objects the equivalence relation is given by (12)–(14). Informally, invariants of the

F G W

are structured objects that have both the same structure and the same features in the same place. Despite the fact that

q = 1

leads to a proper metric, we will further see in Section 4.1 that the case

q = 2

can be computed more efficiently using a separability trick from [26].

Theorem 1 allows a wide set of applications for

F G W

, such as k-nearest-neighbors, distance-substitution kernels, pseudo-Euclidean embeddings, or representative-set methods [27,28,29]. Arguably, such a distance allows for a better interpretation than to end-to-end learning machines, such as neural networks, because the

π

matrix exhibits the relationships between the elements of the objects in a pairwise comparison.

3.2. Can We Adapt W and GW for Structured Objects?

Despite the appealing properties of both Wasserstein and Gromov–Wasserstein distances, they fail at comparing structured objects by focusing only on the feature and structure marginals, respectively. However, with some hypotheses, one could adapt these distances for structured objects.

Adapting Wasserstein. If the structure spaces

(X, d_{X})

and

(Y, d_{Y})

are part of a same ground space

(Z, d_{Z})

, i.e., (

X, Y \subset Z

and

d_{X} = d_{Y} = d_{Z}

), one can build a distance

\hat{d} = d_{Z} \oplus d

between couples

(x, a)

and

(y, b)

and apply the Wasserstein distance, so as to compare the two structured objects. In this case, when the Wasserstein distance vanishes it implies that the structured objects are equal in the sense

μ = ν

. This approach is very related with the one discussed in [30], where the authors define the Transportation

L^{p}

distance for signal analysis purposes. Their approach can be viewed as a transport between two joint measures

μ (X \times Ω) = λ ({(x, f (x)) | x \in X \subset Z = R^{d}; f (x) \in A \subset R^{m}})

,

ν (Y \times Ω) = λ ({(y, g (y)) | y \in Y \subset Z = R^{d}; g (y) \in B \subset R^{m}})

for function

f, g : Z \to R^{m}

representative of the signal values and

λ

the Lebesgue measure. The distance for the transport is defined as

\hat{d} ((x, f (x)), (y, g (y))) = \frac{1}{α} {∥ x - y ∥}_{p}^{p} + {∥ f (x) - g (y) ∥}_{p}^{p}

for

α > 0

and

{∥ \cdot ∥}_{p}

the

l_{p}

norm. In this case,

f (x)

and

g (y)

can be interpreted as encoding the feature information of the signal, while

x, y

encode its structure information. This approach is very interesting, but cannot be used on structured objects, such as graphs that will not share a common structure embedding space.

Adapting Gromov-Wasserstein. The Gromov–Wasserstein distance can also be adapted to structured objects by considering the distances

(1 - β) d_{X} \oplus β d

and

(1 - β) d_{Y} \oplus β d

within each space

X \times Ω

and

Y \times Ω

, respectively, and

β \in] 0, 1 [

. When the resulting

G W

distance vanishes, structured objects are isomorphic with respect to

(1 - β) d_{X} \oplus β d

and

(1 - β) d_{Y} \oplus β d

. However, the strong isomorphism is stronger than this notion, since the isomorphism allows for “permuting the labels”, but not the strong isomorphism. More precisely, we have the following lemma:

Lemma 1.

Let

(X \times Ω, d_{X}, μ), (Y \times Ω, d_{Y}, ν)

be two structured objects and

β \in] 0, 1 [

.

If

(X \times Ω, d_{X}, μ)

and

(Y \times Ω, d_{Y}, ν)

are strongly isomorphic then

(X \times Ω, (1 - β) d_{X} \oplus β d, μ)

and

(Y \times Ω, (1 - β) d_{Y} \oplus β d, ν)

are isomorphic. However the converse is not true in general.

Proof.

To see this, if we consider f as defined in Theorem 1, then, for

(x, a), (x^{'}, b) \in {(supp (μ))}^{2}

, we have

d_{X} (x, x^{'}) = d_{Y} (I (x), I (x^{'}))

. In this way:

(1 - β) d_{X} (x, x^{'}) + β d (a, b) = (1 - β) d_{Y} (I (x), I (x^{'})) + β d (a, b)

(15)

which can be rewritten as:

(1 - β) d \oplus β d_{X} ((x, a), (x^{'}, b)) = (1 - β) d \oplus β d_{Y} (f (x, a), f (x^{'}, b))

(16)

and so f is an isometry with respect to

(1 - β) d \oplus β d_{X}

and

(1 - β) d \oplus β d_{Y}

. Because f is also measure preserving and surjective

(X \times Ω, (1 - β) d_{X} \oplus β d, μ)

and

(Y \times Ω, (1 - β) d_{Y} \oplus β d, ν)

are isomorphic. □

However, the converse is not necessarily true, as it is easy to cook up an example with the same structure but with permuted labels, so that objects are isomorphic but not strongly isomorphic. For example, in the tree example Figure 4, the structures are isomorphic and the distances between the features within each space are the same between each structured objects, so that

(X \times Ω, (1 - β) d_{X} \oplus β d, μ)

and

(Y \times Ω, (1 - β) d_{Y} \oplus β d, ν)

are isomorphic, yet not strongly isomorphic, as shown in the example since

F G W > 0

.

3.3. Convergence of Structured Objects

The metric property naturally endows the structured object space with a notion of convergence, as described in the next definition:

Definition 9.

Convergence of structured objects.

Let

{((X_{n} \times A_{n}, d_{X_{n}}, μ_{n}))}_{n \in N}

be a sequence of structured objects. It converges to

(X \times Ω, d_{X}, μ)

in the Fused Gromov–Wasserstein sense if:

lim_{n \to \infty} d_{F G W, α, p, 1} (μ_{n}, μ) = 0

(17)

Using Proposition 1, it is straightforward to see that if the sequence converges in the

F G W

sense, both the features and the structure converge respectively in the Wasserstein and Gromov–Wasserstein sense (see [17] for the definition of convergence in the Gromov–Wasserstein sense).

An interesting question arises from this definition. Let us consider a structured object

(X \times Ω, d_{X}, μ)

and let us sample the joint distribution so as to consider

{({(x_{i}, a_{i})}_{i \in {1, \dots, n}}, d_{X}, μ_{n})}_{n \in N}

with

μ_{n} = \frac{1}{n} \sum_{i = 1}^{n} δ_{x_{i}, a_{i}}

where

(x_{i}, a_{i}) \in X \times Ω

are sampled from

μ

. Does this sequence converges to

(X \times Ω, d_{X}, μ)

in the

F G W

sense and how fast is the convergence?

This question can be answered thanks to a notion of “size” of a probability measure. For the sake of conciseness, we will not exhaustively present the theory, but the reader can refer to [31] for more details. Given a measure

μ

on

Ω

, we denote as

d i m_{p}^{*} (μ)

its upper Wasserstein dimension. It coincides with the intuitive notion of “dimension” when the measure is sufficiently well behaved. For example, for any absolutely continuous measure

μ

with respect to the Lebesgue measure on

{[0, 1]}^{d}

, we have

d i m_{p}^{*} (μ) = d

for any

p \in [1, \frac{d}{2}]

. Using this definition and the results presented in [31], we answer the question of convergence of finite sample in the following theorem (proof can be found in Section 7.3):

Theorem 2.

Convergence of finite samples and a concentration inequality

Let

p \geq 1

. We have:

lim_{n \to \infty} d_{F G W, α, p, 1} (μ_{n}, μ) = 0 .

Moreover, suppose that

s > d_{p}^{*} (μ)

. Then there exists a constant C that does not depend on n such that:

E [d_{F G W, α, p, 1} (μ_{n}, μ)] \leq C n^{- \frac{1}{s}} .

(18)

The expectation is taken over the i.i.d samples

(x_{i}, a_{i})

. A particular case of this inequality is when

α = 1

so that we can use the result above to derive a concentration result for the Gromov-Wasserstein distance. More precisely, if

ν_{n} = \frac{1}{n} \sum_{i} δ_{x_{i}}

denotes the empirical measure of

ν \in P (X)

and if

s^{'} > d_{p}^{*} (ν)

, we have:

E [d_{G W, p} (ν_{n}, ν)] \leq C^{'} n^{- \frac{1}{s^{'}}} .

(19)

This result is a simple application of the convergence of finite sample properties of the Wasserstein distance, since in this case

μ_{n}

and

μ

are part of the same ground space so that (18) derive naturally from (11) and the properties of Wasserstein. In contrast to the Wasserstein distance case, this inequality is not necessarily sharp and future work will be dedicated to the study of its tightness.

3.4. Interpolation Properties between Wasserstein and Gromov-Wasserstein Distances

F G W

distance is a generalization of both Wasserstein and Gromov–Wasserstein distances in the sense that it achieves an interpolation between them. More precisely, we have the following theorem:

Theorem 3.

Interpolation properties.

As α tends to zero, one recovers the Wasserstein distance between the features information and as α goes to one, one recovers the Gromov–Wasserstein distance between the structure information:

lim_{α \to 0} d_{F G W, α, p, q} (μ, ν) = {(d_{W, q p} (μ_{A}, ν_{B}))}^{q}

(20)

lim_{α \to 1} d_{F G W, α, p, q} (μ, ν) = {(d_{G W, q p} (μ_{X}, ν_{Y}))}^{q}

(21)

Proof of this theorem can be found in Section 7.4.

This result shows that

F G W

can revert to one of the other distances. In machine learning, this allows for a validation of the

α

parameter to better fit the data properties (i.e., by tuning the relative importance of the feature vs. structure information). One can also see the choice of

α

as a representation learning problem and its value can be found by optimizing a given criterion.

3.5. Geodesic Properties

One desirable property in OT is the underlying geodesics defined by the mass transfer between two probability distributions. These properties are useful in order to define the dynamic formulation of OT problems. This dynamic point of view is inspired by fluid dynamics and found its origin in the Wasserstein context with [32]. Various applications in machine learning can be derived from this formulation: interpolation along geodesic paths were used in computer graphics for color or illumination interpolations [33]. More recently, Ref. [34] used Wasserstein gradient flows in an optimization context, deriving global minima results for non-convex particles gradient descent. In [35], the authors used Wasserstein gradient flows in the context of reinforcement learning for policy optimization.

The main idea of this dynamic formulation is to describe the optimal transport problem between two measures as a curve in the space of measures minimizing its total length. We first describe some generality about geodesic spaces and recall classical results for dynamic formulation in both Wasserstein and Gromov–Wasserstein contexts. In a second part, we derive new geodesic properties in the

F G W

context.

Geodesic spaces. Let

(X, d)

be a metric space and

x, y

two points in X. We say that a curve

w : [0, 1] \to X

joining the endpointsx and y (i.e., with

w (0) = x

and

w (1) = y

) is a constant speed geodesic if it satisfies

d (w (t), w (s)) \leq | t - s | d (w (0), w (1)) = | t - s | d (x, y)

for

t, s \in [0, 1]

. Moreover, if

(X, d)

is a length space (i.e., if the distance between two points of X is equal to the infimum of the lengths of the curves connecting these two points) then the converse is also true and a constant speed geodesic satisfies

d (w (t), w (s)) = | t - s | d (x, y)

. It is easy to compute distances along such curves, as they are directly embedded into

R

.

In the Wasserstein context, if the ground space is a complete separable, locally compact length space, and if the endpoints of the geodesic are given, then there exists a geodesic curve. Moreover, if the transport between the endpoints is unique, then there is a unique displacement interpolation between the endpoints (see Corollary 7.22 and 7.23 in [15]). For example, if the ground space is

R^{d}

and the distance between the points is measured via the

ℓ_{2}

norm, then geodesics exist and are uniquely determined (this can be generalized to strictly convex costs). In the Gromov–Wasserstein context, there always exists constant speed geodesics as long as the endpoints are given. These geodesics are unique modulo the isomorphism equivalence relation (see [16]).

The $F G W$ case. In this paragraph, we suppose that

Ω = R^{d}

. We are interested in finding a geodesic curve in the space of structured objects i.e., a constant speed curve of structured objects joining two structured objects. As for Wasserstein and Gromov–Wasserstein, the structured object space endowed with the Fused Gromov–Wasserstein distance maintains some geodesic properties. The following result proves the existence of such a geodesic and characterizes it:

Theorem 4.

Constant speed geodesic.

Let

p \geq 1

and

(X \times Ω, d_{X}, μ_{0})

and

(Y \times Ω, d_{Y}, μ_{1})

in

S_{p} (R^{d})

. Let

π^{*}

be an optimal coupling for the Fused Gromov–Wasserstein distance between

μ_{0}, μ_{1}

, and

t \in [0, 1]

. We equip

R^{d}

with the

ℓ_{m}

norm for

m \geq 1

.

We define

η_{t} : X \times Ω \times Y \times Ω \to X \times Y \times Ω

such that:

\forall (x, a), (y, b) \in X \times Ω \times Y \times Ω, η_{t} (x, a, y, b) = (x, y, (1 - t) a + t b)

(22)

Then:

{(X \times Y \times Ω, (1 - t) d_{X} \oplus t d_{Y}, μ_{t} = η_{t} # π^{*})}_{t \in [0, 1]}

(23)

is a constant speed geodesic connecting

(X \times Ω, d_{X}, μ_{0})

and

(Y \times Ω, d_{Y}, μ_{1})

in the metric space

(S_{p} (R^{d}), d_{F G W, α, p, 1})

.

Proof of the previous theorem can be found in Section 7.5. In a sense, this result combines the geodesics in the Wasserstein space and in the space of all mm-spaces, since it suffices to interpolate the distances in the structure space and the features to construct a geodesic. The main interest is that it defines the minimum path between two structured objects. For example, when considering two discrete structured objects represented by the measures

μ = \sum_{i = 1}^{n} h_{i} δ_{(x_{i}, a_{i})}

and

ν = \sum_{j = 1}^{m} g_{j} δ_{(y_{j}, b_{j})}

, the interpolation path is given for

t \in [0, 1]

by the measure

μ_{t} = \sum_{i = 1}^{n} \sum_{j = 1}^{m} π^{*} (i, j) δ_{(x_{i}, y_{j}, (1 - t) a_{i} + t b_{j})}

where

π^{*}

is an optimal coupling for the

F G W

distance. However this geodesic is difficult to handle in practice, since it requires the computation of the cartesian product

X \times Y

. To overcome this obstacle, an extension using theFrÃ©chet mean is defined in Section 4.2. The proper definition and properties of velocity fields associated to this geodesic is postponed to further works.

4. $FGW$ in the Discrete Case

In practice, structured objects are often discrete and can be defined using the labeled graph formalism described previously. In this section, we discuss how to compute

F G W

efficiently. We also provide an algorithm for the computation of Fréchet means.

4.1. FGW in the Discrete Case

We consider two structured objects

μ = \sum_{i = 1}^{n} h_{i} δ_{(x_{i}, a_{i})}

and

ν = \sum_{j = 1}^{m} g_{j} δ_{(y_{j}, b_{j})}

, as described previously. We note

M_{A B}

the matrix

M_{A B} (i, j) = d (a_{i}, b_{j})

and

C_{1}, C_{2}

the matrices

C_{1} (i, k) = d_{X} (x_{i}, x_{k})

,

C_{2} (j, l) = d_{Y} (y_{j}, y_{l})

(see Figure 9). The Fused Gromov–Wasserstein distance is defined as

d_{F G W, α, p, q} (μ, ν) = {(min_{π \in Π (h, g)} E_{p, q, α} (π))}^{\frac{1}{p}}

(24)

where:

E_{p, q, α} (π) = \sum_{i, j, k, l} {((1 - α) M_{A B} {(i, j)}^{q} + α {| C_{1} (i, k) - C_{2} (j, l) |}^{q})}^{p} π_{i, j} π_{k, l} .

Figure 9.

F G W

loss for a coupling

π

depends on both a similarity between each feature of each node of each graph

{(d (a_{i}, b_{j}))}_{i, j}

and between all intra-graph structure similarities

{(|C_{1} (i, k) - C_{2} (j, l)|)}_{i, j, k, l}

.

Solving the related Quadratic Optimization problem. Equation (24) is clearly a quadratic problem w.r.t.

π

. Note that, despite the apparent

O (m^{2} n^{2})

complexity of computing the tensor product, one can simplify the sum to complexity

O (m n^{2} + m^{2} n)

[26] when considering

p = 1, q = 2

. In this case, the

F G W

computation problem can be re-written as finding

π^{*}

, such that:

π^{*} = \underset{π \in Π (h, g)}{arg min} vec {(π)}^{T} Q (α) vec (π) + vec {(D (α))}^{T} vec (π)

(25)

where

Q = - 2 α C_{2} \otimes_{K} C_{1}

and

D (α) = (1 - α) M_{A B}

.

\otimes_{K}

denote the Kronecker product of two matrices, vec the column-stacking operator. With such form, the resulting optimal map can be seen as a quadratic regularized map from initial Wasserstein [36,37]. However, unlike these approaches, we have a quadratic but probably non-convex term. The gradient G that arises from Equation (24) can be expressed with the following partial derivative w.r.t.

π

:

G = (1 - α) M_{A B}^{q} + 2 α {(\sum_{k, l} {| C_{1} (i, k) - C_{2} (j, l) |}^{q} π_{k, l})}_{i, j}

(26)

Solving a large scale QP with a classical solver can be computationally expensive. In [36], the authors propose a solver for a graph regularized optimal transport problem whose resulting optimization problem is also a QP. We can then directly use their conditional gradient defined in Algorithm 1 to solve our optimization problem. It only needs at each iteration to compute the gradient in Equation (26) and solve a classical OT problem for instance with a network flow algorithm. The line-search part is a constrained minimization of a second degree polynomial function that is adapted to the non convex loss in Algorithm 2. While the problem is non convex, conditional gradient is known to converge to a local stationary point [38]. When

C_{1}, C_{2}

are squared Euclidean distance matrices the problem reduces to a concave graph matching so that the line-search of the conditional gradient Algorithm 1 always leads to 1 [39,40].

Algorithm 1. Conditional Gradient (CG) for

F G W

.

1:: $π^{(0)} \leftarrow μ_{X} μ_{Y}^{⊤}$
2:: for $i = 1, \dots,$ do
3:: $G \leftarrow$ Gradient from Equation (26) w.r.t. $π^{(i - 1)}$
4:: ${\tilde{π}}^{(i)} \leftarrow$ Solve OT with ground loss G
5:: $τ^{(i)} \leftarrow$ Line-search for loss (24) with $τ \in (0, 1)$ using Algorithm 2
6:: $π^{(i)} \leftarrow (1 - τ^{(i)}) π^{(i - 1)} + τ^{(i)} {\tilde{π}}^{(i)}$
7:: end for

Algorithm 2. Line-search for CG (

q = 2

).

1:: $c_{C_{1}, C_{2}}$ from Equation (6) in [26]
2:: $a = - 2 α ⟨ C_{1} {\tilde{π}}^{(i)} C_{2}, {\tilde{π}}^{(i)} ⟩$
3:: $b = ⟨ (1 - α) M_{A B} + α c_{C_{1}, C_{2}}, {\tilde{π}}^{(i)} ⟩$ $- 2 α (⟨ C_{1} {\tilde{π}}^{(i)} C_{2}, π^{(i - 1)} ⟩ + ⟨ C_{1} π^{(i - 1)} C_{2}, {\tilde{π}}^{(i)} ⟩)$
4:: if $a > 0$ then
5:: $τ^{(i)} \leftarrow \min (1, \max (0, \frac{- b}{2 a}))$
6:: else
7:: $τ^{(i)} \leftarrow 1$ if $a + b < 0$ else $τ^{(i)} \leftarrow 0$
8:: end if

4.2. Structured Optimal Transport Barycenter

An interesting use of the

F G W

distance is to define a barycenter of a set of structured data as a Fréchet mean. In that context, one seeks the structured object that minimizes the sum of the (weighted)

F G W

distances with a given set of objects. OT barycenters have many desirable properties and applications [26,41], yet no formulation can leverage both structural and feature information in the barycenter computation. Here, we propose to use the

F G W

distance to compute the barycenter of a set of K structured objects

{(μ_{k})}_{k} \in S {(Ω)}^{K}

associated with structures

{(C_{k})}_{k}

, features

{(B_{k})}_{k}

and base histograms

{(h_{k})}_{k}

.

We suppose that the feature space is

Ω = (R^{d}, ℓ_{2})

and

p = 1

. For simplicity, we assume that the base histograms and the histogram h associated to the barycenter are known and fixed. Note that it could be also included in the optimization process, as suggested in [26].

In this context, for a fixed

N \in N

and

{(λ_{k})}_{k}

, such that

\sum_{k} λ_{k} = 1

, we aim to find:

\min_{μ} \sum_{k} λ_{k} d_{F G W, α, 1, q} (μ, μ_{k}) = \min_{C, A \in R^{N \times d}, {(π_{k})}_{k}} \sum_{k} λ_{k} E_{1, q, α} (M_{A B_{k}}, C, C_{k}, π_{k})

(27)

Note that this problem is convex w.r.t. C and A, but not w.r.t.

π_{k}

. Intuitively, looking for a barycenter means finding feature values supported on a fixed size support, and the structure that relates them. Interestingly enough, there are several variants of this problem, where features or structure can be fixed for the barycenter. Solving the related simpler optimization problems extend straightforwardly.

Solving the barycenter problem with Block Coordinate Descent (BCD). We propose minimizing Equation (27) using a BCD algorithm, i.e., iteratively minimizing with respect to the couplings

π_{k}

, the structure metric C and the feature vector A.The minimization of this problem w.r.t.

{(π_{k})}_{k}

is equivalent to computing K independent Fused Gromov–Wasserstein distances using the algorithm presented above. We suppose that the feature space is

Ω = (R^{d}, ℓ_{2})

and we consider

q = 2

. Minimization w.r.t. C in this case has a closed form (see Prop. 4 in [26]):

C \leftarrow \frac{1}{h h^{T}} \sum_{k} λ_{k} π_{k}^{T} C_{k} π_{k}

(28)

where h is the histogram of the barycenter and the division is computed pointwise. Minimization w.r.t. A can be computed with (Equation (8) in [42]):

A \leftarrow \sum_{k} λ_{k} B_{k} π_{k}^{T} diag (\frac{1}{h}) .

(29)

5. Examples and Applications

In this section we derive some applications of

F G W

in graph contexts such as classification, clustering and coarsening of graphs.

5.1. Illustrations of FGW

In this section, we present several applications of

F G W

as a distance between structured objects and provide an interpretation of the OT matrix.

Example with one-dimensional (1D) features and structure spaces.Figure 10 illustrates the differences between Wasserstein, Gromov–Wasserstein, and Fused Gromov–Wasserstein couplings

π^{*}

. In this example, both the feature and structure space are one-dimensional (Figure 10 left). The feature space (vertical axis) denotes two clusters among the elements of both objects illustrated in the OT matrix

M_{A B}

, the structure space (horizontal axis) denotes a noisy temporal sequence along the indexes illustrated in the matrices

C_{1}

and

C_{2}

(Figure 10 center). Wasserstein respects the clustering, but forgets the temporal structure, Gromov–Wasserstein respects the structure, but do not take the clustering into account. Only FGW retrieves a transport matrix respecting both feature and structure.

Figure 10. Illustration of the difference between W,

G W

, and

F G W

couplings. (left) Empirical distributions

μ

with 20 samples and

ν

with 30 samples which color is proportional to their index. (middle) Cost matrices in the feature (

M_{A B}

) and structure domains (

C_{1}, C_{2}

) with similar samples in white. (right) Solution for all methods. Dark blue indicates a non zero coefficient of the transportation map. Feature distances are large between points laying on the diagonal of

M_{A B}

such that Wasserstein maps is anti-diagonal but unstructured. Fused Gromov–Wasserstein incorporates both feature and structure maps in a single transport map.

Example on two simple images. We extract a

28 \times 28

image from the MNIST dataset and generate a second one through translation or mirroring of the digit in the original image. We use pixel gray levels of the pixels as the features, and the structure is defined as the city-block distance on the pixel coordinate grid. We use equal weights for all of the pixels in the image. Figure 11 shows the different couplings obtained when only considering either the features, the structure only or both information.

F G W

aligns the pixels of the digits, recovering the correct order of the pixels, while both Wassertein and Gromov–Wasserstein distances fail at providing a meaningful transportation map. Note that in the Wasserstein and Gromov-Wasserstein case, the distances are equal to 0, whereas

F G W

manages to spot that the two images are different. Additionally, note that, in the

F G W

sense, the original digit and its mirrored version are also equivalent as there exists an isometry between their structure spaces, making

F G W

invariant to rotations or flips in the structure space in this case.

Figure 11. Couplings obtained when considering (Top left) the features only, where we have

d_{W, 1}^{Ω} = 0

(Top right) the structure only, with

d_{G W, 1} = 0

(Bottom left, and right) both the features and the structure, with

d_{F G W, 0.1, 1, 2}^{Ω}

. For readibility issues, only the couplings starting from non white pixels on the left picture are depicted.

Example on time series data. One of the main assets of

F G W

is that it can be used on a wide class of objects and time series are one more example of this. We consider here 25 monodimensional time series composed of two humps in

[0, 1]

with random uniform height between 0 and 1. The signals are distributed according to two classes translated from each other with a fixed gap. The

F G W

distance is computed by considering d as the euclidean distance between the features of the signals (here the value of the signal in each point) and

d_{X}

and

d_{Y}

as the euclidean distance between timestamps.

A two-dimensional (2D) embedding is computed from a

F G W

distance matrix between a number of examples in this dataset with multidimensional scaling (MDS) [43] in Figure 12 (top). One can clearly see that the representation with a reasonable

α

value in the center is the most discriminant one. This can be better understood by looking as the OT matrices between the classes. Figure 12 (bottom) illustrates the behavior of

F G W

on one pair of examples when going from Wasserstein to Gromov–Wasserstein. The black line depicts the matching provided by the transport matrix and one can clearly see that while Wasserstein on the left assigns samples completely independently to their temporal position, the Gromov-Wasserstein on the right tends to align perfectly the samples (note that it could have reversed exactly the alignment with the same loss), but discards the values in the signal. Only the true

F G W

in the center finds a transport matrix that both respects the time sequences and aligns similar values in the signals.

Figure 12. Behavior of trade-off parameter

α

on a toy time series classification problem.

α

is increasing from left (

α = 0

: Wasserstein distance) to right (

α = 1

: Gromov–Wasserstein distance). (top row) Two-dimensional (2D)-embedding is computed from the set of pairwise distances between samples with multidimensional scaling (MDS) (bottom row) illustration of couplings between two sample time series from opposite classes.

5.2. Graph-Structured Data Classification

In this section, we adress the question of training a classifier for graph data and evaluate the FGW distance used in a kernel with SVM.

Learning problem and datasets. We consider 12 widely used benchmark datasets divided into 3 groups. BZR, COX2 [44], PROTEINS, ENZYMES [45], CUNEIFORM [46] and SYNTHETIC [47] are vector attributed graphs. MUTAG [48], PTC-MR [49], and NCI1 [50] contain graphs with discrete attributes derived from small molecules. IMDB-B, IMDB-M [51] contain unlabeled graphs derived from social networks. All of the datasets are available in [52].

Experimental setup. Regarding the feature distance matrix

M_{A B}

between node features, when dealing with real valued vector attributed graphs, we consider the

ℓ_{2}

distance between the labels of the vertices. In the case of graphs with discrete attributes, we consider two settings: in the first one, we keep the original labels (denoted as raw); we also consider a Weisfeiler-Lehman labeling (denoted as wl) by concatenating the labels of the neighbors. A vector of size h is created by repeating this procedure h times [49,53]. In both cases, we compute the feature distance matrix by using

d (a_{i}, b_{j}) = \sum_{k = 0}^{H} δ (τ (a_{i}^{k}), τ (b_{j}^{k}))

where

δ (x, y) = 1

if

x \neq y

else

δ (x, y) = 0

and

τ (a_{i}^{k})

denotes the concatenated label at iteration k (for

k = 0

original labels are used). Regarding the structure distances C, they are computed by considering a shortest path distance between the vertices.

For the classification task, we run a SVM using the indefinite kernel matrix

e^{- γ F G W}

, which is seen as a noisy observation of the true positive semidefinite kernel [54]. We compare classification accuracies with the following state-of-the-art graph kernel methods: (SPK) denotes the shortest path kernel [45], (RWK) the random walk kernel [55], (WLK) the Weisfeler Lehman kernel [53], and (GK) the graphlet count kernel [56]. For real valued vector attributes, we consider the HOPPER kernel (HOPPERK) [47] and the propagation kernel (PROPAK) [57]. We build upon the GraKel library [58] to construct the kernels and C-SVM to perform the classification. We also compare

F G W

with the PATCHY-SAN framework for CNN on graphs (PSCN) [9] building on our own implementation of the method.

To provide a comparison between the methods, most of the papers about graph classification usually perform a nested cross validation (using nine-fold for training, one for testing, and reporting the average accuracy of this experiment repeated 10 times), and report accuracies of the other methods taken from the original papers. However, these comparisons are not fair because of the high variance (on most datasets) w.r.t. the folds chosen for training and testing. This is why, in our experiments, the nested cross validation is performed on the same folds for training and testing for all methods [59]. In the result, Table 1, Table 2 and Table 3, we write in bold the best score for each dataset and we add a (*) when the best score does not yield to a significative improvement (based on a Wilcoxon signed rank test on the test scores) compared to the second best one. Note that, because of their small sizes, we repeat the experiments 50 times for MUTAG and PTC-MR datasets. For all methods using SVM, we cross validate the parameter

C \in {10^{- 7}, 10^{- 6}, \dots, 10^{7}}

. The range of the WL parameter h is

{0, 1 \dots, 10}

, and we also compute this kernel with h fixed at

2, 4

. The decay factor

λ

for RWK

{10^{- 6}, 10^{- 5} \dots, 10^{- 2}}

, for the GK kernel we set the graphlet size

κ = 3

and cross validate the precision level

ϵ

and the confidence

δ

as in the original paper [56]. The

t_{\max}

parameter for PROPAK is chosen within

{1, 3, 5, 8, 10, 15, 20}

. For PSCN, we choose the normalized betweenness centrality as labeling procedure and cross validate the batch size in

{10, 15, \dots, 35}

and number of epochs in

{10, 20, \dots, 100}

. Finally, for

F G W

,

γ

is cross validated within

{2^{- 10}, 2^{- 9}, \dots, 2^{10}}

and

α

is cross validated via a logspace search in

[0, 0.5]

and symmetrically

[0.5, 1]

(15 values are drawn).

Table 1. Average classification accuracy on the graph datasets with vector attributes.

Table 2. Average classification accuracy on the graph datasets with discrete attributes.

Table 3. Average classification accuracy on the graph datasets with no attributes.

Vector attributed graphs. The average accuracies reported in Table 1 show that FGW is a clear state-of-the-art method and performs best on four out of six datasets with performances in the error bars of the best methods on the other two datasets. The results for CUNEIFORM are significantly below those from the original paper [46], which can be explained by the fact that the method in this paper uses a graph convolutional approach specially designed for this dataset and that experiment settings are different. In comparison, the other competitive methods are less consistent, as they exhibit some good performances on some datasets only.

Discrete labeled graphs. We first note in Table 2 that

F G W

using WL attributes outperforms all competitive methods, including

F G W

with raw features. Indeed, the WL attributes allow for more finely encoding the neighborood of the vertices by stacking their attributes, whereas FGW with raw features only consider the shortest path distance between vertices, not their sequence of labels. This result calls for using meaningful feature and/or structure matrices in the FGW definition, which can be dataset-dependant, in order to enhance the performances. We also note that

F G W

with WL attributes outperforms the WL kernel method, highlighting the benefit of an optimal transport-based distance over a kernel-based similarity. Surprisingly, the results of PSCN are significantly lower than those from the original paper. We believe that it comes from the difference between the fold assignment for training and testing, which suggests that PSCN is difficult to tune.

Non-attributed graphs. The particular case of the GW distance for graph classification is also illustrated on social datasets, which contain no labels on the vertices. Note that no artificial features were added to these datasets. Accuracies reported in Table 3 show that it greatly outperforms the SPK and GK graph kernel methods. This is, to the best of our knowledge, the first application of the Gromov–Wasserstein distance for social graph classification and it highlights the fact that

G W

is a good metric for comparing the graph structures.

Comparison between $F G W$ , W and $G W$ . During the validation step, the optimal value of

α

was consistently selected inside the

] 0, 1 [

interval, i.e., excluding 0 and 1, suggesting that both structure and feature information are necessary.

5.3. Graph Barycenter and Compression

In this experiment, we use

F G W

to compute barycenters and approximations of toy graphs. In the first example, we generate graphs following either a circle or 8 symbol with 1D features following a sine and linear variation, respectively. For each example, the number of nodes is drawn randomly between 10 and 25, Gaussian noise is added to the features and a small noise is applied to the structure (some connections are randomly added). An example graph with no noise is provided for each class in the first column of Figure 13. One can see from there that the circle class has a feature varying smoothly (sine) along the graph, but the 8 has a sharp feature change at its center (so that low pass filtering would loose some information). Some examples of the generated graphs are provided in the 2nd-to-7th columns of Figure 13. We compute the

F G W

barycenter containing 10 samples while using the shortest path distance between the nodes as the structural information and the

ℓ_{2}

distance for the features. We recover an adjacency matrix by thresholding the similarity matrix C given by the barycenter. The threshold is tuned, so as to minimize the Frobenius norm between the original C matrix and the shortest path matrix constructed after thresholding C. Resulting barycenters are showed in Figure 13 for

n = 15

and

n = 7

nodes. First, one can see that the barycenters are denoised in both the feature space and structure space. Also note that the sharp change at the center of the 8 class is conserved in the barycenters, which is a nice result when compared to other divergences that tend to smooth-out their barycenters (

ℓ_{2}

, for instance). Finally, note that, by selecting the number of nodes in the barycenter, one can compress the graph or estimate a “high resolution” representation from all the samples. To the best of our knowledge, no other method can compute such graph barycenters. Finally, note that

F G W

is interpretable, because the resulting OT matrix provides correspondence between the nodes from the samples and those from the barycenter.

Figure 13. Illustration of

F G W

graph barycenter. The first column illustrates the original settings with the noiseless graphs, and columns 2 to 7 are noisy samples that constitute the datasets. Columns 8 and 9 show the barycenters for each setting, with different number of nodes. Blue nodes indicates a feature value close to

- 1

, yellow nodes close to 1.

In the second experiment, we evaluate the ability of FGW to perform graph approximation and compression on a Stochastic Block Model graph [60,61]. The question is to see whether estimating an approximated graph can recover the relation between the blocks and perform simultaneously a community clustering on the original graph (using the OT matrix). We generate two community graphs illustrated in the left column of Figure 14. We can see that the relation between the blocks is sparse and has a “linear” structure, the example in the first line has features that follow the blocks (noisy but similar in each block) whereas the example in the second line has two modes per block. The first graph approximation (top line) is done with four nodes and we can recover both the blocks in the graph and the average feature on each blocks (colors on the nodes). The second problem is more complex due to the two modes per block but one can see that when approximating the graph with eight nodes we recover both the structure between the blocks and the sub-clusters in each block, which illustrates the strength of

F G W

that encodes both features and structures.

Figure 14. Example of community clustering on graphs using

F G W

. (top) Community clustering with four communities and uniform features per cluster. (bottom) Community clustering with four communities and bimodal features per cluster (and two nodes per cluster in the approximate graph).

5.4. Unsupervised Learning: Graphs Clustering

In the last experiment, we evaluate the ability of

F G W

to perform a clustering of multiple graphs and to retrieve meaningful barycenters of such clusters. To do so, we generate a dataset of four groups of community graphs. Each graph follows a simple Stochastic Block Model [60,61] and the groups are defined w.r.t. the number of communities inside each graph and the distribution of their labels. More precisely, labels are drawn from a 1D gaussian distribution specific to each community: each feature of a node is drawn from a law

N (n_{c_{1}} + n_{c_{2}}, σ)

, where

n_{c_{1}}

is the index of the community and

n_{c_{2}} \in {0, 1}

to create two modes per community. The dataset is composed of 40 graphs (10 graphs per group) and the number of nodes of each graph is drawn randomly from

{20, 30, \dots, 50}

, as illustrated in Figure 15. We perform a k-means clustering using the

F G W

barycenter defined in Equation (27) as the centroid of the groups and the

F G W

distance for the cluster assignment. We set the number of nodes of each centroid to 30. We perform a thresholding on the pairwise similarity matrix C of the centroid at the end in order to obtain an adjacency matrix for visualization purposes. The threshold value is chosen, so as to minimize the distance that is induced by the Frobenius norm between the original matrix C and the shortest path matrix obtained from the adjacency matrix. The evolution of the barycenters along the iterations is reported in Figure 15. We can see that these centroids recover community structures and feature distributions that are representative of their cluster content. On this example, note that the clustering perfectly recovers the known groups in the dataset. To the best of our knowledge, there exists no other method that is able to perform a clustering of graphs of arbitrary size and to retrieve the average graph in each cluster without having to solve a pre-image problem.

Figure 15. (left) Examples from the clustering dataset, color indicates the labels. (right) Evolution of the centroids of each cluster in the k-means clustering, from the random initialization until convergence to the final centroid.

6. Conclusions

This paper presents the Fused Gromov–Wasserstein (

F G W

) distance. Inspired by both Wasserstein and Gromov–Wasserstein distances,

F G W

can compare structured objects by including the inherent relations that exist between the elements of the objects, constituting their structure information, and their feature information, part of a common ground space between each structured objects. We have stated mathematical results about this new distance, such as metric, interpolation, and geodesic properties. We have also provided a concentration result for the convergence of finite samples. In the discrete case, algorithms to compute

F G W

itself and related FrÃ©chet means are provided. The use of this new distance is illustrated on problems involving structured objects, such as time series embedding, graph classification, graph barycenter computation, and graph clustering. Several questions are raised by this work. From a practical side, the FGW method is quite expensive to compute and further works will try to lower the computational complexity of the underlying optimization problem to ensure better scalability to very large graphs. Moreover, while we mostly consider in this work structure of graphs described by the shortest path, other choices could be made, such as the distances based on the Laplacian of the graph. Finally, from a theoretical point of view, it is often valuable that the geodesic path be unique, so as to defined properly objects, such as gradient flows. One interesting result would be, for example, to see if the geodesic is unique with respect to the strong isomorphism relation.

7. Proofs of the Mathematical Properties

This section presents all of the proofs of previous theorems and results. We note

P_{i} # μ

the projection on the i-th marginal of

μ

. We will frequently use the following lemma:

Lemma 2.

Let

q \geq 1

. We have:

\forall x, y \in R_{+}, {(x + y)}^{q} \leq 2^{q - 1} (x^{q} + y^{q}) .

(30)

Proof.

Indeed, if

q > 1

{(x + y)}^{q} = {({(\frac{1}{2^{q - 1}})}^{\frac{1}{q}} \frac{x}{{(\frac{1}{2^{q - 1}})}^{\frac{1}{q}}} + {(\frac{1}{2^{q - 1}})}^{\frac{1}{q}} \frac{y}{{(\frac{1}{2^{q - 1}})}^{\frac{1}{q}}})}^{q} \leq {[{(\frac{1}{2^{q - 1}})}^{\frac{1}{q - 1}} + {(\frac{1}{2^{q - 1}})}^{\frac{1}{q - 1}}]}^{q - 1} (\frac{x^{q}}{\frac{1}{2^{q - 1}}} + \frac{y^{q}}{\frac{1}{2^{q - 1}}}) = \frac{x^{q}}{\frac{1}{2^{q - 1}}} + \frac{y^{q}}{\frac{1}{2^{q - 1}}} .

Last inequality is a consequence of HÃlder inequality. The result remains valid for

q = 1

. □

7.1. Proof of Proposition 1—Comparison between FGW, GW and W

Proof.

Of the Proposition.

For the two inequalities (9) and (10), let

π

be an optimal coupling for the Fused Gromov-Wasserstein distance between

μ

and

ν

(assuming its existence for now). Clearly:

\begin{matrix} d_{F G W, α, p, q} (μ, ν) & = {(\int_{{(X \times Ω \times Y \times Ω)}^{2}} {((1 - α) d {(a, b)}^{q} + α L {(x, y, x^{'}, y^{'})}^{q})}^{p} d π ((x, a), (y, b)) d π ((x^{'}, a^{'}), (y^{'}, b^{'})))}^{\frac{1}{p}} \\ \geq {(\int_{X \times Ω \times Y \times Ω} {(1 - α)}^{p} d {(a, b)}^{p q} d π ((x, a), (y, b)))}^{\frac{1}{p}} = (1 - α) {(\int_{Ω \times Ω} d {(a, b)}^{p q} d P_{2, 4} # π (a, b))}^{\frac{1}{p}} \end{matrix}

Becasue

π \in Π (μ, ν)

the coupling

P_{2, 4} # π

is in

Π (μ_{A}, ν_{B})

. So by suboptimality:

d_{F G W, α, p, q} (μ, ν) \geq (1 - α) {(d_{W, p q} (μ_{A}, ν_{B}))}^{q}

which proves Equation (9). Same reasoning is used for Equation (10).

For the last inequality (11), let

π \in Π (μ, ν)

be any admissible coupling. By suboptimality:

\begin{matrix} d_{F G W, α, p, 1} (μ, ν) \leq {(\int_{{(X \times Ω \times Y \times Ω)}^{2}} {((1 - α) d (a, b) + α | d_{Z} (x, x^{'}) - d_{Z} (y, y^{'}) |)}^{p} d π ((x, a), (y, b)) d π ((x^{'}, a^{'}), (y^{'}, b^{'})))}^{\frac{1}{p}} \\ \overset{(*)}{\leq} {(\int_{{(X \times Ω \times Y \times Ω)}^{2}} {((1 - α) d (a, b) + α d_{Z} (x, y) + α d_{Z} (x^{'}, y^{'}))}^{p} d π ((x, a), (y, b)) d π ((x^{'}, a^{'}), (y^{'}, b^{'})))}^{\frac{1}{p}} \\ \leq {(\int_{{(X \times Ω \times Y \times Ω)}^{2}} {((1 - α) d (a, b) + α d_{Z} (x, y) + (1 - α) d (a^{'}, b^{'}) + α d_{Z} (x^{'}, y^{'}))}^{p} d π ((x, a), (y, b)) d π ((x^{'}, a^{'}), (y^{'}, b^{'})))}^{\frac{1}{p}} \\ \overset{(* *)}{\leq} 2 {(\int_{X \times Ω \times Y \times Ω} {((1 - α) d (a, b) + α d_{Z} (x, y))}^{p} d π ((x, a), (y, b)))}^{\frac{1}{p}} \end{matrix}

(*) is the triangle inequality of

d_{Z}

and (**) Minkowski inequality. Since this inequality is true for any admissible coupling

π

we can apply it with the optimal coupling for the Wasserstein distance defined in the proposition and the claim follows. □

7.2. Proof of Theorem 1—Metric Properties of FGW

We prove the theorem point by point: first the existence, then the equality relation and finally the triangle inequality statement. We first recall the definition of weak convergence of probability measure in a metric space [62]:

Definition 10.

(Weak-convergence on a metric space). Let

{(μ_{n})}_{n \in N}

be a sequence of probability measures on

(X, B (d))

where

(X, d)

is a completely metrizable topological space (Polish space) and

B (d)

is the Borel σ algebra. We say that

{(μ_{n})}_{n \in N}

converges weakly to μ in

(X, B (d))

if for all continuous and bounded functions

f : X \to R

:

\int_{X} f d μ_{n} \to \int_{X} f d μ

(31)

We also recall the definition of semi continuity in a metric space:

Definition 11.

(Lower-semi continuity). On a metric space

(X, d)

a function

f : X \to R \cup {+ \infty}

is said to be lower semi-continuous (l.s.c.) if for every sequence

x_{n} \to x

we have

f (x) \leq lim inf f (x_{n})

.

To state the existence of a minimizer, we will rely on the following lemma:

Lemma 3.

Let W be a Polish space. If

f : W \times W \to R_{+} \cup + \infty

is lower semi-continuous, then the functional

J : P (W) \to R \cup + \infty

with

J (μ) = \int \int f (w, w^{'}) d μ (w) d μ (w^{'})

is l.s.c. for the weak convergence of measures.

Proof.

f is l.s.c. and bounded from below by 0. We can consider

{(f_{k})}_{k}

a sequence of continuous and bounded functions converging increasingly to f (see e.g., [63]). By the monotone convergence theorem

J_{k} (μ) \to J (μ) \overset{d e f}{=} {sup}_{k} J_{k} (μ) = {sup}_{k} \int \int f_{k} d μ d μ

.

Moreover every

J_{k}

is continuous for the weak convergence. Using Theorem 2.8 [62] on the Polish space

W \times W

we know that if

μ_{n}

converges weakly to

μ

then the product measure

μ_{n} \otimes μ_{n}

converges weakly to

μ \otimes μ

. In this way

{lim}_{n \to \infty} J_{k} (μ_{n}) = J_{k} (μ)

since

f_{k}

are continuous and bounded. In particular every

J_{k}

is l.s.c.

We can conclude that J is l.s.c. as the supremum of l.s.c. functionals on the metric space of

(P (W), δ)

(see e.g., [63]). Here we equipped

P (W)

with a metric

δ

as e.g.,

δ (μ, ν) = \sum_{k = 1}^{\infty} 2^{- k} | \int_{W} f_{k} d μ - \int_{W} f_{k} d ν |

(see Remark 5.11 in [64]). □

Proposition 2.

Existence of an optimal coupling for the

F G W

distance. For

p, q \geq 1

,

π \to E_{p, q, α} (π)

always achieves a infimum

π^{*}

in

Π (μ, ν)

such that

d_{F G W, α, p, q} (μ, ν) = E_{p, q, α} (π^{*}) < + \infty

.

Proof.

Since

X \times Ω

and

Y \times Ω

are Polish spaces we known that

Π (μ, ν) \subset P (X \times Ω \times Y \times Ω)

is compact (Theorem 1.7 in [63]), so by applying Weierstrass theorem we can conclude that the infimum is attained at some

π^{*} \in Π (μ, ν)

if

π \to E_{p, q, α} (π)

is l.s.c.

We will use Lemma 3 to prove that the functionnal is l.s.c. on

Π (μ, ν)

. If we consider

W = X \times Ω \times Y \times Ω

which is a a metric space endowed with the distance

d_{X} \oplus d \oplus d_{Y} \oplus d

and

f ((w = (x, a, y, b), w^{'} = (x^{'}, a^{'}, y^{'}, b^{'})) = {((1 - α) d {(a, b)}^{q} + α L {(x, y, x^{'}, y^{'})}^{q})}^{p}

then f is l.s.c. by continuity of d,

d_{X}

and

d_{Y}

. With the previous reasoning we can conclude that the infimum is attained. Finally finiteness come from:

\begin{matrix} \int_{{(X \times Ω \times Y \times Ω)}^{2}} {((1 - α) d {(a, b)}^{q} + α L {(x, y, x^{'}, y^{'})}^{q})}^{p} d π ((x, a), (y, b)) d π ((x^{'}, a^{'}), (y^{'}, b^{'})) \\ \overset{*}{\leq} \int 2^{p - 1} (1 - α) d {(a, b)}^{q p} d μ_{A} (a) d ν (b) + \int 2^{p - 1} α L {(x, y, x^{'}, y^{'})}^{q p} d μ_{X} (x) d μ_{X} (x^{'}) d ν_{Y} (y^{'}) d ν_{Y} (y^{'}) \\ \overset{* *}{<} + \infty \end{matrix}

(32)

where in (*) we used Lemma 30 and in (**) that

μ, ν

are in

S_{p q} (Ω)

. □

Proposition 3.

Equality relation. For

α \in] 0, 1 [

,

d_{F G W, α, p, q} (μ, ν) = 0

if an only if there exists a bijective function

f = (f_{1}, f_{2}) : s u p p (μ) \to s u p p (ν)

such that:

f # μ = ν

(33)

\forall (x, a) \in s u p p (μ), f_{2} (x, a) = a

(34)

\forall (x, a), (x^{'}, a^{'}) \in s u p p {(μ)}^{2}, d_{X} (x, x^{'}) = d_{Y} (f_{1} (x, a), f_{1} (x^{'}, a^{'}))

(35)

Moreover if

(μ, ν)

are generalized labeled graphs then

d_{F G W, α, p, q} (μ, ν) = 0

if and only if

(X, d_{X}, μ)

and

(Y, d_{Y}, ν)

are strongly isomorphic.

Proof.

For the first point, let us assume that there exists a function f verifying (33)–(35). We consider the map

π = (I_{d} \times f) # μ \in Π (μ, ν)

. We note

f = (f_{1}, f_{2})

. Then:

\begin{matrix} E_{p, q, α} (π) & = \int_{{(X \times Ω \times Y \times Ω)}^{2}} {((1 - α) d {(a, b)}^{q} + α L ({(x, y, x^{'}, y^{'})}^{q})}^{p} d π ((x, a), (y, b)) d π ((x^{'}, a^{'}), (y^{'}, b^{'})) \\ = \int_{{(X \times Ω)}^{2}} {((1 - α) d {(a, f_{2} (x, a))}^{q} + α L {((x, f_{1} (x, a), x^{'}, f_{1} (x^{'}, a^{'}))}^{q})}^{p} d μ (x, a) d μ (x^{'}, a^{'}) \\ = \int_{{(X \times Ω)}^{2}} {((1 - α) d {(a, f_{2} (x, a))}^{q} + α {| d_{X} (x, x^{'}) - d_{Y} (f_{1} (x, a), f_{1} (x^{'}, a^{'})) |}^{q})}^{p} d μ (x, a) d μ (x^{'}, a^{'}) \\ = 0 \end{matrix}

(36)

Conversely, suppose that

d_{F G W, α, p, q} (μ, ν) = 0

. To prove the existence of a map

f : supp (μ) \to supp (ν)

verifying (33)–(35) we will use the Gromov-Wasserstein properties. We are looking for a vanishing Gromov-Wassersein distance between the spaces

X \times Ω

and

Y \times Ω

equipped with our two measures

μ

and

ν

.

More precisely, we define for

((x, a), (y, b), (x^{'}, a^{'}), (y^{'}, b^{'})) \in {(X \times Ω \times Y \times Ω)}^{2}

and

β \in] 0, 1 [

:

d_{X \times Ω} ((x, a), (x^{'}, a^{'})) = (1 - β) d_{X} (x, x^{'}) + β d (a, a^{'})

and

d_{Y \times Ω} ((y, b), (y^{'}, b^{'})) = (1 - β) d_{Y} (y, y^{'}) + β d (b, b^{'})

We will prove that

d_{G W, p} (d_{X \times Ω}, d_{Y \times Ω}, μ, ν) = 0

. To show that we will bound the Gromov cost with the metrics

d_{X \times Ω}, d_{Y \times Ω}

by the Gromov cost with the metrics

d_{X}, d_{Y}

and a Wasserstein cost.

Let

π \in Π (μ, ν)

be any admissible transportation plan. Subsequently, for

n \geq 1

:

\begin{matrix} J_{n} (d_{X \times Ω}, d_{Y \times Ω}, π) \overset{d e f}{=} \int_{{(X \times Ω \times Y \times Ω)}^{2}} L {(x, y, x^{'}, y^{'})}^{n} d π ((x, a), (y, b)) d π ((x^{'}, a^{'}), (y^{'}, b^{'})) \\ = \int_{{(X \times Ω \times Y \times Ω)}^{2}} {| (1 - β) (d_{X} (x, x^{'}) - d_{Y} (y, y^{'})) + β (d (a, a^{'}) - d (b, b^{'})) |}^{n} d π ((x, a), (y, b)) d π ((x^{'}, a^{'}), (y^{'}, b^{'})) \\ \leq \int_{{(X \times Ω \times Y \times Ω)}^{2}} (1 - β) {| d_{X} (x, x^{'}) - d_{Y} (y, y^{'}) |}^{n} d π ((x, a), (y, b)) d π ((x^{'}, a^{'}), (y^{'}, b^{'})) \\ + \int_{{(X \times Ω \times Y \times Ω)}^{2}} β {| d (a, a^{'}) - d (b, b^{'}) |}^{n} d π ((x, a), (y, b)) d π ((x^{'}, a^{'}), (y^{'}, b^{'})) \end{matrix}

using Jensen inequality with convexity of

t \to t^{n}

and subadditivity of |.|. We note

(*)

the first term above and

(* *)

the second term above. By the triangle inequality property of d we have:

(**)

\leq β \int_{{(X \times Ω \times Y \times Ω)}^{2}} (d (a, b) + d (a^{'}, b^{'}))^{n} d π ((x, a), (y, b)) d π ((x^{'}, a^{'}), (y^{'}, b^{'})) \overset{d e f}{=} β M_{n} (π)

such that we have shown:

\forall π \in Π (μ, ν), \forall n \geq 1, J_{n} (d_{X \times Ω}, d_{Y \times Ω}, π) \leq (1 - β) J_{n} (d_{X}, d_{Y}, π) + β M_{n} (π)

(37)

Now let

π_{*}

be an optimal coupling for

d_{F G W, α, p, q}

between

μ

and

ν

. By hypothesis

d_{F G W, α, p, q} (μ, ν) = 0

so that:

J_{q p} (d_{X}, d_{Y}, π *) = 0

(38)

and:

H_{q p} (π^{*}) = 0 .

(39)

Subsequently,

\int_{(X \times Ω \times Y \times Ω)} d {(a, b)}^{q p} d π^{*} ((x, a), (y, b)) = 0

which implies that d is zero

π^{*}

a.e. so that

\int_{(X \times Ω \times Y \times Ω)} d {(a, b)}^{m} d π^{*} ((x, a), (y, b)) = 0

for any

m \in N^{*}

. In this way:

\begin{matrix} M_{q p} (π^{*}) = β \int_{{(X \times Ω \times Y \times Ω)}^{2}} \sum_{h} (\binom{q p}{h}) d {(a, b)}^{h} d {(a^{'}, b^{'})}^{q p - h} d π^{*} ((x, a), (y, b)) d π^{*} ((x^{'}, a^{'}), (y^{'}, b^{'})) \\ = β \sum_{h} (\binom{q p}{h}) (\int_{(X \times Ω \times Y \times Ω)} d {(a, b)}^{h} d π^{*} ((x, a), (y, b))) (\int_{(X \times Ω \times Y \times Ω)} d {(a^{'}, b^{'})}^{q p - h} d π^{*} ((x^{'}, a^{'}), (y^{'}, b^{'}))) = 0 \end{matrix}

Using Equation (37), we have shown

J_{q p} (d_{X \times Ω}, d_{Y \times Ω}, π_{*}) = 0

which implies that

d_{G W, p} (d_{X \times Ω}, d_{Y \times Ω}, μ, ν) = 0

for the coupling

π^{*}

.

Thanks to the Gromov–Wasserstein properties (see [17]) this states the existence of an isometry between

supp (μ)

and

supp (ν)

. So there exists a surjective function

f = (f_{1}, f_{2}) : supp (μ) \to supp (ν)

, which verifies P.1 and:

\forall ((x, a), (x^{'}, a^{'})) \in {(supp (μ))}^{2}, d_{X \times Ω} ((x, a), (x^{'}, a^{'})) = d_{Y \times Ω} (f (x, a), f (x^{'}, a^{'}))

(40)

or equivalently:

\forall ((x, a), (x^{'}, a^{'})) \in {(supp (μ))}^{2}, (1 - β) d_{X} (x, x^{'}) + β d (a, a^{'}) = (1 - β) d_{Y} (f_{1} (x, a), f_{1} (x^{'}, a^{'})) + β d (f_{2} (x, a), f_{2} (x^{'}, a^{'}))

(41)

In particular,

π^{*}

is concentrated on

{(x, y) = f (x, y)}

or equivalently

π^{*} = (I_{d} \times f) # μ

. Injecting

π^{*}

in (39) leads to:

\begin{matrix} H_{q p} (π^{*}) & = \int_{(X \times Ω \times Y \times Ω)} d {(a, b)}^{q p} d π^{*} ((x, a), (y, b)) = \int_{X \times Ω} d {(a, f_{2} (x, a))}^{q p} d μ (x, a) = 0 \end{matrix}

(42)

Which implies:

\forall (x, a) \in supp (μ), f_{2} (x, a) = a

(43)

Moreover, using the equality (41), we can conclude that:

\forall (x, a) (x^{'}, a^{'}) \in supp {(μ)}^{2}, d_{X} (x, x^{'}) = d_{Y} (f_{1} (x, a), f_{1} (x^{'}, a^{'}))

(44)

In this way, f verifies all of the properties (33)–(35).

Moreover, suppose that

μ

and

ν

are generalized labeled graphs. In this case there exists

ℓ_{f} : X \to A

surjective such that

μ = (I \times ℓ_{f}) # μ_{X}

. Afterwards, (44) implies that:

\forall (x, x^{'}) \in supp {(μ_{X})}^{2}, d_{X} (x, x^{'}) = d_{Y} (f_{1} (x, ℓ_{f} (x)), f_{1} (x^{'}, ℓ_{f} (x^{'})))

(45)

We define

I : supp (μ_{X}) \to supp (μ_{Y})

such that

I (x) = f_{1} (x, ℓ_{f} (x))

. Then we have by (45)

d_{X} (x, x^{'}) = d_{Y} (I (x), I (x^{'}))

for

(x, x^{'}) \in {supp (μ_{X})}^{2}

. Overall we have

f (x, a) = (I (x), a)

for all

(x, a) \in supp (μ)

. Also since

f # μ = ν

we have

I # μ_{X} = ν_{Y}

.

Moreover, I is a surjective function. Indeed let

y \in supp (ν_{Y})

. Let

b \in supp (ν_{B})

such that

(y, b) \in supp (ν)

. By surjectivity of f there exists

(x, a) \in supp (μ)

such that

(y, b) = f (x, a) = (I (x), a)

so that

y = I (x)

.

Overall, f satisfies all P.1, P.2 and P.3 if

μ

and

ν

are generalized labeled graphs. The converse is also true using the reasoning in (36). □

Proposition 4.

Symmetry and triangle inequality.

d_{F G W, α, p, q}

is symmetric and for

q = 1

satisfies the triangle inequality. For

q \geq 2

the triangle inequality is relaxed by a factor

2^{q - 1}

.

To prove this result we will use the following lemma:

Lemma 4.

Let

(X \times Ω, d_{X}, μ), (Y \times Ω, d_{Y}, β), (Z \times Ω, d_{Z}, ν) \in S {(Ω)}^{3}

. For

(x, a), (x^{'}, a^{'}) \in {(X \times Ω)}^{2}

,

(y, b), (y^{'}, b^{'}) \in {(Y \times Ω)}^{2}

and

(z, c), (z^{'}, c^{'}) \in {(Z \times Ω)}^{2}

we have:

L {(x, z, x^{'}, z^{'})}^{q} \leq 2^{q - 1} (L {(x, y, x^{'}, y^{'})}^{q} + L {(y, z, y^{'}, z^{'})}^{q})

(46)

d {(a, c)}^{q} \leq 2^{q - 1} (d {(a, b)}^{q} + d {(b, c)}^{q})

(47)

Proof.

Direct consequence of (30) and triangle inequalities of

d, d_{X}, d_{Y}, d_{Z}

. □

Proof of Proposition 4.

To prove the triangle inequality of

d_{F G W, α, p, q}

distance for arbitrary measures we will use the Gluing lemma which stresses the existence of couplings with a prescribed structure. Let

(X \times Ω, d_{X}, μ), (Y \times Ω, d_{Y}, β), (Z \times Ω, d_{Z}, ν) \in S {(Ω)}^{3}

.

Let

π_{1} \in Π (μ, β)

and

π_{2} \in Π (β, ν)

be optimal transportation plans for the Fused Gromov-Wasserstein distance between

μ

,

β

and

β

,

ν

respectively. By the Gluing Lemma (see [15] and Lemma 5.3.2 in [65]) there exists a probability measure

π \in P ((X \times Ω) \times (Y \times Ω) \times (Z \times Ω))

with marginals

π_{1}

on

(X \times Ω) \times (Y \times Ω)

and

π_{2}

on

(Y \times Ω) \times (Z \times Ω)

. Let

π_{3}

be the marginal of

π

on

(X \times Ω) \times (Z \times Ω)

. By construction

π_{3} \in Π (μ, ν)

. So by suboptimality of

π_{3}

:

\begin{matrix} d_{F G W, α, p, q} (d_{X}, d_{Z}, μ, ν) \leq {(\int_{{(X \times Ω \times Z \times Ω)}^{2}} {((1 - α) d {(a, c)}^{q} + α L {(x, z, x^{'}, z^{'})}^{q})}^{p} d π_{3} ((x, a), (z, c)) d π_{3} ((x^{'}, a^{'}), (z^{'}, c^{'})))}^{\frac{1}{p}} \\ = {(\int_{{(X \times Ω \times Y \times Ω \times Z \times Ω)}^{2}} {((1 - α) d {(a, c)}^{q} + α L {(x, z, x^{'}, z^{'})}^{q})}^{p} d π ((x, a), (y, b), (z, c)) d π ((x^{'}, a^{'}), (y^{'}, b^{'}), (z^{'}, c^{'})))}^{\frac{1}{p}} \\ \overset{(*)}{\leq} 2^{q - 1} (\int_{{(X \times Ω \times Y \times Ω \times Z \times Ω)}^{2}} {((1 - α) d {(a, b)}^{q} + (1 - α) d {(b, c)}^{q} + α L {(x, y, x^{'}, y^{'})}^{q} + α L {(y, z, y^{'}, z^{'})}^{q})}^{p} \\ d π ((x, a), (y, b), (z, c)) d π ((x^{'}, a^{'}), (y^{'}, b^{'}), (z^{'}, c^{'})))^{\frac{1}{p}} \\ \overset{(* *)}{\leq} 2^{q - 1} ({(\int_{{(X \times Ω \times Y \times Ω \times Z \times Ω)}^{2}} {((1 - α) d {(a, b)}^{q} + α L {(x, y, x^{'}, y^{'})}^{q})}^{p} d π ((x, a), (y, b), (z, c)) d π ((x^{'}, a^{'}), (y^{'}, b^{'}), (z^{'}, c^{'})))}^{\frac{1}{p}} \\ + {(\int_{{(X \times Ω \times Y \times Ω \times Z \times Ω)}^{2}} {((1 - α) d {(b, c)}^{q} + α L {(y, z, y^{'}, z^{'})}^{q})}^{p} d π ((x, a), (y, b), (z, c)) d π ((x^{'}, a^{'}), (y^{'}, b^{'}), (z^{'}, c^{'})))}^{\frac{1}{p}}) \\ = 2^{q - 1} ({(\int_{{(X \times Ω \times Y \times Ω)}^{2}} {((1 - α) d {(a, b)}^{q} + α L {(x, y, x^{'}, y^{'})}^{q})}^{p} d π_{1} ((x, a), (y, b)) d π_{1} ((x^{'}, a^{'}), (y^{'}, b^{'})))}^{\frac{1}{p}} \\ + {(\int_{{(Y \times Ω \times Z \times Ω)}^{2}} {((1 - α) d {(b, c)}^{q} + α L {(y, z, y^{'}, z^{'})}^{q})}^{p} d π_{2} ((y, b), (z, c)) d π_{2} ((y^{'}, b^{'}), (z^{'}, c^{'})))}^{\frac{1}{p}}) \\ = 2^{q - 1} (d_{F G W, α, p, q} (μ, β) + d_{F G W, α, p, q} (β, ν)) \end{matrix}

with (*) comes from (46) and (47) and (**) is Minkowski inequality. So when

q = 1

,

d_{F G W, α, p, q}

satisfies the triangle inequality and when

q > 1

,

d_{F G W, α, p, q}

satisfies a relaxed triangle inequality so that it defines a semi-metric as described previously. □

7.3. Proof of Theorem 2—Convergence and Concentration Inequality

Proof.

The proof of the convergence in

F G W

directly stems from the weak convergence of the empirical measure and Lemma 3. Moreover, since

μ_{n}

and

μ

are both in the same ground space, we have:

d_{F G W, α, p, 1} (μ_{n}, μ) \leq 2 d_{W, p} (μ_{n}, μ) \Rightarrow E [d_{F G W, α, p, 1} (μ_{n}, μ)] \leq 2 E [d_{W, p} (μ_{n}, μ)] .

We can directly apply Theorem 1 in [31] to state the inequality. □

7.4. Proof of Theorem 3—Interpolation Properties between GW and W

Proof.

Let

π_{O T} \in Π (μ_{A}, ν_{B})

be an optimal coupling for the

p q

-Wasserstein distance between

μ_{A}

and

ν_{B}

. We can use the same Gluing lemma (Lemma 5.3.2 in [65]) to construct:

ρ \in P (\underset{π_{O T}}{\underset{︸}{\overset{μ}{\overset{︷}{X \times Ω}} \times \overset{ν}{\overset{︷}{Ω \times Y}})}}

such that

ρ \in Π (μ, ν)

and

P_{2, 3} # ρ = π_{O T}

.

Moreover we have:

\int_{Ω \times Ω} d {(a, b)}^{p q} d π_{O T} (a, b) = \int_{X \times Ω \times Ω \times Y} d {(a, b)}^{p q} d ρ (x, a, b, y)

(48)

Let

α \geq 0

and

π_{α}

optimal plan for the Fused Gromov-Wasserstein distance between

μ

and

ν

.

We can deduce that:

\begin{matrix} d_{F G W, α, p, q} {(μ, ν)}^{p} - {(1 - α)}^{p} d_{W, p q} {(μ_{A}, ν_{B})}^{p q} \\ = \int_{{(X \times Ω \times Y \times Ω)}^{2}} {((1 - α) d {(a, b)}^{q} + α L {(x, y, x^{'}, y^{'})}^{q})}^{p} d π_{α} ((x, a), (y, b)) d π_{α} ((x^{'}, a^{'}), (y^{'}, b^{'})) - \int_{Ω \times Ω} {(1 - α)}^{p} d {(a, b)}^{p q} d π_{O T} (a, b) \\ \overset{(*)}{\leq} \int_{{(X \times Ω \times Y \times Ω)}^{2}} {((1 - α) d {(a, b)}^{q} + α L {(x, y, x^{'}, y^{'})}^{q})}^{p} d ρ (x, a, b, y) d ρ (x^{'}, a^{'}, b^{'}, y^{'}) - \int_{X \times Ω \times Y \times Ω} {(1 - α)}^{p} d {(a, b)}^{p q} d ρ (x, a, b, y) \\ = {(1 - α)}^{p} \int_{{(X \times Ω \times Y \times Ω)}^{2}} d {(a, b)}^{p q} d ρ (x, a, b, y) d ρ (x^{'}, a^{'}, b^{'}, y^{'}) - {(1 - α)}^{p} \int_{X \times Ω \times Y \times Ω} d {(a, b)}^{p q} d ρ (x, a, b, y) \\ + \sum_{k = 0}^{p - 1} (\binom{p}{k}) {(1 - α)}^{k} α^{p - k} \int_{{(X \times Ω \times Y \times Ω)}^{2}} d {(a, b)}^{q k} L {(x, y, x^{'}, y^{'})}^{q (p - k)} d ρ (x, a, b, y) d ρ (x^{'}, a^{'}, b^{'}, y^{'}) \\ = \sum_{k = 0}^{p - 1} (\binom{p}{k}) {(1 - α)}^{k} α^{p - k} \int_{{(X \times Ω \times Y \times Ω)}^{2}} d {(a, b)}^{q k} L {(x, y, x^{'}, y^{'})}^{q (p - k)} d ρ (x, a, b, y) d ρ (x^{'}, a^{'}, b^{'}, y^{'}) . \end{matrix}

We note

H_{k} = \int_{{(X \times Ω \times Y \times Ω)}^{2}} d {(a, b)}^{q k} L {(x, y, x^{'}, y^{'})}^{q (p - k)} d ρ (x, a, b, y) d ρ (x^{'}, a^{'}, b^{'}, y^{'})

.

Using (9) we have shown that:

(1 - α) {(d_{W, p q} (μ_{A}, ν_{B}))}^{q} \leq d_{F G W, α, p, q} (μ, ν) \leq {({(1 - α)}^{p} {(d_{W, p q} (μ_{A}, ν_{B}))}^{p q} + \sum_{k = 0}^{p - 1} (\binom{p}{k}) {(1 - α)}^{k} α^{p - k} H_{k})}^{\frac{1}{p}}

Accordingly,

lim_{α \to 0} d_{F G W, α, p, q} (μ, ν) = {(d_{W, p q} (μ_{A}, ν_{B}))}^{q}

.

For the case

α \to 1

we rather consider

π_{G W} \in Π (μ_{X}, ν_{Y})

an optimal coupling for the

p q

-Gromov–Wasserstein distance between

μ_{X}

and

ν_{Y}

and we construct

γ \in P (\underset{π_{G W}}{\underset{︸}{\overset{μ}{\overset{︷}{X \times Ω}} \times \overset{ν}{\overset{︷}{Ω \times Y}})}}

such that

γ \in Π (μ, ν)

and

P_{2, 3} # ρ = π_{G W}

. In the same way as previous reasoning we can derive:

α {(d_{G W, p q} (μ_{X}, ν_{Y}))}^{q} \leq d_{F G W, α, p, q} (μ, ν) \leq (α^{p} {(d_{G W, p q} (μ_{X}, ν_{Y}))}^{p q} + \sum_{k = 0}^{p - 1} (\binom{p}{k}) {(1 - α)}^{p - k} α^{k} J_{k})^{\frac{1}{p}}

(49)

with

J_{k} = \int d {(a, b)}^{q (p - k)} L {(x, y, x^{'}, y^{'})}^{q k} d ρ (x, a, b, y) d ρ (x^{'}, a^{'}, b^{'}, y^{'})

. In this way

lim_{α \to 1} d_{F G W, α, p, q} (μ, ν) = {(d_{G W, p q} (μ_{X}, ν_{Y}))}^{q}

. □

7.5. Proof of Theorem 4—Constant Speed Geodesic

Proof.

Let

t, s \in [0, 1]

. Recalling:

\forall (x, a), (y, b) \in X \times Ω \times Y \times Ω, η_{t} (x, a, y, b) = (x, y, (1 - t) a + t b)

(50)

We note

S_{t} = {(X \times Y \times Ω, (1 - t) d_{X} \oplus t d_{Y}, μ_{t} = η_{t} # π^{*})}_{t \in [0, 1]}

and

d_{t} = (1 - t) d_{X} \oplus t d_{Y}

. Let

∥ . ∥

be any

ℓ_{m}

norm for

m \geq 1

. It suffices to prove:

d_{F G W, α, p, 1} (μ_{t}, μ_{s}) \leq | t - s | d_{F G W, α, p, 1} (μ_{0}, μ_{1})

(51)

To do so, we consider

Δ_{s}^{t} \in P (X \times Y \times Ω \times X \times Y \times Ω)

defined by

Δ_{s}^{t} = (η_{t} \times η_{s}) # π^{*} \in Π (μ_{t}, μ_{s})

and the following “diagonal” coupling:

d γ_{s}^{t} ((x, y), a, (x^{″}, y^{″}), b) = d Δ_{s}^{t} ((x, y), a, (x^{″}, y^{″}), b) d δ_{(x_{0}, x_{1})} (x_{0}^{″}, x_{1}^{″})

(52)

Subsquently,

γ_{s}^{t} \in P (X \times Y \times Ω \times X \times Y \times Ω)

and since

Δ_{s}^{t} \in Π (μ_{t}, μ_{s})

then

γ_{s}^{t} \in Π (μ_{t}, μ_{s})

So by suboptimality:

\begin{matrix} d_{F G W, α, p, 1} {(μ_{t}, μ_{s})}^{p} \leq \int_{{(X \times Y \times Ω \times X \times Y \times Ω)}^{2}} {((1 - α) d (a, b) + α | d_{t} [(x, y), (x^{'}, y^{'})] - d_{s} [(x^{″}, y^{″}), (x^{″'}, y^{″'})] |)}^{p} \\ d γ_{s}^{t} (x, y, a, x^{''}, y^{''}, b) d γ_{s}^{t} (x^{'}, y^{'}, a^{'}, x^{'''}, y^{'''}, b^{'}) \\ = \int_{{(X \times Y \times Ω \times X \times Y \times Ω)}^{2}} {((1 - α) d (a, b) + α | d_{t} [(x, y), (x^{'}, y^{'})] - d_{s} [(x, y), (x^{'}, y^{'})] |)}^{p} \\ d Δ_{s}^{t} (x, y, a, x, y, b) d Δ_{s}^{t} (x^{'}, y^{'}, a^{'}, x^{'}, y^{'}, b^{'}) \\ = \int_{{(X \times Ω \times Y \times Ω)}^{2}} {((1 - α) ∥ (1 - t) a + t b - (1 - s) a - s b ∥ + α | (1 - t) d_{X} (x, x^{'}) + t d_{Y} (y, y^{'}) - (1 - s) d_{X} (x, x^{'}) + s d_{Y} (y, y^{'}) |)}^{p} \\ d π^{*} (x, a, y, b) d π^{*} (x^{'}, a^{'}, y^{'}, b^{'}) \\ = {| t - s |}^{p} \int_{{(X \times Ω \times Y \times Ω)}^{2}} {((1 - α) ∥ a - b ∥ + α | d_{X} (x, x^{'}) - d_{Y} (y, y^{'}) |)}^{p} d π^{*} (x, a, y, b) d π^{*} (x^{'}, a^{'}, y^{'}, b^{'}) \end{matrix}

So

d_{F G W, α, p, 1} (μ_{t}, μ_{s}) \leq | t - s | d_{F G W, α, p, 1} (d_{0}, d_{1}, μ_{0}, μ_{1})

. □

Author Contributions

Conceptualization, T.V., L.C., N.C., R.T. and R.F.; methodology, T.V., L.C., N.C., R.T. and R.F.; software, T.V., R.F., L.C. and N.C.; formal analysis, T.V., L.C., N.C., R.T. and R.F.; investigation, T.V., L.C., N.C., R.T. and R.F.; resources, T.V., L.C., N.C., R.T. and R.F.; data curation, T.V., R.F., L.C. and N.C.; writing—original draft preparation, T.V.; writing—review and editing, T.V., L.C., N.C., R.T. and R.F.; visualization, T.V., L.C. and R.F.; supervision, L.C., N.C., R.T. and R.F.; project administration, L.C., N.C., R.T. and R.F.; funding acquisition, L.C., N.C., R.T. and R.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially funded through the projects OATMIL ANR-17-CE23-0012, 3IA Cote dÁzur Investments ANR-19-P3IA-0002 and MATS ANR-18-CE23-0006 of the French National Research Agency (ANR).

Conflicts of Interest

The authors declare no conflict of interest.

References

Battaglia, P.W.; Hamrick, J.B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. Relational inductive biases, deep learning, and graph networks. arXiv 2018, arXiv:1806.01261. [Google Scholar]
Pearl, J. Fusion, Propagation, and Structuring in Belief Networks. Artif. Intell. 1986, 29, 241–288. [Google Scholar] [CrossRef]
Pearl, J. Causality: Models, Reasoning and Inference, 2nd ed.; Cambridge University Press: New York, NY, USA, 2009. [Google Scholar]
Džeroski, S.; De Raedt, L.; Driessens, K. Relational Reinforcement Learning. Mach. Learn. 2001, 43, 7–52. [Google Scholar] [CrossRef]
Hjort, N.; Holmes, C.; Mueller, P.; Walker, S. Bayesian Nonparametrics: Principles and Practice; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G.E. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Shervashidze, N.; Schweitzer, P.; van Leeuwen, E.J.; Mehlhorn, K.; Borgwardt, K.M. Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res. 2011, 12, 2539–2561. [Google Scholar]
Niepert, M.; Ahmed, M.; Kutzkov, K. Learning Convolutional Neural Networks for Graphs. In Proceedings of the International Conference on Machine Learning Research, New York, NY, USA, 20–22 June 2016; Volume 48, pp. 2014–2023. [Google Scholar]
Bakir, G.H.; Hofmann, T.; Schölkopf, B.; Smola, A.J.; Taskar, B.; Vishwanathan, S.V.N. Predicting Structured Data (Neural Information Processing); The MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
Sakoe, H.; Chiba, S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 43–49. [Google Scholar] [CrossRef]
Cuturi, M.; Blondel, M. Soft-DTW: A Differentiable Loss Function for Time-Series. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 6–11 August 2017; Volume 70, pp. 894–903. [Google Scholar]
Nowozin, S.; Gehler, P.V.; Jancsary, J.; Lampert, C.H. Advanced Structured Prediction; The MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
Niculae, V.; Martins, A.; Blondel, M.; Cardie, C. SparseMAP: Differentiable Sparse Structured Inference. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 3796–3805. [Google Scholar]
Villani, C. Optimal Transport: Old and New, 2009th ed.; Grundlehren der Mathematischen Wissenschaften; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Sturm, K.T. The space of spaces: Curvature bounds and gradient flows on the space of metric measure spaces. arXiv 2012, arXiv:1208.0434. [Google Scholar]
Memoli, F. Gromov Wasserstein Distances and the Metric Approach to Object Matching. Found. Comput. Math. 2011, 1–71. [Google Scholar] [CrossRef]
Vayer, T.; Courty, N.; Tavenard, R.; Chapel, L.; Flamary, R. Optimal Transport for structured data with application on graphs. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; Volume 97, pp. 6275–6284. [Google Scholar]
Verma, S.; Zhang, Z.L. Hunt For The Unique, Stable, Sparse And Fast Feature Learning On Graphs. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 88–98. [Google Scholar]
Rubner, Y.; Tomasi, C.; Guibas, L.J. The Earth Mover’s Distance as a Metric for Image Retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
Peyré, G.; Cuturi, M. Computational Optimal Transport. Found. Trends Mach. Learn. 2019, 11, 355–607. [Google Scholar] [CrossRef]
Solomon, J.; Peyré, G.; Kim, V.G.; Sra, S. Entropic Metric Alignment for Correspondence Problems. ACM Trans. Graph. 2016, 35, 72:1–72:13. [Google Scholar] [CrossRef]
Ezuz, D.; Solomon, J.; Kim, V.G.; Ben-Chen, M. GWCNN: A Metric Alignment Layer for Deep Shape Analysis. Comput. Graph. Forum 2017, 36, 49–57. [Google Scholar] [CrossRef]
Bunne, C.; Alvarez-Melis, D.; Krause, A.; Jegelka, S. Learning Generative Models across Incomparable Spaces. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 851–861. [Google Scholar]
Demetci, P.; Santorella, R.; Sandstede, B.; Noble, W.S.; Singh, R. Gromov-Wasserstein optimal transport to align single-cell multi-omics data. bioRxiv 2020. [Google Scholar] [CrossRef]
Peyré, G.; Cuturi, M.; Solomon, J. Gromov-Wasserstein averaging of kernel and distance matrices. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 19–24 June 2016; pp. 2664–2672. [Google Scholar]
Haasdonk, B.; Bahlmann, C. Learning with Distance Substitution Kernels. In Pattern Recognition; Rasmussen, C.E., Bülthoff, H.H., Schölkopf, B., Giese, M.A., Eds.; Springer: Berlin, Heidelberg, 2004; pp. 220–227. [Google Scholar]
Borg, I.; Groenen, P. Modern Multidimensional Scaling: Theory and Applications; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
Bachem, O.; Lucic, M.; Krause, A. Practical Coreset Constructions for Machine Learning. arXiv 2017, arXiv:1703.06476. [Google Scholar]
Thorpe, M.; Park, S.; Kolouri, S.; Rohde, G.K.; Slepčev, D. A Transportation L^p Distance for Signal Analysis. J. Math. Imaging Vis. 2017, 59, 187–210. [Google Scholar] [CrossRef]
Jonathan Weed, F.B. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. arXiv 2017, arXiv:1707.00087. [Google Scholar]
Benamou, J.D.; Brenier, Y. A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem. Numer. Math. 2000, 84, 375–393. [Google Scholar] [CrossRef]
Bonneel, N.; van de Panne, M.; Paris, S.; Heidrich, W. Displacement Interpolation Using Lagrangian Mass Transport. In Proceedings of the 2011 SIGGRAPH Asia Conference, Hong Kong, China, 11–15 December 2011; 158, pp. 1–12. [Google Scholar] [CrossRef]
Chizat, L.; Bach, F. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
Zhang, R.; Chen, C.; Li, C.; Duke, L.C. Policy Optimization as Wasserstein Gradient Flows. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 5741–5750. [Google Scholar]
Ferradans, S.; Papadakis, N.; Peyré, G.; Aujol, J.F. Regularized discrete optimal transport. SIAM J. Imaging Sci. 2014, 7, 1853–1882. [Google Scholar] [CrossRef]
Flamary, R.; Courty, N.; Tuia, D.; Rakotomamonjy, A. Optimal transport with Laplacian regularization: Applications to domain adaptation and shape matching. In NIPS Workshop on Optimal Transport and Machine Learning; OTML: Montreal, QC, Canada, 2014. [Google Scholar]
Lacoste-Julien, S. Convergence rate of Frank-Wolfe for non-convex objectives. arXiv 2016, arXiv:1607.00345. [Google Scholar]
Maron, H.; Lipman, Y. (Probably) Concave Graph Matching. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 408–418. [Google Scholar]
Redko, I.; Vayer, T.; Flamary, R.; Courty, N. CO-Optimal Transport. arXiv 2020, arXiv:2002.03731. [Google Scholar]
Agueh, M.; Carlier, G. Barycenters in the Wasserstein space. SIAM J. Math. Anal. 2011, 43, 904–924. [Google Scholar] [CrossRef]
Cuturi, M.; Doucet, A. Fast Computation of Wasserstein Barycenters. In Proceedings of the 31st International Conference on Machine Learning, Bejing, China, 22–24 June 2014; Volume 32, pp. 685–693. [Google Scholar]
Kruskal, J.B. Nonmetric multidimensional scaling: A numerical method. Psychometrika 1964, 29, 115–129. [Google Scholar] [CrossRef]
Sutherland, J.J.; O’brien, L.A.; Weaver, D.F. Spline-fitting with a genetic algorithm: A method for developing classification structure-activity relationships. J. Chem. Inf. Comput. Sci. 2003, 43, 1906–1915. [Google Scholar] [CrossRef] [PubMed]
Borgwardt, K.M.; Kriegel, H.P. Shortest-Path Kernels on Graphs. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), Houston, TX, USA, 27–30 November 2005; pp. 74–81. [Google Scholar] [CrossRef]
Kriege, N.; Fey, M.; Fisseler, D.; Mutzel, P.; Weichert, F. Recognizing Cuneiform Signs Using Graph Based Methods. In Proceedings of the International Workshop on Cost-Sensitive Learning (COST), San Diego, CA, USA, 3–5 May 2018. [Google Scholar]
Feragen, A.; Kasenburg, N.; Petersen, J.; de Bruijne, M.; Borgwardt, K. Scalable kernels for graphs with continuous attributes. In Proceedings of the Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 216–224. [Google Scholar]
Debnath, A.K.; Lopez de Compadre, R.L.; Debnath, G.; Shusterman, A.J.; Hansch, C. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J. Med. Chem. 1991, 34, 786–797. [Google Scholar] [CrossRef] [PubMed]
Kriege, N.M.; Giscard, P.; Wilson, R.C. On Valid Optimal Assignment Kernels and Applications to Graph Classification. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Wale, N.; Watson, I.A.; Karypis, G. Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl. Inf. Syst. 2008, 14, 347–375. [Google Scholar] [CrossRef]
Yanardag, P.; Vishwanathan, S. Deep Graph Kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 1365–1374. [Google Scholar]
Kersting, K.; Kriege, N.M.; Morris, C.; Mutzel, P.; Neumann, M. Benchmark Data Sets for Graph Kernels. 2016. Available online: https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets (accessed on 26 August 2020).
Vishwanathan, S.V.N.; Schraudolph, N.N.; Kondor, R.; Borgwardt, K.M. Graph Kernels. J. Mach. Learn. Res. 2010, 11, 1201–1242. [Google Scholar]
Luss, R.; d’Aspremont, A. Support Vector Machine Classification with Indefinite Kernels. In Proceedings of the 20th International Conference on Neural Information Processing Systems, Kitakyushu, Japan, 3 December 2007; pp. 953–960. [Google Scholar]
Gärtner, T.; Flach, P.; Wrobel, S. On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines; Springer: Berlin/Heidelberg, Germany, 2003; pp. 129–143. [Google Scholar]
Shervashidze, N.; Vishwanathan, S.V.N.; Petri, T.H.; Mehlhorn, K.; Borgwardt, K. Efficient graphlet kernels for large graph comparison. In Artificial Intelligence and Statistics; Hilton Clearwater Beach Resort: Clearwater Beach, FL, USA, 2009. [Google Scholar]
Neumann, M.; Garnett, R.; Bauckhage, C.; Kersting, K. Propagation kernels: Efficient graph kernels from propagated information. Mach. Learn. 2016, 102, 209–245. [Google Scholar] [CrossRef]
Siglidis, G.; Nikolentzos, G.; Limnios, S.; Giatsidis, C.; Skianis, K.; Vazirgianis, M. GraKeL: A Graph Kernel Library in Python. arXiv 2018, arXiv:1806.02193. [Google Scholar]
Shchur, O.; Mumme, M.; Bojchevski, A.; Günnemann, S. Pitfalls of Graph Neural Network Evaluation. arXiv 2018, arXiv:1811.05868.2018. [Google Scholar]
Wang, Y.J.; Wong, G.Y. Stochastic blockmodels for directed graphs. J. Am. Stat. Assoc. 1987, 82, 8–19. [Google Scholar] [CrossRef]
Nowicki, K.; Snijders, T.A.B. Estimation and prediction for stochastic blockstructures. J. Am. Stat. Assoc. 2001, 96, 1077–1087. [Google Scholar] [CrossRef]
Billingsley, P. Convergence of Probability Measures, 2nd ed.; Wiley Series in Probability and Statistics: Probability and Statistics; A Wiley-Interscience Publication; John Wiley & Sons Inc.: New York, NY, USA, 1999. [Google Scholar]
Santambrogio, F. Optimal Transport for Applied Mathematicians; Birkäuser: Basel, Switzerland, 2015. [Google Scholar]
Ambrosio, L.; Gigli, N.; Savare, G. Gradient Flows in Metric Spaces and in the Space of Probability Measures; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2005. [Google Scholar] [CrossRef]
Ambrosio, L.; Gigli, N.; Savare, G. Gradient Flows: In Metric Spaces and in the Space of Probability Measures; Lectures in Mathematics; ETH Zürich, Birkhäuser: Basel, Switzerland, 2005. [Google Scholar]

Figure 1. Discrete structured object (left) can be described by a labeled graph with

{(a_{i})}_{i}

the feature information of the object and

{(x_{i})}_{i}

the structure information. If we enrich this object with a histogram

{(h_{i})}_{i}

aiming at measuring the relative importance of the nodes, we can represent the structured object as a fully supported probability measure

μ

over the couple space of feature and structure with marginals

μ_{X}

and

μ_{A}

on the structure and the features respectively (right).

Figure 2. Two isometric metric spaces. Distances between the nodes are given by the shortest path, and the weight of each edge is equal to 1.

Figure 3. Two isometric but not isomorphic spaces.

Figure 4. Two structured objects with isometric structures and identical features that are not strongly isomorphic. The color of the nodes represent the node feature and each edge represents a distance of 1 between the connected nodes.

Figure 5. Example of a coupling between two discrete measures on the same ground space equipped with a distance d that will define the Wasserstein distance. (Left): the discrete measures on

Ω

. (Right): one possible coupling between these measures that conserves the mass. Image adapted from [21] [Figure 2.6].

Figure 6. Gromov–Wasserstein coupling of two mm-spaces

X = (X, d_{X}, μ_{X})

and

Y = (Y, d_{Y}, ν_{Y})

. Left: the mm-spaces have nothing in common. Similarity between pairwise distances is measured by

| d_{X} (x, x^{'}) - d_{Y} (y, y^{'}) |

. Right: an admissible coupling of

μ_{X}

and

ν_{Y}

. Image adapted from [21] [Figure 2.6].

Figure 7. Illustration of Definition 8. The figure shows two structured objects

(X \times Ω, d_{X}, μ)

and

(Y \times Ω, d_{Y}, μ)

. The feature space

Ω

is the common space for all features. The two metric spaces

(X, d_{X})

and

(Y, d_{Y})

represent the structures of the two structured objects, the similarity between all pair-to-pair distances of the structure points is measured by

L (x, y, x^{'}, y^{'})

.

μ

and

ν

are the joint measures on the structure space and the feature space.

Figure 8. Difference on transportation maps between

F G W

,

G W

, and W distances on synthetic trees. (Left) the W distance between the features is null since feature information are the same, (Middle)

F G W

with

α \in] 0, 1 [

is different from zero and discriminate the two structured objects, and (Right)

G W

between the two isometric structures is null.

Figure 9.

F G W

loss for a coupling

π

depends on both a similarity between each feature of each node of each graph

{(d (a_{i}, b_{j}))}_{i, j}

and between all intra-graph structure similarities

{(|C_{1} (i, k) - C_{2} (j, l)|)}_{i, j, k, l}

.

Figure 10. Illustration of the difference between W,

G W

, and

F G W

couplings. (left) Empirical distributions

μ

with 20 samples and

ν

with 30 samples which color is proportional to their index. (middle) Cost matrices in the feature (

M_{A B}

) and structure domains (

C_{1}, C_{2}

) with similar samples in white. (right) Solution for all methods. Dark blue indicates a non zero coefficient of the transportation map. Feature distances are large between points laying on the diagonal of

M_{A B}

such that Wasserstein maps is anti-diagonal but unstructured. Fused Gromov–Wasserstein incorporates both feature and structure maps in a single transport map.

Figure 11. Couplings obtained when considering (Top left) the features only, where we have

d_{W, 1}^{Ω} = 0

(Top right) the structure only, with

d_{G W, 1} = 0

(Bottom left, and right) both the features and the structure, with

d_{F G W, 0.1, 1, 2}^{Ω}

. For readibility issues, only the couplings starting from non white pixels on the left picture are depicted.

Figure 12. Behavior of trade-off parameter

α

on a toy time series classification problem.

α

is increasing from left (

α = 0

: Wasserstein distance) to right (

α = 1

: Gromov–Wasserstein distance). (top row) Two-dimensional (2D)-embedding is computed from the set of pairwise distances between samples with multidimensional scaling (MDS) (bottom row) illustration of couplings between two sample time series from opposite classes.

Figure 13. Illustration of

F G W

graph barycenter. The first column illustrates the original settings with the noiseless graphs, and columns 2 to 7 are noisy samples that constitute the datasets. Columns 8 and 9 show the barycenters for each setting, with different number of nodes. Blue nodes indicates a feature value close to

- 1

, yellow nodes close to 1.

Figure 14. Example of community clustering on graphs using

F G W

. (top) Community clustering with four communities and uniform features per cluster. (bottom) Community clustering with four communities and bimodal features per cluster (and two nodes per cluster in the approximate graph).

Figure 15. (left) Examples from the clustering dataset, color indicates the labels. (right) Evolution of the centroids of each cluster in the k-means clustering, from the random initialization until convergence to the final centroid.

Table 1. Average classification accuracy on the graph datasets with vector attributes.

Vector Attributes	BZR	COX2	CUNEIFORM	ENZYMES	PROTEIN	SYNTHETIC
FGW sp	85.12 ± 4.15 *	77.23 ± 4.86	76.67 ± 7.04	71.00 ± 6.76	74.55 ± 2.74	100.00 ± 0.00
HOPPERK	84.15 ± 5.26	79.57 ± 3.46	32.59 ± 8.73	45.33 ± 4.00	71.96 ± 3.22	90.67 ± 4.67
PROPAK	79.51 ± 5.02	77.66 ± 3.95	12.59 ± 6.67	71.67 ± 5.63 *	61.34 ± 4.38	64.67 ± 6.70
PSCN k = 10	80.00 ± 4.47	71.70 ± 3.57	25.19 ± 7.73	26.67 ± 4.77	67.95 ± 11.28	100.00 ± 0.00
PSCN k = 5	82.20 ± 4.23	71.91 ± 3.40	24.81 ± 7.23	27.33 ± 4.16	71.79 ± 3.39	100.00 ± 0.00

Table 2. Average classification accuracy on the graph datasets with discrete attributes.

Discrete Attr.	MUTAG	NCI1	PTC-MR
FGW raw sp	83.26 ± 10.30	72.82 ± 1.46	55.71 ± 6.74
FGW wl h = 2 sp	86.42 ± 7.81	85.82 ± 1.16	63.20 ± 7.68
FGW wl h = 4 sp	88.42 ± 5.67	86.42 ± 1.63	65.31 ± 7.90
GK k = 3	82.42 ± 8.40	60.78 ± 2.48	56.46 ± 8.03
RWK	79.47 ± 8.17	58.63 ± 2.44	55.09 ± 7.34
SPK	82.95 ± 8.19	74.26 ± 1.53	60.05 ± 7.39
WLK	86.21 ± 8.48	85.77 ± 1.07	62.86 ± 7.23
WLK h=2	86.21 ± 8.15	81.85 ± 2.28	61.60 ± 8.14
WLK h = 4	83.68 ± 9.13	85.13 ± 1.61	62.17 ± 7.80
PSCN k = 10	83.47 ± 10.26	70.65 ± 2.58	58.34 ± 7.71
PSCN k = 5	83.05 ± 10.80	69.85 ± 1.79	55.37 ± 8.28

Table 3. Average classification accuracy on the graph datasets with no attributes.

Without Attribute	IMDB-B	IMDB-M
GW sp	63.80 ± 3.49	48.00 ± 3.22
GK k = 3	56.00 ± 3.61	41.13 ± 4.68
SPK	55.80 ± 2.93	38.93 ± 5.12

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Fused Gromov-Wasserstein Distance for Structured Objects

Abstract

1. Introduction

2. Structured Objects as Distributions and Fused Gromov–Wasserstein Distance

2.1. Comparing Structured Objects

2.2. Background on OT Distances

2.3. Fused Gromov–Wasserstein Distance

3. Mathematical Properties of $FGW$

3.1. Topology of the Structured Object Space

3.2. Can We Adapt W and GW for Structured Objects?

3.3. Convergence of Structured Objects

3.4. Interpolation Properties between Wasserstein and Gromov-Wasserstein Distances

3.5. Geodesic Properties

4. $FGW$ in the Discrete Case

4.1. FGW in the Discrete Case

4.2. Structured Optimal Transport Barycenter

5. Examples and Applications

5.1. Illustrations of FGW

5.2. Graph-Structured Data Classification

5.3. Graph Barycenter and Compression

5.4. Unsupervised Learning: Graphs Clustering

6. Conclusions

7. Proofs of the Mathematical Properties

7.1. Proof of Proposition 1—Comparison between FGW, GW and W

7.2. Proof of Theorem 1—Metric Properties of FGW

7.3. Proof of Theorem 2—Convergence and Concentration Inequality

7.4. Proof of Theorem 3—Interpolation Properties between GW and W

7.5. Proof of Theorem 4—Constant Speed Geodesic

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Fused Gromov-Wasserstein Distance for Structured Objects

Abstract

1. Introduction

2. Structured Objects as Distributions and Fused Gromov–Wasserstein Distance

2.1. Comparing Structured Objects

2.2. Background on OT Distances

2.3. Fused Gromov–Wasserstein Distance

3. Mathematical Properties of FGW

3.1. Topology of the Structured Object Space

3.2. Can We Adapt W and GW for Structured Objects?

3.3. Convergence of Structured Objects

3.4. Interpolation Properties between Wasserstein and Gromov-Wasserstein Distances

3.5. Geodesic Properties

4. FGW in the Discrete Case

4.1. FGW in the Discrete Case

4.2. Structured Optimal Transport Barycenter

5. Examples and Applications

5.1. Illustrations of FGW

5.2. Graph-Structured Data Classification

5.3. Graph Barycenter and Compression

5.4. Unsupervised Learning: Graphs Clustering

6. Conclusions

7. Proofs of the Mathematical Properties

7.1. Proof of Proposition 1—Comparison between FGW, GW and W

7.2. Proof of Theorem 1—Metric Properties of FGW

7.3. Proof of Theorem 2—Convergence and Concentration Inequality

7.4. Proof of Theorem 3—Interpolation Properties between GW and W

7.5. Proof of Theorem 4—Constant Speed Geodesic

Author Contributions

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

3. Mathematical Properties of $FGW$

4. $FGW$ in the Discrete Case