1. Introduction
We focus on the comparison of structured objects, i.e., objects defined by both a feature and a structure information. Abstractly, the feature information cover all the attributes of an object; for example, it can model the value of a signal when objects are time series, or the node labels over a graph. In shape analysis, the spatial positions of the nodes can be regarded as features, or, when objects are images, local color histograms can describe the image’s feature information. As for the structure information, it encodes the specific relationships that exist among the components of the object. In a graph context, nodes and edges are representative of this notion, so that each label of the graph may be linked to some others through the edges between the nodes. In a time series context, the values of the signal are related to each other through a temporal structure. This representation can be related with the concept of relational reasoning (see [
1]), where some entities (or elements with attributes, such as the intensity of a signal) coexist with some relations or properties between them (or some structure, as described above). Including structural knowledge about objects in a machine learning context has often been valuable in order to build more generalizable models. As shown in many contexts, such as graphical models [
2,
3], relational reinforcement learning [
4], or Bayesian nonparametrics [
5], considering objects as a complex composition of entities together with their interactions is crucial in order to learn from small amounts of data.
Unlike recent deep learning end-to-end approaches [
6,
7] that attempt to avoid integration of prior knowledge or assumptions about the structure wherever possible,
ad hoc methods, depending on the kind of structured objects involved, aim to build meaningful tools that include structure information in the machine learning process. In graph classification, the structure can be taken into account through dedicated graph kernels, in which the structure drives the combination of the feature information [
8,
9,
10]. In a time series context, Dynamic Time Warping and related approaches are based on the similarity between the features while allowing limited temporal (i.e., structural) distortion in the time instants that are matched [
11,
12]. Closely related, an entire field has focused on predicting the structure as an output and it has been deployed on tasks, such as segmenting an image into meaningful components or predicting a natural language sentence [
10,
13,
14].
All of these approaches rely on meaningful representations of the structured objects that are involved. In this context, distributions or probability measures can provide an interesting representation for machine learning data. This allows their comparison within the Optimal Transport (OT) framework that provides a meaningful way of comparing distributions by capturing the underlying geometric properties of the space through a cost function. When the distributions dwell in a common metric space
$(\mathsf{\Omega},d)$, the Wasserstein distance defines a metric between these distributions under mild assumptions [
15]. In contrast, the Gromov–Wasserstein distance [
16,
17] aims at comparing distributions that support live in different metric spaces through the intrinsic pair-to-pair distances in each space. Unifying both distances, the Fused Gromov–Wasserstein distance was proposed in a previous work in [
18] and used in the discrete setting to encode, in a single OT formulation, both feature and structure information of structured objects. This approach considers structured objects as joint distributions over a common feature space associated with a structure space specific to each object. An OT formulation is derived by considering a tradeoff between the feature and the structure costs, respectively, defined with respect to the Wasserstein and the Gromov–Wasserstein standpoints.
This paper presents the theoretical foundations of this distance and states the mathematical properties of the $FGW$ metric in the general setting. We first introduce a representation of structured objects using distributions. We show that classical Wasserstein and Gromov–Wasserstein distance can be used in order to compare either the feature information or the structure information of the structured object but that they both fail at comparing the entire object. We then present the Fused Gromov–Wasserstein distance in its general formulation and derive some of its mathematical properties. Particularly, we show that it is a metric in a given case, we give a concentration result, and we study its interpolation and geodesic properties. We then provide a conditional-gradient algorithm to solve the quadratic problem resulting from $FGW$ in the discrete case and we conclude by illustrating and interpreting the distance in several applicative contexts.
Notations. Let $\mathcal{P}(\mathsf{\Omega})$ be the set of all probability measures on a space $\mathsf{\Omega}$ and $\mathcal{B}(A)$ the set of all Borel sets of a $\sigma $-algebra A. We note # the push-forward operator, such, that for a measurable function T, $B\in \mathcal{B}(A)$, $T\#\mu (B)=\mu ({T}^{-1}(B))$.
We note supp$(\mu )$ the support of $\mu \in \mathcal{P}(\mathsf{\Omega})$ is the minimal closed subset $A\subset \mathsf{\Omega}$ such that $\mu (\mathsf{\Omega}\backslash A)=0$. Informally, this is the set where the measure “is not zero”.
For two probability measures $\mu \in \mathcal{P}(A)$ and $\nu \in \mathcal{P}(B)$ we note $\mathsf{\Pi}(\mu ,\nu )$ the set of all couplings or matching measures of $\mu $ and $\nu $, i.e., the set $\{\pi \in P(\mathsf{\Omega}\times \mathsf{\Omega})\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}\forall ({A}_{0},{B}_{0})\in \mathcal{B}(A)\times \mathcal{B}(B),\pi ({A}_{0}\times B)=\mu ({A}_{0}),\pi (\mathsf{\Omega}\times {\mathsf{\Omega}}_{0})=\nu ({B}_{0})\}$.
For two metric spaces $(X,{d}_{X})$ and $(Y,{d}_{Y})$, we define the distance ${d}_{X}\oplus {d}_{Y}$ on $X\times Y$ such that, for $(x,y),({x}^{\prime},{y}^{\prime})\in X\times Y,\phantom{\rule{4pt}{0ex}}{d}_{X}\oplus {d}_{Y}((x,y),({x}^{\prime},{y}^{\prime}))={d}_{X}(x,{x}^{\prime})+{d}_{Y}(y,{y}^{\prime})$.
We note the simplex of N bins as ${\mathsf{\Sigma}}_{N}=\{a\in {({\mathbb{R}}_{+})}^{N},{\sum}_{i}{a}_{i}=1\}$. For two histograms $a\in {\mathsf{\Sigma}}_{n}$ and $b\in {\mathsf{\Sigma}}_{m}$ we note with some abuses $\mathsf{\Pi}(a,b)$ the set of all couplings of a and b, i.e., the set $\mathsf{\Pi}(a,b)=\{\pi \in {\mathbb{R}}_{+}^{n\times m}|{\sum}_{i}{\pi}_{i,j}={b}_{j};{\sum}_{j}{\pi}_{i,j}={a}_{i}\}$. Finally, for $x\in \mathsf{\Omega}$, ${\delta}_{x}$ denotes the dirac measure in x.
Assumption. In this paper, we assume that all metric spaces are non-trivial Polish metric spaces (i.e., separable and completely metrizable topological spaces) and that all measures are Borel.
2. Structured Objects as Distributions and Fused Gromov–Wasserstein Distance
The notion of structured objects used in this paper is inspired from the discrete point of view where one aims at comparing labeled graphs. More formally, we consider undirected labeled graphs as tuples of the form
$\mathcal{G}=(\mathcal{V},\mathcal{E},{\ell}_{f},{\ell}_{s})$, where
$(\mathcal{V},\mathcal{E})$ are the set of vertices and edges of the graph.
${\ell}_{f}:\mathcal{V}\to \mathsf{\Omega}$ is a labelling function that associates each vertex
${v}_{i}\in \mathcal{V}$ with a feature
${a}_{i}\stackrel{\mathrm{def}}{=}{\ell}_{f}({v}_{i})$ in some feature metric space
$(\mathsf{\Omega},d)$. Similarly,
${\ell}_{s}:\mathcal{V}\to X$ maps a vertex
${v}_{i}$ from the graph to its structure representation
${x}_{i}\stackrel{\mathrm{def}}{=}{\ell}_{s}({v}_{i})$ in some structure space
$(X,{d}_{X})$ specific to each graph.
${d}_{X}:X\times X\to {\mathbb{R}}_{+}$ is a symmetric application which aims at measuring the similarity between the nodes in the graph. In the graph context,
${d}_{X}$ can either encode the neighborhood information of the nodes, the edge information of the graph or more generally it can model a distance between the nodes such as the shortest path distance or the harmonic distance [
19]. When
${d}_{X}$ is a metric, such as the shortest-path distance, we naturally endow the structure with the metric space
$(X,{d}_{X})$.
In this paper, we propose enriching the previous definition of a structured object with a probability measure which serves the purpose of signaling the relative importance of the object’s elements. For example, we can add a probability (also denoted as weight)
${({h}_{i})}_{i}\in {\mathsf{\Sigma}}_{n}$ to each node in the graph. This way, we define a fully supported probability measure
$\mu ={\sum}_{i}{h}_{i}{\delta}_{({x}_{i},{a}_{i})}$, which includes all the structured object information (see
Figure 1 for a graphical depiction).
This graph representation for objects with a finite number of points/vertices can be generalized to the continuous case and leads to a more general definition of structured objects:
Definition 1. (Structured objects). A structured object over a metric space $(\mathsf{\Omega},d)$ is a triplet $(X\times \mathsf{\Omega},{d}_{X},\mu )$, where $(X,{d}_{X})$ is a metric space and μ is a probability measure over $X\times \mathsf{\Omega}$. $(\mathsf{\Omega},d)$ is denoted as the feature space, such that $d:\mathsf{\Omega}\times \mathsf{\Omega}\to {\mathbb{R}}_{+}$ is the distance in the feature space and $(X,{d}_{X})$ the structure space, such that ${d}_{X}:X\times X\to {\mathbb{R}}_{+}$ is the distance in the structure space. We will note ${\mu}_{X}$ and ${\mu}_{A}$ the structure and feature marginals of μ.
Definition 2. (Space of structured objects).
We note $\mathbb{X}$ the set of all metric spaces. The space of all structured objects over $(\mathsf{\Omega},d)$ will be written as $\mathbb{S}(\mathsf{\Omega})$ and is defined by all the triplets $(X\times \mathsf{\Omega},{d}_{X},\mu )$ where $(X,{d}_{X})\in \mathbb{X}$ and $\mu \in \mathcal{P}(X\times \mathsf{\Omega})$. To avoid finiteness issues in the rest of the paper we define for $p\in {\mathbb{N}}^{*}$ the space ${\mathbb{S}}_{p}(\mathsf{\Omega})\subset \mathbb{S}(\mathsf{\Omega})$ such that if $(X\times \mathsf{\Omega},{d}_{X},\mu )\in {\mathbb{S}}_{p}(\mathsf{\Omega})$ we have:(the finiteness of this integral does not depend on the choice of ${a}_{0}$) For the sake of simplicity, and when it is clear from the context, we will sometimes denote only by
$\mu $ the whole structured object. The marginals
${\mu}_{X},{\mu}_{A}$ encode very partial information since they focus only on independent feature distributions or only on the structure. This definition encompasses the discrete setting discussed in above. More precisely, let us consider a labeled graph of
n nodes with features
$A={({a}_{i})}_{i=1}^{n}$ with
${a}_{i}\in \mathsf{\Omega}$ and
$X={({x}_{i})}_{i=1}^{n}$ the structure representation of the nodes. Let
${({h}_{i})}_{i=1}^{n}$ be an histogram, then the probability measure
$\mu ={\sum}_{i=1}^{n}{h}_{i}{\delta}_{({x}_{i},{a}_{i})}$ defines structured object in the sense of Definition 1, since it lies in
$\mathcal{P}(X\times \mathsf{\Omega})$. In this case, an example of
$\mu $,
${\mu}_{X}$, and
${\mu}_{A}$ is provided in
Figure 1.
Note that the set of structured objects is quite general and also allows considering discrete probability measures of the form $\mu ={\sum}_{i,j=1}^{p,q}{h}_{i,j}{\delta}_{({x}_{i},{a}_{j})}$ with $p,q$ possibly different than n. We propose focusing on a particular type of structured objects, namely the generalized labeled graphs, as described in the following definition:
Definition 3. (Generalized labeled graph). We call generalized labeled graph a structured object $(X\times \mathsf{\Omega},{d}_{X},\mu )\in {\mathbb{S}}_{p}(\mathsf{\Omega})$ such that μ can be expressed as $\mu =(I\times {\ell}_{f})\#{\mu}_{X}$ where ${\ell}_{f}:X\to \mathsf{\Omega}$ is surjective and pushes ${\mu}_{X}$ forward to ${\mu}_{A}$, i.e., ${\ell}_{f}\#{\mu}_{X}={\mu}_{A}$.
This definition implies that there exists a function ${\ell}_{f}$, which associates a feature $a={l}_{f}(x)$ to a structure point $x\in X$ and, as such, one structure point can not have two different features. The labeled graph described by $\mu ={\sum}_{i=1}^{n}{h}_{i}{\delta}_{({x}_{i},{a}_{i})}$ is a particular instance of a generalized labeled graph in which ${l}_{f}$ is defined by ${l}_{f}({x}_{i})={a}_{i}$.
2.1. Comparing Structured Objects
We now aim to define a notion of equivalence between two structured objects $(X\times \mathsf{\Omega},{d}_{X},\mu )$ and $(Y\times \mathsf{\Omega},{d}_{Y},\nu )$. We note in the following ${\nu}_{Y},{\nu}_{B}$ the marginals of $\nu $. Intuitively, two structured objects are the same if they share the same feature information, if their structure information are lookalike, and if the probability measures are corresponding in some sense. In this section, we present mathematical tools for individual comparison of the elements of structured objects. First, our formalism implies comparing metric spaces, which can be done via the notion of isometry.
Definition 4. (Isometry).
Let $(X,{d}_{X})$ and $(Y,{d}_{Y})$ be two metric spaces. An isometry is a surjective map $f:X\to Y$ that preserves the distances: An isometry is bijective, since for $f(x)=f({x}^{\prime})$ we have ${d}_{Y}(f(x),f({x}^{\prime}))=0={d}_{X}(x,{x}^{\prime})$ and hence $x={x}^{\prime}$ (in the same way ${f}^{-1}$ is also a isometry). When it exists, X and Y share the same “size” and any statement about X, which can be expressed through its distance is transported to Y by the isometry f.
Example 1. Let us consider the two following graphs whose discrete metric spaces are obtained as shortest path between the vertices (see corresponding graphs in Figure 2). These spaces are isometric since the map f, such that $f({x}_{1})={y}_{1}$, $f({x}_{2})={y}_{3}$, $f({x}_{3})={y}_{4}$, $f({x}_{4})={y}_{2}$ verifies Equation (3). The previous definition can be used in order to compare the structure information of two structured objects. Regarding the feature information, because they all lie in the same ambient space $\mathsf{\Omega}$, a natural way for comparing them is by the standard set equality $A=B$. Finally, in order to compare measures on different spaces, the notion of preserving map can be used.
Definition 5. (Measure preserving map).
Let ${\mathsf{\Omega}}_{1},{\mu}_{1}\in \mathcal{P}({\mathsf{\Omega}}_{1})$ and ${\mathsf{\Omega}}_{2},{\mu}_{2}\in \mathcal{P}({\mathsf{\Omega}}_{2})$ be two measurable spaces. A function (usually called a map) $f:{\mathsf{\Omega}}_{1}\to {\mathsf{\Omega}}_{2}$ is said to be measure preserving if it transports the measure ${\mu}_{1}$ on ${\mu}_{2}$ such thatIf there exists such a measure preserving map, the properties about measures of ${\mathsf{\Omega}}_{1}$ are transported via f to ${\mathsf{\Omega}}_{2}$.
Combining these two ideas together leads to the notion of measurable metric spaces (often called
mm-spaces [
17]), i.e., a metric space
$(X,{d}_{X})$ enriched with a probability measure and described by a triplet
$(X,{d}_{X},{\mu}_{X}\in \mathcal{P}(X))$. An interesting notion for comparing mm-spaces is the notion of isomorphism.
Definition 6. (Isomorphism). Two mm-spaces $(X,{d}_{X},{\mu}_{X}),(Y,{d}_{Y},{\mu}_{Y})$ are isomorphic if there exists a surjective measure preserving isometry $f:supp({\mu}_{X})\to supp({\nu}_{Y})$ between the support of the measures ${\mu}_{X},{\nu}_{Y}$.
Example 2. Let us consider two mm-spaces $(X=\{{x}_{1},{x}_{2}\},{d}_{X}=\{1\},{\mu}_{X}=\{\frac{1}{2},\frac{1}{2}\})$ and $(Y=\{{y}_{1},{y}_{2}\},{d}_{Y}=\{1\},{\mu}_{Y}=\{\frac{1}{4},\frac{3}{4}\})$, as depicted in Figure 3. These spaces are isometric, but not isomorphic, as there exists no measure preserving map between them. All of this considered, we can now define a notion of equivalence between structured objects.
Definition 7. (Strong isomorphism of structured objects).
Two structured objects are said to be strongly isomorphic if there exists an isomorphism I between the structures such that $f=(I,id)$ is bijective between $supp(\mu )$ and $supp(\nu )$ and measure preserving. More precisely, f satisfies the following properties:
- P.1
$f\#\mu =\nu $.
- P.2
The function f statisfies: - P.3
The function $I:supp({\mu}_{X})\to supp({\nu}_{Y})$ is surjective, satisfies $I\#{\mu}_{X}={\nu}_{Y}$ and:
It is easy to check that the strong isomorphism defines an equivalence relation over ${\mathbb{S}}_{p}(\mathsf{\Omega})$.
Remark 1. The function f described in this definition can be seen as a feature, structure, and measure preserving function. Indeed, fromP.1f is measure preserving. Moreover, $(X,{d}_{X},{\mu}_{X})$ and $(Y,{d}_{Y},{\nu}_{Y})$ are isomorphic through I. Finally usingP.1andP.2we have that ${\mu}_{A}={\nu}_{B}$, so that the feature information is also preserved.
Example 3. To illustrate this definition, we consider a simple example of two discrete structured objects:with for i, ${a}_{i}={b}_{i}$ and for $i\ne j$, ${a}_{i}\ne {a}_{j}$ (see Figure 4). The two structured objects have isometric structures and same features individually, but they are not strongly isomorphic. One possible map $f=({f}_{1},{f}_{2})$, such that ${f}_{1}$ leads to an isometry is $f({x}_{1},{a}_{1})=({y}_{1},{b}_{1})$, $f({x}_{2},{a}_{2})=({y}_{3},{b}_{3})$, $f({x}_{3},{a}_{3})=({y}_{4},{b}_{4})$, $f({x}_{4},{a}_{4})=({y}_{2},{b}_{2})$. Yet, this map does not satisfy ${f}_{2}(x,.)={I}_{d}$ for any x, since $f({x}_{2},{a}_{2})=({y}_{3},{b}_{3})$ and ${a}_{2}\ne {b}_{3}$. The other possible functions, such that ${f}_{1}$ leads to an isometry are simply permutations of this example, yet it is easy to check that none of them verifiesP.2(for example, with $f({x}_{2},{a}_{2})=({y}_{4},{b}_{4})$). 2.2. Background on OT Distances
The Optimal Transport (OT) framework defines distances between probability measures that describe either the feature or the structure information of structured objects.
Wasserstein distance. The classical OT theory aims at comparing probability measures
${\mu}_{A}\in \mathcal{P}(\mathsf{\Omega}),{\nu}_{B}\in \mathcal{P}(\mathsf{\Omega})$. In this context the quantity:
is usually called the
p-Wasserstein distance (also known, for
$p=1$, as Earth Mover’s distance [
20] in the computer vision community) between distributions
${\mu}_{A}$ and
${\nu}_{B}$. It defines a distance on probability measures, especially
${d}_{W,p}({\mu}_{A},{\nu}_{B})=0$ iff ${\mu}_{A}={\nu}_{B}$. This distance also has a nice geometrical interpretation as it represents an optimal cost (
w.r.t. d) to move the measure
${\mu}_{A}$ onto
${\nu}_{B}$ with
$\pi (a,b)$ the amount of probability mass shifted from
a to
b (see
Figure 5). To this extent, the Wasserstein distance quantifies how “far”
${\mu}_{A}$ is from
${\nu}_{B}$ by measuring how “difficult” it is to move all the mass from
${\mu}_{A}$ onto
${\nu}_{B}$. Optimal transport can deal with smooth and discrete measures and it has proved to be very useful for comparing distributions in a shared space, but with different (and even non-overlapping) supports.
Gromov–Wasserstein distance. In order to compare measures whose support are not necessarily in the same ambient space [
16,
17] define a new OT distance. By relaxing the classical Hausdorff distance [
15,
17], authors build a distance over the space of all mm-spaces. For two compact mm-spaces
$(X,{d}_{X},{\mu}_{X}\in \mathcal{P}(X))$ and
$(Y,{d}_{Y},{\nu}_{Y}\in \mathcal{P}(Y))$, the Gromov–Wasserstein distance is defined as:
where:
The Gromov–Wasserstein distance depends on the choice of the metrics
${d}_{X}$ and
${d}_{Y}$ and with some abuse of notation we denote the entire mm-space by its probability measure. When it is not clear from the context, we will specify using
${d}_{GW,p}({d}_{X},{d}_{Y},{\mu}_{X},{\nu}_{Y})$. The resulting coupling tends to associate pairs of points with similar distances within each pair (see
Figure 6). The Gromov–Wasserstein distance allows for the comparison of measures over different ground spaces and defines a metric over the space of all mm-spaces quotiented by the isomoprhisms (see Definitions 4 and 5). More precisely, it vanishes if the two mm-spaces are isomorphic. This distance has been used in the context of relational data e.g., in shape comparison [
17,
22], deep metric alignment [
23], generative modelling [
24] or to align single-cell multi-omics datasets [
25].
2.3. Fused Gromov–Wasserstein Distance
Building on both Gromov–Wasserstein and Wasserstein distances, we define the Fused Gromov–Wasserstein ($FGW$) distance on the space of structured objects:
Definition 8. (Fused Gromov-Wasserstein distance).
Let $\alpha \in [0,1]$ and $p,q\ge 1$. We consider $(X\times \mathsf{\Omega},{d}_{X},\mu )\in {\mathbb{S}}_{pq}(\mathsf{\Omega})$ and $(Y\times \mathsf{\Omega},{d}_{Y},\nu )\in {\mathbb{S}}_{pq}(\mathsf{\Omega})$. The Fused-Gromov–Wasserstein distance is defined as:where Figure 7 illustrates this definition. When it is clear from the context we will simply note
${d}_{FGW}$ instead for
${d}_{FGW,\alpha ,p,q}$ for brevity.
$\alpha $ acts as a trade-off parameter between the cost of the structures represented by
$L(x,y,{x}^{\prime},{y}^{\prime})$ and the feature cost
$d(a,b)$. In this way, the convex combination of both terms leads to the use of both information in one formalism resulting on a single map
$\pi $ that “moves” the mass from one joint probability measure to the other.
Many desirable properties arise from this definition. Among them, one can define a topology over the space of structured objects using the $FGW$ distance to compare structured objects, in the same philosophy as for Wasserstein and Gromov–Wasserstein distances. The definition also implies that $FGW$ acts as a generalization of both Wasserstein and Gromov-Wasserstein distances, with $FGW$ achieving an interpolation between these two distances. More remarkably, $FGW$ distance also realizes geodesic properties over the space of structured objects, allowing the definition of gradient flows. All of these properties are detailed in the next section. Before reviewing them, we first compare $FGW$ with $GW$ and W (by assuming for now that $FGW$ exists, which will be shown later in Theorem 1).
Proposition 1. (Comparaison between $FGW$, $GW$ and W). We have the following results for two structured objects μ and ν:
The following inequalities hold: Let us suppose that the structure spaces $(X,{d}_{X})$,$(Y,{d}_{Y})$ are part of a single ground space $(Z,{d}_{Z})$ (i.e., $X,Y\subset Z$ and ${d}_{X}={d}_{Y}={d}_{Z}$). We consider the Wasserstein distance between μ and ν for the distance on $Z\times \mathsf{\Omega}$: $\tilde{d}((x,a),(y,b))=(1-\alpha )d(a,b)+\alpha {d}_{Z}(x,y)$. Then:
Proof of this proposition can be found in
Section 7.1. In particular, following this proposition, when the
$FGW$ distance vanishes then both
$GW$ and
W distances vanish so that the structure and the feature of the structure object are individually “the same” (with respect to their corresponding equivalence relation). However, the converse is not necessarily true, as shown in the following example.
Example 4. (Toy trees).
We construct two trees as illustrated in Figure 8 where the 1D node features are shown with colors. The shortest path between the nodes is used to capture the structures of the two structured objects and the Euclidean distance is used for the features. We consider uniform weights on all nodes. Figure 8 illustrates the differences between $FGW$, $GW$, and W distances. The left part is the Wasserstein coupling between the features: red nodes are transported on red ones and the blue nodes on the blue ones but tree structures are completely discarded. In this case, the Wasserstein distance vanishes. In the right part, we compute the Gromov–Wasserstein distance between the structures: all couples of points are transported to another couple of points, which enforces the matching of tree structures without taking into account the features. Because structures are isometric, the Gromov–Wasserstein distance is null. Finally, we compute the $FGW$ using intermediate α (center), the bottom and first level structure is preserved as well as the feature matching (red on red and blue on blue) and $FGW$ discriminates the two structured objects. 3. Mathematical Properties of $\mathbf{FGW}$
In this section, we establish some mathematical properties of the $FGW$ distance. The first result relates to the existence of the $FGW$ distance and the topology of the space of structured objects. We then prove that the $FGW$ distance is indeed a distance regarding the equivalence relation between structured objects, as defined in Defintion 7, allowing us to derive a topology on $\mathbb{S}(\mathsf{\Omega})$.
3.1. Topology of the Structured Object Space
The $FGW$ distance has the following properties:
Theorem 1. (Metric properties). Let $p,q\ge 1$, $\alpha \in ]0,1[$ and $\mu ,\nu \in {\mathbb{S}}_{pq}(\mathsf{\Omega})$. The functional $\pi \to {E}_{p,q,\alpha}(\pi )$ always achieves an infimum ${\pi}^{*}$ in $\mathsf{\Pi}(\mu ,\nu )$ s.t. ${d}_{FGW,\alpha ,p,q}(\mu ,\nu )={E}_{p,q,\alpha}({\pi}^{*})<+\infty $. Moreover:
- •
${d}_{FGW,\alpha ,p,q}$ is symmetric and, for $q=1$, satisfies the triangle inequality. For $q\ge 2$, the triangular inequality is relaxed by a factor ${2}^{q-1}$.
- •
For $\alpha \in ]0,1[$, ${d}_{FGW,\alpha ,p,q}(\mu ,\nu )=0$ if an only if there exists a bijective function $f=({f}_{1},{f}_{2}):supp(\mu )\to supp(\nu )$ such that: - •
If $(\mu ,\nu )$ are generalized labeled graphs then ${d}_{FGW,\alpha ,p,q}(\mu ,\nu )=0$ if and only if $(X\times \mathsf{\Omega},{d}_{X},\mu )$ and $(Y\times \mathsf{\Omega},{d}_{Y},\nu )$ are strongly isomorphic.
Proof of this theorem can be found in
Section 7.2. The identity of indiscernibles is the most delicate part to prove and it is based on using the Gromov–Wasserstein distance between the spaces
$X\times \mathsf{\Omega}$ and
$Y\times \mathsf{\Omega}$. The previous theorem states that
$FGW$ is a distance over the space of generalized labeled graphs endowed with the strong isomorphism as equivalence relation defined in Definition 7. More generally, for any structured objects the equivalence relation is given by (
12)–(
14). Informally, invariants of the
$FGW$ are structured objects that have both the same structure and the same features in the same place. Despite the fact that
$q=1$ leads to a proper metric, we will further see in
Section 4.1 that the case
$q=2$ can be computed more efficiently using a separability trick from [
26].
Theorem 1 allows a wide set of applications for
$FGW$, such as
k-nearest-neighbors, distance-substitution kernels, pseudo-Euclidean embeddings, or representative-set methods [
27,
28,
29]. Arguably, such a distance allows for a better interpretation than to end-to-end learning machines, such as neural networks, because the
$\pi $ matrix exhibits the relationships between the elements of the objects in a pairwise comparison.
3.2. Can We Adapt W and GW for Structured Objects?
Despite the appealing properties of both Wasserstein and Gromov–Wasserstein distances, they fail at comparing structured objects by focusing only on the feature and structure marginals, respectively. However, with some hypotheses, one could adapt these distances for structured objects.
Adapting Wasserstein. If the structure spaces
$(X,{d}_{X})$ and
$(Y,{d}_{Y})$ are part of a same ground space
$(Z,{d}_{Z})$, i.e., (
$X,Y\subset Z$ and
${d}_{X}={d}_{Y}={d}_{Z}$), one can build a distance
$\widehat{d}={d}_{Z}\oplus d$ between couples
$(x,a)$ and
$(y,b)$ and apply the Wasserstein distance, so as to compare the two structured objects. In this case, when the Wasserstein distance vanishes it implies that the structured objects are equal in the sense
$\mu =\nu $. This approach is very related with the one discussed in [
30], where the authors define the Transportation
${L}^{p}$ distance for signal analysis purposes. Their approach can be viewed as a transport between two joint measures
$\mu (X\times \mathsf{\Omega})=\lambda (\{(x,f(x))\phantom{\rule{4pt}{0ex}}|\phantom{\rule{4pt}{0ex}}x\in X\subset Z={\mathbb{R}}^{d};\phantom{\rule{4pt}{0ex}}f(x)\in A\subset {\mathbb{R}}^{m}\})$,
$\nu (Y\times \mathsf{\Omega})=\lambda (\{(y,g(y))\phantom{\rule{4pt}{0ex}}|\phantom{\rule{4pt}{0ex}}y\in Y\subset Z={\mathbb{R}}^{d};\phantom{\rule{4pt}{0ex}}g(y)\in B\subset {\mathbb{R}}^{m}\})$ for function
$f,g:Z\to {\mathbb{R}}^{m}$ representative of the signal values and
$\lambda $ the Lebesgue measure. The distance for the transport is defined as
$\widehat{d}((x,f(x)),(y,g(y)))=\frac{1}{\alpha}{\parallel x-y\parallel}_{p}^{p}+{\parallel f(x)-g(y)\parallel}_{p}^{p}$ for
$\alpha >0$ and
${\parallel \xb7\parallel}_{p}$ the
${l}_{p}$ norm. In this case,
$f(x)$ and
$g(y)$ can be interpreted as encoding the feature information of the signal, while
$x,y$ encode its structure information. This approach is very interesting, but cannot be used on structured objects, such as graphs that will not share a common structure embedding space.
Adapting Gromov-Wasserstein. The Gromov–Wasserstein distance can also be adapted to structured objects by considering the distances $(1-\beta ){d}_{X}\oplus \beta d$ and $(1-\beta ){d}_{Y}\oplus \beta d$ within each space $X\times \mathsf{\Omega}$ and $Y\times \mathsf{\Omega}$, respectively, and $\beta \in ]0,1[$. When the resulting $GW$ distance vanishes, structured objects are isomorphic with respect to $(1-\beta ){d}_{X}\oplus \beta d$ and $(1-\beta ){d}_{Y}\oplus \beta d$. However, the strong isomorphism is stronger than this notion, since the isomorphism allows for “permuting the labels”, but not the strong isomorphism. More precisely, we have the following lemma:
Lemma 1. Let $(X\times \mathsf{\Omega},{d}_{X},\mu ),(Y\times \mathsf{\Omega},{d}_{Y},\nu )$ be two structured objects and $\beta \in ]0,1[$.
If $(X\times \mathsf{\Omega},{d}_{X},\mu )$ and $(Y\times \mathsf{\Omega},{d}_{Y},\nu )$ are strongly isomorphic then $(X\times \mathsf{\Omega},(1-\beta ){d}_{X}\oplus \beta d,\mu )$ and $(Y\times \mathsf{\Omega},(1-\beta ){d}_{Y}\oplus \beta d,\nu )$ are isomorphic. However the converse is not true in general.
Proof. To see this, if we consider
f as defined in Theorem 1, then, for
$(x,a),({x}^{\prime},b)\in {(\mathrm{supp}(\mu ))}^{2}$, we have
${d}_{X}(x,{x}^{\prime})={d}_{Y}(I(x),I({x}^{\prime}))$. In this way:
which can be rewritten as:
and so
f is an isometry with respect to
$(1-\beta )d\oplus \beta {d}_{X}$ and
$(1-\beta )d\oplus \beta {d}_{Y}$. Because
f is also measure preserving and surjective
$(X\times \mathsf{\Omega},(1-\beta ){d}_{X}\oplus \beta d,\mu )$ and
$(Y\times \mathsf{\Omega},(1-\beta ){d}_{Y}\oplus \beta d,\nu )$ are isomorphic. □
However, the converse is not necessarily true, as it is easy to cook up an example with the same structure but with permuted labels, so that objects are isomorphic but not strongly isomorphic. For example, in the tree example
Figure 4, the structures are isomorphic and the distances between the features within each space are the same between each structured objects, so that
$(X\times \mathsf{\Omega},(1-\beta ){d}_{X}\oplus \beta d,\mu )$ and
$(Y\times \mathsf{\Omega},(1-\beta ){d}_{Y}\oplus \beta d,\nu )$ are isomorphic, yet not strongly isomorphic, as shown in the example since
$FGW>0$.
3.3. Convergence of Structured Objects
The metric property naturally endows the structured object space with a notion of convergence, as described in the next definition:
Definition 9. Convergence of structured objects.
Let ${\left(({X}_{n}\times {A}_{n},{d}_{{X}_{n}},{\mu}_{n})\right)}_{n\in \mathbb{N}}$ be a sequence of structured objects. It converges to $(X\times \mathsf{\Omega},{d}_{X},\mu )$ in the Fused Gromov–Wasserstein sense if: Using Proposition 1, it is straightforward to see that if the sequence converges in the
$FGW$ sense, both the features and the structure converge respectively in the Wasserstein and Gromov–Wasserstein sense (see [
17] for the definition of convergence in the Gromov–Wasserstein sense).
An interesting question arises from this definition. Let us consider a structured object $(X\times \mathsf{\Omega},{d}_{X},\mu )$ and let us sample the joint distribution so as to consider ${({\{({x}_{i},{a}_{i})\}}_{i\in \{1,\dots ,n\}},{d}_{X},{\mu}_{n})}_{n\in \mathbb{N}}$ with ${\mu}_{n}=\frac{1}{n}{\displaystyle \sum _{i=1}^{n}}{\delta}_{{x}_{i},{a}_{i}}$ where $({x}_{i},{a}_{i})\in X\times \mathsf{\Omega}$ are sampled from $\mu $. Does this sequence converges to $(X\times \mathsf{\Omega},{d}_{X},\mu )$ in the $FGW$ sense and how fast is the convergence?
This question can be answered thanks to a notion of “size” of a probability measure. For the sake of conciseness, we will not exhaustively present the theory, but the reader can refer to [
31] for more details. Given a measure
$\mu $ on
$\mathsf{\Omega}$, we denote as
$di{m}_{p}^{*}(\mu )$ its upper Wasserstein dimension. It coincides with the intuitive notion of “dimension” when the measure is sufficiently well behaved. For example, for any absolutely continuous measure
$\mu $ with respect to the Lebesgue measure on
${[0,1]}^{d}$, we have
$di{m}_{p}^{*}(\mu )=d$ for any
$p\in [1,\frac{d}{2}]$. Using this definition and the results presented in [
31], we answer the question of convergence of finite sample in the following theorem (proof can be found in
Section 7.3):
Theorem 2. Convergence of finite samples and a concentration inequality
Moreover, suppose that $s>{d}_{p}^{*}(\mu )$. Then there exists a constant C that does not depend on n such that: The expectation is taken over the i.i.d samples $({x}_{i},{a}_{i})$. A particular case of this inequality is when $\alpha =1$ so that we can use the result above to derive a concentration result for the Gromov-Wasserstein distance. More precisely, if ${\nu}_{n}=\frac{1}{n}{\sum}_{i}{\delta}_{{x}_{i}}$ denotes the empirical measure of $\nu \in \mathcal{P}(X)$ and if ${s}^{\prime}>{d}_{p}^{*}(\nu )$, we have: This result is a simple application of the convergence of finite sample properties of the Wasserstein distance, since in this case
${\mu}_{n}$ and
$\mu $ are part of the same ground space so that (
18) derive naturally from (
11) and the properties of Wasserstein. In contrast to the Wasserstein distance case, this inequality is not necessarily sharp and future work will be dedicated to the study of its tightness.
3.4. Interpolation Properties between Wasserstein and Gromov-Wasserstein Distances
$FGW$ distance is a generalization of both Wasserstein and Gromov–Wasserstein distances in the sense that it achieves an interpolation between them. More precisely, we have the following theorem:
Theorem 3. Interpolation properties.
As α tends to zero, one recovers the Wasserstein distance between the features information and as α goes to one, one recovers the Gromov–Wasserstein distance between the structure information: This result shows that $FGW$ can revert to one of the other distances. In machine learning, this allows for a validation of the $\alpha $ parameter to better fit the data properties (i.e., by tuning the relative importance of the feature vs. structure information). One can also see the choice of $\alpha $ as a representation learning problem and its value can be found by optimizing a given criterion.
3.5. Geodesic Properties
One desirable property in OT is the underlying geodesics defined by the mass transfer between two probability distributions. These properties are useful in order to define the dynamic formulation of OT problems. This dynamic point of view is inspired by fluid dynamics and found its origin in the Wasserstein context with [
32]. Various applications in machine learning can be derived from this formulation: interpolation along geodesic paths were used in computer graphics for color or illumination interpolations [
33]. More recently, Ref. [
34] used Wasserstein gradient flows in an optimization context, deriving global minima results for non-convex particles gradient descent. In [
35], the authors used Wasserstein gradient flows in the context of reinforcement learning for policy optimization.
The main idea of this dynamic formulation is to describe the optimal transport problem between two measures as a curve in the space of measures minimizing its total length. We first describe some generality about geodesic spaces and recall classical results for dynamic formulation in both Wasserstein and Gromov–Wasserstein contexts. In a second part, we derive new geodesic properties in the $FGW$ context.
Geodesic spaces. Let $(X,d)$ be a metric space and $x,y$ two points in X. We say that a curve $w:[0,1]\to X$ joining the endpointsx and y (i.e., with $w(0)=x$ and $w(1)=y$) is a constant speed geodesic if it satisfies $d(w(t),w(s))\le |t-s|d(w(0),w(1))=|t-s|d(x,y)$ for $t,s\in [0,1]$. Moreover, if $(X,d)$ is a length space (i.e., if the distance between two points of X is equal to the infimum of the lengths of the curves connecting these two points) then the converse is also true and a constant speed geodesic satisfies $d(w(t),w(s))=|t-s|d(x,y)$. It is easy to compute distances along such curves, as they are directly embedded into $\mathbb{R}$.
In the Wasserstein context, if the ground space is a complete separable, locally compact length space, and if the endpoints of the geodesic are given, then there exists a geodesic curve. Moreover, if the transport between the endpoints is unique, then there is a unique displacement interpolation between the endpoints (see Corollary 7.22 and 7.23 in [
15]). For example, if the ground space is
${\mathbb{R}}^{d}$ and the distance between the points is measured via the
${\ell}_{2}$ norm, then geodesics exist and are uniquely determined (this can be generalized to strictly convex costs). In the Gromov–Wasserstein context, there always exists constant speed geodesics as long as the endpoints are given. These geodesics are unique modulo the isomorphism equivalence relation (see [
16]).
The $FGW$ case. In this paragraph, we suppose that $\mathsf{\Omega}={\mathbb{R}}^{d}$. We are interested in finding a geodesic curve in the space of structured objects i.e., a constant speed curve of structured objects joining two structured objects. As for Wasserstein and Gromov–Wasserstein, the structured object space endowed with the Fused Gromov–Wasserstein distance maintains some geodesic properties. The following result proves the existence of such a geodesic and characterizes it:
Theorem 4. Constant speed geodesic.
Let $p\ge 1$ and $(X\times \mathsf{\Omega},{d}_{X},{\mu}_{0})$ and $(Y\times \mathsf{\Omega},{d}_{Y},{\mu}_{1})$ in ${\mathbb{S}}_{p}({\mathbb{R}}^{d})$. Let ${\pi}^{*}$ be an optimal coupling for the Fused Gromov–Wasserstein distance between ${\mu}_{0},{\mu}_{1}$, and $t\in [0,1]$. We equip ${\mathbb{R}}^{d}$ with the ${\ell}_{m}$ norm for $m\ge 1$.
We define ${\eta}_{t}:X\times \mathsf{\Omega}\times Y\times \mathsf{\Omega}\to X\times Y\times \mathsf{\Omega}$ such that: Then:is a constant speed geodesic connecting $(X\times \mathsf{\Omega},{d}_{X},{\mu}_{0})$ and $(Y\times \mathsf{\Omega},{d}_{Y},{\mu}_{1})$ in the metric space $\left({\mathbb{S}}_{p}({\mathbb{R}}^{d}),{d}_{FGW,\alpha ,p,1}\right)$. Proof of the previous theorem can be found in
Section 7.5. In a sense, this result combines the geodesics in the Wasserstein space and in the space of all mm-spaces, since it suffices to interpolate the distances in the structure space and the features to construct a geodesic. The main interest is that it defines the minimum path between two structured objects. For example, when considering two discrete structured objects represented by the measures
$\mu ={\sum}_{i=1}^{n}{h}_{i}{\delta}_{({x}_{i},{a}_{i})}$ and
$\nu ={\sum}_{j=1}^{m}{g}_{j}{\delta}_{({y}_{j},{b}_{j})}$, the interpolation path is given for
$t\in [0,1]$ by the measure
${\mu}_{t}={\sum}_{i=1}^{n}{\sum}_{j=1}^{m}{\pi}^{*}(i,j){\delta}_{({x}_{i},{y}_{j},(1-t){a}_{i}+t{b}_{j})}$ where
${\pi}^{*}$ is an optimal coupling for the
$FGW$ distance. However this geodesic is difficult to handle in practice, since it requires the computation of the cartesian product
$X\times Y$. To overcome this obstacle, an extension using theFrÃ©chet mean is defined in
Section 4.2. The proper definition and properties of velocity fields associated to this geodesic is postponed to further works.
6. Conclusions
This paper presents the Fused Gromov–Wasserstein ($FGW$) distance. Inspired by both Wasserstein and Gromov–Wasserstein distances, $FGW$ can compare structured objects by including the inherent relations that exist between the elements of the objects, constituting their structure information, and their feature information, part of a common ground space between each structured objects. We have stated mathematical results about this new distance, such as metric, interpolation, and geodesic properties. We have also provided a concentration result for the convergence of finite samples. In the discrete case, algorithms to compute $FGW$ itself and related FrÃ©chet means are provided. The use of this new distance is illustrated on problems involving structured objects, such as time series embedding, graph classification, graph barycenter computation, and graph clustering. Several questions are raised by this work. From a practical side, the FGW method is quite expensive to compute and further works will try to lower the computational complexity of the underlying optimization problem to ensure better scalability to very large graphs. Moreover, while we mostly consider in this work structure of graphs described by the shortest path, other choices could be made, such as the distances based on the Laplacian of the graph. Finally, from a theoretical point of view, it is often valuable that the geodesic path be unique, so as to defined properly objects, such as gradient flows. One interesting result would be, for example, to see if the geodesic is unique with respect to the strong isomorphism relation.