Fused Gromov-Wasserstein Distance for Structured Objects

: Optimal transport theory has recently found many applications in machine learning thanks to its capacity to meaningfully compare various machine learning objects that are viewed as distributions. The Kantorovitch formulation, leading to the Wasserstein distance, focuses on the features of the elements of the objects, but treats them independently, whereas the Gromov–Wasserstein distance focuses on the relations between the elements, depicting the structure of the object, yet discarding its features. In this paper, we study the Fused Gromov-Wasserstein distance that extends the Wasserstein and Gromov–Wasserstein distances in order to encode simultaneously both the feature and structure information. We provide the mathematical framework for this distance in the continuous setting, prove its metric and interpolation properties, and provide a concentration result for the convergence of ﬁnite samples. We also illustrate and interpret its use in various applications, where structured objects are involved.

In this paper we focus on the comparison of structured objects, i.e objects defined by both a feature and a structure information.Abstractly, the feature information covers all the attributes of an object.For example it can model the value of a signal when objects are time series, or the node labels in a graph context.In shape analysis, the spatial positions of the nodes can be regarded as features, or, when objects are images, local color histograms can describe the image's feature information.As for the structure information, it encodes the specific relationships that exist among the components of the object.In a graph context, nodes and edges are representative of this notion so that each label of the graph may be linked to some others through the edges between the nodes.In a time series context, the values of the signal are related to each other through a temporal structure.This representation is clearly related with the concept of relational reasoning (see [6]) where some entities (or elements with attributes such as an intensity of a signal) coexist with some relations or properties between them (or some structure as described above).
Including structural knowledge about objects in a machine learning context has often been valuable in order to build more generalizable models.As shown in many contexts such as graphical models [24,25], relational reinforcement learning [12] or bayesian non parametrics [15], considering machine learning objects as a complex composition of entities together with their interactions is crucial in order to learn from small amounts of data.For a review of relational reasoning and its consequences, see [6].
Unlike recent deep learning end-to-end approaches [17,13] that attempt to avoid integration of prior knowledge or assumptions about the structure wherever possible, ad hoc methods, depending on the kind of structured objects involved, aim to build meaningful tools that include structure information in the machine learning process.In graph classification the structure can be taken into account through dedicated graph kernels, in which the structure drives the combination of the feature information [30,21,36].In a time series context, Dynamic Time Warping and related approaches are based on the similarity between the features while allowing limited temporal distortion in the time instants that are matched [11,29].Closely related, an entire field has focused on predicting the structure as an output and has been deployed on tasks such as segmenting an image into meaningful components or predicting a natural language sentence [5,23,20].
All these approaches rely on meaningful representations of the structured objects that are involved.In this context, an interesting description of machine learning objects can be done through distributions or probability measures.This allows to compare them within the Optimal Transport (OT) framework which provides an elegant way of comparing distributions by capturing the underlying geometric properties of the space through a cost function.When distributions dwell in a common metric space (Ω , d), the Wasserstein distance defines a metric between these distributions [35].In contrast, the Gromov-Wasserstein distance [31,19] aims at comparing distributions that live in different metric spaces through the intrinsic pair-to-pair distances in each space.Unifying both distances, the Fused Gromov-Wasserstein distance was proposed in [34] and used in the discrete setting to encode, in a single OT formulation, both feature and structure information of structured objects.This approach considers structured objects as joint distributions over a common feature space associated with a structure space specific to each object.An OT formulation is derived by considering a tradeoff between the feature and the structure costs, respectively defined with respect to the Wasserstein and the Gromov-Wasserstein standpoints.
This paper presents the theoretical foundations of this distance and states the mathematical properties of the FGW metric in the general setting.We first introduce a representation of structured objects using distributions.We show that classical Wasserstein and Gromov-Wasserstein distance can be used in order to compare either the feature information or the structure information of the structured object but that both fail at comparing the entire object.We then present the Fused Gromov-Wasserstein distance in its general formulation and we derive some of its mathematical properties.Particularly, we show that it is a metric in a given case, we give a concentration result, and we study its interpolation properties and its geodesic properties.We conclude by illustrating and interpreting the distance in several applicative contexts.
NOTATIONS Let P(Ω ) be the set of all probability measures on a space Ω and B(A) the set of all Borel sets of a σ -algebra A. We note # the push forward-operator such that for A measure µ on a set Ω is said to be fully supported if supp[µ] = Ω , where supp[µ] is the minimal closed subset A ⊂ Ω such that µ(Ω \A) = 0. Informally, this is the set where the measure "lives".We note P i #µ the projection on the i-th marginal of µ.
For two probability measures µ ∈ P(A) and ν ∈ P(B) we note Π (µ, ν) the set of all couplings or matching measures of µ and ν, i.e. the set {π ∈ We note the simplex of N bins as For two histograms a ∈ Σ n and b ∈ Σ m we note with some abuses Π (a, b) the set of all couplings of a and b, i.e. the set We also note ⊗ the tensor product, i.e. for a tensor L = (L i, j,k,l ), L ⊗ B is the matrix ∑ k,l L i, j,k,l B k,l i, j .Finally, for x ∈ Ω , δ x denotes the dirac in x.
ASSUMPTION In the paper we suppose that all metric spaces are Polish, non trivial and all measures are Borel.} } } FIG. 1. Structured object (left) can be described by a labelled graph with (a i ) i the feature information of the object and (x i ) i the structure information.If we enrich this object with a histrogram (h i ) i aiming at measuring the relative importance of the nodes between them we can represent the structured object as a fully supported probability measure µ over the couple space of feature and structure with marginals µ X and µ A on the structure and the features respectively (right)

Structured objects as distributions and Fused Gromov Wasserstein distance
We can represent structured objects in the discrete case by a labelled graph G described by ({x i , a i }) i∈ [1,..n] where A = (a 1 , ..., a n ) ∈ Ω n is the set of labels (also called features) and X = (x 1 , ..., x n ) a representation of the graph vertices.To this extent, features a i are structured by the intrinsic relation between the vertices x i (see Fig. 1).Note that in this model, we suppose that we can encode the relation between the vertices in the ambiant space of the vertices x i through the distance d X in this space.In many applications in machine learning, objects are readily endowed with a notion of distance between their points, hence defining metric spaces.As such, structured objects can be viewed as couples of X × A where A is a subset of some metric space (Ω , d) representative of the feature space and (X, d X ) is a metric space representative of the structure, with d X the distance between elements of the space, modeling the intrinsic relationships between points of the structured object.
In this paper, we propose to enrich the previous definition of a structured object with a (fully supported) probability measure which serves the purpose of signaling the relative importance of the object's elements.For example, we can add weights (h i ) i ∈ Σ n to each node in the graph defined previously.This way, we have created a fully supported probability measure µ = ∑ i h i δ (x i ,a i ) which includes all the structured object information (see Fig. 1).
These considerations lead to the following and formal definition of a structured object and the space it belongs to: DEFINITION 2.1 Structured objects.A structured object over a metric space (Ω , d) is the triplet (X × A, d X , µ), where (X, d X ) is a compact metric space, A is a compact of Ω and µ is a fully supported probability measure over X × A. (Ω , d) is denoted as the feature space, A is the feature information of the structured object and (X, d X ) its structure information.

DEFINITION 2.2 Space of structured objects.
A structured object over (Ω , d) (simply denoted as Ω ) is an element of the following space: where C(Ω ) is the set of all compact subsets of Ω , X the set of all compact metric spaces and P(X × A) the set of fully supported probability measures on X × A.
We will note µ X and µ A the structure and feature (fully-supported) marginals of µ.Those marginals encode a very partial information since they focus only on independent feature distributions or only on the structure.An example of µ, µ X and µ A is provided for a labeled graph in Fig. 1.With this definition, the features of all structured objects are directly comparable since they live in the same ambient space (Ω , d).For the sake of simplicity, and when it is clear from the context, we will denote only µ the whole structured object.
With this definition, we only consider the fully-supported case.Although mathematical results can be expanded to non fully supported measures, it leads to discussions about the support of the measures and for the sake of clarity, we omit here this extension.In the following paragraphs, (X × A, d X , µ) and (Y × B, d Y , ν) are structured objects.

Comparing structured objects
We now aim to define a notion of equivalence between two structured objects.Intuitively, two structured objects are the same if they share the same feature information, if their structure information are lookalike and if the probability measures are corresponding in some sense.In this section, we present mathematical tools for comparing individually the elements of structured objects.
First, our formalism implies comparing metric spaces, which can be done via the notion of isometry.

DEFINITION 2.3 Isometry
Let (X, d X ) and (Y, d Y ) be two metric spaces.An isometry is a sujective map f : X → Y that preserves the distances: An isometry is bijective, since for f (x) = f (x ) we have d Y ( f (x), f (x )) = 0 = d X (x, x ) and hence x = x (in the same way f −1 is also a isometry).When it exists, X and Y have the same size and any "metric statement" in the first space is "transported" to the second space by the isometry f .EXAMPLE 2.4 Let us consider the two following graphs whose discrete metric spaces are obtained as shortest path between the vertices (see corresponding graphs in Figure 2)

1).
The previous definition can be used in order to compare the structure information of two structured objects.Regarding the feature information, since they all lie in the same ambient space Ω , a natural way for comparing them is by the standard equality A = B. Finally, in order to compare measures on different spaces, the notion of preserving application can be used.

DEFINITION 2.5 Preserving application
Let Ω 1 , µ 1 ∈ P(Ω 1 ) and Ω 2 , µ 2 ∈ P(Ω 2 ) be two measurable spaces.An application f : Two isometric metric spaces.Distances between the nodes are given by the shortest path, and the weight of each edge is equal to 1.
x to be measure preserving if it transports the measure µ 1 on µ 2 such that If there exists such a measure preserving map, the properties about measures of Ω 1 are transported via f to Ω 2 .
Let us now consider a measurable metric space (denoted mm-space), i.e. a metric space (X, d X ) enriched with a probability measure and described by a triplet (X, d X , µ X ∈ P(X)).An interesting notion for comparing mm-spaces is the notion of isomophism.DEFINITION 2.6 Isomorphism.Two mm-spaces (X, d X , µ X ), (Y, d Y , µ Y ) are isomorphic if there exists a measure preserving isometry f : X → Y between them.EXAMPLE 2.7 Let us consider two mm-spaces }) as depicted in Figure 3.These spaces are isometric but not isomorphic as there exists no measure preserving application between them.
All this considered, we can now define a notion of equivalence between structured objects.DEFINITION 2.8 Equivalence of structured objects.Two structured objects are said to be equivalent (and the equivalence relation is denoted ∼ ) if there exists an application f : It is clear that it defines an equivalence relation over H(Ω ).EXAMPLE 2.9 To illustrate this definition, we consider a simple example of two structured objects: FIG. 4. Two structured objects with isometric structures and identical features that are not equivalent.The color of the nodes represent the node feature and each edge represents a distance of 1 between the connected nodes.
with ∀i, a i = b i and ∀i = j, a i = a j (see Figure 4).The two structured objects have isometric structures and same features individually but they are not equivalent.One possible application f = ( ).Yet this application does not verifies f 2 = I d since f 2 (a 2 ) = b 3 and a 2 = b 3 .The other possible applications such that f 1 is an isometry are simple permutation of this example, yet it is easy to check that none of them verifies f 2 = I d (for example with f (x 2 , a 2 ) = (y 4 , b 4 )).

Background on OT distances
The Optimal Transport (OT) framework defines useful distances between probability measures that describe either the feature or the structure information of structured objects.
WASSERSTEIN DISTANCE When the probability measures live in the same metric space (Ω , d), the quantity: where is usually called the p-Wasserstein distance (also known with p = 1 as Earth Mover's distance [28] in the computer vision comunity) between distributions µ A and ν B .
Optimal transport theory defines a distance on probability measures such that d Ω W,p (µ A , ν B ) = 0 iff µ A = ν B .This distance also has a nice geometrical interpretation as it represents the optimal cost w.r.t.d to move the measure µ A onto ν B with π(a, b) the amount of probability mass shifted from a to b (see figure 5) To this extent, the Wasserstein distance quantifies how "far" µ A is from ν B by measuring how "difficult" it is to move all the mass from µ A onto ν B .Optimal transport can deal with both smooth and FIG. 5. Example of coupling between two discrete measures on the same ground space equipped with a distance d that will define the Wasserstein distance.Left: the discrete measures on Ω .Right: One possible coupling between these measures which respects the mass conservation.Image inspired from [26,Fig 2.6] discrete measures and has proved very useful for comparing distributions in a shared space but with different (and even non-overlapping) supports.
GROMOV WASSERSTEIN DISTANCE In order to compare measures that are not necessarily in the same ambient space, [31,19] define an OT-like distance.By relaxing the classical Hausdorff distance [19,35] that is untractable in practice, authors build a distance over the space of all metric spaces.For two compact mm-spaces X = (X, d X , µ X ∈ P(X)) and Y = (Y, d Y , ν Y ∈ P(Y )), the Gromov-Wasserstein distance is defined as: where J p (π) = X×Y ×X×Y L(x, y, x , y ) p dπ(x, y)dπ(x , y ) Note that, with some abuse of notation, we denote the entire mm-space by its probability measure and that the Gromov-Wasserstein distance depends on the choice of the metrics d X and d Y .When it is not clear from the context we will denote by J p (d X , d Y , π) the Gromov-Wasserstein loss.The resulting coupling tends to associate pairs of points with similar distances within each pair (see figure 6).The Gromov-Wasserstein distance defines a metric over the space of all metric spaces quotiented by measurepreserving isometries (see def 2.3 and 2.5), thus allowing the comparison of measures over different ground spaces.This distance has been used for shape comparison in [18] and is invariant to rotations and translations in either spaces.SOME ADAPTATIONS OF W AND GW Despite the appealing properties of both Wasserstein and Gromov-Wasserstein distances, they fail at comparing structured objects as originally defined by focusing only on the feature and structure marginals respectively.However, with some hypotheses, one could adapt these distances for structured objects.
If the structure spaces are part of a same ground space (Z, d Z ), one can build a distance d between couples (x, a) and (y, b) and apply the Wasserstein distance so as to compare the two structured objects.In this case, when the Wasserstein distance vanishes it implies that the structured objects are equal in the sense of equality between the structures and the features respectively (X = Y and A = B).This approach is very related with the one discussed in [33] where authors define the Transportation L p distance for signal analysis purposes.Their approach can be viewed as a transport between two joint measures µ( of the signal values and λ the Lebesgue measure.The distance for the transport is defined as d((x, f (x)), (y, g(y))) = 1 α x − y p p + f (x) − g(y) p p for α > 0 and • p the l p norm.In this case f (x) and g(y) can be interpreted as encoding the feature information of the signal while x, y encode its structure information.In contrast to the FGW approach, invariants of this approach are the feature and structure preserving applications from X × A to Y × B whereas the invariants for the FGW distance are the feature preserving applications that are isometries in the structure space (as seen further in theorem 3.1).
The Gromov-Wasserstein distance can also be adapted to structured objects by considering for example the distances d X ⊕ d and d Y ⊕ d within each space X × A and Y × B respectively.When the resulting distance vanishes, structured objects are isomorphic with respect to d X ⊕ d and d Y ⊕ d, yet resulting on a weaker result than when FGW vanishes.Indeed, as seen in theorem 3.1 FGW is null iff the structured objects are equivalent and in such case (X × A, d X ⊕ d, µ) and (Y × B, d Y ⊕ d, ν) are de facto isomorphic.However the converse is not necessarily true.For example in Fig. 4 the structures are isometric and the distances between the features within each space are the same between each structured objects so (X × A, d X ⊕ d, µ) and (Y × B, d Y ⊕ d, ν) are isomorphic, yet not equivalent as shown in the example.

Fused Gromov-Wasserstein distance
Building on both Gromov-Wasserstein and Wasserstein distances we define the Fused Gromov-Wasserstein (FGW ) distance on H(Ω ): DEFINITION 2.10 Fused Gromov-Wasserstein distance.The Fused-Gromov-Wasserstein distance is defined for α ∈ [0, 1] and p, q 1 as: where This definition is illustrated in Figure 7. α acts as a trade-off parameter between the structure term represented by L(x, y, x , y ) and the feature term d(a, b).In this way, the convex combination of both terms leads to the use of both information in a single formalism resulting on a single map π that behaves as the optimal map with respect to the structure and the feature costs in order to "move" the mass from the one joint probability measure to the other.
In contrast to the work presented in [34] where the trade-off is defined via d(a, b) q + αL(x, y, x , y ) q for α ∈ [0, ∞[ we rather consider a convex combination (1 − α)d(a, b) q + αL(x, y, x , y ) q for α ∈ [0, 1].Both formulations are strictly equivalent since any optimal plan w.r.t the cost with the convex combination leads to an optimal plan w.r.t for the other cost and conversely (see 5.7 for more details).However the definition with the convex combination of a feature and structure cost carries out more theoretical properties such as the interpolation (theorem 3.5).
Many desirable properties arise from this definition.Among them, one can define a topology over the space of structured objects using the FGW distance to compare structured objects, in the same philosophy as for Wasserstein and Gromov-Wasserstein distances.The definition also implies that FGW acts as a generalization of both Wasserstein and Gromov-Wasserstein distances, with FGW achieving an interpolation of these two distances.More remarkably, FGW distance also realizes geodesic properties over the space of structured objects, allowing the definition of gradient flows.All these properties are detailed in the next section, and before reviewing them, we first compare FGW with GW and W distances and state the following proposition (by assuming for now that FGW exists, which will be shown later).PROPOSITION 2.11 Comparaison between FGW , GW and W .
• The following inequalities hold: • Let us suppose that the structure spaces (X, d X ),(Y, d Y ) are part of a single ground space (Z, d Z ) Difference on transportation maps between FGW , GW and W distances on synthetic trees.On the left the W distance between the features is nul since feature information are the same, on the middle the FGW is different from zero and discriminate the two structured objects and on the right the GW between the two isometric structures is nul.
(i.e.d X = d Y = d Z ).We consider the Wasserstein distance between µ and ν (well defined in this case) for the distance on In particular, following this proposition, when the FGW distance is null then both GW and W distances vanish so that the structure and the feature of the structure object are individually "the same" (with respect to their corresponding equivalence relation).However the converse is not necessarily true as shown in the following example.
TOY TREES We construct two trees as illustrated in Figure 8 where the 1D node features are shown with colors.The shortest path between the nodes is used to capture the structures of the two structured objects and the euclidean distance is used for the features.Figure 8 illustrates the differences between FGW , GW and W distances.The left part is Wasserstein distance between the features: red nodes are transported on red ones and the blue nodes on the blue ones but tree structures are completely discarded.In this case, the Wasserstein distance vanishes.In the right part, we compute the Gromov-Wasserstein distance between structures: all couples of points are transported to another couple of points, which enforces the matching of tree structures without taking into account the features.Since structures are isometric, the Gromov-Wasserstein distance is null.Finally, we compute the FGW using intermediate α (center), the bottom and first level structure is preserved as well as the feature matching (red on red and blue on blue) and FGW discriminates the two structured objects.

Mathematical properties of FGW
In this section, we establish some mathematical properties of the FGW distance.The first result relates to the existence of the FGW distance and to the topology of the space of structured objects.We then prove that the FGW distance is indeed a distance regarding the equivalence relation between structured objects as defined in Defintion 2.8, allowing us to derive a topology on H(Ω ).
Proof of this theorem can be found in Section 5.
The previous theorem states that FGW is a distance over H(Ω ) quotiented by the measure preserving maps that are feature and structure preserving (through an isometry).Informally, invariants of the FGW are objects that have both the same structure and the same features "in the same place".In other words, the FGW distance vanishes iff the structured objects are equivalent with respect to the equivalence relation ∼ defined in Definition 2.8.Theorem 3.1 allows a wide set of applications for FGW such as k-nearest-neighbors, distancesubstitution kernels, pseudo-Euclidean embeddings, or representative-set methods [14,9,4].Arguably, such a distance allows for a better interpretation than to end-to-end learning machines such as neural networks because the π matrix exhibits the relationships between the elements of the objects.
The metric property naturally endows the structured object space with a notion of convergence as described in the next definition: Using Prop.2.11, it is straightforward to see that, if (S n ) n∈N converges to S in FGW sense, both the features and the structure of (S n ) n∈N converge respectively in Wasserstein and Gromov-Wasserstein sense (see [19] for the definition of convergence in the Gromov-Wasserstein sense).
An interesting question arises from this definition.Let us consider a structured object S = (X × A, d X , µ) and let us sample the joint distribution so as to consider

FGW sense and how fast is the convergence?
To answer this question, we will use the theory developped in [16].We recall the following definitions: Let S be a subset of some polish metric space Ω .The ε-covering of S, denoted N ε (S), is the minimum integer m such that there exists closed balls B 1 , ..., B m of diameter ε which cover S.More precisely, the balls verify S ⊂ ∪ m i=1 B i .The ε-dimension of S is defined by: Given a measure µ on Ω , we consider its (ε-τ) covering as the number which represents the smallest ε-covering of sufficiently "large" subsets (with respect to µ).The (ε-τ) dimension of µ is then defined as: The upper Wasserstein dimension is defined by: This notion of dimension exists due to the monotonicity of dim ε (µ, τ) and coincides with the intuitive notion of "dimension" when the measure is sufficiently well behaved.For example, for any absolutely continuous measure µ with respect to the Lebesgue measure on [0, 1] d , we have . For more general cases see Prop.7 in [16].Using these definitions and the results in [16], we answer the question of convergence of finite sample in the following proposition (proof can be found in Section 5) : PROPOSITION 3.4 Convergence of finite samples and a concentration inequality Let p 1. We have: lim Moreover, suppose that s > d * p (µ). Then: A particular case of this inequality is when α = 1 so that we can use the result above to derive a concentration result for the Gromov-Wassersten distance.More precisely, if ν n = 1 n ∑ i δ x i denotes the empirical measure of ν ∈ P(X) and if s > d * p (ν) we have: To the best of our knowledge, this is the first result about concentration for the Gromov-Wasserstein distance.In contrast to the Wasserstein distance case, it is not necessary sharp but it proves that considering the GW and FGW distances by sampling a continuous distribution makes sense as the finite samples concentrate around the expectation.

Interpolation properties between Wasserstein and Gromov-Wasserstein distances
In this section, we prove that the FGW distance is a generalization of both Wasserstein and Gromov-Wasserstein distances in the sense that it achieves an interpolation between them.More precisely, we have the following theorem: THEOREM 3.5 Interpolation properties.
As α tends to zero, one recovers the Wasserstein distance between the feature information and as α goes to one, one recovers the Gromov-Wasserstein distance between the structure information : Proof of this theorem can be found in Section 5.
This result shows that FGW can revert to one of the other distances.In machine learning, this allows for a validation of the α parameter to better fit the data properties (i.e. by tuning the relative importance of the feature vs structure information).One can also see the choice of α as a representation learning problem and its value can be found by optimizing a given criterion.

Geodesic properties
In this section we present some geodesic properties about the FGW distance.These properties are useful in order to define dynamic formulation of OT problems.This dynamic point of view is inspired by fluid dynamics and found its origin in the Wasserstein context with [7].Various applications in machine learning can be derived from this formulation: interpolation along geodesic paths was used in computer graphics for color or illumination interpolations [8]; more recently, [10] used Wasserstein gradient flows in an optimization context, deriving global minima results for non-convex particle gradient descent paving the way for new methods for training neural networks; [38] used Wasserstein gradient flows in the context of reinforcement learning for policy optimization.
The main idea of this dynamic formulation is to describe the optimal transport problem between two measures as a curve in the space of measures minimizing its total length.We first describe some generality about geodesic spaces and recall classical results for dynamic formulation in both Wasserstein and Gromov-Wasserstein contexts.In a second part, we derive new geodesic properties in the FGW context.
GENERALITY ABOUT GEODESIC SPACES Let (X, d) be a metric space and x, y two points in X.We say that a curve w : [0, 1] → X joining the endpoints x and y (i.e. with w(0) = x and w(1) = y) is a constant speed geodesic if it satisfies d(w(t), w(s)) |t − s|d(w(0), w(1)) = |t − s|d(x, y) for t, s ∈ [0, 1].Moreover, if (X, d) is a length space (i.e. if the distance between two points of X is equal to the infimum of the lengths of the curves connecting these two points) then the converse is also true and a constant speed geodesic satisfies d(w(t), w(s)) = |t − s|d(x, y).It is easy to compute distances along such curve as they are directly embedded into R.
In the Wasserstein context, if the ground space is a complete separable, locally compact length space and if endpoints of the geodesic are given, then there exists a geodesic curve.Moreover, if the optimal transport between the endpoints is unique then there is a unique displacement interpolation between the endpoints (see Corollary 7.22 and 7.23 in [35]).For example, if the ground space is R d and the distance between the points is measured via the 2 norm, then geodesics exist and are uniquely determined (this can be generalized to strictly convex cost).
In the Gromov-Wasserstein context, there always exists constant speed geodesics as long as the endpoints are given and these geodesics are unique modulo the isomorphism equivalence relation (see [31]).
THE FGW CASE In this paragraph we suppose that Ω = R d .
We are interested in finding a geodesic curve in the space H(R d ), d R d FGW,α,p,q , i.e. a constant speed curve of structured objects joining two structured objects.As for Wasserstein and Gromov-Wasserstein, the structured object space endowed with the Fused Gromov-Wasserstein distance maintains some geodesic properties.The following result proves the existence of such a geodesic and characterizes it: Let π * be the optimal coupling for the Fused Gromov-Wasserstein distance between those two sets and t ∈ [0, 1].We equip R d with any m norm for all m 1.
We define η t : From the existence of a geodesic in the structured object space, one can wonder if this geodesic is unique so as to define properly the velocity field associated to the geodesic curve.Informally, if one tries to define the speed of a particle passing a point p (here a structured object) at a time t then the uniqueness of this particle passing through p at t seems mandatory.The following result proves that it is indeed the case modulo the equivalence relation of structured objects ∼ in the case where FGW,α,1,q .Let p = 1 and q 2. We equip R d with the q norm.Then each geodesic ) is of the same form as stated in Eq. (3.11).
More precisely, for each geodesic (S t ) t∈[0,1] ∈ H(R d ) there exists an optimal coupling π * ∈ P(X 0 × A 0 × X 1 × A 1 ) of measures µ 0 and µ 1 , representative as the endpoints, for the d R d FGW,α,1,q distance, such that for each t ∈ [0, 1] a representative of the equivalence class ∼ of S t is given by: with η t and Ât defined in theorem 3.6.. Proofs of the previous theorems can be found in Section 5.In a sense this result combines the geodesics in the Wasserstein space and in the space of all metric spaces since it suffices to interpolate the distances in the structure space and the features to construct the geodesic.The main interest is that it defines the minimum path between two structured objects.For example, considering two discrete structured objects represented by the measures µ 0 = ∑ n i=1 h i δ (x i ,a i ) and µ 1 = ∑ m j=1 g j δ (y j ,b j ) , the interpolation path is given for t ∈ [0, 1] by the measure µ t = ∑ n i=1 ∑ m j=1 π * (i, j)δ (x i ,y j ,(1−t)a i +tb j ) where π * is the optimal coupling for the FGW distance.However this geodesic is difficult to handle in practice since it requires the computation of the cartesian product X 0 × X 1 .To overcome this obstacle, an extension using Fréchet mean is defined in section 4.3.The proper definition and properties of velocity fields associated to this geodesic is postponed to further works.

Examples and applications for the discrete case
In this section, we illustrate the behavior of FGW on simple cases where structured objects are involved.

FGW in the discrete case
In the following section, (Ω , d) is the feature space and X n is the set of all discrete metric spaces of size n 1. Picking a structured object of size n in the discrete case is choosing a metric space (X,C), a set of n elements (x i , a i ) where (a i ) i denotes the feature information and (x i ) i denotes the structure information.The matrix C(i, j) aims at comparing the structure points x i and x j .From this set we derive a fully supported probability measure µ by choosing a histogram h ∈ Σ n and µ = ∑ n i=1 h i δ (x i ,a i ) .More precisely, H(Ω ) = n∈N H n (Ω ) where This set includes all graphs with any number of vertices (each from a given metric space), where each vertex x i is associated to a feature a i in Ω and a weight h i on the simplex.
In the next paragraphs, µ ∈ H n (Ω ) and ν ∈ H m (Ω ) are structured data as described in the previous part.We suppose that C 1 and C 2 are the distance matrices inherent to each structure information of µ and ν respectively, and a i , b j are the features.Let p, q 1.Using previous notations, the Fused Gromov-Wasserstein distance is defined as: where: Algorithms for solving the numerical optimization above are given in [34].They rely on Conditional Gradient but converge only to a local minimum due to the non-convexity of the optimization problem.We used these algorithms for all the applications below.

Illustrations of FGW
In this section, we present several applications of FGW as a distance betweeen structured objects and provide interpretation of the OT matrix.EXAMPLE WITH 1D FEATURES AND STRUCTURE SPACES Figure 9 illustrates the differences between Wasserstein, Gromov-Wasserstein and Fused Gromov-Wasserstein couplings π * .In this example both the feature and structure space are 1-dimensional (Figure 9 left).The feature space denotes two clusters among the elements of both objects illustrated in the OT matrix M AB , the structure space denotes a noisy temporal sequence along the indexes liustrated in the matrices C 1 and C 2 (Figure 9 center).Wasserstein respects the clustering but forgets the temporal structure, Gromov-Wasserstein respects the structure but do not take the clustering into account.Only FGW retrieves a transport matrix respecting both feature and structure.
EXAMPLE ON TWO SIMPLE IMAGES We extract one 28 × 28 image from the MNIST dataset and generate a second one by simply re-centering the digit on the frame.Features represent the gray level of each pixel, the structure is defined as the city-block distance on the pixel coordinate grid and we use equal weights for all the pixels in the image.Figure 10 shows the different couplings obtained when considering either the features only, the structure only or both information.FGW aligns the pixels of the digits, recovering the correct order of the pixels, while both Wassertein and Gromov-Wasserstein distances fail at providing a meaningful transportation map.Note that in the Wasserstein and Gromov-Wasserstein case, the distances are equal to 0, whereas FGW manages to spot that the two images are different.Also note that, in the FGW sense, the original digit and its mirrored version are also equivalent as there exists an isometry between their structure spaces, making FGW invariant to rotations or flips in the structure space in this case.
TIME SERIES EXAMPLE One of the main assets of FGW is that it can be used on a wide class of objects and time series are one more example of this.We consider here 25 monodimensional time series composed of two humps in [0, 1] with random uniform height between 0 and 1. Signals are distributed according to two classes translated from each other with a fixed gap.The FGW distance is computed by considering d as the euclidean distance between the features of the signals (here the value of the signal in each point) and d X and d Y as the euclidean distance between timestamps.A 2D embedding is computed from a FGW distance matrix between a number of examples in this dataset with multidimensional scaling (MDS) in Figure 11 (top).One can clearly see that the representation with a reasonable α value in the center is the most discriminant one.This can be better understood by looking as the OT matrices between the classes.Figure 11 (bottom) illustrates the behavior of FGW on one pair of examples when going from Wasserstein to Gromov-Wasserstein.The black line depict the affectation provided by the transport matrix and one can clearly see that while Wasserstein on the left assigns samples completely independently to their temporal position, the Gromov-Wasserstein on the right tends to align perfectly the samples (note that it could have reversed exactly the alignment with the same loss) but discards the values in the signal.Only the true FGW in the center finds a transport matrix that both respects the time sequences and aligns similar values in the signals.

Structured Optimal Transport Barycenter
An interesting use of the FGW distance is to define a barycenter of a set of structured data as a Fréchet mean.In that context, one seeks the structured object that minimizes the sum of the (weighted) FGW distances with a given set of objects.OT barycenters have many desirable properties and applications [1,27], yet no formulation can leverage both structural and feature information in the barycenter computation.Here we propose to use the FGW distance to compute the barycenter of structured objects We suppose that the feature space is Ω = (R d , 2 ) and p = 1.For simplicity, we assume that the base histograms and the histogram h associated to the barycenter are known and fixed.
In this context, for a fixed N ∈ N and (λ k ) k such that ∑ k λ k = 1 , we aim to find: Note that this problem is convex w.r.t C and A but not w.r.t π k .An algorithm to solve this problem is presented in [34].Intuitively, looking for a barycenter means finding feature values supported on a fixed size support, and the structure that relates them.Interestingly enough, there are several variants of this problem, where features or structure can be fixed for the barycenter.Solving the related simpler optimization problems extend straightforwardly.

Increasing value of
GRAPH BARYCENTER AND COMPRESSION In this experiment, we use FGW to compute barycenters and approximations of toy graphs.
In the first example, we generate graphs following either a circle or 8 symbol with 1D features following a sine and linear variation respectively.For each example, the number of nodes is drawn randomly between 10 and 25, Gaussian noise is added to the features and a small noise is applied to the structure (some connections with the third neighbors are randomly added).An example graph with no noise is provided for each class in the first column of Figure 12.One can see from there that the circle class has a feature varying smoothly (sine) along the graph but the 8 has a sharp feature change at its center (so that low pass filtering would loose some information).Some examples of the generated graphs are provided in the 2nd-to-7th columns of Figure 12.We compute the FGW barycenter containing 10 samples using the shortest path distance between the nodes as the structural information and the 2 distance for the features.We recover an adjency matrix by thresholding the similarity matrix C given by the barycenter.The threshold is tuned so as to minimize the Frobenius norm between the original C matrix and the shortest path matrix constructed after thresholding C. Resulting barycenters are showed in Figure 12 for n = 15 and n = 7 nodes.First, one can see that the barycenters are denoised both in the feature space and the structure space.Also note that the sharp change at the center of the 8 class is conserved in the barycenters which is a nice result compared to other divergences that tend to smooth-out their barycenters ( 2 for instance).Finally, note that by selecting the number of nodes in the barycenter one can compress the graph or estimate a "high resolution" representation from all the samples.To the best of our knowledge, no other method can compute such graph barycenters.Finally, note that FGW is interpretable because the resulting OT matrix provides correspondence between the nodes from the samples and those from the barycenter.
In the second experiment, we evaluate the ability of FGW to perform graph approximation and compression on a simple Stochastic Block Model graph [37,22].The question is to see if estimating an approximated graph can recover the relation between the blocks and perform simultaneously a community clustering on the original graph (using the OT matrix).We generate two community graphs illustrated in the left column of Figure 13.We can see that the relation between the blocks is sparse and has a 'linear' structure, the example in the first line has features that follow the blocks (noisy but similar in each block) whereas the example in the second line has two modes per blocks.The first graph approximation (top line) is done with 4 nodes and we can recover both the blocks in the graph and the average feature on each blocks (colors on the nodes).The second problem is more complex due to the two modes per blocks but we can see that when approximating the graph with 8 nodes we recover both the structure between the blocks but also the sub-clusters in each block which illustrate the strength of FGW: encoding both features and structures.
MESH BARYCENTER We show in this section another example of barycenter.We aim at interpolating between unregistered 3D meshes.
Here, we consider the problem of interpolating between k = 2 meshes in 3D that share a common topology but not the same number of vertices.Such an interpolation is realized by setting λ 1 = λ and λ 2 = 1 − λ and varying λ between 0 and 1.We interpolate between two quadrupeds: a deer and a cat, that are triangular meshes with respectively 460 and 989 vertices.This is a particularly difficult problem, since there is no prior matching between meshes available.It has long been considered in the computational geometry and vision communities (e.g.[2,32]), and generally requires user interventions.In our setting, the structure of the barycenter is set to be the one of the cat: the barycenter should have the same topological structure.Our method then only solves for the vertex positions X ∈ R 989×3 .The topological structures C 1 and C 2 are set to be the shortest path along the mesh between two vertices, which is a good approximation of the geodesic distance on the manifold.Results are presented in Figure 14 for λ ∈ [0.75, 0.5, 0.25].A good way of assessing the quality of the results is to visually check that the consistency of the manifold mesh is preserved throughout the interpolation.The first line shows the resulting interpolation when the weight on the structure is set to a high value.When only 3D distances are used to match the shapes (bottom line), one can see that points belonging to different parts of the meshes are matched, because of the different densities of points in the two meshes.This results in highly unrealistic mesh.

Proofs of the mathematical properties
This section presents all the proofs of previous theorem and results.We will frequently use the following lemma : LEMMA 5.1 Let q 1.We claim : FIG.14. Interpolation of a cat and a deer mesh using FGW .(first line) Interpolation using the FGW distance with a high α value (bottomline) same with a very low α value, i.e. the mesh structure is almost not taken into account.

Proof. of the Proposition
For the two inequalities (2.5) and (2.6) let π be the optimal coupling for the Fused Gromov-Wasserstein distance between µ and ν (assuming its existence for now).Clearly : So by suboptimality : which proves (2.5).Same reasoning is used for (2.6).
For the last inequality (2.7) let π ∈ Π (µ, ν) be any admissible coupling.By suboptimality : d Ω FGW,α,p,1 (µ, ν) (*) is the triangle inequality of d Z and (**) Minkowski inequality.Since this inequality is true for any admissible coupling π we can apply it with the optimal coupling for the Wasserstein distance defined in the proposition and the claim follows.

Proof of the theorem 3.1 Metric properties of FGW
We propose to prove the theorem point by point : first the existence, then the equality relation and finally the triangle inequality statement.We first recall the following lemma (lemma 10.3 in [18]): LEMMA 5.2 Let (W, d W ) be a compact metric space and M be a subset of P(W ) which is sequentially compact for the weak convergence.
If we find φ : W ×W → R Lipschitz for following the L 1 metric on W ×W : Then the application µ → I(µ) = W ×W φ (w, w )dµ(w)dµ(w ) admits a minimizer in M .
The sequential compactness of Π (µ, ν) is classic results (see lemma 4.4 in [35]).To prove the existence of FGW distance we use previous lemma 5.2 which states the existence of a minimizer for the integral of φ .The main idea is to rewrite the definition of FGW in the form of the lemma.
We first consider the case p = 1 and use previous lemma 5.2 with W = X × A × Y × B and for w, w ∈ W ×W : We equip W with the metric: So by lemma 5.2 it suffices to show that φ is Lipschitz on W ×W with respect to : We also consider g(t) = t q and, This notations will be useful to prove the case q > 1 from the case q = 1.Indeed, we will show that φ 1 and φ 2 are 1-Lipschitz wrt d, this will prove the result for q = 1.Using the boundedness of g,φ 1 and φ 2 over compacts we will conclude for q > 1.
To prove that φ 1 and φ 2 are 1-Lipschitz wrt d we have to show that for i = 1, 2 and with by definition : We first consider φ 1 : Last inequality is consequence of triangle inequalities of d X and d Y .
Last inequality is consequence of triangle inequalities of d.So φ 2 is 1-Lipschitz w.r.t d Since all metric spaces are compact φ 1 and φ 2 are bounded by a constant M 1 and M 2 .Then the restriction g 1 of g on [0, M 1 ] and the restriction g 2 of g on [0, M 2 ] are Lipschitz with constants bounded by qM q−1 1 and 2 , so by lemma 5.2 there exists a minimizer for p = 1.For p > 1 we can have the same reasoning to show that φ p is lipschitz with constant p((1 − α)qM q−1 1 + αqM q−1 2 ) p−1 so there exists a minimizer for all p.First, let suppose that such an application exists.We consider the map π = (I d × f )#µ ∈ Π (µ, ν).Then : (1 − α)d(a, f 2 (a)) q + αL((x, f 1 (x), x , f 1 (x )) q p dµ(x, a)dµ(x , a ) Since f 2 (a) = a and f 1 is an isometry.So π is the optimal map and d Ω FGW,α,p,q (µ, ν) = 0. Conversly suppose that d Ω FGW,α,p,q (µ, ν) = 0. To prove the existence of a map f = ( f 1 , f 2 ) : X × A → Y ×B we will use the Gromov-Wasserstein properties.We are looking for a vanishing Gromov-Wassersein distance between the spaces X × A and Y × B equipped with our two measures µ and ν and two distances applications.

Proof of Prop. 3.4 Convergence and concentration inequality
Proof.
The proof of the convergence in FGW dervies directly from the weak convergence of the empirical measure and lemma 5.2.For the concentration (2.7) is valid between µ n and µ since they are both in the same ground space.Then we have : We can directly apply theorem 1 in [16] to state the inequality.

Proof of theorem 3.5 Interpolation properties between GW and W
Proof.
To prove the first point of the theorem we want to have a converse inequality of (2.5) and (2.6) in the limit cases.
Let π OT ∈ Π (µ A , ν B ) the optimal coupling for the pq-Wasserstein distance between µ A and , ν B .We can use the same Gluing lemma (lemma 5.3.2 in [3]) to construct : Let α 0 and π α optimal plan for the fused Gromov-Wasserstein distance between µ, ν.
Proof of the theorem.Let q 2 and (S t ) t∈[0,1] = (X t × A t , d X t , µ t ) t∈[0,1] a geodesic in H(R d ) be given. .denotes the q norm.
The goal is to show that this geodesic in (H(R d ), d R d FGW,α,1,q ) is actually in the form : with π * an optimal coupling between the endpoints (X 0 × A 0 , d X 0 , µ 0 ) and (X 1 × A 1 , d X 1 , µ 1 ) for the , d Ω FGW,α,1,q distance and η t , Ât defined in theorem 3.6.The equality of this two geodesics will be with respect to the equivalence relation ∼ of structured objects defined previously.
In order to prove this result we first consider discrete dyadic times t = i2 −k for k ∈ N and i ∈ 1, .., 2 k and we will extend by continuity for any t ∈ [0, 1].

Conclusion
We have presented in this paper a new OT distance called Fused Gromov-Wasserstein distance.Inspired by both Wasserstein and Gromov-Wasserstein distances the FGW distance compare can compare structured objects by including the inherent relations that exist between the elements of the objects, constituting their structure information, and their feature information, part of a common ground space between each structured objects.We stated mathematical results about this new distance such as metric, interpolation and geodesic properties.We also gave a concentration result for the convergence of finite samples.We illustrated this new distance on structured objects and applied it to graph barycenter computation, graph clustering and mesh interpolation.

FIG. 6 .
FIG.6.Gromov-Wasserstein coupling of two mm-spaces X = (X, d X , µ X ) and Y = (Y, d Y , ν Y ).Left: the mm-spaces share nothing in common.Similarity between pairwise distances is measured by |d X (x, x ) − d Y (y, y )|.Right: an admissible coupling of µ X and µ Y .Image inspired from[26, Fig 10.8] FIG.7.Illustration of the definition 2.10.The figure shows two structured objects (X × A, d X , µ) and (Y × B, d Y , µ).The feature space Ω is the common space for all features.The two metric spaces (X, d X ) and (Y, d Y ) represent the structures of our two structured objects, the similarity between all pair to pair distances of the structure points is measured by L(x, y, x , y ).µ and ν are the joint measures on the structure space and the feature space.

FIG. 9 .
FIG. 9. Illustration of the difference between W , GW and FGW couplings.(left) empirical distributions µ with 20 samples and ν with 30 samples which color is proportional to their index.(middle) Cost matrices in the feature (M AB ) and structure domains (C 1 ,C 2 ) with similar samples in white.(right) Solution for all methods.Dark blue indicates a non zero coefficient of the transportation map between i and j.Feature distances are large between points laying on the diagonal of M AB such that Wasserstein maps is anti-diagonal but unstructured.Fused Gromov-Wasserstein incorporates both feature and structure maps in a single transport map.

FIG. 10 .
FIG. 10.Couplings obtained when considering (Top left) the features only, where we have d Ω W,1 = 0 (Top right) the structure only, with d GW,1 = 0 (Bottom left and right) both the features and the structure, with d Ω FGW,0.1,1,2 .For readibility issues, only the couplings starting from non white pixels on the left picture are depicted.

FIG. 11 .
FIG.11.Behavior of trade-off parameter α on a toy time series classification problem.α is increasing from left (α = 0 : Wasserstein distance) to right (α = 1 : Gromov-Wasserstein distance).(top row) 2D-embedding is computed from the set of pairwise distances between samples with MDS (bottom row) illustration of couplings between two sample time series from opposite classes.