Hierarchical and Unsupervised Graph Representation Learning with Loukas’s Coarsening

: We propose a novel algorithm for unsupervised graph representation learning with attributed graphs. It combines three advantages addressing some current limitations of the literature: (i) The model is inductive: it can embed new graphs without re-training in the presence of new data; (ii) The method takes into account both micro-structures and macro-structures by looking at the attributed graphs at different scales; (iii) The model is end-to-end differentiable: it is a building block that can be plugged into deep learning pipelines and allows for back-propagation. We show that combining a coarsening method having strong theoretical guarantees with mutual information maximization sufﬁces to produce high quality embeddings. We evaluate them on classiﬁcation tasks with common benchmarks of the literature. We show that our algorithm is competitive with state of the art among unsupervised graph representation learning methods.


Introduction
Graphs are a canonical way of representing objects and relationships among them.They have proved remarkably well suited in many fields such as chemistry, biology, social sciences or computer science in general.The connectivity information (edges) is often completed by discrete labels or continuous attributes on nodes, resulting in so-called attributed graphs.Many real-life problems involving high dimensional objects and their links can be modeled using attributed graphs.
Machine learning offers several ways to solve problems such as classification, clustering or inference, provided that a sufficient amount of training examples is available.Yet, the most classical frameworks are devoted to data living in regular spaces (e.g.vector spaces), and they are not suitable to deal with attributed graphs.One way to overcome this issue is to represent or encode the attributed graphs in such a way that usual machine learning approaches are efficient.A recent take on that is known as graph representation learning [10]: the graphs are embedded in a fixed dimensional latent space such that similar graphs share similar embeddings.
Three properties are desirable in order for a method of attributed graph representation learning to be widely applicable and expressive enough.We expect a method to be: I. Unsupervised because labels are expensive, and not always available; II.Inductive so that computing the embedding of an unseen graph (not belonging to the training set) can be done on the fly (in contrast to transductive methods); III.Hierarchical so as to take into account properties on both local and global scales; indeed, structured information in graphs can reside at various scales, from small neighborhoods to the entire graph.
In order to obtain these three desirable properties for attributed graphs representation learning, the present work introduces a new Hierarchical Graph2Vec (HG2V) model.Like Graph2Vec [27] with which it shares some similarities, it is based on the maximization of some mutual information.Thanks to a proper use of coarsening, as proposed by Loukas [18], it is hierarchical and incorporates information at all scales, from micro-structures like node neighborhoods up to macro-structures (communities, bridges), by considering a pyramid of attributed graphs of decreasing size.
The article is organised as follows.Section 2 presents some related work in the literature.In Section 3, we introduce the notation and the required background.Section 4 is dedicated to the detailed presentation of our main contribution: the Hierarchical Graph2Vec method.In Section 5, an experimental study is reported that demonstrates the effectiveness of the framework for various tasks.

Related Work
The proposed method is related to a large spectrum of works, from kernel algorithms to graph neural networks.Kernels methods.Graph kernels have become a well established and a widely used technique for learning graph representations [36,40].They use handcrafted similarity measures between every pair of graphs.Some are restricted to discrete labels [32,33], while others can handle continuous attributes [7,16,25].The main drawback of kernel methods is their quadratic time complexity.In addition, they are transductive: in order to embed a new example, they require to re-train the model from scratch on the extended dataset.
Infomax Principle hypothesizes that good representations maximize mutual information between the input data and its embedding.Deep Graph Infomax [39] and GraphSAGE [9] rely on negative sampling (an estimator of Mutual Information) to produce node embeddings for solving a classification task.InfoGraph [34] uses the same estimator to produce embeddings for entire attributed graphs.
Graph2Vec [27] was inspired by languages models (especially the Negative Sampling of Word2Vec) and considers node embedding as the vocabulary used to "write" a graph.The method works well, but shows some limitations: it is transductive, it uses discrete labels and is inefficient for continuous attributes due to the use of a discrete hash function.Recurrent Neural Networks were shown to achieve similar results at lower cost [35], relying on the same model.
Graph coarsening.The aim of graph coarsening is to produce a sequence of graphs of decreasing sizes; it can be done by node pooling, as with Loukas' algorithm [18], or by node decimation, like for example Kron reduction [5,1].Coarsening can also be combined with edge sparsification [2,5], so as to reduce the density of coarsened graphs.In another context, DiffPool [46] performs graph coarsening using pooling but it learns the pooling function specific to each task in a supervised manner.
Graph Neural Networks.Developed after the renewed interest in Neural Networks, they are known to offer interesting graph embedding methods [10], in addition to solving several graph-related task, see [44] (and references therein) for a survey, and specifically the popular Chebyshev GNN [3], GCN [14] and GAT [38].Stacking them to increase the receptive field may raise scalability issues [22].
Still, it has been shown that some easy problems on graphs (diameter approximation, shortest paths, cycle detection, s-t cut) cannot be solved by a GNN of insufficient depth [19].By combining GNNs with coarsening, e.g.[18], and node pooling, e.g.[1,8,46], those impossibility results no longer apply.This combination thus helps in designing a method encompassing all structural properties of graphs.

Method
Continuous attributes

Definitions and background
The proposed method is inspired by Graph2Vec [27], which was itself built upon Negative Sampling and the Weisfeiler-Lehman (WL) algorithm [41].The present section recalls the fundamental ideas of those algorithms.The Weisfeiler-Lehman method produces efficient descriptors of the topology of the graph.Negative sampling is a way to create embeddings of good quality through the maximization of Mutual Information.Definition 1.An attributed graph is a tuple (V, A, Z), where V is the set of nodes, A ∈ R |V |×|V | is a weighted adjacency matrix, and Z : V → R n is the function that maps each node u ∈ V to its attribute vector X(u).We also denote by Z ∈ R |V |×n the matrix representation of these attributes, and by G be the space of attributed graphs.Definition 2. A graph embedding is a function E : G → R d that maps each attributed graph to a vector in the latent space R d for some non-negative integer d.

Weisfeiler-Lehman procedure (WL)
The seminal paper [41] proposes an algorithm initially created in an attempt to solve graph isomorphism problem (whether or not Graph Isomorphism problem belongs to P is still an open problem).It maps the original graph, with discrete node labels, onto a sequence of labelled graphs, by repeatedly applying the same deterministic operator, as sketched in Fig. 1.The resulting distribution of labels distinguishes between graphs (although different graphs can share the same distribution).The method can be used to build efficient kernels for graph learning [33].In the following, we will use the WL-Optimal Assignment kernel [17] as state-of-the-art.The procedure to generate the labels is the following: where N (u) is the set of neighbours of u.
The hashing function has to be injective in order to distinguish between different rooted subtrees.The notation {} emphasises the fact that the output only depends on the unordered set of labels ("bag of labels").The procedure is iterated on l up to a (user defined) depth L: the deeper, the wider the neighborhood used to distinguish between graphs.
By definition, the label x l (u) of a node u at the l-th level depends only on the labels of the nodes at distance at most l from u.The output is invariant under node permutation, and hence is the same for isomorphic graphs.If graph g i contains N i nodes, then it produces N i new labels per level, for a total number of N i L new labels at the end.
Figure 1: Two iterations of the WL algorithm, starting with uniform labels.

Negative Sampling and Mutual Information
In Word2Vec, a Softmax objective (implying the expensive computation of a normalization constant) is replaced by a sequence of binary classification problems, where a discriminator T learns to distinguish between real and fake pairs.Instead of modeling the conditional distribution P Y |X , the binary classifier learns to distinguish between examples coming from the joint distribution P X,Y and coming from the product distribution P X ⊗ P Y .Negative Sampling can be used to build the Jensen-Shannon estimator of Mutual Information (MI) between X and Y (see [11] for the first usage of this estimator, see [29] for its construction and [24] for other insights): where T θ : X × Y → R is the discriminator, i.e. a function parameterized by θ (whose exact form is specified by the user, it can be a neural network), and typically σ(x) = 1 /(1 + e −x ).Maximizing this estimator of MI is equivalent to minimizing cross entropy between the prediction σ(T (x, y)) and the labels of the binary classification, with P X,Y (resp.P X ⊗ P Y ) is the distribution of class 1 (resp.2).

Graph2Vec
[27] combines those two ideas to produce its graph embedding.The joint probability P XY = P X|Y P Y is constructed by sampling a graph g from the dataset, and then by sampling a label x from the sequence generated by WL from this graph.Minimizing the cross entropy with respect to θ leads to the following expression for the loss: The discriminator function T θ (x, y) = θ y • θ x is taken as a dot product between a graph embedding θ g and a node embedding θ x .The resulting graph embedding is E(g) = θ g while the node embedding θ x can be discarded.Optimizing this objective ensures to maximize the mutual information between the labels and the graph embeddings, which is a way to compress information about the distribution of WL labels into the embedding.
4 Contribution: Hierarchical Graph2Vec (HG2V) Our motivations for this work can be summarized as follows: • we first show that WL fails to capture global scale information, which is hurtful for many tasks; • we then show that such flaw can be corrected by the use of graph coarsening.In particular, Loukas' coarsening exhibits good properties in this regard; • we finally show that the advantage of GNN over WL is to handle continuity in node features.
Based on those observations, we propose a new algorithm building on graph coarsening and mutual information maximization, which we term Hierarchical Graph2Vec (or HG2V).It has the following properties: • The training is unsupervised.No label is required.The representation can be used for different tasks.
• The model is inductive, trained once for all with the graphs of the dataset in linear time.The training dataset is used as a prior to embed new graphs, whatever their underlying distribution.• It handles continuous nodes attributes by replacing the hash function in WL procedure by a Convolutional Graph NN.It can be combined with other learning layers, serving as pre-processing step for feature extraction.• The model is end-to-end differentiable.Its input and its output can be connected to other deep neural networks to be used as building block in a full pipeline.The signal of the loss can be back-propagated through the model to train feature extractors, or to retrain the model in transfer learning.• The structures of the graph at all scales are summarized using Loukas coarsening.The embedding combines local view and global view of the graph.
The resulting algorithm shares a similar spirit with Graph2Vec (MI maximization between node and graph descriptors), but it corrects some of its above-mentioned flaws.A high level overview of the method is provided in Algorithm 1. Table 1 summarizes key properties of the method, against the other ones found in literature.
In the following, we introduce in Section 4.1 the Loukas' coarsening method of [18], and detail how we use it in Section 4.2.Then, section 4.3 deals with the continuity property achieved by GNN, while Section 4.4 explains how to train our proposed model HG2V.
Result: Graph embedding E(g) for each graph g Input: A training set of attributed graphs g, subset of G and the number of stages L ; GNNs F l θ and H l θ with randomly initialized θ, 1 ≤ l ≤ L foreach batch of attributed graphs do foreach graph g in the batch do See Sec.4.1: run Loukas' algorithm on g to produce a sequence of coarsened graphs g l , 1 ≤ l ≤ L; foreach level 1 ≤ l ≤ L do foreach graph g in the batch do foreach node u in g l do See 4.2: Generate local node embedding x l (u) using H l θ ; Let P(u) ∈ g l+1 the image of u after pooling; Generate global node embedding g l+1 (P(u)) using F l θ ; Create positive example (x l (u), g l+1 (P(u)); foreach pair of graphs (g, g ) do foreach pair of nodes (u, v) ∈ (g, g) do Create negative examples (x l (u), gl+1 (P(v)); Minimize the cross entropy in Eq. ( 3) between positive and negative examples with discriminator T (x, y) = x • y and using Sec.4.4.
Algorithm 1: High-level version of the HG2V algorithm

Loukas's Coarsening
In this section we detail the main drawback of WL procedure, and the benefit of graph coarsening to overcome this issue.For simplification, we will put aside the node attributes for a moment, and only focus on the graph structure.
Even in this simplified setting, WL appears to be sensitive to structural noise.

Wesfeiler-Lehman Sensibility to Structural Noise
The ability of WL to discriminate all graph patterns comes with the incapacity to recognize as similar a graph and its noisy counterpart.Each edge added or removed can strongly perturb the histogram of labels produced by WL.
We perform experiments to evaluate the effect of adding or removing edges on different graphs.We randomly generate 100 graphs of 500 nodes each, that belong to four categories (cycle, tree, wheel and ladder), using the routines of NetworkX [20] library.For each generated graph g, we randomly remove from 1 to 10 edges, sampled with independent and uniform probability, to create the graph g .One may hope that such little modification over this huge edge set would not perturb excessively the labels of WL procedure.
To evaluate the consequences of edge removal we propose to use as similarity score the intersection over union of histogram of labels of g and g at each stage 1 ≤ l ≤ 5: The average similarity score Sl (g, g ) over the 100 graphs is reported in Figure 2.
The similarity decreases monotonically with the number of edges removed, even when restricting the procedure to the first stage (neighborhood of width 1).On higher stages (wider neighborhood) the effect is even worse.On graphs with small diameter (such as wheel graph or 3-regular tree) a significant drop in similarity can be noticed.On ladder graph and cycle, sparse graphs with huge diameter, the effect of edge removal remains significant.
Loukas' method [18] is a randomized graph coarsening operation.This spectral reduction of attributed graphs offers the guarantee, in the spectral domain, that a coarsened graph approximates well its larger version.The idea is that the graph spectrum (eigenspaces and eigenvalues) describes global and local structures of graphs.Hence Loukas' coarsening, which approximates well the low-pass spectrum, will produce a smaller graph with the same "global shape" as the input one, as demonstrated in Figure 3.We refer to the original paper [18] for an explicit description of Loukas' method and an extensive list of properties and guarantees.
The interest of this coarsening method is that, if two graphs g l−1 and h l−1 are close enough, the coarsened graphs g l is itself a satisfying coarsening for h l−1 .By symmetry, the same result follows for g l and h l−1 .Hence, one may hope that g l and h l share similar patterns (Figure 4), and it will be advantageous for the WL procedure.
Such intuition can be confirmed experimentally.On four datasets (see Section 5 for their description), we sample two graphs g 0 and h 0 with independent and uniform probability.We measure their similarity using the Wasserstein distance between their spectra.Since Loukas's coarsening preserves spectrum, we expect the distance between g 0 and h 0 to be correlated with the distance between their coarsened counterpart g 1 and h 1 .Each dot in Figure 4 corresponds to an experiment, which are repeated 1000 times.Interestingly, this correlation strongly depends of the underlying dataset.

Hierarchy of neighborhoods
Taking advantage of the previous observation, we propose to build a hierarchy of coarsened graphs g l i using Loukas' method.It induces a hierarchy of nested neighborhoods u, P(u), P(P(u)), . . ., P L (u) by pooling the nodes at each level.
We learn the node embedding g l (u) (of node u) at each level.This node embedding is used to produce a local neighborhood embedding x l (u) using function H l θ , and to produce the node embedding of the next level g l+1 (P(u)) using function F l θ .Formally, the recurrent equations defining the successive embeddings are: x The procedure is illustrated in Figure 5.

Handling continuous node attributes with Truncated Krylov
The WL algorithm uses a discrete hash function in Eq. ( 2), with the issue that nodes sharing similar but not quite identical neighborhoods are considered different.If the differences are caused by noise in computations or measures, they should not result in much differences in the labels.For that, we relax the injectivity property of WL by replacing it by a function with a continuity property.
It is known that Graph Neural Networks (GNN) have a discriminative power at least equivalent to WL [23,26,45].We require the opposite, and we emphasize the importance of not having a too strong discriminator.We use the extension of the Gromov-Wasserstein distance to attributed graphs, that requires the mapping to preserve both edge weights and node attributes.The resulting distance is a special case of the Fused Gromov-Wasserstein distance [37].Definition 4 (Fused Gromov-Wasserstein distance.).Let g 1 = (V 1 , A 1 , Z 1 ) and g 2 = (V 2 , A 2 , Z 2 ) be two attributed graphs in G.
where Π is the collection of measures over V 1 × V 2 with uniform marginals.
This definition allows us to state the continuity of GNN under this metric.Lemma 1 (GNN continuity.).GNNs with continuous activation function are continuous under the topology induced byt the metric d G .
GNNs are usually parameterized functions learnable with stochastic gradient descent, and hence fulfill this continuity constraint.Moreover, some attributes may be irrelevant, and should be discarded, or sometimes the graph is only characterized by some function (for example a linear combination) of the attributes.Such a function can be inferred directly from the dataset itself, simply by looking by the co-occurrences of some similar sub-graphs.Hence, the ability of GNN to learn features relevant for some specific task is an other advantage over WL.
Figure 3: Coarsening of four graphs built from MNIST digits using Loukas' algorithm.The bigger the node, the wider the neighborhood pooled.Similar digits share similar shapes.
Figure 4: Wasserstein distance between spectra of graphs g 0 and h 0 sampled from different datasets (see Section 5 for their description) compared to the same distance between their coarsened graphs g 1 and h 1 .It can be seen that their is a correlation between these two quantities: when the distance is small (resp.huge) between g 0 and h 0 the distance between the coarsened graph g 1 and h 1 tends to be small (resp.huge).
GCN [14] is as a baseline among this family of networks.Unfortunately, [22] have shown that they behave badly when stacked, and proposed as a replacement the Truncated Block Krylov method, that we choose here.It amounts to consider, for a given node, a receptive field extended to the set of nodes at most at given distance a.The resulting layer uses the normalized adjacency matrix Ã = D − 1 /2 (A + I)D − 1 /2 (like in GCN) of the graph g l : The vector θ is made of the trainable parameters.In Eq. ( 11), one recognizes a specific graph filter of order a, close to the ones considered by Chebyshev polynomial approximation in [3].
Pool denotes the merging of nodes.Their features are summed because it preserves information on multiset, as argued in [45].The tanh function ensures that the output for every node is bounded by 1. Empirically, the number of nodes pooled at once following Loukas' coarsening (see illustration in Fig. 5) barely exceeds 5. Consequently, the output g l+1 is sufficiently close to 1 to avoid exploding gradient problems.The affine transformation X → Xθ l 2 + θ l 3 in ( 11) is used to improve the expressiveness of the model thanks to a bias term.

Hierarchical Negative Sampling
Following the principles of Graph2Vec we aim to maximize mutual information between node embeddings and local neighborhood embeddings at each level.We achieve this by training a classifier to distinguish between positive and negative examples of such pairs of embeddings.Consider the l-th level of the hierarchical pyramid.The probability P Y l is built by sampling a graph g from the dataset, then by sampling a node u with uniform probability from this Figure 5: Single level of the pyramid.Local information x l (u) centered around node u is extracted from graph g l .The graph is coarsened to form a new graph g l+1 .There, g l+1 (P(u)) captures information at a larger scale, centered on node P(u).The pair (x l (u), g l+1 (P(u))) is used as positive example in the negative sampling algorithm, and it helps to maximize mutual information between global and local view.graph.P X l |Y l is obtained by sampling a node from P −1 (u).It gives a pair (x l (P −1 (u)), g l+1 (u)) sampled according to P X l Y l = P Y l P X l |Y l .The negative pairs are built as usual from the independent probability P X l ⊗ P Y l .The corresponding loss function takes the form: Following the approach of [21], each level of the pyramid is isolated from the others by a stop gradient operation that prevents back-propagation.Hence, each θ l is greedily trained with only one loss L l .Each level of the pyramid corresponds to a different problem with its own loss.Thus, Truncated Block Krylov can be stacked safely without fearing vanishing gradient despite the use of tanh activation function.The overall method is described in Algorithm 1.
A descriptor E l (g) for each level l of the graph g is computed by global pooling over node embeddings.The final graph embedding is obtained by concatenating those descriptors.The sum aggregator is always preferred over mean aggregator, because of its better ability to distinguish between multisets instead of distributions, as argued in [45].Hence, the learned representation is: In the Loukas' method, the number of stages is not pre-determined in advance: only the final coarsened graph size can be chosen.When the number of stages produced by the Loukas method is not aligned with the depth of the neural network, the pooling is either cropped (too much stages) or completed with identity poolings (not enough stages).The resulting vector can be used for visualization purposes (PCA, t-SNE) or directly plug into a classifier with labels on graphs.

Complexity
Time.The coarsening of each graph can be pre-computed once for all, in linear time in the number of nodes and edges [18].Hence, the main bottleneck is the maximization of mutual information.
The first term is due to exponentiation of adjacency matrix, the second one to diffusion along edges, the third one to forward pass in network layers, and the last one to the Cartesian product to create positive and negative examples.The average complexity per graph is hence: The most sensitive factor is the number of nodes, followed by the the number of features and the batch size.The magnitude of those matrix allows to handle graphs with hundred of nodes efficiently on modern GPUs, with embedding as big as 512 and up to 8 graphs per batch and 5 stages of coarsening.In practice, the bottleneck turn out to be the pre-computation of coarsening, which do not benefit of GPU speed up.
Space.Note that for datasets with small graphs we have dim However, when the number of nodes exceeds 50, the embedding size is always smaller than the adjacency matrix.Hence, this method is more suitable for big graphs with dozens of nodes.
An experimental evaluation is conducted on various attributed graphs datasets, and the quality of the embeddings is assessed on supervised classification tasks in Section 5.2.Our method is inductive: the model can be trained over a dataset and be used to embed graphs coming from another dataset.This property is analysed in Section 5.3.HG2V differs from Graph2Vec by the usage of GNN and Loukas coarsening.The influence of those two elements is analysed with ablatives studies in Section 5.4.
Synthetic datasets.Additionally, we introduce a novel dataset for the community: Diffusion Limited Aggregation (DLA).The attributed graphs are created by a random process that makes the graphs scale free, having then an interesting property that justifies the creation of a new benchmark.The code can be found on https://github.com/Algue-Rythme/DiffusionLimitedAgregation Image datasets.We convert MNIST and USPS popular datasets of the computer vision community into graphs, by removing blank pixels, and adding luminosity and (x, y) coordinates as node features.To solve the task, the method should be able to recognize the shape of the graph, which is a global property.3: Accuracy on classification tasks by training on some input distribution and performing inference on an other.When the datasets for training and inference are the identical, the percentage (in parenthesis) corresponds to the sizes of the disjoints splits.The hyper-parameters selected are identical to the ones of Table 2.
Frankenstein dataset was created in [30] by replacing nodes labels of BURSI dataset with MNIST images.
Pre-processing All the continuous node attributes are centered and normalized.The discrete node labels use one hot encoding.When there is no node feature, we use the degree of the node instead.

Supervised classification
In the first task, the model is trained over all the attributed graphs in the dataset.The quality of these embeddings is assessed on a supervised classification task.The classifier used is C-SVM with RBF kernel from scikit-learn library.

Training Procedure
The embedding are trained over 10 epochs.At each step, 8 graphs are picked randomly from the dataset, and all the vocabulary from these graphs is used to produce positive and negative pairs (by cartesian product).Hence, the number of negative examples in each batch is higher than the number of positive examples.Consequently we normalize the loss of negative samples to reduce the unbalance.The optimizer is Adam [4] and the learning rate follows a geometric decay, being divided by approximately 1000 over the 10 epochs.

Model selection
The relevant hyper-parameters of HG2V are the number of features d ∈ {16, 128, 256} at each stage, the receptive field of Truncated Krylov a ∈ {2, 4}, and the maximum depth of Loukas' coarsening L ∈ {3, 5}.They are selected using a grid search.Five random split of the dataset are generated: TrainVal (80% of the data) and Test (20% of the data).HG2V is trained over TrainVal+Test (the whole dataset).Then, a C-SVM is trained over TrainVal, using 5-cross validation grid search for the selection of its hyper-parameters (C, Gamma).The average validation score of the best C-SVM classifier is used to select the hyper-parameters (d * , a * , L * ) of HG2V.The average test score of the best C-SVM classifier (accuracy over Test split) is reported in Table 2.Note that HG2V is trained over the whole dataset, but the labels of the Test split are unused (even for the selection of the best C-SVM classifier), hence the evaluation procedure is valid.

Baselines
We compare our work to various baselines of the literature: Kernel Methods All the results reported are extracted from the corresponding papers [17,28], giving an idea of the best possible performance achievable by WL-Optimal Assignment kernel method [17].It almost always outperform inductive methods based on neural networks.However, like every kernel-based method, it has quadratic time complexity in the number of graphs, which is prohibitive for dealing with large datasets.Due to its very high scores and its sensibly different design, we consider it apart and we never highlight its results (even if the best).
DiffPool, GIN We report the results of the rigorous benchmarks of [6], including the popular DiffPool [46] and GIN [45].Those algorithms are end-to-end differentiable, but they are supervised.DiffPool also relies on graph coarsening, but their pooling function is learned, while Loukas coarsening is task agnostic.
Infograph We also report the results of Infograph [34].It is the closest method to our work: it is unsupervised, end-to-end differentiable, and also relies on mutual information maximization, but it does not benefit of coarsening.Infograph is currently the state of the art in unsupervised graph representation learning.4: Ablative studies.The accuracy on test set is in column "Accuracy".The column "Delta" corresponds to the difference in average accuracy with Graph2Vec.OOM is Out of Memory Error.

Results
We note that get substantial improvements over Graph2Vec baseline for many datasets, more specifically when the graph are big and carry high dimensional features.For FRANKENSTEIN, if we connect a randomly initialized 2-layer CNN to the input of the model for better feature extraction, the results are improved and reach 66.50%.On the notably difficult REDDIT-B we reach state of the art among all considered methods, including the supervised ones.On REDDIT-MULTI-5K, we are only outperformed by GIN by a small margin.The coarsening operation is beneficial to these datasets, considering the size of the graphs.On datasets with smaller graphs, the results are less significant.

Computation time.
Training on only 1 epoch already provides a strong baseline for molecule datasets, and lasts less than 1 minute, using GTX 1080 GPU.The most challenging dataset was REDDIT-MULTI-5K, trained on V100 GPU, with 5000 graphs, an average of 508 nodes and 595 edges per graph.The pre-computation of Loukas's coarsening required 40 minutes (that can be improved with parallelization, not implemented).After this step, the runtime never exceed 190 seconds per epoch, for a total training time of 70 minutes.

Inductive learning
The method is inductive: it allows us to train HG2V on only part of the data, or even on an another dataset.The results are summarized in Table 3.The training set is used to extract relevant features, that are expected to be seen again during inference.
USPS is a dataset of handwritten digits similar to MNIST, hence we expect that a model trained on it can also embed MNIST graphs.Other datasets are split into Train (80% of the data) and Inference (20% of the data).The best hyper-parameters found in the previous section are kept as is without further hyper-parameter tuning.

Results
We see that the model can easily transfer from one to another.MNIST seems to be a better prior than USPS, which is a behavior previously observed in transfer learning.On the other datasets, the accuracy remains within a reasonable range compared to the baseline.We deduce that only a part of the dataset was sufficient to learn a good feature extractor.

Ablative Studies
We perform ablative studies by removing separately Loukas coarsening and GNN.If we remove both, we fall back to Graph2Vec.The dimension of the embeddings is chosen equal to 1024.

Graph2Vec+GNN
We remove Loukas Coarsening.The only difference with Graph2Vec is the replacement of WL iterations with forward pass through a GNN.All the attributes available are used.Graph2Vec+Loukas We remove GNN.The resulting algorithm is Graph2Vec applied on the sequence of coarsened graphs.On the coarsened graphs, new labels are generated by concatenating and hashing the labels of the nodes pooled (like a WL iteration would do).The sequence of (unconnected) graphs is fed into Graph2Vec.Continuous attributes are ignored because WL can not handle them.The results are summarized in Table 4.

Results
On datasets with small graphs (less than 30 nodes in average) the use of coarsening is hurtful, resulting in loss compared to Graph2Vec.As soon the graphs get big, coarsening leads to huge improvements in accuracy.We also notice that the usage of GNN and its ability to handle continuous attributes, and to be trained to extract co-occurring features, leads to significant improvements on all datasets.

Latent space
Figure 6 illustrates some graphs (leftmost column) with their six closest neighbors in the learned latent space (right columns 1 to 6), from the closest to the farthest, taken from MNIST and IMDB.We observe that isomorphic graphs share similar embeddings, and even when the graphs are not quite isomorphic they exhibit similar structures.

Conclusion
We proposed a new method for Graph Representation Learning.It is fully unsupervised, and it learns to extract features by Mutual Information maximization.The usage of Loukas' coarsening allows us to tackle all scales simultaneously, thanks to its capacity to preserve the graph spectrum.The method is inductive, and can be trained on only a fraction of the data before being used in transfer learning settings.Despite being unsupervised, the method produces high quality embeddings leading to competitive results on classification tasks against supervised methods.
The continuity of S t allows to further conclude that: Finally, the right hand size of ( 18) must have limit 0, hence: We just proved that lim n→∞ g n = g implies lim n→∞ g n = g , which is precisely the definition of F being continuous w.r.t the topology induced by d G .

B Additional visualizations of the embeddings
We present other randomly sampled graphs and their six closest neighbors from MNIST-Graph, IMDB and PTC Datasets in, respectively, Figures 7, 8 and 9.

C Details about the datasets
In this section we give additional details on some datasets we used.

C.1 DLA
The DLA dataset has been artificially generated with Diffusion Limited Aggregation ( [42]), a random process that creates cluster of particles following a Brownian motion.Particles are added one by one, and when two particles touch, they can aggregate with some probability of stickiness p.
The resulting structure is a tree, each particle being a node and each link corresponding to a bond.The resulting graphs have scale free properties [43].The degree distribution of the nodes and their position in space will depend on p.
We generated a total of 1000 graphs with 500 nodes each.This dataset is splited into two classes, one with stickiness p = 1 and the other with stickiness p = 0.05.The attributes are the x and y coordinates of the particles following a 2D Brownian motion for simplicity.
It has been observed that Graph2Vec is unable to reach a good accuracy by relying solely on node degree distribution, while Hierarchical Graph2Vec is able to use the features and reach near perfect accuracy.The code that generated this dataset can be found here: https://github.com/Algue-Rythme/DiffusionLimitedAgregation

C.2 MNIST and USPS
We produce graphs from MNIST (resp.USPS) handwritten digits dataset .The graphs are created by removing all the pixel with luminosity equals to 0 (resp.less than 0.3), by mapping each remaining pixel to a node, and then by adding the x and y coordinate of each node to the vector of attributes.Due to the size of MNIST (70, 000 images in total) we kept only the test split (10, 000 images) to train the embeddings, and kept the whole dataset for USPS (9298 images).However these datasets and their graphs remain way larger than the other standard benchmarks.
(a) Cycle: the 2-regular graph with one giant connected component (b) Tree: 3 children per node, except the ones at the last level.(c) Wheel: like cycle graph, with an additional node connected to all the others.(d) Ladder: two paths of 250 nodes each, where each pair of nodes is joined by an edge.

Figure 2 :
Figure 2: Similarity score as a function of edges removed for different stages of WL iterations.The similarity score reaches 100% for identical sets of labels, and 0% for disjoint sets of labels.

Figure 6 :
Figure 6: Six nearest neighbors of graphs in the learned latent space for four graphs from IMDB-b and MNIST.Column 0 corresponds to a randomly chosen graph, then the six nearest neighbors are drawn in increasing distance order from left to right (1 to 6).

Figure 7 :
Figure 7: Six nearest neighbors of graphs in latent space from MNIST for four graphs.Column 0 correspond to the randomly chosen graph then the six nearest neighbors are draw in increasing distance order from left to right (from 1 to 6).

Figure 8 :
Figure 8: Six nearest neighbors of graphs in latent space from IMDB for ten graphs.Column 0 correspond to the randomly chosen graph then the six nearest neighbors are draw in increasing distance order from left to right (from 1 to 6).

Figure 9 :
Figure 9: Six nearest neighbors of graphs in latent space from PTC for six graphs.Column 0 correspond to the randomly chosen graph then the six nearest neighbors are draw in increasing distance order from left to right (from 1 to 6).

Table 2 :
Accuracy on classification tasks.HG2V is trained over both TrainVal+Test splits, without using labels due to its unsupervised nature.Model selection of C-SVM and hyper-parameters of HG2V have been done with 5-cross validation over TrainVal split.We report on the accuracy over the Test split, averaged over 10 runs, and with standard deviation.Unavailable result marked as .
Let |V | the maximum number of nodes in a graph.Let B the batch size.Let L the number of stages.Let a the order of Truncated Krylov.Let d the dimension of node embedding.The complexity of the algorithm per batch is O