Real Quadratic-Form-Based Graph Pooling for Graph Neural Networks

: Graph neural networks (GNNs) have developed rapidly in recent years because they can work over non-Euclidean data and possess promising prediction power in many real-word applications. The graph classiﬁcation problem is one of the central problems in graph neural networks, and aims to predict the label of a graph with the help of training graph neural networks over graph-structural datasets. The graph pooling scheme is an important part of graph neural networks for the graph classiﬁcation objective. Previous works typically focus on using the graph pooling scheme in a linear manner. In this paper, we propose the real quadratic-form-based graph pooling framework for graph neural networks in graph classiﬁcation. The quadratic form can capture a pairwise relationship, which brings a stronger expressive power than existing linear forms. Experiments on benchmarks verify the effectiveness of the proposed graph pooling scheme based on the quadratic form in graph classiﬁcation


Introduction
Artificial intelligence has been prosperous for recent years due to the revolution of deep learning. Artificial neural networks make an attempt to simulate the power of the human brain, and approach the level of human beings in some tasks. At the beginning, deep neural networks simply dealt with regular data, such as text and images. However, irregular data occur frequently in the real world. Developing effective techniques towards this scenario is desirable. Later on, graph neural networks (GNNs) occurred gradually [1] and gave rise to wide research. This technique has many realistic applications, such as recommendation systems [2], traffic prediction [3] and drug target prediction [4].
Formally, GNNs are similar to deep neural networks in a layer-by-layer manner. They differ in that GNNs consider the relationship between nodes, which is reflected by edges in the underlying graph structure. As a result, GNNs not only use the features of data, but also consider the pairwise relation between data. From the statistical viewpoint, the data features in a graph are no longer independent. GNNs are one of the most popular techniques used to deal with this case.
GNNs learn low-dimensional embeddings for nodes in a graph by aggregation and transformation. The aggregator usually acts on a neighbourhood of a node. The size of the neighbourhood implies the range of dependence of a node with other nodes. Formally, the feature representation of a node in the l-th layer of a graph neural network is generated by an aggregation function acting on the representations from the (l-1)-th layer. The convolution operation in deep neural networks can be regarded as a special aggregation over grids in a district. Consecutive aggregation can lead to hierarchical node-level feature representations. The resulting representation can be used for different purposes, such as semi-supervised node classification and graph classification.
In this paper, we concern the graph classification problem. Facing this problem, we need to transform resulting embeddings into graph-level features so that a unified classifier Mach. Learn. Knowl. Extr. 2022, 4 581 can be followed for classification. This step is often called graph pooling [5], and produces graph-level features. In the training set, the size of a graph may be inconsistent; the graph pooling strategy must be adaptive for different sizes. In other words, the graph pooling strategy should be invariant to arbitrary sizes. In addition, it should be invariant to the order of nodes in a graph. Otherwise, an essentially same graph will produce multiple prediction results due to the change in the node's order. This ambiguity should be avoided. For this reason, invariant GNNs have been intensively studied, such as [6,7]. Recall that GNNs are invariant if their output does not depend on the order of the nodes' input. Recently, Keriven and Peyré [6] have provided universal approximation theorems for invariant and equivariant GNNs. The invariant property makes the output of graph neural networks stable and adaptive, which is effective in real world applications.
Previous works on graph neural networks for graph classification typically combine the graph pooling and multi-layer perceptron for generating final graph-level features [6]. This kind of strategy has two drawbacks: (1) this scheme may bring many parameters due to the introduction of multi-layer perceptron; (2) existing pooling schemes using a linear form neglect the possible pairwise relationship. Hence, we propose a real quadraticform-based graph pooling framework to generate graph-level features without further multi-layer perceptron before performing classification. It comprises quadratic-form-based expressions. Generally, the number of quadratic forms should coincide with classes. By comparison, quadratic-form-based graph pooling layers need the least parameters and take the pairwise relationship into account. It is worth mentioning that the proposed quadratic-form-based graph pooling can be easily implemented by popular open-source deep learning frameworks, such as TensorFlow and PyTorch.
The reminder of this article is organised as follows. The second section provides some related works on graph neural networks for graph classification. The third section provides some preliminaries on problem formulation and the procedure of graph classification. The fourth section is about the proposed quadratic-form-based graph pooling frameworks. Experiments on benchmarks are conducted in the fifth section. Finally, we conclude this paper.

Related Works
In recent years, graph neural networks [1] have become a hot research topic in the field of machine learning. They can be used for semi-supervised node classification, link prediction and graph classification.
Graph convolution networks (GCNs) [1] are important and have many variants [2,3,6,7]. Their graph convolution is constructed by a localised first-order approximation of spectral graph convolution. Mathematically, it is approximated with the help of a truncated summation of Chebyshev polynomials.
Graph isomorphism networks (GIN) [8] are simple architectures whose expressive power can be comparable to the Weisfeiler-Lehman graph isomorphism test [9][10][11], which testifies whether two graphs are essentially identical in the sense of topology. It can fit the training data well in most cases.
WL [12] can rapidly extract features by the Weisfeiler-Lehman graph isomorphism test. It transforms a graph into a sequence of graphs, whose node attributes reveal topological and label information.
Graph pooling is a key step in the graph classification problem with graph neural networks. Second-order graph pooling (SOPOOL) [13] can treat the challenge of variable sizes and isomorphic structures of graphs. Bilinear mapping, attentional pooling and hierarchical pooling are also developed in [13].
No matter how the node's order in a graph changes, GNNs must be invariant or equivariant (to permutation) because the relative graph structure is not changed essentially. There are few works on invariant and equivariant graph neural networks, such as [6,7]. Universal approximation theorems are provided for a better understanding of invariant and equivariant networks in [6], which extends the classical universal approximation result for a multi-layer perceptron (MLP) with a single hidden layer [14].
In heterogeneous GNNs, aggregation is often realised by meta-paths. HPN [15] employs an appropriate weighting strategy such that deeper embeddings become distinguishable. HetGNN [16] samples heterogeneous neighbours of a fixed size and then uses two modules to aggregate this neighbourhood information. HHIN [17] uses the random walk to generate meta-paths and the hyperbolic distance to evaluate proximity.

Preliminaries
In this section, we formulate the problem setting and the procedure of graph classification. The problem setting is stated mathematically, which gives a clear aim for the research. The procedure of graph classification provides the routine of the general pipeline on how to realise this aim.

Problem Setup and Notations
A graph G consists of vertex set V and edge set E. We write G = <V,E>. For any node v in V, it is assigned a feature x v ∈ R n . Here, this feature reflects the quantitative information of node v. Let A G be the adjacency matrix whose entry is 1 or 0. If two nodes are connected, i.e., there is an edge between them, the value is 1. Otherwise, the value is 0. The adjacent can be naturally given; for example, a molecule. In addition, the adjacent can also be constructed by people. For example, two nodes are connected if the similarity between them is high. Let with graph labels, the task of graph classification is to establish a model on the basis of this dataset. The type of model in this paper is confined to graph neural networks. As a matter of fact, G i can be regarded as a sample from an unknown distribution.

Procedure of Graph Classification
The procedure of graph classification using graph neural networks contains the following five steps.
Step 1. Aggregate the information of the neighbourhood by some aggregation function. Mathematically, this process is expressed by The common aggregation is SUM and AVG. Formally, the SUM operator-based aggregation is where N(v) is the neighbourhood of node v and |N(v)|denotes the number of set N(v).
Step 2. Combine the aggregated features with the feature from the last layer. Mathematically, this is characterised as Step 3. Use a graph pooling scheme to obtain a graph-level feature.
Step 4 (Optional). Employ a multi-layer perceptron to obtain a final graph-level feature.
Step 5. Choose an appropriate classifier.
Step 1 and Step 2 can proceed alternatively. The meaning of aggregation in Step 1 is the integration of representations in the neighbourhood of a node. In Figure 1, we have illustrated the node v and its neighbourhood. By the above problem setting, each node admits a feature representation. These feature representations can be gathered in some way. Due to the fact that these nodes are connected to the node v, the aggregation makes sense. After aggregation, the resulting vector from the neighbourhood can be associated with the feature representation of node v. In this way, the representation of node v is enriched with the help of the information from the neighbourhood.
Step 3. Use a graph pooling scheme to obtain a graph-level fea Step 4 (Optional). Employ a multi-layer perceptron to obtain a ture.
Step 5. Choose an appropriate classifier.
Step 1 and Step 2 can proceed alternatively. The meaning of ag the integration of representations in the neighbourhood of a node. illustrated the node v and its neighbourhood. By the above proble admits a feature representation. These feature representations can way. Due to the fact that these nodes are connected to the node v, t sense. After aggregation, the resulting vector from the neighbourh with the feature representation of node v. In this way, the represen riched with the help of the information from the neighbourhood. In Step 3, the graph pooling scheme should satisfy the invaria order of node changes. In Figure 2, we simply illustrate this proces feature by the graph pooling operation, the multi-layer perceptro generating features of a higher level. Of course, this step is not ne additional parameters from the multi-layer perceptron, which increa work optimisation.
In Step 4, a classifier can be employed over final graph-level fe sification. For example, the classifier can be chosen as a softmax cl the probabilistic output. In Step 3, the graph pooling scheme should satisfy the invariant property when the order of node changes. In Figure 2, we simply illustrate this process. After obtaining the feature by the graph pooling operation, the multi-layer perceptron can be followed for generating features of a higher level. Of course, this step is not necessary. It may bring additional parameters from the multi-layer perceptron, which increases the burden of network optimisation.
Step 3. Use a graph pooling scheme to obtain a graph-level feature.
Step 4 (Optional). Employ a multi-layer perceptron to obtain a final graph-level feature.
Step 5. Choose an appropriate classifier.
Step 1 and Step 2 can proceed alternatively. The meaning of aggregation in Step 1 is the integration of representations in the neighbourhood of a node. In Figure 1, we have illustrated the node v and its neighbourhood. By the above problem setting, each node admits a feature representation. These feature representations can be gathered in some way. Due to the fact that these nodes are connected to the node v, the aggregation makes sense. After aggregation, the resulting vector from the neighbourhood can be associated with the feature representation of node v. In this way, the representation of node v is enriched with the help of the information from the neighbourhood. In Step 3, the graph pooling scheme should satisfy the invariant property when the order of node changes. In Figure 2, we simply illustrate this process. After obtaining the feature by the graph pooling operation, the multi-layer perceptron can be followed for generating features of a higher level. Of course, this step is not necessary. It may bring additional parameters from the multi-layer perceptron, which increases the burden of network optimisation.
In Step 4, a classifier can be employed over final graph-level features for graph classification. For example, the classifier can be chosen as a softmax classifier that produces the probabilistic output.

Graph Pooling Framework
Graph pooling is a main step in the procedure of graph classification. In this section, we proposed a novel graph pooling framework based on a (real) quadratic form that can In Step 4, a classifier can be employed over final graph-level features for graph classification. For example, the classifier can be chosen as a softmax classifier that produces the probabilistic output.

Graph Pooling Framework
Graph pooling is a main step in the procedure of graph classification. In this section, we proposed a novel graph pooling framework based on a (real) quadratic form that can capture a possible pairwise relationship. We also provide an instantiation with regard to this framework.

Review of General Real Quadratic Form from Linear Algebra
In linear algebra, the real quadratic form is generally expressed as We give a concrete example as follows.
where the middle matrix is Q, which is real symmetric.

The Proposed Approach
Observing the structure of a (real) quadratic form, every term contains the multiplication of two elements. In other words, this is a quadratic coupling method that includes a pairwise relationship. Let be the embeddings in the k-th layer of some architecture of GNNs that are extracted by some graph neural network, where n G is the number of vertex sets and f G is the latent dimension.
The architecture of GNNs is chosen as GIN [8] in this paper. The update formula of the nodes' representation is where MLP (k) is the k-layer perceptron and N(v) denotes the neighbourhood of node v.
Here, the aggregation function is summation, which performs a summation operation for representation in the neighbourhood of nodes. Let where x is a free parameter and Q is specified as and ReLU(z) = z, if z ≥ 0; 0, otherwise.
In fact, Q can be regarded as a similarity matrix as where the inner product similarity between node i and j is Hence, x i x j Q ij means the weighted summation of pairwise similarities in which the weight in each term is also a pairwise product. When the quadratic form is positive semi-definite, we have The above setting of Q exactly makes this equality hold because the product of the transpose of a matrix with itself must be positive semi-definite.
It is easy to implement this kind of quadratic form in PyTorch by invoking "nn.Linear" and "torch.norm".
The resulting representation is then where C is the number of classes and u i is a parametric vector that serves as a connection.
Here, we choose C quadratic forms for generating a final graph-level feature. When we use softmax classifier for this feature, the probabilistic output is Then, the label of graph G is predicted as We provide the information on the derivative of P as follows, which is an ingredient in first-order optimisation. Proposition 1. The partial derivatives of P are Proof. Recall that the expression of P is actually.
Through basic knowledge of calculus, we have To derive the second partial derivative, we need the following transformations: where tr(·) denotes the trace of matrix.
Then, we have

Comparison
Recall that Step 4 is usually optional in the aforementioned procedure of the graph classification. That is, a multi-layer perceptron is employed to obtain final graph-level feature. Existing works on graph pooling typically add the muti-layer perceptron for a better prediction performance. The AVG and SUM are two simple graph pooling operators. The former is average and the latter is summation. The advantage of these two operators is that they are parameter-free. When a fully connected (FC) layer is followed, the amount of parameters is f G C. Hence, both AVG+FC and SUM+FC contain the f G C parameters. The recent second-order pooling methods with attention SOPOOL attn [13] possess a subsequent multi-layer perceptron. As a matter of fact, it uses the one-layer perceptron, i.e., the fully connected layer. Hence, the amount of parameters is the sum of f G and f G C, where f G is the number of parameters in attention and f G C is the number of parameters in the fully connected layer. As a summarisation, we display the comparison result in Table 1. Table 1. Comparison on the amount of parameters.

Multi-Layer Perceptron #Parameters
No f G C

Dataset Description
There were nine graph classification datasets from [18] in our experiments. They can be roughly categorised into two kinds: bioinformatics and social network datasets. For clarity, we introduce them one by one. The statistics of the datasets is displayed in Table 2. MUTAG is a bioinformatics dataset. It contains 188 graphs and each graph represents nitro compounds. The maximum nodes are 28 and average nodes are 18 in this dataset. Every node in a graph has one of seven discrete node labels [19]. There are two kinds of graph labels.
PTC is a bioinformatics dataset. It consists of 344 graphs. Every graph represents chemical compounds [20]. The maximum nodes are 109 and average nodes are 25.6 in this dataset. Every node bears one of 19 discrete node labels. There are two kinds of graph labels.
PROTEINS is a bioinformatics dataset. It comprises 1113 graph structures of proteins. Nodes in the graphs represent secondary structure elements and are assigned with discrete node labels. The maximum nodes are 620 and average nodes are 39.1 in this dataset. Edges mean that two nodes are connected along the amino-acid sequence in space. There are two kinds of graph labels.
NCI1 is a bioinformatics dataset. It includes 4110 graphs in all. Each graph represents chemical compounds [21]. The maximum nodes are 111 and average nodes are 29.9 in this dataset. Every node possesses one of 37 discrete node labels. There are two kinds of graph labels.
COLLAB is a scientific collaboration dataset. It contains 5000 graphs in total. Each graph is generated by ego-networks as [22]. The dataset originates from 3259 public collaboration datasets [23]. Every ego-network includes researchers from different fields, and its label is named by the field. The maximum nodes are 492 and average nodes are 74.5 in this dataset. There are three kinds of graph labels.
IMDB-BINARY is a movie collaboration dataset. It has 1000 graphs. Each graph corresponds to ego-networks for actors/actresses. The dataset is induced by collaboration graphs on Action and Romance genres. Actors/actresses are regarded as nodes and edges reveal that they collaborate with the same movie. The maximum nodes are 136 and average nodes are 19.8 in this dataset. The graph label is marked with the corresponding genre and the task is to predict the genre for graphs.
IMDB-MULTI is a multi-class version of IMDBBINARY. It has 1500 ego-networks and three extra genres, including Comedy, Romance and Sci-Fi. The maximum nodes are 89 and average nodes are 13.0 in this dataset.
REDDIT-BINARY is a dataset that includes 2000 graphs. Each graph corresponds to an online discussion thread. Nodes in a graph correspond to users that lie in the corresponding discussion thread, and an edge signifies that one user responded to another. The graph class number is 2. The maximum nodes are 3783 and average nodes are 429.6 in this dataset.
REDDIT-MULTI5K resembles REDDITBINARY. In total, there are 5000 graphs. REDDIT-MULTI5K are crawled data from five different subreddits that contain worldnews, videos, AdviceAnimals, aww and mildly interesting. The number of graph classes is five. The maximum nodes are 3783 and average nodes are 508.5 in this dataset.

Comparison Methods
The following methods were chosen as comparison.
WL [12]: On the basis of the Weisfeiler-Lehman test of isomorphism on graphs, it generates a graph sequence whose topological and label information are fused into attributes for future graph classification.
PATCHSCAN [24]: By mimicking image-based convolution that operates on locally connected regions, it extracts locally connected regions from graphs. It acts in a patchlike way.
DGCNN [25]: This is a kind of neural network that takes the underlying graph as an input and trains a classifier for graph classification.
GIN-0+AVG/SUM: This is a composite method that combines GINs [8] with an average operator or summation operator.

Implementation Details
Quadratic-form-based graph pooling was inserted into flat GNNs from recent graph isomorphism networks (GINs) [8] that have strong expressive power. The GINs utilise AVG/SUM graph pooling to generate a graph-level feature. Specifically, SUM graph pooling was used on bioinformatics datasets and AVG graph pooling was chosen for social network datasets. Here, we employed the proposed quadratic-form-based graph pooling scheme and kept the other architecture of GINs unchanged. For the flat GNNs, we followed the same training process in [8]. All GINs used in experiments had five layers and two-layer perceptron with batch normalisation [26]. Adam optimiser [27] was chosen for optimisation of graph neural networks with annealing strategy, whose learning rate was initialised as 0.01 and decay rate was 0.5 by 50 epochs. Hidden dimension was tuned over {16, 32, 64, 128} and batch size was chosen from {32,64,128}. Ten-fold cross-validation was used for obtaining final experimental results. The best number of epochs was determined by the best cross-validation result.
For WL [12], the height parameter was set as 2. The classifier was chosen as support vector machine implemented by LIB-SVM [28]. For PATCHSCAN [24], we performed 1-dimensional WL normalisation. The width parameter was set as the average number of nodes. Receptive field sizes were tuned from {5, 10}. For DGCNN [25], AdaGrad algorithms [29] were used as optimiser, with learning rate 0.05. All weights were initilised by sampling from Gaussian distribution N(0,0,01).

Experimental Result and Analysis
The experiments were performed over the aforementioned nine datasets. The experimental results are reported in Table 3. The last column is the result of our method, i.e., GIN-0 with the proposed quadratic-form-based graph pooling scheme. It can been seen from the experimental results that our method almost achieves the best results. It behaves better than the linear form, such as AVG, SUM and SOPOOL attn [13]. The cause may be that the quadratic form is involved with the pairwise relationship. The quadratic form is a non-linear expression. It can extract non-linear representations under a pairwise pattern for nodes, while the linear form cannot obtain any pairwise information. The expression of the quadratic form includes second-order information because it belongs to a kind of second-order polynomial. We also found that the proposed method behaves better on bioinformatics datasets than social network datasets. We provide the numerical evidence for this in Table 4. The cause of this phenomenon may be that the objectivity of the graph structure in bioinformatics datasets is stronger than that in social network datasets, in which, the edge in a graph is not necessarily constructed by people. From Table 4, we can see that the metric of the standard deviation is close for two kinds of datasets. It may reveal that the proposed graph pooling scheme is relatively stable across domains.

Conclusions
In this paper, we consider the graph classification problem with graph neural networks. The quadratic-form-based graph pooling scheme is proposed. Under this scheme, the multilayer perceptron is not necessarily followed and spends the smallest amount of parameters when compared with existing methods. It can be easily implemented in a popular deep learning framework, such as TensorFlow and PyTorch. Experiments demonstrate the effectiveness of the proposed graph pooling scheme, which is based on the quadratic form.