Subgraph Adaptive Structure-Aware Graph Contrastive Learning

: Graph contrastive learning (GCL) has been subject to more attention and been widely applied to numerous graph learning tasks such as node classiﬁcation and link prediction. Although it has achieved great success and even performed better than supervised methods in some tasks, most of them depend on node-level comparison, while ignoring the rich semantic information contained in graph topology, especially for social networks. However, a higher-level comparison requires subgraph construction and encoding, which remain unsolved. To address this problem, we propose a subgraph adaptive structure-aware graph contrastive learning method (PASCAL) in this work, which is a subgraph-level GCL method. In PASCAL, we construct subgraphs by merging all motifs that contain the target node. Then we encode them on the basis of motif number distribution to capture the rich information hidden in subgraphs. By incorporating motif information, PASCAL can capture richer semantic information hidden in local structures compared with other GCL methods. Extensive experiments on six benchmark datasets show that PASCAL outperforms state-of-art graph contrastive learning and supervised methods in most cases.


Introduction
Nowadays, deep learning technologies such as federated learning and reinforcement learning are widely used in various fields [1,2]. However, for graph learning, graph neural networks (GNNs) have gradually become the mainstream methods [3], e.g., GAT [4] and GraphSAGE [5], which have received considerable attention due to their outstanding performance in various tasks. Although GNNs have achieved great success, most of the existing GNNs are supervised methods and commonly rely on a large amount of labeled data. This is also one of the most widely acknowledged limitations of GNNs. Figure 1a,c represent the node embeddings of test nodes of GCN [6] and GCNII [7] when trained with 20 samples per class. Figure 1b,d, respectively, show the embeddings of test nodes when trained with 40% labeled data. Comparing the left and right columns of Figure 1, we can find that, depending on whether it is GCN or GCNII, the more labeled data used to train the model, the higher the quality of node embeddings learned by the model, which greatly limits the performance of GNNS in downstream tasks. Nowadays, although there are many data acquisition methods [8,9], we can easily obtain massive data for training models, but the data quality is often unsatisfactory, especially for social data [10]. On the one hand, the problem of incomplete data is widespread in practice [11]. On the other hand, data annotation is too expensive due to the fact that it requires lots of expertise in many areas [12]. In these contexts, it is difficult for GNNs to achieve excellent performance due to the inability to obtain enough labeled data. Therefore, it is extremely meaningful and necessary to develop unsupervised graph representation learning methods. The main idea of previous unsupervised graph representation learning methods is to reconstruct the graph topology, such as CSADW [13], VGAE [14], etc. However, these methods overemphasize the proximity of graphs and perform unsatisfactorily in some contexts [15]. Unlike traditional grid data, such as images and texts, graphs are a noneuclidean form of structure data containing complex relational structures. Such structures generally have specific meanings in different graphs. For example, a triangle structure can be used to represent the ternary closure in a social network. Meanwhile, it can also represent a special chemical structure in a chemical molecular network. Therefore, subgraph-aware methods have been proposed to enhance the effectiveness of downstream tasks in graph representation learning.
Graph contrastive learning (GCL) is the most representative unsupervised graph learning method currently [16][17][18]. Unlike other deep learning techniques [19], the intuition behind GCL is to learn prior knowledge from the data itself by comparing different views of the original graph, which can be explained by mutual information (MI) and triplet loss [20]. However, most of the existing GCL methods require node-node-level comparison. In some scenarios, such as social networks, it is difficult to adequately capture the semantic information hidden in the local topology, resulting in sub-optimal performance. Although some subgraph-based GCL methods (e.g., [21,22]) have been proposed, the subgraph construction methods they employed failed to capture significant semantic information. Specifically, they usually use nearest neighbors or random walks to construct subgraphs. These methods are too simple to fully capture the structural information in some complex networks. Moreover, some of them have limitations in generalization because these methods can only be used for specific downstream tasks such as graph classification [23].
Our work. To solve the above problems, we propose a motif-based GCL method, entitled the subgraph adaptive structure-aware graph contrastive learning model (PASCAL), used for unsupervised node classification tasks in this paper. Concretely, we first adaptively extract subgraphs for each node based on its motif information to capture rich semantic information hidden in the local structure. Then, we use the feature masking and edge dropping augmentation strategies to generate two different graph views. Next, we use our proposed subgraph aggregation method to calculate subgraph embeddings. The subgraph embeddings are then regarded as node features fed to the next layer of GNNs. Finally, the graph encoder is optimized by maximizing the mutual information between the same nodes in different graph views. We conduct extensive experiments on various academic and social network datasets. Compared with previous methods, our proposed PASCAL performs better in most cases. The contributions of this work are summarized as follows: • Rich sentiment information representation. We propose an effective motif-based graph contrastive learning method, called PASCAL, for unsupervised node classification tasks. PASCAL employs motifs to formulate certain patterns containing rich sentiment information, which significantly enhances the effectiveness of graph contrastive learning. • Subgraph aggregation and encoding strategy. We propose a motif-based subgraph aggregation and encoding strategy, which is a play-and-pug component. In the following, we first introduce existing works and the preliminary of these works in Sections 2 and 3, respectively. Then, we show the details of PASCAL in Section 4. Section 5 introduces the experimental results. The effectiveness of our proposed motif-based subgraph aggregation strategy for semi-supervised models is implemented in Section 6. Moreover, we also analyze the learned attention weights in Section 6, which shows that our model is explainable.

Graph Contrastive Learning
The main idea of graph contrastive learning is to maximize the mutual information between the anchor node and negative nodes. Deep graph infomax (DGI) [15] firstly learns node embeddings by maximizing the similarity between node embeddings and graph embeddings. However, DGI is a graph-level model which requires calculating the whole graph embedding. It is too expensive for large-scale graphs, thereby some node-level models are proposed [20,24]. Two strategies, i.e., adaptive negative sampling and data augmentation, have gradually proved their significance in enhancing the effectiveness of GCL. Adaptive augmentation can adaptively select the optimal one from a set of multiple augmentation strategies [25]. It also can be employed to dynamically design the optimal parameters for a special strategy instead of pre-defining them [16,26]. As for negative sampling, selecting the optimal negative sample (or high-quality ones) to calculate the contrast loss is the most significant method [27]. For example, SelfGNN [28] introduces the bootstrap your own latent (BYOL) mechanism into graph contrastive learning to avoid explicit negative sampling.

Motif-Based Graph Learning
The main purpose of motif-based graph learning is to capture high-or local-level structural information to improve model performance [29], such as [30,31]. An open question of motif-based graph learning is how to integrate motifs and graph learning methods in a reasonable way. Some methods optimize existing models based on graph motifs [32][33][34]. In particular, Xia et al. [32] propose a motif-based high-order clustering algorithm that can effectively improve the clustering efficiency for large social networks. Some other methods regard graph motifs as auxiliary information to preprocess input graph [35][36][37][38]. For instance, Zhang et al. [36] designs a motif-based clustering algorithm, which divides the graph into several small networks for traffic speed prediction in the large urban traffic networks.
In general, under the unsupervised setting, it is more critical to capture the semantic information hidden in the graph topology as supervised signals to solve the problem of scarcity of labeled data. Therefore, we design an unsupervised model that can better capture graph structure information by combining graph contrastive learning with motif. Table 1 summarizes all notations used in this paper. Note that all bold notations represent a matrix or vector. Table 1. Notations used in this paper.

Notation Description
Network Related: G, G 1 , G 2 Input graph and two augmented graph views V The node set of G E The edge set of G |V| The number of nodes in G A, A 1 , A 2 The adjancy matrix of G,G 1 ,G 2 X, V, U The feature matrix of G,G 1 ,G 2 F The dimention of nodes' input feature C The number of node categories Y ∈ R |V|×C The label matrix of G H l The node embedding matrix in layer l Motif Related: M Motif set we define in this paper M ∈ M Some type of motif m t The instance of t type motif M Motif prototype vector m The motif embedding P Motif information of each node Q The number matrix of each type of motif per node S i , S i The motif-based subgraph centered on node v i and its embedding S l The subgraph embedding matrix in layer l Operation: The graph encoder with parameter φ T α (·) Augmentation function with parameter α J (·) The final loss function of the model δ(·, ·) The function that calculate the MI between inputs µ(·, ·) The cosine similarity function g(·) Projection function D(·) Contrastive loss of a positive pair Agg(·) Aggregator for aggregating multi vectors Mean(·) Mean Aggregator Att(·) Aggregator based on attention mechanism

Problem Definition
Given an undirected graph G = (V, A, X), where V = {v 1 , v 2 , . . . , v N }, A ∈ {0, 1} N×N , X ∈ R |V|×|F| represent the node set, adjacency matrix, and node feature matrix, respectively. The goal of unsupervised node classification models is to train a graph encoder f φ (·) without using node labels. H = f φ (X, A) represents the final learned node embedding matrix, which can be used to predict the label of nodes by a linear classifier or support vector machine trained by labeled data, i.e., Y = g w (H), where g w (·) represents the classifier.

Network Motif
The network motif is a special kind of low-order structure that hides rich semantic information and frequently occurs in the network [39]. Table 2 shows all types of thirdorder and fourth-order network motif. The third-and fourth-order network motifs indicate that the motif has three and four nodes respectively. In this paper, we use the first five predefined motifs [35] as auxiliary information for subgraph generation and aggregation, and their ids in the Table 2 range from 1 to 5. Moreover, we also regard as a special kind of motif in experiments. It is an edge in a graph, but herein we denote it as a second-order motif, so there are total six kinds of motifs used for subgraph generation and aggregation.

The Design of PASCAL
We propose a motif-based graph contrastive learning method called PASCAL. As shown in Figure 2, PASCAL mainly consists five components, which are described in detail in Sections 4.1-4.5.

Subgraph Generator
In this work, we use pre-statistical node motif information to adaptively construct subgraphs for each node separately. As shown in Figure 3, we first find all motifs related to target node v i , denoted by represents a motif that containing v i . Subsequently, we incroporate all of these motifs together as the motif-based subgraph centered on node v i . The final subgraph centered on node v i is represented as: where n i represents the ith node in motifs. Algorithm 1 is the pseudocode of the subgraph construction.

Algorithm 1 Subgraph construction.
Input: Target node v i , the set of motifs containing the target node Add all nodes of m to V 3: Add all edges of m to E 4: end for 5 The process of generating subgraph for node v i on the basis of its motif information. The red node represents the target node v i and the blue nodes represent the nodes that appear in the same motif as v i . For all networks in the middle box, the blue and red nodes represent all motifs containing the target node.

Augmentation
We use two augmentation strategies that are commonly used in GCL, feature masking and edge dropping, to generate two different graph views.
Edge Dropping. All edges of the input graph are dropped with a fixed probability. Formally, given a graph G = (V, E, A, X), we first randomly sample a mask matrix R ∈ {0, 1} |V|×|V| , which follows a Bernoulli distribution R ij ∼ B(1 − p r ) if A ij = 1 for the input graph, or otherwise R ij = 0. The p r represents the probability of dropping edges. The adjacency matrix of the augmented graph is computed by Equation (2).
where • represents element-wise product. Feature Masking. We randomly mask some dimensions of the input node features with zeros. To be specific, X represents the original feature matrix, we first randomly sample a vectorm ∈ {0, 1} F , where each dimension of it independently follows a Bernoulli distribution with probability 1 − p m , i.e.,m i ∼ B(1 − p m ), ∀i. The feature matrixes of the two views are computed by Equation (3).
Algothrim 2 summarizes the graph augmentation process of PASCAL.

Algorithm 2 Graph augmentation.
Input: Input graph G = (V, E, A, X), drop edge probability p r , and mask feature probability p m . 1: Construct two empty network G 1 , G 2 2: for i = 1 to 2 do 3: X is the augmented feature matrix 6: for node v in V do 7: Sample a mask vectorm ∈ {0, 1} F , where each dimension of it independently follows a Bernoulli distribution with probability 1 − p m 8: Augmented graph G i = (V, E , A , X ) 11: end for Output: Augmented graph G 1 , G 2

Subgraph Aggregator
How to construct and encode subgraphs is the key to subgraph-level GCL. In this work, we design a motif-based subgraph aggregator to calculate the subgraph embeddings, which are regared as node features fed into the graph encoder. Specifically, for each node v i , the motifs set containing v i is represented as . . , m n ij }, and t is the number of motif types. The subscript j represents the different kinds of motif defined in Section 3.3. Our proposed motif-based subgraph aggregate strategy consists of the following three steps: (1) For each motif m t ij ∈ M ij , we use a sum aggregator Sum(·) to compute the motif embedding, i.e., m t (2) After Step 1, we can obtain all motif embeddings of type j containing v i , denoted by M ij = {m 1 ij , . . . , m n ij }. Then, we use a mean aggregator Mean(·) to aggregate all motif embeddings.The prototype of the j type motif containing v i is represented by m ij = Mean(M ij ).
(3) For all kinds of motif, Steps 1 and 2 are repeated. After obtaining all six kinds of motif embeddings containing v i , denoted by M i , we use an aggregator Agg(·) to compute the final embedding of the subgraph that centered on v i , denoted by s i = Agg(M i ).
For the function Agg(·) used in Step 3, a mean or attention aggregator can be used. Given the motif prototypes, M ∈ R |M|×N where |M| represents the number of motif types and N represents the dimension of node embeddings. The formal definitions are, respectively, shown as follows: • Mean Aggregator. The formula of Mean(·) is as follows: • Attention Aggregator. We employ the attention mechanism used in UDAGCN [40] (shown in Equation (5)).
where f (·) and So f tmax(·) represent the linear and softmax function, respectively. Algorithm 3 shows the process of subgraph aggregation.

Algorithm 3 Subgraph aggregation.
Input: Pre-statistical motif information M, input graph G = (V, E, X). 1: S ∈ R |V|×h represents the subgraph embedding matrix 2: for node v in V do 3: M v ∈ M represents the motif information of v 4: for each type t of motifs do 5: M t v represents a set of motifs of type t containing v. Aggregate all motif prototypes as subgraph embedding, i.e., S v = Agg({m 1 , .., m n }) 10: end for Output: Subgraph embedding matrix S Figure 4 shows the process of calculating the triangle motif's prototype of v i . In the calculation of motif embeddings and motif prototypes, we use Sum(·) and Mean(·) for aggregation, respectively. Therefore, in practice, we can simplify the calculation process of subgraph embeddings to matrix multiplication. Concretely, we define two matrices, P ∈ N |M|×|V|×|V| and Q ∈ N |M|×|V| , which represent all involved nodes and the number of motif type, respectively. P tij represents the number of v j appearing in the motif of type t containing v i . Likewise, Q ti denotes the number of motifs of type t containing v i . The computing process of subgraph embeddings is formulated in Equation (6).

Graph Encoder
Two graph encoders, called PASCAL-concat and PASCAL-replace, respectively, are designed.
• PASCAL-concat adds an aggregation layer before message passing to compute the subgraph embedding S l , which are regarded as node embeddings fed into the message passing layer. The feature update formulas for each layer of the graph encoder are as follows: • PASCAL-replace uses the subgraph aggregation to replace the original neighbor aggregation in GNN. Therefore, the adjacency matrix is useless in the graph encoder, as shown in Equation (8).
where W l is the learnable weight matrix of layer l.
For PASCAL-concat, we use both feature masking and edge dropping augmentation strategies at the same time. However, as the adjacency matrix is not used in PASCALreplace, only the feature masking augmentation strategy is used.

Comparator
To train a graph encoder capturing rich local semantic information in an unsupervised manner, similar to GRACE [20], we define a contrastive objective to maximize the mutual information of the same node in two different graph views. Formally, we use f φ (·) to represent our motif-based graph encoder, and G 1 = (V, A 1 ), G 2 = (U, A 2 ) denotes the two graph views, respectively. For better comparison, we use a projection function g γ (.) : R n×d h → R n×d h to map the node embeddings of the two graph views to the same contrast space. For any node v i , its embeddings in two views are denoted by u i and v i , which are treated as the anchor and positive sample, respectively. The pairwise objective for each positive pair (u i , v i ) is defined as Equation (9).
where δ(u * , v * ) = µ(g γ (u * ), g γ (v * )), and τ is a temperature parameter. The distance function used in µ(·, ·) is the cosine similarity. Moreover, instead of deliberately choosing negative samples for the anchor, we treat all other nodes in the two graph views as negative samples. As two views are symmetric, the definition of D(v * , u * ) is similar to D(u * , v * ). Therefore, the final loss of PASCAL is defined as follows: Overall, in PASCAL, given the input graph and pre-statistical motif information, we first extract subgraphs for each node, and then perform graph augmentation to obtain different graph views and encode subgraphs based on motif information, and finally optimize the graph encoder by maximizing the mutual information between the same node in different views. The pseudocode of PASCAL is summarized in Algorithm 4.

Algorithm 4 PASCAL-replace algorithm.
Input: Input graph G = (A, X), motif info P and Q, graph encoders f φ , projection function g θ , discriminator Θ, loss J. and augmentation function T . 1: for epoch = 1 to n do 2: for i = 1 to 2 do 3: for l = 0 to k do To achieve comprehensive comparison, we conduct unsupervised node classification experiments on six datasets, which can be categorized into two groups: academic and social networks.
• Academic networks. Citationv1, DBLPv7, and ACMv9 are three citation networks extracted from Microsoft Academic Graph, DBLP Computer Science Bibliography, and the Association for Computer Machinery, respectively [41]. These three datasets have five types of node labels. Totally, they have 8779, 5469, and 8769 nodes, with 13,590, 8090, and 14,798 edges, respectively. • Social networks. Polblogs [42] is a directed network of hyperlinks between weblogs on US politics, recorded in 2005 by Adamic and Glance, which contains two categories of 1224 nodes, and 16,718 edges. In this paper, we treat it as an undirected graph. Amazon-computers and Amazon-photo [43,44] are segments of the Amazon co-purchase graph, which contain 10 and 8 kinds of nodes, respectively. The nodes and edges, respectively, represent the goods and the frequency by which two goods are bought together.
Detailed information of the six datasets is summarized in Table 3. The data of last five rows represent the average number of each motif per node, from which we can find that the five kinds of motifs defined in this paper frequently occur in networks. Table 3. Dataset statistics. "#Node " and "#Edge" represent the total number of nodes and edges. "#Classes" is the number of node types. "#M1_AVG"~"#M2_AVG" represent the average number of motifs with per node in the motifs with id 1~5 in Table 2.

Baselines
The baselines can be categorized into two types: unsupervised and supervised methods. For supervised baselines, we choose GCN [6], SGC [45], GCNII [7], and MORE [35]. As for unsupervised methods, we regard GAE [14], GRACE [20], MVGRL [24], and DGI [15] as baselines. The details of these methods are as follows: • GCN [6]: It is a classic semi-supervised GNNs method which learns the latent graph representation by extending the convolutional neural network to graph structure data and is widely used in various fields. • SGC [45]: SGC transforms the nonlinear GCN into a simple linear model, which reduces the extra complexity of GCNs by repeatedly eliminating the nonlinearity between GCN layers and folding the resulting function into a linear transformation. • GCNII [7]: It solves the over-smoothing problem of GNNs by using residual connection and identity mapping, which greatly improve the performance of GNNs. • MORE [35]: MORE is a motif-based graph learning method, which regards the motif information as additional attribute information of nodes, used for social networks. The general idea of it is close to this work and its performance on social networks is very comparative.
• GAE [14]: GAE is an unsupervised graph learning method based on autoencoders, which learns node representation by reconstructing graph structure. • DGI [15]: Different from traditional reconstruction-based unsupervised methods, DGI learns node embedding by maximizing the mutual information between the input and the output. DGI is groundbreaking graph contrastive learning algorithm and it also has top-ranked performance. • GRACE [20]: It is a cutting-edge unsupervised graph representation learning method based on contrastive learning. GRACE is also the fundamental basis of our proposed method. • MVGRL [24]: MVGRL uses graph diffusion for graph augmentation, and then compares the node and graph embedding of different views. It is one of the SOTA graph contrastive learning methods.

Experimental Details
To ensure the fairness of experiments, all unsupervised methods used in this paper employ a linear classifier to predict the label of nodes. We set the maximum epoch to 2000 and tolerance to 20, respectively. After the model is fitted, we use 10% data to train the classifier, and the remaining 90% data are used for testing. If there is no special statement, the graph encoder is a two-layer GNN, and the node embedding dimension is 128 for all datasets. We use the Adam with a learning rate of 0.001 to optimize the model. The classifier used to predict the node label uses a linear classifier or a support vector machine.
For supervised algorithms, the hyperparameters on all datasets are recommended by the original paper, and the node embedding dimension is 128. As for the dataset division, to facilitate comparison, we adopt the classic dataset division method used in GCN [6], that is, 20 samples of each class are used for training, 500 samples are used for validation, and another 1000 samples are used for testing. To prevent overfitting, we set tolerance to 20, and the maximum epoch to 1000. Our code is developed based on Python3.7 and Pytorch 1.7.0+cu101. Our model is trained by a V100 with 32G memory.

Unsupervised Node Classification
For PASCAL, we use the mean aggregator shown in Section 4.2 as the embedding aggregator. All methods are executed 20 times on each dataset, and the comparison results on six datasets are summarized in Table 4. Table 4. Summary of node classification results on 6 datasets. All experiments are carried out 20 times, and "best" and "avg", respectively, represent the best and average performance of the model. "PASCAL-replace" and "PASCAL-concate" represent the two PASCAL variants mentioned in Section 4.4 using different types of encoders. Bold numbers represent the best results on different datasets. Some findings can be obtained according to the experimental results in Table 4. First, compared with all unsupervised methods, our method achieves the best results in acmv9, dblpv7, citationv1, and polblogs, with an average performance improvement of almost 4% compared with GRACE. Although our method performs weaker than the SOTA unsupervised method on the computers and photo datasets, it still outperforms the GRACE using the same type of comparison framework, which proves the effectiveness of PASCAL. Second, under the dataset division settings used in GCN [6], our unsupervised method is superior to all supervised methods on each dataset. Third, we can find that the performance of PASCAL-replace is much worse than that of PASCAL-concat. The reason behind this phenomenon is that PASCAL-replace does not use graph adjacency matrix, which means that it only employs feature mask during data augmentation. Therefore, the two augmented views of PASCAL-replace are in low distinction so that the two views cannot be well contrasted. The semi-supervised experimental results in Section 6 further support our conjecture.

Ablation Studies
In this section, we discuss the performance of different variants of PASCAL. Mean-Agg v.s Att-Agg. When aggregating prototypes of different motifs, we can use the mean aggregator or the attention aggregator. To compare the impact of different aggregators on model performance, we choose the attention mechanism used in UDAGCN [40] to aggregate multiple vectors, and both methods use the concat as the main framework. The experimental results are shown in Figure 5. We find that the model performed comparably in the two different aggregation methods, which shows that our proposed motif-based subgraph aggregation strategy is effective and reliable. Figure 5. The performance comparison when using different aggregators to formulate motif prototypes. "att" and "mean" represent the attention-based and mean-based aggregators, respectively. The horizontal axis represents the classification accuracy, and the vertical axis represents the dataset.

Motifs.
In this part, we study the impact of different variants of our proposed motifbased subgraph aggregation strategy on model performance. Concretely, we explore four different combinations between "second-order" and "degree-agg". Here, "second-order" indicates whether to use second-order motif when generating subgraphs, i.e., . "degreeagg" means using a degree-based aggregation to calculate motif embeddings, instead the Sum(·) aggregator used in Section 4.3. If using Sum(·) to calculate motif embeddings, when two motifs A and B(the motifs with id 1 and 2 in Table 2) consist of the same three nodes at the same time, the embedding of them are same. Thus, we introduce a degree-based motif aggregation method. Specifically, the degree of nodes in the motif is regarded as weight, and the weighted sum of all node embeddings is regarded as the motif embedding.
For example, supposing two motifs A and B consist of the same three nodes v 1 , v 2 , v 3 , represented as m a and m b . The motif embeddings of them are calculated by: As shown in Table 5, we can find that: (1) no matter which combination we use, our model performs better than GRACE; (2) in most cases, the combination of using second-order motif without using degree-agg has the best performance; and (3) using the second-order motif on most datasets can slightly improve the performance of the model. Table 5. The performance of different variants of PASCAL. "second-order" indicates whether to use the second-order motif when constructing the subgraph. "degree-agg" indicates whether to consider node degree when calculating motif embedding. Here check mark means consider it and cross means not use it. "PASCAL-concat-mean" represents the PASCAL-concat variant in Section 4.4, which uses the mean aggregator in Section 4.3 to aggregate different types of motif prototypes. Bold numbers represent the best results on different datasets. Classifier. In all previous experiments, the unsupervised methods employ a simple linear classifier to predict node labels. In this section, we compare the effect of different classifiers on the performance of the model. Specifically, we compare the performance of models using the linear classifier with that using the support vector machine (SVM) and the support vector machine (SVM), and the results are shown in Table 6. We can find that: (1) for both GRACE and PASCAL, using more powerful SVM can significantly improve model performance; (2) even if using SVM, the performance of GRACE is weaker than the PASCAL-concat using linear classifier, which shows the power of our proposed model. Table 6. The performance of GRACE and PASCAL-concat with different classifier. "Linear" and "SVM" represent the use of linear and SVM as node classifiers, respectively. Bold numbers represent the best results on different datasets.

Discussion
Complexity Analysis. Here, we briefly analyze the time complexity of PASCAL and compare it with GCN and GRACE. Let |E| represent the edge number in the graph; d be the embedding size; b and m, respectively, denote the batch size and the node number in a batch; γ denote the edge keep rate in PASCAL; and L represent the number of layers of the encoder. We compare them from four aspects, and Table 7 summarizes the comparison results.

•
Preprocessing: GCN and GRACE do not need to preprocess data, while PASCAL needs to collect the motif information of each node which is one of disadvantages of it. However, the motif information only needs to be analyzed once; the cost of preprocessing is, therefore, acceptable. • Adjacency Matrix: For GCN, the adjacency matrix has only 2|E| non-zero elements since no augmentation is required. GRACE and PASCAL are typical contrastive learning methods that need to generate two augmented views, so there are two adjacency matrices containing 2γ|E| non-zero elements. • Encoder: All three models use a two-layer encoder architecture, so the time complexity is consistent. In general, the time complexity of contrastive learning is higher than that of GCN. Compared with GRACE, the complexity of PASCAL is higher than the additional data preprocessing. However, compared with the significant performance of PASCAL, the time cost of data preprocessing is negligible.

Datasets GCN GRACE PASCAL
Preprocessing Attention weight. In the attention variant of PASCAL, we use the attention mechanism to aggregate different types of motifs. Here, we analyze the learned attention weights to explore something interesting. Specifically, we consider the relation of attention weights and the number of motifs using the Pearson correlation coefficient, which are used to measure the correlation between two variables. The results are summarized in Table 8. From Table 8, we can find that the number of nodes with correlation coefficients greater than 0.9 in acmv9, citationv1, and dblpv7 is much larger than that in the other three datasets, which means that the types with more motifs in the computers, photo, and polblogs may be assigned smaller weights. Actually, this phenomenon is normal. The distribution of the number of motifs in Table 3 shows that the distribution of motifs in these three datasets is more uneven. In this case, if the correlation coefficient is large, it will overly attenuate the influence of other motifs on the model, potentially reducing model performance. Therefore, the results in Table 8 are intuitive, indicating that the attention weights learned by the model are meaningful. Table 8. The statistics of Pearson correlation coefficient between the number of motifs and the learned attention weights. "#<0" represents the number of nodes with coefficient less that 0, and the same for others. "#Node" represents the total number of nodes of datasets. Semi-supervised node classification. To further verify the power of our proposed motif-based subgraph aggregation strategy, we integrate it into GCN and GCNII in a concat manner for semi-supervised node classification tasks. Specifically, we first perform the motif-based subgraph aggregation on the output of the previous layer. Then, it is regarded as the updated node embeddings fed into the next GNN layer. Figure 6a,b, respectively, show the performance of the four models on the five datasets and incomplete Acmv9. From Figure 6a, we can find that on most of the datasets, the GCN-Motif and GCNII-Motif perform better, especially GCNII-Motif, which once again verifies the effectiveness of our proposed strategy. The performance of all models degrades as the ratio of missing edges increases. When the ratio of edge dropping is low, the performance of the improved model is still better than the original model. However, when the ratio of edge loss is too large (≥30%), the performance of the improved model is comparable to or even worse than that of the original model, which is in line with our intuition. As the improved model is more dependent on the graph topology, the performance of the model will be affected more seriously if the original graph structure is excessively destroyed. Table 9 summarizes the GCN-Motif performance under the two integration modes of replace and concat. Unlike the unsupervised framework, in the semi-supervised framework, GCN-Motif-replace and GCN-Motif-concat perform comparably, which supports our conjecture in Section 5.2. Note that we use the Mean-Agg as the motif prototype aggregator for all experiments in this section.

Datasets
(a) (b) Figure 6. Performance comparison before and after integrating our proposed motif-based subgraph aggregation strategy. "GCN" and "GCNII" are two classic GNN models, respectively. "GCN-Motif" and "GCNII-Motif", respectively, represent the improved model based on the subgraph aggregation and encoding strategy proposed in this paper. The horizontal axis of (b) represents the random edge rate. (a) Node classification on the full dataset; (b) node classification on incomplete Acmv9. Table 9. Performance of GCN when integrating different variants of our proposed motif-based subgraph aggregation strategy. "best" and "avg" represent the optimal and average performance over 20 experiments, respectively. Bold numbers represent the best results on different datasets. Visualization. To more intuitively show the effect of our proposed motif-based subgraph aggregation strategy on the GNNs framework, we use the tSNE algorithm to visualize the test set node embeddings learned by the model. Figure 7a,c show the test node embeddings of GCN and GCNII on citationv1, respectively. Figure 7b,d show the learned embeddings of GCN-Motif and GCNII-Motif, combined with our proposed subgraph strategy on citationv1. Comparing the first and second columns of Figure 7, we find that the nodes of each category in the second column are more concentrated, and the classification boundaries are more obvious, which indicates that the quality of learned node embeddings is significantly improved through integrating our proposed subgraph strategy.

Conclusions
In this work, we propose a structure-aware graph contrastive learning model called PASCAL which considers the subgraph-level embedding. PASCAL adaptively constructs and encodes subgraphs based on the nodes' motif information, and further uses them as the input of the GNN encoder to capture rich semantic information hidden in the local structure. Extensive experiments on six social and web benchmark datasets show the outperformance of PASCAL.
Although PASCAL performs well in unsupervised node classification tasks, it is not flawless. The motifs used in PASCAL are predefined, as it underperforms on some datasets such as Amazon Photo. The reason behind this phenomenon is the different distribution of motif types and numbers in different datasets. The five motifs predefined in this study may not be applicable to Amazon Photo. In future work, we will study how to automatically design and select the optimal motifs, which can significantly improve the generalization of PASCAL.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: GCL graph contrastive learning GNN graph neural network GCN graph convolution network MI mutual information SOTA state-of-the-art PASCAL subgraph adaptive structure-aware graph contrastive learning DGI deep graph infomax BYOL bootstrap your own latent