MBHAN: Motif-Based Heterogeneous GraphAttention Network

Graph neural networks are graph-based deep learning technologies that have attracted significant attention from researchers because of their powerful performance. Heterogeneous graphbased graph neural networks focus on the heterogeneity of the nodes and links in a graph. This is more effective at preserving semantic knowledge when representing data interactions in real-world graph structures. Unfortunately, most heterogeneous graph neural networks tend to transform heterogeneous graphs into homogeneous graphs when using meta-paths for representation learning. This paper therefore presents a novel motif-based hierarchical heterogeneous graph attention network algorithm, MBHAN, that addresses this problem by incorporating a hierarchical dual attention mechanism at the node-level and motif-level. Node-level attention aims to learn the importance between a node and its neighboring nodes within its corresponding motif. Motif-level attention is capable of learning the importance of different motifs in the heterogeneous graph. In view of the different vector space features of different types of nodes in heterogeneous graphs, MBHAN also aggregates the features of different types of nodes, so that they can jointly participate in downstream tasks after passing through segregated independent shallow neural networks. MBHAN’s superior network representation learning capability has been validated by extensive experiments on two real-world datasets.


Introduction
Graph neural networks (GNNs) have attracted extensive attention in academia as a powerful way of approaching deep representation learning for graph data. They have been proven to perform especially well in network analysis [1,2]. The basic idea of a graph neural network is to undertake representation learning on the nodes themselves, according to their local neighborhood information. This involves aggregating the information of each node and its surrounding nodes through a neural network. So, in [3][4][5], the node features and graph structure in graphs are used to learn node embeddings. Convolutional operations can also be introduced into graph representation learning [6][7][8][9].
Alongside GNNs, a significant amount of interest has been shown to attention mechanisms [10], which encourage models to focus on the most salient parts of the data that will affect downstream tasks. Attention mechanisms have been highly effective when incorporated into deep neural network frameworks and are widely used across a range of different domains [11][12][13][14][15]. Graph Attention Networks (GAT) [16] assume that different neighboring nodes may play different roles for the core nodes. A self-attention mechanism can therefore be employed [17] to aggregate neighbor nodes and achieve an adaptive matching of weights that captures the different importance of different neighbors. However, GAT can only be applied to homogeneous graphs and cannot be easily migrated to heterogeneous graphs.
Heterogeneous graph-based neural network representation learning methods are a natural extension of deep learning approaches to the processing of structured graph data. The most popular approach, here, is to transform heterogeneous graphs into homogeneous graphs for representation learning through meta-paths [18]. Wang et al. [19], for instance, introduced a two-level hierarchical attention mechanism in graph neural networks where the node-level attention captures the relationship between neighboring nodes generated by a certain meta-path, while meta-path semantic attention captures the importance of the different meta-paths in the original heterogeneous graph. To make the most of the rich interaction information present in heterogeneous graphs focused on intent recommendation, Fan et al. [20] used meta-path guided neighbors to aggregate the node information and designed different aggregation functions based on different types of neighbor features. All of these methods require an expert to design and manually set the meta-paths for specific issues. There is also inevitably a loss of information during the transformation between heterogeneous and homogeneous graphs, and the selection of different meta-paths can lead to significant performance fluctuations in downstream tasks [21].
Numerous studies [22][23][24][25][26][27][28][29][30] have verified the superior performance of motifs as fundamental building blocks of non-random real-world graph-structured data for the purposes of graph representation learning. A motif is essentially a subgraph structure consisting of multiple nodes and links. The semantic relationships among the set of nodes that make up a motif are particularly closely related. Thus, it seems reasonable to assume that, if the representation of nodes in motif-based subgraphs with different structural patterns can be captured, it will be possible to characterize different types of nodes in heterogeneous graphs from multiple perspectives. The Multiscale Convolutional Network (MCN) [30] constructed graph convolutional neural network introduces multiple weighted motif-based adjacency matrices to capture higher-order neighborhood information. Peng et al. [31] used motifs for subgraph normalization and designed the Motif-based Attentional Graph Convolutional Neural Network (MA-GCNN) for subgraph classification tasks. However, the MCN, here, is still constrained to the pre-design of the motif structure by an expert and does not consider the motif schemas of other morphologies in the graph. The MCN is also not scalable to heterogeneous graphs. MA-GCNN similarly pays no attention to the heterogeneity of nodes in the graph and can only be applied to subgraph classification tasks.
In view of the above, and inspired by the Heterogeneous Graph Attention Network (HAN) [19], this paper proposes an end-to-end Motif-based Heterogeneous Attention Network, MBHAN, which employs a hierarchical attention mechanism that includes nodelevel and motif-level attention. MBHAN is able to focus on the degree of importance of both a node's neighbors and the motif subgraph where the node is located.
The MBHAN approach presented in this paper offers the following contributions: It efficiently learns multi-perspective representations for all the nodes in a heterogeneous graph, without having to artificially establish any meta-paths (or subgraphs).
It avoids the knowledge loss in the conversion of heterogeneous graphs to homogeneous graphs and can learn the global node information in heterogeneous graphs.
It can capture the subtle effects of different node types on downstream tasks, leading to more accurate knowledge mining.

Graph Neural Networks
Graph representation learning, also known as graph embedding, aims to represent nodes in a network as low-dimensional dense vectors with real values, and the results can then be stored in a vector space. The resulting vector representations can be easily and conveniently used as the input for machine learning models, which can then be applied to common social network applications, such as visualization, node classification, link prediction, and community discovery [32]. The impressive results of neural networks in solving Euclidean space data (e.g., for images [33] and text [34], etc.) have resulted in proposals to extend deep neural networks to the processing of data with graph (network) structures [4,5]. However, applying graph neural network techniques to heterogeneous graphs and distinguishing the heterogeneity of nodes and links in the graph structure has consistently proved challenging.
The first attempt to model graph structure data with multiple correlations using GNNs involved Relational Graph Convolutional Networks (R-GCNs) [6], which maintain unique linear mapping weights for each different type of link. They also decompose relationshipspecific parameters into linear combinations of several elementary matrices, so as to be able to deal with networks with a large number of relationships. To handle the structure and node properties of heterogeneous graphs, Zhang et al. [35] began by introducing a random walk with a restart strategy that can extract a fixed number of strongly correlated heterogeneous neighbors for each node and group them according to node types. They then used a type-specific Recurrent Neural Network (RNN) to encode the vertex features of each type of neighbor. The encoded representations of different types of neighbors were then aggregated using another RNN. To address the problem of heterogeneous graph GNNs still needing to artificially set meta-paths, with a consequent loss of accuracy in downstream tasks, Yun et al. [21] proposed a graph transformation network (GTN) that is capable of generating new network structures. It can identify useful interactions between unconnected nodes in the original graph, while learning valid node representations in the transformed graph in an end-to-end fashion. A similar approach that does not involve setting metapaths is the Heterogeneous Graph Transformer (HGT) [36], which has parameters relating to different types of nodes and links, so as to be able to characterize the heterogeneous attention for each link. This enables HGT to generate specialized representations for different types of nodes and links.
Most existing graph neural networks based on heterogeneous graphs tend to use metapaths to transform the heterogeneous graphs into homogeneous graphs for representation learning. This inevitably results in a lack of non-vertex information in the meta-paths. At present, most methods cannot perform global node representation learning.

Motifs
The notion of graph motifs, which are higher-order structures in a network, was first proposed by Milo et al. [27]. They are small subordinate structures consisting of multiple nodes. In real-world applications, motifs play a critical role in complex graph analysis. Benson et al. [37] used motifs to analyze higher-order clustering in the Caenorhabditis elegans neuronal network and the higher-order spectral network of airports in Canada and the United States. Zhou et al. [38] looked at how a star motif structure might correspond to a synthetic counterfeit personal account number in a bank's customer information network. Other studies [39,40] have shown that triangles consisting of three nodes form the basic motif structure in most real-world networks and that this plays an important role in network formation and evolution. Some motif-based representation learning methods have also been proposed. Motif2vec [25], for instance, aggregated and shuffled random walk sequences created for both a motif-based higher-order graph and an original graph. The ultimate sequences were then fed to a Skip-Gram model [41] to learn the node embedding.
In this paper, we focus on all the triangle motif schemas in the graphs we are working with and assemble all the "atomic-level" higher-order heterogeneous connectivity patterns, so as to eliminate any interference by man-made semantic assignments. Table 1 gives the notations used in this paper and their corresponding explanations. We will then define the most relevant concepts and the problem we are seeking to address, before introducing the MBHAN algorithm.

Preliminary Information
Final representation vector of the t-type nodes Definition 1. Heterogeneous graphs [42]: A heterogeneous graph is a network data structure, G = (V, E ), with a node type that can be mapped as ϕ : V → T and a link type that can be mapped as ψ : E → R , where |T | > 1 represents the number of node types in the network or |R| > 1 represents the number of link types in the network. If |T | = 1 and |R| = 1, this indicates that the graph, G, is a homogeneous network.

Example 1.
Typical examples of heterogeneous graphs are academic citation networks (see Figure 1). In Figure 1a, DBLP consists of four different types of nodes (author (A), paper (P), conference (C), and term (T)) and multiple types of links (A − P : an author writes a paper or a paper is written by an author; P − C : a paper is published in a conference or a conference publishes the paper; T − P : a paper contains a term or a term is mentioned in a paper). Figure 1b again contains multiple types of nodes, e.g., author (A), paper (P), topic (S), and publication venue (V). These are similarly connected by multiple types of links. It is worth noting that in these kinds of academic citation heterogeneous graphs, different types of nodes have different feature spaces. So, for the DBLP and ACM datasets used in this paper, only the paper (P) type nodes have initial features (bag-of-words vectors). Table 1 gives the notations used in this paper and their corresponding explanations. We will then define the most relevant concepts and the problem we are seeking to address, before introducing the MBHAN algorithm.

Heterogeneous graph
Node set Link set The -th node Type of node set Type of link set The -th motif pattern The subgraph in that satisfies the motif pattern, The set of neighboring nodes of in Node features Importance of node pair ( , ) in Node-level attention vector in Weight of -based node pair ( , ) Motif-level attention vector Importance of -type nodes in Attention weight of -type nodes in Final representation vector of the -type nodes Definition 1. Heterogeneous graphs [42]: A heterogeneous graph is a network data structure, = ( , ℰ), with a node type that can be mapped as : → and a link type that can be mapped as : ℰ → ℛ, where | | > 1 represents the number of node types in the network or |ℛ| > 1 represents the number of link types in the network. If | | = 1 and |ℛ| = 1, this indicates that the graph, , is a homogeneous network.

Example 1.
Typical examples of heterogeneous graphs are academic citation networks (see Figure  1). In Figure 1a, DBLP consists of four different types of nodes (author ( ), paper ( ), conference ( ), and term ( )) and multiple types of links ( − : an author writes a paper or a paper is written by an author; − : a paper is published in a conference or a conference publishes the paper; − : a paper contains a term or a term is mentioned in a paper). Figure 1b again contains multiple types of nodes, e.g., author ( ), paper ( ), topic ( ), and publication venue ( ). These are similarly connected by multiple types of links. It is worth noting that in these kinds of academic citation heterogeneous graphs, different types of nodes have different feature spaces. So, for the DBLP and ACM datasets used in this paper, only the paper ( ) type nodes have initial features (bag-of-words vectors).  can be defined as the motif subgraph corresponding to the motif pattern, M, where M is one of the motif patterns in G, V M ∈ V, and E M ∈ E . The motif pattern, M, has a fixed form that can be naturally observed when the structure of the heterogeneous graph, G, has been determined.
Example 2. MBHAN focuses on a motif pattern that consists of three nodes while taking into account node heterogeneity (see Figure 2). Given prior knowledge of the structural schema of the heterogeneous graph, the motif subgraph is a subset of the original heterogeneous graph based on different node types and compositions. Unlike the motif-based representation learning method, MBRep [24], where all the motif instances that satisfy a specific motif pattern are extracted, the node-level attention learning process of MBHAN can be performed entirely on the motif subgraph. motif patterns and from multi-perspectives in the representation learning process, ℳ = ( ℳ , ℰ ℳ ) can be defined as the motif subgraph corresponding to the motif pattern, ℳ, where ℳ is one of the motif patterns in , ℳ ∈ , and ℰ ℳ ∈ ℰ. The motif pattern, ℳ, has a fixed form that can be naturally observed when the structure of the heterogeneous graph, , has been determined.
Example 2. MBHAN focuses on a motif pattern that consists of three nodes while taking into account node heterogeneity (see Figure 2). Given prior knowledge of the structural schema of the heterogeneous graph, the motif subgraph is a subset of the original heterogeneous graph based on different node types and compositions. Unlike the motif-based representation learning method, MBRep [24], where all the motif instances that satisfy a specific motif pattern are extracted, the node-level attention learning process of MBHAN can be performed entirely on the motif subgraph. As noted previously, there is a tendency for heterogeneous graph-based graph neural network methods to transform heterogeneous graphs into homogeneous graphs for representation learning via meta-path relationships, making it impossible to undertake global node representation learning. To deal with this, we propose the motif-based hierarchical attention graph neural network algorithm, MBHAN, which is an end-to-end global learning model and that can excavate subtle disparities in the magnitude of attention applicable to different node levels and motif subgraphs. Figure 3 shows the MBHAN algorithm framework. Several aspects of the MBHAN algorithm will be presented in this section, including its node-level attention mechanism (4.1), its motif subgraph-level attention mechanism (4.2), and its nodes features mapping mechanism (4.3). As noted previously, there is a tendency for heterogeneous graph-based graph neural network methods to transform heterogeneous graphs into homogeneous graphs for representation learning via meta-path relationships, making it impossible to undertake global node representation learning. To deal with this, we propose the motif-based hierarchical attention graph neural network algorithm, MBHAN, which is an end-to-end global learning model and that can excavate subtle disparities in the magnitude of attention applicable to different node levels and motif subgraphs. Figure 3 shows the MBHAN algorithm framework. Several aspects of the MBHAN algorithm will be presented in this section, including its node-level attention mechanism (4.1), its motif subgraph-level attention mechanism (4.2), and its nodes features mapping mechanism (4.3). Appl

Node-Level Attention Mechanism
Before aggregating different motif subgraphs for different aspects of the nodes' representations, we first focus on the different roles played by the nodes' neighbors in each motif subgraph. To that end, we will begin by looking at the significance features aggregated by the neighbors of each node in a specific motif subgraph. MBHAN employs a self-attention mechanism [10] that can learn the weights between different nodes. Given a motif subgraph, , containing a pair of nodes, ( , ), the node-level attention, , can be defined as follows: where denotes the importance of node to node ; refers to the deep neural network that performs the node-level attention [16]; and, for a given motif subgraph, , is shared for all its node pairs. From Equation (1), it can be seen that the attention level of the node pairs ( , ) within the motif subgraph depends on their features. Unlike the approach adopted in [19], the types of node pairs are not necessarily the same in MBHAN. For example, in the " − − " motif pattern subgraph in Figure 2, a node of type is represented by aggregating the features of its first-order neighbors, which are type nodes. The feature space transformations for different types of nodes will be presented in Section 4.3. Note also that and are asymmetric. So, the degree of importance of node to node is not necessarily the same as the degree of importance of node to node . This is a fundamental property of heterogeneous graphs. Thus, Equation (1) can be more precisely expressed as: where ℎ denotes the features vector of node ; denotes the activation function (LeakyReLU was selected in this case); denotes the node-level attention vector based on the motif subgraph, ; and || denotes the concatenate vector operation. After obtaining the initial attention scores of all the first-order neighbors of node , normalization is undertaken using a SoftMax function, and this gives the attention weights, , for the degree of importance, , from node to node :

Node-Level Attention Mechanism
Before aggregating different motif subgraphs for different aspects of the nodes' representations, we first focus on the different roles played by the nodes' neighbors in each motif subgraph. To that end, we will begin by looking at the significance features aggregated by the neighbors of each node in a specific motif subgraph. MBHAN employs a self-attention mechanism [10] that can learn the weights between different nodes. Given a motif subgraph, G m , containing a pair of nodes, (v i , v j ), the node-level attention, e m ij , can be defined as follows: where e ij denotes the importance of node v j to node v i ; att node−level refers to the deep neural network that performs the node-level attention [16]; and, for a given motif subgraph, G m , att node−level is shared for all its node pairs. From Equation (1), it can be seen that the attention level of the node pairs (v i , v j ) within the motif subgraph depends on their features. Unlike the approach adopted in [19], the types of node pairs are not necessarily the same in MBHAN. For example, in the "A − C − B" motif pattern subgraph in Figure 2, a node of type A is represented by aggregating the features of its first-order neighbors, which are type C nodes. The feature space transformations for different types of nodes will be presented in Section 4.3. Note also that v i and v j are asymmetric. So, the degree of importance of node v i to node v j is not necessarily the same as the degree of importance of node v j to node v i . This is a fundamental property of heterogeneous graphs. Thus, Equation (1) can be more precisely expressed as: where h i denotes the features vector of node v i ; σ denotes the activation function (LeakyReLU was selected in this case); a m denotes the node-level attention vector based on the motif subgraph, G m ; and || denotes the concatenate vector operation. After obtaining the initial attention scores of all the first-order neighbors of node v i , normalization is undertaken using a SoftMax function, and this gives the attention weights, α ij , for the degree of importance, e ij , from node v j to node v i : where N m i denotes the neighboring nodes of node v i in G m . It should be noted that α ij is not symmetric, so nodes v i and v j do not contribute to each other equally. This is not only because of the order of the vector concatenation on the numerator in Equation (3), but also because they have different neighbors. The embedding vector of node v i in the motif subgraph G m is now an attention-based weighted aggregation of its neighboring nodes' features: As the attention weight, α m ij , is based on the specific motif subgraph, G m , it reflects only the side profile of the node representations in a specific motif pattern. This is also because the non-Euclidean spatial properties of graph data, especially heterogeneous graph data, exhibit high levels of variance [19]. To enrich the ability of the model and stabilize the training process, MBHAN extends the node-level attention to multi-head attention so as to avoid overfitting. To do this, MBHAN iterates the node-level attention training process K times and concatenates the learned representations to give the specific embedding: Given a heterogeneous graph, G, the set of its motif subgraphs G m 0 , G m 1 , . . . , G m k can be extracted easily after determining its network connectivity pattern. By inputting the features of the nodes, MBHAN can obtain the set of node representation vectors corresponding to all the motif subgraphs in the node-level attention neural network, Z m 0 , Z m 1 , . . . , Z m k .

Motif Subgraph-Level Attention Mechanism
In heterogeneous graphs, each node consists of multifaceted semantic information, but a particular motif subgraph only reflects a side view of that node given its current motif connectivity pattern. To more thoroughly learn the embedding of the nodes, it is necessary to fuse the representations of the corresponding nodes in all the motif subgraphs. To achieve this, MBHAN employs a motif subgraph-level semantic attention mechanism for each possible triangular motif subgraph in the heterogeneous graph. This enables it to learn the importance of different motif subgraphs for the final representation of the node and fuse them within an end-to-end learning process. Using the node embeddings learned in the node-level attention mechanism as input, the learning process for each motif subgraph can be represented as follows: where att moti f −level refers to the execution of a motif subgraph-level deep learning attention process. This can capture the importance of all the triangular motif subgraphs in the heterogeneous graph for the final embeddings of the nodes. To capture the impact of each motif subgraph on the final node representation, MBHAN first uses a one-layer multilayer perceptron (MLP) to nonlinearly transform the embeddings of the nodes in the corresponding motif subgraphs. It then measures the motif importance of the impact on the final node embedding by using the motif semantic-level attention vector, q, where W is the weight matrix and b is the bias vector: Note that MBHAN attempts to learn the representation of all the nodes in the heterogeneous graphs, so a simple weighted global average for the different types of nodes is clearly inappropriate. As the motif subgraph division of MBHAN is predicated on the heterogeneity of the nodes, different motif subgraphs will have semantic nuances due to the node types, and their influence on the node-level representation learning process will certainly vary. So, to isolate the influence of node heterogeneity, different types of nodes should be treated differently in the attention mechanism's execution. Equation (7) can therefore be reformulated as: where w m t i is the importance coefficient for the t-type nodes in the motif subgraph, G m i , and V t is the set of t-type nodes. To ensure the stability of the training process, different types of nodes all share the one-layer MLP parameters described above when calculating each separate attention coefficient. Having calculated the importance coefficients for different types of nodes in different motif subgraphs, they are normalized to obtain the motif subgraph-level attention weights for each node type, t, in G m i by using a SoftMax function: β m ∑ i can be interpreted as the contribution of each of the t-type nodes in G m i to a specific task. The higher the value of β m t i , the more important the t-type nodes in G m i . The importance varies across different tasks. By fusing the various motif-level weights as coefficients, the final t-type nodes embedding will be: Unlike the Heterogenous Graph Attention Network (HAN) described in [19], which averages all meta-path-based node embeddings, MBHAN uses a vector concatenation strategy to accentuate the different influences of different motif subgraphs on the final node embeddings. As a result, the node representations in different motif subgraphs are reflected in the attention deep learning process at the motif subgraph-level. For node types that do not exist in the motif subgraphs (e.g., nodes of type D in the "A − C − B" motif subgraph in Figure 2), the final global node representation, Z, of the heterogeneous graph employs zero-complement and feature alignment operations.
As shown in Figure 3, the final node representations consist of an aggregation of all the motif-specific subgraph semantics, for which different loss functions can be designed to apply the representations to different downstream tasks. For node classification tasks, MBHAN uses a cross-entropy loss function: where C is the classifier parameters; Y L is the set of nodes with labels; and Y l and Z l are the labels and embeddings of the labeled nodes, respectively. Guided by the labeled node data, MBHAN can use back propagation to optimize the model and learn the embeddings of the nodes.

Node Feature Mapping Mechanism
Having presented the details of the model for deep learning of the node-level and motif subgraph-level attention mechanisms, we now need to consider the incompatibility of feature spaces due to the heterogeneity of the nodes. By seeking to preserve the rich semantics contained in different types of nodes and their links during the graph representation learning process, we are obliged to deal with there being different feature spaces for different types of nodes. Many meta-path-based heterogeneous graph representation learning methods transform heterogeneous graphs into homogeneous graphs for analysis [19,20]. This results in a loss of semantic information for the non-end nodes in the meta-paths. MBHAN aims to learn the embedding of global nodes. In the motif subgraphs subdivided by node types, the node-level attention learning process usually aggregates neighborhood information that is different from the current type of node. In real-world graph datasets, it is also quite possible that only features of specific types of nodes can be accurately extracted. In the DBLP heterogeneous graph dataset shown in Figure 1a, for instance, only features of Paper-type nodes (bag-of-words vectors) can be observed, while features relating to the Author, Term, and Conference-type nodes cannot be easily obtained.
To address the above problem, one can create a matrix, M i , that is specific to the type transformation to project the features of different types of nodes into the same feature space. The feature conversion process for node v i can be formulated as follows: where h i and h i are the original and transformed features of node v i . In MBHAN, this type of transformation is implemented by employing a shallow perceptron whose output dimensionality is consistent with that of the node features that already exist. The specific feature transformation matrix parameters of the model can be guided and learned according to particular downstream tasks throughout the training process. Note that, when undertaking a feature transformation process for multiple types of nodes, the perceptron models should be segregated from each other, i.e., M t = M t . We expect this type of transformation process to only serve for specific node types, where it is important to avoid any possible interference between different types of nodes. Although Equation (12) provides the process for feature transformation, it does not address the problem of missing node features in heterogeneous graphs. Looking at graph representation learning methods based on a random walk strategy [24,43,44], there is an essential assumption being made: that the rich semantic interactions between node pairs will be reflected in the structure of the heterogeneous graphs. So, if an author has numerous important paper nodes connected to him (her) in the field of data mining, it means that the author has contributed significantly to the field of data mining, and vice versa. If this assumption is correct, the node features obtained by graph analysis from the node connectivity patterns can describe the knowledge contained in the nodes from another perspective (i.e., a graph structure perspective). A classic "random walk + Skip-Gram" strategy is therefore typically included in the heterogeneous graph feature preprocessing process. In this paper, MBHAN adopts the approach of Grover et al. [45], where parametrically controlled node sequence selection is applied during the random walk process: where s is the previous hop node; c is the current node; α pq is the transfer probability; d sx is the shortest path distance between node s and node c's neighboring node, x; p is the return probability parameter, which corresponds to the Breadth First Search (BFS) of the walk process; and q is the departure probability parameter, which corresponds to the Depth First Search (DFS). This approach can adapt the random walk strategy for a specific downstream task or a specific graph structure. In the experiments presented in Section 5, we make p = q = 1. This enables a fair comparison between MBHAN and other benchmark algorithms and avoids the need to set specific parameters for the datasets. Doing this collapses the node sequence generation process into a classic random walk process [43]. Finally, the sequence of nodes is fed into a Skip-Gram model to obtain the embedding of the nodes. This is treated as a "structural" feature, h str , and the corresponding inherent features of the nodes themselves are treated as "semantic" features labeled h sem i . Equation (12) can then be reformulated as follows: For any given heterogeneous graph, the input layer of the hierarchical attention model accepts nodes with "semantic" features directly, while nodes without intrinsic features have to pass through a random walk model and undergo the "structural" to "semantic" feature transformation described by Equation (14).
The overall MBHAN process is shown in Algorithm 1. In summary, MBHAN provides an end-to-end semi-supervised node representation learning approach that includes a feature space mapping process, a node-level attention mechanism, and a motif subgraphlevel attention mechanism. For specific downstream tasks, MBHAN can automatically learn the relevant feature transformation parameters and node-level and motif subgraph-level attention weight parameters. In the next section, we report on an evaluation of the full-scale performance of MBHAN when handling a node classification task and a clustering task. We then analyze the impact of various hyperparameters on its performance.  G m 1 , . . . , G m k ); 6: The number of attention heads: K; 7: Output: Node representation learning vectors: Z; 8: Generate node "structure" features using the random walk strategy; 9: h str ← Random Walk, (c); 10: for every node type, t, in T do: 11: if v t i does not have semantic features h sem do; 12: h sem i = M t ·h str ; 13: Integration of all node features →h; 14: for every motif sub-graph, G m , in (G m 0 , G m 1 , . . . , G m k ); 15: for k = 1 . . . k do; 16: for

Experiments
Before reporting on our evaluation of MBHAN's performance, we will outline the two heterogeneous graph datasets and the state-of-the-art benchmark heterogeneous graph representation learning methods that were used in our experiments.

Datasets
The experiments were performed using two real-world heterogeneous graph datasets, which were often used as the benchmark datasets to evaluate the performance of the proposed methods [19,36,46]. Table 2 shows their key statistics, including all the possible triangular motif patterns and details of the nodes and links for the corresponding subgraphs.

Datasets
The experiments were performed using two real-world heterogeneous graph datasets, which were often used as the benchmark datasets to evaluate the performance of the proposed methods [19,36,46]. Table 2 shows their key statistics, including all the possible triangular motif patterns and details of the nodes and links for the corresponding subgraphs.
DBLP_four_area [47]: is a subset of the academic citation heterogeneous network DBLP (https://dblp.uni-trier.de). The database covers four domains: databases; data mining; information retrieval; and artificial intelligence. It contains four different types of nodes (Author, Paper, Term, and Conference) and three different types of links (Author↔Paper, Paper↔Term, and Paper↔Conference). The features of the Paper type nodes consist of their keyword bag-of-words vectors. The features of the Author type nodes are a composite of the bag-of-words vectors for all the Paper type nodes connected to this type of node. The Term type nodes and Conference type nodes do not have any features. The Paper type nodes in the dataset are labeled according to the research fields that correspond to their publication sites. The Author type nodes are labeled according to the research fields associated with their published papers. A graph schema of the dataset is shown in Figure 1a.

Datasets
The experiments were performed using two real-world heterogeneous graph datasets, which were often used as the benchmark datasets to evaluate the performance of the proposed methods [19,36,46]. Table 2 shows their key statistics, including all the possible triangular motif patterns and details of the nodes and links for the corresponding subgraphs.
DBLP_four_area [47]: is a subset of the academic citation heterogeneous network DBLP (https://dblp.uni-trier.de). The database covers four domains: databases; data mining; information retrieval; and artificial intelligence. It contains four different types of nodes (Author, Paper, Term, and Conference) and three different types of links (Author↔Paper, Paper↔Term, and Paper↔Conference). The features of the Paper type nodes consist of their keyword bag-of-words vectors. The features of the Author type nodes are a composite of the bag-of-words vectors for all the Paper type nodes connected to this type of node. The Term type nodes and Conference type nodes do not have any features. The Paper type nodes in the dataset are labeled according to the research fields that correspond to their publication sites. The Author type nodes are labeled according to the research fields associated with their published papers. A graph schema of the dataset is shown in Figure 1a.

Datasets
The experiments were performed using two real-world heterogeneous graph datasets, which were often used as the benchmark datasets to evaluate the performance of the proposed methods [19,36,46]. Table 2 shows their key statistics, including all the possible triangular motif patterns and details of the nodes and links for the corresponding subgraphs.
DBLP_four_area [47]: is a subset of the academic citation heterogeneous network DBLP (https://dblp.uni-trier.de). The database covers four domains: databases; data mining; information retrieval; and artificial intelligence. It contains four different types of nodes (Author, Paper, Term, and Conference) and three different types of links (Author↔Paper, Paper↔Term, and Paper↔Conference). The features of the Paper type nodes consist of their keyword bag-of-words vectors. The features of the Author type nodes are a composite of the bag-of-words vectors for all the Paper type nodes connected to this type of node. The Term type nodes and Conference type nodes do not have any features. The Paper type nodes in the dataset are labeled according to the research fields that correspond to their publication sites. The Author type nodes are labeled according to the research fields associated with their published papers. A graph schema of the dataset is shown in Figure 1a. ACM (http://dl.acm.org/): Papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM, and VLDB were extracted and grouped into three research areas: databases; wireless communication; and data mining. This dataset contains four different types of nodes (Author, Subject, Paper, and Venue) and three different types of links (Author↔Paper, Paper↔Subject, and Paper↔Venue). We used a dataset preprocessing approach similar to that used in [48], where the features of the Paper type nodes were taken to be the vectors of the bag-of-words elements given by their keywords. They were labeled according to the conference in which the papers were published. In this dataset, the Subject type nodes and Venue type nodes do not have any node features. A graph schema for this dataset is shown in Figure 1b.

Baseline Algorithms
MBHAN was compared with the following baseline algorithms, which together cover homogeneous graph representation learning methods, heterogeneous graph representation learning methods, and graph neural networks.
DeepWalk [43]: is a graph representation method based on a random walk strategy that is usually applied to homogeneous graphs. In the experiments, it treated the datasets as homogeneous graphs because it could not handle the heterogeneity of the nodes. metapath2vec [44]: is a heterogeneous graph representation learning method based on a meta-path random walk that leverages Skip-Gram to learn the embedding of the nodes. In the experiments, all possible meta-paths consisting of three node types were DBLP_four_area [47]: is a subset of the academic citation heterogeneous network DBLP (https://dblp.uni-trier.de). The database covers four domains: databases; data mining; information retrieval; and artificial intelligence. It contains four different types of nodes (Author, Paper, Term, and Conference) and three different types of links (Author↔Paper, Paper↔Term, and Paper↔Conference). The features of the Paper type nodes consist of their keyword bag-of-words vectors. The features of the Author type nodes are a composite of the bag-of-words vectors for all the Paper type nodes connected to this type of node. The Term type nodes and Conference type nodes do not have any features. The Paper type nodes in the dataset are labeled according to the research fields that correspond to their publication sites. The Author type nodes are labeled according to the research fields associated with their published papers. A graph schema of the dataset is shown in Figure 1a.
ACM (http://dl.acm.org/): Papers published in KDD, SIGMOD, SIGCOMM, Mo-biCOMM, and VLDB were extracted and grouped into three research areas: databases; wireless communication; and data mining. This dataset contains four different types of nodes (Author, Subject, Paper, and Venue) and three different types of links (Author↔Paper, Paper↔Subject, and Paper↔Venue). We used a dataset preprocessing approach similar to that used in [48], where the features of the Paper type nodes were taken to be the vectors of the bag-of-words elements given by their keywords. They were labeled according to the conference in which the papers were published. In this dataset, the Subject type nodes and Venue type nodes do not have any node features. A graph schema for this dataset is shown in Figure 1b.

Baseline Algorithms
MBHAN was compared with the following baseline algorithms, which together cover homogeneous graph representation learning methods, heterogeneous graph representation learning methods, and graph neural networks.
DeepWalk [43]: is a graph representation method based on a random walk strategy that is usually applied to homogeneous graphs. In the experiments, it treated the datasets as homogeneous graphs because it could not handle the heterogeneity of the nodes. metapath2vec [44]: is a heterogeneous graph representation learning method based on a meta-path random walk that leverages Skip-Gram to learn the embedding of the nodes. In the experiments, all possible meta-paths consisting of three node types were processed. The reported results give the average performance.
GCN [9]: is a semi-supervised graph convolutional neural network designed for homogeneous graphs. As not all nodes in the datasets have features, the experiments used a meta-path based heterogeneous graph transformation operation. For the Author node classification and clustering task in the DBLP_four_area dataset, the meta-paths A − P − A, A − P − C − P − C − A, and A − P − T − P − A were employed for the transformation. For the Paper node classification and clustering task, the meta-paths P − A − P, P − C − P, and P − T − P were employed, with the given results being the average performance. For the ACM dataset, the meta-paths P − A − P and P − S − P were employed, with the average performance again being reported.
HAN [19]: is a hierarchical graph attention neural network that can be used for heterogeneous graphs. It takes into account both node-level attention and meta-path-level attention. We adopted the meta-path selection scheme given in the literature. The metapaths A − P − A, A − P − C − P − C − A, and A − P − T − P − A were used for the Author node classification and clustering task in the DBLP_four_area dataset, while the meta-paths P − A − P, P − C − P, and P − T − P were used for the Paper node classification and clustering task. For the ACM dataset, the meta-paths P − A − P and P − S − P were used.
GTN [21]: is a heterogeneous graph neural network representation learning method that requires no prior knowledge. It generates new graph structures from multiple candidate adjacency matrices from the original graph to achieve more efficient graph convolution operations.
MBHAN non_type : is a simplified version of MBHAN that does not distinguish between the node types when calculating the motif subgraph attention. To be precise, after calculating the importance coefficients for the different motif subgraphs in Equation (7), a SoftMax function is applied to normalize all the node types.

Implementation Details
For the full version of MBHAN, we first randomly initialized the parameters and optimized the model using Adam [49]. The learning rate was set at 0.005, the weight decay parameter was 0.001, the dimensionality of the motif subgraph-level attention vector, q, was 128, the number of multi-headed attention mechanisms, K, was set at 8, and the dropout ratio [50] was set at 0.6. To ensure that all of the experiments were fair, the semi-supervised graph neural network models, i.e., GCN, GAT, HAN, and GTN used exactly the same split for their training, validation, and test sets (80%: 10%: 10%). For the random walk-based graph representation learning methods, i.e., DeepWalk and metapath2vec, the Skip-Gram model context window size was set at 5, the walk sequence length was set at 100, the number of walks per node was set at 5, the negative sampling dimension was set at 5, and the embedding vector dimension was set at 128. All of the experiments were executed on a Lenovo R9000P 2021 with an AMD 3.2GHz processor, 64Gb RAM, and an NVIDIA RTX3060 laptop graphics card with a video memory capacity of 6Gb.

Multi-Class Classification
Multi-class nodes are nodes' labels with more than two classes, but where each node is assigned to only one label in the graph. The multi-class classification prediction assigns one and only one label to each node. MBHAN uses a fully connected linear layer to perform the multi-class node classification task. We employed a 10-fold cross-validation experiment and report the average Macro-F1 and Micro-F1 performance for MBHAN and all the other baseline methods in Table 3. It can be seen that MBHAN outperformed the various benchmark methods. Out of the traditional graph representation learning methods, metapath2vec with guided metapaths guided performed better than DeepWalk. This again confirms that considering the heterogeneity of the nodes preserves more knowledge in the graph. The graph neural network-based methods, e.g., GAT and GCN, not only retained the structural information for the graph, but also attempted to fuse the node features. These methods performed better than the traditional graph representation learning methods. Looking more closely at the results, because of its differential treatment of the nodes' neighboring objects, GAT was better able to capture the degree of importance of the node neighbors in the graph than the simple GCN approach, where the neighboring nodes are merely averaged. When compared with GAT, HAN not only focused on relevant knowledge of the nodes and their neighbors, but also differentiated the influence of substructures (subgraphs that satisfied the meta-paths) on the final embedding of the nodes from a high-dimensional (meta-path) perspective. Unfortunately, it remains the case that HAN transforms heterogeneous graphs into homogeneous ones, so there was inevitably some loss of information and fluctuation in the effectiveness of the artificially selected meta-paths for the downstream tasks. Thus, GTN's convolution of multiple meta-path graphs enabled it to learn the importance of the length of the meta-paths more adaptively and gain a better performance than HAN.
Unlike GTN, MBHAN not only has the advantages of the HAN hierarchical attention mechanism, but also avoids the need for homogeneous transformation of heterogeneous graphs by using motif subgraphs, so it does not require a priori meta-path knowledge. MBHAN can also classify different types of nodes, such as the Author node and Paper node in the DBLP_four_area dataset, simultaneously without needing to change the graph neural network model. So, to sum up, MBHAN performed better than GTN because it can handle non-Euclidean spatially heterogeneous graph data with high degrees of variance and because using a hierarchical attention mechanism enables it to better capture the differentiation between nodes than other graph neural network methods. As a final observation, note that the performance of MBHAN was also better than the simplified MBHAN non_type . This indicates that having different types of nodes in downstream tasks can mean that different motif subgraphs are of different importance, so being able to distinguish different node types makes it possible to capture this subtle differential knowledge more precisely.

Clustering
For the clustering performance evaluation, MBHAN used normalized mutual information values (NMI). To this end, a k-means algorithm was employed in the experiments to cluster the final embedding of the nodes, and the number of clusters, K, was set to the number of classes. As the performance of a k-means algorithm is influenced by the initial cluster centers, we repeated the experiment 10 times and report here the average performance.
It can be seen from Table 4 that MBHAN performed better than all the baseline methods. By fusing the inherent features of the nodes in the graph, the graph neural network-based methods generally performed better than the traditional random walkbased graph representation learning methods. As GCN did not distinguish the importance of neighboring nodes, in most cases its performance was inferior to GAT. This underscores the fact that attention mechanisms capture more meaningful node embeddings in the graph neural network representation learning process. Note, however, that HAN, which does employ a hierarchical attention mechanism, performed less well than GTN, where there was no use of prior knowledge, and MB-HAN. This is because it focused only on the importance of certain meta-paths and inevitably missed information about the influence of other high-dimensional node sets on the final embedding.

Comparative Statistical Tests for the Different Algorithms
The previous two subsections involved the comparison of multiple algorithms on multiple datasets. However, the same algorithm may not have the same ranking on different datasets. We therefore undertook some statistical tests to assess the overall performance of each particular algorithm across the multiple datasets.
A Friedman test [51] can be employed to determine whether the performance of a particular algorithm (MBHAN, in this case) is significantly different from other algorithms. It is a classic non-parametric statistical test that is based on the hypothesis that there is no significant difference in the overall distribution of multiple pairs of algorithms' mean ranks. If we let Rank i j be the rank of the j-th method on the i-th data set, its mean performance ranking on a dataset can be calculated using Equation (15), where N is the number of methods to be compared and S is the number of datasets. The Friedman statistic can then be computed using Equation (16). As τ χ 2 was found to be too conservative, an improved version is given by Equation (17).
To perform an overall comparison, the experimental results for the different datasets need to be combined. S was set at 3 and N was set at 8. The Mac-F1, Mic-F1, and NMI Friedman test values were 41.200, 24.999, and 73.600, respectively. These are all much higher than the critical value (2.76) for a significance level of α = 0.05. Therefore, the original hypothesis can be safely rejected. In other words, there are significant performance differences between the different methods.
Because the null-hypothesis is rejected, a post-hoc test can be proceeded. The Nemenyi test [51], which is calculated using Equation (18), defines the critical difference (CD) value for a case where two methods are significantly different with a certain confidence (1 − γ). Here, q γ is the critical value based on the Studentized range statistic divided by √ 2.
The critical difference for the 95% confidence interval was calculated (CD = 6.06). The results of the test are shown in Figure 4, where the mean rank of each method is marked by a dot, and the length of the horizontal bar across each dot shows the critical difference. For any two methods, a mothed outperforms another one if its mean rank is smaller. More strictly, if the mean ranks of any two methods differ by more than CD, the superiority is significant. According to the mean ranks and the results of the Nemenyi test, we can conclude that our MBHAN method achieves the best overall performance, and significantly outperforms some of these methods. Appl

Analysis of the Hyperparameters
The parameter sensitivity was analyzed by using Micro-F1 metrics for MBHAN's classification performance in relation to the Author type nodes in the DBLP_four_area dataset. The analyzed parameters included the embedding dimensions of the node-level attention mechanism output, the motif subgraph-level attention vector dimensions, , and the multi-head attention parameter, .
Node-level attention mechanism output dimensions: As the final node embedding of MBHAN was a concatenation of the outputs from the motif subgraphs' attention process, its size was impacted by the number of motif subgraphs that could be subdivided by the specific heterogeneous graphs. The results are shown in Figure 5a, where it can be seen that its performance initially improved as the number of embedding dimensions increased, but then began to slowly decrease. This is due to the fact that the final representation of the nodes requires a suitable dimension for encoding. Furthermore, when the output dimensions of the node-level attention process became too large and

Analysis of the Hyperparameters
The parameter sensitivity was analyzed by using Micro-F1 metrics for MBHAN's classification performance in relation to the Author type nodes in the DBLP_four_area dataset. The analyzed parameters included the embedding dimensions of the node-level attention mechanism output, the motif subgraph-level attention vector dimensions, q, and the multi-head attention parameter, K.
Node-level attention mechanism output dimensions: As the final node embedding of MBHAN was a concatenation of the outputs from the motif subgraphs' attention process, its size was impacted by the number of motif subgraphs that could be subdivided by the specific heterogeneous graphs. The results are shown in Figure 5a, where it can be seen that its performance initially improved as the number of embedding dimensions increased, but then began to slowly decrease. This is due to the fact that the final representation of the nodes requires a suitable dimension for encoding. Furthermore, when the output dimensions of the node-level attention process became too large and multiple motif subgraphs with various kinds of representations had to be stacked, this introduced redundancy into the node embedding. Note, however, that the dimensions of the node-level attention output did not cause significant performance fluctuations in the downstream tasks because MBHAN not only learned the attention weights at the node level and motif subgraph level, but also the differentiation between different types of heterogeneous graph node. Thus, the concatenation strategy for the final embedding of the nodes could assign useful knowledge to the representation feature space.

Conclusions
In this paper, we have proposed a method for motif-based hierarchical attentional graph neural network representation learning, called MBHAN. It consists of a node-level attention mechanism and a motif semantic attention mechanism. MBHAN does not require any prior knowledge, but instead seeks to reflect the node features from different perspectives. MBHAN treats the relative importance of various node types differently for the node embedding during the learning process. This enables it to capture subtle nuances caused by having different node types in the motif subgraphs. A full-scale evaluation of the performance of MBHAN was undertaken on two heterogeneous graph datasets, which involved node classification and clustering tasks, and the impact of various hyperparameters on its performance were also evaluated. The F1 and NMI metrics of the results and a statistical analysis show that the proposed method can outperform other state-of-the-art methods.
It should be noted that the feature space mapping proposed by MBHAN is still not able to completely solve the problem of there being incompatible feature spaces for different node types in heterogeneous graphs. This will therefore be the focus of our future research.   The motif subgraph-level attention vector, q: The results for how the motif subgraphlevel attention mechanism was affected by the dimensionality of vector q are shown in Figure 5b. Here, it can be seen that MBHAN's performance increased in line with the dimensions of q and reached its best performance at 128. After that, its performance started to decline because of overfitting of the training process caused by the large number of dimensions.
The multi-head attention parameter, K: The influence of the multi-head attention parameter, K, on MBHAN's performance is shown in Figure 5c. When K = 1, MBHAN did not adopt the multi-head attention strategy. Its performance improved slightly as the number of heads increased. This is because the multi-head attention strategy essentially involves integrating several independent attention coefficient computations. This strategy not only characterizes the nodes from multiple perspectives, but also prevents overfitting and makes the training process more stable.

Conclusions
In this paper, we have proposed a method for motif-based hierarchical attentional graph neural network representation learning, called MBHAN. It consists of a node-level attention mechanism and a motif semantic attention mechanism. MBHAN does not require any prior knowledge, but instead seeks to reflect the node features from different perspectives. MBHAN treats the relative importance of various node types differently for the node embedding during the learning process. This enables it to capture subtle nuances caused by having different node types in the motif subgraphs. A full-scale evaluation of the performance of MBHAN was undertaken on two heterogeneous graph datasets, which involved node classification and clustering tasks, and the impact of various hyperparameters on its performance were also evaluated. The F1 and NMI metrics of the results and a statistical analysis show that the proposed method can outperform other state-of-the-art methods.
It should be noted that the feature space mapping proposed by MBHAN is still not able to completely solve the problem of there being incompatible feature spaces for different node types in heterogeneous graphs. This will therefore be the focus of our future research.