Directed Network Comparison Using Motifs

Analyzing and characterizing the differences between networks is a fundamental and challenging problem in network science. Most previous network comparison methods that rely on topological properties have been restricted to measuring differences between two undirected networks. However, many networks, such as biological networks, social networks, and transportation networks, exhibit inherent directionality and higher-order attributes that should not be ignored when comparing networks. Therefore, we propose a motif-based directed network comparison method that captures local, global, and higher-order differences between two directed networks. Specifically, we first construct a motif distribution vector for each node, which captures the information of a node’s involvement in different directed motifs. Then, the dissimilarity between two directed networks is defined on the basis of a matrix, which is composed of the motif distribution vector of every node and the Jensen–Shannon divergence. The performance of our method is evaluated via the comparison of six real directed networks with their null models, as well as their perturbed networks based on edge perturbation. Our method is superior to the state-of-the-art baselines and is robust with different parameter settings.


Introduction
Many systems in various domains, featuring intricate interaction relationships, can be effectively represented in the form of complex networks [1], including social platforms [2,3], biological systems [4], economic systems [5].Due to the diversity of network forms [6,7] and the high-order features of networks [8,9], the precise measurement of similarity between different networks, namely the design of an effective network comparison method, has emerged as a central focus in the field of network science.Network comparison aims to quantify the differences between two networks based on network topological structure, allowing the effective handling of different types of tasks [10,11].For example, in the field of pattern recognition, network comparison can be applied to classify content such as images, documents, and videos [12].In the biological domain, network comparison can be used to analyze which protein interactions may have equivalent functions [13].In neuroscience, the comparison of brain networks contributes to understanding the functional differences between normal and pathological brains [14].
The original term used to compare networks was the graph isomorphism problem [15], which has been proven to fall within the NP complexity class [16].
In recent years, researchers have proposed various methodologies from different perspectives and technologies to measure the similarity between networks [17][18][19][20][21][22].The majority of these methods have primarily concentrated on the comparison of undirected networks.However, interactions among distinct entities in the real world commonly exhibit asymmetry.In social networks, an instance of user i trusting user j does not necessarily imply reciprocal trust from j to i.The directionality of the interactions between nodes in a network, which cannot be captured by an undirected network, has boosted the research of directed network comparison.For example, Bagrow and Bollt [23] utilized portrait divergence, a metric based on the distribution of the shortest path lengths, to evaluate the structural similarities between networks.Koutra et al. [24] proposed Delta-Con by calculating the Matusita distance of similarity matrices between two networks.Sarajlic et al. [25] extended network distance measures to directed networks using directed graphlets, demonstrating their efficacy in distinguishing various directed networks.The centrality-based methods, such as degree [26], closeness [27], and clustering coefficient [28], compare networks based on the centrality values of each node.Although these methods are capable of comparing networks effectively to some extent, most of them have not considered the higher-order structure of a network, which has been shown to be ubiquitous in various complex systems [9].Consequently, we propose using direct motifs to quantify the dissimilarity between two networks.Motifs refer to recurring subgraphs in a network, where these subgraphs exhibit specific interaction patterns that facilitate understanding of the functionality of networks [29].Motifs have been widely used in different network tasks, i.e., community detection [30], link prediction [31], and node ranking problems [32].In contrast to traditional conventional methods, motif-based approaches consistently exhibit superior performance in tackling these problems.
To explore the similarity between different directed network structures, in this paper we propose a motif-based directed network comparison method D m , i.e., using motifs to examine smaller components of directed networks to assess the similarity between networks.We start by constructing a node motif distribution matrix, where the elements in the matrix are obtained by computing the distribution of nodes appearing in different directed motifs.Due to computational complexity, we consider the motifs composed of 2 to 4 nodes and thus obtain 35 different directed motifs.Later on, we use the Jensen-Shannon divergence to quantify the dissimilarity between two directed networks both locally and globally.We validate the effectiveness of D m in six real directed networks.
Compared to the baseline methods, D m exhibits notable distinguishability and robustness in comparing networks.
The rest of the paper is organized as follows.Section 2 introduces the definition of motifs in a directed network and details the motif-based directed network comparison method.We provide a clear description of the baseline methods and directed network datasets in Section 3.All experiment results are presented in Section 4. Section 5 summarizes the full paper.

The definition of motifs in a directed network
and edge set, respectively.The number of nodes and the number of edges are given by N and M .The adjacent relationship between two nodes in G is given by the adjacency matrix A, with A ij = 1 indicating that there is a directed edge between v i and v j , and A ij = 0 implying that there are no edges between them.We note that the directionality of G determines that A is an asymmetric matrix.
Motifs are the most common graphical patterns in complex networks, consisting of a group of closely connected nodes and edges.Due to the high complexity of computing motifs in a network, we normally consider motifs formed by 2 to 4 nodes.Motifs play a crucial role in the study of complex networks, acting as fundamental building blocks for large complex networks, analogous to genes in biology.In a directed network, the motifs are formed by nodes with directed edges.We show examples of directed motifs in Figure 1.There are 35 directed motifs, each comprising 2 to 4 nodes, individually represented as m 1 to m 35 , respectively.For instance, there are two kinds of motifs if we consider two nodes, which are given by m 1 and m 2 in the figure.

Motif-Based Directed Network Comparison Method
Motifs contain important topological information of a network and thus are essential for network comparison.Based on the distinctive topological properties of directed motifs, we first compute the motif distribution in a directed network.As the time complexity of computing motifs is quite high, we will use the motifs listed in Figure 1 that are formed by 2, 3, and 4 nodes for the computation of motif distribution.Specifically, we use where tributions, and is given by: where µ j represents the average value of N motif distributions, the specific calculation is as follows: , the structural dissimilarity between them can be calculated based on their motif distribution matrices T 1 and T 2 .We use D m (G 1 , G 2 ) to represent the dissimilarity between The dissimilarity D m comprises two terms, and we use a parameter φ(0 ≤ φ ≤ 1) to adjust their weights.The first term illustrates the difference between the average motif distributions, that is, ), and predominantly signifies the global distinctions be-tween the two networks.The second term mainly describes the difference between the DN N Ds of the two networks, indicating the local difference between them.A lower value of D m indicates a higher network similarity and vice versa.

Baselines
Portrait-based directed network comparison method [23]: For a directed network G, we construct a portrait matrix B based on the distance between nodes.Each element B l,k represents the number of nodes that have and d represents the diameter of G.We note that we utilize the shortest directed path length to calculate the distance between nodes.In addition, B is independent of the ordering and labeling of the nodes.Based on B l,k , we can derive the probability that a randomly selected node has k nodes at a distance of l and is given by For two directed networks, G 1 and G 2 , the probability distributions Q 1 and Q 2 are employed to interpret the rows of the network portraits for each of them.
The similarity between G 1 and G 2 is represented as D p (G 1 , G 2 ) and is defined as: where We assume that the similarity matrices for two directed and unweighted networks G 1 and G 2 are denoted as S and S ′ , and the dissimilarity D d between them is given by the following equation: Closeness-based directed network comparison method: Centrality measures, such as degree, betweenness, and closeness, were used to compare networks [27].However, in the part of experiments, we find that closeness centrality surpasses other centrality methods in network comparison.Therefore, we omit the other centrality measures and only use closeness for directed network comparison.Closeness centrality measures the importance of a node within a network by evaluating the proximity of its connections to other nodes.The closeness centrality of a node is defined as where d ij represents the directed shortest path length from node v i to node v j .
For two directed networks G 1 and G 2 , we assume that the closeness centrality vectors for them are given by c Therefore, the dissimilarity between G 1 and G 2 based on closeness centrality is given as follows:

Description of Directed Network Datasets
To evaluate the performance of our proposed methods and the state-of-theart baselines, we select six real-world directed networks from diverse domains including biological networks, transportation networks, and social networks.The descriptions of each of the datasets are as follows: Mac [33] describes the interactions between adult female Japanese macaques, and is about the dominance behavior between them.Each node denotes a macaque and a directed edge from node v i to v j indicates the dominance of v i over v j .
Caenorhabditis elegans (Elegans) [34] is a neural network of Caenorhabditis elegans.It uses directed edges to represent neural connections among neurons in the nervous system of Caenorhabditis elegans.
Physicians [35] is a directed network that describes the spread of innovation among physicians.A directed edge (v i , v j ) between two physicians v i and v j implies that v i would turn to v j if he needs suggestions or is interested in a discussion.
Email-Eu-core (Email) [36] is an email network that captures email interactions between institution members in a large European research institution.
A directed edge between two staff v i and v j means that staff v i has sent an email to staff v j .
US airport [37] illustrates the flight connections between US airports.A directed edge (v i , v j ) between two airports v i and v j illustrates that there is at least a flight from airport v i to v j .
Chess [37] is a network that characterizes the interaction between players in an international chess game within a month.A directed edge is formed from a white player to a black player in this network.
Table 1 shows the basic properties of the directed networks mentioned above, including the number of nodes (N ), the number of edges (M ), average degree (Ad), average shortest path length (Avl), and network diameter (d).

The dissimilarity between a real network and its null models
The null model is widely used as a tool for the comparison of network topology [38], which retains specific network properties, such as degree distribution or clustering coefficient via random reshuffling of network connections.
In this section, we propose three null models for directed networks to gradually change the network topology and use our comparison method to compare each directed network and its null models.
We extend the dk-series null models that were originally proposed for undirected networks to directed networks [39], which retain the degree distributions, correlations, and clustering of a real directed network to some extent.Concretely, the models are illustrated as follows: Dk1.0 preserves the outdegree and indegree of a node by randomly rewiring each directed edge.Therefore, the degree sequence of the original network is preserved in the reshuffling process.
Dk2.0 reshuffles every edge in the network while maintaining the outdegree, indegree, and joint degree distribution of the original network.Dk2.5 rewires every edge by preserving the distribution of the degree-dependent clustering coefficient.We note that the newly formed directed edges should never have existed in the original network before.
We show examples of how to generate the null models in Figure 2(a-c), the blue dashed lines indicate the newly connected edges.The left panel shows the original network, and the right panel shows the network after the rewiring process in each of the figures.Figure 2(a) shows an instance for Dk1.0.Specifically, we disconnect the edges (v 1 , v 2 ) and (v 3 , v 4 ), and form new edges, i.e., (v 1 , v 4 ) and (v 3 , v 2 ).Therefore, the in-degree and out-degree of each node is preserved in this process.Figure 2(b) demonstrates the generation of a random network via Dk2.0, which is more strict than Dk1.0.For example, if we disconnect the directed edge between v 1 and v 2 , that is, (v 1 , v 2 ), we need to find a node that has the same indegree and outdegree as v 2 , and the appropriate node is v 4 .Accordingly, we connect v 1 and v 4 and form a new directed edge (v 1 , v 4 ).Therefore, Dk2.0 maintains the degree sequence and the joint degree distribution of a network.In Figure 2(c), the degree (sum of indegree and outdegree) and clustering coefficient for each node are {2, 3, 3, 3, 3, 1, 1, 1, 1} and {1/2, 1/6, 0, 0, 1/6, 0, 0, 0, 0}, respectively.Therefore, the average clustering coefficients for nodes that have degree of {1, 2, 3} are {0, 1/2, 1/12}, respectively, which are also called degree-dependent clustering coefficients.We disconnect the directed edges (v 1 , v 2 ) and (v 4 , v 3 ) and form new directed edges as (v 1 , v 3 ) and (v 4 , v 2 ).In the rewired network, the degree-dependent clustering coefficient distribution is the same as the original network.A lower value of k implies greater disruption of the original network structure.In Figure 3, we use the motif-based directed network comparison method to quantify the dissimilarity between each of the directed networks and its three null models.Experimental results across six networks suggest that as k increases, the similarity between the original network and its null models gradually increases.The dissimilarity observed in our approach aligns with the generation of null models, providing further confirmation of the effectiveness and stability of our model for comparing directed networks from different domains.

Parameter Sensitivity Analysis
The motif-based directed network comparison method involves a parameter, denoted as φ, that determines how much importance is given to the global or local differences between two networks, with larger value of φ indicting we consider more of global difference and vice versa.Therefore, we perform parameter analysis for φ in the six real-world directed networks via the comparison of original network and its perturbed networks.The results are given in Figure 5, in which we use curves with different colors indicating we choose different values of φ(φ ∈ {0.1, 0.3, 0.5, 0.7, 0.9}).The figure displays curves that exhibit a similar trend for different values of φ, and there is small deviation among the curves when f < 0. However, the network dissimilarity for different f is more significant for φ = 0.5 in most networks (except Physicians and Email), which means we need to consider the global or local differences between networks for comparison.Therefore, we use φ = 0.5 in the above analysis.

Conclusion
In this paper, we introduce a comparison method D m that utilizes network motifs to assess similarities in directed networks.The method, which considers both local and global differences between two directed networks as well as higher-order information, is based on node motif distributions and employs Jensen-Shannon divergence.In detail, we use the motifs of sizes up to 4 that are listed in Figure 1 to compute the motif distribution of nodes in a directed network.Based on Jensen-Shannon divergence and motif distributions of nodes, we define the dispersion of directed network nodes (DN N D) to quantify the heterogeneity of connectivity between nodes.Lastly, for two given directed networks, the similarity between them is further defined by the combination of the DN N D metrics and the average motif distributions.Our method aims to better understand the internal connection patterns of the network nodes by capturing essential subgraph structures.To show the effectiveness of our method, we compare a directed network with its null models, which gradually change the structure of the original network.In addition, we further compare our method with the baselines to characterize the similarity between an orignal network and its perturbed networks The results show that our method outperforms these baseline methods across networks from different domains.
Motifs have been widely used to address a range of tasks.In our analysis, we take into account the directionality of edges by utilizing directed motifs to compare directed networks.We limit our analysis to motifs of sizes up to 4 due to the high computational expenses involved.Although considering larger motifs could potentially enhance the effectiveness of our approach, it may pose scalability challenges when dealing with large networks containing millions of nodes.Given the success of motifs in network comparison, we believe that developing efficient algorithms for computing motifs could be a promising avenue for research.This not only has the potential to enhance network comparison, but also to improve other network tasks such as community detection, node classification, influence maximization, and more.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figure 1 :
Figure 1: Motifs formed by 2 to 4 nodes in directed networks.All the motifs are labeled from m 1 to m 35 .
KL ( * || * ) represents the Kullback-Liebler divergence between two distributions.DeltaCon-based directed network comparison method [24]: Delta-Con considers the similarity between two networks by quantifying the difference of the r-step paths other than the edges.Given a directed and unweighted network G and its adjacency matrix A, the r-step paths are encoded in the similarity matrix S = I + ε 2 D − εA −1 , where D and I are diagonal matrices with diagonal elements equal to node degree and 1, respectively, and

Figure 2 :
Figure 2: Toy examples of three dk-series null models: (a)Dk1.0;(b)Dk2.0;(c)Dk2.5.The blue dashed lines indicate the newly connected edges.In (a), (b), and (c), the left panel shows the original network and the right panel shows the rewired network.

Figure 3 :
Figure 3: Comparison between real directed networks and their null models via motif-based directed network comparison method.The null models are Dk1.0,Dk2.0 and Dk2.5.Smaller values in the heatmap indicate a higher similarity, and vice versa.

4. 2 .
The comparison of directed network and its perturbed networkIn this section, we perform perturbation experiments on the edges of six real directed networks to further assess the stability and applicability of the motif-based comparison method.Specifically, for each given network, we randomly add or remove edges with a certain proportion f , where the range of f is [−0.9, 0.9].The positive value of f indicates that we randomly add |f | fraction of directed edges into the network, and the negative value of f means that we randomly remove |f | fraction of the directed edges.We compare the original network with the perturbed network by adding or removing edges using different network comparison methods, as shown in Figure4.The four comparison methods (D m , D p , D d , and D c ) show similar trends; that is, the increase in |f | will make the perturbed network have a greater difference from the original network, which is consistent with intuition.This conclusion is especially sig-nificant when f is negative.However, the motif-based comparison method is much better than the rest of the baselines for positive values of f .The curves of the other three baselines for f > 0 are flatter than those of our method.Taking the Mac network as an example (Figure4(a)), the values of D p range from 0.07 to 0.13 for f ∈ [0, 1], and the values of D p are the same for f = 0.1 andf = 0.2, which is unreasonable.D d and D c also show insignificant dissimilarities between networks in Figure 4(a)-(f).The baseline methods, such as D p and D c , are based on the distance between nodes, and D d considers the r-step paths of a network for network comparison.However, they have not considered the higher-order network structure of a network and thus may result in poor performance in network comparison.

Figure 4 :
Figure 4: Similarity between a real directed network and perturbed network generated by randomly adding or deleting edges, where positive values of f indicate we randomly add f fraction of edges, and vice versa.We show results for networks: (a) Mac; (b) Elegans; (c) Physicians; (d) Email; (e) US airport; (f) Chess.The parameter φ of Dm is set to 0.5.Each point in the figure is averaged over 100 realizations.

Figure 5 :
Figure 5: Parameter analysis for motif-based directed network comparison.We compare the real network with its perturbed network via edge addition or deletion.Different curves show we choose different values of φ, which is the only parameter in our method, φ ∈ {0.1, 0.3, 0.5, 0.7, 0.9}.Positive values of f indicate the random edge addition, and vice versa.We show results for networks: (a) Mac; (b) Elegans; (c) Physicians; (d) Email; (e) US airport; (f) Chess.All results are averaged over 100 realizations.

Table 1 :
Basic properties of real directed networks, where N , M , Ad, Avl, and d represent the number of nodes, the number of edges, average degree, average shortest path length, and network diameter, respectively.