Abstract
Due to high computational costs, exploring motif statistics (such as motif frequencies) of a large graph can be challenging. This is useful for understanding complex networks such as social and biological networks. To address this challenge, many methods explore approximate algorithms using edge/path sampling techniques. However, state-of-the-art methods usually over-sample frequent motifs and under-sample rare motifs, and thus they fail in many real applications such as anomaly detection (i.e., finding rare patterns). Furthermore, it is not feasible to apply existing weighted sampling methods such as stratified sampling to solve this problem, because it is difficult to sample subgraphs from a large graph in a direct manner. In this paper, we observe that rare motifs of most real-world networks have “more edges” than frequent motifs, and motifs with more edges are sampled by random edge sampling with higher probabilities. Based on these two observations, we propose a novel motif sampling method, Mosar, to estimate motif frequencies. In particular, our Mosar method samples frequent and rare motifs with different probabilities, and tends to sample motifs with low frequencies. As a result, the new method greatly reduces the estimation errors of these rare motifs. Finally, we conducted extensive experiments on a variety of real-world datasets with different sizes, and our experimental results show that the Mosar method is two orders of magnitude more accurate than state-of-the-art methods.
1. Introduction
Recently, exploring small connected subgraph patterns (i.e., motifs) in networks has attracted more and more attention in both academia and industry. These patterns have been widely used in various applications such as evolutionary pattern characterization in online social networks [,,,], pattern recognition in gene expression profiling [], interaction prediction in protein–protein networks [], and coarse-grained topology generation []. For example, Kunegis et al. [] studied the significance of subgraph patterns such as “the enemy of my enemy is my friend” and “the friend of my friend is my friend” to evaluate the stability of “friend or foe” social networks such as Slashdot Zoo (www.slashdot.org, accessed on 6 March 2022). Refs. [,] explored network traffic activity graphs (TAGs) and observed that TAGs of different applications (e.g., FTP, Web, and P2P) exhibited different motif patterns.
Motif frequency and concentration are two popular statistics studied in many applications. Suppose that there exist Nk-node connected and induced subgraphs (CISes) in G and there exist n CIS which are isomorphic to a motif M. Then, the motif frequency and concentration of M is defined as n and , respectively. The huge number of these subgraphs poses a great challenge for computing these two statistics. For instance, in two medium-sized networks, Slashdot [] and Epinions [], with only nodes and edges [], there exist more than four-node CIS. Furthermore, because the number of k-node CIS generally increases exponentially with k, the number of five-node CIS is higher in both of these graphs. To solve this problem, many existing works [,,,,] have explored approximate algorithms to estimate these statistics, making a trade-off between accuracy and computational time. These methods perform node sampling, edge sampling, or path sampling on the original graph and use the sampled graph to inference the statistics of all subgraphs in the original graph. The above sampling schemes usually prefer frequent motifs and under-sample rare ones (i.e., motifs with low frequencies). Among them, both [] and our method can estimate motif frequencies, although our method is mainly biased towards rare motifs, and the algorithms of the two are different. As a result, these methods exhibit large errors for estimating rare motifs’ statistics, and fail in many real applications such as anomaly detection (i.e., finding rare or unusual patterns) [,] and community search (i.e., finding the densest subgraphs and cliques) [,].
A potential way to solve the above problem is stratified sampling with the proportionate allocation strategy [], the basic idea of which can be simply described as follows. For a motif M with frequency n (i.e., G has n CIS isomorphic to M), we suppose that each of its CISe (i.e., CIS in G isomorphic to M) is independently sampled with the same probability . Then, we estimate the motif frequency n as , where m is the number of sampled CIS for which the original CIS are isomorphic to M. We can easily find that the variance of is , which implies that we can reduce the estimation error by increasing . With a fixed sampling budget, we can reduce the total estimation errors of characterizing all motifs by assigning larger probabilities to motifs with lower frequencies. However, it is not feasible to directly sample CIS in a graph with pre-defined probability , which hinders us from performing stratified sampling.
To address the above challenge, in this paper we propose a novel method called Mosar, (Motif Sampling and Retrieving), to estimate all motif frequencies. Mosar first obtains a sampled graph from the graph G under study using random edge sampling, i.e., each and every edge in G is sampled with the same probability p. In our experiments, we observe that motifs with more edges usually have lower frequencies for many real-world graphs. Moreover, the probability of a k-node CIS s in G also appearing as a k-node CIS in increases with the number of edges in s, where k is the size of motifs under study. For example, an unclosed wedge and a triangle in G are observed in with probabilities and , respectively (note that when ). Thus, Mosar can be simply viewed as a novel weighted motif sampling method, and it tends to sample rare motifs. Clearly, may exhibit different motif statistics from G due to two kinds of uncertainties: (1) CIS in and their original CIS in G may be different; and (2) CIS are not sampled uniformly. For example, Figure 1 shows that a sample graph has three-node directed motif concentrations which differ greatly from G; the Flickr graph [] is used as G and is obtained by randomly sampling each edge of G with the same probability 0.05. To remove the error introduced by these two uncertainties, Mosar retrieves the original CIS of all k-node CIS in and then builds a probabilistic model to “re-weight" sampled CIS to compute motif statistics. Our experiments on a variety of public datasets show that our method is two orders of magnitude more accurate state-of-the-art methods.
Figure 1.
Motif statistics of the graph G and a sampled graph (the numbers are the motif IDs).
The rest of this paper is organized as follows. The problem formulation is presented in Section 3. Section 4 presents our motif sampling method, Mosar, and the corresponding methods of estimating motif frequencies and concentrations. The performance evaluation and testing results are presented in Section 5. Section 2 summarizes related work, and concluding remarks follow.
2. Related Work
There is an immense body of literature on the characterization of three-, four-, and five-node CIS in a single large graph. However, many of these works focus on the triangle counting problem [,,,,,,] and cannot be easily extended to count other CIS.
In this section, we briefly review practical algorithms that approximately count all three-, four-, and five-node CIS in a large static graph. While Alon et al. [] proposed a color-coding method to reduce the computational cost of counting subgraphs, it is not scalable to large graphs []. OmidiGenes et al. [] proposed a subgraph enumeration and counting method using edge sampling. However, this method suffers from unknown sampling bias. To estimate subgraph class concentrations, Kashtan et al. [] proposed a connected subgraph sampling method, however, their method is computationally expensive when calculating the weight of each sampled subgraph used for correcting bias introduced by sampling. To address this drawback, FANMOD [] samples subgraphs based on building a subgraph enumeration tree, which requires that the graph is fitted into memory. Recently, Paredes and Ribeiro [] have proposed RAND-FaSE to estimate the frequency of all CIS with an efficient tree data structure, where the leaves are the subgraph occurrences. Wang et al. [] built a transition probability matrix between the motif statistics in the original and sampled graph. With the motif statistics in sampled graph, they provide an unbiased estimator for all three-, four-, and five-node CIS. Marco et al. [] presented a general algorithm using colour coding to approximately count motifs beyond five nodes. Ryan et al. [] developed an unbiased graphlet estimation framework by sampling edges and their local neighbourhood. The new Motivo algorithm proposed in [] scales well to larger graphs while providing more accurate counts of motifs than ever before, both for most frequent motifs and for extremely rare motifs. The general framework proposed in [], called HONE, is used to learn such structural node embeddings from networks through subgraph patterns in node neighborhoods. The Random Walks in [] have been used as the basis for many proximity-based community detection methods. These methods are similar to theh random edge sampling in the first step of our Mosar method, although with many differences in its implementation. In addition, Refs. [,,,,,] proposed sampling methods to estimate online social networks’ motif concentrations when the graph’s topology is not available in advance and it is costly to crawl the entire topology. However, the above methods under-sample rare motifs, and thus exhibit large errors for characterizing such motifs.
3. Problem Formulation
In this section, we introduce motif statistics. For readability, the notations used throughout the paper are listed in Table 1. We denote the graph of interest as a labeled undirected graph , where V is the set of nodes, E is a set of undirected edges, and L is a set of labels associated with undirected edges . For example: (1) directed networks use labels to indicate the direction of the edges ; (2) for edges in signed networks having positive or negative labels; (3) a regular undirected graph can be represented by setting L to null.
Table 1.
Table of notations.
To formally define the motif frequency of G, first, we introduce a few notations. An induced subgraph of G, , is a subgraph with its edges and associated labels all in G, i.e., , , . Denote as the set of all CIS with k nodes in G, and . We provide a simple example in Figure 2, where . We partition into equivalence classes without overlapping where CIS within each are isomorphic. Next, we present several examples to illustrate our notations. Figure 3a reveals all three-node motifs of unlabeled undirected networks. When G is an unlabeled and undirected network, then the number of three-node motifs is , and and are the sets of CIS in G isomorphic to the first and second motifs in Figure 3a, respectively. Figure 3b reveals all three-node motifs when G is any signed network; in this case, . Figure 3c reveals all motifs with three nodes for any directed network; in such a case, . Figure 3d reveals all four-node motifs of any unlabeled and undirected network; in this case, . Figure 3e shows all five-node motifs of any unlabeled and undirected network; in this case, . Throughout the paper, is defined as the set of CIS in G that are isomorphic to the i-th k-node motif . Define the frequency of motif as , i.e., the number of CIS in . For example, includes two CIS for the directed graph G in Figure 2: (1) the CIS made up of a, b, and d, and (2) the CIS made up of a, c, and d. Thus, . In this paper, we focus on designing fast and accurate sampling methods to reduce the time needed to count motif frequencies.
Figure 2.
An example of G and .
Figure 3.
All three-node, four-node, and five-node motifs (the numbers are the motif IDs): (a) three-node undirected motifs; (b) three-node signed and undirected motifs; (c) three-node directed motifs; (d) four-node undirected motifs; (e) five-node undirected motifs.
4. Motif Sampling and Retrieving
In this section, we start by introducing our Mosar method for motif sampling. After that, we present a probabilistic model to analyze its sampling bias. On the basis of this model, we put forward a method to correct the sampling error for estimating motif frequencies. Finally, we provide lower error bounds for our estimates.
4.1. Sampling Motifs over G
Figure 4 shows an overview of Mosar. Mosar first generates a subgraph of by iterating each edge and sampling it with the same probability p. We assume that can be fitted into memory, which can be easily achieved using a small p. Then, Mosar uses existing CIS enumeration methods such as [] to enumerate all k-node CIS of . For a graph s, let and denote the set of nodes and edges contained in s. For a k-node CIS of , let s be its original k-node CIS, which is defined as the k-node CIS of G with the same nodes in , i.e., . We can easily find that can be quite different from s. To eliminate the estimation error introduced by this uncertainty, when traversing , we combine the edge information of the original graph G to retrieve the s of the original graph. Formally, we let denote all k-node CIS of . Finally, we obtain all pairs of CIS and their original CIS, , i.e.,
Figure 4.
Overview of Mosar.
The pseudocode of Mosar is shown in Algorithm 1.
| Algorithm 1: The pseudocode of Mosar. |
![]() |
4.2. Probabilistic Model of Mosar
We build a probabilistic model of pairs , which is similar to the model in []. Define as the probability that is isomorphic to motif given that s is isomorphic to motif , i.e.,
To obtain , first of all, we compute , which is defined as the quantity of subgraphs of isomorphic to . For instance, , i.e., the triangle, includes three subgraphs isomorphic to , i.e., the unclosed wedge for the undirected graph in Figure 3a. Thus, we have for three-node undirected motifs. When , we let . For four- and five-node motifs, it is no easy thing to acquire manually; we use the method in [] to compute . Let ; then, we have
For example, we have and for the undirected three-node motifs in Figure 3a.
4.3. Motif Frequency Estimation
Using the probabilistic model above, we put forward a method a method to estimate motif frequencies. The pseudocode for motif frequency estimation is shown in Algorithm 2.
| Algorithm 2: The pseudocode for Motif Frequency Estimation. |
![]() |
Define , , as the number of pairs , where is isomorphic to motif and s is isomorphic to motif . Then, the expectation of is computed as
When , we have the following estimator of :
Denote . Thus, we have estimators of , i.e., , . Let and be two k-node CIS in G isomorphic to the j-th k-node motif. Denote and as the induced subgraphs of node sets and in , respectively. Define as the probability that and are both isomorphic to the i-th k-node motif. We can easily find that when and have no common edges (i.e., ), and otherwise. For example, as shown in Figure 5, we have and for the undirected three-node motifs in Figure 3a. Then, we have the following theorem.
Figure 5.
Compute and for the undirected three-node motifs.
Theorem 1.
For each , is an unbiased estimator of , i.e.,
and the variance of is
Proof.
From (4), we have
Place to indicate a signal function that the predicate is true and equal to one, and zero otherwise. Define function
Then we can write as
Thus, is computed as
Finally, we have
In the derivation above, we use when , and when and have no common edges. □
Example: For the undirected three-node motifs in Figure 3a, and are two estimators of , i.e., the number of triangles in G. Note that is the same estimator of in []. Let be the number of pairs of triangles that are not edge disjoint. Then, we have
and
We can easily find that is smaller than when . When and , we have and ; therefore, is times more accurate than .
Finally, we estimate using the following mix estimator:
where parameters , and . is used to determine the relative importance of . Suppose that all are independent. Then, the variance of is
Next, we compute optimal to minimize . Define Lagrange function as
The derivatives of with respect to and are
and
To obtain a with the smallest error, we solve the equations , and , , and have
When it is difficult to compute exactly, we approximate and then set parameters as
4.4. Discussion
Compared to the online methods of analyzing streaming graphs in [,] (i.e., the graph of interest is given as a stream of edges and each edge can be accessed and processed only once), Mosar needs to pass over the graph file of interest twice, with the additional pass performed to remove uncertainty introduced by sampling. However, we observe that passing over the graph requires much less time than enumerating and classifying subgraphs even for a small sampling probability p. For example, in our experiments we observed that the computational time needed for passing over the graph file of interest on disk was no more than 7% of the time needed to enumerate and classify CIS in the sampled graph when . Thus, to sample the same number of CIS, Mosar requires effectively the same computational time as the methods in [,].
5. Data Evaluation
In this section, in the first place, we introduce our experimental datasets. In the second place we present experimental results to evaluate the performance of our Mosar method compared to the most advanced methods. Our experiments were conducted on a server with a Quad-Core AMD Opeteron (tm) 8379 HE CPU 2.39 GHz processor and 128 GB DRAM memory.
5.1. Datasets
We performed our experiments on the following available datasets in public summarized in Table 2.
Table 2.
Graph datasets used in our experiments. “edges” refers to the quantity of edges in the undirected graph generated by discarding edge labels. “max-degree” denotes the maximum quantity of edges for a node in an undirected graph.
- Online social networks: Flickr [], Pokec [], LiveJournal [], YouTube [], soc-Epinions1 [], and soc-Slash-dot08 []. Flickr, LiveJournal, and YouTube are popular photo, blog, and video sharing websites, respectively, where a user can subscribe to other user updates such as photos, blogs, and videos. Pokec is the most popular online social network in Slovakia, and has been in existence for more than ten years. These networks can be represented by directed graphs, where nodes represent users and a directed edge from node u to node v indicates that user u subscribes to user v or user u tags user v as a friend. Soc-Epinions1 [] is a directed graph of the Epinions website in 2003, where a directed edge from node u to node v indicates that user u trusts user v. Soc-Slashdot08 and Soc-Slashdot09 [] are graphs of the technology-related news website Slashdot released in 2008 and 2009, respectively, where the edge between node u and node v means that user u has marked user v as a friend.
- Web graph: Web-Google []. The Web-Google dataset was released in 2002 by Google as a part of a Google Programming Contest; nodes represent web pages and directed edges represent hyperlinks between them.
- Signed networks: sign-Epinions, sign-Slashdot08, and sign-Slashdot09 []. Epinions and Slashdot networks can be represented by a signed graph, where a positive edge from user u to user v means that u trusts v in the Epinions website or u marks v as a friend on the Slashdot website. A negative edge from u to v means a distrust relationship on the Slashdot website or that u tags user v as a foe on the Epinions website.
- Collaboration networks: ca-HepTh [], ca-GrQc [], and ca-CondMat []. arXiv is an online repository of electronic preprints of scientific papers in many fields, such as mathematics, physics, and computer science. The datasets ca-GR-QC, ca-HEP-TH, and ca-CondMat consist of arXiv e-prints and cover scientific collaborations between authors of papers submitted to the General Relativity and Quantum Cosmology category, the High-Energy Physics—Theory category, and the Condensed Matter category, respectively []. These networks can all be represented by undirected graphs. If author u co-authored a paper with author v, the graph contains an undirected edge from u to v.
- Peer-to-peer network: p2p-Gnutella08 []. Gnutella is a peer-to-peer file sharing network. Nodes in the p2p-Gnutella08 dataset represent users in the Gnutella network and edges represent connections between Gnutella users.
- Communication network: Wiki-Talk []. Wikipedia is a free encyclopedia written collaboratively by volunteers around the world. Each registered user has a talk page that she/he and other users can edit in order to communicate and discuss updates to various articles on Wikipedia. Nodes in the Wiki-Talk dataset represent registered users on Wikipedia and a directed edge from node u to node v indicates that user u at least once edited a talk page of user v.
- Product network: com-Amazon []. The dataset was collected by crawling the Amazon website based on the Amazon website’s “Customers Who Bought This Item Also Bought” feature. If a product u is frequently co-purchased with product v, the graph contains an undirected edge from u to v.
5.2. Error Metric
Similar to [], in our experiments we studied the normalized root mean square error (NRMSE) to measure the relative error of the motif frequency estimate with respect to its true value , . is defined as:
where is defined as
We can find out that the decomposes into the sum of the variance and bias of the estimator , both of which are important and must be as small as possible to achieve better estimation performance. When is an unbiased estimator of , then , as a consequence, is the equivalent of the normalized standard error of , which is . Please note that our metrics use relative error, and thus we reckon values as large as to be acceptable when is small. In our experiments, we average the estimates and calculate their NRMSEs over 100 runs.
5.3. Accuracy Results
Above all, we evaluated the performance of our method in estimating the motif frequencies of three-node on graphs with millions of nodes (Flickr, Pokec, LiveJournal, YouTube, Web-Google, and Wiki-talk) while comparing our results with the basic truth calculated via brute force methods. Calculating the ground truth of four-node and five-node motif frequencies for large graphs is computationally intensive. Even for a relatively small graph such as soc-Slashdot08, enumerating and counting all of its three-node CIS takes almost 20 h. To overcome this difficulty, experiments with four-node CISes were performed on four medium-size graphs (soc-Epinions1, soc-Slashdot08, soc-Slashdot09, com-DBLP and com-Amazon), and experiments with five-node CIS were performed on four relatively small graphs (ca-GR-QC, ca-HEP-TH, ca-CondMat and p2p-Gnutella08) where the ground-truth could be calculated. We specifically evaluated the performance of our method in estimating the motif frequencies of signed graphs such as sign-Epinions, sign-Slashdot08 and sign-Slashdot09.
5.3.1. Values of Three-, Four-, and Five-Node Motif Frequencies
Figure 6 and Table 3 show the real values of the three-, four-, and five-node motif frequencies of the graphs studied in this paper. Table 3 and Figure 6a show the real values of three-node directed motif frequencies for the undirected and directed graphs of Flickr, Pokec, LiveJournal, Wiki-Talk, and Web-Google, respectively. Here, undirected graphs are obtained by discarding the edge directions of directed graphs. Among all three-node directed motifs, the seventh motif exhibits the smallest frequency for all these directed graphs. Flickr, Pokec, LiveJournal, Wiki-Talk, and Web-Google have , , , , and three-node CIS, respectively. Figure 6b reveals the actual values of the three-node signed motif frequencies for the graphs Sign-Epinions, sign-Slashdot08, and sign-Slashdot09. Sign-Epinions, sign-Slashdot08, and sign-Slashdot09 have , , and three-node CIS, respectively. Figure 6c reveals the actual values of four-node undirected motif frequencies for the graphs soc-Epinions1, soc-Slashdot08, soc-Slashdot09, and com-Amazon. Graphs soc-Epinions1, soc-Slashdot08, soc-Slashdot09, and com-Amazon have , , , and four-node CIS, respectively. Figure 6d reveals the actual values of five-node undirected motif frequencies for com-Amazon, com-DBLP, p2p-Gnutella08, ca-GrQc, ca-CondMat, and ca-HepTh. Com-Amazon, com-DBLP, p2p-Gnutella08, ca-GrQc, ca-CondMat, and ca-HepTh have , , , , , and five-node CIS, respectively.
Figure 6.
Real values of motif frequencies: (a) three-node directed motifs; (b) three-node signed motifs; (c) four-node undirected motifs; (d) five-node undirected motifs.
Table 3.
Real values of three-node undirected motif frequencies (i is the motif ID).
5.3.2. Estimating Three-Node Motif Frequencies
Table 4 reveals our estimated NRMSEs of three-node undirected motif frequencies at and , respectively, using graphs fpr Flickr, Pokec, LiveJournal, Wiki-Talk and Web-Google. The triangular motif structure with in the undirected motif in Table 4 is more rare, thus, the result with is better compared with []. We can see that the NRMSE for is about ten times less than the NRMSE for . When for all these five graphs, the NRMSEs are less than . Figure 7 reveals our estimated NRMSEs for three-node directed motif frequencies at and . Likewise, we observe that NRMSE at is almost ten times less than NRMSE at . The NRMSE of our estimates of (i.e., the seventh three-node directed motif frequency) exhibits the largest error. Except , the NRMSEs of the other motif frequency estimates are smaller than 0.01 when . Figure 8 reveals our estimated NRMSEs for three-node signed and undirected motif frequencies for , , and using the graphs Sign-Epinions, sign-Slashdot08, and sign-Slashdot09. For all three signed graphs, the NRMSEs are less than 0.5, 0.1, and 0.06 for , , and .
Table 4.
NRMSEs of , the concentration estimates of three-node undirected motifs for and , respectively (i is the motif ID).
Figure 7.
NRMSEs of , the estimates of three-node directed motif frequencies for and , respectively. Flickr, Pokec, LiveJournal, Wiki-Talk, and Web-Google have , , , , and three-node CIS, respectively: (a) ; (b) .
Figure 8.
NRMSEs of , the estimates of three-node signed and undirected motif frequencies for , , and , respectively. Sign-Epinions, sign-Slashdot08, and sign-Slashdot09 have , , and three-node CIS, respectively: (a) ; (b) ; (c) .
5.3.3. Estimating Four-Node Motif Frequencies
Figure 9 reveals the NRMSEs of , frequency estimates of four-node undirected motifs for , , and , respectively, using the graphs soc-Epinions1, soc-Slashdot08, soc-Slashdot09, and com-Amazon. We can see that the NRMSEs of the other motif frequency estimates are smaller than 0.2, 0.1, and 0.07 for , , and , respectively.
Figure 9.
NRMSEs of , the motif frequency estimates of four-node undirected motifs for , and , respectively. Soc-Epinions1, soc-Slashdot08, soc-Slashdot09, and com-Amazon have , , , and four-node CIS, respectively: (a) ; (b) ; (c) .
5.3.4. Estimating Five-Node Motif Frequencies
Figure 10 shows the NRMSEs of , the estimates of five-node undirected motif frequencies for , , and , respectively. The experiment was conducted on the graphs com-Amazon, com-DBLP, p2p-Gnutella08, ca-GrQc, ca-CondMat, and ca-HepTh. We can see that most five-node undirected motifs of all graphs except ca-GrQc have NRMSEs smaller than 1 and 0.1 for and , respectively. For instance, the largest three graphs, com-Amazon, com-DBLP, and ca-CondMat, exhibit smaller errors than the other graphs, while the smallest graph, ca-GrQc, has a larger NRMSE.
Figure 10.
NRMSEs of , the motif frequency estimates of five-node undirected motifs for , , and , respectively. Com-Amazon, com-DBLP, p2p-Gnutella08, ca-GrQc, ca-CondMat, and ca-HepTh have , , , , , and five-node CIS, respectively: (a) soc-Amazon; (b) soc-DBLP; (c) p2p-Gnutella08; (d) ca-GrQc; (e) ca-CondMat; (f) ca-HepTh.
5.4. Comparison to Previous Work
5.4.1. Motif Concentration Estimation
Figure 11a–c show the results of our methods for estimating three-, four-, and five-node motif concentrations in comparison with the state-of-the-art methods FANMOD [], PSRW [], and Minfer [] with the same computational time. We set the same edge sampling probability for Mosar and Minfer. We observed that these two methods have almost the same runtime. This is because Mosar and Minfer spend much less time reading the graph files than enumerating and classifying the subgraphs. For example, the computational time needed to pass over the graph file of interest on disk was 3.8%, 6%, 7%, 5%, and 1.9% of the time required to enumerate and classify subgraphs in the sampled graph for Flickr, livejournal, Pokec, Web-Google, and Wiki-Talk, respectively, when using Mosar and Minfer to estimate three-node directed motif frequencies and set . Figure 11 shows that Mosar exhibits almost one order fewer errors than the other methods for estimating concentrations of three- and four-node rare motifs, and two orders fewer errors than Minfer for estimating concentrations of five-node rare motifs.
Figure 11.
Accuracy of our method for estimating motif concentrations in comparison with state-of-the-art methods: (a) (Flickr) , three-node directed motif concentrations; (b) (soc-Epinions1) , four-node undirected motif concentrations; (c) (com-Amazon) , five-node undirected motif concentrations.
5.4.2. Triangle Counting
We compared the performance of our method for estimating the number of triangles with the state-of-the-art method []. To compare Mosar and under the same computational cost, we set the parameters of as . As alluded to, the runtime of Mosar is then almost same as , and the probabilities of observing a triangle (sampled as a closed or unclosed wedge) are and for Mosar and , respectively. Let be the number of triangles and be an estimate of ; then, the variance of is nearly . Thus, the variance of Mosar is up to three times larger than . This is consistent with the results shown in Figure 12, where . We can see that the NRMSE of Mosar is nearly 1.7 times smaller than .
Figure 12.
Accuracy of our method for estimating the number of triangles in comparison with .
6. Conclusions
In this paper, we develop a weighted motif sampling method, Mosar, to accurately estimate the frequency of both frequent and rare motifs. Mosar first obtains a sampled graph and then enumerates all CIS in . To reduce the estimate errors, Mosar samples those rare motifs with higher probabilities. We build a probabilistic model of the CIS in both and G, then use this to drive a motif frequency estimation method with a theoretical guarantee. Finally, we performed experiments on various publicly availably datasets to evaluate the performance of our Mosar method. Our experimental results show that Mosar is over two orders of magnitude more accurate than the current state-of-the-art algorithms. In the future, we plan to extend our method to dynamic graphs with edge insertions and deletions.
Author Contributions
Conceptualization, W.F. and Y.Q.; methodology, W.F. and Y.Q.; software, W.F. and Y.Q.; validation, Y.Q.; formal analysis, W.F. and Y.Q.; investigation, W.F. and Y.Q.; resources, P.W. and J.T.; data curation, Y.Q.; writing—original draft preparation, W.F.; writing—review and editing, W.G., Y.Q., P.W. and J.T.; visualization, W.F.; supervision, W.G., P.W. and J.T.; project administration, W.G.; funding acquisition, W.G. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by National Key R&D Program of China (2021YFB1715600).
Data Availability Statement
Publicly available datasets were analyzed in this study. This data can be found here: [http://snap.stanford.edu/data/index.html], accessed on 6 March 2022.
Conflicts of Interest
The authors declare that they have no conflict of interest.
References
- Chun, H.; Yeol Ahn, Y.; Kwak, H.; Moon, S.; Ho Eom, Y.; Jeong, H. Comparison of Online Social Relations in Terms of Volume vs. Interaction: A Case Study of Cyworld. In Proceedings of the SIGCOMM: Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, Seattle, WA, USA, 17–22 August 2008; pp. 57–59. [Google Scholar]
- Kunegis, J.; Lommatzsch, A.; Bauckhage, C. The slashdot zoo: Mining a social network with negative edges. In Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, 20–24 April 2009; pp. 741–750. [Google Scholar]
- Zhao, J.; Lui, J.C.S.; Towsley, D.; Guan, X.; Zhou, Y. Empirical Analysis of the Evolution of Follower Network: A Case Study on Douban. In Proceedings of the 30th IEEE International Conference on Computer Communications (IEEE INFOCOM 2011), Shanghai, China, 10–15 April 2011; pp. 941–946. [Google Scholar]
- Ugander, J.; Backstrom, L.; Kleinberg, J. Subgraph frequencies: Mapping the empirical and extremal geography of large graph collections. In Proceedings of the 22nd International Conference on World Wide Web, New York, NY, USA, 13–17 May 2013; pp. 1307–1318. [Google Scholar]
- Shen-Orr, S.S.; Milo, R.; Mangan, S.; Alon, U. Network motifs in the transcriptional regulation network of Escherichia coli. Nat. Genet. 2002, 31, 64–68. [Google Scholar] [CrossRef] [PubMed]
- Albert, I.; Albert, R. Conserved network motifs allow protein–protein interaction prediction. Bioinformatics 2004, 4863, 3346–3352. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Itzkovitz, S.; Levitt, R.; Kashtan, N.; Milo, R.; Itzkovitz, M.; Alon, U. Coarse-Graining and Self-Dissimilarity of Complex Networks. Phys. Rev. E 2005, 71, 016127. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jin, Y.; Sharafuddin, E.; Zhang, Z.L. Unveiling Core Network-wide Communication Patterns through Application Traffic Activity Graph Decomposition. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, Seattle, WA, USA, 15 June 2009; pp. 49–60. [Google Scholar]
- Iliofotou, M.; Faloutsos, M.; Mitzenmacher, M. Exploiting Dynamicity in Graph-based Traffic Analysis: Techniques and Applications. In Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies, Rome, Italy, 1–4 December 2009; pp. 241–252. [Google Scholar]
- Leskovec, J.; Lang, K.J.; Dasgupta, A.; Mahoney, M.W. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Math. 2009, 6, 29–123. [Google Scholar] [CrossRef] [Green Version]
- Richardson, M.; Agrawal, R.; Domingos, P. Trust Management for the Semantic Web. In Proceedings of the 7th International Symposium on Wearable Computers (ISWC 2003), White Plains, NY, USA, 21–23 October 2003; pp. 351–368. [Google Scholar]
- Wang, P.; Lui, J.C.; Zhao, J.; Ribeiro, B.; Towsley, D.; Guan, X. Efficiently Estimating Motif Statistics of Large Networks. ACM Trans. Knowl. Discov. Data 2014, 9, 1–27. [Google Scholar] [CrossRef] [Green Version]
- Kashtan, N.; Itzkovitz, S.; Milo, R.; Alon, U. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 2004, 20, 1746–1758. [Google Scholar] [CrossRef] [Green Version]
- Wernicke, S. Efficient Detection of Network Motifs. Trans. Comput. Biol. Bioinform. 2006, 3, 347–359. [Google Scholar] [CrossRef] [Green Version]
- Bhuiyan, M.A.; Rahman, M.; Rahman, M.; Hasan, M.A. GUISE: Uniform Sampling of Graphlets for Large Graph Analysis. In Proceedings of the IEEE International Conference on Data Mining, Brussels, Belgium, 10–13 December 2012; pp. 91–100. [Google Scholar]
- Wang, P.; Qi, Y.; Lui, J.C.; Towsley, D.; Zhao, J.; Tao, J. Inferring higher-order structure statistics of large networks from sampled edges. Trans. Knowl. Data Eng. 2017, 31, 61–74. [Google Scholar] [CrossRef]
- Shin, K.; Eliassi-Rad, T.; Faloutsos, C. Patterns and anomalies in k-cores of real-world graphs with applications. Knowl. Inf. Syst. 2018, 54, 677–710. [Google Scholar] [CrossRef]
- Eswaran, D. Mining Anomalies Using Static and Dynamic Graphs. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2020. [Google Scholar]
- Yuan, L.; Qin, L.; Zhang, W.; Chang, L.; Yang, J. Index-based densest clique percolation community search in networks. Trans. Knowl. Data Eng. 2017, 30, 922–935. [Google Scholar] [CrossRef]
- Fang, Y.; Huang, X.; Qin, L.; Zhang, Y.; Zhang, W.; Cheng, R.; Lin, X. A survey of community search over big graphs. VLDB J. 2020, 29, 353–392. [Google Scholar] [CrossRef] [Green Version]
- Sarndal, C.E.S.; Swensson, B.; Wretman, J. Model Assisted Survey Sampling; Springer: New York, NY, USA, 1992. [Google Scholar]
- Mislove, A.; Marcon, M.; Gummadi, K.P.; Druschel, P.; Bhattacharjee, B. Measurement and Analysis of Online Social Networks. In Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Kyoto, Japan, 27–31 August 2007; pp. 29–42. [Google Scholar]
- Tsourakakis, C.E.; Kang, U.; Miller, G.L.; Faloutsos, C. Doulion: Counting Triangles in Massive Graphs with a Coin. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009. [Google Scholar]
- Pavany, A.; Tirthapuraz, K.T.S.; Wu, K.L. Counting and Sampling Triangles from a Graph Stream. In Proceedings of the 39th International Conference on Very Large Data Bases 2013, (VLDB 2013), Riva del Garda, Italy, 30 August 2013; pp. 1870–1881. [Google Scholar]
- Jha, M.; Seshadhri, C.; Pinar, A. A Space Efficient Streaming Algorithm for Triangle Counting Using the Birthday Paradox. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 589–597. [Google Scholar]
- Lim, Y.; Kang, U. Mascot: Memory-efficient and accurate sampling for counting local triangles in graph streams. In Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 685–694. [Google Scholar]
- Stefani, L.D.; Epasto, A.; Riondato, M.; Upfal, E. Triest: Counting local and global triangles in fully dynamic streams with fixed memory size. Trans. Knowl. Discov. Data 2017, 11, 1–50. [Google Scholar] [CrossRef] [Green Version]
- Jung, M.; Lim, Y.; Lee, S.; Kang, U. FURL: Fixed-memory and uncertainty reducing local triangle counting for multigraph streams. Data Min. Knowl. Discov. 2019, 33, 1225–1253. [Google Scholar] [CrossRef]
- Shin, K.; Oh, S.; Kim, J.; Hooi, B.; Faloutsos, C. Fast, accurate and provable triangle counting in fully dynamic graph streams. Trans. Knowl. Discov. Data 2020, 14, 1–39. [Google Scholar] [CrossRef] [Green Version]
- Alon, N.; Yuster, R.; Zwick, U. Color-coding. J. ACM 1995, 42, 844–856. [Google Scholar] [CrossRef]
- Jha, M.; Seshadhri, C.; Pinar, A. Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 495–505. [Google Scholar]
- Omidi, S.; Schreiber, F.; Masoudi-nejad, A. MODA: An efficient algorithm for network motif discovery in biological networks. Genes Genet Syst. 2009, 84, 385–395. [Google Scholar] [CrossRef] [Green Version]
- Paredes, P.; Ribeiro, P. Rand-fase: Fast approximate subgraph census. Soc. Netw. Anal. Min. 2015, 5, 17. [Google Scholar] [CrossRef]
- Bressan, M.; Chierichetti, F.; Kumar, R.; Leucci, S.; Panconesi, A. Motif counting beyond five nodes. Trans. Knowl. Discov. Data 2018, 12, 1–25. [Google Scholar] [CrossRef] [Green Version]
- Rossi, R.A.; Zhou, R.; Ahmed, N.K. Estimation of graphlet counts in massive networks. Trans. Neural Netw. Learn. Syst. 2018, 30, 44–57. [Google Scholar] [CrossRef]
- Bressan, M.; Leucci, S.; Panconesi, A. Motivo: Fast Motif Counting via Succinct Color Coding and Adaptive Sampling. Proc. VLDB Endow. 2019, 12, 1651–1663. [Google Scholar] [CrossRef] [Green Version]
- Rossi, R.A.; Ahmed, N.K.; Koh, E.; Kim, S.; Rao, A.; Abbasi-Yadkori, Y. A Structural Graph Representation Learning Framework. In Proceedings of the 13th International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020; pp. 483–491. [Google Scholar]
- Rossi, R.A.; Jin, D.; Kim, S.; Ahmed, N.K.; Koutra, D.; Lee, J.B. On Proximity and Structural Role-Based Embeddings in Networks: Misconceptions, Techniques, and Applications. ACM Trans. Knowl. Discov. Data 2020, 14, 1–37. [Google Scholar] [CrossRef]
- Chen, X.; Li, Y.; Wang, P.; Lui, J.C. A General Framework for Estimating Graphlet Statistics via Random Walk. arXiv 2016, arXiv:1603.07504. [Google Scholar] [CrossRef] [Green Version]
- Wang, P.; Zhao, J.; Zhang, X.; Li, Z.; Cheng, J.; Lui, J.C.; Towsley, D.; Tao, J.; Guan, X. MOSS-5: A fast method of approximating counts of 5-node graphlets in large graphs. Trans. Knowl. Data Eng. 2017, 30, 73–86. [Google Scholar] [CrossRef] [Green Version]
- Yang, C.; Lyu, M.; Li, Y.; Zhao, Q.; Xu, Y. Ssrw: A scalable algorithm for estimating graphlet statistics based on random walk. In Proceedings of the 23rd International Conference, DASFAA, Gold Coast, QLD, Australia, 21–24 May 2018; pp. 272–288. [Google Scholar]
- Paramonov, K.; Shemetov, D.; Sharpnack, J. Estimating graphlet statistics via lifting. In Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 587–595. [Google Scholar]
- Ahmed, N.; Duffield, N.; Neville, J.; Kompella, R. Graph Sample and Hold: A Framework for Big-Graph Analytics. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014. [Google Scholar]
- Takac, L.; Zabovsky, M. Data Analysis in Public Social Networks. In Proceedings of the DTI, Omza, Poland, 28–29 May 2012; pp. 1–6. [Google Scholar]
- Leskovec, J.; Huttenlocher, D.; Kleinberg, J. Predicting Positive and Negative Links in Online Social Networks. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 641–650. [Google Scholar]
- Google Programming Contest. 2002. Available online: http://www.google.com/programming-contest/ (accessed on 10 June 2021).
- Leskovec, J.; Huttenlocher, D.; Kleinberg, J. Signed Networks in Social Media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Atlanta, GA, USA, 10–15 April 2010; pp. 1361–1370. [Google Scholar]
- Yang, J.; Leskovec, J. Defining and Evaluating Network Communities Based on Ground-Truth. In Proceedings of the PTDM 2012: Practical Theories of Exploratory Data Mining, Brussels, Belgium, 10–13 December 2012; pp. 745–754. [Google Scholar]
- Ripeanu, M.; Foster, I.T.; Iamnitchi, A. Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer Systems and Implications for System Design. IEEE Internet Comput. J. 2002, 6, 50–57. [Google Scholar]
- Leskovec, J.; Kleinberg, J.; Faloutsos, C. Graph Evolution: Densification and Shrinking Diameters. Trans. Knowl. Discov. Data 2007, 1, 2-es. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

