Mosar: Efficiently Characterizing Both Frequent and Rare Motifs in Large Graphs

Guo, Wenhua; Feng, Wenqian; Qi, Yiyan; Wang, Pinghui; Tao, Jing

doi:10.3390/app12147210

Open AccessArticle

Mosar: Efficiently Characterizing Both Frequent and Rare Motifs in Large Graphs

by

Wenhua Guo

¹,

Wenqian Feng

^2,*,

Yiyan Qi

^3,*,

Pinghui Wang

³ and

Jing Tao

³

¹

State Key Laboratory for Manufacturing Systems Engineering, Xi’an Jiaotong University, Xi’an 710049, China

²

School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China

³

MOE Key Laboratory for Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an 710049, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2022, 12(14), 7210; https://doi.org/10.3390/app12147210

Submission received: 1 June 2022 / Revised: 30 June 2022 / Accepted: 4 July 2022 / Published: 18 July 2022

(This article belongs to the Special Issue Trends and Prospects in Data Mining Techniques for Big Graph/Spatial Data)

Download

Browse Figures

Versions Notes

Abstract

:

Due to high computational costs, exploring motif statistics (such as motif frequencies) of a large graph can be challenging. This is useful for understanding complex networks such as social and biological networks. To address this challenge, many methods explore approximate algorithms using edge/path sampling techniques. However, state-of-the-art methods usually over-sample frequent motifs and under-sample rare motifs, and thus they fail in many real applications such as anomaly detection (i.e., finding rare patterns). Furthermore, it is not feasible to apply existing weighted sampling methods such as stratified sampling to solve this problem, because it is difficult to sample subgraphs from a large graph in a direct manner. In this paper, we observe that rare motifs of most real-world networks have “more edges” than frequent motifs, and motifs with more edges are sampled by random edge sampling with higher probabilities. Based on these two observations, we propose a novel motif sampling method, Mosar, to estimate motif frequencies. In particular, our Mosar method samples frequent and rare motifs with different probabilities, and tends to sample motifs with low frequencies. As a result, the new method greatly reduces the estimation errors of these rare motifs. Finally, we conducted extensive experiments on a variety of real-world datasets with different sizes, and our experimental results show that the Mosar method is two orders of magnitude more accurate than state-of-the-art methods.

Keywords:

motif; subgraph sampling; graph mining

1. Introduction

Recently, exploring small connected subgraph patterns (i.e., motifs) in networks has attracted more and more attention in both academia and industry. These patterns have been widely used in various applications such as evolutionary pattern characterization in online social networks [1,2,3,4], pattern recognition in gene expression profiling [5], interaction prediction in protein–protein networks [6], and coarse-grained topology generation [7]. For example, Kunegis et al. [2] studied the significance of subgraph patterns such as “the enemy of my enemy is my friend” and “the friend of my friend is my friend” to evaluate the stability of “friend or foe” social networks such as Slashdot Zoo (www.slashdot.org, accessed on 6 March 2022). Refs. [8,9] explored network traffic activity graphs (TAGs) and observed that TAGs of different applications (e.g., FTP, Web, and P2P) exhibited different motif patterns.

Motif frequency and concentration are two popular statistics studied in many applications. Suppose that there exist Nk-node connected and induced subgraphs (CISes) in G and there exist n CIS which are isomorphic to a motif M. Then, the motif frequency and concentration of M is defined as n and

n / N

, respectively. The huge number of these subgraphs poses a great challenge for computing these two statistics. For instance, in two medium-sized networks, Slashdot [10] and Epinions [11], with only

1.0 \times 10^{5}

nodes and

1.0 \times 10^{6}

edges [12], there exist more than

2.0 \times 10^{10}

four-node CIS. Furthermore, because the number of k-node CIS generally increases exponentially with k, the number of five-node CIS is higher in both of these graphs. To solve this problem, many existing works [12,13,14,15,16] have explored approximate algorithms to estimate these statistics, making a trade-off between accuracy and computational time. These methods perform node sampling, edge sampling, or path sampling on the original graph and use the sampled graph to inference the statistics of all subgraphs in the original graph. The above sampling schemes usually prefer frequent motifs and under-sample rare ones (i.e., motifs with low frequencies). Among them, both [16] and our method can estimate motif frequencies, although our method is mainly biased towards rare motifs, and the algorithms of the two are different. As a result, these methods exhibit large errors for estimating rare motifs’ statistics, and fail in many real applications such as anomaly detection (i.e., finding rare or unusual patterns) [17,18] and community search (i.e., finding the densest subgraphs and cliques) [19,20].

A potential way to solve the above problem is stratified sampling with the proportionate allocation strategy [21], the basic idea of which can be simply described as follows. For a motif M with frequency n (i.e., G has n CIS isomorphic to M), we suppose that each of its CISe (i.e., CIS in G isomorphic to M) is independently sampled with the same probability

γ

. Then, we estimate the motif frequency n as

\hat{n} = \frac{m}{γ}

, where m is the number of sampled CIS for which the original CIS are isomorphic to M. We can easily find that the variance of

\hat{n}

is

n (\frac{1}{γ} - 1)

, which implies that we can reduce the estimation error by increasing

γ

. With a fixed sampling budget, we can reduce the total estimation errors of characterizing all motifs by assigning larger probabilities to motifs with lower frequencies. However, it is not feasible to directly sample CIS in a graph with pre-defined probability

γ

, which hinders us from performing stratified sampling.

To address the above challenge, in this paper we propose a novel method called Mosar, (Motif Sampling and Retrieving), to estimate all motif frequencies. Mosar first obtains a sampled graph

G^{*}

from the graph G under study using random edge sampling, i.e., each and every edge in G is sampled with the same probability p. In our experiments, we observe that motifs with more edges usually have lower frequencies for many real-world graphs. Moreover, the probability of a k-node CIS s in G also appearing as a k-node CIS in

G^{*}

increases with the number of edges in s, where k is the size of motifs under study. For example, an unclosed wedge and a triangle in G are observed in

G^{*}

with probabilities

p^{2}

and

3 p^{2} - 2 p^{3}

, respectively (note that

3 p^{2} - 2 p^{3} \approx 3 p^{2}

when

p < 0.1

). Thus, Mosar can be simply viewed as a novel weighted motif sampling method, and it tends to sample rare motifs. Clearly,

G^{*}

may exhibit different motif statistics from G due to two kinds of uncertainties: (1) CIS in

G^{*}

and their original CIS in G may be different; and (2) CIS are not sampled uniformly. For example, Figure 1 shows that a sample graph

G^{*}

has three-node directed motif concentrations which differ greatly from G; the Flickr graph [22] is used as G and

G^{*}

is obtained by randomly sampling each edge of G with the same probability 0.05. To remove the error introduced by these two uncertainties, Mosar retrieves the original CIS of all k-node CIS in

G^{*}

and then builds a probabilistic model to “re-weight" sampled CIS to compute motif statistics. Our experiments on a variety of public datasets show that our method is two orders of magnitude more accurate state-of-the-art methods.

The rest of this paper is organized as follows. The problem formulation is presented in Section 3. Section 4 presents our motif sampling method, Mosar, and the corresponding methods of estimating motif frequencies and concentrations. The performance evaluation and testing results are presented in Section 5. Section 2 summarizes related work, and concluding remarks follow.

2. Related Work

There is an immense body of literature on the characterization of three-, four-, and five-node CIS in a single large graph. However, many of these works focus on the triangle counting problem [23,24,25,26,27,28,29] and cannot be easily extended to count other CIS.

In this section, we briefly review practical algorithms that approximately count all three-, four-, and five-node CIS in a large static graph. While Alon et al. [30] proposed a color-coding method to reduce the computational cost of counting subgraphs, it is not scalable to large graphs [31]. OmidiGenes et al. [32] proposed a subgraph enumeration and counting method using edge sampling. However, this method suffers from unknown sampling bias. To estimate subgraph class concentrations, Kashtan et al. [13] proposed a connected subgraph sampling method, however, their method is computationally expensive when calculating the weight of each sampled subgraph used for correcting bias introduced by sampling. To address this drawback, FANMOD [14] samples subgraphs based on building a subgraph enumeration tree, which requires that the graph is fitted into memory. Recently, Paredes and Ribeiro [33] have proposed RAND-FaSE to estimate the frequency of all CIS with an efficient tree data structure, where the leaves are the subgraph occurrences. Wang et al. [16] built a transition probability matrix between the motif statistics in the original and sampled graph. With the motif statistics in sampled graph, they provide an unbiased estimator for all three-, four-, and five-node CIS. Marco et al. [34] presented a general algorithm using colour coding to approximately count motifs beyond five nodes. Ryan et al. [35] developed an unbiased graphlet estimation framework by sampling edges and their local neighbourhood. The new Motivo algorithm proposed in [36] scales well to larger graphs while providing more accurate counts of motifs than ever before, both for most frequent motifs and for extremely rare motifs. The general framework proposed in [37], called HONE, is used to learn such structural node embeddings from networks through subgraph patterns in node neighborhoods. The Random Walks in [38] have been used as the basis for many proximity-based community detection methods. These methods are similar to theh random edge sampling in the first step of our Mosar method, although with many differences in its implementation. In addition, Refs. [12,15,39,40,41,42] proposed sampling methods to estimate online social networks’ motif concentrations when the graph’s topology is not available in advance and it is costly to crawl the entire topology. However, the above methods under-sample rare motifs, and thus exhibit large errors for characterizing such motifs.

3. Problem Formulation

In this section, we introduce motif statistics. For readability, the notations used throughout the paper are listed in Table 1. We denote the graph of interest as a labeled undirected graph

G = (V, E, L)

, where V is the set of nodes, E is a set of undirected edges, and L is a set of labels

l_{u, v}

associated with undirected edges

(u, v) \in E

. For example: (1) directed networks use labels

l_{u, v} \in {\to, \leftarrow, \leftrightarrow}

to indicate the direction of the edges

(u, v) \in E

; (2)

l_{u, v} \in {+, -}

for edges in signed networks having positive or negative labels; (3) a regular undirected graph can be represented by setting L to null.

To formally define the motif frequency of G, first, we introduce a few notations. An induced subgraph of G,

G^{'} = (V^{'}, E^{'}, L^{'})

, is a subgraph with its edges and associated labels all in G, i.e.,

V^{'} \subset V

,

E^{'} = {(u, v) : u, v \in V^{'}, (u, v) \in E}

,

L^{'} = {l_{u, v} : (u, v) \in E^{'}}

. Denote

C^{(k)}

as the set of all CIS with k nodes in G, and

n^{(k)} = | C^{(k)} |

. We provide a simple example in Figure 2, where

n^{(3)} = 3

. We partition

C^{(k)}

into

T_{k}

equivalence classes

C_{1}^{(k)}, \dots, C_{T_{k}}^{(k)}

without overlapping where CIS within each

C_{i}^{(k)}

are isomorphic. Next, we present several examples to illustrate our notations. Figure 3a reveals all three-node motifs of unlabeled undirected networks. When G is an unlabeled and undirected network, then the number of three-node motifs is

T_{3} = 2

, and

C_{1}^{(3)}

and

C_{2}^{(3)}

are the sets of CIS in G isomorphic to the first and second motifs in Figure 3a, respectively. Figure 3b reveals all three-node motifs when G is any signed network; in this case,

T_{3} = 7

. Figure 3c reveals all motifs with three nodes for any directed network; in such a case,

T_{3} = 13

. Figure 3d reveals all four-node motifs of any unlabeled and undirected network; in this case,

T_{4} = 6

. Figure 3e shows all five-node motifs of any unlabeled and undirected network; in this case,

T_{5} = 21

. Throughout the paper,

C_{i}^{(k)}

is defined as the set of CIS in G that are isomorphic to the i-th k-node motif

M_{i}^{(k)}

. Define the frequency of motif

M_{i}^{(k)}

as

n_{i}^{(k)} = | C_{i}^{(k)} |

, i.e., the number of CIS in

C_{i}^{(k)}

. For example,

C_{1}^{(3)}

includes two CIS for the directed graph G in Figure 2: (1) the CIS made up of a, b, and d, and (2) the CIS made up of a, c, and d. Thus,

n_{1}^{(3)} = 2

. In this paper, we focus on designing fast and accurate sampling methods to reduce the time needed to count motif frequencies.

4. Motif Sampling and Retrieving

In this section, we start by introducing our Mosar method for motif sampling. After that, we present a probabilistic model to analyze its sampling bias. On the basis of this model, we put forward a method to correct the sampling error for estimating motif frequencies. Finally, we provide lower error bounds for our estimates.

4.1. Sampling Motifs over G

Figure 4 shows an overview of Mosar. Mosar first generates a subgraph

G^{*} = (V^{*}, E^{*}, L^{*})

of

G = (V, E, L)

by iterating each edge and sampling it with the same probability p. We assume that

G^{*}

can be fitted into memory, which can be easily achieved using a small p. Then, Mosar uses existing CIS enumeration methods such as [16] to enumerate all k-node CIS of

G^{*}

. For a graph s, let

V (s)

and

E (s)

denote the set of nodes and edges contained in s. For a k-node CIS

s^{*}

of

G^{*}

, let s be its original k-node CIS, which is defined as the k-node CIS of G with the same nodes in

s^{*}

, i.e.,

V (s) = V (s^{*})

. We can easily find that

s^{*}

can be quite different from s. To eliminate the estimation error introduced by this uncertainty, when traversing

s^{*}

, we combine the edge information of the original graph G to retrieve the s of the original graph. Formally, we let

C^{(k, *)}

denote all k-node CIS of

G^{*}

. Finally, we obtain all pairs of CIS

s^{*} \in C^{(k, *)}

and their original CIS,

s \in C^{(k)}

, i.e.,

\begin{matrix} S_{G^{*}, G}^{(k)} = {(s^{*}, s) : s^{*} \in C^{(k, *)}, s \in C^{(k)}, V (s^{*}) = V (s)} . \end{matrix}

(1)

The pseudocode of Mosar is shown in Algorithm 1.

Algorithm 1: The pseudocode of Mosar.

4.2. Probabilistic Model of Mosar

We build a probabilistic model of pairs

(s^{*}, s) \in S_{G^{*}, G}^{(k)}

, which is similar to the model in [16]. Define

P_{i, j}

as the probability that

s^{*}

is isomorphic to motif

M_{i}^{(k)}

given that s is isomorphic to motif

M_{j}^{(k)}

, i.e.,

\begin{matrix} P_{i, j} = P (M (s^{*}) = M_{i}^{(k)} | M (s) = M_{j}^{(k)}) . \end{matrix}

(2)

To obtain

P_{i, j}

, first of all, we compute

ϕ_{i, j}

, which is defined as the quantity of subgraphs of

M_{j}^{(k)}

isomorphic to

M_{i}^{(k)}

. For instance,

M_{2}^{(3)}

, i.e., the triangle, includes three subgraphs isomorphic to

M_{1}^{(3)}

, i.e., the unclosed wedge for the undirected graph in Figure 3a. Thus, we have

ϕ_{1, 2} = 3

for three-node undirected motifs. When

i = j

, we let

ϕ_{i, j} = 1

. For four- and five-node motifs, it is no easy thing to acquire

ϕ_{i, j}

manually; we use the method in [16] to compute

ϕ_{i, j}

. Let

q = 1 - p

; then, we have

\begin{matrix} P_{i, j} = ϕ_{i, j} p^{| E (M_{i}^{(k)}) |} q^{| E (M_{j}^{(k)}) | - | E (M_{i}^{(k)}) |} . \end{matrix}

(3)

For example, we have

P_{1, 2} = 3 p^{2} q

and

P_{2, 2} = p^{3}

for the undirected three-node motifs in Figure 3a.

4.3. Motif Frequency Estimation

Using the probabilistic model above, we put forward a method a method to estimate motif frequencies. The pseudocode for motif frequency estimation is shown in Algorithm 2.

Algorithm 2: The pseudocode for Motif Frequency Estimation.

Define

m_{i, j}^{(k)}

,

1 \leq i \leq T_{k}

, as the number of pairs

(s^{*}, s) \in S_{G^{*}, G}^{(k)}

, where

s^{*}

is isomorphic to motif

M_{i}^{(k)}

and s is isomorphic to motif

M_{j}^{(k)}

. Then, the expectation of

m_{i, j}^{(k)}

is computed as

\begin{matrix} E [m_{i, j}^{(k)}] = P_{i, j} n_{j}^{(k)} . \end{matrix}

(4)

When

P_{i, j} > 0

, we have the following estimator of

n_{j}^{(k)}

:

\begin{matrix} {\hat{n}}_{j}^{(k, i)} = \frac{m_{i, j}^{(k)}}{P_{i, j}} . \end{matrix}

(5)

Denote

Z_{j} = {i : i = 1, \dots, T_{k}, and P_{i, j} > 0}

. Thus, we have

| Z_{j} |

estimators of

n_{j}^{(k)}

, i.e.,

{\hat{n}}_{j}^{(k, i)}

,

i \in Z_{j}

. Let

s_{1}

and

s_{2}

be two k-node CIS in G isomorphic to the j-th k-node motif. Denote

s_{1}^{*}

and

s_{2}^{*}

as the induced subgraphs of node sets

V (s_{1})

and

V (s_{2})

in

G^{*}

, respectively. Define

π_{i, j} (s_{1}, s_{2})

as the probability that

s_{1}^{*}

and

s_{2}^{*}

are both isomorphic to the i-th k-node motif. We can easily find that

π_{i, j} (s_{1}, s_{2}) = P_{i, j}^{2}

when

s_{1}

and

s_{2}

have no common edges (i.e.,

E (s_{2}) \cap E (s_{1}) \neq \emptyset

), and

π_{i, j} (s_{1}, s_{2}) > P_{i, j}^{2}

otherwise. For example, as shown in Figure 5, we have

π_{1, 2} (s_{1}, s_{2}) = p^{4} q + 4 p^{3} q^{2}

and

π_{2, 2} (s_{1}, s_{2}) = p^{5}

for the undirected three-node motifs in Figure 3a. Then, we have the following theorem.

Theorem 1.

For each

i \in Z_{j}

,

{\hat{n}}_{j}^{(k, i)}

is an unbiased estimator of

n_{j}^{(k)}

, i.e.,

\begin{matrix} E ({\hat{n}}_{j}^{(k, i)}) = n_{j}^{(k)}, \end{matrix}

(6)

and the variance of

{\hat{n}}_{j}^{(k, i)}

is

\begin{matrix} \begin{matrix} Var ({\hat{n}}_{j}^{(k, i)}) = \frac{1}{P_{i, j}} n_{j}^{(k)} (1 - P_{i, j}) \\ + \frac{1}{P_{i, j}^{2}} \sum_{s_{1}, s_{2} \in C_{j}^{(k)}, s_{1} \neq s_{2}, E (s_{1}) \cap E (s_{2}) \neq \emptyset} (π_{i, j} (s_{1}, s_{2}) - P_{i, j}^{2}) . \end{matrix} \end{matrix}

(7)

Proof.

From (4), we have

\begin{matrix} E ({\hat{n}}_{j}^{(k, i)}) = \frac{E (m_{i, j}^{(k)})}{P_{i, j}} = n_{j}^{(k, i)} . \end{matrix}

(8)

Place

1 (X)

to indicate a signal function that the predicate

X

is true and equal to one, and zero otherwise. Define function

\begin{matrix} δ (s) = 1 (\exists s^{*}, (s^{*}, s) \in S_{G^{*}, G}^{(k)}) . \end{matrix}

(9)

Then we can write

{\hat{n}}_{j}^{(k, i)}

as

\begin{matrix} {\hat{n}}_{j}^{(k, i)} = \frac{1}{P_{i, j}} \sum_{s \in C_{j}^{(k)}} δ (s) \end{matrix}

(10)

Thus,

Var ({\hat{n}}_{j}^{(k, i)})

is computed as

\begin{matrix} \begin{matrix} Var ({\hat{n}}_{j}^{(k, i)}) & = \frac{1}{P_{i, j}^{2}} Var (\sum_{s \in C_{j}^{(k)}} δ (s)) \\ = \frac{1}{P_{i, j}^{2}} \sum_{s_{1} \in C_{j}^{(k)}} \sum_{s_{2} \in C_{j}^{(k)}} Cov (δ (s_{1}), δ (s_{2})) \\ = \frac{1}{P_{i, j}^{2}} \sum_{s_{1} \in C_{j}^{(k)}} \sum_{s_{2} \in C_{j}^{(k)}} π_{i, j} (s_{1}, s_{2}) - P_{i, j}^{2} . \end{matrix} \end{matrix}

(11)

Finally, we have

\begin{matrix} \begin{matrix} Var ({\hat{n}}_{j}^{(k, i)}) = \frac{1}{P_{i, j}} n_{j}^{(k)} (1 - P_{i, j}) \\ + \frac{1}{P_{i, j}^{2}} \sum_{s_{1}, s_{2} \in C_{j}^{(k)}, s_{1} \neq s_{2}, E (s_{1}) \cap E (s_{2}) \neq \emptyset} (π_{i, j} (s_{1}, s_{2}) - P_{i, j}^{2}) . \end{matrix} \end{matrix}

(12)

In the derivation above, we use

π_{i, j} (s_{1}, s_{2}) = P_{i, j}

when

s_{1} = s_{2}

, and

π_{i, j} (s_{1}, s_{2}) = P_{i, j}^{2}

when

s_{1}

and

s_{2}

have no common edges. □

Example: For the undirected three-node motifs in Figure 3a,

{\hat{n}}_{2}^{(3, 1)}

and

{\hat{n}}_{2}^{(3, 2)}

are two estimators of

n_{2}^{(3)}

, i.e., the number of triangles in G. Note that

{\hat{n}}_{2}^{(3, 2)}

is the same estimator of

n_{2}^{(3)}

in [23]. Let

Γ

be the number of pairs of triangles that are not edge disjoint. Then, we have

\begin{matrix} \begin{matrix} Var ({\hat{n}}_{2}^{(3, 1)}) \\ = \frac{n_{2}^{(3)} (3 p^{2} q - 9 p^{4} q^{2}) + 2 Γ (p^{4} q + 4 p^{3} q^{2} - 9 p^{4} q^{2})}{9 p^{4} q^{2}} . \end{matrix} \end{matrix}

(13)

and

\begin{matrix} Var ({\hat{n}}_{2}^{(3, 2)}) = \frac{n_{2}^{(3)} (p^{3} - p^{6}) + 2 Γ (p^{5} - p^{6})}{p^{6}} . \end{matrix}

(14)

We can easily find that

Var ({\hat{n}}_{2}^{(3, 1)})

is smaller than

Var ({\hat{n}}_{2}^{(3, 2)})

when

p < \frac{5}{6}

. When

p ≪ 1

and

n_{2}^{(3)} ≫ p Γ

, we have

Var ({\hat{n}}_{2}^{(3, 1)}) \approx \frac{n_{2}^{(3)}}{3 p^{2}}

and

Var ({\hat{n}}_{2}^{(3, 2)}) \approx \frac{n_{2}^{(3)}}{p^{3}}

; therefore,

{\hat{n}}_{2}^{(3, 1)}

is

\sqrt{\frac{3}{p}}

times more accurate than

{\hat{n}}_{2}^{(3, 2)}

.

Finally, we estimate

n_{j}^{(k)}

using the following mix estimator:

\begin{matrix} {\hat{n}}_{j}^{(k)} = \sum_{i \in Z_{j}} α_{i, j} {\hat{n}}_{j}^{(k, i)}, \end{matrix}

(15)

where parameters

0 \leq α_{i, j} \leq 1

, and

\sum_{i \in Z_{j}} α_{i, j} = 1

.

α_{i, j}

is used to determine the relative importance of

{\hat{n}}_{j}^{(k, i)}

. Suppose that all

{\hat{n}}_{j}^{(k, i)}

are independent. Then, the variance of

{\hat{n}}_{j}^{(k)}

is

\begin{matrix} Var ({\hat{n}}_{j}^{(k)}) = \sum_{i \in Z_{j}} α_{i, j}^{2} Var ({\hat{n}}_{j}^{(k, i)}) . \end{matrix}

(16)

Next, we compute optimal

α_{i, j}

to minimize

Var ({\hat{n}}_{j}^{(k)})

. Define Lagrange function

ψ

as

\begin{matrix} ψ = \sum_{i \in Z_{j}} α_{i, j}^{2} Var ({\hat{n}}_{j}^{(k, i)}) + λ (\sum_{i \in Z_{j}} α_{i, j} - 1) . \end{matrix}

(17)

The derivatives of

ψ

with respect to

α_{i, j}

and

λ

are

\begin{matrix} \frac{\partial ψ}{\partial α_{i, j}} = 2 α_{i, j} Var ({\hat{n}}_{j}^{(k, i)}) + λ, i \in Z_{j}, \end{matrix}

(18)

and

\begin{matrix} \frac{\partial ψ}{\partial λ} = \sum_{i \in Z_{j}} α_{i, j} - 1 . \end{matrix}

(19)

To obtain a

{\hat{n}}_{j}^{(k)}

with the smallest error, we solve the equations

\frac{\partial ψ}{\partial λ} = 0

, and

\frac{\partial ψ}{\partial λ} = 0

,

i \in Z_{j}

, and have

\begin{matrix} α_{i, j} = \frac{{Var}^{- 1} ({\hat{n}}_{j}^{(k, i)})}{\sum_{l \in Z_{j}} {Var}^{- 1} ({\hat{n}}_{j}^{(k, l)})}, i \in Z_{j} . \end{matrix}

(20)

When it is difficult to compute

Var ({\hat{n}}_{j}^{(k, i)})

exactly, we approximate

Var ({\hat{n}}_{j}^{(k, i)}) \approx n_{j}^{(k)} (1 - P_{i, j}) P_{i, j}^{- 1}

and then set parameters

α_{i, j}

as

\begin{matrix} α_{i, j} = \frac{P_{i, j} {(1 - P_{i, j})}^{- 1}}{\sum_{l \in Z_{j}} P_{l, j} {(1 - P_{l, j})}^{- 1}}, i \in Z_{j} . \end{matrix}

(21)

4.4. Discussion

Compared to the online methods of analyzing streaming graphs in [16,43] (i.e., the graph of interest is given as a stream of edges and each edge can be accessed and processed only once), Mosar needs to pass over the graph file of interest twice, with the additional pass performed to remove uncertainty introduced by sampling. However, we observe that passing over the graph requires much less time than enumerating and classifying subgraphs even for a small sampling probability p. For example, in our experiments we observed that the computational time needed for passing over the graph file of interest on disk was no more than 7% of the time needed to enumerate and classify CIS in the sampled graph when

p = 0.01

. Thus, to sample the same number of CIS, Mosar requires effectively the same computational time as the methods in [16,43].

5. Data Evaluation

In this section, in the first place, we introduce our experimental datasets. In the second place we present experimental results to evaluate the performance of our Mosar method compared to the most advanced methods. Our experiments were conducted on a server with a Quad-Core AMD Opeteron (tm) 8379 HE CPU 2.39 GHz processor and 128 GB DRAM memory.

5.1. Datasets

We performed our experiments on the following available datasets in public summarized in Table 2.

Online social networks: Flickr [22], Pokec [44], LiveJournal [44], YouTube [22], soc-Epinions1 [11], and soc-Slash-dot08 [10]. Flickr, LiveJournal, and YouTube are popular photo, blog, and video sharing websites, respectively, where a user can subscribe to other user updates such as photos, blogs, and videos. Pokec is the most popular online social network in Slovakia, and has been in existence for more than ten years. These networks can be represented by directed graphs, where nodes represent users and a directed edge from node u to node v indicates that user u subscribes to user v or user u tags user v as a friend. Soc-Epinions1 [11] is a directed graph of the Epinions website in 2003, where a directed edge from node u to node v indicates that user u trusts user v. Soc-Slashdot08 and Soc-Slashdot09 [10] are graphs of the technology-related news website Slashdot released in 2008 and 2009, respectively, where the edge between node u and node v means that user u has marked user v as a friend.
Web graph: Web-Google [46]. The Web-Google dataset was released in 2002 by Google as a part of a Google Programming Contest; nodes represent web pages and directed edges represent hyperlinks between them.
Signed networks: sign-Epinions, sign-Slashdot08, and sign-Slashdot09 [47]. Epinions and Slashdot networks can be represented by a signed graph, where a positive edge from user u to user v means that u trusts v in the Epinions website or u marks v as a friend on the Slashdot website. A negative edge from u to v means a distrust relationship on the Slashdot website or that u tags user v as a foe on the Epinions website.
Collaboration networks: ca-HepTh [50], ca-GrQc [50], and ca-CondMat [50]. arXiv is an online repository of electronic preprints of scientific papers in many fields, such as mathematics, physics, and computer science. The datasets ca-GR-QC, ca-HEP-TH, and ca-CondMat consist of arXiv e-prints and cover scientific collaborations between authors of papers submitted to the General Relativity and Quantum Cosmology category, the High-Energy Physics—Theory category, and the Condensed Matter category, respectively [50]. These networks can all be represented by undirected graphs. If author u co-authored a paper with author v, the graph contains an undirected edge from u to v.
Peer-to-peer network: p2p-Gnutella08 [49]. Gnutella is a peer-to-peer file sharing network. Nodes in the p2p-Gnutella08 dataset represent users in the Gnutella network and edges represent connections between Gnutella users.
Communication network: Wiki-Talk [45]. Wikipedia is a free encyclopedia written collaboratively by volunteers around the world. Each registered user has a talk page that she/he and other users can edit in order to communicate and discuss updates to various articles on Wikipedia. Nodes in the Wiki-Talk dataset represent registered users on Wikipedia and a directed edge from node u to node v indicates that user u at least once edited a talk page of user v.
Product network: com-Amazon [48]. The dataset was collected by crawling the Amazon website based on the Amazon website’s “Customers Who Bought This Item Also Bought” feature. If a product u is frequently co-purchased with product v, the graph contains an undirected edge from u to v.

5.2. Error Metric

Similar to [16], in our experiments we studied the normalized root mean square error (NRMSE) to measure the relative error of the motif frequency estimate

{\hat{n}}_{i}

with respect to its true value

n_{i}

,

i = 1, 2, \dots

.

NRMSE ({\hat{n}}_{i})

is defined as:

\begin{matrix} NRMSE ({\hat{n}}_{i}) = \frac{\sqrt{MSE ({\hat{n}}_{i})}}{n_{i}}, i = 1, 2, \dots, \end{matrix}

(22)

where

MSE ({\hat{n}}_{i})

is defined as

\begin{matrix} MSE ({\hat{n}}_{i}) = E [{({\hat{n}}_{i} - n_{i})}^{2}] = var ({\hat{n}}_{i}) + {(E [{\hat{n}}_{i}] - n_{i})}^{2} . \end{matrix}

(23)

We can find out that the

MSE ({\hat{n}}_{i})

decomposes into the sum of the variance and bias of the estimator

{\hat{n}}_{i}

, both of which are important and must be as small as possible to achieve better estimation performance. When

{\hat{n}}_{i}

is an unbiased estimator of

n_{i}

, then

MSE (\hat{n}) = var (\hat{n})

, as a consequence,

NRMSE ({\hat{n}}_{i})

is the equivalent of the normalized standard error of

{\hat{n}}_{i}

, which is

NRMSE ({\hat{n}}_{i}) = \sqrt{var ({\hat{n}}_{i})} / n_{i}

. Please note that our metrics use relative error, and thus we reckon values as large as

NRMSE ({\hat{n}}_{i}) = 1

to be acceptable when

n_{i}

is small. In our experiments, we average the estimates and calculate their NRMSEs over 100 runs.

5.3. Accuracy Results

Above all, we evaluated the performance of our method in estimating the motif frequencies of three-node on graphs with millions of nodes (Flickr, Pokec, LiveJournal, YouTube, Web-Google, and Wiki-talk) while comparing our results with the basic truth calculated via brute force methods. Calculating the ground truth of four-node and five-node motif frequencies for large graphs is computationally intensive. Even for a relatively small graph such as soc-Slashdot08, enumerating and counting all of its three-node CIS takes almost 20 h. To overcome this difficulty, experiments with four-node CISes were performed on four medium-size graphs (soc-Epinions1, soc-Slashdot08, soc-Slashdot09, com-DBLP and com-Amazon), and experiments with five-node CIS were performed on four relatively small graphs (ca-GR-QC, ca-HEP-TH, ca-CondMat and p2p-Gnutella08) where the ground-truth could be calculated. We specifically evaluated the performance of our method in estimating the motif frequencies of signed graphs such as sign-Epinions, sign-Slashdot08 and sign-Slashdot09.

5.3.1. Values of Three-, Four-, and Five-Node Motif Frequencies

Figure 6 and Table 3 show the real values of the three-, four-, and five-node motif frequencies of the graphs studied in this paper. Table 3 and Figure 6a show the real values of three-node directed motif frequencies for the undirected and directed graphs of Flickr, Pokec, LiveJournal, Wiki-Talk, and Web-Google, respectively. Here, undirected graphs are obtained by discarding the edge directions of directed graphs. Among all three-node directed motifs, the seventh motif exhibits the smallest frequency for all these directed graphs. Flickr, Pokec, LiveJournal, Wiki-Talk, and Web-Google have

1.35 \times 10^{10}

,

2.02 \times 10^{9}

,

6.90 \times 10^{9}

,

1.2 \times 10^{10}

, and

7.00 \times 10^{8}

three-node CIS, respectively. Figure 6b reveals the actual values of the three-node signed motif frequencies for the graphs Sign-Epinions, sign-Slashdot08, and sign-Slashdot09. Sign-Epinions, sign-Slashdot08, and sign-Slashdot09 have

1.72 \times 10^{8}

,

6.72 \times 10^{7}

, and

7.25 \times 10^{7}

three-node CIS, respectively. Figure 6c reveals the actual values of four-node undirected motif frequencies for the graphs soc-Epinions1, soc-Slashdot08, soc-Slashdot09, and com-Amazon. Graphs soc-Epinions1, soc-Slashdot08, soc-Slashdot09, and com-Amazon have

2.58 \times 10^{10}

,

2.17 \times 10^{10}

,

2.42 \times 10^{10}

, and

1.78 \times 10^{8}

four-node CIS, respectively. Figure 6d reveals the actual values of five-node undirected motif frequencies for com-Amazon, com-DBLP, p2p-Gnutella08, ca-GrQc, ca-CondMat, and ca-HepTh. Com-Amazon, com-DBLP, p2p-Gnutella08, ca-GrQc, ca-CondMat, and ca-HepTh have

8.50 \times 10^{9}

,

3.34 \times 10^{10}

,

3.92 \times 10^{8}

,

3.64 \times 10^{7}

,

3.32 \times 10^{9}

, and

8.73 \times 10^{7}

five-node CIS, respectively.

5.3.2. Estimating Three-Node Motif Frequencies

Table 4 reveals our estimated NRMSEs of three-node undirected motif frequencies at

p = 0.01

and

p = 0.05

, respectively, using graphs fpr Flickr, Pokec, LiveJournal, Wiki-Talk and Web-Google. The triangular motif structure with

i d = 2

in the undirected motif in Table 4 is more rare, thus, the result with

i d = 2

is better compared with [16]. We can see that the NRMSE for

p = 0.05

is about ten times less than the NRMSE for

p = 0.01

. When

p = 0.01

for all these five graphs, the NRMSEs are less than

0.05

. Figure 7 reveals our estimated NRMSEs for three-node directed motif frequencies at

p = 0.01

and

p = 0.05

. Likewise, we observe that NRMSE at

p = 0.05

is almost ten times less than NRMSE at

p = 0.01

. The NRMSE of our estimates of

n_{7}^{(3)}

(i.e., the seventh three-node directed motif frequency) exhibits the largest error. Except

n_{7}^{(3)}

, the NRMSEs of the other motif frequency estimates are smaller than 0.01 when

p = 0.05

. Figure 8 reveals our estimated NRMSEs for three-node signed and undirected motif frequencies for

p = 0.01

,

p = 0.05

, and

p = 0.1

using the graphs Sign-Epinions, sign-Slashdot08, and sign-Slashdot09. For all three signed graphs, the NRMSEs are less than 0.5, 0.1, and 0.06 for

p = 0.01

,

p = 0.05

, and

p = 0.1

.

5.3.3. Estimating Four-Node Motif Frequencies

Figure 9 reveals the NRMSEs of

{\hat{n}}_{i}^{(4)}

, frequency estimates of four-node undirected motifs for

p = 0.05

,

p = 0.1

, and

p = 0.2

, respectively, using the graphs soc-Epinions1, soc-Slashdot08, soc-Slashdot09, and com-Amazon. We can see that the NRMSEs of the other motif frequency estimates are smaller than 0.2, 0.1, and 0.07 for

p = 0.05

,

p = 0.1

, and

p = 0.2

, respectively.

5.3.4. Estimating Five-Node Motif Frequencies

Figure 10 shows the NRMSEs of

{\hat{n}}_{i}^{(5)}

, the estimates of five-node undirected motif frequencies for

p = 0.1

,

p = 0.2

, and

p = 0.3

, respectively. The experiment was conducted on the graphs com-Amazon, com-DBLP, p2p-Gnutella08, ca-GrQc, ca-CondMat, and ca-HepTh. We can see that most five-node undirected motifs of all graphs except ca-GrQc have NRMSEs smaller than 1 and 0.1 for

p = 0.05

and

p = 0.1

, respectively. For instance, the largest three graphs, com-Amazon, com-DBLP, and ca-CondMat, exhibit smaller errors than the other graphs, while the smallest graph, ca-GrQc, has a larger NRMSE.

5.4. Comparison to Previous Work

5.4.1. Motif Concentration Estimation

Figure 11a–c show the results of our methods for estimating three-, four-, and five-node motif concentrations in comparison with the state-of-the-art methods FANMOD [14], PSRW [12], and Minfer [16] with the same computational time. We set the same edge sampling probability for Mosar and Minfer. We observed that these two methods have almost the same runtime. This is because Mosar and Minfer spend much less time reading the graph files than enumerating and classifying the subgraphs. For example, the computational time needed to pass over the graph file of interest on disk was 3.8%, 6%, 7%, 5%, and 1.9% of the time required to enumerate and classify subgraphs in the sampled graph for Flickr, livejournal, Pokec, Web-Google, and Wiki-Talk, respectively, when using Mosar and Minfer to estimate three-node directed motif frequencies and set

p = 0.01

. Figure 11 shows that Mosar exhibits almost one order fewer errors than the other methods for estimating concentrations of three- and four-node rare motifs, and two orders fewer errors than Minfer for estimating concentrations of five-node rare motifs.

5.4.2. Triangle Counting

We compared the performance of our method for estimating the number of triangles with the state-of-the-art method

{gSH}_{T}

[43]. To compare Mosar and

{gSH}_{T}

under the same computational cost, we set the parameters of

{gSH}_{T}

as

{gSH}_{T} (p, p)

. As alluded to, the runtime of Mosar is then almost same as

{gSH}_{T} (p, p)

, and the probabilities of observing a triangle (sampled as a closed or unclosed wedge) are

p_{T} \approx 3 p^{2} q

and

p_{T} \approx p^{2}

for Mosar and

{gSH}_{T}

, respectively. Let

n_{T}

be the number of triangles and

{\hat{n}}_{T}

be an estimate of

n_{T}

; then, the variance of

{\hat{n}}_{T}

is nearly

\frac{n_{T}}{p_{T}}

. Thus, the variance of Mosar is up to three times larger than

{gSH}_{T}

. This is consistent with the results shown in Figure 12, where

p = 0.01

. We can see that the NRMSE of Mosar is nearly 1.7 times smaller than

{gSH}_{T}

.

6. Conclusions

In this paper, we develop a weighted motif sampling method, Mosar, to accurately estimate the frequency of both frequent and rare motifs. Mosar first obtains a sampled graph

G^{*}

and then enumerates all CIS in

G^{*}

. To reduce the estimate errors, Mosar samples those rare motifs with higher probabilities. We build a probabilistic model of the CIS in both

G^{*}

and G, then use this to drive a motif frequency estimation method with a theoretical guarantee. Finally, we performed experiments on various publicly availably datasets to evaluate the performance of our Mosar method. Our experimental results show that Mosar is over two orders of magnitude more accurate than the current state-of-the-art algorithms. In the future, we plan to extend our method to dynamic graphs with edge insertions and deletions.

Author Contributions

Conceptualization, W.F. and Y.Q.; methodology, W.F. and Y.Q.; software, W.F. and Y.Q.; validation, Y.Q.; formal analysis, W.F. and Y.Q.; investigation, W.F. and Y.Q.; resources, P.W. and J.T.; data curation, Y.Q.; writing—original draft preparation, W.F.; writing—review and editing, W.G., Y.Q., P.W. and J.T.; visualization, W.F.; supervision, W.G., P.W. and J.T.; project administration, W.G.; funding acquisition, W.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key R&D Program of China (2021YFB1715600).

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: [http://snap.stanford.edu/data/index.html], accessed on 6 March 2022.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

Chun, H.; Yeol Ahn, Y.; Kwak, H.; Moon, S.; Ho Eom, Y.; Jeong, H. Comparison of Online Social Relations in Terms of Volume vs. Interaction: A Case Study of Cyworld. In Proceedings of the SIGCOMM: Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, Seattle, WA, USA, 17–22 August 2008; pp. 57–59. [Google Scholar]
Kunegis, J.; Lommatzsch, A.; Bauckhage, C. The slashdot zoo: Mining a social network with negative edges. In Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, 20–24 April 2009; pp. 741–750. [Google Scholar]
Zhao, J.; Lui, J.C.S.; Towsley, D.; Guan, X.; Zhou, Y. Empirical Analysis of the Evolution of Follower Network: A Case Study on Douban. In Proceedings of the 30th IEEE International Conference on Computer Communications (IEEE INFOCOM 2011), Shanghai, China, 10–15 April 2011; pp. 941–946. [Google Scholar]
Ugander, J.; Backstrom, L.; Kleinberg, J. Subgraph frequencies: Mapping the empirical and extremal geography of large graph collections. In Proceedings of the 22nd International Conference on World Wide Web, New York, NY, USA, 13–17 May 2013; pp. 1307–1318. [Google Scholar]
Shen-Orr, S.S.; Milo, R.; Mangan, S.; Alon, U. Network motifs in the transcriptional regulation network of Escherichia coli. Nat. Genet. 2002, 31, 64–68. [Google Scholar] [CrossRef] [PubMed]
Albert, I.; Albert, R. Conserved network motifs allow protein–protein interaction prediction. Bioinformatics 2004, 4863, 3346–3352. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Itzkovitz, S.; Levitt, R.; Kashtan, N.; Milo, R.; Itzkovitz, M.; Alon, U. Coarse-Graining and Self-Dissimilarity of Complex Networks. Phys. Rev. E 2005, 71, 016127. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Jin, Y.; Sharafuddin, E.; Zhang, Z.L. Unveiling Core Network-wide Communication Patterns through Application Traffic Activity Graph Decomposition. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, Seattle, WA, USA, 15 June 2009; pp. 49–60. [Google Scholar]
Iliofotou, M.; Faloutsos, M.; Mitzenmacher, M. Exploiting Dynamicity in Graph-based Traffic Analysis: Techniques and Applications. In Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies, Rome, Italy, 1–4 December 2009; pp. 241–252. [Google Scholar]
Leskovec, J.; Lang, K.J.; Dasgupta, A.; Mahoney, M.W. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Math. 2009, 6, 29–123. [Google Scholar] [CrossRef] [Green Version]
Richardson, M.; Agrawal, R.; Domingos, P. Trust Management for the Semantic Web. In Proceedings of the 7th International Symposium on Wearable Computers (ISWC 2003), White Plains, NY, USA, 21–23 October 2003; pp. 351–368. [Google Scholar]
Wang, P.; Lui, J.C.; Zhao, J.; Ribeiro, B.; Towsley, D.; Guan, X. Efficiently Estimating Motif Statistics of Large Networks. ACM Trans. Knowl. Discov. Data 2014, 9, 1–27. [Google Scholar] [CrossRef] [Green Version]
Kashtan, N.; Itzkovitz, S.; Milo, R.; Alon, U. Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 2004, 20, 1746–1758. [Google Scholar] [CrossRef] [Green Version]
Wernicke, S. Efficient Detection of Network Motifs. Trans. Comput. Biol. Bioinform. 2006, 3, 347–359. [Google Scholar] [CrossRef] [Green Version]
Bhuiyan, M.A.; Rahman, M.; Rahman, M.; Hasan, M.A. GUISE: Uniform Sampling of Graphlets for Large Graph Analysis. In Proceedings of the IEEE International Conference on Data Mining, Brussels, Belgium, 10–13 December 2012; pp. 91–100. [Google Scholar]
Wang, P.; Qi, Y.; Lui, J.C.; Towsley, D.; Zhao, J.; Tao, J. Inferring higher-order structure statistics of large networks from sampled edges. Trans. Knowl. Data Eng. 2017, 31, 61–74. [Google Scholar] [CrossRef]
Shin, K.; Eliassi-Rad, T.; Faloutsos, C. Patterns and anomalies in k-cores of real-world graphs with applications. Knowl. Inf. Syst. 2018, 54, 677–710. [Google Scholar] [CrossRef]
Eswaran, D. Mining Anomalies Using Static and Dynamic Graphs. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2020. [Google Scholar]
Yuan, L.; Qin, L.; Zhang, W.; Chang, L.; Yang, J. Index-based densest clique percolation community search in networks. Trans. Knowl. Data Eng. 2017, 30, 922–935. [Google Scholar] [CrossRef]
Fang, Y.; Huang, X.; Qin, L.; Zhang, Y.; Zhang, W.; Cheng, R.; Lin, X. A survey of community search over big graphs. VLDB J. 2020, 29, 353–392. [Google Scholar] [CrossRef] [Green Version]
Sarndal, C.E.S.; Swensson, B.; Wretman, J. Model Assisted Survey Sampling; Springer: New York, NY, USA, 1992. [Google Scholar]
Mislove, A.; Marcon, M.; Gummadi, K.P.; Druschel, P.; Bhattacharjee, B. Measurement and Analysis of Online Social Networks. In Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Kyoto, Japan, 27–31 August 2007; pp. 29–42. [Google Scholar]
Tsourakakis, C.E.; Kang, U.; Miller, G.L.; Faloutsos, C. Doulion: Counting Triangles in Massive Graphs with a Coin. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009. [Google Scholar]
Pavany, A.; Tirthapuraz, K.T.S.; Wu, K.L. Counting and Sampling Triangles from a Graph Stream. In Proceedings of the 39th International Conference on Very Large Data Bases 2013, (VLDB 2013), Riva del Garda, Italy, 30 August 2013; pp. 1870–1881. [Google Scholar]
Jha, M.; Seshadhri, C.; Pinar, A. A Space Efficient Streaming Algorithm for Triangle Counting Using the Birthday Paradox. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 589–597. [Google Scholar]
Lim, Y.; Kang, U. Mascot: Memory-efficient and accurate sampling for counting local triangles in graph streams. In Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 685–694. [Google Scholar]
Stefani, L.D.; Epasto, A.; Riondato, M.; Upfal, E. Triest: Counting local and global triangles in fully dynamic streams with fixed memory size. Trans. Knowl. Discov. Data 2017, 11, 1–50. [Google Scholar] [CrossRef] [Green Version]
Jung, M.; Lim, Y.; Lee, S.; Kang, U. FURL: Fixed-memory and uncertainty reducing local triangle counting for multigraph streams. Data Min. Knowl. Discov. 2019, 33, 1225–1253. [Google Scholar] [CrossRef]
Shin, K.; Oh, S.; Kim, J.; Hooi, B.; Faloutsos, C. Fast, accurate and provable triangle counting in fully dynamic graph streams. Trans. Knowl. Discov. Data 2020, 14, 1–39. [Google Scholar] [CrossRef] [Green Version]
Alon, N.; Yuster, R.; Zwick, U. Color-coding. J. ACM 1995, 42, 844–856. [Google Scholar] [CrossRef]
Jha, M.; Seshadhri, C.; Pinar, A. Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 495–505. [Google Scholar]
Omidi, S.; Schreiber, F.; Masoudi-nejad, A. MODA: An efficient algorithm for network motif discovery in biological networks. Genes Genet Syst. 2009, 84, 385–395. [Google Scholar] [CrossRef] [Green Version]
Paredes, P.; Ribeiro, P. Rand-fase: Fast approximate subgraph census. Soc. Netw. Anal. Min. 2015, 5, 17. [Google Scholar] [CrossRef]
Bressan, M.; Chierichetti, F.; Kumar, R.; Leucci, S.; Panconesi, A. Motif counting beyond five nodes. Trans. Knowl. Discov. Data 2018, 12, 1–25. [Google Scholar] [CrossRef] [Green Version]
Rossi, R.A.; Zhou, R.; Ahmed, N.K. Estimation of graphlet counts in massive networks. Trans. Neural Netw. Learn. Syst. 2018, 30, 44–57. [Google Scholar] [CrossRef]
Bressan, M.; Leucci, S.; Panconesi, A. Motivo: Fast Motif Counting via Succinct Color Coding and Adaptive Sampling. Proc. VLDB Endow. 2019, 12, 1651–1663. [Google Scholar] [CrossRef] [Green Version]
Rossi, R.A.; Ahmed, N.K.; Koh, E.; Kim, S.; Rao, A.; Abbasi-Yadkori, Y. A Structural Graph Representation Learning Framework. In Proceedings of the 13th International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020; pp. 483–491. [Google Scholar]
Rossi, R.A.; Jin, D.; Kim, S.; Ahmed, N.K.; Koutra, D.; Lee, J.B. On Proximity and Structural Role-Based Embeddings in Networks: Misconceptions, Techniques, and Applications. ACM Trans. Knowl. Discov. Data 2020, 14, 1–37. [Google Scholar] [CrossRef]
Chen, X.; Li, Y.; Wang, P.; Lui, J.C. A General Framework for Estimating Graphlet Statistics via Random Walk. arXiv 2016, arXiv:1603.07504. [Google Scholar] [CrossRef] [Green Version]
Wang, P.; Zhao, J.; Zhang, X.; Li, Z.; Cheng, J.; Lui, J.C.; Towsley, D.; Tao, J.; Guan, X. MOSS-5: A fast method of approximating counts of 5-node graphlets in large graphs. Trans. Knowl. Data Eng. 2017, 30, 73–86. [Google Scholar] [CrossRef] [Green Version]
Yang, C.; Lyu, M.; Li, Y.; Zhao, Q.; Xu, Y. Ssrw: A scalable algorithm for estimating graphlet statistics based on random walk. In Proceedings of the 23rd International Conference, DASFAA, Gold Coast, QLD, Australia, 21–24 May 2018; pp. 272–288. [Google Scholar]
Paramonov, K.; Shemetov, D.; Sharpnack, J. Estimating graphlet statistics via lifting. In Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 587–595. [Google Scholar]
Ahmed, N.; Duffield, N.; Neville, J.; Kompella, R. Graph Sample and Hold: A Framework for Big-Graph Analytics. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014. [Google Scholar]
Takac, L.; Zabovsky, M. Data Analysis in Public Social Networks. In Proceedings of the DTI, Omza, Poland, 28–29 May 2012; pp. 1–6. [Google Scholar]
Leskovec, J.; Huttenlocher, D.; Kleinberg, J. Predicting Positive and Negative Links in Online Social Networks. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 641–650. [Google Scholar]
Google Programming Contest. 2002. Available online: http://www.google.com/programming-contest/ (accessed on 10 June 2021).
Leskovec, J.; Huttenlocher, D.; Kleinberg, J. Signed Networks in Social Media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Atlanta, GA, USA, 10–15 April 2010; pp. 1361–1370. [Google Scholar]
Yang, J.; Leskovec, J. Defining and Evaluating Network Communities Based on Ground-Truth. In Proceedings of the PTDM 2012: Practical Theories of Exploratory Data Mining, Brussels, Belgium, 10–13 December 2012; pp. 745–754. [Google Scholar]
Ripeanu, M.; Foster, I.T.; Iamnitchi, A. Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer Systems and Implications for System Design. IEEE Internet Comput. J. 2002, 6, 50–57. [Google Scholar]
Leskovec, J.; Kleinberg, J.; Faloutsos, C. Graph Evolution: Densification and Shrinking Diameters. Trans. Knowl. Discov. Data 2007, 1, 2-es. [Google Scholar] [CrossRef]

Figure 1. Motif statistics of the graph G and a sampled graph

G^{*}

(the numbers are the motif IDs).

Figure 1. Motif statistics of the graph G and a sampled graph

G^{*}

(the numbers are the motif IDs).

Figure 2. An example of G and

C^{(3)}

.

Figure 2. An example of G and

C^{(3)}

.

Figure 3. All three-node, four-node, and five-node motifs (the numbers are the motif IDs): (a) three-node undirected motifs; (b) three-node signed and undirected motifs; (c) three-node directed motifs; (d) four-node undirected motifs; (e) five-node undirected motifs.

Figure 4. Overview of Mosar.

Figure 5. Compute

π_{1, 2} (s_{1}, s_{2})

and

π_{2, 2} (s_{1}, s_{2})

for the undirected three-node motifs.

Figure 5. Compute

π_{1, 2} (s_{1}, s_{2})

and

π_{2, 2} (s_{1}, s_{2})

for the undirected three-node motifs.

Figure 6. Real values of motif frequencies: (a) three-node directed motifs; (b) three-node signed motifs; (c) four-node undirected motifs; (d) five-node undirected motifs.

Figure 7. NRMSEs of

{\hat{n}}_{i}^{(3)}

, the estimates of three-node directed motif frequencies for

p = 0.01

and

p = 0.05

, respectively. Flickr, Pokec, LiveJournal, Wiki-Talk, and Web-Google have

1.35 \times 10^{10}

,

2.02 \times 10^{9}

,

6.90 \times 10^{9}

,

1.2 \times 10^{10}

, and

7.00 \times 10^{8}

three-node CIS, respectively: (a)

p = 0.01

; (b)

p = 0.05

.

Figure 7. NRMSEs of

{\hat{n}}_{i}^{(3)}

, the estimates of three-node directed motif frequencies for

p = 0.01

and

p = 0.05

, respectively. Flickr, Pokec, LiveJournal, Wiki-Talk, and Web-Google have

1.35 \times 10^{10}

,

2.02 \times 10^{9}

,

6.90 \times 10^{9}

,

1.2 \times 10^{10}

, and

7.00 \times 10^{8}

three-node CIS, respectively: (a)

p = 0.01

; (b)

p = 0.05

.

Figure 8. NRMSEs of

{\hat{n}}_{i}^{(3)}

, the estimates of three-node signed and undirected motif frequencies for

p = 0.01

,

p = 0.05

, and

p = 0.1

, respectively. Sign-Epinions, sign-Slashdot08, and sign-Slashdot09 have

1.72 \times 10^{8}

,

6.72 \times 10^{7}

, and

7.25 \times 10^{7}

three-node CIS, respectively: (a)

p = 0.01

; (b)

p = 0.05

; (c)

p = 0.1

.

Figure 8. NRMSEs of

{\hat{n}}_{i}^{(3)}

, the estimates of three-node signed and undirected motif frequencies for

p = 0.01

,

p = 0.05

, and

p = 0.1

, respectively. Sign-Epinions, sign-Slashdot08, and sign-Slashdot09 have

1.72 \times 10^{8}

,

6.72 \times 10^{7}

, and

7.25 \times 10^{7}

three-node CIS, respectively: (a)

p = 0.01

; (b)

p = 0.05

; (c)

p = 0.1

.

Figure 9. NRMSEs of

{\hat{n}}_{i}^{(4)}

, the motif frequency estimates of four-node undirected motifs for

p = 0.1

, and

p = 0.2

, respectively. Soc-Epinions1, soc-Slashdot08, soc-Slashdot09, and com-Amazon have

2.58 \times 10^{10}

,

2.17 \times 10^{10}

,

2.42 \times 10^{10}

, and

1.78 \times 10^{8}

four-node CIS, respectively: (a)

p = 0.05

; (b)

p = 0.1

; (c)

p = 0.2

.

Figure 9. NRMSEs of

{\hat{n}}_{i}^{(4)}

, the motif frequency estimates of four-node undirected motifs for

p = 0.1

, and

p = 0.2

, respectively. Soc-Epinions1, soc-Slashdot08, soc-Slashdot09, and com-Amazon have

2.58 \times 10^{10}

,

2.17 \times 10^{10}

,

2.42 \times 10^{10}

, and

1.78 \times 10^{8}

four-node CIS, respectively: (a)

p = 0.05

; (b)

p = 0.1

; (c)

p = 0.2

.

Figure 10. NRMSEs of

{\hat{n}}_{i}^{(5)}

, the motif frequency estimates of five-node undirected motifs for

p = 0.1

,

p = 0.2

, and

p = 0.3

, respectively. Com-Amazon, com-DBLP, p2p-Gnutella08, ca-GrQc, ca-CondMat, and ca-HepTh have

8.50 \times 10^{9}

,

3.34 \times 10^{10}

,

3.92 \times 10^{8}

,

3.64 \times 10^{7}

,

3.32 \times 10^{9}

, and

8.73 \times 10^{7}

five-node CIS, respectively: (a) soc-Amazon; (b) soc-DBLP; (c) p2p-Gnutella08; (d) ca-GrQc; (e) ca-CondMat; (f) ca-HepTh.

Figure 10. NRMSEs of

{\hat{n}}_{i}^{(5)}

, the motif frequency estimates of five-node undirected motifs for

p = 0.1

,

p = 0.2

, and

p = 0.3

, respectively. Com-Amazon, com-DBLP, p2p-Gnutella08, ca-GrQc, ca-CondMat, and ca-HepTh have

8.50 \times 10^{9}

,

3.34 \times 10^{10}

,

3.92 \times 10^{8}

,

3.64 \times 10^{7}

,

3.32 \times 10^{9}

, and

8.73 \times 10^{7}

five-node CIS, respectively: (a) soc-Amazon; (b) soc-DBLP; (c) p2p-Gnutella08; (d) ca-GrQc; (e) ca-CondMat; (f) ca-HepTh.

Figure 11. Accuracy of our method for estimating motif concentrations in comparison with state-of-the-art methods: (a) (Flickr)

p = 0.01

, three-node directed motif concentrations; (b) (soc-Epinions1)

p = 0.1

, four-node undirected motif concentrations; (c) (com-Amazon)

p = 0.2

, five-node undirected motif concentrations.

Figure 11. Accuracy of our method for estimating motif concentrations in comparison with state-of-the-art methods: (a) (Flickr)

p = 0.01

, three-node directed motif concentrations; (b) (soc-Epinions1)

p = 0.1

, four-node undirected motif concentrations; (c) (com-Amazon)

p = 0.2

, five-node undirected motif concentrations.

Figure 12. Accuracy of our method for estimating the number of triangles in comparison with

{gSH}_{T}

.

Figure 12. Accuracy of our method for estimating the number of triangles in comparison with

{gSH}_{T}

.

Table 1. Table of notations.

G	$G = (V, E, L)$ is the graph of interest
$G^{*}$	$G^{} = (V^{}, E^{}, L^{})$ is a sampled graph
$V (s), s \in C^{(k)}$	set of nodes in k-node CIS s
$E (s), s \in C^{(k)}$	set of edges in k-node CIS s
$M (s)$	motif class ID of CIS s
$T_{k}$	number of k-node subgraph classes
$M_{i}^{(k)}$	i-th k-node motif
$C^{(k)}$	set of k-node CIS in G
$C^{(k, *)}$	set of k-node CIS in $G^{*}$
$C_{i}^{(k)}$	set of CIS in G isomorphic to $M_{i}^{(k)}$
$n^{(k)} = \| C^{(k)} \|$	number of k-node CIS in G
$n_{i}^{(k)} = \| C_{i}^{(k)} \|$	number of CIS in G isomorphic to $M_{i}^{(k)}$
$ω_{i}^{(k)} = \frac{n_{i}^{(k)}}{n^{(k)}}$	concentration of motif $M_{i}^{(k)}$ in G
P	matrix $P = {[P_{i j}]}_{1 \leq i, j \leq T_{k}}$
$P_{i, j}$	probability that $s^{*}$ isomorphic to $M_{i}^{(k)}$ given s isomorphic to $M_{j}^{(k)}$
$Z_{j}$	$Z_{j} = {i : i = 1, \dots, T_{k}, and P_{i, j} > 0}$
$ϕ_{i, j}$	number of subgraphs of $M_{j}^{(k)}$ isomorphic to $M_{i}^{(k)}$
$S_{G^{*}, G}^{(k)}$	$S_{G^{}, G}^{(k)} = {(s^{}, s) : s^{} \in C^{(k, )}, s \in C^{(k)},$ $V (s^{*}) = V (s)}$
$π_{i, j} (s_{1}, s_{2})$	$s_{1}$ and $s_{2}$ are two k-node CIS in G isomorphic to motif $M_{j}^{(k)}$ . Let $s_{1}^{}$ and $s_{2}^{}$ be the induced subgraphs of node sets $V (s_{1})$ and $V (s_{2})$ in $G^{}$ , respectively. $π_{i, j} (s_{1}, s_{2})$ is defined as the probability that $s_{1}^{}$ and $s_{2}^{*}$ are both isomorphic to motif $M_{i}^{(k)}$ .
$m_{i, j}^{(k)}$	number of elements $(s^{}, s) \in S_{G^{}, G}^{(k)}$ , where $s^{*}$ is isomorphic to motif $M_{i}^{(k)}$ and s is isomorphic to motif $M_{j}^{(k)}$
p	probability of sampling an edge
q	$q = 1 - p$

Table 2. Graph datasets used in our experiments. “edges” refers to the quantity of edges in the undirected graph generated by discarding edge labels. “max-degree” denotes the maximum quantity of edges for a node in an undirected graph.

Graph	Nodes	Edges	Max-Degree
Flickr [22]	1,715,255	15,555,041	27,236
Pokec [44]	1,632,803	22,301,964	14,854
LiveJournal [22]	5,189,809	48,688,097	15,017
YouTube [22]	1,138,499	2,990,443	28,754
Wiki-Talk [45]	2,394,385	4,659,565	100,029
Web-Google [46]	875,713	4,322,051	6332
soc-Epinions1 [11]	75,897	405,740	3044
soc-Slashdot08 [10]	77,360	469,180	2539
soc-Slashdot09 [10]	82,168	504,230	2552
sign-Epinions [47]	119,130	704,267	3558
sign-Slashdot08 [47]	77,350	416,695	2537
sign-Slashdot09 [47]	82,144	504,230	2552
com-DBLP [48]	317,080	1,049,866	343
com-Amazon [48]	334,863	925,872	549
p2p-Gnutella08 [49]	6301	20,777	97
ca-GrQc [50]	5241	14,484	81
ca-CondMat [50]	23,133	93,439	279
ca-HepTh [50]	9875	25,937	65

Table 3. Real values of three-node undirected motif frequencies (i is the motif ID).

i	Flickr	Pokec	Live- Journal	Wiki- Talk	Web- Google
undirected three-node motifs
1	1.30 × 10 $^{10}$	1.99 × 10 $^{9}$	6.59 × 10 $^{9}$	1.26 × 10 $^{10}$	6.87 × 10 $^{8}$
2	5.49 × 10 $^{8}$	3.26 × 10 $^{7}$	3.11 × 10 $^{8}$	9.20 × 10 $^{6}$	1.34 × 10 $^{7}$

Table 4. NRMSEs of

n_{i}^{(3)}

, the concentration estimates of three-node undirected motifs for

p = 0.01

and

p = 0.05

, respectively (i is the motif ID).

Table 4. NRMSEs of

n_{i}^{(3)}

, the concentration estimates of three-node undirected motifs for

p = 0.01

and

p = 0.05

, respectively (i is the motif ID).

i	Flickr	Pokec	Live- Journal	Wiki- Talk	Web- Google
$p = 0.01$
1	8.3 × 10 $^{- 3}$	1.3 × 10 $^{- 2}$	2.8 × 10 $^{- 2}$	2.5 × 10 $^{- 2}$	2.6 × 10 $^{- 2}$
2	1.3 × 10 $^{- 2}$	1.3 × 10 $^{- 2}$	1.7 × 10 $^{- 2}$	4.4 × 10 $^{- 2}$	2.4 × 10 $^{- 2}$
$p = 0.05$
1	4.2 × 10 $^{- 3}$	5.0 × 10 $^{- 3}$	3.4 × 10 $^{- 3}$	1.3 × 10 $^{- 2}$	1.2 × 10 $^{- 2}$
2	4.7 × 10 $^{- 3}$	4.1 × 10 $^{- 3}$	4.3 × 10 $^{- 3}$	1.6 × 10 $^{- 2}$	8.1 × 10 $^{- 3}$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, W.; Feng, W.; Qi, Y.; Wang, P.; Tao, J. Mosar: Efficiently Characterizing Both Frequent and Rare Motifs in Large Graphs. Appl. Sci. 2022, 12, 7210. https://doi.org/10.3390/app12147210

AMA Style

Guo W, Feng W, Qi Y, Wang P, Tao J. Mosar: Efficiently Characterizing Both Frequent and Rare Motifs in Large Graphs. Applied Sciences. 2022; 12(14):7210. https://doi.org/10.3390/app12147210

Chicago/Turabian Style

Guo, Wenhua, Wenqian Feng, Yiyan Qi, Pinghui Wang, and Jing Tao. 2022. "Mosar: Efficiently Characterizing Both Frequent and Rare Motifs in Large Graphs" Applied Sciences 12, no. 14: 7210. https://doi.org/10.3390/app12147210

APA Style

Guo, W., Feng, W., Qi, Y., Wang, P., & Tao, J. (2022). Mosar: Efficiently Characterizing Both Frequent and Rare Motifs in Large Graphs. Applied Sciences, 12(14), 7210. https://doi.org/10.3390/app12147210

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mosar: Efficiently Characterizing Both Frequent and Rare Motifs in Large Graphs

Abstract

1. Introduction

2. Related Work

3. Problem Formulation

4. Motif Sampling and Retrieving

4.1. Sampling Motifs over G

4.2. Probabilistic Model of Mosar

4.3. Motif Frequency Estimation

4.4. Discussion

5. Data Evaluation

5.1. Datasets

5.2. Error Metric

5.3. Accuracy Results

5.3.1. Values of Three-, Four-, and Five-Node Motif Frequencies

5.3.2. Estimating Three-Node Motif Frequencies

5.3.3. Estimating Four-Node Motif Frequencies

5.3.4. Estimating Five-Node Motif Frequencies

5.4. Comparison to Previous Work

5.4.1. Motif Concentration Estimation

5.4.2. Triangle Counting

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI