A Parallel Approach for Frequent Subgraph Mining in a Single Large Graph Using Spark

Frequent subgraph mining (FSM) plays an important role in graph mining, attracting a great deal of attention in many areas, such as bioinformatics, web data mining and social networks. In this paper, we propose SSIGRAM (Spark based Single Graph Mining), a Spark based parallel frequent subgraph mining algorithm in a single large graph. Aiming to approach the two computational challenges of FSM, we conduct the subgraph extension and support evaluation parallel across all the distributed cluster worker nodes. In addition, we also employ a heuristic search strategy and three novel optimizations: load balancing, pre-search pruning and top-down pruning in the support evaluation process, which significantly improve the performance. Extensive experiments with four different real-world datasets demonstrate that the proposed algorithm outperforms the existing GRAMI (Graph Mining) algorithm by an order of magnitude for all datasets and can work with a lower support threshold.


Introduction
Many relationships among objects in a variety of applications such as chemical, bioinformatics, computer vision, social networks, text retrieval and web analysis can be represented in the form of graphs.Frequent subgraph mining (FSM) is a well-studied problem in the graph mining area which boosts many real-world application scenarios such as retail suggestion engines [1], protein-protein interaction networks [2], relationship prediction [3], intrusion detection [4], event prediction [5], text sentiment analysis [6], image classification [7], etc.For example, mining frequent subgraphs from a massive event interaction graph [8] can help to find recurring interaction patterns between people or organizations which may be of interest to social scientists.
There are two broad categories of frequent subgraph mining: (i) graph transaction-based FSM; and (ii) single graph-based FSM [9].In graph transaction-based FSM, the input data comprise a collection of small-size or medium-size graphs called transactions, i.e., a graph database.In single graph-based FSM, the input data comprise one very large graph.The FSM task is to enumerate all subgraphs with support above the minimum support threshold.Graph transaction-based FSM uses transaction-based counting support while single graph-based FSM adopts occurrence-based counting.Mining frequent subgraphs in a single graph is more complicated and computationally demanding because multiple instances of identical subgraphs may overlap.In this paper, we focus on frequent subgraph mining in a single large graph.
The "bottleneck" for frequent subgraph mining algorithms on a single large graph is the computational complexity incurred by the two core operations: (i) efficient generation of all subgraphs with various size; and (ii) subgraph isomorphism evaluation (support evaluation), i.e., determining whether a graph is an exact match of another one.Let N and n be the number of vertexes of input graph G and subgraph S, respectively.Typically, the complexity of subgraph generation is O(2 N 2 ) and support evaluation is O(N n ).Thus, the total complexity of an FSM algorithm is O(2 N 2 • N n ), which is exponential in terms of problem size.In recent years, numerous algorithms for single graph-based FSM have been proposed.Nevertheless, most of them are sequential algorithms that require much time to mine large datasets, including SiGraM (Single Graph Mining) [10], GERM (Graph Evolution Rule Miner) [11] and GRAMI (Graph Mining) [12].Meanwhile, researchers have also used parallel and distributed computing techniques to accelerate the computation, in which two parallel computing frameworks are mainly used: Map-Reduce [13][14][15][16][17][18] and MPI (Message Passing Interface) [19].The existing MapReduce implementations of parallel FSM algorithms are all based on Hadoop [20] and are designed for graph transaction and not for a single graph, often reaching IO (Input and Output) bottlenecks because they have to spend a lot of time moving the data/processes in and out of the disk during iteration of the algorithms.Besides, some of these algorithms cannot support mining via subgraph extension [14,15].That is to say, users must provide the size of subgraph as input.In addition, although the MPI based methods, such as DistGraph [19], usually have a good performance, it is geared towards tightly interconnected HPC (High Performance Computing) machines, which are less available for most people.In addition, for MPI-based algorithms, it is hard to combine multiple machine learning or data mining algorithms into a single pipeline from distributed data storage to feature selection and training, which is common for machine learning.Fault tolerance is also left to the application developer.
In this paper, we propose a parallel frequent subgraph mining algorithm in a single large graph using Apache Spark framework, called SSIGRAM.The Spark [21] is an in-memory MapReduce-like general-purpose distributed computation platform which provides a high-level interface for users to build applications.Unlike previous MapReduce frameworks such as Hadoop, Spark mainly stores intermediate data in memory, effectively reducing the number of disk input/output operations.In addition, a benefits of the "ML (Machine Learning) Pipelines" design [22] in Spark is that we can not only mine frequent subgraphs efficiently but also easily combine the mining process seamlessly with other machine learning algorithms like classification, clustering or recommendation.
Aiming at the two computational challenges of FSM, we conduct the subgraph extension and support evaluation across all the distributed cluster worker nodes.For subgraph extension, our approach generates all subgraphs in parallel through FFSM-Join (Fast Frequent Subgraph Mining) and FFSM-Extend proposed by Huan et al. [23], which is an efficient solution for candidate subgraphs enumeration.When computing subgraph support, we adopt the constraint satisfaction problem (CSP) model proposed in [12] as the support evaluation method.The CSP support satisfies the downward closure property (DCP), also known as anti-monotonic (or Apriori property), which means that a subgraph g is frequent if and only if all of its subgraphs are frequent.As a result, we employ a breadth first search (BFS) strategy in our SSIGRAM algorithm.At each iteration, the generated subgraphs are distributed to every executor across the Spark cluster for solving the CSP.Then, the infrequent subgraphs are removed while the remaining frequent subgraphs are passed to the next iteration for candidate subgraph generation.
In practice, the support evaluation is more complicated than subgraph extension and will cost most time during the mining process.As a result, besides parallel mining, our SSIGRAM algorithm also employs a heuristic search strategy and three novel optimizations: load balancing, pre-search pruning and top-down pruning in the support counting process, which significantly improve the performance.Noteworthily, SSIGRAM can also be applied to directed graphs, weighted subgraph mining or uncertain graph mining with slight modifications introduced in [24,25].In summary, our main contributions to the frequent subgraph mining in a single large graph are three-pronged:

•
First, we propose SSIGRAM, a novel parallel frequent subgraph mining algorithm in a single large graph using Spark, which is different from the Hadoop MapReduce based and MPI based parallel algorithms.SSIGRAM can also easily combine with the bottom Hadoop distributed storage data and other machine learning algorithms.

•
Second, we conduct in parallel subgraph extension and support counting, respectively, aiming at the two core steps with high computational complexity in frequent subgraph mining.
In addition, we provide a heuristic search strategy and three optimizations for the support computing operation.

•
Third, extensive experimental performance evaluations are conducted with four real-world graphs.The proposed SSIGRAM algorithm outperforms the GRAMI method by at least one order of magnitude with the same memory allocated.
The paper is organized as follows.The problem formalization is provided in Section 2. Our SSIGRAM algorithm and its optimizations are presented in Section 3. In Section 4, extensive experiments to evaluate the performance of the proposed algorithm are conducted and analyzed.The work is summarized and conclusions are drawn in Section 5.

Formalism
A graph G = (V, E) is defined to be a set of vertexes (nodes) V which are interconnected by a set of edges (links) E ⊆ V × V [26].A labelled graph also consists of a labeling function L besides V and E that assigns labels to V and E. Usually, the graphs used in FSM are assumed to be labelled simple graphs, i.e., un-weighted and un-directed labeled graphs with no loops and no multiple links between any two distinct nodes [27].To simplify the presentation, our SSIGRAM is illustrated with an undirected graph with a single label for each node and edge.Nevertheless, as mentioned above, the SSIGRAM can also be extended to support either directed or weighted graphs.In the following, a number of widely used definitions used later in this paper are introduced.

Definition 1. (Labelled Graph):
A labelled graph can be represented as G = (V, E, L V , L E , ϕ), where V is a set of vertexes, E ⊆ V × V a set of edges.L V and L E are sets of vertex and edge labels respectively.ϕ is a label function that defines the mappings V → L V and E → L E .

Definition 2. (Subgraph):
For example, Figure 1a illustrates a labelled graph of event interaction graph.Node labels represent the event actor's type (e.g., GOV: government) in CAMEO (Conflict And Mediation Event Observations) codes [28] and edge labels represent the event type [28] between the two actors.Figure 1b,c shows two subgraphs of Figure 1a. Figure 1b (v 1 where S denotes the set of subgraphs in G with support greater or equal to τ. For a subgraph G 1 and input graph G, the straightforward way to compute the support of G 1 in graph G is to count all its isomorphisms of G 1 in G [29].Unfortunately, such a method does not satisfy the downward closure property (DCP) since there are cases where a subgraph appears fewer times than its supergraph.For example, in Figure 1a, the single node graph REF (Refugee) appears three times, while its supergraph REF 3 −GOV appears four times.Without the DCP, the search space cannot be pruned and the exhaustive search is unavoidable [30].To address this issue, we employ the minimum image (MNI) based support which is anti-monotonic introduced in [31].
) is mapped to, which is defined as: where Φ(v) is the set of unique mappings for each v ∈ V 1 , denoted as Figure 2 illustrates the four isomorphisms of a subgraph G 1 ≡ A-B-C-A in the input graph.For example, one of the isomorphisms is φ = {u 1 , u 4 , u 6 , u 8 }, shown in the second column in Figure 2c.There are four isomorphisms for the subgraph G 1 in Figure 2b.Therefore, the set of unique mappings for the vertex v 1 is {u 1 , u 2 , u 8 }.The number of unique mappings over all the subgraph vertices {v 1 , v 2 , v 3 , v 4 } are 3, 3, 2 and 2, respectively.Thus, the MNI support of G 1 is

Definition 6. (Adjacency Matrix):
The adjacency matrix of a graph G i with n vertexes is defined as a n × n matrix M, in which every diagonal entry corresponds to a distinct vertex in G i and is filled with the label of this vertex and every off-diagonal (for an undirected graph, the upper triangle is always a mirror of the lower triangle) entry in the lower triangle part corresponds to a pair of vertices in G i and is filled with the label of the edge between the two vertices and zero if there is no edge.

Definition 7. (Maximal Proper Submatrix):
For a m × m matrix A, a n × n matrix B is the maximal proper submatrix of A, iff B is obtained by removing the last nonzero entry from A. For example, the last non-zero entry of M 2 in Figure 3 is y in the bottom row.

Definition 8. (Canonical Adjacency Matrix):
Let matrix M denote the adjacency matrix of graph G i .Code(M) represents the code of M, which is defined as the sequence formed by concatenating lower triangular entries of M (including entries on the diagonal) from left to right and top to bottom, respectively.The canonical adjacency matrix (CAM) of graph G i is the one that produces the maximal code, using lexicographic order [23].Obviously, a CAM's maximal proper submatrix is also a CAM.

Definition 9. (Suboptimal CAM):
Given a graph G, a suboptimal canonical adjacency matrix (suboptimal CAM) of G is an adjacency matrix M of G such that its maximal proper submatrix N is the CAM of the graph that N represents.[23].A CAM is also a suboptimal CAM.A proper suboptimal CAM is denoted as a suboptimal CAM that is not the CAM of the graph it represents.

The SSIGRAM Approach
This section first elaborates upon the framework of the proposed algorithm, before describing the detailed procedure of parallel subgraph extension, parallel support evaluation and three optimization strategies.

Framework
Figure 4 illustrates the proposed framework of our SSIGRAM approach.It mainly contains two major components: parallel subgraph extension and parallel support evaluation.The green notations are main Spark RDD (a resilient distributed dataset (RDD) is the core abstraction of Spark, which is a fault-tolerant collection of elements that can be operated on in parallel) transformations or actions used during the algorithm pipeline, e.g., map, join, etc.We will discuss details of the framework below.

Parallel Subgraph Extension
Our approach employs a breadth first search (BFS) strategy that generates all subgraphs in parallel through FFSM-Join and FFSM-Extend proposed in [23].Similarly, we organize all the suboptimal CAMs of subgraphs in a graph G into a rooted tree, that follows the rules: (i) The root of the tree is an empty matrix.(ii) Each node in the tree is a distinct subgraph of G, represented by its suboptimal CAM that is either a CAM or a proper suboptimal CAM.(iii) For a given non-root node (with suboptimal CAM M), its parent is the graph represented by the maximal proper submatrix of M. The completeness of the suboptimal CAM tree is guaranteed by the Theorem 1.For the formal proof, we refer to the appendix in [23].
Theorem 1.For a graph G, let C k−1 and C k be sets of the suboptimal CAMs of all the subgraphs with (k − 1) edges and k edges (k ≥ 2).Every member of set C k can be enumerated unambiguously either by joining two members of set C k−1 or by extending a member in C k−1 .
Algorithm 1 shows how the subgraph extension process is conducted in parallel.Actually, the extension process is implemented in parallel at the parent subgraph scale (Lines 6-15), which means that each group of subgraphs with the same parent will be sent to an executor for extension in the Spark cluster.The FFSM operator is provided by [32], which implements the FFSM-Join and FFSM-Extend.After extension, all of the extended results are collected to the driver node (Line 16) from the cluster.Because of the extension on more than one executors at the same time, the indexes of the new generated subgraphs from different executors may be duplicated.As a result, the subgraph indexes are reassigned at the end (Line 17).
To perform a parallel subgraph extension, Line 11 and Line 13 conduct the joining and extension of CAM across all Spark executors.The overall complexity is O(n 2 • m) where n is the number of nodes in subgraph and m number of edges.A complete graph with n vertices consists of n(n − 1)/2 edges.Thus, the final complexity is O(m 2 ).

Parallel Support Evaluation
Our SSIGRAM approach employs the CSP model [12] as the subgraph support evaluation strategy.The constraint satisfaction problem (CSP) is an efficient method for finding subgraph isomorphisms (Definition 3), which is illustrated as follows: 1. X is an ordered set of variables which contains a variable x v for each node v ∈ V 1 .2. D is the set of domains for each variable x v ∈ X .Each domain is a subset of V. 3. Set C contains the following constraint rules: For example, the CSP model of a subgraph in Figure 1b under graph Figure 1a is: Theorem 2 [12] describes the relation between subgraph isomorphism and the CSP model.Intuitively, the CSP model is similar to a template, in which each variable in X is a slot.A solution is a correct slot fitting which assigns a different node of G to each node of G 1 , such that the labels of the corresponding nodes and edges match.For instance, a solution to the CSP of the above example is the assignment (x v 1 , x v 2 , x v 3 ) = (u 2 , u 4 , u 5 ).If there exists a solution that assigns a node u to variable x v , then this assignment is valid.
where ASS valid (x v ) is the total count of valid assignments of variable x v .
According to Theorem 3 [12], we can consider the CSP of subgraph G 1 to graph G and check the count of valid assignments of each variable.If there exist τ or more valid assignments for every variable, in other words, at least τ nodes in each domain D 1 , ..., D n for the corresponding variables x v 1 , ..., x v n , then subgraph G 1 is frequent under the MNI support.Thus, the main idea of the heuristic search strategy is elaborated as: if any variable domain remains with less than τ candidates during the search process, then the subgraph cannot be frequent.

Optimizing Support Evaluation
After subgraph extension, all the new generated subgraphs are sent to the next procedure for support evaluation.As mentioned in the Introduction, support evaluation is an NP-hard problem which takes O(N n ) time.The complexity is exponential if we brutally search all the valid assignments.
Owing to the iterative and incremental design of RDD and the join transformation in Spark, we save the CSP domain data of every generated subgraph.As the two green labels join shown in Figure 4, the first join operation combines the new generated subgraphs and frequent edges to get the extended edges, while the second join combines new generated subgraphs and extended edges to generate the search space, i.e., the CSP domain data.In addition, to speed up the support evaluation process, we also propose three optimizations, namely, load balancing, pre-search pruning and top-down pruning, the execution order of which is illustrated on the headpiece of Figure 4.

Load Balancing
The support evaluation process is implemented in parallel at subgraph scale, which means that each subgraph will be sent to an executor in the Spark cluster for support evaluation.The search space is highly dependent on the subgraph's CSP domain size.Nevertheless, new subgraphs may have different domain sizes which result in the phenomenon that some executors may finish searching fast while others are very slow.The final execution time of the whole cluster depends upon the last finished executor.
To overcome this unbalance, generally, the subgraphs distributed to various executors must have roughly the same domain sizes.Algorithm 2 illustrates the detailed process.Because the domain of the present subgraph is incrementally generated from the parent subgraph's domain of last iteration, we save the domain sizes of all subgraphs in each iteration.Then, according to the saved domain sizes of parent subgraphs, new generated subgraphs are re-ordered and partitioned to different executors (Lines 6-9).
Let n be the number of nodes of subgraph S, i.e., the domain size.Load balancing can be done in O(n) time.return SRDD balance 12: end function

Pre-Search Pruning
Because the input single large graph we consider is an undirected labeled graph, if a node and its neighbors have the same node label and edges between them also have the same label, it will bring redundant search space.This phenomenon can be common in graphs, especially when the graphs have few node labels and edge labels.For example, in Figure 5, G is the input graph and G 1 a subgraph.The CSP search space of G 1 is illustrated at the bottom.The assignments in dashed lines are added to the search space when iteratively building the CSP domain data of G 1 whereas they are redundant space violating the first rule in Definition 10.Here, u 1 is assigned twice to v 1 and v 3 ({u 1 , u 2 , u 1 }, {u 1 , u 3 , u 1 }, {u 1 , u 4 , u 1 }).If this redundant search space is pruned before calculating the actual support, the search speed will be much accelerated.
Let N and n be the number of nodes of input graph G and subgraph S. Pre-search pruning will search for redundant space for every node of S between its neighbors in G, the complexity of which is O(n • N 3 ).

Top-Down Pruning
Either FFSM-Join operation or FFSM-Extend operation add an edge to the parent subgraph at a time when generating new subgraphs and constructing the suboptimal CAM tree.Therefore, as the parent subgraph at upside of suboptimal CAM tree is a substructure of its child subgraph, those assignments that were pruned from the domains of the parent, can also not be valid assignments for any of its children [12].For instance, Figure 6a shows a part of a subgraph generation tree, which is constructed from G 1 which is extended to G 2 and G 3 and last, G 4 via G 3 .The marked nodes in different colors represent the pruned assignments from the top to bottom.Invalid assignments from parent subgraphs are pruned from all their child subgraphs.Thus, the search space is reduced a lot.Take subgraph G 4 in Figure 6 as an example, when considering variable x v 1 , the search space has a size of 3 × 2 × 3 × 2 = 36 combinations, while without top-down pruning the respective search space size is 5 × 3 × 5 × 4 = 300 combinations.
Top-down pruning iterates for every node in S and for every value in each domain.Thus, the overall complexity is O(m • N).Corresponding variables and domains Invalid assignments for: After introducing pre-search pruning and top-down pruning, we give the pseudocode ISFREQUENT of heuristically checking whether a subgraph s is frequent in Algorithm 3. Pre-search pruning is conducted at Line 4 to Line 7 while top-down pruning Line 11 to Line 23.

The SSIGRAM Algorithm
Finally, we show the detail pipeline of the SSIGRAM approach in Algorithm 4. SSIGRAM starts by loading the input graph using Spark GraphX (Line 2).Then all frequent edges are identified at Line 4. For each iteration, parallel subgraph extension is conducted at Line 11 and parallel support evaluation Line 14 to Line 22 in which load balancing, pre-search pruning and top-down pruning are conducted.The complexity of SSIGRAM is O(n return true 25: end function

Experimental Evaluation
In this section, the performance of the proposed algorithm SSIGRAM is evaluated using four real-world datasets with different sizes from different domains.Firstly, the experimental setup is introduced.The performance of SSIGRAM is then evaluated.

Experimental Setup
Dataset: We experiment on four real graph datasets, whose main characteristics are summarized in Table 1.
Aviation (http://ailab.wsu.edu/subdue/).This dataset contains a list of event records extracted from the aviation safety database.The events are transformed to a graph which consists of 100 K nodes and 133 K edges.The nodes represent event ids and attribute values.Edges represent attribute names and the "near_to" relationship between two events.
GDELT (https://bigquery.cloud.google.com/table/gdelt-bq:full.events?pli=1).This dataset is constructed from part of the raw events exported from the GDELT (Global Data on Events, Location and Tone) dataset.It consists of 1.5 M nodes and 2.8 M edges.Similar to the Aviation dataset, nodes represent events and attribute values (and are labeled with event types and actual attribute values).Edges represents attribute name and the "relate_to" relationship between two events.
Twitter (http://socialcomputing.asu.edu/datasets/Twitter).This graph models the social news of Twitter and consists of 11M nodes and 85 M edges.Each node represents a Twitter user and each edge represents an interaction between two users.The original graph does not have labels, so we randomly added 40 labels to the nodes, the randomization of which follows a Gaussian distribution.In detail, the mean value was set to 50 and the std-deviance 15.The generated vertex labels less than 0 were all set to 1.
Comparison Method: We compare the proposed SSIGRAM algorithm with the GRAMI [12] and we use the GRAMI_UNDIRECTED_SUBGRAPHS version of GraMi provided by the authors.
Running Environment: All the experiments with SSIGRAM are conducted on Apache Spark (version 1.6.1)deployed on Apache Hadoop YARN (version 2.7.1).The total executors is set to 20 with 6 GB memory and 1 core running at 2.4 GHz for each executor.The memory of driver program is also 6 GB and max results 2 GB.Thus, the total memory allocated from YARN is 128 GB.For the sake of fairness, GRAMI is conducted on a Linux (Ubuntu 14.04) machine running at 2.4 GHz with 128 GB RAM.
Performance Metrics: The support threshold τ is the key evaluation metric as it determines whether a subgraph is frequent.Decreasing τ results in an exponential increase in the number of possible candidates and thus exponential decrease in the performance of the mining algorithms.For a given time budget, an efficient algorithm should be able to solve mining problems for low τ values.When τ is given, efficiency is determined by the running time.In addition, we also give the total subgraphs each algorithm identified under each τ value, proving the correctness of the SSIGRAM algorithm.

Experimental Results
Performance: At the top part of Figure 7, we show the performance comparison between SSIGRAM and GRAMI on DBLP, Aviation, GDELT and Twitter datasets.The number of results grows exponentially when the support threshold τ decreases.Thus, the running time of all algorithms also grows exponentially.Our results indicate that SSIGRAM outperforms GRAMI by an order of magnitude for all datasets.For bigger dataset GDELT and lower τ (9600), GRAMI ran out of memory and was not able to produce a result.For smaller dataset Aviation and bigger τ (2200, 2300 and 2400), GRAMI is faster because the resource scheduling of Hadoop YARN in SSIGRAM will cost 10 to 20 s.Actually, in this circumstance, there is no need to use parallel mining algorithms since GRAMI can give results within a few seconds.For the Twitter dataset, SSIGRAM is about five times faster than GRAMI because of the existence of nodes with a big degree.When a subgraph involves such a node, SSIGRAM will not go to the next iteration until the executor finishes calculating the support of this subgraph.The bottom part of Figure 7 illustrates the total numbers of identified frequent subgraphs on each dataset.The identical numbers of frequent subgraphs of SSIGRAM and GRAMI elaborate the correctness of our SSIGRAM algorithm.
Optimization: Figure 8 demonstrates the effect of the three optimizations discussed above on the DBLP and GDELT datasets.For both datasets, the SSIGRAM with all optimizations (denoted by All opts. in Figure 8) definitely performs best.For the DBLP dataset, when τ > 3500, load balancing is the most effective optimization while as τ becomes bigger, pre-search pruning becomes the most effective.For the GDELT dataset, the pre-pruning is always the most effective optimization.When no optimization is involved (denoted by No opt. in Figure 8), the algorithm performs worst.Actually, the effect of each optimization strategy varies with input graphs and different thresholds.Parallelism: Finally, to evaluate the effect of the number of executors, we fix the supports of each dataset and vary the num-executor parameter of the Spark configuration file.According to the principle of the same allocated memory from YARN, we set num-executors to 20, 15, 10, 5 and 1 with executor-memory being 6 GB, 8 GB, 12 GB, 24 GB and 120 GB respectively.The results shown in Figure 9 lead to three major observations.First, compared with GRAMI's performance shown in Figure 7, the proposed algorithm outperforms the GRAMI algorithm even when only one executor was used.This is because the complexity of SSGRAMI is less than that of GRAMI ( O(n • N n−1 ) ), especially when n is large.Second, the runtime decreases with the increment of parallelism for each dataset overall.Third, when the num-executor is bigger than 10, the performance improvement is less obvious because the final performance will be dependent on a few time-consuming subgraphs.Thus, most executors will wait until these subgraphs are finished.More executors cannot avoid this phenomenon.

Conclusions
In this paper, we propose SSIGRAM, a novel parallel frequent subgraph mining algorithm in a single large graph using Spark, which conducts in parallel subgraph extension and support counting, respectively, focusing on the two core steps with high computational complexity in frequent subgraph mining.In addition, we also provide a heuristic search strategy and three optimizations for the support computing operation.Finally, extensive experimental performance evaluations are conducted with four graph datasets, showing the effectiveness of the proposed SSIGRAM algorithm.
Currently, the parallel execution is conducted on the scale of every generated subgraph.When a subgraph involves a node with very big degree, SSIGRAM will not go to next iteration until the executor finishes calculating the support of this subgraph.In the future, we plan to design a strategy that decomposes the evaluation task for this type of subgraph to all executors, accelerating the search speed further.

Discussion
The proposed algorithm in this paper is applied to FSM on a single large certain graph data, in which each edge definitely exists.However, uncertain graphs are also common and have practical importance in the real world, e.g., the telecommunication or electrical networks.In the uncertain graph model [33], each edge of a graph is associated with a probability to quantify the likelihood that this edge exists in the graph.Usually the existence of edges is assumed to be independent.
There are also two types of FSM on uncertain graphs: transaction based and single graph based.Most existing work on FSM on uncertain graphs is developed on transaction settings, i.e., multiple small/medium uncertain graphs.FSM on uncertain graph transactions under expected semantics considers a subgraph frequent if its expected support is greater than the threshold.Representative algorithms include Mining Uncertain Subgraph patterns (MUSE) [34], Weighted MUSE (WMUSE) [35], Uncertain Graph Index(UGRAP) [36] and Mining Uncertain Subgraph patterns under Probabilistic semantics (MUSE-P) [37].They are proposed under expected semantics or the probabilistic semantics.Here, we mainly discuss the measurement of uncertainty and applications of techniques proposed in this paper considering single uncertain graph.
The measurement of uncertainty is important when considering an uncertain graph.Combining labelled graph in Definition 1 in this paper, an uncertain graph is a tuple G u = (G, P), where G is the backbone labelled graph, and P : E → (0, 1] is a probability function that assigns each edge e with an existence probability, denoted by P(e), e ∈ E.An uncertain graph G u implies 2 |E| possible graphs in total, each of which is a structure G u may exist as.The existence probability of G i can be computed by the joint probability distribution: Generally speaking, FSM on single uncertain graph can also be divided into two phases: subgraph extension and support evaluation.The subgraph extension phase is the same as that for FSM on the backbone graph G. Thus, techniques used in this paper, such as canonical adjacency matrix for representing subgraphs and the parallel extension for extending subgraphs, can be used in the single uncertain graph.
The biggest difference lies in the support evaluation phase.The support of a subgraph g in an uncertain graph G u is measured by expected support.A straightforward procedure to compute the expected support is generating all implied graphs, computing and aggregating the support of the subgraph in every implied graph, and last deriving the expected support, which can be accomplished by the CSP model used in this paper.Formally, the expected support is a probability distribution over the support in implied graphs: where G i is an implied graph of G u .The support measure Sup can be the MNI support introduced in Definition 5, which is computed efficiently.Thus, given an uncertain graph G u = (G, P) and an expected support threshold τ, FSM on an uncertain graph finds all subgraphs g whose expected support is no less than the threshold, i.e., G = {g|eSup(g, G u ) ≥ τ ∧ g ⊆ G}.
Furthermore, let P j (g, G u ) denote the aggregated probability that the support of g in an implied graph is no less than j: where ∆ j (g) = {G i |Supg, G i ≥ j}.The expected support can be reformulated as: where M s is the maximum support of g among all implied graphs of G u .For the detailed proof, we refer to [25].However, it is #P-hard to compute eSup(g, G u ) because of the huge number of implied graphs (2 |E| ), which means that it is rather time consuming to draw exact frequent subgraph results even using the parallel evaluation with Spark platform proposed in this paper.Approximate evaluation with an error tolerance to allow some false positive frequent subgraphs is a common method.Some special optimization techniques other than optimizations in this paper must also be designed.Therefore, the modifications of expected support and some potential optimizations are still problems to be further studied to make the proposed algorithm be fit to mine frequent subgraphs on single uncertain graph.

Figure 1 .
Figure 1.(a) An event interaction graph; nodes represent the event actors (labelled with their type code) and edges represent the event (labelled with event type code); and (b,c) two subgraphs of (a) (MED: Media, REF: Refugee, GOV: Government).

Figure 4 .
Figure 4. Framework of the SSIGRAM(Spark based Single Graph Mining) approach.Green notations on the right side are main Spark resilient distributed dataset (RDD) transformations or actions used during the algorithm pipeline.Abbreviations: CAM (Canonical Adjacency Matrix), HDFS (Hadoop Distributed File System), FFSM (Fast Frequent Subgraph Mining).

Theorem 2 .Theorem 3 .
A solution of the subgraph G 1 to graph G CSP corresponds to a subgraph isomorphism of G 1 to G. Let (X , D, C) be the subgraph CSP of G 1 under graph G.The MNI support of G 1 in G satisfying:

Figure 5 .
Figure 5. Constraint satisfaction problem (CSP) search space of the subgraph G 1 with the input graph G.

Figure 6 .
Figure 6.(a) The subgraph generation tree; and (b) the corresponding variables and domains.Marked nodes represent the pruned assignments from top to bottom.

Figure 7 .
Figure 7. Performance of SSIGRAM and GRAMI on the four different datasets.Abbreviations: DBLP (DataBase systems and Logic Programming), GDELT (Global Data on Events, Location and Tone).

Figure 8 .
Figure 8.The effect of each optimization.All opts.: All optimization enabled; Load balancing: Only load balancing enabled; Pre-pruning: Only pre-search pruning enabled; Top-down prune: Only top-down pruning enabled; No Opt.: No optimization strategies involved.

Figure 9 .
Figure 9.The effect of the number of executors on each dataset.
on the above processes.
else if a solution that assigns u to v exists then else Remove u from v's domain in D s 20:if count = τ then