An Algorithm for Mining Frequent Approximate Subgraphs with Structural and Label Variations in Graph Collections

Jaramillo-Olivares, Daybelis; Carrasco-Ochoa, Jesús Ariel; Martínez-Trinidad, José Francisco

doi:10.3390/app15147880

Open AccessArticle

An Algorithm for Mining Frequent Approximate Subgraphs with Structural and Label Variations in Graph Collections

by

Daybelis Jaramillo-Olivares

^*

,

Jesús Ariel Carrasco-Ochoa

and

José Francisco Martínez-Trinidad

Department of Computer Science, Instituto Nacional de Astrofísica, Óptica y Electrónica, Luis Enrique Erro # 1, Tonantzintla, Puebla 72840, Mexico

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7880; https://doi.org/10.3390/app15147880

Submission received: 17 May 2025 / Revised: 12 June 2025 / Accepted: 21 June 2025 / Published: 15 July 2025

Download

Browse Figures

Versions Notes

Abstract

Using graphs as a data structure is a simple way to represent relationships between objects. Consequently, it has raised the need for algorithms to process, analyze, and extract meaningful information from graphs. Therefore, frequent subgraph mining (FSM) algorithms have been reported in the literature to discover interesting, unexpected, and useful patterns in graph databases. Frequent subgraph mining involves discovering subgraphs that appear no less than a user-specified threshold; this can be performed exactly or approximately. Although several algorithms for mining frequent approximate subgraphs exist, mining this type of subgraph in graph collections has scarcely been addressed. Thus, we propose AGCM-SLV, an algorithm for mining frequent approximate subgraphs within a graph collection that allows structural and label variations. Unlike other FSM approaches, our proposed algorithm tracks subgraph occurrences and their structural dissimilarities, allowing user-defined partial similarities between node and edge labels, and captures frequent approximate subgraphs (patterns) that would otherwise be overlooked. Experiments on real-world datasets demonstrate that our algorithm identifies more patterns than the most similar state-of-the-art algorithm with a shorter runtime. We also present experiments in which we add white noise to the graph collection at different levels, revealing that over 99% of the patterns extracted without noise are preserved under noisy conditions, making the proposed algorithm noise-tolerant.

Keywords:

graph mining; frequent approximate subgraphs; pattern mining; structural variations; label variations

1. Introduction

In several areas, such as government, industry, and science, knowledge derived from data is of great importance. Therefore, it is important to develop methods for extracting useful information and knowledge [1,2,3,4]. This problem is studied in pattern mining, a data mining technique that aims to discover interesting, unexpected, and useful patterns [5].

A trend in pattern mining is the development of algorithms designed to analyze complex data such as graphs [6,7]. Graphs have become popular due to their flexibility and simplicity in representing relationships between objects [8,9] and are widely used in science and technology to describe and visualize data [10]. These characteristics have led to the use of graphs in many applications, such as medicine [11,12], chemistry [13,14], bioinformatics [15,16], social networks [17,18,19], linguistics [20,21], computer vision [22,23,24], and image classification [25,26], among others. Despite their advantages, data mining algorithms for pattern mining cannot be directly applied to graphs due to their complex structure. Consequently, it is necessary to develop algorithms capable of mining patterns in graphs [27]. Mining patterns in graphs refers to the mining of frequent subgraphs in a single graph or graph collection.

Frequent subgraph mining (FSM) involves discovering subgraphs that repeat no less than a user-specified threshold [28,29,30] in a single graph or graph collection. FSM can be performed exactly or approximately. Frequent exact subgraph mining aims to mine subgraphs with one-to-one correspondence (isomorphism). Several algorithms have been developed to mine exact frequent subgraphs [31,32,33,34,35,36,37,38,39,40]. In contrast, frequent approximate subgraph mining aims to mine subgraphs allowing variations in their structure or labels. Therefore, approximate FSM allows dealing with distortions found in real-world applications, such as noisy data, data diversity, and uncertain data [41,42,43].

However, despite its usefulness in real-world applications, approximate FSM introduces significant computational challenges. The problem is inherently more complex than frequent exact subgraph mining, as it requires approximate matching techniques that are more computationally expensive.

Although approximate FSM has already addressed structural variations, most algorithms reported in the literature address frequent approximate subgraph mining with structural variations in a single graph [44,45,46]. While the literature mentions that the algorithms for mining subgraphs in a single graph can be extended to mine subgraphs in graph collections [47], it is not an easy task because subgraphs obtained by an algorithm for a single graph differ from those obtained by an algorithm for a graph collection.

The literature describes only one algorithm, REAFUM [48], for mining frequent approximate subgraphs in an undirected graph collection. This algorithm can handle structural and label variations in edges and nodes. Nevertheless, it mines the subgraphs from a small subset of representative graphs in the collection. Therefore, this paper addresses the problem of mining all frequent approximate subgraphs in graph collections, allowing structural and label variations in nodes and edges.

The contribution of this paper is the AGCM-SLV (Approximate Graph Collection Miner with Structural and Label Variations) algorithm (see Section 4), which is designed to mine all frequent approximate subgraphs within an undirected labeled graph collection, allowing variations in both node and edge labels as well as in the structure of the graphs. For handling label variations, the proposed algorithm allows defining partial similarities between the labels of nodes and edges. For handling structural variations, AGCM-SLV stores all the occurrences of a frequent subgraph and its corresponding dissimilarity.

Experiments on real-world datasets reveal that the proposed algorithm mines more frequent approximate subgraphs with structural and label variations in a lower runtime than REAFUM [48], the most similar state-of-the-art algorithm.

This paper is organized as follows: In Section 2, the related work is presented. Section 3 provides some basic concepts. Section 4 introduces the proposed algorithm AGCM-SLV (Approximate Graph Collection Miner with Structural and Label Variations) for mining frequent approximate subgraphs in graph collections that exhibit structural and label variations. In Section 5, the experimental results are shown and discussed. Finally, we provide our conclusions in Section 6.

2. Related Work

In this section, we discuss the previous works most related to ours. Specifically, we focus on frequent approximate subgraph mining algorithms. Frequent approximate subgraph mining has been approached in a single graph or a graph collection.

2.1. Frequent Approximate Subgraphs in a Single Graph

Graph mining algorithms reported in the literature have addressed three problems in a single graph: only structural variations, only label variations, and a combination of label and structural variations.

Algorithms such as SUBDUE [49], SUBDUECL [50], GREW [51], the proposed algorithm in [52], and Ap-FSM [53] address structural variations in a single graph. SUBDUE is based on compressing parts of the graph using node contraction into a single node. Each compression is a frequent segment within the graph, preserving the best ones. SUBDUECL improves the selection of frequent subgraphs when compressing by counting positive and negative examples. GREW is a heuristic algorithm that mines frequent subgraphs from a large graph; it uses edge contraction, compressing frequent subgraphs into new nodes. In [52], an approximate graph matching algorithm that uses node contraction in smaller degree nodes in a single graph is introduced. Another approach is followed by Ap-FSM [53], which mines frequent subgraphs from a single large graph using the distributed framework Pregel. This algorithm selects representative graphs through a sampling technique from the original large graph and focuses on mining frequent subgraphs that are structurally similar to the selected graphs.

In contrast, gApprox [44] introduces variations in nodes labels through a list of similar labels and also incorporates structural variations in edges. The algorithm gApprox uses a variation of edit distance, and to avoid analyzing the subgraph more than once, it introduces a traversal without redundancies. Label variations in a single large graph are also considered by [54], where a cost matrix between the labels is introduced.

Algorithms such as AGrAP [45], MaxAFG [55], the proposed algorithm in [56], and MFP [46] have explored label and structural variations. AGrAP mines all frequent subgraphs into a single undirected labeled graph, allowing label and structure variations in nodes and edges. It also uses a variation of the graph edit distance and traverses the graph using a depth-first search. MaxAFG is an extension of AGRAP; this algorithm mines only maximal frequent subgraphs. The proposed algorithm in [56] introduces a hybrid structural–semantic graph mining approach to mine frequent subgraphs with structural variations based on graph edit distance and semantic node variations in an uncertain graph based on the probability of the presence of an edge. MFP performs frequent subgraph mining, allowing label and structure variation on a single labeled directed graph.

TIPTAP [57] shows the applicability of the structural variations when dealing with an evolving graph, a graph whose structure changes over time. The variations allowed by TIPTAP are based on the insertion and deletion of edges.

2.2. Frequent Approximate Subgraphs in a Graph Collection

Graph mining algorithms reported in the literature have addressed three problems in a graph collection: only structural variations, only label variations, and a combination of label and structural variations.

RAM [58] and UMFGAMW [59] algorithms address structural variations in a graph collection. RAM allows for structural variations in the edges by allowing their absence. It does not allow label variations, and it does not allow structural variations in the nodes. It does not mine all the frequent subgraphs because it performs a random selection of candidates, eliminating those with similar characteristics to candidates that have already been analyzed. UMFGAMW [59] mines uncertain maximal frequent subgraphs in a graph collection. Due to the nature of uncertain graphs, it mines approximate subgraphs with a similar structure based on the probability of an edge’s presence.

APGM [42], and VEAM [41] have addressed label variation. APGM mines clique frequent subgraphs by representing the graph using a canonical adjacency matrix (CAM). To handle the label variations, it uses a node probability substitution matrix. VEAM mines all frequent subgraphs with label variations in nodes and edges. It uses two probability substitution matrices.

Algorithms such as AMgMiner [60], MgVEAM [61], and CliqueAMgMiner [62] have been reported in the literature for frequent subgraph mining in a graph collection of multigraphs with label variations. AMgMiner preprocesses the multigraphs to convert them into canonical adjacency matrix codes. It occupies two probability substitution matrices to allow for label variations in nodes and edges. MgVEAM, unlike the previous one, introduces a method to mine multigraphs without prior preprocessing by creating a canonical adjacency matrix code that takes multigraphs into account. CliqueAMgMiner mines the clique subgraphs of a multigraph collection.

Regarding label and structural variations in a graph collection, the REAFUM [48] framework allows label and structural variations in nodes and edges. It consists of three stages: selection of representative graphs, expansion of the frequent nodes in search of approximate frequent subgraphs, taking into account the variations in label and structure, and finally, a consensual refinement. To the best of our knowledge, REAFUM is the only algorithm reported in the literature for mining frequent subgraphs in a graph collection. It allows label and structural variations in nodes and edges. However, it does not mine all the frequent subgraphs in a collection; instead, it mines them just from a subset of representative subgraphs, nor does it allow partial similarity in labels.

Table 1 shows that allowing structural variations for mining frequent approximate subgraphs in graph collections has been scarcely studied. This problem has been explored a bit more for mining frequent approximate subgraphs in a single graph, with a few options reported in the literature. Even though there is an algorithm called REAFUM reported in the literature for mining frequent subgraphs in graph collections considering label and structural variations, it does not mine all frequent approximate subgraphs; it only mines some representative frequent subgraphs. To the best of our knowledge, no algorithm can mine all frequent approximate subgraphs in a graph collection considering both structural and label variations. Therefore, as shown in Table 1, we propose AGCM-SLV.

3. Notation and Preliminaries

This section introduces the notation and definitions used throughout the paper. First, we define an undirected labeled simple graph, since it is the type of graph the graph collection contains.

Definition 1

(Undirected labeled simple graph). An undirected labeled simple graph G is a five-tuple

G = {N, E, L, μ, ϕ}

, where

N \neq \emptyset

is the set of nodes,

E \subset N \times N

is the set of undirected edges, L is the set of labels for nodes and edges, μ is the node labeling function

μ : N \to L

that assigns to every node in N a label from L, and

ϕ

is the edge labeling function

ϕ : E \to L

that assigns to every edge in E a label from L.

From now on, whenever the term graph is used, it will refer to an undirected labeled simple graph.

Next, we define graph collection, i.e., the dataset on which the proposed algorithm performs the mining process.

Definition 2

(Graph collection). A graph collection D is a set of n graphs denoted by

D = {G_{1}, G_{2}, \dots, G_{n}}

.

We also define a subgraph since the proposed algorithm mines frequent approximate subgraphs in a graph collection.

Definition 3

(Subgraph). Let

G = {N, E, L, μ, ϕ}

and

G^{'} = {N^{'}, E^{'}, L^{'}, μ^{'}, ϕ^{'}}

be two undirected labeled simple graphs.

G^{'}

is a subgraph of G denoted by

G^{'} \subseteq G

if the following conditions are accomplished:

N^{'} \subseteq N

, i.e., the nodes of

G^{'}

are a subset of the nodes of G, and

E^{'} \subseteq E

, i.e., the edges of

G^{'}

are a subset of the edges of G,

\forall u \in V^{'} : μ^{'} (u) = μ (f (u))

such that

μ^{'} (u) \in L^{'}

and

μ (f (u)) \in L

,

\forall (u, v) \in E^{'} : (f (u), f (v)) \in E

, and

ϕ^{'} (u, v) = ϕ (f (u), f (v))

such that

ϕ^{'} (u, v) \in L^{'}

and

ϕ (f (u), f (v)) \in L

.

An important part of frequent approximate subgraph mining is determining the similarity between graphs. Therefore, a graph similarity measure is required to determine how similar the subgraphs can be. The dissimilarity function used is based on the graph edit distance [63], which consists of counting the operations required to transform one graph into another.

Definition 4

(Dissimilarity function). Let

G_{1} = (N_{1}, E_{1}, L_{1}, μ_{1}, ϕ_{1})

and

G_{2} = (N_{2}, E_{2}, L_{2}, μ_{2}, ϕ_{2})

be two graphs, γ be a node label similarity, and δ be an edge label similarity. The dissimilarity function to be used in the proposed algorithm is the sum of all the costs required to transform a graph

G_{1}

into a graph

G_{2}

. Therefore, the dissimilarity of transforming

G_{1}

to

G_{2}

is given by:

d i s s i m i l a r i t y (G_{1}, G_{2}) = \sum_{i = 1}^{k} C N_{i} + \sum_{j = 1}^{m} C E_{j}

(1)

where

C N_{i}

represents the sum of the k modification costs for the nodes, and

C E_{j}

represents the sum of the m modification costs for the edges. (Note: Depending on the specific structure of the graphs, the node modification cost k or the edge modification cost m can be zero.)

The possible modification costs are:

Substitution of labels on a node or an edge.

The node label substitution cost is $1 - γ (ν_{1}, ν_{2})$ , where $γ (ν_{1}, ν_{2}) \in [0, 1]$ represents the similarity between the label of the node $ν_{1}$ and the label of the node $ν_{2}$ .
The edge label substitution cost is $1 - δ (ϵ_{1}, ϵ_{2})$ , where $δ (ϵ_{1}, ϵ_{2}) \in [0, 1]$ represents the similarity between the label of the edge $ϵ_{1}$ and the label of the edge $ϵ_{2}$ .

Insertion of a node or an edge.

The node insertion cost of inserting a $ν_{1}$ -labeled node is 1.
The edge insertion cost of inserting an $ϵ_{1}$ -labeled edge is 1.

Deletion of a node or an edge.

The node deletion cost of deleting a $ν_{2}$ -labeled node is 1.
The edge deletion cost of deleting an $ϵ_{2}$ -labeled edge is 1.

Since the proposed algorithm performs frequent approximate subgraph mining, it must be defined when a subgraph is considered frequent. To establish this, we introduce the following definitions.

First, we consider whether a subgraph allowing label and structural variations appears in a graph collection. If it appears, each appearance is viewed as an occurrence of the subgraph:

Definition 5

(Occurrence). Let

D = {G_{1}, G_{2}, \dots G_{n}}

be a graph collection of n undirected labeled simple graphs,

G^{'}

be a subgraph, β be a dissimilarity threshold, and

d i s s i m i l a r i t y (G_{i}, G_{j})

be the graph dissimilarity function described in Definition 4. Then

G^{″}

, a subgraph of

G_{i} \in D

, is an occurrence of

G^{'}

if

d i s s i m i l a r i t y (G^{'}, G^{″}) \leq β

.

Next, we define the support of a subgraph, which counts the number of graphs in the collection in which the subgraph has at least one occurrence:

Definition 6

(Support). Let

D = {G_{1}, G_{2}, \dots G_{n}}

be a graph collection of n undirected labeled simple graphs and β a dissimilarity threshold. The support of a subgraph

G^{'}

in D, denoted as

S u p p o r t (G^{'}, D)

, is defined as:

S u p p o r t (G^{'}, D) = \frac{\sum_{j = 1}^{n} f (G^{'}, G_{j})}{| D |}

(2)

where,

f (G^{'}, G_{j}) = \{\begin{matrix} 1 & if \exists G^{″} subgraph of G_{j} such that G^{″} is an occurrence of G^{'} in G_{j} \\ 0 & o t h e r w i s e \end{matrix}

(3)

Before defining a frequent subgraph, we first define a frequent node, which is the simplest form of a frequent subgraph, consisting of a single node:

Definition 7

(Frequent node). Let

D = {G_{1}, G_{2}, \dots G_{n}}

be a graph collection of n undirected labeled simple graphs and α be a support threshold. A node ν is said to be frequent in the graph collection D if the number of graphs in which it appears, divided by the number of graphs in the collection, is greater than or equal to the support threshold α.

Finally, we define frequent subgraph:

Definition 8

(Frequent subgraph). Let

D = {G_{1}, G_{2}, \dots G_{n}}

be a graph collection of n undirected labeled simple graphs, α a support threshold, and β a dissimilarity threshold. A subgraph

G^{'}

is considered frequent within the graph collection D if there exists at least one frequent node

ν_{f}

, according to α (see Definition 7), such that for each occurrence

G^{″}

of

G^{'}

, according to β (see Definition 5),

ν_{f} \in V_{G^{″}}

, and

S u p p o r t (G^{'}, D) \geq α

.

Another important aspect to consider is to reduce the search space and speed up the mining process. For this purpose, we use the anti-monotonicity property, which is defined as follows:

Definition 9

(Anti-monotonicity property). Let D be a graph collection of n undirected labeled simple graphs, and

G^{'}

and

G^{″}

be two graphs in the graph collection D. If

G^{″}

is a subgraph of

G^{'}

then

S u p p o r t (G^{″}, D) \geq S u p p o r t (G^{'}, D)

.

The anti-monotonicity property states that if a subgraph is not frequent, then none of its supergraphs is frequent; thus, they can be pruned without affecting the search.

4. Proposed Algorithm

This section presents the proposed algorithm for mining all frequent approximate subgraphs with structural and label variations in nodes and edges in a graph collection, named AGCM-SLV (Approximate Graph Collection Miner with Structural and Label Variations).

AGCM-SLV works as follows: First, the algorithm mines all frequent nodes to obtain a set of frequent subgraphs containing a single node. After that, a subgraph expansion process begins, where frequent subgraphs are expanded into larger candidate subgraphs by connecting neighboring nodes and adjacent edges according to a depth-first traversal strategy. Each expansion (candidate subgraph) is verified to determine if it satisfies both the frequency and dissimilarity thresholds. Otherwise, the candidate subgraph cannot be expanded further and must be pruned (discarded) due to the anti-monotonicity property. The expansion process continues until there are no more frequent subgraphs to expand.

Each frequent subgraph mined P is associated with its occurrences in the graph collection, which determines the subgraph’s frequency. These occurrences can have structural and label variations in both nodes and edges. Label variation results from substituting node or edge labels, while structural variations result from representing additional or missing nodes and edges in the frequent subgraph. Because of these variations in labels and structure, each occurrence has a dissimilarity value with the frequent approximate subgraph it represents. This dissimilarity cannot exceed a predefined dissimilarity threshold (

β

). The pseudocode for the proposed algorithm is shown in Algorithm 1.

Algorithm 1 Proposed algorithm: AGCM-SLV

Require:: D: graph collection, $α$ : support threshold, $β$ : dissimilarity threshold, $γ$ : node dictionary (optional), $δ$ : edge dictionary (optional)
Ensure:: FS: all unique FREQUENT_SUBGRAPHS

1:: $P \leftarrow$ Obtain all FREQUENT_NODES in the graph collection D
2:: for every $s u b g r a p h$ $G^{'}$ in P do
3:: candidate_subgraphs_list ←EXPAND_FREQUENT_SUBGRAPHS( $G^{'}$ )
4:: for every candidate_subgraphs in candidate_subgraphs_list do
5:: for every occurrence o of $G^{'}$ do
6:: expanded_occurrence ←EXPAND_OCCURRENCE(o)
7:: if dissimilarity(expanded_occurrence, candidate_subgraphs) ≤ β then
8:: expanded_occurrence is an occurrence of candidate_subgraphs
9:: if Support(candidate_subgraphs,D) ≥ $α$ then
10:: if candidate_subgraphs has not been mined then
11:: insert candidate_subgraphs into P and into FS

4.1. Frequent Nodes

In the first stage, all frequent subgraphs formed by a single node are mined by searching through all nodes in the entire graph collection to find those that fulfill the predefined support threshold. These frequent subgraphs, consisting of a single node, are known as frequent nodes (see Definition 7) and are the first set of frequent candidate subgraphs.

4.2. Expand Frequent Subgraph

The next step in the proposed algorithm is to expand the current mined frequent subgraphs to obtain new, larger candidate subgraphs. This is achieved by joining each frequent subgraph with one of its neighbor nodes through an edge or by adding an edge in case there is a cycle. This expansion is carried out for every neighbor node or every adjacent edge of the frequent subgraph. Figure 1 shows how a frequent subgraph can be expanded.

4.3. Expand Occurrences

After the candidate subgraphs are generated, the process of identifying their corresponding occurrences within the graph collection begins. To do this, each occurrence of the frequent subgraph that originated the candidate subgraph is expanded.

To explain the expansion process in detail, the following considerations must be established:

If the node dictionary $(γ)$ or edge dictionary $(δ)$ , or both, are included, they must be considered when evaluating similarities between labels. Figure 2 illustrates the structure of these dictionaries, each composed of three elements (origin label, similar label, and similarity percentage). In the node dictionary, the node label ‘A’ has a 90% similarity to label ‘B’, while in the edge dictionary, the edge label ‘A’ has a 100% similarity to the edge label ‘B’. It is important to note that the dictionaries’ similarities do not apply in both directions; for example, in the node dictionary, label ‘A’ has a 90% similarity to label ‘B’, but this does not imply that label ‘B’ has a 90% similarity to label ‘A’.
The new node, $ν_{n e w}$ , is the last node added to a frequent subgraph to form a candidate subgraph.
A node in a subgraph may have a corresponding node (not necessarily with the same label) in the occurrence, which we will call a mapped node. Since the nodes differ, the maximum allowable dissimilarity must be controlled.
Not all nodes in the occurrence need to be mapped to nodes in the subgraph. Nodes that are not mapped are called extra nodes.

Before expanding the occurrence

o_{c u r r}

, it must be verified whether, without modifying it, it fulfills the dissimilarity threshold. Figure 3b shows an example where the occurrence does not expand but is considered a candidate occurrence.

To expand each occurrence

o_{c u r r}

of the frequent subgraph that originated the candidate subgraph, the first step is to search for all possible nodes for expansion within the graph containing

o_{c u r r}

. Each potential expansion node

ν_{e x p}

must be similar to

ν_{n e w}

. The extra nodes within

o_{c u r r}

are also considered in the search. If the

β

-occurrence

o_{c u r r}

and the candidate subgraph belong to the same graph, the nodes of the candidate subgraph are not considered in the search. From this, there are three possibilities:

If $ν_{e x p}$ is a neighbor of the occurrence $o_{c u r r}$ , a new candidate occurrence $o_{n e w}$ is created, joining the occurrence $o_{c u r r}$ with the node $ν_{e x p}$ , and $ν_{e x p}$ is mapped to $ν_{n e w}$ . Figure 3c shows an example of this case, where, $ν_{e x p} =$ $C$ is a neighbor of the occurrence $B \frac{a}{} A$ .
If $ν_{e x p}$ is already in the occurrence $o_{c u r r}$ as an extra node, a new candidate occurrence $o_{n e w}$ is created, which stays the same as $o_{c u r r}$ , but in it, $ν_{e x p}$ is mapped to $ν_{n e w}$ . Figure 3d shows an example where $ν_{e x p} =$ $C$ is already in the occurrence but it was not mapped. $C$ was an extra node of the $o_{n e w}$ in $o_{c u r r}$ .
If, on the other hand, $ν_{n e w}$ is in the graph of $o_{c u r r}$ but a path must be created to connect $ν_{e x p}$ with $o_{c u r r}$ , the shortest paths connecting $ν_{e x p}$ to each node of $o_{c u r r}$ are searched for. If one path includes another, only the shortest path is considered. For each different path found, a new candidate occurrence $o_{n e w}$ is created by joining the nodes of the occurrence $o_{c u r r}$ with the nodes and edges of the path. The intermediate nodes of the path are marked as extra nodes, and $ν_{e x p}$ is mapped to $ν_{n e w}$ . Figure 3e shows an example where a path is required to connect $ν_{e x p} =$ $C$ . The path with $\frac{d}{}$ $Y$ $\frac{b}{}$ is formed to reach $C$ .

Each occurrence

o_{c u r r}

can be expanded through its neighboring edges. If

ϵ_{n e w}

is a neighbor of the occurrence

o_{c u r r}

a new candidate occurrence

o_{n e w}

is created, joining

o_{c u r r}

with the edge

ϵ_{n e w}

. Figure 3f shows an example, where the occurrence

o_{c u r r}

is expanded by adding the edge

ϵ_{n e w}

\frac{d}{}

, and as a result, a cycle is formed.

4.4. Verifying the Support of a Candidate Subgraph

To verify if the candidate subgraph fulfills the threshold and becomes a frequent subgraph, the expansion process ends by verifying if all the new occurrences created fulfill the dissimilarity threshold to support the candidate subgraph. Thus, using Definition 4, the dissimilarity between each new occurrence and the candidate subgraph is computed. This function sums the nodes’ and edges’ modification cost (insertion, deletion, substitution). The substitution of node and edge labels is defined by dictionaries, which specify the substitution cost of the labels. Occurrences that fulfill the dissimilarity threshold are considered occurrences of the candidate subgraph and maintained as candidate subgraph occurrences. Occurrences that do not fulfill the dissimilarity threshold are eliminated.

The support of a candidate subgraph is calculated using its occurrences to determine whether a candidate subgraph is a frequent subgraph (see Definition 6). If the support is greater than the support threshold, the candidate subgraph becomes a frequent subgraph. Otherwise, the candidate subgraph is pruned (discarded), according to the anti-monotonicity property. This property implies that if an expansion of a candidate subgraph does not fulfill the support threshold, then no further expansions will fulfill this threshold. Thus, no more frequent subgraphs can be obtained by expanding this candidate.

Since the processes followed by the proposed algorithm could produce repeated frequent subgraphs, a frequent subgraph is stored only if it has not been stored previously.

4.5. Completeness

Given a graph collection (D), a support threshold (

α

), and a dissimilarity threshold (

β

), AGCM-SLV aims to mine all frequent subgraphs within the graph collection D, considering both structural and label variations, allowing a dissimilarity of at most

β

with its occurrences. We can ensure this since all frequent nodes in the graph collection are mined through an exhaustive search, and from these, expansions are made to include larger frequent subgraphs until they can no longer be expanded to form new frequent subgraphs. Thus, they are pruned by applying the anti-monotonicity property. Each frequent subgraph expansion considers that a new candidate subgraph must be created for each edge that connects a neighboring node. The frequent subgraphs may include cycles, which implies inserting an edge without inserting a node. This way, all frequent subgraphs that contain the subgraph that is being expanded are considered as candidates. For each candidate, the occurrences of the expanded subgraph are also expanded. Each occurrence expansion considers structural differences and label differences through dictionaries. The dissimilarity is verified to guarantee that every occurrence fulfills the dissimilarity threshold (

β

). Then, the frequency is verified to ensure it satisfies the support threshold (

α

). Hence, all frequent approximate subgraphs that satisfy the dissimilarity and the frequency threshold are mined. All of the above ensures no frequent subgraph is lost; thus, the proposed algorithm mines all the frequent approximate subgraphs according to Definition 8.

To formally prove that AGCM-SLV can mine all frequent subgraphs with structural and label variations, we will show by induction on the size of the frequent subgraph that AGCM-SLV can find all frequent subgraphs as candidate subgraphs and that, for each candidate, AGCM-SLV finds all its corresponding occurrences.

Let

P (k)

be a frequent subgraph of size k.

Base case: When $k = 1$ , since the AGCM-SLV first stage mines all frequent nodes satisfying the support threshold ( $α$ ) by an exhaustive search in all the graph collection D, every subgraph $P (1)$ is mined by AGCM-SLV. For every subgraph mined, all its occurrences are found.
Induction hypothesis: AGCM-SLV mines the set of all frequent approximate subgraphs $P (k)$ of size k in D (with structural and label variations) and all their occurrences.
Induction step: We will prove that if $p (k + 1)$ is a frequent subgraph with structural and label variations in D, AGCM-SLV mines $P (k + 1)$ and all their occurrences.
If $P (k + 1)$ is a subgraph, then due to the anti-monotonicity property, there is a frequent subgraph $P (k)$ such that $P (k) \in P (k + 1)$ . Then, by induction, we have that $P (k)$ is a subgraph with structural and label variations and all its occurrences in D are found by AGCM-SLV. Then, by applying the subgraph expansion of the algorithm AGCM-SLV described in Section 4.2 to $P (k)$ , we get $P (k + 1)$ as a candidate subgraph. Now, applying the expansion of the occurrences of the algorithm AGCM-SLV described in Section 4.3 to each occurrence $O_{k}$ (the occurrences of $P (k)$ ), AGCM-SLV found all occurrences of $P (k + 1)$ because every occurrence $O_{k} + 1$ of $P (k + 1)$ comes from an occurrence $O_{k}$ of $P (k)$ .
$O_{k + 1}$ come from the occurrences of $O_{k}$ , then by hypothesis, they are also occurrences of $P (k + 1)$ ; hence, $P (k + 1)$ is a frequent subgraph and the expanded occurrences denoted as $O_{k + 1}$ their occurrences.
Hence, $p (k + 1)$ and all its occurrence $O_{k} + 1$ are mined by AGCM-SLV.
Therefore, it is concluded that all frequent subgraphs are mined by AGCM-SLV.

4.6. Time Complexity

First, the proposed algorithm obtains the frequent nodes

(N_{f r e q u e n t})

in the collection. To do this, all the nodes in the collection are visited; therefore, the time complexity is

O (N_{t o t a l} + E_{t o t a l})

.

D = Graph collection $G_{1}, G_{2}, \dots, G_{i}$ .
$N_{t o t a l}$ = Total number of nodes in the graph collection.
$E_{t o t a l}$ = Total number of edges in the graph collection.
$N_{f r e q u e n t}$ = Frequent nodes in the graph collection.

From these frequent nodes

(N_{f r e q u e n t})

, the expansion of the subgraph begins. The expansion is made to every edge connected to the frequent subgraphs, and in each expansion, the subgraph grows by one node. In the first iteration, the frequent subgraph is a node; therefore, the frequent subgraph at most can grow

(\bar{N} - 1)

nodes, where

\bar{N}

is the average number of nodes per graph in the collection D. In the second iteration, it can grow at most

(\bar{N} - 2)

, and so on. Hence, in the

i^{t h}

iteration, it can grow at most to

(\bar{N} - i)

nodes, finishing when the candidate subgraph is not frequent. In the worst-case scenario, we need to consider all possible permutations of nodes, because every possible node could be explored; the total number of distinct subgraphs could be approximated by

\bar{N_{g r a p h}}!

.

$\bar{N_{g r a p h}}$ = Average number of nodes per graph.

T (n) = O (N_{t o t a l} + E_{t o t a l}) + O (N_{f r e q u e n t} * | D | * \bar{N_{g r a p h}}!)

(4)

Thus, the time complexity of the proposed algorithm is combinatorial.

5. Experiments

In this section, we present a set of experiments performed to assess our proposed algorithm AGCM-SLV in different scenarios. Section 5.1 shows, through a toy example, how the proposed algorithm, AGCM-SLV, mines frequent approximate subgraphs with structural and label variations. Section 5.2 presents a comparative analysis between AGCM-SLV and REAFUM, the most similar algorithm found in the literature. In Section 5.3, the impact of varying parameters on the algorithm’s performance is assessed. Section 5.4 shows how the algorithm performs in terms of runtime and memory usage when the dataset increases its size. Finally, Section 5.5 presents how noise impacts on the frequent approximate subgraphs mined by the proposed algorithm.

The datasets used for the experiments are small synthetic datasets and datasets from Network Repository [64], a graph and network repository containing hundreds of real-world networks and benchmark datasets. The experiments were conducted using a computer with two Intel Xeon E5-2620 at 2.40 GHz processors, 256 GB RAM, and Linux Ubuntu 22.

5.1. Frequent Approximate Subgraphs Mined by the Proposed Algorithm

To demonstrate how the proposed algorithm mines frequent approximate subgraphs with structural and label variations, we applied AGCM-SLV to the small graph collection shown in Figure 4, varying the dissimilarity threshold (

β

) and introducing different label similarities in the node dictionary (

γ

) and edge dictionary (

δ

).

In a first experiment, we set the support threshold (

α

) to 100% and the dissimilarity threshold (

β

) to zero, resulting in AGCM-SLV mining one frequent subgraph,

B

, as shown in Table 2. Since no modifications are allowed, this frequent subgraph has an exact match.

In a second experiment, we set the support threshold (

α

) to 100% and the dissimilarity threshold (

β

) to one, resulting in AGCM-SLV mining the additional frequent approximate subgraph

B \frac{D}{} A

, having two frequent subgraphs in total, as shown in the first column of Table 3. These frequent approximate subgraphs do not present structural variations, as only a single modification in the occurrences is allowed. This modification can be a label substitution on a node or edge.

To demonstrate the label substitution, consider the frequent approximate subgraph

B \frac{D}{} A

, which appears on

g_{2}

, the occurrence

B \frac{D}{} Z

, where we have a node label substitution: the node label ‘A’ is substituted by the node label ‘Z’. Additionally, the frequent approximate subgraph

B \frac{D}{} A

has two occurrences with edge label substitutions on

g_{3}

:

B \frac{E}{} A

and

B \frac{F}{} A

.

In a third experiment, we set the support threshold (

α

) to 100% and increased the dissimilarity threshold (

β

) to two to allow for structural variations. With these parameter values, AGCM-SLV mines five additional frequent approximate subgraphs, having a total of seven mined frequent subgraphs, as shown in the first column of Table 4.

To demonstrate the structural variations, consider the frequent approximate subgraph

B \frac{D}{} A

and its occurrence

B

on

g_{2}

, which requires two structural variations are required: the deletion of node

A

and its corresponding edge.

In a fourth experiment, we set the support threshold (

α

) to 100% and the dissimilarity threshold (

β

) to two while providing label similarities in nodes through the node dictionary(

γ

). This indicates that the node label ‘Z’ has 50% similarity with node label ‘A’ and node label ‘Y’ has 50% similarity with node label ‘A’. In this scenario, AGCM-SLV mines the seven frequent approximate subgraphs shown in the first column of Table 4, plus one frequent approximate subgraph shown in the first column of Table 5.

To demonstrate the use of the node dictionary (

γ

), consider the occurrence

Z \frac{D}{} B \frac{B}{} Y

on

g_{2}

of the frequent subgraph

A \frac{A}{} B \frac{D}{} A

. In this scenario, the frequent node

B

was expanded to

A

through the labeled edge ‘D’, while the occurrence was expanded from

B

to

Z

through the labeled edge ‘D’. The substitution between the node labels ‘A’ and ‘Z’ requires a dissimilarity cost of 0.5. The frequent subgraph was then expanded from the frequent node

B

to

A

through the labeled edge ‘A’, and the occurrence was expanded from

B

to

Y

through the labeled edge ‘B’. The edge label ‘B’ is substituted with a dissimilarity cost of 1 between labels ‘B’ and ‘A’, as well as a substitution with a dissimilarity cost of 0.5 between the node labels ‘A’ and ‘Z’. As a result, the frequent subgraph

A \frac{A}{} B \frac{D}{} A

and its occurrence

Z \frac{D}{} B \frac{B}{} Y

on

g_{2}

have a dissimilarity of 2.

In a fifth experiment, we set the support threshold (

α

) to 100% and the dissimilarity threshold (

β

) to two, but by changing the label similarities in nodes through the node dictionary(

γ

) file, which indicates that the node label ‘Z’ has 100% similarity with node label ‘A’ and node label ‘Y’ has 100% similarity with node label ‘A’. In this scenario, AGCM-SLV mines the three additional frequent subgraphs, as shown in the first column of Table 6, for a total of 11 mined frequent subgraphs.

These additional frequent approximate subgraphs were mined due to increased similarity among certain node labels, allowing them to appear in all the graphs within the collection while fulfilling the support threshold. The occurrence

Y \frac{B}{} B \frac{D}{} Z

on

g_{2}

of the frequent approximate subgraph

A \frac{E}{} B \frac{F}{} A

serves as an example. According to the node dictionary (

γ

), the node labels ‘Z’ and ‘Y’ are equivalent to ‘A’. The variations in the occurrence involve two modifications: the substitution of both edge labels.

Finally, in a sixth experiment, we set the support threshold (

α

) to 100% and the dissimilarity threshold (

β

) to two. We introduced label similarities in edges through the edge dictionary(

β

) and maintained the same label similarities in nodes. As before, according to the node dictionary(

γ

), node labels ‘Z’ and ‘Y’ are considered 100% similar to node label ‘A’. Moreover, according to the edge dictionary (

δ

), the edge label ‘F’ is 100% similar to the edge label ‘D’. In this scenario, AGCM-SLV mines an additional frequent approximate subgraph, as shown in the first column of Table 7, resulting in a total of 12 mined frequent approximate subgraphs.

The additional frequent approximate subgraph was mined due to the introduction of the edge label dictionary (

δ

) considering that edge label ‘F’ is equivalent to edge label ‘D’, finding the occurrence

A \frac{E}{} B \frac{F}{} A

on

g_{3}

. The variations in the occurrence involve two modifications: substituting the edge label ‘E’ with ‘A’ and deleting the edge labeled ‘C’.

5.2. Comparison Between AGCM-SLV and REAFUM

This section presents a comparison between our proposed algorithm (AGCM-SLV) and REAFUM, which is the most similar algorithm to ours reported in the literature. Section 5.2.1 analyzes the differences between the frequent subgraphs mined by both algorithms, while Section 5.2.2 compares the number of frequent subgraphs mined and the runtime spent by each algorithm.

5.2.1. Comparison of the Frequent Approximate Subgraphs Mined by Both Algorithms

A comparison of the frequent approximate subgraphs mined by AGCM-SLV and REAFUM was performed. Both algorithms present inherent similarities, such as mining frequent subgraphs in a dataset considering structural and label variations in nodes and edges; nevertheless, this comparison aims to focus on the key differences, such as that REAFUM is designed to mine only some frequent subgraphs, by selecting some representative graphs and performing a representative frequent approximate subgraph refinement. In contrast, AGCM-SLV aims to mine all frequent approximate subgraphs in the dataset. Moreover, REAFUM cannot assign partial similarities to labels in nodes and edges. Meanwhile, AGCM-SLV allows the definition of partial similarities among labels of nodes and edges. To ensure a fair analysis, some considerations were taken into account: only the approximate frequent subgraph mining stage of REAFUM was considered without selecting representative graphs and representative approximate frequent subgraph refinement. For AGCM-SLV, no partial similarities between labels were used, as REAFUM cannot use them.

In this experiment, the small graph collection depicted in Figure 5 was used to illustrate the differences between the frequent approximate subgraphs mined by both algorithms. The parameters set in both algorithms were a support threshold (

α

) of

100 %

and a dissimilarity threshold (

β

) of two, resulting in AGCM-SLV mining 31 frequent subgraphs, while REAFUM mined 20 frequent subgraphs. The results also indicated both algorithms mined 19 frequent approximate subgraphs in common; AGCM-SLV mined 12 additional frequent approximate subgraphs that REAFUM did not mine, while REAFUM mined one additional frequent approximate subgraph that AGCM-SLV did not mine.

The additional frequent approximate subgraph mined by REAFUM, as shown in Figure 6, is

A \frac{1}{} B \frac{2}{} E

on graph

g_{1}

, with occurrences

A \frac{1}{} B \frac{2}{} G

on graph

g_{2}

, and

C \frac{1}{} A \frac{2}{} E

on graph

g_{3}

. Unlike REAFUM, our algorithm analyzed

A \frac{1}{} B \frac{2}{} E

as a candidate subgraph and pruned it because its occurrence on graph

g_{3}

exceeded the dissimilarity threshold, i.e., it did not meet the required support threshold. The occurrence on

g_{3}

would require three modifications: The deletion of the edge from

A

to

E

, the insertion of an edge from

C

to

E

, and the substitution of the node label ‘C’ by the node label ‘B’, which exceeds the dissimilarity threshold of two.

AGCM-SLV mined 12 frequent approximate subgraphs that REAFUM did not. To understand why this happens, the frequent subgraph expansion processes of both algorithms were analyzed. Both algorithms analyze the frequent subgraph’s neighboring nodes while building candidate subgraphs from a previously found frequent subgraph. However, REAFUM connects a neighboring node with all edges that connect it. In contrast, AGCM-SLV creates a separate candidate subgraph for each edge that connects a neighboring node, resulting in a greater diversity of frequent subgraphs. Figure 7 provides an example of this. This figure describes the frequent subgraph expansion process of

Y \frac{3}{} A

in graph

g_{3}

. In AGCM-SLV, the node

Z

is expanded, creating three candidate frequent subgraphs: first, using the edge

Y \frac{3}{} Z

, then the edge

A \frac{3}{} Z

, and as a later expansion, both edges. In contrast, REAFUM expands node

Z

using both edges, resulting in just one candidate subgraph.

5.2.2. Comparison Regarding the Number of Frequent Approximate Subgraphs Mined and Runtime

In this experiment, we compared the performance of REAFUM and our proposed algorithm (AGCM-SLV) with a real-world dataset (PTC-FM [64]). We evaluated how the variation in the parameters

α

and

β

impacts the number of frequent subgraphs mined and the runtime.

PTC-FM is a dataset that contains compounds labeled according to carcinogenicity in female mice. This dataset includes 349 graphs where nodes represent atoms labeled by atom type, and edges represent chemical bonds and are labeled by bond type.

This experiment uses specific portions of the PTC-FM dataset based on its size. Table 8 shows the datasets built from PTC-FM for our experiment.

Figure 8a shows how the number of frequent approximate subgraphs mined is affected by varying the dissimilarity threshold, with the support threshold set constant at 90%. The figure contains three horizontal plots showing the number of frequent approximate subgraphs mined, each plot corresponding to a specific value of the dissimilarity threshold (1, 2, and 3). The colored bars represent the number of frequent approximate subgraphs mined for the different cases (both, only REAFUM, and only AGCM-SLV). It is important to emphasize that each plot has a different number of frequent subgraphs on the y-axis scale. In the first plot, when the dissimilarity threshold is 1, the number of frequent approximate subgraphs is up to 100; in the second plot, when the dissimilarity threshold is 2, the number of frequent approximate subgraphs is up to 310; and in the third plot, when the dissimilarity threshold is 3, the number of frequent approximate subgraphs extends up to 800. We need to focus on each plot to compare the number of frequent approximate subgraphs mined by both algorithms. The results reveal that while both algorithms mine most of the frequent subgraphs represented by a blue-colored bar, there are also cases where an algorithm mines additional frequent subgraphs represented by a cyan-colored bar for our proposed algorithm and a red-colored bar for REAFUM. Notice that the red-colored bars representing REAFUM in Figure 8a are not easily seen (see, for example, the second graph of Figure 8a in the last column corresponding to 121–150), since there are a few cases where a small number of frequent approximate subgraphs are mined only by REAFUM. To observe how varying the dissimilarity threshold impacts the number of frequent subgraphs mined, we compare the increase in frequent subgraphs mined across the three plots. The results show that both algorithms mine more frequent approximate subgraphs as the dissimilarity threshold increases.

Figure 8b shows the corresponding runtime in which the frequent approximate subgraphs of Figure 8a were mined by both algorithms. This figure also contains three horizontal plots, which, in this case, show the runtime of mining the frequent approximate subgraphs for both algorithms. Each plot corresponds to a dissimilarity threshold value (1, 2, and 3); the number of frequent approximate subgraphs mined is represented by colored bars for the different cases (both, only REAFUM, and only AGCM-SLV), and the runtime is represented by different line colors for each algorithm (REAFUM and AGCM-SLV). It is important to emphasize that each plot has a different time scale on the y-axis. In the first plot, when the dissimilarity threshold is 1, the runtime is less than 30 min; in the second plot, when the dissimilarity threshold is 2, the runtime is less than 3 h and 30 min; and in the third plot, when the dissimilarity threshold is 3, the runtime grows to more than 22 h. To compare the runtime of both algorithms, we need to focus on each plot, where it is observed that our algorithm’s runtime is shorter in all cases. To observe how varying the dissimilarity threshold impacts the runtime, we compare the increase in the runtime across the three plots. The results show that as the dissimilarity threshold increases, the runtime increases for both algorithms, which is related to the increase in the number of frequent approximate subgraphs mined. However, the runtime increases significantly for REAFUM compared to our proposed algorithm.

Figure 9 is similar to Figure 8, except that it uses a constant support threshold of 80%. Figure 9a contains three horizontal plots showing the number of frequent approximate subgraphs mined, one for each dissimilarity threshold value (1, 2, and 3). The colored bars represent the number of frequent approximate subgraphs mined for the different cases (both, only REAFUM, and only AGCM-SLV). It is important to analyze how the support threshold impacts the number of frequent approximate subgraphs that are mined. At an 80% support threshold, more frequent subgraphs are mined compared to a 90% support threshold, as fewer occurrences are required for a subgraph to be considered frequent, as shown across the plots.

Figure 9b, on the other hand, contains three horizontal plots, one for each dissimilarity threshold value (1, 2, and 3), showing the corresponding runtime for both algorithms. The red line represents REAFUM, and the blue line represents AGCM-SLV. The results reveal an increase in the number of frequent approximate subgraphs mined as a result of a smaller support threshold, which also impacts the rise in runtime. As with the previous experiment, if the dissimilarity threshold increases, there is a consistent growth in the number of frequent subgraphs and runtime. However, the results show that our proposed algorithm’s runtime increases at a slower rate than REAFUM’s, which is consistent with the previous experiment.

These experiments reveal that varying the dissimilarity or support thresholds impacts the number of frequent approximate subgraphs mined and the runtime for REAFUM and the proposed algorithm. However, our proposed algorithm mines more frequent subgraphs in a shorter runtime than REAFUM’s.

5.3. Analysis of the Performance of AGCM-SLV

We evaluated the performance of our proposed algorithm, AGCM-SLV, on the real-world PTC dataset [64] by analyzing how variations in the dissimilarity threshold (

β

) and support threshold (

α

) impact both the number of frequent approximate subgraphs mined and the runtime.

The PTC dataset [64] contains compounds labeled according to carcinogenicity in male mice (MM), male rats (MR), female mice (FM), and female rats (FR). In these four graph collections, the nodes represent atoms and are labeled by atom type, and the edges represent chemical bonds and are labeled by bond type. Table 9 shows the number of graphs in each graph collection and the average number of nodes and edges for the four PTC graph collections.

Figure 10a shows the impact of varying the dissimilarity and support thresholds on the number of frequent approximate subgraphs mined. This figure contains three vertical plots showing the number of frequent approximate subgraphs mined across the four graph collections of PTC (FM, FR, MM, and MR). Each plot corresponds to a specific dissimilarity threshold value (1, 2, or 3). In each plot, the colored bars represent the number of frequent approximate subgraphs mined for each support threshold (70%, 80%, and 90%).

To observe the impact of varying the dissimilarity threshold, we compare the increase in the number of frequent approximate subgraphs across the three plots. The results show that as the dissimilarity threshold increases, the number of frequent approximate subgraphs mined also increases. This is because allowing more modifications increases the number of occurrences per candidate subgraph that fulfills the dissimilarity threshold, leading to more frequent approximate subgraphs being mined.

On the other hand, to observe the impact of varying the support threshold, we compare the variation within each plot with the different colored bars. The results reveal that the number of frequent approximate subgraphs mined increases as the support threshold decreases. This is because it is more likely for a frequent approximate subgraph to appear in fewer graphs within the graph collection.

Figure 10b shows the corresponding runtime for mining the frequent approximate subgraphs shown in Figure 10a. This figure also contains three vertical plots, which indicate the runtime of mining the frequent approximate subgraphs across the four graph collections of PTC (FM, FR, MM, and MR). Each plot corresponds to a specific dissimilarity threshold value (1, 2, or 3). The colored lines represent the runtimes for each support threshold (70%, 80%, and 90%).

To observe the impact of varying the dissimilarity threshold on runtime, we need to compare the increase in the runtime across the three plots. It is important to emphasize that each plot has a different timescale on the y-axis. In the first plot, the runtime is less than 3 min with a dissimilarity threshold of 1; in the second plot, it is less than 28 min with a dissimilarity threshold of 2; and in the third plot, it extends up to 7 h with a dissimilarity threshold of 3. As the dissimilarity threshold increases, the runtime also increases because of the rise in the number of frequent approximate subgraphs mined.

On the other hand, to observe how varying the support threshold impacts the runtime, we must compare the variation within each plot with the color lines representing the support thresholds. The runtime increases as the support threshold decreases due to the increase in the number of frequent approximate subgraphs mined. The results reveal that the runtime increases at a higher rate when the dissimilarity threshold is increased compared with the support threshold. For example, when the dissimilarity threshold is 1, the runtime for all the different support thresholds is a few minutes, and it increases to several hours when the dissimilarity threshold is 3.

5.4. Scalability

The PTC dataset was used for the proposed algorithm’s scalability tests. PTC is a dataset composed of four graph collections: PTC-FM, PTC-FR, PTC-MM, and PTC-MR. These collections are described in Table 9. The four PTC collections were combined into a single collection containing 1380 graphs with an average of 14.24 nodes and 14.63 edges per graph.

Multiple tests were conducted using portions of the collections described above based on their size. Each test was performed using dissimilarity thresholds of 1 and 2. Regarding the support threshold, a constant support threshold was set to 70% because a higher threshold did not produce any frequent subgraphs. In each test, runtime and maximum RAM usage were measured to evaluate the resources required to execute our algorithm.

Initially, we ran our algorithm on 100 graphs from both collections, using the parameters described above. Subsequent tests were performed by adding 100 graphs per experiment until all the graphs in the collection were included.

The measured runtimes are presented in Figure 11a, and the maximum memory usage is shown in Figure 11b. We evaluated all 1380 graphs in the collection with both dissimilarities in less than 1 h using a maximum of 120 GB of RAM.

These tests exhibit growth in terms of runtime and memory usage; however, the runtime increased at a faster rate than the memory usage. The results reveal that increasing the dissimilarity threshold significantly affected the runtime, leading to longer runtimes. However, this effect does not extend to the memory usage.

5.5. Impact of Random White Noise in the Proposed Algorithm

This experiment evaluates the impact of random white noise on the proposed algorithm for mining frequent approximate subgraphs with structural and label variations. To perform this experiment, we introduced 5%, 10%, 15%, and 20% of noise to the PTC datasets in Table 9. Then, we apply our proposed algorithm to the noisy datasets to mine frequent approximate subgraphs at the different noise levels.

Noise was added randomly, including the addition of nodes with their corresponding edges, the addition of only edges, and label substitutions in both nodes and edges. For each noisy dataset, we mined frequent approximate subgraphs using AGCM-SLV, employing dissimilarity thresholds of 1 and 2. In all cases, we used a support threshold of 90%. The frequent approximate subgraphs mined from the original, noise-free datasets served as reference.

Figure 12 shows the impact of different levels of noise on the frequent approximate subgraphs mined. This figure contains four bar graphs, each corresponding to a different PTC dataset (FM, FR, MM, and MR). In each graph, the x-axis represents different noise levels (0% (noise-free), 5%, 10%, 15%, and 20%) grouped by dissimilarity thresholds (1 and 2). The bars are stacked vertically, with each bar’s height indicating the number of frequent approximate subgraphs mined. Each bar is split into two segments: a blue segment labeled Preserved that represents the frequent approximate subgraphs preserved from those mined in the original noise-free (0% noise level) dataset, and a green segment labeled Additional that represents the new frequent approximate subgraphs mined at the current noise level.

The results reveal that as noise increases, the number of frequent approximate subgraphs also increases, suggesting that certain subgraphs become frequent because the noise introduces additional occurrences. These occurrences allow previously infrequent subgraphs to now satisfy the required support threshold and become frequent. Moreover, when noise is included, the majority (99.49% on average) of the frequent approximate subgraphs mined from the noise-free dataset (see 0% in Figure 12) were preserved, indicating the algorithm’s tolerance to white noise. Table 10 shows that every frequent subgraph (100%) is maintained when the dissimilarity threshold is 1, and 98.88% are preserved when the dissimilarity threshold is 2. Thus, only a few of the frequent approximate subgraphs are lost. This happens because occurrences disappear (i.e., an occurrence that previously required only two modifications may now need more, making it no longer match the frequent approximate subgraph). As a result, these subgraphs no longer satisfy the required support threshold.

Due to the random nature of the introduced noise, datasets with higher noise levels may sometimes maintain more frequent approximate subgraphs than those with lower noise levels. As the noise level increases, additional frequent approximate subgraphs may appear that are not present in the noise-free dataset. However, the majority of frequent subgraphs from the original dataset are consistently maintained across the different noise levels, which shows the algorithm’s low sensitivity to white noise.

6. Conclusions

Mining frequent approximate subgraphs with structural and label variations in an undirected graph collection has been scarcely studied in the literature. In this paper, we have proposed AGCM-SLV, an algorithm for mining frequent approximate subgraphs with structural and label variations in graph collections. Our algorithm follows a depth-first traversal strategy starting from frequent nodes (which are single-node subgraphs) and expands them, verifying both the support and dissimilarity thresholds while discarding those that do not meet the defined thresholds.

To assess the proposed algorithm, we conducted experiments comparing it with REAFUM, the most similar algorithm to ours reported in the literature. From these experiments, we can conclude that, due to AGCM-SLV expanding frequent subgraphs through their neighboring edges instead of only their neighboring nodes (as REAFUM does), our proposed algorithm mines frequent subgraphs that REAFUM cannot mine in a shorter runtime.

On the other hand, we also performed experiments to analyze the runtime behavior of AGCM-SVL, varying dissimilarity and support thresholds, on real-world datasets. These experiments reveal that AGCM-SLV’s runtime increases at a higher rate when the dissimilarity threshold is increased compared with the support threshold.

Our scalability experiments allow us to conclude that although AGCM-SLV exhibits fast growth in terms of runtime and memory usage as the collection grows. The increase in runtime is faster than the increase in memory usage, especially when increasing the dissimilarity threshold.

Finally, we assessed the impact of random white noise in the graph collection when our proposed algorithm is applied for mining frequent approximate subgraphs with structural and label variations. This experiment allows us to conclude that AGCM-SLV is tolerant to white noise, since more than 99% of the frequent approximate subgraphs mined without noise are preserved in noise conditions.

In our experiments, we address approximate label similarities through dictionaries, as this approach enables the easy definition of similarities among labels across different domains. However, the algorithm can be easily adapted to include user-defined similarity functions for more complex or domain-specific scenarios.

The main limitation of the proposed algorithm is its time complexity, as the algorithm can only be used on relatively small collections of graphs. However, to our knowledge, it is the first algorithm to address the mining of all frequent approximate subgraphs with structural and label variations in graph collections allowing user-defined partial similarities between node and edge labels. This limitation presents an area of opportunity for the development of more efficient algorithms that can operate with larger collections of graphs. Another interesting problem to address as future work is the development of algorithms for mining frequent subgraphs with structural and label variations in dynamic collections of graphs or for streaming data problems, where the graphs change over time. Extending the proposed algorithm to handle more complex relationships between graph nodes, such as those involving multiple edges between them, as in multigraphs, is a task that also deserves future study. Finally, evaluating our algorithm in a real-life application to solve a real-life problem is a very important line of future research.

Author Contributions

Conceptualization, D.J.-O., J.A.C.-O. and J.F.M.-T.; methodology, D.J.-O., J.A.C.-O. and J.F.M.-T.; software, D.J.-O.; validation, D.J.-O.; formal analysis, D.J.-O.; investigation, D.J.-O.; resources, D.J.-O.; data curation, D.J.-O.; writing—original draft preparation, D.J.-O.; writing—review and editing, J.A.C.-O. and J.F.M.-T.; visualization, D.J.-O.; supervision, J.A.C.-O. and J.F.M.-T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the the Secretaría de Ciencia, Humanidades, Tecnología e Innovación (SECIHTI) through the scholarship grant 778343.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this research are public in [64].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cook, D. Mining Graph Data; Wiley-Interscience: Hoboken, NJ, USA, 2007. [Google Scholar]
Kalra, D. Applications, Models and Uses of Data Mining in E-Governance for Sustainable Development. In Proceedings of the 7 th International Research Symposium of the SGBED, Dubai, United Arab Emirates, 17–19 December 2018; pp. 263–272. [Google Scholar]
Widmer, T.; Klein, A.; Wachter, P.; Meyl, S. Predicting material requirements in the automotive industry using data mining. In Proceedings of the International Conference on Business Information Systems, Seville, Spain, 26–28 June 2019; pp. 147–161. [Google Scholar]
Fadli, A.; Nugraha, A.W.W.; Aliim, M.S.; Taryana, A.; Kurniawan, Y.I.; Purnomo, W.H. Simple Correlation Between Weather and COVID-19 Pandemic Using Data Mining Algorithms. IOP Conf. Ser. Mater. Sci. Eng. 2020, 982, 012015. [Google Scholar] [CrossRef]
Chee, C.H.; Jaafar, J.; Aziz, I.A.; Hasan, M.H.; Yeoh, W. Algorithms for frequent itemset mining: A literature review. Artif. Intell. Rev. 2019, 52, 2603–2621. [Google Scholar] [CrossRef]
Fournier-Viger, P.; He, G.; Cheng, C.; Li, J.; Zhou, M.; Lin, J.C.W.; Yun, U. A survey of pattern mining in dynamic graphs. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1372. [Google Scholar] [CrossRef]
Fournier-Viger, P.; Gan, W.; Wu, Y.; Nouioua, M.; Song, W.; Truong, T.; Duong, H. Pattern mining: Current challenges and opportunities. In Proceedings of the International Conference on Database Systems for Advanced Applications, Virtual Event, 11–14 April 2022; pp. 34–49. [Google Scholar]
Aggarwal, C. Managing and Mining Graph Data; Springer: New York, NY, USA, 2010. [Google Scholar]
Cortés, X.; Conte, D.; Cardot, H. Learning edit cost estimation models for graph edit distance. Pattern Recognit. Lett. 2019, 125, 256–263. [Google Scholar] [CrossRef]
Hinterberger, H. Graph. In Encyclopedia of Database Systems; Liu, L., Özsu, M.T., Eds.; Springer: New York, NY, USA, 2018; pp. 1635–1636. [Google Scholar]
Yang, P.; Wang, H.; Huang, Y.; Yang, S.; Zhang, Y.; Huang, L.; Zhang, Y.; Wang, G.; Yang, S.; He, L.; et al. LMKG: A large-scale and multi-source medical knowledge graph for intelligent medicine applications. Knowl.-Based Syst. 2024, 284, 111323. [Google Scholar] [CrossRef]
Zhou, T.; Xu, P.; Wang, L.; Tang, Y. High-Risk HPV Cervical Lesion Potential Correlations Mining over Large-Scale Knowledge Graphs. Appl. Sci. 2024, 14, 2456. [Google Scholar] [CrossRef]
Sander, R. MEXPLORER 1.0.0—A mechanism explorer for analysis and visualization of chemical reaction pathways based on graph theory. Geosci. Model Dev. 2024, 17, 2419–2425. [Google Scholar] [CrossRef]
Rajan, K.; Brinkhaus, H.O.; Zielesny, A.; Steinbeck, C. Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture. J. Cheminform. 2024, 16, 78. [Google Scholar] [CrossRef]
Matsumoto, N.; Moran, J.; Choi, H.; Hernandez, M.E.; Venkatesan, M.; Wang, P.; Moore, J.H. KRAGEN: A knowledge graph-enhanced RAG framework for biomedical problem solving using large language models. Bioinformatics 2024, 40, btae353. [Google Scholar] [CrossRef]
Hadipour, H.; Li, Y.Y.; Sun, Y.; Deng, C.; Lac, L.; Davis, R.; Cardona, S.T.; Hu, P. GraphBAN: An inductive graph-based approach for enhanced prediction of compound-protein interactions. Nat. Commun. 2025, 16, 2541. [Google Scholar] [CrossRef]
Li, L.; Ding, P.; Chen, H.; Wu, X. Frequent Pattern Mining in Big Social Graphs. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 6, 638–648. [Google Scholar] [CrossRef]
Agouti, T. Graph-based modeling using association rule mining to detect influential users in social networks. Expert Syst. Appl. 2022, 202, 117436. [Google Scholar] [CrossRef]
Dileo, M.; Zignani, M.; Gaito, S. Temporal graph learning for dynamic link prediction with text in online social networks. Mach. Learn. 2024, 113, 2207–2226. [Google Scholar] [CrossRef]
Abdulla, H.H.H.A.; Awad, W.S. Text Classification of English News Articles using Graph Mining Techniques. In Proceedings of the ICAART (3), Online, 3–5 February 2022; pp. 926–937. [Google Scholar]
Mahapatra, R.; Samanta, S.; Pal, M.; Allahviranloo, T.; Kalampakas, A. A Study on Linguistic Z-Graph and Its Application in Social Networks. Mathematics 2024, 12, 2898. [Google Scholar] [CrossRef]
Andriyanov, N. Application of Graph Structures in Computer Vision Tasks. Mathematics 2022, 10, 4021. [Google Scholar] [CrossRef]
Hetang, C.; Xue, H.; Le, C.; Yue, T.; Wang, W.; He, Y. Segment Anything Model for Road Network Graph Extraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 17–18 June 2024; pp. 2556–2566. [Google Scholar]
Elsharkawi, I.; Sharara, H.; Rafea, A. SViG: A Similarity-Thresholded Approach for Vision Graph Neural Networks. IEEE Access 2025, 13, 19379–19387. [Google Scholar] [CrossRef]
Li, P.; Chen, P.; Zhang, D. Cross-Modal Feature Representation Learning and Label Graph Mining in a Residual Multi-Attentional CNN-LSTM Network for Multi-Label Aerial Scene Classification. Remote Sens. 2022, 14, 2424. [Google Scholar] [CrossRef]
Scaringi, R.; Fiameni, G.; Vessio, G.; Castellano, G. GraphCLIP: Image-graph contrastive learning for multimodal artwork classification. Knowl.-Based Syst. 2025, 310, 112857. [Google Scholar] [CrossRef]
ur Rehman, S.; Khan, A.U.; Fong, S.J. Graph mining: A survey of graph mining techniques. In Proceedings of the Seventh International Conference on Digital Information Management (ICDIM 2012), Macao, China, 22–24 August 2012; pp. 88–92. [Google Scholar]
Jiang, C.; Coenen, F.; Zito, M. A survey of frequent subgraph mining algorithms. Knowl. Eng. Rev. 2013, 28, 75–105. [Google Scholar] [CrossRef]
Ramraj, T.; Prabhakar, R. Frequent Subgraph Mining Algorithms—A Survey. Procedia Comput. Sci. 2015, 47, 197–204. [Google Scholar] [CrossRef]
Mrzic, A.; Meysman, P.; Bittremieux, W.; Moris, P.; Cule, B.; Goethals, B.; Laukens, K. Grasping frequent subgraph mining for bioinformatics applications. BioData Min. 2018, 11, 20. [Google Scholar] [CrossRef] [PubMed]
Inokuchi, A.; Washio, T.; Motoda, H. An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data. In Proceedings of the Principles of Data Mining and Knowledge Discovery, Lyon, France, 13–16 September 2000; Zighed, D.A., Komorowski, J., Żytkow, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2000; pp. 13–23. [Google Scholar]
Kuramochi, M.; Karypis, G. Frequent subgraph discovery. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001; pp. 313–320. [Google Scholar]
Yan, X.; Han, J. gSpan: Graph-based substructure pattern mining. In Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City, Japan, 9–12 December 2002; pp. 721–724. [Google Scholar] [CrossRef]
Huan, J.; Bandyopadhyay, D.; Wang, W.; Snoeyink, J.; Prins, J.; Tropsha, A. Comparing Graph Representations of Protein Structure for Mining Family-Specific Residue-Based Packing Motifs. J. Comput. Biol. J. Comput. Mol. Cell Biol. 2005, 12, 657–671. [Google Scholar] [CrossRef]
Nijssen, S.; Kok, J.N. The Gaston Tool for Frequent Subgraph Mining. Electron. Notes Theor. Comput. Sci. 2005, 127, 77–87. [Google Scholar] [CrossRef]
Elseidy, M.; Abdelhamid, E.; Skiadopoulos, S.; Kalnis, P. GraMi: Frequent Subgraph and Pattern Mining in a Single Large Graph. Proc. VLDB Endow. 2014, 7, 517–528. [Google Scholar] [CrossRef]
Jagannadha Rao, D.; Kalpana, P.; Polepally, V.; Nagendra Prabhu, S. HE-Gaston algorithm for frequent subgraph mining with hadoop framework. Expert Syst. Appl. 2024, 251, 123971. [Google Scholar] [CrossRef]
Sahu, S.; Chawla, M.; Khare, N.; Singh, B. Mining approximate frequent subgraph with sampling techniques. Mater. Today Proc. 2023, 81, 395–402. [Google Scholar] [CrossRef]
Chen, X.; Cai, J.; Chen, G.; Gan, W.; Broustet, A. FCSG-Miner: Frequent closed subgraph mining in multi-graphs. Inf. Sci. 2024, 665, 120363. [Google Scholar] [CrossRef]
Leng, F.; Li, F.; Bao, Y.; Zhang, T.; Yu, G. FSM-BC-BSP: Frequent Subgraph Mining Algorithm Based on BC-BSP. Appl. Sci. 2024, 14, 3154. [Google Scholar] [CrossRef]
Acosta-Mendoza, N.; Gago-Alonso, A.; Pagola, J. Frequent approximate subgraphs as features for graph-based image classification. Knowl.-Based Syst. 2012, 17, 381–392. [Google Scholar] [CrossRef]
Jia, Y.; Zhang, J.; Huan, J. An efficient graph-mining method for complicated and noisy data with real-world applications. Knowl. Inf. Syst. 2011, 28, 423–447. [Google Scholar] [CrossRef]
Li, J.; Zou, Z.; Gao, H. Mining frequent subgraphs over uncertain graph databases under probabilistic semantics. VLDB J. 2012, 21, 753–777. [Google Scholar] [CrossRef]
Chen, C.; Yan, X.; Zhu, F.; Han, J. gApprox: Mining Frequent Approximate Patterns from a Massive Network. In Proceedings of the Seventh IEEE International Conference on Data Mining (ICDM 2007), Omaha, NE, USA, 28–31 October 2007; pp. 445–450. [Google Scholar] [CrossRef]
Flores-Garrido, M.; Carrasco-Ochoa, J.A.; Martínez-Trinidad, J. AGraP: An algorithm for mining frequent patterns in a single graph using inexact matching. Knowl. Inf. Syst. 2015, 44, 385–406. [Google Scholar] [CrossRef]
Driss, K.; Boulila, W.; Leborgne, A.; Gançarski, P. Mining frequent approximate patterns in large networks. Int. J. Imaging Syst. Technol. 2021, 31, 1265–1279. [Google Scholar] [CrossRef]
Vanetik, N.; Gudes, E.; Shimony, S. Computing frequent graph patterns from semistructured data. In Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City, Japan, 9–12 December 2002; pp. 458–465. [Google Scholar] [CrossRef]
Li, R.; Wang, W. REAFUM: Representative Approximate Frequent Subgraph Mining. In Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, BC, Canada, 30 April–2 May 2015; pp. 757–765. [Google Scholar] [CrossRef]
Holder, L.B.; Cook, D.J.; Djoko, S. Substructure Discovery in the SUBDUE System. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, AAAIWS’94, Seattle WA, USA, 31 July–1 August 1994; pp. 169–180. [Google Scholar]
Gonzalez, J.; Holder, L.B.; Cook, D.J. Graph Based Concept Learning. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, Austin, TX, USA, July 30–3 August 2000; p. 1072. [Google Scholar]
Kuramochi, M.; Karypis, G. GREW-A Scalable Frequent Subgraph Discovery Algorithm. In Proceedings of the Fourth IEEE International Conference on Data Mining, ICDM’04, Brighton, UK, 1–4 November 2004; pp. 439–442. [Google Scholar]
Dwivedi, S.P.; Singh, R. Error-Tolerant Graph Matching using Node Contraction. Pattern Recognit. Lett. 2018, 116, 58–64. [Google Scholar] [CrossRef]
Bhatia, V.; Rani, R. Ap-FSM: A parallel algorithm for approximate frequent subgraph mining using Pregel. Expert Syst. Appl. 2018, 106, 217–232. [Google Scholar] [CrossRef]
Anchuri, P.; Zaki, M.J.; Barkol, O.; Golan, S.; Shamy, M. Approximate graph mining with label costs. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’13, Chicago, IL, USA, 11–14 August 2013; pp. 518–526. [Google Scholar] [CrossRef]
Flores-Garrido, M.; Carrasco-Ochoa, J.A.; Martínez-Trinidad, J.F. Mining Maximal Frequent Patterns in a Single Graph Using Inexact Matching. Know.-Based Syst. 2014, 66, 166–177. [Google Scholar] [CrossRef]
Moussaoui, M.; Zaghdoud, M.; Akaichi, J. A New Framework of Frequent Uncertain Subgraph Mining. Procedia Comput. Sci. 2018, 126, 413–422. [Google Scholar] [CrossRef]
Nasir, M.A.U.; Aslay, C.; Morales, G.D.F.; Riondato, M. TipTap: Approximate Mining of Frequent k-Subgraph Patterns in Evolving Graphs. ACM Trans. Knowl. Discov. Data 2021, 15, 1–35. [Google Scholar] [CrossRef]
Zhang, S.; Yang, J. RAM: Randomized Approximate Graph Mining. In Proceedings of the 20th International Conference on Scientific and Statistical Database Management, SSDBM’08, Hong Kong, China, 9–11 July 2008; pp. 187–203. [Google Scholar] [CrossRef]
Wu, D.; Ren, J.; Sheng, L. Uncertain maximal frequent subgraph mining algorithm based on adjacency matrix and weight. Int. J. Mach. Learn. Cybern. 2018, 9, 1445–1455. [Google Scholar] [CrossRef]
Acosta-Mendoza, N.; Gago-Alonso, A.; Carrasco-Ochoa, J.; Martínez-Trinidad, J.F.; Pagola, J. A New Algorithm for Approximate Pattern Mining in Multi-graph Collections. Knowl.-Based Syst. 2016, 109, 198–207. [Google Scholar] [CrossRef]
Acosta-Mendoza, N.; Gago-Alonso, A.; Carrasco-Ochoa, J.; Martínez-Trinidad, J.F.; Pagola, J. Extension of Canonical Adjacency Matrices for Frequent Approximate Subgraph Mining on Multi-Graph Collections. Int. J. Pattern Recognit. Artif. Intell. 2017, 31. [Google Scholar] [CrossRef]
Acosta-Mendoza, N.; Carrasco-Ochoa, J.A.; Martínez-Trinidad, J.F.; Gago-Alonso, A.; Medina-Pagola, J.E. Mining Clique Frequent Approximate Subgraphs from Multi-Graph Collections. Appl. Intell. 2020, 50, 878–892. [Google Scholar] [CrossRef]
Sanfeliu, A.; Fu, K.S. A distance measure between attributed relational graphs for pattern recognition. IEEE Trans. Syst. Man Cybern. 1983, SMC-13, 353–362. [Google Scholar] [CrossRef]
Rossi, R.A.; Ahmed, N.K. The Network Data Repository with Interactive Graph Analytics and Visualization. In Proceedings of the AAAI, Austin, TX, USA, 25–30 January 2015. [Google Scholar] [CrossRef]

Figure 1. Cases of subgraph expansion during the pattern growth process. (a) The frequent subgraph is highlighted in pink. (b) Expansion of the frequent subgraph joining the neighbor node

A

through the edge

\frac{d}{}

highlighted in yellow. (c) Expansion of the frequent subgraph adding the edge

\frac{d}{}

highlighted in yellow, which creates a cycle.

Figure 1. Cases of subgraph expansion during the pattern growth process. (a) The frequent subgraph is highlighted in pink. (b) Expansion of the frequent subgraph joining the neighbor node

A

through the edge

\frac{d}{}

highlighted in yellow. (c) Expansion of the frequent subgraph adding the edge

\frac{d}{}

highlighted in yellow, which creates a cycle.

Figure 2. Structure of node dictionary

γ

and edge dictionary

δ

. Each dictionary stores the similarities between the labels of nodes and edges, respectively.

Figure 2. Structure of node dictionary

γ

and edge dictionary

δ

. Each dictionary stores the similarities between the labels of nodes and edges, respectively.

Figure 3. Cases. of occurrence expansion. (a) An example of a candidate subgraph is highlighted in green. (b) The occurrence, highlighted in pink, does not expand. (c) The node

ν_{e x p}

, highlighted in yellow along with its connecting edge

\frac{b}{}

, is a neighbor of the occurrence

o_{c u r r}

, which is highlighted in pink. (d) The node

ν_{e x p}

is already in the occurrence

o_{c u r r}

, highlighted in pink (as an extra node). (e) Node

ν_{e x p}

is in the graph of

o_{c u r r}

, but a path, highlighted in yellow, must be created to connect

ν_{e x p}

with

o_{c u r r}

, highlighted in pink. (f) The edge

ϵ_{n e w}

, highlighted in yellow, is a neighbor of

o_{c u r r}

, which is highlighted in pink.

Figure 3. Cases. of occurrence expansion. (a) An example of a candidate subgraph is highlighted in green. (b) The occurrence, highlighted in pink, does not expand. (c) The node

ν_{e x p}

, highlighted in yellow along with its connecting edge

\frac{b}{}

, is a neighbor of the occurrence

o_{c u r r}

, which is highlighted in pink. (d) The node

ν_{e x p}

is already in the occurrence

o_{c u r r}

, highlighted in pink (as an extra node). (e) Node

ν_{e x p}

is in the graph of

o_{c u r r}

, but a path, highlighted in yellow, must be created to connect

ν_{e x p}

with

o_{c u r r}

, highlighted in pink. (f) The edge

ϵ_{n e w}

, highlighted in yellow, is a neighbor of

o_{c u r r}

, which is highlighted in pink.

Figure 4. Graph collection consisting of three graphs used to demonstrate how the proposed algorithm mined frequent approximate subgraphs with structural and label variations.

Figure 5. Graph collection consisting of three graphs used to compare the frequent approximate subgraphs mined by the proposed algorithm and REAFUM.

Figure 6. Additional frequent approximate subgraph, mined by REAFUM in graph

g_{1}

, and its occurrences in the graphs

g_{2}

and

g_{3}

, highlighted in blue.

Figure 6. Additional frequent approximate subgraph, mined by REAFUM in graph

g_{1}

, and its occurrences in the graphs

g_{2}

and

g_{3}

, highlighted in blue.

Figure 7. Expansion process of REAFUM and the proposed algorithm (AGCM-SLV) in the frequent approximate subgraph (highlighted in blue) within graph

g_{3}

.

Figure 7. Expansion process of REAFUM and the proposed algorithm (AGCM-SLV) in the frequent approximate subgraph (highlighted in blue) within graph

g_{3}

.

Figure 8. Number of frequent approximate subgraphs mined and runtime of AGCM-SLV and REAFUM varying the dissimilarity threshold

(β)

with constant support threshold

(α)

of 90%. (a) Number of frequent approximate subgraphs; (b) Runtime.

Figure 8. Number of frequent approximate subgraphs mined and runtime of AGCM-SLV and REAFUM varying the dissimilarity threshold

(β)

with constant support threshold

(α)

of 90%. (a) Number of frequent approximate subgraphs; (b) Runtime.

Figure 9. Number of frequent approximate subgraphs mined and runtime of AGCM-SLV and REAFUM varying the dissimilarity threshold

(β)

with a constant support threshold

(α)

of 80%. (a) Number of frequent approximate subgraphs; (b) Runtime.

Figure 9. Number of frequent approximate subgraphs mined and runtime of AGCM-SLV and REAFUM varying the dissimilarity threshold

(β)

with a constant support threshold

(α)

of 80%. (a) Number of frequent approximate subgraphs; (b) Runtime.

Figure 10. Number of frequent approximate subgraphs mined and runtime of AGCM-SLV varying the support threshold from 70% to 90% and the dissimilarity threshold from 1 to 3. (a) Number of frequent approximate subgraphs mined; (b) Runtime.

Figure 11. Runtime and memory usage of the proposed algorithm as the graph collection size increases. (a) Runtime; (b) Memory usage.

Figure 12. Number of frequent approximate subgraphs mined by the proposed algorithm when varying the dissimilarity thresholds

β

= 1 and

β = 2

with a constant support threshold

α

= 90%.

Figure 12. Number of frequent approximate subgraphs mined by the proposed algorithm when varying the dissimilarity thresholds

β

= 1 and

β = 2

with a constant support threshold

α

= 90%.

Table 1. Characteristics of frequent approximate subgraphs algorithms.

Algorithm	Label Variations	Structural Variations
For single graph
SUBDUE (1994)	N/A	Node contraction
gApprox (2007)	Nodes	Edges
AGraP (2014)	Nodes and edges	Nodes and edges
MaxAFG (2014)	Nodes and edges	Nodes and edges
Ap-FSM (2018)	N/A	Based on edge uncertainty
MFP (2021)	Nodes and edges	Nodes and edges
TIPTAP (2021)	N/A	Based on edge variation
For graph collection
RAM (2008)	N/A	Edges
APGM (2010)	Nodes	N/A
VEAM (2012)	Nodes and edges	N/A
REAFUM (2015)	Nodes and edges	Nodes and edges
AMGMiner (2016)	Nodes and edges	N/A
MgVEAM (2017)	Nodes and edges	N/A
UMFGAMW (2018)	N/A	Based on edge uncertainty
CliqueAMGMiner (2020)	Nodes and edges	N/A
Proposed AGCM-SLV	Nodes and edges	Nodes and edges

Table 2. Frequent subgraphs mined by AGCM-SLV in the graph collection of Figure 4 with

α

= 100% and

β

= 0.

Table 2. Frequent subgraphs mined by AGCM-SLV in the graph collection of Figure 4 with

α

= 100% and

β

= 0.

Frequent Subgraph	Occurrences
$B$	$B$ $g_{1}$ , $B$ $g_{2}$ , $B$ $g_{3}$

Table 3. Frequent approximate subgraphs mined by AGCM-SLV with

α

= 100% and

β

= 1.

Table 3. Frequent approximate subgraphs mined by AGCM-SLV with

α

= 100% and

β

= 1.

Frequent Subgraph	Occurrences
$B$	$B$ $g_{1}$ , $B$ $g_{2}$ , $B$ $g_{3}$
$B \frac{D}{} A$	$B \frac{D}{} A$ $g_{1}$ , $B \frac{D}{} Z$ $g_{2}$ , $B \frac{E}{} A$ $g_{3}$ , $B \frac{F}{} A$ $g_{3}$

Table 4. Frequent approximate subgraphs mined by AGCM-SLV with

α

= 100% and

β

= 2.

Table 4. Frequent approximate subgraphs mined by AGCM-SLV with

α

= 100% and

β

= 2.

Frequent Subgraph	Occurrences
$B$	$B$ $g_{1}$ , $B$ $g_{2}$ , $B$ $g_{3}$
$B \frac{D}{} A$	$B \frac{D}{} A$ $g_{1}$ , $B$ $g_{2}$ , $B \frac{B}{} Y$ $g_{2}$ , $B \frac{D}{} Z$ $g_{2}$ , $B$ $g_{3}$ , $B \frac{E}{} A$ $g_{3}$ , $B \frac{F}{} A$ $g_{3}$
$B \frac{E}{} A$	$B \frac{E}{} A$ $g_{3}$ , $B$ $g_{1}$ , $B \frac{A}{} A$ $g_{1}$ , $B \frac{D}{} A$ $g_{1}$ , $B$ $g_{2}$ , $B \frac{B}{} Y$ $g_{2}$ , $B \frac{D}{} Z$ $g_{2}$
$B \frac{B}{} Y$	$B \frac{B}{} Y$ $g_{2}$ , $B$ $g_{1}$ , $B \frac{A}{} A$ $g_{1}$ , $B \frac{D}{} A$ $g_{1}$ , $B$ $g_{3}$ , $B \frac{B}{} Y$ $g_{2}$ , $B \frac{D}{} Z$ $g_{2}$
$B \frac{A}{} A$	$B \frac{A}{} A$ $g_{1}$ , $B$ $g_{2}$ , $B \frac{B}{} Y$ $g_{2}$ , $B \frac{D}{} Z$ $g_{2}$ , $B$ $g_{3}$ , $B \frac{E}{} A$ $g_{3}$ , $B \frac{F}{} A$ $g_{3}$
$B \frac{F}{} A$	$B \frac{F}{} A$ $g_{3}$ , $B$ $g_{1}$ , $B \frac{A}{} A$ $g_{1}$ , $B \frac{D}{} A$ $g_{1}$ , $B$ $g_{3}$ , $B \frac{B}{} Y$ $g_{2}$ , $B \frac{D}{} Z$ $g_{2}$
$B \frac{D}{} Z$	$B \frac{D}{} Z$ $g_{2}$ , $B$ $g_{1}$ , $B \frac{A}{} A$ $g_{1}$ , $B \frac{D}{} A$ $g_{1}$ , $B$ $g_{3}$ , $B \frac{E}{} A$ $g_{3}$ , $B \frac{F}{} A$ $g_{3}$

Table 5. Additional frequent approximate subgraph mined by AGCM-SLV with

α

= 100%,

β

= 2 and

γ

with node label Z with 50% similarity to node label A and node label Y with 50% similarity to node label A.

Table 5. Additional frequent approximate subgraph mined by AGCM-SLV with

α

= 100%,

β

= 2 and

γ

with node label Z with 50% similarity to node label A and node label Y with 50% similarity to node label A.

Frequent Subgraph	Occurrences
$A \frac{A}{} B \frac{D}{} A$	$A \frac{A}{} B \frac{D}{} A$ $g_{1}$ , $Z \frac{D}{} B \frac{B}{} Y$ $g_{2}$ , $A \frac{E}{} B \frac{F}{} A$ $g_{3}$

Table 6. Additional frequent approximate subgraphs mined by AGCM-SLV with

α

= 100%,

β

= 2, and

γ

indicates that node label ‘Z’ has a 100% similarity to node label ‘A’ and node label ‘Y’ has a 100% similarity to node label ‘A’.

Table 6. Additional frequent approximate subgraphs mined by AGCM-SLV with

α

= 100%,

β

= 2, and

γ

indicates that node label ‘Z’ has a 100% similarity to node label ‘A’ and node label ‘Y’ has a 100% similarity to node label ‘A’.

Frequent Subgraph	Occurrences
$A \frac{E}{} B \frac{F}{} A$	$A \frac{E}{} B \frac{F}{} A$ $g_{3}$ , $A \frac{A}{} B \frac{D}{} A$ $g_{1}$ , $Y \frac{B}{} B \frac{D}{} Z$ $g_{2}$
$B \frac{A}{} A \frac{C}{} A$	$B \frac{A}{} A \frac{C}{} A$ $g_{1}$ , $B \frac{B}{} Y \frac{C}{} Z$ $g_{2}$ , $B \frac{D}{} Z \frac{C}{} Y$ $g_{2}$ , $B \frac{E}{} A \frac{G}{} Z$ $g_{3}$
$B \frac{D}{} A \frac{C}{} A$

Table 7. Additional frequent subgraph mined by AGCM-SLV with

α

= 100%,

β

= 2,

γ

indicates that node label ‘Z’ has a 100% similarity to node label ‘A’ and node label ‘Y’ has a 100% similarity to node label ‘A’; and additionally,

δ

indicates that edge label ‘F’ has a 100% similarity to edge label ‘D’.

Table 7. Additional frequent subgraph mined by AGCM-SLV with

α

= 100%,

β

= 2,

γ

indicates that node label ‘Z’ has a 100% similarity to node label ‘A’ and node label ‘Y’ has a 100% similarity to node label ‘A’; and additionally,

δ

indicates that edge label ‘F’ has a 100% similarity to edge label ‘D’.

Frequent Subgraph	Occurrences

Table 8. Datasets built from PTC-FM for our experiment.

Dataset	Total Graphs	Average Nodes	Average Edges
1–10	10	16.8	17.1
11–20	10	14.2	14.7
21–30	10	11.9	12.6
31–40	10	12.2	13
1–20	20	15.5	15.9
21–40	20	12.05	12.8
41–60	20	13.1	13.45
61–80	20	14.2	14.3
1–30	30	14.3	14.8
31–60	30	12.8	13.3
61–90	30	13.2	13.3
91–120	30	13.93	14.37
1–40	40	13.77	14.35
41–80	40	13.65	13.88
81–120	40	13.25	13.6
121–160	40	14.12	14.3
1–50	50	13.12	13.56
51–100	50	13.34	13.66
101–150	50	14.58	14.9
151–200	50	15.46	16.02

Table 9. Description of PTC datasets.

Dataset	Total Graphs	Average Nodes	Average Edges
PTC-FM	349	14.11	14.48
PTC-FR	351	14.56	15.00
PTC-MM	336	13.97	14.32
PTC-MR	344	14.29	14.69

Table 10. Percentage of frequent approximate subgraphs preserved across different noise levels, compared to those mined in the noise-free datasets, employing a support threshold

α = 90 %

and dissimilarity thresholds

β = 1

and

β = 2

.

Table 10. Percentage of frequent approximate subgraphs preserved across different noise levels, compared to those mined in the noise-free datasets, employing a support threshold

α = 90 %

and dissimilarity thresholds

β = 1

and

β = 2

.

Noise Level	$β = 1$	$β = 2$
PTC- FM
5%	100%	100%
10%	100%	100%
15%	100%	100%
20%	100%	98.18%
PTC- FR
5%	100%	96.36%
10%	100%	98.18%
15%	100%	100%
20%	100%	100%
PTC- MM
5%	100%	100%
10%	100%	97.96%
15%	100%	100%
20%	100%	100%
PTC- MR
5%	100%	100%
10%	100%	100%
15%	100%	94.55%
20%	100%	94.55%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jaramillo-Olivares, D.; Carrasco-Ochoa, J.A.; Martínez-Trinidad, J.F. An Algorithm for Mining Frequent Approximate Subgraphs with Structural and Label Variations in Graph Collections. Appl. Sci. 2025, 15, 7880. https://doi.org/10.3390/app15147880

AMA Style

Jaramillo-Olivares D, Carrasco-Ochoa JA, Martínez-Trinidad JF. An Algorithm for Mining Frequent Approximate Subgraphs with Structural and Label Variations in Graph Collections. Applied Sciences. 2025; 15(14):7880. https://doi.org/10.3390/app15147880

Chicago/Turabian Style

Jaramillo-Olivares, Daybelis, Jesús Ariel Carrasco-Ochoa, and José Francisco Martínez-Trinidad. 2025. "An Algorithm for Mining Frequent Approximate Subgraphs with Structural and Label Variations in Graph Collections" Applied Sciences 15, no. 14: 7880. https://doi.org/10.3390/app15147880

APA Style

Jaramillo-Olivares, D., Carrasco-Ochoa, J. A., & Martínez-Trinidad, J. F. (2025). An Algorithm for Mining Frequent Approximate Subgraphs with Structural and Label Variations in Graph Collections. Applied Sciences, 15(14), 7880. https://doi.org/10.3390/app15147880

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Algorithm for Mining Frequent Approximate Subgraphs with Structural and Label Variations in Graph Collections

Abstract

1. Introduction

2. Related Work

2.1. Frequent Approximate Subgraphs in a Single Graph

2.2. Frequent Approximate Subgraphs in a Graph Collection

3. Notation and Preliminaries

4. Proposed Algorithm

4.1. Frequent Nodes

4.2. Expand Frequent Subgraph

4.3. Expand Occurrences

4.4. Verifying the Support of a Candidate Subgraph

4.5. Completeness

4.6. Time Complexity

5. Experiments

5.1. Frequent Approximate Subgraphs Mined by the Proposed Algorithm

5.2. Comparison Between AGCM-SLV and REAFUM

5.2.1. Comparison of the Frequent Approximate Subgraphs Mined by Both Algorithms

5.2.2. Comparison Regarding the Number of Frequent Approximate Subgraphs Mined and Runtime

5.3. Analysis of the Performance of AGCM-SLV

5.4. Scalability

5.5. Impact of Random White Noise in the Proposed Algorithm

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI