Graph Stream Compression Scheme Based on Pattern Dictionary Using Provenance

: With recent advancements in network technology and the increasing popularity of the internet, the use of social network services and Internet of Things devices has flourished, leading to a continuous generation of large volumes of graph stream data, where changes, such as additions or deletions of vertices and edges, occur over time. Additionally, owing to the need for the efficient use of storage space and security requirements, graph stream data compression has become essential in various applications. Even though various studies on graph compression methods have been conducted, most of them do not fully reflect the dynamic characteristics of graph streams and the complexity of large graphs. In this paper, we propose a compression scheme using provenance data to efficiently process and analyze large graph stream data. It obtains provenance data by analyzing graph stream data and builds a pattern dictionary based on this to perform dictionary-based compression. By improving the existing dictionary-based graph compression methods, it enables more efficient dictionary management through tracking pattern changes and evaluating their importance using provenance. Furthermore, it considers the relationships among sub-patterns using an FP-tree and performs pattern dictionary management that updates pattern scores based on time. Our experiments show that the proposed scheme outperforms existing graph compression methods in key performance metrics, such as compression rate and processing time.


Introduction
Of late, graph data structures are being widely used to express complex structures in fields, such as social networks, Internet of Things, mobile devices, and bioinformatics.For instance, in social networks, users can be represented as vertices and their follow or friend relationships can be represented as edges [1][2][3][4].These are continuously updated with the registration of new users or changes in relationships among the existing users.Similarly, in bioinformatics, genetic information can be represented as vertices and the relationships among genes as edges, which also change with new gene discoveries or research findings.Such graphs change in real time and quickly accumulate information.Graphs with vertices and edges that change frequently are called dynamic graphs [5,6].In contrast, static graphs refer to those with a fixed structure that does not change over time.Efficient analysis algorithms can be developed for static graphs owing to the consistency of their entire graph structure [7][8][9][10].
Dynamic graphs involve data that change continuously and are generated over time.They may also require sequential process changes, such as the addition or deletion of vertices or edges; these changes occur in real time in the graph structure.As mentioned above, in a social network environment, data are updated with each event, such as adding or deleting friends, or interactions, such as comments or likes on posts.These data can be applied to human network analysis, user interest recommendations, etc. [11][12][13][14][15][16].
In a dynamic environment, changes in the graph occur through the addition of new vertices, deletion of existing vertices, and creation and removal of edges.In such an environment, complex analysis and processing algorithms that reflect changes in the graph in real time are required.One of the main challenges in a dynamic environment is that the size of the graph continuously increases over time.To efficiently manage infinitely increasing data within limited storage space, graph compression is essential, allowing for the efficient use of storage space to accommodate increasing amounts of data [17][18][19][20][21][22].Techniques that incorporate graph pattern mining methods also exist [23][24][25][26][27][28].Generally, graph compression is performed using graph mining techniques to select frequently occurring sub-graphs as reference patterns, recording changes that occur in these reference patterns.These techniques can effectively compress data, while preserving important information from the graph.
Existing methods of compressing graphs have the advantage of high compression rates [13,22]; however, they are not suitable for real-time data processing owing to their significant computational costs in the preprocessing stage.These approaches involve a time-consuming compression process, especially imposing constraints on immediate processing in a graph stream environment.High accuracy and compression rates are provided by pattern extraction techniques, such as those in [14][15][16][23][24][25][26]; however, their long processing times during graph compression also render them difficult to be applied in real-time environments.As a solution to these issues, research on the use of provenance has been proposed [29][30][31].
Research studies often focus on summarizing graphs, but they frequently encounter limitations in compression rates.Gou, X et al. introduced GSS, a sketch-based approach that delivers high update speed and accuracy for graph stream summarization [19].However, the impact of GSS on compression rates under various graph stream characteristics requires further investigation.The existing methods based on the minimum description length or graph partitioning also demonstrate constraints in managing the dynamic nature of graph streams.To tackle these challenges, we propose a novel graph stream compression scheme that leverages provenance information to develop a pattern dictionary.The proposed scheme enables high compression rates while efficiently accommodating the evolving properties of graph streams.
Provenance is metadata that track the change history and origin of data, allowing for the tracking of data changes.For example, in the case of Wikipedia, the process of multiple users creating, modifying, and deleting documents is recorded as provenance data, increasing the size of the original data by tens of times.This implies that managing provenance data can be more complex and challenging than managing the original data.However, if provenance data are used effectively, they can make graph storage more efficient and lead to more effective pattern management.
In this paper, we propose a new graph stream compression scheme that combines existing techniques for compressing large static graphs with provenance techniques.The proposed scheme compresses graph streams incrementally, considering changes in vertices and edges over time, and efficiently records and compresses these changes along with patterns in an in-memory environment using provenance data.By managing patterns based on provenance and maintaining their size, the proposed scheme effectively handles important patterns and increases the space efficiency of pattern management, retaining only essential patterns.Furthermore, the proposed scheme assigns scores to the latest patterns to respond to real-time changes in the graph stream while maintaining a certain compression rate.
To support this novel approach, we developed efficient algorithms for maintaining a dynamic dictionary of patterns and optimizing compression based on the relationships among patterns.This enables our method to achieve high compression rates while support-ing fast update and query processing.Extensive experiments on real-world graph stream datasets demonstrate the superiority of the proposed scheme compared with existing graph compression methods that utilize pattern mining techniques.The results confirm that the proposed pattern dictionary management technique effectively improves upon existing pattern dictionary-based graph compression methods and outperforms graph compression techniques in a stream environment, in terms of compression ratio, memory usage, and processing time.
This paper is organized as follows: Section 2 analyzes the existing schemes and enumerates the limitations of existing methods.Section 3 describes the proposed graph compression scheme.Section 4 presents a performance evaluation of the proposed scheme to confirm its superiority over the existing methods.

Related Work
This section explains techniques focused on the structure of graphs and techniques for compressing frequent patterns in graphs within the realm of existing graph compression research.Some of the previous studies proposed focusing on the graph's structural characteristics.StarZIP [13] performs structural compression by identifying and reordering only sub-graph sets in a star shape, applying an encoding technique for compression, and incrementally processing the graphs.It also uses snapshots to create graph stream objects, which are then shattered to transform undirected graphs into directed ones.GraphZIP [22] compresses graph streams using clique-shaped sub-graphs.This technique decomposes the original graph into a set of cliques and then compresses each set of clique graphs.During this process, it focuses on minimizing the disk I/O by using efficient graph representation structures, such as CSC (compressed sparse column) or CSR (compressed sparse row).The main goal of these methods is to maximize space efficiency by compressing the graph as much as possible, not considering frequent or important patterns.
Research on compressing frequent patterns in graphs utilizes graph mining techniques.Graph mining is a process of extracting useful information and patterns from graphs, used for analyzing large volumes of complex network data.During this process, various algorithms and techniques are applied to consider the relationships among vertices and edges, graph structure, and dynamic changes.Frequent pattern mining [1,[5][6][7]11,15,21] is a method of extracting meaningful information or patterns from data, used in various fields, such as pattern analysis, anomaly detection, and recommendation services.FP-Growth is a method that efficiently finds frequent patterns using an FP-tree structure [32,33].In this method, minimum support is applied to find patterns that appear more than a certain number of times, and the support represents the proportion of the data that include a specific pattern out of the total data.The FP-tree starts with a null root node (vertex), storing the item and its repetition count in each vertex, and the final FP-tree is constructed after reading all transactions.However, unlike frequent pattern mining in static environments, the same in-graph stream environments become more complex owing to dynamic data changes and in information overflow.FP-streaming [14] dynamically detects frequent patterns over time to solve the problem of time-varying data input amounts.In these methods, the construction process of the pattern tree is similar to that of the FP-tree.The tree is constructed using a given similarity and support, managing time and support information in a table to detect all frequent patterns.
GraphZip [11] proposes a dictionary-based compression method that increases compression efficiency by utilizing frequent patterns in graph streams.This technique applies dictionary-based compression along with the minimum description length principle to identify the maximum compression patterns in the graph stream.The input data comprise vertices, edges, and their labels; the process starts with storing vertex information and then processing edge information in a stream format.It sets a batch size to process vertices in bulk, and starts with a single edge to expand and generate patterns.Scores are assigned to the generated patterns, and all patterns are scored taking into account the occurrence frequency at the time of appearance and the total number of edges.Then, they are sorted in descending order to remove patterns that exceed the dictionary size.For each batch, the collected patterns are compressed and the information is recorded in a file.This process is repeated to complete the compression.However, a limitation of GraphZip [11] is that it does not utilize time-related information in pattern management.For example, a pattern that appears early in the graph, but no longer appears later, may still be maintained in memory if it is large and has received a high score.This can lead to wasted storage space and performance degradation.This issue becomes more critical in a graph stream environment, wherein data change continuously.Therefore, considering changes in the importance of patterns over time is an essential element in graph stream compression.
Our previous work introduced the idea of leveraging provenance metadata for the pattern-based compression of large RDF graphs [29][30][31].The present study extends this approach to the more challenging context of graph streams, where the dynamic nature of data requires adaptive pattern mining and incremental compression.The studies reported suggest a graph pattern-based compression method for large RDF documents using metadata, namely provenance data, which represent information on or the history of the data.This method uses an approach that reduces the size of string data by encoding specific graph patterns.While this study exhibits superior compression performance to existing methods, it uses a statistics-based compression approach, which means that the compression performance can vary with parameter adjustments, and some data loss can possibly occur.Furthermore, the compression process using provenance data can take longer than the existing techniques do.This is because compression methods including provenance data require processing and storing additional information.This is a significant factor impacting their overall performance as finding a balance between compression efficiency and processing time can be a major constraint, especially in applications where real-time data processing is emphasized.
In graph stream environments, the size and arrival speed of data make it difficult to process them efficiently with traditional graph storage structures.Therefore, graph summarization or graph compression techniques are essential to effectively handling graph streams.Graph summarization aims to reduce the size of a graph while maintaining its characteristics and structure, while graph compression is reversible, allowing the original graph to be restored from its compressed form.Gou et al. proposed a graph stream compression technique called Graph Stream Sketch (GSS) [19].GSS has linear memory usage and a high update speed, and supports most graph queries.It also shows much higher accuracy compared with that of existing methods.GSS compresses the original graph into a small graph sketch using hash functions and stores it in a novel data structure that combines fingerprints and hash addresses for differentiation.However, GSS has the limitation of high time costs when performing successor/precursor queries, as it needs to scan the mapped rows or columns.Moreover, the compression efficiency of GSS is not very high, resulting in lower compression ratios compared with existing lossless compression methods.
Existing graph compression techniques, primarily designed for static graphs, struggle to address the unique challenges posed by graph streams, such as high update rates, temporal dynamics, and scalability.To overcome these limitations, we propose a novel graph stream compression scheme that leverages provenance information to capture the temporal evolution of patterns.The proposed scheme maintains a dynamic dictionary of patterns and exploits their relationships to achieve high compression rates while supporting efficient update and query processing, thereby addressing the challenges of graph stream compression.

Overall Processing Procedure
This study proposes a graph stream compression scheme based on a pattern dictionary using provenance.The proposed scheme selects sub-graph patterns with high influence in the graph as reference patterns, considering the relationship between each pattern and its sub-patterns during this process and assigning scores accordingly.Previous studies [13,22] proposed schemes that offer high compression rates.However, they are not suitable for real-time data generation environments due to the significant amounts of time taken in the preprocessing stage to identify graph structures (star, clique, bipartite, etc.).Moreover, as the values being transformed increase, efficiency may decline compared with that when storing uncompressed data, and if specific graph structures are not found, the compression rate may decrease, leading to overall performance degradation.
On the other hand, the scheme proposed in [19] selects common graphs that appear in both vertices and edges as reference patterns, and describes the changed parts in those patterns.However, this scheme is not suitable for stream environments, as it is constrained by the maximum size of graph patterns, and the pattern search time increases with the size of the patterns.Additionally, if a pattern completely different from the reference patterns appears, the applicable patterns may be limited or nonexistent, necessitating a re-search for patterns, which can cause a decrease in the compression rate.Other considerations exist when detecting frequent patterns in a graph stream and applying them to compression.First, efficiently using limited storage space while also performing rapid frequent pattern detection is necessary.
In graph streams, data are continuously input and the analysis target data change in real time.Therefore, after a certain amount of data have been entered, it is necessary to delete data to free up memory for analyzing the next input data.Additionally, as the data that enter over time vary, the same patterns do not always appear, and they can change over time.Therefore, repeatedly verifying the isomorphism of the sub-graphs being compared for pattern detection is necessary.Typical subgraph isomorphism algorithms [3] involve finding patterns that match the graph's pattern perfectly; however, in this study, when graphs G 1 and G 2 exist, it means that the subgraph of G 1 is isomorphic to that of G 2 .Therefore, patterns, such as G 1 ∈ G 2 are also considered isomorphic graphs.Lastly, when finding the pattern with the highest compression rate, considering time and frequency is crucial.This allows for the efficient application of frequent patterns to compression.
Figure 1 illustrates the overall structure of the proposed incremental frequent pattern compression scheme.The detailed modules are described in the following section.The main modules include the reference pattern generator, graph manager, pattern dictionary, and compressor.The graph manager organizes raw data into graph form and delivers them to the reference pattern generator.It also assigns scores to the reference pattern candidates generated by the reference pattern generator, which is responsible for the initial step of finding patterns in each graph.The patterns identified here become reference pattern candidates and are sent to the graph manager.The pattern dictionary stores information on each pattern.The compressor performs the compression process.

Graph Manager and Reference Pattern Generator
Figure 2 shows the structure and operation of the graph manager and reference pattern generator.The graph manager consists of two main modules.The first is the graph constructor.The graph constructor takes raw graph stream data as input and builds them The main functions of the compressor are as follows: First, it manages the pattern dictionary while deleting unimportant patterns.During the deletion process, all patterns are written on a disk to prevent issues with decoding patterns used in previous compressions.The compressor divides the graph stream into fixed batch units for file management, and each compressed file includes a header specifying the patterns used in that file.
This structure aims toward the efficient compression and management of graph stream data, enhancing the efficiency of real-time data processing within limited memory and storage space.Additionally, the proposed scheme considers the continuous and dynamic nature of a graph stream, thus making it effective for data storage and processing.Algorithm 1 shows the process followed by the reference pattern generator.In a graph stream, when vertices enter in sequence as in (), the patterns, (), stored in the pattern dictionary are indexed in the order they arrive (line 3).For instance, if graph () = [132, 12, 51, 43, 21, 19, 12, 3, … ] is input, the input graph can be represented as the pattern graph () = [0, 1, 2,3,4,5,1,6, … ].At this time, vertices already assigned an index use their existing index.Thus, if vertex 12 is input again, it is mapped as vertex 1.The next step is to check if two vertices are adjacent and proceed with the expansion (lines 4-5).Pattern expansion initializes new edges if the starting vertex of edge  ∈  does not exist in the edge or if the arriving vertex does not exist, after which (lines 6-10) the expanded pattern is added (line 12).In this process, first, a new pattern is initialized; it checks if the edge's source vertex already exists in the pattern; if not, it adds that vertex to the new pattern and updates vertex_map.Last, the edges not processed in batch B are added to the dictionary, P, as single-edge patterns (line 15).

Algorithm 1. Incremental frequent pattern mining
Input: graph G, batch B, dictionary P Output: Updated Pattern Dictionary P′ 1 for each pattern p in P: 2 for each matched m of pattern p embedded in batch B: The reference pattern generator creates reference pattern candidates through a threestep process.First, in the frequent pattern initialization stage, the graph's frequent patterns are initialized.Second, in the incremental frequent pattern mining stage, frequent patterns are progressively searched for and identified.Lastly, in the reference pattern candidate generation stage, reference pattern candidates are created based on the results of incremental frequent pattern mining.The graph manager uses these reference pattern candidates to check for isomorphism between previously explored graphs and the current graph, thus repeating the process of expanding pattern graphs.During the initial exploration of the graph, as there are no reference patterns stored in the pattern dictionary, single-edge patterns are created as the initial state of all graphs, and these edges are then expanded to generate various pattern graphs.This process identifies and creates efficient compression patterns considering the dynamic and continuous characteristics of graph stream data.
Algorithm 1 shows the process followed by the reference pattern generator.In a graph stream, when vertices enter in sequence as in G(v), the patterns, P(v), stored in the pattern dictionary are indexed in the order they arrive (line 3).For instance, if graph G(v) = [132, 12, 51, 43, 21, 19, 12, 3, . . . is input, the input graph can be represented as the pattern graph P(v) = [0, 1, 2, 3, 4, 5, 1, 6, . . .].At this time, vertices already assigned an index use their existing index.Thus, if vertex 12 is input again, it is mapped as vertex 1.The next step is to check if two vertices are adjacent and proceed with the expansion (lines 4-5).Pattern expansion initializes new edges if the starting vertex of edge e ∈ g does not exist in the edge or if the arriving vertex does not exist, after which (lines 6-10) the expanded pattern is added (line 12).In this process, first, a new pattern is initialized; it checks if the edge's source vertex already exists in the pattern; if not, it adds that vertex to the new pattern and updates vertex_map.Last, the edges not processed in batch B are added to the dictionary, P, as single-edge patterns (line 15).After executing Algorithm 1, each pattern is assigned a score.Subsequently, the results are sorted in descending order to determine the reference pattern candidates.For each pattern, the reference pattern score (RPS) is calculated to update the pattern dictionary if it is isomorphic to the subgraph, α, of graph G B g. Equation (1) calculates the pattern score.Here, α is a value between 0 and 1.

Managing Pattern Dictionary
All patterns are stored in an in-memory space called a dictionary; this process is performed using the graph pattern manager described in Section 3.2.The creation and management of the pattern dictionary are some of the main roles of the graph manager, and are accomplished through the pattern manager.The first step involves pruning using time stamps and the FP-tree.This process determines and manages the importance of patterns.The time stamp indicates the lifespan of each pattern, while the FP-tree is used to assess the importance of patterns.It maintains the top-K most frequently used patterns to limit the size of the pattern dictionary.Each reference pattern consists of a pattern ID, the pattern's time stamp, and scores assigned considering the pattern's size and frequency.Here, the RPS is used as an indicator of the pattern's importance.To effectively use provenance data, patterns extracted through graph pattern mining can use provenance information to track changes in the graph.Finally, the graph is transformed and stored through a dictionary encoding process.
Figure 3 demonstrates how the graph pattern manager manages the pattern dictionary over time.Assuming a size of six for the pattern dictionary, during this process, patterns are stored in the pattern dictionary at each point in time, namely t1, t2, and t3, and updated through the graph pattern manager whenever the pattern dictionary is refreshed.At t1 and t2, since the size of the pattern dictionary is larger than the number of existing patterns, new patterns can be added to the pattern dictionary.At t3, no spare capacity exists in the pattern dictionary; therefore, patterns are deleted based on their scores.In Figure 3, the pattern with the lowest score, p1, is deleted, and a new pattern, p7, is added.The graph explorer performs the role of discerning the similarity between existing reference and new patterns.This includes the process of calculating the similarity between the reference and new patterns through subgraph isomorphism tests for each graph.Most existing studies determine reference patterns as graphs where vertices and edges commonly appear, and use a method to describe changes from the reference pattern in the pattern itself.However, if only perfectly matching graph patterns are searched for compression, then too few patterns would be available for compression, thus making it difficult to apply compression.Therefore, this study only performs comparative operations with patterns that are determined through the RPS value by the reference pattern generator and have a similarity level above a certain threshold, thereby reducing the overall computational cost.The graph pattern manager plays a crucial role in determining and managing the importance of patterns, using a pruning algorithm with time stamps and FP-tree, and limiting the size of the patterns stored in the pattern dictionary.These elements are combined to calculate the RPS, and patterns are kept in the dictionary in the order of their scores.Actual graph streams cannot always guarantee consistent patterns, as these may change over time.Considering these dynamic characteristics, this study sets a threshold ; if a specific pattern does not occur for a certain number of time windows matching this threshold, that pattern is deleted.This takes into account that the importance of patterns may change over time, removing older or less important patterns to enhance memory efficiency and focus more on the latest data.This process is defined as the time stamp trim process.Through this process, dynamic changes in graph stream data can be effectively reflected, and pattern dictionary management can be optimized to enhance the efficiency of realtime data processing.
Graphs are constructed through several batch (  ) processes.Each batch contains edge information, and this is combined for the value of the window to form the graph.The sizes of the batch and window are specified by the user.An example of graph and pattern creation is as follows.As shown in Figure 4, when  = 1 and  ℎ = 3, the window  0 consists of three batches as follows:  0 = { 0 ,  1 ,  3 }.The pattern occurring at this point is shown at the top of the figure to be 1, 2, 3, which represents the patterns of the graph entered up to that point.At the next point in time,  1 , the pattern of the input graph is 2, 3, 4.As time progresses from  0 to  1 , pattern 1 is not input; therefore, the time stamp of that pattern is adjusted to  − , that is,  − 1.Here, when the number of consecutive time windows where a pattern does not appear exceeds the threshold γ, that pattern is considered unimportant and removed.For instance, setting γ = 1 means a pattern will be discarded if it fails to appear even once.In the figure, because pattern 1 The graph pattern manager plays a crucial role in determining and managing the importance of patterns, using a pruning algorithm with time stamps and FP-tree, and limiting the size of the patterns stored in the pattern dictionary.These elements are combined to calculate the RPS, and patterns are kept in the dictionary in the order of their scores.Actual graph streams cannot always guarantee consistent patterns, as these may change over time.Considering these dynamic characteristics, this study sets a threshold γ; if a specific pattern does not occur for a certain number of time windows matching this threshold, that pattern is deleted.This takes into account that the importance of patterns may change over time, removing older or less important patterns to enhance memory efficiency and focus more on the latest data.This process is defined as the time stamp trim process.Through this process, dynamic changes in graph stream data can be effectively reflected, and pattern dictionary management can be optimized to enhance the efficiency of real-time data processing.
Graphs are constructed through several batch (B n ) processes.Each batch contains edge information, and this is combined for the value of the window to form the graph.The sizes of the batch and window are specified by the user.An example of graph and pattern creation is as follows.As shown in Figure 4, when γ = 1 and W batch = 3, the window W 0 consists of three batches as follows: W 0 = {B 0 , B 1 , B 3 }.The pattern occurring at this point is shown at the top of the figure to be p1, p2, p3, which represents the patterns of the graph entered up to that point.At the next point in time, W 1 , the pattern of the input graph is p2, p3, p4.As time progresses from W 0 to W 1 , pattern p1 is not input; therefore, the time stamp of that pattern is adjusted to T − n, that is, T − 1.Here, when the number of consecutive time windows where a pattern does not appear exceeds the threshold γ, that pattern is considered unimportant and removed.For instance, setting γ = 1 means a pattern will be discarded if it fails to appear even once.In the figure, because pattern p1 is not input at the next time point W 2 , it is removed at time W 2 .Similarly, because pattern p4, p2 is not input at the next time point, the time stamp value of that pattern is adjusted to T − 1.The proposed scheme applies the FP-Growth algorithm to the patterns stored in the dictionary to prune the infrequent patterns.This policy keeps the size of the pattern dictionary at a manageable level, maintaining only patterns appropriate for an in-memory environment.In other words, patterns with low usage are removed from the pattern dictionary, while those with high usage are updated to maintain the most current patterns; this enhances the compression efficiency in the dynamically changing graph stream environment.Figure 5 shows an example of the pruning process applied with the FP-Growth algorithm.The proposed scheme structures the vertices of the FP-tree as follows.
(  , , , )W 0 .When the frequent threshold value is 3 and the time threshold value is 2, vertices in The proposed scheme applies the FP-Growth algorithm to the patterns stored in the dictionary to prune the infrequent patterns.This policy keeps the size of the pattern dictionary at a manageable level, maintaining only patterns appropriate for an in-memory environment.In other words, patterns with low usage are removed from the pattern dictionary, while those with high usage are updated to maintain the most current patterns; this enhances the compression efficiency in the dynamically changing graph stream environment.Figure 5 shows an example of the pruning process applied with the FP-Growth algorithm.The proposed scheme structures the vertices of the FP-tree as follows.(P id , B, T, Frequency)W 0 .
When the frequent threshold value is 3 and the time threshold value is 2, vertices in the FP-tree with a frequent value below the threshold are deemed not frequent patterns and removed.Thus, pattern P 1 , which does not meet the threshold in window W 1 , is removed.Additionally, patterns that do not meet the threshold in the window are also removed.This process reduces the storage cost of maintaining the patterns and secures space for new patterns to be added at the next point in time.
Figure 6 shows the process of input patterns being stored in the pattern dictionary and information on each pattern.The pattern dictionary stores various pieces of information on each pattern, such as pattern ID, frequency, size, provenance, batch, and window.The reference pattern generator and graph manager use the content stored in the pattern dictionary to determine the reference pattern and decide which patterns to consider frequently occurring.A window consists of several batches.Figure 6 shows an example where a window is composed of three batches.Each batch involves the process of expanding the patterns stored in the pattern dictionary.That is, when performing three batches, the graph is composed of a total of four edges, which is the maximum number of batches + 1.
environment.In other words, patterns with low usage are removed from the pattern dictionary, while those with high usage are updated to maintain the most current patterns; this enhances the compression efficiency in the dynamically changing graph stream environment.Figure 5 shows an example of the pruning process applied with the FP-Growth algorithm.The proposed scheme structures the vertices of the FP-tree as follows.
(  , , , )W 0 .When the frequent threshold value is 3 and the time threshold value is 2, vertices in the FP-tree with a frequent value below the threshold are deemed not frequent patterns and removed.Thus, pattern  1 , which does not meet the threshold in window  1 , is re- moved.Additionally, patterns that do not meet the threshold in the window are also removed.This process reduces the storage cost of maintaining the patterns and secures space for new patterns to be added at the next point in time.
Figure 6 shows the process of input patterns being stored in the pattern dictionary and information on each pattern.The pattern dictionary stores various pieces of information on each pattern, such as pattern ID, frequency, size, provenance, batch, and window.The reference pattern generator and graph manager use the content stored in the pattern dictionary to determine the reference pattern and decide which patterns to consider frequently occurring.A window consists of several batches.Figure 6 shows an example where a window is composed of three batches.Each batch involves the process of expanding the patterns stored in the pattern dictionary.That is, when performing three batches, the graph is composed of a total of four edges, which is the maximum number of batches + 1.When the input graph is as shown on the left side in Figure 7, the initial form of the graph in the first batch is made up of single edges.These graphs are all of the same form, but are distinguished by the vertex labels.In the second batch, each pattern is expanded.Here, provenance allows for knowledge on how each pattern has been expanded using the previously existing data in the pattern dictionary.It writes all the updated information and patterns in pattern dictionary on a disk to maintain information related to all patterns.Utilizing this, the original data can be restored without any loss during graph compression and restoration.When the input graph is as shown on the left side in Figure 7, the initial form of the graph in the first batch is made up of single edges.These graphs are all of the same form, but are distinguished by the vertex labels.In the second batch, each pattern is expanded.Here, provenance allows for knowledge on how each pattern has been expanded using the previously existing data in the pattern dictionary.It writes all the updated information and patterns in pattern dictionary on a disk to maintain information related to all patterns.Utilizing this, the original data can be restored without any loss during graph compression and restoration.

Graph Compression Process
The compression process in this study, similar to that in the existing studies, involves (i) reading data up to the batch size and window size of the input original graph, (ii) constructing the graph, (iii) finding patterns, and (iv) then representing these patterns with pattern identifiers.Figure 7 illustrates the proposed graph compression process and its structure.As described earlier, each component in the proposed scheme exists in memory, and the actual compression process itself is performed by the compressor.In Figure 7, the compression steps represent the stages in compressing the graph.First, the graph is input as shown, which encompasses the entire process of representing this as a graph.In this stage, similar to the second stage of the compression process, when a graph is constructed and a certain pattern, 1, is formed by  −  −  − , that part corresponding to the pattern is replaced with 1, and the details regarding that pattern are written together along with the pattern.This process is repeated for each batch to perform the actual compression process.This is represented on the right side of Figure 7.That is, as many compressed graph files are created as there are repeated batches, and the pattern dictionary performs only the task of adding the updated pattern information to one file.
Algorithm 2 presents the graph compression process.First, the graph for compression is initialized (line 1).Next, the set, marked_matched_patterns, which indicates and tracks the compression patterns, is initialized (line 2).The findMatchedPatternsOfPattern-InGraph function finds the matched patterns, i.e., parts in the existing graph that match each pattern, p, an element of the given pattern dictionary, P (line 5).The matched pattern found in the original graph signifies the part to be expressed as a compressed graph pattern.Afterward, the process of adding elements connected to the original graph and patterns is performed (lines 7-14).Finally, all vertices and edges not included in the pattern are added to the compressed graph.This step is performed to preserve the structure and connectivity of the original graph.The written graph object is then returned and written on a disk to conclude the compression process (lines 16-21).

Graph Compression Process
The compression process in this study, similar to that in the existing studies, involves (i) reading data up to the batch size and window size of the input original graph, (ii) constructing the graph, (iii) finding patterns, and (iv) then representing these patterns with pattern identifiers.Figure 7 illustrates the proposed graph compression process and its structure.As described earlier, each component in the proposed scheme exists in memory, and the actual compression process itself is performed by the compressor.In Figure 7, the compression steps represent the stages in compressing the graph.First, the graph is input as shown, which encompasses the entire process of representing this as a graph.In this stage, similar to the second stage of the compression process, when a graph is constructed and a certain pattern, P1, is formed by A − B − C − D, that part corresponding to the pattern is replaced with P1, and the details regarding that pattern are written together along with the pattern.This process is repeated for each batch to perform the actual compression process.This is represented on the right side of Figure 7.That is, as many compressed graph files are created as there are repeated batches, and the pattern dictionary performs only the task of adding the updated pattern information to one file.
Algorithm 2 presents the graph compression process.First, the graph for compression is initialized (line 1).Next, the set, marked_matched_patterns, which indicates and tracks the compression patterns, is initialized (line 2).The findMatchedPatternsOfPatternInGraph function finds the matched patterns, i.e., parts in the existing graph that match each pattern, p, an element of the given pattern dictionary, P (line 5).The matched pattern found in the original graph signifies the part to be expressed as a compressed graph pattern.Afterward, the process of adding elements connected to the original graph and patterns is performed (lines 7-14).Finally, all vertices and edges not included in the pattern are added to the compressed graph.This step is performed to preserve the structure and connectivity of the original graph.The written graph object is then returned and written on a disk to conclude the compression process (lines 16-21).

Performance Evaluation
The performance of the proposed scheme was evaluated against that of some of the existing methods to demonstrate its excellence and validity.An experiment was conducted with two main focuses.First, a dataset with repeated patterns was created following [27] to verify if the proposed scheme could identify patterns necessary for compression.Second, the experiment was conducted on various real-world graph sets.Table 1 presents the experimental environment.The proposed graph stream compression method is implemented in Python 3.8, leveraging key libraries such as NumPy 1.21 and SciPy 1.7 for efficient numerical computations.NetworkX 2.6 is used for graph data structures and algorithms, and python-igraph 0.9.6 is used for fast graph processing.The core algorithms for pattern mining, dictionary maintenance, and compression are built from scratch, while the above libraries are used for data preprocessing, result analysis, and performance comparison.
Equation ( 2) calculates the size of the compressed graph compared with that of the original one.Using this equation, the extent of compression relative to the size of the original data can be determined.The performance evaluation assesses changes in compression time and rate depending on the pattern dictionary size in a graph stream environment.
Additionally, the change in compression rate based on the size of the batch can be examined; thus, the performance of the proposed scheme can be verified and validated under various conditions.
Compression ratio = Compressed graph size/Original graph size Finally, the experiments in this study were conducted with a window size, W batch , of 3, a threshold, γ, of 2, and an α of 0.5.First, an experiment was conducted to verify if the proposed scheme could find the subgraphs it sought.The patterns of each dataset presented in Table 2 include three-clique, four-clique, four-star, four-path, five-path, and eight-tree, i.e., a total of six patterns.In this experiment, experimental datasets with a desired frequency were created following [27].The data used in the experiment consisted of 1000 vertices and 10,000 edges."Cov."(coverage) in Table 2 indicates the extent to which the pattern constituted the entire graph.The experimental results show that the proposed scheme, while generally taking more time than GraphZip [11], accurately found all the subgraph patterns.Furthermore, the proposed scheme was able to find accurate subgraph patterns faster than SUBDUE [27].In this experiment, when the experiment time exceeded 10 min (600 s), it was deemed erroneous and marked as timed-over.Table 2 summarizes the experimental results on several synthetic graph datasets, each including a specific type of subgraph pattern (e.g., three-clique, four-path)."Cov."stands for coverage, indicating the percentage of the dataset covered by the corresponding pattern."Accuracy" measures the ratio of correctly identified patterns over the total number of true patterns.The proposed scheme achieves 100% accuracy for all tested datasets, demonstrating its ability to thoroughly detect the embedded patterns, which is crucial for effective compression.Table 3 presents some real-world datasets used in the experiment.These datasets are frequently used in applications, such as graph pattern mining.Additionally, the table includes the labels used for the datasets in subsequent experiments, the number of vertices and edges, the description of the dataset contents, and their size on disk.We evaluate the performance of the proposed scheme on five real-world graph stream datasets from various domains, as summarized in Table 3 (see Appendix A for the source of each dataset).The datasets are chosen not only because they are widely used benchmarks in graph mining research, but also due to their diversity in terms of graph size, density, and temporal characteristics.Specifically, DBLP represents a co-authorship network exhibiting steady growth over time, while YouTube and Skitter contain more dynamic and bursty interactions.LiveJournal and NBER capture large-scale social and citation networks, respectively, allowing us to test the scalability of the proposed scheme.By covering a broad spectrum of graph streams, we aim to provide the comprehensive performance evaluations of the proposed scheme's effectiveness and robustness.
Table 4 presents the average number of patterns, average processing time, and average compression rate as a function of the batch size.Owing to the characteristics of streambased pattern mining techniques, as the batch size increases, both the time taken to process it and amount of memory required increase.This feature means that depending on the graph compression application, it may be necessary to choose an appropriate batch size.In this study, the batch size was fixed at 300 to satisfy the size of the pattern dictionary.This allowed for a stable measurement of the performance of the proposed scheme.Figure 8 shows the time spent on each process in the overall graph compression process as a function of the pattern dictionary size.Specifically, Figure 8a,b show the performance evaluation results of the proposed scheme and one of the existing schemes, namely GraphZip, respectively.It was observed that the proposed scheme takes longer in this process owing to its more complex policy for scoring complex graph patterns.In pattern-based compression methods, the process of finding patterns occupies most of the time; the proposed scheme generally spends more than 10% of the time taken for managing patterns, while GraphZip spends less than 5%.In other words, the proposed scheme generally spends more time effectively managing and allocating more complex patterns than the existing methods do, thereby demonstrating a higher level of graph pattern recognition and compression efficiency.
based compression methods, the process of finding patterns occupies most of the time; the proposed scheme generally spends more than 10% of the time taken for managing patterns, while GraphZip spends less than 5%.In other words, the proposed scheme generally spends more time effectively managing and allocating more complex patterns than the existing methods do, thereby demonstrating a higher level of graph pattern recognition and compression efficiency.Additionally, experiments on some datasets confirmed that the proposed scheme required less time than the existing methods do.This implies that the proposed scheme can achieve a balance between efficiency and performance depending on the situation.
Figure 9 shows the compression rate and time taken for compression as a function of the pattern dictionary size.The left-side plots show the compression rate, while the rightside ones show the compression execution time.Generally, the proposed scheme demonstrated superior performance in most experiments, especially in larger datasets.One of the existing methods, GraphZip [11], assigns value based on size and frequency for each pattern, which may lead to performance degradation if high-scored patterns do not reappear.Additionally, experiments on some datasets confirmed that the proposed scheme required less time than the existing methods do.This implies that the proposed scheme can achieve a balance between efficiency and performance depending on the situation.
Figure 9 shows the compression rate and time taken for compression as a function of the pattern dictionary size.The left-side plots show the compression rate, while the right-side ones show the compression execution time.Generally, the proposed scheme demonstrated superior performance in most experiments, especially in larger datasets.One of the existing methods, GraphZip [11], assigns value based on size and frequency for each pattern, which may lead to performance degradation if high-scored patterns do not reappear.
In contrast, the proposed scheme considers frequency, size, and time for each pattern.Although it takes more time owing to the complex calculations required for each pattern, it maintains more important patterns and deletes less significant ones, thus managing the pattern dictionary more effectively.
Figure 10 shows the results of a performance comparison between the proposed scheme and existing incremental graph pattern extraction methods, using the largest dataset, LiveJournal.GRAMI [28] and FSM [10], which are similar to GraphZip [11] and find frequent patterns in graphs in a stream environment, were included in the experiment.The horizontal axis of the figure represents the percentage of batch processing progress in graph stream data.That is, at a value of 5 on the horizontal axis, approximately 5% of the total graph stream will have been processed.The horizontal axis is divided into intervals of 5%.The total execution time is as shown in Figure 9c.For this dataset, with a batch size of 300, approximately 115,000 iterations of processing were performed.The experimental results showed that for up to 70% of the entire process, GraphZip outperformed the proposed scheme in terms of processing time; however, later on, the latter outperformed the former.This indicates that the proposed scheme performs faster computations by eliminating unnecessary patterns.On the other hand, the existing methods GRAMI and FSM saw an exponential increase in the computation time in the latter half as the number of patterns to compare increased.The compression rate measures the ratio of the compressed graph size to the original graph size where the lower, the better.The runtime includes both mining and encoding phases.We can observe that the proposed scheme consistently outperforms the baseline (GraphZip) in terms of compression rate, especially on larger datasets (e.g., Skitter, NBER, and LiveJournal).Moreover, the performance gap tends to widen as the dictionary size increases (a: 30, b: 50 c: 100), indicating the effectiveness of our dictionary maintenance strategy based on dynamic pattern scoring.On the other hand, the runtime of the proposed scheme is slightly higher than that of GraphZip in most cases, which can be expected due to the additional complexity of provenance-based pattern extraction.However, the gap is not significant and can be well justified by the substantial improvement in compression performance.
In contrast, the proposed scheme considers frequency, size, and time for each pattern Although it takes more time owing to the complex calculations required for each pattern it maintains more important patterns and deletes less significant ones, thus managing the pattern dictionary more effectively.
Figure 10 shows the results of a performance comparison between the proposed This figure shows the compression rates (left column) and runtimes (right column) of the proposed scheme on various real-world datasets, with respect to different pattern dictionary sizes.The compression rate measures the ratio of the compressed graph size to the original graph size, where the lower, the better.The runtime includes both mining and encoding phases.We can observe that the proposed scheme consistently outperforms the baseline (GraphZip) in terms of compression rate, especially on larger datasets (e.g., Skitter, NBER, and LiveJournal).Moreover, the performance gap tends to widen as the dictionary size increases ((a) 30, (b) 50, (c) 100), indicating the effectiveness of our dictionary maintenance strategy based on dynamic pattern scoring.On the other hand, the runtime of the proposed scheme is slightly higher than that of GraphZip in most cases, which can be expected due to the additional complexity of provenance-based pattern extraction.However, the gap is not significant and can be well justified by the substantial improvement in compression performance.
results showed that for up to 70% of the entire process, GraphZip outperformed the proposed scheme in terms of processing time; however, later on, the latter outperformed the former.This indicates that the proposed scheme performs faster computations by eliminating unnecessary patterns.On the other hand, the existing methods GRAMI and FSM saw an exponential increase in the computation time in the latter half as the number of patterns to compare increased.

Conclusions
In this paper, we proposed an incremental frequent pattern-based compression scheme for processing graph streams.It identifies frequent patterns through graph pattern mining, assigns scores to the patterns, and selects reference patterns from among them.Additionally, it utilizes pattern and provenance information to leverage the change history of graphs.The proposed scheme was observed to detect patterns and perform compression faster than the existing techniques.It can be applied to dynamically changing stream graphs, and improve space efficiency by storing only the most efficient patterns in memory.Furthermore, by maintaining the latest patterns, the proposed scheme improves the compression efficiency and processing speed of graph streams that change in real time.Performance evaluation results showed that in environments with repeated patterns, the proposed scheme performed similarly to the existing methods.Despite the promising results, the proposed scheme has some limitations that need to be addressed in future research.One important issue is the lack of consideration for data security and privacy in the proposed graph stream compression method.While the proposed scheme enables the efficient compression of large-scale, ever-growing graph streams, it does not incorporate any specific mechanisms to protect sensitive information that may be present in the data.This could potentially lead to privacy breaches or a misuse of personal data when the compressed graph is stored or processed in various application domains.Furthermore, we did not establish clear criteria or quantitative limits for the formation of new graph

Conclusions
In this paper, we proposed an incremental frequent pattern-based compression scheme for processing graph streams.It identifies frequent patterns through graph pattern mining, assigns scores to the patterns, and selects reference patterns from among them.Additionally, it utilizes pattern and provenance information to leverage the change history of graphs.The proposed scheme was observed to detect patterns and perform compression faster than the existing techniques.It can be applied to dynamically changing stream graphs, and improve space efficiency by storing only the most efficient patterns in memory.Furthermore, by maintaining the latest patterns, the proposed scheme improves the compression efficiency and processing speed of graph streams that change in real time.Performance evaluation results showed that in environments with repeated patterns, the proposed scheme performed similarly to the existing methods.Despite the promising results, the proposed scheme has some limitations that need to be addressed in future research.One important issue is the lack of consideration for data security and privacy in the proposed graph stream compression method.While the proposed scheme enables the efficient compression of large-scale, ever-growing graph streams, it does not incorporate any specific mechanisms to protect sensitive information that may be present in the data.This could potentially lead to privacy breaches or a misuse of personal data when the compressed graph is stored or processed in various application domains.Furthermore, we did not establish clear criteria or quantitative limits for the formation of new graph vertices.Moreover, the statistical characteristics of the experimental datasets were not analyzed in detail.These aspects should be further investigated to ensure the robustness and generalizability of the proposed compression method.For real-world data, it consistently yielded higher compression rates and faster processing times in most environments.Smaller pattern dictionary sizes used in this scheme facilitate more effective compression, compared with other pattern mining schemes.Especially in real-time processing environments with limited latency, the proposed scheme outperformed other graph compression or graph pattern mining schemes in terms of processing time.However, some areas exist, where it does not significantly excel in compression rate, and guaranteeing performance for more complex patterns or scalability might be difficult.To address these issues, we plan to enhance its performance via techniques utilizing GPUs and conduct experiments with other various stream-based graph compression schemes.

Figure 1 .
Figure 1.Overall structure of the proposed scheme.

Figure 1 .
Figure 1.Overall structure of the proposed scheme.

Figure 2 21 Figure 2 .
Figure2shows the structure and operation of the graph manager and reference pattern generator.The graph manager consists of two main modules.The first is the graph constructor.The graph constructor takes raw graph stream data as input and builds them into a graph.Raw graph stream data consist of edge data, for example {(v1, v2), (v2, v3), . ..}, which represents an edge stream.The graph constructed by the graph constructor is then passed to the reference pattern generator to find frequent patterns.The second component, the graph pattern manager, receives reference pattern candidates from the reference pattern generator, and stores and manages them in the pattern dictionary.Pattern management includes deciding reference patterns based on pattern frequency and size, and deleting less important patterns.This is explained in detail in Section 3.3.Appl.Sci.2024, 14, x FOR PEER REVIEW 7 of 21

Figure 2 .
Figure 2. Process flow between graph manager and reference pattern generator.

21 Figure 3 .
Figure 3. Example of pattern dictionary management by graph manager over time.

Figure 3 .
Figure 3. Example of pattern dictionary management by graph manager over time.

Figure 5 .
Figure 5. Pruning process applied with the FP-growth algorithm.

Figure 5 .
Figure 5. Pruning process applied with the FP-growth algorithm.

Figure 5 .
Figure 5. Pruning process applied with the FP-growth algorithm.

Figure 6 .
Figure 6.Construction process of pattern dictionary and utilization of provenance.

Figure 6 .
Figure 6.Construction process of pattern dictionary and utilization of provenance.

Figure 7 .
Figure 7. Graph compression process and structure.

Figure 7 .
Figure 7. Graph compression process and structure.

Figure 8 .
Figure 8.Time requirement for each process step in the complete process: (a) proposed scheme; (b) GraphZip.

Figure 8 .
Figure 8.Time requirement for each process step in the complete process: (a) proposed scheme; (b) GraphZip.

Figure 9 .
Figure 9.This figure shows the compression rates (left column) and runtimes (right column) of the proposed scheme on various real-world datasets, with respect to different pattern dictionary sizes The compression rate measures the ratio of the compressed graph size to the original graph size where the lower, the better.The runtime includes both mining and encoding phases.We can observe that the proposed scheme consistently outperforms the baseline (GraphZip) in terms of compression rate, especially on larger datasets (e.g., Skitter, NBER, and LiveJournal).Moreover, the performance gap tends to widen as the dictionary size increases (a: 30, b: 50 c: 100), indicating the effectiveness of our dictionary maintenance strategy based on dynamic pattern scoring.On the other hand, the runtime of the proposed scheme is slightly higher than that of GraphZip in most cases, which can be expected due to the additional complexity of provenance-based pattern extraction.However, the gap is not significant and can be well justified by the substantial improvement in compression performance.

Figure 9 .
Figure 9.This figure shows the compression rates (left column) and runtimes (right column) of the proposed scheme on various real-world datasets, with respect to different pattern dictionary sizes.The compression rate measures the ratio of the compressed graph size to the original graph size, where the lower, the better.The runtime includes both mining and encoding phases.We can observe that the proposed scheme consistently outperforms the baseline (GraphZip) in terms of compression rate, especially on larger datasets (e.g., Skitter, NBER, and LiveJournal).Moreover, the performance gap tends to widen as the dictionary size increases ((a) 30, (b) 50, (c) 100), indicating the effectiveness of our dictionary maintenance strategy based on dynamic pattern scoring.On the other hand, the runtime of the proposed scheme is slightly higher than that of GraphZip in most cases, which can be expected due to the additional complexity of provenance-based pattern extraction.However, the gap is not significant and can be well justified by the substantial improvement in compression performance.

Figure 10 .
Figure 10.Comparison of performance metrics of the proposed scheme and other pattern mining schemes in streaming environments.

Figure 10 .
Figure 10.Comparison of performance metrics of the proposed scheme and other pattern mining schemes in streaming environments.

Table 2 .
Synthetic graph experiment results.

Table 3 .
Real-world datasets utilized in the experiment.

Table 4 .
Average number of patterns and processing time according to batch size.