An Efﬁcient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data

: Frequent itemset mining (FIM) is a common approach for discovering hidden frequent patterns from transactional databases used in prediction, association rules, classiﬁcation, etc. Apriori is an FIM elementary algorithm with iterative nature used to ﬁnd the frequent itemsets. Apriori is used to scan the dataset multiple times to generate big frequent itemsets with different cardinalities. Apriori performance descends when data gets bigger due to the multiple dataset scan to extract the frequent itemsets. Eclat is a scalable version of the Apriori algorithm that utilizes a vertical layout. The vertical layout has many advantages; it helps to solve the problem of multiple datasets scanning and has information that helps to ﬁnd each itemset support. In a vertical layout, itemset support can be achieved by intersecting transaction ids (tidset/tids) and pruning irrelevant itemsets. However, when tids become too big for memory, it affects algorithms efﬁciency. In this paper, we introduce SHFIM (spark-based hybrid frequent itemset mining), which is a three-phase algorithm that utilizes both horizontal and vertical layout diffset instead of tidset to keep track of the differences between transaction ids rather than the intersections. Moreover, some improvements are developed to decrease the number of candidate itemsets. SHFIM is implemented and tested over the Spark framework, which utilizes the RDD (resilient distributed datasets) concept and in-memory processing that tackles MapReduce framework problem. We compared the SHFIM performance with Spark-based Eclat and dEclat algorithms for the four benchmark datasets. Experimental results proved that SHFIM outperforms Eclat and dEclat Spark-based algorithms in both dense and sparse datasets in terms of execution time.


Introduction
We are currently living in the big data age. Data appear everywhere in a variety of formats and types ranging from structured to unstructured data and are produced by a huge number of sources across a wide range of disciplines and types, including transactional systems, user interactions, social networks, the Internet of Things, the World Wide Web, and many others. Companies and individuals gather and store all these generated data to analyze them for insight, knowledge, and decision making. Therefore, we have been swamped with big data, not just because we already have large amounts of data that need to be processed but also because the amount of data is rapidly growing every moment. The concept of big data has some properties that are collectively known as the "3Vs" model. Volume is defined as the amount of data; enormous amounts of data are generated and gathered. Velocity refers to the high rate at which data are created, gathered, and processed (streams, batch, near-real-time, and real-time). Variety indicates the different types of data: audio, images/video, and text; conventional structured data; and mixed data. In addition, there are two more features added to the "3Vs" model and known as the "5Vs" model. Veracity refers to how much the data is accurate and trusted when it comes from various Data 2022, 7, 11 3 of 22 elements in addressing FIM problems, diffset will enhance execution time and memory consumption because smaller sets will be mined in less time and consume less space. FIM algorithms are efficient when practicing on small datasets, but when the data gets bigger, they start to suffer. Hence, big data parallel processing frameworks including Hadoop, Spark, and Flink [19] are employed to accelerate the execution time of FIM algorithms and, consequently, the frequent itemset patterns are discovered in a reasonable period of time.
Many people mistakenly believe Hadoop and MapReduce [20,21] are synonymous; however, this is not the case. Hadoop was first released in 2007 as an open-source implementation of the MapReduce processing engine coupled with a distributed file system [4], but it has since grown into a vast web of projects covering every aspect of a big data workflow, including data collection, storage, processing, and more. MapReduce [22] is one of the most common parallel processing frameworks used to write the intermediate files generated to local disk to use these files in reduce phase; this process consumes lots of time. Unlike MapReduce, Spark stores the intermediate results in cache memory to speed up data processing. Spark also provides some other libraries such as Spark SQL, Spark streaming, Spark MLIB in one framework, and supports different languages such as Scala, Python, and Java. Some work is being done to support R as well.
Spark [23,24] is one of the most effective open-source big data tools used in running large-scale data analytics applications across clusters of nodes. Spark is a parallel processing framework that can handle both batch and real-time analytics and data processing workloads. It was designed to fill the shortcomings of Hadoop [21], especially in the data processing. Fault tolerance in Spark [25,26] aims to recover any failure or other loss in data that happened by rebuilding the RDDs by applying all the transformations found in the DAG (directed acyclic graph). Furthermore, Spark moves the processing and computations close to the data instead of moving the data itself to the computations; this is called data locality. Data locality is the process of moving the processing to data location instead of moving the data through the networks, and thus reducing the network communication and increasing the throughput of the system.
Bloom Filter (BF) [27] is a compact data structure that is used in the probabilistic representation of a set of variables to make sure whether an element in the set exist certainly or isn't existing in the set. BF is a very quick way of checking if something is in a set of data that gives one of two answers either: "x is not in this data set" or "x might be in this data set." BF provides a particularly useful optimization which means most of the queries can return "not found" without having to start looking through your indexes.
Recently, big data have impacted many aspects of life, including economics, politics, and culture, as well as science and technology. Big data have a lot of advantages in terms of production, trading, services, management, discovery, and so on; however, it also has a lot of challenges in terms of computer science and information technology, such as management, storage, and processing and analysis, security, and visualization. Data Mining generally and Frequent Itemsets Mining (FIM) especially play a critical role in that impact. Parallel FIM algorithms are still inefficient to cope with large amounts of big data, despite various techniques and methodologies addressing execution efficiency and memory optimization. FIM must be equipped with methodologies and solutions to operate effectively and reliably on such large datasets without out-of-memory and quickly enough to satisfy users. These requirements are significantly challenging for FIM. As a result, it is important to develop a new efficient algorithm for discovering the frequent itemset mining pattern as quickly as possible to meet customers' expectations with the wise use of memory.
In this paper, we introduce SHFIM, a new hybrid Spark-based method that attempts to improve FIM efficiency while detecting common patterns. SHFIM is divided into three phases, the first of which tries to extract the first frequent itemset. Next, the second phase concentrates on extracting the second most frequent itemset without candidate generation and converting the horizontal dataset layout to a vertical layout in a single step using BF. Because the candidate generation and vertical layout transformation processes are time expensive in conventional FIM algorithms, we combine them into a single step Data 2022, 7, 11 4 of 22 using BF. Finally, in the third phase, SHFIM extracts the remaining frequent itemsets in parallel using diffset, which is more memory and time efficient. Because FIM is an iterative problem that requires many dataset scans, we chose Spark because it is 100 times quicker than Hadoop in data processing due to in-memory processing and lazy evaluation, which causes Spark to delay the execution of transformations until action is encountered. Furthermore, Spark can considerably improve the efficiency of repetitive algorithms that visit the same dataset regularly. As a result, the proposed algorithm is developed and tested using the Spark framework and written in such a way that it improves time by leveraging Spark best practices such as using broadcasted variables to speed up the lookup process and reduceByKey instead of groupByKey to avoid the shuffling steps. At last, we compared SHFIM performance with Spark-based algorithms (Eclat and dEclat) built on the same cluster, using four benchmark datasets (two sparse and two dense) to evaluate execution time.
The contribution work is summarized as follows (1) Developing and designing a new hybrid framework and algorithm on Spark. (2) Utilizing both the horizontal and vertical layout (3) Developing some improvements to eliminate the candidate itemset generation (4) SHFIM uses diffset to minimize the memory consumption (5) SHFIM uses shared variables to share datasets among cluster nodes (6) Conducting an Experiment on a multi-node cluster using four datasets. (7) Evaluate and discuss the proposed algorithm (SHFIM) against other FIM Algorithms.
The rest of the paper is structured as follows: Section 2 discusses FIM algorithms and their disadvantages. Section 3 demonstrates the fundamental concepts of FIM. SHFIM's proposed framework is depicted in Section 4. Section 5 addresses the Spark cluster requirements, dataset characteristics and SHFIM experiments. The evaluation and discussion are covered in Section 6. Section 7 depicts the suggested algorithm's complexity. Section 7 is dedicated to the conclusion and future work.

Related Work
This section provides an overview of the most common FIM algorithms used in large data mining. Section 2.1 addresses horizontal layout-based algorithms, Section 2.2 discusses vertical layout-based algorithms, and Section 2.3 discusses tree-based FIM techniques.

Horizontal Layout-Based Algorithms
EAFIM [28] is a parallel version of Apriori that has been implemented over Spark that enhances the Apriori by generating the candidate itemsets "on-the-fly" and applying a database reduction after each iteration. EAFIM outperforms YAFIM in almost all iterations and performs better than R-Apriori in all iterations except the second one due to the BF.
Adaptive-Miner [29] is a parallel version of Apriori that has been implemented over Spark. It makes execution plans before every iteration to decide which data structure should be used either BF or hash-Trees based on the cost of execution plans. The approach of taking decisions before every iteration makes Adaptive-Miner scalable and more efficient than YAFIM. Adaptive-Miner is used to generate the candidate itemsets in a separate step, then iterates over the dataset to fetch transaction by transaction to calculate the support of each candidate like YAFIM.
HFIM [30] is a hybrid version of Apriori that has been developed over Spark. It uses the vertical layout rather than the horizontal layout to mitigate the iterative dataset scanning of Apriori. The vertical layout takes less amount of storage and no need to iterate over the dataset many times. A dataset in the horizontal layout is partitioned and distributed over the cluster to generate the candidate itemsets in the workers rather than the driver and make a dataset reduction to keep only the items that are frequent only. Datasets are shared among cluster nodes using broadcasting variables. HFIM outperformed the YAFIM algorithm because of the usage of the vertical layout.
DFIMA [31] is a parallel version of the Apriori algorithm that has been developed over the Spark framework. It is a Matrix-based pruning technique that decreases the number of Data 2022, 7, 11 5 of 22 generated candidates' itemsets. DFIMA outperformed the PFP [32] concerning speed and scalability. Broadcasting variables has been used to enhance the efficiency of the algorithm. DFIMA and PFP seem to run faster when min_sup is low. R-Apriori [33] is a parallel version of the Apriori algorithm that has been implemented over Spark which enhances the execution time of the YAFIM algorithm by adding a BF in the second iteration. As a result, it reduces the runtime due to eliminating the time taken by candidate generation at the second iteration. R-Apriori outperformed YAFIM in the second iteration and performs the same as YAFIM in all other iterations.
YAFIM [34] is a parallel version of the Apriori algorithm that has been developed on the Spark framework. Spark is specially designed for in-memory parallel computing to support iterative and interactive data mining. All Apriori-based algorithms that have been implemented over MapReduce, YAFIM achieved 18x speedup on average for various benchmarks. YAFIM generates the candidate itemsets in a separate step through applying a Cartesian product between (k-1 frequent itemset) and (k-1 frequent itemset), then iterates over the dataset (transaction by transaction) to calculate the support of each candidate.

Vertical Layout-Based Algorithms
DFP [35] is a parallel version of Eclat that was introduced as a distributed approach to increase the mining efficiency for association rules. The transmission cost will be increased if the communication between clients (nodes) increased. The suggested DFP method effectively avoided this difficulty by moving the mining result from the executing client to the server. The experiment findings also revealed that the suggested technique had much greater computational efficiency than DistEclat and BigFIM. The suggested technique decreases computation efficiency by estimating the required memory and clients. By calculating the maximum number of TIDs for each client, the DFP algorithm overcomes memory constraints and repetitive scans, and the extra TIDs are handled by an extended mechanism known as MP.
RDD-Eclat [36] is a parallel Eclat algorithm entitled RDD-Eclat and the implementation of its five variations on the Spark RDD framework. EclatV1 is the first version, while the others are EclatV2, EclatV3, EclatV4, and EclatV5. Each version is the consequence of a new technique and heuristic being applied to the preceding variant. Each variant is the consequence of a new technique and heuristic being applied to the preceding variant. After EclatV1, the filtered transaction approach is used, and in EclatV4 and EclatV5, heuristics for equivalence class partitioning are used. It has been shown that using the filtered transaction approach to reduce the size of the original dataset enhances the performance. Furthermore, the equivalence class partitioning techniques considerably reduced the execution time.
SVT [37] is a hybrid algorithm that has been implemented over the Spark framework. Unlike the traditional vertical Layout, it follows the way of switching between the Eclat (tidset) and dEclat (diffset) relying on the dataset density. If the dataset is sparse, it uses Eclat. Otherwise, it uses dEclat. The experimental results proved that the SVT outperforms the YAFIM in execution time performance.
PEclat [38] is a hybrid algorithm that has been implemented over the Hadoop MapReduce framework. PEclat works on the vertical layout datasets to increase the speed of the algorithm. Unlike the traditional algorithms that use just one vertical format either tidset or diffset, it uses the two formats simultaneously on the item level rather than the dataset level. It works without putting into consideration if the dataset is dense or even sparse. PEclat follows the strategy of mixset in which it decides to continue with diffset or tidset based on the tidset or diffset size. If the size of tidset is greater than or equal to the size of diffset, then continue using tidset, and the opposite is right.
BigFIM [39], introduces two algorithms in the area of big data and frequent pattern mining over Hadoop (Map Reduce). The first algorithm is the DistEclat (Distributed Eclat) which is a new parallel version of Eclat that has been developed on Hadoop. The second algorithm is the BigFIM, a scalable hybrid algorithm between Apriori and Eclat to mine Data 2022, 7, 11 6 of 22 big datasets. The experimental results proved that DistEclat is the fastest among all the algorithms in dense datasets, however, it may fail because it requires lots of memory.

Tree-Based Algorithms
Map-Optimize-Reduce [40] is a parallel version of FP-Growth that has been implemented over Hadoop MapReduce. It is based on the Map-Optimize-Reduce framework which decreases the search time and enhances the performance of the MapReduce framework. It builds a CAN tree that makes the FP-Growth algorithm performs better than using FP-Tree and enhances the database efficiency through making some data preprocessing and normalization.
DFPS [41] is a parallel version of FP-Growth that has been implemented over the Spark framework. It works in the same way as MapReduce; it distributes the workload and computation cost over the cluster. It allows each node to build the conditional FP-tree and mine the tree to extract the frequent pattern. DFPS showed good performance in terms of time and scalability against Apriori.
PFP [32] is a parallel version of FP-Growth [42] that has been developed over the Hadoop MapReduce framework. PFP generates frequent itemset without candidate generation step through constructing an FP-Tree into the workers' memory. It performs well when the memory is big enough in data centers and may fail if the memory cannot fit the tree. Table 1 shows a comparative analysis of FIM algorithms. Apriori has a problem with multiple database scanning, which is time-consuming. Eclat uses the vertical layout, which is faster than the horizontal layout, however, it has a problem with the dense datasets. It takes lots of time due to the tidset intersections. If the memory is not sufficient, Eclat may fail. dEclat is a scalable version of Eclat using diffset. Although dEclat is memory-wise and can mine data in less amount of time, it shows deficient performance with sparse datasets. Researchers do not recommend starting with Eclat or dEclat and start with a memory-wise (Breadth-First Search) algorithm like Apriori instead, then switch to Depth First Search. Thus, this research aims to propose SHFIM hybrid spark-based algorithm that enhances the execution time of extracting the frequent pattern from big datasets.
FIM (Frequent Itemset Mining) offers three approaches: horizontal, vertical, and tree based. Horizontal like Apriori, vertical like Eclat, and tree-based, like PFP. Each algorithm has advantages and disadvantages; for example, Apriori is the best in terms of memory, but the worst in terms of execution time. Eclat, on the other hand, is the fastest in terms of execution time, but it requires extensive processing due to the intersections. PFP examines the dataset once to generate a tree data structure. Although PFP is effective in some circumstances, it has several drawbacks. When at least a set of projected transactions does not fit into the machine's memory, it fails. When the projected transactions significantly overlap, the communication overhead can be very high, because in that situation overlap of data is sent to the network several time [2,39]. This variance and difference in algorithms encourage researchers to develop a new type of algorithm that can utilize the advantages of current algorithms known as Hybrid. Therefore, we designed our hybrid algorithm to start as Apriori to take advantage of memory early on, then move data to a BF to avoid memory and time overhead in producing candidates and layout transformation, and finally, leverage diffset to take advantage of memory and time together. Diffset enhances execution time and memory consumption because it shrinks itemsets into smaller sets that will be mined in less time and consume less space [18]. Furthermore, we designed and developed the algorithm using Spark best practices to reduce network traffic and shuffling between worker nodes. • A scalable hybrid FIM algorithm can mine truly big data.
• Cannot deal with lower frequencies and sometimes out of memory happens.

MapReduce Apriori/Eclat
Dist-Eclat [39] • The fastest algorithm among others regardless of the candidate generation time.

•
Accelerate the speed using a strategy of simple load balancing.
• Cannot deal with a large amount of data due to the memory and processing needed. • Switch between a different approach based before each iteration to improve the execution time.
• Consume some time to identify which approach should apply. • Requires multiple dataset scan.

Spark Apriori
SVT [37] • Reduces the cost of communication across nodes in the cluster. • Distributes workload among the clusters' nodes.
• Computation cost and memory usage are high. • Needs more memory in low minimum supports.
• Requires multiple dataset scan. • Shows bad performance in lower minimum supports.
Spark Apriori RDD-Eclat [36] • Shows better performance in sparse datasets. • Increases ECLAT performance by applying partitioning and some heuristics.
• Shows bad performance in dense datasets.

•
May fail if the memory is not sufficient.

Preliminaries
The problem of the frequent pattern mining is to find correlations between elements in a database. The problem is summarized as follows: Find all patterns P that appears in at least a fraction of the transactions in a database D containing transactions T 1 . . . T N . The list of notations used in this paper is shown in Table 2.

Definition (Pattern)
A pattern is described as a collection of objects, events, or items that occur frequently in a database. Formally, a pattern P in a database D is defined as a subset of elements P ⊆ {i 1 . . . i n }∈ D that represent valuable data properties [43].

Definition (Frequency of a Pattern)
The frequency of a pattern P is defined as the number or the percent of data records that contain the subset of items indicated by P. In other words, a set of items I = {i 1 . . . i n }, in a database D consisting of R = {r 1 . . . r m } ∈ D data records, with each record containing a set of items ∀r∈ R: r ⊆ I, the frequency of P is defined as |{∀r ∈ R:P ⊆ r, r ⊆ I ∈ D}|. The FIM process is extremely difficult due to the vast number of candidate patterns (itemsets) that must be processed [43].
Given an n-item dataset, the number of patterns (or itemsets) of size k is n k = n! k!(n−k)! for any k ≤ n. As a result, with such a dataset, the total number of possible patterns is 2 n -1, and the complexity of locating patterns of interest is in exponential order. When the frequency of each pattern is determined, the complexity increases to O ((2 n − 1) × m × n) for a dataset of m records and n items. To provide some light on the complexity, a dataset of only 33 items (which is a small number of items) yields more than 8.5 × 10 9 possible patterns, which is more than the world population in 2018. (Almost 7.7 billion people) [43].

Definition (Itemset)
Assume that I is a set of items. An itemset is a collection of items from I. A k-itemset is an itemset created from k items. An itemset S is considered a subset of an itemset I if all of the items in S are also in I, denoted as S ⊆ I [18].

Definition (Frequent Itemset)
Consider D is a dataset of transactions, each transaction T contains a set of items. I is an itemset and a subset of transaction T. I is frequent if its support (the number of transactions contains I) defined as, σ (I) = |{T|I ⊆T∧T∈D}| is greater than or equal to minimum support (min_sup) given by the user, denoted as FI = {I|σ(I) ≥ min_sup} [14].
As a running example, consider the given dataset D in Table 3. D contains 4 transactions {T 1 . . . T 4 }; each transaction contains a set of items. Our goal is to discover the frequent itemsets whose support is greater than or equal to min_sup. Let min_sup = 3, so that the first frequent itemsets =

Definition (Tidset)
Consider D to be a dataset containing a set of items i. The tidset of an itemset I is the set of transactions identifiers that have the item, represented by tidset(I) = {t.tid | t ∈ D, I ⊆ t}. Thus, σ(I) = |tidset(I)|, denoted as Equation (1). Additionally, k-itemset's tidset is formed by intersecting the tidsets of two (k − 1)-itemsets that share the first k − 2 items, represented by Equation (2) [16].

SHFIM Proposed Framework
In this section, we introduce SHFIM hybrid algorithm to implement a frequent pattern mining framework for analyzing big data in a parallel and distributed way over Spark. The main goal of SHFIM is to improve the execution time performance and reduce the search space to extract the frequent Itemset from big data. SHFIM algorithm is a Scalabased algorithm implemented over Spark cluster with Spark best practices to reduce the execution time required to extract the frequent itemsets.
Unlike Eclat and dEclat algorithms, SHFIM can handle thousands and even millions of variable-length transactions without leaking or spilling memory. The SHFIM framework is presented in Figure 2 and is divided into three stages: (1) singletons extraction phase, (2) second frequent itemset extraction phase in vertical arrangement, and (3) K

SHFIM Proposed Framework
In this section, we introduce SHFIM hybrid algorithm to implement a frequent pattern mining framework for analyzing big data in a parallel and distributed way over Spark. The main goal of SHFIM is to improve the execution time performance and reduce the search space to extract the frequent Itemset from big data. SHFIM algorithm is a Scala-based algorithm implemented over Spark cluster with Spark best practices to reduce the execution time required to extract the frequent itemsets.
Unlike Eclat and dEclat algorithms, SHFIM can handle thousands and even millions of variable-length transactions without leaking or spilling memory. The SHFIM framework is presented in Figure 2 and is divided into three stages: (1) singletons extraction phase, (2) second frequent itemset extraction phase in vertical arrangement, and (3) K frequent itemset extraction phase. The next subsections go through each of these three phases in depth.

Singletons Extraction Phase
The goal of this phase is to extract the first frequent itemset. It is a single iteration that uses a single map function and a single reduce function to extract the first frequent itemset. This phase's inputs are the dataset in horizontal format and the user-defined min_sup. Each line in the dataset represents a transaction, which has a set of items separated by

Singletons Extraction Phase
The goal of this phase is to extract the first frequent itemset. It is a single iteration that uses a single map function and a single reduce function to extract the first frequent itemset. This phase's inputs are the dataset in horizontal format and the user-defined min_sup. Each line in the dataset represents a transaction, which has a set of items separated by spaces, the first of which is the transaction id. For example, transaction #10 should look like {10, 22,24,35,39,71,91}, where 10 is the tid (Transaction id) and the rest are the items.
The SHFIM algorithm pseudo-code is shown in Algorithm 1, in which SHFIM begins reading the dataset and assigns each partition of data to a worker node in the Spark cluster. To flatten the itemsets, each worker node uses a flatMap function on its data. When the getFirstFrequentItemset method is used, it returns an RDD containing the first frequent itemsets in a (key, value) pair, where the key is an itemset and value is its support. In other words, (A, 5) is a 1-itemset pair containing an itemset (A) with support of five.  Figure 3 shows the visual workflow of this phase in SHFIM, which reads the data in horizontal layout format from HDFS and loads it into Spark RDD, where a map function is applied to construct a key-value pair. Following that, it uses a reduce function to aggregate the values based on their keys. At the end of this phase, SHFIM filters out itemsets/keys whose support is less than min_sup, and the first frequent itemset is retrieved and stored in the BF. Figure 3 shows the visual workflow of this phase in SHFIM, which reads the data in horizontal layout format from HDFS and loads it into Spark RDD, where a map function is applied to construct a key-value pair. Following that, it uses a reduce function to aggregate the values based on their keys. At the end of this phase, SHFIM filters out itemsets/keys whose support is less than min_sup, and the first frequent itemset is retrieved and stored in the BF.  Each mapper accepts a collection of transactions (e.g., Mapper-1 takes transactions T1 and T2). To separate each transaction into discrete items, mappers use a flatMap function on each record in the RDD. The map function is used to create corresponding (key, value) pairs, where the key is an itemset and value is 1. As running example of a dataset D in Table 3, T1 = {A, B, C, D, F} is mapped into (A,1), (B,1), (C,1), (D,1), and (F1). Subsequently, the reducers use the reduceByKey method to aggregate the data based on the key and filter away irrelevant itemsets, leaving only those with support greater than or equal to min sup. If min sup = 3, SHFIM stores in a BF only (A, C, D, F) because their support is more than or equal to 3. Finally, the BF is shared among the cluster's machines so that it can be accessed locally by each worker node.  Each mapper accepts a collection of transactions (e.g., Mapper-1 takes transactions T1 and T2). To separate each transaction into discrete items, mappers use a flatMap function on each record in the RDD. The map function is used to create corresponding (key, value) pairs, where the key is an itemset and value is 1. As running example of a dataset D in Table 3, T1 = {A, B, C, D, F} is mapped into (A,1), (B,1), (C,1), (D,1), and (F1). Subsequently, the reducers use the reduceByKey method to aggregate the data based on the key and filter away irrelevant itemsets, leaving only those with support greater than or equal to min sup. If min sup = 3, SHFIM stores in a BF only (A, C, D, F) because their support is more than or equal to 3. Finally, the BF is shared among the cluster's machines so that it can be accessed locally by each worker node.

2nd Frequent Itemsets in a Vertical Layout Phase
This phase aims to extract the second frequent itemset in a vertical layout. The input of this phase is the first frequent itemset (Singletons). Mappers take each transaction and prune it according to the singletons stored in the BF, producing all feasible candidates together with their tid in pairs. This step eliminates the process of candidate generation, which is time-consuming in most FIM algorithms.
The workflow of this phase is illustrated visually in Figure 5, where the BF reads every item in each transaction from the horizontal layout dataset. Afterward, BF prunes items (that are not present in BF) and then produces key-value (key is the itemset and value is the TID) pairs candidates from items that are remained after pruning. Next, reducers aggregate the values based on their keys to convert the data into a vertical layout such as (Itemset, {list of tids}). At the end of this phase, SHFIM filters out the itemsets whose support is less than min_sup, and then the second frequent itemset is extracted. Line 7 in Algorithm 1 has a function called findPairsBloomFilter, which accepts three parameters (data RDD, singletons, min_sup) and returns the second frequent itemset in vertical layout tid format.

2nd Frequent Itemsets in a Vertical Layout Phase
This phase aims to extract the second frequent itemset in a vertical layout. The input of this phase is the first frequent itemset (Singletons). Mappers take each transaction and prune it according to the singletons stored in the BF, producing all feasible candidates together with their tid in pairs. This step eliminates the process of candidate generation, which is time-consuming in most FIM algorithms.
The workflow of this phase is illustrated visually in Figure 5, where the BF reads every item in each transaction from the horizontal layout dataset. Afterward, BF prunes items (that are not present in BF) and then produces key-value (key is the itemset and value is the TID) pairs candidates from items that are remained after pruning. Next, reducers aggregate the values based on their keys to convert the data into a vertical layout such as (Itemset, {list of tids}). At the end of this phase, SHFIM filters out the itemsets whose support is less than min_sup, and then the second frequent itemset is extracted. Line 7 in Algorithm 1 has a function called findPairsBloomFilter, which accepts three parameters (data RDD, singletons, min_sup) and returns the second frequent itemset in vertical layout tid format. ducers aggregate the values based on their keys to convert the data into a vertical layout such as (Itemset, {list of tids}). At the end of this phase, SHFIM filters out the itemsets whose support is less than min_sup, and then the second frequent itemset is extracted. Line 7 in Algorithm 1 has a function called findPairsBloomFilter, which accepts three parameters (data RDD, singletons, min_sup) and returns the second frequent itemset in vertical layout tid format.  Figure 6 illustrates how SHFIM extracts the second frequent itemsets in a vertical format. A Spark RDD is used to load the transactional dataset from HDFS. Mappers take each transaction and prune it so that only the items listed in the BF are retained, and produce all feasible (key, value) pairs from the trimmed transaction, where the key represents the newly generated itemset and the value represents its tid (Transaction id). Subsequently, Reducers use the reduceByKey transformation to combine all pairs based on their key to build a new RDD containing a collection of (key, value) pairs where the key is the itemset and value is its tidset and filter out itemsets whose support is less than min_sup.  Figure 6 illustrates how SHFIM extracts the second frequent itemsets in a vertical format. A Spark RDD is used to load the transactional dataset from HDFS. Mappers take each transaction and prune it so that only the items listed in the BF are retained, and produce all feasible (key, value) pairs from the trimmed transaction, where the key represents the newly generated itemset and the value represents its tid (Transaction id). Subsequently, Reducers use the reduceByKey transformation to combine all pairs based on their key to build a new RDD containing a collection of (key, value) pairs where the key is the itemset and value is its tidset and filter out itemsets whose support is less than min_sup. , and so on, and filter out itemsets whose support less than min_sup. Finally, the second frequent itemsets in vertical format are stored in Spark RDD and ready for the next phase.

K Frequent Itemset Extraction Phase
The goal of this phase is to use diffset to extract the K frequent itemsets, where K ≥ 3. The input of this phase is the (k − 1) frequent itemset in the form of key value pair RDD, where the key is the itemset and value is the tidset. Figure 7 presents the workflow of this phase which begins with producing the candidate k-itemsets from (k − 1) frequent itemsets and broadcasts the generated candidate list among cluster nodes. Next, mappers traverse each itemset in (k − 1) frequent itemset to map with their corresponding candidate k-itemset. As a result, a key value pair RDD is generated, where key is the candidate k-itemset and value is its mapped (k − 1) itemset. Afterward, reducers integrate the values in the paired RDD based on their key to produce another key-value pair, but the key, in this case, is the candidate k-itemset and the value represents a list of all corresponding (k − 1) frequent itemsets. Finally, the diffset is computed for each candidate k-itemset based on Equations (3) and (5) in the preliminaries section. At the end of this phase, SHFIM filters out the itemsets/keys whose support is less , and so on, and filter out itemsets whose support less than min_sup. Finally, the second frequent itemsets in vertical format are stored in Spark RDD and ready for the next phase.

K Frequent Itemset Extraction Phase
The goal of this phase is to use diffset to extract the K frequent itemsets, where K ≥ 3. The input of this phase is the (k − 1) frequent itemset in the form of key value pair RDD, where the key is the itemset and value is the tidset. Figure 7 presents the workflow of this phase which begins with producing the candidate k-itemsets from (k − 1) frequent itemsets and broadcasts the generated candidate list among cluster nodes. Next, mappers traverse each itemset in (k − 1) frequent itemset to map with their corresponding candidate k-itemset. As a result, a key value pair RDD is generated, where key is the candidate k-itemset and value is its mapped (k − 1) itemset.
Afterward, reducers integrate the values in the paired RDD based on their key to produce another key-value pair, but the key, in this case, is the candidate k-itemset and the value represents a list of all corresponding (k − 1) frequent itemsets. Finally, the diffset is computed for each candidate k-itemset based on Equations (3) and (5) in the preliminaries section. At the end of this phase, SHFIM filters out the itemsets/keys whose support is less than min_sup, and the k-frequent itemset is extracted. Because the FIM is an iterative approach, SHFIM repeats the third phase for each k > 3 to extract the remaining frequent itemsets, and this process lasts until no more frequent itemsets can be obtained.  Figure 8 shows an example for the third phase and how SHFIM extracts the frequent k-itemsets. Mapper-1 takes AD, AC, and DF as vertical itemsets and then applies a filter on the candidate RDD to keep only candidates whose itemsets are already in the second frequent itemset, such as ABC, which is a candidate 3-itemset that is derived from AB and AC, both of which are present in the frequent 2-itemset. After applying the filter function, mapper-1 generates a (key, value) pair RDD in which key is the candidate list and value is an itemset such as ({ACD, ACF}, AC) is derived from AC, whereas ({ACD, ADF}, AD) is derived from AD. On contrary, DF has an empty list of candidates and will be ignored because there are no other 2-itemsets that share the same parent or start with (D). Consequently, a flatMap function is applied on the RDD to split the candidates into separated key-value pairs. As a running example, (

Performance Evaluation
This section evaluates the performance of the SHFIM algorithm by comparing its execution time performance against Eclat and dEclat, where Eclat is a Spark-based vertical layout FIM algorithm depending on intersections between itemset tidset. On the other SHFIM uses the function generatesKCandidates in Algorithm 1 line 12 to generate candidate k-itemsets. The candidate generation rule states that every two (k − 1)-itemsets must share the same prefix and be different in the last. In other words, let ABC be a 3-itemset candidate derived from the 2-itemsets AB, AC, which share the same prefix (A) but differ in the last element B and C. In Algorithm 1 line 18, SHFIM calls the function assignCandidateToItemset, which returns a (key, value) pair RDD. This function uses a flatMap to assign or map every candidate k-itemset to each frequent (k − 1)-itemset. SHFIM invokes a function getDiffsetFromTID where K = 3 and getDiffsetFromDiff where K > 3 on lines 19-23. Finally, a filter function is invoked to filter out itemsets whose support is greater than or equal to the min_sup. Figure 8 shows an example for the third phase and how SHFIM extracts the frequent k-itemsets. Mapper-1 takes AD, AC, and DF as vertical itemsets and then applies a filter on the candidate RDD to keep only candidates whose itemsets are already in the second frequent itemset, such as ABC, which is a candidate 3-itemset that is derived from AB and AC, both of which are present in the frequent 2-itemset. After applying the filter function, mapper-1 generates a (key, value) pair RDD in which key is the candidate list and value is an itemset such as ({ACD, ACF}, AC) is derived from AC, whereas ({ACD, ADF}, AD) is derived from AD. On contrary, DF has an empty list of candidates and will be ignored because there are no other 2-itemsets that share the same parent or start with (D). Consequently, a flatMap function is applied on the RDD to split the candidates into separated key-value pairs. As a running example, ({ACD, ACF}, AC) is transformed using flatMap into two pairs (ACD, AC), (ACF, AC). Reducer-1 combines (ACD, AC) and (ACD, AD) to generate (ACD, Map (AC ≥ {T1, T2, T4}, AD ≥ {T1, T2, T4})). Next, mapper-1 calculates the diffset of each candidate itemset. Therefore, ACD is a candidate itemset has both AC and AD, d(ACD) = t(AC) − t(AD) = {T1, T2, T4} − {T1, T2, T4} = {}, using Equation (4), whereas σ(ACD) = s(AC) − |d (ACD)| = 3 − 0 = 3, using Equation (5). Finally, repeat the third phase in each subsequent k > 3 until no frequent itemsets are existed.

Performance Evaluation
This section evaluates the performance of the SHFIM algorithm by comparing its execution time performance against Eclat and dEclat, where Eclat is a Spark-based vertical layout FIM algorithm depending on intersections between itemset tidset. On the other hand, dEclat is a Spark-based vertical layout FIM algorithm depending on calculating differences between itemsets. Moreover, we implemented Eclat and dEclat on the same environment of the SHFIM algorithm to make sure that the comparison between them all is done on the same cluster using same specifications.

Performance Evaluation
This section evaluates the performance of the SHFIM algorithm by comparing its execution time performance against Eclat and dEclat, where Eclat is a Spark-based vertical layout FIM algorithm depending on intersections between itemset tidset. On the other hand, dEclat is a Spark-based vertical layout FIM algorithm depending on calculating differences between itemsets. Moreover, we implemented Eclat and dEclat on the same environment of the SHFIM algorithm to make sure that the comparison between them all is done on the same cluster using same specifications.

Dataset
In this section, we present the characteristics of datasets that have been used in the evaluation. The proposed solution (SHFIM) has been tested using four datasets [44] which are the most common datasets and the benchmark in all frequent pattern mining algorithms in big data. Table 4 shows the characteristics of the datasets used in performance evaluation. The first dataset is a sparse T1014D100k dataset generated using IBM's generators, whereas, the second dataset is a Retail dataset which is a sparse dataset containing transactions that happened from the half of Dec 1999 to the end of Nov 2000 for a UK-based online non-store system. While the third is a chess dataset, which is a dense dataset containing the chess king vs. king and rook end positions. Finally, the fourth dataset is the dense mushroom dataset for 23 gilled mushroom species. The dataset selection criteria we use are to apply the algorithm to both sparse and dense datasets to evaluate algorithm efficiency and analyze the influence of each dataset on our algorithm.

Experiment and Result
In this section, we will show the experimental results of SHFIM algorithm against Eclat and dEclat in terms of execution time. Figure 9a compares SHFIM, dEclat, and Eclat in different min_sup starting from 7% and ending at 0.6%. At min_sup (7-2%), FIM algorithms generate the frequent itemset in a single iteration. Similarly, Eclat and dEclat do the same operation (convert the dataset from horizontal to a vertical layout, then compute the support), and they nearly reach the same execution time. Unlike Eclat and dEclat, SHFIM only computes each itemset support, then filters out itemsets with support less than min_sup and generates singletons without transforming the dataset into a vertical layout. From min_sup (1%), the iterations start to increase and require Eclat and dEclat to generate the second candidate itemsets by applying intersections (Eclat) or differences (dEclat) between itemsets to extract the second frequent itemset. SHFIM, on the other hand, eliminates the time required to produce the candidate itemsets for the second frequent itemset using the BF. As a result, it saves a lot of memory because it avoids loading the memory with second itemset candidates.

T1014D100k Dataset
In Figure 9b, a comparison between SHFIM, Eclat, and dEclat on iteration level at min_sup (0.6%). It is worth noting that for any min_sup, each algorithm should complete several iterations. SHFIM outperforms Eclat and dEclat in all iterations at min_sup (0.6%), particularly at the second iteration. Eclat and dEclat finish in about 287 s, and 821 s, respectively, whereas SHFIM takes only 19 s. Figure 10a shows a comparison between the vertical layout algorithms for different min_sup (30-0.2%). We noticed that dEclat and Eclat have recorded the same results at higher supports (30% to 3%) since it is one iteration, and they accomplish the same task "extract the first frequent itemset in vertical layout". On contrary, SHFIM executes almost twice faster than dEclat and Eclat in higher supports (30% to 3%) that have only one iteration because SHFIM uses only a flatMap in the first iteration, which requires timeless than Eclat and dEclat time required to transform into the vertical layout. Notably, Eclat outperforms dEclat in most iterations due to the sparsity of the Retail dataset and Eclat ability to compete in the sparse datasets.

T1014D100k Dataset
In Figure 9b, a comparison between SHFIM, Eclat, and dEclat on iteration level at min_sup (0.6%). It is worth noting that for any min_sup, each algorithm should complete several iterations. SHFIM outperforms Eclat and dEclat in all iterations at min_sup (0.6%), particularly at the second iteration. Eclat and dEclat finish in about 287 s, and 821 s, respectively, whereas SHFIM takes only 19 s. Figure 10a shows a comparison between the vertical layout algorithms for different min_sup (30-0.2%). We noticed that dEclat and Eclat have recorded the same results at higher supports (30% to 3%) since it is one iteration, and they accomplish the same task "extract the first frequent itemset in vertical layout". On contrary, SHFIM executes almost twice faster than dEclat and Eclat in higher supports (30% to 3%) that have only one iteration because SHFIM uses only a flatMap in the first iteration, which requires timeless than Eclat and dEclat time required to transform into the vertical layout. Notably, Eclat outperforms dEclat in most iterations due to the sparsity of the Retail dataset and Eclat ability to compete in the sparse datasets.

Retail Dataset
A comparison between SHFIM, Eclat, and dEclat on iteration level at min_sup (1%) is shown in Figure 10b. SHFIM outperforms Eclat and dEclat in terms of execution time in all iterations except the last iteration. At the last iteration, Eclat outperforms SHFIM and dEclat due to the sparsity of the Retail dataset. On the other hand, dEclat spent a long time in the second and third iterations. As a result, Eclat and dEclat consume about 26 and 318 s, respectively, whereas SHFIM takes only 12 s.
"extract the first frequent itemset in vertical layout". On contrary, SHFIM executes almost twice faster than dEclat and Eclat in higher supports (30% to 3%) that have only one iteration because SHFIM uses only a flatMap in the first iteration, which requires timeless than Eclat and dEclat time required to transform into the vertical layout. Notably, Eclat outperforms dEclat in most iterations due to the sparsity of the Retail dataset and Eclat ability to compete in the sparse datasets.  In Figure 11b, a comparison between SHFIM, Eclat, and dEclat on iteration level at min_sup (30%). In all iterations, SHFIM outperformed dEclat and Eclat in terms of execution time, except for the 4th iteration. At the 4th iteration, SHFIM took longer than dEclat. Because of the density of the Mushroom dataset, the driver must create candidates in a long time. Finally, SHFIM has a total runtime of 164 s, while Eclat and dEclat have total runtimes of 303 and 194 s, respectively. Figure 12a shows a comparison between SHFIM, Eclat, and dEclat for different min_sup (90%-85%). SHFIM executes faster than Eclat and slower than dEclat in different min_sup. At min_sup (90%), SHFIM has recorded 41 s while Eclat and dEclat have recorded 98 and 31 s, respectively. Notably, due to the high dataset density of chess dataset, Eclat has recorded the slowest performance.  In Figure 11b, a comparison between SHFIM, Eclat, and dEclat on iteration level at min_sup (30%). In all iterations, SHFIM outperformed dEclat and Eclat in terms of execution time, except for the 4th iteration. At the 4th iteration, SHFIM took longer than dEclat. Because of the density of the Mushroom dataset, the driver must create candidates in a long time. Finally, SHFIM has a total runtime of 164 s, while Eclat and dEclat have total runtimes of 303 and 194 s, respectively. Figure 12a shows a comparison between SHFIM, Eclat, and dEclat for different min_sup (90%-85%). SHFIM executes faster than Eclat and slower than dEclat in different min_sup. At min_sup (90%), SHFIM has recorded 41 s while Eclat and dEclat have recorded 98 and 31 s, respectively. Notably, due to the high dataset density of chess dataset, Eclat has recorded the slowest performance.

Chess Dataset
In Figure 12b, a comparison between SHFIM, Eclat, and dEclat on iteration level at min_sup (85%). SHFIM outperformed Eclat in all iterations, whereas took nearly as long as dEclat in all iterations except the third and fourth, because of the time required by the driver to produce candidates and save the results to HDFS causes SHFIM to take longer than usual. SHFIM takes only 61 s to complete, but Eclat and dEclat take roughly 375 and 55 s, respectively. Figure 12a shows a comparison between SHFIM, Eclat, and dEclat for different min_sup (90%-85%). SHFIM executes faster than Eclat and slower than dEclat in different min_sup. At min_sup (90%), SHFIM has recorded 41 s while Eclat and dEclat have recorded 98 and 31 s, respectively. Notably, due to the high dataset density of chess dataset, Eclat has recorded the slowest performance. In Figure 12b, a comparison between SHFIM, Eclat, and dEclat on iteration level at min_sup (85%). SHFIM outperformed Eclat in all iterations, whereas took nearly as long as dEclat in all iterations except the third and fourth, because of the time required by the driver to produce candidates and save the results to HDFS causes SHFIM to take longer than usual. SHFIM takes only 61 s to complete, but Eclat and dEclat take roughly 375 and 55 s, respectively.  Table 5 shows the average time (in seconds) of the FIM algorithm on the four datasets. The results conclude that the SHFIM algorithm has recorded 7, 5, and 39 s in T1014D100k, Retail, and Mushroom datasets respectively, which is the best performance among the others. While in chess dataset, SHFIM has recorded 51 s, which is an intermediate performance between Eclat and dEclat. We discovered that the SHFIM is better suited to datasets containing thousands of variable-length transactions because, at high and low min sup, the SHFIM can adapt to datasets containing thousands, if not millions, of variablelength transactions without a memory leak or spilling due to the enhancements made. Furthermore, the SHFIM performs admirably in dense and sparse datasets with low and high min_sup, and it is a scalable approach that can mine large amounts of data. SHFIM demonstrated considerably higher execution time performance with datasets containing a high proportion of variable-length transactions, which is the big data nature.  Table 6 summarizes the frequent itemsets for various min_sup in both dense and sparse datasets. The summary includes the number of transactions associated with each dataset sample. Min_sup is calculated by multiplying the number of transactions by the predefined min_sup% to determine if an itemset is frequent or not. In addition, it contains the number of frequent itemsets for each dataset based on min_sup. The tidset size per itemset is the number of transactions' ids in each itemset resulting from intersections between itemsets. Finally, diffset size per itemset is the number of transactions' ids that are located in each itemset due to itemset differences. For example, mushroom is an 8124-transaction dense dataset that contains 2735 frequent itemsets whose supports are larger than min_sup (2437). Each frequent itemset must have a tidset count ranging from 2437 to 8124. On the other hand, diffset size is at least two or three times less than tidset size since the difference reduces data significantly. As a result, diffset is memory-conscious, compact, and quick to process.

Time Complexity
In this section, we are going to present the time complexity of the SHFIM. The complexity of SHFIM can be influenced by factors such as min_sup threshold, the average number of items per transaction, the number of transactions, and the total number of dataset items. Suppose we have a Dataset D contains n number of transactions, k number of items, and m as the number of items in the biggest transaction. In phase 1, we need to traverse each item in every transaction to discover the first frequent itemset. This process takes in the worst-case O(n × m). In Phase 2, BF starts to access each transaction item and prune the items that are not present in the BF and creates pairs from the items that are present after pruning, which takes in the worst-case O(n × m). Furthermore, BF keeps tracking of each transaction ID for each pair created which costs O(1) then finally reduce the results to create the vertical layout which costs O(1). The time complexity of phase 2 can be measured by O(n × m + 1 + 1) equal to O (n × m). Assume F k is the vertical frequent dataset from phase 2 that contains N number of itemsets. Let the number of candidates' itemsets of iteration k is C k . Phase 3 in SHFIM is an iterative process; hence, we find the time complexity for kth iteration/pass. Phase 3 begins with k = 3 till the end of iterations in which SHFIM starts to generate C k candidates from F k−1 , which takes O N k time. Afterward, all the candidates should be mapped with their originated itemsets which are present F k-1 which takes O(N) time. We further visit each candidate after mapping to calculate the difference between each mapped itemsets which requires O(C k ), and it requires O(1) to calculate the differences between its itemsets. We assume that the shared or broadcasted candidate list data requires O(C k ) time. Therefore, phase 3 takes O( N k × N × C k × C k ) time to be completed. Since phase 3 is an iterative process based on k, the complexity will be O( kn ∑ k=3 N k × N × C k 2 ), where k ≥ 3. Finally, the total complexity of SHFIM proposed algorithm is the summation of the complexity of phase 1, phase 2, and phase 3.

Conclusions
FIM is the most common technique used in discovering frequent patterns from transactional datasets. Frequent patterns have a wide effect in many applications including classifications, market basket analysis, mobile computing, web mining, etc. Apriori is computing intensive algorithm; therefore, lots of resources (Memory and processing) are required. Moreover, Apriori uses horizontal data representation and has some challenges such as multiple dataset scans and candidate generating in each iteration, which makes Apriori suffer from big data. Vertical data representation is smaller than horizontal representation in size and carries information through tidsets about each itemset support. Eclat uses vertical data representation (tidset) and achieved observed performance in sparse In this paper, we proposed SHFIM (Spark-based Hybrid Frequent Itemset Mining) a novel algorithm that utilizes both the horizontal and vertical layouts to solve the drawbacks in both Apriori and Eclat. SHFIM is a three phases algorithm, which works perfectly in a distributed environment. Phases one and two use the horizontal layout to extract the first and second frequent itemset. Phase three is an iterative process to extract k frequent itemset in k iterations. This phase uses mainly diffset to enhance execution time and memory consumption because it shrinks itemsets into smaller sets that will be mined in less time and consume less space. The support of an itemset is calculated by exploiting the vertical layout in every worker node. As the vertical layout size is smaller than the horizontal layout, therefore it requires less memory and less execution time. We developed SHFIM on Spark framework due to its ability to deal with the iterative problem better than Hadoop MapReduce. Spark is 100 times quicker than Hadoop in data processing and has lots of features such as in-memory processing, RDD data structure, broadcasting variables, partitioning of data, and applied Spark best practices to reduce data shuffling between nodes. These features make the Spark the best choice for us that help SHFIM to deal with big data efficiently and increase its execution time performance. Extensive experiments have been conducted between SHFIM, Eclat, and dEclat over Spark clusters for dense and sparse datasets. The Experimental results proved that SHFIM can compete well in both dense and sparse datasets and shows noticeably better performance in either lower or higher min_sup in terms of execution time than others in datasets that have lots of variable-length transactions which is the nature of big data. In the future work, we are planning to enhance the SHFIM be more efficient. The results proved that the use of tidset, diffset, and Bloom Filter accelerate the speed of FIM in big datasets. We plan to choose between tidset and diffset on the itemset itself rather than the whole dataset instead of applying the diffset and continue using diffset from the third iteration in the whole dataset.