You are currently viewing a new version of our website. To view the old version click .
Algorithms
  • Article
  • Open Access

28 November 2018

New and Efficient Algorithms for Producing Frequent Itemsets with the Map-Reduce Framework

,
and
1
Department of Computer Science, Ben-Gurion University, Beer-Sheva 8410501, Israel
2
Department of Computer Science, Open University, Ra’anana 4353701, Israel
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue MapReduce for Big Data

Abstract

The Map-Reduce (MR) framework has become a popular framework for developing new parallel algorithms for Big Data. Efficient algorithms for data mining of big data and distributed databases has become an important problem. In this paper we focus on algorithms producing association rules and frequent itemsets. After reviewing the most recent algorithms that perform this task within the MR framework, we present two new algorithms: one algorithm for producing closed frequent itemsets, and the second one for producing frequent itemsets when the database is updated and new data is added to the old database. Both algorithms include novel optimizations which are suitable to the MR framework, as well as to other parallel architectures. A detailed experimental evaluation shows the effectiveness and advantages of the algorithms over existing methods when it comes to large distributed databases.

1. Introduction

The amount of information generated in our world has grown in the last few decades at an exponential rate. The rise of the internet, growth of the number of internet users, social networks with user generated data and other digital processes contributed to petabytes of data being generated and analyzed. This process resulted in a new term: Big Data. Classical databases (DB) are unable to handle such size and velocity of data. Therefore, special tools were developed for this task. One of the common tools that is in use today is the Map-Reduce (MR) framework [1]. It was originally developed by Google, but currently the most researched version is an open source project called Hadoop [2]. MR provides a parallel distributed model and framework that scales to thousands of machines.
While today there are more recent parallel architectures like Spark [3], arguably, in terms of developed algorithms, MR is the most popular framework for contemporary large-scale data analytics [4]. The MR original paper has been cited more than twenty-five thousand times. Therefore, MR is the focus of the present paper. In addition, the algorithms presented in this paper include optimizations which can be applied to any parallel architecture that processes large distributed databases where each node processes one chunk of data; thus, their applicability is beyond the MR framework.
Association Rules Mining (ARM) is an important problem in Data Mining and has been heavily researched since the 1990s. It is being solved in two steps: firstly, by finding all Frequent Itemsets (FI) by a process called Frequent Itemsets Mining (FIM), and then generating the rules themselves from FI. FIM is the most computationally intensive part of ARM [5,6,7]. Solving FIM efficiently allows for efficient solving of the ARM problem. Most of the studies were done thoroughly in centralized static dataset [8] and data stream [9] settings. With data growth, classical FIM/ARM algorithms that were designed to be used on a single machine had to be adapted to a parallel environment. Recently, a few solutions were proposed for running classical FIM/ARM algorithms in the Map-Reduce framework [10,11,12,13]. These algorithms find frequent itemsets for a given static database. Our goal in this paper is to improve these algorithms in some common and important scenarios.
There are several strategies to handle FIM more efficiently. One of the strategies, which is a major branch in this research field, is mining closed frequent itemsets instead of frequent itemsets in order to discover non-redundant association rules. A set of closed frequent itemsets is proven to be a complete yet compact representation of the set of all frequent itemsets [14]. Mining closed frequent itemsets instead of frequent itemsets saves computation time and memory usage, and produces a compacted output. Many algorithms, like Closet, Closet+, CHARM and FP-Close [8], have been presented for mining closed frequent itemsets in centralized datasets. Handling very large databases is more challenging than mining centralized data in the following aspects: (1) the distributed settings are in a shared-nothing environment (one can of course share data, however it is very expensive in terms of communication), meaning that assumptions like shared memory and shared storage, that lie at the base of most algorithms, no longer apply; (2) data transfer is more expensive than data processing, meaning that performance measurements change; (3) the data is huge and cannot reside on a single node. This paper will describe our scheme for distributed mining of closed frequent itemsets which overcomes the drawbacks of existing algorithms.
Another strategy for efficiently handling FI is to mine FI and then always keep FI up-to-date. There is a need for an algorithm that will be able to update the FI effectively when the database is updated, instead of re-running the full FIM algorithm on the whole DB from scratch. There are incremental versions of FIM and ARM algorithms [15,16] for single machine execution. Some of these algorithms can even suit a distributed environment [17], but not the MR model. Because the MR model is more limited than general distributed or parallel computation models, the existing algorithms cannot be used in their current form. They must be adjusted and carefully designed for the MR model to be efficient.
Our contributions in this paper are:
  • A novel algorithm for mining closed frequent itemsets in big, distributed data settings, using the Map-Reduce paradigm. Using Map-Reduce makes our algorithm very pragmatic and relatively easy to implement, maintain and execute. In addition, our algorithm does not require a duplication elimination step, which is common to most known algorithms (it makes both the mapper and reducer more complicated, but it gives better performance).
  • A general algorithm for mining incremental frequent itemsets for general distributed environments with additional optimizations of the algorithm. Some of the optimizations are unique for the Map-Reduce environment but can be applied to other similar architectures.
  • We conducted extensive experimental evaluation of our new algorithms and show their behavior under various conditions and their advantages over existing algorithms.
The rest of the paper is structured as follows. Section 2 presents the closed frequent itemset mining algorithm. Section 3 discusses the incremental frequent itemset algorithm. Section 5 is the conclusions section. A preliminary short presentation of the algorithms has appeared in [18,19].

3. Mining Closed Frequent Itemsets with Map-Reduce

3.1. Problem Definition

Let I = { i 1 , i 2 , , i m } be a set of items with lexicographic order. An itemset x is a set of items such that x I . A transactional database D = { t 1 , t 2 , , t n } is a set of itemsets, each called a transaction. Each transaction in D is uniquely identified with a transaction identifier (TID) and assumed to be sorted lexicographically. The difference between a transaction and an itemset is that an itemset is an arbitrary subset of I , while a transaction is a subset of I that exists in D and identified by its TID, t i d . The support of an itemset x in D , denoted s u p D ( x ) , or simply s u p ( x ) when D is clear from the context, is the number of transactions in D that contain x (sometimes it is the percentage of transactions).
Given a user-defined minimum support denoted minSup, an itemset x is called frequent if s u p ( x ) m i n S u p × | D | .
Let T D be a subset of transactions from D and let x be an itemset. We define the following two functions f and g :
f ( T ) = { i I | t T , i t }
g ( x ) = { t D | i x , i t }
Function f returns the intersection of all the transactions in T , and function g returns the set of all the transactions in D that contain x . Notice that g is antitone, meaning that for two itemsets x 1 and x 2 : x 1 x 2 g ( x 2 ) g ( x 1 ) . It is trivial to see that s u p ( x ) = | g ( x ) | . The function h = f g is called the Galois operator or closure operator.
An itemset x is closed in D if h ( x ) = x . It is equivalent to say that an itemset x is closed in D if no itemset, that is a proper superset of x has the same support in D , exists.
Given a database D and a minimum support minSup, the mining closed frequent itemsets problem is finding all frequent and closed itemsets in D .
Let I = { a , b , c , d , e , f } , let m i n S u p = 3 ( m i n S u p = 60 % ) and let D be the transaction database presented in Table 2. Consider itemset { c } . It is a subset of transactions t 1 , t 3 , t 4 and t 5 , meaning that s u p ( { c } ) = 4 , which is greater than minSup × | D | . However, { c , f } , which is a proper superset of { c } , is also a subset of the same transactions. {c} is not a closed itemset since s u p ( { c , f } ) = s u p ( { c } ) . The list of all closed frequent itemsets is: { a } , { c , f } , { e } and { c , e , f } .
Table 2. D Example. TID = transaction identifier.
We now present an algorithm for mining frequent closed itemsets in a distributed setting, using the Map-Reduce paradigm. It uses the generator idea mentioned in Section 2.2.

3.2. The Algorithm

3.2.1. Overview

Our algorithm is iterative, where each iteration is a Map-Reduce job. The inputs for iteration i are:
  • D , the transaction database
  • C i 1 , the set of the closed frequent itemsets found in the previous iteration ( C o , the input for the first iteration, is the empty set).
The output of iteration i is C i , a set of closed frequent itemsets that have a generator of length i . If C i ϕ , then another iteration, i + 1 , is performed. Otherwise, the algorithm stops. As mentioned earlier, each iteration is a Map-Reduce job (line 7 in Algorithm 1, see details in algorithms 2, 3 and 4), comprised of a map phase and a reduce phase. The map phase, which is equivalent to the g function, emits sets of items called closure generators (or simply generators). The reduce phase, which is equivalent to the f function, finds the closure that each generator produces, and decides whether or not it should be added to C i . Each set added to C i is paired with its generator. The generator is needed for the next iteration.
The output of the algorithm, which is the set of all closed frequent itemsets, is the union of all C i s.
Before the iteration begins, we have discovered that a pre-process phase which finds only the frequent items greatly improves performance, even though another Map-Reduce job is executed, and this data must be shared among all mapper tasks. This Map-Reduce job simply counts the support of all items and keeps only the frequent ones.
The pseudo-code of the algorithm is presented below (Algorithm 1). We provide explanations of the important steps in the algorithm.
Algorithms 11 00194 i001

3.2.2. Definitions

To better understand the algorithm, we need some definitions:
Definition 1.
Let p be an itemset, and let c be a closed itemset, such that h ( p ) = c , then p is called a generator of c .
Note, that a closed itemset might have more than one generator: in the example above, both { c } and { f } are the generators of { c , f } .
Definition 2.
An execution of a map function on a single transaction is called a map task.
Definition 3.
An execution of a reduce function on a specific key is called a reduce task.

3.2.3. Algorithm Steps

Map Step
A map task in iteration i gets three parameters as an input: (1) a set of all the closed frequent itemsets (with their generators) found in the previous iteration, denoted C i 1 (which is shared among all the mappers in the same iteration); (2) a single transaction denoted t ; and (3) the set of all frequent items in D (again, this set is also shared among all the mappers in the same iteration and in all iterations). Note that in the Hadoop implementation, the mapper gets a set of transactions called Split and the mapper object calls the map function for each transaction in its own Split only.
For each c C i 1 , if c t , then t holds the potential of finding new closed frequent itemsets by looking at the complement of c in t (line 3). For each i t e m ( t \ c ) , we check if i t e m is frequent (line 5). If so, we concatenate i t e m to the generator of c (denoted c.generator), thus creating g (we denote that added item as g.item), a potential new generator for other closed frequent itemsets (line 6). The function emits a message where g is the key and the tuple ( t , 1 ) is the value (line 7). The “1” is later summed up and used to count the support of the itemset.
Notice that g is not only a generator, but it is always a minimal generator. Concatenating an item not in its closure guarantees to reach another minimal generator. More precisely, it generates all minimal generators that are supersets of g with one additional item, and such, that t supports it. Since all transactions are taken, every minimal generator with a support of at least one is emitted at some point (this is proven later). The pseudo-code of the map function is presented in Algorithm 2.
Algorithms 11 00194 i002
Combine Step
A combiner is not a part of the Map-Reduce programming paradigm, but a Hadoop implementation detail that minimizes the data transferred between map and reduce tasks. Hadoop gives the user the option of providing a combiner function that is to run on the map output on the same machine running the mapper, and the output of the combiner function is the input for the reduce function.
In our implementation, we have used a combiner, which is quite similar to the reducer but much simpler. The input to the combiner is a key and a collection of values: the key is the generator g (which is an itemset), and the collection of values is a collection of tuples, composed of transactions T, all containing g and a number s indicating the support of the tuple. Since the combiner is “local” by nature, it has no use of the minimum support parameter, which must be applied in a global point of view. The combiner sums the support of the input tuples, stores it in the variable sum, and then performs an intersection on the tuples to get t .
The combiner emits a message where g is the key and the tuple ( t , s u m ) is the value. The pseudo-code of the combiner function is presented in Algorithm 3.
Algorithms 11 00194 i003
Reduce Step
The reduce task gets a key as input, a collection of values and the minimum support. The key is the generator g (which is an itemset), a collection ( t 1 , s 1 ) , , ( t n , s n ) of n tuples, composed of a set of items t i (an intersection of transactions from the combiner), all containing g and a number s i indicating the support of the tuple. In addition, it gets, as a parameter, the user-given minimum support, minSup. The reducer is depicted in Algorithm 4.
At first, the frequency property is checked: s u p ( g ) = i = 1 n s i m i n S u p * | D | . If so, then an intersection of t 1 , , t n is performed and a closure, denoted c, is produced. If the item that was added in the map step is lexicographically greater than the first item in c \ g , then c is a duplication and can be discarded. Otherwise, a new closed frequent itemset is discovered and is added to C i .
In other words, if the test in line 7 passes, then it is guaranteed that the same closure c is found (and kept) in another reduce task—the one that will get c from its first minimal generator in the lexicographical order (as is proven later).
The pseudo-code of the reduce function is presented below.
In line 5 in the algorithm, we apply the f function, which is actually an intersection of all the transactions in T. Notice that we do not need to read all of T and store in the RAM. T can be treated as a stream, reading transactions one at a time and performing the intersection.
Algorithms 11 00194 i004

3.2.4. Run Example

Consider the example database D in Table 2 with a minimum support of two transactions (minSup = 40%). To simulate a distributed setting, we assume that each transaction t i resides on a different machine in the network (mapper node), denoted m i .
1st Map Step. We track node m 1 . Its input is the transaction t 1 , and since this is the first iteration then C i 1 = C 0 = ϕ . For each item in the input transaction, we emit a message containing the item as a key and the transaction as a value. So, the messages that m 1 emits are the following: < { a } , { a , c , d , e , f } > , < { c } , { a , c , d , e , f } > , < { d } , { a , c , d , e , f } > , < { e } , { a , c , d , e , f } > , and < { f } , { a , c , d , e , f } > . A similar mapping process is done on other nodes.
1st Reduce Step. According to the Map-Reduce paradigm, a reducer task is assigned to every key. We follow the reducer tasks assigned to keys { a } , { c } and { f } , denoted R a , R c , and R f respectively.
First, consider R a . According to the Map-Reduce paradigm, this reduce task receives in addition to the key { a } , all the transactions in { D } that contain that key: t 1 , t 2 and t 4 . First, we must test the frequency: there are three transactions containing the key. Since minSup * | D | = 2, we pass the frequency test and go on. Next, we intersect all the transactions, producing the closure { a } . The final check is whether the closure is lexicographically larger than the generator. In our case it is not (because the generator and closure are equal), so we add { a } to C 1 .
Next, consider R c . This reduce task receives the key { c } , and transactions t 1 , t 3 , t 4 and t 5 . Since the number of transactions is four, we pass the frequency test. The intersection of the transactions is the closure { c , f } . Finally, { c } is lexicographically smaller than { c , f } , so we add { c , f } to C 1 .
Finally, consider R f . The transactions that contain the set { f } are t 1 , t 3 , t 4 and t 5 . We pass the frequency test, but the intersection is { c , f } , just like in reduce task R c , so we have a duplicate result. However, { f } is lexicographically greater than { c , f } , so this closure is discarded.
The final set of all closed frequent itemsets found on the first iteration is: C 1 = { { a : a } , { c , f : c } , { e : e } } (the itemset after the semicolon is the generator of this closure).
2nd Map Step. As before, we follow node m 1 . This time the set of closed frequent itemsets is not empty, so according to the algorithm, we iterate over all c C 1 . If the input transaction t contains c , we add to c all the items in t \ c , each at a time, and emit it. So, the messages that m 1 emits are the following:
< { a , c } , { a , c , d , e , f } > , < { a , d } , { a , c , d , e , f } > , < { a , e } , { a , c , d , e , f } > , < { a , f } , { a , c , d , e , f } > , < { c , d } , { a , c , d , e , f } > , < { c , e } , { a , c , d , e , f } > , < { c , f } , { a , c , d , e , f } > , < { e , f } , { a , c , d , e , f } > .
2nd Reduce Step. Consider reduce task R a c . According to the Map-Reduce paradigm, this reduce task receives all the messages containing the key { a , c } , which are transactions t 1 and t 4 . Since minSup = 2, we pass the frequency test. Next, we consider the key { a , c } as a generator and intersect all the transactions getting the closure { a , c , d , f } . The final check is whether the added item c is lexicographically larger than the closure minus the generator. In our case it is not, so we add { a , c , d , f } to the set of closed frequent itemsets.
The full set of closed frequent itemsets is shown in Table 3. Next, we prove the soundness and completeness of the algorithm.
Table 3. Closed Frequent Itemsets of D.

3.2.5. Soundness

The mapper phase makes sure that the input to the reducer is a key which is a subset of items p, and a set of all transactions that contain p, denoted by definition T = g ( p ) . The reducer first checks that s u p ( p ) m i n S u p * | D | by checking | T | m i n S u p * | D | and then performs an intersection of all the transactions in T , which by definition is the result of the function f ( T ) , and outputs the result. So, by definition, all outputs are the result of f g , which is a closed frequent itemset.

3.2.6. Completeness

We need to show that the algorithm outputs all the frequent closed itemsets. Assume negatively, considering that c = i 1 , ,   i n is a closed frequent itemset (that we assume was not produced). Suppose, that c has no proper subset that is a closed frequent itemset. Therefore, for all items i j c , s u p ( i j ) = s u p ( c ) and g ( i j ) = g ( c ) . Therefore h ( i j ) = h ( c ) = c . Since h ( i j ) = c , then i j is a generator of c , and the algorithm will output c at the first iteration.
Suppose that c has one or more proper subsets and each is a closed frequent itemset. We examine the largest one and denote it l . l is generated by the algorithm because its generator is shorter than the generator of c . We also denote its generator g l , meaning that g ( l ) = g ( g l ) . Since g is antitone and since g l c , then g ( c ) g ( g l ) . What we show next is that if we add one of the items not in l to g l , we will generate c. Consider an item i , such that i c \ l . Let g l = g l { i } . Therefore, g ( g l ) = g ( g l ) g ( i ) = g ( l ) g ( i ) . Assume that g ( g l ) g ( c ) . It implies that l is a generator of a closed itemset h ( g l ) that is a proper subset of c in contradiction to l being the largest closed subset of c , therefore g ( g l ) = g ( c ) , meaning that c will be found by the mapper by adding an item to g l (see lines 3–4 in Algorithm 2. Mapper).

3.2.7. Duplication Elimination

As we saw in the run example in Section 3.2.4, a closed itemset can have more than one generator, meaning that two different reduce tasks can produce the same closed itemset. Furthermore, these two reduce tasks can be in two different iterations. We have to identify duplicate closed itemsets and eliminate them. The naive way to eliminate duplications is by submitting another Map-Reduce job that sends all identical closed itemsets to the same reducer. However, this means that we need another Map-Reduce job for that, which greatly damages performance. Line 7 in Algorithm 4 takes care of that without the need for another Map-Reduce round. In the run example, we have already seen how it works when the duplication happens on the same round.
What we would like to show is that the duplication elimination step does not “lose” any closed itemsets. We now explain the method.
Consider that itemset c = { i 1 , i 2 , , i n } is a closed, frequent itemset, and its generator g = { i g 1 , i g 2 , , i g m } , m < n , such that h ( g ) = c . According to our algorithm, g was created by adding an item to a previously found closed itemset. We denote that itemset f , and the added item i g j such that g = f { i g j } . Suppose that i g j > c \ g . Our algorithm will eliminate c . We should show that c can be produced by a different generator. Consider i k to be the smallest item in c \ g . Since i k c is frequent, and since i k g , then surely i k f , meaning that the algorithm will add it to f , creating g = f { i k } . It is possible that h ( g ) c , however if we keep growing g with the smallest items, we will eventually get c .

3.3. Experiments

We have performed several experiments in order to verify the efficiency of our algorithm and to compare it with other renowned algorithms.

3.3.1. Data

We tested our algorithm on both real and synthetic datasets. The real dataset was downloaded from the FIMI repository [35,36], and is called “webdocs”. It contains close to 1.7 million transactions (each transaction is a web document) with 5.3 million distinct items (each item is a word). The maximal length of a transaction is about 71,000 items. The size of the dataset is 1.4 gigabytes (GB). A detailed description of the “webdocs” dataset, that also includes various statistics, can be found in [36].
The synthetic dataset was generated using the IBM data generator [37]. We have generated six million transactions with an average of ten items per transaction—a total of 100,000 items. The total size of the input data is 600 MB.

3.3.2. Setup

We ran all the experiments on the Amazon Elastic Map-Reduce [31] infrastructure. Each run was executed on sixteen machines; each is an SSD-based instance storage for fast I/O performance with a quad core CPU and 15 GB of memory. All machines run Hadoop version 2.6.0 with Java 8.

3.3.3. Measurement

We used communication–cost (see Section 2.6) as the main measurement for comparing the performance of the different algorithms. The input records to each map task and reduce task were simply counted and summed up at the end of the execution. This count is performed on each machine in a distributive manner. The implementation of Hadoop provides an internal input records counter that makes the counting and summing task extremely easy. Communication–cost is an infrastructure-free measurement, meaning that it is not affected by weaker/stronger hardware or temporary network overloads, making it our measurement of choice. However, we also measured the time of execution. We ran each experiment three times and gave the average time.

3.3.4. Experiments Internals

We have implemented the following algorithms: (1) an adaptation of Closet to Map-Reduce; (2) the AFOPT-close adaptation to Map-Reduce; and (3) our proposed algorithm. All algorithms were implemented in Java 8, taking advantage of its new lambda expressions support.
We ran the algorithms on the two datasets with different minimum supports, and measured the communication cost and execution time for each run.

3.3.5. Results

The first batch of runs was conducted on the synthetic dataset. The results can be seen in Figure 4 and Figure 5. In Figure 4, the lines represent the communication cost of each of the three algorithms for different minimum supports. The bars present the number of closed frequent itemsets found for each minimum support. The number of closed frequent itemsets depends only on the minimum support and gets higher when the minimum support gets higher. As can be seen, our algorithm outperforms the others in terms of communication cost in all the minimum supports. In addition, the communication raise gradient is lower than the others, meaning that further increases in the minimum support will make the difference even greater. Figure 5 shows the running time of the three algorithms for the same minimum supports. Again, as can be seen, our algorithm outperforms the others.
Figure 4. Communication–cost of the algorithms on synthetic data.
Figure 5. Running time of the algorithms on the synthetic data.
In the second batch of runs, we run the implemented algorithms on the real dataset with four different minimum supports, and measured the communication cost and execution time for each run. The results can be seen in the figures below (Figure 6 and Figure 7). The figures are similar to the two previous figures, and as can be seen, our algorithm outperforms the existing algorithms.
Figure 6. Communication–cost of the algorithms on real data.
Figure 7. Comparing execution time of the algorithms on real data.

4. Incremental Frequent Itemset Mining with Map-Reduce

4.1. Problem Definition

Let D be a database of transactions, I a set of items and minSup the minimum support level as described in Section 3.1. Define FI to be a set of all frequent items in D :   F I ( D , m i n S u p ) = { x | s u p D ( x ) m i n S u p * D } . Let PK be some previous knowledge that we produced during the FIM process of the D . Let Δ be the set of additional transactions. Let D + be a new database defined as D +   =   D Δ . The problem is to find F I ( D + , m i n S u p ) , the set of all frequent itemsets in the updated database. We denote F I ( Δ ) deltaFI. We may omit minSup from functions if the support level is clear from the context. We call any x s.t. x F I a frequent itemset or just “frequent”.

4.2. The Algorithms

4.2.1. General Scheme

We first propose a general algorithm (Algorithm 5) for incremental frequent itemset mining. It can be used for any distributed or parallel framework, but it also suits the model of Map-Reduce. The algorithm is loosely based on the FUP [15] algorithm and shares similarities with ARMIDB [7] (see Section 2.3). The idea is to first find all frequent itemsets in the new database (only deltaFI) and then unite (join) the new frequent itemsets with the old frequent itemsets, and finally revalidate itemsets which are in the “unknown” state in each of the databases. This algorithm is general because it doesn’t set any constraints on the FIM algorithm that is used (any Map-Reduce algorithm for finding frequent itemsets will suit). We show some optimizations of it later.
Algorithms 11 00194 i005
Brief description of the algorithm (Algorithm 5): Step 1 is the execution of the MR main program (“driver”). Step 2 is getting all parameters and filling the MR job configuration. Steps 3 and 4 are the execution of any standard MR FIM algorithm on the incremental DB ( Δ ), and saving its output back to HDFS (Step 4). For example, the use of IMRApriori as a MR FIM will include running first an MR job with a mapper that finds FI for each data Split and then its reducer will merge all “local” FI to a single candidate list. A second MR job will use its mapper to count all candidate occurrences in each data Split, and the reducer will be used to summarize the candidate occurrences, and to output only “globally” frequent itemsets.
Steps 5 and 6 check if it is incremental run or not; in the latter case, the algorithm just stops. If it is the incremental run, then we need to find “globally” frequent itemsets (GFI) from FI ( Δ ) , and previous FI ( D ) , i.e., FI ( D + , minSup ) . As mentioned in Section 2.3, when adding new transactions, locally frequent itemsets have three options. To determine to which option each itemset applies, we propose using a MR job for joining itemsets (Steps 7–8). Any join MR algorithm may be used here (during result evaluation, we have used repartition join for this task). The key of the Join reducer would be the itemset itself, and the list of values would be occurrences of the itemset in the different DB parts, together with its count. During our evaluation, Step 7 (Join’s mapper) read itemsets from the databases, and outputted them together with their database mark. Step 8 (Join’s reducer) read an itemset and all its database marks, and determined the further processing required for each itemset. It is composed of three cases:
  • In case that it is locally frequent in Δ and old D , then it is globally frequent so it may be outputted immediately (step 9).
  • If it is locally frequent only in Δ , then we need to count it in old D (steps 10–11 by using the additional count MR job).
  • If it is locally frequent in old D , we need to count it in Δ (steps 12–13 by using the same MR job as in steps 10–11 with other input).
All three outputs (9, 11 and 13) are collected in step 14, and represent together FI ( D + , minSup ) .
The proposed scheme contains at least three different kinds of MR jobs:
  • Find deltaFI by using any suitable MR algorithm (may have more than one job inside).
  • Join MR job. Any Join algorithm may be in use. The Mapper output is just a copy of the input (Identity function); the Reducer should have three output files/directories (instead of just one) for each case.
  • Count itemsets inside the database. The same MR count algorithm may be used for both old D and Δ . Counts in both DBs could be executed in parallel on the same MR cluster.
There is (at most) one pass over the old D for counting—Steps 10–13 use the same algorithm for counting. The mapper reads a list of itemsets for counting and counts them in its data split. The reducer summarizes each itemset, and leaves only “globally” frequent itemsets. There is no requirement for any additional input for general FIM (e.g., previous knowledge, PK), but any advanced algorithm may use any additional acquired knowledge from mining old D as an input to the incremental algorithm.

4.2.2. Early Pruning Optimizations

In the scheme described above, only one step requires accessing the old D , whose size may be huge compared to Δ . This is the step of recounting local frequent itemsets from Δ , which did not appear to be in FI( D ). To minimize access to the old D , we suggest using early pruning techniques which consider the relation between the old D size and the Δ size. These techniques are additions to the early pruning of the IMRApriori technique but are not unique for Map-Reduce algorithms, and could be used in every incremental FIM algorithm. All of the following lemmas are trying to numerically determine the potential of a candidate itemset to be frequent as soon as possible.
Let inc be the size of Δ relative to the size of D . Let n be the size of D ( n = | D | ); then, the size of Δ is i n c * n or i n c * | D | .
Observation 1:
x ϵ F I ( D + , m i n S u p ) s u p D + ( x ) m i n S u p * ( | D | + i n c * | D | )
Proof. 
s u p D ( x ) + s u p Δ ( x ) = s u p D + ( x ) m i n S u p * | D + | = m i n S u p * ( | D | + | Δ | ) = m i n S u p * ( | D | + i n c * | D | ) ;
Lemma 1.
(Absolute Count):
x ϵ F I ( D , m i n S u p * ( 1 + i n c ) ) x ϵ F I ( D + , m i n S u p ) ;
Proof of Lemma 1.
x F I ( D , m i n S u p * ( 1 + i n c ) )    s u p D + ( x ) = s u p D ( x ) + s u p Δ ( x ) s u p D ( x ) m i n S u p * ( 1 + i n c ) * | D | = m i n S u p * ( | D | + i n c * | D | ) = m i n S u p * ( | D | + | Δ | ) = m i n S u p * | D + | ;
i.e., s u p D ( x ) m i n S u p ( 1 + i n c ) n I ϵ F I ( D + , m i n S u p ) ;
Lemma 1 ensures that if x is “very” frequent in the old D (support of at least minSup * ( 1 + inc ) ), then it will be frequent in D + even if it does not appear in Δ at all. □
Lemma 2.
(Minimum Count):
x ϵ F I ( D + , m i n S u p ) s u p D ( x ) n * ( m i n s u p + m i n s u p * i n c i n c ) ;
Proof. 
s u p Δ ( x ) | Δ | = i n c * n ;
| Δ | + s u p D ( x ) s u p Δ ( x ) + s u p D ( x )
m i n S u p * | D + | = m i n S u p * ( 1 + i n c ) * n ;
s u p D ( x ) m i n S u p * ( 1 + i n c ) * n | Δ |
= m i n S u p * ( 1 + i n c ) * n i n c * n
= n * ( m i n S u p + m i n S u p * i n c i n c ) ;
For answering if x can be in F I ( D + , m i n S u p ) without even looking at Δ , we need to know if x ϵ FI ( D , minSup + minSup * inc inc ) . Of course, minSup + minSup * inc inc may be less than zero (or minSup inc ( inc + 1 ) ), there is no minimum level.
Lemma 2 puts a lower bound of occurrences of itemset x in old D for it to have a possibility to appear in FI of D + (even if x appears in 100% of transactions in Δ , it must obey this criterion and therefore it is a pruning condition). See below for its use for non-frequent itemsets. □
Lemma 3.
s u p Δ ( x ) m i n S u p * ( 1 + i n c ) * n   x F I ( D + , m i n s u p ) ;
Proof. 
Similar to Lemma 1 conclusion:
s u p D + ( x ) = s u p Δ ( x ) + s u p D ( x ) s u p Δ ( x ) m i n S u p * ( 1 + i n c ) * n = m i n S u p * | D + | ;
Lemma 3 tells us that if x is “very” frequent in Δ and Δ is large enough or minSup is small enough, then x will appear in FI of D + (even if it never appeared in old D ). This lemma is also a pruning condition. If itemset x satisfies it, then there is no need to count it in the old D .
Observation 2 (Absolute Count Delta):
x ϵ F I ( Δ , m i n s u p * ( 1 + 1 / i n c ) ) I ϵ F I ( D + , m i n S u p ) ;
Proof. 
x ϵ F I ( Δ , m i n s u p * ( 1 + 1 i n c ) ) s u p Δ ( x ) m i n S u p * ( 1 + 1 i n c ) * | Δ |      = m i n S u p * ( 1 + 1 i n c ) * i n c * n = m i n S u p * ( 1 + i n c ) * n ;
To use Lemma 3, minSup * ( 1 + inc ) must be less than one. □
To use the above lemmas in our algorithm, we modify the FIM algorithm to keep the itemset together with its potential “minimum count” and “maximal count” (for each Split of Δ and D ). Each Split that has information about the exact count of itemsets adds the count to the total of “minimum count” and “maximal count” (potential count is between these values). When there is no information from the Split about some itemset I, we use observations from IMRApriori, and we update the “maximal count” to be Ceil ( | Split   Size | * minSup ) 1 (otherwise it would appear as locally frequent and we would have exact information about it), and “minimum count” is set to 0 (total “minimum count’’ is not updated). This is done in the Reducer of Stage 1 of IMRApriori. Let χ i ( x ) be an indicator function that is 1 if x was locally frequent in split S i and 0 otherwise. The reducer would output a triple <x,mincount,maxcount>, where
m i n c o u n t = i = 1 | S p l i t s | χ i ( x ) * s u p S i ( x )
m a x c o u n t = i = 1 | S p l i t s | χ i ( x ) * s u p S i ( x ) + ( 1 χ i ( x ) ) * ( | S i | * m i n S u p 1 )
Note, that when the exact count is known, then mincount equals to maxcount. If maxcount < minSup * | D + | , then x is pruned (by the original IMRApriori algorithm). If mincount minSup * | D + | , then x is globally frequent, and should not need to be recounted in the missed Splits. Previously defined lemmas are applied in the algorithm during the Join in Step 8. In this step, we already know the sizes of D and Δ , and therefore we know n and inc. So, we compare potential counts of itemsets directly to sizes of databases. The map phase of Join extracts the potential counts for some DB part (old D or Δ ) from the input and outputs it immediately together with the DB “marker” (variable that determines if it is D or Δ ). Algorithm 6 determines the total potential counts and makes the decision about itemset future processing.
Split size information and total D size is being passed as “previous knowledge” (PK) input into the FIM incremental algorithm in “General Scheme” at Step 3, and in our implementation of IMRApriori in phase 1.
Algorithms 11 00194 i006

4.2.3. Early Pruning Example

The following is an example of “minimal count” (mincount) and “maximal count” (maxcount) for a small DB with 1001 transactions ( | D |   =   1001 ) . In case we have two Mappers, the MR framework would split the DB to two Splits of roughly equal sizes. Table 4 is a table showing the calculation per two Splits with the support ratio of 20%. For example, for an itemset to be frequent in D , it needs to be contained in at least 1001 * 0.2   =   201 transactions. This example will examine itemsets A   =   { a } , and B   =   { b } , and their possible appearance in D .
Table 4. Early pruning example values for D .
Table 4 shows how A is being preemptively pruned in D . Table 5 shows an example of incremental computation with Δ . If Δ consists of only five similar transactions { a } ( | Δ | = 5 , minimum support level is 1), then FI ( Δ , minSup ) = { a } . Also | D + | = 1006 , and the new minimum threshold is 202. The following table shows how decisions for A and B are being made:
Table 5. Early pruning example values for incremental D and Δ .
This example shows that B cannot be frequent in D + , so it will not even be resent to Δ for recounting (if we would omit this optimization, then we would have to go over Δ again to count B).
By using early pruning optimization, we are able to reduce the number of candidates which reduces the output of MR, and saves CPU resources in future jobs that otherwise would require counting of the non-potential “candidates”. The above optimization is valid for any distributed framework.
Next, we show an optimization which is tailored specifically to Map-Reduce.

4.2.4. Map-Reduce Optimized Algorithm

There are few known drawbacks of the Map-Reduce framework [4,38] that can harm the performance of any algorithm. We will concentrate on the overhead of establishing a new computational job, creation of a physical process for the Mapper and the Reducer, on each of the distributed machines, and I/O consumption when it needs to read or write data from/to a remote location, e.g., read input from HDFS.
Our performance evaluation (see Section 4.3) of the General Scheme, even with the early pruning optimization, showed that the CPU time of the algorithms is lower compared to the full process of D + from scratch, but parallel run time could be the same. It happens for databases that are small, and that their delta is also small. The first reason for this is that the Incremental Scheme has many more MR jobs compared to non-incremental FIM. The overhead of job creation time for each of the jobs is summed, and if it is of the same degree as the total algorithm time (happens for small databases/deltas), there could be no benefit for the incremental algorithm run time (although each machine in the cluster runs faster and consumes less energy). Another reason is that the general scheme needs to read the Δ several times from a remote location as it is required as the input for different jobs. In the non-incremental FIM algorithm, the number of I/O reads of whole D + ( D and Δ ) is dependent on the underlying FIM algorithm, and the later output of FI ( D + ) . In the General Scheme, the same FIM is executed only on Δ with an output of FI(Δ), but there is also the requirement to read all FI(Δ) back from the network disk to join it with FI ( D ) . Moreover, it is required to read the whole Δ again for the recounting step.
To work around these limitations of the Incremental Scheme, we suggest reducing the number of jobs that are used in the General Scheme. It will allow us to reduce start times, and will imply less I/O communication. We will start with the observation that the Join job (Steps 7–8 of Algorithm 5) is required to read the output of the previous FIM job (Step 4) immediately. We suggest merging the FIM output of Step 4 with the Join job. It should receive an additional input ( FI ( D ) ), and instead of writing FI ( Δ ) , it will “join” the results. It will still have three outputs like the original Join job. All optimizations discussed in Section 4.2.2 should be also applied in this combined step.
The next job that would be removed is the recounting step in Δ (Steps 12–13). The only itemsets that could be qualified for this output are itemsets that were not frequent in Δ , and were frequent in D . We suggest counting all itemsets from FI ( D ) during Step 3 of FIM in Δ , as there is already a pass over Δ anyway. IMRApriori phase 2 could be enhanced to do the counting not only for new candidates, but also for FI ( D ) . The updated algorithm is depicted in Figure 8.
Figure 8. Optimized incremental algorithm with reduced overhead.
Our performance evaluation also revealed additional conditions when the incremental algorithm performs worse than the non-incremental one. It happens when the input for a Split is very small, and the minimum support level is also small. Under such conditions, the minimum required occurrences for an itemset to become candidate for frequent itemset is very low—it may be as low as a single transaction, and then almost all combinations of items would become candidates. Such a small job may run longer than mining the whole D + with non-incremental FIM. To overcome this problem, we propose few simple but effective techniques:
  • When the Δ is being split by the MR framework to Splits, it is being split to chunks of equal predefined sizes. Only the last chunk may be of different size. We need to make sure that the last chunk is larger than the previous chunk (rather than smaller). Fortunately, MR systems, like Hadoop, do append the last smaller input part to a previous chunk of predefined size. So, the last Split is actually larger than the others.
  • If the total input divided by the minimum Split amounts is still too small, it is preferable to manually control the number of Splits. In most cases, it is better to sacrifice parallel computation for gaining speed with less workers or even using a single worker. Once again, it is possible to control the splitting process in Hadoop’s system via the configuration parameters.
  • If Δ is still very small for a single worker to process it effectively, it is better to use a non-incremental algorithm for the calculation of D + .
The detection of Split sizes, configuration of Split number, and deciding which algorithm to use could be implemented in the main driver of the MR algorithm.
Merging MR jobs into one, allowed us to achieve an algorithm that has only two steps. The first step is the Map-Reduce FIM step for Δ only. For comparison, the non-incremental algorithm would have a Map-Reduce FIM calculation for the much bigger D + . The second step is the optional step of counting candidates in the old D (at most one pass over old D ), and it is triggered only when there is an itemset that must be recounted in the old D . The price for our algorithm is a slightly more complicated FIM input and output step.

4.2.5. Reuse of Discarded Information

During the MR algorithm, we generate many itemset candidates which were “locally” frequent in some Splits, but discarded in the end as they were not globally frequent. We propose to keep the previously discarded itemsets. For this, we keep the non-frequent itemsets ( N F I ( D ) ) in another file that will be created during the process of FIM on D . In our algorithm from Section 4.2.1, we require this file as an additional input (Previous Knowledge). The algorithm will “join” FI ( Δ ) with F I ( D ) N F I ( D ) . With this additional information available, there is a higher chance of an itemset to have an exact count so it will not need to be recounted. As we join itemsets from FI ( Δ ) with N F I ( D ) , we are reducing the counts over D . As the old D tend to be much larger than Δ , reducing (and possibly eliminating completely) the count improves the run time of the algorithm. It is important to mention that NFI tends to be much smaller than all possible itemset combinations of D , and we can keep them as they were already calculated and saved by at least one mapper anyway.

4.3. Experimental Evaluation

4.3.1. Data

The first tested dataset is synthetically generated T20I10D100000K (will be referenced as T20) [37]. It contains almost 100,000,000 (D100000K) transactions of an average length of 20 (T20), and an average length of maximal potential itemset of 10 (I10). It is 13.7 GB in size. The second dataset is “WebDocs” [35,36] (similar to Section 3.3.1).
The datasets were each cut into two equal halves. The first half of each dataset was used as the baseline of 100% size ( D ). The other part was used to generate the different Δ . For example, WebDocs was cut to a file size of 740 MB. Its delta of 5% was cut from the part that was left out, and its size was then 37 MB. The running time of the incremental algorithms on the 5% delta was compared to the full process of the 777 MB file (merged the 740 MB base file and the 5% delta of 37 MB as a single file for the test). Similarly, the T20 baseline of 100% has a size of 6.7 GB with 10% increments of 700 MB. We used different minSup values for each dataset. For T20, we tested minSup 0.1% and 0.2%. WebDocs was tested with 15%, 20%, 25%, 30% (although we show graphs only of 15% and 20%).

4.3.2. Setup

We ran all the experiments on the Google Compute Engine Cloud (GCE) by directly spanning VMs with Hadoop version 1 (old numbering 0.20.XX). We used different cluster sizes with 4 cores, 5 cores, 10 cores and 20 cores, and GCE instance types of n1-standard-1 or n1-standard-2.

4.3.3. Measurement

We measured various times: “run time” is measured by the “Driver” program from the start of the algorithm until all outputs are ready (“Driver” is responsible for communicating with the MR API). The “CPU time” is the time that all cluster machines consumed as measured by the MR framework.
During the experiments, we changed the size of Δ , minimal support (the lower the support, the larger the candidate set size and therefore the algorithm run time should go up), and cluster sizes.

4.3.4. Experiment Internals

We compare the performance of three algorithms described in Section 4.2. We denote the algorithm from Section 4.2.2, incremental algorithm with IMRApriori and early pruning optimization—as “Delta”. The algorithm from Section 4.2.4 with minimum steps/jobs is called “DeltaMin”. The algorithm from Section 4.2.5, that also keeps count of non-frequent itemsets, is called “DeltaMinKeep”. The baseline for the comparison of the algorithms is the previously published non-incremental FIM algorithm IMRApriori [10] on D + . It is called “Full”.

4.3.5. Results

Figure 9a,b demonstrate the run and CPU times of each algorithm for dataset T20 on GCE cluster of size 5 and minSup 0.1%. It shows that incremental algorithms behave better than full in both parameters. The increase in delta size increases the computation time.
Figure 9. (a) T20 Run Time minSup 0.1% Cluster 5; (b) T20 CPU Time minSup 0.1% Cluster 5; (c) T20 Run Time minSup 0.2% Cluster 5; (d) T20 CPU Time minSup 0.2% Cluster 5; (e) T20 Run Time minSup 0.1%, inc 10%; (f) T20 CPU Time minSup 0.1%, inc 10%; (g) WebDocs Run time minSup 15% inc 10%; (h) WebDocs CPU time minSup 15% inc 10%; (i) WebDocs minSup 20% Run Time; (j) WebDocs minSup 20% CPU Time; (k) WebDocs minSup 15% Run Time; (l) WebDocs minSup 15% CPU Time; (m) WebDocs minSup 15% Run Time Close Up; (n) WebDocs minSup 15% CPU Time Close Up.
Figure 9c,d are similar to Figure 9a,b, but they depict the algorithm’s behavior for minSup 0.2%. In this case, the run time of “Delta” is higher than “Full” and “DeltaMin” behaves almost the same as “Full”. The CPU time of “Delta” and “DeltaMin” is still lower than that of “Full”. “DeltaMinKeep” is better by all parameters than “Full”. The large difference between “DeltaMinKeep” and “DeltaMin” is explained by eliminating the need to run the recount Job in old D .
Figure 9e,f show the algorithm’s run, and CPU time behavior for T20, minSup 0.1%, and inc 10% as the GCE cluster size changes from 5 to 10, and to 20 nodes/cores. We can see that the incremental algorithms scale well with more cores added to the system, although it is not linear. “Full” recalculation scaled even worse when the cluster size changed. This is explained by the fact, that as long as the input data size divided by the number of workers is larger than the HDFS block size, each mapper gets exactly the same Split, and its work is the same. Once the cluster size scales above this point, the data split size gets lower than the HDFS block. In most algorithms, the smaller the input, the faster the algorithm runs. In our case, this is true for the counting jobs (second step of IMRApriori or recounting in case of old DB check) and the “Join” job. The first step of IMRApriori finds all FI of its data split by running Apriori on its data Split. This algorithm is not linear in its input size and may perform worse on a very small input size (see Section 4.2.4).
Figure 9g,h show the algorithm’s run and CPU time behavior for WebDocs, minSup 15% and inc 10% as GCE cluster size changes from 5 to 10 and to 20 nodes/cores. We can see that incremental algorithms scale well when more cores are added to the system. “Full” recalculation CPU time does not scale that well when cluster size changes, similar to the previous case.
Figure 9i,j show WebDocs with minSup 20% run, and CPU time as delta size varies. It shows that run time of “Delta” is no better than “Full”, but its CPU time is better.
Figure 9k,l show similar graphs to Figure 9i,j, but for minSup 15%. As frequent itemset mining for minSup of 15% is more computationally extensive than 20%, the resulted times are higher. We can see that in this case, all incremental algorithms behave much better than the full algorithm. In some cases, a full run takes 2–3 times longer than incremental algorithms.
Figure 9m,n are similar to Figure 9k,l, but show a close up of the incremental algorithms only, and their run/CPU times. It shows that “Delta” behaves worse than “DeltaMin”, and “DeltaMin” behaves worse than “DeltaMinKeep”.
Our evaluation shows that the incremental algorithms do similarly or better than full recalculation on smaller sized datasets with larger support in terms of CPU time, while the Run time was not always better for less optimized “Delta” and “DeltaMin” algorithms
As the support threshold decreased, all incremental algorithms had better Run and CPU times than “Full” re-computation. They outperformed the “Full” algorithm several times (in some of the tests, even by a factor of 10). The explanation for this is that more time is required to mine FI than just doing candidate counting.
The Run and CPU times of “DeltaMin” is always superior to “Delta”. “DeltaMinKeep” showed results superior to “DeltaMin” according to Run time and CPU time. The large differences were observed in execution when the algorithm managed to completely eliminate the counting step of old D .
Algorithm run time and CPU time showed almost linear growth with an increase in input size (increase in Δ size), as long as the Split size stayed constant.
Cluster size change showed that larger clusters improve the run time. It is not always linear, as there is a limit at which splitting the input into too many small chunks generates too many locally frequent itemsets, which requires longer recounting steps.

4.3.6. Comparison to Previous Works

The FUP algorithm [15] was the first to provide an Incremental Scheme which is based on mining the Δ . The algorithm is not distributed or parallel. It mines Δ by iterative steps from candidates of size 1 to K, and stops when no more candidates are available. At each step, this algorithm scans the old D to check the validity of its candidates. Implementing this algorithm in MR would require K scans over the old D , which would generate K-times more I/O than our algorithms and would be less effective.
ARM IDB [7] provides optimizations on incremental mining by using the TID–list intersection and its “LAPI” optimization. The algorithm does not deal with a distributed environment (i.e., MR), so it has no way to scale out.
Incoop and DryadInc do not support more than one input for DAG (and we need to be able to have DB and the candidate set as an input). As there is no known way to overcome this difference, it does not allow us a direct comparison. These systems cannot extract useful information from the knowledge of the algorithm goal or its specific implementation and therefore improve their run time.

4.3.7. The Algorithm Relation to the Spark Architecture

Spark [3] is a distributed parallel system that has recently gathered popularity. The main difference from Map-Reduce is that it tries to make all computations in memory. Spark uses notation of Resilient Distributed Datasets (RDDs), which can be recomputed in case of failure. On the contrary, Map-Reduce saves all intermediate and final results on a local or a distributed file system (DFS). The fact that Spark uses in-memory computation, and in-memory distributed cache allows Spark to achieve better performance. It has much less I/O operations as data is being cached in memory. This provides a significant boost to many algorithms.
Our general algorithm is not going to change due to the switch of the computation framework to Spark. We still need to assume that there is a way to compute frequent itemsets in Spark (and indeed Spark’s mllib library contains the FIM algorithm, which is currently based on a FP-Growth–FPGrowth class). Joining two lists/tables is also easily done in Spark. Spark has many kinds of joins implemented (“join” function). The last parts are recounting transactions in different datasets, which could be once again easily done via a set of calls to “map” and “reduceByKey” Spark functions. Broadcasting and caching datasets of all new potential frequent itemsets on each node will improve the overall run time (SparkContext.broadcast). This is possible as the amount of potential frequent itemsets tends to be much lower than all datasets and could be cached in the memory of each partition. As the calculation of frequent itemsets is a computationally intensive operation, the Spark incremental scheme for FIM would have better performance than full re-computation in most cases, similar to Map-Reduce.
The early pruning optimization that was introduced in Section 4.2.2 would have a positive effect in Spark too, as there will be less itemsets to cache in memory, and less itemsets to check against the different datasets.
Job reduction optimizations from Section 4.2.4 are less effective in Spark because of the following: Spark does not spawn a new process/VM for each task, but utilizes multi-threading, so it achieves a better job start time. Spark also encourages writing programs in declarative ways so some “job merging” is achieved naturally by allowing the SparkContext executors to decide for themselves how to solve the problem more effectively. If the size of a dataset tends to be much larger than the total RAM of the cluster, then the cache would be frequently flushed, and data would be re-read from the disk. In such cases, manually forcing a data read only once would be still beneficial, so joining a few computations into a single pass on data will be preferable.
The reuse of discarded data from Section 4.2.5 tends to reduce the number of itemsets that need to be checked in the old dataset. The issue is that the default FIM algorithm in Spark does not produce any intermediate results as it is based on FP-Growth. If some other algorithm for FIM could be used that would produce additional itemsets, we suggest using them and this optimization.
Spark streaming is based on re-running the algorithm on “micro batches” of newly arrived data. Our algorithm could be used to achieve this task. However, the sizes of batches that are suitable for efficient streaming computation should be further studied.

5. Conclusions

This work presented methods for mining frequent itemsets. For closed frequent itemset mining, we have presented a new, distributed, and parallel algorithm using the popular Map-Reduce programming paradigm. Besides its novelty, using Map-Reduce makes this algorithm easy to implement, relieving the programmer from the work of handling concurrency, synchronization and node management which are part of a distributed environment, and focus on the algorithm itself.
Incremental frequent itemset mining algorithms, that were presented in this work, range from a General Scheme that could be used with any distributed environment, to a Map-Reduce heavily optimized version which mostly works much better than other algorithms in experiments. The lower the support rate, the harder the computations are, and the more benefit that can be achieved by the incremental algorithms.
A general direction for future research for both presented schemes involves implementing and testing them on other distributed environments like Spark. We assume that most of the proposed algorithms will work effectively, although some methods may become redundant once the distributed engine becomes more effective with less overhead.

Author Contributions

Y.G. contributed to the research and evaluation of experiments of the closed frequent itemsets. K.K. contributed to the research and experiments of incremental frequent itemsets. E.G. participated in the research and supervision of all the article topics. All authors wrote parts of this paper.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dean, J.; Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters; ACM: New York, NY, USA, 2008. [Google Scholar]
  2. Apache: Hadoop. Available online: http://hadoop.apache.org/ (accessed on 1 January 2016).
  3. Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; McCauley, M.; Franklin, M.; Shenker, S.; Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, San Jose, CA, USA, 25–27 April 2012. [Google Scholar]
  4. Doulkeridis, C.N. A survey of large-scale analytical query processing in MapReduce. VLDB J. Int. J. Very Large Data Bases 2014, 23, 355–380. [Google Scholar] [CrossRef]
  5. Agrawal, R.; Imieliński, T.; Swami, A. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, 25–28 May 1993; pp. 207–216. [Google Scholar]
  6. Agrawal, R.; Srikant, R. Fast Algorithms for Mining Association Rules; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1994. [Google Scholar]
  7. Duaimi, M.G.; Salman, A. Association rules mining for incremental database. Int. J. Adv. Res. Comput. Sci. Technol. 2014, 2, 346–352. [Google Scholar]
  8. Han, J.; Cheng, H.; Xin, D.; Yan, X. Frequent pattern mining: Current status and future directions. Data Min. Knowl. Discovery 2007, 15, 55–86. [Google Scholar] [CrossRef]
  9. Cheng, J.; Ke, Y.; Ng, W. A survey on algorithms for mining frequent. Knowl. Inf, Syst. 2008, 16, 1–27. [Google Scholar] [CrossRef]
  10. Farzanyar, Z.; Cercone, N. Efficient mining of frequent itemsets in social network data based on MapReduce framework. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Niagara Falls, ON, Canada, 25–28 August 2013; pp. 1183–1188. [Google Scholar]
  11. Li, N.; Zeng, L.; He, Q.; Shi, Z. Parallel implementation of apriori algorithm based on MapReduce. In Proceedings of the 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing, Kyoto, Japan, 8–10 August 2012. [Google Scholar]
  12. Woo, J. Apriori-map/reduce algorithm. In Proceedings of the 2012 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2012), Las Vegas, NV, USA, 16–19 July 2012. [Google Scholar]
  13. Yahya, O.; Hegazy, O.; Ezat, E. An efficient implementation of Apriori algorithm based on Hadoop-Mapreduce model. Int. J. Rev. Comput. 2012, 12, 59–67. [Google Scholar]
  14. Pasquier, N.; Bastide, Y.; Taouil, R.; Lakhal, L. Discovering frequent closed itemsets for association rules. In Proceedings of the Database Theory ICDT 99, Jerusalem, Israel, 10–12 January 1999; pp. 398–416. [Google Scholar]
  15. Cheung, D.W.; Han, J.; Wong, C.Y. Maintenance of discovered association rules in large databases: An incremental updating technique. In Proceedings of the Twelfth International Conference on Data Engineering, New Orleans, LA, USA, 26 February–1 March 1996; pp. 106–114. [Google Scholar]
  16. Thomas, S.; Bodagala, S.; Alsabti, K.; Ranka, S. An efficient algorithm for the incremental updation of association rules in large databases. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, USA, 14–17 August 1997; pp. 263–266. [Google Scholar]
  17. Das, A.; Bhattacharyya, D.K. Rule Mining for Dynamic Databases; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
  18. Gonen, Y.; Gudes, E. An improved mapreduce algorithm for mining closed frequent itemsets. In Proceedings of the IEEE International Conference on Software Science, Technology and Engineering (SWSTE), Beer-Sheva, Israel, 23–24 June 2016; pp. 77–83. [Google Scholar]
  19. Kandalov, K.; Gudes, E. Incremental Frequent Itemsets Mining with MapReduce; Springer: Cham, Switzerland, 2017; pp. 247–261. [Google Scholar]
  20. Agrawal, R.; Shafer, J. Parallel mining of association rules. IEEE Trans. Knowl. Data Eng. 1996, 8, 962–969. [Google Scholar] [CrossRef]
  21. Zaki, M.J.; Parthasarathy, S.; Ogihara, M.; Li, W. New Algorithms for Fast Discovery of Association Rules; University of Rochester: Rochester, NY, USA, 1997. [Google Scholar]
  22. Lucchese, C.; Orlando, S.; Perego, R. Parallel mining of frequent closed patterns: Harnessing modern computer architectures. In Proceedings of the Seventh IEEE International Conference on Data Mining, Omaha, NE, USA, 28–31 October 2007; pp. 242–251. [Google Scholar]
  23. Lucchese, C.; Mastroianni, C.; Orlando, S.; Talia, D. Mining@home: Toward a public-resource computing framework for distributed data mining. Concurrency Comput. Pract. Exp. 2009, 22, 658–682. [Google Scholar] [CrossRef]
  24. Liang, Y.-H.; Wu, S.-Y. Sequence-growth: A scalable and effective frequent itemset mining algorithm for big data based on mapreduce framework. In Proceedings of the 2015 IEEE International Congress on Big Data, New York, NY, USA, 27 June–2 July 2015; pp. 393–400. [Google Scholar]
  25. Wang, S.-Q.; Yang, Y.-B.; Chen, G.-P.; Gao, Y.; Zhang, Y. Mapreduce based closed frequent itemset mining with efficient redundancy filtering. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining Workshops, Brussels, Belgium, 10 December 2012; pp. 449–453. [Google Scholar]
  26. Liu, G.; Lu, H.; Yu, J.; Wang, W.; Xiao, X. Afopt: An efficient implementation of pattern growth approach. In Proceedings of the Third IEEE International Conference on Data Mining, Melbourne, FL, USA, 19–22 November 2003. [Google Scholar]
  27. Borthakur, D. The Hadoop Distributed File System: Architecture and Design. In: Hadoop Project Website. 2007. Available online: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf (accessed on 1 January 2016).
  28. Bhatotia, P.W.; Rodrigues, R.; Acar, U.A.; Pasquin, R. Incoop: MapReduce for incremental computations. In Proceedings of the 2nd ACM Symposium on Cloud Computing, Cascals, Portugal, 26–28 October 2011. [Google Scholar]
  29. Popa, L.; Budiu, M.; Yu, Y.; Isard, M. DryadInc: Reusing work in large-scale computations. In Proceedings of the USENIX Workshop on Hot Topics in Cloud Computing, San Diego, CA, USA, 15 June 2009. [Google Scholar]
  30. Afrati, F.N.; Ullman, J.D. Optimizing joins in a map-reduce environment. In Proceedings of the 13th International Conference on Extending Database Technology, Lausanne, Switzerland, 22–26 March 2010; pp. 99–110. [Google Scholar]
  31. Amazon: Elastic Mapreduce (EMR). Available online: https://aws.amazon.com/elasticmapreduce/ (accessed on 1 June 2015).
  32. Gunarathne, T.; Wu, T.-L.; Qiu, J.; Fox, G. MapReduce in the Clouds for Science. In Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, Indianapolis, IN, USA, 30 November–3 December 2010; pp. 565–572. [Google Scholar]
  33. Blanas, S.; Patel, J.M.; Ercegovac, V.; Rao, J.; Shekita, E.J.; Tian, Y. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 6–10 June 2010; pp. 975–986. [Google Scholar]
  34. Afrati, F.N.; Ullman, J.D. Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng. 2011, 23, 1282–1298. [Google Scholar] [CrossRef]
  35. Goethals, B. Frequent Itemset Mining Dataset. Available online: http://fimi.ua.ac.be/data (accessed on 1 June 2015).
  36. Lucchese, C.; Orlando, S.; Perego, R.; Silvestri, F. Webdocs: A real-life huge transactional dataset. In Proceedings of the ICDM Workshop on Frequent Itemset Mining Implementations, Brighton, UK, 1 November 2004; p. 2. [Google Scholar]
  37. Agrawal, R.; Srikant, R. Quest Synthetic Data Generator IBM Almaden Research Center, San Jose, California. In: Mirror: http://sourceforge.net/projects/ibmquestdatagen/. Available online: http://www.almaden.ibm.com/cs/quest/syndata.html (accessed on 1 January 2016).
  38. Ekanayake, J.; Li, H.; Zhang, B.; Gunarathne, T.; Bae, S.; Qiu, J.; Fox, G. Twister: A runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, IL, USA, 21–25 June 2010; pp. 810–818. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.