You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

3 April 2022

Efficient Mining Support-Confidence Based Framework Generalized Association Rules

,
and
1
Institute of Information Technology, Corvinus University of Budapest, 1093 Budapest, Hungary
2
Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Machine Learning, Statistics and Big Data

Abstract

Mining association rules are one of the most critical data mining problems, intensively studied since their inception. Several approaches have been proposed in the literature to extend the basic association rule framework to extract more general rules, including the negation operator. Thereby, this extension is expected to bring valuable knowledge about an examined dataset to the user. However, the efficient extraction of such rules is challenging, especially for sparse datasets. This paper focuses on the extraction of literalsets, i.e., a set of present and absent items. By consequence, generalized association rules can be straightforwardly derived from these literalsets. To this end, we introduce and prove the soundness of a theorem that paves the way to speed up the costly computation of the support of a literalist. Furthermore, we introduce FasterIE, an efficient algorithm that puts the proved theorem at work to efficiently extract the whole set of frequent literalets. Thus, the FasterIE algorithm is shown to devise very efficient strategies, which minimize as far as possible the number of node visits in the explored search space. Finally, we have carried out experiments on benchmark datasets to back the effectiveness claim of the proposed algorithm versus its competitors.

1. Introduction

Discovering association rules is a fundamental and essential subject in data mining and has been extensively investigated since its inception in [1,2]. Over the past few years, the use of association rule mining in varied application scenarios [3,4,5,6,7] have been intensely discussed [8,9]. The idea consists of discovering causal relationships, where the presence of some items suggests that other items follow from them. A typical example of an association rule mining application is the market basket analysis, where the discovered rules can lead to important marketing and strategic management decisions. The process of mining for association rules has two phases:(i) mining for frequent itemsets; and (ii) generating strong association rules from the discovered frequent itemsets.
Traditional association rules mining algorithms were developed to find associations between items present in a transactional database. Nevertheless, in many domains, one might be interested in discovering association rules taking into account the absence of some items to identify conflicting or complementary items. These rules are commonly called generalized association rules [10,11,12]. Nevertheless, considering the negation operator into the association rule framework is the furthest from a straightforward task. Indeed, the challenging issue of mining generalized association rules gave rise to several critical issues:
  • When negative items are considered, the length of the transactions increases to reach a value equal to n, where n stands for the number of items in the mined dataset. Since the complexity of standard association rules mining algorithms is very sensitive to the transaction length, these algorithms would break down for such datasets. Indeed, computing supports of itemsets with negation is a very time-consuming step.
  • For sparse datasets, a large number of the items are not present in each transaction leading to an overwhelming amount of association rules with negation. Consequently, it is nearly impossible for end-users to comprehend or validate such a high number of the extracted association rules, thereby limiting the usefulness of the mined results.
A large number of researchers have tried to mitigate the search space exploration of the patterns for more efficiently sweeping using the following methods: (i) defining various forms of generalized association rules; (ii) incorporating attribute correlations or rule interestingness measures; and (iii) relying on additional background information concerning the data.
As opposed to this, we propose a new approach staying within the strict bounds of the original support-confidence framework. Our proposal can be intuitive to users, i.e., no additional parameters are required. We usually proceed in two steps to extract generalized association rules: (i) all frequent generalized literalsets are extracted; and (ii) all valid generalized association rules are straightforwardly derived from frequent literalsets. Here the fulfillment of the validity criterion is assessed through the confidence metric that needs to be over a user-defined threshold, called minconf.
A scrutiny of the wealthy number of the related work enables us to draw the following challenging landscape:
  • All the surveyed approaches could only extract a particular case of the generalized association rules. This issue is due to the intractability of the extraction of the generalized literaset step.
  • The computation of the support of the negative part literaset is the furthest from a trivial task. Even if the computation of the generalized support can be transcripted in terms of the positive part of the literaset, it will lead to a barely bearable computational over-cost burden. Indeed, most of these itemsets are non-frequent, and we need to explicitly delve into the disk-resident database to compute their associated support values.
Keeping these cons in mind, we focus on the first and the most challenging step of generalized association rules mining, i.e., the extraction of frequent literalsets, since it is the most challenging one. To this end, we propose a new algorithm, called FasterIE, for extracting frequent literalsets. Furthermore, we also propose a new method to compute the support of literalsets efficiently. Our approach outperforms its competitors from the literature on benchmark datasets.
The remainder of the paper is organized as follows. In Section 2, we present some basic definitions used throughout the paper. Section 3 reviews the dedicated related work. Section 4 introduces an extended form of association rules that considers the absence of items. Next, we discuss the drawbacks of the naive approach, which uses classical algorithms such as Apriori [13] to extract frequent literalsets in Section 5. Moreover, we introduce a new method for computing the support of a literalset based on the respective supports of its subsets. Section 6 thoroughly details the FasterIE algorithm dedicated to extract the whole set of frequent literalsets. Experimental results are described in Section 7, along with the comparison of FasterIE performances to those of existing algorithms. Finally, Section 8 concludes the paper and points out issues of future work.

2. Basic Concepts and Terminology

This section provides some fundamental notions used in the remainder of the paper. Furthermore, we recall the problem of positive association rule extraction as it has been defined in [13]. The recent past has witnessed a shift in the focus of the association rule mining community, which is now focusing more on an extended form of association rules, callednegative association rules.
Let I = { i 1 , i 2 , , i m } be a set of m items. A transaction, over I , is a couple T = ( t i d , I) where t i d is the transaction identifier and I is a set of items such that I I . A transaction database D over I is a set of transactions over I . A transaction T is said to support a set X if and only if XI.
Let X be a subset of I , called positive itemset, containing k items, then X is said to be a positive k-itemset. The absolute support of a positive itemset X is given by Supp(X) = | t i d | ( { t i d , I) D , X I } | . If the support of X is greater than or equal to a user-defined minimum threshold minsup, then X is called frequent.
A positive association rule is defined as a correlation between two sets of items [13]. It is sketched as: R : X Y such that X, Y I and X Y = . An association rule R is said to be based on the itemset X Y and the itemsets X and Y are called, respectively, premise and conclusion of R.
To assess the validity of an association rule R, two metrics are commonly used [13]: (i) the support: support of the rule R, denoted Supp(R), is given by Supp(XY); (ii) the confidence: it expresses the conditional probability to find Y in a transaction containing X. The confidence of the rule R, denoted Conf(R), is given by Supp ( X Y ) Supp ( X ) . To be valid, an association rule must have its confidence greater than or equal to a user-defined minimum confidence threshold, denoted minconf.
Negative association rules were at first mentioned in [14]. A negative association rule extends positive association rule R: XY to four basic rules R 1 : X ¯ Y, R 2 : X Y ¯ , R 3 : X ¯ Y ¯ and R 4 : X Y where R 4 is a positive rule and the other three ones are negative rules where premise or/and conclusion parts represent a negation of an itemset (negative itemset). The semantic meaning of a negative itemset X ¯ is the non simultaneous presence of items included in X. The extraction of such rules is based on the following observation:
Supp ( X ¯ Y ) = Supp ( X ¯ Y ) = Supp ( Y ) Supp ( X Y ) .
Therefore, the support of negative itemsets, on which negative association rules is based, can be deduced from the support of positive itemsets.

4. Efficient Extraction of Generalized Association Rules

We usher this section by defining an extended form of association rules, called generalized association rules, which takes into account the presence as well as the absence of the items.
Let I = { i 1 , i 2 , , i m } be a set of items and L = I { i ¯ | i I } be the set of literals, such that a literal is an item i (said a positive literal) or its opposite i ¯ (said a negative literal). Let L be a subset of L containing k non opposite literals, then L is called k-literalset. Let L be a k-literalset composed of p positive literals and (kp) negative literals. Then, L is said to be a p-positive literalset, i.e., a ( k p )-negative literalset. We denote by PosVar ( L ) , PosPart ( L ) and NegPart ( L ) , respectively, the positive variation, the set of the positive literals, and the set of the negative literals of L. Formally, these three notions are defined as follows:
Definition 8.
Let L be a literalset such that L = { i 1 , i 2 , …, i p , j ¯ 1 , j ¯ 2 , …, j ¯ l }.
POSVAR ( L ) = { i 1 , i 2 , , i p , j 1 , j 2 , j l } .
POSPART ( L ) = { i 1 , i 2 , , i p } .
NEGPART ( L ) = { j ¯ 1 , j ¯ 2 , , j ¯ l } .
Let a transaction database D over a set of items I . A transaction T of D is said to support a literalset L whenever it supports PosPart ( L ) and does not contain any opposite literal of NegPart ( L ) , i.e.,
Supp ( L ) = | { t i d | ( t i d , I ) D , POSPART ( L ) I   and   j ¯ NEGPART ( L ) , j I } | .
A literalset L is said to be frequent if and only if its support is at least equal to a minimum threshold minsup. It is worth underscoring that the set FL of frequent literalsets is a downward closure, i.e., equipped by the anti-monotone property, as it is the case for the set of frequent itemsets. Indeed, if L FL , ∀ L 1 L , L 1 is also frequent. Conversely, if L FL , L 1 L , L 1 is not frequent.
Example 1.
Let us consider the transaction database, shown in Table 1, over the set of items I = { a , b , c , d , e } . We have a b ¯ c ¯ is a 3-literalset and it also is a 1-positive literalset. Its support value is equal to Supp ( a b ¯ c ¯ ) = 2, while PosVar ( a b ¯ c ¯ ) = a b c , PosPart ( a b ¯ c ¯ ) = a and NegPart ( a b ¯ c ¯ ) = b ¯ c ¯ . Let minsup = 2, a b ¯ c ¯ is then a frequent literalset. All its subsets are then also frequent literalsets. For example, Supp ( a b ¯ ) = 3 2 .
Table 1. A transaction database D .
We define a generalized association rule as a correlation between two literalsets and having the following form R : L 1 L 2 where L 1 , L 2 L and L 1 L 2 = . A generalized association rule is said to be valid if and only if its support value, i.e., the support of L 1 L 2 , is at least equal to minsup and its confidence is at least equal to m i n o n f .

5. Efficient Computation of the Support of Literalsets

The extraction process of generalized association rules can be split into two steps as follows:
  • Extract frequent literalsets;
  • Derive valid generalized association rules: this step is the least computational. Indeed, for each frequent literalset L, we derive all possible combinations L 1 and L 2 , such that L 1 , L 2 L and L 1 L 2 = , for which the minconf constraint is fulfilled.
For this purpose, the remainder of this section is devoted to the tricky and challenging task of extracting frequent literalsets. We usher this development by paying heed to discussing the opportunity of a straightforward naive Brute-force approach.

5.1. A Naive Brute-Force Approach

A naive brute-force approach consists of augmenting each transaction of the original dataset with new item identifiers representing the absence of each item from a transaction and, then, straightforwardly applying a classical algorithm such as Apriori [13] on a generalized transaction datasbase as the one given in Table 2.
Table 2. A generalized transaction database D .
Nevertheless, this approach was shown to be inefficient, especially during the step dedicated to the computation of literalsets supports [21]. Indeed, to compute supports of the candidate k-literalsets, the algorithm has to check for each k-subset of a transaction T = ( t i d , L ) (L is a set of literals, such that L L .) whether it belongs to the set of the candidate k-literalsets. Since the length of each transaction was increased to reach a value equal to n = | I | , then the number of the k-subsets that we have to check rockets considerably. The computation of literalsets supports will be a very time-consuming and intractable step.

5.2. Toward an Efficient Computation the Support of a Literalset

As underscored before, extracting generalized association rules from the extended transaction database is impractical whenever the classical mining approach is used. Thus, it would be interesting to devise a solution that permits to extraction of generalized association rules directly from the original transaction database. Nevertheless, computing supports of literalsets becomes problematic. In other words, how can we compute the support of a literalset from transactions which contain only the present items? In such a situation, the inclusion-exclusion principle can offer an efficient option. Indeed, this well-known principle was of extensive use in many enumeration problems [22]. Moreover, this principle was used in [21,23] to compute the support of a literalset. Given a literalset L = { i 1 , , i m , j ¯ 1 , , j ¯ n } , then its support is computed as follows:
Supp ( L ) = S { j 1 , , j n } ( 1 ) | S | × Supp ( { i 1 , , i m } S )
Example 2.
Let a b ¯ c ¯ d ¯ be a literalset. Then, its support is computed as follows:
Supp ( a b ¯ c ¯ d ¯ ) = Supp ( a ) Supp ( a b ) Supp ( a c ) Supp ( a d ) + Supp ( a b c ) + Supp ( a b d ) + Supp ( a c d ) Supp ( a b c d ) .
Hence, we notice that the support of a literalset L can be deduced by only considering the supports of positive itemsets. Indeed, the support of a literalset L is determined from the support of PosVar(L) and those of the subsets of NegPart(L). However, it is worth putting forward that positive itemsets, of need to compute the support of a literalset, are not necessarily found to be frequent ones. Consequently, as a flagrant con, these approaches [21,23] need to perform supplementary accesses to the dataset to count the supports of these infrequent positive itemsets. To tackle such an insufficiency, Boulicaut et al. proposed a potential solution, which consists of providing an approximate value of the support of a literalset by ignoring infrequent positive itemsets [21]. Thus, the more positive itemsets are infrequent, the more non-scalable this approach is.
In the following, we introduce a new theorem that reduces the number of accesses to the database. Nevertheless, first, we intuitively illustrate the driving idea through an example.
Example 3.
Let us consider the transaction database D depicted by Table 1. Figure 1 shows transactions that contain the literal a, respectively, b and c. At a glance, we can notice that:
Supp ( a ) = Supp ( a b ¯ c ¯ ) + Supp ( a b ¯ c ) + Supp ( a b c ¯ ) + Supp ( a b c )
Supp ( a ) = Supp ( a b ¯ ) + Supp ( a b c ¯ ) + Supp ( a b c )
Supp ( a ) = Supp ( a b ¯ ) + Supp ( a c ¯ ) Supp ( a b ¯ c ¯ ) + Supp ( a b c )
Figure 1. Sets representing transactions containing literals a, b, and c.
As a consequence, we can deduce the following observation:
Supp ( a b ¯ c ¯ ) = Supp ( a ) + Supp ( a b ¯ ) + Supp ( a c ¯ ) + Supp ( a b c )
As we can see, the support of the literalset a b ¯ c ¯ can be deduced from the supports of its strict subsets and that of its positive variation PosVar( a b ¯ c ¯ ). Consequently, we guarantee a decrease in the number of accesses to the dataset. To generalize the observation, we propose to compute the support of a literalset as follows:
Theorem 1.
Let L = { i 1 , …, i m , j ¯ 1 , …, j ¯ n } be a literalset. Then the support of L is equal to      
Supp ( L ) = ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n } )
S { j ¯ 1 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S )
with | S | = | S | if n is even and | S | = |S| + 1 if n is odd.
Proof. 
Note that for all expressions, | S | = | S | if n is even and | S | = | S | + 1 if n is odd.
We show by induction that Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n } ) = ( 1 ) n ×Supp( { i 1 , , i m , j 1 , , j n } )
+ S { j ¯ 1 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S ) ( H 1 )
We have H 1 fulfilled for both n = 0 and n = 1. Indeed,
  • For n = 0, we have Supp( { i 1 , , i m } ) = ( 1 ) 0 × Supp( { i 1 , , i m } )
  • For n = 1, we have, for each literalset X and an item i, the number of transactions containing X is the sum of the number of transactions in which occurs X with i, and the number of transactions in which X occurs without i. In other words, Supp(X) = Supp(X { i } ) + Supp(X { i ¯ } ). Hence,
    Supp ( X { i ¯ } ) = Supp ( X ) Supp ( X { i } ) ( E 1 )
    Applying E 1 for the literalset { i 1 , , i m } and the item j 1 , we obtain:
    Supp( { i 1 , , i m , j ¯ 1 } ) = Supp( { i 1 , , i m } ) − Supp( { i 1 , , i m , j 1 } ).
We suppose that ( H 1 ) is true for n, and we show that it holds for n + 1.
By applying ( E 1 ) for the literalset { i 1 , , i m , j ¯ 1 , , j ¯ n } and the item j n + 1 , we obtain:
Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n , j ¯ n + 1 } ) = Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n } )
                                     − Supp( { i 1 , , i m , j n + 1 , j ¯ 1 , , j ¯ n } )
According to the Hypothesis ( H 1 ) we have:
Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n } ) = ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n } )
+ S { j ¯ 1 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S )
and
Supp( { i 1 , , i m , j n + 1 , j ¯ 1 , , j ¯ n } ) = ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n , j n + 1 } )
+ S { j ¯ 1 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m , j n + 1 } S )
Then, we can deduce that:
Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n , j ¯ n + 1 } ) = ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n } )
                                    − ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n , j n + 1 } )
+ S { j ¯ 1 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S ) ( E 2 )
S { j ¯ 1 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m , j n + 1 } S ) ( E 3 )
For each literalset L ( E 2 ) , it corresponds a literalset { L j n + 1 } ( E 3 ) . Thus,
Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n , j ¯ n + 1 } ) = ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n } )      ( E 4 )
                                    − ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n , j n + 1 } )
+ S { j ¯ 1 , , j ¯ n , } ( 1 ) | S | × Supp ( { i 1 , , i m , j ¯ n + 1 } S )
Let us compute ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n } ) (E 4 ). According to (H 1 ):
Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n } ) = ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n } )
+ S { j ¯ 1 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S )
Hence,
( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n } ) = Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n } )
S { j ¯ 1 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S )
= S { j ¯ 1 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S ) ( E 5 )
By replacing ( E 4 ) by ( E 5 ) , we obtain:
Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n , j ¯ n + 1 } ) = − ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n , j n + 1 } )
+ S { j ¯ 1 , , j ¯ n , } ( 1 ) | S | × Supp ( { i 1 , , i m , j ¯ n + 1 } S )
S { j ¯ 1 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S )
= ( 1 ) n + 1 × Supp ( { i 1 , , i m , j 1 , , j n , j n + 1 } )
+ S { j ¯ 1 , , j ¯ n + 1 , } ( 1 ) | S | × Supp ( { i 1 , , i m , j ¯ n + 1 } S )
We conclude that:
Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n } ) = ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n } )
+ S { j ¯ 1 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S )
   □

6. The FasterIE Algorithm for an Efficient Extraction of Frequent Literalsets

In what follows, we put the focus on the most computational step of the generalized association rule mining process, namely, the extraction of frequent literalsets. Indeed, this step is considered the critical phase of the process. To this end, we introduce a new algorithm, called FasterIE, permitting us to extract the frequent literalsets from the original database. In the following, we present the FasterIE main principle and the underlying data structure. In addition, we thoroughly describe the different steps of the proposed algorithm.
The FasterIE algorithm adopts a bottom-up traversal of the search space. Hence, starting from the empty set, it determines frequent literalsets in a growing manner and it stores them into a prefix tree (aka trie) [24]. Figure 2 (Left) shows a prefix tree that stores all strict subsets of the literalset a b ¯ c ¯ d ¯ , which can be extracted from the database D depicted in Table 1. The prefix tree nodes are ordered according to the lexicographic order on literals (the lexicographic order used is given by a z a ¯ z ¯ ). Each path, starting from the root node of the prefix tree, represents a literalset, where the integer kept in the last node on the path stands for the support of the literalset, e.g., the left-most path from the node labeled “∅, 5” to the node labeled “ c ¯ , 2” represents the literalset a b ¯ c ¯ , whose support value is equal to 2.
Figure 2. (Left): The prefix tree containing strict subsets of a b ¯ c ¯ d ¯ . (Right): The bottom-most node d ¯ (encircled) presents the candidate literalset a b ¯ c ¯ d ¯ generated from frequent literalsets a b ¯ c ¯ and a b ¯ d ¯ . The support value associated to this node is initialized to 0. The arrows show subsets that have to be checked.
In the following, we thoroughly describe the different steps of the FasterIE algorithm, whose pseudo-code is presented by Algorithm 1.
In the following, we describe the main routines invoked by the FasterIE algorithm, namely the Generate-frequent-1-literalsets, the Generate-next-level, and the Partial-Computat ion-Support.
Algorithm 1: FasterIE Algorithm
  Data: (database D , minsup)
  Results: FL
   Begin
1  Set of frequent literalsets FL ←∅;
2   FL Generate-frequent-1-literalsets( D );
3      do
4      Set of candidates CL Generate-next-level( FL );
        for each literalset L in CL do
5          Partial-Computation-Support(L, root node n );
6      Scan D to compute the support of positive variation of each literalset
    in CL ;
7       CL Prune-Infrequent-literalsets( CL , minsup);
8       FL FL CL ;
9    while CL is non empty
    return FL ;
End

6.1. The Generate-Frequent-1-Literalsets Procedure

The Generate-frequent-1-literalsets procedure scans the transaction database to find out the set of frequent 1-literalsets. To this end, it uses a temporary | I |-sized array, where the ith entry represents the support of the positive literal i. Initially, entries of the array are set to 0. Then, for each scanned transaction T of the database, the support of the literal i is incremented if i is contained in T. Straightforwardly, we can deduce the support of each negative literal i ¯ from that of its opposite i, thanks to Supp( i ¯ ) = Supp( ) Supp ( i ). The procedure creates the root node n containing the empty set and its support value equal to | D | and its child nodes representing frequent literals with their associated supports.

6.2. The Generate-Next-Level Procedure

During an iteration k, the procedure uses the prefix tree to generate the candidate k-literalsets. For this purpose, Generate-next-level creates for each pair of (k − 1)-literalsets L 1 and L 2 , sharing the same ( k 2 )-elements in the prefix tree, a candidate child node n L 1     L 2 . Furthermore, the procedure leverages the anti-monotonicity property of the support measure, to prune candidate k-literalsets, which have at least one infrequent (k − 1)-subset. Figure 2 (Bottom) illustrates the Generate-next-level procedure at work.

6.3. Computing Supports of the Literalsets

The purpose of this step is to compute the respective supports of candidate literalsets. To this end, we propose to split this phase into two sub-phases as follows:

6.3.1. The Partial-Computation-Support

To compute the support of a candidate k-literalset L, we first call the Partial-Computation-Support procedure, whose pseudo-code is given by Algorithm 2. This procedure only allows computing the value of the subtractive term in Equation (2) (c.f. Theorem 1). To do so, the supports of the subsets of L sharing PosPart ( L ) are required. It is important to note that these support values were already determined during previous iterations. To this end, Partial-Computation-Support uses an array of size | L | , denoted by Z. The ith entry of Z, denoted by Z [ i ] , contains the ith literal in L.
Algorithm 2: Partial-Computation-Support Procedure
  Data: (literalset L, n)
 /* assert: Supp(L) stores the support of the literalset L */
 /* assert: Z stores literals of the literalset L */
   Begin
1  i := 0;
2  whileZ[i] is not the last positive literal in Ldo
3    n := n n Z [ i ] ;
4    i := i + 1;
5  Supp(L) := 0;
6  Explore(Z, i, n, Supp(L));
End
This procedure traverses the prefix-tree starting from the root node. Two-pointers are used. The first pointer p runs through the elements of Z and is initialized to the first element. The second pointer q runs through the nodes of the prefix-tree, and it is initialized to the root node n . For a literal Z [ i ] referenced by p, Partial-Computation-Support checks whether p is not the last positive literal in L. If so, it runs through the node’s children referenced by q to locate the node with label Z [ i ] . Otherwise, p is the last positive literal in L, and we begin by retrieving the supports of the literalsets according to Theorem 1, since they share PosPart(L). Indeed, we explore descendants of the node referenced by q, by invoking recursively the Explore procedure, whose pseudo-code is given by Algorithm 3.
Algorithm 3:  Explore Procedure
  Data: (Z, n, i, Supp(L))
  Begin
1  n := n n Z [ i ] ;
2  Supp(L) = Supp(L) ± n . Supp;
3  for(j := i + 1; j < |L|; j := j + 1)
4   Explore(Z, n, j, Supp(L));
  End
In fact, this procedure looks for children nodes of the node referenced by q, whose labels are included in NegPart ( L ) ). Then, for each children node n c , the support of L is updated with support of n c and Explore is recalled. The search process comes to an end whenever any pointer reaches the end of its structures.
Example 4.
In Figure 3, the Partial-Computation-Support procedure is illustrated for the candidate literalset a b ¯ c ¯ d ¯ . The arrows indicate the nodes that are summed.
Figure 3. Partial-Computation-Support at work for the candidate literalset a b ¯ c ¯ d ¯ .

6.3.2. Computation of Supports of Positive Variations

Once the subtractive term of each candidate k-literalset L is computed, the FasterIE algorithm computes the first term which represents the support of PosVar ( L ) , cf. Theorem 1. It is important to note that this computation requires only one scan of the database for the whole set of the candidate k-literalsets.
Finally, after computing supports of the candidate k-literalsets, the algorithm deletes leaves presenting a support value lower than minsup (cf. Algorithm 1, line 8).

6.4. Optimization Issues

It is noteworthy that FasterIE has to make many node visits through the prefix tree to compute the support of a literalset. Consequently, to improve the performance of FasterIE algorithm, we should devise strategies which minimize as far as possible the number of node visits.
  • Strategy 1: The first optimization is based on the following observation. As shown before, during partial counting of the support of a candidate literalset, the algorithm explores nodes that have been already visited during the checking subsets step. For example, in Figure 3, the framed nodes were already visited when subsets of a b ¯ c ¯ d ¯ were handled. Thus, combining these two steps would be advantageous.
  • Strategy 2: According to Theorem 1, we can remark that some supports needed to compute the support of a literalset L are also required to compute the support of L subsets sharing PosPart ( L ) ). For example, we have:
Supp ( a c ¯ d ¯ ) = Supp ( a ) + Supp ( a c ¯ ) + Supp ( a d ¯ ) + Supp ( a c d )
Supp ( a b ¯ c ¯ d ¯ ) = Supp ( a ) Supp ( a b ¯ ) Supp ( a c ¯ ) Supp ( a d ¯ ) + Supp ( a b ¯ c ¯ ) + Supp ( a b ¯ d ¯ ) + Supp ( a c ¯ d ¯ ) Supp ( a b c d )
Consequently, we can replace terms of Equation (4) shared with Equation (3) by Supp(PosVar ( a c ¯ d ¯ ) ).
Supp ( a b ¯ c ¯ d ¯ ) = Supp ( a b ¯ ) + Supp ( a b ¯ c ¯ ) + Supp ( a b ¯ d ¯ ) + Supp ( acd ) Supp ( a b c d )
According to Equation (5), we remark that instead of looking for Supp( a ), Supp( a c ¯ ), Supp( a d ¯ ), and Supp( a c ¯ d ¯ ), we only have to recuperate Supp(PosVar ( a c ¯ d ¯ ) ).
To generalize this example, we propose to further refine the computation support of a literalset L as follows:
Proposition 1.
Let L = { i 1 , …, i m , j ¯ 1 , …, j ¯ n } be a literalset.
Supp (L) = ( 1 ) n × Supp ( { i 1 , , i m , j 1 , , j n } ) + Supp ( { i 1 , , i m , j 2 , , j n } )
S { j ¯ 2 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m , j ¯ 1 } S )
with | S | = |S| if n is even and | S | = |S| + 1 if n is odd.
Proof. 
According to Theorem 1, we have:
Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n } ) = ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n } )
+ S { j ¯ 1 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S )
Hence,
Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n } ) = ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n } )
+ S { j ¯ 2 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S ) ( E 6 )
+ S { j ¯ 2 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m , j ¯ 1 } S )
By applying Theorem 1 for the literalset { i 1 , , i m , j ¯ 2 , , j ¯ n } , we obtain:
Supp( { i 1 , , i m , j ¯ 2 , , j ¯ n } ) = ( 1 ) n × Supp( { i 1 , , i m , j 2 , , j n } )
+ S { j ¯ 2 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S )
Hence,
( 1 ) n × Supp( { i 1 , , i m , j 2 , , j n } ) = Supp( { i 1 , , i m , j ¯ 2 , , j ¯ n } )
( E 7 ) + S { j ¯ 2 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S )
= S { j ¯ 2 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m } S )
By replacing (E 6 ) by (E 7 ), we deduce that:
Supp( { i 1 , , i m , j ¯ 1 , , j ¯ n } ) = ( 1 ) n × Supp( { i 1 , , i m , j 1 , , j n } )
+ Supp( { i 1 , , i m , j 2 , , j n } )
+ S { j ¯ 2 , , j ¯ n } ( 1 ) | S | × Supp ( { i 1 , , i m , j ¯ 1 } S )
However, it is essential to underscore that we have to store the positive variation of literalsets in its corresponding node.

7. Experimental Evaluation

To assess the performances of the FasterIE algorithm, we carried out experiments considered on benchmark datasets taken from the UCI Machine Learning Database Repository (the datasets, accessed on 7 November 2021, are available at http://www.ics.uci.edu/mlearn/MLRepository.html).

7.1. Assessing Optimizations Benefits

The first series of experiments were performed to compare the first version of FasterIE to the second one, i.e., using the optimizations mentioned above, denoted by FasterIE+. According to Figure 4, we can notice that the optimized version largely outperforms the first version of FasterIE, especially as far as we lower minsup values. For example, for the lowest threshold, FasterIE+ is 32 times, 6 times, 8 times, and 7 times as fast as FasterIE respectively for the Nursery, Monks, Flare, and Zoo datasets. This is can be explained by the fact that both introduced optimizations allow to considerably reduce the number of visited nodes during the step of computing of literalset supports.
Figure 4. Comparison of FasterIE performances vs. those of FasterIE+.

7.2. Performance of the FasterIE Algorithm

In the following, we evaluate the FasterIE algorithm in its optimized version. To this end, two different series of experiments were held as follows:
  • The first series of experiments: This series consists of comparing FasterIE versus the naive brute-force approach. To this end, we first extended the tested databases. Then, we used the efficient Bodon implementation [25] of the Apriori algorithm to extract frequent literalsets (this implementation, accessed on 4 September 2021, is available at http://fimi.cs.helsinki.fi/). According to Figure 5, we notice that FasterIE largely outperforms Apriori. Indeed, our algorithm performs 1072 times faster than its competitor Apriori. A takeaway message from this first series of experiments is that we can observe that the brute-force naive approach is, expectantly, the furthest from being scalable.
    Figure 5. Comparison of the performances of FasterIE and those of the existing algorithms.
  • The second series of experiments: In this series, we compare the FasterIE algorithm versus its competitors, i.e., to those extracting frequent literalsets from the original dataset. In [23], Calders and Goethals presented three methods for computing the support of a literalset (these approaches were used to extract the non-derivable itemsets [26]). we leveraged these approaches to implement three algorithms, denoted by BruteForceIE, CombinedIE, and QIE in order to extract frequent literalsets. As aforementioned, these methods have to access the dataset further to compute the required supports of several infrequent positive itemsets. It is worthy of mention to note that we omit the experimental results of QIE because it is a very time-consuming algorithm. For example, for the Zoo database, it takes more than eight hours for a minsup value equal to 60 % . A glance to Figure 5, we notice that FasterIE algorithm outperforms BruteForceIE by many orders of magnitude. This is explained by the fact that BruteForceIE performs a high number of database scans to determine the respective literal supports. Indeed, the algorithm has to scan the database for each support computation. Consequently, the more significant negative literaset part is, the slower the algorithm becomes. This conclusion is reasonably expected since the number of terms of Equation (1) exponentially grows with the number of negative literals. As we have already underscored, the larger the negative literaset part, the trickier and more challenging the literaset support computation. Our approach comes into play since we put forward that according to Proposition 1, we underscore that some supports needed to compute the support of a literalset L are also be reused to compute the support of L subsets sharing part. Thus, we are rewriting in terms of its support, and we are decreasing the length of negative literaset part. By and large, FasterIE algorithm sharply outperforms CombinedIE, which on his turn outperforms the BruteForceIE algorithm. Indeed, the CombinedIE algorithm reduces the I/O cost by storing all transactions in a trie-like data structure [27].

8. Conclusions

Generalized association rules mining is a highly relevant yet challenging problem in data mining that has caught many researchers’ interest. Indeed, when negative items are considered, the length of the transactions increases. Thus, the standard algorithms of data mining and especially the step of computing the supports of itemsets with negation would break down.
This paper focuses on a critical step of generalized association rules mining, namely extracting frequent literalsets. Indeed, this step constitutes the basis of the mining process of generalized association rules. To this end, we proposed a new algorithm, called FasterIE, for extracting frequent literalsets. In addition, we devise an efficient method that overcomes the problem of computing the support of literalsets. Experimental results show the proposed approach’s efficiency compared to the existing algorithms.
The number of generalized association rules can be overwhelming. Thus, it is nearly impossible for the end-users to comprehend or validate many such rules. In this line, we are planning to tackle the pay heed to these thriving challenges:
  • Mining generic bases of top-K of generalized association rules [28]: The massive number of association rules drawn from– even reasonably sized datasets–bootstrapped the development of more acute techniques or methods to reduce the size of the reported rule sets. The sought-after goal would be to define“ irreducible” nuclei of generalized association rule subset. From such a generic basis of generalized association rules, it is possible to infer all association rules commonly via an adequate axiomatic system. We also consider exploring the benefit of applying this newly defined generic basis for the regulation of Pregnancy Associated Breast Cancer Gene Expressions [29],
  • A conceptual coverage composed of generalized literalsets [6,30]: This issue explores the thriving opportunity to define a generalized conceptual coverage by generalized intent and extent parts. Would it be better, or more convenient, to describe some properties by the absence of the other ones?
  • Identification of biclusters in gene expression data [31]: Indeed, biclusters can be of positive or negative correlations. A negative correlations bicluster is a bicluster where the expression values of some genes tend to be the complete opposite of the other genes, i.e., given two genes G 1 and G 2 , under the same condition C, if both G 1 and G 2 are affected by C. At the same time, G 1 goes up, and G 2 goes down, we can note that G 1 and G 2 have a negative correlation pattern.

Author Contributions

Conceptualization, A.M., F.H. and S.A.; Formal analysis, A.M., F.H. and S.A.; Investigation, A.M., F.H. and S.A.; Methodology, A.M., F.H. and S.A.; Project administration, A.M., F.H. and S.A.; Software, A.M., F.H. and S.A.; Supervision, A.M., F.H. and S.A.; Validation, A.M., F.H. and S.A.; Writing—original draft, A.M.; Writing—review & editing, F.H. and S.A. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R236), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Acknowledgments

This project was supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R236), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Solanki, S.K.; Patel, J.T. A Survey on Association Rule Mining. In Proceedings of the Fifth International Conference on Advanced Computing Communication Technologies, Haryana, India, 21–22 February 2015; pp. 212–216. [Google Scholar]
  2. Sharma, R.; Kaushik, M.; Peious, S.A.; Bazin, A.; Shah, S.A.I.F., Jr.; Ben Yahia, S.; Draheim, D. A Novel Framework for Unification of Association Rule Mining, Online Analytical Processing and Statistical Reasoning. IEEE Access 2022, 10, 12792–12813. [Google Scholar] [CrossRef]
  3. Fister, I.I.F., Jr. Association Rules over Time. In Frontiers in Nature-Inspired Industrial Optimization; Springer: Singapore, 2022; pp. 1–16. [Google Scholar] [CrossRef]
  4. Fournier-Viger, P.; Li, J.; Lin, J.C.; Truong Chi, T.; Uday Kiran, R. Mining cost-effective patterns in event logs. Knowl.-Based Syst. 2020, 191, 105241. [Google Scholar] [CrossRef]
  5. Mouakher, A.; Ben Yahia, S. Anthropocentric Visualisation of Optimal Cover of Association Rules. In Proceedings of the 7th International Conference on Concept Lattices and Their Applications, Sevilla, Spain, 19–21 October 2010; Volume 672, pp. 211–222. [Google Scholar]
  6. Mouakher, A.; Ben Yahia, S. QualityCover: Efficient binary relation coverage guided by induced knowledge quality. Inf. Sci. 2016, 355–356, 58–73. [Google Scholar] [CrossRef]
  7. Mouakher, A.; Ragobert, A.; Gerin, S.; Ko, A. Conceptual Coverage Driven by Essential Concepts: A Formal Concept Analysis Approach. Mathematics 2021, 9, 2694. [Google Scholar] [CrossRef]
  8. Shahin, M.; Arakkal Peious, S.; Sharma, R.; Kaushik, M.; Ben Yahia, S.; Shah, S.A.; Draheim, D. Big data analytics in association rule mining: A systematic literature review. In Proceedings of the 3rd International Conference on Big Data Engineering and Technology (BDET), Singapore, 16–18 January 2021; pp. 40–49. [Google Scholar]
  9. Sharmila, S.; Vijayarani, S. Association rule mining using fuzzy logic and whale optimization algorithm. Soft Comput. 2021, 25, 1431–1446. [Google Scholar] [CrossRef]
  10. Bagui, S.; Probal, D. Mining Positive and Negative Association Rules in Hadoop’s MapReduce Environment. In Proceedings of the ACMSE 2018 Conference, ACMSE’18, Richmond, KY, USA, 29–31 March 2018; Association for Computing Machinery: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
  11. Wu, X.; Zhang, C.; Zhang, S. Efficient mining of both positive and negative association rules. ACM Trans. Inf. Syst. 2004, 22, 381–405. [Google Scholar] [CrossRef]
  12. Mahmood, S.; Shahbaz, M.; Guergachi, A. Negative and Positive Association Rules Mining from Text Using Frequent and Infrequent Itemsets. Sci. World J. 2014, 2014, 973750. [Google Scholar] [CrossRef] [PubMed]
  13. Agrawal, R.; Imielinski, T.; Swami, A. Mining association rules between sets of items in large databases. In Proceedings of the ACM-SIGMOD International Conference on Management of Data (SIGMOD 1993), Washington, DC, USA, 26–28 May 1993; pp. 207–216. [Google Scholar]
  14. Amir, A.; Feldman, R.; Kashi, R. A new versatile method for association generation. In Proceedings of the 1st European Symposium on Data Mining and Knowledge Discovery (PKDD 1997), Trondheim, Norway, 24–27 June 1997; pp. 221–231. [Google Scholar]
  15. Savasere, A.; Omiecinski, E.; Navathe, S. Mining for strong negative associations in a large database of customer transactions. In Proceedings of the 14th International Conference Data Engineering 1998 (ICDE 1998), Orlando, FL, USA, 23–27 February 1998; pp. 494–502. [Google Scholar]
  16. Morzy, M. Efficient mining of dissociation rules. In Proceedings of the 8th International Conference on Data Warehousing and Knowledge Discovery (DaWak 2006), Krakow, Poland, 4–8 September 2006. [Google Scholar]
  17. Piatetsky-Shapiro, G. Discovery, Analysis, and Presentation of Strong Rules. In Knowledge Discovery in Databases; Piatetsky-Shapiro, G., Frawley, W.J., Eds.; AAAI/MIT Press: Cambridge, MA, USA, 1991; pp. 229–248. [Google Scholar]
  18. Antonie, M.; Zaïane, O. Mining positive and negative association rules: An approach for confined rules. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2004), Pisa, Italy, 20–24 September 2004; pp. 27–38. [Google Scholar]
  19. Tan, P.; Kumar, V. Interestigness measures for association patterns: A perspective. In Proceedings of the International Workshop on Postprocessing in Machine Learning and Data Mining, Boston, MA, USA, 20–23 August 2000. [Google Scholar]
  20. Cornelis, C.; Yan, P.; Zhang, X.; Chen, G. Mining positive and negative association rules from large databases. In Proceedings of the International Conference on Cybernetics and Intelligent Systems (CIS 2006), Bangkok, Thailand, 19–21 November 2006; pp. 613–618. [Google Scholar]
  21. Boulicaut, J.F.; Bykowski, A.; Jeudy, B. Towards the tractable discovery of association rules with negations. In Proceedings of the 4th International Conference on Flexible Query Answering Systems (FQAS 2000), Warsaw, Poland, 25–28 October 2000; pp. 425–434. [Google Scholar]
  22. Knuth, D.E. Fundamental Algorithms; Addison-Wesley: Reading, MA, USA, 1997. [Google Scholar]
  23. Calders, T.; Goethals, B. Quick Inclusion-Exclusion. In Proceedings of the 4th International Workshop Knowledge Discovery in Inductive Databases (KDID 2005), Porto, Portugal, 3 October 2005. [Google Scholar]
  24. Fredkin, E. Trie memory. Commun. ACM 1960, 3, 490–499. [Google Scholar] [CrossRef]
  25. Bodon, F. A fast Apriori implementation. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI 2003), Melbourne, FL, USA, 19 December 2003. [Google Scholar]
  26. Calders, T.; Goethals, B. Non-derivable itemset mining. Data Min. Knowl. Discov. 2007, 14, 171–206. [Google Scholar] [CrossRef] [Green Version]
  27. Borgelt, C.; Krus, R. Induction of association rules: Apriori implementation. In Proceedings of the 15th Conference on Computational Statistics (COMPSTAT 2002), Berlin, Germany, 24–28 August 2002; pp. 395–400. [Google Scholar]
  28. Ben Yahia, S.; Gasmi, G.; Mephu Nguifo, E. A new generic basis of “factual” and “implicative” association rules. Intell. Data Anal. 2009, 13, 633–656. [Google Scholar] [CrossRef]
  29. Bouasker, S.; Inoubli, W.; Ben Yahia, S.; Diallo, G. Pregnancy Associated Breast Cancer Gene Expressions: New Insights on Their Regulation Based on Rare Correlated Patterns. IEEE ACM Trans. Comput. Biol. Bioinform. 2021, 18, 1035–1048. [Google Scholar] [CrossRef] [PubMed]
  30. Mouakher, A.; Ben Yahia, S. On the efficient stability computation for the selection of interesting formal concepts. Inf. Sci. 2019, 472, 15–34. [Google Scholar] [CrossRef]
  31. Houari, A.; Ayadi, W.; Ben Yahia, S. A new FCA-based method for identifying biclusters in gene expression data. Int. J. Mach. Learn. Cybern. 2018, 9, 1879–1893. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.