You are currently viewing a new version of our website. To view the old version click .
Algorithms
  • Article
  • Open Access

1 November 2013

Multi-Core Parallel Gradual Pattern Mining Based on Multi-Precision Fuzzy Orderings

,
,
,
and
1
Efrei-AllianSTIC, Villejuif 94800, France
2
LIRMM, University Montpellier 2 - CNRS, Montpellier 34095, France
3
Toluca Institute of Technology, Toluca 64849, Mexico
4
Apizaco Institute of Technology, Apizaco 90300, Mexico
This article belongs to the Special Issue Algorithms for Multi Core Parallel Computation

Abstract

Gradual patterns aim at describing co-variations of data such as the higher the size, the higher the weight. In recent years, such patterns have been studied more and more from the data mining point of view. The extraction of such patterns relies on efficient and smart orderings that can be built among data, for instance, when ordering the data with respect to the size, then the data are also ordered with respect to the weight. However, in many application domains, it is hardly possible to consider that data values are crisply ordered. When considering gene expression, it is not true from the biological point of view that Gene 1 is more expressed than Gene 2, if the levels of expression only differ from the tenth decimal. We thus consider fuzzy orderings and fuzzy gamma rank correlation. In this paper, we address two major problems related to this framework: (i) the high memory consumption and (ii) the precision, representation and efficient storage of the fuzzy concordance degrees versus the loss or gain of computing power. For this purpose, we consider multi-precision matrices represented using sparse matrices coupled with parallel algorithms. Experimental results show the interest of our proposal.

1. Introduction

In data mining, mining for frequent patterns (In this paper, the words item and pattern are considered as being synonyms.) has been extensively studied during recent years. Among the patterns that can be discovered, gradual patterns aim at describing co-variations of attributes, such as the higher the size, the higher the weight. Such a gradual pattern relies on the fact that when the age increases, the salary also increases, people being ranked regarding their age and salary. However, real world databases may contain information that can hardly be ranked in a crisp manner. For instance, gene expression levels are measured by instruments and are imperfect. For this reason, an expression level can hardly be declared as being greater than another one if they only differ from a small value. We thus claim that orderings must be considered as being soft. Fuzzy orderings and fuzzy ranking indeed allow to handle vagueness, ambiguity or imprecision present in problems for deciding between fuzzy alternatives and uncertain data [1,2,3,12,16]. However, though there are great benefits to fuzzy orderings and fuzzy rank correlation measures, these techniques prevent us from considering binary relations (greater than / lower than) and binary representations in machine which are efficient from the memory consumption and computation time (binary masks) points of view. The representation and efficient storage of the vagueness and imprecision of the data is indeed a complex challenge as studied in [2]. We thus propose a framework to address the high memory consumption, the representation, precision and efficient storage of the fuzzy concordance degrees, by using sparse matrices and high performance computing (parallel programming).
This paper is organized as follows: Section 2 reports existing work on fuzzy orderings, gradual pattern mining and parallel data mining. Section 3 presents our gradual item set mining algorithm and our framework to address the high memory consumption, the representation, precision and efficient storage of the fuzzy concordance degrees. Experimental results are presented in Section 4. Section 5 is our conclusion.

3. Parallel Fuzzy Gradual Pattern Mining Based on Multi-Precision Fuzzy Orderings

In this section, we detail our approach.

3.1. Managing Multi-Precision

Concerning the implementation of the matrices of concordance degrees c p ˜ ( i , j ) , we address two important issues: (i) the memory consumption; and (ii) the precision of the representation of the concordance degrees of each c p ˜ ( i , j ) .
In order to reduce memory consumption, we represent and store each matrix of concordance degrees according to the Binary Fuzzy Matrix Multi-precision Format, where each c c ˜ ( i , j ) ∈[0, 1] is represented with a precision of 2, 3, or more up to 52 bits.
Because we generate itemset candidates from the frequent k-itemsets, only matrices of the ( k - 1 )-level frequent gradual itemsets are kept in memory while being used to generate the matrices of the ( k ) -level gradual itemset candidates. If the support of a gradual itemset ( C k , q ) is less than minimum threshold, then the C k , q is pruned and its matrix of fuzzy concordance degrees c p ˜ ( i , j ) is removed.
As seen in the previous section, fuzzy orderings are interesting but consume large memory slots when they are stored as floating numbers.
On the other hand, binary matrices are very efficient regarding both memory and time consumption, we thus consider binary vectors in order to represent the fuzzy degrees. The size of these vectors determine the precision we manage. Figure 3 shows how values are represented at the 3 bits precision.
Figure 3. Illustration of the real matrix of fuzzy concordance degrees.
Each c p ˜ ( i , j ) [ 0 , 1 ] is thus represented with a precision ranging from 1 bit (crisp case) to n bits (52 in our implementation). n bits allow to represent up to 2 n values. Figure 4 shows the real matrix of fuzzy concordance degrees. Figure 2 shows how to represent values at precision of 3 bits.
Figure 4. Illustration of the binary matrix of fuzzy concordance degrees with a precision of three bits.
In our Algorithm 1, the concept of matrix concordant degrees plays an important role.
Algorithm 1 Fuzzy Orderings-based Gradual Itemset Mining
Algorithms 06 00747 i001

3.2. Coupling Multi-Precision and Parallel Programming

The evaluation of the correlation, support, and generation of gradual pattern candidates are tasks that require huge amounts of processing time, memory consumption, and load balance. In order to reduce memory consumption, each matrix of fuzzy concordance degrees m c c ˜ (i,j) is represented and stored according to the Binary Fuzzy Matrix Multi-precision Format, where each c c ˜ ( i , j ) ∈ [0, 1] is represented with a precision of 2, 3, or more up to 52 bits. In order to reduce processing time we propose to parallelize the program using OpenMP, a shared memory architecture API, which is ideally suited for multi-core architectures [9].
Figure 5 shows an overall view of the parallel version of two regions of our fuzzyMGP algorithm where, in the first region, the extraction process of gradual patterns of size k = 2 is parallelized. In the second region, we show the parallelization of the extraction cycle of gradual patterns of size k > 2.
Figure 5. Parallel extraction of gradual patterns (parfuzzyMGP).
In the experiments reported below, we aim at studying how multi-precision impacts performances, regarding the trade-off between high precision but high memory consumption and low memory consumption with low precision. The behavior of the algorithms is studied with respect to the number of bits allocated for storing the precision. The question raised is to study if there exists a threshold beyond which it is useless to consider allocating memory space. This threshold may depend on the database.

4. Experiments

4.1. Databases and Computing Resources

We lead experiments on two databases.
The first set of databases is a synthetic database generated in order to study scalability, and thus containing hundreds of attributes and lines that can be easily split in order to get several databases.
The second database comes from astrophysics called Amadeus Exoplanete, which consists of 97,718 instances and 60 attributes [10]. In this paper, we report experimental results of parallel gradual pattern mining from three subsets of data: of 1000, 2000, and 3000 instances with 15 attributes. The three datasets were obtained from Amadeus Exoplanete database.
In order to demonstrate the benefit of high performance computing on fuzzy data mining, our experiments are run on an IBM supercomputer, more precisely on two servers:
  • an IBM dx360 M3 server embedding computing nodes configured with 2 × 2.66 GHx six core Intel (WESTMERE) processors, 24 Go DDR3 1,066 Mhz RAM and Infiniband (40 Gb/s) (reported as Intel); and
  • an IBM x3850 X5 server running 8 processors embedding ten INTEL cores (WESTMERE), representing 80 cores at 2.26 GHz, 1 To DDR3 memory (1,066 Mhz) and Infiniband (40 Gb/s) reported as SMP (because of its shared memory).

4.2. Measuring Performances

In our experiments, we report the speedup of our algorithms regarding the database size and complexity [11]. Speedup is computed in order to prove the efficiency of our solution on high performance platforms and thus its scalability in order to tackle very large problems.
The speedup of a parallel program expresses the relative diminution of response time that can be obtained by using a parallel execution on p processors or cores compared to the best sequential implementation of that program. The speedup (Speedup(p)) of a parallel program with parallel execution time T(p) is defined as
S p e e d u p ( p ) = T ( 1 ) T ( p )
where:
  • p is the number of processors/cores or threads;
  • T(1) is the execution time of the sequential program (with one thread or core);
  • T(p) is the execution time of the parallel program with p processors, cores, or threads.

4.3. Main Results

We first notice that computing time is impacted by the choice of minimum threshold but it is not noticeably affected by a small difference of precision (see Figure 6). Furthermore it has no impact at all on measured speed-ups.
Figure 6, Figure 7, Figure 8 and Figure 9 show that we can achieve very good accelerations on synthetic databases even for a relatively high level of parallelization (more than 50 processing units). In particular, Intel nodes show a good speedup on small precision (Figure 10), which shows the interest of managing multi-precision in order to adapt to the computing resources (memory) being available.
Regarding real database on astrophysics, our experiments show that on Intel nodes, 2000 lines can be managed at 4 bits precision (Figure 11), and up to 3000 lines at precision 2 bits (Figure 12) while it is impossible to manage 3000 lines at 4 bits precision due to memory consumption limits. On SMP nodes, our experiments show excellent speedup and scale up even over large databases, without memory explosion (Figure 13).
Figure 6. Execution time and Speedup related to the number of threads on Intel nodes for synthetic database of 150 attributes at precisions 6 and 9 bits.
Figure 7. Execution time and Speedup related to the number of threads on Intel nodes for synthetic database of 200 attributes at precision 8 bits with minimum threshold values 0.411, 0.412 and 0.413.
Figure 8. Execution time and Speedup related to the number of threads on SMP nodes for synthetic database of 200 attributes at precision 32 bits with minimum threshold values 0.411, 0.412, 0.414.
Figure 9. Execution time and Speedup related to the number of threads on SMP nodes for synthetic database of 300 attributes at precision 6 bits with minimum threshold values 0.4181 and 0.4183.
Figure 10. Execution time and Speedup related to the number of threads on Intel nodes for synthetic database of 300 attributes at precision 2 bits with minimum threshold values 0.17 and 0.18.
Figure 11. Execution time and Speedup related to the number of threads on Intel nodes for real database of 2000 lines at precision 4 bits with minimum threshold values 0.17 and 0.18.
Figure 12. Execution time and Speedup related to the number of threads on Intel nodes for real database of 3000 lines at precision 2 bits with minimum threshold values 0.17 and 0.18.
Figure 13. Execution time, Speedup and Scaleup related to the number of threads on SMP nodes for real database of 3000 lines at precision 12 bits with minimum threshold values 0.17 and 0.18.

5. Conclusions

In this paper, we address the extraction of gradual patterns when considering fuzzy ordering. This allows for dealing with imperfection in the datasets, when values can hardly be crisply ordered. For instance, this situation often occurs when considering data collected from sensors. In this case, the measurement error leads to values that can be considered as being similar even if they are not equal. The extent to which they can be considered as similar is handled by considering fuzzy orderings and fuzzy gamma rank correlation which we propose to introduce in the gradual pattern mining algorithms. We show that the parallelization of such algorithms is necessary to remain scalable regarding both memory consumption and runtime. Memory consumption is indeed challenging in our framework as introducing fuzzy ranking prevents us to use a single bit for representing that such value is greater than such other one. We this introduce the notion of precision and we propose an efficient storage of the fuzzy concordance degrees that can be tuned (from 2 to 52 bits) in order to manage the trade-off between memory consumption and the loss or gain of computing power.

Acknowledgments

This work was realized with the support of HPC@LR, a Center of Competence in High-Performance Computing from the Languedoc-Roussillon region, funded by the Languedoc-Roussillon region, the Europe and the Université Montpellier 2 Sciences et Techniques. The HPC@LR Center is equiped with an IBM hybrid Supercomputer. The authors would also like to thank the AMADEUS CNRS MASTODONS project (Analysis of MAssive Data in Earth and Universe Sciences) for providing real data.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bodenhofer, U. Fuzzy Orderings of Fuzzy Sets. In Proceedings of the 10th IFSA World Congress, Istanbul, Turkey, 30 June–2 July 2003; pp. 500–5007.
  2. Koh, H.-W.; Hullermeier, E. Mining Gradual Dependencies Based on Fuzzy Rank Correlation. In Combining Soft Computing and Statistical Methods in Data Analysis; Volume 77, Advances in Intelligent and Soft Computing; Springer: Heidelberg, Germany, 2010; pp. 379–386. [Google Scholar]
  3. Lin, N.P.; Chueh, H. Fuzzy Correlation Rules Mining. In Proceedings of the 6th WSEAS International Conference on Applied Computer Science, Hangzhou, China, 15–17 April 2007; pp. 13–18.
  4. Laurent, A.; Lesot, M.-J.; Rifqi, M. GRAANK: Exploiting Rank Correlations for Extracting Gradual Itemsets. In Proceedings of the Eighth International Conference on Flexible Query Answering Systems (FQAS’09), Springer, Roskilde, Denmark, 26–28 October 2009; Volume LNAI 5822, pp. 382–393.
  5. Quintero, M.; Laurent, A.; Poncelet, P. Fuzzy Ordering for Fuzzy Gradual Patterns. In Proceedings of the FQAS 2011, Springer, Ghent, Belgium, 26–28 October 2011; Volume LNAI 7022, pp. 330–341.
  6. Di Jorio, L.; Laurent, A.; Teisseire, M. Mining Frequent Gradual Itemsets from Large Databases. In Proceedings of the International Conference on Intelligent Data Analysis (IDA’09), Lyon, France, 31 August–2 September, 2009.
  7. Quintero, M.; Laurent, A.; Poncelet, P.; Sicard, N. Fuzzy Orderings for Fuzzy Gradual Dependencies: Efficient Storage of Concordance Degrees. In Proceedings of the FUZZ-IEEE Conference, Brisbane, Australia, 10–15 June 2012.
  8. El-Rewini, H.; Abd-El-Barr, M. Advanced Computer Architecture Ans Parallel Processing; Wiley: Hoboken, NJ, USA, 2005. [Google Scholar]
  9. Rauber, T.; Rünger, G. Parallel Programming: For Multicore and Cluster Systems; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
  10. Debosscher, J.; Sarro, L.M. Automated supervised classification of variable stars in the CoRoT programme: Method and application to the first four exoplanet fields. Astron. Astrophys. 2009, 506, 519–534. [Google Scholar] [CrossRef]
  11. Hill, M.D. What is scalability? ACM SIGARCH Comput. Archit. News 1990, 18, 18–21. [Google Scholar] [CrossRef]
  12. Bodenhofer, U.; Klawonn, F. Roboust rank correlation coefficients on the basis of fuzzy orderings: Initial steps. Mathw. Soft Comput. 2008, 15, 5–20. [Google Scholar]
  13. Calders, T.; Goethais, B.; Jarszewicz, S. Mining Rank-Correlated Sets of Numerical Attributes. In Proceedings of the KDD’06, 20–23 August 2006; ACM: Philadelphia, PA, USA, 2006. [Google Scholar]
  14. Flynn, M. Some computer organizations and their effectiveness. IEEE Trans. Comput. 1972, C-21, 948–960. [Google Scholar] [CrossRef]
  15. Hüllermeier, E. Association Rules for Expressing Gradual Dependencies. In Proceedings of the PKDD Conference, Helsinki, Finland, 19–23 August 2002; Volume LNCS 2431, pp. 200–211.
  16. Zadeh, L.A. Similarity relations and fuzzy orderings. Inf. Sci. 1971, 3, 177–200. [Google Scholar] [CrossRef]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.