Next Article in Journal
Intelligent Parking Control Method Based on Multi-Source Sensory Information Fusion and End-to-End Deep Learning
Next Article in Special Issue
A Fast Parallel Random Forest Algorithm Based on Spark
Previous Article in Journal
A Channel Correction and Spatial Attention Framework for Anterior Cruciate Ligament Tear with Ordinal Loss
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient False Positive Control Algorithms in Big Data Mining

1
School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
2
Northeastern University, Shenyang 110819, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(8), 5006; https://doi.org/10.3390/app13085006
Submission received: 14 February 2023 / Revised: 8 April 2023 / Accepted: 13 April 2023 / Published: 16 April 2023
(This article belongs to the Special Issue Big Data Engineering and Application)

Abstract

:
The typical hypothesis testing issue in statistical analysis is determining whether a pattern is significantly associated with a specific class label. This usually leads to highly challenging multiple-hypothesis testing problems in big data mining scenarios, as millions or billions of hypothesis tests in large-scale exploratory data analysis can result in a large number of false positive results. The permutation testing-based FWER control method (PFWER) is theoretically effective in dealing with multiple hypothesis testing issues. In reality, however, this theoretical approach confronts a serious computational efficiency problem. It takes an extremely long time to compute an appropriate FWER false positive control threshold using PFWER, which is almost impossible to achieve in a reasonable amount of time using human effort on medium- or large-scale data. Although some methods for improving the efficiency of the FWER false positive control threshold calculation have been proposed, most of them are stand-alone, and there is still a lot of space for efficiency improvement. To address this problem, this paper proposes a distributed PFWER false-positive threshold calculation method for large-scale data. The computational effectiveness increases significantly when compared to the current approaches. The FP-growth algorithm is used first for pattern mining, and the mining process reduces the computation of invalid patterns by using pruning operations and index optimization for merging patterns with index transactions. The distributed computing technique is introduced on this basis, and the constructed FP tree is decomposed into a set of subtrees, each corresponding to a subtask. All subtrees (subtasks) are distributed to different computing nodes. Each node independently calculates the local significance threshold according to the designated subtasks. Finally, all local results are aggregated to compute the FWER false positive control threshold, which is completely consistent with the theoretical result. A series of experimental findings on 11 real-world datasets demonstrate that the distributed algorithm proposed in this paper can significantly improve the computation efficiency of PFWER while ensuring its theoretical accuracy.

1. Introduction

In statistical analysis, we often need to test whether a pattern is significantly associated with a given class label, which is the classical hypothesis testing problem [1]. We frequently need to conduct this task on large datasets due to the increasing data size. For example, detecting whether a certain genetic pattern in massive bioinformatics data is significantly associated with a certain disease [2], focusing on whether a certain user behavior pattern is significantly associated with the sale of a certain item in massive market shopping data, etc. [3]. This raises a challenging issue of multiple hypothesis testing because millions or billions of hypothesis tests in large-scale exploratory data analysis can result in many false positives, resulting in a substantial waste of resources [4].
The FWER control method based on permutation testing (PFWER) has been theoretically shown to be an effective method for mitigating multiple hypothesis testing problems [5,6]. Compared with traditional FWER control methods (e.g., Bonferroni correction [7], the SRB algorithm [8], the Simes algorithm [9], Hochbeg [10], etc.), it has received much attention for its ability to control the overall probability of false positives at a lower level without assuming independent identical distributions. The PFWER control method is based on the principle of perturbing the class labels in the original data and then performing a certain number of random combinations and recalculating the significance threshold (i.e., p-value) that satisfies the FWER constraint [11]. The p-values corrected by the PFWER control technique can better control the false positives of the overall results in a more realistic scenario because the initial association of class labels with datasets is randomly perturbed. (i.e., where the assumption of an independent identical distribution between variables is not required).
Although the PFWER control method can theoretically produce more reasonable FWER thresholds, it is highly computationally intensive. Each class label permutation requires the calculation of the corresponding p-value for all patterns embedded in the data (typically in the order of the original data size), and the selection of the smallest p-value among them, and the same process is typically repeated 1000 to 10,000 times [11,12]. The FastWY algorithm [13] exploits the inherent properties of discrete test statistics and successfully reduces the computational burden of the Westfall–Young permutation-based procedure. The Westfall–Young Light algorithm [5] is based on an incremental search strategy where the enumerated frequent patterns are computed only once. Several orders of magnitude in the p-value pre-computation reduce the corresponding running time of the p-value computation task. These PFWER control methods, however, are all single-machine algorithms, and there is still space for significant efficiency improvements.
To address the aforementioned problem, a distributed FWER false positive threshold calculation method for large-scale data is proposed in this article. The computational efficiency is greatly improved when compared to current methods. The FP-growth algorithm is used first for pattern mining, and the mining process lowers the computation of invalid patterns by merging patterns with index transactions via pruning operations and index optimization. On this basis, the concept of distributed computing is introduced, and the constructed FP tree is decomposed into a set of subtrees, each of which corresponds to a subtask, and all subtrees (subtasks) are distributed to different computing nodes, each of which independently computes the local significance threshold based on the assigned subtasks. Finally, the results of all nodes’ local computations are aggregated, and the FWER false positive control thresholds that are completely consistent with the theoretical results are calculated.
The main contributions of this paper are as follows.
(1)
A distributed PFWER false positive control algorithm is proposed. Based on the proof that the threshold calculation task is decomposable, the PFWER false-positive control threshold calculation problem on large data is extended to a distributed solvable problem through task decomposition and the merging of local results. Theoretical analysis and experimental findings indicate that the algorithm outperforms similar algorithms in terms of execution efficiency.
(2)
An FP tree with an index structure and a pruning strategy is proposed. The pruning strategy can reduce the number of condition trees constructed, and the index structure can reduce the computation of redundant patterns in FP tree construction. The experimental findings show that the two strategies can significantly reduce the number of traversals of the dataset and the pattern computation overhead, which greatly improves computational efficiency.
The paper is structured as follows: Section 2 is an introduction to the relevant concepts and techniques. Section 3 introduces the distributed PFWER false positive control algorithm. Section 4 tests the correctness and computational efficiency of the distributed PFWER false positive control algorithm through experiments and provides a theoretical analysis of the experimental results. Section 5 concludes the paper and discusses the focus of future work.

2. Related Concepts and Techniques

The main purpose of false positive control is to correct for multiple hypothesis testing to reduce the occurrence of errors in multiple hypothesis testing, which has a wide range of applications in both scientific research and practical production life. With the continuous improvement of technology, a large amount of data has been generated. The correction of multiple hypothesis testing in the era of big data has become the focus of more and more researchers and companies. This chapter introduces the concepts of hypothesis testing, multiple hypothesis testing, false positives, and p-value calculation. Next, three false positive control methods are introduced, namely the direct adjustment method, the replacement-based method, and the retention method. Finally, several popular distributed computing frameworks at this stage are introduced.

2.1. Concepts Related to False Positives

2.1.1. Hypothesis Testing

In statistics, hypothesis testing is a method of inferring the total from the sample based on certain hypotheses. Hypothesis testing begins with the formulation of the hypothesis to be tested based on the idea of the counterfactual method and the calculation of the probability that the hypothesis holds using appropriate statistical methods, applying the principle of small probability. The specific implementation steps of hypothesis testing are as follows, first, establishing the null hypothesis H 0 and the alternative hypothesis H 1 . The null hypothesis is usually set as the hypothesis that is opposite to the conclusion the researcher wants to draw, and the null hypothesis is the hypothesis to be tested. The alternative hypothesis is usually the conclusion that the researcher wants to reach. Next, the appropriate method is chosen to calculate the statistic for the test. Next, the magnitude of the probability, p, that the null hypothesis is true is calculated based on the magnitude of the statistic. If  p > α , then the null hypothesis H 0 is not rejected. Otherwise, the null hypothesis H 0 is rejected and the alternative hypothesis H 1 is accepted, where α is referred to as the significant level. Researchers usually set the significance level to 0.05 in a one-tailed hypothesis test.
Hypothesis testing is a statistical judgment based on “small probability events”. The occurrence or non-occurrence of a particular type of event depends on the sample of events selected and the level of significance chosen. Since the sample is random and the selected significance level α is different, the results of the test may differ from the real situation, so the hypothesis test may be incorrect. Errors that occur in hypothesis testing are generally classified into two categories [14,15] and Type I errors [16] are those that reject the null hypothesis H 0 when the null hypothesis H 0 is correct and then commit the error of rejecting the true null hypothesis. The second type of error is accepting the false null hypothesis H 0 when the null hypothesis H 0 is false. Hopefully, the probability of both of these errors occurring during hypothesis testing is relatively small, but when determining the sample size, it is not possible to reduce the probability of both of these errors at the same time. That is, if the probability of one error decreases, then the probability of the other error increases. To solve this problem, the only way to reduce the probability of both types of error is to increase the number of data to be tested. Therefore, for a given amount of data to be tested, the probability of only one type of error can be controlled.

2.1.2. Multiple Hypothesis Testing

Hypothesis testing can solve the single hypothesis testing problem, but in the era of big data, the amount of data involved is huge, and hypothesis testing is no longer sufficient to deal with such a huge amount of data. Therefore, multiple hypothesis testing is used in order to satisfy the problem of dealing with large-scale data [17,18]. Multiple hypothesis testing is an effective method for calculating large-scale statistical inference problems. It takes all the individual hypothesis tests proposed in the sample as a whole, i.e., a test cluster, and tests each hypothesis in the test cluster simultaneously. For example, n hypotheses H 1 , H 2 , , H n can be proposed in a given sample and each re-evaluation of the hypothesis test commits the first I type of error, and the first II class error; for each heavy hypothesis test, the summary of the results can be obtained as shown in Table 1.
As shown in Table 1, the results calculated in the n-weight hypothesis test are obtained in four cases, denoted by U, V, T, and S, respectively. The meaning of R in the table is the number of rejections of the null hypothesis H 0 . The number of correct rejections of the null hypothesis H 0 is S, the number of correct acceptances of the null hypothesis H 0 is U, the number of committing the I type errors (false positives) is V, and the number of II type errors (false negatives) is T. Similar to the single hypothesis testing, the false positive error of the I type in the process of multiple hypothesis testing can cause incalculable harm to daily applications and subsequent scientific research, so this paper focuses on multiple hypothesis testing in the false positive control problem. In Table 1, the number of false positive errors committed in n-fold hypothesis testing is V. In order to reduce the harm caused by the false positive phenomenon to daily applications and subsequent scientific research, it is necessary to control the false positive phenomenon, i.e., to reduce the number of false positive errors V.
In multiple hypothesis tests, as in a single hypothesis test controlling for p α , even though α is a small value, it can lead to an overall significant level that is too high after the multiple hypothesis tests, resulting in a large number of false positives. For example, if the significant level in an n-weight hypothesis test is α , then the number of false positives generated in that n-weight hypothesis test is n α , and if n is very large, n α will also become very large, which will generate a large number of false positives. Therefore, it is necessary to correct for multiple hypothesis tests to reduce the occurrence of false positives.
The FWER (family-wise error rate) is the probability of making at least one false positive error in an n-fold hypothesis test. The use of the cluster error rate is the more commonly used control method for multiple hypothesis testing. The commonly used methods for correcting FWER are the Bonferroni correction method [7], the step-down algorithm [9], and the step-up algorithm [10].
The FDR (false discovery rate) [19] indicates the number of false positives as a proportion of the rejected null hypothesis. The FDR method relaxes the control of false positives compared to the above methods but can significantly improve the power. The commonly used methods for FDR correction are the BH method [19], ABH method [20], TST method [21], etc.

2.1.3. False Positive

A false positive is the testing of a result that, for various reasons, does not have positive characteristics as a positive result for various reasons. In statistics, it refers to the I type of error in hypothesis testing, where the null hypothesis H 0 was originally correct, but after a series of calculations, the conclusion that H 0 was wrong was rejected, while the alternative hypothesis H 1 (the result expected by the researcher) was incorrectly accepted. When the alternative hypothesis H 1 was chosen as the conclusion, a positive result was obtained. If the null hypothesis H 0 is chosen as the conclusion, a negative result is obtained, and a false positive is the incorrect acceptance of the alternative hypothesis H 1 . The probability of making this type of error does not exceed α . To illustrate a false positive error with a simple example, a man goes to a hospital for a physical examination, and the doctor reads the physical report and tells the patient congratulations on being pregnant. The null hypothesis H 0 in this example is that the patient is not pregnant, and the alternative hypothesis H 1 is that the patient is pregnant. In the case where the patient is not pregnant, i.e., the null hypothesis is true, the report shows that the patient is pregnant, which means that the alternative hypothesis is true and the alternative hypothesis is false. This is clearly a false positive error. It is also clear from the above example that making false positive errors in hypothesis testing causes incalculable damage to routine applications and subsequent scientific studies by reporting to the researcher a phenomenon that does not exist at all.

2.1.4. Calculation of p-Value

Parametric tests make assumptions about the parameters, and nonparametric tests make assumptions about the overall distribution. Since the overall distribution is assumed to be unknown in the efficient control of false positive experiments in large datasets, nonparametric tests are used [22,23]. Commonly used methods are the Barnard Exact Test and the Fisher Exact Test, and these two tests are described separately below.
(1)
Fisher’s exact test
Fisher’s exact test [24,25] is a method used to analyze the statistical significance of a column-linked table. It is based on the hypergeometric distribution and is usually used to test the association between two categories. Fisher’s exact test can be used to analyze and verify whether the row variables are associated with the column variables in the 2 × 2 column linkage table. The null hypothesis H 0 established by Fisher’s exact test in the 2 × 2 column association table is that there is no association between the row and column variables. Now we need a method to calculate the cumulative probability p, and reject the null hypothesis if p α . Where p i conforms to the hypergeometric distribution, as shown in Equation (1).
p i = a + b a c + d c / n a + c = a + b b c + d d / n b + d
One of the methods of Fisher’s exact test, the SF algorithm, can be divided into a one-sided test and a two-sided test, and the one-sided test is divided into a left-sided test and a right-sided test. Using a 0 to denote the number of frequencies shown in the current table, the probability from the left-hand side test is shown in Equation (2). The probability from the right test is shown in Equation (3). The two-sided test is the probability of p 0 when the probability is less than or equal to a = a 0 , then the probability of Fisher’s two-sided test is shown in Equation (4).
p = a a 0 p i
p = a a 0 p i
p = p i < p 0
The above formula uses a 2 × 2 column table, as shown in Table 2.
(2)
Barnard’s exact test
Barnard’s exact test is an unconditional test [26], which is implemented by assuming that the observed frequency of the hypothesis to be tested in the real dataset is a random variable. Therefore, the unconditional test also needs to take into account the frequency of the pattern before assessing the association between the hypothesis and the label and the different scenarios that occur in the real dataset. The p-value of the unconditional test requires artificial exploration of the space of possible values to obtain perturbation parameters that describe the unknown in the process of generating the database. Barnard’s exact test can also be used to analyze the relationship between the ranks of the 2 × 2 column association table, which will be followed here using Table 2. To calculate the p-values for the exact Barnard’s test, it is first necessary to introduce the concept of the perturbation parameter π 0 , 1 . Let x = a + c find the p-value according to the 2 × 2 column table, as shown in Equation (5). For all y 0 , n and fixed perturbation parameters π , π 0 , 1 , the Barnard exact test probability is found, as shown in Equation (6).
p ( x , c | π ) = c + d x a a + b a π x ( 1 π ) ( n x )
p ( y , ε , π ) = ( x , a ) { ( x , a ) | p ( x , a | π ) p ( y , ε | π ) } p ( x , a | π )
The Barnard exact test must eliminate the dependence on the nuisance parameter π when calculating the actual p-value, but the computational effort required to eliminate the dependence on the nuisance parameter π is large.
Comparing Fisher’s exact test and Barnard’s exact test, two nonparametric test methods for calculating p-values according to the 2 × 2 column table, it is found that Barnard’s exact test needs to use an unknown perturbation parameter in the calculation process for subsequent calculation, which is more complicated than Fisher’s exact test, and the difference between the two calculation accuracies is not significant, so this paper will use Fisher’s exact test for subsequent p-value calculation.

2.2. False Positive Control-Related Methods

False positive control methods for multiple hypothesis testing can be broadly classified into two categories: FWER control methods and FDR control methods. FWER control methods are more stringent than FDR control methods, and FDR control methods will achieve better efficacy than FWER control methods. Therefore, for multiplex problems that require strict control of the number of false positives, the FWER control method is required. For a multiple testing problem in an exploratory study, the FDR control method is preferred. After further problem analysis from the perspective of hypothesis testing, this paper will use two types of labels W 1 , W 0 to denote the “range” of parameters, since “hypothesis” is a kind of virtual determination of the range to which the real parameters belong. Since “hypothesis” is a virtual determination of the range of the real parameters, then the null hypothesis H 0 can be regarded as the range of the real parameters belonging to the label W 1 , and the alternative hypothesis H 1 can be regarded as the range of the real parameters belonging to the label W 0 . In this paper, we will choose the transaction dataset as the real parameter, and obviously, the null hypothesis H 0 becomes that transaction T i belongs to label W 1 . Let S i be the set of items contained in a transaction T i , then if a transaction T i contains the set of items S i and the label of that transaction is W 1 , then we can define a rule L: S i W 1 , which obviously also becomes a false positive control problem for multiple hypothesis testing in association rule mining. This section briefly describes three methods for correcting multiple hypothesis testing in association rule mining: the direct adjustment method, the permutation-based approach, and the holdout evaluation method.
  • Direct adjustment method: The direct adjustment method is the direct control of false positives using the implementation algorithm of FWER or FDR. A common direct adjustment method for FWER is the Bonferroni correction [27,28], which calculates the hypothesized p-value and considers it significant if the p-value is not greater than α / n . A common direct adjustment method for FDR is the BH procedure [19], where the p-values are sorted in ascending order p 1 , , p n , if p i i α / n , i = n , , 1 holds, it is considered that H 1 , , H i is statistically significant.
  • Permutation-based approach: The permutation-based approach [29] is to randomly disrupt the class labels and then recombine them with the transactions and recalculate the p-values [30,31]. Since the individual hypothesis tests are dependent on each other, the random disturbance is used to break the association between the transactions and the class labels. The distribution of the recalculated p-values is, therefore, an approximation of the null distribution, which allows a more precise determination of the truncation threshold (corrected significance threshold) of the p-values.
To keep the FWER under the α level, an operation is performed that randomly generates a set of n labels is performed as a way to break the association between transactions and class labels. A truncated p-value (significance threshold) is eventually found such that the probability of having at least one false positive error is no greater than α . To find the truncated p-value, the smallest p-value obtained after the calculation in each permutation is ranked from lowest to highest, and the α n th value among them is used as the truncation threshold.
To control the FDR at the α level, n-label permutations are randomly generated and adjusted for each p-value by the following method. Let α n be the number of hypotheses to be tested, and let H = p 1 , p 2 , , p n × n α be the p-values calculated from the hypotheses to be tested after arranging the labels. Finally, the method proposed by Benjamini and Hochberg will be used for subsequent calculations until the truncated p-value is found.
3.
Maintenance assessment method: The hold-evaluation method [32] divides the dataset into two parts, the exploration dataset and the evaluation dataset. First, the hypotheses to be tested need to be identified from the exploration dataset first, and then the hypotheses with p-values no greater than α are passed to the evaluation dataset for validation. To control the FWER at the α level, the Bonferroni correction [27,28] can be used to adjust the p-values of the hypotheses to be tested on the evaluation dataset. To control the FDR at the α level, the method proposed by Benjamini and Hochberg can be used in a similar way.
The permutation-based approach preserves the dependencies between hypotheses and finds corrected significance thresholds more accurately than the direct adjustment approach, but the permutation-based approach requires significant computational overhead. The replacement-based method is less computationally expensive than the holdout-based method but its performance may be affected by data partitioning, resulting in a hypothesis simply not being found. The advantages and disadvantages of various false positive control methods are analyzed, and according to Liu’s research [33], this paper will use the permutation-based method for FWER false positive control.

2.3. Pattern Mining-Related Techniques

Frequent pattern mining [34] is one of the most widely studied problems in data mining. Compared to deep learning, which gradually transforms the initial “low-level” feature representation into a “high-level” feature representation through multi-layer processing by simulated neural networks [35,36], and completes complex classification and other learning tasks with a “simple model”, frequent pattern mining is a key step of association rule mining in data mining. Frequent patterns generally refer to the set of items that occur with high frequency in a dataset. For example, items that appear frequently in the shopping basket dataset (e.g., toothbrush and toothpaste) can form a frequent item set. For a sequence in the shopping basket database (e.g., first buy flour, then eggs, then basin), it is said to be a frequent sequence if it appears frequently in the shopping cart data. Commonly used frequent pattern mining algorithms include Apriori, FP-Growth, and others.
The Apriori algorithm [37] is a commonly used pattern mining algorithm. The algorithm usually uses prior knowledge in its process. The core idea of the Apriori algorithm for performing pattern mining is that if an item set is a frequent item set, then all its subsets are also frequent, i.e., if {toothbrush,toothpaste} is frequent, then {toothbrush}, {toothpaste} must also be frequent, and if {insoles} is not a frequent item set then its superset {shoes,insoles} must also not be a frequent item set.
The Apriori algorithm uses an iterative approach to computation. The algorithm uses the k-item set in the process of finding frequent patterns for the k + 1 -item set. The specific implementation steps of the Apriori algorithm are: traversing the dataset, obtaining the count of each item, and determining the support of each item. The set of all items that satisfy the minimum support is the frequent 1-item set. The frequent 1-item set is then used to find the frequent 2-item set, and so on, until all frequent patterns are found.
The FP-Growth algorithm [38] is a frequent item set mining method proposed by Jiawei Han, which stores the items in the dataset sorted by support degree in FP-Tree and labels the support degree of each node; it mines frequent item sets according to FP-Tree.
The FP-Growth algorithm is implemented in the following steps: first, the dataset is scanned, and the purpose of this operation is to prepare for the construction of the item header table. While scanning, the items with support greater than the minimum support threshold are constructed as frequent item sets and then arranged in descending order of support; secondly, the dataset is scanned again to create the item head table and FP tree in descending order of support. After the creation of the head table and the FP tree, the pattern mining operation is performed. This recursively invokes the tree structure to construct the conditional pattern base for each item in each node of the item header table. The conditional pattern base for each item is the set of all paths from the root node to that item in the conditional tree. If the resulting tree structure is a single path, the recursion ends by enumerating all combinations; if a non-single path is obtained, the tree structure continues to be invoked until a single path is formed.

2.4. Distributed Computing Frameworks

  • Hadoop framework: Hadoop [21] is a distributed infrastructure framework developed by the Apache Foundation, which is mainly used to solve the problem of massive data storage and massive data analysis and can be applied to logistics warehouses, the retail industry, recommendation systems, the insurance and finance industry, and the artificial intelligence industry. Hadoop is suitable for processing large-scale data, and it can handle more than one million data [39,40]. Hadoop uses HDFS for dis tributed file management, which automatically saves multiple copies of the data and can recover the data from backups of other nodes in case of power failure or program bugs, thus increasing the system’s tolerance for errors.
The core components of Hadoop 2.x are HDFS, Yarn, and MapReduce. HDFS is a distributed file system used to manage and store some data information.
The MapReduce framework is a computing model that works on top of Hadoop. It automatically divides computational data and computational tasks, automatically assigns tasks and computes them on each node of the cluster, and finally aggregates the results of the computation on each node. In the Reduce phase, each Reduce task obtains the results of the computation on each machine performing the Map task according to its own partition number and merges them.
2.
Spark Framework: Spark is an in-memory-based big data processing engine [41]. Spark makes up for the shortcomings of the Hadoop 1.x framework, which is not suitable for iterative computing, has very slow performance, and has high coupling. Spark itself can support multiple programming languages, so big data developers can choose the most suitable language for program development according to the program usage scenarios and their own coding habits. Spark can be installed and used on laptops as well as on large server clusters. It can not only provide learning convenience for beginners but also process large-scale data in actual production applications [42]. Spark supports SQL and stream processing as well as machine learning tasks.
Spark is a unified platform for writing big data applications, and it has a unified API, which makes it easier and more efficient to write applications. Spark does not provide persistent data storage, so it needs to be used in conjunction with distributed data systems such as HDFS, message queues, etc. Spark is more powerful than previous big-data processing engines, and it has a software library that can be used to process structured data and machine learning algorithms, and also supports library packages provided by the open-source community.
The Spark application consists of a driver process and a set of execution processes. The driver process is located on the master node of the cluster, and its role is mainly to maintain Spark-related information, user input and output, and task distribution. The execution process executes the specific tasks assigned to it, and it handles the actual computation work. It also reports the computation status to the master node after the actual computation work has been completed.

3. PFWER-Based Distributed False Positive Control Algorithm

The FWER control method can control multiple hypothesis detection problems that require strict control of false positive errors. In this paper, a transactional dataset with binary labels is selected as the computational vehicle for the distributed false positive control algorithm. Considering that there is a certain degree of dependence among the hypotheses in the transactional dataset, and therefore, the computed p-values also have a certain degree of dependence, this chapter will use the Westfall–Young Light algorithm [5] based on the Westfall and Young [30,43] substitution process for the computation. This algorithm can control FWER under the α level, but the implementation of this algorithm involves a large number of resampling and replacement operations, and the computation is very slow. Therefore, the main objective of this chapter is to improve the computational speed and accuracy of the false positive control algorithm in large-scale data computation using a distributed strategy.

3.1. Problem Definition

Definition 1.
Let l 0 , l 1 be two class labels, and the transaction dataset is D = T 1 , l 1 , T 2 , l 2 , , T n , l n , where each transaction T i is composed of a set of items set, i.e., T i = t 1 , t 2 , , t k . Each transaction T i in the transaction dataset carries a binary class label l i l 0 , l 1 .
Definition 2.
Let the pattern S be a set of items, i.e., S = t 1 , t 2 , , t i , t i { 1 , , m } . Let σ S denote the number of dataset D containing pattern S, σ 1 S denote the number of dataset D labeled l 1 as containing pattern S, σ 0 S denote the number of dataset D labeled l 1 as containing pattern S, and σ 0 S denotes the number of datasets D in which the label l 0 is the number of containing patterns S. Based on the above two definitions, a 2 × 2 column-linked table can be constructed, as shown in Table 3.
Definition 3.
The null hypothesis H 0 is that the pattern S is not significantly associated with the label l i and let δ be the corrected significance level, the null hypothesis is rejected, and the pattern S is considered to be significantly associated with the label l i if and only if the p-value δ .
Definition 4.
A false positive is the probability of finding an incorrect association (Type I error) [5].
Section 2.1.4 has shown that the p-value calculation method used in this paper is the Fisher exact test. The Fisher exact test observes that the values n , n 1 , σ S of the edges of the 2 × 2 column table are fixed. Thus, under the null hypothesis that mode S and the labels l i are independent of each other, the calculation of σ 1 S follows the hypergeometric distribution, as shown in Equation (7).
p F ( σ 1 ( S ) = a | σ ( S ) , n 1 , n ) = n 1 a n n 1 σ ( S ) a n σ ( S )
Let b be the observed value of σ 1 S in the S column table, and the p-value obtained using Fisher’s exact test is shown in Equation (8). The p-value of Fisher’s exact test is the value of all σ 1 S = a than the cumulative sum of probabilities that are lower.
p S F ( b ) = p S F ( a | σ ( S ) , n 1 , n ) p S F ( b | σ ( S ) , n 1 , n ) p S F ( a | σ ( S ) , n 1 , n )

3.2. Overall Framework of the Algorithm

The general framework of the distributed PFWER false positive control algorithm proposed in this paper is shown in Figure 1.
Since the null hypothesis, H 0 , proposed in this paper is that pattern S is not significantly associated with label l i , more than one pattern S can be mined in the transactional dataset D. There is a dependency between different patterns, S, and the p-values computed from the labels l i , so the PFWER false positive control is performed using the permutation method proposed by Westfall and Young [30,43]. The permutation-based method is very computationally intensive, so the Spark framework is used for parallel computing to improve the overall computational rate. The algorithm proposed in this chapter can be broadly divided into the following three stages.
  • Label permutation operation. According to the replacement method proposed by Westfall and Young [30,43], it is known that to calculate the truncated p-value (corrected significance level δ ) more accurately, it is necessary to perform a replacement operation on the label l i (generally performing j r = 10 3 10 4 times replacement) to achieve the purpose of breaking the association between pattern S and label l i .
  • Finding the hypothesis to be tested in multiple hypothesis testing. Since the null hypothesis is composed of two key elements, pattern S and label l i , the main task of the second stage of the algorithm is to find all patterns S and their corresponding labels l i in the transactional dataset D.
  • False-positive correction calculation. After finding the hypotheses to be tested and permuting the labels, the p-value of each hypothesis was calculated according to Fisher’s exact test. The false-positive correction was then performed according to the Westfall and Young [30,43] replacement method, and finally, the FWER was controlled at the α level.

3.3. Index-Tree Algorithm

In order to solve the problem, the hypothesis determination process will dig out a large number of redundant patterns, which affects the computational speed. In this paper, we propose an Index-Tree algorithm, which uses a reduction strategy to reduce the construction of conditional trees and, thus, the computation of patterns. It also adopts an index optimization strategy to reduce the computational overhead caused by multiple traversals of the dataset, further reducing the computation of redundant patterns and speeding up the overall computational speed of the false positive control.

3.3.1. Pattern Mining

The main purpose of pattern mining in this paper is to find all hypotheses. The hypothesis is composed of two key elements patterns S with labels l i , so in the hypothesis determination phase, it is necessary to mine all patterns S by pattern mining methods. Then the hypothesis is determined by traversing the dataset to find the labels that contain the corresponding pattern transactions.
As shown in Figure 2, this paper uses the FP-Growth algorithm for pattern mining. However, since this paper wants to control the false positives in multiple hypothesis testing, that is, to find all patterns S for which the p-value is calculated, the false positive control is performed using the PFWER control method. In other words, the minimum support count in the FP-Growth algorithm is to be set to 1. This makes the computational efficiency of the FP-Growth algorithm for pattern mining very low. Since pattern mining is also only one step in all the computational aspects of this paper, there is a subsequent PFWER false positive control calculation. Therefore, it is necessary to improve the FP-Growth algorithm without changing the effect of the PFWER false positive control in order to reduce the memory overhead and improve the computational efficiency. To solve the above problems, a pruning operation and an index optimization operation are adopted to reduce the redundant patterns and improve the computational efficiency.

3.3.2. Pruning Operation

This chapter focuses on controlling the number of false positive errors in multiple hypothesis testing using the PFWER control method. According to the concept of FWER control, it is known that FWER (family-wise error rate) is the probability of at least one false positive error, and to ensure that the probability of error is as small as possible is to make F W E R ( δ ) α . This means that reducing the significance level of p S F b from the original α to δ is a guarantee that F W E R ( δ ) α . In this way, the problem becomes one of computing the significance threshold δ , where δ = m a x { δ | F W E R ( δ ) α } . Since the Westfall–Young Light algorithm [5] requires j r = 10 3 10 4 times permutation operation for label i in order to make the label unassociated with the pattern, and determines whether a false positive error has occurred by checking whether there is p m i n α where p m i n min p S F b . Then the cluster error rate is calculated as shown in Equation (9).
F W E R ( δ ) = 1 j r i = 1 j r 1 [ p min ( i ) δ ]
where 1 p m i n j δ means that if p m i n j δ is true, then it is 1, otherwise it is 0. The final δ to be found is the p m i n i i = 1 j r of α quantile point.
Theorem 1.
If there exists S 1 S 2 and σ S 1 = σ S 2 , then σ 1 S 1 = σ 1 S 2 , σ 0 S 1 = σ 0 S 2 and for each permuted label, σ 1 S 1 = σ 1 S 2 , σ 0 S 1 = σ 0 S 2 .
Theorem 2.
If S 1 S 2 and there exists σ S 1 = σ S 2 , then p S 1 F b = p S 2 F b .
Proof. 
Since σ S 1 = σ S 2 , the values n , n 1 , σ S of the edges in the 2 × 2 column table are fixed, so for S 1 and S 2 , the three values n, n 1 , and σ S are equal. Equations (10) and (11) can be obtained from Equation (7). Obviously, Equation (12) can be derived from Equations (10) and (11). By substituting Equation (12) into the Fisher exact test formula, it can be deduced that p S 1 F b = p S 2 F b .
p F ( σ 1 ( S 1 ) = a | σ ( S 1 ) , n 1 , n ) = n 1 a n n 1 σ ( S 1 ) a / n σ ( S 1 )
p F ( σ 1 ( S 2 ) = a | σ ( S 2 ) , n 1 , n ) = n 1 a n n 1 σ ( S 2 ) a / n σ ( S 2 )
p F ( σ 1 ( S 1 ) = a | σ ( S 1 ) , n 1 , n ) = p F ( σ 1 ( S 2 ) = a | σ ( S 2 ) , n 1 , n )
   □
Theorem 3.
If S 1 S 2 and σ S 1 = σ S 2 , then only the p-value of mode S 1 needs to be computed.
Proof. 
According to Equation (9), it is known that the final estimate of F W E R ( δ ) is related to p m i n i after each permutation and p m i n = min p S F b . By Theorem 2, we know that if S 1 S 2 and there exists σ S 1 = σ S 2 , then p S 1 F b = p S 2 F b . If  p S 1 F b = p S 2 F b is the minimum value of the p-value in this substitution, then p m i n picks p S 1 F b as the same as the result of p m i n picking p S 2 F b . Therefore, it is sufficient to compute only the p-value p S 2 F b of the mode S 1 , without computing the p-value of the mode S 1 . If  p S 1 F b = p S 2 F b is not the minimum value of p-value in this replacement, since p S 1 F b = p S 2 F b , then p m i n and p S 1 F b are the same as the result of comparing p S 2 F b , so it is sufficient to perform the calculation only once.    □
Theorem 4.
In FP-Tree, if there exists σ I 1 = σ I 2 and I 1 . n e x t = I 2 in the item header table, while for all I 1 . l i n k . n e x t and I 2 . l i n k . n e x t there are σ I 1 . l i n k . n e x t = σ I 2 . l i n k . n e x t , and in the FP-Tree I 1 . l i n k . n e x t . c h i l d = I 2 . l i n k . n e x t and I 2 . l i n k . n e x t . p a r e n t = I 1 . l i n k . n e x t , such that S 1 = S I 1 , S 2 = S I 1 , I 2 , we have S 1 S 2 and σ S 1 = σ S 2 .
Proof. 
According to { I 2 , I 5 : 1 , I 1 , I 3 : 2 , I 1 , I 2 , I 3 : 1 , I 1 , I 2 , I 3 , I 5 : 1 , I 1 , I 2 , I 3 , I 4 : 2 , I 2 : 4 , I 1 , I 3 , I 4 : 2 } the dataset constructed by the FP-Tree is shown in Figure 3. Where σ I 1 = σ I 3 , the number of nodes in the item header table is I 1 at one position on I 3 and for all I 1 and I 3 chains on the number of supported nodes σ I . l i n k . n e x t is the same for all I 1 and I 3 links. In FP-Tree, all I 3 nodes’ parent nodes are I 1 nodes and all I 1 nodes’ children are I 3 nodes. Obviously, there is σ { I 1 } = σ { I 1 , I 3 } . Let S 1 = S I 1 , S 2 = S I 1 , I 3 , then S 1 S 2 , σ S 1 = σ ( S I 1 ) = σ S σ { I 1 } , σ S 2 = σ ( S I 1 , I 3 ) = σ S σ { I 1 , I 3 } , so σ S 1 = σ S 2 .    □
The nodes I 1 and I 3 that satisfy the condition of Theorem 4 in the FP-Tree can be combined into one node I 1 ; that is, patterns S 1 and S 2 can be combined into one pattern, and then according to Theorems 1–3, it is only necessary to calculate the p-value of pattern S 1 to reduce the amount of computation in memory and speed up the computation of the single-machine algorithm.

3.3.3. Index Optimization

From the 2 × 2 column table, we know that after mining pattern S, we need to find all the S T i , l i = l 1 support numbers σ 1 S , and this process requires traversing the whole dataset once. Since the Westfall–Young Light algorithm [5] starts from a minimum support number of 1, the number of patterns to be mined is very large, and it would be too expensive to traverse the dataset once for each pattern mined to find its σ 1 S . When performing pattern mining, an index can be added to speed up the query, which is the position of transaction T i , so that counting l i = l 1 takes only linear time to find. The transaction dataset D with the index added is shown in Table 4.
The FP-Tree with indexed structure is constructed based on the above dataset, as shown in Figure 4. The conditional pattern bases are constructed on the basis of the indexed FP-Tree, and the conditional pattern bases are constructed from the smallest to the largest support counts, that is, from I 5 : < I 2 , I 1 : 8 > < I 2 : 0 > , I 4 : < I 2 , I 1 : 3 , 5 > < I 1 : 9 , 12 > , I 1 : < I 2 : 2 , 3 , 5 , 8 > . Next, we construct the indexed conditional FP-Tree based on the indexed conditional pattern base and find the pattern S I = t i , { T I D i } with the index structure. The conditional pattern bases are constructed based on the index FP-Tree by supporting the degree counts from small to large; that is, starting from I 5 : < I 2 , I 1 : 8 > , < I 2 : 0 > , I 4 : < I 2 , I 1 : 3 , 5 > < I 1 : 9 , 12 > , I 1 : < I 2 : 2 , 3 , 5 , 8 > . Next, we construct the indexed conditional FP tree based on the indexed conditional pattern base and find the pattern with the index structure S I = t i , { T I D i } .
The null hypothesis H 0 proposed in this paper is that pattern S is not significantly associated with the label l i , and the parameter to be tested in this paper can be set as θ = { S , l i | S T j , j = 1 , , n , i = 0 , 1 } . According to Table 3 and Equation (9), the key variables for false positive control for the selected dataset D and the determined null hypothesis h 0 are n, n 1 , σ S , and σ 1 S in the case of l i = l 1 obtained after label replacement. Where n, n 1 can already be determined when the sample dataset is selected, and n, n 1 is fixed. While σ S and σ 1 S are the support counts when S T i and l i = l 1 and S T j , respectively. It is easy to see from the structure of dataset D that once the set of transactions T i to which the pattern S belongs is known, the set of labels l i corresponding to it can be found, and the support count σ S is the size of the set T i , and it is also not difficult to determine σ 1 S based on the correspondence between transactions and labels. Therefore, it is not necessary to know what the specific pattern S is when performing the PFWER false positive control calculation, but only to know what sets of transactions T i are available to mine the pattern S. Finding out which transaction sets exist that can be mined for patterns is more important for subsequent computation. In this way, it is clearly more advantageous to use a vertical data format for data mining.
The transactional dataset of Table 4 is converted into a vertical data format representation, as shown in Table 5. Data mining is performed to find the patterns to be computed by intersecting the index set of each pair of items in the item set. For example, the index set of the pattern I 1 , I 2 is T I D I 1 , I 2 = T I D I 1 T I D I 2 = 2 , 3 , 5 , 8 .
Theorem 5.
If there exists T I D S 1 = T I D S 2 , then p S 1 F b = p S 2 F b .
Proof. 
If there exists T I D S 1 = T I D S 2 , then it means that the number of transactions containing patterns S 1 and S 2 are equal, i.e., T I D S 1 = T I D S 2 , so there is σ S 1 = σ S 2 . Again, since labels have a one-to-one correspondence with transactions, although j r permutations are performed, pattern s 1 belongs to the same transaction set as pattern s 2 . Therefore, for each permutation, there is σ 1 S 1 = σ 1 S 2 . The total number of transactions n is fixed with the number of labels l i = l 1 for the same dataset n 1 , substituting σ S , σ 1 S , n and n 1 into Equations (7) and  (8) to find p S 1 F b = p S 2 F b .    □
Substituting p S 1 F b = p S 2 F b into Equation (9) (FWER false positive control formula) shows that p S 1 F b has the same effect as p S 2 F b on Equation (9), so their calculated p-values have the same influence on Equation (9) for different patterns with the same index set, so it is sufficient to perform the p-value calculation only once.
Based on the above problem analysis, it is clear that mining the set of transactions containing pattern S is more useful for the subsequent computation than mining all patterns in the dataset and then computing the corresponding dataset. Inspired by the vertical data format, the index tree is pruned again according to Theorem 5 to reduce the computation of invalid patterns generated in the data mining process.
The conditional pattern base of I 4 is < I 2 , I 1 : 3 , 5 > < I 1 : 9 , 12 > according to the item header table in Figure 4, and the conditional tree constructed from this conditional pattern base is shown in Figure 5. The conditional tree using I 4 is pattern mined using the FP-Growth algorithm by combining all nodes on this single path and then combining the combined set with that node to form the pattern output. According to the above statement, from the conditional tree of I 4 , we will receive S 1 = I 1 , I 4 , 3 , 5 , 9 , 12 , S 2 = I 2 , I 4 , 3 , 5 , S 3 = I 1 , I 2 , I 4 , 3 , 5 , but actually patterns S 2 and S 3 are exactly equal for the PFWER false positive control calculation, and there is no need to repeat the calculation; therefore, it is only necessary to know the index set of each node to substitute into the FWER control formula for the calculation. When the number of nodes contained in a single-path conditional tree is very large, it can reduce a lot of additional computational overhead.
The purpose of Algorithm 1 is to mine the index set containing the patterns and to provide computational preparation for the subsequent PFWER false positive control. The first line of the algorithm constructs the set of frequent1 -items and calculates their support counts. The second line of the algorithm constructs the index tree. In the third line of the algorithm, it calls Algorithm 2 to perform pruning operations on the index tree. If the condition tree contains only a single path, then the index set of nodes on this path is output, otherwise, the condition tree is constructed for the pattern β α in the tenth to thirteenth lines of the algorithm, and if the condition tree is not controlled, then the algorithm is recursively called for mining, and finally, all index sets containing the pattern are obtained.
The first line of Algorithm 2 iterates through the nodes in the item head table, and the second to fifth lines determine whether two adjacent nodes with the same support count in the item head table are to be merged. If the nodes in the FP tree satisfy the pruning condition in Section 3.3.2, the nodes in the FP tree are merged in the sixth line of the algorithm, the term header table is updated, and the pruned index tree is returned.
Algorithm 1 Index Tree
Require:   D = T i , l i
Ensure:  I n = T I D i
 1: c r e a t e i t e m _ 1 , σ s i z e i n d e x
 2: I F P _ T r e e c r e a t e T r e e ( i t e m _ 1 , D )
 3: t r e e I P F P _ T r e e ( I F P _ T r e e )
 4: I F P _ G r o w t h ( t r e e , β )
 5: if  p a t h t r e e then
 6:    for  n o d e p a t h  do
 7:         T I D ( β n o d e )
 8:    end for
 9: else
10:    for each a i ( a i , T I D b )  do
11:         b β a i T I D b T I D a σ s i z e ( T I D b )
12:         c r e a t e ( D b )
13:         c r e a t e ( t r e e b )
14:         t r e e i b I P F P T r e e ( t r e e b )
15:        if  t r e e i b  then
16:            I F P G r o w t h ( t r e e i b , b )
17:        end if
18:    end for
19: end if
Algorithm 2 IPFP Tree
Require:  i t e m s
Ensure:  I P F P t r e e
 1: for  i i t e m s   do
 2:     if  σ H e a d ( i ) ) = σ H e a d ( i 1 ) )  then
 3:         for  n o d e i l i n k i , n o d e i - 1 l i n k i - 1  do
 4:             if  σ ( n o d e i ) = σ ( n o d e i 1 )  then
 5:                 if  n o d e i . c h i l d = n o d e i 1 and n o d e i 1 . p a r e n t = n o d e i  then
 6:                      r e m o v e ( n o d e i 1 )
 7:                      u p d a t e ( H e a d )
 8:                 end if
 9:             end if
10:        end for
11:     end if
12:      i i + 1
13: end for

3.4. Distributed PFWER Control Algorithm

3.4.1. Label Replacement

The first stage of the distributed PFWER false positive control algorithm is the label replacement stage, where the purpose of label replacement is to make no relationship between labels and patterns. Therefore, it is necessary to disrupt and reshuffle the labels, which generally requires j r = 10 3 10 4 times the permutation of the labels. This process can be run in parallel on the cluster, and the execution is shown in Figure 6.
First, we read the label data using the sc.textFile() method and store it in labelRDD; then we perform a random permutation operation on the read labels in parallel. Then we perform a merge operation on the disordered set of labels in the cluster, and finally, we receive a permuted set of labels.

3.4.2. Hypothesis Determination

The second part of the distributed PFWER false positive control algorithm is to find the parameters to be tested in the multiple hypothesis test. Since the null hypothesis is composed of two key elements, pattern S and label l i , it is known from the theorem in Section 3.3 that the parameters to be determined in the actual computation are the set of all indexes mapped to pattern S and their corresponding labels l i in the transaction dataset D, so the main task in this stage is to find the above two parameters.
Based on the PFWER false positive control characteristics combined with the Index-Tree algorithm in Section 3.3, this leads to a distributed computational method for hypothesis determination and the parallel computation of index sets and their labels mapped to patterns. The method consists of three important phases divided into a dataset-partitioning phase, a frequent 1-item set and FP tree construction phase, and a group mining phase with pattern-mapped index sets and their labels. Figure 7 shows the computational framework of distributed hypothesis determination, where the dataset is divided into n partitions in the dataset partitioning phase and subsequent computations are performed in parallel. The main objective of the frequent 1-item set and FP tree construction phase is to construct frequent 1-item sets with index structures and to construct FP trees with index structures based on frequent 1-item sets and transactional datasets. Figure 8 illustrates the process of constructing a frequent 1-term set as follows
  • First, the items in the dataset should be split using the flatMap operator to construct <key = item,value = index> key-value pairs in parallel and the map operator to construct <key = item,value = 1> key-value pairs.
  • Secondly, the key-value pairs of <key = item,value = 1> are computed cumulatively using the reduceByKey algorithm. The computed key is the item name, and the value is the number of items in the dataset.
  • Next, the key-value pair <key=item,value = index> is computed using the groupByKey operator to obtain a new key-value pair <key = item,value = index>, where the value is the index set containing the key values.
  • Finally, use the join operator to combine <key = item,value = index> and <key = item,value = count> into a new key-value pair <key = item,value = count + index> and output it in descending order of the count of the values in each key-value pair to get the item header table for subsequent calculations.
The FP tree with index structure is constructed by traversing the transaction dataset based on the frequent 1-item sets with index structure. Next, the frequent 1-item set is divided into h groups, the group numbers are denoted by h i d , and each group contains a complete FP tree with an index structure. The conditional pattern base and the conditional pattern tree are constructed for each h i d group, and then the index set containing the patterns is mined using the Index-Tree algorithm. Since the labels correspond to the transaction data, the index set containing the patterns can be computed while the corresponding label set can be determined, and obviously, the two parameters related to the null hypothesis in the hypothesis test have been determined.

3.4.3. False Positive Control

This section mainly uses the false positive control method proposed by Westfall and Young [30,43] to control the FWER at the α level, which is implemented with the main idea that a new resampled transactional dataset with no relationship between patterns and labels can be generated by just randomly arranging the class labels. This allows one to determine whether a false positive error has occurred by computing the minimum p-value after each permutation, p min = min p S F , and checking whether p m i n δ holds. The subsequent sections of this paper refer to this method as the WY replacement algorithm.
The disadvantage of the WY replacement algorithm is that it is computationally expensive in addition to having a large number of replacement operations. Terada [13] and other researchers found that in Fisher’s exact test, when 2 × 2 columns are fixed, then the value n , n 1 , σ S at the edge of the table is also fixed, and according to Equations (7) and (8) it is not difficult to find that the p-value is ultimately a function about σ 1 S . Since the objects in the 2 × 2 column table are discrete and can only take finitely many values, it can be determined that σ 1 S is bounded, i.e., σ 1 S σ 1 S m i n , σ 1 S m a x . Where σ 1 S m a x = min n 1 , σ S , σ 1 S m i n = max 0 , σ S n n 1 . From the bound of σ 1 S , it can also be further deduced that there exists a minimum reachable p-value φ σ S strictly greater than 0 as follows.
φ ( σ S ) = min { p S F ( a ) | σ 1 ( S ) min a σ 1 ( S ) max }
According to Equation (8), the p-value calculated for Fisher’s exact test is the cumulative sum of the results obtained using Equation (7), and the values calculated in Equation (7) are all greater than 0. It can be inferred that when σ 1 S = σ 1 S m i n or σ 1 S = σ 1 S m a x , the minimum reachable p-value φ σ S . It is then possible to call all patterns S of φ σ S δ the set of measurable patterns so that patterns not in the set of κ ( δ ) cannot be statistically significant under δ . On this basis, a monotonically decreasing lower bound φ ^ ( σ ) on the minimum achievable p-value can be introduced, as shown in Equation (14).
φ ^ σ = φ σ S 0 σ S n 1 1 / n n 1 n 1 σ S n
The monotonically decreasing lower bound φ ^ ( σ ) on the minimum achievable p-value gives κ ^ ( δ ) = S | φ ^ ( σ ) δ , which satisfies κ ( δ ) κ ^ ( δ ) , which, in turn, can be rewritten as κ ^ ( δ ) = S | σ S σ δ due to monotonicity. That means only the mode S satisfying condition κ ^ ( δ ) = S | σ S σ δ is valuable for the PFWER false positive control calculation. Based on the above, the pseudo-code of the distributed PFWER false positive control algorithm is proposed, as shown in Algorithms 3 and 4.
Algorithm 3  DS-FWER(D)
Require:  D
Ensure:  δ
 1:  l a b e l D i s t r i b u t e d L a b e l P e r m u t a t i o n ( D )
 2:  p min ( i ) 1
 3:   σ 1 , δ φ ^ σ
 4:   i t e m I n d e x f l a t M a p ( D ) , i t e m O n e m a p ( D )
 5:   i t e m C o u n t r e d u c e B y K e y ( i t e m O n e ) , i t e m I n d e x s g r o u p B y K e y ( i t e m I n e d x )
 6:   i t e m i t e m C o u n t . j o i n ( i t e m I n d e x s )
 7:   t r e e c r e a t e F 1 T r e e ( i t e m ) , F 1 _ t r e e I P F P T r e e ( t r e e )
 8:   i t e m G r o u p g r o u p ( i t e m )
 9:   i n d e x I n d e x t r e e ( i t e m G r o u p , F 1 _ t r e e )
10:   W Y ( i n d e x , l a b e l )
11:  Return α quantile of p min ( i ) i = 1 j r
Algorithm 4 WY Algorithm
Require:   i n d e x , l a b e l
Ensure:  σ
 1:   p S F ( σ 1 ( S ) )
 2:  for  i = 1 , , j r   do
 3:      Compute σ 1 ( S )
 4:       p min ( i ) p min ( i ) , p S F ( σ 1 ( S ) )
 5: end for
 6: F W E R ( δ ) = 1 j r i = 1 j r 1 p min ( i ) δ
 7: while  F W E R ( δ ) > α do
 8:       σ σ + 1 δ φ ^ ( σ )
 9:       F W E R ( δ ) = 1 j r i = 1 j r 1 p min ( i ) δ
10: end while
11: for  i n d e x L i s t i n d e x do
12:      Compute σ ( S )
13:      if  σ ( S ) σ  then
14:           W Y ( i n d e x , l a b e l )
15:      end if
16: end for
The first line of Algorithm 3 uses distributed label permutation to obtain the permuted label set with indexed positions, the second line initializes all minimum p-values in j r permutation calculations to 1, the third line initializes the minimum support of the pattern, and the modified significance threshold δ is initialized according to this minimum support for subsequent calculations. The fourth to seventh lines of the algorithm uses parallel methods to construct frequent 1-item sets with indexed structures with FP trees, and the eighth line groups the frequent 1-item sets and distributes the grouped data to each node in the cluster. We rewrite the Index-Tree algorithm and change its input to FP tree and frequent 1-item sets, and mine its index set on each node according to FP tree and the grouped frequent 1-item sets. Finally, the index set and label set are substituted into the WY replacement algorithm to obtain the set of j r minimum p-values p m i n i i = 1 j r , then the significance threshold of p-value calculation is set to p m i n i i = 1 j r of the α quantile will eventually control the FWER under the α level.
Algorithm 4 is the WY permutation algorithm. The first line of the algorithm computes all p-values p S F σ 1 S in the bounds using Fisher’s exact test. The second to fourth lines of the algorithm calculate the σ 1 S value of the index set for each permutation for j r permutations and calculate the minimum p-value p m i n i . The fifth line of the algorithm finds the current F W E R ( δ ) value based on p m i n i i = 1 j r . Lines six to eight of the algorithm perform a round-robin operation where the minimum support is the current minimum support plus 1 if F W E R ( δ ) > α and update the significance threshold at the same time until F W E R ( δ ) α . For all the mined index sets of σ S σ , the WY replacement algorithm is executed to find the final modified significance threshold. Finally, the corrected significance thresholds found on each node are compared, and the smallest significance threshold among all nodes is the final result.

3.5. Proof of Correctness

The first is the correctness of the data cut, and the second is the correctness of the final result obtained by executing the WY permutation algorithm in parallel.
According to Section 3.3, we can find the index sets of all patterns S and perform the de-duplication operation on these index sets before performing the PFWER false positive control computation to reduce the amount of data to be computed while ensuring the correctness of the result computation. This chapter uses the distributed false positive control algorithm process to group the frequent 1-item sets with index structure, and each node will use the index FP tree and the grouped frequent 1-item sets for index set mining. The Index-Tree algorithm determines the conditional pattern base for each item in the head table based on the FP tree and then constructs a conditional tree based on the conditional pattern base to perform subsequent pattern mining. Therefore, as long as the initial index FP tree is consistent for each set of item headers, the index set obtained by the distributed computation will be the same as the index set obtained in the stand-alone case.
Theorem 6.
The minimum value of the significance threshold among all nodes is the overall significance threshold, and the overall significance threshold is the same as the result of the significance threshold computed by a single machine.
Proof. 
The WY replacement algorithm for example performs σ = σ + 1 and δ = φ ^ ( σ ) operations whenever it meets F W E R ( δ ) > α . Let I n 1 and I n 2 be two different index sets at different nodes with I n 1 and I n 2 support of σ I n 1 and σ I n 2 , and σ I n 1 < σ I n 2 . According to Equation (9) and δ = max δ | F W E R ( δ α ) we can find δ I n 2 < δ I n 1 , which also verifies the property that δ decreases monotonically with σ . Therefore, I n 2 index sets smaller than the current support count can be directly ignored and will not have an impact on the final result, so the final significance threshold is the minimum of the significance thresholds obtained for all nodes and is the same as the result of the stand-alone calculation. □

4. Experiments and Performance Analysis

This chapter validates the algorithm through experiments in the following four areas: Section 4.3.1 determines the parameters used in the distributed PFWER false positive control. Section 4.3.2 tests the pruning efficiency of the algorithm and verifies the effect of the pruning operation on the algorithm. Section 4.3.3 focuses on verifying the accuracy of the calculation of the distributed PFWER false positive control algorithm. Section 4.3.4 tests the operational efficiency of the distributed PFWER false positive control algorithm by comparing the runtime of the distributed PFWER false positive control algorithm with that of the stand-alone PFWER false positive control algorithm using different datasets. The above four experimental directions verify the difference between the distributed false positive control algorithm and the stand-alone false positive control algorithm for false positive control results on the one hand. On the other hand, the distributed false positive algorithm is verified for its ability to improve the computation rate. The experiments use different datasets to demonstrate the robustness and general applicability of the algorithms.

4.1. Experimental Environment Configuration

The algorithm in this paper is written in Java language and uses the Spark framework for distributed computation. The experimental code writing environment is shown in Table 6.
The algorithm proposed in this paper is a distributed false positive control algorithm, so the main experimental part of the algorithm is completed on the cluster. The test cluster environment of the experiment is shown in Table 7.

4.2. Experimental Dataset

The information on the datasets used in the experiments of this paper is shown in Table 8. We performed our experiments using 11 datasets: they are available at FIMI’04 (http://fimi.ua.ac.be, 7 June 2022), UCI (https://archive.ics.uci.edu/ml/index.php, 7 June 2022) and kdd2018 (https://github.com/VandinLab/TopKWY, 10 June 2022). The datasets labeled with (L) in the dataset description are the datasets with binary classification labels, and the datasets labeled with (U) are the datasets without classification. For datasets without transactions classified into two categories, a single item with a frequency closer to 0.5 is chosen to be removed from the transaction dataset to artificially divide the dataset into two groups, and n / n 1 is used to represent the ratio of the number of transactions in the dataset to the number of transactions labeled l 1 , with two decimal places retained.

4.3. Distributed PFWER False Positive Control Experiment

4.3.1. Determination of The Number of Permutations

  • Experimental description: This section focuses on determining the parameter used in the distributed PFWER false positive control, i.e., the number of label replacements, jr. Label replacement is an important element to ensure the accuracy of the distributed PFWER false positive control results, and its purpose is to make sure there is no relationship between labels and patterns. The null hypothesis proposed in this paper is satisfied by the absence of an association between the mode and label and by avoiding the influence of inter-mode dependencies on the computational results. The experiment is to test the effect of the PFWER false positive control algorithm on the false positive control effect by setting different numbers of substitutions in the label substitution stage. In this paper, the FP-Growth algorithm will be used to perform the pattern mining operation for all comparison experiments.
  • Experimental analysis: The distributed PFWER false positive control uses a permutation-based approach for the control calculation. The known cost in setting the permutation value, jr, is that the larger the jr, the more accurate the final corrected significance threshold is estimated, but the cost is that the running time increases with the increase in jr. The following figure represents the computation for different datasets with different jr.
The horizontal coordinate of Figure 9 is the number of permutations, jr, and the vertical coordinate indicates the final support count. Figure 10 indicates the running time corresponding to different datasets selected with different replacement counts, the horizontal coordinate is the replacement count jr, and the vertical coordinate indicates the running time in (s). Since the label replacement is a random replacement process, there will be individual label disruptions that are not very good in the process of disrupting the label order. However, from the overall experimental results, the support count tends to be stable at j r = 10 3 10 4 ; if the number of permutations is increased on this basis, it has little effect on the calculation but will greatly increase the running time of the algorithm, so the experimental parameter chosen in this paper is j r = 10 3 or j r = 10 4 .

4.3.2. Pruning Efficiency Analysis

(1)
Experimental description
The PFWER false positive control algorithm needs to find all the hypotheses to be tested in the dataset, and these hypotheses to be tested are composed of the patterns mined in the transaction set and their corresponding permuted labels. Therefore, it is necessary to use techniques related to pattern mining. In the computation process, it is found that using Fisher’s exact test to calculate the p-value and using the WY replacement process for false positive control can reduce the computation of PFWER false positive control by some pruning operations and speeding up the computation, which does not affect the computation results.
The purpose of the experimental tests in this section is to verify the effect of pruning operations on the algorithm. From the above experimental description, it can be seen that the execution of the pruning operation reduces the number of patterns to be calculated for PFWER false positive control and does not affect the false positive control effect. Therefore, the experiments in this section will verify the efficiency of the pruning operation in terms of both the number of patterns that need to be computed before and after the pruning operation and the change in the significance threshold.
(2)
Experimental analysis
The purple bars in Figure 11 show the number of patterns mined before the pruning operation, and the green bars show the number of patterns mined after the pruning operation. The experimental results show that the use of the pruning operation in the calculation of the PFWER false positive control can effectively reduce the number of patterns calculated, thus reducing the number of p-values that need to be calculated by Fisher’s exact test and thus can effectively improve the efficiency of the PFWER false positive control.
Table 9 shows the effect of pruning on the run speed of different datasets before and after the pruning operation, and it can be seen from the data in the table that for most of the datasets, the pruning operation can improve the run efficiency.
Figure 12 represents the changes in the support counts of different datasets before and after the pruning operation. From the experimental results in Figure 12, we can see that the results calculated by the PFWER false positive control algorithm before and after performing the pruning operation are basically the same, thus verifying the correctness of the pruning operation.
Figure 13 shows the comparison of the significance thresholds of the PFWER false positive control after performing pruning operations with and without the pruning operation on different datasets, with the vertical coordinate as the logarithm with base 10. Since the PFWER false positive control performs random permutations of jr times labels that affect the final significance threshold results, it is acceptable to have some deviation in the significance thresholds after performing the pruning operation with and without pruning on individual datasets.

4.3.3. Accuracy Test

(1)
Experiment Description
The experiments in this section focus on verifying the accuracy of the computation of the distributed PFWER false positive control algorithm. The distributed PFWER false positive algorithm will process the data in the transaction dataset and then perform the PFWER false positive control calculation in parallel on each node of the cluster. The most important point in this process is to ensure that the calculation results of the algorithm in the distributed case are consistent with the results of the stand-alone calculation. The most important point in this process is to ensure that the algorithm’s computational results in the distributed case are consistent with those of the stand-alone computation. The main reason for ensuring the same results of the two runs is that the corrected saliency thresholds obtained in the end are the same.
(2)
Experimental Analysis
Figure 14 gives a comparison of the minimum support calculated by the distributed PFWER false positive control with that of the stand-alone PFWER false positive control, from which it can be seen that the final minimum support obtained for different datasets performing PFWER false positive control is basically the same in the distributed and stand-alone cases, demonstrating the accuracy of the distributed algorithm calculation.
Figure 15 shows the final corrected significant threshold for the distributed PFWER false positive control versus the corrected significant threshold obtained from the PFWER false positive control in the stand-alone case, with the vertical coordinate as the logarithm with base 10. The experimental results show that the results of the corrected significance thresholds obtained for the single machine on different datasets are in general agreement with the results calculated by the distributed PFWER false positive control algorithm proposed in this paper.

4.3.4. Operational Efficiency Test

(1)
Experimental Description
The main purpose of using distributed techniques for PFWER false positive control calculations in this paper is to improve the computational efficiency of the procedure. The distributed PFWER false-positive control algorithm reduces the time spent on the experiment and does not affect the final results of the experiment, as the model is reduced in the hypothesis determination. In this section, the runtime of the distributed PFWER false positive control algorithm, the stand-alone PFWER false positive algorithm, and the existing FastWY [13] and WYlight [5] algorithms are compared using different datasets to test the efficiency of the distributed PFWER false positive control algorithm.
(2)
Experimental Analysis
The running time units for the algorithms in Figure 16 are seconds (s). The experiments focus on showing a comparison of the run times of the distributed PFWER false positive control algorithm, the stand-alone PFWER false positive algorithm, the FastWY algorithm [13], and the WYlight algorithm [5] running different datasets. The experimental results show that the use of the distributed PFWER false positive control algorithm can effectively improve the computational speed of the algorithm while avoiding the limitations of the stand-alone in-memory computation and can efficiently perform false positive control computations in large-scale data situations, which is of good use.

4.4. Summary

The distributed PFWER false positive control algorithm has been analyzed and tested experimentally. The experimental data show that the distributed PFWER false positive control algorithm has the same control results as the stand-alone case and is better in terms of operational efficiency than running on a single machine. The algorithm can effectively address the problem of excessive computation in multiple hypothesis testing of false positive control for large data.

5. Conclusions

The PFWER control algorithm can obtain a single hypothesis-test significance threshold subject to an arbitrarily specified overall false positive level constraint without assuming an independent identical distribution. Since the PFWER control algorithm is highly time-consuming, this paper proposes a distributed solution to the PFWER control algorithm, which significantly improves the execution efficiency of the PFWER control algorithm without any loss in theoretical accuracy. Specifically, we abstract the PFWER control problem as a frequent pattern mining problem, and by adapting the FP growth algorithm and introducing distributed computing techniques, the constructed FP tree is decomposed into a set of subtrees, each corresponding to a subtask. All subtrees (subtasks) are distributed to different computing nodes, and each node independently computes the local significance threshold according to the assigned subtasks. The local computation outcomes from every node are aggregated, and the FWER false positive control thresholds are calculated to be exactly in line with the theoretical outcomes. To the best of our knowledge, this is the first paper to present a distributed PFWER control algorithm. Experimental results on real datasets show that the proposed algorithm is more computationally efficient than the comparison algorithm.
In the future, we may also consider using unconditional exact tests, i.e., Barnard’s exact tests, to calculate p-values in false positive control methods for multiple hypothesis testing. Unconditional tests, on the other hand, are generally more expensive than conditional tests (often Fisher’s exact tests) because unconditional tests take into account the various scenarios observed in the pattern frequencies and the actual dataset and require the use of an unknown perturbation parameter for subsequent calculations. Another possible path is to extend this paper’s distributed algorithm to multi-categorically labeled transactional datasets, and to explore efficient distributed control of false positives in multiple hypothesis testing processes in other types of datasets.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z.; software, X.L., Y.S. and C.C.; validation, X.L., Y.S. and C.C.; formal analysis, X.L., Y.S. and C.C.; data curation, X.L., Y.S. and C.C.; writing—original draft preparation, X.L.; writing—review and editing, Y.Z., X.L., T.X., F.W., Y.S. and C.C.; visualization, Y.Z. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62032013 and 61772124).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Erdogmus, H. Bayesian Hypothesis Testing Illustrated: An Introduction for Software Engineering Researchers. ACM Comput. Surv. 2023, 55, 119:1–119:28. [Google Scholar] [CrossRef]
  2. Munoz, A.; Martos, G.; Gonzalez, J. Level Sets Semimetrics for Probability Measures with Applications in Hypothesis Testing. Methodol. Comput. Appl. Probab. 2023, 25, 21. [Google Scholar] [CrossRef]
  3. Li, Y.; Zhang, C.; Shelby, L.; Huan, T.C. Customers’ self-image congruity and brand preference: A moderated mediation model of self-brand connection and self-motivation. J. Prod. Brand Manag. 2022, 31, 798–807. [Google Scholar] [CrossRef]
  4. Jensen, R.I.T.; Iosifidis, A. Qualifying and raising anti-money laundering alarms with deep learning. Expert Syst. Appl. 2023, 214, 119037. [Google Scholar] [CrossRef]
  5. Llinares-López, F.; Sugiyama, M.; Papaxanthos, L.; Borgwardt, K.M. Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; Cao, L., Zhang, C., Joachims, T., Webb, G.I., Margineantu, D.D., Williams, G., Eds.; ACM: New York, NY, USA, 2015; pp. 725–734. [Google Scholar] [CrossRef]
  6. Dey, M.; Bhandari, S.K. FWER goes to zero for correlated normal. Stat. Probab. Lett. 2023, 193, 109700. [Google Scholar] [CrossRef]
  7. Samarskiĭ, A. Claverie JM: The significance of digital gene expression profiles. Genome Res. 1997, 7, 986–995. [Google Scholar]
  8. Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar]
  9. Simes, R.J. An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986, 73, 751–754. [Google Scholar] [CrossRef]
  10. Hochberg, Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 1988, 75, 800–802. [Google Scholar] [CrossRef]
  11. Pellegrina, L.; Vandin, F. Efficient Mining of the Most Significant Patterns with Permutation Testing. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD 2018), London, UK, 19–23 August 2018; Guo, Y., Farooq, F., Eds.; ACM: New York, NY, USA, 2018; pp. 2070–2079. [Google Scholar] [CrossRef]
  12. Hang, D.; Zeleznik, O.A.; Lu, J.; Joshi, A.D.; Wu, K.; Hu, Z.; Shen, H.; Clish, C.B.; Liang, L.; Eliassen, A.H.; et al. Plasma metabolomic profiles for colorectal cancer precursors in women. Eur. J. Epidemiol. 2022, 37, 413–422. [Google Scholar] [CrossRef]
  13. Terada, A.; Tsuda, K.; Sese, J. Fast Westfall-Young permutation procedure for combinatorial regulation discovery. In Proceedings of the IEEE International Conference on Bioinformatics & Biomedicine, Belfast, UK, 2–5 November 2014. [Google Scholar]
  14. Harvey, C.R.; Liu, Y. False (and Missed) Discoveries in Financial Economics. J. Financ. 2020, 75, 2503–2553. [Google Scholar] [CrossRef]
  15. Kelter, R. Power analysis and type I and type II error rates of Bayesian nonparametric two-sample tests for location-shifts based on the Bayes factor under Cauchy priors. Comput. Stat. Data Anal. 2022, 165, 107326. [Google Scholar] [CrossRef]
  16. Andrade, C. Multiple Testing and Protection Against a Type 1 (False Positive) Error Using the Bonferroni and Hochberg Corrections. Indian J. Psychol. Med. 2019, 41, 99–100. [Google Scholar] [CrossRef] [PubMed]
  17. Blostein, S.D.; Huang, T.S. Detecting small, moving objects in image sequences using sequential hypothesis testing. IEEE Trans. Signal Process. 1991, 39, 1611–1629. [Google Scholar] [CrossRef]
  18. Babu, P.; Stoica, P. Multiple Hypothesis Testing-Based Cepstrum Thresholding for Nonparametric Spectral Estimation. IEEE Signal Process. Lett. 2022, 29, 2367–2371. [Google Scholar] [CrossRef]
  19. Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B Methodological 1995, 57, 289–300. [Google Scholar] [CrossRef]
  20. Benjamini, Y.; Hochberg, Y. On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics. J. Educ. Behav. Stat. 2000, 25, 60–83. [Google Scholar] [CrossRef]
  21. Yekutieli, K.D. Adaptive linear step-up procedures that control the false discovery rate. Biometrika 2006, 93, 491–507. [Google Scholar]
  22. D’Alberto, R.; Raggi, M. From collection to integration: Non-parametric Statistical Matching between primary and secondary farm data. Stat. J. IAOS 2021, 37, 579–589. [Google Scholar] [CrossRef]
  23. Pawlak, M.; Lv, J. Nonparametric Testing for Hammerstein Systems. IEEE Trans. Autom. Control. 2022, 67, 4568–4584. [Google Scholar] [CrossRef]
  24. Carlson, J.M.; Heckerman, D.; Shani, G. Estimating False Discovery Rates for Contingency Tables. Technical Report MSR-TR-2009-53, 2009, 1–24. Available online: https://www.microsoft.com/en-us/research/publication/estimating-false-discovery-rates-for-contingency-tables/ (accessed on 13 February 2023).
  25. Bestgen, Y. Using Fisher’s Exact Test to Evaluate Association Measures for N-grams. arXiv 2021, arXiv:2104.14209. [Google Scholar]
  26. Pellegrina, L.; Riondato, M.; Vandin, F. SPuManTE: Significant Pattern Mining with Unconditional Testing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
  27. Terada, A.; Sese, J. Bonferroni correction hides significant motif combinations. In Proceedings of the 13th IEEE International Conference on BioInformatics and BioEngineering (BIBE 2013), Chania, Greece, 10–13 November 2013; pp. 1–4. [Google Scholar] [CrossRef]
  28. Sultanov, A.; Protsyk, M.; Kuzyshyn, M.; Omelkina, D.; Shevchuk, V.; Farenyuk, O. A statistics-based performance testing methodology: A case study for the I/O bound tasks. In Proceedings of the 17th IEEE International Conference on Computer Sciences and Information Technologies (CSIT 2022), Lviv, Ukraine, 10–12 November 2022; pp. 486–489. [Google Scholar] [CrossRef]
  29. Paschali, M.; Zhao, Q.; Adeli, E.; Pohl, K.M. Bridging the Gap Between Deep Learning and Hypothesis-Driven Analysis via Permutation Testing; Springer: Cham, Switzerland, 2022. [Google Scholar]
  30. Young, S.S.; Young, S.S.; Young, S.S. Resampling-Based Multiple Testing: Examples and Methods for p-value Adjustment; John Wiley & Sons: Hoboken, NJ, USA, 1993. [Google Scholar]
  31. Schwender, H.; Sandrine, D.; Mark, J.; van der Laan, J. Multiple Testing Procedures with Applications to Genomics. Stat. Pap. 2009, 50, 681–682. [Google Scholar] [CrossRef]
  32. Webb, G.I. Discovering Significant Patterns. Mach. Learn. 2007, 68, 1–33. [Google Scholar] [CrossRef]
  33. Liu, G.; Zhang, H.; Wong, L. Controlling False Positives in Association Rule Mining. In Proceedings of the VLDB Endowment, Seattle, WA, USA, 29 August–3 September 2011. [Google Scholar]
  34. Yan, D.; Qu, W.; Guo, G.; Wang, X. PrefixFPM: A Parallel Framework for General-Purpose Frequent Pattern Mining. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020. [Google Scholar]
  35. Messner, W. Hypothesis Testing and Machine Learning: Interpreting Variable Effects in Deep Artificial Neural Networks using Cohen’s f2. arXiv 2023, arXiv:2302.01407. [Google Scholar]
  36. Yu, J.; Wen, Y.; Yang, L.; Zhao, Z.; Guo, Y.; Guo, X. Monitoring on triboelectric nanogenerator and deep learning method. Nano Energy 2022, 92, 106698. [Google Scholar] [CrossRef]
  37. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann: Burlington, MA, USA, 2011; pp. 248–253. [Google Scholar]
  38. Han, J.; Jian, P.; Yin, Y.; Mao, R. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 2004, 8, 53–87. [Google Scholar] [CrossRef]
  39. White, T. Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale, 2nd ed.; O’Reilly Media: Sebastopol, CA, USA, 2011. [Google Scholar]
  40. Ji, K.; Kwon, Y. New Spam Filtering Method with Hadoop Tuning-Based MapReduce Naïve Bayes. Comput. Syst. Sci. Eng. 2023, 45, 201–214. [Google Scholar] [CrossRef]
  41. Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, San Jose, CA, USA, 25–27 April 2012. [Google Scholar]
  42. Chambers, B.; Zaharia, M. Spark: The Definitive Guide: Big Data Processing Made Simple; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2018. [Google Scholar]
  43. Dalleiger, S.; Vreeken, J. Discovering Significant Patterns under Sequential False Discovery Control. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; Zhang, A., Rangwala, H., Eds.; ACM: New York, NY, USA, 2022; pp. 263–272. [Google Scholar] [CrossRef]
Figure 1. Overall framework of distributed PFWER false positive control.
Figure 1. Overall framework of distributed PFWER false positive control.
Applsci 13 05006 g001
Figure 2. Pattern mining purpose.
Figure 2. Pattern mining purpose.
Applsci 13 05006 g002
Figure 3. Pattern mining purpose.
Figure 3. Pattern mining purpose.
Applsci 13 05006 g003
Figure 4. Pattern mining purpose.
Figure 4. Pattern mining purpose.
Applsci 13 05006 g004
Figure 5. Condition tree of I 4 .
Figure 5. Condition tree of I 4 .
Applsci 13 05006 g005
Figure 6. Parallel label replacement.
Figure 6. Parallel label replacement.
Applsci 13 05006 g006
Figure 7. Find hypothetical computing frameworks in parallel.
Figure 7. Find hypothetical computing frameworks in parallel.
Applsci 13 05006 g007
Figure 8. Constructing frequent 1-item sets.
Figure 8. Constructing frequent 1-item sets.
Applsci 13 05006 g008
Figure 9. The number of replacement experiments.
Figure 9. The number of replacement experiments.
Applsci 13 05006 g009
Figure 10. Run time changes.
Figure 10. Run time changes.
Applsci 13 05006 g010
Figure 11. The number of modes before and after pruning operations in different datasets.
Figure 11. The number of modes before and after pruning operations in different datasets.
Applsci 13 05006 g011
Figure 12. Impact of pruning operation on support count.
Figure 12. Impact of pruning operation on support count.
Applsci 13 05006 g012
Figure 13. Significant threshold before and after pruning operation.
Figure 13. Significant threshold before and after pruning operation.
Applsci 13 05006 g013
Figure 14. PFWER support for different datasets.
Figure 14. PFWER support for different datasets.
Applsci 13 05006 g014
Figure 15. Modified significance thresholds for different datasets of PFWER.
Figure 15. Modified significance thresholds for different datasets of PFWER.
Applsci 13 05006 g015
Figure 16. Runtime comparison of distributed PFWER control algorithms with existing algorithms.
Figure 16. Runtime comparison of distributed PFWER control algorithms with existing algorithms.
Applsci 13 05006 g016
Table 1. N-hypothesis test result table.
Table 1. N-hypothesis test result table.
Do Not Reject H 0 Reject H 0 Total
Original hypothesis H 0 is trueUV n 0
Original hypothesis H 0 is falseTSn- n 0
Totaln-RRn
Table 2. 2 × 2 contingency table.
Table 2. 2 × 2 contingency table.
B 1 B 2 Total
A 1 aba + b
A 2 cdc + d
Totala + cb + dn
Table 3. A 2 × 2 contingency table.
Table 3. A 2 × 2 contingency table.
VariablesDo Not Reject H 0 Reject H 0 Column Total
l i = l 1 σ 1 S n 1 σ 1 S n 1
l i = l 0 σ 0 S n n 1 σ 0 S n n 1
Row total σ S n σ S n
Table 4. Transaction dataset with index.
Table 4. Transaction dataset with index.
Index TIDLabelsTransaction
00 I 2 , I 5
11 I 1 , I 3
21 I 1 , I 2 , I 3
30 I 2
40 I 1 , I 2 , I 3 , I 4
51 I 1 , I 2 , I 3 , I 4
61 I 2
71 I 1 , I 3
80 I 1 , I 2 , I 3 , I 5
90 I 1 , I 3 , I 4
101 I 2
111 I 2
120 I 1 , I 3 , I 4
Table 5. Vertical data format transaction dataset.
Table 5. Vertical data format transaction dataset.
Item SetTID-Set
I 1 1 , 2 , 3 , 5 , 7 , 8 , 9 , 12
I 2 0 , 2 , 3 , 4 , 5 , 6 , 8 , 10 , 11
I 3 1 , 2 , 3 , 5 , 7 , 8 , 9 , 12
I 4 3 , 5 , 9 , 12
I 5 0 , 8
Table 6. Coding environment description.
Table 6. Coding environment description.
Encoding Software and Hardware Environment
CPUIntel(R) Core(TM) i7-10750H CPU @ 2.60 GHz 2.59 GHz
Memory16.00 GB
Hard disk500 GB
Operating SystemWindows 10
System type64-bit OS, x64-based processor
Development toolsIEDA
Development environmentJDK1.8, Hadoop2.7.7, Spark2.4.4
Table 7. Experimental environment description.
Table 7. Experimental environment description.
Test Software and Hardware Environment
CPUIntel(R) Xeon(R) CPU E5-2420 0 @ 1.90 GHz
Memory24.00 GB
Hard disk2TB
Operating systemRed Hat Enterprise Linux Server release 6.3
System typeX86_64
Experimental environmentJDK1.8,Hadoop2.7.7,Spark2.4.4
Table 8. Experimental dataset.
Table 8. Experimental dataset.
Dataset|D|Number of ItemsAverage Length of Transactions n / n 1
Mushroom(L)8124118222.08
Breast Cancer(L)732511296.711.11
A9a(L)32,56124713.94.17
Bms-Web1(U)58,13660,9782.5133.33
Bms-Web2(U)77,158330,2854.5925
Retail(U)88,16216,47010.32.13
Ijcnn1(L)91,701441310
T10I4D100K_new(U)100,00087010.112.5
Codrna(L)271,6171683.03
Covtype(L)581,0126411.92.04
Susy(U)5,000,000190432.08
Table 9. Time comparison before and after pruning.
Table 9. Time comparison before and after pruning.
DatasetMushroomA9aBms-Web2Breast CancerCod-RnaRetailIjcnn1
Before pruning (s)656.31706.9226.0833.91066.353.48837.0
After pruning (s)77.51016.5119.25526.3844.239.57157.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, X.; Zhao, Y.; Xu, T.; Wahab, F.; Sun, Y.; Chen, C. Efficient False Positive Control Algorithms in Big Data Mining. Appl. Sci. 2023, 13, 5006. https://doi.org/10.3390/app13085006

AMA Style

Liu X, Zhao Y, Xu T, Wahab F, Sun Y, Chen C. Efficient False Positive Control Algorithms in Big Data Mining. Applied Sciences. 2023; 13(8):5006. https://doi.org/10.3390/app13085006

Chicago/Turabian Style

Liu, Xuze, Yuhai Zhao, Tongze Xu, Fazal Wahab, Yiming Sun, and Chen Chen. 2023. "Efficient False Positive Control Algorithms in Big Data Mining" Applied Sciences 13, no. 8: 5006. https://doi.org/10.3390/app13085006

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop