The Iterative Exclusion of Compatible Samples Workﬂow for Multi-SNP Analysis in Complex Diseases

: Complex diseases are affected by various factors, and single-nucleotide polymorphisms (SNPs) are the basis for their susceptibility by affecting protein structure and gene expression. Complex diseases often arise from the interactions of multiple SNPs and are investigated using epistasis detection algorithms. Nevertheless, the computational burden associated with the “combination ex-plosion” hinders these algorithms’ ability to detect these interactions. To perform multi-SNP analysis in complex diseases, the iterative exclusion of compatible samples (IECS) workﬂow is proposed in this work. In the IECS workﬂow, qualitative comparative analysis (QCA) is ﬁrstly employed as the calculation engine to calculate the solution; secondly, the pattern is extracted from the prime impli-cants with the greatest raw coverage in the solution; then, the pattern is tested with the chi-square test in the source dataset; ﬁnally, all compatible samples are excluded from the current dataset. This process is repeated until the QCA calculation has no solution or reaches the iteration threshold. The workﬂow was applied to analyze simulated datasets and the Alzheimer’s disease dataset, and its performance was compared with that of the BOOST and MDR algorithms. The ﬁndings illustrated that IECS exhibits greater power with less computation and can be applied to perform multi-SNP analysis in complex diseases.


Introduction
Single-nucleotide polymorphism (SNP), the most prevalent form of genetic variation in the human genome, represents a third-generation genetic marker [1][2][3][4].SNPs are connected to the occurrence of inherited diseases in humans [5], while there is still limited understanding regarding the mechanism underlying this phenomenon [6,7].Some associations between SNPs and diseases have been discovered, including the primary effect of single SNPs, interactions between SNPs, and interactions between SNPs and the environment [8][9][10].The main effects of single SNPs can be detected by single-point association analysis [11][12][13].However, this approach can only explain a small portion of complex diseases.To explain more complex diseases, epistasis analysis is required to detect SNP-SNP interactions [14,15].
Studies of epistasis analysis methods start with small datasets.With the development of genome sequencing technologies, extensive volumes of data have been produced, resulting in the widespread implementation of genome-wide association studies (GWAS) [16].GWAS have been carried out to identify sequence variations in the whole human genome and screen out the SNPs associated with diseases through single-point association analysis and epistasis analysis [17,18].With the advancement of bioinformatics, numerous epistasis analysis methods have emerged, but epistasis analysis is faced with the challenge of combinatorial explosion since GWAS data are characterized by high dimensions [19].
For epistasis analysis, the methods can be mainly classified into searching, screening, and machine learning methods.The searching method transforms the mining of SNP-SNP interactions into a problem of searching for SNP combinations in an N-dimensional space.For instance, multifactor dimension reduction (MDR) is a searching method, proposed in 2001 [20], which can transform a structure of high dimensions into a structure of one dimension which consists of two levels (high risk or low risk).Following dimensionality reduction, evaluation of the capability to identify and predict diseases using the one-dimensional multifactor combination can be performed through cross-validation and permutation tests [21].In the following years, the MDR method has been continuously improved.For instance, an enhanced method named OR-MDR (odds ratio-based MDR) was introduced by incorporating the odds ratio as a risk indicator [22], which greatly improves the recognition ability but increases the amount of calculation.GMDR (generalized multifactor dimensionality reduction) (GMDR) broadens the data range of MDR to continuous variables [23].The method of MB-MDR (multifactor dimensionality reduction based on models) can be applied to investigate datasets with limited initial sample sizes [24].MDRGPU is a GPU-based multifactor dimension reduction method with great improvement of computing speed [25].QMDR is an algorithm that can identify models for quantitative traits [26].MDR-ER, proposed in 2013, introduces a classifier function, which improves the probability of the correct classification of genotypes but also increases the amount of calculation [27].Fuzzy MDR combines the fuzzy set theory, in which the conditional variables can be fuzzy data between 0 and 1 [28].UM-MDR is a unified model-based MDR method that reduces the error rate by using a regression framework with a semiparametric correction procedure [29].The combination of classification-based multifactor dimensionality reduction (CMDR) with the differential evolution algorithm has led to the development of an innovative algorithm known as DECMDR, which shows improvement of recognition ability but an increase in the calculation amount [30].Multi-objective MDR (MOMDR) regards the contingency table of MDR as the target equation and employs the classification accuracy and likelihood ratio to measure SNP-SNP interactions, which improves the recognition ability of MDR [31].GFQMDR, proposed in 2018, is a method to detect interactions between genes for complex quantitative traits via generalized fuzzy classification, which can calculate multiple SNP interactions with a heavy computational burden [32].
The screening methods can effectively screen SNPs, delete a large number of noise sites, and effectively retain the genetic correlation of data, thereby improving the calculation efficiency and recognition ability.In 2008, a two-stage method for epistasis analysis was reported.In this method, significant SNPs are first screened out and single SNPs with marginal effects significantly exceeding the threshold are retained, and then epistasis is identified based on the retained SNPs [33].INTERSNP, proposed in 2009, can screen SNPs by combining SNP association, genomic location, and pathway information, and logistic regression (LR) is then used to identify higher-order epistasis based on the screened SNPs [34].The efficient detection of all pairwise interactions in genome-wide case-control studies can be achieved through the application of the BOolean operation-based screening and testing (BOOST) approach [35].BOOST introduces a Boolean expression of genotype, establishes a 3 × 3 contingency table, and adopts a two-stage searching approach.To evaluate all SNP pairs, a non-iterative method is utilized in the filtering stage to calculate the approximate likelihood statistical ratio [36], and the interactive impact of the chosen SNP pairs is evaluated using both the classical likelihood ratio test and the chi-square test during the testing stage [37,38].
Machine learning methods judge the phenotype of new data by learning the training data and select the SNP combinations with the strongest association with diseases by converting epistasis detection into a classification problem.Random forest (RF) [39], support vector machines (SVM) [40], neural networks (NNs) [41], and LR [42] are machine learning methods commonly used in epistasis analysis.Usually, machine learning models are difficult to interpret due to their complexity.Some of the aforementioned algorithms can only analyze two-order SNP interactions, and some have a heavy computational workload or difficulty in interpretation.Boolean algebra is a rigorous logical calculation system that can obtain the combination of conditional variables for a specific result which has the potential to be used to study the association between SNPs and complex diseases.Qualitative comparative analysis (QCA) [43], a configurational analysis method grounded in set theory and Boolean algebra, has been extensively applied in sociology [44,45] to examine the interplay between conditional variables and outcome variables.For an SNP dataset, there are generally many conditional variables and relatively few samples.Boolean minimization and simplification can only eliminate a small number of conditional variables, and therefore the complex solution of QCA usually has no simple prime implicants.Since complex diseases are often caused by mutated SNPs, the mutated SNPs can be extracted from the prime implicants screened according to the coverage and then combined into pathogenic patterns, followed by the chi-square test on the pathogenic pattern in the source data to check the association between the pathogenic pattern and complex diseases.To further mine the information, the samples compatible with the pathogenic pattern are excluded, and the remaining samples will be subjected to the next round of calculation.The four steps (QCA, pattern extraction, the chi-square test, and compatible sample exclusion) are iterated and form the IECS workflow.

Iterative Exclusion of Compatible Samples Workflow
In set theory, the definition of a subset is as follows: consider two sets X and Y, if every element in X is also an element in Y, then X is a subset of Y.A subset is a sufficient condition for a superset, and it can be logically deduced that if X, then Y.In real situations, there are very rare complete subset relationships.Therefore, it is necessary to evaluate the extent to which a condition set is sufficient for an outcome set, namely, consistency.Consistency represents the proportion of samples with a particular antecedent or a combination of antecedents with the same outcome.Coverage represents the extent to which the subset covers the target set, which can be used to measure the empirical importance of a particular antecedent or a combination of antecedents and represents the explanatory power of the dataset with respect to the result.
QCA explores how the outcome occurs as a whole by examining the subset relationship of sufficiency between the conditional variables and the outcome variable.In sufficiency analysis, the conditional variable is taken as the subset of the outcome variable, whose consistency is calculated by Equation (1).

Consistency(X
The coverage is calculated by Equation (2).

Coverage(X
where X i denotes the value of the conditional variable and Y i denotes the value of the outcome variable. The flow chart of IECS is presented in Figure 1.In IECS, the iteration of four steps (QCA, pattern extraction, the chi-square test, and exclusion of compatible samples) is used to analyze the sufficiency relationship between SNPs and complex diseases.
QCA obtains the solution by constructing a truth table according to the dataset and performing Boolean minimization, simplification, and elimination of some conditional variables, and the resultant solution is a combination of multiple prime implicants.If there is no solution, the items obtained in previous rounds of iteration are output, and the IECS workflow is ended.QCA obtains the solution by constructing a truth table according to the dataset and performing Boolean minimization, simplification, and elimination of some conditional variables, and the resultant solution is a combination of multiple prime implicants.If there is no solution, the items obtained in previous rounds of iteration are output, and the IECS workflow is ended.
Pattern extraction selects the prime implicant with the greatest raw coverage in the solution, extracts the conditional variables with "1", and combines these conditional variables into a pattern.If all the conditional variables in the prime implicant are "0", the prime implicant with the second greatest raw coverage is selected, and so on.If no pattern Pattern extraction selects the prime implicant with the greatest raw coverage in the solution, extracts the conditional variables with "1", and combines these conditional variables into a pattern.If all the conditional variables in the prime implicant are "0", the prime implicant with the second greatest raw coverage is selected, and so on.If no pattern can be extracted from all prime implicants in the solution, the items obtained in previous rounds of iteration are output, and the IECS workflow is ended.
The chi-square test is then employed to test whether the pattern is related to the complex disease in the source dataset.
Exclusion of compatible samples is performed to exclude all samples compatible with the pattern and subject the remaining samples to the next round of analysis.
This cycle of processes is repeated until the preset maximum number of iterations is obtained, the results are output, and the IECS workflow is ended.
IECS can work in two modes: the first mode restricts the number of iterations, while the second mode has no limitations on the number of iterations.In the first mode, assume that the number of iterations is set to k.If each round of QCA produces a solution and a pathogenic pattern can be extracted, the program will iteratively run until the preset number of iterations is reached.However, if QCA has no solution or no pathogenic pattern can be extracted in a certain round, the program will end and n pathogenic patterns will be obtained (n < k).On the other hand, in the second mode, the program will run iteratively until the next round of QCA calculation has no solution or no pathogenic pattern can be extracted.It is recommended to initially limit the number of iterations to a smaller value and then decide whether to increase the number of iterations or switch to the second mode after observing the results.This approach helps avoid excessive analysis time in the beginning.
The framework of IECS with data examples is presented in Figure 2. In the first mode, IECS performs the first round of analysis: QCA obtains n prime implicants, among which PI-1 has the greatest coverage (0.557).Therefore, the pathogenic pattern of simultaneous mutations of SNP B and SNP D is extracted from PI-1.The p-value (0.023) for this pathogenic pattern is calculated in the source dataset.Next, samples that are compatible with the pathogenic pattern (such as sample 2, etc.) are excluded.Then, it is checked whether the number of iterations has been reached.If so, the IECS workflow is ended, and all the items are output.If not, IECS continues with a following round of iterations with the remaining samples.During the iterations, if the solution of QCA is empty or the extracted pathogenic pattern is empty, the items obtained in previous rounds of iteration are output, and the IECS workflow is ended.
In the second mode, IECS works until a certain round of QCA solution is empty or the extracted pathogenic pattern is empty.

Algorithm 1 IECS
Input: k: threshold of iterations; consistency threshold: threshold of consistency; U: set of samples.Output: Solution: The SNP combinations with p-value of chi-square test.; each column indicates one SNP, except the last column, which indicates whether there is a disease.For conditions, red squares indicate the mutation of the SNP, and green squares indicate no mutation of the SNP.For the result, red squares indicate the disease, green squares indicate no disease.In the first mode, IECS performs the first round of analysis, and QCA obtains n prime implicants, among which PI-1 has the greatest coverage (0.557).Therefore, the pathogenic pattern of simultaneous mutations of SNP B and SNP D is extracted from PI-1.The p-value (0.023) for this pathogenic pattern is calculated in the source dataset.Next, samples that are compatible with the pathogenic pattern (such as sample 2, etc.) are excluded.This cycle of processes is repeated until the preset maximum number of iterations is obtained, the results are output, and the IECS workflow is ended.During the iterations, if the solution of QCA is empty or the extracted pathogenic pattern is empty, the items obtained in previous rounds of iteration are output, and the IECS workflow is ended.
In the second mode, IECS works until a certain round of QCA solution is empty or the extracted pathogenic pattern is empty.

Algorithm IECS
Input: k: threshold of iterations; consistency threshold: threshold of consistency; U: set of samples.Output: Solution: The SNP combinations with p-value of chi-square test.In the first mode, IECS performs the first round of analysis, and QCA obtains n prime implicants, among which PI-1 has the greatest coverage (0.557).Therefore, the pathogenic pattern of simultaneous mutations of SNP B and SNP D is extracted from PI-1.The p-value (0.023) for this pathogenic pattern is calculated in the source dataset.Next, samples that are compatible with the pathogenic pattern (such as sample 2, etc.) are excluded.This cycle of processes is repeated until the preset maximum number of iterations is obtained, the results are output, and the IECS workflow is ended.During the iterations, if the solution of QCA is empty or the extracted pathogenic pattern is empty, the items obtained in previous rounds of iteration are output, and the IECS workflow is ended.

Analysis of Necessary Conditions
Analysis of necessary conditions considers complex diseases as the subsets of single SNPs and calculates the consistency and coverage parameters.Then, single SNPs with consistency and coverage greater than the threshold are selected, followed by a chi-square test to screen single SNPs as the necessary conditions (with statistical significance) of complex diseases.

Performance Measurements
The recognition ability (power) and runtimes of MDR, BOOST, and IECS were compared.Measurement of power was performed with the proportion of the number of datasets identified by the algorithm to that of all datasets [35].Power is calculated as follows: where N T denotes the number of identified datasets determined by whether the whole solution has at least one item that is the same as the item in the logical expression of the pathogenic model [46], and N D denotes the total number of datasets, which was set to 1000 in this experiment.Runtime is obtained by calculating the average time that the program runs in each dataset of each dataset group.

Simulated Data
Suppose the S disease is caused by the simultaneous mutation of SNP-A and SNP-B or SNP-C and SNP-D, and E is added to stand for any other SNP.The configuration table of all logical combinations of the S disease is expressed as A × B × C × D × E × S, and the pathogenic model is recorded as A × B + C × D = S.

Alzheimer's Disease Data
The etiology of Alzheimer's disease (a neurodegenerative disorder) remains unknown [47].The performance of IECS was further tested in a real dataset of Alzheimer's disease downloaded from the Kaggle website.This dataset encompasses 257 Chinese individuals diagnosed with sporadic Alzheimer's disease along with 242 control subjects exhibiting normal cognitive function.The average age of the patients at examination was 76.7 years, and the average age of the controls was 80.0 years.

Results and Discussion
All calculations were executed on the same computer with the configuration as follows: CPU, Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz; RAM, 16.0 GB; OS, Windows 10 64 Bit.

Simulated Data Experiment
For Collection I, the power levels of IECS, MDR, and BOOST are presented in Figure 3a.The results revealed that when the noise was lower than or equal to 60%, IECS exhibited a greater power compared to both the MDR and BOOST methods; when the noise level reached 70%, the power of IECS was slightly smaller than that of MDR, but greater than that of BOOST.IECS utilizes QCA for calculation and then extracts the pattern for the chi-square test, which can minimize the negative impact caused by noise and then more accurately identify SNP interactions.
Noise showed a great influence on the power: the power of IECS, MDR, and BOOST was close to 1 when the noise was 10% and gradually decreased with increasing noise.A greater noise ratio represents more interfered samples.For MDR, a greater noise ratio means a greater error probability to define different combinations of SNP pairs and accordingly a higher probability of incorrect results in the cross-validation, thereby leading to the smaller power of the algorithm; for BOOST, it means a lower probability that the distribution of the contingency table is consistent with the pathogenic model, so the power is smaller; and for IECS, it means a greater reduction in the consistency and coverage during QCA, which has a greater impact on the identification process and then leads to smaller power.
For Collection I, the power levels of IECS, MDR, and BOOST are presented in Figure 3a.The results revealed that when the noise was lower than or equal to 60%, IECS exhibited a greater power compared to both the MDR and BOOST methods; when the noise level reached 70%, the power of IECS was slightly smaller than that of MDR, but greater than that of BOOST.IECS utilizes QCA for calculation and then extracts the pattern for the chi-square test, which can minimize the negative impact caused by noise and then more accurately identify SNP interactions.

Figure 3. (a)
The power levels of IECS, MDR, and BOOST for simulated data with different noise levels.When the noise was less than or equal to 60%, the power of IECS was greatest; when the noise was 70%, the power of IECS was slightly less than MDR, but greater than BOOST.(b) The power levels of the three algorithms with different numbers of samples.With the increase in the number of samples, the power of MDR and BOOST increase slowly and IECS remains constant; the more Figure 3. (a) The power levels of IECS, MDR, and BOOST for simulated data with different noise levels.When the noise was less than or equal to 60%, the power of IECS was greatest; when the noise was 70%, the power of IECS was slightly less than MDR, but greater than BOOST.(b) The power levels of the three algorithms with different numbers of samples.With the increase in the number of samples, the power of MDR and BOOST increase slowly and IECS remains constant; the more samples, the more information about the pathogenic model obtained by the MDR and BOOST algorithms and the greater the power.However, in IECS, the QCA calculation engine becomes insensitive to the number of samples beyond a certain amount.
The runtimes of IECS, MDR, and BOOST for simulated data with different noise ratios are shown in Table 1.In general, BOOST is the fastest, followed by IECS and then MDR.MDR performs the permutation test on generated multiple new datasets by randomly shuffling the outcome of the original samples and then carries out MDR analysis on these new datasets, which is very computationally intensive, resulting in its having the lowest speed among IECS, MDR, and BOOST.BOOST employs an approximate approach to evaluate all pairs of loci by calculating the approximate likelihood ratio in a non-iterative way, which reduces the runtime by simplifying the calculation.With increasing noise ratio, the runtime remains almost constant.For Collection II, the power levels of IECS, MDR, and BOOST for simulated data with different numbers of samples are presented in Figure 3b.With an increasing number of samples, the power of MDR and BOOST increases slowly, while that of IECS remains almost constant.With more samples, more information about the pathogenic model could be obtained by MDR and BOOST, which would contribute to a greater power.However, in IECS, the QCA calculation engine becomes insensitive to the number of samples beyond a certain amount.When the noise is constant, the number of excluded samples will be adjusted proportionally, and then the extracted pattern is almost unchanged, resulting in the constant power of IECS.
The runtimes of IECS, MDR, and BOOST for simulated data with different numbers of samples are presented in Table 1.With an increasing number of samples, the runtime of MDR and IECS will increase, because more information needs to be calculated for more samples, and therefore more runtime is consumed.BOOST employs an approximate approach to evaluate all pairs of loci by calculating the approximate likelihood ratio in a non-iterative way, which is not sensitive to the number of samples, resulting in an almost constant runtime.
Based on the above results of simulated datasets with a comprehensive comparison of power and runtime, IECS has a stronger recognition ability for pathogenic models with an acceptable runtime.

Alzheimer's Disease Data Experiment
The dominant model was adopted to code homozygous wild-type alleles as 0 and heterozygous wild-type and mutant alleles or homozygous mutant alleles as 1.
Four iterations were completed in the IECS workflow.The pattern of the simultaneous mutation of SNP (IV S22 + 36 C > A) and SNP (3 UT R159 C > T) was extracted in the first round; the IV S17 − 294 C > T mutation pattern was extracted in the second round; the simultaneous mutation pattern of IV S3 + 106 T >G, IV S10 − 5 C > T, and 3 'UT R159 C > T was extracted in the third round; and the IV S22 + 36 C > A mutation pattern was extracted in the fourth round.Please refer to Table 2 for the results of the QCA analysis.
The chi-square test was employed to analyze the predictive power of the four patterns in the source dataset, and the p-value of the pattern with simultaneous mutation of IV S3 + 106 T > G, IV S10 − 5 C > T, and 3 UT R159 C > T was 0.909, which was greater than 0.05, and therefore this pattern was excluded from the pathogenic model.
The relationship between SNPs and Alzheimer's disease was obtained by IECS.If IV S22 + 36 C > A and 3 UT R159 C > T are simultaneously mutated, or IV S17 − 294 C > T or IV S22 + 36 C > A is mutated, the individual might get Alzheimer's disease.
The Alzheimer's disease data were also analyzed by MDR and BOOST, and the results and comparisons with those of IECS are shown in Table 3. IECS obtained three significant items with a runtime of 4.496 s.MDR obtained one significant item, namely, the simultaneous mutation of IV S22 + 36 C > A and 3 UT R159 C > T, with a runtime of 42.591 s.BOOST obtained an insignificant item, namely, the simultaneous mutation of IV S10 − 5 C > T and IV S22 + 36 C > A, with a runtime of 0.368 s.Previous research has demonstrated the interactive association of Alzheimer's disease with two SNPs, namely, IV S22 + 36 C > A and 3 UT R159 C > T, located within introns [22].This interaction was determined using the multifactor dimensionality reduction method based on a log-linear model and the multifactor dimensionality reduction algorithm.IV S17 − 294 C > T in introns was associated with an increase in the risk for Alzheimer's disease, as indicated by the statistical analysis and the haplotype analysis; in addition, IV S22 + 36 C > A in introns was also related to a higher risk of Alzheimer's disease [48].
Among IECS, MDR, and BOOST for the Alzheimer's disease dataset, IECS could obtain more results than MDR and BOOST, its runtime is relatively short, and the results are all supported by the literature, demonstrating that IECS can detect multiple SNPs related to complex diseases.
According to the significance of necessary conditions, we took Alzheimer's disease as the outcome and eight SNPs as the conditional variables to perform the chi-square test.The relation of necessary conditions was expressed as "No A, then no B" logically.A conditional variable is deemed necessary if its consistency exceeds 0.9 and its coverage surpasses 0.5 [49].There are four conditional variables necessary for Alzheimer's disease: ∼−204 G > C, ∼C.401A > G, ∼IV S10 − 5 C > T, and ∼IV S15 + 144 T > A ("∼" denotes no mutation of the SNP).According to the significance of necessary conditional variables, a chi-square test was conducted by taking the disease as the outcome.The results are shown in Table 4. "No mutation of −204 G > C" under Alzheimer's disease is significant with a p-value of 0.008, suggesting that if the SNP of −204 G > C is mutated, the individual will not get Alzheimer's disease.The other three conditional variables are not significant.In a previous study, analysis of the transcription factor binding site performed by Consite showed that the mutation at the position of −204 G > C enables it to enhance the expression of neprilysin and reduce the accumulation of A β (amyloid beta) in the brain, which possibly hinders Alzheimer's disease [48].The relationships between SNPs and Alzheimer's disease obtained by IECS and the analysis of necessary conditions are shown in Figure 4.All results are supported by the literature.

Conclusions
The IECS workflow with QCA as the calculation engine was proposed an to analyze simulated datasets and the real dataset of Alzheimer's disease, and mance was compared with that of the BOOST and MDR algorithms.The result that IECS has greater power with relatively less computation cost.IECS has a acceptable runtime and can compute high-dimensional pathogenic patterns wi power.IECS could be applied to multi-SNP analysis in complex diseases as wel gene and gene-environment interactions to explore the causes of complex d further research, we would use IECS to analyze more datasets to explore the complex diseases and accelerate the computing speed of IECS.
Author Contributions: J.G.: idea conceptualization, project administration, and funding a L.Z., W.X. and X.Z.: methodology and validation.W.X.: article writing.All authors hav

Conclusions
The IECS workflow with QCA as the calculation engine was proposed and applied to analyze simulated datasets and the real dataset of Alzheimer's disease, and its performance was compared with that of the BOOST and MDR algorithms.The results revealed that IECS has greater power with relatively less computation cost.IECS has a relatively acceptable runtime and can compute high-dimensional pathogenic patterns with greater power.IECS could be applied to multi-SNP analysis in complex diseases as well as gene-gene and geneenvironment interactions to explore the causes of complex diseases.In further research, we

Figure 1 .
Figure 1.Flow chart of IECS workflow.IECS utilizes the iteration of four steps (QCA, pattern extraction, the chi-square test, and exclusion of compatible samples) to analyze the relationship between SNPs and complex diseases.

Figure 1 .
Figure 1.Flow chart of IECS workflow.IECS utilizes the iteration of four steps (QCA, pattern extraction, the chi-square test, and exclusion of compatible samples) to analyze the relationship between SNPs and complex diseases.

Figure 2 .
Figure 2. Framework of IECS with data examples.For the data examples, the row name is the sample ID; each column indicates one SNP, except the last column, which indicates whether there is a disease.For conditions, red squares indicate the mutation of the SNP, and green squares indicate no mutation of the SNP.For the result, red squares indicate the disease, green squares indicate no disease.In the first mode, IECS performs the first round of analysis, and QCA obtains n prime implicants, among which PI-1 has the greatest coverage (0.557).Therefore, the pathogenic pattern of simultaneous mutations of SNP B and SNP D is extracted from PI-1.The p-value (0.023) for this pathogenic pattern is calculated in the source dataset.Next, samples that are compatible with the pathogenic pattern (such as sample 2, etc.) are excluded.This cycle of processes is repeated until the preset maximum number of iterations is obtained, the results are output, and the IECS workflow is ended.During the iterations, if the solution of QCA is empty or the extracted pathogenic pattern is empty, the items obtained in previous rounds of iteration are output, and the IECS workflow is ended.

Figure 2 .
Figure 2. Framework of IECS with data examples.For the data examples, the row name is the sample ID; each column indicates one SNP, except the last column, which indicates whether there is a disease.For conditions, red squares indicate the mutation of the SNP, and green squares indicate no mutation of the SNP.For the result, red squares indicate the disease, green squares indicate no disease.In the first mode, IECS performs the first round of analysis, and QCA obtains n prime implicants, among which PI-1 has the greatest coverage (0.557).Therefore, the pathogenic pattern of simultaneous mutations of SNP B and SNP D is extracted from PI-1.The p-value (0.023) for this pathogenic pattern is calculated in the source dataset.Next, samples that are compatible with the pathogenic pattern (such as sample 2, etc.) are excluded.This cycle of processes is repeated until the preset maximum number of iterations is obtained, the results are output, and the IECS workflow is ended.During the iterations, if the solution of QCA is empty or the extracted pathogenic pattern is empty, the items obtained in previous rounds of iteration are output, and the IECS workflow is ended.

Algorithms 2023 ,
16, x FOR PEER REVIEW

Table 1 .
Runtimes of IECS, MDR, and BOOST for simulated data with different noise percentages and numbers of samples.

Table 2 .
QCA results for Alzheimer's disease data of four iterations.

Table 3 .
Comparison of IECS, MDR, and BOOST for Alzheimer's disease data.

Table 4 .
Results of analysis of necessary conditions for Alzheimer's disease data.
Tildes "~" indicate no mutation of the SNP.