Integrating Statistical and Machine-Learning Approach for Meta-Analysis of Bisphenol A-Exposure Datasets Reveals Effects on Mouse Gene Expression within Pathways of Apoptosis and Cell Survival

Bisphenols are important environmental pollutants that are extensively studied due to different detrimental effects, while the molecular mechanisms behind these effects are less well understood. Like other environmental pollutants, bisphenols are being tested in various experimental models, creating large expression datasets found in open access storage. The meta-analysis of such datasets is, however, very complicated for various reasons. Here, we developed an integrating statistical and machine-learning model approach for the meta-analysis of bisphenol A (BPA) exposure datasets from different mouse tissues. We constructed three joint datasets following three different strategies for dataset integration: in particular, using all common genes from the datasets, uncorrelated, and not co-expressed genes, respectively. By applying machine learning methods to these datasets, we identified genes whose expression was significantly affected in all of the BPA microanalysis data tested; those involved in the regulation of cell survival include: Tnfr2, Hgf-Met, Agtr1a, Bdkrb2; signaling through Mapk8 (Jnk1)); DNA repair (Hgf-Met, Mgmt); apoptosis (Tmbim6, Bcl2, Apaf1); and cellular junctions (F11r, Cldnd1, Ctnd1 and Yes1). Our results highlight the benefit of combining existing datasets for the integrated analysis of a specific topic when individual datasets are limited in size.


Introduction
Bisphenols have been in commercial use as plasticizers for over 70 years. They are reported to be estrogenic mimics that may interfere with hormonal homeostasis. One very prevalent bisphenol, bisphenol A (BPA), is used for manufacturing polysulfones and polycarbonate plastics, epoxy resins, and thermal paper. BPA is considered an endocrine analysing functional genomics data from various sources necessitate innovative and more powerful methods to utilize these data for novel analyses.
In this manuscript, we studied the gene expression changes from four available microarray datasets of mice under the influence of BPA exposure. The standard approach in analysing gene expression changes is to perform differential expression analysis with statistical tests for differences in intensity [27]. We performed this traditional differential gene expression analysis of individual GEO datasets. However, this method suffers from various issues, such as uncertainty in p-value choice to select the right set of "important" genes in terms of biological effects and the necessity of dealing with the problem of multiple comparisons. In contrast, machine learning methods, especially feature selection methods, are widely used today in gene expression analysis, providing the ability to select the right set of "important" genes in terms of the quality of the prediction model [27,28]. In this study, we focused on applying machine learning methods in terms of feature selection (FS), revealing key genes influenced by BPA exposure.
We constructed three joint datasets following three different correlation-based preprocessing approaches, namely using all of the common genes through four GEO datasets, uncorrelated, and no co-expressed genes, respectively. By applying machine learning methods to these joint datasets, we identified genes whose expression was significantly changed in all of the BPA microanalysis data tested. We went on to determine that a subset of these genes is involved in the regulation of cell survival and apoptosis. Our results highlight the benefit of combining existing datasets for integrated analysis for a specific topic when individual datasets are limited in size, in our case when studying the effects of BPA.

Differential Gene Expression Analysis
Differential gene expression analysis was performed in several ways in terms of statistical significance. As described in the Methods section, we declared a gene differentially expressed if an observed expression difference between two experimental conditions reported an adjusted p-value < 0.05. We also performed the same analysis with an adjusted p-value < 0.1, non-adjusted p-value < 0.05, and non-adjusted p-value < 0.1 (Figure 1). After applying multiple adjustment corrections, the analysis determined that GSE26728 was the only dataset with differentially expressed genes. All of the other datasets examined did not show any differentially expressed genes, neither with an adjusted p-value < 0.05 nor with an adjusted p-value < 0.1. On the contrary, all the datasets showed differentially expressed genes with both a non-adjusted p-value < 0.05 and a non-adjusted p-value < 0.1. Therefore, we could state that there were no common differentially expressed genes among the four datasets.

Machine Learning Methods
In our study, we found that ensemble-based methods (Section 4.2.1) tended to overfit the data studied (Table 1). Both the Random Forest (RF) model and the Support Vector Machine (SVM) ensemble model were able to learn the training dataset, producing 1.0 training accuracy, but failed to generalize, producing a test accuracy only slightly higher than 0.5. Due to the high differences in training and test accuracies for fitted models, we did not use feature sets from these models in any subsequent analysis.   In contrast, the iterative model seemed to be able to construct more meaningful feature sets before it overfit our data. The iterative feature selection procedure (Section 4.2.2) with two binary classification models, Naïve Bayesian classifier (NB) and Logistic Regression (LR), were applied to the datasets. The resulting feature sets, composed of selected genes, were used to train a single SVM model in order to prove the predictive ability of the selected features (Section 4.3). Tables 2 and 3 show the test/training cross-validation accuracies and ROC AUC scores of the SVM model. Although the training accuracies and ROC AUC scores remained close to 1.0, the differences between the training and test scores significantly decreased, showing the ability of the models to generalize. Table 2. Test/train cross-validation accuracies of SVM model, trained on all genes and genes selected by the iterative feature selection procedure with Naïve Bayesian classifier or Logistic Regression classifier. SVM model was applied to simple scaled (simple_scaled), without correlated genes (without_correlated), and without co-expressed genes (without_coexpressed) datasets. In contrast to all genes as a feature set, genes selected by the iterative procedure show predictive ability.

Simple_Scaled Without_Correlated Without_CoexPressed
All genes 0.54/1.0 0.55/1.0 0.54/1.0 Top genes from Naive Bayesian classifier -0.82/0.9 0.74/0.84 Top genes from Logistic Regression 0.6/0.73 -- Table 3. Test/train cross-validation ROC AUC scores of SVM model, trained on all genes and genes selected by the iterative feature selection procedure with Naïve Bayesian classifier or Logistic Regression classifier. SVM model was applied to simple scaled (simple_scaled), without correlated genes (without_correlated), and without co-expressed genes (without_coexpressed) datasets. In contrast to all genes as a feature set, genes selected by the iterative procedure show predictive ability.

Gene Lists Analysis
In the next step, we analyzed the number of appearances of each feature in the feature sets, obtained by 100 runs of the iterative feature selection procedure on each of the datasets (ref. to Section 4.4.1). We considered the most frequent features to be the most important genes in terms of distinguishing between the BPA-exposed and control samples. For these genes, pathway analysis was performed using DAVID [29] to determine the most enriched pathways and biological processes within each dataset (Section 4.4.2). This revealed that the most frequent genes from the simple scaled dataset (Figure 2A and Table 4), without correlated genes dataset ( Figure 2B and Table 5), and without co-expressed genes dataset ( Figure 2C and Table 6) did not cluster together in any Gene Ontology (GO) biological processes (BP). By examining the top genes for all of the datasets, we could observe 24 common genes (Table 7). sets, obtained by 100 runs of the iterative feature selection procedure on each of the datasets (ref. to Section 4.4.1). We considered the most frequent features to be the most important genes in terms of distinguishing between the BPA-exposed and control samples. For these genes, pathway analysis was performed using DAVID [29] to determine the most enriched pathways and biological processes within each dataset (Section 4.4.2). This revealed that the most frequent genes from the simple scaled dataset (Figure 2A and Table  4), without correlated genes dataset ( Figure 2B and Table 5), and without co-expressed genes dataset ( Figure 2C and Table 6) did not cluster together in any Gene Ontology (GO) biological processes (BP). By examining the top genes for all of the datasets, we could observe 24 common genes (Table 7).

Figure 2.
Feature frequencies, obtained by 100 runs of iterative feature selection procedure, for (A) simple scaled dataset, (B) dataset without correlated genes and, (C) dataset without co-expressed genes. There are 4 features for the simple scaled dataset, 6 features for the dataset without correlated genes, and 7 features for dataset without co-expressed genes, which have noticeably higher frequencies than other features. Table 4. The most frequent genes within 100 runs of the iterative feature selection procedure on the simple scaled dataset.

Entrez ID
Gene Symbol Gene Name Frequency/100 654810 Appbp2os Amyloid beta precursor protein (cytoplasmic 33 Figure 2. Feature frequencies, obtained by 100 runs of iterative feature selection procedure, for (A) simple scaled dataset, (B) dataset without correlated genes and, (C) dataset without co-expressed genes. There are 4 features for the simple scaled dataset, 6 features for the dataset without correlated genes, and 7 features for dataset without co-expressed genes, which have noticeably higher frequencies than other features.     Next, the 24 common genes were used for pathway analysis using DAVID to find the most enriched biological processes. The functional annotation in DAVID showed that one cluster of genes (Dbp, P2ry1, Tbl1x, Nrip1, and Yes1) was related to GO:004594: the positive regulation of transcription from RNA polymerase II promoter. There were also two genes belonging to the ephrin receptor signaling pathway (GO:0048013), Efnb2 and Efna4.
We then examined important features for the intersection of the top 30 genes from the machine learning models with the cut-off of 20 appearances. It was expected that important (the most frequent) features for datasets without correlated and without co-expressed genes would be similar due to the similarity of the pre-processing procedure. Five common genes among the top 30 genes were found for these datasets: F11r, Pfkfb1, Zfp839, Csn1s2b, Yes1. Moreover, there were two genes (Rad9a, Senp5) that appear in the top 30 genes in the simple scaled dataset only.
The genes from the top 30 important genes dataset were also utilized in the pathway analysis using DAVID to determine the most enriched biological processes. Although most of the genes did not form any obvious clusters, the functional annotation in DAVID showed that the two largest clusters of Gene Ontology (GO) biological processes (BP) were related to the regulation of apoptosis (GO:0042981) and proteolysis (Table 8). In fact, two clusters, having some gene overlap, were related to the general regulation of apoptosis or the negative regulation of apoptosis (GO:0043066) ( Table 8 and Figure 3).  In fact, many of the pathways recovered, including Tnfr2, Hgf-Met, Agtr1a, Bdkrb2, signal through Mapk8 (also known as Jnk1) to regulate cell survival. One of these pathways, Hgf-Met, also functions to regulate another recovered gene, Mgmt, to allow for DNA repair. Two of the recovered genes, Tmbim6 and Bcl2L, inhibit Bax in order to prevent apoptosis, while one gene, Apaf1, is necessary for forming apoptosomes to induce apoptosis. Cellular junctions are also centrally important for cell survival, and four of the genes recovered, F11r, Cldnd1, Ctnd1, and Yes1, function for maintain cellular junctions.

Discussion
Bisphenols are important pollutants that significantly infiltrated the biome. Their potential to disrupt physiology has led several groups to perform microarray analysis using biological material from mice after BPA intervention. The large datasets created by these works have been deposited in public data banks [21,22], but no other analysis has been performed. Using machine learning, we mined a subset of these microarray datasets and In fact, many of the pathways recovered, including Tnfr2, Hgf-Met, Agtr1a, Bdkrb2, signal through Mapk8 (also known as Jnk1) to regulate cell survival. One of these pathways, Hgf-Met, also functions to regulate another recovered gene, Mgmt, to allow for DNA repair. Two of the recovered genes, Tmbim6 and Bcl2L, inhibit Bax in order to prevent apoptosis, while one gene, Apaf1, is necessary for forming apoptosomes to induce apoptosis. Cellular junctions are also centrally important for cell survival, and four of the genes recovered, F11r, Cldnd1, Ctnd1, and Yes1, function for maintain cellular junctions.

Discussion
Bisphenols are important pollutants that significantly infiltrated the biome. Their potential to disrupt physiology has led several groups to perform microarray analysis using biological material from mice after BPA intervention. The large datasets created by these works have been deposited in public data banks [21,22], but no other analysis has been performed. Using machine learning, we mined a subset of these microarray datasets and were able to define not only a method for performing a meta-analysis of these large datasets but also produce pathways conserved across different BPA interventions within a species.
Our research confirms the importance of combining datasets in a meta-analysis but also highlights the importance of different pre-processing steps before applying machine learning methods, especially for small datasets. In this study, we focused on various correlation-based gene averaging processes. We showed that the usage of these strategies leads to different, but, in general, related, solutions. This suggests that co-expression-based pre-processing produces a dataset modification that promotes a solution candidate. Using hard voting, the final result was aggregated from the results of running three models on dataset modifications. The strategies to improve this process might include a deeper investigation of the differences in outcomes between the models, as well as more sensitive aggregation.
BPA exposure has been shown to disrupt mitochondria integrity, leading to elevated ROS levels and apoptosis rates in human granulosa and HT-22 cells, which are derived from the mouse brain [30,31]. The BPA-inhibited proliferation of neural progenitor cells and rat embryonic midbrain cells through the suppression of the JNK signaling pathway has also been reported [32,33]. Our gene ontology analysis using DAVID indicated that a significant group of the top 30 transcripts recovered from machine learning models were linked to the regulation of apoptosis (see Table 8). Furthermore, by using DAVID and STRING, it became evident that many of the genes encoding these transcripts categorized as being involved in the regulation of apoptosis were, in fact, cell survival pathways that converged on Mitogen-Activated Protein Kinase 8 (Mapk8, also known as Jnk1) (Figure 3). The protein of one of the transcripts recovered, Hepatocyte growth factor (Hgf), activates MET Proto-Oncogene, Receptor Tyrosine Kinase (Met), which not only regulates Mapk8 activation but also leads to increased transcriptional expression of another gene in our dataset, O-6-Methylguanine-DNA Methyltransferase (Mgmt) [34]. Mgmt and Rad9a proteins, whose transcript levels were also shown to be affected by BPA, are both involved in repairing damaged DNA [35]. Furthermore, in mouse macrophages, it was determined that BPA-induced mitochondrial disruption reduced BCL2 protein expression, which led to caspase-dependent apoptosis [36]. In our study, one of the top 30 genes was Bcl2l1, whose protein is known to inhibit Bax-induced apoptosis ( Figure 3). Furthermore, the protein of another gene recovered in the study, Apaf1, acts downstream of Bax, and, along with Caspase-9, forms an apoptosome to induce apoptosis ( Figure 3) [37]. Interestingly, we also showed that Tmbim6 transcript levels are affected by BPA, and Tmbim6 was shown to inhibit Bax-induced apoptosis (Figure 3) [38].
Three of the top 30 genes, F11 Receptor (F11r, also known as Jam-1), Claudin Domain Containing 1 (Cldnd1), and Catenin Delta 1 (Ctnnd1), are associated with either tight or adherens junctions (see Table 8 and Figure 3). In a recent study, the reproductive toxicity of BPA was investigated. Male CD-1 mice were orally administrated BPA, and the results showed that this exposure was sufficient to induce disorders in spermatogenesis, including damaging the tight junctions between Sertoli cells [39]. Another study examined the effect of BPA in female rats on the expression levels of tight junction (TJ) transcripts in the uterus during early pregnancy. This study found profound alterations in the TJ gene transcript levels of uterine epithelial cells when the rats were exposed to BPA, which led to changes in fluid and ion transport across the epithelium, blocking the receptivity of the uterus to blastocyst implantation [40]. In fact, this study saw profound effects on claudin transcript levels, such as Cldnd1, including low expression levels or even the loss of expression.
Interestingly, our study recovered transcripts for two receptors that are in pathways regulated by Angiotensin I Converting Enzyme (ACE), Angiotensin II Receptor Type 1a (Agtr1a), and Bradykinin Receptor B2 (Bdkrb2) (see Table 8 and Figure 3). In both rat cardiac cells and human endothelial cell lines, it was shown that BPA was proangiogenic, including the upregulation of Nitric Oxide Synthase 3 [41][42][43]. In another report, it was discovered in rat striatum that the inhibition of ACE was able to alleviate the ROS-inducing effects of a BPA + 1-methyl-4-phenylpyridinium ion (MPP(+)) mixture [44]. Interestingly, both Agtr1a and Bdkrb2 signal upstream of Nos3, where Agtr1a leads to Nos3 inhibition and Bdkrb2 leads to activation (Figure 3).
In terms of computational methods, in this paper, we suggest using a new crossvalidation-based greedy feature selection algorithm with three different preprocessing strategies. Using this approach, one has the flexibility to incorporate different machine learning models and stopping criteria into the feature selection procedure depending on the properties of the data. We also provided gene importance analysis based on the frequencies of the genes' appearances in the feature lists from 100 runs of the proposed algorithm. For small datasets, this process is more stable than using feature selection techniques based on a single run.
Our results highlight the value of integrating data from multiple datasets for coanalysis, revealing new biological knowledge. However, a key limitation of our study is still a lack of publicly available microarray data after BPA exposure, which restricts our investigation to the baseline machine learning methods. This is also an important constraint for analyzing the differences between the results from datasets without correlated and without co-expressed genes. We used co-expression analysis with the WGCNA package for each GEO dataset, but it should be carefully used for datasets with less than 15 samples [45]. This means that a pre-processing method should be attentively chosen based on the available data.
In summary, we developed a new approach for the meta-analyses of microarray data, which could be very useful for analyzing other datasets relating to any environmental pollutants. The pathways that we have identified align well with the previous evidence for the molecular actions of BPA and prompt further studies into pathways that relate to the regulation of cell survival, DNA repair, apoptosis, and cellular junctions.

Dataset Collection of BPA-Exposure-Related Data
We restricted our survey to the datasets devoted to BPA-exposure experiments using male mice. Four publicly available microarray datasets from the GEO database were examined: GSE26728 [21], GSE126297 [22], GSE43977 [43], and GSE44088 [43]. In GSE26728, liver gene expression was measured from CD-1 mice exposed for 28 days to bisphenol A at doses 0 (controls), 50 (TDI or low dose), or 5000 µg/kg/day (NOAEL or high dose) via food spiking [21]. The GSE126297 dataset used pancreatic islets from OF1 male mice after exposure of organisms to 100 µg/kg/day (two injections of 50 µg/kg/day) for four days [22]. The GSE43977 and GSE44088 datasets used hepatic samples from C57BL/6J mice [43] after exposure to~21.93 mM (5000 ppm) for 7 days and 10 µM for 24 h, respectively. Four datasets have 41 samples in total, 21 control untreated samples and 20 treated samples.
We examined each dataset separately for differential expression analysis. For MLbased analysis, we combined datasets following three different strategies. Below is the detailed description of all pre-processing procedures.

Data Pre-Processing for Differential Expression Analysis of Individual Datasets
In the bioinformatic pipeline, we examined each dataset separately, where datasets themselves were given log2-transformed values. Expression data files were pre-processed using the R limma package (version 3.42.0) [46]. We annotated datasets with Entrez ID and dropped NA values. We defined low-expression genes with a constant threshold for log-transformed probe intensity values and removed them manually from the dataset [47]. We also removed probe replicates using the avereps function and performed quantile normalization using the normalizeBetweenArrays function.

Data Pre-Processing for Machine Learning-Based Analysis for Combined Datasets
In order to analyze combined datasets, we reduced each dataset to the common genes set among all datasets. This left us with four datasets having 6742 genes in each. Then, we scaled intensity values for each gene in each dataset in the range of 0 to 1, following Equation (1).
where x is an intensity value for the specific gene. Finally, we combined scaled datasets into a single dataset, following three different strategies. The first strategy was not to use any modification. The second and third strategies use two different ways to construct independent feature sets in order to meet the requirement of machine learning algorithms with independence assumptions between the features.
Simple scaled dataset. The first strategy is to combine four datasets without any modifications, resulting in a dataset with a matrix size of 41 × 6742.
Dataset without correlated genes. In the second strategy, we built a correlation graph. In this graph, vertices correspond to the genes, and edges correspond to the correlated genes with α level of Pearson correlation. Then, we replaced each connectivity component with an averaged value of its vertices. Thus, the new dataset consists of uncorrelated elements, representing genes or averaged groups of genes. We varied α from 0.7 to 0.99 and finally used 0.7 because, for higher levels, most of the genes did not belong to any correlation cluster. This strategy resulted in a dataset with a shape of 41 × 5704.
Dataset without co-expressed genes. In the third strategy, we used the R package WGCNA (version 1.46) [48] to build co-expressing clustering based on biweight midcorrelation. For a combined scaled dataset, we analyzed genes' co-expression with the following steps. First, we clustered the samples (in contrast to clustering genes that will be described later) with hclust function to see if there are any potential outliers. Figure 4A shows a sample tree without any outliers. Finally, we averaged genes among each cluster. In total, 3 clusters with 1094 genes were averaged. This strategy resulted in a dataset with a matrix size of 41×5651. As a result, we have obtained the simple combined dataset, the dataset without correlated genes, and the dataset without co-expressed genes.

Differential Gene Expression Analysis
We performed differential expression analysis using the R package limma (version Then, we built a gene-gene similarity network with soft-threshold power selection using pickSoftThreshold function. Figure 4B,C show soft-threshold power selection. We chose the threshold equal to 7 (this value is the lowest power for which the scale-free topology fit index curve flattens out upon reaching a high value).
In the next step, we built the corresponding gene network and identified modules within each network. Figure 4D shows the heatmap for the gene network. Each row and column of the heatmap corresponds to a single gene. The heatmap can depict adjacencies or topological overlaps, with light colors denoting low adjacency (overlap) and darker colors higher adjacency (overlap). In addition, the gene dendrograms and module colors are plotted along the top and left side of the heatmap. Based on results presented in Figure 4D, one can conclude that genes taken into account do not have strong co-expression.
Finally, we averaged genes among each cluster. In total, 3 clusters with 1094 genes were averaged. This strategy resulted in a dataset with a matrix size of 41 × 5651. As a result, we have obtained the simple combined dataset, the dataset without correlated genes, and the dataset without co-expressed genes.

Differential Gene Expression Analysis
We performed differential expression analysis using the R package limma (version 3.42.0) [46]. Benjamini-Hochberg correction was applied as multiple testing correction. A gene was declared differentially expressed if an observed expression difference between two experimental conditions was equal or more than 1.5 and statistically significant (adjusted p-value < 0.05).

Machine Learning-Based Genes Selection
We used machine learning to build binary classification models considering genes as features and then find "important" features in terms of distinguishing BPA-exposed samples from control ones. In particular, we used two different machine learning-based approaches, namely ensemble-based methods and new iterative feature selection procedure.

Ensemble-Based Approach
We used Random Forest and Support Vector Machine (SVM) ensemble methods, with feature bagging, to all three datasets. Random Forest is a widely used classification model; we used it with 1000 Gini impurity-based trees and 100 features in each tree. In order to find the most "important" genes, we used Gini importance, which is computed as the total reduction of Gini impurity brought by that feature. Similar to Random Forest, we built an SVM ensemble with feature bagging. As a feature importance criterion, we used weights, assigned to each feature by the SVM classifier.

Iterative Feature Selection Procedure
We constructed a cross-validation-based greedy feature selection procedure ( Figure 5). On each step, this procedure tries to expand a feature set by adding a new feature. It fits a model with different alternatives and selects a feature that is the best in terms of cross-validation accuracy on that step. used weights, assigned to each feature by the SVM classifier.

Iterative Feature Selection Procedure
We constructed a cross-validation-based greedy feature selection procedure ( Figure  5). On each step, this procedure tries to expand a feature set by adding a new feature. It fits a model with different alternatives and selects a feature that is the best in terms of cross-validation accuracy on that step. Figure 5. The algorithm of the cross-validation-based greedy selection procedure. The algorithm takes as inputs the following parameters: dataset X (gene features of each of three datasets, simple scaled, without correlated genes, and without co-expressed), BinaryClassifier (a function of binary classification), AccuracyDelta (the minimum significant difference in the accuracy score), and MaxDe-creaseCounter (the maximum number of steps to evaluate in case of accuracy decrease). The iterative feature selection procedure returns a subset of selected features.
An alternative to this idea could be a Recursive Feature Elimination procedure (RFE), which fits a model once and iteratively removes the weakest feature until the specified number of features is reached. The reason why we did not use RFE procedure is its inability to control the fitting process, while our greedy selection algorithm provides us an opportunity to set up useful stopping criteria. We stopped when there was no significant increase in cross-validation accuracy, which helped us overcome overfitting.
Because of the small number of samples in our dataset, we used 50/50 split in crossvalidation. This led to an issue of unstable feature selection at each step. In order to reduce this instability, we ran the procedure 100 times and calculated a gene's appearances in "important genes" lists.
The crucial step of the algorithm is to train a binary classifier, which could be any appropriate classification model. In our study, we focused on strong baseline models. We used Logistic Regression with L1 and L2 penalties for the simple combined dataset and Naive Bayesian classifier for the datasets without correlated or co-expressed genes. Naive Bayesian classifier is known to be a strong baseline for problems with independence assumptions between the features. It assigns a class label y_NB from possible classes Y following maximum a posteriori principle (Equation (2)): y NB = argmax y∈Y P(y) ∏ i P(x i ∨ y), (2) under the "naive" assumption that all features are mutually independent (Equation (3)): P(x 1 , x 2 , . . . , x n ∨ y) = P(x 1 ∨ y)P(x 2 ∨ y) . . . P(x n ∨ y), where x i stands for an intensity value for the specific gene i, y stands for a class label, P(x i ∨ y) stands for a probability of class y for the intensity value x i , P(y) stands for y class probability. Both probabilities P(x i ∨ y) and P(y) are estimated with relative frequencies in the training set. Logistic Regression is a simple model that assigns class probabilities with sigmoid function of linear combination (Equation (4)): where x stands for a vector of all intensity values, w stands for a vector of linear coefficients, y stands for a class label and σ is a sigmoid function. We used it with ElasticNet regularization, which includes penalties to L1 and L2 norms of weight vector w.

Genes Selection Validation
In order to prove predictive ability of selected features, we used them in the S classifier, which is known to be a strong model for binary classification. We checked the increase in cross-validation ROC AUC scores for each feature set.

Identification of the Most Important Genes
We calculated the genes' appearances in feature lists from 100 runs of the algorithm ( Figure 5). From these frequencies, we were able to range genes in each dataset in terms of their importance for binary classification.
In order to compare gene lists to each other, we built a summary table using the top 30 genes of each dataset. We also annotated them with corresponding p-values from differential expression analysis.

Annotation and Pathway Analysis
Pathway enrichment analysis was performed in DAVID (Database for Annotation, Visualization and Integrated Discovery) and PANTHER, using Gene Ontology (GO), and Reactome databases (PMID: 22543366; PMID: 30804569; PMID: 31691815). The MetaCore default setting of false discovery rate (FDR) < 0.05 was used as threshold for significance in enrichment analysis.