# Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{7}

^{8}

^{*}

## Abstract

**:**

## 1. Background

## 2. Materials and Methods

#### 2.1. Motivation

#### 2.2. Data Source

#### 2.3. Methods

#### 2.3.1. Notations

_{N}

_{× M}= [x

_{im}] be the GE data matrix, where x

_{im}represents the expression of i

^{th}(i = 1, 2, …, N) gene in m

^{th}(m = 1, 2, …, M) sample/subject; x

_{m}be the N-dimensional vector of expression values of genes for m

^{th}sample; ${y}_{m}$ be the outcome variable for target class label of m

^{th}sample and take values {+1, −1} for case and control conditions, respectively; M

_{1}and M

_{2}be the number of GE samples in case and control classes, respectively, (${M}_{1}+{M}_{2}=M$); $({\overline{x}}_{i1},{S}_{i1}^{2})\mathrm{and}({\overline{x}}_{i2},{S}_{i2}^{2})$ be the mean and variance of i

^{th}gene for case and control classes, respectively; ${\overline{x}}_{i}$ be the mean of i

^{th}gene across all M samples; ${S}_{ij}$ be the covariance between i

^{th}and j

^{th}genes.

#### 2.3.2. Maximum Relevance and Minimum Redundancy (MRMR) Filter

^{th}gene over the given classes (i.e., case and control) is computed through F-statistic [12] and is expressed as:

_{i}(≥0) is the weight associated with i

^{th}gene. The functions F(i) and R(i, j) in Equation (3) represent the F-statistic for i

^{th}gene and Pearson’s correlation coefficient between i

^{th}and j

^{th}genes. In other words, i

^{th}gene weight is F-statistic adjusted with average absolute correlation of i

^{th}gene with the remaining genes.

#### 2.3.3. Support Vector Machine (SVM)

_{i}and b are the weight of i

^{th}gene and bias, respectively. Here, we assume that the GE samples for 2 classes are linearly separable. In other words, we can select 2 parallel hyperplanes that separate the case and control classes in such a way that the distance between them is maximum.

_{i}’s are obtained by minimizing the objective function in Equation (9). Through the principle of maxima-minima, we have:

#### 2.3.4. Proposed Hybrid Approach of Gene Selection

_{i}) or by SVM (k

_{i}), especially when w

_{i}and k

_{i}are negatively correlated. Hence, we propose a statistical approach by combining SVM and MRMR weights under sound statistical framework, where genes are selected through p-values computed using the NP test statistic, which is described as follows.

_{i}’s through minimax normalization. Then ${w}_{i}$ and k

_{i}were ranked based on the ascending order of their magnitudes and assigned ranks ${\gamma}_{i}^{MR}$ and ${\gamma}_{i}^{SV}$ for i

^{th}gene, respectively. Then, we developed a technique, i.e., quadratic integration, for integrating the gene scores based on ranks, which automatically assigned more weights to the higher value of w

_{i}and k

_{i}. Now, the quadratic integration score can be expressed as:

_{i}in Equation (13) is used alone for ranking of genes, it will become a filter approach and lead to selection of spuriously associated genes. Hence, we used a bootstrap procedure under a subject sampling model setup to obtain the empirical distribution of SD

_{i}for computation of statistical significance value for i

^{th}(i = 1, 2, …, N) gene. Here, the used bootstrap procedure is described below.

_{ib}, be a random variable (rv) that shows the position of i

^{th}gene in b

^{th}bootstrap GE matrix. Then, another rv can be defined based on P

_{ib}(without loss of generality), given as:

^{th}(i = 1, 2, …, N) gene in b

^{th}(b = 1, 2, …, B) bootstrap GE matrix. Here, it may be noted that the distribution of the rank scores of genes, computed from a bootstrap GE data matrix, is symmetric around the median value (as rank scores are a function of ranks). The values of the median and the third quartile (${Q}_{3}$) are given as 0.5 and 0.75, respectively.

^{th}gene is biologically relevant or not to the condition/trait under study, the following null hypothesis can be tested.

^{th}gene over all possible bootstrap samples.

_{0}, we define another rv ${Z}_{ib}$, as:

_{0}vs. H

_{1}the test statistic for i

^{th}gene, ${W}_{i}$, was developed, and is given as:

_{i}in Equation (19) is the sum of the ranks of positive signed scores for i

^{th}gene over B bootstrap samples. Further, ${U}_{ib}$ in Equation (19) is a Bernoulli rv, and its probability mass function can be given as:

_{i}in Equation (19) under H

_{0}can be obtained as:

_{i}in Equation (19) becomes:

^{th}(i = 1, 2, …, N) gene is computed and similarly this testing procedure is repeated for the remaining N − 1 genes. Let ${p}_{1},{p}_{2},\dots ,{p}_{N}$ be the corresponding p-values for all the genes in GE data, and α be the level of significance. Here, we assume that all genes in the GE data are equally important for the trait development, hence, we employed Hochberg procedure [49] for correcting the multiple testing, and to compute the adjusted (adj.) p-values for genes. It is worthy to note that Hochberg’s procedure is computationally simple, quite popular in genomic data analysis [50] and more powerful than Holm’s procedure [51]. The algorithm for Hochberg’s procedure [49] is as follows.

#### 2.4. Comparative Performance Analysis of the Proposed Approach

**G**) of various sizes given in Supplementary Table S10 were selected through the 10 gene selection methods including the proposed BSM approach. Then, these gene sets were validated with respect to subject classification, QTL testing and GO analysis.

#### 2.4.1. Performance Analysis with Subject Classification

**G**were validated with their ability to discriminate the class labels of subjects/samples between case (+1), and control (−1). Further, the gene set selected through a method which provides maximum discrimination between the subjects of 2 groups (i.e., case vs. control) through CA will be considered as highly relevant gene sets. The expressions for mean CA and SE in CA computed through varying window size technique are given in Equations (25) and (26).

**G**, S be the size of the windows (i.e., size refers to number of ranked genes) and L be the sliding length. Then, the total number of windows becomes $K=\left(n-S\right)/L$. The genes in

**G**, arranged in different windows along with their expression values, were then used in SVM classifiers with 4 basis-functions, i.e., linear (SVM-LBF), radial (SVM-RBF), polynomial (SVM-PBF) and Sigmoidal (SVM-SBF) to compute CA over a 5-fold cross validation. Let, CA

_{1}, CA

_{2}, …, CA

_{K}be the CA’s for each sliding windows, then the mean CA and SE in CA can be defined as:

#### 2.4.2. Performance Analysis with QTL Testing

**G**: gene set selected by a method, Qstat: rv whose values represent the number of genes covered by QTLs, Q: set of associated QTLs, and the indicator function present in Equation (27) is represented in Equation (28).

_{i}

^{c}[a, b] ϵ

**G**(a and b represent start and stop positions in terms of bp of the gene g

_{i}on chromosome c) and q

_{t}

^{c}[d, e] ϵ Q (d and e represents the start and stop positions of the QTL q

_{t}on chromosome c).

**G**that are covered by QTLs. Further, using the Equation (29), the statistical significance value (p-value) associated with the

**G**can be computed. In other words, this p-value reveals the enrichment significance of

**G**with trait specific QTLs. Here, the higher values of $Qstat$ and $-lo{g}_{10}$(p-value) indicate the better performance of the gene selection method, and vice-versa.

#### 2.4.3. Performance Analysis with GO Enrichment

**G**[52], as there exists a direct relationship between semantic similarity of gene pairs with their structural (sequence) similarity [53,54]. Under this comparison setting, we assessed the performance of 10 gene selection methods including the proposed method using GO based biologically relevant criterion. In other words, first different gene sets were selected through these methods, then GO based criterion was computed for each selected gene set. For this purpose, we developed a GO based semantic distance measure to assess the GO based biologically relevancy of

**G**selected thorough the proposed and existing gene selection methods. The GO based semantic distance measure (d

_{ij}) between i

^{th}and j

^{th}genes can be expressed in Equation (30), as:

_{i}= {go

_{i1}, go

_{i2}, …, go

_{iI}} and GO

_{j}= {go

_{j1}, go

_{j2}, …, go

_{jJ}} are the 2 sets of GO terms that annotate i

^{th}and j

^{th}genes in

**G**, respectively. Further, the GO based average biologically relevant score for

**G**(for a gene selection method) can be developed based on Equation (30) and is shown in Equation (31).

**G**based on GO annotations. Using Equation (31), the ${D}_{G}^{avg}$ scores under MF, BP and CC taxonomies were computed for each of the gene sets selected through different methods. A lower value of ${D}_{G}^{avg}$ indicates better performance of the gene selection method and vice-versa.

## 3. Results and Discussion

#### 3.1. Computation of Genes Selection Criteria through Proposed Approach

#### 3.2. Comparative Performance Analysis Based on Subject Classification

#### 3.3. Comparative Performance Analysis Based on QTL Testing

#### 3.4. Comparative Performance Analysis Based on GO Analysis

#### 3.5. Comparative Performance Analysis Based on Runtime

## 4. Developed R Software Package

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Availability of Data and Material

## References

- Reuter, J.A.; Spacek, D.V.; Snyder, M.P. High-Throughput Sequencing Technologies. Mol. Cell
**2015**, 58, 586–597. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Trevino, V.; Falciani, F.; Barrera-Saldaña, H.A. DNA Microarrays: A Powerful Genomic Tool for Biomedical and Clinical Research. Mol. Med.
**2007**, 13, 527–541. [Google Scholar] [CrossRef] [PubMed] - Charpe, A.M. DNA Microarray. In Advances in Biotechnology; Springer: New Delhi, India, 2014; pp. 71–104. [Google Scholar] [CrossRef]
- Barrett, T.; Wilhite, S.E.; Ledoux, P.; Evangelista, C.; Kim, I.F.; Tomashevsky, M. NCBI GEO: Archive for functional genomics data sets—Update. Nucleic Acids Res.
**2012**, 41, D991–D995. [Google Scholar] [CrossRef] [Green Version] - Das, S.; Meher, P.K.; Rai, A.; Bhar, L.M.; Mandal, B.N. Statistical approaches for gene selection, hub gene identification and module interaction in gene co-expression network analysis: An application to aluminum stress in soybean (Glycine max L.). PLoS ONE
**2017**, 12, e0169605. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Wang, J.; Chen, L.; Wang, Y.; Zhang, J.; Liang, Y.; Xu, D. A Computational Systems Biology Study for Understanding Salt Tolerance Mechanism in Rice. PLoS ONE
**2013**, 8, e64929. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science
**1999**, 286, 531–537. [Google Scholar] [CrossRef] [Green Version] - Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn.
**2002**. [Google Scholar] [CrossRef] - Saeys, Y.; Inza, I.; Larranaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics
**2007**, 23, 2507–2517. [Google Scholar] [CrossRef] [Green Version] - Liang, Y.; Zhang, F.; Wang, J.; Joshi, T.; Wang, Y.; Xu, D. Prediction of Drought-Resistant Genes in Arabidopsis thaliana Using SVM-RFE. PLoS ONE
**2011**, 6, e21750. [Google Scholar] [CrossRef] [Green Version] - Díaz-Uriarte, R.; Alvarez de Andrés, S. Gene selection and classification of microarray data using random forest. BMC Bioinform.
**2006**, 7, 3. [Google Scholar] [CrossRef] [Green Version] - Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Trans. Pattern Anal. Mach. Intell.
**2005**. [Google Scholar] [CrossRef] - Mundra, P.A.; Rajapakse, J.C. SVM-RFE with MRMR Filter for Gene Selection. IEEE Trans. Nanobioscience
**2010**, 9, 31–37. [Google Scholar] [CrossRef] [PubMed] - Das, S.; Pandey, P.; Rai, A.; Mohapatra, C. A computational system biology approach to construct gene regulatory networks for salinity response in rice (Oryza sativa). Indian J. Agric. Sci.
**2015**, 85, 1546–1552. [Google Scholar] - Kursa, M.B. Robustness of Random Forest-based gene selection methods. BMC Bioinform.
**2014**. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Inza, I.; Larrañaga, P.; Blanco, R.; Cerrolaza, A.J. Filter versus wrapper gene selection approaches in DNA microarray domains. Artif. Intell. Med.
**2004**. [Google Scholar] [CrossRef] [PubMed] - Lazar, C.; Taminau, J.; Meganck, S.; Steenhoff, D.; Coletta, A.; Molter, C. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2012**. [Google Scholar] [CrossRef] [PubMed] - Cui, X.; Churchill, G.A. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol.
**2003**. [Google Scholar] [CrossRef] [Green Version] - Das, S.; Meher, P.K.; Pradhan, U.K.; Paul, A.K. Inferring gene regulatory networks using Kendall’s tau correlation coefficient and identification of salinity stress responsive genes in rice. Curr. Sci.
**2017**, 112. [Google Scholar] [CrossRef] - Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. Computational Systems Bioinformatics CSB2003 Proceedings of the 2003 IEEE Bioinformatics Conference CSB2003. IEEE Comput. Soc.
**2003**, 523–528. [Google Scholar] [CrossRef] - Chen, Y.W.; Lin, C.J. Combining SVMs with various feature selection strategies. Stud. Fuzziness Soft Comput.
**2006**. [Google Scholar] [CrossRef] [Green Version] - Hossain, A.; Willan, A.R.; Beyene, J. An improved method on wilcoxon rank sum test for gene selection from microarray experiments. Commun. Stat. Simul. Comput.
**2013**. [Google Scholar] [CrossRef] - Troyanskaya, O.G.; Garber, M.E.; Brown, P.O.; Botstein, D.; Altman, R.B. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics
**2002**. [Google Scholar] [CrossRef] [PubMed] - Cheng, T.; Wang, Y.; Bryant, S.H. F Selector: A Ruby gem for feature selection. Bioinformatics
**2012**, 28, 2851–2852. [Google Scholar] [CrossRef] [Green Version] - Radovic, M.; Ghalwash, M.; Filipovic, N.; Obradovic, Z. Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinform.
**2017**, 18, 9. [Google Scholar] [CrossRef] [Green Version] - Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol.
**2005**, 3, 185–205. [Google Scholar] [CrossRef] - Zhang, G.-L.; Pan, L.-L.; Huang, T.; Wang, J.-H. The transcriptome difference between colorectal tumor and normal tissues revealed by single-cell sequencing. J. Cancer
**2019**, 10, 5883–5890. [Google Scholar] [CrossRef] - Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell.
**1997**. [Google Scholar] [CrossRef] [Green Version] - Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst.
**1998**, 13, 18–28. [Google Scholar] [CrossRef] [Green Version] - Duan, K.B.; Rajapakse, J.C.; Wang, H.; Azuaje, F. Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans. Nanobioscience
**2005**. [Google Scholar] [CrossRef] - Tao, X.; Wu, X.; Huang, T.; Mu, D. Identification and Analysis of Dysfunctional Genes and Pathways in CD8+ T Cells of Non-Small Cell Lung Cancer Based on RNA Sequencing. Front. Genet.
**2020**. [Google Scholar] [CrossRef] [PubMed] - Ting, K.M.; Witten, I.H. Stacking bagged and dagged models. In ICML ′97: Proceedings of the Fourteenth International Conference on Machine Learning; Douglas, H., Fisher, E.D., Eds.; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1997; pp. 367–375. [Google Scholar]
- Li, J.R.; Huang, T. Predicting and analyzing early wake-up associated gene expressions by integrating GWAS and eQTL studies. Biochim. Biophys. Acta Mol. Basis Dis.
**2018**. [Google Scholar] [CrossRef] [PubMed] - Sun, L.; Kong, X.; Xu, J.; Xue, Z.; Zhai, R.; Zhang, S. A Hybrid Gene Selection Method Based on ReliefF and Ant Colony Optimization Algorithm for Tumor Classification. Sci. Rep.
**2019**. [Google Scholar] [CrossRef] [Green Version] - Mahi, M.; Baykan, Ö.K.; Kodaz, H. A new hybrid method based on Particle Swarm Optimization, Ant Colony Optimization and 3-Opt algorithms for Traveling Salesman Problem. Appl. Soft Comput.
**2015**, 30, 484–490. [Google Scholar] [CrossRef] - Sohn, I.; Owzar, K.; George, S.L.; Kim, S.; Jung, S.H. A permutation-based multiple testing method for time-course microarray experiments. BMC Bioinform.
**2009**. [Google Scholar] [CrossRef] [Green Version] - Ritchie, M.E.; Phipson, B.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res.
**2015**, 43, e47. [Google Scholar] [CrossRef] - Knijnenburg, T.A.; Wessels, L.F.A.; Reinders, M.J.T.; Shmulevich, I. Fewer permutations, more accurate P-values. Bioinformatics
**2009**. [Google Scholar] [CrossRef] [Green Version] - Das, S.; Rai, A.; Mishra, D.C.; Rai, S.N. Statistical approach for selection of biologically informative genes. Gene
**2018**, 655. [Google Scholar] [CrossRef] - Lai, C.; Reinders, M.J.T.; van’t Veer, L.J.; Wessels, L.F.A. A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC Bioinform.
**2006**. [Google Scholar] [CrossRef] - Das, S.; Rai, A.; Mishra, D.C.; Rai, S.N. Statistical Approach for Gene Set Analysis with Trait Specific Quantitative Trait Loci. Sci. Rep.
**2018**, 8, 2391. [Google Scholar] [CrossRef] [Green Version] - Tiwari, S.; Kumar, V.; Singh, B.; Rao, A.; Mithra, S.V.A. Mapping QTLs for Salt Tolerance in Rice (Oryza sativa L) by Bulked Segregant Analysis of Recombinant Inbred Lines Using 50K SNP Chip. Yadav RS, editor. PLoS ONE
**2016**, 11, e0153610. [Google Scholar] [CrossRef] [Green Version] - Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res.
**2004**. [Google Scholar] [CrossRef] [Green Version] - Gautier, L.; Cope, L.; Bolstad, B.M.; Irizarry, R.A. Affy—Analysis of Affymetrix GeneChip data at the probe level. Bioinformatics
**2004**. [Google Scholar] [CrossRef] [PubMed] - Ware, D. Gramene: A resource for comparative grass genomics. Nucleic Acids Res.
**2002**. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Tian, T.; Liu, Y.; Yan, H.; You, Q.; Yi, X.; Du, Z. AgriGO v2.0: A GO analysis toolkit for the agricultural community, 2017 update. Nucleic Acids Res.
**2017**. [Google Scholar] [CrossRef] - Sahani, M.; Linden, J. Advances in Neural Information Processing Systems, Processing Systems: Proceedings from the 2002, 2003; MIT Press: Cambridge, MA, USA, 2003. [Google Scholar] [CrossRef]
- Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Springer: Boston, MA, USA, 1993. [Google Scholar] [CrossRef] [Green Version]
- Benjamini, Y.; Hochberg, Y. Multiple Hypotheses Testing with Weights. Scand. J. Stat.
**1997**, 24, 407–418. [Google Scholar] [CrossRef] - Li, Q.; Brown, J.B.; Huang, H.; Bickel, P.J. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat.
**2011**, 5, 1752–1779. [Google Scholar] [CrossRef] - Chen, S.-Y.; Feng, Z.; Yi, X. A general introduction to adjustment for multiple comparisons. J. Thorac. Dis.
**2017**, 9, 1725–1729. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Mazandu, G.K.; Mulder, N.J. Information content-based gene ontology functional similarity measures: Which one to use for a given biological data type? PLoS ONE
**2014**. [Google Scholar] [CrossRef] [Green Version] - Lord, P.W.; Stevens, R.D.; Brass, A.; Goble, C.A. Investigating semantic similarity measures across the gene ontology: The relationship between sequence and annotation. Bioinformatics
**2003**. [Google Scholar] [CrossRef] - Wang, J.Z.; Du, Z.; Payattakool, R.; Yu, P.S.; Chen, C.F. A new method to measure the semantic similarity of GO terms. Bioinformatics
**2007**. [Google Scholar] [CrossRef] [Green Version] - Ouyang, S.; Zhu, W.; Hamilton, J.; Lin, H.; Campbell, M.; Childs, K. The TIGR Rice Genome Annotation Resource: Improvements and new features. Nucleic Acids Res.
**2007**. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Glazko, G.V.; Emmert-Streib, F. Unite and conquer: Univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics
**2009**. [Google Scholar] [CrossRef] [PubMed] [Green Version]

**Figure 1.**Operational procedure for data integration and the use of proposed BSM approach. (

**A**) Outlines for the data integration used in this study for the application of BSM approach. The first step indicates the integration and meta-analysis of GE datasets obtained from various GE studies. Then gene selection methods are applied on the meta GE data. (

**B**) Flowchart depicting the implemented algorithm of BSM approach. W

_{i}

^{(S)}’s and W

_{i}

^{(M)}’s are the N-dimensional vectors of weights computed through SVM and MRMR approach, respectively. G

_{i}’s and R

_{i}’s are the N-dimensional vectors of gene lists and corresponding gene rank scores. SVM and MRMR stand for Maximum Relevance and Minimum Redundancy and support vector machine algorithms. p

_{i}-value is statistical significance value for i

^{th}gene. α is the desired level of statistical significance.

**Figure 2.**Graphical analysis of the proposed BSM approach with SVM-MRMR approach for abiotic stress datasets. Distribution of gene weights computed from SVM-MRMR approach for the abiotic stresses. The distributions of gene weights from the SVM-MRMR are shown for (

**A**) salinity; (

**B**) cold; and (

**C**) drought stress datasets in rice. Distribution of adjusted p-values computed from the proposed BSM approach for the abiotic stresses. The distributions of the adjusted p-values are shown for (

**A1**) salinity; (

**B1**) cold; and (

**C1**) drought stress datasets.

**Figure 3.**Classification-based comparative performance analysis of gene selection methods through SVM-LBF and SVM-PBF classifiers for abiotic stress datasets. The horizontal axis represents the gene selection methods. The vertical axis represents post selection classification accuracy obtained by using varying sliding window size technique. The classification accuracies over the window sizes are presented as boxes. The bars on the boxes represent the standard errors. The distributions of classification accuracies are shown for cold stress with SVM-LBF (

**A1**), and SVM-PBF (

**A2**) classifiers. The distributions of classification accuracies are shown for salinity stress with SVM-LBF (

**B1**) and SVM-PBF (

**B2**) classifiers. The distributions of classification accuracies are shown for drought stress with SVM-LBF (

**C1**) and SVM-PBF (

**C2**) classifiers.

**Figure 4.**Classification-based comparative performance analysis of gene selection methods through SVM-RBF and SVM-SBF classifiers for abiotic stress datasets. The horizontal axis represents the gene selection methods. The vertical axis represents post selection classification accuracy obtained by using varying sliding window size technique. The classification accuracies over the window sizes are presented as boxes. The distributions of classification accuracies are shown for cold stress with SVM-RBF (

**A1**) and SVM-SBF (

**A2**) classifiers. The distributions of classification accuracies are shown for salinity stress with SVM-RBF (

**B1**) and SVM-SBF (

**B2**) classifiers. The distributions of classification accuracies are shown for drought stress with SVM-RBF (

**C1**) and SVM-SBF (

**C2**) classifiers.

**Figure 5.**Comparative performance analysis of gene selection methods through distribution of Qstat statistic. The horizontal axis represents the informative gene sets obtained through gene selection methods. The vertical axis represents the value of Qstat statistic. The distribution of Qstat statistic are shown for (

**A**) salinity; (

**B**) cold; (

**C**) drought; (

**D**) bacterial; (

**E**) fungal and (

**F**) insect stress datasets in rice. The lines in different colors represent different gene selection methods.

**Figure 6.**Comparative performance analysis of gene selection methods through distribution of p-values from QTL-hypergeometric test. The horizontal axis represents the gene sets obtained through gene selection methods. The vertical axis represents the value of −log10(p-value) from QTL-hypergeometric test. The distribution of −log10(p-value) are shown for (

**A**) salinity; (

**B**) cold; (

**C**) drought; (

**D**) bacterial; (

**E**) fungal, and (

**F**) insect stress datasets in rice. The lines in different colors represent different gene selection methods.

Sl. No. | Descriptions | #Series | Series ID | #Genes | #Samples | Stress Type |
---|---|---|---|---|---|---|

1. | Salinity stress | 3. | GSE14403, GSE16108, GSE6901. | 6637 | 45 (23, 22) | Abiotic |

2. | Cold stress | 4. | GSE31077, GSE33204. GSE37940, GSE6901. | 8840 | 28 (15, 13) | Abiotic |

3. | Drought stress | 5. | GSE6901, GSE26280. GSE21651, GSE23211. GSE24048. | 9078 | 70 (35, 35) | Abiotic |

4. | Bacterial (xanthomonas) stress | 3. | GSE19239, GSE36093. GSE36272. | 8356 | 74 (37, 37) | Biotic |

5. | Fungal (blast) stress | 2. | GSE41798, GSE7256. | 7072 | 26 (13, 13) | Biotic |

6. | Insect (brown plant hopper) stress | 1. | GSE29967. | 7241 | 18 (12, 6) | Biotic |

**Table 2.**Comparative Performance analysis of the gene selection methods through MF GO-based biologically relevant score for abiotic stresses in rice.

Methods | MRMR | SVM | SVM-MRMR | IG | GR | Wilcox | t | PCR | F | BSM |
---|---|---|---|---|---|---|---|---|---|---|

Salt stress in rice | ||||||||||

10 | 0.98 | 0.95 | 0.97 | 0.92 | 0.89 | 0.93 | 0.93 | 0.96 | 0.96 | 0.88 |

20 | 0.97 | 0.89 | 0.93 | 0.92 | 0.86 | 0.89 | 0.89 | 0.91 | 0.91 | 0.86 |

50 | 0.92 | 0.91 | 0.92 | 0.90 | 0.90 | 0.87 | 0.87 | 0.92 | 0.92 | 0.85 |

100 | 0.92 | 0.90 | 0.89 | 0.90 | 0.88 | 0.87 | 0.88 | 0.92 | 0.91 | 0.83 |

150 | 0.90 | 0.89 | 0.90 | 0.89 | 0.88 | 0.87 | 0.87 | 0.90 | 0.91 | 0.83 |

200 | 0.90 | 0.89 | 0.88 | 0.89 | 0.87 | 0.88 | 0.88 | 0.90 | 0.90 | 0.84 |

500 | 0.90 | 0.90 | 0.89 | 0.90 | 0.90 | 0.89 | 0.90 | 0.89 | 0.89 | 0.83 |

Cold stress in rice | ||||||||||

10 | 0.82 | 0.84 | 0.82 | 0.92 | 0.99 | 0.92 | 0.87 | 0.77 | 0.77 | 0.75 |

20 | 0.93 | 0.88 | 0.93 | 0.95 | 0.93 | 0.88 | 0.90 | 0.91 | 0.88 | 0.71 |

50 | 0.91 | 0.88 | 0.91 | 0.93 | 0.90 | 0.91 | 0.91 | 0.92 | 0.92 | 0.73 |

100 | 0.91 | 0.90 | 0.91 | 0.90 | 0.88 | 0.91 | 0.91 | 0.91 | 0.91 | 0.74 |

150 | 0.90 | 0.89 | 0.90 | 0.89 | 0.89 | 0.89 | 0.90 | 0.91 | 0.91 | 0.72 |

200 | 0.90 | 0.89 | 0.90 | 0.89 | 0.88 | 0.89 | 0.90 | 0.90 | 0.90 | 0.73 |

500 | 0.90 | 0.88 | 0.90 | 0.90 | 0.89 | 0.88 | 0.89 | 0.88 | 0.89 | 0.73 |

Drought stress in rice | ||||||||||

10 | 0.82 | 0.86 | 0.81 | 0.90 | 0.93 | 0.65 | 0.76 | 0.76 | 0.76 | 0.71 |

20 | 0.79 | 0.86 | 0.78 | 0.91 | 0.90 | 0.80 | 0.81 | 0.81 | 0.81 | 0.75 |

50 | 0.88 | 0.84 | 0.87 | 0.88 | 0.90 | 0.84 | 0.88 | 0.89 | 0.89 | 0.75 |

100 | 0.89 | 0.89 | 0.88 | 0.89 | 0.89 | 0.88 | 0.88 | 0.88 | 0.88 | 0.76 |

150 | 0.88 | 0.88 | 0.87 | 0.89 | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.76 |

200 | 0.88 | 0.88 | 0.87 | 0.88 | 0.89 | 0.89 | 0.88 | 0.88 | 0.88 | 0.74 |

500 | 0.88 | 0.88 | 0.87 | 0.88 | 0.88 | 0.89 | 0.88 | 0.87 | 0.87 | 0.73 |

**Table 3.**Comparative Performance analysis of the gene selection methods through BP GO-based biologically relevant score for abiotic stresses in rice.

Methods | MRMR | SVM | SVM-MRMR | IG | GR | Wilcox | t | PCR | F | BSM |
---|---|---|---|---|---|---|---|---|---|---|

Salt stress in rice | ||||||||||

10 | 0.86 | 0.94 | 0.86 | 0.92 | 0.97 | 0.90 | 0.90 | 0.88 | 0.88 | 0.83 |

20 | 0.90 | 0.91 | 0.90 | 0.89 | 0.91 | 0.92 | 0.92 | 0.84 | 0.85 | 0.84 |

50 | 0.89 | 0.90 | 0.88 | 0.88 | 0.90 | 0.88 | 0.89 | 0.88 | 0.88 | 0.82 |

100 | 0.88 | 0.89 | 0.86 | 0.89 | 0.89 | 0.85 | 0.86 | 0.89 | 0.87 | 0.82 |

150 | 0.87 | 0.89 | 0.90 | 0.88 | 0.89 | 0.85 | 0.85 | 0.89 | 0.89 | 0.83 |

200 | 0.87 | 0.89 | 0.86 | 0.88 | 0.89 | 0.84 | 0.85 | 0.89 | 0.88 | 0.82 |

500 | 0.87 | 0.89 | 0.87 | 0.87 | 0.89 | 0.86 | 0.86 | 0.86 | 0.86 | 0.82 |

Cold stress in rice | ||||||||||

10 | 0.79 | 0.82 | 0.79 | 0.86 | 0.94 | 0.91 | 0.90 | 0.79 | 0.79 | 0.79 |

20 | 0.93 | 0.89 | 0.93 | 0.90 | 0.88 | 0.86 | 0.88 | 0.90 | 0.86 | 0.82 |

50 | 0.88 | 0.89 | 0.88 | 0.90 | 0.88 | 0.88 | 0.87 | 0.89 | 0.90 | 0.71 |

100 | 0.88 | 0.89 | 0.88 | 0.89 | 0.87 | 0.90 | 0.88 | 0.89 | 0.89 | 0.74 |

150 | 0.89 | 0.88 | 0.89 | 0.88 | 0.88 | 0.88 | 0.87 | 0.88 | 0.88 | 0.73 |

200 | 0.89 | 0.87 | 0.89 | 0.87 | 0.87 | 0.87 | 0.87 | 0.88 | 0.84 | 0.73 |

500 | 0.88 | 0.86 | 0.88 | 0.86 | 0.86 | 0.84 | 0.86 | 0.87 | 0.83 | 0.71 |

Drought stress in rice | ||||||||||

10 | 0.86 | 0.79 | 0.85 | 0.81 | 0.89 | 0.62 | 0.83 | 0.83 | 0.83 | 0.73 |

20 | 0.84 | 0.79 | 0.83 | 0.89 | 0.90 | 0.80 | 0.84 | 0.84 | 0.84 | 0.72 |

50 | 0.88 | 0.81 | 0.87 | 0.88 | 0.88 | 0.81 | 0.88 | 0.88 | 0.88 | 0.72 |

100 | 0.87 | 0.84 | 0.86 | 0.88 | 0.88 | 0.84 | 0.86 | 0.87 | 0.87 | 0.72 |

150 | 0.86 | 0.84 | 0.85 | 0.88 | 0.88 | 0.84 | 0.87 | 0.87 | 0.87 | 0.71 |

200 | 0.86 | 0.84 | 0.85 | 0.87 | 0.87 | 0.85 | 0.86 | 0.86 | 0.86 | 0.72 |

500 | 0.87 | 0.85 | 0.86 | 0.86 | 0.87 | 0.87 | 0.86 | 0.85 | 0.83 | 0.72 |

**Table 4.**Comparative Performance analysis of the gene selection methods through CC GO-based biologically relevant score for abiotic stresses in rice.

MRMR | SVM | SVM-MRMR | IG | GR | Wilcox | t | PCR | F | BSM | |
---|---|---|---|---|---|---|---|---|---|---|

Salt stress in rice | ||||||||||

10 | 0.77 | 0.71 | 0.70 | 0.94 | 0.97 | 0.93 | 0.93 | 0.95 | 0.95 | 0.78 |

20 | 0.88 | 0.87 | 0.85 | 0.92 | 0.90 | 0.91 | 0.91 | 0.88 | 0.88 | 0.81 |

50 | 0.88 | 0.89 | 0.86 | 0.92 | 0.92 | 0.90 | 0.90 | 0.89 | 0.89 | 0.84 |

100 | 0.88 | 0.90 | 0.8 | 0.91 | 0.89 | 0.86 | 0.86 | 0.88 | 0.88 | 0.83 |

150 | 0.87 | 0.90 | 0.87 | 0.90 | 0.89 | 0.86 | 0.87 | 0.88 | 0.88 | 0.83 |

200 | 0.87 | 0.89 | 0.85 | 0.90 | 0.90 | 0.88 | 0.89 | 0.88 | 0.88 | 0.83 |

500 | 0.88 | 0.90 | 0.88 | 0.89 | 0.90 | 0.88 | 0.89 | 0.87 | 0.87 | 0.82 |

Cold stress in rice | ||||||||||

10 | 0.78 | 0.80 | 0.78 | 0.96 | 0.81 | 0.87 | 0.86 | 0.70 | 0.70 | 0.70 |

20 | 0.88 | 0.86 | 0.88 | 0.96 | 0.87 | 0.87 | 0.89 | 0.81 | 0.83 | 0.71 |

50 | 0.86 | 0.89 | 0.86 | 0.90 | 0.85 | 0.84 | 0.85 | 0.89 | 0.90 | 0.73 |

100 | 0.88 | 0.90 | 0.88 | 0.90 | 0.81 | 0.83 | 0.84 | 0.87 | 0.87 | 0.74 |

150 | 0.88 | 0.89 | 0.88 | 0.90 | 0.82 | 0.82 | 0.86 | 0.87 | 0.88 | 0.74 |

200 | 0.87 | 0.90 | 0.87 | 0.90 | 0.84 | 0.85 | 0.86 | 0.87 | 0.85 | 0.73 |

500 | 0.88 | 0.89 | 0.88 | 0.89 | 0.86 | 0.97 | 0.86 | 0.88 | 0.87 | 0.73 |

Drought stress in rice | ||||||||||

10 | 0.82 | 0.86 | 0.81 | 0.91 | 0.89 | 0.83 | 0.87 | 0.87 | 0.87 | 0.74 |

20 | 0.89 | 0.85 | 0.88 | 0.93 | 0.90 | 0.87 | 0.89 | 0.89 | 0.89 | 0.74 |

50 | 0.86 | 0.88 | 0.85 | 0.91 | 0.87 | 0.87 | 0.88 | 0.88 | 0.88 | 0.73 |

100 | 0.87 | 0.87 | 0.86 | 0.89 | 0.86 | 0.87 | 0.88 | 0.88 | 0.88 | 0.74 |

150 | 0.87 | 0.87 | 0.86 | 0.90 | 0.85 | 0.85 | 0.87 | 0.87 | 0.87 | 0.74 |

200 | 0.87 | 0.87 | 0.86 | 0.89 | 0.86 | 0.86 | 0.87 | 0.87 | 0.87 | 0.73 |

500 | 0.87 | 0.86 | 0.86 | 0.89 | 0.87 | 0.88 | 0.87 | 0.86 | 0.85 | 0.72 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Das, S.; Rai, S.N.
Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data. *Entropy* **2020**, *22*, 1205.
https://doi.org/10.3390/e22111205

**AMA Style**

Das S, Rai SN.
Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data. *Entropy*. 2020; 22(11):1205.
https://doi.org/10.3390/e22111205

**Chicago/Turabian Style**

Das, Samarendra, and Shesh N. Rai.
2020. "Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data" *Entropy* 22, no. 11: 1205.
https://doi.org/10.3390/e22111205