Feature Selection Strategies for Deep Learning-Based Classification in Ultra-High-Dimensional Genomic Data
Abstract
1. Introduction
- Problems with the accurate estimation of model parameters. More specifically, as demonstrated by Giraud [3], standard errors of estimates linearly increase with an increasing number of model dimensions that, as a consequence, make statistical and biological inferences based on less reliable. Furthermore, when the variation of a dependent variable is described by numerous independent covariates (features), false positive associations can arise due to fitting patterns to “noise” [5]. The accurate estimation of model parameters in a high-dimensional setup is also hampered by numerical inaccuracies that may result from many numerical operations, often on the verge of overflow or underflow.
- Interpretability of model parameters due to correlations between subsets of features. This induces the risk of fitting false patterns of the variation of the dependent variable to the noise in the independent variables.
- The applicability of traditional hypothesis testing, based on p-values for which the multiple testing correction may fail due to the violation of the underlying assumption of independence between single tests, leading to inflated Type I error rates.
- Especially for the specific task of classification, in a multidimensional space, many data points lie near the true class boundaries, leading to ambiguous class assignments.
2. Results
2.1. Feature Selection
2.2. Classification
3. Discussion
4. Materials and Methods
4.1. Sequenced Individuals and Their Genomic Information
4.2. Feature Selection
4.2.1. SNP Aggregation Based on Linkage Disequilibrium (SNP Tagging)
4.2.2. SNP Selection-Based on One-Dimensional Supervised Rank Aggregation (1D-SRA)
- From a dataset comprising N = 1825 individuals and P = 11,915,233 SNPs, K = 47,660 reduced datasets were sampled, each containing all individuals but a random subset of S = 250 SNPs.
- For each of the reduced datasets, K multinomial logistic regression models were fitted for B = 5 breed classification (Equation (1)):
- Model performance of each of the K models () was expressed by the quality of fit of each of the multinomial logistic regression models quantified by the cross-entropy loss () metric (Equation (2)):
- The resulting model performance matrix is a sparse matrix of P columns containing feature performance (FPs) and in the column. K rows corresponding to each reduced model. Precisely, each row contains FPs of S SNPs fitted in the k-th reduced model, expressed by for each SNP, while FPs of the remaining P S SNPs were set to zero, and the quality of fit of this model was expressed by the reciprocal of the cross-entropy loss . Due to the large dimensions of , its disc storage was resolved using the memory map approach implemented in the NumPy library (version 1.24.3) [25].
- The aggregation step combined the effects of all SNPs from multinomial logistic regression reduced models using the following mixed linear model:
- The final step in the feature selection process comprised grouping the SNP effect estimates () into sets relevant and irrelevant for classification, so that the relevant features were further used to train the DL-based classifier. This was performed by applying the one-dimensional K-means classification algorithm to define two clusters, implemented in the Scikit-learn library (version 1.3.0) [28]. Using K-means allowed us to avoid setting up an arbitrary cutoff value to define relevant and irrelevant SNP sets. This procedure assumed that features with greater importance for classification resulted in models with high MP, which corresponds to low cross-entropy loss [7].
Algorithm 1: 1D-SRA |
input: Feature data X (N x P) Target class Y (N x 1) constants: P = 11,915,233 // total number of features (SNPs) S = 250 // reduced number of features K = 47,660 // number of reduced models step A1: generate reduced data sets and assess performance of reduced models shuffle the vector of all P features for each reduced model k in K // over reduced models randomly sample S features without replacement from the shuffled vector fit a multinomial logistic regression model to N individuals described by S features (dependent variable–target class independent variable–features) collect model_performance_k = cross-entropy loss for each s in S // over features feature_performance_k_s = max(SNP effect estimate) feature_index_k_s = feature Id from feature vector end for end for step A2: generate model performance matrix C generate a model performance matrix of K rows and P + 1 columns filled with zeros for each k in K // over reduced models C[k,P + 1] = model_performance_k for each s in S // over features C[k,feature_index_k_s] = feature_performance_k_s end for end for step A3: supervised rank aggregation based on matrix C run LMM (dependent variable–model performance C[:,P + 1], independent variable–feature performance C[:,1:P] collect 1:P vector of LMM parameter estimates step A4: feature selection based on LMM parameter estimates define 2 clusters of LMM parameter estimates based on 1D-K-means clustering |
4.2.3. SNP Selection Based on Multidimensional Supervised Rank Aggregation (MD-SRA)
- Five model performance matrices were created by repeating steps 1–4 described above. Since the performance of a single feature occurs only once within each model performance matrix, K rows of each model performance matrix were collapsed to form one dense vector of non-zero FPs, forming model performance vectors.
- Relevant and irrelevant groups of features were defined by assuming two clusters in five-dimensional K-means clustering of matrix V that was formed by appending dense vectors . Matrix V was weighted by the multiplication of each by the reciprocal of cross-entropy loss of the corresponding reduced model. The L-2 norm of the five-dimensional coordinates of the cluster centroids was used to define the cluster containing the relevant features, which corresponded to the higher norm in five-dimensional space.
Algorithm 2: MD-SRA |
input: Feature data X (N x P) Target class Y (N x 1) constants: P = 11,915,233 // total number of features (SNPs) S = 250 // reduced number of features K = 47,660 // number of reduced models W = 5 // number of model performance vectors step B1: generate dense model performance matrix V for each v in W // over model performance vectors execute step A1 of 1D-SRA algorithm execute step A2 of 1D-SRA algorithm transform performance matrix C by applying C[k,1:P] · C[k,P + 1] v[1:P] = non-zero feature performance from each P of matrix C generate matrix V by appending dense vectors v end for step B2: feature selection based on weighted elements of V define 2 clusters of matrix V based on MD-K-means clustering |
4.3. Classification
- The input consisted of the SNP genotypes that resulted from each of the three feature selection algorithms, comprising tagSNPs selected by SNP tagging, or SNPs representing the relevant cluster selected by 1D-SRA or MD-SRA, respectively, as well as the breed class assignments for each individual (five breeds).
- First, in the convolution phase, (i) a sequential 1D Convolutional Neural Network (1D-CNN) was applied, followed by the Rectified Linear Unit (ReLU) activation function (ii) a batch normalisation (BN) layer, and (iii) a max pooling layer with a pool size of five. In particular, for each 1D convolutional layer, 10 kernels of size 150 were applied to the input to capture the local patterns by sliding across the input. After each 1D-CNN, a batch normalisation layer was used to normalise the output so that each layer received input with the same mean and variance. This was followed by a max-pooling layer, which selected the maximum value from each feature map window to decrease the dimensionality (Figure 6).
- In the last section of the DL architecture, the output of the 1D-CNN was flattened to a one-dimensional vector processed by three hidden dense layers with the ReLU activation function (Figure 6).
- The final classification was performed by the last DNN implementing the softmax activation function, , where represents the output of the last layer and is the number of classes (i.e., breeds).
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sartori, F.; Codicè, F.; Caranzano, I.; Rollo, C.; Birolo, G.; Fariselli, P.; Pancotti, C. A Comprehensive Review of Deep Learning Applications with Multi-Omics Data in Cancer Research. Genes 2025, 16, 648. [Google Scholar] [CrossRef]
- Ballard, J.L.; Wang, Z.; Li, W.; Shen, L.; Long, Q. Deep Learning-Based Approaches for Multi-Omics Data Integration and Analysis. BioData Min. 2024, 17, 38. [Google Scholar] [CrossRef]
- Giraud, C. Introduction to High-Dimensional Statistics, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2021; ISBN 978-1-003-15874-5. [Google Scholar]
- Johnstone, I.M.; Titterington, D.M. Statistical Challenges of High-Dimensional Data. Philos. Trans. R. Soc. A 2009, 367, 4237–4253. [Google Scholar] [CrossRef] [PubMed]
- Fan, J.; Li, R. Statistical Challenges with High Dimensionality. In Proceedings of the International Congress of Mathematicians, Zurich, Switzerland, 22–30 August 2006. [Google Scholar]
- Sujithra, L.R.; Kuntha, A. Review of Classification and Feature Selection Methods for Genome-Wide Association SNP for Breast Cancer. In Artificial Intelligence for Sustainable Applications; Umamaheswari, K., Vinoth Kumar, B., Somasundaram, S.K., Eds.; Wiley: Hoboken, NJ, USA, 2023; pp. 55–78. ISBN 978-1-394-17458-4. [Google Scholar]
- Jain, R.; Xu, W. Supervised Rank Aggregation (SRA): A Novel Rank AggregationApproach for Ensemble-Based Feature Selection. Recent Adv. Comput. Sci. Commun. 2024, 17, e030124225206. [Google Scholar] [CrossRef]
- Viharos, Z.J.; Kis, K.B.; Fodor, Á.; Büki, M.I. Adaptive, Hybrid Feature Selection (AHFS). Pattern Recognit. 2021, 116, 107932. [Google Scholar] [CrossRef]
- Jiang, J. Asymtotic Properties of the Empirical BLUP and BLUE in Mixed Linear Models. Stat. Sin. 1998, 8, 861–885. [Google Scholar]
- Henderson, C.R. Applications of Linear Models in Animal Breeding; University of Guelph: Guelph, ON, Canada, 1984. [Google Scholar]
- Yang, J.; Zaitlen, N.A.; Goddard, M.E.; Visscher, P.M.; Price, A.L. Advantages and Pitfalls in the Application of Mixed-Model Association Methods. Nat. Genet. 2014, 46, 100–106. [Google Scholar] [CrossRef]
- Adadi, A. A Survey on Data-efficient Algorithms in Big Data Era. J. Big Data 2021, 8, 24. [Google Scholar] [CrossRef]
- Yu, L.; Liu, H. Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. In Proceedings of the Twentieth International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003; Volume 2, pp. 856–863. [Google Scholar]
- Kamalov, F.; Sulieman, H.; Moussa, S.; Reyes, J.A.; Safaraliev, M. Nested Ensemble Selection: An Effective Hybrid Feature Selection Method. Heliyon 2023, 9, e19686. [Google Scholar] [CrossRef]
- Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-Means Clustering Algorithms: A Comprehensive Review, Variants Analysis, and Advances in the Era of Big Data. Inf. Sci. 2023, 622, 178–210. [Google Scholar] [CrossRef]
- Barrera-García, J.; Cisternas-Caneo, F.; Crawford, B.; Gómez Sánchez, M.; Soto, R. Feature Selection Problem and Metaheuristics: A Systematic Literature Review about Its Formulation, Evaluation and Applications. Biomimetics 2023, 9, 9. [Google Scholar] [CrossRef] [PubMed]
- Mamdouh Farghaly, H.; Abd El-Hafeez, T. A High-Quality Feature Selection Method Based on Frequent and Correlated Items for Text Classification. Soft Comput. 2023, 27, 11259–11274. [Google Scholar] [CrossRef]
- Pinto, R.C.; Engel, P.M. A Fast Incremental Gaussian Mixture Model. PLoS ONE 2015, 10, e0139931. [Google Scholar] [CrossRef]
- Wan, H.; Wang, H.; Scotney, B.; Liu, J. A Novel Gaussian Mixture Model for Classification. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 3298–3303. [Google Scholar]
- Zhao, Y.; Shrivastava, A.K.; Tsui, K.L. Regularized Gaussian Mixture Model for High-Dimensional Clustering. IEEE Trans. Cybern. 2019, 49, 3677–3688. [Google Scholar] [CrossRef] [PubMed]
- van der Auwera, G.A.; O’Connor, B.D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra, 1st ed.; O’Reilly Media: Sebastopol, CA, USA, 2020; ISBN 978-1-4919-7519-0. [Google Scholar]
- Browning, B.L.; Zhou, Y.; Browning, S.R. A One-Penny Imputed Genome from Next-Generation Reference Panels. Am. J. Hum. Genet. 2018, 103, 338–348. [Google Scholar] [CrossRef]
- Danecek, P.; Bonfield, J.K.; Liddle, J.; Marshall, J.; Ohan, V.; Pollard, M.O.; Whitwham, A.; Keane, T.; McCarthy, S.A.; Davies, R.M.; et al. Twelve Years of SAMtools and BCFtools. GigaScience 2021, 10, giab008. [Google Scholar] [CrossRef] [PubMed]
- Purcell, S. PLINK. Available online: https://zzz.bwh.harvard.edu/plink/ (accessed on 6 August 2025).
- Harris, C.R.; Millman, K.J.; Van Der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array Programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
- Lidauer, M.; Matilainen, K.; Mäntysaari, E.; Pitkänen, T.; Taskinen, M.; Strandén, I. Technical Reference Guide for MiX99 Solver; Natural Resources Institute Finland (Luke): Jokioinen, Finland, 2022. [Google Scholar]
- Strandén, I.; Lidauer, M. Solving Large Mixed Linear Models Using Preconditioned Conjugate Gradient Iteration. J. Dairy Sci. 1999, 82, 2779–2787. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2012, 12, 2825–2830. [Google Scholar]
- Chollet, F. Keras 2015. Available online: https://keras.io/ (accessed on 5 August 2025).
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization 2017. arXiv 2017, arXiv:1412.6980. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
- Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- König, I.R.; Auerbach, J.; Gola, D.; Held, E.; Holzinger, E.R.; Legault, M.-A.; Sun, R.; Tintle, N.; Yang, H.-C. Machine Learning and Data Mining in Complex Genomic Data—A Review on the Lessons Learned in Genetic Analysis Workshop 19. BMC Genet. 2016, 17, S1. [Google Scholar] [CrossRef] [PubMed]
Number of Selected SNPs | Reduction Rate | Relative Computational Time | Disc Storage | |
---|---|---|---|---|
SNP tagging | 773,069 | 93.51% | x1.0 | No intermediate files |
1D-SRA | 4,392,322 | 63.14% | x37.7 | 3.1 TB |
MD-SRA | 3,886,351 | 67.39% | x2.2 | 227 MB |
Validation of the Training Dataset | Test Dataset | |||
---|---|---|---|---|
Macro F1-Score | AUC | Macro F1-Score | AUC | |
SNP tagging | 74.65% ± 16.63% | 0.9627 ± 0.0357 | 86.87% | 0.9847 |
1D-SRA | 90.09% ± 5.49% | 0.9915 ± 0.0058 | 96.81% | 0.9968 |
MD-SRA | 88.22% ± 9.53% | 0.9842 ± 0.0168 | 95.12% | 0.9976 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kotlarz, K.; Słomian, D.; Zawadzka, W.; Szyda, J. Feature Selection Strategies for Deep Learning-Based Classification in Ultra-High-Dimensional Genomic Data. Int. J. Mol. Sci. 2025, 26, 7961. https://doi.org/10.3390/ijms26167961
Kotlarz K, Słomian D, Zawadzka W, Szyda J. Feature Selection Strategies for Deep Learning-Based Classification in Ultra-High-Dimensional Genomic Data. International Journal of Molecular Sciences. 2025; 26(16):7961. https://doi.org/10.3390/ijms26167961
Chicago/Turabian StyleKotlarz, Krzysztof, Dawid Słomian, Weronika Zawadzka, and Joanna Szyda. 2025. "Feature Selection Strategies for Deep Learning-Based Classification in Ultra-High-Dimensional Genomic Data" International Journal of Molecular Sciences 26, no. 16: 7961. https://doi.org/10.3390/ijms26167961
APA StyleKotlarz, K., Słomian, D., Zawadzka, W., & Szyda, J. (2025). Feature Selection Strategies for Deep Learning-Based Classification in Ultra-High-Dimensional Genomic Data. International Journal of Molecular Sciences, 26(16), 7961. https://doi.org/10.3390/ijms26167961