Predicting the S. cerevisiae Gene Expression Score by a Machine Learning Classifier
Abstract
:1. Introduction
2. Materials and Methods
2.1. Introduction to ML: Basic Concepts
2.2. Main Stages of the Applied Procedure
2.3. The Data Platform for AI
2.4. Initial Data
2.5. Data Balancing
2.6. Attribute Selection
3. Results
3.1. Finding the Optimal Model
3.2. Example Topology of the Tree
3.3. Best-Predicting Routes
3.4. Finding the Dependence of the Test Classification Correctness on the Presence of the Selected Attribute
4. Discussion
5. Conclusions
- -
- The physical attributes of genes are the most important for determining gene expression scores when they are examined by different attribute evaluators.
- -
- The random forest classification algorithm may be adapted to successfully predict the intensity of gene expression and indicate the best routes and parameters for this prediction.
- -
- Logistic attributes dominate other attributes in random forest prediction, but the predictive role of attribute ensembles may be stronger than that of single attributes.
- -
- The genetic type attributes may not be the most important in determining the expression score, but they cannot be ignored.
- -
- The random forest classifier is a promising tool for in silico experiments.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Crick, F. Central Dogma of Molecular Biology. Nature 1970, 227, 561–563. [Google Scholar] [CrossRef] [PubMed]
- Ralston, A.; Shaw, K. Gene expression regulates cell differentiation. Nat. Educ. 2008, 1, 127–131. [Google Scholar]
- Wright, J. Gene Control; ED-Tech Press: Essex, UK, 2020; p. 18. [Google Scholar]
- Phillips, T. Regulation of transcription and gene expression in eukaryotes. Nat. Educ. 2008, 1, 199. [Google Scholar]
- Paudel, B.P.; Xu, Z.-Q.; Jergic, S.; Oakley, A.J.; Sharma, N.; Brown, S.H.J.; Bouwer, J.C.; Lewis, P.J.; Dixon, N.E.; van Oijen, A.M.; et al. Mechanism of transcription modulation by the transcription-repair coupling factor. Nucleic Acids Res. 2022, 50, 5688–5712. [Google Scholar] [CrossRef]
- Whiteside, S.T.; Goodbourn, S. Signal transduction and nuclear targeting: Regulation of transcription factor activity by subcellular localisation. J. Cell Sci. 1993, 104 Pt 4, 949–955. [Google Scholar] [CrossRef] [PubMed]
- Kim, S.; Kaang, B.K. Epigenetic regulation and chromatin remodeling in learning and memory. Exp. Mol. Med. 2017, 49, e281. [Google Scholar] [CrossRef]
- Bumgarner, R. Overview of DNA microarrays: Types, applications, and their future. Curr. Protoc. Mol. Biol. 2013, 22, 1. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Wang, Z.; Gerstein, M.; Snyder, M. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009, 10, 57–63. [Google Scholar] [CrossRef]
- SGD Projekt. 2025. Available online: https://sites.google.com/view/yeastgenome-help/function-help/expression-data (accessed on 8 April 2025).
- Siwiak, M.; Zielenkiewicz, P. A Comprehensive, Quantitative, and Genome-Wide Model of Translation. PLoS Comput. Biol. 2010, 6, e1000865. [Google Scholar] [CrossRef]
- Minca, E.C.; Al-Rohil, R.N.; Wang, M.; Harms, P.W.; Ko, J.S.; Collie, A.M.; Kovalyshyn, I.; Prieto, V.G.; Tetzlaff, M.T.; Billings, S.D.; et al. Comparison between melanoma gene expression score and fluorescence in situ hybridization for the classification of melanocytic lesions. Mod. Pathol. 2016, 29, 832–843. [Google Scholar] [CrossRef] [PubMed]
- Siwiak, M.; Zielenkiewicz, P. Transimulation-Protein Biosynthesis Web Service. PLoS ONE 2013, 8, e73943. [Google Scholar] [CrossRef] [PubMed]
- Mehdi, A.M.; Patrick, R.; Bailey, T.L.; Bodén, M. Predicting the Dynamics of Protein Abundance Technological Innovation and Resources. Mol. Cell. Proteom. 2014, 13, 1330–1340. [Google Scholar] [CrossRef]
- Li, W.; Yin, Y.; Quan, X.; Zhang, H. Gene Expression Value Prediction Based on XGBoost Algorithm. Front. Genet. 2019, 10, 1077. [Google Scholar] [CrossRef] [PubMed]
- Mitchell, T. Machine Learning; McGraw Hill: New York, NY, USA, 1997; ISBN 0-07-042807-7. [Google Scholar]
- Chicco, D. Ten quick tips for machine learning in computational biology. BioData Min. 2017, 10, 35. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Ho, T.K. Random Decision Forests (PDF). In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; pp. 278–282. [Google Scholar]
- Spiesser, T.W.; Diener, C.; Barberis, M.; Klipp, E. What Influences DNA Replication Rate in Budding Yeast? PLoS ONE 2010, 5, e10203. [Google Scholar] [CrossRef]
- Laso, M.V.; Zhu, D.E.L.I.N.; Sagliocco, F.; Brown, A.J.; Tuite, M.F.; McCarthy, J.E. Inhibition of translational initiation in the yeast Saccharomyces cerevisiae as a function of the stability and position of hairpin structures in the mRNA leader. J. Biol. Chem. 1993, 268, 6453–6462. [Google Scholar] [CrossRef] [PubMed]
- Johnson, A.; Lewis, J.; Alberts, B. The Shape and Structure of Proteins. In Molecular Biology of the Cell, 4th ed.; Garland Science: New York, NY, USA, 2002. Available online: https://www.ncbi.nlm.nih.gov/books/NBK26830/ (accessed on 8 April 2025).
- Eibe, F.; Hall, M.A.; Witten, I.H. The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”, 4th ed.; Morgan Kaufmann: Burlington, MA, USA, 2016. [Google Scholar]
- Oromendia, A.B.; Dodgson, S.E.; Amon, A. Aneuploidy causes proteotoxic stress in yeast. Genes Dev. 2012, 26, 2696–2708. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Kunkel, J.; Luo, X.; Capaldi, A.P. Integrated TORC1 and PKA signaling control the temporal activation of glucose-induced gene expression in yeast. Nat. Commun. 2019, 10, 3558. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Gasch, A.P.; Spellman, P.T.; Kao, C.M.; Carmel-Harel, O.; Eisen, M.B.; Storz, G.; Botstein, D.; Brown, P.O. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 2000, 11, 4241–4257. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Cherry, J.M.; Adler, C.; Ball, C.; Chervitz, S.A.; Dwight, S.S.; Hester, E.T.; Jia, Y.; Juvik, G.; Roe, T.; Schroeder, M.; et al. SGD: Saccharomyces Genome Database. Nucleic Acids Res. 1998, 26, 73–79. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Yanofsky, C. Establishing the Triplet Nature of the Genetic Code. Cell 2007, 128, 815–818. [Google Scholar] [CrossRef] [PubMed]
- Edgar, R.; Domrachev, M.; Lash, A.E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30, 207–210. [Google Scholar] [CrossRef] [PubMed]
- Nasa, C.; Suman. Evaluation of Different Classification Techniques for WEB Data. Int. J. Comput. Appl. 2012, 52, 34–40. [Google Scholar] [CrossRef]
- WEKA. 2025. Available online: https://weka.sourceforge.io/doc.dev/index.html?overview-summary.html (accessed on 8 April 2025).
- le Cessie, S.; van Houwelingen, J.C. Ridge Estimators in Logistic Regression. Appl. Stat. 1992, 41, 191–201. [Google Scholar] [CrossRef]
- Aha, D.; Kibler, D. Instance-based learning algorithms. Mach. Learn. 1991, 6, 37–66. [Google Scholar] [CrossRef]
- Cleary, J.G.; Leonard, E. Trigg: K*: An Instance-based Learner Using an Entropic Distance Measure. In Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; pp. 108–114. [Google Scholar]
- Holte, R.C. Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 1993, 11, 63–91. [Google Scholar] [CrossRef]
- Quinlan, R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers: San Mateo, CA, USA, 1993. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Mantione, K.J.; Kream, R.M.; Kuzelova, H.; Ptacek, R.; Raboch, J.; Samuel, J.M.; Stefano, G.B. Comparing bioinformatic gene expression profiling methods: Microarray and RNA-Seq. Med. Sci. Monit. Basic. Res. 2014, 20, 138–142. [Google Scholar] [CrossRef]
- Bergemann, T.L.; Wilson, J. Proportion statistics to detect differentially expressed genes: A comparison with log-ratio statistics. BMC Bioinform. 2011, 12, 228. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
- Koch, C.M.; Chiu, S.F.; Akbarpour, M.; Bharat, A.; Ridge, K.M.; Bartom, E.T.; Winter, D.R. A Beginner’s Guide to Analysis of RNA Se-quencing Data. Am. J. Respir. Cell Mol. Biol. 2018, 59, 145–157. [Google Scholar] [CrossRef]
- Quackenbush, J. Microarray data normalization and transformation. Nat. Genet. 2002, 32 (Suppl. S4), 496–501. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Liang, Q.; Song, W.; Marchisio, M.A. Nucleotides upstream of the Kozak sequence strongly influence gene expression in the yeast S. cerevisiae. J. Biol. Eng. 2017, 11, 25. [Google Scholar] [CrossRef] [PubMed]
- Arhondakis, S.; Auletta, F.; Torelli, G.; D’Onofrio, G. Base composition and expression level of human genes. Gene 2004, 325, 165–169. [Google Scholar] [CrossRef] [PubMed]
- Singh, R.; Sophiarani, Y. A report on DNA sequence determinants in gene expression. Bioinformation 2020, 16, 422–431. [Google Scholar] [CrossRef] [PubMed]
Mean Normalized Rank | Attribute | Meaning |
---|---|---|
0.02741 | MM | the molecular mass of the protein |
0.025221 | ProtL | the length of the protein-coding sequence |
0.017287 | TransL | the length of the transcript |
0.015501 | Fun | the gene function |
0.007004 | nC | the number of the cytosine |
0.006534 | 5’UTRL | the length of the 5’ untranslated region |
0.02741 | MM | the molecular mass of the protein |
0.025221 | ProtL | the length of the protein-coding sequence |
Selected Aspects of Features and Conditions | Attribute | Meaning |
---|---|---|
Physical properties | TransL 5’UTRL ProtL MM | the length of the transcript the length of the 5’ untranslated region the length of the protein-coding sequence the molecular mass of the protein |
Experimental conditions | Exp Time | the experiment type the time of the experiment |
Logistic | Chro Fun | the order number of chromosome the gene function |
Genetic | U3 U2 U1 AA2 S4 S5 S6 | base of the 5’UTR (3rd before start codon) base of the 5’UTR (2nd before start codon) base of the 5’UTR (1st before start codon) amino acid (2nd coded by DNA, 1st after methionine) base of the coding sequence (4th, 1st after start codon) base of the coding sequence (5th, 2nd after start codon) base of the coding sequence (6th, 3rd after start codon) |
Statistic | nA nT nG nC | the number of the adenine the number of the thymine the number of the guanine the number of the cytosine |
Class | LMH | the low, moderate, and high expression scores (ES) |
Classifier | cci [%] | Comments | References |
---|---|---|---|
ZeroR | 33.3 | Predicts the mean (for a numeric class) or the mode (for a nominal class). Constructs a frequency table for the target and select its most frequent value. | [29] |
BayesNet | 41.3 | Uses various search algorithms and quality measures. Base class for a Bayes Network classifier. Provides data structures (network structure, conditional probability distributions, etc.) and facilities algorithms like K2 and B that are common to Bayes Network learning. | [30] |
Logistic | 28.6 | Constructs and uses a multinomial logistic regression model with a ridge estimator. | [31] |
MultilayerPerceptron | 54.0 | Uses backpropagation to classify instances. This network can be built by hand, created by an algorithm, or both. The network can also be monitored and modified during the training time. The nodes in this network are all sigmoid (except for when the class is numeric, in which case the output nodes become unthresholded linear units). | [30] |
IBk | 44.4 | K-nearest neighbors’ classifier. Can select the appropriate value of K based on cross-validation. Can also conduct distance weighting. | [32] |
Kstar | 68.3 | An instance-based classifier. The class of a test instance is based upon the class of those training instances similar to it, as determined by some similarity function. It differs from other instance-based learners in that it uses an entropy-based distance function. | [33] |
OneR | 58.7 | Uses the minimum-error attribute for prediction, discretizing numeric attributes. | [34] |
J48 | 65.1 | Generates a pruned or unpruned C4.5 decision tree. | [35] |
RandomForest | 77.8 | Constructs a forest of random trees. | [36] |
RandomTree | 42.9 | Constructs a tree that considers K randomly chosen attributes at each node. Performs no pruning. Also has an option to allow the estimation of class probabilities (or the target mean in the regression case) based on a hold-out set (backfitting). | [30] |
Importance | Number of Nodes | Attribute |
---|---|---|
0.73 | 6906 | Exp |
0.61 | 9276 | Time |
0.54 | 712 | Fun |
0.41 | 529 | ProtL |
0.41 | 717 | TransL |
0.4 | 411 | MM |
0.39 | 541 | 5’UTRL |
0.39 | 160 | U2 |
0.37 | 158 | U3 |
0.37 | 189 | AA2 |
0.35 | 150 | Chro |
0.35 | 146 | S6 |
0.33 | 156 | U1 |
0.32 | 109 | nG |
0.31 | 74 | S4 |
0.3 | 50 | S5 |
0.29 | 92 | nC |
0.29 | 163 | nA |
0.27 | 114 | nT |
Class | TP | FP | TN | FN | TPR | FPR | TNR | FNR | ACC |
---|---|---|---|---|---|---|---|---|---|
L | 18 | 1 | 41 | 3 | 0.857143 | 0.02381 | 0.97619 | 0.142857 | 0.936508 |
M | 21 | 9 | 33 | 0 | 1 | 0.214286 | 0.785714 | 0 | 0.857143 |
H | 14 | 0 | 42 | 7 | 0.666667 | 0 | 1 | 0.333333 | 0.888889 |
Class | Route | Attempts | False Predictions |
---|---|---|---|
L | ProtL < 2910 ProtL ≥ 514.5 nA ≥ 122.5 5’UTRL < 264.5 | 41 | 0 |
M | ProtL ≥ 2910 5’UTRL ≥ 64.5 Exp = Dia | 120 | 0 |
H | S5 = T Exp = Dia U3 = A ProtL ≥ 469.5 MM < 71,288.1 5’UtrL ≥ 72 | 71 | 0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pawłowski, P.H.; Zielenkiewicz, P. Predicting the S. cerevisiae Gene Expression Score by a Machine Learning Classifier. Life 2025, 15, 723. https://doi.org/10.3390/life15050723
Pawłowski PH, Zielenkiewicz P. Predicting the S. cerevisiae Gene Expression Score by a Machine Learning Classifier. Life. 2025; 15(5):723. https://doi.org/10.3390/life15050723
Chicago/Turabian StylePawłowski, Piotr H., and Piotr Zielenkiewicz. 2025. "Predicting the S. cerevisiae Gene Expression Score by a Machine Learning Classifier" Life 15, no. 5: 723. https://doi.org/10.3390/life15050723
APA StylePawłowski, P. H., & Zielenkiewicz, P. (2025). Predicting the S. cerevisiae Gene Expression Score by a Machine Learning Classifier. Life, 15(5), 723. https://doi.org/10.3390/life15050723