Two-Stage Probability-Enhanced Regression on Property Matrices and LLM Embeddings Enables State-of-the-Art Prediction of Gene Knockdown by Modified siRNAs
Abstract
1. Introduction
2. Results
2.1. Data Collection and Engineering
2.1.1. Data Extraction and Curation
2.1.2. Feature Engineering
2.2. Data Analysis
2.2.1. Exploratory Data Analysis
2.2.2. Correlation Analysis
2.2.3. Sequence Analysis
2.2.4. Mismatch Effect on Gene Knockdown
2.2.5. Modifications Influence on siRNA Efficacy
2.3. Predictive ML Models
2.3.1. siRNA Classifiers
Binary Classifier
Multiclass Classifier
2.3.2. Knockdown Efficacy Quantitative Prediction
Descriptors and Models Comparison
Probability-Enhanced Approach
LOGO Experiment
3. Materials and Methods
3.1. Data Processing
3.2. Feature Engineering
3.3. ML Models Development
3.4. Classification Models
3.5. Regression Models
3.6. Probability-Enhanced Approach
3.7. Leave-One-Gen-Out (LOGO) Experiment
3.8. Visualization
3.9. AI-Assisted Figure Generation
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Arkin, M.R.; Tang, Y.; Wells, J.A. Small-Molecule Inhibitors of Protein-Protein Interactions: Progressing toward the Reality. Chem. Biol. 2014, 21, 1102–1114. [Google Scholar] [CrossRef]
- Voinnet, O. Induction and Suppression of RNA Silencing: Insights from Viral Infections. Nat. Rev. Genet. 2005, 6, 206–220. [Google Scholar] [CrossRef]
- Prussia, A.; Thepchatri, P.; Snyder, J.P.; Plemper, R.K. Systematic Approaches towards the Development of Host-Directed Antiviral Therapeutics. Int. J. Mol. Sci. 2011, 12, 4027–4052. [Google Scholar] [CrossRef] [PubMed]
- Ros, X.B.-D. Toward Learning the Rules That Predict siRNA Efficacy. Mol. Ther.-Nucleic Acids 2023, 33, 543–544. [Google Scholar] [CrossRef]
- Batra, N.; Tu, M.-J.; Yu, A.-M. Molecular Engineering of Functional SiRNA Agents. ACS Synth. Biol. 2024, 13, 1906–1915. [Google Scholar] [CrossRef] [PubMed]
- Redman, M.; King, A.; Watson, C.; King, D. What Is CRISPR/Cas9? Arch. Dis. Child. Educ. Pract. Ed. 2016, 101, 213. [Google Scholar] [CrossRef]
- Metwally, A.A.; Nayel, A.A.; Hathout, R.M. In Silico Prediction of siRNA Ionizable-Lipid Nanoparticles In Vivo Efficacy: Machine Learning Modeling Based on Formulation and Molecular Descriptors. Front. Mol. Biosci. 2022, 9, 1042720. [Google Scholar] [CrossRef]
- Müller, M.; Avar, M.; Heinzer, D.; Emmenegger, M.; Aguzzi, A.; Pelkmans, L.; Berry, S. High Content Genome-Wide siRNA Screen to Investigate the Coordination of Cell Size and RNA Production. Sci. Data 2021, 8, 162. [Google Scholar] [CrossRef]
- Sun, Y.-C.; Qiu, Z.-Z.; Wen, F.-L.; Yin, J.-Q.; Zhou, H. Revealing Potential Diagnostic Gene Biomarkers Associated with Immune Infiltration in Patients with Renal Fibrosis Based on Machine Learning Analysis. J. Immunol. Res. 2022, 2022, 3027200. [Google Scholar] [CrossRef]
- Kavya, K.V.; Vargheese, S.; Shukla, S.; Khan, I.; Dey, D.K.; Bajpai, V.K.; Thangavelu, K.; Vivek, R.; Kumar, R.T.R.; Han, Y.-K.; et al. A Cationic Amino Acid Polymer Nanocarrier Synthesized in Supercritical CO2 for Co-Delivery of Drug and Gene to Cervical Cancer Cells. Colloids Surf. B Biointerfaces 2022, 216, 112584. [Google Scholar] [CrossRef]
- Wittrup, A.; Lieberman, J. Knocking down Disease: A Progress Report on siRNA Therapeutics. Nat. Rev. Genet. 2015, 16, 543–552. [Google Scholar] [CrossRef] [PubMed]
- Chen, X.; Mangala, L.S.; Rodriguez-Aguayo, C.; Kong, X.; Lopez-Berestein, G.; Sood, A.K. RNA Interference–Based Therapy and Its Delivery Systems. Cancer Metastasis Rev. 2018, 37, 107–124. [Google Scholar] [CrossRef] [PubMed]
- Meng, Z.; Lu, M. RNA Interference-Induced Innate Immunity, Off-Target Effect, or Immune Adjuvant? Front. Immunol. 2017, 8, 331. [Google Scholar] [CrossRef]
- Ahmed, F.; Raghava, G.P.S. Designing of Highly Effective Complementary and Mismatch siRNAs for Silencing a Gene. PLoS ONE 2011, 6, e23443. [Google Scholar] [CrossRef]
- Sioud, M.; Furset, G.; Cekaite, L. Suppression of Immunostimulatory siRNA-Driven Innate Immune Activation by 2′-Modified RNAs. Biochem. Biophys. Res. Commun. 2007, 361, 122–126. [Google Scholar] [CrossRef]
- Song, X.; Wang, X.; Ma, Y.; Liang, Z.; Yang, Z.; Cao, H. Site-Specific Modification Using the 2′-Methoxyethyl Group Improves the Specificity and Activity of siRNAs. Mol. Ther.-Nucleic Acids 2017, 9, 242–250. [Google Scholar] [CrossRef]
- Fluiter, K.; Mook, O.R.F.; Baas, F. The Therapeutic Potential of LNA-Modified siRNAs: Reduction of Off-Target Effects by Chemical Modification of the siRNA Sequence. In siRNA and miRNA Gene Silencing; Sioud, M., Ed.; Methods in Molecular Biology; Humana Press: Totowa, NJ, USA, 2009; Volume 487, pp. 1–15. ISBN 978-1-60327-546-0. [Google Scholar]
- Bramsen, J.B.; Pakula, M.M.; Hansen, T.B.; Bus, C.; Langkjær, N.; Odadzic, D.; Smicius, R.; Wengel, S.L.; Chattopadhyaya, J.; Engels, J.W.; et al. A Screen of Chemical Modifications Identifies Position-Specific Modification by UNA to Most Potently Reduce siRNA off-Target Effects. Nucleic Acids Res. 2010, 38, 5761–5773. [Google Scholar] [CrossRef] [PubMed]
- Khvorova, A.; Watts, J.K. The Chemical Evolution of Oligonucleotide Therapies of Clinical Utility. Nat. Biotechnol. 2017, 35, 238–248. [Google Scholar] [CrossRef]
- Lorenz, R.; Bernhart, S.H.; Höner Zu Siederdissen, C.; Tafer, H.; Flamm, C.; Stadler, P.F.; Hofacker, I.L. ViennaRNA Package 2.0. Algorithms Mol. Biol. 2011, 6, 26. [Google Scholar] [CrossRef]
- Zuker, M. Mfold Web Server for Nucleic Acid Folding and Hybridization Prediction. Nucleic Acids Res. 2003, 31, 3406–3415. [Google Scholar] [CrossRef]
- Han, Y.; He, F.; Chen, Y.; Liu, Y.; Yu, H. SiRNA Silencing Efficacy Prediction Based on a Deep Architecture. BMC Genom. 2018, 19, 669. [Google Scholar] [CrossRef]
- Liu, B.; Huang, H.; Liao, W.; Pan, X.; Jin, C.; Yuan, Y. DeepSipred: A Deep-Learning-Based Approach on siRNA Inhibition Prediction. In Proceedings of the BIC 2024: 2024 4th International Conference on Bioinformatics and Intelligent Computing, Beijing, China, 26–28 January 2024. [Google Scholar]
- Liu, X.; Liu, C.; Zhou, J.; Chen, C.; Qu, F.; Rossi, J.J.; Rocchi, P.; Peng, L. Promoting siRNA Delivery via Enhanced Cellular Uptake Using an Arginine-Decorated Amphiphilic Dendrimer. Nanoscale 2015, 7, 3867–3875. [Google Scholar] [CrossRef]
- siRNAmod: First siRNA Chemical Modification Database|CRDD. Available online: http://crdd.osdd.net/servers/sirnamod/sub.php (accessed on 20 August 2024).
- Bento, A.P.; Hersey, A.; Félix, E.; Landrum, G.; Gaulton, A.; Atkinson, F.; Bellis, L.J.; De Veij, M.; Leach, A.R. An Open Source Chemical Structure Curation Pipeline Using RDKit. J. Cheminformatics 2020, 12, 51. [Google Scholar] [CrossRef]
- Dong, J.; Yao, Z.-J.; Zhang, L.; Luo, F.; Lin, Q.; Lu, A.-P.; Chen, A.F.; Cao, D.-S. PyBioMed: A Python Library for Various Molecular Representations of Chemicals, Proteins and DNAs and Their Interactions. J. Cheminformatics 2018, 10, 16. [Google Scholar] [CrossRef]
- Willighagen, E.L.; Mayfield, J.W.; Alvarsson, J.; Berg, A.; Carlsson, L.; Jeliazkova, N.; Kuhn, S.; Pluskal, T.; Rojas-Chertó, M.; Spjuth, O.; et al. The Chemistry Development Kit (CDK) v2.0: Atom typing, depiction, molecular formulas, and substructure searching. J. Cheminformatics 2017, 9, 33, Erratum in J. Cheminformatics 2017, 9, 53. [Google Scholar] [CrossRef] [PubMed]
- Ki, K.H.; Park, D.Y.; Lee, S.H.; Kim, N.Y.; Choi, B.M.; Noh, G.J. The Optimal Concentration of siRNA for Gene Silencing in Primary Cultured Astrocytes and Microglial Cells of Rats. Korean J. Anesthesiol. 2010, 59, 403. [Google Scholar] [CrossRef]
- Wang, X.; Wang, X.; Varma, R.K.; Beauchamp, L.; Magdaleno, S.; Sendera, T.J. Selection of Hyperfunctional siRNAs with Improved Potency and Specificity. Nucleic Acids Res. 2009, 37, e152. Available online: https://www.researchgate.net/publication/38027476_Selection_of_hyperfunctional_siRNAs_with_improved_potency_and_specificity#pf8 (accessed on 14 December 2024). [CrossRef]
- Optimizing siRNA Transfection for RNAi. Low Transfection Efficiency and Low Cell Viability Are the Most Frequent Causes of Unsuccessful Gene Silencing Experiments. Available online: https://www.thermofisher.com/us/en/home/references/ambion-tech-support/rnai-sirna/tech-notes/optimizing-sirna-transfection-for-rnai.html (accessed on 14 December 2024).
- Safari, F.; Rahmani Barouji, S.; Tamaddon, A.M. Strategies for Improving siRNA-Induced Gene Silencing Efficiency—PMC. Adv. Pharm. Bull. 2017, 7, 603–609. Available online: https://pmc.ncbi.nlm.nih.gov/articles/PMC5788215/ (accessed on 26 December 2024). [CrossRef]
- Wu, H.; Ma, H.; Ye, C.; Ramirez, D.; Chen, S.; Montoya, J.; Shankar, P.; Wang, X.A.; Manjunath, N. Improved siRNA/shRNA Functionality by Mismatched Duplex. PLoS ONE 2011, 6, e28580. [Google Scholar] [CrossRef] [PubMed]
- Das, G.; Harikrishna, S.; Gore, K. Influence of Sugar Modifications on the Nucleoside Conformation and Oligonucleotide Stability: A Critical Review. Chem. Rec. 2022, 22, e202200174. [Google Scholar] [CrossRef] [PubMed]
- Chernikov, I.V.; Ponomareva, U.A.; Chernolovskaya, E.L. Structural Modifications of siRNA Improve Its Performance In Vivo. Int. J. Mol. Sci. 2023, 24, 956. [Google Scholar] [CrossRef] [PubMed]
- Gangopadhyay, S.; Gore, K.R. Advances in siRNA Therapeutics and Synergistic Effect on siRNA Activity Using Emerging Dual Ribose Modifications. RNA Biol. 2022, 19, 452–467. [Google Scholar] [CrossRef] [PubMed]
- Martinelli, D.D. From Sequences to Therapeutics: Using Machine Learning to Predict Chemically Modified siRNA Activity. Genomics 2024, 116, 110815. [Google Scholar] [CrossRef] [PubMed]
- Liu, T.; Huang, J.; Luo, D.; Ren, L.; Ning, L.; Huang, J.; Lin, H.; Zhang, Y. Cm-siRPred: Predicting Chemically Modified siRNA Efficiency Based on Multi-View Learning Strategy. Int. J. Biol. Macromol. 2024, 264, 130638. [Google Scholar] [CrossRef]
- Dong, X.; Zheng, W. Cheminformatics Modeling of Gene Silencing for Both Natural and Chemically Modified siRNAs. Molecules 2022, 27, 6412. [Google Scholar] [CrossRef]
- La Rosa, M.; Fiannaca, A.; La Paglia, L.; Urso, A. A Graph Neural Network Approach for the Analysis of siRNA-Target Biological Networks. Int. J. Mol. Sci. 2022, 23, 14211. [Google Scholar] [CrossRef]
- Dar, S.A.; Gupta, A.K.; Thakur, A.; Kumar, M. SMEpred Workbench: A Web Server for Predicting Efficacy of Chemicallymodified siRNAs. RNA Biol. 2016, 13, 1144–1151. [Google Scholar] [CrossRef]
- Murali, R.; John, P.G.; Peter, S.D. Soft Computing Model for Optimized siRNA Design by Identifying off Target Possibilities Using Artificial Neural Network Model. Gene 2015, 562, 152–158. [Google Scholar] [CrossRef]
- Cock, P.J.A.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; et al. Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics. Bioinformatics 2009, 25, 1422–1423. [Google Scholar] [CrossRef]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery, New York, NY, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]






| Type | rdKit | PyBioMed | CDK | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Metrics | RMSE Train | RMSE Test | R2 | CV R2 | RMSE Train | RMSE Test | R2 | CV R2 | RMSE Train | RMSE Test | R2 | CV R2 |
| LGBM | 10.56 | 15.57 | 0.711 | 0.681 | 10.61 | 15.37 | 0.717 | 0.682 | 10.07 | 15.44 | 0.716 | 0.676 |
| XGB | 8.18 | 15.77 | 0.704 | 0.652 | 8.38 | 16.32 | 0.683 | 0.650 | 8.04 | 16.17 | 0.688 | 0.649 |
| RF | 8.45 | 16.35 | 0.682 | 0.640 | 8.46 | 16.69 | 0.668 | 0.638 | 8.42 | 16.58 | 0.670 | 0.640 |
| Descriptors (Gene) | Descriptors (siRNA) | Model | RMSE | PCC (R) | R2 | |
|---|---|---|---|---|---|---|
| Current work | Mistral 7B embeddings | RDkit | Two-stage probability-enhanced LGBM-based approach | 12.27 | 0.91 | 0.84 |
| Liu et al., 2024 [38] | - | Property matrices | Cross-attention CNN | 16.97 | 0.83 | 0.69 |
| Dong et al., 2022 [39] | - | Property matrices (BCUT) | Partial least squares (PLS) | 13.50 | 0.82 | 0.67 |
| La Rosa et al., 2022 [40] | Graph nodes + k-mers | GNN (HinSAGE) | 14.23 | 0.74 | 0.49 | |
| Dar et al., 2016 [41] | - | Mononucleotide composition | SVM | - | 0.80 | 0.64 |
| Murali et al., 2015 [42] | Energy scores | MLP | - | 0.74 | 0.55 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Golovkin, I.; Shatkovskii, D.; Serov, N. Two-Stage Probability-Enhanced Regression on Property Matrices and LLM Embeddings Enables State-of-the-Art Prediction of Gene Knockdown by Modified siRNAs. Int. J. Mol. Sci. 2025, 26, 11791. https://doi.org/10.3390/ijms262411791
Golovkin I, Shatkovskii D, Serov N. Two-Stage Probability-Enhanced Regression on Property Matrices and LLM Embeddings Enables State-of-the-Art Prediction of Gene Knockdown by Modified siRNAs. International Journal of Molecular Sciences. 2025; 26(24):11791. https://doi.org/10.3390/ijms262411791
Chicago/Turabian StyleGolovkin, Ivan, Denis Shatkovskii, and Nikita Serov. 2025. "Two-Stage Probability-Enhanced Regression on Property Matrices and LLM Embeddings Enables State-of-the-Art Prediction of Gene Knockdown by Modified siRNAs" International Journal of Molecular Sciences 26, no. 24: 11791. https://doi.org/10.3390/ijms262411791
APA StyleGolovkin, I., Shatkovskii, D., & Serov, N. (2025). Two-Stage Probability-Enhanced Regression on Property Matrices and LLM Embeddings Enables State-of-the-Art Prediction of Gene Knockdown by Modified siRNAs. International Journal of Molecular Sciences, 26(24), 11791. https://doi.org/10.3390/ijms262411791

