Genome-Wide Inference of Essential Genes in Dirofilaria immitis Using Machine Learning
Abstract
1. Introduction
2. Results
2.1. Genome Assembly, Quality of the Assembly, Gene Prediction and Annotation
2.2. Identification of Predictors of Essential Genes in D. immitis
2.3. Association Between Essential Genes and Transcription Profiles
2.4. Essential Genes of D. immitis Are Inferred to Be Involved Predominantly in Ribosome Biogenesis, Translation, RNA Binding/Processing and Signalling
2.5. Linking Essential Genes to Genome Locations, and Their Transcription to Cell Type or Tissue
3. Discussion
4. Materials and Methods
4.1. Nucleic Acid Sequence Data Sets
4.2. Genome Assembly, Prediction of Repeats and Genes, and Comparative Analysis
4.3. Predicting and Ranking Essential Genes by ML and Associated Analyses
4.3.1. Feature Extraction and Selection
4.3.2. Predicting and Ranking of Essential Genes
4.3.3. Analysis of Transcription and Clustering
4.3.4. Genomic, Tissue and Functional Annotation of Essential Genes
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Noack, S.; Harrington, J.; Carithers, D.S.; Kaminsky, R.; Selzer, P.M. Heartworm disease—Overview, intervention, and industry perspective. Int. J. Parasitol. Drugs Drug Resist. 2021, 16, 65–89. [Google Scholar] [CrossRef]
- Selzer, P.M.; Epe, C. Antiparasitics in animal health: Quo vadis? Trends Parasitol. 2021, 37, 77–89. [Google Scholar] [CrossRef]
- Savadelis, M.D.; McTier, T.L.; Kryda, K.; Maeder, S.J.; Woods, D.J. Moxidectin: Heartworm disease prevention in dogs in the face of emerging macrocyclic lactone resistance. Parasites Vectors 2022, 15, 82. [Google Scholar] [CrossRef]
- Gasser, R.B.; Cantacessi, C. Heartworm genomics: Unprecedented opportunities for fundamental molecular insights and new intervention strategies. Top. Companion Anim. Med. 2011, 26, 193–199. [Google Scholar] [CrossRef]
- Geary, T.G.; Sakanari, J.A.; Caffrey, C.R. Anthelmintic drug discovery: Into the future. J. Parasitol. 2015, 101, 125–133. [Google Scholar] [CrossRef] [PubMed]
- Sepúlveda-Crespo, D.; Reguera, R.M.; Rojo-Vásquez, F.; Balaña-Fouce, R.; Martínez-Valladares, M. Drug discovery technologies: Caenorhabditis elegans as a model for anthelmintic therapeutics. Med. Res. Rev. 2020, 40, 1715–1753. [Google Scholar] [CrossRef] [PubMed]
- Campos, T.L.; Korhonen, P.K.; Sternberg, P.W.; Gasser, R.B.; Young, N.D. Predicting gene essentiality in Caenorhabditis elegans by feature engineering and machine learning. Comput. Struct. Biotechnol. J. 2020, 15, 1093–1102. [Google Scholar] [CrossRef] [PubMed]
- Campos, T.L.; Korhonen, P.K.; Hofmann, A.; Gasser, R.B.; Young, N.D. Combined use of feature engineering and machine learning to predict essential genes in Drosophila melanogaster. NAR Genom. Bioinform. 2020, 2, lqaa051. [Google Scholar] [CrossRef]
- Marygold, S.J.; Crosby, M.A.; Goodman, J.L.; FlyBase Consortium. Using FlyBase, a database of Drosophila genes and genomes. In Drosophila; Dahmann, C., Ed.; Springer: New York, NY, USA, 2016; Volume 1478, pp. 1–31. [Google Scholar]
- Howe, K.L.; Bolt, B.J.; Shafie, M.; Kersey, P.; Berriman, M. WormBase ParaSite—A comprehensive resource for helminth genomics. Mol. Biochem. Parasitol. 2017, 215, 2–10. [Google Scholar] [CrossRef]
- Harris, T.W.; Arnaboldi, V.; Cain, S.; Chan, J.; Chen, W.J.; Cho, J.; Davis, P.; Gao, S.; A Grove, C.; Kishore, R.; et al. WormBase: A modern model organism information resource. Nucleic Acids Res. 2020, 48, D762–D767. [Google Scholar] [CrossRef]
- Kimble, J.; Nüsslein-Volhard, C. The great small organisms of developmental genetics: Caenorhabditis elegans and Drosophila melanogaster. Dev. Biol. 2022, 485, 93–122. [Google Scholar] [CrossRef] [PubMed]
- Campos, T.L.; Korhonen, P.K.; Young, N.D. Cross-predicting essential genes between two model eukaryotic species using machine learning. Int. J. Mol. Sci. 2021, 22, 5056. [Google Scholar] [CrossRef] [PubMed]
- Campos, T.L.; Korhonen, P.K.; Hofmann, A.; Gasser, R.B.; Young, N.D. Harnessing model organism genomics to underpin the machine-learning-based prediction of essential genes in eukaryotes—Biotechnological implications. Biotechnol. Adv. 2021, 54, 107822. [Google Scholar]
- Ma, J.; Song, J.; Young, N.D.; Chang, B.C.H.; Korhonen, P.K.; Campos, T.L.; Liu, H.; Gasser, R.B. “Bingo”—A large language model- and graph neural network (LLM-GNN)-based workflow for the prediction of essential genes from protein data. Brief. Bioinform. 2024, 25, bbad472. [Google Scholar]
- Campos, T.L.; Korhonen, P.K.; Young, N.D.; Wang, T.; Song, J.; Marhoefer, R.; Chang, B.C.H.; Selzer, P.M.; Gasser, R.B. Prediction and prioritisation of essential genes of the parasite Haemonchus contortus via machine learning. Int. J. Mol. Sci. 2024, 25, 7015. [Google Scholar] [CrossRef]
- Campos, T.L.; Korhonen, P.K.; Young, N.D.; Chang, B.C.H.; Gasser, R.B. Inference of essential genes in Brugia malayi and Onchocerca volvulus by machine learning. Comput. Struct. Biotechnol. J. 2024, 23, 3081–3089. [Google Scholar] [CrossRef]
- Lieberman-Aiden, E.; Van Berkum, N.L.; Williams, L.; Imakaev, M.; Ragoczy, T.; Telling, A.; Amit, I.; Lajoie, B.R.; Sabo, P.J.; Dorschner, M.O.; et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009, 326, 289–293. [Google Scholar] [CrossRef]
- Fu, Y.; Lan, J.; Zhang, Z.; Hou, R.; Wu, X.; Yang, D.; Zhang, R.; Zheng, W.; Nie, H.; Xie, Y.; et al. Novel insights into the transcriptome of Dirofilaria immitis. PLoS ONE 2012, 7, e41639. [Google Scholar] [CrossRef]
- Luck, A.N.; Evans, C.C.; Riggs, M.D.; Foster, J.M.; Moorhead, A.R.; E Slatko, B.; Michalski, M.L. Concurrent transcriptional profiling of Dirofilaria immitis and its Wolbachia endosymbiont throughout the nematode life cycle reveals coordinated gene expression. BMC Genom. 2014, 15, 1041. [Google Scholar] [CrossRef]
- Edgerton, E.B.; McCrea, A.R.; Berry, C.T.; Kwok, J.Y.; Thompson, L.K.; Watson, B.; Fuller, E.M.; Nolan, T.J.; Lok, J.B.; Povelones, M. Activation of mosquito immunity blocks the development of transmission-stage filarial nematodes. Proc. Natl. Acad. Sci. USA 2020, 117, 3711–3717. [Google Scholar] [CrossRef]
- Kronenberg, Z.N.; Hall, R.J.; Hiendleder, S.; Smith, T.P.; Sullivan, S.T.; Williams, J.L.; Kingan, S.B. FALCON-Phase: Integrating PacBio and Hi-C data for phased diploid genomes. bioRxiv 2018. [Google Scholar] [CrossRef]
- Li, H.; Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 2010, 26, 589–595. [Google Scholar] [CrossRef]
- Faust, G.G.; Hall, I.M. SAMBLASTER: Fast duplicate marking and structural variant read extraction. Bioinformatics 2014, 30, 2503–2505. [Google Scholar] [CrossRef] [PubMed]
- Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N. The sequence alignment/map (SAM) format and SAMtools. Bioinformatics 2009, 25, 2078–2079. [Google Scholar] [CrossRef] [PubMed]
- Bickhart, D.M.; Rosen, B.D.; Koren, S.; Sayre, B.L.; Hastie, A.R.; Chan, S.; Lee, J.; Lam, E.T.; Liachko, I.; Sullivan, S.T.; et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat. Genet. 2017, 49, 643–650. [Google Scholar] [CrossRef]
- Burton, J.N.; Adey, A.; Patwardhan, R.P.; Qiu, R.; Kitzman, J.O.; Shendure, J. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 2013, 31, 1119–1125. [Google Scholar] [CrossRef]
- Rao, S.S.P.; Huntley, M.H.; Durand, N.C.; Stamenova, E.K.; Bochkov, I.D.; Robinson, J.T.; Sanborn, A.L.; Machol, I.; Omer, A.D.; Lander, E.S.; et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014, 159, 1665–1680. [Google Scholar] [CrossRef]
- Durand, N.C.; Robinson, J.T.; Shamim, M.S.; Machol, I.; Mesirov, J.P.; Lander, E.S.; Aiden, E.L. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 2016, 3, 99–101. [Google Scholar] [CrossRef]
- Ranallo-Benavidez, T.R.; Jaron, K.S.; Schatz, M.C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 2020, 11, 1432. [Google Scholar] [CrossRef]
- Camacho, C.; Coulouris, G.; Avagyan, V.; Ma, N.; Papadopoulos, J.; Bealer, K.; Madden, T.L. BLAST+: Architecture and applications. BMC Bioinform. 2009, 10, 421. [Google Scholar] [CrossRef]
- Tarailo-Graovac, M.; Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinform. 2009, 25, 4.10.1–4.10.14. [Google Scholar] [CrossRef]
- Gabriel, L.; Brůna, T.; Hoff, K.J.; Ebel, M.; Lomsadze, A.; Borodovsky, M.; Stanke, M. BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA. Genome Res. 2024, 34, 769–777. [Google Scholar] [CrossRef]
- Bolger, A.M.; Lohse, M.; Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 2014, 30, 2114–2120. [Google Scholar] [CrossRef]
- Sayers, E.W.; Cavanaugh, M.; Clark, K.; Pruitt, K.D.; Sherry, S.T.; Yankie, L.; Karsch-Mizrachi, I. GenBank 2023 update. Nucleic Acids Res. 2023, 51, D141–D144. [Google Scholar] [CrossRef]
- Simão, F.A.; Waterhouse, R.M.; Ioannidis, P.; Kriventseva, E.V.; Zdobnov, E.M. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 2015, 31, 3210–3212. [Google Scholar] [CrossRef]
- Mitchell, A.L.; Attwood, T.K.; Babbitt, P.C.; Blum, M.; Bork, P.; Bridge, A.; Brown, S.D.; Chang, H.-Y.; El-Gebali, S.; Fraser, M.I.; et al. InterPro in 2019: Improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 2018, 47, D351–D360. [Google Scholar] [CrossRef] [PubMed]
- Magrane, M.; UniProt Consortium. UniProt knowledgebase: A hub of integrated protein data. Database 2011, 2011, bar009. [Google Scholar] [CrossRef] [PubMed]
- Kanehisa, M.; Goto, S.; Sato, Y.; Furumichi, M.; Tanabe, M. KEGG for integration and interpretation of large-scale molecular datasets. Nucleic Acids Res. 2012, 40, D109–D114. [Google Scholar] [CrossRef] [PubMed]
- Pruitt, K.D.; Tatusova, T.; Brown, G.R.; Maglott, D.R. NCBI Reference Sequences (RefSeq): Current status, new features and genome annotation policy. Nucleic Acids Res. 2012, 40, D130–D135. [Google Scholar] [CrossRef]
- Li, L.; Stoeckert, C.J.; Roos, D.S. OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res. 2003, 13, 2178–2189. [Google Scholar] [CrossRef]
- Krzywinski, M.; Schein, J.; Birol, I.; Connors, J.; Gascoyne, R.; Horsman, D.; Jones, S.J.; Marra, M.A. Circos: An information aesthetic for comparative genomics. Genome Res. 2009, 19, 1639–1645. [Google Scholar] [CrossRef]
- Gandasegui, J.; Power, R.I.; Curry, E.; Lau, D.C.-W.; O’NEill, C.M.; Wolstenholme, A.; Prichard, R.; Šlapeta, J.; Doyle, S.R. Genome structure and population genomics of the canine heartworm Dirofilaria immitis. Int. J. Parasitol. 2024, 54, 89–98. [Google Scholar] [CrossRef] [PubMed]
- Armenteros, J.J.A.; Sønderby, C.K.; Sønderby, S.K.; Nielsen, H.; Winther, O. DeepLoc: Prediction of protein subcellular localization using deep learning. Bioinformatics 2017, 33, 3387–3395. [Google Scholar] [CrossRef] [PubMed]
- Hug, L.A.; Baker, B.J.; Anantharaman, K.; Brown, C.T.; Probst, A.J.; Castelle, C.J.; Butterfield, C.N.; Hernsdorf, A.W.; Amano, Y.; Ise, K.; et al. A new view of the tree of life. Nat. Microbiol. 2016, 1, 16048. [Google Scholar] [CrossRef] [PubMed]
- Emms, D.M.; Kelly, S. OrthoFinder: Phylogenetic ortholog inference for comparative genomics. Genome Biol. 2019, 20, 238. [Google Scholar] [CrossRef]
- Li, B.; Dewey, C.N. RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 2011, 12, 323. [Google Scholar] [CrossRef]
- Langmead, B.; Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 2012, 9, 357–359. [Google Scholar] [CrossRef]
- Cao, J.; Packer, J.S.; Ramani, V.; Cusanovich, D.A.; Huynh, C.; Daza, R.; Qiu, X.; Lee, C.; Furlan, S.N.; Steemers, F.J.; et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 2017, 357, 661–667. [Google Scholar] [CrossRef]
- Raudvere, U.; Kolberg, L.; Kuzmin, I.; Arak, T.; Adler, P.; Peterson, H.; Vilo, J. g:Profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 2019, 47, W191–W198. [Google Scholar] [CrossRef]
- Fabregat, A.; Jupe, S.; Matthews, L.; Sidiropoulos, K.; Gillespie, M.; Garapati, P.; Haw, R.; Jassal, B.; Korninger, F.; May, B.; et al. The Reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022, 50, D687–D692. [Google Scholar]
Features | D. immitis (This Study) | B. malayi (GCA_000002995.5) |
---|---|---|
Genome size (bp) | 94,056,743 | 88,235,797 |
Number of scaffolds | 84 | 197 |
N50 (bp); L50 | 15,667,105; 3 | 14,214,749; 3 |
N90 (bp); L90 | 15,025,590; 5 | 13,467,244; 5 |
Genome GC content (%) | 27.8 | 28.5 |
Repetitive sequences (%) | 9.7 | 18.3 |
Exonic proportion; incl. introns (%) | 16.5; 56.0 | 18.6; 55.5 |
Number of protein-coding genes predicted (isoforms) | 11,852 (14,247) | 11,350 (not reported) |
Mean; median gene size (bp) | 4447; 3101 | 4315; 3015 |
Mean; median protein-coding gene length (bp) | 1292; 933 | 1317; 1095 |
Mean exon number per protein-coding gene | 9.4 | 9.1 |
Mean; median exon length (bp) | 140; 129 | 161; 140 |
Mean; median intron length (bp) | 375; 243 | 354; 231 |
Coding G+C content (%) | 38.4 | 39.1 |
BUSCO completeness: complete; complete + fragmented (%) | 93.6; 94.4 | 98.9; 99.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Campos, T.L.; Korhonen, P.K.; Young, N.D.; Sumanam, S.B.; Bullard, W.; Harrington, J.M.; Song, J.; Chang, B.C.H.; Marhöfer, R.J.; Selzer, P.M.; et al. Genome-Wide Inference of Essential Genes in Dirofilaria immitis Using Machine Learning. Int. J. Mol. Sci. 2025, 26, 9923. https://doi.org/10.3390/ijms26209923
Campos TL, Korhonen PK, Young ND, Sumanam SB, Bullard W, Harrington JM, Song J, Chang BCH, Marhöfer RJ, Selzer PM, et al. Genome-Wide Inference of Essential Genes in Dirofilaria immitis Using Machine Learning. International Journal of Molecular Sciences. 2025; 26(20):9923. https://doi.org/10.3390/ijms26209923
Chicago/Turabian StyleCampos, Túlio L., Pasi K. Korhonen, Neil D. Young, Sunita B. Sumanam, Whitney Bullard, John M. Harrington, Jiangning Song, Bill C. H. Chang, Richard J. Marhöfer, Paul M. Selzer, and et al. 2025. "Genome-Wide Inference of Essential Genes in Dirofilaria immitis Using Machine Learning" International Journal of Molecular Sciences 26, no. 20: 9923. https://doi.org/10.3390/ijms26209923
APA StyleCampos, T. L., Korhonen, P. K., Young, N. D., Sumanam, S. B., Bullard, W., Harrington, J. M., Song, J., Chang, B. C. H., Marhöfer, R. J., Selzer, P. M., & Gasser, R. B. (2025). Genome-Wide Inference of Essential Genes in Dirofilaria immitis Using Machine Learning. International Journal of Molecular Sciences, 26(20), 9923. https://doi.org/10.3390/ijms26209923