CryoVirusDB: An Annotated Dataset for AI-Based Virus Particle Identification in Cryo-EM Micrographs
Abstract
1. Introduction
2. Materials and Methods
2.1. Raw Data Acquisition and Preprocessing
2.2. Software and Computation Tools
3. Results
3.1. Particle-Picking Workflow
3.1.1. Micrograph Import
3.1.2. Motion Correction and Patch-Based CTF Estimation of Micrographs
3.1.3. Manual Particle Picking and 2D Class Formation
3.1.4. Template-Based Picking
3.1.5. Manual Particle Inspection and Extraction
3.2. Data Organization of CryoVirusDB
3.3. Data Validation
3.3.1. 2D Particle Class Validation
3.3.2. 3D Density Map Validation
3.4. Machine Learning Workflows with CryoVirusDB
3.4.1. Denoising and Preprocessing Cryo-EM Micrographs
3.4.2. Generating Labels for Input Micrographs
3.4.3. Machine Learning Model Development: Inputs and Expected Outputs
4. Discussion and Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| ML | Machine Learning |
| CTF | Contrast Transfer Function |
| Cryo-EM | Cryogenic Electron Microscopy |
| CSV | Comma-Separated Values |
| EMDB | Electron Microscopy Data Bank |
| EMPIAR | Electron Microscopy Public Image Archive |
| FSC | Fourier Shell Correlation |
| GSFSC | Gold Standard Fourier Shell Correlation |
| NCC | Normalized Cross-Correlation |
| PDB | Protein Data Bank |
| PS | Power Spectrum |
| Å | Angstrom |
| kV | Kilovolt |
| e/Å2 | Electron Dose per Square Angstrom |
| µm | Micrometer |
References
- Renaud, J.-P.; Chari, A.; Ciferri, C.; Liu, W.-T.; Rémigy, H.-W.; Stark, H.; Wiesmann, C. Cryo-EM in drug discovery: Achievements, limitations and prospects. Nat. Rev. Drug Discov. 2018, 17, 471–492. [Google Scholar] [CrossRef] [PubMed]
- Chua, E.Y.; Mendez, J.H.; Rapp, M.; Ilca, S.L.; Tan, Y.Z.; Maruthi, K.; Kuang, H.; Zimanyi, C.M.; Cheng, A.; Eng, E.T.; et al. Better, Faster, Cheaper: Recent Advances in Cryo—Electron Microscopy. Annu. Rev. Biochem. 2022, 91, 1–32. [Google Scholar] [CrossRef] [PubMed]
- Wu, M.; Lander, G.C. Present and Emerging Methodologies in Cryo-EM Single-Particle Analysis. Biophys. J. 2020, 119, 1281–1289. [Google Scholar] [CrossRef] [PubMed]
- Dhakal, A.; Gyawali, R.; Wang, L.; Cheng, J. A large expert-curated cryo-EM image dataset for machine learning protein particle picking. Sci. Data 2023, 10, 392. [Google Scholar] [CrossRef]
- Dhakal, A.; Gyawali, R.; Wang, L.; Cheng, J. CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking. bioRxiv 2023. [Google Scholar] [CrossRef]
- Hryc, C.F.; Chen, D.H.; Chiu, W. Near-atomic resolution cryo-EM for molecular virology. Curr. Opin. Virol. 2011, 1, 110–117. [Google Scholar] [CrossRef]
- Jiang, W.; Tang, L. Atomic cryo-EM structures of viruses. Curr. Opin. Struct. Biol. 2017, 46, 122–129. [Google Scholar] [CrossRef]
- Wrapp, D.; Wang, N.; Corbett, K.S.; Goldsmith, J.A.; Hsieh, C.-L.; Abiona, O.; Graham, B.S.; McLellan, J.S. Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation. Science 2020, 367, 1260–1263. [Google Scholar] [CrossRef]
- Walls, A.C.; Park, Y.-J.; Tortorici, M.A.; Wall, A.; McGuire, A.T.; Veesler, D. Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein. Cell 2020, 181, 281–292.e6. [Google Scholar] [CrossRef]
- Yan, R.; Zhang, Y.; Li, Y.; Xia, L.; Guo, Y.; Zhou, Q. Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2. Science 2020, 367, 1444–1448. [Google Scholar] [CrossRef]
- Hauser, B.M.; Sangesland, M.; Denis, K.J.S.; Lam, E.C.; Case, J.B.; Windsor, I.W.; Feldman, J.; Caradonna, T.M.; Kannegieter, T.; Diamond, M.S.; et al. Rationally designed immunogens enable immune focusing following SARS-CoV-2 spike imprinting. Cell Rep. 2022, 38, 110561. [Google Scholar] [CrossRef]
- Ong, E.; Huang, X.; Pearce, R.; Zhang, Y.; He, Y. Computational design of SARS-CoV-2 spike glycoproteins to increase immunogenicity by T cell epitope engineering. Comput. Struct. Biotechnol. J. 2021, 19, 518–529. [Google Scholar] [CrossRef] [PubMed]
- Castro, K.M.; Scheck, A.; Xiao, S.; Correia, B.E. Computational design of vaccine immunogens. Curr. Opin. Biotechnol. 2022, 78, 102821. [Google Scholar] [CrossRef] [PubMed]
- Dhakal, A.; McKay, C.; Tanner, J.J.; Cheng, J. Artificial intelligence in the prediction of protein-ligand interactions: Recent advances and future directions. Brief. Bioinform. 2022, 23, bbab476. [Google Scholar] [CrossRef] [PubMed]
- Earl, L.A.; Subramaniam, S. Cryo-EM of viruses and vaccine design. Proc. Natl. Acad. Sci. USA 2016, 113, 8903–8905. [Google Scholar] [CrossRef]
- Dhakal, A.; Gyawali, R.; Cheng, J. Predicting Protein-Ligand Binding Structure Using E(n) Equivariant Graph Neural Networks. bioRxiv 2023. bioRxiv:2023.02.26.23286462. [Google Scholar] [CrossRef]
- Scheres, S.H.W. RELION: Implementation of a Bayesian approach to cryo-EM structure determination. J. Struct. Biol. 2012, 180, 519–530. [Google Scholar] [CrossRef]
- Tang, G.; Peng, L.; Baldwin, P.R.; Mann, D.S.; Jiang, W.; Rees, I.; Ludtke, S.J. EMAN2: An extensible image processing suite for electron microscopy. J. Struct. Biol. 2007, 157, 38–46. [Google Scholar] [CrossRef]
- Dhakal, A.; Gyawali, R.; Wang, L.; Cheng, J. Artificial intelligence in cryo-EM protein particle picking: Recent advances and remaining challenges. Brief. Bioinform. 2025, 26, bbaf011. [Google Scholar] [CrossRef]
- Dhakal, A.; Gyawali, R.; Wang, L.; Cheng, J. CryoTransformer: A transformer model for picking protein particles from cryo-EM micrographs. Bioinformatics 2024, 40, btae109. [Google Scholar] [CrossRef]
- He, F.; Yang, Z.; Gao, M.; Poudel, B.; Dhas, N.S.E.S.; Gyawali, R.; Dhakal, A.; Cheng, J.; Xu, D. Adapting Segment Anything Model (SAM) through Prompt-based Learning for Enhanced Protein Identification in Cryo-EM Micrographs. arXiv 2023, arXiv:2311.16140. [Google Scholar] [CrossRef]
- Bepler, T.; Morin, A.; Rapp, M.; Brasch, J.; Shapiro, L.; Noble, A.J.; Berger, B. Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs. Nat. Methods 2019, 16, 1153–1160. [Google Scholar] [CrossRef] [PubMed]
- Wagner, T.; Merino, F.; Stabrin, M.; Moriya, T.; Antoni, C.; Apelbaum, A.; Hagel, P.; Sitsel, O.; Raisch, T.; Prumbaum, D.; et al. SPHIRE-crYOLO is a fast and accurate fully automated particle picker for cryo-EM. Commun. Biol. 2019, 2, 1–13. [Google Scholar] [CrossRef] [PubMed]
- Gyawali, R.; Dhakal, A.; Wang, L.; Cheng, J. CryoSegNet: Accurate cryo-EM protein particle picking by integrating the foundational AI image segmentation model and attention-gated U-Net. Brief. Bioinform. 2024, 25, bbae282. [Google Scholar] [CrossRef] [PubMed]
- Gyawali, R.; Dhakal, A.; Wang, L.; Jianlin, C. CryoVirusDB: An Annotated Dataset for AI-Based Virus Particle Identification in Cryo-EM Micrographs. Zenodo 2023. [Google Scholar] [CrossRef]
- Conley, M.J.; McElwee, M.; Azmi, L.; Gabrielsen, M.; Byron, O.; Goodfellow, I.G.; Bhella, D. Calicivirus VP2 forms a portal-like assembly following receptor engagement. Nature 2019, 565, 377–381. [Google Scholar] [CrossRef]
- Castells-Graells, R.; Hesketh, E.L.; Johnson, J.E.; Ranson, N.A.; Lawson, D.M.; Lomonossoff, G.P. Decoding Virus Maturation with Cryo-EM Structures of Intermediates. EMPIAR. Available online: https://www.ebi.ac.uk/empiar/EMPIAR-11060/ (accessed on 1 December 2025).
- Ho, K.L.; Gabrielsen, M.; Beh, P.L.; Kueh, C.L.; Thong, Q.X.; Streetley, J.; Tan, W.S.; Bhella, D. Structure of the Macrobrachium rosenbergii nodavirus: A new genus within the Nodaviridae? PLoS Biol. 2018, 16, e3000038. [Google Scholar] [CrossRef]
- Shakeel, S.; Westerhuis, B.M.; Domanska, A.; Koning, R.I.; Matadeen, R.; Koster, A.J.; Bakker, A.Q.; Beaumont, T.; Wolthers, K.C.; Butcher, S.J. Multiple capsid-stabilizing interactions revealed in a high-resolution structure of an emerging picornavirus causing neonatal sepsis. Nat. Commun. 2016, 7, 11387. [Google Scholar] [CrossRef]
- Flatt, J.W.; Domanska, A.; Seppälä, A.L.; Butcher, S.J. Identification of a conserved virion-stabilizing network inside the interprotomer pocket of enteroviruses. Commun. Biol. 2021, 4, 1–8. [Google Scholar] [CrossRef]
- Chandler-Bostock, R.; Mata, C.P.; Bingham, R.J.; Dykeman, E.C.; Meng, B.; Tuthill, T.J.; Rowlands, D.J.; Ranson, N.A.; Twarock, R.; Stockley, P.G. Assembly of infectious enteroviruses depends on multiple, conserved genomic RNA-coat protein contacts. PLoS Pathog. 2020, 16, e1009146. [Google Scholar] [CrossRef]
- Thompson, R.F.; Iadanza, M.G.; Hesketh, E.L.; Rawson, S.; Ranson, N.A. Collection, pre-processing and on-the-fly analysis of data for high-resolution, single-particle cryo-electron microscopy. Nat. Protoc. 2019, 14, 100–118. [Google Scholar] [CrossRef]
- Castells-Graells, R.; Ribeiro, J.R.S.; Domitrovic, T.; Hesketh, E.L.; Scarff, C.A.; Johnson, J.E.; Ranson, N.A.; Lawson, D.M.; Lomonossoff, G.P. Plant-expressed virus-like particles reveal the intricate maturation process of a eukaryotic virus. Commun. Biol. 2021, 4, 1–12. [Google Scholar] [CrossRef] [PubMed]
- Iudin, A.; Korir, P.K.; Somasundharam, S.; Weyand, S.; Cattavitello, C.; Fonseca, N.; Salih, O.; Kleywegt, G.J.; Patwardhan, A. EMPIAR: The Electron Microscopy Public Image Archive. Nucleic Acids Res. 2023, 51, D1503–D1511. [Google Scholar] [CrossRef] [PubMed]
- Punjani, A.; Rubinstein, J.L.; Fleet, D.J.; Brubaker, M.A. cryoSPARC: Algorithms for rapid unsupervised cryo-EM structure determination. Nat. Methods 2017, 14, 290–296. [Google Scholar] [CrossRef] [PubMed]
- Pettersen, E.F.; Goddard, T.D.; Huang, C.C.; Meng, E.C.; Couch, G.S.; Croll, T.I.; Morris, J.H.; Ferrin, T.E. UCSF ChimeraX: Structure visualization for researchers, educators, and developers. Protein Sci. 2021, 30, 70–82. [Google Scholar] [CrossRef]
- De Castro, E.; Hulo, C.; Masson, P.; Auchincloss, A.; Bridge, A.; Le Mercier, P. ViralZone 2024 provides higher-resolution images and advanced virus-specific resources. Nucleic Acids Res. 2024, 52, D817–D821. [Google Scholar] [CrossRef]
- Baldwin, P.R.; Penczek, P.A. The Transform Class in SPARX and EMAN2. J. Struct. Biol. 2007, 157, 250–261. [Google Scholar] [CrossRef]
- Bepler, T.; Kelley, K.; Noble, A.J.; Berger, B. Topaz-Denoise: General deep denoising models for cryoEM and cryoET. Nat. Commun. 2020, 11, 5208. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]








| SN | EMPIAR ID | Virus Type | Number of Micrographs | Micrograph Size | Particle Diameter (px) | Defocus Range (μm) | Number of Virus Particles | Average Particles per Micrograph |
|---|---|---|---|---|---|---|---|---|
| 1 | 10192 [26] | Feline calicivirus | 1000 | (4096, 4096) | 470 | −1.2 to −3.5 | 9660 | 9.66 |
| 2 | 11060 [27] | Nudaurelia capensis omega virus | 1276 | (4096, 4096) | 516 | −0.70 to −2.2 | 11,916 | 9.34 |
| 3 | 10203 [28] | Macrobrachium rosenbergii nodavirus | 1000 | (3838, 3710) | 377 | −1.0 to −2.5 | 16,601 | 16.60 |
| 4 | 10033 [29] | Human parechovirus 3 | 1000 | (4096, 4096) | 350 | −0.42 to −2.34 | 55,732 | 55.73 |
| 5 | 10652 [30] | Coxsackievirus B4 | 1127 | (3838, 3710) | 374 | −0.60 to −3.0 | 11,144 | 9.89 |
| 6 | 10341 [31] | Bovine enterovirus | 1274 | (4096, 4096) | 376 | −0.75 to −3.5 | 22,694 | 17.81 |
| 7 | 10193 [26] | Feline calicivirus | 1000 | (4096, 4096) | 516 | −1.2 to −3.5 | 96,126 | 96.13 |
| 8 | 10205 [32] | Cowpea mosaic virus | 1000 | (4096, 4096) | 310 | −0.50 to −3.5 | 81,037 | 81.04 |
| 9 | 10555 [33] | Nudaurelia capensis omega virus | 1264 | (4096, 4096) | 564 | −0.70 to −2.7 | 34,488 | 27.28 |
| Total | 9941 | 339,398 | 34.14 |
| SN | EMPIAR ID | Pixel Size (Å) | Electron Dose (e/Å2) | Detector |
|---|---|---|---|---|
| 1 | 10192 | 1.065 | 63 | FEI FALCON III (4k × 4k) |
| 2 | 11060 | 1.065 | 46 | FEI FALCON III (4k × 4k) |
| 3 | 10203 | 1.06 | 36 | GATAN K2 QUANTUM (4k × 4k) |
| 4 | 10033 | 1.14 | 36 | FEI FALCON II (4k × 4k) |
| 5 | 10652 | 1.06 | 47 | GATAN K2 SUMMIT (4k × 4k) |
| 6 | 10341 | 1.065 | 49.5 | FEI FALCON III (4k × 4k) |
| 7 | 10193 | 1.065 | 63 | FEI FALCON III (4k × 4k) |
| 8 | 10205 | 1.065 | 67.5 | FEI FALCON III (4k × 4k) |
| 9 | 10555 | 1.0651 | 72 | FEI FALCON III (4k × 4k) |
| EMPIAR 10205 | ||
| 2D Particle Class Statistics (Topaz) | 2D Particle Class Statistics (CryoVirusDB) | |
| Number of Picked Particles | 155,953 | 81,037 |
| Weighted Average Resolution of 2D classes (N = 50) | 9.41 Å | 6.59 Å |
| Weighted Average Resolution of 2D classes (N = 10) | 13.42 Å | 10.96 Å |
| EMPIAR 10193 | ||
| 2D Particle Class Statistics (Topaz) | 2D Particle Class Statistics (CryoVirusDB) | |
| Number of Picked Particles | 239,852 | 96,126 |
| Weighted Average Resolution of 2D classes (N = 50) | 18.52 Å | 15.02 Å |
| Weighted Average Resolution of 2D classes (N = 10) | 23.68 Å | 21.72 Å |
| EMPIAR 10205 | ||||||
| 3D Density Map Statistics (Topaz) | 3D Density Map Statistics (CryoVirusDB) | |||||
| Number of Picked Particles | 155,953 | 81,037 | ||||
| GSFSC Resolution (Å) | Trial 1 | Trial 2 | Trial 3 | Trial 1 | Trial 2 | Trial 3 |
| 6.97 | 6.59 | 6.48 | 4.34 | 4.40 | 4.47 | |
| No Mask Resolution (Å) | 9.3 | 9.5 | 9.2 | 7.9 | 8.8 | 8.1 |
| Loose Mask Resolution (Å) | 7.7 | 7.2 | 7.3 | 5.7 | 7.1 | 6.3 |
| Tight Mask Resolution (Å) | 6.8 | 6.6 | 6.5 | 4.3 | 4.5 | 4.4 |
| Corrected Mask Resolution (Å) | 7 | 6.6 | 6.5 | 4.3 | 4.4 | 4.5 |
| EMPIAR 10193 | ||||||
| 3D Density Map Statistics (Topaz) | 3D Density Map Statistics (CryoVirusDB) | |||||
| Number of Picked Particles | 239,852 | 96,126 | ||||
| GSFSC Resolution (Å) | Trial 1 | Trial 2 | Trial 3 | Trial 1 | Trial 2 | Trial 3 |
| 5.86 | 5.74 | 5.82 | 5.16 | 5.22 | 5.18 | |
| No Mask Resolution (Å) | 12 | 13 | 11 | 12 | 9.4 | 9.1 |
| Loose Mask Resolution (Å) | 5.9 | 6.1 | 5.8 | 5.3 | 5.8 | 5.5 |
| Tight Mask Resolution (Å) | 5.8 | 5.8 | 5.7 | 5.2 | 5.2 | 5.2 |
| Corrected Mask Resolution (Å) | 5.9 | 5.7 | 5.6 | 5.2 | 5.2 | 5.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Gyawali, R.; Dhakal, A.; Wang, L.; Cheng, J. CryoVirusDB: An Annotated Dataset for AI-Based Virus Particle Identification in Cryo-EM Micrographs. Viruses 2026, 18, 224. https://doi.org/10.3390/v18020224
Gyawali R, Dhakal A, Wang L, Cheng J. CryoVirusDB: An Annotated Dataset for AI-Based Virus Particle Identification in Cryo-EM Micrographs. Viruses. 2026; 18(2):224. https://doi.org/10.3390/v18020224
Chicago/Turabian StyleGyawali, Rajan, Ashwin Dhakal, Liguo Wang, and Jianlin Cheng. 2026. "CryoVirusDB: An Annotated Dataset for AI-Based Virus Particle Identification in Cryo-EM Micrographs" Viruses 18, no. 2: 224. https://doi.org/10.3390/v18020224
APA StyleGyawali, R., Dhakal, A., Wang, L., & Cheng, J. (2026). CryoVirusDB: An Annotated Dataset for AI-Based Virus Particle Identification in Cryo-EM Micrographs. Viruses, 18(2), 224. https://doi.org/10.3390/v18020224

