Data Augmentation and Synthetic Data Generation in Rare Disease Research: A Scoping Review
Abstract
1. Introduction
2. Materials and Methods
2.1. Literature Search and Selection
2.2. Full-Text Analysis and Metadata Extraction
3. Results
3.1. Temporal Trends of Publication
3.2. Study Focus
3.3. Rare Diseases Classification
3.4. Data Type and Data Size
3.5. Method
4. Discussion
4.1. Data Augmentation
4.2. Synthetic Data Generation
4.3. Advantages and Limitations
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| GANs | Generative adversarial networks |
| VAEs | Variational autoencoders |
| AI | Artificial intelligence |
| IEEE | Electronics Engineers |
| SMOTE | Synthetic Minority Over-sampling Technique |
| ADASYN | Adaptive Synthetic Sampling |
| G&P | Geometric and photometric transformations |
| DGMs | Deep generative models |
| MRI | Magnetic resonance imaging |
| CT | Computed tomography |
References
- Schieppati, A.; Henter, J.-I.; Daina, E.; Aperia, A. Why Rare Diseases Are an Important Medical and Social Issue. Lancet 2008, 371, 2039–2041. [Google Scholar] [CrossRef]
- Vickers, P.J. Challenges and Opportunities in the Treatment of Rare Diseases. Drug Discov. World 2013, 14, 9–16. [Google Scholar]
- Ferreira, C.R. The Burden of Rare Diseases. Am. J. Med. Genet. A 2019, 179, 885–892. [Google Scholar] [CrossRef]
- Austin, C.P.; Cutillo, C.M.; Lau, L.P.L.; Jonker, A.H.; Rath, A.; Julkowska, D.; Thomson, D.; Terry, S.F.; de Montleau, B.; Ardigò, D.; et al. Future of Rare Diseases Research 2017–2027: An IRDiRC Perspective. Clin. Transl. Sci. 2018, 11, 21–27. [Google Scholar] [CrossRef]
- Hurvitz, N.; Azmanov, H.; Kesler, A.; Ilan, Y. Establishing a Second-Generation Artificial Intelligence-Based System for Improving Diagnosis, Treatment, and Monitoring of Patients with Rare Diseases. Eur. J. Hum. Genet. 2021, 29, 1485–1490. [Google Scholar] [CrossRef]
- Al-Hussaini, I.; White, B.; Varmeziar, A.; Mehra, N.; Sanchez, M.; Lee, J.; DeGroote, N.P.; Miller, T.P.; Mitchell, C.S. An Interpretable Machine Learning Framework for Rare Disease: A Case Study to Stratify Infection Risk in Pediatric Leukemia. J. Clin. Med. 2024, 13, 1788. [Google Scholar] [CrossRef]
- Visibelli, A.; Roncaglia, B.; Spiga, O.; Santucci, A. The Impact of Artificial Intelligence in the Odyssey of Rare Diseases. Biomedicines 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
- Ilan, Y. Second-Generation Digital Health Platforms: Placing the Patient at the Center and Focusing on Clinical Outcomes. Front. Digit. Health 2020, 2, 569178. [Google Scholar] [CrossRef] [PubMed]
- Dou, B.; Zhu, Z.; Merkurjev, E.; Ke, L.; Chen, L.; Jiang, J.; Zhu, Y.; Liu, J.; Zhang, B.; Wei, G.-W. Machine Learning Methods for Small Data Challenges in Molecular Science. Chem. Rev. 2023, 123, 8736–8780. [Google Scholar] [CrossRef]
- Halmich, C.; Höschler, L.; Schranz, C.; Borgelt, C. Data Augmentation of Time-Series Data in Human Movement Biomechanics: A Scoping Review. PLoS ONE 2025, 20, e0327038. [Google Scholar] [CrossRef] [PubMed]
- Goceri, E. Medical Image Data Augmentation: Techniques, Comparisons and Interpretations. Artif. Intell. Rev. 2023, 56, 12561–12605. [Google Scholar] [CrossRef]
- Lo, J.; Cardinell, J.; Costanzo, A.; Sussman, D. Medical Augmentation (Med-Aug) for Optimal Data Augmentation in Medical Deep Learning Networks. Sensors 2021, 21, 7018. [Google Scholar] [CrossRef] [PubMed]
- Baião, A.R.; Cai, Z.; Poulos, R.C.; Robinson, P.J.; Reddel, R.R.; Zhong, Q.; Vinga, S.; Gonçalves, E. A Technical Review of Multi-Omics Data Integration Methods: From Classical Statistical to Deep Generative Approaches. Brief. Bioinform. 2025, 26, bbaf355. [Google Scholar] [CrossRef]
- Wang, F.; Ke, H.; Tang, Y. Fusion of Generative Adversarial Networks and Non-Negative Tensor Decomposition for Depression fMRI Data Analysis. Inf. Process. Manag. 2025, 62, 103961. [Google Scholar] [CrossRef]
- Mahmud, M.; Kaiser, M.S.; McGinnity, T.M.; Hussain, A. Deep Learning in Mining Biological Data. Cogn. Comput. 2021, 13, 1–33. [Google Scholar] [CrossRef]
- Gonzales, A.; Guruswamy, G.; Smith, S.R. Synthetic Data in Health Care: A Narrative Review. PLoS Digit. Health 2023, 2, e0000082. [Google Scholar] [CrossRef] [PubMed]
- Jacobs, F.; D’Amico, S.; Benvenuti, C.; Gaudio, M.; Saltalamacchia, G.; Miggiano, C.; De Sanctis, R.; Della Porta, M.G.; Santoro, A.; Zambelli, A. Opportunities and Challenges of Synthetic Data Generation in Oncology. JCO Clin. Cancer Inform. 2023, 7, e2300045. [Google Scholar] [CrossRef] [PubMed]
- Pasculli, G.; Virgolin, M.; Myles, P.; Vidovszky, A.; Fisher, C.; Biasin, E.; Mourby, M.; Pappalardo, F.; D’Amico, S.; Torchia, M.; et al. Synthetic Data in Healthcare and Drug Development: Definitions, Regulatory Frameworks, Issues. CPT Pharmacomet. Syst. Pharmacol. 2025, 14, 840–852. [Google Scholar] [CrossRef]
- Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.; Horsley, T.; Weeks, L.; et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann. Intern. Med. 2018, 169, 467–473. [Google Scholar] [CrossRef]
- Arksey, H.; O’Malley, L. Scoping Studies: Towards a Methodological Framework. Int. J. Soc. Res. Methodol. 2005, 8, 19–32. [Google Scholar] [CrossRef]
- Ouzzani, M.; Hammady, H.; Fedorowicz, Z.; Elmagarmid, A. Rayyan—A Web and Mobile App for Systematic Reviews. Syst. Rev. 2016, 5, 210. [Google Scholar] [CrossRef] [PubMed]
- Anwar, S.; Alam, A. A Convolutional Neural Network–Based Learning Approach to Acute Lymphoblastic Leukaemia Detection with Automated Feature Extraction. Med. Biol. Eng. Comput. 2020, 58, 3113–3121. [Google Scholar] [CrossRef]
- Atteia, G.; Alhussan, A.A.; Samee, N.A. BO-ALLCNN: Bayesian-Based Optimized CNN for Acute Lymphoblastic Leukemia Detection in Microscopic Blood Smear Images. Sensors 2022, 22, 5520. [Google Scholar] [CrossRef] [PubMed]
- Elrefaie, R.M.; Mohamed, M.A.; Marzouk, E.A.; Ata, M.M. A Robust Classification of Acute Lymphocytic Leukemia-Based Microscopic Images with Supervised Hilbert-Huang Transform. Microsc. Res. Tech. 2024, 87, 191–204. [Google Scholar] [CrossRef]
- Jammal, F.; Dahab, M.; Bayahya, A.Y. Neuro-Bridge-X: A Neuro-Symbolic Vision Transformer with Meta-XAI for Interpretable Leukemia Diagnosis from Peripheral Blood Smears. Diagnostics 2025, 15, 2040. [Google Scholar] [CrossRef]
- Kasani, P.H.; Park, S.-W.; Jang, J.-W. An Aggregated-Based Deep Learning Method for Leukemic B-Lymphoblast Classification. Diagnostics 2020, 10, 1064. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, P.; Zhang, J.; Liu, N.; Liu, Y. Weakly Supervised Ternary Stream Data Augmentation Fine-Grained Classification Network for Identifying Acute Lymphoblastic Leukemia. Diagnostics 2022, 12, 16. [Google Scholar] [CrossRef]
- Makem, M.; Tamas, L.; Bușoniu, L. A Reliable Approach for Identifying Acute Lymphoblastic Leukemia in Microscopic Imaging. Front. Artif. Intell. 2025, 8, 1620252. [Google Scholar] [CrossRef]
- Paing, M.P.; Sento, A.; Bui, T.H.; Pintavirooj, C. Instance Segmentation of Multiple Myeloma Cells Using Deep-Wise Data Augmentation and Mask R-CNN. Entropy 2022, 24, 134. [Google Scholar] [CrossRef] [PubMed]
- Shafique, S.; Tehsin, S. Acute Lymphoblastic Leukemia Detection and Classification of Its Subtypes Using Pretrained Deep Convolutional Neural Networks. Technol. Cancer Res. Treat. 2018, 17, 1533033818802789. [Google Scholar] [CrossRef]
- She, Z.; Marzullo, A.; Destito, M.; Spadea, M.F.; Leone, R.; Anzalone, N.; Steffanoni, S.; Erbella, F.; Ferreri, A.J.; Ferrigno, G.; et al. Deep Learning-Based Overall Survival Prediction Model in Patients with Rare Cancer: A Case Study for Primary Central Nervous System Lymphoma. Int. J. CARS 2023, 18, 1849–1856. [Google Scholar] [CrossRef]
- Elhassan, T.; Osman, A.H.; Mohd Rahim, M.S.; Mohd Hashim, S.Z.; Ali, A.; Elhassan, E.; Elkamali, Y.; Aljurf, M. CAE-ResVGG FusionNet: A Feature Extraction Framework Integrating Convolutional Autoencoders and Transfer Learning for Immature White Blood Cells in Acute Myeloid Leukemia. Heliyon 2024, 10, e37745. [Google Scholar] [CrossRef]
- Gu, D.; Guo, D.; Yuan, C.; Wei, J.; Wang, Z.; Zheng, H.; Tian, J. Multi-Scale Patches Convolutional Neural Network Predicting the Histological Grade of Hepatocellular Carcinoma. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Virtual, 1–5 November 2021; pp. 2584–2587. [Google Scholar] [CrossRef]
- Zhou, W.; Jian, W.; Cen, X.; Zhang, L.; Guo, H.; Liu, Z.; Liang, C.; Wang, G. Prediction of Microvascular Invasion of Hepatocellular Carcinoma Based on Contrast-Enhanced MR and 3D Convolutional Neural Networks. Front. Oncol. 2021, 11, 588010. [Google Scholar] [CrossRef]
- Jang, W.; Lee, J.; Kim, Y.; Lee, J.; Kim, J.J.; Choi, J.Y. Slice-Wise Augmentation for Multi-Task Learning to Predict Neuropsychological Outcomes after Traumatic Brain Injury. Stud. Health Technol. Inform. 2025, 329, 1764–1765. [Google Scholar] [CrossRef] [PubMed]
- Carlier, T.; Frécon, G.; Mateus, D.; Rizkallah, M.; Kraeber-Bodéré, F.; Kanoun, S.; Blanc-Durand, P.; Itti, E.; Le Gouill, S.; Casasnovas, R.O.; et al. Prognostic Value of 18F-FDG PET Radiomics Features at Baseline in PET-Guided Consolidation Strategy in Diffuse Large B-Cell Lymphoma: A Machine-Learning Analysis from the GAINED Study. J. Nucl. Med. 2024, 65, 156–162. [Google Scholar] [CrossRef]
- Cai, Z.; Wong, L.M.; Wong, Y.H.; Lee, H.L.; Li, K.Y.; So, T.Y. Dual-Level Augmentation Radiomics Analysis for Multisequence MRI Meningioma Grading. Cancers 2023, 15, 5459. [Google Scholar] [CrossRef]
- Feng, Z.; Zhang, L.; Qi, Z.; Shen, Q.; Hu, Z.; Chen, F. Identifying BAP1 Mutations in Clear-Cell Renal Cell Carcinoma by CT Radiomics: Preliminary Findings. Front. Oncol. 2020, 10, 279. [Google Scholar] [CrossRef] [PubMed]
- Pereira, H.M.; Leite Duarte, M.E.; Ribeiro Damasceno, I.; de Oliveira Moura Santos, L.A.; Nogueira-Barbosa, M.H. Machine Learning-Based CT Radiomics Features for the Prediction of Pulmonary Metastasis in Osteosarcoma. Br. J. Radiol. 2021, 94, 20201391. [Google Scholar] [CrossRef]
- Moreno-Barea, F.J.; Franco, L.; Elizondo, D.; Grootveld, M. Application of Data Augmentation Techniques towards Metabolomics. Comput. Biol. Med. 2022, 148, 105916. [Google Scholar] [CrossRef] [PubMed]
- Moreno-Barea, F.J.; Franco, L.; Elizondo, D.; Grootveld, M. Data Augmentation Techniques to Improve Metabolomic Analysis in Niemann-Pick Type C Disease. In Proceedings of the Computational Science—ICCS 2022: 22nd International Conference, London, UK, 21–23 June 2022; Proceedings, Part III. Springer: Berlin/Heidelberg, Germany, 2022; pp. 78–91. [Google Scholar] [CrossRef]
- Fernández-Ruiz, R.; Núñez-Vidal, E.; Hidalgo-Delaguía, I.; Garayzábal-Heinze, E.; Álvarez-Marquina, A.; Martínez-Olalla, R.; Palacios-Alonso, D. Identification of Smith-Magenis Syndrome Cases through an Experimental Evaluation of Machine Learning Methods. Front. Comput. Neurosci. 2024, 18, 1357607. [Google Scholar] [CrossRef]
- Núñez-Vidal, E.; Fernández-Ruiz, R.; Álvarez-Marquina, A.; Hidalgo-delaGuía, I.; Garayzábal-Heinze, E.; Hristov-Kalamov, N.; Domínguez-Mateos, F.; Conde, C.; Martínez-Olalla, R. Noninvasive Deep Learning Analysis for Smith–Magenis Syndrome Classification. Appl. Sci. 2024, 14, 9747. [Google Scholar] [CrossRef]
- Revathi, R.; Ramachandran, T. Experimental Evaluation of Amyotrophic Lateral Sclerosis (ALS) Disease Prediction Based on Improved Deep Learning Mechanism. In Proceedings of the International Conference on Advances in Intelligent Systems (ICAISS), Trichy, India, 21–23 May 2025; pp. 1707–1713. [Google Scholar] [CrossRef]
- Harshvardhan, G.; Gourisaria, M.K.; Pandey, M.; Rautaray, S.S. A Comprehensive Survey and Analysis of Generative Models in Machine Learning. Comput. Sci. Rev. 2020, 38, 100285. [Google Scholar] [CrossRef]
- Figueira, A.; Vaz, B. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 2022, 10, 2733. [Google Scholar] [CrossRef]
- Jaffri, Z.A.; Chen, M.; Bao, S.; Ahmad, Z. Understanding GANs: Fundamentals, Variants, Training Challenges, Applications, and Open Problems. Multimed. Tools Appl. 2025, 84, 10347–10423. [Google Scholar] [CrossRef]
- Yoo, T.K.; Choi, J.Y.; Kim, H.K. Feasibility Study to Improve Deep Learning in OCT Diagnosis of Rare Retinal Diseases with Few-Shot Classification. Med. Biol. Eng. Comput. 2021, 59, 401–415. [Google Scholar] [CrossRef]
- Asadi, F.; Angsuwatanakul, T.; O’Reilly, J.A. Evaluating Synthetic Neuroimaging Data Augmentation for Automatic Brain Tumour Segmentation with a Deep Fully-Convolutional Network. IBRO Neurosci. Rep. 2023, 16, 57–66. [Google Scholar] [CrossRef] [PubMed]
- Longato, E.; Tavazzi, E.; Chiò, A.; Sparacino, G.; Di Camillo, B. Dynamic Bayesian Networks and Transfer Learning Enable the Development of Deep Sequence-Based Models on Small-Sample Data. In Proceedings of the GNB 2023—Gruppo Nazionale di Bioingegneria Conference, Padova, Italy, 21–23 June 2023. [Google Scholar]
- Longato, E.; Tavazzi, E.; Sparacino, G.; Di Camillo, B. Dynamic Bayesian Networks for Rare Disease Modeling. In Computational Methods for Biomedical Data Analysis; Springer: Cham, Switzerland, 2023; pp. 1–15. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, C.; Li, K.; Deng, J.; Liu, H.; Lai, G.; Xie, B.; Zhong, X. Identification of Molecular Subtypes and Prognostic Characteristics of Adrenocortical Carcinoma Based on Unsupervised Clustering. Int. J. Mol. Sci. 2023, 24, 15465. [Google Scholar] [CrossRef]
- Lai, G.; Liu, H.; Deng, J.; Li, K.; Zhang, C.; Zhong, X.; Xie, B. The Characteristics of Tumor Microenvironment Predict Survival and Response to Immunotherapy in Adrenocortical Carcinomas. Cells 2023, 12, 755. [Google Scholar] [CrossRef]
- Zhu, J. Synthetic data generation by diffusion models. Natl. Sci. Rev. 2024, 11, nwae276. [Google Scholar] [CrossRef]
- Sordo, Z.; Chagnon, E.; Hu, Z.; Donatelli, J.J.; Andeer, P.; Nico, P.S.; Northen, T.; Ushizima, D. Synthetic Scientific Image Generation with VAE, GAN, and Diffusion Model Architectures. J. Imaging 2025, 11, 252. [Google Scholar] [CrossRef]
- U.S. Food & Drug Administration; Center for Devices & Radiological Health. Executive Summary for the Digital Health Advisory Committee Meeting: Total Product Lifecycle Considerations for Generative AI-Enabled Devices; U.S. Food & Drug Administration: Silver Spring, MD, USA, 2024. Available online: https://www.fda.gov/media/182871/download (accessed on 13 October 2025).
- Southall, N.; Natarajan, M.; Lau, L.P.L.; Jonker, A.H.; Deprez, B.; Guilliams, T.; Hunter, L.; Rademaker, C.M.; Hivert, V.; Ardigò, D. The use or generation of biomedical data and existing medicines to discover and establish new treatments for patients with rare diseases—Recommendations of the IRDiRC Data Mining and Repurposing Task Force. Orphanet J. Rare Dis. 2019, 14, 225. [Google Scholar] [CrossRef] [PubMed]




| Database | Search Strings |
|---|---|
| PubMed | data augmentation AND rare disease data augmentation AND one of 9784 individual rare disease names synthetic data AND rare disease synthetic data AND one of 9784 individual rare disease names |
| IEEE | data augmentation AND rare disease synthetic data AND rare disease |
| Scopus | data augmentation AND rare disease synthetic data AND rare disease |
| Metadata | Definition | Categories or Range |
|---|---|---|
| Publication year | Article publication year | Between 2018 and 2025 |
| Study focus | Focus of the article on data augmentation and/or synthetic data generation for rare diseases | Diagnosis Treatment Prognosis Differential diagnosis |
| Rare disease | Disease(s) or group of diseases investigated in the article | |
| Data type | Where the inputs for the model(s) are derived from | Imaging data Audio data Clinical data Laboratory data Omics data Multi-omics data |
| Data size (before) | Quantity related to a particular type of data sample before augmentation and synthesis | |
| Data size (after) | Quantity related to a particular type of data sample after augmentation and synthesis | |
| Method | The methodology used for data augmentation and/or the generation of synthetic data | Classical augmentation Deep generative models Oversampling techniques Rule/model-based generation approaches Frameworks and tools |
| Data Augmentation | Synthetic Data Generation | |
|---|---|---|
| Scale of expansion | 2–4× increase over the original dataset | Up to 10× or more, depending on model type and training data |
| Main techniques | Geometric and photometric transformations, patch resampling, and oversampling | Deep generative models |
| Strengths | Simple, interpretable, computationally efficient, low risk of artifacts | Can model complex variability, enhance diversity |
| Limitations | Limited to existing data variability; may amplify original biases | Computationally demanding; potential for artifacts; validation and interpretability remain challenging |
| Common applications | Image-based diagnosis and prognosis, small clinical cohorts | Simulation of disease variability, virtual cohorts, and data sharing in restricted domains |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Finetti, R.; Roncaglia, B.; Visibelli, A.; Spiga, O.; Santucci, A. Data Augmentation and Synthetic Data Generation in Rare Disease Research: A Scoping Review. Med. Sci. 2025, 13, 260. https://doi.org/10.3390/medsci13040260
Finetti R, Roncaglia B, Visibelli A, Spiga O, Santucci A. Data Augmentation and Synthetic Data Generation in Rare Disease Research: A Scoping Review. Medical Sciences. 2025; 13(4):260. https://doi.org/10.3390/medsci13040260
Chicago/Turabian StyleFinetti, Rebecca, Bianca Roncaglia, Anna Visibelli, Ottavia Spiga, and Annalisa Santucci. 2025. "Data Augmentation and Synthetic Data Generation in Rare Disease Research: A Scoping Review" Medical Sciences 13, no. 4: 260. https://doi.org/10.3390/medsci13040260
APA StyleFinetti, R., Roncaglia, B., Visibelli, A., Spiga, O., & Santucci, A. (2025). Data Augmentation and Synthetic Data Generation in Rare Disease Research: A Scoping Review. Medical Sciences, 13(4), 260. https://doi.org/10.3390/medsci13040260

