Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain

Guerreiro, Camila; Leal, Fátima; Pinho, Micaela

doi:10.3390/info16110986

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain

by

Camila Guerreiro

¹,

Fátima Leal

^1,*

and

Micaela Pinho

^1,2,3

¹

Research on Economics, Management and Information Technologies, REMIT, Portucalense University, 4200-072 Porto, Portugal

²

Instituto Jurídico Portucalense, IJP, Portucalense University, 4200-072 Porto, Portugal

³

Research Unit in Governance, Competitiveness and Public Policy, GOVCOPP, Aveiro University, 3810-193 Aveiro, Portugal

^*

Author to whom correspondence should be addressed.

Information 2025, 16(11), 986; https://doi.org/10.3390/info16110986

Submission received: 16 September 2025 / Revised: 4 November 2025 / Accepted: 12 November 2025 / Published: 14 November 2025

(This article belongs to the Special Issue Emerging Applications of Machine Learning in Healthcare, Industry, and Beyond)

Download Versions Notes

Abstract

The growing demand for data-driven solutions in healthcare is often hindered by limited access to high-quality datasets due to privacy concerns, data imbalance, and regulatory constraints. Synthetic data generation has emerged as a promising strategy to address these challenges by creating artificial yet statistically valid datasets that preserve the underlying patterns of real data without compromising patient confidentiality. This study explores methodologies for generating synthetic data tailored to binary and multi-class classification problems within the health domain. We employ advanced techniques such as probabilistic modelling, generative adversarial networks, and data augmentation strategies to replicate realistic feature distributions and class relationships. A comprehensive evaluation is conducted using benchmark healthcare datasets, measuring fidelity, diversity, and utility of the synthetic data in downstream predictive modelling tasks. The original dataset consisted of 2125 imbalanced cases, both in the binary and multi-class classification scenarios. Experimental results demonstrate that models trained on synthetic datasets achieve performance levels comparable to those trained on real data, particularly in scenarios with severe class imbalance. The findings underscore the potential of synthetic data as a privacy-preserving enabler for robust machine learning applications in healthcare, facilitating innovation while adhering to strict data protection regulations.

Keywords: synthetic data; binary; multi-class; classification; health; data balancing

Share and Cite

MDPI and ACS Style

Guerreiro, C.; Leal, F.; Pinho, M. Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain. Information 2025, 16, 986. https://doi.org/10.3390/info16110986

AMA Style

Guerreiro C, Leal F, Pinho M. Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain. Information. 2025; 16(11):986. https://doi.org/10.3390/info16110986

Chicago/Turabian Style

Guerreiro, Camila, Fátima Leal, and Micaela Pinho. 2025. "Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain" Information 16, no. 11: 986. https://doi.org/10.3390/info16110986

APA Style

Guerreiro, C., Leal, F., & Pinho, M. (2025). Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain. Information, 16(11), 986. https://doi.org/10.3390/info16110986

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI