Recent Advances in Synthetic Data Generation

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Computer Science & Engineering".

Deadline for manuscript submissions: closed (31 December 2022) | Viewed by 29805

Special Issue Editors


E-Mail Website
Guest Editor
1. Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), 20009 Donostia-San Sebastián, Spain
2. Biodonostia Health Research Institute, eHealth Group, Paseo Doctor Begiristain, s/n, 20014 San Sebastián, Spain
Interests: health; software; data; network; data preparation; QoD; Synthetic data generation for data security / privacy

E-Mail Website
Guest Editor
School of Computing, Engineering and Intelligent Systems, Ulster University, Derry~Londonderry, UK
Interests: patient rehabilitation; virtual reality; artificial intelligence; computer games

Special Issue Information

Dear Colleagues,

Scientific and technological advances in recent decades have led to the digitization and increased generation and collection of data describing real-world applications or processes. In addition, machine learning models and artificial intelligence applications built on data have been proven to improve management and decision making about these applications and processes.

Despite the potential of data-based solutions, there are many issues that prevent or delay the development of such solutions. The most notable issues are the access to data, and the captured sample’s representativeness of the real population. Access to real data can be delayed or even prevented for various reasons such as privacy, security and intellectual property, or required (quality) capturing and preparation technology development. Sample representativeness is another critical issue that relates to class imbalance and representation of rare and extreme events, which is critical for ML model performance.

Synthetic data (SD) is described in this context as “any production data applicable to a given situation that are not obtained by direct measurement”. SD has three key use cases: (i) data augmentation: to balance datasets or supplement available data before training an ML model; (ii) privacy-preservation: to allow safe and private sharing of sensitive data; (iii) simulation: estimating and teaching systems in situations that haven’t been observed in actual reality.

The need for a comprehensive solution to exploit developments in Big Data and AI technology has never been greater, and synthetic data generation (SDG) research has been underway for some time with promising results in various application areas, including healthcare, cybersecurity, industrial processes, and energy consumption. Research has addressed the SDG of different data modalities (written natural language, images, video, tabular data, time series data, etc.) using different technological approaches.

The main objective of this Special Issue is to bring together diverse, novel and impactful research on synthetic data generation, thereby accelerating research in this field and the adoption of these techniques for real-world applications.

Contributions from different application domains, use cases and data modalities are sought by this Special Issue.

Submissions should be of high enough quality for an international journal and should not be submitted or published elsewhere. However, the extended versions of conference papers that show significant improvement (minimum of over 30%) can be considered for review in this Special Issue. In addition, we welcome review papers covering the subjects of this Special Issue.

Technical Program Committee Members:

  1. Dr. Debbie Rankin - Ulster University
  2. Dr. Ane Alberdi – Mondragon Unibertsitatea
  3. Dr. Rodrigo Cilla – Vicomtech - BRTA

Dr. Gorka Epelde Unanue
Dr. Darryl Charles
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Synthetic data generation
  • Generative adversarial networks
  • Privacy preserving data
  • Data augmentation
  • Artificial intelligence
  • Healthcare
  • Imbalanced learning

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 9026 KiB  
Article
Nonparametric Generation of Synthetic Data Using Copulas
by Juan P. Restrepo, Juan Carlos Rivera, Henry Laniado, Pablo Osorio and Omar A. Becerra
Electronics 2023, 12(7), 1601; https://doi.org/10.3390/electronics12071601 - 29 Mar 2023
Viewed by 2135
Abstract
This article presents a novel nonparametric approach to generate synthetic data using copulas, which are functions that explain the dependency structure of the real data. The proposed method addresses several challenges faced by existing synthetic data generation techniques, such as the preservation of [...] Read more.
This article presents a novel nonparametric approach to generate synthetic data using copulas, which are functions that explain the dependency structure of the real data. The proposed method addresses several challenges faced by existing synthetic data generation techniques, such as the preservation of complex multivariate structures presented in real data. By using all the information from real data and verifying that the generated synthetic data follows the same behavior as the real data under homogeneity tests, our method is a significant improvement over existing techniques. Our method is easy to implement and interpret, making it a valuable tool for solving class imbalance problems in machine learning models, improving the generalization capabilities of deep learning models, and anonymizing information in finance and healthcare domains, among other applications. Full article
(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)
Show Figures

Figure 1

17 pages, 2135 KiB  
Article
A Novel Fusion Approach Consisting of GAN and State-of-Charge Estimator for Synthetic Battery Operation Data Generation
by Kei Long Wong, Ka Seng Chou, Rita Tse, Su-Kit Tang and Giovanni Pau
Electronics 2023, 12(3), 657; https://doi.org/10.3390/electronics12030657 - 28 Jan 2023
Cited by 2 | Viewed by 1977
Abstract
The recent success of machine learning has accelerated the development of data-driven lithium-ion battery state estimation and prediction. The lack of accessible battery operation data is one of the primary concerns with the data-driven approach. However, research on battery operation data augmentation is [...] Read more.
The recent success of machine learning has accelerated the development of data-driven lithium-ion battery state estimation and prediction. The lack of accessible battery operation data is one of the primary concerns with the data-driven approach. However, research on battery operation data augmentation is rare. When coping with data sparsity, one popular approach is to augment the dataset by producing synthetic data. In this paper, we propose a novel fusion method for synthetic battery operation data generation. It combines a generative, adversarial, network-based generation module and a state-of-charge estimator. The generation module generates battery operation features, namely the voltage, current, and temperature. The features are then fed into the state-of-charge estimator, which calculates the relevant state of charge. The results of the evaluation reveal that our method can produce synthetic data with distributions similar to the actual dataset and performs well in downstream tasks. Full article
(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)
Show Figures

Figure 1

15 pages, 868 KiB  
Article
Statistical Validation of Synthetic Data for Lung Cancer Patients Generated by Using Generative Adversarial Networks
by Luis Gonzalez-Abril, Cecilio Angulo, Juan Antonio Ortega and José-Luis Lopez-Guerra
Electronics 2022, 11(20), 3277; https://doi.org/10.3390/electronics11203277 - 12 Oct 2022
Cited by 4 | Viewed by 2046
Abstract
The development of healthcare patient digital twins in combination with machine learning technologies helps doctors in therapeutic prescription and in minimally invasive intervention procedures. The confidentiality of medical records or limited data availability in many health domains are drawbacks that can be overcome [...] Read more.
The development of healthcare patient digital twins in combination with machine learning technologies helps doctors in therapeutic prescription and in minimally invasive intervention procedures. The confidentiality of medical records or limited data availability in many health domains are drawbacks that can be overcome with the generation of synthetic data conformed to real data. The use of generative adversarial networks (GAN) for the generation of synthetic data of lung cancer patients has been previously introduced as a tool to solve this problem in the form of anonymized synthetic patients. However, generated synthetic data are mainly validated from the machine learning domain (loss functions) or expert domain (oncologists). In this paper, we propose statistical decision making as a validation tool: Is the model good enough to be used? Does the model pass rigorous hypothesis testing criteria? We show for the case at hand how loss functions and hypothesis validation are not always well aligned. Full article
(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)
Show Figures

Figure 1

21 pages, 9663 KiB  
Article
The “Coherent Data Set”: Combining Patient Data and Imaging in a Comprehensive, Synthetic Health Record
by Jason Walonoski, Dylan Hall, Karen M. Bates, M. Heath Farris, Joseph Dagher, Matthew E. Downs, Ryan T. Sivek, Ben Wellner, Andrew Gregorowicz, Marc Hadley, Francis X. Campion, Lauren Levine, Kevin Wacome, Geoff Emmer, Aaron Kemmer, Maha Malik, Jonah Hughes, Eldesia Granger and Sybil Russell
Electronics 2022, 11(8), 1199; https://doi.org/10.3390/electronics11081199 - 9 Apr 2022
Cited by 4 | Viewed by 9197
Abstract
The “Coherent Data Set” is a novel synthetic data set that leverages structured data from Synthea™ to create a longitudinal, “coherent” patient-level electronic health record (EHR). Comprised of synthetic patients, the Coherent Data Set is publicly available, reproducible using Synthea™, and free of [...] Read more.
The “Coherent Data Set” is a novel synthetic data set that leverages structured data from Synthea™ to create a longitudinal, “coherent” patient-level electronic health record (EHR). Comprised of synthetic patients, the Coherent Data Set is publicly available, reproducible using Synthea™, and free of the privacy risks that arise from using real patient data. The Coherent Data Set provides complex and representative health records that can be leveraged by health IT professionals without the risks associated with de-identified patient data. It includes familial genomes that were created through a simulation of the genetic reproduction process; magnetic resonance imaging (MRI) DICOM files created with a voxel-based computational model; clinical notes in the style of traditional subjective, objective, assessment, and plan notes; and physiological data that leverage existing System Biology Markup Language (SBML) models to capture non-linear changes in patient health metrics. HL7 Fast Healthcare Interoperability Resources (FHIR®) links the data together. The models can generate clinically logical health data, but ensuring clinical validity remains a challenge without comparable data to substantiate results. We believe this data set is the first of its kind and a novel contribution to practical health interoperability efforts. Full article
(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)
Show Figures

Figure 1

10 pages, 1626 KiB  
Article
MaWGAN: A Generative Adversarial Network to Create Synthetic Data from Datasets with Missing Data
by Thomas Poudevigne-Durance, Owen Dafydd Jones and Yipeng Qin
Electronics 2022, 11(6), 837; https://doi.org/10.3390/electronics11060837 - 8 Mar 2022
Cited by 5 | Viewed by 2396
Abstract
The creation of synthetic data are important for a range of applications, for example, to anonymise sensitive datasets or to increase the volume of data in a dataset. When the target dataset has missing data, then it is common to just discard incomplete [...] Read more.
The creation of synthetic data are important for a range of applications, for example, to anonymise sensitive datasets or to increase the volume of data in a dataset. When the target dataset has missing data, then it is common to just discard incomplete observations, even though this necessarily means some loss of information. However, when the proportion of missing data are large, discarding incomplete observations may not leave enough data to accurately estimate their joint distribution. Thus, there is a need for data synthesis methods capable of using datasets with missing data, to improve accuracy and, in more extreme cases, to make data synthesis possible. To achieve this, we propose a novel generative adversarial network (GAN) called MaWGAN (for masked Wasserstein GAN), which creates synthetic data directly from datasets with missing values. As with existing GAN approaches, the MaWGAN synthetic data generator generates samples from the full joint distribution. We introduce a novel methodology for comparing the generator output with the original data that does not require us to discard incomplete observations, based on a modification of the Wasserstein distance and easily implemented using masks generated from the pattern of missing data in the original dataset. Numerical experiments are used to demonstrate the superior performance of MaWGAN compared to (a) discarding incomplete observations before using a GAN, and (b) imputing missing values (using the GAIN algorithm) before using a GAN. Full article
(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)
Show Figures

Figure 1

17 pages, 12551 KiB  
Article
Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain
by Mikel Hernandez, Gorka Epelde, Andoni Beristain, Roberto Álvarez, Cristina Molina, Xabat Larrea, Ane Alberdi, Michalis Timoleon, Panagiotis Bamidis and Evdokimos Konstantinidis
Electronics 2022, 11(5), 812; https://doi.org/10.3390/electronics11050812 - 4 Mar 2022
Cited by 15 | Viewed by 9402
Abstract
To date, the use of synthetic data generation techniques in the health and wellbeing domain has been mainly limited to research activities. Although several open source and commercial packages have been released, they have been oriented to generating synthetic data as a standalone [...] Read more.
To date, the use of synthetic data generation techniques in the health and wellbeing domain has been mainly limited to research activities. Although several open source and commercial packages have been released, they have been oriented to generating synthetic data as a standalone data preparation process and not integrated into a broader analysis or experiment testing workflow. In this context, the VITALISE project is working to harmonize Living Lab research and data capture protocols and to provide controlled processing access to captured data to industrial and scientific communities. In this paper, we present the initial design and implementation of our synthetic data generation approach in the context of VITALISE Living Lab controlled data processing workflow, together with identified challenges and future developments. By uploading data captured from Living Labs, generating synthetic data from them, developing analysis locally with synthetic data, and then executing them remotely with real data, the utility of the proposed workflow has been validated. Results have shown that the presented workflow helps accelerate research on artificial intelligence, ensuring compliance with data protection laws. The presented approach has demonstrated how the adoption of state-of-the-art synthetic data generation techniques can be applied for real-world applications. Full article
(This article belongs to the Special Issue Recent Advances in Synthetic Data Generation)
Show Figures

Figure 1

Back to TopTop