Next Article in Journal
Entropy and Entropic Forces to Model Biological Fluids
Next Article in Special Issue
A Manifold Learning Perspective on Representation Learning: Learning Decoder and Representations without an Encoder
Previous Article in Journal
CSI Amplitude Fingerprinting for Indoor Localization with Dictionary Learning
Previous Article in Special Issue
Occlusion-Based Explanations in Deep Recurrent Models for Biomedical Signals
Article

The Problem of Fairness in Synthetic Healthcare Data

1
Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
2
OptumLabs, Eden Prairie, MN 55344, USA
3
Rensselaer Institute for Data Exploration and Applications, Troy, NY 12180, USA
4
LISN, CNRS/INRIA, Université Paris-Saclay, 91190 Gif-sur-Yvette, France
5
ChaLearn, San Francisco, CA 94115, USA
6
Department of Mathematics, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
*
Author to whom correspondence should be addressed.
Academic Editors: Fabio Aiolli and Mirko Polato
Entropy 2021, 23(9), 1165; https://doi.org/10.3390/e23091165
Received: 8 July 2021 / Revised: 25 August 2021 / Accepted: 30 August 2021 / Published: 4 September 2021
(This article belongs to the Special Issue Representation Learning: Theory, Applications and Ethical Issues)
Access to healthcare data such as electronic health records (EHR) is often restricted by laws established to protect patient privacy. These restrictions hinder the reproducibility of existing results based on private healthcare data and also limit new research. Synthetically-generated healthcare data solve this problem by preserving privacy and enabling researchers and policymakers to drive decisions and methods based on realistic data. Healthcare data can include information about multiple in- and out- patient visits of patients, making it a time-series dataset which is often influenced by protected attributes like age, gender, race etc. The COVID-19 pandemic has exacerbated health inequities, with certain subgroups experiencing poorer outcomes and less access to healthcare. To combat these inequities, synthetic data must “fairly” represent diverse minority subgroups such that the conclusions drawn on synthetic data are correct and the results can be generalized to real data. In this article, we develop two fairness metrics for synthetic data, and analyze all subgroups defined by protected attributes to analyze the bias in three published synthetic research datasets. These covariate-level disparity metrics revealed that synthetic data may not be representative at the univariate and multivariate subgroup-levels and thus, fairness should be addressed when developing data generation methods. We discuss the need for measuring fairness in synthetic healthcare data to enable the development of robust machine learning models to create more equitable synthetic healthcare datasets. View Full-Text
Keywords: synthetic data; healthcare; fairness; covariate; temporal; time-series; disparate impact; health inequities synthetic data; healthcare; fairness; covariate; temporal; time-series; disparate impact; health inequities
Show Figures

Figure 1

MDPI and ACS Style

Bhanot, K.; Qi, M.; Erickson, J.S.; Guyon, I.; Bennett, K.P. The Problem of Fairness in Synthetic Healthcare Data. Entropy 2021, 23, 1165. https://doi.org/10.3390/e23091165

AMA Style

Bhanot K, Qi M, Erickson JS, Guyon I, Bennett KP. The Problem of Fairness in Synthetic Healthcare Data. Entropy. 2021; 23(9):1165. https://doi.org/10.3390/e23091165

Chicago/Turabian Style

Bhanot, Karan, Miao Qi, John S. Erickson, Isabelle Guyon, and Kristin P. Bennett 2021. "The Problem of Fairness in Synthetic Healthcare Data" Entropy 23, no. 9: 1165. https://doi.org/10.3390/e23091165

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop