Next Article in Journal
Flexural Performance of Steel Bar Reinforced Sea Sand Concrete Beams Exposed to Tidal Environment
Next Article in Special Issue
OMECDN: A Password-Generation Model Based on an Ordered Markov Enumerator and Critic Discriminant Network
Previous Article in Journal
Safety Risk Assessment in Urban Public Space Using Structural Equation Modelling
Previous Article in Special Issue
On the Privacy–Utility Trade-Off in Differentially Private Hierarchical Text Classification
 
 
Article
Peer-Review Record

Privacy and Utility of Private Synthetic Data for Medical Data Analyses

Appl. Sci. 2022, 12(23), 12320; https://doi.org/10.3390/app122312320
by Arno Appenzeller 1,2,*, Moritz Leitner 1,2, Patrick Philipp 2, Erik Krempel 3 and Jürgen Beyerer 1,2
Reviewer 2: Anonymous
Appl. Sci. 2022, 12(23), 12320; https://doi.org/10.3390/app122312320
Submission received: 2 November 2022 / Revised: 18 November 2022 / Accepted: 27 November 2022 / Published: 1 December 2022
(This article belongs to the Special Issue Advanced Technologies for Data Privacy and Security)

Round 1

Reviewer 1 Report

1. What is the main question addressed by the research?*

The study established that there is a collection of privacy issues that arise due to the growing availability and use of sensitive personal data. This is especially true when handling sensitive health information and most be addressed

*2. Do you consider the topic original or relevant in the field? Does it address a specific gap in the field?*

In the world of information security, an innovative and important topic currently being discussed is the privacy and utility of private synthetic data for medical data analytics. This paper discusses these difficulties and comes to the conclusion that private synthetic data generators have significant advantages over more conventional methods, but they also call for an in-depth analysis depending on the application.

*3. What does it add to the subject area compared with other published material?*

This study investigated the risks associated to the privacy of medical records in an effort to solve information security concerns. Private synthetic data generators have been found to provide substantial benefits over conventional methods, however, further exploration may be necessary depending on the specific use case. This adds to the area of privacy in medical record studies.

*4. What specific improvements should the authors consider regarding the methodology? What further controls should be considered?*

The paper evaluates three different private synthetic data generators on their use case-specific privacy and utility. It has come up with finding associated with the use case of continuous heart rate measurements from different individuals analyzed. This is crucial and various other medical data and their characteristics should be provided

*5. Are the conclusions consistent with the evidence and arguments presented and do they address the main question posed?*

This work shows that private synthetic data generators have tremendous advantages over traditional techniques, but also require an in-depth analysis depending on the use case. This study demonstrates the significant benefits of private synthetic data generators over conventional methods but also highlights the necessity of a thorough analysis of each application.

*6. Are the references appropriate?*

Yes

*7. Please include any additional comments on the tables and figures.*

The legend in Figures 3 and 4 are not clear 

Author Response

Dear Reviewer,

 

We thank you for your valuable feedback.

Based on your review and other reviews we received, we updated the manuscript (changes are highlighted in blue).

 

We also agree that more and other data sets need to be evaluated. In our paper we also recommended that use case specific analyses should be assessed for any new use case or data set. While our use case is very specific to the data set, we see no possibility to analyze other data sets as part of this work. However, we agree that this should be done in future work and is also an ongoing research direction of us.

 

To address your feedback on the legend in the figures, we have enlarged the legend in all figures to make them more clear and better readable.

 

If you have additional feedback or questions, we are looking forward adding more improvements or answering your questions.

 

Best regards

Reviewer 2 Report

The paper is about Privacy and Utility of Private Synthetic Data for Medical Data Analyses. I have the following comments:

1.            What are the main limitations of related work?

2.            Summarize the contributions of the paper in bullets

3.            Regarding the data where it shows the heart beats versus time, I believe more information about life/work is needed. For example, people who work during the day and sleep at night, the peak of heart beats will be during the day and the minimum will be during sleep time. This hypothesis is incorrect if a person works at night and sleeps during the day.

4.            Explain in more details the datasets used in this research

5.            What is the complexity of the proposed model?

6.            Write a section to explain the limitations of the research

7.            How are the results compared to previous studies?

Author Response

Dear Reviewer,

 

We want to thank you for your feedback.

Based on your review and other reviews we received, we updated and improved the manuscript (changes are highlighted in blue).

 

We also want to address our specific points with this answer:

 

 1) The main limitations are the focus on the specific data set and the specific use case. While this paper provides an in-depth look for the data and for the use case, we did not generalize our approach and looked at more use cases or provided general experiments. However, we believe that there is much related work on these technologies where they have been evaluated in a general way, and we think that other use cases could be considered as future work. Another limitation is the choice of the attacker. The attacker in our paper is very specific to the use case. More general attackers could cover broader privacy issues.

Additionally, our choice of technologies (SmartNoise) limits the results, experiments with more data generators could be performed. Furthermore, we used the pre-defined parameters of SmartNoise

 

2) We added a bullet point list to the end of the introduction to provide a list of the contributions of the paper. In addition, we also provide those bullet points here:

* Overview of existing technologies and implementations for private synthetic data generation.

 * Introduction of a use case-specific privacy and utility metric of synthetic medical data.

 * Evaluation of three approaches for generating private synthetic data generation in terms of privacy and utility.

 

3) The data set itself contains no information about the behavior of the participants (see Point 4)). However, our attacker model makes the assumption that the attacker might have such background knowledge about an individual (during which time a day the person works or when the person is active e.g., doing a workout).

We agree that data sets with additional information about the individual could underline our results and is considered for future work.

 

4) The data set is an open-source data set provided by the RTI International Institute.

It is a crowd sourced data set of Fitbit health and activity data that was collected through a survey via Amazon Mechanical Turk. The data set contains steps and heart rate data of 30 participants over a one-month period.

Section 4.2 contains a description of the data set which we extended in the revised manuscript.

 

5) Our attacker model focuses on re-identification attacks against victims.

While there are a lot of other privacy and security vectors, we make the assumption that there are no other privacy or security leakage (e.g., due to using unsecure data transmission). The attacker has the ability to use his background knowledge for the re-identification. The ability for detailed background knowledge about an individual makes the attacker very powerful. We assume that an attacker has knowledge about the usual daily schedule of a victim. This knowledge in combination with a heart rate curve can lead to re-identification.

While this model is strongly linked to our use case, attackers that use background knowledge are rather common and so the use case can be replaced.

 

6) We added a subsection regarding limitations in the discussion chapter (7.2) where we discuss the points we also mentioned in answer 1)

 

7) At the end of the related work section we wrote that there is no similar study doing the same evaluation as we did.

However, we saw similar studies that came to the same conclusion that the use of private synthetic data is promising but needs to be evaluated depending on the use case.

This is also what the paper by Stadler et al. "A Privacy Mirage" concludes, which we also discuss in related work.

Another paper that suggests the same is Synthetic data in machine learning for medicine and healthcare ( https://www.nature.com/articles/s41551-021-00751-8), where the authors say: "In addition to creating regulatory standards for synthetic-data quality, regulations and evaluation metrics should also be developed for models that assess not only realism but also failure modes, such as information leakage"

We added this paper to related work.

 

If you have additional feedback or questions, we are looking forward adding more improvements or answering your questions.

 

Best regards

Round 2

Reviewer 2 Report

The authors addressed my comments

Back to TopTop