Defending against Reconstruction Attacks through Differentially Private Federated Learning for Classification of Heterogeneous Chest X-ray Data

Privacy regulations and the physical distribution of heterogeneous data are often primary concerns for the development of deep learning models in a medical context. This paper evaluates the feasibility of differentially private federated learning for chest X-ray classification as a defense against data privacy attacks. To the best of our knowledge, we are the first to directly compare the impact of differentially private training on two different neural network architectures, DenseNet121 and ResNet50. Extending the federated learning environments previously analyzed in terms of privacy, we simulated a heterogeneous and imbalanced federated setting by distributing images from the public CheXpert and Mendeley chest X-ray datasets unevenly among 36 clients. Both non-private baseline models achieved an area under the receiver operating characteristic curve (AUC) of 0.94 on the binary classification task of detecting the presence of a medical finding. We demonstrate that both model architectures are vulnerable to privacy violation by applying image reconstruction attacks to local model updates from individual clients. The attack was particularly successful during later training stages. To mitigate the risk of a privacy breach, we integrated Rényi differential privacy with a Gaussian noise mechanism into local model training. We evaluate model performance and attack vulnerability for privacy budgets ε∈{1,3,6,10}. The DenseNet121 achieved the best utility-privacy trade-off with an AUC of 0.94 for ε=6. Model performance deteriorated slightly for individual clients compared to the non-private baseline. The ResNet50 only reached an AUC of 0.76 in the same privacy setting. Its performance was inferior to that of the DenseNet121 for all considered privacy constraints, suggesting that the DenseNet121 architecture is more robust to differentially private training.


Introduction
The development of machine learning models for medical use cases often requires collecting large amounts of sensitive patient data. Medical datasets are usually scattered across multiple sites and underlie rigorous privacy constraints of both ethical and regulatory natures [1]. The effectiveness of anonymization to enable data sharing is dependent on the type of data and cannot always prevent re-identification [2]. Federated learning gains increasing attention as a method for training machine learning models on distributed data in a privacy-preserving manner. In a federated learning setting, holders of sensitive data can make their data available for machine learning without sharing it with other parties. In several iterations, a central server distributes an initial model to several clients holding the data, e.g., medical institutions, which then individually train their models and provide them to the server for aggregation. Federated learning provides a basic level of privacy by the principle of data minimization, i.e., data collection and processing are restricted to a necessary minimum. However, it cannot by itself formally guarantee privacy [3]. It has been shown that input data can successfully be reconstructed from model gradients [4,5]. In addition to the threat of data reconstruction, attacks disclosing the presence of a specific data sample or property in the training data imply a serious privacy risk for individual contributors [6].
Measures to prevent privacy breaches of machine learning models are subject to ongoing research. Differential privacy is a concept actively explored in this field. Intuitively, the goal of differential privacy is to limit the impact of a single data sample or a subset of the data on the outcome of a function computed on the data, thereby providing a guarantee that no or little information can be inferred about individual samples [6]. However, the application of differential privacy is known to decrease the utility of the machine learning model, characterized by a trade-off between utility and privacy specific to each use case [3,[7][8][9]. Despite the potential of differentially private federated learning in healthcare, there has been an increased research interest only recently in selected use cases.
Vast amounts of medical image data are currently produced in daily medical practice. Chest X-rays play an essential role in diagnosing a variety of diseases, such as pneumonia [10], as well as recently in studying COVID-19 [11]. Automatic diagnosis assistance may substantially support the work of radiologists, which is particularly of interest in the face of ongoing medical specialist shortages [12]. Digital support systems may also mitigate the impact of error sources in human assessment that occur systematically, e.g., due to increased workload and varying professional experience [13]. Increased costs for the healthcare system and potentially fatal misdiagnoses can thereby be avoided.
We evaluate the potential of privacy-preserving federated learning for the use case of disease classification on chest X-ray images. As a key contribution, we directly compare two popular image classification model architectures, DenseNet121 and ResNet50, in terms of the effects of differentially private training on model performance and privacy preservation. Extending previous work, we introduce a federated environment that is subject to data heterogeneity and imbalance. We demonstrate that the basic federated learning setting is vulnerable to privacy violation through the successful application of reconstruction attacks. We specifically compare the vulnerability to privacy breach and the effect of differential privacy on a previously unconsidered complex model, DenseNet121, with the previously studied ResNet architecture. Our results endorse the conjecture that reconstruction attacks pose a realistic threat within the federated learning paradigm, even for large and complex model architectures. We integrate Rényi differential privacy into the federated learning process and investigate how it affects the utility-privacy trade-off for our use case. Two measures of privacy are addressed: The privacy budget ε as part of the formal differential privacy guarantee and the susceptibility of the local models to reconstruction attacks. Our results suggest that the DenseNet121 is a promising architecture for feasible privacypreserving model training on X-ray images. This novel insight may direct future research and applications in that area. This paper is structured as follows: In Section 2, we briefly present previous work related to privacy-preserving federated learning for the task of X-ray classification. We introduce the used datasets in Section 3.1 and explain our federated learning setup in Section 3.2. We provide background information on the Deep Leakage from Gradients (DLG) attack (Section 3.3) and on the integration of differential privacy into the training of neural networks (Section 3.4). We present the results on model performance in a basic federated learning setting (Section 4.1), demonstrate the susceptibility of our federated learning models to reconstruction attacks (Section 4.2), and finally evaluate the impact of differential privacy on model performance and attack vulnerability (Section 4.3). We discuss and summarize our findings in Sections 5 and 6.

Related Work
The healthcare sector especially profits from privacy-preserving machine learning due to the natural sensitivity of the underlying patient data [1,14,15]. A wide range of applications demonstrate that federated learning is a potential fit for leveraging diverse types of medical data, including electronic health records [16], genomic data [17], and time-series data from wearables [18]. Examples related to medical image classification include brain tumor segmentation [9,19], classification and survival prediction on whole slide images in pathology [20], classification of functional magnetic resonance images (fMRI) [21], and breast density classification from mammographic images [22]. One large research area is concerned with the classification of chest X-ray images. Çallı et al. [23] provided an overview of recent deep learning advances in this field, but do not consider federated learning. The feasibility of federated learning on chest X-rays has previously been benchmarked for both the CheXpert [24] and the Mendeley [25] dataset. Chakravarty et al. [26] enhance a ResNet18 architecture with a graph neural network for federated learning on CheXpert data with site-specific data distributions. Nath et al. [27] deploy a DenseNet121 model for a real-world, physically distributed implementation of federated learning on CheXpert. Banerjee et al. [28] determine the ResNet18 architecture to be superior for federated learning on the Mendeley data in comparison with ResNet50, DenseNet121, and MobileNet. Table 1 summarizes related work on chest X-ray classification with DenseNet or ResNet architectures. Table 1. Overview of related works evaluating deep neural networks on the CheXpert or Mendeley datasets using DenseNet or ResNet architectures. The mentioned models are not necessarily exhaustive; some papers evaluate additional ResNet and DenseNet architectures. We also include our paper at the bottom for comparison with related work. Note: (non-)IID corresponds to a (not) independent and identical data distribution. DP corresponds to the use of differential privacy (ε = 6). Surveys on current developments in the field of privacy-preserving machine learning and federated learning describe potential threat models and privacy attacks [3,31,32]. Zhu et al. [5] originally proposed the Deep Leakage from Gradients (DLG) attack, which allows a malicious server instance to reconstruct complete data samples from received model gradients. Subsequent improvements to the idea include analytical label reconstruction [33], improved loss functions for gradient matching [4,34] and an extension towards larger batch sizes [35]. DLG and other privacy attacks have been identified as a severe threat to federated learning. Wei et al. [36] evaluate the impact of attack initialization, optimization method, and training parameters including batch size, image resolution, and activation function on DLG attack's success with a small network.

Data
Ensuring privacy and protecting against reconstruction in a practicable manner is not yet fully explored and remains an open problem for the federated learning paradigm [1,3,37,38]. A key method that finds wide use among federated learning research is differential privacy, first proposed by Dwork [39] in the context of database systems. Generally, it describes the addition of carefully crafted noise into a system to prevent learning too much about single data instances and measuring the remaining risk. Mironov [40] introduced the variant of Rényi differential privacy, defining a tighter bound on the privacy loss. Differential privacy in federated learning is often achieved using differentially-private stochastic gradient descent (DP-SGD) [7,41,42], an algorithm that determines the appropriate noise scale and how to clip the model parameter. The combination of federated learning and differential privacy has been explored in multiple medical use cases, including prediction of mortality and adverse drug reactions from electronic health records [43], brain tumor segmentation [9], classification of pathology whole slide images [20], detection of diabetic retinopathy in images of the retina [44], and identification of lung cancer in histopathologic images [45].
Most similarly to this work, Kaissis et al. [8] demonstrate a framework for the implementation and evaluation of privacy-preserving machine learning in a federated learning setting on the Mendeley dataset evenly distributed among three clients. They combine a ResNet18 model with a secure multi-party computation protocol and differential privacy and compare the success of reconstruction attacks on centralized and federated learning models. We extend their setting by considering larger networks, simulating a scenario with heterogeneous data unevenly distributed among a larger number of clients, and evaluating the impact of different parameters on model performance and the model's vulnerability to reconstruction attacks.

Materials and Methods
In this section, we first provide information about the used datasets. Then, we go over the three central pieces of our article: the federated learning baseline and our heterogeneous data distribution, the reconstruction attack, and finally the introduction of differential privacy as a defense against the attack.

Data
CheXpert [24] comprises 224,316 images of 65,240 adult patients in total, where 234 images are labeled by professional radiologists for use as a validation set. We only considered frontal view images as this accounts for the higher prevalence of frontal view images in the clinical setting and ensures compatibility with the Mendeley dataset. Each image is labeled with one or more of thirteen classes referring to a medically relevant finding, or with "No Finding". Following previous work [46,47], uncertain labels were considered as negative (U-Zeroes method).
The Mendeley chest X-ray dataset version 3 [25] contains 5856 images of pediatric patients and is split into original training and test sets with 5232 and 624 images. Each image is labeled as either "Normal", with "Viral Pneumonia", or "Bacterial Pneumonia". For convenience, we assume that "Normal" in the Mendeley dataset corresponds to "No Finding" in the CheXpert dataset. To ensure compatibility between the dataset labels, our primary setting is a binary classification task based on the "No Finding" or "Normal" labels, indicating the presence or absence of a medically relevant condition.

Federated Learning
Successful training of a deep learning model usually relies on the availability of a single large, high-quality dataset, requiring prior data collection and curation, which are potentially associated with great expense in time and resources. Despite such efforts, data transfer or direct access to the data can still often not be granted due to patient privacy concerns. Federated learning enables model training on scattered data that remains at the participants' sites at all times [48].
A typical federated learning system consists of a central server that orchestrates the training procedure, and several clients that communicate with the server (Figure 1). The server initializes a model and distributes the model parameters to its clients. In parallel, a subset of clients trains the model individually on their data for a defined number of epochs, which is equivalent to local stochastic gradient descent (SGD) optimization. The clients send their local models back to the server, where they are aggregated through federated averaging [48]. The new global model is again distributed among the clients, and the process is repeated until convergence or until a defined number of communication rounds has been reached. In the federated learning setup, the server first initializes a model and distributes the model parameters to its clients. Over several iterations, each client trains the model individually on its data for a defined number of local epochs, sends the parameters of its locally trained model back to the server for aggregation, and receives a global model, aggregated from all trained local models.

Experimental Setup
In real-world use cases, the expectation is that datasets between clients in a federated learning setting show some variety. We reflect this in our simulation of a federated environment by combining two public X-ray datasets representing heterogeneous target populations; adult and pediatric patients. Our federated learning setup comprises 36 clients that each hold a subset of X-ray images from either the CheXpert or the Mendeley dataset. The clients represent hospitals or other medical institutions that provide their collected X-rays for the development of a classification model. The sizes of the clients' datasets are chosen such that they create a highly imbalanced setting including clients with very few data points, which represents small institutions that make their limited amount of data available as soon as they are collected.
We simulated five clients with large subsets of the original CheXpert training set, and thirty-one clients with small subsets of either the original CheXpert validation set or the original Mendeley training set. We randomly split the patients whose images are part of the original CheXpert training set into five equal parts and assigned each part randomly to one of five clients. Table 2 shows the distribution of the CheXpert validation data and Mendeley data among the remaining 31 clients, split in training, validation and test set sizes. These clients are used as targets for the reconstruction attacks in Sections 4.2 and 4.3.3. Each client's dataset was further split into a dedicated training, validation, and test set, consisting of 70%, 15%, and 15% of the client's data, respectively. Clients' datasets that are smaller than 50 images were split equally among the subsets. Datasets comprising less than ten images were used solely for training, omitting local validation or testing. All splits were performed randomly. No specific label distribution was enforced. We ensured that there was no patient overlap between clients and between training, validation, and test splits within each client's dataset.

Model Training
We compared a densely connected network (DenseNet) [49] and a residual network (ResNet) [50] because both architectures have proven especially successful for the task of X-ray image classification [30,51]. We monitored the local models' performance during training on their client's validation set. The global, aggregated model was evaluated using the average performance over the clients' validation sets. The client's test sets were held back for unbiased, internal evaluation of the final global model after training had finished. As the performance metric, we used the area under the receiver operating characteristics curve (AUC).
Both the DenseNet121 and ResNet50 models were initialized with parameters pretrained on ImageNet data. A fully connected layer with sigmoid activation and the adjusted number of output neurons replaced the original final classification layer. We modified the model architectures to accept one channel grayscale instead of three-channel RGB inputs to reduce unnecessary model complexity. To still leverage pre-trained model parameters, we summed the three-channel parameters of the first model layer to obtain new weights for the one-channel input. Images were resized to 224 × 224 pixels and normalized with ImageNet parameters adapted to grayscale color encoding by averaging over the input channels, yielding the normalization parameters µ = 0.449 and σ = 0.226. We did not apply any data augmentation methods.
Because multiple training rounds on the same dataset increase the risk of privacy leakage, we did not perform hyperparameter tuning and settled on standard hyperparameters. The training ran for at most 20 communication rounds. Each client participated in every round. We set the local batch size to ten and adapted it accordingly for clients with fewer than ten data points. To avoid overfitting, clients performed a single local epoch [2,19]. Early stopping was applied if the AUC value of the global model did not improve for five consecutive rounds. We minimized the binary cross-entropy loss using SGD with an initial learning rate of 0.01. The learning rate was reduced by a factor of 0.1 when reaching a performance plateau, i.e., after the AUC of the global model has not improved for three consecutive rounds. The global model with the highest mean AUC across all clients was selected as the best final model.
In private training, the privacy loss is difficult to track for some layer types. This includes active batch normalization layers, which are part of both the DenseNet and ResNet architectures, as they create arbitrary dependencies between samples within a single batch [52]. We experimented with different model layer freezing techniques to avoid training batch normalization layers resulting in intractable privacy loss. We refer to rendering model layers untrainable as layer freezing. We considered full model training (no layer freezing), freezing batch normalization layers, and freezing all layers but the final classification layer.

Reconstruction Attack
Federated learning enables model training on distributed data without the need for direct data sharing. However, while federated learning satisfies the principle of data minimization by eliminating the need for data transfer, it is not by itself sufficiently privacypreserving. Sensitive information about the training data can be inferred from shared models, which has been demonstrated in a variety of privacy attacks including inference of class representatives [53], property inference [54], membership inference [55], and sample reconstruction [5].
We assume the server to be an honest-but-curious adversary with full knowledge of the federated as well as local training procedures [3]. It correctly orchestrates and executes the required computations. However, it has white-box access to shared model parameters and can passively investigate them without interfering with the training process.
Reconstruction attacks aim to recover data samples from trained model parameters. A disclosure implies a serious privacy risk as X-ray images may reveal information about the patient's identity [56] and sensitive properties, such as patient age [57]. Since reconstruction attacks can be conducted with little auxiliary information and in a passive manner, it is a relevant vulnerability within our threat model.
The DLG attack enables pixel-wise reconstruction of training images from the model gradients obtained during SGD [5]. The attack comprises the following steps:

1.
Randomly initialize some dummy input data x and dummy label y .

2.
Fit the given initial model with the dummy data and obtain dummy gradients ∇θ .

3.
Quantify the difference between the original and the dummy gradient by using the Euclidean ( 2 ) distance as the cost function:

4.
Iteratively minimize the distance between the dummy and original gradients by adjusting the dummy input and label using the following objective:

5.
End the optimization process when the loss is sufficiently small, indicating complete reconstruction of the input data, or when reaching a maximum number of iterations.
Following subsequent work, we used an improved version of the attack. We assume that labels can be reconstructed analytically [33] and restrict the optimization to the image data. We used a loss function based on the cosine similarity between original and dummy gradients and the Adam optimizer as proposed by Geiping et al. [4]. The cosine similarity loss is defined as follows: where TV(x ) is the total variation of the dummy image x , with factor α as a small prior. The loss is minimized based on the sign of its gradient.
To evaluate the vulnerability of the local models to the DLG attack in our federated learning setting, we simulated an adversarial server that applies the reconstruction attack to model updates received from individual clients. We chose an arbitrary client holding one image from the Mendeley dataset to evaluate the impact of model layer freezing and attack time on image reconstruction quality. We then attacked other clients holding up to ten training images to demonstrate that they are also susceptible to a privacy breach. We conducted three trials per attack, initializing dummy images from a random normal distribution. We determined the best result as the trial with the lowest cosine similarity loss. The initial learning rate of the Adam optimizer for the attack was 0.1. We adopted the strategy from Geiping et al. [4] and reduced the learning rate by a factor of 0.1 after 3/8th, 5/8th, and 7/8th of the maximum number of iterations. Each trial ran for 20,000 optimization steps. The total variation factor α for the cosine similarity loss was 0.01. We inferred the model gradients by computing the absolute difference between the original model parameters and the local model parameters after local training.
For quantitative evaluation of attack success, we used the peak signal-to-noise ratio (PSNR), measured in the unit of decibels (dB): where MAX I is the difference between the minimum and the maximum possible pixel value, and MSE is the mean squared error (MSE) between two images. In addition to quantifying attack success with the PSNR measure, we demonstrate to what degree sensitive patient information can be derived from reconstructed X-ray images. Even if an individual cannot always be identified directly from a particular image, statistical knowledge about demographic information and other sensitive properties in a given dataset may lead to unwanted conclusions about individuals. We compare the performance of auxiliary models that predict demographic patient information from original and reconstructed X-rays. Because there is no demographic patient information available for the Mendeley data, we focus our evaluation on clients holding parts of the CheXpert dataset. We centrally trained two auxiliary ResNet50 models with the original CheXpert training data to predict patient sex and age. Sex was encoded as a binary category. The corresponding loss function for model training was binary cross-entropy. The loss function for age prediction was the MSE in years between the true and the predicted age. The sigmoid activation function in the age prediction model was replaced by a rectified linear unit (ReLU). Validation was carried out on a dedicated part of the CheXpert training data. The models achieved a validation AUC of 0.97 on sex classification and a mean absolute error (MAE) of 6.0 on age prediction. We applied the auxiliary classification on reconstructed images from clients that hold subsets of the original CheXpert validation data, thus ensuring that the model was only used for inference on images with which it has not been trained with.

Differential Privacy
Dwork [39] originally proposed the notion of differential privacy in the context of database systems. Differential privacy guarantees that the amount of information revealed about any individual record during a query remains unchanged regardless of whether the record is included in the database at the time of the query or not. Put differently, the probability of receiving a specific output from a query on a database should be almost the same when an individual record is part of the database or not. Almost means that the probabilities do not differ by more than a specific factor, which is captured by the privacy budget or privacy loss ε. In the context of machine learning, we regard model training as a function of a dataset equivalent to a query that runs on a database. Intuitively, differential privacy applied to machine learning means that training a model on a dataset should likely result in the same model that would be obtained when removing a single sample from the dataset.
The formal definition of (ε, δ)-differential privacy is as follows: where M(x) is the (randomized) query or function, x and y are parallel databases that differ in at most one entry, S ⊆ Range(M), and δ is a small term relaxing the guarantee, usually interpreted as the probability that it fails. A randomized mechanism M(x) can be obtained by adding noise to the original function drawn from a statistical random distribution, e.g., the Laplacian or the Gaussian distribution. The amount of noise necessary to achieve (ε, δ)-differential privacy is scaled to the 2 -sensitivity of the function, which is the maximum distance between the outputs of a function run on two parallel databases. Differential privacy has two important qualities important to its application to machine learning. The output of a differentially private random mechanism remains differentially private during the application of another data-independent function (closure under post-processing). The privacy loss can be analyzed cumulatively over several applications of a mechanism on the same database (composability). We use the variant of Rényi differential privacy, based on the Rényi divergence, in combination with a Gaussian noise mechanism that allows for a tighter estimate of the privacy loss over composite mechanisms than (ε, δ)-differential privacy [40]. Differentially-private stochastic gradient descent (DP-SGD) is commonly deployed for integrating differential privacy into model training [7]. DP-SGD adds two main steps to the SGD algorithm:

1.
Bounding the function's sensitivity by clipping per-sample gradient 2 -norms to a clipping value C.

2.
Adding Gaussian noise to the gradient, scaled to the sensitivity enforced by Step 1.
We applied DP-SGD locally during training at the clients' sites. Private training was limited to at most ten communication rounds. A privacy accountant tracked the ε-guarantees for a specified list of orders α of the Rényi divergence over communication rounds. This yields the optimal (α, ε)-pair at the end, where ε is the lowest bound on the privacy loss in combination with the respective α. Because the differentially private mechanism is closed under post-processing, aggregation of private model parameters yields a private global model that does not incur a larger privacy loss on individual clients' data than that upper bounded by local DP-SGD.
To investigate the relationship between privacy and model performance for our use case, we limited the privacy budget to ε ∈ {1, 3, 6, 10}. We tracked the α values in [1.1, 10.9] in steps of 0.1, and the values in [12,63] in steps of 1.
If δ is equal to or greater than the inverse of the size of the dataset, it would allow for leakage of a whole record or data sample without violation of the privacy constraint [58]. As this is unacceptable, δ should be smaller than the inverse of the dataset size [8]. Because the size of each individual client's dataset varies, we determined δ as follows: where x k 1 is the number of data samples in the training dataset of client k. Because some clients' datasets are very small, leading to high probabilities for the privacy guarantee to be violated, we defined a minimum value of δ = 10 −2 .
We bounded the sensitivity of the training function by clipping per-sample gradients. An effective bound is a compromise between excessive clipping, which leads to biased aggregated gradient estimates that do not adequately represent the underlying true gradient values, and a loose clipping bound that forces addition of an exaggerated amount of noise to the gradients. We employed global norm clipping, i.e., gradients were clipped uniformly over the course of training. Abadi et al. propose to use the median of unclipped gradient 2 norms [7]. We could not obtain unclipped gradient norms directly because the clients' datasets were not considered available for non-private training. As a solution, we ran a few epochs of non-private, centralized training on an auxiliary chest X-ray dataset with the same training parameters as in our federated learning scenario [8]. We used the original Mendeley test set, which was not part of any of the clients' datasets. We randomly picked 5% of the dataset as validation and test sets to validate the training procedure. Because centralized training on the Mendeley test set converged quickly, we tracked the medians over the first three epochs. We obtained median gradient norms of 0.42 (DenseNet121) and 0.62 (ResNet50) for models with frozen batch normalization layers, and 1.24 (DenseNet121) and 0.72 (ResNet50) for models with all layers frozen but the final layer.

Implementation
The

Results
This section follows a similar structure as the previous one. We first show the results of the federated learning baseline. Then, we go over the evaluation of the reconstruction attack on the system, where we analyze different factors that impact attack success. Finally, we assess the impact of differential privacy, first on the federated learning effectiveness, and then on the reconstruction attack.

Federated Learning Baseline
We trained both models on the binary classification of the "No Finding" label. The best global models achieved an AUC value of 0.935 (DenseNet121) and 0.938 (ResNet50). The average AUC was larger on clients holding Mendeley data (0.96 for DenseNet121 and 0.95 for ResNet50) compared to clients with CheXpert data (0.85 for DenseNet121 and 0.87 for ResNet50). These results confirm the ability of deep learning models to reach a high classification performance on the Mendeley dataset [8,28], even when trained in a heterogeneous setting.
Given the implications of different layer freezing techniques for privacy, we compared the outcomes of full model training (no layer freezing), freezing batch normalization layers, and freezing all layers but the final classification layer (Table 3).
For both models, the performance after full model training and training with frozen batch normalization layers was similar with a maximum difference in AUC of 0.022 on the test sets. We conclude that freezing batch normalization layers did not impede model training in our setting. In contrast, rendering all layers untrainable except for the final layer significantly decreased performance. This confirms outcomes from previous work where this transfer learning technique was found to be inferior to including more layers in training updates [51]. Table 3. Mean AUCs of the best global DenseNet121 and ResNet50 models, evaluated on the clients' test sets. Batch norm. refers to freezing of batch normalization layers, All but last to freezing all parameters except for the final classification layer. Training with frozen batch normalization layers delivered similar results to full model training.

Reconstruction Attack
We attacked local models of arbitrary clients with varying layer freezing techniques, attack time points and batch sizes.

Impact of Layer Freezing
We applied the reconstruction attack to a single client's local model with a batch size of one during the first communication round. Table 4 reports the mean PSNR and sample standard deviation over three trials per experiment. Figure 2 shows the reconstructed images of the best attack trials. The attack was only successful in the case of batch normalization layer freezing, indicated by larger mean PSNR values of 12.29 (ResNet50) and 10.98 (DenseNet121). Training the full model as well as fine-tuning only the output layer prevented the recovery of any useful image features in this setting. We further observed that the DenseNet121 seems to be more robust to leakage from gradients in this example, although the ResNet50 is the larger architecture in terms of parameter count, containing more than three times as many trainable parameters as the DenseNet121. Table 4. Impact of layer freezing on the attack success during early training. We report the mean PSNR and sample standard deviation (STD) over all images obtained from three attack trials per setting. The batch size is kept constant at one. The attack was only successful on models with frozen batch normalization layers. The results highlight that shared model updates with partial layer freezing are practically relevant targets for privacy violation. Cases of attack failure, however, do not provide a formal privacy guarantee. Other factors such as the privacy-breaking properties of active batch normalization layers in the case of full model training need to be considered for a comprehensive assessment of model privacy.

Impact of Training Stage
We applied the attack during the initial communication round and after four rounds of training. We refer to the settings as an attack in early and late training stages, respectively. Figure 3 compares the images obtained from early and late attacks. The late attack was significantly more successful on both ResNet50 (mean PSNR 18.42 ± 5.25) and DenseNet121 (mean PSNR 11.7 ± 2.23). At the same time, the variation between trials was greater for the late attack. The observation that the attack was more successful as training progressed does not confirm previous evidence, which suggests that reconstruction is less successful from pretrained models [4] and during later training stages [8]. Attack success has been associated with the magnitude of the gradients' 2 -norms, which are usually largest at the beginning when the model starts training on previously unseen data [8]. In Figure 4, we investigate how the 2 -norms of our models' gradients changed as training progressed. We show the exemplary case of the pre-trained DenseNet121. Results were similar for the ResNet50, for which we refer to Appendix A. For each layer of every client's local model, we tracked the median 2 -norm during training. We display the per-layer mean values of all tracked medians over the local models. For both model architectures, the norms were greater during the first round of training than in the following iterations. Subsequent changes are more subtle and lack continuity. We validated that the attacked client's model did not pose an exception to this behavior.
Since our attacks were more successful during late training and we observed overall smaller gradient norms as training progressed, we could not associate larger gradient norms with increased attack success.

Impact of Batch Size
We investigated the impact of the training batch size in the setting where the attack was most successful, i.e., on models trained with frozen batch normalization layers attacked during late training. We attacked clients for which the considered batch size was equal to the available number of training images. The setting is equivalent to clients with larger datasets sharing model updates after every processed batch. A batch size of ten reduced attack success as the mean PSNR values over the batch decreased to 9.01 (ResNet50) and 8.15 (DenseNet121) ( Table 5). While the quality of the reconstructions varied for individual images within a batch, at least one image out of each batch became recognizable. We note that the order of images in a batch may not be preserved in the reconstruction of larger batches, preventing a direct comparison between original and reconstructed data points. To assign a reconstructed image to its original for evaluation, we first obtained the PSNR of each original image with each reconstructed image. We then determined the first original-reconstruction pair as the one with the largest PSNR value. The next best pair was determined considering the PSNR values between the remaining original and reconstructed images. We iterated the procedure until all images have been assigned. Figure 5 shows the best-reconstructed images out of each batch, demonstrating that all considered batch sizes permit severe privacy breaches on individual data samples. Table 5. Impact of batch size on attack success during late training. We report the mean PSNR and sample standard deviation (STD) over all images obtained from three attack trials per setting. Models were trained with frozen batch normalization layers (cf. Figure 2). Attack success deteriorated with a batch size of ten, but not significantly with smaller batch sizes.

Inference of Demographic Properties
Finally, to investigate the leakage of sensitive patient information from reconstructed images, we applied the attack to 15 clients holding CheXpert validation data subsets. We included five clients each, holding one, two, and four training images, yielding 35 images in total. The setting was the same as for the attacks on Mendeley clients. We attacked models trained with frozen batch normalization layers during late training. Then, we predicted the patients' age and sex from the original X-rays and from the reconstructed images using auxiliary models to demonstrate that the images leak sensitive information. Table 6 summarizes the auxiliary model predictions. The low baseline performance of the auxiliary models on original images compared to the classifier validation estimate is probably due to the small sample size of 35 images. Superior results on images reconstructed from the ResNet50 in the case of sex prediction suggest an increased susceptibility to privacy violation of this architecture compared to the DenseNet121. Table 6. Performance of the auxiliary models for predicting patient sex and age from X-ray images. We compare the classification/regression of original images, and images reconstructed from local ResNet50 and DenseNet121 models. All attacked clients provided 35 images in total. Metrics reported are AUC for sex prediction and the mean absolute error (MAE) in years for age regression.

Differentially Private Federated Learning
As a countermeasure to the reconstruction attack, we evaluate the introduction of local differential privacy into our training process. The following sections detail the implications that come with that added protection. Table 7 reports the models' performance with privacy budgets ε ∈ {1, 3, 6, 10}. We include the non-private baseline performance for comparison. Batch normalization layer parameters were not updated during model training. We report the exact privacy budget spent by each local model as optimal (α, ε)-pairs in Appendix C. We compare the utility-privacy trade-off between the two model architectures in Figure 6. The DenseNet121 performed better than ResNet50 for all considered privacy budgets. As expected, a stronger privacy guarantee claimed a higher cost in accuracy for both models. The degradation was more pronounced in the ResNet50 with an AUC difference of 0.24 between ε = 10 and ε = 1. The private DenseNet121 performed equally well compared to its non-private counterpart for both ε = 10 and ε = 6, suggesting that a increasing the privacy budget beyond ε = 6 does not benefit model performance. For ε = 6, the DenseNet121 achieved an AUC of 0.937, the ResNet50 only 0.764. We expect that model evaluation in our setting with imbalanced data distribution tends to be unreliable on clients with less data. Incidental good results on those clients may bias the global model's performance estimate. To provide a more meaningful assessment of the model performance under privacy conditions, we investigated the performance of the best global private DenseNet121 model on individual clients compared to the best nonprivate model. We visualize the comparison for ε = 6 in Figure 7. We provide the figures for other considered ε-values in Appendix B. Private training demanded a systematic cost in performance for clients holding large amounts of CheXpert data. AUC values on those clients' datasets decreased by 0.03 (ε = 10 and ε = 6) and 0.06 (ε = 3) on average from non-private to private training.

Model Performance
We conclude that the impact of private training on model accuracy, also at moderate privacy budgets, needs to be carefully assessed on the client level. Further potential weak points of the resulting model, such as performance on underrepresented patient subgroups, require additional consideration.

Additional Training Techniques
We evaluate the effect of additional training techniques on model performance: Training only the final layer (All but last layer freezing), client subsampling, and employment of layer-wise gradient clipping. All experiments were carried out with a privacy budget of ε = 10. We found that none of the techniques introduced an advantage for private model training. When restricting training to the final layer, the performance of the DenseNet121 decreased significantly compared to Batch norm. freezing (AUC 0.707 vs. 0.925) and that of the ResNet50 remained similar (AUC 0.871 vs. 0.861).
In a separate experiment, we introduced a client subsampling procedure where the maximum number of global communication rounds was set to ten and the maximum number of rounds that each client can be selected to five. The fraction of clients chosen each round was 0.3, resulting in eleven clients selected per round. This way, less of the available privacy budget was effectively spent during the clients' local training because each client participated in fewer training rounds in total. However, there was no improvement in model performance. The DenseNet121 and ResNet50 achieved AUC values of 0.836 and 0.819, respectively. A potential explanation is that clients with small datasets got selected frequently during subsampling, but could not contribute as effectively to the global model as clients with larger datasets. Model accuracy degraded more heavily after a few rounds during the subsampling experiment, indicating stronger local overfitting which was amplified by the lower number of contributing clients.
Finally, instead of uniformly clipping the norm of each gradient value, we specified an individual clipping bound for each model layer. We utilized the per-layer median gradient norms from the auxiliary training experiment on the Mendeley test set (Section 3.4). The models' AUC values converged to 0.5, indicating that model training failed for our use case when employing layer-wise gradient clipping. The variation between individual clipping values may be too large, preventing the model parameters to retain any information that is usable in combination with other layers' parameters.

Vulnerability to Reconstruction Attack
We attempted to reconstruct the training image from the local model shared by a Mendeley client during private training. We performed the attack on models with frozen batch normalization layers during late training. Table 8 compares the mean PSNR over three trials between non-private and private training. The PSNR on all images from private models was significantly smaller than in the non-private setting. Figure 8 confirms that the reconstructed images from both model architectures did not leak any visual parts of the training images. Differentially private training under all considered privacy budgets therefore successfully prevented the attack.  To validate that no sensitive information was leaked, we applied the auxiliary models (first introduced in Section 4.2.4) to predict patient age and sex from images reconstructed from private models. Table 9 compares their performance on original and recovered images in private and non-private settings. We attacked the model with the weakest privacy guarantee of ε = 10. The AUC values of 0.49 and 0.47 on sex prediction indicate that the classifier's performance was equivalent to random label assignment in the private setting. The age predictions deviated around 19 years on average from the true patients' age. Differentially private model training prevented both auxiliary models to predict usable information about the patients' demographic properties. Table 9. Performance of the auxiliary models for predicting patient sex and age from X-ray images. We compare the predictions on original images, and images reconstructed from local ResNet50 and DenseNet121 models in the non-private and private setting with ε = 10. Images reconstructed from private models leaked no usable information about the selected properties. We conclude, that in our federated learning setting, differential privacy is an effective countermeasure against sample reconstruction from gradients, and no sensitive information could be inferred from the reconstructed images.

Discussion
In our federated learning setup, the effectiveness of model aggregation is limited by data heterogeneity and imbalance. The federated averaging algorithm weights the local model updates with respect to the clients' dataset size in relation to the overall amount of available data [48]. This led to a strong emphasis on model updates from clients with large CheXpert subsets in our case, while updates from clients with fewer images contributed less to model aggregation. One option to mitigate data imbalance is to aggregate models after a specified number of batches instead of local epochs [8,59]. However, sharing intermediate models more frequently will increase the susceptibility to reconstruction attacks since the updates are obtained on small batches rather than the client's whole dataset. Improving the aggregation process under consideration of privacy costs is left for future work.
The applied attack has shown that the two considered deep machine learning models are susceptible to reconstruction of sensitive data from gradients. Most notably, and contrary to previous work, we found that the attack was more successful in later training stages and for pre-trained models. Our privacy evaluation framework is limited by the choice of the DLG attack as a qualitative measure for model vulnerability. Even though we found reconstruction not successful under certain conditions, including full model training, restricting training to the final layer, and attacking the DenseNet121 at an early training stage, it cannot be assumed that model training would be privacy-preserving in these cases. Minor modifications of the attack scheme may improve attack success even in supposedly safe settings. Moreover, reconstruction attacks are only one example among a range of deliberate privacy breaches that neural networks are vulnerable to. Extending our privacy evaluation framework to include other privacy threats, e.g., property inference without data reconstruction, will provide further insights into potential vulnerabilities of the federated learning paradigm. Since the main limitation of DLG is its restriction to small datasets, it will be particularly valuable to capture the consequences of privacy breaches for clients with large amounts of data. From a security perspective, demonstrating that these attacks are practically feasible, albeit under limited circumstances, is sufficient for considering the machine learning process vulnerable to privacy violation. Countermeasures must constantly be re-evaluated for their effectiveness as a better understanding of privacy threats evolves.
Our privacy evaluation was further constrained to a limited choice of privacy budgets. While choosing ε = 6 delivered the best utility-privacy trade-off for our use case, which is in line with previous work [8], it may not be the optimal lower bound. We specifically suggest empirically examining choices of ε in the range [3,6] to potentially improve upon our results in future work. We also note that while all considered privacy constraints prevented the success of reconstruction attacks, smaller ε values still formally provide stronger privacy guarantees that offer protection against threats beyond the limited case of the reconstruction attack.
A key implication of our results is that the DenseNet121 architecture proved more robust against private model training with regard to performance than the ResNet50. This observation is potentially related to the greater ability of the DenseNet121 to withstand reconstruction attacks. Although the model contains overall fewer parameters than the ResNet50, its dense structure may, to a certain degree, offer a natural defense against reconstruction from trained parameters as well as perturbation of parameter updates during private training. This outcome suggests substantial differences in the suitability of individual model types for privacy-preserving machine learning, which requires further validation.
In the medical context, fairness is crucial for the safe deployment of machine learning algorithms. Rare diseases or conditions must be reliably detected despite the restricted availability of representative data. Furthermore, a model should perform with equal accuracy on all patient subgroups. We uncovered performance differences between individual clients' data in our federated learning baseline, revealing that the classification produced better results on Mendeley than on CheXpert data. This potentially reflects the ability of deep learning models to recognize pneumonia as an abnormal finding particularly well since the pathologic X-rays in the Mendeley dataset only include cases of pneumonia. More thorough investigations are required to reveal other potential biases, e.g., with respect to patient subgroups. It is further known that underrepresented classes and population subgroups are potentially affected more strongly by model performance degradation when applying differential privacy [60]. Because model performance evaluation on Mendeley clients was less reliable due to smaller amounts of data, it remains an open question how exactly these clients were affected by the integration of differential privacy. For practical applications, it is mandatory to thoroughly investigate how privacy mechanisms affect the model's performance on different types of data to identify a potential underlying bias.
We did not consider the practical implementation of federated learning between different institutions with regard to communication time, required infrastructure, costs and validation of correct computational execution. The focus of our paper lies in analyzing the threat of data reconstruction and the effectiveness of differential privacy against it. While simulated use cases like ours are vital to prepare for leveraging differential privacy in real-world cases where sensitive data is involved, further case studies are required to investigate aspects of practicability for privacy-preserving federated learning on a large scale.

Conclusions and Future Work
We simulated a collaborative machine learning use case in which 36 institutions provide their diverse chest X-ray data collections for the development of a classification model. Two main concerns in this scenario are the physical separation of the data sources and the privacy of patients to whom the data belongs. We employed the paradigm of federated learning as a solution for machine learning on dispersed data. Throughout our experiments, we compared two large network architectures: DenseNet121 and ResNet50.
Extending previous evidence, we demonstrated that individual X-rays can be reconstructed from shared model updates within the federated learning setting from those networks using the DLG attack. It is especially successful during later training stages.
As a step towards privacy-preserving distributed learning, we integrated Rényi differential privacy with a Gaussian noise mechanism into the federated learning process. The DenseNet121 achieved the best utility-privacy trade-off with a mean AUC of 0.937 for ε = 6, where we identified an expected cost in accuracy of 0.03 in terms of the AUC on CheXpert clients' data compared to the non-private baseline. The results suggest that ε ∈ [3,6] are suitable candidates for private model training depending on the specific demands on model privacy and performance for the respective application. Overall, we found the DenseNet121 model superior to ResNet50 with regard to private model training for all considered ε values.
The adverse impact of differential privacy on model performance must be carefully considered, particularly for medical use cases. Our results endorse that differentially private federated learning is feasible at a small cost in model accuracy for the classification of heterogeneous chest X-ray data. As real-world medical use cases become more complex in practice, future work may elaborate on the potential of differentially private federated learning for multi-label X-ray classification where heterogeneous data from a broader range of sources is effectively integrated under consideration of an improved bound on the privacy budget. We identified the DenseNet121 as a robust model architecture suitable for differentially private training. Further comparison with other neural network architectures may reveal key indicators for the suitability of different model types and provide guidance in the choice of models for privacy-preserving machine learning. We further suggest to extend our evaluation framework in future work to consider the vulnerability to other types of privacy breaches, enabling a comprehensive qualitative assessment of model privacy. Finally, other variants of differential privacy, e.g., Gaussian differential privacy [61], may offer suitable alternatives to the application of Rényi differential privacy providing yet tighter bounds on the privacy loss.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Gradient 2 -Norms
We investigated the models' gradients' 2 -norms during training to assess how they correlate with attack success. Figure A1 shows how the norms change in the ResNet50. The norms were greater during the first round of training than in the following iterations. In our experiments, image reconstruction was better on models from later rounds, suggesting that the magnitude of gradient 2 -norms is not a primary indicator for attack success.

Appendix B. Client-Level AUC of Private DenseNet121 Models
We investigated the performance of the best global private DenseNet121 model on individual clients compared to the best non-private model. We visualize the comparison for ε ∈ 3, 6, 10 in Figure A2. Because training was unsuccessful for ε = 1, we do not evaluate model performance for this case in detail. Private training demanded a systematic cost in performance for clients holding large amounts of CheXpert data for all considered privacy budgets. Because clients with Mendeley data hold fewer images, the results on individual test sets of those clients was subject to greater variation.