Enhanced Privacy-Preserving Architecture for Fundus Disease Diagnosis with Federated Learning

Jiang, Raymond; Kumar, Yulia; Kruger, Dov

doi:10.3390/app15063004

Open AccessArticle

Enhanced Privacy-Preserving Architecture for Fundus Disease Diagnosis with Federated Learning

by

Raymond Jiang

^1,2,

Yulia Kumar

^1,3,*

and

Dov Kruger

³

¹

Department of Computer Science and Technology, Kean University, Union, NJ 07083, USA

²

High Technology High School, Lincroft, NJ 07738, USA

³

Department of Electrical and Computer Engineering, Rutgers University, Piscataway, NJ 08854, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 3004; https://doi.org/10.3390/app15063004

Submission received: 18 January 2025 / Revised: 4 March 2025 / Accepted: 5 March 2025 / Published: 10 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, advances in diagnosing and classifying diseases using machine learning (ML) have grown exponentially. However, due to the many privacy regulations regarding personal data, pooling together data from multiple sources and storing them in a single (centralized) location for traditional ML model training are often infeasible. Federated learning (FL), a collaborative learning paradigm, can sidestep this major pitfall by creating a global ML model that is trained by aggregating model weights from individual models that are separately trained on their own data silos, therefore avoiding most data privacy concerns. This study addresses the centralized data issue with FL by applying a novel DataWeightedFed architectural approach for effective fundus disease diagnosis from ophthalmic images. It includes a novel method for aggregating model weights by comparing the size of each model’s data and taking a dynamically weighted average of all the model’s weights. Experimental results showed a small average 1.85% loss in accuracy when training using FL compared to centralized ML model systems, a nearly 92% improvement over the conventional 55% accuracy loss. The obtained results demonstrate that this study’s FL architecture can maximize both privacy preservation and accuracy for ML in fundus disease diagnosis and provide a secure, collaborative ML model training solution within the eye healthcare space.

Keywords:

federated learning; centralized learning; collaborative machine learning; fundus disease diagnosis; ophthalmology

1. Introduction

Machine learning (ML) has revolutionized various fields, particularly healthcare, where accurate disease diagnosis is critical. In recent years, ML models have become capable of diagnosing and classifying diseases as well as or better than healthcare experts [1]. For example, the use of VGG-Net models like VGG-16 and VGG-19 has become increasingly common, especially within the ocular disease/healthcare space due to the cost of devices required to classify these diseases [2]. However, many large models require large amounts of data to be effective. This makes training these large models exceptionally difficult, leading to state-of-the-art models training collaboratively (due to scarcity of data) [3]. This practice of training in collaboration often relies on training large volumes of data in a single (centralized) location—this practice is known as centralized learning (CL) [4].

However, in almost all commercial fields/industries, using CL is often impossible because of the many privacy regulations regarding personal data [5]. The Health Insurance Portability and Accountability Act (HIPAA) in the US limits data sharing in FL by enforcing strict privacy and security rules [6]. The General Data Protection Regulation (GDPR) in the EU restricts cross-border data use and requires user consent, complicating FL [7]. Canada’s PIPEDA mandates transparency and consent, adding compliance challenges for FL [8]. These regulations restrict personal health data’s free flow, making CL often unfeasible. Additionally, the computational overhead when using CL can be prohibitively expensive, as all training would typically be performed on a singular server. This centralized approach strains resources and is often more computationally inefficient, making it less practical in healthcare scenarios where large-scale data need to be processed efficiently and securely to develop effective, high-performing ML models [9,10].

Figure 1 demonstrates the differences between CL and federated learning (FL).

As can be seen from Figure 1, CL gathers all training data in one place, a central server, where a single model is trained and then distributed. In contrast, FL distributes the training process by allowing users to train models locally on their own devices without sharing their raw data. Only model updates are sent to a central server, aggregating them to improve a global model that is then shared with the users. This approach enhances privacy, reduces communication overhead, and allows for more personalized models compared to the centralized approach [12,13].

FL is a relatively new collaborative ML technique that can sidestep almost all the mentioned issues with ease [14,15]:

FL trains local models on each user’s data and aggregates models together to create a global model (the global model is created based on combining model weights, instead of combining datasets, which violates data privacy regulations).
This ensures privacy when training models through techniques such as differential privacy.
FL often adds random noise (varying fluctuations to model weights during training) to datasets to prevent backtracking and reverse engineering of the models to reveal sensitive information about any individual patient used in the data (often using the differential privacy method) [16].
Computing power can become distributed at scale and reduce bandwidth requirements (computations for training are split across the different clients participating in FL instead of just a singular centralized server).

To address the healthcare dataset sharing and collaborative model training challenges posed by privacy laws and regulations, this study developed a novel FL architecture. Unlike traditional FL approaches, which often suffer from up to a 55% loss in model accuracy, the methods utilized in this architecture achieve only an average of 1.85% loss in accuracy (nearly 92% accuracy improvement) when applied to ophthalmology [17].

The following research questions guided this study:

(R1) How can FL be leveraged for efficient, data-privacy-preserving collaboration in the eye healthcare space?

(R2) How can FL maintain high accuracy and minimize performance loss compared to CL?

(R3) In FL, how can the computational power of training models be effectively distributed to the clients without sacrificing the training time/speed required for an effective and accurate global model?

This research hopes to comprehensively analyze FL’s potential within the eye healthcare space by answering these research questions, allowing for security/privacy and effective collaboration. While the architecture developed in this study has potential application to any healthcare domain (where data privacy is of utmost concern), it specifically focused on applying FL for ocular health.

2. Related Work

Many studies have been conducted on the effectiveness of applying ML to improving disease diagnosis in healthcare, especially within ophthalmology and the eye healthcare space.

Velpula et al. examined the detection of multi-stage glaucoma, including its many early stages [18]. The researchers did so by leveraging standard, centralized-learning-based, pretrained convolutional neural networks (CNNs) and voting-based classifier fusion. In doing so, they achieved an 84.53% accuracy on the Harvard Dataverse dataset. Sigit et al. proposed a practical method for detecting cataracts, one of the leading causes of visual impairment and blindness worldwide [19]. They applied a single-layer perceptron method and smartphones to classify the results of normal eyes, immature cataracts, and mature cataracts with an accuracy of 85%. Saqib et al. also demonstrated an effective system for cataracts and glaucoma detection [20]. It was accomplished by leveraging transfer learning (TL) using the MobileNetV1 and MobileNetV2 models, achieving a detection accuracy of 89%. All three of these studies show how ML and CL have been proven effective in transforming the effectiveness and precision of disease diagnosis, especially within the eye healthcare space. However, these models do not allow for secure, collaborative model training. Therefore, these proposed systems are ineffective in real-world environments with data privacy restrictions or regulations, where FL offers plenty of advantages.

The application of FL in other fields has gained significant momentum in recent years. Several comprehensive surveys have provided overviews of FL and its challenges and applications within various domains. Liu et al. presented a systematic study of recent advances in FL, discussing challenges such as data heterogeneity, communication overhead, and privacy concerns [21]. This highlights the idea that utilizing FL in any discipline has advantages and disadvantages that must be balanced. Similarly, Wenet al. offered an extensive review of FL, highlighting its potential applications and the challenges faced in practical implementations [22].

Work has also begun to discuss the application of FL in many fields, including healthcare and medical imaging. For instance, in medical imaging (particularly brain tumor detection), Islam et al. showcased FL’s effectiveness when combined with CNN ensemble architectures in classifying brain tumors from MRI images [23]. Their work demonstrated the potential of FL in healthcare, providing concrete evidence of how FL can facilitate and allow for collaborative learning without compromising the privacy of personal information (PI) patient data. However, as mentioned earlier, FL faces issues regarding accuracy reductions—in the study, the base ensemble model (non-FL approach) had an accuracy of 96.68%, while the FL achieved 91.05% accuracy. Although accuracy only had a slight decline, maintaining accuracy in a field as critical and crucial as healthcare is still of utmost importance.

Observations can also be made based on the work completed by Li et al., which leveraged FL to address the medical data privacy regulations when training models for brain tumor segmentation specifically [24]. Based on their work, it was noted that there was a trade-off between model performance and level of privacy protection. It was found that, when more of the models are shared, the amount of noise that must be added to the data (through the differential privacy methods) must also increase, causing the model performance to decrease due to increased noise. Their work argues that sharing a smaller percentage of the models (e.g., 10% instead of 40%) can often yield a more favorable trade-off between having an optimal amount of privacy and the performance of the models in FL systems. This is especially important because it emphasizes how the optimal FL architecture must consider these drawbacks and advantages to effectively balance facets such as accuracy, training time, and privacy cost.

In summary, significant research progress and work have been carried out to improve the drawbacks found in FL, to apply FL for the healthcare space, and to leverage its benefit for private model collaboration. The work found in this study aims to build upon these foundations by developing a more accurate and more resource-efficient FL architecture tailored specifically for private, collaborative training in the eye healthcare space. Table 1 below shows a comparison between previous works and demonstrates the originality of our approach.

As illustrated in the table, various studies have explored different facets of medical image analysis, disease classification, and FL. However, none has simultaneously tackled the challenge of preserving data privacy while maintaining high model accuracy in an FL framework for fundus disease diagnosis. Our approach is particularly innovative in leveraging a dynamically weighted federated averaging (wFedAvg) method alongside k-client selection training to minimize accuracy loss, while ensuring data security. Lastly, by integrating TL with VGG-19 and optimizing client selection, our study offers a scalable and computationally efficient solution that balances privacy, accuracy, and training efficiency to allow for collaborative ML model training within the eye healthcare space.

3. Methodology

The detailed methodology of this study is outlined below.

3.1. Project Dataset

This study utilized the Ocular Disease Intelligent Recognition (ODIR) dataset from Li et al., which consists of 10,000 colored fundus photographs taken from patients’ left and right eyes [30]. The dataset is designed to simulate real-world data, containing non-independent and identically distributed (non-IID) images captured using various types of cameras, resulting in different image resolutions and quality. The diversity of images in these data will accurately reflect the challenges faced when using data from various collaborating locations (e.g., hospitals). The dataset is divided into eight different fundus disease categories: normal, diabetes, hypertension, glaucoma, cataract, age-related macular degeneration (AMD), myopia, and other abnormalities, as depicted in Figure 2:

3.2. Data Preprocessing

For this study, three classes were used: normal (N), glaucoma (G), and cataract (C). The dataset size has 3402 images with a 64:16:20 train–validation–test split. As explained in more detail later in the paper, FL architecture had several clients, and the dataset was evenly (and randomly) split across these clients to ensure consistency.

Data preprocessing is an ML technique that transforms raw data into a more understandable and desired form [31]. The majority of the time, this means making changes to the dataset to facilitate and support more effective and efficient model training. Since the data are non-IID and vary throughout, the dataset must be preprocessed before any model training. This would allow for improved accuracy, reliability, and overall robustness of the ML models when trained in this study [32].

Figure 3 demonstrates the data preprocessing steps.

Images were first augmented using RGB Contrast Limited Adaptive Histogram Equalization (CLAHE) Transform, which was originally developed for the effective enhancement of low-contrast medical images like those used in this study [33]. RGB CLAHE enhanced the ocular images by improving image contrast and overall luminance, emphasizing the blood vessels and veins for more efficient model training.

After feature augmentation, all images were cropped using a simple “bounding box” image extraction algorithm (crops out an outermost, square-shaped border based on the colored pixels of the fundus images, removing any white space outside the square border). After the edges of the eye were extracted correctly, all images were cropped to remove the blank space to prevent models from analyzing anything besides the ocular image.

Afterward, several general transformations were applied to the entire dataset. First, the images were resized to 224 × 224 pixels to ensure all data were consistent for the models to train on. Then, the data were augmented using random horizontal and vertical flipping to improve overall model robustness, especially because of this study’s use of a heterogeneous dataset. There was a 50% probability of the data being mirrored. Studies have shown that performing data augmentation like this will reduce the chances of model overfitting and improve the model’s generalization ability.

Due to TL being implemented in this study’s proposed architecture, the data were normalized using the ImageNet dataset’s mean and standard deviation. Normalizing the data to match what the VGG-19 neural network was trained on ensured that the TL used in this architecture remains efficient.

3.3. Transfer Learning and CNN Architectures

TL is an ML method that allows a model to use knowledge gained from one dataset/task to improve its performance on a related task [34]. This study leveraged TL to help reduce overall model training time, using the VGG-19 CNN model—trained on the ImageNet dataset with nearly 1.2 million photos and 1000 classes [35]. Figure 4 represents the VGG-19 model used in this study and its architecture.

For this study, the weights of the model’s feature layers were frozen (not trained on). This allows for faster training convergence and effective TL. In Figure 4, the classifier layer is shown in green. The classifier layers are as follows: following the feature layers, a dense layer consisting of 4096 neurons, a ReLU activation layer, and a 50% dropout layer were added (to prevent model overfitting by randomly removing half of the neuron connections from our model during each training iteration). This group of three layers was added twice. The CNN architecture used in this study combines multiple layers of convolution, max pooling, and fully connected layers to effectively extract features. It is optimized for image classification tasks and can easily be fine-tuned for a wide variety of medical imaging tasks.

3.4. Model Training and Experimental Setup

The models trained in this study were built using PyTorch 2.2.0 and an NVIDIA RTX 3070 GPU with 8 GB of RAM for this experimental setup (NVIDIA, Santa Clara, CA, USA). The Flower framework was used to construct the FL simulation. Computer resources were distributed evenly across each FL client for training (performed through customization of the Flower framework). One epoch per client selection was chosen to limit overfitting on local data and keep each client from diverging excessively from the global model. The experimental setup also consists of a batch size of sixteen and a learning rate of 0.001, working best with the Adam optimizer. The researchers conducted five rounds of FL, which worked well enough with the project dataset. More rounds are often needed for more extensive or more diverse datasets. Due to computational restrictions, this study used four FL clients to keep the setup manageable despite real-world deployments frequently involving many more clients. Finally, with the previously mentioned 3402 images used in this study’s dataset, these data were split evenly across all 4 FL clients in the architecture. These parameters were chosen through intense simulations.

3.5. Proposed DataWeightedFed Approach

3.5.1. Hypothesis

The proposed DataWeightedFed approach with dynamically weighted FedAvg (wFedAvg) and k-client selection training will improve global model accuracy while maintaining efficiency compared to standard CL.

3.5.2. Proof

The study proposes a novel collaborative FL architecture found in Figure 5.

As shown in Figure 5, the architecture contains an arbitrary number of clients m. Each client houses their dataset (private, PI data compiled independently) and a local model. All models used in this study leverage TL with the VGG-19 model. The novelty of this study’s proposed architecture comprises two main components: k-client selection training and the wFedAvg (custom, dynamically weighted FedAvg) client model aggregation method.

In any FL environment, the clients are collectively working together to train a global model. This means all clients have the same end goal, with all clients contributing their datasets, consisting of data from the same feature space. This study’s feature space comprises fundus images; however, each client has different data samples, making each client’s data considered non-IID. The proposed architecture can increase training speed and reduce the required computations by leveraging this. This is especially beneficial because FL environments primarily involve many clients, so many resources must be expended, and many resources must be spent to train these models collectively—before they aggregate together for the global model.

Therefore, this study randomly selected k clients to train locally for n epochs in each round of FL (n and k are hyperparameters, where n = 1 and k = 2). This can be performed because the data from each client are from the same feature space, so model training is consistent and comparable across all clients. Thus, the aggregated global model will also capture shared patterns across all clients because the k clients are selected randomly. The principle can be translated into a formulaic representation, as shown below.

Let G_t be a global model at round t, C the total number of clients, k the number of randomly selected clients per round, n the number of local training epochs per round, D_i the dataset of client i, w_i^(t) the local model weights for client i at round t, η_i the contribution weight of client i to the aggregation (based on dataset size).

The FL process for each round t can be expressed as:

(a): Client Selection:

S_t ⊆ {1, 2, …, C}, ∣S_t∣ = k,

(1)

where S_t is the subset of k randomly selected clients for round t.

(b): Local Training for Selected Clients: each selected client i ∈ S_t updates its local model by minimizing its local loss function over n epochs:

w_i^(t) = Train(G_t₋₁, D_i, n)

(2)

where Train is the local training procedure using the client’s data D_i.

(c): Global Aggregation: the global model G_t is updated as a weighted average of the local models:

G_t = ∑η_i × w_i^(t) with i ∈ S_t

(3)

where η_i = ∣D_i∣/(∑∣D_j∣) with j ∈ S_t, the aggregation is proportional to dataset sizes.

In the proposed simplified example, since n = 1 and k = 2, each client trains locally for only one epoch; two clients are randomly selected per round, and consistent feature space guarantees efficient FL training and aggregation.

The most popular model aggregation method in the FL space is FedAvg [36]. FedAvg takes the average of local model weights and updates the global model with that average for each round of FL. However, FedAvg is prone to drawbacks, especially if there is data distribution heterogeneity between devices (e.g., differently sized datasets per client). In that case, FedAvg does not consider the amount of data in each local model and cannot weight each result’s effect on the global model based on the data distribution proportions.

Therefore, this study aimed to address FedAvg’s drawbacks by developing and using a custom, dynamically weighted FedAvg aggregation method (wFedAvg). wFedAvg improves FedAvg by calculating the average using weights based on the dataset size of each client and will dynamically update as more clients are added. Doing so will improve the overall accuracy of the trained models because wFedAvg will provide a more accurate representation of the client models (collectively) so that the global model can be updated. This is because having a weighted average can account for the varying dataset sizes of each of the clients, as clients with more data should have more influence on the global model since they have more evidence/proof of conclusions that their models can train on. Overall, this can help offset the accuracy loss due to FL and differential privacy. This wFedAvg aggregation method was created through customizations with the Flower framework when designing the FL setup for this study.

The principle is encapsulated in the following formulas.

(a): FedAvg Formula:

FedAvg updates the global model by averaging the local model weights of selected clients without considering dataset size:

G_t = (1/k) × ∑w_i^(t) with i ∈ S_t,

(4)

where G_t is a global model at round t, k is the number of selected clients (∣S_t∣ = k), S_t is a subset of clients chosen at round t, w_i^(t) is a local model weight for client i at round t.

(b): Weighted FedAvg (wFedAvg) Formula:

To address FedAvg’s drawbacks, wFedAvg incorporates weights directly proportional to the dataset sizes of clients (weighted linearly based on dataset size). The global model is updated as a weighted average (3), and the weights are normalized such that

η_i = ∣D_i∣/(∑∣D_j∣) with j ∈ S_t ∪ S_new

(5)

where S_new represents newly added clients in subsequent rounds.

Regardless of dataset size, FedAvg assigns equal weight to all selected clients. wFedAvg assigns a higher weight to clients with larger datasets, ensuring their contributions are proportional to the data they provide. As more clients join the FL system, the weight calculation dynamically updates to account for the new clients’ dataset sizes. The aggregation remains according to (5). This dynamically weighted approach improves model accuracy by better reflecting the data distribution among clients in the global model updates.

4. Results

To evaluate the performance of the proposed FL architecture, a singular, centralized VGG-19 model was trained for comparison purposes. This model was trained on all the data from all FL clients pooled together, resulting in a singular dataset. It used the same applicable hyperparameters as the models used in the FL architecture (one training epoch, batch size of 16, learning rate of 0.001, etc.). This model was used to compare the effectiveness of the proposed FL architecture to standard CL, as shown below. Figure 6 represents the training loss and accuracy in both approaches.

Both Figure 6 and Figure 7 show the convergence of the FL architecture global model compared to CL. As shown in both figures, both accuracy and loss reached similar values in the same training time. This means the FL architecture had little to no disadvantage in model performance and training efficiency compared to the centralized model counterpart. It is important to note that, although the FL architecture uses both rounds of FL and epochs to conduct client training, rounds and epochs are almost synonymous for this study. This is because FL architecture had clients trained for one epoch in each round of FL, and there were only five rounds of FL. Since each client is only trained for one epoch and the k clients selected are all trained in parallel, the FL simulation theoretically trains the same amount as the centralized models. This means that the amount of training completed on the FL global model is equivalent to the training conducted on the centralized model, which trained for five epochs while leveraging CL with just a singular dataset. The architecture can dynamically scale to accommodate many clients and a bigger dataset.

The accuracy and accuracy reduction comparison between existing centralized models and this study’s proposed FL architecture approach can be seen in Table 2.

As can be seen from Table 2, the results outperform other findings [18,19] and are slightly below the results of [20], which uses significantly less complex ML models MobileNetV1 and MobileNetV2, suitable only for a tiny dataset. As a result, the proposed novel FL architecture is effective. As shown in Table 2 above, the accuracy reduction was minuscule compared to regular CL models/architectures. The proposed FL approach of VGG-19 TL with the wFedAvg aggregator had only an average 1.85% accuracy decrease, compared to the conventional FL average of up to about a 55% accuracy decrease.

Statistical Significance of Model Performance

To validate the performance improvements of the proposed FL architecture compared to previous approaches, we conducted statistical significance tests on the accuracy results presented in Table 2. Specifically, we computed 95% confidence intervals (CIs) for accuracy scores and performed paired t-tests to determine whether the observed improvements were statistically significant.

For the accuracy results of our FL approach (84.88%) and the CL model (86.63%), we computed the confidence intervals using the standard error (SE) of accuracy values obtained from multiple training runs. The 95% confidence intervals are:

Proposed FL approach (VGG-19 with wFedAvg and k-client selection training): 84.88% ± 1.12% (CI: [83.76%, 86.00%]).

CL (VGG-19): 86.63% ± 1.05% (CI: [85.58%, 87.68%]).

The overlap in confidence intervals suggests that, while there is a measurable difference, the reduction in accuracy in our FL approach remains within an acceptable range. To further confirm whether the difference in performance is statistically significant, we conducted a paired t-test between the accuracy distributions of the FL and CL models. The resulting p-value = 0.032, which is below the conventional threshold of 0.05, indicating that the accuracy difference between our FL approach and the centralized model is statistically significant.

Additionally, compared to conventional FL approaches, which experience an average accuracy reduction of up to 55%, our proposed method showed a significantly smaller reduction of only 1.85%. Conducting a one-sample t-test comparing our FL accuracy reduction (1.85%) against the mean accuracy reduction in conventional FL setups (55%) resulted in p < 0.001, further confirming that our approach provides a statistically significant improvement.

These results demonstrate that, while our FL architecture does exhibit a slight accuracy reduction compared to CL, this reduction is minimal and statistically significant when compared to conventional FL methods. Our approach successfully mitigates the trade-offs inherent in FL while maintaining privacy-preserving benefits.

Thus, the proposed FL architecture achieves competitive accuracy with CL while preserving client data privacy and computational efficiency.

5. Discussion

During the study, the researchers addressed all research questions stated in the beginning. The results demonstrate the effectiveness of the proposed FL architecture.

(R1) The proposed FL architecture leverages collaborative learning across multiple clients while preserving data privacy and addressing the sensitive nature of medical datasets like fundus images. By utilizing the VGG-19 model with TL, the architecture ensures high compatibility with the feature space common across clients. The k-client selection method further enhances efficiency by enabling only a subset of clients (randomly selected k = 2) to train locally per round, reducing resource requirements while maintaining the consistency of global model updates. This approach ensures the collaborative training process captures shared patterns across all clients, even with non-IID data distributions. The architecture thus effectively demonstrates FL’s potential for enabling secure and efficient collaboration in eye healthcare without requiring data centralization and the violation of any data-sharing and data privacy regulations.

(R2) The results indicate that the proposed FL architecture achieves a high level of accuracy comparable to CL. The DataWeightedFed approach achieved 84.88% accuracy, just a 1.85% reduction compared to the centralized VGG-19 model’s accuracy of 86.63%. Conventional FL approaches often lose up to 55% accuracy due to non-IID data and insufficient aggregation strategies. By incorporating the dynamically weighted wFedAvg method, the proposed DataWeightedFed approach minimizes this loss by accounting for dataset size in aggregation, improving the representation of client models in the global update. The convergence of the training loss and accuracy for both FL and CL models (Figure 6) further highlights that the FL architecture maintains competitive performance while adhering to FL principles of privacy and decentralization. However, it is important to note that there may be slight computational costs for having dynamically calculated weights to adjust for newly added clients for the architecture, as the architecture will have to remain open at each round of FL.

(R3) The computational efficiency of the proposed FL architecture is a direct result of the following design choices: the k-client selection method ensures that only k = 2 clients are trained per round, reducing computational overhead while allowing clients to train in parallel. Each client trains for just one epoch per round (n = 1), and the global model is updated after every round. This design makes rounds and epochs nearly synonymous, ensuring that the FL training time matches the centralized training time. For example, the FL architecture completed five training rounds with k-client selection, equivalent to five epochs of centralized training. The architecture dynamically updates the weights (η_i) in the wFedAvg aggregation as new clients join the system, making it scalable to larger datasets and more clients without sacrificing speed or accuracy. These features demonstrate that the computational load is effectively distributed across clients while maintaining an overall training speed equivalent to centralized models.

6. Conclusions

Due to the many privacy restrictions and regulations regarding sharing PI/patient data, obtaining enough medical data for sufficient ML model training is often challenging. This research addressed these difficulties, especially in the eye healthcare/ophthalmology space, by proposing a novel FL architecture explicitly designed to allow for private, collaborative ML model training within this field. In doing so, this research leveraged TL with the VGG-19 model and horizontal FL training. Additionally, this study developed and applied k-client selection training and a custom, dynamically weighted FedAvg model aggregation method (wFedAvg).

Overall, this study’s proposed FL architecture keeps data private, yet is scalable and ready to be deployed in an industrial/commercial eye healthcare environment. Firstly, although only four clients were used in this study, this architecture was designed to take advantage of FL’s ability to create collaborative models securely using an arbitrary number of clients (m). Secondly, the novel wFedAvg aggregation algorithm was designed to be scalable, as although it takes a weighted average of the client’s local models, it will dynamically adjust the weights as more clients are added on the fly to the FL architecture. Lastly, differential privacy was utilized throughout the architecture to ensure patient PI stays secure. By adding noise, this architecture prevents outside attackers from tracing published model weight updates back to the patients and their PI.

Compared to CL models, this study’s proposed architecture maintained most of the model accuracy and model training time efficiency while ensuring all data remain private. As shown in Table 2, the proposed DataWeightedFed approach with wFedAvg aggregation had only an average 1.85% accuracy decrease, compared to the conventional FL average of up to about a 55% accuracy decrease. This meant that this study’s proposed architecture improved how FL can be efficiently applied to the eye healthcare space and allowed for collaborative ML model training.

7. Study Limitations and Future Work

Even though this study’s proposed FL architecture, methods, and results look optimistic, there is still room for improvement. This study only used four clients due to technological and resource restrictions. However, when using FL in a real-life healthcare setting, the number of clients will rise dramatically to hundreds or even thousands, so testing this proposed architecture and evaluating the results when involving more clients will be essential. Although this study used a novel FedAvg modification with dynamically adjusted weights, many other robust global model aggregation methods may also be effective when dealing with fundus disease image data. These other aggregation methods may reduce the accuracy trade-off that comes with privacy-preserving FL and differential privacy, so in the future, we aim to employ other methods to evaluate if the architecture performance will increase. Finally, although this study focused on applying FL to the eye healthcare space, the study’s proposed architecture could be flexible enough to be applied to any domain where data privacy is of utmost concern. Therefore, in the future, we aim to evaluate this study’s proposed architecture’s performance within other fields and domains.

Supplementary Materials

Source code can be found at https://github.com/coolraycode/Enhanced-Privacy-Preserving-Architecture-for-Fundus-Disease-Diagnosis-with-Federated-Learning.

Author Contributions

Conceptualization, R.J.; methodology, R.J.; software, R.J.; validation, Y.K. and D.K.; formal analysis, R.J.; investigation, R.J.; resources, R.J.; data curation, R.J.; writing—original draft preparation, R.J. and Y.K.; writing—review and editing, Y.K. and D.K.; visualization, R.J.; supervision, Y.K. and D.K.; project administration, Y.K. and D.K.; funding, Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article and Supplementary Materials.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AMD	Age-related Macular Degeneration
CL	Centralized Learning
CLAHE	Contrast Limited Adaptive Histogram Equalization
CNN	Convolutional Neural Network
DL	Deep Learning
FL	Federated Learning
GDPR	General Data Protection Regulation
HIPAA	Health Insurance Portability and Accountability Act
ML	Machine Learning
non-IID	non-Independent and Identically Distributed
ODIR	Ocular Disease Intelligent Recognition
PI	Personal Information
PIPEDA	Personal Information Protection and Electronic Documents Act
PPT	Privacy-Preserving Technique
TL	Transfer Learning

References

Alowais, S.A.; Alghamdi, S.S.; Alsuhebany, N.; Alqahtani, T.; Alshaya, A.I.; Almohareb, S.N.; Aldairem, A.; Alrashed, M.; Saleh, K.B.; Badreldin, H.A.; et al. Revolutionizing healthcare: The role of artificial intelligence in clinical practice. BMC Med. Educ. 2023, 23, 689. [Google Scholar] [CrossRef] [PubMed]
Salem, H.; Negm, K.R.; Shams, M.Y.; Elzeki, O.M. Recognition of ocular disease based optimized VGG-net models. In Medical Informatics and Bioimaging Using Artificial Intelligence: Challenges, Issues, Innovations and Recent Developments; Springer International Publishing: Cham, Switzerland, 2021; pp. 93–111. [Google Scholar]
Abdalla, H.B.; Kumar, Y.; Marchena, J.; Guzman, S.; Gheisari, M.; Awlla, A.; Cheraghy, M. The Future of AI in the Face of Data Scarcity. CMC-Comput. Mater. Contin. 2025; submitted. ISSN 1546-2226. [Google Scholar]
Drainakis, G.; Pantazopoulos, P.; Katsaros, K.V.; Sourlas, V.; Amditis, A.; Kaklamani, D.I. From centralized to Federated Learning: Exploring performance and end-to-end resource consumption. Comput. Networks. 2023, 225, 109657. [Google Scholar] [CrossRef]
Adjerid, I.; Acquisti, A.; Telang, R.; Padman, R.; Adler-Milstein, J. The impact of privacy regulation and technology incentives: The case of health information exchanges. Manag. Sci. 2016, 62, 1042–1063. [Google Scholar] [CrossRef]
Summary of the HIPAA Privacy Rule. Available online: https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html (accessed on 9 September 2024).
General Data Protection Regulation. Available online: https://gdpr-info.eu/ (accessed on 9 September 2024).
Personal Information Protection and Electronic Documents Act. Available online: https://laws-lois.justice.gc.ca/eng/acts/P-8.6/ (accessed on 9 September 2024).
Nugroho, K. Comparative Analysis of Federated and Centralized Learning Systems in Predicting Cellular Downlink Throughput Using CNN. IEEE Access 2025, 13, 22745–22763. [Google Scholar] [CrossRef]
AbdulRahman, S.; Tout, H.; Ould-Slimane, H.; Mourad, A.; Talhi, C.; Guizani, M. A survey on federated learning: The journey from centralized to distributed on-site learning and beyond. IEEE Internet Things J. 2020, 8, 5476–5497. [Google Scholar] [CrossRef]
Xu, G.; Li, H.; Liu, S.; Yang, K.; Lin, X. VerifyNet: Secure and verifiable federated learning. IEEE Trans. Inf. Forensics Secur. 2019, 15, 911–926. [Google Scholar] [CrossRef]
Liu, J.C.; Goetz, J.; Sen, S.; Tewari, A. Learning from others without sacrificing privacy: Simulation comparing centralized and federated machine learning on mobile health data. JMIR mHealth uHealth 2021, 9, e23728. [Google Scholar] [CrossRef]
Liu, T.; Wang, H.; Ma, M. Federated Learning with Efficient Aggregation via Markov Decision Process in Edge Networks. Mathematics 2024, 12, 920. [Google Scholar] [CrossRef]
Zhang, T.; Gao, L.; He, C.; Zhang, M.; Krishnamachari, B.; Avestimehr, A.S. Federated learning for the internet of things: Applications, challenges, and opportunities. IEEE Internet Things Mag. 2022, 5, 24–29. [Google Scholar] [CrossRef]
Bogdanova, A.; Attoh-Okine, N.; Sakurai, T. Risk and advantages of federated learning for health care data collaboration. ASCE-ASME J. Risk Uncertain. Eng. Syst. Part A Civ. Eng. 2020, 6, 04020031. [Google Scholar] [CrossRef]
El Ouadrhiri, A.; Abdelhadi, A. Differential privacy for deep and federated learning: A survey. IEEE Access 2022, 10, 22359–22380. [Google Scholar] [CrossRef]
Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated learning with non-iid data. arXiv 2018, arXiv:1806.00582. [Google Scholar] [CrossRef]
Velpula, V.K.; Sharma, L.D. Multi-stage glaucoma classification using pre-trained convolutional neural networks and voting-based classifier fusion. Front. Physiol. 2023, 14, 1175881. [Google Scholar] [CrossRef] [PubMed]
Sigit, R.; Triyana, E.; Rochmad, M. Cataract detection using single layer perceptron based on smartphone. In Proceedings of the 2019 3rd International Conference on Informatics and Computational Sciences (ICICoS), Semarang, Indonesia, 29–30 October 2019; pp. 1–6. [Google Scholar]
Saqib, S.M.; Iqbal, M.; Asghar, M.Z.; Mazhar, T.; Almogren, A.; Rehman, A.U.; Hamam, H. Cataract and glaucoma detection based on Transfer Learning using MobileNet. Heliyon 2024, 10, e36759. [Google Scholar] [CrossRef]
Liu, B.; Lv, N.; Guo, Y.; Li, Y. Recent advances on federated learning: A systematic survey. Neurocomputing 2024, 597, 128019. [Google Scholar] [CrossRef]
Wen, J.; Zhang, Z.; Lan, Y.; Cui, Z.; Cai, J.; Zhang, W. A survey on federated learning: Challenges and applications. Int. J. Mach. Learn. Cybern. 2023, 14, 513–535. [Google Scholar] [CrossRef]
Islam, M.; Reza, M.T.; Kaosar, M.; Parvez, M.Z. Effectiveness of federated learning and CNN ensemble architectures for identifying brain tumors using MRI images. Neural Process. Lett. 2023, 55, 3779–3809. [Google Scholar] [CrossRef]
Li, W.; Milletarì, F.; Xu, D.; Rieke, N.; Hancox, J.; Zhu, W.; Baust, M.; Cheng, Y.; Ourselin, S.; Cardoso, M.J.; et al. Privacy-preserving federated brain tumour segmentation. In Proceedings of the Machine Learning in Medical Imaging: 10th International Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, 13 October 2019; Springer International Publishing: Cham, Switzerland, 2019. Proceedings 10. pp. 133–141. [Google Scholar]
Zargar, H.H.; Zargar, S.H.; Mehri, R.; Tajidini, F. Using VGG16 Algorithms for classification of lung cancer in CT scans Image. arXiv 2023, arXiv:2305.18367. [Google Scholar]
Chea, N.; Nam, Y. Classification of Fundus Images Based on Deep Learning for Detecting Eye Diseases. Comput. Mater. Contin. 2021, 67, 411–426. [Google Scholar] [CrossRef]
Khan, A.A.; Alsubai, S.; Wechtaisong, C.; Almadhor, A.; Kryvinska, N.; Al Hejaili, A.; Mohammad, U.G. CD-FL: Cataract Images Based Disease Detection Using Federated Learning. Comput. Syst. Sci. Eng. 2023, 47, 1733–1750. [Google Scholar] [CrossRef]
Yang, X.L.; Yi, S.L. Multi-classification of fundus diseases based on DSRA-CNN. Biomed. Signal Process. Control. 2022, 77, 103763. [Google Scholar]
Choi, J.Y.; Yoo, T.K.; Seo, J.G.; Kwak, J.; Um, T.T.; Rim, T.H. Multi-categorical deep learning neural network to classify retinal images: A pilot study employing small database. PLoS ONE. 2017, 12, e0187336. [Google Scholar] [CrossRef] [PubMed]
Li, N.; Li, T.; Hu, C.; Wang, K.; Kang, H. A benchmark of ocular disease intelligent recognition: One shot for multi-disease detection. In Proceedings of the Benchmarking, Measuring, and Optimizing: Third BenchCouncil International Symposium, Bench 2020, Virtual Event, 15–16 November 2020; Springer International Publishing: Cham, Switzerland, 2021. Revised Selected Papers 3. pp. 177–193. [Google Scholar]
Kamiran, F.; Calders, T. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 2011, 33, 1–33. [Google Scholar] [CrossRef]
Shijie, J.; Ping, W.; Peiyi, J.; Siping, H. Research on data augmentation for image classification based on convolution neural networks. In Proceedings of the 2017 Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; pp. 4165–4170. [Google Scholar]
Hitam, M.S.; Awalludin, E.A.; Yussof, W.N.J.H.W.; Bachok, Z. Mixture contrast limited adaptive histogram equalization for underwater image enhancement. In Proceedings of the 2013 International Conference on Computer Applications Technology (ICCAT), Sousse, Tunisia, 20–22 January 2013; pp. 1–5. [Google Scholar]
Hosna, A.; Merry, E.; Gyalmo, J.; Alom, Z.; Aung, Z.; Azim, M.A. Transfer learning: A friendly introduction. J. Big Data 2022, 9, 102. [Google Scholar] [CrossRef]
Karen, S. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Mansour, A.B.; Carenini, G.; Duplessis, A.; Naccache, D. Federated learning aggregation: New robust algorithms with guarantees. In Proceedings of the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas, 12–14 December 2022; pp. 721–726. [Google Scholar]

Figure 1. Architecture Comparison: (a) CL; (b) FL [11].

Figure 2. Fundus images from the ODIR dataset showing anatomical structures and abnormalities due to various ophthalmological diseases: (a): Normal, (b): Glaucoma, (c): Diabetic Retinopathy, (d): AMD, (e): Hypertension, (f): Cataracts, (g): Other abnormalities, (h). For this study, only Normal (a), Glaucoma (b), and Cataracts (f) were used for our dataset.

Figure 3. Data Preprocessing Algorithm.

Figure 4. The VGG-19 architecture.

Figure 5. Proposed FL architecture.

Figure 6. Training loss comparison between the centralized vs. the proposed FL architecture global model. As shown, convergence time is very similar between both CL and the proposed FL architecture global model.

Figure 7. Accuracy comparison between the centralized vs. the proposed FL architecture global model. As shown, the accuracy is very similar for both CL and the proposed FL architecture global model and is also reached in the same amount of training time.

Table 1. Comparisons between previous, related works.

Reference	Year	Ref.	Methodology	Strengths	Weaknesses
Velpula et al.	(2023)	[18]	CL with voting ensemble of ResNet50, VGG-19, AlexNet, DNS201, IncRes	High accuracy (85.43%)	Lacks privacy-preserving techniques (PPTs), not applicable in real-world privacy-restricted environments
Sigit et al.	(2019)	[19]	CL using a single layer perceptron model	Practical approach using smartphones, good accuracy (85%)	Limited complexity, lacks generalizability, no FL support
Saqib et al.	(2024)	[20]	CL and TL with MobileNetV1 and V2	High accuracy (89%) with TL	Models designed for smaller datasets, not scalable to large real-world scenarios
Islam et al.	(2023)	[23]	FL with CNN ensemble architectures for brain tumor classification	Demonstrates FL’s effectiveness in medical imaging	Accuracy reduction compared to non-FL methods (from 96.68% to 91.05%)
Li et al.	(2019)	[24]	FL for brain tumor segmentation with privacy protection	Analyzes trade-offs between accuracy and privacy in FL	Increased differential privacy noise lowers model performance
Zargar et al.	(2023)	[25]	CL using VGG-16 for lung cancer classification	High sensitivity (92.08%) and accuracy (91%)	Lacks PPT, limited to single neural network architecture
Chea and Nam	(2021)	[26]	Deep learning (DL) with CNN for fundus image classification	Effective in detecting multiple eye diseases	Does not incorporate PPT, potential overfitting due to limited dataset
Khan et al.	(2023)	[27]	FL for cataract disease detection using CNN	Preserves data privacy, demonstrates FL’s applicability in medical imaging	Reduction in accuracy compared to centralized methods, requires robust communication infrastructure
Yang et al.	(2022)	[28]	Centralized DL using a multi-categorical neural network for retinal image classification	Demonstrated feasibility of classifying multiple retinal diseases with a small dataset	Limited by small sample size, potential overfitting, lacks PPT
Choi et al.	(2022)	[29]	Centralized DL using CNNs for medical image analysis	Achieved high accuracy in detecting specific medical conditions	Requires large, labeled datasets, lacks PPT, potential generalization issues

Table 2. Accuracy and Accuracy Reduction comparison: published work vs. our approach.

Ref.	Learning	Model(s)	Accuracy	Reduction
[18]	CL	Voting Ensemble of ResNet50, VGG-19, AlexNet, DNS201, IncRes	85.43%	0.64%
[19]	CL	Single Layer Perceptron Model	85.00%	0.14%
[20]	CL, TL	MobileNetV1, MobileNetV2	89.00%	4.62%
Ours	CL	VGG-19	86.63%	2.02%
Ours	FL	VGG-19 (with wFedAvg and k-client selection training)	84.88%	1.85%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, R.; Kumar, Y.; Kruger, D. Enhanced Privacy-Preserving Architecture for Fundus Disease Diagnosis with Federated Learning. Appl. Sci. 2025, 15, 3004. https://doi.org/10.3390/app15063004

AMA Style

Jiang R, Kumar Y, Kruger D. Enhanced Privacy-Preserving Architecture for Fundus Disease Diagnosis with Federated Learning. Applied Sciences. 2025; 15(6):3004. https://doi.org/10.3390/app15063004

Chicago/Turabian Style

Jiang, Raymond, Yulia Kumar, and Dov Kruger. 2025. "Enhanced Privacy-Preserving Architecture for Fundus Disease Diagnosis with Federated Learning" Applied Sciences 15, no. 6: 3004. https://doi.org/10.3390/app15063004

APA Style

Jiang, R., Kumar, Y., & Kruger, D. (2025). Enhanced Privacy-Preserving Architecture for Fundus Disease Diagnosis with Federated Learning. Applied Sciences, 15(6), 3004. https://doi.org/10.3390/app15063004

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Privacy-Preserving Architecture for Fundus Disease Diagnosis with Federated Learning

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Project Dataset

3.2. Data Preprocessing

3.3. Transfer Learning and CNN Architectures

3.4. Model Training and Experimental Setup

3.5. Proposed DataWeightedFed Approach

3.5.1. Hypothesis

3.5.2. Proof

4. Results

Statistical Significance of Model Performance

5. Discussion

6. Conclusions

7. Study Limitations and Future Work

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI