FedCycle: An Improved Federated Learning Framework for Assessment Across Modalities and Domains

Dundar, Betul; Sezer, Ebru Akcapinar; Okay, Feyza Yildirim; Ozdemir, Suat

doi:10.3390/electronics15081752

Open AccessArticle

FedCycle: An Improved Federated Learning Framework for Assessment Across Modalities and Domains

¹

Department of Computer Engineering, Hacettepe University, Ankara 06800, Türkiye

²

Ministry of National Education, Ankara 06420, Türkiye

³

Department of Computer Engineering, Gazi University, Ankara 06570, Türkiye

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(8), 1752; https://doi.org/10.3390/electronics15081752

Submission received: 1 March 2026 / Revised: 6 April 2026 / Accepted: 15 April 2026 / Published: 21 April 2026

(This article belongs to the Special Issue Federated Learning and Its Application)

Download

Browse Figures

Versions Notes

Abstract

Artificial Intelligence (AI) systems based on traditional Deep Learning (DL) are expected to play a leading role in the early detection of various diseases in healthcare applications. However, there are two major drawbacks of these systems: protecting patient privacy and obtaining sufficiently large, high-quality datasets to train reliable models. In traditional DL, collecting data from different sources on a single central server increases system complexity and raises serious privacy and security concerns. Federated Learning (FL) makes it possible to train models locally at multiple data locations while collaboratively improving a global model without exposing raw data, making it a promising architectural solution for privacy preservation. Although previous studies have reported that FL can achieve performance comparable to centralized DL approaches, traditional FL approaches often struggle to maintain consistent performance across different settings. This limitation becomes more noticeable when heterogeneous data distributions, modalities, and domains are involved. In these situations, client drift, overfitting, and generalization capability of the global model arise as major challenges. Thus, this study presents FedCycle as an incremental improvement of the FedAvg algorithm. It modifies the aggregation frequency. It aims to overcome these drawbacks and make the global model more stable and efficient. The FedCycle eliminates centralized data collection, enhances data security, and effectively reduces client drift and overfitting by supporting model training across heterogeneous data distributions, modalities, and domains. The performance evaluation involves extensive experiments using various real-world breast cancer image datasets, namely BREAKHIS, ROBOFLOW, RSNA, BUSI, and BCFPP. The presented method is evaluated against both traditional DL and FL approaches using accuracy, precision, recall, F1-score, and AUC. The findings confirm that applying fine-tuning within FedCycle reduces overfitting during training. As a result, FedCycle achieves performance improvements of 7.75% and 4.65% in accuracy and F1-score on the RSNA and BCFPP datasets compared to traditional DL approaches, while also providing an average improvement of approximately 1.5% in accuracy and F1-score across BREAKHIS, ROBOFLOW, and BUSI datasets compared to FedAvg.

Keywords:

deep learning; federated learning; healthcare dataset; artificial intelligence

1. Introduction

Recent developments in Artificial Intelligence (AI) have increasingly enabled new research opportunities in diverse fields. One branch of AI models, Deep Learning (DL), has achieved significant success in both industry and academia. The applications are very diverse, such as autonomous vehicles, fraud detection, healthcare, and entertainment [1,2,3,4]. In the case of healthcare, DL, for example, produces remarkable results in medical imaging analysis and disease diagnosis of various conditions such as skin cancer, tuberculosis, brain tumors, and breast cancer [5,6,7,8]. Therefore, DL allows clinicians to make more accurate predictions and design more effective treatment plans using electronic health data.

DL approaches have great benefits in processing and interpreting complicated and large-scale data. Some of the main advantages are that the decision-making process gets better, the accuracy of diagnostics is improved, and patient care becomes more efficient through the automatic identification of significant patterns. The application of these approaches is becoming increasingly advantageous in fields like healthcare, where the ability to handle and analyze various multidimensional types of data are very important.

On the other hand, although traditional DL approaches have some advantages, they still face challenges in data quality, privacy, and security during data transfer. In traditional centralized DL approaches, a global server collects data from various sources. This situation leads to ethical and legal problems, especially in healthcare systems, where patient data are extremely sensitive and are protected by regulations such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States (US) [9]. The security and privacy of patient data are often challenged by centralized data collection through traditional DL approaches that carry various risks. Moreover, among other factors, there are differences in data standards, infrastructure problems, and institutional policies that hinder data sharing and centralized data collection.

Federated Learning (FL) is an innovative technology that allows hospitals and similar institutions to train their models on local data without transferring raw data [10]. As an effective option to improve the privacy and security of healthcare datasets, FL complies with data privacy rules and avoids gathering diverse datasets in one location. FL’s primary goal is to allow local model training on phones, cloud servers, and various institutions simultaneously. In this process, the global model is updated through transmission of parameters from local models to a global server for that purpose. For this reason, it is possible to have various data owners participate in the training process while protecting local data in terms of confidentiality, trustworthiness, and accuracy. FL also eliminates the need for the global server to have access to the entire dataset.

FL has advantages, but it is still less effective than traditional DL approaches. One of the main reasons for this performance gap is the difficulty of synchronizing models over distributed networks. This paper presents FedCycle as an incremental improvement of the FedAvg algorithm to address the limitations of FedAvg. FedCycle modifies the aggregation schedule by performing more frequent global updates. This helps to improve training stability. In this setting, the global model is updated after each local training epoch, while traditional FL approaches usually perform aggregation after several local epochs. Frequent synchronization allows the model to be updated continuously. This leads to more stable training under heterogeneous data distributions. As a result, FedCycle results in a more stable and efficient training process while maintaining the privacy and security advantages of FL.

In this study, different scenarios are designed to evaluate the performance of FedCycle. The aim is to assess FedCycle’s robustness and effectiveness across homogeneous and heterogeneous settings with differences in data distribution, modality, and domain. A homogeneous structure is used in Scenario 1, whereas heterogeneous structures are used in Scenarios 2 and 3.

FedCycle provides a robust method for improving performance in FedAvg across different fields. Its application is particularly critical in healthcare, where it helps reduce overfitting and client drift and preserves patient privacy across heterogeneous medical datasets in terms of data distribution, modality, and domain. This makes it a practical solution for managing sensitive medical data distributed in multiple healthcare institutions. To our knowledge, this is the first FL study to use different data structures in breast cancer images. FedCycle demonstrates strong performance in handling heterogeneous data. In particular, fine-tuning effectively improves FedCycle’s performance. The key contributions of this study are as follows:

FedCycle provides an incremental improvement to FedAvg algorithm by allowing all clients to perform local training and synchronously update the global model after each local epoch.
FedCycle offers an effective solution to the client drift and overfitting problem commonly observed in FedAvg by combining model updates in a cyclical and balanced manner, thereby achieving more stable and consistent global convergence.
FedCycle is evaluated for different scenarios, including various data distributions, modalities, and domains. It demonstrates superior performance across all clients, whether trained in different subsets of the same dataset or on entirely heterogeneous datasets.
By combining breast cancer datasets with various modalities and formats, this work addresses the problem of heterogeneous and non-Independent and Identically Distributed (non-IID) medical data, highlighting both domain and modality heterogeneity.

The remainder of this paper is structured as follows: Section 2 provides detailed information on existing FL studies, and Section 3 presents an overview of the FL approach and the challenges of FL. Section 4 gives details about the FedCycle, and the methodology is given in Section 5. The experimental scenarios are given in Section 6. Results and discussions of the three Scenarios, ablation studies, and statistical analysis are given in Section 7. Finally, Section 8 provides concluding remarks.

2. Related Work

FL is an advanced approach that has recently been used across many domains. It has developed rapidly to the point of providing solutions to problems related to security weaknesses, the inability to collect data on a single global server, and the heterogeneity of data. These issues cause some difficulties in developing accurate and robust models. FL emerged as a promising approach to address the drawbacks of traditional DL approaches, which include network traffic, data quality, availability, data transmission issues, and concerns regarding data privacy. This section summarizes relevant studies in the literature.

2.1. FL for Medical Images

The technical challenges of FL approaches and the issues related to their real-world applications are examined by Darzidehkalani et al. [11]. The authors analyzed the problems related to FL and discussed them, e.g., decreased model accuracy, inter-algorithm bias, and non-standardized data due to the use of different distributed environments for data collection. The paper provides a detailed review of the current state of FL, recognizing various problems and suggesting different solutions. Their observations significantly impacted current research and development in the field and opened new directions for addressing the challenges of FL.

Xie et al. [12] addressed the challenge of centralizing medical data due to the large number of patients and privacy regulations that hinder the safe sharing of data. The FL approach was used to overcome this problem. FL was combined with Secure Multi-Party Computation (SMC) and Differential Privacy (DP), which maintained model weights and generated random noise to prevent data inference. In the proposed method, federated clients downloaded a model, trained it locally, and uploaded the encrypted parameters using the Cheon-Kim-Kim-Song (CKKS) Homomorphic Encryption (HE) algorithm. The federated server then aggregated these parameters using the Federated Average Algorithm (FedAvg) and updated the model using Advanced Encryption Standard (AES) to ensure privacy. The researchers tested four models: ResNet152, DenseNet201, MobileNetV2-100, and EfficientNetB7 for the diagnosis of breast cancer using the BREAKHIS dataset. The results showed that FL performed similarly to traditional DL approaches, thus validating the effectiveness of the framework. ResNet152 performed best, demonstrating high reliability.

In order to assess the efficiency of FL for the digital analysis of tissue sections, Baid et al. [13] built a framework that involved vertical partitioning of a dataset containing cancer Whole Slide Images (WSIs) from 12 anatomical regions, thus allowing all patients with a particular cancer type to be included. The Visual Geometry Group (VGG), a deep CNN architecture, was successfully applied to ImageNet classification. VGG was chosen because it can perform classification and regression tasks efficiently. The lower layers of the VGG conducted feature extraction, whereas the upper sigmoid layers served as classifiers. The model and training process were integrated with the OpenFL library, allowing training across multiple regions without data sharing. The results showed that the FL model achieved similar quantitative and qualitative performance to a model where all training data were centralized. The study also showed that FL can classify tumor-infiltrating lymphocytes with performance comparable to centralized training, while preserving patient data privacy across institutions.

In the study conducted by Wicaksan et al. [14], two main strategies in FL were identified. The first strategy was the creation of a federated model consisting only of feature extraction layers for feature matching. The second strategy was the use of a federated feature extractor to guide the training of each model. This allowed each client to build a local model using features extracted from the global model. Their technique, named Customized FL (CusFL), was tested on multi-source medical image datasets for prostate cancer detection and skin lesion classification. CusFL is a personalized approach where each client can build an optimized model according to its own data by sharing knowledge through FL. This allows each client to reduce the effect of irrelevant parameters and extract useful features from the global model. Extensive experiments on prostate cancer and the classification of malignant skin lesions showed that CusFL performs better than other methods and traditional FL.

Jim’enez-S’anchez et al. [15] presented a memory-aware Curriculum Learning (CL) method for use in an FL environment. This method controlled the order of training examples, with a focus on training examples that were forgotten after applying the global model. It was also combined with unsupervised domain adaptation to address domain shift. The proposed method improved the classification performance. The CL methods prioritized learning through a systematic and stepwise process. Training examples were either weighted using a scoring function or presented to the optimizer in a specific order. This ordering improved classification accuracy by influencing the local minima reached by the optimizer. This study was based on previous work showing the impact of CL in medical image classification. The proposed Fed-Align CL method achieved the highest average Area Under the ROC Curve (AUC) and Precision–Recall AUC (PR-AUC).

Tan et al. [16] used a Transfer Learning (TL) approach within the FL framework for the prediction of breast cancer. Their contribution included using TL to extract features from regions of interest in images. Furthermore, the Synthetic Minority Oversampling Technique (SMOTE) was used to process the data, using balanced and unbalanced breast X-ray datasets. The experimental results of different DL, TL and FL models in the clusters were presented. This study proposed an effective TL-based DL model for the automatic detection and diagnosis of breast cancer. A key aspect was that the knowledge gained from one problem was applied to another. A pre-trained model was used for feature extraction with a Convolutional Neural Network (ConvNet). The extracted features were then classified using a linear classifier. ConvNets analyzed breast X-ray images to detect normal and abnormal cases, identify calcifications or masses, and distinguish between benign and malignant findings. The model achieved a higher accuracy than the others.

2.2. FL for Data Heterogeneity

Recent work has started to address data heterogeneity through multimodal FL. Most traditional FL approaches assume a single data modality, which is often unrealistic. In practice, clients may have different data modalities, leading to heterogeneous and incomplete data distributions. A recent survey by Che et al. [17] provided a taxonomy of multimodal FL settings and highlighted differences in available modalities across as a key challenge. Differences in modalities make model alignment and aggregation more difficult and can negatively affect both performance and communication efficiency.

Li et al. [18] studied the negative impact of statistical heterogeneity (non-IID data distributions) on model performance among clients in FL environments. The study demonstrated that variations in the global model’s learning process result from each client’s unique data distribution and computational capabilities. By including a moderating term during local training, the Federated Proximal (FedProx) algorithm aims to reduce client deviation from the global model. This strategy produced a more stable and balanced learning process, particularly in situations with statistically heterogeneous data and system conditions. Their experimental results showed that FedProx offers more stable convergence than FedAvg. However, the main drawback of FedProx is its dependence on an extra hyperparameter

μ

. If this parameter is not selected correctly, performance may decrease. Moreover, it does not show a significant advantage over FedAvg when the data distributions are close to IID.

The purpose of the Stabilizing FL Process (SCAFFOLD) approach presented by Karimireddy et al. [19] was to decrease the issue of client drift that occurs in FL environments. The study highlighted how updates based on local data cause the model to gradually diverge from the global model, leading to slower and less stable training. Control variables maintained at both the client and server sides were used to improve compatibility between local updates and the global optimization. Particularly in non-IID data distributions, the experimental results showed that SCAFFOLD provided faster and more stable convergence than the FedAvg method. However, because of the method’s complexity, extra control variables must be maintained for each client and the server. This increases both memory usage and communication costs. Therefore, the method has a major drawback in real-world applications.

Gupta et al. [20] suggested MIMS-FL, an FL approach that uses separate processing for various data types, including text, audio, and Electroencephalography (EEG) to classify depression (heterogeneous multimodal data). Each dataset was evaluated individually. Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (Bi-LSTM), and CNN models were tested for each type of data, and Bi-LSTM produced the best results (voice: 99.33%, EEG: 98.95%, text: 96.97%). The system maintained high accuracy during testing in both IID and non-IID scenarios. The proposed approach can process different data modalities independently while preserving data privacy. This feature distinguishes it from other depression classification approaches.

The Multimodal Federated Cross Prototype Learning (MFCPL) approach was proposed by Le et al. [21] to address the issue of missing modalities in clients with different modalities. The approach used cross-modal prototypes and modality-specific contrastive mechanisms to process data from each modality at the local and global levels. In order to account for statistical heterogeneity and missing modality issues in global model evaluation, experiments were conducted on four distinct multimodal datasets, each of which was tested independently on its own dataset. The results showed that reducing the misalignment caused by missing modalities improved the global model’s performance and the capacity for generalization.

Feng et al. [22] evaluated multimodal FL on ten datasets: CREMA-D, MELD, MOSI, MOSEI (emotion), UCF101 (action), PAMAP2 (human activity), MIMIC-IV (EHR-healthcare), CrisisMMD, MVSA-Single and MVSA-Multiple (social media). Each dataset was independently evaluated within its own task. Multiple modalities, such as image, text, audio, and sensor data, were used. Heterogeneity was simulated within each dataset through non-IID distributions and controlled missing settings. Real datasets were used, but clients were created by splitting each dataset internally rather than assigning one dataset per client. FedAvg, FedProx, FedOpt, and FedRS were compared under missing modality, missing label, and noisy label settings. The results showed that FedOpt achieved higher classification performance in standard and low heterogeneous settings. FedRS was more robust when the label distribution was highly imbalanced. Under missing modality, all approaches decreased noticeably, but FedOpt remained relatively more stable. Under noisy labels, the performance differences between approaches were smaller, and no single method consistently performed better than the others. FedAvg provided stable baseline performance across datasets, but its performance decreased noticeably under modality-level heterogeneity, especially when modalities are missing. It should be noted that heterogeneity in this study was defined through missing modalities, missing labels, noisy labels, and differences in data distribution across clients.

Ridolfi et al. [23] implemented an FL framework using Raspberry Pi devices using the CIFAR-10 dataset for image classification. The study does not contain modality diversity. They distributed the dataset with different data sizes to each client, which created data heterogeneity. The study used the FedAvg algorithm with a CNN-based model and evaluated performance in a setup that included real Raspberry Pi devices and virtual clients. The main contribution was the examination of FL under limited resources and varying client conditions, with a focus on communication and training time. The results showed that FedAvg achieved accuracy close to centralized training (around 60–65%), while client differences and hardware limitations affected convergence speed and overall training time.

2.3. Limitations of Existing Studies and Motivation

The proposed solutions address the issues of bias and statistical data heterogeneity in healthcare by implementing the FL approach, which allows for the processing of diverse health data from different hospitals or institutions while ensuring data confidentiality. Existing studies on modality heterogeneity generally involve data such as text and audio. Furthermore, there are very few studies that use TL approaches on completely different datasets [24]. Although studies on datasets with different modalities are limited, classification studies involving different modalities and domains within a specific healthcare dataset are not yet available. Moreover, studies that use real-world healthcare imaging datasets from different domains as clients are still limited. In most existing work, clients are created by splitting a dataset rather than using data collected from different institutions. As a result, this setting remains underexplored, where such datasets are used as independent clients to update a shared global model. There is no empirical evidence that the algorithms used for statistically heterogeneous data are better than FedAvg in all conditions [25]. Moreover, FedAvg-based algorithms developed for non-IID data (FedProx, SCAFFOLD, FedSgd) are insufficient due to communication costs and complexity. This study includes all different modalities, domains, and data distributions. In this study, FedCycle, an incremental training method based on FedAvg, was presented to reduce client drift and improve FL model performance, and its performance was evaluated on multiple real-world datasets.

Table 1 shows the main contributions and methods used in previous studies.

3. Federated Learning

FL is one of the most important DL approaches. It has been developing for the last few years. FL preserves distributed data and trains models on local devices, unlike traditional centralized data collection and processing. This approach is particularly useful for sectors such as healthcare and finance, which have high privacy and security requirements.

Accessing large datasets stored in a single location is difficult. Privacy issues also make this problem more difficult. These challenges have contributed to the rapid adoption of FL. Having sensitive data such as medical images, personal health information, and financial data collected on a global server can lead to privacy and security risks. Furthermore, collecting and standardizing large amounts of data on a single server is quite challenging. Instead of transferring data to a global server, FL keeps data distributed across local devices and performs model training on these devices. Each device independently computes global model updates based on its local data without sharing raw data. The ability of FL is to reduce network traffic, eliminate the need for data sharing, and ensure privacy through distributed model training. These features have made it a preferred approach in areas such as healthcare, finance, and the Internet of Things (IoT).

FL develops robust and high-performance models during training by using DL techniques such as CNNs, Recurrent Neural Networks (RNNs), Autoencoders (AEs), and Transformer-based architectures. Local training is performed automatically on the end devices. This eliminates the need for human intervention. Multiple institutions can act as clients and contribute to the model training process in healthcare systems. A global server combines training parameters from clients to update the global model. Techniques such as FedAvg [26], FedProx [18], and SCAFFOLD [19], which are derived from FedAvg, are widely used to update the global model. These approaches are secure and privacy-preserving techniques that minimize the risk of data leakage to third parties. Nonetheless, the implementation of FL still has challenges. These are high communication costs, non-uniform data distribution, individual client bias, privacy and security issues, and the problem of model generalization [27,28]. Different data distributions and formats across devices can affect model generalization and lead to client drift. Proper model generalization is very important in terms of the reliability and acceptance of FL.

In healthcare systems, critical issues such as data standardization, patient privacy, data security, bias, modality and domain heterogeneity are still common in FL approaches. Most of these issues have a direct impact on model performance and reliability in both theoretical and practical applications. Data in a hospital might be heterogeneous in terms of quality, format, size, modality, and domain. This is a common problem across hospitals performing medical imaging, where the latest data management standards are not consistently followed and data standardization is not effectively applied. Data management also becomes more complex due to the lack of clear methods for sorting and handling patient records. In addition, hospital pre-selection results in bias in FL studies. When sensitive information is protected by strict privacy rules, unauthorized access to private data is a serious security and privacy concern. It is very important to ensure that the models are secure and robust in real-world applications.

Among these fundamental challenges, data heterogeneity and bias present critical limitations in healthcare applications of FL. These two factors directly affect the accuracy and stability of the global model. Our study focuses on these two problems within the context of FedCycle. The definitions of these two problems are provided below.

(a) Data, modality, and domain heterogeneity: The distributed nature of the data limits the verification of the integrity and quality of the FL model’s results. In almost all cases, the accuracy of the FL model is reduced by heterogeneous data, depending on the degree of heterogeneity. Since local models are trained using data from different patient profiles, a global model may not adequately represent all profiles. In some cases, heterogeneity in data distributions, modalities, and domains leads to client drift, overfitting, and overall model performance degradation.

(b) Bias: Bias is a common problem in distributed networks. In such networks, a neural network may prioritize data from certain clients, leading to better performance on those clients and worse performance on others. This bias can arise from differences in the size, modality, or distribution of data between clients. Furthermore, the FL algorithm itself can introduce bias.

Federated Average (FedAvg) Algorithm

The FedAvg algorithm is an FL approach that minimizes communication and bias [26]. The devices receive the updated model from the global server. Then, the process begins. Each device updates this model using its own local data before transmitting the parameters to the global server. The global server integrates the received updates into the global model using FedAvg, then redistributes the aggregated model to all client devices. It is important to note that clients only share the model’s weights with the global server. Training data are stored locally on each device, and the global server does not store individual updates.

FedAvg determines the new weights of the global model by computing the average of the local and global model weights and then updating the global model with these new weights. In this way, the global model is upgraded with the contributions of the clients.

FedAvg performs the computation on client devices, thus reducing data transmission to a minimum. It is a powerful tool for managing FL and enhances privacy and security.

4. FedCycle: An Incremental Improvement of FedAvg

FL reduces communication delays and conserves network resources, such as spectrum and transmission power, during training. It also ensures secure, privacy-preserving healthcare systems and reduces the risk of data leakage to third parties.

However, traditional FL approaches may lead to suboptimal model updates, particularly under heterogeneous settings, including differences in data distributions, domains, or modalities across clients. This study presents FedCycle (Figure 1) to deal with this limitation. FedCycle is an incremental improvement of the FedAvg algorithm and can be seen as a scheduling variant of FedAvg. It modifies the aggregation schedule through more frequent global updates. This improves model performance.

In traditional FL approaches, each client updates the global model after completing all training epochs. On the other hand, in FedCycle, each client updates the global model after one local training epoch incrementally. The FedCycle process begins with the creation of a global model with random weights. Each client iteratively trains its local model for one epoch using its own dataset over a fixed number of rounds. After each local training epoch, the clients calculate the average weights by combining the local and global model weights. It is important to note that FedCycle does not modify the underlying optimization or aggregation rules of FedAvg. It uses FedAvg algorithm when calculating average weights. These average weights are then sent back to the global server and all other clients. This method progressively improves the performance of the global model by repeatedly performing the client training and average weighting steps over multiple rounds. Using more communication rounds with one epoch of local training per round allows clients to explore the local dataset more extensively and add more sophisticated updates to the global model.

Let K be the total number of clients, and let

w_{t}

denote the global model at the beginning of communication round t. Let

w_{t + 1}^{m_{j}}

be the local model obtained after client

m_{j}

completes one local training epoch in round t for

j = 1, \dots, K

. The FedCycle update for the global model at the end of each local epoch is defined as:

w_{t + 1} = \frac{1}{K} \sum_{j = 1}^{K} w_{t + 1}^{m_{j}},

(1)

where

w_{t + 1}

denotes the global model obtained after aggregating all client updates at the end of communication round t, and

w_{t + 1}^{m_{j}}

represents the local model of client

m_{j}

after completing one local training epoch in round t.

Equation (1) shows that the global model is updated after each local epoch using a simple average of all client models. This process is repeated for all rounds, enabling faster integration of client information.

The core principle of FedCycle is its ability to leverage data from distributed clients to improve the overall learning process. By allowing clients to incrementally update the global model, FedCycle enables deeper learning. It also makes it easier for the global model to adapt to the characteristics of local data. This approach generally yields better bias rates and model performance than FedAvg. Algorithm 1 illustrates the FedCycle process.

Algorithm 1 FedCycle: Aggregation Strategy Based on FedAvg

1:: Server-side initialization:
2:: Initialize global model $w_{0}$
3:: for each communication round $t = 0, 1, \dots, T - 1$ do
4:: Broadcast $w_{t}$ to all clients.
5:: Local training:
6:: for each client $m_{j}$ , $j = 1$ to K synchronously do
7:: $w_{t + 1}^{m_{j}} \leftarrow ClientEnd (m_{j}, w_{t}, 1)$ {Train client $m_{j}$ for 1 local epoch}
8:: end for
9:: Global aggregation:
10:: $w_{t + 1} \leftarrow \frac{1}{K} \sum_{j = 1}^{K} w_{t + 1}^{m_{j}}$ {Aggregate clients models to form the new global model}
11:: end for
12:: Broadcast final model $w_{T}$ to all clients

Theoretical Insight on Aggregation Frequency

Let

F_{j} (w)

denote the local objective function of client

m_{j}

, and let the global objective be defined as

F (w) = \frac{1}{K} \sum_{j = 1}^{K} F_{j} (w) .

(2)

Under heterogeneous data distributions, the local objectives

F_{j} (w)

may differ from the global objective

F (w)

. In standard FedAvg, each client performs E local training epochs prior to global aggregation. Starting from the global model

w_{t}

, the deviation of the local model after E local epochs can be expressed as

∥ w_{t, E}^{m_{j}} - w_{t} ∥ \leq η \sum_{e = 0}^{E - 1} ∥ \nabla F_{j} (w_{t, e}^{m_{j}}) ∥,

(3)

This inequality follows from recursively expanding the local updates and applying the triangle inequality, where

η

denotes the learning rate. Assuming that the local gradients are bounded by a constant G, i.e.,

∥ \nabla F_{j} (w) ∥ \leq G

, the above expression yields

∥ w_{t, E}^{m_{j}} - w_{t} ∥ \leq E η G .

(4)

This result indicates that the deviation between local and global models increases with the number of local epochs E. Reducing the number of local epochs limits this deviation. As a result, more frequent aggregation maintains closer alignment across clients and leads to more stable convergence under heterogeneous settings. Therefore, in heterogeneous settings, large values of E may lead to significant client drift.

In the presented FedCycle, aggregation is performed after each local training epoch, i.e.,

E = 1

. In this case, the deviation becomes as follows:

∥ w_{t, 1}^{m_{j}} - w_{t} ∥ \leq η G,

(5)

which limits the accumulation of client drift.

Moreover, frequent aggregation repeatedly aligns local models with the global model before excessive adaptation to client-specific data occurs. From a generalization perspective, let

R_{j} (w)

and

{\hat{R}}_{j} (w)

denote the population and empirical risks of client

m_{j}

, respectively. The generalization gap can be written as

| R_{j} (w) - {\hat{R}}_{j} (w) |

, and it generally increases as the model becomes more adapted to local data. When multiple local epochs are carried out without aggregation, the model may start to overfit to client-specific distributions, leading to a larger gap. In contrast, more frequent aggregation restricts this effect by repeatedly aligning local models with the global objective. In this sense, it acts as an implicit regularization mechanism and helps reduce overfitting at the client level. As a result, the training process becomes more stable, particularly under heterogeneous data distributions. Therefore, the aggregation frequency directly controls both optimization stability and generalization behavior.

5. Methodology

In Scenario 1, datasets were copied to a local environment and then to an Anaconda environment. These experiments were conducted on a laptop. It is equipped with an 11th Generation Intel® Core™ i7-1165G7 processor (Intel Corporation, Santa Clara, CA, USA), up to 4.70 GHz, 8 GB of 3200 MHz DDR4 memory, and 512 GB of SSD storage. The system utilizes integrated Intel® Iris® Xe Graphics as the hardware accelerator within the Anaconda environment.

For Scenarios 2 and 3, the datasets were obtained from Kaggle and processed using Google Colab. This environment was chosen to accelerate the training of large-scale, pre-trained CNN-based models. The experiments in Scenarios 2 and 3 were conducted in a Google Colab environment equipped with an Intel® Xeon® CPU running at 2.00 GHz. During the experiments, CPU usage was approximately 1.1%, with a total of 334.56 GB of RAM, of which 2.63 GB (1.5%) was utilized. The system had a total disk capacity of 225 GB, with 22 GB (9.8%) in use, leaving 202 GB of available storage. Additionally, a v2-8 TPU was employed as the hardware accelerator to enhance computational performance.

The KaggleHub, TensorFlow, and Keras libraries were used for data collection, pre-processing, and implementing pre-trained CNN-based models for all scenarios. In all scenarios, the datasets are divided into 80% for training, 10% for validation, and 10% for testing.

To evaluate the performance of the model, accuracy, precision, recall, F1-score, and AUC were calculated for the test data. In addition, the accuracy and loss values of the training, validation, and test data were evaluated to detect potential overfitting in Scenarios 2 and 3. Client drift was quantified as the L2 (Euclidean) norm of the parameter difference between local and global models after each communication round (

| w_{k}^{(t)} - w^{(t)} |_{2}

). In this study, pre-trained CNN-based models were employed as the primary DL approach for classification. Furthermore, the global model was fine-tuned to improve its adaptation to client-specific data distributions in Scenario 1 and Scenario 3. All experiments were repeated multiple times (5 runs), and the reported performance metrics are presented as the mean ± standard deviation across runs.

5.1. Dataset

The datasets used in this study consist of real-world clinical imaging data representing real patient samples rather than synthetic or simulated data. Unlike commonly used non-IID partition strategies such as the Dirichlet distribution, where a single dataset is artificially divided across clients, the heterogeneity in this study arises from entirely different datasets assigned to each client. This leads to differences not only in label distribution but also in domain and modality.

In particular, the datasets differ in imaging modality, collection conditions, and data characteristics, resulting in variations that are not limited to standard label-skew or quantity-skew settings. This introduces a more complex form of heterogeneity compared to simulated non-IID partitions. In contrast, Dirichlet-based partitioning introduces only statistical heterogeneity within a single data domain, while our setup incorporates both statistical and structural heterogeneity across different domains and modalities.

Such a setup reflects real-world FL scenarios in healthcare. Data are naturally distributed across institutions with different sources, devices, and protocols. As a result, the data distributions across clients are inherently non-IID, rather than being artificially partitioned from a single dataset. The BREAKHIS [29] dataset was used for Scenario 1, three different datasets (BREAKHIS, ROBOFLOW [30], RSNA [31]) were used for Scenario 2, and five different datasets (BREAKHIS, ROBOFLOW, RSNA, BUSI, BCFPP) were used for Scenario 3. The datasets contain two types of images: normal and cancer, and the experiments are formulated as a binary classification problem. Each dataset has different types, characteristics, and distributions. Descriptions of the datasets are given in the following items.

BREAKHIS: The data consist of color, high-resolution histopathological microscopy images used for breast cancer diagnosis, classified into normal and cancer classes to aid in the detection and classification of breast cancer.
BUSI: The number of patients is 600. The data contain grayscale breast ultrasound (US) images obtained from female patients between 25 and 75 years old.
RSNA: The data consist of grayscale mammography (X-ray) images taken for screening purposes.
ROBOFLOW: The ROBOFLOW data contain grayscale mammography (X-ray) images collected from different sources.
BCFPP: The data are obtained from grayscale computed tomography (CT) images.

The number of images according to the scenarios is given in Table 2.

5.2. Pre-Processing

The data were pre-processed to prepare them for model training. The following steps were performed for each dataset in all experimental scenarios:

Image loading and resizing: All images were read in the appropriate format and resized to the model’s input size specified by the dataset.
Color channel standardization: Single-channel (grayscale) images were converted to RGB format and expanded to three channels. In the BCFPP dataset, single-channel images were converted to RGB, and the labels were converted to a one-dimensional tensor.
Normalization: Image pixels were normalized to match the model’s pre-trained weights. Pixels were scaled to the range [0, 1].
Dataset optimization: The data were divided into batches and shuffled using the TensorFlow Dataset library. Prefetch caching was also used during training to improve performance.

It should be emphasized that the class distributions were balanced to ensure equal representation across classes, and consistent randomization procedures (shuffling) were applied throughout training and validation to minimize bias and overfitting. The Pre-processed image samples are shown in Figure 2.

5.3. Pre-Trained CNN Models

The use of pre-trained CNN models in medical image analysis has significantly improved cancer diagnosis. This study examines five different CNN-based models for diagnosing cancer in breast images: MobileNetV2, EfficientNetB0, MobileNetV3-Small, DenseNet201, and ResNet152. Initially, the experiments included Inception models alongside EfficientNetB0, MobileNetV2, and ResNet152. However, Inception was later excluded from further training on different datasets. The models were trained on an extended dataset where class imbalance was addressed through oversampling within each client. No random sampling or dynamic selection of clients was performed during training. The learning rate was initialized at 0.001 and allowed to decrease automatically if the validation loss stopped improving for several epochs. The ReduceLROnPlateau callback was used for this adjustment and was applied only when necessary. In practice, no learning rate reduction was triggered during training, and the learning rate remained constant across all experiments. EarlyStopping was also enabled as a safeguard against overfitting by monitoring validation performance. However, it was not activated in any experiment and all models were trained for the full 10 epochs. Therefore, the results reported in this study correspond to the final epoch. The hyperparameters and the number of parameters of all models are given in Table 3 and Table 4.

6. Experimental Scenarios

In this study, different scenarios are designed to evaluate the performance of FedCycle (Figure 3). (i) In the first scenario, the presented method represents a single-source homogeneous structure, as both the data type and the source are identical. (ii) In the second scenario, a heterogeneous structure is created based on the data distribution, modality, and domain across three datasets: two mammography datasets and one containing histopathological images. (iii) In the third scenario, a heterogeneous structure based on data distribution, modality, and domain is built by five different datasets.

6.1. Baseline FL Algorithms for Comparison

This section compares the presented FedCycle method with benchmark FL algorithms in terms of accuracy and communication cost.

SCAFFOLD [19], FedProx [18], FedSGD [32], FedNova [33], FedAvg [26], and FedCycle were evaluated in both homogeneous and heterogeneous settings (Table 5). FedProx, SCAFFOLD, FedSGD, and FedNova are all derived from the FedAvg algorithm with additional mechanisms. The results indicate that these added mechanisms do not consistently improve performance. Although these algorithms aim to reduce client drift or cope with heterogeneity, some of them introduce extra communication or additional processing, which leads to longer training times and does not consistently improve accuracy and loss compared to FedAvg in our experimental setup. In contrast, FedCycle follows the same simple update structure as FedAvg and avoids extra communication messages and extra local computation. While it uses multiple communication rounds, FedCycle achieves higher accuracy with substantially shorter training time than other complex baselines, showing that its performance comes from incremental update behavior rather than increased algorithmic complexity.

Scenario 3 is also evaluated for all baseline algorithms. The results show that although these methods introduce additional mechanisms to address heterogeneity, they do not consistently outperform FedAvg across all datasets in this real-world experimental setup. In contrast, FedCycle achieves the highest accuracy across all datasets while preserving comparable time efficiency to FedAvg. This shows that the observed performance improvements are mainly related to the aggregation schedule rather than increased algorithmic complexity.

6.2. Scenario 1: FedCycle with a Single Dataset Based on Data Homogeneity

In the first scenario, the presented method was applied to a single BREAKHIS histopathology dataset. The dataset is divided into four equal clients. Each client is trained on its respective subset of the dataset and contributes updates to the global model after each local epoch. To improve the model’s generalization ability, data augmentation was applied using the albumentations library. Horizontal flipping, rotation, brightness and contrast adjustment, cropping, and blurring were applied for augmentation. Furthermore, oversampling was applied to eliminate class imbalance within the training data of each client. Before local training, the number of samples in each class was examined, and the class with the highest sample count was taken as a reference. Samples from the remaining classes were then randomly duplicated until all classes reached the same size.

In this way, a balanced training dataset was obtained for each client, ensuring equal class representation without generating synthetic data. Fine-tuning was used to improve the model’s accuracy and robustness by refining the pre-trained weights.

The experiments were conducted using multiple input resolutions (40X, 100X, 200X, and 400X). During training, all resolutions were used together and the CNN-based models—EfficientNetB0, MobileNetV2, ResNet152, InceptionV3, and DenseNet201—were compared in a centralized manner on the entire dataset.

Following the initial configuration, the method was further trained on subsets of the BREAKHIS dataset to extend the experimental scope. In contrast to the previous configuration, the global model was updated at different intervals—specifically after 2, 3, and 4 epochs—using FedCycle.

This configuration aims to evaluate the performance after each local epoch. In particular, if the performance in 2, 3, or 4 epochs had been similar to or better than that in 1 epoch, experiments would have been conducted to determine whether the results from those epochs could be used to reduce training time and resource consumption.

6.3. Scenario 2: FedCycle with Three Different Datasets Based on Modality and Domain Heterogeneity

In the second scenario, the experiments were extended on the existing BREAKHIS histopathological images by adding two different mammography datasets. Traditional centralized learning, FedAvg, and the presented FedCycle method were applied to the BREAKHIS, RSNA, and ROBOFLOW datasets. Each dataset corresponds to an individual client, is trained on its own dataset, and contributes updates to the global model after each local epoch. Pre-trained CNN-based models that were compared include EfficientNetB0 and MobileNetV2. We then compared the performance of traditional centralized learning and FedAvg, as well as that of FedCycle.

The computational time was also reported for each model. This scenario evaluates the performance of FedCycle on three datasets with different characteristics. In this context, the BREAKHIS, RSNA, and ROBOFLOW datasets were used as individual clients. This tests the robustness and generalizability of the method when dealing with diverse data sources.

6.4. Scenario 3: FedCycle with Five Different Datasets Based on Modality and Domain Heterogeneity

In the third scenario, a heterogeneous structure, including heterogeneity in data distributions, modalities, and domains, was constructed using histopathological, mammography, ultrasound, and CT images. These included the BREAKHIS, RSNA, ROBOFLOW, BUSI, and BCFPP datasets. Each dataset serves as a separate client within the FL framework. Fine-tuning was used to improve the model’s accuracy and robustness by refining the pre-trained weights in this scenario.

Each client was trained locally on a dataset with modality, data distribution, and domain heterogeneity, while FedCycle aggregated the parameters into the global model after each local epoch. The experiment compares the following CNN-based models: EfficientNetB0, MobileNetV3, MobileNetV2, ResNet152, and DenseNet201. The performance of traditional centralized learning, FedAvg, and FedCycle was then compared.

The main objective of this scenario is to broaden the scope of the experiments by incorporating heterogeneous datasets alongside the existing datasets and by increasing the variety of models employed. This enables a more thorough analysis of the performance of FedCycle on mixed datasets with different modalities, domains, and data distributions.

7. Performance Results and Discussions

This section presents a comprehensive performance evaluation of the presented FedCycle method and compares its performance with traditional centralized learning and FedAvg in different scenarios. The evaluation is conducted using various metrics (loss, accuracy, precision, recall, F1-score, and AUC) to ensure a comprehensive comparison.

7.1. Results of Scenario 1

Traditional centralized learning, FedAvg, and the presented FedCycle were tested on the divided BREAKHIS dataset using the EfficientNetB0, MobileNetV2, InceptionV2, and ResNet152 models in Scenario 1. The results of the experiments are summarized in Table 6. According to the results, FedCycle is approximately 3% more accurate than the FedAvg approach. This suggests that the global model is more balanced and robust.

The performance of FedCycle was tested using various combinations of epochs in Scenario 1. As can be seen in Table 6, updating the global model after 1 epoch yields the best results, with 97.5% accuracy, so it is used in subsequent scenarios. As the number of epochs increases, accuracy decreases, so the 1-epoch structure is favored.

7.2. Results of Scenario 2

In CNN-based models, centralized training was performed on each dataset separately using EfficientNetB0 and MobileNetV2. The results are given in Table 7.

In centralized learning, MobileNetV2 outperforms EfficientNetB0 across all datasets, achieving high success rates due to direct data access, particularly achieving 96.78% accuracy on BREAKHIS. Training and validation accuracies converge quickly, and loss values decrease rapidly. On RSNA, MobileNetV2 reaches 78.93% accuracy with steady convergence, indicating no overfitting. Despite being deeper, EfficientNetB0 shows lower accuracy (e.g., 70.20% on RSNA) and a tendency to overfit.

In FedAvg, each dataset is treated as a client, and the global model is updated via FedAvg. On RSNA, MobileNetV2 accuracy decreases to 62.68%, with volatile validation loss and early saturation of accuracy curves. On BREAKHIS, although accuracy is high (96.94%), validation loss fluctuates, suggesting an imbalanced generalization. On ROBOFLOW, accuracy is limited (65.45%), and the gap between validation and test accuracy indicates pronounced overfitting. Overfitting and client drift tendencies are shown in Figure 4.

The presented FedCycle method incrementally updates the global model using FedAvg across clients, unlike traditional FL approaches, thereby improving accuracy on heterogeneous datasets in terms of data distributions, modalities, and domains. On RSNA, FedCycle-MobileNet achieves 70.73%, an 8% improvement in accuracy over FedAvg, with better alignment between validation and test curves and smoother loss decrease. On BREAKHIS, accuracy is 86.97%, slightly higher than FedAvg, demonstrating effective synchronization and a balance between local and global performance. ROBOFLOW achieves 72.95%, outperforming both centralized learning and FedAvg, showing FedCycle’s suitability for mixed data structures.

In this scenario, where fine-tuning is not applied, FedCycle alone improves accuracy but shows slight overfitting and client drift effects during training. The results obtained show that MobileNetV2 is the most efficient model in terms of training time.

7.3. Results of Scenario 3

Traditional centralized learning, FedAvg, and the presented FedCycle method were compared on the BREAKHIS, BUSI, RSNA, ROBOFLOW, and BCFPP datasets using EfficientNetB0, MobileNetV2, MobileNetV3-Small, DenseNet, and ResNet models in Scenario 3. The results are given in Table 8, Table 9 and Table 10.

Centralized learning for Scenario 3 shows consistent training and strong generalization. Training accuracy generally increases steadily, while validation and test accuracy stabilize, indicating effective regularization (e.g., dropout and batch normalization) and avoiding overfitting. DenseNet201 and EfficientNetB0 models exhibit lower loss values and faster convergence, whereas MobileNetV2 and MobileNetV3, being lighter, improve more slowly but steadily. Accuracy rises sharply on BREAKHIS and BUSI, reflecting homogeneous data, while fluctuations in RSNA and ROBOFLOW suggest higher variance. The steady decrease in loss for BCFPP indicates a balanced learning environment.

FedAvg for Scenario 3 displays more fluctuating validation and test curves due to heterogeneous client data and asynchronous updates. Although training accuracy reaches certain levels, validation and test accuracy are lower. On RSNA and ROBOFLOW, fluctuations indicate client drift, which limits generalization for clients with less data. Loss curves converge more slowly than in centralized learning, and validation losses remain higher, suggesting suboptimal synchronization. DenseNet201 and EfficientNetB0 stabilize faster, while MobileNetV2 shows higher variance in low-capacity clients. DenseNet201 and EfficientNetB0 stabilize faster, while MobileNetV2 shows higher variance on low-capacity clients. FedAvg performance on BREAKHIS and BUSI is similar to centralized models, but RSNA and ROBOFLOW show fluctuations; BCFPP remains the most stable. As seen in Figure 5, FedAvg is competitive, but suffers from client drift, unstable convergence, and limited generalization under heterogeneous data.

The presented FedCycle method employs cyclic synchronization to guarantee that the final weights of the clients are always consistent. The training, validation, and test curves progress much more steadily and simultaneously, with higher test accuracies than FedAvg. FedCycle has a smoothing effect on the fine-tuning steps, especially on RSNA and ROBOFLOW, as their fluctuations are less pronounced. The loss curves reach parity faster, and the difference between training and validation losses becomes smaller, indicating improved global accuracy and well-balanced local contributions. DenseNet201 and EfficientNetB0 are the most stable ones, whereas MobileNet maintains steady validation despite being lightweight. The results demonstrate that MobileNet has the lowest computational time. Figure 5 shows that FedCycle with fine-tuning not only provides lower overfitting and client drift but also offers accuracy improvements.

Additional result graphs are provided in Appendix A.

Client drift graphs for FedAvg and FedCycle are shown in Figure 6. The drift magnitude is measured using the L2 norm of the local model updates. In comparison to FedAvg, FedCycle generates lower average L2 drift values for the majority of datasets, particularly for BREAKHIS (19.09 → 6.18), BUSI (7.35 → 2.04), and ROBOFLOW (11.33 → 5.08).

The higher drift observed for BREAKHIS in FedAvg indicates that its local updates have a larger impact on the global update during aggregation. This implies that FedCycle helps reduce the divergence between local and global models in Scenario 3 and produces more stable local updates.

Consequently, FedCycle with fine-tuning is remarkable because it decreases overfitting, practically eliminates client drift, and enhances accuracy proportionally when compared to FedAvg. FedCycle successfully adapts to heterogeneous data distributions, modalities, and domains. This implies that the method is able to provide a global synchronous learning mechanism that can be used in different data scenarios in federated environments. This kind of setup reduces client drift and overfitting.

7.4. Ablation Studies

7.4.1. Aggregation Frequency Analysis in Scenario 1

To analyze the impact of the number of local epochs, an ablation study was conducted in Scenario 1 using MobileNetV2 with FedCycle. All experimental settings are kept identical, and only the number of local epochs is varied. Table 11 reports the corresponding accuracy and loss values.

7.4.2. Fine Tuning Analysis

In our fine-tuning strategy, each client received the current global model and continued training on its own local data at the beginning of each training round. During this process, the feature extraction backbone was only partially updated: the earlier layers were kept frozen, while the remaining layers, together with the classification head, were allowed to learn. In our implementation, approximately the top 30% of the layers closest to the output were set as trainable, and the remaining layers were frozen. Layers were taken in order from input to output, and the same rule was used for all architectures. This setup lets the upper layers adapt to the target data while keeping low-level features unchanged. Similar partial fine-tuning approaches are used in medical imaging [34,35], especially when data are limited. The same fine-tuning setup was used for all methods within each scenario. In scenarios where fine-tuning is enabled, all models use the same frozen and trainable layers, while in scenarios without fine-tuning, all layers remain trainable.

In the experiments, train–validation differences of the BREAKHIS, BUSI, RSNA, ROBOFLOW, and BCFPP datasets on MobileNetV2 using FedAvg and FedCycle are compared in Table 12. This ablation examines how fine-tuning affects generalization by comparing train–validation gaps under the same experimental conditions.

7.4.3. Sensitivity to Client Numbers

In our experiments, we assumed that all defined clients were available and participated in each communication round. However, in real-world federated environments, client availability may vary due to system constraints or network conditions. To evaluate client sensitivity, some existing datasets (BREAKHIS and RSNA) were divided into separate subsets and treated as individual clients. As the presented FedCycle method is designed for heterogeneous settings, each client was defined to represent different modalities, domains, and data distributions. Therefore, the number of clients was kept limited in our experiments. It should be noted that the number of clients can be increased depending on application requirements, and FedCycle can be applied in settings with a larger number of clients.

The effect of varying the number of clients was analyzed by evaluating FedAvg and FedCycle under homogeneous and heterogeneous settings with MobileNetV2. Table 13 summarizes the corresponding accuracy and loss results.

7.4.4. Communication Cost Analysis

In this analysis, the communication cost is measured in terms of the total amount of data transmitted during training. The communication costs of FedCycle and FedAvg are compared in various scenarios in Table 14 using MobileNetV2. For comparison, we also simulated the communication costs, assuming 200 communication rounds for FedAvg, which corresponds to 2000 rounds for FedCycle, to illustrate the potential scaling effect.

The communication cost is computed as

C = 2 \times N \times R \times S,

(6)

where N denotes the number of clients, R represents the number of communication rounds, and S corresponds to the size of the transmitted model parameters. The factor of 2 reflects the upload and download processes in each communication round. The global model is downloaded by the clients, and their updated models are uploaded back to the server.

In standard FedAvg, each client performs E local epochs before aggregation, whereas in FedCycle, aggregation is performed after each local epoch. If both methods are compared under the same total number of local epochs, then the number of communication rounds in FedCycle becomes:

R_{FedCycle} = E R_{FedAvg}

(7)

Therefore, the communication overhead ratio is:

\frac{C_{comm}^{FedCycle}}{C_{comm}^{FedAvg}} = \frac{R_{FedCycle}}{R_{FedAvg}} = E

(8)

All communication costs are calculated using the experimental setup corresponding to each scenario.

7.4.5. Communication Efficiency Analysis

To evaluate whether the additional communication cost of FedCycle is justified, the communication efficiency was analyzed in terms of performance gain per unit communication cost. In this context, performance gain is defined as the difference in accuracy between FedCycle and FedAvg. The efficiency metric is given by:

Δ Efficiency = \frac{{Acc}_{FedCycle} - {Acc}_{FedAvg}}{C_{FedCycle} - C_{FedAvg}} .

(9)

The communication cost differences for each scenario are computed from Table 14 as:

Δ C_{S 1} = 967.6, Δ C_{S 2} = 725.8, Δ C_{S 3} = 1209.6 .

(10)

Since communication cost is defined at the scenario level, the same

Δ C

value is used for all datasets within the corresponding scenario.

Table 15 presents the communication efficiency analysis across scenarios.

7.4.6. Time Efficiency Analysis

Across the scenarios, Table 16 shows the training efficiencies of the DL, FedAvg, and FedCycle. The time efficiency is computed as:

Time Efficiency = \frac{1}{T_{total}}

(11)

where

T_{total}

is the total training time in seconds. Time efficiency indicates how quickly a model is trained; higher values correspond to faster training. All time efficiency measurements are obtained under the experimental setup corresponding to each scenario.

7.4.7. Computational Complexity Analysis

The computational complexity was estimated using the MobileNetV2 backbone with approximately

3.5 \times 10^{6}

parameters and an input size of

224 \times 224

. The cost of the measured training computation (forward pass, loss computation, and backward gradient calculation) is

1, 733, 202, 630

FLOPs per iteration.

Let I denote the number of training iterations per epoch:

I = ⌈\frac{N}{B}⌉,

(12)

where N is the number of training samples and

B = 32

is the batch size.

According to the dataset sizes in Table 2, the total number of iterations per epoch is 1236 for Scenario 1, 439 for Scenario 2, and 550 for Scenario 3. Since the total number of local epochs is 10, the total number of training iterations becomes 12,360, 4390, and 5500 for Scenarios 1, 2, and 3.

Thus, the approximate computational cost is:

C_{comp} = T_{iter} \times F_{train},

(13)

which gives

C_{comp, S 1} \approx 2.14 \times 10^{13},

(14)

C_{comp, S 2} \approx 7.61 \times 10^{12},

(15)

C_{comp, S 3} \approx 9.53 \times 10^{12} .

(16)

7.5. Statistical Analysis

The statistical analysis results across datasets are summarized in Table 17.

One-way ANOVA was used for the BREAKHIS, RSNA, and ROBOFLOW datasets while analyzing the number of clients. The BUSI and BCFPP datasets were only available for the 5- and 7-client settings. Thus, the BUSI and BCFPP datasets were evaluated using Welch’s t-test. The null hypothesis (H0) assumes that the number of clients does not affect classification accuracy, whereas the alternative hypothesis assumes that it does (H1). All p-values were above 0.05. Therefore, H0 could not be rejected, suggesting that changing the number of clients did not lead to a meaningful difference in accuracy.

To analyze the effect of the number of training epochs, one-way ANOVA was applied. In this case, H0 assumes that accuracy does not differ across various epoch settings, whereas H1 assumes that the number of epochs influences accuracy. The analysis yielded

F = 38.95

with a p-value of

4.55 \times 10^{- 6}

. H0 was rejected because the p-value is less than 0.05. This indicates that the number of training epochs has a statistically significant effect on accuracy.

The effect of fine-tuning was evaluated using a paired t-test. Here, H0 assumes that there is no difference in the train–validation gap between the fine-tuning ON and OFF settings, while H1 assumes that there is a difference. According to the results, enabling fine-tuning reduces the gap across all datasets (p < 0.001). So, H0 was rejected, and the OFF setting resulted in a larger gap compared to the ON setting.

According to the comparison between FedAvg and FedCycle, H0 assumes that there is no difference in accuracy between the two approaches, while H1 assumes that there is a difference. The p-values were above 0.05 for the BREAKHIS, ROBOFLOW, and BCFPP datasets. Thus, the null hypothesis could not be rejected. In contrast, the p-values of the remaining datasets are less than 0.05, indicating statistically significant accuracy differences between FedAvg and FedCycle.

The statistical results show that the performance difference between FedAvg and FedCycle varies across datasets. Significant improvements are observed in BREAKHIS, ROBOFLOW, and BCFPP, whereas the differences are not significant in BUSI and RSNA. These differences may be due to the properties of each dataset. In the BUSI dataset, the number of samples is relatively small. Because of this, it is harder to detect small performance differences. In the RSNA dataset, higher variability and a more complex data distribution make the results less consistent across runs, which weakens statistical significance. In contrast, datasets with more pronounced heterogeneity benefit more from frequent aggregation, as it helps control client drift and stabilize training. Therefore, FedCycle provides more reliable improvements, especially in settings where data heterogeneity has a stronger effect on the training process.

7.6. Discussion

From a theoretical perspective, the behavior of FedCycle can be explained by its epoch-level global synchronization strategy. In FedAvg, clients perform several local updates before aggregation, which may cause their local models to gradually diverge from the global model when data are heterogeneous. By synchronizing the global model after each local epoch, FedCycle reduces the accumulation of such divergence by incorporating client updates at an earlier stage. This more frequent integration stabilizes the training process and leads to smoother and more consistent convergence, particularly under statistically non-IID data distributions, different modalities, and domains. Importantly, these improvements result from the scheduling of model updates rather than changes to the underlying FedAvg optimization.

The effectiveness of FedCycle has been demonstrated across multiple experimental scenarios involving heterogeneous datasets in terms of data distributions, modalities, and domains. As discussed in the Dataset section, the heterogeneity in this study arises from multi-source data rather than simulated non-IID partitioning. The results show that centralized learning performs well only in homogeneous data scenarios. Although traditional FL approaches preserve data privacy in distributed environments, they show low validation stability and client drift due to heterogeneous data distributions, modalities, domains, and synchronization imbalances. In contrast, the presented FedCycle method combines the privacy advantages of FL with the accuracy of centralized learning. It also achieves more stable validation and testing curves by reducing client drift and overfitting. These results demonstrate that FedCycle achieves an optimal balance between accuracy and stability in federated environments.

In the following items, the ablation results are discussed to clarify how key design choices affect the behavior of the presented method and how these effects relate to the main experimental findings.

Effects of Aggregation Frequency (Epoch Number): Table 11 shows that the highest accuracy is obtained when one local epoch is used. As the number of local epochs increases, the accuracy gradually decreases, and the loss becomes higher.
In this setting, the number of local epochs also determines how frequently global aggregation is performed. Using a single local epoch corresponds to more frequent aggregation, whereas larger epoch numbers lead to less frequent synchronization between local and global models. The observed results indicate that more frequent aggregation helps maintain model alignment and improves performance, while increasing the number of local epochs leads to performance degradation. This suggests that the performance gain mainly stems from the aggregation schedule rather than other training factors. Based on the results, one epoch is used in all experiments.
Effects of Fine-Tuning: As shown in Table 12, the findings show that fine-tuning greatly reduces overfitting, especially on RSNA and ROBOFLOW. FedCycle shows strong performance in both settings, with and without fine-tuning. However, fine-tuning further improves the method by providing better generalization across the data. In the BREAKHIS dataset, accuracy remains strong in FedCycle without fine-tuning, while fine-tuning primarily improves loss.
Effects of Client Numbers: Table 13 shows that varying the number of clients causes only minor changes in accuracy and loss. FedCycle shows similar behavior across different client settings and achieves performance comparable to FedAvg. In some cases, it performs slightly better, particularly under heterogeneous data. Overall, its performance shows only small changes as the number of clients increases.
This is also supported by the results. In the homogeneous setting, accuracy decreases by about 0.63% for FedAvg and 0.51% for FedCycle as the number of clients increases. In the heterogeneous setting, the change remains within about 0.4–1.2% for FedAvg and 0.1–1.1% for FedCycle.
As the number of clients increases, the level of heterogeneity and client drift is expected to grow. In such cases, less frequent aggregation can lead to larger divergence between local and global models. FedCycle applies aggregation after each local epoch. As a result, the difference between local and global models becomes smaller, and training remains stable. Under these conditions, the model is expected to remain stable and maintain similar performance.
Effects of Communication Cost: As shown in Table 14, the communication cost increases approximately linearly with the number of clients and communication rounds. As FedAvg performs aggregation after multiple local epochs, it requires fewer communication rounds. Therefore, it has a lower communication cost in all scenarios. In contrast, FedCycle has more communication rounds and higher communication costs because it performs aggregation after each local epoch. This behavior is consistent with the theoretical analysis in Equations (7) and (8), where the communication overhead increases proportionally with the number of communication rounds. Accordingly, each method is evaluated under its natural communication setting, allowing us to observe the trade-off between communication cost and accuracy. This shows a clear balance between communication cost and accuracy. FedCycle increases communication overhead because it performs aggregation more often, but this results in consistent accuracy improvements across scenarios. In practice, this can be acceptable in cases where training stability and performance are more important than communication cost, especially with heterogeneous data. However, in environments with limited communication resources, the higher overhead may reduce its usability.
FedCycle can be used in the same settings where FedAvg is already preferred. It keeps the same simple structure and does not introduce extra parameters or additional local computation. It is suitable for both homogeneous and heterogeneous data, where more frequent aggregation helps keep training stable. In these cases, improvements in accuracy and training stability can make the additional communication cost acceptable. It may not be a good choice when communication bandwidth is very limited. It is also not intended for use with more complex methods that already introduce additional overhead.
Effects of Communication Efficiency: As shown in Table 15, all efficiency values are positive. This shows that the additional communication improves performance across all scenarios. The values are small because the communication cost is high, but they remain positive in every case. This suggests that the extra communication cost is supported by the observed gains in accuracy, particularly under heterogeneous settings.
Effects of Time Efficiency: As shown in Table 16, both FedAvg and FedCycle outperform centralized learning in terms of time efficiency across all scenarios. This is because all clients update the global model synchronously within the same training process, whereas centralized learning aggregates training time across datasets. FedCycle achieves slightly higher efficiency than FedAvg in Scenario 1, remains comparable in Scenario 3, and is slightly lower in Scenario 2. Overall, the results indicate that FedCycle maintains a reasonable training time despite requiring more communication rounds.
Effects of Computational Complexity: As shown in Equations (14)–(16), since both FedAvg and FedCycle use the same total number of local epochs, their client-side computational costs remain approximately the same:

$C_{comp}^{FedCycle} \approx C_{comp}^{FedAvg} .$

(17)

Therefore, the main additional overhead of FedCycle comes from more frequent communication and aggregation, rather than increased local computation.
Effects of Computational Overhead: From a computational point of view,

$T_{comp}^{FedCycle} \approx T_{comp}^{FedAvg}$

(18)

since both methods perform the same total number of local training epochs. Although FedCycle introduces higher communication overhead, the results show that this does not lead to a significant increase in computational cost in practice.

8. Conclusions

DL has been successfully applied in many fields in recent years. However, in scenarios where privacy is critical and data sharing is restricted, traditional DL approaches may be inadequate. To overcome this limitation, an FL approach has been developed in which model training is performed on local devices and only model parameters are transmitted to a global server. Nevertheless, traditional FL approaches often do not achieve the same level of performance as traditional DL approaches. To address this, the current study presents FedCycle as an incremental improvement of the FedAvg algorithm that modifies the aggregation schedule. FedCycle performs incremental training and weight updates after one epoch within a given number of cycles at the global server. Traditional centralized learning, FL, and FedCycle approaches are compared using healthcare datasets. The performance of FedCycle across multiple datasets under different experimental scenarios is also evaluated in this study. The results show that FedCycle performs better than FedAvg, with less overfitting and client drift, as well as higher accuracy, faster convergence rates, and better efficiency in communication and privacy. FedCycle’s incremental updates provide a more stable global model for decentralized clients. This improves adaptation to local data and results in more robust training. Applying fine-tuning within FedCycle reduces overfitting. As a result, FedCycle achieves performance improvements of 7.75% and 4.65% in terms of accuracy and F1-score on the RSNA and BCFPP datasets compared to traditional DL approaches, while also providing an average improvement of approximately 1.5% in accuracy and F1-score across BREAKHIS, ROBOFLOW, and BUSI datasets compared to FedAvg.

Consequently, FedCycle provides a balance between the high accuracy of centralized learning and the privacy advantages of the FL approach. Our future work will focus on testing FedCycle in different domains and improving the method.

Author Contributions

B.D.: Writing—original draft, Visualization, Validation, Software, Methodology, Investigation, Data curation. E.A.S.: Supervision, Methodology, Conceptualization. F.Y.O.: Writing—original draft, Visualization, Validation, Investigation. S.O.: Supervision, Methodology, Conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in the following dataset. The BREAKHIS dataset is available on Kaggle at https://www.kaggle.com/datasets/ambarish/breakhis (accessed on 1 February 2024). The BUSI dataset is available on Kaggle at https://www.kaggle.com/ds/4519784 (accessed on 1 January 2025). The RSNA Screening Mammography Breast Cancer Detection dataset is available via AWS Open Data at https://registry.opendata.aws/rsna-screening-mammography-breast-cancer-detection/ (accessed on 1 July 2024). The Breast Cancer dataset (Roboflow) is available at https://universe.roboflow.com/hayder17/breast-cancer-detection (accessed on 1 July 2024). The Breast Cancer CT (Fully Preprocessed) dataset (BCFPP) is available on Kaggle at https://www.kaggle.com/ds/2026269 (accessed on 1 January 2025).

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT-4 (OpenAI) to assist with minor language refinements. The tool was not used to generate scientific content, data, results, or interpretations. All content was critically reviewed, revised, and approved by the authors, who take full responsibility for the originality, accuracy, and integrity of the manuscript. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

All authors declare that they have no conflicts of interest.

Appendix A. Accuracy and Loss Results of FedCycle MobileNetV2 Model in Scenario 3

References

Bogdoll, D.; Nitsche, M.; Zöllner, J.M. Anomaly Detection in Autonomous Driving: A Survey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 4487–4498. [Google Scholar] [CrossRef]
El Kafhali, S.; Tayebi, M.; Sulimani, H. An Optimized Deep Learning Approach for Detecting Fraudulent Transactions. Information 2024, 15, 227. [Google Scholar] [CrossRef]
Neumann, D.; Lutz, A.; Müller, K.; Samek, W. A Privacy Preserving System for Movie Recommendations Using Federated Learning. arXiv 2023, arXiv:2303.04689. [Google Scholar] [CrossRef]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef]
Kaushik, H.; Yadav, I.; Yadav, R.; Sharma, N.; Sharma, P.K.; Biswas, A. Brain Tumor Detection and Classification Using Deep Learning Techniques and MRI Imaging. In Proceedings of the Parul University International Conference on Engineering and Technology (PiCET 2025), Vadodara, India, 2–3 May 2025; pp. 1453–1457. [Google Scholar] [CrossRef]
Rajkomar, A.; Oren, E.; Chen, K.; Dai, A.M.; Hajaj, N.; Hardt, M.; Liu, P.J.; Liu, X.; Marcus, J.; Sun, M.; et al. Scalable and Accurate Deep Learning with Electronic Health Records. npj Digit. Med. 2018, 1, 18. [Google Scholar] [CrossRef]
Naqvi, M.; Gilani, S.Q.; Syed, T.; Marques, O.; Kim, H.-C. Skin Cancer Detection Using Deep Learning—A Review. Diagnostics 2023, 13, 1911. [Google Scholar] [CrossRef] [PubMed]
Margarat, S.G.; Hemalatha, G.; Mishra, A.; Shaheen, H.; Maheswari, K.; Tamijeselvan, S.; Pavan Kumar, U.; Banupriya, V.; Ferede, A.W. Early Diagnosis of Tuberculosis Using Deep Learning Approach for IoT-Based Healthcare Applications. Comput. Intell. Neurosci. 2022, 2022, 3357508. [Google Scholar] [CrossRef]
U.S. Department of Health and Human Services. Summary of the HIPAA Privacy Rule. 2025. Available online: https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html (accessed on 21 October 2025).
Singh, P.; Singh, M.K.; Singh, R.; Singh, N. Federated learning: Challenges, methods, and future directions. In Federated Learning for IoT Applications; Springer: Berlin/Heidelberg, Germany, 2022; pp. 199–214. [Google Scholar]
Darzidehkalani, E.; Ghasemi-Rad, M.; Ooijen, P.M.A.V. Federated Learning in Medical Imaging: Methods, Challenges, and Considerations. J. Am. Coll. Radiol. 2022, 19, 975–982. [Google Scholar] [CrossRef]
Li, L.; Xie, N.; Yuan, S. A Federated Learning Framework for Breast Cancer Histopathological Image Classification. Electronics 2022, 11, 3767. [Google Scholar] [CrossRef]
Baid, U.; Pati, S.; Kurc, T.M.; Gupta, R.; Bremer, E.; Abousamra, S.; Bakas, S. Federated Learning for the Classification of Tumor-Infiltrating Lymphocytes. arXiv 2022, arXiv:2203.16622. [Google Scholar] [CrossRef]
Wicaksana, J.; Yan, Z.; Yang, X.; Liu, Y.; Fan, L.; Cheng, K.T. Customized Federated Learning for Multi-Source Decentralized Medical Image Classification. IEEE J. Biomed. Health Inform. 2022, 26, 5596–5607. [Google Scholar] [CrossRef]
Jiménez-Sánchez, A.; Tardy, M.; Ballester, M.A.G.; Mateus, D.; Piella, G. Memory-Aware Curriculum Federated Learning for Breast Cancer Classification. Comput. Methods Programs Biomed. 2023, 229, 107318. [Google Scholar] [CrossRef]
Tan, Y.N.; Tinh, V.P.; Lam, P.D.; Nam, N.H.; Khoa, T.A. A Transfer Learning Approach to Breast Cancer Classification in a Federated Learning Framework. IEEE Access 2023, 11, 27462–27476. [Google Scholar] [CrossRef]
Che, L.; Wang, J.; Zhou, Y.; Ma, F. Multimodal Federated Learning: A Survey. Sensors 2023, 23, 6986. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. In Proceedings of the Machine Learning and Systems (MLSys), Austin, TX, USA (Virtual), 2–4 March 2020; Volume 2, pp. 429–450. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.J.; Stich, S.U.; Suresh, A.T. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the 37th International Conference on Machine Learning (ICML), Vienna, Austria, 13–18 July 2020; Volume 119, pp. 5132–5143. [Google Scholar]
Gupta, C.; Khullar, V. Modality independent federated multimodal classification system detached EEG, audio and text data for IID and non-IID conditions. Biomed. Signal Process. Control 2025, 108, 107938. [Google Scholar] [CrossRef]
Le, H.Q.; Thwal, C.M.; Qiao, Y.; Tun, Y.L.; Nguyen, M.N.H.; Huh, E.N.; Hong, C.S. Cross-Modal Prototype Based Multimodal Federated Learning Under Severely Missing Modality. Inf. Fusion 2025, 122, 103219. [Google Scholar] [CrossRef]
Feng, T.; Bose, D.; Zhang, T.; Hebbar, R.; Ramakrishna, A.; Gupta, R.; Zhang, M.; Avestimehr, S.; Narayanan, S. FedMultimodal: A Benchmark for Multimodal Federated Learning. arXiv 2023, arXiv:2306.09486. [Google Scholar] [CrossRef]
Ridolfi, L.; Naseh, D.; Shinde, S.S.; Tarchi, D. Implementation and Evaluation of a Federated Learning Framework on Raspberry PI Platforms for IoT 6G Applications. Future Internet 2023, 15, 358. [Google Scholar] [CrossRef]
Tahosin, M.S.; Sheakh, M.A.; Alam, M.J.; Hassan, M.M.; Bairagi, A.K.; Abdulla, S.; Alshathri, S.; El-Shafai, W. FedVGM: Enhancing Federated Learning Performance on Multi-Dataset Medical Images with XAI. IEEE J. Biomed. Health Inform. 2025, 29, 3600361. [Google Scholar] [CrossRef]
Wang, J.; Das, R.; Joshi, G.; Kale, S.; Xu, Z.; Zhang, T. On the Unreasonable Effectiveness of Federated Averaging with Heterogeneous Data. arXiv 2024, arXiv:2206.04723. [Google Scholar] [CrossRef]
McMahan, H.B.; Ramage, D.; Talwar, K.; Zhang, L. Learning Differentially Private Recurrent Language Models. arXiv 2017, arXiv:1710.06963. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Kairouz, P.; McMahan, H.B.; Avent, B. Advances and open problems in federated learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Mbarish, K. BreakHis—Breast Cancer Histopathological Database. 2019. Kaggle. Available online: https://www.kaggle.com/datasets/ambarish/breakhis (accessed on 1 February 2024).
Ali, H. Breast Cancer Detection. 2023. Roboflow. Available online: https://universe.roboflow.com/hayder17/breast-cancer-detection (accessed on 1 July 2024).
Dutta, G. RSNA Screening Mammography Breast Cancer Detection. 2023. Available online: https://registry.opendata.aws/rsna-screening-mammography-breast-cancer-detection/ (accessed on 1 July 2024).
McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL, USA, 20–22 April 2017; Volume 54, pp. 1273–1282. [Google Scholar]
Wang, J.; Liu, Q.; Liang, H.; Joshi, G.; Poor, H.V. Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada (Virtual), 6–12 December 2020; Volume 33, pp. 7615–7625. [Google Scholar]
Khowaja, S.A.; Nadeem, M.; Saba, T.; Rehman, A. SelfFed: Self-Supervised Federated Learning for Medical Images. arXiv 2023, arXiv:2307.01514. [Google Scholar]
Davila, A.; Colan, J.; Hasegawa, Y. Comparison of fine-tuning strategies for transfer learning in medical image classification. Image Vis. Comput. 2024, 146, 105012. [Google Scholar] [CrossRef]

Figure 1. Overall Structure of FedCycle. The dashed outer frame indicates the aggregation loop.

Figure 2. Samples from Different Datasets, Categorized as Normal or Cancer (a) BREAKHIS Normal (color images). (b) BREAKHIS Cancer (color images). (c) BUSI Normal and Cancer (grayscale images). (d) RSNA Normal and Cancer (grayscale images). (e) ROBOFLOW Normal and Cancer (grayscale images). (f) BCFPP Normal and Cancer (grayscale images).

Figure 3. Different Experimental Scenarios of FedCycle. The dashed outer frame indicates the aggregation loop. Bidirectional arrows denote client–server communication. Unidirectional arrows indicate local data flow and processing steps. Green check marks and red crosses denote enabled and disabled preprocessing steps, respectively.

Figure 4. FedAvg Overfitting Graphs in Scenario 2.

Figure 5. FedCycle Improvement on Overfitting in Scenario 3. (a) FedAvg DenseNet Accuracy Results of BUSI; (b) FedAvg DenseNet Loss Results of BUSI; (c) FedAvg DenseNet Accuracy Results of ROBOFLOW; (d) FedAvg DenseNet Loss Results of ROBOFLOW; (e) FedCycle DenseNet Accuracy Results of ROBOFLOW; (f) FedCycle DenseNet Loss Results of ROBOFLOW. (g) FedCycle DenseNet Accuracy Results of BUSI; (h) FedCycle DenseNet Loss Results of BUSI.

Figure 6. Client Drift Graphs in Scenario 3.

Table 1. Literature Review of FL Approaches for Medical Images and Data Heterogeneity.

Publication	Year	Dataset	Contribution	Results
Federated Learning for Medical Images
[12]	2022	BREAKHIS	Evaluated Methods: ResNet, DenseNet, MobileNet, EfficientNet. The highest accuracy in the study is DenseNet-201.	DenseNet-201 Accuracy: 98.6%. ResNet, MobileNet, EfficientNet: 94–97%.
[13]	2022	The Cancer Genome Atlas (TCGA)	The VGG network architecture, successfully used in ImageNet classification, was favored for its ability to perform classification and regression tasks effectively.	DL: 0.75, Consensus Model: 0.89
[14]	2022	Public Multi-Source Medical Dataset	CusFL has demonstrated superior performance compared to other personalization and standard FL methods when classifying multi-source medical image datasets, by using the federated model solely as a guide for feature alignment.	Avg. ACC: 0.8%-7% increase. Avg. AUC: 0.3–8.7% increase. Better in a 32-client setting.
[15]	2023	Full Field Digital Mammography (FFDM)	The proposed FedAlign CL method achieved the highest average AUC and PR-AUC values in medical image classification.	FedAlign CL: AUC: 0.967, PR-AUC: 0.962. Traditional FL and CL approaches: AUC 0.94–0.95
[16]	2023	Digital database of mammography screenings from the University of South Florida	The results of the ConvNets indicated that their model achieved higher accuracy compared to other models.	ConvNet Model Accuracy: 97.8%. Other CNN and traditional approaches: 94–96%.
Federated Learning for Data Heterogeneity
[18]	2020	Synthetic Dataset, FEMNIST, Shakespeare Dataset, Sentiment140 (Sent140)	The FedProx algorithm is suggested as a solution to the instability of FL in non-IID data distributions.	FedProx 22% better than FedAvg
[19]	2020	Synthetic dataset, MNIST, CIFAR-10, Shakespeare dataset (LEAF benchmark)	SCAFFOLD solves the problem of client drift in FL that happens when data distributions are non-IID by using control variables.	FedSGD 0.766, FedAvg 0.787, SCAFFOLD 0.801.
[20]	2025	Audio: Lanzhou University’s Second Affiliated Hospital. Text: Depression data taken from the Reddit platform. EEG: Collected data.	This study proposes a system (MIMS-FL) capable of modality-independent depression classification under FL by processing EEG, audio, and text datasets separately.	Proposed Approach: (IID) Audio 100%, EEG 99.89%, Text 100%. (Non-IID) Audio 99.7%, EEG 99.5%, Text 100%. Traditional DL approaches (LSTM, CNN): (IID) Audio 99.33%, EEG 98.95%, Text 96.97%. (Non-IID) Audio 99.33%, EEG 98.95%, Text 96.97%
[21]	2025	UCI-HAR, Hateful Memes, MELD, PTB-XL	The proposed MFCPL method improves the performance and generalization of the global model in multimodal FL, even with missing modalities.	MFCPL performance improvements are 1.85%, 1.36% and 1.77% on UCI-HAR, Hateful Memes and MELD dataset. Modality missing: 80%.
[22]	2023	CREMA-D, MELD, 248 MOSI, MOSEI, UCF101, PAMAP2, MIMIC-IV, CrisisMMD, MVSA-Single, MVSA-Multiple	FedMultimodal provides a benchmark which evaluates multimodal FL with text, audio, and image data under different conditions, showing that modality-level heterogeneity significantly impacts performance, especially when modalities are missing.	Under missing modality, all approaches (FedAvg, FedProx, FedOpt, FedRS) decreases by 15–30%; missing labels cause a 5–10% decrease; noisy labels decreases 3–8%. FedOpt achieves the highest performance. FedRS is more robust under label imbalance.
[23]	2023	CIFAR-10	FedAvg is used. The main contribution was the examination of FL under limited resources and varying client conditions, with a focus on communication and training time.	FedAvg achieved accuracy close to centralized training (around 60–65%)

Table 2. Number of Images of Datasets According to the Scenarios.

Dataset	Scenario 1	Scenario 2	Scenario 3
BREAKHIS	39,545	1531	1574
RSNA	-	9412	8485
ROBOFLOW	-	3047	3047
BUSI	-	-	780
BCFPP	-	-	3600

Table 3. Number of Parameters for CNN-based Models.

Model	Number of Parameters
EfficientNetB0	$5.3 \times 10^{6}$
MobileNetV2	$3.5 \times 10^{6}$
MobileNetV3 Small	$2.9 \times 10^{6}$
ResNet152	$6.0 \times 10^{7}$
DenseNet201	$2.0 \times 10^{7}$

Table 4. Hyperparameters of CNN-based Models.

Hyperparameter	Value
Initial Learning rate	0.001
Image size	224 × 224
Batch size	32
Optimizer	Adam
Max Number of Epochs	10
Dropout	0.4
Weight decay	0.0001

Table 5. MobileNetV2 Performance Results of SCAFFOLD, FedProx, FedSGD, FedNOVA, FedAvg, FedCycle. Time Efficiency is calculated as a Ratio to FedAvg. Bold values indicate the best performance.

Dataset	Metrics	SCAFFOLD	FedPROX	FedSGD	FedNOVA	FedAvg	FedCycle
Scenario 1: Homogeneous Dataset
	Loss	0.522 ± 0.0068	0.410 ± 0.0059	0.352 ± 0.0075	0.654 ± 0.0081	0.366 ± 0.0055	0.407 ± 0.0041
BREAKHIS	Accuracy	0.815 ± 0.0049	0.905 ± 0.0036	0.939 ± 0.0048	0.811 ± 0.0055	0.948 ± 0.0040	0.979 ± 0.0020
	Time (s)	40,629.936	24,646.615	11,811.803	10,253.775	900.432	1051.080
	Time Eff.	0.02	0.04	0.08	0.09	1.00	0.86
	Comm. Cost	2.15 GB	107.6 MB	1.08 GB	1.08 GB	107.6 MB	1.08 GB
Scenario 2: Heteregeneous Datasets
	Loss	0.587 ± 0.0123	0.610 ± 0.0284	0.546 ± 0.0617	0.980 ± 0.0312	0.479 ± 0.0125	0.380 ± 0.0096
BREAKHIS	Accuracy	0.678 ± 0.0104	0.754 ± 0.0216	0.787 ± 0.0489	0.614 ± 0.0263	0.869 ± 0.0058	0.870 ± 0.0049
	Loss	0.830 ± 0.0146	0.822 ± 0.0321	0.768 ± 0.0694	0.642 ± 0.0347	0.702 ± 0.0151	0.591 ± 0.0121
ROBOFLOW	Accuracy	0.599 ± 0.0118	0.610 ± 0.0253	0.645 ± 0.0556	0.626 ± 0.0289	0.655 ± 0.0081	0.730 ± 0.0062
	Loss	0.783 ± 0.0154	0.799 ± 0.0348	0.750 ± 0.0726	0.704 ± 0.0369	0.797 ± 0.0178	0.684 ± 0.0134
RSNA	Accuracy	0.604 ± 0.0127	0.623 ± 0.0275	0.606 ± 0.0589	0.618 ± 0.0306	0.627 ± 0.0086	0.707 ± 0.0060
	Time (s)	44,193.203	28,200.632	12,166.657	13,181.712	5902.137	7563.755
	Time Eff.	0.13	0.21	0.49	0.45	1.00	0.78
	Comm. Cost	1.58 GB	80.6 MB	0.79 GB	0.79 GB	80.6 MB	0.79 GB
Scenario 3: Heteregeneous Datasets
Dataset	Metrics	SCAFFOLD	FedPROX	FedSGD	FedNova	FedAvg	FedCycle
	Loss	0.563 ± 0.0135	0.582 ± 0.0261	0.531 ± 0.0584	0.948 ± 0.0298	0.025 ± 0.0034	0.025 ± 0.0086
BREAKHIS	Accuracy	0.699 ± 0.0112	0.771 ± 0.0203	0.803 ± 0.0461	0.633 ± 0.0248	0.992 ± 0.0010	0.999 ± 0.0035
	Loss	0.780 ± 0.0158	0.801 ± 0.0304	0.751 ± 0.0668	0.629 ± 0.0326	0.338 ± 0.0020	0.336 ± 0.0136
ROBOFLOW	Accuracy	0.608 ± 0.0126	0.622 ± 0.0241	0.658 ± 0.0527	0.639 ± 0.0275	0.854 ± 0.0046	0.867 ± 0.0086
	Loss	0.756 ± 0.0166	0.772 ± 0.0332	0.721 ± 0.0695	0.689 ± 0.0348	0.234 ± 0.0052	0.233 ± 0.0114
RSNA	Accuracy	0.619 ± 0.0134	0.638 ± 0.0262	0.621 ± 0.0564	0.632 ± 0.0291	0.880 ± 0.0050	0.890 ± 0.0068
	Loss	0.496 ± 0.0149	0.512 ± 0.0248	0.468 ± 0.0559	0.451 ± 0.0286	0.272 ± 0.0073	0.268 ± 0.0061
BUSI	Accuracy	0.731 ± 0.0121	0.748 ± 0.0196	0.762 ± 0.0438	0.739 ± 0.0254	0.940 ± 0.0064	0.945 ± 0.0058
	Loss	0.662 ± 0.0153	0.681 ± 0.0273	0.638 ± 0.0602	0.611 ± 0.0307	0.220 ± 0.0039	0.217 ± 0.0037
BCFPP	Accuracy	0.649 ± 0.0128	0.667 ± 0.0215	0.682 ± 0.0476	0.658 ± 0.0268	0.916 ± 0.0035	0.925 ± 0.0033
	Time (s)	57,799.392	40,850.214	25,280.417	26,840.593	19,770.808	20,759.348
	Time Eff.	0.34	0.48	0.78	0.74	1.00	0.95
	Comm. Cost	2.62 GB	134.4 MB	1.31 GB	1.31 GB	134.4 MB	1.31 GB

Table 6. Performance Results of Scenario 1. Epoch Number represents aggregation frequency. Bold values indicate the best performance.

Model	Loss	Accuracy	Precision	Recall	F1-Score	AUC
EfficientNet	0.085 ± 0.0021	0.976 ± 0.0009	0.951 ± 0.0020	0.965 ± 0.0017	0.958 ± 0.0007	0.998 ± 0.0003
MobileNet	0.328 ± 0.0026	0.889 ± 0.0013	0.722 ± 0.0024	0.982 ± 0.0019	0.832 ± 0.0014	0.995 ± 0.0006
ResNet	0.296 ± 0.0023	0.923 ± 0.0011	0.846 ± 0.0021	0.896 ± 0.0018	0.870 ± 0.0012	0.996 ± 0.0005
Inception	0.625 ± 0.0028	0.796 ± 0.0015	0.796 ± 0.0023	0.791 ± 0.0020	0.793 ± 0.0016	0.994 ± 0.0006
MobileNet FedAvg	0.366 ± 0.0055	0.948 ± 0.0040	0.942 ± 0.0048	0.814 ± 0.0045	0.873 ± 0.0041	0.997 ± 0.0011
MobileNet FedCycle 1 Epoch	0.407 ± 0.0041	0.979 ± 0.0020	0.976 ± 0.0032	0.949 ± 0.0036	0.962 ± 0.0023	0.999 ± 0.0006
MobileNet FedCycle 2 Epochs	0.472 ± 0.0044	0.970 ± 0.0026	0.968 ± 0.0038	0.960 ± 0.0035	0.964 ± 0.0029	0.998 ± 0.0007
MobileNet FedCycle 3 Epochs	0.485 ± 0.0047	0.966 ± 0.0031	0.962 ± 0.0041	0.967 ± 0.0031	0.965 ± 0.0034	0.998 ± 0.0009
MobileNet FedCycle 4 Epochs	0.491 ± 0.0051	0.962 ± 0.0037	0.964 ± 0.0044	0.961 ± 0.0037	0.963 ± 0.0038	0.997 ± 0.0010

Table 7. Performance Results of Scenario 2. Bold values indicate the best performance.

	Metrics	EfficientNet	MobileNet	FedAvg EfficientNet	FedAvg MobileNet	FedCycle EfficientNet	FedCycle MobileNet
BREAKHIS	Loss	0.337 ± 0.0092	0.097 ± 0.0061	0.608 ± 0.0148	0.479 ± 0.0125	0.477 ± 0.0112	0.380 ± 0.0096
	Accuracy	0.864 ± 0.0048	0.968 ± 0.0026	0.788 ± 0.0069	0.869 ± 0.0058	0.850 ± 0.0055	0.870 ± 0.0049
	Precision	0.725 ± 0.0065	0.958 ± 0.0038	0.780 ± 0.0082	0.845 ± 0.0067	0.807 ± 0.0064	0.878 ± 0.0056
	Recall	0.915 ± 0.0059	0.939 ± 0.0041	0.774 ± 0.0090	0.868 ± 0.0063	0.813 ± 0.0068	0.873 ± 0.0054
	F1-Score	0.809 ± 0.0052	0.948 ± 0.0033	0.777 ± 0.0075	0.856 ± 0.0060	0.810 ± 0.0061	0.876 ± 0.0051
	AUC	0.960 ± 0.0027	0.984 ± 0.0019	0.935 ± 0.0049	0.949 ± 0.0040	0.958 ± 0.0031	0.971 ± 0.0029
ROBOFLOW	Loss	0.709 ± 0.0156	0.786 ± 0.0162	0.733 ± 0.0174	0.702 ± 0.0151	0.673 ± 0.0136	0.591 ± 0.0121
	Accuracy	0.725 ± 0.0074	0.775 ± 0.0068	0.682 ± 0.0085	0.655 ± 0.0081	0.718 ± 0.0067	0.730 ± 0.0062
	Precision	0.626 ± 0.0089	0.719 ± 0.0076	0.659 ± 0.0092	0.642 ± 0.0088	0.690 ± 0.0075	0.726 ± 0.0068
	Recall	0.654 ± 0.0093	0.698 ± 0.0080	0.675 ± 0.0096	0.668 ± 0.0090	0.702 ± 0.0078	0.720 ± 0.0070
	F1-Score	0.640 ± 0.0081	0.708 ± 0.0072	0.667 ± 0.0089	0.655 ± 0.0084	0.696 ± 0.0071	0.723 ± 0.0065
	AUC	0.839 ± 0.0040	0.862 ± 0.0032	0.804 ± 0.0050	0.788 ± 0.0049	0.829 ± 0.0036	0.845 ± 0.0031
RSNA	Loss	0.689 ± 0.0139	0.674 ± 0.0132	0.824 ± 0.0186	0.797 ± 0.0178	0.700 ± 0.0141	0.684 ± 0.0134
	Accuracy	0.702 ± 0.0065	0.789 ± 0.0054	0.580 ± 0.0092	0.627 ± 0.0086	0.702 ± 0.0062	0.707 ± 0.0060
	Precision	0.723 ± 0.0073	0.702 ± 0.0061	0.535 ± 0.0101	0.686 ± 0.0090	0.697 ± 0.0069	0.706 ± 0.0066
	Recall	0.645 ± 0.0084	0.744 ± 0.0068	0.484 ± 0.0108	0.699 ± 0.0092	0.718 ± 0.0073	0.754 ± 0.0071
	F1-Score	0.682 ± 0.0076	0.723 ± 0.0060	0.509 ± 0.0097	0.692 ± 0.0087	0.707 ± 0.0066	0.729 ± 0.0063
	AUC	0.813 ± 0.0035	0.836 ± 0.0030	0.739 ± 0.0055	0.781 ± 0.0048	0.808 ± 0.0035	0.822 ± 0.0032
	Time (s)	4671.655	4403.037	5912.741	5902.137	11,725.384	7563.755

Table 8. DL Performance Results of Scenario 3. Bold values indicate the best performance.

	Metrics	EfficientNetB0	MobileNetV2	DenseNet	ResNet	MobileNetV3
BREAKHIS	Loss	0.278 ± 0.0065	0.097 ± 0.0081	0.033 ± 0.0084	0.503 ± 0.0102	0.021 ± 0.0072
	Accuracy	0.879 ± 0.0030	0.968 ± 0.0040	0.988 ± 0.0042	0.726 ± 0.0055	0.994 ± 0.0035
	Precision	0.875 ± 0.0018	0.958 ± 0.0024	0.985 ± 0.0025	0.722 ± 0.0033	0.993 ± 0.0021
	Recall	0.887 ± 0.0026	0.939 ± 0.0035	0.989 ± 0.0036	0.736 ± 0.0046	0.994 ± 0.0030
	F1-Score	0.881 ± 0.0021	0.948 ± 0.0029	0.987 ± 0.0030	0.729 ± 0.0039	0.994 ± 0.0025
	AUC	0.999 ± 0.0015	0.999 ± 0.0020	0.998 ± 0.0017	0.805 ± 0.0022	0.999 ± 0.0018
BUSI	Loss	0.207 ± 0.0045	0.185 ± 0.0058	0.165 ± 0.0055	0.150 ± 0.0068	0.301 ± 0.0210
	Accuracy	0.931 ± 0.0028	0.964 ± 0.0036	0.939 ± 0.0034	0.947 ± 0.0042	0.879 ± 0.0231
	Precision	0.864 ± 0.0055	0.952 ± 0.0070	0.855 ± 0.0065	0.876 ± 0.0080	0.690 ± 0.0560
	Recall	0.881 ± 0.0020	0.914 ± 0.0024	0.934 ± 0.0023	0.934 ± 0.0030	0.995 ± 0.0041
	F1-Score	0.872 ± 0.0050	0.932 ± 0.0064	0.892 ± 0.0060	0.904 ± 0.0075	0.815 ± 0.0356
	AUC	0.901 ± 0.0035	0.911 ± 0.0045	0.910 ± 0.0039	0.871 ± 0.0053	0.835 ± 0.0231
RSNA	Loss	0.331 ± 0.0051	0.422 ± 0.0059	0.442 ± 0.0061	0.649 ± 0.0884	0.335 ± 0.0055
	Accuracy	0.858 ± 0.0049	0.682 ± 0.0058	0.790 ± 0.0059	0.636 ± 0.0527	0.866 ± 0.0053
	Precision	0.835 ± 0.0056	0.849 ± 0.0066	0.727 ± 0.0067	0.491 ± 0.1280	0.806 ± 0.0061
	Recall	0.761 ± 0.0061	0.745 ± 0.0071	0.640 ± 0.0073	0.099 ± 0.0815	0.829 ± 0.0066
	F1-Score	0.796 ± 0.0057	0.793 ± 0.0064	0.681 ± 0.0066	0.169 ± 0.1078	0.818 ± 0.0060
	AUC	0.820 ± 0.0040	0.819 ± 0.0045	0.704 ± 0.0042	0.601 ± 0.0711	0.834 ± 0.0038
ROBOFLOW	Loss	0.247 ± 0.0038	0.327 ± 0.0049	0.428 ± 0.0050	0.430 ± 0.0065	0.406 ± 0.0043
	Accuracy	0.899 ± 0.0016	0.904 ± 0.0021	0.796 ± 0.0020	0.804 ± 0.0028	0.818 ± 0.0019
	Precision	0.873 ± 0.0026	0.885 ± 0.0034	0.802 ± 0.0032	0.807 ± 0.0041	0.746 ± 0.0029
	Recall	0.822 ± 0.0014	0.826 ± 0.0018	0.781 ± 0.0019	0.800 ± 0.0026	0.762 ± 0.0016
	F1-Score	0.847 ± 0.0019	0.855 ± 0.0025	0.792 ± 0.0024	0.804 ± 0.0033	0.754 ± 0.0021
	AUC	0.822 ± 0.0013	0.838 ± 0.0017	0.807 ± 0.0016	0.780 ± 0.0022	0.772 ± 0.0014
BCFPP	Loss	0.477 ± 0.0060	0.673 ± 0.0081	0.699 ± 0.0085	0.699 ± 0.0087	0.286 ± 0.0068
	Accuracy	0.850 ± 0.0050	0.718 ± 0.0074	0.702 ± 0.0081	0.702 ± 0.0083	0.884 ± 0.0058
	Precision	0.807 ± 0.0055	0.690 ± 0.0092	0.697 ± 0.0092	0.697 ± 0.0096	0.808 ± 0.0062
	Recall	0.813 ± 0.0048	0.702 ± 0.0085	0.718 ± 0.0081	0.718 ± 0.0089	0.861 ± 0.0056
	F1-Score	0.810 ± 0.0052	0.696 ± 0.0090	0.707 ± 0.0088	0.707 ± 0.0092	0.834 ± 0.0060
	AUC	0.874 ± 0.0038	0.865 ± 0.0045	0.723 ± 0.0048	0.704 ± 0.0050	0.863 ± 0.0039
	Time (s)	22,667.054	44,028.752	9691.586	35,210.641	4084.859

Table 9. FedAvg Performance Results of Scenario 3. Bold values indicate the best performance.

Dataset	Metric	EfficientNetB0	MobileNetV2	DenseNet	ResNet	MobileNetV3
BREAKHIS	Loss	0.309 ± 0.0270	0.025 ± 0.0034	0.048 ± 0.0061	0.228 ± 0.0215	0.010 ± 0.0025
	Accuracy	0.854 ± 0.0218	0.992 ± 0.0010	0.984 ± 0.0048	0.911 ± 0.0189	0.998 ± 0.0008
	Precision	0.859 ± 0.0240	0.991 ± 0.0000	0.985 ± 0.0045	0.926 ± 0.0176	0.999 ± 0.0007
	Recall	0.851 ± 0.0107	0.993 ± 0.0015	0.982 ± 0.0052	0.906 ± 0.0198	0.998 ± 0.0010
	F1-Score	0.856 ± 0.0138	0.992 ± 0.0007	0.984 ± 0.0047	0.916 ± 0.0182	0.998 ± 0.0006
	AUC	0.999 ± 0.0142	0.999 ± 0.0007	0.998 ± 0.0024	0.997 ± 0.0091	0.999 ± 0.0003
BUSI	Loss	0.243 ± 0.0041	0.272 ± 0.0073	0.246 ± 0.0035	0.358 ± 0.0075	0.248 ± 0.0028
	Accuracy	0.939 ± 0.0014	0.940 ± 0.0064	0.911 ± 0.0048	0.894 ± 0.0062	0.891 ± 0.0039
	Precision	0.830 ± 0.0023	0.877 ± 0.0056	0.770 ± 0.0095	0.794 ± 0.0070	0.721 ± 0.0110
	Recall	0.927 ± 0.0016	0.911 ± 0.0059	0.938 ± 0.0032	0.762 ± 0.0105	0.951 ± 0.0024
	F1-Score	0.876 ± 0.0019	0.894 ± 0.0071	0.846 ± 0.0068	0.778 ± 0.0086	0.820 ± 0.0074
	AUC	0.876 ± 0.0021	0.874 ± 0.0078	0.871 ± 0.0033	0.860 ± 0.0042	0.890 ± 0.0036
RSNA	Loss	0.298 ± 0.0450	0.234 ± 0.0052	0.4722 ± 0.0602	0.621 ± 0.0815	0.213 ± 0.0343
	Accuracy	0.872 ± 0.0157	0.880 ± 0.0050	0.771 ± 0.0230	0.640 ± 0.0334	0.933 ± 0.0118
	Precision	0.810 ± 0.0286	0.821 ± 0.0026	0.722 ± 0.0351	0.499 ± 0.0423	0.877 ± 0.0239
	Recall	0.851 ± 0.0605	0.832 ± 0.0032	0.606 ± 0.0514	0.317 ± 0.0680	0.944 ± 0.0157
	F1-Score	0.830 ± 0.0351	0.827 ± 0.0015	0.659 ± 0.0463	0.388 ± 0.0591	0.910 ± 0.0188
	AUC	0.789 ± 0.0380	0.791 ± 0.0016	0.755 ± 0.0231	0.704 ± 0.0294	0.822 ± 0.0093
ROBOFLOW	Loss	0.257 ± 0.0280	0.338 ± 0.0020	0.561 ± 0.0347	0.478 ± 0.0291	0.341 ± 0.0216
	Accuracy	0.886 ± 0.0360	0.854 ± 0.0046	0.703 ± 0.0418	0.756 ± 0.0386	0.870 ± 0.0169
	Precision	0.805 ± 0.0151	0.752 ± 0.0029	0.759 ± 0.0289	0.655 ± 0.0437	0.780 ± 0.0243
	Recall	0.863 ± 0.0392	0.825 ± 0.0030	0.658 ± 0.0471	0.506 ± 0.0701	0.850 ± 0.0175
	F1-Score	0.833 ± 0.0325	0.787 ± 0.0033	0.704 ± 0.0364	0.571 ± 0.0529	0.814 ± 0.0197
	AUC	0.801 ± 0.0350	0.797 ± 0.0036	0.763 ± 0.0180	0.735 ± 0.0263	0.811 ± 0.0096
BCFPP	Loss	0.439 ± 0.0035	0.220 ± 0.0039	0.416 ± 0.0827	0.428 ± 0.0913	0.266 ± 0.0735
	Accuracy	0.803 ± 0.0014	0.916 ± 0.0035	0.788 ± 0.0186	0.777 ± 0.0214	0.900 ± 0.0148
	Precision	0.666 ± 0.0016	0.845 ± 0.0037	0.690 ± 0.0219	0.657 ± 0.0267	0.820 ± 0.0196
	Recall	0.776 ± 0.0012	0.894 ± 0.0034	0.638 ± 0.0294	0.689 ± 0.0349	0.888 ± 0.0179
	F1-Score	0.717 ± 0.0017	0.869 ± 0.0036	0.663 ± 0.0241	0.672 ± 0.0286	0.852 ± 0.0164
	AUC	0.839 ± 0.0039	0.856 ± 0.0040	0.833 ± 0.0119	0.826 ± 0.0144	0.877 ± 0.0079
	Time (s)	25,050.146	19,770.808	26,068.463	27,729.569	4471.982

Table 10. FedCycle Performance Results of Scenario 3. Bold values indicate the best performance.

	Metrics	EfficientNetB0	MobileNet	DenseNet	ResNet	MobileNetV3
BREAKHIS	Loss	0.304 ± 0.0186	0.025 ± 0.0086	0.047 ± 0.0194	0.228 ± 0.0218	0.010 ± 0.0172
	Accuracy	0.863 ± 0.0137	0.999 ± 0.0035	0.993 ± 0.0121	0.911 ± 0.0164	0.998 ± 0.0110
	Precision	0.870 ± 0.0152	0.977 ± 0.0014	0.990 ± 0.0134	0.926 ± 0.0172	0.999 ± 0.0118
	Recall	0.855 ± 0.0107	0.980 ± 0.0047	0.997 ± 0.0102	0.906 ± 0.0149	0.998 ± 0.0101
	F1-Score	0.862 ± 0.0087	0.978 ± 0.0025	0.993 ± 0.0091	0.916 ± 0.0136	0.997 ± 0.0093
	AUC	0.999 ± 0.0061	0.999 ± 0.0018	0.998 ± 0.0027	0.994 ± 0.0041	0.999 ± 0.0028
BUSI	Loss	0.240 ± 0.0098	0.268 ± 0.0061	0.243 ± 0.0030	0.358 ± 0.0062	0.248 ± 0.0023
	Accuracy	0.953 ± 0.0011	0.945 ± 0.0058	0.920 ± 0.0039	0.894 ± 0.0054	0.891 ± 0.0034
	Precision	0.840 ± 0.0024	0.886 ± 0.0049	0.779 ± 0.0079	0.794 ± 0.0061	0.721 ± 0.0098
	Recall	0.941 ± 0.0013	0.922 ± 0.0047	0.947 ± 0.0027	0.762 ± 0.0086	0.951 ± 0.0021
	F1-Score	0.888 ± 0.0015	0.904 ± 0.0056	0.854 ± 0.0061	0.777 ± 0.0071	0.820 ± 0.0066
	AUC	0.890 ± 0.0011	0.902 ± 0.0039	0.886 ± 0.0020	0.872 ± 0.0022	0.889 ± 0.0021
RSNA	Loss	0.294 ± 0.0156	0.233 ± 0.0114	0.468 ± 0.0184	0.621 ± 0.0201	0.213 ± 0.0146
	Accuracy	0.877 ± 0.0069	0.890 ± 0.0068	0.775 ± 0.0072	0.640 ± 0.0084	0.933 ± 0.0063
	Precision	0.819 ± 0.0046	0.824 ± 0.0042	0.729 ± 0.0060	0.499 ± 0.0078	0.877 ± 0.0052
	Recall	0.855 ± 0.0268	0.869 ± 0.0118	0.612 ± 0.0209	0.317 ± 0.0346	0.944 ± 0.0158
	F1-Score	0.836 ± 0.0048	0.846 ± 0.0059	0.665 ± 0.0064	0.387 ± 0.0082	0.909 ± 0.0056
	AUC	0.816 ± 0.0034	0.824 ± 0.0041	0.780 ± 0.0020	0.701 ± 0.0023	0.917 ± 0.0019
ROBOFLOW	Loss	0.254 ± 0.0103	0.336 ± 0.0136	0.556 ± 0.0159	0.478 ± 0.0176	0.341 ± 0.0164
	Accuracy	0.896 ± 0.0080	0.867 ± 0.0086	0.706 ± 0.0107	0.756 ± 0.0129	0.870 ± 0.0136
	Precision	0.813 ± 0.0065	0.763 ± 0.0027	0.770 ± 0.0091	0.655 ± 0.0114	0.780 ± 0.0119
	Recall	0.871 ± 0.0108	0.829 ± 0.0060	0.664 ± 0.0168	0.506 ± 0.0198	0.850 ± 0.0127
	F1-Score	0.841 ± 0.0104	0.795 ± 0.0044	0.712 ± 0.0149	0.571 ± 0.0172	0.814 ± 0.0121
	AUC	0.827 ± 0.0073	0.815 ± 0.0031	0.806 ± 0.0043	0.768 ± 0.0050	0.834 ± 0.0035
BCFPP	Loss	0.437 ± 0.0029	0.217 ± 0.0037	0.412 ± 0.0068	0.428 ± 0.0074	0.266 ± 0.0031
	Accuracy	0.813 ± 0.0012	0.925 ± 0.0033	0.798 ± 0.0145	0.777 ± 0.0191	0.900 ± 0.0016
	Precision	0.669 ± 0.0013	0.849 ± 0.0031	0.693 ± 0.0178	0.657 ± 0.0240	0.820 ± 0.0019
	Recall	0.780 ± 0.0010	0.903 ± 0.0028	0.648 ± 0.0215	0.689 ± 0.0286	0.888 ± 0.0020
	F1-Score	0.721 ± 0.0011	0.875 ± 0.0030	0.670 ± 0.0192	0.672 ± 0.0234	0.853 ± 0.0018
	AUC	0.865 ± 0.0023	0.869 ± 0.0035	0.852 ± 0.0056	0.846 ± 0.0068	0.893 ± 0.0010
	Time (s)	27,555.161	20,759.348	27,371.886	27,729.569	12,295.046

Table 11. Effect of the Number of Local Epochs on MobileNetV2 using the FedCycle in Scenario 1. Bold values indicate the best performance.

Aggregation Frequency (Epoch Number)	Accuracy	Loss
1	0.979 ± 0.0020	0.407 ± 0.0041
2	0.970 ± 0.0026	0.472 ± 0.0044
3	0.966 ± 0.0031	0.485 ± 0.0047
4	0.962 ± 0.0037	0.491 ± 0.0051
10	0.948 ± 0.0040	0.366 ± 0.0055

Table 12. Average Train-Validation Gaps Values for Fine-Tuning (FT) ON–OFF. Bold values indicate the best performance.

Dataset	Metric	FedAvg		FedCycle
Dataset	Metric	ON	OFF	ON	OFF
BREAKHIS	Accuracy	0.0021	0.0041	0.0027	0.0016
BREAKHIS	Loss	0.2970	0.2769	0.0300	0.0311
BUSI	Accuracy	0.0452	0.0953	0.0865	0.1474
BUSI	Loss	0.2869	0.1788	0.1763	0.2307
RSNA	Accuracy	0.1636	0.4220	0.0519	0.1435
RSNA	Loss	0.3368	0.6695	0.1436	0.3170
ROBOFLOW	Accuracy	0.1457	0.2170	0.0923	0.2453
ROBOFLOW	Loss	0.3798	0.3831	0.2228	0.6764
BCFPP	Accuracy	0.1693	0.1816	0.0156	0.1424
BCFPP	Loss	0.1832	0.2470	0.0460	0.2176

Table 13. Sensitivity to the Number of Clients under Heterogeneous Data using MobileNetV2.

Homogeneous Dataset
Method	Dataset	Client Number
		2		4		6
		Acc	Loss	Acc	Loss	Acc	Loss
FedAvg	BREAKHIS	0.951 ± 0.0035	0.358 ± 0.0052	0.948 ± 0.0040	0.366 ± 0.0055	0.945 ± 0.0043	0.375 ± 0.0058
FedCycle	BREAKHIS	0.981 ± 0.0019	0.398 ± 0.0040	0.979 ± 0.0020	0.407 ± 0.0041	0.976 ± 0.0021	0.415 ± 0.0042
Method	Dataset	Client Number
		3		5		7
		Acc	Loss	Acc	Loss	Acc	Loss
FedAvg	BREAKHIS	0.994 ± 0.0008	0.023 ± 0.0031	0.992 ± 0.0010	0.025 ± 0.0034	0.990 ± 0.0018	0.027 ± 0.0039
	BREAKHIS_2	–	–	–	–	0.997 ± 0.0012	0.027 ± 0.0035
	BUSI	–	–	0.940 ± 0.0064	0.272 ± 0.0073	0.936 ± 0.0068	0.280 ± 0.0078
	RSNA	0.885 ± 0.0045	0.225 ± 0.0050	0.880 ± 0.0050	0.234 ± 0.0052	0.874 ± 0.0054	0.242 ± 0.0055
	RSNA_2	–	–	–	–	0.880 ± 0.0051	0.245 ± 0.0054
	ROBOFLOW	0.858 ± 0.0039	0.330 ± 0.0022	0.854 ± 0.0046	0.338 ± 0.0020	0.848 ± 0.0048	0.346 ± 0.0032
	BCFPP	–	–	0.916 ± 0.0035	0.220 ± 0.0039	0.911 ± 0.0040	0.228 ± 0.0045
FedCycle	BREAKHIS	0.999 ± 0.0027	0.024 ± 0.0074	0.999 ± 0.0035	0.025 ± 0.0086	0.998 ± 0.0038	0.027 ± 0.0092
	BREAKHIS_2	–	–	–	–	0.998 ± 0.0036	0.027 ± 0.0089
	BUSI	–	–	0.945 ± 0.0058	0.268 ± 0.0061	0.940 ± 0.0062	0.276 ± 0.0066
	RSNA	0.894 ± 0.0059	0.225 ± 0.0109	0.890 ± 0.0068	0.233 ± 0.0114	0.884 ± 0.0072	0.241 ± 0.0121
	RSNA_2	–	–	–	–	0.884 ± 0.0071	0.241 ± 0.0110
	ROBOFLOW	0.870 ± 0.0083	0.328 ± 0.0128	0.867 ± 0.0086	0.336 ± 0.0136	0.860 ± 0.0089	0.344 ± 0.0141
	BCFPP	–	–	0.925 ± 0.0033	0.217 ± 0.0037	0.919 ± 0.0039	0.225 ± 0.0041

Table 14. Comparisons of Communication Cost.

Scenario	Method	Client	Cost (MB)	Simulated Cost (GB)	Communication Overhead (x)
Scenario 1	FedAvg	4	107.6	21.02	1.0×
Scenario 1	FedCycle	4	1075.2	210.16	10.0×
Scenario 2	FedAvg	3	80.6	15.74	1.0×
Scenario 2	FedCycle	3	806.4	157.42	10.0×
Scenario 3	FedAvg	5	134.4	26.25	1.0×
Scenario 3	FedCycle	5	1344	262.50	10.0×

Table 15. Communication Efficiency Analysis Across Scenarios.

Scenario	Dataset	$Δ$ Acc	$Δ$ Efficiency
Scenario 1	BREAKHIS	0.031	$3.20 \times 10^{- 5}$
Scenario 2	BREAKHIS	0.005	$6.89 \times 10^{- 6}$
Scenario 2	RSNA	0.009	$1.24 \times 10^{- 5}$
Scenario 2	ROBOFLOW	0.012	$1.65 \times 10^{- 5}$
Scenario 3	BREAKHIS	0.007	$5.79 \times 10^{- 6}$
Scenario 3	BUSI	0.005	$4.13 \times 10^{- 6}$
Scenario 3	RSNA	0.010	$8.27 \times 10^{- 6}$
Scenario 3	ROBOFLOW	0.013	$1.07 \times 10^{- 5}$
Scenario 3	BCFPP	0.009	$7.44 \times 10^{- 6}$

Table 16. Comparisons of Time Efficiency.

Scenario	Number of Clients	Method	Time Efficiency (per s)
Scenario 1	4	Deep Learning	$8.25 \times 10^{- 5}$
		FedAvg	$9.59 \times 10^{- 5}$
		FedCycle	$1.015 \times 10^{- 4}$
Scenario 2	3	Deep Learning	$2.27 \times 10^{- 4}$
		FedAvg	$1.694 \times 10^{- 4}$
		FedCycle	$1.323 \times 10^{- 4}$
Scenario 3	5	Deep Learning	$2.27 \times 10^{- 5}$
		FedAvg	$5.06 \times 10^{- 5}$
		FedCycle	$4.82 \times 10^{- 5}$

Table 17. Statistical Analysis Results Across Datasets.

Dataset	Client Num.	FT ON/OFF	FedAvg/FedCycle
BREAKHIS	$F = 0.0225$ , $p = 0.9778$	$t = 42.72$ , $p = 0.00055$	$F = 17.33$ , $p = 0.0141$
BUSI	$t = 1.0201$ , $p = 0.3656$	$t = 39.07$ , $p = 0.00065$	$F = 1.01$ , $p = 0.3728$
RSNA	$F = 1.7691$ , $p = 0.2489$	$t = 41.75$ , $p = 0.00057$	$F = 4.21$ , $p = 0.1094$
ROBOFLOW	$F = 1.0673$ , $p = 0.4013$	$t = 56.38$ , $p = 0.00031$	$F = 7.81$ , $p = 0.0491$
BCFPP	$t = 2.16$ , $p = 0.097$	$t = 46.73$ , $p = 0.00046$	$F = 10.50$ , $p = 0.0317$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dundar, B.; Sezer, E.A.; Okay, F.Y.; Ozdemir, S. FedCycle: An Improved Federated Learning Framework for Assessment Across Modalities and Domains. Electronics 2026, 15, 1752. https://doi.org/10.3390/electronics15081752

AMA Style

Dundar B, Sezer EA, Okay FY, Ozdemir S. FedCycle: An Improved Federated Learning Framework for Assessment Across Modalities and Domains. Electronics. 2026; 15(8):1752. https://doi.org/10.3390/electronics15081752

Chicago/Turabian Style

Dundar, Betul, Ebru Akcapinar Sezer, Feyza Yildirim Okay, and Suat Ozdemir. 2026. "FedCycle: An Improved Federated Learning Framework for Assessment Across Modalities and Domains" Electronics 15, no. 8: 1752. https://doi.org/10.3390/electronics15081752

APA Style

Dundar, B., Sezer, E. A., Okay, F. Y., & Ozdemir, S. (2026). FedCycle: An Improved Federated Learning Framework for Assessment Across Modalities and Domains. Electronics, 15(8), 1752. https://doi.org/10.3390/electronics15081752

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FedCycle: An Improved Federated Learning Framework for Assessment Across Modalities and Domains

Abstract

1. Introduction

2. Related Work

2.1. FL for Medical Images

2.2. FL for Data Heterogeneity

2.3. Limitations of Existing Studies and Motivation

3. Federated Learning

Federated Average (FedAvg) Algorithm

4. FedCycle: An Incremental Improvement of FedAvg

Theoretical Insight on Aggregation Frequency

5. Methodology

5.1. Dataset

5.2. Pre-Processing

5.3. Pre-Trained CNN Models

6. Experimental Scenarios

6.1. Baseline FL Algorithms for Comparison

6.2. Scenario 1: FedCycle with a Single Dataset Based on Data Homogeneity

6.3. Scenario 2: FedCycle with Three Different Datasets Based on Modality and Domain Heterogeneity

6.4. Scenario 3: FedCycle with Five Different Datasets Based on Modality and Domain Heterogeneity

7. Performance Results and Discussions

7.1. Results of Scenario 1

7.2. Results of Scenario 2

7.3. Results of Scenario 3

7.4. Ablation Studies

7.4.1. Aggregation Frequency Analysis in Scenario 1

7.4.2. Fine Tuning Analysis

7.4.3. Sensitivity to Client Numbers

7.4.4. Communication Cost Analysis

7.4.5. Communication Efficiency Analysis

7.4.6. Time Efficiency Analysis

7.4.7. Computational Complexity Analysis

7.5. Statistical Analysis

7.6. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Accuracy and Loss Results of FedCycle MobileNetV2 Model in Scenario 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI