A Comparative Analysis of Federated Learning for Multi-Class Breast Cancer Classification in Ultrasound Imaging

Elshenawy, Marwa Ali; Tawfik, Noha S.; Hamada, Nada; Kadry, Rania; Fayed, Salema; Ghatwary, Noha

doi:10.3390/ai6120316

Open AccessArticle

A Comparative Analysis of Federated Learning for Multi-Class Breast Cancer Classification in Ultrasound Imaging

by

Marwa Ali Elshenawy

^1,*

,

Noha S. Tawfik

^1,*

,

Nada Hamada

¹,

Rania Kadry

¹,

Salema Fayed

²

and

Noha Ghatwary

²

¹

Computer Engineering Department, College of Engineering and Technology, Arab Academy for Science, Technology and Maritime Transport, Abou Kir, Alexandria P.O. Box 1029, Egypt

²

Computer Engineering Department, College of Engineering and Technology, Arab Academy for Science, Technology and Maritime Transport, Smart Village Campus, Giza P.O. Box 2033, Egypt

^*

Authors to whom correspondence should be addressed.

AI 2025, 6(12), 316; https://doi.org/10.3390/ai6120316

Submission received: 13 October 2025 / Revised: 30 November 2025 / Accepted: 1 December 2025 / Published: 4 December 2025

(This article belongs to the Special Issue Artificial Intelligence for Future Healthcare: Advancement, Impact, and Prospect in the Field of Cancer)

Download

Browse Figures

Versions Notes

Abstract

Breast cancer is the second leading cause of cancer-related mortality among women. Early detection enables timely treatment, improving survival outcomes. This paper presents a comparative evaluation of federated learning (FL) frameworks for multiclass breast cancer classification using ultrasound images drawn from three datasets: BUSI, BUS-UCLM, and BCMID, which include 600, 38, and 323 patients, respectively. Five state-of-the-art networks were tested, with MobileNet, ResNet and InceptionNet identified as the most effective for FL deployment. Two aggregation strategies, FedAvg and FedProx, were assessed under varying levels of data heterogeneity in two and three client settings. Results from experiments indicate that the FL models outperformed local and centralized training, bypassing the adverse impacts of data isolation and domain shift. In the two-client federations, FL achieving up to 8% higher accuracy and almost 6% higher macro-F1 scores on average that local and centralized training. FedProx on MobileNet maintained a stable performance in the three-client federation with best average accuracy of 73.31%, and macro-F1 of 67.3% despite stronger heterogeneity. Consequently, these results suggest that the proposed multiclass model has the potential to support clinical workflows by assisting in automated risk stratification. If deployed, such a system could allow radiologists to prioritize high-risk patients more effectively. The findings emphasize the potential of federated learning as a scalable, privacy-preserving infrastructure for collaborative medical imaging and breast cancer diagnosis.

Keywords:

federated learning; machine learning; breast cancer detection; multiclass classification; privacy-preserving machine learning; decentralized learning; medical image analysis

1. Introduction

The integration of Artificial Intelligence (AI) into medical practice is driving a transformative shift in healthcare, especially in oncology, where AI is increasingly instrumental in improving diagnostic accuracy, tailoring treatment regimens, and accelerating research. Cancer continues to rank among the leading causes of morbidity and mortality worldwide, with an estimated 18 million new cases and 10 million deaths reported annually (excluding non-melanoma skin cancer) in recent years [1]. Among the various types of cancer, breast cancer holds particular significance. It was the most commonly diagnosed cancer among women in 2022, with more than 2.3 million new cases and approximately 670,000 deaths worldwide [2]. Studies suggest that if current trends persist, new breast cancer cases will increase by 38%, whereas deaths may increase by 68% by 2050, reaching around 3.2 million new cases and 1.1 million annual deaths per year [3]. In this context, the ability of AI to analyze large and complex datasets—including medical images, genetic information, and electronic health records—supports earlier breast cancer detection, better risk assessment, and personalized treatments guided by precision oncology, ultimately helping reduce mortality and improve patient outcomes [4,5,6,7].

Deep neural networks (DNNs) have become the dominant approach for classifying breast lesions on ultrasound (US) images, demonstrating substantial improvements in diagnostic accuracy and clinical interpretability. Early post-2021 studies mainly adopted transfer learning with pre-trained convolutional neural networks (CNNs) such as ResNet, DenseNet, and EfficientNet, achieving high accuracy and AUC in differentiating benign from malignant lesions on standard datasets. More recent work has moved toward customized DNN architectures tailored to ultrasound characteristics [8,9], such as attention-based CNNs, hybrid CNN–transformer classifiers, and multi-view fusion networks that integrate multiple ultrasound planes for improved context [10]. Efforts to enhance the generalization include the use of synthetic data generation with GANs, domain adaptation, and ensemble learning, all of which have reduced overfitting and improved robustness across ultrasound vendors and imaging protocol [11,12]. Overall, DNN-based classifiers since 2021 have evolved from simple binary predictors to clinically oriented, interpretable systems, achieving performance comparable to experienced radiologists in controlled studies while moving steadily towards real-world validation.

Despite the promising performance of deep neural network (DNN) models for breast ultrasound classification, data security and patient privacy during model training remain significant challenges. DNNs require large, diverse datasets to achieve generalizable performance, but medical institutions are often reluctant to share raw ultrasound images due to strict privacy regulations such as HIPAA and GDPR, as well as ethical concerns about patient-identifiable information [13,14]. Traditional centralized training strategies aggregate data from multiple hospitals into a single repository. This increases the risk of data breaches, unauthorized access, and the potential misuse of sensitive medical records [15]. Moreover, anonymization alone is insufficient, as re-identification attacks can sometimes reconstruct private features from model gradients or embeddings [16].

Federated Learning (FL) offers an alternative solution by enabling decentralized training across multiple institutions without exposing raw data [17]. In this paradigm, only model parameters or updates are shared between clients and a central server, preserving data locality and reducing privacy risks while maintaining model performance across diverse ultrasound domains. Recent studies have shown that Federated Learning (FL) can effectively train breast cancer classification networks across datasets from multiple hospitals, achieving accuracy comparable to centralized models while maintaining compliance with privacy-preserving regulations.

In this study, we propose, compare, and evaluate a federated learning (FL) framework designed to enhance medical image diagnosis while maintaining data privacy and computational efficiency. To demonstrate its effectiveness, we apply the proposed approach to multi-class breast cancer classification (normal, benign, and malignant) using ultrasound imaging as a representative case study. The major contributions of this work include (i) the implementation and comprehensive evaluation of a federated learning framework specifically configured for multi-class ultrasound breast cancer diagnosis and (ii) a comprehensive comparative analysis of centralized, local, and federated paradigms across three publicly available datasets to assess diagnostic performance and robustness. Furthermore, to extend our experimental analysis, we introduce an ablation study examining the effect of different loss functions on the best-performing model configuration. Experimental results confirm that the proposed FL framework effectively addresses the data accessibility and privacy challenges associated with centralized training while preserving high diagnostic accuracy and enabling secure, collaborative knowledge exchange among distributed medical institutions.

2. Related Work

Recent studies have investigated the integration of federated learning (FL) with deep learning models to improve the accuracy of breast cancer detection while preserving patient privacy. AlSalman et al. [18] proposed a federated learning framework for breast cancer detection designed to classify mammogram images into malignant or benign categories. The model applied deep convolutional neural networks (DCNNs) and used multiple clients, such as hospitals and clinics, to train local models on their own datasets using local gradient descent. Afterwards, a global server aggregates the locally trained models into a weighted global model. The proposed model achieved a detection accuracy of 98.9% on three large-scale datasets—VinDr-Mammo, CMMD, and INBreast. Waly et al. [19] applied a custom U-Net within a federated learning framework on two ultrasound datasets Baheya foundation in Egypt and Bus-set Dataset B, Spain. The model reached an accuracy of 89.5%, with precision values in the range of 0.54–0.58. The approach improved generalization and safe guarded data privacy, but the study faced limitations due to the small dataset size, ineffective pre-processing, and possible overfitting.

Recent studies have explored federated learning (FL) for breast cancer detection across imaging modalities such as mammography, ultrasound, and histopathology [20]. Using different datasets (e.g., CBIS-DDSM, INbreast, BreakHis, BUSI, and MIAS) and deep models (e.g., CNNs, U-Net, ResNet, YOLO) the study achieved up to 99% accuracy. Hence, highlighting FL’s potential but the challenges were mainly the data heterogeneity and the lack of benchmarking standards. Gupta et al. [21] proposed an ensemble FL framework using YOLOv6 with FedAvg and homomorphic encryption, attaining 98% accuracy on BreakHis and 97% on BUSI datasets. Their method outperformed centralized models while preserving privacy and enabling real-time inference, though it was limited by non-IID data and minimal clinical validation. Another study investigated the use of synthetic data generation to enhance the performance of FL [22] by applying FedAvg and FedProx as baseline algorithms to classify benign and malignant lesions on multiple ultrasound datasets (BUSI, BUS-BRA, and UDIAT). By applying class-specific Deep Convolutional Generative Adversarial Networks, the approach achieves an average AUC of 0.92 for FedAvg and 0.95 for FedProx, improving performance for small and non-IID datasets. However, they noted that excessive reliance on synthetic images could negatively affect performance.

A hybrid federated framework combining convolutional neural networks (CNNs) with graph neural networks (GNNs) was introduced [23]. The model was pre-trained in federated mode for spatial and geometric learning for the classification of breast tumors from multi-center ultrasound datasets, including BUSI, BUS, and OASBUD. This model achieved AUC-ROC values of 0.911, 0.871, and 0.767 for the datasets BUSI, BUS, and OASBUD, respectively, demonstrating the effectiveness of integrating spatial and geometric learning across distributed sites. The comprehensive survey on cross-hospital collaborative research reviewed federated learning (FL) approaches across diverse datasets, including medical images, electronic health records (EHRs), and clinical data [24]. This comprehensive survey discussed federated learning for research collaboration across hospitals, covering medical images (various types), EHRs, and clinical/administrative data. It assessed horizontal, vertical, and transfer learning, plus regulatory compliance (HIPAA, GDPR), privacy, and security. Major strengths included enabling multi-institutional research and preserving patient privacy. Weaknesses center on standardization challenges, vulnerability to security attacks, communication overhead, and incomplete clinical validation.

A web application called BreastInsight was developed using a federated learning framework for real-time breast cancer classification while preserving patient data privacy [25]. The model trained on multiple imaging modalities, including histopathology (BreakHis) and ultrasound (BUSI) images, utilized four variations of the Swin Transformer (Tiny, Small, Base, and Large) as base models, with outputs aggregated by a Random Forest meta-learner. It achieved impressive accuracies of 99.14% for BreakHis and 98.27% for BUSI. However, the model encountered challenges such as high computational demands, sensitivity to noise, coordination overhead in federated learning, and limited interpretability of ensemble predictions.

Three deep learning approaches were compared, including centralized and federated schemes [26]. These comprised individual AI models, feature space ensembles, and a hybrid model combining global Vision Transformer and CNN features. The models were evaluated on binary, multi-class, and BI-RADS classifications. In the centralized scenario, three clients aggregated their outputs using a global AI model. The federated model, trained on 734 cases with 41 risk factors, achieved accuracies of 98.65%, 97.30%, and 95.59% for binary, multi-class, and BI-RADS tasks, respectively, with an AUC of 0.970 [0.917–1.000]. However, the small dataset and ambiguity in borderline BI-RADS cases affected interpretability, as noted in the LIME analysis.

Shukla et al. [27] proposed a privacy-preserving machine learning model that integrates Differential Privacy (DP) and Federated Learning (FL) for decentralized breast cancer diagnosis in healthcare organizations. This FL system allows institutions to train models locally without sharing patient data, while DP protects against attacks and data leakage. The model achieved an accuracy of 96.1%, comparable to centralized models but with better privacy. However, it relies on a relatively small dataset limited to numeric features and does not fully address challenges like data heterogeneity. Bechar et al. [28] introduced a framework combining FL and Transfer Learning (TL) for collaborative cancer detection, using large-scale datasets like BreakHis, which contains 7909 histopathological images. This model allows local training while sharing only model parameters, achieving an accuracy of 96.32%. However, it also suffers from limited data heterogeneity and lacks other modalities like MRI and mammography. A hybrid approach in [29] utilized FL with LightGBM for cancer detection across institutions, incorporating Shapley Additive exPlanations (SHAP) for model interpretation. This framework explored multimodal data from Kaggle, achieving an accuracy of 98.3% with high precision and recall. Despite its strengths, the dataset may not reflect real clinical diversity, necessitating wider testing in diverse federated settings. Additionally, the computational demands of SHAP may limit its application in low-resource clinics.

While federated learning has been increasingly explored in breast cancer imaging, prior studies have largely been confined to binary classification tasks (benign vs. malignant) and to imaging modalities such as histopathology and mammography.

To date, limited research has explored multi-class federated learning for breast cancer classification using ultrasound data. Existing studies have primarily focused on binary classification tasks and the only known attempt at a multi-class federated framework built for Breast Cancer Histopathological Image Classification [30]. Building on this gap, this study systematically analyzes federated learning approaches for multi-class breast ultrasound classification across multiple institutions, providing a comparative assessment of their performance under realistic data heterogeneity.

3. Materials and Methods

3.1. Datasets

In this section, we provide a detailed description of the three breast ultrasound datasets used to evaluate our model. Moreover, Figure 1 illustrates samples of normal, benign, and malignant cases from each dataset.

3.1.1. BUSI: Breast Ultrasound Images Dataset

The Breast Ultrasound Images Dataset (BUSI) [31] is a collection of annotated ultrasound images collected in 2018 gathered from 600 female patients, with a total of 780 ultrasound images. The images are categorized into three classes: normal, benign and malignant, with 133 normal images, 437 benign images and 210 malignant images. These images are sourced from patients with various breast diseases, ensuring a diverse representation of real-world situations. Every image is carefully annotated by professional physicians, with labels that specify the lesion type and its features. BUSI attempts to improve the precision of automated diagnostic instruments and assist physicians in analyzing ultrasound images.

3.1.2. BUS-UCML: Breast Ultrasound Lesion Segmentation Dataset

The BUS-UCLM (Breast Ultrasound Lesion Segmentation) dataset [32] consists of 681 images obtained from 38 patients gathered between 2022 and 2023 using a Siemens ACUSON S2000 Ultrasound System. It is categorized into three distinct types based on the results: Normal (419 samples), Benign (174 samples), and Malignant (90 samples). Each image comes with a ground truth mask that indicates normal breast tissue (black label), benign tissue (green label), and malignant lesions (red label). This diverse collection of ultrasound images depicting various breast lesions makes the dataset a valuable resource for training and evaluating segmentation algorithms.

3.1.3. BCMID: Breast Cancer Multimodal Imaging Dataset

The Breast Cancer Multimodal Imaging Dataset (BCMID) [33] was collected from 323 adult female patients between 2019 and 2022, with ages ranging from 26 to 82 years at the time of evaluation with total of 1145 imagesw. The dataset was acquired from Ayadi Hospital in Alexandria, Egypt. Data for each patient include ultrasound and mammography scans, along with a validated report that summarizes the physician’s diagnostic conclusion based on the BI-RADS score. Our model incorporated only ultrasound scans. The dataset consists of three classifications: normal, benign, and malignant, with 152 images in the normal class, 734 images in the benign class, and 259 images in the malignant class. The total number of ultrasound images exceeds the number of patients, as each patient may have multiple ultrasound views and/or scans acquired on different dates.

3.2. Data Splits and Pre-Processing

As shown in Table 1, the three datasets exhibit distinct and imbalanced class proportions, highlighting the heterogeneous nature of the local data distributions. BUS-UCLM contains a majority of normal cases (61%), whereas BUSI and BCMID are dominated by benign samples (56% and 64%, respectively). The absolute differences in class proportions across clients reach up to 48% (e.g., normal class: 61% in BUS-UCLM vs. 13% in BCMID). Pairwise comparison of class distributions yielded Jensen–Shannon divergence (JSD) values of 0.278 for BUSI–BUS-UCLM, and 0.263 for BCMID–BUS-UCLM, with an average divergence of 0.19. These results quantitatively indicate substantial label-distribution dissimilarity among clients and confirming the non-IID imbalance.

Prior to model training, all available ultrasound images with validated diagnostic labels (normal, benign, or malignant) were included in the analysis, while corrupted, duplicate, or unlabeled images were excluded. Only ultrasound images (excluding mammograms) were utilized from the BCMID dataset. Each dataset was treated as an independent client in the federated learning framework, and individual patients could have multiple ultrasound views or scans representing different lesion orientations. The experimental setting of our study was ensured through a consistent dataset partitioning scheme. All three datasets were divided into training, validation, and test sets using a conventional ratio of 70% for training, 20% for testing, and 10% for validation. To prevent data leakage, this partitioning was performed strictly at the patient level rather than the image level. Consequently, all images associated with a single patient—including complex cases in the BUS-UCLM dataset where a patient may present with differing diagnostic classes—were restricted to a single subset (training, validation, or test). The test set remained completely held out and unseen during the training and hyperparameter tuning phases. Furthermore, this identical patient-level split was rigorously maintained across all experimental modalities, centralized, local, and Federated, to guarantee a fair and unbiased evaluation of model generalization on new data.

The data preprocessing workflow involved a dataset-specific step where images from the BCMID dataset were cropped to focus more precisely on the regions of interest. Following this, all images were uniformly resized to 224 × 224 pixels to match the input requirements of the pre-trained ML models. Pixel values were normalized by scaling pixel intensities to the [0, 1] range to ensure consistency across datasets. These standard preprocessing steps were applied identically for training, validation, and testing in all experimental setups. Although data augmentation is commonly used in the literature to enhance model performance and mitigate bias or class imbalance, it was deliberately excluded in this study to ensure a true simulation of real-world clinical conditions. To visualize feature-level differences across datasets (Figure 2), we extracted deep image features from each dataset, applied PCA for dimensionality reduction followed by t-SNE for visualization, and color-coded the embeddings by dataset. Distinct clusters correspond to dataset-specific distributions, confirming the presence of domain shift arising from differences in imaging devices, acquisition protocols, preprocessing techniques, and patient populations.

3.3. Models

In this section, we briefly describe five state-of-the-art (SOTA) classification networks utilized to assess the performance of the selected dataset. Our objective is to identify the optimal network to integrate into our federated learning model at the aggregation stage. The models we have selected for evaluation are DenseNet [34], EfficientNet [35], Inception [36], ResNet [37], and MobileNet [38].

3.3.1. DenseNet

DenseNet [34] is a CNN architecture that introduces direct connections between all layers within a dense block, promoting feature reuse, improved gradient flow, and compactness. Each layer receives inputs from all preceding layers, enabling multi-scale feature learning. Transition layers with batch normalization, 1 × 1 convolutions, and 2 × 2 average pooling are used between dense blocks to manage feature map dimensions. Its dense connectivity enables effective learning of both global and local features, making it well-suited for detecting subtle calcification patterns.

3.3.2. EfficientNet

EfficientNet [35] is a CNN architecture that achieves a balance between accuracy and efficiency using a compound scaling strategy that uniformly scales depth, width, and input resolution. The baseline model, EfficientNet-B0, is built with mobile inverted bottleneck (MBConv) blocks and depthwise separable convolutions, enabling significant parameter and FLOP reduction. Each block integrates expansion, depthwise, and projection layers, along with batch normalization and squeeze-and-excitation for adaptive feature recalibration.

3.3.3. Inception

An Inception network [36] is a CNN model based on the Inception modules, which perform multiple parallel convolutional filters, each with different sizes such as (1 × 1, 3 × 3, 5 × 5), enabling the network to capture features at various scales and choose the most effective features for a certain layer. This technique enables the network to achieve a wider and deeper coverage while maintaining low computing expenses, consequently enhancing performance in applications such as image categorization and object identification.

3.3.4. ResNet

ResNet [37] is a convolutional neural network architecture that incorporates residual learning to enhance the training of extremely deep networks. The basic concept involves employing skip connections that skip one or more levels. This enables the network to learn residual functions—specifically the gap between the output and input—rather than trying to learn the entire mapping directly. This architecture eliminates the problem of vanishing gradients and improves network training. ResNet enhances the flow of gradients through the network, resulting in outstanding results on multiple tasks, especially in image classification.

3.3.5. MobileNet

The MobileNet [38] series of models was incorporated due to their design focus on computational efficiency, particularly in low-resource or edge deployment environments. The original MobileNet introduced separable convolutions in depth, drastically reducing the number of parameters and operations while preserving accuracy for visual recognition tasks. It is specifically appropriate for classification of images and object recognition on devices with constrained computing capabilities, such as smartphones and IoT devices. Its efficacy enables developers to implement extensive neural network layouts in resource-limited settings, promoting sophisticated machine learning functionalities in mobile contexts.

3.4. Aggregation Techniques

Unlike traditional machine learning, which assumes data are Independent and Identically Distributed (IID), i.e., that data across all clients have the same statistical distribution. Federated Learning (FL) operates on decentralized and statistically heterogeneous (non-IID) data distributed across clients, posing significant challenges for model convergence and generalization [39,40]. If our hospital datasets were IID, every client would have a similar ratio of Normal, Benign, and Malignant cases, and ultrasound features (such as image quality or contrast) would be uniform. However, real-world multi-institutional collaborations invariably result in Non-IID (Non-Identically and Independently Distributed) data. Our experimental setup clearly highlights the statistical heterogeneity between clients: the three datasets, BUSI, BCMID, and BUS-UCLM, exhibit label and quantity skew. In general, the combined data are heavily dominated by cases (51.5%), while the individual client distributions differ; for example, BUS-UCLM is predominantly composed of normal class.

In this context, an essential component of the Federated Learning (FL) framework is how the server combines model updates received from clients. The aggregation algorithm greatly influences how quickly, reliably, and efficiently the training process converges. In this study, we examine two commonly adopted aggregation methods. The first, Federated Averaging (FedAvg) [39], is the core FL algorithm in which the server updates its global model by computing a weighted average of the parameters of the client models. Specifically, if

w_{i}^{t}

denotes the model weights of client i in round t, and

n_{i}

represents the number of local data samples, the server performs the following update:

w^{t + 1} = \sum_{i = 1}^{K} \frac{n_{i}}{N_{k}} w_{i}^{t}

(1)

where

N_{k} = \sum_{k = 1}^{K} n_{k}

represents the total number of data samples across all clients. In practice, each client trains its local model on its own data for a predefined number of epochs, then sends

w_{i}^{t}

to the server. The server uses Equation (1) to compute a weighted average of these updates, producing the new global model

w^{t + 1}

that is broadcast back to all clients for the next round of training.

Although FedAvg is efficient in terms of communication, it often faces challenges when the client data are heterogeneous (non-IID). To overcome this limitation, we also employed Federated Proximal (FedProx) [40], an algorithm specifically designed to manage statistical heterogeneity by adjusting the local training objective. Specifically, for client i, the standard local loss function

F_{i} (w_{i}^{t})

is augmented with a proximal term that penalizes large deviations from the current global model

w^{t}

:

F_{i}^{prox} (w_{i}^{t}) = F_{i} (w_{i}^{t}) + \frac{μ}{2} {∥ w_{i}^{t} - w^{t} ∥}^{2}

(2)

where

μ \geq 0

is the proximal term coefficient controlling the strength of the regularization.

∥ w_{i}^{t} - w^{t} ∥_{2}^{2}

denotes the squared norm of the difference between the local model and the global model. The local model

w_{i}^{t}

is optimized with respect to this modified objective, and after local updates, the server aggregates the client models using the same weighted averaging scheme as in Equation (1). In practice, each client minimizes

F_{i}^{prox} (w_{i}^{t})

during local training then sends the updated

w_{i}^{t}

to the server, which aggregates the client models using the same weighted averaging scheme as in Equation (1). By incorporating this proximal regularization, FedProx stabilizes training and improves convergence under non-IID conditions by preventing local updates from drifting too far from the global model.

4. Experiments and Results

This section presents the experimental design, baseline comparisons and federated learning evaluations performed in this study. All experiments aim to assess the feasibility and effectiveness of federated learning for multi-class breast cancer ultrasound classification across multiple institutions. The experimental design is structured into three phases, each validating a core aspect of applying Federated Learning (FL) to multi-class breast cancer classification.

The underlying server–client architecture is illustrated in Figure 3. This decentralized training consists of a Central Server that coordinates three distinct clients: the Breast Cancer Multimodal Imaging Dataset (BCMID), acquired from Ayadi Hospital in Alexandria, Egypt, the BUSI dataset, also sourced from medical centers in Egypt; and the BUS-UCLM dataset, originating from a university hospital in Spain. This architecture ensures that the sensitive ultrasound images remain strictly localized on each client’s secure device, preserving data privacy. In every communication round, each client performs local training on its proprietary data, generating updated model weights, which are the only parameters transmitted to the Central Server. The server then aggregates these updates through weighted averaging to produce an improved Global AI Model.

4.1. Experimental Settings

The models utilized across all experiments—centralized, local, and federated—were constructed using the Keras API with a TensorFlow backend. All core training parameters remained constant across modalities: models were trained using the Adam optimizer with a fixed learning rate of 0.001, a batch size of 32, and a maximum of 20 epochs, subject to early stopping based on validation loss. These specific hyperparameters were chosen based on established values in related literature, providing a robust and widely used baseline. Maintaining a fixed set of parameters was a deliberate choice to ensure a fair and direct comparison between the training paradigms, thereby isolating the performance differences attributable to the learning framework (centralized, local, or federated) rather than to model-specific tuning.

For multi-class classification, the output layer employed the softmax activation function, optimized initially via the categorical cross entropy loss function. To guarantee complete reproducibility, all environments were initialized using a fixed random seed and deterministic behavior was enforced during the TensorFlow configuration. The federated setting was implemented and orchestrated using the open-source Flower framework [41], which follows a clear server-client architecture: a central server coordinates the training process (handling client selection and model aggregation), while decentralized clients execute local training tasks and transmit model updates back. Centralized and local experiments were conducted using Google Colab environment with a CPU backbone, whereas federated experiments were run locally on an Intel(R) Core(TM) i7-10510U CPU @ 1.80 GHz (2.30 GHz) with 16 GB RAM. This client hardware configuration was deliberately chosen to simulate real-world scenarios, particularly in contexts like hospitals, where participating entities typically lack advanced GPU or high-end computational resources.

4.2. Evaluation Metrics

To assess the performance of the different approaches in classifying breast ultrasound images into normal, benign, and malignant categories, several standard evaluation metrics were employed, namely Accuracy, Macro Precision, Macro Recall, and Macro F1-Score. These metrics provide complementary insights into model performance, particularly under class imbalance conditions common in medical imaging.

Accuracy measures the overall proportion of correctly classified samples among all predictions and is defined as:

Accuracy = \frac{\sum_{i = 1}^{C} T P_{i}}{\sum_{i = 1}^{C} (T P_{i} + F P_{i} + F N_{i} + T N_{i})},

(3)

where C is the number of classes (three in this study), and

T P_{i}

,

F P_{i}

,

F N_{i}

and

T N_{i}

represent the true positives, false positives, false negatives and true negative for class i, respectively. Although accuracy reflects the overall correctness of predictions, it fails to account for class imbalance and may therefore overestimate model performance.

To ensure a fair evaluation across classes, we employ macro averaging, which computes each metric independently for every class and then takes the unweighted mean. The macro precision and macro recall are defined as:

Macro Precision = \frac{1}{C} \sum_{i = 1}^{C} \frac{T P_{i}}{T P_{i} + F P_{i}},

(4)

Macro Recall = \frac{1}{C} \sum_{i = 1}^{C} \frac{T P_{i}}{T P_{i} + F N_{i}} .

(5)

Macro averaging treats all classes equally, ensuring that minority classes (such as malignant) contribute equally to the final metric. In contrast, a weighted average would bias the result toward majority classes (e.g., benign).

The F1-Score, which is the harmonic mean of precision and recall, balances the trade-off between false positives and false negatives:

F 1_{i} = 2 \times \frac{{Precision}_{i} \times {Recall}_{i}}{{Precision}_{i} + {Recall}_{i}} .

(6)

The macro F1-Score is then computed as:

Macro F 1 - Score = \frac{1}{C} \sum_{i = 1}^{C} F 1_{i} .

(7)

The F1-Score is particularly important in medical image classification, where both sensitivity and precision are critical: high recall ensures that most malignant cases are correctly identified, while high precision minimizes false alarms that could lead to unnecessary diagnostic procedures.

4.3. Baseline Performance and Architectural Comparison

The centralized learning experiment, which trains the model on the pooled data of all three clients, provides the primary performance benchmark. In an ideal scenario, this pooled result represents the theoretical upper bound of the achievable accuracy. Although it is generally assumed that training on larger and more diverse datasets leads to improved model generalization, this assumption does not always hold in medical imaging contexts. Previous research has shown that simply aggregating disparate data sources can introduce variability that confuses the model and reduces its ability to distinguish true diagnostic features, leading to unexpected drops [42,43].

To identify the most suitable architectures for the federated experiments, we first evaluate SOTA CNN models: DenseNet201, EfficientNetV2M, InceptionNetV3, MobileNet, and ResNet50. Each model is trained and validated independently on the three ultrasound datasets: BUSI, BCMID, and BUS-UCLM. All models are fine-tuned from ImageNet pretrained weights using identical training protocols, input resolution, optimizer settings, and early stopping criteria to ensure fair comparison. The epoch column denotes the final training epoch attained prior to termination by the early stopping criterion. Based on the Macro-Averaged metrics, which provide a balanced assessment across the multi-class breast cancer datasets provided in Table 2, Resnet50V2 demonstrated the highest accuracy of 68% while both MobileNet and InceptionNetV3 followed with 67%. Although DenseNet201 achieved the highest macro-F1 (69%), accuracy was used as the selection criterion for subsequent FL experiments. Accordingly, MobileNet, InceptionNet and ResNet were chosen for the subsequent federated learning experiments.

It is worth noting that our preliminary experiments also included several Vision Transformer (ViT) architectures [44] and hybrid CNNs like ConvNeXt [45]. These models were excluded from the main federated analysis due to an unfavorable trade-off between performance and computational cost as they did not generalize effectively to our ultrasound data. The ViT models, for instance, yielded centralized accuracies in the 64–71% range, offering only marginal improvement over our selected CNNs (up to 70.6% for the best ViT, compared to 68% for ResNet50V2), but required a substantially larger parameter count (up to 305.5M versus 3.8–53.81M for the CNNs). Furthermore, the ConvNeXt models showed significantly lower performance, achieving only 61.0% accuracy (ConvNeXt-Base) in the centralized setting. This large gap between computational demand and classification performance highlights that the significant computational overhead of these architectures was deemed impractical for our target of resource-constrained clinical deployment scenarios, thus justifying our focus on CNN-based deep learning networks.

4.4. Two-Client Federation: Heterogeneity and Architectural Robustness

While the Centralized result provides a performance benchmark, the independent Local Learning experiments, where models are trained exclusively on their respective institutional datasets (BUSI, BCMID, or BUS-UCLM), establish the minimum expected performance. These results reflect how each model performs when trained in complete isolation without external data exposure. The main goal of Federated Learning is to successfully bridge the performance gap between these two extremes, Centralized and Local, aiming to achieve performance metrics that significantly surpass the poor generalization of the local model while approaching or exceeding the centralized baseline across the various client test sets.

The initial centralized baseline is used to select top 3 target architectures by assessing their highest achievable accuracy, as shown in the previous section. Next, the Independent Local Baselines for the selected architectures are executed on each client’s isolated data.

This experiment compares the test performance of centralized, local, and federated learning under identical experimental conditions. The results are reported in the remaining test sets for each client to ensure fairness and reproducibility. We implement and compare two popular FL algorithms: FedAvg and FedProx. FedAvg represents the standard baseline aggregation method, while FedProx introduces a proximal regularization term to mitigate the effects of data heterogeneity. Both algorithms are evaluated using the three selected architectures, MobileNet ResNet50V2, and InceptionNetV3 in all three possible heterogeneous two-client pairings (

BUSI + BCMID

,

BUSI + BUS - UCLM

, and

BCMID + BUS - UCLM

). In this two-client federation setup, the global model is updated using the weighted aggregation in Equation (1), while each client’s local objective can include the proximal term from Equation (2) to mitigate the effects of non-IID data.

The comparative results are illustrated in Table 3, Table 4 and Table 5 detailing the metrics per-client on the respective test sets. The metrics presented include accuracy, macro-precision, macro-recall, and macro-F1 score. It is important to note that the reported centralized learning setup is defined per combination of clients; for each pair of institutions, a centralized model is trained on the pooled data of the two participating datasets and individually evaluated on each client’s respective test set.

The BCMID + BUS-UCLM federation represents the most heterogeneous data scenario, with substantial label and quantity skew between clients. Both architectures experienced moderate performance degradation compared to the other two-client settings. In MobileNet and ResNet architectures, FedAvg achieved marginally higher performance on the BCMID client in isolation. This can be attributed to the fact that BCMID is a bigger dataset compared to BUS-UCLM, which allows FedAvg’s direct averaging mechanism to converge more effectively when one client dominates the data distribution.

Overall, the results demonstrate that federated learning generally outperformed independent Local models across all two-client combinations, confirming the effectiveness of federated aggregation in mitigating data isolation. In most settings, FedProx achieved slightly higher stability and balanced macro-averaged metrics, particularly under stronger heterogeneity, such as the case in pairing BUSI and BCMID). For both architectures, MobileNet, ResNet50V2 and InceptionNetV3, federated models substantially matched or exceeded the performance of traditional learning while preserving decentralized data privacy.

4.5. Scalability and Generalization with Multi-Client Federation

This experiment evaluates the scalability and generalization capability of the federated learning framework. In this setting, all three clients, BUSI, BCMID, and BUS-UCLM, participate in a collaborative training process. This configuration simulates a more complex multi-institutional collaboration where data distributions, imaging protocols, and patient demographics may vary substantially across sites. Table 6 summarizes the results for all selected architectures (MobileNet, ResNet50V2, and InceptionNetV3), reporting classification accuracy and macro-F1 scores evaluated on each client’s test set.

As shown, FedAvg exhibited a marked degradation in performance, with the average accuracy dropping to 67.6% and macro-F1 to 57.33%, even below the Local baseline (71.0%, 66%). This decline reflects the sensitivity of FedAvg to severe inter-client heterogeneity. In contrast, FedProx improved the global model’s stability, achieving 73.2% accuracy surpassing all baselines across most clients for MobileNet. BUSI and BUS-UCLM, the datasets with smaller class imbalance and more consistent imaging conditions, benefited the most from FedProx, while BCMID showed a moderate gain. For the ResNet50 architecture, a similar yet less favorable pattern was observed compared to MobileNet. Although FedProx mitigated the severe degradation seen with FedAvg, its performance remained slightly below the Local baseline, achieving an average accuracy of 69.1% and macro-F1 of (63%). For the InceptionNetV3 model, federated performance remained comparable to the centralized benchmark, with FedProx achieving 66.6% accuracy and 59.7% macro-F1, reflecting a modest imporvement ovr FedAvg.

As presented in Table 6, the three-client federation combining BUSI, BCMID, and BUS-UCLM achieved consistent performance improvements over both local and centralized training across architectures. For the MobileNet model, FedProx reached an average accuracy of 73.31% and a macro-F1 of 67.31%, outperforming the average Local (71.00%, 66.33%) and Centralized (72.09%, 55.20%) settings. A similar pattern was observed with ResNet50V2, where FedProx achieved 69.07% average accuracy and a macro-F1 of 63.16%, achieving comparable accuracy to centralized training while surpassing it in F1-score. However, FedAvg retained a slight advantage, specifically on the BCMID client, for this architecture. These results confirm that incorporating data from all three institutions enhanced the robustness and generalization of the global model, despite increased heterogeneity across clients.

To validate the statistical reliability of the best-performing configuration, MobileNet with FedProx, we conducted the DeLong test to compare the AUC-ROC curves across training paradigms. The analysis yields two critical findings that substantiate the efficacy of the proposed FL framework. First, the comparison between Centralized and FedProx showed no statistically significant difference across any client (all

p > 0.05

), statistically indicating that our privacy-preserving FedProx model achieves diagnostic performance comparable to that of centralized data pooling in this experimental setting. Second, while Centralized training proved statistically superior to Local training for the BCMID (

p < 0.001

) and BUSI (

p = 0.011

) datasets, the improvement of FedProx over Local training approached significance (

p = 0.09

for BCMID). Although strict statistical significance (

α = 0.05

) was marginally missed, these p-values—combined with the consistent accuracy gains reported in Table 6 indicate a strong positive trend in generalizability and clinical robustness.

Across all datasets, the ROC analysis demonstrated in Figure 4 shows that the comparable discrimination performance between the Federated Proximal (FedProx) and Centralized models, while the Local models showed more variability. For the BCMID dataset, the Centralized model achieved the highest AUC (0.804), followed closely by FedProx (0.790), with Local training performing notably lower (0.752). In the BUSI dataset, the Local model unexpectedly achieved the highest AUC (0.949), while Centralized and FedProx models yielded similar performance (0.924 and 0.918, respectively). For the BUS-UCLM dataset, the AUC values of Centralized (0.908) and Local (0.904) were nearly identical, with FedProx slightly lower at 0.874. These trends were supported by DeLong’s statistical tests: significant differences appeared only in the Centralized vs. Local comparison for BCMID and BUSI, whereas Local vs. FedProx and Centralized vs. FedProx showed no statistically significant differences across all datasets. Overall, FedProx consistently provided performance that was statistically indistinguishable from Centralized training while maintaining stability across heterogeneous client distributions.

4.6. Ablation Study: Adaptive Loss Function for Heterogeneity Robustness

The comparative analysis shows that inter-client heterogeneity in the three-client federation poses a significant challenge, limiting the performance gains of standard federated learning algorithms, even the FedProx approach. To address this, we propose a targeted contribution focused on client-level optimization. Instead of altering the server-side aggregation, we investigate whether adapting the client’s local training objective can better accommodate its specific data distribution. Based on its superior average performance in the three-client scenario as highlighted in Table 6, the MobileNet architecture combined with the FedProx algorithm was selected as the base for this investigation. We investigate the Tversky loss, a function commonly used in the medical image analysis domain, particularly for segmentation tasks with highly imbalanced classes [46]. The Tversky Index (TI) is a generalized metric of both the Dice coefficient and Jaccard index, defined as

TI = \frac{TP}{TP + α FP + β FN}

(8)

The key parameter for the Tversky loss are the

α

and

β

parameters. These explicitly control the penalties for False Positives (FP) and False Negatives (FN), respectively. In the context of medical diagnostics, a False Negative (missing a malignant tumor) is often far more detrimental than a False Positive. To embed this clinical priority into the model’s objective, one can set

β > α

, thereby forcing the model to learn features that minimize the risk of missing positive cases. In our proposed FL setting, we applied an ablation study consisting of 3 experiments, as summarized in Table 7. The baseline is the standard Categorical Cross-Entropy (CCE) loss function. The second experiment replaces CCE with a standard Tversky loss using commonly-cited fixed parameters (

α = 0.3, β = 0.7

). The third employs a combined Loss with 50% Fixed Tversky and 50% CCE to test a blended, more stable approach.

The primary benefit of introducing the adaptive loss functions is clearly demonstrated by their impact on the most challenging and imbalanced dataset, BCMID. The Fixed Tversky Loss configuration yields a dramatic improvement for BCMID, boosting its accuracy from 61.29% (Baseline) to 66.94% and its Macro F1-score from 54 to 58.84. However, this optimization comes with a trade-off. The same Loss causes a performance drop on the BUS-UCLM dataset, with Macro F1 declining from baseline. In contrast, the combined Loss finds a different balance: while it doesn’t improve BCMID as dramatically as the Tversky loss (reaching 62.90%), it successfully elevates the performance on the BUS-UCLM dataset, maintaining higher stability across the public benchmark. This suggests that while the Tversky component penalizes false negatives, the addition of Cross-Entropy provides necessary gradient stability, allowing the model to converge to a higher sensitivity for minority classes, as visualized in the confusion matrices in Figure 5.

5. Discussion

In our experiments, the combined multi-institutional centralized learning did not consistently outperform each institution’s independent learning despite having access to aggregated data. Although centralized models occasionally achieved slightly higher average accuracy, their performance gains were inconsistent. This observation reflects a known challenge in multi-institutional aggregation of naive data pooling across heterogeneous medical sources. In medical imaging, this heterogeneity often appears as Domain Shift, arising from differences in scanner manufacturers, imaging protocols, and patient populations across hospitals [47]. When such disparate data are simply pooled, this leads to negative transfer, where the unique, high-quality features learned from one institution’s consistent protocol are effectively treated as noise by the training process, resulting in a generalized model that performs similarly to local models. Similar observations have been reported in the previous literature, where the incorporation of data from multiple institutions sometimes even degraded the performance of the model [42,43,48].

The comparative analysis confirms that the FL approach remains effective in mitigating the negative impact of data heterogeneity specifically within the domain of multi-institutional ultrasound classification. In particular, the observed performance gains of both FedAvg and FedProx over independent Local baselines highlight FL’s capacity to transfer complementary diagnostic knowledge across clients without compromising data privacy. For example, in the BUSI + BCMID federation (Table 3), FedProx on MobileNet improved the average Macro F1-Score from 65% in the Local setting to 68%, while accuracy rose from 67.9% to 74.5%. A similar trend was observed with InceptionNetV3, where FedProx improved average accuracy from 61.0% to 67.6% and Macro F1 from 57% to 62%. Similarly, in the BUSI + BUS-UCLM combination (Table 4), FedAvg on MobileNet achieved an average accuracy of 83.6% compared to 78.5% locally, with a corresponding increase in Macro F1-Score from 74% to 79%. Even in the most heterogeneous pairing (BCMID + BUS-UCLM, Table 5), FedProx on MobileNet improved accuracy from 66.3% to 67.7%. InceptionNetV3 followed the same modest improvement pattern, with accuracy increasing from 60.0% to 61.2%, further confirming the limited but positive transfer effect. Consistent with prior research, FedProx exhibited greater stability under statistically heterogeneous (non-IID) conditions. In the BUSI + BCMID federation, FedProx produced more consistent macro-F1 scores across clients, indicating improved robustness to imbalance even when average performance differences were marginal. It is worth noting, however, that FedAvg outperformed FedProx in certain configurations, particularly where inter-client heterogeneity was moderate. For instance, in the BUSI + BUS-UCLM setup using the MobileNet architecture (Table 4), FedAvg achieved a slightly higher average accuracy than FedProx. InceptionNetV3 also showed near-equivalent performance between FedAvg 73.5% and FedProx 74.1%, supporting the notion that under moderate heterogeneity, both aggregation methods yield comparable outcomes.This marginal advantage may be attributed to the simpler averaging mechanism of FedAvg being more effective when client gradients are already well-aligned due to similar data distributions. Some theoretical and empirical analysis suggest that under mild non-IID settings, FedAvg’s performance can be comparable to that of FedProx, whereas FedProx tends to be more advantageous under stronger heterogeneity [40].

The results highlight important characteristics of each training strategy under varying data conditions. Centralized training generally delivered the strongest or near-strongest ROC curves, reflecting its advantage when all training data are pooled; however, this approach is often infeasible due to privacy and data-sharing restrictions. Local training, while effective on the BUSI dataset, exhibited substantial variability, underperforming markedly on BCMID and showing only moderate performance on BUS-UCLM. This instability likely reflects sensitivity to limited or non-representative client data, making local-only models unreliable in non-IID settings. In contrast, FedProx consistently achieved AUC values close to those of the Centralized model across all datasets, without any statistically significant performance drop. This suggests that FedProx effectively mitigates heterogeneity-induced client drift while preserving discriminative performance. Importantly, although FedProx did not significantly outperform either Local or Centralized models, its stability and robustness across datasets make it the most dependable choice when data cannot be centralized and when client data distributions differ. These findings emphasize FedProx’s suitability for real-world federated learning environments that must balance privacy constraints with model performance.

To further elaborate on the diagnostic significance of this contrast, the utility of FedProx is mainly in its ability to manage client drift, the fundamental convergence failure in non-IID FL. By introducing its proximal regularization term from Equation (2), FedProx restrains the local model from diverging too far from the global model during client training. In practice, when a client like BCMID possesses a severe majority class imbalance (64% benign), standard FedAvg allows the local model to over-optimize heavily toward benign features, causing the model updates to conflict with the global model, resulting in the drastic performance drop seen in the three-client federation Macro F1-Score of 57.33% for MobileNet. In contrast, FedProx’s regularization counters this tendency, transferring knowledge in a more stabilized manner. This directly results in the preservation of diagnostic performance for minority, clinically critical classes, Normal and Malignant, leading to the superior Macro F1-Score of 66.42%. Furthermore, this principle is substantiated by our Ablation Study in Section 4.6, where the introduction of the Tversky Loss, a local, client-side regularizer, achieved similar stability and boosted the global Macro F1-Score for the most imbalanced client, BCMID. This shows that model stabilization, whether through server-side aggregation (FedProx) or client-side loss function design (Tversky), is an essential solution for maintaining reliable multi-class diagnostic accuracy under high data heterogeneity.

The client-level performance gains from federated learning were not uniformly distributed across datasets, but instead reflected each datasets’ internal class balance and size. In the BUSI + BCMID federation running on MobileNet, the BCMID client—although the largest in sample size—benefited the most due to its strong benign-class dominance, with accuracy improving from 55.6% to 68.0% and macro-F1 from 50% to 57% under FedProx, while BUSI’s already robust performance remained stable (80.3% to 81.0%). Similarly, in the BCMID + BUS-UCLM federation, the BCMID client again exhibited noticeable improvement—its accuracy increasing from 55.6% (Local) to 64.9% under FedAvg and 58.9% under FedProx, with corresponding macro-F1 scores rising from 50% to (57%) and 55%, respectively. However, this came at the cost of severe degradation on the BUSI client (63.15%), indicating client drift. FedProx balanced this effectively. This benefit pattern highlights that FL disproportionately aids clients constrained by data imbalance rather than size, while well-balanced sites contribute stability to the global model. This highlights how FL can transfer complementary diagnostic representations from more balanced datasets to highly skewed ones. In the BUSI + BUS-UCLM federation, both institutions achieved consistent gains, although the relative improvement was more evident for BUS-UCLM, reflecting FL’s ability to enhance smaller and slightly imbalanced datasets. Our results also align with recent studies that emphasize that federated strategies outperform isolated local training in medical imaging domains characterized by strong domain shifts [21]. However, the reduced gains in the BCMID + BUS-UCLM pairing suggest that extreme heterogeneity still poses convergence challenges, a limitation also identified in [13,15], who emphasized that non-IID distributions remain a primary obstacle to the clinical implementation of FL.

In federated learning research, it is commonly observed that increasing the number of participating clients can lead to a reduction in global model performance; mainly because the aggregation process must accommodate a wider range of data characteristics [15,49]. The results in Table 6 demonstrate that expanding the federation to include BUSI, BCMID, and BUS-UCLM improves the robustness and diagnostic balance of the global model compared to both local and centralized approaches. The MobileNet configuration achieved the strongest results with average accuracy 73.31%). The observed decline in FedAvg reflects the algorithm’s sensitivity to severe statistical heterogeneity between clients, as the participating datasets (BUSI, BCMID, and BUS-UCLM) differ in both class distribution and sample quantity. In such highly non-IID conditions, FedAvg’s straightforward parameter averaging leads to conflicting gradient updates, causing the global model to oscillate between divergent local optima rather than converging to a stable state. However, increasing the number of clients also introduces notable challenges. As more institutions participate, variations in data distributions, annotation styles, and acquisition devices amplify heterogeneity. A per-dataset inspection reveals that the impact of expanding from two-client to three-client federation is not uniform across sites. For MobileNet (FedProx), BUSI’s peak two-client performance occurred in the BUSI + BUS-UCLM pairing (accuracy = 83.55%, Table 4), but this fell to 80.92% in the three-client setup, representing a drop of 2.63%. BCMID exhibits the strongest two-client result when paired with BUSI with an accuracy of ≈68.00%, as shown in Table 3, yet its accuracy in the three-client federation is 61.29%, representing a decline of 6.85% relative to that best two-client pairing; however, BCMID does improve relative to its performance in the BCMID + BUS-UCLM pairing (58.87% in Table 5), gaining around 2.0% in the three-client case.

Analyzing the performance across federation scenarios reveals a consistent pattern where the in-house clinical dataset (BCMID) benefits greatly from collaboration with public benchmarks. As illustrated in Figure 6, in the two-client settings (Table 3, Table 4 and Table 5), the pairing of BUSI and BCMID yielded the most dramatic improvement, with the private client gaining 12.51% in accuracy over local training, effectively illustrating the benefits of FL knowledge transfer. This trend persists in the three-client federation (Table 6), where the private client achieved a 5.65% gain compared to modest improvements for the public clients. Most notably, the proposed Combined Loss framework (Section 4.6) maximized this advantage: while the public datasets BUSI and BUS-UCLM saw limited improvements, the in-house BCMID dataset achieved a substantial

\approx 6 %

boost. This confirms and demonstrates how resource-constrained clinical institutions can leverage high-quality external representations to improve local diagnostic accuracy without sharing sensitive data. These results indicate that the proposed FL framework facilitates significant improvements for resource-constrained clinical datasets while maintaining the performance standards of established public benchmarks.

Overall, the present findings show that federated learning provides a potential alternative offering privacy preservation, by not requiring direct data sharing, and scalable multi-institutional learning. The uniform behavior observed across our selected models indicates that this performance pattern could reasonably be expected to hold for other neural networks. The comparative results highlight that federated learning best benefits the clients with more limited or skewed data, in line with previous observations [18,22], who reported similar cross-institutional gains for underrepresented datasets in mammography and pathology classification.

Limitations

While this study demonstrates the architectural robustness and heterogeneity mitigation capabilities of Federated Learning (FL) for multi-class breast cancer classification, it is important to acknowledge several limitations that define the scope of the current work and suggest directions for future research. A primary advantage of FL is its promise of enhanced data privacy; however, this study focuses solely on the performance and heterogeneity-handling aspects of FL protocols and does not incorporate advanced cryptographic or differential privacy mechanisms. In other words, no privacy was injected beyond the distributed nature of the training process itself. Moreover, the transition from our simulated environment to a live clinical deployment within a routine hospital environment presents practical logistical, computational, and human-centric challenges that this study did not address. A key consideration is the complexity of deploying and maintaining the federated orchestration system (such as the Flower framework). While local computation is efficient, the iterative communication rounds central to FL introduce significant overhead due to network latency. Our analysis did not evaluate mitigation strategies, such as scheduling communication during off-peak hours or optimizing the number of aggregation rounds to balance performance and time costs. Additionally, the operational robustness of a real-world system was assumed to be ideal, yet varying levels of digital infrastructure readiness across institutions could hinder deployment. Finally, the successful adoption of such models relies not only on technical metrics but also on overcoming clinical resistance arising from skepticism or lack of trust in AI-driven decision support systems. Another limitation is conducting the analysis using only three distinct public datasets (BUSI, BCMID, and BUS−UCLM), while it allowed for testing of all two-client combinations and a comprehensive three-client setup. However, to better validate the true scalability of the FL framework in a real-world setting, a significantly larger number of clients (e.g., 5 to 10 hospitals) should be involved. However, this study was constrained by the limited availability of high-quality, publicly accessible ultrasound datasets that offer multi-class annotations (Normal, Benign, Malignant). Consequently, while the current experimental design simulates heterogeneity, it does not fully confirm generalizability to entirely new patient populations or diverse imaging protocols. Future validation on a completely independent, unseen dataset from a distinct clinical site is required to verify the model’s robustness in scenarios outside the training federation.

A methodological limitation of our study is the absence of repeated experimental runs to calculate inter-run variance. This constraint was imposed by the significant computational and time resources required to execute full, end-to-end federated experiments across all centralized, local and federation scenarios. Consequently, all reported results are derived from a single, deterministically seeded execution to maximize reproducibility and ensure stable training dynamics, thereby minimizing run-to-run variability across all comparisons.

6. Conclusions and Future Work

This study presented a comparative analysis of federated learning (FL) approaches for multi-class breast cancer classification across three categories (benign, malignant, and normal). This study contributes to addressing the current gap in applying federated learning to multi-class breast ultrasound classification, an area where prior have been limited to binary tasks. Using three datasets (BUSI, BUS-UCLM and BCMID), two public and one in-house, we evaluated multiple CNN-based architectures under three training paradigms: centralized, independent/local, and federated learning. The results demonstrate that FL provides a promising direction for the development of collaborative and privacy-preserving models in medical imaging, effectively addressing the data-sharing restrictions that often hinder clinical AI research.

Across experiments, FL showed robust performance against both architectural and statistical heterogeneity. In particular, FedProx consistently provided greater stability and higher average performance compared to FedAvg, especially in the three-client setting where data non-IID characteristics were increased. Furthermore, FL outperformed centralized training in nearly all scenarios, proving that direct data pooling can cause domain shift and reduce generalization in multi-institutional tradiotonal learning context. Ultimately, this framework translates into actionable clinical support by enabling automated risk stratification and triage to prioritize high-risk patients and optimize diagnostic workflows while maintaining cross-institutional privacy.

Future work will focus on integrating personalization mechanisms at the client level to better adapt global models to local data distributions. Furthermore, a comprehensive hyperparameter optimization and sensitivity analysis could be conducted to maximize the absolute diagnostic performance of the models and assess the robustness of our comparative findings under different training configurations. To further strengthen privacy guarantees, future research should also incorporate and rigorously validate robust differential privacy mechanisms within the federated learning workflow. Additionally, exploring more advanced architectures, such as Vision Transformers and hybrid CNN–transformer models, may further enhance diagnostic accuracy. Investigating harmonization strategies for imaging protocols across institutions could help reduce domain shift and improve cross-site model consistency. Finally, expanding the study to include a larger number of institutions and datasets will help re-validate the scalability and generalizability of federated learning in real-world clinical environments.

Author Contributions

Conceptualization, N.S.T. and M.A.E.; Methodology, N.S.T. and M.A.E.; Software, N.S.T., M.A.E. and N.H.; Validation, N.G. and N.H.; Formal Analysis, N.G., N.S.T. and M.A.E.; Investigation, N.S.T. and M.A.E.; Resources, N.G.; Data Curation, N.G.; Writing—Original Draft Preparation, R.K. and S.F.; Writing—Review and Editing, N.S.T., M.A.E. and N.G.; Visualization, N.G.; Supervision, R.K. and S.F.; Project Administration, N.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the results reported in this article originate from three sources. Two datasets are publicly available on Kaggle (all accessed on 16 September 2025): the BUSI dataset is available at https://www.kaggle.com/datasets/sabahesaraki/breast-ultrasound-images-dataset, and the BUS-UCLM dataset is available at https://www.kaggle.com/datasets/orvile/bus-uclm-breast-ultrasound-dataset. The third dataset, BCMID, is a multi-modal dataset curated by the authors and is currently under publication. While the data is publicly registered at https://zenodo.org/records/14970848, direct public download is restricted due to patient privacy and ongoing publication processes. The BCMID data can be made available upon reasonable request and through personal communication with the corresponding author (noha.abdelsalam@aast.edu).

Conflicts of Interest

The authors declare no conflicts of interest.

References

ScienceDaily. Global Cancer Burden Estimated at 19.3 Million New Cases and 10.0 Million Cancer Deaths in 2020. Available online: https://www.iarc.who.int/news-events/latest-global-cancer-data-cancer-burden-rises-to-19-3-million-new-cases-and-10-0-million-cancer-deaths-in-2020/ (accessed on 6 October 2025).
Euronews. Cancer Deaths Expected to Climb by 68% Worldwide by 2050, WHO Warns. 2025. Available online: https://www.euronews.com/health/2025/02/25/breast-cancer-deaths-expected-to-climb-by-68-worldwide-by-2050-who-warns (accessed on 6 October 2025).
United Nations Geneva. Breast Cancer Cases Projected to Rise Nearly 40 Percent by 2050, WHO Warns. 2025. Available online: https://news.un.org/en/story/2025/02/1160391 (accessed on 6 October 2025).
Huhulea, E.N.; Huang, L.; Eng, S.; Sumawi, B.; Huang, A.; Aifuwa, E.; Hirani, R.; Tiwari, R.K.; Etienne, M. Artificial intelligence advancements in oncology: A review of current trends and future directions. Biomedicines 2025, 13, 951. [Google Scholar] [CrossRef]
Zhang, J.; Guo, X.; Wang, B.; Cui, W. Automatic Detection of Invasive Ductal Carcinoma Based on the Fusion of Multi-Scale Residual Convolutional Neural Network and SVM. IEEE Access 2021, 9, 40308–40317. [Google Scholar] [CrossRef]
Esteva, A.; Robicquet, A.; Ramsundar, B.; Kuleshov, V.; DePristo, M.; Chou, K.; Cui, C.; Corrado, G.; Thrun, S.; Dean, J. A guide to deep learning in healthcare. Nat. Med. 2019, 25, 24–29. [Google Scholar] [CrossRef]
Jiang, F.; Jiang, Y.; Zhi, H.; Dong, Y.; Li, H.; Ma, S.; Wang, Y.; Dong, Q.; Shen, H.; Wang, Y. Artificial intelligence in healthcare: Past, present and future. Stroke Vasc. Neurol. 2017, 2, 230–243. [Google Scholar] [CrossRef] [PubMed]
Afrin, H.; Larson, N.; Fatemi, M.; Alizad, A. Deep Learning in Different Ultrasound Methods for Breast Cancer, from Diagnosis to Prognosis: Current Trends, Challenges, and an Analysis. J. Med. Imaging Health Inform. 2023, 13, 1172–1186. [Google Scholar] [CrossRef] [PubMed]
Islam, M.; Hossain, M.; Ahmed, S. Enhancing breast cancer classification with an Efficient Deep Convolutional Neural Network (EDCNN). Comput. Biol. Med. 2024, 174, 108123. [Google Scholar] [CrossRef]
He, Q.; Yang, Q.; Xie, M. HCTNet: A hybrid CNN-transformer network for breast ultrasound image segmentation. Comput. Biol. Med. 2023, 155, 106629. [Google Scholar] [CrossRef]
Jiménez-Gaona, Y.; Carrión-Figueroa, D.; Lakshminarayanan, V.; María. Gan-based data augmentation to improve breast ultrasound and mammography mass classification. Biomed. Signal Process. Control 2024, 94, 106255. [Google Scholar] [CrossRef]
Rai, H.M.; Yoo, J.; Agarwal, S.; Agarwal, N. LightweightUNet: Multimodal Deep Learning with GAN-Augmented Imaging Data for Efficient Breast Cancer Detection. Bioengineering 2025, 12, 73. [Google Scholar] [CrossRef]
Rieke, N.; Hancox, J.; Li, W.; Milletari, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.; Landman, B.A.; Maier-Hein, K.H.; et al. The future of digital health with federated learning. npj Digit. Med. 2020, 3, 119. [Google Scholar] [CrossRef]
Muthalakshmi, M.; Jeyapal, K.; Vinoth, M.; Dinesh, P.S.; Murugan, N.S.; Sheela, K. Federated Learning for Secure and Privacy-Preserving Medical Image Analysis in Decentralized Healthcare Systems. In Proceedings of the 2024 5th International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 7–9 August 2024; pp. 1442–1447. [Google Scholar] [CrossRef]
Sheller, M.J.; Edwards, B.; Reina, G.A.; Martin, J.; Pati, S.; Kotrotsou, A.; Milchenko, M.; Xu, W.; Marcus, D.; Colen, R.R.; et al. Federated learning in medicine: Facilitating multi-institutional collaborations without sharing patient data. Sci. Rep. 2020, 10, 12598. [Google Scholar] [CrossRef]
Geyer, R.C.; Klein, T.; Nabi, M. Differentially Private Federated Learning: A Client Level Perspective. arXiv 2018, arXiv:1712.07557. [Google Scholar] [CrossRef]
Yang, Q.; Liu, Y.; Chen, T.; Tong, Y. Federated Machine Learning: Concept and Applications. ACM Trans. Intell. Syst. Technol. 2019, 10, 12. [Google Scholar] [CrossRef]
AlSalman, H.; Alrakhami, M.; Alfakih, T.; Hassan, M. Federated Learning Approach for Breast Cancer Detection Based on DCNN. IEEE Access 2024, 12, 40114–40138. [Google Scholar] [CrossRef]
Waly, S.; Taha, R.; ElGhany, M.; Salem, M.A.M.M. Deep/Federated Learning Algorithms for Ultrasound Breast Cancer Image Enhancement. In Proceedings of the 2023 International Conference on Microelectronics (ICM), Abu Dhabi, United Arab Emirates, 17–20 December 2023; pp. 52–57. [Google Scholar] [CrossRef]
Ciobotaru, A.; Corches, C.; Gota, D.I.; Miclea, L. Deep Learning and Federated Learning in Breast Cancer Screening and Diagnosis: A Systematic Review. IEEE Access 2025, 13, 76322–76351. [Google Scholar] [CrossRef]
Gupta, C.; Gill, N.; Gulia, P.; Alduaiji, N.; Shreyas, J.; Shukla, P. Applying YOLOv6 as an ensemble federated learning framework to classify breast cancer pathology images. Sci. Rep. 2025, 15, 3769. [Google Scholar] [CrossRef]
Pan, H.; Hong, Z.; Durak, G.; Xu, Z. Federated Breast Cancer Detection Enhanced by Synthetic Ultrasound Image Augmentation. arXiv 2025, arXiv:2506.23334. [Google Scholar] [CrossRef]
Ru, J.; Zhu, Z.; Shi, J. Spatial and geometric learning for classification of breast tumors from multi-center ultrasound images: A hybrid learning approach. BMC Med. Imaging 2024, 24, 133. [Google Scholar] [CrossRef]
Liang, P.; Chen, J.; Yu, H.; Huang, H.; Pu, B. Federated Learning for Cross-Hospital Collaborative Research: A Comprehensive Survey of Applications, Challenges, and Future Directions. SSRN Electron. J. 2025. [Google Scholar] [CrossRef]
Rohan, A.; Ahmed, M.; Sakib, A.A.; Hasan, M.K.; Haque, R.; Hasan, M.J.; Nguyen, C.D. Federated Learning-Based Robust Breast Cancer Recognition: Ensuring Privacy and Accuracy Across Distributed Data. SSRN Electron. J. 2025. [Google Scholar] [CrossRef]
Al-Hejri, A.; Sable, A.; Al-Tam, R.; Al-antari Aisslab, M.A.; Alshamrani, S.; Alshmrany, K.; Alatebi, W. A hybrid explainable federated-based vision transformer framework for breast cancer prediction via risk factors. Sci. Rep. 2025, 15, 18453. [Google Scholar] [CrossRef]
Shukla, S.; Rajkumar, S.; Sinha, A.; Esha, M.; Elango, K.; Sampath, V. Federated learning with differential privacy for breast cancer diagnosis enabling secure data sharing and model integrity. Sci. Rep. 2025, 15, 13061. [Google Scholar] [CrossRef]
Bechar, A.; Medjoudj, R.; Elmir, Y.; Himeur, Y.; Amira, A. Federated and transfer learning for cancer detection based on image analysis. Neural Comput. Appl. 2025, 37, 2239–2284. [Google Scholar] [CrossRef]
Karthiga, B.; Praneeth, K.; Saravanan, V.; Rao, T.R.K. Enhancing cancer detection in medical imaging through federated learning and explainable artificial intelligence: A hybrid approach for optimized diagnostics. Egypt. Inform. J. 2025, 31, 100751. [Google Scholar] [CrossRef]
Li, L.; Xie, N.; Yuan, S. A federated learning framework for breast cancer histopathological image classification. Electronics 2022, 11, 3767. [Google Scholar] [CrossRef]
Al-Dhabyani, W.; Gomaa, M.; Khaled, H.; Fahmy, A. Dataset of breast ultrasound images. Data Brief 2020, 28, 104863. [Google Scholar] [CrossRef] [PubMed]
Vallez, N.; Bueno, G.; Deniz, O.; Rienda, M.A.; Pastor, C. BUS-UCLM: Breast ultrasound lesion segmentation dataset. Sci. Data 2025, 12, 242. [Google Scholar] [CrossRef] [PubMed]
Seddik Tawfik, N.; Ghatwary, N.; Elgendy, A.; Nasr, O.; Ye, x.; Elshenawy, M. BCMID: Breast Cancer Multimodal Imaging Dataset (Updated). 2025. Available online: https://zenodo.org/records/15546909 (accessed on 6 October 2025).
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Koonce, B. EfficientNet. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 109–123. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. In Proceedings of the Machine Learning and Systems (MLSys), Austin, TX, USA, 2–4 March 2020; pp. 429–450. [Google Scholar]
Beutel, D.J.; Topal, T.; Mathur, A.; Qiu, X.; Fernandez-Marques, J.; Gao, Y.; Sani, L.; Kwing, H.L.; Parcollet, T.; Gusmão, P.P.d.; et al. Flower: A Friendly Federated Learning Research Framework. arXiv 2020, arXiv:2007.14390. [Google Scholar]
Compton, K.; Zou, J.; Ghassemi, M. When More is Less: Incorporating Additional Datasets Can Hurt Performance by Introducing Spurious Correlations. In Proceedings of the 40th International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Blaivas, M.; Blaivas, L.N.; Tsung, J.W. Deep Learning Pitfall: The Impact of Novel Ultrasound Machines on Model Performance. J. Ultrasound Med. 2021, 40, 2547–2555. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Salehi, S.S.M.; Erdogmus, D.; Gholipour, A. Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks. In Proceedings of the Machine Learning in Medical Imaging; Wang, Q., Shi, Y., Suk, H.I., Suzuki, K., Eds.; Springer: Cham, Switzerland, 2017; pp. 379–387. [Google Scholar]
Wang, R.; Chaudhari, P.; Davatzikos, C. Embracing the Disharmony in Medical Imaging: A Simple and Effective Framework for Domain Adaptation. Med. Image Anal. 2022, 76, 102309. [Google Scholar] [CrossRef]
Varoquaux, G.; Cheplygina, V. Machine learning for medical imaging: Methodological failures and recommendations for the future. npj Digit. Med. 2022, 5, 48. [Google Scholar] [CrossRef] [PubMed]
Sáinz-Pardo Díaz, J.; López García, Á. Study of the performance and scalability of federated learning for medical imaging with intermittent clients. Neurocomputing 2023, 518, 142–154. [Google Scholar] [CrossRef]

Figure 1. Ultrasound image samples from three distinct datasets, BUSI, BUS-UCML, and BCMID, illustrating normal, benign, and malignant instances. Each ultrasound image emphasizes specific characteristics associated with each classification. Normal tissue lacks focal masses with heterogeneous background parenchyma. In contrast, benign lesions are generally rounded or oval with smooth borders, while malignant lesions often appear with irregular shapes and non-parallel orientation. The overall contrast, texture, and brightness variations among the three datasets visually illustrate the substantial domain shift and emphasize the challenge of data heterogeneity that may lead to under performance of centralized models.

Figure 2. Visualizing Feature Space Separation between Datasets across all three data splits. Colors represent the dataset origin BUSI, BCMID, BUS-UCLM. (a) Train Split; (b) Validation Split; (c) Test Split.

Figure 3. Federated Learning Architecture for Multi-Center Breast Ultrasound Classification, where CL_i represents the clients, while W_i denotes the weights of local training.

Figure 4. ROC curves upon applying MobileNet using the three experiments Local, Centralized and FedProx (a) BCMID dataset; (b) BUSI dataset; (c) BUS-UCLM dataset.

Figure 5. Confusion Matrix Analysis across Loss Functions. The rows correspond to the three experimental configurations: Baseline (Row 1), Fixed Tversky Loss (Row 2), and Combined Loss (Row 3). The columns represent the participating clients: Client 1 (BCMID, in-house), Client 2 (BUSI, public), and Client 3 (BUS-UCLM, public).

Figure 6. Performance Gain Analysis across Federation Scenarios and Architectures. (a) 2-Client Federation: BUSI + BCMID; (b) 2-Client Federation: BUS-UCLM + BCMID; (c) 3-Client Federation.

Table 1. Detailed class distribution across datasets, data partitions, and totals.

Dataset	Train			Validation			Test			Total
Dataset	Normal	Benign	Malignant	Normal	Benign	Malignant	Normal	Benign	Malignant	Total
BUSI	89	309	152	13	48	17	31	80	41	780
BCMID	96	499	184	18	75	22	38	160	53	1145
BUS-UCLM	284	113	48	31	15	15	103	45	27	681
Class Total	469	921	384	62	138	54	172	285	121	2606
Split Total	1774			254			578			2606

Table 2. Baseline centralized performance comparison of five CNN architectures. The best two models (in bold) are selected for subsequent experiments. All metrics are Macro-Averaged to ensure balanced evaluation across all three classes.

Model	Parameters (M)	Accuracy (%)	F1-Score (%)	Precision (%)	Recall (%)	Epochs
ResNet50V2	24.62	68	62	67	63	7
DenseNet201	19.3	65	69	67	65	8
EfficientNetV2	53.81	43	34	38	43	6
InceptionNetV3	22.85	67	64	64	64	12
MobileNet	3.76	67	64	65	64	12

Table 3. Two-Client Federated Learning Results (

BUSI + BCMID