1. Introduction
Liver cancer remains one of the most prevalent and lethal malignancies worldwide, with increasing incidence and mortality reported in recent global statistics [
1]. Accurate classification of liver tumours from computed tomography (CT) is crucial for diagnosis and treatment planning. Conventional deep learning methods in medical imaging typically rely on centralized datasets. In practice, however, medical data are distributed across institutions, creating data silos. Moreover, strict privacy regulations, such as the General Data Protection Regulation (GDPR) [
2], limit cross-institutional sharing of raw patient data and hinder the development of robust large-scale medical AI systems.
Federated learning (FL) has emerged as a promising paradigm to address this challenge [
3,
4]. By enabling collaborative model training without exchanging raw local data, FL supports privacy-preserving multi-centre learning. In realistic medical networks, participating clients may correspond to central hospitals, regional hospitals, and local clinics, each with distinct data scales and case compositions. This naturally induces heterogeneous (non-IID) client distributions and unequal local optimization behaviours.
1.1. Federated Learning Algorithms
Federated learning was introduced by McMahan et al. through Federated Averaging (FedAvg), which updates a global model by aggregating locally trained client models [
3]. Although effective under IID data, FedAvg often exhibits degraded performance and unstable convergence under heterogeneous (non-IID) client distributions [
4,
5]. Such heterogeneity can arise from both label-distribution skew and feature-distribution shift, leading to client drift and unreliable global aggregation. FedProx adds a proximal regularization term to constrain excessive deviation between local and global models [
6]. The proximal coefficient
controls the stability–plasticity trade-off, motivating our ablation study. Other variants, including FedBN (Federated Learning with Local Batch Normalization) and FedAMP (Federated Attentive Message Passing), introduce normalization- and attention-based mechanisms to alleviate feature shift and improve client collaboration [
7,
8]. In this work, we adopt FedProx as the core optimization framework and evaluate its behaviour under different non-IID intensities.
1.2. Federated Learning in Medical Imaging
FL has gained increasing attention in medical AI due to its privacy-preserving collaborative training paradigm [
9,
10]. Prior studies have applied FL to brain tumour segmentation [
10], COVID-19 screening [
11], and multi-institutional medical image analysis [
12]. More recent studies extend FL to brain tumour classification and heterogeneous MRI data, highlighting practical non-IID challenges [
13,
14,
15]. However, most existing medical FL work focuses on segmentation and commonly evaluates within the same dataset or domain [
16]. Cross-dataset external validation across different scanners and acquisition protocols remains underexplored, despite being critical for real-world deployment. Recent HCC-focused FL studies on CT/MR imaging further motivate this direction [
17,
18]. Recent evidence synthesis also highlights unresolved issues in cross-site robustness, evaluation consistency, and deployment readiness [
19].
1.3. Transfer Learning and Lightweight Architectures
Transfer learning (TL), where models pretrained on large-scale natural-image datasets such as ImageNet are adapted to medical tasks, is widely used in medical imaging and is particularly beneficial when labelled data are limited [
20,
21]. The optimal fine-tuning depth (e.g., frozen backbone versus partial fine-tuning) remains an important design choice under scarce local data. Recent architectures such as Vision Transformers (ViT) and EfficientNet have demonstrated strong recognition performance and are increasingly adopted in medical classification [
22,
23]. At the same time, deployment constraints motivate lightweight networks such as MobileNetV3 [
24]. In federated settings, model size directly determines communication payload per round, making the performance–efficiency trade-off particularly critical [
5].
Nevertheless, applying FL to liver tumour classification still faces key challenges. First, most medical FL studies focus on segmentation [
10,
16], while binary classification—especially under non-IID distributions—remains underexplored. Second, the trade-off between model capacity and communication efficiency is critical for clinical deployment, yet systematic comparisons between heavyweight backbones and lightweight CNNs in FL are limited. Third, the effectiveness of transfer learning from natural-image pretraining versus training from scratch in federated medical settings remains unclear.
In this paper, we propose a FedProx-based federated framework for multi-centre liver tumour classification and conduct a comprehensive empirical study. Our contributions are fourfold: (1) We establish a robust federated classification pipeline and a cross-domain benchmark by training on LiTS and evaluating on external 3D-IRCADb, enabling rigorous assessment of out-of-distribution generalisation. (2) We systematically compare diverse backbones, including ResNet-50, EfficientNet-B3, ViT-B/16, and MobileNetV3-Small, and quantify their performance–efficiency trade-offs in terms of predictive performance and communication overhead. (3) We investigate transfer learning by comparing ImageNet initialization with training from scratch, showing that pretraining consistently stabilises federated convergence, particularly for data-sparse clients. (4) We design a case-level partition strategy with minimum-validation guarantees to support robust evaluation when local samples are limited ().
Overall, we position this study as a careful empirical comparison under realistic heterogeneity and domain shift, rather than a claim of universal superiority of a single FL optimizer across all settings.
2. Methods
2.1. Problem Definition
We formulate liver tumour diagnosis as a multi-centre binary classification task under federated learning. Suppose there are K participating clients (e.g., hospitals). Client k owns a local dataset , where is a 2D CT slice and denotes the class label (normal vs. tumour). The objective is to learn a global model parametrised by w without sharing raw patient data across clients.
Training proceeds over communication rounds
. At round
t, the server broadcasts the current global model
to all clients. Each client performs local optimization on
and returns updated parameters
. The server then aggregates all client updates to obtain the next global model
. In this work, all clients participate in each communication round (full participation), following common multi-institutional FL practice [
25,
26].
2.2. Local Training Objective
Given a neural network
, the predicted probability for class
is denoted by
. On client
k, we use the cross-entropy loss:
Under non-IID data, FedAvg may suffer from client drift and unstable convergence. We therefore employ FedProx, which augments the local objective with a proximal term:
where
is the proximal coefficient and
is the broadcast global model at round
t. When
, Equation (
2) reduces to the FedAvg local objective.
For implementation, each client performs E local epochs per round using Adam. The default setting is , with ablation over . For FedProx, we use by default and study , together with (FedAvg baseline).
2.3. Server Aggregation
After receiving
, the server performs sample-size weighted aggregation:
where
is the number of local training samples on client
k. This gives proportionally larger influence to clients with more local data.
Figure 1 summarizes the overall pipeline, including slice-level label construction, case-level client partitioning (IID/Mild/Severe), and federated optimization with client-side validation and held-out external testing. Within this pipeline,
Figure 2 shows the repeated server–client aggregation across communication rounds, and Algorithm 1 provides the full procedure.
| Algorithm 1: FedProx-based liver tumour classification |
![Computers 15 00286 i001 Computers 15 00286 i001]() |
2.4. Network Architectures and Training Strategies
To study the trade-off between model capacity and deployment efficiency, we evaluate four backbones: ResNet-50, EfficientNet-B3, ViT-B/16, and MobileNetV3-Small. For all backbones, the original classifier is replaced with a task-specific two-class prediction head.
Training strategies. We compare three settings: (i) scratch, where all parameters are randomly initialized and trained; (ii) pretrained-freeze, where the ImageNet-pretrained backbone is frozen and only the classification head is trained; (iii) pretrained-finetune-last, where the ImageNet-pretrained backbone is used and both the classification head and the last stage/block are updated.
Importantly, in our current federated implementation, each client still transmits and receives the full model state at every communication round. Therefore, freezing backbone layers reduces local trainable parameters and computation, but does not reduce communication payload in the current protocol.
For reproducibility, the trainable modules in pretrained-finetune-last are: ResNet-50 (layer4 + classification head), EfficientNet-B3 (last stage of features, blocks 7–8 + head), ViT-B/16 (last encoder block + head), and MobileNetV3-Small (final block of features + head).
These architectural choices are consistent with recent medical FL trends that emphasize both representation quality and deployment efficiency [
19,
27].
2.5. Datasets
Training domain (LiTS). We use the public Liver Tumour Segmentation (LiTS) dataset as the source domain for federated training [
28]. LiTS is not a native multi-centre FL dataset with institution IDs; therefore, we construct a simulated multi-centre protocol by partitioning LiTS at case level into
clients. LiTS contains contrast-enhanced abdominal CT volumes with voxel-level liver and tumour annotations. We construct a slice-level binary classification dataset: a slice is labelled as tumour if it contains any tumour voxel, and as normal if it contains liver tissue but no tumour voxel. Slices containing neither liver nor tumour are discarded.
External test domain (3D-IRCADb). We use 3D-IRCADb as a fully held-out external test set to evaluate out-of-distribution (OOD) generalisation [
29]. Compared with LiTS, 3D-IRCADb differs in acquisition protocols, scanner characteristics, and patient population, introducing a practical domain shift. We apply the same preprocessing and slice-level labelling pipeline as in LiTS. Importantly, 3D-IRCADb is never used for training, partition construction, or hyperparameter selection; all model selection and ablation decisions are based solely on LiTS client-side validation.
2.6. Preprocessing and Dataset Construction
CT volumes are standardised via resampling and intensity normalisation, and converted into 2D axial slices. Selected slices are resized to
. During training, greyscale slices are replicated to three channels to match ImageNet-pretrained backbones [
20]. Data augmentation includes horizontal flipping, small-angle rotation, and mild random resized cropping. During evaluation, only deterministic resizing and normalisation are used.
2.7. Federated Client Partitioning
We use clients to simulate small-to-medium multi-centre collaboration. The partitioning is performed at case level (never slice level) to avoid leakage across client splits.
Partition protocol. Given all LiTS cases, we first compute a tumour-burden score per case (tumour-positive slices divided by liver slices). We then assign cases to clients as follows:
IID: shuffled case list is split to keep client sample sizes and tumour burden approximately balanced.
Mild non-IID: cases are sorted by tumour burden and distributed with a weak skew so that clients show moderate differences.
Severe non-IID: cases are grouped by burden quantiles and allocated to maximize inter-client distribution shift.
This setting defines how samples are separated across clients and directly corresponds to the IID/Mild/Severe protocols used in all tables.
Each client is further split into training and validation subsets at case level with a ratio of
. A minimum-validation constraint is enforced (at least one validation case per client whenever feasible) to avoid empty validation subsets and ensure stable metric reporting. Per-client case/slice counts for each generated split are recorded in the released split manifests. To quantify the designed heterogeneity, we report tumour-positive slice ratios across five clients under IID/Mild/Severe settings in
Table 1; the variance increases substantially from IID to Severe.
2.8. Compared Methods and Experimental Protocol
We compare:
FedAvg: federated averaging without proximal regularisation [
3], implemented as
in Equation (
2).
FedProx: proximal regularised federated optimisation (
) for improved stability under non-IID settings [
6].
For each backbone, we compare transfer learnin (ImageNet initialisation) and training from scratch (random initialisation) [
21]. Unless otherwise specified, the main setting uses pretrained weights with a frozen backbone, while fine-tuning depth is analysed separately.
Ablation studies. We perform: (i) proximal-coefficient ablation with (plus for FedAvg), (ii) local-epoch ablation with , and (iii) fine-tuning strategy comparison between freeze and finetune-last.
2.9. Implementation Details
All experiments are implemented in PyTorch 2.5.1+cu121 (with torchvision 0.20.1+cu121 and torchaudio 2.5.1+cu121). [
30]. Unless otherwise stated, we use Adam [
31] with learning rate
and batch size 16. Federated training uses
communication rounds by default, and each client performs
local epochs per round (with
for ablation). Global aggregation follows sample-size weighted averaging in Equation (
3).
For FedProx, the default setting is , with ablation over (and as FedAvg). All experiments are run on a single workstation equipped with a CUDA-enabled GPU.
2.10. Evaluation Metrics
We report Accuracy, ROC-AUC, F1-score, tumour-class Recall, Specificity, and confusion matrix statistics. For F1/Recall/Specificity, the tumour probability threshold is fixed at 0.5 across all methods for fair comparison.
Let TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative counts, respectively. For binary classification:
AUC is computed as the area under the ROC curve:
where
is Recall and
.
We report both (i) averaged client-side validation performance, and (ii) external test performance on 3D-IRCADb, to assess cross-domain generalisation.
3. Results
3.1. Main Experimental Results on Federated Validation and External Test
Table 2 compares four backbones under the most challenging federated setting (
, severe non-IID). We report AUC, F1-score, Recall, and Specificity on both averaged client-side validation (LiTS) and external testing (3D-IRCADb).
Across models, tumour recall remains high on LiTS validation, whereas external AUC is consistently lower, indicating a domain gap between LiTS and 3D-IRCADb. ImageNet pretraining generally improves external robustness (e.g., ResNet-50 FedAvg external AUC of 0.5605 vs. 0.4892 under scratch). FedProx is competitive with FedAvg across settings, with modest gains in some configurations rather than a uniform advantage.
Model-wise, no backbone dominates every metric. ResNet-50 and ViT-B/16 can be competitive in AUC under selected settings, while MobileNetV3-Small provides the best practical trade-off between external performance and communication cost. Therefore, our conclusions emphasize robust empirical comparison under heterogeneous settings instead of claiming a single universally best model.
3.2. Impact of Non-IID Intensity
To assess robustness against statistical heterogeneity, we evaluate three partition regimes (IID, Mild, Severe) using MobileNetV3-Small (pretrained,
).
Table 3 reports results under this IID/Mild/Severe protocol. External AUC varies slightly across intensities, while specificity remains low across all settings, indicating a persistent threshold-level bias toward the positive class. FedProx remains comparable to FedAvg across intensities.
The low external specificity should be interpreted cautiously for deployment. Possible reasons include class imbalance, centre-level acquisition differences, and threshold mismatch between internal validation and external cohorts. This also suggests that site-adaptive threshold selection or probability calibration may be necessary in practical use.
3.3. Parameter Sensitivity Analysis (, E, and K)
We further investigate key hyperparameters (
Table 4,
Table 5 and
Table 6). As shown in
Table 4, larger
improves specificity at a modest cost in AUC/recall, reflecting a sensitivity–specificity trade-off.
Table 5 shows that increasing local epochs reduces communication frequency but can increase local drift.
Table 3.
Impact of non-IID intensity levels on external test performance (MobileNetV3-Small, , pretrained).
Table 3.
Impact of non-IID intensity levels on external test performance (MobileNetV3-Small, , pretrained).
| Intensity | Method | AUC | F1 | Recall | Spec. |
|---|
| IID | FedAvg | 0.57 | 0.81 | 0.97 | 0.08 |
| FedProx | 0.57 | 0.81 | 0.97 | 0.07 |
| Mild | FedAvg | 0.58 | 0.81 | 0.99 | 0.02 |
| FedProx | 0.59 | 0.81 | 0.99 | 0.01 |
| Severe | FedAvg | 0.60 | 0.81 | 0.98 | 0.05 |
| FedProx | 0.60 | 0.81 | 0.97 | 0.06 |
Table 4.
Sensitivity to proximal coefficient (MobileNetV3-Small, , severe non-IID, ).
Table 4.
Sensitivity to proximal coefficient (MobileNetV3-Small, , severe non-IID, ).
| AUC | F1 | Recall | Spec. |
|---|
| 0 (FedAvg) | 0.60 | 0.81 | 0.98 | 0.05 |
| 0.60 | 0.81 | 0.98 | 0.05 |
| 0.60 | 0.81 | 0.98 | 0.06 |
| 0.60 | 0.81 | 0.97 | 0.06 |
Table 5.
Sensitivity to local epochs E (MobileNetV3-Small, , severe non-IID, for FedProx).
Table 5.
Sensitivity to local epochs E (MobileNetV3-Small, , severe non-IID, for FedProx).
| E | Method | AUC | F1 | Recall | Spec. |
|---|
| 1 | FedAvg | 0.59 | 0.80 | 0.97 | 0.04 |
| FedProx | 0.59 | 0.80 | 0.97 | 0.05 |
| 3 | FedAvg | 0.60 | 0.81 | 0.98 | 0.05 |
| FedProx | 0.60 | 0.81 | 0.97 | 0.06 |
| 5 | FedAvg | 0.59 | 0.81 | 0.99 | 0.03 |
| FedProx | 0.59 | 0.81 | 0.98 | 0.04 |
Table 6.
Sensitivity to number of clients K (MobileNetV3-Small, pretrained, ). Note: is IID; are severe non-IID.
Table 6.
Sensitivity to number of clients K (MobileNetV3-Small, pretrained, ). Note: is IID; are severe non-IID.
| K | Method | AUC | F1 | Recall | Spec. |
|---|
| 1 | FedAvg | 0.53 | 0.81 | 0.99 | 0.05 |
| FedProx | 0.55 | 0.81 | 1.00 | 0.02 |
| 3 (Severe) | FedAvg | 0.55 | 0.81 | 0.99 | 0.01 |
| FedProx | 0.55 | 0.81 | 0.99 | 0.02 |
| 5 (Severe) | FedAvg | 0.60 | 0.81 | 0.98 | 0.05 |
| FedProx | 0.60 | 0.81 | 0.97 | 0.06 |
To complement these findings,
Figure 3 shows 50-round convergence under severe non-IID settings. The validation curves indicate stable optimisation dynamics, with progressive improvement in accuracy and F1 in later rounds.
For client-number sensitivity,
Table 6 includes
. We note that
results were obtained under IID partition, whereas
correspond to severe non-IID; therefore,
serves as a reference rather than a strictly matched condition. Under severe non-IID, increasing clients from
to
improves external AUC, suggesting benefits from broader client diversity.
3.4. Ablation on Fine-Tuning Depth
Table 7 compares transfer-learning depth. Freezing the backbone yields higher AUC, whereas
finetune-last improves specificity (from 6.05% to 10.60%). This suggests that adapting higher-level features can reduce false positives, with a modest reduction in ranking performance.
3.5. Communication and Computation Efficiency
Table 8 reports the model size and estimated per-round communication cost under the pretrained-freeze setting. In our current pipeline, clients transmit and receive the full model state each round; therefore, communication cost is determined by total model size even when most layers are frozen. Under this protocol, MobileNetV3-Small incurs the lowest per-round communication burden while maintaining competitive external performance.
Per-round communication per client is estimated as downlink + uplink = model size (FP32). For K participating clients, total communication per round is model size. If only trainable parameters were transmitted, communication under freeze could be further reduced; this is left for future engineering optimization.
4. Discussion
Multi-centre liver tumour datasets naturally exhibit substantial heterogeneity due to differences in patient populations and imaging characteristics. Under such non-IID settings, local updates may diverge and destabilize global aggregation [
5,
15,
25,
26]. The proximal regularization in FedProx constrains local optimization by limiting deviation from the global model, which can alleviate client drift compared with FedAvg [
6]. In our experiments, FedProx shows competitive performance and more stable behaviour as data heterogeneity increases, although the improvement varies across evaluation metrics and configurations [
19,
32].
The ablation results on the proximal coefficient
and the number of local epochs
E highlight the trade-off between communication efficiency and training stability. Increasing
restricts local deviation and can improve specificity, with modest trade-offs in AUC or recall. Larger
E reduces communication frequency but may amplify client drift under heterogeneous data [
3,
19,
33]. The communication cost in our setting is primarily determined by model size. Even with frozen backbones, large models such as ViT-B/16 incur substantially higher transmission overhead than lightweight architectures such as MobileNetV3-Small, suggesting that communication efficiency is more effectively improved through compact architectures or parameter-efficient update strategies [
19].
Our results confirm the effectiveness of ImageNet pretraining, which provides transferable low-level visual representations for medical images [
34]. However, the domain gap between natural and medical images remains relevant. Comparing different strategies, freezing the backbone tends to preserve ranking performance (AUC), whereas partial fine-tuning of later layers improves specificity by adapting higher-level features to CT-specific patterns [
21,
27]. This trade-off is particularly important when local datasets are small.
Evaluation on the external 3D-IRCADb dataset provides a strict out-of-distribution test [
19,
35]. Domain shifts caused by scanner differences and acquisition protocols can reduce model generalisation and increase false positives [
36]. In our experiments, tumour recall remains relatively high on the external dataset, while specificity decreases, indicating that improving robustness to domain shift remains an important challenge for clinical deployment. This behaviour also implies that fixed thresholds selected on internal validation may not transfer optimally to unseen centres, and calibration or site-adaptive thresholding should be considered before routine use.
Several limitations remain. First, our current 2D slice-based formulation does not explicitly capture volumetric context; future work could explore 3D CNNs or transformer-based sequence models to model inter-slice dependencies [
37]. Second, although FL preserves data locality, it does not inherently prevent gradient leakage; integrating differential privacy or secure aggregation mechanisms would further strengthen privacy protection [
38,
39]. Third, our external evaluation is retrospective; prospective multi-centre validation with centre-specific calibration protocols is still required for deployment. Finally, incorporating multimodal clinical information may further improve model robustness and clinical utility [
17,
18].
Further supporting analyses, extended tables, non-IID partition statistics, and detailed reproducibility instructions can be found in the
Supplementary Materials.
5. Conclusions
This study investigates multi-centre liver tumour classification under realistic data heterogeneity and domain shift in a federated setting. We develop a FedProx-based training pipeline and establish a cross-domain benchmark by training on LiTS clients and evaluating on the external 3D-IRCADb dataset. Across multiple backbones (ResNet-50, EfficientNet-B3, ViT-B/16, and MobileNetV3-Small), FedProx is generally comparable to FedAvg, with slightly more stable behaviour in several non-IID settings. We observe a clear validation-to-external gap and low external specificity, highlighting that external-domain evaluation and careful threshold handling are necessary for medical FL. ImageNet pretraining improves cross-domain behaviour relative to scratch training, particularly for data-sparse clients. MobileNetV3-Small provides a favourable performance–efficiency trade-off, and the case-level partitioning strategy with minimum-validation guarantees offers a practical evaluation protocol for small-client federated scenarios.