Multi-Centre Liver Tumour Classification via Federated Learning: Investigating Data Heterogeneity, Transfer Learning, and Model Efficiency

Zhu, Degang; Wei, Shiqi; Zhang, Xinming

doi:10.3390/computers15050286

Open AccessArticle

Multi-Centre Liver Tumour Classification via Federated Learning: Investigating Data Heterogeneity, Transfer Learning, and Model Efficiency

by

Degang Zhu

^1,†,

Shiqi Wei

^2,†

and

Xinming Zhang

^1,*

¹

School of Computer Science and Technology, University of Science and Technology of China, Hefei 230094, China

²

School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei 230094, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Computers 2026, 15(5), 286; https://doi.org/10.3390/computers15050286

Submission received: 26 March 2026 / Revised: 22 April 2026 / Accepted: 25 April 2026 / Published: 1 May 2026

(This article belongs to the Special Issue Machine and Deep Learning in the Health Domain (3rd Edition))

Download

Browse Figures

Versions Notes

Abstract

This paper investigates federated multi-centre liver tumour classification from contrast-enhanced CT under realistic data heterogeneity and domain shift. To address the practical constraint that medical data are often siloed across institutions, we develop a FedProx-based federated learning pipeline that enables collaborative training without exchanging raw patient data. Using the LiTS dataset as the training domain, we construct a slice-level binary classification task based on voxel-level annotations, while rigorously assessing out-of-distribution generalisation on an external held-out dataset, 3D-IRCADb. We conduct comprehensive experiments across multiple backbone architectures, including ResNet-50, EfficientNet-B3, ViT-B/16, and MobileNetV3-Small, comparing FedProx and FedAvg under three heterogeneity intensities (IID, mild non-IID, and severe non-IID). Furthermore, we evaluate transfer learning strategies, ranging from frozen backbones to partial fine-tuning of the last stage, and perform ablations on the proximal coefficient

μ

and local epochs E to characterise optimisation behaviour. Our results show that FedProx is generally comparable to FedAvg, with slightly more stable behaviour in some heterogeneous settings. We also observe a clear validation-to-external gap, indicating that external-domain robustness remains challenging and requires cautious interpretation for deployment. ImageNet pretraining yields consistent gains, particularly for data-sparse clients, while partial fine-tuning enhances adaptation to CT-specific features. Finally, MobileNetV3-Small offers a favourable performance–efficiency trade-off by reducing communication payload and computation cost, supporting practical deployment on resource-constrained clinical edge devices.

Keywords:

federated learning; liver tumour classification; non-IID data; transfer learning; cross-domain generalisation

1. Introduction

Liver cancer remains one of the most prevalent and lethal malignancies worldwide, with increasing incidence and mortality reported in recent global statistics [1]. Accurate classification of liver tumours from computed tomography (CT) is crucial for diagnosis and treatment planning. Conventional deep learning methods in medical imaging typically rely on centralized datasets. In practice, however, medical data are distributed across institutions, creating data silos. Moreover, strict privacy regulations, such as the General Data Protection Regulation (GDPR) [2], limit cross-institutional sharing of raw patient data and hinder the development of robust large-scale medical AI systems.

Federated learning (FL) has emerged as a promising paradigm to address this challenge [3,4]. By enabling collaborative model training without exchanging raw local data, FL supports privacy-preserving multi-centre learning. In realistic medical networks, participating clients may correspond to central hospitals, regional hospitals, and local clinics, each with distinct data scales and case compositions. This naturally induces heterogeneous (non-IID) client distributions and unequal local optimization behaviours.

1.1. Federated Learning Algorithms

Federated learning was introduced by McMahan et al. through Federated Averaging (FedAvg), which updates a global model by aggregating locally trained client models [3]. Although effective under IID data, FedAvg often exhibits degraded performance and unstable convergence under heterogeneous (non-IID) client distributions [4,5]. Such heterogeneity can arise from both label-distribution skew and feature-distribution shift, leading to client drift and unreliable global aggregation. FedProx adds a proximal regularization term to constrain excessive deviation between local and global models [6]. The proximal coefficient

μ

controls the stability–plasticity trade-off, motivating our ablation study. Other variants, including FedBN (Federated Learning with Local Batch Normalization) and FedAMP (Federated Attentive Message Passing), introduce normalization- and attention-based mechanisms to alleviate feature shift and improve client collaboration [7,8]. In this work, we adopt FedProx as the core optimization framework and evaluate its behaviour under different non-IID intensities.

1.2. Federated Learning in Medical Imaging

FL has gained increasing attention in medical AI due to its privacy-preserving collaborative training paradigm [9,10]. Prior studies have applied FL to brain tumour segmentation [10], COVID-19 screening [11], and multi-institutional medical image analysis [12]. More recent studies extend FL to brain tumour classification and heterogeneous MRI data, highlighting practical non-IID challenges [13,14,15]. However, most existing medical FL work focuses on segmentation and commonly evaluates within the same dataset or domain [16]. Cross-dataset external validation across different scanners and acquisition protocols remains underexplored, despite being critical for real-world deployment. Recent HCC-focused FL studies on CT/MR imaging further motivate this direction [17,18]. Recent evidence synthesis also highlights unresolved issues in cross-site robustness, evaluation consistency, and deployment readiness [19].

1.3. Transfer Learning and Lightweight Architectures

Transfer learning (TL), where models pretrained on large-scale natural-image datasets such as ImageNet are adapted to medical tasks, is widely used in medical imaging and is particularly beneficial when labelled data are limited [20,21]. The optimal fine-tuning depth (e.g., frozen backbone versus partial fine-tuning) remains an important design choice under scarce local data. Recent architectures such as Vision Transformers (ViT) and EfficientNet have demonstrated strong recognition performance and are increasingly adopted in medical classification [22,23]. At the same time, deployment constraints motivate lightweight networks such as MobileNetV3 [24]. In federated settings, model size directly determines communication payload per round, making the performance–efficiency trade-off particularly critical [5].

Nevertheless, applying FL to liver tumour classification still faces key challenges. First, most medical FL studies focus on segmentation [10,16], while binary classification—especially under non-IID distributions—remains underexplored. Second, the trade-off between model capacity and communication efficiency is critical for clinical deployment, yet systematic comparisons between heavyweight backbones and lightweight CNNs in FL are limited. Third, the effectiveness of transfer learning from natural-image pretraining versus training from scratch in federated medical settings remains unclear.

In this paper, we propose a FedProx-based federated framework for multi-centre liver tumour classification and conduct a comprehensive empirical study. Our contributions are fourfold: (1) We establish a robust federated classification pipeline and a cross-domain benchmark by training on LiTS and evaluating on external 3D-IRCADb, enabling rigorous assessment of out-of-distribution generalisation. (2) We systematically compare diverse backbones, including ResNet-50, EfficientNet-B3, ViT-B/16, and MobileNetV3-Small, and quantify their performance–efficiency trade-offs in terms of predictive performance and communication overhead. (3) We investigate transfer learning by comparing ImageNet initialization with training from scratch, showing that pretraining consistently stabilises federated convergence, particularly for data-sparse clients. (4) We design a case-level partition strategy with minimum-validation guarantees to support robust evaluation when local samples are limited (

K \leq 5

).

Overall, we position this study as a careful empirical comparison under realistic heterogeneity and domain shift, rather than a claim of universal superiority of a single FL optimizer across all settings.

2. Methods

2.1. Problem Definition

We formulate liver tumour diagnosis as a multi-centre binary classification task under federated learning. Suppose there are K participating clients (e.g., hospitals). Client k owns a local dataset

D_{k} = {(x_{i}, y_{i})}_{i = 1}^{n_{k}}

, where

x_{i}

is a 2D CT slice and

y_{i} \in {0, 1}

denotes the class label (normal vs. tumour). The objective is to learn a global model parametrised by w without sharing raw patient data across clients.

Training proceeds over communication rounds

t = 0, 1, \dots, T - 1

. At round t, the server broadcasts the current global model

w^{t}

to all clients. Each client performs local optimization on

D_{k}

and returns updated parameters

w_{k}^{t + 1}

. The server then aggregates all client updates to obtain the next global model

w^{t + 1}

. In this work, all clients participate in each communication round (full participation), following common multi-institutional FL practice [25,26].

2.2. Local Training Objective

Given a neural network

f (\cdot; w)

, the predicted probability for class

c \in {0, 1}

is denoted by

p_{c} (x; w)

. On client k, we use the cross-entropy loss:

L_{CE} (w; D_{k}) = - \frac{1}{n_{k}} \sum_{(x, y) \in D_{k}} \sum_{c \in {0, 1}} 1 [y = c] log p_{c} (x; w) .

(1)

Under non-IID data, FedAvg may suffer from client drift and unstable convergence. We therefore employ FedProx, which augments the local objective with a proximal term:

min_{w} L_{CE} (w; D_{k}) + \frac{μ}{2} {∥ w - w^{t} ∥}_{2}^{2},

(2)

where

μ \geq 0

is the proximal coefficient and

w^{t}

is the broadcast global model at round t. When

μ = 0

, Equation (2) reduces to the FedAvg local objective.

For implementation, each client performs E local epochs per round using Adam. The default setting is

E = 3

, with ablation over

E \in {1, 3, 5}

. For FedProx, we use

μ = 10^{- 2}

by default and study

μ \in {10^{- 4}, 10^{- 3}, 10^{- 2}}

, together with

μ = 0

(FedAvg baseline).

2.3. Server Aggregation

After receiving

{w_{k}^{t + 1}}_{k = 1}^{K}

, the server performs sample-size weighted aggregation:

w^{t + 1} = \sum_{k = 1}^{K} \frac{n_{k}}{\sum_{j = 1}^{K} n_{j}} w_{k}^{t + 1},

(3)

where

n_{k}

is the number of local training samples on client k. This gives proportionally larger influence to clients with more local data. Figure 1 summarizes the overall pipeline, including slice-level label construction, case-level client partitioning (IID/Mild/Severe), and federated optimization with client-side validation and held-out external testing. Within this pipeline, Figure 2 shows the repeated server–client aggregation across communication rounds, and Algorithm 1 provides the full procedure.

Algorithm 1: FedProx-based liver tumour classification

2.4. Network Architectures and Training Strategies

To study the trade-off between model capacity and deployment efficiency, we evaluate four backbones: ResNet-50, EfficientNet-B3, ViT-B/16, and MobileNetV3-Small. For all backbones, the original classifier is replaced with a task-specific two-class prediction head.

Training strategies. We compare three settings: (i) scratch, where all parameters are randomly initialized and trained; (ii) pretrained-freeze, where the ImageNet-pretrained backbone is frozen and only the classification head is trained; (iii) pretrained-finetune-last, where the ImageNet-pretrained backbone is used and both the classification head and the last stage/block are updated.

Importantly, in our current federated implementation, each client still transmits and receives the full model state at every communication round. Therefore, freezing backbone layers reduces local trainable parameters and computation, but does not reduce communication payload in the current protocol.

For reproducibility, the trainable modules in pretrained-finetune-last are: ResNet-50 (layer4 + classification head), EfficientNet-B3 (last stage of features, blocks 7–8 + head), ViT-B/16 (last encoder block + head), and MobileNetV3-Small (final block of features + head).

These architectural choices are consistent with recent medical FL trends that emphasize both representation quality and deployment efficiency [19,27].

2.5. Datasets

Training domain (LiTS). We use the public Liver Tumour Segmentation (LiTS) dataset as the source domain for federated training [28]. LiTS is not a native multi-centre FL dataset with institution IDs; therefore, we construct a simulated multi-centre protocol by partitioning LiTS at case level into

K \in {1, 3, 5}

clients. LiTS contains contrast-enhanced abdominal CT volumes with voxel-level liver and tumour annotations. We construct a slice-level binary classification dataset: a slice is labelled as tumour if it contains any tumour voxel, and as normal if it contains liver tissue but no tumour voxel. Slices containing neither liver nor tumour are discarded.

External test domain (3D-IRCADb). We use 3D-IRCADb as a fully held-out external test set to evaluate out-of-distribution (OOD) generalisation [29]. Compared with LiTS, 3D-IRCADb differs in acquisition protocols, scanner characteristics, and patient population, introducing a practical domain shift. We apply the same preprocessing and slice-level labelling pipeline as in LiTS. Importantly, 3D-IRCADb is never used for training, partition construction, or hyperparameter selection; all model selection and ablation decisions are based solely on LiTS client-side validation.

2.6. Preprocessing and Dataset Construction

CT volumes are standardised via resampling and intensity normalisation, and converted into 2D axial slices. Selected slices are resized to

224 \times 224

. During training, greyscale slices are replicated to three channels to match ImageNet-pretrained backbones [20]. Data augmentation includes horizontal flipping, small-angle rotation, and mild random resized cropping. During evaluation, only deterministic resizing and normalisation are used.

2.7. Federated Client Partitioning

We use

K \in {1, 3, 5}

clients to simulate small-to-medium multi-centre collaboration. The partitioning is performed at case level (never slice level) to avoid leakage across client splits.

Partition protocol. Given all LiTS cases, we first compute a tumour-burden score per case (tumour-positive slices divided by liver slices). We then assign cases to clients as follows:

IID: shuffled case list is split to keep client sample sizes and tumour burden approximately balanced.
Mild non-IID: cases are sorted by tumour burden and distributed with a weak skew so that clients show moderate differences.
Severe non-IID: cases are grouped by burden quantiles and allocated to maximize inter-client distribution shift.

This setting defines how samples are separated across clients and directly corresponds to the IID/Mild/Severe protocols used in all tables.

Each client is further split into training and validation subsets at case level with a ratio of

0.8 / 0.2

. A minimum-validation constraint is enforced (at least one validation case per client whenever feasible) to avoid empty validation subsets and ensure stable metric reporting. Per-client case/slice counts for each generated split are recorded in the released split manifests. To quantify the designed heterogeneity, we report tumour-positive slice ratios across five clients under IID/Mild/Severe settings in Table 1; the variance increases substantially from IID to Severe.

2.8. Compared Methods and Experimental Protocol

We compare:

FedAvg: federated averaging without proximal regularisation [3], implemented as $μ = 0$ in Equation (2).
FedProx: proximal regularised federated optimisation ( $μ > 0$ ) for improved stability under non-IID settings [6].

For each backbone, we compare transfer learnin (ImageNet initialisation) and training from scratch (random initialisation) [21]. Unless otherwise specified, the main setting uses pretrained weights with a frozen backbone, while fine-tuning depth is analysed separately.

Ablation studies. We perform: (i) proximal-coefficient ablation with

μ \in {10^{- 4}, 10^{- 3}, 10^{- 2}}

(plus

μ = 0

for FedAvg), (ii) local-epoch ablation with

E \in {1, 3, 5}

, and (iii) fine-tuning strategy comparison between freeze and finetune-last.

2.9. Implementation Details

All experiments are implemented in PyTorch 2.5.1+cu121 (with torchvision 0.20.1+cu121 and torchaudio 2.5.1+cu121). [30]. Unless otherwise stated, we use Adam [31] with learning rate

1 \times 10^{- 4}

and batch size 16. Federated training uses

T = 10

communication rounds by default, and each client performs

E = 3

local epochs per round (with

E \in {1, 3, 5}

for ablation). Global aggregation follows sample-size weighted averaging in Equation (3).

For FedProx, the default setting is

μ = 10^{- 2}

, with ablation over

μ \in {10^{- 4}, 10^{- 3}, 10^{- 2}}

(and

μ = 0

as FedAvg). All experiments are run on a single workstation equipped with a CUDA-enabled GPU.

2.10. Evaluation Metrics

We report Accuracy, ROC-AUC, F1-score, tumour-class Recall, Specificity, and confusion matrix statistics. For F1/Recall/Specificity, the tumour probability threshold is fixed at 0.5 across all methods for fair comparison.

Let TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative counts, respectively. For binary classification:

Recall = \frac{TP}{TP + FN}, Specificity = \frac{TN}{TN + FP} .

(4)

F 1 = \frac{2 \cdot TP}{2 \cdot TP + FP + FN} .

(5)

AUC is computed as the area under the ROC curve:

AUC = \int_{0}^{1} TPR (FPR) d (FPR),

(6)

where

TPR

is Recall and

FPR = 1 - Specificity

.

We report both (i) averaged client-side validation performance, and (ii) external test performance on 3D-IRCADb, to assess cross-domain generalisation.

3. Results

3.1. Main Experimental Results on Federated Validation and External Test

Table 2 compares four backbones under the most challenging federated setting (

K = 5

, severe non-IID). We report AUC, F1-score, Recall, and Specificity on both averaged client-side validation (LiTS) and external testing (3D-IRCADb).

Across models, tumour recall remains high on LiTS validation, whereas external AUC is consistently lower, indicating a domain gap between LiTS and 3D-IRCADb. ImageNet pretraining generally improves external robustness (e.g., ResNet-50 FedAvg external AUC of 0.5605 vs. 0.4892 under scratch). FedProx is competitive with FedAvg across settings, with modest gains in some configurations rather than a uniform advantage.

Model-wise, no backbone dominates every metric. ResNet-50 and ViT-B/16 can be competitive in AUC under selected settings, while MobileNetV3-Small provides the best practical trade-off between external performance and communication cost. Therefore, our conclusions emphasize robust empirical comparison under heterogeneous settings instead of claiming a single universally best model.

3.2. Impact of Non-IID Intensity

To assess robustness against statistical heterogeneity, we evaluate three partition regimes (IID, Mild, Severe) using MobileNetV3-Small (pretrained,

K = 5

). Table 3 reports results under this IID/Mild/Severe protocol. External AUC varies slightly across intensities, while specificity remains low across all settings, indicating a persistent threshold-level bias toward the positive class. FedProx remains comparable to FedAvg across intensities.

The low external specificity should be interpreted cautiously for deployment. Possible reasons include class imbalance, centre-level acquisition differences, and threshold mismatch between internal validation and external cohorts. This also suggests that site-adaptive threshold selection or probability calibration may be necessary in practical use.

3.3. Parameter Sensitivity Analysis ( $μ$ , E, and K)

We further investigate key hyperparameters (Table 4, Table 5 and Table 6). As shown in Table 4, larger

μ

improves specificity at a modest cost in AUC/recall, reflecting a sensitivity–specificity trade-off. Table 5 shows that increasing local epochs reduces communication frequency but can increase local drift.

Table 3. Impact of non-IID intensity levels on external test performance (MobileNetV3-Small,

K = 5

, pretrained).

Table 3. Impact of non-IID intensity levels on external test performance (MobileNetV3-Small,

K = 5

, pretrained).

Intensity	Method	AUC	F1	Recall	Spec.
IID	FedAvg	0.57	0.81	0.97	0.08
IID	FedProx	0.57	0.81	0.97	0.07
Mild	FedAvg	0.58	0.81	0.99	0.02
Mild	FedProx	0.59	0.81	0.99	0.01
Severe	FedAvg	0.60	0.81	0.98	0.05
Severe	FedProx	0.60	0.81	0.97	0.06

Table 4. Sensitivity to proximal coefficient

μ

(MobileNetV3-Small,

K = 5

, severe non-IID,

E = 3

).

Table 4. Sensitivity to proximal coefficient

μ

(MobileNetV3-Small,

K = 5

, severe non-IID,

E = 3

).

$μ$	AUC	F1	Recall	Spec.
0 (FedAvg)	0.60	0.81	0.98	0.05
$10^{- 4}$	0.60	0.81	0.98	0.05
$10^{- 3}$	0.60	0.81	0.98	0.06
$10^{- 2}$	0.60	0.81	0.97	0.06

Table 5. Sensitivity to local epochs E (MobileNetV3-Small,

K = 5

, severe non-IID,

μ = 0.01

for FedProx).

Table 5. Sensitivity to local epochs E (MobileNetV3-Small,

K = 5

, severe non-IID,

μ = 0.01

for FedProx).

E	Method	AUC	F1	Recall	Spec.
1	FedAvg	0.59	0.80	0.97	0.04
1	FedProx	0.59	0.80	0.97	0.05
3	FedAvg	0.60	0.81	0.98	0.05
3	FedProx	0.60	0.81	0.97	0.06
5	FedAvg	0.59	0.81	0.99	0.03
5	FedProx	0.59	0.81	0.98	0.04

Table 6. Sensitivity to number of clients K (MobileNetV3-Small, pretrained,

E = 3

). Note:

K = 1

is IID;

K = 3, 5

are severe non-IID.

Table 6. Sensitivity to number of clients K (MobileNetV3-Small, pretrained,

E = 3

). Note:

K = 1

is IID;

K = 3, 5

are severe non-IID.

K	Method	AUC	F1	Recall	Spec.
1	FedAvg	0.53	0.81	0.99	0.05
1	FedProx	0.55	0.81	1.00	0.02
3 (Severe)	FedAvg	0.55	0.81	0.99	0.01
3 (Severe)	FedProx	0.55	0.81	0.99	0.02
5 (Severe)	FedAvg	0.60	0.81	0.98	0.05
5 (Severe)	FedProx	0.60	0.81	0.97	0.06

To complement these findings, Figure 3 shows 50-round convergence under severe non-IID settings. The validation curves indicate stable optimisation dynamics, with progressive improvement in accuracy and F1 in later rounds.

For client-number sensitivity, Table 6 includes

K \in {1, 3, 5}

. We note that

K = 1

results were obtained under IID partition, whereas

K = 3, 5

correspond to severe non-IID; therefore,

K = 1

serves as a reference rather than a strictly matched condition. Under severe non-IID, increasing clients from

K = 3

to

K = 5

improves external AUC, suggesting benefits from broader client diversity.

3.4. Ablation on Fine-Tuning Depth

Table 7 compares transfer-learning depth. Freezing the backbone yields higher AUC, whereas finetune-last improves specificity (from 6.05% to 10.60%). This suggests that adapting higher-level features can reduce false positives, with a modest reduction in ranking performance.

3.5. Communication and Computation Efficiency

Table 8 reports the model size and estimated per-round communication cost under the pretrained-freeze setting. In our current pipeline, clients transmit and receive the full model state each round; therefore, communication cost is determined by total model size even when most layers are frozen. Under this protocol, MobileNetV3-Small incurs the lowest per-round communication burden while maintaining competitive external performance.

Per-round communication per client is estimated as downlink + uplink =

2 \times

model size (FP32). For K participating clients, total communication per round is

2 K \times

model size. If only trainable parameters were transmitted, communication under freeze could be further reduced; this is left for future engineering optimization.

4. Discussion

Multi-centre liver tumour datasets naturally exhibit substantial heterogeneity due to differences in patient populations and imaging characteristics. Under such non-IID settings, local updates may diverge and destabilize global aggregation [5,15,25,26]. The proximal regularization in FedProx constrains local optimization by limiting deviation from the global model, which can alleviate client drift compared with FedAvg [6]. In our experiments, FedProx shows competitive performance and more stable behaviour as data heterogeneity increases, although the improvement varies across evaluation metrics and configurations [19,32].

The ablation results on the proximal coefficient

μ

and the number of local epochs E highlight the trade-off between communication efficiency and training stability. Increasing

μ

restricts local deviation and can improve specificity, with modest trade-offs in AUC or recall. Larger E reduces communication frequency but may amplify client drift under heterogeneous data [3,19,33]. The communication cost in our setting is primarily determined by model size. Even with frozen backbones, large models such as ViT-B/16 incur substantially higher transmission overhead than lightweight architectures such as MobileNetV3-Small, suggesting that communication efficiency is more effectively improved through compact architectures or parameter-efficient update strategies [19].

Our results confirm the effectiveness of ImageNet pretraining, which provides transferable low-level visual representations for medical images [34]. However, the domain gap between natural and medical images remains relevant. Comparing different strategies, freezing the backbone tends to preserve ranking performance (AUC), whereas partial fine-tuning of later layers improves specificity by adapting higher-level features to CT-specific patterns [21,27]. This trade-off is particularly important when local datasets are small.

Evaluation on the external 3D-IRCADb dataset provides a strict out-of-distribution test [19,35]. Domain shifts caused by scanner differences and acquisition protocols can reduce model generalisation and increase false positives [36]. In our experiments, tumour recall remains relatively high on the external dataset, while specificity decreases, indicating that improving robustness to domain shift remains an important challenge for clinical deployment. This behaviour also implies that fixed thresholds selected on internal validation may not transfer optimally to unseen centres, and calibration or site-adaptive thresholding should be considered before routine use.

Several limitations remain. First, our current 2D slice-based formulation does not explicitly capture volumetric context; future work could explore 3D CNNs or transformer-based sequence models to model inter-slice dependencies [37]. Second, although FL preserves data locality, it does not inherently prevent gradient leakage; integrating differential privacy or secure aggregation mechanisms would further strengthen privacy protection [38,39]. Third, our external evaluation is retrospective; prospective multi-centre validation with centre-specific calibration protocols is still required for deployment. Finally, incorporating multimodal clinical information may further improve model robustness and clinical utility [17,18].

Further supporting analyses, extended tables, non-IID partition statistics, and detailed reproducibility instructions can be found in the Supplementary Materials.

5. Conclusions

This study investigates multi-centre liver tumour classification under realistic data heterogeneity and domain shift in a federated setting. We develop a FedProx-based training pipeline and establish a cross-domain benchmark by training on LiTS clients and evaluating on the external 3D-IRCADb dataset. Across multiple backbones (ResNet-50, EfficientNet-B3, ViT-B/16, and MobileNetV3-Small), FedProx is generally comparable to FedAvg, with slightly more stable behaviour in several non-IID settings. We observe a clear validation-to-external gap and low external specificity, highlighting that external-domain evaluation and careful threshold handling are necessary for medical FL. ImageNet pretraining improves cross-domain behaviour relative to scratch training, particularly for data-sparse clients. MobileNetV3-Small provides a favourable performance–efficiency trade-off, and the case-level partitioning strategy with minimum-validation guarantees offers a practical evaluation protocol for small-client federated scenarios.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/computers15050286/s1.

Author Contributions

D.Z. and S.W. contributed equally to this work. D.Z. designed the study and developed the methodology. S.W. conducted experiments and analysed the results. X.Z. supervised the project and finalized the manuscript. All authors reviewed and approved the final manuscript.

Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2020YFB2103803.

Data Availability Statement

The datasets used in this study are publicly available: LiTS [28] and 3D-IRCADb [29]. The raw code for federated training and evaluation is publicly available at https://github.com/wisky321/multi-centre-liver-tumour-classification-via-federated-learning (accessed on 10 April 2026). Raw medical data are not redistributed and should be obtained from the official dataset sources. Processed data and split manifests are available from the corresponding author upon reasonable request.

Acknowledgments

The authors thank the collaborators for helpful discussions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef] [PubMed]
European Parliament and Council. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016. Off. J. Eur. Union 2016, 119, 1–88. [Google Scholar]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Kairouz, P.; McMahan, H.B.; Avent, B. Advances and Open Problems in Federated Learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Li, X.; Jiang, M.; Zhang, X.; Kamp, M.; Dou, Q. FedBN: Federated Learning on Non-IID Features via Local Batch Normalization. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Hu, Z.; Shou, L.; Chen, K.; Ye, G.; Liu, Z. Federated Learning with Attentive Message Passing. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
Rieke, N.; Hancox, J.; Li, W.; Milletari, F.; Roth, H.R.; Albarqouni, S.; Bakas, S.; Galtier, M.N.; Landman, B.A.; Maier-Hein, K.; et al. The Future of Digital Health with Federated Learning. npj Digit. Med. 2020, 3, 119. [Google Scholar] [CrossRef] [PubMed]
Sheller, M.J.; Reina, G.A.; Edwards, B.; Martin, J.; Bakas, S. Federated Learning in Medicine: Facilitating Multi-Institutional Collaborations without Sharing Patient Data. Sci. Rep. 2020, 10, 12598. [Google Scholar] [CrossRef]
Dayan, I.; Roth, H.R.; Zhong, A.; Harouni, A.; Gentili, A.; Abidin, A.Z.; Liu, A.; Costa, A.B.; Wood, B.J.; Tsai, C.S.; et al. Federated Learning for Predicting Clinical Outcomes in Patients with COVID-19. Nat. Med. 2021, 27, 1735–1743. [Google Scholar] [CrossRef]
Kaissis, G.A.; Makowski, M.R.; Rückert, D.; Braren, R.F. Secure, Privacy-Preserving and Federated Machine Learning in Medical Imaging. Nat. Mach. Intell. 2021, 3, 473–484. [Google Scholar] [CrossRef]
Le Dinh Viet, K.; Le Ha, K.; Quoc, T.N.; Hoang, V.T. MRI Brain Tumor Classification Based on Federated Deep Learning. In Proceedings of the 2023 Zooming Innovation in Consumer Technologies Conference, Novi Sad, Serbia, 29–31 May 2023; pp. 131–135. [Google Scholar] [CrossRef]
Aggarwal, M.; Khullar, V.; Goyal, N.; Rastogi, R.; Singh, A.; Torres, V.Y.; Albahar, M.A. Privacy Preserved Collaborative Transfer Learning Model with Heterogeneous Distributed Data for Brain Tumor Classification. Int. J. Imaging Syst. Technol. 2024, 34, e22994. [Google Scholar] [CrossRef]
Gong, C.; Liu, X.; Zhou, J. Federated Learning in Non-IID Brain Tumor Classification. In Proceedings of the 5th International Symposium on Artificial Intelligence for Medicine Science, Amsterdam, The Netherlands, 13–17 August 2024; pp. 1–8. [Google Scholar] [CrossRef]
Li, W.; Milletarì, F.; Xu, D.; Rieke, N.; Hancox, J.; Zhu, W.; Baust, M.; Cheng, Y.; Ourselin, S.; Cardoso, M.J.; et al. Privacy-Preserving Federated Brain Tumour Segmentation. In Proceedings of the Machine Learning in Clinical Neuroimaging, Virtual, 4 October 2020. [Google Scholar]
Hsiao, C.H.; Lin, F.Y.S.; Sun, T.L.; Liao, Y.Y.; Wu, C.H.; Lai, Y.C.; Wu, H.P.; Liu, P.R.; Xiao, B.R.; Chen, C.H.; et al. Precision and Robust Models on Healthcare Institution Federated Learning for Predicting HCC on Portal Venous CT Images. IEEE J. Biomed. Health Inform. 2024, 28, 4674–4687. [Google Scholar] [CrossRef]
Uzdur, B.; Tekeli, E.; Ibrikci, T.; Ur Rashid, H.; Ramachandran, G. Diagnosis of Hepatocellular Carcinoma (HCC) Liver Cancer Using Federated Learning on MR Images. Cukurova Univ. Muhendis. Fak. Derg. 2025, 40, 531–544. [Google Scholar] [CrossRef]
Ng, D. Federated Learning in Medical Imaging: A Systematic Review. Med. Image Anal. 2023, 86, 102789. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami Beach, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Tajbakhsh, N.; Shin, J.Y.; Gurudu, S.R.; Hurst, R.T.; Kendall, C.B.; Gotway, M.B.; Liang, J. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Trans. Med. Imaging 2016, 35, 1299–1312. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Roth, H.R.; Chang, K.H.; Yang, D. Federated Learning for Multi-Institutional Medical Image Segmentation. IEEE Trans. Med. Imaging 2022, 41, 2633–2644. [Google Scholar]
Pati, S.; Baid, U.; Edwards, B.; Sheller, M.; Wang, S.H.; Reina, G.A.; Foley, P.; Gruzdev, A.; Karkada, D.; Davatzikos, C.; et al. Federated Learning Enables Big Data for Rare Cancer Boundary Detection. Nat. Commun. 2022, 13, 734. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar]
Bilic, P.; Christ, P.F.; Vorontsov, E.; Chlebus, G.; Chen, H.; Dou, Q.; Fu, C.W.; Han, X.; Heng, P.A.; Hesser, J.; et al. The Liver Tumor Segmentation Benchmark (LiTS). Med. Image Anal. 2023, 84, 102680. [Google Scholar] [CrossRef] [PubMed]
Soler, L.; Hostettler, A.; Agnus, V.; Charnoz, A.; Fasquel, J.B.; Moreau, J.; Osswald, A.B.; Bouhadjar, M.; Marescaux, J. 3D Image Reconstruction for Comparison of Algorithm Database; Technical Report; IRCAD: Strasbourg, France, 2010. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated Learning with Non-IID Data. arXiv 2018, arXiv:1806.00582. [Google Scholar]
Wang, J.; Charles, Z.; Xu, Z.; Joshi, G.; McMahan, H.B.; Al-Shedivat, M.; Andrew, G.; Avestimehr, S.; Daly, K.; Data, D.; et al. A Field Guide to Federated Optimization. arXiv 2021, arXiv:2107.06917. [Google Scholar] [CrossRef]
Raghu, M.; Zhang, C.; Kleinberg, J.; Bengio, S. Transfusion: Understanding Transfer Learning for Medical Imaging. Adv. Neural Inf. Process. Syst. 2019, 32, 3342–3352. [Google Scholar]
Pang, J. Federated Learning for Medical Imaging: A Health Data Management Perspective. IEEE J. Biomed. Health Inform. 2021, 25, 4127–4140. [Google Scholar]
Geirhos, R.; Jacobsen, J.H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F.A. Shortcut Learning in Deep Neural Networks. Nat. Mach. Intell. 2020, 2, 665–673. [Google Scholar] [CrossRef]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 3202–3211. [Google Scholar]
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep Learning with Differential Privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar]
Zhu, L.; Liu, Z.; Han, S. Deep Leakage from Gradients. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]

Figure 1. Overall experimental workflow of the proposed federated liver tumour classification framework, including both the federated training stage (client-side local updates with server aggregation) and the evaluation stage (client-side validation and external testing on 3D-IRCADb).Blue boxes with snowflake icons represent frozen components during training and testing; red boxes with sun icons represent trainable components; green boxes represent data processing steps; arrows indicate the training flow.

Figure 2. Illustration of a realistic medical federated learning scenario. A central server communicates model parameters with heterogeneous clients (e.g., central hospitals, regional hospitals, and clinics), where local data distributions are non-IID and institution-dependent. Blue arrows indicate upload of local model parameters from clients to the server, and green arrows indicate download of the global model parameters from the server to clients.

Figure 3. Convergence curves over 50 communication rounds under severe non-IID (

K = 5

, MobileNetV3-Small, FedProx,

μ = 0.01

,

E = 3

, freeze). Panels report validation loss, accuracy, ROC-AUC, F1, tumour recall, and specificity.

Figure 3. Convergence curves over 50 communication rounds under severe non-IID (

K = 5

, MobileNetV3-Small, FedProx,

μ = 0.01

,

E = 3

, freeze). Panels report validation loss, accuracy, ROC-AUC, F1, tumour recall, and specificity.

Table 1. Tumour slice ratios across 5 clients under different non-IID intensities.

Intensity	C1	C2	C3	C4	C5	Variance
IID	0.181	0.183	0.181	0.181	0.182	$1.0 \times 10^{- 6}$
Mild	0.178	0.181	0.183	0.182	0.184	$4.0 \times 10^{- 6}$
Severe	0.054	0.108	0.161	0.234	0.353	$1.1 \times 10^{- 2}$

Table 2. Main experimental results under the

K = 5

severe non-IID setting (

E = 3

). Validation metrics are averaged across LiTS clients; external metrics are evaluated on the held-out 3D-IRCADb set. Pre. and Scr. denote pretrained and scratch initialization, respectively. External-test results are reported as mean ± std over 3 seeds.

Table 2. Main experimental results under the

K = 5

severe non-IID setting (

E = 3

). Validation metrics are averaged across LiTS clients; external metrics are evaluated on the held-out 3D-IRCADb set. Pre. and Scr. denote pretrained and scratch initialization, respectively. External-test results are reported as mean ± std over 3 seeds.

Model	Method	Init	Validation (LiTS)				External Test (3D-IRCADb)
Model	Method	Init	AUC	F1	Recall	Spec.	AUC	F1	Recall	Spec.
ResNet-50	FedAvg	Pre.	0.69	0.80	0.95	0.14	0.58 ± 0.01	0.81 ± 0.00	0.98 ± 0.01	0.07 ± 0.03
	FedAvg	Scr.	0.74	0.81	1.00	0.02	0.50 ± 0.03	0.81 ± 0.01	0.99 ± 0.02	0.02 ± 0.04
	FedProx	Pre.	0.70	0.80	0.93	0.19	0.58 ± 0.01	0.81 ± 0.00	0.98 ± 0.01	0.09 ± 0.02
	FedProx	Scr.	0.72	0.77	0.84	0.41	0.54 ± 0.03	0.81 ± 0.02	0.96 ± 0.06	0.11 ± 0.09
EfficientNet-B3	FedAvg	Pre.	0.69	0.80	0.96	0.08	0.55 ± 0.01	0.79 ± 0.01	0.93 ± 0.02	0.09 ± 0.02
	FedAvg	Scr.	0.74	0.77	0.87	0.29	0.55 ± 0.04	0.81 ± 0.00	0.99 ± 0.01	0.04 ± 0.04
	FedProx	Pre.	0.70	0.80	0.94	0.14	0.55 ± 0.01	0.79 ± 0.01	0.93 ± 0.02	0.10 ± 0.02
	FedProx	Scr.	0.72	0.76	0.80	0.46	0.54 ± 0.01	0.80 ± 0.01	0.95 ± 0.03	0.10 ± 0.03
ViT-B/16	FedAvg	Pre.	0.73	0.82	0.95	0.20	0.56 ± 0.02	0.80 ± 0.01	0.95 ± 0.02	0.10 ± 0.03
	FedAvg	Scr.	0.73	0.81	1.00	0.00	0.56 ± 0.00	0.81 ± 0.00	1.00 ± 0.00	0.00 ± 0.00
	FedProx	Pre.	0.73	0.82	0.95	0.20	0.56 ± 0.02	0.80 ± 0.01	0.95 ± 0.02	0.10 ± 0.03
	FedProx	Scr.	0.74	0.81	1.00	0.00	0.58 ± 0.02	0.81 ± 0.00	1.00 ± 0.00	0.00 ± 0.00
MobileNetV3-S	FedAvg	Pre.	0.72	0.82	0.96	0.14	0.59 ± 0.01	0.81 ± 0.00	0.98 ± 0.01	0.03 ± 0.02
	FedAvg	Scr.	0.72	0.79	0.90	0.26	0.57 ± 0.02	0.79 ± 0.01	0.90 ± 0.03	0.19 ± 0.03
	FedProx	Pre.	0.72	0.81	0.95	0.18	0.59 ± 0.01	0.81 ± 0.00	0.98 ± 0.01	0.04 ± 0.02
	FedProx	Scr.	0.71	0.77	0.86	0.25	0.57 ± 0.01	0.79 ± 0.00	0.91 ± 0.02	0.16 ± 0.04

Table 7. Fine-tuning depth ablation (MobileNetV3-Small,

K = 5

, severe non-IID, FedProx).

Table 7. Fine-tuning depth ablation (MobileNetV3-Small,

K = 5

, severe non-IID, FedProx).

Strategy	AUC	F1	Recall	Spec.
freeze (head only)	0.60	0.81	0.97	0.06
finetune-last block	0.58	0.81	0.95	0.11

Table 8. Model parameter scale and estimated per-round communication cost under pretrained-freeze (FP32).

Backbone	Params	Comm. Cost (Client/Round)
ResNet-50	25.6 M	204.8 MB
EfficientNet-B3	12.2 M	97.6 MB
ViT-B/16	86.6 M	692.8 MB
MobileNetV3-Small	2.5 M	20.0 MB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, D.; Wei, S.; Zhang, X. Multi-Centre Liver Tumour Classification via Federated Learning: Investigating Data Heterogeneity, Transfer Learning, and Model Efficiency. Computers 2026, 15, 286. https://doi.org/10.3390/computers15050286

AMA Style

Zhu D, Wei S, Zhang X. Multi-Centre Liver Tumour Classification via Federated Learning: Investigating Data Heterogeneity, Transfer Learning, and Model Efficiency. Computers. 2026; 15(5):286. https://doi.org/10.3390/computers15050286

Chicago/Turabian Style

Zhu, Degang, Shiqi Wei, and Xinming Zhang. 2026. "Multi-Centre Liver Tumour Classification via Federated Learning: Investigating Data Heterogeneity, Transfer Learning, and Model Efficiency" Computers 15, no. 5: 286. https://doi.org/10.3390/computers15050286

APA Style

Zhu, D., Wei, S., & Zhang, X. (2026). Multi-Centre Liver Tumour Classification via Federated Learning: Investigating Data Heterogeneity, Transfer Learning, and Model Efficiency. Computers, 15(5), 286. https://doi.org/10.3390/computers15050286

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Centre Liver Tumour Classification via Federated Learning: Investigating Data Heterogeneity, Transfer Learning, and Model Efficiency

Abstract

1. Introduction

1.1. Federated Learning Algorithms

1.2. Federated Learning in Medical Imaging

1.3. Transfer Learning and Lightweight Architectures

2. Methods

2.1. Problem Definition

2.2. Local Training Objective

2.3. Server Aggregation

2.4. Network Architectures and Training Strategies

2.5. Datasets

2.6. Preprocessing and Dataset Construction

2.7. Federated Client Partitioning

2.8. Compared Methods and Experimental Protocol

2.9. Implementation Details

2.10. Evaluation Metrics

3. Results

3.1. Main Experimental Results on Federated Validation and External Test

3.2. Impact of Non-IID Intensity

3.3. Parameter Sensitivity Analysis ( $μ$ , E, and K)

3.4. Ablation on Fine-Tuning Depth

3.5. Communication and Computation Efficiency

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Multi-Centre Liver Tumour Classification via Federated Learning: Investigating Data Heterogeneity, Transfer Learning, and Model Efficiency

Abstract

1. Introduction

1.1. Federated Learning Algorithms

1.2. Federated Learning in Medical Imaging

1.3. Transfer Learning and Lightweight Architectures

2. Methods

2.1. Problem Definition

2.2. Local Training Objective

2.3. Server Aggregation

2.4. Network Architectures and Training Strategies

2.5. Datasets

2.6. Preprocessing and Dataset Construction

2.7. Federated Client Partitioning

2.8. Compared Methods and Experimental Protocol

2.9. Implementation Details

2.10. Evaluation Metrics

3. Results

3.1. Main Experimental Results on Federated Validation and External Test

3.2. Impact of Non-IID Intensity

3.3. Parameter Sensitivity Analysis ( μ , E, and K)

3.4. Ablation on Fine-Tuning Depth

3.5. Communication and Computation Efficiency

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.3. Parameter Sensitivity Analysis ( $μ$ , E, and K)