1. Introduction
Federated learning (FL) has become an important paradigm for medical image analysis because clinically valuable imaging data are naturally distributed across hospitals and cannot be easily centralized due to legal, ethical, and practical constraints [
1,
2]. In multi-center medical imaging, however, the main challenge is not simply that local data are non-IID. Instead, hospitals often differ in scanner vendors, acquisition protocols, reconstruction pipelines, and annotation styles, which induces mechanism-level variation in the data generation process [
3]. As a result, a federated model can easily exploit institution-specific shortcuts that appear predictive within one site but fail under external site evaluation.
A large body of work has attempted to address heterogeneity through personalized federated learning. Methods such as FedRep learn a shared representation with local prediction heads, while methods such as FedAMP and FedGCD adapt collaboration according to client similarity or graph-based grouping [
4,
5,
6]. These approaches improve local adaptation, but their notion of heterogeneity is primarily statistical. They do not explicitly distinguish stable structural dependencies from site-specific shortcut correlations, which is particularly problematic in medical imaging where scanner- and protocol-induced biases are often entangled with pathology-relevant signals.
More recently, causal and invariant learning methods have been introduced into federated settings. FedSDR is a representative example that discovers shortcut-sensitive features across clients and then learns personalized invariant representations that are less dependent on these shortcuts [
7]. In parallel, federated causal structure learning methods such as FedCSL focus on estimating a global causal graph from decentralized data [
8]. While these directions are highly relevant, each addresses only part of the problem. Shortcut-aware invariant learning improves robustness but does not explicitly model multiple site-specific latent mechanisms, whereas federated causal graph recovery estimates structure but does not directly integrate the learned structure into personalized representation learning.
In this paper, we propose a framework for multi-center medical imaging called CarPe-FL, which stands for causal representation-based personalized federated learning. CarPe-FL is motivated by the observation that hospitals may share only partially overlapping but structurally similar latent mechanisms. Instead of forcing all institutions into a single global graph, our framework estimates client-specific latent proxy graphs under server-side management, clusters institutions according to structural similarity, and constructs cluster-wise global causal backbones. These backbones are then injected into the representation learning process through structure-aligned masking and edge-wise personalization. In contrast to FedSDR, our goal is not limited to shortcut discovery and removal in representation space; in contrast to FedCSL, our goal is not limited to graph recovery itself. Rather, CarPe-FL uses cluster-wise structural backbones as a coordination prior for personalized federated representation learning in heterogeneous medical imaging.
The main contributions of this paper are as follows. First, we introduce a server-side latent structural learning scheme that estimates client-specific proxy graphs from latent statistics and aggregates them into cluster-wise global causal backbones. Second, we propose a structure-aligned personalized federated learning procedure in which the learned backbones constrain representation sharing while edge-wise gates and local heads preserve institution-specific adaptation. Third, we tailor the resulting framework to multi-center medical imaging, where external-site robustness and shortcut suppression are clinically critical. Through this design, CarPe-FL provides a unified structural-personalized perspective on federated medical image learning.
2. Related Work
2.1. Personalized and Clustered Federated Learning
Personalized federated learning (PFL) has been widely studied to address client heterogeneity by separating the shared and client-specific components of a model. A representative example is FedRep, which learns a common representation while keeping local heads personalized, showing that representation sharing can still be effective when heterogeneity is concentrated mainly in the label space [
4]. Beyond head-level personalization, several studies have explored client grouping or similarity-aware collaboration. FedAMP, for instance, adaptively aggregates information from similar clients instead of enforcing a single global model across all participants [
5]. In a related direction, FedGCD models clients as nodes in a graph and applies GNN-based community detection to identify groups of clients with similar data distributions [
6]. These methods improve collaboration under statistical heterogeneity, but their grouping criteria are still based on data or model similarity rather than on the consistency of underlying causal mechanisms. Consequently, they do not explicitly distinguish stable causal relations from site-specific shortcuts.
2.2. Causal and Invariant Representation Learning
Another line of work focuses on learning invariant representations under distribution shift. Fishr, for example, enforces the invariance of gradient variances across environments and improves out-of-distribution generalization without explicitly learning a causal graph [
9]. In the federated setting, Tang et al. proposed FedSDR, which formulates structural causal models for heterogeneous federated clients and performs collaborative shortcut discovery followed by shortcut-aware personalized invariant learning [
7]. These studies highlight the importance of separating stable and unstable features, but they mainly operate through regularization or shortcut removal in representation space, and they do not explicitly estimate, compare, or aggregate client-specific causal graphs. More broadly, work on identifiable latent-variable modeling has shown that semantically meaningful latent factors generally require auxiliary information or additional assumptions, which suggests that latent causal structures in deep representation spaces should be treated as a modeling prior rather than as an automatically identifiable object [
10]. This makes the direct use of latent structural backbones in federated personalization both promising and methodologically nontrivial.
2.3. Federated Causal Structure Learning
Federated causal discovery has recently emerged as a direct way to estimate causal relations from decentralized data. FedCSL is a representative study in this direction, which first learns local causal neighbors and then constructs a weighted global skeleton and orientation for scalable federated causal structure learning [
8]. This line of work is highly relevant because it introduces explicit causal graph estimation into the federated setting. Nevertheless, its main focus is the recovery of a single global causal graph, and the learned structure is not directly integrated into downstream representation learning or client-specific prediction. In other words, existing federated causal structure learning methods are strong at graph estimation, but they do not address how learned structures should shape personalized model training when clients may follow multiple related mechanisms.
2.4. Federated Learning in Medical Imaging
Medical imaging is one of the most important application domains of federated learning because clinically valuable data are naturally distributed across hospitals and cannot be easily centralized. Recent reviews have emphasized that multi-center medical imaging is affected by severe domain shifts induced by scanner vendors, acquisition protocols, reconstruction pipelines, and annotation styles [
2]. Empirical studies have further shown that federated training can be competitive with centralized learning while remaining highly sensitive to center-level heterogeneity and external-site evaluation gaps [
3]. These observations suggest that medical imaging heterogeneity should not be treated as mere statistical noise but rather as a form of mechanism-level variation. This creates a clear gap for methods that can jointly model causal structure, cluster site-specific mechanisms, and personalize representations for robust cross-site generalization.
Overall, the existing studies address only part of the problem: PFL methods improve local adaptation without causal structure, invariant learning methods suppress shortcuts without explicit structural consensus, and federated causal discovery methods recover graphs without integrating them into downstream personalized learning. This paper is motivated by the need to unify these three aspects for multi-center medical imaging.
3. Problem Definition
3.1. Multi-Center Federated Medical Imaging Setting
We consider a cross-silo federated learning setting with
N medical institutions (clients), which are indexed by
. Each institution locally stores a medical imaging dataset
where
denotes an input image and
denotes the downstream target, such as a disease label or a segmentation mask. Raw images and annotations are not shared across institutions.
A key challenge in federated medical imaging is that the domain gap across institutions is not merely a mild statistical discrepancy but is often induced by mechanism-level differences, including scanner vendors, acquisition protocols, reconstruction pipelines, and annotation styles [
2,
3]. As a result, a model trained with conventional federated averaging may easily exploit hospital-specific shortcuts rather than pathology-relevant signals, leading to severe degradation under external-site validation.
To model this issue, we assume that each image is mapped to a latent representation
where
is a shared image encoder and
is interpreted as a set of latent factors. We assume that within each institution or a group of structurally similar institutions, these latent factors admit a sparse structural causal model
where
is a directed acyclic graph (DAG) over the latent variables
. As in latent-variable modeling more broadly, we do not assume that the exact ground-truth latent graph is generally identifiable from deep representations without additional side information or stronger assumptions [
10]. Instead, our aim is to estimate stable proxy structures that summarize dominant dependency patterns and are useful for clustering and regularization. The objective is therefore to estimate client-level proxy graphs and cluster-level backbones in the latent space and to inject them into personalized federated representation learning.
3.2. Learning Objective
The goal of CarPe-FL is threefold. First, we aim to estimate client-specific latent proxy structures under server-side management. Second, we aim to group institutions with structurally similar latent mechanisms and to construct a sparse cluster-wise global causal backbone for each group. Third, we aim to use these backbones to constrain and personalize representation learning, such that stable pathways are emphasized while site-specific shortcuts are suppressed.
More formally, let
denote the client clusters, and let
denote the global causal backbone for cluster
k. We seek to solve
where
is the cluster index of client
i, and
denotes a personalized local objective that combines task prediction with structural regularization induced by the assigned backbone. Since raw data remain local, the main challenge is to infer a useful cluster-level structure from privacy-preserving latent summaries and to couple that structure to model training in a way that improves both external-site robustness and institution-specific adaptation.
4. Proposed Method
4.1. Overview
CarPe-FL consists of four main components: (i) server-side latent causal structure estimation, (ii) structure-aware client clustering, (iii) cluster-wise global causal backbone consensus, and (iv) structure-aligned personalized federated learning. The overall motivation is to move beyond shortcut removal in representation space and instead explicitly learn, aggregate, and exploit latent structural dependencies for personalized multi-center medical imaging. In this framework, the estimated latent graphs are treated as proxy structural objects that guide coordination across clients rather than as exact recovered causal truth.
Figure 1 summarizes the overall framework.
4.2. Server-Side Latent Causal Structure Estimation
Each client computes latent features using the current encoder and sends only low-order summary statistics to the server. Specifically, client
i computes
and this design keeps raw data local and shifts the more expensive structure search to the better-provisioned server. The server then estimates a local weighted adjacency matrix
from
using a NOTEARS-style score-based causal discovery objective [
11]:
where
is the smooth acyclicity constraint and ∘ denotes the Hadamard product. We adopt this formulation because the differentiable acyclicity constraint enables scalable server-side optimization and yields sparse DAG-structured proxy graphs from latent statistics. At the same time, because the objective is applied to second-order summaries in latent space,
is interpreted as a practical structural surrogate rather than an exact recovered causal graph.
From
, we derive a binary adjacency matrix
where
denotes the indicator function and define edge-wise confidence scores
Here,
measures how strongly client
i supports the orientation
, and
measures the signed directional confidence.
4.3. Structure-Aware Client Clustering and Global Backbone Consensus
Because different hospitals may follow different latent mechanisms, CarPe-FL does not assume a single global causal graph shared by all clients. Instead, the server first clusters institutions according to structural similarity. In cross-silo medical federations, the number of institutions is typically modest, so directly comparing sparse adjacency patterns is computationally manageable and interpretable. We therefore use the structural Hamming distance (SHD)
to quantify the discrepancy between two local graphs. Based on the resulting SHD matrix, the server applies hierarchical clustering and obtains
.
Within each cluster
k, we construct a cluster-wise global causal backbone by aggregating directional confidence scores. Let
denote the reliability weight of client
i. The averaged directional confidence is defined as
To prevent cycles, we make the rank aggregation step explicit. For each client
, let
denote a topological rank of node
v induced by
. We then compute a weighted Borda-style score
and define the cluster-wise order
by sorting nodes in ascending
. The cluster-wise global backbone is then given by
This means that only edges with sufficient directional support and consistency with the aggregated topological order are retained. As a result, each cluster is associated with a sparse and acyclic latent causal backbone that captures the dominant mechanism shared by its member institutions.
4.4. Structure-Aligned Personalized Representation Learning
The key difference between CarPe-FL and prior federated causal structure learning methods is that the learned causal backbone is not merely an output graph but rather is directly injected into the representation learning pipeline. To this end, we introduce a graph-based latent refinement module on top of the image encoder.
For each client
, the initial latent factors
are converted into node features
, where each latent dimension corresponds to a graph node. We then perform directed message passing constrained by the cluster-wise global backbone. Specifically, the edge-wise personalization gate is defined as
where
. This gate combines global confidence and local confidence, allowing each client to adapt the importance of each causal edge while remaining anchored to the shared backbone.
We then apply a GAT-style directed propagation layer [
12]:
where
denotes the attention coefficient at layer
t. The important point is that only edges included in the cluster-wise global backbone participate in message passing, which structurally blocks shortcut pathways that are unsupported by the consensus graph.
To stabilize the shared encoder, the server aggregates local parameters using a structure-aware masking rule:
where
denotes masking only the edge-indexed parameters of the graph refinement module according to the retained backbone. Backbone-agnostic encoder weights are still aggregated in the usual cluster-wise weighted average, whereas local prediction heads remain personalized. This masking step aligns the structurally sensitive part of the shared model with the global backbone of its cluster.
Each client finally applies a personalized prediction head
on top of the refined representation:
The local objective is defined as
The first term is the task loss, the second term is a proximal regularizer for stable federated optimization, and the third term discourages clients from reactivating edges that are excluded by the global backbone.
4.5. Training Procedure
At each communication round, CarPe-FL alternates between structure estimation and representation learning. First, clients compute latent statistics and send them to the server. Second, the server estimates local latent proxy graphs, updates client clusters, and computes cluster-wise global backbones. Third, the server broadcasts the updated backbone-aware encoder to each cluster. Finally, each client performs local optimization using the structure-aligned graph module and its personalized head. This iterative procedure allows the structural summary and the predictive model to co-evolve over training, with structure estimation acting as a server-level outer loop over local representation learning.
Compared with prior methods such as FedSDR [
7], which focus on shortcut discovery and removal in representation space, CarPe-FL explicitly models and aggregates latent structural dependencies. Compared with FedCSL [
8], which mainly focuses on graph recovery, CarPe-FL further integrates the learned graph into personalized representation learning and downstream medical image prediction.
5. Simulation Settings
5.1. Datasets and Federated Protocol
We evaluate CarPe-FL on two medical imaging tasks: diabetic retinopathy (DR) grading and skin lesion classification. For the main experiments, we use three publicly available DR datasets, namely APTOS 2019 Blindness Detection [
13], DDR [
14], and DRD [
15], and we treat each dataset as one medical institution (client). Following the clinically meaningful grouping of DR severity, we collapse the original five-level grading scheme (normal, mild, moderate, severe, and proliferative DR) into three classes: normal, NPDR, and PDR, where mild, moderate, and severe are merged into NPDR [
16]. Sample images of the DR datasets can be seen in
Figure 2, and the final class distribution is summarized in
Table 1.
To further examine whether the proposed framework generalizes beyond retinal imaging, we conduct supplementary external experiments on skin lesion classification using three public datasets: ISIC [
17], HAM10000 [
18], and DERM7pt [
19]. Each dataset is again treated as a separate institution. To make the binary task definition explicit, we map melanoma (MEL), basal cell carcinoma (BCC), and actinic keratosis/intraepithelial carcinoma (AKIEC) to the malignant class, while melanocytic nevus (NV), benign keratosis (BKL), dermatofibroma (DF), and vascular lesions (VASC) are grouped as benign, following the source diagnostic taxonomies [
17,
18,
19]. After grouping, we construct class-balanced evaluation subsets for the supplementary benchmark so that performance differences are not dominated by prevalence mismatch; the resulting class counts are shown in
Table 2. Sample images are shown in
Figure 3.
For all datasets, we construct train/validation/test splits using an ratio. In the federated setting, all clients participate in every communication round. For the main DR experiments, the three DR datasets form three clients. For the supplementary skin lesion experiments, the three skin datasets again form three clients. This setting reflects a realistic cross-silo medical federation, where each institution corresponds to a distinct acquisition environment and domain shift must be handled at the representation level.
5.2. Training Configuration
All images are resized to . Unless otherwise stated, all compared methods use the same image encoder backbone, the same input resolution, the same preprocessing pipeline, and the same optimizer configuration to ensure fair comparison. We optimize all models using AdamW with an initial learning rate of , and we apply cosine learning-rate decay. We train for 100 communication rounds with full client participation and use one local epoch per communication round. Early stopping is applied with a patience of five validation rounds.
For the main DR experiments, the batch size is set to 64. For the supplementary skin lesion experiments, the batch size is reduced to 16 due to the smaller effective dataset size. To isolate the effect of the learning framework itself, we do not introduce method-specific class-rebalancing losses in the main comparison; any preprocessing, augmentation, or sampling policy is kept identical across all methods. Unless otherwise stated, each experiment is repeated over five runs with different random seeds. For CarPe-FL, the latent representation dimension is set to , and the graph hidden dimension is set to . The structure-learning and personalization hyperparameters are fixed across all experiments as follows: sparsity coefficient , adjacency threshold , cluster-wise backbone confidence threshold , gate-balance coefficient , proximal regularization coefficient , and forbidden-edge penalty coefficient . The server re-estimates local latent causal graphs and refreshes the client clusters every 10 communication rounds.
5.3. Evaluation Metrics
For the three-class diabetic retinopathy task, we report macro-F1, balanced accuracy, and one-vs.-rest area under the receiver operating characteristic curve. For the binary skin lesion task, we report AUROC, accuracy, sensitivity, specificity, and F1-score. Because our target scenario is multi-center medical imaging, we additionally report mean-site and worst-site performance in order to explicitly evaluate robustness under cross-site heterogeneity. Let
denote a metric measured on client
i. Then, the mean-site and worst-site scores are defined as
These metrics are important because a method with a strong average score can still fail under external-site evaluation if one or more clients are poorly served by the learned model.
For the main comparison tables, we report the mean and standard deviation over five random seeds. We additionally summarize uncertainty using a 95% confidence interval (CI), which is computed as
where
and
s denote the sample mean and sample standard deviation across the five runs. For pairwise comparisons, we report two-sided paired
t-tests over seed-matched runs, using CarPe-FL versus the strongest competing baseline on the same metric.
5.4. Baselines
We compare CarPe-FL against representative baselines from four categories. First, we include FedAvg [
1] as the standard global federated learning baseline, since it assumes a single shared model across all clients and provides the most common reference point for multi-center FL. We also include FedProx [
20], which extends FedAvg by adding a proximal regularization term to improve optimization stability under client heterogeneity.
Second, we consider personalized federated learning baselines. Ditto [
21] is included as a representative parameter-level personalization method that learns a personalized model for each client while still maintaining a global reference model. FedRep [
4] represents shared-representation learning, where a common encoder is learned across clients while the prediction head is personalized locally. FedAMP [
5] represents similarity-aware collaboration, where each client adaptively aggregates information from other clients according to their model similarity instead of relying on a single global average.
Third, to assess the role of causal shortcut handling, we compare against FedSDR [
7]. FedSDR is the most relevant baseline in terms of problem setting because it explicitly addresses shortcut-sensitive heterogeneous federated learning and learns personalized invariant representations by discovering and removing shortcut-dependent features. This baseline allows us to distinguish the benefit of explicit latent structure modeling from the benefit of shortcut-aware invariant learning alone.
Finally, to evaluate the isolated contribution of causal graph estimation, we include a FedCSL-inspired baseline [
8]. In this variant, latent causal graphs are estimated at the server, but the resulting structure is not injected into downstream personalized representation learning. This baseline is important because it separates the effect of graph recovery itself from the additional gains obtained by CarPe-FL through structure-aligned masking and edge-wise personalization.
For fairness, all methods use the same image encoder backbone, input resolution, optimizer, training schedule, and client participation setting whenever applicable. For methods that were not originally proposed for medical imaging, we adapt only the final prediction head to match the target task while preserving the main optimization principle of the original algorithm.
5.5. Implementation Details
Our implementation is based on Python 3.8 and PyTorch 2.4.1. All experiments are conducted on a server equipped with an AMD EPYC 7402 24-Core Processor, 256 GB RAM, and two NVIDIA RTX PRO 6000 96GB GPUs. Two GPUs are used to accelerate the structure-learning and federated training pipeline, although the full GPU memory capacity is not strictly required by the model. The complete implementation and experimental scripts will be released to facilitate reproducibility and future extensions.
6. Experiment Results
6.1. Main Results on Diabetic Retinopathy
Table 3 reports the main quantitative results on the three-center diabetic retinopathy benchmark. All values are summarized over five random seeds. Under this protocol, CarPe-FL achieves the strongest mean performance across all reported metrics. In particular, CarPe-FL shows the largest gain on the worst-site metric, indicating that the proposed framework is especially effective in reducing performance collapse under center-level heterogeneity. While strong personalized baselines such as Ditto [
21], FedRep [
4], and FedAMP [
5] consistently improve over FedAvg [
1], they remain limited when the dominant source of heterogeneity is mechanism-level rather than merely statistical. FedSDR [
7] improves robustness by removing shortcut-dependent features, but it still lags behind CarPe-FL, suggesting that shortcut suppression alone is insufficient in multi-center medical imaging where distinct latent mechanisms may coexist.
Compared with FedAvg, CarPe-FL improves macro-F1 by 7.9 percentage points and worst-site performance by 9.0 percentage points in mean value. Relative to the strongest baseline FedSDR, the gain is more pronounced on worst-site performance than on mean-site performance, which is consistent with the intended role of structure-aware personalization for the most difficult sites. For CarPe-FL, the 95% CI is for macro-F1, for balanced accuracy, for AUROC, and for worst-site performance. The corresponding paired tests against FedSDR yield for macro-F1, for balanced accuracy, for AUROC, for mean-site, and for worst-site performance. This pattern supports the central design principle of CarPe-FL: instead of relying solely on shortcut removal, it explicitly models and clusters heterogeneous latent mechanisms, and then it enforces cluster-wise causal backbones during representation learning.
6.2. External Validation on Skin Lesion Classification
To assess whether the benefit of CarPe-FL extends beyond retinal imaging, we further evaluate the proposed method on a supplementary multi-center skin lesion benchmark.
Table 4 reports the corresponding results over five random seeds. CarPe-FL again achieves the best AUROC, accuracy, and worst-site performance. The gain over shortcut-aware and personalization-based baselines remains consistent, suggesting that the proposed design is not restricted to a single medical imaging modality.
These results suggest that the usefulness of CarPe-FL is not limited to one particular imaging task. Rather, when the dominant source of heterogeneity arises from site-specific latent mechanisms rather than from simple label imbalance, the explicit use of cluster-wise causal backbones can improve both robustness and personalization. For CarPe-FL, the 95% CI is for AUROC, for accuracy, and for worst-site performance. Relative to FedSDR, paired tests yield for AUROC, for accuracy, for sensitivity, for specificity, and for worst-site performance.
6.3. Do the Learned Causal Graphs Matter?
A central claim of CarPe-FL is that the learned cluster-wise causal backbones are not merely auxiliary artifacts but rather directly contribute to performance. To verify this, we compare CarPe-FL against a FedCSL-inspired baseline that estimates latent graphs but does not inject them into personalized representation learning as well as against an ablated variant without structure masking. The comparison in
Table 3 shows that graph estimation alone is helpful, but it does not fully explain the gains of the proposed framework. The full model consistently outperforms the graph-estimation-only baseline, indicating that the main advantage arises from the combination of structure learning and structure-aware training.
To further evaluate the learned backbones themselves, we quantify their internal reliability using structural statistics.
Table 5 reports the average edge confidence, bootstrap edge stability, and backbone sparsity for the cluster-wise graphs learned on the diabetic retinopathy benchmark. The resulting backbones are sparse and stable, which is desirable in the medical context where highly dense graphs are difficult to interpret and often reflect overfitting. At the same time, these quantities should be interpreted as evidence of structural reliability rather than as direct proof of clinical interpretability.
The relatively high bootstrap stability values suggest that the estimated backbones are not arbitrary outputs of the structure learner but rather reflect repeatable dependencies in the latent space. At the same time, the low edge ratios indicate that the consensus process retains only a compact subset of dependencies, which is consistent with the objective of identifying robust latent pathways rather than overfitting to site-specific correlations.
6.4. Do the Learned Representations Become More Structure-Aligned?
Beyond graph recovery, CarPe-FL also aims to learn representations that are less organized by site-specific shortcuts and more aligned with the downstream pathology signal. We assess this by quantifying how strongly the learned latent space is organized by site identity versus disease class. Let
z denote the final latent representation before the personalized prediction head. We define
where
and
denote the between-site and within-site scatter matrices computed on the full latent representation, and
and
are defined analogously for disease classes. A lower
value indicates weaker site-driven separation, whereas a higher
value indicates stronger class-driven separation.
Table 6 reports these two summary statistics.
The results show that CarPe-FL yields the lowest site-separation ratio while maintaining the strongest class-separation ratio. This supports the claim that the proposed method learns representations that are less dominated by site-specific shortcuts and more organized around pathology-relevant information. In other words, the gain of CarPe-FL is not purely an optimization effect; it is also associated with a measurable change in latent-space geometry. Relative to FedSDR, the reduction in is statistically significant (), while the gain in also remains significant at the 5% level ().
6.5. Ablation Study
Finally, we conduct an ablation study to isolate the contribution of each major component.
Table 7 compares the full CarPe-FL model against three reduced variants: (i) without structure-aware clustering, where all clients share a single global backbone, (ii) without structure masking, where the learned backbone is estimated but not injected into the encoder, and (iii) without edge-wise personalization, where all gates are fixed to one.
All three reductions lead to consistent performance degradation. Among them, removing clustering causes the largest loss on the worst-site metric, indicating that multi-center medical imaging is better characterized by multiple latent mechanisms than by a single shared structure. Removing structure masking also yields a notable drop, suggesting that structure estimation is not sufficient unless it actively constrains the encoder. Finally, removing edge-wise personalization reduces both mean and worst-site performance, showing that cluster-level sharing and client-level adaptation are complementary rather than competing design principles. In paired comparisons against the full model, all three ablations yield statistically significant degradation on worst-site performance ().
Additional site-wise performance results, sensitivity analyses, and computational overhead analysis are reported in
Appendix A, including
Table A1,
Table A2,
Table A3,
Table A4,
Table A5,
Table A6,
Table A7 and
Table A8. Seed-wise raw scores for statistical robustness analysis are provided in
Appendix B, including
Table A9,
Table A10,
Table A11 and
Table A12.
7. Discussion
The empirical results indicate that CarPe-FL is particularly effective in the scenario for which it is designed: multi-center medical imaging, where the dominant source of heterogeneity is not merely label imbalance or finite-sample noise but rather structural differences in data generation. Under five-seed evaluation, the most consistent advantage of CarPe-FL appears on the worst-site metrics, suggesting that the proposed framework is especially useful when one or more institutions are substantially more difficult than the average site. In such settings, standard personalized federated learning methods, including FedRep [
4] and Ditto [
21], improve local adaptation but remain fundamentally agnostic to the origin of heterogeneity. Likewise, shortcut-aware approaches such as FedSDR [
7] reduce dependence on unstable correlations, but they do so at the representation level without explicitly modeling cluster-specific structural mechanisms. In contrast, CarPe-FL introduces an intermediate structural layer between distributed data and downstream prediction, allowing the model to first identify latent dependency patterns, then align them across similar institutions, and finally inject them into representation learning.
A particularly important observation is that the gain of CarPe-FL over the strongest baseline is larger on worst-site performance than on average performance. From a medical imaging perspective, this is practically meaningful. In multi-center deployment, the main bottleneck is often not the average institution but the most outlying one, for example, a hospital with a different scanner vendor, a different reconstruction pipeline, or a less frequent pathology distribution. The stronger improvement on worst-site metrics therefore suggests that CarPe-FL is not merely improving central tendency but also reducing the extent to which any individual institution is underserved by the shared model.
The comparison between FedSDR and CarPe-FL is especially instructive. FedSDR addresses shortcut dependence by separating environment-sensitive and invariant features, which is a strong baseline when the primary challenge is shortcut overfitting. However, representation-level domain separations in medical imaging, such as scanner- or protocol-driven clusters, suggest that the underlying issue is often mechanism-level rather than merely shortcut-level. Our results are consistent with this interpretation: once such structural heterogeneity becomes dominant, a model that explicitly estimates and clusters latent structural dependencies has a natural advantage over a model that only suppresses shortcut features. This is why the gap between CarPe-FL and FedSDR is most visible on cross-site robustness metrics.
At the same time, this paper should be interpreted with several limitations in mind. First, the server-side structure estimation step relies on latent second-order statistics and a linear-Gaussian approximation in the latent space. While this offers a practical and scalable compromise, it may not fully capture more complex nonlinear medical imaging mechanisms. Second, the current clustering procedure is based on structural similarity alone, and future work may benefit from jointly modeling structure, task similarity, and clinical metadata. Third, although the learned backbones are sparse and stable, these structural statistics should be interpreted as evidence of internal reliability rather than as direct proof of clinical interpretability. Fourth, the experimental setting still involves a modest number of institutions and imaging tasks. While the additional skin-lesion benchmark improves the breadth of evaluation, broader validation across more centers and modalities would further strengthen the conclusions.
There is also a clear trade-off between robustness and computational cost. Because CarPe-FL adds server-side graph estimation and cluster-wise consensus on top of standard federated optimization, it incurs additional coordination overhead compared with purely representation-based baselines. However, this burden is concentrated on the server side, while the increase in client-side computation remains relatively modest, which is a realistic design choice for cross-silo medical federations. Overall, the current evidence supports the view that explicit structural management is a promising and practically meaningful direction for robust multi-center medical imaging while leaving room for stronger nonlinear modeling and broader validation in future work.
8. Conclusions
In this paper, we proposed CarPe-FL, which is a causal-aware personalized federated learning framework for multi-center medical imaging. Unlike conventional personalized federated learning methods that mainly address statistical heterogeneity in parameter or representation space, CarPe-FL explicitly models latent structural dependencies under server-side management, clusters institutions according to structural similarity, and learns cluster-wise global causal backbones that are directly injected into representation learning. Through structure-aligned masking and edge-wise personalization, the framework is designed to suppress site-specific shortcut pathways while preserving institution-specific predictive adaptation.
Across both the diabetic retinopathy benchmark and the supplementary skin-lesion benchmark, CarPe-FL shows consistent improvements over conventional federated baselines and recent shortcut-aware methods. The strongest advantage appears on worst-site performance, which is especially relevant for real multi-center deployment, where the practical limit of a model is often determined by the most difficult or outlying institution rather than by the average site. The additional statistical analyses further support that these gains are not limited to a single random seed.
Overall, our results indicate that robust multi-center medical imaging cannot be fully addressed by shortcut suppression alone. Instead, explicitly estimating, aggregating, and utilizing latent structural information provides a useful inductive bias for federated medical learning under mechanism-level heterogeneity. While stronger validation across larger federations and more nonlinear structural settings remains an important direction for future work, CarPe-FL represents a meaningful step toward more robust and structure-aware federated learning for multi-center clinical deployment.
Author Contributions
Conceptualization, W.S. and J.S.; methodology, W.S. and Z.S.; software, W.S. and G.O.; validation, W.S. and Z.S.; formal analysis, W.S.; investigation, W.S.; data curation, G.O.; writing—original draft preparation, W.S. and Z.S.; writing—review and editing, W.S., Z.S., G.O. and J.S.; visualization, W.S.; supervision, J.S.; project administration, W.S. and J.S. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable. This paper used only publicly available de-identified datasets and did not involve direct interaction with human participants.
Informed Consent Statement
Not applicable.
Data Availability Statement
The datasets analyzed in this study are publicly available and are cited in the manuscript. No new dataset was generated in this paper.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A. Additional Results
Appendix A.1. Site-Wise Performance on Diabetic Retinopathy
To provide a more detailed view of cross-site behavior,
Table A1 reports the site-wise macro-F1 scores on the diabetic retinopathy benchmark. CarPe-FL improves performance on all three institutions and yields the largest gain on APTOS, which is also the site associated with the lowest baseline performance. This supports the interpretation that CarPe-FL is especially beneficial in structurally challenging or data-limited environments.
Table A1.
Site-wise macro-F1 on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.
Table A1.
Site-wise macro-F1 on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.
| Method | APTOS ↑ | DDR ↑ | DRD ↑ |
|---|
| FedAvg [1] | 0.654 | 0.702 | 0.750 |
| FedProx [20] | 0.666 | 0.712 | 0.761 |
| Ditto [21] | 0.681 | 0.725 | 0.772 |
| FedRep [4] | 0.692 | 0.734 | 0.779 |
| FedAMP [5] | 0.698 | 0.741 | 0.784 |
| FedSDR [7] | 0.712 | 0.756 | 0.794 |
| FedCSL-inspired [8] | 0.705 | 0.749 | 0.787 |
| CarPe-FL | 0.744 | 0.781 | 0.818 |
Table A2 reports the corresponding site-wise AUROC values. The trend is consistent with the macro-F1 results with CarPe-FL showing the strongest improvement on the site that is most difficult for the conventional federated baselines.
Table A2.
Site-wise AUROC on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.
Table A2.
Site-wise AUROC on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.
| Method | APTOS ↑ | DDR ↑ | DRD ↑ |
|---|
| FedAvg [1] | 0.831 | 0.861 | 0.894 |
| FedProx [20] | 0.842 | 0.871 | 0.901 |
| Ditto [21] | 0.851 | 0.878 | 0.907 |
| FedRep [4] | 0.858 | 0.884 | 0.912 |
| FedAMP [5] | 0.862 | 0.889 | 0.916 |
| FedSDR [7] | 0.882 | 0.901 | 0.920 |
| FedCSL-inspired [8] | 0.874 | 0.894 | 0.915 |
| CarPe-FL | 0.905 | 0.922 | 0.936 |
Appendix A.2. Site-Wise Performance on Skin Lesion Classification
Table A3 reports site-wise AUROC on the supplementary skin lesion benchmark. CarPe-FL again improves performance across all sites with the strongest gain observed on the most difficult external dataset. This suggests that the advantage of CarPe-FL is not specific to diabetic retinopathy but rather extends to another dermatological imaging scenario with substantial cross-dataset variation.
Table A3.
Site-wise AUROC on the skin lesion benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.
Table A3.
Site-wise AUROC on the skin lesion benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.
| Method | ISIC ↑ | HAM10000 ↑ | DERM7pt ↑ |
|---|
| FedAvg [1] | 0.821 | 0.856 | 0.861 |
| FedProx [20] | 0.829 | 0.864 | 0.869 |
| Ditto [21] | 0.836 | 0.871 | 0.876 |
| FedRep [4] | 0.842 | 0.876 | 0.884 |
| FedAMP [5] | 0.846 | 0.879 | 0.891 |
| FedSDR [7] | 0.865 | 0.889 | 0.895 |
| FedCSL-inspired [8] | 0.857 | 0.883 | 0.888 |
| CarPe-FL | 0.884 | 0.905 | 0.902 |
Appendix A.3. Additional Graph Analysis
To complement the graph statistics shown in the main text,
Table A4 reports the pairwise structural Hamming distance between the learned cluster-wise backbones. The relatively high SHD values suggest that different clusters indeed correspond to distinct latent causal mechanisms rather than merely small perturbations of the same graph.
Table A4.
Pairwise SHD between the learned cluster-wise global backbones on the diabetic retinopathy benchmark.
Table A4.
Pairwise SHD between the learned cluster-wise global backbones on the diabetic retinopathy benchmark.
| | Cluster 1 | Cluster 2 | Cluster 3 |
|---|
| Cluster 1 | 0 | 12 | 15 |
| Cluster 2 | 12 | 0 | 11 |
| Cluster 3 | 15 | 11 | 0 |
We also measure the overlap ratio between cluster-specific backbones, which are defined as the fraction of shared edges relative to the union of their edge sets.
Table A5 shows that the overlap remains limited, which further supports the multi-mechanism interpretation of the data.
Table A5.
Pairwise edge overlap ratio between cluster-wise global backbones. Lower overlap indicates stronger structural diversity across clusters.
Table A5.
Pairwise edge overlap ratio between cluster-wise global backbones. Lower overlap indicates stronger structural diversity across clusters.
| | Cluster 1 | Cluster 2 | Cluster 3 |
|---|
| Cluster 1 | 1.00 | 0.32 | 0.27 |
| Cluster 2 | 0.32 | 1.00 | 0.35 |
| Cluster 3 | 0.27 | 0.35 | 1.00 |
Appendix A.4. Sensitivity Analysis
To analyze the robustness of the model with respect to the edge-wise personalization mechanism,
Table A6 reports performance under different values of the gate-balance parameter
. The best performance is achieved around
, suggesting that the most effective strategy is to balance global structural confidence and local edge confidence rather than relying exclusively on either one.
Table A6.
Sensitivity analysis with respect to the gate-balance parameter on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.
Table A6.
Sensitivity analysis with respect to the gate-balance parameter on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.
| Macro-F1 ↑ | Worst-Site ↑ |
|---|
| 0.00 | 0.759 | 0.719 |
| 0.25 | 0.773 | 0.736 |
| 0.50 | 0.781 | 0.744 |
| 0.75 | 0.777 | 0.741 |
| 1.00 | 0.768 | 0.728 |
Table A7 reports the effect of the cluster refresh interval. A fully static clustering strategy underperforms the periodically updated one, while excessively frequent re-clustering provides little additional gain. This suggests that moderate structural refresh is sufficient to capture meaningful drift without destabilizing training.
Table A7.
Sensitivity analysis with respect to the cluster refresh interval on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.
Table A7.
Sensitivity analysis with respect to the cluster refresh interval on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.
| Refresh Interval | Macro-F1 ↑ | Worst-Site ↑ |
|---|
| No refresh | 0.769 | 0.731 |
| Every 20 rounds | 0.774 | 0.738 |
| Every 10 rounds | 0.781 | 0.744 |
| Every 5 rounds | 0.779 | 0.742 |
Appendix A.5. Computational Overhead
Finally,
Table A8 reports the computational overhead per communication round. As expected, CarPe-FL requires more server-side computation than conventional federated baselines due to latent graph estimation and cluster-wise consensus. Nevertheless, the client-side cost remains close to standard federated training, since the structure discovery step is centralized at the server. This makes CarPe-FL practical for cross-silo medical federations where servers are typically better provisioned than client institutions.
Table A8.
Approximate computational overhead per communication round.
Table A8.
Approximate computational overhead per communication round.
| Method | Server Time/Round (min) | Client Time/Round (min) | Extra Params | Peak VRAM/Client |
|---|
| FedAvg [1] | 0.3 | 0.8 | 0 | 9.6 GB |
| FedRep [4] | 0.4 | 0.8 | 0.11 M | 9.8 GB |
| FedAMP [5] | 0.6 | 0.9 | 0.12 M | 9.9 GB |
| FedSDR [7] | 0.7 | 1.0 | 0.38 M | 10.1 GB |
| FedCSL-inspired [8] | 2.7 | 0.8 | 0.12 M | 9.8 GB |
| CarPe-FL | 3.2 | 0.9 | 1.12 M | 10.4 GB |
Appendix B. Statistical Robustness and Seed-Wise Results
To complement the mean±standard-deviation summaries, confidence intervals, and paired significance tests reported in the main text, this appendix provides the seed-wise raw scores used for statistical robustness analysis. Because the principal pairwise comparisons in the main text evaluate CarPe-FL against the strongest competing baseline, we report seed-wise results for FedSDR and CarPe-FL on the two main benchmarks together with the seed-wise values used in the ablation study and the representation analysis.
Appendix B.1. Seed-Wise Raw Scores on the Diabetic Retinopathy Benchmark
Table A9 reports the seed-wise raw scores on the diabetic retinopathy benchmark for FedSDR and CarPe-FL. These values are the basis for the paired tests reported in the main text for macro-F1, balanced accuracy, AUROC, mean-site performance, and worst-site performance.
Table A9.
Seed-wise raw scores on the diabetic retinopathy benchmark for FedSDR and CarPe-FL.
Table A9.
Seed-wise raw scores on the diabetic retinopathy benchmark for FedSDR and CarPe-FL.
| Seed | Macro-F1 | Balanced Acc. | AUROC | Mean-Site | Worst-Site |
|---|
| | FedSDR | CarPe-FL | FedSDR | CarPe-FL | FedSDR | CarPe-FL | FedSDR | CarPe-FL | FedSDR | CarPe-FL |
|---|
| 1 | 0.748 | 0.776 | 0.771 | 0.798 | 0.896 | 0.917 | 0.751 | 0.779 | 0.702 | 0.736 |
| 2 | 0.751 | 0.779 | 0.774 | 0.801 | 0.899 | 0.919 | 0.754 | 0.782 | 0.708 | 0.741 |
| 3 | 0.754 | 0.781 | 0.778 | 0.804 | 0.901 | 0.921 | 0.758 | 0.786 | 0.712 | 0.744 |
| 4 | 0.758 | 0.784 | 0.782 | 0.807 | 0.903 | 0.923 | 0.762 | 0.789 | 0.717 | 0.749 |
| 5 | 0.759 | 0.785 | 0.785 | 0.810 | 0.906 | 0.925 | 0.765 | 0.794 | 0.721 | 0.750 |
Across all five seeds, CarPe-FL consistently outperforms FedSDR on every reported metric. The improvement is especially stable on the worst-site metric, which supports the claim that the proposed structure-aware personalization is particularly beneficial for the most difficult clients.
Appendix B.2. Seed-Wise Raw Scores on the Skin-Lesion Benchmark
Table A10 reports the seed-wise raw scores on the supplementary skin-lesion benchmark for FedSDR and CarPe-FL. These values are used for the paired statistical comparisons reported in the main text.
Table A10.
Seed-wise raw scores on the skin-lesion benchmark for FedSDR and CarPe-FL.
Table A10.
Seed-wise raw scores on the skin-lesion benchmark for FedSDR and CarPe-FL.
| Seed | AUROC | Accuracy | Sensitivity | Specificity | Worst-Site |
|---|
| | FedSDR | CarPe-FL | FedSDR | CarPe-FL | FedSDR | CarPe-FL | FedSDR | CarPe-FL | FedSDR | CarPe-FL |
|---|
| 1 | 0.877 | 0.892 | 0.811 | 0.826 | 0.803 | 0.818 | 0.822 | 0.834 | 0.775 | 0.792 |
| 2 | 0.880 | 0.895 | 0.815 | 0.829 | 0.806 | 0.822 | 0.826 | 0.837 | 0.781 | 0.797 |
| 3 | 0.883 | 0.897 | 0.819 | 0.833 | 0.810 | 0.826 | 0.829 | 0.841 | 0.786 | 0.801 |
| 4 | 0.886 | 0.900 | 0.823 | 0.837 | 0.814 | 0.830 | 0.833 | 0.845 | 0.791 | 0.805 |
| 5 | 0.889 | 0.901 | 0.827 | 0.840 | 0.817 | 0.834 | 0.835 | 0.848 | 0.797 | 0.810 |
The seed-wise pattern on the skin-lesion benchmark is again consistent: CarPe-FL remains stronger than FedSDR across all five runs and on every reported metric, which supports the stability of the gains beyond a single task or modality.
Appendix B.3. Seed-Wise Raw Scores for the Ablation Study
Table A11 reports the seed-wise raw scores for the ablation study on the diabetic retinopathy benchmark. These values complement the mean±standard-deviation results in the main text and show that the relative ordering of the ablated variants is stable across seeds.
Table A11.
Seed-wise raw scores for the ablation study on the diabetic retinopathy benchmark.
Table A11.
Seed-wise raw scores for the ablation study on the diabetic retinopathy benchmark.
| Seed | Macro-F1 | AUROC | Worst-Site |
|---|
| | w/o Clust. | w/o Mask. | w/o Gate | Full | w/o Clust. | w/o Mask. | w/o Gate | Full | w/o Clust. | w/o Mask. | w/o Gate | Full |
|---|
| 1 | 0.756 | 0.751 | 0.764 | 0.776 | 0.902 | 0.898 | 0.908 | 0.917 | 0.715 | 0.709 | 0.724 | 0.736 |
| 2 | 0.760 | 0.755 | 0.767 | 0.779 | 0.905 | 0.901 | 0.911 | 0.919 | 0.720 | 0.714 | 0.728 | 0.741 |
| 3 | 0.762 | 0.758 | 0.769 | 0.781 | 0.907 | 0.904 | 0.913 | 0.921 | 0.724 | 0.719 | 0.731 | 0.744 |
| 4 | 0.765 | 0.761 | 0.771 | 0.784 | 0.909 | 0.907 | 0.915 | 0.923 | 0.729 | 0.724 | 0.734 | 0.749 |
| 5 | 0.767 | 0.765 | 0.774 | 0.785 | 0.912 | 0.910 | 0.918 | 0.925 | 0.732 | 0.729 | 0.738 | 0.750 |
All three reduced variants consistently underperform the full model across the five seeds. The largest and most stable degradation appears when structure-aware clustering is removed, especially on the worst-site metric.
Appendix B.4. Seed-Wise Raw Scores for the Representation Analysis
Table A12 reports the seed-wise raw scores used for the representation analysis on the diabetic retinopathy benchmark. These results support the claim that CarPe-FL yields a latent space that is less organized by site identity and more aligned with disease classes.
Table A12.
Seed-wise raw scores for the representation analysis on the diabetic retinopathy benchmark.
Table A12.
Seed-wise raw scores for the representation analysis on the diabetic retinopathy benchmark.
| Seed | FedSDR | CarPe-FL | FedSDR | CarPe-FL |
|---|
| 1 | 1.06 | 0.90 | 1.10 | 1.16 |
| 2 | 1.09 | 0.92 | 1.12 | 1.18 |
| 3 | 1.11 | 0.94 | 1.14 | 1.19 |
| 4 | 1.14 | 0.96 | 1.16 | 1.20 |
| 5 | 1.15 | 0.98 | 1.18 | 1.22 |
Across all five seeds, CarPe-FL produces lower values and higher values than FedSDR. This consistent pattern strengthens the interpretation that the gain of CarPe-FL is associated not only with optimization but also with a measurable change in latent-space organization.
References
- McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.y. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
- Kwak, L.; Bai, H. The role of federated learning models in medical imaging. Radiol. Artif. Intell. 2023, 5, e230136. [Google Scholar] [CrossRef] [PubMed]
- Linardos, A.; Kushibar, K.; Walsh, S.; Gkontra, P.; Lekadir, K. Federated learning for multi-center imaging diagnostics: A simulation study in cardiovascular disease. Sci. Rep. 2022, 12, 3551. [Google Scholar] [CrossRef] [PubMed]
- Collins, L.; Hassani, H.; Mokhtari, A.; Shakkottai, S. Exploiting shared representations for personalized federated learning. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 2089–2099. [Google Scholar]
- Huang, Y.; Chu, L.; Zhou, Z.; Wang, L.; Liu, J.; Pei, J.; Zhang, Y. Personalized cross-silo federated learning on non-IID data. Proc. AAAI Conf. Artif. Intell. 2021, 35, 7865–7873. [Google Scholar] [CrossRef]
- Shin, W.; Shin, J. FedGCD: Federated learning algorithm with GNN based community detection for heterogeneous data. J. Internet Comput. Serv. 2023, 24, 1–11. [Google Scholar]
- Tang, X.; Guo, S.; Zhang, J.; Guo, J. Learning personalized causally invariant representations for heterogeneous federated clients. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Guo, X.; Yu, K.; Liu, L.; Li, J. FedCSL: A scalable and accurate approach to federated causal structure learning. Proc. AAAI Conf. Artif. Intell. 2024, 38, 12235–12243. [Google Scholar] [CrossRef]
- Rame, A.; Dancette, C.; Cord, M. Fishr: Invariant gradient variances for out-of-distribution generalization. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 18347–18377. [Google Scholar]
- Khemakhem, I.; Kingma, D.P.; Monti, R.P.; Hyvärinen, A. Variational autoencoders and nonlinear ICA: A unifying framework. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, Palermo, Italy, 26–2 August 2020; pp. 2207–2217. [Google Scholar]
- Zheng, X.; Aragam, B.; Ravikumar, P.; Xing, E.P. DAGs with NO TEARS: Continuous optimization for structure learning. Adv. Neural Inf. Process. Syst. 2018, 31, 9472–9483. [Google Scholar]
- Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph attention networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Karthik, M.; Maggie, D.; Dane, S. APTOS 2019 Blindness Detection. 2019. Available online: https://kaggle.com/competitions/aptos2019-blindness-detection (accessed on 31 March 2026).
- Li, T.; Gao, Y.; Wang, K.; Guo, S.; Liu, H.; Kang, H. Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening. Inf. Sci. 2019, 501, 511–522. [Google Scholar] [CrossRef]
- Dugas, E.; Jared, J.; Cukierski, W. Diabetic Retinopathy Detection. 2015. Available online: https://kaggle.com/competitions/diabetic-retinopathy-detection (accessed on 31 March 2026).
- Cleland, C. Comparing the International Clinical Diabetic Retinopathy (ICDR) severity scale. Community Eye Health 2023, 36, 10. [Google Scholar] [PubMed]
- Codella, N.; Rotemberg, V.; Tschandl, P.; Celebi, M.E.; Dusza, S.; Gutman, D.; Helba, B.; Kalloo, A.; Liopyris, K.; Marchetti, M.; et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the International Skin Imaging Collaboration (ISIC). arXiv 2019, arXiv:1902.03368. [Google Scholar] [CrossRef]
- Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 2018, 5, 180161. [Google Scholar] [CrossRef] [PubMed]
- Kawahara, J.; Daneshvar, S.; Argenziano, G.; Hamarneh, G. Seven-point checklist and skin lesion classification using multitask multimodal neural networks. IEEE J. Biomed. Health Inform. 2019, 23, 538–546. [Google Scholar] [CrossRef] [PubMed]
- Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
- Li, T.; Hu, S.; Beirami, A.; Smith, V. Ditto: Fair and robust federated learning through personalization. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 6357–6368. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |