Causal Representation-Based Personalized Federated Learning with Causal Graph Consensus for Medical Imaging

Shin, Wooseok; Shen, Zhiqiang; Oh, Gyutae; Shin, Jitae

doi:10.3390/electronics15101983

Open AccessArticle

Causal Representation-Based Personalized Federated Learning with Causal Graph Consensus for Medical Imaging

by

Wooseok Shin

¹

,

Zhiqiang Shen

²

,

Gyutae Oh

¹

and

Jitae Shin

^1,*

¹

Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea

²

Department of Computer Science and Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 1983; https://doi.org/10.3390/electronics15101983

Submission received: 5 April 2026 / Revised: 26 April 2026 / Accepted: 3 May 2026 / Published: 7 May 2026

(This article belongs to the Special Issue AI-Driven Medical Image/Video Processing)

Download

Browse Figures

Versions Notes

Abstract

Medical image federated learning has emerged as a practical solution for multi-center collaboration without centralizing sensitive data. However, the dominant source of heterogeneity in medical imaging is often not merely at the statistical level but also at the mechanism level, arising from scanner vendors, acquisition protocols, reconstruction pipelines, and annotation styles. Such heterogeneity encourages models to rely on site-specific shortcuts rather than pathology-relevant signals, which leads to poor external-site generalization. To address this problem, we propose CarPe-FL, which is a causal representation-based personalized federated learning framework for medical imaging. CarPe-FL maps images into a latent factor space, estimates client-specific latent causal structures under server-side management, clusters institutions according to structural similarity, and constructs cluster-wise global causal backbones. These backbones are then injected into federated representation learning through structure-aligned masking and edge-wise personalization, while personalized heads capture institution-specific prediction behavior. In this way, CarPe-FL aims to suppress shortcut-dependent pathways while preserving clinically meaningful local adaptation. The proposed framework is expected to provide a principled solution for robust, personalized, and interpretable federated learning in multi-center medical imaging.

Keywords:

federated learning; multi-center medical imaging; personalized learning; causal structure learning; out-of-distribution generalization

1. Introduction

Federated learning (FL) has become an important paradigm for medical image analysis because clinically valuable imaging data are naturally distributed across hospitals and cannot be easily centralized due to legal, ethical, and practical constraints [1,2]. In multi-center medical imaging, however, the main challenge is not simply that local data are non-IID. Instead, hospitals often differ in scanner vendors, acquisition protocols, reconstruction pipelines, and annotation styles, which induces mechanism-level variation in the data generation process [3]. As a result, a federated model can easily exploit institution-specific shortcuts that appear predictive within one site but fail under external site evaluation.

A large body of work has attempted to address heterogeneity through personalized federated learning. Methods such as FedRep learn a shared representation with local prediction heads, while methods such as FedAMP and FedGCD adapt collaboration according to client similarity or graph-based grouping [4,5,6]. These approaches improve local adaptation, but their notion of heterogeneity is primarily statistical. They do not explicitly distinguish stable structural dependencies from site-specific shortcut correlations, which is particularly problematic in medical imaging where scanner- and protocol-induced biases are often entangled with pathology-relevant signals.

More recently, causal and invariant learning methods have been introduced into federated settings. FedSDR is a representative example that discovers shortcut-sensitive features across clients and then learns personalized invariant representations that are less dependent on these shortcuts [7]. In parallel, federated causal structure learning methods such as FedCSL focus on estimating a global causal graph from decentralized data [8]. While these directions are highly relevant, each addresses only part of the problem. Shortcut-aware invariant learning improves robustness but does not explicitly model multiple site-specific latent mechanisms, whereas federated causal graph recovery estimates structure but does not directly integrate the learned structure into personalized representation learning.

In this paper, we propose a framework for multi-center medical imaging called CarPe-FL, which stands for causal representation-based personalized federated learning. CarPe-FL is motivated by the observation that hospitals may share only partially overlapping but structurally similar latent mechanisms. Instead of forcing all institutions into a single global graph, our framework estimates client-specific latent proxy graphs under server-side management, clusters institutions according to structural similarity, and constructs cluster-wise global causal backbones. These backbones are then injected into the representation learning process through structure-aligned masking and edge-wise personalization. In contrast to FedSDR, our goal is not limited to shortcut discovery and removal in representation space; in contrast to FedCSL, our goal is not limited to graph recovery itself. Rather, CarPe-FL uses cluster-wise structural backbones as a coordination prior for personalized federated representation learning in heterogeneous medical imaging.

The main contributions of this paper are as follows. First, we introduce a server-side latent structural learning scheme that estimates client-specific proxy graphs from latent statistics and aggregates them into cluster-wise global causal backbones. Second, we propose a structure-aligned personalized federated learning procedure in which the learned backbones constrain representation sharing while edge-wise gates and local heads preserve institution-specific adaptation. Third, we tailor the resulting framework to multi-center medical imaging, where external-site robustness and shortcut suppression are clinically critical. Through this design, CarPe-FL provides a unified structural-personalized perspective on federated medical image learning.

2. Related Work

2.1. Personalized and Clustered Federated Learning

Personalized federated learning (PFL) has been widely studied to address client heterogeneity by separating the shared and client-specific components of a model. A representative example is FedRep, which learns a common representation while keeping local heads personalized, showing that representation sharing can still be effective when heterogeneity is concentrated mainly in the label space [4]. Beyond head-level personalization, several studies have explored client grouping or similarity-aware collaboration. FedAMP, for instance, adaptively aggregates information from similar clients instead of enforcing a single global model across all participants [5]. In a related direction, FedGCD models clients as nodes in a graph and applies GNN-based community detection to identify groups of clients with similar data distributions [6]. These methods improve collaboration under statistical heterogeneity, but their grouping criteria are still based on data or model similarity rather than on the consistency of underlying causal mechanisms. Consequently, they do not explicitly distinguish stable causal relations from site-specific shortcuts.

2.2. Causal and Invariant Representation Learning

Another line of work focuses on learning invariant representations under distribution shift. Fishr, for example, enforces the invariance of gradient variances across environments and improves out-of-distribution generalization without explicitly learning a causal graph [9]. In the federated setting, Tang et al. proposed FedSDR, which formulates structural causal models for heterogeneous federated clients and performs collaborative shortcut discovery followed by shortcut-aware personalized invariant learning [7]. These studies highlight the importance of separating stable and unstable features, but they mainly operate through regularization or shortcut removal in representation space, and they do not explicitly estimate, compare, or aggregate client-specific causal graphs. More broadly, work on identifiable latent-variable modeling has shown that semantically meaningful latent factors generally require auxiliary information or additional assumptions, which suggests that latent causal structures in deep representation spaces should be treated as a modeling prior rather than as an automatically identifiable object [10]. This makes the direct use of latent structural backbones in federated personalization both promising and methodologically nontrivial.

2.3. Federated Causal Structure Learning

Federated causal discovery has recently emerged as a direct way to estimate causal relations from decentralized data. FedCSL is a representative study in this direction, which first learns local causal neighbors and then constructs a weighted global skeleton and orientation for scalable federated causal structure learning [8]. This line of work is highly relevant because it introduces explicit causal graph estimation into the federated setting. Nevertheless, its main focus is the recovery of a single global causal graph, and the learned structure is not directly integrated into downstream representation learning or client-specific prediction. In other words, existing federated causal structure learning methods are strong at graph estimation, but they do not address how learned structures should shape personalized model training when clients may follow multiple related mechanisms.

2.4. Federated Learning in Medical Imaging

Medical imaging is one of the most important application domains of federated learning because clinically valuable data are naturally distributed across hospitals and cannot be easily centralized. Recent reviews have emphasized that multi-center medical imaging is affected by severe domain shifts induced by scanner vendors, acquisition protocols, reconstruction pipelines, and annotation styles [2]. Empirical studies have further shown that federated training can be competitive with centralized learning while remaining highly sensitive to center-level heterogeneity and external-site evaluation gaps [3]. These observations suggest that medical imaging heterogeneity should not be treated as mere statistical noise but rather as a form of mechanism-level variation. This creates a clear gap for methods that can jointly model causal structure, cluster site-specific mechanisms, and personalize representations for robust cross-site generalization.

Overall, the existing studies address only part of the problem: PFL methods improve local adaptation without causal structure, invariant learning methods suppress shortcuts without explicit structural consensus, and federated causal discovery methods recover graphs without integrating them into downstream personalized learning. This paper is motivated by the need to unify these three aspects for multi-center medical imaging.

3. Problem Definition

3.1. Multi-Center Federated Medical Imaging Setting

We consider a cross-silo federated learning setting with N medical institutions (clients), which are indexed by

i \in {1, \dots, N}

. Each institution locally stores a medical imaging dataset

D_{i} = {(x_{i, j}, y_{i, j})}_{j = 1}^{n_{i}},

where

x_{i, j} \in R^{H \times W \times C}

denotes an input image and

y_{i, j}

denotes the downstream target, such as a disease label or a segmentation mask. Raw images and annotations are not shared across institutions.

A key challenge in federated medical imaging is that the domain gap across institutions is not merely a mild statistical discrepancy but is often induced by mechanism-level differences, including scanner vendors, acquisition protocols, reconstruction pipelines, and annotation styles [2,3]. As a result, a model trained with conventional federated averaging may easily exploit hospital-specific shortcuts rather than pathology-relevant signals, leading to severe degradation under external-site validation.

To model this issue, we assume that each image is mapped to a latent representation

z_{i, j} = f_{θ} (x_{i, j}) \in R^{d},

where

f_{θ}

is a shared image encoder and

z_{i, j} = [Z_{1}, \dots, Z_{d}]

is interpreted as a set of latent factors. We assume that within each institution or a group of structurally similar institutions, these latent factors admit a sparse structural causal model

Z_{v} = g_{v} (Z_{{PA}_{G_{i}} (v)}, ε_{v}), ε_{u} ⊥ ε_{v} for u \neq v,

where

G_{i} = (V, E_{i})

is a directed acyclic graph (DAG) over the latent variables

V = {Z_{1}, \dots, Z_{d}}

. As in latent-variable modeling more broadly, we do not assume that the exact ground-truth latent graph is generally identifiable from deep representations without additional side information or stronger assumptions [10]. Instead, our aim is to estimate stable proxy structures that summarize dominant dependency patterns and are useful for clustering and regularization. The objective is therefore to estimate client-level proxy graphs and cluster-level backbones in the latent space and to inject them into personalized federated representation learning.

3.2. Learning Objective

The goal of CarPe-FL is threefold. First, we aim to estimate client-specific latent proxy structures under server-side management. Second, we aim to group institutions with structurally similar latent mechanisms and to construct a sparse cluster-wise global causal backbone for each group. Third, we aim to use these backbones to constrain and personalize representation learning, such that stable pathways are emphasized while site-specific shortcuts are suppressed.

More formally, let

C_{1}, \dots, C_{K^{★}}

denote the client clusters, and let

A_{k}^{glob}

denote the global causal backbone for cluster k. We seek to solve

min_{{θ_{i}, ϕ_{i}}, {A_{k}^{glob}}, {C_{k}}} \sum_{i = 1}^{N} L_{i} (θ_{i}, ϕ_{i}; A_{c (i)}^{glob}),

where

c (i)

is the cluster index of client i, and

L_{i}

denotes a personalized local objective that combines task prediction with structural regularization induced by the assigned backbone. Since raw data remain local, the main challenge is to infer a useful cluster-level structure from privacy-preserving latent summaries and to couple that structure to model training in a way that improves both external-site robustness and institution-specific adaptation.

4. Proposed Method

4.1. Overview

CarPe-FL consists of four main components: (i) server-side latent causal structure estimation, (ii) structure-aware client clustering, (iii) cluster-wise global causal backbone consensus, and (iv) structure-aligned personalized federated learning. The overall motivation is to move beyond shortcut removal in representation space and instead explicitly learn, aggregate, and exploit latent structural dependencies for personalized multi-center medical imaging. In this framework, the estimated latent graphs are treated as proxy structural objects that guide coordination across clients rather than as exact recovered causal truth.

Figure 1 summarizes the overall framework.

4.2. Server-Side Latent Causal Structure Estimation

Each client computes latent features using the current encoder and sends only low-order summary statistics to the server. Specifically, client i computes

μ_{i} = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} z_{i, j}, Σ_{i} = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} (z_{i, j} - μ_{i}) {(z_{i, j} - μ_{i})}^{⊤}

and this design keeps raw data local and shifts the more expensive structure search to the better-provisioned server. The server then estimates a local weighted adjacency matrix

W_{i} \in R^{d \times d}

from

Σ_{i}

using a NOTEARS-style score-based causal discovery objective [11]:

min_{W_{i}} \frac{1}{2} Tr ({(I - W_{i})}^{⊤} Σ_{i} (I - W_{i})) + λ_{1} {∥ W_{i} ∥}_{1} s . t . h (W_{i}) = 0, {(W_{i})}_{v v} = 0,

where

h (W_{i}) = Tr (exp (W_{i} \circ W_{i})) - d

is the smooth acyclicity constraint and ∘ denotes the Hadamard product. We adopt this formulation because the differentiable acyclicity constraint enables scalable server-side optimization and yields sparse DAG-structured proxy graphs from latent statistics. At the same time, because the objective is applied to second-order summaries in latent space,

W_{i}

is interpreted as a practical structural surrogate rather than an exact recovered causal graph.

From

W_{i}

, we derive a binary adjacency matrix

A_{i} = I (| W_{i} | > τ_{w}),

where

I (\cdot)

denotes the indicator function and define edge-wise confidence scores

p_{u \to v}^{(i)} = σ (\frac{W_{i} [u, v]}{s_{w}}), C_{u \to v}^{(i)} = p_{u \to v}^{(i)} - p_{v \to u}^{(i)} .

Here,

p_{u \to v}^{(i)}

measures how strongly client i supports the orientation

u \to v

, and

C_{u \to v}^{(i)} \in [- 1, 1]

measures the signed directional confidence.

4.3. Structure-Aware Client Clustering and Global Backbone Consensus

Because different hospitals may follow different latent mechanisms, CarPe-FL does not assume a single global causal graph shared by all clients. Instead, the server first clusters institutions according to structural similarity. In cross-silo medical federations, the number of institutions is typically modest, so directly comparing sparse adjacency patterns is computationally manageable and interpretable. We therefore use the structural Hamming distance (SHD)

SHD (A_{i}, A_{j}) = \sum_{u \neq v} 1 [A_{i} [u, v] \neq A_{j} [u, v]]

to quantify the discrepancy between two local graphs. Based on the resulting SHD matrix, the server applies hierarchical clustering and obtains

{C_{k}}_{k = 1}^{K^{★}}

.

Within each cluster k, we construct a cluster-wise global causal backbone by aggregating directional confidence scores. Let

w_{i} \propto n_{i}

denote the reliability weight of client i. The averaged directional confidence is defined as

{\bar{C}}_{u \to v}^{(k)} = \frac{\sum_{i \in C_{k}} w_{i} C_{u \to v}^{(i)}}{\sum_{i \in C_{k}} w_{i}} .

To prevent cycles, we make the rank aggregation step explicit. For each client

i \in C_{k}

, let

r_{i} (v)

denote a topological rank of node v induced by

A_{i}

. We then compute a weighted Borda-style score

s_{v}^{(k)} = \sum_{i \in C_{k}} w_{i} r_{i} (v),

and define the cluster-wise order

π_{k}^{*}

by sorting nodes in ascending

s_{v}^{(k)}

. The cluster-wise global backbone is then given by

A_{k}^{glob} [u, v] = 1 \Leftrightarrow {\bar{C}}_{u \to v}^{(k)} \geq τ_{k} \land π_{k}^{*} (u) < π_{k}^{*} (v) .

This means that only edges with sufficient directional support and consistency with the aggregated topological order are retained. As a result, each cluster is associated with a sparse and acyclic latent causal backbone that captures the dominant mechanism shared by its member institutions.

4.4. Structure-Aligned Personalized Representation Learning

The key difference between CarPe-FL and prior federated causal structure learning methods is that the learned causal backbone is not merely an output graph but rather is directly injected into the representation learning pipeline. To this end, we introduce a graph-based latent refinement module on top of the image encoder.

For each client

i \in C_{k}

, the initial latent factors

z_{i, j}

are converted into node features

H_{i, j}^{(0)} \in R^{d \times d_{h}}

, where each latent dimension corresponds to a graph node. We then perform directed message passing constrained by the cluster-wise global backbone. Specifically, the edge-wise personalization gate is defined as

α_{u \to v}^{(i)} = {[{\bar{C}}_{u \to v}^{(k)}]}_{+}^{β} {[p_{u \to v}^{(i)}]}_{+}^{1 - β}, 0 \leq β \leq 1,

where

{[\cdot]}_{+} = max (\cdot, 0)

. This gate combines global confidence and local confidence, allowing each client to adapt the importance of each causal edge while remaining anchored to the shared backbone.

We then apply a GAT-style directed propagation layer [12]:

h_{v}^{(t + 1)} = σ (W_{0} h_{v}^{(t)} + \sum_{u : A_{k}^{glob} [u, v] = 1} α_{u \to v}^{(i)} a_{u v}^{(t)} W_{1} h_{u}^{(t)}),

where

a_{u v}^{(t)}

denotes the attention coefficient at layer t. The important point is that only edges included in the cluster-wise global backbone participate in message passing, which structurally blocks shortcut pathways that are unsupported by the consensus graph.

To stabilize the shared encoder, the server aggregates local parameters using a structure-aware masking rule:

θ_{g}^{(k)} = \frac{\sum_{i \in C_{k}} n_{i} (θ_{i} ⊙ A_{k}^{glob})}{\sum_{i \in C_{k}} n_{i}},

where

⊙ A_{k}^{glob}

denotes masking only the edge-indexed parameters of the graph refinement module according to the retained backbone. Backbone-agnostic encoder weights are still aggregated in the usual cluster-wise weighted average, whereas local prediction heads remain personalized. This masking step aligns the structurally sensitive part of the shared model with the global backbone of its cluster.

Each client finally applies a personalized prediction head

ϕ_{i}

on top of the refined representation:

{\hat{y}}_{i, j} = h_{ϕ_{i}} (H_{i, j}^{(L)}) .

The local objective is defined as

L_{i} = \frac{1}{n_{i}} \sum_{(x, y) \in D_{i}} ℓ (h_{ϕ_{i}} (H^{(L)}), y) + λ {∥ θ_{i} - θ_{g}^{(k)} ∥}_{2}^{2} + ρ \sum_{u \to v : A_{k}^{glob} [u, v] = 0} {(α_{u \to v}^{(i)})}^{2} .

The first term is the task loss, the second term is a proximal regularizer for stable federated optimization, and the third term discourages clients from reactivating edges that are excluded by the global backbone.

4.5. Training Procedure

At each communication round, CarPe-FL alternates between structure estimation and representation learning. First, clients compute latent statistics and send them to the server. Second, the server estimates local latent proxy graphs, updates client clusters, and computes cluster-wise global backbones. Third, the server broadcasts the updated backbone-aware encoder to each cluster. Finally, each client performs local optimization using the structure-aligned graph module and its personalized head. This iterative procedure allows the structural summary and the predictive model to co-evolve over training, with structure estimation acting as a server-level outer loop over local representation learning.

Compared with prior methods such as FedSDR [7], which focus on shortcut discovery and removal in representation space, CarPe-FL explicitly models and aggregates latent structural dependencies. Compared with FedCSL [8], which mainly focuses on graph recovery, CarPe-FL further integrates the learned graph into personalized representation learning and downstream medical image prediction.

5. Simulation Settings

5.1. Datasets and Federated Protocol

We evaluate CarPe-FL on two medical imaging tasks: diabetic retinopathy (DR) grading and skin lesion classification. For the main experiments, we use three publicly available DR datasets, namely APTOS 2019 Blindness Detection [13], DDR [14], and DRD [15], and we treat each dataset as one medical institution (client). Following the clinically meaningful grouping of DR severity, we collapse the original five-level grading scheme (normal, mild, moderate, severe, and proliferative DR) into three classes: normal, NPDR, and PDR, where mild, moderate, and severe are merged into NPDR [16]. Sample images of the DR datasets can be seen in Figure 2, and the final class distribution is summarized in Table 1.

To further examine whether the proposed framework generalizes beyond retinal imaging, we conduct supplementary external experiments on skin lesion classification using three public datasets: ISIC [17], HAM10000 [18], and DERM7pt [19]. Each dataset is again treated as a separate institution. To make the binary task definition explicit, we map melanoma (MEL), basal cell carcinoma (BCC), and actinic keratosis/intraepithelial carcinoma (AKIEC) to the malignant class, while melanocytic nevus (NV), benign keratosis (BKL), dermatofibroma (DF), and vascular lesions (VASC) are grouped as benign, following the source diagnostic taxonomies [17,18,19]. After grouping, we construct class-balanced evaluation subsets for the supplementary benchmark so that performance differences are not dominated by prevalence mismatch; the resulting class counts are shown in Table 2. Sample images are shown in Figure 3.

For all datasets, we construct train/validation/test splits using an

8 : 1 : 1

ratio. In the federated setting, all clients participate in every communication round. For the main DR experiments, the three DR datasets form three clients. For the supplementary skin lesion experiments, the three skin datasets again form three clients. This setting reflects a realistic cross-silo medical federation, where each institution corresponds to a distinct acquisition environment and domain shift must be handled at the representation level.

5.2. Training Configuration

All images are resized to

224 \times 224

. Unless otherwise stated, all compared methods use the same image encoder backbone, the same input resolution, the same preprocessing pipeline, and the same optimizer configuration to ensure fair comparison. We optimize all models using AdamW with an initial learning rate of

10^{- 3}

, and we apply cosine learning-rate decay. We train for 100 communication rounds with full client participation and use one local epoch per communication round. Early stopping is applied with a patience of five validation rounds.

For the main DR experiments, the batch size is set to 64. For the supplementary skin lesion experiments, the batch size is reduced to 16 due to the smaller effective dataset size. To isolate the effect of the learning framework itself, we do not introduce method-specific class-rebalancing losses in the main comparison; any preprocessing, augmentation, or sampling policy is kept identical across all methods. Unless otherwise stated, each experiment is repeated over five runs with different random seeds. For CarPe-FL, the latent representation dimension is set to

d = 128

, and the graph hidden dimension is set to

d_{h} = 128

. The structure-learning and personalization hyperparameters are fixed across all experiments as follows: sparsity coefficient

λ_{1} = 10^{- 3}

, adjacency threshold

τ_{w} = 0.05

, cluster-wise backbone confidence threshold

τ_{k} = 0.6

, gate-balance coefficient

β = 0.5

, proximal regularization coefficient

λ = 10^{- 3}

, and forbidden-edge penalty coefficient

ρ = 10^{- 2}

. The server re-estimates local latent causal graphs and refreshes the client clusters every 10 communication rounds.

5.3. Evaluation Metrics

For the three-class diabetic retinopathy task, we report macro-F1, balanced accuracy, and one-vs.-rest area under the receiver operating characteristic curve. For the binary skin lesion task, we report AUROC, accuracy, sensitivity, specificity, and F1-score. Because our target scenario is multi-center medical imaging, we additionally report mean-site and worst-site performance in order to explicitly evaluate robustness under cross-site heterogeneity. Let

M_{i}

denote a metric measured on client i. Then, the mean-site and worst-site scores are defined as

M_{mean - site} = \frac{1}{N} \sum_{i = 1}^{N} M_{i}, M_{worst - site} = min_{1 \leq i \leq N} M_{i} .

These metrics are important because a method with a strong average score can still fail under external-site evaluation if one or more clients are poorly served by the learned model.

For the main comparison tables, we report the mean and standard deviation over five random seeds. We additionally summarize uncertainty using a 95% confidence interval (CI), which is computed as

\bar{m} \pm t_{0.975, 4} \frac{s}{\sqrt{5}},

where

\bar{m}

and s denote the sample mean and sample standard deviation across the five runs. For pairwise comparisons, we report two-sided paired t-tests over seed-matched runs, using CarPe-FL versus the strongest competing baseline on the same metric.

5.4. Baselines

We compare CarPe-FL against representative baselines from four categories. First, we include FedAvg [1] as the standard global federated learning baseline, since it assumes a single shared model across all clients and provides the most common reference point for multi-center FL. We also include FedProx [20], which extends FedAvg by adding a proximal regularization term to improve optimization stability under client heterogeneity.

Second, we consider personalized federated learning baselines. Ditto [21] is included as a representative parameter-level personalization method that learns a personalized model for each client while still maintaining a global reference model. FedRep [4] represents shared-representation learning, where a common encoder is learned across clients while the prediction head is personalized locally. FedAMP [5] represents similarity-aware collaboration, where each client adaptively aggregates information from other clients according to their model similarity instead of relying on a single global average.

Third, to assess the role of causal shortcut handling, we compare against FedSDR [7]. FedSDR is the most relevant baseline in terms of problem setting because it explicitly addresses shortcut-sensitive heterogeneous federated learning and learns personalized invariant representations by discovering and removing shortcut-dependent features. This baseline allows us to distinguish the benefit of explicit latent structure modeling from the benefit of shortcut-aware invariant learning alone.

Finally, to evaluate the isolated contribution of causal graph estimation, we include a FedCSL-inspired baseline [8]. In this variant, latent causal graphs are estimated at the server, but the resulting structure is not injected into downstream personalized representation learning. This baseline is important because it separates the effect of graph recovery itself from the additional gains obtained by CarPe-FL through structure-aligned masking and edge-wise personalization.

For fairness, all methods use the same image encoder backbone, input resolution, optimizer, training schedule, and client participation setting whenever applicable. For methods that were not originally proposed for medical imaging, we adapt only the final prediction head to match the target task while preserving the main optimization principle of the original algorithm.

5.5. Implementation Details

Our implementation is based on Python 3.8 and PyTorch 2.4.1. All experiments are conducted on a server equipped with an AMD EPYC 7402 24-Core Processor, 256 GB RAM, and two NVIDIA RTX PRO 6000 96GB GPUs. Two GPUs are used to accelerate the structure-learning and federated training pipeline, although the full GPU memory capacity is not strictly required by the model. The complete implementation and experimental scripts will be released to facilitate reproducibility and future extensions.

6. Experiment Results

6.1. Main Results on Diabetic Retinopathy

Table 3 reports the main quantitative results on the three-center diabetic retinopathy benchmark. All values are summarized over five random seeds. Under this protocol, CarPe-FL achieves the strongest mean performance across all reported metrics. In particular, CarPe-FL shows the largest gain on the worst-site metric, indicating that the proposed framework is especially effective in reducing performance collapse under center-level heterogeneity. While strong personalized baselines such as Ditto [21], FedRep [4], and FedAMP [5] consistently improve over FedAvg [1], they remain limited when the dominant source of heterogeneity is mechanism-level rather than merely statistical. FedSDR [7] improves robustness by removing shortcut-dependent features, but it still lags behind CarPe-FL, suggesting that shortcut suppression alone is insufficient in multi-center medical imaging where distinct latent mechanisms may coexist.

Compared with FedAvg, CarPe-FL improves macro-F1 by 7.9 percentage points and worst-site performance by 9.0 percentage points in mean value. Relative to the strongest baseline FedSDR, the gain is more pronounced on worst-site performance than on mean-site performance, which is consistent with the intended role of structure-aware personalization for the most difficult sites. For CarPe-FL, the 95% CI is

[0.776, 0.786]

for macro-F1,

[0.798, 0.810]

for balanced accuracy,

[0.917, 0.925]

for AUROC, and

[0.737, 0.751]

for worst-site performance. The corresponding paired tests against FedSDR yield

p = 0.006

for macro-F1,

p = 0.005

for balanced accuracy,

p = 0.004

for AUROC,

p = 0.007

for mean-site, and

p = 0.003

for worst-site performance. This pattern supports the central design principle of CarPe-FL: instead of relying solely on shortcut removal, it explicitly models and clusters heterogeneous latent mechanisms, and then it enforces cluster-wise causal backbones during representation learning.

6.2. External Validation on Skin Lesion Classification

To assess whether the benefit of CarPe-FL extends beyond retinal imaging, we further evaluate the proposed method on a supplementary multi-center skin lesion benchmark. Table 4 reports the corresponding results over five random seeds. CarPe-FL again achieves the best AUROC, accuracy, and worst-site performance. The gain over shortcut-aware and personalization-based baselines remains consistent, suggesting that the proposed design is not restricted to a single medical imaging modality.

These results suggest that the usefulness of CarPe-FL is not limited to one particular imaging task. Rather, when the dominant source of heterogeneity arises from site-specific latent mechanisms rather than from simple label imbalance, the explicit use of cluster-wise causal backbones can improve both robustness and personalization. For CarPe-FL, the 95% CI is

[0.892, 0.902]

for AUROC,

[0.826, 0.840]

for accuracy, and

[0.791, 0.811]

for worst-site performance. Relative to FedSDR, paired tests yield

p = 0.009

for AUROC,

p = 0.011

for accuracy,

p = 0.018

for sensitivity,

p = 0.013

for specificity, and

p = 0.008

for worst-site performance.

6.3. Do the Learned Causal Graphs Matter?

A central claim of CarPe-FL is that the learned cluster-wise causal backbones are not merely auxiliary artifacts but rather directly contribute to performance. To verify this, we compare CarPe-FL against a FedCSL-inspired baseline that estimates latent graphs but does not inject them into personalized representation learning as well as against an ablated variant without structure masking. The comparison in Table 3 shows that graph estimation alone is helpful, but it does not fully explain the gains of the proposed framework. The full model consistently outperforms the graph-estimation-only baseline, indicating that the main advantage arises from the combination of structure learning and structure-aware training.

To further evaluate the learned backbones themselves, we quantify their internal reliability using structural statistics. Table 5 reports the average edge confidence, bootstrap edge stability, and backbone sparsity for the cluster-wise graphs learned on the diabetic retinopathy benchmark. The resulting backbones are sparse and stable, which is desirable in the medical context where highly dense graphs are difficult to interpret and often reflect overfitting. At the same time, these quantities should be interpreted as evidence of structural reliability rather than as direct proof of clinical interpretability.

The relatively high bootstrap stability values suggest that the estimated backbones are not arbitrary outputs of the structure learner but rather reflect repeatable dependencies in the latent space. At the same time, the low edge ratios indicate that the consensus process retains only a compact subset of dependencies, which is consistent with the objective of identifying robust latent pathways rather than overfitting to site-specific correlations.

6.4. Do the Learned Representations Become More Structure-Aligned?

Beyond graph recovery, CarPe-FL also aims to learn representations that are less organized by site-specific shortcuts and more aligned with the downstream pathology signal. We assess this by quantifying how strongly the learned latent space is organized by site identity versus disease class. Let z denote the final latent representation before the personalized prediction head. We define

R_{site} = \frac{tr (B_{site})}{tr (W_{site})}, R_{class} = \frac{tr (B_{class})}{tr (W_{class})},

where

B_{site}

and

W_{site}

denote the between-site and within-site scatter matrices computed on the full latent representation, and

B_{class}

and

W_{class}

are defined analogously for disease classes. A lower

R_{site}

value indicates weaker site-driven separation, whereas a higher

R_{class}

value indicates stronger class-driven separation. Table 6 reports these two summary statistics.

The results show that CarPe-FL yields the lowest site-separation ratio while maintaining the strongest class-separation ratio. This supports the claim that the proposed method learns representations that are less dominated by site-specific shortcuts and more organized around pathology-relevant information. In other words, the gain of CarPe-FL is not purely an optimization effect; it is also associated with a measurable change in latent-space geometry. Relative to FedSDR, the reduction in

R_{site}

is statistically significant (

p = 0.004

), while the gain in

R_{class}

also remains significant at the 5% level (

p = 0.021

).

6.5. Ablation Study

Finally, we conduct an ablation study to isolate the contribution of each major component. Table 7 compares the full CarPe-FL model against three reduced variants: (i) without structure-aware clustering, where all clients share a single global backbone, (ii) without structure masking, where the learned backbone is estimated but not injected into the encoder, and (iii) without edge-wise personalization, where all gates are fixed to one.

All three reductions lead to consistent performance degradation. Among them, removing clustering causes the largest loss on the worst-site metric, indicating that multi-center medical imaging is better characterized by multiple latent mechanisms than by a single shared structure. Removing structure masking also yields a notable drop, suggesting that structure estimation is not sufficient unless it actively constrains the encoder. Finally, removing edge-wise personalization reduces both mean and worst-site performance, showing that cluster-level sharing and client-level adaptation are complementary rather than competing design principles. In paired comparisons against the full model, all three ablations yield statistically significant degradation on worst-site performance (

p < 0.05

).

Additional site-wise performance results, sensitivity analyses, and computational overhead analysis are reported in Appendix A, including Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7 and Table A8. Seed-wise raw scores for statistical robustness analysis are provided in Appendix B, including Table A9, Table A10, Table A11 and Table A12.

7. Discussion

The empirical results indicate that CarPe-FL is particularly effective in the scenario for which it is designed: multi-center medical imaging, where the dominant source of heterogeneity is not merely label imbalance or finite-sample noise but rather structural differences in data generation. Under five-seed evaluation, the most consistent advantage of CarPe-FL appears on the worst-site metrics, suggesting that the proposed framework is especially useful when one or more institutions are substantially more difficult than the average site. In such settings, standard personalized federated learning methods, including FedRep [4] and Ditto [21], improve local adaptation but remain fundamentally agnostic to the origin of heterogeneity. Likewise, shortcut-aware approaches such as FedSDR [7] reduce dependence on unstable correlations, but they do so at the representation level without explicitly modeling cluster-specific structural mechanisms. In contrast, CarPe-FL introduces an intermediate structural layer between distributed data and downstream prediction, allowing the model to first identify latent dependency patterns, then align them across similar institutions, and finally inject them into representation learning.

A particularly important observation is that the gain of CarPe-FL over the strongest baseline is larger on worst-site performance than on average performance. From a medical imaging perspective, this is practically meaningful. In multi-center deployment, the main bottleneck is often not the average institution but the most outlying one, for example, a hospital with a different scanner vendor, a different reconstruction pipeline, or a less frequent pathology distribution. The stronger improvement on worst-site metrics therefore suggests that CarPe-FL is not merely improving central tendency but also reducing the extent to which any individual institution is underserved by the shared model.

The comparison between FedSDR and CarPe-FL is especially instructive. FedSDR addresses shortcut dependence by separating environment-sensitive and invariant features, which is a strong baseline when the primary challenge is shortcut overfitting. However, representation-level domain separations in medical imaging, such as scanner- or protocol-driven clusters, suggest that the underlying issue is often mechanism-level rather than merely shortcut-level. Our results are consistent with this interpretation: once such structural heterogeneity becomes dominant, a model that explicitly estimates and clusters latent structural dependencies has a natural advantage over a model that only suppresses shortcut features. This is why the gap between CarPe-FL and FedSDR is most visible on cross-site robustness metrics.

At the same time, this paper should be interpreted with several limitations in mind. First, the server-side structure estimation step relies on latent second-order statistics and a linear-Gaussian approximation in the latent space. While this offers a practical and scalable compromise, it may not fully capture more complex nonlinear medical imaging mechanisms. Second, the current clustering procedure is based on structural similarity alone, and future work may benefit from jointly modeling structure, task similarity, and clinical metadata. Third, although the learned backbones are sparse and stable, these structural statistics should be interpreted as evidence of internal reliability rather than as direct proof of clinical interpretability. Fourth, the experimental setting still involves a modest number of institutions and imaging tasks. While the additional skin-lesion benchmark improves the breadth of evaluation, broader validation across more centers and modalities would further strengthen the conclusions.

There is also a clear trade-off between robustness and computational cost. Because CarPe-FL adds server-side graph estimation and cluster-wise consensus on top of standard federated optimization, it incurs additional coordination overhead compared with purely representation-based baselines. However, this burden is concentrated on the server side, while the increase in client-side computation remains relatively modest, which is a realistic design choice for cross-silo medical federations. Overall, the current evidence supports the view that explicit structural management is a promising and practically meaningful direction for robust multi-center medical imaging while leaving room for stronger nonlinear modeling and broader validation in future work.

8. Conclusions

In this paper, we proposed CarPe-FL, which is a causal-aware personalized federated learning framework for multi-center medical imaging. Unlike conventional personalized federated learning methods that mainly address statistical heterogeneity in parameter or representation space, CarPe-FL explicitly models latent structural dependencies under server-side management, clusters institutions according to structural similarity, and learns cluster-wise global causal backbones that are directly injected into representation learning. Through structure-aligned masking and edge-wise personalization, the framework is designed to suppress site-specific shortcut pathways while preserving institution-specific predictive adaptation.

Across both the diabetic retinopathy benchmark and the supplementary skin-lesion benchmark, CarPe-FL shows consistent improvements over conventional federated baselines and recent shortcut-aware methods. The strongest advantage appears on worst-site performance, which is especially relevant for real multi-center deployment, where the practical limit of a model is often determined by the most difficult or outlying institution rather than by the average site. The additional statistical analyses further support that these gains are not limited to a single random seed.

Overall, our results indicate that robust multi-center medical imaging cannot be fully addressed by shortcut suppression alone. Instead, explicitly estimating, aggregating, and utilizing latent structural information provides a useful inductive bias for federated medical learning under mechanism-level heterogeneity. While stronger validation across larger federations and more nonlinear structural settings remains an important direction for future work, CarPe-FL represents a meaningful step toward more robust and structure-aware federated learning for multi-center clinical deployment.

Author Contributions

Conceptualization, W.S. and J.S.; methodology, W.S. and Z.S.; software, W.S. and G.O.; validation, W.S. and Z.S.; formal analysis, W.S.; investigation, W.S.; data curation, G.O.; writing—original draft preparation, W.S. and Z.S.; writing—review and editing, W.S., Z.S., G.O. and J.S.; visualization, W.S.; supervision, J.S.; project administration, W.S. and J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable. This paper used only publicly available de-identified datasets and did not involve direct interaction with human participants.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study are publicly available and are cited in the manuscript. No new dataset was generated in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Additional Results

Appendix A.1. Site-Wise Performance on Diabetic Retinopathy

To provide a more detailed view of cross-site behavior, Table A1 reports the site-wise macro-F1 scores on the diabetic retinopathy benchmark. CarPe-FL improves performance on all three institutions and yields the largest gain on APTOS, which is also the site associated with the lowest baseline performance. This supports the interpretation that CarPe-FL is especially beneficial in structurally challenging or data-limited environments.

Table A1. Site-wise macro-F1 on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.

Method	APTOS ↑	DDR ↑	DRD ↑
FedAvg [1]	0.654	0.702	0.750
FedProx [20]	0.666	0.712	0.761
Ditto [21]	0.681	0.725	0.772
FedRep [4]	0.692	0.734	0.779
FedAMP [5]	0.698	0.741	0.784
FedSDR [7]	0.712	0.756	0.794
FedCSL-inspired [8]	0.705	0.749	0.787
CarPe-FL	0.744	0.781	0.818

Table A2 reports the corresponding site-wise AUROC values. The trend is consistent with the macro-F1 results with CarPe-FL showing the strongest improvement on the site that is most difficult for the conventional federated baselines.

Table A2. Site-wise AUROC on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.

Method	APTOS ↑	DDR ↑	DRD ↑
FedAvg [1]	0.831	0.861	0.894
FedProx [20]	0.842	0.871	0.901
Ditto [21]	0.851	0.878	0.907
FedRep [4]	0.858	0.884	0.912
FedAMP [5]	0.862	0.889	0.916
FedSDR [7]	0.882	0.901	0.920
FedCSL-inspired [8]	0.874	0.894	0.915
CarPe-FL	0.905	0.922	0.936

Appendix A.2. Site-Wise Performance on Skin Lesion Classification

Table A3 reports site-wise AUROC on the supplementary skin lesion benchmark. CarPe-FL again improves performance across all sites with the strongest gain observed on the most difficult external dataset. This suggests that the advantage of CarPe-FL is not specific to diabetic retinopathy but rather extends to another dermatological imaging scenario with substantial cross-dataset variation.

Table A3. Site-wise AUROC on the skin lesion benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.

Method	ISIC ↑	HAM10000 ↑	DERM7pt ↑
FedAvg [1]	0.821	0.856	0.861
FedProx [20]	0.829	0.864	0.869
Ditto [21]	0.836	0.871	0.876
FedRep [4]	0.842	0.876	0.884
FedAMP [5]	0.846	0.879	0.891
FedSDR [7]	0.865	0.889	0.895
FedCSL-inspired [8]	0.857	0.883	0.888
CarPe-FL	0.884	0.905	0.902

Appendix A.3. Additional Graph Analysis

To complement the graph statistics shown in the main text, Table A4 reports the pairwise structural Hamming distance between the learned cluster-wise backbones. The relatively high SHD values suggest that different clusters indeed correspond to distinct latent causal mechanisms rather than merely small perturbations of the same graph.

Table A4. Pairwise SHD between the learned cluster-wise global backbones on the diabetic retinopathy benchmark.

	Cluster 1	Cluster 2	Cluster 3
Cluster 1	0	12	15
Cluster 2	12	0	11
Cluster 3	15	11	0

We also measure the overlap ratio between cluster-specific backbones, which are defined as the fraction of shared edges relative to the union of their edge sets. Table A5 shows that the overlap remains limited, which further supports the multi-mechanism interpretation of the data.

Table A5. Pairwise edge overlap ratio between cluster-wise global backbones. Lower overlap indicates stronger structural diversity across clusters.

	Cluster 1	Cluster 2	Cluster 3
Cluster 1	1.00	0.32	0.27
Cluster 2	0.32	1.00	0.35
Cluster 3	0.27	0.35	1.00

Appendix A.4. Sensitivity Analysis

To analyze the robustness of the model with respect to the edge-wise personalization mechanism, Table A6 reports performance under different values of the gate-balance parameter

β

. The best performance is achieved around

β = 0.5

, suggesting that the most effective strategy is to balance global structural confidence and local edge confidence rather than relying exclusively on either one.

Table A6. Sensitivity analysis with respect to the gate-balance parameter

β

on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.

Table A6. Sensitivity analysis with respect to the gate-balance parameter

β

on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.

$β$	Macro-F1 ↑	Worst-Site ↑
0.00	0.759	0.719
0.25	0.773	0.736
0.50	0.781	0.744
0.75	0.777	0.741
1.00	0.768	0.728

Table A7 reports the effect of the cluster refresh interval. A fully static clustering strategy underperforms the periodically updated one, while excessively frequent re-clustering provides little additional gain. This suggests that moderate structural refresh is sufficient to capture meaningful drift without destabilizing training.

Table A7. Sensitivity analysis with respect to the cluster refresh interval on the diabetic retinopathy benchmark. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.

Refresh Interval	Macro-F1 ↑	Worst-Site ↑
No refresh	0.769	0.731
Every 20 rounds	0.774	0.738
Every 10 rounds	0.781	0.744
Every 5 rounds	0.779	0.742

Appendix A.5. Computational Overhead

Finally, Table A8 reports the computational overhead per communication round. As expected, CarPe-FL requires more server-side computation than conventional federated baselines due to latent graph estimation and cluster-wise consensus. Nevertheless, the client-side cost remains close to standard federated training, since the structure discovery step is centralized at the server. This makes CarPe-FL practical for cross-silo medical federations where servers are typically better provisioned than client institutions.

Table A8. Approximate computational overhead per communication round.

Method	Server Time/Round (min)	Client Time/Round (min)	Extra Params	Peak VRAM/Client
FedAvg [1]	0.3	0.8	0	9.6 GB
FedRep [4]	0.4	0.8	0.11 M	9.8 GB
FedAMP [5]	0.6	0.9	0.12 M	9.9 GB
FedSDR [7]	0.7	1.0	0.38 M	10.1 GB
FedCSL-inspired [8]	2.7	0.8	0.12 M	9.8 GB
CarPe-FL	3.2	0.9	1.12 M	10.4 GB

Appendix B. Statistical Robustness and Seed-Wise Results

To complement the mean±standard-deviation summaries, confidence intervals, and paired significance tests reported in the main text, this appendix provides the seed-wise raw scores used for statistical robustness analysis. Because the principal pairwise comparisons in the main text evaluate CarPe-FL against the strongest competing baseline, we report seed-wise results for FedSDR and CarPe-FL on the two main benchmarks together with the seed-wise values used in the ablation study and the representation analysis.

Appendix B.1. Seed-Wise Raw Scores on the Diabetic Retinopathy Benchmark

Table A9 reports the seed-wise raw scores on the diabetic retinopathy benchmark for FedSDR and CarPe-FL. These values are the basis for the paired tests reported in the main text for macro-F1, balanced accuracy, AUROC, mean-site performance, and worst-site performance.

Table A9. Seed-wise raw scores on the diabetic retinopathy benchmark for FedSDR and CarPe-FL.

Seed	Macro-F1		Balanced Acc.		AUROC		Mean-Site		Worst-Site
	FedSDR	CarPe-FL	FedSDR	CarPe-FL	FedSDR	CarPe-FL	FedSDR	CarPe-FL	FedSDR	CarPe-FL
1	0.748	0.776	0.771	0.798	0.896	0.917	0.751	0.779	0.702	0.736
2	0.751	0.779	0.774	0.801	0.899	0.919	0.754	0.782	0.708	0.741
3	0.754	0.781	0.778	0.804	0.901	0.921	0.758	0.786	0.712	0.744
4	0.758	0.784	0.782	0.807	0.903	0.923	0.762	0.789	0.717	0.749
5	0.759	0.785	0.785	0.810	0.906	0.925	0.765	0.794	0.721	0.750

Across all five seeds, CarPe-FL consistently outperforms FedSDR on every reported metric. The improvement is especially stable on the worst-site metric, which supports the claim that the proposed structure-aware personalization is particularly beneficial for the most difficult clients.

Appendix B.2. Seed-Wise Raw Scores on the Skin-Lesion Benchmark

Table A10 reports the seed-wise raw scores on the supplementary skin-lesion benchmark for FedSDR and CarPe-FL. These values are used for the paired statistical comparisons reported in the main text.

Table A10. Seed-wise raw scores on the skin-lesion benchmark for FedSDR and CarPe-FL.

Seed	AUROC		Accuracy		Sensitivity		Specificity		Worst-Site
	FedSDR	CarPe-FL	FedSDR	CarPe-FL	FedSDR	CarPe-FL	FedSDR	CarPe-FL	FedSDR	CarPe-FL
1	0.877	0.892	0.811	0.826	0.803	0.818	0.822	0.834	0.775	0.792
2	0.880	0.895	0.815	0.829	0.806	0.822	0.826	0.837	0.781	0.797
3	0.883	0.897	0.819	0.833	0.810	0.826	0.829	0.841	0.786	0.801
4	0.886	0.900	0.823	0.837	0.814	0.830	0.833	0.845	0.791	0.805
5	0.889	0.901	0.827	0.840	0.817	0.834	0.835	0.848	0.797	0.810

The seed-wise pattern on the skin-lesion benchmark is again consistent: CarPe-FL remains stronger than FedSDR across all five runs and on every reported metric, which supports the stability of the gains beyond a single task or modality.

Appendix B.3. Seed-Wise Raw Scores for the Ablation Study

Table A11 reports the seed-wise raw scores for the ablation study on the diabetic retinopathy benchmark. These values complement the mean±standard-deviation results in the main text and show that the relative ordering of the ablated variants is stable across seeds.

Table A11. Seed-wise raw scores for the ablation study on the diabetic retinopathy benchmark.

Seed	Macro-F1				AUROC				Worst-Site
	w/o Clust.	w/o Mask.	w/o Gate	Full	w/o Clust.	w/o Mask.	w/o Gate	Full	w/o Clust.	w/o Mask.	w/o Gate	Full
1	0.756	0.751	0.764	0.776	0.902	0.898	0.908	0.917	0.715	0.709	0.724	0.736
2	0.760	0.755	0.767	0.779	0.905	0.901	0.911	0.919	0.720	0.714	0.728	0.741
3	0.762	0.758	0.769	0.781	0.907	0.904	0.913	0.921	0.724	0.719	0.731	0.744
4	0.765	0.761	0.771	0.784	0.909	0.907	0.915	0.923	0.729	0.724	0.734	0.749
5	0.767	0.765	0.774	0.785	0.912	0.910	0.918	0.925	0.732	0.729	0.738	0.750

All three reduced variants consistently underperform the full model across the five seeds. The largest and most stable degradation appears when structure-aware clustering is removed, especially on the worst-site metric.

Appendix B.4. Seed-Wise Raw Scores for the Representation Analysis

Table A12 reports the seed-wise raw scores used for the representation analysis on the diabetic retinopathy benchmark. These results support the claim that CarPe-FL yields a latent space that is less organized by site identity and more aligned with disease classes.

Table A12. Seed-wise raw scores for the representation analysis on the diabetic retinopathy benchmark.

Seed	FedSDR $R_{site}$	CarPe-FL $R_{site}$	FedSDR $R_{class}$	CarPe-FL $R_{class}$
1	1.06	0.90	1.10	1.16
2	1.09	0.92	1.12	1.18
3	1.11	0.94	1.14	1.19
4	1.14	0.96	1.16	1.20
5	1.15	0.98	1.18	1.22

Across all five seeds, CarPe-FL produces lower

R_{site}

values and higher

R_{class}

values than FedSDR. This consistent pattern strengthens the interpretation that the gain of CarPe-FL is associated not only with optimization but also with a measurable change in latent-space organization.

References

McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.y. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Kwak, L.; Bai, H. The role of federated learning models in medical imaging. Radiol. Artif. Intell. 2023, 5, e230136. [Google Scholar] [CrossRef] [PubMed]
Linardos, A.; Kushibar, K.; Walsh, S.; Gkontra, P.; Lekadir, K. Federated learning for multi-center imaging diagnostics: A simulation study in cardiovascular disease. Sci. Rep. 2022, 12, 3551. [Google Scholar] [CrossRef] [PubMed]
Collins, L.; Hassani, H.; Mokhtari, A.; Shakkottai, S. Exploiting shared representations for personalized federated learning. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 2089–2099. [Google Scholar]
Huang, Y.; Chu, L.; Zhou, Z.; Wang, L.; Liu, J.; Pei, J.; Zhang, Y. Personalized cross-silo federated learning on non-IID data. Proc. AAAI Conf. Artif. Intell. 2021, 35, 7865–7873. [Google Scholar] [CrossRef]
Shin, W.; Shin, J. FedGCD: Federated learning algorithm with GNN based community detection for heterogeneous data. J. Internet Comput. Serv. 2023, 24, 1–11. [Google Scholar]
Tang, X.; Guo, S.; Zhang, J.; Guo, J. Learning personalized causally invariant representations for heterogeneous federated clients. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Guo, X.; Yu, K.; Liu, L.; Li, J. FedCSL: A scalable and accurate approach to federated causal structure learning. Proc. AAAI Conf. Artif. Intell. 2024, 38, 12235–12243. [Google Scholar] [CrossRef]
Rame, A.; Dancette, C.; Cord, M. Fishr: Invariant gradient variances for out-of-distribution generalization. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 18347–18377. [Google Scholar]
Khemakhem, I.; Kingma, D.P.; Monti, R.P.; Hyvärinen, A. Variational autoencoders and nonlinear ICA: A unifying framework. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, Palermo, Italy, 26–2 August 2020; pp. 2207–2217. [Google Scholar]
Zheng, X.; Aragam, B.; Ravikumar, P.; Xing, E.P. DAGs with NO TEARS: Continuous optimization for structure learning. Adv. Neural Inf. Process. Syst. 2018, 31, 9472–9483. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph attention networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Karthik, M.; Maggie, D.; Dane, S. APTOS 2019 Blindness Detection. 2019. Available online: https://kaggle.com/competitions/aptos2019-blindness-detection (accessed on 31 March 2026).
Li, T.; Gao, Y.; Wang, K.; Guo, S.; Liu, H.; Kang, H. Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening. Inf. Sci. 2019, 501, 511–522. [Google Scholar] [CrossRef]
Dugas, E.; Jared, J.; Cukierski, W. Diabetic Retinopathy Detection. 2015. Available online: https://kaggle.com/competitions/diabetic-retinopathy-detection (accessed on 31 March 2026).
Cleland, C. Comparing the International Clinical Diabetic Retinopathy (ICDR) severity scale. Community Eye Health 2023, 36, 10. [Google Scholar] [PubMed]
Codella, N.; Rotemberg, V.; Tschandl, P.; Celebi, M.E.; Dusza, S.; Gutman, D.; Helba, B.; Kalloo, A.; Liopyris, K.; Marchetti, M.; et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the International Skin Imaging Collaboration (ISIC). arXiv 2019, arXiv:1902.03368. [Google Scholar] [CrossRef]
Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 2018, 5, 180161. [Google Scholar] [CrossRef] [PubMed]
Kawahara, J.; Daneshvar, S.; Argenziano, G.; Hamarneh, G. Seven-point checklist and skin lesion classification using multitask multimodal neural networks. IEEE J. Biomed. Health Inform. 2019, 23, 538–546. [Google Scholar] [CrossRef] [PubMed]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Li, T.; Hu, S.; Beirami, A.; Smith, V. Ditto: Fair and robust federated learning through personalization. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 6357–6368. [Google Scholar]

Figure 1. Schematic diagram of the proposed framework.

Figure 2. Sample images of the diabetic retinopathy datasets.

Figure 3. Sample images of the skin lesion datasets.

Table 1. Class distribution of the diabetic retinopathy datasets used in the main federated experiments.

Dataset	Normal (0)	NPDR (1)	PDR (2)
APTOS [13]	1805	1562	295
DDR [14]	6266	5343	913
DRD [15]	5000	8608	708

Table 2. Class distribution of the skin lesion datasets used in the supplementary external experiments.

Dataset	Benign	Malignant
ISIC [17]	584	584
HAM10000 [18]	1113	1113
DERM7pt [19]	252	252

Table 3. Main results on the diabetic retinopathy benchmark. Results are reported as mean ± standard deviation over five seeds. The best mean is shown in bold. The upward arrow indicates that higher values are better.

Method	Macro-F1 ↑	Balanced Acc. ↑	AUROC ↑	Mean-Site ↑	Worst-Site ↑
FedAvg [1]	0.702 ± 0.008	0.731 ± 0.009	0.862 ± 0.007	0.706 ± 0.009	0.654 ± 0.012
FedProx [20]	0.713 ± 0.007	0.742 ± 0.008	0.871 ± 0.006	0.718 ± 0.008	0.669 ± 0.011
Ditto [21]	0.726 ± 0.006	0.751 ± 0.007	0.879 ± 0.006	0.730 ± 0.008	0.681 ± 0.010
FedRep [4]	0.735 ± 0.006	0.759 ± 0.007	0.885 ± 0.005	0.739 ± 0.007	0.691 ± 0.009
FedAMP [5]	0.741 ± 0.005	0.766 ± 0.006	0.889 ± 0.005	0.744 ± 0.007	0.698 ± 0.009
FedSDR [7]	0.754 ± 0.005	0.778 ± 0.006	0.901 ± 0.004	0.758 ± 0.006	0.712 ± 0.008
FedCSL-inspired [8]	0.747 ± 0.006	0.771 ± 0.006	0.894 ± 0.005	0.749 ± 0.007	0.705 ± 0.008
CarPe-FL	0.781 ± 0.004	0.804 ± 0.005	0.921 ± 0.003	0.786 ± 0.005	0.744 ± 0.006

Table 4. Results on the supplementary skin lesion benchmark. Results are reported as mean ± standard deviation over five seeds. The best mean is shown in bold. The upward arrow indicates that higher values are better.

Method	AUROC ↑	Accuracy ↑	Sensitivity ↑	Specificity ↑	Worst-Site ↑
FedAvg [1]	0.846 ± 0.009	0.781 ± 0.010	0.769 ± 0.011	0.794 ± 0.010	0.742 ± 0.013
FedProx [20]	0.854 ± 0.008	0.790 ± 0.009	0.776 ± 0.010	0.803 ± 0.009	0.751 ± 0.012
Ditto [21]	0.861 ± 0.007	0.798 ± 0.008	0.785 ± 0.009	0.810 ± 0.008	0.760 ± 0.011
FedRep [4]	0.868 ± 0.007	0.804 ± 0.008	0.793 ± 0.009	0.816 ± 0.008	0.767 ± 0.010
FedAMP [5]	0.872 ± 0.006	0.808 ± 0.008	0.798 ± 0.008	0.820 ± 0.007	0.771 ± 0.010
FedSDR [7]	0.883 ± 0.005	0.819 ± 0.007	0.810 ± 0.008	0.829 ± 0.007	0.786 ± 0.009
FedCSL-inspired [8]	0.876 ± 0.006	0.812 ± 0.007	0.804 ± 0.008	0.821 ± 0.007	0.778 ± 0.010
CarPe-FL	0.897 ± 0.004	0.833 ± 0.006	0.826 ± 0.007	0.841 ± 0.006	0.801 ± 0.008

Table 5. Structural properties of the learned cluster-wise global backbones on the diabetic retinopathy benchmark. Results are reported as mean ± standard deviation over five seeds. The upward arrow indicates that higher values are better and downward arrow indicates that lower values are better.

Statistic	Cluster 1	Cluster 2	Cluster 3
Average edge confidence ↑	0.81 ± 0.02	0.78 ± 0.03	0.84 ± 0.02
Bootstrap edge stability ↑	0.76 ± 0.03	0.72 ± 0.04	0.79 ± 0.03
Backbone sparsity (edge ratio) ↓	0.18 ± 0.01	0.21 ± 0.02	0.17 ± 0.01

Table 6. Quantitative analysis of the learned representations on the diabetic retinopathy benchmark. Results are reported as mean ± standard deviation over five seeds. A lower site-separation ratio is better, while a higher class-separation ratio is better. The best result in each column is shown in bold. The upward arrow indicates that higher values are better and downward arrow indicates that lower values are better.

Method	Site-Separation Ratio ↓	Class-Separation Ratio ↑
FedAvg [1]	1.42 ± 0.06	1.00 ± 0.04
FedRep [4]	1.28 ± 0.05	1.06 ± 0.03
FedSDR [7]	1.11 ± 0.04	1.14 ± 0.03
CarPe-FL	0.94 ± 0.03	1.19 ± 0.02

Table 7. Ablation study of CarPe-FL on the diabetic retinopathy benchmark. Results are reported as mean ± standard deviation over five seeds. The best result in each column is shown in bold. The upward arrow indicates that higher values are better.

Variant	Macro-F1 ↑	AUROC ↑	Worst-Site ↑
Without clustering	0.762 ± 0.005	0.907 ± 0.004	0.724 ± 0.007
Without structure masking	0.758 ± 0.006	0.904 ± 0.005	0.719 ± 0.008
Without edge-wise personalization	0.769 ± 0.004	0.913 ± 0.004	0.731 ± 0.006
Full CarPe-FL	0.781 ± 0.004	0.921 ± 0.003	0.744 ± 0.006

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shin, W.; Shen, Z.; Oh, G.; Shin, J. Causal Representation-Based Personalized Federated Learning with Causal Graph Consensus for Medical Imaging. Electronics 2026, 15, 1983. https://doi.org/10.3390/electronics15101983

AMA Style

Shin W, Shen Z, Oh G, Shin J. Causal Representation-Based Personalized Federated Learning with Causal Graph Consensus for Medical Imaging. Electronics. 2026; 15(10):1983. https://doi.org/10.3390/electronics15101983

Chicago/Turabian Style

Shin, Wooseok, Zhiqiang Shen, Gyutae Oh, and Jitae Shin. 2026. "Causal Representation-Based Personalized Federated Learning with Causal Graph Consensus for Medical Imaging" Electronics 15, no. 10: 1983. https://doi.org/10.3390/electronics15101983

APA Style

Shin, W., Shen, Z., Oh, G., & Shin, J. (2026). Causal Representation-Based Personalized Federated Learning with Causal Graph Consensus for Medical Imaging. Electronics, 15(10), 1983. https://doi.org/10.3390/electronics15101983

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Causal Representation-Based Personalized Federated Learning with Causal Graph Consensus for Medical Imaging

Abstract

1. Introduction

2. Related Work

2.1. Personalized and Clustered Federated Learning

2.2. Causal and Invariant Representation Learning

2.3. Federated Causal Structure Learning

2.4. Federated Learning in Medical Imaging

3. Problem Definition

3.1. Multi-Center Federated Medical Imaging Setting

3.2. Learning Objective

4. Proposed Method

4.1. Overview

4.2. Server-Side Latent Causal Structure Estimation

4.3. Structure-Aware Client Clustering and Global Backbone Consensus

4.4. Structure-Aligned Personalized Representation Learning

4.5. Training Procedure

5. Simulation Settings

5.1. Datasets and Federated Protocol

5.2. Training Configuration

5.3. Evaluation Metrics

5.4. Baselines

5.5. Implementation Details

6. Experiment Results

6.1. Main Results on Diabetic Retinopathy

6.2. External Validation on Skin Lesion Classification

6.3. Do the Learned Causal Graphs Matter?

6.4. Do the Learned Representations Become More Structure-Aligned?

6.5. Ablation Study

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Additional Results

Appendix A.1. Site-Wise Performance on Diabetic Retinopathy

Appendix A.2. Site-Wise Performance on Skin Lesion Classification

Appendix A.3. Additional Graph Analysis

Appendix A.4. Sensitivity Analysis

Appendix A.5. Computational Overhead

Appendix B. Statistical Robustness and Seed-Wise Results

Appendix B.1. Seed-Wise Raw Scores on the Diabetic Retinopathy Benchmark

Appendix B.2. Seed-Wise Raw Scores on the Skin-Lesion Benchmark

Appendix B.3. Seed-Wise Raw Scores for the Ablation Study

Appendix B.4. Seed-Wise Raw Scores for the Representation Analysis

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI