Causal Inference-Based Self-Supervised Cross-Domain Fundus Image Segmentation

Li, Qiang; Zhang, Qiyi; Zhang, Zheqi; Liu, Hengxin; Nie, Weizhi

doi:10.3390/app15095074

Open AccessArticle

Causal Inference-Based Self-Supervised Cross-Domain Fundus Image Segmentation

by

Qiang Li

¹,

Qiyi Zhang

¹,

Zheqi Zhang

²,

Hengxin Liu

^1,* and

Weizhi Nie

²

¹

School of Microelectronics, Tianjin University, Tianjin 300072, China

²

School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5074; https://doi.org/10.3390/app15095074

Submission received: 27 March 2025 / Revised: 27 April 2025 / Accepted: 30 April 2025 / Published: 2 May 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Accurate glaucoma diagnosis relies on precise segmentation of the optic disc (OD) and optic cup (OC) in retinal images. However, despite the development of numerous automatic segmentation models, the lack of annotations in the target domain and domain shift among datasets continue to limit their segmentation performance. To address these issues, we propose a Causal Self-Supervised Network (CSSN) that leverages self-supervised learning to enhance model performance. First, we construct a Structural Causal Model (SCM) and employ backdoor adjustment to convert the conventional conditional distribution into an interventional distribution, effectively severing the influence of style information on feature extraction and pseudo-label generation. Subsequently, the low-frequency components of source and target domain images are exchanged via Fourier transform to simulate cross-domain style transfer. The original target images and their style-transferred counterparts are then processed by a dual-path segmentation network to extract their respective features, and a confidence-based pseudo-label fusion strategy is employed to generate more reliable pseudo-labels for self-supervised learning. In addition, we employ adversarial training and cross-domain contrastive learning to further reduce style discrepancies between domains. The former aligns feature distributions across domains using a feature discriminator, effectively mitigating the adverse effects of style inconsistency, while the latter minimizes the feature distance between original and style-transferred images, thereby ensuring structural consistency. Experimental results demonstrate that our method achieves more accurate OD and OC segmentation in the target domain during testing, thereby confirming its efficacy in cross-domain adaptation tasks.

Keywords:

causal inference; cross-domain adaptation; self-supervised learning; fundus image segmentation; glaucoma

1. Introduction

Glaucoma is the world’s second leading cause of blindness [1], with its primary pathologic mechanism attributed to the degeneration of retinal ganglion cells. Without timely intervention, the progressive damage to the optic nerve can result in irreversible vision loss. However, early screening and prompt treatment can significantly reduce these risks. The cup-to-disc ratio (CDR)—defined as the ratio of the vertical cup diameter (VCD) to the vertical disc diameter (VDD)—is a widely used diagnostic indicator for glaucoma, as a higher CDR often signals an increased risk of the disease. Given that manually delineating the optic disc (OD) and optic cup (OC) and computing the CDR is both time-consuming and labor-intensive, there is an urgent need for research into automated segmentation methods to improve the efficiency and accuracy of glaucoma diagnosis.

In traditional methods for automatic segmentation of the OD and OC, a variety of approaches have been proposed, including active contour models [2,3], Hough-transform-based detection [2,4,5], morphological operations [6,7], weakly supervised techniques [8], among others. For example, Mary et al. [2] proposed a cascaded pipeline in which the red channel is first enhanced by adaptive histogram equalization and subjected to morphological vessel removal and binarization; a circular Hough transform then generates initial disc contours, which are iteratively evolved under the gradient vector flow (GVF) energy functional to accurately delineate the disc boundary. Zhu et al. [5] described a similar pipeline that converts fundus images to a luminance representation for artifact removal, extracts edges via Sobel or Canny operators, detects circular candidates via the Hough transform, and refines them by intensity thresholding to precisely locate the disc. Welfer et al. [6] designed a two-stage adaptive morphological framework: in Stage 1, vessel skeletonization—using reconstruction, top-hat filtering and skeletonization—localizes the disc region; in Stage 2, a marker-controlled watershed transform adaptively segments the exact boundary. Choukikar et al. [9] converted color retina images to grayscale with histogram equalization, performed multilevel thresholding followed by morphological erosion and dilation to extract boundary candidates, and then fitted a circle to the resulting points to determine the disc center and radius.

However, these traditional approaches typically depend on handcrafted feature extraction, heuristic processing pipelines, and domain-specific parameter tuning, which render them sensitive to image artifacts, inter-patient variability, and inconsistent acquisition conditions. With the advent of deep learning, convolutional neural networks and related architectures have been widely adopted for OD and OC segmentation, learning hierarchical feature representations directly from data and demonstrating superior robustness and accuracy.

Although numerous deep learning–based automated segmentation models have achieved excellent performance in segmenting OD and OC, most rely on training with labeled source-domain images. In practical applications, target-domain images are typically unlabeled, and variations in imaging equipment and acquisition conditions introduce a significant distribution discrepancy between source and target domains, resulting in a marked decline in conventional models’ target-domain performance. While assembling a fundus-image dataset that encompasses all domain styles could enable the model to adapt to diverse styles and thereby mitigate the effects of domain shift, such an undertaking is extremely time-consuming, labor-intensive, and demands substantial additional computational resources. Consequently, recent studies have explored domain adaptation techniques to enhance the generalization ability of cross-domain segmentation. However, existing methods [10,11,12,13,14] suffer from two major limitations: first, they predominantly utilize labeled source-domain data for supervision, lacking effective constraints on unlabeled images; second, the use of a shared network for processing images from different domains makes it challenging to completely eliminate interference caused by domain style differences.

To tackle the issue of domain style interference in cross-domain segmentation, we propose a self-supervised cross-domain segmentation method based on causal inference. First, to generate more reliable pseudo-labels, we introduce a causal inference strategy. By constructing a structural causal model (SCM), we explicitly delineate the causal relationships among target images, feature maps, domain style interference, and prediction outcomes, and employ a back-door adjustment strategy to sever the confounding pathways introduced by domain style. This converts the conditional distribution of pseudo-label generation into an interventional distribution, thereby enhancing the reliability of the generated pseudo-labels and providing robust support for subsequent model training. In this process, we utilize the Fourier transform to achieve style transformation between source and target domain images, simulating the stratification operation in back-door adjustment so that the model can fully account for interference under different domain styles. Moreover, we design a dual-path segmentation network that processes source and target domain style images separately, ensuring that each network exclusively extracts features corresponding to a single domain style, thus further mitigating the adverse effects of domain style interference. In addition, we incorporate adversarial learning and cross-domain contrastive learning strategies. Through adversarial training, the model can align the feature representations of original and style-transferred images, thereby effectively narrowing the visual gap between domains and mitigating the adverse effects of style discrepancies on segmentation performance. Meanwhile, cross-domain contrastive learning minimizes the distance between positive sample pairs, thereby reinforcing the structural consistency between images of different styles. These strategies not only help to further reduce domain style interference but also enhance the model’s generalization capability on cross-domain data.

Our main contributions can be summarized as follows:

We propose a causal inference-based pseudo-label fusion module for self-supervised learning that effectively reduces domain style bias and imposes constraints on target domain images.
We introduce adversarial learning [15,16] and cross-domain contrastive learning mechanisms, which reduce the distribution discrepancy between source and target domains.
We conduct extensive experiments on three publicly available datasets, and the results fully demonstrate the effectiveness of the proposed modules as well as a significant improvement in overall performance.

2. Related Works

Currently, deep learning models have been widely applied in the medical field [17,18,19,20], which has led to the development of numerous automatic segmentation networks for the OD and OC. Fu et al. [21] designed an end-to-end multi-label deep network (M-Net) that jointly segments the OD and OC through a multi-scale U-Net architecture with side-output layers, while incorporating a polar transformation to enhance spatial constraints and balance the data, thereby improving the accuracy of the cup-to-disc ratio for glaucoma screening.

However, due to domain shift phenomena that may degrade model performance, researchers have begun to incorporate domain adaptation methods to alleviate the interference caused by distribution discrepancies across different domains. For instance, Wang et al. [10] designed a patch-based output space adversarial learning (pOSAL) framework, which addresses the domain adaptation problem by combining morphology-aware segmentation loss with patch-level adversarial learning. Liu et al. [22] designed the CFEA network that simultaneously applies adversarial losses at multiple levels in both the encoder and decoder to extract domain-invariant features; it further incorporates a self-ensembling mechanism by building a teacher network based on historical predictions to guide the training of the student network, thereby achieving temporal smoothing. Chen et al. [12] designed an IOSUDA framework that, in the input space, employs image translation to decompose images into shared content features and domain-specific style features, and in the output space, utilizes adversarial learning to enforce consistency in segmentation results. Xu et al. [14] proposed a domain adaptation network named MeFDA, which enhances prediction confidence in the target domain via both direct and adversarial entropy minimization and improves semantic consistency between the source and target domains by exchanging low-frequency information through Fourier transform. He et al. [23] designed a self-ensembling-based model that jointly extracts mask and boundary information, enforcing their consistency with a mask-boundary segmentation loss; simultaneously, they adopted an output-level adversarial domain adaptation technique to align the prediction results of the source and target domains.

Recently, some studies have begun to explore the use of causal inference methods to address the challenges posed by domain shift. Chen et al. [24] designed the CIADA method, which constructs a causal graph for the source and target domain data using a constraint-based PC algorithm and performs feature classification and mapping based on causal influences; subsequently, graph structure and temporal sequence features are extracted via a two-dimensional processing approach, and finally, an adversarial domain adaptation and fine-tuning strategy is employed to build the detection model.

3. Methods

3.1. Causal Inference

In the self-supervised cross-domain fundus image segmentation task, the model is primarily supervised by source-domain images and their corresponding labels. However, as illustrated in Figure 1, the same sample can exhibit substantial visual differences under different domain styles, and this style-induced variation inevitably perturbs the training process, introducing systematic biases into pseudo-label generation. Specifically, during feature extraction, the model may capture intrinsic style noise from images in a particular domain, leading to biased pseudo-labels and ultimately degrading segmentation performance.

To mitigate this confounding effect, we conducted a causal analysis of the task and constructed a corresponding SCM, as illustrated in Figure 2. This model comprises four variables: the target domain image X, the feature map F extracted from the target domain image, the interference information C induced by domain style, and the segmentation result Y. In this causal graph, the arrow

X \to F

indicates that the feature map F is extracted from the target domain image X, while the arrow

F \to Y

signifies that the pseudo-label Y is generated based on the feature F. Moreover,

C \to {F, Y}

shows that the domain style shift interferes with both the feature extraction process and the pseudo-label generation, thereby acting as a confounding factor.

Due to the existence of a backdoor path, directly generating pseudo-labels using the conditional probability

P (Y ∣ F)

leads to bias. Therefore, we adopt a causal intervention approach to cut off the backdoor path and eliminate the confounding effect, replacing the conditional distribution with the interventional distribution

P (Y ∣ d o (F))

for pseudo-label generation. To accurately compute

P (Y ∣ d o (F))

, we intervene on F so as to sever its causal connections with all non-descendant nodes, thereby ensuring that the remaining causal pathways faithfully reflect the relationship

P (Y ∣ d o (F))

.

In this process, we assume that the confounder C can be observed and stratified; in our context, the set C represents the domain style information from both the source and target domains. To this end, we introduce the Fourier transform to simulate such stratification—by exchanging the low-frequency components between the source and target domain images, we effectively strip away the confounding domain style components. Subsequently, both the intrinsic feature map extracted from the target domain image and the feature map obtained after applying a source-domain style transformation are concurrently fed into the pseudo-label generation module to produce pseudo-labels based on causal inference.

Based on the above analysis, the backdoor adjustment can be expressed as:

\begin{matrix} \begin{matrix} P (Y ∣ d o (F)) & = \sum_{c \in C} P (Y ∣ F, c) P (c ∣ F) \\ = \sum_{c \in C} P (Y ∣ F, c) P (c) \end{matrix} \end{matrix}

(1)

where

P (Y ∣ F, c)

denotes the conditional probability of generating the pseudo-label Y given the feature F and the confounding factor c, and

P (c)

represents the marginal distribution of the confounder.

3.2. Causal Inference-Based Pseudo-Label Fusion Module

Based on the above causal inference strategy, we apply Fourier transform to the target-domain images for style transfer. As noted in [14], the low-frequency components of an image generally capture properties such as background, illumination, and other style-related attributes. Therefore, by replacing the low-frequency components of the original image with those obtained via Fourier transform, we can achieve an approximate image-domain transformation.

First, we denote the Fast Fourier Transform (FFT) [25] by

F_{ou} (\cdot)

, and its inverse by

F_{ou}^{- 1} (\cdot)

. In our algorithm, we apply FFT to each image to obtain its spectra (amplitude spectrum

f_{am} (x)

and phase spectrum

f_{ph} (x)

):

\begin{matrix} f_{am} (x_{s}), f_{ph} (x_{s}) & = F_{ou} (x_{s}), \\ f_{am} (x_{t}), f_{ph} (x_{t}) & = F_{ou} (x_{t}) \end{matrix}

(2)

During training, we randomly select a source-domain image

x_{s}

and a target-domain image

x_{t}

, and replace the low-frequency part of

f_{am} (x_{t})

with that of

f_{am} (x_{s})

to obtain the style-transferred image

x_{t}^{s}

. Specifically, we treat the center of the amplitude spectrum as the zero-frequency point, and construct a square window

α

of side length a centered at this point. We then remove the contents of

f_{am} (x_{t})

within

α

and fill them with the corresponding contents of

f_{am} (x_{s})

, resulting in the modified amplitude spectrum

f_{am} (x_{t}^{s})

. Finally, by combining

f_{am} (x_{t}^{s})

with the original phase spectrum

f_{ph} (x_{t})

and applying the inverse FFT, we obtain the style-transferred image:

\begin{matrix} x_{t}^{s} & = {F_{ou}}^{- 1} (f_{am} (x_{t}^{s}), f_{ph} (x_{t})) \end{matrix}

(3)

The resulting image

x_{t}^{s}

retains the ground-truth segmentation mask of the original target-domain image

x_{t}

, while its domain gap relative to the source-domain image

x_{s}

is significantly reduced, making it effectively aligned with the source domain.

This operation is analogous to the stratification process in backdoor adjustment, where different domain styles are considered separately. Next, we feed the source-style transformed image

x_{t}^{s}

and the original target domain image

x_{t}

into two segmentation networks that share the same backbone architecture (DeepLabV3+ [26]) but do not share weights, thereby obtaining their corresponding feature maps. This design effectively decouples the feature extraction process under different domain styles, which better implements the stratification and more accurately reflects the relationship of

P (Y ∣ d o (F))

.

In terms of backbone architecture, we adopt MobileNetV2 [27]—pretrained on ImageNet—as the encoder within DeepLabV3+. MobileNetV2’s core building block is the inverted residual: a

1 \times 1

convolution first expands the low-dimensional input features to a higher dimension, followed by a

3 \times 3

depthwise separable convolution for spatial filtering, and finally a

1 \times 1

projection back to the original channel dimensionality. When the input and output channels match, a residual connection is applied, which mitigates gradient vanishing and feature degradation while dramatically reducing both parameter count and computational cost. This lightweight feature extractor therefore strikes an effective balance between representational capacity and inference speed, particularly well suited to medical imaging tasks with limited data and constrained hardware resources.

Built upon this encoder, DeepLabV3+ incorporates an Atrous Spatial Pyramid Pooling (ASPP) module that applies parallel atrous convolutions at multiple dilation rates to capture contextual information from local to global scales without sacrificing feature-map resolution. Its lightweight decoder then fuses shallow, high-resolution feature maps with the ASPP outputs and employs depthwise separable transposed convolutions for upsampling, thereby refining boundary details. This backbone design achieves a harmonious trade-off between multi-scale adaptability, high-fidelity edge recovery, and computational efficiency, effectively addressing the large scale variations and fine boundary requirements inherent to OD/OC segmentation.

After obtaining the feature maps from both the target domain image and its source-style transformed counterpart, we proceed to generate pseudo-labels using these features. The structure of the pseudo-label generation module is shown in Figure 3. In this figure, the dashed arrow denotes that the maximum-square loss

L_{m}

is applied to the fused pseudo-label

{\hat{y}}_{t}^{f}

during training.

In this process, we also introduce a confidence-based mechanism to enhance the reliability of the pseudo-labels. First, for the prediction outputs of the segmentation network

M_{s}

(or

M_{t}

), we compute the confidence for each pixel; then, we fuse the resulting confidence maps obtained from the same image (e.g.,

x_{t}

and

x_{t}^{s}

) and perform pixel-level normalization.

It is important to note that

{\hat{y}}_{t}

comprises the OD prediction probability map

{\hat{y}}_{t (disc)}

and the OC prediction probability map

{\hat{y}}_{t (cup)}

, whereas

{\hat{y}}_{t}^{s}

consists of

{\hat{y}}_{t (disc)}^{s}

and

{\hat{y}}_{t (cup)}^{s}

. Accordingly, we compute the confidence maps for these four probability maps—denoted as

{\hat{O}}_{t (disc)}

,

{\hat{O}}_{t (disc)}^{s}

,

{\hat{O}}_{t (cup)}

, and

{\hat{O}}_{t (cup)}^{s}

—and then concatenate

{\hat{O}}_{t (disc)}

with

{\hat{O}}_{t (disc)}^{s}

and feed the result into a softmax layer for normalization. The computation process is illustrated in Figure 4. An analogous operation is performed for

{\hat{O}}_{t (cup)}

and

{\hat{O}}_{t (cup)}^{s}

. The resulting normalized confidence maps

{\hat{O}}_{t}

and

{\hat{O}}_{t}^{s}

are then used to weight the corresponding probability maps and produce the pixel-wise pseudo-label

{\hat{y}}_{t}^{f}

, which serves as the self-supervision target in the target-domain segmentation loss.

Moreover, because there is an imbalance in the prediction probabilities among different classes (in fundus images, the OD region typically surrounds the OC region and the boundary of the OD is more distinct), the OD class tends to exhibit higher prediction confidence, which may lead to bias during training. To mitigate this bias and further enhance the reliability of the fused pseudo-label

{\hat{y}}_{t}^{f}

, we incorporate the Max-square loss [28]. This loss function is capable of attenuating the dominant influence of high-confidence classes during training. The Max-square loss

L_{m}

is defined as follows:

\begin{matrix} L_{m} = \sum_{i = 1}^{H \times W} \sum_{c = 1}^{C} {({\hat{y}}_{t}^{f} (i, c))}^{2} + {(1 - {\hat{y}}_{t}^{f} (i, c))}^{2} \end{matrix}

(4)

Finally, we supervise the model training with the fused pseudo-labels. The pixel-level cross-entropy loss is adopted as the segmentation loss, defined as:

\begin{matrix} L_{s e g} (x, y) = - \sum_{i = 1}^{H \times W} \sum_{c = 1}^{C} y (i, c) \cdot log (σ (F (x) (i, c))) \end{matrix}

(5)

where H and W represent the height and width of the image, i represents the i-th pixel, c represents the category label (OD or OC),

σ (\cdot)

is the sigmoid function.

y (i, c)

represents whether the i-th pixel in the segmentation map belongs to the category c (0 or 1).

σ (F (x))

(i.e.,

\hat{y}

)denotes the predicted segmentation of

x

. The target segmentation loss corresponding to the pseudo-labels is defined as:

\begin{matrix} L_{s e g}^{t} = L_{s e g} (x_{t}, \hat{y} t^{f}) + L s e g (x_{t}^{s}, {\hat{y}}_{t}^{f}) \end{matrix}

(6)

3.3. Source Domain Image Style Transfer and Adversarial Training Mechanism

To fully leverage labeled source domain images for domain adaptation, we also apply a Fourier transform to the source domain images to generate a target-style counterpart, denoted as

x_{s}^{t}

. Subsequently, the original source domain images

x_{s}

and their style-transferred counterparts

x_{s}^{t}

are fed into the source segmentation network

M_{s}

and the target segmentation network

M_{t}

, respectively. This design ensures that each network processes images with a single style, thereby further reducing the interference introduced by mixed domain styles and yielding purer feature extraction. The overall model architecture is illustrated in Figure 5.

For the labeled source domain images, we supervise the segmentation results using a pixel-wise cross-entropy loss. Specifically, let

y_{s}

denote the ground-truth segmentation labels for

x_{s}

. Based on the same segmentation loss function as in Equation (5), the supervised segmentation loss for the source images is defined as:

\begin{matrix} L_{s e g}^{s} = L_{s e g} (x_{s}, y_{s}) + L_{s e g} (x_{s}^{t}, y_{s}) \end{matrix}

(7)

Furthermore, to further reduce the visual discrepancy between the original and the style-transferred images, we introduce an adversarial training mechanism. Specifically, we employ a feature discriminator to compare the features extracted from the original source domain images and those from the style-transferred images. Within our dual-path network architecture, the adversarial loss can be expressed as:

\begin{matrix} L_{a d v} = L_{a d v} (x_{s}, x_{t}^{s}) + L_{a d v} (x_{t}, x_{s}^{t}) \end{matrix}

(8)

Through this adversarial training, the model can further mitigate the adverse effects of inter-domain style discrepancies on segmentation performance.

3.4. Cross-Domain Contrastive Learning

The underlying idea is that features extracted from images before and after domain transformation should preserve structural consistency. To achieve this, we design a cross-domain contrastive loss [29].

Specifically, let

x_{p}

and

x_{q}

denote a pair of positive samples. The contrastive loss is computed as:

L_{c o n} (x_{p}, x_{q}) = - log \frac{exp (d (F (x_{p}), F (x_{q})) / 2 τ^{2})}{\sum_{j = 1}^{N} 1_{p \neq j} exp (d (F (x_{p}), F (x_{j})) / 2 τ^{2})},

(9)

where

e x p (d (\cdot)) / 2 τ^{2}

denotes the Gaussian kernel function used to measure similarity,

1_{p \neq j} \in {0, 1}

is an indicator function that equals 1 if

p \neq j

, d(·) represents the Euclidean distance between feature maps, and N is the total number of images in a batch [30].

In our framework, there are two types of positive pairs: one consisting of a target image

x_{t}

and its source-style transformed version

x_{t}^{s}

, and another comprising a source image

x_{s}

and its target-style transformed version

x_{s}^{t}

. The overall cross-domain contrastive loss is then defined as:

\begin{matrix} L_{c} = L_{c o n} (x_{s}, x_{s}^{t}) + L_{c o n} (x_{t}, x_{t}^{s}) \end{matrix}

(10)

By reducing the distance between the feature maps of these positive pairs—extracted by

M_{s}

and

M_{t}

—the network better preserves structural similarity and enhances prediction consistency. This improved consistency further bolsters the reliability of the fused pseudo-labels used in subsequent training stages.

3.5. Loss Function

The two separate paths

M_{s}

and

M_{t}

are trained simultaneously under a total loss:

\begin{matrix} L_{total} = L_{seg}^{s} + L_{a d v} + λ_{s e g}^{t} L_{seg}^{t} + λ_{m} L_{m} + λ_{c} L_{c} \end{matrix}

(11)

where

λ_{s e g}^{t}

,

λ_{m}

and

λ_{c}

are the weights for their respective loss functions.

During testing phase where only target data is given, the trained target path network

M_{t}

is straightforwardly used for testing the segmentation performance of the model.

4. Experiments

4.1. Datasets and Implementation Details

We validated the proposed method on three cross-domain datasets, with detailed information provided in Table 1. The REFUGE dataset [31] is divided into Train and Validation/Test subsets. The Train set, serving as the source domain, comprises 400 images with OD and OC segmentation annotations acquired using a Zeiss Bisucam 500. In contrast, the Validation/Test set, serving as the target domain, contains 400 training images and 400 testing images captured using a Canon CR-2. Additionally, the Drishti-GS dataset [32] from India includes 101 images with segmentation annotations provided by multiple ophthalmologists. Lastly, the RIM-ONE-r3 dataset [33] from Spain, collected with a Canon EOS 5D, comprises 99 training images and 60 testing images. Because our framework uses self-supervised learning, the target-domain training images are treated as unlabeled data solely for adaptation and pseudo-label generation; their ground-truth masks are not used during training, and only the target-domain test images are employed for final evaluation.

We construct a dual-path image segmentation network,

M_{s}

and

M_{t}

, using the DeepLabV3+ [26] framework. The network employs MobileNetV2 [27] as its feature extractor. The training is conducted on a server with an Nvidia 1080ti GPU, over 200 epochs with a batch size of 4. The Adam optimizer [34] is employed with a starting learning rate of

1 \times 10^{- 3}

, which is reduced by 0.2 every 100 epochs. To augment the training dataset, we apply a range of random transformations, including random cropping and scaling, random rotation and flipping, elastic deformation, salt–pepper noise, and random region erasing. All augmentation operations are performed online via internal random sampling mechanisms, so that each image receives different augmentation effects in different epochs or iterations.

We use Dice coefficients (

D I

) as a metric to evaluate the segmentation performance of the model. The metric is calculated as:

\begin{matrix} D I = \frac{2 N_{t p}}{2 N_{t p} + N_{f p} + N_{f n}}, \end{matrix}

(12)

where

N_{t p}

,

N_{f p}

, and

N_{f n}

correspond to the pixel counts for true positives, false positives, and false negatives, respectively.

In addition, we adopted the absolute value

δ

to evaluate the error between the predicted

C D R

(

C D R_{p}

) and the ground truth

C D R

(

C D R_{g}

). The calculation of

δ

is illustrated as follows:

\begin{matrix} δ = | C D R_{p} - C D R_{g} | \end{matrix}

(13)

where

C D R = V C D / C D D

.

V C D

and

V D D

are the vertical cup diameter and vertical disc diameter, respectively.

4.2. Performance Comparison with Prior Methods

We compare our model with other domain adaptation works on three datasets. Among them, refs. [35,36,37] were originally designed for other tasks, and refs. [10,11,12,13,14,23] were the existing OD and OC segmentation frameworks based on domain adaptation.

4.2.1. Quantitative Analysis

Table 2 shows that our method outperforms existing domain adaptation approaches in terms of

D I_{c u p}

,

D I_{d i s c}

, and

δ

. Specifically, our approach yields higher

D I_{d i s c}

and

D I_{c u p}

scores on the Drishti-GS dataset while achieving the lowest

δ

value among all compared methods. Although the RIM-ONE-r3 dataset exhibits a more pronounced domain shift relative to REFUGE—thus complicating the adaptation process—our method still attains

D I_{d i s c}

scores of 0.922 and 0.958 on RIM-ONE-r3 and REFUGE Validation/Test, respectively. This improvement is primarily attributable to the incorporation of causal inference-based pseudo-label supervision, which effectively reduces the adverse impact of mixed domain styles on model predictions.

4.2.2. Qualitative Analysis

Figure 6, Figure 7 and Figure 8 show the visual segmentation results of the proposed method and several open-source methods, i.e., BEAL [11], IOSUDA [12] and MeFDA [14]. As can be seen, while all methods achieve satisfactory OD segmentation due to its clear boundary, our method produces smoother contours. For OC segmentation, the other methods often fail to generate accurate boundaries. For instance, in the second row of the RIM-ONE-r3 results, the predicted segmentation is much smaller than the ground truth, whereas in the fourth row of the REFUGE Validation/Test results, their predictions include regions that do not belong to the OC. In contrast, our method outperforms these approaches in OC segmentation, as demonstrated in the second row of the Drishti-GS results, where the outline and shape of the segmented results are closer to the ground truth.

4.3. Discussion on Causal Inference-Based Pseudo-Label Fusion Module

To evaluate the effectiveness of our proposed causal inference–based pseudo-label fusion module, we designed a series of ablation experiments to comprehensively assess the impact of different pseudo-label generation strategies on model performance. In our framework, the fused pixel-wise pseudo-label

{\hat{y}}_{t}^{f}

serves as the self-supervision target for unlabeled target-domain images—substituting the unavailable ground-truth masks and being directly incorporated into the target-domain segmentation loss

L_{seg}^{t}

—thereby guiding the network to learn accurate OD/OC delineations under domain shift.

4.3.1. Effectiveness of Pseudo-Label Fusion

To demonstrate the effectiveness of the causal inference-based pseudo-label fusion, we compared the segmentation results of three schemes: one that does not employ pseudo-labels for self-supervision, one that generates pseudo-labels using a single-path approach solely based on target domain images, and the final method that incorporates causal inference. The corresponding results are presented in Table 3. The experimental findings indicate that the use of the pseudo-label module enables the model to exploit the information from target domain images, thereby significantly enhancing segmentation performance and validating the effectiveness of self-supervised learning. Moreover, incorporating the causal inference strategy further optimizes pseudo-label generation by effectively mitigating the interference caused by mixed domain styles, resulting in more accurate and reliable pseudo-labels for supervising the model. This strategy not only enhances the generalization ability of the model but also provides a robust theoretical foundation for cross-domain adaptation.

4.3.2. Effectiveness of Confidence-Based Dynamic Fusion

During the pseudo-label generation process, we also introduced a confidence map generation mechanism. To validate the effectiveness of this strategy, we compared it with a simple averaging fusion method, which combines the dual-path prediction results using equal weights (i.e., multiplying each segmentation output by

\frac{1}{2}

) to obtain the fused pseudo-labels; the relevant results are presented in Table 4. The results show that the confidence-based dynamic fusion strategy is capable of adaptively adjusting the weight of each pixel based on its prediction certainty during the fusion process, thereby generating more refined and accurate pseudo-labels that provide a more reliable supervisory signal for model training.

4.4. Discussion on Domain Transformation Methods

In our causal inference-based pseudo-label generation module, we employ the Fourier transform as the tool for domain style transformation, generating images with varying domain styles to simulate the stratification operation in backdoor adjustment. The primary rationale for choosing the Fourier transform is that, due to the large size of the original images and memory limitations, CycleGAN can only process compressed images, leading to a degradation in image quality; in contrast, the Fourier transform can directly handle the original images, thereby achieving a more effective style conversion. Moreover, as a deep learning model, CycleGAN’s performance is constrained by the scale of the training dataset, and given the limited number of fundus images available, the Fourier transform can generate the required transferred images without additional training.

To validate these points, we conducted comparative experiments using CycleGAN [38] as the domain transformation tool, comparing the results of the CycleGAN-based method with those of the Fourier transform-based method. As shown in Figure 9, although CycleGAN produces transferred images that are satisfactory in terms of domain style, noticeable differences in detail and contour exist compared to the original images. When converting target domain images to the source domain, the size of the OD is significantly enlarged. Further quantitative analysis supports this conclusion; as presented in Table 5, the overall model’s segmentation performance on three datasets indicates that the Fourier transform achieves superior domain style transformation.

4.5. Loss Ablation Study

To further validate the effectiveness of our approach, we conducted an ablation study on the loss functions. We progressively incorporated the loss components used in the overall loss function and recorded the corresponding performance on three datasets, as shown in Table 6. The contributions of each loss term are analyzed as follows:

Baseline Segmentation Loss ( $L_{s e g}^{s}$ ): The results obtained using only $L_{s e g}^{s}$ , which relies solely on the labeled source data, establish the baseline segmentation performance.
Adversarial Loss ( $L_{a d v}$ ): By incorporating $L_{a d v}$ , we observed significant performance gains—especially on the RIM-one-R3 dataset, which exhibits a larger domain shift.
Target Segmentation Loss ( $L_{s e g}^{t}$ ): Next, we incorporated the target segmentation loss $L_{s e g}^{t}$ , which is based on the pseudo-labels generated by our causal inference-based pseudo-label fusion module. The addition of $L_{s e g}^{t}$ resulted in substantial improvements across all three datasets, thereby demonstrating the reliability of the predicted pseudo-labels.
Maximum Square Loss ( $L_{m}$ ): Incorporating the maximum square loss $L_{m}$ further improved segmentation performance by regularizing pseudo-label predictions and balancing high-confidence outputs.
Cross-domain Contrastive Loss ( $L_{c}$ ): Finally, the cross-domain contrastive loss $L_{c}$ was introduced to constrain the features extracted from the original images and their style-transferred counterparts, ensuring consistency between the dual-path outputs.

5. Conclusions

In this paper, we propose a novel cross-domain fundus image segmentation framework, Causal Self-Supervised Network (CSSN). Our approach constructs a structural causal model and introduces a causal intervention mechanism to effectively eliminate the interference of domain style information in pseudo-label generation, thereby enabling efficient self-supervised learning. Specifically, we employ a Fourier transform to convert the image style, simulating the stratification process inherent in backdoor adjustment. Subsequently, a dual-path segmentation network is used to separately extract features from the original and style-transformed images, and a confidence fusion strategy is leveraged to generate more reliable pseudo-labels. Furthermore, we incorporate adversarial training and cross-domain contrastive learning to effectively narrow the feature discrepancy between different domains, significantly enhancing the model’s generalization ability and segmentation performance.

Author Contributions

Q.L.: Conceptualization, Methodology, and Writing—review & editing; Q.Z.: Software and Writing—original draft; Z.Z.: Investigation; H.L.: Validation, Supervision, and Writing— review & editing; W.N.: Funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (62272337), the Natural Science Foundation of Tianjin (16JCZDJC31100) and Tianjin Natural Science Foundation (No. 23JCQNJC01520).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code supporting the findings of this study is publicly available at the following GitHub repository: https://github.com/SingularZh/SSCN-for-OD-and-OC-segmentation.git. The REFUGE dataset analyzed in this work can be accessed via its DOI: https://dx.doi.org/10.21227/tz6e-r977. The RIM-ONE-r3 dataset used herein is available for download at: https://medimrg.webs.ull.es/RIM-ONE-r3.zip.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tham, Y.-C.; Li, X.; Wong, T.Y.; Quigley, H.A.; Aung, T.; Cheng, C.-Y. Global Prevalence of Glaucoma and Projections of Glaucoma Burden Through 2040: A Systematic Review and Meta-Analysis. Ophthalmology 2014, 121, 2081–2090. [Google Scholar] [CrossRef]
Mary, M.C.V.S.; Rajsingh, E.B.; Jacob, J.K.K.; Anandhi, D.; Amato, U.; Selvan, S.E. An Empirical Study on Optic Disc Segmentation Using an Active Contour Model. Biomed. Signal Process. Control 2015, 18, 19–29. [Google Scholar] [CrossRef]
Gagan, J.H.; Shirsat, H.S.; Kamath, Y.S.; Kuzhuppilly, N.I.R.; Kumar, J.R.H. Automated Optic Disc Segmentation Using Basis Splines-Based Active Contour. IEEE Access 2022, 10, 88152–88163. [Google Scholar] [CrossRef]
Gopalakrishnan, A.; Almazroa, A.; Raahemifar, K.; Lakshminarayanan, V. Optic Disc Segmentation Using Circular Hough Transform and Curve Fitting. In Proceedings of the 2015 2nd International Conference on Opto-Electronics and Applied Optics (IEM OPTRONIX), Vancouver, BC, Canada, 5–17 October 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–4. [Google Scholar]
Zhu, X.; Rangayyan, R.M. Detection of the Optic Disc in Images of the Retina Using the Hough Transform. In Proceedings of the 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Vancouver, BC, Canada, 20–25 August 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 3546–3549. [Google Scholar]
Welfer, D.; Scharcanski, J.; Kitamura, C.M.; Dal Pizzol, M.M.; Ludwig, L.W.B.; Marinho, D.R. Segmentation of the Optic Disk in Color Eye Fundus Images Using an Adaptive Morphological Approach. Comput. Biol. Med. 2010, 40, 124–137. [Google Scholar] [CrossRef]
Morales, S.; Naranjo, V.; Angulo, J.; Alcañiz, M. Automatic Detection of Optic Disc Based on PCA and Mathematical Morphology. IEEE Trans. Med. Imaging 2013, 32, 786–796. [Google Scholar] [CrossRef]
Lu, Z.; Chen, D. Weakly Supervised and Semi-Supervised Semantic Segmentation for Optic Disc of Fundus Image. Symmetry 2020, 12, 145. [Google Scholar] [CrossRef]
Choukikar, P.; Patel, A.K.; Mishra, R.S. Segmenting the Optic Disc in Retinal Images Using Thresholding. Int. J. Comput. Appl. 2014, 94, 6–10. [Google Scholar] [CrossRef]
Wang, S.; Yu, L.; Yang, X.; Fu, C.-W.; Heng, P.-A. Patch-based output space adversarial learning for joint optic disc and cup segmentation. IEEE Trans. Med. Imaging 2019, 38, 2485–2495. [Google Scholar] [CrossRef]
Wang, S.; Yu, L.; Li, K.; Yang, X.; Fu, C.-W.; Heng, P.-A. Boundary and entropy-driven adversarial learning for fundus image segmentation. Int. J. Med. Image Comput. Comput.-Assist. Interv. 2019, 2019, 102–110. [Google Scholar]
Chen, C.; Wang, G. IOSUDA: An Unsupervised Domain Adaptation with Input and Output Space Alignment for Joint Optic Disc and Cup Segmentation. Appl. Intell. 2021, 51, 3880–3898. [Google Scholar] [CrossRef]
Kadambi, S.; Wang, Z.; Xing, E. WGAN Domain Adaptation for the Joint Optic Disc-and-Cup Segmentation in Fundus Images. Int. J. Comput. Assist. Radiol. Surg. 2020, 15, 1205–1213. [Google Scholar] [CrossRef]
Xu, S.-P.; Li, T.-B.; Zhang, Z.-Q.; Song, D. Minimizing-Entropy and Fourier Consistency Network for Domain Adaptation on Optic Disc and Cup Segmentation. IEEE Access 2021, 9, 153985–153994. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. Adv. Neural Inf. Process. Syst. 2014, 27. Available online: https://proceedings.neurips.cc/paper_files/paper/2014/file/f033ed80deb0234979a61f95710dbe25-Paper.pdf (accessed on 25 January 2025).
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-Adversarial Training of Neural Networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
Guan, Y.; Zhang, L.; Li, J.; Xu, X.; Yan, Y.; Zhang, L. A Lightweight Entropy–Curvature-Based Attention Mechanism for Meningioma Segmentation in MRI Images. Appl. Sci. 2025, 15, 3401. [Google Scholar] [CrossRef]
Guo, B.; Cao, N.; Zhang, R.; Yang, P. SCENet: Small Kernel Convolution with Effective Receptive Field Network for Brain Tumor Segmentation. Appl. Sci. 2024, 14, 11365. [Google Scholar] [CrossRef]
Zou, C.; Jeon, W.-S.; Ju, H.-R.; Rhee, S.-Y. A Dual-Headed Teacher–Student Framework with an Uncertainty-Guided Mechanism for Semi-Supervised Skin Lesion Segmentation. Electronics 2025, 14, 984. [Google Scholar] [CrossRef]
Tang, Y.; Guo, Y.; Wang, H.; Song, T.; Lu, Y. Uncertainty-Aware Semi-Supervised Method for Pectoral Muscle Segmentation. Bioengineering 2025, 12, 36. [Google Scholar] [CrossRef]
Fu, H.; Cheng, J.; Xu, Y.; Wong, D.W.K.; Liu, J.; Cao, X. Joint Optic Disc and Cup Segmentation Based on Multi-Label Deep Network and Polar Transformation. IEEE Trans. Med. Imaging 2018, 37, 1597–1605. [Google Scholar] [CrossRef]
Liu, P.; Kong, B.; Li, Z.; Zhang, S.; Fang, R. CFEA: Collaborative Feature Ensembling Adaptation for Domain Adaptation in Unsupervised Optic Disc and Cup Segmentation. Int. J. Med. Image Comput. Comput.-Assist. Interv. 2019, 2019, 521–529. [Google Scholar]
He, Y.; Kong, J.; Liu, D.; Li, J.; Zheng, C. Self-ensembling with mask-boundary domain adaptation for optic disc and cup segmentation. Eng. Appl. Artif. Intell. 2024, 129, 107635. [Google Scholar] [CrossRef]
Chen, Y.; Ji, Y.; Wang, H.; Hao, X.; Yang, Y.; Ma, Y.; Yu, D. Causal Inference-Based Adversarial Domain Adaptation for Cross-Domain Industrial Intrusion Detection. IEEE Trans. Ind. Inform. 2024, 21, 970–979. [Google Scholar] [CrossRef]
Schneider, M. A Review of Nonlinear FFT-Based Computational Homogenization Methods. Acta Mech. 2021, 232, 2051–2100. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 801–818. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4510–4520. [Google Scholar]
Chen, M.; Xue, H.; Cai, D. Domain Adaptation for Semantic Segmentation with Maximum Squares Loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2090–2099. [Google Scholar]
Peng, L.; Mo, Y.; Xu, J.; Shen, J.; Shi, X.; Li, X.; Shen, H.T.; Zhu, X. GRLC: Graph Representation Learning with Constraints. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 8609–8622. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: New York, NY, USA, 2020; pp. 1597–1607. [Google Scholar]
Orlando, J.I.; Fu, H.; Breda, J.B.; van Keer, K.; Bathula, D.R.; Diaz-Pinto, A.; Fang, R.; Heng, P.-A.; Kim, J.; Lee, J.; et al. REFUGE Challenge: A Unified Framework for Evaluating Automated Methods for Glaucoma Assessment from Fundus Photographs. Med. Image Anal. 2020, 59, 101570. [Google Scholar] [CrossRef]
Sivaswamy, J.; Krishnadas, S.; Chakravarty, A.; Joshi, G.; Tabish, A.S. A Comprehensive Retinal Image Dataset for the Assessment of Glaucoma from the Optic Nerve Head Analysis. JSM Biomed. Imaging Data Pap. 2015, 2, 1004. [Google Scholar]
Fumero, F.; Alayón, S.; Sanchez, J.L.; Sigut, J.; Gonzalez-Hernandez, M. RIM-ONE: An Open Retinal Image Database for Optic Nerve Evaluation. Int. J.-Comput.-Based Med. Syst. 2011, 2011, 1–6. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Zhang, Y.; Miao, S.; Mansi, T.; Liao, R. Task Driven Generative Modeling for Unsupervised Domain Adaptation: Application to X-Ray Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Granada, Spain, 16–20 September 2018; Springer: Cham, Switzerland, 2018; pp. 599–607. [Google Scholar]
Hoffman, J.; Wang, D.; Yu, F.; Darrell, T. FCNs in the Wild: Pixel-Level Adversarial and Constraint-Based Adaptation. arXiv 2016, arXiv:1612.02649. [Google Scholar]
Javanmardi, M.; Tasdizen, T. Domain Adaptation for Biomedical Image Segmentation Using Adversarial Training. In Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI), Washington, DC, USA, 4–7 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 554–558. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2223–2232. [Google Scholar]

Figure 1. Comparison of the same sample under different domain styles.

Figure 2. Structural Causal Model for mitigating domain style interference. In this model, target domain images (X) undergo feature extraction to generate feature maps (F), from which pseudo labels (Y) are derived. Meanwhile, the domain style confounder C introduces confounding style effects in both feature extraction and pseudo label generation, inducing bias in Y via backdoor paths.

Figure 3. Workflow of the Causal Inference-based Pseudo-Label Generation Module. First, a Fourier transform is applied to the target domain images to convert them into the source domain style, thereby achieving style separation; subsequently, dual segmentation networks extract features from both the original and transformed images, and a confidence map generation mechanism integrates these features to ultimately generate reliable pseudo-labels.

Figure 4. Schematic diagram of the generation process of confidence weight masks.The confidence of the OD and OC prediction probability maps is computed separately, with the resulting maps for each class concatenated and normalized via softmax to produce the final confidence map.

Figure 5. The overall architecture of CSSN is depicted. Images with different domain styles are generated using the Fourier transform. Features are then extracted from both the original target images and the style-transferred images, and a causal inference-based pseudo-label fusion module is employed to generate reliable pseudo-labels for self-supervised training. Meanwhile, adversarial training and cross-domain contrastive learning further reduce the feature discrepancy, ensuring structural consistency.

Figure 6. Visual comparison on Drishti-GS dataset. Compared with BEAL [11], IOSUDA [12] and MeFDA [14]. The green contours represent the segmented optic disc, while the blue contours represent the segmented optic cup.

Figure 7. Visual comparison on RIM-ONE-r3 dataset. Compared with BEAL [11], IOSUDA [12] and MeFDA [14]. The green contours represent the segmented optic disc, while the blue contours represent the segmented optic cup.

Figure 8. Visual comparison on REFUGE Validation/Test dataset. Compared with BEAL [11], IOSUDA [12] and MeFDA [14]. The green contours represent the segmented optic disc, while the blue contours represent the segmented optic cup.

Figure 9. Visualization results of domain transformation using CycleGAN and Fourier transform.

Table 1. The datasets used in the proposed method.

Domain	Dataset	Number (Train/Test)	Image Size
Source	REFUGE Train	400/0	$2124 \times 2056$
Target	Drishti-GS	50/51	$2047 \times 1759$
Target	RIM-ONE-r3	99/60	$2144 \times 1424$
Target	REFUGE Validation/Test	400/400	$1634 \times 1634$

Table 2. Quantitative segmentation results for OD and OC on three public datasets. The best results are highlighted in bold.

Method	Drishti-GS			RIM-ONE-r3			REFUGE Val
Method	${D I}_{c u p}$	${D I}_{d i s c}$	$δ$	${D I}_{c u p}$	${D I}_{d i s c}$	$δ$	${D I}_{c u p}$	${D I}_{d i s c}$	$δ$
TD-GAN [35]	0.747	0.924	0.117	0.728	0.853	0.118	-	-	-
Hoffman et al. [36]	0.851	0.959	0.093	0.755	0.852	0.109	-	-	-
WGAN [13]	0.840	0.954	0.106	-	-	-	-	-	-
Javanmardi et al. [37]	0.849	0.961	0.091	0.779	0.853	0.103	-	-	-
OSAL-pixel [10]	0.851	0.962	0.089	0.778	0.854	0.097	0.869	0.932	0.059
pOSAL [10]	0.858	0.965	0.082	0.787	0.865	0.089	0.875	0.946	0.051
BEAL [11]	0.862	0.961	0.084	0.810	0.898	0.090	0.852	0.948	0.055
IOSUDA [12]	0.775	0.940	0.091	0.723	0.907	0.095	0.829	0.954	0.057
CFEA [22]	-	-	-	-	-	-	0.863	0.942	0.052
MeFDA [14]	0.866	0.959	0.082	0.821	0.909	0.087	0.880	0.956	0.049
OADA [23]	0.873	0.965	0.085	0.816	0.904	0.094	0.885	0.952	0.044
CSSN (Ours)	0.876	0.971	0.081	0.818	0.922	0.083	0.885	0.958	0.049

Table 3. Performance of different pseudo-label generation strategies. The best results are highlighted in bold.

Dataset	Method	${D I}_{c u p}$	${D I}_{d i s c}$	$δ$
Drishti-GS	w/o pseudo-labels	0.822	0.941	0.097
	Single-path pseudo-labels	0.847	0.952	0.085
	Causal Inference-based	0.876	0.971	0.081
RIM-ONE-r3	w/o pseudo-labels	0.757	0.873	0.103
	Single-path pseudo-labels	0.798	0.896	0.093
	Causal Inference-based	0.818	0.922	0.083
REFUGE Val	w/o pseudo-labels	0.831	0.941	0.056
	Single-path pseudo-labels	0.840	0.952	0.052
	Causal Inference-based	0.885	0.958	0.049

Table 4. Comparison of experimental results using different pseudo-label fusion methods. The best results are highlighted in bold.

Dataset	Method	${D I}_{c u p}$	${D I}_{d i s c}$	$δ$
Drishti-GS	Avg-pooling	0.871	0.966	0.087
Drishti-GS	Ours	0.876	0.971	0.081
RIM-ONE-r3	Avg-pooling	0.816	0.918	0.087
RIM-ONE-r3	Ours	0.818	0.922	0.083
REFUGE Val	Avg-pooling	0.861	0.947	0.053
REFUGE Val	Ours	0.885	0.958	0.049

Table 5. Performance of different domain transformation methods. The best results are highlighted in bold.

Dataset	Method	${D I}_{c u p}$	${D I}_{d i s c}$	$δ$
Drishti-GS	CycleGAN	0.853	0.942	0.089
Drishti-GS	Fourier	0.876	0.971	0.081
RIM-ONE-r3	CycleGAN	0.793	0.911	0.094
RIM-ONE-r3	Fourier	0.818	0.922	0.083
REFUGE Val	CycleGAN	0.861	0.923	0.056
REFUGE Val	Fourier	0.885	0.958	0.049

Table 6. Ablation study of various loss functions. The best results are highlighted in bold.

Dataset	Method	${D I}_{c u p}$	${D I}_{d i s c}$	$δ$
Drishti-GS	$L_{s e g}^{s}$	0.822	0.941	0.097
	$L_{s e g}^{s} + L_{a d v}$	0.829	0.939	0.093
	$L_{s e g}^{s} + L_{a d v} + L_{s e g}^{t}$	0.867	0.958	0.084
	$L_{s e g}^{s} + L_{a d v} + L_{s e g}^{t} + L_{m}$	0.869	0.961	0.084
	$L_{s e g}^{s} + L_{a d v} + L_{s e g}^{t} + L_{m} + L_{c}$	0.876	0.971	0.081
RIM-ONE-r3	$L_{s e g}^{s}$	0.757	0.873	0.103
	$L_{s e g}^{s} + L_{a d v}$	0.781	0.883	0.097
	$L_{s e g}^{s} + L_{a d v} + L_{s e g}^{t}$	0.805	0.897	0.090
	$L_{s e g}^{s} + L_{a d v} + L_{s e g}^{t} + L_{m}$	0.814	0.913	0.085
	$L_{s e g}^{s} + L_{a d v} + L_{s e g}^{t} + L_{m} + L_{c}$	0.818	0.922	0.083
REFUGE Val	$L_{s e g}^{s}$	0.831	0.941	0.056
	$L_{s e g}^{s} + L_{a d v}$	0.849	0.951	0.055
	$L_{s e g}^{s} + L_{a d v} + L_{s e g}^{t}$	0.871	0.953	0.054
	$L_{s e g}^{s} + L_{a d v} + L_{s e g}^{t} + L_{m}$	0.873	0.959	0.049
	$L_{s e g}^{s} + L_{a d v} + L_{s e g}^{t} + L_{m} + L_{c}$	0.885	0.958	0.049

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Q.; Zhang, Q.; Zhang, Z.; Liu, H.; Nie, W. Causal Inference-Based Self-Supervised Cross-Domain Fundus Image Segmentation. Appl. Sci. 2025, 15, 5074. https://doi.org/10.3390/app15095074

AMA Style

Li Q, Zhang Q, Zhang Z, Liu H, Nie W. Causal Inference-Based Self-Supervised Cross-Domain Fundus Image Segmentation. Applied Sciences. 2025; 15(9):5074. https://doi.org/10.3390/app15095074

Chicago/Turabian Style

Li, Qiang, Qiyi Zhang, Zheqi Zhang, Hengxin Liu, and Weizhi Nie. 2025. "Causal Inference-Based Self-Supervised Cross-Domain Fundus Image Segmentation" Applied Sciences 15, no. 9: 5074. https://doi.org/10.3390/app15095074

APA Style

Li, Q., Zhang, Q., Zhang, Z., Liu, H., & Nie, W. (2025). Causal Inference-Based Self-Supervised Cross-Domain Fundus Image Segmentation. Applied Sciences, 15(9), 5074. https://doi.org/10.3390/app15095074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Causal Inference-Based Self-Supervised Cross-Domain Fundus Image Segmentation

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Causal Inference

3.2. Causal Inference-Based Pseudo-Label Fusion Module

3.3. Source Domain Image Style Transfer and Adversarial Training Mechanism

3.4. Cross-Domain Contrastive Learning

3.5. Loss Function

4. Experiments

4.1. Datasets and Implementation Details

4.2. Performance Comparison with Prior Methods

4.2.1. Quantitative Analysis

4.2.2. Qualitative Analysis

4.3. Discussion on Causal Inference-Based Pseudo-Label Fusion Module

4.3.1. Effectiveness of Pseudo-Label Fusion

4.3.2. Effectiveness of Confidence-Based Dynamic Fusion

4.4. Discussion on Domain Transformation Methods

4.5. Loss Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI