Next Article in Journal
Analytical Model and Gas Leak Source Localization Based on Acoustic Emission for Cylindrical Storage
Previous Article in Journal
Wind-Induced Dynamic Performance Evaluation of Tall Buildings Considering Future Wind Climate
 
 
Article
Peer-Review Record

Causal Inference-Based Self-Supervised Cross-Domain Fundus Image Segmentation

Appl. Sci. 2025, 15(9), 5074; https://doi.org/10.3390/app15095074
by Qiang Li 1, Qiyi Zhang 1, Zheqi Zhang 2, Hengxin Liu 1,* and Weizhi Nie 2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2025, 15(9), 5074; https://doi.org/10.3390/app15095074
Submission received: 27 March 2025 / Revised: 27 April 2025 / Accepted: 30 April 2025 / Published: 2 May 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript describes a Causal Self-Supervised Network (CSSN) utilizing a Structural Causal Model (SCM), towards accurage segmentation of the optic disc (OD) and optic cup (OC) in retinal photograph images. A pseudo-label fusion strategy is then used to generate reliable labels for self-supervised learning. Some issues might be considered:

1. In Section 3.1 and Figure 1, a SCM is described. The effects of the domain style confounder C might be illustrated with concrete example images, both before and after application of the confounding.

2. In Section 3.1, pseudo-labels are described. It might be clarified as to what these labels represent, since the target task is (pixel-level) segmentation, and not classification. Figure 2 for example appears to indicate a single pseudo-label (L_m) for the full OD image. While pseudo-label fusion and generation is later discussed in Section 4.3, it remains unclear as to what role the pseudo-labels play in the model.

3. In Table 1, although Drishti-GS, RIM-ONE-r3 and REFUGE are stated as the target domain datasets, they are apparently also used to train the model too (since the train number is non-zero). This might be clarified, since the assessment of cross-domain segmentation would require the target domain to be unknown to the model.

4. In Table 6, ablation results are presented. It might be clarified as to which dataset(s) these results were from.

 

Author Response

Comment1:In Section 3.1 and Figure 1, a SCM is described. The effects of the domain style confounder C might be illustrated with concrete example images, both before and after application of the confounding.

Response1Thank you for your insightful comment. “Confounding” refers specifically to the interference introduced by domain-style shifts between source and target images. To illustrate this, we have added Figure 1 in Section 3.1, which presents three representative samples showing each image in its original target-domain style alongside its style-transformed counterpart. This visual comparison illustrates the effects of the domain-style confounder C. The revised content is as follows:

In the self-supervised cross-domain fundus image segmentation task, the model is primarily supervised by source-domain images and their corresponding labels. However, as illustrated in Figure 1, the same sample can exhibit substantial visual differences under different domain styles, and this style-induced variation inevitably perturbs the training process, introducing systematic biases into pseudo-label generation. Specifically, during feature extraction, the model may capture intrinsic style noise from images in a particular domain, leading to biased pseudo-labels and ultimately degrading segmentation performance.

 

Comment2:In Section 3.1, pseudo-labels are described. It might be clarified as to what these labels represent, since the target task is (pixel-level) segmentation, and not classification. Figure 2 for example appears to indicate a single pseudo-label (L_m) for the full OD image. While pseudo-label fusion and generation is later discussed in Section 4.3, it remains unclear as to what role the pseudo-labels play in the model.

Response2Thank you for your detailed comments. We apologize for any confusion caused by our insufficiently precise description of the pseudo-label component in Section 3.1 and Figure 2. In fact, Lm is not a pseudo-label but a maximum-square loss designed to mitigate class-imbalance in prediction probabilities, whereas the true pseudo-label  is a pixel-wise OD/OC segmentation probability mask on the target-domain images that fully satisfies the requirements of a pixel-level segmentation task and serves as the self-supervision signal in the target-domain segmentation loss . In the figure, the dashed arrow from  to  indicates that the maximum‐square loss ​is applied to  during training. To make this distinction clear, we have supplemented the description of the pseudo-label generation process in Section 3 and further elaborated the definition of  and its specific role in model training in Section 4.3. The revised manuscript now reads as follows:

“After obtaining the feature maps from both the target domain image and its source-style transformed counterpart, we proceed to generate pseudo-labels using these features. The structure of the pseudo-label generation module is shown in Figure 3. In this figure, the dashed arrow denotes that the maximum‐square loss  is applied to the fused pseudo‐label  during training.”

“It is important to note that comprises the OD prediction probability map  and the OC prediction probability map , whereas consists of and . Accordingly, we compute the confidence maps for these four probability maps—denoted as , , , and —and then concatenate  with  and feed the result into a softmax layer for normalization. The computation process is illustrated in Figure. 4. An analogous operation is performed for and . The resulting normalized confidence maps  and are then used to weight the corresponding probability maps and produce the pixel-wise pseudo-label , which serves as the self-supervision target in the target-domain segmentation loss.”

“To evaluate the effectiveness of our proposed causal inference–based pseudo-label fusion module, we designed a series of ablation experiments to comprehensively assess the impact of different pseudo-label generation strategies on model performance. In our framework, the fused pixel-wise pseudo-label serves as the self-supervision target for unlabeled target-domain images—substituting the unavailable ground-truth masks and being directly incorporated into the target-domain segmentation loss —thereby guiding the network to learn accurate OD/OC delineations under domain shift.”

 

Comment3In Table 1, although Drishti-GS, RIM-ONE-r3 and REFUGE are stated as the target domain datasets, they are apparently also used to train the model too (since the train number is non-zero). This might be clarified, since the assessment of cross-domain segmentation would require the target domain to be unknown to the model.

Response3Thank you for this valuable suggestion. Because our model adopts a self-supervised learning strategy, we do include target-domain images during training; however, these images are entirely unlabeled and are used only for self-supervised adaptation and pseudo-label generation. We do not use their ground-truth segmentation masks for training—only the source-domain labels are used for supervised learning—while the target-domain masks are reserved solely for final evaluation on the test split. This design ensures that the target domain remains unseen during training.

To make this clear, we have added the following statement to Section 4.1:

We validated the proposed method on three cross-domain datasets, with detailed information provided in Table 1. The REFUGE dataset [31] is divided into Train and Validation/Test subsets. The Train set, serving as the source domain, comprises 400 images with OD and OC segmentation annotations acquired using a Zeiss Bisucam 500. In contrast, the Validation/Test set, serving as the target domain, contains 400 training images and 400 testing images captured using a Canon CR-2. Additionally, the Drishti‐GS dataset [32] from India includes 101 images with segmentation annotations provided by multiple ophthalmologists. Lastly, the RIM‐ONE‐r3 dataset [33] from Spain, collected with a Canon EOS 5D, comprises 99 training images and 60 testing images. Because our framework uses self-supervised learning, the target-domain training images are treated as unlabeled data solely for adaptation and pseudo-label generation; their ground-truth masks are not used during training, and only the target-domain test images are employed for final evaluation.”

 

Comment4:In Table 6, ablation results are presented. It might be clarified as to which dataset(s) these results were from.

Response4Thank you for your constructive suggestions. We have updated Table 6 by adding a leftmost column that specifies the dataset for each ablation result, so that readers can easily see which dataset each experiment was performed on.

 

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript proposes a segmentation model for cross-domain segmentation of the optic disc and optic cup segmentation of retinal images. The proposed model does not significantly improve the segmentation results when compared to existing domain adaptation models on the public datasets used in the manuscript. Listed below are my comments and suggestions.

  1. The introduction fails to mention other approaches for segmenting OD and OC in literature besides deep learning. For completeness, it will be helpful to readers to mention the traditional and other weakly supervised segmentation methods. Why are deep learning methods commonly used in recent segmentation models compared to the other methods?
  2. Line 75, adversarial learning mechanisms exist in literature. Please include references to the original work as it not new in this manuscript.
  3. Line 121-122: Can this limitation be reduced by increasing the variability in the training samples for existing models? Will using a large enough sample size of images from different sensors for training existing reduce this effect?
  4. Line 157: The domain transformation using the Fourier transform is not adequately described. Please include a description of the filtering process i.e. the masking applied to the image in the frequency domain before reversing the result back to the spatial domain.
  5. Briefly describe the backbone architecture on Line 165. What factors influence that choice compared to others?
  6. Please provide details of the feature extractor (MobileNetV2) employed by the network. i.e. number of filters or output size, etc.
  7. Is a batch size of 4 optimal for the network? How does the model performance vary as the batch size increases?
  8. Please clarify the type of augmentations applied to the training dataset. Random or fixed for each sample image?

Author Response

Comment1:The introduction fails to mention other approaches for segmenting OD and OC in literature besides deep learning. For completeness, it will be helpful to readers to mention the traditional and other weakly supervised segmentation methods. Why are deep learning methods commonly used in recent segmentation models compared to the other methods?

Response1Thank you for your constructive suggestions. We have revised the Introduction to include representative traditional OD/OC segmentation methods—such as active contour models, Hough‐transform-based detection, morphological operations and weakly supervised techniques—and added citations for each. We also added a brief discussion explaining why, in recent years, deep learning approaches have become predominant for OD/OC segmentation. The specific additions in the revised manuscript are as follows:

“In traditional methods for automatic segmentation of the OD and OC, a variety of approaches have been proposed, including active contour models [2,3], Hough‐transform‐based detection [2,4,5], morphological operations [6,7], weakly supervised techniques [8], among others. For example, Mary et al. [2] proposed a cascaded pipeline in which the red channel is first enhanced by adaptive histogram equalization and subjected to morphological vessel removal and binarization; a circular Hough transform then generates initial disc contours, which are iteratively evolved under the gradient vector flow (GVF) energy functional to accurately delineate the disc boundary. Zhu et al. [5] described a similar pipeline that converts fundus images to a luminance representation for artifact removal, extracts edges via Sobel or Canny operators, detects circular candidates via the Hough transform, and refines them by intensity thresholding to precisely locate the disc. Welfer et al. [6] designed a two-stage adaptive morphological framework: in Stage 1, vessel skeletonization—using reconstruction, top-hat filtering and skeletonization—localizes the disc region; in Stage 2, a marker-controlled watershed transform adaptively segments the exact boundary. Choukikar et al. [9] converted color retina images to grayscale with histogram equalization, performed multilevel thresholding followed by morphological erosion and dilation to extract boundary candidates, and then fitted a circle to the resulting points to determine the disc center and radius.

However, these traditional approaches typically depend on handcrafted feature extraction, heuristic processing pipelines, and domain-specific parameter tuning, which render them sensitive to image artifacts, inter-patient variability, and inconsistent acquisition conditions. With the advent of deep learning, convolutional neural networks and related architectures have been widely adopted for OD and OC segmentation, learning hierarchical feature representations directly from data and demonstrating superior robustness and accuracy.”

 

Comment2:Line 75, adversarial learning mechanisms exist in literature. Please include references to the original work as it not new in this manuscript.

Response2Thank you for this valuable suggestion. We have added citations to the seminal adversarial learning works, specifically referencing Goodfellow et al. (2014) and Ganin et al. (2016), both of which are highly representative in this field. The revised content is as follows:

We introduce adversarial learning [15,16] and cross-domain contrastive learning mechanisms, which reduce the distribution discrepancy between source and target domains.

 

Comment3Line 121-122: Can this limitation be reduced by increasing the variability in the training samples for existing models? Will using a large enough sample size of images from different sensors for training existing reduce this effect?

Response3Thanks for your constructive suggestions. While it is true that training on a large, multi‐sensor annotated dataset can enhance a model’s robustness to style variations, in practice constructing such a dataset demands substantial time and financial investment, and requires additional computational resources for preprocessing and training—factors that run counter to our goal of achieving efficient cross‐domain segmentation with limited resources. Moreover, in clinical settings, image annotation depends on senior ophthalmologists, and the lack of standardized protocols across different imaging devices further increases both the difficulty and cost of data collection. In addition, due to patient privacy concerns and the sensitive nature of medical data, available datasets are often small, making it difficult to assemble a dataset rich in diverse classes.

By contrast, our proposed CSSN method leverages existing labeled source‐domain data together with a small set of unlabeled target‐domain images. Through Fourier‐based stratification and a dual‐path causal back‐door adjustment strategy, CSSN effectively severs the confounding influence of style variations during feature extraction and pseudo‐label generation, thereby greatly reducing reliance on large‐scale annotated datasets while also alleviating the annotation burden on experts.

We have added a discussion of these points to the manuscript. The specific revisions are as follows:

“Although numerous deep learning–based automated segmentation models have achieved excellent performance in segmenting OD and OC, most rely on training with labeled source-domain images. In practical applications, target-domain images are typically unlabeled, and variations in imaging equipment and acquisition conditions introduce a significant distribution discrepancy between source and target domains, resulting in a marked decline in conventional models’ target-domain performance. While assembling a fundus-image dataset that encompasses all domain styles could enable the model to adapt to diverse styles and thereby mitigate the effects of domain shift, such an undertaking is extremely time-consuming, labor-intensive, and demands substantial additional computational resources. Consequently, recent studies have explored domain adaptation techniques to enhance the generalization ability of cross-domain segmentation. However, existing methods [10-14] suffer from two major limitations: first, they predominantly utilize labeled source-domain data for supervision, lacking effective constraints on unlabeled images; second, the use of a shared network for processing images from different domains makes it challenging to completely eliminate interference caused by domain style differences.”

 

Comment4Line 157: The domain transformation using the Fourier transform is not adequately described. Please include a description of the filtering process i.e. the masking applied to the image in the frequency domain before reversing the result back to the spatial domain.

Response4Thank you for this valuable comment. To improve clarity, we have expanded the Methods section with a detailed description of our Fourier-based domain transformation. The revised content is as follows:

Based on the above causal inference strategy, we apply Fourier transform to the target-domain images for style transfer. As noted in [14], the low-frequency components of an image generally capture properties such as background, illumination, and other style-related attributes. Therefore, by replacing the low-frequency components of the original image with those obtained via Fourier transform, we can achieve an approximate image-domain transformation.

First, we denote the Fast Fourier Transform (FFT) [25] by , and its inverse by . In our algorithm, we apply FFT to each image to obtain its spectra (amplitude spectrum  and phase spectrum :

 

 

During training, we randomly select a source-domain image  and a target-domain image , and replace the low-frequency part of  with that of  to obtain the style-transferred image . Specifically, we treat the center of the amplitude spectrum as the zero-frequency point, and construct a square window  of side length a centered at this point. We then remove the contents of  within  and fill them with the corresponding contents of , resulting in the modified amplitude spectrum . Finally, by combining  with the original phase spectrum  and applying the inverse FFT, we obtain the style-transferred image:

 

The resulting image  retains the ground-truth segmentation mask of the original target-domain image , while its domain gap relative to the source-domain image  is significantly reduced, making it effectively aligned with the source domain.

We believe this additional detail will help readers fully understand and reproduce our Fourier transform–based style transfer.

 

Comment5Briefly describe the backbone architecture on Line 165. What factors influence that choice compared to others?

Response5Thank you very much for this insightful comment. In response, we have expanded the description of our backbone architecture. Specifically, we now explain that we employ an ImageNet-pretrained MobileNetV2 as the encoder in DeepLabV3+, detail its inverted-residual structure and Atrous Spatial Pyramid Pooling (ASPP), and outline the design of its lightweight decoder. We also clarify our rationale for this choice—namely, its excellent balance between representational power and inference speed, its ability to capture multi-scale context while preserving high-resolution features, and its suitability for medical image tasks with limited training data and hardware constraints.

The revised text is as follows:

“In terms of backbone architecture, we adopt MobileNetV2 [27]—pretrained on ImageNet—as the encoder within DeepLabV3+. MobileNetV2’s core building block is the inverted residual: a 1×1 convolution first expands the low-dimensional input features to a higher dimension, followed by a 3×3 depthwise separable convolution for spatial filtering, and finally a 1×1 projection back to the original channel dimensionality. When the input and output channels match, a residual connection is applied, which mitigates gradient vanishing and feature degradation while dramatically reducing both parameter count and computational cost. This lightweight feature extractor therefore strikes an effective balance between representational capacity and inference speed, particularly well suited to medical imaging tasks with limited data and constrained hardware resources.

Built upon this encoder, DeepLabV3+ incorporates an Atrous Spatial Pyramid Pooling (ASPP) module that applies parallel atrous convolutions at multiple dilation rates to capture contextual information from local to global scales without sacrificing feature-map resolution. Its lightweight decoder then fuses shallow, high-resolution feature maps with the ASPP outputs and employs depthwise separable transposed convolutions for upsampling, thereby refining boundary details. This backbone design achieves a harmonious trade-off between multi-scale adaptability, high-fidelity edge recovery, and computational efficiency, effectively addressing the large scale variations and fine boundary requirements inherent to OD/OC segmentation.”

 

Comment6Please provide details of the feature extractor (MobileNetV2) employed by the network. i.e. number of filters or output size, etc.

Response6Thank you for this valuable suggestion. We have supplemented the manuscript with a detailed description of the MobileNetV2 feature extractor. The added text now reads:

In terms of backbone architecture, we adopt MobileNetV2 [27]—pretrained on ImageNet—as the encoder within DeepLabV3+. MobileNetV2’s core building block is the inverted residual: a 1×1 convolution first expands the low-dimensional input features to a higher dimension, followed by a 3×3 depthwise separable convolution for spatial filtering, and finally a 1×1 projection back to the original channel dimensionality. When the input and output channels match, a residual connection is applied, which mitigates gradient vanishing and feature degradation while dramatically reducing both parameter count and computational cost. This lightweight feature extractor therefore strikes an effective balance between representational capacity and inference speed, particularly well suited to medical imaging tasks with limited data and constrained hardware resources.”

 

Comment7Is a batch size of 4 optimal for the network? How does the model performance vary as the batch size increases?

Response7We thank the reviewer for this valuable suggestion. In the initial submission we chose a batch size of 4 mainly because of hardware constraints: each fundus image passes through two independent DeepLabV3+ branches, and on a single NVIDIA 1080 Ti this already consumes a substantial share of GPU memory while additional space must be reserved for optimiser states and mixed-precision buffers.

To verify that this setting is appropriate, we retrained the model with batch sizes of 8 and 16 while keeping all other hyper-parameters unchanged. As shown in the table in the word file, enlarging the batch to 8 or 16 yields no noticeable improvement in DIcup, DIdisc, or δ; in some runs the metrics even decline slightly. This suggests that, at our current model and data scale, larger batches do not provide the expected benefits, and the reduced gradient noise may in fact weaken implicit regularisation.

 

Furthermore, medical image-segmentation models are often deployed on clinical devices where computational resources are limited and high-memory GPUs are not always available. Retaining a smaller batch size therefore not only makes training feasible with the existing hardware but also offers a configuration that better reflects real-world deployment conditions. Once again, we appreciate the reviewer’s comment, which prompted a more comprehensive empirical analysis of this hyper-parameter choice.

 

Comment8Please clarify the type of augmentations applied to the training dataset. Random or fixed for each sample image?

Response8Thank you very much for this insightful comment. During the training phase, we apply a series of random data enhancements to each sample, including random cropping and scaling, random rotation and flipping, elastic deformation, salt-pepper noise, and random area erasing. All of these operations rely on internal random sampling mechanisms; therefore, the same image will receive different augmentation effects in different epochs or iterations, rather than having a single fixed augmentation applied once per sample. To clarify this point, we have added the following description to the manuscript:

“To augment the training dataset, we apply a range of random transformations, including random cropping and scaling, random rotation and flipping, elastic deformation, salt–pepper noise, and random region erasing. All augmentation operations are performed online via internal random sampling mechanisms, so that each image receives different augmentation effects in different epochs or iterations.”

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

We thank the authors for addressing our previous comments.

 

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed comments and suggestions in the revised version of the manuscript, providing background information and improving the clarity of the methods. I do not have any further comments.

Back to TopTop