Boundary-Aware Transformer for Optic Cup and Disc Segmentation in Fundus Images

Wang, Soohyun; Kim, Byoungkug; Eom, Doo-Seop

doi:10.3390/app15095165

Open AccessArticle

Boundary-Aware Transformer for Optic Cup and Disc Segmentation in Fundus Images

by

Soohyun Wang

¹

,

Byoungkug Kim

^2,*

and

Doo-Seop Eom

^3,*

¹

AI Development Team, Sensorway, 140 Tongil-ro, Deogyang-gu, Goyang-si 10594, Republic of Korea

²

Division of Computer Science and Engineering, Sahmyook University, 815 Hwarang-ro, Nowon-gu, Seoul 01795, Republic of Korea

³

Institute of Convergence Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5165; https://doi.org/10.3390/app15095165

Submission received: 22 March 2025 / Revised: 2 May 2025 / Accepted: 4 May 2025 / Published: 6 May 2025

(This article belongs to the Special Issue Machine Learning in Biomedical Sciences)

Download

Browse Figures

Versions Notes

Abstract

:

Segmentation of the Optic Disc (OD) and Optic Cup (OC) boundaries in fundus images is a critical step for early glaucoma diagnosis, but accurate segmentation is challenging due to low boundary contrast and significant anatomical variability. To address these challenges, this study proposes a novel segmentation framework that integrates structure-preserving data augmentation, Boundary-aware Transformer Attention (BAT), and Geometry-aware Loss. We enhance data diversity while preserving vascular and tissue structures through truncated Gaussian-based sampling and colormap transformations. BAT strengthens boundary recognition by globally learning the inclusion relationship between the OD and OC within the skip connection paths of U-Net. Additionally, Geometry-aware Loss, which combines the normalized Hausdorff Distance with the Dice Loss, reduces fine-grained boundary errors and improves boundary precision. The proposed model outperforms existing state-of-the-art models across five public datasets—DRIONS-DB, Drishti-GS, REFUGE, G1020, and ORIGA—and achieves Dice scores of 0.9127 on Drishti-GS and 0.9014 on REFUGE for OC segmentation. For joint segmentation of the OD and OC, it attains high Dice scores of 0.9892 on REFUGE, 0.9782 on G1020, and 0.9879 on ORIGA. Ablation studies validate the independent contributions of each component and demonstrate their synergistic effect when combined. Furthermore, the proposed model more accurately captures the relative size and spatial alignment of the OD and OC and produces smooth and consistent boundary predictions in clinically significant regions such as the region of interest (ROI). These results support the clinical applicability of the proposed method in medical image analysis tasks requiring precise, boundary-focused segmentation.

Keywords:

fundus image; optic disc; optic cup; boundary-aware transformer attention; geometry-aware loss; structure-preserving data augmentation; segmentation; glaucoma; medical image analysis

1. Introduction

Glaucoma is one of the leading causes of blindness worldwide, often progressing without noticeable symptoms in its early stages, making it difficult for patients to detect any abnormalities. As the disease advances, it gradually narrows the visual field and can eventually lead to complete blindness. Due to these characteristics, glaucoma is often referred to as the “Silent Thief of Sight”, emphasizing the critical importance of early detection, continuous monitoring, and appropriate treatment. Major risk factors include optic nerve damage caused by elevated intraocular pressure (IOP), along with various environmental factors such as age, gender, genetic predisposition, and ethnicity [1].

Traditionally, glaucoma diagnosis has relied on methods such as visual field testing, intraocular pressure measurement, and direct examination of the optic nerve head. However, these techniques depend heavily on the expertise of medical professionals and face limitations when applied to large patient populations due to factors such as examiner fatigue and the high costs associated with repeated testing. Consequently, there has been growing research interest in automated glaucoma diagnosis methods based on retinal fundus images. Among these, the Cup-to-Disc Ratio (CDR) is widely recognized as a key quantitative indicator for assessing glaucoma risk. The CDR is defined as the ratio of the diameter of the OC to that of the OD, with a larger CDR generally indicating a higher risk of glaucoma. Figure 1 provides a visual illustration of the CDR concept and the spatial relationship between the OD and OC.

Previous research [2] focused on classifying glaucoma presence using multiple visual representations and ensemble classification models, without performing structural segmentation. However, for accurate glaucoma diagnosis, it is essential to first achieve precise segmentation of the OD and OC boundaries. Therefore, this study redefines the problem with a focus on fine-grained segmentation, aiming to establish a foundation for improving structure-based diagnostic accuracy. Nevertheless, segmenting the OD and OC boundaries remains technically challenging. The two structures exhibit low boundary contrast and high morphological similarity within fundus images, making them difficult to distinguish. Moreover, variations caused by disease progression, lighting conditions, and interpatient differences further complicate the task. Although various deep learning-based segmentation studies have been actively pursued, most have struggled to clearly capture the structural inclusion relationship between the OD and OC or have suffered from insufficient boundary awareness.

To address these challenges, we propose a novel segmentation model designed to precisely delineate the structures of the OD and OC, which has the following three key contributions:

Advanced Data Augmentation for Enhanced Generalization: We introduce a data augmentation method based on truncated Gaussian sampling and colormap transformations to preserve boundary information while capturing diverse imaging conditions and patient characteristics. This approach enhances the structural features of the fundus, which are often difficult to observe in original images, and improves OD and OC segmentation performance through minimally distorted data augmentation.
Boundary-aware Transformer Attention (BAT): To explicitly model the anatomical inclusion relationship between the OD and OC, we designed the BAT module. The BAT module is incorporated into the skip connections of the U-Net architecture, enhancing boundary recognition capability by leveraging multi-resolution contextual features.
Geometry-aware Loss Function: Conventional Dice Loss and IoU Loss are effective for pixel-level accuracy but have limitations in improving boundary precision. This study introduces a Geometry-aware Loss by incorporating a normalized Hausdorff Distance to reduce boundary shape distortions and quantitatively correct structure-based errors.

2. Related Work

Accurate segmentation of the OD and OC in retinal fundus images is a critical component for early glaucoma diagnosis. Various deep learning-based approaches have been proposed to achieve this, primarily aiming to capture the structural characteristics of the OD and OC while ensuring precise boundary segmentation and stable computation of clinical indicators such as the CDR. The most widely used baseline architecture is U-Net, introduced by Ronneberger et al. [3], which employs an encoder–decoder structure with skip connections to effectively recover fine structures in medical images. On the Drishti-GS dataset, U-Net achieved an IoU of 0.9190 and a Dice score of 0.9559 for OD, as well as an IoU of 0.7416 and a Dice score of 0.8383 for OC. While U-Net showed high accuracy for large structures like the OD, it performed relatively poorly for the smaller OC regions with low boundary contrast. To overcome these limitations, Oktay et al. [4] proposed Attention U-Net, which integrates an attention mechanism into U-Net to suppress irrelevant background information and enhance performance by focusing visual attention. Zhou et al. [5] and Long et al. [6] attempted to improve boundary delineation by employing FCN-based architectures with streamlined decoding strategies. Zilly et al. [7] and Sevastopolsky et al. [8] enhanced OD and OC segmentation performance by applying post-processing and boundary refinement techniques to the encoder’s CNN features, achieving OD Dice scores of 0.9730 and OC Dice scores between 0.8500 and 0.8910. These studies demonstrated that optimization techniques could significantly improve performance even with simple network structures. More recently, there have been efforts to maximize performance through multi-scale feature integration and complex loss function designs. Feng et al. [9] applied a Residual U-Net-based structure with specialized loss weighting, achieving an OD Dice score of 0.9741 and an OC Dice score of 0.8893. Zhu et al. [10] and Gu et al. [11] reported outstanding performance with OD Dice scores of 0.9743 and 0.9746 and OC Dice scores of 0.9083 and 0.8992, respectively. These approaches particularly focused on capturing subtle variations in OC boundaries, achieving clinically meaningful results. However, these CNN-based approaches, relying primarily on local operations, still face limitations in fully capturing the structural inclusion relationship between the OD and OC and modeling global contextual information. To address these issues, Transformer-based models have recently gained attention [12,13,14].

Transformers can learn global information from images through the self-attention mechanism, enhancing the recognition of fine structures and boundary accuracy. The Vision Transformer (ViT) by Dosovitskiy et al. [15] processes images by splitting them into patches, and in addition to classification tasks, ViT-based segmentation models achieved a mean IoU (mIoU) of 47.6% on the ADE20K dataset, demonstrating performance comparable to or better than CNN-based models like ResNet. Huang et al. [16] proposed CCNet, which introduced Criss-Cross Attention to reduce the computational complexity inherent in traditional Transformers by limiting interactions to pixels in the same rows and columns while still effectively capturing global contextual information. CCNet achieved an mIoU of 81.9% on the Cityscapes dataset, improving both efficiency and performance. Liu et al. [17] introduced the Swin Transformer, which overcomes limitations of the ViT by adopting a hierarchical structure and a Shifted Window Attention mechanism. By applying local attention within windows and shifting the windows between layers, the Swin Transformer effectively captures global information while maintaining computational efficiency. This approach showed outstanding performance in tasks requiring fine structural detail, such as segmentation, achieving a 53.5% mIoU on the ADE20K dataset, outperforming CNN-based models like DeeplabV3+ [18], and clearly demonstrating the potential of Transformer-based segmentation.

However, existing Transformer-based approaches mostly focus on simply combining pretrained models with U-Net structures, lacking explicit designs to capture the structural inclusion relationship between the OD and OC or to enhance boundary recognition. Most approaches aimed to improve pixel-level accuracy, and attempts to quantitatively correct boundary misalignments or structural distance differences using boundary-aware loss functions were rare. To overcome the limitations of conventional CNN and Transformer-based segmentation models, this study proposes a new segmentation framework that explicitly learns the structural relationship between the OD and OC through the design of the BAT and enhances boundary recognition by introducing a Geometry-aware Loss based on the normalized Hausdorff Distance.

3. Proposed Method

3.1. Data Augmentation: Truncated Gaussian, Colormap

To enhance the performance of deep learning-based segmentation models, a sufficient amount of training data is essential. Consequently, the field of computer vision has actively explored data augmentation techniques, including geometric transformations such as rotation, translation, scaling, and brightness adjustment, as well as methods using generative models like Generative Adversarial Networks (GANs) [19] and Variational Autoencoders (VAEs) [20].

In medical imaging, the availability of training data is often constrained by ethical and privacy regulations. Furthermore, ensuring the structural fidelity of medical images is critical for reliable diagnosis. Several studies have indicated that augmentations based on generative models may compromise anatomical integrity, thereby elevating the risk of diagnostic errors. Therefore, it is essential to prioritize augmentation techniques that not only increase data diversity but also preserve anatomical structures and sharpen boundary features.

To meet these demands, this study applies a data augmentation technique combining truncated Gaussian sampling with colormap transformations. Truncated Gaussian sampling balances the emphasis between central regions (e.g., around the optic disc) and boundary regions (e.g., vascular and tissue boundaries), enabling the generation of diverse local variations without distorting the original structural properties of the images. Meanwhile, colormap transformations adjust brightness and contrast while preserving the global color distribution of the original images, contributing to improved boundary recognition performance. Through these methods, we enhanced data diversity while preserving vascular and tissue structures, thereby effectively improving the generalization performance of the segmentation model.

z \sim T N (0, 1; - τ, τ), τ ≪ 1

(1)

To address this issue, we propose a VAE-based data augmentation method incorporating truncated Gaussian sampling. While conventional VAEs generate new data by sampling randomly from the latent space, our approach samples from a standard normal distribution

N (0, 1)

but restricts it within a truncated range of

[- τ, τ]

. Here,

τ ≪ 1

is set as a threshold to preserve the structural characteristics of the original data, guiding the generation of samples that closely resemble the originals.

This approach minimizes distortion of the OD and OC boundaries, thereby enhancing both the stability and accuracy of network training. A visual comparison is presented in Figure 2. Without applying truncated Gaussian sampling (c), significant structural distortion and deformation are observed compared to the original image (a), whereas the proposed method (b) successfully preserves the boundary structures of the original.

To further enhance the structural features of fundus images, we additionally applied a variety of color mapping techniques. In retinal fundus imaging, the low color contrast between the OD and OC often leads to ambiguous boundaries. By transforming the color space, these boundaries can be accentuated, improving both human interpretability and model performance. In consultation with clinical experts, we selected color mapping methods that are both diagnostically relevant and practically applicable.

First, the Red-Free conversion removes the red channel, thereby enhancing the visibility of vascular structures and improving the distinction between the OD and OC regions. Second, the Grayscale conversion eliminates color information and focuses solely on luminance, which helps emphasize structural intensity differences and contributes to better segmentation performance. Third, the Jet colormap applies a high-contrast color scale that amplifies boundary distinctions, enabling clearer identification of the OD and OC shapes. Fourth, the Viridis conversion employs a smoothly transitioning color gradient designed to minimize perceptual distortion while effectively highlighting differences between the OD and OC. Lastly, the Inferno colormap enhances contrast in low-light regions, adding a sense of depth and making the OD and OC boundaries more visually prominent.

These transformations go beyond simple color adjustments by significantly improving the perceptibility of OD and OC boundaries, thereby enabling the model to learn boundary-aware features more effectively. Examples of these color-mapped images are visualized in Figure 3.

3.2. Boundary-Aware Transformer Attention

To improve the accuracy of boundary delineation between the OD and OC in fundus images, we propose the BAT module. While the conventional CNN-based U-Net architecture excels at learning local feature-level information, it has limitations in explicitly capturing global structural relationships, relative object positioning, and boundary information. In particular, accurately segmenting the OC located within the OD requires more than local features alone, as it is difficult to precisely distinguish the complex relationship between the two structures.

The self-attention mechanism in Transformers can selectively emphasize highly relevant regions across an entire image, making it advantageous for integrating broader contextual information compared to standard CNNs. However, conventional self-attention generates the Query, Key, and Value from the same feature map, limiting its ability to explicitly learn boundary relationships between structurally distinct objects. While it can model positional correlations, it is not optimized for clearly distinguishing boundaries between different structures.

To overcome these limitations, we designed the BAT module. In BAT, the Query (Q) is extracted from the OC feature map to focus attention on the relatively smaller cup region, while the Key (K) is generated from the OD feature map to encode the structural relationship between the cup and disc. The Value (V) is taken from the global feature map, integrating the relationship between Q and K to produce the final segmentation output. This independent setting of Q, K, and V directly reflects the inclusion structure of the OC within the OD and is implemented as the input configuration for the Transformer attention. As a result, the BAT enables boundary-focused feature enhancement, effectively compensating for the structural boundary recognition that conventional U-Net-based segmentation models tend to miss.

BAT is designed to take the feature maps extracted at each resolution stage of the U-Net encoder as inputs, allowing the model to explicitly learn boundary relationships between the OD and OC across multiple spatial scales. This design not only improves fine boundary recognition but also contributes to enhanced precision and structural consistency in segmentation performance.

Q = f_{e m b} (F_{O C}), K = f_{e m b} (F_{O D}), V = f_{e m b} (F_{G l o b a l})

(2)

where

f_{emb} (\cdot)

denotes a linear transformation that embeds the feature maps into the appropriate input dimension for the Transformer.

The BAT module includes a transformation step that converts the feature maps extracted from the intermediate skip connections of the U-Net’s convolutional layers into suitable inputs for Transformer attention. To achieve this, the feature maps are passed through an embedding layer, which projects them into fixed-dimensional vectors as follows:

F_{e m b} = W_{e m b} F + b_{e m b}, F_{e m b} \in R^{32 \times 32 \times d}

(3)

where

W_{emb}

is a learnable weight matrix,

b_{emb}

is the bias term, and d denotes the model dimension used in the Transformer. Following the embedding step, Multi-Head Self-Attention (MHSA) is applied to learn the structural relationship between the OD and OC. Each attention head plays a critical role in distinguishing the OC boundary within the OD, and through the attention weights, the model is able to capture key boundary features essential for accurate segmentation.

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(4)

The embedded feature maps are processed through MHSA, which enables the model to learn the structural relationship between the OD and OC across multiple attention heads. The resulting attention output is then passed through a Feedforward Network (FFN) followed by layer normalization to ensure stability and enhance feature representation.

Let

F_{skip}

denote the feature map originally propagated through the U-Net’s skip connection. To minimize information loss and preserve local details, the BAT output is merged with

F_{skip}

in a residual manner, effectively combining both global contextual cues and local structural features.

F_{B A T} = LayerNorm (FFN (LayerNorm (MHSA (Q, K, V) + F_{s k i p})))

(5)

The feature maps refined through the BAT module are subsequently passed to the decoder of the U-Net to generate the final segmentation output. Let the output of the BAT module be denoted as

F_{BAT}

; this serves as the final feature representation used in the decoder, effectively integrating both boundary-aware global context and localized structural information for improved segmentation performance.

F_{d e c o d e r} = Concat (F_{B A T}, F_{s k i p})

(6)

The decoder generates the segmentation mask based on these combined feature maps, ultimately producing outputs that clearly distinguish the boundary of the OC located within the OD. The overall architecture of the proposed BAT structure is visualized on the left side of Figure 4. The effectiveness of the BAT module is visually demonstrated on the right side of Figure 4. Before applying the BAT module, uniform activation appears around the center of the optic disc, but the boundary with the optic cup is not distinctly highlighted. After applying BAT, the boundary between the OD and OC becomes significantly clearer and more pronounced. Particularly within the ROI image, where the relative proportion of the OC and OD is larger compared to the full fundus image, it is even more evident that the BAT module substantially enhances fine local boundary refinement around the boundaries.

3.3. Loss Function: Geometry-Aware Loss

To more precisely delineate the boundaries between the OD and OC, this study proposes a Geometry-aware Loss that combines Dice Loss with normalized Hausdorff Distance Loss. Conventional segmentation models typically use Cross-Entropy or Dice Loss to measure the degree of overlap between the predicted mask and the ground truth, focusing on maximizing overall region matching. However, these loss functions have limitations in fully capturing fine-grained segmentation performance at the boundary areas. In glaucoma diagnosis, where boundary recognition between the OD and OC is critical, relying solely on pixel-centric loss functions makes it difficult to achieve stable and high precision.

Accordingly, we designed a loss structure that considers both region overlap and boundary precision. The Dice Loss effectively measures the overall degree of overlap between two regions, while the Hausdorff Distance captures local boundary errors by measuring the maximum distance between the predicted and ground truth boundaries. In particular, the Hausdorff Distance emphasizes extreme boundary outliers, encouraging the model to learn even subtle errors near the boundary. In this study, we normalized the Hausdorff Distance by the diagonal length of the image, reducing the scale gap between the Dice Loss and Hausdorff Distance when combined, thereby improving both training stability and boundary recognition performance. The combination of the Dice Loss and normalized Hausdorff Distance Loss effectively reduces prediction errors near the OD/OC boundaries and plays a critical role in elevating boundary precision and overall segmentation quality to clinically meaningful levels.

L_{D i c e} = 1 - \frac{2 \sum_{i} y_{i} {\hat{y}}_{i}}{\sum_{i} y_{i} + \sum_{i} {\hat{y}}_{i}}

(7)

where

y_{i}

denotes the ground truth label, and

{\hat{y}}_{i}

represents the predicted segmentation output. While the Dice Loss effectively evaluates the overall region overlap, it is less sensitive to discrepancies near object boundaries, which is a critical limitation in tasks requiring fine boundary localization.

To overcome this, we incorporate Hausdorff Distance Loss to improve the precision of boundary delineation between the OD and OC. The Hausdorff Distance measures the maximum deviation between the closest points of two sets, in this case, between the predicted and ground truth segmentation boundaries. By minimizing this distance, the model is encouraged to align its predicted boundaries more closely with the actual anatomical contours.

L_{H D} = max (sup_{x \in Y} inf_{y \in \hat{Y}} d (x, y), sup_{x \in \hat{Y}} inf_{y \in Y} d (x, y))

(8)

where Y denotes the set of boundary points in the ground truth,

\hat{Y}

represents the set of predicted boundary points, and

d (x, y)

is the Euclidean distance between points x and y. The operator sup refers to the maximum distance from a point to the closest point in the other set, capturing the worst-case scenario, while inf indicates the minimum distance between two points. In this context, the Hausdorff Distance

L_{H D}

serves as a metric that quantifies the worst-case deviation between the predicted and actual boundaries, making it particularly effective for penalizing large boundary errors in segmentation tasks.

However, the Hausdorff Distance is sensitive to image scale, as its values can vary significantly depending on image resolution. When combined directly with the Dice Loss, which is bounded in the range

[0, 1]

, the mismatch in scale can destabilize the training process. While the Dice Loss yields normalized values, the Hausdorff Distance may range from tens to hundreds of pixels.

To resolve this issue, we normalized the Hausdorff Distance Loss by dividing it by the maximum possible diagonal distance of the image. This allows the loss values to be on a comparable scale, enabling more stable and balanced joint optimization with the Dice Loss. The final loss function, which combines the Dice Loss with the normalized Hausdorff Distance Loss, is designed to simultaneously improve both region-level segmentation and boundary accuracy.

L_{t o t a l} = λ_{1} L_{D i c e} + λ_{2} L_{H D}^{n o r m}

(9)

where

λ_{1}

and

λ_{2}

are weighting factors for the Dice Loss and the Hausdorff Distance Loss, respectively. Through empirical evaluation, we found that setting

λ_{1} = 1.0

and

λ_{2} = 0.3

yielded the most stable convergence and superior segmentation performance. During the early stages of training, the

L_{H D}^{norm}

term exhibited relatively high values, reflecting initial boundary discrepancies. However, as training progressed, this term gradually decreased, demonstrating its positive contribution to boundary recognition performance over time.

3.4. Proposed Architecture

In this study, we propose a segmentation network incorporating the BAT module to more precisely delineate the boundaries between the OD and OC in fundus images. The proposed model is based on the conventional U-Net architecture but integrates the BAT module, which is specialized for boundary recognition and structural relationship learning, to enhance overall segmentation performance. Figure 5 provides a visual overview of the proposed network architecture.

While the conventional CNN-based U-Net is effective at extracting the local features of individual objects, it has limitations in explicitly learning the structural relationships, such as inclusion and relative positioning, needed to distinguish the OC boundary located within the OD. Particularly, the boundary between the cup and disc is difficult to recognize accurately using only local features due to low color contrast and similar shapes. To address this limitation, we inserted the BAT module within the skip connections, leveraging Transformer attention to strengthen the recognition of structural relationships and boundary-focused learning.

The overall network consists of three parts: an encoder, the BAT module, and a decoder. First, the encoder follows the convolution-based structure of U-Net, transforming the input image (

512 \times 512 \times 3

) into multi-resolution feature maps. Intermediate features extracted at each encoding stage are stored via skip connections for later use in the decoder. The intermediate feature maps extracted from the encoder are processed through the BAT module using Transformer attention. Placed within the skip connections, BAT performs attention operations that reflect the hierarchical relationship between OD and OC feature maps, enhancing information specifically relevant to boundary recognition. This allows the network to learn clearer representations of the OC boundaries contained within the OD. The refined feature maps are then passed to the decoder, which combines them with the corresponding skip connections to reconstruct the final segmentation output. The decoder performs repeated upsampling and convolution operations to gradually increase the spatial resolution and ultimately generates a segmentation output of size

512 \times 512

. A softmax activation function is applied at the final stage to produce the segmentation mask predicting the boundaries of the OD and OC.

This design builds on the strength of CNNs in local feature learning while complementing it with global structural and boundary-focused information through the BAT module, enabling precise boundary delineation between the OC and OD. As a result, the proposed structure improves boundary segmentation accuracy compared to the conventional U-Net and more effectively captures the structural inclusion relationships within fundus images.

4. Experiment Implementation

4.1. Experiment Setup

In this study, we evaluated the generalization performance of the proposed model using five publicly available fundus image datasets with diverse characteristics and conditions. These datasets differ in terms of imaging equipment, resolution, patient populations, and image quality, thereby indirectly reflecting a variety of clinical conditions. Each dataset possesses unique features related to imaging environments, resolution, annotation methods, and the presence or absence of glaucoma. Therefore, independent experiments were conducted without mixing datasets, focusing on dataset-specific analysis. When a predefined train/test split was available, the training set was further divided into train and validation sets in an 8:2 ratio. When no predefined split was provided, the entire dataset was partitioned into train, validation, and test sets with a 7:2:1 ratio for the experiments. Detailed information is summarized in Table 1.

The DRIONS-DB (Digital Retinal Images for Optic Nerve Segmentation Database) is a fundus image database designed to evaluate the performance of optic disc segmentation algorithms. It consists of 110 fundus images collected from 55 patients randomly selected at the Ophthalmology Service of Miguel Servet Hospital in Zaragoza, Spain. The images were captured using a color analog fundus camera at a resolution of

600 \times 400

pixels. Each image was annotated by two medical experts, with OD boundaries marked using 36 landmarks per expert. The Drishti-GS dataset, collected from an Indian patient population, consists of 101 retinal images divided into 50 for training and 51 for testing. Each image is provided in PNG format with a resolution of

2896 \times 1944

pixels and includes both normal and glaucomatous eyes. The training set contains manual segmentation labels for the OD and OC regions based on annotations from multiple experts. REFUGE (Retina Fundus Glaucoma Challenge) is a dataset provided through the Retina Fundus Glaucoma Challenge held by the MICCAI in Spain. It consists of 1200 fundus images collected from multiple centers split into training (400 images), validation (400 images), and testing (400 images) sets. The images have resolutions of

2124 \times 2056

and

1634 \times 1634

pixels and include both normal and glaucomatous cases, reflecting diverse imaging conditions and quality variations. ORIGA (an online retina fundus image database for glaucoma analysis and research) was collected by the Singapore Eye Research Institute and contains a total of 650 fundus images, including 168 images from glaucomatous patients. It provides accurate boundary labels for the OD and OC as well as CDR information. The images generally exhibit low brightness and a higher level of noise. The G1020 dataset is a large-scale, high-resolution fundus image dataset provided in JPG format with a resolution of

3004 \times 2423

pixels. It includes precise annotation data for the locations of the OD and OC as well as the size of the neuroretinal rim. The dataset features high image quality with well-preserved fine structural details.

All experiments were conducted in an Ubuntu 22.04 environment using an Intel 14th-generation i9 CPU and an Nvidia 4090 GPU. PyTorch 2.4.1 was used as the deep learning framework. The batch size was set to 16. Depending on the experimental conditions, the learning rate was varied between 0.01 and 0.0001, which was key to achieving efficient optimization and faster convergence.

4.2. Evaluation Metrics

The segmentation performance of the OD and OC on fundus images was quantitatively evaluated using the Intersection over Union (IoU) and Dice Coefficient, as shown in Figure 6.

The IoU, also referred to as the Jaccard Index, evaluates the similarity between the predicted segmentation mask and the ground truth by calculating the ratio of their intersection area to their union area.

IoU = \frac{Area of Overlap}{Area of Union} = \frac{| \hat{Y} \cap Y |}{| \hat{Y} \cup Y |}

(10)

where

\hat{Y}

is the predicted segmentation mask, and Y is the ground truth.

The Dice Coefficient score is conceptually similar to the IoU but places greater emphasis on the degree of overlap between two regions. It is defined as

Dice = \frac{2 \times Area of Overlap}{Total Area} = \frac{2 | \hat{Y} \cap Y |}{| \hat{Y} | + | Y |}

(11)

where

\hat{Y}

and Y represent the predicted and ground truth segmentation masks, respectively.

The Dice Coefficient score ranges from 0 to 1, with values closer to 1 indicating a high level of agreement between the prediction and the ground truth. Compared to the IoU, the Dice Coefficient score is more sensitive to small objects, making it particularly effective for evaluating the segmentation of smaller anatomical structures such as the OC.

5. Results and Analysis

5.1. Performance Analysis on Different Datasets

Table 2 and Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16 present a comprehensive comparison of the segmentation performance between the proposed BAT-based segmentation network and several existing models across five public fundus image datasets: DRIONS-DB, Drishti-GS, REFUGE, G1020, and ORIGA. The proposed model consistently achieved the highest IoU and Dice scores across all datasets, with the most notable performance improvements observed in the OC region, which is particularly challenging due to its smaller size and lower contrast with surrounding tissues. To provide a more intuitive understanding of the model’s effectiveness, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16 visualize the segmentation results on both the entire fundus image and the region of interest (ROI). These visual comparisons allow for detailed analysis of whether the proposed model maintains spatial consistency across full images and how precisely it distinguishes the boundaries within localized regions.

The baseline models used for comparison include the U-Net, a widely adopted encoder–decoder architecture known for robust local feature extraction; the Attention U-Net, which extends U-Net with attention gates to suppress irrelevant background features; DeepLabV3, which utilizes Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale contextual information; TransUNet, a hybrid architecture combining CNN-based U-Net encoders with Transformer modules to improve global context modeling; and Swin-UNet, which leverages hierarchical self-attention to effectively integrate multi-resolution features.

Together, these quantitative and qualitative results demonstrate that the proposed BAT-based segmentation network significantly outperformed existing methods, particularly in challenging anatomical regions, by providing more accurate and consistent boundary delineation through the integration of global contextual reasoning and boundary-aware attention.

The BAT-based model proposed in this study has a computational complexity of approximately 130 GFLOPs for an input resolution of

512 \times 512

, with an average inference time of about 22 ms on an NVIDIA RTX 4090. While this represents a slight increase in computation compared to U-Net (90 GFLOPs) and Attention U-Net (110 GFLOPs), it remains relatively lightweight compared to more recent segmentation models such as DeepLabV3+ (210 GFLOPs), TransUNet (430 GFLOPs), and Swin-UNet (330 GFLOPs). Since the primary objective was to improve the accuracy through enhanced boundary recognition and structural preservation rather than to achieve real-time processing, this modest increase in computational load is considered acceptable.

Figure 7. Segmentation results on fundus images from the DRIONS-DB dataset. (a) input fundus images, (b) ground truth, (c) predictions mask, and (d) predictions from the proposed model with attention maps.

Figure 8. Segmentation results on ROI images from the DRIONS-DB dataset. (a) ROI images, (b) ROI ground truth, (c) oredictions mask, and (d) predictions from the proposed model with attention maps.

DRIONS-DB is a dataset that provides manual annotations only for the OD region, without any segmentation labels for the OC. Accordingly, segmentation performance on this dataset was evaluated solely based on the OD region. DRIONS-DB is known for its challenging characteristics, particularly due to the faint and unclear OD boundaries, which often hinder the performance of conventional models.

Existing methods tend to produce blurred segmentation masks, struggling to localize the OD boundary accurately. In contrast, the proposed model equipped with BAT achieved significantly superior results, recording an IoU of 0.9758 and a Dice score of 0.9877. These results outperformed the best-performing baseline model, Swin-UNet, which achieved an IoU of 0.9631 and a Dice score of 0.9677—representing an improvement of 1.27% in IoU and 2.06% in Dice.

This performance gain can be attributed to BAT’s ability to more precisely capture boundary information, enabling the network to distinguish subtle structural edges with higher fidelity. As illustrated in Figure 7 and Figure 8, the proposed model shows a clearer separation of boundaries compared to existing models both on full fundus images and on ROI visualizations.

Figure 9. Segmentation results on fundus images from the Dristi-GS dataset. (a) input fundus images, (b) OD ground truth, (c) OD predicted mask, (d) OC ground truth, (e) OC predicted mask, and (f) predictions from the proposed model with attention maps.

Figure 10. Segmentation results on ROI images from the Dristi-GS dataset. (a) ROI images, (b) ROI OD ground truth, (c) ROI OD predicted mask, (d) ROI OC ground truth, (e) ROI OC predicted mask, and (f) predictions from the proposed model with attention maps.

The Drishti-GS dataset provides separate annotations for both the OD and OC, enabling a detailed evaluation of segmentation performance for each structure. Due to the relatively small dataset size, there is potential for limited model generalization during training. Additionally, in some images, the boundaries between the OD and OC are visually ambiguous, leading existing segmentation models to produce incomplete or imprecise boundary predictions.

Despite these challenges, the proposed model achieved the highest performance on the Drishti-GS dataset. Specifically, for OD segmentation, it recorded an IoU of 0.9756 and a Dice score of 0.9783. For the more challenging OC region, the model achieved an IoU of 0.8405 and a Dice score of 0.9127, surpassing the performance of the best existing model, Swin-UNet, which obtained an IoU of 0.8305 and a Dice score of 0.9027.

These results demonstrate that the BAT enables the model to maintain strong generalization performance even on small-scale datasets, effectively enhancing boundary precision. As shown in Figure 9 and Figure 10, the proposed model delivers clearer shape and contour delineation in both full fundus and ROI-based visualizations, supporting its robustness in structural segmentation.

Figure 11. Segmentation results on fundus images from the REFUGE dataset. (a) input fundus images, (b) ground truth, (c) predictions mask, and (d) predictions from the proposed model with attention maps.

Figure 12. Segmentation results on ROI images from the REFUGE dataset. (a) ROI images, (b) ROI ground truth, (c) predictions mask, and (d) predictions from the proposed model with attention maps.

The REFUGE dataset, comprising a large number of images collected under various clinical conditions, is well suited for evaluating both the learning capacity and generalization ability of segmentation models. Its diversity makes it particularly valuable for assessing model robustness and applicability in real-world scenarios.

The proposed model achieved an IoU of 0.9786 and a Dice score of 0.9892, outperforming the previous best model, Swin-UNet, with an improvement of approximately 1.75% in the IoU and 1.72% in the Dice score. These results highlight the effectiveness of the proposed model in learning boundary-specific features, even under varied imaging conditions.

As shown in Figure 11 and Figure 12, the proposed model delivers sharper boundary delineation in both full fundus images and ROI regions. Moreover, the attention maps clearly indicate that the model successfully focuses on the boundary areas, confirming that the BAT attention mechanism contributed significantly to the improved performance.

Figure 13. Segmentation results on fundus images from the ORIGA dataset. (a) input fundus images, (b) ground truth, (c) predictions mask, and (d) predictions from the proposed model with attention maps.

The ORIGA dataset reflects a wide range of optical conditions and imaging techniques, which introduces additional challenges for segmentation models. In particular, several images in the dataset exhibit ambiguous boundaries between the OD and OC, leading many existing models to produce inaccurate or inconsistent boundary predictions.

Despite these difficulties, the proposed model achieved an IoU of 0.976 and a Dice score of 0.9879, representing the highest performance on this dataset. Compared to the previous best-performing model, Swin-UNet (IoU of 0.960, Dice score of 0.9719), our approach shows a 1.6% improvement in the Dice score, demonstrating its superior ability to handle complex and low-contrast boundary regions.

Figure 14. Segmentation results on ROI images from the ORIGA dataset. (a) ROI images, (b) ROI ground truth, (c) predictions mask, and (d) predictions from the proposed model with attention maps.

As shown in Figure 13 and Figure 14, the proposed model performed more precise boundary segmentation than existing models in both full fundus images and ROI views. In particular, the model more accurately reflects the relative size and spatial alignment of the OD and OC and produces smooth, consistent boundary predictions, especially in ROI regions where fine structural detail is critical for clinical interpretation.

Figure 15. Segmentation results on fundus images from the G1020 dataset. (a) input fundus images, (b) ground truth, (c) predictions mask, and (d) predictions from the proposed model with attention maps.

The G1020 dataset contains fundus images of varying quality, making it well-suited for evaluating the robustness of segmentation models. Existing segmentation methods often struggle with low-quality or low-contrast images, leading to inaccurate boundary predictions and performance degradation under such conditions.

The proposed model demonstrated strong performance on the G1020 dataset, achieving an IoU of 0.9696 and a Dice score of 0.9783 and significantly outperforming Swin-UNet, which achieved an IoU of 0.9503 and a Dice score of 0.9613. These results indicate that the proposed model is capable of maintaining stable and accurate predictions even across images with varying levels of quality thanks to its enhanced boundary awareness and contextual understanding.

Figure 16. Segmentation results on ROI images from the G1020 dataset. (a) ROI images, (b) ROI ground truth, (c) predictions mask, and (d) predictions from the proposed model with attention maps.

As illustrated in Figure 15 and Figure 16, the proposed model achieved sharper and more reliable boundary segmentation compared to existing methods, both in full images and ROI views. In particular, for the OC region, which is more prone to boundary ambiguity in low-quality images, the proposed model was able to produce clearer and more accurate delineations, where baseline models tended to generate blurred or incomplete segmentations.

5.2. Ablation Study

In this study, we conducted an ablation study to evaluate the individual and combined contributions of the proposed BAT, Geometry-aware Loss, and data augmentation techniques to the overall model performance. Table 3 presents the results of this analysis, showing how the IoU and Dice coefficients changed across three benchmark datasets— REFUGE, G1020, and ORIGA —as each component was incrementally integrated into the model.

This study conducted an ablation experiment to independently analyze the contribution of the proposed structure-preserving data augmentation, BAT module, and Geometry-aware Loss to model performance and to verify the synergistic effect when these components are combined. The baseline model was set as a standard U-Net without data augmentation, Geometry-aware Loss, or BAT applied. In this configuration, it achieved performances of an IoU of 0.9171 and a Dice score of 0.9177 on REFUGE, an IoU of 0.9097 and a Dice score of 0.9068 on G1020, and an IoU of 0.9045 and a Dice score of 0.9164 on ORIGA. These results show that a basic U-Net can perform segmentation to a certain extent, but limitations exist in boundary precision.

Applying truncated Gaussian sampling and colormap transformation improved the performance on the REFUGE dataset, with the IoU increasing from 0.9171 to 0.9247 and the Dice coefficient from 0.9177 to 0.9353. This suggests that the increased data diversity contributed to enhanced boundary recognition performance. However, data augmentation alone was insufficient to overcome limitations in structural relationship learning and boundary precision. When Geometry-aware Loss, incorporating the normalized Hausdorff Distance, was applied to the baseline, the performance improved to an IoU of 0.9302 and a Dice score of 0.9396 on the REFUGE dataset. Notably, the Dice coefficient showed a larger improvement, indicating enhanced segmentation precision around the boundaries. Applying only the BAT module to the baseline resulted in an IoU of 0.9468 and a Dice coefficient of 0.9574 on REFUGE, demonstrating that by learning the structural inclusion relationship between the OD and OC, the model was able to recognize boundaries more clearly. The greater improvement in the Dice score compared to the IoU supports the effectiveness of BAT in specifically enhancing boundary precision. Finally, the full model integrating all components (data augmentation + Geometry-aware Loss + BAT) achieved the highest performance, with REFUGE scores of an IoU of 0.9786 and a Dice score of 0.9892, G1020 scores of an IoU of 0.9696 and a Dice score of 0.9783, and ORIGA scores of an IoU of 0.976 and a Dice score of 0.9879. Although significant performance improvements were observed when only combining Geometry-aware Loss and BAT, the full model with additional data augmentation achieved even higher precision. This demonstrates that each component contributes to different aspects—enhancing data diversity, strengthening boundary recognition, and preserving structural information—and that their combination yields a synergistic effect. In summary, the ablation study empirically confirms that the three proposed components each independently lead to meaningful performance improvements and act complementarily to maximize the model’s boundary recognition and structural preservation capabilities.

Figure 17 presents a visual comparison of the ROI-based ablation study results on the REFUGE and ORIGA datasets. Each row corresponds to a different patient’s fundus image, and each column shows the output of a specific model configuration.

In the baseline model, the OD boundary appears irregular and jagged, and the segmentation error is especially pronounced in the OC region. This is likely due to the model’s inability to capture structural relationships or encode boundary-specific information effectively. When data augmentation was added, the boundary irregularities around the OD persisted, but the OC prediction showed noticeable improvement. This indicates that the truncated Gaussian-based augmentation and colormap transformations helped emphasize structural features in the fundus images, contributing to better OC localization.

From the point where Geometry-aware Loss was applied, the outer OD boundary became smoother, and much of the jaggedness was corrected. The precision of OC segmentation also improved significantly. This demonstrates that incorporating the Hausdorff Distance into the loss function made a substantial contribution to boundary learning, enabling the model to reduce structural errors more effectively.

When BAT was applied, the overall shapes of both the OD and OC boundaries closely resembled the ground truth, although small deviations remained. The segmentation results clearly reflected the hierarchical relationship between the OD and OC, confirming that BAT effectively captures global context and structural containment between the two regions.

Finally, the proposed model, which integrates data augmentation, Geometry-aware Loss, and BAT, produced segmentation results that were nearly indistinguishable from the ground truth. The boundaries of both the OD and OC were smooth, consistent, and accurately located. This outcome illustrates how the synergistic combination of all three components leads to significant improvements in both boundary precision and structural consistency.

Overall, these visual results validate that while each component contributes to performance in a distinct yet complementary manner, their integration in the full model enables it to deliver predictions that most closely resemble the ground truth, thereby confirming the effectiveness of the proposed approach.

5.3. State-of-the-Art Comparison

Table 4 presents a comparative evaluation of the segmentation performance between the proposed model and several state-of-the-art (SOTA) methods on three benchmark public datasets: DRIONS-DB, Drishti-GS, and REFUGE. The performance was assessed using the IoU values and Dice coefficients for both the OD and OC regions.

On the DRIONS-DB dataset, the proposed model achieved an IoU of 0.9758 and a Dice score of 0.9877, outperforming the best existing method by [25] (IoU of 0.9474, Dice score of 0.9768) by 2.84 percentage points in the IoU and 1.09 points in the Dice score. This significant improvement indicates that the proposed BAT effectively enhances the model’s ability to learn precise OD boundaries, while the incorporation of Geometry-aware Loss substantially contributes to boundary refinement and segmentation precision.

In the Drishti-GS dataset, the proposed model achieved and IoU of 0.9756 and Dice score of 0.9783 for the OD and an IoU of 0.8405 and Dice score of 0.9127 for the OC, showing comparable performance to the current best-performing method by [27] (OD: IoU of 0.950, Dice score of 0.9763; OC: IoU of 0.845, Dice score of 0.9196). While the OD segmentation performance was on par with the SOTA models, the OC Dice score of 0.9127 achieved by our model is noteworthy, given that the OC is structurally smaller, more variable in shape, and often exhibits unclear boundaries. These results underscore the model’s strength in learning the spatial containment relationship between the OD and OC through BAT and the added value of Geometry-aware Loss in capturing fine-grained boundary details.

On the REFUGE dataset, the proposed model attained an IoU of 0.9786 and Dice score of 0.9892 for the OD and an IoU of 0.8798 and Dice score of 0.9014 for the OC, outperforming both [24] (OD Dice score of 0.9693, OC Dice score of 0.9082) and [31] (OD Dice score of 0.9504, OC Dice score of 0.8546). Notably, the proposed model improved the OC Dice score by 4.68 percentage points compared to [31] and the OD Dice score by 1.99 points compared to [24]. These results demonstrate that the combination of boundary-sensitive loss functions and Transformer-based global context modeling leads to robust and consistent segmentation performance, even under diverse imaging conditions.

Overall, the results across all three datasets validate the effectiveness of the proposed framework and establish its superiority over existing methods, particularly in boundary localization and structural accuracy.

6. Discussion

This study proposed a novel segmentation framework that integrates boundary-focused data augmentation, BAT, and Geometry-aware Loss to achieve precise delineation of OD and OC boundaries in fundus images. Each of the three components independently contributed to performance improvements, and when combined, they produced a synergistic effect that maximized overall performance.

Experimental results showed that the proposed model consistently outperformed existing SOTA models across diverse datasets, including DRIONS-DB, Drishti-GS, REFUGE, G1020, and ORIGA. The model maintained high performance despite variations in ethnicity, imaging devices, resolution, and pathology levels across datasets, suggesting strong generalization capability against diverse boundary deformations and structural differences.

The improvement in fine segmentation performance for small structures like the OC supports the effectiveness of the BAT module. BAT explicitly learns the structural inclusion relationship by independently generating the Query and Key from the OC and OD feature maps, respectively, enabling more precise boundary recognition compared to conventional CNN and Transformer-based models. Geometry-aware Loss further enhanced boundary precision by correcting fine errors near boundaries that conventional Dice Loss-based approaches tend to miss.

While CNN-based models (U-Net, Attention U-Net, and DeepLabV3+) excel at local feature extraction, they struggle to explicitly model structural inclusion relationships. Transformer-based models (TransUNet and Swin-Unet) capture global context, but conventional self-attention is not optimized for distinguishing boundaries between different structures. In contrast, the BAT module provides fundamental differentiation by performing boundary-focused attention based on structural inclusion.

Ablation study results showed that data augmentation alone improved the performance by 1.76%, adding Geometry-aware Loss resulted in an additional 2.19% improvement, and applying BAT led to a further 4.97% gain. The full model integrating all components achieved a total improvement of 7.15% (from 0.9177 to 0.9892). These results empirically confirm that each component yields meaningful improvements individually and produces a complementary synergistic effect when combined.

Visual analysis also demonstrated that the proposed model generated smoother and more consistent boundary representations for the OD and OC compared to conventional models. Particularly in ROI-based analysis, BAT-based structural learning showed clear superiority in boundary recognition and fine local boundary refinement compared to existing approaches.

7. Limitation and Future Work

In this study, independent training and evaluation were conducted on five publicly available fundus image datasets—DRIONS-DB, Drishti-GS, REFUGE, G1020, and ORIGA. By utilizing multiple datasets that reflect a variety of imaging devices, resolutions, and patient characteristics, we aimed to objectively validate the generalization capability of the proposed model.

Although our research team possesses separately collected hospital-based private clinical data, they were not directly utilized in this study for the following reasons: (1) the absence of high-quality annotations for OD and OC boundaries, (2) ethical restrictions, including compliance with personal data protection laws, and (3) an insufficient number and distribution of patients across age groups and genders, which limited the feasibility of statistical analysis. Due to these constraints, all experiments were conducted using publicly available datasets. However, in real-world clinical settings, models are required to generalize across various factors such as ethnicity, gender, age, and imaging conditions. In future work, we plan to adopt patient-specific fine-tuning strategies and condition-aware learning techniques to further expand the applicability of the model to diverse patient populations and imaging scenarios.

Although this study focused on OD and OC structure segmentation in fundus images, building a complete system for early glaucoma diagnosis will require the integration of additional downstream tasks, such as Cup-to-Disc Ratio computation, disease risk assessment, and longitudinal monitoring. In future research, we aim to develop a multi-task learning framework based on the proposed segmentation model to facilitate the integrated analysis of quantitative biomarkers and clinical diagnosis.

Additionally, the proposed model currently faces limitations under extreme conditions, such as severe image quality degradation or the presence of rare or atypical structural abnormalities. Future work will focus on improving the architecture and incorporating hard-case-focused learning strategies to enhance robustness in such challenging scenarios.

Finally, while the current study designed an architecture optimized for fundus imaging, the generalizability of the proposed method to other medical imaging modalities, such as Optical Coherence Tomography (OCT), has not yet been explored. Given that Boundary-aware Transformer Attention and Geometry-aware Loss are well suited for tasks requiring precise boundary recognition, future work will aim to expand the model to different imaging modalities and strengthen its universality.

8. Conclusions

This study proposed a novel segmentation network that integrates structure-preserving data augmentation, BAT, and Geometry-aware Loss to achieve more precise delineation of boundaries between the OD and OC in fundus images. Through truncated Gaussian sampling and various colormap transformations, we enhanced vascular and tissue structures while introducing only meaningful morphological variations, thereby simultaneously achieving data diversity and structural preservation. The BAT module was designed to explicitly learn the structural inclusion relationship between the OD and OC using multi-resolution feature maps, effectively improving boundary recognition performance. Additionally, to address the limitations of conventional Dice Loss and to quantitatively reduce errors near boundaries, we introduced a Geometry-aware Loss based on a normalized Hausdorff Distance.

Across multiple publicly available datasets, the proposed model outperformed existing state-of-the-art segmentation models, with particularly notable improvements in the OC regions where boundary recognition is critical. Ablation studies and visual analyses demonstrated that each component independently contributed to meaningful performance improvements, and when combined, the three components exhibited a complementary synergistic effect, achieving the highest levels of precision and boundary consistency.

In conclusion, this study presents an integrated approach capable of achieving both precise boundary recognition and structural preservation, which are essential requirements for medical image segmentation. The proposed method shows strong potential for future expansion into glaucoma diagnosis, diagnostic assistance systems, and real-world clinical applications, and it is expected to be applicable to various medical image analysis tasks where precise boundary segmentation is critical.

Author Contributions

Conceptualization, S.W. and D.-S.E.; methodology, S.W. and D.-S.E.; software, S.W.; validation, B.K.; formal analysis, S.W.; investigation, B.K.; resources, S.W.; data curation, B.K.; writing—original draft preparation, S.W.; writing—review and editing, B.K. and D.-S.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the members of the Future Information Network Architecture Laboratory in Korea University.

Conflicts of Interest

Author Soohyun Wang was employed by the company Sensorway. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

OD	Optic Disc
OC	Optic Cup
CDR	Cup-to-Disc Ratio
ROI	Region of Interest
BAT	Boundary-aware Transformer Attention
ViT	Vision Transformer
CCNet	Criss-Cross Network
OCT	Optical Coherence Tomography
CNN	Convolutional Neural Network
IoU	Intersection over Union
mIoU	Mean Intersection over Union
GT	Ground Truth
FFN	Feedforward Network
MHSA	Multi-Head Self-Attention
HD	Hausdorff Distance
VAE	Variational Autoencoder
SOTA	State-of-the-Art

References

Yamamoto, T.; Kitazawa, Y. Vascular pathogenesis of normal-tension glaucoma: A possible pathogenetic factor, other than intraocular pressure, of glaucomatous optic neuropathy. Prog. Retin. Eye Res. 1998, 17, 127–143. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Kim, B.; Kang, J.; Eom, D.-S. Precision Diagnosis of Glaucoma with VLLM Ensemble Deep Learning. Appl. Sci. 2024, 14, 4588. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Zhou, P.; Yang, X.L.; Wang, X.G.; Hu, B.; Zhang, L.; Zhang, W.; Si, H.R.; Zhu, Y.; Li, B.; Huang, C.L.; et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 2020, 579, 270–273. [Google Scholar] [CrossRef] [PubMed]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
Zilly, J.; Buhmann, J.M.; Mahapatra, D. Glaucoma Detection Using Entropy Sampling and Ensemble Learning for Automatic Optic Cup and Disc Segmentation. Comput. Med. Imaging Graph. 2017, 55, 28–41. [Google Scholar] [CrossRef]
Sevastopolsky, A. Optic Disc and Cup Segmentation Methods for Glaucoma Detection with Modification of U-Net Convolutional Neural Network. Pattern Recognit. Image Anal. 2017, 27, 618–624. [Google Scholar] [CrossRef]
Feng, Y.; Yu, P.; Li, J.; Cao, Y.; Zhang, J. Phosphatidylinositol 4-kinase B is required for the ciliogenesis of zebrafish otic vesicle. J. Genet. Genom. 2020, 47, 627–636. [Google Scholar] [CrossRef]
Zhu, Q.; Chen, X.; Meng, Q.; Song, J.; Luo, G.; Wang, M.; Shi, F.; Chen, Z.; Xiang, D.; Pan, L.; et al. GDCSeg-Net: General Optic Disc and Cup Segmentation Network for Multi-Device Fundus Images. Biomed. Opt. Express 2021, 12, 6529–6544. [Google Scholar] [CrossRef]
Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. CE-Net: Context Encoder Network for 2D Medical Image Segmentation. IEEE Trans. Med. Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef]
He, J.; Ma, Y.; Yang, M.; Yang, W.; Wu, C.; Chen, S. TAC-UNet: Transformer-Assisted Convolutional Neural Network for Medical Image Segmentation. Quant. Imaging Med. Surg. 2024, 14, 8824. [Google Scholar] [CrossRef]
Heidari, M.; Kazerouni, A.; Soltany, M.; Azad, R.; Aghdam, E.K.; Cohen-Adad, J.; Merhof, D. HiFormer: Hierarchical Multi-scale Representations Using Transformers for Medical Image Segmentation Supplementary Material. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Fan, Z.; Rong, Y.; Cai, X. Optic Disk Detection in Fundus Image Based on Structured Learning. IEEE J. Biomed. Health Inform. 2017, 22, 224–234. [Google Scholar] [CrossRef]
Abdullah, M.; Fraz, M.M.; Barman, S.A. Localization and Segmentation of Optic Disc in Retinal Images Using Circular Hough Transform and Grow-Cut Algorithm. PeerJ 2016, 4, e2003. [Google Scholar] [CrossRef]
Zahoor, M.N.; Fraz, M.M. Fast Optic Disc Segmentation in Retina Using Polar Transform. IEEE Access 2017, 5, 12293–12300. [Google Scholar] [CrossRef]
Yi, Y.; Jiang, Y.; Zhou, B.; Zhang, N.; Dai, J.; Huang, X.; Zeng, Q.; Zhou, W. C2FTFNet: Coarse-to-Fine Transformer Network for Joint Optic Disc and Cup Segmentation. Comput. Biol. Med. 2023, 164, 107215. [Google Scholar] [CrossRef]
Joshi, A.; Sharma, K.K. Graph Deep Network for Optic Disc and Optic Cup Segmentation for Glaucoma Disease Using Retinal Imaging. Comput. Biol. Med. 2022, 148, 105800. [Google Scholar] [CrossRef]
Bhattacharya, R.; Hussain, R.; Chatterjee, A.; Paul, D.; Chatterjee, S.; Dey, D. PY-net: Rethinking Segmentation Frameworks with Dense Pyramidal Operations for Optic Disc and Cup Segmentation from Retinal Fundus Images. Biomed. Signal Process. Control. 2023, 85, 104895. [Google Scholar] [CrossRef]
Vangaveti, V.; Kumar, M.P.; Mitra, K. Glaucoma Identification Using Convolutional Neural Networks Ensemble for Optic Disc and Cup Segmentation. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Singapore, 20–23 October 2024; pp. 2720–2724. [Google Scholar] [CrossRef]
Orlando, J.I.; Fu, H.; Breda, J.B.; Van Keer, K.; Bathula, D.R.; Diaz-Pinto, A.; Bogunović, H. Refuge Challenge: A Unified Framework for Evaluating Automated Methods for Glaucoma Assessment from Fundus Photographs. Med. Image Anal. 2020, 59, 101570. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Pan, D.; Song, H. Joint Optic Disc and Cup Segmentation Based on Densely Connected Depthwise Separable Convolution Deep Network. BMC Med. Imaging 2021, 21, 6. [Google Scholar] [CrossRef] [PubMed]
Zhou, W.; Ji, J.; Jiang, Y.; Wang, J.; Qi, Q.; Yi, Y. EARDS: EfficientNet and Attention-Based Residual Depthwise Separable Convolution for Joint OD and OC Segmentation. Front. Neurosci. 2023, 17, 1139181. [Google Scholar] [CrossRef] [PubMed]
Almubarak, H.; Bazi, Y.; Alajlan, N. Two-Stage Mask-RCNN Approach for Detecting and Segmenting the Optic Nerve Head, Optic Disc, and Optic Cup in Fundus Images. Appl. Sci. 2020, 10, 3833. [Google Scholar] [CrossRef]

Figure 1. Illustration of OD, OC, and macula in fundus images. The vertical CDR is calculated as the ratio between the vertical cup diameter (VCD) and the vertical disc diameter (VDD). A higher CDR (e.g., 0.6) indicates a larger optic cup relative to the disc, which is a key indicator in glaucoma diagnosis.

Figure 2. Structure-preserving data augmentation using VAE. Latent vectors are sampled from a truncated Gaussian distribution to generate anatomically realistic variations of the optic disc region.

Figure 3. Examples of colormap-based augmentation. These conversions enhance visual diversity while preserving anatomical boundaries of the optic disc and cup.

Figure 4. Structure of the proposed BAT module (left) and attention map visualization (right). The inputs to the attention mechanism are set as Query from the optic cup features, Key from the optic disc features, and Value from the global feature map, enabling boundary-aware representation learning.

Figure 5. Overall architecture of the proposed segmentation network. The BAT modules are embedded into the skip connections of a U-Net backbone, with multi-scale features processed through boundary-aware attention to enhance the delineation of optic cup and disc regions.

Figure 6. Comparison of IoU (Intersection over Union) and Dice Coefficient scores across different models.

Figure 17. Ablation study results for the proposed model. (a) original image, (b) ground truth (GT) mask, (c) baseline model, (d) baseline + data augmentation, (e) baseline + Geometry-aware Loss, (f) baseline + BAT, and (g) proposed model.

Table 1. Datasets used in this work.

Database	Healthy (Normal)	Glaucoma (Abnormal)	Resolution
DRIONS-DB	84	26	600 × 400
Drishti-GS	70	31	2896 × 1944
REFUGE	1440	160	2124 × 2056 (train), 1634 × 1634 (test)
ORIGA	482	165	3072 × 2048
G1020	625	296	3004 × 2423

Table 2. Segmentation performance comparison on various datasets.

Model	DRIONS-DB		Drishti-GS (OD)		Drishti-GS (OC)		REFUGE		G1020		ORIGA
Model	IoU	Dice	IoU	Dice	IoU	Dice	IoU	Dice	IoU	Dice	IoU	Dice
U-Net [3]	0.926	0.9324	0.9253	0.9555	0.7869	0.8848	0.9186	0.9292	0.9096	0.9183	0.9158	0.9276
Attention U-Net [4]	0.9459	0.9712	0.9454	0.9664	0.8137	0.8987	0.9328	0.9473	0.9345	0.9575	0.9239	0.9539
DeepLabV3+ [18]	0.9508	0.9627	0.9506	0.9533	0.8155	0.8877	0.9536	0.9642	0.9446	0.9534	0.9510	0.9629
TransUNet [14]	0.9359	0.9471	0.9454	0.9654	0.8047	0.8941	0.9297	0.9387	0.9196	0.9556	0.9239	0.9629
Swin-UNet [17]	0.9631	0.9677	0.9556	0.9603	0.8305	0.9027	0.9613	0.9725	0.9503	0.9613	0.9600	0.9719
Proposed	0.9758	0.9877	0.9756	0.9783	0.8405	0.9127	0.9786	0.9892	0.9696	0.9783	0.9760	0.9879

Table 3. Ablation study results on key components of the proposed model.

Model	Aug	Geo. Loss	BAT	REFUGE		G1020		ORIGA
Model	Aug	Geo. Loss	BAT	IoU	Dice	IoU	Dice	IoU	Dice
Baseline	-	-	-	0.9171	0.9177	0.9097	0.9068	0.9045	0.9164
Baseline +	√	-	-	0.9247	0.9353	0.9157	0.9244	0.9221	0.9340
	-	√	-	0.9302	0.9396	0.9237	0.9389	0.9368	0.9415
	-	-	√	0.9468	0.9574	0.9378	0.9465	0.9442	0.9561
	√	√	-	0.9365	0.9471	0.9275	0.9362	0.9339	0.9458
	-	√	√	0.9651	0.9673	0.9478	0.9564	0.9541	0.9660
	√	-	√	0.9567	0.9681	0.9581	0.9691	0.9672	0.9784
Proposed	√	√	√	0.9786	0.9892	0.9696	0.9783	0.9760	0.9879

Table 4. The segmentation results of OD and OC obtained by the proposed approach and compared approaches.

Database	Author	OD		OC
Database	Author	IoU	Dice	IoU	Dice
DRIONS-DB	Fan et al. [21]	0.8473	0.9137	-	-
	Abdullah et al. [22]	0.8510	0.9102	-	-
	Zahoor et al. [23]	0.8862	0.9378	-	-
	Sevastopolsky et al. [8]	0.8900	0.9400	-	-
	Yi et al. [24]	0.9363	0.9679	-	-
	Joshi et al. [25]	0.9474	0.9768	-	-
	Proposed	0.9758	0.9877
Drishti-GS	R.Bhattacharya et al. [26]	0.944	0.971	-	0.876
	Sevastopolsky et al. [8]	0.9444	0.9739	0.8050	0.8910
	Zhu et al. [10]	0.9501	0.9743	0.8334	0.9083
	Gu et al. [11]	0.9506	0.9746	0.8213	0.8992
	Yi et al. [24]	0.9531	0.9768	0.8538	0.9195
	Vangaveti et al. [27]	0.950	0.974	0.845	0.916
	Proposed	0.9756	0.9783	0.8405	0.9127
REFUGE	Mammoth [28]	-	0.9361	-	0.8667
	SDSAIRC [28]	-	0.9436	-	0.8315
	NKSG [29]	-	0.9488	-	0.8643
	VRT [28]	-	0.9532	-	0.8600
	CUHKMED [28]	-	0.9602	-	0.8826
	Zhou et al. [30]	0.915	0.955	0.802	0.887
	Almubarak et al. [31]	-	0.9504	-	0.8546
	Yi et al. [24]	-	0.9693	-	0.9082
	Vangaveti et al. [27]	0.925	0.961	0.808	0.894
	Liu et al. [29]	-	0.9601	-	0.8903
	Proposed	0.9786	0.9892	0.8798	0.9014

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, S.; Kim, B.; Eom, D.-S. Boundary-Aware Transformer for Optic Cup and Disc Segmentation in Fundus Images. Appl. Sci. 2025, 15, 5165. https://doi.org/10.3390/app15095165

AMA Style

Wang S, Kim B, Eom D-S. Boundary-Aware Transformer for Optic Cup and Disc Segmentation in Fundus Images. Applied Sciences. 2025; 15(9):5165. https://doi.org/10.3390/app15095165

Chicago/Turabian Style

Wang, Soohyun, Byoungkug Kim, and Doo-Seop Eom. 2025. "Boundary-Aware Transformer for Optic Cup and Disc Segmentation in Fundus Images" Applied Sciences 15, no. 9: 5165. https://doi.org/10.3390/app15095165

APA Style

Wang, S., Kim, B., & Eom, D.-S. (2025). Boundary-Aware Transformer for Optic Cup and Disc Segmentation in Fundus Images. Applied Sciences, 15(9), 5165. https://doi.org/10.3390/app15095165

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Boundary-Aware Transformer for Optic Cup and Disc Segmentation in Fundus Images

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Data Augmentation: Truncated Gaussian, Colormap

3.2. Boundary-Aware Transformer Attention

3.3. Loss Function: Geometry-Aware Loss

3.4. Proposed Architecture

4. Experiment Implementation

4.1. Experiment Setup

4.2. Evaluation Metrics

5. Results and Analysis

5.1. Performance Analysis on Different Datasets

5.2. Ablation Study

5.3. State-of-the-Art Comparison

6. Discussion

7. Limitation and Future Work

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI