1. Introduction
Ulva prolifera (green tide) is one of the largest floating algal bloom phenomena worldwide, with major outbreaks occurring in the Yellow Sea, causing severe impacts on the marine ecological environment, fisheries, coastal tourism, and maritime transportation [
1,
2]. Therefore, rapid and accurate monitoring of its spatial distribution and dynamics using satellite remote sensing is of great significance for marine ecological security and coastal economic development [
3].
However, due to spectral confusion with underlying seawater, suspended matter, turbidity variations, and sea surface foam, accurately extracting scattered
Ulva prolifera patches from remote sensing imagery remains challenging [
4]. In addition, its floating morphology is strongly affected by winds and currents, resulting in irregular and dynamic boundaries that often lead to false negatives and false positives during segmentation [
5].
Traditional remote sensing approaches for
Ulva prolifera monitoring primarily rely on empirical spectral indices, such as the Floating Algae Index (FAI) and Normalized Difference Phytoplankton Index (NDPI), which can enhance the contrast between algae and water bodies to some extent [
6,
7]. Nevertheless, due to complex sea states and spectral variability under different imaging conditions, these index-based methods often suffer from threshold sensitivity and limited generalization, making them inadequate for large-scale, real-time, and high-precision monitoring [
8].
With the rapid development of deep learning, convolutional neural networks (CNNs) have demonstrated strong feature extraction capabilities in remote sensing image analysis [
9]. Deep learning methods have been widely applied to
Ulva prolifera semantic segmentation and have significantly outperformed traditional approaches under complex marine backgrounds [
10].
Fully Convolutional Networks (FCNs) enable end-to-end pixel-level prediction and have proven effective for large-scale
Ulva prolifera extraction [
11]. U-Net improves boundary integrity through encoder–decoder skip connections [
12,
13], while High-Resolution Networks (HRNet) preserve fine-grained details via parallel multi-scale feature processing [
14]. PSPNet and DeepLab further enhance boundary delineation by aggregating multi-scale contextual information [
15,
16]. More recently, SegFormer, with its lightweight Transformer architecture and efficient MLP decoder, has further improved the robustness and accuracy of algal bloom semantic segmentation [
17].
In addition to generic semantic segmentation frameworks, boundary-aware segmentation methods have attracted increasing attention in fine-grained target extraction tasks [
18]. Existing studies typically enhance contour representation by introducing explicit edge branches, auxiliary boundary supervision losses, or joint semantic-boundary learning strategies [
19]. For example, some approaches employ dedicated boundary branches to explicitly model object contours, thereby improving contour continuity and boundary integrity [
20], while others adopt multi-task learning frameworks to jointly optimize region segmentation and boundary prediction, alleviating boundary ambiguity and fragmented segmentation results [
21]. These methods have demonstrated strong boundary recovery capabilities in both natural image segmentation and high-precision remote sensing applications [
22].
However, most existing boundary-aware methods are primarily designed for generic objects or terrestrial remote sensing scenarios [
23], with limited consideration of the low-contrast boundaries, irregular floating morphology, and complex background interference characteristics of
Ulva prolifera in marine remote sensing imagery. In particular, when small scattered patches and large continuous boundaries coexist, conventional boundary supervision strategies often struggle to balance local detail recovery and global semantic consistency [
24]. Therefore, designing a lightweight architecture with stronger boundary sensitivity and targeted feature enhancement remains of significant research value for
Ulva prolifera semantic segmentation.
Despite these advances, existing feature extraction networks still face limitations in complex marine environments. On the one hand, environmental noise and background interference from clouds, waves, and reefs restrict the ability to identify scattered algal patches, leading to missed detections. On the other hand, in low-contrast regions, boundary contours of Ulva prolifera are often incompletely captured, resulting in segmentation gaps. Furthermore, most current methods directly adopt generic semantic segmentation models for remote sensing without targeted adaptations for the unique characteristics of Ulva prolifera, thereby constraining their practical applicability.
This study aims to address the limited applicability of existing models for Ulva prolifera monitoring in the Yellow Sea by proposing an enhanced SegFormer-based semantic segmentation framework optimized for complex marine conditions.
Contributions
The main contributions of this work, highlighting the methodological innovations and their distinctions from existing studies, are summarized as follows:
Unlike existing generic boundary-aware segmentation methods that are primarily designed for natural images or terrestrial remote sensing targets, we introduce a dedicated boundary supervision branch specifically tailored to the low-contrast and irregular boundaries of floating Ulva prolifera, which improves contour continuity and fine-structure preservation.
To further enhance feature discrimination in shallow layers, lightweight ECA channel attention modules are embedded into and . This design strengthens edge-sensitive and texture-related feature representation with minimal computational overhead, making it particularly suitable for small scattered patches and complex marine backgrounds.
Extensive experiments and comparative analyses against classical CNN-based models, recent Transformer-based architectures, and representative boundary-aware segmentation methods demonstrate that the proposed framework achieves superior performance in both global semantic consistency and local boundary accuracy.
2. Related Work
Traditional methods for
Ulva prolifera semantic segmentation mainly rely on empirically designed spectral indices, such as the Floating Algae Index (FAI) and the Normalized Difference Phytoplankton Index (NDPI). To improve robustness under marine environmental interference, researchers have proposed enhanced indices and adaptive threshold strategies. For instance, Cao et al. [
25] proposed the Algal Bloom Detection Index (ABDI), while Hu [
26] introduced FAI and an atmospheric correction strategy to improve cross-scene applicability. Other studies, including regional chlorophyll-based monitoring [
27] and water-specific spectral indices [
28], further improved detection capability in particular environments. Although these methods are computationally efficient, they remain highly dependent on threshold settings and are sensitive to suspended matter and atmospheric variations, which limits their robustness in large-scale and precise segmentation tasks.
With the development of deep learning, CNN-based methods have been increasingly applied to
Ulva prolifera analysis. Early studies mainly focused on object detection frameworks [
29], including FRCNN, RFCN, SSD [
30], lightweight YOLO variants [
31], and improved Cascade R-CNN models [
32]. These methods improved deployment efficiency and inference speed compared with traditional spectral index approaches. However, object detection methods cannot provide pixel-level boundaries, making them insufficient for accurate distribution mapping and fine-grained ecological monitoring.
Semantic segmentation methods [
33,
34] further advanced pixel-level extraction of
Ulva prolifera. Representative methods include Mask R-CNN [
35,
36], which was introduced into algal bloom monitoring by Jesus et al. [
37], and CNN-based comparative studies with different backbones [
38]. More recently, several specialized architectures have been developed, including attention-enhanced U-Net variants [
39,
40], synthetic-data-assisted segmentation frameworks [
41,
42], and Transformer-based hybrid models such as HySwinFormer [
43]. These methods have significantly improved segmentation accuracy, yet challenges remain in balancing computational efficiency, boundary precision, and robustness under complex marine interference.
Recently, Transformer-based segmentation frameworks have attracted increasing attention due to their strong global modeling capability. Among them, SegFormer has achieved a favorable balance between accuracy and efficiency through its lightweight encoder and all-MLP decoder. Existing improvements mainly focus on integrating channel attention modules, such as ECA [
44,
45], or combining ASPP and advanced upsampling mechanisms [
46,
47]. Although these methods improve contextual modeling and feature recovery, they still lack targeted optimization for the low-contrast boundaries, irregular morphology, and complex spectral background of
Ulva prolifera in marine remote sensing imagery.
Boundary-aware segmentation methods have attracted increasing attention in recent years for addressing common issues in semantic segmentation tasks, such as blurred boundaries, contour discontinuity, and loss of fine-grained structures [
48]. Existing studies typically enhance contour modeling by introducing explicit boundary branches, boundary supervision loss functions, or joint region-boundary optimization frameworks [
18,
19,
20,
21,
22,
23,
24]. A representative category of methods employs dedicated edge branches to explicitly learn object contours, where auxiliary supervision guides the backbone network to focus on high-frequency boundary details, thereby improving contour continuity and local structural integrity [
49]. For example, Gated-SCNN introduces an additional shape stream parallel to the semantic branch and utilizes a gating mechanism to dynamically fuse semantic and boundary features, achieving strong boundary recovery performance in natural image segmentation tasks [
18]. In addition, several studies further improve boundary localization accuracy by incorporating boundary-constrained optimization objectives, such as Boundary Loss, Dice Loss, or Hausdorff Distance Loss [
50].
In remote sensing image segmentation, boundary-aware strategies have also been widely applied to high-precision tasks such as building extraction, road segmentation, and coastline detection [
51,
52,
53]. Due to the large-scale variation in remote sensing targets, complex backgrounds, and boundary sensitivity to noise interference, region-level supervision alone is often insufficient to constrain contour information, making explicit boundary supervision particularly effective for improving object completeness and edge continuity [
54]. However, most existing boundary-aware methods are primarily designed for terrestrial targets, such as buildings, roads, and agricultural parcels, whose boundaries generally exhibit relatively regular geometric structures [
55]. In contrast, floating
Ulva prolifera in marine remote sensing imagery presents substantially different characteristics, including highly irregular contours, fragmented local morphology, dense small-scale patches, and low-contrast boundaries against seawater backgrounds [
56]. These characteristics make conventional boundary branch methods prone to over-smoothing, local contour discontinuity, and missed segmentation of scattered regions [
57].
Compared with existing representative boundary-aware methods, the proposed approach differs significantly in design philosophy. First, instead of adopting complex dual-stream architectures or independent shape flows, such as those used in Gated-SCNN, this study preserves the original lightweight SegFormer backbone and introduces only an auxiliary boundary supervision branch at the decoder stage, thereby enhancing boundary sensitivity with minimal parameter overhead. Second, the boundary labels are not obtained through additional manual annotation but are automatically generated from semantic masks via morphological gradient operations, which reduces annotation cost and improves method transferability. More importantly, the proposed framework further integrates shallow-layer ECA channel attention to strengthen near-infrared-sensitive features around boundary regions, enabling boundary supervision to improve not only spatial contour recovery but also spectral discriminability in low-contrast regions. This collaborative design of ”shallow feature enhancement + lightweight boundary supervision” allows the proposed method to better balance global semantic consistency and local boundary integrity under complex marine backgrounds.
4. Experiments and Analysis
4.1. Experimental Dataset
The remote sensing data used in this study are derived from China’s independently developed Coastal Zone Imager (CZI, 50 m resolution) onboard the HY-1C, HY-1D, and HY-1E satellites. The HY-1C satellite, launched in 2018, is equipped with both the Coastal Zone Imager (CZI) and the Medium-Resolution Imaging Spectrometer (PMRIS, 1 km resolution), enabling large-scale and continuous observations of ocean color parameters. The HY-1D satellite, launched in 2020, provides improved spectral resolution and radiometric accuracy, while HY-1E further enhances spatiotemporal resolution and quantitative observation capability. These satellites cover key nearshore regions of China, especially the Yellow Sea, offering reliable data support for long-term monitoring of Ulva prolifera blooms.
Compared with widely used international satellites such as Landsat, Sentinel-2, and MODIS, the HY-1 series is specifically designed for ocean color observation and exhibits strong sensitivity to water spectral characteristics. This makes it particularly suitable for detecting marine phenomena such as Ulva prolifera green tides.
The dataset used in this study, referred to as the
HYU dataset, consists of remote sensing images in
GeoTIFF format along with corresponding
Ulva prolifera mask annotations. The dataset used in this study, referred to as the
HYU dataset, consists of remote sensing images in
GeoTIFF format along with corresponding
Ulva prolifera mask annotations. The original data contain multiple spectral bands, from which a subset of bands is selected for model input. Detailed input configuration and preprocessing strategies are described in
Section 4.3.
Ground-truth annotations are generated from vector boundaries obtained through inversion and expert interpretation. These vector files (shapefiles) are subsequently rasterized to produce pixel-wise mask labels aligned with the input images.
The original data contain multiple spectral bands, from which a subset of bands is selected for model input. In this study, a near-infrared (NIR) together with blue-green (BG) band combination is adopted instead of the conventional red-based configuration.
This choice is consistent with widely adopted practices in marine and aquatic remote sensing, where NIR-based band combinations are commonly used to enhance the contrast between floating targets and surrounding water bodies [
26,
58,
59,
60].
Specifically, due to the strong absorption of NIR radiation by water, the background typically exhibits very low reflectance, while floating algae show relatively higher reflectance, resulting in a more distinguishable spectral response. Meanwhile, the blue-green bands are more sensitive to variations in water constituents such as chlorophyll concentration, suspended matter, and shallow-water features, providing complementary information for discriminating
Ulva prolifera from complex marine backgrounds [
61,
62,
63].
The red band is highly sensitive to water turbidity and suspended particles [
64,
65,
66], which may introduce background interference in marine environments. In this study, the target region is the Yellow Sea and Bohai Sea, which are characterized by relatively high turbidity and abundant suspended sediments [
67,
68,
69]. Under such conditions, incorporating the red band is more likely to amplify turbidity-related signals and introduce additional interference, thereby reducing its discriminative capability for
Ulva prolifera. As a result, its effectiveness becomes limited, especially under complex conditions with varying turbidity levels. In contrast, the near-infrared (NIR) band provides more stable and distinctive responses, as water exhibits strong absorption while floating algae show high reflectance. Therefore, replacing the red band with NIR can effectively enhance the separability between
Ulva prolifera and surrounding water.
Therefore, the NIR-BG combination provides improved discriminability and robustness for Ulva prolifera detection under complex coastal conditions.
Representative samples of the dataset are shown in
Figure 5.
The dataset is partitioned as shown in
Table 3.
4.2. Evaluation Metrics
To validate the effectiveness of the proposed improvements, ablation experiments were conducted on the HYU dataset using SegFormer as the baseline system. Model performance was evaluated using mean Intersection over Union (mIoU), F1 scores, precision, and recall.
mIoU measures the average percentage of overlap between predicted and actual Ulva prolifera pixels.
The F1 score considers both precision and recall, providing a balanced measure of model accuracy.
Precision represents the proportion of correctly predicted Ulva prolifera pixels among all predicted Ulva prolifera pixels.
Recall denotes the proportion of correctly predicted Ulva prolifera pixels among all actual Ulva prolifera pixels.
Let
T and
F denote predicted Ulva and non-Ulva pixels, respectively, and
P and
N denote actual Ulva and non-Ulva pixels, respectively. The formulas for the four evaluation metrics are as follows:
To further evaluate the effectiveness of the proposed boundary supervision branch, boundary-aware evaluation metrics are introduced in addition to region-based metrics such as mIoU and F1-score. Given that the proposed method explicitly enhances boundary representation, these metrics provide a more targeted assessment of contour quality.
Specifically, the Boundary F1-score (BF-score) is employed to measure the alignment between predicted boundaries and ground-truth contours. Let and denote the predicted and ground-truth boundary sets, respectively. A predicted boundary pixel is considered a true positive if there exists at least one ground-truth boundary pixel within a tolerance distance from it, and vice versa for ground-truth boundary pixels. In this study, is set to 3 pixels, following common practice in boundary evaluation.
The precision and recall of boundary extraction are defined as:
where
denotes the
-neighborhood of the ground-truth boundary, and
is defined analogously.
The BF-score is then computed as:
In addition, the Hausdorff Distance (HD) is adopted to evaluate the maximum deviation between predicted and ground-truth boundaries, providing a stricter assessment of contour accuracy:
where
denotes the Euclidean distance between two boundary points. A smaller HD indicates better boundary alignment.
Overall, these boundary-focused metrics complement traditional region-based metrics, enabling a more comprehensive and reliable evaluation of segmentation performance, particularly for boundary-sensitive tasks such as Ulva prolifera semantic segmentation.
4.3. Experimental Settings
4.3.1. Experimental Environment and Training Settings
All experiments in this study were conducted on a server equipped with an NVIDIA GeForce RTX 4090 GPU. The operating system is Ubuntu 20.04, and the deep learning framework used is PyTorch 1.13 with CUDA 11.7. Data preprocessing and visualization were performed using Python 3.8 along with common scientific computing libraries such as NumPy 1.23.5, OpenCV 4.6.0, and Matplotlib 3.5.3.
The training configurations of the dataset are summarized in
Table 4.
4.3.2. Input Configuration and Preprocessing
The original data used in this study are multispectral remote sensing images in GeoTIFF format, containing multiple spectral bands. During the preprocessing stage, three bands, namely near-infrared (NIR), green (G), and blue (B), are selected from the original data. The NIR band is used to replace the conventional red (R) band, forming a three-channel (NIR, G, B) composite, which enhances the spectral separability between Ulva prolifera and the surrounding seawater.
Considering the large spatial coverage and high resolution of remote sensing images, a sliding-window strategy is employed to crop the original images into patches. Specifically, image tiles of size are generated to serve as inputs for model training. During this process, vector data representing ocean regions are utilized to filter the generated patches, retaining tiles that intersect with target areas as well as a small number of background ocean samples, while excluding irrelevant regions such as land and dense clouds, thereby improving data utilization efficiency.
For spectral preprocessing, a global percentile-based contrast stretching method is adopted to reduce radiometric differences among images and enhance contrast. Specifically, the 2nd and 98th percentiles are computed for each band, followed by linear stretching to normalize pixel values into the range of . The values are then scaled to and converted to 8-bit integers for generating PNG images.
For annotation generation, vector boundaries obtained from manual interpretation are rasterized to produce pixel-wise mask labels that are spatially aligned with the image tiles. All preprocessing operations are consistently applied across the training, validation, and test sets to ensure fairness and reproducibility of the experiments.
This preprocessing pipeline ensures spectral consistency while improving data processing efficiency, making it suitable for large-scale remote sensing semantic segmentation tasks.
4.3.3. Data Augmentation
After input configuration and preprocessing, online data augmentation is applied to the generated PNG image patches during training to improve the generalization ability of the model and alleviate overfitting. Considering the large-scale variations, diverse observation angles, and complex sea surface textures in remote sensing imagery, a combination of geometric and spectral augmentation strategies is adopted.
For geometric augmentation, random horizontal flipping and vertical flipping are independently applied with a probability of 0.5 to enhance robustness to different viewing directions. In addition, random rotations (i.e., , , and ) are introduced to further improve the model’s adaptability to orientation variations. Meanwhile, under the constraint of a fixed input size of , random cropping and scale perturbation are employed to simulate observations at different spatial resolutions.
For spectral augmentation, controlled perturbations in brightness and contrast are applied to simulate varying imaging conditions and marine environments. Considering that the color differences between small-scale Ulva prolifera targets and the background, as well as those between large-scale boundaries and seawater, are relatively subtle, saturation variation is not introduced to avoid altering discriminative spectral characteristics.
Specifically, the magnitudes of brightness and contrast perturbations are constrained within a limited range (), ensuring that the augmentation only induces minor adjustments to the overall radiometric distribution without changing the relative spectral relationships between Ulva prolifera and the background. This strategy mitigates the influence of illumination changes, atmospheric variations, and sensor inconsistencies while preserving semantic consistency, and complements the proposed boundary supervision branch, thereby improving segmentation accuracy in boundary regions under complex marine backgrounds.
It should be noted that all data augmentation operations are only applied during the training phase, while no augmentation is performed on the validation and test sets to ensure objective and fair evaluation.
This augmentation strategy effectively expands the training data distribution without altering label consistency, thereby improving the model’s adaptability to complex marine environments.
4.4. Ablation Experiment
4.4.1. Ablation Experiments on Modules with Different Mechanisms
The original SegFormer model and various network variants incorporating different mechanisms were evaluated on the HYU dataset. The comparison of evaluation metrics is summarized in
Table 5.
From the ablation results, it can be observed that introducing either the boundary supervision branch or the ECA channel attention module alone improves segmentation performance. Specifically, the boundary supervision branch notably enhances recall (from 69.84% to 75.67%) and the BFScore (from 77.91% to 83.06%), demonstrating its effectiveness in boundary localization and fine-grained structure recovery. Meanwhile, embedding the ECA module into shallow features improves precision (from 91.75% to 92.84%) and the BFScore (from 77.91% to 80.62%), indicating that channel attention strengthens feature discrimination while suppressing irrelevant responses. The Hausdorff Distance (HD) also decreases (from 17.42 to 15.03 pixels), reflecting improved boundary continuity.
When both mechanisms are combined in the proposed network, all metrics achieve the best performance. Compared with the baseline SegFormer, mIoU, F1, precision, and recall improve by 6.80%, 4.76%, 2.37%, and 6.22%, respectively, while the BFScore increases by 6.45%, and HD decreases by 5.95 pixels. These results demonstrate that integrating boundary-aware learning and channel attention not only complements the encoder–decoder backbone but also simultaneously enhances global semantic consistency and local boundary accuracy, leading to substantial overall improvements in segmentation performance.
The changes in segmentation accuracy during training for the original and improved SegFormer models are shown in
Figure 6 and
Figure 7.
4.4.2. Ablation Experiments on the Embedding Layer of the ECA Module
To further investigate the effect of the ECA module at different network layers, multiple ablation experiments (A1-D2) were designed. They were conducted to determine the optimal insertion position of the ECA module. The shallow layers correspond to the first two encoder layers [0, 1], and the deep layers correspond to the last two layers [2, 3]. These configurations were evaluated both independently and in combination with the boundary supervision branch. The comparison of evaluation metrics is summarized in
Table 6.
Experimental results are summarized in
Table 6. Without boundary supervision, activating ECA in shallow layers (A1) mainly improved
g (74.14%) compared with the baseline, confirming that shallow channel attention strengthens responses to edges and texture details. However, due to limited suppression of false positives, the gain in
precision was modest (93.01%). Activating ECA in deeper layers (A2) yielded a more substantial improvement in
precision (94.49%) but a relatively smaller increase in recall (72.83%), indicating that high-level semantic attention better suppresses background interference and enhances class discrimination.
After introducing boundary supervision, shallow-layer ECA with boundary supervision (B1) significantly boosted recall (75.67%) and the BFScore (82.75%), enabling more accurate boundary and fine-structure recovery. Deep-layer ECA with boundary supervision (B2) further improved precision (93.90%) and overall consistency (mIoU = 71.88%, F1 = 83.64%, HD = 15.12).
When combining ECA with boundary supervision at specific feature layers, the proposed network (D2) achieved the best performance across all metrics: mIoU = 72.61%, F1 = 84.14%, Precision = 94.12%, Recall = 76.06%, BFScore = 84.36%, and HD = 11.47 pixels. This demonstrates that activating ECA in shallow layers under boundary supervision effectively balances global semantic modeling and local boundary detail recovery, substantially enhancing segmentation performance.
Overall, the results indicate that shallow-layer ECA is particularly effective when combined with boundary supervision, as it directly strengthens low-level spatial details crucial for precise boundary delineation.
4.5. Comparative Experiments
4.5.1. Baseline Model Comparative Experiments
In this experiment, the prediction results of the proposed method were compared with the ground-truth labels and several representative semantic segmentation networks to evaluate its effectiveness. The selected baseline models include
HRNet,
PSPNet,
U-Net, the original
SegFormer, and
SegFormer-ASPP (SegFormer with Atrous Spatial Pyramid Pooling). The quantitative comparison results are summarized in
Table 7.
The proposed method outperforms traditional convolutional networks (HRNet, PSPNet, and U-Net) as well as the original SegFormer model on the HYU dataset. Specifically, compared with the baseline SegFormer, the proposed method improves mIoU from 65.81% to 72.61%, F1 from 79.38% to 84.14%, Precision from 91.75% to 94.12%, and Recall from 69.84% to 76.06%. Meanwhile, the BFScore increases from 77.91% to 84.36%, and the Hausdorff Distance (HD) decreases from 17.42 pixels to 11.47 pixels, indicating a substantial improvement in both regional segmentation accuracy and boundary localization quality. The mIoU gain reaches up to 10.32% compared with the best baseline (SegFormer), while the BFScore improvement reaches 8.31%, further highlighting the effectiveness of the proposed strategy in jointly enhancing semantic consistency and boundary detail recovery.
From the comparative results, U-Net and PSPNet exhibit limitations in recovering fine boundary structures and suppressing complex marine background interference, resulting in relatively lower F1, Recall, and BFScore values, as well as larger HD values. In contrast, the proposed method leverages shallow-layer ECA enhancement and boundary supervision to improve edge-sensitive feature responses while preserving global semantic information, thereby achieving simultaneous improvements in precision, recall, and boundary continuity.
In addition, SegFormer-ASPP (SegF-ASPP) performs slightly worse than the original SegFormer, with mIoU and Recall decreasing by 1.49% and 2.23%, respectively. Its BFScore is also slightly lower, while the HD value increases, suggesting weaker boundary consistency. This indicates that although ASPP improves multi-scale context aggregation, it may weaken the representation of small fragmented targets and boundary continuity in this task. These results further demonstrate that, for Ulva prolifera with irregular morphology and weak boundaries, shallow feature enhancement and explicit boundary learning are more effective optimization strategies.
Overall, the proposed network significantly improves semantic segmentation accuracy, particularly in complex boundary extraction, small-object segmentation, and robustness to marine noise. The consistent improvements in both BFScore and HD further verify the superiority of the proposed method in boundary-aware segmentation.
The model complexity and efficiency of HRNet, PSPNet, U-Net, the original SegFormer, and SegFormer-ASPP were compared with the proposed method, as summarized in
Table 8.
In this experiment, GFLOPs are computed with an input size of 512 × 512. The computation time refers to the average time required to complete a single iteration (i.e., one forward and backward pass of a batch) on the HYU dataset.All computation times are measured under the same experimental conditions, including identical hardware (NVIDIA RTX 4090), batch size (8), and input resolution (512 × 512). The reported time corresponds to the average iteration time over multiple batches after a warm-up phase, excluding data loading overhead. This ensures a fair comparison of computational efficiency across different models.
The proposed method introduces lightweight improvements to SegFormer, including the ECA module and the boundary supervision branch. As a result, the number of parameters increases only slightly from 27.5 M to 27.612 M, while the computational cost remains comparable. Despite this negligible increase in model complexity, a significant improvement in mIoU is achieved, demonstrating a favorable trade-off between computational efficiency and segmentation performance, and indicating strong practical applicability.
Figure 8 presents the prediction results of HRNet, PSPNet, U-Net, the original SegFormer, SegFormer-ASPP, and the proposed method on the HYU dataset. Correctly predicted pixels are marked in
green, false negatives in
blue, and false positives in
red. Visual inspection highlights several advantages of the proposed network over other methods:
- 1.
Superior boundary delineation and smoother edge transitions
As shown in
Figure 8, the proposed network produces more continuous and accurate boundary contours, particularly in low-contrast regions where Ulva patches merge with complex marine backgrounds. Compared with other methods, the predicted boundaries are smoother and more consistent with the ground truth.
- 2.
Improved segmentation of small and heterogeneous patches
The proposed method demonstrates stronger capability in recovering small scattered Ulva patches and heterogeneous mixed regions. Benefiting from shallow-layer ECA enhancement and explicit boundary supervision, the model effectively reduces missed predictions and preserves clearer local structures.
- 3.
Better detail preservation in complex aggregation regions
In regions containing multiple adjacent patches, the proposed network preserves finer edge details and reduces over-smoothing effects, resulting in segmentation contours that more closely align with manual annotations.
- 4.
Remaining challenges under severe environmental interference
Despite the improved boundary sensitivity and semantic discrimination, minor false positives and false negatives may still occur under strong reflections, water ripples, or surrounding plankton interference, suggesting room for further optimization in complex spectral environments.
4.5.2. Comparison with Representative Advanced Segmentation Models
To further validate the effectiveness, robustness, and generalization capability of the proposed method, additional comparative experiments were conducted using several representative advanced semantic segmentation models on the HYU dataset. The selected baselines include DeepLabV3+, Swin-UNet, Gated-SCNN, and OCRNet, which respectively represent strong convolutional neural network-based architectures, recent Transformer-based segmentation frameworks, classical boundary-aware segmentation methods, and high-performance context modeling networks.
The selection of these comparison models is based on their wide adoption, strong citation impact, and representative methodological characteristics in the field of semantic segmentation. Specifically, DeepLabV3+ is introduced as a widely recognized CNN-based baseline with excellent multi-scale contextual modeling capability. Swin-UNet is employed as a representative hierarchical Transformer segmentation architecture with strong global feature modeling ability. Gated-SCNN is selected as a classical boundary-aware segmentation framework to evaluate the effectiveness of the proposed boundary supervision strategy. OCRNet is further included as a strong semantic segmentation baseline with enhanced object-context representation capability.
By comparing the proposed method with these representative models, this study aims to comprehensively evaluate its performance in terms of global semantic consistency, local boundary delineation, and robustness under complex marine background interference.
As shown in
Table 9, the proposed method consistently outperforms both classical convolutional networks and recent advanced segmentation architectures on the HYU dataset. Specifically, compared with the strongest baseline OCRNet, the proposed method further improves
mIoU from 71.88% to 72.61%,
F1-score from 83.75% to 84.14%,
Precision from 93.67% to 94.12%, and
Recall from 75.58% to 76.06%. Although the absolute gains appear moderate, such improvements are highly meaningful given the already strong performance of the compared advanced models.
Compared with the CNN-based baseline DeepLabV3+, the proposed method improves mIoU by 2.77% and the BFScore by 5.80%, indicating that the proposed architecture achieves better global semantic consistency while substantially enhancing boundary delineation capability. This improvement mainly benefits from the complementary design of shallow-layer ECA feature enhancement and explicit boundary supervision, which is more suitable for irregular floating Ulva prolifera contours than purely context-driven CNN architectures.
Compared with the Transformer-based Swin-UNet, the proposed method achieves higher performance across all region-based and boundary-based metrics. In particular, the BFScore increases from 80.42% to 84.36%, while HD decreases from 14.28 to 11.47 pixels. This demonstrates that, although hierarchical Transformer architectures possess strong global modeling ability, they may still be insufficient in capturing weak boundaries and small scattered patches under low-contrast marine conditions. In contrast, the proposed boundary supervision branch explicitly constrains contour learning, significantly improving local structural recovery.
More importantly, compared with the representative boundary-aware model Gated-SCNN, the proposed method still achieves superior results in both semantic and boundary metrics, with mIoU increasing by 1.25%, the BFScore improving by 1.63%, and HD decreasing by 1.44 pixels. This result strongly demonstrates that the proposed boundary supervision branch is not a simple auxiliary edge branch, but a task-oriented optimization strategy specifically designed for the complex boundary characteristics of Ulva prolifera in marine remote sensing imagery.
From a mechanism perspective, the shallow-layer ECA modules enhance edge-sensitive channel responses and improve discrimination between Ulva prolifera and surrounding seawater, particularly in low-contrast regions and small fragmented patches. Meanwhile, the boundary supervision branch provides explicit contour constraints, effectively reducing false positives caused by wave reflections and false negatives caused by weak algae-water transitions. The combination of these two components enables the network to simultaneously preserve global semantic consistency and local boundary continuity. Overall, the quantitative results confirm that the proposed method achieves a better balance between semantic representation, boundary sensitivity, and robustness to complex marine interference, thereby providing a more reliable solution for large-scale Ulva prolifera monitoring.
These results further demonstrate that the proposed method is not merely a simple boundary-branch extension, but a task-oriented optimization framework specifically designed for marine Ulva prolifera semantic segmentation.
5. Conclusions
This study presents ECAB-SegFormer, an enhanced SegFormer-based network for high-precision semantic segmentation of Ulva prolifera in remote sensing imagery. The network integrates a boundary supervision branch and embeds Efficient Channel Attention (ECA) modules in shallow decoder layers to improve edge-sensitive feature representation while retaining the original SegFormer encoder–decoder structure. This design effectively addresses challenges caused by irregular Ulva morphology, small scattered patches, and complex marine backgrounds.
Extensive experiments on the HYU dataset demonstrate that ECAB-SegFormer consistently outperforms classical convolutional networks (HRNet, PSPNet, and U-Net) and recent segmentation models (SegFormer, SegFormer-ASPP, Swin-Unet, DeeplabV3+, and Gated-SCNN). Specifically, compared with the original SegFormer, ECAB-SegFormer improves mIoU from 65.81% to 72.61%, F1 from 79.38% to 84.14%, Precision from 91.75% to 94.12%, Recall from 69.84% to 76.06%, and the BFScore from 81.46% to 86.34% and reduces the Hausdorff Distance (HD) from 12.85 to 9.35. These improvements highlight the network’s ability to capture fine boundaries, accurately segment small or heterogeneous Ulva patches, and suppress false positives and false negatives under complex marine conditions.
The results confirm that the proposed method achieves superior edge delineation, robust segmentation in challenging marine environments, and strong generalization across varied spatial distributions. The main contributions of this work include: (1) embedding ECA modules in shallow decoder layers to enhance feature discrimination, (2) introducing a boundary supervision branch to explicitly model and preserve edges, and (3) demonstrating state-of-the-art performance over both classical CNNs and modern Transformer-based segmentation networks. ECAB-SegFormer provides a reliable tool for large-scale Ulva monitoring, ecological early warning, and intelligent marine mapping, with future work focused on further improving robustness against spectral imaging interference.