1. Introduction
Bathymetric survey is one of the fundamental tasks in marine surveying, and its precise detection is of great significance for marine engineering construction, marine scientific research, ship navigation safety, and other related fields. As the primary equipment for seafloor topographic data acquisition, the Multibeam Echo Sounder (MBES) can efficiently collect high-precision, high-density 3D point cloud data [
1,
2]. However, the raw point clouds acquired by this system are essentially unstructured, spatially discrete points lacking effective semantic information, which poses enormous challenges for extracting valuable topographic features directly from the data. Therefore, how to achieve automated recognition and segmentation of seafloor topographic units has become a critical issue that needs to be urgently addressed in the field of seafloor topographic data processing.
Seafloor topography recognition and classification methods have undergone an evolution from manual interpretation to intelligent automation. Prior to the widespread adoption of modern surveying techniques, geomorphological classification relied predominantly on researcher expertise and subjective interpretation, lacking systematic rigor. With the rapid advancement of marine exploration technologies, researchers began to employ geometric morphometric parameters—including bathymetry, slope, topographic relief, Terrain Ruggedness Index (TRI), and curvature—and to discriminate geomorphological units by setting empirical thresholds [
3,
4]. Harris et al. (2014) produced a global map of seafloor geomorphology based on IHO standards, using a semi-automated approach that combined manual delineation with algorithm-assisted analysis and integrated parameters such as slope, topographic relief, and Topographic Position Index (TPI) to classify multiple geomorphological units [
5]. However, these threshold- and geometry-based methods are inherently limited by their time-consuming and labor-intensive nature, sensitivity to threshold settings, and poor generalization capability, rendering them inadequate for the efficient automated processing of large-scale seafloor topographic data.
To overcome the reliance of traditional methods on manual expertise and fixed thresholds, researchers have begun to introduce machine learning techniques into the field of seafloor topographic and geomorphological analysis. Such methods can extract multi-dimensional features from topographic data and construct data-driven mapping models, enabling autonomous recognition and delineation of geomorphological units by the model and thereby reducing, to some extent, dependence on subjective decision rules. Masetti et al. (2018) proposed a seafloor segmentation algorithm based on bathymetric and acoustic backscatter data, which automatically identified and merged geomorphological units by analyzing the similarity between topographic morphometric features and backscatter textures, achieving automated seafloor segmentation [
6]. Giannakopoulos et al. (2025) employed geomorphometric methods to extract morphological features from multibeam bathymetric data and combined them with a Random Forest classifier to identify seafloor pockmarks [
7]. However, these methods still rely on hand-crafted feature engineering, exhibit limited adaptive capability in delineating complex topographic boundaries, and suffer from insufficient cross-dataset transferability and generalization performance under varying conditions, restricting their capacity for fine-grained recognition in complex seafloor scenarios.
In recent years, deep learning models, particularly convolutional neural networks (CNNs), have advanced rapidly. Through end-to-end feature learning, these models eliminate the need for hand-crafted feature engineering and automatically extract discriminative features from raw data, thereby opening new avenues for automated seafloor topographic recognition. However, research applying deep learning to seafloor topography segmentation remains relatively limited. Inspired by technical developments in the general point cloud processing field, existing deep learning-based methods for this task can be broadly categorized into two groups. The first group processes data directly in 3D space, relying on deep learning models capable of end-to-end feature extraction and classification on raw point clouds, such as PointNet [
8], PointNet++ [
9], and RandLA-Net [
10]. The second group utilizes 2D image-based segmentation methods, the core idea of which is to project 3D point clouds into regular gridded 2D images or Digital Elevation Models (DEMs) for subsequent feature extraction and segmentation using well-established 2D convolutional neural networks. Each approach has distinct advantages and limitations: 3D direct-processing methods better preserve three-dimensional topological relationships, yet suffer from high computational costs and low efficiency; furthermore, multibeam point clouds typically lack rich attribute information, which hinders adequate model learning. By contrast, 2D image-based methods offer significant advantages in computational efficiency and engineering practicality, but sacrifice the original three-dimensional spatial topological information. Moreover, their segmentation accuracy is constrained by the image generation method and quality, necessitating targeted improvements to enhance precision.
To this end, this study adopts the 2D image-based segmentation paradigm and proposes a lightweight seafloor topography recognition and segmentation method based on bimodal image feature fusion, from the perspectives of image generation and model optimization, aiming to improve segmentation accuracy and robustness. Specifically, at the image generation level, an early fusion strategy is adopted in which a pseudocolor image based on per-image adaptive range mapping and a grayscale image based on global fixed-range mapping—both generated from point clouds via continuous curvature tension spline interpolation—are concatenated channel-wise at the input level. This achieves complementarity between local texture details and absolute water depth information, enhancing the model’s ability to perceive topographic features. At the model optimization level, a lightweight Efficient Channel Attention (ECA) module [
11] is embedded after the Spatial Pyramid Pooling-Fast (SPPF) module of the backbone network to adaptively recalibrate channel weights, thereby reinforcing the contribution of the grayscale channel. Furthermore, a weighted BCE-Dice joint loss function [
12] is constructed to alleviate the class imbalance problem and optimize boundary segmentation accuracy, ultimately improving the overall quality of image segmentation.
3. Materials and Experiments
3.1. Experimental Data
The experimental dataset used in this study is derived from multi-voyage measured seafloor topography data, comprising various typical scenarios in shallow, medium, and deep waters, thereby ensuring good representativeness and feature diversity. Representative image examples of these scenarios are shown in
Table 1.
Table 1.
Representative seafloor topographic examples.
The original dataset consists of 341 pairs of bimodal images (pseudo-color and grayscale images in one-to-one correspondence), including 29 flat images without topographic features. To ensure training generalization and evaluation objectivity, the dataset is randomly split into training, validation, and test sets in an 8:1:1 ratio, with a balanced distribution of all topographic sample types maintained across the three subsets. To address the issue of limited original sample size, targeted data augmentation is applied using operations such as flipping, translation, and color jittering (color jittering is applied only to pseudo-color images; grayscale images undergo no such augmentation). The data augmentation pipeline is illustrated in
Figure 3. This augmentation yields a total of 3069 images, effectively enhancing the diversity of target features. Finally, manual annotation of topographic targets in the images is performed, resulting in a high-quality seafloor topography dataset.
Figure 3.
Data augmentation methods.
Figure 3.
Data augmentation methods.
3.2. Experimental Setup and Evaluation Metrics
The experimental hardware configuration employed in this study was an NVIDIA GeForce RTX 4060 Laptop GPU, and the software environment consisted of Python 3.10, CUDA 11.8, cuDNN 8.6.0, and PyTorch 2.8.0. The training process was set to 300 epochs with an initial learning rate of 0.0005 and a batch size of 4. Early stopping was configured for 50 epochs (training is terminated if the validation loss does not decrease for 50 consecutive epochs).
To systematically evaluate the comprehensive performance of the improved model proposed in this study on the seafloor topography segmentation task, classic metrics in the field of object detection and segmentation were adopted, including: Precision, Recall, mean Average Precision (mAP), Parameters (Params), and GFLOPs. Their definitions are as follows:
TP (True Positives): number of positive samples correctly classified as positive.
TN (True Negatives): number of negative samples correctly classified as negative.
FP (False Positives): number of negative samples incorrectly classified as positive.
FN (False Negatives): number of positive samples incorrectly classified as negative.
3.3. Experimental Design
To comprehensively evaluate the performance of the improved YOLO11n-seg model proposed in this study, three experiments are designed and conducted under identical parameter configurations to ensure fair and reliable comparisons.
- (1)
Ablation Study: Using the original YOLO11n-seg model as the baseline, the bimodal early fusion strategy, the ECA mechanism, and the weighted BCE-Dice joint loss function are gradually incorporated. This experiment aims to verify the individual contribution and necessity of each improved module for the segmentation performance.
- (2)
Comparative Experiments: Under identical experimental settings and on the same dataset, the proposed method is compared with several mainstream segmentation models to evaluate its segmentation performance on multibeam point cloud images.
- (3)
Back-Projection Validation Experiments: Since some downstream applications (e.g., point cloud simplification and topographic modeling) are implemented in 3D space, back-projection validation experiments are designed. By recording the index mapping between point clouds and images, the segmentation masks predicted by the model are back-projected to the 3D point cloud space. Typical blocks are randomly selected to verify the integrity and accuracy of the topographic region segmentation.
4. Results and Discussion
4.1. Ablation Study Results and Analysis
Following the ablation study design described in
Section 3.3, we conducted experiments to evaluate the individual contributions of the three proposed improvement modules. The results are presented in
Table 2 and analyzed below.
From the ablation results, it can be observed that all three improvement strategies proposed in this study exert positive effects on the model’s segmentation performance, forming a progressive and complementary synergistic mechanism that gradually enhances the precision and robustness of seafloor topography segmentation.
- (1)
Bimodal early fusion strategy. This strategy alleviates the problem of incomplete information representation in single-modality images and reduces the false positive rate. After introducing the early fusion strategy on the baseline model, all evaluation metrics are significantly improved, among which the precision is enhanced most remarkably. This indicates that the early fusion strategy effectively remedies the incomplete information representation of single-modality images through the complementarity between local texture details of pseudocolor images and global depth information of grayscale images, enabling the model to more reliably distinguish real topography from false positive signals of flat seabed, while simultaneously providing a high-quality multi-channel input foundation for subsequent feature learning.
- (2)
ECA mechanism. This mechanism adaptively recalibrates channel weights and improves recall. After introducing the ECA attention mechanism on the basis of early fusion, the model’s recall and mean average precision (mAP) are further improved, while the precision shows a slight decrease. This reflects that the ECA module effectively reinforces the contribution of the grayscale depth channel to the final segmentation decision through adaptive channel weight recalibration, thereby enhancing the model’s ability to perceive and extract weak-texture topography and small targets. Although accompanied by a slight increase in false detections due to the introduction of a small amount of noise, the overall performance of the model is significantly enhanced. Leveraging the multi-channel feature foundation established by bimodal fusion, the ECA module can build cross-modal dependency relationships within the fused four-dimensional feature tensor, achieving dynamic equilibrium between texture information and depth information, and avoiding the suboptimal state of “information complementarity but uneven utilization.”
- (3)
Loss function optimization. This optimization alleviates class imbalance and improves the problems of fragmented segmentation boundaries and incomplete regions. After introducing the loss function optimization, the model’s segmentation precision and boundary quality are significantly improved, indicating that the loss function effectively enhances the discriminative ability of topographic region boundaries on fused features. Specifically, the BCE loss provides stable pixel-level gradients, while the Dice loss suppresses the dominant effect of flat seabed background on loss computation through regional overlap constraints. Together, they achieve a balance between pixel-wise precision and regional integrity.
Finally, after integrating all three optimizations, all metrics except reach near-optimal levels, with absolute improvements of 9.1, 6.9, 8.9, 7.6, 5.8, and 7.6 percentage points over the baseline model, demonstrating the best overall segmentation performance. The three improvements form a clear complementary relationship: the bimodal early fusion strategy integrates texture and depth information at the input layer in a complementary manner, providing a complete feature foundation with both local details and a global depth datum for subsequent modules; the ECA module then balances the contributions of different channels on this basis, enhancing the detection capability for weak-texture targets and small targets; and the BCE-Dice loss transforms the boundary information and detected targets brought by the former two into complete, smooth segmentation masks through regional overlap constraints, ultimately achieving an overall performance improvement.
To verify the statistical significance of the performance improvement, we conducted repeated training and testing for both the baseline and the proposed full model using five different random seeds under a fixed dataset split. On the test set, the baseline model achieved a of 0.855 ± 0.012 (mean ± standard deviation), whereas the proposed method achieved 0.926 ± 0.011. A paired t-test yielded p < 0.05, indicating a statistically significant difference and confirming that the performance gain is stable and reliable.
The segmentation comparison in
Figure 4 provides further intuitive visual validation of the effectiveness of the proposed improvements. (a) The baseline model produces extensive false-positive errors in flat seabed regions (misclassifying flat areas lacking significant topographic relief as terrain; these areas exhibit no discernible topographic variation and are representative of typical flat seabed). By contrast, the proposed method effectively suppresses spurious responses in flat regions by introducing absolute water depth information through bimodal fusion. (b) The baseline model exhibits obvious missed detections of small isolated topographic units, whereas the proposed method effectively recovers these missed targets by leveraging the ECA mechanism to reinforce depth-channel responses. (c) The baseline model suffers from boundary fragmentation and localization deviations, with edges of topographic units appearing serrated and discontinuous; the proposed method achieves tighter, smoother, and more precise edge fitting through loss-function optimization. In summary, through the synergistic effect of the three improvements, the proposed method significantly enhances the accuracy of seafloor topography discrimination and the robustness of segmentation.
In terms of parameter scale, with the gradual introduction of the improved strategies, the model’s parameters and computational cost increase only slightly. The full model has 9.8 GFLOPs and 2.84 M parameters, fully preserving its lightweight property. For computational efficiency, taking a typical block of 300,000 points as an example, the average time cost for bimodal image generation (supporting offline preprocessing) is approximately 47.8 s, and YOLO inference takes about 0.1 s. Overall, the improved YOLO11n-seg model in this study achieves a remarkable enhancement in segmentation accuracy while maintaining low computational complexity. Combined with the offline image generation pipeline, the proposed scheme presents sound feasibility and practicality.
4.2. Comparison with Existing Segmentation Models
To comprehensively evaluate the segmentation performance of the improved model proposed in this study, five representative segmentation models—YOLOv5n-seg, YOLOv8n-seg, Mask R-CNN [
22], CondInst [
23], and Mask2Former [
24]—were selected for comparison under identical experimental settings and training conditions. The results are presented in
Table 3.
As can be observed from the comparison results, the proposed improved method significantly outperforms all compared models in segmentation accuracy (). In the longitudinal comparison within the YOLO family, the three lightweight generations exhibit a clear progressive accuracy improvement: YOLOv5n-seg and YOLOv8n-seg achieve mAP50_M of 0.823 and 0.834, respectively, while the YOLO11n-seg baseline reaches 0.852. The proposed method further improves upon this baseline to 0.928, validating both the advancement of selecting YOLO11n-seg as the baseline and the effectiveness of the proposed improvement strategies. In the cross-architecture comparison, the two-stage model Mask R-CNN (0.876) and the single-stage instance segmentation model CondInst (0.874) achieve comparable accuracy. The Transformer-based Mask2Former attains the highest accuracy among the compared models at 0.891, yet still trails the proposed method by a noticeable margin of 3.7 percentage points. Furthermore, the proposed method maintains only 2.84 M parameters, far fewer than Mask R-CNN (43.97 M), CondInst (33.98 M), and Mask2Former (44.00 M), achieving a superior balance between accuracy and lightweight design. In summary, the proposed method surpasses all compared mainstream models in accuracy while maintaining an extremely low parameter count, demonstrating a favorable accuracy–efficiency trade-off in seafloor topography segmentation tasks and validating its effectiveness and applicability for multibeam point cloud image segmentation.
4.3. Back-Projection of Segmentation Results
During the bimodal image generation stage, the index mapping between the original 3D point cloud and image pixels is recorded synchronously. After the improved YOLO11n-seg model completes inference and outputs instance segmentation masks, each pixel label in the masks is back-projected onto the corresponding 3D point cloud according to this mapping, yielding point cloud segmentation results with semantic labels. Given that seafloor topographic edges often exhibit gradual transitional characteristics, strict back-projection based on binary masks may misclassify boundary points, potentially leading to the loss of critical topographic transition information in subsequent applications. To this end, this study performs a 1-pixel dilation on the segmentation results prior to back-projection, conservatively classifying edge regions as topographic areas to ensure the integrity of topographic segmentation and the reliability of engineering applications.
In this study, twenty typical blocks are randomly selected from the test set (covering various scenarios including flat regions, shallow-water topography, medium-water topography, and deep-water topography) for visual validation.
Figure 5 presents the back-projection results of four representative blocks.
Figure 5a,b illustrates composite scenarios containing both terrain and flat regions: in
Figure 5a, strip-like topography is distributed along the block edge; in
Figure 5b, a large continuous seamount landform occupies the center of the block, dividing the flat seabed into several isolated regions. The segmentation results indicate that after the pixel dilation operation, the model accurately identifies topographic boundaries and clearly captures the transitional characteristics between terrain and flat regions.
Figure 5c shows a block dominated entirely by terrain with no flat regions, exhibiting overall complex elevation variations, which validates the model’s reliable extraction capability in pure terrain scenarios.
Figure 5d depicts a deep-sea plain with an average water depth of approximately 4600 m, showing no significant topographic relief; the segmentation results exhibit no false topographic detection alarms, confirming the model’s reliability in flat region identification.
The above results demonstrate that after the image-domain segmentation results are back-projected to the point cloud space via index mapping, the integrity and boundary accuracy of terrain regions are well maintained, with clear distinction between flat and terrain regions and no systematic deviations.
5. Conclusions
This study proposes a seafloor topography recognition and segmentation method based on YOLO11n-seg with bimodal image feature fusion, from the perspectives of image generation and model optimization, to improve the segmentation accuracy and robustness of multibeam seafloor topography images. Through continuous curvature tension spline interpolation, multibeam point cloud data are projected into bimodal 2D images. On the basis of the YOLO11n-seg baseline model, three targeted improvements are introduced: the early fusion strategy, the ECA channel attention mechanism, and the BCE-Dice joint loss function optimization. These measures effectively alleviate color drift and the problem of incomplete information representation in single-modality images, significantly enhancing the model’s ability to perceive and discriminate topographic features. Experimental results show that the proposed method achieves an mAP@50 of 92.8%, a precision of 94%, and a recall of 79.5% on the self-constructed dataset, representing absolute improvements of 7.6, 7.6, and 5.8 percentage points over the original YOLO11n baseline model, while preserving the lightweight property of the baseline model. This demonstrates promising application potential for real-time processing on AUVs and other resource-constrained scenarios. In terms of engineering scalability and practical value, this method can be adapted to diverse marine survey tasks, including AUV/ROV underwater positioning and navigation with adaptive path planning, differentiated simplification of multibeam seafloor topographic point clouds, seafloor geomorphological feature extraction and semantic labeling, and conventional marine surveying engineering. Currently, this method has completed offline validation on measured data. Subsequent work will further promote its engineering deployment onto actual operational platforms to validate real-time processing performance and operational reliability in real working environments. Future work will focus on introducing measured data from multiple sources and sea areas to strengthen the model’s generalization ability, and exploring the extension from topography-flat binary segmentation to fine recognition and segmentation of multi-class topographic and geomorphic features. Meanwhile, adversarial training strategies will be considered to enhance the model’s robustness to potential environmental noise and abnormal disturbances, further improving its anti-interference capability in complex marine environments.