A Method for Seafloor Topography Recognition and Segmentation Based on Bimodal Image Feature Fusion with YOLO11 Model

Liang, Dekun; Cui, Yang; Jin, Shaohua; Liang, Yihan; Chen, Na

doi:10.3390/jmse14100903

Open AccessArticle

A Method for Seafloor Topography Recognition and Segmentation Based on Bimodal Image Feature Fusion with YOLO11 Model

by

Dekun Liang

¹

,

Yang Cui

¹,

Shaohua Jin

^1,*,

Yihan Liang

² and

Na Chen

³

¹

Department of Oceanography and Hydrography, Dalian Naval Academy, Dalian 116018, China

²

School of Software, Northeastern University, Shenyang 110169, China

³

College of Electrical Engineering, Naval University of Engineering, Wuhan 430033, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(10), 903; https://doi.org/10.3390/jmse14100903 (registering DOI)

Submission received: 6 March 2026 / Revised: 4 May 2026 / Accepted: 12 May 2026 / Published: 13 May 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Accurate recognition and segmentation of seafloor topographic units is of great significance for marine surveying and engineering applications. Efficient segmentation of multibeam bathymetric point clouds typically requires projecting them into two-dimensional images. However, segmentation methods based on single-modality images suffer from incomplete information representation and insufficient model adaptability, which often lead to blurred boundaries, false positives, and missed detections, thereby limiting segmentation accuracy. To address these challenges, this study proposes a seafloor topography recognition and segmentation method based on YOLO11n-seg with bimodal image feature fusion, from the perspectives of image generation and model optimization, aiming to improve segmentation accuracy and robustness. First, an early fusion strategy for bimodal images is adopted. Two types of images generated from point clouds via continuous curvature tension spline interpolation are concatenated at the input level, fusing local texture details with absolute water depth information, thereby enhancing the model’s ability to perceive topographic features. Second, a lightweight Efficient Channel Attention (ECA) module is embedded after the Spatial Pyramid Pooling-Fast (SPPF) module of the backbone network. This module adaptively calibrates channel weights, reinforcing the contribution of the grayscale channel to the final segmentation decision. Finally, a weighted BCE-Dice joint loss function is constructed to mitigate class imbalance between flat seabed and topographic regions, while also optimizing boundary segmentation accuracy. Experimental results on a self-constructed multibeam image dataset demonstrate that the proposed method achieves an mAP@50 of 92.8%, representing an absolute improvement of 7.6 percentage points over the baseline model. Notably, the model has only 2.84 M parameters, maintaining a lightweight profile.

Keywords:

multibeam bathymetry; YOLO11; image segmentation

1. Introduction

Bathymetric survey is one of the fundamental tasks in marine surveying, and its precise detection is of great significance for marine engineering construction, marine scientific research, ship navigation safety, and other related fields. As the primary equipment for seafloor topographic data acquisition, the Multibeam Echo Sounder (MBES) can efficiently collect high-precision, high-density 3D point cloud data [1,2]. However, the raw point clouds acquired by this system are essentially unstructured, spatially discrete points lacking effective semantic information, which poses enormous challenges for extracting valuable topographic features directly from the data. Therefore, how to achieve automated recognition and segmentation of seafloor topographic units has become a critical issue that needs to be urgently addressed in the field of seafloor topographic data processing.

Seafloor topography recognition and classification methods have undergone an evolution from manual interpretation to intelligent automation. Prior to the widespread adoption of modern surveying techniques, geomorphological classification relied predominantly on researcher expertise and subjective interpretation, lacking systematic rigor. With the rapid advancement of marine exploration technologies, researchers began to employ geometric morphometric parameters—including bathymetry, slope, topographic relief, Terrain Ruggedness Index (TRI), and curvature—and to discriminate geomorphological units by setting empirical thresholds [3,4]. Harris et al. (2014) produced a global map of seafloor geomorphology based on IHO standards, using a semi-automated approach that combined manual delineation with algorithm-assisted analysis and integrated parameters such as slope, topographic relief, and Topographic Position Index (TPI) to classify multiple geomorphological units [5]. However, these threshold- and geometry-based methods are inherently limited by their time-consuming and labor-intensive nature, sensitivity to threshold settings, and poor generalization capability, rendering them inadequate for the efficient automated processing of large-scale seafloor topographic data.

To overcome the reliance of traditional methods on manual expertise and fixed thresholds, researchers have begun to introduce machine learning techniques into the field of seafloor topographic and geomorphological analysis. Such methods can extract multi-dimensional features from topographic data and construct data-driven mapping models, enabling autonomous recognition and delineation of geomorphological units by the model and thereby reducing, to some extent, dependence on subjective decision rules. Masetti et al. (2018) proposed a seafloor segmentation algorithm based on bathymetric and acoustic backscatter data, which automatically identified and merged geomorphological units by analyzing the similarity between topographic morphometric features and backscatter textures, achieving automated seafloor segmentation [6]. Giannakopoulos et al. (2025) employed geomorphometric methods to extract morphological features from multibeam bathymetric data and combined them with a Random Forest classifier to identify seafloor pockmarks [7]. However, these methods still rely on hand-crafted feature engineering, exhibit limited adaptive capability in delineating complex topographic boundaries, and suffer from insufficient cross-dataset transferability and generalization performance under varying conditions, restricting their capacity for fine-grained recognition in complex seafloor scenarios.

In recent years, deep learning models, particularly convolutional neural networks (CNNs), have advanced rapidly. Through end-to-end feature learning, these models eliminate the need for hand-crafted feature engineering and automatically extract discriminative features from raw data, thereby opening new avenues for automated seafloor topographic recognition. However, research applying deep learning to seafloor topography segmentation remains relatively limited. Inspired by technical developments in the general point cloud processing field, existing deep learning-based methods for this task can be broadly categorized into two groups. The first group processes data directly in 3D space, relying on deep learning models capable of end-to-end feature extraction and classification on raw point clouds, such as PointNet [8], PointNet++ [9], and RandLA-Net [10]. The second group utilizes 2D image-based segmentation methods, the core idea of which is to project 3D point clouds into regular gridded 2D images or Digital Elevation Models (DEMs) for subsequent feature extraction and segmentation using well-established 2D convolutional neural networks. Each approach has distinct advantages and limitations: 3D direct-processing methods better preserve three-dimensional topological relationships, yet suffer from high computational costs and low efficiency; furthermore, multibeam point clouds typically lack rich attribute information, which hinders adequate model learning. By contrast, 2D image-based methods offer significant advantages in computational efficiency and engineering practicality, but sacrifice the original three-dimensional spatial topological information. Moreover, their segmentation accuracy is constrained by the image generation method and quality, necessitating targeted improvements to enhance precision.

To this end, this study adopts the 2D image-based segmentation paradigm and proposes a lightweight seafloor topography recognition and segmentation method based on bimodal image feature fusion, from the perspectives of image generation and model optimization, aiming to improve segmentation accuracy and robustness. Specifically, at the image generation level, an early fusion strategy is adopted in which a pseudocolor image based on per-image adaptive range mapping and a grayscale image based on global fixed-range mapping—both generated from point clouds via continuous curvature tension spline interpolation—are concatenated channel-wise at the input level. This achieves complementarity between local texture details and absolute water depth information, enhancing the model’s ability to perceive topographic features. At the model optimization level, a lightweight Efficient Channel Attention (ECA) module [11] is embedded after the Spatial Pyramid Pooling-Fast (SPPF) module of the backbone network to adaptively recalibrate channel weights, thereby reinforcing the contribution of the grayscale channel. Furthermore, a weighted BCE-Dice joint loss function [12] is constructed to alleviate the class imbalance problem and optimize boundary segmentation accuracy, ultimately improving the overall quality of image segmentation.

2. Methods

2.1. YOLO11 Baseline Model

The YOLO (You Only Look Once) model, proposed by Joseph Redmon et al. in 2015 [13], is a one-stage object detection framework. Owing to its outstanding real-time inference capability and excellent engineering adaptability, it has become one of the mainstream foundational models in the field of computer vision. In recent years, with continuous algorithmic iteration and optimization, the application scenarios of the YOLO series have been continually expanded. Beyond general visual tasks, these models have been widely applied in specialized domains such as marine remote sensing monitoring and underwater target detection and recognition. Peng et al. (2024) combined a DDPM diffusion model with the YOLOv5 detection network to propose an adversarial enhancement generation method for side-scan sonar images, effectively improving the detection accuracy of underwater shipwrecks and other targets through an iterative training strategy [14]. Bakirci and Bayraktar (2024) evaluated the performance of the YOLO11 algorithm in SAR image ship detection, verifying its effectiveness in open-ocean and complex coastal environments [15]. Liu and Sun (2022) applied YOLOX to shore-based intelligent ship monitoring systems, demonstrating the excellent performance of the YOLO series in maritime real-time monitoring [16].

Through continuous evolution, the YOLO11 model released in 2024 builds upon the overall architecture of YOLOv8 while introducing in-depth optimizations to its three core components—the Backbone, Neck, and Head—achieving a superior balance among detection accuracy, inference speed, and parameter count compared with the widely adopted YOLOv5 and YOLOv8 models [17]. The YOLO11 model retains the classic three-tier architecture of Backbone–Neck–Head. Specifically, the Backbone network performs layer-wise feature extraction and abstraction on the input image through modules including CBS, C3K2, SPPF, and C2PSA, generating a set of feature maps at different resolutions. The Neck network then integrates the multi-scale features from the Backbone via upsampling, downsampling, and cross-level feature fusion, constructing a multi-scale feature pyramid with strong semantic representation capability. Finally, the Head network decodes these multi-scale features through lightweight convolutional layers, predicting the spatial locations, object categories, and instance masks of targets, thereby enabling efficient multi-object recognition and segmentation.

Compared with semantic segmentation algorithms that output only pixel-level labels, YOLO series models exhibit mature instance segmentation capabilities, enabling them to independently distinguish different topographic units and generate dedicated masks. This satisfies the demand for independent multi-target delineation and differentiated analysis in complex seafloor scenarios. Considering the computational efficiency, deployment lightweight requirements, and real-time demands of seafloor topography processing, this study adopts the lightweight YOLO11n-seg variant as the baseline model. This variant maintains an extremely low parameter count and fast inference speed, making it well suited for large-scale, rapid interpretation of seafloor topographic data while guaranteeing segmentation accuracy.

2.2. Overall Workflow of the Proposed Method

The overall workflow of the proposed seafloor topography recognition and segmentation method based on the YOLO11n-seg model with bimodal image feature fusion is illustrated in Figure 1. First, unordered point cloud data is transformed into a regular bimodal image representation through point cloud data preprocessing and continuous curvature tension spline interpolation. Subsequently, topographic regions are identified and segmented from the images using the improved lightweight YOLO11n-seg segmentation network.

2.3. Data Preprocessing and Generation of Bimodal Images

The data used in this study were collected by a Multibeam Echosounder (MBES) system, the core of which comprises peripheral auxiliary sensors, a multibeam acoustic subsystem, and post-processing software. This system simultaneously acquires the spatial coordinates (longitude, latitude, and depth) of the seafloor topography, as well as auxiliary measurement information including tidal data, positioning data, and sound-speed profiles. To ensure data quality, prior to entering the image generation and segmentation pipeline of this study, the raw point cloud data must undergo standardized preprocessing operations—including sound-speed correction, tidal level correction, and gross error elimination—using CARIS HIPS and SIPS software (Version 11.3), so as to obtain high-precision, reliable seafloor topographic point cloud data that serve as the foundation for subsequent bimodal image generation and analysis [18].

To reconcile the need to preserve the natural curvature of seafloor topography with the demand for high-resolution detail representation, this study adopts the continuous curvature tension spline interpolation algorithm to convert 3D discrete point clouds into regular gridded 2D images. Compared with conventional interpolation methods, this algorithm introduces an adjustable tension factor to flexibly balance topographic smoothness and detail fidelity. It also effectively avoids the spurious extreme values that Inverse Distance Weighting (IDW) tends to produce in sparse point cloud areas, and overcomes the poor adaptability of Kriging interpolation to complex seafloor topography. These advantages make it particularly well suited to seafloor topographic scenarios where flat areas and abrupt topographic features coexist. The reliability and cross-scenario applicability of this method have been validated by internationally recognized seafloor topographic models and multi-region measured data, enabling the method to provide high-quality 2D image inputs for subsequent deep learning models [19,20,21]. The appropriate determination of image resolution is directly related to the balance between topographic feature preservation and computational efficiency. Leveraging the physical detection characteristics of the multibeam echo sounder, this study derives the corresponding central beam footprint size from the average water depth of the current interpolation area and adopts it as the spatial resolution for generating bimodal images. This setting strategy ensures that image details match the actual detection capability of the sounding system, preventing the introduction of spurious details by oversampling while avoiding the loss of topographic features caused by undersampling. While preserving valid topographic information to the maximum extent, this strategy balances computational efficiency with result reliability.

2.4. Improvements to the YOLO11n-Seg Model

Based on the YOLO11n-seg model, this study introduces targeted improvements that mainly involve the design of a bimodal image early fusion strategy, the integration of the ECA mechanism, and the construction of a weighted loss function. The architecture of the improved model is illustrated in Figure 2.

2.4.1. Early Fusion of Bimodal Images

The baseline YOLO11n-seg model by default accepts single-source three-channel RGB images as input. However, single-modality images projected from multibeam point clouds suffer from inherent drawbacks. The per-image adaptive range mapping strategy can effectively highlight local relative topographic variations, yet it is susceptible to color drift caused by dynamic range stretching, resulting in inconsistent representations of the same terrain type across different image frames. Moreover, in flat areas, minor noise tends to be over-amplified into spurious topographic responses, leading to erroneous model judgments. In contrast, the global fixed-range mapping strategy features a unified global depth datum; however, its wide mapping range and low contrast may obscure pixel-level differences at topographic edges and subtle undulations, easily causing blurred segmentation boundaries. Accurate seafloor topography segmentation is difficult to achieve when either single-modality image is used independently. To this end, this study generates two types of 2D images via projection from the same set of multibeam point clouds. The first is a three-channel pseudocolor image based on per-image adaptive range mapping. Color mapping is performed by clipping the 2–98% quantile range of water depth in a single image to eliminate extreme depth outliers and optimize the visualization range, thereby maximizing the retention of valid topographic detail differences within limited color levels and enhancing local texture representation. The second is a single-channel grayscale image based on global fixed-range mapping, which preserves absolute water depth information and provides a unified depth datum.

On this basis, this study modifies the model architecture by adopting a bimodal early fusion strategy. The two image modalities are concatenated channel-wise at the input layer to construct a 4-channel fusion tensor as the model input. This enables the model to simultaneously utilize the prominent local texture details from pseudocolor images and the water depth information carried by grayscale images, thereby achieving information complementarity. Compared with intermediate fusion at the feature layer or late fusion at the decision layer, the early fusion strategy adopted in this study offers two distinct advantages. First, it requires no additional learnable parameters or computational modules, thus preserving the lightweight characteristic of the original model. Second, since both image modalities are derived from the same set of multibeam point clouds, effective information integration can be accomplished directly at the input layer without relying on deep network layers for feature alignment, allowing for a better trade-off between inference efficiency and segmentation performance in edge deployment scenarios.

2.4.2. ECA Mechanism

Although the bimodal early fusion strategy achieves complementarity between local texture and absolute water depth information, the characteristic differences between the two image types may cause inter-channel information imbalance. High-frequency texture features from pseudocolor images tend to dominate during gradient updates, which may lead the model to overly rely on local texture information while relatively suppressing the absolute water depth information carried by the grayscale channel. This weakens the model’s discriminative capability between flat areas and topographic regions, thereby degrading segmentation accuracy. To address this issue, this study introduces the Efficient Channel Attention (ECA) module. Its core innovation lies in modeling inter-channel dependencies without dimensionality reduction, enabling local cross-channel interaction through 1D convolutions. This approach not only avoids the information loss caused by dimensionality reduction in the SE module but also introduces only a small number of learnable parameters, making it more lightweight than complex attention modules such as CBAM. This aligns well with the objective of maintaining a lightweight model architecture in this study [11].

Considering the architectural characteristics of YOLO11n-seg, this study embeds the ECA module immediately after the SPPF module in the backbone network. The SPPF module aggregates features from multiple receptive fields via multi-scale pooling, producing feature maps enriched with contextual information. Introducing the ECA module at this stage enables adaptive recalibration of the channel weights of the fused cross-channel features, guiding the model to leverage bimodal information more evenly. Its core working principle comprises the following four steps:

Step 1. Channel-wise Global Average Pooling. This step serves as the foundational feature processing stage of the ECA module, aiming to aggregate the spatial information of each channel into a global representation. Global average pooling is performed independently on each channel of the input feature

X \in ℝ^{H \times W \times C}

, compressing the H × W two-dimensional spatial features of each channel into a scalar value representing the global information of that channel, thereby yielding a channel feature vector

y \in ℝ^{C}

of length C.

Step 2. Adaptive Calculation of 1D Convolution Kernel Size. After obtaining the channel feature vector

y

, the ECA module adaptively determines the kernel size k for the subsequent 1D convolution based on the channel dimension C via a predefined non-linear mapping function. This mapping function is defined as:

k = ψ (C) = {|\frac{\log_{2} (C)}{γ} + \frac{b}{γ}|}_{o d d}

(1)

This mapping function is defined by the bias parameter b and the scaling coefficient γ, establishing a non-linear correspondence between channel dimension and kernel size. This ensures that the coverage of local cross-channel interactions aligns with the channel dimension: high-dimensional channels automatically receive larger k values to capture longer-range dependencies, while low-dimensional channels receive smaller k values to avoid redundant computation. The key advantage of this adaptive design is that it adapts to the varying complexity of inter-channel dependencies in feature maps at different depths of the CNN without requiring manual layer-wise hyperparameter tuning, significantly reducing computational cost.

Step 3. Cross-Channel Interaction and Attention Weight Generation. A 1D convolution with kernel size k is performed on the channel feature vector

y

obtained in Step 1 to establish local dependencies between each channel and its k nearest neighboring channels. Subsequently, the Sigmoid activation function normalizes the convolution output to the range (0, 1), generating an attention weight vector

ω \in ℝ^{C}

that denotes the importance of each channel. This mechanism ensures that the attention weight of each channel is jointly determined by itself and its k adjacent channels, thereby efficiently capturing local cross-channel interaction information.

Step 4. Feature Recalibration. The attention weight vector

ω

is expanded and broadcast to match the spatial dimensions of the original input feature map. A channel-wise multiplication operation is then applied to reweight the feature map, outputting the recalibrated optimized feature map. This enhances the model’s segmentation robustness and accuracy in complex seafloor topography scenarios.

2.4.3. Loss Function Optimization

To address the potential class imbalance between foreground and background in the seafloor topography segmentation task and to enhance the model’s ability to segment the overall shape of topographic regions, this study constructs a BCE-Dice joint loss function. Through weighted combination, it achieves complementary advantages of pixel-level and region-level supervision, thereby optimizing segmentation performance.

The Binary Cross-Entropy (BCE) loss measures the discrepancy between predicted outputs and ground-truth labels pixel-wise via the log-likelihood function, providing stable gradient signals that facilitate rapid model convergence in the early training stages. It is defined as:

L_{B C E} = - \frac{1}{N} \sum_{i = 1}^{N} [g_{i} \log (p_{i}) + (1 - g_{i}) \log (1 - p_{i})]

(2)

where

g_{i} \in {0, 1}

denotes the ground-truth label of pixel

i

; in the seafloor topography segmentation task of this study,

g_{i} = 1

indicates that the pixel belongs to a topographic region, and

g_{i} = 0

indicates that it belongs to a flat seabed region.

p_{i}

represents the predicted probability of pixel

i

belonging to the topographic region. However, when the numbers of foreground and background pixels are severely imbalanced, the optimization process tends to be dominated by the background class, leading to missed detection of topographic regions. In contrast, the Dice loss function, based on the Dice similarity coefficient, optimizes from the perspective of regional overlap. Its expression is given by:

L_{D i c e} = 1 - \frac{2 \sum_{i = 1}^{N} p_{i} g_{i} + ϵ}{\sum_{i = 1}^{N} p_{i} + \sum_{i = 1}^{N} g_{i} + ϵ}

(3)

where

g_{i}

is the ground-truth label of pixel

i

,

p_{i}

is the predicted probability of pixel

i

belonging to the topographic region, and

ε

is a small smoothing constant used to prevent division-by-zero errors when the numerator or denominator approaches zero, ensuring numerical stability and stable gradient backpropagation during model training. The Dice loss does not rely on pixel-wise independent comparisons; instead, it focuses on the overall overlap between the predicted and ground-truth regions. As a result, it exhibits strong robustness to class imbalance and contributes to more complete segmentation of topographic regions. However, its training process may experience significant fluctuations in the early stages. To simultaneously compensate for the limitations of individual loss functions, this study constructs a joint loss function via weighted summation, where the weight coefficients are determined through tuning on the validation set (the ratio of 2:3 achieves the optimal balance between topographic boundary accuracy and regional integrity). The formula is given below:

L = 0.4 L_{B C E} + 0.6 L_{D i c e}

(4)

This combination enables the model to not only leverage the pixel-level accurate classification capability provided by the BCE loss during training, but also strengthen the ability to identify and segment the overall structure of topographic regions via the higher-weighted Dice loss, thereby achieving more robust segmentation performance in complex seafloor scenarios.

3. Materials and Experiments

3.1. Experimental Data

The experimental dataset used in this study is derived from multi-voyage measured seafloor topography data, comprising various typical scenarios in shallow, medium, and deep waters, thereby ensuring good representativeness and feature diversity. Representative image examples of these scenarios are shown in Table 1.

Table 1. Representative seafloor topographic examples.

	Pseudocolor Images	Grayscale Images
Flat region
Shallow-water target
Medium-water target
Deep-water target

The original dataset consists of 341 pairs of bimodal images (pseudo-color and grayscale images in one-to-one correspondence), including 29 flat images without topographic features. To ensure training generalization and evaluation objectivity, the dataset is randomly split into training, validation, and test sets in an 8:1:1 ratio, with a balanced distribution of all topographic sample types maintained across the three subsets. To address the issue of limited original sample size, targeted data augmentation is applied using operations such as flipping, translation, and color jittering (color jittering is applied only to pseudo-color images; grayscale images undergo no such augmentation). The data augmentation pipeline is illustrated in Figure 3. This augmentation yields a total of 3069 images, effectively enhancing the diversity of target features. Finally, manual annotation of topographic targets in the images is performed, resulting in a high-quality seafloor topography dataset.

Figure 3. Data augmentation methods.

3.2. Experimental Setup and Evaluation Metrics

The experimental hardware configuration employed in this study was an NVIDIA GeForce RTX 4060 Laptop GPU, and the software environment consisted of Python 3.10, CUDA 11.8, cuDNN 8.6.0, and PyTorch 2.8.0. The training process was set to 300 epochs with an initial learning rate of 0.0005 and a batch size of 4. Early stopping was configured for 50 epochs (training is terminated if the validation loss does not decrease for 50 consecutive epochs).

To systematically evaluate the comprehensive performance of the improved model proposed in this study on the seafloor topography segmentation task, classic metrics in the field of object detection and segmentation were adopted, including: Precision, Recall, mean Average Precision (mAP), Parameters (Params), and GFLOPs. Their definitions are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(5)

R e c a l l = \frac{T P}{T P + F N}

(6)

m A P = \frac{1}{N} \sum_{c = 1}^{N} A P_{c}

(7)

TP (True Positives): number of positive samples correctly classified as positive.

TN (True Negatives): number of negative samples correctly classified as negative.

FP (False Positives): number of negative samples incorrectly classified as positive.

FN (False Negatives): number of positive samples incorrectly classified as negative.

3.3. Experimental Design

To comprehensively evaluate the performance of the improved YOLO11n-seg model proposed in this study, three experiments are designed and conducted under identical parameter configurations to ensure fair and reliable comparisons.

(1): Ablation Study: Using the original YOLO11n-seg model as the baseline, the bimodal early fusion strategy, the ECA mechanism, and the weighted BCE-Dice joint loss function are gradually incorporated. This experiment aims to verify the individual contribution and necessity of each improved module for the segmentation performance.
(2): Comparative Experiments: Under identical experimental settings and on the same dataset, the proposed method is compared with several mainstream segmentation models to evaluate its segmentation performance on multibeam point cloud images.
(3): Back-Projection Validation Experiments: Since some downstream applications (e.g., point cloud simplification and topographic modeling) are implemented in 3D space, back-projection validation experiments are designed. By recording the index mapping between point clouds and images, the segmentation masks predicted by the model are back-projected to the 3D point cloud space. Typical blocks are randomly selected to verify the integrity and accuracy of the topographic region segmentation.

4. Results and Discussion

4.1. Ablation Study Results and Analysis

Following the ablation study design described in Section 3.3, we conducted experiments to evaluate the individual contributions of the three proposed improvement modules. The results are presented in Table 2 and analyzed below.

From the ablation results, it can be observed that all three improvement strategies proposed in this study exert positive effects on the model’s segmentation performance, forming a progressive and complementary synergistic mechanism that gradually enhances the precision and robustness of seafloor topography segmentation.

(1): Bimodal early fusion strategy. This strategy alleviates the problem of incomplete information representation in single-modality images and reduces the false positive rate. After introducing the early fusion strategy on the baseline model, all evaluation metrics are significantly improved, among which the precision is enhanced most remarkably. This indicates that the early fusion strategy effectively remedies the incomplete information representation of single-modality images through the complementarity between local texture details of pseudocolor images and global depth information of grayscale images, enabling the model to more reliably distinguish real topography from false positive signals of flat seabed, while simultaneously providing a high-quality multi-channel input foundation for subsequent feature learning.
(2): ECA mechanism. This mechanism adaptively recalibrates channel weights and improves recall. After introducing the ECA attention mechanism on the basis of early fusion, the model’s recall and mean average precision (mAP) are further improved, while the precision shows a slight decrease. This reflects that the ECA module effectively reinforces the contribution of the grayscale depth channel to the final segmentation decision through adaptive channel weight recalibration, thereby enhancing the model’s ability to perceive and extract weak-texture topography and small targets. Although accompanied by a slight increase in false detections due to the introduction of a small amount of noise, the overall performance of the model is significantly enhanced. Leveraging the multi-channel feature foundation established by bimodal fusion, the ECA module can build cross-modal dependency relationships within the fused four-dimensional feature tensor, achieving dynamic equilibrium between texture information and depth information, and avoiding the suboptimal state of “information complementarity but uneven utilization.”
(3): Loss function optimization. This optimization alleviates class imbalance and improves the problems of fragmented segmentation boundaries and incomplete regions. After introducing the loss function optimization, the model’s segmentation precision and boundary quality are significantly improved, indicating that the loss function effectively enhances the discriminative ability of topographic region boundaries on fused features. Specifically, the BCE loss provides stable pixel-level gradients, while the Dice loss suppresses the dominant effect of flat seabed background on loss computation through regional overlap constraints. Together, they achieve a balance between pixel-wise precision and regional integrity.

Finally, after integrating all three optimizations, all metrics except

R_{B}

reach near-optimal levels, with absolute improvements of 9.1, 6.9, 8.9, 7.6, 5.8, and 7.6 percentage points over the baseline model, demonstrating the best overall segmentation performance. The three improvements form a clear complementary relationship: the bimodal early fusion strategy integrates texture and depth information at the input layer in a complementary manner, providing a complete feature foundation with both local details and a global depth datum for subsequent modules; the ECA module then balances the contributions of different channels on this basis, enhancing the detection capability for weak-texture targets and small targets; and the BCE-Dice loss transforms the boundary information and detected targets brought by the former two into complete, smooth segmentation masks through regional overlap constraints, ultimately achieving an overall performance improvement.

To verify the statistical significance of the performance improvement, we conducted repeated training and testing for both the baseline and the proposed full model using five different random seeds under a fixed dataset split. On the test set, the baseline model achieved a

m A P 50_{M}

of 0.855 ± 0.012 (mean ± standard deviation), whereas the proposed method achieved 0.926 ± 0.011. A paired t-test yielded p < 0.05, indicating a statistically significant difference and confirming that the performance gain is stable and reliable.

The segmentation comparison in Figure 4 provides further intuitive visual validation of the effectiveness of the proposed improvements. (a) The baseline model produces extensive false-positive errors in flat seabed regions (misclassifying flat areas lacking significant topographic relief as terrain; these areas exhibit no discernible topographic variation and are representative of typical flat seabed). By contrast, the proposed method effectively suppresses spurious responses in flat regions by introducing absolute water depth information through bimodal fusion. (b) The baseline model exhibits obvious missed detections of small isolated topographic units, whereas the proposed method effectively recovers these missed targets by leveraging the ECA mechanism to reinforce depth-channel responses. (c) The baseline model suffers from boundary fragmentation and localization deviations, with edges of topographic units appearing serrated and discontinuous; the proposed method achieves tighter, smoother, and more precise edge fitting through loss-function optimization. In summary, through the synergistic effect of the three improvements, the proposed method significantly enhances the accuracy of seafloor topography discrimination and the robustness of segmentation.

In terms of parameter scale, with the gradual introduction of the improved strategies, the model’s parameters and computational cost increase only slightly. The full model has 9.8 GFLOPs and 2.84 M parameters, fully preserving its lightweight property. For computational efficiency, taking a typical block of 300,000 points as an example, the average time cost for bimodal image generation (supporting offline preprocessing) is approximately 47.8 s, and YOLO inference takes about 0.1 s. Overall, the improved YOLO11n-seg model in this study achieves a remarkable enhancement in segmentation accuracy while maintaining low computational complexity. Combined with the offline image generation pipeline, the proposed scheme presents sound feasibility and practicality.

4.2. Comparison with Existing Segmentation Models

To comprehensively evaluate the segmentation performance of the improved model proposed in this study, five representative segmentation models—YOLOv5n-seg, YOLOv8n-seg, Mask R-CNN [22], CondInst [23], and Mask2Former [24]—were selected for comparison under identical experimental settings and training conditions. The results are presented in Table 3.

As can be observed from the comparison results, the proposed improved method significantly outperforms all compared models in segmentation accuracy (

m A P 50_{M}

). In the longitudinal comparison within the YOLO family, the three lightweight generations exhibit a clear progressive accuracy improvement: YOLOv5n-seg and YOLOv8n-seg achieve mAP50_M of 0.823 and 0.834, respectively, while the YOLO11n-seg baseline reaches 0.852. The proposed method further improves upon this baseline to 0.928, validating both the advancement of selecting YOLO11n-seg as the baseline and the effectiveness of the proposed improvement strategies. In the cross-architecture comparison, the two-stage model Mask R-CNN (0.876) and the single-stage instance segmentation model CondInst (0.874) achieve comparable accuracy. The Transformer-based Mask2Former attains the highest accuracy among the compared models at 0.891, yet still trails the proposed method by a noticeable margin of 3.7 percentage points. Furthermore, the proposed method maintains only 2.84 M parameters, far fewer than Mask R-CNN (43.97 M), CondInst (33.98 M), and Mask2Former (44.00 M), achieving a superior balance between accuracy and lightweight design. In summary, the proposed method surpasses all compared mainstream models in accuracy while maintaining an extremely low parameter count, demonstrating a favorable accuracy–efficiency trade-off in seafloor topography segmentation tasks and validating its effectiveness and applicability for multibeam point cloud image segmentation.

4.3. Back-Projection of Segmentation Results

During the bimodal image generation stage, the index mapping between the original 3D point cloud and image pixels is recorded synchronously. After the improved YOLO11n-seg model completes inference and outputs instance segmentation masks, each pixel label in the masks is back-projected onto the corresponding 3D point cloud according to this mapping, yielding point cloud segmentation results with semantic labels. Given that seafloor topographic edges often exhibit gradual transitional characteristics, strict back-projection based on binary masks may misclassify boundary points, potentially leading to the loss of critical topographic transition information in subsequent applications. To this end, this study performs a 1-pixel dilation on the segmentation results prior to back-projection, conservatively classifying edge regions as topographic areas to ensure the integrity of topographic segmentation and the reliability of engineering applications.

In this study, twenty typical blocks are randomly selected from the test set (covering various scenarios including flat regions, shallow-water topography, medium-water topography, and deep-water topography) for visual validation. Figure 5 presents the back-projection results of four representative blocks. Figure 5a,b illustrates composite scenarios containing both terrain and flat regions: in Figure 5a, strip-like topography is distributed along the block edge; in Figure 5b, a large continuous seamount landform occupies the center of the block, dividing the flat seabed into several isolated regions. The segmentation results indicate that after the pixel dilation operation, the model accurately identifies topographic boundaries and clearly captures the transitional characteristics between terrain and flat regions. Figure 5c shows a block dominated entirely by terrain with no flat regions, exhibiting overall complex elevation variations, which validates the model’s reliable extraction capability in pure terrain scenarios. Figure 5d depicts a deep-sea plain with an average water depth of approximately 4600 m, showing no significant topographic relief; the segmentation results exhibit no false topographic detection alarms, confirming the model’s reliability in flat region identification.

The above results demonstrate that after the image-domain segmentation results are back-projected to the point cloud space via index mapping, the integrity and boundary accuracy of terrain regions are well maintained, with clear distinction between flat and terrain regions and no systematic deviations.

5. Conclusions

This study proposes a seafloor topography recognition and segmentation method based on YOLO11n-seg with bimodal image feature fusion, from the perspectives of image generation and model optimization, to improve the segmentation accuracy and robustness of multibeam seafloor topography images. Through continuous curvature tension spline interpolation, multibeam point cloud data are projected into bimodal 2D images. On the basis of the YOLO11n-seg baseline model, three targeted improvements are introduced: the early fusion strategy, the ECA channel attention mechanism, and the BCE-Dice joint loss function optimization. These measures effectively alleviate color drift and the problem of incomplete information representation in single-modality images, significantly enhancing the model’s ability to perceive and discriminate topographic features. Experimental results show that the proposed method achieves an mAP@50 of 92.8%, a precision of 94%, and a recall of 79.5% on the self-constructed dataset, representing absolute improvements of 7.6, 7.6, and 5.8 percentage points over the original YOLO11n baseline model, while preserving the lightweight property of the baseline model. This demonstrates promising application potential for real-time processing on AUVs and other resource-constrained scenarios. In terms of engineering scalability and practical value, this method can be adapted to diverse marine survey tasks, including AUV/ROV underwater positioning and navigation with adaptive path planning, differentiated simplification of multibeam seafloor topographic point clouds, seafloor geomorphological feature extraction and semantic labeling, and conventional marine surveying engineering. Currently, this method has completed offline validation on measured data. Subsequent work will further promote its engineering deployment onto actual operational platforms to validate real-time processing performance and operational reliability in real working environments. Future work will focus on introducing measured data from multiple sources and sea areas to strengthen the model’s generalization ability, and exploring the extension from topography-flat binary segmentation to fine recognition and segmentation of multi-class topographic and geomorphic features. Meanwhile, adversarial training strategies will be considered to enhance the model’s robustness to potential environmental noise and abnormal disturbances, further improving its anti-interference capability in complex marine environments.

Author Contributions

Conceptualization, Y.C. and S.J.; methodology, D.L.; software, D.L. and Y.L.; writing—original draft preparation, D.L. and N.C.; writing—review and editing, D.L. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Raw geo-referenced bathymetric data cannot be shared publicly.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, J.H.; Liu, J.N. Multi-Beam Sounding and Image Data Processing; Wuhan University Press: Wuhan, China, 2008. [Google Scholar]
Sun, H.P.; Li, Q.Q.; Bao, L.F.; Wu, Z.; Wu, L. Progress and Development Trend of Global Refined Seafloor Topography Modeling. Geomat. Inf. Sci. Wuhan Univ. 2022, 47, 1555–1567. [Google Scholar]
Wilson, M.F.J.; O’Connell, B.; Brown, C.; Guinan, J.C.; Grehan, A.J. Multiscale Terrain Analysis of Multibeam Bathymetry Data for Habitat Mapping on the Continental Slope. Mar. Geod. 2007, 30, 3–35. [Google Scholar] [CrossRef]
Lecours, V.; Dolan, M.F.J.; Micallef, A.; Lucieer, V.L. A review of marine geomorphometry, the quantitative study of the seafloor. Hydrol. Earth Syst. Sci. 2016, 20, 3207–3244. [Google Scholar] [CrossRef]
Harris, P.T.; Macmillan-Lawler, M.; Rupp, J.; Baker, E.K. Geomorphology of the Oceans. Mar. Geol. 2014, 352, 4–24. [Google Scholar] [CrossRef]
Masetti, G.; Mayer, L.A.; Ward, L.G. A Bathymetry- and Reflectivity-Based Approach for Seafloor Segmentation. Geosciences 2018, 8, 14. [Google Scholar] [CrossRef]
Giannakopoulos, V.; Feldens, P.; Fakiris, E. Semi-Automated Mapping of Pockmarks from MBES Data Using Geomorphometry and Machine Learning-Driven Optimization. Remote Sens. 2025, 17, 2917. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017); IEEE: Piscataway, NJ, USA, 2017; pp. 77–85. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Hu, Q.Y.; Yang, B.; Xie, L.H.; Rosa, S.; Guo, Y.L.; Wang, Z.H.; Trigoni, N.; Markham, A. RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11108–11117. [Google Scholar]
Wang, Q.L.; Wu, B.G.; Zhu, P.F.; Li, P.H.; Zuo, W.M.; Hu, Q.H. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11531–11539. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision, Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Peng, C.; Jin, S.; Liu, H.; Zhang, W.; Xia, H. Adversarial enhancement generation method for side-scan sonar images based on DDPM–YOLO. Mar. Geod. 2024, 47, 526–554. [Google Scholar] [CrossRef]
Bakirci, M.; Bayraktar, I. Assessment of YOLO11 for Ship Detection in SAR Imagery under Open Ocean and Coastal Challenges. In Proceedings of the 2024 21st International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), Mexico City, Mexico, 23–25 October 2024; pp. 1–6. [Google Scholar] [CrossRef]
Liu, J.; Sun, W. YOLOX-based ship target detection for Shore-based monitoring. In Proceedings of the 2022 5th International Conference on Signal Processing and Machine Learning (SPML), Dalian, China, 4–6 August 2022; pp. 234–241. [Google Scholar] [CrossRef]
Sapkota, R.; Flores-Calero, M.; Qureshi, R.; Badgujar, C.; Nepal, U.; Poulose, A.; Zeno, P.; Vaddevolu, U.B.P.; Khan, S.; Shoman, M.; et al. YOLO advances to its genesis: A decadal and comprehensive review of the You Only Look Once (YOLO) series. Artif. Intell. Rev. 2025, 58, 274. [Google Scholar] [CrossRef]
Jin, S.H.; Bian, G. Modern Hydrographic Survey Technology; National Defense Industry Press: Beijing, China, 2025. [Google Scholar]
Chen, Y.L.; Tang, Q.H.; Liu, X.Y.; Wang, Y.H. Construction of Offshore Digital Bathymetric Model Based on Multi-source Bathymetric Data Fusion. Adv. Mar. Sci. 2021, 39, 461–469. [Google Scholar]
Fan, M.; Sun, Y.; Xing, Z.; Wang, Y.T.; Li, S.H.; Jin, J.Y. Bathymetry Fusion Techniques for High-Resolution Digital Bathymetric Modeling. Haiyang Xuebao 2017, 39, 130–137. [Google Scholar]
Fang, Y. Research on Parameter Selection of Continuous Curvature Splines in Tension. Geospat. Inf. 2010, 8, 17–19. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H. Conditional Convolutions for Instance Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]

Figure 1. Flow diagram of the seafloor topography recognition and segmentation method based on the YOLO11n-seg model with bimodal feature fusion.

Figure 2. Architecture of the Improved YOLO11n-seg Model.

Figure 4. Comparison of segmentation results. (a1–c1) Segmentation results of the baseline model (a2–c2) Segmentation results of ours.

Figure 5. 3D back-projection segmentation results of four representative seafloor topographic blocks. Gray regions indicate flat seabed, and colored regions indicate topographic regions. (a,b) are composite scenarios where terrain and flat regions coexist; (c) is a region dominated entirely by terrain; (d) is a flat scenario with no significant topographic relief.

Table 2. Ablation Study Results.

Early Fusion		√	√	√	√
ECA			√		√
BCE-Dice				√	√
P_B	0.84	0.903	0.887	0.917	0.931
R_B	0.719	0.733	0.799	0.751	0.788
mAP50_B	0.828	0.868	0.902	0.892	0.917
P_M	0.864	0.889	0.869	0.912	0.94
R_M	0.737	0.752	0.784	0.760	0.795
mAP50_M	0.852	0.885	0.893	0.899	0.928
Params	2,842,803	2,842,947	2,842,955	2,842,947	2,842,955
GFLOPs	9.7	9.8	9.8	9.8	9.8

Note:

P_{B}

= Precision of bounding box detection;

R_{B}

= Recall of bounding box detection;

m A P 50_{B}

= mean Average Precision of bounding box detection at IoU threshold 0.5;

P_{M}

= Precision of mask segmentation;

R_{M}

= Recall of mask segmentation;

m A P 50_{M}

= mean Average Precision of mask segmentation at IoU threshold 0.5; Params = Total number of trainable parameters; GFLOPs = Giga floating-point operations.

Table 3. Comparison of segmentation performance.

	YOLOv5n-seg	YOLOv8n-seg	Mask R-CNN	CondInst	Mask2Former	Ours
mAP50_M	0.823	0.834	0.876	0.874	0.891	0.928
Params (M)	2.0	3.26	43.97	33.98	44.00	2.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, D.; Cui, Y.; Jin, S.; Liang, Y.; Chen, N. A Method for Seafloor Topography Recognition and Segmentation Based on Bimodal Image Feature Fusion with YOLO11 Model. J. Mar. Sci. Eng. 2026, 14, 903. https://doi.org/10.3390/jmse14100903

AMA Style

Liang D, Cui Y, Jin S, Liang Y, Chen N. A Method for Seafloor Topography Recognition and Segmentation Based on Bimodal Image Feature Fusion with YOLO11 Model. Journal of Marine Science and Engineering. 2026; 14(10):903. https://doi.org/10.3390/jmse14100903

Chicago/Turabian Style

Liang, Dekun, Yang Cui, Shaohua Jin, Yihan Liang, and Na Chen. 2026. "A Method for Seafloor Topography Recognition and Segmentation Based on Bimodal Image Feature Fusion with YOLO11 Model" Journal of Marine Science and Engineering 14, no. 10: 903. https://doi.org/10.3390/jmse14100903

APA Style

Liang, D., Cui, Y., Jin, S., Liang, Y., & Chen, N. (2026). A Method for Seafloor Topography Recognition and Segmentation Based on Bimodal Image Feature Fusion with YOLO11 Model. Journal of Marine Science and Engineering, 14(10), 903. https://doi.org/10.3390/jmse14100903

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Method for Seafloor Topography Recognition and Segmentation Based on Bimodal Image Feature Fusion with YOLO11 Model

Abstract

1. Introduction

2. Methods

2.1. YOLO11 Baseline Model

2.2. Overall Workflow of the Proposed Method

2.3. Data Preprocessing and Generation of Bimodal Images

2.4. Improvements to the YOLO11n-Seg Model

2.4.1. Early Fusion of Bimodal Images

2.4.2. ECA Mechanism

2.4.3. Loss Function Optimization

3. Materials and Experiments

3.1. Experimental Data

3.2. Experimental Setup and Evaluation Metrics

3.3. Experimental Design

4. Results and Discussion

4.1. Ablation Study Results and Analysis

4.2. Comparison with Existing Segmentation Models

4.3. Back-Projection of Segmentation Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI