DSM-Seg: A CNN-RWKV Hybrid Framework for Forward-Looking Sonar Image Segmentation in Deep-Sea Mining

Liu, Xinran; Yang, Jianmin; Zhang, Enhua; Xu, Wenhao; Lu, Changyu

doi:10.3390/rs17172997

Open AccessArticle

DSM-Seg: A CNN-RWKV Hybrid Framework for Forward-Looking Sonar Image Segmentation in Deep-Sea Mining

by

Xinran Liu

^1,2,3,

Jianmin Yang

^1,2,3,*,

Enhua Zhang

^1,3,

Wenhao Xu

^1,2,3

and

Changyu Lu

^1,2,3

¹

State Key Laboratory of Ocean Engineering (SKLOE), Shanghai Jiao Tong University (SJTU), Shanghai 200240, China

²

SJTU Yazhou Bay Institute of Deepsea Technology, Sanya 572000, China

³

School of Ocean and Civil Engineering, Shanghai Jiao Tong University (SJTU), Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(17), 2997; https://doi.org/10.3390/rs17172997

Submission received: 1 July 2025 / Revised: 13 August 2025 / Accepted: 27 August 2025 / Published: 28 August 2025

(This article belongs to the Special Issue Ocean Remote Sensing Based on Radar, Sonar and Optical Techniques (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

Accurate and real-time environmental perception is essential for the safe and efficient execution of deep-sea mining operations. Semantic segmentation of forward-looking sonar (FLS) images plays a pivotal role in enabling environmental awareness for deep-sea mining vehicles (DSMVs), but remains challenging due to strong acoustic noise, blurred object boundaries, and long-range semantic dependencies. To address these issues, this study proposes DSM-Seg, a novel hybrid segmentation architecture combining Convolutional Neural Networks (CNNs) and Receptance Weighted Key-Value (RWKV) modeling. The architecture integrates a Physical Prior-Based Semantic Guidance Module (PSGM), which utilizes sonar-specific physical priors to produce high-confidence semantic guidance maps, thereby enhancing the delineation of target boundaries. In addition, a RWKV-Based Global Fusion with Semantic Constraints (RGFSC) module is introduced to suppress cross-regional interference in long-range dependency modeling and achieve the effective fusion of local and global semantic information. Extensive experiments on both a self-collected seabed terrain dataset and a public marine debris dataset demonstrate that DSM-Seg significantly improves segmentation accuracy under complex conditions while satisfying real-time performance requirements. These results highlight the potential of the proposed method to support intelligent environmental perception in DSMV applications.

Keywords:

forward-looking sonar; deep-sea mining; semantic segmentation

1. Introduction

Deep-sea mining has emerged as a crucial approach for developing marine mineral resources in response to growing resource demands and technological advancements [1]. As a key component in deep-sea mining systems, the deep-sea mining vehicle (DSMV) plays a pivotal role in resource extraction [2]. Equipped with a forward-looking sonar (FLS) system, the DSMV conducts in-depth perception of the seabed environment, thereby ensuring safe navigation and high operational efficiency in complex terrains. FLS images not only support real-time monitoring and target detection [3] under deep-sea conditions but also provide the foundation for autonomous obstacle avoidance, path tracking, and planning [4]. However, extracting precise semantic information from high-noise, low-contrast FLS images in real time remains a major challenge that constrains the development of deep-sea environment perception technologies.

Despite recent progress, FLS image segmentation continues to face unique challenges. Unlike natural images, FLS images exhibit non-uniform intensity distributions caused by range-dependent attenuation and insufficient contrast between targets and the background in low signal-to-noise conditions [5]. In deep-sea mining areas with complex terrain, blurred boundaries further weaken the ability of deep models to capture edge features, making it difficult for data-driven methods to achieve high segmentation accuracy. In addition, FLS images are often severely affected by seabed reverberation and speckle noise, and environmental disturbances introduced by the DSMV itself can significantly degrade both real-time performance and the stability of segmentation results. Moreover, FLS images captured in deep-sea mining areas typically exhibit more pronounced multi-scale features and long-range semantic dependencies compared to those obtained in general underwater environments, due to the presence of complex geological structures such as elongated trenches and clustered rock formations. These factors further exacerbate the trade-off between segmentation accuracy and computational efficiency [6].

Consequently, conventional optical image segmentation methods are not directly applicable to FLS imagery. Although classic CNN architectures demonstrate inherent advantages in extracting local texture features, their built-in inductive biases limit global semantic modeling. Vision Transformer-based methods can capture long-range dependencies with high accuracy but encounter excessive computational overhead, making them unsuitable for large-scale sonar images requiring real-time processing. In contrast, recently proposed linear attention mechanisms reduce computation but often compromise feature integrity, particularly in capturing subtle edges and structural details in FLS images. These architectural constraints are pronounced in FLS segmentation: on the one hand, single-channel low-resolution sonar data demand stronger local feature extraction to resist noise interference; on the other hand, complex topographical structures call for robust global modeling across multiple spatial scales. In addition, stringent real-time requirements in deep-sea operations further limit the practicality of computationally intensive frameworks, making it challenging to achieve a balance between performance and efficiency.

To address these limitations, a hybrid segmentation architecture named DSM-Seg is proposed, which integrates semantic guidance and joint semantic constraint mechanisms based on a CNN and Receptance Weighted Key-Value (RWKV) modeling. DSM-Seg leverages the fine-grained detail extraction capability of CNNs and the efficient global feature modeling of RWKV’s linear recurrent mechanism, forming a dual-path collaborative learning framework. Specifically, a Physical Prior-Based Semantic Guidance Module (PSGM) is introduced, which incorporates gradient-based edge extraction, signal-to-noise ratio (SNR) attenuation modeling, and morphological aggregation strategies to generate semantic guidance maps with physical consistency and prior credibility. These guidance maps provide structurally reliable and spatially distinct semantic cues that enhance the feature representation of ambiguous boundary regions, thereby improving segmentation accuracy. In addition, a RWKV-Based Global Fusion with Semantic Constraints (RGFSC) module is developed to address the issue of long-range semantic interference. This module incorporates two complementary mechanisms, Semantic-Aware Adaptive Modulation (SAAM) and Semantic Prior Hard Filtering (SPHF), which work synergistically to suppress background noise and cross-region interference while enhancing the aggregation of semantically relevant regions during long-range dependency modeling. By combining these modules, DSM-Seg effectively overcomes the performance limitations of traditional segmentation architectures in sonar image analysis. Moreover, the linear computational complexity of the proposed framework ensures its suitability for real-time operation in DSMVs, achieving a balance between segmentation accuracy and operational efficiency. As illustrated in Figure 1, DSM-Seg provides a novel technical paradigm for intelligent environmental perception in complex subsea scenarios, supporting downstream tasks such as path planning and the 3D reconstruction of mining area topography.

The main contributions of this work are summarized as follows:

(1): A semantic guidance module, PSGM, is introduced as a dedicated branch to provide reliable prior cues for feature fusion, enabling the model to focus on semantically relevant regions.
(2): A global fusion module, RGFSC, is proposed, incorporating two constraint mechanisms, SAAM and SPHF, into the linear recurrent attention framework to enhance semantic aggregation and suppress cross-region interference during long-range dependency modeling.
(3): A hybrid segmentation framework, DSM-Seg, is developed by integrating CNN-based local feature perception with the efficient global modeling capability of RWKV. This architecture enables the complementary segmentation of local structures and global semantics under noisy conditions, achieving improved accuracy and robustness in sonar image semantic segmentation.

The remainder of this paper is organized as follows. Notations frequently used throughout the paper are summarized in Table 1. Section 2 reviews related work on sonar image segmentation. Section 3 introduces the proposed DSM-Seg architecture. Section 4 presents the experimental setup, results, and analysis. Finally, Section 5 concludes the paper.

2. Related Work

This section reviews relevant work in the field of sonar image segmentation. A structured overview is provided, covering traditional segmentation methods, the application of deep learning techniques in sonar image segmentation, and recent developments involving hybrid architecture-based segmentation methods.

2.1. Traditional Sonar Image Segmentation Methods

Traditional sonar image segmentation methods generally fall into two categories: classical image processing techniques and physics-based modeling approaches. Classical methods typically include thresholding, morphological operations, clustering analysis, and transform domain techniques. Thresholding has been widely applied in sonar image segmentation due to its simplicity and ease of implementation [7]. However, its performance is highly sensitive to noise and contrast variations, often resulting in unstable segmentation outcomes [8]. Morphological operations can mitigate noise interference to some extent, but their effectiveness strongly depends on the appropriate selection of structural elements, leading to limited generalization capability in complex underwater scenarios [9]. Clustering-based approaches offer greater flexibility by eliminating the need for manually defined thresholds [10], but they are often sensitive to initial parameter settings and prone to local optima. Methods based on Markov Random Fields (MRFs) model the probabilistic dependencies between pixels, improving noise robustness to a certain degree [11]. Nevertheless, their high computational complexity presents challenges for real-time applications. In addition, transform domain methods such as Fourier or wavelet transforms can effectively enhance image features, but their performance degrades when object boundaries are unclear or features are weak [12].

Physics-based methods [13] exploit the inherent physical principles of sonar imaging by constructing mathematical models to describe sound wave propagation, reflection, scattering, and noise generation. These approaches can enhance segmentation accuracy and robustness by incorporating domain-specific priors. However, they often involve complex modeling procedures, are sensitive to parameter tuning, and require frequent adjustments under varying imaging conditions. As a result, their generalizability is limited, and they struggle to meet the demands of diverse real-world applications.

In summary, traditional sonar image segmentation methods are generally constrained by heavy computational burdens, limited real-time performance, and strong reliance on prior assumptions. These limitations have driven a shift toward deep learning approaches, which offer greater potential for efficient and adaptive sonar image segmentation.

2.2. Applications of Deep Learning in Sonar Image Segmentation

Due to the limitations of traditional sonar image segmentation methods in handling noise interference and adapting to complex environments, deep learning techniques have increasingly been introduced into this field. Early studies primarily adopted CNN architectures, which significantly improved segmentation performance in local regions owing to their strong local feature extraction capabilities [14]. For instance, U-Net [15], with its encoder-decoder structure and skip connections, effectively captures fine-grained features. DeepLabv3 [16] enhances model adaptability to complex scenes through the use of dilated convolutions and multi-scale feature fusion mechanisms. However, these CNN-based models remain constrained by their limited receptive fields and are insufficient for capturing broader spatial dependencies and global semantic relationships in sonar images. To address these limitations, methods such as DcNet [17] incorporate depthwise separable and dilated convolutions to enhance contextual representation. Other approaches employ multi-scale feature fusion or attention mechanisms to strengthen the representation of features across different scales, thereby improving spatial awareness and robustness. For example, FDBANet [18] integrates frequency-domain denoising and multi-scale boundary feature extraction to boost segmentation accuracy, while LMA-Net [19] introduces lightweight multi-scale attention to achieve real-time semantic segmentation. Despite these advances, such methods still fundamentally rely on CNN architectures, which inherently lack the ability to capture long-range dependencies across global regions in sonar images.

The emergence of Transformer architectures has effectively addressed the limitations of CNNs in modeling long-range spatial relationships. Through the self-attention mechanism, Transformer-based models can globally model image features with high expressiveness. However, these models typically require large-scale data and high computational resources, which poses challenges under sonar imaging conditions characterized by limited data and complex environments [20]. Recently, state space models, such as Mamba, have gained attention in visual tasks. These models model long-range dependencies efficiently via state space formulations, offering lower computational complexity and faster inference than Transformers, while being less dependent on large datasets. Nonetheless, their inherently sequential processing paradigm limits their capacity to capture the complex spatial structures and local details that are inherent in two-dimensional images. As a result, their performance in fine-grained image segmentation tasks has not yet consistently surpassed that of CNN-based architectures. Therefore, standalone deep learning models often struggle to simultaneously balance noise robustness, local precision, and global contextual modeling in sonar image segmentation. This challenge has motivated researchers to explore hybrid architectures that aim to integrate the strengths of multiple paradigms while overcoming the limitations of individual models.

2.3. Hybrid Architecture-Based Segmentation Methods

To leverage the advantages of CNNs in local feature extraction and the capability of Transformer-like or state space models in global information modeling, hybrid architectures have recently emerged as a prominent trend in the field of image segmentation. These architectures aim to integrate the strengths of different model paradigms to address the inherent limitations of single-model approaches in capturing fine-grained details or long-range semantic dependencies. Several recent studies [21,22,23,24] have explored the combination of CNNs with Transformer-like structures or state space models [25,26] to enhance segmentation performance. However, most existing methods rely on simple concatenation or weighted addition strategies for feature integration, lacking explicit semantic constraint mechanisms to resolve potential conflicts between local detail features and global representations. As a result, such approaches often suffer from insufficient feature fusion and semantic ambiguity in complex scenes. Some studies have attempted to incorporate hand-crafted features or physical priors to improve the fusion process, but these enhancements are typically limited to the early stages of feature extraction and fail to extend into the deeper layers of spatial dependency modeling. Consequently, they are less effective in achieving fine-grained perception and semantic consistency under complex underwater conditions. It is also worth noting that many hybrid models are evaluated primarily on datasets acquired in controlled environments, such as tanks or pools. These settings do not fully reflect the challenges posed by real-world deep-sea environments, including severe acoustic noise, ambiguous boundaries, and complex seafloor terrain. As a result, the generalization capability of these methods remains insufficiently validated in practical scenarios.

In summary, existing hybrid segmentation architectures still face critical challenges in real-world FLS image segmentation tasks, including limited fusion strategies, a lack of semantic constraint mechanisms, inadequate deep spatial modeling, and insufficient validation in realistic environments. These limitations motivate the design of DSM-Seg, which introduces a more advanced hybrid architecture that not only combines the strengths of a CNN and RWKV for local and global feature modeling but also incorporates explicit physical priors and semantic constraint mechanisms. This design enhances the coherence and depth of feature fusion between different modules, enabling robust semantic segmentation in complex sonar imaging scenarios.

3. Methodology

3.1. Overall Architecture of DSM-Seg

The overall architecture of DSM-Seg is illustrated in Figure 2. It is designed to address the challenges of semantic segmentation in FLS images caused by strong speckle noise, blurred boundaries, and SNR attenuation in far-field regions. The proposed framework consists of three functional modules, each responsible for local feature modeling, semantic guidance generation, and global semantic fusion, respectively. First, the CNN-based feature extraction module employs a densely connected structure to extract multi-scale local feature maps, capturing clear texture details and boundary information that serve as the foundational representations for subsequent semantic integration. Next, the PSGM operates directly on shallow local feature maps without introducing an additional image branch. This module integrates edge gradients, dynamic convolutional responses, and sonar-specific range attenuation priors to generate semantic guidance maps that exhibit physical consistency and regional structural saliency. These maps highlight potential target regions while suppressing redundant noise, thereby providing structurally reliable prior cues to support the subsequent global fusion process. Finally, the global feature fusion part incorporates the RGFSC mechanism. Built on the sequential modeling capacity of the RWKV architecture, RGFSC introduces an explicit semantic constraint control flow. Specifically, the SAAM component enhances feature consistency among semantically similar regions, while the SPHF component suppresses irrelevant responses in non-target areas. These two mechanisms work in tandem to improve the semantic focus of long-range dependency modeling, enabling robust and efficient global feature fusion.

In summary, DSM-Seg achieves a tightly coupled integration of local semantic extraction, physically guided semantic priors, and semantically constrained global fusion. This design is particularly well-suited for FLS image segmentation tasks in real-world, complex underwater environments.

3.2. CNN-Based Local Semantic Feature Extraction

Given the prevalence of noise, low contrast, and complex textures in FLS images, DSM-Seg first requires the extraction of stable local features with a strong semantic representation capability to support subsequent segmentation. To this end, a CNN-based cascaded encoder–decoder structure is adopted as the initial feature extraction unit in DSM-Seg to obtain local semantic features while suppressing noise during downsampling. The encoder is constructed by stacking dense blocks and applying cross-layer feature fusion to enhance the representation of local structural information, thereby improving feature quality under noisy conditions. The decoder then progressively restores spatial resolution to produce a semantically rich and structurally informative local feature map

F_{l o c a l}

.

To ensure that the extracted local features contain sufficient semantic information, the input sonar image

X^{H \times W \times 1}

is first processed by an L-layer encoder to extract multi-scale features and compress spatial dimensions, enabling higher-level semantic abstraction. The encoding process at layer i is defined as in Equation (1), where

{D e n s e b l o c k}_{i} (\cdot)

denotes the convolution and dense connection operations at the i th layer, σ (⋅) is the activation function, and DownSample (⋅) performs spatial downsampling to highlight semantic features. The top-level encoded feature is denoted as Z_L.

Z_{i} = D o w n S a m p l e (σ ({D e n s e b l o c k}_{i} (Z_{i - 1}))), Z_{0} = X

(1)

To progressively recover spatial resolution while preserving high-level semantics, a symmetric L-layer decoder is designed to upsample the feature maps layer by layer starting from Z_L, as shown in Equation (2). Here,

{D e C o n v}_{i} (\cdot)

represents the transposed convolution operation, and UpSample (⋅) increases spatial resolution. Each decoder layer also integrates the corresponding encoder features through skip connections to enhance edge and detail representation.

Z_{i}^{'} = U p S a m p l e (σ (D e C o n v_{i} (Z_{i + 1}^{'}))), Z_{L}^{'} = Z_{L}

(2)

Finally, the output of the top decoder layer

X_{1}^{'}

is used as the local feature map

F_{l o c a l}

, as defined in Equation (3). This feature map has a spatial resolution close to that of the original image and preserves the high-level semantic representation and noise suppression capabilities learned by the encoder, thus providing a high-quality input for the subsequent semantic guidance and global modeling with semantic constraints.

F_{l o c a l} = X_{1}^{'}

(3)

3.3. Physical Prior-Based Semantic Guidance Module (PSGM)

Accurate semantic segmentation of sonar imagery remains challenging due to limited structural cues and data-driven models’ difficulty in capturing physical degradation patterns. To address these limitations, a semantic guidance module is introduced to enhance spatial reliability based on sonar-specific priors. Due to the imaging characteristics of forward-looking sonar, such images commonly suffer from severe speckle noise and range-dependent non-uniform SNR attenuation. These issues lead to blurred target boundaries and ambiguous semantic regions, making it difficult to accurately delineate targets using data-driven deep feature learning alone. To address this limitation, the Physical Prior-Based Semantic Guidance Module is designed and integrated into DSM-Seg. This module combines Structure-Aware Context Aggregation with domain-specific physical priors that are inherent to sonar imaging, thereby significantly enhancing the model’s ability to focus on critical structural regions and resist noise interference. The PSGM provides stable and effective semantic guidance for subsequent global feature fusion. A schematic illustration of the module is shown in Figure 3.

First, the PSGM applies the Sobel operator to the local feature map

F_{l o c a l} (i, j)

, which is obtained from the CNN feature extraction module, in order to extract initial edge features. The computation is defined in Equation (4). In this equation, K_x and K_y denote the Sobel convolution kernels in the x and y directions, respectively. The symbol

\sum (u, v)

represents a convolution operation over the local neighborhood centered at pixel coordinate (i,j). A scaling factor,

α_{s n r}

, derived from the speckle noise intensity and the global SNR of the image, is applied to the gradient magnitude

G (i, j)

to maintain sufficient contrast in regions affected by strong speckle noise.

G (i, j) = α_{s n r} \sqrt{{(\sum_{u = - 1}^{1} \sum_{v = - 1}^{1} K_{x} (u, v) F_{l o c a l} (i + u, j + v))}^{2} + {(\sum_{u = - 1}^{1} \sum_{v = - 1}^{1} K_{y} (u, v) F_{l o c a l} (i + u, j + v))}^{2}}

(4)

Subsequently, SACA is employed to more effectively exploit the structural physical priors that are inherent in sonar imagery. Specifically, SACA focuses on capturing the strongly continuous geometric structures that are commonly present in FLS images. This is achieved by applying a learnable convolutional kernel

W_{a t t n}

to extract feature responses, which are then combined with a bias term

b_{a t t n}

to produce a saliency-aware attention mask

M_{s} (i, j)

, as defined in Equation (5). In this process, the parameters and receptive field of the kernel

W_{a t t n}

are learned adaptively during training, enabling the model to enhance coherent structural responses while suppressing random speckle noise and isolated outlier points.

M_{s} (i, j) = σ ({W_{a t t n}}^{*} F_{l o c a l} (i, j) + b_{a t t n})

(5)

While

M_{s} (i, j)

effectively highlights structural continuity, it lacks sensitivity to range-based signal degradation. To address this, distance-adaptive SNR modulation is introduced to account for the physical attenuation characteristics of sonar imaging. To further incorporate the range-dependent attenuation characteristics of FLS imaging, a distance-adaptive SNR modulation strategy is employed to naturally suppress feature responses in far-field regions, thereby enhancing the saliency of near-field structures. The resulting modulation map is denoted as

M_{p r i o r} (i, j)

, as defined in Equation (6). In this equation, γ is a distance attenuation coefficient, which is estimated from calibration data by fitting the measured near- and far-range SNR to an exponential decay model, and subsequently fine-tuned on the validation set to ensure both physical consistency and optimal segmentation performance. And

r_{i, j}

represents the normalized distance from pixel

(i, j)

to the center of the sonar array. This term models the degradation of signal intensity with increasing distance. The responses of distant pixels are attenuated accordingly, which effectively reduces interference from far-field noise while preserving clear structural features in near-range areas.

M_{p r i o r} (i, j) = M_{s} (i, j) \cdot e x p (- γ \cdot r_{i, j})

(6)

Subsequently, the semantic guidance map

M_{p r i o r} (i, j)

is further refined through adaptive threshold-based morphological aggregation, resulting in a more stable guidance mask

T (i, j)

. Specifically, an adaptive threshold,

τ_{(i + u, j + v)}

, is computed to determine whether the gradient of a pixel exceeds the local threshold. A morphological maximization operation is then performed within a structural neighborhood Ω, aggregating adjacent high-gradient regions into the current pixel. This process smooths broken edges and removes isolated noise points. As defined in Equation (7), the Sobel gradient magnitude

G_{(i + u, j + v)}

is compared against the adaptive threshold τ(⋅); the output is 1 if the gradient exceeds the threshold and 0 otherwise. The max operator performs morphological aggregation by merging all satisfied neighboring responses into the center (i,j), thereby eliminating fragmentation and isolated noise in elongated terrain structures.

T (i, j) = \underset{(u, v) \in Ω}{m a x} [I (G (i + u, j + v) ⩾ τ (i + u, j + v))]

(7)

Finally, to further improve the accuracy of the coarse guidance mask

T (i, j)

, a compensatory filtering operation Ψ(⋅) is introduced for refinement. The output is then binarized using a threshold β to obtain the final semantic prior map

S^{'} (i, j)

, as defined in Equation (8). A value of

S^{'} (i, j) = 1

indicates that the pixel (i,j) has a high probability of belonging to a target seabed terrain region under varying SNR conditions. As illustrated in Figure 3, a visual example of

S^{'}

demonstrates that the PSGM provides effective and meaningful semantic guidance.

S^{'} (i, j) = I [Ψ (T (i, j)) ⩾ β]

(8)

In summary, the proposed PSGM fully incorporates the physical priors that are specific to sonar imaging, particularly by effectively capturing continuous structural features and suppressing random speckle noise. In the PSGM, such sonar-specific priors are embedded into the network structure through Sobel-based structural edge extraction and distance-adaptive modulation, which emphasize highlight–shadow continuity and range-dependent attenuation, while morphological aggregation suppresses isolated speckle artifacts. This significantly improves the accuracy and stability of the generated semantic prior masks. By providing more reliable semantic guidance, the PSGM enhances the overall performance of the DSM-Seg framework, improving segmentation accuracy and noise robustness in complex seabed terrain scenarios.

3.4. RWKV-Based Global Fusion with Semantic Constraints (RGFSC)

After extracting local features and generating physically guided semantic cues, DSM-Seg must still effectively capture long-range dependencies across wide-area seabed terrain. However, FLS images often exhibit severe non-uniform noise, blurred boundaries, and complex backgrounds, making conventional similarity-based attention mechanisms prone to introducing irrelevant information, which hinders precise feature fusion.

To address this challenge, this paper introduces the RGFSC module, which combines a Bi-RWKV attention mechanism with dual semantic constraints to dynamically suppress noise and cross-region interference during long-range modeling. This enables a more accurate fusion of global and local features. Specifically, RGFSC incorporates two complementary semantic constraint mechanisms: SAAM and SPHF. SAAM leverages semantic priors to dynamically modulate the attention gates, enhancing intra-region correlation by emphasizing semantically consistent areas. In contrast, SPHF employs prior maps generated by the PSGM to strictly restrict global attention computation to semantically corresponding regions, thereby suppressing irrelevant signals and preventing noise accumulation during recursive inference. The overall structure of RGFSC is illustrated in Figure 4. By integrating SAAM and SPHF, RGFSC retains the computational efficiency of the RWKV mechanism while significantly improving the semantic consistency of global representations, providing reliable support for subsequent fine-grained segmentation.

To define the RGFSC computation process in detail, the output of the Bi-WKV at position t is formulated in Equation (9), where w and u represent learnable channel-wise decay and bias vectors, respectively, and

k_{t}

and

v_{t}

denote the key and value projections at the current position.

B i - W K V {(K, V)}_{t} = \frac{{\sum_{i = 0, i \neq t}^{T - 1} e}^{- \frac{| t - i | - 1}{T} w + k_{i}} v_{i} + e^{u + k_{t}} v_{t}}{{\sum_{i = 0, i \neq t}^{T - 1} e}^{- \frac{| t - i | - 1}{T} w + k_{i}} + e^{u + k_{t}}}

(9)

To measure semantic similarity between a query pixel (i, j) and a reference pixel (p, q) in the local feature space, the SAAM factor

g ((p, q) \to (i, j))

is defined, as shown in Equation (10). This factor is computed by projecting the local feature vector

F_{l o c a l} (p, q)

into an embedded space and comparing it with

F_{l o c a l} (i, j)

through an inner product, followed by an activation function σ (⋅). The output value, ranging from 0 to 1, serves as an attention weight: values close to 1 indicate strong semantic similarity, whereas values near 0 imply semantic irrelevance and should be suppressed. In addition to similarity, semantic consistency is enforced using the SPHF factor

M ((p, q) \to (i, j))

, defined in Equation (11). Here,

S^{'} (p, q)

represents the semantic class of pixel

(p, q)

, and ϵ is a small constant. This factor ensures that global attention is computed only between semantically corresponding regions. A value of 1 is assigned when both pixels belong to the same semantic region; otherwise, the interaction is down-weighted by ϵ, effectively eliminating irrelevant information exchange.

g ((p, q) \to (i, j)) = σ (ϕ (F_{l o c a l} (p, q)) \cdot ϕ (F_{l o c a l} (i, j)))

(10)

M ((p, q) \to (i, j)) = \{\begin{matrix} 1, & S' (p, q) = S' (i, j), \\ ϵ, & o t h e r w i s e, \end{matrix}

(11)

Based on the above definitions, the operations of SAAM and SPHF are formulated in Equations (12) and (13). SAAM performs semantic-aware feature aggregation by weighting values across the global space Ω based on the similarity factor g, while SPHF constrains the attention computation using the binary gating mask M. Here,

V_{s} (i, j)

denotes the value vector at location

(i, j)

. These mechanisms jointly enable focused and semantically consistent information fusion.

S A A M (F_{s e m}, t) = \sum_{(i, j) \in Ω} g ((p, q) \to (i, j)) \cdot V_{s} (i, j)

(12)

S P H F (F_{s e m}, t) = \sum_{(i, j) \in Ω} M ((p, q) \to (i, j)) \cdot V_{s} (i, j)

(13)

The final output of RGFSC is given in Equation (14), where

R_{s} (t)

is the input feature at time step t, σ (⋅) is the sigmoid function, ⊙ denotes element-wise multiplication, and

W_{O}

is the output projection matrix. Both SAAM and SPHF are embedded into the attention and gating stages of the RWKV architecture to enforce dual semantic constraints, thereby enhancing the model’s ability to focus on consistent regions and suppress distracting content.

O_{s}^{R G F S C} (t) = [σ (R_{s} (t) + S A A M (F_{s e m}, t)) ⊙ (B i - W K V {(K_{s}, V_{s})}_{t} ⊙ S P H F (F_{s e m}, t))] W_{O}

(14)

In summary, the RGFSC module enables semantically focused attention while substantially reducing noise and irrelevant contextual information. By integrating physical semantic priors and dual semantic constraints, the framework first provides region-level semantic localization through the PSGM, and then applies attention filtering based on both feature similarity and semantic consistency via the SAAM and SPHF mechanisms. In RGFSC, these physical semantic priors are embedded through semantic-aware attention modulation in SAAM, which emphasizes contextually consistent regions, and through hard filtering in SPHF, which suppresses bright artifacts lacking structural or shadow cues. This hierarchical strategy ensures that essential seabed structures are effectively captured across long-range dependencies. The final representation retains detailed local features from the CNN encoder while incorporating a high-level semantic structure, enabling DSM-Seg to maintain robust segmentation performance in complex environments.

4. Experiments and Results

This section presents the two forward-looking sonar datasets used in the experiments, the evaluation metrics adopted for performance comparison, the implementation details, comparative results with other methods including visual analysis, and the ablation study to assess the contribution of each module.

4.1. FLS Image Dataset

To comprehensively evaluate the performance of DSM-Seg in both real-world and simulated marine environments, experiments were conducted on two forward-looking sonar datasets: a custom-built deep-sea terrain dataset and the publicly available marine debris dataset. The characteristics of these two datasets are described below.

4.1.1. Forward-Looking Sonar Deep-Sea Terrain Dataset

The forward-looking sonar deep-sea terrain dataset was constructed using sonar images collected during sea trials conducted in the Western Pacific, utilizing the DSMV Pioneer II developed by Shanghai Jiao Tong University. During data acquisition, a BlueView M900 D6-Mk2 forward-looking sonar, mounted on the front of the DSMV, continuously emitted acoustic signals while the vehicle was in motion. These signals generated images based on terrain and object echo information. The FLS system operates at 900 kHz, consists of 768 beams, and has a field of view of 130°, with an angular resolution of 0.18° and an effective detection range of 2–100 m. This configuration allows for capturing sonar images at varying distances in front of the DSMV.

The dataset consists of approximately 1000 FLS images, primarily containing typical seabed structures found in deep-sea mining areas, such as rocks, seamounts, depressions, trenches, and background regions. An overview of the raw FLS images and their corresponding manually labeled masks is shown in Figure 5. The annotated data are categorized into four classes:

(1): Red indicates strong echo regions, such as seabed rocks and protruding terrain, which appear as bright areas in the FLS images;
(2): Yellow denotes weak echo regions, including shadowed areas caused by occlusions or topographic depressions;
(3): Light blue represents the FLS-detectable area;
(4): White refers to regions that cannot be reliably identified or detected in the sonar image.

A statistical summary of the pixel distribution across different categories is provided in Table 2. The results indicate a class imbalance within the dataset, posing challenges for accurate semantic segmentation. Notably, the primary segmentation targets, namely the highlighted and shadow regions, are characterized by highly irregular boundaries and small target sizes. These classes occur frequently but account for a relatively low proportion of pixels, which increases the difficulty of accurate segmentation.

Figure 5. The pipeline of sonar image acquisition and semantic segmentation annotation.

4.1.2. Forward-Looking Sonar Marine Debris Dataset

To evaluate the generalization capability of DSM-Seg, experiments were conducted on the Forward-Looking Sonar Marine Debris Dataset [27], which consists of 1868 FLS images. These images were captured in a water tank using an ARIS Explorer 3000 forward-looking sonar system operating at a frequency of 3.0 MHz. The system comprises 128 beams with a field of view of 30° × 15° and an angular resolution of 0.25°. The dataset includes 11 target categories that are commonly found in household marine debris, along with a background class. These categories are bottles, cans, chains, drink cartons, hooks, propellers, shampoo bottles, standing bottles, tires, valves, and walls. All FLS images were resized to a resolution of 480 × 320 pixels for consistency. A statistical summary of the class-wise pixel distribution is provided in Table 3.

4.2. Evaluation Metrics

To comprehensively evaluate the performance of the proposed method, two aspects were considered: segmentation accuracy and inference efficiency. For segmentation accuracy, three commonly used metrics were adopted, including the Intersection over Union (IoU), F1 score, and the mean Intersection over Union (mIoU). For runtime performance, Frames Per Second (FPS) was used to measure the model’s processing speed.

The IoU is a standard metric in semantic segmentation that measures the overlap between the predicted region and the ground-truth region. It is calculated as the ratio of the intersection area to the union area of the predicted and actual segments, as shown in Equation (15). This metric reflects how accurately the model identifies specific object classes and is suitable for evaluating class-wise segmentation performance.

I o U = \frac{A_{i n t e r s e c t i o n}}{A_{u n i o n}} = \frac{| P \cap G |}{| P \cup G |}

(15)

F1 score provides a balanced evaluation by combining precision and recall. It is particularly effective in scenarios with class imbalance or when false positives and false negatives are significant. As shown in Equation (16), this metric captures both the model’s ability to correctly detect objects and its tendency to misclassify or miss them.

F_{1} = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(16)

The mIoU represents the average IoU across all categories. Unlike the IoU for a single class, the mIoU emphasizes the model’s consistency and generalization ability across multiple semantic categories. It provides a more holistic view of segmentation performance, as defined in Equation (17).

m I o U = \frac{1}{n + 1} \sum_{i = 0}^{n} \frac{p_{i i}}{\sum_{j = 0}^{n} p_{i j} + \sum_{j = 0}^{n} p_{j i} - p_{i i}}

(17)

In Equations (15)–(17), P refers to the predicted segmentation region, G denotes the ground-truth region, and n is the number of semantic classes. The term p_ij represents the number of pixels correctly predicted as class iii, while p_ji represents the number of pixels incorrectly classified between classes i and j.

In addition to accuracy, FPS was employed to assess inference speed. This metric reflects the number of frames the model can process per second and indicates whether the method is suitable for deployment in time-sensitive or real-time applications.

4.3. Training Strategies and Implementation Details

During the experiments, each dataset was divided into training, validation, and testing sets at a ratio of 8:1:1. Given the limited availability and high acquisition cost of real deep-sea terrain data, data augmentation techniques were applied to enhance the model’s generalization ability and reduce the risk of overfitting. The augmentation operations included rotation, cropping, random noise injection, and scaling, which increased the diversity of training samples. These augmentation strategies not only expanded the dataset size but also improved the model’s robustness to various distortions and noise in real-world conditions. The training protocol and hyperparameter configurations used in the experiments are summarized in Table 4.

4.4. Comparison with Other Methods

To demonstrate the effectiveness of the proposed method, DSM-Seg was evaluated on the two datasets described in Section 4.1 and compared against both classical image segmentation methods and recent state-of-the-art approaches for sonar image segmentation [15,26,28,29,30,31,32,33,34,35,36,37,38,39]. The evaluation included both segmentation accuracy and inference speed. This section presents the quantitative results as well as the visual comparisons among different methods.

4.4.1. Comparative Experiments on the Forward-Looking Sonar Deep-Sea Terrain Dataset

Table 5 presents the performance comparison across different segmentation methods. Based on the experimental results, it can be observed that due to the dense and irregular distribution of shadow and highlighted regions in the sonar images, traditional CNN-based models such as UNet and DeepLabV3+ tend to rely heavily on local convolutional operations. This reliance limits their ability to model global semantic dependencies, leading to suboptimal performance in areas with high object density and blurred boundaries. As a result, both the mIoU and F1 score of these methods remain relatively low. Methods such as CcNet, U-SAS, and CSFINet, which incorporate multi-scale feature extraction and convolutional attention mechanisms, demonstrate improved precision by enhancing local detail representation. Transformer-based approaches including Swin-Transformer, TransUNet, and s3seg-vit leverage self-attention to capture long-range dependencies globally, resulting in significant accuracy gains. However, their high computational complexity inevitably leads to increased inference latency. Hybrid architectures such as UNetMamba, SonarNet, and FLSSNet achieve a favorable balance by integrating a global semantic context with fine-grained local features, resulting in competitive segmentation performance. These methods benefit from the structural flexibility of combining local encoding modules with global modeling branches, which allows them to maintain reasonable accuracy without incurring prohibitive computational costs. Building upon this direction, the proposed DSM-Seg further improves segmentation accuracy by combining the complementary strengths of a CNN and RWKV in feature modeling, while incorporating efficient semantic guidance and constraint strategies. This design leads to an mIoU of 79.9%, an F1 score of 87.9%, and maintains an inference speed of 83.5 FPS. These results demonstrate that DSM-Seg achieves a noticeable improvement between segmentation accuracy and runtime efficiency, effectively capturing both global structural information and fine-grained terrain variations in forward-looking sonar images.

4.4.2. Comparative Experiments on the Forward-Looking Sonar Marine Debris Dataset

Table 6 presents the comparative results of different segmentation methods on the Forward-Looking Sonar Marine Debris Dataset. DSM-Seg achieves the best performance, with an mIoU of 79.3% and an F1 score of 90.5%. Compared to the traditional CNN-based model U-Net, these scores represent increases of 7.2% and 5.7%, respectively, reflecting DSM-Seg’s enhanced capability in representing features and capturing boundary structures. Transformer-based methods, including TransUNet, Swin-Transformer, Swin-Unet, and s3seg-vit, continue to exhibit strong performance on large-scale structures due to their global modeling ability. However, performance on small debris objects and fine boundaries remains suboptimal. For example, s3seg-vit achieves an mIoU of 76.2% and an F1 score of 88.9%, which, despite being better than most CNN-based models, still falls short of the best-performing method. These results suggest that relying solely on Transformer architectures may not fully resolve the trade-off between semantic consistency and detail accuracy, particularly under the noise and clutter that are typical of sonar imaging. Hybrid architectures such as SonarNet and UNetMamba maintain relatively high scores by integrating local and global features. UNetMamba achieves a 76.9% mIoU and a 90.2% F1, ranking second in F1 among all methods. This indicates good overall structure prediction. SonarNet, which combines CNN, Transformer, and HOG-based components, shows inconsistent performance under cluttered backgrounds, likely due to interference among heterogeneous feature types. While UNetMamba benefits from Mamba’s efficient sequence modeling and reduced computational complexity, its performance on complex or blurry boundaries still lags behind more advanced global models. In contrast, DSM-Seg leverages the RWKV structure to improve the integration of semantic context and detailed geometry, further enhancing segmentation performance.

In terms of inference speed, DSM-Seg reaches 83.5 FPS. Although slightly lower than that of U-Net and CSFINet, it remains significantly faster than most Transformer-based models and hybrid architectures. Swin-Transformer and TransUNet achieve only 11.4 FPS and 12.3 FPS, respectively, reflecting the computational overhead of self-attention mechanisms. Hybrid models such as SonarNet and UNetMamba reach 34.2 FPS and 76.9 FPS, with moderate speed improvements. DSM-Seg outperforms these methods by offering both higher segmentation accuracy and faster inference, confirming its effectiveness and deployment potential in complex sonar environments.

4.4.3. Visualization Analysis of Comparative Experiments

In this section, a visual comparison is conducted between DSM-Seg and other competing methods using representative examples from the two datasets. The results are shown in Figure 6 and Figure 7, respectively. In each figure, the columns from left to right represent the original image, ground truth, and the segmentation results of different methods.

The visual segmentation results clearly reveal that different network architectures exhibit typical challenges when applied to complex deep-sea scenes. CNN-based models, represented by UNet, show evident limitations in capturing elongated structures and fine-grained small-scale targets due to the inherently limited receptive field of convolutional operations. For instance, in Figure 6, the elongated Highlight structure appears to be fragmented and discontinuous in the result produced by UNet. In Figure 7, the segmentation of the smaller Drink-carton object is overly smooth, with indistinct boundaries. These issues mainly stem from the downsampling operations used in CNN architectures, which cause a substantial loss of texture information and hinder accurate boundary recognition for small objects. In contrast, Transformer-based models such as Swin Transformer and Swin-Unet inherently benefit from a global receptive field enabled by the self-attention mechanism, which theoretically allows for better modeling of large-scale semantic structures. The visual examples confirm this advantage: for example, in Figure 7, the Tire and Chain objects show sharper and more complete boundaries compared to those generated by CNN-based methods. However, the limited sensitivity of Transformers to local spatial details and textures results in fragmented boundary representations for objects with a complex geometry. This indicates a deficiency in spatial precision when using Transformer-based designs alone. To mitigate the limitations observed in both CNN and Transformer frameworks, hybrid architectures such as SonarNet and UNetMamba aim to combine the strengths of both. These models employ CNNs to extract local features and use Transformer or Mamba components to capture long-range dependencies. The results across both datasets indicate performance improvements for medium-scale or geometrically regular targets. Nevertheless, challenges remain in the fusion of multi-level semantic features. For example, in Figure 6, the boundary of the large Shadow region still contains localized misclassified pixels. In Figure 7, which includes multiple small objects within a cluttered scene, these hybrid models generally preserve overall structure and object placement but still fail to precisely capture detailed boundaries. This suggests that the integration of multi-scale features and semantic information is not yet fully optimized. In comparison, DSM-Seg consistently produces the most accurate segmentation results across both datasets, especially in terms of boundary accuracy and continuity for objects of various sizes and shapes. These results suggest that DSM-Seg achieves a more effective integration of multi-scale features and offers a better balance between long-range dependency modeling and spatial detail representation. This makes it more suitable for segmentation tasks in complex deep-sea environments characterized by diverse object scales and strong noise interference. Further analysis of these observations, along with the underlying reasons for the performance differences among models, is provided in the Section 5.

5. Discussion

Building on the comparative and visualization results presented in Section 4, this section provides a detailed analysis of the factors affecting segmentation performance, including the contributions of individual modules, the role of edge detection, the impact of global fusion design, and the robustness under low-SNR conditions.

5.1. Effect of Semantic Guidance and Constraints on Segmentation Performance

To evaluate the effectiveness of individual components in DSM-Seg, an ablation study was conducted on both datasets, focusing on the contributions of the PSGM, and the SAAM and SPHF modules. The quantitative results are summarized in Table 7, and the class-wise comparisons are illustrated in Figure 8 and Figure 9. The introduction of each module was found to influence the mIoU, F1 score, and FPS, with the mIoU and F1 generally exhibiting consistent trends across the ablation settings. The removal of any individual module led to performance degradation to varying degrees. When the PSGM was removed, the mIoU decreased by 5.2% and 5.6% on the DSTD and MDD datasets, respectively, while the F1 score dropped by 1.7% and 4.1%. These results indicate that the PSGM significantly enhances the model’s ability to detect fuzzy boundaries and low-contrast regions by generating high-confidence semantic guidance maps based on gradient variation and noise distribution that are inherent to sonar images. Without this module, the model becomes less effective in handling speckle noise and contrast deficiencies, leading to a notable drop in segmentation accuracy. Removing the SAAM module resulted in a decline of 3.5% and 0.8% in the mIoU, and 2.0% and 2.3% in F1 score on the DSTD and MDD datasets, respectively. This suggests that SAAM plays an important role during global feature fusion by dynamically enhancing information aggregation within semantically coherent regions and reducing cross-region interference through a similarity-aware modulation mechanism. In its absence, the model shows reduced sensitivity to detailed features in complex backgrounds, negatively affecting overall segmentation performance. In contrast, the removal of the SPHF module led to a relatively smaller performance drop, with the mIoU decreasing by 1.1% and 1.4%, and the F1 score declining by 1.3% and 1.7% on the two datasets, respectively. This indicates that SPHF effectively suppresses irrelevant background interference by enforcing strict semantic filtering, allowing only pixels within the same semantic region to participate in global attention computation. Although the impact is less pronounced, SPHF still contributes to maintaining semantic consistency and refining fine-grained feature fusion.

In terms of segmentation speed, the model achieved 91.6 FPS after removing the PSGM, representing an increase of approximately 8.2 FPS compared to the full DSM-Seg configuration. This is because the PSGM includes Sobel gradient computation and SACA, which, while effective in incorporating structural physical priors, introduce substantial computational overhead. The additional convolution operations and morphological aggregation significantly increase processing cost, leading to a reduction in FPS. When the SAAM module was removed, the FPS further increased to 94.7, indicating a more pronounced improvement in inference speed. This is mainly attributed to the design of SAAM, which introduces local semantic embedding and pixel-wise similarity computation during long-range dependency modeling. These operations involve extensive feature mapping, inner product calculations, and dynamic modulation, all of which substantially raise the model’s overall computational complexity and impose a more significant impact on real-time performance. In comparison, the FPS of the model without the SPHF module increased only slightly to 85.2, which is just 1.7 FPS higher than that of the complete model. This limited gain is due to the lightweight nature of SPHF, which performs simple spatial masking based on semantic priors to constrain the global attention region. Since it does not involve complex feature embedding or similarity calculations, the additional computational cost is minimal and has negligible influence on overall inference speed.

In summary, although RGFSC introduces additional computational overhead due to gradient extraction and semantic embedding in the PSGM and SAAM modules, the complete DSM-Seg model still maintains an inference speed above 80 FPS, which is sufficient to support real-time processing requirements for forward-looking sonar imagery.

5.2. Effect of Edge Detectors on Semantic Prior Quality

To verify the effectiveness of using the Sobel operator for semantic prior generation in the PSGM, a comparative analysis was conducted against three representative categories of edge and texture feature extraction methods, (1) gradient-based edge operators (Sobel, Prewitt), (2) conventional edge detectors (Canny, Multiscale LoG), and (3) texture-based descriptors (LBP, Census, HOG, Gabor), as illustrated in Figure 10. The experimental results show that Sobel consistently extracts complete and continuous terrain contours while effectively suppressing speckle and strip noise, achieving the best performance in terms of edge structure integrity and noise robustness. In comparison, Prewitt produces weaker and more fragmented edge responses. Both Canny and Multiscale LoG are highly sensitive to sonar-specific noise, often generating excessive fine edges that lack semantic relevance, which interferes with subsequent guidance. Although LBP, Census, HOG, and Gabor can capture local texture information, they fail to simultaneously represent global structural contours and local texture details, making them unsuitable as high-quality structural priors. Compared with other methods, Sobel not only stably extracts closed and directionally consistent edge boundaries, but also enhances local gradients through first-order differentiation, effectively suppressing edge fragmentation and false responses caused by high-frequency noise. This leads to higher structural integrity and spatial coherence in sonar imagery. In the PSGM, the edge maps generated by Sobel provide a clear geometric foundation for semantic prior construction, enabling SACA to focus accurately on real terrain contours while avoiding interference from textures and isolated noise. Furthermore, the integration of Sobel significantly improves the spatial consistency of distance modulation within the PSGM, enhancing the physical plausibility and overall stability of the semantic guidance pathway.

5.3. Effect of Global Fusion Design on Semantic Consistency

To systematically evaluate the impact of global fusion strategies on long-range dependency modeling and semantic consistency, the proposed RWKV-Based Global Fusion with Semantic Constraints was compared with several representative global modeling methods. The comparison includes different types of global fusion strategies, such as a standard Transformer block, a Swin Transformer block, a Mamba block [40], Performer [41], and the original RWKV architecture. All experiments were conducted on the Forward-Looking Sonar Deep-Sea Terrain Dataset to assess the ability of each method to model complex sonar image structures and thier effect on overall segmentation performance. The results are presented in Table 8.

The standard Transformer architecture, which performs global computation via self-attention, shows limited effectiveness in handling sonar images with high noise levels and blurred boundaries. It achieves an mIoU of 73.9% and an F1 score of 85.8%, with a relatively low inference speed of 14.7 FPS. Swin Transformer improves modeling efficiency through hierarchical computation with local window partitioning and shifted-window mechanisms, resulting in slight accuracy improvements (mIoU 74.1%, F1 score 86.3%). To further enhance computational efficiency, Performer replaces the full self-attention computation with kernel-based approximations, and Mamba employs state space models for sequence modeling. These two methods achieve significantly higher inference speeds of 65.3 FPS and 60.8 FPS, respectively. However, due to the lack of semantic control mechanisms in their modeling processes, they exhibit limitations in structural representation, with mIoUs of 73.6% and 76.4%. The RWKV architecture introduces a spatially recursive weighted mechanism that enables global context modeling with linear complexity. It achieves a favorable trade-off between modeling capacity and computational efficiency, reaching an mIoU of 78.4% and an F1 score of 87.8%, along with the highest inference speed of 84.4 FPS. Built upon the RWKV backbone, the proposed RGFSC module introduces dual semantic constraint mechanisms to enable dynamic modulation and the selective enhancement of semantic region relationships. This allows the model to more accurately focus on semantically coherent areas while suppressing interference across unrelated regions. RGFSC achieves an mIoU of 79.9% and an F1 score of 87.9%, with an inference speed of 81.9 FPS. Although slightly slower than the original RWKV, it delivers improved semantic consistency and better preservation of structural details. These results highlight the effectiveness of semantic constraints in enhancing feature fusion quality and improving the model’s discriminative capability.

Overall, the experimental results demonstrate that RGFSC is well-suited for semantic segmentation tasks in complex underwater terrain scenarios, offering a strong balance between segmentation accuracy and real-time inference performance.

5.4. Effectiveness Under Low-SNR Conditions

To evaluate the robustness of DSM-Seg under severe noise interference, we conducted an additional controlled-noise experiment on the DSTD test set. Starting from the original test images, we generated noisy versions by injecting multiplicative speckle noise, with the relative signal-to-noise ratio (SNR) as the control variable to adjust noise intensity. The SNR is defined as the ratio of target signal power to noise power in decibels. Five representative SNR levels were selected for evaluation, −10, −5, 0, 5, and 10 dB, arranged from left to right in the plots so that the effective noise level increases as the SNR decreases. The unaltered test set is regarded as the “clean/reference” condition, corresponding to the high-SNR end (10 dB) in this relative scale, whereas the noisy versions correspond to progressively lower SNR conditions. Because the absolute SNR of the test images is unavailable, we use the relative SNR to ensure consistent comparisons across methods at matched noise levels, rather than attempting to estimate absolute SNR values. To ensure diversity while avoiding redundancy, we compared DSM-Seg with a representative subset of baselines, including UNet, Swin-Transformer, SonarNet, and UNetMamba.

As shown in Figure 11, DSM-Seg achieves higher mIoU and F1 scores than the compared methods in most SNR settings, with a particularly clear advantage under low-SNR conditions (SNR ≤ 0 dB). Even when performance is not the highest at certain SNR levels, the overall degradation trend of DSM-Seg is noticeably slower, indicating a greater ability to maintain stable segmentation quality as noise increases. This robustness arises from the PSGM’s incorporation of physical priors, which suppress noise-induced artifacts, and the RGFSC’s semantic constraints, which reduce cross-regional interference. Such robustness is critical for reliable environmental perception in real deep-sea missions, where the SNR can fluctuate unpredictably due to environmental and operational factors. These results demonstrate that DSM-Seg possesses strong robustness and reliable performance in challenging low-SNR environments.

6. Conclusions

This study presents DSM-Seg, a forward-looking sonar image segmentation framework designed for environmental perception tasks in DSMVs. To address key challenges encountered in deep-sea mining scenarios, including severe speckle noise, non-uniform SNR attenuation, and blurred terrain boundaries, DSM-Seg incorporates a physical prior-based semantic guidance mechanism together with a joint semantic constraint strategy. This design improves segmentation accuracy in complex underwater terrains while maintaining real-time inference capability. Extensive evaluations were conducted using both a custom-built Forward-Looking Sonar Deep-Sea Terrain Dataset and the publicly available Forward-Looking Sonar Marine Debris Dataset. Experimental results show that DSM-Seg consistently outperforms existing mainstream methods in both real-world and simulated environments, demonstrating strong generalization and practical potential for deep-sea mining applications. Future work will focus on further optimizing the network architecture, increasing dataset scale and diversity, and exploring the integration of terrain elevation estimation to enable a more comprehensive and accurate perception of deep-sea terrain.

Author Contributions

Conceptualization, X.L., J.Y. and W.X.; methodology, X.L.; software, X.L.; validation, X.L. and E.Z.; formal analysis, X.L.; investigation, X.L., W.X. and C.L.; resources, J.Y.; data curation, X.L. and E.Z.; writing—original draft, X.L.; writing—review and editing, X.L., J.Y., E.Z., W.X. and C.L.; visualization, X.L. and C.L.; supervision, J.Y.; project administration, J.Y.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fundamental Research Funds for the Central Universities, the Major Projects of Strategic Emerging Industries in Shanghai (BH3230001), and the project Research on Human-Machine Cooperative Trajectory Planning and Control for Deep-Sea Mining Vehicles Based on Virtual Potential Field (Hnky2024-94). The authors gratefully acknowledge the financial support.

Data Availability Statement

The datasets generated and analyzed during the current study are not publicly available due to ongoing research but are available upon reasonable request.

Conflicts of Interest

The authors declare that there are no conflicts of interest. The funding information has been accurately provided in the submission system. The funding agency had no involvement in the study design, data collection, analysis, interpretation, manuscript preparation, or the decision to submit the manuscript for publication. All authors have reviewed and approved the final version of the manuscript for submission to the journal.

References

Du, K.; Xi, W.; Huang, S.; Zhou, J. Deep-sea mineral resource mining: A historical review, developmental progress, and insights. Min. Metall. Explor. 2024, 41, 173–192. [Google Scholar] [CrossRef]
Leng, D.; Shao, S.; Xie, Y.; Wang, H.; Liu, G. A brief review of recent progress on deep sea mining vehicle. Ocean Eng. 2021, 228, 108565. [Google Scholar] [CrossRef]
Liu, X.; Yang, J.; Xu, W.; Chen, Q.; Lu, H.; Chai, Y.; Lu, C.; Xue, Y. DSM-Net: A multi-scale detection network of sonar images for deep-sea mining vehicle. Appl. Ocean Res. 2025, 158, 104551. [Google Scholar] [CrossRef]
Lu, C.; Yang, J.; Lu, H.; Lin, Z.; Wang, Z.; Ning, J. Adaptive bi-level path optimization for deep-sea mining vehicle in non-uniform grids considering ocean currents and dynamic obstacles. Ocean Eng. 2025, 315, 119835. [Google Scholar] [CrossRef]
Liu, X.; Yang, J.; Xu, W.; Zhang, E.; Lu, C. FLS-GAN: An end-to-end super-resolution enhancement framework for FLS terrain in deep-sea mining vehicles. Ocean. Eng. 2025, 332, 121369. [Google Scholar] [CrossRef]
Qiaoqiao, Y.; Zhiqiang, W.; Lei, H. Research on deep-sea terrain refinement methods based on satellite remote sensing data. Ocean Eng. 2024, 309, 118467. [Google Scholar] [CrossRef]
Andreatos, A.; Leros, A. Contour Extraction Based on Adaptive Thresholding in Sonar Images. Information 2021, 12, 354. [Google Scholar] [CrossRef]
Tian, Y.; Lan, L.; Sun, L. A review of sonar image segmentation for underwater small targets. In Proceedings of the 2020 International Conference on Pattern Recognition and Intelligent Systems, Athens, Greece, 30 July–2 August 2020; pp. 1–4. [Google Scholar]
Kumar, N.; Mitra, U.; Narayanan, S.S. Robust object classification in underwater sidescan sonar images by using reliability-aware fusion of shadow features. IEEE J. Ocean. Eng. 2014, 40, 592–606. [Google Scholar] [CrossRef]
Guo, Y.; Wei, L.; Xu, X. A sonar image segmentation algorithm based on quantum-inspired particle swarm optimization and fuzzy clustering. Neural Comput. Appl. 2020, 32, 16775–16782. [Google Scholar] [CrossRef]
Mignotte, M.; Collet, C.; Perez, P.; Bouthemy, P. Sonar image segmentation using an unsupervised hierarchical MRF model. IEEE Trans. Image Process. 2000, 9, 1216–1231. [Google Scholar] [CrossRef]
Tian, Y.; Lan, L.; Guo, H. A review on the wavelet methods for sonar image segmentation. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420936091. [Google Scholar] [CrossRef]
Choi, W.-S.; Olson, D.R.; Davis, D.; Zhang, M.; Racson, A.; Bingham, B.; McCarrin, M.; Vogt, C.; Herman, J. Physics-based modelling and simulation of multibeam echosounder perception for autonomous underwater manipulation. Front. Robot. AI 2021, 8, 706646. [Google Scholar] [CrossRef]
Wang, N.; Wang, Y.; Feng, Y.; Wei, Y. AodeMar: Attention-aware occlusion detection of vessels for maritime autonomous surface ships. IEEE Trans. Intell. Transp. Syst. 2024, 25, 13584–13597. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Zhao, X.; Qin, R.; Zhang, Q.; Yu, F.; Wang, Q.; He, B. DcNet: Dilated convolutional neural networks for side-scan sonar image semantic segmentation. J. Ocean Univ. China 2021, 20, 1089–1096. [Google Scholar] [CrossRef]
Zhao, Q.; Zhang, L.; Zhang, F.; Li, X.; Pan, G. FDBANet: A Fusion Frequency Domain Denoising and Multi-Scale Boundary Attention Network for Sonar Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4212609. [Google Scholar] [CrossRef]
Zhao, D.; Zhou, H.; Chen, P.; Hu, Y.; Ge, W.; Dang, Y.; Liang, R. Design of forward-looking sonar system for real-time image segmentation with light multiscale attention net. IEEE Trans. Instrum. Meas. 2023, 73, 1–17. [Google Scholar] [CrossRef]
Wang, N.; Chen, Y.; Wei, Y.; Chen, T.; Karimi, H.R. UP-GAN: Channel-spatial attention-based progressive generative adversarial network for underwater image enhancement. J. Field Robot. 2024, 41, 2597–2614. [Google Scholar] [CrossRef]
He, J.; Xu, H.; Li, S.; Yu, Y. Efficient SonarNet: Lightweight CNN Grafted Vision Transformer Embedding Network for Forward-Looking Sonar Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4210317. [Google Scholar] [CrossRef]
Jiang, J.; Zhang, J.; Liu, W.; Gao, M.; Hu, X.; Yan, X.; Huang, F.; Liu, Y. RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation. arXiv 2025, arXiv:2501.08458. [Google Scholar]
Liang, Y.; Zhu, X.; Zhang, J. MiTU-Net: An efficient mix transformer U-like network for forward-looking sonar image segmentation. In Proceedings of the 2022 IEEE 2nd International Conference on Computer Communication and Artificial Intelligence (CCAI), Beijing, China, 6–8 May 2022; pp. 149–154. [Google Scholar]
Yang, D.; Zhang, J.; Liu, J.; Suo, X.; Chen, N.; Li, R. FLSSnet: Few labeled samples segmentation network for coated fuel particle segmentation. Adv. Eng. Inform. 2024, 62, 102630. [Google Scholar] [CrossRef]
Li, J.; Wang, Z.; Yuan, S.; You, Z. MFSonar: Multiscale Frequency Domain Contextual Denoising for Forward-Looking Sonar Image Semantic Segmentation. IEEE Sens. J. 2025, 25, 11792–11808. [Google Scholar] [CrossRef]
Zhu, E.; Chen, Z.; Wang, D.; Shi, H.; Liu, X.; Wang, L. UNetMamba: An Efficient UNet-Like Mamba for Semantic Segmentation of High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 22, 6001205. [Google Scholar] [CrossRef]
Singh, D.; Valdenegro-Toro, M. The marine debris dataset for forward-looking sonar semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3741–3749. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Li, H.; Xiong, P.; Fan, H.; Sun, J. Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9522–9531. [Google Scholar]
Feng, S.; Zhuo, Z.; Pan, D.; Tian, Q. CcNet: A cross-connected convolutional network for segmenting retinal vessels using multi-scale features. Neurocomputing 2020, 392, 268–276. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
Rajani, H.; Gracias, N.; Garcia, R. A convolutional vision transformer for semantic segmentation of side-scan sonar data. Ocean. Eng. 2023, 286, 115647. [Google Scholar] [CrossRef]
Li, J.; Wang, Z.; You, Z.; Zhao, Z.; Yuan, Z. U-SAS: U-Shape Network with Multi-Level Enhancement and Global Decoding for Synthetic Aperture Sonar Image Semantic Segmentation. IEEE Sens. J. 2024, 25, 1799–1813. [Google Scholar] [CrossRef]
He, J.; Chen, J.; Xu, H.; Yu, Y. SonarNet: Hybrid CNN-Transformer-HOG Framework and Multifeature Fusion Mechanism for Forward-Looking Sonar Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, H.; Zhang, G.; Ren, J.; Shu, G. CGF-Unet: Semantic segmentation of sidescan sonar based on unet combined with global features. IEEE J. Ocean. Eng. 2024, 49, 963–975. [Google Scholar] [CrossRef]
Feng, Y.; Zhan, Y.; Zeng, H. CSFINet: Cross-scale Feature Interaction for Medical Image Segmentation. In Proceedings of the 2023 8th International Conference on Signal and Image Processing (ICSIP), Wuxi, China, 8–10 July 2023; pp. 207–211. [Google Scholar]
Lei, J.; Wang, H.; Lei, Z.; Li, J.; Rong, S. CNN-Transformer Hybrid Architecture for Underwater Sonar Image Segmentation. Remote Sens. 2025, 17, 707. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. arXiv 2024, arXiv:2407.08083. [Google Scholar]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L. Rethinking attention with performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]

Figure 1. Pipeline of the proposed FLS image segmentation framework.

Figure 2. Overall architecture of the DSM-Seg segmentation model. The red arrow labeled Attention Stages indicates the cross-stage attention information flow in the RWKV blocks, representing the propagation of state updates and key-value interactions across feature scales during global dependency modeling.

Figure 3. Schematic illustration of the PSGM.

Figure 4. Schematic illustration of the RGFSC.

Figure 6. Example segmentation results on the Forward-Looking Sonar Deep-Sea Terrain Dataset. (a) FLS image. (b) Ground Truth. (c) UNet. (d) Swin Transformer. (e) Swin-unet. (f) SonarNet. (g) UNetMamba. (h) CSFI-Net. (i) DSM-Seg (ours).

Figure 7. Example segmentation results on the Forward-Looking Sonar Marine Debris Dataset. (a) FLS image. (b) Ground Truth. (c) UNet. (d) Swin Transformer. (e) Swin-unet. (f) SonarNet. (g) UNetMamba. (h) CSFI-Net. (i) DSM-Seg (ours).

Figure 8. Performance comparison of different ablation studies across categories on the Forward-Looking Sonar Marine Debris Dataset.

Figure 9. Performance comparison of different ablation studies across categories on the Forward-Looking Sonar Deep-Sea Terrain Dataset.

Figure 10. Qualitative comparison of edge and texture feature extractors for semantic prior construction in PSGM.

Figure 11. (a) mIoU under different SNR levels and (b) F1 under different SNR levels on the DSTD dataset with injected multiplicative speckle noise. DSM-Seg maintains higher accuracy and shows slower degradation under low-SNR conditions compared with representative baselines.

Table 1. Nomenclature.

Notation	Description
DSMV	Deep-sea mining vehicle
DSM-Seg	Segmentation network for deep-sea mining using forward-looking sonar images
RWKV	Receptance Weighted Key-Value
PSGM	Physical Prior-Based Semantic Guidance Module
SACA	Structure-Aware Convolutional Attention
RGFSC	RWKV-Based Global Fusion with Semantic Constraints
SAAM	Semantic-Aware Adaptive Modulation
SPHF	Semantic Prior Hard Filtering
mIoU	mean Intersection over Union
F1	Harmonic mean of precision and recall
DSTD	Forward-Looking Sonar Deep-Sea Terrain Dataset
MDD	Forward-Looking Sonar Marine Debris Dataset

Table 2. Statistical results of object categories in the Forward-Looking Sonar Deep-Sea Terrain Dataset.

Target Class	Class Count	Proportion of Class Count	Pixel Count	Proportion of Pixel Count
Highlight	6107	11%	696,872	4%
Shadow	19,688	37%	34,525,962	9%
Background	1055	2%	41,543,434	10%
Unseen	26,850	50%	307,204,874	77%

Table 3. Statistical results of object categories in the Forward-Looking Sonar Marine Debris Dataset.

Target Class	Image Count	Proportion of Image Count	Pixel Count	Proportion of Pixel Count
Bottle	416	11%	691,558	6%
Can	355	10%	732,377	7%
Chain	313	8%	1,341,771	13%
Drink carton	348	9%	251,932	2%
Hook	170	5%	211,996	2%
Propeller	197	5%	562,524	5%
Shampoo bottle	109	3%	253,656	2%
Standing bottle	65	2%	128,707	1%
Tire	615	16%	1,667,687	16%
Valve	237	6%	180,640	2%
Wall	915	25%	4,766,360	44%

Table 4. Category and configuration used in the experiment.

Category	Configuration
Experimental environment
Operation System	Ubuntu 20.04.4
CPU	15 vCPU Intel(R) Xeon(R) Platinum 8474C
GPU	NVIDIA RTX 4090
Hyperparameters Settings
Epochs	200
Batch Size	4
Initial Learning Rate	0.00005
Lr_policy	Cosine Annealing
Optimizer	Adam

Table 5. Comparison results on the Forward-Looking Sonar Deep-Sea Terrain Dataset. The best results are highlighted in bold.

Methods	IoU(%)				Evaluation Metrics
Methods	Shadow	Highlight	Background	Unseen	mIoU(%)	F1(%)	FPS
UNet (2016)	53.9	61.7	69.6	93.2	69.6	84.3	153.4
DeepLabV3+ (2018)	53.7	62.2	68.8	94.5	69.8	83.3	46.8
DFANet (2019)	51.6	56.9	71.2	88.8	67.1	84.2	33.5
CcNet (2020)	55.6	63.4	71.1	91	70.3	81.8	34.7
Swin-Transformer (2021)	60.2	64.4	74.7	93.3	73	86.1	11.4
TransUNet (2021)	60.5	64.3	74.8	90.3	72.5	85.8	17.4
Swin-unet (2022)	60.7	68.9	77.6	89.8	74.3	85.9	25.1
s3seg-vit (2023)	63	73.8	81.1	94.1	78	87.6	45.6
UNetMamba (2024)	63.5	76.2	82.1	93.9	78.9	85.2	78.4
SonarNet (2024)	64.1	74.4	79.4	94.4	78.1	87.3	34.2
CGF-Unet (2024)	63.5	71.2	80.9	91.8	76.9	84.6	68.4
U-SAS (2024)	63.4	69.8	82.2	93.6	77.3	85.7	59.4
FLSSNet (2025)	62.7	72.5	78.9	90.4	76.1	86.6	63.3
CSFINet (2025)	62.9	70.8	81.3	95.2	77.6	87.3	122.3
DSM-Seg (ours)	64.7	75.2	84.1	95.7	79.9	87.9	83.5

Table 6. Comparison results on the Forward-Looking Sonar Marine Debris Dataset. The best results are highlighted in bold.

Methods	IoU(%)												Evalution Metrics
Methods	Background	Bottle	Can	Chain	Drink Carton	Hook	Propeller	Shampoo Bottle	Standing Bottle	Tire	Valve	Wall	mIoU(%)	F1(%)	FPS
UNet (2016)	99.1	82.6	56.4	66.8	79.9	66.9	58.3	76.4	48.4	89.6	55.7	84.8	72.1	85.5	153.4
DeepLabV3+ (2018)	99.4	81.4	57.7	63.4	76.8	72.2	72.4	81.9	48.8	86.7	56.8	85.2	73.6	79.7	46.8
DFANet (2019)	98.9	78.3	62.4	62.9	71.2	61.4	63.9	78.3	44.5	88.4	51.2	80.6	70.2	84.6	33.5
CcNet (2020)	98.8	82.7	61.9	66.9	78.4	73.9	78.4	80.1	45	88.6	56.7	87.2	74.9	85.1	34.7
Swin-Transformer (2021)	99.8	83.1	66.6	65.7	77.3	71	74.8	82.7	47.7	87.5	59.7	87.3	75.1	86.2	11.4
TransUNet (2021)	99.5	67.2	66.3	69.3	82.3	66.6	75.2	79.7	41.6	89.5	53.7	90.1	73.4	81.4	17.4
Swin-unet (2022)	99.6	82.4	57.8	64.4	79.7	76.6	78.2	81.1	52.2	85.4	66.7	88.9	76.1	82	25.1
s3seg-vit (2023)	99.5	83.2	62.4	68.7	82.3	74.7	79.6	81.5	44.9	87.8	59.8	90.2	76.2	88.9	45.6
UNetMamba (2024)	98.8	82.9	63.5	68.5	80	69.1	74.8	82.1	54.1	90.3	72.4	86.3	76.9	90.2	78.4
SonarNet (2024)	99.3	82.3	66.7	68.8	82.1	74.3	77.2	81.3	52.3	89.4	67.9	86.9	75.4	86.7	34.2
CGF-Unet (2024)	98.8	82.5	61.6	67.7	82.5	77.3	76.9	80.6	52.7	86.7	65.1	88.4	76.7	87.8	68.4
U-SAS (2024)	98.8	81.9	62.3	68.1	84.3	74.6	77.1	81.5	52.1	87.9	71.3	87.8	77.3	85.5	59.4
FLSSNet (2025)	99.4	82.7	64.9	68.2	81.1	75.1	76.5	81.7	53.7	88.6	70.9	85.8	77.4	88.1	63.3
CSFINet (2025)	99.6	83.2	66.1	68.4	84.7	78.6	78.5	80.9	53.5	89	72.4	89.6	78.7	88.7	122.3
DSM-Seg (ours)	99.7	83.4	65.9	69.4	85.7	78.5	80.1	82.4	53.6	89.7	73.5	89.9	79.3	90.5	83.5

Table 7. Results of the ablation study. In the table, the Forward-Looking Sonar Deep-Sea Terrain Dataset and the Forward-Looking Sonar Marine Debris Dataset are abbreviated as DSTD and MDD, respectively. All images are uniformly processed to 320 × 320.

Model	Modules			mIoU(%)		F1(%)		FPS
Model	PSGM	SAAM	SPHF	DSTD	MDD	DSTD	MDD	FPS
Without PSGM	×	√	√	74.7	77.6	82.3	86.4	91.6
Without SAAM	√	×	√	76.4	78.5	85.9	88.2	94.7
Without SPHF	√	√	×	78.8	77.9	86.6	88.8	85.2
DSM-Seg(ours)	√	√	√	79.9	79.3	87.9	90.5	83.5

Table 8. Quantitative comparison of global fusion strategies on sonar segmentation.

Method	Global Fusion Strategy	mIoU(%)	F1(%)	FPS
Transformer	Self-Attention	73.9	85.8	14.7
Swin-Transformer	Window Attention	74.1	86.3	18.1
Performer	Kernel-based Attention	73.6	85.7	65.3
Mamba	State Space Model	76.5	86.4	60.8
RWKV	Linear Recurrence	78.4	87.8	84.4
RGFSC (ours)	Linear Recurrence+ Semantic Constraints	79.9	87.9	81.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Yang, J.; Zhang, E.; Xu, W.; Lu, C. DSM-Seg: A CNN-RWKV Hybrid Framework for Forward-Looking Sonar Image Segmentation in Deep-Sea Mining. Remote Sens. 2025, 17, 2997. https://doi.org/10.3390/rs17172997

AMA Style

Liu X, Yang J, Zhang E, Xu W, Lu C. DSM-Seg: A CNN-RWKV Hybrid Framework for Forward-Looking Sonar Image Segmentation in Deep-Sea Mining. Remote Sensing. 2025; 17(17):2997. https://doi.org/10.3390/rs17172997

Chicago/Turabian Style

Liu, Xinran, Jianmin Yang, Enhua Zhang, Wenhao Xu, and Changyu Lu. 2025. "DSM-Seg: A CNN-RWKV Hybrid Framework for Forward-Looking Sonar Image Segmentation in Deep-Sea Mining" Remote Sensing 17, no. 17: 2997. https://doi.org/10.3390/rs17172997

APA Style

Liu, X., Yang, J., Zhang, E., Xu, W., & Lu, C. (2025). DSM-Seg: A CNN-RWKV Hybrid Framework for Forward-Looking Sonar Image Segmentation in Deep-Sea Mining. Remote Sensing, 17(17), 2997. https://doi.org/10.3390/rs17172997

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DSM-Seg: A CNN-RWKV Hybrid Framework for Forward-Looking Sonar Image Segmentation in Deep-Sea Mining

Abstract

1. Introduction

2. Related Work

2.1. Traditional Sonar Image Segmentation Methods

2.2. Applications of Deep Learning in Sonar Image Segmentation

2.3. Hybrid Architecture-Based Segmentation Methods

3. Methodology

3.1. Overall Architecture of DSM-Seg

3.2. CNN-Based Local Semantic Feature Extraction

3.3. Physical Prior-Based Semantic Guidance Module (PSGM)

3.4. RWKV-Based Global Fusion with Semantic Constraints (RGFSC)

4. Experiments and Results

4.1. FLS Image Dataset

4.1.1. Forward-Looking Sonar Deep-Sea Terrain Dataset

4.1.2. Forward-Looking Sonar Marine Debris Dataset

4.2. Evaluation Metrics

4.3. Training Strategies and Implementation Details

4.4. Comparison with Other Methods

4.4.1. Comparative Experiments on the Forward-Looking Sonar Deep-Sea Terrain Dataset

4.4.2. Comparative Experiments on the Forward-Looking Sonar Marine Debris Dataset

4.4.3. Visualization Analysis of Comparative Experiments

5. Discussion

5.1. Effect of Semantic Guidance and Constraints on Segmentation Performance

5.2. Effect of Edge Detectors on Semantic Prior Quality

5.3. Effect of Global Fusion Design on Semantic Consistency

5.4. Effectiveness Under Low-SNR Conditions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI