3.1. Overall Architecture of DSM-Seg
The overall architecture of DSM-Seg is illustrated in
Figure 2. It is designed to address the challenges of semantic segmentation in FLS images caused by strong speckle noise, blurred boundaries, and SNR attenuation in far-field regions. The proposed framework consists of three functional modules, each responsible for local feature modeling, semantic guidance generation, and global semantic fusion, respectively. First, the CNN-based feature extraction module employs a densely connected structure to extract multi-scale local feature maps, capturing clear texture details and boundary information that serve as the foundational representations for subsequent semantic integration. Next, the PSGM operates directly on shallow local feature maps without introducing an additional image branch. This module integrates edge gradients, dynamic convolutional responses, and sonar-specific range attenuation priors to generate semantic guidance maps that exhibit physical consistency and regional structural saliency. These maps highlight potential target regions while suppressing redundant noise, thereby providing structurally reliable prior cues to support the subsequent global fusion process. Finally, the global feature fusion part incorporates the RGFSC mechanism. Built on the sequential modeling capacity of the RWKV architecture, RGFSC introduces an explicit semantic constraint control flow. Specifically, the SAAM component enhances feature consistency among semantically similar regions, while the SPHF component suppresses irrelevant responses in non-target areas. These two mechanisms work in tandem to improve the semantic focus of long-range dependency modeling, enabling robust and efficient global feature fusion.
In summary, DSM-Seg achieves a tightly coupled integration of local semantic extraction, physically guided semantic priors, and semantically constrained global fusion. This design is particularly well-suited for FLS image segmentation tasks in real-world, complex underwater environments.
3.2. CNN-Based Local Semantic Feature Extraction
Given the prevalence of noise, low contrast, and complex textures in FLS images, DSM-Seg first requires the extraction of stable local features with a strong semantic representation capability to support subsequent segmentation. To this end, a CNN-based cascaded encoder–decoder structure is adopted as the initial feature extraction unit in DSM-Seg to obtain local semantic features while suppressing noise during downsampling. The encoder is constructed by stacking dense blocks and applying cross-layer feature fusion to enhance the representation of local structural information, thereby improving feature quality under noisy conditions. The decoder then progressively restores spatial resolution to produce a semantically rich and structurally informative local feature map .
To ensure that the extracted local features contain sufficient semantic information, the input sonar image
is first processed by an L-layer encoder to extract multi-scale features and compress spatial dimensions, enabling higher-level semantic abstraction. The encoding process at layer
i is defined as in Equation (1), where
denotes the convolution and dense connection operations at the
i th layer,
σ (⋅) is the activation function, and DownSample (⋅) performs spatial downsampling to highlight semantic features. The top-level encoded feature is denoted as
ZL.
To progressively recover spatial resolution while preserving high-level semantics, a symmetric L-layer decoder is designed to upsample the feature maps layer by layer starting from
ZL, as shown in Equation (2). Here,
represents the transposed convolution operation, and UpSample (⋅) increases spatial resolution. Each decoder layer also integrates the corresponding encoder features through skip connections to enhance edge and detail representation.
Finally, the output of the top decoder layer
is used as the local feature map
, as defined in Equation (3). This feature map has a spatial resolution close to that of the original image and preserves the high-level semantic representation and noise suppression capabilities learned by the encoder, thus providing a high-quality input for the subsequent semantic guidance and global modeling with semantic constraints.
3.3. Physical Prior-Based Semantic Guidance Module (PSGM)
Accurate semantic segmentation of sonar imagery remains challenging due to limited structural cues and data-driven models’ difficulty in capturing physical degradation patterns. To address these limitations, a semantic guidance module is introduced to enhance spatial reliability based on sonar-specific priors. Due to the imaging characteristics of forward-looking sonar, such images commonly suffer from severe speckle noise and range-dependent non-uniform SNR attenuation. These issues lead to blurred target boundaries and ambiguous semantic regions, making it difficult to accurately delineate targets using data-driven deep feature learning alone. To address this limitation, the Physical Prior-Based Semantic Guidance Module is designed and integrated into DSM-Seg. This module combines Structure-Aware Context Aggregation with domain-specific physical priors that are inherent to sonar imaging, thereby significantly enhancing the model’s ability to focus on critical structural regions and resist noise interference. The PSGM provides stable and effective semantic guidance for subsequent global feature fusion. A schematic illustration of the module is shown in
Figure 3.
First, the PSGM applies the Sobel operator to the local feature map
, which is obtained from the CNN feature extraction module, in order to extract initial edge features. The computation is defined in Equation (4). In this equation,
Kx and
Ky denote the Sobel convolution kernels in the x and y directions, respectively. The symbol
represents a convolution operation over the local neighborhood centered at pixel coordinate (
i,
j). A scaling factor,
, derived from the speckle noise intensity and the global SNR of the image, is applied to the gradient magnitude
to maintain sufficient contrast in regions affected by strong speckle noise.
Subsequently, SACA is employed to more effectively exploit the structural physical priors that are inherent in sonar imagery. Specifically, SACA focuses on capturing the strongly continuous geometric structures that are commonly present in FLS images. This is achieved by applying a learnable convolutional kernel
to extract feature responses, which are then combined with a bias term
to produce a saliency-aware attention mask
, as defined in Equation (5). In this process, the parameters and receptive field of the kernel
are learned adaptively during training, enabling the model to enhance coherent structural responses while suppressing random speckle noise and isolated outlier points.
While
effectively highlights structural continuity, it lacks sensitivity to range-based signal degradation. To address this, distance-adaptive SNR modulation is introduced to account for the physical attenuation characteristics of sonar imaging. To further incorporate the range-dependent attenuation characteristics of FLS imaging, a distance-adaptive SNR modulation strategy is employed to naturally suppress feature responses in far-field regions, thereby enhancing the saliency of near-field structures. The resulting modulation map is denoted as
, as defined in Equation (6). In this equation,
γ is a distance attenuation coefficient, which is estimated from calibration data by fitting the measured near- and far-range SNR to an exponential decay model, and subsequently fine-tuned on the validation set to ensure both physical consistency and optimal segmentation performance. And
represents the normalized distance from pixel
to the center of the sonar array. This term models the degradation of signal intensity with increasing distance. The responses of distant pixels are attenuated accordingly, which effectively reduces interference from far-field noise while preserving clear structural features in near-range areas.
Subsequently, the semantic guidance map
is further refined through adaptive threshold-based morphological aggregation, resulting in a more stable guidance mask
. Specifically, an adaptive threshold,
, is computed to determine whether the gradient of a pixel exceeds the local threshold. A morphological maximization operation is then performed within a structural neighborhood
Ω, aggregating adjacent high-gradient regions into the current pixel. This process smooths broken edges and removes isolated noise points. As defined in Equation (7), the Sobel gradient magnitude
is compared against the adaptive threshold
τ(⋅); the output is 1 if the gradient exceeds the threshold and 0 otherwise. The max operator performs morphological aggregation by merging all satisfied neighboring responses into the center (
i,
j), thereby eliminating fragmentation and isolated noise in elongated terrain structures.
Finally, to further improve the accuracy of the coarse guidance mask
, a compensatory filtering operation
Ψ(⋅) is introduced for refinement. The output is then binarized using a threshold
β to obtain the final semantic prior map
, as defined in Equation (8). A value of
indicates that the pixel (
i,
j) has a high probability of belonging to a target seabed terrain region under varying SNR conditions. As illustrated in
Figure 3, a visual example of
demonstrates that the PSGM provides effective and meaningful semantic guidance.
In summary, the proposed PSGM fully incorporates the physical priors that are specific to sonar imaging, particularly by effectively capturing continuous structural features and suppressing random speckle noise. In the PSGM, such sonar-specific priors are embedded into the network structure through Sobel-based structural edge extraction and distance-adaptive modulation, which emphasize highlight–shadow continuity and range-dependent attenuation, while morphological aggregation suppresses isolated speckle artifacts. This significantly improves the accuracy and stability of the generated semantic prior masks. By providing more reliable semantic guidance, the PSGM enhances the overall performance of the DSM-Seg framework, improving segmentation accuracy and noise robustness in complex seabed terrain scenarios.
3.4. RWKV-Based Global Fusion with Semantic Constraints (RGFSC)
After extracting local features and generating physically guided semantic cues, DSM-Seg must still effectively capture long-range dependencies across wide-area seabed terrain. However, FLS images often exhibit severe non-uniform noise, blurred boundaries, and complex backgrounds, making conventional similarity-based attention mechanisms prone to introducing irrelevant information, which hinders precise feature fusion.
To address this challenge, this paper introduces the RGFSC module, which combines a Bi-RWKV attention mechanism with dual semantic constraints to dynamically suppress noise and cross-region interference during long-range modeling. This enables a more accurate fusion of global and local features. Specifically, RGFSC incorporates two complementary semantic constraint mechanisms: SAAM and SPHF. SAAM leverages semantic priors to dynamically modulate the attention gates, enhancing intra-region correlation by emphasizing semantically consistent areas. In contrast, SPHF employs prior maps generated by the PSGM to strictly restrict global attention computation to semantically corresponding regions, thereby suppressing irrelevant signals and preventing noise accumulation during recursive inference. The overall structure of RGFSC is illustrated in
Figure 4. By integrating SAAM and SPHF, RGFSC retains the computational efficiency of the RWKV mechanism while significantly improving the semantic consistency of global representations, providing reliable support for subsequent fine-grained segmentation.
To define the RGFSC computation process in detail, the output of the Bi-WKV at position
t is formulated in Equation (9), where
w and
u represent learnable channel-wise decay and bias vectors, respectively, and
and
denote the key and value projections at the current position.
To measure semantic similarity between a query pixel (
i,
j) and a reference pixel (
p,
q) in the local feature space, the SAAM factor
is defined, as shown in Equation (10). This factor is computed by projecting the local feature vector
into an embedded space and comparing it with
through an inner product, followed by an activation function
σ (⋅). The output value, ranging from 0 to 1, serves as an attention weight: values close to 1 indicate strong semantic similarity, whereas values near 0 imply semantic irrelevance and should be suppressed. In addition to similarity, semantic consistency is enforced using the SPHF factor
, defined in Equation (11). Here,
represents the semantic class of pixel
, and
ϵ is a small constant. This factor ensures that global attention is computed only between semantically corresponding regions. A value of 1 is assigned when both pixels belong to the same semantic region; otherwise, the interaction is down-weighted by
ϵ, effectively eliminating irrelevant information exchange.
Based on the above definitions, the operations of SAAM and SPHF are formulated in Equations (12) and (13). SAAM performs semantic-aware feature aggregation by weighting values across the global space
Ω based on the similarity factor
g, while SPHF constrains the attention computation using the binary gating mask
M. Here,
denotes the value vector at location
. These mechanisms jointly enable focused and semantically consistent information fusion.
The final output of RGFSC is given in Equation (14), where
is the input feature at time step
t,
σ (⋅) is the sigmoid function, ⊙ denotes element-wise multiplication, and
is the output projection matrix. Both SAAM and SPHF are embedded into the attention and gating stages of the RWKV architecture to enforce dual semantic constraints, thereby enhancing the model’s ability to focus on consistent regions and suppress distracting content.
In summary, the RGFSC module enables semantically focused attention while substantially reducing noise and irrelevant contextual information. By integrating physical semantic priors and dual semantic constraints, the framework first provides region-level semantic localization through the PSGM, and then applies attention filtering based on both feature similarity and semantic consistency via the SAAM and SPHF mechanisms. In RGFSC, these physical semantic priors are embedded through semantic-aware attention modulation in SAAM, which emphasizes contextually consistent regions, and through hard filtering in SPHF, which suppresses bright artifacts lacking structural or shadow cues. This hierarchical strategy ensures that essential seabed structures are effectively captured across long-range dependencies. The final representation retains detailed local features from the CNN encoder while incorporating a high-level semantic structure, enabling DSM-Seg to maintain robust segmentation performance in complex environments.