SonarNet: Global Feature-Based Hybrid Attention Network for Side-Scan Sonar Image Segmentation

Lei, Juan; Wang, Huigang; Fan, Liming; Gu, Qingyue; Rong, Shaowei; Zhang, Huaxia

doi:10.3390/rs17142450

Open AccessArticle

SonarNet: Global Feature-Based Hybrid Attention Network for Side-Scan Sonar Image Segmentation

by

Juan Lei

¹

,

Huigang Wang

^1,2,*

,

Liming Fan

¹

,

Qingyue Gu

³

,

Shaowei Rong

⁴

and

Huaxia Zhang

⁵

¹

School of Marine Science and Technology, Northwestern Polytechnical University, West Youyi Road, Xi’an 710072, China

²

Research and Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen 518057, China

³

School of Equipment Management and UAV of Air Force Engineering University, No. 1 Jiazi, Changle East Road, Xi’an 710045, China

⁴

School of Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Jingming South Road, Kunming 650500, China

⁵

School of Information Science and Engineering, Shandong Agricultural University, Daizong Road, Tai’an 271000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2450; https://doi.org/10.3390/rs17142450

Submission received: 28 May 2025 / Revised: 30 June 2025 / Accepted: 8 July 2025 / Published: 15 July 2025

(This article belongs to the Special Issue Ocean Remote Sensing Based on Radar, Sonar and Optical Techniques (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

With the rapid advancement of deep learning techniques, side-scan sonar image segmentation has become a crucial task in underwater scene understanding. However, the complex and variable underwater environment poses significant challenges for salient object detection, with traditional deep learning approaches often suffering from inadequate feature representation and the loss of global context during downsampling, thus compromising the segmentation accuracy of fine structures. To address these issues, we propose SonarNet, a Global Feature-Based Hybrid Attention Network specifically designed for side-scan sonar image segmentation. SonarNet features a dual-encoder architecture that leverages residual blocks and a self-attention mechanism to simultaneously capture both global structural and local contextual information. In addition, an adaptive hybrid attention module is introduced to effectively integrate channel and spatial features, while a global enhancement block fuses multi-scale global and spatial representations from the dual encoders, mitigating information loss throughout the network. Comprehensive experiments on a dedicated underwater sonar dataset demonstrate that SonarNet outperforms ten state-of-the-art saliency detection methods, achieving a mean absolute error as low as 2.35%. These results highlight the superior performance of SonarNet in challenging sonar image segmentation tasks.

Keywords:

salient object detection; global feature extraction; self-attention mechanism; underwater sonar; mamba; hybrid attention mechanism

Graphical Abstract

1. Introduction

Underwater sonar image segmentation is a fundamental technology in marine exploration and underwater navigation, with widespread applications in underwater archaeology, resource exploration, environmental monitoring, and military reconnaissance. Sonar technology leverages the propagation characteristics of acoustic waves in water to detect and identify underwater targets by emitting sound waves and analyzing their reflected signals. In traditional sonar imaging, multi-receiver Synthetic Aperture Sonar (SAS) techniques achieve data focusing and image reconstruction through methods such as the Legendre expansion [1] or frequency-domain interpolation. For instance, the imaging algorithm proposed in [2] enhances the integrity of target contours by optimizing phase compensation. However, these classical approaches typically rely on manually designed feature extraction rules, which makes it difficult to capture dynamic correlations between global structure and local detail in complex underwater environments. This limitation often results in insufficient segmentation accuracy, especially for fine structures or regions with weak boundaries. Image segmentation technology aims to automatically identify and delineate objects or regions of interest with significant or special attributes from complex sonar echo images or data, thus providing essential information for subsequent classification, recognition, and tracking tasks. In marine resource exploration, such as the search for seabed oilfields and natural gas hydrates, image segmentation has demonstrated its capability to efficiently localize potential resource points [3]. Furthermore, in scenarios such as underwater archaeological excavation, emergency rescue, or seabed construction, this technology enables the timely detection and avoidance of obstacles or potential hazards, thereby ensuring the safety of personnel and equipment [4]. Image segmentation can also extract critical information from sonar images, such as seabed topography and biological distribution, which are invaluable for marine scientists [5,6]. Moreover, it can sensitively capture abnormal seabed changes, such as crustal movement or submarine landslides, providing indispensable data for marine disaster prediction and prevention [7]. Compared with other underwater imaging modalities, such as optical imaging, sonar imaging stands out for its superior detection range, strong penetration capability, and robustness to turbid water environments [8]. Consequently, image segmentation in underwater sonar images plays an irreplaceable role in various fields, including underwater exploration, marine resource development, and seabed mapping [9].

Despite the rapid development of image segmentation methods in recent years, significant challenges remain. Underwater sonar images are often affected by complicated and variable environments, leading to unique texture and noise characteristics. Early methods mainly relied on traditional image processing techniques, but with the advent of deep learning, CNN-based segmentation approaches have become a research focus. However, several crucial problems persist. Most existing image segmentation methods are based on deep convolutional neural networks (CNNs), especially fully convolutional networks (FCNs) [10,11] and encoder–decoder structures such as U-Net [12]. These models often use pre-trained classification backbones like VGG [13] and ResNet [14] for feature extraction. Since these networks are primarily designed for natural images and trained on large-scale datasets such as ImageNet [15], there is a significant mismatch in data distribution compared to underwater sonar images, which can negatively impact segmentation performance. Moreover, as shown in Figure 1, encoder–decoder architectures may lose global location information as high-level semantic features are propagated through deep layers. This information loss impairs the model’s ability to capture holistic structures and long-range dependencies, resulting in inaccurate boundary localization and limited generalization.

To overcome these limitations, we propose an innovative Global Feature-Based Hybrid Attention Network, termed SonarNet, specifically tailored for image segmentation in underwater sonar images. As a simple analogy, imagine compressing a high-resolution map into a small thumbnail; while you can still see the general layout, it becomes much harder to pinpoint the exact location of specific landmarks. Similarly, as the encoder deepens and the feature maps shrink, the network loses detailed knowledge about where each feature was originally located in the input image. This can impair the segmentation of fine structures and the accurate delineation of object boundaries, especially in cluttered or noisy sonar images. To address this, our proposed architecture incorporates a dual-encoder design with a self-attention mechanism and global enhancement modules, which explicitly preserve and fuse global contextual and positional information throughout the network. Built upon the U-Net architecture, SonarNet introduces several key innovations. First, a dual-encoder structure with residual blocks and a self-attention mechanism is employed to capture both global structural and local spatial information, thereby alleviating the loss of positional information caused by downsampling and convolution. Second, an adaptive hybrid attention module is integrated to dynamically adjust attention allocation according to input features, enhancing the model’s adaptability to various underwater environments. Third, a multi-scale global enhancement block is embedded between the encoder and decoder, providing additional global features and location information to the decoder and significantly improving the accuracy of underwater sonar image segmentation. To validate the effectiveness of the proposed method, we conduct extensive experiments on a dedicated underwater sonar image dataset, comparing SonarNet against several state-of-the-art methods, including U-Net, PAGRN [16], SRM [17], UCF [18], CPD [19], Amulet [20], PiCANet [21], DGRL [22], Pool-Net [23], and U2-Net [24]. Experimental results demonstrate that SonarNet achieves superior generalization and robustness in image segmentation for sonar images, providing new insights and methodological advances for this field. The contributions of this paper are summarized as follows:

We propose a novel dual-encoder structure incorporating self-attention for global feature extraction. This architecture enables simultaneous extraction of spatial features and global location information, effectively mitigating the decay of global correlations caused by continuous convolution and downsampling.
We design an adaptive hybrid attention module that dynamically emphasizes or suppresses information based on input data. By jointly focusing on spatial and channel-wise features, this module enhances the model’s ability to accurately identify and localize target regions, improving generalization and robustness against complex underwater environments.
We introduce a global information enhancement module between the dual encoder and decoder. This module integrates global and spatial features from both encoding paths, providing the decoder with multi-scale global information and significantly improving segmentation accuracy.

The remainder of this paper is organized as follows: Section 2 reviews the related literature and models for image segmentation. Section 3 details the proposed SonarNet architecture. Section 4 presents experimental results and comparisons with baseline and state-of-the-art methods. Section 5 concludes the paper and outlines future research directions.

2. Related Works

2.1. Traditional and CNNs-Based Sonar Image Segmentation

In the field of underwater sonar image segmentation, traditional methods primarily relied on handcrafted features and heuristic-based strategies [25,26,27,28]. With the rapid advancement of deep learning, convolutional neural network (CNN)-based approaches have gradually become the mainstream. Early studies focused on preprocessing sonar signals and optimizing imaging algorithms. For instance, ref. [29] improved synthetic aperture image resolution via non-uniform sampling compensation, while ref. [30] proposed an efficient signal simulation method to enhance the signal-to-noise ratio (SNR) of raw data. However, these early approaches lacked high-level abstraction capabilities for semantic features, making it difficult to address common challenges in underwater sonar images such as texture blurring and noise interference. Although deep learning enables end-to-end feature learning and has significantly improved segmentation performance, challenges remain, particularly in modeling global context for sonar image segmentation.

CNN-based methods can automatically learn and extract complex feature representations, greatly enhancing segmentation performance. For example, Li et al. [31] utilized multi-scale features extracted from CNNs to compute the segmentation value of each superpixel, effectively leveraging CNNs’ feature extraction capabilities. Wang et al. [32] employed two different CNNs to integrate local superpixel estimation with global search proposals, yielding more accurate segmentation maps. Zhao et al. [33] introduced a multi-context deep learning framework using two separate CNNs to extract both local and global contextual information, further enriching the feature set for image segmentation. Lee et al. [34] combined low-level heuristic features with high-level CNN features to improve segmentation accuracy. Notably, many of these methods process image patches as CNN inputs, which increases computational cost and may overlook crucial spatial information across the entire image, thus affecting segmentation accuracy. To address these limitations, current research trends are shifting towards pixel-wise segmentation map prediction, inspired by fully convolutional networks [35]. Wang et al. [11] used low-level cues to generate segmentation prior maps, guiding iterative predictions and improving accuracy. Liu et al. [36] proposed a two-stage network that first generates a coarse segmentation map, then integrates local context information and refines it through recursive layering. Hou et al. [37] introduced short connections in multi-scale side outputs to better capture fine image details. Luo et al. [38] and Zhang et al. [20] improved the U-shaped structure, utilizing multi-level contextual information for more accurate segmentation. Zhang et al. [16] and Liu et al. [21] integrated attention mechanisms with U-shaped models to guide feature fusion, further enhancing segmentation performance. Wang et al. [22] proposed a novel network that iteratively searches for target regions and refines them with local context information. Zhang et al. [39] adopted a bidirectional structure that facilitates information transmission between multi-level CNN features, leading to more precise segmentation. Xiao et al. [40] used one network to customize regions of dispersed attention, then another for segmentation, effectively improving specificity and accuracy.

2.2. Global Context and Attention Mechanisms

Despite advances in underwater sonar image segmentation, many methods still fail to fully exploit the global contextual information of images, which is essential for accurate segmentation. Recent studies have increasingly focused on global context extraction. For example, Wang et al. [17] employed the pyramid pooling module [41] to effectively capture global context and proposed a multi-stage optimization mechanism to refine segmentation accuracy. Zhang et al. [16] developed spatial and channel attention modules to accurately capture global information at each level, proposing a progressive attention guidance mechanism for further refinement. Wang et al. [18] designed an inception-like context weighting module to enhance global localization of targets, complemented by a boundary refinement module for local optimization. Liu et al. [21] repeatedly captured local and global pixel-level contextual attention and integrated it with the U-Net architecture for effective pixel-wise segmentation. Zhang et al. [42] designed local and global perception modules to extract relevant information from backbone features. Zeng et al. [43] specifically designed an attention module to predict the spatial distribution of foreground objects and effectively aggregate these features. Feng et al. [44] developed a global perception module and an attention feedback module to better explore object structures. Qin et al. [45] proposed a novel prediction-refinement model that achieves boundary-aware segmentation by stacking two U-Nets with different configurations and employing mixed losses. Liu et al. [23] developed an encoder-decoder architecture incorporating a global guidance module and a multi-scale feature aggregation module; the former extracts global positional features while the latter effectively fuses global and fine-grained information. Qin et al. [25] introduced a two-level nested U-shaped structure, increasing network depth to capture richer multi-scale contextual information. In the study of attention mechanisms, Woo et al. [46] designed the Convolutional Block Attention Module (CBAM), which integrates channel and spatial attention features. Notably, Dosovitskiy et al. [47] introduced the Vision Transformer by combining self-attention mechanisms and transformer architecture for computer vision, opening new possibilities for image segmentation. Inspired by these advanced strategies, our proposed method innovatively extends and enhances the dual-encoder structure adopted by Li et al. [48] in medical image segmentation. Specifically, we integrate a global feature extractor based on self-attention into the architecture, which efficiently captures and fuses multi-level global contextual information. This design significantly enhances the model’s ability to perceive global context and improves the accuracy and robustness of underwater sonar image segmentation.

3. Methodlogy

The proposed method in this paper consists of the following three core components: a dual encoder based on global feature extraction, a global enhancement module, and an adaptive hybrid attention network. The global feature extraction module is designed to capture comprehensive image information, while the residual enhancement module improves the representational capacity of feature maps. The adaptive hybrid attention network further enhances feature extraction by jointly leveraging channel and spatial attention. Through the synergistic integration of these three modules, the proposed network achieves robust and accurate segmentation of underwater sonar images. Benefiting from its distinctive U-shaped structure and skip connections, U-Net has been widely adopted for various image segmentation tasks. In this study, the overall architecture of the proposed Global Feature-Based Hybrid Attention Network (SonarNet) is also constructed based on the U-Net framework.

As illustrated in Figure 2, the main workflow can be summarized as:

The input underwater sonar image is first processed by a global feature extractor to obtain a global feature map. Subsequently, a dual encoder is employed to generate high-level feature representations of size $512 \times 64 \times 64$ .
An adaptive hybrid attention network is applied to assign higher weights to channels containing important features, thereby emphasizing key regions within the features.
The decoder reconstructs these high-level features to restore the original input resolution, yielding the final segmentation result.
The global feature enhancement module is integrated into the skip connections between the dual encoder and the decoder. This module provides additional complementary information to the decoder, which further improves the accuracy and robustness of underwater sonar image segmentation.

The following subsections will provide a detailed description of the specific design and implementation of each of these three innovative modules.

Figure 2. Overall architecture of the proposed SonarNet. The network employs a dual-encoder structure with residual blocks and global feature extractors to capture both spatial and global information. Global Enhancement Modules (GEM) are embedded in skip connections to fuse multi-scale features, and an Adaptive Hybrid Attention Module (AHAM) further refines the decoded features, enabling accurate and robust underwater sonar image segmentation.

3.1. Dual Encoder Based on Global Feature Extraction

Inspired by global aperture synthesis and local detail optimization strategies in traditional SAS imaging [49], we propose a dual-encoder architecture that captures both spatial features and global context through parallel encoding paths. Specifically, the image encoding path leverages conventional convolutional operations to extract local spatial features, while the global encoding path employs a self-attention mechanism to capture long-range dependencies analogous to the global coverage of sonar beams. This design effectively mitigates the loss of contextual information due to repeated downsampling in conventional encoder structures.

As illustrated in Figure 3, the dual encoder consists of two branches as follows: the left branch (image encoding path) and the right branch (global encoding path). The image encoding path, similar to the encoder in U-Net, utilizes residual blocks with

3 \times 3

convolutional kernels and downsampling layers to extract high-level semantic features. In parallel, the global encoding path receives a global feature map generated by a self-attention-based global feature extractor, which models the relationships among all positions in the feature map to enhance global contextual representation. To prevent attenuation of global background information through convolution and downsampling, global detection is performed on the feature maps produced by the image encoding path. The resulting global features are then merged with those from the global encoding path as input for the next layer, thereby enhancing the preservation of global context. The self-attention mechanism, as depicted in Figure 3, calculates pairwise relationships between all positions in a feature map, enabling the network to dynamically focus on salient global features and suppress irrelevant information. However, the computational complexity of self-attention grows quadratically with the feature map size. To address this, we apply adaptive pooling to reduce the dimensionality of the feature maps before self-attention, and we then upsample the attended features after the operation. Formally, let

X_{n}

denote the spatial feature map at layer n with height H and width W. The pooled feature map

G_{n 1}

is computed as follows:

G_{n 1} = {AvePool}_{\frac{H}{δ} \times \frac{W}{δ}} (X_{n})

(1)

where

δ

is the scaling factor controlling the pooling window size and stride. The self-attention mechanism operates on the sequence of spatial features. Given input elements

x = (x_{1}, x_{2}, \dots, x_{n})

, where

x_{i} \in R^{d_{x}}

, the output sequence

z = (z_{1}, \dots, z_{n})

is computed as follows:

z_{i} = \sum_{j = 1}^{n} a_{i j} (x_{j} W_{V})

(2)

where

W_{V}

is the value matrix, and

a_{i j}

is the attention weight between positions i and j, computed via the following softmax function:

a_{i j} = \frac{exp (e_{i j})}{\sum_{u = 1}^{n} exp (e_{i u})}

(3)

The compatibility score

e_{i j}

is given by the following scaled dot-product:

e_{i j} = \frac{(x_{i} W_{Q}) {(x_{j} W_{K})}^{⊤}}{\sqrt{d_{z}}}

(4)

where

W_{Q}

and

W_{K}

are the query and key matrices, and

d_{z}

is the channel dimension for scaling. To facilitate matrix operations, we define the following:

\begin{matrix} Q & = G_{n 1} W_{Q} \\ K & = G_{n 1} W_{K} \\ V & = G_{n 1} W_{V} \end{matrix}

(5)

where

G_{n 1}

is the pooled feature map, and

W_{Q}, W_{K}, W_{V} \in R^{d_{x} \times d_{z}}

are learnable parameters. The attention output is then as follows:

G_{n 2} = Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V

(6)

where

d_{k}

is the dimension of the key vectors. After self-attention, we restore the spatial resolution of the attended global feature map via upsampling as follows:

G_{n} = {Upsample}_{H \times W} (Reshape (G_{n 2}))

(7)

where

G_{n 2}

is reshaped and upsampled to match the original spatial size

(H, W)

, enabling subsequent fusion with the local feature path. This dual-path design enables the network to capture both local spatial details and long-range global context, thus improving segmentation performance on challenging underwater sonar images.

3.2. Adaptive Hybrid Attention Mechanism

The proposed Adaptive Hybrid Attention Mechanism (AHA) is designed to dynamically recalibrate the importance of both channel-wise and spatial features, thereby enhancing the network’s ability to extract discriminative representations for underwater sonar image segmentation. AHA consists of the following two main branches: channel attention and spatial attention. The channel attention branch focuses on modeling inter-channel dependencies, treating each channel as a potential feature detector and adaptively emphasizing informative channels. To compute channel attention, we aggregate spatial information using both global average pooling and global max pooling, thereby generating two distinct descriptors. These descriptors are passed through a shared Multi-Layer Perceptron (MLP) with one hidden layer, and their outputs are merged via element-wise summation to form the channel attention map.

As shown in Figure 4, given an input feature map

F \in R^{C \times H \times W}

, the global average pooling and global max pooling across the spatial dimensions are computed as follows:

F_{avg}^{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{c, i, j}

(8)

F_{\max}^{c} = max_{i, j} (F_{c, i, j})

(9)

where

F_{c, i, j}

is the value at channel c, row i, and column j. The aggregated descriptors

F_{avg}^{c}

and

F_{\max}^{c}

(\in R^{C})

are then processed by a shared MLP, expressed as follows:

W_{c} = σ (W_{c 2} \cdot MLP (W_{c 1} \cdot F_{avg} + W_{c 1} \cdot F_{\max} + b_{c 1}) + b_{c 2})

(10)

where

W_{c 1} \in R^{C \times \frac{C}{r}}

,

W_{c 2} \in R^{C \times \frac{C}{r}}

,

b_{c 1}

and

b_{c 2}

are bias terms, r is the reduction ratio, and

σ

denotes the sigmoid activation. The resulting channel attention weights

W_{c}

are used to recalibrate the original feature map as follows:

F^{C} = F \cdot W_{c}

(11)

where

F^{C}

is the channel-refined feature map. For the spatial attention branch, the focus is on modeling inter-spatial dependencies. Given

F^{C}

, we first aggregate channel information using average pooling and max pooling along the channel axis, concatenate the outputs, and apply convolutional layers to generate the following spatial attention map:

F_{cat} = Cat [AvgPool (F^{C}), MaxPool (F^{C})]

(12)

W_{s} = σ ({Conv}_{3 \times 3} (σ ({Conv}_{7 \times 7} (F_{cat}))))

(13)

where

W_{s} \in R^{1 \times H \times W}

is the spatial attention map, and

σ

is the sigmoid function. The spatial attention weights

W_{s}

are then applied to

F^{C}

as follows:

F_{AHA} = F^{C} \cdot W_{s}

(14)

In summary, the final AHA output is obtained by sequentially applying channel and spatial attention to the input feature map F as follows:

F_{AHA} = F \cdot W_{c} \cdot W_{s}

(15)

where F is the input feature map, and

W_{c}

and

W_{s}

are the learned channel and spatial attention weights, respectively. By adaptively integrating both channel and spatial attention, the proposed AHA module effectively enhances the representational power of the network, facilitating more accurate segmentation of complex underwater sonar images.

3.3. Global Feature Enhancement Module

In U-Net-based underwater sonar image segmentation models, skip connections are commonly used to concatenate features from the encoder and decoder, enabling the network to leverage multi-scale information. Enhancing skip connections—such as by embedding attention mechanisms or increasing the receptive field through larger or dilated convolutions—has been shown to improve segmentation performance. Given the dual-encoder structure in our framework, effective fusion of features from different encoding paths becomes crucial for maximizing segmentation accuracy. Recent studies indicate that combining

5 \times 5

and

7 \times 7

convolutions can better capture the structural characteristics of underwater sonar targets. Motivated by this, we propose a Global Feature Enhancement Block to adaptively integrate global and spatial information.

As illustrated in Figure 5, the proposed module first applies

5 \times 5

and

7 \times 7

convolutions to the global feature map

T_{g}

from the global encoding path. The results are then processed with global average pooling and a sigmoid activation to generate channel-wise weight maps

λ^{5 \times 5}

and

λ^{7 \times 7}

, each representing the importance of the respective convolutional features as follows:

λ^{5 \times 5} = Sigmoid ({AvePool}_{1 \times 1} [{Conv}_{5 \times 5} (T_{g})])

(16)

λ^{7 \times 7} = Sigmoid ({AvePool}_{1 \times 1} [{Conv}_{7 \times 7} (T_{g})])

(17)

where

T_{g}

is the global feature map,

{AvePool}_{1 \times 1} (\cdot)

denotes global average pooling, and

Sigmoid (\cdot)

is the sigmoid activation function. Next, the spatial feature map

T_{s}

from the spatial encoding path is processed with

5 \times 5

and

7 \times 7

convolutions, and the resulting feature maps are multiplied by their respective global weights

λ^{5 \times 5}

and

λ^{7 \times 7}

in a channel-wise fashion. The enhanced features are then fused with the original spatial features to obtain the globally enhanced spatial feature map

G_{ge}

as follows:

G_{ge} = {Conv}_{5 \times 5} (T_{s}) * λ^{5 \times 5} + {Conv}_{7 \times 7} (T_{s}) * λ^{7 \times 7} + T_{s}

(18)

where

{Conv}_{5 \times 5} (\cdot)

and

{Conv}_{7 \times 7} (\cdot)

denote

5 \times 5

and

7 \times 7

convolutions, respectively, and ∗ indicates channel-wise multiplication. This global enhancement strategy enables the network to adaptively integrate global context into spatial features at multiple scales, thereby significantly improving the segmentation accuracy and robustness in complex underwater sonar environments.

4. Experiments and Results

4.1. Dataset

To evaluate the effectiveness of the proposed method, we conducted experiments on a dedicated underwater sonar image dataset. The constructed dataset contains five categories of target objects with varying scales, including airplanes, ships, cars, humans, and mines. Representative examples from each category are presented in Figure 6, and the distribution of image counts per category is shown in Figure 7.

The side-scan sonar dataset consists of 300 underwater sonar images, each with a resolution of

512 \times 512

pixels. This resolution is sufficient to capture detailed information about the underwater environment, providing a solid foundation for accurate image segmentation. Of the 300 images, 270 were used for training to enable the model to fully learn the diverse characteristics of underwater sonar images, while the remaining 30 images were reserved for testing to assess the model’s generalization and segmentation performance. To ensure the reliability of the evaluation, all test images were manually annotated by an expert. These annotations serve as the ground truth for quantitative comparisons with the model’s segmentation results, providing a rigorous basis for performance assessment.

4.2. Parameter Settings and Implementation Details

All experiments were conducted on a workstation equipped with an Intel(R) Xeon(R) CPU E5-2683 v4 and an NVIDIA RTX 3090-24G GPU. The PyTorch deep learning framework (version 1.13.1) was used, running on Python 3.8 and CUDA 12.3. All convolutional layers in the network were initialized using the He normal initializer [50]. Bias terms were set to zero unless otherwise specified. This strategy ensures stable model convergence and improved training performance. To enhance the robustness and generalization ability of the model, we applied several data augmentation techniques to the training dataset. Specifically, we used random horizontal and vertical flipping, random rotations within

\pm 20^{\circ}

, and random cropping. These augmentations were applied on-the-fly during training to increase the variety of training samples and mitigate overfitting. The Adam optimizer was used with an initial learning rate of 0.01 and a weight decay of 0.005 to prevent overfitting. The learning rate was dynamically adjusted using a cosine annealing strategy [51] to accelerate convergence and improve performance. The training process consisted of 10 cycles, with each cycle comprising 60 epochs. All input images were resized to

512 \times 512

pixels during both training and testing. The batch size was set to 4 to maximize GPU utilization and computational efficiency. In the Adaptive Hybrid Attention (AHA) module, the reduction ratio r was set to 16, which balances model capacity and computational cost. Other hyperparameters, such as the kernel size for convolutional operations in the global enhancement module, were selected based on standard practice and empirical validation (i.e.,

5 \times 5

and

7 \times 7

convolutions). During training, the original images were reshaped and fed into the network. In the testing phase, segmentation results were reshaped to their original size for quantitative evaluation against the ground truth. To optimize the network parameters, the cross-entropy loss function was employed to measure the discrepancy between the predicted output

y^{*}

and the ground truth y. The cross-entropy loss is defined as follows:

L (y^{*}, y) = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} log y_{i}^{*} + (1 - y_{i}) log (1 - y_{i}^{*})]

(19)

where N is the total number of pixels in the image. All these implementation details have been provided to ensure the reproducibility of our experiments.

4.3. Evaluation Metrics

To comprehensively assess the performance of our method and compare it with other approaches, we adopted the following nine widely used evaluation metrics for image segmentation: Accuracy (Acc), Precision, Recall, Maximum F-Measure (Max F-Measure), Structural Measurement Score (SMeasure), Mean Intersection over Union (MIoU),

F 1

-score, Mean Absolute Error (MAE), and the Area Under the Precision–Recall (PR) curve (AUPR). Accuracy (Acc) measures the proportion of correctly classified pixels, including both foreground and background, and it serves as an overall indicator of model performance, expressed as follows:

Acc = \frac{T P + T N}{T P + T N + F P + F N}

(20)

where

T P

,

T N

,

F P

, and

F N

denote the numbers of true positive, true negative, false positive, and false negative pixels, respectively. Precision quantifies the proportion of correctly identified positive pixels among all pixels predicted as positive, while Recall (or Sensitivity) reflects the proportion of actual positive pixels that are correctly detected, expressed as follows:

Precision = \frac{T P}{T P + F P}, Recall = \frac{T P}{T P + F N}

(21)

The F-measure, including

F_{β}

and the

F_{1}

-score, represents a weighted harmonic mean of Precision and Recall, providing a balanced evaluation of segmentation quality. It is defined as follows:

F_{β} = \frac{(1 + β^{2}) \cdot Precision \cdot Recall}{β^{2} \cdot Precision + Recall}

(22)

where

β

is typically set to 0.3 to emphasize precision, and the

F_{1}

-score is obtained by setting

β = 1

. The Structural Measurement Score (SMeasure) evaluates the structural similarity between the predicted segmentation and the ground truth, considering edge connectivity, regional integrity, and boundary alignment. It is calculated as a weighted combination of object-aware and region-aware similarity components. Mean Absolute Error (MAE) measures the average absolute difference between the predicted segmentation and the ground truth, intuitively reflecting the overall prediction error as follows:

MAE = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} | S (x, y) - G (x, y) |

(23)

where

S (x, y)

and

G (x, y)

denote the predicted and ground truth values at pixel

(x, y)

, and W and H are the image width and height. Mean Intersection over Union (MIoU) evaluates the overlap between the predicted segmentation and the ground truth at the pixel level, providing a fair assessment across different target sizes, expressed as follows:

MIoU = \frac{T P}{T P + F P + F N}

(24)

Finally, we use the Precision–Recall (PR) curve and its Area Under the Curve (AUPR) to analyze the trade-off between precision and recall across different thresholds. The AUPR provides a single scalar value, ranging from 0 to 1, with higher values indicating better overall segmentation performance. These metrics collectively provide a comprehensive and rigorous evaluation of segmentation accuracy, structural similarity, and error, ensuring fair comparison between different models in the context of sonar image segmentation.

4.4. Comparative Experiments

To comprehensively evaluate the performance of the proposed DE-AHA-GE-Net model on underwater sonar image segmentation, we conducted a series of rigorous comparative experiments. All experiments were performed on the constructed underwater sonar dataset, which features diverse environments and a variety of sonar target categories, thereby providing a robust foundation for assessing model generalization and segmentation capabilities. For fair and thorough comparison, we reproduced several state-of-the-art segmentation methods under identical experimental conditions. The compared methods include Amulet [20], CapSal [42], CPD [19], DGRL [22], PAGRN [16], PiCANet [21], Pool-Net [23], SRM [17], U2-Net [24], and UCF [18]. All models were evaluated using the same dataset splits, preprocessing pipeline, and evaluation metrics as our proposed approach, ensuring the reliability and objectivity of the results.

Quantitative Comparison: The detailed quantitative results are summarized in Table 1. We used nine representative metrics, including Accuracy (Acc), Precision, Recall, Max

F 1

-score (MaxF), Structural Measure (SM), Mean Intersection over Union (MIoU),

F 1

-score, Mean Absolute Error (MAE), and Area Under the Precision–Recall Curve (AUPR), to comprehensively assess segmentation performance from multiple perspectives. The results demonstrate that DE-AHA-GE-Net outperforms all competing methods across nearly all metrics. Specifically, our model achieved an accuracy of 98.77%, a precision of 93.59%, and a recall of 91.85%, indicating both high overall correctness and strong sensitivity in segmenting target regions. The MaxF score reached 93.06%, reflecting an excellent balance between precision and recall. The model also excelled in structural preservation, as indicated by an SM of 74.25%, and it outperformed others in pixel-level overlap with a MIoU of 86.37%. The

F 1

-score of 92.55% further confirms the robustness of the segmentation performance. Notably, our method achieved a very low MAE of 2.35%, demonstrating superior prediction accuracy, while the AUPR of 98.51% highlights exceptional reliability across different threshold settings. While some baseline methods, such as UCF, performed well in terms of precision, they lagged behind in other critical metrics, especially MAE, where UCF’s error was 2.42 percentage points higher than that of DE-AHA-GE-Net. This discrepancy may lead to significant error accumulation in practical applications. Furthermore, as shown in Figure 8, our approach yielded the most favorable PR curve and the highest AUPR among all competitors, underscoring its advantage in sonar image segmentation.

Qualitative Comparison: In practical applications, traditional synthetic aperture techniques (e.g., the sub-bottom profiler experiment in [52]) show that detecting low-contrast targets (such as objects buried in sediments)relies on high-SNR data. In addition to quantitative evaluation, we conducted visual comparisons to further illustrate the segmentation performance differences among the various methods. As presented in Figure 9, we selected representative results from ten methods under the same experimental conditions. While many models could segment major targets, several struggled to filter out noise in complex underwater scenes (e.g., CapSal, DGRL, and SRM). Other models, such as PAGRN and UCF, were effective in removing irrelevant noise, but at the expense of losing crucial spatial and edge details. This can be attributed to an imbalance between noise suppression and the preservation of global structural information in challenging sonar environments. In contrast, our DE-AHA-GE-Net, equipped with an adaptive hybrid attention mechanism, dynamically adjusts channel and spatial weights to enhance the response in target regions, especially under low signal-to-noise ratio (SNR) conditions. This design effectively suppresses background noise while maintaining accurate segmentation of significant targets. In summary, both quantitative and qualitative analyses demonstrate that DE-AHA-GE-Net achieves excellent performance in underwater sonar image segmentation. Its ability to eliminate irrelevant noise and precisely segment target regions is particularly advantageous for practical applications in complex and dynamic underwater environments.

4.5. Ablation Experiments

To validate the effectiveness of each component in the proposed SonarNet, we conducted ablation experiments on the underwater sonar dataset, with U-Net serving as the baseline model [53]. The ablation study was designed to evaluate the individual and combined contributions of the dual encoder (DE), Adaptive Hybrid Attention mechanism (AHA), and Global Feature Enhancement block (GE) to the overall performance.

For the quantitative evaluation, we incrementally integrated the DE, AHA, and GE modules into the baseline, and we assessed the resulting models using nine representative performance metrics. The detailed results are reported in Table 2 and illustrated in Figure 10. The inclusion of the dual encoder (DE) led to notable improvements across all metrics, with precision increasing from 83.39% to 88.18%. This demonstrates that DE is effective in enhancing saliency detection by preserving global contextual information during feature extraction. Importantly, the F1-score and MIoU improved by 3.10% and 2.28% respectively, while MAE decreased from 3.84% to 3.51%, confirming the positive impact of DE on segmentation accuracy and error reduction.

Building upon the DE module, the addition of the Adaptive Hybrid Attention (AHA) mechanism further improved all evaluation metrics. Specifically, accuracy rose from 98.07% to 98.42%, precision from 88.18% to 90.82%, and recall from 83.61% to 89.46%. The maximum F-measure, MIoU, and F1-score also experienced significant gains, while MAE dropped to 3.42%. These results highlight the effectiveness of AHA in focusing the model on salient regions while mitigating background noise, as reflected in the improved quantitative and visual segmentation performance. The integration of the Global Enhancement block (GE) aimed to strengthen the fusion of spatial and global features from the dual encoder, thereby enhancing the model’s capability to capture global contextual information. The experimental results indicate that the inclusion of GE led to further improvements in most metrics and a reduction of MAE by 0.35%. Although there was a slight decrease in recall and structural measurement (SM) due to intensified global feature fusion, the overall contribution of GE to underwater sonar salient object detection remains substantial.

In addition to quantitative analysis, we conducted qualitative comparisons to visually assess the impact of each module. As shown in Figure 11, the addition of the DE module increased the model’s sensitivity to scene details but also introduced some misclassification of noise. This issue was effectively mitigated by incorporating the GE module, which enhanced the accuracy of salient target detection and reduced noise interference. The introduction of the AHA module further improved detection accuracy, enabling the precise identification of important underwater targets. Overall, both the quantitative results and visualizations affirm that each proposed component—DE, AHA, and GE—makes a significant and complementary contribution to the performance of DE-AHA-GE-Net in underwater sonar image segmentation.

5. Conclusions

In this paper, we have investigated the incorporation of global feature information for underwater sonar salient object detection and proposed a novel framework, SonarNet, that integrates several lightweight yet effective modules. Specifically, we designed a global feature extraction module (GF) and a global feature enhancement module (GE) to capture and utilize global contextual cues. The global feature extractor, based on a combination of ResBlock and self-attention mechanisms, effectively preserves and aggregates information from all regions of the feature maps, thus alleviating the loss of global context caused by successive convolution and downsampling operations. Furthermore, we introduced an adaptive hybrid attention mechanism that dynamically balances channel-wise and spatial dependencies, enabling the network to more fully exploit both global and local features. This enhances the model’s representational capacity, particularly in complex underwater environments characterized by noise and low contrast. The global enhancement block further fuses global and spatial features from dual-encoder pathways, providing the decoder with enriched contextual information and contributing to improved segmentation accuracy. Comprehensive experiments on a challenging underwater sonar dataset, including comparisons with ten state-of-the-art salient object detection methods, demonstrate that our proposed SonarNet consistently achieves superior performance across multiple quantitative and qualitative metrics. These results validate the effectiveness of our approach and highlight its potential for practical applications in underwater sonar image analysis and target detection tasks.

Author Contributions

Investigation, H.Z.; Project administration, S.R.; Funding acquisition, Q.G.; Resources, L.F.; Data curation, Software & Writing—original draft, J.L.; Supervision & Writing—review & editing, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Science Foundation of China under Grant 62171368; in part by the Science, Technology and Innovation of Shenzhen Municipality under Grant KJZD20230923115505011; and in part by the Science, Technology and Innovation of Shenzhen Municipality under Grant JCYJ20241202124931042.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, X.; Yang, P.; Dai, X. Focusing multireceiver SAS data based on the fourth order Legendre expansion. Circuits Syst. Signal Process. 2019, 38, 2607–2629. [Google Scholar] [CrossRef]
Zhang, X.; Yang, P. Imaging algorithm for multireceiver synthetic aperture sonar. J. Electr. Eng. Technol. 2019, 14, 471–478. [Google Scholar] [CrossRef]
Miller, K.A.; Singh, H.; Caiti, A. An overview of seabed mining including the current state of development, environmental impacts, and knowledge gaps. Front. Mar. Sci. 2018, 4, 312755. [Google Scholar] [CrossRef]
Singh, H.; Adams, R.; White, L. Imaging underwater for archaeology. J. Field Archaeol. 2000, 27, 319–328. [Google Scholar] [CrossRef]
Caiti, A.; Brown, T.; Lee, S. Innovative technologies in underwater archaeology: Field experience, open problems, and research lines. Chem. Ecol. 2006, 22 (Suppl. S1), S383–S396. [Google Scholar] [CrossRef]
Shortis, M.; Harvey, E.; Abdo, D. A review of underwater stereo-image measurement for marine biology and ecology applications. Oceanogr. Mar. Biol. 2016, 269–304. [Google Scholar]
Lamarche, G.; Smith, J.; Brown, T. Quantitative characterisation of seafloor substrate and bedforms using advanced processing of multibeam backscatter—Application to Cook Strait, New Zealand. Cont. Shelf Res. 2011, 31, S93–S109. [Google Scholar] [CrossRef]
Hu, K.; Wang, L.; Zhang, P. Overview of underwater 3D reconstruction technology based on optical images. J. Mar. Sci. Eng. 2023, 11, 949. [Google Scholar] [CrossRef]
Reggiannini, M.; Moroni, D. The use of saliency in underwater computer vision: A review. Remote Sens. 2020, 13, 22. [Google Scholar] [CrossRef]
Dai, J.; Li, Y.; He, K. R-FCN: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 379–387. [Google Scholar]
Wang, L.; Zhang, X.; Liu, N. Saliency detection with recurrent fully convolutional networks. Lect. Notes Comput. Sci. 2016, 9907, 825–841. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. Med. Image Comput. Comput.-Assist. Interv. 2015, 9351, 234–241. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Deng, J.; Dong, W.; Socher, R. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Zhang, X.; Yang, P.; Sun, M. Progressive attention guided recurrent network for salient object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 714–722. [Google Scholar]
Wang, T.; Zhang, P.; Liu, J. A stagewise refinement model for detecting salient objects in images. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4039–4048. [Google Scholar]
Zhang, P.; Wang, T.; Liu, J. Learning uncertain convolutional features for accurate saliency detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2123–2132. [Google Scholar]
Wu, Z.; Su, L.; Huang, Q. Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 3907–3916. [Google Scholar]
Zhang, P.; Wang, T.; Liu, J. AMulet: Aggregating multi-level convolutional features for salient object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 202–211. [Google Scholar]
Liu, N.; Han, J.; Yang, M.-H. PiCANet: Learning pixel-wise contextual attention for saliency detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3089–3098. [Google Scholar]
Wang, T.; Zhang, P.; Liu, J. Detect globally, refine locally: A novel approach to saliency detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3127–3135. [Google Scholar]
Liu, J.-J.; Hou, Q.; Cheng, M.-M. A simple pooling-based design for real-time salient object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3917–3926. [Google Scholar]
Qin, X.; Zhang, Z.; Huang, C. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
Cheng, M.-M.; Mitra, N.J.; Huang, X. Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 569–582. [Google Scholar] [CrossRef] [PubMed]
Jiang, H.; Wang, J.; Yuan, Z. Salient object detection: A discriminative regional feature integration approach. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2083–2090. [Google Scholar]
Li, X.; Lu, H.; Xu, X. Saliency detection via dense and sparse reconstruction. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2976–2983. [Google Scholar]
Perazzi, F.; Krahenbuhl, P.; Pritch, Y. Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 733–740. [Google Scholar]
Zhang, X.; Cao, D. Synthetic aperture image enhancement with near-coinciding Nonuniform sampling case. Comput. Electr. Eng. 2024, 120, 109818. [Google Scholar] [CrossRef]
Zhang, X. An efficient method for the simulation of multireceiver SAS raw signal. Multimed. Tools Appl. 2024, 83, 37351–37368. [Google Scholar] [CrossRef]
Li, G.; Yu, Y. Visual saliency based on multiscale deep features. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5455–5463. [Google Scholar]
Wang, L.; Lu, H.; Ruan, X. Deep networks for saliency detection via local estimation and global search. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3183–3192. [Google Scholar]
Zhao, R.; Ouyang, W.; Li, H. Saliency detection by multi-context deep learning. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1265–1274. [Google Scholar]
Lee, G.; Tai, Y.-W.; Kim, J. Deep saliency with encoded low level distance map and high level features. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 660–668. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Liu, N.; Han, J. DHSNet: Deep hierarchical saliency network for salient object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 678–686. [Google Scholar]
Hou, Q.; Cheng, M.-M.; Hu, X. Deeply supervised salient object detection with short connections. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5300–5309. [Google Scholar]
Luo, Z.; Mishra, A.; Achkar, A. Non-local deep features for salient object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6593–6601. [Google Scholar]
Zhang, L.; Dai, J.; Lu, H. A bi-directional message passing model for salient object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1741–1750. [Google Scholar]
Xiao, H.; Feng, J.; Wei, Y. Deep salient object detection with dense connections and distraction diagnosis. IEEE Trans. Multimed. 2018, 20, 3239–3251. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X. Pyramid scene parsing network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Zhang, L.; Wang, T.; Liu, J. CapSal: Leveraging captioning to boost semantics for salient object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6024–6033. [Google Scholar]
Zeng, Y.; Zhuge, Y.; Lu, H. Multi-source weak supervision for saliency detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6067–6076. [Google Scholar]
Feng, M.; Lu, H.; Ding, E. Attentive feedback network for boundary-aware salient object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1623–1632. [Google Scholar]
Qin, X.; Zhang, Z.; Huang, C. BASNet: Boundary-aware salient object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7479–7489. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y. CBAM: Convolutional block attention module. Lect. Notes Comput. Sci. 2018, 11211, 3–19. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Li, Y.; Zhang, Z.; Liu, J. Dual encoder-based dynamic-channel graph convolutional network with edge enhancement for retinal vessel segmentation. IEEE Trans. Med. Imaging 2022, 41, 1975–1989. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Yang, P.; Feng, X.; Sun, H. Efficient imaging method for multireceiver SAS. IET Radar Sonar Navig. 2022, 16, 1470–1483. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Zhang, X.; Yang, P.; Sun, M. Experiment results of a novel sub-bottom profiler using synthetic aperture technique. Curr. Sci. 2022, 122, 461–464. [Google Scholar] [CrossRef]
Wu, H.; Chen, S.; Wang, G. SCS-Net: A scale and context sensitive network for retinal vessel segmentation. Med. Image Anal. 2021, 70, 102025. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Visual analysis of underwater sonar images using different deep learning methods; (a) raw image; (b) ground truth; (c) CapSal; (d) PAGRN; (e) UCF; (f) our method.

Figure 3. Illustration of the global feature extraction module. The module first applies adaptive average pooling to the spatial feature map, followed by a self-attention mechanism that generates the query (Q), key (K), and value (V) matrices. The softmax-normalized attention scores are then used to compute global contextual features, which are upsampled and fused with the original spatial features to enhance global information representation.

Figure 4. Illustration of the Adaptive Hybrid Attention Mechanism (AHA). The module first computes channel attention by applying global average pooling and max pooling, followed by multi-layer perceptrons and sigmoid activation. In parallel, spatial attention is generated by concatenating spatially pooled features and processing them with convolutional layers and non-linear activations. The final attention maps are used to adaptively recalibrate the input feature map along both channel and spatial dimensions.

Figure 5. Illustration of the Global Enhancement Block. Channel-wise weights are generated from global feature maps using

5 \times 5

and

7 \times 7

convolutions, global average pooling, and sigmoid activation. These weights are then used to recalibrate the corresponding spatial feature maps, which are fused together to produce globally enhanced output features for improved segmentation.

Figure 5. Illustration of the Global Enhancement Block. Channel-wise weights are generated from global feature maps using

5 \times 5

and

7 \times 7

convolutions, global average pooling, and sigmoid activation. These weights are then used to recalibrate the corresponding spatial feature maps, which are fused together to produce globally enhanced output features for improved segmentation.

Figure 6. Examples of different categories of images in the underwater sonar dataset.

Figure 7. Number of images for different categories.

Figure 8. Precision–Recall curves of our model and other typical state-of-the-art models on the underwater sonar dataset.

Figure 9. Visual comparison with 10 other methods, where (a) image, (b) gt, (c) ours, (d) U2Net, (e) PoolNet, (f) CapSal, (g) CPD, (h) DGRL, (i) PAGRN, (j) PiCANet, (k) SRM, (l) UCF, and (m) Amulet.

Figure 10. Precision–Recall curves of ablation experiments.

Figure 11. Visual effect comparison of ablation experiments, where (a) image, (b) gt, (c) Unet, (d) Unet+DE, (e) Unet+DE+GE, (f) Unet+AHA, and (g) Unet+DE+AHA+GE.

Table 1. Comparison of 9 evaluation metrics and 10 methods, where red and green indicate the best and second best, respectively.

Method	Acc (%)	Precision (%)	Recall (%)	MaxF (%)	SM (%)	MIoU (%)	F1 (%)	MAE (%)	AUPR (%)
Amulet	97.87	85.57	84.45	83.56	71.14	74.89	83.21	3.70	95.62
CapSal	97.59	80.65	87.13	81.20	70.12	73.57	82.23	3.99	94.71
CPD	98.35	89.41	86.43	88.31	73.16	80.21	87.42	2.95	97.59
DGRL	98.31	88.45	87.32	87.95	73.16	80.33	87.57	3.04	97.38
PAGRN	98.02	90.31	84.89	88.12	71.79	77.34	86.32	4.26	96.01
PiCANet	98.18	92.52	80.41	89.07	73.01	77.38	85.65	3.07	97.62
PoolNet	98.34	89.55	90.05	89.14	73.63	81.25	88.99	2.47	97.57
SRM	97.72	81.03	88.65	81.75	71.66	74.27	83.46	3.76	95.98
U2-Net	98.31	90.10	84.15	88.43	73.53	78.93	86.74	2.52	97.63
UCF	96.52	94.14	66.83	84.77	69.02	65.25	76.94	4.77	94.54
Ours	98.77	93.59	91.85	93.06	74.25	86.37	92.55	2.35	98.51

Table 2. Comparison of ablation experiments in 9 evaluation metrics.

Method	Acc (%)	Precision (%)	Recall (%)	MaxF (%)	SM (%)	MIoU (%)	F1 (%)	MAE (%)	AUPR (%)
UNet	98.07	83.39	83.61	83.06	71.24	75.02	82.97	3.84	96.37
UNet+DE	97.91	88.18	86.90	86.51	72.37	77.30	86.07	3.51	95.88
UNet+DE+GE	98.32	90.30	84.96	88.61	72.34	79.66	87.08	3.16	97.18
UNet+AHA	98.42	90.82	89.46	90.39	72.54	82.51	89.99	3.42	97.46
UNet+DE+AHA+GE	98.77	93.59	91.85	93.06	74.25	86.37	92.55	2.35	98.51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lei, J.; Wang, H.; Fan, L.; Gu, Q.; Rong, S.; Zhang, H. SonarNet: Global Feature-Based Hybrid Attention Network for Side-Scan Sonar Image Segmentation. Remote Sens. 2025, 17, 2450. https://doi.org/10.3390/rs17142450

AMA Style

Lei J, Wang H, Fan L, Gu Q, Rong S, Zhang H. SonarNet: Global Feature-Based Hybrid Attention Network for Side-Scan Sonar Image Segmentation. Remote Sensing. 2025; 17(14):2450. https://doi.org/10.3390/rs17142450

Chicago/Turabian Style

Lei, Juan, Huigang Wang, Liming Fan, Qingyue Gu, Shaowei Rong, and Huaxia Zhang. 2025. "SonarNet: Global Feature-Based Hybrid Attention Network for Side-Scan Sonar Image Segmentation" Remote Sensing 17, no. 14: 2450. https://doi.org/10.3390/rs17142450

APA Style

Lei, J., Wang, H., Fan, L., Gu, Q., Rong, S., & Zhang, H. (2025). SonarNet: Global Feature-Based Hybrid Attention Network for Side-Scan Sonar Image Segmentation. Remote Sensing, 17(14), 2450. https://doi.org/10.3390/rs17142450

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SonarNet: Global Feature-Based Hybrid Attention Network for Side-Scan Sonar Image Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Traditional and CNNs-Based Sonar Image Segmentation

2.2. Global Context and Attention Mechanisms

3. Methodlogy

3.1. Dual Encoder Based on Global Feature Extraction

3.2. Adaptive Hybrid Attention Mechanism

3.3. Global Feature Enhancement Module

4. Experiments and Results

4.1. Dataset

4.2. Parameter Settings and Implementation Details

4.3. Evaluation Metrics

4.4. Comparative Experiments

4.5. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI