Next Article in Journal
Dissimilar Welded Joints and Sustainable Materials for Ship Structures
Previous Article in Journal
Fixed-Time Event-Triggered Sliding Mode Consensus Control for Multi-AUV Formation Under External Disturbances and Communication Delays
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TriEncoderNet: Multi-Stage Fusion of CNN, Transformer, and HOG Features for Forward-Looking Sonar Image Segmentation

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2025, 13(12), 2295; https://doi.org/10.3390/jmse13122295
Submission received: 17 October 2025 / Revised: 29 November 2025 / Accepted: 30 November 2025 / Published: 3 December 2025
(This article belongs to the Section Ocean Engineering)

Abstract

Forward-looking sonar (FLS) image segmentation is essential for underwater exploration with remaining challenges including low contrast, ambient noise, and complex backgrounds, which both existing traditional and deep learning-based methods fail to address effectively. This paper presents TriEncoderNet, a novel model that simultaneously extracts local, global, and edge-related features through three parallel encoders. Specifically, the model integrates a convolutional neural network (CNN) for local feature extraction, a transformer for global context modeling, and a histogram of oriented gradients (HOG) encoder for edge and shape detection. The key innovations of TriEncoderNet include the CrossFusionTransformer (CFT) module, which effectively integrates local and global features to capture both fine details and comprehensive context, and the HOG attention gate (HAG) module, which enhances edge detection and preserves semantic consistency across diverse feature types. Additionally, TriEncoderNet introduces the hierarchical efficient transformer (HETransformer) with a lightweight multi-head self-attention mechanism to reduce computational overhead while maintaining global context modeling capability. Experimental results on the marine debris dataset and UATD dataset demonstrate the superior performance of TriEncoderNet. Specifically, it achieves an mIoU of 0.793 and mAP of 0.916 on the marine debris dataset, and an mIoU of 0.582 and mAP of 0.687 on the UATD Dataset, outperforming state-of-the-art methods in both segmentation accuracy and robustness in challenging underwater environments.

1. Introduction

With the continued development of ocean engineering, seabed exploration, and marine mapping have become critical components of marine research. Underwater sonar detection technology plays an important role in marine research. Among them, the forward-looking sonar (FLS) sensor has become an indispensable imaging device in underwater exploration due to its mature technology and low cost. FLS images effectively capture the underwater environment without being constrained by lighting conditions, offering rich visual information. Consequently, analysis and processing methods for FLS images have garnered significant attention in recent years [1,2]. One of the key tasks in this domain is FLS image segmentation, which serves as a foundational step for underwater detection and advanced semantic analysis [3]. While natural image segmentation has achieved remarkable progress in recent years, these algorithms cannot be directly applied to FLS images due to their unique characteristics. FLS image segmentation presents two major challenges: (1) Due to the influence of the water’s physical properties on acoustic wave propagation, phenomena such as attenuation and scattering occur, leading to a reduction in the contrast between the background and targets in FLS images, which weakens the semantic information. (2) Underwater environmental noise and ocean reverberation can cause significant interference, and the complex background often makes target feature extraction challenging [4,5,6].
Sonar image segmentation methods are mainly divided into two categories, traditional methods and deep learning-based methods. Traditional sonar image segmentation methods mainly include thresholding-based, Clustering-based, Markov random field (MRF)-based and level set-based methods [7,8,9,10,11,12]. For example, Li et al. [9] proposed an MRF-based method that formalizes the image segmentation problem as an energy minimization problem, leveraging MRF to capture statistical correlations within local regions and optimizing the energy function using a level-set approach to refine segmentation boundaries. Zhang et al. [10] proposed a level-set algorithm combining non-local means and heterogeneity filtering to improve FLS image saliency and segmentation accuracy for small underwater targets. However, reliance on handcrafted features limits robustness and generalizability.
Among the deep learning-based methods, convolutional neural networks (CNNs) are widely used in sonar image segmentation tasks with remarkable results due to their superior feature extraction capabilities. Typical CNN-based algorithms include U-Net [13], DeepLabv3+ [14], FCNN [15] and SegNet [16]. U-Net [13] employed a U-shaped structure that extracts high-level features during encoding and restores spatial resolution during decoding. Skip connections directly transfer encoder features to the decoder, helping recover fine details. DeepLabv3+ [14] utilized dilated convolutions to expand the receptive field, effectively capturing multi-scale context and addressing challenges posed by varying object sizes and blurred edges. MAANU-Net [17] replaced traditional skip connections with an atrous pyramid structure in the encoder–decoder pipeline, enabling more effective feature fusion. It also employed a nested U-shaped structure to integrate features from multiple encoders, further enhancing segmentation accuracy. LMA-Net [18] enhanced inference speed with a lightweight multi-scale attention mechanism and introduced a multi-scale feature attention gate to fuse large spatial features with decoding-stage features, effectively preserving high-precision details from upper-layer features. Despite the excellent performance of these methods in dealing with local features and multi-scale information, they are still insufficiently considered in global feature extraction. How to effectively extract global contextual information and effectively fuse it with local features is the key to further improving the performance of sonar image segmentation.
Vision Transformer (ViT) [19] migrated from the field of Natural Language Processing (NLP) to Computer Vision (CV). With its powerful global context modeling capabilities, it has achieved remarkable results in image segmentation tasks [19,20,21,22,23]. Swin Transformer [20] introduced a hierarchical structure with sliding windows, restricting self-attention to local regions and gradually expanding the window size with increasing network depth, effectively capturing multi-scale features while reducing computational complexity. RegionViT [21] integrated the strengths of Vision Transformers and regional information. By employing Regional Self-Attention, it captured local feature relationships and incorporated global information, effectively addressing the dependencies between local regions and global context. Although Transformer-based image segmentation methods demonstrate remarkable capabilities in capturing global contextual relationships, they still face challenges in representing local features effectively.
To address this limitation, some approaches incorporate CNNs to enhance the expression of local features. By integrating local and global features, these methods aim to achieve a better balance between modeling local details and global context. iFormer [24] combined Inception’s convolution and pooling operations with Transformer self-attention, optimizing image feature extraction and representation. Using a channel-splitting mechanism and frequency ramp structure, it improved efficiency while effectively modeling fine details and global information. TransUNet [25] integrated Transformer modules into the U-Net framework, combining CNNs for local feature extraction with Transformers to encode contextual sequences. By leveraging self-attention, it captured global dependencies, enabling precise detail extraction and long-range interactions for better understanding of complex image structures. Currently, only a few sonar image segmentation methods employ Transformers for global context extraction, such as SonarNet [26]. This approach combined Transformers for global context modeling with CNNs for local feature extraction. In the final stage of the decoder, it fused these two types of features with different semantic meanings, significantly improving segmentation accuracy and highlighting the importance of global context features in sonar image segmentation. While these methods effectively integrate the respective strengths of Transformers and CNNs, they are constrained by a sequential paradigm where local features are extracted first, followed by global feature extraction. This design limits the comprehensive utilization of the interdependencies between local and global features across the network layers, thereby impeding deeper interaction and integration of these features. The Sonar image segmentation methods are summarized in Table 1.
Beyond single-stream architectures, ensemble-based strategies have also emerged to boost segmentation performance. For instance, Carisi et al. [27] and Dhiyanesh et al. [28] utilize bagging or stacking to aggregate predictions from independently trained models, while Das et al. [29] and Dang et al. [30] employ weighted fusion or multi-layer stacking to refine final outputs. However, these decision-level ensemble paradigms often suffer from high computational redundancy due to running multiple heavy backbones. More importantly, they typically lack deep feature interaction, as independent branches cannot correct or guide each other during the encoding process. Consequently, developing a tightly coupled, multi-branch framework that enables stage-wise deep feature fusion—allowing global and local features to dynamically modulate each other at multiple resolutions—presents a more efficient and effective alternative to simple prediction averaging.
To address the aforementioned challenges, we propose a novel model with three-encoder architecture, TriEncoderNet, which utilizes three parallel encoders to separately extract local high-resolution features, global contextual features, and histogram of oriented gradients (HOG) features. Feature fusion is performed at each downsampling stage, where local features capture fine-grained texture information of the target, global features provide a comprehensive view of the target through expanded receptive fields, and HOG features represent the target’s local edges and shape information. Considering the significant semantic differences between these features, we specifically design the CrossFusionTransformer (CFT) module and the HOG Attention Gate (HAG). The core objective of the CFT module is to deeply integrate local perceptual features with global contextual features. Through the CFT module, we can fully exploit the complementarity between local and global information. Local features enable precise modeling of minute details, while global features help mitigate the insufficiency of local information or reduce noise interference, thus enhancing the model’s ability to adapt to complex scenes while ensuring segmentation accuracy. The HAG module fuses HOG features with image features, maintaining semantic consistency across different feature types. This fusion improves the model’s perception of edge information, optimizing segmentation accuracy in edge regions and preventing the loss of edge details. Moreover, considering that pure Transformers often introduce excessive parameters and computational overhead when extracting global features, we design a hierarchical efficient Transformer (HETransformer). Our contributions are as follows:
  • A novel model TriEncoderNet is proposed in this work, which utilizes three parallel branches to extract different types of features simultaneously. CNNs are used to extract local high-resolution features. Transformers are employed to capture global contextual information. A HOG encoder is used to extract gradient features. The segmentation performance is significantly improved by effectively fusing the advantages of these three features.
  • TriEncoderNet introduces the CFT module and the HAG module. These two modules enable the deep fusion of features with different semantic meanings at multiple scales. The CFT module enhances information representation by effectively integrating the complementarity of local and global features, while the HAG improves the model’s ability to perceive edge information, significantly boosting segmentation accuracy.
  • A HETransformer is desired in TriEncoderNet, incorporating a lightweight multi-head self-attention (LMSA) mechanism. The LMSA mechanism facilitates cross-window interactions within local regions, captures global context, and effectively reduces computational complexity.
  • Through comparative experiments on two publicly available datasets, TriEncoderNet outperforms state-of-the-art sonar segmentation algorithms, achieving the best results in terms of performance.
This paper is organized as follows: The related works are reviewed in Section 2. The proposed method, including the overall architecture and the design of each modules, is detailed in Section 3. Section 4 introduces the datasets and the evaluation metrics used in the experiments and the comparative results and visualization analysis are also demonstrated. The ablation study is followed in Section 5. In Section 6, we draw a conclusion of this work.

2. Related Work

2.1. CNN-Based Sonar Image Segmentation Methods

CNN-based sonar image segmentation algorithms, including modular architecture [31], CNN-based encoder–decoder structure [32], Seg2Sonar network [33], CNN with dynamic multi-scale dilated convolutions [34] and dual-channel segmentation network [35], perform well in local feature extraction but often struggle with capturing global features. Without a global perspective, the model cannot fully capture the deep semantic relationships between the target and background, making it more vulnerable to noise and reducing segmentation accuracy and robustness. Therefore, integrating global features is crucial for enhancing segmentation precision and model stability.

2.2. Transformer-Based Image Segmentation Methods

Vision Transformer (ViT) [19] used self-attention mechanisms on image patches to capture global dependencies, surpassing the limitations of traditional CNNs and showcasing the potential of Transformers in visual tasks. CoaT [36] achieved cross-scale information interaction by integrating serial and parallel blocks, effectively enabling the modeling of image features from fine to coarse and vice versa. Focal Transformer [37] introduced a strategy that combined fine-grained local attention with coarse-grained global attention, facilitating the capture of both short-range and long-range dependencies. Rajani et al. [38] introduced a multi-scale encoder for sonar image segmentation, combining patch merging with efficient convolutions to enhance feature extraction while reducing computational complexity.

2.3. Hybrid Transformer–CNN Models

To extract both global and local features, various hybrid Transformer–CNN models have been developed. H2Former [39] introduced the Multi-Scale Channel Attention (MSCA) module, which dynamically selected relevant features by fusing multi-scale global and local features while modeling inter-channel dependencies. UNETR [40] used a Transformer encoder to capture global features and convolutional operations at multiple resolutions for local feature extraction. ST-UNet [41] employed a dual-encoder structure with the Swin Transformer for global modeling and CNNs for local extraction, using a Spatial Interaction Module (SIM) to model pixel-level correlations between different feature semantics. TopFormer [42] employed Vision Transformers to extract global features from a Token Pyramid. During upsampling, it fused multi-scale local tokens with global features, achieving scale-aware semantic extraction and producing high-quality segmentation results. UNetFormer [43] combined CNNs for local detail extraction and a Transformer-based decoder with a global–local attention mechanism to integrate both global and local context.
However, most of the existing models only use high-level features that have been processed by CNNs, ignoring the importance of using low-level features in establishing remote dependencies. In addition, the semantic difference between local and global features is large, which makes the fusion of these two features not ideal. Moreover, the traditional Transformer has too much computation in extracting global features, which leads to inefficiency, especially when processing high-resolution images, as the computational overhead increases significantly.

3. Method

3.1. Overall Architecture

The overall architecture of the proposed TriEncoderNet is shown in Figure 1. It aims to achieve more accurate image segmentation by extracting in parallel local detail features, global features, and Histogram of Oriented Gradients (HOG) features through multiple encoders. To address the high computational overhead associated with Transformer models when processing global features, TriEncoderNet introduces a HETransformer to effectively reduce computational complexity. Additionally, TriEncoderNet proposes the CFT module, which fuses CNN features and Transformer features at each downsampling stage of the encoder. This allows low-level features from different encoders to complement each other, preserving more detailed information. Furthermore, the CFT module utilizes a Cross-Attention Block to effectively integrate features with different semantic information, thereby enhancing the model’s ability to capture diverse feature types. To better leverage HOG features, TriEncoderNet incorporates an HAG module to achieve deep integration of HOG and image features. To further explore the correlation between HOG features and image features, an Attention Map Dropout mechanism is introduced, enabling the network to focus more effectively on the contribution of HOG features to edge information extraction during training.
TriEncoderNet takes Forward-Looking Sonar (FLS) images X R 512 × 512 as input. Initially, the input images are fed into both the HOG feature encoder and the CNN encoder. The image features are processed through a CNN block to generate CNN feature maps X cnn i R H 2 i × W 2 i × C where H, W, and C represent the height, width, and number of channels of the image, respectively. While i represents the encoder layer index, i = ( 0 , 1 , 2 , 3 , 4 ) . Each CNN block consists of two convolutional layers, two batch normalization layers, and ReLU activation functions, designed to extract rich image features. The extracted CNN features are then simultaneously processed by multiple modules. Firstly, the image features are fed into the Transformer encoder to capture global contextual relationships, while also being passed to the next stage of the CNN encoder for further extraction of local detail features. Within the Transformer encoder, the input image features undergo a patch embedding operation, converting them into feature tokens Z t r a n s i R n × d , where n = H W 2 2 i P 2 , d is the dimension of tokens. These tokens are then processed by the HETransformer for global context modeling, followed by downsampling via a patch merge layer. The tokens are subsequently converted back into feature maps through a feature mapping operation. The feature maps from both the Transformer and CNN encoders are then fed into the CFT module for feature fusion. The CFT module leverages a cross-attention mechanism to deeply integrate features from different encoders. The fused features are further processed by the HAG module, where they are combined with HOG features to achieve deep integration and enhance the representation of semantic information from three distinct sources. TriEncoderNet fuses semantic information at each scale and enhances feature representation through the mutual fusion of features across each scale. The fused features are finally passed through a decoder to restore the image resolution.
To improve the readability of the mathematical formulations, the key parameters and notations used in this paper are summarized in Table 2.
In the following sections, we will provide a detailed description of key modules, including the HETransformer, CFT, HOG Extract Block, and HAG, along with their respective functionalities and contributions to the proposed architecture.

3.2. Hierarchical Efficient Transformer (HETransformer)

Before image features are processed by the HETransformer, they undergo a Patch Embedding operation to generate feature vectors. This process consists of two steps. First, the image is divided into multiple patches using Patch Partition, where each patch has a size of P × P . Next, these patches are mapped into a high-dimensional space through a Linear Embedding layer, producing feature vectors Z trans i of dimension d. These feature vectors Z trans i are then fed into the HETransformer for self-attention computation. The architecture of the HETransformer is shown in Figure 2.
In the standard Transformer architecture [19], global self-attention is computed across all feature tokens, resulting in a computational complexity that scales quadratically with both the number of tokens (n) and their dimensionality (d). This high computational cost poses significant challenges during both training and inference. To mitigate this, the HETransformer reduces the number of tokens (n), substantially decreasing the computational overhead. As shown in the figure, the HETransformer adopts a dual strategy. First, feature tokens are transformed back into feature maps via a Feature Mapping operation and processed by the LMSA module. Second, a Windows Partition strategy [41] is employed, where M × M patches are divided into multiple non-overlapping windows. Within each window, local self-attention is computed to model remote dependencies among patches, replacing the computationally expensive global self-attention. This step produces feature tokens Z w i R n × d , where n = M 2 , significantly reducing the overall computational complexity. To further enhance global context modeling, the HETransformer incorporates a Shifted Windows Partition strategy [41] in subsequent stages. By shifting the positions of windows, this approach ensures the effective integration of global information. Together, these strategies reduce the computational complexity of self-attention while preserving the ability to capture meaningful global dependencies. The computational formulation of the HETransformer is as follows:
Z ^ t r a n s i = L M S A ( F M ( Z t r a n s i ) , W P ( Z t r a n s i ) ) + Z t r a n s i
Z ˙ t r a n s i = M L P ( L N ( Z ^ t r a n s i ) ) + Z ^ t r a n s i
Z ˜ t r a n s i = S L M S A ( F M ( Z ˙ t r a n s i ) , S W P ( Z ˙ t r a n s i ) ) + Z ˙ t r a n s i
Z ¨ t r a n s i = M L P ( L N ( Z ˜ t r a n s i ) ) + Z ˜ t r a n s i
where L M S A ( · ) and S L M S A ( · ) represent the LMSA mechanism and shifted LMSA mechanism, F M ( · ) denotes Feature Map, W P ( · ) and S W P ( · ) refer to Windows Partition and Shifted Windows Partition, M L P ( · ) stands for Multi-Layer Perceptron, and L N ( · ) represents Layer Normalization.
The LMSA mechanism is depicted in Figure 2. LMSA takes two inputs, which are used to generate the query, key, and value. Specifically, LMSA performs a depth-wise convolution along the channel dimension of the input Feature Map, reducing its resolution by a factor of k, resulting in feature maps denoted as x i R H 2 i k × W 2 i k × d . Here, k is a hyperparameter, and the depth-wise convolution uses a kernel size of k + 1 , a stride of k, and a padding of k / 2 . The downsampled feature map is then subjected to a patch embedding operation, converting it into feature vectors. These vectors are further divided into windows using the Windows Partition method, where M k × M k patches are grouped into windows to create feature vectors Z d w i R n × d , where n = k 2 M 2 . Subsequently, a Layer Normalization (LN) operation is applied to generate V d w i and K d w i . The second input to LMSA, Z w i , undergoes a Layer Normalization operation directly to produce Q w i . The LMSA calculation is formally expressed as follows:
V d w i = Z d w i W V d w , K d w i = Z d w i W K d w , Q w i = Z w i W Q w
The corresponding weight matrices for V d w i , K d w i and Q w i are W V d w , W K d w and W Q w , respectively.
Finally, the attention mechanism is calculated using the following formula:
A t t e n t i o n ( Q , K , V ) = S o f t m a x ( Q w i K d w i T d ) V d w i
The computational complexity of LMSA is O ( ( 2 k 2 + 1 ) n d 2 + ( 2 M 2 + 2 k + 1 k 2 + 2 ) n d ) . In contrast, the computational complexities of the MSA modules in ViT and Swin Transformer are O ( 4 n d 2 + 2 n 2 d ) and O ( 4 n d 2 + 2 M 2 n d ) , respectively. The computational complexity of LMSA is significantly lower than these classical approaches.

3.3. CrossFusionTransformer (CFT)

Features from different encoders often exhibit significant semantic differences. However, these features also possess unique strengths that are complementary to one another. CNN encoders excel at capturing local features, effectively extracting high-precision local details and progressively deriving higher-level semantic features through multiple convolutional layers. In contrast, Transformer encoders are adept at capturing global relationships, capable of handling long-range dependencies and complex contextual information. They demonstrate strong capabilities in modeling interactions between distant regions within an image. To integrate and align the features from both types of encoders, we propose the CFT module, as illustrated in Figure 3.
First, the features from different encoders are transformed through Patch Embedding to obtain the transformer feature vectors Z t r a n s i R n × d and the CNN feature vectors Z c n n i R n × d , respectively. Subsequently, the two feature vectors undergo an initial fusion process, which can be expressed by the following equation:
Z f i = P E ( F M ( C o n c a t ( Z t r a n s i , Z c n n i ) ) )
Here, P E ( · ) represents Patch Embedding, and Z f i R n × d denotes the feature vector after the initial fusion.
However, simple feature fusion cannot eliminate the semantic gap between features. To address this, the transformer feature vector Z t r a n s i , the image feature vector Z c n n i , and the fused feature vector Z f i are introduced. Z f i serves as supplementary information. These multi-source features are used to extract the query vector, key vector, and value vector in parallel. The specific calculations for this process are as follows:
V f i = Z f i W V f , K c n n i = Z c n n i W K c n n , Q c n n i = Z c n n i W Q c n n
K t r a n s i = Z t r a n s i W K t r a n s , Q t r a n s i = Z t r a n s i W Q t r a n s
The matrices W V f , W K t r a n s , W K c n n , W Q t r a n s , and W Q c n n represent the weight matrices corresponding to V f i , K t r a n s i , K c n n i , Q t r a n s i , and Q c n n i , respectively. After obtaining the query, key, and value vectors from the multi-source features, they are fed into Cross Attention (CA) Blocks for further processing. Specifically, one set of features ( V f i , K t r a n s i , Q t r a n s i ) is fed into one CA Block, while the other set ( V f i , K c n n i , Q c n n i ) is fed into another CA Block. Each CA Block performs identical computational operations. The detailed calculation process within a CA Block is as follows:
C A t t e n t i o n ( Q c n n i , K c n n i , V f i ) = S o f t m a x ( Q c n n i K c n n i T d ) V f i
The output results of the dual-branch CA Block are normalized in parallel and then further processed by a Multi-Layer Perceptron (MLP). Finally, the feature vectors are fused through a concatenation operation. The fused features are subsequently mapped back into image features through Feature Mapping. A series of convolution operations is then applied to reduce the number of channels in the feature maps, enabling deeper feature integration and producing the final output.

3.4. Hog Extract Block, HogConv Block and HOG Attention Gate

Inspired by research on ship classification in radar images [44] and facial recognition [45] using HOG features, we observed that HOG features excel in capturing edge information of objects in images. The specific structure of the HOG Extract Block is shown in Figure 4, and its details are described below.
First, the input FLS image undergoes preprocessing, with the specific steps outlined as follows:
X = max { X m e a n ( X ) , 0 }
where X represents the input FLS image, m e a n ( · ) denotes the computation of the grayscale average. Next, the gradient for each pixel in the preprocessed image is calculated as follows:
G ( x , y ) = G x ( x , y ) 2 + G y ( x , y ) 2
θ ( x , y ) = arctan G y ( x , y ) G x ( x , y )
where ( x , y ) represents each pixel in the image. The pixel gradient is described using the gradient magnitude G x ( x , y ) and gradient direction θ ( x , y ) . The specific calculation processes for gradient magnitude and gradient direction are as follows:
G x ( x , y ) = V ( x + 1 , y ) V ( x 1 , y )
G y ( x , y ) = V ( x , y + 1 ) V ( x , y 1 )
where V ( · ) represents the pixel value at each point in the image. The image is first divided into smaller cells, and the gradient magnitude and direction are calculated for each cell. The gradient magnitude and direction of each cell are then mapped into a histogram based on predefined orientation bins. Subsequently, neighboring cells are grouped into blocks, and block-level normalization is applied to the histograms of these cells. Finally, the combined histograms from all cells are concatenated to form the HOG feature vector. The resulting HOG feature vector is then resized to serve as the input HOG feature for the HOG encoder.
To effectively extract HOG features, TriEncoderNet designs the HogConv Block, as illustrated in Figure 4. This module consists of three sets of convolutional operations, each followed by Batch Normalization and the ReLU activation function to introduce nonlinearity. Specifically, the first convolutional operation employs a 1 × 1 kernel and maintains the same number of channels, aiming to extract low-level features from the image. The second convolutional operation utilizes a 3 × 3 kernel with a stride of 1, doubling the number of channels to enhance the network’s representation capacity and depth. Next, a pooling operation is performed to achieve down-sampling. The final convolutional operation also uses a 3 × 3 kernel with a stride of 1, maintaining the channel count, and focuses on extracting higher-level semantic features.
For the HOG features and the image features from the CFT module, TriEncoderNet designs a HAG to achieve feature fusion, as illustrated in Figure 5. First, the two features are concatenated and then undergo initial feature fusion through a 1 × 1 convolutional operation, followed by Batch Normalization and a ReLU activation function. Next, we employ P Attention Extract branches to generate attention maps. By incorporating HOG features, these branches can automatically identify critical regions associated with the HOG features, particularly the object edges and contour regions that are crucial for FLS image segmentation tasks. Each Attention Extract branch consists of two convolutional layers designed to reduce the feature dimensionality, with an activation function introduced between the layers to add nonlinearity. Finally, the attention maps Map m R H 2 i × W 2 i are obtained through a Sigmoid activation function, where M a p = [ M a p 1 , M a p 1 , , M a p P ] .
To encourage the model to explore more edge feature information brought by HOG features, TriEncoderNet proposes an Attention Drop mechanism. This mechanism randomly eliminates s attention maps by setting all values in these s maps to zero. The remaining N = P s attention maps are then merged into a final attention map M a p f i n a l R H 2 i × W 2 i . The detailed calculation is as follows:
M a x ( M a p 1 ( x , y ) , M a p 2 ( x , y ) , , M a p N ( x , y ) )
where x [ 1 , H 2 i ] , y [ 1 , W 2 i ] . Finally, M a p f i n a l is multiplied element-wise with the previously fused image features. Through multiple Attention Extract branches and the Attention Drop mechanism, the HAG assigns different weights to each feature region, further highlighting the shape and edge information brought by the HOG features. This not only effectively integrates HOG features with image features but also significantly enhances the utilization efficiency of the edge information provided by HOG features.
The intuition behind this design stems from the specific degradation mechanisms of underwater acoustics. While CNNs capture local texture details often obscured by speckle noise, and Transformers model global context to distinguish valid targets from reverberation shadows, the HOG encoder introduces explicit geometric priors to sharpen boundaries that are typically blurred in FLS imagery. By fusing these complementary views, the model achieves robust segmentation in complex underwater environments.

4. Experiment

This section provides a detailed introduction to the datasets used in the experiments, evaluation metrics, comparisons with other models and visualized comparison results.

4.1. FLS Image Dataset

Marine Debris Dataset [46]: This dataset consists of 1868 FLS images collected using an ARIS Explorer 3000 Sonar. It includes 11 target categories (excluding the background). Notably, Standing bottle and Shampoo bottle account for only 2% and 3%, respectively, while Wall accounts for 25%. Categories like Standing bottle, Hook, Drink carton, and Shampoo bottle only account for 1%, 2%, 2%, and 2% of pixels, respectively, whereas Wall comprises 44%. This dataset suffers from significant class imbalance, especially for smaller objects. Sample images of Marine Debris Dateset are shown in Figure 6.
UATD Dataset [47]: This dataset was collected using a Tritech Gemini 1200ik sonar in lakes and oceans. It contains 9200 FLS images across 10 target categories: sphere, cube, cylinder, human, plane, circular cage, square cage, metal barrel, tire, and ROV. While the original dataset was annotated with bounding boxes, we further refined the annotations using LabelMe to provide semantic segmentation for objects in each category across the entire dataset. Sample images of UATD Dateset are shown in Figure 7.

4.2. Evaluation Metrics

We comprehensively evaluate TriEncoderNet from two aspects: segmentation accuracy and segmentation efficiency. Segmentation accuracy is assessed using mean Pixel Accuracy (mAP), Intersection over Union (IoU), and mean Intersection over Union (mIoU) as evaluation metrics. Segmentation efficiency is measured in terms of model size and inference time.
mAP measures the proportion of correctly classified pixels within each class, averaged across all classes. The calculation formula is as follows:
m A P = 1 k + 1 i = 0 k p i i j = 0 k p i j
p i j represents the number of pixels that belong to class i but are misclassified as class j. Similarly, p i i represents the number of pixels correctly classified as class i. K represents the total number of classes.
IoU calculates the ratio between the intersection and the union of the ground truth A and the segmentation result B. The specific formula is shown as follows:
IoU = | A B | | A B |
mIoU calculates the IoU for each class and then computes the average across all classes. The specific formula is shown as follows:
mIoU = 1 k i = 1 k IoU i = 1 k i = 1 k | A i B i | | A i B i |
where k is the total number of classes, A i is the ground truth for class i, and B i is the segmentation result for class i.

4.3. Implementation Details

Due to the absence of an official division for the Marine Debris Dataset and the UATD dataset, we randomly divided the training, validation, and testing sets in a 7:2:1 ratio. The input FLS images were resized to 512 × 512 , and data augmentation techniques such as random flipping, random cropping, random scaling, and color jittering were employed. During training, the Adam optimizer [48] was used for 160 epochs with a batch size of 8. A cosine annealing learning rate schedule was applied to accelerate model convergence, starting with an initial learning rate of 1 × 10 4 . To mitigate the class imbalance in the datasets, we utilized a joint loss function combining Dice Loss and Cross-Entropy (CE) Loss [49]. All hyperparameters were tuned on the validation set, and the final model performance was evaluated on the test set. We conducted training and inference for TriEncoderNet and baseline models using an NVIDIA Geforce RTX 3090 Ti GPU. The hyperparameters and training details are listed in Table 3. For comparison experiments, we retrained state-of-the-art (SOTA) methods with their recommended parameter settings. The same dataset partitioning strategy, training duration, batch size, and image augmentation methods as our approach were applied to ensure fairness.

4.4. Comparison with the SOTA Methods

We conducted comparative experiments between TriEncoderNet and state-of-the-art (SOTA) methods on the Marine Debris Dataset and UATD dataset. The SOTA methods include CNN-based models such as U-Net [13], DeepLabv3+ [14], FCNN [15], UNet3+ [50], SegNet [16], as well as Transformer-based models such as PVT [22], TransUNet [25], Swin Transformer [20], MT-UNet [51], COAT [36], and SAM [49].
Marine Debris Dataset: The quantitative analysis results of TriEncoderNet and SOTA methods on the Marine Debris Dataset are shown in Table 4, where the bold font indicates the highest metric values. We evaluated the IoU metric for each category and used mIoU and mAP to assess overall performance. TriEncoderNet demonstrated superior performance across all three evaluation metrics.
Specifically, compared to the baseline model U-Net, TriEncoderNet improved mAP and mIoU by 9.4% and 10%, respectively. Among CNN-based methods, TriEncoderNet outperformed the best-performing UNet3+ by 4.3% and 4.4% in mAP and mIoU, respectively. Compared to the best-performing Transformer-based method, SAM-Sonar, TriEncoderNet still achieved improvements of 4.4% in mAP and 3% in mIoU. Furthermore, TriEncoderNet consistently led in IoU for each individual category compared to all other models.
UATD Dataset: The quantitative analysis results of TriEncoderNet and SOTA methods on the UATD dataset are presented in Table 5, where the bold font indicates the highest metric values. We evaluated the IoU metric for each category and used mIoU and mAP for overall assessment. TriEncoderNet achieved the best results across all segmentation tasks for the UATD dataset.
Compared to the baseline model U-Net, TriEncoderNet improved mIoU and mAP by 12.1% and 16.4%, respectively. Among CNN-based methods, TriEncoderNet outperformed the best-performing SegNet by 11.2% and 13.7% in mIoU and mAP, respectively. Compared to the best-performing Transformer-based method, Swin Transformer, TriEncoderNet achieved improvements of 3.2% in mIoU and 5.5% in mAP. Additionally, TriEncoderNet maintained a consistent lead in IoU for every individual category compared to all other models.
Robustness Analysis: Furthermore, to empirically quantify the model’s robustness against acoustic interference, we conducted stress tests using synthetic Gaussian and Speckle noise on the Marine Debris Dataset in Table 6. Under severe noise conditions (e.g., Speckle noise with σ 2 = 0.10 ), the baseline U-Net and Swin Transformer suffered catastrophic performance drops of 32.3% and 28.6%, respectively. In contrast, TriEncoderNet exhibited superior resilience, limiting the degradation to only 14.0%. This confirms that the HOG-based multi-stage fusion effectively preserves structural integrity even when pixel-level coherence is compromised by noise.

4.5. Computational Efficiency Analysis

As shown in Figure 8, it is evident that CNN-based models are significantly smaller in size and inference speed compared to transformer-based models, primarily due to the inevitable computational overhead of global attention mechanisms. TriEncoderNet incorporates both CNN and transformer-based parallel encoders. Despite this, its model size is still smaller than most transformer-based models, and its inference speed outperforms many of them. This improvement is attributed to the design of our HETransformer, which significantly reduces computational overhead compared to classical transformers. Although TriEncoderNet is not the smallest in terms of model size or the fastest in inference speed, it achieves the highest segmentation accuracy at 91.6%. With a model size of 49.07 MB and an inference speed of 37 fps, TriEncoderNet reduces the model size by 274.96 MB and improves inference speed by 185% compared to SAM, the second most accurate model. TriEncoderNet achieves the best segmentation performance with minimal computational overhead.
To provide a standardized comparison of computational cost independent of hardware platforms, we evaluated the Giga Floating-point Operations (GFLOPs) with an input size of 512 × 512 . As presented in Table 7, our TriEncoderNet achieves a competitive computational cost compared to other SOTA methods.

4.6. Visualization Analysis

To visually and comprehensively validate the performance of TriEncoderNet, we conducted a comparative analysis of segmentation results on the Marine Debris Dataset and UATD dataset. This analysis included comparisons with SOTA methods such as U-Net [13], Unet3+ [14], SegNet [16], TransUNet [25], and SAM [49].
Marine Debris Dataset: We performed visualization analysis on the test set of the Marine Debris Dataset, which included scenarios with complex-shaped objects (e.g., chains) and small-scale targets (e.g., hooks, drink cartons). As shown in Figure 9, TriEncoderNet demonstrated precise segmentation of underwater objects even in complex backgrounds. While other methods were also capable of segmenting various underwater targets, they were prone to errors due to prominent noise interference and complex backgrounds, leading to imprecise segmentation.
In Scene 1, where the chain is located in a complex background, CNN-based methods produce discontinuous segmentation results for the chain. In contrast, TriEncoderNet achieves precise and continuous segmentation. This is attributed to its dual ability to extract both local features and global contextual information, enabling it to effectively capture relevant target features. In Scene 2, TriEncoderNet is the only model capable of accurately segmenting the edges of the wall and cans. This advantage is due to the integration of HOG features, which enhance the capture of contour details. Other methods are significantly affected by noise, leading to incomplete or imprecise segmentation results. In Scene 3, for valve segmentation, only TriEncoderNet manages to clearly capture the branch details along the valve’s edges. This performance is made possible by its deep-level extraction of local detail features and the seamless fusion of HOG features with image characteristics. Other methods generally fail to address these crucial edge details effectively. In Scene 4, hooks are small-scale targets. CNN-based methods, while strong in local feature extraction, fail to segment the upper portion of the hook due to their lack of global contextual modeling. Transformer-based methods can roughly capture the upper portion of the hook’s shape but fall short in edge precision due to insufficient attention to local details. Only TriEncoderNet successfully achieves the most accurate segmentation, combining the strengths of both local detail extraction and global context awareness. In Scene 5, bottle segmentation demonstrates TriEncoderNet’s robustness. SegNet fails to fully segment the bottle’s shape, and other models inaccurately segment the bottle neck due to noise interference and low contrast with the background. TriEncoderNet, however, accurately captures the bottle’s complete shape, showcasing its superior robustness to noise and ability to capture fine details.
UATD Dataset: For the UATD dataset, we analyzed test set scenarios including similar objects (e.g., cube and cage), small-scale targets (e.g., ROV and ball), and sparse objects (e.g., human body), as shown in Figure 10.
On the UATD dataset, Scene 1 and Scene 2 involve similar targets (e.g., square cage and cube) and small-scale objects (e.g., tires). CNN-based methods are heavily affected by underwater noise and fail to segment the edges of the square cage or the full tire region. TriEncoderNet, however, achieves complete segmentation of these targets, demonstrating its adaptability and accuracy in challenging environments. In Scene 3, both TransUNet [25] and SAM failed to segment the plane’s wings, lacking the ability to extract high-precision local details. Furthermore, all other methods overlooked the extraction of potential HOG features during the overall segmentation process, resulting in inferior performance compared to TriEncoderNet. In Scene 4, where the target ROV is a small-scale object with weak semantic information and significant background noise, only TriEncoderNet achieved precise segmentation, demonstrating its exceptional capability in handling small objects. Only TriEncoderNet achieved precise segmentation, demonstrating its exceptional capability in handling small objects. In Scene 6, when segmenting the human body, the low contrast between the target and the background led all other methods, except TriEncoderNet, to fail in segmenting the limbs of the human body. TriEncoderNet’s superior performance is attributed to its robust ability to extract detailed features and effectively model global contextual information.
The visualization analysis on the Marine Debris Dataset and UATD dataset reveals the primary challenges in underwater target segmentation: low contrast, subtle target features, and interference from complex backgrounds and noise. Traditional segmentation methods (CNN-based and Transformer-based models) show notable limitations when dealing with complex-shaped objects, small-scale targets, and sparse objects. These limitations include insufficient local detail extraction, inadequate global context modeling, and reduced segmentation accuracy under noisy conditions. In contrast, TriEncoderNet, with its multi-stage feature fusion, deep modeling of local details and global context, and the extraction and integration of HOG features, achieves precise segmentation across various challenging scenarios. It particularly excels in segmenting small-scale targets, edge details, and low-contrast objects, highlighting its robustness and adaptability for underwater FLS image segmentation.

5. Ablation Study

5.1. Ablation Study of Hand-Crafted Feature Fusions

We conducted comprehensive ablation experiments on the hand-crafted features introduced in TriEncoderNet, comparing several different types, including Canny edge features [52], grayscale histogram features [53], and Local Binary Pattern (LBP) features [54]. As shown in Table 8, the performance of different hand-crafted feature fusions on the two test datasets is summarized, with bold font indicating the highest metric values.
From the experimental results, it is evident that HOG features outperform the other three feature types, leading to more accurate segmentation results and proving the effectiveness of HOG features. This superiority can be attributed to the fact that HOG features excel at capturing the distribution of local gradient orientations, effectively describing edge shapes and object structures in sonar images. By modeling the directional information of these edges, HOG features significantly enhance segmentation performance. Additionally, the results show that grayscale histogram features and LBP features negatively impact the segmentation performance. Compared to not incorporating any hand-crafted features, their inclusion results in a decrease in accuracy. We hypothesize that grayscale histogram features are overly sensitive to local brightness variations, potentially introducing noisy textures that hinder sonar image segmentation. On the other hand, LBP features, being global descriptors, fail to preserve local structural information, adversely affecting segmentation precision. While Canny edge features provide some benefits for sonar image segmentation, their susceptibility to noise interference limits their effectiveness. Consequently, the choice of appropriate hand-crafted features remains a research-worthy direction for further exploration.
Although gradient-based methods like Sobel operators and learned convolution layers offer marginal improvements over the baseline, they still fall short of the proposed HOG module. This performance gap is attributed to HOG’s histogram binning and block-wise normalization, which provide robust statistical representations against the speckle noise and signal attenuation inherent in FLS images—challenges that raw gradients and unnormalized features fail to address effectively. In contrast, grayscale histograms and LBP features degrade performance due to their sensitivity to local brightness variations and lack of structural preservation. Consequently, the normalized HOG descriptor proves to be the most effective geometric prior for enhancing segmentation in complex underwater environments.

Ablation Study of Proposed Blocks

To comprehensively validate the effectiveness of each module in TriEncoderNet and its contribution to overall model performance, we conducted extensive ablation experiments. The results, as presented in Table 9, demonstrate the impact of each module on two test datasets, with bold font indicating the highest metric values. U-Net was employed as the CNN-baseline, while Vision Transformer was used as the Transformer-baseline. The CNN–Transformer (CT) model features a dual-encoder architecture that combines CNN and Vision Transformer encoders, using simple concatenation as the fusion mechanism. The CNN–Transformer–Hog (CTH) model extends this structure by adopting a tri-encoder architecture, incorporating CNN, Vision Transformer, and HogExtract as the Hog feature encoder. While its structure aligns with TriEncoderNet, all fusion modules in CTH are replaced with concatenation.
Effectiveness of the Tri-Encoder Architecture: The results for the CT model demonstrate the advantages of using dual encoders to extract both global and local information. Compared to the CNN-baseline and Trans-baseline, which extract only one type of feature, CT achieves substantial improvements in segmentation accuracy. Specifically, on the Marine Debris Dataset, CT achieves a 6.5% increase in mIoU and a 7% increase in mAP compared to the CNN-baseline. This demonstrates that the combination of global and local feature extraction significantly enhances model performance.
Building on the CT model, the introduction of Hog features in the CTH model further improves segmentation accuracy. By incorporating Hog features through a third encoder, CTH supplements the global and local features with texture and contour information. This addition leads to notable performance gains across various evaluation metrics. For example, on the Marine Debris Dataset, CTH achieves higher mIoU and mAP scores than CT, showcasing the effectiveness of the tri-encoder structure.
Effectiveness of HETransformer: Replacing the Vision Transformer in CT with the proposed HETransformer (CT + HET) results in further improvements in segmentation accuracy on both datasets. Specifically, CT + HET achieves a 0.5% and 0.3% increase in mIoU and mAP, respectively, on the Marine Debris Dataset, and a 0.6% and 1.1% increase in mIoU and mAP, respectively, on the UATD dataset. Furthermore, integrating HETransformer into CT + CFT results in even greater performance gains. For instance, on the Marine Debris Dataset, mIoU and mAP improve by 0.7% each, while on the UATD dataset, the improvements are 0.7% and 0.8%, respectively. These results validate the efficiency of HETransformer, which not only reduces computational complexity but also enhances the ability to extract global features, leading to improved segmentation accuracy.
Effectiveness of CFT: Replacing the simple concatenation operation in CT with the proposed CFT fusion module (CT + CFT) significantly improves performance, particularly on the UATD dataset, where mIoU and mAP both increase by over 1%. When applied to CT-HET, the CFT module further boosts segmentation accuracy. The improvement stems from the CFT module’s ability to model the correlations between global and local features across multiple scales, enabling deeper integration of semantically distinct feature information. This contrasts with simple concatenation, which cannot fully capture the complementary relationships between such features.
Effectiveness of HAG: Adding the HAG module to CTH and CTH + CFT + HET further enhances segmentation performance. Unlike the straightforward concatenation of Hog features in CTH, the HAG module effectively captures detailed textures and contour features brought by Hog, resulting in higher segmentation accuracy. For example, on the UATD dataset and Marine Debris Dataset, adding HAG to CTH + CFT + HET leads to measurable improvements in both mIoU and mAP, showcasing the module’s ability to exploit the benefits of Hog features for refining segmentation precision. Additionally, replacing HAG with standard SE or CBAM blocks leads to performance degradation. This confirms that HAG’s specific cross-modal design, utilizing HOG priors and Attention Dropout, is more effective than standard self-attention at suppressing underwater noise.
The ablation experiments highlight the efficacy of each module in TriEncoderNet. The dual-encoder architecture of CT effectively combines global and local features, while the tri-encoder structure of CTH leverages Hog features to further enhance segmentation accuracy. Additionally, the integration of HETransformer, CFT, and HAG modules contributes to substantial improvements in segmentation precision, validating the overall design of TriEncoderNet and its computational efficiency.

6. Conclusions

In this study, we proposed a novel deep learning model to address the challenges of FLS image segmentation. The integration of the CFT module facilitates a seamless fusion of local and global features at multiple scales, enhancing the model’s ability to adapt to complex underwater environments. The HAG further improves edge detection, ensuring the preservation of critical boundary information. The HETransformer optimizes computational efficiency without compromising global context modeling capabilities. Comparative experiments on publicly available datasets demonstrate the effectiveness of TriEncoderNet. Despite the superior performance, we acknowledge certain limitations in our evaluation. Currently, the experiments are conducted on two public datasets. While these datasets cover various underwater scenarios, they may not fully represent extreme conditions such as cross-domain adaptation or synthetic-to-real transfer. Future work will focus on extending the evaluation to a wider range of benchmarks and exploring the model’s generalization capabilities on unannotated real-world sonar data.

Author Contributions

Conceptualization, J.L. and Y.D.; Data curation, Y.D.; Formal analysis, Y.D.; Funding acquisition, F.Z. and J.G.; Investigation, J.G. and Y.C.; Methodology, Y.D.; Project administration, J.G. and F.Z.; Software, Y.D.; Supervision, J.G.; Validation, G.C. and J.G.; Visualization, J.L.; Writing—original draft, Y.D. and J.L.; Writing—review and editing, J.L. and J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China under Grants of 52571392.

Data Availability Statement

Data are contained within the article.

Acknowledgments

We gratefully acknowledge the contributors of the Marine Debris dataset and UATD dataset for their open-source efforts.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
FLSForward-Looking Sonar
CNNConvolutional Neural Network
CFTCrossFusionTransformer
MRFMarkov random field
HETransformerHierarchical Efficient Transformer
HOGHistogram of Oriented Gradients
CTCNN–Transformer
CTHCNN–Transformer–Hog
LMSAlightweight multi-head self-attention

References

  1. Long, H.; Shen, L.; Wang, Z.; Chen, J. Underwater forward-looking sonar images target detection via speckle reduction and scene prior. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604413. [Google Scholar] [CrossRef]
  2. Zheng, L.; Hu, T.; Zhu, J. Underwater sonar target detection based on improved ScEMA YOLOv8. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1503505. [Google Scholar] [CrossRef]
  3. He, J.; Xu, H.; Li, S.; Yu, Y. Efficient SonarNet: Lightweight CNN grafted Vision Transformer embedding network for forward-looking sonar image segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4210317. [Google Scholar] [CrossRef]
  4. Zhao, D.; Ge, W.; Chen, P.; Hu, Y.; Dang, Y.; Liang, R.; Guo, X. Feature Pyramid U-Net with attention for semantic segmentation of forward-looking sonar images. Sensors 2022, 22, 8468. [Google Scholar] [CrossRef]
  5. Zheng, H.; Sun, Y.; Xu, H.; Zhang, L.; Han, Y.; Cui, S.; Li, Z. MLMFFNet: Multilevel Mixed Feature Fusion Network for Real-Time Forward-Looking Sonar Image Segmentation. IEEE J. Ocean. Eng. 2025, 50, 1356–1369. [Google Scholar] [CrossRef]
  6. Han, C.; Shen, Y.; Liu, Z. Three-Stage Distortion-Driven Enhancement Network for Forward-Looking Sonar Image Segmentation. IEEE Sens. J. 2025, 25, 3867–3878. [Google Scholar] [CrossRef]
  7. Zhou, T.; Wang, Y.; Zhang, L.; Chen, B.; Yu, X. Underwater multitarget tracking method based on threshold segmentation. IEEE J. Ocean. Eng. 2023, 48, 1255–1269. [Google Scholar] [CrossRef]
  8. Tian, Y.; Lan, L.; Sun, L. A review of sonar image segmentation for underwater small targets. In Proceedings of the 2020 International Conference on Pattern Recognition and Intelligent Systems, Athens, Greece, 30 July–2 August 2020; pp. 1–4. [Google Scholar]
  9. Li, J.; Jiang, P.; Zhu, H. A local region-based level set method with Markov random field for side-scan sonar image multi-level segmentation. IEEE Sens. J. 2020, 21, 510–519. [Google Scholar] [CrossRef]
  10. Zhang, M.; Cai, W.; Wang, Y.; Zhu, J. A level set method with heterogeneity filter for side-scan sonar image segmentation. IEEE Sens. J. 2024, 24, 584–595. [Google Scholar] [CrossRef]
  11. Wang, Y.; Zhou, K.; Tian, W.; Chen, Z.; Yang, D. Underwater sonar image segmentation by a novel joint level set model. J. Phys. Conf. Ser. 2022, 2173, 012040. [Google Scholar] [CrossRef]
  12. Wang, Y.; Liu, Z.; Li, G.; Lu, X.; Liu, X.; Zhang, H. Hybrid Modeling Based Semantic Segmentation of Forward-Looking Sonar Images. IEEE J. Ocean. Eng. 2025, 50, 380–393. [Google Scholar] [CrossRef]
  13. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015; Springer: Cham, Switzerland, 2015; Volume 18, pp. 234–241. [Google Scholar]
  14. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  15. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  16. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
  17. Liang, Y.; Zhu, X.; Zhang, J. Maanu-Net: Multi-level attention and atrous pyramid nested U-Net for wrecked objects segmentation in forward-looking sonar images. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 736–740. [Google Scholar]
  18. Zhao, D.; Zhou, H.; Chen, P.; Hu, Y.; Ge, W.; Dang, Y.; Liang, R. Design of forward-looking sonar system for real-time image segmentation with light multiscale attention net. IEEE Trans. Instrum. Meas. 2024, 73, 4501217. [Google Scholar] [CrossRef]
  19. Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  20. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Event, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  21. Chen, C.-F.; Panda, R.; Fan, Q. RegionViT: Regional-to-local attention for vision transformers. arXiv 2021, arXiv:2106.02689. [Google Scholar]
  22. Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Event, 11–17 October 2021; pp. 568–578. [Google Scholar]
  23. He, J.; Yu, Y.; Xu, H. Reverberation Suppression and Multilayer Context Aware for Underwater Forward-Looking Sonar Image Segmentation. IEEE Trans. Instrum. Meas. 2025, 74, 5014517. [Google Scholar] [CrossRef]
  24. Si, C.; Yu, W.; Zhou, P.; Zhou, Y.; Wang, X.; Yan, S. Inception transformer. Adv. Neural Inf. Process. Syst. 2022, 35, 23495–23509. [Google Scholar]
  25. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
  26. He, J.; Chen, J.; Xu, H.; Yu, Y. SonarNet: Hybrid CNN-Transformer-HOG framework and multifeature fusion mechanism for forward-looking sonar image segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4203217. [Google Scholar] [CrossRef]
  27. Carisi, L.; Chiereghin, F.; Fantozzi, C.; Nanni, L. SAM-Based Input Augmentations and Ensemble Strategies for Image Segmentation. Information 2025, 16, 848. [Google Scholar] [CrossRef]
  28. Dhiyanesh, B.; Vijayalakshmi, M.; Saranya, P.; Viji, D. EnsembleEdgeFusion: Advancing semantic segmentation in microvascular decompression imaging with innovative ensemble techniques. Sci. Rep. 2025, 15, 17892. [Google Scholar] [CrossRef]
  29. Das, A.; Das Choudhury, S.; Das, A.K.; Samal, A.; Awada, T. EmergeNet: A novel deep-learning based ensemble segmentation model for emergence timing detection of coleoptile. Front. Plant Sci. 2023, 14, 1084778. [Google Scholar] [CrossRef]
  30. Dang, T.; Nguyen, T.T.; McCall, J.; Elyan, E.; Moreno-García, C.F. Two-layer Ensemble of Deep Learning Models for Medical Image Segmentation. Cogn. Comput. 2024, 16, 1141–1160. [Google Scholar] [CrossRef]
  31. Fan, Z.; Xia, W.; Liu, X.; Li, H. Detection and segmentation of underwater objects from forward-looking sonar based on a modified Mask RCNN. Signal Image Video Process. 2021, 15, 1135–1143. [Google Scholar] [CrossRef]
  32. Yang, D.; Cheng, C.; Wang, C.; Pan, G.; Zhang, F. Side-scan sonar image segmentation based on multi-channel CNN for AUV navigation. Front. Neurorobot. 2022, 16, 928206. [Google Scholar] [CrossRef] [PubMed]
  33. Huang, C.; Zhao, J.; Zhang, H.; Yu, Y. Seg2Sonar: A full-class sample synthesis method applied to underwater sonar image target detection, recognition, and segmentation tasks. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5909319. [Google Scholar] [CrossRef]
  34. Wang, Z.; Zhang, S.; Gross, L.; Zhang, C.; Wang, B. Fused adaptive receptive field mechanism and dynamic multiscale dilated convolution for side-scan sonar image segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5116817. [Google Scholar] [CrossRef]
  35. Huang, H.; Zuo, Z.; Sun, B.; Wu, P.; Zhang, J. DSA-SOLO: Double split attention SOLO for side-scan sonar target segmentation. Appl. Sci. 2022, 12, 9365. [Google Scholar] [CrossRef]
  36. Xu, W.; Xu, Y.; Chang, T.; Tu, Z. Co-scale conv-attentional image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Event, 11–17 October 2021; pp. 9981–9990. [Google Scholar]
  37. Yang, J.; Li, C.; Zhang, P.; Dai, X.; Xiao, B.; Yuan, L.; Gao, J. Focal self-attention for local-global interactions in vision transformers. arXiv 2021, arXiv:2107.00641. [Google Scholar]
  38. Rajani, H.; Gracias, N.; Garcia, R. A convolutional vision transformer for semantic segmentation of side-scan sonar data. Ocean Eng. 2023, 286, 115647. [Google Scholar] [CrossRef]
  39. He, A.; Wang, K.; Li, T.; Du, C.; Xia, S.; Fu, H. H2Former: An efficient hierarchical hybrid transformer for medical image segmentation. IEEE Trans. Med Imaging 2023, 42, 2763–2775. [Google Scholar] [CrossRef]
  40. Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. UNETR: Transformers for 3D medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 574–584. [Google Scholar]
  41. He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin Transformer Embedding UNet for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408715. [Google Scholar] [CrossRef]
  42. Zhang, W.; Huang, Z.; Luo, G.; Chen, T.; Wang, X.; Liu, W.; Yu, G.; Shen, C. TopFormer: Token pyramid transformer for mobile semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12083–12093. [Google Scholar]
  43. Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
  44. Zhang, T.; Zhang, X.; Ke, X.; Liu, C.; Xu, X.; Zhan, X.; Wang, C.; Ahmad, I.; Zhou, Y.; Pan, D. HOG-ShipCLSNet: A novel deep learning network with HOG feature fusion for SAR ship classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5210322. [Google Scholar] [CrossRef]
  45. Yi, J.; Hou, J.; Huang, L.; Shi, H.; Hu, J. Partial occlusion face recognition based on CNN and HOG feature fusion. In Proceedings of the 2021 IEEE 4th International Conference on Electronics and Communication Engineering (ICECE), Xi’an, China, 17–19 December 2021; pp. 55–59. [Google Scholar]
  46. Singh, D.; Valdenegro-Toro, M. The marine debris dataset for forward-looking sonar semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Event, 11–17 October 2021; pp. 3741–3749. [Google Scholar]
  47. Xie, K.; Yang, J.; Qiu, K. A dataset with multibeam forward-looking sonar for underwater object detection. Sci. Data 2022, 9, 739. [Google Scholar] [CrossRef]
  48. Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  49. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
  50. Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.-W.; Wu, J. UNet 3+: A full-scale connected UNet for medical image segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual Event, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
  51. Wang, H.; Xie, S.; Lin, L.; Iwamoto, Y.; Han, X.-H.; Chen, Y.-W.; Tong, R. Mixed transformer U-Net for medical image segmentation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 2390–2394. [Google Scholar]
  52. Zhou, M.; Zhou, Y.; Yang, D.; Song, K. Remote sensing image classification based on Canny operator enhanced edge features. Sensors 2024, 24, 3912. [Google Scholar] [CrossRef] [PubMed]
  53. Koike, H.; Ashizawa, K.; Tsutsui, S.; Kurohama, H.; Okano, S.; Nagayasu, T.; Kido, S.; Uetani, M.; Toya, R. Differentiation between heterogeneous GGN and part-solid nodule using 2D grayscale histogram analysis of thin-section CT image. Clin. Lung Cancer 2023, 24, 541–550. [Google Scholar] [CrossRef] [PubMed]
  54. Khayyat, M.M.; Zamzami, N.; Zhang, L.; Nappi, M.; Umer, M. Fuzzy-CNN: Improving personal human identification based on IRIS recognition using LBP features. J. Inf. Secur. Appl. 2024, 83, 103761. [Google Scholar] [CrossRef]
Figure 1. The overall architecture of the TriEncoderNet.
Figure 1. The overall architecture of the TriEncoderNet.
Jmse 13 02295 g001
Figure 2. The architecture of the HETransformer.
Figure 2. The architecture of the HETransformer.
Jmse 13 02295 g002
Figure 3. The architecture of the CrossFusionTransformer.
Figure 3. The architecture of the CrossFusionTransformer.
Jmse 13 02295 g003
Figure 4. The architecture of the Hog Extract Block and the HogConv Block.
Figure 4. The architecture of the Hog Extract Block and the HogConv Block.
Jmse 13 02295 g004
Figure 5. The architecture of the HAG.
Figure 5. The architecture of the HAG.
Jmse 13 02295 g005
Figure 6. Sample images of Marine Debris Dataset.
Figure 6. Sample images of Marine Debris Dataset.
Jmse 13 02295 g006
Figure 7. Sample images of UATD Dataset.
Figure 7. Sample images of UATD Dataset.
Jmse 13 02295 g007
Figure 8. Computational efficiency comparison with the SOTA Methods. The bubble sizes represent the model’s mAP.
Figure 8. Computational efficiency comparison with the SOTA Methods. The bubble sizes represent the model’s mAP.
Jmse 13 02295 g008
Figure 9. Visualization on the Marine Debris Dataset. (a). Ground truth. (b). U-Net. (c). Unet3+. (d). SegNet. (e). TransUNet. (f). SAM. (g). Ours.
Figure 9. Visualization on the Marine Debris Dataset. (a). Ground truth. (b). U-Net. (c). Unet3+. (d). SegNet. (e). TransUNet. (f). SAM. (g). Ours.
Jmse 13 02295 g009
Figure 10. Visualization on the UATD Dataset. (a). Ground truth. (b). U-Net. (c). Unet3+. (d). SegNet. (e). TransUNet. (f). SAM. (g). Ours.
Figure 10. Visualization on the UATD Dataset. (a). Ground truth. (b). U-Net. (c). Unet3+. (d). SegNet. (e). TransUNet. (f). SAM. (g). Ours.
Jmse 13 02295 g010
Table 1. Summary of Sonar image segmentation methods.
Table 1. Summary of Sonar image segmentation methods.
CategoriesMethodsAdvantagesDisadvantages
Traditional
methods
Thresholding-based methods [7]Minimal training requirement,
Low computational complexity.
Limited robustness
and generalizability.
Clustering-based methods [8]
Markov random field (MRF)-based methods [9]
Level set-based methods [10,11]
Deep learning
methods
Typical CNN-based algorithms include U-Net [13], DeepLabv3+ [14], FCNN [15], SegNet [16], MAANU-Net [17] and LMA-Net [18].Dealing with local features and multi-scale information well.Considering global feature
extraction insufficiently.
Transformer-based methods including Swin Transformer [20], RegionViT [21]Capturing global contextual relationships remarkablyInsufficient performance in representing local features.
Hybrid approaches: iFormer [24], TransUNet [25], SonarNet [26]Balanced performance between modeling local details and global context.Limited comprehensive utilization of the interdependencies between local and global features.
Table 2. Description of key symbols and parameters used in TriEncoderNet.
Table 2. Description of key symbols and parameters used in TriEncoderNet.
SymbolDescription
H , W , C Height, width, and channel number of the input image
PPatch size used in Patch Embedding
dDimension of the feature vectors (tokens)
nNumber of tokens
MWindow size in Window Partition
kDownsampling factor in LMSA
Z t r a n s Feature vectors from the Transformer encoder
Z c n n Feature vectors from the CNN encoder
Z h o g Feature vectors from the HOG encoder
M a p Attention maps generated in HAG
Table 3. Implementation details and hyperparameters.
Table 3. Implementation details and hyperparameters.
CategoryParameterValue
Training SettingsOptimizerAdam
Initial Learning Rate 1 × 10 4
Batch Size8
Epochs160
Weight Decay 1 × 10 4
Learning Rate SchedulerCosine Annealing
Model ArchitectureInput Size 512 × 512
Patch Size (P)4
Window Size (M)7
Data AugmentationRandom FlipProbability 0.5
Random Rotation[−10°, 10°]
Random Scaling[0.5, 2.0]
Color JitteringBrightness, Contrast
Table 4. Quantitative results of TriEncoderNet and SOTA methods on the Marine Debris Dataset.
Table 4. Quantitative results of TriEncoderNet and SOTA methods on the Marine Debris Dataset.
ModelIoU ↑Evaluation Index ↑
BackgroundBottleCanChainDrink CartonHookPropellerShampoo BottleStanding BottleTireValveWallM_IoUMPA
U-Net0.990.810.540.690.690.650.540.730.490.860.590.800.6990.816
Deeplabv3+0.990.790.610.620.760.680.700.860.530.890.570.870.7390.846
FCNN0.980.760.590.570.690.640.630.760.410.860.500.830.6850.801
Unet3+0.990.810.630.640.790.760.700.830.460.900.640.850.7500.872
SegNet0.990.750.570.630.750.730.730.810.490.900.650.890.7410.851
PVT0.990.830.560.700.710.680.640.760.460.870.600.820.7180.848
TransUNet0.990.710.670.690.720.590.730.760.310.900.420.860.6960.814
Swintransformer0.990.820.670.630.740.670.790.800.500.870.630.880.7490.871
MT-Unet0.990.840.540.690.770.700.710.840.440.900.650.880.7460.870
COAT0.990.800.560.710.700.670.530.720.510.880.610.820.7080.823
SAM0.990.780.590.660.780.740.730.840.460.900.640.880.7490.886
Our0.990.850.700.730.810.770.790.860.540.910.660.900.7930.916
Table 5. Quantitative results of TriEncoderNet and SOTA methods on the UATD Dataset.
Table 5. Quantitative results of TriEncoderNet and SOTA methods on the UATD Dataset.
ModelIoU ↑Evaluation Index ↑
BackgroundCubeBallCylinderHuman BodyTireCageMetal BucketPlaneRovM_IoUMPA
U-Net0.990.430.410.250.390.460.470.400.400.410.4610.523
Deeplabv3+0.990.380.410.260.340.400.450.430.350.410.4420.491
FCNN0.990.320.310.160.310.300.430.470.220.200.3710.421
Unet3+0.990.410.450.260.400.420.490.520.450.300.4690.558
SegNet0.990.390.470.280.420.440.500.410.420.380.4700.550
PVT0.990.380.450.300.410.450.510.420.390.370.4670.516
TransUNet0.990.360.600.260.430.490.450.580.430.360.4950.588
Swintransformer0.990.400.620.290.460.490.610.600.540.500.5500.632
MT-Unet0.990.390.430.280.360.430.480.460.260.420.4500.515
COAT0.990.450.400.260.410.410.500.380.410.420.4630.521
SAM0.990.510.590.270.450.510.590.610.560.460.5540.641
Our0.990.530.620.300.480.520.640.620.620.500.5820.687
Table 6. Quantitative robustness analysis of TriEncoderNet and baseline methods under synthetic noise on Marine Debris Dataset.
Table 6. Quantitative robustness analysis of TriEncoderNet and baseline methods under synthetic noise on Marine Debris Dataset.
ModelNoise TypeIntensity ( σ 2 )Metrics
mIoUPerformance Drop
U-NetNone00.699-
Gaussian0.010.642−8.2%
Gaussian0.050.515−26.3%
Speckle0.050.588−15.9%
Speckle0.100.473−32.3%
Swin TransformerNone00.749-
Gaussian0.010.695−7.2%
Gaussian0.050.580−22.6%
Speckle0.050.645−13.9%
Speckle0.100.535−28.6%
TriEncoderNet (Ours)None00.793-
Gaussian0.010.771−2.8%
Gaussian0.050.705−11.1%
Speckle0.050.748−5.7%
Speckle0.100.682−14.0%
Table 7. Comparison of computational cost (GFLOPs) with SOTA methods. Input size is standardized to 512 × 512 .
Table 7. Comparison of computational cost (GFLOPs) with SOTA methods. Input size is standardized to 512 × 512 .
MethodGFLOPs (G)
U-Net126.4
DeepLabv3+30
FCNN152.3
UNet3+195.4
SegNet160.2
PVT46.5
TransUNet128
SwinTransformer80.5
MT-UNet75.1
CoaT94.8
SAM325.7
TriEncoderNet (Ours)54
Table 8. Performance comparison of different hand-crafted features on Marine Debris Dataset and UATD datasets.
Table 8. Performance comparison of different hand-crafted features on Marine Debris Dataset and UATD datasets.
Hand-Crafted FeatureMarine Debris DatasetUATD Datasets
mIoUmAPmIoUmAP
None0.7800.9020.5700.671
Canny0.7850.9080.5750.677
Grayscale histogram0.7760.8900.5570.657
LBP0.7780.8950.5640.668
Sobel0.7820.9090.5720.673
Unnormalized HOG0.7850.9060.5620.667
Learned Gradients0.7860.9010.5760.670
HOG (Ours)0.7930.9160.5820.687
Table 9. Ablation Experiment of the Proposed Modules on Marine Debris Dataset and UATD Dataset.
Table 9. Ablation Experiment of the Proposed Modules on Marine Debris Dataset and UATD Dataset.
ModelOurs ModulesMarine Debris DatasetUATD Dataset
HETCFTHAGHOGmIoUmAPmIoUmAP
CNN-baseline 0.6990.8160.4610.523
Transformer-baseline 0.7520.8800.5230.636
CT 0.7640.8860.5550.650
CT + HET 0.7690.8890.5610.661
CT + CFT 0.7730.8950.5630.663
CT + CFT + HET 0.7800.9020.5700.671
CTH 0.7750.8910.5590.657
CTH + HAG 0.7830.8960.5660.669
CTH + CFT + HET 0.7870.9100.5750.676
CTH + CFT + HET + SE 0.7850.9080.5760.673
CTH + CFT + HET + CBAM 0.7880.9040.5760.669
CTH + CFT + HET + HAG (Ours)0.7930.9160.5820.687
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Dong, Y.; Chen, G.; Chen, Y.; Gao, J.; Zhang, F. TriEncoderNet: Multi-Stage Fusion of CNN, Transformer, and HOG Features for Forward-Looking Sonar Image Segmentation. J. Mar. Sci. Eng. 2025, 13, 2295. https://doi.org/10.3390/jmse13122295

AMA Style

Liu J, Dong Y, Chen G, Chen Y, Gao J, Zhang F. TriEncoderNet: Multi-Stage Fusion of CNN, Transformer, and HOG Features for Forward-Looking Sonar Image Segmentation. Journal of Marine Science and Engineering. 2025; 13(12):2295. https://doi.org/10.3390/jmse13122295

Chicago/Turabian Style

Liu, Jie, Yan Dong, Guofang Chen, Yimin Chen, Jian Gao, and Fubin Zhang. 2025. "TriEncoderNet: Multi-Stage Fusion of CNN, Transformer, and HOG Features for Forward-Looking Sonar Image Segmentation" Journal of Marine Science and Engineering 13, no. 12: 2295. https://doi.org/10.3390/jmse13122295

APA Style

Liu, J., Dong, Y., Chen, G., Chen, Y., Gao, J., & Zhang, F. (2025). TriEncoderNet: Multi-Stage Fusion of CNN, Transformer, and HOG Features for Forward-Looking Sonar Image Segmentation. Journal of Marine Science and Engineering, 13(12), 2295. https://doi.org/10.3390/jmse13122295

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop