Attentive SOLO for Sonar Target Segmentation

Imaging sonar systems play an important role in underwater target detection and location. Due to the influence of reverberation noise on imaging sonar systems, the task of sonar target segmentation is a challenging problem. In order to segment different types of targets in sonar images accurately, we proposed the gated fusion-pyramid segmentation attention (GF-PSA) module. Specifically, inspired by gated full fusion, we improved the pyramid segmentation attention (PSA) module by using gated fusion to reduce the noise interference during feature fusion and improve segmentation accuracy. Then, we improved the SOLOv2 (Segmenting Objects by Locations v2) algorithm with the proposed GF-PSA and named the improved algorithm Attentive SOLO. In addition, we constructed a sonar target segmentation dataset, named STSD, which contains 4000 real sonar images, covering eight object categories with a total of 7077 target annotations. The experimental results show that the segmentation accuracy of Attentive SOLO on STSD is as high as 74.1%, which is 3.7% higher than that of SOLOv2.


Introduction
As countries pay more attention to the ocean environment, marine exploration plays an important role in the field of marine research. Moreover, the growing demand for marine surveys has greatly promoted the development of imaging sonar systems, which can be carried by a surveying ship, USV and UUV to implement locate, identify and tracking tasks.
The sonar system obtains images based on the calculating process of transmitting and recovering sound waves, in which the transmitted sound wave will be reflected back and received after encountering the target object. Therefore, the received echo contains the significant sound wave absorption characteristics of different objects. However, the performance of the sonar system is constrained by the limitations of natural unstructured terrain. Due to the complexity of an underwater acoustic channel and the variability in sound wave scattering, the received echo is also mixed with interference, including environmental noise, reverberation and sonar self-noise, which present significant challenges to the accurate target segmentation of sonar images.
Over the past few decades, researchers have proposed various traditional sonar image segmentation methods, including geometric features, probability models, level sets and Markov random field (MRF) theory. Chen et al. [1] established a new energy function combining unified MRF and level sets. Unified MRF is used for integrating the pixellevel and region-level information to analyze inter-pixel and inter-region neighborhood relationships. Further, LS evolved according to the results of UMRF, so that the model can accurately segment sonar images. Ye et al. [2] proposed two new level sets for sonar image segmentation. Firstly, the local texture feature is extracted by using a Gauss-Markov random field model and integrated into the level set energy function to dynamically select the region of interest. Then, the proposed two-phase and multi-phase level set models are obtained by optimizing the energy function. Finally, the segmentation results are obtained according to the two-level sets. Song et al. [3] employed simple linear iterative clustering (SLIC) to segment sonar images into homogeneous super pixels and then used a uniformity facet and adaptive intensity constraint strategy to maximize MRF segmentation outcomes in each iteration. Abu et al. [4] proposed a novel segmentation method called EnFK, in which local space and statistical information are treated as fuzzy terms. Firstly, the sonar images are denoised and both the processed and original images are fed into the segmentation process to enhance convergence. Then, the two novel fuzzy terms are used to obtain the final segmentation results. Wang et al. [5] proposed a region-growing-based segmentation using the likelihood ratio testing method (RGLT), which focuses on the regions of the target and shadows. Specifically, it obtains the seed points of highlight and shadow regions by likelihood ratio test based on statistical probability distribution and then grew them according to similarity criterion. Consequently, the method avoids the processing of seabed reverberation area and considerably reduces the segmentation time. Although these methods can perform well in some sonar image segmentation tasks, the segmentation performance will be greatly reduced when dealing with sonar images with severe noise and uneven intensity.
Since deep learning methods have been proven to achieve significant advantages in RGB image segmentation, detection and tracking, more and more researchers try to adopt deep learning methods for imaging sonar systems. Liu et al. [6] proposed a CNN with multi-scale inputs (MSCNN), which is trained by strategy of data expansion and integration learning and then combined the MSCNN with Markov random field (MRF) to obtain a final segmentation map. Zhao et al. [7] proposed an encoder-decoder network called Dilated Convolutional Network (DcNet), which uses extended convolution and depth separable convolution between the encoder and decoder to obtain more context information and improve segmentation accuracy. Wu et al. [8] proposed an encoderdecoder network named ECNet and utilized a fully convolutional network (FCN) and a deeply supervised network for end-to-end prediction. The ECNet explored an encoder to obtain context information from sonar images and used a decoder for recovering feature maps with high resolution from a low-resolution context feature map and, finally, obtained the pixel-based segmentation map based on the output of the decoder. Wang et al. [9] used a depth separable residual module for multi-scale feature extraction of the target regions to suppress noise interference. A multi-channel feature fusion method was used to enhance the feature information of the convolution layer and an adaptive supervision function was used to classify pixels and objects of different categories. Jiao et al. [10] proposed a relative loss function method that aims to solve small-target segmentation in sonar images by simultaneously considering the probability of pixels in target and nontarget regions. In this method, the loss function of the back propagation process of the FCN (fully convolutional network) [11] is improved to speed up the convergence of the network and improve the segmentation accuracy.
Existing deep-learning-based segmentation methods can achieve good performance in most scenes, yet most of the above methods are semantic segmentation methods, which cannot identify the category of the targets. In this paper, we employed the instance segmentation method to deal with the sonar image segmentation task. Attention module is a common method to improve the accuracy of the segmentation algorithm. While considering the differences between RGB images and sonar images, such as the principle of imaging sonar systems, leads to high noise interference, weak boundary information and difficult target feature extraction of sonar images. We proposed our GF-PSA module, which can analyze the sonar feature maps from multi-scale, extract boundary information in different scales and fuse the information organically to ensure the accuracy of segmentation. We embedded a GF-PSA module into SOLOv2 [12] and the new network is named Attentive SOLO. As SOLOv2 segment objects by locations, it ignores the channel-wise information between multi-scale feature maps, so we employed channel attention in GF-PSA to obtain more useful information in multi-scale feature maps. In the process of feature fusion, we used the gated full fusion (GFF) [13] mechanism, which can reduce the noise interference to improve the fusion method in the pyramid segmentation attention (PSA) [14]. We tested the feasibility of the algorithm on a real sonar dataset named STSD and the results showed that our proposed Attentive SOLO method outperformed other existing segmentation algorithms on the sonar dataset.
The main contributions of this work are summarized as follows: (1) A new model named Attentive SOLO for sonar image segmentation is designed. The improved attention module of gated fusion pyramid segmentation was used to extract the boundary information of sonar image targets, improving the accuracy of the segmentation results. (2) A GF-PSA module is designed. The GFF was used to improve the fusion method of PSA, reducing the noise in the PSA module during feature fusion and improving the segmentation accuracy. (3) A sonar image dataset named STSD for sonar target segmentation is constructed.
The sonar dataset was collected by Pengcheng Laboratory in Shenzhen, Guangdong Province, China, in the sea area near Zhanjiang City, Guangdong Province. We annotated the dataset, which contains 4000 real sonar images, eight different object categories and 7077 instance annotations.
This paper is organized as follows. In Section 2, we review the related works, including instance segmentation algorithm, attention mechanisms, multi-scale feature fusion methods and sonar datasets. In Section 3, we describe the detailed content of Attentive SOLO. The experiments and results are reported in Section 4. Finally, we conclude our work in Section 5.

Instance Segmentation Algorithm
In recent years, the instance segmentation algorithm has improved considerably. Instance segmentation includes two tasks: semantic segmentation and object detection. Based on the two tasks, current instance segmentation methods are classified into two types: top-down and bottom-up methods. The top-down method firstly uses the target detection method to get an a priori bounding box and then semantic segmentation is performed within the priori bounding box. Representative algorithms include Mask R-CNN [15] and PANet [16]. The bottom-up method carries out semantic segmentation and then distinguishes different instances using clustering and metric learning. Representative algorithms include SGN [17] and SSAP [18]. All the above methods have two stages. Recently, borrowing from one-stage target detection research, some one-stage instance segmentation methods have been proposed. The one-stage methods can also be divided into two classes. One approach borrows from the idea of YOLO [19] and the representative algorithms include YOLACT [20] and SOLO [21]. Another approach is inspired by FCOS [22] and the representative algorithms are PolarMask [23] and AdaptIS [24]. Compared with other instance segmentation algorithms, SOLO adopts a fully convolutional, box-free and grouping-free approach to directly output the instance mask and the corresponding class, which balances speed and accuracy better.
Instance segmentation is widely used in many tasks, such as target tracking, human representing learning and underwater object detection. Zhou et al. [25] applied instance segmentation to the unsupervised video multi-target segmentation task, which solved the verification of unseen targets. Specifically, a novel network, which combines foreground region estimation and instance grouping, is proposed to characterize an unseen target. Further, the appearance model is employed to capture more fine-grained information. Zhou et al. [26] proposed a new bottom-up network, which utilizes sparse key points to ease the human represent. In the training process, a projected gradient descent and Dykstra's cyclic projection algorithm is used for supervising the whole learning process. Xu et al. [27] proposed an active Mask-Box Scoring R-CNN method, which efficiently balances the boxIoU and NMS score. In addition, a triplet-measure-based active learning (TBAL) method and a balanced-sampling method is used to improve the performance of the network.

Attention Mechanism
Attention mechanism has been extensively exploited in various deep learning tasks, such as image processing, speech recognition and natural language processing. By using an attention mechanism, CNNs can obtain more useful detailed information and suppress other useless information, improving network robustness.
At present, many excellent attention modules have been proposed. SENet [28], proposed by Hu et al., contains two parts, where squeeze establishes dependencies between channels and excitation recalibrates features. The structure of SENet is shown in Figure 1a, where H', W' and C' represent the height, width and channel number of the input image X, respectively, and H, W and C represent that of the feature maps U and the results X. F sq (·) is used to shrink the special size of U, making it 1 × 1. F ex (·,W) is used to capture the dependencies between channels. F scale (·,·) denotes channel-wise multiplication. Jaderberg et al. [29] proposed Spatial Transformer Networks (STN), which use nonlinear interpolation to affine transform the input and output to obtain the mapping relationship. As shown in Figure 1b, the input feature map U is passed to a localization network, which regresses the transformation parameters θ. The regular spatial grid G is transformed to the sampling grid γ θ (G), which is applied to U to produce the output feature map V. Woo et al. [30] proposed a generic module called Convolutional Block Attention Module (CBAM), which can be integrated into the existing network with little computation. CBAM fuses channel attention and spatial attention in a series way (shown in Figure 1c). Firstly, a channel attention similar to SENet is employed to analyze the relationship between channels, except that a parallel max-pooling layer is added in CBAM. Then, the spatial channel utilizes global max-pooling and global average pooling to obtain two feature maps, which are aggregated by a concat operation and then the final output is generated by sigmoid function. Fu et al. [31] proposed DANet, which contains two new attention modules, including Position Attention Module and Channel Attention Module. DANet extracts global context information based on local features generated by extended residual network and obtains better feature expression for pixel-level prediction. Xu et al. [27] proposed an active Mask-Box Scoring R-CNN method, which efficiently balances the boxIoU and NMS score. In addition, a triplet-measure-based active learning (TBAL) method and a balanced-sampling method is used to improve the performance of the network.

Attention Mechanism
Attention mechanism has been extensively exploited in various deep learning tasks, such as image processing, speech recognition and natural language processing. By using an attention mechanism, CNNs can obtain more useful detailed information and suppress other useless information, improving network robustness.
At present, many excellent attention modules have been proposed. SENet [28], proposed by Hu et al., contains two parts, where squeeze establishes dependencies between channels and excitation recalibrates features. The structure of SENet is shown in Figure  1a, where H', W' and C' represent the height, width and channel number of the input image X, respectively, and H, W and C represent that of the feature maps U and the results X . Fsq(·) is used to shrink the special size of U, making it 1 × 1. Fex(·,W) is used to capture the dependencies between channels. Fscale(·,·) denotes channel-wise multiplication. Jaderberg et al. [29] proposed Spatial Transformer Networks (STN), which use nonlinear interpolation to affine transform the input and output to obtain the mapping relationship. As shown in Figure 1b, the input feature map U is passed to a localization network, which regresses the transformation parameters θ. The regular spatial grid G is transformed to the sampling grid ( ), which is applied to U to produce the output feature map V. Woo et al. [30] proposed a generic module called Convolutional Block Attention Module (CBAM), which can be integrated into the existing network with little computation. CBAM fuses channel attention and spatial attention in a series way (shown in Figure 1c). Firstly, a channel attention similar to SENet is employed to analyze the relationship between channels, except that a parallel max-pooling layer is added in CBAM. Then, the spatial channel utilizes global max-pooling and global average pooling to obtain two feature maps, which are aggregated by a concat operation and then the final output is generated by sigmoid function. Fu et al. [31] proposed DANet, which contains two new attention modules, including Position Attention Module and Channel Attention Module. DANet extracts global context information based on local features generated by extended residual network and obtains better feature expression for pixel-level prediction.  represents SENet, which mainly focuses on channel relationship between feature maps, (b) represents SpatialNet, which mainly focuses on spatial relationship between feature maps and (c) represents CBAM, which focuses on both channel and spatial attention.

Multi-Scale Feature Fusion
Fusing feature maps of different scales in convolutional neural networks is an important means to improve segmentation performance. Low-level feature maps contain more details, but they contain less semantic information because of fewer convolution times. High-level feature maps have more deep semantic information and the receptive field is larger than low-level feature maps, but the detection effect for small targets is poor. To solve this problem, FCN uses mid-level feature prediction to improve the detail structure of segmentation, whereas Hariharan et al. [32] directly combined multi-scale feature maps for prediction. The U-Net [33] model proposed by Ronneberger et al. incorporated jump connections between decoder and encoder structures to reuse the low-level features. Zhang et al. [34] improved U-Net by fusing high-level features into low-level features. Lin et al. [35] suggested that every two adjacent feature maps in the feature pyramid network (FPN) were fused into a feature map and the new feature maps were fused continuously to obtain a final feature map.
These feature fusion methods achieved remarkable results, but the useful information of each feature map is ignored in the fusion process, resulting in the fusion results being influenced by the semantic differences in each feature map.

Sonar Datasets
Though sonar image analysis has attracted a lot of research attention, the publicly available sonar datasets are relatively limited, which can be divided into two types: One is generated by simulation data, such as the Multi-target Noise interference Sonar Dataset (MNSD) and Single-target Reverberation interference Sonar Dataset (SRSD) [36], while the MNSD is generated by three-dimensional imaging sonar data simulation experiment and the SRSD is a single-class bionic dataset with seabed reverberation interference. Liu et al. [37] utilizes an acoustic image simulator to generate a forward-looking sonar dataset and CycleGAN is used to enhance the dataset.
Another one is collected by real environment collection, such as the datasets used in paper [7] being collected by an autonomous underwater vehicle (AUV) equipped with dual-frequency side-scan sonar, yet the data collection is mainly nearby a seabed reef and a sand wave, which do not contain common underwater targets. Moreover, the datasets used in [9] are dealt with a pseudo-color processing on sonar images, which contains three underwater targets, yet each image only contains one instance. Singh et al. [38] collected a forward-looking sonar dataset using an ARIS Explorer 3000 sensor. The dataset consists of 1868 forward-looking sonar images and 11 categories in total.
Presently, the sonar datasets are still relatively scarce. How to construct a dataset containing more targets and more scenes is still worthy of in-depth study. Therefore, this paper built a new sonar dataset. Table 1 compares several existing sonar datasets; as some datasets are not named, we describe them with the relevant references in the table.
Although the image number of STSD is not the largest in the dataset collected in the real environment, it contains eight instance categories. Compared with the dataset mentioned in [38], STSD has more images, which avoids over fitting.

Attentive SOLO
As illustrated in Figure 2, Attentive SOLO is built on a backbone called ResNet and FPN is added after the backbone to deal with multi-scale targets. The proposed GF-PSA is embedded between ResNet and FPN to improve the feature extraction capability. Same as SOLOv2, of which the core idea is to realize instance segmentation by position and size of objects, Attentive SOLO first divides the input images into grids. The center of the target falls in a certain grid, then the grid has two tasks, corresponding to the two branches mentioned in SOLOv2. One is the category branch for predicting the semantic category and the other is the mask branch for predicting the instance mask. The prediction results of two branches are correlated by their reference grid and the instance segmentation results of each grid are directly formed, which is determined by the following equation, . Finally, the instance segmentation results of each grid are collected and processed by matrix non-maximum suppression to form the final results.

Attentive SOLO
As illustrated in Figure 2, Attentive SOLO is built on a backbone called ResNet and FPN is added after the backbone to deal with multi-scale targets. The proposed GF-PSA is embedded between ResNet and FPN to improve the feature extraction capability. Same as SOLOv2, of which the core idea is to realize instance segmentation by position and size of objects, Attentive SOLO first divides the input images into grids. The center of the target falls in a certain grid, then the grid has two tasks, corresponding to the two branches mentioned in SOLOv2. One is the category branch for predicting the semantic category and the other is the mask branch for predicting the instance mask. The prediction results of two branches are correlated by their reference grid and the instance segmentation results of each grid are directly formed, which is determined by the following equation, Finally, the instance segmentation results of each grid are collected and processed by matrix non-maximum suppression to form the final results.

Gated Fusion Module
In CNNs, it is a common operation to fuse feature maps of different scales. The purpose is to combine the advantages of low-level and high-level feature maps and improve the prediction results. Concatenation is a simple operation that aggregates all the information in multiple feature maps by stitching them together. Addition is another straightforward way to combine all the information of different feature maps by adding features at each location. Both of these methods superimpose the information of each feature map, fusing useless information together.
Therefore, we use gating fusion module to solve the problem of redundant useless information. Based on addition fusion, information flow fusion is controlled by gate maps. Specifically, each input X is connected to a gate map

Gated Fusion Module
In CNNs, it is a common operation to fuse feature maps of different scales. The purpose is to combine the advantages of low-level and high-level feature maps and improve the prediction results. Concatenation is a simple operation that aggregates all the information in multiple feature maps by stitching them together. Addition is another straightforward way to combine all the information of different feature maps by adding features at each location. Both of these methods superimpose the information of each feature map, fusing useless information together.
Therefore, we use gating fusion module to solve the problem of redundant useless information. Based on addition fusion, information flow fusion is controlled by gate maps. Specifically, each input X is connected to a gate map G X ∈ [0, 1] H X ×W X and during the fusion process, the feature vector of feature map i at the position of (x,y) is fused to feature map j (i = j) only when the value of gate map G i (x, y) is greater than the value of G j (x, y). The gate map is determined by the following equation, where w i ∈ R 1×1×C i refers to a convolutional layer and C i represents the number of channels of feature map i. The specific operation of gating fusion is shown in Figure 3. With these gate maps, the addition-based fusion is defined as: where · denotes element-wise multiplication broadcasting in the channel dimension.
where   11C i i wR refers to a convolutional layer and i C represents the number of channels of feature map i. The specific operation of gating fusion is shown in Figure 3. With these gate maps, the addition-based fusion is defined as: where  denotes element-wise multiplication broadcasting in the channel dimension.

Gated Fusion-Pyramid Split Attention Block
The sonar images are not only subject to the severe noise interference but also affected by ocean currents. Ocean currents can drive the sand and gravel to form different shapes, which have similar features to those of the targets. To accurately separate the targets from the interference, we proposed the GF-PSA model. The structure of GF-PSA is shown in Figure 4.

Gated Fusion-Pyramid Split Attention Block
The sonar images are not only subject to the severe noise interference but also affected by ocean currents. Ocean currents can drive the sand and gravel to form different shapes, which have similar features to those of the targets. To accurately separate the targets from the interference, we proposed the GF-PSA model. The structure of GF-PSA is shown in Figure 4.
where   11C i i wR refers to a convolutional layer and i C represents the number of channels of feature map i. The specific operation of gating fusion is shown in Figure 3. With these gate maps, the addition-based fusion is defined as: where  denotes element-wise multiplication broadcasting in the channel dimension.

Gated Fusion-Pyramid Split Attention Block
The sonar images are not only subject to the severe noise interference but also affected by ocean currents. Ocean currents can drive the sand and gravel to form different shapes, which have similar features to those of the targets. To accurately separate the targets from the interference, we proposed the GF-PSA model. The structure of GF-PSA is shown in Figure 4.  Firstly, the multi-scale feature maps are extracted by the proposed GF-SPC module. The GF-SPC module firstly divides the feature map X into n parts according to the channel dimension, which are represented as [X 0 , X 1 , X 2 , · · ·, X n−1 ]. Among them, the size of each part is X i ∈ R H×W×C , i = 0, 1, 2, · · ·, n − 1 and the number of channels is C = C/n. The size of convolution kernel directly affects the size of receptive field, thus, affecting the segmentation accuracy. To better segment objects of different scales, we used convolutional kernel with a size of k i = 2 × (i + 1) + 1. However, the increase in kernel size increased in the number of parameters significantly. In order to handle the input tensor with different kernel sizes without increasing the computational effort, we utilized the group convolution method. The relationship between group size g i and kernel size k i can be expressed as: Table 2 records the kernel size and group size used in the experiment when n = 4. After obtaining the multi-scale feature maps, we used gated fusion to aggregate the useful information from each scale feature map. The useful information was obtained from the fused feature maps by additive fusion, which is controlled by gate map.
In order to fuse context information at different scales, SEWeight module is used to obtain channel-wise attention vectors from feature maps at different scales so that the high-level feature map can fuse more detailed information. The channel-wise attention vectors Z i corresponding to feature maps F i and all channel-wise attention vectors are obtained in the following ways, where ⊕ represents the concat operation. In order to balance spatial attention while obtaining channel attention, we used cross-channel soft attention to adaptively obtain spatial attention weights and re-calibrate the obtained Z i . The soft attention weights are shown as follows, where the Softmax operation was used to obtain re-calibrated attention weight S i , which included all the position information in the space and the attention weights in the channels. By doing this, the global information of spatial and channel were combined with local information. We multiplied the multi-scale attention vector S i with the corresponding feature map F i to obtain a new feature map Y i with multi-scale attention weight. Finally, we spliced Y i together by the concat operation to obtain a new feature map with the same number of channels as the original feature map. From the above description, it is clear that GF-PSA can fuse useful information from multi-scale feature maps and consider both channel attention and spatial attention. Therefore, our proposed GF-PSA module has good information interaction between local and global attention.

Dataset
The dataset used in this paper, STSD, is a forward-looking sonar image dataset collected by Pengcheng Laboratory in Shenzhen, Guangdong Province, China, in the sea off Zhanjiang City, Guangdong Province. The dataset is collected using a Tritech Gemini 1200i (Aberdeenshire, UK) forward-looking sonar and is the original echo intensity information of the sonar, which exists in the form of a two-dimensional matrix. For the convenience of processing, it is stored in bmp image format. The dataset contains eight categories, namely human body, tire, round cage, square cage, metal bucket, cube, sphere and cylinder. It contains 4000 sonar images, which are divided into training and test set in a ratio of 3:1. STSD contains 7077 instance annotations. Table 3 shows the number and ratio of each target category of the STSD. For the training process, the error of the predicted results and the ground truth is used to iteratively optimize the model parameters. The ground truth is the standard segmentation result of manual annotation and it is the key to establish the benchmark of the segmentation result. We used the LableMe software developed by MIT (Cambridge, Boston, MA, USA) to annotate the sonar images. However, the original sonar image contains different types of noise and is displayed as a nearly black image, which makes annotation extremely difficult. We processed the image by logarithmic transformation in order to artificially determine the class of targets contained in the sonar images. Figure 5 displays the original sonar image, processed image and the ground truth we annotated.

Dataset
The dataset used in this paper, STSD, is a forward-looking sonar image dataset collected by Pengcheng Laboratory in Shenzhen, Guangdong Province, China, in the sea off Zhanjiang City, Guangdong Province. The dataset is collected using a Tritech Gemini 1200i (Aberdeenshire, UK) forward-looking sonar and is the original echo intensity information of the sonar, which exists in the form of a two-dimensional matrix. For the convenience of processing, it is stored in bmp image format. The dataset contains eight categories, namely human body, tire, round cage, square cage, metal bucket, cube, sphere and cylinder. It contains 4000 sonar images, which are divided into training and test set in a ratio of 3:1. STSD contains 7077 instance annotations. Table 3 shows the number and ratio of each target category of the STSD. For the training process, the error of the predicted results and the ground truth is used to iteratively optimize the model parameters. The ground truth is the standard segmentation result of manual annotation and it is the key to establish the benchmark of the segmentation result. We used the LableMe software developed by MIT (Cambridge, Boston, MA, USA) to annotate the sonar images. However, the original sonar image contains different types of noise and is displayed as a nearly black image, which makes annotation extremely difficult. We processed the image by logarithmic transformation in order to artificially determine the class of targets contained in the sonar images. Figure 5 displays the original sonar image, processed image and the ground truth we annotated. We conducted experiments with the original sonar images and the processed images separately and obtained similar results. This indicates that image processing has little effect on the accuracy of the algorithm. Therefore, we still use the original sonar image We conducted experiments with the original sonar images and the processed images separately and obtained similar results. This indicates that image processing has little effect on the accuracy of the algorithm. Therefore, we still use the original sonar image during the experiments, but the image inserted in the article is the log-transformed processed sonar image in order to artificially identify the location and category of the target.

Implementation Detail
The hardware environment for training and testing is as follows: Intel Core i7-11800H@ 2.30 GHz, NVIDIA GeForce RTX 3060 Laptop GPU. All experiments were conducted in real time using the same hardware environment. The experimental operating system is Ubuntu 20.04, Python 3.7, and PyTorch 1.8.0 framework. As SOLOv2 requires that the size of an input image must be a multiple of 32, we set the sonar image size to 1024 × 1024. The loss function used to evaluate the performance of the model is as follows: where L cate denotes the conventional focus loss [39] for semantic category classification. L mask is for the loss of mask prediction and its calculation method is as follows: where k, i and j are related as k = i × S + j. N pos represents the number of positive samples, P * and m * represent the category and mask target, respectively. 1 {P * i,j >0} represents the indicator function, which is 1 when p * i,j > 0 and 0 when the opposite is true. For d mask (., .), we use the Dice Loss function used by SOLO v2 and the expression is as follows: where D is the Dice function with the following expression: where p x,y and q x,y refer to pixel values located at (x, y) in the prediction mask p and the ground truth mask q.

Evaluation Index
In the task of object segmentation, we usually use IoU (intersection over union) to determine whether the predicted result is a positive sample. IoU refers to the ratio of the intersection between the predicted mask and the ground truth value and its union. It can be expressed as follows: where S overlap refers to the overlapping area of the prediction mask and the ground truth mask and S union refers to the union area of the prediction mask and the ground truth mask. We set a threshold t(t ∈ [0, 1]). When the value of IoU is greater than t, the predicted mask is considered a positive sample. The evaluation criteria of target segmentation tasks usually include two indexes: precision and recall. Precision is the ratio of the number of correctly classified positive samples to the number of all predicted positive samples. Recall is the ratio of the number of correctly classified positive samples to the number of all positive samples. The two indexes are calculated as follows: where T p indicates true-positive cases, F p indicates false-positive cases and F N indicates false-negative cases. The precision-recall curve, which is obtained with precision as the y-axis and recall as the x-axis, is an excellent graphical method for visualizing and evaluating the performance of target segmentation. The mAP (mean average precision) is a quantitative metric to evaluate the effectiveness of multicategory target segmentation by calculating the area under the PR curve. In this paper, we used the mAP calculation standard in COCO. The mAP at a total of 10 thresholds with gradual increments of 0.05 from 0.5 to 0.95 (mAP0.5:0.95) and the mAP at a threshold t = 0.5 (mAP0.5) and t = 0.75 (mAP0.75) were evaluated. In addition, COCO also defined mAP S , mAP M and mAP L with the target scale as constraints. However, the scales of targets in STSD are all small, so we do not use these indicators in this paper. AR (average recall) is the maximum recall of a fixed number detected in each image. Different from mAP, the constraint of AR is only the target scale, so we only use AR without constraints. Typically, FPS (frames per second) is used to measure the speed of the algorithm. In this paper, we also used this parameter to compare the real-time performance of each algorithm.

Ablation Experiments
In order to obtain better results, we set different values of n and embedded GF-PSA module into different positions of the network. GF-SPC module split feature maps into n parts according to the channel dimension, so n must be divisible by the number of channels. In order to ensure that each part of the feature maps contains enough information, the value of n cannot be too large. In this paper, we discuss six cases according to the different values of n and the different positions of GF-PSA. The visual segmentation results of different cases are shown in Figure 6. Among them, Model e has the highest precision for each target. We conducted a quantitative analysis of these algorithms and evaluate the effectiveness of different models by mAP.5:0.95, mAP0.5, mAP0.75, AR (average recall) and FPS. Table 4 records the experimental results of different cases. As shown in Table 1   We conducted a quantitative analysis of these algorithms and evaluate the effectiveness of different models by mAP.5:0.95, mAP0.5, mAP0.75, AR (average recall) and FPS. Table 4 records the experimental results of different cases. As shown in Table 1, the speeds of the six models are all around 12 FPS, but the accuracy is quite different. By comparison, the value of n has a great influence on the results.

Comparative Experiment
In order to verify the performance of the proposed Attentive SOLO, this paper compares Attentive SOLO with Mask R-CNN, YOLACT, PolarMask and SOLOv2. The dataset uses STSD. Some experimental results are shown in Figure 6. Figure 7a represents the processed sonar image by logarithmic transformation. As the size of the target is too small to discern, we enlarged the masks and precisions at the corner of the result maps. Figure 7f shows the results of Attentive SOLO and compares  Figure 7, we can know that in scene (A), Polar Mask missed detection and our Attentive SOLO obtains the highest precision (0.93) compared to other algorithms. In scene (B), YOLACT missed detection and the precision is poor; other algorithms have similar results. In scene (C), all algorithms can detect the target accurately. We performed a quantitative analysis of these algorithms. Table 5 displays the results, from which we can know that the mAP0.5 of Attentive SOLO is 7.3% higher than YOLACT, 6.5% higher than Polar Mask, 3.1% higher than Mask R-CNN and 3.7% higher than SOLOv2. For the other evaluating indicators, our proposed Attentive SOLO also has the best results compared to other segmentation models.
Electronics 2022, 11, x FOR PEER REVIEW 13 of 16 Figure 7a represents the processed sonar image by logarithmic transformation. As the size of the target is too small to discern, we enlarged the masks and precisions at the corner of the result maps. Figure 7f shows the results of Attentive SOLO and compares  Figure 7, we can know that in scene (A), Polar Mask missed detection and our Attentive SOLO obtains the highest precision (0.93) compared to other algorithms. In scene (B), YOLACT missed detection and the precision is poor; other algorithms have similar results. In scene (C), all algorithms can detect the target accurately. We performed a quantitative analysis of these algorithms. Table 5 displays the results, from which we can know that the mAP0.5 of Attentive SOLO is 7.3% higher than YOLACT, 6.5% higher than Polar Mask, 3.1% higher than Mask R-CNN and 3.7% higher than SOLOv2. For the other evaluating indicators, our proposed Attentive SOLO also has the best results compared to other segmentation models.    In addition, we compared the proposed GF-PSA module with SENet, CBAM, selfattention, DAnet and PSA attention models. All attention mechanisms were added between Resnet and FPN and Table 4 records the experimental results of adding different attention models to SOLO v2. The GF-PSA module proposed in this paper has a better effect. Specifically, our method achieved the best results in all four indicators. From Table 6, we can know that the mAP0.5 of GF-PSA is 9.3% higher than CBAM, 5.1% higher than DANet, 4.5% higher than PSA and 3.5% higher than SENet. For the other evaluating indicators, GF-PSA also has the best results compared to other attention models. From the results, the gated fusion model can effectively improve the performance of PSA and it is more effective than other attention mechanisms. Although Attentive SOLO achieves better performance than the existing methods, it cannot segment some difficult images accurately (shown in Figure 8). Most of these failed cases are about the "human body". We analyzed the possible reasons. In the process of data collection, we put the manikin into the sea and its posture changed greatly with the ocean current. This makes it difficult to learn the features of the human body. DANet, 4.5% higher than PSA and 3.5% higher than SENet. For the other evaluating indicators, GF-PSA also has the best results compared to other attention models. From the results, the gated fusion model can effectively improve the performance of PSA and it is more effective than other attention mechanisms. Although Attentive SOLO achieves better performance than the existing methods, it cannot segment some difficult images accurately (shown in Figure 8). Most of these failed cases are about the "human body". We analyzed the possible reasons. In the process of data collection, we put the manikin into the sea and its posture changed greatly with the ocean current. This makes it difficult to learn the features of the human body.

Conclusions
On the basis of SOLOv2, this paper improves the multi-scale prediction using PSA and GFF. A sonar target segmentation algorithm named Attentive SOLO suitable for complex environments is proposed. The Attentive SOLO uses gated fusion to improve the fusion mechanism of PSA and adds it to SOLOv2, which significantly reduces the information redundancy in feature fusion and enhances the ability of context information transmission and network feature extraction. Experiments show that the proposed Attentive SOLO model has better detection accuracy than SOLOv2 and other instance Figure 8. Some failed cases, which are all about "human body". Where (a) represents the processed image, (b) represents the ground truth, (c) represents the failed results.