Sonar Image Target Detection Based on Simulated Stain-like Noise and Shadow Enhancement in Optical Images under Zero-Shot Learning

: There are many challenges in using side-scan sonar (SSS) images to detect objects. The challenge of object detection and recognition in sonar data is greater than in optical images due to the sparsity of detectable targets. The complexity of real-world underwater scanning presents additional di ﬃ culties, as di ﬀ erent angles produce sonar images of varying characteristics. This heterogeneity makes it di ﬃ cult for algorithms to accurately identify and detect sonar objects. To solve these problems, this paper presents a novel method for sonar image target detection based on a transformer and YOLOv7. Thus, two data augmentation techniques are introduced to improve the performance of the detection system. The ﬁ rst technique applies stain-like noise to the training optical image data to simulate the real sonar image environment. The second technique adds multiple shadows to the optical image and 3D data targets to represent the direction of the target in the sonar image. The proposed method is evaluated on a public sonar image dataset, and the experimental results demonstrate that the proposed method outperforms the state-of-the-art methods in terms of accuracy and speed. The experimental results show that our method achieves be tt er precision.


Introduction
The advancement of sonar technology has yielded remarkable achievements in underwater exploration [1] and target detection [2,3].Compared with the limitations of optical sensors in detecting targets, such as short detection distances and poor underwater visibility, SSS-based target detection methods have become increasingly popular and effective.These methods [4][5][6] have proven to be more effective in terms of distance and visibility, overcoming the limitations of traditional optical sensors.The lack of sonar sample data and image quality remains a common problem in sonar target recognition.Researchers have developed various approaches to address these challenges, such as utilizing deep learning algorithms [7][8][9][10] to enhance the image quality of sonar data and applying transfer learning [11,12], allowing for more robust detection and recognition results.These methods have shown promising results in simulated underwater environments, but more research is needed to validate their effectiveness in real-world scenarios or simulated environments.However, the high cost associated with underwater experiments [13], including the deployment and recycling of underwater targets, the use of diverse sonar devices, and the search for suitable experimental areas, has resulted in a lack of available samples.As a result, it is challenging to obtain sufficient data to evaluate the performance of target recognition algorithms in real-world scenarios.
Some researchers have employed deep convolutional neural networks (DCNNs) using style transfer to simulate the environment [4,11,12,14].This has highly improved the performance of sonar image detection.Due to the complex underwater environment, some key features will be lost in the simulated environment.Several experts in the field have utilized semantic segmentation to classify targets in sonar images [15][16][17].However, most of them have focused on image classification.Hence, this paper provides a comprehensive approach that considers the complexities of the underwater environment and employs feature enhancement techniques for accurate target detection.This approach is combined with the semantic segmentation method in an optical image dataset to address the lack of available samples and enable the evaluation of target recognition algorithms in real-world scenarios.This makes up for the loss of some features in the simulation environment despite the high cost associated with underwater experiments.
First, we used the semantic segmentation method [18,19] on optical images to extract the target.However, the limitation of optical single targets in images causes a low performance in object detection.Therefore, we propose using shadow enhancement on targets to solve the problem of sonar target features in the real environment to improve the performance of sonar target detection.Furthermore, we added stain-like noise on targets to simulate mud and sand obstruction and mutilated targets in the real environment.Finally, considering the style difference between sonar images and optical images, we used style transfer to enhance object features via frequency analysis in real sonar images.
The remainder of this paper is organized as follows: In Section 2, we provide an overview of the existing methods and highlight their shortcomings.In Section 3, we present our proposed methods, which combine data augmentation and simulation techniques.These techniques are based on shadow enhancement and the addition of stain-like noise to the data.Section 4 presents a comparison of our proposed methods with existing approaches, along with the training results.Additionally, we provide an analysis of the experiments we designed and conducted, along with a comparison of their results.

Related Works
Scholars have researched sonar synthetic image datasets [2,3] and zero-shot learning methods [4,11,14,20] to augment samples to overcome the shortage of samples and simulate sonar images.Pre-trained DCNNs and fine-tuning techniques are powerful methods for sonar image detection [21,22].William et al. [16] present an approach for merging SSS data and bathymetry information to improve automatic shipwreck identification.The method combines raw SSS images with a 2D relief map into a composite RGB image and uses a supervised image segmentation approach to identify shipwrecks.Zhao et al. [4] utilized a combination of 3D modeling, amplified data, equipment noise, and image mechanisms to extract target features and simulate target damages and postures using a DCNN and a fine-tuning style-transfer method.Their approach achieved a precision of 85.3% and a recall of 94.5%.Li et al. [6] identified texture features as domain-specific features and proposed to narrow the domain gap by removing these features.This method successfully transferred knowledge from optical images to sonar image classification tasks.The approach shows promise for improving the performance of sonar image classification tasks.Lee et al. [12] employed StyleBankNet [23] to perform style transfer simulations on optical images of the human body, which improved sonar object detection and achieved a precision of 86%.The authors generated samples using CAD but noted that significant simulation work was required to generate sufficient samples.This approach shows promise for enhancing sonar object detection using simulated optical images.Song et al. [24] introduced an effective sonar segmentation approach that leverages speckle noise analysis for pixel-wise classification.This method involves a single-stream deep neural network (DNN) with multiple side outputs to optimize edge segmentation.Huo et al. [25] utilized a transfer learning method to leverage knowledge from the ImageNet dataset to classify underwater targets in an SSS image dataset they built.They proposed using a semisynthetic data generation method during the transfer process to produce sonar images that effectively compensate for insufficient data.Ochal et al. [20] conducted a comparison of multiple supervised and semi-supervised few-shot learning (FSL) methods using under-water optical and SSS imagery.The results indicate that FSL methods have significant advantages over simple transfer learning methods, such as fine-tuning a pre-trained model for underwater target classification.Yu et al. [9] proposed a novel method for underwater target recognition, integrating a transformer module and YOLOv5.The method also incorporates an attention mechanism to improve both accuracy and efficiency.Xu et al. [22] proposed an active instance segmentation method combining a region-based convolution neural network (R-CNN) and balanced sampling.The method has benefits when a limited number of labeled samples are available, leading to better results for underwater shipwreck detection.
These enhancements make the methods well-suited for underwater environments where target recognition is challenging.However, target features cannot be properly expressed without considering the image environment (the state of the target, such as target damage and corruption, target postures, etc.), and a simulated image cannot properly present sonar features.Many studies have focused on sample amplification and image-processing mechanisms for underwater target recognition but have not sufficiently considered the challenges posed by real-world underwater environments, such as mud and sand obstruction, missing target parts, multiple target states, and shadows and reflections in sonar data.

Our Methods
Our method is based on yolov7 and a transformer backbone model to address the issue by enhancing multiple shadows on the target.The detection process in a DCNN involves the relationship between the target feature  and the model feature .
For the contributions of this study, we define three feature sets including optical target image features, shadow enhancement features, and random stain-like noise features to describe the feature mapping process.By adding random stain-like noise to the target image, the optical image is simulated for a sediment-covered, mutilated target to improve the uncertainty of target states.Moreover, an image-processing method based on the existing style transfer method is proposed for data training to more closely represent the real data and enhance the object features.From the perspective of feature matching, the more features of the target  contained in model , the higher the similarity.

Problem Definitions and Our Framework
A lack of samples is a common problem in target detection in sonar images, which leads to low model performance.Many methods are applied to transfer optical data to sonar data to improve target detection performance, but these methods do not fully consider the underwater environment.Given this issue, the key to successful deep learning work is preparing datasets with appropriate target features.In this section, we focus on our main contributions to this field, which include extracting complex features from datasets and utilizing zero-shot learning for target detection.
We define three types of features, , , and , to describe the feature mapping process. ( , ) =  ,   =  ,  , … ,  ,  = 1,2, … ,  denotes the domain of the optical target features. indicates the mth image feature. expresses the nth feature. ( , ) denotes the domain of the shadow enhancement features that extends from .  ( , ) denotes the domain of random stain-like noise to extend features .The optical image features of the target set  ( , ) are expressed as  ( , ) =  ,  , … ,0, …   ,  , … ,0, …  … …  ,  , … ,0, …  .To standardize the dimensions of all features, zero-padding is used to ensure that all images have the same dimensions.The target is extracted from the background, so the background is empty and presents as zero.The values ( ,  , … ,  ) denote the features of the original target.By augmenting the features  ( , ) with shadow features, the features  ( , ) are obtained.Simultaneously, a simulation of the complex underwater environment is performed with the addition of random stain-like noise using the method to obtain the features  ( , ) .We define an equation to describe the generated training data on one target, which combines the features (by summing the features).The features can be expressed as follows: The detection process aims to map the relationships between the real sonar target features  and features  ( , ) for sonar target detection.We consider three major aspects in the dataset design:  The process in Figure 1 includes four parts: First, different optical image datasets are integrated into target categories, such as airplanes, ships, cars, etc.Second, multiple shadows are generated and the direction for simulating the SSS image targets on the dataset is adjusted.Third, noise is generated on the targets to simulate covered and incomplete targets, whereby sediment occlusion on the seabed is simulated by adding stain-like noise.Fourth, yolov7 is used as the framework, and the transformer is used as the backbone method for data training.The detection model is then used to detect target objects in the sonar data.

Feature Enhancement and Augmentation Methods
The most existing methods focus on amplifying samples from an optical image dataset with less consideration of the optical background, which can impact the detection performance.The target shadow plays an important role in real sonar target image detection.Examples of real sonar images are shown in Figure 2. The shadow between an underwater acoustics image and an optical image are illustrated in [26].Observation geometry given by the range and elevation angle is important for interpreting the highlight and shadow in an image.Examples of a shadow feature in sonar and optical images are shown in Figure 3.As shown in Figure 3, the target shadow in a real SSS image is always on the left or right side.The processes for a sonar shadow and an optical shadow are very similar [27].The side-scan image formation process is briefly sketched in Figure 3a.The intensity of the corresponding pixel of the side-scan image depends on the amount of energy scattered back from the surface point.The traditional Lambertian model [28] permits us to derive the returned intensity from the parameters defining the observed scene [29].The point p and intensity  can be expressed as follows: where  represents the intensity of the illuminating sound wave at point ,  is the reflectivity of the object,  is the incidence angle of the wave front, and  ∈ 0,1 is a normalization constant.To obtain the maximum intensity, return Imax at any surface point,  is set as 1, and the reflectivity and incident intensity values are both 1 for the optimal surface orientation, with respect to the incident illumination. ⃗ and  ⃗ are a coordinate system relative to the sensor (Figure 3a).To simplify the process in our experiment, we defined the seafloor as a flat surface, denoted as  , with a constant value to express the surface intensity.Under this assumption and the combination of expressions in (2), and yield an expression that depends on  when applying finite difference methods on the gradients.The intensity returned from an object point can be represented with the following expression: A shadow is a critical feature in deep learning detection work.The shadow of an object has discrepancies due to the object's posture and sonar position.We propose a method that uses a large amount of optical data and 3D model data [30] to improve an object's features via shadow enhancement in the training data.First, we split the optical target and background to reduce interference when the deep learning network extracts target features.Second, we generate a target image with multiple shadow features via a 3D model by adjusting the object and light position.
From the overall process of target extraction and shadow as shown generator in Figure 1.The goal of a deep learning network is to extract object features.Since our experiments were based on zero-shot learning, the backgrounds of the optical images lack features from the sonar images.We adopted finetuning DeepLabV3 [31,32] as a semantic segmentation method to extract target features from the optical image.The optical image was segmented and processed to obtain an image containing only the target, thus improving the model's recognition rate.In addition, we adopted 3D data and employed the finetuning exponential shadow maps (ESM) method [33][34][35], combining lighting and object position techniques for shadow simulation.Figure 1a-d show the processes of segmentation and shadow enhancement in our experiment.
The research on real sonar data showed that simulating the features of sonar images using two-dimensional image shadow simulation cannot fully simulate shadows.A comparison between real sonar images and stylized 3D-generated images is shown in Figure 4.In a two-dimensional image, we defined function  as the shadow enhancement function.() is the semantic segmentation function. is the original optical image. is the enhanced features, which can be expressed as  =  , () ,  ∈ 0,1, … ,360 ,  ∈ .In function  ,  is shadow angle, and  is shadow width. is the entire enhanced features from one original image.
The rotation matrix (, , ) =  () () () is used to calculate the target with its shadow in the image.
In the matrix,  () represents the rotation of an object around the z-axis by  degrees. () and  () represent the rotations of the y-axis and x-axis.
In the three-dimensional model,  = (, , ).The transformation move matrix is In the two-dimensional image,  =  ().The transformation matrix is defined as The shadow calculation process can be expressed as follows:  =  ( , ) •  •  +  ( , ) ,  min  ( , ) ,  ( , ) , (, ) 0 where  is the final image matrix with the shadow in the two-dimensional image.Where  ( , ) is the target without the background and  ( , ) denotes the target shadow generated from  ( , ) .(, ) represents the coordinate positions. is the final image matrix with the shadow in the three-dimensional model. is the target matrix in the 3D model.Figure 5 displays the generated 3D data shadow compared with the real sonar data.The displacement generally depends on the center of the original image.We define a 10 pixel displacement in a (512, 512) image in the examples.The rotation angle is around the image center as the axis, and the angle of the object's shadow is determined by the object's orientation and simulated lighting.Figure 7 displays the sample data for the shadow feature expansion of the 3D target using shadow feature enhancement methods.The simulated shadow enhancement with different targets is shown in Figure 8.

Stain-like Noise Method
Many scholars extract targets from optical images and simulate defects, which can effectively replicate the defects in the targets.However, these defect simulation methods are limited to the targets and do not adequately represent the surrounding environment and shadows.
Optical and acoustic images of an aircraft target were derived from [36] to better understand the real environment, as shown in Figure 9.With the rapid development of DCNNs, object features can be easily extracted from data.A DCNN network is capable of extracting object features even from zero-shot learning, with minimal real-world conditions.This is because the training samples used for a DCNN are almost perfect and may not accurately represent the complexities and variabilities presented in real-world conditions.We propose a random stain-like noise method to simulate the damage, occlusion, and other factors in real sonar imaging targets underwater, which can effectively improve recognition efficiency.The proposed method was proven to be effective in the experiments.The single process of generating data with stainlike noise is shown in Figure 11.Our study compared the recognition performance for different types of noise, and we found that random stain-like noise resulted in the highest performance, as depicted in Figure 11.However, stain-like noise can lead to overfitting, which affects the recognition accuracy.To address this, we finetuned the data and achieved a peak performance of 0.89 mean average precision (mAP) [37] when the noise occupied approximately 31% of the target image in our experimental data.
Figure 12 compares the performance trends for the different noise types in the noiseoccupied area on the target.The noise-occupied area on the target ranged from 10% to 60%.We found that the difference between a noise-occupied area of less than 10% and no noise was minimal.The performance greatly decreased with a noise-occupied area of over 50% due to overfitting.The simulation process for generating stain-like noise data can be expressed as the following equation: We use (, ) to express the generated stain-like points on the target. is the original image.C is a constant that expresses the pixels of one stain-like point. is one stainlike point, and (, ) is its position. ( , ) is the image matrix of generated stain-like points.Zeros are used in  ( , ) to create an empty background for the stain-like image.ℎ is the image height. is the image width.

Experiment and Analysis
In this section, we present a series of experiments to compare our proposed method with the existing methods.The experiments were conducted on different datasets.We report our method's performance using several evaluation metrics commonly used in the field.
In this study, we adopted precision, recall, and mAP to evaluate the model's performance.True positive (TP) means that the network detection is a target and is correct.False positive (FP) means that if a sample does not belong to a class but is predicted to, it is considered a false positive.False negative (FN) means that if a sample belongs to a class but is predicted not to, it is considered a false negative.
Precision signifies the proportion of accurately predicted positive samples to the total number of predicted positive samples: Recall signifies the proportion of correctly predicted positive samples to the overall number of positive samples: With the results of our experiments, we believe that our method has the potential to be used in real-world applications and can contribute to the advancement of the underwater detection field.
Our model can be fitted to customized target sizes, which can be defined in the training data.We adjusted the target size to (128,128) in the training data.Figure 14 shows examples of the detection of different, real-sonar targets selected from our test results.In our experiments, we extracted target features from our trained model with different types of images with the t-distributed stochastic neighbor embedding (t-SNE) method [38] to better understand the impact of enhancement features on the target in an optical image, as shown in Figure 15.Based on the results, by enhancing the optical image targets, the enhanced target images have similar distributions to the real sonar targets.

Experimental Data
To increase the diversity of the target forms, we trained our model on a portion of the VOC2012 dataset, the human pose and NWPU VHR-10 [39,40] image dataset, and the pascal and human pose 3D dataset [41].We conducted a comparative experiment on the different datasets under the same batch, verifying 29 real aircraft wreck sonar images, five real body sonar images, and 43 real shipwreck sonar images that are publicly available on the internet.The results show that our model is effective in detecting both aircraft wrecks and shipwrecks in sonar images and can be used for practical applications in underwater target recognition.We used three types of targets with shadow enhancement and stainlike noise in our experiments, including an airplane, person, ship, and others.The enhanced training dataset and test data is shown in Table 1.

Model
Precision Recall mAP (IOU = 0.5) StyleBank + fastrcnn [12] 0.860 0.705 0.786 Whitening and coloring transform [14] 0.875 0.836 0.75 Improved style transfer + yolov5 [4] 0.853 0.945 0.876 Yolov5 + style transfer + regular-shaped noise [11] 0.899 0.861 0.865 Our method: Yolov7 (transformer backbone) + stain-shaped noise 0.903 0.857 0.891 In the comparison table, the precision with our method is increased by 0.004 compared with the existing methods' highest precision.Our method's recall is decreased by 0.088 compared with the existing methods' top recall, and its mAP is increased by 0.015 compared with the existing top mAP.

Experiment with Different Noise Types
We employed the Yolov7 framework-transformer backbone model on different types of noise datasets to investigate the performance of each noise type, including Gaussian, salt and pepper, regular shapes, stain-like shapes, and no noise.Examples of the noise types are shown in Figure 16.Our analysis of the experimental results reveals that the highest mAP achieved for the recognition of stain-like noise was 0.824.

Experiment with Different Models
We conducted experiments using different models on the same dataset with shadow enhancement and random stain-like noise to further verify the detection performance.The comparison is shown in Table 4.Our experiments reveal that the combination of the two models exhibits better detection performance.The results show that the Yolov7 framework-transformer backbone model has significant potential to enhance object detection accuracy in various real-world applications.

Experiment on Shadow Enhancement
To verify whether the target shadow features increased the detection performance, we conducted an experiment using the Yolov7 framework to compare two datasets: one with shadow enhancement features and another without.The results of the comparison are presented in Table 5.

Data
No shadow Enhancement Shadow Enhancement mAP 0.763 0.806 Our experimental results demonstrate that shadow enhancement is an effective data augmentation technique for improving the performance of sonar target detection models.Using simulation methods based on shadow enhancement can improve the model's ability to generalize real-world scenarios, resulting in a higher recognition mAP of up to 0.806.It should be noted that the detection result uses the model without real data in the training phase (only enhanced optical image and 3D-models are in the training dataset).

Conclusions
In this paper, we applied a transformer as the backbone model of Yolov7 to improve the underwater detection performance, despite a lack of training data.We addressed the design considerations for complex underwater scenarios, the limitations of lost features with style transfer, and targets covered by mud and sand.Hence, we proposed a method that merges stain-like noise on a simulated target to overcome the constraints of the real environment.Furthermore, we removed the background from optical target images to focus the training model on target features and reduce useless information.Additionally, we used shadow enhancements on the targets in two-dimensional images and a CSM shadow generator on a 3D model.The method addressed the key features of the target shadows, which would otherwise be missing when directly using optical object style transfer.Using comparison experiments, we demonstrated that our proposed method could achieve a better target detection performance than other methods that do not include shape noise fusion and key feature enhancement in the training data.
Future research could, for instance, investigate the relationship between the percentage of noise occupying the target and the dataset size, target number, and target categories.
(1) defining the dataset and augmentation from the optical image to extend multiple shadows on the same target; (2) transferring the optical image to a sonarstyle image; (3) and designing stain-like noise on the target to simulate mud and sand obstruction.The data processing to generate the training data in our experiments is shown in Figure 1.

Figure 1 .
Figure 1.Data processing to generate training data on one target.

Figure 3 .
Figure 3. Shadow features in sonar and optical images.

Figure 5 .
Figure 5. Shadow image sample generated via 3D model and real sonar data.

Figure 6
Figure 6 displays the airplane sample data for the feature expansion of the target image in Figure 1c, using shadow feature enhancement methods.

Figure 6 .
Figure 6.Sample data for shadow feature expansion in two-dimensional image.

Figure 7 .
Figure 7. Sample data for shadow feature expansion in three-dimensional model.

Figure 9 .
Figure 9. Examples of incomplete targets.(a) Original sonar image; (b) original optical image; (c) front-view optical image of the target; (d) rear-view optical image of the target.Many of the targets to be detected in actual sonar image applications are incomplete or defective targets.The diagram in Figure 10 illustrates varying degrees of burial of the targets by sediments, resulting in minimal obstruction, moderate obstruction, and significant obstruction of the targets.

Figure 10 .
Figure 10.Examples of varying degrees of burial of the targets by sediments.(a) Minimal obstruction of airplane; (b) moderate obstruction of airplane; (c) significant obstruction of airplane; (d) minimal obstruction of ship; (e) moderate obstruction of ship; (f) and significant obstruction of ship.

Figure 11 .
Figure 11.The single process of generating data with stain-like noise.(a) Random stain-like noise; (b) original image; (c) optical target; (d) sonar image background; (e) and sonar-style target with stain-like noise.

Figure 12 . 1 :
Figure 12.Performance trends for different noise types.The process of stain-like noise is exhibited in Algorithm 1. Algorithm 1: Process of stain noise Input: Scale  is the number of stain points. is the area ratio.C is a constant value which expresses the pixels of one stain point.Steps: Directions of walk (up, down, right, left).Number of walks.Output: The image with random stain points.Initialize: Size of image (width, height).The maximum area proportion of stain points In image.Calculate  by . for  in scale() do for step in walks do if is over the maximum area proportion then Return image end if is the direction being walked then Update the direction end

Figure 13 .
Figure 13.Example of target with different area ratios with stain-like noise.

Figure 14 .
Figure 14.Detection of different real sonar targets.

Figure 15 .
Figure 15.Features extracted on different targets with t-SNE distribution.

Table 1 .
Training and test data.

Table 2 .
Comparison with existing methods' performances.

Table 3
presents the performances of the different types of noise in the Yolov7 framework.

Table 3 .
Comparison between performances of different noise types.

Table 4 .
Comparison of different models' performances.The results show that the Yolov7 framework-transformer backbone model achieved the highest recognition mAP of 0.891.We obtained the best performance in all model comparison experiments using stain-like shapes and shadow enhancement as the training dataset.

Table 5 .
Comparison of performance with and without shadow enhancement.