A Survey of Underwater Acoustic Target Recognition Methods Based on Machine Learning

: Underwater acoustic target recognition (UATR) technology has been implemented widely in the ﬁelds of marine biodiversity detection, marine search and rescue, and seabed mapping, providing an essential basis for human marine economic and military activities. With the rapid development of machine-learning-based technology in the acoustics ﬁeld, these methods receive wide attention and display a potential impact on UATR problems. This paper reviews current UATR methods based on machine learning. We focus mostly, but not solely, on the recognition of target-radiated noise from passive sonar. First, we provide an overview of the underwater acoustic acquisition and recognition process and brieﬂy introduce the classical acoustic signal feature extraction methods. In this paper, recognition methods for UATR are classiﬁed based on the machine learning algorithms used as UATR technologies using statistical learning methods, UATR methods based on deep learning models, and transfer learning and data augmentation technologies for UATR. Finally, the challenges of UATR based on the machine learning method are summarized and directions for UATR development in the future are put forward.


Introduction
Acoustic waves are the only energy form known to humans that can travel long distances in water and they are generally considered to be the best information carrier for sensing and recognizing underwater targets [1]. Exploring accurate UATR methods can better promote the development of related fields, such as seabed mapping [2], marine biodiversity detection [3,4], vessel target recognition [5], etc. The recognition and detection process based on underwater acoustic target signals is shown in Figure 1, mainly including target signal acquisition by a sonar system, array processing, data preprocessing, feature extraction, and target recognition [6]. This paper mainly reviews the target recognition methods based on sonar signals. Due to the complexity of the marine environment, the acoustic signals obtained by the spatiotemporal sampling of the sensor array in the ocean are not only the target signals but also the environmental noises and other target interference signals. Usually, the beam signals in the target direction are obtained by the spatial filtering characteristics of array processing, and then are combined with other preprocessing works to reduce the impact of noises. In the early stage, underwater acoustic targets are recognized by human ears. This method does not involve the feature extraction process shown in Figure 1. Generally, acquired original signals or the signals after preprocessing are recognized directly by human ears. The disadvantages of this recognition method are also apparent. When the amount of data increases, the time and energy required will increase proportionally, and it can hardly meet the requirements of real-time recognition. The most important thing is that the frequency distribution range of the underwater acoustic signals is broad. Listening to these sounds, which human ears cannot adapt to, for a long time is inevitably harmful to human health. In addition, researchers also used various identify targets based on the spectral features of underwater acoustic signals. Among them, power spectrum analysis, low-frequency analysis and recording (LOFAR), and detection of envelope modulation on noise (DEMON) are commonly used to extract underwater acoustic signal features [7,8]. Mel frequency cepstral coefficient (MFCC) and gammatone frequency cepstral coefficient (GFCC) are also commonly used features for underwater acoustic signal processing [9]. With the rise of machine learning methods, researchers in the underwater acoustic field have turned their attention to building automatic underwater acoustic target recognition methods using machine learning models [10]. Some studies combine the underwater acoustic signal feature extraction method with the machine learning model and use the signal frequency domain features as the input of the machine learning model for target recognition [11]. This paper introduces the classical underwater acoustic signal feature extraction methods in Section 2. Deep learning models have a strong learning ability and can extract the potential knowledge representation of signals. Therefore, some studies directly input the raw signals into the deep learning model for target recognition [12]. This paper reviews several studies on UATR using machine learning methods and divides them into three categories according to the different models used. The first category is the recognition methods based on the statistical learning model. These methods construct statistical probability models using labels and features of samples and classify test samples. The second category is recognition methods using deep learning models. These methods use the powerful learning ability of deep neural networks to extract features from input samples and perform recognition. The third one is transfer learning methods and data augmentation technologies, which are used to solve the problem of the lack of labeled underwater acoustic signal samples. This is a severe problem faced by UART. At present, machine learning methods have shown better performance than traditional methods in the field of underwater acoustics, including underwater acoustic target recognition [5], underwater acoustic source localization [13], single-channel source separation [14], and so on. However, they are data-driven methods, and underwater acoustic At present, machine learning methods have shown better performance than traditional methods in the field of underwater acoustics, including underwater acoustic target recognition [5], underwater acoustic source localization [13], single-channel source separation [14], and so on. However, they are data-driven methods, and underwater acoustic data acquisition faces enormous challenges. At the same time, traditional signal processing methods are more interpretable than many machine learning algorithms. Additionally, interpretability is necessary for some systems and application scenarios. In addition, machine learning models with good recognition performance on specific scenes and data may not apply to other datasets. Because the model only learns the potential feature representation of training data, it cannot be a universal method. However, even though machine learning methods still have many limitations, we can still see the development potential of such models in underwater acoustics fields [15][16][17].
This paper reviews the recent studies of UATR based on machine learning and analyzes the technical characteristics, performance, and challenges of these studies, which will provide a reference for researchers in the field of UATR technology. The next section describes the widely used feature extraction methods for UATR. Section 3 discusses the UATR technology based on machine learning models and related research works in detail. Section 4 analyzes the challenges encountered by the UATR methods based on machine learning. Section 5 gives the conclusion and discussion of this paper.

Data Preprocessing and Feature Extraction Methods
Acoustic signals are time sequence signals, but their frequency domain usually contains more information. Therefore, it is necessary to preprocess and extract features of the raw data to reduce data dimensions and suppress noise before inputting into recognition models.
Ship-radiated noise is one of the main research objects in the UATR field. It has the characteristics of being short-term stationary, and the frequency spectrum obtained by Fourier transform is relatively stable, which is more conducive to the extraction of target recognition features. LOFAR spectrums are generally used to characterize ship-radiated noise [8]. A LOFAR spectrum is a two-dimensional image of frequency and time obtained by a short time Fourier transformer (STFT). Narrowband components in ship-radiated noise can be found by extracting line spectral trajectories from the LOFAR spectrum [18]. Researchers can intuitively obtain the time-frequency distribution information of target signal energy through the LOFAR spectrum. This time-frequency spectrum analysis method is conducive to the detection and recognition of dynamic targets [19].
The periodic rotation of the propeller in the nonuniform flow field gives the shipradiated noise a unique rhythm. Researchers often use the DEMON spectral analysis method to estimate the shaft frequency and blade number of the target propeller [20]. DE-MON analysis is a wideband de-modulation technique, which can separate the modulation envelope caused by propeller cavitation from the ship-radiated noise signal, and estimate shaft number, shaft frequency, and blade number through spectrum analysis and line spectrum detection. These parameters related to target propellers are useful features for underwater target detection and recognition [21]. However, the DEMON analysis method has poor performance under low SNR. Some researchers simultaneously use LOFAR and DEOMON methods to extract ship-radiated noise signal features [22,23].
Studies showed that the human ear's perception of sound frequencies is not linear. To make the extracted sound features more consistent with the sound perception mechanism of human ears, researchers have proposed the extraction method of MFCC features. It uses the Mel filter bank to filter signals and then takes logarithm and inverse Fourier transform to obtain MFCC. This feature extraction method has become the basis of speech processing [24][25][26]. In 2007, Lim T et al. applied MFCC to UATR, and their research showed that this method has a good potential for application in UATR [27]. Tong et al. proposed an effective UATR method. They first extracted three types of underwater targets MFCC features. Then, they classified and identified them using the K-Nearest Neighbor algorithm [28]. Similarly, GFCC is also an acoustic signal feature based on auditory perception, which is implemented based on Gammatone filter bank. Some studies have shown that GFCC features can better describe the targets than MFCC features under low SNR or interference conditions. Additionally, their classification results have better recognition accuracy and robustness [29][30][31]. In the field of UATR, many studies have shown the effectiveness and practicability of these two feature extraction methods [32][33][34].
In fact, the ship-radiated noise also contains components that change rapidly with time. Empirical mode decomposition (EMD) [35] can decompose non-stationary complex signals into various signal components called intrinsic mode function (IMF). Huang proposed the Hilbert-Huang transform (HHT) method [35] to divided the signals into the sum of several IMFs. In contrast to the frequency definition of traditional time-frequency analysis methods, HHT uses phase derivatives to obtain frequencies and accurately describes the instantaneous frequency components of signals [36]. HHT can characterize local instantaneous characteristics, so it has good adaptability to non-stationary signals. In 2014, Zeng and Wang et al. applied HHT to underwater acoustic target recognition and achieved better recognition results than MFCC [37]. In 2010, Bao et al. proposed a ship classification approach based on EMD, which approved the effectiveness of recognition by analyzing nonlinear features of radiated sound [38]. Other feature extraction methods in UATR include formant analysis, wavelet transform, linear predictive cepstral coefficient (LPCC), etc. [39][40][41].
The preprocessing of raw data and feature extraction are very important to improve the accuracy of UATR. Therefore, it is necessary to make sufficient preparation for the preprocessing and feature extraction of the raw data, extract the features of the target, and reduce the redundant information, so that the recognition model can have a good performance.

UATR Methods Based on Machine Learning
This section reviews machine-learning-based UATR techniques, which use a machine learning model [42,43] to conduct the mapping between underwater acoustic signals and their labels [44,45]. UATR methods based on machine learning are divided into three main categories: (1) methods based on statistical learning models, such as support vector machine (SVM), gaussian mixed model (GMM), hidden Markov model (HMM), etc.; (2) methods based on deep learning algorithms, such as convolution neural network (CNN), recurrent neural network (RNN), attention mechanism, etc.; (3) methods based on transfer learning and data augmentation strategies, which are proposed to solve the problem of insufficient data caused by the difficulty of data acquisition, storing, and labeling in the UATR field. In this section, the relevant studies on UATR based on machine learning methods are summarized in the form of the table listed, and the feature extraction methods, datasets used, recognition effects, and main contributions of these recognition methods are briefly explained.

UATR Methods Based on Statistical Learning
Statistical learning methods are based on traditional statistical methods to establish probability and statistical models for analysis and prediction. They are relatively simple and have easy-to-understand parameters. Statistical learning models can achieve good results on small datasets and are less prone to overfitting. Therefore, in the UATR domain, many studies are based on statistical learning models [46,47]. Table 1 summarizes studies that apply statistical learning models for UATR, and briefly describes the model used, feature extraction methods, datasets, recognition performance, and the main contributions. One of the statistical learning methods widely used in UATR is SVM [48,49]. Its core idea is to find the decision surface between different classes of data, make the two classes of samples fall on both sides of the decision surface, and make the samples far enough away from the decision surface [50]. The original SVM is based on the plane decision, which requires the samples to be linearly separable, but this condition usually cannot be satisfied in practical cases. The solution of SVM is to map the samples to a new space, usually a higher dimensional space, using a kernel function, and then find a linear decision surface in the new space for classification [49]. Statistical learning theory shows that SVM has two advantages. First, it is a convex optimization problem, so the solution obtained must be a global optimum rather than a local optimum. Second, this algorithm is suitable for both linear and nonlinear problems. The computational complexity of SVM only depends on the dimension of support vectors rather than the size of datasets, which avoids the curse of dimensionality in a sense. Hence, it is suitable for datasets with a high-dimensional sample space.
Ref. [48] uses single-class SVM [56] for target detection in passive sonar systems. The single-class SVM is used to solve the problem that there is only one kind of training data. Additionally, the target data are expected to have the same characteristics as the training data. The reason single-class SVM is proposed is that in some specific scenarios, it is hard to obtain negative samples or to define the range of negative samples accurately. At this time, the model needs to recognize the type of unknown samples according to the characteristics of one class of known samples. De Moura and de Seixas [48] use the ship dataset acquired from a real marine environment to train the single-class SVM model, and the SP index [57] is 73.18%. The classification performance of SVM largely depends on the kernel function. Ref. [49] uses the BAT algorithm [58] to optimize the kernel parameters of SVM. Compared with other parameter optimization algorithms, such as genetic algorithms (GA) and particle swarm optimization (PSO), the BAT algorithm has the advantage that it can conduct global and local searches simultaneously to avoid falling into local optimum. The results reported show that the accuracy of the classifier using the BAT optimization algorithm is six percentage points higher than that using PSO algorithm [49]. An ensemble of SVM can improve the recognition accuracy of a UATR system. Yang et al. [50] proposes a novel SVM ensemble algorithm combined with sample selection and feature selection methods (WSFSelect-SVME). The proposed model solved the two limitations of traditional ensemble SVM methods. (1) The training data with poor quality will result in errors between actual and theoretical results. (2) Ensemble recognition systems usually have higher complexity and computational costs. The experimental results on the UCI sonar dataset and real-world underwater acoustic target dataset show that the WSFSelect-SVME model obtains better recognition performance and robustness than Adaboost SVM ensemble algorithm.
A set of studies applied the SVM model to UATR and achieved record-breaking results. The process of solving the support vectors involves the calculation of the N-order matrix (N is the number of samples). It requires a lot of memory and computing time when N is large. At the same time, the conventional SVM algorithm only supports binary classification. When dealing with multi-classification problems, the problem needs to be transformed into multiple binary classification problems, which reduces the classification efficiency.
The GMM is an extension of a single Gaussian probability density function, which is composed of multiple Gaussian distributions. GMM can approximate the density distribution of arbitrary shapes and can be used for multi-class target recognition. According to the different parameters of Gaussian probability density function (PDF), each Gaussian model can be regarded as a class. The GMM model first calculates the probability value of input samples. Additionally, whether the sample belongs to a Gaussian distribution can be judged according to the set threshold [59]. Research shows that GMM is suitable for modeling complex samples [60]. Parada and Cardénal-Lopez [51] proposed a method based on the GMM model to identify the two main sounds emitted by dolphins, whistle and pulse, as well as background noise. By introducing the uncertainty measure and MUSIC algorithm [52] for feature extraction, the detection rate of GMM is increased from 87.5% to 90.3%, and the classification error rate is reduced from 23.6% to 18.1%.
The GMM model can only approximate the Gaussian distribution of the calculated data and cannot extract the deep abstract features of the acoustic signal. Although GMM fits existing samples well, the fitting to unknown samples is unstable. Most importantly, the recognition results of GMM in multidimensional features are not ideal [61]. Therefore, GMM is generally combined with other models to build a reliable UATR system.
Compared to GMMs, the advantage of HMMs is that they usually have a better prediction performance, instead of only focusing on fitting observed values [62]. During target recognition, HMM obtains the state transition probability matrix and observation probability matrix through training and makes decisions according to the maximum probability in the process of state transition [63]. In Kim et al. [53], a multi-direction target classification method based on HMM is proposed and applied to the classification of synthesized active sonar signals. Mohammed et al. [55] researched the efficiency and reliability of underwater acoustic target methods and proposed an HMM model based on Gammatone cepstral coefficient (GTCC). The experiment results on a dataset including ten types of ship and marine species show that the GTCC-based HMM model achieves an average accuracy of 89% under different SNR, which is 5 and 8 percentage points higher than ANN and statistical Euclidean distance classification, respectively.
Statistical learning methods build and train models based on traditional statistical analysis, which can only roughly fit the distribution of samples and have limited ability to extract features. Statistical learning methods struggle to handle the recognition tasks with large samples due to their limited model capacity. Moreover, both GMMs and HMMs have default assumptions, but the underwater acoustic data, in reality, struggle to meet these assumptions, which affects the generalization of the model. To better extract and use the features of underwater acoustic signals for UATR, deep learning models with strong feature extraction ability have been applied in this field.

UATR Methods Based on Deep Learning
In recent years, with the improvement in the computing ability of computers, the research of deep learning (DL) based on neural networks [64] has developed rapidly. A deep learning model can be composed of network modules with multiple processing layers. These layers extract features with different levels of abstraction and automatically adjust parameters through back propagation until suitable data features are extracted for downstream tasks. Deep learning models are widely used in speech recognition, image processing, intelligent control, expert systems, and other fields with their powerful feature extraction ability [65]. Researchers in the UATR field have also turned their attention to deep learning algorithms [66]. Deep learning-based UATR models are generally supervised, which train deep neural networks on datasets with labels, and then the network can predict the type of unknown samples. Table 2 lists the relevant studies on UATR using deep learning models.  CNN is one of the mainstream deep learning architectures, which has been widely used in natural language processing, speech recognition, medical diagnosis, and other fields [75,76]. A basic convolutional neural network consists of the convolutional layer, activation function, and pooling layer. The convolutional layer is the core part of the network, and the convolutional kernel can be regarded as a feature recognizer. The training process of CNNs for UATR is to adjust the weights of the convolution kernel and make it suitable for target recognition. After the convolutional layer, the activation function enhances the generalization ability of the network by non-linear mapping between the input and output. Pooling can be regarded as a down-sampling operation, the main purpose of which is to reduce the resolution of the feature map. Common pooling methods include maximum pooling and average pooling [77]. The pooling operation is helpful to prevent overfitting of neural networks. When processing the target recognition tasks, CNNs send the output sample feature vectors to a fully connected layer to map the samples and labels [78]. In studies of UATR, some researchers use the acquired sonar image data as the CNN input for target recognition [79][80][81]. Others directly input time domain signals into CNN models to identify ship types [34,68]. In general, researchers first transform time domain signals into various spectrums and then use the CNN model to extract abstract features of spectrums and recognize underwater targets [8,82].
Doan et al. [34] use dense CNN to extract time domain signal features for UATR. The proposed target recognition network with the skip-connection technique could reuse former feature maps, which prevents the gradient vanishing problem. Experimental accuracy on a real-world dataset with 0 dB achieves 98.85%. Xiaoping et al. [67] compared the recognition ability of CNN and LSTM models for complex underwater acoustic signals. Experimental results show that when the classifier takes the time domain signal as input, the accuracy of CNN on the dataset containing eight types of underwater targets and six types of ships is five percentage points higher than that of LSTM. Hu et al. [68] used depth-separable convolution and dilated convolution for passive UATR for the first time. The dilated convolution enlarges the receptive field of the model without increasing the parameters so that the features extracted by the model have better intra-class aggregation and inter-class separation characteristics. The proposed model achieves better recognition performance than traditional CNN model.
RNNs are a kind of neural network that is good at processing sequence data. The input of RNNs at each time step contains the output of the previous time step, and this structure makes the model capable of memory [83]. Underwater acoustic signals are complex time-varying signals with some correlation between each frame. Additionally, the memory ability of RNNs makes them suitable for learning the features of underwater acoustic signals. In recent years, RNNs have also become one of the major solutions for UATR. Wang et al. [69] proposed a hybrid time-series network, i.e., the combination of bi-directional gated recurrent unit (Bi-GRU) and multi-layer gated recurrent unit (GRU), for acoustic signal modulation identification in harsh underwater communication environments. The network optimizes the internal network structure by cascade order to obtain more hidden signal features. The experimental results show that the combined network of 4-layer Bi-GRU and 4-layer GRU have good recognition accuracy and robustness in an environment with serious interference. CNNs are effective local feature extractors. Additionally, the combination of CNN and LSTM can extract features from samples better. Kamal et al. [70] proposed a combination model of CNN and LSTM for target recognition based on shallow sea acoustic data. First, the standardized data are convolved with the filter to generate a learnable time-frequency representation. Then, the abstract features of the time-frequency representation are further extracted using a three-layer two-dimensional convolution. Bi-LSTM is used to capture the temporal features of the sequence from the front and back directions. Finally, the selective attention layer is used to select the most useful features for recognition. Experimental results on acoustic datasets collected in Indian Ocean shoals show that the recognition accuracy of this end-to-end deep learning model reaches 95.2%.
Attention is a kind of information selection and resource allocation method, which devotes limited resources to processing important information [84]. Generally, in order to select the information that is more important to the downstream task in the input set of vectors, the input information is represented in the form of key-value pairs. At the same time, query vectors are introduced, and the correlation between each input vector and the query vector is calculated by a scoring function [84]. To effectively extract the low-frequency spectrum under Doppler shift, Xue et al. designed a ResNet with channel attention mechanism model [71]. The target deep abstraction spectral features are extracted by ResNet. Then, the channel attention mechanism model is used to weigh the signal channels and complete information points in each channel. The targets are recognized by one-dimensional convolution. The recognition accuracy on a real-world dataset containing four kinds of underwater acoustic targets reaches 98.2%. Transformer [85] is an attentionbased architecture. It is widely used in image processing and natural language processing fields. The transformer model was introduced to the UATR field for the first time by Feng et al. and achieves good recognition performance [72]. Compared with the CNN-based model, transformer architecture can consider both global and local information.
Deep learning is proving to be a potential tool for UATR. However, its application in this field is limited to a few methods. There are a set of deep learning methods and applications with excellent performance waiting to be explored.

Transfer Learning and Data Augmentation Strategies for UATR
Even though deep learning methods have achieved good performance in the UATR field, it is incredibly to train a reliable enough deep learning model if there is not exist a large amount of labeled data [86]. Many studies have shown that transfer learning (TL) and data augmentation methods are effective ways to solve the problem of model training in the case of insufficient data [87]. TL methods first train a network model on a large and related dataset called the source domain, and then use a small target domain dataset to fine-tune the parameters to make the network model adapt to the new task requirements. These methods not only release the training pressure on an insufficient dataset but also reduce the training time on the target source and obtain robust models. Furthermore, the data augmentation methods such as generative adversarial networks (GANs) [88] also provide a solution for model training in the case of insufficient data, which expands the dataset by generating new samples. For training the UATR network, it is hard to construct a standard dataset of sufficient scale. Therefore, many researchers use transfer learning or data augmentation techniques for UATR. Table 3 lists the applications of TL and data augmentation methods in UATR. In recent years, many deep CNN models with remarkable effects on image recognition have been proposed, and many researchers are trying to transfer the pretrained deep CNN model to the UATR tasks. ImageNet, a large image database published by Google, provides a good dataset to pretrain these CNN models [95]. The most direct application of transfer learning in the underwater target recognition field is to transfer the pre-trained model on the ImageNet dataset to the underwater sonar image dataset. In the paper by Lipton et al. [83], TL technology is applied to sonar seabed image classification. The proposed method pretrains a VGG19 model using the ImageNet dataset. Then, the parameters of VGG19 are fine-tuned using the dataset acquired in a real scenario and semi-generated data. Experimental results show that the VGG19 network transferred from the ImageNet dataset achieves 97.76% accuracy, which is better than the results of SVM and shallow CNN networks. With the support of pretrained CNN models and ImageNet dataset, many studies have chosen to transfer pretrained deep CNN models to underwater target recognition tasks. For example, a pretrained GoogleNet [89] is used for underwater human body automatic detection based on sonar images. In Fuchs et al. [90], ResNet50 pretrained on ImageNet is transferred to forward-looking sonar (FLS) image data classification, and the accuracy reaches 95%. Underwater acoustic spectrums have a similar format to images. Therefore, the TL strategies mentioned above have immense potential in UATR. Ke et al. [82] proposed a one-dimensional convolution automatic encoding-decoding model to recognize shipradiated noise. It is combined with the feature extraction method based on resonant sparse signal decomposition. The model is trained on a large unlabeled dataset and then finetuned using a small, labeled dataset. The recognition accuracy of this model on the ShipsEar dataset [73] reaches 93.28%. Korkmaz et al. [91] compared the recognition performance of dolphin whistles using PamGuard [96], a software that automatically identifies marine mammals, vanilla CNN, and VGG models using the transfer learning approach. The results showed that the mean recognition accuracy of the CNN model was much higher than that of the PamGuard software, while the VGG model using the migration learning technique had an additional 11.7 percentage points higher recognition accuracy than the vanilla CNN. This study offers great potential for the deployment of marine biological detection systems using deep learning techniques.
Data augmentation methods are kinds of technology that can build synthetic data by transforming the existing labeled data using various transformations. Due to the difficulty of acquiring ship radiation signals, it is difficult to construct a sufficient number of labeled training data. Researchers have tried to apply various data augmentation techniques for UATR tasks. Luo et al. [92] used a restricted Boltzmann machine (RBM) autoencoder to augment the ship-radiated noise signal dataset for training the UATR system. In this method, RBM is used to encode the combined data of the power spectrum and demodulation spectrum of ship-radiated noise automatically without supervision. Then, reconstructed samples are obtained by decoding feature vectors layer by layer. After the above data augmentation processing, the recognition accuracy of a 4-layer Back Propagation (BP) classifier is improved from 91.4% to 92.6%. Luo et al. [93] proposed a conditional deep convolution generative adversarial network (cDCGAN) model for data augmentation. The cDCGAN uses CNN to build the generator and discriminator and introduces the label information to the training process. It increases the number of shipradiated noise samples. A ResNet-based classifier is used to recognize ship type. The test accuracy is improved from 90.94% to 96.32% after the data augmentation processing. In the work by Jiang et al. [94], an improved DCGAN [97] architecture is used to augment the training data of ship-radiated noise targets, and then the proposed S-ResNet is used as a classifier. The recognition accuracy of the S-ResNet classifier improved by about six percentage points after using data augmentation technology. Schmidhuber [64] used WGAN-GP [98] to expand the time domain signal and the LOFAR spectrum and uses CNN and LSTM to classify underwater targets and ship signals. The experimental results show that the recognition accuracy of the CNN model increased by 3.7 percentage points when training with the dataset was augmented.

Challenges
The real marine environment is complex and changeable, and acquiring data from it poses various problems. For example, performing realistic underwater experiments has a high cost. The propagation of the acoustic signals process exits expansion loss, absorption loss, and boundary loss, so acquiring high-quality underwater sound signals is time and energy consuming. At the same time, due to the limitations of underwater communication hardware equipment, it is difficult to guarantee the quality of the acquired data. There may be problems such as non-homogenous resolution, a too-weak target signal, non-uniform intensity, and reverberation [98]. High-resolution sensors can improve the system of sonar systems, but they are expensive. Therefore, there is currently a lack of publicly available datasets for UATR. Again, data management, labeling, and storage also consume a lot of time and energy. Most UATR methods are tested on the recordings collected in relatively simple sea areas of the environment, and they do not apply to signals from complex sea areas. On top of that, the underwater acoustic data collected for military purposes are mostly highly classified and hard to use for academic studies. The dataset imbalance is one of the main problems for machine learning-based UATR.
Due to the lack of public datasets, most current studies on UATR based on machine learning use self-constructed datasets to evaluate the model recognition performance. The published literature does not provide detailed information about their dataset, so it is hard to compare the performance of various methods in the same dimension. In the complex marine environment, underwater acoustic signals are affected by various factors, such as time, temperature, depth, salinity, geographic location, and sensor type [99]. In the research, a set of factors should be comprehensively considered and should design appropriate models, which also brings daunting challenges to the underwater acoustic target recognition works.
In response to the problem of insufficient training data, some researchers have used transfer learning and data augmentation techniques in UATR. However, data augmentation technology has certain limitations. It is often a simple transformation based on raw data. Even the data generated through the neural network model is similar to the known samples in its distribution. Whether it can represent the data characteristics in the real environment remains to be verified. Transfer learning requires pretraining on a large number of source domain sample data. Whether it is possible to find a source domain close to the target domain and how to determine the appropriate source domain size are problems that are currently faced. Moreover, the pretraining process of the model in the source domain also requires a lot of computing resources and time.
In addition, it is a grant challenge to explore the model architecture and parameters suitable for underwater acoustic signals and improve the efficiency of network training because there is a set of problems in the training process, such as the model failing to convergence caused by the gradient vanishing or gradient explosion. Many machine learning models, and more so deep learning models based on neural networks, are blackbox models with little interpretability. However, in many practical application scenarios, the predicted basis of the model is required. It limits the application scope of such methods to a large extent.
Each method has its limitations. Statistical learning models have a small number of parameters and compute faster but are only suitable for small datasets. Deep learning models with more complex structures usually provide better recognition accuracy than statistical learning methods but require more computational resources and training time. In addition, deep learning models are less computationally efficient as they are usually programmed based on the highly integrated python language. Deep learning models require data with high quality and poor generalization, which makes it hard to deploy current deep learning-based UATR methods directly to underwater acoustic monitoring systems.

Conclusions and Discussion
This paper reviews the recent studies of underwater acoustic target recognition based on machine learning, gives the flow chart from underwater acoustics signals acquisition to data processing and target recognition, analyzes the pros and cons of the machine learning framework used in relevant studies as well as the recognition performance, and summarizes the challenges faced by UATR based on machine learning. Due to the lack of training data, the current development trend of UATR is to combine manual features with machine learning methods. This paper introduced the feature extraction methods and their respective characteristics in underwater acoustic target recognition in Section 2. Then, we summarized the UATR methods based on machine learning and analyze the applicable scene of different machine learning models by citing the papers in Section 3. According to the different machine learning methods used, this paper organized this part into three categories: (1) UATR methods based on statistical learning; (2) UATR technologies using CNNs, RNNs, and other deep learning methods; (3) the application of transfer learning and data augmentation technology for UATR. By surveying the relevant literature, we found some problems and challenges facing UATR, including data acquisition, management, storage, labeling, computing resource problems, and some disadvantages of machine learning methods. The details were given in Section 4.
Machine learning methods have been widely used in natural language processing and computer vision fields and have achieved breakthroughs. On the contrary, the development of the application of machine learning in UATR both nationally and internationally is slow. Hence, there is still a set of work to be completed in the future. First, the dataset problem is one of the main problems faced by underwater acoustic target recognition methods based on machine learning. Effective data acquisition and preprocessing methods are the focus of the current progress in automatic UATR. At present, what needs to be solved urgently is the establishment of an open and standardized dataset for model training and testing, facilitating the comparative analysis of the studies, and promoting the development of UATR. At the same time, future works should pay attention to the basic research on the underwater acoustics, and try to use the fusion of multiple spectral features methods to describe the features of underwater targets from multiple dimensions. For the insufficient and unbalanced training data issue, we need to select appropriate transfer learning and data augmentation techniques according to the actual situation to improve the performance of UATR. More studies should also be conducted to understand and evaluate existing stateof-the-art deep learning architectures and their application in underwater acoustic signal classification. In addition, interpretability is an important issue that needs to be tackled in the future development of UATR based on machine learning. Only when the model has certain interpretability can it be applied to more actual scenarios and play a significant role. In the complex underwater environment, it is hard to solve all the challenges faced by UATR using a single form of data. Multimodal learning is one of the current research hotspots, aiming to learn information from multiple modalities in various modalities and to achieve the communication and transformation of information from different modalities [100]. Many researchers have tried to use multimodal learning for studies such as underwater navigation [101] and underwater communication [102]. Using multimodal data such as acoustic, optical, and imagery for UATR offers new ideas for future research.