Study on Active Tracking of Underwater Acoustic Target Based on Deep Convolution Neural Network

: The active tracking technology of underwater acoustic targets is an important research direction in the ﬁeld of underwater acoustic signal processing and sonar, and it has always been issued that draws researchers’ attention. The commonly used Kalman ﬁlter active tracking (KFAT) method is an effective tracking method, however, it is difﬁcult to detect weak SNR signals, and it is easy to lose the target after the azimuth of different targets overlaps. This paper proposes a KFAT based on deep convolutional neural network (DCNN) method, which can effectively solve the problem of target loss. First, we use Kalman ﬁltering to predict the azimuth and distance of the target, and then use the trained model to identify the azimuth-weighted time-frequency image to obtain the azimuth and label of the target and obtain the target distance by the time the target appears in the time-frequency image. Finally, we associate the data according to the target category, and update the target azimuth and distance information for this cycle. In this paper, two methods, KFAT and DCNN-KFAT, are simulated and tested, and the results are obtained for two cases of tracking weak signal-to-noise signals and tracking different targets with overlapping azimuths. The simulation results show that the DCNN-KFAT method can solve the problem that the KFAT method is difﬁcult to track the target under the weak SNR and the problem that the target is easily lost when two different targets overlap in azimuth. It reduces the deviation range of the active tracking to within 200 m, which is 500~700 m less than the KFAT method.


Introduction
In marine activities, active sonar is widely used in underwater moving target detection, recognition, tracking, seabed scanning, navigation, and communication [1][2][3][4][5][6][7]. Active sonar is an important device for active tracking of underwater acoustic targets. By periodically transmitting specific signals and analyzing the echo signals received by the array, the target's position, speed, distance, and other characteristic information can be obtained [8]. In recent years, due to the strong demand for marine development research and marine safety, higher requirements have been put forward for the application of active sonar in tracking [9]. Therefore, the research on the active tracking method of underwater moving target has important engineering significance and practical significance.
The current active tracking method for underwater acoustic targets [10] has the following two drawbacks: when the azimuths of two different moving targets overlap, the target data cannot be correctly correlated, leading to the loss of one target; and when the target echo signal is weak and below the detection threshold set by the target verdict, it is more difficult to track such weak SNR targets. In the active tracking of underwater acoustic targets, accurately detecting targets with weak SNR from the signal and correctly correlating the data of different targets has become a hot issue of research. The Kalman filter active tracking (KFAT) method combined with Kalman filter is the most commonly used method for active tracking of hydroacoustic targets, which transforms the azimuth and distance information of the target from polar coordinates to Cartesian coordinates, and then establishes a motion model of target for processing [11], which can achieve the prediction of azimuth and distance of the tracked target and denoise the tracking results to improve the tracking accuracy [12]. Lei [13] proposed that when tracking a moving target, in the case of decorrelation processing on the conversion deviation of position and distance, the linear part (position measurement) and the nonlinear part (distance-rate measurement) are respectively passed through Kalman filtering and unscented Kalman filter (UKF) for processing. Bar-Shalom [14] studied the application of probabilistic data association (PDA) technology in different target tracking schemes, especially targets with low signal-to-noise ratio (SNR). The accuracy of the target azimuth and distance obtained after filtering depends on the original data measurement, which reduces the deviation of the initial measurement and can improve the accuracy of active tracking.
The information on target scattering characteristics carried in the active sonar echo signal can be used to detect and distinguish different targets [15]. Researchers have used target scattering characteristics for target detection or classification in the field of underwater acoustics [16,17]. In 2007, Young [18] proposed a method to extract features from the target echo signal by imitating auditory perception, and classify the target using Gaussian classification in machine learning. Deep learning has the advantage of automatically extracting target feature information from raw data through learning training and can be used for multi-target recognition and classification. The convolutional neural network (CNN) is a commonly used structure in deep learning that has greatly improved productivity and efficiency in areas such as computer vision, natural language processing, text and speech recognition, and object detection [19]. In 2015, Yang [20] used CNN and existing auditory perception models to extract features of target radiated noise using mel-scale frequency cepstral coefficients to simulate the function of a complete auditory system to identify ship target radiated noise. The results show the feasibility of applying deep learning to the field of underwater acoustics. In 2019, Yao [21] proposed a deep learning method, constructed an underwater acoustic signal feature extraction model based on a generative confrontation network, combined with a deep neural network classifier for modulation recognition, and could effectively extract classification features from underwater acoustic signals. Deep learning has a wide range of application prospects in the field of underwater acoustic targets [22].
In this paper, we propose a DCNN-KFAT method for underwater acoustic moving target tracking. After the target is detected in the early stage of tracking, we calculate the target azimuth and distance, and then use beamforming to perform target azimuth-related weighting on the echo signal, convert the obtained target frequency domain data into time domain data, generate a data set with the target time-frequency image as a sample, and finally use DCNN to train the data set to generate a model to identify the target. In the follow-up process of tracking the target, we use Kalman filter to predict the target azimuth and distance, weight the received echo signal, generate all possible azimuth-related signal time-frequency images, and use the trained model for recognition. We determine the target category according to the output of the model, and finally get the target azimuth and distance. According to the current target recognition result, it is associated with the existing target, and the target tracking information is updated. The rest of the article is as follows: Section 2 describes the active tracking method proposed in this article, Section 3 describes the simulation test, Section 4 describes the discussion, and Section 5 gives the conclusion. We use periodic pulse signals as active sonar transmission signals, and receive echo signals with a uniform linear array. In the preprocessing step, the target is detected by beamforming and matching filtering the received echo signal, and the time-frequency spectrogram of the target is obtained after azimuth-weighting the signal according to the detected target azimuth; we label the different targets and generate the dataset, and DCNN is used to train the dataset.
In the subsequent tracking step, the echo signals are weighted by different azimuths to obtain a time-frequency spectrum image of the signal to be detected, and the trained model is used for identification to obtain the azimuth and distance of the target. All steps are described in detail in the rest of this section.

Active Sonar Echo Signal Preprocessing to Generate Data Set
In active tracking, the first step is to determine the moving target to be tracked and generate a target data set. The specific process is shown in Figure 2. Firstly, we generate a weighting matrix containing the target orientation information based on the target detected from the original array signal, and multiply it with the array signal matrix to obtain the frequency domain data of the target echo signal; then convert the frequency domain data to time domain data and do the short-time Fourier transform to generate the timefrequency images of the target echo signal; finally, the time-frequency images of different targets are labeled and stored in the data set. We use periodic pulse signals as active sonar transmission signals, and receive echo signals with a uniform linear array. In the preprocessing step, the target is detected by beamforming and matching filtering the received echo signal, and the time-frequency spectrogram of the target is obtained after azimuth-weighting the signal according to the detected target azimuth; we label the different targets and generate the dataset, and DCNN is used to train the dataset.
In the subsequent tracking step, the echo signals are weighted by different azimuths to obtain a time-frequency spectrum image of the signal to be detected, and the trained model is used for identification to obtain the azimuth and distance of the target. All steps are described in detail in the rest of this section.

Active Sonar Echo Signal Preprocessing to Generate Data Set
In active tracking, the first step is to determine the moving target to be tracked and generate a target data set. The specific process is shown in Figure 2. Firstly, we generate a weighting matrix containing the target orientation information based on the target detected from the original array signal, and multiply it with the array signal matrix to obtain the frequency domain data of the target echo signal; then convert the frequency domain data to time domain data and do the short-time Fourier transform to generate the time-frequency images of the target echo signal; finally, the time-frequency images of different targets are labeled and stored in the data set. After the active sonar emits sound waves, the echo signals received continuously during the period include various reverberation and sound wave scattering caused by impurities. The difference in target geometry will also cause the difference in the echo signal. We can use the information carried in the target echo signal to track the target. Active sonar tracking a moving target is a process of predicting the location, searching for the target, judging the target, and associating the data. The schematic diagram of the DCNN-KFAT process proposed in this paper is shown in Figure 1. We use periodic pulse signals as active sonar transmission signals, and receive echo signals with a uniform linear array. In the preprocessing step, the target is detected by beamforming and matching filtering the received echo signal, and the time-frequency spectrogram of the target is obtained after azimuth-weighting the signal according to the detected target azimuth; we label the different targets and generate the dataset, and DCNN is used to train the dataset.
In the subsequent tracking step, the echo signals are weighted by different azimuths to obtain a time-frequency spectrum image of the signal to be detected, and the trained model is used for identification to obtain the azimuth and distance of the target. All steps are described in detail in the rest of this section.

Active Sonar Echo Signal Preprocessing to Generate Data Set
In active tracking, the first step is to determine the moving target to be tracked and generate a target data set. The specific process is shown in Figure 2. Firstly, we generate a weighting matrix containing the target orientation information based on the target detected from the original array signal, and multiply it with the array signal matrix to obtain the frequency domain data of the target echo signal; then convert the frequency domain data to time domain data and do the short-time Fourier transform to generate the timefrequency images of the target echo signal; finally, the time-frequency images of different targets are labeled and stored in the data set. After the active sonar transmits the signal, the array continues to receive the echo signal within a period, accumulate 1 s echo signal Y(t) through Fourier transform to generate a matrix in frequency domain, We obtain the spatial energy spectrum through matched filtering and beamforming. The spatial energy spectrum accumulation in the period is shown in Figure 3a. Suspected targets are filtered out through threshold detection, and the approximate position (distance r, azimuth θ) of the target is calculated according to the speed of sound c and echo arrival time t, providing prior information for confirming the target, and determining the tracking target through data association. After the target is determined, a weighting matrix W is generated according to the detected target azimuth θ, as shown in Figure 3b.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 21 (a) Spatial spectrum of beams within a perioddiscovering different targets based on energy threshold detection.  In view of the shortcomings of fewer underwater acoustic target samples, two methods are used to increase the number of samples in the data set. The first is to offset the target time-domain data ( ) starting integration time by a small amount, and the second is to offset the azimuth to change the weighting matrix . We use these two methods to obtain more samples.
According to the target category, we label the time-frequency image of target simples We multiply the weighting matrix with the echo signal matrix F(w) to obtain the weighted target frequency domain data F (w), After the transformation, the signal is concentrated on the azimuth of the target, which minimizes interference from other directions. We convert the weighted target frequency domain data F (w) into time domain data f (t) through the inverse Fourier transform, as shown in Figure 3c.
Since the signal duration of each beamforming is 1 s, in order to fully include the target echo signal, the integration duration is set to 3 s.
Finally, we use short time Fourier transform (STFT) on the integrated target time domain data to generate the target time-frequency image.
In Formula (6), g * (t − u) is the window function. The time-frequency image of the target echo signal is shown in Figure 3d. In the time-frequency image, the horizontal axis is time and the vertical axis is frequency. Through the Formulas (1)-(6), we fuse the target position information into the time-frequency image, which contains the change characteristics of the target echo signal in the frequency domain over time, and can be used to distinguish different targets and identify the position of the target.
In view of the shortcomings of fewer underwater acoustic target samples, two methods are used to increase the number of samples in the data set. The first is to offset the target time-domain data Y(t) starting integration time t by a small amount, and the second is to offset the azimuth θ to change the weighting matrix W. We use these two methods to obtain more samples.
According to the target category, we label the time-frequency image of target simples and store it in the data set. In the data set, all samples contain the target's echo data, distance, and azimuth information. The data set will be used to train the deep convolutional neural network model.

The Structure of the DCNN Model
In this article, we input the target data set into deep convolutional neural network (DCNN) for training. The basic CNN consists of three structures: convolution, activation, and pooling [23]. DCNN is usually composed of multiple above-mentioned structures connected before and after and adjusted within the layer. The three key features of CNN are the local acceptance area, weight sharing and downsampling process, which effectively reduces the number of network parameters and alleviates the over-fitting problem of the model [24]. Convolution is the most basic and most important level. Convolution operation can extract the features of the image [25]. Through the convolution operation, certain features of the original signal can be enhanced and noise can be reduced. Pooling layers can reduce the amount of data processing while retaining useful information, and sampling can obfuscate the specific location of features [26]. Pooling layers are generally divided into mean pooling and maximum pooling. The advantages of CNN are sharing the convolution kernel, no pressure on high-dimensional data processing, no need to manually select features, and training the weights, that is, the feature classification effect is good. The disadvantages are the need to adjust parameters, the need for a large sample size, and training is best to use GPU. According to the characteristics of the target time-frequency image, we design a DCNN model. Figure 4 shows the model structure.
sampling can obfuscate the specific location of features [26]. Pooling layers are generally divided into mean pooling and maximum pooling. The advantages of CNN are sharing the convolution kernel, no pressure on high-dimensional data processing, no need to manually select features, and training the weights, that is, the feature classification effect is good. The disadvantages are the need to adjust parameters, the need for a large sample size, and training is best to use GPU. According to the characteristics of the target timefrequency image, we design a DCNN model. Figure 4 shows the model structure. There are in total five convolutional layers in the DCNN model we designed, a fully connected layer and an output layer. As shown in Formula (7), The activation function we use is rectified linear unit (ReLU), and its function is to perform nonlinear mapping on the output of the convolutional layer. It is characterized by fast convergence and simple gradient finding, which can prevent the gradient from disappearing. Figure 5 shows the flow of the input data. The data flows into the 'convo1' convolution layer through the input layer, and then into the pooling layer. In this layer, there are eight convolution kernels. The size of the input data is 4096 × 024, the size of the convolution kernel is 6 x 6, and the padding is set to "same", the output feature map has the same size as the input data, and the output feature map has eight dimensions. We use 'ReLU' as the activation function. The convolution kernel of the pooling layer is 4 × 4, the step is 1, and the image generated by the feature mapping after pooling is 1024 × 256 × 8, that is, the dimension is 8 and the size is 1024 × 256. There are in total five convolutional layers in the DCNN model we designed, a fully connected layer and an output layer. As shown in Formula (7), The activation function we use is rectified linear unit (ReLU), and its function is to perform nonlinear mapping on the output of the convolutional layer. It is characterized by fast convergence and simple gradient finding, which can prevent the gradient from disappearing.
x 0 (7) Figure 5 shows the flow of the input data. The data flows into the 'convo1' convolution layer through the input layer, and then into the pooling layer. In this layer, there are eight convolution kernels.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 21 sampling can obfuscate the specific location of features [26]. Pooling layers are generally divided into mean pooling and maximum pooling. The advantages of CNN are sharing the convolution kernel, no pressure on high-dimensional data processing, no need to manually select features, and training the weights, that is, the feature classification effect is good. The disadvantages are the need to adjust parameters, the need for a large sample size, and training is best to use GPU. According to the characteristics of the target timefrequency image, we design a DCNN model. Figure 4 shows the model structure. There are in total five convolutional layers in the DCNN model we designed, a fully connected layer and an output layer. As shown in Formula (7), The activation function we use is rectified linear unit (ReLU), and its function is to perform nonlinear mapping on the output of the convolutional layer. It is characterized by fast convergence and simple gradient finding, which can prevent the gradient from disappearing. Figure 5 shows the flow of the input data. The data flows into the 'convo1' convolution layer through the input layer, and then into the pooling layer. In this layer, there are eight convolution kernels. The size of the input data is 4096 × 024, the size of the convolution kernel is 6 x 6, and the padding is set to "same", the output feature map has the same size as the input data, and the output feature map has eight dimensions. We use 'ReLU' as the activation function. The convolution kernel of the pooling layer is 4 × 4, the step is 1, and the image generated by the feature mapping after pooling is 1024 × 256 × 8, that is, the dimension is 8 and the size is 1024 × 256. The size of the input data is 4096 × 024, the size of the convolution kernel is 6 × 6, and the padding is set to "same", the output feature map has the same size as the input data, and the output feature map has eight dimensions. We use 'ReLU' as the activation function. The convolution kernel of the pooling layer is 4 × 4, the step is 1, and the image generated by the feature mapping after pooling is 1024 × 256 × 8, that is, the dimension is 8 and the size is 1024 × 256.
The data flow of the 'convo2 convolutional layer of the DCNN model is shown in Figure 6. The 'convo2 convolution layer takes the feature map output by the 'convo1 convolution layer as input, and the size is 1024 × 256 × 8. The size of the convolution kernel of the convo2 convolution layer is 5 × 5, the number is 32, and the output feature map is 1024 × 256 × 32. The activation function also uses 'ReLU'. The convolution kernel of the pooling layer is set to 4 × 4, the step size is 1, and the image generated by the feature map after pooling is 256 × 64 × 32, that is, the dimension is 32 and the size is 256 × 64. The data flow of the 'convo2′ convolutional layer of the DCNN model is shown in Figure 6. The 'convo2′ convolution layer takes the feature map output by the 'convo1′ convolution layer as input, and the size is 1024 × 256 × 8. The size of the convolution kernel of the convo2 convolution layer is 5 × 5, the number is 32, and the output feature map is 1024 × 256 × 32. The activation function also uses 'ReLU'. The convolution kernel of the pooling layer is set to 4 × 4, the step size is 1, and the image generated by the feature map after pooling is 256 × 64 × 32, that is, the dimension is 32 and the size is 256 × 64. As shown in Figures 7-9, the data from 'convo2' passes through the 'convo3' convolution layer and the 'convo4' convolution layer, and then enters the 'convo5' convolution layer.   As shown in Figures 7-9, the data from 'convo2' passes through the 'convo3' convolution layer and the 'convo4' convolution layer, and then enters the 'convo5' convolution layer. The data flow of the 'convo2′ convolutional layer of the DCNN model is shown in Figure 6. The 'convo2′ convolution layer takes the feature map output by the 'convo1′ convolution layer as input, and the size is 1024 × 256 × 8. The size of the convolution kernel of the convo2 convolution layer is 5 × 5, the number is 32, and the output feature map is 1024 × 256 × 32. The activation function also uses 'ReLU'. The convolution kernel of the pooling layer is set to 4 × 4, the step size is 1, and the image generated by the feature map after pooling is 256 × 64 × 32, that is, the dimension is 32 and the size is 256 × 64. As shown in Figures 7-9, the data from 'convo2' passes through the 'convo3' convolution layer and the 'convo4' convolution layer, and then enters the 'convo5' convolution layer.   The data flow of the 'convo2′ convolutional layer of the DCNN model is shown in Figure 6. The 'convo2′ convolution layer takes the feature map output by the 'convo1′ convolution layer as input, and the size is 1024 × 256 × 8. The size of the convolution kernel of the convo2 convolution layer is 5 × 5, the number is 32, and the output feature map is 1024 × 256 × 32. The activation function also uses 'ReLU'. The convolution kernel of the pooling layer is set to 4 × 4, the step size is 1, and the image generated by the feature map after pooling is 256 × 64 × 32, that is, the dimension is 32 and the size is 256 × 64. As shown in Figures 7-9, the data from 'convo2' passes through the 'convo3' convolution layer and the 'convo4' convolution layer, and then enters the 'convo5' convolution layer.   The data flow of the 'convo2′ convolutional layer of the DCNN model is shown in Figure 6. The 'convo2′ convolution layer takes the feature map output by the 'convo1′ convolution layer as input, and the size is 1024 × 256 × 8. The size of the convolution kernel of the convo2 convolution layer is 5 × 5, the number is 32, and the output feature map is 1024 × 256 × 32. The activation function also uses 'ReLU'. The convolution kernel of the pooling layer is set to 4 × 4, the step size is 1, and the image generated by the feature map after pooling is 256 × 64 × 32, that is, the dimension is 32 and the size is 256 × 64. As shown in Figures 7-9, the data from 'convo2' passes through the 'convo3' convolution layer and the 'convo4' convolution layer, and then enters the 'convo5' convolution layer.   The parameters of the convo5 convolutional layer are as follows, the input is 16 × 4 × 64 feature input, the activation function 'ReLU', after convolution pooling, the output is The fully connected layer is used to "flatten" the input, that is, to make the multidimensional input one-dimensional. It is often used in the transition from the convolutional layer to the fully connected layer. The fully connected layer has 2048 neuron nodes and is connected to the convo5 convolutional layer. The output layer uses the 'Softmax' function as the classifier. According to the number of tracking targets, set the corresponding number of nodes. The output y 1 , y 2 , · · · , y n from each node in the previous layer is used as the confidence level to generate a new output. The incentive function of each node is so f tmax, The output of the Softmax function is between 0 to 1, and the sum of the output values is 1, that is, According to the Formula (9), the probability that a certain target exists on the beam azimuth can be calculated. After the DCNN model is built, the target data set can be input into the model for training, and the trained model can be used to identify the target.

Tracking Process
In this article, we simplify the sonar working environment to two-dimensional plane observation. In the early stage of tracking, we obtain the target's motion state including distance r and azimuth θ through initial measurement, convert the target's information from polar coordinate form (r, θ) to Cartesian coordinates (x, y), and establish the target state equation and motion equation.
In Formula (10), X k is the target motion state matrix in period k, F is the state transition matrix, and G is the noise driving matrix. w k and ϑ k are uncorrelated white noise with zero mean, and their variance matrices are Q and R respectively. w k is the input noise and ϑ k is the observation noise. H is the observation matrix, and Z k is the corresponding observation signal matrix {Z 1 , Z 2 , · · · , Z k }.
According to the existing data, the velocity components v x and v y of the target in the x and y directions are calculated. The position and velocity matrix of the target can be expressed as In the follow-up tracking process, the trained model is used to identify the azimuthweighted time-frequency image of the echo signal. In order to reduce the amount of calculation, Kalman filtering is used to predict the motion state of the target. As shown in Formula (13), the minimum variance estimated valueX k obtained from this observation is used to predict the motion state X pre of the target in the next cycle.
Formula (14) is the prediction covariance matrix, and P k is a quantitative description of the pros and cons of the prediction quality.
Appl. Sci. 2021, 11, 7530 9 of 21 Formulas (13) and (14) describe the time update process of Kalman filtering. According to the obtained target motion state prediction information, the arrival time t of the target echo and the azimuth angle θ of the target are calculated. Then we intercept the echo signal matrix near the predicted time (t ± 1) s, take the predicted azimuth as the center (θ ± 5) • , and generate the time-frequency image of the signal related to the azimuth and the time to be detected according to Formulas (1)-(6) in Section 2.1. Part of the time-frequency image of the signal to be identified is shown in Figure 10.
Formulas (13) and (14) describe the time update process of Kalman filtering. According to the obtained target motion state prediction information, the arrival time of the target echo and the azimuth angle of the target are calculated. Then we intercept the echo signal matrix near the predicted time ( ±1) s, take the predicted azimuth as the center ( ±5)°, and generate the time-frequency image of the signal related to the azimuth and the time to be detected according to Formulas (1)-(6) in Section 2.1. Part of the timefrequency image of the signal to be identified is shown in Figure 10. We use the trained model to recognize the time-frequency image of the signal to be detected, obtain the azimuth and distance ( , ) of the target in this tracking period, and convert it to Cartesian coordinates ( , ). The Kalman filter is used to correct the deviation of the initial motion state of the target obtained in this observation. The filter gain is calculated before correction Then calculate the filtered target state Finally update the covariance matrix for the calculation of the next cycle According to Formulas (15)- (17), the Kalman filter is used to predict the azimuth and distance of the target during the tracking process, and the deviation is corrected to obtain more accurate target state information. Finally, the target status information is obtained in each cycle is updated in the tracking system to complete the tracking.

Setting of Simulation Signal
We use the bright spot echo model [27] to construct the echo signal of the active sonar, and preprocess the echo signal to generate a data set of the simulated target. The bright spot echo model is often used in the simulation of active sonar echo signals [28], which saves time and improves efficiency in technical verification. We use the trained model to recognize the time-frequency image of the signal to be detected, obtain the azimuth and distance (r k , θ k ) of the target in this tracking period, and convert it to Cartesian coordinates (x k , y k ). The Kalman filter is used to correct the deviation of the initial motion state Z k+1 of the target obtained in this observation. The filter gain is calculated before correction Then calculate the filtered target statê Finally update the covariance matrix for the calculation of the next cycle According to Formulas (15)- (17), the Kalman filter is used to predict the azimuth and distance of the target during the tracking process, and the deviation is corrected to obtain more accurate target state information. Finally, the target status information is obtained in each cycle is updated in the tracking system to complete the tracking.

Setting of Simulation Signal
We use the bright spot echo model [27] to construct the echo signal of the active sonar, and preprocess the echo signal to generate a data set of the simulated target. The bright spot echo model is often used in the simulation of active sonar echo signals [28], which saves time and improves efficiency in technical verification.
The transmitted signal will echo when it encounters the target. According to the target bright spot model theory, in the case of high frequency, the echo of a complex target is formed by the superposition of several wavelets. Each wavelet can be regarded as a wave emitted from a certain scattering point, and this scattering point is a bright spot. It can be a real bright spot or an equivalent bright spot. The echo signal of a single bright spot target can be expressed as H r , w = A r , w e j(wτ+ϕ) (18) In Formula (18), A r , w is the target scattering intensity factor, which is related to frequency, and narrowband signals can take the center frequency value. τ is the delay factor that determined by the sound path ξ of the equivalent echo center point relative to a reference point. τ = 2ξ/c, c is the sound speed. ϕ is the phase factor, which is the phase jump generated when the echo is formed. The underwater complex target echo can be regarded as the result of the superposition of several independent bright spot signals. When the number of target bright spot echoes is N, the LFM signal is used as the transmit signal, and the total target echo signal S(t) after superposition can be expressed as, The echo of the simulated target is composed of a set of three different parameters, A n , τ n , ϕ n (n = 1 . . . N). The echo signal received by the receiving array of the active sonar, in addition to the echo and environmental noise reflected by the target, will also receive the scattered waves generated by the random scatterers in the ocean to the emitted acoustic signal and the seabed reverberation. In the simulation of this article, we ignored these effects. Therefore, when the signal is used as the transmission signal, the time-domain form of the target echo signal received by a single array element can be expressed as, Then the signal matrix received by the array with the number of elements M is In the simulation experiment, we set up a simulated underwater acoustic environment and set Gaussian white noise as the background noise of the marine environment. Active sonar can only work normally and recognize the target when the difference between the received signal level and the background interference level is greater than or equal to the detection threshold of the device. In this article, we set the active sonar transmitting transducer and the receiving array at the same place, and the environmental noise is isotropic background interference. The SNR of the active sonar received signal is, In Formula (22), SL is the emission sound source level, TL is the transmission loss from the transmitter to the target, TS is the target intensity of the target, the receiving directivity index of the receiving array is DI, the detection threshold of the sonar processing device is DT, the background interference is environmental noise, and its sound level is NL within the working bandwidth of the device.
We use linear frequency modulation (LFM) as the transmitting signal of active sonar. The LFM signal can not only improve the anti-interference ability and target recognition efficiency, and more effectively carry out underwater target detection, but also the time-frequency image of its echo signal is suitable for convolutional neural network training and recognition. The time function of the LFM signal can be expressed as, In Formula (23), A is the amplitude of the LFM signal, f 0 is the center frequency, k is the frequency change rate of the signal, and the pulse width of the signal is t = [−T/2, T/2].
We set the transmission frequency modulation signal f 0 = 2500 Hz,k = 2500 Hz/s, and the transmission duration of the transmission signal is 1 s. The time-domain waveform and frequency-domain waveform of the transmitted signal are shown in Figure 11.  (23), is the amplitude of the LFM signal, is the center frequency, is the frequency change rate of the signal, and the pulse width of the signal is = [− /2, /2].
We set the transmission frequency modulation signal = 2500 Hz, = 2500 Hz/s, and the transmission duration of the transmission signal is 1 s. The time-domain waveform and frequency-domain waveform of the transmitted signal are shown in Figure 11. After transmitting the signal, active sonar waits for 1 s to receive the signal, and the receiving time is 20 s. Waiting for 1 s reduces the effect of reverberation.

Simulation Target Data Set Generation and Model Training
We complete the programming under the TensorFlow 2.0 framework. The neural network model built by TensorFlow 2.0 can realize cross-platform model deployment and is more flexible than TensorFlow 1.0. TensorFlow 2.0 has certain requirements for hardware configuration. We use GPU for model training and use a small server in the laboratory to complete the above work. The graphics card is configured with two Tesla T4s and the GPU memory is 2 × 15 Gb.
In the simulation, according to the data set generation method given in Section 2.1, we generated a data set of moving targets with different echo characteristics. The number of samples for each target in the data set is 1800, and samples of ocean background noise are generated at the same time, the number is 1500, and the total number of samples in the data set is (1800 × num + 1500), where num is the number of tracking targets. Some sample images in the data set are shown in Figure 12. The order of the samples is randomly shuffled. After transmitting the signal, active sonar waits for 1 s to receive the signal, and the receiving time is 20 s. Waiting for 1 s reduces the effect of reverberation.

Simulation Target Data Set Generation and Model Training
We complete the programming under the TensorFlow 2.0 framework. The neural network model built by TensorFlow 2.0 can realize cross-platform model deployment and is more flexible than TensorFlow 1.0. TensorFlow 2.0 has certain requirements for hardware configuration. We use GPU for model training and use a small server in the laboratory to complete the above work. The graphics card is configured with two Tesla T4s and the GPU memory is 2 × 15 Gb.
In the simulation, according to the data set generation method given in Section 2.1, we generated a data set of moving targets with different echo characteristics. The number of samples for each target in the data set is 1800, and samples of ocean background noise are generated at the same time, the number is 1500, and the total number of samples in the data set is (1800 × num + 1500), where num is the number of tracking targets. Some sample images in the data set are shown in Figure 12. The order of the samples is randomly shuffled.
In the simulation, according to the data set generation method given in Section 2.1, we generated a data set of moving targets with different echo characteristics. The number of samples for each target in the data set is 1800, and samples of ocean background noise are generated at the same time, the number is 1500, and the total number of samples in the data set is (1800 × num + 1500), where num is the number of tracking targets. Some sample images in the data set are shown in Figure 12. The order of the samples is randomly shuffled. Then the data set is input to the DCNN model designed in Section 2.2 for training, and the accuracy of the model is tested. When training the model, 60% of the data in the data set are used as the training sample, and 40% are used as the test sample. Part of the training parameters are set as follows, the batch size is 40, the training optimizer is "Adam", and the loss function is the multiclassification loss function "categori-cal_crossentropy". The accuracy and loss of the DCNN model after 50, 100, and 200 trainings are shown in Figures 13-15.  Then the data set is input to the DCNN model designed in Section 2.2 for training, and the accuracy of the model is tested. When training the model, 60% of the data in the data set are used as the training sample, and 40% are used as the test sample. Part of the training parameters are set as follows, the batch size is 40, the training optimizer is "Adam", and the loss function is the multiclassification loss function "categorical_crossentropy". The accuracy and loss of the DCNN model after 50, 100, and 200 trainings are shown in Figures 13-15.
In Figure 13, after the DCNN model is trained 50 times, the accuracy rate rises to about 0.95, and the loss decreases to about 0.26.
In Figure 14. After the DCNN model was trained 100 times, the accuracy rate gradually increased to 0.97, and the loss decreased to 0.18.
In Figure 15, after the DCNN model was trained 200 times, the accuracy rate increased to 0.986 and the loss decreased to 0.06. Through training and testing, it is shown that the DCNN model designed in this paper can efficiently learn the characteristics of the target echo signal, and the model recognition accuracy rate is high.
Then the data set is input to the DCNN model designed in Section 2.2 for training, and the accuracy of the model is tested. When training the model, 60% of the data in the data set are used as the training sample, and 40% are used as the test sample. Part of the training parameters are set as follows, the batch size is 40, the training optimizer is "Adam", and the loss function is the multiclassification loss function "categori-cal_crossentropy". The accuracy and loss of the DCNN model after 50, 100, and 200 trainings are shown in Figures 13-15. In Figure 13, after the DCNN model is trained 50 times, the accuracy rate rises to about 0.95, and the loss decreases to about 0.26. In Figure 14. After the DCNN model was trained 100 times, the accuracy rate gradually increased to 0.97, and the loss decreased to 0.18.  In Figure 14. After the DCNN model was trained 100 times, the accuracy rate gradually increased to 0.97, and the loss decreased to 0.18.

Simulation
In the simulation verification, the DCNN-KFAT method proposed in this paper is tested as follows: (a) the tracking effect after the azimuth of two moving targets crossed; (b) the tracking effect on a weak target. The test results of the new method are compared and analyzed with the KFAT method.
In this article, we first test the distinguishing ability of the proposed method for different targets and target data association ability during active tracking, and set up a simulation environment containing two moving targets for verification. The two targets have different bright spot echo models, and the time-frequency image of the echo signal can reflect the characteristic information of the target. Figure 7 shows the simulation results of the KFAT method and the DCNN-KFAT method, the coordinate axis unit is km. In the Cartesian coordinate system, the initial position of target 1 is near (−4.5 km, 5 km), moving in the direction away from the active sonar, target 2 is near the initial position (−2 km, 11 km), moving in the direction close to the active sonar. Figure 16a,b are the simulation results of the KFAT method. We represent the simulated motion trajectory of the target, the observation result, and the result after Kalman filtering in Cartesian coordinates. The KFAT method loses a target after the azimuths overlap. Figure 16c,d are the results of using the DCNN-KFAT method. The tracking result is more accurate than the KFAT method. After the azimuth overlaps, the two targets are successfully identified and can continue to be tracked. Figure 16a,b are the simulation results of the KFAT method. We represent the simulated motion trajectory of the target, the observation result, and the result after Kalman filtering in Cartesian coordinates. The KFAT method loses a target after the azimuths overlap. Figure 16c,d are the results of using the DCNN-KFAT method. The tracking result is more accurate than the KFAT method. After the azimuth overlaps, the two targets are successfully identified and can continue to be tracked.     two targets. The observation deviation of the DCNN-KFAT method is about 150~300 m, and it is stable at about 200 m. Among them, the deviation of target 1 increases to 300 m in the 400~500 s time period. Compared with their positions, in this time period, the two targets are very close, their azimuth angles are very close, and the distance from the receiving array is almost the same. When the position of the target is very close, since the echo signals of the active sonar are similar in the frequency domain, the deviation is likely to increase.  In order to accurately test the effect of the DCNN-KFAT method in tracking low SNR signals, we continue to set up simulated moving targets for simulation testing. Figure 18  In order to accurately test the effect of the DCNN-KFAT method in tracking low SNR signals, we continue to set up simulated moving targets for simulation testing. Figure 18 shows the results of the simulation test. The coordinate axis represents the distance in km. The red solid line in Figure 9 represents the set trajectory of the simulated target, and the target moves away from the active sonar. Figure 18a shows the observation results and filtered results using the KFAT method. As the target moves, the intensity of the target echo signal received by the array gradually decreases; when the echo signal intensity is lower than the detection range, the active sonar cannot continue to track the moving target. shows the results of the simulation test. The coordinate axis represents the distance in km. The red solid line in Figure 9 represents the set trajectory of the simulated target, and the target moves away from the active sonar. Figure 18a shows the observation results and filtered results using the KFAT method. As the target moves, the intensity of the target echo signal received by the array gradually decreases; when the echo signal intensity is lower than the detection range, the active sonar cannot continue to track the moving target.   Figure 18b shows the observation results and filtered results using the DCNN-KFAT method. When the KFAT method is lost, the DCNN-KFAT method can still identify the distance and azimuth of the tracked target from the echo signal, and continue tracking the goal. In the early stage of tracking, the deviation of KFAT method is relatively large, while the deviation of DCNN-KFAT method is relatively small. We specifically analyzed the observation bias of the two methods, and the results are shown in Figure 19.  Figure 18b shows the observation results and filtered results using the DCNN-KFAT method. When the KFAT method is lost, the DCNN-KFAT method can still identify the distance and azimuth of the tracked target from the echo signal, and continue tracking the goal. In the early stage of tracking, the deviation of KFAT method is relatively large, while the deviation of DCNN-KFAT method is relatively small. We specifically analyzed the observation bias of the two methods, and the results are shown in Figure 19. Figure 18b shows the observation results and filtered results using the DCNN-KFA method. When the KFAT method is lost, the DCNN-KFAT method can still identify th distance and azimuth of the tracked target from the echo signal, and continue tracking th goal. In the early stage of tracking, the deviation of KFAT method is relatively large, whil the deviation of DCNN-KFAT method is relatively small. We specifically analyzed th observation bias of the two methods, and the results are shown in Figure 19.   Figure 19a shows the comparison of the tracking deviation of the KFAT method b fore and after filtering. We can see that the observation deviation range of the KFA method is about 100~1000 m. In the initial tracking stage, the deviation is large, betwe 800~1000 m. After 300 s, that is, after 15 transmission cycles, the minimum observed dev ation is around 100 m and the maximum is around 900 m. The deviation range of t Kalman filtering result is reduced to within 700 m, and as the tracking continues, the d viation range is gradually reduced to within 400 m. Figure 19b shows the deviation comparison of the DCNN-KFAT method before an after filtering. We can see that the observation deviation is stable within 200 m, and the are slight fluctuations in the tracking process. Figure 19c shows the comparison of the observation bias of the two methods. T deviation of the KFAT method fluctuates sharply. Between 100 and 1000 m, the observ tion bias of the DCNN-KFAT method is stable at about 200 m. Obviously, the stability the DCNN-KFAT method is better. Through analyzing the data, it is found that the obse vation deviation is mainly caused by the estimation deviation of the echo arrival time an the azimuth angle. The deviation caused by the azimuth angle increases with the increa of the distance. Compared with the KFAT method, the DCNN-KFAT method greatly r duces the observation bias and improves the accuracy of the original data. Figure 19d shows the comparison of the deviations of the two methods after filterin Figure 19. The deviations of KFAT and DCNN-KFAT before and after filtering are listed separately, and the deviation of DCNN-KFAT is significantly smaller than that of KFAT, with an improvement of 500~700 m. Figure 19a shows the comparison of the tracking deviation of the KFAT method before and after filtering. We can see that the observation deviation range of the KFAT method is about 100~1000 m. In the initial tracking stage, the deviation is large, between 800~1000 m. After 300 s, that is, after 15 transmission cycles, the minimum observed deviation is around 100 m and the maximum is around 900 m. The deviation range of the Kalman filtering result is reduced to within 700 m, and as the tracking continues, the deviation range is gradually reduced to within 400 m. Figure 19b shows the deviation comparison of the DCNN-KFAT method before and after filtering. We can see that the observation deviation is stable within 200 m, and there are slight fluctuations in the tracking process. Figure 19c shows the comparison of the observation bias of the two methods. The deviation of the KFAT method fluctuates sharply. Between 100 and 1000 m, the observation bias of the DCNN-KFAT method is stable at about 200 m. Obviously, the stability of the DCNN-KFAT method is better. Through analyzing the data, it is found that the observation deviation is mainly caused by the estimation deviation of the echo arrival time and the azimuth angle. The deviation caused by the azimuth angle increases with the increase of the distance. Compared with the KFAT method, the DCNN-KFAT method greatly reduces the observation bias and improves the accuracy of the original data. Figure 19d shows the comparison of the deviations of the two methods after filtering. The deviation of the KFAT method after filtering fluctuates in a larger range, gradually narrowing from 100~1000 m to 150~700 m. The deviation of the DCNN-KFAT method after filtering remains stable at about 200 m.
It can be seen from the above simulation test results that the DCNN-KFAT method proposed in this paper is not only more accurate than the KFAT method in terms of target determination and target data association, but also greatly reduces the observation bias and improves the tracking accuracy.

Verification of Real World Signal
We further tested the performance of the DCNN-KFAT method using some prerecorded real-world active sonar signals, with data derived from experimental data from a sea trial in May 2020. We intercepted part of the data containing the two targets whose azimuths overlapped during the motion for processing. The active sonar is set to transmit LFM signals at a frequency of f = 320~400 Hz, and the duration of the transmitted signal is 10 s. The echo signal is received by a vertical array of 20 array elements. The time domain waveform and frequency domain waveform of the received target echo signal are shown in Figure 20. As can be seen from Figure 20, in practical applications, it is difficult to discover the target quickly from the time domain or frequency domain only due to the high energy of the low-frequency components in the ambient noise, which can cover the target echo signal.  We used the method proposed in Section 2.1 to generate the target echo signal dataset. Some of the data in the dataset are shown in Figure 21. We used the method proposed in Section 2.1 to generate the target echo signal dataset. Some of the data in the dataset are shown in Figure 21.
We processed the data using the DCNN-KFAT method to obtain tracking results and compared them with the KFAT method. The results are shown in Figure 22. The actual motion trajectories of the two moving targets recorded by GPS are shown as black lines.
In Figure 22, for two moving targets with overlapping azimuths, the KFAT method lost one of the targets after the overlap, while the DCNN-KFAT method was able to continue tracking for two different targets after the two targets overlapped in azimuth without the problem of target loss.
In Figure 23, the DCNN-KFAT method is more accurate than the KFAT method, with deviations in the range of 100~200 m, which is within the expected range of the simulation.  We used the method proposed in Section 2.1 to generate the target echo signal dataset. Some of the data in the dataset are shown in Figure 21. We processed the data using the DCNN-KFAT method to obtain tracking results and compared them with the KFAT method. The results are shown in Figure 22. The actual motion trajectories of the two moving targets recorded by GPS are shown as black lines.
In Figure 22, for two moving targets with overlapping azimuths, the KFAT method lost one of the targets after the overlap, while the DCNN-KFAT method was able to continue tracking for two different targets after the two targets overlapped in azimuth without the problem of target loss. In Figure 23, the DCNN-KFAT method is more accurate than the KFAT method, with deviations in the range of 100~200 m, which is within the expected range of the simulation. The DCNN-KFAT method proposed in this paper is further validated by processing and analyzing the prerecorded real-world signals. The new method is able to continue accurate tracking after two moving targets overlap in azimuth, and has less deviation in tracking accuracy than the KFAT method.

Discussion
In active sonar tracking of underwater acoustic targets, the initial measurement in- In Figure 23, the DCNN-KFAT method is more accurate than the KFAT method, with deviations in the range of 100~200 m, which is within the expected range of the simulation. The DCNN-KFAT method proposed in this paper is further validated by processing and analyzing the prerecorded real-world signals. The new method is able to continue accurate tracking after two moving targets overlap in azimuth, and has less deviation in tracking accuracy than the KFAT method.

Discussion
In active sonar tracking of underwater acoustic targets, the initial measurement information, data association and target judgment of the target affect the tracking performance. In this article, we use DCNN to improve the performance of active tracking of underwater acoustic targets, especially to solve the problem of correctly associating target The DCNN-KFAT method proposed in this paper is further validated by processing and analyzing the prerecorded real-world signals. The new method is able to continue accurate tracking after two moving targets overlap in azimuth, and has less deviation in tracking accuracy than the KFAT method.

Discussion
In active sonar tracking of underwater acoustic targets, the initial measurement information, data association and target judgment of the target affect the tracking performance. In this article, we use DCNN to improve the performance of active tracking of underwater acoustic targets, especially to solve the problem of correctly associating target data after overlapping target azimuths. The samples in the data set are azimuth-weighted time-frequency images, which contain the target's echo feature information and azimuth information. Therefore, when tracking a target, the observation results given by the DCNN-KFAT method include the target's category and azimuth angle, which can accurately correlate the data.
We simulated the moving target using the bright spot echo model of active sonar, performed simulation experiments and validated them using some prerecorded data. We found that the KFAT method uses matched filtering and beamforming methods to determine the time when the target appears through threshold detection, and the deviation range is relatively large. DCNN-KFAT method uses the time-frequency image of the target to judge according to the intensity of the frequency band energy of the target echo signal, and the target arrival time can be calculated more accurately. Due to the limitations of experimental conditions, we only conducted simulation verification, and did not conduct sea trials. In the future, we hope to test in the actual marine environment to verify the performance of the DCNN-KFAT method in practical applications.

Conclusions
In this article, we generate echo signals of different underwater acoustic targets based on the bright spot echo model, and generate a data set of simulated target echo signals through weighting processing. The samples in the data set contain echo signal characteristics, target azimuth and distance information. Then we built a DCNN model to learn the echo signal of underwater acoustic targets. We trained and tested the model with a data set of analog signals, and the results showed that the accuracy of the model was high enough to be used for active tracking. Finally, we validate the proposed DCNN-KFAT method with simulations and pre-recorded sea trial data. By analyzing the simulation results, the method has a significant improvement in active tracking and can more accurately distinguish similar different targets. It is simpler and more accurate than the data association and target judgment of the KFAT method. The data recognized by DCNN-KFAT method include target category, target azimuth and target distance. In the process of target data association and target determination, the target data can be correlated very accurately. It solves the problem that KFAT loses a target after encountering the overlap of two target azimuths, and has a significant improvement in tracking accuracy and range.
The research results of this paper can be used for active tracking of underwater acoustic targets and building target datasets for deep learning training and recognition. The DCNN-KFAT method can improve the range and accuracy of tracking, and can solve the data correlation problem in the process of hydroacoustic target tracking, which can be used to improve the engineering application problem of lost targets. The next step will be to test in a real marine environment to verify the performance of the DCNN-KFAT method proposed in this paper in practical applications.