An Auditory Convolutional Neural Network for Underwater Acoustic Target Timbre Feature Extraction and Recognition

Ni, Junshuai; Ji, Fang; Lu, Shaoqing; Feng, Weijia

doi:10.3390/rs16163074

Open AccessArticle

An Auditory Convolutional Neural Network for Underwater Acoustic Target Timbre Feature Extraction and Recognition

by

Junshuai Ni

¹,

Fang Ji

^1,2,*,

Shaoqing Lu

¹

and

Weijia Feng

¹

China Ship Research and Development Academy, Beijing 100101, China

²

College of Shipbuilding Engineering, Harbin Engineering University, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 3074; https://doi.org/10.3390/rs16163074

Submission received: 13 June 2024 / Revised: 30 July 2024 / Accepted: 18 August 2024 / Published: 21 August 2024

(This article belongs to the Topic AI and Data-Driven Advancements in Industry 4.0)

Download

Browse Figures

Versions Notes

Abstract

:

In order to extract the line-spectrum features of underwater acoustic targets in complex environments, an auditory convolutional neural network (ACNN) with the ability of frequency component perception, timbre perception and critical information perception is proposed in this paper inspired by the human auditory perception mechanism. This model first uses a gammatone filter bank that mimics the cochlear basilar membrane excitation response to decompose the input time-domain signal into a number of sub-bands, which guides the network to perceive the line-spectrum frequency information of the underwater acoustic target. A sequence of convolution layers is then used to filter out interfering noise and enhance the line-spectrum components of each sub-band by simulating the process of calculating the energy distribution features, after which the improved channel attention module is connected to select line spectra that are more critical for recognition, and in this module, a new global pooling method is proposed and applied in order to better extract the intrinsic properties. Finally, the sub-band information is fused using a combination layer and a single-channel convolution layer to generate a vector with the same dimensions as the input signal at the output layer. A decision module with a Softmax classifier is added behind the auditory neural network and used to recognize the five classes of vessel targets in the ShipsEar dataset, achieving a recognition accuracy of 99.8%, which is improved by 2.7% compared to the last proposed DRACNN method, and there are different degrees of improvement over the other eight compared methods. The visualization results show that the model can significantly suppress the interfering noise intensity and selectively enhance the radiated noise line-spectrum energy of underwater acoustic targets.

Keywords:

gammatone filter bank; attention mechanism; line spectrum; auditory convolutional neural network; global energy pooling; underwater acoustic target recognition

1. Introduction

Underwater acoustic target recognition (UATR) is a research hotspot in the field of passive sonar and an internationally recognized technical difficulty in underwater acoustic signal processing, which has a significant demand in harbor and waterway monitoring and other aspects. Affected by sound scattering, ocean background noise, and the complex underwater environment, the UATR technology is meeting great challenges [1]. In recent years, with the rapid development of artificial intelligence technology, deep learning has been gradually replacing traditional classification methods and is widely used in UATR with remarkable achievements.

Compared with other deep learning models such as deep neural networks, recurrent neural networks, and deep belief networks, a convolutional neural network (CNN) is the most widely used in UATR due to its powerful deep feature extraction capability [2,3,4]. Researchers usually transform underwater acoustic target-radiated noise time-domain signals into time-frequency maps, such as the LOFAR spectrum [5], DEMON spectrum [6], and Meier spectrum [7], and then refer to the computer vision method to achieve UATR using two-dimensional (2d) CNN such as MobileNet [8], ResNet [9], VGG [10], DensNet [11], etc. In order to suppress the background noise interference and extract the critical features of the target, attention mechanisms, including a channel attention mechanism, spatial attention mechanism, and self-attention mechanism, are added in the CNN model to select the deep features adaptively according to different weights [12,13,14], which effectively improves the accuracy of target recognition. In order to downscale the scale of model parameters to reduce the amount of computation, some scholars have improved the classical models structurally by introducing advanced structures such as depth-separable convolutional layers and global average pooling layers to give many new deep learning models for UATR [15]. However, there are very few samples currently available for UATR model training, which leads to serious model over-fitting. To address this problem, a generative adversarial network (GAN) is used for underwater acoustic target data augmentation prior to UATR, which utilizes mutual confrontation between the generator and the discriminator to generate a large number of new samples similar to the real data, thus enlarging the size of the underwater acoustic target sample set [16,17,18].

The process of converting time-domain signals into time-frequency maps will lose part of the feature information, which may reduce the accuracy of target recognition. For this reason, some end-to-end recognition methods based on time-domain signals and a 1d CNN model are proposed and proved to be effective [19,20,21]. Compared with the 2d CNN, the 1d CNN usually has fewer model parameters and has the advantages of low computational cost and high stability. In addition, the 1d CNN has a shorter duration of input data, which means that more samples can be generated from the same dataset, and therefore the model can be trained more adequately [22], resulting in better UATR performance. In addition, a flexible combination of convolutional kernels with different sizes is used to extract multi-scale features of underwater acoustic signals to enhance the representation of target characteristics. Due to the current insufficient human knowledge of the underwater acoustic target’s inherent properties, researchers are overly concerned with accuracy improvement, but pay little attention to the interpretability of the features extracted by deep learning, which leads to poor model generalization and low recognition accuracy in complex environments [23]. It is no exaggeration to say that the recognition performance of all current deep learning models is far from that of sonarmen.

Neuroscience researchers have found that the human ear is very sensitive to different sounds and has a masking effect that allows humans to distinguish acoustic targets through slight differences and perceive critical information about specific timbres in noisy environments [24]. This phenomenon provides inspiration for using deep learning to mimic sonarmen processing hydroacoustic signals. In this paper, the characteristics of underwater acoustic targets are analyzed based on LOFAR spectrum from real data and elucidate the specific sources of line spectra on different frequency bands. The timbre feature extraction mechanisms of the human auditory system are summarized, and an end-to-end deep learning model is proposed for UATR: Auditory Convolutional Neural Network (ACNN). It realizes line-spectrum frequency perception, line-spectrum energy enlargement, critical line-spectrum selecting, and line-spectrum features fusion, and thus realizes timbre feature extraction from audio signals. Secondly, a new method of constructing the channel attention mechanism is proposed in designing the network model to globally pool the feature map using the sub-band energy based on the underwater acoustic target signal characteristics, which has a clearer physical meaning compared to the global average pooling (GAP) and global maximum pooling (GMP). Finally, simulating the information processing and decision-making functions on the brain, a one-dimensional deep convolutional network is used to extract deep abstract features of different targets and predict the categories of underwater acoustic targets.

This paper is organized as follows: The underwater acoustic target line-spectrum characteristics are explained in Section 2. Section 3 introduces the human auditory mechanism, gives a detailed introduction to the structure and design ideas of the ACNN model, and provides a new UATR method based on ACNN. The experiment details, including an experimental dataset, experimental process and experimental results are shown in Section 4. An overall discussion is concluded in Section 5.

2. Line Spectra of Underwater Acoustic Targets

Underwater acoustic target-radiated noise is a kind of strong periodic and non-smooth signal consisting of mechanical noise, propeller noise and hydrodynamic noise, whose power spectrum is composed of line spectra and continuous spectra. Line spectra are frequency components with higher energy than continuous spectra and a stable amplitude over a period of time, and they are mainly generated by mechanical operation (both reciprocating and rotating) and structural vibration. The frequency of the line spectra is related to the vibration frequency or operating speed. Time-frequency line spectra diagrams of four typical underwater acoustic targets are shown in Figure 1.

The time-frequency diagram of the small boat is shown in Figure 1a. This boat has a diesel engine with four cylinders and a speed of 2500 revolutions per minute (rpm) when the radiated noise is measured in experiments, which means that each cylinder fires 20.8 times per second, so there are four strong line spectra at 20.8 Hz, 41.6 Hz, 62.4 Hz, and 83.2 Hz. It is easy to see that the fundamental frequency value and the number of line spectra correspond to the rotational speed of the motor and the number of its cylinders, respectively, which are the essential properties of underwater acoustic targets. Since the engine noise intensity is unsteady over long periods of time, there are several harmonic line spectra in the time-frequency diagram with a fundamental frequency of 20.8 Hz and an order greater than four, but the maximum frequency of line spectra generated by diesel engine is generally not greater than 300 Hz.

The time-frequency diagram of the test vessel in a stationary state with only auxiliary machinery operation is shown in Figure 1b. It can be seen that there is a very strong line spectrum at 50 Hz, which has the same frequency with the current output from the diesel generator. In addition, pumps and other auxiliary machines produce a very large number of line spectra below 500 Hz, and these components are easily drowned by the main engine noise during ship navigation but are very significant in the radiated noise of non-diesel-powered ships.

The time-frequency diagram of the fishing vessel with shaft system failure is shown in Figure 1c. The bearing periodically collides and fricates with the base during rotation, thus generating rhythmic radiated noise with an intensive line-spectrum cluster normally in the range of 300 to 1000 Hz, while the center frequency of this vessel is 560 Hz. The frequency difference between any two adjacent line spectra in the cluster is approximately equal and equal to the propeller shaft frequency. This is a frequency-accompanying phenomenon resulting from the coupling of the line spectra, reflecting the vibration frequency of the ship structure and the line spectra reflecting the rotational speed of the shaft, which is a common characteristic of civilian ships.

The time-frequency diagram of the motor boat-radiated noise when its propeller rotates at a high speed is shown in Figure 1d. Along with the propeller rapidly cutting the water, its blades resonate and produce radiated noise with some high-frequency line spectra, and the frequencies of these line spectra are generally more than 1000 Hz, and the highest frequency even up to 8000 Hz. This motor boat has line spectra, respectively, at 6500 Hz and 7500 Hz, which are typical features significantly different from other classes of targets.

3. ACNN

3.1. Auditory Mechanisms

Listeners usually distinguish different sounds containing similar frequency components and the same loudness by their timbre, which can be simply represented by the structure of the energy distribution of a sound in terms of frequency [25]. For underwater acoustic targets, mechanical noise, propeller noise, bearing noise, and structure vibration noise each have characteristic line spectra. Different targets have similar line-spectra frequency bands, but the frequency energy distribution structures are different, thus being an important feature for distinguishing them. Sonarmen recognize different underwater acoustic targets mainly by sensing their timbre features.

The human auditory system is a multilevel complex system with the ability to perceive sound information, which is mainly composed of the cochlea, auditory midbrain, auditory thalamus and auditory cortex. When the auditory system perceives sound, the cochlea basilar membrane first breaks down the sound into different frequency components based on a specific excitation response, and then neurons in the primary auditory cortex that correspond to the frequencies are activated by the complex frequency components contained in the sound, which have a specific sensitivity and are connected with each other to form a frequency topology [26]. After that, different frequency components from the activation mode of the primary auditory cortex are integrated and processed by the secondary auditory cortex to get the timbre information of the sound. When people focus on hearing a certain timbre in a noisy environment, in accordance with the masking effect, the primary auditory cortex allows all frequency information to pass through, but in the secondary auditory cortex, only the neurons related to this timbre can be activated [27]. In addition, as the sound intensity changes over time, the activation mode of auditory cortical neurons also changes with the change of signal timbre energy [28]; that means the auditory cortex is more sensitive to transient and mutant audio signals. Finally, the higher cerebral cortex extracts deep features of all the timbre information integrated by the auditory cortex for final recognition.

According to the above research results in the field of neuroscience, the human auditory perception mechanism can be simply summarized into the following three aspects: (1) decomposing the acoustic signal into different frequency bands; (2) extracting timbre features and focusing on important information; and (3) using deep features to recognize acoustic targets.

3.2. Model Structure and Theory

Inspired by auditory mechanisms, this paper attempts to simulate the process of human auditory perception with deep learning algorithms and proposes a model for underwater acoustic target recognition, ACNN, which is composed of a frequency-component perception module, timbre perception module, and critical-information perception model. The model approximately emulates the whole process of acoustic signals being perceived by the auditory system, and its structure is shown in Figure 2.

3.2.1. Frequency Perception Module

In order to give the ACNN model initial guidance to extract timbre features of underwater acoustic targets and achieve signal decomposition, the gammatone filter bank with a similar excitation response to that of the cochlear basilar membrane is used in the frequency perception module, which is shown in Equation (1).

g (t) = a t^{n - 1} e^{- 2 π b t} c o s (2 π f t + φ)

(1)

where

a

is amplitude,

n

is filter’s order,

t

is time in second,

B

is bandwidth in Hz,

φ

is phase of the carrier in radians, and

f

is center frequency in Hz.

f

and

B

are set according to an equivalent rectangular bandwidth (ERB) filter bank cochlea model in Equations (2) and (3).

E R B (f) = 24.7 (4.37 f / 1000 + 1)

(2)

B = 1.019 \times E R B (f)

(3)

In this module, according to the line-spectrum frequency distribution characteristics of underwater acoustic targets, 128 gammatone filter channels are generated for sensing frequency information with center frequencies ranging from 10 Hz to 10 kHz, and the frequency magnitude responses of the gammatone filter bank are shown in Figure 3.

It can be seen in Figure 3 that the center frequency structures of the gammatone filter bank are approximately linearly distributed in the frequency band of 10~300 Hz, and the difference between two adjacent center frequencies is about 10 Hz, which shows a good frequency resolution performance of the low-frequency line spectra generated by mechanical noise. Moreover, in the frequency band higher than 300 Hz, the center frequency interval and finite bandwidth of the gammatone filter bank are increasing, so that the high-frequency line spectra generated by bearing friction and propeller resonance can be detected efficiently with less computational cost.

3.2.2. Timbre Perception Module

In the timber perception module, a convolutional block consisting of four 1d convolutional layers is connected after one of the gammatone filter output channels. The first three convolutional layers all have 16 channels with a convolutional kernel size of 5 × 1, which are used to extract line-spectrum energy distribution features in different frequency bands and also to filter the marine ambient noise. The fourth convolutional layer has only one channel with a convolutional kernel size of 1 × 1, which is used to change the shape of its input data and fully fuse underwater acoustic target timber features. In this module, the combination of two different-sized convolutional kernels also enables multi-scale feature extraction, allowing the model to better adapt to the effects of target velocity variations. Each gammatone filter channel has a convolutional block connected to it and the operation of the convolutional blocks can be described as follows.

y_{n} = h (w_{n} x_{n} + b_{n}), y_{n} \in R^{l \times 1}, n = 1, 2, \dots, N

(4)

where

x_{n}

is the output of the

n

th gammatone filter channel,

y_{n}

,

w_{n}

and

b_{n}

are the output, weights matrix, and bias matrix of the

n

th convolutional block, respectively,

h

is the tanh activation function which limits the output value of the convolutional block to a range of −1 to 1, so as to ensure that the numerical range of the output data is the same as the numerical range of the original signal, and it is expressed as

t a n h (α) = \frac{e^{α} - e^{- α}}{e^{α} + e^{- α}}

(5)

Compared to the commonly used Relu function, the tanh function better preserves the original information of the input signal while introducing nonlinear operations. Driven by a large amount of underwater acoustic target-radiated noise data, the kernel parameters of convolutional blocks are optimized, so that interfering noise unrelated to target characteristics is eliminated and the line-spectrum components of underwater acoustic targets will be strongly enhanced.

3.2.3. Critical Information Perception Module

In this module, a concatenate layer is first used to concatenate the outputs of every convolutional block, obtaining a feature map

Y

with

N

channels, which is shown in Equation (6).

Y = concatenate (y_{1}, y_{2}, \dots, y_{N}), Y \in R^{l \times 1 \times N}

(6)

It is worth mentioning that, unlike the feature map output from convolutional layers, the feature map here has practical significance, as each channel corresponds to a different frequency band. On this basis, the channel attention (CA) [12] module whose structure is shown in Figure 4 is cleverly used to generate a weight vector, weighting all frequency channels in order to select a combination of frequency bands that is more critical to the target recognition task, achieving secondary extraction of timbre features. This operation is represented as

Z_{C} = Y σ = [\begin{matrix} y_{1} \\ y_{2} \\ ⋮ \\ y_{N} \end{matrix}] \cdot [\begin{matrix} σ_{1} \\ σ_{2} \\ ⋮ \\ σ_{N} \end{matrix}]

(7)

where

σ

is the weight vector generated by the CA module,

ρ \in R^{l \times 1}

, and

Z_{S}

is the output of the CA module,

Z_{S} \in R^{l \times 1 \times N}

. It can be seen from Figure 4 that the weight matrix

σ

is derived from the input feature map

Y

through the computation of the global pooling layer, the dense connectivity layer and the nonlinear activation layer. As the most critical step of the channel attention mechanism, the global pooling operation is currently divided into two types: GAP and GMP, and their role is to downscale the high-dimensional feature maps along the channel axes to obtain a set of one-dimensional vectors with dimensions equal to the number of channels, which is shown in Figure 5.

GAP and GMP are obtained by averaging and maximizing the feature maps along the channel axes, respectively, and these two pooling methods are widely used in the image processing field due to their excellent feature abstraction capabilities. However, in this paper, unlike traditional image processing, the feature maps are three-dimensional vectors with actual physical meaning, and each layer corresponds to the component of the original signal in a certain frequency band, so GAP and GMP are not applicable here. The specific reasons are as follows:

(1): The underwater acoustic target-radiated noise signal can be viewed as a superposition of simple harmonic components, and the average value of any sub-band signal is roughly 0. Therefore, sub-band features cannot be effectively extracted using global average pooling.
(2): Although the maximum value of each sub-band signal can characterize the properties of the target, the result of the maximum pooling operation is highly random due to the interference of ambient noise.

In this paper, a new weight matrix calculation method called a global energy pooling (GEP) layer is proposed in the channel attention module to achieve the operation of dimension reduction in feature maps by replacing the mean and maximum computation with normalized energy computation. The output value of GEP is represented as

ρ_{n} = \frac{y_{n}^{2} (1) + y_{n}^{2} (2) + \dots + y_{n}^{2} (l) + \dots + y_{n}^{2} (L)}{L}, n = 1, 2, \dots, N

(8)

Then, the GEP layer output vector is processed through two dense layers and activated by a sigmoid function to obtain a weight vector taking values between 0 and 1. Equation (9) is the expression of the sigmoid activation function.

s i g m o i d (α) = \frac{1}{1 + e^{- α}}

(9)

After that, the feature map output from the CA module with a dimension of

l \times 1 \times N

is input into a one-dimensional convolutional layer whose channel number is 1 and kernel size is 1 × 1 to achieve data compression, obtaining the output of the ACNN model after the tanh function activation.

3.3. ACNN_DRACNN

In Ref. [12], an UATR method based on a deep residual attention convolutional neural network (DRACNN) was proposed by our team in July 2023, which achieved 97.1% recognition accuracy on the ShipsEar dataset [16] and had excellent generalization on the DeepShip dataset with a recognition accuracy of 89.2% [17]. The DRACNN model consists of five residual attention convolution blocks (RACB) connected in series, and it has the same input data dimensions as the ACNN output data dimensions, which ensures the connectivity of data transfer between the auditory and decision models. In this paper, the DRACNN model is used to simulate the process of abstract feature extraction and decision-making in the human cerebral cortex, and it is connected behind the ACNN model which is used to simulate the human auditory perception mechanism to obtain a new model called ACNN_DRACNN, finally achieving UATR by a Softmax classifier. The structure of the ACNN_DRACNN model is shown in Figure 6.

The input shape of the ACNN_DRACNN model is set to 4096 × 1; thus, the number of parameters and floating-point operations are 0.61 million (M) and 1.30 M, respectively. It should be noted that the number of parameters and floating-point operations of the ACNN model are 0.35 M and 0.69 M, respectively.

4. UATR Experiment

4.1. Introduction to the Dataset

In this paper, performances of the ACNN model in line-spectrum feature extraction and target recognition are validated on the ShipsEar dataset, which is available at the web address http://atlanttic.uvigo.es/underwaternoise/ (accessed on 15 March 2023). The ShipsEar dataset was selected from audio recordings collected under different sea conditions in Vigo Harbor, Spain. This dataset covers test data from the fall of 2012 through the summer of 2013, and contains 91 sound recordings of 11 vessel types including fishing boats, ocean liners, trawlers, mussel boats, tugboats, dredgers, etc., and one background noise class. The recordings were made with autonomous acoustic digitalHyd SR-1 recorders, with a total duration of 3 h and 10 min and a sampling rate of 52,734 Hz. As shown in Table 1, ship targets in this dataset were categorized into five categories by vessel length.

4.2. Experiments and Analysis of Results

Firstly, the data in the ShipsEar dataset are resampled to 20 kHz, which effectively retains the line-spectrum information below 10 kHz and doubly increases the duration of the signal at a certain sampling point number. Then, all the data in the ShipsEar dataset are divided into frames according to 4096 points per frame, and 2048 points overlap between two adjacent frames, so the sample duration of each frame is about 0.2 s. After each sample is zero-averaged and normalized, which is described in (10), a sample set containing 16,537 samples is obtained. Next, 80% of all samples are randomly selected as the training set and the remaining 20% as the test set, so there are 13,230 samples in the training set and 3307 samples in the test set, and it can be seen that this sample set is unbalanced from Table 2.

p (n) = \frac{s (n) - \frac{1}{4096} \sum_{j = 1}^{4096} s (j)}{\max (|s (n) - \frac{1}{4096} \sum_{j = 1}^{4096} s (j)|)}, n = 1, 2, \dots, 4096

(10)

The ACNN_DRACNN model proposed in this paper is built in a deep learning development environment with the Windows 10 operating system, Python-3.6.5, Keras-2.2.4, TensorFlow-1.14.0, Cuda-10.0.130 and trained on a workstation equipped with NVIDIA RTX2080Ti GPU, Intel Xeon silver 4214R CPU, and 32 GB RAM. The parameters of the ACNN_DRACNN model are initialized with random numbers satisfying a Gaussian distribution before model training, with the optimizer set to Adam, the learning rate initial value set to 0.001, the batch-size sequentially set to 8, 16, 32, 64, and the epoch set to 100. This model is trained based on the multi-classification cross-entropy cost function as Equation (11), and the cost function curve of four times experiments with different batch sizes is shown in Figure 7.

J = - \sum_{i = 0}^{C - 1} y_{i} l o g (p_{i})

(11)

The training and validation curves in Figure 7 show that the ACNN_DRACNN model has good convergence on the ShipsEar dataset as the model iterates, and among four sets of experiments, the best convergence effect is achieved when the batch size is set to 64, at which time the recognition accuracy on the validation set is 99.87%. The confusion matrix of the recognition results is shown in Figure 8. It can be seen that the recognition error mainly stems from the model’s confusion between class A and class D targets. This is due to the fact that fishing vessels in class A targets and ro-ro vessels in class D targets have similar mechanical structures and working conditions, and their underwater radiated noise is also similar.

In the display of identification experiment results, recall, accuracy, precision, and F1-score are adopted to evaluate the recognition performance of the ACNN_DRACNN model. The formulas for each indicator are as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(12)

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

R e c a l l = \frac{T P}{T P + F N}

(14)

F 1_s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(15)

where

T P

is true positive,

T N

is true negative,

F P

is false positive, and

F N

is false negative. The recognition experiment results are shown in Table 2.

Then, in order to investigate the effect of the gammatone filter number and the global pooling layer of the CA module on the model recognition performance, the ablation experiments are carried out on the test set while other parameters are kept constant, with the number of filters setting to 8, 16, 32, and 64, respectively, and the global pooling setting to GAP, GMP and GEP, respectively. The recognition results are shown in Figure 9 and Figure 10.

With the same number of gammatone filter channels, the ACNN model using the GEP layer proposed in this paper has a higher recognition accuracy and smaller cost function value compared to the model using the GAP layer or the GMP layer, which indicates that the GEP layer improves the model recognition performance and makes it have higher confidence in the recognition results. In addition, as the number of channels increases, the original signal is decomposed more finely, resulting in more explicit line-spectrum components carried by each sub-band signal, so the model recognition accuracy increases and the value of the cost function decreases. It is worth mentioning that the increase in channel number leads to a significant increase in the model parameters and computation, which needs to be weighed against the two factors for channel number selection. Next, the ACNN_DRACNN proposed in this paper is compared with other state-of-the-art models on the same sample set, and the results are shown in Table 3.

Benefitting from the excellent line-spectrum feature extraction capability of the ACNN model, on the ShipsEar dataset, the ACNN_DRACNN model proposed in this paper reaches a target recognition accuracy of 99.8%, which achieves an improvement of 2.7% compared to the DRACNN model proposed by our team last year, and the recognition accuracy is improved to different degrees compared to several other state-of-the-art models. More importantly, the parameters of the ACNN_DRACNN model is only 0.61 M, which is about 1/15th that of the A-ResNet model, and it has only 13 M floating-point computations, which is about 1/100th of the A-ResNet model. In comparison, the model proposed in this paper has fewer parameters and fewer floating-point calculations, which means that less memory and computational resources are required to run the model, facilitating the deployment of the model on a minicomputer system and the fast implementation of target recognition. On our computer, model training process consumes 23 ms per sample, and recognition consumes 2 ms per sample.

The signal-to-noise ratio of the radiated noise signals from underwater acoustic targets in the ShipsEar dataset is high due to the close proximity between the target and the hydrophone. The model recognition performance was tested after adding Gaussian noise with different signal-to-noise ratios to the original data, and the results are shown in Table 4. It can be seen that the target recognition accuracy increases with the increase in SNR. When the SNR is −20 dB, the average recognition accuracy of the model is 66.3%, and when the SNR is greater than −5 dB, the average recognition accuracy of our method achieves more than 85%. In addition, the ability of the model to detect environmental noise is significantly enhanced by adding noise.

4.3. Visualization of Features

In the previous section, the auditory neural network model has been trained by samples from the ShipsEar dataset, establishing a data-driven mapping from the original signal to the output layer. In this section, the validation set samples are input into the trained model, and the output layer data is transformed to the frequency domain using the Fourier function. A segment of data in each of the five categories is selected for analysis and comparison, as shown in Figure 11. For the five types of targets, a comparison of the ACNN model output data power spectrum with the input data power spectrum shows that the energy of the output signals is suppressed or enhanced to different degrees in different frequency bands, and the continuous spectrum intensity of the output signal decreases by about 20 dB compared with that of the input signal, which implies that the line spectra are enhanced to the same degree.

Under the constraint of the cross-entropy loss function, the model parameters are continuously updated by an error back-propagation algorithm to extract features from the original signal that can distinguish the five types of targets. It should be emphasized that not all the information in the original signal is valid for identifying the target, so this paper uses an improved channel attention mechanism to select the frequency bands that better reflect the differences between the five types of targets, and therefore the larger the attention weights are, the better they reflect the essential target characteristics. According to Figure 12, the power spectra of the five types of targets are augmented in certain frequency bands, and the common denominator is the presence of line spectra in these bands, which suggests that the main feature relied upon by the model proposed in this paper to recognize targets is the combination of underwater radiated-noise line spectra. The dimensions of the ACNN model input data and output data are reduced from 4096 to 2 by the TSNE algorithm for visualization in feature space, as feature 12.

The raw signals of underwater acoustic targets are disordered in the feature space, and it is almost impossible to find any information that can distinguish the targets from Figure 12a, while from Figure 12b, we can see that the distinguishability of the auditory domain line-spectrum features extracted from the ACNN model output layer is obviously improved, which indicates that the ACNN model proposed in this paper contributes to the improvement of the performance of the UATR and that the features extracted by the model are interpretable.

5. Conclusions

In this paper, an auditory neural network model with the ability of line-spectrum feature extraction of underwater acoustic targets is designed by simulating the human hearing mechanism and combined with the DRACNN model for UATR. On this basis, the main contributions of this article are as follows:

(1): The design idea of the model is inspired by the human hearing mechanism, the feature extraction process is very interpretable, and the data processing results show that the finer the decomposition of the original signal, the higher the target recognition accuracy that can be achieved.
(2): In this paper, a new global pooling method called a GEP layer is proposed, which integrates traditional features with deep learning, and can provide higher recognition accuracy and recognition result confidence for the network model compared with global maximum pooling and global average pooling.
(3): The ACNN_DRACNN model achieves 99.8% recognition accuracy on the ShipsEar dataset, which is a 2.7% improvement over the DRACNN model, and it has better-integrated performance than DarkNet, MobileNet, CRNN and other current state-of-the-art methods.

At the same time, there are still many issues worth studying based on our work, such as:

(1): Expanding the dataset using simulation data or data augmentation, to improve the recognition and generalization ability under sample imbalance and missing data for typical operating conditions.
(2): Mining the time-correlation features and transient working condition features in the original signal to improve the ability of portraying the essential characteristics of underwater acoustic targets.

Author Contributions

Conceptualization, F.J. and J.N.; methodology, F.J.; software, J.N. and S.L.; validation, J.N., F.J. and W.F.; formal analysis, J.N. and S.L.; investigation, W.F.; resources, J.N.; data curation, F.J. and F.J.; writing—original draft preparation, J.N. and S.L.; writing—review and editing, J.N. and F.J.; visualization, S.L., W.F. and J.N.; supervision, F.J.; project administration, F.J.; funding acquisition, F.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant number 52371356.

Data Availability Statement

Data available in a publicly accessible repository. Datasets are openly available at http://atlanttic.uvigo.es/underwaternoise/ at https://doi.org/10.1016/j.apacoust.2016.06.008 (accessed on 15 March 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Abbreviation	Full Name
UATR	Underwater acoustic target recognition
CNN	Convolutional neural network
GAN	Generative adversarial network
ACNN	Auditory convolutional neural network
DEMON	Detection of envelope modulation on noise
LOFAR	Low frequency analysis and recording
GAP	Global average pooling
GMP	Global maximum pooling
GEP	Global energy pooling
CA	Channel attention
DRACNN	Deep residual attention convolutional neural networks
RACB	Residual attention convolution blocks
ERB	Equivalent rectangular bandwidth

References

Luo, X.W.; Chen, L.; Zhou, H.L.; Cao, H.L. A Survey of Underwater Acoustic Target Recognition Methods Based on Machine Learning. J. Mar. Sci. Eng. 2023, 11, 384. [Google Scholar] [CrossRef]
Jiang, J.G.; Wu, Z.N.; Lu, J.N.; Huang, M.; Xiao, Z.Z. Interpretable features for underwater acoustic target recognition. Measurement 2020, 173, 108586. [Google Scholar] [CrossRef]
Wang, W.; Zhao, X.C.; Liu, D.L. Design and Optimization of 1D-CNN for Spectrum Recognition of Underwater Targets. Integr. Ferroelectr. 2021, 218, 164–179. [Google Scholar] [CrossRef]
Kim, K.I.; Pak, M.I.; Chon, B.P.; Ri, C.H. A method for underwater acoustic signal classification using convolutional neural network combined with discrete wavelet transform. Int. J. Wavelets Multiresolut. Inf. Process. 2021, 19, 2050092. [Google Scholar] [CrossRef]
Yao, Q.H.; Wang, Y.; Yang, Y.X. Underwater Acoustic Target Recognition Based on Data Augmentation and Residual CNN. Electronics 2023, 12, 1206. [Google Scholar] [CrossRef]
Chen, L.; Luo, X.W.; Zhou, H.L. A ship-radiated noise classification method based on domain knowledge embedding and attention mechanism. Eng. Appl. Artif. Intell. 2024, 127, 10732. [Google Scholar] [CrossRef]
Ju, Y.; Wei, Z.X.; Li, H.F.; Feng, X. A New Low SNR Underwater Acoustic Signal Classification Method Based on Intrinsic Modal Features Maintaining Dimensionality Reduction. Pol. Marit. Res. 2020, 27, 187–198. [Google Scholar] [CrossRef]
Yao, H.Y.; Gao, T.; Wang, Y.; Wang, H.Y.; Chen, X. Mobile_ViT: Underwater Acoustic Target Recognition Method Based on Local–Global Feature Fusion. J. Mar. Sci. Eng. 2024, 12, 589. [Google Scholar] [CrossRef]
Luo, X.W.; Zhang, M.H.; Liu, T.; Huang, M.; Xu, X.G. An Underwater Acoustic Target Recognition Method Based on Spectrograms with Different Resolutions. J. Mar. Sci. Eng. 2021, 9, 1246. [Google Scholar] [CrossRef]
Ouyang, T.; Zhang, Y.J.; Zhao, H.L.; Cui, Z.W.; Yang, Y.; Xu, Y.J. A multi-color and multistage collaborative network guided by refined transmission prior for underwater image enhancement. Vis. Comput. 2024. [Google Scholar] [CrossRef]
Yildiz, E.; Yuksel, M.E.; Sevgen, S. A Single-Image GAN Model Using Self-Attention Mechanism and DenseNets. Neurocomputing 2024, 596, 127873. [Google Scholar] [CrossRef]
Ji, F.; Ni, J.S.; Li, G.N.; Liu, L.L.; Wang, Y.Y. Underwater Acoustic Target Recognition Based on Deep Residual Attention Convolutional Neural Network. J. Mar. Sci. Eng. 2023, 11, 1626. [Google Scholar] [CrossRef]
Hong, F.; Liu, C.W.; Guo, L.J.; Chen, F.; Feng, H.H. Underwater Acoustic Target Recognition with a Residual Network and the Optimized Feature Extraction Method. Appl. Sci. 2021, 11, 1442. [Google Scholar] [CrossRef]
Li, J.; Wang, B.X.; Cui, X.R.; Li, S.B.; Liu, J.H. Underwater Acoustic Target Recognition Based on Attention Residual Network. Entropy 2022, 24, 1657. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.Q.; Li, S.; Li, D.H.; Wang, Z.C.; Zhou, Q.X.; You, Q.X. Sonar image quality evaluation using deep neural network. IET Image Process. 2022, 16, 992–999. [Google Scholar] [CrossRef]
Ashraf, H.; Shah, B.; Soomro, A.M.; Safdar, Q.A.; Halim, Z.; Shah, S.K. Ambient-noise Free Generation of Clean Underwater Ship Engine Audios from Hydrophones using Generative Adversarial Networks. Comput. Electr. Eng. 2022, 100, 107970. [Google Scholar]
Wang, Z.; Liu, L.W.; Wang, C.Y.; Deng, J.J.; Zhang, K.; Yang, Y.C.; Zhou, J.B. Data Enhancement of Underwater High-Speed Vehicle Echo Signals Based on Improved Generative Adversarial Networks. Electronics 2022, 11, 2310. [Google Scholar] [CrossRef]
Jin, G.H.; Liu, F.; Wu, H.; Song, Q.Z. Deep Learning-Based Framework for Expansion, Recognition and Classification of Underwater Acoustic Signal. J. Exp. Theor. Artif. Intell. 2019, 32, 205–218. [Google Scholar] [CrossRef]
Ge, F.X.; Bai, Y.Y.; Li, M.J.; Zhu, G.P.; Yin, J.W. Label distribution-guided transfer learning for underwater source localization. J. Acoust. Soc. Am. 2022, 151, 4140–4149. [Google Scholar] [CrossRef]
Ji, F.; Li, G.N.; Lu, S.Q.; Ni, J.S. Research on a Feature Enhancement Extraction Method for Underwater Targets Based on Deep Autoencoder Networks. Appl. Sci. 2024, 14, 1341. [Google Scholar] [CrossRef]
Hao, Y.K.; Wu, X.J.; Wang, H.Y.; He, X.Y.; Hao, C.P.; Wang, Z.R.; Hu, Q. Underwater Reverberation Suppression via Attention and Cepstrum Analysis-Guided Network. J. Mar. Sci. Eng. 2023, 11, 313. [Google Scholar] [CrossRef]
Li, Y.X.; Gu, Z.Y.; Fan, X.M. Research on Sea State Signal Recognition Based on Beluga Whale Optimization-Slope Entropy and One Dimensional-Convolutional Neural Network. Sensors 2024, 24, 1680. [Google Scholar] [CrossRef] [PubMed]
Liu, D.L.; Shen, W.H.; Cao, W.J.; Hou, W.M.; Wang, B.Z. Design of Siamese Network for Underwater Target Recognition with Small Sample Size. Appl. Sci. 2022, 12, 10659. [Google Scholar] [CrossRef]
Li, N.; Wang, L.B.; Ge, M.; Unoki, M.; Li, S.; Dang, J.W. Robust voice activity detection using an auditory-inspired masked modulation encoder based convolutional attention network. Speech Commun. 2024, 157, 103024. [Google Scholar] [CrossRef]
Li, J.H.; Yang, H.H. The underwater acoustic target timbre perception and recognition based on the auditory inspired deep convolutional neural network. Appl. Acoust. 2021, 182, 108210. [Google Scholar] [CrossRef]
Yang, H.H.; Li, J.H.; Shen, S.; Xu, G.H. A Deep Convolutional Neural Network Inspired by Auditory Perception for Underwater Acoustic Target Recognition. Sensors 2019, 19, 1104. [Google Scholar] [CrossRef]
Reiterer, S.; Erb, M.; Grodd, W.; Wildgruber, D. Cerebral Processing of Timbre and Loudness: fMRI Evidence for a Contribution of Broca’s Area to Basic Auditory Discrimination. Brain Imaging Behav. 2008, 2, 1–10. [Google Scholar] [CrossRef]
Occelli, F.; Suied, C.; Pressnitzer, D.; Edeline, J.M.; Gourévitch, B. A Neural Substrate for Rapid Timbre Recognition? Neural and Behavioral Discrimination of Very Brief Acoustic Vowels. Cereb. Cortex 2016, 26, 2483–2496. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Maaten, L.V.D.; Kilian, Q.W. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Pathak, D.; Raju, U. Shuffled-Xception-DarkNet-53: A content-based image retrieval model based on deep learning algorithm. Comput. Electr. Eng. 2023, 107, 108647. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 13728–13737. [Google Scholar] [CrossRef]
Liu, F.; Shen, T.S.; Luo, Z.L.; Zhao, D.; Guo, S.J. Underwater target recognition using convolutional recurrent neural networks with 3-D Mel-spectrogram and data augmentation. Appl. Acoust. 2021, 178, 107989. [Google Scholar] [CrossRef]

Figure 1. Time-frequency line-spectra diagrams. (a) Time-frequency diagram of the small boat. (b) Time-frequency diagram of the test vessel in a stationary state with only auxiliary machinery operation. (c) Time-frequency diagram of the fishing vessel with shaft system failure. (d) Time-frequency diagram of the motor boat-radiated noise when its propeller is rotating at a high speed.

Figure 2. ACNN model structure.

Figure 3. The frequency magnitude responses of the gammatone filter bank.

Figure 4. Channel attention mechanism.

Figure 5. Structure of the global pooling layer.

Figure 6. ACNN_DRACNN model structure.

Figure 7. Training curves of ACNN_DRACNN model.

Figure 8. Confusion matrix for recognizing results when batch size is 64.

Figure 9. Cost function value of the model on validation dataset.

Figure 10. Recognition accuracy of the model on the validation dataset.

Figure 11. Power spectrum of input data and output data. (a) Sample of category A. (b) Sample of category B. (c) Sample of category C. (d) Sample of category D. (e) Sample of category E.

Figure 12. Data visualization by TSNE. (a) Raw signals of ShipsEar dataset. (b) Output data of the ACNN model.

Table 1. ShipsEar data recognition details (Duration in seconds).

Category	Type of Vessel	Files	Duration
A	Fishing boats, trawlers, mussel boats, tugboats, dredgers	17	1880
B	Motor boats, pilot boats, sailboats	19	1567
C	Passenger ferries	30	4276
D	Ocean liners, ro-ro vessels	12	2460
E	Background noise recordings	12	1145

Table 2. Recognition results of each class.

Category	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
A	99.83	99.83	99.83	99.83
B	99.82	99.99	99.82	99.91
C	99.99	99.72	99.99	99.86
D	99.87	99.87	99.87	99.87
E	99.85	99.99	99.85	99.92
Average	99.87	99.88	99.87	99.88

Table 3. Recognition performance comparison of this method with other state-of-the-art methods.

No.	Model	Accuracy (%)	Params (M)	Flops (G)
1	DenseNet [29]	90.15	6.96	0.610
2	DarkNet [30]	96.68	40.59	1.930
3	RepVGG [31]	97.05	7.83	0.420
4	CRNN [32]	91.44	3.88	0.110
5	Auto-encoder [20]	93.32	0.18	0.410
6	ResNet [13]	94.97	0.33	0.110
7	A-ResNet [14]	98.19	9.47	1.460
8	MobileNet [8]	94.02	2.23	0.140
9	DRACNN [12]	97.10	0.26	0.005
10	ACNN_DRACNN	99.87	0.61	0.013

Table 4. Recognition accuracy at different signal-to-noise ratios (%).

SNR (dB)	−20	−15	−10	−5	0	5	10
A	63.7	78.5	75.5	84.9	95.1	98.8	99.5
B	65.9	79.6	80.3	85.0	97.4	99.1	99.4
C	68.2	73.7	79.4	86.2	97.1	99.0	99.6
D	52.5	67.0	77.8	84.8	96.9	98.5	99.6
E	81.2	92.1	94.3	97.2	98.4	99.0	99.9
Average	66.3	78.2	81.5	87.6	97.0	98.9	99.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ni, J.; Ji, F.; Lu, S.; Feng, W. An Auditory Convolutional Neural Network for Underwater Acoustic Target Timbre Feature Extraction and Recognition. Remote Sens. 2024, 16, 3074. https://doi.org/10.3390/rs16163074

AMA Style

Ni J, Ji F, Lu S, Feng W. An Auditory Convolutional Neural Network for Underwater Acoustic Target Timbre Feature Extraction and Recognition. Remote Sensing. 2024; 16(16):3074. https://doi.org/10.3390/rs16163074

Chicago/Turabian Style

Ni, Junshuai, Fang Ji, Shaoqing Lu, and Weijia Feng. 2024. "An Auditory Convolutional Neural Network for Underwater Acoustic Target Timbre Feature Extraction and Recognition" Remote Sensing 16, no. 16: 3074. https://doi.org/10.3390/rs16163074

APA Style

Ni, J., Ji, F., Lu, S., & Feng, W. (2024). An Auditory Convolutional Neural Network for Underwater Acoustic Target Timbre Feature Extraction and Recognition. Remote Sensing, 16(16), 3074. https://doi.org/10.3390/rs16163074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Auditory Convolutional Neural Network for Underwater Acoustic Target Timbre Feature Extraction and Recognition

Abstract

1. Introduction

2. Line Spectra of Underwater Acoustic Targets

3. ACNN

3.1. Auditory Mechanisms

3.2. Model Structure and Theory

3.2.1. Frequency Perception Module

3.2.2. Timbre Perception Module

3.2.3. Critical Information Perception Module

3.3. ACNN_DRACNN

4. UATR Experiment

4.1. Introduction to the Dataset

4.2. Experiments and Analysis of Results

4.3. Visualization of Features

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI