Next Article in Journal
Coarse-to-Fine Structure and Semantic Learning for Single-Sample SAR Image Generation
Next Article in Special Issue
Robust Underwater Direction-of-Arrival Estimation Method Using Acoustic Sensor Array under Unknown Swing Deviation Elements
Previous Article in Journal
A Small-Scale Landslide in 2023, Leshan, China: Basic Characteristics, Kinematic Process and Cause Analysis
Previous Article in Special Issue
AoI Analysis of Satellite–UAV Synergy Real-Time Remote Sensing System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Dual-Stream Deep Learning-Based Acoustic Denoising Model to Enhance Underwater Information Perception

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(17), 3325; https://doi.org/10.3390/rs16173325
Submission received: 8 July 2024 / Revised: 20 August 2024 / Accepted: 27 August 2024 / Published: 8 September 2024

Abstract

:
Estimating the line spectra of ship-radiated noise is a crucial remote sensing technique for detecting and recognizing underwater acoustic targets. Improving the signal-to-noise ratio (SNR) makes the low-frequency components of the target signal more prominent. This enhancement aids in the detection of underwater acoustic signals using sonar. Based on the characteristics of low-frequency narrow-band line spectra signals in underwater target radiated noise, we propose a dual-stream deep learning network with frequency characteristics transformation (DS_FCTNet) for line spectra estimation. The dual streams predict amplitude and phase masks separately and use an information exchange module to swap learn features between the amplitude and phase spectra, aiding in better phase information reconstruction and signal denoising. Additionally, a frequency characteristics transformation module is employed to extract convolutional features between channels, obtaining global correlations of the amplitude spectrum and enhancing the ability to learn target signal features. Through experimental analysis on ShipsEar, a dataset of underwater acoustic signals by hydrophones deployed in shallow water, the effectiveness and rationality of different modules within DS_FCTNet are verified.Under low SNR conditions and with unknown ship types, the proposed DS_FCTNet model exhibits the best line spectrum enhancement compared to methods such as SEGAN and DPT_FSNet. Specifically, SDR and SSNR are improved by 14.77 dB and 13.58 dB, respectively, enabling the detection of weaker target signals and laying the foundation for target localization and recognition applications.

1. Introduction

An important application of ocean remote sensing is using passive sonar to collect acoustic signals in the marine environment [1]. After reducing noise and interference, the extracted signals are used to locate and identify underwater vehicles [2]. With the development of ship vibration and noise reduction technology, the intensity of radiated noise in the medium- and high-frequency bands is effectively controlled. Meanwhile, hydrophones inevitably pick up underwater environmental noise, such as wind, rain, and marine life, which results in a relatively poor signal-to-noise ratio (SNR) for the received signal of the detection system [3]. This poses significant challenges for ocean remote sensing applications, such as ship detection [4], target recognition [5], localization [6], and tracking [7]. Estimating the line spectra of ship-radiated noise is a crucial remote sensing technique for detecting and recognizing underwater acoustic targets. Acoustic signal denoising can make the target signals in the low-frequency spectrum more prominent and enhance the underwater information perception capabilities of sonar systems, aiding in sonar-based underwater remote sensing detection.
Underwater acoustic signal denoising (UWASD) methods are mainly divided into two categories: traditional methods and deep learning-based methods. Traditional methods include spectral subtraction [8], Wiener filtering [9], wavelet transform [10], singular value decomposition (SVD) [11], and empirical mode decomposition (EMD) [12]. Traditional denoising algorithms mainly rely on accurate signal analysis and modeling, as well as the optimal fine-tuning of parameters. However, due to the complexity of the marine environment, underwater acoustic signals usually have a low SNR, and such methods are often difficult to achieve effective noise reduction under low SNR condition.
In recent years, deep learning methods have been widely applied in underwater acoustics [13], electromagnetics [14], and audio enhancement [15] in recent years. In ocean remote sensing, researchers have studied the UWASD method to enhance the information sensing capabilities of sonar systems. Song et al. [16] proposed a method that combines attention mechanisms with bidirectional long short-term memory recurrent neural networks to separate target ship radiated noise from nonlinear mixed noise. Koh et al. [3] introduced WaveN2N, which learns noise cancellation and clean signal reconstruction from multi-channel array data in a self-supervised learning environment. This subsequently applied to practical signal enhancement for single hydrophones or shorter array configurations. Inspired by paper [15], Song et al. [17] constructed a mask separation module using a dual-path network structure based on recurrent neural networks, achieving robust feature extraction and SNR separation for underwater targets. Zhou et al. [18,19,20] introduced a noise-aware deep learning model with a full-band–sub-band attention network for UWASD, incorporating an interaction module to transfer auxiliary information from the noise subnet to the target subnet. Zhou et al. [21] proposed a deep learning model for self-noise suppression in AUVs, capable of learning spectral features of noise from single-channel data and reconstructing clean signals in a self-supervised learning setup. Wang et al. [22] proposed a novel stacked convolutional sparse denoising autoencoder to accomplish the blind denoising task of underwater heterogeneous information data, with stacked sparse denoising autoencoders built to extract overcomplete sparse features. Paolo et al. [23] and Alberto et al. [24] used a deep denoising autoencoder to reduce noise interference. These methods mostly have a high dependency on data and poor generalization performance for unseen ships. Additionally, the denoised signal obtained by the neural network may be distorted compared to the original signal. Furthermore, under low SNR conditions, the signal components are almost completely overwhelmed by noise, significantly impacting the denoising effectiveness and resulting in minimal improvement in SNR.
To better address the problem of underwater target sensing under low SNR and unknown target types, a novel dual-stream deep learning network with frequency characteristics transformation (DS_FCTNet) is proposed for UWASD. The main contributions of this work are as follows:
  • We propose a dual-branch UWASD model based on deep neural networks, which predicts amplitude and phase masks separately. The amplitude and phase blocks exchange information through cross connections, improving the capability of feature representation. Therefore, compared to existing methods (such as SEGAN [25] and DPT_FSNet [26]), our model more effectively recovers the phase information of underwater acoustic signals.
  • In the amplitude block, a frequency characteristics transformation module that extracts convolutional features between channels was designed, obtaining global correlations of the amplitude spectrum. This enhances the reconstruction of the amplitude feature map, thus boosting the ability of the model to learn features from the underwater acoustic target signal.
  • Our proposed model effectively denoises signal from unknown ship types, better restoring line spectra in low-frequency components, making the target signals in the low-frequency line spectra more prominent. This enhancement is beneficial for improving the accuracy of subsequent blind target detection and classification algorithms.
  • Extensive experiments were conducted to verify the validity of the proposed model. When compared with classic traditional algorithms and deep learning algorithms, our model demonstrated superior performance in underwater signal denoising.
In the following, we first analyze the characteristics of underwater signals and describe the architecture of the DS_FCTNet model in Section 2. Then, the dataset and environmental settings required for training the model are described in Section 3. Next, the performance of the proposal is extensively evaluated, and the results are discussed in Section 4. Finally, our work is concluded in Section 5.

2. Problem Analysis and DS_FCTNet Model

2.1. Problem Analysis

With the development of ship vibration and noise reduction technologies, the intensity of radiated noise has been effectively controlled in the mid-to-high frequency bands, making passive sonar detection of underwater targets increasingly challenging. However, the low-frequency narrow-band line spectrum signals in underwater target radiated noise are difficult to eliminate, especially the low-frequency line spectrum components in radiated noise. The low-frequency narrow-band line spectrum signal manifests in the frequency domain as discrete frequency components concentrated within a small frequency range, characterized by strong stability and resistance to interference. The T-F domain diagrams of clean signal and mixture signal are shown in Figure 1, and the corresponding PSD is shown in Figure 2. Due to their stable characteristics, high power, and long propagation distances, low-frequency narrow-band line spectrum signals in acoustic target radiated noise are easily captured and are important features for identifying and detecting ship targets. Therefore, line spectra in acoustic target radiated noise have always been crucial targets for modern sonar equipment detection and capture. However, in the marine environment, the analysis of low-frequency line spectra can be less accurate due to the influence of marine environmental noise (waves, wind, thunderstorms, etc.). Underwater signal denoising, by improving the SNR, makes the target signals in the low-frequency line spectra more prominent, which helps to enhance the accuracy of target detection and classification algorithms. Furthermore, when enhancing underwater signals, the accuracy of the low-frequency line spectra directly affects the quality of the final results. Denoising provides a cleaner basis for high-quality signal reconstruction, especially when it is necessary to restore or enhance low-frequency signal components.

2.2. DS_FCTNet Model

As shown in Figure 3, the DS_FCTNet model first reads the noisy underwater acoustic signal, with the raw data transformed into a spectrogram via STFT. This spectrogram serves as the input of the model, D Input R T × F × 2 , where T represents the number of time steps, indicating how many segments or frames the sound signal is divided into during the transformation into a spectrogram. Each time step typically represents a specific short time window within which a Fourier transform (or similar transformation) is performed to analyze the frequency content within that window. F denotes the number of frequency bands, which are the divisions along the frequency axis in the spectrogram of the sound signal. Each band represents the energy or amplitude of the signal within a specific frequency range. The number two denotes the real and imaginary parts.
The input D input is fed into both the amplitude and phase streams to predict the amplitude and phase masks, resulting in D A R T × F and D P R T × F × 2 , where the amplitude stream includes an amplitude encoder, amplitude block, and amplitude decoder; the phase stream includes a phase encoder, phase block, and phase decoder. The amplitude mask D A is element-wise multiplied by the magnitude of D input to obtain D m R T × F . D m is then reshaped through a reshape layer to D m R T × F × 1 . D m is element-wise multiplied by D P to obtain the predicted time-frequency mask D out R T × F × 2 . ∘ denotes element-wise multiplication. The operations are defined as:
D A = Decoder ( Amplitude_stream ( Encoder ( D input ) ) )
D P = Decoder ( Phase_stream ( Encoder ( D input ) ) )
D m = D A D input Mag
D m = Reshape ( D m )
D out = D m D P

2.2.1. Encoder

Amplitude Encoder: This consists of two convolutional layers and two activation functions in Figure 4. The input D Input first passes through the first convolutional layer Conv1 ( 1 × 7 ) and a ReLU activation function to obtain D conv 1 R T × F × C A , where D conv 1 then passes through the second convolutional layer Conv1 ( 7 × 1 ) and a ReLU activation function to produce D Out A Encoder R T × F × C A . These two convolutional layers do not change the length of the time or frequency domains, only the number of channels.
Phase Encoder: This is composed of two convolutional layers in Figure 4. The input D Input first passes through convolutional layers Conv3 ( 5 × 3 ) and Conv4 ( 5 × 3 ) to obtain D Out P Encoder R T × F × C P . Similar to the amplitude encoding layer, these two convolutional layers do not change the length of the time or frequency domains, only the number of channels.

2.2.2. Dual-Stream Blocks (DSBs)

In the DS_FCTNet network model, the number of DSBs is set to three based on the experimental results.
Amplitude Stream: The amplitude stream models the energy distribution of underwater signals, helping the model understand which parts are the main concentration areas of the signal energy, which is beneficial for recovering high-frequency components in low-frequency power. As shown in Figure 5, the amplitude stream includes two FCT modules and three convolution layers, each equipped with Batch Normalization (BN) and ReLU activation functions. Using three convolution layers allows for handling the local correlations of frequency domain features, while the FCT modules capture global correlations in the frequency domain. The specific operations are as follows:
D Out FCT1 = FCT1 ( D Out A_Encoder ) = FCT1 ( D In FCT1 )
D Out Conv8 = Conv8 ( D Out FCT1 )
D Out Conv9 = Conv9 ( ReLU ( D Out Conv8 ) )
D Out Conv10 = Conv10 ( ReLU ( D Out Conv9 ) )
D Out FCT2 = ReLU ( D Out Conv10 )
Phase Stream: The phase stream precisely adjusts the phase to ensure correct reconstruction of the waveform after denoising. Compared to the amplitude stream, the phase stream is more lightweight, consisting of two convolution layers in Figure 5. Before the phase information is passed into the convolution layers, it first undergoes Layer Normalization (LN), which, unlike Batch Normalization, normalizes all the features of a single data sample. LN reduces the internal covariate shift during the training of deep neural networks, thus speeding up the training process and stabilizing the model training. The specific operations are as follows:
D Out P_conv11 = Conv11 ( LN ( D Out P_Encoder ) )
D Out P_conv12 = Conv12 ( LN ( D Out P_conv11 ) )
Communication: The communication module consists of a convolution layer and its activation function Tanh. Tanh is a commonly used activation function in neural networks. The Tanh function is typically employed in hidden layers because it maps input values to a range between −1 and 1, which is particularly suitable for handling outputs that require negative values or data centered around zero. The main purpose of this module is to promote information sharing between the two independent processing branches—the amplitude stream and the phase stream. Through this design, each branch receives crucial information from the other, thus obtaining a more comprehensive and detailed representation of input features. This complementary information exchange mechanism not only enhances the feature learning capabilities of the network but also improves the overall signal processing efficiency and accuracy. As shown in Figure 5, when the amplitude stream receives information from the phase stream in practical operation, it adjusts the amplitude estimation more precisely, further refining the dynamic range and intensity model of the signal. Conversely, the phase stream, upon integrating information from the amplitude stream, can also make more accurate phase adjustments, which is crucial for maintaining the naturalness and coherence of speech. This bidirectional adjustment of phase and amplitude significantly enhances the model denoising capabilities, enabling it to effectively reduce background noise while preserving more speech details and characteristics. The specific operations are as follows:
D Out A_SB = D Out FCT 2 Tanh ( Conv ( D Out P_conv12 ) )
D Out P_SB = D Out P_conv1 Tanh ( Conv ( D Out FCT2 ) )
Frequency characteristics transformation (FCT): Composed of convolutional layers, reshape layers, fully connected layers, and residual connections, this structure functions similarly to the attention mechanism in Figure 6. It dynamically adjusts the weights of features to emphasize more informative parts, thereby enhancing the ability of the model to learn target signal features. The input to FCT passes through two convolutional layers and two reshape layers to produce D Out FCT , reducing the channel dimension from C A to C R . D Out FCT is element-wise multiplied by D In F C T to produce D Out . D Out is then processed through an FC layer to yield D Out F C , which represents the weights for the narrow-band frequency domain. Using skip connections, D In F C T is concatenated with D Out F C to form D Out C a t , which is then processed by a 1 × 1 convolutional layer, Conv, to change the channel dimensions. The specific operations are as follows:
D Out FCT = Reshape ( Conv66 ( Reshape ReLU ( Conv5 ( D In F C T ) ) ) ) ,
D Out = D In F C T D Out FCT ,
D Out F C = FC1 ( D Out )
D Out C a t = D Out F C + D In F C T
D Out F C T = ReLU ( Conv7 ( D Out C a t ) ) .

2.2.3. Decoder

Amplitude Decoder: The amplitude decoder consists of a convolution layer, a BiLSTM, and three fully connected (FC) layers. The activation function for the convolution layer is sigmoid, the BiLSTM has 600 units, the first two FC layers each have 600 units with ReLU activation functions, and the last FC layer has F units with a sigmoid activation function. The output D Out A_SB passes through a 1 × 1 convolution, reducing the channel size from C A to C R , and then through three fully connected layers to predict the amplitude mask D Out Ψ . The specific operations are as follows:
D Out Conv13 = Sigmoid ( Conv13 ( D Out A_SB ) )
D Out FC2 = ReLU ( FC2 ( D Out Conv13 ) )
D Out FC3 = ReLU ( FC3 ( D Out FC2 ) )
D Out Ψ = Sigmoid ( FC4 ( D Out FC3 ) )
Phase Decoder: The phase decoder consists of a convolution layer and amplitude normalization. As shown in Figure 7, the output D Out P_SB passes through a 1 × 1 convolution, reducing the channel size from C P to 2, and then through amplitude normalization to predict the phase mask D Out Φ . The specific operation is as follows:
D Out Φ = AN ( Conv ( D Out P_SB ) )

2.2.4. Loss Function

The loss function for the DS_FCTNet model is the mean squared error (MSE), which primarily consists of two components, L A and L P , where L A represents the loss due to amplitude prediction, and L P is for phase prediction. The total loss L is a weighted combination of these two losses:
L = λ L A + ( 1 λ ) L P
L A = MSE ( abs ( D out ) compress , abs ( D Input ) compress )
L P = MSE D out abs ( D out ) compress abs ( D out ) , D Input abs ( D Input ) compress abs ( D Input )
where λ ( 0 , 1 ) is a weighting parameter used to control the relative importance of L A , and L P in the final loss function L. D out represents the output spectrum from the DS_FCTNet model, and abs ( D out ) compress represents the compressed power-law spectrum obtained after taking the modulus of the output spectrum. Power-law compression is applied to alter the dynamic range of the spectrogram, making the weaker parts of the signal more prominent relative to the stronger parts, which aids in analyzing and highlighting subtle changes in the signal. D Input represents the spectrogram of the original underwater signal, and abs ( D Input ) compress indicates the compressed spectrum obtained after taking the modulus of D Input .

3. Experimental Setup

3.1. Dataset

The dataset used in this paper comes from ShipsEar [27], which includes various ship signal recordings and underwater noise signal. We selected two types of ships with a larger number of recordings and longer durations (passenger boat and motorboat, totaling 43 tracks) as the clean dataset. Three types of underwater noise signals (wind noise, flow noise, and rain noise) were used as the noise dataset. Initially, we segmented 41 clean signals (leaving one passenger boat and one motorboat out) into 4105 clean signal data pieces with a length of 3 s, a step size of 1 s, and a sampling rate of 16 kHz. These 4105 pieces were randomly mixed with three noise data under SNR of [−15 dB, −10 dB, −5 dB], resulting in 36,945 experimental data pieces for the DS_FCTNet. The dataset was divided into 70% training set, 20% validation set, and 10% test set.
We validated the effectiveness of the DS_FCTNet model in three scenarios: (1) Dataset-I: known ships, where the signal comes from the same types of ships included in the training phase; (2) Dataset-II: untrained ships, referring to the remaining two ships mentioned earlier, where the signal comes from the same type of ship but different from those included during training; and (3) Dataset-III: unknown ship types, other ships from ShipsEar, meaning the signal comes from previously unknown types of ships.

3.2. Experimental Parameters

As shown in Table 1, the STFT window length was set to 512, frequency length to 256, the number of channels in the amplitude block C A to 96, C R to 8, and the number of channels in the phase block C P to 48. DS_FCTNet was trained for 150 epochs with a batch size of 8. The optimizer used was Adam with a learning rate of 0.0005. The use of a warm-up phase helped to improve the stability of the training process.

3.3. Evaluation Metrics

We used the signal distortion ratio (SDR), improvement in signal distortion ratio (SDRi), segmental signal-to-noise ratio (SSNR), and improvement in segmental signal-to-noise ratio (SSNRi) as objective measures to evaluate the performance of our proposed DS_FCTNet model and other models for UWASD. SDR measures the ratio between the energy of the acoustic signal signal in the denoised signal and the energy of the remaining noise (including distortion), with higher SDR values usually indicating better denoising performance. SDRi is the difference between SDR after and before denoising, providing a direct measure of the improvement brought by the denoising algorithm, with positive values indicating performance improvement. SSNR is the short-time SNR, considering the short-term characteristics of the signal. It focuses more on the local characteristics of the signal and is particularly suitable for evaluating the denoising effect of time-varying noise:
SDR = 10 log 10 d 2 d d ^ 2
SSNR = 1 T t = 1 T 10 log 10 d t 2 d t d ^ t 2
where d represents the original clean underwater signal, d ^ represents the denoised underwater signal, T is the number of sample blocks, and d t and d ^ t represent the clean and denoised underwater signal samples of the t-th segment, respectively.

4. Experimental Comparison and Analysis

4.1. Comparison of Ablation Experiments

To verify the effectiveness of certain structures of the model, ablation experiments were conducted. This section primarily validates three modules of the DS_FCTNet model: the DSB, the communication module, and the amplitude decoding block. Specific model structures are described in Table 2. The performance of different structures under various SNRs is illustrated using the metrics SSNR and SDR as shown in Table 3.

4.1.1. With or Without Communication

First, we verified the effectiveness of the information exchange module in the model for underwater signal denoising. The commonality between the “With One DSB” and “Without Communication” models is that they both have a DSB module, an FCT module, and a BiLSTM layer. However, the difference lies in that the With One DSB modelincludes a communication module, whereas the Without Communication model does not. As Table 3 shows, compared to that without communication, the With One DSB SDR has gains of 0.11 dB, 0.26 dB, and 0.42 dB at −5 dB, −10 dB, and −15 dB, respectively. SSNR has gains of 0.13 dB, 0.31 dB, and 0.42 dB at −5 dB, −10 dB, and −15 dB, respectively. The gains in SDR and SSNR demonstrate the effectiveness of the information exchange module in underwater signal denoising because, through this module, the model focuses on phase mask prediction while performing amplitude mask prediction, thus better restoring the phase of the underwater acoustic signal. Therefore, in other result comparisons, we use the information communicate module.

4.1.2. With or Without FCT

Next, we verified the effectiveness of the FCT module in underwater signal denoising. The difference between without FCT and DS_FCTNet (ours) is that the DSB in the Without FCT model does not contain an FCT module, whereas DS_FCTNet does. As shown in Table 3, compared to that without FCT, DS_FCTNet has SDR gains of 2.08 dB, 2.89 dB, and 3.59 dB at −5 dB, −10 dB, and −15 dB, respectively. The SSNR gains are 2.27 dB, 3.16 dB, and 3.66 dB at −5 dB, −10 dB, and −15 dB, respectively. The SDR and SSNR gains demonstrate that DS_FCTNet significantly outperforms the Without FCT model, proving that the FCT module greatly benefits underwater signal denoising. The FCT module utilizes an attention mechanism to extract convolutional features between channels, obtaining global correlations of the amplitude spectrum, which aids in the reconstruction of amplitude feature maps, thereby effectively removing noise in underwater signals.

4.1.3. Compare the Number of DSBs

We also verified the effectiveness of the number of DSB modules in underwater signal denoising. With One DSB, With Two DSBs, and With BiLSTM use one, two, and three DSB modules, respectively. As shown in Table 3, compared to With One DSB, With Two DSBs has SDR gains of 0.72 dB, 1.07 dB, and 1.37 dB at −5 dB, −10 dB, and −15 dB, respectively. SSNR gains are 0.73 dB, 0.86 dB, and 1.41 dB at −5 dB, −10 dB, and −15 dB, respectively. Compared to With Two DSBs, With BiLSTM has SDR gains of 0.27 dB, 0.36 dB, and 0.48 dB at −5 dB, −10 dB, and −15 dB, respectively. SSNR gains are 0.26 dB, 0.70 dB, and 0.50 dB at −5 dB, −10 dB, and −15 dB, respectively. The number of DSB modules impacts the performance of the model in underwater signal denoising, as the DSB contains both FCT and information communication modules, and multiple DSB modules facilitate the exchange of underwater signals. Due to the impact of the number of parameters and the computational cost, our comparative study in this paper uses up to three DSB modules in Figure 8.

4.1.4. With or Without BiLSTM

Lastly, we examined the impact of the BiLSTM in the amplitude decoding block on underwater signal denoising. Influenced by the literature, we compared the With BiLSTM model, which uses BiLSTM, with our DS_FCTNet (ours), which does not. As shown in Table 3, compared to the With BiLSTM model, DS_FCTNet has SDR gains of 0.18 dB, 0.35 dB, and 0.37 dB at −5 dB, −10 dB, and −15 dB, respectively. SSNR gains are 0.32 dB, 0.37 dB, and 0.37 dB at −5 dB, −10 dB, and −15 dB, respectively. From the SDR and SSNR gains, it is evident that using BiLSTM does not significantly improve the performance of model. Therefore, the amplitude decoding block of our model does not use the BiLSTM structure.
Figure 9 and Figure 10 display the amplitude and PSD plots of signal segments processed by the trained models based on different structures. It is observable that after the clean signal is mixed with noise, the characteristics of the ship undergo significant changes, causing distortion. Compared to other model structures, our proposed DS_FCTNet model performs denoising, eliminating most of the noise and preserving the characteristics of the ship signal. As shown in Figure 10, it is evident that the high energy in the low-frequency range of the clean signal becomes indistinct after noise mixing, making it difficult to distinguish in the PSD of the mixed signal. Different model architectures, Without FCT, With One DSB, With Two DSBs, Without Communication, and With BiLSTM, retained the target signal’s characteristics to a lesser extent when denoising the mixed signal. Comparing the first and last subplots in Figure 10, the proposed model DS_FCTNet, through interactions between modules, significantly restores the high energy in the low-frequency range of the signal and preserves the target signal’s characteristics as much as possible.
Therefore, our proposed DS_FCTNet model is better adapted to the task of underwater noise reduction, effectively improving the SDR and SSNR gains.

4.2. Method Comparison

In this section, we demonstrate the superiority of our proposed DS_FCTNet model over traditional UWASD methods and two existing deep learning-based algorithms. The traditional denoising methods compared include wavelet transform, spectral subtraction with prior noise estimation, and Wiener filtering, while the deep learning algorithms include SEGAN and DPT_FSNET. For comparison, we used our DS_FCTNet model, which showed excellent performance in Section 4.1 and includes an information exchange module, three DSBs, and FCT.
To assess the denoising performance of different methods under low SNR conditions, SDR and SSNR metrics were used. In the experiments, three different SNR scenarios generated from the D-I dataset (−5 dB, −10 dB, and −15 dB) were used. As shown in Table 4, the comparison results of different denoising methods are presented. First, DS_FCTNet achieved the best performance in all scenarios, significantly outperforming other algorithms. DS_FCTNet increased the SDR from −4.99 dB, −9.77 dB, and −14.71 dB to 5.28 dB, 4.79 dB, and 4.73 dB, respectively. SSNR increased from −8.17 dB, −3.75 dB, and −12.91 dB to 5.40 dB, 5.19 dB, and 4.78 dB, respectively. These results indicate that even under low SNR conditions, it is an effective method for underwater signal denoising.
For a better analysis of the performance differences among these algorithms, a segment of the underwater signal was denoised using the above methods, resulting in denoised waveform, amplitude, and PSD plots as shown in Figure 11, Figure 12 and Figure 13. In the PSD of the clean signal plot in Figure 12, four major frequency components are marked. From the PSD plot of the signal mixed with noise, it can be seen that the frequencies marked by red frames are masked by noise. As the frequency increases, the power of the frequency band gradually decreases, making it harder to recover the high-frequency components masked by noise. As can be seen from Figure 13, the traditional algorithm denoised signal struggles to mimic the original signal and fails to recover low-power high-frequency components. Although the deep learning-based SEGAN and DPT_FSNET can recover some low-power high-frequency components, the frequencies marked by the right red frame are eliminated as noise, failing to accurately reconstruct the original signal high-frequency components. As shown in Figure 13, our proposed algorithm DS_FCTNet better recovers low-power high-frequency components. In the denoising waveform plots for underwater signal denoising algorithms, the black line represents the underwater signal mixed with noise, and the red line represents the denoised signals processed through different algorithms. Because traditional algorithms, SEGAN, and DPT_FSNET fail to fully recover low-power high-frequency components, these methods result in poorly aligned signal in the time domain with the acoustic signal ground truth.
Compared to other algorithms in Figure 13, DS_FCTNet, by utilizing the FCT to extract convolutional features between channels, obtains global correlations of the amplitude spectrum, aiding in the reconstruction of amplitude feature maps (as shown in Figure 11), and more accurately models the target signal and noise components, even under low SNR, achieving a clearer signal. As highlighted in the red box, our proposed DS_FCTNet shows better performance in waveform matching than other algorithms. Additionally, we noted that although DS_FCTNet achieves good noise reduction in Figure 12, the denoised signal does not completely match the clean signal. Due to limitations in acquisition conditions, even when hydrophones are sufficiently close to the target ship, the captured signals inevitably contain a small amount of underwater environmental noise. The noise components present in the original clean underwater signal have characteristics similar to the noise mixed in the noisy signal. Therefore, the denoised signal may remove some residual noise from the original clean underwater signal, resulting in an incomplete match between the two signals.

4.3. Denoising Untrained Ships

In this section, DS_FCTNet is employed to denoise underwater signal recordings of the ship types that have been used for training but from ships that have not been explicitly trained. The gains in SDRi and SSNRi for different algorithms are as shown in Table 5. In Table 4, the SDRi and SSNRi of underwater audio denoising algorithms slightly decrease, but the DS_FCTNet model outperforms other algorithms under any SNR conditions. For the same type of ships, during signal acquisition, ships are in various states in the water (i.e., waiting, entering, leaving, passing, arriving, departing, and close-range recording). Thus, when hydrophones collect signals from ships, the acquired signals inevitably contain a small amount of underwater environmental noise, causing variations in the features in the spectrogram and leading to differences in the denoising results for the same type of untrained ships.
To better verify the effectiveness of the proposed DS_FCTNet model for underwater audio denoising, a segment of audio from Dataset-II was analyzed. Figure 14 and Figure 15 show the denoised amplitude and PSD plots obtained through the underwater audio denoising algorithm. From the amplitude and PSD plots, it can be observed that our proposed algorithm can effectively recover the low-power high-frequency components in the audio of unknown ships.

4.4. Denoising Unknown Ship Types

In this section, we use DS_FCTNet to denoise underwater audio recordings from ships and ship types that have not been trained. The gains in SDR and SSNR for different algorithms are shown in Table 6. Compared to the table, the SDR and SSNR of underwater audio denoising algorithms have improved, with the DS_FCTNet model performing exceptionally well under any SNR conditions. According to the PSD of clean audio shown in Figure 16, the tested PSD of ship audio has fewer high-frequency components, which, after being mixed with noise, also has fewer high-frequency components affected by noise, thus reducing the difficulty for the underwater audio denoising algorithm in recovering high frequencies, thereby improving the performance of the denoising algorithm.
To further verify the effectiveness of the proposed DS_FCTNet model for underwater audio denoising, a segment of audio from Dataset-III was analyzed. Figure 16 and Figure 17 show the denoised amplitude and PSD plots obtained through the underwater audio denoising algorithm. From the amplitude and PSD plots, it can be observed that our proposed algorithm effectively recovers the low-frequency line spectra of unknown types of ships.

4.5. Robustness Analysis

The robustness of the model is demonstrated in Table 4, where we validate the effectiveness and stability of our model under different SNRs of −15 dB, −10 dB, and −5 dB. Additionally, as shown in Table 5 and Table 6, we validate the model’s effectiveness using different datasets.

5. Conclusions

In this paper, a dual-branch deep learning-based UWASD model that predicts both amplitude and phase masks is proposed. Utilizing an information exchange module that swaps information between the amplitude and phase spectra, the model is better able to reconstruct the phase information of the underwater target audio. This facilitates UWASD and aids sonar in conducting underwater remote sensing detection. Additionally, a frequency characteristics transformation module is utilized for capturing the global correlations of the amplitude spectrum, thereby enhancing the ability to learn the characteristics of the target signal. We train and test using a portion of the ShipsEar dataset to verify the effectiveness and rationality of the different modules within DS_FCTNet. Moreover, under low SNR, DS_FCTNet increases the SDR from −4.99 dB, −9.77 dB, and −14.71 dB to 5.28 dB, 4.79 dB, and 4.73 dB, respectively. SSNR is increased from −8.17 dB, −3.75 dB, −12.91 dB to 5.40 dB, 5.19 dB, 4.78 dB, respectively. Evaluating untrained ships and ship-type audio from the ShipsEar dataset, the proposed DS_FCTNet proves effective across various assessment metrics, and demonstrates the efficacy of our model especially in unknown ship types. The scalability of the model lies in the ability to add a classifier after the model’s decoder to enable speech recognition and detection functions. In the future, we will simplify the model to achieve a more lightweight design and use a broader range of sea condition noise to validate the model’s effectiveness.

Author Contributions

Conceptualization, W.G., D.C. and Y.L.; methodology, D.C.; software, W.G.; validation, W.G., D.C. and Y.L.; formal analysis, W.G.; investigation, D.C.; resources, D.C.; data curation, W.G.; writing—original draft preparation, W.G.; writing—review and editing, Y.L.; visualization, D.C.; supervision, D.C. and Y.L.; project administration, D.C. and Y.L.; funding acquisition, D.C. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China under Grant No. 62301040, the Open Fund of State Key Laboratory of Acoustics under Grant SKLA202303, and the China Post-doctoral Science Foundation under Grant No. 2023M730266.

Data Availability Statement

The original data presented in the study are openly available in ShipsEar at https://underwaternoise.atlanttic.uvigo.es/, accessed on 1 November 2023.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Fernandes, J.d.C.V.; de Moura Junior, N.N.; de Seixas, J.M. Deep learning models for passive sonar signal classification of military data. Remote Sens. 2022, 14, 2648. [Google Scholar] [CrossRef]
  2. Hummel, H.I.; van der Mei, R.; Bhulai, S. A survey on machine learning in ship radiated noise. Ocean. Eng. 2024, 298, 117252. [Google Scholar] [CrossRef]
  3. Koh, S.; Chia, C.S.; Tan, B.A. Underwater signal denoising using deep learning approach. In Proceedings of the Global Oceans 2020: Singapore–US Gulf Coast, Singapore, 5–30 October 2020; pp. 1–6. [Google Scholar]
  4. Zhu, S.; Zhang, G.; Wu, D.; Jia, L.; Zhang, Y.; Geng, Y.; Liu, Y.; Ren, W.; Zhang, W. High Signal-to-Noise Ratio MEMS Noise Listener for Ship Noise Detection. Remote Sens. 2023, 15, 777. [Google Scholar] [CrossRef]
  5. Du, L.; Wang, Z.; Lv, Z.; Han, D.; Wang, L.; Yu, F.; Lan, Q. A Method for Underwater Acoustic Target Recognition Based on the Delay-Doppler Joint Feature. Remote Sens. 2024, 16, 2005. [Google Scholar] [CrossRef]
  6. Zhu, X.; Dong, H.; Salvo Rossi, P.; Landrø, M. Feature Selection Based on Principal Component Regression for Underwater Source Localization by Deep Learning. Remote Sens. 2021, 13, 1486. [Google Scholar] [CrossRef]
  7. Wang, M.; Qiu, B.; Zhu, Z.; Ma, L.; Zhou, C. Passive tracking of underwater acoustic targets based on multi-beam LOFAR and deep learning. PLoS ONE 2022, 17, e0273898. [Google Scholar] [CrossRef]
  8. Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef]
  9. Chen, J.; Benesty, J.; Huang, Y.; Doclo, S. New insights into the noise reduction Wiener filter. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1218–1234. [Google Scholar] [CrossRef]
  10. Weiss, L.G.; Dixon, T.L. Wavelet-based denoising of underwater acoustic signals. J. Acoust. Soc. Am. 1997, 101, 377–383. [Google Scholar] [CrossRef]
  11. Alter, O.; Brown, P.O.; Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA 2000, 97, 10101–10106. [Google Scholar] [CrossRef]
  12. Li, Y.; Wang, L. A novel noise reduction technique for underwater acoustic signals based on complete ensemble empirical mode decomposition with adaptive noise, minimum mean square variance criterion and least mean square adaptive filter. Def. Technol. 2020, 16, 543–554. [Google Scholar] [CrossRef]
  13. Liu, Y.; Niu, H.; Li, Z. A multi-task learning convolutional neural network for source localization in deep ocean. J. Acoust. Soc. Am. 2020, 148, 873–883. [Google Scholar] [CrossRef] [PubMed]
  14. Hu, R.; Monebhurrun, V.; Himeno, R.; Yokota, H.; Costen, F. An uncertainty analysis on finite difference time-domain computations with artificial neural networks: Improving accuracy while maintaining low computational costs. IEEE Antennas Propag. Mag. 2022, 65, 60–70. [Google Scholar] [CrossRef]
  15. Le, X.; Chen, H.; Chen, K.; Lu, J. DPCRN: Dual-path convolution recurrent network for single channel speech enhancement. arXiv 2021, arXiv:2107.05429. [Google Scholar]
  16. Song, R.; Feng, X.; Wang, J.; Sun, H.; Zhou, M.; Esmaiel, H. Underwater Acoustic Nonlinear Blind Ship Noise Separation Using Recurrent Attention Neural Networks. Remote Sens. 2024, 16, 653. [Google Scholar] [CrossRef]
  17. Luo, Y.; Chen, Z.; Yoshioka, T. Dual-path rnn: Efficient long sequence modeling for time-domain single-channel speech separation. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–9 May 2020; pp. 46–50. [Google Scholar]
  18. Zhou, A.; Zhang, W.; Li, X.; Xu, G.; Zhang, B.; Ma, Y.; Song, J. A Novel Noise-Aware Deep Learning Model for Underwater Acoustic Denoising. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
  19. Zhou, A.; Zhang, W.; Xu, G.; Li, X.; Deng, K.; Song, J. dBSA-Net: Dual Branch Self-Attention Network for Underwater Acoustic Signal Denoising. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1851–1865. [Google Scholar] [CrossRef]
  20. Zhou, A.; Zhang, W.; Li, X.; Xu, G.; Zhang, B.; Song, J. Noise-Aware Subband Attention Network for Underwater Acoustic Signal Denoising. In Proceedings of the 2022 IEEE Smartworld, Ubiquitous Intelligence & Computing, Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles (Smartworld/UIC/ScalCom/DigitalTwin/PriComp/Meta), Haikou, China, 15–18 December 2022; pp. 610–617. [Google Scholar]
  21. Zhou, W.; Li, J. Self-Noise Suppression for AUV without Clean Data: A Noise2Noise Approach. In Proceedings of the 2023 IEEE Underwater Technology (UT), Tokyo, Japan, 6–9 March 2023; pp. 1–5. [Google Scholar]
  22. Wang, X.; Zhao, Y.; Teng, X.; Sun, W. A stacked convolutional sparse denoising autoencoder model for underwater heterogeneous information data. Appl. Acoust. 2020, 167, 107391. [Google Scholar] [CrossRef]
  23. Russo, P.; Di Ciaccio, F.; Troisi, S. DANAE: A denoising autoencoder for underwater attitude estimation. arXiv 2020, arXiv:2011.06853. [Google Scholar]
  24. Testolin, A.; Diamant, R. Underwater acoustic detection and localization with a convolutional denoising autoencoder. In Proceedings of the 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), Piscataway, NJ, USA, 15–18 December 2019; pp. 281–285. [Google Scholar]
  25. Pascual, S.; Bonafonte, A.; Serra, J. SEGAN: Speech enhancement generative adversarial network. arXiv 2017, arXiv:1703.09452. [Google Scholar]
  26. Dang, F.; Chen, H.; Zhang, P. DPT-FSNet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; pp. 6857–6861. [Google Scholar]
  27. Santos-Domínguez, D.; Torres-Guijarro, S.; Cardenal-López, A.; Pena-Gimenez, A. Shipsear: An underwater vessel noise database. Appl. Acoust. 2016, 113, 64–69. [Google Scholar] [CrossRef]
Figure 1. The T-F domain of clean signal and its mixed signal after adding noise.
Figure 1. The T-F domain of clean signal and its mixed signal after adding noise.
Remotesensing 16 03325 g001
Figure 2. The PSD of clean signal and its mixed signal after adding noise.
Figure 2. The PSD of clean signal and its mixed signal after adding noise.
Remotesensing 16 03325 g002
Figure 3. The architecture of the proposed DS_FCTNet model.
Figure 3. The architecture of the proposed DS_FCTNet model.
Remotesensing 16 03325 g003
Figure 4. The diagram of the encoder. Above is the amplitude encoding layer, and below is the phase encoding layer.
Figure 4. The diagram of the encoder. Above is the amplitude encoding layer, and below is the phase encoding layer.
Remotesensing 16 03325 g004
Figure 5. The diagram of DSB, including the amplitude-stream block, phase-stream block, and communication.
Figure 5. The diagram of DSB, including the amplitude-stream block, phase-stream block, and communication.
Remotesensing 16 03325 g005
Figure 6. The diagram of FCT.
Figure 6. The diagram of FCT.
Remotesensing 16 03325 g006
Figure 7. The diagram of the Decoder Layer. Above is the amplitude encoding layer, and below is the phase encoding layer.
Figure 7. The diagram of the Decoder Layer. Above is the amplitude encoding layer, and below is the phase encoding layer.
Remotesensing 16 03325 g007
Figure 8. The arrangement of three DSB modules.
Figure 8. The arrangement of three DSB modules.
Remotesensing 16 03325 g008
Figure 9. Comparison of the clean signal with the denoised signal using ablation experiments in the T-F domain.
Figure 9. Comparison of the clean signal with the denoised signal using ablation experiments in the T-F domain.
Remotesensing 16 03325 g009
Figure 10. Comparison of the clean signal with the denoised signal using ablation experiments in PSD.
Figure 10. Comparison of the clean signal with the denoised signal using ablation experiments in PSD.
Remotesensing 16 03325 g010
Figure 11. Comparison of the clean signal with the denoised signal using methods in the T-F domain on Dataset-I.
Figure 11. Comparison of the clean signal with the denoised signal using methods in the T-F domain on Dataset-I.
Remotesensing 16 03325 g011
Figure 12. Comparison of the clean signal with the denoised signal using methods in PSD on Dataset-I.
Figure 12. Comparison of the clean signal with the denoised signal using methods in PSD on Dataset-I.
Remotesensing 16 03325 g012
Figure 13. Comparison of the clean signal with the denoised signal using ablation experiments in the time domain.
Figure 13. Comparison of the clean signal with the denoised signal using ablation experiments in the time domain.
Remotesensing 16 03325 g013
Figure 14. Comparison of the clean signal with the denoised signal using methods in the T-F domain on Dataset-II.
Figure 14. Comparison of the clean signal with the denoised signal using methods in the T-F domain on Dataset-II.
Remotesensing 16 03325 g014
Figure 15. Comparison of the clean signal with the denoised signal using methods in PSD on Dataset-II.
Figure 15. Comparison of the clean signal with the denoised signal using methods in PSD on Dataset-II.
Remotesensing 16 03325 g015
Figure 16. Comparison of the clean signal with the denoised signal using methods for unknown ship types in PSD on Dataset-III.
Figure 16. Comparison of the clean signal with the denoised signal using methods for unknown ship types in PSD on Dataset-III.
Remotesensing 16 03325 g016
Figure 17. Comparison of the clean signal with the denoised signal using methods for unknown ship types in the T-F domain on Dataset-III.
Figure 17. Comparison of the clean signal with the denoised signal using methods for unknown ship types in the T-F domain on Dataset-III.
Remotesensing 16 03325 g017
Table 1. The parameters of the network and experiment.
Table 1. The parameters of the network and experiment.
ParameterValue
Window length of STFT512 samples
Frame shift of STFT256 samples
Channel of Amplitude96
C R 8
Channel of Phase48
epoch150
batch8
Learning rate(warm-up)0.0005
OptimizerAdam
Table 2. Model introduction of ablation experiments.
Table 2. Model introduction of ablation experiments.
ModelDescription
Without FCTDS_FCTNet applied with three DSBs and BiLSTM, without FCT.
With one DSBDS_FCTNet applied with FCT, one DSB, BiLSTM and communication.
With two DSBsDS_FCTNet applied with FCT, two DSBs and BiLSTM.
Without communicationDS_FCTNet applied with FCT, one DSB, BiLSTM and without
communication
With BiLSTMDS_FCTNet applied with FCT, three DSBs and BiLSTM.
DS_FCTNet (ours)DS_FCTNet applied with FCT, three DSBs, and without BiLSTM.
Table 3. Quantitative comparisons of ablation experiments on Dataset-I.
Table 3. Quantitative comparisons of ablation experiments on Dataset-I.
Methods−5 dB−10 dB−15 dB
SDRSSNRSDRSSNRSDRSSNR
Noise−4.99−3.75−9.77−8.17−14.71−12.91
Without FCT3.23.271.92.031.141.12
With one DSB3.974.073.013.262.512.5
With two DSB4.694.804.084.123.883.91
Without communication3.863.942.752.952.092.08
With BiLSTM4.965.084.444.824.364.41
DS_ FCTNet (ours)5.285.404.795.194.734.78
Table 4. Quantitative comparisons of methods on Dataset-I.
Table 4. Quantitative comparisons of methods on Dataset-I.
Methods−5 dB−10 dB−15 dB
SDRSSNRSDRSSNRSDRSSNR
noise−4.99−3.75−9.77−8.17−14.71−12.91
Wavelet−1.10.17−5.17−2.36−7.57−4.55
Spectral−Subtraction−0.88−0.68−3.21−2.42−6.1−5.13
Wiener0.821.33−3.11−0.96−6.27−4.74
SEGAN2.692.741.501.540.80.77
DPT_FSNET3.183.181.831.931.041.01
DS_FCTNet5.285.404.795.194.734.78
Table 5. Quantitative comparisons of methods on Dataset-II.
Table 5. Quantitative comparisons of methods on Dataset-II.
Methods−5 dB−10 dB−15 dB
SDRiSSNRiSDRiSSNRiSDRiSSNRi
Wavelet5.255.368.418.8510.3410.16
Spectral-Subtraction4.584.5318.068.209.738.42
Wiener6.436.5012.487.637.165.79
SEGAN7.487.4611.7612.0916.9415.94
DPT_FSNET7.527.5111.3911.7017.2316.68
DS_FCTNet8.428.5713.0413.3518.8519.39
Table 6. Quantitative comparisons of methods on Dataset-III.
Table 6. Quantitative comparisons of methods on Dataset-III.
Methods−5 dB−10 dB−15 dB
SDRSSNRSDRSSNRSDRSSNR
noise−5−4.64−10−9.56−15−13.63
Wavelet0.111.06−4.51−2.59−8.71−6.12
Spectral−Subtraction−0.68−0.75−2.92−2.83−6.28−5.59
Wiener0.871.22−2.14−1.4−6.48−5.15
SEGAN2.252.481.151.40.620.47
DPT_FSNET3.163.293.073.282.552.64
DS_FCTNet5.986.396.366.715.515.38
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, W.; Liu, Y.; Chen, D. A Dual-Stream Deep Learning-Based Acoustic Denoising Model to Enhance Underwater Information Perception. Remote Sens. 2024, 16, 3325. https://doi.org/10.3390/rs16173325

AMA Style

Gao W, Liu Y, Chen D. A Dual-Stream Deep Learning-Based Acoustic Denoising Model to Enhance Underwater Information Perception. Remote Sensing. 2024; 16(17):3325. https://doi.org/10.3390/rs16173325

Chicago/Turabian Style

Gao, Wei, Yining Liu, and Desheng Chen. 2024. "A Dual-Stream Deep Learning-Based Acoustic Denoising Model to Enhance Underwater Information Perception" Remote Sensing 16, no. 17: 3325. https://doi.org/10.3390/rs16173325

APA Style

Gao, W., Liu, Y., & Chen, D. (2024). A Dual-Stream Deep Learning-Based Acoustic Denoising Model to Enhance Underwater Information Perception. Remote Sensing, 16(17), 3325. https://doi.org/10.3390/rs16173325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop