1. Introduction
Line spectrum purification finds extensive application in underwater acoustic target detection and array signal processing [
1,
2,
3,
4,
5]. It aims to obtain purified target line spectrum features by removing superimposed noise and interference, playing a significant role in underwater detection, identification, and localization [
6,
7]. In practical scenarios involving hydrophone arrays for underwater target detection, interested targets are often contaminated by strong interference. This results in extracted target frequency-domain features being mixed with interference features. To mitigate the impact of interference on target line spectra, mainstream approaches involve array interference suppression techniques. Examples include null-broadening methods [
8,
9], second-order cone programming (SOCP) null-steering beamforming [
10,
11], and eigenspace-based interference suppression methods [
12,
13,
14,
15].
Spatial interference suppression methods typically require angular separation between the interference and the target. Otherwise, if this separation is smaller than the Rayleigh resolution limit of the array, the suppression capability and line spectrum purification performance degrade significantly. For null-steering methods, designing the null beam is a global optimization task. When determining the depth and width of the null, factors such as sidelobe levels and mainlobe width must be comprehensively considered to ensure sufficient algorithm robustness. Consequently, under low signal-to-interference ratio (SIR) conditions, the designed beam may lack sufficient null depth or width. This can lead to leakage of interference energy, thereby reducing the effectiveness of line spectrum purification. Eigenspace-based methods rely on eigendecomposition of the covariance matrix to separate the signal and interference subspaces. In practical scenarios, spectral overlap between the target and interference signals can cause subspace mixing, leading to residual interference and suboptimal purification results.
In operational underwater detection scenarios, array operational conditions are often non-ideal due to factors like installation amplitude errors and inconsistencies and phase responses across array channels [
16]. Generally, traditional line spectrum purification methods are model-based [
17]. These methods depend on an ideal array model. When the actual array exhibits imperfections, the interference suppression capability diminishes, resulting in degraded purification performance. In contrast, the approach introduced here employs machine learning. Machine learning methods operate in a data-driven manner and do not depend on a priori assumptions regarding the assumed array model. Therefore, it demonstrates stronger robustness against array imperfections encountered in practice. Beyond array-based spatial processing, probabilistic model-based approaches such as hidden Markov models (HMM) have also been recently refined for weak line spectrum extraction [
18], demonstrating improved detection performance under low signal-to-noise ratio conditions. However, these methods often rely on prior assumptions about spectral line dynamics and require careful parameter tuning, which can limit their generalization across diverse scenarios.
Researchers have thus explored machine learning solutions. Some adopt a supervised learning paradigm to directly learn the mapping from data to physical features [
19,
20,
21,
22]. The underlying logic of supervised learning is to forcibly establish a statistical association between signal mixture patterns and target features using massive labeled data. Its essence is to approximate the “target–interference” distribution boundary in high-dimensional time–frequency space in a data-driven manner. The core assumption is that the spectra of the target and interference have implicit separability in the time–frequency domain. This separability can be automatically extracted through the nonlinear transformations of machine learning. Supervised learning heavily relies on labeled data, and labeling errors significantly impact system performance. Among recent deep learning approaches, diffusion models have been introduced for the recovery of multiple frequency lines in underwater acoustic signals [
23], achieving high-quality restoration by learning the reverse process of additive degradation. Meanwhile, dual-stream network architectures that separately predict amplitude and phase masks have been proposed for underwater acoustic denoising [
24], demonstrating improved signal reconstruction by exchanging information between amplitude and phase branches.
Other scholars utilize unsupervised learning paradigms to find order within data chaos [
25]. Unsupervised learning avoids the limitations of labeled data by using self-consistent optimization objectives (like generative adversarial loss or orthogonal constraints). This approach forces the model to uncover the underlying separation rules between target and interference within the mixed signal. The core assumption of unsupervised learning is that a separable boundary (e.g., statistical independence or distributional difference) exists between target and interference in latent space, and this boundary can be captured by machine learning. Although unsupervised learning does not require labeled data, its underlying decoupling assumption may be physically invalid. In real scenarios, ship targets and interference noise are not strictly independent. For instance, performance degrades markedly when interference and target spectral lines overlap. Furthermore, unsupervised models often require iterative optimization, leading to higher computational complexity.
A dual-input architecture based on a DenseBlock-based U-net neural network for line spectrum purification is proposed in this paper. The U-net framework was introduced by Ronneberger and colleagues in 2015 [
26]. It represents a matrix-to-matrix neural network whose output is a two-dimensional matrix, not a one-dimensional classification vector. DenseBlock is the primary building block of DenseNets, proposed by Gao Huang et al. in 2017 [
27]. Within a DenseBlock structure, each layer establishes direct connections to every subsequent layer. The DenseNet design promotes feature recycling, enhances feature flow, lowers parameter counts, and yields reduced error rates compared to traditional convolutional architectures.
Herein, two key improvements are made to the traditional U-net network to adapt it to the line spectrum purification problem: 1. The traditional two-dimensional convolutional layers and up-convolutional layers in the original U-net are substituted with DenseBlocks; and 2. the number of network inputs has been expanded from one to two two-dimensional matrices. The inputs are the time–frequency feature of the mixture (containing interference, target, and noise) and the time–frequency feature of the interference (containing only interference and noise). Through supervised learning, the network learns the differential component between these two inputs, corresponding to the target’s time–frequency features. Consequently, this method suppresses interference and noise, outputting a purified target line spectrum.
DenseBlock improves feature propagation efficiency. Consequently, the proposed approach exhibits strong generalization capability and robustness when the interference is relatively stationary. Given a defined set of signal and network models, a limited training dataset proves adequate. Moreover, the model is trainable using simulation datasets and subsequently deployed to real-world data. This helps mitigate the strong dependency on labeled real data in underwater target detection. Additionally, the output matrix reflects not only the frequency but also the intensity of the target line spectrum, thereby delivering more valuable insights for subsequent data processing.
To summarize, the main contributions of this paper are as follows:
A dual-input Dense U-net architecture is proposed, where DenseBlocks replace conventional convolutional layers to enhance feature propagation and reduce the parameter count;
The network simultaneously takes the time–frequency features of interference and the interference–target mixture as inputs, enabling it to learn interference suppression through supervised learning and output a purified target line spectrum;
Compared with traditional model-driven interference suppression methods, the proposed approach achieves an output SINR improvement of more than 8 dB under low SINR conditions and exhibits significantly stronger robustness to array position errors;
The network can be trained on simulation data and applied directly to real data, effectively addressing the scarcity of labeled samples in underwater acoustic detection.
Simulations are conducted to support these advantages. Experimental results provide further validation.
The rest of this paper is structured as follows:
Section 2 describes the signal and array model formulations.
Section 3 presents the dual-input DenseBlock-based U-net framework.
Section 4 employs simulation data to illustrate the proposed method’s advantages.
Section 5 utilizes experimental data to validate the proposed method.
Section 6 summarizes the main conclusions.
2. Problem Statement
As shown in
Figure 1, we consider a uniform linear array (ULA) consisting of M elements composed of omnidirectional hydrophones. One interference signal and one target signal impinge upon the array from the far field under the plane-wave assumption. Consequently, the directions of arrival (DOA) for the interference and the target are denoted as
θi and
θt, respectively.
When both interference and target are present, the received signal at the
m-th element is denoted as
.
is a broadband signal. First, an
L-point Discrete Fourier Transform (DFT) is applied, as shown in Equation (1):
where
k is the frequency bin index, corresponding to the frequency (
is the sampling frequency):
The frequency-domain data received by the array can be expressed as:
The beamformer output
for each subband is given by Equation (4):
where
is the narrowband beamforming weight vector. Different beam designs correspond to different weights. For example, for conventional beamforming (CBF), we have:
is termed the array manifold vector, where
d is the ULA element spacing,
c is the sound speed, and
is the beam steering direction.
Applying the inverse DFT (IDFT) to
yields the beam’s time-domain output:
The power spectrum of
is then estimated using Welch’s method. First,
is divided into
K overlapping segments of length
N, denoted as
. The power spectrum for each segment is computed using Equation (8):
where
F is the number of frequency bins in the positive spectrum, and the normalization factor
is defined in Equation (9) to ensure the power spectrum is asymptotically unbiased:
is the window function. The final power spectrum of
is:
3. U-Net Network Structure
3.1. Preprocessing
The long-duration received signal
X is segmented into
R fragments. The signal for each segment is denoted as
:
where
represents the received signal of the
m-th hydrophone within the
r-th segment, and N denotes the number of snapshots.
Within each signal segment, the directions of the interference and target are assumed to be constant. This assumption is based on the slow time-varying nature of underwater acoustic signals. In scenarios with faster signal variation, the segment duration can be shortened to maintain time-invariance, at the cost of reduced frequency resolution.
When the beam is steered towards the interference direction
, the corresponding array manifold vector
and weight vector
are obtained. Within the defined frequency band, the power spectrum steered towards
, denoted as
, can be estimated. Concatenating all
yields the time–frequency record of the power spectrum in the interference direction, i.e., the interference’s time–frequency feature:
Similarly, when the beam is steered towards the target direction , the time–frequency record of the power spectrum at the target direction, , is obtained.
In practical underwater detection environments, weak targets are often masked by strong interference. This causes the calculated to contain features of the interference signal. This paper employs a dual-input U-net network. Given and , the network removes the contamination of interference and noise from , leaving behind the purified target line spectrum features.
3.2. Network Architecture
To obtain fixed-size inputs for the network, a patch-based strategy is applied to the time–frequency feature matrices obtained from preprocessing. Each matrix is divided into non-overlapping patches of size 480 × 480. This size is chosen because it is divisible by powers of two (ensuring compatibility with the U-net architecture) and offers a practical trade-off between computational efficiency and preservation of fine-grained time–frequency details.
A novel U-net neural network structure is proposed, where traditional 2D convolutional layers are replaced with DenseBlocks. To enhance network performance, a Batch Normalization (BN) layer [
28] and an activation function are added before each convolutional layer. The BN layer, activation function layer, and convolutional layer together form a composite function. Within the same DenseBlock, the output of each composite function is connected via shortcuts to the inputs of all subsequent layers. We define the growth rate
k as the number of feature maps produced by each convolutional layer, the DenseBlock growth rate. Generally, considering a DenseBlock comprising
L layers with
k0 input channels, the resulting output feature maps contains
k0 +
k × (
L − 1) channels. In this paper,
k is set to 16 and
L is set to 3. The structure of a DenseBlock is illustrated in
Figure 2.
The relationship between the input
x and output
y of a DenseBlock can be expressed as:
where
represents the composite function. Inside each composite function block, the BN layer performs normalization on all feature maps within each batch. This accelerates network convergence and effectively mitigates gradient explosion and vanishing gradient problems. The BN layer is succeeded by an activation unit along with a 3 × 3 convolutional layer. This paper adopts the Rectified Linear Unit (ReLU) [
29] for the activation function:
Additionally, the input feature map undergoes padding prior to convolution to guarantee that the input and output dimensions remain identical.
Using the described DenseBlocks, a U-net network is constructed, whose structure is shown in
Figure 3. Parameters for each part of the network are listed in
Table 1. Typically, a U-net network consists of a contracting path and an expanding path.
In this paper, the contracting path contains two 1 × 1 convolutional layers, two ReLU activation functions, four DenseBlocks, and three pooling layers. DenseBlocks perform feature extraction from incoming matrices. Pooling layers are employed to decrease the feature map dimensionality, thereby preventing overfitting.
The expanding path contains three DenseBlocks, three upsampling layers, one 1 × 1 convolutional layer, and one Sigmoid activation function. DenseBlocks restore the feature information of the output matrix. Upsampling layers expand the dimensionality of feature maps so that they match the original input size. Simultaneously, three direct connections concatenate the outputs of the DenseBlocks in the contracting path with the inputs of the DenseBlocks in the expanding path. These direct connections help the network better preserve and utilize details from the input matrix while fully leveraging features from different hierarchical levels, thereby enhancing network performance and accuracy. However, these direct connections double the number of channels. Furthermore, DenseBlocks further increase the channel count, as mentioned earlier. Therefore, the upsampling layers also serve the important role of regulating the number of feature channels fed into subsequent DenseBlocks, preventing the feature maps from growing excessively thick. The final convolutional layer fuses features from all channels to generate a single output. A Sigmoid activation is adopted, defined as:
Model training is performed via supervised learning using an Adam optimizer and mean squared error loss function, and detailed training hyperparameters are provided in
Section 4.1. MSE is selected for its direct correspondence to the regression nature of the task and its sensitivity to large errors, which helps suppress strong interference residuals.
The model in this paper considers only one interference and one target. However, the proposed method can be extended to multiple interference sources by merging the features of all interferences into one input or by increasing the number of input channels.
4. Simulations and Performance Analysis
4.1. Simulation Configuration
We consider a 20-element ULA with a 2 m spacing (with a sound speed of 1500 m/s, the half-wavelength corresponds to a frequency of approximately 375 Hz). A target signal and an interference signal are incident upon the array from the far field. A segment of real recorded ship noise is used as the interference signal. For more precise performance analysis, the target signal is simulated. Both signals contain line spectra and continuous spectra. The line spectra of the interference and target are distinguishable in frequency, but their continuous spectra overlap. The signal processing bandwidth is 0–960 Hz. The sampling frequency is set to 2048 Hz. Simulation noise is additive white Gaussian noise.
A U-net network is trained using 200 samples using data augmentation [
30,
31,
32]. Every sample comprises 500 time segments, and each time segment contains 1920 snapshots. For power spectrum estimation, a Hanning window with a length of 2048 points is applied to each segment, with a 50% overlap (1024 points). The DFT length is set to 4000 points. The training set covers different DOA, SNRs, and target line spectra (varying numbers of spectral lines and frequencies). The network inputs are the mixed time–frequency feature and the interference time–frequency feature. The network output is the purified target line spectrum.
The network is trained using the Adam optimizer with an initial learning rate of 0.001, a batch size of five, and no learning rate scheduling. To prevent overfitting, early stopping is employed to dynamically control the number of training epochs: the validation loss is monitored, and training is terminated if the validation loss does not improve for 10 consecutive epochs (patience = 10), after which the network weights from the epoch with the lowest validation loss are restored. No explicit regularization techniques such as L2 regularization or Dropout are used, as early stopping itself serves as an implicit regularizer.
An example of the proposed method is shown in
Figure 4a,b, which shows the network inputs, and
Figure 4c is the network output, and
Figure 4d is the supervised learning label, representing the preset true target line spectrum. Visually, the proposed method demonstrates good purification capability. More in-depth analysis is presented below.
For performance comparison, three typical conventional interference suppression techniques are selected as baselines: second-order cone programming (SOCP) null-steering beamforming, minimum variance distortionless response (MVDR) null-broadening beamforming, and orthogonal projection (OP). The SOCP method, based on convex optimization, achieves robust interference suppression by constraining the null depth and width. The MVDR null-broadening method forms a wide null in the interference direction by applying a taper to the covariance matrix. The orthogonal projection method separates the signal and interference subspaces via eigendecomposition and cancels interference through subspace projection. These three methods represent the mainstream conventional interference suppression techniques in the optimization-based, adaptive beamforming, and subspace-based categories, respectively, providing a comprehensive comparison with the proposed data-driven approach.
Performance is evaluated using output SINR and normalized Hamming distance. Output SINR measures the purification capability from an energy perspective. Hamming distance quantifies the similarity between the network output feature map and the corresponding label. The computation procedure is as follows. First, the two-dimensional time–frequency feature map and its corresponding label are each converted into a binary hash sequence using the mean hash algorithm. Specifically, the mean gray value of all pixels in each feature map is calculated, and each pixel is compared with this mean: a pixel value greater than or equal to the mean is encoded as 1, otherwise it is encoded as 0, resulting in a binary hash sequence. The Hamming distance between the two hash sequences is then computed as the number of positions at which the corresponding bits differ. Finally, the normalized Hamming distance is obtained by dividing the Hamming distance by the total number of bits. Thus, a normalized Hamming distance closer to 1 indicates greater similarity between the two feature maps. To define the “zero point” of similarity, we compute the normalized distance between an output containing no target signal and the label, which serves as a reference baseline.
4.2. Interference Immunity Performance
The DOA for interference and target is −60.9° and −44.6°, respectively. For the SOCP method, a −30 dB null is formed between −56° and −66°. For the MVDR null-broadening method, the null is broadened over ±2° around the interference direction. For the U-net method, input samples are prepared according to
Section 2 and
Section 3 and fed into the network. The target in-band SNR is approximately 15 dB. By varying the interference energy, the input SINR ranges from −20 dB to 10 dB. The output SINR and normalized Hamming distance for each method are shown in
Figure 5.
Figure 5a shows that the output SINR of all methods increases as the input SINR improves. When the input SINR is greater than −5 dB, the U-net method achieves an output SINR approximately 5 dB higher than traditional methods. When the input SINR decreases to −20 dB, the output SINR gap widens to 8–14 dB.
Figure 5b indicates that the estimation accuracy of all methods decreases with lower SINR. However, the decline for the U-net method is less pronounced, maintaining a relatively high level even under low SINR conditions. Therefore, the U-net method exhibits superior interference immunity compared to the SOCP traditional methods.
4.3. Robustness Against Array Imperfections
Under real-world scenarios, a hydrophone array configuration geometry may indeed have errors. Element position errors introduce phase errors into the array manifold, affecting algorithm performance. The traditional interference suppression methods heavily rely on the ideal array model. Conversely, the U-net approach is inherently data-driven, endowing it with greater robustness to array imperfections.
For broadband signals, element position errors are used to quantify array imperfections. Assume the array position error is
δ. For each array element, a small perturbation drawn from a uniform distribution over [−
δ, +
δ] is added to its position. The perturbed array manifold vector is expressed as:
Employing , the interference time–frequency feature with array errors is derived as , and the interference–target mixture time–frequency feature with array errors is derived as . In the test set, and are input to the U-net.
As
δ increases from 0 m to 1 m, the output SINR and normalized Hamming distance for each method are shown in
Figure 6. Samples with array errors were not included in the training set.
Figure 6a shows that the output SINR of all methods decreases as the array position error increases. At an error of 0.2 m, the U-net’s output SINR is approximately 8 dB higher than SOCP, 6 dB higher than MVDR, and 10 dB higher than OP. At an error of 1 m, these gaps are approximately 4 dB, 3 dB, and 6 dB, respectively.
Figure 6b indicates that the estimation accuracy of all methods decreases with larger array position errors. The decline in accuracy for the U-net method is less significant. Even under substantial array errors, the U-net method maintains a relatively high level of estimation accuracy.
The strong robustness of the U-net approach stems from three main reasons: Firstly, the U-net method uses the beam output signal , obtained via CBF, to compute the power spectrum. CBF constitutes the most robust beamformer in the spatial domain, in the presence of white Gaussian noise. Secondly, Welch’s method is chosen to estimate from . In classical spectrum estimation scenarios with strong interference, Welch’s method effectively suppresses spectral leakage of strong interference and other random disturbances due to its segment averaging and windowing, exhibiting significantly higher robustness than traditional methods like the periodogram or Bartlett’s method. Finally, the U-net method processes two-dimensional time–frequency features as input and output, utilizing temporal information not accessible to traditional model-based approaches. Consequently, the proposed U-net method exhibits superior robustness to array mismatch.
4.4. Robustness Against Noise
Figure 7 depicts the output SINR and normalized Hamming distance as input SNRs increase from −20 dB to 10 dB.
Figure 7a indicates that as input SNR increases, the output SINR and estimation accuracy of all methods improve. At an SNR of −20 dB, the U-net’s output SINR is approximately 10 dB higher than SOCP, 15 dB higher than MVDR, and 25 dB higher than OP. When SNR increases to 10 dB, this gap narrows to about 3 dB.
Figure 7b shows that the U-net maintains high estimation accuracy even under low SNR conditions. Therefore, the U-net method demonstrates higher robustness to noise.
5. Experimental Results
An experiment is designed to further validate the algorithm’s performance. A 20-element ULA composed of omnidirectional hydrophones is deployed underwater at a depth of approximately 30 m. The element spacing is 2 m (with a sound speed of 1500 m/s, the half-wavelength corresponds to a frequency of approximately 375 Hz). Multiple ships are present in the water area. A strong signal from one ship is selected as interference. An independent weak signal is extracted as the target and superimposed onto the signals received at each array element, positioning the target direction close to the strong interference. The radiated signals from both interference and target are ship noise containing several line spectra and continuous spectra. The signal processing bandwidth is 0–960 Hz. The sampling frequency is set to 2048 Hz. The signal processing parameters (DFT length, window type, and overlap) are the same as those used in the simulation settings (
Section 4.1).
As in the previous section, simulated signals are still used for the target signal during neural network training. Both interference and target signals in the training set contain line spectra and continuous spectra. Their line spectra do not overlap in frequency. The training dataset comprises 400 simulation samples. Data augmentation is applied. Samples cover various DOA and SNRs, along with simulated target signals featuring 8–12 spectral lines randomly distributed between 0 and 960 Hz. The interference and target signals in the test set are actual recorded ship-radiated noise. The DOAs of interference and target are −63.4° and −45.7°, respectively, with an SINR of approximately −15 dB. The SOCP method forms a −30 dB null between −59° and −69°. The MVDR null-broadening method broadens the range of ±2° around the interference direction.
One case from the real data is shown in
Figure 8.
Figure 8a shows the interference’s time–frequency feature, clearly displaying line spectra at 529 Hz, 775 Hz, 780 Hz, etc.
Figure 8b shows the time–frequency feature of the interference–target mixture. Comparing with
Figure 8a, target line spectra, at 230 Hz, 284 Hz, 391 Hz, 474 Hz, 746 Hz, etc., are discernible.
Figure 8c–f show the output time–frequency features of the SOCP, MVDR, OP, and U-net methods, respectively. The results indicate that the target contains seven distinct line spectra. All methods suppress the line spectra and continuous spectrum of the interfering ship to some extent. However, the SOCP output still retains residual interference components around 490–505 Hz and the two interference lines at 775 Hz and 780 Hz. The MVDR output exhibits an overall lower SINR. The OP output shows reduced target line spectrum intensity. The U-net output contains fewer interference and noise components, achieving higher output SINR and superior accuracy.
It is noteworthy that the number of line spectra present in the actual target falls outside the range covered by the training set. Nevertheless, the U-net performs well even in this untrained scenario. This outcome demonstrates the high generalization capability of the proposed method.
6. Discussion and Conclusions
This paper proposes a dual-input Dense U-net network to address the problem of underwater acoustic target line spectrum purification in strong interference backgrounds. The network input is modified from a single matrix to dual matrices. The time–frequency feature of the interference and the mixture are simultaneously fed into the network. Through supervised learning, the network learns to remove interference features from the mixture, achieving the goal of purifying the target features. The network architecture replaces traditional two-dimensional convolutional layers with two-dimensional DenseBlocks. DenseBlocks achieve more efficient parameter utilization compared to cascaded standard convolutional layers, thereby yielding excellent purification capability as well as strong generalization to unseen scenarios. The U-net method achieves higher output SINR and accuracy compared to traditional model-based purification methods. Furthermore, the proposed method exhibits significantly higher adaptability relative to model-based methods when facing array imperfections, which implies enhanced robustness. Moreover, this dual-input Dense U-net model can be trained on simulation datasets while also performing well on real-world data. This effectively addresses the scarcity of labeled data and the strong dependency of supervised learning on annotations in the field of underwater acoustic detection.
The method also has some limitations. The interference time–frequency feature and the mixture time–frequency feature input to the network are derived from different time segments. Consequently, if the interference features vary too rapidly over time, the network may fail to completely eliminate the interference influence, leading to reduced output SINR. Continuous spectrum components in the feature might be misclassified as line spectra, causing false alarms. In the future, we plan to mitigate false alarms by optimizing the training set. Additionally, compared to other convolutional network structures like ResNet, DenseBlock consumes more graphics memory in terms of hardware requirements.
Regarding real-time processing capability, when the interference and target are separable in the spatial domain, the two input features can be obtained simultaneously from the same time segment, and the inference can be performed efficiently on modern hardware, supporting quasi-real-time operation. However, when the interference and target are not separable in angle, the interference feature must be extracted from a different time segment (a period containing only interference), which introduces additional latency and does not guarantee real-time performance.
To contextualize our method, we compare it with several recent approaches cited in the Introduction. Ma et al. [
18] proposed an improved hidden Markov model for weak spectral line extraction; their method relies on manually tuned parameters, whereas our data-driven approach learns the suppression mapping directly from simulation data. Shen et al. [
23] introduced a diffusion-based iterative denoising algorithm (ADPDM) for frequency line recovery; it treats the mixture as a degraded version of the target, while our dual-input architecture explicitly uses interference-only segments to learn the differential component. Gao et al. [
24] developed a dual-stream network (DS_FCTNet) for underwater acoustic denoising; both methods employ multi-stream designs, but ours focuses on the interference–mixture relationship rather than amplitude–phase synergy. Collectively, these comparisons highlight that the proposed dual-input strategy complements existing approaches and addresses scenarios where conventional spatial-processing methods are challenged.
The method proposed in this paper holds broad application prospects. Although it is developed for underwater acoustic line spectrum purification, the core idea—removing interference by learning the differential component between mixture and interference features—is not limited to this domain. Similar challenges exist in other systems that operate under strong interference backgrounds. For instance, modern high-resolution HFSWR (HR-HFSWR) systems provide fine-grained range-Doppler (RD) maps but face more severe interference challenges due to their high resolution; they are often severely degraded by various types of interference, including clutter, transient interference, and directional noise, which can elevate the noise floor in specific azimuths and mask weak targets. Recent studies have explored data-driven and multi-domain feature-based methods for interference suppression in HR-HFSWR, such as residual regression networks for range-Doppler spectrum purification [
33], space-time cascaded processing for transient interference mitigation [
34], and multiple fourth-order cumulant-based approaches for directional noise cancelation [
35]. These works demonstrate that interference suppression methods leveraging learned features and multi-dimensional information have strong potential for cross-domain applicability. Therefore, the proposed dual-input Dense U-net method is expected to be transferable to HR-HFSWR and other similar high-resolution systems, providing a new perspective for interference suppression and weak signal extraction in diverse environments. Beyond underwater detection and identification, it can also be applied in radio communications, biomedical engineering, speech signal processing, and numerous other fields.
Future work will focus on several directions: (1) optimizing the network architecture (e.g., using lightweight DenseBlocks or model pruning) to reduce memory consumption and improve inference speed; (2) extending the method to scenarios with multiple interfering sources by incorporating multi-channel inputs or feature merging; (3) exploring weakly supervised or unsupervised learning paradigms to relax the requirement for interference-only segments; and (4) conducting extensive field trials under diverse environmental conditions to further validate robustness and generalization.