1. Introduction
The passive sonar detection of underwater targets is one of the key technologies for maritime surveillance and ocean engineering, particularly in the very-low-frequency (VLF) band [
1]. Unlike active sonar, which releases acoustic pulses and listens for returning echoes, passive sonar relies only on the receipt of ship-radiated or submarine-radiated sounds. This makes passive sonar inherently stealthier and indispensable for naval defense, anti-submarine warfare, and long-term ocean monitoring [
2,
3]. Using experiments and modeling, an excellent description of ship-radiated noise had been given as a combination of broadband noise and line spectral signatures [
4,
5,
6]. These line spectral fingerprints correspond to the rotation of engines, shaft-line dynamics, propeller cavitations which are usually applied to various algorithms for passive sonar detection, tracking, and classification.
However, the performance of VLF passive sonar detection is strongly hampered by the features of ocean noise, which are often non-Gaussian, extremely non-stationary, and highly variable in time and space [
7]. Plenty of experimental and modeling studies have revealed that the underwater acoustic noise environment is non-Gaussian and impulsive in nature [
8,
9,
10,
11]. Typical modeling approaches include Gaussian Mixture (GM) [
12], Generalized Gaussian Distribution (GG) [
13], 
 stable distribution model (S
S) [
14], etc. For typical measured maritime ambient noise, there is a considerable amount of strong random impulsive interference, mainly in the form of energy distributed in the lower frequency. In this way, classical detection methods are effective in relatively simple acoustic environments in the higher frequency band (>100 Hz), but struggle against modern challenges in the very-low-frequency band (<100 Hz) of shipping traffic, and highly dynamic natural noise from wind, waves and marine life [
15,
16]. Moreover, the Gaussianity assumption no longer matches the statistical features of real underwater environments, resulting in a sub-optimal performance or worse [
17,
18,
19].
As is known for non-Gaussian signal detection, an optimal receiving method for signal detection is to construct nonlinear filters to approach the optimal nonlinear transfer function [
20]. These indicate that detection theories and methods under non-Gaussian noise should be nonlinear. On the basis of this, nonlinear methods generally adopted for weak signal detection, such as chaos-based methods [
21], stochastic resonance [
2,
22,
23], higher-order statistics [
24], and fractal dimension analysis [
25], have been applied to better deal with non-Gaussian noise. However, the optimal nonlinear systems frequently have complex architectures. The aforementioned nonlinear method cannot approximate the optimal nonlinear transfer function in general, where more complex nonlinear sysytems are adopted for potential performance enhancement [
26]. In recent years, the rapid advancement of deep learning has provided new opportunities for sonar signal analysis. Deep neural networks can automatically learn hierarchical features from raw or transformed acoustic signals, removing the need for handcrafted feature engineering [
27]. Convolutional Neural Networks (CNNs), in particular, are well suited for sonar detection, as spectrograms of acoustic signals can be treated as two-dimensional images [
28]. CNNs have proven significant skills in extracting discriminative time–frequency patterns, robustly handling noise, and generalizing across multiple environments [
29]. Several recent studies have highlighted the success of CNNs in underwater acoustics [
30]. For instance, CNNs trained on Log-Mel spectrogram features have shown a higher performance in classifying ship-radiated noise, leveraging the Mel scale’s ability to emphasize perceptually important low-frequency bands [
31]. Hybrid models combining CNNs with recurrent networks, such as CNN-LSTM frameworks, have further improved classification by modeling both spatial and temporal dynamics [
32]. Attention-based CNNs and adaptive filter architectures have also emerged, showing resilience to environmental variability and enabling finer-grained spectral feature learning [
33]. To enhance the feature design of CNN input, Harmonic–Percussive Source Separation (HPSS) was adopted for sonar signal analysis. It can decompose spectrograms into harmonic components (stable tonal structures) and percussive components (transient broadband structures). As a result, it can effectively remove impulsive noise interference and retain the tonal features of ship-radiated noise when applied to underwater signals, thereby improving the detectability of moving ships [
34,
35]. These advancements firmly establish CNNs as a powerful tool for sonar detection, but their performance is ultimately dependent on the quality of the input features, especially under the very-low-frequency band with non-Gaussian impulsive background noise.
Motivated by the aforementioned studies, a compelling solution develops from hybrid feature design. By combining Log-Mel spectrograms and HPSS harmonic features, one can use both perceptually meaningful low-frequency energy distributions and stable harmonic structures contained within the noise. Log-Mel spectrograms emphasize the low-frequency bands important for VLF sonar, whereas HPSS harmonics provide robustness against transient and wideband disturbances. By relying on this insight, the present study offers a hybrid Log-Mel and HPSS-aided CNN framework for VLF passive sonar detection. The key contributions are threefold:
(1) A systematic non-Gaussian statistical analysis of deep sea VLF ocean noise is given with a week-long noise dataset.
(2) A hybrid Log-Mel and HPSS feature-aided deep CNN detection framework is designed, aiming to highlight the detailed features of low frequencies in accordance with impulsive noise interferences removal.
(3) A 10-layer optimal CNN was trained and tested with hybrid features and comprehensively compared against conventional STFT, Log-Mel, and HPSS, which can demonstrate its superior performance with experimental verification.
(4) This study can offer a robust solution to the long-standing challenges of VLF remote passive sonar detection, especially under non-Gaussian impulsive background noise.
The rest of the paper is organized as follows. 
Section 2 describes the construction of the training and test datasets, and gives a comprehensive non-Gaussian characteristics analysis of deep sea very-low-frequency ambient noise. In 
Section 3, the hybrid Log-Mel and HPSS feature-aided deep CNN detection framework is given, and a 10-layer optimal CNN is trained and tested with hybrid features. In 
Section 4, the detection performance is verified with a designed deep sea experiment in the South China Sea, where a moving ship first passed abeam and moved away, which can demonstrate the superior detection performance of the proposed method. 
Section 5 presents the discussions. Finally, 
Section 6 presents the conclusions.
  2. Dataset and Analysis
This section analyzes an eight-day continuous navigation experimental dataset, partitions the collected signals, and applies time–frequency analysis methods to investigate both ship-radiated noise and ocean ambient noise.
  2.1. Dataset Overview
The dataset employed in this study was obtained from an eight-day intermittent navigation and sea trial conducted by a vessel in the South China Sea. A total of 165.67 h of acoustic data were collected during the trial. The receiving hydrophone was deployed at a depth of approximately 1700 m, and the sound pressure channel data from the vector hydrophone were selected as the experimental input for the passive detection of underwater targets.
  2.2. Overall Experimental Process
During the eight-day sea trial, the test vessel conducted reciprocal navigation around the receiving hydrophone X. A total of 994 data files were collected by the hydrophone, each representing a 10 min segment. Based on the experimental log, a navigation record table was compiled, as shown in 
Table 1. The file names correspond to the start time of each recorded segment and represent the subsequent 10 min of hydrophone data. The table also documents the vessel’s operational status: a value of 1 indicates that the vessel was underway, whereas 0 indicates that it was drifting with the engine shut down. In addition, the experimental log annotates the presence of external interference, where a value of 1 indicates interference and a value of 0 indicates no interference. Using this navigation record, the trajectory of the test vessel was reconstructed, as illustrated in 
Figure 1.
Figure 2a,b present the time–frequency plots of ship-radiated noise and ocean environmental noise, respectively. These plots clearly demonstrate the substantial differences between ship-radiated noise and ocean environmental noise in the time–frequency domain. In 
Figure 2a, which corresponds to the time–frequency plot during a beamwise pass, ship-radiated noise exhibits distinct line spectra, harmonic components, and acoustic signatures. In particular, within the 50–350 Hz range, clear line spectral components are observed, with the most prominent line spectrum located around 250 Hz, accompanied by multiple modulated line spectra. This plot corresponds to the vessel’s approach toward hydrophone X, where a clear acoustic signature is evident. In contrast, 
Figure 2b shows that the energy of ocean environmental noise is primarily concentrated below 100 Hz, with no discernible patterns or regularities.
   2.3. Dataset Division
To facilitate subsequent analyses, a threshold distance of 30 km was adopted as the criterion for dividing the data into Sample 1 (signal) and Sample 0 (noise). The labeling criteria for both categories are summarized in 
Table 2, and this classification was employed to determine the presence or absence of a signal.
According to the aforementioned classification criteria, 48 samples were labeled as “1” (signal), whereas 946 samples were labeled as “0” (noise). A subset of the samples also contained additional interference.
The acoustic signals acquired by the hydrophone exhibit pronounced non-stationarity and temporal variability. To enhance signal quality for subsequent feature extraction, several preprocessing steps were applied, including DC removal, normalization, framing with windowing (20 s observation windows with a 10 s overlap), and data augmentation. These procedures yielded higher-quality acoustic signal segments of ship-radiated noise and ocean ambient noise, making them more suitable for further analysis.
As noted in 
Section 2.3 (Dataset Division), the dataset suffers from an imbalance between ship-radiated noise and ocean background noise, with the number of noise samples substantially exceeding that of signal samples. During training, this imbalance introduces bias toward the majority class, thereby degrading network performance. Furthermore, due to confidentiality constraints, acquiring ship-radiated noise samples is costly. Prior work has demonstrated that data augmentation can significantly improve the performance of neural networks under such conditions.
For this study, the data collected from 20:00 on 16 May to 00:00 on 17 May were used as continuous navigation test data for the network, while the remaining data were framed, windowed, and augmented. All processed data were stored in .mat format files. In total, 46,580 signal samples and 45,300 noise samples were generated. These two categories were then combined and randomly partitioned into training, validation, and test sets using a 7:2:1 ratio. The training set was employed to optimize the model, the validation set was used to adjust hyperparameters and conduct preliminary evaluations, and the test set was used to assess the model’s generalization capability. The detailed data division is provided in 
Table 3.
  2.4. Non-Gaussian Characteristics Analysis of Deep Sea Very-Low-Frequency Ambient Noise
The deep sea environmental noise dataset was collected during a very-low-frequency acoustic sensing experiment conducted in the Dongsha area of the South China Sea. 
Figure 3 shows a set of 50 min time–frequency maps of typical measured marine ambient noise in the South China Sea. It can be seen that there is a large amount of strong random impulsive interference, mainly energy distributed in the lower frequency, below 100 Hz.
To eliminate the influence of passing vessels in close proximity, all data segments containing noticeable ship passages were carefully excluded. A total of 30 segments, amounting to 3990 min of continuous recordings, were selected to construct the dataset. The key information of each segment is summarized in 
Table 4.
When extracting the dataset, both daylight (06:00–20:00) and evening (20:00–06:00) intervals were investigated to account for anticipated diurnal fluctuations in noise characteristics. During the experimental period (18–21 May), the South China Sea region was continuously influenced by strong impulsive sources such as seismic airguns and deep water blasting operations from petroleum exploration and scientific surveys. These impulsive signals, with repetition times ranging from seconds to minutes, constituted approximately half of the dataset. This illustrates the considerable impact of anthropogenic impulsive interference on the deep sea acoustic environment. Accordingly, 
Table 4 explicitly states whether each data segment has airgun pulse interference.
A non-Gaussian impulsive analysis was performed on the ocean environmental noise dataset acquired by the very-low-frequency (VLF) vector hydrophone between 16 and 22 May. An observation window of 20 s with a 10 s overlap was chosen for continuous analysis. Although the dispersion and symmetry of the noise can be defined by the mean and variance, the amplitude distribution during large sample averaging and signal preprocessing remains unbiased and symmetric. Since the lower-order moments of stable distributions are bounded, parameter estimation can be performed using fractional lower-order moments or logarithmic moment estimation. Considering the advantages of the logarithmic moment estimation method, this study employs this method to estimate the parameters of the recorded very-low-frequency ocean ambient noise. For this purpose, the following section discusses the theoretical framework and estimation approach applied for parameter extraction based on the logarithmic moment method.
In order to represent the actual deep sea ambient noise using the Lévy distribution, it is necessary to estimate the four parameters of the distribution using collected data. However, since the Lévy distribution does not possess a closed-form probability density function, typical statistical approaches based on explicit density representations cannot be utilized. Moreover, its higher-order, and even second-order, statistical moments do not exist. To circumvent this constraint, the concept of fractional lower-order moments (FLOMs) is developed. Based on the logarithmic moment approach, the parameters of the Lévy distribution for genuine low-frequency oceanic ambient noise can be determined. The relationship between FLOMs and the characteristic exponent 
 as well as the dispersion parameter 
 in the characteristic function is provided by
        where
Let 
. If a random variable 
X satisfies 
, then 
X is referred to as a log-moment random variable. Its moment-generating function is defined as
Thus, the Lévy distribution possesses finite log-moments, and the second- and higher-order log-moments of 
Y are determined solely by the characteristic exponent 
 as follows:
        where 
, with 
, 
, and 
.
Since higher-order log-moment estimations are generally wrong, only the first and second log-moments are typically applied to estimate the parameters of the Lévy distribution. It is typically considered that underwater ambient noise is unbiased with a location parameter of zero. The main focus is consequently on the noise intensity, which is defined by the characteristic exponent 
 and the scale parameter 
. Their estimates are obtained using the following formulas:
To verify the efficiency of this estimation approach, Lévy noise with  and  was created using the Janicki–Weron algorithm, and the parameters were subsequently estimated using the logarithmic moment method.
Accordingly, the analysis focuses on the impulsive and heavy-tailed features of the data, represented by fluctuations in the characteristic exponent 
. The results from the P-channel data are shown in 
Figure 4, which illustrate that deep sea ambient noise exhibits sparse, strong impulsive, non-Gaussian behavior on the minute scale. In most cases, 
, but severe impulsive interferences exist. Values of 
 are primarily connected to seismic sources, such as airgun transmissions. 
 represents Gaussian values.Although such events occur infrequently, they drastically affect the noise backdrop and underline the intrinsic complexity of the marine auditory environment.
To further validate these observations, the kurtosis distributions of noise segments in the frequency bands above 100 Hz and below 100 Hz were investigated, as shown in 
Figure 5a,b. This study reveals that while most segments display modest kurtosis values, a large number of outliers with extremely high values are also present, thus indicating the highly impulsive nature of the observed ambient noise. The red arrow represent Gaussian noise.
In summary, the experimental results reveal that deep sea ambient noise is not only non-Gaussian but also highly impulsive and changeable, offering considerable problems for underwater detecting systems. Conventional feature extraction algorithms struggle to properly capture these properties. In contrast, the Log-Mel spectrogram provides robustness against impulsive distortions through logarithmic compression, and HPSS (Harmonic–Percussive Source Separation) separates harmonic and transient components to emphasize organized signal information. Accordingly, the combined use of Log-Mel and HPSS features is more ideal for absorbing the non-Gaussianity, strong impulsiveness, and variability of deep sea noise, hence boosting both the robustness and generalization capabilities of underwater target identification models.
  5. Discussions
Based on the methodological developments and experimental outcomes, the following discussions are highlighted.
(1) The building of a specific dataset for passive underwater target identification, suited to the peculiarities of ship transit, represents a foundational contribution of this work. The eight-day field experiment provided a realistic acoustic environment with ship-radiated noise, ambient noise, and the interference of the deep sea very-low-frequency band. The pre-processing and labeling methodology established a clear norm for data preparation in this domain. Additionally, the application of data augmentation techniques solved the key issue of class imbalance, which is a prevalent challenge in real-world underwater acoustic datasets. Most notably, the novel incorporation of Log-mel and HPSS characteristics, adopted from speech recognition, for characterizing ship-radiated noise has been highly effective. This cross-domain feature transfer highlights the possibility of employing advanced audio processing techniques in underwater acoustics, with the exact extraction parameters supplied affording useful guidance for future research.
(2) The suggested CNN-based detection framework has proved both its feasibility and superiority over standard energy detection approaches. The carefully developed 10-layer CNN architecture, tailored for ship-radiated noise characteristics, coupled with appropriate assessment metrics bridging deep learning and signal detection theory, creates a comprehensive detection technique. The comprehensive analysis of diverse network parameters provides vital insights into model optimization. The comparative analysis of time–frequency features demonstrates that Log-mel features obtained an optimal performance by efficiently capturing the low-frequency properties of ship noise while maintaining computing economy. This achievement underlines the importance of feature design that matches with both the physical features of the target signals and practical implementation constraints.
However, several issues and future directions demand attention:
(1) The complexity and variety of ship-radiated noise need more robust feature representations. While the current features show promising results, future work should focus on using deep learning’s powerful representation learning capabilities to construct more sophisticated feature fusion algorithms. This could involve researching attention mechanisms or multi-modal learning methodologies that can better manage complicated maritime environments and improve the system’s generalization potential.
(2) The data imbalance problem remains a key worry. Although data augmentation techniques were applied in this study, more advanced alternatives such as generative adversarial networks (GANs) for synthetic data generation, or cost-sensitive learning approaches deserve additional consideration. As deep learning models are essentially data-driven, creating more complex data handling solutions will be crucial for progressing this field of study.
(3) Future work will also focus on constructing more efficient network topologies and researching semi-supervised learning approaches to lessen the dependency on huge labeled datasets, ultimately moving toward more practical and deployable underwater target detection systems.