You are currently viewing a new version of our website. To view the old version click .
Journal of Marine Science and Engineering
  • Article
  • Open Access

16 November 2025

Robust Dolphin Whistle Detection Based on Dually-Regularized Non-Negative Matrix Factorization in Passive Acoustic Monitoring

,
,
,
,
and
1
College of Oceanography and Space Informatics, China University of Petroleum (East China), Qingdao 266580, China
2
State Key Laboratory of Ocean Sensing and Ocean College, Zhejiang University, Zhoushan 316021, China
3
National Key Laboratory of Underwater Acoustic Technology, Harbin Engineering University, Zhoushan 150006, China
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng.2025, 13(11), 2164;https://doi.org/10.3390/jmse13112164 
(registering DOI)
This article belongs to the Section Ocean Engineering

Abstract

Underwater passive acoustic monitoring (PAM) serves as a core approach pervasively applied to the long-term, non-invasive detection of biological acoustic signals. Dolphin whistles serve as a fundamental aspect of vocal communication, exhibiting intricate frequency-modulated structures. Robust detection of these whistles is essential for dolphin species diversity conservation, yet performance is frequently compromised by underwater background noise, leading to significant degradation in detection reliability. To address this issue, this paper presents an unsupervised enhancement method based on Dually-Regularized Non-Negative Matrix Factorization (DR-NMF). Beyond a standard data fidelity term, the proposed framework integrates two specialized regularizers, including Overlapping Group Shrinkage and Group Lasso. The former promotes time–frequency continuity of whistle ridges, while the latter adaptively eliminates redundant bases, achieving an improved trade-off between structural integrity and noise suppression. The optimization procedure employed a combination of majorization–minimization, iteratively reweighted least squares, and proximal gradient techniques, all of which were implemented within an alternating minimization scheme featuring nested inner–outer iterations. This architecture ensures stable convergence and computational practicality. Extensive experimental evaluations under diverse low signal-to-noise ratio (SNR) conditions reveal that the proposed method achieves a substantial improvement in recall without compromising precision, resulting in consistent enhancements in frame-level F 1 -scores. When applied to real-world dolphin whistle recordings, our method outperforms existing baseline approaches, demonstrating remarkable robustness in detecting whistle signals when amidst challenging marine environmental noise.

1. Introduction

Passive acoustic monitoring (PAM) has become extensively applied in marine mammal ecology as a non-invasive, cost-effective, and long-term monitoring approach  [,]. Traditional methods rely on visual surveys or short-term captures. In contrast, PAM facilitates continuous observation of acoustic behavior at large spatial and temporal scales without disturbing the natural activities of target species []. This capability provides unique advantages for population monitoring and conservation. The core components of PAM research include detection, classification, and localization [,]. Detection serves as the prerequisite for subsequent analyses, and in odontocete studies, whistles represent the most prominent detection target. Whistles carry essential information on individual communication and group behavior []. Therefore, whistle detection constitutes a pivotal front-end step, responsible for separating target signals from complex acoustic backgrounds. Accurate identification of whistle events further supports population size estimation and habitat-use evaluation [,]. Based on differences in frequency modulation patterns among species, researchers can also perform behavioral analyses and ecological niche characterization []. Thus, the accuracy of whistle detection is essential not only for ensuring robust localization and classification, but also for advancing the application of PAM in revealing ecological processes and supporting conservation strategies.
However, in real marine environments, whistle signals are often affected by broadband stationary noise [,]. Such interference greatly masks the energy of whistles and causes severe distortion and degradation of spectral contours []. Under these conditions, detectors show reduced sensitivity to target signals and fail to reliably identify low-energy or partially masked ridge features. At the same time, noise-induced artifacts increase the risk of false alarms and reduce detection specificity. This combined effect substantially weakens the performance of whistle detection [], making it difficult for researchers to accurately estimate dolphin population size, abundance, and behavioral patterns in shallow-water habitats.
Methods for whistle detection mainly include time-domain approaches [], time-frequency domain approaches, and other transform-domain approaches []. Studies have shown that dolphin whistles exhibit distinct modulation characteristics in the time–frequency domain [,]. During individual recognition, dolphins also use the spectral contour details of continuous whistles as important cues []. As a result, time–frequency domain detection methods have been more widely applied in practical research. Common techniques in this category typically combine spectrogram enhancement with thresholding []. For example, some studies have applied various noise-cancellation techniques, such as median filtering, to spectrograms and then used thresholding to retrieve whistle candidate regions []. Another method combines time–frequency adaptive binary image processing with morphological operations to achieve whistle tracking and enhancement []. In addition, a robust unsupervised detection approach has been proposed, which integrates whistle enhancement using gammatone multichannel processing and Savitzky–Golay filtering with adaptive thresholding, enabling effective recognition of dolphin whistle events []. Overall, time–frequency–based enhancement methods can suppress background noise while preserving whistle structure, and when combined with adaptive thresholding, they provide a significant improvement in whistle detection performance.
Traditional time–frequency domain enhancement methods for dolphin whistles include time–frequency smoothing [,] and empirical mode decomposition (EMD) []. These techniques can improve signal intelligibility to some extent. However, under low signal-to-noise ratio (SNR) conditions, the fine-grained time–frequency structure of whistles is easily masked by noise, which limits the enhancement capacity of these approaches and prevents effective recovery of the complete target signal. As a result, the features provided to detectors become insufficient, thereby weakening their discriminative power and leading to a sharp decline in recall. In recent years, deep neural network methods, such as denoising convolutional neural networks (DNCNNs) and denoising autoencoders (DAEs) [,], have shown strong performance in dolphin whistle enhancement. Nevertheless, these approaches require large amounts of clean whistle data and rely heavily on manual annotation and expert supervision, which are both time consuming and costly. Therefore, there is an urgent need for an unsupervised enhancement method that does not depend on manual labeling or large-scale clean datasets, while effectively improving whistle detection performance in ocean noise environments.
In the field of speech signal processing, structured sparse modeling methods, such as Overlapping Group Shrinkage (OGS) [,] and Group Lasso (GL) [], have been widely applied to speech enhancement, detection, and recognition, achieving strong performance. The main advantage of these methods lies in the introduction of structured priors into sparse modeling, which capture local continuity and global sparsity of signals and reflect their grouping and hierarchical characteristics. Similarly, dolphin whistles are produced by air sacs and controlled by laryngeal muscles [], resulting in unique time–frequency patterns with both sparsity and group structure [,]. Structured sparse modeling methods, therefore, show considerable potential for dolphin whistle enhancement. However, when applied directly to whistle spectrograms without explicit reconstruction models or data fidelity constraints, these methods often fail to balance the preservation of weak whistle components, the maintenance of ridge structure integrity, and effective noise suppression, which limits their effectiveness in whistle enhancement.
In this study, we propose an unsupervised enhancement framework based on regularized Non-Negative Matrix Factorization (NMF) [,]. NMF decomposes the amplitude spectrum matrix into non-negative bases and activations, with a small number of narrowband bases combined with continuous activations to accurately characterize whistles and noise being dispersed to the remaining bases, thereby suppressing random disturbances and improving low SNR robustness under low rank approximation. In addition to the data fidelity term, the framework incorporates an OGS regularizer and a GL penalty. The OGS term promotes the continuity of spectral ridges through two-dimensional overlapping windows while suppressing block-like artifacts, whereas the GL term eliminates redundancy at the dictionary level by discouraging the participation of ineffective bases that would otherwise amplify residual noise. The joint effect of these two regularization strategies enables the enhanced spectrograms to preserve the structural integrity of whistle contours while markedly improving their separability from background noise, thereby providing a more robust input distribution for subsequent endpoint detection.
The remainder of this paper is organized as follows. Section 2 provides a detailed description of the proposed DR-NMF-based enhancement framework, including the formulation of the model, the associated optimization algorithm, and the design of the endpoint detection module. In addition, the experimental setup is also introduced. Section 3 reports the results of both the enhancement and detection, accompanied by a comprehensive performance analysis under varying noise conditions. Section 4 offers a critical discussion of the method, highlighting its advantages, as well as the inherent limitations that warrant further investigation. Finally, Section 5 concludes this paper.

2. Methods

2.1. STFT-Based Spectral Model of Noisy Whistles

For a PAM system with a bandwidth of f s , the collected noisy whistle signal x ( n ) can be represented as x ( n ) = s ( n ) + p ( n ) , n = 1 , , N , where s ( n ) denotes the clean whistle signal and p ( n ) denotes the background noise. By applying the short-time Fourier transform (STFT), the time–frequency spectrogram S k , m of the signal s ( n ) can be obtained, which can be expressed as follows:
S k , m = n = s ( n ) w ( n m Q ) e j 2 π k ( n m Q ) M ,
where w ( n ) denotes the analysis window function, m = 1 , , M is the frame index, k = 1 , , K is the frequency index, and Q represents the time shift in discrete samples. Subsequently, by applying the inverse short-time Fourier transform (ISTFT), the time-domain signal s ( n ) can be reconstructed from S k , m , which can be expressed as follows:
s ( n ) = m g ( n m Q ) 1 M k = 0 M 1 S k , m e j 2 π k ( n m Q ) M ,
where g ( n ) denotes the synthesis window function. If the analysis and synthesis windows are not identical, the condition of perfect reconstruction requires satisfying the constant overlap-add (COLA) criterion, which can be expressed as
m w ( n m Q ) g ( n m Q ) = C ( n ) .
If the window and hop size satisfy the COLA condition and C ( n ) is a constant, then it suffices to divide the final reconstruction result by this constant.
Since the STFT is a linear transform, the signal decomposition can be expressed as follows:
X k , m = S k , m + P k , m .
It can be shown that the observed complex spectrogram equals the linear superposition of the signal spectrogram and the noise spectrogram. However, in practical processing, the complex spectrogram X k , m is usually not used directly; instead, its magnitude (amplitude spectrogram) or squared magnitude (power spectrogram) is taken as the object of analysis. Accordingly, the observation matrix is defined as follows:
Y k , m = X k , m β .
Commonly, β = 1 (amplitude spectrogram) or β = 2 (power spectrogram) is adopted. For simplicity, X k , m , Y k , m , S k , m , and P k , m are, respectively, denoted as X , Y , S , P . Since the squared expansion of S + P contains cross terms, it is strictly the case that | X | β | S | β + | P | β . However, for the sake of simplified derivation and practical application in NMF, it is typically assumed that s ( n ) and p ( n ) are statistically independent or weakly correlated. Under this assumption, the cross terms can be neglected in expectation, yielding E | X | 2 E | S | 2 + E | P | 2 . Therefore, it can directly assumed that
Y S + P ,
where Y , S , and P , respectively, denote the observed spectrogram matrix, the clean signal spectrogram matrix, and the noise spectrogram matrix. This formulation facilitates the establishment of a decomposition model in the non-negative domain, enabling the separation and reconstruction of noisy signals.

2.2. Spectrogram Enhancement Model Based on DR-NMF and Iterative Solution Method

2.2.1. Derivation of the Objective Function

In the application of PAM, the observed time–frequency spectrogram is often composed of the superposition of target signals and noise. To effectively capture the latent structure of the signal, the NMF method is employed to model the observed spectrogram, thereby obtaining spectral bases and sparse, interpretable representations of the activation coefficients. Suppose the input clean spectrogram matrix is denoted as S R + K × M , where K represents the number of frequency channels and M denotes the number of time frames. By constructing the NMF model, the goal is to find a non-negative matrix pair ( W , H ) such that
S ^ WH ,
where S ^ denotes the reconstructed T-F representation of a dolphin whistle, W R + K × r denotes the spectral basis matrix, and H R + r × M denotes the activation matrix. In practice, since the observed signal inevitably suffers from noise and modeling errors, decomposition is usually performed directly on the observed spectrogram Y . In this case, the model can be formulated as follows:
Y = WH + E ,
where E denotes the residual component that cannot be represented by the low-rank non-negative decomposition.
To characterize the discrepancy between the observed spectrogram Y and the reconstructed spectrogram WH , an appropriate data fidelity term is introduced. In signal processing, a common assumption is that the noise follows an independent and identically distributed Gaussian distribution. Under this statistical assumption, maximum likelihood estimation can be reduced to minimizing the squared error between the observed data and the model reconstruction. Therefore, the fidelity term in the NMF model can be defined as the squared Frobenius norm:
L fidelity ( W , H ) = 1 2 Y WH F 2 .
Since the loss function is convex with respect to one variable when either W or H is fixed independently, the optimization strategy of alternating minimization can be adopted.
Based on the Gaussian noise assumption, the NMF fidelity term is introduced to ensure that the model can reasonably approximate the observed data under the reconstruction error criterion. However, relying solely on the fidelity term is often insufficient to handle complex broadband stationary noise environments. On the one hand, noise may introduce numerous scattered high-frequency artifacts, which, in turn, cause the temporal distribution of the activation matrix H to become fragmented and discontinuous; on the other hand, an excessive number of bases in W may participate in the decomposition indiscriminately, leading to increased redundancy of the model while weakening interpretability. Therefore, it is necessary to incorporate structural priors into the NMF framework in order to reflect the intrinsic regularity of whistle signals and to enhance robustness.
It is observed that the energy distribution of the activation matrix H exhibits pronounced local continuity and block-wise aggregation. If only the element-wise l 1 sparsity constraint is applied, the model tends to over-compress energy at isolated points, thereby disrupting the inherent trajectory continuity. Similarly, applying independent regularization along the temporal or frequency dimension alone cannot effectively capture cross-dimensional correlations, which easily leads to fragmented energy distributions. To address this issue, the OGS regularization is introduced on H to more appropriately characterize the local energy continuity within overlapping neighborhoods in the time–frequency domain. Specifically, let a set of overlapping local windows be denoted as G p , q , each window centered at ( p , q ) with a size of K 1 × K 2 . All elements within a window are treated as one group, and the l 2 norm of the group is defined accordingly. The OGS regularization term can be mathematically expressed as follows:
R OGS ( H ) = p , q H G p , q F = p , q ( i , j ) G p , q H i , j 2 ,
where the subscripts ( i , j ) , respectively, correspond to the faulted in terms of convolution operations. Let the convolution kernel G be constructed as the outer product of all-ones vectors h 1 and h 2 , such that the kernel size is K 1 × K 2 . Under this definition, the squared sum within a window can be equivalently expressed as the convolution between the matrix H H and the kernel G :
[ G ( H H ) ] p , q = ( i , j ) G p , q H i , j 2 .
Therefore, the OGS regularization term can be compactly written as follows:
R OGS ( H ) = p , q [ G ( H H ) ] p , q .
In contrast to H , the j-th column vector W : , j R + K of the spectral basis matrix W R + K × r represents the j-th candidate basis with dimension K, thereby spanning all frequency channels. Without additional constraints, all basis vectors may participate in the decomposition, which could result in excessive redundancy. To avoid this issue, the GL regularization ( l 2 , 1 norm) is imposed on W , which is expressed as follows:
R GL ( W ) = j = 1 r W : , j 2 .
This regularization term imposes sparsity constraints at the level of entire basis columns. When a basis contributes insufficiently and fails to play a substantive role in signal reconstruction, its entire column vector will be compressed to zero, thereby achieving automatic basis selection. Compared with element-wise l 1 sparsity, this column-wise sparsity mechanism can more effectively eliminate redundant or ineffective bases, rather than producing sparsity only at the level of individual elements.
Therefore, by incorporating OGS and GL into the NMF fidelity term, the overall objective function is formulated as follows:
min W 0 , H 0 J ( W , H ) = L fidelity ( W , H ) + λ ogs R OGS ( H ) + λ gl R GL ( W ) = 1 2 Y WH F 2 + λ ogs p , q [ G ( H H ) ] p , q + λ gl j = 1 r W : , j 2 ,
where λ o g s and λ g l are weighting parameters that balance the fidelity term and the structural priors.

2.2.2. Optimization of the Objective Function

Since different regularization terms are imposed on W and H, distinct solution methods are employed to derive their respective update rules, thereby implementing the alternating minimization optimization strategy.
In the optimization process, the OGS regularization term exhibits strong nonlinearity due to the presence of both square-root and convolution operations, which renders direct gradient-based updates difficult to implement. To address this issue, a majorization–minimization (MM) strategy is adopted, in which the original OGS regularizer is reformulated as a sequence of progressively tightened quadratic upper bounds. This reformulation enables the problem to be solved in the form of an iteratively reweighted least squares (IRLS) scheme, thereby transforming the non-convex regularization into a tractable weighted quadratic optimization problem.
Specifically, the derivation begins by invoking the classical upper-bound inequality of the square-root function. For any non-negative scalar z 0 and any positive constant b > 0 , the following inequality holds:
z + ε z + ε 2 b + b 2 ,
where ε > 0 is introduced as a smoothing term to avoid singularity. When b = z + ε , the inequality becomes tight, and this relationship is non-convex. The square-root function, thus, provides a locally tight quadratic upper bound, which can be applied to the group energy in the OGS regularization, and it is defined as S p , q ( H ) = G ( H H ) p , q , leading to the following:
S p , q ( H ) + ε S p , q ( H ) + ε 2 r p , q ( t ) + r p , q ( t ) 2 ,
where r p , q ( t ) = S p , q H ( t ) + ε denotes the local energy value from the previous iteration. Consequently, an upper bound of R OGS ( H ) can be obtained as follows:
λ o g s R OGS ( H ) λ o g s 2 p , q S p , q ( H ) + ε r p , q ( t ) + r p , q ( t ) .
By defining the terms independent of H as a constant c o n s t = λ ogs 2 p , q ε r p , q ( t ) + r p , q ( t ) , a quadratic surrogate function with respect to H can be constructed as Q ( H ; H ( t ) ) :
Q H ; H ( t ) = 1 2 Y WH F 2 + λ ogs 2 p , q S p , q ( H ) r p , q ( t ) + c o n s t .
For any H , the constructed surrogate function satisfies the following:
J ( H ) = 1 2 Y WH F 2 + λ ogs R OGS ( H ) Q H ; H ( t ) .
At H = H ( t ) , the two functions take identical values, i.e., Q H ( t ) ; H ( t ) = J H ( t ) . At the ( t + 1 ) -th update, the new solution is obtained by minimizing the surrogate function, namely H ( t + 1 ) = arg min H Q ( H ; H ( t ) ) , from which the update rule can be derived:
J H ( t + 1 ) Q H ( t + 1 ) ; H ( t ) Q H ( t ) ; H ( t ) = J H ( t ) .
It guarantees that the original objective function J ( H ) is monotonically non-increasing.
In Equation (18), the regularization term appears in the form of local group energy; however, this formulation still involves the quadratic sum of convolution and windowing, which makes it difficult to correspond directly to the element-wise entries of matrix H . To address this, it is necessary to interchange the order of convolution and summation. By applying the Fubini–Tonelli theorem, the double summation can be rearranged, leading to the following equivalent transformation:
p , q S p , q ( H ) r p , q ( t ) = i , j w i , j ( t ) H i , j 2 ,
where each element of the weight matrix w ( t ) corresponds to the aggregated result of being weighted within all local windows covering that position, and it is expressed as follows:
w ( t ) = G 1 r ( t ) ,
where the symbol ∗ denotes the convolution operation and r ( t ) is the matrix composed of the local quantities r p , q ( t ) at all positions ( p , q ) . Through the above transformation, the OGS regularization term is converted from the complex representation of group energy into an element-wise weighted quadratic form, thereby allowing Q ( H ; H ( t ) ) to be simplified as follows:
Q H ; H ( t ) = 1 2 Y WH F 2 + λ ogs 2 i , j w i , j ( t ) H i , j 2 + c o n s t .
According to Equation (23), the overall gradient can be expressed as follows:
H Q H ; H ( t ) = W WH W Y + λ ogs w ( t ) H .
In theory, the update of the matrix H can be obtained by setting the gradient to zero, leading to a linear system involving the OGS weights. However, in the proposed method, the coefficient matrix simultaneously contains both W W and the convolution operator associated with the IRLS-based OGS weights. As a result, the system is often severely ill-conditioned, numerically unstable, and highly sensitive to noise and scaling variations. Even when iterative solvers, such as preconditioned conjugate gradient (PCG), are employed, additional preconditioners and monotonicity safeguards are required, which considerably increase the computational cost and implementation complexity. Moreover, the closed-form solution does not inherently satisfy the nonnegativity constraint, and subsequent projection steps may disrupt the monotonic descent guaranteed by the MM framework, thereby undermining the reliability of the optimization process.
By contrast, the Lee–Seung type multiplicative update rule [], derived from the auxiliary function method, naturally guarantees nonnegativity and monotonic convergence. To construct a feasible iterative update, the gradient is decomposed as + = W WH + λ ogs w ( t ) H and = W Y . According to the Lee–Seung auxiliary function method, which provides the update in the form H H + , the update rule for H can thus be obtained as follows:
H i , j ( t + 1 ) = H i , j ( t ) ( W Y ) i , j ( W W H ( t ) ) i , j + λ ogs ( w i , j ( t ) H i , j ( t ) ) + ε ,
where ε > 0 is a small positive constant introduced to prevent division by zero. The multiplicative update strictly satisfies the Karush–Kuhn–Tucker (KKT) conditions and, within the OGS-regularized framework, preserves local group coupling while ensuring nonnegativity and monotonic decrease in the objective. This provides certain effectiveness in maintaining time–frequency ridge continuity in the proposed method.
Within the modeling framework of NMF, the update of the basis matrix W can be formulated as the following optimization problem:
min W 0 1 2 Y WH F 2 + λ gl j = 1 r W : j 2 .
Since the objective function contains a nonsmooth regularization term, directly applying the conventional gradient descent method cannot yield effective updates. Therefore, it is necessary to employ the proximal gradient method (PGM) to separate the optimization of the smooth and nonsmooth components.
According to the rules of matrix differentiation, the gradient of the smooth term 1 2 Y WH F 2 with respect to W is as follows:
W f W = W 1 2 Y WH F 2 = WH H Y H .
The Lipschitz constant of the gradient is γ = HH 2 , and thus the step size is chosen as η ( 0 , 1 / γ ) . At each iteration, the current basis matrix W ( t ) is first updated by a single gradient descent step to obtain an intermediate point:
V = W ( t ) η W ( t ) H H Y H .
Then, the proximal operator is applied to the intermediate point V , thereby simultaneously enforcing nonnegativity and group sparsity. The corresponding proximal subproblem is formulated as follows:
W ( t + 1 ) = arg min W 0 1 2 W V F 2 + η λ gl j = 1 r W : j 2 .
Since the regularization term is decomposed column-wise, the proximal operator can be computed independently for each column. For the k-th column, the proximal subproblem can be expressed as follows:
W : j ( t + 1 ) = arg min w j 0 1 2 w j v j 2 2 + η λ g 1 w j 2 ,
where w j denotes the j-th column of W , and v j denotes the j-th column of V . The closed-form solution of this problem can be described as follows: if the Euclidean norm of the non-negative part of the vector, v j + = max v j , 0 , is less than or equal to the threshold η λ gl , then the entire column is shrunk to zero; otherwise, the result is a non-negative vector scaled proportionally, which can be expressed as follows []:
W : j ( t + 1 ) = 1 η λ gl v j + 2 + v j + .
By consolidating the column-wise derivations, the complete update rule for the entire matrix W is obtained:
W ( t + 1 ) = ( V ) + Diag 1 η λ gl V : j + 2 + j = 1 r ,
where ( V ) + = max ( V , 0 ) denotes the element-wise non-negative truncation, and the term within the parentheses represents the column-wise soft-thresholding shrinkage.
Therefore, based on the alternating minimization optimization strategy, the optimal matrices W and H can be obtained after a certain number of iterations. Subsequently, the reconstructed time–frequency matrix S ^ can be derived by S ^ = WH . The whistle enhancement process based on regularized NMF is presented in Algorithm 1.
Algorithm 1 Dually-Regularized NMF Algorithm
  1:
Input: Collected noisy whistle signal y ( n ) ;
  2:
Noisy whistle T-F representation Y = S T F T ( y ( n ) ) using Equation (1);
  3:
Require: Y , r , λ o g s , λ g l , K 1 , K 2 , K max ;
  4:
( W , H ) NNDSVD ( Y , r ) ; define kernels h 1 , h 2 ;
  5:
k 0 ;
  6:
while  k K max do:
  7:
    H-update:
  8:
     d conv 2 ( h 1 , h 2 , | H | 2 ) + ε ;
  9:
     D λ o g s conv 2 ( h 1 , h 2 , H . / d ) ;
10:
     H i , j H i , j ( W Y ) i , j ( W WH ) i , j + D i , j + ε ;
11:
    W-update:
12:
     V W η ( W ( HH ) YH ) ;
13:
     U max ( V , 0 ) ;
14:
    for each column j do
15:
         W : , j 1 η λ g l U : , j 2 + ε + U : , j ;
16:
    end for
17:
     k k + 1 ;
18:
end while
19:
return  W , H ;
20:
Output: Reconstructed whistle T-F representation S ^ WH .

2.3. Endpoint Detector Based on Time–Frequency Spectrogram

In this paper, endpoint detection is performed on the time–frequency representations of the noisy observation matrix Y R + K × M and the reconstructed matrix S ^ R + K × M . Taking the matrix Y as an example, K denotes the number of frequency bins, M is the number of frames, f s is the sampling rate, and h is the frame shift. The frame-wise energy is defined as follows:
E m = k = 1 K Y k , m 2 , m = 0 , 1 , , M 1 .
By normalizing as E ˜ m = E m / max ( E m ) , the overall amplitude variation is eliminated, and the time axis is given by t m = m h / f s . To distinguish active frames from silent ones, an adaptive threshold is introduced:
τ = median ( E ˜ ) + α ( max ( E ˜ ) median ( E ˜ ) ) ,
where α is set to 0.4, which biased the threshold towards the median side. It was measured that the detection performance remained good under different SNRs and dolphin whistle conditions. Therefore, there is no need to adjust the parameters separately for each signal segment. Based on this threshold, the activity indicator function is defined as A m = 1 E ˜ m > τ , thereby obtaining a candidate sequence of active frames.
To avoid misjudging short-term energy fluctuations as event boundaries, duration constraints are imposed. Let the total duration be T = M h / f s , then the minimum silence duration is defined as Δ sil = max ( 0.1 , 0.01 T ) and the minimum segment duration is defined as Δ dur = max ( 0.01 , 0.005 T ) , which are converted into the corresponding frame numbers L sil and L dur , respectively. By tracking the rising and falling edges of the detected activity indicator sequence, candidate onset–offset frame indices s i , e i are obtained. If the gap between two adjacent segments is less than L sil and their mean energy is below the threshold, they are merged into a single continuous segment. Finally, short segments shorter than L dur are removed, leaving only valid event intervals. The final endpoint set is, thus, given by S = t s i , t e i i , where t s i = s i h / f s and t e i = e i h / f s .

2.4. Experimental Setup

2.4.1. Evaluation Metrics

To evaluate the whistle-event detection performance of the proposed method, the frame-level F 1 -score is calculated, which is defined as the harmonic mean of precision and recall. The frame-level F 1 -score comprehensively reflects the effectiveness of the method in whistle detection. In the collected dolphin whistle signals, let there be M frames containing whistles. After enhancement and detection, T P frames are correctly identified as whistle frames (true positives), while F N frames remain undetected (false negatives), such that M = T P + F N . Meanwhile, if whistle-absent frames are mistakenly judged as containing whistles, their number is denoted as F P (false positives), and those correctly judged as non-whistle frames are denoted as T N (true negatives). Under this definition, precision and recall can be expressed as
Precision = T P T P + F P Recall = T P T P + F N .
The F 1 -score, as a comprehensive indicator, is then obtained by computing the harmonic mean of precision and recall, and it is expressed as follows:
1 F 1 = 1 2 1 P r e c i s i o n + 1 R e c a l l .
Therefore, the frame-level F 1 -score is given by
F 1 = 2 T P 2 T P + F P + F N .
Meanwhile, the ground-truth annotations were obtained using methods employed by audio engineers for assessing human auditory perception, as well as spectrograms commonly utilized by zoologists.

2.4.2. Baseline Methods

To validate the effectiveness and superiority of the proposed method, a systematic comparison was conducted against five representative baseline approaches. Baseline Method 1 (Blurring) employs a median filtering technique to smooth the signal, which is commonly applied to mitigate time–frequency blurring effects in dolphin whistle signals []. Baseline Method 2 (OGS) introduces a non-convex overlapping group sparsity regularization within a convex optimization framework [], achieving robust and effective signal denoising by reinforcing the constraints on group-sparse structures. Baseline Method 3 (GL) imposes group-sparse constraints based on predefined group structures [], thereby enhancing the accuracy of parameter estimation and the robustness to noise under strong group-sparsity conditions. Baseline Method 4 (NMF-OGS) involves imposing OGS regularization on the coefficient matrix H in standard NMF, suppressing scattered activations along the time axis and encouraging sparse patterns that are continuous in segments. Baseline Method 5 (NMF-GL) applies GL regularization to the substrate matrix W in the standard NMF to achieve group-level sparse selection.
To ensure objectivity and consistency in the comparison process, the performance of all methods was evaluated using the same detector. The implementation of the detector is described in detail in Section 2, and the evaluation metrics include precision, recall, and the F 1 -score computed based on these metrics.

3. Results

3.1. Optimal Parameter Selection

To ensure that the proposed method achieves optimal performance in dolphin whistle enhancement and endpoint detection, it is necessary to systematically optimize and validate several of the input parameters involved in the model. First, regarding the number of iterations, too few iterations may result in incomplete convergence of the decomposition, whereas too many iterations would lead to excessive computational cost with limited performance gain. Based on repeated experimental validation, the number of iterations in this work is set to 200 in order to balance convergence accuracy and computational efficiency. It is noteworthy that, within each alternating update iteration, the update loop of matrix H is further embedded with an inner loop of 5 iterations, thereby enhancing the stability and accuracy of OGS regularization optimization. In addition, for the decomposition rank r, if it is set too small, the model will fail to adequately capture the time–frequency characteristics of whistle signals, while an excessively large r may introduce noise components and reduce the effectiveness of sparsity regularization. Considering both signal complexity and model robustness, this paper sets r = 128 to maintain a balance between feature representation capacity and the enforcement of regularization constraints.
Within the framework of NMF, the configuration of regularization terms plays a decisive role in the final enhancement and endpoint detection performance. The proposed model involves four core hyperparameters, namely the OGS regularization coefficient λ o g s , the GL regularization coefficient λ g l , and the time–frequency neighborhood window sizes ( K 1 , K 2 ) required by the OGS regularization. To determine the optimal combination of these hyperparameters, a grid search strategy is adopted to traverse the parameter space. The overall optimization objective function can be expressed as follows:
min W , H 0 1 2 Y WH F 2 + λ ogs Ω OGS H ; K 1 , K 2 + λ gl Ω GL ( W ) .
In the parameter search and optimization process, the search ranges of λ o g s and λ g l are set to [0.01, 0.1] with a step size of 0.01, while the search ranges of K 1 and K 2 are set to [3, 15] with a step size of 2. Each candidate parameter set is evaluated using the frame-level F 1 -score obtained from endpoint detection, and the parameter combination yielding the highest score is regarded as the optimal solution, and it is denoted as λ ogs , λ gl , K 1 , K 2 = arg max λ ogs , λ gl , K 1 , K 2 F 1 .
It is noteworthy that during the alternating updates of W and H , the choice of the gradient descent step size η is also involved. In theory, the step size can be set according to the Lipschitz constant γ as η = 1 / γ , which guarantees convergence but is often overly conservative, thereby limiting both convergence speed and performance improvement. Based on extensive experimental comparisons, the step size in this work is ultimately set to η = 0.5 , which balances numerical stability with practical performance without strictly relying on the Lipschitz bound.

3.2. Enhanced and Detected Results of Typical Whistle Segments

This paper selects three typical pure dolphin whistle clips emitted by two captive Southern bottlenose dolphins []: wav1, wav2, and wav3. These signals cover concave and upsweep morphological features and effectively characterize the modulation patterns commonly observed in acoustic monitoring of the target species. Such a selection helps prevent the evaluation from overfitting to a single type of whistle structure. According to Equation (2), the STFT was employed to convert the signals into the time–frequency domain. All STFT parameters in this article are configured as follows: a window length of 256 samples, an overlap of 128 samples, and an FFT size of 1024. The resulting clean spectrograms are presented in Figure 1a,c,e, which clearly reveal the tonal ridges characteristic of whistle contours.
Figure 1. Comparison of the time–frequency spectrograms between the clean and noisy whistle segments under −5 dB SNR noise interference. lighter colors indicate higher energy. (a) clean wav1; (b) noisy wav1; (c) clean wav2; (d) noisy wav2; (e) clean wav3; and (f) noisy wav3.
To assess the robustness of the proposed enhancement framework under noisy conditions, Gaussian white noise was artificially added to the clean whistle signals. The input SNRs were set in the range of −10 dB to −5 dB, representing the challenging acoustic environments typically encountered in shallow-water passive acoustic monitoring. Importantly, the noise was generated using a fixed random seed, thereby eliminating stochastic variability and ensuring the reproducibility of all experiments. The spectrograms of the noisy whistle signals are illustrated in Figure 1b,d,f. As shown, under an input SNR of −5 dB, the noisy spectrograms exhibit nearly uniform background fluctuations with pronounced random granular interference. This led to a flattening of the overall energy distribution, which, in turn, significantly diminished the contrast between whistle ridges and weak-energy segments relative to the background. Such degradation highlights the necessity of effective enhancement techniques, as traditional detection algorithms may fail to reliably distinguish whistle contours from the surrounding noise under these adverse acoustic conditions.
Building upon the hyperparameter search strategy outlined in the preceding subsection, this study employed the frame-level F 1 -score as the primary evaluation criterion. A comprehensive grid search was conducted on the validation set to explore the joint parameter space, and the globally optimal configuration was determined as ( λ o g s , λ g l , K 1 , K 2 ) = ( 0.02 , 0.03 , 11 , 3 ) . This parameter set was thereafter fixed throughout the experimental stage to ensure consistency and comparability across evaluations involving the three whistle segments. As summarized in Table 1, under varying SNR conditions, the enhancement process yields marked improvements in recall across all three segments, reflecting the method’s capability to preserve weak signal components that are often masked by noise. Moreover, the frame-level F 1 -score consistently surpassed the corresponding pre-enhancement results, underscoring the robustness of the proposed framework in low-SNR environments. Notably, the highest-performing sample achieved a maximum frame-level F 1 -score of 0.9678, demonstrating the effectiveness of the optimized parameter combination in balancing noise suppression with signal fidelity.
Table 1. The detection precision, recall, and frame-level F 1 -score of whistle segments under different SNRs before and after enhancement.
Figure 2 presents the time–frequency spectrogram results of the wav1, wav2, and wav3 segments under −5 dB SNR noise interference after being processed by the proposed enhancement method. It can be observed that the energy distribution of the whistles becomes clearer and more continuous, with the background noise–induced random fluctuations significantly suppressed, thereby allowing the whistle ridges and weak energy components to be more prominently revealed.
Figure 2. Time–frequency spectrograms of the whistle segments after enhancement under −5 dB SNR noise interference. lighter colors indicate higher energy. (a) wav1; (b) wav2; and (c) wav3.
Furthermore, Figure 3 compares the energy curves and the corresponding endpoint detection results of three whistle segments before and after enhancement under −5 dB noise interference. It can be observed that the enhancement processing made the peak–valley differences of the energy curves more pronounced, while the low-amplitude fluctuations caused by noise were substantially suppressed, thereby improving the robustness and accuracy of endpoint detection. Compared with the distorted and fluctuating energy curves prior to enhancement, the onset and offset boundaries of the enhanced signals are more clearly defined. This demonstrates that the proposed method not only improves the interpretability of time–frequency representations, but also significantly enhances detection performance in practical endpoint detection tasks.
Figure 3. Comparison of the energy curves of whistle segments before and after enhancement under −5 dB SNR noise interference. The green and pink dash lines respectively represent the starting and ending points of the detected audio clips. (a) noisy wav1; (b) enhanced wav1; (c) noisy wav2; (d) enhanced wav2; (e) noisy wav3; and (f) enhanced wav3.

3.3. Comparison with Other Methods

In the cross-comparison experiments of performance, the wav1 whistle segment was selected as the test subject, and the proposed method was systematically evaluated against five baseline methods under various low-SNR conditions. Specifically, under noise interference at different SNR levels, the enhanced signals obtained by each method were analyzed using the endpoint detection algorithm introduced in Section 2, with precision, recall, and the F 1 -score adopted as evaluation metrics to ensure consistency and fairness in the assessment process. Figure 4 illustrates the comparative results of the precision, recall, and F 1 -score obtained after enhancement by the proposed method and the five baseline methods in environments with SNRs of −10 dB, −9 dB, −8 dB, −7 dB, −6 dB, and −5 dB.
Figure 4. Comparative results of the precision, recall, and F 1 -score obtained after enhancement by the six methods under different SNR conditions. (a) Precision; (b) recall; and (c) F 1 -score.
It can be observed that, under different SNR conditions, the six methods exhibited distinct differences in endpoint detection performance. In terms of precision, all of the methods maintained values close to 1 across almost all noise levels, with only the proposed method showing a slight decline at −10 dB. This phenomenon indicates that all of the methods demonstrated strong stability in suppressing false detections, as the detected segments were essentially true whistle signals. It should be noted, however, that although the precision of all methods consistently remained at a high level, this does not guarantee the completeness of detection results, as some true whistle segments were still missed. Therefore, it was necessary to further examine the recall metric to evaluate the ability of each method to capture true whistles under different SNR conditions. Regarding recall, the Blurring, OGS, GL, NMF-OGS, and NMF-GL methods generally performed poorly, particularly under low-SNR conditions, where recall rates remained within the range of 0.43–0.69, suggesting limitations in enhancing and preserving weak signals. In contrast, the proposed method exhibited significant advantages across all noise levels, with recall improving from 0.6686 to 0.9377 as the SNR increased from −10 dB to −5 dB, highlighting its enhanced capability of preserving whistle signals in low-SNR environments.
From the perspective of the overall performance metric F 1 -score, the proposed method achieved the highest values under all SNR conditions and exhibited a gradual upward trend as the SNR increased. At −5 dB, the F 1 -score of the proposed method reached 0.9678, which was markedly superior to Blurring (0.7210), OGS (0.7679), GL (0.8154), NMF-OGS (0.8054), and NMF-GL (0.8134). This demonstrates that the proposed method effectively improves recall while maintaining a high level of precision, thereby achieving overall performance superior to the baseline methods.

3.4. Experimental Results on Real Dolphin Whistle Recordings

To further validate the effectiveness and applicability of the proposed method in real-world scenarios, two bottlenose dolphin whistle segments were selected from an open-source acoustic dataset, denoted as wav4 and wav5 (available at WHOI Dolphin Whistle Database (https://whoicf2.whoi.edu/science/B/whalesounds), accessed on 17 August 2025). The signals were processed under the same experimental framework for enhancement and endpoint detection. Since marine noise in natural environments often exhibits complex non-stationary characteristics, particularly including broadband background noise and impulsive interference, validation on such data provides a more intuitive reflection of the algorithm’s robustness and generalization capability in natural conditions.
In terms of experimental results, Figure 5 presents the original spectrograms of the wav4 and wav5 segments along with their enhanced versions. It can be observed that the original spectrograms were subject to substantial background noise interference. For the wav4 segment, the whistle energy was stronger. Most whistle tracks can still be recognized in the original state, but due to the presence of noise, some weak energy components were covered up, resulting in incomplete and inaccurate endpoint detection. After enhancement using the proposed method, the spectrogram of the wav4 segment exhibited a continuous and distinct energy distribution, with background noise significantly reduced. This highlights the whistle structure more clearly, where excellent detection performance in endpoint detection was achieved and the enhancement effect of the proposed method on common dolphin whistles is demonstrated.
Figure 5. Comparison of spectrograms of the wav4 and wav5 whistle segments before and after enhancement. lighter colors indicate higher energy. (a) Noisy wav4; (b) enhanced wav4; (c) noisy wav5; and (d) enhanced wav5.
In contrast, the signal energy of the wav5 whistle was relatively weak. The whistle signal was almost completely submerged in the noise, and its energy trajectory was almost unrecognizable, making traditional endpoint detection methods basically ineffective in this situation. However, the enhanced spectrum processed by the proposed method effectively highlights the energy distribution of the whistle, indicating that the method still has a certain signal extraction ability for weak signals in the presence of noise. Although there was still some residual noise interference, the overall SNR had improved, allowing the whistle structure to be distinguished and, thus, restoring a certain effectiveness of endpoint detection in this segment.
Based on the comparison of the energy curves before and after enhancement, as shown in Figure 6, the effect of the enhancement method under different noise complexities can be analyzed more intuitively. For the wav4 segment, although the pre-enhancement energy curve already reflects the presence of certain whistles, its fluctuations are not sufficiently pronounced, and there exists some deviation between the threshold crossings and the ground-truth annotations. After enhancement with the proposed method, the energy curve exhibited more distinct fluctuations, thereby improving the overlap between the detected endpoints and the annotated intervals, which indicates higher detection accuracy. In contrast, for the wav5 segment, the pre-enhancement energy curve was almost entirely irregular, showing only random fluctuations close to the noise baseline, which made endpoint detection essentially ineffective. After enhancement, however, the energy curve transitioned from a disordered state to a structure with clear peaks. Although residual noise still interfered with certain intervals, the overall SNR was improved.
Figure 6. Comparison of the energy curves of the wav4 and wav5 whistle segments before and after enhancement. The green and pink dash lines respectively represent the starting and ending points of the detected audio clips. (a) Noisy wav4; (b) enhanced wav4; (c) noisy wav5; and (d) enhanced wav5.
Table 2 summarizes the endpoint detection results of the whistle segments wav4 and wav5 under different enhancement methods. For the wav4 segment, the noisy signal achieved a precision of 1.0000 but a low recall of 0.3137, resulting in a modest F 1 -score of 0.4775. Traditional enhancement methods, such as Blurring, OGS, GL, NMF-OGS, and NMF-GL, slightly improved in their recall values (0.4649, 0.4594, 0.3413, 0.4668, and 0.4631), yet their F 1 -scores remained below 0.65. In contrast, the proposed method substantially raised the recall to 0.7491 while preserving perfect precision, leading to the highest F 1 -score of 0.8565.
Table 2. The detection precision, recall, and F 1 -score for the different enhancement methods on the whistle segments wav4 and wav5.
For the wav5 segment, the detection performance on the noisy input and baseline enhancement methods completely failed, with all metrics equal to 0.0000, reflecting the severe masking effect of background noise. In contrast, the proposed method demonstrated partial recovery of valid whistle segments, with a precision of 0.9438, recall of 0.3631, and an F 1 -score of 0.5245. This demonstrates its effectiveness in extracting weak signals that were otherwise undetectable by conventional approaches.

3.5. Complexity Analysis

To evaluate the computational cost of the proposed algorithm, an approximate complexity analysis of the main sub-operations within a single outer iteration was conducted in terms of memory and arithmetic. Let the observation matrix be of size K × M , the factorization rank be r min ( K , M ) , and the number of outer iterations be N it . The update of H involves I h = 5 inner iterations, with a two-dimensional OGS smoothing kernel of size K 1 × K 2 . In each inner iteration of the H -update, the core computations are as follows: computing W Y with complexity O ( K r M ) , forming W W and multiplying it by H with complexity O ( K r 2 + r 2 M ) , and performing two local convolutions to construct the smoothing quantities required by the regularization with complexity O ( r M K 1 K 2 ) . Hence, the overall complexity of the H -update can be expressed as O I h ( K r M + r 2 M + K r 2 + r M K 1 K 2 ) . During the update of W , the main computations include constructing HH with complexity O ( r 2 M ) , computing YH with complexity O ( K r M ) , and evaluating WHH with complexity O ( K r 2 ) . The proximal operation performed at this stage involves row-wise soft-thresholding and vector-norm shrinkage with complexity O ( K r ) . Therefore, the overall complexity of the W -update can be expressed as O ( K r M + r 2 M + K r 2 + K r ) . Therefore, the total computational complexity of a single outer iteration can be approximated as O I h ( K r M + K r 2 + r 2 M + r M K 1 K 2 ) + ( K r M + r 2 M + K r 2 + K r ) .
In this study, whistle signal processing was performed on an Intel Core i9-13900H processor with 9 GB RAM. The results show that the proposed method requires 8.785 s of runtime. Compared with GL (10.372 s), the proposed method reduces computation time, while providing stronger enhancement than OGS (3.146 s), Blurring (0.143 s), NMF-OGS (8.256 s), and NMF-GL (6.778 s), thereby achieving a better balance between performance and efficiency.

4. Discussion

4.1. Analysis of the Superiority of the Proposed Method

The proposed NMF enhancement framework demonstrates significant advantages in dolphin whistle detection under low SNR and complex noise conditions. Compared with conventional approaches, such as spectral subtraction and time–frequency smoothing [,], the proposed method not only maintains stability in achieving high precision, but it also substantially improves recall, thereby leading to overall gains in the F 1 -score. The pivotal innovation lies in the joint incorporation of OGS and GL regularization within the NMF framework. The former enforces local time–frequency continuity to preserve whistle ridges while effectively suppressing the fragmented energy distributions caused by background noise []; the latter adaptively eliminates ineffective bases at the dictionary level, preventing redundant components from amplifying residual noise []. The synergistic effect of these two regularizers enables the enhanced spectrograms to strike a favorable balance between structural preservation and noise suppression, thereby providing more reliable inputs for subsequent endpoint detection.
In comparison with the five representative baseline methods, the proposed approach demonstrates superior performance under low-SNR conditions ranging from −10 dB to −5 dB. Although all methods maintain relatively high precision, Blurring, OGS, GL, NMF-OGS, and NMF-GL individually exhibit clear deficiencies in recall, particularly in preserving weak signals. By contrast, the proposed method consistently achieves substantial improvements in recall across different noise levels, indicating that the joint regularization possesses stronger robustness in retaining weak energy components. This finding aligns with observations in other bioacoustic studies, where a single sparsity constraint has often proven insufficient to ensure the detectability of weak target signals []. Notably, in experiments with real dolphin whistle recordings, the proposed method preserved high detection integrity in segments with complex background conditions and restored partial detectability under extremely low-visibility scenarios, thereby demonstrating strong practical applicability and generalization capability.

4.2. Limitations and Future Perspectives

Nevertheless, the proposed method still exhibits certain limitations. Under extremely low-SNR conditions, although the recall rate improves substantially, it remains unable to fully recover all whistle components that are severely masked by noise, indicating that the model’s separation capability requires further enhancement in the presence of strong noise interference []. Complexity analysis reveals that while the dual regularization and the nested iterative scheme ensure stable performance, their computational cost may cause a constraint in large-scale passive acoustic monitoring tasks []. Moreover, as the method is founded on sparsity and low-rank assumptions, its adaptability remains limited in environments characterized by nonstationarity or the superposition of multiple noise sources [].
Future research will proceed along several directions. Empirical validation can be broadened by expanding datasets and evaluation scenarios, including streaming benchmarks on continuous, long-duration recordings, and multi-site deployments. Incorporating multi-noise source modeling and adaptive noise estimation is expected to improve robustness in nonstationary and complex acoustic environments []. In parallel, motivated by recent advances in deep learning for bioacoustic signal detection and enhancement [,,], integrating the proposed regularized NMF framework with deep learning methods may enable joint optimization of enhancement and detection, thereby overcoming the limitations of conventional stage-wise processing. Finally, although extending this research to other types of acoustic signals is valuable, doing so will require new prior assumptions and tailored front-end designs.

5. Conclusions

This study presents an unsupervised enhancement framework rooted in dually-regularized NMF, specifically designed to address the challenges of dolphin whistle detection under low SNR. Beyond the conventional data fidelity term, two complementary structural priors are incorporated: an OGS regularizer, which promotes the continuity of whistle trajectories in the time–frequency domain; and a GL regularizer, which adaptively prunes redundant spectral bases. Together, these regularization strategies enhance the separability of whistle signals from noise while preserving their inherent structural integrity. To ensure efficient and reliable optimization, the activation matrix is updated using a majorization–minimization scheme with iteratively reweighted least squares (MM–IRLS), which stabilizes convergence even in the presence of non-smooth penalties. Meanwhile, the spectral basis matrix is refined via a proximal gradient method, allowing for the simultaneous enforcement of non-negativity and column-level sparsity constraints.
Comprehensive experiments conducted under a variety of low-SNR conditions demonstrate that the proposed approach achieves substantial improvements in recall while maintaining consistently high precision, resulting in robust and stable gains in the F 1 -score. Comparative evaluations against representative baseline methods further corroborate the advantages of the framework, particularly in preserving weak whistles that are otherwise masked by background noise. Finally, the complexity analysis indicates that the algorithm achieves a favorable balance between accuracy and computational cost, confirming its practicality for deployment on modern processing platforms. This improvement in dolphin whistle detection performance enhances, to some extent, the practicality of PAM, and it also exerts a positive influence in population assessment and habitat studies.

Author Contributions

L.L.: Writing—Review and Editing, Writing—Original Draft, Funding Acquisition, Visualization, Methodology, Investigation, Formal Analysis, and Conceptualization. X.S.: Writing—Review and Editing, Writing—Original Draft, Visualization, Supervision, Methodology, and Conceptualization. S.H.: Writing—Review and Editing, Supervision, Project Administration, Methodology, and Conceptualization. X.C.: Writing—Review and Editing, Supervision, and Conceptualization. J.Z.: Writing—Review and Editing, Supervision, and Conceptualization. S.L.: Supervision, Project administration, and Conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported, in part, by the Youth Project of the National Natural Science Foundation of China under Grant 1240042200; in part, by the Youth Project of Shandong Natural Science Foundation under Grant ZR2024QA011; in part, by the Stable Supporting Fund of Acoustic Science and Technology Laboratory under Grant JCKYS2025SSJS014; in part, by the Youth Project of Qingdao Natural Science Foundation of Qingdao Municipality under Grant 24-4-4-zrjj-1-jch; and in part, by the Fundamental Research Funds for the Central Universities under Grant 27RA2322010.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Rice, A.; Širović, A.; Trickey, J.S.; Debich, A.J.; Gottlieb, R.S.; Wiggins, S.M.; Hildebrand, J.A.; Baumann-Pickering, S. Cetacean occurrence in the Gulf of Alaska from long-term passive acoustic monitoring. Mar. Biol. 2021, 168, 72. [Google Scholar] [CrossRef]
  2. Zimmer, W.M. Passive Acoustic Monitoring of Cetaceans; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
  3. Haver, S.M.; Rand, Z.; Hatch, L.T.; Lipski, D.; Dziak, R.P.; Gedamke, J.; Haxel, J.; Heppell, S.A.; Jahncke, J.; McKenna, M.F.; et al. Seasonal trends and primary contributors to the low-frequency soundscape of the Cordell Bank National Marine Sanctuary. J. Acoust. Soc. Am. 2020, 148, 845–858. [Google Scholar] [CrossRef] [PubMed]
  4. Fujioka, E.; Soldevilla, M.S.; Read, A.J.; Halpin, P.N. Integration of passive acoustic monitoring data into OBIS-SEAMAP, a global biogeographic database, to advance spatially-explicit ecological assessments. Ecol. Inform. 2014, 21, 59–73. [Google Scholar] [CrossRef]
  5. Hung, C.T.; Chu, W.Y.; Li, W.L.; Huang, Y.H.; Hu, W.C.; Chen, C.F. A case study of whistle detection and localization for humpback dolphins in Taiwan. J. Mar. Sci. Eng. 2021, 9, 725. [Google Scholar] [CrossRef]
  6. Kershenbaum, A.; Sayigh, L.S.; Janik, V.M. The encoding of individual identity in dolphin signature whistles: How much information is needed? PLoS ONE 2013, 8, e77671. [Google Scholar] [CrossRef]
  7. Kipnis, D.; Diamant, R. Graph-based clustering of dolphin whistles. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2216–2227. [Google Scholar] [CrossRef]
  8. Gregorietti, M.; Papale, E.; Ceraulo, M.; De Vita, C.; Pace, D.S.; Tranchida, G.; Mazzola, S.; Buscaino, G. Acoustic presence of dolphins through whistles detection in Mediterranean shallow waters. J. Mar. Sci. Eng. 2021, 9, 78. [Google Scholar] [CrossRef]
  9. Simpson, S.D.; Miller, C.E. Identification of key discriminating variables between spinner dolphin (Stenella longirostris) whistle types. J. Acoust. Soc. Am. 2020, 148, 1136–1144. [Google Scholar] [CrossRef]
  10. Ashokan, M.; Latha, G.; Ramesh, R. Analysis of shallow water ambient noise due to rain and derivation of rain parameters. Appl. Acoust. 2015, 88, 114–122. [Google Scholar] [CrossRef]
  11. Santoso, T.B. Ambient noise characterization of shallow water environment. EMITTER Int. J. Eng. Technol. 2015, 3, 77–87. [Google Scholar] [CrossRef]
  12. Hildebrand, J.A. Anthropogenic and natural sources of ambient noise in the ocean. Mar. Ecol. Prog. Ser. 2009, 395, 5–20. [Google Scholar] [CrossRef]
  13. Juodakis, J.; Marsland, S.; Priyadarshani, N. A changepoint prefilter for sound event detection in long-term bioacoustic recordings. J. Acoust. Soc. Am. 2021, 150, 2469–2478. [Google Scholar] [CrossRef]
  14. Qiao, G.; Ma, T.; Liu, S.; Zheng, N.; Babar, Z.; Yin, Y. Spectral entropy based dolphin whistle detection algorithm and its possible application for biologically inspired communication. In Proceedings of the OCEANS 2019-Marseille, Marseille, France, 17–20 June 2019; pp. 1–6. [Google Scholar]
  15. Azevedo, A.F.; Oliveira, A.M.; Rosa, L.D.; Lailson-Brito, J. Characteristics of whistles from resident bottlenose dolphins (Tursiops truncatus) in southern Brazil. J. Acoust. Soc. Am. 2007, 121, 2978–2983. [Google Scholar] [CrossRef] [PubMed]
  16. Li, L.; Qiao, G.; Liu, S.; Qing, X.; Zhang, H.; Mazhar, S.; Niu, F. Automated classification of Tursiops aduncus whistles based on a depth-wise separable convolutional neural network and data augmentation. J. Acoust. Soc. Am. 2021, 150, 3861–3873. [Google Scholar] [CrossRef] [PubMed]
  17. Janik, V.M.; Todt, D.; Dehnhardt, G. Signature whistle variations in a bottlenosed dolphin, Tursiops truncatus. Behav. Ecol. Sociobiol. 1994, 35, 243–248. [Google Scholar] [CrossRef]
  18. Zhou, X.; Wu, R.; Chen, W.; Dai, M.; Zhu, P.; Xu, X. Thresholding Dolphin Whistles Based on Signal Correlation and Impulsive Noise Features Under Stationary Wavelet Transform. J. Mar. Sci. Eng. 2025, 13, 312. [Google Scholar] [CrossRef]
  19. Gillespie, D.; Caillat, M.; Gordon, J.; White, P. Automatic detection and classification of odontocete whistles. J. Acoust. Soc. Am. 2013, 134, 2427–2437. [Google Scholar] [CrossRef]
  20. Wang, X.; Jiang, J.; Duan, F.; Liang, C.; Li, C.; Sun, Z.; Lu, R.; Li, F.; Xu, J.; Fu, X. A method for enhancement and automated extraction and tracing of Odontoceti whistle signals base on time-frequency spectrogram. Appl. Acoust. 2021, 176, 107698. [Google Scholar] [CrossRef]
  21. Li, L.; Wang, Q.; Qing, X.; Qiao, G.; Liu, X.; Liu, S. Robust unsupervised Tursiops aduncus whistle enhancement based on complete ensembled empirical optimal envelope local mean decomposition with adaptive noise. J. Acoust. Soc. Am. 2022, 152, 3360–3372. [Google Scholar] [CrossRef]
  22. Pu, W.; Liu, S.; Qing, X.; Qiao, G.; Mazhar, S.; Ma, T. Automated extraction of baleen whale calls based on the pseudo-Wigner–Ville distribution. J. Acoust. Soc. Am. 2023, 153, 1564–1579. [Google Scholar] [CrossRef]
  23. Giard, S.; Simard, Y.; Roy, N. Decadal passive acoustics time series of St. Lawrence estuary beluga. J. Acoust. Soc. Am. 2020, 147, 1874–1884. [Google Scholar] [CrossRef]
  24. Seger, K.D.; Al-Badrawi, M.H.; Miksis-Olds, J.L.; Kirsch, N.J.; Lyons, A.P. An empirical mode decomposition-based detection and classification approach for marine mammal vocal signals. J. Acoust. Soc. Am. 2018, 144, 3181–3190. [Google Scholar] [CrossRef] [PubMed]
  25. Vickers, W.; Milner, B.; Risch, D.; Lee, R. Robust North Atlantic right whale detection using deep learning models for denoising. J. Acoust. Soc. Am. 2021, 149, 3797–3812. [Google Scholar] [CrossRef] [PubMed]
  26. Wang, H.; Wu, X.; Wang, Z.; Hao, Y.; Hao, C.; He, X.; Hu, Q. Low-Resource Generation Method for Few-Shot Dolphin Whistle Signal Based on Generative Adversarial Network. J. Mar. Sci. Eng. 2023, 11, 1086. [Google Scholar] [CrossRef]
  27. Chen, P.Y.; Selesnick, I.W. Group-sparse signal denoising: Non-convex regularization, convex optimization. IEEE Trans. Signal Process. 2014, 62, 3464–3478. [Google Scholar] [CrossRef]
  28. Chen, P.Y.; Selesnick, I.W. Translation-invariant shrinkage/thresholding of group sparse signals. Signal Process. 2014, 94, 476–489. [Google Scholar] [CrossRef]
  29. Bischl, B.; Eichhoff, M.; Weihs, C. Selecting Groups of Audio Features by Statistical Tests and the Group Lasso. In Proceedings of the Sprachkommunikation, Bochum, Deutschland, 6–8 October 2010; pp. 1–4. [Google Scholar]
  30. Song, Z.; Zhang, C.; Fu, W.; Gao, Z.; Ou, W.; Zhang, J.; Zhang, Y. Investigation on whistle directivity in the Indo-Pacific humpback dolphin (Sousa chinensis) through numerical modeling. J. Acoust. Soc. Am. 2022, 151, 3573–3579. [Google Scholar] [CrossRef]
  31. Brewer, A.M.; Castellote, M.; Van Cise, A.M.; Gage, T.; Berdahl, A.M. Communication in Cook Inlet beluga whales: Describing the vocal repertoire and masking of calls by commercial ship noise. J. Acoust. Soc. Am. 2023, 154, 3487–3505. [Google Scholar] [CrossRef]
  32. Ferrer-i Cancho, R.; McCowan, B. The span of correlations in dolphin whistle sequences. J. Stat. Mech. Theory Exp. 2012, 2012, P06002. [Google Scholar] [CrossRef]
  33. Lee, D.; Seung, H.S. Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 2000, 13. [Google Scholar]
  34. Lefevre, A.; Bach, F.; Févotte, C. Online algorithms for nonnegative matrix factorization with the Itakura-Saito divergence. In Proceedings of the 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 16–19 October 2011; pp. 313–316. [Google Scholar]
  35. Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef]
  36. Beck, A. First-Order Methods in Optimization; SIAM: Philadelphia, PA, USA, 2017. [Google Scholar]
  37. Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. Stat. Methodol. 2006, 68, 49–67. [Google Scholar] [CrossRef]
  38. Lim, J.S.; Oppenheim, A.V. Enhancement and bandwidth compression of noisy speech. Proc. IEEE 2005, 67, 1586–1604. [Google Scholar] [CrossRef]
  39. Loizou, P.C. Speech enhancement based on perceptually motivated Bayesian estimators of the magnitude spectrum. IEEE Trans. Speech Audio Process. 2005, 13, 857–869. [Google Scholar] [CrossRef]
  40. Kowalski, M.; Torrésani, B. Sparsity and persistence: Mixed norms provide simple signal models with dependent coefficients. Signal Image Video Process. 2009, 3, 251–264. [Google Scholar] [CrossRef]
  41. Jacob, L.; Obozinski, G.; Vert, J.P. Group lasso with overlap and graph lasso. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 433–440. [Google Scholar]
  42. Kogan, J.A.; Margoliash, D. Automated recognition of bird song elements from continuous recordings using dynamic time warping and hidden Markov models: A comparative study. J. Acoust. Soc. Am. 1998, 103, 2185–2196. [Google Scholar] [CrossRef]
  43. Cichocki, A.; Zdunek, R.; Phan, A.; Amari, S.I.; Matrix, N.N.; Factorizations, T. Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation; Wiley: Hoboken, NJ, USA, 2009. [Google Scholar]
  44. Cichocki, A.; Zdunek, R. Regularized alternating least squares algorithms for non-negative matrix/tensor factorization. In Proceedings of the International Symposium on Neural Networks; Springer: Berlin/Heidelberg, Germany, 2007; pp. 793–802. [Google Scholar]
  45. Candès, E.J.; Li, X.; Ma, Y.; Wright, J. Robust principal component analysis? J. ACM (JACM) 2011, 58, 1–37. [Google Scholar] [CrossRef]
  46. Kahl, S.; Wood, C.M.; Eibl, M.; Klinck, H. BirdNET: A deep learning solution for avian diversity monitoring. Ecol. Inform. 2021, 61, 101236. [Google Scholar] [CrossRef]
  47. Stowell, D. Computational bioacoustics with deep learning: A review and roadmap. PeerJ 2022, 10, e13152. [Google Scholar] [CrossRef] [PubMed]
  48. Shiu, Y.; Palmer, K.; Roch, M.A.; Fleishman, E.; Liu, X.; Nosal, E.M.; Helble, T.; Cholewiak, D.; Gillespie, D.; Klinck, H. Deep neural networks for automated detection of marine mammal species. Sci. Rep. 2020, 10, 607. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.