Next Article in Journal
RoPT: Route-Planning Model with Transformer
Previous Article in Journal
Research on Relative Position and Attitude Measurement of Space Maglev Vibration Isolation Control System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Bird Vocalization Classification Method Based on Bidirectional FBank with Enhanced Robustness

1
College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming 650224, China
2
College of Science, Southwest Forestry University, Kunming 650224, China
3
College of Life Science, Northeast Normal University, Changchun 130024, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(9), 4913; https://doi.org/10.3390/app15094913
Submission received: 6 April 2025 / Revised: 20 April 2025 / Accepted: 21 April 2025 / Published: 28 April 2025

Abstract

:
Recent advances in audio signal processing and pattern recognition have made the classification of bird vocalization a focus of bioacoustic research. However, the accurate classification of birdsongs is challenged by environmental noise and the limitations of traditional feature extraction methods. This study proposes the iWAVE-BiFBank method, an innovative approach combining improved wavelet adaptive denoising (iWAVE) and a bidirectional Mel-filter bank (BiFBank) for effective birdsong classification with enhanced robustness. The iWAVE method achieves adaptive optimization using the autocorrelation coefficient and peak-sum-ratio (PSR), overcoming the manual adjustments required with and incompleteness of traditional methods. BiFBank combines FBank and inverse FBank (iFBank) to enhance feature representation. This fusion addresses the shortcomings of FBank and introduces novel transformation methods and filter designs to iFBank, with a focus on high-frequency components. The iWAVE-BiFBank method creates a robust feature set, which can effectively reduce the noise of audio signals and capture both low- and high-frequency information. Experiments were conducted on a dataset of 16 species of birds, and the proposed method was verified with a random forest (RF) classifier. The results show that iWAVE-BiFBank achieves an accuracy of 94.00%, with other indicators, including the F1 score, exceeding 93.00%, outperforming all other tested methods. Overall, the proposed method effectively reduces audio noise, comprehensively captures the characteristics of bird vocalization, and provides improved classification performance.

1. Introduction

Birds play a critical role in biological monitoring and ecological conservation, aiding biologists in formulating biodiversity preservation strategies and providing ecologists with crucial information on the regional behaviors of birds. Consequently, the identification of bird vocalizations has become a key focus in bioacoustics. The traditional classification process consists of pre-processing, feature extraction, and classification [1]. Among these steps, effective denoising in pre-processing is essential for achieving accurate classification [2].
The noise contained in birdsong mainly comes from environmental sources [3], such as wind, rain, and even the buzzing of insects. This kind of noise is typically modeled as additive white Gaussian noise (AWGN) [4,5], a typical non-stationary oscillation signal. Wavelet transform (WT)-based denoising is regarded as the most effective approach for handling oscillatory signals. It addresses the time-domain resolution limitations of the short-time Fourier transform (STFT) [6]. Many studies have demonstrated that WT outperforms other techniques, such as the Fourier transform (FT) and discrete cosine transform (DCT) [7,8,9]. The wavelet concept was first introduced by S. Mallat [10] in 1989, and it led to another revolution in the field of computer vision, offering advantages in terms of sparsity and multiples resolutions [11]. This study focuses on wavelet threshold denoising, which is essentially the discrete wavelet transform (DWT). The primary factors affecting its effect are the selection of the wavelet basis, determination of the decomposition levels, selection of the threshold, and threshold function [12]. Most of them require manual setting, which dramatically limits the wide application of the wavelet transform [13]. To address this, we propose an improved Wavelet AdaptiVe dEnoising (iWAVE) method to enhance both applicability and performance, improving the robustness of the classification method.
Feature extraction is vital for bird vocalization classification, and its performance is highly dependent on the quality of the feature. Although the Mel-frequency cepstral coefficients (MFCCs) have been widely used, we prefer the Mel-filter bank (FBank) due to its lower computational requirements and higher feature correlations. FBank tends to emphasize low-frequency over high-frequency information during feature extraction. As a solution, ian nverse FBank (iFBank) was introduced and combined with FBank to form a bidirectional FBank (BiFBank). This novel approach integrates low-frequency and high-frequency information and achieves the best classification performance on the tested features by using the random forest (RF) classifier [14].
Therefore, we propose the iWAVE and BiFBank methods for adaptive automatic recognition, aiming to improve classification performance and robustness in bird vocalization. The main contributions of this study are as follows:
(1)
We constructed the iWAVE wavelet threshold denoising method for bird vocalization, which adapts to determine the optimal decomposition levels, optimizes the threshold selection rule, and improves the threshold function. This method outperforms existing classical and improved techniques in most cases, thus supporting the wider application of wavelet-transform-based denoising.
(2)
We developed a horizontal linear fusion of FBank and iFBank to create BiFBank features, effectively integrating low-frequency and high-frequency information, reducing the loss of audio data, and making up for the shortcomings of traditional methods.
(3)
We demonstrated that FBank outperforms MFCC under the same conditions, providing researchers with a wider selection of features. The proposed iWAVE-BiFBank method achieves optimal performance and robustness on the tested features and compared with state-of-the-art methods.
The remainder of this paper is organized as follows: Section 2 reviews related work; Section 3 presents the dataset; Section 4 details the proposed methods and method; Section 5 outlines the experiments, analysis, and discussion; Section 6 addresses the limitations and future work; and Section 7 concludes this paper.

2. Related Work

The sparsity and multi-resolution characteristics of DWT are reflected in its ability to decompose the input signal, yielding a set of approximate coefficients (low-frequency components) and detailed coefficients (high-frequency components) at various decomposition levels [15]. The former is considered a useful signal, while the latter is regarded as noise. This effective denoising approach has been widely applied in birdsong recognition by researchers such as Xie et al. and Gong et al., enhancing the reliability of signal analysis and interpretation [16,17]. In other domains, Chi et al. integrated the WT, convolutional neural network (CNN), inception module, and long short-term memory (LSTM) network for structural damage diagnosis [18]. Threshold selection and threshold functions are critical factors in denoising performance. VisuShrink [19], RigrSure [20], and MiniMax [21] are classic methods for threshold selection, while the hard threshold proposed by Donoho and Johnstone [19] and the soft threshold introduced by Donoho [22] are well-established methods for threshold functions.
MFCC is the main feature used in bird vocalization classification, but its limitations are frequently addressed by integrating them with other features. For instance, Bonet-Solà and Alsina-Pagès integrated MFCC, gamma-tone cepstral coefficients (GTCCs), and narrowband (NB) features, employing k-nearest neighbor (kNN), neural network (NN), and Gaussian mixture model (GMM) classifiers [23]. Similarly, Andono et al. utilized MFCC and GTCC for feature extraction, and the birdsong classification accuracy rate was 78.11%, which improved to 78.33% through the use of particle swarm optimization (PSO) for 264 bird vocalizations [24,25]. In particular, Xie et al. extracted deep features with a CNN to the Hilbert–Huang transform (HHT), short-time Fourier transform (STFT), and wavelet transform (WT), fusing MFCC to achieve excellent results [16].
In addition to MFCC, FBank features have also widely been used in recognition. Sui et al. identified ten bird vocalizations using FBank with the transformer model, achieving an accuracy of 85.77% [26]. Peng et al. fused FBank with SincNet-based Sinc spectrograms while using 2D Gabor filters for Sinc spectrograms and secondary feature extraction to enhance feature representations [27]. Furthermore, García-Ordás et al. employed a full CNN with FBank, reaching 85.71% accuracy in the classification of 17 birdsongs [28]. Wang et al. adopted a different approach by integrating FBank, delta, and delta-delta FBank (3D-FBank) with a Phylogenetic Perspective Neural Network (PPNN), which led to an accuracy of 89.95% in classifying the songs of 500 bird species [29]. In other fields, Liao et al. utilized FBank in a deep neural network (DNN) to analyze giant panda vocalizations [30], and Lin applied FBank and Linear FBank for human voiceprint recognition [31]. Moreover, Wu et al. employed 3D-FBank with ResNet18 for underwater target identification [32].

3. Dataset

The dataset for this study was derived from Xeno-canto [33]. It comprises 263 audio recordings from 16 bird species belonging to 7 orders, 9 families, and 15 genera. The dataset was divided into a training set and a test set in a ratio of 8:2. The details are provided in Table 1.
As the rarity and various vocal habits of different bird species can influence the robustness of a model, we focused on extracting the energy-dense segments of the Mel spectrum from each audio sample. This resulted in approximately 1000 samples per species, as illustrated in Figure 1.

4. Research Methods

To recognize birdsongs more widely and effectively, the proposed method in this paper implements robust classification. Its structure is depicted in Figure 2. First, the input audio is pre-processed, including denoising with the iWAVE method, pre-emphasis, framing, and windowing. Then, BiFBank feature extraction is performed, obtained using the fusion features of FBank and iFBank, where ⊗ represents the horizontal linear fusion operation. Finally, the random forest classifier is utilized to obtain the final results.

4.1. Improved Wavelet Adaptive Denoising

The setting of the decomposition level significantly influences the denoising effect, while traditional wavelet threshold methods require manual adjustment. Additionally, the classic threshold selection rule is incomplete, as it often ignores the effects of decomposition levels. Furthermore, in classic threshold functions, a hard threshold can disrupt signal continuity, leading to the pseudo-Gibbs phenomenon with bird calls [34]. With a soft threshold, the cost of continuity is that all wavelet coefficients have to be subtracted from the threshold, resulting in a large reconstruction error.
Hence, this study developed an improved Wavelet AdaptiVe dEnoising (iWAVE) method to enhance the effectiveness and availability of wavelet denoising. Wavelet basis selection remains a trial-and-error process [35], while the other three aspects were improved based on the mentioned limitations.

4.1.1. Adaptive Determination of Optimal Decomposition Level

The modular square equation of calculating additive white Gaussian noise (AWGN) through wavelet decomposition is represented as [4]
E | ω | 2 = ψ 2 σ 2 i ,
where i represents the number of decomposition levels, ψ refers to the wavelet basis, σ is the standard deviation of the AWGN, and E is the mathematical expectation of the sub-band coefficients at each level, representing the randomness of the signal over time. The equation involves the decomposition scale and wavelet basis, indicating that a larger scale more effectively distinguishes noise from the signal, thus improving denoising performance. However, experiments show that too-large decomposition levels can distort coefficients, thereby increasing the error of the inverse wavelet transform. The analysis of the wavelet coefficients at each level reveals that the detail coefficient, such as noise, retains part of the AWGN sequence after wavelet decomposition [36]. Subsequently, we perform a whitening test on the autocorrelation coefficient of the detail coefficients, defining it as Corr i 2 :
Corr i 2 1.38 i .
When Equation (2) is satisfied, further decomposition of the detail coefficient is required. Otherwise, the optimal decomposition level is obtained by reducing one. In practice, however, we observed that as noise intensity increases, the Corr i 2 value becomes too low to determine the number of decomposition levels accurately. Inspired by Liu et al. and Yang et al., we noted that as the decomposition scale increases, certain approximation coefficients are misassigned to detailed ones, distorting the reconstructed signal [13,37]. Therefore, an autocorrelation coefficient restriction is introduced for approximation coefficients, which is defined as Corr i 1 . It is generally believed that when Corr i 1 [ 0.8 , 1.0 ] , the approximation coefficients are highly correlated, so the restriction is set as
Corr i 1 0.8 .
When Corr i 1 fails to satisfy Equation (3), further decomposition distorts the signal, resulting in the loss of valuable information during reconstruction. At this point, the loop is terminated, and the subsequent decomposition is not performed.

4.1.2. Optimized Threshold Selection Rule

In 1992, Donoho and Johnstone introduced the VisuShrink threshold selection rule [19]:
λ = σ 2 ln N ,
where λ denotes the threshold, and N denotes the length of the corresponding detail coefficient. The formula for calculating the standard deviation σ of the AWGN is expressed as follows:
σ = MAD 0.6745 ,
where MAD represents the median amplitude of the detail coefficient. VisuShrink is widely used due to its strong performance and good interpretability, and many optimization methods rely on it [34,37,38,39,40], also was the case in this study.
Considering the adaptive determination of the optimal decomposition level, we observed noise reduction as the decomposition level increased, suggesting thresholds should be similarly decreased. Furthermore, our research indicated that the detail coefficients were not all noise but included valuable information. Thus, this paper employs the concept of “peak-to-sum ratio ( PSR )” to assist in the selection of threshold values [41]:
PSR = Max ( | d ( i ) | ) j = 1 N | d ( i , j ) | .
where d ( i ) represents the detail coefficient at level i, N represents the length of the detail coefficient, and the denominator represents the sum of all the detail coefficients at level i. Our improved threshold rule, building on VisuShrink, is expressed as follows:
λ i = σ 2 ln N N PSR log 10 ( i 2 + 1 ) .
In Equation (7), λ i denotes the threshold for each decomposition level, and the useful information is concentrated on the coefficients with larger values in the wavelet transform domain. Therefore, if the detail coefficient contains a large value, the PSR increases, indicating that there is more valuable information at that level, and the threshold is lowered to retain a more significant number of coefficients. Conversely, it means that the level contains more noise, the threshold increases, and more noise is filtered out. At the same time, the threshold decreases as the decomposition scale increases.

4.1.3. Improved Threshold Function

Classical threshold functions primarily consist of the hard threshold [19]
ω ^ i , j = ω i , j , | ω i , j | λ 0 , | ω i , j | < λ ,
and the soft threshold [22]
ω ^ i , j = sgn ( ω i , j ) | ω i , j | λ , | ω i , j | λ 0 , | ω i , j | < λ ,
where i denotes the number of decomposition levels, and ω i , j is the j th wavelet coefficient at level i. The sgn ( ω i , j ) symbol represents the positive or negative state of the current coefficient.
Considering the disadvantages of both hard and soft thresholds, when the absolute value of the coefficient exceeds the threshold, the function should be as close as possible to the hard threshold. When the absolute value of the coefficient falls below the threshold, the function must maintain continuity to ensure a smooth connection between the two [42]. This paper proposes an improved threshold function:
ω ^ i , j = sgn ( ω i , j ) ω i , j ( 0.1 ) ω i , j 2 λ 2 · λ α + 1 , | ω i , j | λ sgn ( ω i , j ) · α · exp 10 ( | ω i , j | λ ) α + 1 , | ω i , j | < λ ,
where α represents the regulating factor used to adjust the position of the function as it approaches the hard threshold. For Equation (10), at the threshold, when | ω i , j | λ + and λ , ω ^ i , j = α α + 1 λ , so the proposed threshold function is continuous at λ , overcoming the defect of the hard threshold. When ω i , j , ( 0.1 ) ω i , j 2 λ 2 0 at this time; ω ^ i , j ω i , j does not introduce a fixed deviation to the coefficient, solving the problem with the soft threshold. A comparison of the improved threshold function with the hard and soft thresholds is shown in Figure 3, where α = 2 . The proposed threshold function closely approximates the hard threshold in the non-threshold region, while ensuring a smooth and continuous transition within the threshold region.
The process of the improved wavelet adaptive denoising (iWAVE) is shown in Figure 4.

4.2. Bidirectional FBank Feature Extraction

This paper proposes the bidirectional FBank (BiFBank) feature, BiFBank, which combines FBank and inverse FBank (iFBank). It can simultaneously focus on both low- and high-frequency information, mitigate the weakness of traditional FBank, reduce the loss of birdsong information, and provide more complete feature representation.

4.2.1. FBank Feature Extraction

The main extraction steps of FBank and MFCC are illustrated in Figure 5. The primary difference between them is the use of DCT, which reduces the feature correlation and the loss of useful information in MFCC [43]. Therefore, FBank was selected for bird vocalization classification.
FBank employs triangular filters based on the Mel scale to simulate human ear perception, which is characterized by non-linear frequency responses and is more sensitive to lower frequencies. On this basis, the Mel-scale transform is expressed as follows:
M e l = 2595 · log 10 f 700 + 1 ,
where f denotes the linear frequency in Hz . Then, a set of isometric triangular filters is generated on the Mel scale, where the m th triangular filter is
H Mel ( k ) = k f ( m 1 ) f ( m ) f ( m 1 ) , f ( m 1 ) k f ( m ) 0 , otherwise f ( m + 1 ) k f ( m + 1 ) f ( m ) , f ( m ) k f ( m + 1 ) ,
where m represents the Mel-central frequency of the m th filter, with f ( m 1 ) , f ( m ) , and f ( m + 1 ) representing lower, central, and higher Mel frequencies, respectively. Finally, the triangular filter generated in Equation (13) is transformed back to the linear frequency, resulting in the filter bank depicted in Figure 6:
f = 700 · 10 Mel 2595 1 .

4.2.2. Inverse FBank Feature Extraction

As illustrated in Figure 6, the traditional FBank is dense at low frequencies and sparse at high frequencies, emphasizing low-frequency information while neglecting high-frequency information. However, high-frequency information is equally crucial in practical applications and should be addressed.
In this study, we found that the frequency of birdsong is initially concentrated between 1000 and 4000 Hz . After the framing and windowing, the signal frequency also concentrated in the latter half between 1000 and 4000 Hz (Nyquist sampling theorem). For example, when the sampling frequency is 48,000 Hz , the bird audio data concentrate from 1000 to 4000 Hz and 20,000 to 23,000 Hz . If we only focus on the low-frequency part of the audio, we lose the equally significant high-frequency information. Therefore, in this study, a new filter bank was developed, referred to as inverse FBank (iFBank). First, the iMel-scale transform is defined as follows:
i Mel = 700 · 10 f 17270 1 .
Second, the method of generating the filter bank is consistent with that of FBank. Following the generation of the isometric triangular filter on the iMel scale, it is converted back to a linear frequency:
f = 17270 · log 10 i Mel 700 + 1 .
Subsequently, the new filter bank based on the iMel scale is shown in Figure 7.
The characteristic of this filter is sparse at low frequencies and dense at high frequencies, indicating that iFBank emphasizes high-frequency information. The BiFBank feature extraction method is based on horizontal linear feature fusion, which focuses on both low-frequency and high-frequency information simultaneously. The complete process of BiFBank is illustrated in Figure 8.

5. Experiment and Result Analysis

5.1. Experimental Environment and Design

The experimental hardware platform consisted of a desktop computer with 128 G memory, 16 cores, and 32-thread CPU. The operating system was Windows 11 64-bit Professional, and the experiments were conducted using Matlab R2022a.
To verify the effectiveness of the proposed method, two groups of experiments were designed for comparison and analysis, and the descriptions of each feature are listed in Table 2. The first group consisted of three experiments. The first experiment verified the effectiveness of the adaptive algorithm for determining the optimal decomposition level. The second experiment used simulation signals to compare the performance of classical and improved methods of other researchers with the threshold selection rule proposed in this study. The third experiment verified the better performance of the proposed threshold function objectively and subjectively.
Another group of experiments used the random forest (RF) classifier to compare the classification performance of different features, including MFCC, FBank, iWAVE-FBank, iWAVE-iFBank, and iWAVE-BiFBank. In the subsequent discussion, the robustness of the proposed method is further verified. Furthermore, a comparison with the state-of-the-art studies is given. The parameter settings during the experiments are listed in Table 3.

5.2. Evaluation System

Since real and clean signals are unobtainable, denoising performance could not be verified directly through evaluation indicators. Thus, the “Blocks” and “Bumps” signals in Matlab were used to simulate non-stationary random signals, with varying noise intensities added to emulate AWGN (where noise variance represented intensity), as shown in Figure 9.
The signal–noise ratio ( SNR ) was used as an evaluation metric to measure denoising effectiveness [45]:
SNR = 10 log 10 x ( n ) 2 y ( n ) x ( n ) 2 ( dB ) ,
where n represents the number of signal sampling points, y ( n ) represents the noise signal, and x ( n ) represents the clean signal. The higher the SNR , the closer the noise-reduced signal to the clean signal, indicating better denoising effectiveness. Classification performance was evaluated with Accuracy, F1 score, Kappa, Precision, and Recall.

5.3. Result Analysis

Group I : comparison with other denoising methods
(1)
Verification of the adaptive determination of the optimal decomposition level
Figure 10 presents the comparison results. The red dotted line represents the limit of Corr i 2 ; when this value is exceeded, the loop terminates, and the optimal decomposition level is the current number of levels minus one. When the Corr i 1 value falls below 0.8, the loop immediately terminates, and the optimal decomposition level is set to the current level. To represent this point more specifically, autocorrelation coefficients were selected when the noise intensity was three, as shown in Figure 10. In addition, the SNR value increased up to the optimal decomposition level, and, beyond this point, the SNR decreased significantly.
(2)
Verification of the optimized threshold selection rule’s effectiveness
The proposed method was compared with other methods in different noise environments, and the results are shown in Table 4. Among them, the method with the highest SNR value is bolded (values within a 0.01 difference are also bolded). In most cases, the proposed method has the best performance.
(3)
Verification of the improved threshold function’s effectiveness
The simulation signals were used to calculate the SNR to objectively reflect the performance of the proposed method, and the real signals were chosen to subjectively observe waveform differences after denoising with different threshold functions. Table 5 presents the performance of the three threshold functions in different noise environments.
As shown in Table 5, in all cases, the hard threshold outperforms the soft threshold. However, the hard threshold causes signal discontinuity, resulting in a pseudo-Gibbs phenomenon, which is undesirable. Conversely, the soft threshold avoids these issues, but it introduces a permanent deviation in the signal, which is also undesirable. The proposed threshold function maintains optimal performance in most cases while maintaining signal continuity and avoiding signal deviation. To further verify the effectiveness performance of the proposed function, the denoising effects of each function on real signals are compared in Figure 11.
In addition to the valuable signals, the original audio also contains noise with significant amplitude. All three threshold functions can effectively suppress noise and reduce the noise amplitude. The improved threshold function better retains the characteristics of the original waveform and reduces the noise intensity of the waveform. The hard threshold causes the loss of continuity and waveform distortion. The soft threshold introduces permanent errors to the signal. Compared with the other two threshold functions, the waveform amplitude is obviously amplified or contracted. The proposed method enhances the traditional threshold function, establishing a foundation for subsequent classification tasks.
Group II : comparison of different features in classification performance
The classification performances of five different features were compared. The evaluation metrics included Accuracy, F1 score, Kappa, Precision, and Recall. MFCC, FBank, iWAVE-FBank, iWAVE-iFBank, and iWAVE-BiFBank were compared under the same conditions. The classification experiments were repeated ten times, and the average of the results was taken as the final outcome, as shown in Table 6 and Figure 12.
As shown in Table 6 and Figure 12, the performance of FBank is significantly higher than that of MFCC. For FBank and iWAVE-FBank, denoising the audio data can enhance the performance on the classification task. Moreover, iFBank emphasizes high-frequency information, and its performance is superior to that of FBank, which proves its reliability and effectiveness. iWAVE-BiFBank has the best classification performance, achieving an accuracy 16.02% and 10.96% higher than that of MFCC and FBank, respectively.
Figure 13 shows the confusion matrices of FBank and iWAVE-BiFBank, highlighting the number of errors for each species. The error rate for Numenius phaeopus and Accipiter nisus is particularly high with MFCC and FBank. The results indicate that the proposed method increases the accuracy of Numenius phaeopus recognition from 64.14% to 87.24% and of Accipiter nisus from 76% to 94.5%.

5.4. Discussion

To further verify the robustness of the iWAVE-BiFBank method, boxplots of the accuracy of the results for 10 experiments are shown in Figure 14. iWAVE-BiFBank has good robustness and stability, with high accuracy. Furthermore, the robustness of FBank and iFBank based on iWAVE is significantly higher than that of the traditional MFCC and FBank.
The proposed method is compared with other state-of-the-art methods in Table 7. Previous studies usually focused on feature engineering without denoising processes, resulting in framework instability [46]. The introduction of iWAVE enhances the robustness of classification performance. Furthermore, as shown in Table 7, researchers have frequently combined MFCC and/or FBank with machine learning and deep learning techniques [25,26,28,29,47,48,49,50], because MFCC and FBank generate filter banks based on the Mel scale, ignoring high-frequency information in the original audio and therefore affecting classification performance. iFBank extracted on the iMel scale focuses on high-frequency information, while BiFBank can fully integrate low- and high-frequency information. Furthermore, when processing the same frame, BiFBank operations can be carried out simultaneously in the same code loop, so the proposed method does not increase the amount of computational complexity significantly.
In summary, the iWAVE-BiFBank method achieves higher-accuracy bird vocalization classification, and iWAVE effectively reduces the interference of noise in the classifier. The BiFBank feature effectively addresses the limitations of traditional methods by focusing on both the low-frequency and high-frequency characteristics of the audio. This method has good noise reduction performance, classification performance, and robustness. At the same time, the proposed method also has the potential to be combined with emerging deep learning models. BiFBank, as a handcrafted feature, mainly focuses on the essence characteristics in audio, while the deep learning models extract the high-level features in audio. Combining them can provide a multi-view analysis on prospective features that enhance classification accuracy and generalizability.

6. Limitations and Future Scope

This paper proposes iWAVE-BiFBank, based on wavelet threshold denoising and FBank, which improves the birdsong classification performance, introduces a wider feature selection range, and provides technical support for biological monitoring, biodiversity strategies, and ecological protection. Although the method performs well, there are still some limitations that should be addressed in future work, mainly including the following:
  • Choosing a more effective pre-emphasis method that is closer to the birdsong acoustic attenuation model rather than using a simple high-pass filter [51].
  • Developing a method to separate overlapping vocalizations across the time domain, frequency domain, and spectrogram as well as attempting to recognize different bird vocalizations within overlapping audio.
  • Exploring more handcrafted and deep-learning-based features and constructing a multi-feature fusion method to enhance the classification performance of bird vocalizations in future studies. In addition, experiments could be conducted with different acoustic datasets containing more species of birds as well as different animal species to enhance the generalization of the method.

7. Conclusions

To classify bird vocalizations, this study proposed a method involving iWAVE denoising with iFBank and BiFBank feature extraction. Compared with traditional methods, the iWAVE method produced good results for both simulated and real signals at various noise settings. In most cases, it can obtain the highest objective SNR values and generate better waveforms. BiFBank integrates the advantages of FBank and iFBank, which was combined with the proposed denoising method to create a comprehensive feature set for constructing an effective bird vocalization recognition method. The random forest classifier was used to evaluate the classification performance of both traditional and proposed features. In the classification of the vocalization of 16 bird species, the classification accuracy of iWAVE-BiFBank reached 94.00%, outperforming MFCC and FBank by 16.02% and 10.96%, respectively, and its other evaluation indicators were over 93%. The experimental results showed that the proposed method has better bird vocalization classification performance with enhanced stability and robustness.

Author Contributions

Conceptualization, C.P.; methodology, C.P., Y.Z. and Y.X.; software, C.P.; validation, C.P. and J.L.; formal analysis, C.P.; investigation, C.P. and J.L.; resources, Y.Z.; data curation, J.L.; writing—original draft preparation, C.P.; writing—review and editing, C.P., Y.Z. and J.L.; visualization, C.P. and J.L.; supervision, Y.Z., D.L. and Y.X.; project administration, Y.Z. and D.L.; funding acquisition, Y.Z. and D.L. All authors have read and agreed to the published version of this manuscript.

Funding

This study was supported by the National Natural Science Foundation of China (grant No. 32360388, grant No. 31860332, grant No. 31960142).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data included in this study are available from the corresponding author or by accessing Xeno-canto.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Xie, J.; Colonna, J.G.; Zhang, J. Bioacoustic signal denoising: A review. Artif. Intell. Rev. 2021, 54, 3575–3597. [Google Scholar]
  2. Zhao, Z.; Zhang, S.; Xu, Z.; Bellisario, K.; Dai, N.; Omrani, H. Automated bird acoustic event detection and robust species classification. Ecol. Inform. 2017, 39, 99–108. [Google Scholar]
  3. Xie, J.; Zhong, Y.; Zhang, J.; Liu, S.; Ding, C.; Triantafyllopoulos, A. A review of automatic recognition technology for bird vocalizations in the deep learning era. Ecol. Inform. 2023, 73, 101927. [Google Scholar]
  4. Liu, H.; Wang, W.; Xiang, C.; Han, L.; Nie, H. A de-noising method using the improved wavelet threshold function based on noise variance estimation. Mech. Syst. Signal Process. 2018, 99, 30–46. [Google Scholar]
  5. Bnou, K.; Raghay, S.; Hakim, A. A wavelet denoising approach based on unsupervised learning model. EURASIP J. Adv. Signal Process. 2020, 2020, 36. [Google Scholar]
  6. Usman, A.M.; Ogundile, O.O.; Versfeld, D. Review of Automatic Detection and Classification Techniques for Cetacean Vocalization. IEEE Access 2020, 8, 105181–105206. [Google Scholar]
  7. Xiong, Z.; Ramchandran, K.; Orchard, M.T.; Zhang, Y.Q. A comparative study of DCT- and wavelet-based image coding. IEEE Trans. Circuits Syst. Video Technol. 1999, 9, 692–697. [Google Scholar]
  8. Halidou, A.; Mohamadou, Y.; Ari, A.; Zacko, E. Review of wavelet denoising algorithms. Multimed. Tools Appl. 2023, 82, 41539–41569. [Google Scholar]
  9. Singh, A.K.; Bhushan, S.; Vij, S.; Goyal, D.B.; Bălaş, V.E.; Mukherjee, A.; Hugo, C.; Albuquerque, V.D.; Gupta, A.K. Analysis and Comparison of DCT- and DWT-Based Image Compression Techniques. In Information Management and Machine Intelligence, Proceedings of the ICIMMI 2019, Jaipur, India, 14–15 December 2019; Springer: Singapore, 2021; pp. 45–55. [Google Scholar]
  10. Mallat, S.G.; Peyré, G. A Wavelet Tour of Signal Processing: The Sparse Way; Elsevier/Academic Press: Amsterdam, The Netherlands, 2009. [Google Scholar]
  11. Goyal, B.; Dogra, A.; Agrawal, S.; Sohi, B.S.; Sharma, A. Image denoising review: From classical to state-of-the-art approaches. Inf. Fusion 2020, 55, 220–264. [Google Scholar]
  12. Donoho, D.L.; Johnstone, I.M.; Kerkyacharian, G.; Picard, D. Density estimation by wavelet thresholding. Ann. Stat. 1996, 24, 508–539. [Google Scholar]
  13. Liu, Y.; Bei, G.; Jiang, Z.; Meng, Q.; Shi, H. Adaptive Noise Reduction Algorithm for Chaotic Signals Based on Wavelet Packet Transform. J. Electron. Inf. Technol. 2023, 45, 3676–3684. [Google Scholar]
  14. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar]
  15. Silberberg, M.; Grecco, H.E. Binlets: Data fusion-aware denoising enables accurate and unbiased quantification of multichannel signals. Inf. Fusion 2024, 101, 101999. [Google Scholar]
  16. Xie, S.; Lu, J.; Liu, J.; Zhang, Y.; Lv, D.; Chen, X. Multi-view features fusion for birdsong classification. Ecol. Inform. 2022, 72, 101893. [Google Scholar]
  17. Gong, L.; Xie, S.; Zhang, Y.; Xiong, Y.; Wang, X.; Li, J. A Robust Feature Extraction Method for Sound Signals Based on Gabor and MFCC. In Proceedings of the 6th International Conference on Communication and Information Systems (ICCIS), Chongqing, China, 14–16 October 2022; IEEE: Piscataway, NJ, USA, 2022; Volume 2022, pp. 49–55. [Google Scholar]
  18. Chi, Y.; Cai, C.; Ren, J.; Xue, Y.; Zhang, N. Damage location diagnosis of frame structure based on wavelet denoising and convolution neural network implanted with Inception module and LSTM. Struct. Health Monit. 2024, 23, 57–76. [Google Scholar]
  19. Donoho, D.L.; Johnstone, I.M. Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994, 81, 425–455. [Google Scholar]
  20. Donoho, D.L.; Johnstone, I.M. Adapting to Unknown Smoothness via Wavelet Shrinkage. J. Am. Stat. Assoc. 1995, 90, 1200–1224. [Google Scholar]
  21. Sardy, S. Minimax threshold for denoising complex signals with Waveshrink. IEEE Trans. Signal Process. 2000, 48, 1023–1031. [Google Scholar]
  22. Donoho, D.L. De-noising by soft-thresholding. IEEE Trans. Inf. Theory 1995, 41, 613–640. [Google Scholar]
  23. Bonet-Solà, D.; Alsina-Pagès, R.M. A Comparative Survey of Feature Extraction and Machine Learning Methods in Diverse Acoustic Environments. Sensors 2021, 21, 1274. [Google Scholar] [CrossRef]
  24. Andono, P.N.; Shidik, G.F.; Prabowo, D.P.; Pergiwati, D.; Pramunendar, R.A. Bird Voice Classification Based on Combination Feature Extraction and Reduction Dimension with the K-Nearest Neighbor. IJIES 2022, 15, 262–272. [Google Scholar]
  25. Andono, P.N.; Shidik, G.F.; Prabowo, D.P.; Yanuarsari, D.H.; Sari, Y.; Pramunendar, R.A. Feature Selection on Gammatone Cepstral Coefficients for Bird Voice Classification Using Particle Swarm Optimization. IJIES 2023, 16, 254–264. [Google Scholar]
  26. Sui, T.; Yang, F.; Wang, R.; Li, X.; Liu, F. Research on bird sound recognition based on deep learning. In Proceedings of the International Conference on Automation Control, Algorithm, and Intelligent Bionics (ACAIB 2023), Xiamen, China, 28–30 April 2023; Ladaci, S., Kaswan, S., Eds.; SPIE: Bellingham, WA, USA, 2023; Volume 31, p. 85. [Google Scholar]
  27. Peng, C.; Zhang, Y.; Lv, D.; Lv, D.; Huang, Q.; Xiong, Y. FGS-Net: A Mutual Feature Fusion Residual Network Based on FBank and Gabor-Sinc Spectrogram for Birdsong Classification. In Proceedings of the 2024 10th International Conference on Computer and Communications (ICCC), Chengdu, China, 13–16 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 639–644. [Google Scholar]
  28. García-Rodríguez, I. Multispecies bird sound recognition using a fully convolutional neural network. Appl. Intell. 2023, 53, 23287–23300. [Google Scholar]
  29. Wang, Q.; Song, Y.; Du, Y.; Yang, Z.; Cui, P.; Luo, B. Hierarchical-taxonomy-aware and attentional convolutional neural networks for acoustic identification of bird species: A phylogenetic perspective. Ecol. Inform. 2024, 80, 102538. [Google Scholar]
  30. Liao, Z.; Hu, S.; Hou, R.; Liu, M.; Xu, P.; Zhang, Z. Automatic recognition of giant panda vocalizations using wide spectrum features and deep neural network. MBE 2023, 20, 15456–15475. [Google Scholar]
  31. Cui, L.; Wang, Z. Study on Voiceprint Recognition Based on Mixed Features of LFBank and FBank. Comput. Sci. 2022, 49, 211000194. [Google Scholar]
  32. Wu, J.; Li, P.; Wang, Y.; Lan, Q.; Wang, X.; Wang, Z. VFR: The Underwater Acoustic Target Recognition Using Cross-Domain Pre-Training with FBank Fusion Features. J. Mar. Sci. Eng. 2023, 11, 263. [Google Scholar] [CrossRef]
  33. Lu, J.; Zhang, Y.; Danjv, L.; Xie, S.; Fu, Y.; Dan, L. Improved Broad Learning System for Birdsong Recognition. Appl. Sci. 2023, 13, 11009. [Google Scholar] [CrossRef]
  34. Zheng, W. Power Quality Disturbance Signal Denoising Algorithm Based on Improved Wavelet Threshold. Appl. Switchg. 2021, 1, 28–33. [Google Scholar]
  35. Yi, T.H.; Li, H.N.; Zhao, X.Y. Noise Smoothing for Structural Vibration Test Signals Using an Improved Wavelet Thresholding Technique. Sensors 2012, 12, 11205–11225. [Google Scholar] [CrossRef]
  36. Zhang, J.; Zhong, Q.; Dai, Y. The determination of the threshold and the decomposition order in threshold de-noising method based on wavelet transform. Proc. CSEE 2004, 24, 118–140. [Google Scholar]
  37. Yang, H.; Cheng, Y.; Li, G. A denoising method for ship radiated noise based on Spearman variational mode decomposition, spatial-dependence recurrence sample entropy, improved wavelet threshold denoising, and Savitzky-Golay filter. Alex. Eng. J. 2021, 60, 3379–3400. [Google Scholar] [CrossRef]
  38. Wang, J.; Sun, Y.; Sun, S. Recognition of Muscle Fatigue Status Based on Improved Wavelet Threshold and CNN-SVM. IEEE Access 2020, 8, 207914–207936. [Google Scholar] [CrossRef]
  39. Fu, R.; Zhang, J.; Wang, R.; Xu, T. Improved Wavelet Thresholding Function and Adaptive Thresholding for Noise Reduction. In Proceedings of the 11th International Conference on Intelligent Computing and Wireless Optical Communications (ICWOC), Chongqing, China, 16–18 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 25–30. [Google Scholar]
  40. Xie, B.; Xiong, Z.; Wang, Z.; Zhang, L.; Zhang, D.; Li, F. Gamma spectrum denoising method based on improved wavelet threshold. Nucl. Eng. Technol. 2020, 52, 1771–1777. [Google Scholar] [CrossRef]
  41. Srivastava, M.; Anderson, C.L.; Freed, J.H. A New Wavelet Denoising Method for Selecting Decomposition Levels and Noise Thresholds. IEEE Access 2016, 4, 3862–3877. [Google Scholar] [CrossRef] [PubMed]
  42. Tang, P.; Guo, B. Wavelet Denoising Based on Modified Threshold Function Optimization Method. J. Signal Process. 2017, 33, 102–112. [Google Scholar]
  43. Liu, F.; Shen, T.; Luo, Z.; Zhao, D.; Guo, S. Underwater target recognition using convolutional recurrent neural networks with 3-D Mel-spectrogram and data augmentation. Appl. Acoust. 2021, 178, 107989. [Google Scholar] [CrossRef]
  44. Daubechies, I. Orthonormal bases of compactly supported wavelets. Commun. Pure Appl. Math. 1988, 41, 909–996. [Google Scholar] [CrossRef]
  45. Zhang, Y.; Ding, W.; Pan, Z.; Qin, J. Improved Wavelet Threshold for Image De-noising. Front. Neurosci. 2019, 13, 39. [Google Scholar] [CrossRef]
  46. Pahuja, R.; Kumar, A. Sound-spectrogram based automatic bird species recognition using MLP classifier. Appl. Acoust. 2021, 180, 108077. [Google Scholar] [CrossRef]
  47. Ntalampiras, S. Bird species identification via transfer learning from music genres. Ecol. Inform. 2018, 44, 76–81. [Google Scholar] [CrossRef]
  48. Xie, S.; Zhang, Y.; Lv, D.; Xu, H.; Liu, J.; Yin, Y. Birdsongs recognition based on ensemble ELM with multi-strategy differential evolution. Sci. Rep. 2022, 12, 9739. [Google Scholar] [CrossRef] [PubMed]
  49. Wang, H.; Xu, Y.; Yu, Y.; Lin, Y.; Ran, J. An Efficient Model for a Vast Number of Bird Species Identification Based on Acoustic Features. Animals 2022, 12, 2434. [Google Scholar] [CrossRef] [PubMed]
  50. Kamarajugadda, R.; Battula, R.; Borra, C.R.; Durga, H.; Bypilla, V.; Reddy, S.S. Optimizing avian species recognition with MFCC features and deep learning models. Int. J. Inf. Tecnol. 2024, 16, 4621–4627. [Google Scholar] [CrossRef]
  51. Royle, J.A. Modelling sound attenuation in heterogeneous environments for improved bioacoustic sampling of wildlife populations. Methods Ecol. Evol. 2018, 9, 1939–1986. [Google Scholar] [CrossRef]
Figure 1. Distribution of the data before and after processing.
Figure 1. Distribution of the data before and after processing.
Applsci 15 04913 g001
Figure 2. The structure of the proposed method.
Figure 2. The structure of the proposed method.
Applsci 15 04913 g002
Figure 3. Comparison of the improved threshold function with hard and soft thresholds.
Figure 3. Comparison of the improved threshold function with hard and soft thresholds.
Applsci 15 04913 g003
Figure 4. The process of iWAVE.
Figure 4. The process of iWAVE.
Applsci 15 04913 g004
Figure 5. The extraction steps of MFCC and FBank.
Figure 5. The extraction steps of MFCC and FBank.
Applsci 15 04913 g005
Figure 6. Mel filter bank.
Figure 6. Mel filter bank.
Applsci 15 04913 g006
Figure 7. iMel filter bank.
Figure 7. iMel filter bank.
Applsci 15 04913 g007
Figure 8. The complete extraction process of BiFBank.
Figure 8. The complete extraction process of BiFBank.
Applsci 15 04913 g008
Figure 9. From left to right are the signals in the original, 0.01, 0.5, 3, and 10 noise intensity.
Figure 9. From left to right are the signals in the original, 0.01, 0.5, 3, and 10 noise intensity.
Applsci 15 04913 g009
Figure 10. Application of the adaptive determination of the optimal decomposition level.
Figure 10. Application of the adaptive determination of the optimal decomposition level.
Applsci 15 04913 g010
Figure 11. Denoising performance comparison of proposed function and traditional functions.
Figure 11. Denoising performance comparison of proposed function and traditional functions.
Applsci 15 04913 g011
Figure 12. Classification performance of various features.
Figure 12. Classification performance of various features.
Applsci 15 04913 g012
Figure 13. Confusion matrix of FBank and iWAVE-BiFBank.
Figure 13. Confusion matrix of FBank and iWAVE-BiFBank.
Applsci 15 04913 g013
Figure 14. Boxplot of various features based on accuracy.
Figure 14. Boxplot of various features based on accuracy.
Applsci 15 04913 g014
Table 1. The detailed information on the bird dataset.
Table 1. The detailed information on the bird dataset.
IDLatin NameNumber of Audio RecordingsDuration of Bird Vocalization (s)
1Elachura formosa13409.97
2Leiothrix lutea15604.88
3Coturnix coturnix14925.56
4Asio otus321001.03
5Grus grus19835.76
6Numenius phaeopus30860.62
7Larus canus11497.83
8Accipiter nisus20989.37
9Accipiter gentilis17592.73
10Falco tinnunculus14527.57
11Phasianus colchicus11701.19
12Lagopus muta10536.17
13Lyrurus tetrix8784.16
14Cygnus cygnus15619.89
15Phylloscopus trochiloides211516.88
16Francolinus pintadeanus13612.85
Table 2. Description of each feature.
Table 2. Description of each feature.
Name of FeatureDescription
MFCC, FBankTraditional features without denoising
iWAVE-FBankDenoising method: iWAVE, feature extraction method: FBank
iWAVE-iFBankDenoising method: iWAVE, feature extraction method: iFBank
iWAVE-BiFBankDenoising method: iWAVE, feature extraction method: BiFBank
Table 3. Parameter settings in experiments.
Table 3. Parameter settings in experiments.
ObjectRelevant Parameter Settings
Wavelet basissym3 [44]
MFCC, FBank, iFBank, BiFBankrandomSeed: 77, pre-emphasis parameter: 0.97, frame size: 512, frameshift: 256, overlap rate: 0.5, window: Hamming window, numFilters: 40, numMFCCcoefficients: 13
RF [14]numTrees: 100, maxDepth: 0, randomSeed: 0, numFeatures: 1 + log 2 N
Table 4. SNR values (dB) of the proposed rule and other rules.
Table 4. SNR values (dB) of the proposed rule and other rules.
Threshold Selection RuleBlocksBumps
0.010.53100.010.5310
RigrSure [20]44.23 34.3329.3525.2339.8530.1025.1220.99
MinMax [21]41.9434.1129.3425.3140.1230.1225.0921.11
Zheng [34]42.8333.6228.6824.6838.5929.3524.3820.37
Wang et al. [38]43.9734.0029.1525.3240.0930.0925.0921.09
Fu et al. [39]41.7332.2827.2423.2037.4127.9722.9318.88
Xie et al. [40]38.8129.3224.2220.1234.4824.9919.8915.78
Srivastava et al. [41]42.8333.6328.6824.6838.5929.3624.3820.37
Proposed Method44.0234.3529.4025.4239.8630.1125.1221.12
Table 5. SNR values (dB) of the proposed function and traditional functions.
Table 5. SNR values (dB) of the proposed function and traditional functions.
Threshold FunctionBlocksBumps
0.010.53100.010.5310
Hard-Threshold [19]35.7624.3019.9817.6632.7124.2619.8715.70
Soft-Threshold [20]28.4321.8718.2116.2627.5220.5016.2113.48
Proposed Method35.7024.5020.4517.0533.5424.6220.3616.58
Table 6. Classification performance of different features.
Table 6. Classification performance of different features.
FeaturePerformance/%
Accuracy(±SD)F1 Score(±SD)Kappa(±SD)Precision(±SD)Recall(±SD)
MFCC77.98 ± 0.9777.62 ± 0.7776.38 ± 0.7877.96 ± 0.7277.84 ± 0.74
FBank83.05 ± 0.9383.14 ± 0.8282.09 ± 0.8683.18 ± 0.7783.19 ± 0.82
iWAVE-FBank88.67 ± 0.7888.95 ± 0.4288.26 ± 0.4088.95 ± 0.3288.93 ± 0.42
iWAVE-iFBank89.58 ± 0.6089.57 ± 0.4589.02 ± 0.4389.64 ± 0.4289.62 ± 0.42
iWAVE-BiFBank94.00 ± 0.4493.57 ± 0.1993.03 ± 0.2093.54 ± 0.1693.48 ± 0.19
Table 7. Comparisons with other methods.
Table 7. Comparisons with other methods.
StudyYearMethodNumber of Bird SpeciesPerformance/%
Ntalampiras [47]2018MFCC with hidden Markov model (HMM) -based RF10Accuracy: 92.5
Pahuja et al. [46]2021STFT with multi-layer perceptron (MLP)8Accuracy: 96.1, Precision: 84.5, Recall: 82.6
Xie et al. [48]2022MFCC with ensemble-optimized extreme learning machine (ELM)9Accuracy: 89.05
Wang et al. [49]2022FBank and MFCC with long short-term memory (LSTM)264Accuracy: 74.94
Andono et al. [25]2023Combined MFCC, GTCC, NB, KNN, NN, GMM, and PSO264Accuracy: 78.33
Sui et al. [26]2023FBank with transformer model10Accuracy: 85.77
García-Ordás et al. [28]2023FBank with full CNN17Accuracy: 85.71
Wang et al. [29]20243D-FBank with Phylogenetic Perspective Neural Network (PPNN)500Accuracy: 89.95
Kamarajugadda et al. [50]2024MFCC with CNN, LSTM, and VGGish5Accuracy: 87.37
Proposed method2024iWAVE-BiFBank16Accuracy: 94.00, F1 score: 93.57, Kappa: 93.03, Precision: 93.54, Recall: 93.48
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Peng, C.; Zhang, Y.; Lu, J.; Lv, D.; Xiong, Y. A Bird Vocalization Classification Method Based on Bidirectional FBank with Enhanced Robustness. Appl. Sci. 2025, 15, 4913. https://doi.org/10.3390/app15094913

AMA Style

Peng C, Zhang Y, Lu J, Lv D, Xiong Y. A Bird Vocalization Classification Method Based on Bidirectional FBank with Enhanced Robustness. Applied Sciences. 2025; 15(9):4913. https://doi.org/10.3390/app15094913

Chicago/Turabian Style

Peng, Chizhou, Yan Zhang, Jing Lu, Danjv Lv, and Yanjiao Xiong. 2025. "A Bird Vocalization Classification Method Based on Bidirectional FBank with Enhanced Robustness" Applied Sciences 15, no. 9: 4913. https://doi.org/10.3390/app15094913

APA Style

Peng, C., Zhang, Y., Lu, J., Lv, D., & Xiong, Y. (2025). A Bird Vocalization Classification Method Based on Bidirectional FBank with Enhanced Robustness. Applied Sciences, 15(9), 4913. https://doi.org/10.3390/app15094913

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop