1. Introduction
Biodiversity is a critical component for maintaining the stability and adaptability of ecosystems, and its effective monitoring is particularly crucial in the context of global climate change and the intensification of human activities [
1]. As one of the most environmentally sensitive biological groups within ecosystems, birds are widely regarded as ecological indicator species and play a pivotal role in ecosystem health assessments and biodiversity conservation efforts [
2]. Birds exhibit rapid responses to factors such as habitat quality, land-use changes, and climate fluctuations, making their population dynamics valuable indicators of ecosystem trends. In recent years, bird populations and biodiversity worldwide have sharply declined due to the combined effects of human activities and environmental changes [
3]. Consequently, enhancing bird monitoring capabilities, strengthening conservation mechanisms, and safeguarding the healthy development of bird populations have become increasingly vital.
Birds serve not only as biodiversity indicators but also as sentinels of habitat integrity, since many species exhibit distinct seasonal and behavioral vocal patterns [
4,
5]. For example, migratory waterfowl such as Canada Goose produce conspicuous honking sequences during spring courtship and territorial displays [
6], whereas Whooper Swan deploys low-frequency contact calls during molting periods. Such phenological shifts in acoustic signals reflect underlying physiological and ecological processes—breeding readiness, flock cohesion, or predation risk [
7]—and offer a non-invasive window into population health and ecosystem function.
Birds typically possess ecological traits such as species richness, prominent vocalizations, and high diurnal activity, which render them highly observable and representative in field surveys and long-term ecological monitoring [
8]. Notably, the species information, behavioral patterns, and temporal rhythms embedded in bird vocalizations provide a crucial data source for eco-acoustic research [
9]. Through continuous monitoring of avian soundscapes, researchers can obtain essential ecological data without disturbing natural behaviors, thereby offering a more comprehensive assessment of ecosystem structure and functionality [
10].
Traditional bird monitoring methods primarily include visual observation techniques [
2] and manual acoustic recording methods [
9]. These methods rely on human experts for species identification and the interpretation of bird calls [
8]. However, manual and visual surveys suffer from several drawbacks. They are time-consuming and labor-intensive, hindering large-scale, long-term studies. Observer expertise varies, introducing subjective bias. Surveys may also disturb bird behavior. Finally, many species call primarily at dawn, dusk, or in concealed habitats—further complicating manual monitoring [
10].
With the development of digital recording equipment and sensor technologies, passive acoustic monitoring (PAM) has rapidly become an effective tool in wildlife research [
11]. This method involves deploying automatic recorders in the field to continuously collect natural environmental sounds, enabling continuous monitoring of bird soundscapes [
12].
Despite the promise of passive acoustic monitoring (PAM), several ecological challenges persist in field deployment. Ambient soundscapes vary dramatically with habitat type—open wetlands, forest understories, and agricultural mosaics each impose distinct noise profiles (water flow, wind rustle, machinery hum) that can mask target vocalizations. Moreover, many species concentrate their calling activity at dawn and dusk or in concealed microhabitats, further complicating manual surveys and naïve-automated detectors. Therefore, robust signal processing and intelligent classification methods are essential to disentangle overlapping sources, adapt to variable soundscapes, and ultimately translate raw acoustic data into actionable ecological insights [
13].
Before the rise of deep learning, bird audio recognition primarily relied on manually extracted features combined with traditional machine learning models. Typical approaches included template matching methods (e.g., Dynamic Time Warping (DTW)) [
14] and feature-based classification methods such as Gaussian Mixture Models (GMM) [
15], Hidden Markov Models (HMM) [
16], Support Vector Machines (SVM) [
17], and Random Forests (RF) [
18]. These methods reduced human subjectivity to some extent, but they often struggled to achieve high accuracy when dealing with the highly variable and complex bird calls in field recordings [
19]. Traditional methods, which rely on manually selected acoustic features such as Mel-frequency cepstral coefficients (MFCC) and Linear Predictive Cepstral Coefficients (LPCC), face difficulties in capturing the complex patterns and long-range dependencies in bird calls, thus limiting the performance and generalization ability of the models [
20]. With the increase in data scale and environmental noise, the limitations of these traditional approaches have become more pronounced, highlighting the urgent need for more advanced methods to improve recognition robustness and accuracy.
In recent years, owing to advancements in deep learning technologies, neural network-based bird audio recognition systems have emerged prolifically. The introduction of convolutional neural networks (CNNs) has significantly enhanced the performance of bird audio recognition. Researchers typically treat the time–frequency spectrum of bird calls (e.g., spectrograms or spectrographs) as image inputs, with CNNs automatically extracting high-recognition time–frequency features [
21]. Compared with handcrafted features, CNNs can learn local patterns in the bird call frequency spectrum (such as frequency modulation and harmonic structures) through convolutional filters, effectively capturing the temporal and frequency local structures of bird calls [
21]. In early applications, shallow CNNs demonstrated the potential to surpass traditional methods. For example, in the BirdCLEF2016 challenge, models that used convolutional networks to process spectrograms set a new record of 55%, validating the superiority of deep learning methods over traditional approaches [
22]. With the deepening of networks and the development of data augmentation techniques, more complex CNN architectures have achieved superior performance: Sankupellay et al. employed the ResNet-50 network to identify 46 bird species, achieving an accuracy of 72%, significantly higher than earlier traditional methods [
23]. A study using the Inception-v3 network on the BirdCLEF2019 dataset, which encompasses 695 bird species, resulted in a final accuracy of only approximately 16% [
24]. But this also revealed a limitation of CNNs: Due to the invariance of convolution to frequency translation, CNNs struggle to distinguish similarly shaped calls when they occur at different frequency bands [
11]. This phenomenon highlights a gap in modeling along the frequency axis—traditional CNNs are more focused on local shape patterns and lack sensitivity to absolute frequency positions [
11]. Overall, CNN-based deep learning methods have significantly outperformed earlier techniques in bird audio recognition, although their local perceptual characteristics present new challenges.
Recently, several studies have explored wavelet-based time–frequency representations in bioacoustic analysis. Andén and Mallat [
25] introduced the wavelet scattering transform with Morlet wavelets to extract stable, deformation-invariant features from non-stationary signals, achieving state-of-the-art performance in limited-label audio classification. More recently, Gauthier et al. [
26] proposed parametric scattering networks that learn wavelet filter parameters in a data-driven fashion, demonstrating improved species classification accuracy under varying noise conditions. These findings suggest that wavelet-based methods can capture rapid frequency modulations and non-stationary chirp patterns, complementing STFT-derived spectrograms for bird audio classification.
Bird calls exhibit distinct temporal properties, with variations in rhythm, syllable order, and duration across different species. Recurrent neural networks (RNNs) and their variants (long short-term memory (LSTM) networks and gated recurrent units (GRUs)) are adept at modeling sequential data, capturing temporal dependencies in bird call audio [
27]. Unlike CNNs, which focus solely on instantaneous spectral features, RNNs can memorize and utilize the state changes in bird calls over time, such as the sequence and duration of notes, thereby enhancing classification performance for complex calls [
21]. To simultaneously address spectral local feature extraction and temporal modeling, the convolutional–recurrent hybrid model (CRNN) was introduced [
28]. CRNNs typically use CNNs as the frontend to extract spectral features from each short-time frame, followed by RNNs to model these feature sequences, exhibiting superior performance in bird sound recognition. Gupta et al. compared various convolutional–recurrent networks on large bird call datasets, finding that hybrid models incorporating LSTM/GRU performed the best on large-scale data such as the Cornell Bird Call Challenge [
21]. Some studies have also introduced attention mechanisms to enhance the RNN’s focus on critical information: Noumida et al. constructed a hierarchical attention BiGRU model, achieving favorable results in multi-label bird species recognition. The attention mechanism enabled the model to focus on the most relevant time segments in the audio, thus improving classification accuracy [
29]. In summary, RNN and CRNN methods have addressed the shortcomings of pure CNNs in temporal sequence modeling, achieving remarkable progress in the sequential modeling of bird audio. By combining convolutional and recurrent networks, the model can more comprehensively represent bird call signals, achieving higher recognition performance than single-architecture approaches.
Over the past few years, with the remarkable success of transformer models in natural language processing and computer vision, their application has gradually extended to bird audio recognition. The transformer architecture, relying on its self-attention mechanism, enables the global modeling of dependencies across any positions in a sequence, which is particularly advantageous for capturing long-term correlated patterns across both time and frequency in bird calls [
30]. Early attempts include Puget’s STFT-Transformer model [
31], which used logarithmic Mel-spectra as input for the transformer, achieving impressive results in BirdCLEF2021. The model outperformed traditional CNN baselines in both recognition accuracy and processing speed. Furthermore, Tang et al. developed the Transound model, incorporating a Vision Transformer with an exceptionally large number of attention heads to encode bird call features such as Mel-frequency cepstral coefficients (MFCC). This model demonstrated a 10.64% increase in accuracy compared with the best existing CNN model [
32], showcasing the significant advantages of the transformer in bird call recognition tasks and its ability to uncover discriminative patterns that CNNs fail to capture. Additionally, some studies have combined transformer with convolutional networks to form hybrid architectures. For instance, Xiao et al. integrated multiple features—such as MFCC, chroma spectrograms, and spectral centroid—into an enhanced ResNet (AM-ResNet) and transformer module, achieving a recognition accuracy of 90.1% [
33]. Zhang et al. merged features extracted by deep CNNs from logarithmic Mel-spectra with features encoded by transformers (MFCC/chroma) to achieve an accuracy of 97.99% in classifying 20 bird species [
19]. These studies indicate that the transformer not only serves as an independent time–frequency modeler but, when combined with traditional CNNs, further enhances performance. The introduction of transformer-based methods has endowed bird audio recognition with the capability of global modeling, effectively mitigating the limitations of CNNs’ local receptive fields. This has proven especially beneficial for distinguishing cross-time and cross-frequency related patterns in long recordings or complex soundscapes.
Given the complex and dynamic patterns in bird calls, models based on a single architecture often fail to comprehensively capture all discriminative information. Consequently, various hybrid architectures and feature fusion strategies have emerged in recent years to combine the advantages of different models and features. For example, Liu et al. proposed a multi-scale convolutional model (EMSCNN), which uses convolutional kernels of varying sizes to extract both detailed and overall features of bird calls, then concatenates multi-scale features, achieving a recognition accuracy of 91.49% for 30 bird species [
34]. However, this model has a large number of parameters and simply concatenates the outputs of different convolution branches, lacking a mechanism for further exploring the correlations between features at different scales. The integration of attention mechanisms into deep learning models can guide the model’s focus on more important feature dimensions or time–frequency locations, which has also been increasingly applied in bird audio recognition. For instance, Gunawan et al. integrated the Convolutional Block Attention Module (CBAM) into convolutional networks for owl call classification, achieving superior performance compared with CNNs without attention [
35]. Another study designed the “MFF-ScSEnet” network, which combines multi-scale feature fusion with scSE attention, reaching over 96% accuracy in bird call classification [
36]. However, most of the currently applied attention modules focus on weighting the channel or spatial (time–frequency) dimensions, with few efforts focusing on separately designing attention mechanisms for the time and frequency axes. Existing methods treat the entire time–frequency plane as a spatial dimension, which may fail to distinguish which frequencies at which times are more important.
To address the critical limitations of prior methods—namely, the insufficient joint modeling of local and global features, the lack of integrated spectral–temporal attention mechanisms, and the suboptimal discriminative capacity due to simplistic feature fusion—this paper proposes DuSAFNet (Dual-path spectro–temporal Attention & Fusion Network) for bird sound recognition. The proposed DPFM is built upon a shared backbone, employing a dual-path dense-residual skip structure to facilitate multi-scale feature extraction. This architecture is complemented by a Spectral–Temporal Attention (STA) mechanism, which allows the network to selectively focus on key frequency bands and temporal windows, capturing both local and global features. To further enhance feature integration, Gated Fusion Mapping (GFM) is utilized, enabling dynamic and context-sensitive fusion of features across the two paths. Additionally, the Temporal-Spatial Multi-scale Attention Module (TSFM) provides deep contextual modeling of the features, enriching their temporal and spatial relationships. Finally, the integration of a multi-band ArcMarginProduct classifier further refines the model’s ability to distinguish between complex audio patterns, improving recognition accuracy across various bird species. The key innovations and contributions of this work are summarized as follows:
- (i)
Dual-Path Feature Module (DPFM)
A dual-path feature extraction module, comprising GrowthBranch and SkipBranch, is designed to extract features in parallel. The former employs dense growth units to capture fine-grained local textures, while the latter uses a residual skip structure to capture long-range context. This dual-path approach facilitates complementary modeling of features at different scales, enabling collaborative perception of time–frequency features in bird song.
- (ii)
Spectral–Temporal Attention (STA)
Attention weights are modeled separately on the frequency axis and time axis, allowing the network to automatically focus on the most discriminative frequency bands and time periods during training. This effectively addresses the limitations of traditional convolutional networks, which fail to adequately capture absolute frequency information and long-range temporal relationships.
- (iii)
Gated Fusion Map (GFM)
A lightweight gating mechanism is proposed, which dynamically adjusts the proportion of information flow from each branch during the fusion process. This self-adaptive approach suppresses redundant features while highlighting critical information, thereby enhancing the fusion efficiency and feature quality.
- (iv)
Temporal-Spatial Fusion Module (TSFM)
The LocalSpanAttention (local span temporal attention) and MultiscaleAttentionModule (multi-scale spatial-channel attention) are coupled to achieve unified modeling of local dependencies in the time domain and multi-scale re-scaling in the spatial-channel domain. This comprehensive approach enhances the recognition capability of the bird song spectrogram along both temporal and spatial dimensions.
- (v)
Multi-band ArcMarginProduct Classifier
The feature maps are divided into low, medium, and high frequency bands based on their height. The model leverages species-specific differences in bird calls across these frequency bands by using ArcMarginProduct with different scale factors and angular margins for classification. These features are weighted and fused using learnable weights, which explicitly increase the inter-class angular distance and improve fine-grained species discrimination.
3. Experiments and Results
3.1. Experimental Setup
3.1.1. Dataset Split
The 17,653 preprocessed bird audio segments were split and cleaned as described in
Section 2.1. In this study, the same 70% training and 30% testing split was retained, with 12,357 segments for training and 5296 for testing, thereby ensuring consistent species-level distribution. During training, inference and performance evaluation were conducted exclusively on this fixed test set.
3.1.2. Hardware and Software Environment
All experiments were conducted on the same server, with the hardware and software configurations listed in
Table 2:
3.1.3. Hyperparameters and Training Details
To ensure the reproducibility of the experiments, all random operations were performed using a fixed random seed of 1337. The model input was a 224 × 224 three-channel log Mel spectrogram, with a batch size of 64. The number of training epochs was set to 150. The Adam optimizer was used with an initial learning rate of ; if the validation loss did not decrease over 10 consecutive epochs, the learning rate was reduced by a factor of 0.1, with a lower limit of . To prevent overfitting, Dropout (with ) was applied to the output of the MultiscaleAttentionModule; BatchNorm2d normalization was applied to all convolutional layer outputs.
During the training phase, several data augmentation techniques were applied to the input spectrograms: the grayscale spectrograms were first resized to 224 × 224, then randomly rotated by ±20°, followed by random cropping (scale ) and resizing back to 224 × 224. Additionally, a 50% chance of horizontal flipping was applied. After each reading, SpecAugment was applied to the single-channel grayscale spectrogram by first performing frequency masking and then time masking, randomly masking part of the Mel bands and time frames. During the validation phase, the spectrograms were uniformly scaled to 224 × 224 and normalized per channel to the range [–1, 1].
During training, after each epoch, the loss and accuracy were calculated on the validation set. The model weights corresponding to the best validation loss were recorded, and the final training weights were saved at the end of all epochs.
3.1.4. Real-Time Inference Performance Evaluation
To assess the suitability of DuSAFNet for real-time acoustic monitoring, this research measured its inference latency and throughput on our target hardware platform as described in
Section 3.1.2. The average latency per inference was 31.5 ms, corresponding to a processing rate of approximately 31.3 frames per second (FPS). These results indicate that DuSAFNet comfortably meets near real-time requirements for field deployment, enabling prompt detection and classification of avian vocalizations with minimal delay. The achieved throughput further suggests that continuous audio streams can be processed without backlog, which is critical for reliable sound-based conservation systems.
3.2. Evaluation Metrics
We evaluate multi-class bird species recognition from three perspectives: classification performance, predictive quality, and model complexity.
Accuracy measures overall correctness across all samples:
where
C is the number of classes, and
,
,
,
denote true positives, true negatives, false positives, and false negatives for class
i. Values closer to 1 indicate better overall classification.
Precision and
Recall are defined per class as
We report the macro averages:
Precision reflects the model’s ability to avoid false-positive errors, that is, the proportion of correctly predicted positives among all positive predictions. Recall reflects the model’s ability to capture true positives, that is, the proportion of correctly detected positives among all actual positive samples.
F1 score for class
i is the harmonic mean of its precision and recall:
and the macro-average F1 is
Params denotes the total number of learnable parameters (in millions), reflecting model size and storage requirements. GFLOPs denotes floating-point operations (in billions) per forward pass, reflecting inference cost. Larger values imply higher computational and deployment demands.
Together, these metrics provide comprehensive benchmarks for comparing model accuracy, predictive balance, and deployment feasibility in subsequent ablation and comparison experiments.
3.3. Ablation Studies
3.3.1. Progressive Ablation of Core Modules
To analyze the contribution of each core module in DuSAFNet, this research conducted a series of ablation experiments by progressively introducing modules.
Table 3 presents detailed results for each configuration. From these results, this research observes that the STA module significantly enhances early feature discriminability; the DPFM and GFM together capture multi-scale local and global information; the weighted tri-band ArcMarginProduct (W-ArcMargin) enhances frequency-specific discrimination; and finally, combining W-ArcMargin with the TSFM yields the best overall performance.
The baseline model (E0) consists of a shared convolutional stem, global average pooling, and a linear classifier. It achieves 33.12% accuracy, 45.03% precision, 33.48% recall, and 31.05% F1. Without any attention or fusion modules, this configuration fails to capture sufficiently discriminative time–frequency features of bird calls, resulting in poor classification across similar-sounding species.
Building upon E0, we add the STA. STA recalibrates features along both frequency and time dimensions, focusing on the most informative bands and intervals. As a result, E1 achieves 45.41% accuracy, 53.37% precision, 44.62% recall, and 43.86% F1. This improvement indicates that STA enables the network to suppress irrelevant noise and emphasize bird-specific spectral patterns, though some fine-grained temporal context remains unmodeled.
Next, E2 incorporates the DPFM, consisting of GrowthBranch and SkipBranch, together with the GFM to capture richer multi-scale local and global information. By adaptively weighting and fusing dual-path features, the model’s parameter count reaches 5400.41 M, and GFLOPs is 2.223. On the test set, E2 achieves 95.66% accuracy, 95.50% precision, 95.60% recall, and 95.53% F1. This dramatic improvement demonstrates that DPFM and GFM are critical for effectively combining fine-grained texture cues (e.g., harmonic structures) with broader contextual patterns (e.g., syllable sequences), resulting in near-perfect discrimination among species.
Then, we replace the classifier with the weighted tri-band ArcMarginProduct (W-ArcMargin). The parameter count remains near 5418.82 M, with 2.223 GFLOPs. On the test set, E3 achieves 93.07% accuracy, 93.05% precision, 93.00% recall, and 92.90% F1. Compared with E4, E3’s slight performance drop highlights that TSFM contributes additional spatio-temporal refinement. Nevertheless, W-ArcMargin preserves strong classification by explicitly enlarging angular separation in low-, mid-, and high-frequency subspaces, reducing confusion between acoustically similar species.
Finally, the full DuSAFNet configuration integrates TSFM with weighted ArcMarginProduct. The resulting model has 6769.99 M parameters and requires 2.275 GFLOPs. On the test set, E4 attains 96.88% accuracy, 96.85% precision, 96.89% recall, and 96.83% F1. The combined spatio-temporal refinement and multi-frequency angular margin dramatically reduce residual errors, particularly for species with overlapping frequency profiles, confirming that TSFM and W-ArcMargin work synergistically to maximize classification accuracy.
Collectively, these ablation results demonstrate that each module—STA, DPFM, GFM, W-ArcMargin, and TSFM—contributes positively to performance. In particular, E2 shows the decisive impact of multi-scale feature fusion, E3 highlights the effectiveness of weighted ArcMarginProduct for frequency-specific discrimination, and E4 confirms that adding TSFM yields the best overall results.
3.3.2. Ablation of DPFM Submodules
In order to thoroughly assess the impact of the DPFM on classification performance within DuSAFNet, this research conducted a series of ablation experiments. We individually evaluated the contributions of the GrowthBranch and SkipBranch outputs and compared simple concatenation against gated fusion for feature integration. To this end, we modified the E1 model by selectively removing or combining submodules and measured the subsequent performance. The results are summarized in
Table 4.
In experiment E2a, only the GrowthBranch was preserved and integrated with the output of the Spectral–Temporal Attention (STA) module before being passed to the subsequent layers. Under this configuration, the model comprises 0.642 M parameters and incurs a computational cost of 0.857 GFLOPs. On the test set, it achieves an accuracy of 92.65%, a precision of 92.53%, a recall of 92.54%, and an F1 score of 92.51%.
Compared with E2b, which retains only the SkipBranch, E2a exhibits a 3.18% decrease in accuracy and a 3.22% drop in F1 score, indicating that although the GrowthBranch effectively captures fine-grained local features, it lacks the global contextual modeling capability provided by the SkipBranch, thereby limiting overall classification performance.
Nevertheless, when compared with the enhanced baseline model E1, which omits any multi-path feature extraction modules and yields 45.41% accuracy and a 43.86% F1 score, E2a delivers substantial improvements of 47.24 and 48.65 percentage points, respectively, while introducing only approximately 0.52 M additional parameters and 0.736 GFLOPs in computational overhead. These results underscore the effectiveness of the GrowthBranch in significantly enhancing the discriminative power of STA-derived time–frequency representations, even in the absence of a complete dual-path structure, thereby highlighting its practical value for lightweight deployment scenarios.
By contrast, experiment E2b retained only the SkipBranch output alongside the STA output, resulting in 4.514842 M parameters and 1.436 GFLOPs. Under this setup, the model achieved 95.83% accuracy, 95.75% precision, 95.77% recall, and a 95.73% F1 score. This substantial improvement demonstrates that the SkipBranch output carries richer global context, yielding a more pronounced performance gain than GrowthBranch alone.
In experiment E2ab, the GrowthBranch and SkipBranch outputs were merged via simple concatenation, increasing parameters to 5,146.97 M and GFLOPs to 2.172. The test accuracy, precision, recall, and F1 score were 95.54%, 95.50%, 95.44%, and 95.46%, respectively. This indicates that concatenating both paths’ outputs further enhances performance relative to using either path independently.
To refine feature fusion, experiment E2 introduced the GFM to perform weighted fusion of the GrowthBranch and SkipBranch outputs. The model size rose to 5400.41 M parameters with 2.223 GFLOPs. On the test set, accuracy increased marginally to 95.66%, precision to 95.50%, recall to 95.60%, and the F1 score to 95.53%. This modest gain confirms that gated fusion effectively emphasizes critical information during integration.
Overall, these experiments reveal that the SkipBranch contributes more significantly to performance improvement than the GrowthBranch, highlighting the importance of global context in bird-call classification. While simple concatenation provides some benefit, the gating mechanism further refines feature fusion, validating the utility of adaptive weighting. In summary, the DPFM submodule, through meticulous feature extraction and efficient fusion, greatly enhances DuSAFNet’s performance on bird-audio classification tasks.
3.3.3. ArcMarginProduct Submodule Ablation
In order to thoroughly assess the impact of the multi-band ArcMarginProduct submodule within DuSAFNet, we designed ablation experiments focusing on the contribution of individual frequency bands and their weighted fusion to classification performance. This study compares three separate paths (low-, mid-, and high-frequency) each trained with ArcMarginProduct, alongside a path that fuses all three via learnable weights, aiming to explore the importance of frequency–band information and how weighted fusion optimizes overall performance. The results are summarized in
Table 5.
First, in experiment E4a, we trained using only low-frequency features with ArcMarginProduct. Under this configuration, the model achieved 83.97% accuracy, 84.43% precision, 83.95% recall, and an 82.00% F1 score on the test set. Although low-frequency features provide some discriminative value for certain species, the overall performance remains suboptimal, with an F1 score notably lower than mid- and high-frequency configurations. This indicates that, while containing some useful information, low-frequency features alone cannot sustain efficient classification across species.
Next, experiment E4b employed only mid-frequency features with ArcMarginProduct, resulting in a marked improvement: 96.34% accuracy, 96.31% precision, 96.29% recall, and a 96.29% F1 score. Mid-frequency features play a more critical role in bird-audio classification, offering richer discriminative cues that enhance performance.
In experiment E4c, only high-frequency features were used with ArcMarginProduct, yielding 96.58% accuracy, 96.55% precision, 96.55% recall, and a 96.53% F1 score. High-frequency features capture abundant fine-grained details and achieve high discriminability, effectively modeling subtle time–frequency variations.
Finally, in experiment E4, we introduced a tri-band weighted fusion of ArcMarginProduct, merging low, mid, and high-frequency outputs via learnable weights. This configuration achieved 96.88% accuracy, 96.85% precision, 96.89% recall, and a 96.83% F1 score. These results demonstrate that fusing features across frequency bands leverages each band’s strengths, substantially improving classification performance. The multi-band weighted fusion both enhances accuracy through information complementation and validates the efficacy of the weighted fusion mechanism.
In summary, low-frequency features alone yield limited performance, indicating insufficient support across most species. Mid- and high-frequency features provide more effective discriminative information, especially high-frequency. Ultimately, by introducing tri-band weighted fusion, the model achieves peak performance, showcasing the tremendous potential of multi-band feature fusion for bird-audio classification.
3.4. Comparison with Other Models
In this section, we conduct a comprehensive comparison between DuSAFNet and several established deep learning architectures, including recent models such as InceptionNeXt [
51], ConvNeXt [
52], and MnasNet [
53]. All competing models were trained on the same preprocessed dataset, using
log-Mel spectrograms as input. The training protocols and hyperparameters used in these experiments follow those described in
Section 3.1.3. For the final evaluation, the models are tested using the weights that achieved the best validation performance. The results of this comparison validate DuSAFNet’s superior performance in the multi-class bird audio classification task.
Table 6 presents a summary of the results for ViT [
30], VGG16 [
54], 1D-CRNN [
28], ResNet-50 [
42], MobileNetV2 [
55], LSTM [
56], InceptionNeXt [
51], ConvNeXt [
52], MnasNet [
53], and DuSAFNet.
Table 7 provides a model complexity summary, including parameter counts and inference GFLOPs for each model. These models are evaluated based on four key performance metrics: accuracy, precision, recall, and F1 score.
ViT, a vision transformer that has recently gained significant attention, achieves 94.11% accuracy, 94.12% precision, 93.97% recall, and a 94.00% F1 score. However, its 85.814 M parameters and 11.286 GFLOPs are considerably larger than DuSAFNet’s footprint, and the performance improvement it provides is not substantial in comparison to smaller models. This suggests that larger architectures may not necessarily yield expected improvements in specialized tasks such as bird audio classification, where specialized models focusing on task-specific features tend to be more effective. Additionally, the extensive computational demands of ViT make it less efficient for deployment in resource-constrained environments.
In comparison, ResNet-50 and MobileNetV2 achieve 95.96% and 96.37% accuracy, respectively, demonstrating strong performance among convolutional networks. However, their computational costs differ significantly: ResNet-50 uses 23.545 M parameters and 4.132 GFLOPs, whereas MobileNetV2 requires only 2.247 M parameters and 0.326 GFLOPs. Despite MobileNetV2 being more lightweight than ResNet-50, DuSAFNet surpasses both models in accuracy and F1 score while maintaining far fewer parameters and lower GFLOPs than ResNet-50. This highlights DuSAFNet’s superior efficiency and effectiveness, enabling higher performance while keeping computational overhead low. The relatively low computational cost of DuSAFNet makes it particularly well-suited for deployment in real-time systems with strict resource limitations.
While the 1D-CRNN and LSTM models achieve 88.82% and 84.01% accuracy, respectively, their performance lags behind when classifying bird audio using multi-channel spectrograms, even though they are designed to model sequential data. The gap in performance can be attributed to their inherent limitations in capturing spatial dependencies, which are crucial in tasks such as bird audio classification. 1D-CRNN and LSTM models are not as effective as convolutional architectures in capturing the spatial features of spectrograms, which DuSAFNet addresses through multi-path feature extraction and fusion. This performance gap further emphasizes DuSAFNet’s ability to leverage convolutional feature extraction and multi-path fusion, which better captures both spatial and temporal information. DuSAFNet improves upon the 1D-CRNN by 8.06% in accuracy and surpasses the LSTM by 12.87%, demonstrating its superior ability to handle complex data representations for bird audio classification.
Among the newer models, InceptionNeXt achieves 95.02% accuracy, 94.90% precision, 94.98% recall, and a 94.92% F1 score. It uses 25.789 M parameters and incurs a computational cost of 4.189 GFLOPs. While it performs admirably, DuSAFNet surpasses InceptionNeXt in all four metrics, achieving an accuracy of 96.88% and an F1 score of 96.83% and maintaining only 6.770 M parameters and 2.275 GFLOPs. This demonstrates DuSAFNet’s superior efficiency in achieving high performance with significantly fewer computational resources, thus making it a more practical solution for real-world applications that require both high accuracy and computational efficiency.
Similarly, ConvNeXt achieves 96.37% accuracy and a 96.27% F1 score, using 27.831 M parameters and requiring 4.454 GFLOPs. DuSAFNet outperforms ConvNeXt with a slight but meaningful margin of 0.51% in accuracy and 0.56% in F1 score, while utilizing significantly fewer parameters and computational resources. This shows that DuSAFNet is not only more efficient but also more effective at capturing the features relevant to the bird audio classification task. The balance between performance and computational complexity is key to DuSAFNet’s competitive advantage.
MnasNet, which achieves 90.46% accuracy and a 90.29% F1 score, uses only 3.125 M parameters and 0.328 GFLOPs. While MnasNet is highly efficient, DuSAFNet outperforms it by a significant margin of 6.42% in accuracy and 6.54% in F1 score. This highlights DuSAFNet’s superior performance, even when compared with lightweight models designed for efficiency. The additional performance gain demonstrates that DuSAFNet’s multi-path feature fusion and advanced attention mechanisms contribute significantly to its higher classification accuracy.
In conclusion, DuSAFNet outperforms all the comparison models, including ViT, VGG16, 1D-CRNN, ResNet-50, MobileNetV2, LSTM, InceptionNeXt, ConvNeXt, and MnasNet, with 96.88% accuracy, 96.85% precision, 96.89% recall, and a 96.83% F1 score. DuSAFNet also demonstrates substantial advantages in terms of computational efficiency, with a relatively modest 6.770 M parameters and 2.275 GFLOPs, making it well-suited for real-world deployment. Specifically, DuSAFNet outperforms ViT by 2.77%, VGG16 by 32.61%, 1D-CRNN by 8.06%, ResNet-50 by 0.92%, and MobileNetV2 by 0.51%, validating its robust effectiveness in multi-class bird audio classification tasks.
Statistical Significance Analysis
To ensure that the reported improvements are not due to random variation, we performed each comparison experiment five times with different random seeds. For each model, we report the mean and 95% confidence interval (CI) for accuracy and F1 score, computed via the percentile bootstrap method with 10,000 resamples. Furthermore, we applied a paired bootstrap test to the metric differences between DuSAFNet and each baseline to estimate
p-values.
Table 8 summarizes these results. All observed gains of DuSAFNet over ResNet-50 and ConvNeXt are statistically significant at the
level.
3.5. Comparative Analysis with Transformer-Based Architectures
This research further benchmarked DuSAFNet against three advanced and representative transformer-based models: a Squeezeformer [
57] encoder combined with ResNet-50 [
42], the Fast Audio Spectrogram Transformer (FAST) [
58], and the pretrained Vision Transformer (ViT) [
30]. The Squeezeformer+ResNet50 model comprises 139.53 M parameters and requires 9.24 GFLOPs per inference, achieving 96.45% accuracy, 96.36% precision, 96.42% recall, and 96.36% F1 score. In contrast, FAST is extremely lightweight, with only 2.42 M parameters and 2.30 GFLOPs, yielding 95.86% accuracy, 95.75% precision, 95.82% recall, and 95.77% F1 score. ViT (pretrained) achieves 94.11% accuracy, 94.12% precision, 93.97% recall, and 94.00% F1 score, with a footprint of 85.814 M parameters and 11.286 GFLOPs. DuSAFNet, by comparison, achieves 96.88% accuracy and 96.83% F1 with only 6.77 M parameters and 2.275 GFLOPs, demonstrating a superior balance of classification performance, computational efficiency, and model compactness (see
Table 9). These results underscore DuSAFNet’s novelty in delivering state-of-the-art accuracy with significantly lower complexity than large transformer models, while outperforming lightweight variants on both accuracy and parameter efficiency.
3.6. Generalization Experiments
To further evaluate model robustness in cross-dataset scenarios, we selected the Birdsdata dataset as the second external test set. This dataset comprises recordings of 20 bird species with a total of 14,311 labeled samples. For cross-dataset validation, DuSAFNet was re-trained on the Birdsdata dataset using the same preprocessing pipeline (mono 16 kHz, 3 s segments) and training hyperparameters as detailed in
Section 3.1.3. The model was trained for an identical number of epochs, and the checkpoint achieving the best validation accuracy was used for final evaluation on the Birdsdata test set. Consequently, no zero-shot evaluation without fine-tuning was performed; the reported results reflect supervised training on Birdsdata.
In this experiment, we compared ResNet-34 [
42], ResNet-50 [
42], ViT [
30], BirdNet [
37], AMResNet [
33], and DuSAFNet. To replicate the best prior results, we followed the data preprocessing and training protocols described in [
33].
Table 10 lists each model’s final evaluation metrics on this dataset. The distribution of Mel spectrograms in the processed dataset is shown in
Figure 8.
Since the protocols for AMResNet are not publicly disclosed, we replicated the results based on the description provided in its paper [
33]. As
Table 10 shows, DuSAFNet achieved 93.74% accuracy on this public dataset—significantly higher than ResNet-34 (89.50%), ResNet-50 (86.60%), and ViT (82.80%). Compared with AMResNet’s 92.60%, DuSAFNet improved by 1.14%, and compared with BirdNet’s 93.84%, it improved by 0.10%. Despite these relatively small improvements, DuSAFNet’s performance surpasses both models, which highlights its robustness and effectiveness.
Crucially, DuSAFNet’s parameter count is only 6.77 M, far below ResNet-50’s 23.545 M and ViT’s 85.814 M, achieving high performance while substantially reducing model complexity. This further confirms DuSAFNet’s strong generalization ability across different datasets and validates the efficacy of multi-path fusion and multi-band ArcMarginProduct in cross-dataset scenarios. DuSAFNet achieves competitive accuracy while maintaining low computational overhead, making it particularly suitable for real-time applications where computational resources are often limited.
These results highlight DuSAFNet’s competitive edge in handling multi-class bird audio classification tasks with limited parameters and computational resources. Despite its lightweight architecture, DuSAFNet consistently outperforms other models, proving its effectiveness in both generalization and efficiency.
3.7. Visualization of Learned Features and Classification Performance
To gain a deeper understanding of the discriminative features learned by DuSAFNet in the bird audio classification task and its overall performance, this section presents visual analyses from three perspectives: (i) detailed per-class metrics, radar charts, and confusion matrix; (ii) average ROC curve; and (iii) t-SNE feature distribution.
3.7.1. Per-Class Performance Visualization
In this section, we conduct a comprehensive analysis of DuSAFNet’s per-class discrimination performance in the 18-bird-species classification task, leveraging the confusion matrix on the test set. Additionally, we present a detailed evaluation of the model’s performance across each class in terms of precision, recall, and F1 score, utilizing radar charts and quantitative tables. By examining the rows and columns of the confusion matrix, we can identify species that are prone to misclassification and uncover potential causes for these errors. The radar chart enables us to visually assess how most species’ performances are concentrated in the high-range area, while a few species exhibit some variation. Finally, we provide a table listing precision, recall, F1 score, and sample count for each class, which serves as a guide for future improvements in the model.
The test set consists of 5296 samples, and the confusion matrix (
Figure 9) has rows representing the true labels and columns representing the predicted labels, with a total dimension of
. Most of the values along the diagonal are concentrated between 295 and 322, indicating that over 95% of the samples, approximately 300 per species, were correctly classified.
For example, Canada Goose (row 1, column 1) correctly identified 298 samples, Common Blackbird (row 2, column 2) identified 297 samples, and Common Cuckoo (row 3, column 3) identified 297 samples. Other species showed similar distributions. The diagonal blocks are predominantly dark blue, and the scale of the color bar has a maximum value of 320, ensuring that the maximum correct count (European Herring Gull, 311 samples) does not exceed the visualization range, thus avoiding saturation distortion.
Most off-diagonal elements are either 0 or close to white, with misclassification counts not exceeding 3. For example, European Nightjar (row 9) has five samples misclassified as Common Kestrel (column 4) and two samples misclassified as Canada Goose (column 1); Mallard (row 12) has four samples misclassified as Dunlin (column 6), and two samples misclassified as Eurasian Woodcock (column 7); European Robin (row 10) has three samples misclassified as Common Blackbird (column 2); and Canada Goose (row 1) has three samples misclassified as Whooper Swan (column 18).
Although these misclassifications primarily occur between species with similar acoustic features, at a global level, the total misclassification count in any row or column remains below 5% of the total samples for that species. This is consistent with the overall model performance of F1 ≈ 96.8%, demonstrating DuSAFNet’s strong stability in distinguishing bird call features.
Ecologically, the one-way confusion between Canada Goose and Whooper Swan likely arises from their co-occurrence at temperate wetland stopover sites, where both species forage and rest together during migration. The shared low-frequency background noise of rippling water and wind can obscure subtle call distinctions, leading to asymmetric misclassification. Similarly, Mallard and Dunlin frequently occupy intertidal mudflats, producing calls against a backdrop of flowing water and wave action; this shared acoustic niche explains their near-symmetric confusion rates. Recognizing these habitat-driven overlaps can guide field recording protocols—such as deploying directional microphones or scheduling surveys during peak calling periods—to reduce misidentification in ecological monitoring.
From the asymmetry in rows and columns, we can observe a one-way shift in the misclassification of Canada Goose → Whooper Swan (3 misclassifications) and Whooper Swan → Canada Goose (0 misclassifications). This suggests an asymmetry in the “inclusiveness” and “coverage” of goose vocalizations. The frequency and amplitude variation in Canada Goose samples are more diverse, and these features are more likely to be “covered” by the feature subset of Whooper Swan, leading to fewer reverse misclassifications. Furthermore, Mallard and Dunlin exhibit near-symmetric distributions in the confusion matrix (Mallard → Dunlin: 4, Dunlin → Mallard: 2), as these species often share similar environmental noise in natural recordings (e.g., lake water flow, wind sounds), leading to confusion in low-frequency rhythmic features.
Moreover, the analysis of misclassification patterns reveals a clear “semantic neighbor misclassification” phenomenon. Species with similar vocal characteristics, such as Mallard and Dunlin, are more likely to be misclassified as each other. This indicates that improving feature representation for such species could be beneficial.
Additionally, DuSAFNet consistently achieves high performance across most species, with only minor fluctuations observed in Common Kestrel and Eurasian Woodcock. These species, known for their irregular, short vocalizations, are indeed more challenging in bird audio classification tasks. However, DuSAFNet demonstrates strong adaptability to a variety of vocalization types, excelling in distinguishing between most species despite the inherent complexity in such irregular sounds.
The model’s robust performance across diverse bird species suggests that its architecture, which emphasizes multi-path feature extraction and multi-band ArcMarginProduct, significantly improves classification performance. Specifically, DuSAFNet’s ability to capture both local and global features, aided by the fusion of different frequency bands, enhances its capacity to handle a wide range of acoustic patterns. Unlike other models, which may struggle with high-variance calls, DuSAFNet can effectively identify the critical discriminative features by leveraging its unique multi-path architecture and frequency-specific attention mechanism.
These results underscore DuSAFNet’s strength in modeling complex audio features, particularly in terms of its ability to handle the variability inherent in bird vocalizations. The effective integration of multi-scale features and multi-band attention mechanisms gives DuSAFNet a distinctive edge, making it well-suited for real-world tasks that require not only high accuracy but also generalization across diverse datasets and sound characteristics.
In conclusion, the visualization of per-class performance through confusion matrices, radar charts, and detailed tables offers valuable insights into DuSAFNet’s strengths and weaknesses. The model excels at distinguishing between most species, but its performance on species with irregular vocalizations, such as Common Kestrel and Eurasian Woodcock, can be further improved using targeted strategies. The analysis highlights the potential benefits of improving feature representations for species exhibiting similar vocal characteristics, paving the way for future refinements in data augmentation and model training.
The tightly clustered high metrics for most species reflect their stereotyped and repetitive call patterns—such as the melodious trills of Common Blackbird and the rapid “chip-chip” sequences of House Sparrow—which facilitate robust feature learning. In contrast, Common Kestrel emits brief, crepuscular hunting calls often masked by insect rustle and wind in open fields, while Eurasian Woodcock produces soft nocturnal “peenting” calls within dense forest undergrowth. These ecological behaviors underlie their slightly lower precision and recall, indicating that targeted noise reduction or time-specific sampling strategies could improve classification for crepuscular and nocturnal species.
After analyzing the confusion matrix, we present the overall performance of the 18 species in terms of precision, recall, and F1 score across three metrics using a radar chart shown in
Figure 10. The radial axes of the radar chart correspond to the bird species in the dataset, with blue solid lines representing precision, orange dashed lines representing recall, and green dotted lines representing F1 score. The majority of species have their three lines clustered closely in the range of 0.94 to 0.99, forming a near-circular shape, indicating highly balanced and near-optimal performance across all classes. For instance, Common Blackbird’s precision of 98.34%, recall of 99.66%, and F1 score of 99.00% nearly align with the top of the radar chart. Other species like House Sparrow, Whooper Swan, and Plaintive Cuckoo also exceed 96% across all three metrics, positioned just outside the inner circle. Only Common Kestrel (precision 92.81%, recall 93.82%, F1 score 93.31%) and Eurasian Woodcock (precision 93.69%, recall 93.27%, F1 score 93.48%) show slight dips in the radar chart, yet still maintain values well above 90%, suggesting a minor drop in classification performance for these minority classes but remaining within an acceptable range. The compact shape of the radar chart further verifies the conclusion from the confusion matrix: “the majority of species were correctly recognized with only a small number of misclassifications”.
For easier reference and further optimization,
Table 11 lists the precision, recall, F1 score, and sample count for each species in this test set. This table not only provides a clear presentation of the quantitative values of each species in three key metrics but also offers clear guidance for focusing on categories with poorer performance during subsequent data collection and model optimization.
In conclusion, through the sparse misclassification pattern and row-column asymmetry analysis in the confusion matrix, the high-density metric distribution in the radar chart, and the detailed listing of precision, recall, and F1 scores for each class in
Table 11, DuSAFNet demonstrates high precision and balanced classification ability in the 18-class bird audio classification task. Misclassifications and performance drops in minority classes indicate potential directions for further system improvements, including focusing on specific frequency bands, increasing data augmentation for rare classes, and introducing contrastive loss or dynamic weight adjustments for easily confused categories. Overall, these visualizations and quantitative analyses collectively validate DuSAFNet’s high stability and robustness at the class level, providing a solid foundation for future model deployment and optimization in real-world environments.
3.7.2. Guild-Level Performance
Leveraging the full complement of 18 species (
Table 12), we organized them into five ecological guilds—waterfowl, passerines, shorebirds, raptors, and other birds—and computed the mean F1 score for each on the test set. Waterfowl (Whooper Swan 97.41%, Canada Goose 96.75%, Mallard 94.46%) achieved a mean F1 of 96.21%, reflecting their loud, repetitive low-frequency calls in relatively stable wetland soundscapes. Passerines (Common Blackbird 99.00%, House Sparrow 97.24%, Spotted Flycatcher 96.84%, and European Robin 94.70%) followed at 96.95%, benefitting from stereotyped song structures. The “other birds” group—including Common Quail 97.38%, European Herring Gull 97.19%, Rock Dove 95.98%, Common Cuckoo 97.86%, Plaintive Cuckoo 97.24%, European Nightjar 97.44%, and White Stork 94.47%—attained 96.79%, indicating robust but varied acoustic characteristics. Shorebirds (Dunlin 97.37%, Eurasian Woodcock 93.48%) scored 95.43%, and raptors (Common Kestrel 93.31%, Peregrine Falcon 94.51%) trailed at 93.91%, consistent with their brief, crepuscular vocalizations being more prone to ambient noise masking. This comprehensive guild-level analysis underscores how call amplitude, temporal complexity, and habitat noise collectively shape classification performance across a broad taxonomic spectrum.
3.7.3. Mean ROC Curve
To comprehensively assess the classification performance of the model, it is necessary to go beyond fixed-threshold metrics such as precision, recall, and F1 score and adopt threshold-independent methods to evaluate robustness under varying decision thresholds. The ROC curve and its corresponding Area Under the Curve (AUC) offer an overall measure of a model’s ability to distinguish between classes.
Figure 11 presents the ROC curves computed separately for all 18 bird species on the test set (totaling 5296 samples) and summarizes them into two aggregated curves: the macro-average and the micro-average. The horizontal axis represents the false positive rate (FPR), defined as the proportion of negative samples incorrectly classified as positive, while the vertical axis represents the true positive rate (TPR), indicating the proportion of positive samples correctly identified. The diagonal line indicates the baseline AUC value of 0.5 under random guessing. The macro-average ROC curve is shown in a blue solid line, and the micro-average in an orange dashed line.
From a conservation monitoring standpoint, the ability to maintain a >99% true positive rate at a <1% false positive rate is critical during sensitive periods such as breeding or migration. For example, setting conservative detection thresholds can ensure near-perfect detection of Whooper Swan territorial calls in spring nesting grounds or early arrival signals of Canada Goose at wintering wetlands without inundating analysts with false alarms. This threshold-independent robustness enables managers to deploy real-time alert systems that prioritize minimal missed detections of focal species, thereby improving long-term population assessments and adaptive management.
Both curves ascend sharply along the vertical axis and extend near the top edge until the upper right corner, indicating that the model can maintain nearly 100% TPR even under extremely low FPRs.
Specifically, both the macro and micro AUCs reach 0.998. The macro-average is obtained by computing the “one-vs.-rest” ROC for each class and averaging the individual AUCs, reflecting balanced performance across classes. In contrast, the micro-average aggregates predictions from all classes into a unified binary classification and computes a single AUC. Their near-identical values suggest that there is no significant imbalance or underperformance in any particular class and that DuSAFNet learns equally discriminative features for all species.
The shape of the ROC curves further implies that when the FPR is below 1%, the TPR already exceeds 99%, meaning a conservative confidence threshold can be set in real-world applications to suppress false alarms while still achieving high recall. This property is particularly important for ecological monitoring or field deployments, where minimizing missed detections is prioritized, even at the cost of tolerating minimal false positives, in exchange for efficient recognition of all target bird calls.
The close alignment of macro and micro AUCs also suggests that the sample size and spectral differences among species do not introduce significant bias. In many multi-class bird classification tasks, classes with scarce samples or overlapping spectral traits tend to have much lower AUCs. However, this phenomenon does not appear here, indicating that DuSAFNet’s feature extractor and tri-band ArcMarginProduct module effectively capture and differentiate features across all bands, ensuring that no class is suppressed due to data sparsity or spectral similarity.
In summary, the macro and micro ROC curves and their AUC values shown in
Figure 11 provide a comprehensive, threshold-independent validation of DuSAFNet’s performance on bird audio classification. The curves closely follow the vertical and upper boundaries, and the AUC
strongly confirms the model’s high separability for all species, both at the global and per-class levels.
3.7.4. t-SNE Feature Distribution
In this experiment, five representative bird species from the test set (Canada Goose, Common Blackbird, Common Cuckoo, Common Kestrel, and Common Quail) were selected. The 512-dimensional features extracted by the MultiscaleAttentionModule, before global average pooling, were projected into a two-dimensional space using t-distributed Stochastic Neighbor Embedding (t-SNE), a nonlinear dimensionality reduction method that preserves local neighborhood similarities, for visualization.
As shown in
Figure 12, the samples of each class form five compact and mutually separated clusters in the 2D plane. Canada Goose samples are mainly distributed in the right arc-shaped area, exhibiting coherence while maintaining aggregation, indicating low intra-class variation and clear separation from other species. The point cloud of Common Blackbird is distributed nearly vertically along the y-axis, showing extreme compactness and small intra-class variance. Common Cuckoo forms a diagonal band from the lower left to the upper right, partially overlapping with Canada Goose in coordinate space, but with minimal actual overlap, suggesting local spectral similarity yet global distinguishability.
The distinct clusters correspond closely to ecological guilds: waterfowl (Canada Goose, Whooper Swan, Mallard) form low-frequency “arc” clusters reflecting open-water calls, forest passerines (Common Blackbird, European Robin, Spotted Flycatcher) occupy tight vertical clusters tied to tonal song patterns, raptors (Common Kestrel, Peregrine Falcon) appear as narrow elongated strips capturing swift predatory calls, and shorebirds (Dunlin, Eurasian Woodcock) span intermediate regions associated with mudflat or undergrowth peents. These mappings validate that DuSAFNet’s feature space preserves ecological and behavioral distinctions, supporting its utility for guild-specific monitoring in diverse habitats.
Common Kestrel samples form a narrow, vertically elongated strip at the bottom, with no intersection with other clusters, indicating a high concentration of its feature vectors along specific dimensions. Common Quail spans a wider range on the x-axis, reflecting greater internal diversity, but still maintains clear separation from other species.
Overall, the five clusters show little to no overlap in the 2D projection. Only a few edge samples reside near cluster boundaries, but these marginal points account for a very small proportion and do not affect overall separability. These visualizations demonstrate that DuSAFNet maps different species into well-isolated subspaces in the high-dimensional feature space—preserving intra-cluster consistency while emphasizing inter-cluster distinctiveness—thus providing robust support for accurate acoustic discrimination in real-world deployments.
4. Discussion
This study introduces DuSAFNet, an innovative model specifically designed for bird audio classification, which integrates multi-path feature extraction with a multi-band ArcMarginProduct attention mechanism. The experimental findings presented herein demonstrate that DuSAFNet surpasses all the models it was compared against, including ViT, VGG16, 1D-CRNN, ResNet-50, MobileNetV2, LSTM, InceptionNeXt, ConvNeXt, and MnasNet, showing a significant enhancement in accuracy, precision, recall, and F1 score. Notably, DuSAFNet achieves an accuracy of 96.88%, a precision of 96.85%, a recall of 96.89%, and an F1 score of 96.83%, thereby establishing a new standard within the field.
A principal strength of DuSAFNet lies in its superior performance across a wide range of evaluation metrics when compared with existing models. For instance, DuSAFNet enhances accuracy over ViT by 2.77%, VGG16 by 32.61%, 1D-CRNN by 8.06%, ResNet-50 by 0.92%, MobileNetV2 by 0.51%, InceptionNeXt by 1.86%, ConvNeXt by 0.51%, and MnasNet by 6.42%, all while maintaining a substantially lower parameter count (6.770 M) and computational cost (2.275 GFLOPs). The relatively modest parameter count and reduced computational burden make DuSAFNet particularly advantageous for real-world applications, where both performance and efficiency are paramount.
These findings underscore the efficacy of multi-path feature extraction and multi-band ArcMarginProduct in enhancing the model’s capacity to capture discriminative features from bird vocalizations. In contrast to traditional models that often struggle to account for the high variability in acoustic patterns, DuSAFNet excels in extracting both local and global features, while its attention mechanism facilitates focused learning from critical frequency bands.
Further validation of DuSAFNet’s robustness was carried out through a generalization experiment utilizing the Birdsdata dataset, which encompasses recordings from 20 bird species. DuSAFNet demonstrated an accuracy of 93.74% on this external test set, outperforming ResNet-34, ResNet-50, ViT, and AMResNet, while achieving comparable performance to BirdNet. Although the improvement over BirdNet was modest (0.10%), the lower parameter count (6.770 M) offers a significant advantage, rendering DuSAFNet more efficient for deployment in resource-limited environments. This cross-dataset evaluation further highlights the model’s robustness, illustrating its ability to generalize effectively across diverse datasets.
The multi-path fusion methodology employed by DuSAFNet proves to be instrumental in reducing model complexity while maintaining high levels of accuracy. By leveraging the multi-band ArcMarginProduct, the model enhances its ability to capture diverse acoustic patterns, thereby generalizing effectively across datasets with varying recording conditions.
To gain a deeper understanding of DuSAFNet’s strengths and limitations, we conducted a comprehensive per-class analysis using confusion matrices, radar charts, and t-SNE visualizations. The confusion matrix (
Figure 9) shows that DuSAFNet correctly classifies over 95% of samples for most species, with only minor misclassifications between acoustically similar pairs such as Mallard and Dunlin. Radar charts (
Figure 10) confirm consistently high precision and recall across 18 species, with only Common Kestrel and Eurasian Woodcock dipping slightly—reflecting their brief, crepuscular calls that are easily masked by ambient noise. The t-SNE projection (
Figure 12) further demonstrates that DuSAFNet embeds each species into well-separated clusters, preserving intra-cluster cohesion and maximizing inter-cluster separation even under complex, overlapping soundscapes.
Beyond technical performance, DuSAFNet offers powerful applications for field ornithology and biodiversity monitoring. DuSAFNet’s high-precision detection of discrete call types—such as territorial whooper swan honks or dawn chorus trills of the common blackbird—opens the door to automated phenological monitoring of migration timing and breeding onset. For example, applying DuSAFNet to continuous wetland recordings can generate fine-grained arrival and departure curves for stopover management. Likewise, passive monitoring of nocturnal “peenting” by Eurasian woodcock via undergrowth-deployed microphones can reveal habitat use patterns and sensitivity to disturbance, informing adaptive management of critical breeding and foraging sites.
While not yet implemented in our current pipeline, DuSAFNet’s species-specific detections could be seamlessly integrated with established acoustic-ecological indices—such as the Acoustic Complexity Index (ACI) and the Noise-to-Signal Index (NDSI)—to produce a multidimensional assessment of ecosystem health. For example, future work might correlate call-rate time series from DuSAFNet with ACI trends to disentangle genuine biodiversity declines from shifts in abiotic noise. Moreover, feeding DuSAFNet’s outputs into occupancy or abundance models has the potential to yield robust estimates of population change across seasons and landscapes, thereby informing strategic conservation planning and IUCN Red List assessments.
Looking ahead, DuSAFNet could be embedded within real-time monitoring platforms. A lightweight, quantized variant running on edge devices (e.g., bioacoustic loggers) might trigger automated alerts upon detecting rare or threatened species, supporting rapid anti-poaching or habitat-disturbance responses. Deploying directional microphone arrays at breeding sites, combined with DuSAFNet’s spectral–temporal attention, may reduce false positives and focus sampling during critical calling windows (dawn chorus, crepuscular hours). A user-friendly dashboard could visualize spatio-temporal heatmaps of call activity, empowering conservation practitioners without specialized training.
While this study focuses on single-source recordings, real-world field deployments often involve overlapping bird calls. In future work, we plan to integrate blind signal separation techniques—such as independent component analysis (ICA) [
59] and non-negative matrix factorization (NMF) [
60]—to disentangle simultaneous vocalizations. Furthermore, combining these methods with microphone-array spatial filtering and beamforming will enable more robust separation in multi-source environments, enhancing DuSAFNet’s applicability to complex acoustic monitoring scenarios.
Despite its robust performance, DuSAFNet currently treats each 3 s segment in isolation and does not explicitly address overlapping choruses or multi-species mixing.
Although DuSAFNet demonstrates high classification accuracy under a variety of conditions, it remains sensitive to intense or non-stationary background noise (e.g., strong wind, water flow, traffic sounds). In particular, low-energy calls may be masked by overlapping noise, leading to reduced recall for certain species. Our two-threshold VAD preprocessing removes silence but retains environmental noise to improve generalization; however, extreme noise cases can still degrade performance. Future work will explore advanced denoising techniques (e.g., wavelet filtering, noise-robust spectral transforms), synthetic noise augmentation, and domain adaptation methods to further mitigate the impact of environmental noise on model predictions.
Future work should extend the framework to multi-label classification and source-separation architectures to handle dense acoustic environments. Incorporating habitat metadata (e.g., vegetation structure, water depth) and meteorological covariates could further reduce context-driven errors. Critically, close collaboration with field ornithologists will be necessary to co-design sampling schemes that align DuSAFNet’s technical capabilities with on-the-ground conservation priorities—such as assessing population viability in fragmented landscapes.
5. Conclusions
In this study, this research introduced DuSAFNet, a cutting-edge deep network for bird audio classification that synergizes multi-path feature extraction with a multi-band ArcMarginProduct attention mechanism. Extensive experiments—both in-domain and cross-dataset—demonstrate that DuSAFNet establishes a new benchmark in bioacoustic research, attaining 96.88% accuracy, 96.85% precision, 96.89% recall, and a 96.83% F1 score, yet requiring only 6.77 M parameters and 2.275 GFLOPs.
Beyond these technical feats, DuSAFNet delivers powerful ecological and conservation dividends. Its fine-grained, species-specific detections can be seamlessly combined with acoustic indices (e.g., Acoustic Complexity Index, Noise-Dominant Sound Index) and fed into occupancy or abundance models to disentangle genuine biodiversity shifts from background-noise fluctuations. This multidimensional framework yields robust spatiotemporal estimates of bird population dynamics, furnishing actionable metrics for habitat management, phenological monitoring, and IUCN Red List assessments.
Looking ahead, DuSAFNet’s versatile design readily extends to multi-label classification, overlapping choruses, and other taxa (e.g., insects, amphibians). Targeted data augmentation and contrastive learning can further refine its handling of species with irregular or brief vocalizations (such as Common Kestrel and Eurasian Woodcock). Integrating habitat metadata (vegetation type, water depth) and environmental covariates (weather, anthropogenic noise) will reduce context-driven errors. Deploying quantized, pruned, or distilled variants on edge devices and microphone arrays promises real-time alerts for rare or threatened species, empowering rapid anti-poaching and disturbance-response workflows.
In conclusion, DuSAFNet represents a significant leap in bioacoustic classification—marrying state-of-the-art performance, computational efficiency, and ecological utility. Its ongoing refinement and deep ecological integration will pave the way for truly holistic, real-world acoustic monitoring and conservation planning.