1. Introduction
Birds are one of the most widely distributed species, inhabiting almost all ecosystems, and their survival is an important indicator of global ecological health. However, the Global Red List of Birds indicates that 1445 bird species (12.95% of the total) are classified as Vulnerable (VU), Endangered (EN), or Critically Endangered (CR) [
1]. Therefore, it is vital to protect birds. Birdsong is a primary means of communication among birds and is essential for the interaction within and between populations. Different species exhibit distinct song patterns. Analyzing the subtle features of birdsong patterns can yield information that cannot be obtained through images. In addition, the recordings of birdsong are not affected by the field of view or lighting conditions [
2]. Thus, identification of bird species based on song is one of the most important methods in ecological monitoring and a vital component of habitat monitoring and protection for endangered species [
3]. However, due to the diversity of bird species and the complexity of birdsongs, the annotation of bird audio data is time-consuming. Therefore, making use of existing annotated data to achieve accurate birdsong recognition remains a challenge.
With the development of deep learning in the field of computer vision, a novel exploratory mechanism has been introduced to solve the challenges of birdsong recognition. At present, the birdsong classification technology mainly converts signals into spectral representation and carry out detailed analysis of the spectral features of birdsong [
2]. Common spectral representations, such as Short-Time Fourier Transform (STFT) spectra, Mel spectrograms, and Mel Frequency Cepstrum Coefficients (MFCC), are used as inputs to convolutional neural network (CNN) models for birdsong classification. For example, Na et al. [
4] combined features of Linear Prediction Cepstral Coefficients (LMs) and MFCC with a 3DCCNN-LSTM model to classify four bird species, obtaining an average accuracy of 97.9%. Xie et al. [
5] combined Mel spectrograms, harmonic component spectrograms, and percussion component spectrograms with a CNN model, achieving an F1-score of 95.95% for the classification task of 43 bird species. Wang et al. [
6] introduced an improved Transformer model for recognizing MFCC spectrograms of 15 bird species with an accuracy of 93.57%. Fu et al. [
7] applied the wavelet transform to convert birdsongs into spectrograms, constructed a DR-ACGAN model, and incorporated a dynamic convolutional kernel into the classifier, achieving an accuracy of 97.60% with the VGG16 model. Kahl et al. [
8] developed BirdNET, which used Mel spectrograms to successfully identify 984 bird species in North America and Europe, with an average accuracy of 0.791 in the high noise environments. These studies highlight the effectiveness of neural networks in the classification of birdsong spectrograms. However, most of them relied on supervised learning approaches and required a large amount of labeled training data. In addition, supervised learning methods are prone to generalization errors, spurious associations, and adversarial attacks [
9].
With the advent of unsupervised learning [
10,
11,
12,
13], supervised information can now be obtained from unlabeled data. Self-supervised learning [
14,
15,
16,
17,
18,
19] is a subset of unsupervised learning, and contrastive learning [
20,
21,
22,
23,
24,
25] is a branch of self-supervised learning [
9]. Contrastive learning is a discriminative method whose aim is to cluster similar samples more closely in the representation space while dispersing dissimilar samples. Through this process, the model learns more discriminative features, enabling it to better distinguish different samples, and providing a solid foundation for solving downstream tasks. The existing contrastive learning methods include Simple Contrastive Learning Representation (SimCLR) [
17], Momentum Contrast (MoCoV2) [
26], Swapping Assignments between Views (SwAV) [
19], Bootstrap Your Own Latent (BYOL) [
14], and Siamese Network for Contrastive Learning (SimSiam) [
18]. The fine-tuned classification accuracy rates of these methods on ImageNet are shown in
Figure 1. As shown in
Figure 1, the accuracy of existing contrastive learning approaches is very close to that of supervised learning, indicating that these unsupervised feature extraction techniques are effective.
SimCLR is a classic algorithm for contrastive learning with simple structure, clever design and high performance.It has achieved remarkable success in computer vision. Li [
27] used unlabeled data to train a feature extraction network with SimCLR, and then added a linear classification layer for supervised training and fine-tuning with a small amount of labeled data, achieving remarkable results in the intrusion detection for industrial control systems. Yang et al. [
28] pre-trained SimCLR on unlabeled plant image data and fine-tuned the pre-training model on labeled plant disease samples, obtaining an accuracy comparable to that of supervised learning. Shi et al. [
29] combined SimCLR with the MetaFormer-2 model to classify snakes at a fine-grained level, with an accuracy of 83.8%. Sun et al. [
30] pre-trained SimCLR on a dataset of chest digital radiography (DR) images and used it as the backbone of a fully convolutional one-stage (FCOS) object detection network to identify rib fractures from DR images. SimCLR leverages a large amount of unlabeled data to learn general data representations through data augmentation and transformations and then updates the network based on the InfoNCE loss (Normalized Temperature-Scaled Cross-Entropy Loss) [
31,
32]. This learned knowledge is then transferred to smaller models for specific downstream tasks. The fine-tuned downstream tasks in these studies achieved an accuracy rate comparable to that of supervised learning, further demonstrating the powerful feature representation capability of SimCLR contrastive learning. However, this two-stage training process increases complexity, is time-consuming, requires additional resources, and poses challenges for optimization and debugging.
Zhang [
33] further demonstrated that although self-supervised learning has been successful in several downstream tasks. However, if the model is fine-tuned using only labeled data from downstream tasks, its generality and applicability may not be fully realized. In addition, existing supervised learning methods still require a significant amount of time [
28]. Therefore, we address the challenge of utilizing the representation capability of self-supervised learning to reduce training time while achieving better accuracy. In this study, we propose a Dual-Branch Supervised and Unsupervised Learning Network (DBS-NET) is proposed for Birdsong Classification, which employs a weight assignment strategy to balance the losses of the two branches, and enhance the feature generalization ability of the backbone network. To verify the effectiveness of DBS-NET model, experiments are carried out on the self-built 30-class and Birdsdata datasets. The main contributions of this paper are summarized as follows:
A Dual-Branch Network (DBS-NET) combining supervised and self-supervised learning is proposed for birdsong classification. Using a weight allocation strategy, unsupervised and supervised feature extraction are integrated into one framework for joint training, which effectively balances the contributions of the two branches.
An enhanced backbone based on an iterative dual-attention feature fusion module (iDAFF) is constructed, and an enhanced linear residual classifier is designed to further improve the classification capability of the model.
Weighted loss for A class imbalance weighted loss function is designed to calculate the weight of each category according to its frequency in the dataset. These weights are then used in cross-entropy loss to ensure balanced training across categories.
3. Methods
This study proposes a Dual-Branch Network that combines supervised classification learning (SCL) and self-supervised contrastive learning (SSCL). This method leverages a multi-task learning framework to optimize classification loss and contrastive loss simultaneously. The strong discriminative features of supervised classification integrates with the generalizable and robust features of SimCLR, which enhances the ability of model to deal with complex environments and improve classification accuracy. The overall structure of the method is illustrated in
Figure 3. The model inference process is shown in Algorithm 1.
Algorithm 1 DBS-Net inference algorithm |
Input: Labeled audio segments |
Output: Predicted class labels - 1:
Initialize Unsupervised Branch Encoding , Supervised Branch Encoding , Projection head , Number of epochs E, batch size N, weight coefficient , Data augmentation functions - 2:
//Step1: Data preprocessing, Convert audio segments to spectrogram using wavelet transform (WT) - 3:
- 4:
//Step2 - 5:
for epoch=1 to E do - 6:
for each batch fromN do - 7:
//2.1: Data augmentation, generate two related views of the same example for Unsupervised branch, data augmentation for supervised branch - 8:
; ; - 9:
//2.2: Encoder feature extraction - 10:
#Unsupervised branch - 11:
- 12:
- 13:
#Supervised branch - 14:
- 15:
//2.3: Compute Loss - 16:
- 17:
- 18:
- 19:
//2.4: Optimize model - 20:
update networks g and f to minimize - 21:
end for - 22:
end for - 23:
//Step3: predict the result - 24:
|
3.1. Wavelet Transform
Existing studies usually convert birdsong signals into STFT spectrograms [
34] and Mel spectrograms [
35]. However, a fixed STFT window function results in time resolution and frequency resolution not being optimized at the same time. In addition, Mel spectrograms are significantly affected by amplitude variations, which means that different intensities of the same sound can produce different features [
36].
On the other hand, the wavelet transform has good time-frequency resolution and can provide local information of the signal in both time and frequency domains. It is also more suitable for non-stationary signals. Therefore, in this paper, wavelet transform is used to convert birdsong signals into spectrograms. The formula of wavelet transform is defined as follows:
where a is the scale parameter, which controls the scale of the wavelet function;
is the time shift that controls the translation of the wavelet function;
is a wavelet basis function. The choice of wavelet basis function directly affects the time-frequency resolution and spectral characteristics of the wavelet transform. Therefore, the Morlet wavelet [
37], with a higher response to birdsong, is selected in this paper, and its definition is shown in Formula (
2).
where
is the center frequency and
is the bandwidth.
3.2. Supervised Feature Representation
Supervised feature representation focuses on extracting discriminative features and learning class-specific representations by mining the labeled information in the data. Supervised classification learning (SCL) aims to fully use labeled data to achieve effective feature representation. As shown in
Figure 3, the supervised classification branch employs a ResNet18 as a backbone network to encode input data into high-dimensional feature representations. ResNet18 has an 18-layer architecture that balances computational efficiency and representational power. Its residual connections improve gradient flow for stable training. Unlike deeper variants such as ResNet50 or ResNet101, it reduces both computational cost and overfitting risk. Its proven performance across diverse classification tasks further underscores its suitability as an encoder.
In the classification head, we designed a linear residual structure, as shown in
Figure 3. The formula is shown in (
3), where
is the final output feature,
is the supervised encoder output feature, and
n is the number of repetitions of the intermediate linear layer.
The linear residual structure increases the number of linear layers of the classification head through skip connections, and deepens the nonlinear transformation enables network to capture more complex feature patterns.
The spectrogram data in this study show class imbalance that causes the trained classifier to favor the majority class and ignore the minority class samples, which can lead to distortion of probability output of a few classes, thus affecting the reliability of classification results. To address this issue, a balanced cross-entropy loss function is proposed. First, the weights of different categories are calculated based on their proportion in the dataset. Then, these weights are used to initialize the cross-entropy loss function.
The number of samples in each category in the dataset is defined as
, and the weight of category
can be calculated by Formula (
4).
The weighted cross-entropy loss function can be defined as
3.3. Self-Supervised Feature Representation
Self-supervised feature representation focuses on learning generalizable features from unlabeled data. The SimCLR framework, as contrastive learning, is used to construct the Self-Supervised Classification Learning (SSCL) branch. The SSCL branch leverages data augmentation techniques to generate positive and negative sample pairs. As shown in
Figure 4, positive pairs are obtained by performing different augmentation transformations on the same input sample, while negative pairs are formed by combining representations of different samples within the same batch. Each pair of augmented views (denoted as
,
) is processed by the encoder
to generate the corresponding feature representations
,
, respectively. Subsequently, a projection head, consisting of a fully connected layer and a nonlinear activation function, is introduced to map the feature representation to a higher-dimensional representation space. This mapping function
produces the projected representations
. The ultimate goal is to maximize the similarity of the positive sample pairs, while minimizing the similarity score between the negative sample pairs. The similarity scores is calcullated using the InfoNCE loss, as shown in Formula (
6).
where
represents the cosine similarity between the two vectors.
is the temperature scalar used as a hyperparameter to adjust the smoothness of the similarity. This unsupervised framework utilizes a large amount of unlabeled data to learn general representations and applies them to smaller models for specific tasks.
3.4. Iteration Dual Attention Feature Fusion Module
In the residual blocks, skip connections are usually used to fuse the original features X with the convolved output . These connections provide an uninterrupted alternative path for the gradient during the backpropagation process, making the gradient flow smoother and mitigating the problem of gradient vanishing. However, traditional feature fusion methods, such as element-wise addition or concatenation, often lead to information redundancy and fail to effectively capture contextual relationships between features.
To address this issue, we refer to the work of Dai et al. [
38], proposing Multi-Scale Channel and Spatial Attention Module (MS-CSAM), dual-attention feature fusion (DFF), and iterative dual-attention feature fusion (iDAFF), as shown in
Figure 5. In our MS-CSAM, a multi-scale feature extraction mechanism is introduced to combine maxinum pooling, average pooling, and global average pooling to capture features at different scales. This is especially beneficial for tasks with significant differences in time-frequency features, such as birdsong classification, and can greatly improve the robustness and generalization ability of the model. In addition, MS-CSAM employs light convolution to reduce computational overhead while maintaining strong feature extraction capabilities to capture the fine-grained features and ensure efficient model performance. The fusion process is described as follows.
We define two input features
in ResNet skip connection.
X represents the original feature and
Y denotes the residual feature learned from the ResNet block. Based on the multi-scale channel attention module M, Dual-Attentional Feature Fusion (DAFF) can be formulated as in Formula (
7).
In Formula (
7), Z refers to the fusion feature, ⊎ denotes the initial feature fusion, and the sum of the elements is used for calculation. DAFF is as shown in
Figure 5b, and the dashed line represents 1 −
. The fusion weight
consists of real numbers between 0 and 1, as does 1 −
, which allows the network to choose between X and Y or average them.
The initial feature fusion is replaced by an DAFF module to form an iterative dual-attention feature fusion (iDAFF). Hence,
can be re-expressed as in Formula (
8). This iterative approach aims to optimize the quality of feature fusion through continuous improvement and adjustment so as to provide a more accurate and reliable data basis for subsequent analysis and application.
The traditional residual connections in the backbone network are replaced by the iDAFF mechanism. It combines local and global attention, and uses feature weighting for fusion. This approach improves the richness and expressiveness of feature representation, and maintains efficiency in terms of parameters and computational complexity. In addition, it significantly improves the performance of the network in handling complex tasks.
3.5. Dual-Branch Network
The proposed Dual-Branch Network combines the advantages of supervised and self-supervised learning for feature representation. The supervised branch focuses on learning discriminative and task-specific representations using labeled data, while the self-supervised branch leverages contrastive learning to extract robust and generalizable representations from unlabeled data. Both branches are jointly optimized under a unified framework to achieve complementary benefits, as shown in
Figure 6.
To integrate the contributions of both branches, the total loss is defined as a weighted combination of the supervised classification loss and the self-supervised contrastive loss.
where
and
are weight coefficients that control the contributions of the supervised and self-supervised losses, respectively, and the sum of the two weights is always 1.
represents the InfoNCE loss used in the self-supervised branch to maximize the similarity of positive pairs and minimize the similarity of negative pairs.
is the cross-entropy loss calculated from the output of the supervised branch and the ground truth labels.