1. Introduction
Power transmission infrastructure constitutes the foundation of reliable electricity distribution across regions. In recent years, however, bird-related incidents on transmission lines have become increasingly frequent, encompassing fault types such as flashovers caused by bird droppings, short circuits due to bird contact, and contamination of equipment. These events severely compromise the safe and stable operation of transmission networks. Statistical analyses indicate that bird-related accidents account for approximately 15% to 35% of total transmission line faults in certain regions, resulting in considerable economic losses and supply disruptions, with this proportion continuing to rise [
1,
2,
3]. Although recent studies have applied deep learning methods to detect bird-related hazards on transmission lines, including bird nest detection from UAV imagery [
4], fine-grained species recognition, which is essential for differentiated prevention measures, remains a challenging open problem.
The severity and frequency of bird-induced faults are closely associated with species characteristics, temporal patterns, and geographical conditions. Large raptors whose wingspans exceed safety clearances pose direct fault risks, whereas smaller species may trigger cumulative faults through nesting, droppings accumulation, and roosting behaviors. Conventional mitigation strategies—including visual deterrent devices, physical barriers, and periodic manual inspections—suffer from limited effectiveness and high labor costs, and lack species-specific differentiated prevention measures. As the power grid continues to expand, there is a growing demand for precise prevention and control of bird-related faults. Advanced bird monitoring and early warning methods can provide avian population data to grid operation and maintenance personnel and offer predictive alerts for potential incidents, thereby furnishing essential support for differentiated fault prevention [
5].
Beyond power grid applications, accurate bird species recognition holds broader scientific and ecological significance. Birds are well-established indicators of biodiversity and ecosystem health, with their long-term population trends widely used to assess habitat quality and global environmental change [
6]. Automated species recognition further enables large-scale passive acoustic monitoring, supporting biodiversity conservation, ecological research, and citizen science initiatives [
7]. The same underlying technical capability, namely accurate fine-grained bird species recognition under field conditions, therefore underpins both ecological monitoring and industry-specific applications such as bird-strike prevention on power transmission lines.
In the field of bird recognition, deep learning has substantially advanced both visual and acoustic identification approaches. On the visual side, fine-grained classification methods have been widely employed to distinguish closely related species with subtle inter-class differences: recent works based on attention mechanisms and Vision Transformers [
8] have achieved competitive performance on public benchmarks such as NABirds and CUB; meanwhile, transfer learning approaches built on efficient convolutional networks like EfficientNet have demonstrated favorable a favorable balance between accuracy and efficiency for bird species classification [
9]. On the acoustic side, deep learning models such as BirdNET have enabled the recognition of nearly a thousand bird species from passive acoustic recordings under field conditions [
7]. Nevertheless, single-modality recognition methods exhibit inherent limitations: image-based recognition struggles under adverse weather, nighttime conditions, and field-of-view blind spots, while audio-based recognition faces challenges from background noise and multi-source interference, and cannot provide spatial location information or identify silent individuals. Therefore, multimodal fusion recognition strategies that integrate complementary information offer a more accurate and practically valuable approach to bird species identification.
In summary, although considerable progress has been made in both image-based and acoustic-based bird recognition, several limitations remain. First, single-modality methods are inadequate for complex and variable field environments. Second, while multimodal fusion has matured into systematic paradigms (including early, feature-level, and decision-level fusion) in adjacent domains such as autonomous driving, medical imaging, and remote sensing [
10], existing fusion methods in bird recognition still predominantly rely on fixed weights or feature concatenation [
11,
12], failing to adaptively adjust according to the reliability of each modality’s prediction. Third, although recent learning-based fusion methods have introduced uncertainty estimation [
13], they typically require additional trainable parameters and complex evidential modeling, and may suffer from weight degeneration toward a single modality. These limitations motivate the present work: to develop a parameter-free, sample-level adaptive fusion strategy guided by two complementary confidence indicators (information entropy and probability gap) for the typical fine-grained task of bird species recognition.
As illustrated in
Figure 1, the central challenge in field bird species recognition is that the visual and acoustic modalities offer informative evidence under different conditions, but rarely both at once. In daytime open-view scenarios (
Figure 1, Scenario A), the image modality provides distinctive visual cues, while the audio may be silent or noisy when the bird is at rest. Conversely, under dusk or partial occlusion (
Figure 1, Scenario B), the image becomes degraded while a vocalizing bird still produces a distinctive acoustic signature. Existing fusion approaches that rely on fixed weighting or that require an additional trained fusion module cannot adapt to this sample-level variation in modality reliability. This work addresses the gap with a parameter-free, confidence–adaptive fusion strategy that judges, on each sample, which modality is currently more trustworthy, and lets the more reliable one lead the final prediction.
This paper proposes an audiovisual bird recognition method based on EfficientNet-B3 and ResNet-50, achieving accurate fine-grained species classification by leveraging the complementary strengths of visual and acoustic modalities. For the image branch, EfficientNet-B3 is adopted as the classifier; this model achieves an effective balance between computational efficiency and classification performance through compound scaling and the SE attention mechanism [
14]. For the audio branch, ResNet-50 is employed to extract features from Mel spectrograms of bird vocalizations and perform classification [
15]. To address the multimodal fusion problem, a confidence–adaptive fusion strategy is proposed that jointly considers information entropy and probability gap to dynamically compute fusion weights for each modality, enabling sample-level adaptive decision-making without additional trainable parameters. The experiments are conducted on the SSW60 multimodal bird recognition dataset [
16], which contains image and audio data of 60 North American bird species and serves as a standard benchmark for audio–visual fine-grained classification research. The main contributions of this paper are as follows:
(1) An image classification branch based on EfficientNet-B3 is constructed, which fully exploits compound scaling and the SE attention mechanism to extract fine-grained features from bird images. On the SSW60 dataset, this branch achieves a Top-1 accuracy of 91.55%, outperforming conventional CNN models such as ResNet-50 and VGG-16.
(2) An audio classification branch based on ResNet-50 is designed, which converts bird vocalizations into Mel spectrograms and classifies them via the residual network, combined with a dense sampling inference strategy to fully utilize complete audio information. This branch achieves an accuracy of 68.20% on the audio classification task, outperforming AST [
17] and VGG-16.
(3) A confidence–adaptive fusion strategy is proposed that jointly considers information entropy and probability gap to dynamically assess the reliability of each modality’s prediction. This strategy requires no trainable parameters and adaptively adjusts fusion weights based on the prediction confidence of each modality for the current sample, ultimately achieving a multimodal fusion accuracy of 95.30%, representing a 3.75 percentage-point improvement over the single-image modality.
The remainder of this paper is organized as follows.
Section 2 reviews related work.
Section 3 presents the proposed multimodal fusion recognition method, including the image classification branch, the audio classification branch, and the confidence–adaptive fusion strategy.
Section 4 describes the experimental setup, evaluation metrics, and results analysis.
Section 5 concludes this paper.
3. Proposed Multimodal Bird Recognition Method
To overcome the inherent limitations of single-modality methods in fine-grained bird species recognition, this paper proposes a confidence–adaptive audiovisual recognition method. The proposed method integrates complementary information from both visual and acoustic modalities and dynamically evaluates the reliability of each modality’s prediction via information entropy and probability gap, thereby enabling sample-level adaptive fusion.
3.1. Overview of the Multimodal Fusion Framework
The overall framework of the proposed method is illustrated in
Figure 1. It consists of three components: an image classification branch, an audio classification branch, and a confidence–adaptive fusion module. The image branch employs EfficientNet-B3 to extract visual features and output class prediction probabilities. The audio branch utilizes ResNet-50 to extract acoustic features from Mel spectrograms of bird vocalizations and produce class prediction probabilities. The fusion module computes the confidence of each modality’s prediction based on information entropy and probability gap, dynamically assigns fusion weights accordingly, and produces an adaptive weighted fusion of the two modalities’ outputs.
Compared with existing audiovisual fusion methods, the proposed approach has the following characteristics: (1) the fusion module requires no trainable parameters, thereby avoiding the weight degeneration problem inherent in learning-based fusion; (2) fusion weights are computed dynamically at the sample level, adapting to the prediction confidence of each modality on the current sample and fully exploiting inter-modal complementarity; and (3) the information-theoretic fusion strategy possesses a clear theoretical foundation and good interpretability.
As illustrated in
Figure 2, the end-to-end inference pipeline proceeds in six steps:
Step 1: Input. A paired sample, consisting of a bird image and a bird audio recording, is fed into the system.
Step 2: Preprocessing. The image is resized to 300 × 300 and normalized using ImageNet statistics (
Section 3.2.1). In parallel, the audio waveform is resampled to 16 kHz, transformed into a log-Mel spectrogram via STFT and a 128-band Mel filter bank, replicated along the channel dimension, and resized to 224 × 224 (
Section 3.3.1).
Step 3: Feature extraction. The preprocessed image is processed by EfficientNet-B3, which leverages compound scaling and SE attention to extract discriminative fine-grained visual features (
Section 3.2.2). The preprocessed spectrogram is processed by ResNet-50 with a dense sampling inference strategy (multiple temporal windows) to fully exploit complete audio information (
Section 3.3.3).
Step 4: Probability output. Each branch produces a softmax probability distribution over the K bird species: p
img for the image branch and p
aud for the audio branch (
Section 3.4.1).
Step 5: Confidence estimation. For each modality, two complementary confidence indicators are computed from its probability distribution: the Shannon entropy H
m (capturing overall distribution uncertainty) and the Top-1/Top-2 probability gap G
m (capturing decision decisiveness). The two indicators are linearly combined into a unified composite confidence C
m (
Section 3.4.1 and
Section 3.4.2).
Step 6: Adaptive fusion and output. The composite confidences of the two modalities are normalized to obtain the image-modality fusion weight α, with the audio modality assigned 1−α. The weighted average of the two probability distributions p
final is computed, and the final predicted species ŷ is output (
Section 3.4.2).
The entire pipeline requires no trainable parameters in the fusion module, enabling efficient sample-level adaptive decision-making at inference time.
Figure 2.
Framework of the proposed confidence–adaptive audiovisual fusion method for fine-grained bird species recognition.
Figure 2.
Framework of the proposed confidence–adaptive audiovisual fusion method for fine-grained bird species recognition.
3.2. Bird Recognition Model Under Image Modality
The image branch comprises image data preprocessing and feature extraction with the EfficientNet-B3 backbone.
3.2.1. Image Data Preprocessing
The input to the image branch is an RGB image of the target bird. Input images are first resized to 300 × 300 pixels to match the standard input size of EfficientNet-B3 and then normalized using the channel-wise statistics of the ImageNet dataset. Let the input image be
I ∈ ℝ
H×W×3; the normalization is defined as:
where
= [0.485, 0.456, 0.406] and
= [0.229, 0.224, 0.225] are the per-channel mean and standard deviation of the ImageNet dataset, respectively. This normalization aligns the input distribution with that of the pretrained model, facilitating effective feature extraction via transfer learning.
To enhance the generalization capability of the model, the following data augmentation strategies are applied during training: random horizontal flipping with a probability of 0.5; random rotation within the range of ±15°; and color jittering that randomly adjusts brightness, contrast, and saturation. These augmentation strategies simulate variations in bird posture and illumination conditions encountered in real-world scenarios, effectively improving the model’s robustness to input perturbations.
3.2.2. EfficientNet-B3 Backbone Network
EfficientNet is a family of efficient CNNs designed through neural architecture search and compound scaling. Unlike conventional approaches that adjust network depth, width, or input resolution independently, EfficientNet simultaneously optimizes all three dimensions via a compound scaling coefficient
ϕ, achieving an optimal trade-off between computational cost and model performance. The compound scaling strategy is defined as:
where
d,
w, and
r denote the scaling factors for network depth, width, and resolution, respectively;
α1,
β1, and
γ1 are the base scaling coefficients subject to the constraint
α1·
β12·
γ2 ≈ 2; and
ϕ is a compound coefficient that controls the overall model scale.
EfficientNet-B3 corresponds to ϕ = 3, with a depth scaling factor of 1.4, a width scaling factor of 1.2, and an input resolution of 300 × 300. Compared with the baseline EfficientNet-B0, the B3 variant substantially enhances feature representation capability while maintaining high computational efficiency, making it well suited for fine-grained bird species classification.
The fundamental building block of EfficientNet is the Mobile Inverted Bottleneck Convolution (MBConv) [
35]. As shown in
Figure 3. An MBConv block first expands the input channels to
k times the original dimensionality via a 1 × 1 convolution (expansion ratio, typically
k = 6), then performs spatial feature extraction using depthwise separable convolution, and finally compresses the channels back to the original dimensionality through another 1 × 1 convolution. Additionally, MBConv incorporates the Squeeze-and-Excitation (SE) attention mechanism [
36], which learns inter-channel dependencies through global average pooling and a two-layer fully connected network, adaptively recalibrating feature responses across channels. Let the input feature map be
X ∈ ℝ
H×W×C; the SE module is computed as:
where GAP(·) denotes global average pooling;
W1 and
W2 are learnable weight matrices with
r being the reduction ratio;
σ(·) is the sigmoid activation function; and ⊙ denotes element-wise multiplication.
After feature extraction by the EfficientNet-B3 backbone, global average pooling compresses the spatial features into a one-dimensional vector, which is then mapped to a
K-dimensional output space through a fully connected layer, where
K is the number of bird species. Let the backbone output feature map be
F ∈ ℝ
H′×W′×C′; the classification output is computed as:
where
zimg ∈ ℝ
K denotes the unnormalized log-probabilities (logits) output by the image branch, and
Wcls and
bcls are the weight and bias parameters of the classification layer.
3.3. Bird Recognition Model Under Audio Modality
The audio branch consists of audio data preprocessing, data augmentation, and feature extraction with the ResNet-50 backbone enhanced by a dense sampling inference strategy.
3.3.1. Audio Data Preprocessing
The input to the audio branch is the raw waveform of bird vocalizations. The audio is first resampled to a standard sampling rate of 16 kHz to unify the temporal resolution across different sources. Subsequently, the one-dimensional time-domain signal is converted into a two-dimensional time–frequency representation via the Short-Time Fourier Transform (STFT). Let the raw audio signal be
x(
t); the STFT is defined as:
where
w(·) is the window function (a Hann window is used in this work),
m is the frame index,
H is the hop length, and ω is the angular frequency. The FFT size is set to 512 with a hop length of 128 samples, corresponding to a temporal resolution of approximately 8 ms.
To emulate the nonlinear perceptual characteristics of the human auditory system across different frequencies, the linear frequency axis is converted to the Mel scale. The mapping between the Mel scale and linear frequency is given by:
where
f is the linear frequency in Hz and
fmel is the corresponding Mel frequency. A filter bank of 128 triangular filters is applied to map the power spectrum onto the Mel frequency domain, yielding the Mel spectrogram. A logarithmic transformation is then applied to convert multiplicative noise into additive noise while compressing the dynamic range:
where
Smel is the Mel spectrogram,
ϵ = 10
−10 is a numerical stability constant to prevent logarithmic underflow,
M ∈ ℝ
F×T is the log-Mel spectrogram,
F = 128 is the number of Mel frequency bands, and
T is the number of time frames.
To accommodate the ImageNet-pretrained ResNet-50, the single-channel log-Mel spectrogram is replicated three times along the channel dimension to form a three-channel feature tensor
S = [
M;
M;
M] ∈ ℝ
3×F×T, matching the expected input format of the network. This channel-replication operation is a common practice when adapting ImageNet-pretrained vision backbones to audio classification [
17], allowing the input format to remain compatible with the pretrained weights without modifying the backbone architecture.
The spectrogram is then normalized to the [0, 1] range:
where
ϵ1 = 10
−8 is a numerical stability constant to prevent division by zero. The normalized tensor is then resized to 224 × 224 pixels and standardized using the ImageNet statistics (mean [0.485, 0.456, 0.406], standard deviation [0.229, 0.224, 0.225]) to match the input distribution of the pretrained ResNet-50.
3.3.2. Data Augmentation Strategy
To improve the generalization ability and noise robustness of the audio classification model, the following data augmentation strategies are employed during training:
(1) Random temporal cropping: a contiguous time window is randomly sampled from the full spectrogram as a training sample. Let the temporal dimension of the original spectrogram be T; the crop window length is Tcrop = 400 frames (approximately 3 s of audio), and the starting position t0 is uniformly sampled from [0, T − Tcrop]. This strategy encourages the model to learn invariance to the temporal position of vocalizations.
(2) Frequency masking: inspired by the SpecAugment method [
37], consecutive frequency bands are randomly masked along the frequency axis. The starting band index
f0 is randomly selected from [0,
F −
Fmask], with a mask width of
Fmask = 15; values in the band [
f0,
f0 +
Fmask) are set to zero. This strategy forces the model to learn local invariance along the frequency dimension, enhancing robustness to partial frequency information loss.
3.3.3. ResNet-50 Backbone Network
ResNet (Residual Network) effectively mitigates the vanishing gradient problem in deep networks by introducing skip connections, enabling the training of substantially deeper architectures. ResNet-50 comprises 49 convolutional layers and one fully connected layer, organized into an initial convolutional layer followed by four residual stages.
The residual block is the core building unit of ResNet. Let the input to the
l-th layer be
xl; the output of the residual block is given by:
where
(·) is the residual function and
^l denotes the learnable parameters of the
l-th layer. The skip connection allows gradients to propagate directly from deeper layers to shallower ones, significantly improving training efficiency in deep networks.
As shown in
Figure 4. ResNet-50 adopts the bottleneck architecture, in which each residual block consists of three convolutional layers: a 1 × 1 convolution for dimensionality reduction, a 3 × 3 convolution for spatial feature extraction, and a 1 × 1 convolution for dimensionality restoration. This compress-then-expand design substantially reduces computational cost while preserving feature representation capacity. The four residual stages have channel dimensions of 256, 512, 1024, and 2048, respectively, with each stage containing multiple residual blocks.
After feature extraction by the ResNet-50 backbone, global average pooling compresses the spatial features into a 2048-dimensional vector, which is then mapped to a
K-dimensional output space through a fully connected layer:
where
zaud ∈ ℝ
K denotes the unnormalized logits output by the audio branch, and
Fres is the output feature map of the final residual stage of ResNet-50.
Since random temporal cropping during training uses only a portion of the audio, a dense sampling strategy is adopted at inference to fully exploit the complete audio information. Specifically, multiple time windows are extracted from the full spectrogram using a fixed-stride sliding window, each is passed through ResNet-50 to obtain predictions, and the logits from all windows are averaged.
Let the number of sampling windows be
Nw = 5, the window length be
Tcrop = 400, and the sliding stride be
S = 150. The starting position of the
i-th window is
ti = (
i − 1) ×
S,
i ∈ {1, 2, …,
Nw}. The averaged logits across all windows serve as the prediction vector for the audio modality:
3.4. Confidence-Based Adaptive Fusion Strategy
The central challenge of multimodal fusion lies in how to effectively integrate information from different modalities to improve the accuracy of the final decision. Existing fusion strategies are broadly categorized into three types: early fusion, late fusion, and hybrid fusion. Early fusion concatenates feature vectors from each modality and feeds them into a shared classifier for joint learning, which can capture cross-modal interactions but requires additional trainable parameters. Late fusion combines the outputs of individual modality classifiers through weighted aggregation; it is straightforward to implement and allows each branch to be optimized independently, but typically employs fixed or validation-tuned static weights that cannot adapt to sample-specific characteristics.
Under practical conditions, the prediction reliability of different modalities varies considerably across samples. When the probability distribution output by the image modality for a given sample is highly concentrated, the image-based judgment is more reliable; conversely, a flatter distribution indicates greater classification uncertainty. The same applies to the audio modality. An ideal fusion strategy should therefore dynamically allocate fusion weights according to the prediction confidence of each modality on the current sample, allowing the more confident modality to dominate the final decision.
Information entropy is a classical measure of uncertainty in a probability distribution. For discrete distributions, lower entropy indicates a more concentrated distribution with less uncertainty, while higher entropy implies a more uniform distribution with greater uncertainty. In addition, the probability gap between the Top-1 and Top-2 predicted classes reflects the decisiveness of the model: a larger gap suggests a stronger preference for the top-ranked class and thus more reliable predictions. Based on the above analysis, this paper proposes to jointly consider information entropy and probability gap as two complementary confidence indicators, combining them through weighted aggregation to achieve sample-level adaptive fusion in which the more confident modality plays a dominant role.
3.4.1. Prediction Entropy and Confidence Computation
First, the logits output by each modality are converted to probability distributions via the softmax function:
where
pimg,
paud ∈ ℝ
K are the predicted probability distributions of the image and audio modalities, respectively.
The Shannon entropy of each modality’s predicted probability distribution is then computed as:
where
ϵ2 = 10
−8 is a numerical stability constant to prevent logarithmic underflow. The theoretical range of entropy is [0, log
K]: the minimum value of 0 is attained when the distribution degenerates to a deterministic one (i.e., one class has probability 1 and all others have 0), while the maximum value of log
K is reached under a uniform distribution.
Based on the inverse relationship with entropy, the confidence of each modality is defined as the reciprocal of its entropy:
This formulation ensures that the modality with lower entropy (i.e., a more certain prediction) receives a higher confidence score.
3.4.2. Probability Gap Indicator and Combined Fusion Strategy
This paper proposes a combined fusion strategy that simultaneously considers the entropy indicator and the probability gap indicator, yielding a more robust confidence estimate through weighted aggregation.
The probability gap indicator is defined as the difference between the Top-1 and Top-2 class probabilities; a larger gap indicates greater certainty in the model’s prediction:
where
p(1) and
p(2) denote the highest and second-highest values in the predicted probability distribution, respectively.
The normalized entropy and probability gap are combined through weighted aggregation to obtain the composite confidence:
where
=
Hm/log
K is the normalized entropy mapped to the [0, 1] interval, and
β ∈ [0, 1] is a balancing coefficient that controls the relative importance of the entropy and probability gap indicators. In this work,
β = 0.5 is adopted to equally balance the contributions of the two indicators.
The composite confidence scores of the two modalities are normalized to obtain the fusion weight for the image modality:
Correspondingly, the fusion weight for the audio modality is 1 − α. By definition, α ∈ (0, 1); when the composite confidence of the image modality exceeds that of the audio modality, α > 0.5 and the image modality dominates the fusion, and vice versa.
The final fused prediction probability distribution is obtained by weighted averaging:
The classification decision selects the class with the highest probability in the fused distribution as the predicted result:
In the proposed adaptive combined fusion strategy, the fusion weights are derived entirely from the predicted probability distributions of each modality, eliminating the need for additional training of a fusion module and avoiding the weight degeneration phenomenon caused by optimization objectives in learning-based fusion. The fusion weights are computed independently for each sample, allowing dynamic adjustment according to the prediction confidence of each modality. For example, when a sample has a clear image but the bird is silent, the image modality exhibits high confidence while the audio modality yields low confidence, and the fusion weight automatically shifts toward the image modality. Moreover, entropy, as a measure of uncertainty, directly reflects the model’s degree of certainty regarding its prediction, while the probability gap captures the clarity of the decision boundary. Using these as fusion criteria is both intuitive and highly interpretable. Furthermore, the fusion process involves only simple operations—softmax normalization, entropy computation, probability gap calculation, and weighted averaging—incurring negligible computational overhead.
3.5. Training Strategy
A two-stage independent training strategy is adopted, in which the image and audio classification models are optimized separately. The fusion module requires no training.
The image branch is initialized with ImageNet-pretrained EfficientNet-B3 weights and fine-tuned on the bird image dataset. The training configuration is as follows: the AdamW optimizer is used with an initial learning rate of 10
−4 and a weight decay of 10
−4; the learning rate is scheduled using cosine annealing, which smoothly decays the learning rate from its initial value to a minimum following a cosine curve; training runs for 15 epochs with a batch size of 32. The loss function is the cross-entropy loss:
where
y is the one-hot encoded ground-truth label vector and
is the predicted probability distribution.
The audio branch is initialized with ImageNet-pretrained ResNet-50 weights and fine-tuned on the bird audio dataset. Although audio spectrograms differ visually from natural images, the ImageNet-pretrained weights provide generic low-level features such as textures and edge detectors, which help accelerate model convergence. The training configuration is as follows: the AdamW optimizer is employed with an initial learning rate of 10−4 and a weight decay of 10−4; cosine annealing is used for learning rate scheduling; training runs for 30 epochs with a batch size of 16.
The loss function is likewise the cross-entropy loss:
Both branches employ the cosine annealing learning rate schedule, which smoothly decays the learning rate from the initial value to a minimum following a cosine function:
where
ηt is the learning rate at the
t-th epoch,
ηmax is the initial maximum learning rate,
ηmin is the minimum learning rate (set to 10
−6 in this work), and
Tmax is the total number of training epochs. Compared with a fixed learning rate, cosine annealing refines model parameters with smaller steps in the later training stages, effectively preventing loss oscillation near local minima and improving the final convergence quality.
After the two-stage training is completed, both branches can perform inference independently. During inference, the logits from the two branches are obtained separately, and the final prediction is computed using the confidence–adaptive fusion strategy described in
Section 3.4. The entire fusion process requires no training and directly derives fusion weights from information entropy and probability gap.
4. Experimental Results and Analysis
This section experimentally validates the effectiveness of the proposed multimodal fusion recognition method. The experimental environment and dataset are first described, followed by the definition of evaluation metrics. The experimental analysis is then conducted from two perspectives: single-modality recognition performance and multimodal fusion effectiveness.
4.1. Experimental Setup and Dataset Description
The proposed multimodal fusion bird recognition model is experimentally evaluated under the following hardware and software configuration: Windows 10 operating system, Intel Core i5-12400F processor, NVIDIA GeForce RTX 3070Ti GPU (8 GB VRAM), and 32 GB DDR4 RAM. The deep learning framework is PyTorch 1.12 with Python 3.8 and CUDA 11.6.
The Sapsucker Woods 60 (SSW60) dataset is adopted for experimental validation. SSW60 is a multimodal bird recognition benchmark released by the Cornell Lab of Ornithology, specifically designed for audio–visual fine-grained classification research. The dataset covers 60 North American bird species, all of which can be observed at the Sapsucker Woods Sanctuary in Ithaca, New York, spanning multiple taxonomic groups including songbirds, woodpeckers, and raptors. The dataset comprises multi-modal data sources. For images, it contains 21,600 field photographs from iNaturalist 2021 and 10,221 images from NABirds, totaling 31,821 independent image samples. For audio, it includes 3861 independently recorded WAV files at a sampling rate of 22,050 Hz, mono-channel, each approximately 10 s in duration, with expert-verified annotations confirming the presence of the target species’ vocalizations. The official train/test split provided with the dataset is strictly followed throughout this work.
For image data, input images are uniformly resized to 300 × 300 pixels to match the standard input size of EfficientNet-B3 and then normalized using the ImageNet channel-wise statistics. During training, data augmentation is applied, including random horizontal flipping, random rotation, and color jittering.
For audio data, the waveforms are first resampled to 16,000 Hz. Spectral features are then extracted via STFT with an FFT size of 512 and a hop length of 128 samples. A bank of 128 triangular filters maps the power spectrum onto the Mel frequency domain, followed by a logarithmic transformation to obtain log-Mel spectrograms. The single-channel spectrogram is replicated along the channel dimension to match the three-channel input format of ResNet-50, normalized, resized to 224 × 224 pixels, and standardized using the ImageNet statistics. During training, random temporal cropping with a window length of 400 frames is applied, along with frequency masking augmentation.
The detailed configuration parameters of each module are listed in
Table 1. The image branch adopts the EfficientNet-B3 architecture with an input size of 300 × 300 pixels, a compound scaling coefficient ϕ = 3, and 60 output classes. The audio branch employs the ResNet-50 architecture with an input spectrogram size of 224 × 224 pixels, four residual stages, and an output feature dimension of 2048.
Both branches use the AdamW optimizer with an initial learning rate of 10−4 and a weight decay of 10−4, and the learning rate is scheduled via cosine annealing. The image branch is trained for 15 epochs and the audio branch for 30 epochs.
Since the image (31,821 samples) and audio (3861 samples) data in SSW60 are collected independently and are imbalanced in count without any instance-level correspondence, we adopt a species-level random pairing strategy to construct image-audio pairs for the fusion evaluation. The strategy traverses the image test set as the primary iteration, drawing each test image in turn; for every image, an audio sample is then randomly selected from the audio set of the same species to form a paired sample, with the resulting pair sharing the same species label. Because the pairing relies on the species-organized directory structure of SSW60, every pair always belongs to the same bird species; however, since the two modalities were independently collected, the paired image and audio do not necessarily originate from the same individual or the same observation event, but are aligned only at the species level. Given that the audio set is smaller than the image set, a given audio recording may be sampled multiple times or not at all within a single evaluation pass, while every test image is covered exactly once, ensuring that the fusion evaluation is complete with respect to the image test set. It is worth noting that this pairing protocol is applied only at the fusion evaluation stage; the image and audio branches are trained independently on their respective modality-specific data following the official SSW60 train/test split, without requiring any image-audio pairing during training. This species-level pairing protocol is consistent with the design intent of SSW60 [
16] as a fine-grained audiovisual classification benchmark, in which the two modalities provide complementary species-discriminative information rather than instance-level synchronized signals.
4.2. Evaluation Metrics
To comprehensively evaluate the classification performance, the following metrics are adopted. Top-1 accuracy is the proportion of samples for which the class with the highest predicted probability matches the ground truth, serving as the most direct performance measure in multi-class classification tasks:
where
N is the total number of test samples,
ŷi is the predicted class for the
i-th sample,
yi is the ground-truth class, and
(·) is the indicator function that returns 1 when the condition is true and 0 otherwise.
Top-5 accuracy is the proportion of samples for which the ground-truth class falls within the top five predicted classes. For fine-grained classification tasks with a large number of species, Top-5 accuracy reflects the model’s ability to rank candidate classes:
where Top5(
) denotes the set of five classes with the highest probabilities in the predicted distribution
of the
i-th sample.
The Macro-F1 is the macro-averaged F1 score, which first computes the F1 score (the harmonic mean of precision and recall) for each class and then takes the arithmetic mean across all classes, providing a balanced assessment of the model’s classification quality on multi-class tasks:
where
TPc,
FPc, and
FNc are the numbers of true positives, false positives, and false negatives for class
c, respectively, and
K is the total number of classes. The Macro-F1 score assigns equal weight to each class, reflecting the model’s balanced performance across all categories, and is particularly suitable for evaluating classification on datasets with long-tailed distributions.
4.3. Single-Modality Recognition Performance Comparison
To validate the effectiveness of the image and audio classification models adopted in this work, comparative experiments are conducted under single image modality and single-audio modality conditions against mainstream deep learning models. All compared models are initialized with ImageNet-pretrained weights and fine-tuned on the SSW60 dataset.
For the image classification experiment, ResNet-50, VGG-16, and MobileNetV3-Large are selected as baseline models. ResNet-50 is a widely adopted deep residual network that mitigates the vanishing gradient problem through residual connections. VGG-16 is a representative early deep CNN characterized by its straightforward architecture of stacked 3 × 3 convolutions. Both baselines share the same training configuration as the proposed method: ImageNet-pretrained initialization, AdamW optimizer, initial learning rate of 10
−4, cosine annealing schedule, and 15 training epochs. The performance comparison of image classification models is presented in
Table 2.
As shown in
Table 2, the EfficientNet-B3 model adopted in this work achieves the best performance across all three metrics. Compared with VGG-16, EfficientNet-B3 improves Top-1 accuracy by 7.74 percentage points, Top-5 accuracy by 2.44 percentage points, and Macro-F1 score by 8.42 percentage points, while containing only 8.8% of the parameters of VGG-16, demonstrating the superior balance between computational efficiency and performance afforded by compound scaling. Compared with ResNet-50, EfficientNet-B3 achieves a 1.80 percentage-point gain in Top-1 accuracy with approximately 52% fewer parameters, confirming the advantage of the EfficientNet family for fine-grained bird classification. MobileNetV3-Large, despite its highly compact design with only 4.28 M parameters and 0.23 GFLOPs, achieves a Top-1 accuracy of merely 84.34%, which is 7.21 percentage points lower than EfficientNet-B3. This confirms that excessively compressed networks struggle with fine-grained classification, and that EfficientNet-B3 strikes the best balance between accuracy and efficiency among the compared image classifiers.
The compound scaling strategy of EfficientNet-B3 jointly optimizes network depth, width, and input resolution, endowing the model with stronger feature representation under a limited computational budget. The SE attention mechanism enables adaptive enhancement of discriminative feature channels while suppressing redundant information, making EfficientNet-B3 particularly effective for fine-grained feature extraction from bird images.
For the audio classification experiment, the Audio Spectrogram Transformer (AST), VGG-16, and EfficientNet-B3 are selected as baselines. AST is a representative model that adapts the Vision Transformer architecture to audio classification, modeling global dependencies in spectrograms through self-attention. VGG-16, as a classical CNN architecture, is also applicable to spectrogram classification. Both the proposed method and VGG-16 replicate the single-channel log-Mel spectrogram to three channels to match the ImageNet-pretrained input format, whereas AST uses its standard single-channel spectrogram input. All baselines share the same training configuration: pretrained weight initialization, AdamW optimizer, initial learning rate of 10
−4, cosine annealing schedule, and 30 training epochs. The performance comparison of audio classification models is presented in
Table 3.
As shown in
Table 3, the ResNet-50 model adopted in this work achieves the best performance on the audio classification task. Compared with VGG-16, ResNet-50 improves Top-1 accuracy by 14.72 percentage points, Top-5 accuracy by 8.30 percentage points, and Macro-F1 score by 15.15 percentage points—a substantial improvement. Notably, ResNet-50 contains only 18.5% of the parameters of VGG-16, exhibiting clear advantages in both computational efficiency and classification performance. Compared with AST pretrained on AudioSet, the ResNet-50 model, using only ImageNet pretraining, still achieves gains of 4.91, 4.11, and 5.04 percentage points in Top-1 accuracy, Top-5 accuracy, and Macro-F1, respectively. These results indicate that, for the SSW60 bird audio classification task, the ResNet-50 architecture outperforms the Transformer-based AST model. When EfficientNet-B3 is applied to Mel spectrogram classification, it achieves a Top-1 accuracy of 67.88%, which is 0.32 percentage points lower than ResNet-50, with a 0.70 percentage point gap on Macro-F1. Although their accuracies are close, ResNet-50 runs more than twice as fast at 4.06 ms versus 8.42 ms per sample. This observation supports our choice of ResNet-50 over EfficientNet-B3 for the audio branch, where its residual structure handles the time-frequency patterns of Mel spectrograms more efficiently.
CNNs such as ResNet-50 incorporate inductive biases including translation invariance, which align well with the local time–frequency patterns present in audio spectrograms. The characteristic elements of bird vocalizations—such as harmonics at specific frequencies and repetitive syllable patterns—are inherently local and repetitive, and CNNs are effective at capturing such patterns. In contrast, Transformers may require substantially larger datasets to learn comparable feature representations.
ResNet-50 effectively mitigates the vanishing gradient problem through residual connections, enabling stable training of a 50-layer network. Residual learning facilitates the learning of identity mappings, which helps preserve low-level feature information. In contrast, although VGG-16 has a simple architecture, its 16-layer depth limits feature representation capacity, and the absence of residual connections leads to lower training efficiency.
In summary, the EfficientNet-B3 and ResNet-50 models selected in this work achieve superior classification performance over baseline methods in their respective modalities, establishing high-quality single-modality prediction foundations for the subsequent multimodal fusion.
Although EfficientNet-B3 records the highest inference time among the compared image models at 8.44 ms, this remains at the millisecond level and translates to over 100 frames per second, which well exceeds the real-time requirements of practical bird monitoring video pipelines.
4.4. Validation of Multimodal Fusion Effectiveness
To validate the effectiveness of the proposed multimodal fusion method, multiple comparative experiments are designed and analyzed from two perspectives: comparison of fusion strategies and overall multimodal fusion effectiveness. The compared methods include:
(1) Image-Only: only the prediction of the image branch (EfficientNet-B3) is used as the final output, without modal fusion.
(2) Audio-Only: only the prediction of the audio branch (ResNet-50) is used as the final output, without modal fusion.
(3) Entropy Fusion: fusion weights are computed based on the information entropy of each modality’s predicted probability distribution. Lower entropy indicates a more certain prediction and is assigned a higher fusion weight. The weight is computed as:
where
Himg and
Haud are the Shannon entropies of the predicted probability distributions of the image and audio modalities, respectively.
(4) Gap Fusion: fusion weights are computed based on the probability gap between the Top-1 and Top-2 classes in each modality’s predicted distribution. A larger gap indicates a more certain prediction and receives a higher weight:
(5) Trusted Multi-View Classification (TMC) [
13]: a representative learning-based uncertainty fusion method that converts each branch’s output into Dirichlet evidence and combines the two branches via Dempster-Shafer theory. TMC requires modifying both branches to output evidence and end-to-end retraining with evidential losses.
(6) Combined Fusion (proposed method): jointly considers both entropy and probability gap as confidence indicators, computing fusion weights through weighted aggregation.
The performance comparison of different fusion strategies is presented in
Table 4.
As shown in
Table 4, all fusion strategies achieve significantly better classification performance than single-modality methods. Compared with the image-only baseline, the combined fusion strategy improves Top-1 accuracy from 91.55% to 95.30%, a gain of 3.75 percentage points, and Macro-F1 score by 4.32 percentage points. This indicates that even though the standalone accuracy of the audio modality (68.20%) is substantially lower than that of the image modality (91.55%), a well-designed fusion strategy can still extract beneficial complementary information from the audio data to further improve overall performance. The comparison between multimodal and single-modality results validates the effectiveness of the proposed fusion strategy.
We additionally compare against TMC [
13], a representative learning-based uncertainty fusion method. TMC achieves 95.09% Top-1 accuracy, which is 0.21 percentage points lower than the proposed Combined Fusion at 95.30%, with a 0.37 percentage point gap on Macro-F1. Notably, TMC requires modifying both branches to output Dirichlet evidence and end-to-end retraining with evidential losses, whereas the proposed fusion module introduces no trainable parameters and can be applied directly on top of independently trained classifiers. This contrast highlights that the proposed dual-indicator confidence captures modality reliability in a parameter-free yet effective manner, achieving competitive accuracy without the architectural and training overhead of evidential fusion.
The combined fusion strategy adopted in this work achieves the best performance in both Top-1 accuracy and Macro-F1 score. Compared with entropy fusion, the combined strategy yields a 0.12 percentage-point gain in Top-1 accuracy; compared with gap fusion, the gain is 0.10 percentage points. The combined strategy simultaneously considers entropy and probability gap, providing a more comprehensive assessment of each modality’s prediction confidence. All three confidence-based adaptive fusion strategies achieve comparable and strong performance, with Top-1 accuracies exceeding 95%. The entropy indicator, grounded in information theory, measures the overall uncertainty of the predicted distribution and performs slightly better on Top-5 accuracy. The combined strategy leverages the complementary strengths of both indicators to achieve the best Top-1 accuracy and Macro-F1 score, demonstrating the complementary effect of multi-indicator fusion.
Figure 5 shows the histograms of fusion weight distributions for the three strategies on the test set. The entropy fusion weights exhibit a pronounced right-skewed distribution, with a large number of samples concentrated in the 0.9–1.0 interval and a mean weight of 0.7946, indicating that the image modality dominates the vast majority of samples. This extreme distribution stems from the entropy indicator’s sensitivity to prediction certainty. However, it also restricts the contribution of the audio modality—even when the image prediction contains subtle errors, the audio information can barely influence the fusion decision. In contrast, the gap fusion and combined fusion strategies yield more balanced weight distributions that approximate a normal shape. The mean weight decreases to 0.6650 for gap fusion and further to 0.6344 for combined fusion, with both distributions primarily concentrated in the 0.4–0.8 interval. This characteristic grants the audio modality greater weight allocation, allowing it to contribute more effectively when the image prediction confidence is low. By integrating both entropy and probability gap indicators, the combined fusion strategy maintains sensitivity to high-confidence predictions while preventing excessive weight concentration on a single modality.
The performance gains from multimodal fusion primarily stem from two types of samples. The first type consists of samples where the image modality predicts incorrectly but the audio modality predicts correctly; when image quality is compromised (e.g., due to insufficient lighting, occlusion, or motion blur), the fusion strategy can leverage audio information to correct erroneous predictions. The second type consists of samples where both modalities are not fully confident yet point to the same correct class; in this case, the fused probability distribution becomes more concentrated, improving the accuracy of the final decision. Through more balanced weight allocation, the combined fusion strategy effectively handles both types of samples, thereby achieving the best fusion performance.
4.5. Sensitivity Analysis on the Confidence Combination Coefficient
To analyze how the relative weighting of the entropy and probability gap indicators affects fusion performance, this subsection presents a sensitivity analysis on the balancing coefficient β. As defined in
Section 3.4.2, β ∈ [0, 1] controls the contribution ratio of the two indicators, with β = 0 corresponding to pure entropy, β = 1 to pure probability gap, and β = 0.5 to the default Combined Fusion setting. With all other configurations held fixed, β is evaluated across {0.00, 0.25, 0.50, 0.75, 1.00}; results are summarized in
Table 5.
The fused Top-1 accuracy remains within a narrow range of 95.22–95.28% across the entire β interval, with a fluctuation of only 0.06 percentage points; Macro-F1 exhibits comparable stability. This demonstrates that the proposed fusion strategy is essentially insensitive to the choice of β. Among all settings, β = 0.5 achieves the joint highest Top-1 and the best Macro-F1, indicating that the two indicators capture complementary aspects of prediction reliability and that an equal-weight combination exploits both effectively. The average fusion weight α decreases monotonically from 0.6656 to 0.6103 as the strategy shifts from entropy-dominant to gap-dominant. Based on these observations, β is fixed at 0.5 for all fusion experiments in this work.
6. Conclusions
This paper proposes a confidence–adaptive audiovisual recognition method for fine-grained bird species classification. By integrating complementary information from both visual and auditory modalities, the proposed method effectively overcomes the inherent limitations of single-modality approaches and achieves accurate bird species classification. The main contributions and conclusions are summarized as follows.
(1) An image classification branch based on EfficientNet-B3 was constructed. This branch fully exploits the compound scaling strategy to jointly optimize network depth, width, and input resolution, and incorporates the SE attention mechanism to adaptively enhance discriminative feature channels. It achieves a Top-1 accuracy of 91.55% on the SSW60 dataset. Compared with ResNet-50 and VGG-16, EfficientNet-B3 attains superior classification performance with fewer parameters, confirming the advantage of neural architecture search-based efficient networks for fine-grained bird classification.
(2) An audio classification branch based on ResNet-50 was designed. This branch converts bird vocalizations into Mel spectrograms, extracts acoustic features through the residual network, and employs a dense sampling inference strategy to cover vocalization information across different temporal segments. Experimental results show that ResNet-50 achieves a Top-1 accuracy of 68.20% on the audio classification task, outperforming AST [
17] and VGG-16. This finding suggests that, on medium-scale datasets, CNN architectures with convolutional inductive biases generalize better than Transformers.
(3) A confidence–adaptive fusion strategy was proposed. This strategy jointly considers information entropy and probability gap to assess the reliability of each modality’s prediction from complementary perspectives and dynamically computes fusion weights accordingly. Compared with entropy fusion and gap fusion, the combined strategy yields a more balanced weight distribution, maintaining sensitivity to high-confidence predictions while granting greater contribution to the audio modality. Experimental results demonstrate that the combined fusion strategy achieves a Top-1 accuracy of 95.30% and a Macro-F1 score of 95.12%, representing improvements of 3.75 and 4.32 percentage points over the image-only baseline, respectively. This strategy requires no trainable parameters and incurs negligible computational overhead, offering strong practicality.
Beyond addressing the limitations discussed in
Section 5.2, two directions are particularly worth pursuing. First, given that real-world bird monitoring data are typically collected from geographically distributed cameras and microphones operated by different organizations, the resulting data exhibit pronounced non-IID characteristics, where the class distribution and modality availability can vary substantially across clients. Federated learning frameworks tailored to such non-IID settings are therefore particularly promising. At the platform level, the FedBirdAg platform proposed by Benhoussa et al. [
38] provides a representative reference for low-energy federated training of bird-recognition models on distributed wireless smart cameras. At the algorithm level, addressing the label-skew and client-heterogeneity challenges inherent to such deployments would benefit from incorporating recent advances such as FedLC [
39], which mitigates label-distribution skew via logits calibration, and FedProto [
40], which enables federated prototype learning across heterogeneous clients. Adapting these paradigms to the audiovisual fusion setting studied here, particularly to handle clients that may possess different modality subsets or different species coverage, constitutes a promising direction for future investigation. Second, extending the proposed method to handle missing-modality scenarios, through modality-dropout training, modality-aware confidence calibration, or generative modality completion, would further improve its robustness in practical deployment, where one of the two modalities may be unavailable or severely degraded.