An Adaptive Audiovisual Fusion Method Based on Prediction Confidence for Fine Granularity Bird Species Recognition

Xu, Xinliang; Liu, Qiming; Wen, Xin; Zhao, Heng; Wang, Zhenhao; Wang, Chong

doi:10.3390/app16105113

Open AccessArticle

An Adaptive Audiovisual Fusion Method Based on Prediction Confidence for Fine Granularity Bird Species Recognition

by

Xinliang Xu

¹,

Qiming Liu

²,

Xin Wen

¹,

Heng Zhao

¹,

Zhenhao Wang

² and

Chong Wang

^2,*

¹

Harbin Power Supply Company, State Grid Heilongjiang Electric Power Co., Ltd., Harbin 150090, China

²

School of Electrical Engineering, Northeast Electric Power University, Jilin 132012, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(10), 5113; https://doi.org/10.3390/app16105113

Submission received: 12 April 2026 / Revised: 14 May 2026 / Accepted: 14 May 2026 / Published: 20 May 2026

(This article belongs to the Special Issue AI-Based Supervised Prediction Models)

Download

Browse Figures

Versions Notes

Abstract

To address the inherent limitations of single-modality approaches in fine-grained bird species recognition, this paper proposes an adaptive audiovisual fusion method based on prediction confidence. The proposed framework comprises three core components: an image classification branch, an audio classification branch, and a confidence–adaptive fusion module. The image branch employs EfficientNet-B3 to extract fine-grained visual features through compound scaling and squeeze-and-excitation (SE) attention. The audio branch utilizes ResNet-50 to classify Mel spectrograms converted from bird vocalizations, incorporating a dense sampling inference strategy to fully exploit complete audio information. For multimodal integration, a confidence–adaptive fusion strategy is introduced that jointly considers information entropy and probability gap to dynamically assess the reliability of each modality’s prediction, thereby assigning fusion weights at the sample level without any additional trainable parameters. Experiments on the SSW60 multimodal bird recognition dataset show that the image branch achieves a Top-1 accuracy of 91.55%, outperforming ResNet-50 (89.75%) and VGG-16 (83.81%); the audio branch reaches 68.20%, surpassing AST (63.29%) and VGG-16 (53.48%); and the fused model attains 95.30% Top-1 accuracy, a 3.75 percentage-point improvement over the image-only baseline and a 0.21 percentage-point gain over the learning-based TMC fusion baseline without introducing any trainable parameters, confirming the effectiveness of the proposed method.

Keywords:

bird species recognition; multimodal fusion; confidence estimation; EfficientNet; ResNet; information entropy

1. Introduction

Power transmission infrastructure constitutes the foundation of reliable electricity distribution across regions. In recent years, however, bird-related incidents on transmission lines have become increasingly frequent, encompassing fault types such as flashovers caused by bird droppings, short circuits due to bird contact, and contamination of equipment. These events severely compromise the safe and stable operation of transmission networks. Statistical analyses indicate that bird-related accidents account for approximately 15% to 35% of total transmission line faults in certain regions, resulting in considerable economic losses and supply disruptions, with this proportion continuing to rise [1,2,3]. Although recent studies have applied deep learning methods to detect bird-related hazards on transmission lines, including bird nest detection from UAV imagery [4], fine-grained species recognition, which is essential for differentiated prevention measures, remains a challenging open problem.

The severity and frequency of bird-induced faults are closely associated with species characteristics, temporal patterns, and geographical conditions. Large raptors whose wingspans exceed safety clearances pose direct fault risks, whereas smaller species may trigger cumulative faults through nesting, droppings accumulation, and roosting behaviors. Conventional mitigation strategies—including visual deterrent devices, physical barriers, and periodic manual inspections—suffer from limited effectiveness and high labor costs, and lack species-specific differentiated prevention measures. As the power grid continues to expand, there is a growing demand for precise prevention and control of bird-related faults. Advanced bird monitoring and early warning methods can provide avian population data to grid operation and maintenance personnel and offer predictive alerts for potential incidents, thereby furnishing essential support for differentiated fault prevention [5].

Beyond power grid applications, accurate bird species recognition holds broader scientific and ecological significance. Birds are well-established indicators of biodiversity and ecosystem health, with their long-term population trends widely used to assess habitat quality and global environmental change [6]. Automated species recognition further enables large-scale passive acoustic monitoring, supporting biodiversity conservation, ecological research, and citizen science initiatives [7]. The same underlying technical capability, namely accurate fine-grained bird species recognition under field conditions, therefore underpins both ecological monitoring and industry-specific applications such as bird-strike prevention on power transmission lines.

In the field of bird recognition, deep learning has substantially advanced both visual and acoustic identification approaches. On the visual side, fine-grained classification methods have been widely employed to distinguish closely related species with subtle inter-class differences: recent works based on attention mechanisms and Vision Transformers [8] have achieved competitive performance on public benchmarks such as NABirds and CUB; meanwhile, transfer learning approaches built on efficient convolutional networks like EfficientNet have demonstrated favorable a favorable balance between accuracy and efficiency for bird species classification [9]. On the acoustic side, deep learning models such as BirdNET have enabled the recognition of nearly a thousand bird species from passive acoustic recordings under field conditions [7]. Nevertheless, single-modality recognition methods exhibit inherent limitations: image-based recognition struggles under adverse weather, nighttime conditions, and field-of-view blind spots, while audio-based recognition faces challenges from background noise and multi-source interference, and cannot provide spatial location information or identify silent individuals. Therefore, multimodal fusion recognition strategies that integrate complementary information offer a more accurate and practically valuable approach to bird species identification.

In summary, although considerable progress has been made in both image-based and acoustic-based bird recognition, several limitations remain. First, single-modality methods are inadequate for complex and variable field environments. Second, while multimodal fusion has matured into systematic paradigms (including early, feature-level, and decision-level fusion) in adjacent domains such as autonomous driving, medical imaging, and remote sensing [10], existing fusion methods in bird recognition still predominantly rely on fixed weights or feature concatenation [11,12], failing to adaptively adjust according to the reliability of each modality’s prediction. Third, although recent learning-based fusion methods have introduced uncertainty estimation [13], they typically require additional trainable parameters and complex evidential modeling, and may suffer from weight degeneration toward a single modality. These limitations motivate the present work: to develop a parameter-free, sample-level adaptive fusion strategy guided by two complementary confidence indicators (information entropy and probability gap) for the typical fine-grained task of bird species recognition.

As illustrated in Figure 1, the central challenge in field bird species recognition is that the visual and acoustic modalities offer informative evidence under different conditions, but rarely both at once. In daytime open-view scenarios (Figure 1, Scenario A), the image modality provides distinctive visual cues, while the audio may be silent or noisy when the bird is at rest. Conversely, under dusk or partial occlusion (Figure 1, Scenario B), the image becomes degraded while a vocalizing bird still produces a distinctive acoustic signature. Existing fusion approaches that rely on fixed weighting or that require an additional trained fusion module cannot adapt to this sample-level variation in modality reliability. This work addresses the gap with a parameter-free, confidence–adaptive fusion strategy that judges, on each sample, which modality is currently more trustworthy, and lets the more reliable one lead the final prediction.

This paper proposes an audiovisual bird recognition method based on EfficientNet-B3 and ResNet-50, achieving accurate fine-grained species classification by leveraging the complementary strengths of visual and acoustic modalities. For the image branch, EfficientNet-B3 is adopted as the classifier; this model achieves an effective balance between computational efficiency and classification performance through compound scaling and the SE attention mechanism [14]. For the audio branch, ResNet-50 is employed to extract features from Mel spectrograms of bird vocalizations and perform classification [15]. To address the multimodal fusion problem, a confidence–adaptive fusion strategy is proposed that jointly considers information entropy and probability gap to dynamically compute fusion weights for each modality, enabling sample-level adaptive decision-making without additional trainable parameters. The experiments are conducted on the SSW60 multimodal bird recognition dataset [16], which contains image and audio data of 60 North American bird species and serves as a standard benchmark for audio–visual fine-grained classification research. The main contributions of this paper are as follows:

(1) An image classification branch based on EfficientNet-B3 is constructed, which fully exploits compound scaling and the SE attention mechanism to extract fine-grained features from bird images. On the SSW60 dataset, this branch achieves a Top-1 accuracy of 91.55%, outperforming conventional CNN models such as ResNet-50 and VGG-16.

(2) An audio classification branch based on ResNet-50 is designed, which converts bird vocalizations into Mel spectrograms and classifies them via the residual network, combined with a dense sampling inference strategy to fully utilize complete audio information. This branch achieves an accuracy of 68.20% on the audio classification task, outperforming AST [17] and VGG-16.

(3) A confidence–adaptive fusion strategy is proposed that jointly considers information entropy and probability gap to dynamically assess the reliability of each modality’s prediction. This strategy requires no trainable parameters and adaptively adjusts fusion weights based on the prediction confidence of each modality for the current sample, ultimately achieving a multimodal fusion accuracy of 95.30%, representing a 3.75 percentage-point improvement over the single-image modality.

The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 presents the proposed multimodal fusion recognition method, including the image classification branch, the audio classification branch, and the confidence–adaptive fusion strategy. Section 4 describes the experimental setup, evaluation metrics, and results analysis. Section 5 concludes this paper.

2. Related Work

This section reviews related work along four directions relevant to the proposed method: foreign object detection on transmission lines, bird image recognition, bird acoustic recognition, and multimodal fusion approaches; a concise statement of differentiation from prior work is then provided in Section 2.5.

2.1. Foreign Object Detection on Transmission Lines

Deep learning has provided effective solutions to the problem of foreign object detection on transmission lines. Wang et al. [18] proposed an improved YOLOv8m-based detection model for transmission lines by replacing the SPPF module with an SPPCSPC module, thereby enhancing multi-scale feature extraction capability and achieving accurate recognition of multiple foreign object categories on a Yunnan Power Grid dataset. Zheng et al. [19] introduced the GEB-YOLO model to tackle the challenges of diversity and scale variation in transmission line foreign object detection, enabling efficient detection under complex backgrounds through the fusion of multi-scale features and attention mechanisms. Li et al. [20] developed a YOLOv8-based detection algorithm incorporating weighted spatial attention, which effectively improves recognition accuracy by dynamically learning inter-feature dependencies. Chen et al. [21] constructed a dataset covering common foreign objects found on both railway and transmission lines and benchmarked the performance of mainstream detection models. Nan et al. [22] applied the RB-UNet semantic segmentation network to foreign object detection, achieving an mIoU of 88.43% on a dataset containing multiple object categories. Wu et al. [23] proposed an oriented bounding box regression method that effectively addresses the small-object detection problem. Qiu et al. [24] presented a multi-generative-model-based detection algorithm to overcome the challenge of scarce foreign object training data. However, the aforementioned studies rely predominantly on image-based monitoring, and their detection capabilities remain limited under adverse weather, nighttime conditions, and field-of-view blind spots.

2.2. Bird Image Recognition

Numerous deep-learning-based methods have been proposed for bird image recognition. Yuan et al. [25] developed an improved YOLOv8 bird detection model that achieves 94.8% accuracy on a Poyang Lake bird dataset. Zhang et al. [26] proposed the DC-YOLO model, which accurately detects bird targets and triggers bird deterrent devices in a timely manner. Li et al. [27] introduced the YOLO-Bird algorithm, combining the lightweight C2f-HLB feature extraction module with a small-object feature enhancement strategy to effectively improve the detection performance for small- and medium-sized birds. Kim et al. [28] benchmarked five models—including Faster R-CNN, R-FCN, and SSD—for bird detection in unmanned aerial vehicle imagery.

In fine-grained bird classification, the architectural design of convolutional neural networks (CNNs) plays a critical role. The EfficientNet family proposed by Tan et al. [14] simultaneously optimizes network depth, width, and input resolution through a compound scaling strategy, achieving superior performance on the ImageNet classification benchmark with substantially fewer parameters. ResNet, introduced by He et al. [15], effectively alleviates the vanishing gradient problem in deep networks via residual connections and has become a cornerstone architecture in image classification. The emergence of the Vision Transformer (ViT) has opened new avenues for fine-grained classification. Li et al. [29] proposed an improved ViT bird recognition model incorporating counterfactual reasoning, achieving 92.1% accuracy on the corresponding dataset. Park et al. [30] further enhanced fine-grained bird recognition by integrating ViT with multi-scale patch selection. These studies demonstrate that attention-based models can effectively capture discriminative local details in bird images, making them well suited for species classification tasks where inter-class differences are subtle.

2.3. Bird Acoustic Recognition

Acoustic recognition serves as an important complementary means for bird monitoring. Xie et al. [31] proposed a multi-feature fusion Transformer model that extracts log-Mel and MFCC features, achieving recognition accuracies of 93.8% and 92.1% on the Birdsdata and CBC datasets, respectively. Carvalho et al. [32] combined MFCC and Mel spectrogram features with deep learning architectures to achieve high-accuracy recognition of multiple bird species. Yang et al. [33] introduced a ResNet-based multi-feature attention detection method that suppresses noise interference through attention modules, attaining 93.2% accuracy on an urban audio classification task. The Audio Spectrogram Transformer (AST) proposed by Gong et al. [34] adapts the ViT architecture to audio classification by modeling global dependencies in spectrograms via self-attention, achieving strong performance on large-scale benchmarks such as AudioSet. However, methods relying solely on audio signals have inherent limitations: they cannot provide spatial location information and fail to identify individuals in a non-vocalizing state. In addition, acoustic detection faces challenges including equipment noise, wind noise, and multi-source interference.

2.4. Multimodal Fusion for Bird Recognition

Given the inherent drawbacks of single-modality recognition, multimodal fusion strategies integrate complementary information sources to provide more robust solutions for bird identification. Bold et al. [11] proposed a cross-domain deep feature fusion method combining visual and acoustic data for bird species classification, demonstrating the performance advantage of multimodal fusion over single-modality approaches. Gavali et al. [12] achieved higher recognition accuracy and environmental adaptability by combining image and audio data. Van Horn et al. [16] released the SSW60 multimodal bird recognition dataset, which contains image, audio, and video data of 60 North American bird species and provides a standard benchmark for audio–visual fusion research.

Regarding fusion strategies, existing methods fall mainly into two categories: early fusion and late fusion. Early fusion concatenates feature vectors from each modality and feeds them into a shared classifier for joint learning, which captures cross-modal interaction features but requires additional trainable parameters for the fusion module [11]. Late fusion performs a weighted combination of each modality classifier’s output; it is straightforward to implement and allows each branch to be optimized independently, but typically relies on fixed weights that cannot be dynamically adjusted according to sample-specific characteristics [12]. In recent years, adaptive fusion methods based on uncertainty estimation have attracted increasing attention. Gal et al. [34] demonstrated that the predictive uncertainty of deep neural networks can be estimated through metrics such as information entropy, providing a reliability measure for fusion decisions. Building on this idea, this paper proposes a confidence–adaptive fusion strategy that jointly considers information entropy and probability gap to dynamically allocate fusion weights, enabling sample-level adaptive decision-making.

2.5. Differentiation from Existing Approaches

Building on the literature reviewed above, this subsection explicitly distinguishes the proposed method from existing approaches in three respects. First, in contrast to feature concatenation [11] and fixed-weight fusion [12] commonly used in audiovisual bird recognition, the proposed method dynamically allocates fusion weights based on each sample’s prediction confidence, without requiring validation-set tuning. Second, unlike learning-based uncertainty fusion methods such as evidential trusted multi-view classification [13], the proposed fusion module introduces no trainable parameters, thereby avoiding weight degeneration toward a single modality and reducing deployment complexity. Third, instead of relying on a single confidence indicator, the proposed method jointly exploits two complementary indicators, namely information entropy (capturing overall distribution uncertainty) and the Top-1/Top-2 probability gap (capturing decision decisiveness), to construct a more robust sample-level confidence estimate. This design adapts the mature paradigm of confidence–aware fusion from adjacent domains [10] to the specific characteristics of fine-grained audiovisual bird species recognition.

3. Proposed Multimodal Bird Recognition Method

To overcome the inherent limitations of single-modality methods in fine-grained bird species recognition, this paper proposes a confidence–adaptive audiovisual recognition method. The proposed method integrates complementary information from both visual and acoustic modalities and dynamically evaluates the reliability of each modality’s prediction via information entropy and probability gap, thereby enabling sample-level adaptive fusion.

3.1. Overview of the Multimodal Fusion Framework

The overall framework of the proposed method is illustrated in Figure 1. It consists of three components: an image classification branch, an audio classification branch, and a confidence–adaptive fusion module. The image branch employs EfficientNet-B3 to extract visual features and output class prediction probabilities. The audio branch utilizes ResNet-50 to extract acoustic features from Mel spectrograms of bird vocalizations and produce class prediction probabilities. The fusion module computes the confidence of each modality’s prediction based on information entropy and probability gap, dynamically assigns fusion weights accordingly, and produces an adaptive weighted fusion of the two modalities’ outputs.

Compared with existing audiovisual fusion methods, the proposed approach has the following characteristics: (1) the fusion module requires no trainable parameters, thereby avoiding the weight degeneration problem inherent in learning-based fusion; (2) fusion weights are computed dynamically at the sample level, adapting to the prediction confidence of each modality on the current sample and fully exploiting inter-modal complementarity; and (3) the information-theoretic fusion strategy possesses a clear theoretical foundation and good interpretability.

As illustrated in Figure 2, the end-to-end inference pipeline proceeds in six steps:

Step 1: Input. A paired sample, consisting of a bird image and a bird audio recording, is fed into the system.

Step 2: Preprocessing. The image is resized to 300 × 300 and normalized using ImageNet statistics (Section 3.2.1). In parallel, the audio waveform is resampled to 16 kHz, transformed into a log-Mel spectrogram via STFT and a 128-band Mel filter bank, replicated along the channel dimension, and resized to 224 × 224 (Section 3.3.1).

Step 3: Feature extraction. The preprocessed image is processed by EfficientNet-B3, which leverages compound scaling and SE attention to extract discriminative fine-grained visual features (Section 3.2.2). The preprocessed spectrogram is processed by ResNet-50 with a dense sampling inference strategy (multiple temporal windows) to fully exploit complete audio information (Section 3.3.3).

Step 4: Probability output. Each branch produces a softmax probability distribution over the K bird species: p_img for the image branch and p_aud for the audio branch (Section 3.4.1).

Step 5: Confidence estimation. For each modality, two complementary confidence indicators are computed from its probability distribution: the Shannon entropy H_m (capturing overall distribution uncertainty) and the Top-1/Top-2 probability gap G_m (capturing decision decisiveness). The two indicators are linearly combined into a unified composite confidence C_m (Section 3.4.1 and Section 3.4.2).

Step 6: Adaptive fusion and output. The composite confidences of the two modalities are normalized to obtain the image-modality fusion weight α, with the audio modality assigned 1−α. The weighted average of the two probability distributions p_final is computed, and the final predicted species ŷ is output (Section 3.4.2).

The entire pipeline requires no trainable parameters in the fusion module, enabling efficient sample-level adaptive decision-making at inference time.

Figure 2. Framework of the proposed confidence–adaptive audiovisual fusion method for fine-grained bird species recognition.

3.2. Bird Recognition Model Under Image Modality

The image branch comprises image data preprocessing and feature extraction with the EfficientNet-B3 backbone.

3.2.1. Image Data Preprocessing

The input to the image branch is an RGB image of the target bird. Input images are first resized to 300 × 300 pixels to match the standard input size of EfficientNet-B3 and then normalized using the channel-wise statistics of the ImageNet dataset. Let the input image be I ∈ ℝ^H×W×3; the normalization is defined as:

{\hat{I}}_{c} = \frac{I_{c} - μ_{c}}{σ_{c}}, c \in {R, G, B}

(1)

where

μ_{c}

= [0.485, 0.456, 0.406] and

σ_{c}

= [0.229, 0.224, 0.225] are the per-channel mean and standard deviation of the ImageNet dataset, respectively. This normalization aligns the input distribution with that of the pretrained model, facilitating effective feature extraction via transfer learning.

To enhance the generalization capability of the model, the following data augmentation strategies are applied during training: random horizontal flipping with a probability of 0.5; random rotation within the range of ±15°; and color jittering that randomly adjusts brightness, contrast, and saturation. These augmentation strategies simulate variations in bird posture and illumination conditions encountered in real-world scenarios, effectively improving the model’s robustness to input perturbations.

3.2.2. EfficientNet-B3 Backbone Network

EfficientNet is a family of efficient CNNs designed through neural architecture search and compound scaling. Unlike conventional approaches that adjust network depth, width, or input resolution independently, EfficientNet simultaneously optimizes all three dimensions via a compound scaling coefficient ϕ, achieving an optimal trade-off between computational cost and model performance. The compound scaling strategy is defined as:

d = α_{1}^{ϕ}, w = β_{1}^{ϕ}, r = γ_{1}^{ϕ}

(2)

where d, w, and r denote the scaling factors for network depth, width, and resolution, respectively; α₁, β₁, and γ₁ are the base scaling coefficients subject to the constraint α₁·β₁²·γ² ≈ 2; and ϕ is a compound coefficient that controls the overall model scale.

EfficientNet-B3 corresponds to ϕ = 3, with a depth scaling factor of 1.4, a width scaling factor of 1.2, and an input resolution of 300 × 300. Compared with the baseline EfficientNet-B0, the B3 variant substantially enhances feature representation capability while maintaining high computational efficiency, making it well suited for fine-grained bird species classification.

The fundamental building block of EfficientNet is the Mobile Inverted Bottleneck Convolution (MBConv) [35]. As shown in Figure 3. An MBConv block first expands the input channels to k times the original dimensionality via a 1 × 1 convolution (expansion ratio, typically k = 6), then performs spatial feature extraction using depthwise separable convolution, and finally compresses the channels back to the original dimensionality through another 1 × 1 convolution. Additionally, MBConv incorporates the Squeeze-and-Excitation (SE) attention mechanism [36], which learns inter-channel dependencies through global average pooling and a two-layer fully connected network, adaptively recalibrating feature responses across channels. Let the input feature map be X ∈ ℝ^H×W×C; the SE module is computed as:

s = σ (W_{2} \cdot ReLU (W_{1} \cdot GAP (X)))

(3)

\tilde{X} = s ⊙ X

(4)

where GAP(·) denotes global average pooling; W₁ and W₂ are learnable weight matrices with r being the reduction ratio; σ(·) is the sigmoid activation function; and ⊙ denotes element-wise multiplication.

After feature extraction by the EfficientNet-B3 backbone, global average pooling compresses the spatial features into a one-dimensional vector, which is then mapped to a K-dimensional output space through a fully connected layer, where K is the number of bird species. Let the backbone output feature map be F ∈ ℝ^{H′×W′×C′}; the classification output is computed as:

z_{img} = W_{cls} \cdot GAP (F) + b_{cls}

(5)

where z_img ∈ ℝ^K denotes the unnormalized log-probabilities (logits) output by the image branch, and W_cls and b_cls are the weight and bias parameters of the classification layer.

3.3. Bird Recognition Model Under Audio Modality

The audio branch consists of audio data preprocessing, data augmentation, and feature extraction with the ResNet-50 backbone enhanced by a dense sampling inference strategy.

3.3.1. Audio Data Preprocessing

The input to the audio branch is the raw waveform of bird vocalizations. The audio is first resampled to a standard sampling rate of 16 kHz to unify the temporal resolution across different sources. Subsequently, the one-dimensional time-domain signal is converted into a two-dimensional time–frequency representation via the Short-Time Fourier Transform (STFT). Let the raw audio signal be x(t); the STFT is defined as:

X (m, ω) = \sum_{n = - \infty}^{\infty} x (n) \cdot w (n - m H) \cdot e^{- j ω n}

(6)

where w(·) is the window function (a Hann window is used in this work), m is the frame index, H is the hop length, and ω is the angular frequency. The FFT size is set to 512 with a hop length of 128 samples, corresponding to a temporal resolution of approximately 8 ms.

To emulate the nonlinear perceptual characteristics of the human auditory system across different frequencies, the linear frequency axis is converted to the Mel scale. The mapping between the Mel scale and linear frequency is given by:

f_{mel} = 2595 \cdot \log_{10} (1 + f / 700)

(7)

where f is the linear frequency in Hz and f_mel is the corresponding Mel frequency. A filter bank of 128 triangular filters is applied to map the power spectrum onto the Mel frequency domain, yielding the Mel spectrogram. A logarithmic transformation is then applied to convert multiplicative noise into additive noise while compressing the dynamic range:

M = 10 \cdot \log_{10} (\max (S_{mel}, ϵ))

(8)

where S_mel is the Mel spectrogram, ϵ = 10⁻¹⁰ is a numerical stability constant to prevent logarithmic underflow, M ∈ ℝ^F×T is the log-Mel spectrogram, F = 128 is the number of Mel frequency bands, and T is the number of time frames.

To accommodate the ImageNet-pretrained ResNet-50, the single-channel log-Mel spectrogram is replicated three times along the channel dimension to form a three-channel feature tensor S = [M; M; M] ∈ ℝ^3×F×T, matching the expected input format of the network. This channel-replication operation is a common practice when adapting ImageNet-pretrained vision backbones to audio classification [17], allowing the input format to remain compatible with the pretrained weights without modifying the backbone architecture.

The spectrogram is then normalized to the [0, 1] range:

\hat{M} = \frac{M - \min (M)}{\max (M) - \min (M) + ϵ_{1}}

(9)

where ϵ₁ = 10⁻⁸ is a numerical stability constant to prevent division by zero. The normalized tensor is then resized to 224 × 224 pixels and standardized using the ImageNet statistics (mean [0.485, 0.456, 0.406], standard deviation [0.229, 0.224, 0.225]) to match the input distribution of the pretrained ResNet-50.

3.3.2. Data Augmentation Strategy

To improve the generalization ability and noise robustness of the audio classification model, the following data augmentation strategies are employed during training:

(1) Random temporal cropping: a contiguous time window is randomly sampled from the full spectrogram as a training sample. Let the temporal dimension of the original spectrogram be T; the crop window length is T_crop = 400 frames (approximately 3 s of audio), and the starting position t₀ is uniformly sampled from [0, T − T_crop]. This strategy encourages the model to learn invariance to the temporal position of vocalizations.

(2) Frequency masking: inspired by the SpecAugment method [37], consecutive frequency bands are randomly masked along the frequency axis. The starting band index f₀ is randomly selected from [0, F − F_mask], with a mask width of F_mask = 15; values in the band [f₀, f₀ + F_mask) are set to zero. This strategy forces the model to learn local invariance along the frequency dimension, enhancing robustness to partial frequency information loss.

3.3.3. ResNet-50 Backbone Network

ResNet (Residual Network) effectively mitigates the vanishing gradient problem in deep networks by introducing skip connections, enabling the training of substantially deeper architectures. ResNet-50 comprises 49 convolutional layers and one fully connected layer, organized into an initial convolutional layer followed by four residual stages.

The residual block is the core building unit of ResNet. Let the input to the l-th layer be x^l; the output of the residual block is given by:

x^{l + 1} = F (x^{l}, W^{l}) + x^{l}

(10)

where

ℱ

(·) is the residual function and

W

^l denotes the learnable parameters of the l-th layer. The skip connection allows gradients to propagate directly from deeper layers to shallower ones, significantly improving training efficiency in deep networks.

As shown in Figure 4. ResNet-50 adopts the bottleneck architecture, in which each residual block consists of three convolutional layers: a 1 × 1 convolution for dimensionality reduction, a 3 × 3 convolution for spatial feature extraction, and a 1 × 1 convolution for dimensionality restoration. This compress-then-expand design substantially reduces computational cost while preserving feature representation capacity. The four residual stages have channel dimensions of 256, 512, 1024, and 2048, respectively, with each stage containing multiple residual blocks.

After feature extraction by the ResNet-50 backbone, global average pooling compresses the spatial features into a 2048-dimensional vector, which is then mapped to a K-dimensional output space through a fully connected layer:

z_{aud} = W_{aud} \cdot GAP (F_{res}) + b_{aud}

(11)

where z_aud ∈ ℝ^K denotes the unnormalized logits output by the audio branch, and F_res is the output feature map of the final residual stage of ResNet-50.

Since random temporal cropping during training uses only a portion of the audio, a dense sampling strategy is adopted at inference to fully exploit the complete audio information. Specifically, multiple time windows are extracted from the full spectrogram using a fixed-stride sliding window, each is passed through ResNet-50 to obtain predictions, and the logits from all windows are averaged.

Let the number of sampling windows be N_w = 5, the window length be T_crop = 400, and the sliding stride be S = 150. The starting position of the i-th window is t_i = (i − 1) × S, i ∈ {1, 2, …, N_w}. The averaged logits across all windows serve as the prediction vector for the audio modality:

z_{aud} = \frac{1}{N_{w}} \sum_{i = 1}^{N_{w}} z_{aud}^{(i)}

(12)

3.4. Confidence-Based Adaptive Fusion Strategy

The central challenge of multimodal fusion lies in how to effectively integrate information from different modalities to improve the accuracy of the final decision. Existing fusion strategies are broadly categorized into three types: early fusion, late fusion, and hybrid fusion. Early fusion concatenates feature vectors from each modality and feeds them into a shared classifier for joint learning, which can capture cross-modal interactions but requires additional trainable parameters. Late fusion combines the outputs of individual modality classifiers through weighted aggregation; it is straightforward to implement and allows each branch to be optimized independently, but typically employs fixed or validation-tuned static weights that cannot adapt to sample-specific characteristics.

Under practical conditions, the prediction reliability of different modalities varies considerably across samples. When the probability distribution output by the image modality for a given sample is highly concentrated, the image-based judgment is more reliable; conversely, a flatter distribution indicates greater classification uncertainty. The same applies to the audio modality. An ideal fusion strategy should therefore dynamically allocate fusion weights according to the prediction confidence of each modality on the current sample, allowing the more confident modality to dominate the final decision.

Information entropy is a classical measure of uncertainty in a probability distribution. For discrete distributions, lower entropy indicates a more concentrated distribution with less uncertainty, while higher entropy implies a more uniform distribution with greater uncertainty. In addition, the probability gap between the Top-1 and Top-2 predicted classes reflects the decisiveness of the model: a larger gap suggests a stronger preference for the top-ranked class and thus more reliable predictions. Based on the above analysis, this paper proposes to jointly consider information entropy and probability gap as two complementary confidence indicators, combining them through weighted aggregation to achieve sample-level adaptive fusion in which the more confident modality plays a dominant role.

3.4.1. Prediction Entropy and Confidence Computation

First, the logits output by each modality are converted to probability distributions via the softmax function:

p_{img} = Softmax (z_{img}), p_{aud} = Softmax (z_{aud})

(13)

where p_img, p_aud ∈ ℝ^K are the predicted probability distributions of the image and audio modalities, respectively.

The Shannon entropy of each modality’s predicted probability distribution is then computed as:

H_{img} = - \sum_{j = 1}^{K} p_{img, j} \log (p_{img, j} + ϵ_{2})

(14)

H_{aud} = - \sum_{j = 1}^{K} p_{aud, j} \log (p_{aud, j} + ϵ_{2})

(15)

where ϵ₂ = 10⁻⁸ is a numerical stability constant to prevent logarithmic underflow. The theoretical range of entropy is [0, log K]: the minimum value of 0 is attained when the distribution degenerates to a deterministic one (i.e., one class has probability 1 and all others have 0), while the maximum value of log K is reached under a uniform distribution.

Based on the inverse relationship with entropy, the confidence of each modality is defined as the reciprocal of its entropy:

C_{img} = \frac{1}{H_{img} + ϵ_{2}}, C_{a u d} = \frac{1}{H_{aud} + ϵ_{2}}

(16)

This formulation ensures that the modality with lower entropy (i.e., a more certain prediction) receives a higher confidence score.

3.4.2. Probability Gap Indicator and Combined Fusion Strategy

This paper proposes a combined fusion strategy that simultaneously considers the entropy indicator and the probability gap indicator, yielding a more robust confidence estimate through weighted aggregation.

The probability gap indicator is defined as the difference between the Top-1 and Top-2 class probabilities; a larger gap indicates greater certainty in the model’s prediction:

G_{img} = p_{img}^{(1)} - p_{img}^{(2)}, G_{aud} = p_{aud}^{(1)} - p_{aud}^{(2)}

(17)

where p⁽¹⁾ and p⁽²⁾ denote the highest and second-highest values in the predicted probability distribution, respectively.

The normalized entropy and probability gap are combined through weighted aggregation to obtain the composite confidence:

C_{m} = β \cdot (1 - {\bar{H}}_{m}) + (1 - β) \cdot G_{m}, m \in {img, aud}

(18)

where

{\bar{H}}_{m}

= H_m/log K is the normalized entropy mapped to the [0, 1] interval, and β ∈ [0, 1] is a balancing coefficient that controls the relative importance of the entropy and probability gap indicators. In this work, β = 0.5 is adopted to equally balance the contributions of the two indicators.

The composite confidence scores of the two modalities are normalized to obtain the fusion weight for the image modality:

α = \frac{C_{img}}{C_{img} + C_{aud}}

(19)

Correspondingly, the fusion weight for the audio modality is 1 − α. By definition, α ∈ (0, 1); when the composite confidence of the image modality exceeds that of the audio modality, α > 0.5 and the image modality dominates the fusion, and vice versa.

The final fused prediction probability distribution is obtained by weighted averaging:

p_{final} = α \cdot p_{img} + (1 - α) \cdot p_{aud}

(20)

The classification decision selects the class with the highest probability in the fused distribution as the predicted result:

\hat{y} = \arg \max_{j} p_{final, j} p_{final} \in p_{final}

(21)

In the proposed adaptive combined fusion strategy, the fusion weights are derived entirely from the predicted probability distributions of each modality, eliminating the need for additional training of a fusion module and avoiding the weight degeneration phenomenon caused by optimization objectives in learning-based fusion. The fusion weights are computed independently for each sample, allowing dynamic adjustment according to the prediction confidence of each modality. For example, when a sample has a clear image but the bird is silent, the image modality exhibits high confidence while the audio modality yields low confidence, and the fusion weight automatically shifts toward the image modality. Moreover, entropy, as a measure of uncertainty, directly reflects the model’s degree of certainty regarding its prediction, while the probability gap captures the clarity of the decision boundary. Using these as fusion criteria is both intuitive and highly interpretable. Furthermore, the fusion process involves only simple operations—softmax normalization, entropy computation, probability gap calculation, and weighted averaging—incurring negligible computational overhead.

3.5. Training Strategy

A two-stage independent training strategy is adopted, in which the image and audio classification models are optimized separately. The fusion module requires no training.

The image branch is initialized with ImageNet-pretrained EfficientNet-B3 weights and fine-tuned on the bird image dataset. The training configuration is as follows: the AdamW optimizer is used with an initial learning rate of 10⁻⁴ and a weight decay of 10⁻⁴; the learning rate is scheduled using cosine annealing, which smoothly decays the learning rate from its initial value to a minimum following a cosine curve; training runs for 15 epochs with a batch size of 32. The loss function is the cross-entropy loss:

L_{img} = - \sum_{j = 1}^{K} y_{j} \log (p_{img, j})

(22)

where y is the one-hot encoded ground-truth label vector and

p_{img, j}

is the predicted probability distribution.

The audio branch is initialized with ImageNet-pretrained ResNet-50 weights and fine-tuned on the bird audio dataset. Although audio spectrograms differ visually from natural images, the ImageNet-pretrained weights provide generic low-level features such as textures and edge detectors, which help accelerate model convergence. The training configuration is as follows: the AdamW optimizer is employed with an initial learning rate of 10⁻⁴ and a weight decay of 10⁻⁴; cosine annealing is used for learning rate scheduling; training runs for 30 epochs with a batch size of 16.

The loss function is likewise the cross-entropy loss:

L_{aud} = - \sum_{j = 1}^{K} y_{j} \log (p_{aud, j})

(23)

Both branches employ the cosine annealing learning rate schedule, which smoothly decays the learning rate from the initial value to a minimum following a cosine function:

η_{t} = η_{\min} + \frac{1}{2} (η_{\max} - η_{\min}) (1 + \cos (\frac{t π}{T_{\max}}))

(24)

where η_t is the learning rate at the t-th epoch, η_max is the initial maximum learning rate, η_min is the minimum learning rate (set to 10⁻⁶ in this work), and T_max is the total number of training epochs. Compared with a fixed learning rate, cosine annealing refines model parameters with smaller steps in the later training stages, effectively preventing loss oscillation near local minima and improving the final convergence quality.

After the two-stage training is completed, both branches can perform inference independently. During inference, the logits from the two branches are obtained separately, and the final prediction is computed using the confidence–adaptive fusion strategy described in Section 3.4. The entire fusion process requires no training and directly derives fusion weights from information entropy and probability gap.

4. Experimental Results and Analysis

This section experimentally validates the effectiveness of the proposed multimodal fusion recognition method. The experimental environment and dataset are first described, followed by the definition of evaluation metrics. The experimental analysis is then conducted from two perspectives: single-modality recognition performance and multimodal fusion effectiveness.

4.1. Experimental Setup and Dataset Description

The proposed multimodal fusion bird recognition model is experimentally evaluated under the following hardware and software configuration: Windows 10 operating system, Intel Core i5-12400F processor, NVIDIA GeForce RTX 3070Ti GPU (8 GB VRAM), and 32 GB DDR4 RAM. The deep learning framework is PyTorch 1.12 with Python 3.8 and CUDA 11.6.

The Sapsucker Woods 60 (SSW60) dataset is adopted for experimental validation. SSW60 is a multimodal bird recognition benchmark released by the Cornell Lab of Ornithology, specifically designed for audio–visual fine-grained classification research. The dataset covers 60 North American bird species, all of which can be observed at the Sapsucker Woods Sanctuary in Ithaca, New York, spanning multiple taxonomic groups including songbirds, woodpeckers, and raptors. The dataset comprises multi-modal data sources. For images, it contains 21,600 field photographs from iNaturalist 2021 and 10,221 images from NABirds, totaling 31,821 independent image samples. For audio, it includes 3861 independently recorded WAV files at a sampling rate of 22,050 Hz, mono-channel, each approximately 10 s in duration, with expert-verified annotations confirming the presence of the target species’ vocalizations. The official train/test split provided with the dataset is strictly followed throughout this work.

For image data, input images are uniformly resized to 300 × 300 pixels to match the standard input size of EfficientNet-B3 and then normalized using the ImageNet channel-wise statistics. During training, data augmentation is applied, including random horizontal flipping, random rotation, and color jittering.

For audio data, the waveforms are first resampled to 16,000 Hz. Spectral features are then extracted via STFT with an FFT size of 512 and a hop length of 128 samples. A bank of 128 triangular filters maps the power spectrum onto the Mel frequency domain, followed by a logarithmic transformation to obtain log-Mel spectrograms. The single-channel spectrogram is replicated along the channel dimension to match the three-channel input format of ResNet-50, normalized, resized to 224 × 224 pixels, and standardized using the ImageNet statistics. During training, random temporal cropping with a window length of 400 frames is applied, along with frequency masking augmentation.

The detailed configuration parameters of each module are listed in Table 1. The image branch adopts the EfficientNet-B3 architecture with an input size of 300 × 300 pixels, a compound scaling coefficient ϕ = 3, and 60 output classes. The audio branch employs the ResNet-50 architecture with an input spectrogram size of 224 × 224 pixels, four residual stages, and an output feature dimension of 2048.

Both branches use the AdamW optimizer with an initial learning rate of 10⁻⁴ and a weight decay of 10⁻⁴, and the learning rate is scheduled via cosine annealing. The image branch is trained for 15 epochs and the audio branch for 30 epochs.

Since the image (31,821 samples) and audio (3861 samples) data in SSW60 are collected independently and are imbalanced in count without any instance-level correspondence, we adopt a species-level random pairing strategy to construct image-audio pairs for the fusion evaluation. The strategy traverses the image test set as the primary iteration, drawing each test image in turn; for every image, an audio sample is then randomly selected from the audio set of the same species to form a paired sample, with the resulting pair sharing the same species label. Because the pairing relies on the species-organized directory structure of SSW60, every pair always belongs to the same bird species; however, since the two modalities were independently collected, the paired image and audio do not necessarily originate from the same individual or the same observation event, but are aligned only at the species level. Given that the audio set is smaller than the image set, a given audio recording may be sampled multiple times or not at all within a single evaluation pass, while every test image is covered exactly once, ensuring that the fusion evaluation is complete with respect to the image test set. It is worth noting that this pairing protocol is applied only at the fusion evaluation stage; the image and audio branches are trained independently on their respective modality-specific data following the official SSW60 train/test split, without requiring any image-audio pairing during training. This species-level pairing protocol is consistent with the design intent of SSW60 [16] as a fine-grained audiovisual classification benchmark, in which the two modalities provide complementary species-discriminative information rather than instance-level synchronized signals.

4.2. Evaluation Metrics

To comprehensively evaluate the classification performance, the following metrics are adopted. Top-1 accuracy is the proportion of samples for which the class with the highest predicted probability matches the ground truth, serving as the most direct performance measure in multi-class classification tasks:

Top - 1 = \frac{1}{N} \sum_{i = 1}^{N} I ({\hat{y}}_{i} = y_{i})

(25)

where N is the total number of test samples, ŷ_i is the predicted class for the i-th sample, y_i is the ground-truth class, and

I

(·) is the indicator function that returns 1 when the condition is true and 0 otherwise.

Top-5 accuracy is the proportion of samples for which the ground-truth class falls within the top five predicted classes. For fine-grained classification tasks with a large number of species, Top-5 accuracy reflects the model’s ability to rank candidate classes:

Top - 5 = \frac{1}{N} \sum_{i = 1}^{N} I (y_{i} \in Top 5 ({\hat{p}}_{i}))

(26)

where Top5(

{\hat{p}}_{i}

) denotes the set of five classes with the highest probabilities in the predicted distribution

{\hat{p}}_{i}

of the i-th sample.

The Macro-F1 is the macro-averaged F1 score, which first computes the F1 score (the harmonic mean of precision and recall) for each class and then takes the arithmetic mean across all classes, providing a balanced assessment of the model’s classification quality on multi-class tasks:

P_{c} = \frac{T P_{c}}{T P_{c} + F P_{c}}, R_{c} = \frac{T P_{c}}{T P_{c} + F N_{c}}

(27)

F 1_{c} = \frac{2 \cdot P_{c} \cdot R_{c}}{P_{c} + R_{c}}

(28)

Macro - F 1 = \frac{1}{K} \sum_{c = 1}^{K} F 1_{c}

(29)

where TP_c, FP_c, and FN_c are the numbers of true positives, false positives, and false negatives for class c, respectively, and K is the total number of classes. The Macro-F1 score assigns equal weight to each class, reflecting the model’s balanced performance across all categories, and is particularly suitable for evaluating classification on datasets with long-tailed distributions.

4.3. Single-Modality Recognition Performance Comparison

To validate the effectiveness of the image and audio classification models adopted in this work, comparative experiments are conducted under single image modality and single-audio modality conditions against mainstream deep learning models. All compared models are initialized with ImageNet-pretrained weights and fine-tuned on the SSW60 dataset.

For the image classification experiment, ResNet-50, VGG-16, and MobileNetV3-Large are selected as baseline models. ResNet-50 is a widely adopted deep residual network that mitigates the vanishing gradient problem through residual connections. VGG-16 is a representative early deep CNN characterized by its straightforward architecture of stacked 3 × 3 convolutions. Both baselines share the same training configuration as the proposed method: ImageNet-pretrained initialization, AdamW optimizer, initial learning rate of 10⁻⁴, cosine annealing schedule, and 15 training epochs. The performance comparison of image classification models is presented in Table 2.

As shown in Table 2, the EfficientNet-B3 model adopted in this work achieves the best performance across all three metrics. Compared with VGG-16, EfficientNet-B3 improves Top-1 accuracy by 7.74 percentage points, Top-5 accuracy by 2.44 percentage points, and Macro-F1 score by 8.42 percentage points, while containing only 8.8% of the parameters of VGG-16, demonstrating the superior balance between computational efficiency and performance afforded by compound scaling. Compared with ResNet-50, EfficientNet-B3 achieves a 1.80 percentage-point gain in Top-1 accuracy with approximately 52% fewer parameters, confirming the advantage of the EfficientNet family for fine-grained bird classification. MobileNetV3-Large, despite its highly compact design with only 4.28 M parameters and 0.23 GFLOPs, achieves a Top-1 accuracy of merely 84.34%, which is 7.21 percentage points lower than EfficientNet-B3. This confirms that excessively compressed networks struggle with fine-grained classification, and that EfficientNet-B3 strikes the best balance between accuracy and efficiency among the compared image classifiers.

The compound scaling strategy of EfficientNet-B3 jointly optimizes network depth, width, and input resolution, endowing the model with stronger feature representation under a limited computational budget. The SE attention mechanism enables adaptive enhancement of discriminative feature channels while suppressing redundant information, making EfficientNet-B3 particularly effective for fine-grained feature extraction from bird images.

For the audio classification experiment, the Audio Spectrogram Transformer (AST), VGG-16, and EfficientNet-B3 are selected as baselines. AST is a representative model that adapts the Vision Transformer architecture to audio classification, modeling global dependencies in spectrograms through self-attention. VGG-16, as a classical CNN architecture, is also applicable to spectrogram classification. Both the proposed method and VGG-16 replicate the single-channel log-Mel spectrogram to three channels to match the ImageNet-pretrained input format, whereas AST uses its standard single-channel spectrogram input. All baselines share the same training configuration: pretrained weight initialization, AdamW optimizer, initial learning rate of 10⁻⁴, cosine annealing schedule, and 30 training epochs. The performance comparison of audio classification models is presented in Table 3.

As shown in Table 3, the ResNet-50 model adopted in this work achieves the best performance on the audio classification task. Compared with VGG-16, ResNet-50 improves Top-1 accuracy by 14.72 percentage points, Top-5 accuracy by 8.30 percentage points, and Macro-F1 score by 15.15 percentage points—a substantial improvement. Notably, ResNet-50 contains only 18.5% of the parameters of VGG-16, exhibiting clear advantages in both computational efficiency and classification performance. Compared with AST pretrained on AudioSet, the ResNet-50 model, using only ImageNet pretraining, still achieves gains of 4.91, 4.11, and 5.04 percentage points in Top-1 accuracy, Top-5 accuracy, and Macro-F1, respectively. These results indicate that, for the SSW60 bird audio classification task, the ResNet-50 architecture outperforms the Transformer-based AST model. When EfficientNet-B3 is applied to Mel spectrogram classification, it achieves a Top-1 accuracy of 67.88%, which is 0.32 percentage points lower than ResNet-50, with a 0.70 percentage point gap on Macro-F1. Although their accuracies are close, ResNet-50 runs more than twice as fast at 4.06 ms versus 8.42 ms per sample. This observation supports our choice of ResNet-50 over EfficientNet-B3 for the audio branch, where its residual structure handles the time-frequency patterns of Mel spectrograms more efficiently.

CNNs such as ResNet-50 incorporate inductive biases including translation invariance, which align well with the local time–frequency patterns present in audio spectrograms. The characteristic elements of bird vocalizations—such as harmonics at specific frequencies and repetitive syllable patterns—are inherently local and repetitive, and CNNs are effective at capturing such patterns. In contrast, Transformers may require substantially larger datasets to learn comparable feature representations.

ResNet-50 effectively mitigates the vanishing gradient problem through residual connections, enabling stable training of a 50-layer network. Residual learning facilitates the learning of identity mappings, which helps preserve low-level feature information. In contrast, although VGG-16 has a simple architecture, its 16-layer depth limits feature representation capacity, and the absence of residual connections leads to lower training efficiency.

In summary, the EfficientNet-B3 and ResNet-50 models selected in this work achieve superior classification performance over baseline methods in their respective modalities, establishing high-quality single-modality prediction foundations for the subsequent multimodal fusion.

Although EfficientNet-B3 records the highest inference time among the compared image models at 8.44 ms, this remains at the millisecond level and translates to over 100 frames per second, which well exceeds the real-time requirements of practical bird monitoring video pipelines.

4.4. Validation of Multimodal Fusion Effectiveness

To validate the effectiveness of the proposed multimodal fusion method, multiple comparative experiments are designed and analyzed from two perspectives: comparison of fusion strategies and overall multimodal fusion effectiveness. The compared methods include:

(1) Image-Only: only the prediction of the image branch (EfficientNet-B3) is used as the final output, without modal fusion.

(2) Audio-Only: only the prediction of the audio branch (ResNet-50) is used as the final output, without modal fusion.

(3) Entropy Fusion: fusion weights are computed based on the information entropy of each modality’s predicted probability distribution. Lower entropy indicates a more certain prediction and is assigned a higher fusion weight. The weight is computed as:

α = \frac{1 / H_{img}}{1 / H_{img} + 1 / H_{aud}}

(30)

where H_img and H_aud are the Shannon entropies of the predicted probability distributions of the image and audio modalities, respectively.

(4) Gap Fusion: fusion weights are computed based on the probability gap between the Top-1 and Top-2 classes in each modality’s predicted distribution. A larger gap indicates a more certain prediction and receives a higher weight:

α = \frac{G_{i m g}}{G_{i m g} + G_{a u d}}

(31)

(5) Trusted Multi-View Classification (TMC) [13]: a representative learning-based uncertainty fusion method that converts each branch’s output into Dirichlet evidence and combines the two branches via Dempster-Shafer theory. TMC requires modifying both branches to output evidence and end-to-end retraining with evidential losses.

(6) Combined Fusion (proposed method): jointly considers both entropy and probability gap as confidence indicators, computing fusion weights through weighted aggregation.

The performance comparison of different fusion strategies is presented in Table 4.

As shown in Table 4, all fusion strategies achieve significantly better classification performance than single-modality methods. Compared with the image-only baseline, the combined fusion strategy improves Top-1 accuracy from 91.55% to 95.30%, a gain of 3.75 percentage points, and Macro-F1 score by 4.32 percentage points. This indicates that even though the standalone accuracy of the audio modality (68.20%) is substantially lower than that of the image modality (91.55%), a well-designed fusion strategy can still extract beneficial complementary information from the audio data to further improve overall performance. The comparison between multimodal and single-modality results validates the effectiveness of the proposed fusion strategy.

We additionally compare against TMC [13], a representative learning-based uncertainty fusion method. TMC achieves 95.09% Top-1 accuracy, which is 0.21 percentage points lower than the proposed Combined Fusion at 95.30%, with a 0.37 percentage point gap on Macro-F1. Notably, TMC requires modifying both branches to output Dirichlet evidence and end-to-end retraining with evidential losses, whereas the proposed fusion module introduces no trainable parameters and can be applied directly on top of independently trained classifiers. This contrast highlights that the proposed dual-indicator confidence captures modality reliability in a parameter-free yet effective manner, achieving competitive accuracy without the architectural and training overhead of evidential fusion.

The combined fusion strategy adopted in this work achieves the best performance in both Top-1 accuracy and Macro-F1 score. Compared with entropy fusion, the combined strategy yields a 0.12 percentage-point gain in Top-1 accuracy; compared with gap fusion, the gain is 0.10 percentage points. The combined strategy simultaneously considers entropy and probability gap, providing a more comprehensive assessment of each modality’s prediction confidence. All three confidence-based adaptive fusion strategies achieve comparable and strong performance, with Top-1 accuracies exceeding 95%. The entropy indicator, grounded in information theory, measures the overall uncertainty of the predicted distribution and performs slightly better on Top-5 accuracy. The combined strategy leverages the complementary strengths of both indicators to achieve the best Top-1 accuracy and Macro-F1 score, demonstrating the complementary effect of multi-indicator fusion.

Figure 5 shows the histograms of fusion weight distributions for the three strategies on the test set. The entropy fusion weights exhibit a pronounced right-skewed distribution, with a large number of samples concentrated in the 0.9–1.0 interval and a mean weight of 0.7946, indicating that the image modality dominates the vast majority of samples. This extreme distribution stems from the entropy indicator’s sensitivity to prediction certainty. However, it also restricts the contribution of the audio modality—even when the image prediction contains subtle errors, the audio information can barely influence the fusion decision. In contrast, the gap fusion and combined fusion strategies yield more balanced weight distributions that approximate a normal shape. The mean weight decreases to 0.6650 for gap fusion and further to 0.6344 for combined fusion, with both distributions primarily concentrated in the 0.4–0.8 interval. This characteristic grants the audio modality greater weight allocation, allowing it to contribute more effectively when the image prediction confidence is low. By integrating both entropy and probability gap indicators, the combined fusion strategy maintains sensitivity to high-confidence predictions while preventing excessive weight concentration on a single modality.

The performance gains from multimodal fusion primarily stem from two types of samples. The first type consists of samples where the image modality predicts incorrectly but the audio modality predicts correctly; when image quality is compromised (e.g., due to insufficient lighting, occlusion, or motion blur), the fusion strategy can leverage audio information to correct erroneous predictions. The second type consists of samples where both modalities are not fully confident yet point to the same correct class; in this case, the fused probability distribution becomes more concentrated, improving the accuracy of the final decision. Through more balanced weight allocation, the combined fusion strategy effectively handles both types of samples, thereby achieving the best fusion performance.

4.5. Sensitivity Analysis on the Confidence Combination Coefficient

To analyze how the relative weighting of the entropy and probability gap indicators affects fusion performance, this subsection presents a sensitivity analysis on the balancing coefficient β. As defined in Section 3.4.2, β ∈ [0, 1] controls the contribution ratio of the two indicators, with β = 0 corresponding to pure entropy, β = 1 to pure probability gap, and β = 0.5 to the default Combined Fusion setting. With all other configurations held fixed, β is evaluated across {0.00, 0.25, 0.50, 0.75, 1.00}; results are summarized in Table 5.

The fused Top-1 accuracy remains within a narrow range of 95.22–95.28% across the entire β interval, with a fluctuation of only 0.06 percentage points; Macro-F1 exhibits comparable stability. This demonstrates that the proposed fusion strategy is essentially insensitive to the choice of β. Among all settings, β = 0.5 achieves the joint highest Top-1 and the best Macro-F1, indicating that the two indicators capture complementary aspects of prediction reliability and that an equal-weight combination exploits both effectively. The average fusion weight α decreases monotonically from 0.6656 to 0.6103 as the strategy shifts from entropy-dominant to gap-dominant. Based on these observations, β is fixed at 0.5 for all fusion experiments in this work.

5. Discussion

5.1. Discussion of Results

The proposed confidence–adaptive fusion strategy yields consistent gains over single-modality baselines on SSW60, with Top-1 accuracy improving by 3.75 and 27.10 percentage points over the image-only and audio-only models, respectively. These gains primarily originate from two sample categories: those where the image modality misclassifies under poor visual conditions (e.g., low lighting, occlusion, motion blur) but the audio modality predicts correctly, and those where both modalities are uncertain yet point to the same class, leading to a sharper post-fusion distribution.

Comparison among the three fusion variants further reveals the complementary roles of the two confidence indicators. As shown in Figure 5, entropy-based weights are heavily concentrated near 0.9–1.0 (mean α = 0.7946), which suppresses the audio contribution even when image predictions contain subtle errors. The probability gap indicator yields a more balanced distribution (mean α = 0.6650), and the combined strategy further moderates it to α = 0.6344, allowing the audio modality to contribute meaningfully to uncertain samples. The combined strategy thus surpasses both single-indicator variants, and also outperforms TMC, a representative learning-based uncertainty fusion method, without introducing any trainable parameters.

These observations also explain why confidence–adaptive fusion is particularly effective for fine-grained classification. Fine-grained tasks, characterized by high inter-class similarity and locally discriminative features, often produce single-modality predictions with high entropy or small probability gaps—a low-decisiveness state in which fixed-weight fusion cannot tell which modality is more reliable. The dual-indicator confidence explicitly identifies this state and reallocates weight accordingly, which we view as the principal reason for the observed gain over the strong image-only baseline.

5.2. Limitations

While the proposed method achieves strong performance on the SSW60 benchmark, several limitations should be acknowledged. SSW60 is collected primarily as a fine-grained audiovisual benchmark for North American bird species under relatively favorable field conditions; it does not capture the specific environmental characteristics of power transmission line scenarios, including persistent electromagnetic interference, structural occlusion by towers and conductors, wide-angle surveillance viewpoints, low-light or infrared imagery, and characteristic acoustic noise from wind, corona discharge, and equipment vibration. The performance reported here should therefore be interpreted as a methodological proof-of-concept on a standard audiovisual benchmark, rather than a verified outcome in real transmission line settings. Future work will involve constructing a domain-specific dataset and evaluating the proposed method under such conditions, possibly through transfer learning or domain adaptation.

6. Conclusions

This paper proposes a confidence–adaptive audiovisual recognition method for fine-grained bird species classification. By integrating complementary information from both visual and auditory modalities, the proposed method effectively overcomes the inherent limitations of single-modality approaches and achieves accurate bird species classification. The main contributions and conclusions are summarized as follows.

(1) An image classification branch based on EfficientNet-B3 was constructed. This branch fully exploits the compound scaling strategy to jointly optimize network depth, width, and input resolution, and incorporates the SE attention mechanism to adaptively enhance discriminative feature channels. It achieves a Top-1 accuracy of 91.55% on the SSW60 dataset. Compared with ResNet-50 and VGG-16, EfficientNet-B3 attains superior classification performance with fewer parameters, confirming the advantage of neural architecture search-based efficient networks for fine-grained bird classification.

(2) An audio classification branch based on ResNet-50 was designed. This branch converts bird vocalizations into Mel spectrograms, extracts acoustic features through the residual network, and employs a dense sampling inference strategy to cover vocalization information across different temporal segments. Experimental results show that ResNet-50 achieves a Top-1 accuracy of 68.20% on the audio classification task, outperforming AST [17] and VGG-16. This finding suggests that, on medium-scale datasets, CNN architectures with convolutional inductive biases generalize better than Transformers.

(3) A confidence–adaptive fusion strategy was proposed. This strategy jointly considers information entropy and probability gap to assess the reliability of each modality’s prediction from complementary perspectives and dynamically computes fusion weights accordingly. Compared with entropy fusion and gap fusion, the combined strategy yields a more balanced weight distribution, maintaining sensitivity to high-confidence predictions while granting greater contribution to the audio modality. Experimental results demonstrate that the combined fusion strategy achieves a Top-1 accuracy of 95.30% and a Macro-F1 score of 95.12%, representing improvements of 3.75 and 4.32 percentage points over the image-only baseline, respectively. This strategy requires no trainable parameters and incurs negligible computational overhead, offering strong practicality.

Beyond addressing the limitations discussed in Section 5.2, two directions are particularly worth pursuing. First, given that real-world bird monitoring data are typically collected from geographically distributed cameras and microphones operated by different organizations, the resulting data exhibit pronounced non-IID characteristics, where the class distribution and modality availability can vary substantially across clients. Federated learning frameworks tailored to such non-IID settings are therefore particularly promising. At the platform level, the FedBirdAg platform proposed by Benhoussa et al. [38] provides a representative reference for low-energy federated training of bird-recognition models on distributed wireless smart cameras. At the algorithm level, addressing the label-skew and client-heterogeneity challenges inherent to such deployments would benefit from incorporating recent advances such as FedLC [39], which mitigates label-distribution skew via logits calibration, and FedProto [40], which enables federated prototype learning across heterogeneous clients. Adapting these paradigms to the audiovisual fusion setting studied here, particularly to handle clients that may possess different modality subsets or different species coverage, constitutes a promising direction for future investigation. Second, extending the proposed method to handle missing-modality scenarios, through modality-dropout training, modality-aware confidence calibration, or generative modality completion, would further improve its robustness in practical deployment, where one of the two modalities may be unavailable or severely degraded.

Author Contributions

Conceptualization, X.W., Q.L. and C.W.; methodology, Q.L., X.W. and C.W.; software, Q.L.; validation, H.Z., Z.W. and Q.L.; formal analysis, Q.L. and Z.W.; investigation, Q.L. and H.Z.; resources, X.X. and H.Z.; data curation, H.Z. and Q.L.; writing—original draft preparation, Q.L.; writing—review and editing, X.X., Z.W., X.W., H.Z. and C.W.; visualization, Q.L.; supervision, X.X. and C.W.; project administration, X.X.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Harbin Power Supply Company, State Grid Heilongjiang Electric Power Co., Ltd., through the science and technology project “Research on Bird Damage Diagnosis and Active Bird Repelling Device Methods for Transmission Lines”, grant number SGHLHROOKJJS2400814. The APC was funded by the same project.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/visipedia/ssw60. Access date: 10 March 2026.

Conflicts of Interest

Authors Xinliang Xu, Xin Wen, and Heng Zhao were employed by State Grid Heilongjiang Electric Power Co., Ltd. Harbin Power Supply Company. All the authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SE	Squeeze-and-Excitation
CNN	Convolutional Neural Network
ViT	Vision Transformer
MBConv	Mobile Inverted Bottleneck Convolution
STFT	Short-Time Fourier Transform
MFCC	Mel-Frequency Cepstral Coefficients
AST	Audio Spectrogram Transformer
GAP	Global Average Pooling
FFT	Fast Fourier Transform
RGB	Red, Green, Blue
mIoU	Mean Intersection over Union
R-CNN	Region-based Convolutional Neural Network
R-FCN	Region-based Fully Convolutional Network
SSD	Single Shot MultiBox Detector
YOLO	You Only Look Once
SSW60	Sapsucker Woods 60
SPPF	Spatial Pyramid Pooling—Fast
SPPCSPC	Spatial Pyramid Pooling Cross Stage Partial Connection

References

Qiu, W.; Liang, Y.; Wu, J.; Chen, G.; Liang, Y. Detection of Bird Species Related to Transmission Line Faults Based on Lightweight Convolutional Neural Network. IET Gener. Transm. Distrib. 2022, 16, 869–881. [Google Scholar] [CrossRef]
Wang, S.; Ye, Z. Analysis of Bird Damage Accidents on Overhead Transmission Lines and Prevention Techniques. High Volt. Appar. 2011, 47, 61–67. [Google Scholar]
Liu, H.; Zhou, C.; Shen, Q.; Li, X. Birds Protection and Safety Research of Transmission Lines. In Proceedings of the IOP Conference Series: Earth and Environmental Science; IOP Publishing: Bristol, UK, 2020; Volume 467, p. 012132. [Google Scholar]
Xiang, Y.; Du, C.; Mei, Y.; Zhang, L.; Du, Y.; Liu, A. BN-YOLO: A Lightweight Method for Bird’s Nest Detection on Transmission Lines. J. Real-Time Image Process. 2024, 21, 194. [Google Scholar] [CrossRef]
Qiu, Z.; Shi, D.; Kuang, Y.; Chen, J. Image Recognition of Harmful Bird Species Related to Transmission Line Outages Based on Deep Transfer Learning. High Volt. Eng. 2021, 47, 3785–3794. [Google Scholar]
Fraixedas, S.; Lindén, A.; Piha, M.; Cabeza, M.; Gregory, R.; Lehikoinen, A. A State-of-the-Art Review on Birds as Indicators of Biodiversity: Advances, Challenges, and Future Directions. Ecol. Indic. 2020, 118, 106728. [Google Scholar] [CrossRef]
Kahl, S.; Wood, C.M.; Eibl, M.; Klinck, H. BirdNET: A Deep Learning Solution for Avian Diversity Monitoring. Ecol. Inform. 2021, 61, 101236. [Google Scholar] [CrossRef]
Cui, S.; Hui, B. Dual-Dependency Attention Transformer for Fine-Grained Visual Classification. Sensors 2024, 24, 2337. [Google Scholar] [CrossRef]
Mochurad, L. A New Efficient Classifier for Bird Classification Based on Transfer Learning. J. Eng. 2024, 2024, 8254130. [Google Scholar] [CrossRef]
Qian, H.; Wang, M.; Zhu, M.; Wang, H. A Review of Multi-Sensor Fusion in Autonomous Driving. Sensors 2025, 25, 6033. [Google Scholar] [CrossRef]
Bold, N.; Zhang, C.; Akashi, T. Cross-Domain Deep Feature Combination for Bird Species Classification with Audio-Visual Data. IEICE Trans. Inf. Syst. 2019, 102, 2033–2042. [Google Scholar] [CrossRef]
Gavali, P.; Banu, J.S. A Novel Approach to Indian Bird Species Identification: Employing Visual-Acoustic Fusion Techniques for Improved Classification Accuracy. Front. Artif. Intell. 2025, 8, 1527299. [Google Scholar] [CrossRef]
Han, Z.; Zhang, C.; Fu, H.; Zhou, J.T. Trusted Multi-View Classification with Dynamic Evidential Fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2551–2566. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Van Horn, G.; Qian, R.; Wilber, K.; Adam, H.; Mac Aodha, O.; Belongie, S. Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset. In Computer Vision—ECCV 2022; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; Volume 13668, pp. 271–288. [Google Scholar]
Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 571–575. [Google Scholar]
Wang, Z.; Yuan, G.; Zhou, H.; Ma, Y.; Ma, Y. Foreign-Object Detection in High-Voltage Transmission Line Based on Improved YOLOv8m. Appl. Sci. 2023, 13, 12775. [Google Scholar] [CrossRef]
Zheng, J.; Liu, H.; He, Q.; Li, G. GEB-YOLO: A Novel Algorithm for Enhanced and Efficient Detection of Foreign Objects in Power Transmission Lines. Sci. Rep. 2024, 14, 15769. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Wang, S.; Chen, N.; Liu, X. The Transmission Line Foreign Body Detection Algorithm Based on Weighted Spatial Attention. Front. Neurorobot. 2024, 18, 1424158. [Google Scholar]
Chen, Z.; Yang, J.; Feng, Z.; Wang, L. RailFOD23: A Dataset for Foreign Object Detection on Railroad Transmission Lines. Sci. Data 2024, 11, 72. [Google Scholar] [CrossRef]
Nan, S.; Liu, Y.; Zhang, X.; Wang, H. Transmission Line Foreign Object Segmentation Based on RB-UNet Algorithm. PeerJ Comput. Sci. 2024, 10, e2419. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Zhao, S.; Xing, Z.; Wei, Z.; Li, Y.; Li, Y. Detection of Foreign Objects Intrusion into Transmission Lines Using Diverse Generation Model. IEEE Trans. Power Deliv. 2023, 38, 3551–3560. [Google Scholar] [CrossRef]
Qiu, Z.; Zhu, X.; Liao, C.; Chen, D. A Lightweight YOLOv4-EDAM Model for Accurate and Real-Time Detection of Foreign Objects Suspended on Power Lines. IEEE Trans. Power Deliv. 2023, 38, 1329–1340. [Google Scholar] [CrossRef]
Ma, J.; Guo, J.; Zheng, X.; Fang, C. An Improved Bird Detection Method Using Surveillance Videos from Poyang Lake Based on YOLOv8. Animals 2024, 14, 3353. [Google Scholar] [CrossRef]
Zou, C.; Liang, Y.-Q. Bird Detection on Transmission Lines Based on DC-YOLO Model. In Intelligent Information Processing X; Shi, Z., Vadera, S., Li, G., Eds.; IFIP Advances in Information and Communication Technology; Springer: Cham, Switzerland, 2020; Volume 581, pp. 222–232. [Google Scholar]
Li, P.; Luo, Y. YOLO-Bird: Small Bird Object Detection in Natural Scenes. Signal Image Video Process. 2025, 19, 351. [Google Scholar] [CrossRef]
Hong, S.-J.; Han, Y.; Kim, S.-Y.; Lee, A.-Y.; Kim, G. Application of Deep-Learning Methods to Bird Detection Using Unmanned Aerial Vehicle Imagery. Sensors 2019, 19, 1651. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Li, Y.; Qiao, Q. Fine-Grained Bird Image Classification Based on Counterfactual Method of Vision Transformer Model. J. Supercomput. 2023, 80, 6221–6239. [Google Scholar] [CrossRef]
Zhang, Z.; Chen, Z.; Wang, Y.; Luo, X.; Xu, X. A Vision Transformer for Fine-Grained Classification by Reducing Noise and Enhancing Discriminative Information. Pattern Recognit. 2024, 145, 109979. [Google Scholar] [CrossRef]
Xie, J.; Zhong, Y.; Zhang, J.; Liu, S.; Ding, C.; Triantafyllopoulos, A. A Novel Bird Sound Recognition Method Based on Multi-Feature Fusion and a Transformer Encoder. Sensors 2023, 23, 8099. [Google Scholar]
Carvalho, S.; Gomes, E.F. Automatic Classification of Bird Sounds: Using MFCC and Mel Spectrogram Features with Deep Learning. Vietnam J. Comput. Sci. 2023, 10, 39–54. [Google Scholar] [CrossRef]
Yang, C.; Gan, X.; Peng, A.; Chen, L. ResNet Based on Multi-Feature Attention Mechanism for Sound Classification in Noisy Environments. Sustainability 2023, 15, 10762. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar]
Benhoussa, S.; De Sousa, G.; Chanet, J.-P. FedBirdAg: A Low-Energy Federated Learning Platform for Bird Detection with Wireless Smart Cameras in Agriculture 4.0. AI 2025, 6, 63. [Google Scholar] [CrossRef]
Zhang, J.; Li, Z.; Bo, B.; Xu, J.; Wu, S.; Ding, S.; Wu, C. Federated Learning with Label Distribution Skew via Logits Calibration. In Proceedings of the 39th International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
Tan, Y.; Long, G.; Liu, L.; Zhou, T.; Lu, Q.; Jiang, J.; Zhang, C. FedProto: Federated Prototype Learning across Heterogeneous Clients. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 8432–8440. [Google Scholar]

Figure 1. Motivation of the proposed confidence–adaptive audiovisual fusion.

Figure 3. Architecture of the EfficientNet-B3 model.

Figure 4. Architecture of the ResNet-50 model.

Figure 5. Histograms of fusion weight distributions for the three fusion strategies: (a) entropy-based Fusion; (b) max-Gap Fusion; (c) combined Fusion.

Table 1. Model configuration parameters.

Model	Parameter	Value	Model	Parameter	Value
EfficientNet-B3	Input size	300 × 300	ResNet-50	Input size	224 × 224
	Compound coeff.	3		Residual stages	4
	Depth factor	1.4		Channels	256-512-1024-2048
	Width factor	1.2		Output dim.	2048
	Dropout rate	0.3		Output classes	60
	Output classes	60

Table 2. Image classification performance comparison on SSW60.

Method	Top-1 (%)	Top-5 (%)	Macro-F1 (%)	Params (M)	FLOPs (G)	Inference (ms)
VGG-16	83.81	96.48	82.38	128.4	15.47	5.08
ResNet-50	89.75	98.34	88.91	25.6	4.13	3.88
MobileNetV3-Large	84.34	97.04	83.45	4.28	0.23	3.81
EfficientNet-B3	91.55	98.92	90.80	12.2	1.93	8.44

Table 3. Audio classification performance comparison on SSW60.

Method	Pretraining	Top-1 (%)	Top-5 (%)	Macro-F1 (%)	Params (M)	FLOPs (G)	Inference (ms)
VGG-16	ImageNet	53.48	80.54	52.91	138.4	15.47	4.43
AST	AudioSet	63.29	86.73	63.02	87.0	40.05	12.57
EfficientNet-B3	ImageNet	67.88	89.00	67.36	10.79	1.93	8.42
ResNet-50	ImageNet	68.20	88.84	68.06	25.63	4.13	4.06

Table 4. Fusion strategy comparison on SSW60.

Method	Top-1 (%)	Top-5 (%)	Macro-F1 (%)	Average Weight α
Image-Only	91.55	98.92	90.80	/
Audio-Only	68.20	88.84	68.06	/
TMC	95.09	99.46	94.75	/
Entropy Fusion	95.18	99.46	95.03	0.7946
Gap Fusion	95.20	99.42	95.00	0.6650
Combined Fusion	95.30	99.34	95.12	0.6344

Table 5. Sensitivity analysis on the balancing coefficient β.

β	Strategy	Top-1 (%)	Top-5 (%)	Macro-F1 (%)	Average Weight α
0.00	Entropy	95.24	99.46	95.00	0.6656
0.25	Entropy dominant	95.26	99.50	95.05	0.6493
0.5	Combined	95.28	99.46	95.06	0.6348
0.75	Gap dominant	95.28	99.48	95.06	0.6219
1.00	Gap	95.22	99.50	95.03	0.6103

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, X.; Liu, Q.; Wen, X.; Zhao, H.; Wang, Z.; Wang, C. An Adaptive Audiovisual Fusion Method Based on Prediction Confidence for Fine Granularity Bird Species Recognition. Appl. Sci. 2026, 16, 5113. https://doi.org/10.3390/app16105113

AMA Style

Xu X, Liu Q, Wen X, Zhao H, Wang Z, Wang C. An Adaptive Audiovisual Fusion Method Based on Prediction Confidence for Fine Granularity Bird Species Recognition. Applied Sciences. 2026; 16(10):5113. https://doi.org/10.3390/app16105113

Chicago/Turabian Style

Xu, Xinliang, Qiming Liu, Xin Wen, Heng Zhao, Zhenhao Wang, and Chong Wang. 2026. "An Adaptive Audiovisual Fusion Method Based on Prediction Confidence for Fine Granularity Bird Species Recognition" Applied Sciences 16, no. 10: 5113. https://doi.org/10.3390/app16105113

APA Style

Xu, X., Liu, Q., Wen, X., Zhao, H., Wang, Z., & Wang, C. (2026). An Adaptive Audiovisual Fusion Method Based on Prediction Confidence for Fine Granularity Bird Species Recognition. Applied Sciences, 16(10), 5113. https://doi.org/10.3390/app16105113

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Adaptive Audiovisual Fusion Method Based on Prediction Confidence for Fine Granularity Bird Species Recognition

Abstract

1. Introduction

2. Related Work

2.1. Foreign Object Detection on Transmission Lines

2.2. Bird Image Recognition

2.3. Bird Acoustic Recognition

2.4. Multimodal Fusion for Bird Recognition

2.5. Differentiation from Existing Approaches

3. Proposed Multimodal Bird Recognition Method

3.1. Overview of the Multimodal Fusion Framework

3.2. Bird Recognition Model Under Image Modality

3.2.1. Image Data Preprocessing

3.2.2. EfficientNet-B3 Backbone Network

3.3. Bird Recognition Model Under Audio Modality

3.3.1. Audio Data Preprocessing

3.3.2. Data Augmentation Strategy

3.3.3. ResNet-50 Backbone Network

3.4. Confidence-Based Adaptive Fusion Strategy

3.4.1. Prediction Entropy and Confidence Computation

3.4.2. Probability Gap Indicator and Combined Fusion Strategy

3.5. Training Strategy

4. Experimental Results and Analysis

4.1. Experimental Setup and Dataset Description

4.2. Evaluation Metrics

4.3. Single-Modality Recognition Performance Comparison

4.4. Validation of Multimodal Fusion Effectiveness

4.5. Sensitivity Analysis on the Confidence Combination Coefficient

5. Discussion

5.1. Discussion of Results

5.2. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI