ADFF-Net: An Attention-Based Dual-Stream Feature Fusion Network for Respiratory Sound Classification

Zhu, Bing; Chen, Lijun; Li, Xiaoling; Zhao, Songnan; Yu, Shaode; Sun, Qiurui

doi:10.3390/technologies14010012

Open AccessReview

ADFF-Net: An Attention-Based Dual-Stream Feature Fusion Network for Respiratory Sound Classification

by

Bing Zhu

¹

,

Lijun Chen

¹,

Xiaoling Li

¹,

Songnan Zhao

²

,

Shaode Yu

^1,*

and

Qiurui Sun

^3,*

¹

School of Information and Communication Engineering, Communication University of China, Beijing 100024, China

²

College of Engineering, Northeastern University, Silicon Valley, San Jose, CA 95113, USA

³

Center of Information & Network Technology, Beijing Normal University, Beijing 100875, China

^*

Authors to whom correspondence should be addressed.

Technologies 2026, 14(1), 12; https://doi.org/10.3390/technologies14010012

Submission received: 19 September 2025 / Revised: 12 December 2025 / Accepted: 17 December 2025 / Published: 24 December 2025

Download

Browse Figures

Versions Notes

Abstract

Deep learning-based respiratory sound classification (RSC) has emerged as a promising non-invasive approach to assist clinical diagnosis. However, existing methods often face challenges, such as sub-optimal feature representation and limited model expressiveness. To address these issues, we propose an Attention-based Dual-stream Feature Fusion Network (ADFF-Net). Built upon the pre-trained Audio Spectrogram Transformer, ADFF-Net takes Mel-filter bank and Mel-spectrogram features as dual-stream inputs, while an attention-based fusion module with a skip connection is introduced to preserve both the raw energy and the relevant tonal variations within the multi-scale time–frequency representation. Extensive experiments on the ICBHI2017 database with the official train–test split show that, despite critical failure in sensitivity of 42.91%, ADFF-Net achieves state-of-the-art performance in terms of aggregated metrics in the four-class RSC task, with an overall accuracy of 64.95%, specificity of 81.39%, and harmonic score of 62.14%. The results confirm the effectiveness of the proposed attention-based dual-stream acoustic feature fusion module for the RSC task, while also highlighting substantial room for improving the detection of abnormal respiratory events. Furthermore, we outline several promising research directions, including addressing class imbalance, enriching signal diversity, advancing network design, and enhancing model interpretability.

Keywords:

respiratory sound classification; acoustic feature representation; dual-stream feature fusion; Audio Spectrogram Transformer

1. Introduction

According to the World Health Organization, respiratory diseases are among the top global health threats, ranking as the second (tuberculosis) and third (lower respiratory tract infections) leading causes of death worldwide [1]. More than one billion people suffer from acute or chronic respiratory conditions, leading to a substantial burden on healthcare systems and significantly affecting patients’ quality of life.

Among various indicators of respiratory health, respiratory sounds reflect underlying pulmonary conditions and provide non-invasive, real-time insight into the state of the lungs. These sounds play a critical role in the early detection and auxiliary diagnosis of diseases, such as chronic obstructive pulmonary disease (COPD), asthma, and pneumonia. Traditionally, auscultation has been the cornerstone of pulmonary assessment, where stethoscopes are used to listen to lung sounds and evaluate respiratory conditions [2]. However, this approach is inherently restricted by clinicians’ experience, perceptual acuity, and subjective interpretation [3]. As a result, diagnostic accuracy can vary significantly between practitioners. Given these limitations, there is a growing demand for intelligent diagnosis that is not only automated, accurate, and efficient but also user-friendly and capable of providing consistent, expert-level assessments across diverse settings.

Recent developments in electronic stethoscope technology enable the recording of high-quality lung sounds, which creates new opportunities for automated analysis of pulmonary auscultation. The International Conference on Biomedical and Health Informatics (ICBHI) in 2017 launched a scientific challenge based on the ICBHI2017 respiratory sound database [4]. The database includes respiratory cycles annotated by clinical experts from two different research teams over several years, with labels specifying the presence of crackles, wheezes, combination of both crackles and wheezes, or normal breath sounds. Unlike crowd-sourced datasets collected online, the ICBHI2017 database was curated in clinical environments under professional supervision, ensuring greater annotation reliability and clinical relevance. Since its release, the ICBHI2017 dataset has become a benchmark for building and evaluating respiratory sound classification (RSC) algorithms [5].

Numerous studies leveraging convolutional neural networks (CNNs) have been conducted to classify abnormal respiratory sounds, achieving promising results [5]. These efforts underscore the importance of identifying acoustic anomalies, particularly crackles and wheezes, for the accurate diagnosis of pulmonary diseases [6]. Specifically, crackles are brief, discontinuous, non-musical sounds, which are typically associated with conditions such as COPD and pneumonia [7], while wheezes are continuous, high-pitched, musical sounds, which are often linked to airway obstruction, commonly observed in patients with asthma and COPD [6,7,8]. Therefore, automated classification of these sounds is challenging yet holds great potential for early screening and diagnosis, remote monitoring, and clinical decision support.

Development of efficient and intelligent RSC systems has become a critical pathway for enhancing early screening and auxiliary diagnosis of pulmonary diseases. Recently, deep learning has emerged as the predominant approach for modeling and classifying respiratory sounds, with commonly used architectures including CNNs, recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and Transformer-based learning models [9,10,11,12,13,14,15,16]. Notably, the Audio Spectrogram Transformer (AST) has shown remarkable effectiveness and adaptability in capturing temporal and spectral dependencies [9], yielding superior RSC performance. In addition, techniques such as transfer learning, feature fusion, and attention mechanisms have contributed to performance improvements by leveraging prior knowledge and combining complementary representations [9,10,11,12,13,14,15,16,17].

Despite these advancements, current research still faces significant challenges, particularly in audio feature fusion strategies [18]. Most existing methods rely on simple concatenation of handcrafted or learned features, without adequately modeling the semantic complementarity or interaction dynamics between different feature inputs. Consequently, these models may overlook sample-specific discriminative cues and fail to extract the most informative features [19]. Moreover, such static fusion schemes lack the capacity to dynamically adjust feature importance based on input context, which limits their effectiveness in capturing diverse acoustic patterns and ultimately constrains real-world performance.

To address these challenges, we propose an Attention-based Dual-stream Feature Fusion Network (ADFF-Net) enhanced with a skip connection (SC). At first, acoustic Mel-scale representations of the Mel-filter bank (Mel-FBank), Mel-spectrogram, and Mel-frequency cepstral coefficients (MFCCs) are extracted [20]. Through experiments and theoretical analysis, the Mel-FBank and Mel-spectrogram are found to be particularly discriminative for representing respiratory sound characteristics. To fully exploit the complementary information between the dual-stream feature inputs, an attention mechanism [10] is employed for adaptive weighting and fusion of features, and an SC [21] is incorporated for retaining the original information and for preventing the loss of important details during feature fusion. The fused representation, enriched with both contextual emphasis and original signal integrity, is fed into the pre-trained AST to perform the classification of respiratory sound events. The main contributions of this work are summarized as follows:

1.: An ADFF module is designed that uses an attention mechanism for dual-stream feature fusion and employs a skip connection for details preservation.
2.: An ADFF-Net framework is proposed that utilizes the ADFF module for deep feature representation and the pre-trained AST for respiratory sound classification.
3.: Extensive experiments on the ICBHI2017 database verify the effectiveness of the proposed framework for classifying respiratory sound events.

The remainder of this paper is structured as follows. Section 2 reviews the RSC advances in time–frequency representations, AST-based approaches, and multi-stream feature fusion strategies. Section 3 introduces the ICBHI2017 database, the computation of the three acoustic features, and the ADFF-Net framework. Section 4 reports the experimental results, including the evaluation of feature representation capacity, comparisons of different fusion strategies, and benchmarking against state-of-the-art (SOTA) methods on the ICBHI2017 database. Section 5 provides an in-depth analysis of the findings and discusses potential future research directions. Finally, Section 6 summarizes the current study and highlights the potential of ADFF-Net for advancing biomedical signal analysis.

2. Related Work

This section presents the related work, ranging from two-dimensional (2D) representations of respiratory sound events to the application of the pre-trained AST in the RSC task and fusion techniques for dual- and multi-stream feature inputs.

2.1. Respiratory Sound Representation

Transforming acoustic signals into 2D time–frequency quantifications is widely used for respiratory sound representation. This transformation allows the extraction of both temporal dynamics and spectral characteristics, while also enabling the adaptation of off-the-shelf deep learning-based models originally designed for 2D image inputs.

Mel-FBank, Mel-spectrogram, and MFCC features are preferred 2D representations, owing to their strong biological relevance in capturing acoustic patterns [16,20,22,23,24,25,26]. For instance, Bae et al. [16] employ Mel-FBank features and incorporate data augmentation and contrastive learning. Bacanin et al. [22] utilize Mel-spectrograms and fine-tune the embedding parameters through metaheuristic learning. Zhu et al. [23] adopt a VGGish-BiLSTM-Attention model that uses Mel-spectrogram inputs. Wall et al. [24] utilize MFCCs in an attention-based architecture for the abnormality diagnosis. Latifi et al. [25] introduce maximum entropy Mel-FBank features by combining MFCCs and Gabor Filter Bank features. Li et al. [26] address class imbalance by coupling MFCCs with adaptive synthetic sampling. These 2D representations enhance the characterization of respiratory sounds and also facilitate the reuse of deep learning-based models.

2.2. AST-Based RSC Applications

Transformer-based models, particularly the AST model [9], have established themselves as a popular approach for non-stationary sound analysis. Their success is largely attributable to the capacity to model long-range temporal dependencies and global acoustic context through multi-head self-attention mechanisms. Wu et al. [14] enhance the robustness of the AST to background noise by designing a dual-input variant that simultaneously processes spectrograms and log Mel-spectrograms. Ariyanti et al. [15] introduce the Audio Spectrogram Vision Transformer, which segments Mel-spectrograms into overlapping patches with positional encoding, and multi-layer self-attention is applied to capture global acoustic feature patterns. Bae et al. [16] demonstrate that integrating the AST with data augmentation and contrastive learning achieves superior RSC performance. Neto et al. [27] extract deep features from MFCCs, Mel-spectrograms, and a Constant-Q Transform (CQT) representation for refinement of a Vision Transformer. Kim et al. [28] employ an audio diffusion model to generate synthetic data, which is combined with the AST to address the problem of limited training samples. Stethoscope-guided contrastive learning is proposed [29], which mitigates domain shifts caused by differences in recording devices, thereby improving the real-world applicability of the AST in respiratory sound classification. Xiao et al. [30] leverage multi-scale convolutions for spectral details and feature adaptation into the AST backbone.

2.3. Multi-Stream Feature Fusion Strategies

Single-source features are widely used for their simplicity, even though the representation capacity is limited by uni-dimensional information coverage [5]. In contrast, fusion of multi-stream inputs combines complementary acoustic cues, such as spectral structure, temporal dynamics, and tonal features. A more holistic capture of diverse pathological patterns has been shown to significantly improve model discrimination.

Dual-stream feature fusion enhances acoustic separability by combining complementary representations. Xu et al. [12] combine MFCCs and Mel-spectrograms into a residual-attention-enhanced parallel encoder, achieving 80.0% accuracy for binary and 56.76% for four-class RSC tasks. Chu et al. [31] fuse Mel-spectrograms and CQT spectrograms into hybrid spectrograms, processed by frequency masking and time grouping, and extract features using a grouped time–frequency attention network, achieving 71% accuracy on four-class ICBHI2017 classification.

Multi-stream feature fusion holds the potential to improve performance by leveraging multi-dimensional information. Neto et al. [27] utilize Mel-spectrograms, MFCCs, and the CQT to overcome the limitations of frequency resolution. Borwankar et al. [32] integrate Mel-spectrograms, MFCCs, and chroma energy normalized statistics to cover spectral, auditory, and pitch features jointly. Roy et al. [33] combine Mel-spectrograms, the CQT, and MFCCs to form comprehensive time–frequency representations for COPD detection [34]. Wanasinghe et al. [35] organize MFCCs, Mel-spectrograms, and chromagrams in a 3D representation, and a lightweight network is designed to learn spectral, auditory, and pitch features for balancing model compactness and performance in binary and multi-class prediction. Xu et al. [36] integrate acoustic features such as MFCCs and zero-crossing rate with deeply learned features, and Bi-LSTM is employed to model temporal dependencies, achieving a classification accuracy of 96.33% in cough sound detection.

Fusion of deep features is also popular for comprehensive representation and performance improvement. Kim et al. [28] design an audio diffusion model to generate realistic respiratory sound samples and employ adversarial fine-tuning to mitigate distributional discrepancies between synthetic and real data. Further, they incorporate cross-domain adaptation to transfer learned knowledge from the source to the target domain by treating different types of stethoscope as distinct domains in a supervised contrastive learning manner [29]. Pham et al. [37] implement an ensemble framework by integrating multiple transferred models. Three levels of feature fusion, including early, middle, and late, are explored, achieving a maximum ICBHI2017 score of 57.3%. Shehab et al. [38] extract multi-source features from different pre-trained CNNs and obtain an 8064D vector. By employing early fusion through direct concatenation, they achieve an accuracy of 96.03% in classifying eight categories of pulmonary diseases.

2.4. The Motivation for This Study

Although existing fusion strategies have shown substantial progress in respiratory sound analysis, most rely on heterogeneous feature inputs and lack a unified mechanism to effectively leverage the complementary strengths [38]. In particular, the joint use of different acoustic representations, such as Mel-FBank and Mel-spectrogram features, remains under-explored.

Mel-FBank and Mel-spectrogram features capture different yet complementary aspects of respiratory acoustics, making effective fusion essential [20]. Mel-FBank features preserve raw spectral energy, which is crucial for detecting low-amplitude crackles, whereas the logarithmic-scale Mel-spectrogram features emphasize perceptually relevant tonal variations characteristic of wheezes. Without an appropriate fusion strategy, however, these features may produce redundant information or lose important discriminative cues.

To bridge this gap, we propose ADFF-Net, which incorporates the ADFF module as a dedicated mechanism for more coherent and complementary integration of multiple acoustic features. Specifically, the ADFF module is designed to enhance cross-feature interactions, preserve feature-specific spectral characteristics, and improve overall representational quality, offering a novel and more effective approach to acoustic feature fusion.

3. Materials and Methods

This section begins with the ICBHI2017 database, followed by a description of computing features for respiratory sound representation. The proposed ADFF-Net framework is then detailed from the integration of the attention mechanism and SC (i.e., the ADFF module) to the use of the pre-trained AST for sound classification. After that, the experimental design is described, covering a unified pre-processing pipeline for respiratory cycle segments, evaluation of different feature representations, and comparison of various fusion strategies. Finally, the evaluation metrics used to assess RSC performance are outlined.

3.1. The ICBHI2017 Database

Figure 1 illustrates the category-wise distribution of data samples across the officially defined training and testing sets in the ICBHI2017 database [4]. The database consists of 6898 respiratory cycles, totaling ≈5.5 h of recordings. To facilitate fair model evaluation and comparison, the dataset is officially partitioned into a training set (60% of the samples) and a testing set. Notably, respiratory cycles from the same patient are assigned exclusively to either the training or the test set, ensuring that there is no patient case overlap between the two sets to prevent data leakage. This standardized split supports reproducibility and benchmarking in respiratory sound classification.

As illustrated in Figure 1, each respiratory cycle in the database is annotated as one of the four categories “normal”, “crackle”, “wheeze”, or “both” (i.e., the presence of both crackle and wheeze). In total, the training set includes 2063 normal cycles, 1215 with crackles, 501 with wheezes, and 363 with both crackles and wheezes, while the testing set contains 1579 normal cycles, 649 with crackles, 385 with wheezes, and 143 with both adventitious sounds.

3.2. Feature Representation of Respiratory Sounds

Mel-FBanks, Mel-spectrograms, and MFCCs are three core Mel-scale acoustic features widely used in respiratory sound representation and general audio processing [20]. Figure 2 shows the procedure, including signal pre-processing, framing and windowing, short-time Fourier transform (STFT) computation, and power spectrum calculation. Mel-FBank features are derived by applying a bank of Mel-scale filters to the power spectrum, Mel-spectrogram features are obtained by taking the logarithm of the Mel-FBank energies, and MFCC features are generated by applying the Discrete Cosine Transform (DCT) to the log-Mel spectrum, providing a compact representation of the spectral envelope.

To enhance the readability and clarity of the feature extraction process, Table 1 provides a summary of the symbols and parameter definitions used in the computing of Mel-FBank, Mel-spectrogram, and MFCC features. It facilitates a clear understanding of the underlying mathematical operations and processing steps.

3.2.1. Signal Preprocessing

After loading an original audio signal

x (t)

, pre-emphasis is applied. It employs a high-pass filter to enhance the high-frequency components of the input signal. Equation (1) shows the operation of the pre-emphasis, in which

α

denotes the pre-emphasis coefficient and its value is defined as 0.97, as suggested [39].

y (t) = x (t) - α x (t - 1),

(1)

3.2.2. Framing and Windowing

Framing refers to segmenting a signal into a series of short frames S, each containing N samples, and adjacent frames typically overlap to a certain extent. The frame length and shift determine the size of each frame and the degree of overlap between frames.

Windowing is applied by multiplying each frame with a window function

w (n)

, such as a Hamming window, to reduce spectral leakage. The windowed signal of the m-th frame is calculated as Equation (2), in which

y (\cdot)

is the pre-emphasized signal.

x_{m} (n) = y (m \cdot h + n) \cdot w (n), n = 0, 1, \dots, N - 1

(2)

3.2.3. STFT Computing

The Fast Fourier Transform (FFT) is performed on each frame for obtaining its frequency spectrum. Equation (3) defines the operation, in which

x_{m} (n)

denotes the windowed time-domain sample at index n in the m-th frame and

X_{m} (k)

represents the complex value of the k-th frequency component in the m-th frame.

X_{m} (k) = \sum_{n = 0}^{N - 1} x_{m} (n) \cdot e^{- j 2 π k n / N},

(3)

During feature extraction, the STFT was applied using a window size of 25 ms and a frame shift of 10 ms, with a Hanning window employed to minimize spectral leakage.

3.2.4. Calculating Power Spectrum

The power spectrum for each frame is calculated by squaring the magnitude of the spectrum. Equation (4) shows the formulation of the power spectrum,

P_{m} (k)

.

P_{m} (k) = \frac{{|X_{m} (k)|}^{2}}{N},

(4)

3.2.5. Summarizing Mel Filters for Mel-FBank Features

A set of Mel filters

{H_{i} (k)}

is designed to transform the power spectrum from the linear frequency scale to the Mel scale. These filters are typically triangular, with center frequencies evenly spaced on the Mel scale. The output of each filter is the weighted sum of the input power spectrum within that filter. Equation (5) shows the operation of the output of each filter, in which

S_{m} (i)

represents the frequency response of the i-th Mel filter.

S_{m} (i) = \sum_{n = 0}^{N - 1} P_{m} (k) H_{i} (k),

(5)

In this study, raw audio waveforms are transformed into a sequence of 128D Mel-FBank features. The resulting spectrograms are further standardized with a mean of −4.27 and a standard deviation of 4.57 for normalization, as suggested.

Figure 3 demonstrates representative examples of Mel-FBank features for the four categories of respiratory sounds. In each spectrogram, the horizontal axis represents time and the vertical axis shows the filter bank index. The color bar indicates the amplitude intensity in decibels (dB) for capturing the temporal and spectral variations of the signal through color mapping.

It is observed in Figure 3 that the normal signal is quite different from the other three types of sounds due to having the lowest signal amplitude intensity, while the exemplar signals of the wheeze and both categories are visually hard to differentiate from each other because of perceived similar patterns and strength.

3.2.6. Logarithmic Computing for Mel-Spectrogram Features

The Mel-spectrogram is a 2D time–frequency representation obtained by taking the logarithm of the Mel-FBank outputs and arranging them sequentially along the time axis. It provides an intuitive visualization of the time-varying energy distribution of the signal in the Mel-frequency domain. Equation (6) shows the operation of the Mel-spectrogram. The log Mel-spectrograms of each frame are subsequently arranged sequentially along the time axis to form a 2D Mel-spectrogram, with the horizontal axis representing time (frame sequence) and the vertical axis representing Mel frequency.

M_{m} (i) = log s_{m} (i)

(6)

In this study, Librosa [40] is used to extract Mel-spectrogram features from the raw audio signals, generating 64D Mel-filter bank feature maps. The parameters are set as a frame length of 1024 samples (≈64 ms), frame shift of 512 samples (≈32 ms), Hanning window function, frequency range from 50 to 2000 Hz, and FFT size of 1024 points, as suggested [39]. The extracted spectrograms are log-power-scaled and normalized to the [0, 1] range and then converted into image format for model input.

Figure 4 illustrates example Mel-spectrogram features of different categories. The horizontal axis represents the time and the vertical axis denotes the Mel-frequency index. The color bar intuitively reflects the intensity of frequency components over time, with color saturation corresponding to signal power in dBs.

It is found in Figure 4 that the normal signal is quite different from the other three types of sounds due to having the lowest signal amplitude intensity, while the exemplar signals of the wheeze and both categories are visually similar because of similar patterns and strength in the feature map.

3.2.7. DCT Computing for MFCC Features

MFCCs are obtained by applying the DCT to the log Mel-frequency spectrum, and a set of cepstral coefficients are generated. Typically, only the first K coefficients are retained as the final output of a MFCC feature vector, reducing feature dimensionality while preserving the essential characteristics of the input signal. Equation (7) shows the operation of the MFCC features, in which

C_{m} (q)

denotes the q-th cepstral coefficient of the m-th frame, L is the number of Mel filters, and K is the number of retained cepstral coefficients. In this study, the MFCC function in Librosa [40] is employed to extract 13D MFCC features of the input audio signals.

C_{m} (q) = \sum_{i = 0}^{L - 1} M_{m} (i) \cdot cos [\frac{π q (2 i + 1)}{2 L}], q = 0, 1, \dots, K - 1

(7)

Figure 5 shows MFCC features of the categories of respiratory sounds. The horizontal axis represents time, and the vertical axis shows the 13 coefficients, each capturing the energy variation of the signal within a specific frequency band. The color map provides an intuitive visualization of the temporal dynamics of each coefficient, where the intensity of the color indicates the magnitude of the coefficient values. This representation effectively reveals the time–frequency structure of the audio signal.

As illustrated in Figure 5, the normal respiratory sound shows the highest coefficient intensity, particularly in the lower-order MFCC dimensions from MFCC-1 to MFCC-5, while the other categories of sounds demonstrate the highest coefficient values in MFCC-1 and MFCC-2. Notably, the sounds of both, crackle, and wheeze show similar patterns, which are hard to differentiate from each other.

3.3. The Proposed ADFF-Net Framework

The proposed ADFF-Net framework integrates the ADFF module, pre-trained AST, and SC, as illustrated in Figure 6. Specifically, the ADFF module is designed to fuse dual-representative features for enhancing the expressiveness of the input data, and the AST is responsible for the final classification of respiratory signals. In Figure 6, Mel-FBank features and Mel-spectrogram features are used to show the processing workflow, and the matrix structures are detailed at each step, in which b, C, H, and W correspond to the batch size, the number of input channels, and the input height and width.

The technical novelty of the framework lies primarily in the ADFF module, which introduces an attention-based mechanism to weight and integrate complementary acoustic features. Specifically, the ADFF module can dynamically adjust the importance of each representation, allowing the network to adaptively emphasize either raw spectral energy or logarithmic perceptual cues on a sample-by-sample basis. This results in a more nuanced and informative fusion than simple feature concatenation. In addition, a skip connection structure is incorporated to preserve low-level feature information and facilitate stable gradient flow during training. Together, these architectural designs enhance the model’s ability to capture both global and local patterns in respiratory sounds, ultimately improving overall classification performance.

3.3.1. Attention Mechanism

Attention was introduced in machine translation and has since become a fundamental component in deep learning architectures [10]. It assigns learnable weights to different parts of the input sequence and enables the model to dynamically focus on regions most relevant to the current task. This enhances the model’s ability to capture both local and global contextual information and is particularly valuable for complex and sequential data analysis. A typical attention mechanism consists of three key steps, as formulated in Equation (8). First, attention scores are computed by measuring the similarity between query (Q) and key (K) vectors. Second, these scores are normalized into a probability distribution using the Softmax function, yielding attention weights. Finally, a weighted sum of the value (V) vectors is computed using the attention weights, producing context-aware representation vectors. This process allows the network to selectively emphasize informative parts of the input while suppressing less relevant components.

A tt e n t i o n (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}}) V

(8)

In Equation (8), the dimensionality, d, of Q, K, and V acts as the scaling factor, ensuring that the output of the self-attention mechanism maintains a distribution consistent with the input, thereby enhancing the model’s generalization capability.

3.3.2. The ADFF Module

The ADFF module combines an attention mechanism with an SC for dynamic representation of multi-source features. The attention mechanism allows adaptive focus on critical correlations between different features for enhancing feature relevance and interaction. Meanwhile, the SC preserves low-level feature information that helps to mitigate degradation during feature propagation and enriches the overall multi-level semantic representation.

As shown in Figure 6, different feature types, such as Mel-FBank and Mel-spectrogram features, are extracted, each with a shape of

[b, C, H, W]

(here,

C = 1

). To facilitate the computation of attention weights, the Mel-FBank features are reshaped to

[b, H \times W, C]

, and the Mel-spectrogram features are reshaped to

[b, C, H \times W]

. Attention weights are then computed as the similarity between the two feature sets using the torch.bmm function, resulting in an attention weight matrix sized

[b, H \times W, H \times W]

, where each element quantifies the correlation between pixels across the feature types.

After that, the attention weights are normalized to ensure numerical stability by using the nn.Softmax function along the second dimension, producing a valid probability distribution. Next, the attention weights are used to act as a weighted summation of the features. The normalized attention matrix is multiplied with the reshaped Mel-FBank features, producing a fused feature tensor of shape

[b, C, H \times W]

that enhances the discriminative capability. At last, the fused tensor is reshaped back to its original dimensions

[b, C, H, W]

to maintain compatibility with the subsequent layers of the network.

The advantages of the ADFF module arise from three aspects. First, adding the original Mel-FBank features to the attention output through an element-wise identity shortcut allows the initial information to bypass the attention weighting and fusion steps, ensuring that fine-grained spectral details reach the AST backbone without degradation. Second, the SC further helps preserve low-level structure, mitigates feature attenuation during propagation, and enriches the resulting multi-level semantic representation. Third, the dot-product attention mechanism, implemented via batch matrix multiplications, is well suited for GPU acceleration on dense 2D feature maps. This formulation has been extensively validated in Transformer-based architectures as a stable and effective approach for modeling pairwise interactions [41]. In particular, its inner-product structure naturally reflects correlation patterns between time–frequency spectrogram representations.

3.3.3. Audio Spectrogram Transformer

The AST adopts a Transformer architecture and overcomes the limitations of local receptive fields in CNNs [21] and the sequential inefficiency of RNNs [9]. It leverages cross-modal transfer learning by adapting a Vision Transformer [42] pre-trained on ImageNet [43] to the audio domain by fine-tuning on the AudioSet database [44]. The Transformer architecture based on self-attention allows for parallel processing of sequential data. It generally consists of an input module, an encoder, a decoder, and an output module. Notably, the input module includes an embedding layer and a positional encoding (PE) mechanism. The embedding layer converts input tokens (e.g., spectrogram patches) into dense vector representations. To address the lack of inherent awareness of token order, PEs are added to the token embeddings to preserve sequence information. These PEs have the same dimensionality as the token embeddings and can either be learned during training or computed using a fixed formula (e.g., sinusoidal functions). By incorporating positional information, the model can better capture the sequential structure of the input data.

3.4. Experiment Design

A unified respiratory signal pre-processing pipeline is designed for standardizing respiratory cycle segments for subsequent analysis. Meanwhile, different feature fusion strategies are designed and compared. In addition, algorithm implementation and parameter settings are provided.

3.4.1. Unified Respiratory Signal Pre-Processing

To enhance model robustness, a unified pre-processing pipeline is designed for the respiratory cycle segments [5]. All segments are at first resampled to 16 kHz. Given the substantial variability in segment durations, each recording is then normalized to a fixed length of 8 s (≈798 frames) using truncation or padding. Furthermore, to mitigate boundary artifacts caused by abrupt truncation or repetition, linear fade-in and fade-out techniques are applied at the beginning and end of each segment, which help suppress high-frequency transients and reduce spectral discontinuities. This unified pre-processing pipeline ensures that all segments are consistent in length, facilitating fair and reproducible model training, evaluation, and benchmarking in respiratory sound classification.

3.4.2. Analysis of Feature Representation Performance

Three features, including the Mel-FBank, Mel-spectrogram, and MFCCs, are investigated for representation performance in respiratory signal classification. The goal is to assess the representational capacity and complementarity of each feature, thereby informing the design of the subsequent feature fusion strategy. Both individual features and combinations of two different feature inputs are evaluated to assess their effectiveness in capturing discriminative patterns from respiratory sounds, highlighting their individual and complementary contributions to model performance.

3.4.3. Comparison of Different Feature Fusion Strategies

Several feature fusion strategies are compared, including (1) two concatenation-based fusion methods, (2) attention mechanism-based fusion, and (3) the proposed ADFF module. To ensure fair comparison, all experiments are conducted using the same pre-trained AST as the backbone model.

The first concatenation-based fusion approach, referred to as Concat-AST, shown in Figure 7, involves concatenating two feature inputs prior to classification using the AST model. To ensure dimensional compatibility with AST input requirements, the concatenated features are compressed along the height (H) dimension before being fed into the model.

The second concatenation-based fusion approach, referred to as AST-Concat, shown in Figure 8, processes each input feature independently through the pre-trained AST model. The output representations from each stream are then concatenated and passed through a linear classification layer to perform respiratory signal prediction.

Additionally, an attention mechanism-based fusion approach, implemented using the proposed ADFF module without the SC, as shown in Figure 6, is introduced to evaluate the individual contribution of the SC. This setup is intended to verify the importance and effectiveness of the SC in enhancing feature representation.

3.4.4. Parameter Settings

In the experiments, model training was performed on a Linux system (Ubuntu 20.04) equipped with an NVIDIA GeForce RTX 2080 Ti GPU card (Gigabyte Technology Co., Taipei, China). The network was implemented with PyTorch 1.10.1, CUDA 11.1 and Python 3.9.12, where pre-processed respiratory sound data were fed into the model for feature extraction and classification. The learning rate was set to 0.00005, with a training batch size of 4 and a total of 50 training epochs. The classification task used the standard cross-entropy loss function and the Adam optimizer. Hyper-parameters were primarily selected based on best practices reported in the related literature [9,16]. The configuration was used to evaluate the effectiveness of the proposed network and feature fusion strategy.

3.5. Evaluation Metrics

In this four-class prediction task,

P_{n}

,

P_{c}

,

P_{w}

, and

P_{b}

correspond to the number of correctly predicted samples in the “normal,” “crackle,” “wheeze,” and “both” categories, and

N_{n}

,

N_{c}

,

N_{w}

, and

N_{b}

correspond to the number of samples for the classes.

Evaluation metrics used in this study include specificity (SPE), sensitivity (SEN), and harmonic score (HS), as established in the ICBHI 2017 Challenge [4]. The metric SPE measures the capacity of a model to correctly identify healthy samples, and Equation (9) shows how to compute the metric,

S P E = \frac{P_{n}}{N_{n}} .

(9)

The metric SEN measures the model capacity’s to correctly identify pathological samples, which can be computed as shown in Equation (10),

S E N = \frac{P_{c} + P_{w} + P_{b}}{N_{c} + N_{w} + N_{b}} .

(10)

The metric HS concerns both specificity and sensitivity to evaluate the overall performance, as shown in Equation (11),

H S = \frac{1}{2} \times (S P E + S E N) .

(11)

In addition, prediction accuracy (

A C C

) is computed as shown in Equation (12). It is fundamental and widely used, since it provides a simple and intuitive measure of how well a classification model performs overall.

A C C = \frac{P_{n} + P_{c} + P_{w} + P_{b}}{N_{n} + N_{c} + N_{w} + N_{b}}

(12)

The metrics are widely used for performance evaluation and can be computed in a similar way for binary, ternary, and multi-class classification tasks [45,46].

4. Results

This section presents the evaluation results on feature representation capacity, comparisons of various feature fusion strategies, and a summary of SOTA results on the ICBHI2017 database. For ease of comparison, the highest and lowest values of each metric are highlighted in red and blue, respectively.

4.1. Evaluation of Feature Representation Capacity

Feature effectiveness is evaluated through the four-class RSC task. Notably, when a single feature is used, classification is performed using the AST model. For combinations of two feature types, the proposed ADFF-Net framework (Figure 6) is applied. Experimental results are presented in Table 2.

As shown in Table 2, when using a single feature, MFCCs yield the highest SPE but the lowest SEN and HS. Notably, the SEN value is 9.43%, significantly lower than that of the other two features. In contrast, Mel-FBank features achieve the highest SEN and HS values but have the lowest SPE. When combining two types of features, the combination of Mel-FBank and Mel-spectrogram features produces the highest values across SPE, SEN, and HS, suggesting that this fusion provides the most discriminative representation. In addition, combining the two features achieves a better balance between the SEN and SPE metrics, with a notable improvement in the SEN value. Therefore, Mel-FBank and Mel-spectrogram features are combined for respiratory sound representation.

4.2. Performance Comparison of Various Feature Fusion Strategies

When the Mel-FBank and Mel-spectrogram features act as dual-stream inputs, different fusion strategies are conducted for verifying the effectiveness. According to Table 3, ADFF-Net achieves the best overall performance across all evaluation metrics among the four feature fusion strategies. This suggests that the combination of attention andthe SC enables more effective and balanced feature fusion. Specifically, Concat-AST yields the lowest SPE and HS values, although its SEN is slightly higher than that of AST-Concat. In contrast, AST-Concat shows a relatively high SPE but suffers from the lowest SEN, which limits its applicability in clinical scenarios where correctly identifying positive cases (i.e., high sensitivity) is critical. Although the ADFF-Net variant without the SC shows a lower SPE than AST-Concat, it achieves a higher SEN (42.06%) and the highest harmonic score (58.49%) among the three variants. This indicates that the attention mechanism substantially enhances feature fusion and improves the model’s ability to detect abnormal respiratory sounds. Notably, when combined with the SC in the full ADFF-Net model, both SPE and SEN improve substantially, demonstrating the complementary roles of the two components.

4.3. Visualizing the Role of Skip Connections in Preserving Feature Details

The role of SCs in preserving signal details is visualized via feature encoding and reconstruction, and the results are shown in Figure 9. Specifically, the original fused feature maps for each category, shown in the first column, are taken from the output of the ADFF module, which serves as the input to the AST backbone (Figure 6). These feature maps are processed through a four-stage reconstruction pipeline. First, each 2D feature map is flattened into a single long vector. This vector is then passed through a sequence of linear layers that compress it into a low-dimensional latent representation. From this latent space, a vector with the original dimensionality is subsequently reconstructed. Finally, the reconstructed 1D vector is reshaped back to the 2D feature-map format to support downstream visualization-based comparison. The second and third columns, respectively, show the reconstructed feature maps produced by the ADFF module with and without the SC, corresponding to the same categories as those in the first column. In each plot, the horizontal axis represents the time frames, the vertical axis corresponds to the Mel-bin indices, and the color bar indicates the signal energy after global normalization.

As shown in Figure 9, reconstruction differences are perceived. When the SC is present, the reconstructed maps retain the global spectral structure and local signal dynamics of the original fused features, including smooth temporal transitions, coherent energy distributions, and well-preserved frequency-dependent patterns. In contrast, removing the SC leads to substantial degradation, and the reconstructed maps become noisy and spatially fragmented. These differences are consistent across all four respiratory sound categories. Specifically, for normal samples, the SC helps maintain the characteristic low-energy smoothness for normal samples, preserves the brief, high-energy transients for crackle sounds, reconstructs the narrow-band horizontal structures for wheeze samples, and maintains transient bursts and continuous tonal components for samples containing both crackles and wheezes. Together, these observations indicate that the SC plays a crucial role in retaining category-specific patterns during feature encoding and reconstruction.

4.4. Confusion Matrix of the Four-Class Prediction Result

Figure 10 presents the confusion matrix of the four-class prediction results using the proposed model. The results show that ADFF-Net achieves high accuracy in distinguishing the normal and both categories. However, it struggles to correctly identify crackle and wheeze events, which are frequently misclassified as normal sounds.

Table 4 summarizes the class-wise performance for the four-class RSC task according to Figure 10, and the prediction accuracy is 64.95%. The model achieves a relatively high SPE across the three abnormal classes (crackle, wheeze, and both), with values ranging from 90.89% to 96.44%, indicating strong ability to avoid false alarms for abnormal patterns. However, SEN varies notably across classes. While the normal and both categories show higher SEN values (81.38% and 80.42%, respectively), the crackle and wheeze classes exhibit substantially lower SEN values (38.52% and 36.36%), suggesting that these two abnormal patterns are more challenging for the model to correctly identify.

4.5. Formulation as a Binary Classification Task

When this task is formulated as a binary classification problem between normal and abnormal categories, Table 5 reports the class-wise performance in terms of SPE and SEN. It is worth noting that the abnormal class includes recordings containing crackles, wheezes, or both events, and the accuracy of the binary prediction is 71.37%.

The proposed ADFF-Net model achieves SPE 57.94% and SEN 81.38% for the normal class, indicating that most normal samples are correctly identified but with moderate false-positive rates. For the abnormal class, the SPE reaches 90.89%, whereas the SEN drops to 38.52%, suggesting that the model is highly conservative when predicting abnormal cases but misses a substantial portion of true abnormal samples. This imbalance reflects the model’s tendency to favor precision for abnormal detection at the cost of reduced recall.

4.6. Current Achievement of Four-Class Prediction on the ICBHI2017 Database

Table 6 presents a comparative evaluation of recent RSC models using the ICBHI2017 database. All models were trained and evaluated under identical experimental conditions, using the official data splittings to ensure the reproducibility and fair comparison.

Based on the aggregated HS metric, the proposed ADFF-Net achieves performance comparable to SOTA models on the ICBHI2017 database. The BTS model [47] obtains the highest HS (63.54%), primarily driven by its highest SEN and strong SPE. This suggests that BTS is well-balanced and especially suitable for scenarios where high SEN is critical (e.g., detecting as many abnormal cases as possible). The CycleGuardian model [48] obtains the highest SPE (82.06%), indicating excellent capability in correctly identifying normal samples. It also maintains strong SEN and HS values. The other models show balanced performance, typically integrating contrastive learning [16,29], data augmentation [16,28], or domain adaptation [16,28,29,30], showing how leveraging large-scale or cross-domain techniques can improve robustness.

However, the SEN values on this four-class RSC task remain poor. As a measure of the model’s ability to correctly identify abnormal respiratory sounds, SEN is a clinically critical metric, while all SOTA approaches achieve low SEN values (≈42–46%). The result indicates that detecting abnormal cases in the ICBHI2017 dataset remains a substantial challenge. Such low SEN levels also highlight potential risks in clinical applications, where missed abnormal samples are unacceptable. Among the listed models, BTS achieves the highest SEN (45.67%), followed by CycleGuardian (44.47%) and LungAdapter (44.37%). The remaining methods exhibit similar SEN values in the 42–43% range.

In terms of computational complexity, most models are built on the AST backbone, which has been pre-trained on ImageNet [43] and fine-tuned on AudioSet [9], and therefore, they exhibit similar parameter counts, including the proposed ADFF-Net and other AST-based approaches [16,28,29,30]. Specifically, ADFF-Net adds a small number of

1 \times 1

convolution and attention-like matrix multiplications on low-channel feature maps, and the theoretical FLOPs and the empirical inference time of ADFF-Net are expected to remain in the same order of magnitude as those AST-based RSC models [16,28,29,30]. In contrast, BTS [47] employs the Contrastive Language–Audio Pre-training (CLAP) framework [49], which includes both text and audio encoders and is trained on the large-scale LAION-Audio-630K corpus of 633,526 audio–text pairs. BTS further enriches the RSC task by converting four metadata attributes, including age, gender, chest recording location, and device type, from the ICBHI2017 dataset into text descriptions, with experiments showing that such metadata-augmented contextual information markedly improves performance [47]. Although BTS achieves the best overall results, these design choices nearly double its parameter size relative to AST-based models due to its dual audio–text encoders. Notably, CycleGuardian [48] delivers competitive performance with a much lighter architecture (≈38M parameters) through a streamlined design that integrates deep clustering, contrastive learning with group mixing, and multi-objective optimization.

5. Discussion

Although Mel-scale acoustic features, such as Mel-FBank and Mel-spectrograms, are easy to compute, effectively fusing them remains an open challenge. In this study, we propose the ADFF-Net framework to address this issue. Extensive experiments on the ICBHI2017 dataset show that the performance improvements primarily stem from the attention mechanism and skip connection design (Figure 9), which jointly enable effective integration of Mel-FBank and Mel-spectrogram information (Table 3). When combined with the pre-trained AST backbone, these components further enhance the network’s ability to capture discriminative respiratory sound patterns.

When compared with existing SOTA models, the performance gains achieved by the ADFF-Net model remain relatively modest. The low SEN values (Table 6) further highlight that current SOTA approaches still struggle to reliably detect certain abnormal respiratory events, particularly crackles and wheezes (Figure 10). These findings suggest that substantial room for improvement remains in the four-class prediction task, especially in enhancing the model’s ability to capture subtle or low-energy pathological patterns that distinguish crackles and wheezes from normal respiratory sounds.

5.1. Our Findings

The competitive performance of the proposed ADFF-Net framework in terms of the integrated HS metric can be attributed to the dual-stream acoustic features and the ADFF module design, which combines attention-based feature fusion with a skip connection. First, combining the Mel-FBank and Mel-spectrogram features effectively balances the classification metrics (Table 2). For example, MFCC features yield the highest specificity (88.16%) but extremely low sensitivity (9.43%), reflecting a severe imbalance in prediction performance. By contrast, Mel-FBank features achieve a higher HS score (57.69%), slightly outperforming Mel-spectrogram features. Importantly, the combination of Mel-FBank and Mel-spectrogram features consistently outperforms any single feature input or other pairwise combinations, demonstrating their strong complementarity. Second, the ADFF module achieves superior performance compared to baseline fusion strategies such as Concat-AST, AST-Concat, and ADFF-Net without the SC (Table 3). While simple concatenation remains a common approach for feature integration, it often lacks the capacity to exploit complex dependencies between features. More advanced strategies, such as feature-wise linear modulation for context-aware computation [50] or cross-attention fusion for integrating multi-stream outputs [51], highlight the need for principled fusion mechanisms. ADFF-Net addresses this by leveraging attention-guided fusion and an SC to enhance feature interactions and stabilize learning. Overall, by jointly incorporating dual-branch acoustic inputs and an advanced fusion architecture, ADFF-Net achieves a balanced trade-off between sensitivity and specificity while reaching competitive state-of-the-art performance. This indicates strong generalization capability even under class-imbalance conditions.

It is also observed that ADFF-Net struggles to identify abnormal respiratory sounds in this RSC task (Figure 10 and Table 6), with particularly low SEN values for crackle and wheeze events (Table 4). Several underlying factors may explain this problem. First, the ADFF module and AST backbone operate on time–frequency patches with a relatively coarse temporal resolution and global token aggregation. This design favors stable, high-energy patterns that characterize normal breathing but tends to smooth out short, low-energy transients. However, crackles and wheezes occupy only a few patches in sound events and therefore contribute weakly to the feature representation. Second, the attention weights inside the ADFF module are supervised through the four-class cross-entropy loss. Under the severe class imbalance of the ICBHI2017 dataset, the learned attention maps are biased toward the discrimination for the majority normal class. Third, the objective function treats all training cases equally. Therefore, it may drive the decision boundary toward higher specificity at the expense of sensitivity, as reflected in the confusion matrix in Figure 10 and the per-class metrics in Table 4. Together, these factors explain why ADFF-Net fails to reliably capture a substantial fraction of crackle and wheeze events.

On the ICBHI2017 database, there is still considerable room for improvement in the four-class classification task (Table 6). The highest HS value (63.54%) is achieved by the BTS model [47], which introduces a text–audio multi-modal approach leveraging metadata from respiratory sounds. Specifically, free-text descriptions derived from metadata, such as patient gender and age, recording device type, and recording location, are used to fine-tune a pre-trained multi-modal model. Meanwhile, similar advanced techniques are commonly adopted in recent SOTA methods, including large-scale models [16,28,30], contrastive learning [16,29,48], fine-tuning strategies [28,30], and data augmentation or cross-domain adaptation [16,28,29]. A key limitation of this study is the relatively low class-wise SEN value, and recent SOTA methods typically report SEN values between 42% and 46%. From a clinical perspective, such sensitivity is insufficient for medical diagnosis, because many abnormal breaths would be wrongly classified as normal sounds. Therefore, the proposed model should be regarded as a methodological contribution and a proof of concept on this public database, rather than as a deployable clinical decision-support tool in its current form. Notably, this limitation is not unique to ADFF-Net, but reflects a broader RSC challenge. Abnormal acoustic events such as crackles and wheezes are often subtle, short in duration, easily masked by background noise, and highly variable across patients and recording conditions. These intrinsic characteristics, combined with significant inter-class imbalance in the ICBHI dataset, make accurate detection particularly difficult for all existing deep learning systems. Accordingly, improving SEN remains an open research problem for the entire field rather than a deficiency specific to a single architecture.

Accurate classification of respiratory sounds into normal, crackle, wheeze, and both categories remains challenging for several reasons [2,6,7,8]. First, the acoustic boundaries between classes are often blurred. In clinical practice, crackles and wheezes frequently occur together, making the both category a composite of two abnormal patterns with overlapping spectral–temporal signatures. Second, substantial intra-class variability and inter-class similarity complicate discrimination. Intra-class variability arises from differences in age, body habitus, breathing effort, auscultation site, and disease stage. Conversely, inter-class similarity may occur when classes share similar frequency ranges or when acoustic patterns are distorted by background noise or inconsistent recording quality. These phenomena are visible in the Mel-scale representation, as shown in Figure 3, Figure 4 and Figure 5. Third, some abnormal sounds are faint, brief, or intermittent, making them easily confused with normal breathing. This challenge is reflected in the misclassification patterns shown in the confusion matrix (Figure 10). Fourth, class imbalance is common. Normal recordings usually dominate public datasets, while crackle, wheeze, and both categories are under-represented [5]. This imbalance biases model training and hinders reliable learning of minority classes. In summary, the task is difficult due to overlapping acoustic characteristics, high temporal variability, environmental and recording noise, and severe data imbalance [52]. Addressing these challenges requires more expressive feature representations and robust classification strategies capable of reliably distinguishing the four respiratory sound categories.

5.2. Future Directions

While substantial improvement in network design and classification performance is achieved in the four-class prediction task on the ICBHI2017 database, several directions remain open for future investigation. Given that the present study evaluates ADFF-Net only on the ICBHI2017 database and under a single dual-stream configuration, the current experimental evidence is not sufficient to support strong claims of broad generalizability beyond this database. In the short term, future work will examine ADFF-style modules on related problems, such as cough and heart sound analysis, where multi-representation and multi-feature schemes have already been shown to improve classification performance [23,25,35,38,46]. Beyond medical acoustics, attention-based fusion of heterogeneous signals has also proven effective in other signal processing contexts, including environmental sound recognition and multi-modal medical diagnosis [33,36]. Therefore, evaluating ADFF-type fusion modules in these broader scenarios will be essential for rigorously establishing and refining the cross-dataset and cross-domain generalizability.

An important direction is balancing the number of cases across categories in respiratory sound databases [4]. Imbalanced datasets often bias models toward majority classes, resulting in poor sensitivity for minority categories [5]. Several strategies can be considered to address this issue. First, expanding clinical collaborations can help collect recordings from under-represented disease categories, ensuring a representative distribution of patient conditions. However, this approach is resource-intensive, requiring significant time, expert involvement, and funding. Second, data augmentation can be applied using signal processing techniques such as time-stretching, pitch shifting, noise injection, or spectrogram-level transformations to artificially increase the diversity of existing samples [5]. Third, generative modeling offers a promising direction, where realistic synthetic respiratory sound samples are produced to synthesize rare pathological categories while maintaining clinical validity, including conditional generative adversarial networks [28,53], variational autoencoders [54], and generative adversarial diffusion [55]. Fourth, domain adaptation provides another feasible solution by mitigating distribution shifts across recording devices, patient populations, or clinical centers [9,16,28,29,30]. This can effectively increase usable data samples and improve the generalization of deep learning models. In addition, adopting robust mathematical frameworks widely used in other domains to address class-imbalance problems and enhance sensitivity is highly valuable. These techniques include robust imbalance classification via deep generative mixture models [56], class rebalancing through class-balanced dynamically weighted loss functions [57], and class-imbalance learning based on theoretically guaranteed latent-feature rectification [58].

Diverse signal collection is another direction essential for advancing respiratory sound analysis. Relying solely on acoustic signals may restrict a model’s ability to capture all relevant information. Four complementary directions can be considered. First, fully exploiting acoustic signals remains fundamental. High-quality respiratory sound recordings are the core modality, and improvements such as multi-site recordings and standardized acquisition protocols [4,5] can enhance reliability while reducing noise-related biases. Second, incorporating additional modalities can provide richer diagnostic context. For instance, integrating physiological signals or chest imaging data can complement acoustic features and improve robustness in clinical decision-making [46]. Finally, leveraging contextual metadata, including patient age, gender, medical history, recording device type, and auscultation location, provides valuable cues that refine model predictions and support more personalized assessments [47]. Additionally, the transformation of respiratory sounds into diverse quantitative forms, not limited to MFCCs, Mel-spectrograms, and deeply learned features [16,20,22,23,24,25,26,27,28,29,30], facilitates comprehensive information representation [5], enabling more effective discriminative feature identification and knowledge discovery.

Advances in network design are also key to driving further improvements in RSC performance. First, Transformer architectures remain the dominant backbone, often outperforming traditional deep networks. Their self-attention mechanisms allow models to capture long-range temporal dependencies in acoustic signals and highlight subtle but clinically relevant sound events [14,15,16,27,28,29,30]. Second, extending large multi-modal models to integrate respiratory sounds, free-text patient descriptions, additional modalities, and acoustic feature interpretations can provide more comprehensive and context-aware representations [47,59]. This integration bridges the gap between signal-level patterns and clinical knowledge, thereby enhancing model robustness and generalization. In addition, other advanced strategies hold significant promise for boosting performance and clinical applicability. These include multi-task learning, which enables joint optimization of related tasks such as RSC and disease diagnosis [60]; contrastive learning, which facilitates the extraction of discriminative embeddings under limited labeled data [16,29,48]; and domain adaptation, which mitigates distribution shifts across recording devices, patient populations, and clinical centers [9,16,28,29,30].

Beyond architectural design and feature fusion, loss-function engineering offers a promising direction for improving diagnostic sensitivity under severe class imbalance. In the ADFF-Net framework, the standard cross-entropy loss may not optimally address the skewed distribution of normal and pathological classes in the ICBHI2017 dataset [4,5]. Future studies could therefore explore class-weighted cross-entropy and Focal Loss to more aggressively penalize false negatives and to emphasize under-represented pathological categories such as crackles and wheezes [52]. Systematically evaluating these loss functions in combination with the proposed ADFF module may help alleviate the high false-negative rate and enhance the clinical viability of respiratory sound classification models.

In the medical domain, where annotated datasets are often scarce and costly to obtain, relying solely on fully supervised learning is increasingly impractical. Expert labeling of respiratory sounds requires trained clinicians, careful listening, and sometimes consensus among multiple annotators, making large-scale high-quality labels difficult to collect [4,46]. As a result, exploring alternative training paradigms that reduce dependence on precise labels has become essential. Promising directions include weakly supervised learning, which uses coarse or imperfect labels to guide model training [61], and unsupervised learning, which discovers structure in data without relying on any labels [62]. By encouraging the model to pull similar samples closer and push dissimilar samples apart in the feature space, contrastive learning, a powerful branch of self-supervised learning, has shown strong potential in respiratory sound analysis [16,29,48] and medical text generation [63]. These alternative paradigms can mitigate the limitations of small, imbalanced, or imperfectly labeled medical datasets and enhance the robustness of downstream classification models [64]. Ultimately, integrating weakly supervised, unsupervised, and contrastive learning approaches may provide a more scalable and label-efficient pathway for future respiratory sound research.

Given the inherent “black-box” characteristic of deep learning-based models, explainable AI (XAI) should be a central focus of future research in clinical practice [65,66,67,68,69]. A variety of XAI techniques can shed light on how models arrive at their predictions. For example, attention-map visualizations and gradient-based attribution methods [66,67] can identify the specific time–frequency regions that most strongly influence a model’s decisions. These tools allow researchers and clinicians to verify whether the model relies on physiologically meaningful acoustic cues rather than spurious noise or recording artifacts. Meanwhile, recent advances in Kolmogorov–Arnold Networks (KANs) provide an analytical and interpretable view by using explicit scalar functions along each input dimension to replace nonlinear transformations in deep learning [68]. Building on this formulation, TaylorKAN further employs Taylor-series approximations to decompose these functions into analytically tractable low-order polynomial terms, enabling clear interpretation of how individual components contribute to model behavior [69]. Most importantly, integrating XAI into respiratory sound analysis is essential for improving transparency, ensuring reliability, and strengthening clinical trust [5]. Meaningful explanations can help clinicians understand why a model labels a segment as crackle, wheeze, both, or normal and support more effective bias detection, while the ultimate goal is to facilitate the translation of AI models from research environments to real-world clinical workflows.

Despite achieving competitive performance, a persistent issue across all SOTA models, including ours, is the relatively low SEN value (Table 6). Such low sensitivity undermines clinical trust, as both clinicians and patients prioritize avoiding missed disease cases. This indicates that while current models reliably identify normal sounds, accurate detection of the subtle and diverse patterns of pathological events remains a major challenge [11,12,13,14,15,16]. Future work must therefore prioritize strategies that substantially improve sensitivity, such as targeted data augmentation for under-represented classes or loss functions that more heavily penalize false negatives. Additional improvements may come from better data class balancing, careful threshold tuning, and architectural designs for more effectively capturing fine-grained acoustic features. Moreover, multi-center validation using additional datasets and real-world recordings will be essential for assessing generalizability and facilitating translation from laboratory research to clinical practice.

6. Conclusions

This study presents ADFF-Net, a framework that fuses Mel-FBank and Mel-spectrogram representations through an attention-based feature fusion module. By leveraging complementary features and incorporating skip connections, ADFF-Net achieves competitive performance compared with recent state-of-the-art methods. However, the current sensitivity, particularly for crackle and wheeze, remains insufficient for reliable medical diagnosis and clinical deployment. Future research directions include improving the balance of patient distributions to mitigate data bias, expanding signal acquisition and feature diversity to enhance model robustness, and developing advanced network architectures and learning paradigms that strengthen sensitivity and support clinical applicability after rigorous multi-center validation.

Author Contributions

Conceptualization, B.Z., S.Y., and Q.S.; data curation, B.Z., L.C., and S.Z.; formal analysis, X.L., S.Y., and Q.S.; funding acquisition, S.Y. and Q.S.; investigation, S.Y. and Q.S.; Methodology, B.Z., L.C., and X.L.; project administration, Q.S.; software, B.Z., L.C., and X.L.; supervision, Q.S.; validation, S.Y. and S.Z.; visualization, L.C., S.Z., and X.L.; writing—original draft, L.C. and S.Y.; writing—review and editing, B.Z., S.Y., and Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

The work was in part supported by the National Key Research and Develop Program of China (Grant No. 2024YFF0907401, 2022ZD0115901, and 2022YFC2409000), the National Natural Science Foundation of China (Grant No. 62177007, U20A20373, and 82202954), the China-Central Eastern European Countries High Education Joint Education Project (Grant No. 202012), the Application of Trusted Education Digital Identity in the Construction of Smart Campus in Vocational Colleges (Grant No. 2242000393), the Knowledge Blockchain Research Fund (Grant No. 500230), and the Medium- and Long-term Technology Plan for Radio, Television and Online Audiovisual (Grant No. ZG23011). The work was also supported by Public Computing Cloud, CUC. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Data Availability Statement

The dataset supporting the current study is available online (ICBHI 2017 Respiratory Sound Database, bhichallenge.med.auth.gr) (accessed on 12 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ADFF-Net	Attention-based Dual-stream Feature Fusion Network
COPD	Chronic Obstructive Pulmonary Disease
ICBHI	International Conference on Biomedical and Health Informatics
ICBHI2017	ICBHI 2017 Database
CNN	Convolutional Neural Network
RSC	Respiratory Sound Classification
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
AST	Audio Spectrogram Transformer
SC	Skip Connection
Mel-FBank	Mel-Filter Bank
MFCC	Mel-Frequency Cepstral Coefficient
SOTA	State Of The Art
CQT	Constant-Q Transform
STFT	Short-Time Fourier Transform
DCT	Discrete Cosine Transform
FFT	Fast Fourier Transform
dB	Decibel
PE	Positional Encoding
SPE	Specificity
SEN	Sensitivity
HS	The ICBHI Score
CLAP	Contrastive Language–Audio Pre-training
XAI	Explainable Artificial Intelligence
KAN	Kolmogorov–Arnold Network

References

Greene, C.M.; Abdulkadir, M. Global respiratory health priorities at the beginning of the 21st century. Eur. Respir. Rev. 2024, 33, 230205. [Google Scholar] [CrossRef]
Arabi, Y.M.; Azoulay, E.; Al-Dorzi, H.M.; Phua, J.; Salluh, J.; Binnie, A.; Hodgson, C.; Angus, D.C.; Cecconi, M.; Du, B.; et al. How the COVID-19 pandemic will change the future of critical care. Intensive Care Med. 2021, 47, 282–291. [Google Scholar] [CrossRef]
Arts, L.; Lim, E.H.T.; van de Ven, P.M.; Heunks, L.; Tuinman, P.R. The diagnostic accuracy of lung auscultation in adult patients with acute pulmonary pathologies: A meta-analysis. Sci. Rep. 2020, 10, 7347. [Google Scholar] [CrossRef] [PubMed]
Rocha, B.M.; Filos, D.; Mendes, L.; Vogiatzis, I.; Perantoni, E.; Kaimakamis, E.; Natsiavas, P.; Oliveira, A.; Jácome, C.; Marques, A.; et al. A respiratory sound database for the development of automated classification. In Proceedings of the International Conference on Biomedical and Health Informatics: ICBHI 2017, Singapore, 17–21 November 2017; Springer: Singapore, 2017; pp. 33–37. [Google Scholar]
Yu, S.; Yu, J.; Chen, L.; Zhu, B.; Liang, X.; Xie, Y.; Sun, Q. Advances and Challenges in Respiratory Sound Analysis: A Technique Review Based on the ICBHI2017 Database. Electronics 2025, 14, 2794. [Google Scholar] [CrossRef]
Bohadana, A.; Izbicki, G.; Kraman, S.S. Fundamentals of lung auscultation. N. Engl. J. Med. 2014, 370, 744–751. [Google Scholar] [CrossRef]
Flietstra, B.; Markuzon, N.; Vyshedskiy, A.; Murphy, R. Automated analysis of crackles in patients with interstitial pulmonary fibrosis. Pulm. Med. 2011, 2011, 590506. [Google Scholar] [CrossRef]
Reichert, S.; Gass, R.; Brandt, C.; Andrès, E. Analysis of respiratory sounds: State of the art. Clin. Med. Circ. Respir. Pulm. Med. 2008, 2, 45–58. [Google Scholar] [CrossRef]
Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio spectrogram transformer. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 571–575. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Yang, Z.; Liu, S.; Song, M.; Parada-Cabaleiro, E.; Schuller, B.W. Adventitious respiratory classification using attentive neural networks. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 2912–2916. [Google Scholar]
Xu, L.; Cheng, J.; Liu, J.; Kuang, H.; Wu, F.; Wang, J. Arsc-net: Adventitious respiratory sound classification network using parallel paths with channel-spatial attention. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 9–12 December 2021; pp. 1125–1130. [Google Scholar]
Zhang, Y.; Huang, Q.; Sun, W.; Chen, F.; Lin, D.; Chen, F. Research on lung sound classification model based on dual-channel CNN-LSTM algorithm. Biomed. Signal Process. Control 2024, 94, 106257. [Google Scholar] [CrossRef]
Wu, C.; Huang, D.; Tao, X.; Qiao, K.; Lu, H.; Wang, W. Intelligent stethoscope using full self-attention mechanism for abnormal respiratory sound recognition. In Proceedings of the 2023 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), Pittsburgh, PA, USA, 15–18 October 2023; IEEE: New York, NY, USA, 2023; Volume 1, pp. 1–4. [Google Scholar]
Ariyanti, W.; Liu, K.-C.; Chen, K.-Y.; Tsao, Y. Abnormal respiratory sound identification using audio-spectrogram vision transformer. In Proceedings of the 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Sydney, Australia, 24–27 July 2023; pp. 1–4. [Google Scholar]
Bae, S.; Kim, J.W.; Cho, W.Y.; Baek, H.; Son, S.; Lee, B.; Ha, C.; Tae, K.; Kim, S.; Yun, S.Y. Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification. In Proceedings of the 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023), Dublin, Ireland, 20–24 August 2023; pp. 5436–5440. [Google Scholar]
Yu, S.; Liu, L.; Wang, Z.; Dai, G.; Xie, Y. Transferring deep neural networks for the differentiation of mammographic breast lesions. Sci. China Technol. Sci. 2019, 62, 441–447. [Google Scholar] [CrossRef]
Arora, V.; Singh, M. Artificial intelligence based techniques to detect and classify adventitious respiratory sounds: An in-depth review. Arch. Comput. Methods Eng. 2025. [Google Scholar] [CrossRef]
Pramono, R.X.A.; Bowyer, S.; Rodriguez-Villegas, E. Automatic adventitious respiratory sound analysis: A systematic review. PLoS ONE 2017, 12, e0177926. [Google Scholar] [CrossRef]
Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Bacanin, N.; Jovanovic, L.; Stoean, R.; Stoean, C.; Zivkovic, M.; Antonijevic, M.; Dobrojevic, M. Respiratory condition detection using audio analysis and convolutional neural networks optimized by modified metaheuristics. Axioms 2024, 13, 335. [Google Scholar] [CrossRef]
Zhu, B.; Li, X.; Feng, J.; Yu, S. VGGish-BiLSTM-attention for COVID-19 identification using cough sound analysis. In Proceedings of the 2023 8th International Conference on Signal and Image Processing (ICSIP), Wuxi, China, 8–10 July 2023; pp. 49–53. [Google Scholar]
Wall, C.; Zhang, L.; Yu, Y.; Kumar, A.; Gao, R. A deep ensemble neural network with attention mechanisms for lung abnormality classification using audio inputs. Sensors 2022, 22, 5566. [Google Scholar] [CrossRef]
Latifi, S.A.; Ghassemian, H.; Imani, M. Feature extraction and classification of respiratory sound and lung diseases. In Proceedings of the 2023 6th International Conference on Pattern Recognition and Image Analysis (IPRIA), Qom, Iran, 14–16 February 2023; IEEE: New York, NY, USA, 2023; Volume 1, pp. 1–6. [Google Scholar]
Li, X.; Qi, B.; Wan, X.; Zhang, J.; Yang, W.; Xiao, Y.; Mao, F.; Cai, K.; Huang, L.; Zhou, J. Electret-based flexible pressure sensor for respiratory diseases auxiliary diagnosis system using machine learning technique. Nano Energy 2023, 114, 108652. [Google Scholar] [CrossRef]
Neto, J.; Arrais, N.; Vinuto, T.; Lucena, J. Convolution-vision transformer for automatic lung sound classification. In Proceedings of the 2022 35th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Natal, Brazil, 24–27 October 2022; IEEE: New York, NY, USA, 2023; Volume 1, pp. 97–102. [Google Scholar]
Kim, J.-W.; Yoon, C.; Toikkanen, M.; Bae, S.; Jung, H.-Y. Adversarial fine-tuning using generated respiratory sound to address class imbalance. In Proceedings of the Deep Generative Models for Health Workshop NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; pp. 1–10. [Google Scholar]
Kim, J.-W.; Bae, S.; Cho, W.-Y.; Lee, B.; Jung, H.-Y. Stethoscope-guided supervised contrastive learning for cross-domain adaptation on respiratory sound classification. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1431–1435. [Google Scholar]
Xiao, L.; Fang, L.; Yang, Y.; Tu, W. LungAdapter: Efficient Adapting Audio Spectrogram Transformer for Lung Sound Classification. In Proceedings of the 25th Annual Conference of the International Speech Communication Association (INTERSPEECH 2024), Kos Island, Greece, 1–5 September 2024; pp. 4738–4742. [Google Scholar]
Chu, Y.; Wang, Q.; Zhou, E.; Zheng, G.; Liu, Q. Hybrid Spectrogram for the Automatic Respiratory Sound Classification with Group Time Frequency Attention Network. In Proceedings of the 2023 IEEE 6th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Haikou, China, 18–20 August 2023; pp. 839–845. [Google Scholar]
Borwankar, S.; Verma, J.P.; Jain, R.; Nayyar, A. Improvise approach for respiratory pathologies classification with multilayer convolutional neural networks. Multimed. Tools Appl. 2022, 81, 39185–39205. [Google Scholar] [CrossRef]
Roy, A.; Satija, U.; Karmakar, S. Pulmo-TS2ONN: A novel triple scale self operational neural network for pulmonary disorder detection using respiratory sounds. IEEE Trans. Instrum. Meas. 2024, 73, 6502812. [Google Scholar] [CrossRef]
Roy, A.; Satija, U. A novel multi-head self-organized operational neural network architecture for chronic obstructive pulmonary disease detection using lung sounds. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2566–2575. [Google Scholar] [CrossRef]
Wanasinghe, T.; Bandara, S.; Madusanka, S.; Meedeniya, D.; Bandara, M.; De La Torre, I.D. Lung sound classification with multi-feature integration utilizing lightweight CNN model. IEEE Access 2024, 12, 21262–21276. [Google Scholar] [CrossRef]
Xu, W.; Bao, X.; Lou, X.; Liu, X.; Chen, Y.; Zhao, X.; Zhang, C.; Pan, C.; Liu, W.; Liu, F. Feature fusion method for pulmonary tuberculosis patient detection based on cough sound. PLoS ONE 2024, 19, e0302651. [Google Scholar] [CrossRef]
Pham, L.; Ngo, D.; Tran, K.; Hoang, T.; Schindler, A.; McLoughlin, I. An ensemble of deep learning frameworks for predicting respiratory anomalies. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Glasgow, UK, 11–15 July 2022; pp. 4595–4598. [Google Scholar]
Shehab, S.A.; Mohammed, K.K.; Darwish, A.; Hassanien, A.E. Deep learning and feature fusion-based lung sound recognition model to diagnoses the respiratory diseases. Soft Comput. 2024, 28, 11667–11683. [Google Scholar] [CrossRef]
Jung, J.-W.; Heo, H.-S.; Yang, I.-H.; Shim, H.-J.; Yu, H.-J. A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: New Yorek, NY, USA, 2018; Volume 1, pp. 5349–5353. [Google Scholar]
McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.W.; McVicar, M.; Battenberg, E.; Nieto, O. Librosa: Audio and music signal analysis in python. SciPy 2015, 1, 18–24. [Google Scholar]
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of Transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, Austria, 3–7 May 2021; pp. 1–22. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: New York, NY, USA, 2009; Volume 1, pp. 248–255. [Google Scholar]
Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: New York, NY, USA, 2017; Volume 1, pp. 776–780. [Google Scholar]
Zou, L.; Yu, S.; Meng, T.; Zhang, Z.; Liang, X.; Xie, Y. A Technical Review of Convolutional Neural Network-Based Mammographic Breast Cancer Diagnosis. Comput. Math. Methods Med. 2019, 1, 6509357. [Google Scholar] [CrossRef]
Zhu, B.; Zhou, Z.; Yu, S.; Liang, X.; Xie, Y.; Sun, Q. Review of phonocardiogram signal analysis: Insights from the PhysioNet/CinC challenge 2016 database. Electronics 2024, 13, 3222. [Google Scholar] [CrossRef]
Kim, J.W.; Toikkanen, M.; Choi, Y.; Moon, S.E.; Jung, H.Y. BTS: Bridging Text and Sound Modalities for Metadata-Aided Respiratory Sound Classification. In Proceedings of the 25th Annual Conference of the International Speech Communication Association (INTERSPEECH 2024), Kos Island, Greece, 1–5 September 2024; pp. 1690–1694. [Google Scholar]
Chu, Y.; Wang, Q.; Zhou, E.; Fu, L.; Liu, Q.; Zheng, G. CycleGuardian: A Framework for Automatic Respiratory Sound Classification Based on Improved Deep Clustering and Contrastive Learning. Complex Intell. Syst. 2025, 11, 200. [Google Scholar] [CrossRef]
Wu, Y.; Chen, K.; Zhang, T.; Hui, Y.; Berg-Kirkpatrick, T.; Nezhurina, M.; Dubnov, S. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; Volume 1, pp. 1–5. [Google Scholar]
Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. FiLM: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; AAAI Press: Palo Alto, CA, USA, 2018; Volume 32, pp. 3942–3951. [Google Scholar]
Yu, S.; Meng, J.; Fan, W.; Chen, Y.; Zhu, B.; Yu, H.; Xie, Y.; Sun, Q. Speech emotion recognition using dual-stream representation and cross-attention fusion. Electronics 2024, 13, 2191. [Google Scholar] [CrossRef]
Terven, J.; Cordova-Esparza, D.M.; Romero-González, J.A.; Ramírez-Pedraza, A.; Chávez-Urbiola, E.A. A comprehensive survey of loss functions and metrics in deep learning. Artif. Intell. Rev. 2025, 58, 195. [Google Scholar] [CrossRef]
Jayalakshmy, S.; Sudha, G.F. Conditional GAN based augmentation for predictive modeling of respiratory signals. Comput. Biol. Med. 2021, 138, 104930. [Google Scholar] [CrossRef]
Cozzatti, M.; Simonetta, F.; Ntalampiras, S. Variational autoencoders for anomaly detection in respiratory sounds. In Proceedings of the International Conference on Artificial Neural Networks, Bristol, UK, 6–9 September 2022; pp. 333–345. [Google Scholar]
Fan, L.; Chen, T.; He, L.; Wang, Z.; Zhang, R. GADM: Data augmentation using Generative Adversarial Diffusion Model for pulse-based disease identification. Biomed. Signal Process. Control 2025, 100, 107005. [Google Scholar] [CrossRef]
Wang, X.; Jing, L.; Lyu, Y.; Guo, M.; Wang, J.; Liu, H.; Yu, J.; Zeng, T. Deep generative mixture model for robust imbalance classification. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2897–2912. [Google Scholar] [CrossRef]
Fernando, K.R.M.; Tsokos, C.P. Dynamically weighted balanced loss: Class imbalanced learning and confidence calibration of deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 2940–2951. [Google Scholar] [CrossRef]
Cheng, X.; Shi, F.; Zhang, Y.; Li, H.; Liu, X.; Chen, S. FRAME: Feature rectification for class imbalance learning. IEEE Trans. Knowl. Data Eng. 2024, 37, 1167–1181. [Google Scholar] [CrossRef]
Wang, X.; Chen, G.; Qian, G.; Gao, P.; Wei, X.-Y.; Wang, Y.; Tian, Y.; Gao, W. Large-scale multi-modal pre-trained models: A comprehensive survey. Mach. Intell. Res. 2023, 20, 447–482. [Google Scholar] [CrossRef]
Kim, J.-W.; Lee, S.; Toikkanen, M.; Hwang, D.; Kim, K. Tri-MTL: A Triple Multitask Learning Approach for Respiratory Disease Diagnosis. arXiv 2025, arXiv:2505.06271. [Google Scholar]
Misera, L.; Müller-Franzes, G.; Truhn, D.; Kather, J.N. Weakly supervised deep learning in radiology. Radiology 2025, 25, e232085. [Google Scholar] [CrossRef]
Li, Y.; Guo, L.; Ge, Y. Pseudo labels for unsupervised domain adaptation: A review. Electronics 2023, 12, 3325. [Google Scholar] [CrossRef]
Zhang, T.; Meng, J.; Yang, Y.; Yu, S. Contrastive learning penalized cross-entropy with diversity contrastive search decoding for diagnostic report generation of reduced token repetition. Appl. Sci. 2024, 14, 2817. [Google Scholar] [CrossRef]
Yu, S.; Chen, M.; Zhang, E.; Wu, J.; Yu, H.; Yang, Z.; Ma, L.; Gu, X.; Lu, W. Robustness study of noisy annotation in deep learning based medical image segmentation. Phys. Med. Biol. 2020, 65, 175007. [Google Scholar] [CrossRef]
Alkhanbouli, R.; Almadhaani, H.M.A.; Alhosani, F.; Simsekler, M.C.E. The role of explainable artificial intelligence in disease prediction: A systematic literature review and future research directions. BMC Med. Inform. Decis. Mak. 2025, 25, 110. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. Int. Conf. Mach. Learn. 2017, 1, 3319–3328. [Google Scholar]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar] [PubMed]
Chen, Z.; Yu, S. Taylor expansion-based Kolmogorov-Arnold network for blind image quality assessment. J. Vis. Commun. Image Represent. 2025, 112, 104571. [Google Scholar] [CrossRef]

Figure 1. Category-wise distribution of data samples in the ICBHI2017 database.

Figure 2. Computation of respiratory sound feature representation.

Figure 3. Examples of Mel-FBank features of four types of respiratory sounds.

Figure 4. Examples of Mel-spectrogram features of four types of respiratory sounds.

Figure 5. Examples of MFCC features of four types of respiratory sounds.

Figure 6. The proposed ADFF-Net framework. It designs the attention module and skip connection for dual-representative feature fusion, followed by AST-based signal classification.

Figure 7. Concat-AST. Feature concatenation is followed by the AST module.

Figure 8. AST-Concat. The AST module is prior to feature concatenation.

Figure 9. Comparative visualization of feature maps of each category (normal, crackle, wheeze, and both). The first column shows the original fused feature maps, the second column shows the reconstructed feature maps with the SC (

m a p_{s c}

), and the third column shows the reconstructed feature maps without the SC (

m a p

).

Figure 9. Comparative visualization of feature maps of each category (normal, crackle, wheeze, and both). The first column shows the original fused feature maps, the second column shows the reconstructed feature maps with the SC (

m a p_{s c}

), and the third column shows the reconstructed feature maps without the SC (

m a p

).

Figure 10. Confusion matrix illustrating the four-class prediction results.

Table 1. The definition of symbols in the procedure of feature extraction.

	Definition
$x (t)$	The value of an original respiratory sound signal at time t
$y (t)$	The pre-emphasized signal value
$α$	The pre-emphasis coefficient
$x_{m} (n)$	The n-th windowed sample in the m-th frame
h	The hop size or frame shift (interval between adjacent frames)
N	The number of samples per frame (frame length)
$w (n)$	The window function (e.g., Hamming or Hanning function)
$X_{m} (k)$	The complex value of the k-th frequency bin in the m-th frame (spectrum)
$P_{m} (k)$	The power spectrum of the k-th frequency bin in the m-th frame
$H_{i} (k)$	The gain of the i-th Mel filter at frequency bin k
$S_{m} (i)$	The response of the i-th Mel filter in the m-th frame
$M_{m} (i)$	The log energy output of the i-th Mel filter in the m-th frame
$C_{m} (q)$	The q-th MFCC of the m-th frame
L	The number of Mel filters
K	The number of retained MFCCs

Table 2. Experimental results using individual and combined feature types.

	SPE (%)	SEN (%)	HS (%)	ACC (%)
Mel-FBank	76.63	38.74	57.69	59.72
Mel-Spectrogram	77.83	23.36	50.60	53.07
MFCC	88.16	9.43	48.79	51.48
Mel-FBank + Mel-Spectrogram	81.38	42.91	62.14	64.95
Mel-FBank + MFCC	75.17	41.04	58.11	60.54
Mel-Spectrogram + MFCC	68.71	23.53	46.12	50.64

Table 3. Experimental results using different feature fusion strategies.

	SPE (%)	SEN (%)	HS (%)
Concat-AST	74.29	41.80	58.04
AST-Concat	79.80	36.53	58.17
ADFF-Net without SC	74.92	42.06	58.49
ADFF-Net (ours)	81.38	42.91	62.14

Table 4. Class-wise SPE and SEN performance for four-class prediction.

	SPE (%)	SEN (%)
Normal	57.94	81.38
Crackle	90.89	38.52
Wheeze	92.16	36.36
Both	96.44	80.42

Table 5. Class-wise SPE and SEN performance for binary classification.

	SPE (%)	SEN (%)
Normal	57.94	81.38
Abnormal	90.89	38.52

Table 6. SOTA performance of four-class prediction on the ICBHI2017 database.

	Year	SPE (%)	SEN (%)	HS (%)	Backbone	Parameters
AFT + AST [28]	2023	80.72	42.86	61.79	AST	≈87.5M
SG-SCL + AST [29]	2024	81.19	42.31	61.75	AST	≈87.9M
LungAdapter [30]	2024	80.43	44.37	62.40	AST	≈90.0M
CL + AST [16]	2024	81.66	43.07	62.37	AST	≈87.5M
BTS [47]	2024	81.40	45.67	63.54	CLAP	≈160M
CycleGuardian [48]	2025	82.06	44.47	63.26	CNN	≈38M
ADFF-Net (ours)	2025	81.38	42.91	62.14	AST	≈88.0M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, B.; Chen, L.; Li, X.; Zhao, S.; Yu, S.; Sun, Q. ADFF-Net: An Attention-Based Dual-Stream Feature Fusion Network for Respiratory Sound Classification. Technologies 2026, 14, 12. https://doi.org/10.3390/technologies14010012

AMA Style

Zhu B, Chen L, Li X, Zhao S, Yu S, Sun Q. ADFF-Net: An Attention-Based Dual-Stream Feature Fusion Network for Respiratory Sound Classification. Technologies. 2026; 14(1):12. https://doi.org/10.3390/technologies14010012

Chicago/Turabian Style

Zhu, Bing, Lijun Chen, Xiaoling Li, Songnan Zhao, Shaode Yu, and Qiurui Sun. 2026. "ADFF-Net: An Attention-Based Dual-Stream Feature Fusion Network for Respiratory Sound Classification" Technologies 14, no. 1: 12. https://doi.org/10.3390/technologies14010012

APA Style

Zhu, B., Chen, L., Li, X., Zhao, S., Yu, S., & Sun, Q. (2026). ADFF-Net: An Attention-Based Dual-Stream Feature Fusion Network for Respiratory Sound Classification. Technologies, 14(1), 12. https://doi.org/10.3390/technologies14010012

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

ADFF-Net: An Attention-Based Dual-Stream Feature Fusion Network for Respiratory Sound Classification

Abstract

1. Introduction

2. Related Work

2.1. Respiratory Sound Representation

2.2. AST-Based RSC Applications

2.3. Multi-Stream Feature Fusion Strategies

2.4. The Motivation for This Study

3. Materials and Methods

3.1. The ICBHI2017 Database

3.2. Feature Representation of Respiratory Sounds

3.2.1. Signal Preprocessing

3.2.2. Framing and Windowing

3.2.3. STFT Computing

3.2.4. Calculating Power Spectrum

3.2.5. Summarizing Mel Filters for Mel-FBank Features

3.2.6. Logarithmic Computing for Mel-Spectrogram Features

3.2.7. DCT Computing for MFCC Features

3.3. The Proposed ADFF-Net Framework

3.3.1. Attention Mechanism

3.3.2. The ADFF Module

3.3.3. Audio Spectrogram Transformer

3.4. Experiment Design

3.4.1. Unified Respiratory Signal Pre-Processing

3.4.2. Analysis of Feature Representation Performance

3.4.3. Comparison of Different Feature Fusion Strategies

3.4.4. Parameter Settings

3.5. Evaluation Metrics

4. Results

4.1. Evaluation of Feature Representation Capacity

4.2. Performance Comparison of Various Feature Fusion Strategies

4.3. Visualizing the Role of Skip Connections in Preserving Feature Details

4.4. Confusion Matrix of the Four-Class Prediction Result

4.5. Formulation as a Binary Classification Task

4.6. Current Achievement of Four-Class Prediction on the ICBHI2017 Database

5. Discussion

5.1. Our Findings

5.2. Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI