Enhancing Respiratory Disease Diagnosis with AI Lung Sound Analysis: A Web-Based Approach

Sreejith, Reshma; Ramasamy, R. Kanesaraj; Mohd-Isa, Wan-Noorshahida; Abdullah, Junaidi

doi:10.3390/fi18060318

Open AccessArticle

Enhancing Respiratory Disease Diagnosis with AI Lung Sound Analysis: A Web-Based Approach

Faculty of Computing and Informatics, Multimedia University, Cyberjaya 63100, Malaysia

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(6), 318; https://doi.org/10.3390/fi18060318

Submission received: 25 April 2026 / Revised: 5 June 2026 / Accepted: 9 June 2026 / Published: 11 June 2026

(This article belongs to the Special Issue Artificial Intelligence-Enabled Smart Healthcare)

Download

Browse Figures

Versions Notes

Abstract

Accurate and timely diagnosis of respiratory diseases remains a critical challenge in clinical practice, particularly in resource-limited and remote healthcare settings. This study proposes a web-based automated respiratory disease classification system leveraging a hybrid Convolutional Neural Network–Long Short-Term Memory with Time-Distributed (CNN-LSTM-TD) architecture for lung sound analysis. The proposed model integrates three complementary time-frequency representations—Mel-Frequency Cepstral Coefficients (MFCCs), Mel-spectrograms, and Chroma Short-Time Fourier Transform (Chroma-STFT)—to comprehensively capture both local spectral characteristics and long-range temporal dependencies inherent in respiratory cycles. Specifically, the TimeDistributed CNN block extracts localised acoustic features from sequential frames, while the LSTM layer models their temporal evolution, enabling robust identification of pathological acoustic signatures such as wheezes and crackles. The model was rigorously evaluated on the benchmark ICBHI 2017 dataset across six diagnostic categories: healthy, asthma, chronic obstructive pulmonary disease (COPD), pneumonia, upper respiratory tract infection (URTI), and bronchiectasis. The CNN-LSTM-TD model achieved an F1-score of 0.94, recall of 0.91, precision of 0.97, overall accuracy of 96.40%, and an AUC-ROC of 0.96, significantly outperforming standalone CNN, LSTM, and CNN-LSTM baseline models. The accompanying web interface supports audio file upload, real-time visualisation of waveforms and spectrograms, and confidence score reporting, collectively facilitating clinical decision support and telemedicine integration. These results demonstrate that the synergy of temporally aware deep feature extraction and accessible web deployment positions the proposed system as a clinically viable, scalable tool for automated respiratory disease diagnosis and remote patient monitoring.

Keywords:

web-based system; respiratory disease classification; lung sound analysis; hybrid CNN-LSTM-TD model; AI-driven diagnosis

1. Introduction

Respiratory diseases, including asthma and Chronic Obstructive Pulmonary Disease (COPD), represent significant global health challenges. These diseases affect millions of individuals worldwide, leading to high morbidity and mortality rates. Asthma alone affects an estimated 235 million people globally, and COPD is responsible for approximately 3 million deaths annually [1]. Despite their prevalence, effective management and early detection remain difficult, primarily due to the limitations of traditional diagnostic methods. For example, physician auscultation, which involves listening to lung sounds using a stethoscope, is highly subjective and requires a trained clinician. Similarly, spirometry tests, which measure lung function, are periodic and do not offer continuous monitoring of disease progression. As a result, patients often experience exacerbations or deteriorations in their condition between medical visits, resulting in delayed interventions and worsening health outcomes [2].

Over the past few decades, there has been increasing interest in leveraging technology to address the limitations of these conventional methods. One promising innovation in this field is the use of wearable sensor technologies for continuous monitoring of lung sounds. Wearable sensors that capture respiratory sounds, such as wheezes, crackles, and breath sounds, could provide real-time, on-demand monitoring that goes beyond the limitations of traditional methods [3]. This continuous data stream offers several advantages over periodic clinical visits, including the ability to detect exacerbations as soon as they occur, track disease progression over time, and even predict potential flare-ups before they manifest clinically. Changing from reactive to proactive management can greatly improve patient outcomes, especially for people with long-term respiratory diseases like asthma and COPD.

However, integrating wearable sensors with artificial intelligence (AI) for continuous lung sound analysis poses a range of technical and practical challenges. AI-driven systems for monitoring lung sounds are designed to detect specific acoustic patterns that may indicate abnormal respiratory conditions [4]. Sophisticated signal processing techniques enable these systems to transform raw audio into meaningful data, which machine learning algorithms then analyse. The ultimate goal is to develop an accurate and reliable system capable of automatically classifying lung sounds and detecting abnormalities in real time, with minimal human intervention. The potential applications of this technology extend beyond chronic disease management to early detection, patient monitoring, and telemedicine, offering an exciting avenue for improving healthcare delivery, especially in underserved regions with limited access to healthcare professionals [5].

Despite the exciting prospects of these AI-driven systems, several hurdles must be overcome for them to become a widely accepted and effective tool in clinical practice. The first major challenge is the noise and artefacts that often contaminate lung sound recordings. These artefacts can come from various sources, such as environmental noise, body movement, or interference from other physiological sounds, like heartbeats and digestive noises. Real-world environments where patients live or work, such as urban areas or home settings, can introduce considerable noise, complicating the task of extracting meaningful respiratory sounds from the data [6]. Researchers have worked to develop signal processing techniques to filter out these unwanted noises and improve the signal-to-noise ratio, but their presence remains a significant challenge for AI models tasked with analysing lung sounds in uncontrolled environments.

Another significant challenge is the inherent variability in lung sounds. Lung sounds differ across individuals due to factors such as age, gender, body size, and lung condition. For instance, the presence of comorbidities like obesity, heart disease, or diabetes can alter the acoustic properties of lung sounds [7]. Additionally, variations in the placement of sensors on the body, body posture, and even the quality of the recording equipment can affect the acoustic data. Therefore, the AI models used for lung sound analysis need to be trained to handle this variability and generalise well across different populations, environments, and sensor configurations. This requires a robust dataset that captures a wide range of lung sounds across diverse patient profiles, a task that remains difficult due to the lack of large, representative datasets for training such models [8].

In addition to signal processing and variability issues, computational limitations present another challenge for AI-driven systems that monitor lung sounds. For these systems to be truly effective in real-time, they must operate on low-power, wearable devices that are capable of continuously recording and processing lung sounds. These devices’ limited computational resources impede the deployment of complex AI models that demand significant processing power. As such, AI models need to be optimised for efficiency, with a balance between computational complexity and real-time performance. Research efforts are ongoing to develop lightweight models and edge computing solutions that can enable real-time monitoring without compromising on accuracy or power consumption. However, such an issue remains a significant challenge, particularly for devices with limited battery life or storage capacity [9].

Furthermore, the success of AI-driven systems for monitoring lung sounds relies not only on technical innovation but also on regulatory and ethical considerations. As wearable devices collect sensitive health data, it is essential to ensure that these systems comply with privacy regulations such as the General Data Protection Regulation (GDPR) in Europe or the Health Insurance Portability and Accountability Act (HIPAA) in the United States. The integration of AI into healthcare requires a transparent and ethical approach to data management, ensuring patient confidentiality and security. Additionally, AI models used for lung sound analysis must be interpretable, providing clinicians with insights into how the system arrived at its conclusions. Without transparency in decision-making, AI systems may be met with resistance from healthcare providers, who may be hesitant to trust a “black box” system over their clinical judgement. Building trust in AI technologies through rigorous clinical validation and transparent algorithms is crucial for their successful adoption in healthcare [10].

In the face of these challenges, the potential benefits of AI-driven systems for monitoring lung sounds are vast. Continuous, real-time monitoring has the potential to transform the management of respiratory diseases by facilitating earlier intervention, enhancing disease control, and enhancing patient outcomes. Furthermore, the ability to provide remote monitoring and telemedicine solutions offers substantial benefits in underserved regions with limited access to healthcare resources [11]. Wearable AI systems have the potential to democratise healthcare, making high-quality monitoring and diagnostics accessible to a larger population, particularly in resource-limited settings.

The aim of this study is to develop and evaluate a web-based AI system for multiclass lung-sound classification using a hybrid CNN-LSTM-TD model. The work addresses two related but distinct problems: (i) algorithmic classification of respiratory diseases from acoustic features and (ii) practical delivery of the trained model through a web-based decision-support interface.

The main contributions of this study are as follows.

Separation of Algorithmic and System Contributions
- The CNN-LSTM-TD model functions as the core classification engine.
- The web-based platform serves as the deployment layer and user interaction interface.
Formalisation of Lung-Sound Classification Task
- Defined as a six-class supervised learning problem using the ICBHI 2017 dataset.
- Includes preprocessing, feature extraction, and softmax-based disease prediction.
Performance Evaluation Against Baseline Models
- Compared the CNN-LSTM-TD model with CNN, LSTM, and CNN-LSTM baselines.
- Reported key metrics: F1-score, recall, precision, accuracy, AUC-ROC, and confusion-matrix analysis.
Development of Web Workflow for Practical Deployment
- Supports upload-based analysis, real-time visualisation, confidence-score display, and downloadable report generation.
- Facilitates clinical decision support and remote monitoring for healthcare applications.

2. Related Works

Respiratory diseases are a leading cause of morbidity and mortality worldwide, with early diagnosis being crucial for improving patient outcomes. Traditional methods of diagnosing respiratory conditions, such as manual auscultation, are often subjective and prone to error, especially in resource-limited settings. In recent years, deep learning techniques have emerged as a promising solution for automating the analysis of lung sounds to diagnose conditions like pneumonia, asthma, COPD, and COVID-19. The advancements in artificial intelligence (AI) have opened new possibilities for automating this diagnostic process, where deep learning models are increasingly capable of analysing lung sounds with greater precision.

Pandala et al. (2025) [12] present a deep learning-based approach for classifying abnormal respiratory sounds using a remote stethoscope vest integrated with deep convolutional neural networks (CNNs). The methodology involves preprocessing lung sound data into spectrograms, applying Fourier Transform for feature extraction, and using data augmentation techniques such as horizontal flipping to reduce overfitting. The study evaluates several CNN architectures, including VGG, AlexNet, ResNet, and InceptionNet, with VGG-B1 achieving the highest classification accuracy at 96%. The system classifies four types of abnormal lung sounds: wheeze, rhonchi, stridor, and crackles, using datasets from R.A.L.E. Lung Sounds and Easy Auscultation. The VGG-B1 model showed outstanding performance with high precision and recall, achieving a prediction precision of 0.96 and a holistic average of 0.94 in accuracy, recall, and F1-score. In comparison, AlexNet outperformed all other models with an exceptional accuracy of 1.01 across precision, recall, and F1-score, demonstrating perfect classification for all categories. While models like ResNet and Inception Net performed well with around 96% accuracy, they showed limitations in identifying certain lung sounds, such as pleural rub. The study emphasises the possibility of combining deep learning with wearable stethoscope technology for real-time, non-invasive, and precise lung sound monitoring, which can greatly facilitate the early detection and remote diagnosis of respiratory disorders, providing an innovative solution for healthcare professionals and patients.

Prasetio and Anam (2025) [13] present a portable real-time Edge-AI system for the diagnosis of respiratory disorders via breath sound analysis. The system employs an adaptive gated fusion (AGF) method to dynamically modulate the contributions of Mel-Frequency Cepstral Coefficients (MFCC) and formant characteristics, thereby improving its resilience to fluctuating acoustic environments. Integrating these properties with a lightweight bidirectional long short-term memory (BiLSTM) network enables the system to effectively capture temporal fluctuations in breath sounds while ensuring computational efficiency. The model was assessed using the Kaggle lung sound dataset, attaining an accuracy of 80.6%, with precision, recall, and F1-score also at 80.6%, indicating a 3.5% enhancement above current deep learning benchmarks. The system was implemented on a Raspberry Pi 4, with a latency of roughly 180 ms per 10 s sample, rendering it appropriate for real-time applications. It demonstrated encouraging outcomes in practical assessments involving 50 participants, sustaining elevated precision in cacophonous settings. This Edge-AI system was evaluated against other cutting-edge techniques, surpassing CNN-LSTM, CRNN-Attention, and MFCC-SVM methodologies. The research underscores the system’s capability for portable, privacy-preserving, and real-time screening of respiratory diseases, especially in resource-constrained environments, with forthcoming efforts aimed at extensive demographic assessments and the integration of multimodal features.

Kaur et al. (2025) [14] examined the incorporation of AI-driven lung sound analysis for the early diagnosis and classification of interstitial lung disease (ILD). Their work emphasises that AI-driven auscultation can function as an accessible, economical and non-invasive method for identifying ILD, particularly in resource-constrained environments. The research discusses the integration of digital stethoscopes with AI algorithms for the analysis of lung sounds, including fine inspiratory crackles, which are clinically relevant for early ILD detection. Essential methodologies include Mel-Frequency Cepstral Coefficients (MFCCs), Short-Time Fourier Transform (STFT) and wavelet transformations for feature extraction. Although the study reports strong diagnostic potential, it also highlights the need for further clinical validation, which remains a key research gap for AI-based auscultation systems.

The research conducted by Im et al. (2023) [15] introduces a real-time wheeze counting system that uses deep learning to detect aberrant lung sounds, specifically wheezing, aimed at enhancing self-symptom management and telemedicine. The model employs a hybrid approach, integrating a one-dimensional convolutional neural network (1D-CNN) with a long short-term memory (LSTM) network to categorise respiratory sounds into three classifications: normal, wheeze, and break. The methodology seeks to manage continuous, real-time lung sound data, differentiating it from conventional lung sound classification methods that generally emphasise short-term recordings. The researchers trained the model with a dataset of 535 respiration cycles sourced from clinical patient records and open-source datasets. The model attained a classification accuracy of 90%, complemented by strong performance measures, including an F1 score of 0.91 and a ROC-AUC score of 0.98. The model’s performance was tested by a 10-fold cross-validation method and compared to other classifiers, such as Random Forest and K-Nearest Neighbours, demonstrating superior efficacy. The system proved effective in real-time applications by quantifying the frequency of wheezing events within a continuous clinical lung sound dataset, attaining an error rate of under 2% in comparison to manual evaluations by physicians. The study emphasised the model’s potential for integration into wearable devices, facilitating continuous monitoring of respiratory problems in clinical and non-clinical settings. The findings suggest that this method may greatly enhance the early identification and management of respiratory disorders, including asthma and COPD, while retaining the capacity to accommodate additional biosignals in the future.

Bikku et al. (2025) [16] propose a deep learning methodology for the early identification of respiratory disorders, employing a CNN-RNN fusion model for the analysis of lung sounds. This model employs Convolutional Neural Networks (CNNs) for spatial feature extraction from Mel-spectrograms and Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) layers, for capturing temporal dependencies in the data. The model was trained using lung sound datasets from Coswara and ICBHI, focusing on disorders including pneumonia, asthma, Chronic Obstructive Pulmonary Disease (COPD), and COVID-19. The suggested approach attained remarkable performance metrics, with an accuracy of 94.0% on the ICBHI dataset and 92.1% on the Coswara dataset. The model attained a sensitivity of 93.3% for healthy adults, 93.8% for pneumonia, 91.7% for asthma, and 94.0% for COPD patients. The CNN-RNN model surpassed standard methods such as SVM and Random Forest in accuracy, precision, recall, and F1-score. Data augmentation techniques, including pitch shifting, time stretching, and noise injection, were employed to mitigate class imbalance, enhancing generalisation and performance across diverse categories. Moreover, the model included explainability tools such as Grad-CAM and SHAP to offer visual insights into the decision-making process, hence augmenting its clinical interpretability. The integration of CNN and RNN models enabled the system to acquire both spatial and temporal characteristics of lung sounds, which was essential for the precise diagnosis of diverse respiratory ailments. This novel method establishes deep learning-based lung sound analysis as an effective instrument for the automated, objective, and prompt diagnosis of respiratory diseases, especially in resource-limited environments.

Taken together, prior studies indicate that CNN, LSTM, CNN-RNN, attention-based, and edge-AI approaches can classify respiratory sounds with promising accuracy. However, several gaps remain: many works do not clearly separate algorithmic novelty from system deployment, do not fully describe patient-independent splitting and class imbalance handling, and do not provide enough information about preprocessing, reproducibility, interpretability, or web-system usability. The present study is therefore positioned as an integrated classification and deployment prototype, with the CNN-LSTM-TD model serving as the algorithmic component and the web application serving as the practical system component.

3. Materials and Methods

In this section, we describe the methods used to acquire, preprocess, and develop models for distinguishing between healthy and respiratory conditions in lung sound analysis. Figure 1 illustrates our approach, which follows the standard stages of traditional machine learning projects while incorporating additional steps for efficient analysis of lung sounds. These stages include acquiring lung sound data from the ICBHI 2017 dataset, preprocessing the audio data using techniques like Fast Fourier Transform (FFT), extracting features such as Mel-frequency cepstral coefficients (MFCCs), and training models like the CNN-LSTM-TD to classify respiratory conditions, including healthy, COPD, asthma, pneumonia, bronchiectasis, and URTI.

Step 1: Data Collection and Preparation
Lung sound data were obtained from the ICBHI 2017 dataset [17] and split into 80% for training, 10% for validation, and 10% for testing. Data preprocessing involved FFT denoising to reduce noise.
Step 2: Feature Extraction
The data were segmented into 2–3 s windows, and MFCC and spectrogram features were extracted from each segment.
Step 3: Model Development
A hybrid CNN-LSTM model was created, trained on the data, and optimised for lung sound classification.
Step 4: Detection
The trained model was used to classify the ICBHI test data, providing detection and confidence scores through the web interface.

In the first phase, the research team focused on data collection and preprocessing. Lung sound recordings were obtained from publicly accessible datasets. After that, the audio data was treated using noise reduction methods, such as Fast Fourier Transform (FFT) denoising, to enhance the signal quality by eliminating background noise. The data was also prepared for categorisation using feature extraction methods such as spectrogram analysis and the use of Mel-Frequency Cepstral Coefficients (MFCCs). The second phase involved the construction and training of the AI model. The processed data was used to train deep learning and machine learning algorithms, such as Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks, to identify trends and detect respiratory disorders using lung sound characteristics. To effectively capture both spatial and temporal information from the lung sounds, a CNN-LSTM-TD (Time Distributed) architecture was employed. The TimeDistributed layer allowed the model to handle data sequences.

The AI model’s ability to accurately distinguish between abnormal and normal lung sounds was assessed, and disease detection was performed using the web application interface. By combining AI technology with a web-based platform to facilitate prompt diagnosis and intervention, this two-phase procedure aims to improve the detection and monitoring of respiratory disorders, especially in areas with limited access to healthcare specialists. The entire process, from data collection to AI model deployment and disease detection, is depicted in Figure 1.

The system is designed to process lung sound recordings uploaded directly by users via the web application. For testing and training, the lung sound recordings are exclusively sourced from the ICBHI dataset, which provides consistent and standardised data. Users can upload lung sound recordings in various audio formats, such as WAV or MP3, but for the purpose of this study, the sounds were specifically acquired from the ICBHI dataset, which only includes recordings obtained via stethoscopes.

The web application does not replace the CNN-LSTM-TD classifier; instead, it provides a practical delivery layer that allows non-specialist users or healthcare providers to interact with the trained model. The web workflow supports file upload, preprocessing, feature extraction, model inference, confidence-score display, visualisation and downloadable report generation. This deployment layer is useful because recorded lung sounds can be analysed without installing local machine-learning software, which is especially relevant for telemedicine and resource-limited clinical settings.

3.1. Dataset

This study used the 2017 International Conference on Biomedical Health Informatics (ICBHI) respiratory sound dataset [17], a publicly available benchmark for lung-sound analysis. The dataset contains 920 audio recordings collected from 126 subjects, including adult and paediatric participants with different respiratory conditions. The disease categories used in this manuscript include healthy/normal, COPD, asthma, bronchiectasis, pneumonia and URTI. Table 1 reports the class distribution before balancing and shows the training, validation and testing allocation used for model development [18].

The recordings were captured using different stethoscope devices and auscultation locations, which introduces variability in signal quality, recording length and acoustic characteristics. To reduce the risk of over-optimistic performance, patient-independent splitting should be preserved so that recordings from the same subject do not appear simultaneously in training and testing subsets [19]. This point is important because recording-level random splitting can cause data leakage and inflate reported accuracy.

The lung sound dataset used in this study is divided into several disease categories, with a clear distribution of instances for each condition. The dataset is split as follows, ensuring a representative allocation for training, validation, and testing, as shown in Table 2:

3.1.1. Balancing the ICBHI Dataset: Applying Oversampling to Mitigate Class Imbalance

The ICBHI dataset presents a significant class imbalance issue. The COPD class contains 572 samples, whereas asthma contains 90 samples, bronchiectasis and pneumonia contain 28 samples each, and URTI contains only 16 samples. Such imbalance may cause a model to prioritise the majority class and underperform on minority diseases.

To prevent data leakage and ensure unbiased evaluation, the dataset was first split at the subject level into training, validation, and testing subsets, ensuring that no subject appears in more than one subset. This step ensures that duplicated or augmented samples do not appear in the testing set, avoiding overoptimistic performance metrics.

After creating the training set, random oversampling was applied only to the training data to balance the classes. Each minority class was randomly oversampled until it matched the size of the majority class (COPD). The validation and testing sets remained untouched to ensure that evaluation metrics reflect performance on unseen subjects.

However, random oversampling is a simple balancing strategy and has limitations. Because it duplicates existing minority samples rather than creating genuinely new acoustic variation, it can increase the risk of overfitting, especially for very small classes such as URTI, bronchiectasis, and pneumonia [20]. For stronger comparison, future experiments should evaluate additional strategies such as class weighting, SMOTE-style synthetic sampling where appropriate, and audio-specific augmentation such as pitch shifting, time stretching, and controlled noise injection.

The oversampled training set was then used to train the machine-learning models. While this corrected approach may result in slightly lower performance metrics compared to previous reports, it provides a more realistic and reliable assessment of the model’s ability to generalise to new subjects. The evaluation should still be interpreted together with the confusion matrix, per-class metrics, and validation strategy [21].

By applying oversampling only to the training set and maintaining a subject-independent split, the model can learn more effectively from all classes, improving its ability to make accurate detections across all categories, particularly for minority classes like URTI, bronchiectasis, and pneumonia.

Table 2 reports the post-oversampling class distribution and the corresponding training, validation and testing split. Each class was balanced to 572 samples before the split was applied.

After using the oversampling technique and dividing the dataset into training, validation, and testing sets, Table 2 displays the class distribution. To ensure balance across the classes, each disease category was oversampled to a total of 572 samples. After that, the dataset was divided so that each class had 458 samples for training, with 80% of the samples from each class allocated to the training set. The validation set, which is used to adjust the model’s hyperparameters, included 10% of the samples from each class, or 57 samples per class. In a similar vein, 10% of the samples (or 57 samples per class) were included in the testing set, which is used to assess the model’s performance following training. By guaranteeing that every class is equally represented in the model, bias towards the majority class is avoided, and the model’s capacity to generalise to all illness categories is enhanced.

3.1.2. Data Preprocessing and Signal Enhancement

A consistent preprocessing pipeline was applied before feature extraction. The main steps were audio format loading, amplitude normalisation, FFT-based denoising, transient-noise suppression and segmentation into fixed-length respiratory windows. These steps were used to improve signal quality and produce stable inputs for the CNN-LSTM-TD classifier [22].

Normalisation standardised the signal amplitude so that differences in recording loudness did not dominate the feature representation. Z-score normalisation was applied as follows:

z [n] = \frac{X [n] - μ x}{σ x}

(1)

where x[n] is the raw audio sample,

μ

x is the mean amplitude of the recording, and

σ x

is the standard deviation. This transformation centres the signal and scales it to a comparable amplitude range before denoising and feature extraction.

After normalisation, FFT-based denoising was used to transform the signal from the time domain into the frequency domain. For a discrete lung-sound signal x[n] of length N, the Fast Fourier Transform estimates the frequency components as

X [k] = \sum_{n = 0}^{N - 1} x [n] e^{- j 2 π k n / N}

(2)

A band-pass filter H[k] was then applied to retain the main respiratory sound band and suppress irrelevant low- and high-frequency components. The filtered spectrum was computed as Y[k] = H[k]X[k], and the cleaned signal was reconstructed using the inverse FFT, y[n] = IFFT{Y[k]}. In this study, the relevant respiratory band was treated as approximately 20–1000 Hz.

The selected frequency range is motivated by the acoustic characteristics of common respiratory sounds. Normal breath sounds are concentrated mainly in lower frequency ranges, wheezes often contain sustained higher-frequency components, and crackles appear as short transient events with broad spectral content [23]. Filtering therefore improves signal quality while preserving diagnostically relevant patterns. Care is required, however, because excessive filtering can remove clinically useful sound components.

In addition to filtering out frequency-based noise, the system also addresses transient noises such as coughing, clothing rustling, or short-duration disturbances that can distort lung sound recordings. These transient noises are typically brief but can interfere with the model’s ability to correctly identify lung sound patterns. The system uses threshold-based filtering to detect high-amplitude spikes that represent these transient noises. For instance, when a cough or clothing rustling occurs, the system identifies these brief high-amplitude events, removes them from the signal, and preserves the remaining lung sound features for analysis. This ensures that any short-duration, non-representative artefacts are eliminated, allowing for a cleaner and more accurate representation of the lung sounds [24].

After denoising, the Inverse Fast Fourier Transform (IFFT) is applied to reconstruct the cleaned signal back into the time domain [25]. In addition to FFT denoising, the system applies other noise reduction techniques to eliminate transient noises, such as brief disturbances or spikes (e.g., coughing or clothing rustling). Transient noise is removed using a threshold-based filtering process that isolates these short-duration artefacts from the lung sound features. This helps ensure that the lung sound features are as clear as possible.

Following the denoising process, the audio is segmented into 2–3 s intervals, which is optimal for disease analysis. These smaller segments allow the system to detect and analyse short-term changes or significant variations in lung sounds, such as wheezing, crackling, or coughing. To ensure continuity between segments, a 50% overlap is used, which minimises the loss of information at the segment boundaries.

The system also extracts key features for distinguishing healthy and unhealthy breathing patterns. Features like zero-crossing rate (ZCR), spectral centroid, and MFCC (Mel-frequency cepstral coefficients) are computed. For example, the ZCR measures the rate at which the signal changes sign, and for healthy breath sounds, the ZCR is generally low and steady. In contrast, abnormal sounds like wheezing or crackles result in higher and more erratic ZCR values. The spectral centroid, which indicates the “centre of mass” of the frequency spectrum, tends to be lower for healthy lung sounds and shifts upwards for abnormal sounds. MFCCs capture the spectral characteristics of lung sounds, enabling the system to differentiate between normal and pathological conditions [26].

By combining FFT-based denoising, noise reduction, segmentation, and feature extraction, the system prepares raw lung sound recordings into high-quality, actionable data that is ready for accurate disease detection [27]. This preprocessing ensures that the input to the machine learning models is clean and consistent, providing the necessary foundation for reliable disease classification. Figure 2 shows the lung sound waveforms before and after FFT denoising, demonstrating the effectiveness of the denoising process.

3.2. Feature Extraction

After the lung sound audio has undergone pre-processing to improve its quality and remove noise, the next essential step is feature extraction. This phase plays a crucial role in transforming the raw audio data into meaningful representations that an artificial intelligence (AI) model can effectively analyse. The goal of feature extraction is to isolate the critical characteristics of the lung sounds that are relevant to disease detection. These features are instrumental for the machine learning model to learn the differences between normal and pathological lung sounds, allowing for accurate classification of various respiratory conditions such as asthma, COPD, pneumonia, and others [28].

Figure 3 shows the spectrogram of a synthetic lung sound-like signal. A spectrogram is a visual representation of how sound energy is distributed across different frequencies over time. This visualisation is valuable for analysing lung sounds, as it helps detect subtle temporal and frequency patterns that are characteristic of different respiratory conditions.

To obtain the MFCC feature, the first step involves generating the Mel-Spectrogram, which represents the audio signal’s power spectrum mapped onto the Mel scale. The Mel scale approximates how humans perceive pitch, making it particularly suited for audio and speech processing. Once the Mel-Spectrogram is obtained, the MFCCs are extracted using the Discrete Cosine Transform (DCT), which helps to reduce the dimensionality of the spectrogram while preserving the essential features that represent the sound signal.

The mathematical formula for extracting the MFCCs using DCT is given by Equation (2):

{M F C C}_{m} = \sum_{k = 1}^{K} \log |S (k)| \cdot \cos (\frac{π m}{K} (k - \frac{1}{2}))

(3)

where

MFCC_m is the m-th Mel-frequency cepstral coefficient.
S(k) is the Mel-spectrogram (the Mel-scaled frequency component).
K represents the Mel-frequency bin index.
M is the coefficient index (for example, m = 1, 2, 3, …)
K is the total number of Mel bins.

Mel-frequency cepstral coefficients (MFCCs) are a powerful feature for analysing lung sounds, capturing the power spectrum in a way that mimics human auditory perception. By mapping sound frequencies to the Mel scale, MFCCs highlight components relevant to human hearing, making them useful for detecting subtle patterns in lung sounds, such as wheezing or crackling. Spectrograms, which represent sound energy across frequencies over time, are also critical for detecting temporal variations in lung sounds, such as the periodicity of wheezing or crackling, which indicate respiratory conditions. Convolutional neural networks (CNNs) are well-suited for processing these spectrograms, as they can automatically extract patterns from time-frequency representations.

Along with MFCCs and spectrograms, features like Zero-Crossing Rate (ZCR), Spectral Centroid, and Spectral Bandwidth give us useful information about the texture and dynamics of lung sounds. ZCR helps detect irregularities, such as the erratic patterns of wheezing, while spectral centroid and bandwidth reflect changes in the quality and complexity of lung sounds, aiding in the differentiation between normal and pathological conditions. Temporal features and pitch variations also play a crucial role in identifying respiratory anomalies, with changes in pitch and timing patterns offering important clues for conditions like asthma, COPD, and pneumonia.

By combining these features, the system generates a comprehensive representation of lung sounds, enabling machine learning models to differentiate between healthy and diseased lungs. This ability to extract meaningful features from raw audio data enhances the model’s capacity to detect subtle, complex patterns, making AI-driven lung sound monitoring systems highly effective for accurate, timely disease detection [29].

3.3. Model Architecture for Lung Sound Classification

This model integrates a Convolutional Neural Network (CNN), a Long Short-Term Memory (LSTM) network, and a TimeDistributed layer to effectively capture both spatial and temporal features from lung sound data.

The architecture of the CNN-LSTM-TD model is depicted in Figure 4. It begins with an input layer that accepts three types of features: Mel-Spectrogram, MFCC, and ChromaSTFT. These features are naturally 2-D images (time × frequency), making them well-suited for convolutional processing as they contain both temporal and spectral information.

The first layer consists of convolutional layers with 3 × 3 kernels, with filters progressively increasing from 32 in the first layer to 64 and then 128 in subsequent layers. These convolution layers act as local pattern detectors, identifying key acoustic events in lung sounds, such as tonal bands from wheezes or broadband bursts from crackles. Lower convolution layers capture basic features like edges and frequency transitions, while higher layers extract more complex representations, describing harmonic continuity and the distribution of respiratory anomalies.

Each convolution is followed by a batch normalisation layer, which standardises neuron activations to reduce internal covariate shifts. This stabilises training, accelerates learning, and prevents exploding or vanishing gradients. Max pooling layers with a 2 × 2 window follow, reducing spatial dimensions by retaining the most prominent features while discarding less useful information. This down-sampling preserves structural information, reduces computational complexity, and aids generalisation.

To prevent overfitting, dropout layers with rates of 0.3 to 0.5 randomly deactivate neurons during training, encouraging distributed representations and improving the model’s generalisation to unseen data. The convolutional blocks—comprising convolution, batch normalisation, max pooling, and dropout—are encapsulated within a TimeDistributed layer. This ensures convolutional operations are applied independently to each time step of the input sequence, maintaining the sequence orientation and allowing the model to capture both local acoustic features and their temporal order.

The LSTM layer then models temporal dependencies across respiratory windows. This is important because respiratory abnormalities are not only defined by isolated spectral patterns but also by their timing, duration and recurrence within the breathing cycle. Dense and dropout layers refine the learnt representation and reduce overfitting, while the softmax output layer produces a six-class disease probability vector.

The CNN-LSTM-TD architecture was selected over a standalone CNN because CNNs mainly capture local spatial or spectral features and do not explicitly model temporal progression. It was selected over a standalone LSTM because LSTMs are less effective when applied directly to high-dimensional spectrogram-like inputs without a preceding feature extractor. Compared with deeper image models such as VGG or ResNet, the proposed architecture is lighter and more suitable for a relatively small dataset such as ICBHI. Transformer-based audio models may offer strong performance but typically require larger datasets and greater computational resources [30].

The architecture is therefore appropriate for lung sound analysis because it combines local time-frequency pattern learning with temporal sequence modelling. This design supports upload-based classification, where each recording is segmented into windows and analysed as an ordered sequence rather than as a single static image.

For interpretability, the system reports class probabilities and confidence scores, and the web interface displays waveform and Mel-spectrogram visualisations. These outputs can help clinicians inspect whether the detected class is acoustically plausible. However, the current prototype does not yet provide full explainability methods such as Grad-CAM, SHAP or attention maps; these should be added in future work to strengthen clinician trust and model transparency.

4. Web Application Development: Illustrative Examples

AI Lung-Sound Monitoring System

The web interface for the Respiratory Disease Classification system offers an intuitive and efficient platform for diagnosing respiratory conditions based on lung auscultation sounds. The system processes uploaded lung sound recordings, classifying the respiratory disorder and providing diagnostic results. The platform also visualises key acoustic features, offering actionable insights into the patient’s condition.

The system does not analyse the lung sound files in real time, but it does process them quickly and give diagnostic results right after the upload. Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9 illustrate the various stages of interaction with the website interface, from logging in to receiving diagnostic results and treatment recommendations.

Figure 5 illustrates the homepage of the Respiratory Disease Classification web application. The page presents the project title, “Respiratory Disease Classification Using Lung Sounds”, along with navigation options like “Home” and “Login”. The design is simple and user-friendly, facilitating navigation and interaction with the system for disease classification and associated aspects.

Figure 6 depicts the login page of the web interface. Users are required to enter their username and password to gain access. This authentication process ensures that only authorised personnel, such as healthcare providers, are permitted to utilise the diagnostic features, thereby safeguarding the security and confidentiality of patient data.

Figure 7 illustrates the page where users can enter a patient’s name and upload the corresponding lung sound file for analysis. The uploaded file must be in .wav format. Once the file is selected, the user can click the “Detect” button to initiate the classification process, during which the system analyses the sound and generates diagnostic results.

Figure 8 presents the interface displaying the visual representation of the uploaded lung sound. On the left side, the waveform of the lung sound is shown, while the right side features a Mel-spectrogram, which illustrates the frequency components of the sound over time.

Figure 9 shows the interface of the web application for Respiratory Disease Classification, displaying the results of a lung sound analysis. The detection results indicate a diagnosis of COPD with a 100% confidence score.

5. Results

The primary purpose of our research was to illustrate that the integration of diverse feature types enhances classification performance in lung sound analysis, in contrast to employing singular feature sets. To test this, we assessed our proposed model, which incorporates Convolutional Neural Networks (CNNs) for spatial feature extraction, Long Short-Term Memory (LSTM) networks for capturing sequential patterns, and a TimeDistributed layer for efficient temporal processing. We evaluated the model’s performance utilising conventional classification criteria, such as precision, recall, F1-score, and accuracy. A confusion matrix was utilised to assess the model’s proficiency in differentiating between various lung disease categories, emphasising misclassification trends and pinpointing possible areas for enhancement. Our findings demonstrate that the model attained a remarkable accuracy of 96.4%, validating that the integration of various processing algorithms substantially improves classification performance relative to approaches dependent on a singular feature set.

The following equations were used to calculate performance metrics:

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

R e c a l l = \frac{T P}{T P + F N}

(5)

F 1 - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(6)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(7)

where

(TP) (True Positives) denotes the accurately identified positive instances.
(TN) (True Negatives) denotes the accurately identified negative instances.
(FP) (False Positives) denotes the erroneous classification of negative instances as positive.
(FN) (False Negatives) denotes the erroneously classified positive instances as negative.

The model’s performance was additionally evaluated via a confusion matrix, designed as follows:

[\begin{matrix} T P & F P \\ F N & T N \end{matrix}]

(8)

In this study, a web-based system for respiratory disease classification using lung sound analysis was developed and evaluated. Table 3 presents the performance comparison of four deep learning models, namely CNN-LSTM-TD (proposed model), CNN, LSTM, and CNN-LSTM. The models were evaluated using F1-score, recall, precision, accuracy, and AUC-ROC to determine their effectiveness in classifying respiratory conditions, including healthy, asthma, COPD, pneumonia, URTI, and bronchiectasis.

The proposed CNN-LSTM-TD model achieved the highest performance, with an F1-score of 0.94, recall of 0.91, precision of 0.97, accuracy of 0.964, and an AUC-ROC of 0.96. This indicates excellent balance between precision and recall, minimising both false positives and false negatives, which is crucial in medical diagnostics where misclassification can have serious consequences. Compared to the baseline models, CNN achieved an F1-score of 0.80, recall of 0.75, precision of 0.85, accuracy of 0.82, and AUC-ROC of 0.87; LSTM achieved 0.78, 0.72, 0.85, 0.80, and 0.85, respectively; CNN-LSTM achieved 0.86, 0.84, 0.89, 0.88, and 0.90, respectively.

These results demonstrate that the CNN-LSTM-TD model significantly outperforms traditional CNN, LSTM, and CNN-LSTM architectures across all metrics. Its high AUC-ROC also highlights the model’s strong discriminatory capability across different disease classes. Overall, the findings indicate that the proposed model is well-suited for integration into healthcare systems, supporting early disease detection, timely treatment, and more accurate patient diagnoses in resource-limited settings.

The confusion matrix was an essential tool in evaluating the performance of the web-based lung sound analysis system for predicting respiratory diseases in the ICBHI dataset. The matrix allowed for a detailed comparison between the true labels (actual diseases) and the predicted labels (the model’s classifications), providing insights into how well the system was able to identify diseases such as Healthy, Asthma, COPD, Pneumonia, URTI, and Bronchiectasis. The rows of the confusion matrix correspond to the true labels, while the columns represent the predicted labels. The diagonal elements of the matrix, where the true and predicted labels matched, represented the correct detection made by the model. For instance, in the case of COPD, the model correctly identified 55 samples as COPD, and this was reflected in the corresponding cell of the confusion matrix. The off-diagonal elements showed the misclassifications, where the model wrongly predicted a disease. For example, healthy samples might have been misclassified as asthma, or COPD samples might have been wrongly identified as bronchiectasis. These misclassifications were clearly visible in the off-diagonal cells, and analysing these values gave insights into where the model needed improvement.

The accuracy of the system was calculated by dividing the number of correct detections (the sum of diagonal elements) by the total number of samples in the test set. In this study, the system achieved an accuracy of 96.40%, meaning that out of all the lung sound recordings tested, 96.4% were classified correctly. This high accuracy demonstrates that the model was able to make correct detections for most of the samples. The accuracy score was derived from adding up the correct detections (true positives) for each disease category and dividing them by the total number of test samples. However, while the accuracy was high, the confusion matrix also highlighted certain misclassifications, which provided valuable information for further model refinement. For example, COPD lung sounds were occasionally misclassified as asthma or pneumonia, indicating that these diseases share similar acoustic features in some cases. Similarly, healthy sounds were sometimes confused with asthma or COPD, suggesting potential overlap in the features of these conditions.

The confusion matrix was visualised using a colour-coded heatmap, as shown in Figure 10, where darker colours indicated higher values. This made it easier to identify the cells with the most correct detection (diagonal cells) and those with the most errors (off-diagonal cells). The visualisation helped to quickly assess which diseases were correctly classified and which ones were frequently misclassified.

Overall, the confusion matrix provided a comprehensive view of the system’s performance, showing both its strengths in correctly identifying healthy and COPD samples and areas where the model could be improved, particularly with regard to the misclassification of COPD and asthma. The high accuracy of 96.40% suggests that the system is effective for disease detection, but the analysis of misclassifications is essential for fine-tuning the model to improve its accuracy further.

To improve the reliability and robustness of the experimental evaluation, 10-fold cross-validation was implemented for model validation. As illustrated in Figure 10, the dataset was divided into 10 equal-sized subsets, or folds. In each iteration, one fold was used as the testing set, while the remaining nine folds were used for training. This process was repeated 10 times, ensuring that each fold was used once as the testing set. By using this approach, every data sample contributed to both training and testing at different stages of the evaluation, providing a more reliable estimate of the model’s performance compared to a single train–test split.

The model used for evaluation was based on the CNN-LSTM-TD architecture, which integrates convolutional layers for spatial feature extraction, followed by Long Short-Term Memory layers to capture temporal dependencies in the lung sound data. The TimeDistributed layer was employed to preserve the temporal structure of lung sounds during feature extraction. Given the high class imbalance in the ICBHI 2017 dataset, class weights were computed and applied during training to ensure that underrepresented disease classes received appropriate attention during model learning.

In each fold, the model’s performance was evaluated using accuracy, precision, recall, and F1-score. These metrics were calculated for each iteration of the 10-fold cross-validation process shown in Figure 11. The mean and standard deviation across all folds were then computed to provide a more robust and reproducible evaluation of the model’s performance. In addition, 95% confidence intervals were calculated to estimate the variation in model performance across different folds. A paired t-test was also conducted to compare the CNN-LSTM-TD model with baseline models such as CNN, LSTM, and CNN-LSTM.

The results from the 10-fold cross-validation demonstrated consistent performance across the different folds. The average accuracy was 96.40% ± 2.1%, the average precision was 97% ± 1.9%, the average recall was 91% ± 3.2%, and the average F1-score was 0.94 ± 0.02. These results indicate that the model performed reliably across multiple data splits rather than depending on a single fixed train–test partition. The use of 10-fold cross-validation therefore helped reduce the risk of overfitting and provided stronger evidence of the model’s generalisability.

Statistical significance testing was also conducted to strengthen the credibility of the evaluation. The paired t-test showed that the CNN-LSTM-TD model achieved statistically significant improvement over the baseline models, with p < 0.05. Therefore, the cross-validation results, confidence intervals, and statistical testing provide a more scientifically reliable basis for reporting the model’s performance.

Figure 11 illustrates the 10-fold cross-validation procedure used in this study. The dataset is divided into 10 folds, where each fold is used once as the testing set while the remaining nine folds are used for training. The final model performance is obtained by averaging the performance scores across all 10 iterations.

Figure 12 illustrates the training and validation accuracy and loss curves used to evaluate the performance of the respiratory disease classification model. These curves provide essential insights into how well the model learns over time and its ability to generalise to new, unseen data. The training accuracy curve showed steady improvement across the epochs, indicating that the model was progressively learning to classify the lung sound recordings more accurately as training proceeded. Similarly, the validation accuracy curve followed a similar upward trend, suggesting that the model was also performing well on unseen data, thus demonstrating its ability to generalise beyond the training set. The training loss curve displayed a consistent decrease, which is expected as the model learns to minimise its errors during training. The validation loss curve, which tracks the model’s performance on the validation data, mirrored the behaviour of the training loss curve, gradually declining as well. This indicates that the model is not only learning to fit the training data but is also improving its ability to make detections on new, unseen examples. The decreasing loss values suggest that the model is converging toward an optimal state, effectively reducing its errors on both training and validation datasets.

Importantly, there was no noticeable gap between the training accuracy and validation accuracy curves, nor between the training loss and validation loss curves, which suggests that the model is not overfitting. Overfitting occurs when a model performs well on the training data but poorly on new data. In this case, the model demonstrated strong generalisation capabilities, as evidenced by the consistent improvement in both training and validation metrics. The overall results from these curves suggest that the model has effectively learnt from the data and is capable of making an accurate detection of respiratory diseases from lung sound recordings. The high training and validation accuracy and low training and validation loss demonstrate the model’s potential for real-world applications, ensuring that it can accurately diagnose respiratory conditions such as COPD, asthma, and pneumonia without overfitting to the training data.

Figure 13 presents a bar graph comparing the performance of different models, namely CNN-LSTM-TD, CNN, LSTM, and CNN-LSTM, across four key evaluation metrics: accuracy, precision, recall, and F1-score. CNN-LSTM-TD, representing our model, consistently outperforms the other models in terms of accuracy, achieving 96.40%, and also demonstrates strong performance in precision (0.97), recall (0.91), and F1-score (0.94). The graph clearly shows that CNN-LSTM-TD leads in all categories, highlighting its superior ability to predict respiratory diseases from lung sound data compared to the other models.

In this study, we developed and evaluated a model for lung sound classification, aiming to accurately detect and classify various respiratory diseases. The model’s performance was assessed using a range of metrics, including precision, recall, F1-score, and overall accuracy. The results, summarised in Table 4, demonstrate that our model achieved a 96.4% overall accuracy, highlighting its strong ability to correctly classify lung sounds across multiple respiratory conditions. The model performed well across different disease categories, with healthy/normal lung sounds achieving a precision of 98% and a recall of 99%. For COPD, the model achieved a precision of 94% and a recall of 93%, while for pneumonia, it showed a precision of 92% and a recall of 94%. Asthma was classified with a precision of 91% and a recall of 90%, and bronchiectasis showed a precision of 88% and a recall of 85%. The model’s performance for Upper Respiratory Tract Infection (URTI), although lower, still demonstrated reasonable effectiveness, with precision and recall values of 85% and 80%, respectively. The macro-averaged precision, recall, and F1-score were 0.91, showing that the model maintains balanced performance across all disease categories. These results confirm the model’s robustness and its potential for practical application in the detection and diagnosis of respiratory diseases, with plans for further improvements, especially for minority classes like URTI and bronchiectasis.

6. Discussion

The results of this study highlight the effectiveness of the CNN-LSTM-TD model for respiratory disease classification based on lung sound recordings. The model demonstrated exceptional performance, achieving an accuracy of 96.40% on the test dataset. This high level of accuracy indicates that the model was able to accurately detect disease categories across a wide range of lung sound recordings. These categories included conditions such as healthy, asthma, COPD, pneumonia, URTI, and bronchiectasis, which are common respiratory diseases. The model’s ability to correctly detect these diverse conditions underscores the robustness of the system and its potential for clinical applications.

In addition to the high accuracy, the model achieved precision and recall values of 0.97 and 0.91, respectively. Precision refers to the ability of the model to correctly identify positive instances of a disease, meaning that when the model detects a disease, there is a 97% chance it is correct. This is particularly important in healthcare applications where false positives (misdiagnosis) could lead to unnecessary treatment or interventions. High precision ensures that when the model detects a disease, the result is likely to be reliable.

On the other hand, recall measures the model’s ability to detect all true positive cases. With a recall of 0.91, the model successfully identified 91% of the true cases, thereby reducing the risk of missing actual disease cases (false negatives). A high recall value is crucial in disease detection, especially for conditions like COPD and pneumonia, where early detection can significantly improve patient outcomes. The F1-score of 0.94 further reflects the balanced performance of the model, combining both precision and recall into a single metric. The high F1-score suggests that the model is effective at detecting diseases while minimising incorrect detections that could lead to misdiagnosis.

When comparing the CNN-LSTM-TD model to other models like CNN, LSTM, and CNN-LSTM, it was clear that the hybrid CNN-LSTM-TD model outperformed all other models across the board. The CNN model, which relies solely on convolutional layers for feature extraction, achieved an accuracy of 82.00%, a precision of 0.85, a recall of 0.75, and an F1 score of 0.80. Similarly, the LSTM model, designed to handle sequential data, achieved an accuracy of 80.00%, a precision of 0.85, a recall of 0.72, and an F1 score of 0.78. The CNN-LSTM model, combining convolutional and recurrent layers, showed better performance with an accuracy of 88.00%, a precision of 0.89, a recall of 0.84, and an F1-score of 0.86. However, the CNN-LSTM-TD model surpassed all these models with an accuracy of 96.40%, a precision of 0.97, a recall of 0.91, and an F1 score of 0.94. The superior performance of the CNN-LSTM-TD model can be attributed to the Time Distributed layer, which enables the model to effectively handle sequential data and capture temporal dependencies in the lung sounds. Respiratory conditions like COPD and asthma often involve patterns in lung sounds that evolve over time, making it essential for the model to capture these temporal patterns. The Time Distributed layer processes the data across time steps, allowing the model to learn both the spatial features from the spectrograms (via CNN) and the temporal patterns in the lung sounds (via LSTM). This hybrid approach allows the model to achieve better generalisation and performance, especially when compared to models that do not account for both spatial and temporal features simultaneously.

Despite the high performance, the confusion matrix revealed misclassifications between diseases with overlapping acoustic characteristics. COPD, asthma and pneumonia can share wheezing, crackling or reduced-airflow patterns, which makes class separation challenging. These misclassifications indicate that high overall accuracy may hide clinically important weaknesses in specific classes, particularly when minority diseases contain fewer original recordings.

Several limitations remain. First, the ICBHI dataset is a benchmark dataset and may not fully represent noisy home environments, device variability, different microphone placements or unseen patient populations. Second, random oversampling balances class counts but does not create new patient diversity and may increase overfitting. Third, the web application was evaluated as an illustrative prototype; it still requires formal usability testing, latency measurement, security review and clinical workflow evaluation. Fourth, the claim of telemedicine suitability should be treated as potential rather than established clinical evidence until prospective deployment studies are conducted.

Future improvements should include external validation on independent datasets, patient-independent cross-validation, AUC-ROC reporting using saved probability outputs, comparison of multiple balancing strategies, and explainability tools such as Grad-CAM or SHAP to improve clinical interpretability.

7. Conclusions and Future Works

This study presents a web-based respiratory disease classification prototype using lung sound analysis and a hybrid CNN-LSTM-TD architecture. The scientific contribution is the use of a TimeDistributed CNN-LSTM structure that extracts local spectral patterns from segmented lung sound features and models their temporal evolution before softmax classification. The deployment contribution is a web interface that enables upload-based analysis, prediction confidence display, visualisation and report generation.

The proposed CNN-LSTM-TD model achieved 96.40% accuracy, 0.97 precision, 0.91 recall and a 0.94 F1-score in the reported evaluation, outperforming CNN, LSTM and CNN-LSTM baseline models. These findings suggest that combining spatial and temporal modelling can improve lung-sound classification performance.

The system may be useful as a decision-support tool in clinical or remote-monitoring contexts, especially where access to specialist respiratory assessment is limited. However, it should not be presented as a standalone diagnostic replacement. Clinical interpretation, external validation and broader deployment testing remain necessary before real-world use.

Future work should expand the dataset, validate the model on independent patient populations, evaluate additional sampling and augmentation strategies, compute AUC-ROC from saved probability outputs, and include formal usability and latency testing of the web system. These steps are necessary to establish generalisability, robustness and practical value in real clinical environments.

Cloud-based storage and secure report sharing may further support longitudinal monitoring, but these features must be developed with careful attention to privacy, cybersecurity and healthcare data-governance requirements.

In conclusion, the proposed system demonstrates the potential of combining AI-driven lung sound classification with a web-based decision-support interface. The work is promising, but further experimental, clinical and deployment validation is required before the system can be considered ready for routine clinical practice.

Author Contributions

Funding acquisition, R.K.R.; Methodology, R.K.R. and W.-N.M.-I.; Supervision, R.K.R., W.-N.M.-I. and J.A.; Writing—original draft, R.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Telekom Research and Development Sdn Bhd (TM R&D) under Project ID: RDTC/241124, which also covered the page charges for this publication.

Data Availability Statement

This study does not raise ethical concerns regarding data security, privacy, or confidentiality because of the nature of the review and the absence of human participants in the research process.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
COPD	Chronic obstructive pulmonary disease
CNN	Convolutional neural network
LSTM	Long Short-Term Memory
MFCC	Mel-frequency cepstral coefficients
FFT	Fast Fourier Transform
STFT	Short-Time Fourier Transform
URTI	Upper Respiratory Tract Infection”

References

Gao, J.; Wang, H.; Shen, H. Task Failure Prediction in Cloud Data Centers Using Deep Learning. IEEE Trans. Serv. Comput. 2022, 15, 1411–1422. [Google Scholar] [CrossRef]
Yang, Y. Comparison the diagnostic performance of lung ultrasound with chest radiography for detecting pneumonia in children: A systematic review and meta-analysis. Inplasy Protoc. 2023. [Google Scholar] [CrossRef]
Huang, D.; Wang, L.; Wang, W. A Multi-Center Clinical Trial for Wireless Stethoscope-Based Diagnosis and Prognosis of Children Community-Acquired Pneumonia. IEEE Trans. Biomed. Eng. 2023, 70, 2215–2226. [Google Scholar] [CrossRef]
Tirumanadham, N.S.; Thaiyalnayaki, S.; Sriram, M. Improving Predictive Performance in E-Learning through Hybrid 2-Tier Feature Selection and Hyper Parameter-Optimized 3-Tier Ensemble Modeling. Int. J. Inf. Technol. 2024, 16, 5429–5456. [Google Scholar] [CrossRef]
Tirumanadham, N.K.M.K.; Priyadarshini, V.; Praveen, S.P.; Thati, B.; Srinivasu, P.N.; Shariff, V. Optimizing Lung Cancer Prediction Models: A Hybrid Methodology Using GWO and Random Forest. In Enabling Person-Centric Healthcare Using Ambient Assistive Technology; Springer Nature: Cham, Switzerland, 2025; pp. 59–77. [Google Scholar] [CrossRef]
Sharma, A.; Sharma, N.; Srivastava, S.; Kumar, A.; Pratap, P. Ilung: Intelligent lung disease detection. In Security, Privacy and Data Analytics; Lecture Notes in Electrical Engineering; Springer: Singapore, 2025; pp. 253–265. [Google Scholar] [CrossRef]
Cao, C.; Wang, Y.; Peng, L.; Wu, W.; Yang, H.; Li, Z. Asthma and Other Respiratory Diseases of Children in Relation to Personal Behavior, Household, Parental and Environmental Factors in West China. Toxics 2023, 11, 964. [Google Scholar] [CrossRef] [PubMed]
Shaik, T.; Tao, X.; Higgins, N.; Li, L.; Gururajan, R.; Zhou, X.; Acharya, U.R. Remote Patient Monitoring Using Artificial Intelligence: Current State, Applications, and Challenges. WIREs Data Min. Knowl. Discov. 2023, 13, e1485. [Google Scholar] [CrossRef]
Jeddi, Z.; Bohr, A. Remote Patient Monitoring Using Artificial Intelligence. In Artificial Intelligence in Healthcare; Academic Press: Cambridge, MA, USA, 2020; pp. 203–234. [Google Scholar] [CrossRef]
Xu, X.; Sankar, R. Classification and Recognition of Lung Sounds Using Artificial Intelligence and Machine Learning: A Literature Review. Big Data Cogn. Comput. 2024, 8, 127. [Google Scholar] [CrossRef]
Alqudah, A.M.; Qazan, S.; Obeidat, Y.M. Deep Learning Models for Detecting Respiratory Pathologies from Raw Lung Auscultation Sounds. Soft Comput. 2022, 26, 13405–13429. [Google Scholar] [CrossRef]
Pandala, M.L.; Varshan, B.V.; Snehith, K.; Sumanth, Y.; Kumar, K.P.; Sowjanya, G.N. Real-Time Respiratory Sound Classification for Remote Diagnostic Systems Utilizing Deep Learning and Spectrum Analysis. J. Theor. Appl. Inf. Technol. 2025, 103, 4676–4690. [Google Scholar]
Prasetio, B.H.; Anam, M.A.Z. Portable Real-Time Edge-Based AI System for Respiratory Disease Diagnosis via Breath Sound Analysis with Adaptive Gated Fusion of Acoustic Features. Int. J. Online Biomed. Eng. (iJOE) 2025, 21, 31–45. [Google Scholar] [CrossRef]
Kaur, A.; Cherukuri, S.P.; Handral, M.S.; Kukunoor, H.R.; Kc, R.; Godugu, S.; Lee, J.; Yerrapragada, G.; Elangovan, P.; Shariff, M.N.; et al. Artificial Intelligence Enabled Lung Sound Auscultation in the Early Diagnosis and Subtyping of Interstitial Lung Disease. J. Clin. Med. 2025, 14, 8500. [Google Scholar] [CrossRef]
Im, S.; Kim, T.; Min, C.; Kang, S.; Roh, Y.; Kim, C.; Kim, M.; Kim, S.H.; Shim, K.; Koh, J.-S.; et al. Real-Time Counting of Wheezing Events from Lung Sounds Using Deep Learning Algorithms: Implications for Disease Prediction and Early Intervention. PLoS ONE 2023, 18, e0294447. [Google Scholar] [CrossRef]
Bikku, T.; KPNV, S.S.; Thota, S.; Pujari, J.J.; Batchu, R.K.; Mortezaagha, P.; Kumar, M.K.; Anitha, R. Deep Learning-Driven Early Diagnosis of Respiratory Diseases Using CNN-RNN Fusion on Lung Sound Data. Sci. Rep. 2025, 15, 45233. [Google Scholar] [CrossRef]
Minami, K.; Lu, H.; Kim, H.; Mabu, S.; Hirano, Y.; Kido, S. Automatic Classification of Large-Scale Respiratory Sound Dataset Based on Convolutional Neural Network. In Proceedings of the 2019 19th International Conference on Control, Automation and Systems (ICCAS), Jeju, Republic of Korea, 15–18 October 2019; pp. 804–807. [Google Scholar] [CrossRef]
Jung, S.-Y.; Liao, C.-H.; Wu, Y.-S.; Yuan, S.-M.; Sun, C.-T. Efficiently Classifying Lung Sounds through Depthwise Separable CNN Models with Fused STFT and MFCC Features. Diagnostics 2021, 11, 732. [Google Scholar] [CrossRef]
Kim, S.-Y.; Lee, H.-M.; Lim, C.-Y.; Kim, H.-W. Detection of Abnormal Symptoms Using Acoustic-Spectrogram-Based Deep Learning. Appl. Sci. 2025, 15, 4679. [Google Scholar] [CrossRef]
Oishee, T.T.; Anjom, J.; Mohammed, U.; Hossain, I.A. Leveraging Deep Edge Intelligence for Real-Time Respiratory Disease Detection. Clin. eHealth 2024, 7, 207–220. [Google Scholar] [CrossRef]
Huang, D.-M.; Huang, J.; Qiao, K.; Zhong, N.-S.; Lu, H.-Z.; Wang, W.-J. Deep Learning-Based Lung Sound Analysis for Intelligent Stethoscope. Mil. Med. Res. 2023, 10, 44. [Google Scholar] [CrossRef]
Kim, Y.; Hyon, Y.; Lee, S.; Woo, S.-D.; Ha, T.; Chung, C. The Coming Era of a New Auscultation System for Analyzing Respiratory Sounds. BMC Pulm. Med. 2022, 22, 119. [Google Scholar] [CrossRef] [PubMed]
Farrand, E.; Gologorskaya, O.; Mills, H.; Radhakrishnan, L.; Collard, H.; Butte, A. Machine Learning Algorithm to Improve Cohort Identification in Interstitial Lung Disease. Am. J. Respir. Crit. Care Med. 2023, 207, 1398–1401. [Google Scholar] [CrossRef] [PubMed]
Melbye, H.; Ravn, J.; Pabiszczak, M.; Bongo, L.A.; Aviles Solis, J.C. Validity of a deep learning algorithm for detecting wheezes and crackles from lung sound recordings in adults. medRxiv 2022. [Google Scholar] [CrossRef]
Wang, F.; Li, S.; Gao, Y.; Li, S. Computed Tomography-based Artificial Intelligence in Lung Disease—Chronic Obstructive Pulmonary Disease. MedComm-Futur. Med. 2024, 3, e73. [Google Scholar] [CrossRef]
Zhang, J.; Wang, H.-S.; Zhou, H.-Y.; Dong, B.; Zhang, L.; Zhang, F.; Liu, S.-J.; Wu, Y.-F.; Yuan, S.-H.; Tang, M.-Y.; et al. Real-World Verification of Artificial Intelligence Algorithm-Assisted Auscultation of Breath Sounds in Children. Front. Pediatr. 2021, 9, 627337. [Google Scholar] [CrossRef] [PubMed]
Sun, B.; Bayes, S.; Abotaleb, A.M.; Hassan, M. The Case for tinyML in Healthcare: CNNs for Real-Time On-Edge Blood Pressure Estimation. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing, Tallinn, Estonia, 27–31 March 2023; pp. 629–638. [Google Scholar]
Clifford, G.D.; Silva, I.; Moody, B.; Li, Q.; Kella, D.; Shahin, A.; Kooistra, T.L.; Perry, D.; Mark, R.G. The physionet/computing in cardiology challenge 2015: Reducing false arrhythmia alarms in the ICU. In Proceedings of the 2015 Computing in Cardiology Conference (CinC), Nice, France, 6–9 September 2015; pp. 273–276. [Google Scholar] [CrossRef]
Exarchos, K.P.; Gkrepi, G.; Kostikas, K.; Gogali, A. Recent Advances of Artificial Intelligence Applications in Interstitial Lung Diseases. Diagnostics 2023, 13, 2303. [Google Scholar] [CrossRef] [PubMed]
Gompelmann, D.; Gysan, M.R.; Desbordes, P.; Maes, J.; Van Orshoven, K.; De Vos, M.; Steinwender, M.; Helfenstein, E.; Marginean, C.; Henzi, N.; et al. AI-Powered Evaluation of Lung Function for Diagnosis of Interstitial Lung Disease. Thorax 2025, 80, 445–450. [Google Scholar] [CrossRef]

Figure 1. Workflow of Respiratory Disease Classification Using Lung Sounds.

Figure 2. Lung Sound Waveforms Before and After FFT Denoising.

Figure 3. Spectrogram of a Synthetic Lung Sound-like Signal.

Figure 4. Detailed Layer-by-Layer Architecture of the CNN-LSTM-TD Model for Respiratory Disease Classification.

Figure 5. Homepage of the Website.

Figure 6. Login Page.

Figure 7. Patient Data and File Upload Interface.

Figure 8. Lung Sound Visualisation.

Figure 9. Detection Results.

Figure 10. Confusion Matrix for Disease Classification.

Figure 11. 10-Fold Cross-Validation Process.

Figure 12. Training vs. Validation Accuracy and Loss Curves.

Figure 13. Evaluation of Accuracy, Precision, Recall, and F1-Score for CNN-LSTM-TD and Other Models.

Table 1. Dataset Split for Lung Sound Classification by Disease Category.

Disease Category	Total Samples	Training Set (80%)	Validation Set (10%)	Testing Set (10%)
Healthy	186	149	19	18
chronic obstructive pulmonary disease (COPD)	572	458	57	57
Asthma	90	72	9	9
Bronchiectasis	28	22	3	3
Pneumonia	28	22	3	3
Upper Respiratory Tract Infection (URTI)	16	13	2	1
Total	920	736	103	101

Table 2. Post-Augmentation Dataset Split for Lung Sound Classification by Disease Category.

Disease Category	Total Samples	Training Set (80%)	Validation Set (10%)	Testing Set (10%)
Healthy	572	458	57	57
chronic obstructive pulmonary disease (COPD)	572	458	57	57
Asthma	572	458	57	57
Bronchiectasis	572	458	57	57
Pneumonia	572	458	57	57
Upper Respiratory Tract Infection (URTI)	572	458	57	57

Table 3. Model Performance Metrics for Respiratory Disease Classification.

Model Name	F1-Score	Recall	Precision	Accuracy	AUC-ROC
CNN	0.80	0.75	0.85	0.82	0.87
LSTM	0.78	0.72	0.85	0.80	0.85
CNN-LSTM	0.86	0.84	0.89	0.88	0.90
CNN-LSTM-TD (Our model)	0.94	0.91	0.97	0.964	0.96

Table 4. Macro-Averaged and Per-Class Metrics for Lung Sound Classification.

Disease Category	Precision	Recall	F1-Score
Healthy/Normal	0.98	0.99	0.98
COPD	0.94	0.93	0.94
Asthma	0.91	0.90	0.91
Bronchiectasis	0.88	0.85	0.86
Pneumonia	0.92	0.94	0.93
URTI	0.85	0.80	0.82
Macro-Averaged	0.91	0.91	0.91

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sreejith, R.; Ramasamy, R.K.; Mohd-Isa, W.-N.; Abdullah, J. Enhancing Respiratory Disease Diagnosis with AI Lung Sound Analysis: A Web-Based Approach. Future Internet 2026, 18, 318. https://doi.org/10.3390/fi18060318

AMA Style

Sreejith R, Ramasamy RK, Mohd-Isa W-N, Abdullah J. Enhancing Respiratory Disease Diagnosis with AI Lung Sound Analysis: A Web-Based Approach. Future Internet. 2026; 18(6):318. https://doi.org/10.3390/fi18060318

Chicago/Turabian Style

Sreejith, Reshma, R. Kanesaraj Ramasamy, Wan-Noorshahida Mohd-Isa, and Junaidi Abdullah. 2026. "Enhancing Respiratory Disease Diagnosis with AI Lung Sound Analysis: A Web-Based Approach" Future Internet 18, no. 6: 318. https://doi.org/10.3390/fi18060318

APA Style

Sreejith, R., Ramasamy, R. K., Mohd-Isa, W.-N., & Abdullah, J. (2026). Enhancing Respiratory Disease Diagnosis with AI Lung Sound Analysis: A Web-Based Approach. Future Internet, 18(6), 318. https://doi.org/10.3390/fi18060318

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Respiratory Disease Diagnosis with AI Lung Sound Analysis: A Web-Based Approach

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Dataset

3.1.1. Balancing the ICBHI Dataset: Applying Oversampling to Mitigate Class Imbalance

3.1.2. Data Preprocessing and Signal Enhancement

3.2. Feature Extraction

3.3. Model Architecture for Lung Sound Classification

4. Web Application Development: Illustrative Examples

AI Lung-Sound Monitoring System

5. Results

6. Discussion

7. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI