Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Lung Sound Classification Model for On-Device AI

Appl. Sci. 2025, 15(17), 9361; https://doi.org/10.3390/app15179361

by Jinho Park^*

, Chanhee Jeong, Yeonshik Choi

, Hyuck-ki Hong and Youngchang Jo

Reviewer 1:

Shigang Wang

Reviewer 2: Anonymous

Reviewer 3:

Ruizhe Yang

Appl. Sci. 2025, 15(17), 9361; https://doi.org/10.3390/app15179361

Submission received: 25 July 2025 / Revised: 21 August 2025 / Accepted: 21 August 2025 / Published: 26 August 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

I'm delighted to receive the paper titled "Lung Sound Classification Model for On-Device AI". As the reviewer for this paper, I have read it thoroughly. Based on deep learning algorithms and the characteristics of lung auscultation sounds, the author proposes a lightweight lung sound classification model specifically designed for on-device (terminal device) environments. This approach employs Mel-Frequency Cepstral Coefficients (MFCC) to extract audio features. The extracted features are then stacked to form the model input. The lightweight model performs convolution operations that are specifically tailored to the characteristics of lung sounds in both the time and frequency domains. Comparative experimental results demonstrate that the proposed model achieves superior inference performance while maintaining a significantly smaller model size compared to traditional classification schemes, making it highly suitable for deployment on resource-constrained devices. Overall, the paper's topic holds practical application value, the method design is reasonable, and the experimental results are somewhat convincing. However, there are still some aspects that require further improvement and refinement.

In the Introduction and Related Works sections, while the relevant studies and the dataset used in this paper are introduced in great detail, it is recommended to streamline the comparative content regarding other similar studies, especially the content in Sections 2.2 and 2.3. It is advisable to directly describe how the research method in this paper performs classification and the expected outcomes, as well as directly elucidate how the research achievements realize On-Device AI, along with the anticipated practical value and scientific contributions.

At line 328, the section number should be Section 4.3. Please verify and make the correction.

This paper describes that extracting time-domain and frequency-domain features of lung sounds is effective, but is there a clinical correlation between the pathological characteristics of lung sounds and the model's classification results? A collaborative verification process involving doctors could be added.

The paper mentions selecting experimental equipment suitable for resource-constrained scenarios and conducting experimental comparisons, but it does not specify the inference time, power consumption, or memory usage. It is recommended to supplement relevant measured data on the target devices.

Suggestions for enhancing the paper's impact: It is recommended to provide algorithmic pseudocode or flowcharts at an appropriate section to visually demonstrate the specific implementation process of the proposed method. Additionally, making the model code and preprocessing pipeline publicly available would facilitate reproduction by other researchers.

Suggestions for enhancing the paper's impact: Deployment Case Studies: Present application cases of the model on real-world devices (e.g., smart stethoscopes) to demonstrate its practical utility.

In summary, the method chosen by the authors for model establishment demonstrates an innovative approach in terms of its generalization capability across different datasets (such as lung sounds collected from various hospitals), and has been thoroughly validated through experiments. However, there are still some aspects that require further improvement and refinement. If the authors can adequately address and enhance these issues, this paper has the potential to make significant contributions in the interdisciplinary field of medical AI and edge computing, and, once perfected, could offer novel insights for medical diagnosis in resource-constrained scenarios.

Author Response

RESPONSE TO THE REVIEWER

REVIEWER

Reviewer Comment 1:

Response:

Thank you for your comment. In accordance with your suggestion, we have revised the manuscript to clarify the comparison between Sections 2.2 and 2.3. Specifically, Section 2.2 now emphasizes that our proposed approach converts audio signals into MFCCs, Mel spectrograms, and chromagrams, which are then stacked as image inputs to the model. The model architecture employs Inception blocks with a relatively shallow depth, followed by a stabilizing convolutional layer before the final output. This design is expected to achieve high accuracy while maintaining a lightweight model size. Furthermore, as deployment requires compatibility with the target NPU, the model is quantized accordingly. In our study, we utilized the Hailo-8 module, and we demonstrate the practical utility of the proposed method through real-device experiments. Our approach contributes to the field by providing a scalable architecture that balances computational efficiency and classification performance, making it suitable for real-world deployment in resource-constrained environments. We updated the manuscript by adding the anticipated practical value and scientific contributions.

In the revised version of the paper:

Bardou et al. [18] extracted MFCC and LBP features from lung sound recordings and achieved high classification accuracy using a CNN-based model. Petmezas et al. [19] proposed a CNN-LSTM hybrid model that extracts features from STFT spectrograms and captures temporal dependencies for lung sound classification. Li et al. [20] proposed a residual network with augmented attention convolution, leveraging variable Q-factor wavelet and triple STFT transforms to enhance lung sound feature representation and classification performance.

Shuvo et al. [21] proposed a lightweight CNN with four convolutional blocks that clas-sifies six respiratory conditions using scalogram-based features. Wanasinghe et al. [22] introduced a lightweight CNN that classifies ten lung sound categories using stacked representations of Mel spectrograms, MFCC, and chromagrams for efficient and low-complexity feature extraction.

In this study, we propose the scheme that utilizes stacked image representations composed of MFCC, mel spectrogram, and chroma features extracted from audio sig-nals as input to the model. To achieve model compactness, Inception blocks are em-ployed, and their output features are integrated to enhance model stability. This design is expected to achieve high classification accuracy while maintaining a lightweight model size.

This work offers practical value by demonstrating that a lightweight yet accurate mod-el can operate efficiently on resource-constrained edge devices. Furthermore, it con-tributes to the field by providing a scalable architecture that balances computational efficiency and classification performance for respiratory sound analysis.

Reviewer Comment 2:

At line 328, the section number should be Section 4.3. Please verify and make the correction.

Response:

We have thoroughly checked and revised our paper for a better presentation.

In the revised version of the paper:

4.4. Classification Results in NPU.

	(8)
	(9)
	(10)
	(11)

Reviewer Comment 3:

Response:

We described the time-domain and frequency-domain features of lung sounds in Introduction.

Lung sounds are typically classified into three main categories: wheezes, crackles, and rhonchi. Wheezes are continuous, high-pitched, abnormal sounds that occur when air-flow is partially obstructed, often observed in patients suffering from pneumonia or interstitial pulmonary fibrosis [4, 5]. Crackles are sudden, discontinuous sounds heard during both inspiration and expiration, and are commonly associated with conditions such as asthma and Chronic Obstructive Pulmonary Disease (COPD). Rhonchi are low-pitched, coarse, snoring-like sounds produced by airway secretions and turbulent airflow. Accurate classification of these lung sounds requires detailed analysis of their acoustic properties. Normal respiratory sounds typically occupy a frequency range of 100 to 200 Hz and may be perceptible up to 800 Hz under sensitive detection condi-tions. In contrast, abnormal respiratory sounds span a broader frequency range, from approximately 200 to 2000 Hz, and are often characterized by lower pitch and more ir-regular patterns [6–8].

However, accurate disease classification based solely on auscultatory lung sounds remains a challenge for clinicians, due to the subjective nature of interpretation and the overlapping acoustic characteristics of various respiratory conditions. To address this issue, recent studies—including those cited in references [9, 10]—have explored the application of deep learning techniques to automatically analyze lung sounds and improve diagnostic accuracy.

In the current study, we were unable to incorporate a collaborative validation process involving medical professionals due to resource constraints. Nevertheless, we plan to develop a deployable diagnostic device in future work, through which clinical validation and real-world performance assessment can be conducted.

In the revised version of the paper:

Finally, we plan to implement the on-device AI module on a real stethoscope to evaluate the model’s performance, power consumption, and inference latency in a practical setting.

Reviewer Comment 4:

Response:

To demonstrate the effectiveness of the proposed scheme in on-device environments, we have added evaluation results on inference time, power consumption, and memory usage in the manuscript. However, NPU memory usage was excluded due to restricted access to such information by the NPU module manufacturer.

In the revised version of the paper:

Table 8 presents a result of inference time and power consumption in NPU. Among the baseline models, MobileNetV2 achieves the fastest average inference time but exhibits the highest standard deviation and the highest power consumption, indi-cating potential instability and inefficiency under real-time constraints. InceptionV2, EfficientNet, and Stacked models maintain relatively low inference times, but all show high variance in latency and elevated power demands. ResNet50 demonstrates stable performance with the lowest standard deviation among the baselines, though its av-erage inference time remains the highest, limiting its real-time applicability. In con-trast, the Proposed model offers a more balanced and efficient trade-off. Although its average inference time is slightly higher than that of lightweight architectures, it achieves notably low variability and moderate power consumption. Although 5 sec-onds audio segments are required for disease classification based on lung sound data, the proposed model completes inference in only 245.9 milliseconds, enabling real-time processing.

Table 8. Result of inference time in NPU.

Metric	Proposed	ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet
Average	245.9ms	269.77ms	150.27ms	166.43ms	157.74ms	166.84ms
Standard Deviation	197ms	186ms	241ms	260ms	271ms	264ms
Power Consumption	4.6W	3.8W	5.55W	4.95W	5.1W	4.45W

We analyzed the memory footprint of each model when deployed on a Raspberry Pi with NPU support. As shown in Table 9, the proposed model consumed approximately 60.3MB of system memory, while ResNet exhibited the highest usage at 150.4MB. In contrast, lightweight architectures such as MobileNet and EfficientNet maintained significantly lower memory requirements at 15.1MB and 20.2MB, respectively. These results confirm the suitability of the proposed and lightweight models for real-time inference on memory-constrained edge devices. We also wanted to analyze the resource usage of the NPU, but this was not possible because the NPU manufacturer did not support this feature.

Table 9. Result of lung sound classification in NPU.

Memory	Proposed	ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet
RAM	60.3 MB	150.4 MB	15.1 MB	50.6 MB	20.7 MB	20.2 MB

Reviewer Comment 5:

Response:

Thank you for your comment. We updated the manuscript by adding the pseudocode of preprocessing process.

In the revised version of the paper:

Algorithm 1: Preprocessing of Lung Sound
	Data: Lung sound signal
	Result: Preprocessed audio feature
1:	Sampling the audio signal based on
2:	Formatting the audio based on input audio length
3:	Extract mel spectrogram feature using Eq. (3)
4:	Extract chroma feature using Eq. (5)
5:	Extract MFCC feature using Eq. (6)
6:	Transform the audio feature to 2D Image (128 x 216)
7:	Normalize the feature value
8:	Stack mel spectrogram, chroma, and MFCC

Algorithm 1 shows the preprocessing of lung sound pseudo code. The audio signal is first resampled according to sampling rate (sr) to ensure temporal consistency across recordings. The signal is then formatted to match a fixed input duration (d seconds), using padding or truncation if necessary, to standardize the temporal dimension for model input. Next, three distinct audio features are extracted to capture complemen-tary characteristics of the lung sounds. The mel spectrogram is computed as described in Equation (3), providing a time–frequency representation with perceptually scaled frequency bins. The chroma feature, extracted via Equation (5), captures pitch class distribution, which is relevant for tonal components present in pathological sounds. Additionally, MFCC are derived using Equation (6) to represent the spectral envelope of the sound, which is known to be effective in characterizing respiratory acoustics. Once these features are extracted, each is resized to a common image resolution of 128 × 216, allowing compatibility with CNN input formats. Feature normalization is ap-plied to standardize value ranges. Finally, the three features—mel spectrogram, chro-ma, and MFCC—are stacked along the channel dimension.

Reviewer Comment 6:

Suggestions for enhancing the paper's impact: Deployment Case Studies: Present application cases of the model on real-world devices (e.g., smart stethoscopes) to demonstrate its practical utility.

Response:

Although on-device AI has been actively studied in the medical domain, real-world deployment remains limited. We are concurrently conducting research aimed at actual device integration, and we plan to validate the practical utility of our model through such implementation efforts as part of future work. We updated further the manuscript.

In the revised version of the paper:

Finally, we plan to implement the on-device AI module on a real stethoscope to evaluate the model’s performance, power consumption, and inference latency in a practical setting.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Summary
This paper presents a lightweight CNN architecture for lung sound classification optimized for on-device deployment. The model uses stacked audio features (MFCC, Mel spectrograms, chromagrams) and inception blocks to achieve 79.74% accuracy on GPU and 89.49% on NPU, with a model size of 1.13 MB. While the application is relevant, the paper suffers from methodological gaps and presentation issues.

Major Strengths

Addresses important practical problem of on-device medical AI
Comprehensive evaluation on both GPU and NPU platforms
Lightweight architecture (1.13 MB) suitable for resource-constrained devices
Uses multiple complementary audio features

Major Weaknesses

Unexplained 10% accuracy improvement on NPU versus GPU
Limited dataset (1,298 samples after augmentation)
No cross-dataset evaluation
Missing ablation studies
Poor figure quality and presentation
No real-time performance metrics

Specific Comments and Revision Suggestions

NPU Performance Anomaly: The 10% accuracy improvement on NPU (79.74% to 89.49%) is counterintuitive and inadequately explained. Quantization typically reduces accuracy. Authors must: (i) Provide detailed quantization methodology, (ii) Show intermediate results with different quantization levels, (iii) Verify results with multiple runs and statistical testing, and (iv) Consider if there's a bug in evaluation code.
Limited Dataset: Only 1,298 samples after augmentation is insufficient for deep learning. Authors should: (i) Include additional public datasets (Coswara, SPRSound), (ii) Implement cross-dataset validation, and (iii) Discuss generalization limitations explicitly.
Missing Ablation Studies: No justification for architectural choices: (i) Why inception blocks over depthwise separable convolutions? (ii) Impact of different feature combinations, (iii) Effect of model depth (why 3 stages?), (iv) Kernel size selection rationale.
Incomplete Evaluation: (i) No real-time inference latency (only total time), (ii) Missing power consumption measurements, (iii) No comparison with other on-device frameworks (TensorFlow Lite, ONNX Runtime), (iv) Lack of statistical significance testing.
Figure 2: The stacked feature visualization adds little value.
Line 260: Equation numbering error (should be 7-10, not 6-9).
How do you explain the significant accuracy improvement on NPU? Have you verified this isn't a evaluation artifact?
Why not use modern lightweight architectures like MobileViT or EfficientNet as baselines?
What is the actual real-time factor? Can the model process audio in real-time on target hardware?
Have you tested the model in noisy clinical environments?
Why not evaluate on the full ICBHI dataset without segmentation to 5-second clips?

Final Overall Recommendation: Major Revision
While the paper addresses an important problem and presents a lightweight solution, it has significant methodological issues that must be resolved:

The unexplained NPU performance improvement raises serious concerns about experimental validity,
Lack of ablation studies undermines the technical contribution,
Limited dataset and missing cross-dataset evaluation question generalizability,
Poor presentation quality (illegible figures, duplicated tables) detracts from the work.

The paper has potential but requires substantial improvements before publication.

Author Response

RESPONSE TO THE REVIEWER

REVIEWER

Reviewer Comment 1:

NPU Performance Anomaly: The 10% accuracy improvement on NPU (79.74% to 89.49%) is counterintuitive and inadequately explained. Quantization typically reduces accuracy. Authors must: (i) Provide detailed quantization methodology, (ii) Show intermediate results with different quantization levels, (iii) Verify results with multiple runs and statistical testing, and (iv) Consider if there's a bug in evaluation code.

Response:

Thank you for your comment.

(i) As you rightly noted, quantization generally leads to some degradation in model accuracy. However, as the number of target classes increases, the number of parameters the network must learn also grows, which can lead to overfitting toward specific classes. In GPU-based environments, this often results in increased bias toward minority classes and a decline in generalization performance. In contrast, quantization for NPU deployment reduces parameter precision, which can act as a form of regularization. This reduction in representational fidelity may help mitigate overfitting, thereby improving generalization in on-device settings.

(ii) The model parameters are trained using 32-bit floating point (Float32) precision. However, due to resource constraints, on-device environments do not support Float32 execution. To enable deployment on such devices, quantization must be performed in accordance with the requirements of the NPU module. In this study, we utilized the Hailo-8 NPU, which supports only 8-bit unsigned integer (UINT8) representation. Accordingly, Float32 parameters were quantized by mapping their values to the 0–255 range. Specifically, the minimum and maximum values of the Float32 parameters were linearly scaled such that the minimum was mapped to 0 and the maximum to 255.

(iii) We ..

(iv) We identified a bug in the inference time measurement within the evaluation code and have corrected it accordingly.

We revised the manuscript for clarification.

In the revised version of the paper:

In GPU-based training environments, high-capacity floating-point models may be-come overly sensitive to dominant class signals, resulting in biased predictions and reduced generalization capability. However, when such models are quantized and deployed on integer-based NPUs, the reduced numerical precision inherently acts as a form of regularization, mitigating the risk of overfitting and improving generalization. Quantization compresses the dynamic range of weights and activations, effectively suppressing the influence of overly dominant parameters and reducing the model's sensitivity to minor fluctuations. This attenuates class-specific bias, enabling more balanced inference across all classes. Empirically, we observed that the inference accu-racy after quantization on the NPU surpasses that of the original GPU-based inference in certain scenarios. Furthermore, quantization implicitly reduces over-parameterization and acts as a noise stabilizer, which is particularly beneficial in cases where training data exhibits label imbalance or noise.

Due to limited computational resources, the device does not support inference using Float32. To enable model deployment in such environments, quantization must be performed in accordance with the requirements of the supported NPU module. The Hailo-8 NPU only support the 8 bit usigned integer. Accordingly, all model parameters originally represented in Float32 were quantized by mapping their values to the range of 0–255. This mapping was performed by linearly rescaling each parameter based on its original minimum and maximum values, such that the minimum was mapped to 0 and the maximum to 255.

Table 5. Result of lung sound classification.

Model	Accuracy	F1-score	Precision	Recall
Proposed	79.68%	0.789	0.809	0.759
ResNet50 [30]	80.59%	0.798	0.801	0.798
MobileNetV2 [31]	55.43%	0.495	0.541	0.412
InceptionV2 [32]	54.48%	0.427	0.493	0.41
Stacked [22]	50.16%	0.243	0.267	0.224
EfficientNet [33]	60.44%	0.552	0.576	0.535

Table 7. Result of lung sound classification in NPU.

Model	Accuracy	F1-score	Precision	Recall
Proposed	89.46%	0.87	0.89	0.87
ResNet50 [30]	89.41%	0.89	0.88	0.87
MobileNetV2 [31]	71.89%	0.66	0.73	0.66
InceptionV2 [32]	47.74%	0.42	0.41	0.36
Stacked [22]	55.95%	0.28	0.24	0.32
EfficientNet [33]	40.63%	0.273	0.45	0.265

Table 8. Result of inference time in NPU.

Metric	Proposed	ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet
Average	245.9ms	269.77ms	150.27ms	166.43ms	157.74ms	166.84ms
Standard Deviation	197ms	186ms	241ms	260ms	271ms	264ms
Power Consumption	4.6W	3.8W	5.55W	4.95W	5.1W	4.45W

Reviewer Comment 2:

Limited Dataset: Only 1,298 samples after augmentation is insufficient for deep learning. Authors should: (i) Include additional public datasets (Coswara, SPRSound), (ii) Implement cross-dataset validation, and (iii) Discuss generalization limitations explicitly.

Response:

The dataset initially consisted of 1,298 samples prior to augmentation. After applying data augmentation, the dataset was expanded to 19,894 samples. As you correctly pointed out, 1,298 samples are insufficient for training deep learning models. Therefore, we applied augmentation techniques to address the limitation in sample size. The number of additional samples generated through augmentation was 19,894.

(i-iii) The ICBHI 2017 and KAUH datasets used in this study are publicly available and include a variety of respiratory diseases. However, datasets such as Coswara and SPRSound differ in structure and focus from the disease classification based on pulmonary auscultation sounds targeted in our work. We agree with your suggestion that cross-dataset evaluation would enhance the contribution of the study. Unfortunately, to the best of our knowledge, there are currently no publicly available datasets that provide disease-specific labels based on lung sounds, making such validation infeasible at this stage. We therefore plan to evaluate the generalizability of our model on additional relevant datasets, should they become publicly available in the future. We upated the manuscript by adding the more descripton.

In the revised version of the paper:

To address this limitation, we applied various data augmentation techniques, including tempo modification, pitch shifting, background noise addition, time shifting, and volume scaling. The time-stretch rate was set to 0.9, corresponding to a 10% reduction in playback speed, which reflects the natural variation in human speech tempo. For pitch shifting, we used steps of +1 and -1, simulating realistic pitch fluctuations commonly observed in human utterances. Background noise was added using a standard deviation of 0.005, representing a moderate level of ambient noise found in everyday environ-ments, rather than excessively loud sources such as construction sites. Time shifting was applied by 0.5 seconds to model possible delays in utterance onset. Additionally, volume scaling was applied in the range of 0.5 to 1.5, capturing typical variations in vocal loudness. All augmentation parameters were carefully chosen to reflect realistic variations in human speech and phonation characteristics, thereby enhancing the model's robustness to natural acoustic variability. After the augmentation, we used the 19,894 data samples.

Furthermore, the ICBHI 2017 and KAUH datasets used in this study differ in disease types, recording conditions, and demographic factors, which may limit the model's generalizability. While public datasets such as Coswara and SPRSound include respir-atory sounds, they are either limited to COVID-19 or focus on acoustic characteristics rather than disease classification, making them unsuitable for validating our model. Therefore, evaluating the model’s performance in more diverse and real-world settings remains a key area for future work.

Reviewer Comment 3:

Missing Ablation Studies: No justification for architectural choices: (i) Why inception blocks over depthwise separable convolutions? (ii) Impact of different feature combinations, (iii) Effect of model depth (why 3 stages?), (iv) Kernel size selection rationale.

Response:

(i) We consider time–frequency features such as the Mel spectrogram, chroma, and MFCC. Asymmetric convolutional kernels are well-suited for capturing distinct patterns within these representations and have demonstrated superior performance in this context. While depthwise separable convolutions are effective at isolating individual features, they lack the ability to integrate and model relationships across feature dimensions. Therefore, they may be limited in capturing the interdependencies among the extracted features. We have added a more detailed explanation regarding the rationale behind the selection of Inception blocks in the revised manuscript.

(ii-iii) Thank you for your comment. As the reviewer suggested, we added it to the experiment as ablation studies. We updated the manuscript by adding new experimental results.

(iv) We appreciate the reviewer’s insightful comment regarding kernel size selection in our proposed model. In designing the architecture, we carefully considered the nature of the input features—specifically time–frequency representations such as Mel spectrograms, chromagrams, and MFCCs—which are characterized by rectangular spatial structures (e.g., 128 × 216). To effectively capture local patterns while maintaining computational efficiency, we employed asymmetric convolutional kernels (e.g., 1×3 and 3×1, 1×5 and 5×1) in place of conventional 3×3 or 5×5 kernels. This design maintains a comparable receptive field to standard kernels but offers reduced computational cost and parameter count. We added the manuscript to clarify the explanation.

In the revised version of the paper:

In designing the proposed scheme, we considered time–frequency representations such as the Mel spectrogram, chroma, and MFCC, each of which captures complementary aspects of the acoustic signal. Asymmetric convolutional kernels were selected for their ability to effectively model distinct temporal and spectral patterns within these features, while maintaining computational efficiency. Although depthwise separable convolu-tions can isolate individual feature maps with reduced complexity, they are inherently limited in capturing cross-feature dependencies and integrating information across dimensions. By contrast, the Inception block’s multi-branch structure enables simulta-neous extraction of features at multiple receptive fields and facilitates integration of diverse representations, thereby enhancing the model’s capacity to capture the complex interdependencies present in lung sound data.

In designing the proposed model, careful consideration was given to the selection of kernel sizes to effectively capture the spatiotemporal patterns inherent in time–frequency representations such as MFCC, Mel spectrograms, and chromagrams. Given the rectangular shape of the input (e.g., 128 × 216), the architecture incorporates asymmetric convolutional kernels—specifically combinations of 1×3 and 3×1, as well as 1×5 and 5×1 convolutions—in place of standard 3×3 and 5×5 kernels. This design main-tains a comparable receptive field while reducing computational complexity. These asymmetric kernels are particularly well-suited for processing spectrogram-like inputs, as they can independently model temporal variations (along the width) and frequency patterns (along the height). Additionally, 1×1 convolution layers are employed in tran-sition blocks to adjust channel dimensions without altering spatial resolution, enabling flexible depth control while preserving localization information. Max pooling with a 3×3 kernel and stride 1 is also included in each Inception block branch to enhance local feature abstraction without reducing feature map size. Finally, the classifier uses a 3×3 convolution to integrate local features, followed by an adaptive max pooling layer to aggregate global features regardless of input size, ensuring compatibility with variable input resolutions and robust classification performance.

4.2. Proposed Scheme Validation

Table 5 presents the classification results according to the number of Inception blocks in the proposed scheme. The preceding number indicates the number of repetitions of the Inception block. In the case of 2T, the shallow network depth limits the model’s ability to fully capture time–frequency features of the audio, resulting in lower accuracy, F1-score, precision, and recall. Conversely, 4T, with a deeper network, is able to extract and recognize audio features more effectively, thereby achieving higher performance; however, it also exhibits signs of overfitting as the number of layers increases. The 3T configuration, having a greater depth than 2T, demonstrates the highest performance with minimal overfitting to the audio features..

Table 5. Result of lung sound classification.

Model	Accuracy	F1-score	Precision	Recall
2T	75.27%	0.724	0.752	0.752
3T	79.74%	0.787	0.817	0.762
4T	78.05%	0.76	0.784	0.74

Table 6 shows classification performance according to the audio feature combination. The configuration mel spectrogram, chroma, MFCC consistently yielded superior results compared to alternative permutations. This arrangement allows the model to process the global spectral energy distribution first (Mel spectrogram), refine the representation with harmonic and pitch-related information (chroma), and finally encode a compressed summary of the spectral envelope (MFCC). Such ordering aligns with a hierarchical progression from low-level acoustic structure to mid-level harmonic content and finally to high-level compressed descriptors. This hierarchical feature integration appears to facilitate more stable learning in the initial convolutional layers, enabling the network to more effectively capture both temporal and frequency-domain dependencies present in lung sound data.

Based on these experimental results, we selected the configuration with three iterations of the Inception block. Also, we select the audio feature combination that mel spectrogram, chroma, MFCC.

Table 6. Classification performance according to the audio feature combination.

Model	Accuracy	F1-score	Precision	Recall
Mel spectrogram Chroma MFCC	79.74%	0.787	0.817	0.762
Chroma MFCC Mel spectrogram	78.31%	0.771	0.786	0.759
MFCC Mel spectrogram Chroma	77.63%	0.758	0.776	0.735

Reviewer Comment 4:

Incomplete Evaluation: (i) No real-time inference latency (only total time), (ii) Missing power consumption measurements, (iii) No comparison with other on-device frameworks (TensorFlow Lite, ONNX Runtime), (iv) Lack of statistical significance testing.

Response:

(i, ii) Thank you for your comment. We updated the manuscript by adding experimental results.

(iii) We consider the on-device with NPU in manuscript. To deploy the model on the target device, it must be built using a framework supported by the NPU. If the model is implemented with an unsupported framework, it cannot be executed on the device. Therefore, as the NPU used in this study does not support frameworks such as TensorFlow Lite or ONNX Runtime, experiments using these frameworks, as suggested in the review, are not feasible. We revised the manuscript to clarify this point.

(iv) To strengthen the statistical significance of the proposed scheme, we have included p-value results in our evaluation. These results have been added to the revised manuscript.

In the revised version of the paper:

Table 8 presents a result of inference time and power consumption in NPU. Among the baseline models, MobileNetV2 achieves the fastest average inference time but exhibits the highest standard deviation and the highest power consumption, indicating potential instability and inefficiency under real-time constraints. InceptionV2, EfficientNet, and Stacked models maintain relatively low inference times, but all show high variance in latency and elevated power demands. ResNet50 demonstrates stable performance with the lowest standard deviation among the baselines, though its average inference time remains the highest, limiting its real-time applicability. In contrast, the Proposed model offers a more balanced and efficient trade-off. Although its average inference time is slightly higher than that of lightweight architectures, it achieves notably low variability and moderate power consumption. Although 5 seconds audio segments are required for disease classification based on lung sound data, the proposed model completes inference in only 245.9 milliseconds, enabling real-time processing.

Table 8. Result of inference time in NPU.

Metric	Proposed	ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet
Average	245.9ms	269.77ms	150.27ms	166.43ms	157.74ms	166.84ms
Standard Deviation	197ms	186ms	241ms	260ms	271ms	264ms
Power Consumption	4.6W	3.8W	5.55W	4.95W	5.1W	4.45W

Table 9. Result of lung sound classification in NPU.

Memory	Proposed	ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet
RAM	60.3 MB	150.4 MB	15.1 MB	50.6 MB	20.7 MB	20.2 MB

Table 9. P-value between proposed scheme and existing schemes.

ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet

Table 9 shows the p-value between the proposed scheme and existing schemes. The difference of existing schemes was statistically significant , indicating that the observed improvement is highly unlikely to have occurred by chance.

Reviewer Comment 5:

Figure 2: The stacked feature visualization adds little value.

Response:

Thank you for your opinion. We deleted the Figure 2 in manuscript.

Reviewer Comment 6:

Line 260: Equation numbering error (should be 7-10, not 6-9).

Response:

We have thoroughly checked and revised our paper for a better presentation.

In the revised version of the paper:

4.4. Classification Results in NPU.

	(8)
	(9)
	(10)
	(11)

Reviewer Comment 7:

How do you explain the significant accuracy improvement on NPU? Have you verified this isn't a evaluation artifact?

Response:

As you rightly noted, quantization generally leads to some degradation in model accuracy. However, as the number of target classes increases, the number of parameters the network must learn also grows, which can lead to overfitting toward specific classes. In GPU-based environments, this often results in increased bias toward minority classes and a decline in generalization performance. In contrast, quantization for NPU deployment reduces parameter precision, which can act as a form of regularization. This reduction in representational fidelity may help mitigate overfitting, thereby improving generalization in on-device settings.

In the revised version of the paper:

Reviewer Comment 8:

Why not use modern lightweight architectures like MobileViT or EfficientNet as baselines?

Response:

Thank you for your opinion. We added EfficientNet to the experiment as comparison scheme. We updated the manuscript by adding new collaboration scheme to the experimental results.

In the revised version of the paper:

The EfficientNet shows the higher performance than another lightweight model. Be-cause it scales up the model according to the depth, width, and resolution. However, this model focuses on one point in the image and is not suitable for audio feature.

Model	Accuracy	F1-score	Precision	Recall
Proposed	79.74%	0.787	0.817	0.762
ResNet50 [30]	80.62%	0.796	0.806	0.806
MobileNetV2 [31]	55.34%	0.491	0.537	0.456
InceptionV2 [32]	54.51%	0.43	0.497	0.407
Stacked [22]	50.06%	0.239	0.268	0.222
EfficientNet [33]	60.44%	0.552	0.576	0.535

Table 6. Result of models parameters.

Model	Parameter	Model Size
Proposed	264 K	1.13 MB
ResNet50 [30]	23.53 M	90 MB
MobileNetV2 [31]	2.24 M	8.75 MB
InceptionV2 [32]	5.98 M	22.8 MB
Stacked [22]	0.36 K	1.4 MB
EfficientNet	2.45 M	9.55 MB

(f)

Figure 6. Confusion matrix of each scheme. (a) Proposed scheme; (b) ResNet50; (c) MobileNetV2; (d) InceptionV2; (e) Stacked; (f) EfficientNet

The EfficientNet achieves high classification accuracy for the Healthy class, as well as COPD and asthma, which are relatively well-represented and acoustically distin-guishable. In contrast, substantial confusion is observed between Asthma, Heart Fail-ure, and URTI, suggesting overlapping symptom-related acoustic features. Rare classes such as Bronchiolitis, Pleural Effusion, and Lung Fibrosis exhibit consistently low true positive rates, likely due to insufficient training samples and limited variability. Notably, Heart Failure is often misclassified as Healthy or Asthma, indicating challenges in differentiating cardiovascular-related sounds from respiratory ones.

Table 7. Result of lung sound classification in NPU.

Model	Accuracy	F1-score	Precision	Recall
Proposed	89.49%	0.88	0.89	0.87
ResNet50 [30]	89.46%	0.89	0.89	0.88
MobileNetV2 [31]	71.92%	0.68	0.72	0.65
InceptionV2 [32]	47.77%	0.39	0.42	0.39
Stacked [22]	55.99%	0.27	0.25	0.31
EfficientNet [33]	40.63%	0.273	0.45	0.265

Table 8. Result of inference time in NPU.

Metric	Proposed	ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet
Average	245.9ms	269.77ms	150.27ms	166.43ms	157.74ms	166.84ms
Standard Deviation	197ms	186ms	241ms	260ms	271ms	264ms
Power Consumption	4.6W	3.8W	5.55W	4.95W	5.1W	4.45W

(f)

Table 9. Result of lung sound classification in NPU.

Memory	Proposed	ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet
RAM	60.3 MB	150.4 MB	15.1 MB	50.6 MB	20.7 MB	20.2 MB

Tan, M; Le. Q., EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Proc. of 36^th Int. Conf. on Machine Learning, PMLR, 2019, 6105-6114.

Reviewer Comment 9:

What is the actual real-time factor? Can the model process audio in real-time on target hardware?

Response:

The proposed model is capable of real-time audio processing on the target hardware. Since our approach requires a 5-second audio segment for disease classification, we measured the inference time based on this input length. The proposed method achieves an inference time of 245.9 ms. As this is significantly shorter than the time required to collect the input audio (5 seconds), the system is able to operate in real time. We have added this clarification to the manuscript.

In the revised version of the paper:

Table 8 presents a result of inference time and power consumption in NPU. Among the baseline models, MobileNetV2 achieves the fastest average inference time but exhibits the highest standard deviation and the highest power consumption, indicating potential instability and inefficiency under real-time constraints. InceptionV2, EfficientNet, and Stacked models maintain relatively low inference times, but all show high variance in latency and elevated power demands. ResNet50 demonstrates stable performance with the lowest standard deviation among the baselines, though its average inference time remains the highest, limiting its real-time applicability. In contrast, the Proposed model offers a more balanced and efficient trade-off. Although its average inference time is slightly higher than that of lightweight architectures, it achieves notably low variability and moderate power consumption. Although 5 seconds audio segments are required for disease classification based on lung sound data, the proposed model completes inference in only 245.9 milliseconds, enabling real-time processing.

Table 8. Result of inference time in NPU.

Metric	Proposed	ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet
Average	245.9ms	269.77ms	150.27ms	166.43ms	157.74ms	166.84ms
Standard Deviation	197ms	186ms	241ms	260ms	271ms	264ms
Power Consumption	4.6W	3.8W	5.55W	4.95W	5.1W	4.45W

Reviewer Comment 10:

Have you tested the model in noisy clinical environments?

Response:

This study proposes a lightweight model for disease classification based on pulmonary auscultation sounds. For training and performance evaluation, we utilized publicly available datasets—ICBHI 2017 and KAUH. Since these datasets were not collected in noisy clinical environments, it was not possible to directly evaluate the model under such conditions. To address this limitation, we constructed a noise-augmented version of the dataset by adding background noise to simulate realistic clinical settings. Furthermore, to enhance the model’s robustness across various environments—including noisy conditions—we applied additional augmentation techniques such as tempo modification, pitch shifting, and time shifting. To clarify the description, we updated the manuscript.

In the revised version of the paper:

Reviewer Comment 11:

Why not evaluate on the full ICBHI dataset without segmentation to 5-second clips?

Response:

Deep learning models require inputs of consistent dimensions to ensure stable and comparable outputs. However, the ICBHI dataset contains audio recordings of varying lengths, ranging from 10 to 90 seconds. As a result, using the dataset without modification would lead to inconsistencies in the size of the input representations (e.g., spectrograms). To address this, we preprocessed the data by segmenting the recordings into 5-second clips, thereby ensuring uniform input dimensions across all samples.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper presents a compact lung sound classification model tailored for edge AI applications. The approach involves transforming audio-derived features into stacked image-like inputs, which are then processed through a CNN architecture built with inception modules and convolutional layers. The goal is to balance accuracy with computational efficiency, making the model suitable for deployment on devices with limited resources. Evaluations on the ICBHI 2017 and KAUH datasets show that the model performs competitively against larger architectures like ResNet50, while maintaining a significantly smaller footprint. Despite these strengths, several aspects of the work warrant further clarification:

The model combines MFCCs, Mel spectrograms, and chromagrams into a unified image input. It's unclear how the dimensionality of the stacked representation was determined. Was this configuration empirically selected, or based on prior work? An ablation study could help justify this choice by comparing it with models trained on individual feature sets.
The use of inception-style blocks with parallel convolutions is an interesting choice. However, the reasoning behind the specific kernel sizes is not discussed. Were alternative configurations, such as asymmetric kernels or depthwise convolutions, considered during model development?
Data Augmentation Parameters: The training pipeline includes time stretching, pitch shifting, and noise addition. More detail is needed on how the parameters for these augmentations were selected. Were they tuned experimentally, and if so, were there instances where certain augmentations degraded model performance?
The confusion matrices reveal weak performance on underrepresented classes. Given the class imbalance, why was overall accuracy emphasized over macro-averaged metrics? Did the authors experiment with strategies such as oversampling, SMOTE, or class-weighted loss functions to address this issue?
The paper notes that quantization sometimes improved the F1 score even as accuracy dropped. How does 8-bit quantization affect the model’s ability to extract meaningful features? Was quantization-aware training employed, or was post-training quantization used exclusively?
The comparison is limited to MobileNetV2 and a stacked CNN. It's surprising that other lightweight models, such as EfficientNet-Lite or TinyML architectures, were not included. Wouldn’t these provide a more comprehensive benchmark for edge deployment?
The reported inference time raises concerns about real-time deployment on wearables. Has the latency been measured on actual low-power hardware? If so, how does the model handle memory constraints? Were pruning or knowledge distillation explored as additional optimization strategies?
It would be helpful to know whether any of the model’s errors were reviewed by clinicians. Do the misclassifications reflect common diagnostic challenges in auscultation, or are they more likely due to noise or labeling inconsistencies in the dataset?
The two datasets originate from different regions. Could regional variations in disease prevalence or recording conditions limit the generalizability of the model? Has any cross-regional validation been attempted?
Since the model is intended for on-device use, energy efficiency is a critical metric. Were power consumption measurements conducted on the NPU? How does the energy-per-inference compare to other models optimized for mobile deployment?

Author Response

RESPONSE TO THE REVIEWER

REVIEWER

Reviewer Comment 1:

The model combines MFCCs, Mel spectrograms, and chromagrams into a unified image input. It's unclear how the dimensionality of the stacked representation was determined. Was this configuration empirically selected, or based on prior work? An ablation study could help justify this choice by comparing it with models trained on individual feature sets.

Response:

Thank you for your comment. To strengthen the rationale for combining MFCC, Mel spectrogram, and chroma features, we have added experimental results comparing different combinations of these representations. The experiments demonstrated that the combination of Mel spectrogram, chroma, and MFCC yielded the highest classification performance. Based on these findings, we selected this combination for our proposed scheme. We revised the manuscript for clarification.

In the revised version of the paper:

Based on these experimental results, we selected the configuration with three iterations of the Inception block. Also, we select the audio feature combination that mel spectrogram, chroma, MFCC.

Table 6. Classification performance according to the audio feature combination.

Model	Accuracy	F1-score	Precision	Recall
Mel spectrogram Chroma MFCC	79.74%	0.787	0.817	0.762
Chroma MFCC Mel spectrogram	78.31%	0.771	0.786	0.759
MFCC Mel spectrogram Chroma	77.63%	0.758	0.776	0.735

Reviewer Comment 2:

The use of inception-style blocks with parallel convolutions is an interesting choice. However, the reasoning behind the specific kernel sizes is not discussed. Were alternative configurations, such as asymmetric kernels or depthwise convolutions, considered during model development?

Response:

We appreciate the reviewer’s insightful comment regarding kernel size selection in our proposed model. In designing the architecture, we carefully considered the nature of the input features—specifically time–frequency representations such as Mel spectrograms, chromagrams, and MFCCs—which are characterized by rectangular spatial structures (e.g., 128 × 216). To effectively capture local patterns while maintaining computational efficiency, we employed asymmetric convolutional kernels (e.g., 1×3 and 3×1, 1×5 and 5×1) in place of conventional 3×3 or 5×5 kernels. This design maintains a comparable receptive field to standard kernels but offers reduced computational cost and parameter count. While depthwise separable convolutions are effective at isolating individual features, they lack the ability to integrate and model relationships across feature dimensions. Therefore, they may be limited in capturing the interdependencies among the extracted features.We added the manuscript to clarify the explanation.

In the revised version of the paper:

Reviewer Comment 3:

Data Augmentation Parameters: The training pipeline includes time stretching, pitch shifting, and noise addition. More detail is needed on how the parameters for these augmentations were selected. Were they tuned experimentally, and if so, were there instances where certain augmentations degraded model performance?

Response:

Thank you for your comment. We select the paratemer for augmentation based on human’s listening power. The parameter of time stretch rate is selected by variability in patients’ respiratory rates. The parameter of pitch shift is selected by frequency observed in abnormal breath sounds. The parameter of noise is reflected by real clinical environments. The time shift parameter is included by variability in auscultation. And, the volume scaling parameter is reflected by stethoscope pressure, device sensitivity, or patient condition. We updated the manuscript by adding more detail description

In the revised version of the paper:

To address this limitation, we applied various data augmentation techniques, including tempo modification, pitch shifting, background noise addition, time shifting, and volume scaling. The time stretch rate was set to 0.9, corresponding to a 10% decrease in playback speed, which reflects variability in patients’ respiratory rates. A pitch shift of +1, -1 semitone was applied to simulate pathological frequency changes observed in abnormal breath sounds. The standard deviation of additive noise was set to 0.005, which reflects moderate ambient noise levels commonly encountered in real clinical environments, as opposed to extreme industrial noise. A time shift of 0.5 seconds was included to account for variability in auscultation timing or onset delays. Volume scal-ing was applied in the range of 0.5 to 1.5× to simulate variability in recording intensity due to stethoscope pressure, device sensitivity, or patient condition. These augmenta-tion parameters were chosen to reflect both the physiological variability of lung sounds and the diversity of recording conditions encountered in real-world clinical settings. After the augmentation, we used the 19,894 data samples.

Reviewer Comment 4:

The confusion matrices reveal weak performance on underrepresented classes. Given the class imbalance, why was overall accuracy emphasized over macro-averaged metrics? Did the authors experiment with strategies such as oversampling, SMOTE, or class-weighted loss functions to address this issue?

Response:

Thank you for your valuable comment regarding class imbalance and evaluation metrics. We acknowledge that class imbalance can lead to biased performance assessments when relying solely on overall accuracy. To address this limitation, we have supplemented our evaluation by reporting multiple macro-averaged metrics—namely, macro accuracy, macro precision, macro recall, and macro F1-score—which equally reflect the performance across all classes. These additional metrics provide a more balanced and comprehensive evaluation of the model's classification capabilities under class-imbalanced conditions. Additionally, to further mitigate the effects of class imbalance, we adopted a repeated random subsampling strategy. Specifically, we randomly split the entire dataset into 80% for training and 20% for validation, and repeated this process 10 times. The final performance results reported in the manuscript represent the average values obtained across these 10 independent runs. We revised the manuscript to clarify the experimental results.

In the revised version of the paper:

In our experiments, 80% of the entire dataset was used for training, while the remain-ing 20% was allocated for validation. To mitigate potential performance bias due to data imbalance, each model was trained ten times, and the corresponding performance metrics were reported. All metrics were computed as macro-averaged scores.

Reviewer Comment 5:

The paper notes that quantization sometimes improved the F1 score even as accuracy dropped. How does 8-bit quantization affect the model’s ability to extract meaningful features? Was quantization-aware training employed, or was post-training quantization used exclusively?

Response:

Thank you for your comment. As you rightly noted, quantization generally leads to some degradation in model accuracy. However, as the number of target classes increases, the number of parameters the network must learn also grows, which can lead to overfitting toward specific classes. In GPU-based environments, this often results in increased bias toward minority classes and a decline in generalization performance. In contrast, quantization for NPU deployment reduces parameter precision, which can act as a form of regularization. This reduction in representational fidelity may help mitigate overfitting, thereby improving generalization in on-device settings. We were post-training quantization used exclusively. We updated the manuscript to clarify the description.

In the revised version of the paper:

Reviewer Comment 6:

The comparison is limited to MobileNetV2 and a stacked CNN. It's surprising that other lightweight models, such as EfficientNet-Lite or TinyML architectures, were not included. Wouldn’t these provide a more comprehensive benchmark for edge deployment?

Response:

Thank you for your opinion. We added EfficientNet to the experiment as comparison scheme. We updated the manuscript by adding new collaboration scheme to the experimental results.

In the revised version of the paper:

Model	Accuracy	F1-score	Precision	Recall
Proposed	79.74%	0.787	0.817	0.762
ResNet50 [30]	80.62%	0.796	0.806	0.806
MobileNetV2 [31]	55.34%	0.491	0.537	0.456
InceptionV2 [32]	54.51%	0.43	0.497	0.407
Stacked [22]	50.06%	0.239	0.268	0.222
EfficientNet [33]	60.44%	0.552	0.576	0.535

Table 6. Result of models parameters.

Model	Parameter	Model Size
Proposed	264 K	1.13 MB
ResNet50 [30]	23.53 M	90 MB
MobileNetV2 [31]	2.24 M	8.75 MB
InceptionV2 [32]	5.98 M	22.8 MB
Stacked [22]	0.36 K	1.4 MB
EfficientNet	2.45 M	9.55 MB

(f)

Figure 6. Confusion matrix of each scheme. (a) Proposed scheme; (b) ResNet50; (c) MobileNetV2; (d) InceptionV2; (e) Stacked; (f) EfficientNet

The EfficientNet achieves high classification accuracy for the Healthy class, as well as COPD and asthma, which are relatively well-represented and acoustically distin-guishable. In contrast, substantial confusion is observed between Asthma, Heart Fail-ure, and URTI, suggesting overlapping symptom-related acoustic features. Rare classes such as Bronchiolitis, Pleural Effusion, and Lung Fibrosis exhibit consistently low true positive rates, likely due to insufficient training samples and limited variability. Nota-bly, Heart Failure is often misclassified as Healthy or Asthma, indicating challenges in differentiating cardiovascular-related sounds from respiratory ones.

Table 7. Result of lung sound classification in NPU.

Model	Accuracy	F1-score	Precision	Recall
Proposed	89.49%	0.88	0.89	0.87
ResNet50 [30]	89.46%	0.89	0.89	0.88
MobileNetV2 [31]	71.92%	0.68	0.72	0.65
InceptionV2 [32]	47.77%	0.39	0.42	0.39
Stacked [22]	55.99%	0.27	0.25	0.31
EfficientNet [33]	40.63%	0.273	0.45	0.265

Table 8. Result of inference time in NPU.

Metric	Proposed	ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet
Average	245.9ms	269.77ms	150.27ms	166.43ms	157.74ms	166.84ms
Standard Deviation	197ms	186ms	241ms	260ms	271ms	264ms
Power Consumption	4.6W	3.8W	5.55W	4.95W	5.1W	4.45W

(f)

Table 9. Result of lung sound classification in NPU.

Memory	Proposed	ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet
RAM	60.3 MB	150.4 MB	15.1 MB	50.6 MB	20.7 MB	20.2 MB

Tan, M; Le. Q., EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Proc. of 36^th Int. Conf. on Machine Learning, PMLR, 2019, 6105-6114.

Reviewer Comment 7:

The reported inference time raises concerns about real-time deployment on wearables. Has the latency been measured on actual low-power hardware? If so, how does the model handle memory constraints? Were pruning or knowledge distillation explored as additional optimization strategies?

Response:

We employed the Hailo-8 NPU, a low-power neural processing unit, to evaluate the inference performance of each model in realistic edge deployment scenarios. The device used in our experiments has relatively relaxed memory constraints, allowing us to focus on measuring computational efficiency and latency on a functional NPU-based system. However, in ultra-compact hardware designs with stricter memory limitations, additional optimization techniques such as model pruning or knowledge distillation, as the reviewer suggested, would be necessary to meet deployment requirements. At present, commercially available NPU boards that are both sufficiently compact and support real-time AI workloads are limited. Therefore, we consider hardware-constrained optimization for ultra-small platforms an important direction for future work.

Reviewer Comment 8:

It would be helpful to know whether any of the model’s errors were reviewed by clinicians. Do the misclassifications reflect common diagnostic challenges in auscultation, or are they more likely due to noise or labeling inconsistencies in the dataset?

Response:

We acknowledge the value of clinical validation and expert review in interpreting model errors. However, due to limited access to medical professionals during the course of this study, a formal evaluation by clinicians was not conducted. Based on our analysis of the confusion matrix and class distribution, we observed that most misclassifications occurred between classes with overlapping acoustic features or where significant class imbalance exists. While some of these errors may reflect genuine diagnostic ambiguity commonly encountered in auscultation, others could result from label inconsistencies or noise inherent in the dataset.

Reviewer Comment 9:

The two datasets originate from different regions. Could regional variations in disease prevalence or recording conditions limit the generalizability of the model? Has any cross-regional validation been attempted?

Response:

The ICBHI 2017 and KAUH datasets used in this study differ in terms of disease types, recording environments, ethnicity, age groups, and other environmental factors, which may limit the generalizability of the proposed model. To properly evaluate generalization performance, a dataset must include the same disease categories targeted in this study. Although Coswara contains respiratory audio data, it is limited to COVID-19, making it unsuitable for evaluating broader disease classification. Similarly, the SPRSound dataset includes various types of respiratory sounds but is designed for sound-type classification rather than disease-specific labeling. Therefore, neither dataset is appropriate for validating the disease classification task addressed in this study. We have identified this limitation and included a discussion in the revised manuscript. As future work, we plan to evaluate the model’s generalizability once suitable, publicly available datasets with disease-specific labels become accessible.

In the revised version of the paper:

As future work, we aim to reduce the model's computational complexity to shorten inference time while maintaining or improving classification performance under re-source-constrained hardware conditions. Furthermore, the ICBHI 2017 and KAUH da-tasets used in this study differ in disease types, recording conditions, and demographic factors, which may limit the model's generalizability. While public datasets such as Coswara and SPRSound include respiratory sounds, they are either limited to COVID-19 or focus on acoustic characteristics rather than disease classification, making them unsuitable for validating our model. Therefore, evaluating the model’s performance in more diverse and real-world settings remains a key area for future work. Finally, we plan to implement the on-device AI module on a real stethoscope to evaluate the mod-el’s performance, power consumption, and inference latency in a practical setting.

Reviewer Comment 10:

Since the model is intended for on-device use, energy efficiency is a critical metric. Were power consumption measurements conducted on the NPU? How does the energy-per-inference compare to other models optimized for mobile deployment?

Response:

Thank you for your insightful comment. We measured the power consumption associated with inference for each model. While the proposed method achieves faster inference time compared to ResNet-50, it exhibits higher power consumption. However, when compared to other lightweight models, the proposed method demonstrates lower power usage, indicating a favorable trade-off between efficiency and performance. We revised the manuscript to add the experimental results.

In the revised version of the paper:

Table 8 presents a result of inference time and power consumption in NPU. Among the baseline models, MobileNetV2 achieves the fastest average inference time but exhibits the highest standard deviation and the highest power consumption, indicating potential instability and inefficiency under real-time constraints. InceptionV2, EfficientNet, and Stacked models maintain relatively low inference times, but all show high variance in latency and elevated power demands. ResNet50 demonstrates stable performance with the lowest standard deviation among the baselines, though its average inference time remains the highest, limiting its real-time applicability. In contrast, the Proposed model offers a more balanced and efficient trade-off. Although its average inference time is slightly higher than that of lightweight architectures, it achieves notably low variability and moderate power consumption. Although 5 seconds audio segments are required for disease classification based on lung sound data, the proposed model completes inference in only 245.9 milliseconds, enabling real-time processing.

Table 8. Result of inference time in NPU.

Metric	Proposed	ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet
Average	245.9ms	269.77ms	150.27ms	166.43ms	157.74ms	166.84ms
Standard Deviation	197ms	186ms	241ms	260ms	271ms	264ms
Power Consumption	4.6W	3.8W	5.55W	4.95W	5.1W	4.45W

Table 9. Result of lung sound classification in NPU.

Memory	Proposed	ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet
RAM	60.3 MB	150.4 MB	15.1 MB	50.6 MB	20.7 MB	20.2 MB

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Summary and Contributions
The paper presents a lightweight CNN model for lung sound classification optimized for on-device deployment using NPUs. The key contributions include: (1) a novel inception-based architecture using asymmetric kernels for processing time-frequency representations, (2) comprehensive evaluation on Hailo-8 NPU with INT8 quantization, and (3) demonstration of real-time inference capability (245.9ms for 5-second audio segments). The model achieves 79.68% accuracy on GPU and surprisingly 89.46% on NPU after quantization.

Response to Previous Reviews
The authors have made substantial efforts to address reviewer concerns:

Adequately Addressed:

Added ablation studies for architecture choices (Tables 5-6)
Included statistical significance testing (p-values in Table 9)
Added EfficientNet as additional baseline
Clarified real-time performance (245.9ms << 5 seconds)
Added memory footprint analysis (Table 11)

Partially Addressed:

NPU accuracy improvement explanation relies heavily on a "regularization effect" hypothesis without empirical validation
Dataset augmentation details improved but cross-dataset validation remains infeasible
Power consumption measurements added but lack detailed analysis

Insufficiently Addressed:

The 10% accuracy improvement on NPU remains scientifically questionable despite the explanation provided
No comparison with quantization-aware training or other quantization methods
Limited discussion of clinical deployment challenges

Technical Merit & Scientific Rigor

The inception block design with asymmetric kernels is well-motivated for rectangular spectrograms
Statistical testing (p<0.001) strengthens the claims
However, the quantization methodology is oversimplified (linear mapping without calibration)
Missing critical details: quantization error analysis, per-layer sensitivity analysis, alternative quantization schemes

Novel Contribution

The work provides incremental advancement in on-device medical AI
The NPU deployment aspect is valuable but not groundbreaking
The accuracy improvement phenomenon, if validated, could be significant but requires more rigorous investigation

Critical Technical Issues

Quantization Anomaly: The explanation that "quantization acts as regularization" is plausible but insufficient. The authors should: (i) Provide layer-wise activation distributions before/after quantization, (ii) Compare with proper quantization methods (symmetric vs asymmetric, per-channel vs per-tensor), (iii) Test on multiple random seeds to ensure reproducibility.
Dataset Limitations: Only 1,298 original samples expanded to 19,894 through augmentation raises concerns about: (i) Overfitting to augmentation patterns, (ii) Limited disease diversity (ICBHI + KAUH only), (iii) No external validation set.
Real-time Claims: While 245.9ms < 5s, the authors don't address: (i) Audio preprocessing overhead, (ii) Continuous streaming scenarios, (iii) Clinical integration requirements.

Presentation Quality

Generally well-written with clear structure
Figures are informative, especially confusion matrices
Some notation inconsistencies (e.g., equation numbering was corrected)

Journal Fit

Appropriate for Applied Sciences' biomedical engineering scope
Practical focus aligns with journal's applied nature
Could benefit from more discussion of clinical translation

Specific Recommendations for Improvement

Conduct rigorous ablation study on quantization: (i) Compare INT8 quantization with FP16, dynamic quantization, (ii) Test quantization-aware training vs post-training quantization, (iii) Provide quantization error metrics per layer.
Strengthen validation: (i) Report confidence intervals for all metrics, (ii) Test on held-out hospital data if possible, (iii) Provide learning curves showing training/validation dynamics.
Address the accuracy improvement more scientifically: (i) Hypothesis testing for the improvement, (ii) Analysis of which classes benefit most from quantization, (iii) Comparison with Monte Carlo dropout or other uncertainty methods.
Expand baseline comparisons: (i) Include TensorFlow Lite or ONNX Runtime results (even if simulated), (ii) Compare with knowledge distillation approaches, (iii) Test pruning + quantization combinations.
Improve clinical relevance discussion: (i) Address regulatory considerations for medical devices, (ii) Discuss failure modes and safety considerations, (iii) Include clinician feedback or usability aspects.
Add data flow diagram for the complete pipeline.
Include more detailed NPU specifications.
Discuss battery life implications for wearable deployment.

Final Recommendation: MINOR REVISION
The authors have made good faith efforts to address reviewer concerns and have substantially improved the manuscript. The work presents a valid contribution to on-device medical AI. However, the extraordinary claim of accuracy improvement after quantization requires more rigorous validation before publication. The requested revisions are focused and achievable within a minor revision timeframe.
The paper should be accepted after:

Providing more rigorous analysis of the quantization effect (minimum requirement),
Adding confidence intervals for key metrics,
Discussing limitations more thoroughly,
Clarifying the quantization methodology with standard terminology.

Author Response

RESPONSE TO THE REVIEWER

REVIEWER

Reviewer Comment 1:

Conduct rigorous ablation study on quantization: (i) Compare INT8 quantization with FP16, dynamic quantization, (ii) Test quantization-aware training vs post-training quantization, (iii) Provide quantization error metrics per layer.

Response:

We sincerely thank the reviewer for the valuable suggestions regarding quantization. Our manuscript primarily targets architectural compactness—i.e., proposing a lightweight DNN that reduces parameters and compute relative to existing models—rather than a comprehensive study of quantization strategies. For deployment, our target NPU (Hailo-8) currently support only unsigned 8‑bit integer (UINT8) inference; FP16 execution and dynamic quantization are not available on‑device in our environment. Consequently, a full on‑device comparison across FP16, dynamic quantization, and INT8 is not feasible at present.

Reviewer Comment 2:

Strengthen validation: (i) Report confidence intervals for all metrics, (ii) Test on held-out hospital data if possible, (iii) Provide learning curves showing training/validation dynamics.

Response:

Thank you for your comment.

(i) The existing schemes and proposed scheme show p-value < 0.05 for all metrics. This values mean that the likelibood of it occurring by change is low. the We added the confidence intervals in revised manuscript.

(ii) Other datasets either do not include the diseases we are targeting or contain data types other than lung sounds. Our study was conducted based on publicly available datasets, not in collaboration with a hospital, making testing using separate hospital data impossible.

(iii) We added the learning curvers of existing schemes based on the reviwer’s suggestion. We revised the manuscript.

In the revised version of the paper:

Table 12 shows the p-value between proposed scheme and existing schemes in NPU. Although the p-values of conventional methods before quantization vary, the p-values of models quantized to the UINT8 format are below 0.05, indicating a lack of statistical significance.

Table 12. P-value between proposed scheme and existing schemes in NPU.

ResNet50	MobileNetV2	InceptionV2	Stacked	EfficientNet

Figure 5 illustrates the training and loss curves of each scheme. As the training loss gradually decreases, a corresponding increase in classification accuracy is observed, indicating effective model learning. Moreover, the proposed scheme demonstrates stable training dynamics, with no significant fluctuations in either the loss or accuracy curves. The results also reveal that the model converges rapidly, highlighting the efficiency and robustness of the training process. We observed that the validation loss of both MobileNetV2 and EfficientNet did not decrease. This is attributed to the depthwise convolution mechanism of MobileNetV2 and the regularization strategy of EfficientNet, which were not effective in extracting discriminative features from audio spectrograms.


(a)	(b)

(c)	(d)

(e)	(f)

(g)	(h)

(i)	(j)

(k)	(l)

Figure 5. Training and loss curves of each scheme. (a) Loss curve of proposed scheme; (b) Training curve of proposed scheme; (c) Loss curve of ResNet50; (d) Training curve of ResNet50; (e) Loss curve of MobileNetV2; (f) Training curve of MobileNetV2; (g) Loss curve of InceptionV2; (h) Training curve of InceptionV2; (i) Loss curve of Stacked; (j) Training curve of Stacked; (k) Loss curve of EfficientNet; (l) Training curve of EfficientNet.

Reviewer Comment 3:

Address the accuracy improvement more scientifically: (i) Hypothesis testing for the improvement, (ii) Analysis of which classes benefit most from quantization, (iii) Comparison with Monte Carlo dropout or other uncertainty methods.

Response:

(i) We agree that accuracy gains should be supported statistically. We already report performance as 10 independent training runs on the fixed test split.

(ii) We analyze the classes benefit from quanization according to the reviewer’s comment. The COPD and Pneumonia tend to rely primarily on large-scale structural patterns and are therefore less sensitive to quantization. We added the result analysis in the manuscript.

(iii) We agree that uncertainty-aware baselines are informative in general. However, our contribution targets integer-only NPU deployment with strict latency/power budgets; Monte Carlo dropout and related sampling methods require multiple stochastic forward passes at inference, which are unsupported or impractical on our NPU pipeline and misaligned with the wearable duty-cycle constraints. As such, a fair on-device comparison is out of scope for this deployment-oriented paper. We have clarified in the manuscript that we do not claim quantization to be universally superior to uncertainty methods and we list this as a limitation and avenue for future work.

In the revised version of the paper:

Quantization improved the classification accuracy of the COPD and Pneumonia classes. Since these classes exhibit large-scale patterns, they are less sensitive to quantization, thereby reducing the likelihood of misclassification. Furthermore, such classes often possess strong channel specificity, which allows the normalization effect introduced during quantization to further enhance classification accuracy. In contrast, classes such as bronchiolitis, which rely on subtle high-frequency features, may suffer from information loss in the layers after quantization, leading to reduced classification accuracy.

As future work, we aim to reduce computational complexity in order to shorten inference time while maintaining or improving classification performance on re-source-constrained devices. Since the ICBHI 2017 dataset and the KAUH dataset differ in disease types, recording conditions, and demographic characteristics, we will expand the evaluation to cover a more diverse real-world cohort. Public corpora such as Coswara and SPRSound are not directly suitable for this disease classification setting, as they are either COVID-19–centric or encompass a much broader acoustic scope. Therefore, validating the model in a wider context remains a key direction. We also plan to integrate an on-demand AI module into an actual stethoscope and evaluate end-to-end accuracy, power consumption, and inference latency during real use. Com-plementary to these efforts, we will investigate hardware-efficient uncertainty estima-tion methods compatible with integer-only NPUs, and benchmark on-device calibration and risk-aware decision thresholds for quantized models under strict latency and power constraints. Finally, if limited probabilistic inference becomes feasible in the toolchain, we will explore bounded-sample Monte Carlo methods that can still be executed on the target hardware.

Reviewer Comment 4:

Expand baseline comparisons: (i) Include TensorFlow Lite or ONNX Runtime results (even if simulated), (ii) Compare with knowledge distillation approaches, (iii) Test pruning + quantization combinations.

Response:

(i) We include the result of ONNX runtime. We revised the manuscript to add the result.

(ii-iii) Thank you for your comment. We wish the other quantization approach. However, the NPU only support the UINT8 based quantization. This quatization approach is determined by NPU vendor. This means the AI model must be performed by UINT8. If the NPU module support the various quatization approach, we test the AI model that apply the various quatization approach. Also, we focus on the lightweight model for lung sound classification. Many of the lightweighting techniques you suggested are not suitable for this study.

In the revised version of the paper:

Table 10 shows the performance of each scheme in ONNX framework. All schemes exhibited comparable performance to GPU-based evaluations; however, a slight degradation in performance was observed. This is because, although both the baseline model and the ONNX model operate with FLOAT32 precision, subtle variations occur during the conversion process due to parameter scaling, making the ONNX model less sensitive to fine-grained representations.

Table 10. Performance of each scheme in ONNX framework.

Model	Accuracy	F1-score	Precision	Recall
Proposed	79.44%	0.781	0.799	0.765
ResNet50 [30]	80.42%	0.781	0.788	0.78
MobileNetV2 [31]	55.24%	0.484	0.522	0.397
InceptionV2 [32]	54.36%	0.415	0.477	0.392
Stacked [22]	50.02%	0.228	0.256	0.204
EfficientNet [33]	60.26%	0.534	0.562	0.523

Reviewer Comment 5:

Improve clinical relevance discussion: (i) Address regulatory considerations for medical devices, (ii) Discuss failure modes and safety considerations, (iii) Include clinician feedback or usability aspects.

Response:

Thank you for your opinion. Our suggestion is lightweight AI architecture. In the Proposed Scheme section, we handle the audio data processing and DNN architecture for lightweight. We do not propose the medical device. Therefore, refulatory considerations for medical devices, failure modes, and safety considerations are out of our interest.

Reviewer Comment 6:

Add data flow diagram for the complete pipeline.

Response:

We added the pseudo code for pipeline of proposed scheme.

In the revised version of the paper:

Algorithm 1 shows the preprocessing of lung sound pseudo code. The audio signal is first resampled according to sampling rate (sr) to ensure temporal consistency across recordings. The signal is then formatted to match a fixed input duration, using padding or truncation if necessary, to standardize the temporal dimension for model input. Next, three distinct audio features are ex-tracted to capture complementary characteristics of the lung sounds. The mel spectro-gram is computed as described in Equation (3), providing a time–frequency represen-tation with perceptually scaled frequency bins. The chroma feature, extracted via Equation (5), captures pitch class distribution, which is relevant for tonal components present in pathological sounds. Additionally, MFCC are derived using Equation (6) to represent the spectral envelope of the sound, which is known to be effective in charac-terizing respiratory acoustics. Once these features are extracted, e ach is resized to a common image resolution of 128 × 216, allowing compatibility with CNN input for-mats. Feature normalization is applied to standardize value ranges. Finally, the three features—mel spectrogram, chroma, and MFCC—are stacked along the channel di-mension.

Algorithm 1: Preprocessing of Lung Sound
	Data: Lung sound signal
	Result: Preprocessed audio feature
1:	Sampling the audio signal based on
2:	Formatting the audio based on input audio length
3:	Extract mel spectrogram feature using Eq. (3)
4:	Extract chroma feature using Eq. (5)
5:	Extract MFCC feature using Eq. (6)
6:	Transform the audio feature to 2D Image (128 x 216)
7:	Normalize the feature value
8:	Stack mel spectrogram, chroma, and MFCC

Reviewer Comment 7:

Include more detailed NPU specifications.

Response:

We added NPU information in revised manuscript. However, more detailed NPU specifications does not be provided by vendor. We added We have added information related to AI running from the manufacturer to the revised manuscript.

In the revised version of the paper:

The Raspberry Pi is equipped with a 64-bit Arm Cortex-A76 CPU and 16 GB of RAM, while the Hailo-8 NPU delivers up to 26 Tera Operations Per Second (TOPS) of infer-ence performance. The NPU support the M.2 connection and operate -40℃ to 80℃ and linux based system. The basic power compsumtion is 2.5W.

Reviewer Comment 8:

Discuss battery life implications for wearable deployment.

Response:

Thank you for your opinion. We discuss the battery life implecations in Conclusions section. We updated the manuscript by adding the discussion.

In the revised version of the paper:

We measured an active power draw of 4.6 W for the NPU path. Under continuous operation, this implies very short endurance on typical wearable energy budgets: a 2 Wh pack (≈520 mAh at 3.85 V) lasts ≈0.43 h (≈26 min) and a 3 Wh pack (≈780–800 mAh) lasts ≈0.65 h (≈39 min). In addition to the NPU used in this study, there exist modules with various specifications. The NPU we employed delivers high performance with 26 TOPS; however, its relatively high power consumption poses challenges for long-term operation in wearable devices. Therefore, when deploying NPUs in such devices, a trade-off between performance and battery consumption must be carefully considered.

Reviewer Comment 9:

The 10% accuracy improvement on NPU remains scientifically questionable despite the explanation provided.

Response:

Thank you for your opinion. We updated the manuscript to clarify the description of 10% accuracy improvement on NPU. Also, we added the result of audio data processing overhead.

In the revised version of the paper:

The three methods take an average of 42.22ms to process audio data.

In GPU-based training environments, high-capacity floating-point models may become overly sensitive to dominant class signals, resulting in biased predictions and reduced generalization capability. However, when such models are quantized from FP32 to UINT8 and deployed on integer-based NPUs, the reduced numerical precision inher-ently acts as a form of regularization, mitigating the risk of overfitting and improving generalization. Quantization compresses the dynamic range of weights and activations, effectively suppressing the influence of overly dominant parameters and reducing the model’s sensitivity to minor fluctuations. This attenuates class-specific bias, enabling more balanced inference across all classes. Empirically, we observed that the inference accuracy after quantization on the NPU surpasses that of the original GPU-based in-ference in certain scenarios. In addition to this implicit-regularization view, we analyze that three deployment-side factors also contribute to the observed gains: First, batch-normalization parameters are folded into adjacent convolutions during export, eliminating potential running-statistics mismatch at inference; Second, per-channel weight scaling on the NPU equalizes channel contributions and reduces sensitivity to poorly conditioned filters; and third, deterministic integer accumulation with bounded activation ranges curtails extreme logits, yielding more stable and better-calibrated de-cisions. Furthermore, quantization implicitly reduces over-parameterization and acts as a noise stabilizer, which is particularly beneficial in cases where the training data exhibits label imbalance or noise.

Author Response File: Author Response.pdf

Article Menu

Lung Sound Classification Model for On-Device AI

RESPONSE TO THE REVIEWER

RESPONSE TO THE REVIEWER

RESPONSE TO THE REVIEWER

RESPONSE TO THE REVIEWER

Further Information

Guidelines

MDPI Initiatives

Follow MDPI