Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function

BioMedInformatics 2025, 5(2), 31; https://doi.org/10.3390/biomedinformatics5020031

by Nicki Lentz-Nielsen¹

, Lars Maaløe², Pascal Madeleine³

and Stig Nikolaj Blomberg^1,4,5,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

BioMedInformatics 2025, 5(2), 31; https://doi.org/10.3390/biomedinformatics5020031

Submission received: 9 April 2025 / Revised: 28 May 2025 / Accepted: 29 May 2025 / Published: 7 June 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1. Methodological Transparency: Although the authors note that several candidate algorithms were tested before settling on KNN, the manuscript doesn't present in sufficient detail the selection process. An explanation - what feature sets were tried, which preprocessing steps were indispensable, how hyperparameters were tuned, and why competing models fell short - would give the choice far more credibility.

2. Dataset Scope and Signal Processing: The study enrolls 23 volunteers, yet the reader is left guessing how many total observations those sessions produced or how often recordings had to be discarded. Equally unclear is the treatment of gaps, artefacts, and drift in the time-series data: Were missing segments imputed or simply ignored? Were signals normalized or smoothed before entering the pipeline?

3. Literature Review: The related work review feels insufficient. Prior attempts at spirometry-free FEV₁ estimation - some of which use accelerometry, others acoustics or radar - receive only brief mention. A table comparing inputs, sample sizes, and error metrics would help readers understand how this study advances the field.

4. Model Choice and Interpretability: KNN emerges as the best model, yet the rationale for preferring it over ensemble methods (e.g., gradient boosting) or lightweight neural networks is not clearly articulated. A side-by-side comparison of bias-variance trade-offs, training times, and feature importance explanations could reveal whether KNN’s edge is substantive or merely incidental.

5. Clinical Relevance and Deployment Feasibility: The conclusion suggests that the approach could facilitate at-home monitoring for COPD; however, the discussion falls short of addressing practicalities. What hardware is required? How much calibration would a patient perform unaided? Could data be streamed securely to clinicians?

Author Response

Dear Reviewer,
Thank you for your in-depth feedback. These comments are highly appreciated, and we have tried to insert your suggestions and comments to the best we could to increase the readability and quality of the manuscript.

Six new citations have been added.

---------------------------------------------------------------------------------------------------------------------

Methodological Transparency: Although the authors note that several candidate algorithms were tested before settling on KNN, the manuscript doesn't present in sufficient detail the selection process. An explanation - what feature sets were tried, which preprocessing steps were indispensable, how hyperparameters were tuned, and why competing models fell short - would give the choice far more credibility.

Thank you for the good feedback. Most of this information was in different tables and appendices. We have now added different sections that explain it directly in the manuscript to increase methodological transparency.

[Line 166-176]

This rolling window feature extraction inherently provided temporal smoothing of the acoustic characteristics before aggregate statistics were calculated for each recording. The extraction yielded to a broad set of 431 features, encompassing time-domain characteristics (e.g., Zero Crossing Rate, Energy), spectral shape descriptors (e.g., Spectral Centroid, MFCCs, Chroma features), and their first derivatives (deltas) to capture temporal dynamics. Additionally, Linear Predictive Coding (LPC) coefficients were derived to model vocal tract properties, and Word Error Rate (WER) was calculated to assess transcription fidelity. Silence analysis provided statistics on pause durations [19]. These acoustically derived features, along with manually entered physiological data and a relative audio energy measure, were then subjected to aggregate statistical calculations (Appendix C).

[Line 197]

reducing the initial 431 features to a final set of 100 for model training.

[Line 210-219]

For hyperparameter tuning of various models (Table I), a 10-fold cross-validation approach within a GridSearchCV framework was employed. The specific hyperparameter search spaces are detailed in Table II. This optimization phase utilized a custom loss function (Equation 2), designed to heavily penalize larger prediction errors and minimize error variability.
However, following the hyperparameter tuning, the Root Mean Squared Error (RMSE) (Equation 3) on the validation set was employed as the primary metric to assess and select the best performing model, thereby simplifying interpretation and facilitating comparison.

Dataset Scope and Signal Processing: The study enrolls 23 volunteers, yet the reader is left guessing how many total observations those sessions produced or how often recordings had to be discarded. Equally unclear is the treatment of gaps, artefacts, and drift in the time-series data: Were missing segments imputed or simply ignored? Were signals normalized or smoothed before entering the pipeline?

Thank you for pointing this out. We have indeed not made it clear about the observations outside of Figure 1. To address this, we have added following sections to the method and materials to make it clearer:

[Line 124-138]: Addresses The concern regarding number of observations and gives a short reasoning for this prior to describing the details which is done below figure 1.

From an initial collection of 2946 audio files collected over nine months. To ensure consistent data quality, a multi-step filtering process was applied (Figure 1). This process removed audio files that: (1) lacked corresponding spirometry measurements, (2) were missing tablet identifiers due to technical issues preventing participant attribution, or (3) failed to meet our quality thresholds based on file size (min 300kb) and (4) transcription accuracy. Recordings that were incomplete (i.e., missing large segments of the expected poem), as indicated by failing to meet file size or overall Word Error Rate (WER) thresholds, were thus excluded from the analysis. For recordings passing these initial quality checks, silences occurring within the spoken poems were not removed or imputed; instead, characteristics of these speech pauses (e.g., duration, frequency) were quantified using the pyAudioAnalysis library and included as predictive features, as detailed in Section 2.1 Feature engineering. This resulted in 2084 audio files (2019 patients with COPD, 34 control) for analysis, with details of exclusions provided in Figure 1.

[Line 143-147] Making it clearer how we addressed artefacts and audio files with too low quality.

The primary approach to handling recording artefacts (e.g., significant background noise, non-speech sounds corrupting the poem reading) was done through this multi-step filtering protocol, leading to the exclusion of recordings that did not meet quality standards, rather than attempting direct artefact removal or correction within the audio signal itself.

[Line 157-162] Thank you for this point. We have enhanced the description by adding details on how we mitigate potential drifting.

In the present study, short-duration (30-60 seconds) audio recordings were used limiting the potential drift effect of the microphone. The use of a 'gold-standard' reference recording provided a stable benchmark for relative audio energy comparisons (detailed in Section 2.1 Feature Engineering), and the consistent application of per-recording quality control served to mitigate issues due to sensor drift.

[Line 166-167] We have tried to make it clearer that the rolling window is the function that smoothes the data.

This rolling window feature extraction inherently provided temporal smoothing of the acoustic characteristics before the aggregate statistics were calculated for each recording.

Literature Review: The related work review feels insufficient. Prior attempts at spirometry-free FEV₁ estimation - some of which use accelerometry, others acoustics or radar - receive only brief mention. A table comparing inputs, sample sizes, and error metrics would help readers understand how this study advances the field.

This is a very relevant point. There are other solutions than audio to estimate FEV1 without spirometry. However, since the present manuscript is not a literature review we had decided not to make a table comparing all the inputs, sample sizes etc.. However, we have added the following section to accommodate your comment regarding the insufficient review of other technologies (four new references were added):

[Line 50-63]
The pursuit of more amenable alternatives has led to investigations across various technological modalities. Beyond acoustic analysis, researchers have explored accelerometry-based systems, often involving wearable sensors to detect chest wall movements or physical activity patterns that correlate with respiratory status [9,10]. Non-contact methods, such as those utilizing radar technology, have also been investigated for their potential to remotely sense thoracic excursions and breathing rhythms to derive estimates of lung function parameters [11, 12]. These diverse technological approaches aim to provide less burdensome FEV₁ estimation, though each carries its own considerations regarding accuracy, user acceptability, and deployment complexity in real-world environments. For instance, with radar-based systems, obtaining accurate physiological data can be made difficult due to random body movements and the presence of other individuals within the sensing range can introduce confounding factors, complicating accurate physiological data acquisition [12].

Added references:

[9] Sang, B.; Wen, H.; Junek, G.; Neveu, W.; Di Francesco, L.; Ayazi, F. An accelerometer-based wearable patch for robust respiratory rate and wheeze detection using deep learning. Biosensors 2024, 14, 118. doi:10.3390/bios14030118.

[10] Minakata, Y.; Azuma, Y.; Sasaki, S.; Murakami, Y. Objective measurement of physical activity and sedentary behavior in patients with chronic obstructive pulmonary disease: Points to keep in mind during evaluations. J. Clin. Med. 2023, 12, 3254. doi:10.3390/jcm12093254.

[11] Wang, W., Wan, Y., Li, C., Chen, Z., Zhang, W., Zhao, L., Zhao J., Mu., Li, G. (2023). Millimetre‐wave radar‐based spirometry for the preliminary diagnosis of chronic obstructive pulmonary disease. IET Radar Sonar Navig. 2023, 17, 1874–1885. doi:10.1049/rsn2.12538.

[12] Islam, S.M.M. Radar-based remote physiological sensing: Progress, challenges, and opportunities. Front. Physiol. 2022, 13, 955208. doi:10.3389/fphys.2022.95520

Model Choice and Interpretability: KNN emerges as the best model, yet the rationale for preferring it over ensemble methods (e.g., gradient boosting) or lightweight neural networks is not clearly articulated. A side-by-side comparison of bias-variance trade-offs, training times, and feature importance explanations could reveal whether KNN’s edge is substantive or merely incidental.

Thank you for bringing it up. A side-by-side comparison of the bias-variance trade-offs, training time etc. would indeed be very interesting and the manuscript would benefit from this. Unfortunately, for ethical and legal reasons, we are not allowed to work anymore with the dataset so it will not be possible to run it all again at this time. However, we have tried to add parts to the discussion that helps with the comparison and hope for your understanding.

[Line 348-366]
The K-Nearest Neighbours (KNN) model, configured with seven neighbours and distance-weighting, demonstrated the strongest performance (Table IV). The comparatively weaker performance of linear models (Table IV) suggests a non-linear relationship between the acoustic features and FEV1. KNN, as a non-parametric method, excels at capturing such local non-linearities without imposing strong assumptions on the data distribution, which likely contributed to its success in this context. While ensemble methods like XGBoost performed less well, KNN's selection as the optimal model was driven by its performance on the validation set across key metrics like RMSE, MAE, and MSE (see Table IV), indicating an advantage for this specific FEV1 estimation task. This choice is further supported by KNN's characteristic efficiency in training compared to more complex models like XGBoost or MLP (as generally outlined in Table I), and its inherent local interpretability, where predictions can be understood by examining neighbouring instances. Although ensemble methods can offer global feature importance, KNN's leading predictive accuracy, coupled with these practical considerations, rendered it the most suitable model for this study. The selection and tuning of the KNN model were guided by established principles of model configuration (e.g., [27, p. 159]), and the specific settings chosen for the KNN model were informed by examples of detailed setups, such as that presented in [28, Table 2].

Added references:

[27] Kuhn, M.; Johnson, K. Applied predictive modeling; Springer: New York, NY, USA, 2013. doi:10.1007/978-1-4614-6849-3.

[28] Zargari Marandi, R.; Madeleine, P.; Omland, Ø.; Vuillerme, N.; Samani, A. An oculometrics-based biofeedback system to impede fatigue development during computer work: A proof-of-concept study. PLoS ONE 2019, 14, e0213704. doi:10.1371/journal.pone.0213704.

Clinical Relevance and Deployment Feasibility: The conclusion suggests that the approach could facilitate at-home monitoring for COPD; however, the discussion falls short of addressing practicalities. What hardware is required? How much calibration would a patient perform unaided? Could data be streamed securely to clinicians?

This is an extremely interesting perspective and we have added an entire new section for this important matter.
[Line 404-423]

4.3. Practical Considerations for At-Home Deployment

The clinical utility of an AI-driven voice analysis approach for COPD management hinges on its practical deployment in home settings. This study's design inherently addresses several feasibility aspects, particularly when contrasted with other technological modalities for respiratory monitoring. Firstly, hardware requirements are minimal; the system effectively utilized standard tablets, and future iterations could readily employ ubiquitous smartphones. This significantly lowers barriers to access and cost compared to approaches requiring dedicated wearable sensors for accelerometry-based systems [9, 10] or specialized non-contact radar equipment [11, 12]. Secondly, the burden of calibration and daily use on patients is low. Initial guidance on reading a scripted text into the device is straightforward, with complex audio processing and AI-driven FEV1 estimation handled by the backend. This contrasts with the potential need for precise sensor placement in wearable systems or managing environmental confounders pertinent to radar-based methods [12]. Finally, for integration into clinical workflows, secure and confidential data handling is paramount. Any system deployed would need robust encrypted transmission of audio data and FEV1 estimates to clinicians, adhering strictly to data protection regulations such as GDPR. These elements suggest a viable pathway for developing an accessible, less burdensome, and scalable tool for remote respiratory monitoring compared to some existing alternatives.

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript has developed a predictive model using voice recordings from participants with moderate to severe COPD at home estimated FEV1 with encouraging performance. Minor revision should be performed

Specific comments:

(1) What are the drawbacks of the proposed model?

(2) Compared with the models in the literatures, what are the improvements or advantages of the proposed model in this manuscript?

Author Response

Dear Reviewer,

Thank you for your in-depth feedback. These comments are highly appreciated, and we have tried to insert your suggestions and comments to the best we could to increase the readability and quality of the manuscript.

---------------------------------------------------------------------------------------------------------------------

What are the drawbacks of the proposed model?

We have tried to make it clearer what drawbacks such a model has in the discussion.

[Line 467-478]

A notable drawback of the current model is its reliance on active patient engagement. Users must consciously perform daily scripted readings. While this still requires user interaction, it is less physically demanding and potentially more convenient than performing a full spirometry test. Active participation requirements underscore an area for future research, towards more passive monitoring systems. Furthermore, a limitation in developing the current model is the reliance on FEV1 data derived from unsupervised home spirometry for training. As Anand et al. [8] reported, unsupervised tests can yield different, often lower, FEV1 values compared to supervised tests. This inherent variability in the training target data may limit the precision achievable by the current model. To potentially enhance model accuracy and establish a more robust ground truth, future iterations could benefit from training data where FEV1 is assessed under supervised spirometry.

2. Compared with the models in the literatures, what are the improvements or advantages of the proposed model in this manuscript?

Thank you for the question, this is a vital point we missed. This has been inserted to the discussion.

[Line 339-343]

Furthermore, a key methodological advantage in the current study is its high ecological validity, stemming from daily, unsupervised home recordings using non obstructive sensor like tablet microphone. This approach directly evaluates real-world performance [13] and addresses challenges of unsupervised monitoring in domestic settings [8].

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript presents an innovative study on the use of voice analysis, combined with machine learning models, for estimating respiratory function (FEV1) in patients with COPD. It proposes a non-invasive approach based on daily home recordings, with accurate data processing and a robust comparison between different predictive algorithms.

The manuscript stands out for its innovative use of voice as a respiratory biomarker, a highly current approach in the field of digital medicine. The methodology is solid, with particular attention to data quality, preprocessing, and feature selection. The comparison between different machine learning models strengthens the validity of the results, while the clinical impact is significant, offering a promising solution for home monitoring of COPD as an alternative to traditional spirometry.

Areas for Improvement:

At the end of the introduction, the structure of the paper across its sections should be briefly outlined.
I would suggest adding a subsection within the introduction to explain the methodological advantages and outcomes compared to other related studies in the literature.
At the beginning of the Materials and Methods section, it would be helpful to include a graphical workflow, accompanied by a brief explanation, to immediately convey the overall methodology to the reader.
The metrics used for the performance comparison between algorithms (RMSE, MAPE, MAE, MSE, etc.) should be explained in the methodology section, including the relevant mathematical formulas properly numbered.
The presence of some significant outliers is correctly acknowledged, but a broader discussion on possible strategies for handling extreme data in future studies would be advisable.
In paragraph 2.3, I would suggest inserting a brief reference to the literature supporting the choice of the 80/20 sampling, 10-fold cross-validation, and the comparison between multiple machine learning algorithms. For this purpose, I propose citing the following references:

https://doi.org/10.3390/app14146016; https://doi.org/10.1016/j.compbiomed.2024.109330.

The conclusions are overall appropriate and well aligned with the results presented. However, I would recommend slightly expanding this section by more clearly outlining potential future research directions and briefly deepening the discussion of the practical applicability of the proposed methodology. A more articulated closing would further enhance the overall quality of the manuscript.

Author Response

Please kindly see the attached file.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have made the requested changes.
Congratulations and best wishes for the publication.

Article Menu

Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function

Further Information

Guidelines

MDPI Initiatives

Follow MDPI