Next Article in Journal
Identification of a New Lung Cancer Biomarker Signature Using Data Mining and Preliminary In Vitro Validation
Previous Article in Journal
Enhanced Brain Tumor Classification Using MobileNetV2: A Comprehensive Preprocessing and Fine-Tuning Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function

by
Nicki Lentz-Nielsen
1,
Lars Maaløe
2,
Pascal Madeleine
3 and
Stig Nikolaj Blomberg
1,4,5,*
1
Koncern Digitalisering, Region Zealand, 4100 Ringsted, Denmark
2
Corti and Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800 Lyngby, Denmark
3
ExerciseTech, Department of Health Science and Technology, Faculty of Medicine, Aalborg University, 9260 Klarup, Denmark
4
Institute of Clinical Medicine, University of Copenhagen, 2200 Copenhagen, Denmark
5
CardioTech, Department of Health Science and Technology, Faculty of Medicine, Aalborg University, 9260 Klarup, Denmark
*
Author to whom correspondence should be addressed.
BioMedInformatics 2025, 5(2), 31; https://doi.org/10.3390/biomedinformatics5020031
Submission received: 9 April 2025 / Revised: 28 May 2025 / Accepted: 29 May 2025 / Published: 7 June 2025

Abstract

Background: Chronic obstructive pulmonary disease (COPD) is projected to be the third-leading cause of death by 2030. Traditional spirometry for the monitoring of the forced expiratory volume in one second (FEV1) can provoke discomfort and anxiety. This study aimed to validate AI models using daily audio recordings as an alternative for FEV1 estimation in home settings. Methods: Twenty-three participants with moderate to severe COPD recorded daily audio readings of standardized texts and measured their FEV1 using spirometry over nine months. Participants also recorded biomarkers (heart rate, temperature, oxygen saturation) via tablet application. Various machine learning models were trained using acoustic features extracted from 2053 recordings, with K-nearest neighbor, random forest, XGBoost, and linear models evaluated using 10-fold cross-validation. Results: The K-nearest neighbors model achieved a root mean square error of 174 mL/s on the validation data. The limit of agreement (LoA) ranged from −333.21 to 347.26 mL/s. Despite an error range of −1252 to 1435 mL/s, most predictions fell within the LoA, indicating good performance in estimating the FEV1. Conclusions: The predictive model showed promising results, with a narrower LoA compared to traditional unsupervised spirometry methods. The AI models effectively used audio to predict the FEV1, suggesting a viable non-invasive approach for COPD monitoring that could enhance patient comfort and accessibility in home settings.

1. Introduction

Voice, speech, and respiratory sounds contain rich physiological information, driving interest in their analysis using artificial intelligence (AI) as accessible, non-invasive digital biomarkers for various health conditions [1]. This potential is particularly relevant in the monitoring of chronic diseases like chronic obstructive pulmonary disease (COPD), a long-term lung condition characterized by progressively reduced airflow. The global impact of COPD is substantial; the World Health Organization projects it to be the third-leading cause of death and seventh-leading cause of morbidity by 2030 [2].
Individuals with COPD typically perform spirometer tests to estimate their respiratory function, i.e., the forced expiratory volume in one second (FEV1). The FEV1 is a critical indicator of disease severity and exacerbations of COPD [3,4]. Therefore, FEV1 serves as a critical outcome measure in evaluating the efficacy of interventions [5]. The spirometer test can provoke discomfort and result in “extreme anxiety” [6]. Despite this, individuals generally view self-monitoring positively, yet the likelihood of undergoing spirometry testing decreases with milder COPD severity [7]. Furthermore, Anand et al. [8] reported that unsupervised spirometry tests produced lower FEV1 values (mean difference: −107 mL, LoA: −509, 296) compared to supervised spirometry tests.
This highlights the challenges with unsupervised spirometry tests in home settings. The pursuit of more amenable alternatives has led to investigations across various technological modalities. Beyond acoustic analysis, researchers have explored accelerometry-based systems, often involving wearable sensors to detect chest wall movements or physical activity patterns that correlate with the respiratory status [9,10]. Non-contact methods, such as those utilizing radar technology, have also been investigated for their potential to remotely sense thoracic excursions and breathing rhythms to derive estimates of lung function parameters [11,12]. These diverse technological approaches aim to provide less burdensome FEV1 estimation, although each carries its own considerations regarding accuracy, user acceptability, and deployment complexity in real-world environments. For instance, with radar-based systems, obtaining accurate physiological data can be difficult due to random body movements, and the presence of other individuals within the sensing range can introduce confounding factors, complicating accurate physiological data acquisition [12].
Using audio data may offer a way to overcome the aforementioned barriers associated with traditional spirometry and other sensor modalities. A few studies have already used audio data and machine learning to identify COPD or estimate respiratory function [13,14,15,16]. One approach analyzes cough sounds to predict COPD exacerbations [15]. However, it requires live sound monitoring, posing GDPR compliance challenges and concerns for those averse to constant surveillance—an issue that the current study seeks to mitigate through on-demand, user-initiated recordings. Nathan et al. [14] encountered similar issues with continuous monitoring. Conversely, Alam et al. [13] demonstrated promising results when using voice recordings from 26 asthma patients reading a standardized text to predict the FEV1%. One of the main barriers with such controlled datasets is ecological validity. Therefore, a key aim of the current study is to build upon these findings by specifically assessing FEV1 estimation from daily, user-initiated scripted readings under unsupervised, real-world domestic conditions, thereby prioritizing ecological validity and patient comfort, in contrast to continuous surveillance or highly controlled lab-based assessments.
In practice, frail and elderly individuals may struggle to adhere perfectly to any unsupervised monitoring protocol. This difficulty is exemplified by unsupervised spirometry tests, where lower FEV1 values have been observed compared to supervised tests [8].
A similar pattern of deviation is expected in daily readings of standardized texts, highlighting the need for an alternative to unsupervised spirometry tests for effective home monitoring. This study aimed to assess the validity of using audio data and machine learning to estimate the FEV1 in elderly patients with COPD through the analysis of daily scripted text readings. This approach was implemented and evaluated under real-world domestic conditions. By testing this approach in participants’ homes without supervision, we sought to demonstrate the potential of voice analysis using AI as an alternative to traditional unsupervised spirometry. The remainder of this paper outlines the methodology employed, presents the key results from our models, and discusses the implications of these findings for COPD home monitoring.

2. Materials and Methods

Twenty-three participants (12 males, 11 females; age = 73.1 ± 5.0 yrs), diagnosed with moderate to severe COPD and recruited from the same geographical area, took part in the study (see Table 3 in Section 3). Each participant was provided with a tablet and selected one of four poems to read out loud (see Appendix A). Participants were instructed to read the poem aloud and to assess and record physiological data every morning over a period of nine months. Measured biomarkers included the heart rate, body temperature, oxygen saturation, and the target variable FEV1. After receiving initial instructions about self-monitoring, the participants conducted the monitoring in unsupervised conditions. All biomarkers were manually entered into an application by the participants, on the tablet, and subsequently stored on an SQL server. Additionally, the tablet was equipped with an application to record audio.
A single asymptomatic female adult served as a quality control participant. The control person read each poem aloud on a weekly basis throughout the study period, in a controlled and noise-free environment. This procedure established a “gold standard” for the minimal expected sound quality of the recordings. A general overview of the methods and process can be seen in Figure 1.

2.1. Data Processing

All data processing was performed using Python v3.9.18. The audio data were transcribed using Openai’s Whisper model v20231117 configured with the Danish large-v3 model [17]. Prior to the transcription application in this study, the Whisper model underwent a validation process to determine its accuracy with Danish language audio. This process utilized audio samples from the Common Voice Corpus v16.1, hosted by Mozilla [18].
For this validation, only sentences containing between three and 100 words were selected for analysis. The word error rate (WER) was computed for these sentences and was 0.14 ± 0.24, in concordance with Radford et al. [17]. To ensure consistent data quality, a multi-step filtering process was applied (Figure 1). This process removed audio files that (1) lacked corresponding spirometry measurements, (2) were missing tablet identifiers due to technical issues preventing participant attribution, or (3) failed to meet our quality thresholds based on the file size (min 300 kb) and (4) transcription accuracy. Recordings that were substantially incomplete (i.e., missing large segments of the expected poem), as indicated by failing to meet the file size or overall word error rate (WER) thresholds, were thus excluded from the analysis. For recordings passing these initial quality checks, silences occurring within the spoken poems were not removed or imputed; instead, the characteristics of these speech pauses (e.g., duration, frequency) were quantified using the pyAudioAnalysis library and included as predictive features, as detailed in Section 2.2. This resulted in 2084 audio files (2019 patients with COPD, 34 control) for analysis, with details of exclusions provided in Figure 2.
All MP4 files smaller than 300 kb were removed, as these primarily consisted of instructions from the healthcare professional and were generally too short to contain the entire poem. The first author listened to the audio files to control their quality, ensuring that only non-relevant files were excluded. The primary approach to handling recording artefacts (e.g., significant background noise, non-speech sounds corrupting the poem reading) was implemented through this multi-step filtering protocol, leading to the exclusion of recordings that did not meet the quality standards, rather than attempting direct artefact removal or correction within the audio signal itself. The WER was calculated using the jiwer package v3.0.2. This metric was utilized to assess the alignment between the transcribed and reference texts. By calculating the WER using a rolling window and a set number of words, alignment between the audio and the reference text was secured by discarding all words preceding the best match (i.e., the lowest WER). This same strategy was also applied to eliminate any noise following the poem’s conclusion. Once alignment was ensured, any audio files with a WER exceeding the 80th percentile cutoff value of 0.667 for the starting segment or the 90th percentile cutoff value of 0.851 for the entire transcription were removed from the dataset. These were deemed to be of insufficient quality for further analysis (see Appendix B, which illustrates the distributions of the WERs before and after applying these thresholds). In the present study, short-duration (30–60 s) audio recordings were used, limiting the potential drift effect of the microphone. The use of a “gold-standard” reference recording provided a stable benchmark for relative audio energy comparisons (detailed in Section 2.2), and the consistent application of per-recording quality control served to mitigate issues due to sensor drift.

2.2. Feature Engineering

A range of frequency-domain features were computed utilizing a rolling window approach with windows of 0.1 s and an overlap of 0.05 s, employing the pyAudioAnalysis library v0.3.14. This rolling window feature extraction inherently provided temporal smoothing of the acoustic characteristics before the aggregate statistics were calculated for each recording. The extraction yielded a broad set of 431 features, encompassing time-domain characteristics (e.g., zero crossing rate, energy), spectral shape descriptors (e.g., spectral centroid, MFCCs, chroma features), and their first derivatives (deltas) to capture temporal dynamics. Additionally, linear predictive coding (LPC) coefficients were derived to model vocal tract properties, and the word error rate (WER) was calculated to assess the transcription fidelity. Silence analysis provided statistics on pause durations [19]. These acoustically derived features, along with manually entered physiological data and a relative audio energy measure, were then subjected to aggregate statistical calculations (Appendix C).
These calculations included the mean, median, standard deviation, maximum, interquartile range, and total range for each window; this provided the model with a richer description of the data’s characteristics within each window—encompassing both measures sensitive to extreme values (like the mean, standard deviation, maximum, and range) and those robust to them (like the median and IQR)—allowing the algorithm the flexibility to learn which statistical properties are most informative for prediction. Additionally, a relative energy comparison was performed to calculate the difference in energy between the audio files and the “gold-standard” reference audio.
Finally, all features were normalized to a range between zero and one using minmax scaling (Equation (1)), by fitting the scaler to the training data and then applying this scaling to transform the test data accordingly.
X s c a l e d = X X m i n X X m i n  

2.3. Feature Selection

Features exhibiting internal variation below a threshold of 0.01 were excluded. Subsequently, a “VotingSelector” algorithm was employed, comprising four distinct selection methods: Kendall’s rank correlation coefficient, a variance inflation factor with a threshold of 10.0 [20], recursive feature elimination (RFE) with a linear support vector regressor kernel, and the Boruta method utilizing a random forest kernel [21]. Only features selected by at least two of these four methods were retained for inclusion in the final dataset, reducing the initial 431 features to a final set of 100 for model training.

2.4. Modeling

The dataset was randomly split into a training and test set using an 80/20% split. For the hyperparameter tuning of various models, a 10-fold cross-validation approach was employed, encompassing models such as ridge, lasso, ElasticNet, random forest, XGBoost, K-nearest neighbors (KNN), and a multi-layer perceptron (see Table 1) [22].
Table 1 displays the hyperparameters and their respective values that were subjected to training. All training was performed with the scikit-learn library v1.3.0 and xgboost v1.7.3. To refine the learning process and minimize the spread of predictions, a loss function incorporating both the quartic error and variability error were utilized during the cross-validation phase (Equation (2)). For the hyperparameter tuning of various models (Table 1), a 10-fold cross-validation approach within a GridSearchCV framework was employed. The specific hyperparameter search spaces are detailed in Table 2.
This optimization phase utilized a custom loss function (Equation (2)), designed to heavily penalize larger prediction errors and minimize error variability.
l o s s = 1 n i = 1 n y p r e d i y t r u e i 4 + V a r ( y p r e d y t r u e )
However, following the hyperparameter tuning, the root mean squared error (RMSE) (Equation (3)) on the validation set was employed as the primary metric to assess and select the best-performing model, thereby simplifying interpretation and facilitating comparison.
R M S E = 1 n i = 1 n y p r e d i y p r e d i 2
The RMSE served as the primary evaluation metric, but the mean absolute percentage error (MAPE) (Equation (4)), mean absolute error (MAE) (Equation (5)), and mean squared error (MSE) (Equation (6)) are also reported for comprehensive model assessment. The MAPE (Equation (4)) expresses the average absolute error as a percentage of the actual values, offering a relative error measure. The MAE (Equation (5)) quantifies the average absolute difference between the predicted and actual values. The MSE (Equation (6)) measures the average of the squared errors, thereby assigning greater penalties to larger deviations.
M A P E = 100 % n i = 1 n y i ŷ i y i
M A E = 1 n i = 1 n y i ŷ i
M S E = 1 n i = 1 n y i ŷ i ²

3. Results

The initial analysis of the audio recordings revealed substantial variations in recording quality across participants. These variations necessitated the implementation of our filtering protocol (see Figure 2), which ensured uniform data quality for subsequent analysis. After applying all filtering steps, the participants generated an average of 87.8 ± 56.2 audio recordings during the nine-month study period (see Table 3). The K-nearest neighbors (KNN) model performed the best among all tested models, with the number of neighbors set to seven and utilizing distance weighting. As shown in Table 4, it resulted in an RMSE of 138.92 mL/s (relative RMSE of 8.73%) on the training set, with a standard deviation of 20.43 mL/s. On the validation set, the model yielded an RMSE of 173.73 mL/s (relative RMSE of 9.90%). The KNN model also demonstrated the lowest mean absolute error (MAE) of 84.98 mL/s and 93.82 mL/s on the training and validation sets, respectively. Furthermore, it exhibited the lowest mean squared error (MSE) among all models, with 19,716.86 mL/s2 for the training set and 30,182.83 mL/s2 for the validation set, indicating its superior overall accuracy and consistency in predictions.
Additionally, the mean difference between the predicted and actual FEV1 values was 7.03 mL/s, with the limit of agreement (LoA) ranging from −333.21 to 347.26 mL/s, as depicted in Figure 3, and an absolute range of −1252 to 1435 mL/s. Moreover, the model demonstrated a homogeneous spread across all FEV1 values, indicating no model bias towards either lower or higher FEV1 values. Other models, such as random forest and XGBoost, also showed good performance but were outperformed by KNN. The linear models (ridge, lasso, and ElasticNet) generally showed higher error rates, suggesting that the relationship between the audio features and lung function is non-linear.
The feature selection process selected 100 features from the initial set of 431. Detailed information about the combinations of aggregated features is provided in Appendix D. Furthermore, the XGBoost model demonstrated competitive performance compared to KNN. An analysis of the learning curve (see Figure 4) indicates that the inclusion of more data could enhance its performance, potentially outperforming the KNN model.
Specifically, Figure 4 plots the model’s performance score as the RMSE (y-axis, where higher indicates better FEV1 prediction performance) against the increasing amount of patient data used for training (x-axis). It shows two curves: one representing the score achieved on the data that the model was trained on (training score) and another representing the score on data that it had not seen before (validation score), which indicates its real-world performance. For the XGBoost model, the validation score curve is still trending upwards as more data are added, and it has not yet flattened or converged closely with the training score curve. This pattern suggests that the model’s ability to generalize and yield accurate predictions on new, unseen patient data improves with more training examples, indicating that providing it with data from more participants or more recordings per participant could lead to further performance gains.
The overall error distribution across all trained models is depicted in Figure 5, indicating a number of prediction errors surpassing the previously mentioned limits of agreement.

4. Discussion

This study aimed to explore the validity of using audio data and machine learning for the estimation of respiratory function, proposing this as an alternative to spirometry tests at home. While the KNN model showed promising performance (low RMSE/MAE) in estimating the FEV1, there was notable variation in the individual prediction accuracy, with some errors being quite large (range −1252 to 1435 mL/s). Our findings contribute to the ongoing development of home monitoring approaches for chronic respiratory diseases like COPD. The ability to estimate respiratory function using accessible technology directly in the patient’s environment aligns with the interest in solutions that may reduce the burden on healthcare systems and support patients in their self-management [23,24]. This approach, utilizing the microphones integrated into consumer-grade tablets or smartphones as low-cost sensors, increases the potential for scalability compared to methods requiring specialized medical devices. It presents possibilities for more frequent, longitudinal tracking of respiratory health, potentially complementing episodic clinic visits [25].

4.1. Model Performance

The results illustrated a narrower LoA between the predicted and actual FEV1 values compared with the LoA between supervised and unsupervised spirometry, with approximately 125 mL/s [8], representing a key advantage of an audio-based approach for potentially more consistent unsupervised home monitoring This suggests that the predictive model effectively captures key aspects of pulmonary function, resulting in a more consistent estimation of lung function compared with unsupervised spirometry tests. Nevertheless, for greater precision, supervised spirometry is still required. Additionally, Pramono et al. [26] highlighted similar advantages of automated audio analysis in their review of adventitious respiratory sound classification, where machine learning models demonstrated high accuracy in detecting abnormal respiratory sounds, such as wheezes and crackles. The current study expanded upon this foundation by utilizing audio recordings to predict the quantitative FEV1, offering a broader application of automated respiratory sound analysis in clinical practice.
The current study found an RMSE of 174 mL/s, lower than the 380 mL/s reported in Xu et al. [16]. This discrepancy can be attributed to the different auditory inputs utilized. Xu et al. [16] used forced cough sounds, whereas this study utilized speech patterns as participants read a scripted text. The utilization of speech may offer a more stable and controlled acoustic environment. Furthermore, a key methodological advantage in the current study is its high ecological validity, stemming from daily, unsupervised home recordings using non-obstructive sensors like tablet microphones. This approach directly evaluates the real-world performance [13] and addresses the challenges of unsupervised monitoring in domestic settings [8].
Additionally, Alam et al. [13] found that integrating speech and breathing sounds yields better predictions than using either alone, emphasizing the value of the complete auditory range in speech, including non-verbal sounds. A direct comparison of the results with those of Alam et al. is not feasible, as their evaluation focused on the FEV1%, in contrast to the FEV1 used in the current study. The K-nearest neighbors (KNN) model, configured with seven neighbors and distance weighting, demonstrated the strongest performance (Table 4). The comparatively weaker performance of linear models (Table 4) suggests a non-linear relationship between the acoustic features and FEV1. KNN, as a non-parametric method, excels in capturing such local non-linearities without imposing strong assumptions on the data distribution, which likely contributed to its success in this context. While ensemble methods like XGBoost performed less well, KNN’s selection as the optimal model was driven by its performance on the validation set across key metrics like the RMSE, MAE, and MSE (see Table 4), indicating an advantage for this specific FEV1 estimation task. This choice is further supported by KNN’s characteristic efficiency in training compared to more complex models like XGBoost or the MLP (as generally outlined in Table 1) and its inherent local interpretability, where the predictions can be understood by examining neighboring instances. Although ensemble methods can offer global feature importance, KNN’s leading predictive accuracy, coupled with these practical considerations, rendered it the most suitable model for this study. The selection and tuning of the KNN model were guided by established principles of model configuration (e.g., p. 159 in [27]), and the specific settings chosen for the KNN model were informed by examples of detailed setups, such as that presented in Table 2 in [28].
Similarly, Haider et al. [29] demonstrated that combining lung sound features with spirometry data significantly enhanced the diagnostic accuracy, achieving up to 100% classification accuracy for COPD versus non-COPD cases. While their work focused on classification rather than regression, their findings highlight the potential of integrating complementary audio features to improve model performance.

4.2. Feature Importance

Another factor potentially contributing to the differences in the observed performance compared to the work of Xu et al. [16] relates to the feature engineering and selection protocols employed. Xu et al. [12] utilized 52 audio features reduced via principal component analysis (PCA). While PCA is a common dimensionality reduction technique, its effectiveness relies on assumptions such as linearity and data scaling. Standard practice often involves standardizing the features prior to PCA to address the sensitivity to different measurement units [30]; Xu et al. applied Z-score normalization after PCA to the resulting components. Furthermore, PCA assumes linear relationships, whereas our findings when using various models suggest that the connection between the audio features used in our study and the FEV1 may be inherently non-linear. These differences in the dimensionality reduction approach and the apparent non-linearity of the problem space might favor the feature selection methods and non-linear models used in the present study. Haider et al. [29] similarly emphasized the importance of identifying key features for accurate modeling, finding that the median frequency and linear predictive coefficients (LPCs) were the most predictive lung sound parameters for COPD classification. This complements our approach, which highlights the value of features such as MFCCs and LPCs in contributing to the overall predictive capabilities of the model. These findings reinforce the need for robust feature selection techniques to capture a combination of high-impact variables, which may also benefit future hybrid models incorporating both speech and lung sounds.
A permutation importance analysis revealed that no specific features exhibited extreme importance to the model. This suggests that the model’s performance relied on the combined effect of multiple features. The feature selection protocol encompassed a diverse range of features; however, it was only the MFCCs and chroma vectors for which all aggregations were selected (see Appendix D). None of the time-domain features from the non-verbal segments of the audio were selected. While the time and frequency domains offer different characterizations of the signal, this outcome suggests that, for the prediction of the FEV1 in this dataset, the information provided by these specific time-domain features did not add significant predictive power beyond that already present in the selected frequency-domain features, possibly indicating redundancy in the context of the chosen features.

4.3. Practical Considerations for At-Home Deployment

The clinical utility of an AI-driven voice analysis approach for COPD management relies on its practical deployment in home settings. This study’s design inherently addresses several feasibility aspects, particularly when contrasted with other technological modalities for respiratory monitoring. Firstly, the hardware requirements are minimal; the system effectively utilized standard tablets, and future iterations could readily employ ubiquitous smartphones. This significantly lowers the barriers to access and the cost compared to approaches requiring dedicated wearable sensors for accelerometry-based systems [9,10] or specialized non-contact radar equipment [11,12]. Secondly, the burden of calibration and daily use on patients is low. Initial guidance on reading a scripted text into the device is straightforward, with complex audio processing and AI-driven FEV1 estimation handled by the backend. This contrasts with the potential need for precise sensor placement in wearable systems or the management of environmental confounders pertinent to radar-based methods [12]. Finally, for integration into clinical workflows, secure and confidential data handling is paramount. Any system deployed would require the robust, encrypted transmission of audio data and FEV1 estimates to clinicians, adhering strictly to data protection regulations such as GDPR. These elements suggest a viable pathway for the development of an accessible, less burdensome, and scalable tool for remote respiratory monitoring compared to some existing alternatives.

4.4. Limitations

In this study, 23 elderly participants were recruited, all diagnosed with COPD and recruited from the same geographical area. This resulted in a homogeneous group. Hence, this study did not consider the effects of different dialects or accents on the FEV1 estimations. However, this homogeneity also conferred certain advantages. Given the small population group, the homogeneity allowed for dataset splitting without the need to correct for individual differences. Consequently, the leave-one-subject-out approach was unnecessary. Under normal circumstances, the leave-one-subject-out approach is preferable, as it allows for predictions on unseen individuals [31].
Pramono et al. [26] noted that variations in recording environments and sensor placement significantly affect the quality and performance of respiratory sound analysis. Similar challenges were observed in this study, as the unsupervised home recordings occasionally exhibited reduced quality or non-adherence to the protocol, necessitating stringent quality control and the exclusion of some recordings (862 out of 2946 recordings). The extreme data values encountered in real-world settings are likely attributable to recording artefacts or transient issues with sound quality. Some of this variance may also be caused by true physiological outliers, as seen for Participant 10, who had a median FEV1 of 610 mL/s but managed to record a value of 1880 mL/s at one point. Distinguishing between artefactual extremes and genuine, clinically significant outliers is a non-trivial challenge. While our current filtering addresses quality issues, developing robust strategies to appropriately handle these diverse extreme values is an important area for future research. This would require deeper investigation to determine whether rule-based deterministic solutions, more sophisticated machine learning-based outlier detection tools, or hybrid approaches are most effective in preserving valuable clinical information while minimizing the impact of noise in sound data. This underscores the importance of refining the data collection protocols to ensure consistency in future implementations. The reliance on low-cost sensors, while advantageous for accessibility, can introduce variation, as observed in the data quality challenges encountered. Nevertheless, our study suggests that, with appropriate data preprocessing and machine learning models like KNN, physiological parameters such as the FEV1 can still be estimated. This highlights an important consideration: leveraging the convenience of home monitoring with low-cost sensors requires careful data handling and algorithms capable of managing noise and real-world conditions. Future work could explore adaptive filtering or environment-aware calibration to potentially improve the robustness [32].
This study did not consider potential comorbidities that could affect speech and breathing. However, attempting to incorporate all comorbidities into a predictive model would lead to overfitting and practical challenges in maintenance. Requiring daily self-monitoring through scripted recitations may intensify COPD patients’ awareness of their illness and impact their well-being. This daily burden and potential negative impact could, in turn, lead to inconsistencies in compliance (e.g., skipped days) and consequently affect data quality (e.g., rushed or incomplete recordings). However, these issues regarding the patient burden and variations are also inherent to conventional spirometry tests. A notable drawback of the current model is its reliance on active patient engagement. Users must consciously perform daily scripted readings. While this still requires user interaction, it is less physically demanding and potentially more convenient than performing a full spirometry test. The active participation requirements underscore an area for future research towards more passive monitoring systems. Furthermore, a limitation in developing the current model is the reliance on FEV1 data derived from unsupervised home spirometry for training. As Anand et al. [8] reported, unsupervised tests can yield different, often lower, FEV1 values compared to supervised tests. This inherent variability in the training target data may limit the precision achievable by the current model. To potentially enhance the model accuracy and establish a more robust ground truth, future iterations could benefit from training data where the FEV1 is assessed under supervised spirometry.
Finally, the potential of this home monitoring approach could extend beyond FEV1 tracking. Longitudinal voice data, possibly combined with inputs from other low-cost sensors (e.g., wearable activity trackers, environmental sensors, or simple pulse oximeters), might inform predictive models aiming for earlier awareness of COPD exacerbations [33]. Identifying these events earlier is a goal in COPD management, and voice biomarkers could represent a non-intrusive data stream for such multi-modal monitoring systems.

5. Conclusions

Developing a predictive model using voice recordings from participants with moderate to severe COPD at home enabled us to successfully estimate the FEV1 with encouraging performance. The KNN model achieved an RMSE of 174 mL/s and exhibited an LoA narrower than those observed between supervised and unsupervised spirometry. These findings indicate that ML models like KNN can effectively learn from audio data for FEV1 prediction, suggesting a viable pathway for non-invasive, audio-based assessments that could improve patient comfort and the data collection frequency in domestic environments. Such AI approaches have the potential to enhance COPD management. Future research could focus on validating these findings in broader populations and exploring the integration of multi-modal data for even more robust predictions, as well as investigating methods for autonomous patient monitoring that minimize active device engagement. This represents a step towards the more accessible and proactive monitoring of respiratory health.

Author Contributions

N.L.-N. was responsible for data analysis, coding, and manuscript writing. N.L.-N. also played a key role in conceptualizing the study, ensuring the robustness and accuracy of the analytical methods used. L.M. provided crucial supervision and ideas that were essential to the data analysis. L.M. also contributed to and critically reviewed the analytical methods and results. P.M. provided advice and comments essential to the data analysis and result interpretation. S.N.B. was the main supervisor and contributed to the development of the methodology and ensured the study’s adherence to relevant guidelines. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki and approved by the Regional Research Register, Region Zealand (Forskningsfortegnelsen, Region Sjælland, REG-100-2021, 1 November 2021).

Informed Consent Statement

Informed consent was obtained from all humans participating in this study. Data obtained from participants were handled according to the General Data Protection Regulation (GDPR). Our research followed the Declaration of Helsinki’s ethical principles for medical research involving human participants.

Data Availability Statement

The data that support the findings of this study are not available.

Acknowledgments

The authors would like to thank the healthcare professionals and participants from Odsherred Municipality for their essential cooperation and support throughout the study. We would also like to offer our thanks to Lise-Lotte Teilman and Maya Milter, who were the main facilitators behind the PreCare Clinic, in which the study took place.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
Activation (MLP) The function determining the output of a neuron given an input (e.g., ‘ReLU’, ‘tanh’).
Algorithm (KNN) The method used to find the nearest neighbors (e.g., ‘ball_tree’, ‘kd_tree’, ‘brute’).
Alpha Regularization ParameterControls the penalty strength applied to model complexity to prevent overfitting (used for L1, L2, or both depending on the model).
Colsample_bytree (XGBoost)The fraction of features (columns) considered when building each tree.
COPDChronic obstructive pulmonary disease
FEV1 Forced expiratory volume in one second
Gamma (XGBoost)Minimum loss reduction required to perform a further partition (split) on a leaf node of the tree; acts as a regularization parameter.
Hidden_layer_sizes (MLP)Defines the architecture of the MLP, specifying the number of neurons in each hidden layer.
IQRInterquartile range
KNNK-nearest neighbors
L1_ratio (ElasticNet)Specifies the mixing proportion between L1 (lasso) and L2 (ridge) penalties in ElasticNet regularization (0 = L2, 1 = L1).
Learning_rate Controls the step size at each iteration while moving toward a minimum of the loss function; influences the convergence speed and stability (XGBoost, MLP).
LoALimit of agreement
MAEMean absolute error
MAPE Mean absolute percentage error
Max_depth The maximum depth allowed for individual decision trees in an ensemble, controlling model complexity.
Max_iter (MLP) The maximum number of training iterations (epochs) allowed for the MLP solver to converge.
MFCC Mel-frequency cepstral coefficient
Min_child_weight (XGBoost) The minimum sum of instance weights needed in a child node; acts as a regularization parameter.
Min_samples_leafThe minimum number of data samples required to be present in a leaf node of a decision tree.
Min_samples_split The minimum number of data samples required within a node to allow it to be split further in a decision tree.
MLP Multi-layer perceptron
MSEMean squared error
N_estimators The number of decision trees included in the ensemble model (random forest, XGBoost).
N_neighbors (KNN)The number of nearest neighbors considered to produce a prediction.
PCAPrincipal component analysis
RFRandom forest
RFERecursive feature elimination
RMSERoot mean squared error
Solver (MLP) The algorithm used to optimize the weights of the MLP during training (e.g., ‘Adam’).
Subsample (XGBoost)The fraction of the training data samples used to fit each individual tree.
Weights (KNN)Specifies how the influence of neighbors is weighted in the predictions (e.g., ‘uniform’—all equal, ‘distance’—closer neighbors have a greater influence).
WERWord error rate
XGBXGBoost

Appendix A

Appendix A.1. Sommerdrøm—Under et Blomstrende æbletræ Kumbel: Gruk—19

Under en æblekrone, som bierne summer i, går jeg i middagshi. Ned fra den lyse zone lyder en summetone, sommerens melodi. Mellem de hvide grene løber en luftig sti. Her er de to alene, blomstrende træ og bi. Dette er livets scene i sommerligt sceneri. Biens arbejdsomme færden blir til en alegori i drømmen den overgår i. Let som en ordleg er den: dét er den hvide verden, og dét er den kække bi!

Appendix A.2. Den Første Gang Jeg Så Dig af Kim Larsen and Kjukken

Den første gang, jeg så dig, det var en sommerdag En sommerdag da solen lyste klart Og alle engens blomster de stod i stakkevis Og vuggede som unge brudepar.
Og vinden kom så stille ned til stranden, hvor vi stod Og bølgerne de klukkede og kyssede din fod.
Den første gang jeg så dig, det var en sommer dag Den første gang du holdt mig i hånden.
Den første gang jeg så dig, der skinned’ sommer sky Så blendende som syv vilde svaner Jeg mødte dig i vrimlen et sted midt ind’ i byen Ofelia og den sidste mohikan.

Appendix A.3. Jeg Plukker Fløjlsgræs af Sigfred Pedersen and Knud Vad-Thomsen

En vår er kommet så mild og stille Med drømmedufte om mark og eng På sølvstreng klimprer den friske kilde Og fuld af sødme af fuglens sang.
Jeg plukker fløjlsgræs og ridderspore Jeg plukker mandstro og jomfrusko Men blodet banker i mine årer Mit lille hjerte har ingen ro Ja, blodet banker i mine årer Mit lille hjerte har ingen ro.
Jeg synger ikke, jeg danser ikke Der er så meget, jeg ikke når Er jeg bedrøvet? Jeg ved det ikke Jeg drømmer bare, og tiden går.

Appendix A.4. Jeg Vil La’ Lyset Brænde af Ray Dee Ohh

Stille sommeraftenvind Månen trækker i din sjæl Kalder ud, der hvor regnbuen ender La’ mig hviske til farvel.
Jeg vil la’ lyset brænde Og la’ min dør stå på klem Indtil du har set, hvad du ville Indtil du er hos mig igen Jeg vil la’ lyset brænde Og la’ min dør stå på klem Se, mit hjerte gløder i natten For at du kan finde vejen hjem.
Stille sommermorgenregn Gennem byen med mig selv Vinker højt mod din flyver på himlen Hør mig hviske til farvel.

Appendix B

Figure A1. Distribution of Word Error Rates (WER) for starting segments versus entire transcriptions before and after quality filtering. Panels (A) and (B) show WER distributions for starting segments and entire transcriptions before filtering, respectively. Panels (C) and (D) show the corresponding distributions after applying quality thresholds (80th percentile cutoff of 0.667 for starting segments and 90th percentile cutoff of 0.851 for entire transcriptions).
Figure A1. Distribution of Word Error Rates (WER) for starting segments versus entire transcriptions before and after quality filtering. Panels (A) and (B) show WER distributions for starting segments and entire transcriptions before filtering, respectively. Panels (C) and (D) show the corresponding distributions after applying quality thresholds (80th percentile cutoff of 0.667 for starting segments and 90th percentile cutoff of 0.851 for entire transcriptions).
Biomedinformatics 05 00031 g0a1

Appendix C

Table A1. Definitions of the features extracted from audio and physiological measurements. These features, described in Section 2.2 were used for the training and evaluation of the machine learning models.
Table A1. Definitions of the features extracted from audio and physiological measurements. These features, described in Section 2.2 were used for the training and evaluation of the machine learning models.
FeatureExplanation
EnergyThe total magnitude of the signal, representing the amount of sound energy.
Entropy of EnergyThe variation in energy throughout the signal, indicating dynamic changes.
Spectral CentroidThe balance point of the spectrum, indicating the brightness of a sound.
Spectral SpreadA measure of the spread of the spectrum, indicating the timbral texture.
Spectral EntropyThe randomness in the spectral amplitude distribution, a measure of signal complexity.
Spectral FluxThe rate of change in the spectral power, indicating the texture or timbral change.
Spectral RolloffThe frequency below which a specified percentage of the total spectral energy lies.
Mel Frequency Cepstral CoefficientFeatures that capture key aspects of the spectral envelope (shape of the power spectrum) of a sound, using the Mel frequency scale, which approximates human auditory perception.
Chroma VectorA representation of the energy content within each pitch class, related to the harmonic and melodic content.
Chroma DeviationThe variation from a standard chroma vector, indicating deviations in harmonic content.
Zero Crossing RateThe rate at which the signal changes sign, related to the frequency content of the signal.
Linear Predictive CodingCoefficients derived from a model that predicts future signal samples based on past samples, often used to represent the vocal tract filter.
Word Error RateThe rate of errors in speech transcription, indicating the accuracy of voice recognition.
Speech PausesThe presence or duration of pauses in speech, potentially indicating respiratory issues or speech flow.
Heart RateThe number of heart beats per minute, indicating cardiovascular health.
Body TemperatureThe measured temperature of the body, an indicator of metabolic and overall health.
Oxygen SaturationThe percentage of oxygen carried by red blood cells to the body, indicating the respiratory efficiency.

Appendix D

Table A2. All selected features after the use of the voting selector. X indicated the statistical measures selected by the feature selection algorithm for model inclusion.
Table A2. All selected features after the use of the voting selector. X indicated the statistical measures selected by the feature selection algorithm for model inclusion.
FeatureMeanMedianStandard DeviationMaxIQRRangeNon-Aggregated
Zero Crossing RateX X
Energy
Entropy of Energy X X
Spectral Centroid X X
Spectral Spread X
Spectral Entropy X
Spectral Rolloff XX
MFCCXXXXXX
Chroma VectorXXXXXX
Chroma Deviation
Linear Predictive CodingX
Pulse X
Temperature X
Hhmætning X
World Error Rate X
Speech Pauses
Relative Energi Difference X
Delta Zero Crossing Rate XX
Delta Energy
Delta Entropy of Energy X
Delta Spectral Centroid X X
Delta Spectral Spread
Delta Spectral Entropy XX
Delta Spectral Rolloff X X
Delta MFCC X XX
Delta Chroma Vector XX X
Delta Chroma DeviationX X X

References

  1. Bensoussan, Y.; Elemento, O.; Rameau, A. Voice as an AI Biomarker of Health—Introducing Audiomics. JAMA Otolaryngol. Head Neck Surg. 2024, 150, 283–284. [Google Scholar] [CrossRef] [PubMed]
  2. World Health Organization. Global Surveillance, Prevention and Control of Chronic Respiratory Diseases: A Comprehensive Approach; World Health Organization: Geneva, Switzerland, 2007.
  3. Kerkhof, M.; Voorham, J.; Dorinsky, P.; Cabrera, C.; Darken, P.; Kocks, J.W.H.; Sadatsafavi, M.; Sin, D.D.; Carter, V.; Tran, T.N.; et al. Association between COPD exacerbations and lung function decline during maintenance therapy. Thorax 2020, 75, 744–753. [Google Scholar] [CrossRef] [PubMed]
  4. Kakavas, S.; Kotsiou, O.S.; Perlikos, F.; Gourgoulianis, K.I.; Steiropoulos, P. Pulmonary function testing in COPD: Looking beyond the curtain of FEV1. NPJ Prim. Care Respir. Med. 2021, 31, 23. [Google Scholar] [CrossRef]
  5. Liu, X.L.; Tan, J.Y.; Wang, T.; Zhang, Q.; Zhang, M.; Yao, L.Q.; Chen, J.X. Effectiveness of home-based pulmonary rehabilitation for patients with chronic obstructive pulmonary disease: A meta-analysis of randomized controlled trials. Rehabil. Nurs. 2014, 39, 36–59. [Google Scholar] [CrossRef]
  6. Johnson, B.; Theobald, J.; Darcy, K.; Mottershaw, M.; Brassington, K.; Thickett, D.R. Improving spirometry testing by understanding patient preferences. ERJ Open Res. 2021, 7, 00766–02020. [Google Scholar] [CrossRef]
  7. Parsons, K.; Thomas, P.; Bevan-Smith, E.; Doran, O.; Sama, S. Patient perceived facilitators to greater self-management using home spirometry. Eur. Respir. J. 2022, 60, 2325. [Google Scholar] [CrossRef]
  8. Anand, R.; Topriceanu, C.C.; Keir, G.; Williamson, J.P.; Gao, J. Unsupervised home spirometry versus supervised clinic spirometry for respiratory disease: A systematic methodology review and meta-analysis. Eur. Respir. Rev. 2023, 32, 220135. [Google Scholar] [CrossRef] [PubMed]
  9. Sang, B.; Wen, H.; Junek, G.; Neveu, W.; Di Francesco, L.; Ayazi, F. An accelerometer-based wearable patch for robust respiratory rate and wheeze detection using deep learning. Biosensors 2024, 14, 118. [Google Scholar] [CrossRef]
  10. Minakata, Y.; Azuma, Y.; Sasaki, S.; Murakami, Y. Objective measurement of physical activity and sedentary behavior in patients with chronic obstructive pulmonary disease: Points to keep in mind during evaluations. J. Clin. Med. 2023, 12, 3254. [Google Scholar] [CrossRef]
  11. Wang, W.; Wan, Y.; Li, C.; Chen, Z.; Zhang, W.; Zhao, L.; Zhao, J.; Mu Li, G. Millimetre-wave radar-based spirometry for the preliminary diagnosis of chronic obstructive pulmonary disease. IET Radar Sonar Navig. 2023, 17, 1874–1885. [Google Scholar] [CrossRef]
  12. Islam, S.M.M. Radar-based remote physiological sensing: Progress, challenges, and opportunities. Front. Physiol. 2022, 13, 955208. [Google Scholar] [CrossRef] [PubMed]
  13. Alam, M.Z.; Patel, A.; Bui, F.M.; Fazel-Rezai, R.; Sazonov, E.; Bobhate, P.; Jaiswal, N.; Batsis, J.A.; Ramachandran, S.K.; McSharry, P.; et al. Predicting pulmonary function from the analysis of voice: A machine learning approach. Front. Digit. Health 2022, 4, 750226. [Google Scholar] [CrossRef] [PubMed]
  14. Nathan, V.; Paul, S.; Prioleau, T.; Niu, L.; Mortazavi, B.J.; Camargo, C.A.; Guttag, J.; Dy, J.; Jaimovich, D.; Colantonio, L.D.; et al. Assessment of chronic pulmonary disease patients using biomarkers from natural speech recorded by mobile devices. In Proceedings of the IEEE 16th International Conference on Wearable and Implantable Body Sensor Networks (BSN), Chicago, IL, USA, 19–22 May 2019. [Google Scholar] [CrossRef]
  15. Claxton, S.; Williams, G.; Roggen, D.; Rotheram, S.; Lam, C.; Howard, S.; Khawaja, S.; Price, D.B.; Crooks, M.G. Identifying acute exacerbations of chronic obstructive pulmonary disease using patient-reported symptoms and cough feature analysis. NPJ Digit. Med. 2021, 4, 107. [Google Scholar] [CrossRef] [PubMed]
  16. Xu, W.; Zhou, Y.; Zhao, M.; Wang, L.; Zhang, X.; Chen, Q.; Xie, Q.; Gao, B.; Li, B.; Shi, Y. A forced cough sound based pulmonary function assessment method by using machine learning. Front. Public Health 2022, 10, 1015876. [Google Scholar] [CrossRef] [PubMed]
  17. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; PMLR 202. pp. 28491–28518. [Google Scholar]
  18. Mozilla. Common Voice Corpus v16.1. Available online: https://commonvoice.mozilla.org/da/datasets (accessed on 4 April 2024).
  19. Wiechern, B.; Liberty, K.A.; Pattemore, P.; Lin, E. Effects of asthma on breathing during reading aloud. Speech Lang. Hear. 2018, 21, 30–40. [Google Scholar] [CrossRef]
  20. Craney, T.A.; Surles, J.G. Model-Dependent Variance Inflation Factor Cutoff Values. Qual. Eng. 2002, 14, 391–403. [Google Scholar] [CrossRef]
  21. Kursa, M.B.; Rudnicki, W.R. Feature Selection with the Boruta Package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
  22. Santamato, V.; Tricase, C.; Faccilongo, N.; Iacoviello, M.; Pange, J.; Marengo, A. Machine learning for evaluating hospital mobility: An Italian case study. Appl. Sci. 2024, 14, 6016. [Google Scholar] [CrossRef]
  23. Polisena, J.; Tran, K.; Cimon, K.; Hutton, B.; McGill, S.; Palmer, K.; Scott, R.E. Home telehealth for chronic obstructive pulmonary disease: A systematic review and meta-analysis. J. Telemed. Telecare 2010, 16, 120–127. [Google Scholar] [CrossRef]
  24. Chen, C.; Ding, S.; Wang, J. Digital health for aging populations. Nat. Med. 2023, 29, 1623–1630. [Google Scholar] [CrossRef]
  25. Majumder, S.; Mondal, T.; Deen, M.J. Wearable sensors for remote health monitoring. Sensors 2017, 17, 130. [Google Scholar] [CrossRef] [PubMed]
  26. Pramono, R.X.A.; Bowyer, S.; Rodriguez-Villegas, E. Automatic adventitious respiratory sound analysis: A systematic review. PLoS ONE 2017, 12, e0177926. [Google Scholar] [CrossRef] [PubMed]
  27. Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
  28. Zargari Marandi, R.; Madeleine, P.; Omland, Ø.; Vuillerme, N.; Samani, A. An oculometrics-based biofeedback system to impede fatigue development during computer work: A proof-of-concept study. PLoS ONE 2019, 14, e0213704. [Google Scholar] [CrossRef] [PubMed]
  29. Haider, N.S.; Singh, B.K.; Periyasamy, R.; Behera, A.K. Respiratory sound based classification of chronic obstructive pulmonary disease: A risk stratification approach in machine learning paradigm. J. Med. Syst. 2019, 43, 255. [Google Scholar] [CrossRef]
  30. Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A 2016, 374, 20150202. [Google Scholar] [CrossRef]
  31. Dehghani, A.; Glatard, T.; Shihab, E. Subject cross validation in human activity recognition. arXiv 2019, arXiv:1904.02666. [Google Scholar]
  32. Cook, D.; Feuz, K.D.; Krishnan, N.C. Transfer learning for activity recognition: A survey. Knowl. Inf. Syst. 2013, 36, 537–556. [Google Scholar] [CrossRef]
  33. Sanchez-Morillo, D.; Fernandez-Granero, M.A.; Leon-Jimenez, A. Use of predictive algorithms in-home monitoring of chronic obstructive pulmonary disease and asthma: A systematic review. Chronic Respir. Dis. 2016, 13, 264–283. [Google Scholar] [CrossRef]
Figure 1. Schematic overview of the data processing and machine learning workflow, from initial data acquisition to final model selection. COPD: Chronic Obstructive Pulmonary Disease; RMSE: Root Mean Square Error; MAPE: Mean Absolute Percentage Error; MAE: Mean Absolute Error; MSE: Mean Squared Error; WER: Word Error Rate.
Figure 1. Schematic overview of the data processing and machine learning workflow, from initial data acquisition to final model selection. COPD: Chronic Obstructive Pulmonary Disease; RMSE: Root Mean Square Error; MAPE: Mean Absolute Percentage Error; MAE: Mean Absolute Error; MSE: Mean Squared Error; WER: Word Error Rate.
Biomedinformatics 05 00031 g001
Figure 2. Flowchart illustrating the audio data screening and filtering process. Starting with 2946 collected audio files, sequential filters were applied to remove recordings with missing associated spirometry data or identifiers (n = 202), files smaller than 300 kb (n = 241), files with a high word error rate (WER) indicating poor quality or transcription alignment (n = 419), and files linked to invalid FEV1 measurements (0 mL/s, n = 31). This resulted in a final dataset of 2053 audio files (2019 from COPD participants, 34 from control) used for analysis.
Figure 2. Flowchart illustrating the audio data screening and filtering process. Starting with 2946 collected audio files, sequential filters were applied to remove recordings with missing associated spirometry data or identifiers (n = 202), files smaller than 300 kb (n = 241), files with a high word error rate (WER) indicating poor quality or transcription alignment (n = 419), and files linked to invalid FEV1 measurements (0 mL/s, n = 31). This resulted in a final dataset of 2053 audio files (2019 from COPD participants, 34 from control) used for analysis.
Biomedinformatics 05 00031 g002
Figure 3. Bland–Altman plot comparing predicted and actual lung function measurements. The plot shows the difference between measurements (y-axis) against their mean (x-axis), with a mean difference of 7.03 (solid line) and 95% limits of agreement from −333.21 to 347.26 (dashed lines). The scatter pattern suggests generally good agreement between the two measurement methods, with most differences falling within the limits of agreement, although there are some notable outliers at higher average values. The relatively symmetric distribution around the mean difference line indicates no systematic bias in the predictions.
Figure 3. Bland–Altman plot comparing predicted and actual lung function measurements. The plot shows the difference between measurements (y-axis) against their mean (x-axis), with a mean difference of 7.03 (solid line) and 95% limits of agreement from −333.21 to 347.26 (dashed lines). The scatter pattern suggests generally good agreement between the two measurement methods, with most differences falling within the limits of agreement, although there are some notable outliers at higher average values. The relatively symmetric distribution around the mean difference line indicates no systematic bias in the predictions.
Biomedinformatics 05 00031 g003
Figure 4. Learning curve analysis of the XGBoost model, showing the negative root mean squared error (RMSE) against the number of training examples. The red line represents the training score, while the green line shows the cross-validation score with its standard deviation (shaded area). The converging trend between the training and validation scores as the training examples increase indicates reduced overfitting. Training performance ranges from approximately 0 to −50, while validation performance improves from −300 to −170 with increased data.
Figure 4. Learning curve analysis of the XGBoost model, showing the negative root mean squared error (RMSE) against the number of training examples. The red line represents the training score, while the green line shows the cross-validation score with its standard deviation (shaded area). The converging trend between the training and validation scores as the training examples increase indicates reduced overfitting. Training performance ranges from approximately 0 to −50, while validation performance improves from −300 to −170 with increased data.
Biomedinformatics 05 00031 g004
Figure 5. Box plot comparison of prediction error distributions across seven machine learning models (ridge, lasso, ElasticNet, RF, XGB, KNN, and MLP) on the validation set. Green triangles indicate mean errors, while the boxes show the interquartile range (IQR: 25th to 75th percentiles) with the median line. Whiskers extend to the most extreme points within 1.5 × IQR from the box edges, and points beyond these whiskers are considered outliers. The K-nearest neighbors (KNN) model demonstrates the best overall performance, with the smallest IQR and fewer extreme outliers compared to other models. While all models show symmetric error distributions centered near zero, they vary in their spread and outlier patterns.
Figure 5. Box plot comparison of prediction error distributions across seven machine learning models (ridge, lasso, ElasticNet, RF, XGB, KNN, and MLP) on the validation set. Green triangles indicate mean errors, while the boxes show the interquartile range (IQR: 25th to 75th percentiles) with the median line. Whiskers extend to the most extreme points within 1.5 × IQR from the box edges, and points beyond these whiskers are considered outliers. The K-nearest neighbors (KNN) model demonstrates the best overall performance, with the smallest IQR and fewer extreme outliers compared to other models. While all models show symmetric error distributions centered near zero, they vary in their spread and outlier patterns.
Biomedinformatics 05 00031 g005
Table 1. Overview and rationale for selected machine learning regression models.
Table 1. Overview and rationale for selected machine learning regression models.
ModelDescriptionCharacteristicsPractical Considerations
Ridge RegressionLinear Model with L2 RegularizationHandles multicollinearity, shrinks coefficients, tests simplest relationship form.Computationally efficient (convex optimization). Highly interpretable via coefficient magnitudes, indicating feature influence. Requires feature scaling for coefficient comparison. Less prone to overfitting than standard linear regression in cases of multicollinearity.
Lasso RegressionLinear Model with L1 RegularizationProvides sparse models via automatic feature selection (shrinks some coefficients to zero). Useful for identifying potentially key predictive features.Computationally efficient. Performs feature selection, yielding sparse and potentially simpler models. High interpretability through non-zero coefficients. Requires feature scaling. Can be unstable with highly correlated features (may select one arbitrarily).
ElasticNetLinear Model with L1 and L2 RegularizationCombines ridge/lasso strengths. Robustly handles multicollinearity while performing feature selection. Useful when groups of correlated features exist.Computationally efficient. Balances L1/L2 penalties to handle correlated features effectively while performing feature selection. High interpretability via coefficients. Requires feature scaling and tuning of two hyperparameters (alpha and l1_ratio).
Random Forest (RF)Ensemble Learning (Bagging of Decision Trees)Captures non-linearities/interactions. Generally robust to outliers and feature scaling. Provides feature importance measures.Ensemble method; training involves building numerous trees, potentially requiring significant computation time and memory, but parallelizable. Less sensitive to feature scaling than distance-based or linear models. Moderate interpretability: provides global feature importance scores; requires post hoc methods (e.g., SHAP) for reliable local explanations.
XGBoostEnsemble Learning (Gradient Boosted Decision Trees)High-performance algorithm capturing non-linearities/interactions effectively. Utilizes regularized boosting for improved generalization.Advanced gradient boosting implementation, often achieving high predictive accuracy. Can be computationally intensive and requires careful hyperparameter tuning (e.g., learning rate, tree depth, regularization). Moderate interpretability: provides feature importance; local explanations typically rely on methods like SHAP.
K-Nearest Neighbors (KNN)Instance-Based Learning (Non-Parametric)Captures local structure. Makes no strong assumptions about underlying data distribution. Requires scaled features. Sensitive to irrelevant features (“curse of dimensionality”).Non-parametric, instance-based learner. Minimal training time (stores data), but prediction complexity scales with dataset size (potentially slow). Highly sensitive to feature scaling and choice of distance metric. Interpretability is high locally (can examine neighbors influencing a prediction) but lacks a global, summarized model.
Multi-Layer Perceptron (MLP)Artificial Neural NetworkUniversal approximator capable of learning highly complex, non-linear functions. Represents a distinct modeling approach. Requires scaled features.Flexible neural network model requiring significant data, computational resources (often GPU acceleration), and careful tuning (architecture, optimizer, regularization). Sensitive to feature scaling. Generally considered a “black box” due to low direct interpretability; understanding predictions relies heavily on post hoc explanation techniques (e.g., SHAP, LIME).
Table 2. All trained models, hyperparameters, and values. The asterisk indicates the best-performing values. All definitions of models and hyperparameters can be found in the Abbreviations section.
Table 2. All trained models, hyperparameters, and values. The asterisk indicates the best-performing values. All definitions of models and hyperparameters can be found in the Abbreviations section.
ModelHyperparameterValues
Ridgealpha10.0 *, 50.0, 100.0, 150.0, 200.0, 250.0, 300.0
Lassoalpha0.001 *, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0
ElasticNetalpha0.01 *, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0
l1_ratio0.1 *, 0.3, 0.5, 0.7, 0.9
Random Forestn_estimators50, 100 *, 150, 200
max_depthNone, 5, 10, 15, 20 *
min_samples_split2 *, 3, 4, 5
min_samples_leaf1, 2 *, 3, 4
XGBoostn_estimators50, 100, 150, 200 *
learning_rate0.01, 0.05, 0.1 *, 0.2
max_depth3, 4 *, 5, 6
min_child_weight1, 2, 3 *, 4
gamma0.0 *, 0.1, 0.2, 0.3
subsample0.5, 0.6, 0.7 *, 0.8
colsample_bytree0.5, 0.6, 0.7 *, 0.8
K-Nearest Neighborsn_neighbors2, 3, 5, 7 *, 10, 12
weightsuniform, distance *
algorithmball_tree *, kd_tree, brute
Multi-Layer Perceptronhidden_layer_sizes(50,), (100,), (50, 50), (100, 100) *
activationRelu *, tanh
solverAdam
alpha0.0001, 0.001, 0.01, 0.1 *
learning_rateconstant, adaptive *
max_iter500, 1000, 1500 *
Table 3. Summarizes lung function statistics, sex, age, and number of audio files for each participant. Featuring median, interquartile, min, and max for lung function measurements.
Table 3. Summarizes lung function statistics, sex, age, and number of audio files for each participant. Featuring median, interquartile, min, and max for lung function measurements.
FEV1 (mL/s)
Participant IDSexAgeNumber of Audio FilesMedianIQRMinMax
1M7616110501109101310
2M70225178015014902110
3M65281865412.511202380
4F7210912809011101480
5M6811744050370540
6M77919001107301130
7F76877047.5700810
8M7911378050720920
9M75661880147.516102220
10F73127610904001880
11M751315309013801660
12M848953050440650
13F731360025580630
14F781529001207401080
15F6487910757701080
16F647714901806502180
17M79189252505501190
18F721261140137.58401560
19F7614611401309601500
20M681645530420520
21M72882200242.518402590
22F7354990608801180
23F7395600120380880
Table 4. Performance metrics of machine learning models for estimation ofng FEV1 from audio data. The table shows the root mean square error (RMSE, mL/s), mean absolute percentage error (MAPE, mL/s), mean absolute error (MAE, %), mean squared error (MSE, mL/s2), and RMSE standard deviation for both the training and evaluation datasets. Models are ordered by increasing RMSE on the evaluation dataset (best- performing first).
Table 4. Performance metrics of machine learning models for estimation ofng FEV1 from audio data. The table shows the root mean square error (RMSE, mL/s), mean absolute percentage error (MAPE, mL/s), mean absolute error (MAE, %), mean squared error (MSE, mL/s2), and RMSE standard deviation for both the training and evaluation datasets. Models are ordered by increasing RMSE on the evaluation dataset (best- performing first).
Training
ModelRMSEMAPEMAEMSERMSE Std
K-nearest neighbors138.928.7384.9819,716.8620.43
XGBoost167.6012.42116.2528,397.6917.56
Random Forest180.7512.12112.7433,296.4325.00
Multi-Layer Perceptron202.4215.09145.4541,644.9325.88
Lasso250.1421.20190.7862,831.4016.24
Ridge275.0723.62205.3875,884.9814.85
ElasticNet280.4924.18209.2778,887.9114.64
Evaluation
K-nearest neighbors173.739.9093.8230,182.83
XGBoost178.8513.30120.6731,987.81
Random Forest202.9914.74123.4341,202.95
Multi-Layer Perceptron227.8316.52157.0451,904.96
Lasso264.1621.06190.8869,781.04
Ridge300.7425.91217.390,446.22
ElasticNet307.2326.69222.4994,391.37
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lentz-Nielsen, N.; Maaløe, L.; Madeleine, P.; Blomberg, S.N. Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function. BioMedInformatics 2025, 5, 31. https://doi.org/10.3390/biomedinformatics5020031

AMA Style

Lentz-Nielsen N, Maaløe L, Madeleine P, Blomberg SN. Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function. BioMedInformatics. 2025; 5(2):31. https://doi.org/10.3390/biomedinformatics5020031

Chicago/Turabian Style

Lentz-Nielsen, Nicki, Lars Maaløe, Pascal Madeleine, and Stig Nikolaj Blomberg. 2025. "Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function" BioMedInformatics 5, no. 2: 31. https://doi.org/10.3390/biomedinformatics5020031

APA Style

Lentz-Nielsen, N., Maaløe, L., Madeleine, P., & Blomberg, S. N. (2025). Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function. BioMedInformatics, 5(2), 31. https://doi.org/10.3390/biomedinformatics5020031

Article Metrics

Back to TopTop