Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function

Lentz-Nielsen, Nicki; Maaløe, Lars; Madeleine, Pascal; Blomberg, Stig Nikolaj

doi:10.3390/biomedinformatics5020031

Open AccessArticle

Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function

¹

Koncern Digitalisering, Region Zealand, 4100 Ringsted, Denmark

²

Corti and Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800 Lyngby, Denmark

³

ExerciseTech, Department of Health Science and Technology, Faculty of Medicine, Aalborg University, 9260 Klarup, Denmark

⁴

Institute of Clinical Medicine, University of Copenhagen, 2200 Copenhagen, Denmark

⁵

CardioTech, Department of Health Science and Technology, Faculty of Medicine, Aalborg University, 9260 Klarup, Denmark

^*

Author to whom correspondence should be addressed.

BioMedInformatics 2025, 5(2), 31; https://doi.org/10.3390/biomedinformatics5020031

Submission received: 9 April 2025 / Revised: 28 May 2025 / Accepted: 29 May 2025 / Published: 7 June 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Background: Chronic obstructive pulmonary disease (COPD) is projected to be the third-leading cause of death by 2030. Traditional spirometry for the monitoring of the forced expiratory volume in one second (FEV1) can provoke discomfort and anxiety. This study aimed to validate AI models using daily audio recordings as an alternative for FEV1 estimation in home settings. Methods: Twenty-three participants with moderate to severe COPD recorded daily audio readings of standardized texts and measured their FEV1 using spirometry over nine months. Participants also recorded biomarkers (heart rate, temperature, oxygen saturation) via tablet application. Various machine learning models were trained using acoustic features extracted from 2053 recordings, with K-nearest neighbor, random forest, XGBoost, and linear models evaluated using 10-fold cross-validation. Results: The K-nearest neighbors model achieved a root mean square error of 174 mL/s on the validation data. The limit of agreement (LoA) ranged from −333.21 to 347.26 mL/s. Despite an error range of −1252 to 1435 mL/s, most predictions fell within the LoA, indicating good performance in estimating the FEV1. Conclusions: The predictive model showed promising results, with a narrower LoA compared to traditional unsupervised spirometry methods. The AI models effectively used audio to predict the FEV1, suggesting a viable non-invasive approach for COPD monitoring that could enhance patient comfort and accessibility in home settings.

Keywords:

artificial intelligence; machine learning; audio analysis; COPD monitoring; digital biomarkers; respiratory health

1. Introduction

Voice, speech, and respiratory sounds contain rich physiological information, driving interest in their analysis using artificial intelligence (AI) as accessible, non-invasive digital biomarkers for various health conditions [1]. This potential is particularly relevant in the monitoring of chronic diseases like chronic obstructive pulmonary disease (COPD), a long-term lung condition characterized by progressively reduced airflow. The global impact of COPD is substantial; the World Health Organization projects it to be the third-leading cause of death and seventh-leading cause of morbidity by 2030 [2].

Individuals with COPD typically perform spirometer tests to estimate their respiratory function, i.e., the forced expiratory volume in one second (FEV1). The FEV1 is a critical indicator of disease severity and exacerbations of COPD [3,4]. Therefore, FEV1 serves as a critical outcome measure in evaluating the efficacy of interventions [5]. The spirometer test can provoke discomfort and result in “extreme anxiety” [6]. Despite this, individuals generally view self-monitoring positively, yet the likelihood of undergoing spirometry testing decreases with milder COPD severity [7]. Furthermore, Anand et al. [8] reported that unsupervised spirometry tests produced lower FEV1 values (mean difference: −107 mL, LoA: −509, 296) compared to supervised spirometry tests.

This highlights the challenges with unsupervised spirometry tests in home settings. The pursuit of more amenable alternatives has led to investigations across various technological modalities. Beyond acoustic analysis, researchers have explored accelerometry-based systems, often involving wearable sensors to detect chest wall movements or physical activity patterns that correlate with the respiratory status [9,10]. Non-contact methods, such as those utilizing radar technology, have also been investigated for their potential to remotely sense thoracic excursions and breathing rhythms to derive estimates of lung function parameters [11,12]. These diverse technological approaches aim to provide less burdensome FEV₁ estimation, although each carries its own considerations regarding accuracy, user acceptability, and deployment complexity in real-world environments. For instance, with radar-based systems, obtaining accurate physiological data can be difficult due to random body movements, and the presence of other individuals within the sensing range can introduce confounding factors, complicating accurate physiological data acquisition [12].

Using audio data may offer a way to overcome the aforementioned barriers associated with traditional spirometry and other sensor modalities. A few studies have already used audio data and machine learning to identify COPD or estimate respiratory function [13,14,15,16]. One approach analyzes cough sounds to predict COPD exacerbations [15]. However, it requires live sound monitoring, posing GDPR compliance challenges and concerns for those averse to constant surveillance—an issue that the current study seeks to mitigate through on-demand, user-initiated recordings. Nathan et al. [14] encountered similar issues with continuous monitoring. Conversely, Alam et al. [13] demonstrated promising results when using voice recordings from 26 asthma patients reading a standardized text to predict the FEV1%. One of the main barriers with such controlled datasets is ecological validity. Therefore, a key aim of the current study is to build upon these findings by specifically assessing FEV1 estimation from daily, user-initiated scripted readings under unsupervised, real-world domestic conditions, thereby prioritizing ecological validity and patient comfort, in contrast to continuous surveillance or highly controlled lab-based assessments.

In practice, frail and elderly individuals may struggle to adhere perfectly to any unsupervised monitoring protocol. This difficulty is exemplified by unsupervised spirometry tests, where lower FEV1 values have been observed compared to supervised tests [8].

A similar pattern of deviation is expected in daily readings of standardized texts, highlighting the need for an alternative to unsupervised spirometry tests for effective home monitoring. This study aimed to assess the validity of using audio data and machine learning to estimate the FEV1 in elderly patients with COPD through the analysis of daily scripted text readings. This approach was implemented and evaluated under real-world domestic conditions. By testing this approach in participants’ homes without supervision, we sought to demonstrate the potential of voice analysis using AI as an alternative to traditional unsupervised spirometry. The remainder of this paper outlines the methodology employed, presents the key results from our models, and discusses the implications of these findings for COPD home monitoring.

2. Materials and Methods

Twenty-three participants (12 males, 11 females; age = 73.1 ± 5.0 yrs), diagnosed with moderate to severe COPD and recruited from the same geographical area, took part in the study (see Table 3 in Section 3). Each participant was provided with a tablet and selected one of four poems to read out loud (see Appendix A). Participants were instructed to read the poem aloud and to assess and record physiological data every morning over a period of nine months. Measured biomarkers included the heart rate, body temperature, oxygen saturation, and the target variable FEV1. After receiving initial instructions about self-monitoring, the participants conducted the monitoring in unsupervised conditions. All biomarkers were manually entered into an application by the participants, on the tablet, and subsequently stored on an SQL server. Additionally, the tablet was equipped with an application to record audio.

A single asymptomatic female adult served as a quality control participant. The control person read each poem aloud on a weekly basis throughout the study period, in a controlled and noise-free environment. This procedure established a “gold standard” for the minimal expected sound quality of the recordings. A general overview of the methods and process can be seen in Figure 1.

2.1. Data Processing

All data processing was performed using Python v3.9.18. The audio data were transcribed using Openai’s Whisper model v20231117 configured with the Danish large-v3 model [17]. Prior to the transcription application in this study, the Whisper model underwent a validation process to determine its accuracy with Danish language audio. This process utilized audio samples from the Common Voice Corpus v16.1, hosted by Mozilla [18].

For this validation, only sentences containing between three and 100 words were selected for analysis. The word error rate (WER) was computed for these sentences and was 0.14 ± 0.24, in concordance with Radford et al. [17]. To ensure consistent data quality, a multi-step filtering process was applied (Figure 1). This process removed audio files that (1) lacked corresponding spirometry measurements, (2) were missing tablet identifiers due to technical issues preventing participant attribution, or (3) failed to meet our quality thresholds based on the file size (min 300 kb) and (4) transcription accuracy. Recordings that were substantially incomplete (i.e., missing large segments of the expected poem), as indicated by failing to meet the file size or overall word error rate (WER) thresholds, were thus excluded from the analysis. For recordings passing these initial quality checks, silences occurring within the spoken poems were not removed or imputed; instead, the characteristics of these speech pauses (e.g., duration, frequency) were quantified using the pyAudioAnalysis library and included as predictive features, as detailed in Section 2.2. This resulted in 2084 audio files (2019 patients with COPD, 34 control) for analysis, with details of exclusions provided in Figure 2.

All MP4 files smaller than 300 kb were removed, as these primarily consisted of instructions from the healthcare professional and were generally too short to contain the entire poem. The first author listened to the audio files to control their quality, ensuring that only non-relevant files were excluded. The primary approach to handling recording artefacts (e.g., significant background noise, non-speech sounds corrupting the poem reading) was implemented through this multi-step filtering protocol, leading to the exclusion of recordings that did not meet the quality standards, rather than attempting direct artefact removal or correction within the audio signal itself. The WER was calculated using the jiwer package v3.0.2. This metric was utilized to assess the alignment between the transcribed and reference texts. By calculating the WER using a rolling window and a set number of words, alignment between the audio and the reference text was secured by discarding all words preceding the best match (i.e., the lowest WER). This same strategy was also applied to eliminate any noise following the poem’s conclusion. Once alignment was ensured, any audio files with a WER exceeding the 80th percentile cutoff value of 0.667 for the starting segment or the 90th percentile cutoff value of 0.851 for the entire transcription were removed from the dataset. These were deemed to be of insufficient quality for further analysis (see Appendix B, which illustrates the distributions of the WERs before and after applying these thresholds). In the present study, short-duration (30–60 s) audio recordings were used, limiting the potential drift effect of the microphone. The use of a “gold-standard” reference recording provided a stable benchmark for relative audio energy comparisons (detailed in Section 2.2), and the consistent application of per-recording quality control served to mitigate issues due to sensor drift.

2.2. Feature Engineering

A range of frequency-domain features were computed utilizing a rolling window approach with windows of 0.1 s and an overlap of 0.05 s, employing the pyAudioAnalysis library v0.3.14. This rolling window feature extraction inherently provided temporal smoothing of the acoustic characteristics before the aggregate statistics were calculated for each recording. The extraction yielded a broad set of 431 features, encompassing time-domain characteristics (e.g., zero crossing rate, energy), spectral shape descriptors (e.g., spectral centroid, MFCCs, chroma features), and their first derivatives (deltas) to capture temporal dynamics. Additionally, linear predictive coding (LPC) coefficients were derived to model vocal tract properties, and the word error rate (WER) was calculated to assess the transcription fidelity. Silence analysis provided statistics on pause durations [19]. These acoustically derived features, along with manually entered physiological data and a relative audio energy measure, were then subjected to aggregate statistical calculations (Appendix C).

These calculations included the mean, median, standard deviation, maximum, interquartile range, and total range for each window; this provided the model with a richer description of the data’s characteristics within each window—encompassing both measures sensitive to extreme values (like the mean, standard deviation, maximum, and range) and those robust to them (like the median and IQR)—allowing the algorithm the flexibility to learn which statistical properties are most informative for prediction. Additionally, a relative energy comparison was performed to calculate the difference in energy between the audio files and the “gold-standard” reference audio.

Finally, all features were normalized to a range between zero and one using minmax scaling (Equation (1)), by fitting the scaler to the training data and then applying this scaling to transform the test data accordingly.

X_{s c a l e d} = \frac{X - X_{m i n}}{X_{- X_{m i n}}}

(1)

2.3. Feature Selection

Features exhibiting internal variation below a threshold of 0.01 were excluded. Subsequently, a “VotingSelector” algorithm was employed, comprising four distinct selection methods: Kendall’s rank correlation coefficient, a variance inflation factor with a threshold of 10.0 [20], recursive feature elimination (RFE) with a linear support vector regressor kernel, and the Boruta method utilizing a random forest kernel [21]. Only features selected by at least two of these four methods were retained for inclusion in the final dataset, reducing the initial 431 features to a final set of 100 for model training.

2.4. Modeling

The dataset was randomly split into a training and test set using an 80/20% split. For the hyperparameter tuning of various models, a 10-fold cross-validation approach was employed, encompassing models such as ridge, lasso, ElasticNet, random forest, XGBoost, K-nearest neighbors (KNN), and a multi-layer perceptron (see Table 1) [22].

Table 1 displays the hyperparameters and their respective values that were subjected to training. All training was performed with the scikit-learn library v1.3.0 and xgboost v1.7.3. To refine the learning process and minimize the spread of predictions, a loss function incorporating both the quartic error and variability error were utilized during the cross-validation phase (Equation (2)). For the hyperparameter tuning of various models (Table 1), a 10-fold cross-validation approach within a GridSearchCV framework was employed. The specific hyperparameter search spaces are detailed in Table 2.

This optimization phase utilized a custom loss function (Equation (2)), designed to heavily penalize larger prediction errors and minimize error variability.

l o s s = \frac{1}{n} \sum_{i = 1}^{n} {|{y_{p r e d}}_{i} - {y_{t r u e}}_{i}|}^{4} + V a r (|y_{p r e d} - y_{t r u e}|)

(2)

However, following the hyperparameter tuning, the root mean squared error (RMSE) (Equation (3)) on the validation set was employed as the primary metric to assess and select the best-performing model, thereby simplifying interpretation and facilitating comparison.

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({y_{p r e d}}_{i} - {y_{p r e d}}_{i})}^{2}}

(3)

The RMSE served as the primary evaluation metric, but the mean absolute percentage error (MAPE) (Equation (4)), mean absolute error (MAE) (Equation (5)), and mean squared error (MSE) (Equation (6)) are also reported for comprehensive model assessment. The MAPE (Equation (4)) expresses the average absolute error as a percentage of the actual values, offering a relative error measure. The MAE (Equation (5)) quantifies the average absolute difference between the predicted and actual values. The MSE (Equation (6)) measures the average of the squared errors, thereby assigning greater penalties to larger deviations.

M A P E = \frac{100 %}{n} \sum_{i = 1}^{n} |\frac{y_{i} - ŷ_{i}}{y_{i}}|

(4)

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - ŷ_{i}|

(5)

M S E = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - ŷ_{i}) ²

(6)

3. Results

The initial analysis of the audio recordings revealed substantial variations in recording quality across participants. These variations necessitated the implementation of our filtering protocol (see Figure 2), which ensured uniform data quality for subsequent analysis. After applying all filtering steps, the participants generated an average of 87.8 ± 56.2 audio recordings during the nine-month study period (see Table 3). The K-nearest neighbors (KNN) model performed the best among all tested models, with the number of neighbors set to seven and utilizing distance weighting. As shown in Table 4, it resulted in an RMSE of 138.92 mL/s (relative RMSE of 8.73%) on the training set, with a standard deviation of 20.43 mL/s. On the validation set, the model yielded an RMSE of 173.73 mL/s (relative RMSE of 9.90%). The KNN model also demonstrated the lowest mean absolute error (MAE) of 84.98 mL/s and 93.82 mL/s on the training and validation sets, respectively. Furthermore, it exhibited the lowest mean squared error (MSE) among all models, with 19,716.86 mL/s² for the training set and 30,182.83 mL/s² for the validation set, indicating its superior overall accuracy and consistency in predictions.

Additionally, the mean difference between the predicted and actual FEV1 values was 7.03 mL/s, with the limit of agreement (LoA) ranging from −333.21 to 347.26 mL/s, as depicted in Figure 3, and an absolute range of −1252 to 1435 mL/s. Moreover, the model demonstrated a homogeneous spread across all FEV1 values, indicating no model bias towards either lower or higher FEV1 values. Other models, such as random forest and XGBoost, also showed good performance but were outperformed by KNN. The linear models (ridge, lasso, and ElasticNet) generally showed higher error rates, suggesting that the relationship between the audio features and lung function is non-linear.

The feature selection process selected 100 features from the initial set of 431. Detailed information about the combinations of aggregated features is provided in Appendix D. Furthermore, the XGBoost model demonstrated competitive performance compared to KNN. An analysis of the learning curve (see Figure 4) indicates that the inclusion of more data could enhance its performance, potentially outperforming the KNN model.

Specifically, Figure 4 plots the model’s performance score as the RMSE (y-axis, where higher indicates better FEV1 prediction performance) against the increasing amount of patient data used for training (x-axis). It shows two curves: one representing the score achieved on the data that the model was trained on (training score) and another representing the score on data that it had not seen before (validation score), which indicates its real-world performance. For the XGBoost model, the validation score curve is still trending upwards as more data are added, and it has not yet flattened or converged closely with the training score curve. This pattern suggests that the model’s ability to generalize and yield accurate predictions on new, unseen patient data improves with more training examples, indicating that providing it with data from more participants or more recordings per participant could lead to further performance gains.

The overall error distribution across all trained models is depicted in Figure 5, indicating a number of prediction errors surpassing the previously mentioned limits of agreement.

4. Discussion

This study aimed to explore the validity of using audio data and machine learning for the estimation of respiratory function, proposing this as an alternative to spirometry tests at home. While the KNN model showed promising performance (low RMSE/MAE) in estimating the FEV1, there was notable variation in the individual prediction accuracy, with some errors being quite large (range −1252 to 1435 mL/s). Our findings contribute to the ongoing development of home monitoring approaches for chronic respiratory diseases like COPD. The ability to estimate respiratory function using accessible technology directly in the patient’s environment aligns with the interest in solutions that may reduce the burden on healthcare systems and support patients in their self-management [23,24]. This approach, utilizing the microphones integrated into consumer-grade tablets or smartphones as low-cost sensors, increases the potential for scalability compared to methods requiring specialized medical devices. It presents possibilities for more frequent, longitudinal tracking of respiratory health, potentially complementing episodic clinic visits [25].

4.1. Model Performance

The results illustrated a narrower LoA between the predicted and actual FEV1 values compared with the LoA between supervised and unsupervised spirometry, with approximately 125 mL/s [8], representing a key advantage of an audio-based approach for potentially more consistent unsupervised home monitoring This suggests that the predictive model effectively captures key aspects of pulmonary function, resulting in a more consistent estimation of lung function compared with unsupervised spirometry tests. Nevertheless, for greater precision, supervised spirometry is still required. Additionally, Pramono et al. [26] highlighted similar advantages of automated audio analysis in their review of adventitious respiratory sound classification, where machine learning models demonstrated high accuracy in detecting abnormal respiratory sounds, such as wheezes and crackles. The current study expanded upon this foundation by utilizing audio recordings to predict the quantitative FEV1, offering a broader application of automated respiratory sound analysis in clinical practice.

The current study found an RMSE of 174 mL/s, lower than the 380 mL/s reported in Xu et al. [16]. This discrepancy can be attributed to the different auditory inputs utilized. Xu et al. [16] used forced cough sounds, whereas this study utilized speech patterns as participants read a scripted text. The utilization of speech may offer a more stable and controlled acoustic environment. Furthermore, a key methodological advantage in the current study is its high ecological validity, stemming from daily, unsupervised home recordings using non-obstructive sensors like tablet microphones. This approach directly evaluates the real-world performance [13] and addresses the challenges of unsupervised monitoring in domestic settings [8].

Additionally, Alam et al. [13] found that integrating speech and breathing sounds yields better predictions than using either alone, emphasizing the value of the complete auditory range in speech, including non-verbal sounds. A direct comparison of the results with those of Alam et al. is not feasible, as their evaluation focused on the FEV1%, in contrast to the FEV1 used in the current study. The K-nearest neighbors (KNN) model, configured with seven neighbors and distance weighting, demonstrated the strongest performance (Table 4). The comparatively weaker performance of linear models (Table 4) suggests a non-linear relationship between the acoustic features and FEV1. KNN, as a non-parametric method, excels in capturing such local non-linearities without imposing strong assumptions on the data distribution, which likely contributed to its success in this context. While ensemble methods like XGBoost performed less well, KNN’s selection as the optimal model was driven by its performance on the validation set across key metrics like the RMSE, MAE, and MSE (see Table 4), indicating an advantage for this specific FEV1 estimation task. This choice is further supported by KNN’s characteristic efficiency in training compared to more complex models like XGBoost or the MLP (as generally outlined in Table 1) and its inherent local interpretability, where the predictions can be understood by examining neighboring instances. Although ensemble methods can offer global feature importance, KNN’s leading predictive accuracy, coupled with these practical considerations, rendered it the most suitable model for this study. The selection and tuning of the KNN model were guided by established principles of model configuration (e.g., p. 159 in [27]), and the specific settings chosen for the KNN model were informed by examples of detailed setups, such as that presented in Table 2 in [28].

Similarly, Haider et al. [29] demonstrated that combining lung sound features with spirometry data significantly enhanced the diagnostic accuracy, achieving up to 100% classification accuracy for COPD versus non-COPD cases. While their work focused on classification rather than regression, their findings highlight the potential of integrating complementary audio features to improve model performance.

4.2. Feature Importance

Another factor potentially contributing to the differences in the observed performance compared to the work of Xu et al. [16] relates to the feature engineering and selection protocols employed. Xu et al. [12] utilized 52 audio features reduced via principal component analysis (PCA). While PCA is a common dimensionality reduction technique, its effectiveness relies on assumptions such as linearity and data scaling. Standard practice often involves standardizing the features prior to PCA to address the sensitivity to different measurement units [30]; Xu et al. applied Z-score normalization after PCA to the resulting components. Furthermore, PCA assumes linear relationships, whereas our findings when using various models suggest that the connection between the audio features used in our study and the FEV1 may be inherently non-linear. These differences in the dimensionality reduction approach and the apparent non-linearity of the problem space might favor the feature selection methods and non-linear models used in the present study. Haider et al. [29] similarly emphasized the importance of identifying key features for accurate modeling, finding that the median frequency and linear predictive coefficients (LPCs) were the most predictive lung sound parameters for COPD classification. This complements our approach, which highlights the value of features such as MFCCs and LPCs in contributing to the overall predictive capabilities of the model. These findings reinforce the need for robust feature selection techniques to capture a combination of high-impact variables, which may also benefit future hybrid models incorporating both speech and lung sounds.

A permutation importance analysis revealed that no specific features exhibited extreme importance to the model. This suggests that the model’s performance relied on the combined effect of multiple features. The feature selection protocol encompassed a diverse range of features; however, it was only the MFCCs and chroma vectors for which all aggregations were selected (see Appendix D). None of the time-domain features from the non-verbal segments of the audio were selected. While the time and frequency domains offer different characterizations of the signal, this outcome suggests that, for the prediction of the FEV1 in this dataset, the information provided by these specific time-domain features did not add significant predictive power beyond that already present in the selected frequency-domain features, possibly indicating redundancy in the context of the chosen features.

4.3. Practical Considerations for At-Home Deployment

The clinical utility of an AI-driven voice analysis approach for COPD management relies on its practical deployment in home settings. This study’s design inherently addresses several feasibility aspects, particularly when contrasted with other technological modalities for respiratory monitoring. Firstly, the hardware requirements are minimal; the system effectively utilized standard tablets, and future iterations could readily employ ubiquitous smartphones. This significantly lowers the barriers to access and the cost compared to approaches requiring dedicated wearable sensors for accelerometry-based systems [9,10] or specialized non-contact radar equipment [11,12]. Secondly, the burden of calibration and daily use on patients is low. Initial guidance on reading a scripted text into the device is straightforward, with complex audio processing and AI-driven FEV1 estimation handled by the backend. This contrasts with the potential need for precise sensor placement in wearable systems or the management of environmental confounders pertinent to radar-based methods [12]. Finally, for integration into clinical workflows, secure and confidential data handling is paramount. Any system deployed would require the robust, encrypted transmission of audio data and FEV1 estimates to clinicians, adhering strictly to data protection regulations such as GDPR. These elements suggest a viable pathway for the development of an accessible, less burdensome, and scalable tool for remote respiratory monitoring compared to some existing alternatives.

4.4. Limitations

In this study, 23 elderly participants were recruited, all diagnosed with COPD and recruited from the same geographical area. This resulted in a homogeneous group. Hence, this study did not consider the effects of different dialects or accents on the FEV1 estimations. However, this homogeneity also conferred certain advantages. Given the small population group, the homogeneity allowed for dataset splitting without the need to correct for individual differences. Consequently, the leave-one-subject-out approach was unnecessary. Under normal circumstances, the leave-one-subject-out approach is preferable, as it allows for predictions on unseen individuals [31].

Pramono et al. [26] noted that variations in recording environments and sensor placement significantly affect the quality and performance of respiratory sound analysis. Similar challenges were observed in this study, as the unsupervised home recordings occasionally exhibited reduced quality or non-adherence to the protocol, necessitating stringent quality control and the exclusion of some recordings (862 out of 2946 recordings). The extreme data values encountered in real-world settings are likely attributable to recording artefacts or transient issues with sound quality. Some of this variance may also be caused by true physiological outliers, as seen for Participant 10, who had a median FEV1 of 610 mL/s but managed to record a value of 1880 mL/s at one point. Distinguishing between artefactual extremes and genuine, clinically significant outliers is a non-trivial challenge. While our current filtering addresses quality issues, developing robust strategies to appropriately handle these diverse extreme values is an important area for future research. This would require deeper investigation to determine whether rule-based deterministic solutions, more sophisticated machine learning-based outlier detection tools, or hybrid approaches are most effective in preserving valuable clinical information while minimizing the impact of noise in sound data. This underscores the importance of refining the data collection protocols to ensure consistency in future implementations. The reliance on low-cost sensors, while advantageous for accessibility, can introduce variation, as observed in the data quality challenges encountered. Nevertheless, our study suggests that, with appropriate data preprocessing and machine learning models like KNN, physiological parameters such as the FEV1 can still be estimated. This highlights an important consideration: leveraging the convenience of home monitoring with low-cost sensors requires careful data handling and algorithms capable of managing noise and real-world conditions. Future work could explore adaptive filtering or environment-aware calibration to potentially improve the robustness [32].

This study did not consider potential comorbidities that could affect speech and breathing. However, attempting to incorporate all comorbidities into a predictive model would lead to overfitting and practical challenges in maintenance. Requiring daily self-monitoring through scripted recitations may intensify COPD patients’ awareness of their illness and impact their well-being. This daily burden and potential negative impact could, in turn, lead to inconsistencies in compliance (e.g., skipped days) and consequently affect data quality (e.g., rushed or incomplete recordings). However, these issues regarding the patient burden and variations are also inherent to conventional spirometry tests. A notable drawback of the current model is its reliance on active patient engagement. Users must consciously perform daily scripted readings. While this still requires user interaction, it is less physically demanding and potentially more convenient than performing a full spirometry test. The active participation requirements underscore an area for future research towards more passive monitoring systems. Furthermore, a limitation in developing the current model is the reliance on FEV1 data derived from unsupervised home spirometry for training. As Anand et al. [8] reported, unsupervised tests can yield different, often lower, FEV1 values compared to supervised tests. This inherent variability in the training target data may limit the precision achievable by the current model. To potentially enhance the model accuracy and establish a more robust ground truth, future iterations could benefit from training data where the FEV1 is assessed under supervised spirometry.

Finally, the potential of this home monitoring approach could extend beyond FEV1 tracking. Longitudinal voice data, possibly combined with inputs from other low-cost sensors (e.g., wearable activity trackers, environmental sensors, or simple pulse oximeters), might inform predictive models aiming for earlier awareness of COPD exacerbations [33]. Identifying these events earlier is a goal in COPD management, and voice biomarkers could represent a non-intrusive data stream for such multi-modal monitoring systems.

5. Conclusions

Developing a predictive model using voice recordings from participants with moderate to severe COPD at home enabled us to successfully estimate the FEV1 with encouraging performance. The KNN model achieved an RMSE of 174 mL/s and exhibited an LoA narrower than those observed between supervised and unsupervised spirometry. These findings indicate that ML models like KNN can effectively learn from audio data for FEV1 prediction, suggesting a viable pathway for non-invasive, audio-based assessments that could improve patient comfort and the data collection frequency in domestic environments. Such AI approaches have the potential to enhance COPD management. Future research could focus on validating these findings in broader populations and exploring the integration of multi-modal data for even more robust predictions, as well as investigating methods for autonomous patient monitoring that minimize active device engagement. This represents a step towards the more accessible and proactive monitoring of respiratory health.

Author Contributions

N.L.-N. was responsible for data analysis, coding, and manuscript writing. N.L.-N. also played a key role in conceptualizing the study, ensuring the robustness and accuracy of the analytical methods used. L.M. provided crucial supervision and ideas that were essential to the data analysis. L.M. also contributed to and critically reviewed the analytical methods and results. P.M. provided advice and comments essential to the data analysis and result interpretation. S.N.B. was the main supervisor and contributed to the development of the methodology and ensured the study’s adherence to relevant guidelines. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki and approved by the Regional Research Register, Region Zealand (Forskningsfortegnelsen, Region Sjælland, REG-100-2021, 1 November 2021).

Informed Consent Statement

Informed consent was obtained from all humans participating in this study. Data obtained from participants were handled according to the General Data Protection Regulation (GDPR). Our research followed the Declaration of Helsinki’s ethical principles for medical research involving human participants.

Data Availability Statement

The data that support the findings of this study are not available.

Acknowledgments

The authors would like to thank the healthcare professionals and participants from Odsherred Municipality for their essential cooperation and support throughout the study. We would also like to offer our thanks to Lise-Lotte Teilman and Maya Milter, who were the main facilitators behind the PreCare Clinic, in which the study took place.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Activation (MLP)	The function determining the output of a neuron given an input (e.g., ‘ReLU’, ‘tanh’).
Algorithm (KNN)	The method used to find the nearest neighbors (e.g., ‘ball_tree’, ‘kd_tree’, ‘brute’).
Alpha Regularization Parameter	Controls the penalty strength applied to model complexity to prevent overfitting (used for L1, L2, or both depending on the model).
Colsample_bytree (XGBoost)	The fraction of features (columns) considered when building each tree.
COPD	Chronic obstructive pulmonary disease
FEV1	Forced expiratory volume in one second
Gamma (XGBoost)	Minimum loss reduction required to perform a further partition (split) on a leaf node of the tree; acts as a regularization parameter.
Hidden_layer_sizes (MLP)	Defines the architecture of the MLP, specifying the number of neurons in each hidden layer.
IQR	Interquartile range
KNN	K-nearest neighbors
L1_ratio (ElasticNet)	Specifies the mixing proportion between L1 (lasso) and L2 (ridge) penalties in ElasticNet regularization (0 = L2, 1 = L1).
Learning_rate	Controls the step size at each iteration while moving toward a minimum of the loss function; influences the convergence speed and stability (XGBoost, MLP).
LoA	Limit of agreement
MAE	Mean absolute error
MAPE	Mean absolute percentage error
Max_depth	The maximum depth allowed for individual decision trees in an ensemble, controlling model complexity.
Max_iter (MLP)	The maximum number of training iterations (epochs) allowed for the MLP solver to converge.
MFCC	Mel-frequency cepstral coefficient
Min_child_weight (XGBoost)	The minimum sum of instance weights needed in a child node; acts as a regularization parameter.
Min_samples_leaf	The minimum number of data samples required to be present in a leaf node of a decision tree.
Min_samples_split	The minimum number of data samples required within a node to allow it to be split further in a decision tree.
MLP	Multi-layer perceptron
MSE	Mean squared error
N_estimators	The number of decision trees included in the ensemble model (random forest, XGBoost).
N_neighbors (KNN)	The number of nearest neighbors considered to produce a prediction.
PCA	Principal component analysis
RF	Random forest
RFE	Recursive feature elimination
RMSE	Root mean squared error
Solver (MLP)	The algorithm used to optimize the weights of the MLP during training (e.g., ‘Adam’).
Subsample (XGBoost)	The fraction of the training data samples used to fit each individual tree.
Weights (KNN)	Specifies how the influence of neighbors is weighted in the predictions (e.g., ‘uniform’—all equal, ‘distance’—closer neighbors have a greater influence).
WER	Word error rate
XGB	XGBoost

Appendix A

Appendix A.1. Sommerdrøm—Under et Blomstrende æbletræ Kumbel: Gruk—19

Under en æblekrone, som bierne summer i, går jeg i middagshi. Ned fra den lyse zone lyder en summetone, sommerens melodi. Mellem de hvide grene løber en luftig sti. Her er de to alene, blomstrende træ og bi. Dette er livets scene i sommerligt sceneri. Biens arbejdsomme færden blir til en alegori i drømmen den overgår i. Let som en ordleg er den: dét er den hvide verden, og dét er den kække bi!

Appendix A.2. Den Første Gang Jeg Så Dig af Kim Larsen and Kjukken

Den første gang, jeg så dig, det var en sommerdag En sommerdag da solen lyste klart Og alle engens blomster de stod i stakkevis Og vuggede som unge brudepar.

Og vinden kom så stille ned til stranden, hvor vi stod Og bølgerne de klukkede og kyssede din fod.

Den første gang jeg så dig, det var en sommer dag Den første gang du holdt mig i hånden.

Den første gang jeg så dig, der skinned’ sommer sky Så blendende som syv vilde svaner Jeg mødte dig i vrimlen et sted midt ind’ i byen Ofelia og den sidste mohikan.

Appendix A.3. Jeg Plukker Fløjlsgræs af Sigfred Pedersen and Knud Vad-Thomsen

En vår er kommet så mild og stille Med drømmedufte om mark og eng På sølvstreng klimprer den friske kilde Og fuld af sødme af fuglens sang.

Jeg plukker fløjlsgræs og ridderspore Jeg plukker mandstro og jomfrusko Men blodet banker i mine årer Mit lille hjerte har ingen ro Ja, blodet banker i mine årer Mit lille hjerte har ingen ro.

Jeg synger ikke, jeg danser ikke Der er så meget, jeg ikke når Er jeg bedrøvet? Jeg ved det ikke Jeg drømmer bare, og tiden går.

Appendix A.4. Jeg Vil La’ Lyset Brænde af Ray Dee Ohh

Stille sommeraftenvind Månen trækker i din sjæl Kalder ud, der hvor regnbuen ender La’ mig hviske til farvel.

Jeg vil la’ lyset brænde Og la’ min dør stå på klem Indtil du har set, hvad du ville Indtil du er hos mig igen Jeg vil la’ lyset brænde Og la’ min dør stå på klem Se, mit hjerte gløder i natten For at du kan finde vejen hjem.

Stille sommermorgenregn Gennem byen med mig selv Vinker højt mod din flyver på himlen Hør mig hviske til farvel.

Appendix B

Figure A1. Distribution of Word Error Rates (WER) for starting segments versus entire transcriptions before and after quality filtering. Panels (A) and (B) show WER distributions for starting segments and entire transcriptions before filtering, respectively. Panels (C) and (D) show the corresponding distributions after applying quality thresholds (80th percentile cutoff of 0.667 for starting segments and 90th percentile cutoff of 0.851 for entire transcriptions).

Appendix C

Table A1. Definitions of the features extracted from audio and physiological measurements. These features, described in Section 2.2 were used for the training and evaluation of the machine learning models.

Feature	Explanation
Energy	The total magnitude of the signal, representing the amount of sound energy.
Entropy of Energy	The variation in energy throughout the signal, indicating dynamic changes.
Spectral Centroid	The balance point of the spectrum, indicating the brightness of a sound.
Spectral Spread	A measure of the spread of the spectrum, indicating the timbral texture.
Spectral Entropy	The randomness in the spectral amplitude distribution, a measure of signal complexity.
Spectral Flux	The rate of change in the spectral power, indicating the texture or timbral change.
Spectral Rolloff	The frequency below which a specified percentage of the total spectral energy lies.
Mel Frequency Cepstral Coefficient	Features that capture key aspects of the spectral envelope (shape of the power spectrum) of a sound, using the Mel frequency scale, which approximates human auditory perception.
Chroma Vector	A representation of the energy content within each pitch class, related to the harmonic and melodic content.
Chroma Deviation	The variation from a standard chroma vector, indicating deviations in harmonic content.
Zero Crossing Rate	The rate at which the signal changes sign, related to the frequency content of the signal.
Linear Predictive Coding	Coefficients derived from a model that predicts future signal samples based on past samples, often used to represent the vocal tract filter.
Word Error Rate	The rate of errors in speech transcription, indicating the accuracy of voice recognition.
Speech Pauses	The presence or duration of pauses in speech, potentially indicating respiratory issues or speech flow.
Heart Rate	The number of heart beats per minute, indicating cardiovascular health.
Body Temperature	The measured temperature of the body, an indicator of metabolic and overall health.
Oxygen Saturation	The percentage of oxygen carried by red blood cells to the body, indicating the respiratory efficiency.

Appendix D

Table A2. All selected features after the use of the voting selector. X indicated the statistical measures selected by the feature selection algorithm for model inclusion.

Feature	Mean	Median	Standard Deviation	Max	IQR	Range	Non-Aggregated
Zero Crossing Rate	X		X
Energy
Entropy of Energy		X				X
Spectral Centroid			X			X
Spectral Spread					X
Spectral Entropy			X
Spectral Rolloff		X	X
MFCC	X	X	X	X	X	X
Chroma Vector	X	X	X	X	X	X
Chroma Deviation
Linear Predictive Coding	X
Pulse							X
Temperature							X
Hhmætning							X
World Error Rate							X
Speech Pauses
Relative Energi Difference							X
Delta Zero Crossing Rate			X	X
Delta Energy
Delta Entropy of Energy			X
Delta Spectral Centroid			X			X
Delta Spectral Spread
Delta Spectral Entropy			X	X
Delta Spectral Rolloff			X			X
Delta MFCC			X			X	X
Delta Chroma Vector			X	X		X
Delta Chroma Deviation	X			X		X

References

Bensoussan, Y.; Elemento, O.; Rameau, A. Voice as an AI Biomarker of Health—Introducing Audiomics. JAMA Otolaryngol. Head Neck Surg. 2024, 150, 283–284. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. Global Surveillance, Prevention and Control of Chronic Respiratory Diseases: A Comprehensive Approach; World Health Organization: Geneva, Switzerland, 2007.
Kerkhof, M.; Voorham, J.; Dorinsky, P.; Cabrera, C.; Darken, P.; Kocks, J.W.H.; Sadatsafavi, M.; Sin, D.D.; Carter, V.; Tran, T.N.; et al. Association between COPD exacerbations and lung function decline during maintenance therapy. Thorax 2020, 75, 744–753. [Google Scholar] [CrossRef] [PubMed]
Kakavas, S.; Kotsiou, O.S.; Perlikos, F.; Gourgoulianis, K.I.; Steiropoulos, P. Pulmonary function testing in COPD: Looking beyond the curtain of FEV1. NPJ Prim. Care Respir. Med. 2021, 31, 23. [Google Scholar] [CrossRef]
Liu, X.L.; Tan, J.Y.; Wang, T.; Zhang, Q.; Zhang, M.; Yao, L.Q.; Chen, J.X. Effectiveness of home-based pulmonary rehabilitation for patients with chronic obstructive pulmonary disease: A meta-analysis of randomized controlled trials. Rehabil. Nurs. 2014, 39, 36–59. [Google Scholar] [CrossRef]
Johnson, B.; Theobald, J.; Darcy, K.; Mottershaw, M.; Brassington, K.; Thickett, D.R. Improving spirometry testing by understanding patient preferences. ERJ Open Res. 2021, 7, 00766–02020. [Google Scholar] [CrossRef]
Parsons, K.; Thomas, P.; Bevan-Smith, E.; Doran, O.; Sama, S. Patient perceived facilitators to greater self-management using home spirometry. Eur. Respir. J. 2022, 60, 2325. [Google Scholar] [CrossRef]
Anand, R.; Topriceanu, C.C.; Keir, G.; Williamson, J.P.; Gao, J. Unsupervised home spirometry versus supervised clinic spirometry for respiratory disease: A systematic methodology review and meta-analysis. Eur. Respir. Rev. 2023, 32, 220135. [Google Scholar] [CrossRef] [PubMed]
Sang, B.; Wen, H.; Junek, G.; Neveu, W.; Di Francesco, L.; Ayazi, F. An accelerometer-based wearable patch for robust respiratory rate and wheeze detection using deep learning. Biosensors 2024, 14, 118. [Google Scholar] [CrossRef]
Minakata, Y.; Azuma, Y.; Sasaki, S.; Murakami, Y. Objective measurement of physical activity and sedentary behavior in patients with chronic obstructive pulmonary disease: Points to keep in mind during evaluations. J. Clin. Med. 2023, 12, 3254. [Google Scholar] [CrossRef]
Wang, W.; Wan, Y.; Li, C.; Chen, Z.; Zhang, W.; Zhao, L.; Zhao, J.; Mu Li, G. Millimetre-wave radar-based spirometry for the preliminary diagnosis of chronic obstructive pulmonary disease. IET Radar Sonar Navig. 2023, 17, 1874–1885. [Google Scholar] [CrossRef]
Islam, S.M.M. Radar-based remote physiological sensing: Progress, challenges, and opportunities. Front. Physiol. 2022, 13, 955208. [Google Scholar] [CrossRef] [PubMed]
Alam, M.Z.; Patel, A.; Bui, F.M.; Fazel-Rezai, R.; Sazonov, E.; Bobhate, P.; Jaiswal, N.; Batsis, J.A.; Ramachandran, S.K.; McSharry, P.; et al. Predicting pulmonary function from the analysis of voice: A machine learning approach. Front. Digit. Health 2022, 4, 750226. [Google Scholar] [CrossRef] [PubMed]
Nathan, V.; Paul, S.; Prioleau, T.; Niu, L.; Mortazavi, B.J.; Camargo, C.A.; Guttag, J.; Dy, J.; Jaimovich, D.; Colantonio, L.D.; et al. Assessment of chronic pulmonary disease patients using biomarkers from natural speech recorded by mobile devices. In Proceedings of the IEEE 16th International Conference on Wearable and Implantable Body Sensor Networks (BSN), Chicago, IL, USA, 19–22 May 2019. [Google Scholar] [CrossRef]
Claxton, S.; Williams, G.; Roggen, D.; Rotheram, S.; Lam, C.; Howard, S.; Khawaja, S.; Price, D.B.; Crooks, M.G. Identifying acute exacerbations of chronic obstructive pulmonary disease using patient-reported symptoms and cough feature analysis. NPJ Digit. Med. 2021, 4, 107. [Google Scholar] [CrossRef] [PubMed]
Xu, W.; Zhou, Y.; Zhao, M.; Wang, L.; Zhang, X.; Chen, Q.; Xie, Q.; Gao, B.; Li, B.; Shi, Y. A forced cough sound based pulmonary function assessment method by using machine learning. Front. Public Health 2022, 10, 1015876. [Google Scholar] [CrossRef] [PubMed]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; PMLR 202. pp. 28491–28518. [Google Scholar]
Mozilla. Common Voice Corpus v16.1. Available online: https://commonvoice.mozilla.org/da/datasets (accessed on 4 April 2024).
Wiechern, B.; Liberty, K.A.; Pattemore, P.; Lin, E. Effects of asthma on breathing during reading aloud. Speech Lang. Hear. 2018, 21, 30–40. [Google Scholar] [CrossRef]
Craney, T.A.; Surles, J.G. Model-Dependent Variance Inflation Factor Cutoff Values. Qual. Eng. 2002, 14, 391–403. [Google Scholar] [CrossRef]
Kursa, M.B.; Rudnicki, W.R. Feature Selection with the Boruta Package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
Santamato, V.; Tricase, C.; Faccilongo, N.; Iacoviello, M.; Pange, J.; Marengo, A. Machine learning for evaluating hospital mobility: An Italian case study. Appl. Sci. 2024, 14, 6016. [Google Scholar] [CrossRef]
Polisena, J.; Tran, K.; Cimon, K.; Hutton, B.; McGill, S.; Palmer, K.; Scott, R.E. Home telehealth for chronic obstructive pulmonary disease: A systematic review and meta-analysis. J. Telemed. Telecare 2010, 16, 120–127. [Google Scholar] [CrossRef]
Chen, C.; Ding, S.; Wang, J. Digital health for aging populations. Nat. Med. 2023, 29, 1623–1630. [Google Scholar] [CrossRef]
Majumder, S.; Mondal, T.; Deen, M.J. Wearable sensors for remote health monitoring. Sensors 2017, 17, 130. [Google Scholar] [CrossRef] [PubMed]
Pramono, R.X.A.; Bowyer, S.; Rodriguez-Villegas, E. Automatic adventitious respiratory sound analysis: A systematic review. PLoS ONE 2017, 12, e0177926. [Google Scholar] [CrossRef] [PubMed]
Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Zargari Marandi, R.; Madeleine, P.; Omland, Ø.; Vuillerme, N.; Samani, A. An oculometrics-based biofeedback system to impede fatigue development during computer work: A proof-of-concept study. PLoS ONE 2019, 14, e0213704. [Google Scholar] [CrossRef] [PubMed]
Haider, N.S.; Singh, B.K.; Periyasamy, R.; Behera, A.K. Respiratory sound based classification of chronic obstructive pulmonary disease: A risk stratification approach in machine learning paradigm. J. Med. Syst. 2019, 43, 255. [Google Scholar] [CrossRef]
Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A 2016, 374, 20150202. [Google Scholar] [CrossRef]
Dehghani, A.; Glatard, T.; Shihab, E. Subject cross validation in human activity recognition. arXiv 2019, arXiv:1904.02666. [Google Scholar]
Cook, D.; Feuz, K.D.; Krishnan, N.C. Transfer learning for activity recognition: A survey. Knowl. Inf. Syst. 2013, 36, 537–556. [Google Scholar] [CrossRef]
Sanchez-Morillo, D.; Fernandez-Granero, M.A.; Leon-Jimenez, A. Use of predictive algorithms in-home monitoring of chronic obstructive pulmonary disease and asthma: A systematic review. Chronic Respir. Dis. 2016, 13, 264–283. [Google Scholar] [CrossRef]

Figure 1. Schematic overview of the data processing and machine learning workflow, from initial data acquisition to final model selection. COPD: Chronic Obstructive Pulmonary Disease; RMSE: Root Mean Square Error; MAPE: Mean Absolute Percentage Error; MAE: Mean Absolute Error; MSE: Mean Squared Error; WER: Word Error Rate.

Figure 2. Flowchart illustrating the audio data screening and filtering process. Starting with 2946 collected audio files, sequential filters were applied to remove recordings with missing associated spirometry data or identifiers (n = 202), files smaller than 300 kb (n = 241), files with a high word error rate (WER) indicating poor quality or transcription alignment (n = 419), and files linked to invalid FEV1 measurements (0 mL/s, n = 31). This resulted in a final dataset of 2053 audio files (2019 from COPD participants, 34 from control) used for analysis.

Figure 3. Bland–Altman plot comparing predicted and actual lung function measurements. The plot shows the difference between measurements (y-axis) against their mean (x-axis), with a mean difference of 7.03 (solid line) and 95% limits of agreement from −333.21 to 347.26 (dashed lines). The scatter pattern suggests generally good agreement between the two measurement methods, with most differences falling within the limits of agreement, although there are some notable outliers at higher average values. The relatively symmetric distribution around the mean difference line indicates no systematic bias in the predictions.

Figure 4. Learning curve analysis of the XGBoost model, showing the negative root mean squared error (RMSE) against the number of training examples. The red line represents the training score, while the green line shows the cross-validation score with its standard deviation (shaded area). The converging trend between the training and validation scores as the training examples increase indicates reduced overfitting. Training performance ranges from approximately 0 to −50, while validation performance improves from −300 to −170 with increased data.

Figure 5. Box plot comparison of prediction error distributions across seven machine learning models (ridge, lasso, ElasticNet, RF, XGB, KNN, and MLP) on the validation set. Green triangles indicate mean errors, while the boxes show the interquartile range (IQR: 25th to 75th percentiles) with the median line. Whiskers extend to the most extreme points within 1.5 × IQR from the box edges, and points beyond these whiskers are considered outliers. The K-nearest neighbors (KNN) model demonstrates the best overall performance, with the smallest IQR and fewer extreme outliers compared to other models. While all models show symmetric error distributions centered near zero, they vary in their spread and outlier patterns.

Table 1. Overview and rationale for selected machine learning regression models.

Model	Description	Characteristics	Practical Considerations
Ridge Regression	Linear Model with L2 Regularization	Handles multicollinearity, shrinks coefficients, tests simplest relationship form.	Computationally efficient (convex optimization). Highly interpretable via coefficient magnitudes, indicating feature influence. Requires feature scaling for coefficient comparison. Less prone to overfitting than standard linear regression in cases of multicollinearity.
Lasso Regression	Linear Model with L1 Regularization	Provides sparse models via automatic feature selection (shrinks some coefficients to zero). Useful for identifying potentially key predictive features.	Computationally efficient. Performs feature selection, yielding sparse and potentially simpler models. High interpretability through non-zero coefficients. Requires feature scaling. Can be unstable with highly correlated features (may select one arbitrarily).
ElasticNet	Linear Model with L1 and L2 Regularization	Combines ridge/lasso strengths. Robustly handles multicollinearity while performing feature selection. Useful when groups of correlated features exist.	Computationally efficient. Balances L1/L2 penalties to handle correlated features effectively while performing feature selection. High interpretability via coefficients. Requires feature scaling and tuning of two hyperparameters (alpha and l1_ratio).
Random Forest (RF)	Ensemble Learning (Bagging of Decision Trees)	Captures non-linearities/interactions. Generally robust to outliers and feature scaling. Provides feature importance measures.	Ensemble method; training involves building numerous trees, potentially requiring significant computation time and memory, but parallelizable. Less sensitive to feature scaling than distance-based or linear models. Moderate interpretability: provides global feature importance scores; requires post hoc methods (e.g., SHAP) for reliable local explanations.
XGBoost	Ensemble Learning (Gradient Boosted Decision Trees)	High-performance algorithm capturing non-linearities/interactions effectively. Utilizes regularized boosting for improved generalization.	Advanced gradient boosting implementation, often achieving high predictive accuracy. Can be computationally intensive and requires careful hyperparameter tuning (e.g., learning rate, tree depth, regularization). Moderate interpretability: provides feature importance; local explanations typically rely on methods like SHAP.
K-Nearest Neighbors (KNN)	Instance-Based Learning (Non-Parametric)	Captures local structure. Makes no strong assumptions about underlying data distribution. Requires scaled features. Sensitive to irrelevant features (“curse of dimensionality”).	Non-parametric, instance-based learner. Minimal training time (stores data), but prediction complexity scales with dataset size (potentially slow). Highly sensitive to feature scaling and choice of distance metric. Interpretability is high locally (can examine neighbors influencing a prediction) but lacks a global, summarized model.
Multi-Layer Perceptron (MLP)	Artificial Neural Network	Universal approximator capable of learning highly complex, non-linear functions. Represents a distinct modeling approach. Requires scaled features.	Flexible neural network model requiring significant data, computational resources (often GPU acceleration), and careful tuning (architecture, optimizer, regularization). Sensitive to feature scaling. Generally considered a “black box” due to low direct interpretability; understanding predictions relies heavily on post hoc explanation techniques (e.g., SHAP, LIME).

Table 2. All trained models, hyperparameters, and values. The asterisk indicates the best-performing values. All definitions of models and hyperparameters can be found in the Abbreviations section.

Model	Hyperparameter	Values
Ridge	alpha	10.0 *, 50.0, 100.0, 150.0, 200.0, 250.0, 300.0
Lasso	alpha	0.001 *, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0
ElasticNet	alpha	0.01 *, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0
	l1_ratio	0.1 *, 0.3, 0.5, 0.7, 0.9
Random Forest	n_estimators	50, 100 *, 150, 200
	max_depth	None, 5, 10, 15, 20 *
	min_samples_split	2 *, 3, 4, 5
	min_samples_leaf	1, 2 *, 3, 4
XGBoost	n_estimators	50, 100, 150, 200 *
	learning_rate	0.01, 0.05, 0.1 *, 0.2
	max_depth	3, 4 *, 5, 6
	min_child_weight	1, 2, 3 *, 4
	gamma	0.0 *, 0.1, 0.2, 0.3
	subsample	0.5, 0.6, 0.7 *, 0.8
	colsample_bytree	0.5, 0.6, 0.7 *, 0.8
K-Nearest Neighbors	n_neighbors	2, 3, 5, 7 *, 10, 12
	weights	uniform, distance *
	algorithm	ball_tree *, kd_tree, brute
Multi-Layer Perceptron	hidden_layer_sizes	(50,), (100,), (50, 50), (100, 100) *
	activation	Relu *, tanh
	solver	Adam
	alpha	0.0001, 0.001, 0.01, 0.1 *
	learning_rate	constant, adaptive *
	max_iter	500, 1000, 1500 *

Table 3. Summarizes lung function statistics, sex, age, and number of audio files for each participant. Featuring median, interquartile, min, and max for lung function measurements.

				FEV1 (mL/s)
Participant ID	Sex	Age	Number of Audio Files	Median	IQR	Min	Max
1	M	76	161	1050	110	910	1310
2	M	70	225	1780	150	1490	2110
3	M	65	28	1865	412.5	1120	2380
4	F	72	109	1280	90	1110	1480
5	M	68	117	440	50	370	540
6	M	77	91	900	110	730	1130
7	F	76	8	770	47.5	700	810
8	M	79	113	780	50	720	920
9	M	75	66	1880	147.5	1610	2220
10	F	73	127	610	90	400	1880
11	M	75	13	1530	90	1380	1660
12	M	84	89	530	50	440	650
13	F	73	13	600	25	580	630
14	F	78	152	900	120	740	1080
15	F	64	87	910	75	770	1080
16	F	64	77	1490	180	650	2180
17	M	79	18	925	250	550	1190
18	F	72	126	1140	137.5	840	1560
19	F	76	146	1140	130	960	1500
20	M	68	16	455	30	420	520
21	M	72	88	2200	242.5	1840	2590
22	F	73	54	990	60	880	1180
23	F	73	95	600	120	380	880

Table 4. Performance metrics of machine learning models for estimation ofng FEV1 from audio data. The table shows the root mean square error (RMSE, mL/s), mean absolute percentage error (MAPE, mL/s), mean absolute error (MAE, %), mean squared error (MSE, mL/s²), and RMSE standard deviation for both the training and evaluation datasets. Models are ordered by increasing RMSE on the evaluation dataset (best- performing first).

Training
Model	RMSE	MAPE	MAE	MSE	RMSE Std
K-nearest neighbors	138.92	8.73	84.98	19,716.86	20.43
XGBoost	167.60	12.42	116.25	28,397.69	17.56
Random Forest	180.75	12.12	112.74	33,296.43	25.00
Multi-Layer Perceptron	202.42	15.09	145.45	41,644.93	25.88
Lasso	250.14	21.20	190.78	62,831.40	16.24
Ridge	275.07	23.62	205.38	75,884.98	14.85
ElasticNet	280.49	24.18	209.27	78,887.91	14.64
Evaluation
K-nearest neighbors	173.73	9.90	93.82	30,182.83
XGBoost	178.85	13.30	120.67	31,987.81
Random Forest	202.99	14.74	123.43	41,202.95
Multi-Layer Perceptron	227.83	16.52	157.04	51,904.96
Lasso	264.16	21.06	190.88	69,781.04
Ridge	300.74	25.91	217.3	90,446.22
ElasticNet	307.23	26.69	222.49	94,391.37

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lentz-Nielsen, N.; Maaløe, L.; Madeleine, P.; Blomberg, S.N. Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function. BioMedInformatics 2025, 5, 31. https://doi.org/10.3390/biomedinformatics5020031

AMA Style

Lentz-Nielsen N, Maaløe L, Madeleine P, Blomberg SN. Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function. BioMedInformatics. 2025; 5(2):31. https://doi.org/10.3390/biomedinformatics5020031

Chicago/Turabian Style

Lentz-Nielsen, Nicki, Lars Maaløe, Pascal Madeleine, and Stig Nikolaj Blomberg. 2025. "Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function" BioMedInformatics 5, no. 2: 31. https://doi.org/10.3390/biomedinformatics5020031

APA Style

Lentz-Nielsen, N., Maaløe, L., Madeleine, P., & Blomberg, S. N. (2025). Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function. BioMedInformatics, 5(2), 31. https://doi.org/10.3390/biomedinformatics5020031

Article Menu

Voice as a Health Indicator: The Use of Sound Analysis and AI for Monitoring Respiratory Function

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Processing

2.2. Feature Engineering

2.3. Feature Selection

2.4. Modeling

3. Results

4. Discussion

4.1. Model Performance

4.2. Feature Importance

4.3. Practical Considerations for At-Home Deployment

4.4. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Sommerdrøm—Under et Blomstrende æbletræ Kumbel: Gruk—19

Appendix A.2. Den Første Gang Jeg Så Dig af Kim Larsen and Kjukken

Appendix A.3. Jeg Plukker Fløjlsgræs af Sigfred Pedersen and Knud Vad-Thomsen

Appendix A.4. Jeg Vil La’ Lyset Brænde af Ray Dee Ohh

Appendix B

Appendix C

Appendix D

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI