Respiratory Rehabilitation Index (R2I): Unsupervised Clustering Approach to Identify COPD Subgroups Associated with Rehabilitation Outcomes

Ester Marra; Piergiuseppe Liuzzi; Andrea Mannini; Isabella Romagnoli; Francesco Gigliotti

doi:10.3390/diagnostics15162053

,

and

IRCCS Fondazione Don Carlo Gnocchi Onlus, 50143 Firenze, Italy

^*

Author to whom correspondence should be addressed.

Diagnostics2025, 15(16), 2053;https://doi.org/10.3390/diagnostics15162053

This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics

Version Notes

Order Reprints

Review Reports

Abstract

Background/Objectives: Chronic obstructive pulmonary disease (COPD) is a progressive condition whose heterogeneous endotypes, clinical manifestations, and recovery pathways complicate the identification of reliable predictors of rehabilitation outcomes. Several respiratory and functional assessments are available with no consensus on the most predictive ones. While univariate markers may miss multifactorial interactions essential for prognosis, data-driven unsupervised clustering methods can integrate complex information from different sources. This study aimed to apply unsupervised clustering to identify pre-rehabilitation characteristics predictive of discharge outcomes for COPD patients undergoing pulmonary rehabilitation. Methods: A total of 126 COPD patients undergoing pulmonary rehabilitation were included in the analysis. Three assessments were performed at admission, namely the forced oscillation technique, spirometry, and the six-minute walk test (6MWT). The outcome was the change in 6MWT distance between admission and discharge. Unsupervised clustering methods were applied to admission variables to identify subgroups associated with outcomes. Results: Among the clustering algorithms tested, k-means (with N_cl = 2) provided the optimal solution. The resulting respiratory rehabilitation index (R2I) was significantly associated with the outcome dichotomized via the minimal clinically important difference of 30 m. Patients with R2I = 1, indicating severe functional and respiratory impairments, were associated with higher post-rehabilitation functional improvement (p = 0.032). While few functional parameters of 6MWT were statistically different between the groups identified by outcome, nearly all variables in the analysis exhibited significant distribution differences among the R2I clusters. Conclusions: These findings highlight the heterogeneity of COPD and the potential of unsupervised clustering to identify distinct patient subgroups, enabling more personalized rehabilitation strategies.

Keywords:

chronic obstructive pulmonary disease; unsupervised clustering; pulmonary rehabilitation

1. Introduction

Chronic obstructive pulmonary disease (COPD) represents a significant global health burden, leading to substantial morbidity, mortality, and health care costs [1]. Current estimates suggest a prevalence of around 3% in the general population, with projections indicating it will become the third leading cause of death and the seventh leading cause of disability-adjusted life years (DALYs) lost by 2050 [2]. In Italy, recent estimates indicate a slightly lower prevalence, affecting approximately 2–2.5% of the general population [2]. COPD is a progressive disease, often resulting in reduced quality of life and increased risk of exacerbations, hospitalizations, and mortality [3]. The current prognostic markers for COPD include a range of clinical, lung function, and imaging parameters. These may include lung capacity measures such as forced expiratory volume in the first second (FEV₁) [4], oxygen (O₂) saturation, and inflammatory biomarkers such as white blood cell count and C-reactive protein levels [5].

Pulmonary rehabilitation (PR) is a cornerstone intervention in COPD management, encompassing a multidisciplinary approach aimed at improving physical and psychological health. PR programs typically include exercise training, education about COPD management, and psychosocial support [6]. However, the response to PR varies widely across individuals, reflecting the significant heterogeneity of the disease. Indeed, COPD patients present with distinct phenotypes, each characterized by unique etiological, clinical, and prognostic profiles, which complicates efforts to predict rehabilitation outcomes and tailor personalized interventions [7]. Univariate markers such as FEV₁ [8], six-minute walk test (6MWT) covered distance [9], and scales assessing symptoms like the Medical Research Council (MRC) Dyspnoea Scale [10], have been widely used to assess COPD severity and response to rehabilitation.

However, these single-dimension measures may oversimplify the multifactorial nature of COPD, failing to capture the complex interconnections between clinical, functional, physiological, and psychological factors that shape individual recovery trajectories. This complexity highlights the need for analytical approaches capable of integrating multiple data sources to provide a more comprehensive characterization of disease variability.

Supervised machine learning methods, although highly effective for prediction tasks, rely on predefined outcome labels and therefore cannot uncover latent structures or describe the variability in patient responses.

Unsupervised machine learning techniques have increasingly been used to address the challenges posed by high-dimensional, heterogeneous data [11,12,13]. Their main strength lies in identifying complex, nonlinear relationships that traditional statistical methods may overlook, enabling a more data-driven stratification of patients. Although their clinical applicability is still constrained by the need for large and high-quality datasets, limited reproducibility, and reduced interpretability, these approaches have shown promise in supporting clinical decision-making and complementing conventional assessments.

Building on this rationale, unsupervised clustering is particularly suited to explore the underlying structure of the COPD population and identify patient subgroups with distinct recovery trajectories [14,15,16]. Recent applications of unsupervised clustering in COPD have demonstrated its potential and further support its use in this clinical context [17,18]. Despite the promise of clustering approaches, challenges remain in ensuring their reproducibility and clinical applicability. Variability in patient cohorts, data sources, and clustering methodologies across studies can lead to inconsistent results, raising concerns about the robustness of identified phenotypes.

This study employs unsupervised clustering methods to stratify COPD patients from clinical information at admission in a rehabilitation unit. The resulting index is then assessed for its predictive value on rehabilitation outcomes at discharge. To address the limitations of earlier studies, a set of variables from three distinct assessment domains, i.e., the 6MWT, forced oscillation technique (FOT), and spirometry, was considered. In addition, different clustering approaches were compared to verify consistency and robustness in subgroup identification.

2. Materials and Methods

2.1. Study Design and Collection

This study was based on both a prospective observational study (conducted from 2021 to 2022) and a retrospective observational study (from 2016 to 2018) carried out at the Pulmonary Rehabilitation Unit of IRCSS Fondazione Don Gnocchi ONLUS in Florence. The studies enrolled COPD patients undergoing an outpatient pulmonary rehabilitation program (PRP). PRP was conducted in accordance with the American Thoracic Society (ATS) and the European Respiratory Society (ERS) recommendations [19] and included education, aerobic exercise training for both upper and lower limbs, and breathing retraining. The studies shared the same inclusion criteria: patients had to meet the COPD definition outlined by GOLD standards [20]; the severity of airflow obstruction ranged from moderate to very severe according to the GOLD classification; participants were former smokers in stable condition for at least four weeks prior to enrollment; and they were receiving optimal standard treatment as recommended by GOLD guidelines. Patients with recent cardiovascular events or with neuromuscular or osteoarticular diseases that limited physical exercise and/or compromised lung mechanical properties were excluded from the PRP. The studies were approved by the Research Ethics Committee (r.n.18765_oss; r.n.15217_oss). All participants provided written informed consent at the time of assessment. The variables of interest were evaluated at two time points, namely admission (T0) and discharge (T1) over 20 sessions of the pulmonary rehabilitation program.

This analysis incorporated three distinct respiratory tests (Figure 1), including the FOT [21], spirometry [22], and the 6MWT [23], each serving a specific purpose in assessing respiratory function. The FOT procedure focused on assessing respiratory impedance by recording multiple measurements while participants breathed normally. This non-invasive technique provided detailed insights into the mechanical properties of the respiratory system, helping to identify potential abnormalities or dysfunctions [21]. Spirometry provided insightful data on lung volume and air-flow dynamics [22]. The 6MWT involved participants walking briskly for six minutes while vital signs and the distance covered were recorded. This test provided insights into participants’ functional capacity and endurance, offering a practical measure of their overall cardiopulmonary health [23].

Figure 1. Data collection (A) and clustering analysis (B) pipeline.

2.2. Data Preparation

The outcome of the study was the 6MWT covered distance variation between T0 and T1 (namely, Delta meters). Patients with missing outcome data were excluded from the analysis. In line with international clinical guidelines, specifically the ATS/ERS technical standard on field walking tests under chronic respiratory conditions [24], a minimal clinically important difference (MCID) threshold of 30 m was used to define clinically significant improvement (CSI) in the 6MWT. Consequently, the outcome was dichotomized as follows:

6 M W T_{C S I} = \{\begin{matrix} 0, i f Delta meters < 30 \\ 1, i f Delta meters \geq 30 \end{matrix}

(1)

The independent variables of the study were collected at T0 from the three different assessment domains mentioned above, for a total of 26 variables. Specifically, within the FOT, respiratory system resistance (R_RS) and reactance (X_RS) were measured during inspiration at 5 Hz, along with its variation (ΔX_RS). Moreover, inspiratory time (T_I), the ratio of T_I to total time (T_I/T_TOT), expiratory time (T_E), mechanical ventilation (V_E), tidal volume (V_T), the percentage of respiratory flow (R_F%), and respiratory rate (R_R) were recorded. In spirometry, functional parameters were included, such as forced expiratory volume divided by slow vital capacity (FEV/SVC), FEV₁, total lung capacity (TLC), inspiratory capacity (IC), functional residual capacity (FRC), and residual volume (V_R). During the 6MWT, in addition to recording the total distance walked, patients were assessed from multiple perspectives, including O₂ levels, O₂ saturation, the Borg Dyspnoea Scale, and the Borg Scale for limb fatigue, measured twice, before and after the test.

A preliminary analysis was adopted to discard variables showing a cross-correlation greater than 0.8. Variables with missing values were imputed using a k-nearest neighbors (kNN)-based imputer from the Scikit-learn library [25]. Then, the remaining features were standardized by removing the mean and scaling to unit variance.

2.3. Clustering Methods

Patients were clustered according to four different unsupervised algorithms, including k-means [26], k-medoids [27], a Gaussian mixture model [28], and BIRCH (balanced iterative reducing and clustering using hierarchies) [29]. Input data for the unsupervised models were the independent variables of the analysis.

K-means clusters data by partitioning samples in a number of groups with equal variance [26]. The algorithm was initialized with the k-means++ method (selecting initial centroids using the distribution probability-based sampling technique [30]) with the aim of minimizing the total variance contribution to the cluster. Computation was sped up using the ELKAN method (applying the triangle inequality to avoid computation of unnecessary distances [31]).

The k-medoids algorithm, a variation of k-means, partitions data into clusters by choosing representative points (medoids) and assigning each sample to the nearest medoid [27]. The algorithm was initialized with the k-medoids++ method (following an approach similar to k-means++).

The Gaussian mixture model (GMM) assumes data are generated from a mixture of Gaussian distributions [28]. It employs the expectation–maximization algorithm to estimate the distribution parameters and assigns points to clusters based on the maximum a posteriori probability [32]. The algorithm was initialized with the k-means++ method.

BIRCH constructs a feature tree with each of the nodes representing a subcluster. The feature tree expands dynamically as new data points are added [29].

For each algorithm, the number of clusters varied between 2 and 15. The number of clusters, as well as the different initializations, were compared and selected by choosing the configuration that yielded the highest silhouette score [33]. Once the optimal number of clusters was identified, the clustering algorithm was selected based on the best compromise between the silhouette score and balance in the number of patients assigned to clusters.

2.4. Statistical Analysis

Descriptive statistics were calculated before the imputation to provide a comprehensive overview of the effective absolute values. The median and interquartile range (IQR) values were reported for numerical variables, while for categorical variables, absolute frequencies and percentages were calculated. A comparative analysis was conducted between the subgroups identified by the dichotomized outcome. A Mann–Whitney test was performed for numerical variables, while a chi-squared test was conducted for categorical variables. After computing the cluster centroids, a second comparative analysis was conducted (Mann–Whitney test) to assess whether the outcome distributions in the cluster groups were statistically different. Later, the dichotomized outcome was compared with the cluster labels of each algorithm through a contingency table and a chi-squared analysis. Finally, on the model that reported the best results, a Mann–Whitney test was employed to evaluate whether there were statistically significant variations in the distribution of independent variables between the clusters.

3. Results

3.1. Descriptive and Univariate Results

A total of 166 patients were initially enrolled, of whom 26 were excluded due to comorbidities, resulting in 140 patients included in the study. Among these, 14 patients had missing outcome data, leading to a final sample size of 126 patients analyzed. In this final cohort (median age 77 years [IQR = 10], males: 56), 50% of participants had a 6MWT_CSI=1 (the median value of Delta meters was 29.5 [IQR = 61]). The preliminary correlational analysis reduced the cardinality of the variables to 20. All the variables related to the FOT and spirometry did not show significantly different distributions between the two groups stratified by outcome. Conversely, among the variables of the 6MWT, O₂ saturation and Borg Dyspnoea Scale rating were measured at the beginning, and total meters significantly differed between the groups. (Table 1).

Table 1. Descriptive statistics of overall analysis samples: 6MWT clinically significant improvement patient sample, and no 6MWT clinically significant improvement patient sample. Comparative analysis between groups was conducted.

3.2. Cluster Analysis

The optimization of the number of clusters conducted for each of the clustering algorithms led to identical results for all: the configuration with two clusters was the one with the highest silhouette score (Figure 2). The silhouette scores for the two-cluster configuration were 0.20 for the Gaussian mixture model, 0.14 for BIRCH, 0.12 for k-means, and 0.08 for k-medoids.

Figure 2. Silhouette score values as the number of clusters varies between 2 and 15.

The number of patients assigned to each cluster was computed for each clustering method to assess group balancing. K-medoids and k-means clustering resulted in the most balanced distributions (N_cl0 = 61, N_cl1 = 65 and N_cl0 = 60, N_cl1 = 66, respectively); conversely, the Gaussian mixture model and BIRCH showed less cluster balance (N_cl0 = 11, N_cl1 = 115 and N_cl0 = 27, N_cl1 = 99, respectively).

Given these findings, the k-means clustering solution has been considered the most appropriate for the analysis and was referred to as the respiratory rehabilitation index (R2I).

Concerning the comparison of clustering output with the dichotomized outcome, only k-means was statistically significant (χ² = 4.58, p = 0.032). Conversely, the continuous outcome distribution was significantly different between the two clusters (Mann–Whitney, p < 0.05) for all the proposed solutions. The Delta meters distribution of the two clusters resulted in a median {IQR] of 21 [46.3] and 43.5 [74], 25 [57] and 30 [60], 20 [30.5] and 30 [57], and 21 [29] and 32 [63] for the k-means, k-medoids, GMM, and BIRCH, respectively (Figure 3). A radar plot illustrating the distribution of independent variables in the two clusters has been provided exclusively for the R2I (Figure 4). Several variables significantly differed between the two identified clusters (Table 2).

Figure 3. Boxplot representation of Delta meters distribution between the two clusters for each algorithm classification.

Figure 4. Mean and standard deviation of k-means cluster centroids. Dashed blue and solid red represent R2I equal to 0 and 1, respectively.

Table 2. Comparison of median and IQR of all the variables at admission between R2I clusters. p-values associated with Mann–Whitney test were reported. Statistially significant values are in bold.

4. Discussion

This study demonstrated that unsupervised clustering techniques can effectively stratify COPD patients into distinct subgroups based on pre-rehabilitation characteristics, offering valuable insights into rehabilitation outcomes. The outcome measure, defined as the change in 6MWT distance between admission and discharge, was dichotomized based on the MCID threshold of 30 m. The optimal clustering solution was obtained using the k-means algorithm with two clusters, resulting in the R2I. The latter, obtained from T0 data, revealed a significant association with the outcome at T1 (p = 0.032), showing that patients with more severe baseline functional and respiratory impairments (R2I = 1) were positively associated with a post-rehabilitation improvement in walked distance. In particular, patients in R2I = 0, compared to those in R2I = 1, presented at admission with lower overall mechanical impairment (lower respiratory resistance values and smaller variations in reactance during the test), a more favorable ventilatory pattern and lung volumes, and better functional capacity, as indicated by higher walking performance, greater exercise tolerance, and lower perceived dyspnoea. Identifying these profiles through clustering before rehabilitation could help clinicians anticipate which patients are more likely to achieve meaningful functional improvement and adapt the intensity, focus, and monitoring of PR programs accordingly, ultimately aiming to maximize individual benefits. These findings suggested that pre-rehabilitation profiling through clustering can help identify patients who are more likely to benefit from PR in terms of the 6MWT, with a significant increase. While only a few functional parameters of the 6MWT, such as total distance and O₂ saturation, showed significant differences between the groups identified by the outcome, the R2I clusters revealed differences in nearly all pre-rehabilitation variables. These included parameters from both the FOT and spirometry, which were not evident in the outcome-based grouping, indicating that these respiratory measures play a critical role in patient stratification and may better capture the underlying heterogeneity in rehabilitation responses. Key variables that contributed most to the discrimination between R2I clusters included ΔX_RS, FEV/SVC, and IC. This multidimensional approach goes beyond single-domain assessments used in previous studies by capturing both respiratory mechanics and functional performance, providing a more accurate characterization of patient profiles.

From a methodological perspective, this study compares clustering algorithms, including k-medoids, the GMM, and BIRCH, in addition to k-means, ultimately selected as the most appropriate solution. The use of silhouette scores to choose the optimal number of clusters ensured an objective and reproducible approach, reinforcing the validity of the identified subgroups. These methodological strengths addressed critical gaps in the literature, where clustering solutions were often hindered by inconsistent methods and insufficient validation, resulting in a lack of reproducibility and practical relevance. By applying and comparing different clustering methodologies and achieving consistent subgroup identification across algorithms, this study enhanced confidence in the robustness of the R2I for patient stratification.

The most significant practical implication of this study is the potential to personalize rehabilitation strategies for COPD patients. COPD is a highly heterogeneous condition, with patients presenting diverse clinical profiles and responses to therapy, which often limits the effectiveness of standardized rehabilitation protocols. By stratifying patients into more homogeneous subgroups based on pre-rehabilitation features, unsupervised clustering techniques can contribute to understanding the relationship between pulmonary function impairment and mechanisms of response to PR. This approach enables the design of tailored rehabilitation programs with the potential to improve rehabilitation outcomes, reduce variability in responses, and support more effective patient management in clinical practice.

5. Limitations

The relatively small sample size may limit the generalizability of the findings to broader COPD populations. Moreover, conducting the study in a single rehabilitation center may have introduced bias linked to the specific population characteristics or local rehabilitation protocols.

6. Future Directions

Future research should focus on validating the R2I across larger COPD cohorts to enhance its generalizability and clinical applicability. Further investigations could benefit from the inclusion of additional clinical and functional variables, such as psychosocial factors (e.g., anxiety and depression [34]), comorbidities (e.g., cardiac, metabolic, orthopedic, or behavioral health problems [35]), and markers of skeletal muscle dysfunction [36]. These aspects are well established as influential determinants of rehabilitation outcomes in individuals with COPD. Incorporating them into a multidimensional framework may allow for more accurate patient stratification and could enhance the overall predictive value and clinical utility of the R2I.

7. Conclusions

This study shows that the unsupervised clustering of multidimensional admission data enables the identification of clinically meaningful subgroups of COPD patients undergoing pulmonary rehabilitation. By integrating 6MWT, FOT, and spirometry parameters, the R2I offers a data-driven stratification tool capable of predicting rehabilitation outcomes. Specifically, patients with more severe pre-rehabilitation impairment (R2I = 0) were more likely to achieve clinically significant improvements in functional capacity, as measured by the 6MWT.

The R2I captured differences across a broad range of admission variables, many of which were not univariately associated with the outcome. These findings underscore the potential of unsupervised machine learning approaches to uncover hidden patterns in complex clinical data and support more personalized rehabilitation strategies.

Author Contributions

Conceptualization, P.L. and A.M.; methodology, E.M. and P.L.; software, E.M.; validation, I.R. and F.G.; formal analysis, E.M. and P.L.; investigation, E.M., F.G. and I.R.; resources, A.M., F.G. and I.R.; writing—original draft preparation, E.M.; writing—review and editing, P.L., A.M., I.R. and F.G.; visualization, E.M.; supervision, I.R., F.G. and A.M.; project administration, A.M.; funding acquisition, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Italian Ministry of Health under the “Ricerca Corrente” program.

Institutional Review Board Statement

The studies were approved by the Research Ethics Committee of Comitato Etico Area Vasta Centro (approval code: r.n.18765_oss; r.n.15217_oss; approval date: 20 April 2020).

Informed Consent Statement

Written informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author for reproducibility purposes.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

6MWT	Six-Minute Walk Test
BIRCH	Balanced Iterative Reducing and Clustering using Hierarchies
CSI	Clinically Significant Improvement
COPD	Chronic Obstructive Pulmonary Disease
DALYs	Disability-Adjusted Life Years
FEV	Forced Expiratory Volume
FEV₁	Forced Expiratory Volume in the First Second
FRC	Functional Residual Capacity
FOT	Forced Oscillation Technique
GMM	Gaussian Mixture Model
IC	Inspiratory Capacity
IQR	Interquartile Range
kNN	k-Nearest Neighbors
MCID	Minimal Clinically Important Difference
MRC	Medical Research Council
PR	Pulmonary Rehabilitation
PRP	Pulmonary Rehabilitation Program
R2I	Respiratory Rehabilitation Index
RF	Respiratory Flow
R_RS	Respiratory System Resistance
RR	Respiratory Rate
SVC	Slow Vital Capacity
TE	Expiratory Time
T_I	Inspiratory Time
TLC	Total Lung Capacity
T_TOT	Total Time
V_E	Mechanical Ventilation
V_R	Residual Volume
V_T	Tidal Volume
X_RS	Respiratory System Reactance

References

Safiri, S.; Carson-Chahhoud, K.; Noori, M.; Nejadghaderi, S.A.; Sullman, M.J.M.; Heris, J.A.; Ansarin, K.; Mansournia, M.A.; Collins, G.S.; Kolahi, A.-A.; et al. Burden of chronic obstructive pulmonary disease and its attributable risk factors in 204 countries and territories, 1990–2019: Results from the Global Burden of Disease Study 2019. BMJ 2022, 378, e069679. [Google Scholar] [CrossRef]
Wang, Z.; Lin, J.; Liang, L.; Huang, F.; Yao, X.; Peng, K.; Gao, Y.; Zheng, J. Global, regional, and national burden of chronic obstructive pulmonary disease and its attributable risk factors from 1990 to 2021: An analysis for the Global Burden of Disease Study 2021. Resp. Res. 2025, 26, 2. [Google Scholar] [CrossRef]
Wedzicha, J.A.; Seemungal, T.A. COPD exacerbations: Defining their cause and prevention. Lancet 2007, 370, 786–796. [Google Scholar] [CrossRef]
Vestbo, J.; Edwards, L.D.; Scanlon, P.D.; Yates, J.C.; Agusti, A.; Bakke, P.; Calverley, P.M.; Celli, B.; Coxson, H.O.; Crim, C.; et al. Changes in forced expiratory volume in 1 second over time in COPD. N. Engl. J. Med. 2011, 365, 1184–1192. [Google Scholar] [CrossRef] [PubMed]
Fermont, J.M.; Masconi, K.L.; Jensen, M.T.; Ferrari, R.; Di Lorenzo, V.A.P.; Marott, J.M.; Schuetz, P.; Watz, H.; Waschki, B.; Müllerova, H.; et al. Biomarkers and clinical outcomes in COPD: A systematic review and meta-analysis. Thorax 2019, 74, 439–446. [Google Scholar] [CrossRef] [PubMed]
Troosters, T.; Janssens, W.; Demeyer, H.; Rabinovich, R.A. Pulmonary rehabilitation and physical interventions. Eur. Respir. Rev. 2023, 32, 220222. [Google Scholar] [CrossRef] [PubMed]
Corlateanu, A.; Mendez, Y.; Wang, Y.; Garnica, R.d.J.A.; Botnaru, V.; Siafakas, N. Chronic obstructive pulmonary disease and phenotypes: A state-of-the-art. Pulmonology 2020, 26, 95–100. [Google Scholar] [CrossRef]
Jones, P.W.; Agusti, A.G.N. Outcomes and markers in the assessment of chronic obstructive pulmonary disease. Eur. Respir. J. 2006, 27, 822–832. [Google Scholar] [CrossRef]
Jenkins, S.C. Six-minute walk test in patients with COPD: Clinical applications in pulmonary rehabilitation. Physiotherapy 2007, 93, 175–182. [Google Scholar] [CrossRef]
Bestall, J.C.; A Paul, E.; Garrod, R.; Garnham, R.; Jones, P.W.; A Wedzicha, J. Usefulness of the Medical Research Council (MRC) dyspnoea scale as a measure of disability in patients with chronic obstructive pulmonary disease. Thorax 1999, 54, 581–586. [Google Scholar] [CrossRef]
Komorowski, M.; Green, A.; Tatham, K.C.; Seymour, C.; Antcliffe, D. Sepsis biomarkers and diagnostic tools with a focus on machine learning. EBioMedicine 2022, 86, 104394. [Google Scholar] [CrossRef]
Miller, R.J.H.; Bednarski, B.P.; Pieszko, K.; Kwiecinski, J.; Williams, M.C.; Shanbhag, A.; Liang, J.X.; Huang, C.; Sharir, T.; Hauser, M.T.; et al. Clinical phenotypes among patients with normal cardiac perfusion using unsupervised learning: A retrospective observational study. EBioMedicine 2024, 99, 104930. [Google Scholar] [CrossRef]
Alexander, N.; Alexander, D.C.; Barkhof, F.; Denaxas, S. Identifying and evaluating clinical subtypes of Alzheimer’s disease in care electronic health records using unsupervised machine learning. BMC Med. Inf. Decis. Mak. 2021, 21, 343. [Google Scholar] [CrossRef] [PubMed]
Burgel, P.R.; Paillasseur, J.-L.; Caillaud, D.; Tillie-Leblond, I.; Chanez, P.; Escamilla, R.; Court-Fortune, I.; Perez, T.; Carré, P.; Roche, N. Clinical COPD phenotypes: A novel approach using principal component and cluster analyses. Eur. Respir. J. 2010, 36, 531–539. [Google Scholar] [CrossRef] [PubMed]
Pikoula, M.; Quint, J.K.; Nissen, F.; Hemingway, H.; Smeeth, L.; Denaxas, S. Identifying clinically important COPD sub-types using data-driven approaches in primary care population-based electronic health records. BMC Med. Inf. Decis. Mak. 2019, 19, 1–14. [Google Scholar] [CrossRef] [PubMed]
Burgel, P.R.; Quint, J.K.; Nissen, F.; Hemingway, H.; Smeeth, L.; Denaxas, S. A simple algorithm for the identification of clinical COPD phenotypes. Eur. Respir. J. 2017, 50, 1701034. [Google Scholar] [CrossRef]
Chikhanie, Y.A.; Bailly, S.; Amroussa, I.; Veale, D.; Hérengt, F.; Verges, S. Clustering of COPD patients and their response to pulmonary rehabilitation. Respir. Med. 2022, 198, 106861. [Google Scholar] [CrossRef]
Spruit, M.A.; Augustin, I.M.L.; Vanfleteren, L.E.; Janssen, D.J.A.; Gaffron, S.; Pennings, H.-J.; Smeenk, F.; Pieters, W.; van den Bergh, J.J.A.M.; Michels, A.-J.; et al. Differential response to pulmonary rehabilitation in COPD: Multidimensional profiling. Eur. Respir. J. 2015, 46, 1625–1635. [Google Scholar] [CrossRef]
Rochester, C.L.; Vogiatzis, I.; Holland, A.E.; Lareau, S.C.; Marciniuk, D.D.; Puhan, M.A.; Spruit, M.A.; Masefield, S.; Casaburi, R.; Clini, E.M.; et al. An official American Thoracic Society/European Respiratory Society policy statement enhancing implementation, use, and delivery of pulmonary rehabilitation. Am. J. Respir. Crit. Care Med. 2015, 192, 1373–1386. [Google Scholar] [CrossRef]
Agustí, A.; Celli, B.R.; Criner, G.J.; Halpin, D.; Anzueto, A.; Barnes, P.; Bourbeau, J.; Han, M.K.; Martinez, F.J.; de Oca, M.M.; et al. Global Initiative for Chronic Obstructive Lung Disease 2023 report. GOLD executive summary. Eur. Respir. J. 2023, 61, 2300239. [Google Scholar] [CrossRef]
Oostveen, E.; MacLeod, D.; Lorino, H.; Farré, R.; Hantos, Z.; Desager, K.; Marchal, F. The FOT in clinical practice: Methodology, recommendations and future developments. Eur. Respir. J. 2003, 22, 1026–1041. [Google Scholar] [CrossRef]
Graham, B.L.; Steenbruggen, I.; Miller, M.R.; Barjaktarevic, I.Z.; Cooper, B.G.; Hall, G.L.; Hallstrand, T.S.; Kaminsky, D.A.; McCarthy, K.; McCormack, M.C.; et al. Standardization of spirometry: 2019 update. Am. J. Respir. Crit. Care Med. 2019, 200, e70–e88. [Google Scholar] [CrossRef]
Enright, P.L. The six-minute walk test. Respir. Care. 2003, 48, 783–785. [Google Scholar]
Singh, S.J.; Puhan, M.A.; Andrianopoulos, V.; Hernandes, N.A.; Mitchell, K.E.; Hill, C.J.; Lee, A.L.; Camillo, C.A.; Troosters, T.; Spruit, M.A.; et al. An official systematic review of the European Respiratory Society/American Thoracic Society: Measurement properties of field walking tests in chronic respiratory disease. Eur. Respir. J. 2014, 44, 1447–1478. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1965, 27 December 1965–7 January 1966; Le Cam, L.M., Neyman, J., Eds.; University of California Press: Oakland, CA, USA, 1967. [Google Scholar]
Park, H.S.; Jun, C.H. A simple and fast algorithm for K-medoids clustering. Expert. Syst. Appl. 2009, 36, 3336–3341. [Google Scholar] [CrossRef]
Rasmussen, C. The infinite Gaussian mixture model. Adv. Neural Inf. Process. Syst. 1999, 12, 554–560. [Google Scholar]
Zhang, T.; Ramakrishnan, R.; Livny, M. BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Rec. 1996, 25, 103–114. [Google Scholar] [CrossRef]
Arthur, D.; Vassilvitskii, S. k-Means++: The Advantages of Careful Seeding; Technical Report; Stanford University: Palo Alto, CA, USA, 2006. [Google Scholar]
Elkan, C. Using the triangle inequality to accelerate k-means. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), Washington, DC, USA, 21–24 August 2003; pp. 147–153. [Google Scholar]
Moon, T.K. The expectation-maximization algorithm. IEEE Signal Process Mag. 1996, 13, 47–60. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Gordon, C.S.; Waller, J.W.; Cook, R.M.; Cavalera, S.L.; Lim, W.T.; Osadnik, C.R. Effect of pulmonary rehabilitation on symptoms of anxiety and depression in COPD: A systematic review and meta-analysis. Chest 2019, 156, 80–91. [Google Scholar] [CrossRef] [PubMed]
Tunsupon, P.; Lal, A.; Abo Khamis, M.; Mador, M.J. Comorbidities in patients with chronic obstructive pulmonary disease and pulmonary rehabilitation outcomes. J. Cardiopulm. Rehabil. Prev. 2017, 37, 283–289. [Google Scholar] [CrossRef] [PubMed]
Jaitovich, A.; Barreiro, E. Skeletal muscle dysfunction in chronic obstructive pulmonary disease: What we know and can do for our patients. Am. J. Respir. Crit. Care Med. 2018, 198, 175–186. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Data collection (A) and clustering analysis (B) pipeline.

Figure 2. Silhouette score values as the number of clusters varies between 2 and 15.

Figure 3. Boxplot representation of Delta meters distribution between the two clusters for each algorithm classification.

Figure 4. Mean and standard deviation of k-means cluster centroids. Dashed blue and solid red represent R2I equal to 0 and 1, respectively.

Table 1. Descriptive statistics of overall analysis samples: 6MWT clinically significant improvement patient sample, and no 6MWT clinically significant improvement patient sample. Comparative analysis between groups was conducted.

Variables	Total (N = 126)	6MWT_CSI=0 (N = 63)	6MWT_CSI=1 (N = 63)	p-Value
Variables	Median [IQR] or N (%)	Median [IQR] or N (%)	Median [IQR] or N (%)	p-Value
Sex, male	1: 56 (44.4%)	1: 32 (50.7%)	1: 24 (38%)	0.151
Age, yr	77.00 [10.00]	75.00 [11.00]	78.00 [11.00]	0.373
R_RS	4.50 [1.62]	4.37 [1.97]	4.56 [1.56]	0.238
X_RS	−1.83 [1.33]	−1.66 [1.15]	−2.02 [1.58]	0.230
ΔX_RS	2.13 [3.45]	1.99 [3.87]	2.51 [2.74]	0.281
T_I	1.27 [0.53]	1.31 [0.48]	1.24 [0.55]	0.221
T_E	2.20 [0.85]	2.31 [1.01]	2.07 [0.68]	0.135
T_I/T_TOT	0.37 [0.06]	0.36 [0.07]	0.38 [0.05]	0.756
V_T	0.66 [0.33]	0.73 [0.37]	0.62 [0.30]	0.275
V_E	11.37 [4.16]	11.27 [3.74]	11.37 [4.67]	0.560
FEV/SVC	39.50 [24.25]	40.00 [26.00]	39.00 [24.00]	0.687
FEV₁	0.91 [0.52]	0.91 [0.52]	0.92 [0.53]	0.811
TLC	6.10 [2.68]	6.08 [3.00]	6.17 [2.54]	0.603
IC	1.66 [0.92]	1.66 [0.85]	1.63 [0.97]	0.709
O₂ saturation ⁱ	95.00 [4.00]	96.00 [3.00]	95 [3.00]	0.007
Borg Scale ⁱ	0.50 [2.00]	0.00 [1.00]	0.50 [2.00]	0.036
Borg Scale limbs ⁱ	0.00 [1.30]	0.00 [1.00]	0.00 [2.00]	0.491
O₂ ⁱ	0.00 [3.00]	0.00 [3.00]	0.00 [5.00]	0.709
O₂ saturation ^f	92.00 [7.00]	92.00 [7.00]	92.00 [7.00]	0.805
Borg Scale ^f	5.00 [2.00]	5.00 [2.00]	5.00 [2.00]	0.877
Borg Scale limbs ^f	3.00 [3.00]	3.00 [3.00]	3.00 [3.00]	0.877
Total meters	287.50 [148.00]	345.00 [120.00]	240.00 [112.00]	<0.001

Abbreviations: 6MWT, six-minute walk test; Borg Scale ^f, Borg Dyspnoea Scale at the end of 6MWT; Borg Scale ⁱ, Borg Dyspnoea Scale at the beginning of 6MWT; Borg Scale limbs ^f, Borg Scale for limb fatigue at the end of 6MWT; Borg Scale limbs ⁱ, Borg Scale for limb fatigue at the beginning of 6MWT; CSI, clinically significant improvement; FEV₁, forced expiratory volume in the first second; FEV/SVC, forced expiratory volume divided by slow vital capacity; IC, inspiratory capacity; IQR, interquartile range; O₂ ⁱ, oxygen level at the beginning of 6MWT; O₂ saturation ^f, oxygen saturation at the end of 6MWT; O₂ saturation ⁱ, oxygen saturation at the beginning of 6MWT; R_RS, respiratory system resistance; T_I, inspiratory time; T_I/T_TOT, ratio of T_I to total time; TLC, total lung capacity; V_E, mechanical ventilation; V_T, tidal volume; X_RS, respiratory system reactance; ΔX_RS, change in respiratory system reactance. Statistically significant p-values are in bold.

Table 2. Comparison of median and IQR of all the variables at admission between R2I clusters. p-values associated with Mann–Whitney test were reported. Statistially significant values are in bold.

Variables	R2I = 0		R2I = 1		p-Value
Variables	Median	IQR	Median	IQR	p-Value
R_RS	3.74	1.44	5.06	1.61	<0.001
X_RS	−1.46	0.93	−2.37	1.18	<0.001
ΔX_RS	0.88	1.80	3.97	4.51	<0.001
T_I	1.43	0.45	1.12	0.46	<0.001
T_E	2.11	1.00	2.17	0.80	0.841
T_I/T_TOT	0.40	0.07	0.36	0.07	<0.001
V_T	0.74	0.31	0.62	0.30	<0.05
V_E	11.57	4.40	11.30	4.14	0.545
FEV/SVC	54.00	22.50	36.00	11.50	<0.001
FEV₁	1.23	0.55	0.72	0.31	<0.001
TLC	5.84	3.30	6.18	2.10	0.725
IC	2.00	0.85	1.44	0.55	<0.001
O₂ saturation ⁱ	95.00	4.00	95.00	3.00	0.349
Borg Scale ⁱ	0.00	0.50	1.00	2.00	0.001
Borg Scale limbs ⁱ	0.00	0.00	0.00	2.00	0.004
O₂ ⁱ	0.00	0.50	0.00	6.00	0.039
O₂ saturation ^f	93.00	6.00	90.00	7.00	0.003
Borg Scale ^f	3.00	3.00	5.00	3.00	<0.001
Borg Scale limbs ^f	2.50	2.80	4.00	2.00	0.001
Total meters	350.00	128.00	230.00	123.00	<0.001

Abbreviations: 6MWT, six-minute walk test; Borg Scale ^f, Borg Dyspnoea Scale at the end of 6MWT; Borg Scale ⁱ, Borg Dyspnoea Scale at the beginning of 6MWT; Borg Scale limbs ^f, Borg Scale for limb fatigue at the end of 6MWT; Borg Scale limbs ⁱ, Borg Scale for limb fatigue at the beginning of 6MWT; CSI, clinically significant improvement; FEV₁, forced expiratory volume in the first second; FEV/SVC, forced expiratory volume divided by slow vital capacity; IC, inspiratory capacity; IQR, interquartile range; O₂ ⁱ, oxygen level at the beginning of 6MWT; O₂ saturation ^f, oxygen saturation at the end of 6MWT; O₂ saturation ⁱ, oxygen saturation at the beginning of 6MWT; RRS, respiratory system resistance; TI, inspiratory time; TI/TTOT, ratio of TI to total time; TLC, total lung capacity; VE, mechanical ventilation; VT, tidal volume; XRS, respiratory system reactance; ΔXRS, change in respiratory system reactance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Respiratory Rehabilitation Index (R2I): Unsupervised Clustering Approach to Identify COPD Subgroups Associated with Rehabilitation Outcomes

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Design and Collection

2.2. Data Preparation

2.3. Clustering Methods

2.4. Statistical Analysis

3. Results

3.1. Descriptive and Univariate Results

3.2. Cluster Analysis

4. Discussion

5. Limitations

6. Future Directions

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics