Comparison of Predictive Properties between Tools of Patient-Reported Outcomes: Risk Prediction for Three Future Events in Subjects with COPD

Background: Patient-reported outcome (PRO) measures must be evaluated for their discriminatory, evaluative, and predictive properties. However, the predictive capability remains unclear. We aimed to examine the predictive properties of several PRO measures of all-cause mortality, acute exacerbation of chronic obstructive pulmonary disease (COPD), and associated hospitalization. Methods: A total of 122 outpatients with stable COPD were prospectively recruited and completed six self-administered paper questionnaires: the COPD Assessment Test (CAT), St. George’s Respiratory Questionnaire (SGRQ), Baseline Dyspnea Index (BDI), Dyspnoea-12, Evaluating Respiratory Symptoms in COPD and Hyland Scale at baseline. Cox proportional hazards analyses were conducted to examine the relationships with future outcomes. Results: A total of 66 patients experienced exacerbation, 41 were hospitalized, and 18 died. BDI, SGRQ Total and Activity, and CAT and Hyland Scale scores were significantly related to mortality (hazard ratio = 0.777, 1.027, 1.027, 1.077, and 0.951, respectively). The Hyland Scale score had the best predictive ability for PRO measures, but the C index did not reach the level of the most commonly used FEV1. Almost all clinical, physiological, and PRO measurements obtained at baseline were significant predictors of the first exacerbation and the first hospitalization due to it, with a few exceptions. Conclusions: Measurement of health status and the global scale of quality of life as well as some tools to assess breathlessness, were significant predictors of all-cause mortality, but their predictive capacity did not reach that of FEV1. In contrast, almost all baseline measurements were unexpectedly related to exacerbation and associated hospitalization.


Introduction
The importance of patient-reported outcome (PRO) measures when evaluating healthcare delivery and conducting scientific investigations has grown substantially [1][2][3][4]. Chronic obstructive pulmonary disease (COPD) is considered a model for the evaluation of PRO measures [5,6]. Guyatt and colleagues first developed the Chronic Respiratory Disease Questionnaire (CRQ) in 1987 to measure the disease-specific quality of life of individuals with COPD [7]. The updated Global Initiative for Chronic Obstructive Lung Disease (GOLD, 2011) proposed that symptoms be evaluated as health status measures using the COPD Assessment Test (CAT) [8][9][10], one of the PRO measuring tools recommended in clinical practice according to the international document [11]. PRO measures are thus considered essential when assessing a patient's COPD.
Health indicators, including PROs, can be discussed from three perspectives. First, they can differentiate between lesser and more severely ill patients (discriminatory quality). Second, they can measure the amount of change (an evaluative feature). Third, they can forecast future outcomes (predictive property). In the twentieth century, forced expiratory volume in one second (FEV 1 ) and age were believed to be the strongest mortality predictors in subjects with COPD [12]. Subsequently, several predictors of mortality have emerged in the literature, including dyspnea, health status, exercise capacity, and physical activity [13][14][15][16]. Oga and colleagues reported that the St. George's Respiratory Questionnaire (SGRQ) Total score was able to predict mortality for a period of up to five years, but CRQ scores were not associated with seven-year mortality [15,17]. As COPD is a progressive disease, it is important to consider future outcome predictors such as FEV 1 and PRO measures when assessing disease severity, as these can be used to predict mortality.
We hypothesized that individual PRO measures had been independently based on specific conceptual frameworks and are not interchangeable, although several PRO tools aimed at subjects with COPD have been reported in the literature [4,18]. For example, the SGRQ and CAT were developed to measure COPD-specific health status [8][9][10]19], Evaluating Respiratory Symptoms in COPD (E-RS) focuses on respiratory symptoms [20,21], and Dyspnoea-12 (D-12) is targeted at breathlessness [22][23][24][25]. They have generally been administered to subjects with COPD individually or in combination when multifaceted analysis and evaluation of outcome markers is required. It remains unclear whether the currently used PRO measures can predict future outcomes for patients with COPD and whether or not the predictive properties are different from the perspective of individual conceptual frameworks. The purpose of this study was to examine the predictive properties of several different PRO measures in subjects with COPD and to investigate which of these measures best predicts mortality, acute exacerbation of COPD (AECOPD), and hospitalization due to AECOPD.

Participants
A total of 122 stable COPD patients were recruited from our outpatient clinic at the Department of Respiratory Medicine of the National Center for Geriatrics and Gerontology between April 2013 and April 2019 and followed up to December 2019 for a maximum of six and a half years. The criteria for inclusion were over 50 years old, former or current smokers with a cumulative smoking history of more than 10 pack-years, and chronic fixed airflow limitation (described elsewhere as a part of the hospital-based cohort study) [26]. The exclusion criteria included an exacerbation of COPD in the preceding three months.

Measurements
All eligible subjects completed lung function tests and PRO measurements at baseline on the same day. Participants were instructed to arrive at the study site at least 12 h after stopping bronchodilator use. During the visit, they were monitored by a physician while they inhaled long-acting bronchodilators and, more than 60 min later, underwent spirometry with the CHESTAC-8800 spirometer (Chest, Tokyo, Japan). The test was performed in the sitting position, and the highest values of the three measurements were analyzed. Residual volume was calculated using the closed-circuit helium method, and diffusing capacity for carbon monoxide (DL CO ) was determined using the single-breath technique [27].
The survival of all registered subjects was assessed until mid-December 2019. For those who did not attend an outpatient clinic, telephone or postal contacts with families or primary health practitioners were used to obtain information on mortality. Those whom we could not reach were regarded as having withdrawn. The period from entry to the last participation or event was recorded for analysis. AECOPD was defined as a worsening of respiratory symptoms requiring treatment with systemic corticosteroids or antibiotics, or both [28].

Patient-Reported Measurements
Disease-specific health status was assessed with the CAT and SGRQ [8][9][10]19]. The CAT scores range from 0 to 40, with a score of zero indicating no impairment [8][9][10]. The SGRQ consists of 50 items, divided into three components: Symptoms, Activity and Impact, and the Total score is calculated [19]. Higher scores on the SGRQ indicate a more severely impaired health status. To assess the severity of breathlessness, we used the Baseline Dyspnea Index (BDI) and the Dyspnoea-12 (D-12) [22][23][24]29]. The BDI is composed of three categories: functional impairment, magnitude of task and magnitude of effort, rated by five grades from 0 (severe) to 4 (not impaired) for each [29]. The D-12 consists of 12 elements (7 physical and 5 emotional), and the D-12 Total scores range from 0 to 36, with higher scores denoting more severe dyspnea [22][23][24]. The total score of the E-RS indicates the severity of respiratory symptoms in general, where scores range from 0 to 40, with higher scores indicating more severe symptoms [20,21]. Three subscales are used to assess breathlessness (RS-Breathlessness), cough and sputum (RS-Cough and Sputum), and chest-related symptoms (RS-Chest Symptoms) [20,21]. Although E-RS is intended to be administered using accredited electronic devices, none were available in the Japanese version. Global health was also assessed using the Hyland Scale with scores ranging from 0 to 100, where 0 = 'might as well be dead' and 100 = 'perfect quality of life' [30]. All the validated Japanese versions were self-administered using a paper-based questionnaire under site supervision in the aforementioned order (in booklet form).

Statistical Methods
All results are expressed as mean ± standard deviation (SD). A p-value of less than 0.05 was considered statistically significant. Differences between groups were determined by the Steel-Dwass and Kruskal-Wallis tests. Univariate Cox proportional hazards analyses were performed to investigate the relationships between measurements at baseline and subsequent events. Results of regression analyses are presented in terms of hazard ratio (HR) with the corresponding 95% confidence intervals (CI). They were first calculated by actual measured value and further analyzed in a standardized format using a score to show HR for changes per SD. The C-index of an event prediction model is the property of correctly discriminating between event and non-event-onset individuals and is often used when comparing different measures, e.g., when comparing different models. The closer the value of the C index is to 1.0, the better the risk prediction.

Subject Characteristics and Scores Obtained at Baseline
During the study period, 122 consecutive patients (113 men) with mild to very severe COPD, with a wide range of FEV 1 values, were investigated. The mean age and FEV 1 were 74.5 ± 6.4 years and 1.72 ± 0.54 L (68.8 ± 20.3% pred), respectively. Ninety-four were former smokers, and 28 were current smokers. Patient characteristics and the results of the pulmonary function tests at baseline are shown in Table 1. According to the classification of GOLD airflow limitation [11], 41 subjects (33.6%) were included in GOLD 1 (defined as FEV 1 ≥ 80% predicted), 60 (49.2%) in GOLD 2 (50% ≤ FEV 1 < 80% predicted), 14 (11.5%) in GOLD 3 (30% ≤ FEV 1 < 50% predicted) and 7 (5.7%) in GOLD 4 (FEV 1 < 30% predicted) ( Table 2). The elderly population was more strongly represented than expected, and there were only a small number of patients with severe or very severe COPD. Table 2 shows the distribution of the PRO scores obtained at baseline. Almost all scores were shifted toward the milder end of each scale. The best possible score ("floor effect") was observed except on the SGRQ Total Score and the Hyland Scale, although the best possible score is the ceiling for the BDI. Most of the scores obtained from the PRO measuring tools deteriorated due to the severity of airflow limitation, with the exception of the D-12 Affective Score and E-RS Cough and Sputum. 1 n = 121, 2 n = 120, 3 one patient receiving oxygen. Table 2. Score distribution of questionnaires and comparison of scores obtained from patientreported outcome measurements between GOLD 1, 2 and 3 + 4 groups classified by the severity of airflow limitation.  GOLD 2 vs. GOLD 3 + 4 (Steel-Dwass test), § § : p < 0.001, § : p < 0.01 GOLD 1 vs. GOLD 3 + 4 (Steel-Dwass test), ¶ ¶ : p < 0.001, ¶ : p < 0.05 Kruskal-Wallis test for three groups yields significant differences (p < 0.001) except for D-12 Affective Score and E-RS Cough and Sputum. 1 n = 119, 2 n = 120, 3 n = 57, 4 n = 58, 5 n = 59.

Episodes Identified during Follow-Up Periods
Of the 122 enrolled patients, 18 (14.8%) were confirmed to have died during the followup period. The observed period for mortality was 43.0 ± 44.5 months, with a median of 21.6 months and a range of 4 to 74 months (1324.2 ± 1377.0 days with a range of 138 to 2281 days). An episode of exacerbation was identified in 66 of 117 available subjects (56.4%). The mean duration from entry to last attendance or the first episode of exacerbation was 21.5 ± 16.0 months, ranging from 0 to 74 months (669.3 ± 497.0 days with a range of 7 to 2273 days). Forty-one out of 119 available subjects (34.5%) were hospitalized for exacerbation at least once during the observation period with a mean of 30.5 ± 25.0 months ranging from 1 to 74 months (941.0 ± 761.0 days with a range of 54 to 2273 days). Table 3 shows the results of the univariate Cox proportional hazards model in analyzing the association of major clinical measures and scores obtained from PRO measures with mortality. Crude Cox regression analysis of the raw predictors revealed that HR was statistically significant for age, some of the physiological measures, including FVC, FEV 1 , FEV 1 /FVC, and DLco, and scores from some PRO measures such as the BDI, SGRQ Total and Activity, CAT, and Hyland Scale. This demonstrates that these PRO measures are all significant mortality predictors in stable COPD. In other words, health status, global quality of life scale, and some measurements of dyspnea are related to mortality. It is advisable to strive for standardization of Cox regression analysis using actual measurements to compare variables with different units rather than crude Cox regression analysis of the raw predictors. HRs per SD are shown in Table 3, using the z-score transformation as a general standardization method (Table 3). On the other hand, C-index is often preferred to compare different event prediction models. Among the significant mortality predictors, the C-index for FEV 1 was the highest at 0.733.

Predictive Properties of Mortality
The comparison of the HR of mortality associated with significant PRO scores and established predictors of mortality, such as age and FEV 1 , is shown in Figure 1. How much risk increases for a 1 SD increase (+) or decrease (−) is indicated in the order of HR per SD change, that is, in the order of the largest change in mortality risk. This can be called a standardized illustration of the magnitude of the effect on mortality. It is also known that the results of multivariate analysis based on the Cox proportional hazards model are unstable when there are fewer than 20 events, and since there were 18 deaths in the present study, fewer than 20, multivariate analysis was not performed. It is also known that the results of multivariate analysis based on the Cox proportional hazards model are unstable when there are fewer than 20 events, and since there were 18 deaths in the present study, fewer than 20, multivariate analysis was not performed.

Predictive Properties of AECOPD
HRs for much of the clinical information and physiological measures were statistically significant, revealing that the older the patient and the poorer their physiological measures, the greater the risk (Table 4). PaO 2 and BMI, however, were not associated with a greater risk. The majority of PRO tool scores were significantly associated with the first exacerbation, apart from the D-12 Affective score and RS-Cough and Sputum, which did not demonstrate a statistically significant predictive relationship. The highest C-index was 0.754 for FEV 1 . As previously described, Cox regression of standardized predictors with exacerbation is shown in Table 4, and the comparison of the HR of exacerbation associated with significant scores of PROs, age and FEV 1 is illustrated in Figure 2.

Predictive Properties of the First Hospitalization Due to Acute Exacerbation
Statistically significant HRs were observed for all measures, apart from BMI and the D-12 Affective score, in relation to the first hospitalization caused by acute exacerbation. Table 5 and Figure 3 show the results of the univariate analysis based on the Cox proportional hazards model for the data obtained at baseline and the time to hospitalization for the first AECOPD. Almost all clinical, physiological, and PRO measurements obtained at baseline except for BMI and the D-12 Affective score were significant predictors of first hospitalization for exacerbation. tion, apart from the D-12 Affective score and RS-Cough and Sputum, which d demonstrate a statistically significant predictive relationship. The highest C-ind 0.754 for FEV1. As previously described, Cox regression of standardized predicto exacerbation is shown in Table 4, and the comparison of the HR of exacerbation ass with significant scores of PROs, age and FEV1 is illustrated in Figure 2.

Predictive Properties of the First Hospitalization Due to Acute Exacerbation
Statistically significant HRs were observed for all measures, apart from BMI a D-12 Affective score, in relation to the first hospitalization caused by acute exacer Table 5 and Figure 3 show the results of the univariate analysis based on the Cox p tional hazards model for the data obtained at baseline and the time to hospitaliza the first AECOPD. Almost all clinical, physiological, and PRO measurements obta baseline except for BMI and the D-12 Affective score were significant predictors hospitalization for exacerbation.  1 n = 121, 2 n = 120, 3 one patient receiving oxygen.

Figure 3.
Comparison of the HR (hazard ratio) of the first hospitalization due to acute exacerbation of COPD associated with significant scores from patient-reported outcome tools and established predictors of the first hospitalization for exacerbation such as age as well as FEV1.

Discussion
The purpose of the current study was to determine whether PRO measures have risk-predictive ability. Six different PRO measures were examined for 14 scores, including subscales. For mortality, the SGRQ Total and CAT scores assess health status; the BDI score assesses dyspnea; the SGRQ Activity score assesses activity, one of the three components of health status; and the Hyland Scale score, a global score that is considered a very comprehensive assessment of health-related quality of life, were statistically significant predictors. Contrasting these five scores that were concluded to be significantly associated with mortality with the other nine scores for which no significant association could be found, it is necessary to consider the conceptual framework within which each of the scales was designed. The SGRQ Total, CAT scores, and the Hyland Scale score are considered to be a comprehensive overview of both health-related quality of life and health status. It is hypothesized that the importance of these prognostic factors is derived from the fact that these scores encapsulate essential information in a condensed form.
The BDI score, a measure of dyspnea, was a significant predictor of mortality in the current study. It has been suggested that the SGRQ Activity score is analogous to the activity of daily life and can be used to evaluate dyspnea [31]. Reports studying COPD-specific health status or health-related quality of life components have indicated that dyspnea is involved in 30-40% of scores [32]. Given the three positive scores, which are considered comprehensive representations of health status or health-related quality of life, this may reflect the assumption that dyspnea is a significant prognostic factor.
In contrast, the D-12 Total score and two of its subscales, which assess dyspnea, as well as the RS-Breathlessness score, a subscale of E-RS, were not shown to be significant predictors. Although it is not easy to measure the perception of breathlessness due to its sensory quality and affective components, it has been theorized that the D-12 attempts to scale breathlessness based on descriptions and to be a precise characterization of the sensory and affective dimensions of dyspnea. It has been reported that the BDI score was strongly significantly correlated with mortality, whereas the peak Borg score at the end of progressive cycle ergometry was not [33]. Thus, studies differ in their findings as to whether dyspnea is a statistically significant prognostic factor. When comparing patient-reported outcome tools, we must avoid simple summaries because the results depend not only on the underlying conceptual constructs but also on the measuring properties. In the present study, D-12 showed a highly skewed distribution of scores, which may have led to negative results. Nevertheless, the disparity in the forecasting ability of mortality between tools could have been the result of measuring properties.
The highest C-index for mortality predictors was FEV 1 , suggesting it is a better predictor of mortality than any of the PRO measures studied. Several factors have been reported to be better predictors of mortality than FEV 1 . In 2011, Waschki and colleagues discovered that physical activity is a better predictor of mortality than FEV 1 and is the best predictor of all-cause mortality [16]. The C-index of FEV 1 was 0.75 in their study, which is comparable to our own finding of 0.733. Furthermore, they reported that the C-index for both the SGRQ Total and Activity scores was 0.64 and 0.67, respectively, which corresponds to the findings of the present investigation and demonstrates their importance as predictors of mortality.
The PRO tool with the best predictive ability for mortality considered in this analysis was the Hyland Scale score. The C-index of the FEV 1 and Hyland Scale scores showed that the former was higher than the latter. In some of the uncorrected measured HRs, Hyland Scale scores may have appeared to be more strongly associated with mortality, especially when examining p values. In other words, the prognostic ability of the two is considered to be very similar. In the previous literature, subtle differences in prognostic significance have been reported depending on the population and statistical methodology [16], and the present analysis does not necessarily place a lower value on the predictive value of PRO measures compared to FEV 1 . Therefore, it is not easy to discuss the relative merits of different indices in terms of risk-predictive ability. As some of the relevant indicators change, the corresponding assumptions about how much the prognosis will change must be compared, and it should always be debated what analysis and what assumptions are best, sometimes giving the impression that convenient methods of analysis are chosen according to the preferences of researchers. Several PRO measures were examined for their relationship to mortality, with positive scores on comprehensive measures of health status or health-related quality of life and scores related to dyspnea, while other PRO measures were negative. This negates the short-sighted notion that every PRO measure is predictive, and the constructs of each of the PRO tools will have to be fully considered.
This study also sought to determine the markers of clinically significant exacerbation and hospitalization [34][35][36]. We found that most of the clinical, physiological, and PRO measurements taken at the start were significant predictors of the first exacerbation and the first hospitalization caused by it, apart from BMI and the D-12 Affective score. PaO 2 and RS-Cough and Sputum were significant hospitalization predictors but not exacerbation predictors. Some reports have indicated an association between specific indicators and the emergence of exacerbation, but, to our knowledge, there have been only a few studies to compare indicators at baseline as an exacerbation predictor. The abundance of risk predictors in a real-world clinical setting may be one of the important results of the present study. In other words, for COPD patients with lower performance on these measures, caution should be exercised as exacerbations are more likely to develop. FEV 1 was found to have the highest C-index in both analyses for predicting exacerbation and hospitalization. This analysis of various PRO measures concluded that the most commonly used FEV 1 is superior for predicting the risk of AECOPD. The CAT is the PRO measure that is most frequently analyzed as a potential predictor of exacerbation. It has been reported to be capable of predicting exacerbation and hospitalization [37][38][39][40]. However, since the present study demonstrated that many PRO measurement tools could also predict exacerbation, care should be taken not to overemphasize the benefits of the CAT.
Some limitations of the present study should be mentioned. First, this single-center study was limited by the number of patients with COPD admitted to the study site. Although this is a potential weak point, it contains all patients with stable COPD seen in this hospital during the study period. It is possible, given the small sample size, that there was insufficient power to evaluate any association. Furthermore, because our study included predominantly men, generalizations of these results to women with COPD may be unwarranted. Since the numbers of women with COPD were, in fact, quite low in Japan, the study reflected the reality of clinical COPD in our population. Lastly, although the Hyland Scale score, that is, one of the global quality of life scales, topped the list of risk predictive ability for PRO tools, to the best of our knowledge, there have been no previous reports on the clinical importance of the global quality of life scale for COPD. Because it is so simple, we have used it routinely in our laboratory for many years. Its role in the medical care of patients with COPD is an important topic for further study.

Conclusions
For mortality, the SGRQ Total and CAT scores assess health status; the BDI score assesses dyspnea; the SGRQ Activity score assesses activity, one of the three components of health status; and the Hyland Scale score, a global score that is considered a very comprehensive assessment of the quality of life related to health, were statistically significant predictors. The Hyland Scale score had the highest risk predictive ability, but the C index did not reach the level of the most commonly used FEV 1 . Almost all clinical, physiological, and PRO measurements obtained at baseline were significant predictors of the first exacerbation and the first hospitalization for exacerbation, with a few exceptions. The results may depend not only on the underlying conceptual constructs but also on the measuring properties.  Informed Consent Statement: Written informed consent was obtained from all participants.

Data Availability Statement:
Anonymized participant data will be available upon reasonable request to the corresponding author.