Rescuing Suboptimal Patient-Reported Outcome Instrument Data in Clinical Trials: A New Strategy

Background: Psychometric instruments such as the Repeated Battery for the Assessment of Neuropsychological Status (RBANS) are commonly used under conditions for which they were not developed or validated. They may then generate troublesome data that could conceal potential findings. Methods: Based on a previously published refinement of the RBANS, we reanalyzed the data on 303 patients from two National Institutes of Health (NIH) trails in Parkinson’s disease and contrasted the results using the original versus refined scores. Results: Findings from the original RBANS scores were inconsistent; however, use of the refined scores produced potential findings that were in agreement with independent reports. Conclusion: This study demonstrates that, for negative trials using instrument scores as primary outcomes, it is possible to rescue potential findings. The key to this new strategy is to validate and refine the instrument for the specific disease and conditions under study and then to reanalyze the data. This study offers a demonstration of this new strategy for general approaches.


Introduction
Since few health-related psychometric instruments have been "professionally developed" [1], it is common to see suboptimal instrument data in clinical trials. An example was the use of the Repeatable Battery for the Assessment of Neuropsychological Status (RBANS) [2] in two National Institutes of Health (NIH) Exploratory Trials in Parkinson's disease (NET-PD) [3,4]. The RBANS has been popular since its initial publication. According to Web of Science (Thomson Reuters; accessed 12 December 2017), the initial description of RBANS has been cited 472 times. Moreover, it has been translated and used in many other countries such as China [5], Japan [6], and Italy [7]. Its popularity may relate to its brevity. However, the original factor structure of RBANS was theory-driven [2], while multiple subsequent empirical studies have identified optimal factor structures that differ from the original (e.g., [8][9][10][11][12][13][14]). This has engendered significant concerns about the validity of the universal use of RBANS to assess cognitive function.
Two NET-PD trials tested four drugs for the treatment of movement impairment in PD: Creatine and Minocycline in FS1, and CoQ10 and GPI1485 in FS-TOO, indicating that Creatine and Minocycline might be promising [3,4]. The RBANS was used to assess cognition as a secondary outcome. Because we previously demonstrated that the original RBANS had not been validated for PD, it is not surprising that the RBANS assessments produced equivocal results [13]. Yet, we believe that better use of these problematic, but expensive, data is of critical and practical importance. We therefore set out to out to reanalyze the RBANS data from the two NET-PD trials based on the refined factor structure from our previous study [13], and then to contrast the results with those based on the original factor structure, in hope of rescuing potential findings.

Patients
The two NET-PD trials recruited 858 early untreated PD patients from 42 sites in North America, and randomized 413 participants into six arms. Details can be found in earlier publications [3,4]. In total, 339 finished the 12-month follow-up visit. RBANS data were collected at the baseline and 12month follow-up visit. After the deletion of patients with missing values for the RBANS data and outliers as detected by the Malhanobis distance, 383 and 315 patients remained at baseline and followup. Since the change from baseline to follow-up in RBANS scores will be used as the primary outcome, only patients with complete RBANS data at both baseline and follow-up were included in the final analysis of this study. Therefore, 303 patients with complete RBANS data at both baseline and follow-up were analyzed.

The RBANS
The original RBANS has 12 items and offers scores on five cognitive domains [2]. Each of the first four domains is measured by two items, while the last domain is measured by four. In addition, the RBANS offers a total score based on the five domain scores ( Figure 1A). However, these six original RBANS scores are neither valid nor reliable in the two NET-PD trials [13]. Following psychometric analysis, we created a new instrument that retained half of the assessment items from the original RBANS and reorganized them into two domains [13]. The refined RBANS [13,15] for these trials has six items and offers scores on two new cognitive domains: word list memory (WLM) and story and figure memory (SFM), measured by three items each ( Figure 1B). Specifically, six of the original RBANS items were excluded because they had low correlation with each other and the remaining six items [13]. The refined RBANS is valid and reliable for these specific patients, and makes clinical sense [13]. However, a unified total RBANS score based on these two domains was not supported by a two-level factor structure [13].

Statistical Analysis
Data from the two trials were analyzed separately and in parallel. The normality of each variable was checked in order to choose appropriate analysis approaches. Chi-square tests were applied to categorical data, and analysis of variance (ANOVA) or Kruskal-Wallis tests were applied to continuous data. Demographic and disease characteristics of the participants at baseline were Following psychometric analysis, we created a new instrument that retained half of the assessment items from the original RBANS and reorganized them into two domains [13]. The refined RBANS [13,15] for these trials has six items and offers scores on two new cognitive domains: word list memory (WLM) and story and figure memory (SFM), measured by three items each ( Figure 1B). Specifically, six of the original RBANS items were excluded because they had low correlation with each other and the remaining six items [13]. The refined RBANS is valid and reliable for these specific patients, and makes clinical sense [13]. However, a unified total RBANS score based on these two domains was not supported by a two-level factor structure [13].

Statistical Analysis
Data from the two trials were analyzed separately and in parallel. The normality of each variable was checked in order to choose appropriate analysis approaches. Chi-square tests were applied to categorical data, and analysis of variance (ANOVA) or Kruskal-Wallis tests were applied to continuous data. Demographic and disease characteristics of the participants at baseline were summarized and compared across the three arms within each of the two trials. Preliminary comparisons of change from baseline on cognitive abilities as measured by the original and refined RBANS, within each of the two trials, were implemented. In comparing the original RBANS with the refined RBANS, we focused on providing consistent changes in cognitive scores. Comparing p-values would be flawed given the lack of power in these historical data (inadequate sample size) for assessing these secondary outcome measures [3,4]. Post hoc power analysis and sample size estimation were implemented to address the lack of power and small sample size issues. SPSS (Version 23, IBM Corp., Armonk, NY, USA) and SAS (Version 9.4, SAS Institute Inc., Cary, NC, USA) were used for data preparation and analysis. G*Power (Version 3.1.9.2, University of Düsseldorf, Düsseldorf, Germany) [16,17]) was used for post hoc power analysis and sample size estimation.

Results
Demographic and disease characteristics at baseline are summarized in Table 1. No statistically significant difference was detected at any of these characteristics among the two treatment groups and one placebo group within each of the trials, and characteristics are very similar within each of the two trials.
Preliminary comparisons of change from baseline on cognitive abilities are summarized in Table 2.  In contrast, when cognition is measured by the two refined RBANS scores, the trend over time is much more consistent. Out of the 12 changes on refined RBANS scores, 10 (83.3%) show increases, and only two show a decrease, but both within one unit (−0.12, −0.83). None of the differences among groups, in change from baseline on either of the RBANS scores, is statistically significant (p > 0.05), because the study was underpowered. Clearly, however, the refined RBANS provides much more consistent data trends, and will therefore give greater insight into treatment outcomes.
Post hoc power analysis indicated that, in order to have 80% power to detect the difference shown on WLM between Creatine and Placebo (3.00 vs. 1.25, Cohen's d = 0.30 [19]), 178 patients per group would be needed. Given the current sample size of 50 per group, the statistical power to detect this difference was 31%. Since the effect size of Creatine on WLM was the largest, post hoc analysis on others would result in a larger required sample size or would show lower power at the current sample size.

Discussion
Given that suboptimal instrument data are commonly utilized in clinical trials, it is of great practical importance to better utilize these troublesome and expensive data in hope of rescuing important potential findings. This study offers a template for rescuing efforts through reanalyzing existing suboptimal instrument data. Findings from the two NET-PD trials that employed the unified Parkinson's disease rating scale (UPDRS) scores as the primary outcome indicated that two (Creatine, Minocycline) of the four tested drugs may be beneficial for PD patients [3], while the other two (CoQ10, GPI1485) may not [4]. Results from the present study indicate that, while using the original six RBANS scores showed no benefit of the treatments on cognitive ability as a secondary outcome, the use of the two refined RBANS scores may have produced a positive outcome had the sample sizes been larger ( Table 2).
The inconsistencies in the trend among the original six RBANS scores are very troublesome. It is not easy to explain why a drug can help improve some cognitive abilities while impairing others ( Table 2). This observation strengthens the conclusion that the RBANS was neither valid nor reliable in the two NET-PD trials studied here [13]. In contrast, the refined RBANS scores offer much more consistency in the trend, albeit in the absence of statistical significance. Most of the changes from baseline are increasing, indicating that the treatment is increasing each of the cognitive abilities. The increased outcomes in cognitive assessments offer more practical support for the validity and reliability of the refined RBANS in both trials [13].
Another advantage of the refined versus the original RBANS is indicated by the big differences in the standard deviations (SD) of the changes from baseline ( Table 2). When using the original RBANS, the SDs are huge (e.g., for Creatine, Att has a mean of 0.65 and an SD of 13.17). However, after refinement, the SDs dropped substantially (e.g., for Creatine, WLM has a mean of 3.00 and an SD of 5.52.
Placebo effect [20] on the two refined RBANS scores is evidenced by the three increases in the two placebo groups. Participating in a clinical trial and receiving some kind of treatment may help patients feel better and can improve their cognition. However, these placebo effects are smaller than potential true treatment effects.
Lack of power due to the small sample size is the primary reason for potentially physiologically important, but statistically negative results. Had the sample size been large enough, these results should also be statistically significant. Future studies with appropriate sample size are therefore warranted.
Clearly, there are important limitations in this analysis related to statistical power. Take, for example, the observed differences on WLM between Creatine and Placebo in FS1 (details in Table 2). The observed difference was 3.00 for Creatine and 1.25 for Placebo, with Cohen's d as 0.30, which was between "small" and "medium" [19]. That is to say, the difference was clearly "clinically significant". However, due to the small sample size (46 in Creatine, 53 in Placebo, 99 total), the p-value was 0.15, and the difference was "statistically insignificant". This is a typical scenario when a study finding is "clinically significant", but "statistically insignificant", since the sample size is not big enough. However, clinical significance should be an important deciding factor for medical studies, not simply statistical significance, because "p-values do not measure evidence" [20] (p. 619). In addition, recent reports have re-emphasized the severe issue of p-driven research (e.g., [21]), including the American Statistical Association (ASA) statement on p-values [22]. What the present studies emphasize, however, is that a properly designed and validated instrument, combined with an appropriate sample size, can provide both clinical and statistical significance.
In clinical trials that use psychometric instruments, it is critical to validate or even refine the instruments for data collected before any formal statistical analysis. This is because most instruments are "not professionally developed" [1], and instrument validation is an "ongoing process" [23]. No instrument should be claimed to be "already validated"; rather, assessment instruments should be validated for the disease and population under study. Another good example appears a recent review on oral health-related quality of life (OHRQoL) instruments [24]. Other recent literature re-emphasizes the importance of sound psychometric properties of an instrument [25][26][27].
For negative trials that used instrument scores as primary outcomes, the present findings offer a path to rescuing potential findings: validating and refining the instruments and then reanalyzing the data based on the refined instrument scores. Our study offers a demonstration of the new strategy for this type of promising effort.

Conclusions
This study demonstrates that, for negative trials using instrument scores as primary outcomes, it is possible to rescue potential findings. The key to this new strategy is to validate and refine the instrument for the specific disease and conditions under study and then to reanalyze the data. This study offers a demonstration of this new strategy for general approaches.