Core Outcome Measurement Instruments for Clinical Trials of Total Knee Arthroplasty: A Systematic Review

(1) Background: We have updated knowledge of the psychometric qualities of patient-reported outcome measures and, for the first time, systematically reviewed and compared the psychometric qualities of physical tests for patients with knee osteoarthritis who are undergoing total knee arthroplasty. This work was conducted to facilitate the choice of the most appropriate instruments to use in studies and clinical practice. (2) Methods: A search of medical databases up to December 2019 identified the studies and thus the instruments used. The quality of the measurement properties was assessed by the Bot et al. criteria. (3) Results: We identified 20 studies involving 25 instruments. Half of the instruments were questionnaires (n = 13). Among the condition-specific instruments, the Oxford knee score, Knee injury and Osteoarthritis Outcomes Score, and the Western Ontario and McMaster Universities Osteoarthritis index had the highest overall scores. Concerning generic tools, the Medical Outcomes Study Short-Form 36 (SF-36) or SF-12 obtained the highest overall score. For patient-specific tools, the Hospital Anxiety and Depression Scale ranked the highest. Some physical tests seemed robust in psychometric properties: 6-min Walk Test, five times Sit-To-Stand test, Timed Up and Go test strength testing of knee flexor/extensor by isometric or isokinetic dynamometer and Pressure Pain Threshold. (4) Conclusion: To make stronger recommendations, key areas such as reproducibility, responsiveness to clinical change, and minimal important change still need more rigorous evaluations. Some promising physical tests (e.g., actimetry) lack validation and require rigorous studies to be used as a core set of outcomes in future studies.


Introduction
Knee osteoarthritis (OA) is a common degenerative osteoarticular disease, associating pain, stiffness, and loss of mobility. It affects 2% to 10% of men and 1.6% to 15% of women over age 40 depending on geographical areas and definitions of pathology [1]. According to a US study, 45% of adults over age 45 will have knee OA before the age of 85 [2]. OA has important consequences for quality of life, disability and mobility [1] and is associated with an increase in mortality [3]. It is a heterogeneous disorder characterized by various etiological factors, pathophysiological pathways, clinical phenotypes and prognosis [4]. Indeed, more than half of patients with radiographic OA are not symptomatic; for 50% to 70%, disease will not worsen radiographically in 2 years; for 27%, disease will progress slowly or moderately; and for 2%, disease will progress rapidly, known as rapidly destructive OA of the knee [5].
We lack long-term effective pharmacological treatment for OA, especially for frail patients, but nonpharmacological treatments (e.g., physical therapy, surgery) have been found effective [6][7][8]. The use of surgery is frequent: the 2010 prevalence of knee replacement in the total US population was 1.52%, with 4.7 million individuals undergoing total knee replacement [9]. In France in 2011, 86,000 knee replacements were performed, increasing by more than 33.0% between 2008 and 2013 [10].
In end-stage knee OA, total knee arthroplasty (TKA) is an effective intervention to reduce pain and improve function for most patients. However, after TKA, some patients still experience pain, loss of function, deficient muscle strength or reduced walking speed.
Outcomes of patients with knee OA undergoing or after TKA are evaluated with many different instruments used both in clinical practice and in research. The Outcome Measures in Rheumatology Clinical Trials (OMERACT) group defined a core set of outcome dimensions for clinical studies in knee OA at its medical stage: pain, physical function (the performance of daily activities), and patient global assessment [11,12]. In the same way, the International Classification of Functioning, Disability and Health (ICF) defined a core set of outcome dimensions: impairments of body functions and structures, activity limitations and participation restriction, and environmental factors [13]. These comply with the recommendations of several guidelines for outcome measurement in OA trials (European League Against Rheumatism [14], Osteoarthritis Research Society International [15], US Food and Drug Administration [11], and Slow-acting Drugs in Osteoarthritis [16]). However, these guidelines differ in their recommendations for the use of specific instruments [14,15] or simply lack any recommendations in this regard for knee OA undergoing or after TKA.
Many instruments, such as patient-reported outcome measures (PROMs) and physical tests, are available to assess the outcome dimensions of the OMERACT or ICF. However, which instruments are the most appropriate is unclear [17]. In fields like satisfaction after TKA, we lack an objective assessment tool to evaluate the impact of TKA and to better understand the heterogeneity between a patient's post-surgery status [18]. Instrument selection should depend on the instrument's psychometric qualities and on practical considerations (e.g., time to complete, ease of scoring or use, mode of administration and distribution, meaning its validity by use, whatever its metrological properties). Psychometric qualities refer mainly to the validity and reliability of the measuring tool [19]. Validity meaning "the extent to which an instrument measures what it is intended to measure" and reliability the fact that the instrument is free of error [19].
Several reviews of OA tools at the medical stage have been published, and a special issue on outcome measurements was published by Arthritis Care & Research [11,[21][22][23]. Two separate studies [20,21] concluded that the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), the Oxford Knee Score (OKS) and the Medical Outcomes Study Short-Form 36 (SF-36) could be recommended as primary measures in treatment studies knowing that certain key areas such as reproducibility, responsiveness to clinical change, and minimal important change needed more rigorous evaluation to make stronger recommendations. However, none of these reviews focused on physical tests. In addition, an update 10 years later seems necessary in order to allow a complete overview of available instruments for patients with knee OA undergoing or after TKA. To assess the selected instruments, data on the descriptive and psychometric qualities of each instrument will be collected and rated by using the same checklist [22,23].
The objective of this study was to give an overview of a core set of PROMs as well as physical tests used for clinical trials of individuals with knee OA undergoing TKA. This review will facilitate the choice of the most appropriate measurement instruments (PROMs and physical tests) for studies and clinical practice of patients with knee OA who are undergoing TKA.

Experimental Section
The review and analysis were conducted and reported in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) [24]. The systematic review protocol was registered in PROSPERO (CRD42020161878). Information regarding study selection, search strategy, inclusion and exclusion criteria, risk of bias and quality of evidence, data extraction, and psychometric qualities are shown in detail in the Supplementary Materials.

Search Strategy
PubMed, EMBASE, Web of Science, Cochrane Central Register of Controlled Trials and CINAHL databases were systematically searched for articles published from 2014 through December 2019. The broad computerized search strategy was built on key words for patients with OA of the knee undergoing TKA; search strategy for outcome assessment; and search strategy for control trial. Search terms are listed in Research Algorithm S1. Selection of articles was based on their title and abstract and was decided by two independent reviewers (VR and AV). Their inclusion was then decided after reading the full article by the same two independent reviewers. The full article was read by two other independent reviewers (FC and EC) in case of doubt or concern and if necessary, a third reviewer (SB) resolved disagreements.
Regarding clinimetric studies, a search following the same process and hierarchy was made using PubMed database. In addition, references of the retrieved articles were screened for relevant articles.

Inclusion and Exclusion Criteria
Inclusion criteria were as follows: (1) Study design: all randomized controlled trials (RCTs) or clinical controlled trials (CCTs); (2) Patients: individuals with OA of the knee undergoing TKA; (3) Intervention: articles focused on unilateral and primary TKA (not total hip arthroplasty) because those patients were considered different patient populations; (4) Outcomes: data measured before and after TKA. Given the large number of articles published on this topic and considering that a certain number of outcomes have been used only recently (last 5 years), we included only articles published since 2014 (≤5 years); (5) Instruments: all outcomes (PROMs and physical tests) used as primary or secondary criteria. Finally, inclusion criteria at the level of clinimetric studies were as follows: article's main focus was the psychometric evaluation of the instrument (since no checklist to rate psychometric evaluations based on item response theory [IRT] is currently available, only psychometric evaluations involving usual test theory were included and evaluations based on IRT were excluded); data for patients with knee OA undergoing TKA were published independently in case of mixed populations (e.g., patients with OA undergoing total hip arthroplasty and patients with OA undergoing TKA); and results had been published in English as a full report.

Quality Assessment and Data Extraction
The Physiotherapy Evidence Database (PEDro) scale [25] was used to assess risk of bias within studies. At the level of clinimetric studies, to facilitate comparisons with other studies from 10 years ago [7,26], fundamentally the same checklist of specific criteria for quality assessment of instruments was used (Table S1 for self-assessment tools and Table S2 for physical tests). This checklist was initially developed by Bot et al. [27] based on the work by Lohr et al. [28], the Scientific Advisory Committee of the Medical Outcomes Trust guidelines and the checklist developed by Bombardier and Tugwell [29]. All qualities were rated as positive, doubtful, or negative. If no or insufficient information was available, no rating was given. Two reviewers (VR and AV) independently assessed the psychometric qualities of each instrument. Disagreements between reviewers were resolved by discussion. The highest rating was assigned when ≥2 studies were found on the same psychometric qualities of the same instrument involving the same population.

Characteristics of the Instruments
The descriptive data recovered provide information about the target population, domain assessed, format of the instruments and mandatory equipment for physical tests (Table A1).

Psychometric Qualities
Psychometric qualities (validity, reproducibility, responsiveness, and interpretability) were assessed for each instrument in this specific population, namely knee OA, and more precisely TKA according to the COnsensus-based Standards for the selection of health status Measurement Instruments (COSMIN) recommendations [30]. See supplementary material for more detailed information (Table S1 for self-assessment tools and Table S2 for physical tests).

Overall Quality
As Bot et al. described [31,27], overall score of the instruments was obtained by adding the number of positive ratings for each psychometric quality (Table S1 for self-assessment tools and Table  S2 for physical tests).

Statistical Analysis
Except the calculation of the agreement kappa coefficient between two reviewers, statistical analyzes were exclusively descriptive and involved use of Microsoft Office Excel 2019 (Microsoft Corp., Redmond, WA, USA). To assess the concordance between the two reviewers proofreading, we calculate a kappa coefficient using Stata v13 (StataCorp, College Station, TX, USA), considering a categorical criterion, and the number of modalities of the variable studied.

Study Selection
A flow chart detailing the study selection process is shown in Figure 1. The initial searches returned a total of 2339 articles; 313 were duplicates. Titles and abstracts of the retrieved articles were assessed for suitability, leading to the retrieval of 105 full texts. Of these, 85 did not fulfill the inclusion criteria and reports for the remaining 20 studies were analyzed. The kappa statistic between two reviewers (VR and AV) was 0.74, which indicated good agreement [32].

Study Characteristics
The included studies involved 1997 participants (1005 interventions and 992 controls). For 12 (60%) studies, the design was randomized controlled trial (RCTs). A summary of the included trials is shown in Table 1.

Risk of Bias Within Studies
Eleven trials were considered high quality (PEDro score >5/10), with a mean score of 5.9/10 across all trials (Table 1 and Table A2). The total PEDro scores were 8 for 1 trial [52], 7 for 8 trials [1,20,43,48,57,58,60,73], 6 for 2 trials [44,45], 5 for 6 trials [11,24,49,[82][83][84] and 4 for 3 trials [40,85,86]. The items of the PEDro scale the most frequently found were eligibility criteria, outcome obtained in more than 85% of participants, the use of similar groups at baseline, measurements of variability for at least one key outcome, and between-group comparisons, which were evident in almost all reports. None of the trials reported blinding of participants or therapists nor assessors, which is expected, given that these items are the most difficult to adhere to in trials of non-pharmacological interventions involving exercise. Nineteen trials reported an intention-to-treat analysis, 9 used allocation concealment and 10 used random allocation.

Responsiveness
Responsiveness was examined in almost all tools by various methods. With the definition of responsiveness used in this study, the KOOS [39,72,78], OKS [13,29,41,90], GDS [72], HADS [10,17,51,81,87] and NRS pain [82] had positive ratings for responsiveness to change. In TKA patients, at 6 months, the MIC scores for improvement in pain and physical function for the WOMAC were approximately 23 and 19, respectively, which were higher than the MDC [78]. For the NRS pain, the MIC was less than the MDC [82]. The remaining instruments had indeterminate ratings because only distribution-based methods were used, external clinical criteria or a control ("stable") population to determine whether change had indeed occurred were lacking, and the MIC was not defined.

Practical Burden
Practical burden on the patient (time to complete tool, ease of scoring) was assessed for almost all instruments. Only the KSS, HSS, SF-36 and SF-12 had doubtful or negative ratings, which were mainly related to some scoring difficulty.

Cultural Adaptation
Transcultural adaptation was not rated precisely because of the great subjectivity and frequently because of lack of clarity in the cross-cultural adaptation process. Table 3 and Table A3 show the cultural translations/adaptations of each tool.

Practical Burden
Practical burden on the patient (time to administer, ease of scoring) was assessed for almost all instruments. Only strength had negative ratings related to time to administer. Physical activity level with average steps/day measured by accelerometry (actimetry) also had a negative rating related to ease of scoring, with some signal processing difficulty.

Overall Score
Among the condition-specific PROMs, the OKS, KOOS and WOMAC had the highest overall scores with 10, 9 and 8 positive ratings respectively over the 12 criteria investigated. Concerning generic tools, the SF-36 and SF-12 obtained the highest overall score, with 6 positive ratings. For patient-specific tools, the GDS and HADS seem appropriate, with an overall positive rating of 9 and 8, respectively (Table 3 and Figure 2).
Regarding physical tests, some tests appeared to be quite robust in terms of psychometric properties-strength, PPT, 6MWT, 5STS, TUG tools-with an overall score ≥6/7. However, some tests lacked clinimetric studies: The Active Straight Leg Raise (ASLR), gait velocity, 10MWT, electromyography and actimetry, with an overall score ≤2 ( Figure 3 and Table A5).  Table 2

Discussion
In this study, we examined and compared the quality of the measurement properties of outcome measures (patient-reported and for the first time, physical tests) assessing rehabilitation outcomes for patients with knee OA undergoing TKA. There are 3 main findings in this review. First, a wide variety of PROMs are applied to measure outcomes in rehabilitation after knee arthroplasty, but only 3 (KOOS, WOMAC and OKS) have undergone an extensive validation process in knee OA before and/or after TKA. Second, important measurement attributes for evaluative instruments such as reproducibility, responsiveness to clinical change and definition of the MIC are still scarcely evaluated. Third, some physical tests were well evaluated (TUG, 5STS test, 6MWT, PPT) but others not at all (actimetry, electromyography, ASLR).
Of the 13 tools applied in knee arthroplasty rehabilitation, the KOOS, WOMAC and OKS have been completely studied for their measurement properties, including content validity, reliability, construct validity, responsiveness, and floor and ceiling effects. The GDS seems valid in younger populations but in view of the mean age of TKA patients may not be the best assessment and the HADS may be preferred. The Hamilton Depression Rating Scale (HDRS) is frequently used as a primary or secondary outcome, but no studies have been carried out in this type of population; nevertheless its validity has been assessed in a depressive population [96]. In the same way, for the ASLR, although frequently used, only one study used it in this type of population [93] but not really in terms of its psychometric properties. Nevertheless, clinimetric properties of the ASLR have been widely assessed in chronic low back pain [100].
Another key finding from this study is the persistent lack of essential requisites for evaluative instruments. Because evaluative PROMs are used to assess change in patients over time, these measures need to have high responsiveness and high reproducibility [101]. These instruments must be able to identify a change if it occurred, and in this case be able to establish that it is beyond the range of measurement error. Therefore, the agreement parameter is vital (and preferred over the reliability parameter) because this concerns the absolute measurement error. The measurement error is an essential data to distinguish whether the change measured is relevant or not. The MDC can be estimated from the measurement error, and can be compared to the MIC. Knowing the MDC, the amount of measurement error as well as how these relate to the MIC provides insight into the meaning of values changes from the instrument [32].
In this review, as in the study of Veenhof et al. [20], estimates of agreement were reported less frequently than were estimates of reliability. Of the 13 instruments, 6 (46%) reported a positively rated reliability parameter and only 2 (15%) presented an agreement parameter. Researchers tend to use more reliability parameters than agreement parameters [101]. Agreement parameters is a psychometric property tend to be neglected in clinimetric studies in the medical sciences [101]. The other issue lay in determining the responsiveness of the instruments for which a definition of MDC is missing. According to Alviar et al. [21], "most studies examined responsiveness to clinical change by estimating effect sizes, standardized response means, and the t statistic, which could be affected by sample size and sample variation". In this review, as in Alviar et al. ten years ago, only a few studies defined the MDC to allow for meaningful interpretation of the obtained scores. Defining what constitutes a clinically meaningful change should remain a priority for future clinimetric studies.
Although the HDRS has been widely used to evaluate depression, to our knowledge no clinimetric study has been done in OA patients. Moreover, the criticism regarding its reliability has been growing the last few years. A recent review of Bagby et al. [96] showed that the HDRS is ''… psychometrically flawed…''and the authors suggested its revision.
Some physical tests were well evaluated, but others not at all. Actimetry is a recent technology in full expansion, having for a main interest allowing an evaluation of the functional abilities of the patient in everyday life. However, we lack consensus on the measurement method, signal processing and results interpretation (number of steps, activity counts, duration of activity at different intensity levels). Mainly, the lack of consensus is from the signal processing and extracted data but not from interpretation of these data for which there is a known threshold of "normality". Studies of this subject seem necessary. Also, the tests presented here (with the exception of strength, EMG and actimetry) are simple to implement in everyday clinical practice (see necessary equipment in Table  A1 and Table A5). However, although most have psychometric properties clearly evaluated in our population, the rules of good practices must be respected when administering these performancebased tests (see Table A1).
This study has several limitations. Regarding the large number of studies of TKA patients, only instruments used in CCTs or RCTs were included, so pertinent tools might have been missed. The definitions approaches and methods in assessing the attributes varied among studies. We lack a gold standard to assess the measurement properties of PROMs [88]. The review included only studies in which measurement properties were assessed with classical response theory, and therefore recent studies using relatively newer approaches (e.g., IRT) might have been missed. The IRT method is progressively becoming a prominent tool in rehabilitation research since it facilitates the evaluation of the quality of psychometric properties of some instruments. However, we still lack explicit criteria for quality evaluation of the methods and results of studies using IRT models. The unfavorable or indeterminate ratings a tool received could be due to flaws in study methods, and not necessarily deficiency of the tool per se. In addition, because some tools have been extensively studied (e.g., WOMAC, SF-36), they had varied ratings per measurement attribute as compared with others with only one or a few clinimetric studies but positive ratings for the attributes. Negative results in clinimetric studies obtained and not published (publication bias) is another limitation, which might have precluded the inclusion of these studies.

Conclusions
We compared the measurement attributes of the various outcomes applied in studies of rehabilitation after TKA with a view to facilitating the choice of the most appropriate instruments for studies of patients with knee OA who are undergoing TKA. Physical tests were reviewed for the first time in TKA population. Our analysis suggests that strength, PPT, 6MWT, 5STS and TUG tools are notable; the 6MWT, TUG being however the most extensively validated and therefore possibly the most appropriate to use. Overall, regarding PROMs our findings corroborate results from previous studies suggesting that the KOOS, WOMAC, OKS, HADS and SF-36 are the most comprehensively tested tools in this population and are worth considering. Nevertheless, as already demonstrated more rigorous evaluations in key areas, such as reproducibility, responsiveness to clinical change, and MIC, are still needed to make stronger recommendations. By differentiating the assessment field of each tool, we may potentially recommend the KOOS for the most relevant as à condition specific questionnaire; the HADS as a patient specific and finally the SF-36 as a generic one. Nevertheless, other promising assessments (e.g., actimetry) lack validation and require rigorous studies to be used as a core set of outcomes in future studies. This review could serve as a basis for future studies, being a guarantee of quality.

Supplementary Materials:
The following are available online at www.mdpi.com/xxx/s1 and detail the Experimental Section, Research algorithm S1: Search terms used in databases, Table S1: Checklist for rating the psychometric quality of self-assessment tools, Table S2: Checklist for rating the psychometric quality of physical tools.  Time to rise from a standard armchair, walk as quickly but as safely as possible distance of 3 m, turn, walk back to the chair and sit down. Usual footwear and regular walking aids allowed and recorded. Fastest of two trials is recorded in seconds. Same chair is needed for re-testing.

Stair tests
SCT Stopwatch Flight of 12 stairs with 18 cm (7 inch) step height and handrails Time (s) Ascend and descend flight of 12 stairs as quickly as safe and comfortable. One handrail allowed but encouraged to only use legs. Total time to ascend and descend steps for one trial is recorded to nearest 100th second.
See Table 2 for abbreviations.   See Table 2 for abbreviations. MDC95 = minimal detectable chance at the 95% confidence level; + = positive; -= negative; ø = doubtful; pts = points; see Table 2 for additional abbreviations.* ø for the functional subscale; ** May have reduced validity in some populations (e.g., older people). § Floor effect; § § Ceiling effect; § § § Floor effect for sports and recreation subscale; § § § §floor and ceiling effect for some individual items but not at the scale level. # -for S-Anxiety. ˩ S-Anxiety more responsive to change than T-Anxiety. OA = osteoarthritis; TKA = total knee arthroplasty; MDC95 = minimal detectable chance at the 95% confidence level; + = positive; -= negative; ø = doubtful; NA = not applicable; see Table 2 for additional abbreviations.˥ Domains: domain(s) explored by the test; ǂ see Table A1 for more details about equipment, space and instructions; * Standard error of measurement (SEM).